arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.17415 2026-05-25 cs.LG cs.AI cs.DB cs.IR 版本更新

IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer

IVF-TQ：通过无码本残差层实现无需校准的流式向量搜索

Tarun Sharma

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种名为IVF-TQ的流式向量搜索索引，该方法通过一种无需代码本的残差压缩层实现了校准自由的近似最近邻搜索。核心思想是在不依赖代码本的情况下，利用固定随机旋转和预计算的Lloyd-Max标量量化器，仅通过比特宽度和维度参数进行配置，从而在不需训练的情况下保持流式数据的稳定性。实验表明，IVF-TQ在多个数据集和内存条件下均能保持良好的性能，无需重新训练或个性化调整比特预算，显著提升了流式场景下的搜索效率与鲁棒性。

详情

AI中文摘要

近似最近邻（ANN）索引部署在流式语料库上会在数周内无声地丢失召回率。标准诊断是分布漂移，但在洗牌独立同分布（shuffled-i.i.d.）摄取下（完全没有漂移），乘积量化在子匹配位预算下仍会下降3.8个百分点。主流生产压缩方法（PQ、OPQ、ScaNN）都针对初始样本拟合码本，并在数据库增长数个数量级时重复使用该码本。本文提出IVF-TQ，一种倒排文件索引，其残差压缩层是数据无关的：一个固定的随机旋转，后跟一个仅由位宽b和维度d参数化的预计算Lloyd-Max标量量化器。仅训练IVF粗k-means分区。一个仅依赖于(b, d, delta)的球面上均匀内积误差界提供了任何学习码本方法都无法提供的结构保证。相同的无码本设计实现了IVF放大效应，将差距缩小到Extended RaBitQ的统计噪声范围内（在匹配位预算下，比平面TQ高17.7个百分点），以及一种自适应变体，在不触及压缩层的情况下刷新分区。在九个受控单元（三个10M数据集、三种PQ内存模式、三个随机种子）中，每批PQ码本重新训练从未恢复流式差距；IVF-PQ流式稳定性需要逐数据集位预算调整，而IVF-TQ在所有三个数据集上使用一个固定的(b, d)配置，Delta在[-0.80, +0.56]个百分点之间。贡献在于操作层面：无需训练码本，无需逐数据集位预算调整，无需任何能缩小差距的重新训练周期。

英文摘要

Approximate nearest neighbor (ANN) indexes deployed against streaming corpora silently lose recall over weeks. The standard diagnosis is distribution shift, but under shuffled-i.i.d. ingestion -- no shift at all -- product quantization still degrades -3.8pp at sub-matched bit budgets. The dominant production compression methods (PQ, OPQ, ScaNN) all fit a codebook to an initial sample and reuse it as the database grows by orders of magnitude. This paper presents IVF-TQ, an inverted-file index whose residual compression layer is data-independent: a fixed random rotation followed by a precomputed Lloyd-Max scalar quantizer parameterised only by the bit width b and dimension d. Only the IVF coarse k-means partition is trained. A uniform-over-sphere inner-product error bound depending only on (b, d, delta) provides a structural guarantee no learned-codebook method admits. The same codebook-free design enables an IVF-amplification effect that closes the gap to Extended RaBitQ to within statistical noise (+17.7pp over flat TQ at matched bit budget), and an Adaptive variant that refreshes the partition without touching the compression layer. Across nine controlled cells (three 10M datasets, three PQ memory regimes, three seeds), per-batch PQ codebook retraining never recovers the streaming gap; IVF-PQ streaming stability requires per-dataset bit-budget tuning, while IVF-TQ holds at one fixed (b, d) configuration on all three datasets with Delta in [-0.80, +0.56]pp. The contribution is operational: no codebook to train, no per-dataset bit-budget tuning, no retraining cycle that ever closes the gap.

URL PDF HTML ☆

赞 0 踩 0

2605.23901 2026-05-25 cs.LG cs.AI cs.IT math.IT 版本更新

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

LLMs 作为噪声信道：香农视角下的模型容量与缩放定律

Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma

发表机构 * University of Virginia（弗吉尼亚大学）； University of California, Berkeley（加州大学伯克利分校）

AI总结本文从香农信息论的角度出发，将大语言模型（LLM）的训练过程建模为在噪声信道中传递信息的过程，提出了香农扩展定律（Shannon Scaling Law），用以解释传统单调扩展定律无法描述的非单调现象，如灾难性过训练和量化退化。该理论通过将模型参数映射为信道带宽、训练数据映射为信号功率，揭示了模型规模或数据量的扩展若不能保持足够的信噪比，将导致噪声放大并引发性能的U型退化。实验验证表明，该理论在多个任务和扰动设置下均优于传统扩展定律，具有良好的拟合与外推能力。

Comments Accepted by ICML 2026

详情

AI中文摘要

现有的大语言模型（LLMs）缩放定律主要是单调幂律，无法解释新出现的非单调现象，如灾难性过训练和量化引起的退化，在这些现象中，尽管计算量增加，性能却下降。我们提出了香农缩放定律，这是一个统一的理论框架，将LLM训练建模为噪声信道上的信息传输，基于香农-哈特利定理。通过将模型参数映射到信道带宽，训练令牌映射到信号功率，我们的公式明确捕捉了学习信号与内在噪声之间的相互作用。这一视角揭示了LLMs的基本香农容量：在未保持足够信噪比（SNR）的情况下扩展模型规模或数据，必然会放大噪声，导致从单调改进到U形性能退化的转变。我们通过在Pythia和OLMo2上进行的实验验证了该理论，实验包括高斯噪声、量化以及在数学、问答和代码任务上的监督微调。香农缩放定律始终优于经典缩放定律和最近的扰动感知定律，取得了强$R^2$分数，并准确捕捉了先前方法遗漏的损失盆地。它还能进行外推：在$\leq$6.9B Pythia模型上使用$\leq$180B令牌拟合后，预测了未见过的12B模型在高达307B令牌时的性能，池化$R^2=0.847$，而单调基线则崩溃。

英文摘要

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.

URL PDF HTML ☆

赞 0 踩 0

2605.23899 2026-05-25 cs.AI 版本更新

CHRONOS：面向演化数据市场的时态感知多智能体协调

Joydeep Chandra

发表机构 * BNRIST, Tsinghua University（北京清华大学智能机器人系统研究院）

AI总结 CHRONOS 是一种面向动态数据市场的多智能体协调框架，旨在解决静态设计中因数据演化带来的检索效率下降、价值分配不准确和隐私预算过度消耗等问题。该方法采用三层架构，分别通过时间感知的神经微分方程、基于突变点检测的夏普利价值评估和满足差分隐私的强化学习算法，实现高效且隐私保护的市场协调。实验表明，CHRONOS 在多个基准上表现出优越的检索性能和隐私效率，具有较高的实用价值。

详情

AI中文摘要

时态知识图谱数据市场在静态设计中面临三个耦合的失败：随着边演化，过时的混合索引捷径降低召回率；分布漂移后，固定的Shapley定价错误归因价值；不协调的智能体过度消耗共享的差分隐私预算。我们提出CHRONOS，一个三层架构，通过显式的公共和私有分离统一处理这些挑战。第一层应用神经ODE时间衰减到捷径边，提供每个查询的期望召回损失界为Big-O of Pq lambda delta t，单调包络保证将边界宽松度降低到观测损失的1.8到3.2倍。第二层将Shapley估值条件化在检测到的变点上，并在噪声下提供有限样本误差保证。第三层使用EXP3-IX实现Big-O of sqrt(T log T)遗憾，同时通过矩会计强制执行epsilon和delta差分隐私。CHRONOS每轮使用高斯机制发布私有化亲和矩阵；所有检索和排序都是后处理，不产生额外隐私成本。我们提供多轮结算、500个卖家的可扩展性分析，以及与加速基线的比较。在四个基准上，CHRONOS在10个结果时召回率为0.937，每秒2.74个查询，延迟161毫秒，在zCDP组合下总epsilon为4.25，delta为10^{-6}。这些结果表明一个竞争性的操作点。一个局限性是，在此隐私水平下，发布的估值仍受噪声主导；效用主要来自公共索引路由和由低敏感度统计驱动的自适应调度。

英文摘要

Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over-consume a shared differential-privacy budget. We present CHRONOS, a three-layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural-ODE temporal decay to shortcut edges, providing a per-query expected recall-loss bound of Big-O of Pq lambda delta t, with a monotone-envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite-sample error guarantees under noise. Layer three uses EXP3-IX to achieve Big-O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post-processing, incurring no extra privacy cost. We provide multi-epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise-dominated; utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

URL PDF HTML ☆

赞 0 踩 0

2605.23883 2026-05-25 cs.CV cs.AI 版本更新

PGT: Procedurally Generated Tasks for improving visual grounding in MLLMs

PGT: 用于提升多模态大语言模型视觉定位的程序化生成任务

Rim Assouel, Amir Bar, Michal Drozdzal, Adriana Romero-Soriano

发表机构 * Mila - Qu\'ebec AI Institute ； FAIR at Meta Superintelligence Labs ； McGill University ； Canada CIFAR AI Chair

AI总结尽管多模态大语言模型（MLLMs）已取得显著进展，但在细粒度理解任务上仍存在不足。本文提出了一种名为PGT的过程生成任务框架，通过在图像上叠加明确的几何原语生成密集的监督信号，从而提升模型的视觉 grounding 能力，并作为低成本的诊断工具识别感知失败的原因。实验表明，PGT 在多种基准测试中显著提升了模型性能，表明细粒度感知瓶颈可通过增强监督信号有效解决。

详情

AI中文摘要

尽管多模态大语言模型(MLLMs)取得了显著进展，但这些模型在细粒度理解任务上仍然存在困难。在这项工作中，我们提出了程序化生成任务(PGT)，一个简单的数据驱动框架，具有双重目的：诱导细粒度视觉理解，并作为低成本的诊断工具来识别感知失败的来源。通过在图像上叠加明确的几何基元，PGT生成额外的密集监督，将视觉定位能力与语义先验解耦。在关系、定量和3D/深度理解基准上的大量实验表明，PGT在各种架构上均取得了显著提升。在使用PGT数据增强的LLaVA-v1.5-Instruct上进行指令微调，在What'sUp基准上提升高达+20%，在CV-Bench-2D上提升+13.3%，同时保持通用感知能力。此外，在PGT数据上微调最先进的MLLMs，在What'sUp上提升高达+5.5%，在CV-Bench-2D上提升+8.3%。这些发现表明，PGT有效解决了细粒度感知的瓶颈，揭示了许多空间推理缺陷源于监督信号不足，而非固有的架构或分辨率限制。

英文摘要

Despite remarkable progress in Multimodal Large Language Models (MLLMs), these models still struggle with fine-grained understanding tasks. In this work, we propose Procedurally Generated Tasks (PGT), a simple data-driven framework that serves a dual purpose: inducing fine-grained visual understanding and acting as a low-cost diagnostic tool to identify the source of perception failures. By overlaying unambiguous geometric primitives on images, PGT generate additional dense supervision that disentangles visual grounding capability from semantic priors. Extensive experiments on relational, quantitative, and 3D/depth understanding benchmarks show that PGT yields remarkable gains across diverse architectures. Instruction tuning MLLMs on LLaVA-v1.5-Instruct augmented with PGT data results in improvements of up to +20% on the What'sUp benchmark and +13.3% on CV-Bench-2D, while maintaining general perception capabilities. Moreover, finetuning state-of-the-art MLLMs on PGT data leads to boosts of up to +5.5% on What'sUp and +8.3% on CV-Bench-2D. These findings demonstrate that PGT effectively address the bottleneck of fine-grained perception, revealing that many spatial reasoning deficits stem from inadequate supervision signals rather than inherent architectural or resolution limitations.

URL PDF HTML ☆

赞 0 踩 0

2605.23867 2026-05-25 cs.HC cs.AI 版本更新

Human Decision-Making with Persuasive and Narrative LLM Explanations

具有说服性和叙事性LLM解释的人类决策

Laura R. Marusich, Mary Grace Kozuch Dhooghe, Jonathan Z. Bakdash, Murat Kantarcioglu

发表机构 * DEVCOM Army Research Laboratory（美国陆军研发实验室）； University of Texas at Dallas（德克萨斯大学达拉斯分校）； Virginia Polytechnic Institute and State University（弗吉尼亚理工大学）

AI总结本研究探讨了生成式语言模型（LLM）在分类任务中生成的叙事性解释对人类决策性能的影响。通过大规模人类行为实验，研究发现LLM生成的叙事解释的说服力并未显著提升决策准确性，但可能增加人类对AI预测的依赖，并可能对决策反应时间和判断AI预测正确性的能力产生负面影响。研究结果表明，在AI预测中加入叙事解释可能带来决策性能的权衡，未来需要进一步研究其具体影响机制和适用场景。

详情

AI中文摘要

大型语言模型（LLMs）有潜力在分类任务中辅助和改善人类决策，不仅通过提供相当准确的预测，还通过生成这些预测的连贯叙事解释。先前的研究表明，人们通常认为AI叙事解释易于理解、可信且具有说服力，能够改变信念和观点；然而，关于叙事解释对客观人类决策表现的影响知之甚少。在这里，我们进行了一项大规模人类行为实验，以评估使用LLM生成的不同说服力叙事解释的决策表现。我们发现，基于LLM的解释的说服力程度（或缺乏说服力）并未显著影响决策准确性，相比于简单的AI预测本身，这与基于特征重要性的可解释AI的典型结果一致。我们发现有证据表明叙事增加了对AI的依赖，但无论AI预测正确还是错误都是如此。探索性分析还表明，更具说服力的叙事可能对决策响应时间以及区分正确和错误AI预测的能力产生不利影响。总体而言，这项工作表明，将叙事解释与AI预测结合可能会对决策表现产生权衡，需要更多研究来确定叙事解释如何以及何时影响人类决策。

英文摘要

Large language models (LLMs) have the potential to aid and improve human decision-making in classification tasks, not only by providing fairly accurate predictions, but also in their ability to generate cogent narrative explanations of those predictions. Prior work has demonstrated that people generally find AI narrative explanations to be understandable, trustworthy, and convincing for changing beliefs and opinions; however, less is known about the impact of narrative explanations on objective human decision-making performance. Here we conduct a large-scale human behavioral experiment to evaluate decision-making performance with LLM-generated narrative explanations of varying persuasiveness. We found the degree of persuasiveness, or lack thereof, for LLM-based explanations did not meaningfully impact decision accuracy over a simple AI prediction alone, in agreement with typical results with explainable AI based on feature importance. We found evidence that narratives increased reliance on AI, but both when the AI prediction was correct and incorrect. Exploratory analyses also indicated that the more persuasive narratives may have had a detrimental effect on decision response times and the ability to discriminate between a correct and incorrect AI prediction. Overall, this work indicates that including narrative explanations with AI predictions may involve tradeoffs for decision-making performance, and more work is needed to determine how and when narrative explanations impact human decision-making.

URL PDF HTML ☆

赞 0 踩 0

2605.23861 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Leveraging Foundation Models for Causal Generative Modeling

利用基础模型进行因果生成建模

Aneesh Komanduri, Xintao Wu

发表机构 * University of Arkansas（亚拉巴马大学）

AI总结该论文研究如何利用预训练基础模型进行因果生成建模，旨在提升AI系统在反事实推理方面的能力。提出了一种名为FM-CGM的模块化框架，通过概念提取器、概念操作器和反事实生成器三个核心组件，实现了端到端的视觉因果推理。该方法结合了因果推理模型和文本到图像扩散模型，并引入了因果语义引导机制，有效支持零样本因果发现与反事实图像生成，具有重要的理论与应用价值。

详情

AI中文摘要

因果生成建模对于开发能够进行反事实推理的可靠且透明的AI系统至关重要。现有方法侧重于在生成模型训练过程中整合因果约束，但通常缺乏统一框架来利用预训练基础模型的零样本推理能力。我们提出FM-CGM，一个使用预训练基础模型进行端到端视觉因果推理的模块化框架。FM-CGM通过三个核心组件形式化因果流程：概念提取器、概念操作器和反事实生成器。通过利用大型推理模型进行因果推断，以及文本到图像扩散模型进行生成，我们的方法实现了零样本因果发现、干预和反事实生成。然后，我们开发了因果语义引导（CSG），一种基于交叉注意力的机制，确保语义干预传播到后代概念，同时保留不变区域。我们实验证明，我们的方法能够识别合理的因果结构，并适用于忠实的反事实图像生成。

英文摘要

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.23825 2026-05-25 cs.LG cs.AI 版本更新

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

是人类，而非数据：LLM中的地缘政治偏见源于后训练，并通过提示语言放大

Stuart Bladon, Brinnae Bent

发表机构 * Alibaba（阿里巴巴）； seven AI labs（七家人工智能实验室）

AI总结该研究发现，语言模型中的地缘政治偏见主要来源于微调阶段，而非预训练阶段。通过对七家实验室的多个模型进行对比实验，结果表明，微调后模型的立场往往更倾向于其开发者所在国家或地区，且这种偏见在不同语言提示下表现不同。研究强调，模型对国家、文化及政治观点的表征并非单纯继承自训练数据，而是在对齐过程中被主动塑造，凸显了对微调过程进行透明度和监管的重要性。

Comments 12 pages, 6 figures, 2 tables, 3 appendices. Code and scenario bank: https://github.com/recozers/LLM-Bias

详情

AI中文摘要

人们通常认为语言模型中的地缘政治偏见源于预训练阶段使用的训练数据。我们在英语、法语和中文中，对来自七个实验室的七对开放权重LLM（仅预训练的基础模型和经过预训练及后训练的对话模型）进行了28对国家对的配对场景强制选择探测，发现地缘政治偏见源于后训练而非预训练。在七个AI实验室中，有六个在模型开发者所在国家或地区的方向上，后训练后出现了偏见偏移。这种偏移在阿里巴巴的Qwen 2.5中最为显著：基础模型对中国好感度呈中性（对数几率-0.15，p=0.15），而后训练的对话变体则为+2.91（p<10^-4），几率偏移了18倍。我们还观察到所有模型对其他国家的偏见也存在偏移。此外，这种偏移的幅度取决于提示模型所用的语言：法国制造的Mistral仅在法语提示下表现出亲法倾向（法语-英语偏移+1.91，p<10^-4）。这些发现表明，语言模型中的地缘政治偏好并非简单地从大规模互联网数据中继承，而是在后训练过程中被主动塑造，这凸显了对影响模型如何表征国家、文化和政治观点的对齐过程进行更大透明度、审计和监督的必要性。

英文摘要

It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p<10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p<10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

URL PDF HTML ☆

赞 0 踩 0

2605.23819 2026-05-25 cs.CV cs.AI 版本更新

Not Too Generative, Not Too Discriminative: The Human Alignment Sweet Spot

不过于生成，也不过于判别：人类对齐的甜蜜点

Jorge Chang Ortega, Bastien Le Lan, Thomas Serre, Victor Boutin

发表机构 * ANITI ； Brown University（布朗大学）； CNRS（国家科学研究中心）

AI总结本文探讨了计算视觉中一个核心问题：人类视觉表征是由判别式学习还是生成式学习更好地解释。研究通过联合能量模型（JEMs）在固定架构下连续插值判别与生成训练目标，分离学习目标的影响，并在六个涵盖感知相似性、光泽感知、人类响应不确定性等的人类对齐基准上进行评估。结果表明，人类对齐在生成与判别目标的中间点达到最优，而非极端端点，表明人类视觉对齐源于生成与判别目标的平衡，而非单一目标的选择。

详情

AI中文摘要

计算视觉中的一个核心问题是，人类视觉表征是否更好地由判别学习或生成学习解释。然而，现有的比较常常混淆学习目标与架构、规模及训练数据，使得目标本身是否驱动对齐的问题悬而未决。我们使用联合能量模型（JEM）来解决这一混淆问题，该模型在固定架构内连续插值判别与生成训练。通过改变单个混合系数，我们隔离了学习目标的影响，并在六个涵盖感知相似性、光泽感知、人类响应不确定性、鲁棒性、形状-纹理线索冲突和诊断性特征归因的人类对齐基准上评估了所得模型。在这多样化的测试套件中，人类对齐在生成-判别连续体的中间点始终达到最大，而非任一端点。混合JEM结合了判别学习诱导的类别结构与生成学习诱导的对输入结构的敏感性，在视觉的多个层次上产生了更类人的行为。这些结果表明，生成-判别二分法不是理解人类对齐视觉的正确轴：对齐并非来自选择其中一个目标，而是来自平衡两者。

英文摘要

A central question in computational vision is whether human-like visual representations are better explained by discriminative or generative learning. Existing comparisons, however, often confound the learning objective with architecture, scale, and training data, leaving open whether the objective itself drives alignment. We address this confound using Joint Energy-Based Models (JEMs), which interpolate continuously between discriminative and generative training within a fixed architecture. By varying a single mixing coefficient, we isolate the effect of the learning objective and evaluate the resulting models across six human-alignment benchmarks spanning perceptual similarity, gloss perception, human response uncertainty, robustness, shape-texture cue conflict, and diagnostic feature attribution. Across this diverse suite, human alignment is consistently maximized at intermediate points of the generative-discriminative continuum, rather than at either endpoint. Hybrid JEMs combine the categorical structure induced by discriminative learning with the sensitivity to input structure induced by generative learning, yielding more human-like behavior across multiple levels of vision. These results suggest that the generative-discriminative dichotomy is the wrong axis for understanding human-aligned vision: alignment emerges not from choosing one objective over the other, but from balancing both.

URL PDF HTML ☆

赞 0 踩 0

2605.23780 2026-05-25 cs.AI 版本更新

Beyond Binary Edits Robust Multimodal Knowledge Editing with Adversarial Subspace Alignment

超越二元编辑：基于对抗子空间对齐的鲁棒多模态知识编辑

Haoyuan Wang, Xiaohao Liu, Jiajie Su, Jianmao Xiao, Chaochao Chen

发表机构 * Zhejiang University（浙江大学）； National University of Singapore（新加坡国立大学）； Jiangxi Normal University（江西师范大学）

AI总结本文研究了多模态大语言模型中鲁棒的内在知识编辑问题，旨在在不损害原有能力的前提下高效更新知识。针对现有方法在语义等价的视觉和语言变体间传播编辑效果有限的问题，作者提出了对抗子空间对齐方法（ASAM），通过引入潜在对抗鲁棒化（LAR）和秩约束子空间学习（RCSL）技术，增强模型在高维多模态空间中的泛化能力和编辑鲁棒性。实验表明，该方法在知识编辑任务中表现出优越的性能。

详情

AI中文摘要

多模态大语言模型（MLLMs）需要高效的机制来更新知识，同时不降低现有能力。虽然内在多模态知识编辑实现了强可靠性和局部性，但它通常表现出有限的泛化性，无法在语义等价的视觉和语言变体之间传播编辑。这个问题源于在高维多模态空间中缺乏显式的语义监督、僵化的编辑范围以及对单个样本的有偏锚定。我们通过显式地针对泛化性来解决鲁棒的内在多模态知识编辑。我们通过知识单元（将语义等价的多模态输入分组）形式化鲁棒性，并将泛化性定义为每个单元内一致的预测。为了暴露脆弱的语义区域，我们引入了潜在对抗鲁棒化（LAR），它在联合潜在空间中生成对抗但语义连贯的变体。我们进一步提出了秩约束子空间学习（RCSL），通过基于奇异值的目标在编辑层强制对抗表示的低秩对齐。大量实验证明了ASAM的有效性。

英文摘要

Multimodal large language models (MLLMs) need efficient mechanisms to update knowledge without degrading existing capabilities. While intrinsic multimodal knowledge editing achieves strong reliability and locality, it often exhibits limited generality, failing to propagate edits across semantically equivalent visual and linguistic variations. This issue arises from the lack of explicit semantic supervision, rigid editing scopes, and biased anchoring to individual samples in high-dimensional multimodal spaces. We address robust intrinsic multimodal knowledge editing by explicitly targeting generalization. We formalize robustness through knowledge units that group semantically equivalent multimodal inputs and define generality as consistent predictions within each unit. To expose fragile semantic regions, we introduce Latent Adversarial Robustification (LAR), which generates adversarial yet semantically coherent variants in the joint latent space. We further propose Rank-Constrained Subspace Learning (RCSL), enforcing low-rank alignment of adversarial representations at the edit layer via a singular value-based objective. Extensive analysis demonstrates the effectiveness of ASAM empirically.

URL PDF HTML ☆

赞 0 踩 0

2605.23772 2026-05-25 cs.AI cs.LO cs.PL cs.SE 版本更新

Agentic Proving for Program Verification

程序验证的智能体证明

Alessandro Sosso, Akhil Arora, Bas Spitters

发表机构 * Department of Computer Science（计算机科学系）

AI总结该研究评估了基于代理的定理证明系统在程序验证任务中的能力，通过在CLEVER基准上测试Claude Code的表现，发现其在生成规范、验证实现以及端到端程序生成与验证方面均取得了较高的成功率。研究还指出当前程序验证基准与现代代理证明系统的能力之间存在差距，并强调需要更严格、更具鲁棒性的评估方法，特别是替代基于同构评分的规范评估方式。研究结果表明，结合编译器的紧密循环代理范式是当前程序验证最有效的方法之一。

详情

AI中文摘要

智能体系统最近已成为形式数学中自动定理证明的最先进方法。为了评估这些能力在程序验证中的延伸程度，我们在CLEVER（一个用于可验证代码生成的Lean 4基准）上，在智能体证明框架中评估了Claude Code。我们的结果显示，Claude为98.8%的问题生成了可论证的有效规范（其中81.3%也被CLEVER基于同构的评分在基准的正确部分接受），针对正确的地面真实规范验证了87.5%问题的实现，并在具有自洽前提的条目上，端到端程序生成和验证管道的成功率达到98.1%。在所有阶段，Claude进一步对其自身尝试提供了高质量的反馈（经人工审查确认），识别了失败的根本原因和数据集中残留的错误。这些发现突显了现有程序验证基准的难度与当代智能体证明器能力之间日益增长的不匹配，并指出了对更严格、更具错误鲁棒性的评估方法的需求，特别是对生成规范基于同构的评分的替代方案。更广泛地说，我们的结果提供了经验证据，表明紧密的编译器在环智能体范式目前是基础程序验证最有效的方法。

英文摘要

Agentic systems have recently emerged as state-of-the-art approaches for automated theorem proving in formal mathematics. To assess how far these capabilities extend to program verification, we evaluate Claude Code in an agentic proving framework on CLEVER, a Lean 4 benchmark for verifiable code generation. Our results show that Claude generates arguably valid specifications for 98.8% of problems (with 81.3% also accepted by CLEVER's isomorphism-based scoring on the correct portion of the benchmark), certifies implementations against correct ground-truth specifications for 87.5% of problems, and reaches a 98.1% success rate on the end-to-end program generation and verification pipeline over entries with self-consistent premises. Across all stages, Claude further provides high-quality feedback on its own attempts (as confirmed under manual review), identifying underlying causes of failure and lingering bugs in the dataset. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug-resilient evaluation methodologies, and in particular for alternatives to isomorphism-based scoring of generated specifications. More broadly, our results provide empirical evidence that tight compiler-in-the-loop agentic paradigms are currently the most effective approach for foundational program verification.

URL PDF HTML ☆

赞 0 踩 0

2605.23771 2026-05-25 cs.CV cs.AI cs.MA 版本更新

PhotoFlow: Agentic 3D Virtual Photography Missions

PhotoFlow: 智能体式3D虚拟摄影任务

Jiarui Guo, Haojia Wei, Yiming Zhang, Yifei Liu, Yuning Gong, Hongjie Zhang, Xue Yang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Northeastern University（东北大学）； University of California, Los Angeles（加州大学洛杉矶分校）； Cornell University（康奈尔大学）； Shanghai AI Laboratory（上海人工智能实验室）； Sichuan University（四川大学）

AI总结 PhotoFlow 是一种用于虚拟摄影的智能代理系统，能够在没有预设相机参数或参考图像的情况下，根据语言指令在3D场景中生成符合语义意图的高质量照片。该系统由三个模块组成：Director 生成多样化的相机候选方案，Reviewer 进行视觉评估与参数筛选，Reflector 则通过失败经验优化搜索策略。研究还提出了 VPhotoBench 基准，包含多个 Blender 场景和语言条件摄影任务，实验表明 PhotoFlow 在多轮渲染预算下表现出色，是首个在任意 Blender 场景中实现语言条件虚拟摄影的可执行代理系统。

详情

AI中文摘要

虚拟摄影要求智能体进入一个预制的3D场景，没有预设的相机姿态或参考图像，从场景信息和语言意图中推断合适的镜头，选择可执行的相机参数，并渲染最终照片。视觉-语言模型的最新进展使这种空间智能体越来越可行，但该任务强调两种难以同时评估的能力：复杂的3D空间理解和抽象审美判断。我们引入了PhotoFlow，一个导演-评审-反思智能体，用于闭环相机搜索。导演构建软摄影蓝图并提议多样化的候选相机；评审结合规则检查、视觉批评和成对优胜者选择；反思将失败转化为区域记忆、死区抑制和高探索重定位。我们还引入了VPhotoBench，一个包含47个开源许可的Blender场景和141个语言条件摄影任务的基准，涵盖主体放置、关系构图和氛围/风格。在保留实验中，PhotoFlow在六轮渲染预算下，在一次性预测、单链反思、锚点库选择和随机搜索中取得了最强的外部质量-对齐复合指标和成功率。据我们所知，这是第一项将任意Blender场景中的语言条件虚拟摄影作为可执行智能体任务的工作，我们的结果表明，以LLM为中心的空间智能体已经可以在旨在挑战3D推理和审美选择的设置中产生强大的照片。

英文摘要

Virtual photography asks an agent to enter a prepared 3D scene with no preselected camera pose or reference image, infer a suitable shot from scene information and a language intent, choose executable camera parameters, and render the final photograph. Recent progress in vision-language models makes this kind of spatial agent increasingly plausible, but the task stresses two capabilities that remain hard to evaluate together: complex 3D spatial understanding and abstract aesthetic judgment. We introduce PhotoFlow, a Director-Reviewer-Reflector agent for closed-loop camera search. The Director builds a soft photographic blueprint and proposes diverse candidate cameras; the Reviewer combines rule checks, visual critique, and pairwise incumbent selection; and the Reflector converts failures into region memory, dead-zone suppression, and high-explore relocation. We also introduce VPhotoBench, a benchmark of 47 open-license Blender scenes and 141 language-conditioned photography missions spanning subject placement, relational composition, and atmosphere/style. On held-out experiments, PhotoFlow achieves the strongest external quality-alignment composite and success rate among one-shot prediction, single-chain reflection, anchor-bank selection, and random search under a six-round rendering budget. To our knowledge, this is the first work to make language-conditioned virtual photography in arbitrary Blender scenes an executable agent task, and our results show that an LLM-centered spatial agent can already produce strong photographs in a setting designed to challenge both 3D reasoning and aesthetic choice.

URL PDF HTML ☆

赞 0 踩 0

2605.23723 2026-05-25 cs.AI 版本更新

MemAudit: Post-hoc Auditing of Poisoned Agent Memory via Causal Attribution and Structural Anomaly Detection

MemAudit：通过因果归因和结构异常检测对中毒代理记忆进行事后审计

Zhewen Tan, Yilun Yao, Huiyan Jin, Wenhan Yu, Guoan Wang, Mengyuan Fan, liang lu, Feng Liu, Xiangzheng Zhang, Duohe Ma, Tong Yang, Lin Sun

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； Qiyuan Tech（齐元科技）； Laboratory of Multimedia Information Processing, School of Computer Science, Peking University（北京大学计算机科学学院多媒体信息处理实验室）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）

AI总结随着大型语言模型代理越来越多地依赖持久内存来存储历史交互并提升任务执行能力，内存机制也带来了潜在的安全隐患：攻击者可通过正常交互向内存中注入恶意记录，从而影响代理的行为。为此，本文提出 MemAudit，一种用于事后审计内存增强型大语言模型代理的因果记忆审计框架。该方法结合因果影响评分与结构异常检测，有效识别出对有害输出有贡献的恶意记忆记录，并在多种攻击场景下显著降低了攻击成功率。

详情

AI中文摘要

大型语言模型代理越来越依赖持久记忆来存储过去的交互、检索相关演示并改进长期任务执行。然而，这种记忆机制也造成了一个实际的安全漏洞：对抗性用户可能通过普通交互将恶意记录注入代理的记忆中，这些记录随后可能被检索以引导代理的推理和行动。现有的防御主要关注在线干预，如提示过滤或输出阻止，但它们没有解决事后问题，即在观察到有害行为后，哪些存储的记忆应负责。我们提出了 extbf{MemAudit}，一个用于记忆增强型LLM代理的事后因果记忆审计框架。该框架结合了两个互补信号：（1）反事实记忆影响评分，衡量每个记忆对有害输出的因果贡献；（2）记忆一致性图，识别更广泛记忆存储中的结构异常记忆。我们针对MINJA（一种仅查询的记忆注入攻击，其中恶意记录通过正常代理交互生成和存储，而非直接修改记忆库）评估了MemAudit。在QA和推理代理两种设置中，MemAudit在现实的事后审计场景下显著降低了攻击成功率。结果显示，QA攻击成功率从$70\%$降至$0\%$，而RAP攻击成功率从$83.3\%$降至$0\%$。

英文摘要

Large language model agents increasingly rely on persistent memory to store past interactions, retrieve relevant demonstrations, and improve long-horizon task execution. However, this memory mechanism also creates a practical security vulnerability: an adversarial user may inject malicious records into the agent's memory through ordinary interaction, and these records can later be retrieved to steer the agent's reasoning and actions. Existing defenses primarily focus on online intervention, such as prompt filtering or output blocking, but they do not address the post-hoc question of which stored memories are responsible after harmful behavior has already been observed. We propose \textbf{MemAudit}, a post-hoc causal memory auditing framework for memory-augmented LLM agents. The framework combines two complementary signals: (1) a counterfactual memory influence score that measures each memory's causal contribution to harmful outputs, and (2) a memory consistency graph that identifies structurally anomalous memories within the broader memory store. We evaluate MemAudit against MINJA, a query-only memory injection attack in which malicious records are generated and stored through normal agent interactions rather than direct memory-bank modification. Across both QA and reasoning-agent settings, MemAudit substantially reduces attack success rates under realistic post-hoc auditing scenarios. The results show that QA attack success is reduced from $70\%$ to $0\%$, while RAP attack success drops from $83.3\%$ to $0\%$.

URL PDF HTML ☆

赞 0 踩 0

2605.23719 2026-05-25 cs.CV cs.AI 版本更新

Weierstrass Positional Encoding for Vision Transformers

Weierstrass位置编码用于视觉Transformer

Zhihang Xin, Rui Wang, Xitong Hu, Xiaojun Wu

发表机构 * School of Mathematics and Data Science, Jiangnan University（江南大学数学与数据科学学院）； School of Artificial Intelligence and Computer Science, Jiangnan University（江南大学人工智能与计算机科学学院）

AI总结视觉Transformer在计算机视觉中取得了显著成功，但其常用的可学习一维位置编码在图像分块展平后削弱了图像的二维空间结构。为解决这一问题，本文提出了一种基于魏尔斯特拉斯椭圆函数的位置编码方法（WePE），通过在复数域中对二维分块坐标进行映射，构建具有双周期特性的四维位置特征，从而更准确地保留图像分块的几何关系和空间邻近性先验。该方法具有数学理论支撑，能够自然匹配图像网格的规则结构，并且无需额外计算开销，可无缝集成到现有视觉Transformer中，实验表明其在多种任务中均能带来性能提升。

详情

AI中文摘要

视觉Transformer在计算机视觉中取得了显著成功，但它们通常使用可学习的一维位置编码，这削弱了图像块展平后固有的二维空间结构。现有的位置编码往往缺乏几何约束，并且不保持欧氏空间距离与序列索引距离之间的单调关系，限制了ViTs利用空间邻近先验的能力。受周期性在位置编码中实用性的启发，我们提出了Weierstrass椭圆位置编码（WePE），这是一种在复数域中编码二维坐标的数学基础方法。WePE将归一化的二维块坐标映射到复平面，并使用Weierstrass椭圆函数及其导数构建紧凑的四维位置特征。双周期性提供了二维位置的原则性表示，其固有的晶格结构自然匹配图像块网格的规则几何形状。其非线性几何特性有助于更忠实地建模空间距离关系，而代数加法公式使得任意块对之间的相对位置信息可以直接从其绝对编码中推导出来。WePE是即插即用的且与分辨率无关，可以无缝集成到现有的ViTs中。大量实验表明，WePE在大多数设置中带来一致的性能提升。通过预计算的查找表，这些改进不会引入明显的计算或内存开销。额外的分析和消融研究进一步验证了所提方法的有效性。

英文摘要

Vision Transformers have achieved remarkable success in computer vision, but their common use of learnable one-dimensional positional encodings weakens the inherent two-dimensional spatial structure of images after patch flattening. Existing positional encodings often lack geometric constraints and do not preserve a monotonic relationship between Euclidean spatial distances and sequential index distances, limiting ViTs' ability to exploit spatial proximity priors. Motivated by the usefulness of periodicity in positional encoding, we propose Weierstrass elliptic Positional Encoding (WePE), a mathematically grounded method for encoding two-dimensional coordinates in the complex domain. WePE maps normalized 2D patch coordinates onto the complex plane and constructs compact four-dimensional positional features using the Weierstrass elliptic function and its derivative. The double periodicity provides a principled representation of 2D positions, and its intrinsic lattice structure naturally matches the regular geometry of image patch grids. Its nonlinear geometric properties help model spatial distance relationships more faithfully, while the algebraic addition formula enables relative positional information between arbitrary patch pairs to be derived directly from their absolute encodings. WePE is plug-and-play and resolution-agnostic, allowing seamless integration into existing ViTs. Extensive experiments show that WePE brings consistent performance gains in most settings. With precomputed lookup tables, these improvements introduce no noticeable computational or memory overhead. Additional analyses and ablation studies further validate the effectiveness of the proposed method.

URL PDF HTML ☆

赞 0 踩 0

2605.22738 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Proxy-Based Approximation of Shapley and Banzhaf Interactions

基于代理的Shapley和Banzhaf交互近似

Santo M. A. R. Thies, Hubert Baniecki, R. Teal Witter, Eyke Hüllermeier, Maximilian Muschalik, Fabian Fumagalli

发表机构 * LMU Munich（慕尼黑大学）； MCML ； DFKI（德意志联邦防务研究院）； Centre for Credible AI, Warsaw University of Technology（华沙技术大学可信AI中心）； University of Warsaw（华沙大学）； Claremont McKenna College（克莱尔蒙特麦肯纳学院）； Bielefeld University（比勒菲尔德大学）

AI总结本文研究了如何高效准确地估计Shapley和Banzhaf交互值，以解释机器学习模型中特征之间的复杂相互作用。为此，作者提出了ProxySHAP方法，结合树模型代理的高效采样与残差校正策略，实现了在保证精度的同时提升计算效率。理论分析表明，ProxySHAP能够在多项式时间内计算树集成模型的精确交互指数，并有效控制偏差与方差。实验表明，ProxySHAP在多个基准测试中表现优异，尤其在大规模高维数据上显著优于现有方法。

详情

AI中文摘要

Shapley和Banzhaf交互捕捉了现代机器学习应用中固有的复杂动态。然而，当前对这些高阶交互的估计器在速度和准确性之间进行权衡。为了克服这一限制，我们引入了ProxySHAP。ProxySHAP将基于树的代理模型的高样本效率与通过残差校正实现一致性的原则路径相结合。在理论层面，我们推导了干预TreeSHAP的多项式时间推广，以计算树集成的精确交互指数，成功避免了先前方法中的指数树深度依赖。此外，我们正式分析了残差调整策略，刻画了最大样本重用（MSR）在特定条件下校正代理偏差而不使其方差随交互规模指数增长的条件。广泛的基准测试表明，ProxySHAP在近似质量上树立了新的最先进标准，包括在具有数千个特征的大规模应用中。通过在小预算和大预算场景下均实现最低误差，ProxySHAP显著优于先前最佳估计器ProxySPEX和KernelSHAP-IQ，同时在可解释性下游任务上也提供了卓越性能。

英文摘要

Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.22672 2026-05-25 cs.AI 版本更新

Is Capability a Liability? More Capable Language Models Make Worse Forecasts When It Matters Most

能力是负担吗？更强大的语言模型在关键时刻做出更差的预测

Nick Merrill, Jaeho Lee, Ezra Karger

发表机构 * Forecasting Research Institute（预测研究 institute）

AI总结本文研究了在具有超线性增长和制度变化尾风险的时间序列预测任务中，能力更强的语言模型反而会产生更差的分布预测这一逆向缩放现象。通过在模拟和真实数据集上的实验，发现更强大的模型倾向于高估上尾风险，而下尾预测相对稳定。研究还表明，模型规模和后训练均对这一现象有影响，并建议在评估语言模型预测能力时应结合连续的准确性指标，而不仅仅依赖于单一阈值的二元指标。

详情

AI中文摘要

我们记录了LLM在预测问题上的逆缩放现象，这些问题的底层时间序列表现出超线性增长和制度转换的尾部风险，这种结构在金融和流行病学中很常见。在这些任务上，更强大的模型会产生更差的分位数预测。该模式出现在我们发布的、无污染的模拟世界基准ForecastBench-Sim（FBSim）上，在预测具有匹配线性控制的合成SIR流行病时，并在COVID-19、麻疹、住房市场和恶性通货膨胀的真实世界数据集中得到复现。每个分位数的分解表明，失败集中在尾部上端，更强大的模型将其向上移动以跟踪激进的增长外推，而下尾部保持不变。Llama-3.1的族内研究表明，模型规模和后训练都独立地促成了这种效应。领域知识并不能可靠地挽救校准。这种逆缩放并不出现在LLM预测基准中常见的单阈值指标上，在相同的输出上，能力-准确性关系的符号发生了反转。在常规截止点上的单阈值评分忽略了尾部上端的成本；包含尾部的评分在相同的输出上反转了能力-准确性关系的符号。我们建议LLM预测评估使用连续（且无界）的准确性度量以及有界的二元阈值度量。

英文摘要

We document inverse scaling in LLMs on forecasting problems whose underlying time series exhibit superlinear growth and tail risk of regime change, a structure common in finance and epidemiology. On these tasks, more capable models produce worse distributional forecasts. The pattern appears on ForecastBench-Sim (FBSim), a contamination-free, simulated-world benchmark we release, in forecasting synthetic SIR epidemics with a matched linear control, and replicates in real-world datasets on COVID-19, measles, housing markets, and hyperinflation. A per-quantile decomposition shows the failure concentrates at the upper tail, which more capable models shift upward to track aggressive extrapolations of growth, while the lower tail stays put. A within-family study of Llama-3.1 shows that both model scale and post-training independently contribute to this effect. Domain knowledge does not reliably rescue calibration. This inverse scaling does not appear on single-threshold metrics common in LLM forecasting benchmarks, reversing the sign of the capability--accuracy relationship on identical outputs. Single-threshold scoring at conventional cutoffs misses the upper-tail cost; tail-inclusive scoring reverses the sign of the capability--accuracy relationship on the same outputs. We recommend that LLM forecasting evaluations use continuous (and unbounded) measures of accuracy alongside bounded binary threshold metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.21504 2026-05-25 q-fin.ST cs.AI 版本更新

Multivariate Financial Forecasting using the Chronos Time Series Foundation Models

使用Chronos时间序列基础模型进行多元金融预测

Sanjiv R Das, Tarang Goyal, Mohini Yadav

发表机构 * Santa Clara University（圣克拉拉大学）

AI总结本文利用开源时间序列基础模型Chronos-2，评估预训练时间序列模型在经济与金融预测中的表现，重点研究多变量（MV）输入相比单变量（UV）基线是否能提升预测精度。研究覆盖了七只优质股票、美国国债利率及其组合面板，通过2000年至2025年的滚动月度评估，结果显示多变量预测在利率和股票数据中均显著优于单变量预测，且误差分布更集中。研究还指出，跨市场混合时间序列会降低预测准确性，表明引入噪声背景可能影响模型性能，整体表明基础模型可通过跨序列信息提升金融预测精度，尤其在结构化滚动协议下效果更佳。

Comments 10 pages, 3 tables, 3 figures

详情

AI中文摘要

使用开源时间序列基础模型Chronos-2，我们评估了预训练时间序列模型在经济和金融预测中的表现，重点研究多元输入相对于单变量基线是否提高了准确性。研究涵盖两个面板——Magnificent-7股票和美国国债利率——以及一个组合面板，使用2000年至2025年的滚动月度评估。我们改变输入窗口长度和预测范围，并报告RMSE和MAPE。跨数据集，多元预测一致优于单变量预测，利率的增益尤为强劲，股票也有显著改善。序列级比较显示多元输入在所有情况下均有改进，且误差离散度通常更低。我们还提供了参数热图和时间序列可视化。然而，混合股票和利率市场的时间序列会降低预测准确性，表明添加噪声上下文会降低模型性能。总体而言，结果表明基础模型可以利用跨序列信息提高金融预测准确性，并且在严格滚动协议下对相关序列进行联合建模时收益最大。除了使用开源基础模型外，本文还展示了AI如何用于金融研究。

英文摘要

Using Chronos-2, an open-source time-series foundation model, we evaluate pretrained time-series models for economic and financial forecasting with an emphasis on whether multivariate (MV) inputs improve accuracy relative to univariate (UV) baselines. The study covers two panels -- the Magnificent-7 equities and U.S. Treasury interest rates -- as well as a combined panel, using rolling monthly evaluations from 2000--2025. We vary input window lengths and forecast horizons and report RMSE and MAPE. Across datasets, MV forecasts consistently outperform UV forecasts, with especially strong gains for interest rates and meaningful improvements for equities. Series-level comparisons show MV improvements in every case, and error dispersion is generally lower under MV inputs. We also provide parameter-heatmap and time-series visualizations. However, mixing time series across equity and interest rate markets reduces forecast accuracy, indicating that adding noisy context degrades model performance. Overall, the results indicate that foundation models can leverage cross-series information to improve forecast accuracy in finance, and that the benefits are strongest when related series are modeled jointly under disciplined rolling protocols. Other than using an open-source foundation model, this paper also showcases how AI may be used for financial research.

URL PDF HTML ☆

赞 0 踩 0

2605.19069 2026-05-25 cs.CL cs.AI 版本更新

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

商业ASR系统在代码切换语音上的基准测试：阿拉伯语、波斯语和德语

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

发表机构 * Perle AI

AI总结本文研究了自动语音识别（ASR）系统在语言代码转换（Code-Switching）场景下的性能，针对阿拉伯语、波斯语和德语与英语之间的四种语言对进行了评估。通过一个两阶段的筛选流程，选取了300个样本，并使用BERTScore和词错误率（WER）进行测评，发现不同指标对系统排名的一致性及质量差距的反映存在差异。研究还揭示了商业ASR系统在处理代码转换语音时的性能差距，并公开了相关数据集以供进一步研究。

详情

AI中文摘要

代码切换——在同一话语中两种语言的自然交替——仍然是自动语音识别（ASR）中最具挑战性和研究不足的条件之一。我们提出了一个基准测试，评估了五个商业ASR提供商在四种语言对上的表现：埃及阿拉伯语-英语、沙特阿拉伯语（纳吉迪/希贾兹）-英语、波斯语（法尔西）-英语和德语-英语，每对包含300个样本，通过结合启发式过滤和GPT-4o与Gemini 1.5 Pro集成评分器的两阶段管道选择，将LLM成本降低约91%。我们在WER和BERTScore上进行评估，表明虽然两个指标在阿拉伯语和波斯语对的系统排序上一致（τ=1.0），但WER通过惩罚语义正确的音译选择，将质量差距的幅度夸大约3倍。ElevenLabs Scribe v2实现了最低的WER（总体13.2%），并在BERTScore上领先（总体0.936）。难度分层分析揭示了被总体平均值掩盖的性能差距，BERT嵌入投影证实了参考和假设之间的语义接近性，尽管存在表面脚本差异。数据集公开于https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch。

英文摘要

Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($τ= 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

URL PDF HTML ☆

赞 0 踩 0

2604.26145 2026-05-25 cs.HC cs.AI 版本更新

Ceci n'est pas une explication: Evaluating Explanation Failures as Explainability Pitfalls in Language Learning Systems

Ceci n'est pas une explication: 评估语言学习系统中作为可解释性陷阱的解释失败

Ben Knight, Wm. Matthew Kennedy, Danielle Carvalho, Isaac Pattis, James Edgell

发表机构 * Oxford University Press（牛津大学出版社）； Oxford Internet Institute（牛津互联网研究所）； University of Oxford（牛津大学）

AI总结该研究探讨了人工智能语言学习系统中解释性失败的问题，指出这些系统提供的即时反馈可能在表面看似有帮助，但实际上存在根本性缺陷，可能加剧学习者的误解并影响学习效果。研究提出了L2-Bench基准，用于评估语言教育中的AI系统，涵盖诊断准确性、错误原因分析等多个关键反馈维度，并分析了AI在这些维度上的失效方式及其带来的“可解释性陷阱”。研究强调了语言学习场景下这些风险的特殊性，并呼吁在设计评估框架时更加关注相关问题。

Comments Accepted to Misleading Impacts Resulting from AI Generated Explanations (MIRAGE) Workshop @ IUI 2026

详情

AI中文摘要

AI驱动的语言学习工具日益为全球数百万学习者提供即时、个性化的反馈。然而，这种反馈可能以学习者甚至教师难以察觉的方式失败，长期使用可能强化误解并侵蚀学习效果。我们提出了L2-Bench的一部分，这是一个用于评估语言教育中AI系统的基准，包括（但不限于）有效反馈的六个关键维度：诊断准确性、适当性意识、错误原因、优先级排序、改进指导和支持自我调节。我们分析了AI系统在这些维度上可能失败的方式。这些失败，我们认为会导致“可解释性陷阱”，即表面上看似有帮助但本质上有缺陷的AI生成解释，增加了成就、人机交互和社会情感危害的风险。我们讨论了语言学习的特定背景如何放大这些风险，并概述了在设计评估框架时我们认为值得更多关注的开放问题。我们的分析旨在扩展社区对可解释性陷阱类型学及其可能发生的上下文动态的理解，以鼓励AI开发者更好地设计安全、可信和有效的AI解释。

英文摘要

AI-powered language learning tools increasingly provide instant, personalised feedback to millions of learners worldwide. However, this feedback can fail in ways that are difficult for learners--and even teachers--to detect, potentially reinforcing misconceptions and eroding learning outcomes over extended use. We present a portion of L2-Bench, a benchmark for evaluating AI systems in language education that includes (but is not limited to) six critical dimensions of effective feedback: diagnostic accuracy, awareness of appropriacy, causes of error, prioritisation, guidance for improvement, and supporting self-regulation. We analyse how AI systems can fail with respect to these dimensions. These failures, which we argue are conducive to "explainability pitfalls," are AI-generated explanations that appear helpful on the surface but are fundamentally flawed, increasing the risk of attainment, human-AI interaction, and socioaffective harms. We discuss how the specific context of language learning amplifies these risks and outline open questions we believe merit more attention when designing evaluation frameworks specifically. Our analysis aims to expand the community's understanding of both the typology of explainability pitfalls and the contextual dynamics in which they may occur in order to encourage AI developers to better design safe, trustworthy, and effective AI explanations.

URL PDF HTML ☆

赞 0 踩 0

2604.24920 2026-05-25 cs.CR cs.AI 版本更新

SUDP: Secret-Use Delegation Protocol for Agentic Systems

SUDP: 面向智能体系统的秘密使用委托协议

Xiaohang Yu, Hejia Geng, Xinmeng Zeng, William Knottenbelt

发表机构 * Imperial College London（伦敦帝国学院）； University of Oxford（牛津大学）； Stanford University（斯坦福大学）

AI总结随着代理系统越来越多地使用用户秘密进行API调用、消息平台和云服务操作，现有的运行时授权机制往往通过暴露秘密或其衍生物来实现，导致潜在的安全风险。本文提出了一种名为SUDP的机密使用委托协议，旨在确保用户授权的秘密操作不被滥用，且不赋予请求者持久的访问权限。该协议通过用户授权、请求者提出操作、托管方执行有限使用的方式，满足七个关键安全属性，在结合硬件根运行时的情况下，能够在标准密码学假设下保障秘密的完整性和机密性。

R$^3$L: 反思-重试强化学习与语言引导探索、关键信用和正向放大

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li

发表机构 * Tongyi Lab（通义实验室）； Soochow University（苏州大学）； Hong Kong University of Science and Technology（香港科技大学）

AI总结 R$^3$L 是一种结合语言引导探索、关键信用分配和正向增强的强化学习方法，旨在解决大语言模型在推理和智能体能力训练中面临的探索与利用难题。该方法通过“反思-重试”机制合成高质量轨迹，利用语言反馈定位错误并优化失败路径，同时仅更新存在差异的轨迹后缀以提高信用分配精度，并通过增强成功轨迹的权重来稳定训练过程。实验表明，R$^3$L 在多个任务中相较基线方法实现了显著性能提升，同时保持了训练稳定性。

详情

AI中文摘要

强化学习推动了LLM推理和智能体能力的最新进展，但当前方法在探索和利用方面均存在困难。探索方面，困难任务成功率低且从头开始重复rollout成本高；利用方面，粗粒度的信用分配和训练不稳定：轨迹级奖励因后续错误惩罚有效前缀，且失败主导的群体淹没少数正向信号，使优化缺乏建设性方向。为此，我们提出R$^3$L，即反思-重试强化学习与语言引导探索、关键信用和正向放大。为合成高质量轨迹，R$^3$L通过反思-重试从随机采样转向主动合成，利用语言反馈诊断错误，将失败尝试转化为成功尝试，并通过从识别出的失败点重启来降低rollout成本。在错误被诊断和定位后，关键信用分配仅更新存在对比信号的分叉后缀，排除共享前缀的梯度更新。由于困难任务中失败占主导且反思-重试产生离策略数据，可能导致训练不稳定，正向放大提高成功轨迹的权重，确保正向信号引导优化过程。在智能体和推理任务上的实验表明，与基线相比，相对提升5%到52%，同时保持训练稳定性。我们的代码已发布在https://github.com/shiweijiezero/R3L。

英文摘要

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

URL PDF HTML ☆

赞 0 踩 0

2512.15767 2026-05-25 cs.LG cs.AI 版本更新

Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework

连接数据与物理：基于图神经网络的混合孪生框架

M. Gorpinich, B. Moya, S. Rodriguez, F. Meraghni, Y. Jaafra, A. Briot, M. Henner, R. Leon, F. Chinesta

发表机构 * Valeo（瓦莱欧）； PIMM Lab. ENSAM Institute of Technology（ENSAM技术学院PIMM实验室）

AI总结该研究提出了一种基于图神经网络的混合孪生框架，旨在解决物理仿真中因模型简化或未建模效应导致的“无知模型”问题。通过结合物理模型与数据驱动方法，该方法利用图神经网络学习稀疏空间测量中的缺失物理规律，从而在减少数据需求的前提下提升仿真精度与可解释性。实验表明，该框架在不同网格、几何和负载位置的非线性热传导问题中均表现出良好的泛化能力与修正效果。

Comments 27 pages, 14 figures

详情

AI中文摘要

模拟复杂的非定常物理现象依赖于详细的数学模型，例如通过有限元方法（FEM）进行仿真。然而，由于未建模效应或简化假设，这些模型通常与实际情况存在差异。我们将这种差距称为无知模型。纯数据驱动的方法试图学习整个系统的行为，但需要跨越整个空间和时间域的大量高质量数据。在现实场景中，此类信息不可用，使得完全数据驱动的建模不可靠。为了克服这一限制，我们采用混合孪生方法对无知分量进行建模，而不是从头模拟现象。由于基于物理的模型近似了现象的整体行为，剩余的无知通常比完整的物理响应复杂度低，因此可以用更少的数据进行学习。然而，一个关键困难是空间测量是稀疏的，并且在实际中获取不同空间配置下同一现象的数据具有挑战性。我们的贡献是通过使用图神经网络（GNN）来表示无知模型来克服这一限制。即使测量位置数量有限，GNN也能学习缺失物理的空间模式。这使得我们能够用数据驱动的修正来丰富基于物理的模型，而无需密集的空间、时间和参数数据。为了展示所提出方法的性能，我们在不同网格、几何形状和载荷位置的非线性热传导问题上评估了这种基于GNN的混合孪生方法。结果表明，GNN成功捕获了无知并泛化了跨空间配置的修正，提高了仿真精度和可解释性，同时最小化了数据需求。

英文摘要

Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element Method (FEM). However, these models often exhibit discrepancies from the reality due to unmodeled effects or simplifying assumptions. We refer to this gap as the ignorance model. While purely data-driven approaches attempt to learn full system behavior, they require large amounts of high-quality data across the entire spatial and temporal domain. In real-world scenarios, such information is unavailable, making full data-driven modeling unreliable. To overcome this limitation, we model of the ignorance component using a hybrid twin approach, instead of simulating phenomena from scratch. Since physics-based models approximate the overall behavior of the phenomena, the remaining ignorance is typically lower in complexity than the full physical response, therefore, it can be learned with significantly fewer data. A key difficulty, however, is that spatial measurements are sparse, also obtaining data measuring the same phenomenon for different spatial configurations is challenging in practice. Our contribution is to overcome this limitation by using Graph Neural Networks (GNNs) to represent the ignorance model. GNNs learn the spatial pattern of the missing physics even when the number of measurement locations is limited. This allows us to enrich the physics-based model with data-driven corrections without requiring dense spatial, temporal and parametric data. To showcase the performance of the proposed method, we evaluate this GNN-based hybrid twin on nonlinear heat transfer problems across different meshes, geometries, and load positions. Results show that the GNN successfully captures the ignorance and generalizes corrections across spatial configurations, improving simulation accuracy and interpretability, while minimizing data requirements.

URL PDF HTML ☆

赞 0 踩 0

2511.22521 2026-05-25 cs.CV cs.AI 版本更新

DocVAL: Validated Chain-of-Thought Distillation for Grounded Document VQA

DocVAL：用于基于文档的视觉问答的验证链式思维蒸馏

Pinaki Prasad Guha Neogi, Ahmad Mohammadshirazi, Ser-Nam Lim, Rajiv Ramnath

发表机构 * Department of Computer Science（计算机科学系）； Engineering, Ohio State University, Ohio, US（工程系，俄亥俄州立大学，俄亥俄，美国）； Department of Computer Science, University of Central Florida, Florida, US（计算机科学系，中央佛罗里达大学，佛罗里达，美国）

AI总结 DocVAL 是一种用于文档视觉问答（VQA）的验证式思维链（CoT）蒸馏框架，旨在将大型视觉语言模型（VLM）中的精确空间推理能力转移到更高效的紧凑模型中。该方法结合了教师模型生成的空间推理监督、基于规则的双模式验证器以过滤低质量训练信号，并采用两阶段训练流程进行迭代优化，最终使学生模型无需OCR或检测模块即可独立运行。实验表明，DocVAL 在多个基准测试中显著提升了紧凑模型的定位性能，并引入了mAP作为新的定位评估指标。

详情

AI中文摘要

文档视觉问答要求模型不仅正确回答问题，还要在复杂文档布局中精确定位答案。大型视觉语言模型（VLM）具有强大的空间定位能力，但其推理成本和延迟限制了实际部署。紧凑型VLM更高效，但在标准微调或蒸馏下常出现显著的定位退化。为解决这一问题，我们提出DocVAL，一种验证链式思维（CoT）蒸馏框架，将显式空间推理从大型教师模型转移到紧凑、可部署的学生VLM。DocVAL结合了（1）教师生成的空间CoT监督，（2）基于规则的双模式验证器，过滤低质量训练信号并提供细粒度像素级纠正反馈，以及（3）验证驱动的两阶段训练过程与迭代细化。文本检测仅作为训练时的监督和验证脚手架，使得最终学生模型在推理时作为纯VLM运行，无需OCR或检测。在多个文档理解基准上，DocVAL相比可比的紧凑VLM持续提升高达6-7个ANLS点。我们进一步引入平均精度（mAP）作为文档问答的定位指标，并在此新评估下报告了强大的空间定位性能。我们发布了95K验证器验证的CoT轨迹，并表明高质量、验证过的监督比扩展未过滤数据更有效，实现了高效且可信的文档定位。代码/数据：https://github.com/ahmad-shirazi/DocVAL

英文摘要

Document visual question answering requires models not only to answer questions correctly, but also to precisely localize answers within complex document layouts. While large vision-language models (VLMs) achieve strong spatial grounding, their inference cost and latency limit real-world deployment. Compact VLMs are more efficient, but they often suffer substantial localization degradation under standard fine-tuning or distillation. To address this gap, we propose DocVAL, a validated chain-of-thought (CoT) distillation framework that transfers explicit spatial reasoning from large teacher models to compact, deployable student VLMs. DocVAL combines (1) teacher-generated spatial CoT supervision, (2) a rule-based dual-mode validator that filters low-quality training signals and provides fine-grained, pixel-level corrective feedback, and (3) a validation-driven two-stage training procedure with iterative refinement. Text detection is used only as training-time scaffolding for supervision and validation, enabling the final student to operate as a pure VLM without OCR or detection at inference. Across multiple document understanding benchmarks, DocVAL yields consistent improvements of up to 6-7 ANLS points over comparable compact VLMs. We further introduce mean Average Precision (mAP) as a localization metric for document question answering and report strong spatial grounding performance under this new evaluation. We release 95K validator-verified CoT traces and show that high-quality, validated supervision is more effective than scaling unfiltered data, enabling efficient and trustworthy document grounding. Code/Data: https://github.com/ahmad-shirazi/DocVAL

URL PDF HTML ☆

赞 0 踩 0

2511.03882 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； Technical University of Munich（慕尼黑技术大学）； Johns Hopkins School of Medicine（约翰霍普金斯医学院）

AI总结本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用，特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境，并构建了包含正确操作轨迹和双平面X射线序列的数据集，用于训练仅依赖视觉信息的模仿学习策略。实验表明，该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入，为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情

DOI: 10.1007/s11548-026-03716-x

AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而，对于稀疏输入的X光引导手术（如脊柱内固定），这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒，具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集，模拟了提供者的逐步对齐过程。然后，我们训练了用于规划和开环控制的模仿学习策略，该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功，在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构（包括骨折）以及不同的解剖结构和初始位置。在真实X光上的展开表明，具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞，但我们还发现了局限性，特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准，而借助更稳健的先验和领域知识，此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

URL PDF HTML ☆

赞 0 踩 0

2508.13663 2026-05-25 cs.AI cs.LG 版本更新

Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

具有软实体约束的知识图谱交互式查询回答

Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut

发表机构 * Translational AI Laboratory, Department of Laboratory Medicine（转化人工智能实验室，实验室医学系）； Amsterdam University Medical Center, Vrije Universiteit Amsterdam（阿姆斯特丹大学医学中心，伏里埃大学阿姆斯特丹）； Accenture Labs（埃森哲实验室）； Delft University of Technology（代尔夫特理工大学）； ELLIS Institute Finland & Abo Akademi University, Turku, Finland & Elsevier Discovery Lab, Amsterdam（芬兰ELLIS研究所 & 阿博阿卡迪米大学，图尔库，芬兰 & 埃西弗尔发现实验室，阿姆斯特丹）

AI总结本文研究了在知识图谱中结合软实体约束进行交互式查询回答的问题，旨在处理现实场景中含模糊或上下文依赖约束的查询。为此，作者提出了两种高效方法，能够在不破坏原有查询结果排名结构的前提下，通过少量参数调整或小型神经网络学习软约束，从而提升查询结果的相关性。实验表明，该方法在保持原有查询性能的同时，有效融入了用户偏好，为知识图谱交互提供了更灵活的方式。

Comments Accepted in Transactions on Machine Learning Research (2026)

详情

AI中文摘要

针对不完整知识图谱的查询回答方法检索可能成为答案的实体，这在由于缺失边而无法通过直接图遍历达到此类答案时特别有用。然而，现有方法侧重于使用一阶逻辑形式化的查询。在实践中，许多现实世界的查询涉及固有模糊或上下文依赖的约束，例如对属性或相关类别的偏好。针对这一差距，我们引入了具有软约束的查询回答问题。我们形式化了该问题，并提出了两种高效方法，旨在通过融入软约束来调整查询答案分数，同时不破坏查询的原始答案。这些方法是轻量级的，只需调整两个参数或训练一个小型神经网络来捕获软约束，同时保持原始排序结构。为了评估该任务，我们通过生成带有软约束的数据集来扩展现有的QA基准。我们的实验表明，我们的方法能够捕获软约束，同时保持稳健的查询回答性能，并增加很少的开销。通过我们的工作，我们探索了一种与图数据库交互的新颖灵活方式，允许用户通过交互式提供示例来指定其偏好。

英文摘要

Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead. With our work, we explore a new and flexible way to interact with graph databases that allows users to specify their preferences by providing examples interactively.

URL PDF HTML ☆

赞 0 踩 0

2508.12247 2026-05-25 cs.LG cs.AI 版本更新

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

STM3: 多尺度曼巴混合模型用于长期时空时间序列预测

Haolong Chen, Liang Zhang, Zhengyuan Xin, Guangxu Zhu

发表机构 * Shenzhen Loop Area Institute（深圳环城研究院）

AI总结本文提出了一种名为STM3的新型深度学习模型，用于解决长期时空时间序列预测中的多尺度信息提取和空间依赖建模难题。STM3结合了多尺度Mamba架构与解耦的专家混合框架（DMoE），并引入自适应图因果网络以高效捕捉复杂的时空依赖关系。该模型通过稳定路由策略和因果对比学习策略，确保了表示学习的鲁棒性和多尺度信息的可区分性，实验表明其在多个现实数据集上均取得了优越的预测性能。

Comments Accepted by KDD 2026

详情

AI中文摘要

近年来，时空时间序列预测发展迅速，但现有深度学习方法难以高效学习复杂的长期时空依赖。长期时空依赖学习带来两个新挑战：1）长期时间序列自然包含多尺度信息，难以高效提取；2）不同节点的多尺度时间信息高度相关且难以建模。为解决这些问题，我们提出时空多尺度曼巴混合模型（STM3）。STM3在新型分离式混合专家（DMoE）框架内集成多尺度曼巴架构，以高效捕获多样的多尺度信息，同时利用自适应图因果网络建模复杂的空间依赖。为确保鲁棒的表示学习，我们引入稳定路由策略和因果对比学习策略，与层次信息聚合协同工作，保证尺度可区分性。我们理论上证明STM3实现了优越的路由平滑性，并保证了每个专家的模式分离。在跨领域的10个真实世界基准上的大量实验表明，STM3具有优越性能，在长期时空时间序列预测中达到了最先进的结果。值得注意的是，在PEMSD8数据集上，它取得了显著改进，在MAE、RMSE和MAPE上分别超过第二好的模型7.1%、8.5%和15.9%。代码可在https://github.com/IfReasonable/STM3_KDD26获取。

英文摘要

Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

URL PDF HTML ☆

赞 0 踩 0

2506.05438 2026-05-25 cs.LG cs.AI 版本更新

An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics

一种用于动态健康指标构建的无监督框架及其在滚动轴承预测中的应用

Tongda Sun, Chen Yin, Huailiang Zheng, Yining Dong

发表机构 * School of Data Science（数据科学学院）； Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong（香港数据科学研究所，香港城市大学，香港）； College of Mechanical（机械学院）； Electrical Engineering, Harbin Engineering University, Harbin 150001, China（电气工程学院，哈尔滨工程大学，哈尔滨150001，中国）

AI总结本文提出了一种无需专家知识的无监督框架，用于构建动态健康指标（HI），以提升滚动轴承退化趋势建模与剩余寿命预测的准确性。该方法通过基于跳跃连接的自编码器自动提取退化特征，并在特征空间中引入嵌入内部预测模块的HI生成模块，显式建模HI状态的时序依赖关系，从而捕捉退化过程中的动态信息。实验结果表明，所提出的动态HI在两个轴承生命周期数据集上优于现有方法，显著提升了预测性能。

详情

DOI: 10.1016/j.ress.2025.111039

AI中文摘要

健康指标（HI）在滚动轴承的退化评估和预测中起着关键作用。尽管已有多种HI构建方法被研究，但大多数依赖于专家知识进行特征提取，并忽略了捕捉序列退化过程中隐藏的动态信息，这限制了所构建HI在退化趋势表示和预测中的能力。为解决这些问题，通过一种无监督框架构建了考虑HI级时间依赖性的新型动态HI。具体而言，由基于跳跃连接的自编码器组成的退化特征学习模块首先将原始信号映射到代表性退化特征空间（DFS），以自动提取必要的退化特征，无需专家知识。随后，在该DFS中，提出了一种嵌入内部HI预测模块的新型HI生成模块用于动态HI构建，其中过去和当前HI状态之间的时间依赖性被保证并显式建模。在此基础上，动态HI捕捉了退化过程固有的动态内容，确保其在退化趋势建模和未来退化预测中的有效性。在两个轴承生命周期数据集上的实验结果表明，所提出的HI构建方法优于对比方法，且构建的动态HI在预测任务中表现更优。

英文摘要

Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation and prognostics. To address these concerns, a novel dynamic HI that considers HI-level temporal dependence is constructed through an unsupervised framework. Specifically, a degradation feature learning module composed of a skip-connection-based autoencoder first maps raw signals to a representative degradation feature space (DFS) to automatically extract essential degradation features without the need for expert knowledge. Subsequently, in this DFS, a new HI-generating module embedded with an inner HI-prediction block is proposed for dynamic HI construction, where the temporal dependence between past and current HI states is guaranteed and modeled explicitly. On this basis, the dynamic HI captures the inherent dynamic contents of the degradation process, ensuring its effectiveness for degradation tendency modeling and future degradation prognostics. The experiment results on two bearing lifecycle datasets demonstrate that the proposed HI construction method outperforms comparison methods, and the constructed dynamic HI is superior for prognostic tasks.

URL PDF HTML ☆

赞 0 踩 0

2502.20349 2026-05-25 q-bio.NC cs.AI 版本更新

Naturalistic Computational Cognitive Science: Towards generalizable models and theories that capture the full range of natural behavior

自然主义计算认知科学：迈向能够捕捉自然行为全范围的通用模型与理论

Wilka Carvalho, Andrew Lampinen

发表机构 * Kempner Institute for the Study of Natural and Artificial Intelligence（Kempner自然与人工智能研究学院）； Harvard University（哈佛大学）； Google DeepMind（谷歌DeepMind）

AI总结本文探讨如何通过结合人工智能的最新进展，构建能够涵盖自然情境和行为全貌的通用认知科学理论。研究指出，采用更加自然化的实验范式和计算模型，有助于更准确地理解自然智能的本质，并推动理论的泛化能力。文章综述了认知科学、神经科学和人工智能领域的相关研究，提出整合这些领域进展有助于在保持实验控制和理论深度的同时，更好地解释和模拟人类认知过程。

详情

AI中文摘要

认知科学如何构建能够涵盖自然情境与行为全范围的通用理论？我们认为，人工智能（AI）的进展为认知科学提供了及时的机会，使其能够采用日益自然化的刺激、任务和行为进行实验，并构建能够适应这些变化的计算模型。我们首先回顾了涵盖神经科学、认知科学和AI的日益增长的研究，这些研究表明，纳入更广泛的自然主义实验范式及其相应模型，可能是解决自然智能某些方面并确保理论泛化的必要条件。我们回顾了认知科学和神经科学中的案例，其中自然主义范式引发了不同的行为或涉及不同的过程。然后，我们讨论了AI的最新进展，表明从自然主义数据中学习会产生定性的不同行为模式和泛化模式，并探讨了这些发现如何影响我们从认知建模中得出的结论，以及如何帮助产生关于认知和神经现象根源的新假设。接着，我们建议整合AI和认知科学的最新进展，将使我们能够处理更自然的现象，而不放弃实验控制或对理论理解基础的追求。我们提供了关于方法论实践如何有助于自然主义计算认知科学中累积进展的实用指导，并描绘了一条构建能够解决自然认知实际问题的计算模型的道路，同时对这些模型所依据的过程和原则进行还原性理解。

英文摘要

How can cognitive science build generalizable theories that span the full scope of natural situations and behaviors? We argue that progress in Artificial Intelligence (AI) offers timely opportunities for cognitive science to embrace experiments with increasingly naturalistic stimuli, tasks, and behaviors; and computational models that can accommodate these changes. We first review a growing body of research spanning neuroscience, cognitive science, and AI that suggests that incorporating a broader range of naturalistic experimental paradigms, and models that accommodate them, may be necessary to resolve some aspects of natural intelligence and ensure that our theories generalize. We review cases from cognitive science and neuroscience where naturalistic paradigms elicit distinct behaviors or engage different processes. We then discuss recent progress in AI that shows that learning from naturalistic data yields qualitatively different patterns of behavior and generalization, and examine how these findings impact the conclusions we draw from cognitive modeling, and can help yield new hypotheses for the roots of cognitive and neural phenomena. We then suggest that integrating recent progress in AI and cognitive science will enable us to engage with more naturalistic phenomena without giving up experimental control or the pursuit of theoretically grounded understanding. We offer practical guidance on how methodological practices can contribute to cumulative progress in naturalistic computational cognitive science, and illustrate a path towards building computational models that solve the real problems of natural cognition, together with a reductive understanding of the processes and principles by which they do so.

URL PDF HTML ☆

赞 0 踩 0

2502.13731 2026-05-25 cs.AI 版本更新

Robust Counterfactual Inference in Markov Decision Processes

马尔可夫决策过程中的鲁棒反事实推断

Jessica Lally, Milad Kazemi, Nicola Paoletti

发表机构 * King's College London（伦敦国王学院）

AI总结本文针对马尔可夫决策过程（MDP）中现有反事实推理方法的一个关键局限性，提出了一种新的非参数方法。传统方法依赖特定的因果模型来识别反事实，而实际上存在多个与观测和干预分布一致的因果模型，导致反事实分布不同。本文通过计算所有兼容因果模型下反事实转移概率的紧致界，提供了高效且可扩展的反事实推理方法，并在此基础上设计出鲁棒的反事实策略，以优化最坏情况下的奖励。实验表明，该方法在多个案例中表现出更强的鲁棒性。

详情

AI中文摘要

本文解决了马尔可夫决策过程（MDP）中现有反事实推断方法的一个关键局限性。当前方法假设特定的因果模型以使反事实可识别。然而，通常存在许多与MDP的观测分布和干预分布一致的因果模型，每个模型产生不同的反事实分布，因此固定一个特定的因果模型限制了反事实推断的有效性（和有用性）。我们提出了一种新颖的非参数方法，该方法在所有兼容因果模型上计算反事实转移概率的紧界。与先前需要求解规模过大（变量随MDP大小呈指数增长）的优化问题的方法不同，我们的方法为这些界提供了闭式表达式，使得计算对于非平凡MDP高度高效且可扩展。一旦构建了这样的区间反事实MDP，我们的方法识别出鲁棒的反事实策略，该策略针对不确定的区间MDP概率优化最坏情况奖励。我们在各种案例研究上评估了我们的方法，证明了相比现有方法具有改进的鲁棒性。

英文摘要

This paper addresses a key limitation in existing counterfactual inference methods for Markov Decision Processes (MDPs). Current approaches assume a specific causal model to make counterfactuals identifiable. However, there are usually many causal models that align with the observational and interventional distributions of an MDP, each yielding different counterfactual distributions, so fixing a particular causal model limits the validity (and usefulness) of counterfactual inference. We propose a novel non-parametric approach that computes tight bounds on counterfactual transition probabilities across all compatible causal models. Unlike previous methods that require solving prohibitively large optimisation problems (with variables that grow exponentially in the size of the MDP), our approach provides closed-form expressions for these bounds, making computation highly efficient and scalable for non-trivial MDPs. Once such an interval counterfactual MDP is constructed, our method identifies robust counterfactual policies that optimise the worst-case reward w.r.t. the uncertain interval MDP probabilities. We evaluate our method on various case studies, demonstrating improved robustness over existing methods.

URL PDF HTML ☆

赞 0 踩 0

2605.23668 2026-05-25 cs.CL cs.AI 版本更新

通过噪声学习：为什么潜意识学习有效以及何时失败

Vincent C. Brockers, Roman D. Ventzke, Valentin Neuhaus, Belén Hidalgo-Ogalde, Viola Priesemann

发表机构 * Max Planck Institute for Dynamics and Self-Organization（马克斯·普朗克动态与自组织研究所）； Faculty of Physics, Institute for the Dynamics of Complex Systems, University of Göttingen（哥廷根大学物理系，复杂系统动力学研究所）

AI总结本文研究了人工神经网络中的“潜意识学习”现象，即通过任务无关的输入-输出对进行知识蒸馏时，学生模型从教师模型中隐式学习任务相关知识或偏差的机制。研究发现，这一过程并不依赖于教师与学生模型的初始化一致性，而是由输出头的兼容性所决定。通过控制实验，作者展示了即使在随机初始化、网络结构变化等情况下，学生模型仍能通过兼容的辅助输出头从教师模型中学习有用信息，并在特定条件下达到与教师相当的任务性能。该研究为潜意识学习提供了理论解释，并明确了其适用范围与失效条件。

详情

AI中文摘要

在人工神经网络的背景下，潜意识学习指的是通过任务无关的输入-输出对的蒸馏，将任务相关知识或意外偏差从教师模型传递到学生模型。先前的解释将这种效应归因于共享或紧密匹配的教师-学生初始化。我们表明，紧密匹配的初始化并非必要。相反，潜意识学习由兼容的输出头控制。使用受控的MNIST设置，我们将输出分为辅助头（用于辅助的、任务无关的噪声信号）和分类头（用于分类），以证明潜意识学习发生——即使我们随机初始化隐藏层并移除层、添加新层或更改架构（MLP到CNN）。兼容的辅助头能够传递可恢复的教师信号，使学生的表示更接近教师的表示。当分类头也保持兼容时，仅训练于任务无关噪声的学生可以接近，并且在有利情况下达到教师级别的任务性能。我们的设置使我们能够发展一种理论来解释潜意识学习的机制，并推导出潜意识学习失败时的上界。总之，我们的结果将潜意识学习从一种令人惊讶的迁移效应转变为具有可预测限制的理论基础机制。

英文摘要

In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

URL PDF HTML ☆

赞 0 踩 0

2605.23634 2026-05-25 cs.CV cs.AI 版本更新

DualMem: Bypassing the Objectness Bottleneck for Calibrated Unknown-Stream Filtering in Open-World Object Detection

DualMem: 绕过目标性瓶颈以实现开放世界目标检测中校准的未知流过滤

Yingjun Xiao, Xi Chen, Gang Fang, Siyuan Chen

发表机构 * School of Artificial Intelligence, Guangzhou University（广州大学人工智能学院）； School of Computer Science and Cyber Engineering, Guangzhou University（广州大学计算机科学与网络工程学院）； Institute of Computing Science and Technology, Guangzhou University（广州大学计算科学与技术研究院）

AI总结开放世界目标检测（OWOD）需要检测器既能定位已知类别，又能识别未知对象以支持未来的增量学习。本文发现当前强OWOD检测器的未知预测流中背景误检比例过高，问题根源在于对象性头的信息瓶颈。为此，作者提出DualMem，一种基于冻结SigLIP特征空间的校准后处理过滤器，通过非参数似然比检验实现对未知对象的筛选，有效提升了未知对象识别的准确性，同时保持已知类别检测性能不变。

详情

AI中文摘要

开放世界目标检测（OWOD）要求检测器定位已知类别，同时识别未知对象以进行未来的增量学习。我们发现，强OWOD检测器的未知预测流受到严重污染：在M-OWODB上，对于PROB、OW-DETR和HypOW，未来任务的正未知样本仅占未知预测的不到10%，而背景假阳性则占46-71%。我们表明，这不是信息缺失问题，而是目标性头部的信息瓶颈。在PROB任务1上，对256维解码器查询的线性探针在正负未知区分上达到了0.908的AUROC，但最终的一维目标性标量降至0.642。一个冻结的SigLIP特征，无需访问检测器，在过滤阶段独立恢复了大部分这种提议级别的可分离性（AUROC = 0.871）。基于这一发现，我们提出DualMem，一种校准的后验过滤器，它假设一个小的、图像不相交的、标注了未来任务对象的校准分割，并在冻结的SigLIP特征空间中执行非参数似然比检验。DualMem使用k近邻正记忆来保护未来任务对象，并使用负记忆来抑制类似背景的提议。其决策阈值通过Neyman-Pearson校准选择，为用户提供了假未知抑制与新奇召回之间的显式权衡。在M-OWODB任务1上的PROB、OW-DETR和HypOW中，DualMem将每幅图像的背景型假未知提议减少了44.9%-66.3%，平均减少56.6%。在PROB任务1上，它使自然K-means原型基线的减少量翻倍以上，同时保持已知类别的mAP不变，因为已知检测绕过过滤器。

英文摘要

Open-world object detection (OWOD) requires detectors to localize known classes while identifying unknown objects for future incremental learning. We find that the unknown prediction streams of strong OWOD detectors are heavily polluted: on M-OWODB, across PROB, OW-DETR, and HypOW, future-task positive unknowns make up less than 10% of unknown predictions, whereas background false positives account for 46-71%. We show that this is not a missing-information problem, but an information bottleneck at the objectness head. On PROB Task 1, a linear probe on the 256-D decoder query achieves an AUROC of 0.908 for positive-versus-negative unknown discrimination, but the final one-dimensional objectness scalar drops to 0.642. A frozen SigLIP feature, without access to the detector, independently recovers much of this proposal-level separability at the filtering stage (AUROC = 0.871). Motivated by this finding, we propose DualMem, a calibrated post-hoc filter that assumes a small image-disjoint annotated calibration split of held-out future-task objects and performs a non-parametric likelihood ratio test in frozen SigLIP feature space. DualMem uses a k-nearest-neighbor positive memory to protect future-task objects and a negative memory to suppress background-like proposals. Its decision threshold is chosen by Neyman-Pearson calibration, giving users an explicit trade-off between false-unknown suppression and novel recall. Across PROB, OW-DETR, and HypOW on M-OWODB Task 1, DualMem reduces background-type false unknown proposals per image by 44.9%-66.3%, with a mean reduction of 56.6%. On PROB Task 1, it more than doubles the reduction achieved by a natural K-means prototype baseline, while leaving known-class mAP unchanged because known detections bypass the filter.

URL PDF HTML ☆

赞 0 踩 0

2605.23623 2026-05-25 cs.CR cs.AI cs.LG 版本更新

Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection

时间概念漂移下的对抗脆弱性：Android恶意软件检测的纵向研究

Ahmed Sabbah, Mohammed Kharma, Radi Jarrar, Samer Zein, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University（巴勒斯坦伯利兹大学计算机科学系）； Department of Computer Science, University of Central Florida（佛罗里达州立大学计算机科学系）

AI总结本文通过长期视角研究了安卓恶意软件检测系统在时间概念漂移下的对抗脆弱性，分析了十年间应用数据在静态和动态特征表示下的对抗鲁棒性。研究采用三种部署协议评估模型性能，引入了多个时间关联指标以量化分布偏移对鲁棒性的影响。结果表明，随着时间间隔增大，对抗鲁棒性下降，而攻击成功率上升，强调了在动态数据环境下需考虑时间漂移因素，并提出了针对长期对抗环境的鲁棒性评估框架的重要性。

Comments 42 pages, 4 tables, 10 figures

详情

AI中文摘要

我们提出了一种纵向的、考虑漂移的对抗鲁棒性评估，使用从模拟器和真实设备执行中提取的静态和动态特征表示，跨越超过十年的Android应用。数据集按年度切片组织，并在三种模拟现实学习场景的部署协议下进行评估：（1）同年度训练和测试，（2）跨年度部署且不更新模型，（3）使用累积历史数据进行扩展窗口重训练。在多个分类器家族中，使用FGSM和SPSA在可行性约束下生成对抗样本。我们测量了干净性能、对抗准确率（AA）、攻击成功率（ASR），并引入了时序关联指标——RobustDrop、$\Delta$ASR和对抗放大因子（AAF）——以量化分布漂移与鲁棒性退化之间的关系。结果表明，在评估的基于迁移的特征空间设置下，时间分离与对抗鲁棒性降低相关。随着训练-测试间隔增加，干净准确率和对抗准确率下降，而攻击成功率呈现配置相关的增加，特别是在FGSM扰动和静态特征下。扩展窗口重训练可以缓解但无法消除在持续分布演化下的鲁棒性损失。这些发现表明，在评估智能检测系统在演化数据分布下的长期鲁棒性时，应考虑时间漂移，并强调了在长期对抗环境中需要漂移感知的鲁棒性评估框架。

英文摘要

We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics -- RobustDrop, $Δ$ASR, and Adversarial Amplification Factor (AAF) -- to quantify the relationship between distribution shift and robustness degradation.nResults show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.

URL PDF HTML ☆

赞 0 踩 0

2605.23610 2026-05-25 cs.CV cs.AI 版本更新

EM-Vid: Training-Free Entity-Centric Memory for Efficient and Consistent Multi-Shot Video Generation

EM-Vid：无需训练的以实体为中心的记忆，用于高效且一致的多镜头视频生成

Jente Vandersanden, Matheus Gadelha, Chun-Hao P. Huang, Hyeonho Jeong, Yulia Gryaditskaya

发表机构 * Max Planck Institute for Informatics（马克斯·普朗克研究所）； Adobe Research（Adobe研究）

AI总结本文提出了一种无需训练的实体中心记忆机制 EM-Vid，用于高效且一致的多镜头视频生成。该方法通过存储实体相关的潜在补丁来分离持久实体信息与瞬时场景背景，结合稀疏 token 条件控制和结构化脚本格式，有效降低了计算成本并提升了生成一致性。此外，引入的预算化记忆更新策略和噪声注入机制，进一步增强了对实体外观的精细控制，防止了无关信息的泄露。

详情

AI中文摘要

多镜头视频生成需要在不同镜头间保持重复实体的一致外观，同时忠实于镜头特定的文本提示。最近的自回归方法重用先前生成的帧作为记忆。然而，全帧存储将持久实体信息与瞬态场景上下文纠缠在一起，导致无关信息泄漏和高计算成本。我们提出一种以实体为中心的记忆，形式为实体索引的潜在补丁库。我们引入与预训练模型兼容的稀疏令牌条件化，将自注意力限制在实体相关令牌上，降低计算成本。为此，我们引入一种结构化的多镜头脚本格式。我们还提出一种预算记忆更新策略，以维护紧凑且不断演化的记忆。最后，我们为实体表示配备噪声注入机制，实现细粒度外观控制，防止无关信息泄漏。我们的方法在保持主体一致性的同时，提高了提示遵循度和效率。

英文摘要

Multi-shot video generation requires maintaining a consistent appearance of recurring entities across shots while remaining faithful to shot-specific text prompts. Recent autoregressive methods reuse previously generated frames as memory. However, full-frame storage entangles persistent entity information with transient scene context, leading to irrelevant information leakage and high computational cost. We propose an entity-centric memory in the form of an entity-indexed bank of latent patches. We introduce sparse token conditioning compatible with pretrained models, restricting self-attention to entity-relevant tokens and reducing computational cost. To support this, we introduce a structured multi-shot script format. We additionally propose a budgeted memory update strategy to maintain a compact, evolving memory. Finally, we equip the entity representation with a noise-injection mechanism that enables fine-grained appearance control, preventing leakage of irrelevant information. Our method improves prompt adherence and efficiency while preserving subject consistency.

URL PDF HTML ☆

赞 0 踩 0

2605.23605 2026-05-25 cs.LG cs.AI cs.CL 版本更新

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

DiLaDiff: 蒸馏潜在增强扩散用于语言建模

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat, Ante Jukić

发表机构 * NVIDIA（英伟达）

AI总结 DiLaDiff 是一种改进的扩散语言模型，旨在解决传统扩散模型在采样质量和生成速度之间的矛盾。该方法引入了连续语义潜在空间，并通过自编码器和一致性蒸馏技术提升生成效率和质量。实验表明，DiLaDiff 在不进行蒸馏时已优于基线模型，并在蒸馏后显著加快了推理速度。

2605.23603 2026-05-25 cs.LG cond-mat.dis-nn cs.AI cs.NE 版本更新

Preisach Attention: A Hysteretic Model of Sequential Memory

Preisach注意力：序列记忆的迟滞模型

Piotr Frydrych

发表机构 * Faculty of Mechatronics, Warsaw University of Technology（机电学院，华沙技术大学）

AI总结本文提出了一种基于经典 Preisach 滞后算子的新型序列建模架构——Preisach 注意力层（PAL），用二值继电器操作符替代传统的 softmax 注意力机制，通过学习激活与去激活阈值来维护内部的局部极值栈。该架构在任意精度算术下实现图灵完备性，且单层 PAL-Transformer 的深度仅为 O(1)，优于传统硬注意力 Transformer 所需的 O(log n) 深度。研究还证明 PAL 与 Transformer 在可计算函数类上互不包含，PAL 能以更少层数计算历史范围统计量，而 Transformer 支持随机访问但需额外状态支持，且 PAL 对序列的响应仅依赖于局部极值序列，而非绝对位置或时间间隔。

Comments 24 pages, 2 tables, preprint

详情

AI中文摘要

我们引入了Preisach注意力层（PAL），一种基于数学物理中经典Preisach迟滞算子的新型序列建模架构。PAL用由学习到的激活和去激活阈值参数化的二进制继电器算子替代了softmax注意力机制，并维护一个局部极值栈作为其内部状态。在任意精度算术下，具有O(1)深度的单层PAL-Transformer是图灵完备的，这可以通过模拟双栈下推自动机实现——而标准硬注意力变压器需要O(log n)深度。其次，我们证明了PAL和Transformer可计算的函数类是不可比的：PAL在O(1)层内计算历史范围统计，而Transformer需要O(log n)层；Transformer支持随机访问检索，而PAL在没有辅助状态的情况下无法执行。分离性质是率无关性——PAL仅响应局部极值序列，而不响应绝对标记位置或时间间隔。第三，我们证明了极值栈构成了所有率无关泛函的输入历史的最小充分统计量，提供了经典迟滞理论中擦除性质的形式类比。因此，PAL是一种适用于长情节记忆和弱位置依赖任务的高效架构，其总推理成本为O(n log n)，而标准注意力为O(n^2)。

英文摘要

We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds, maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton -- in contrast to the O(log n) depth required by standard hard-attention transformers. Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1) layers that require O(log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence -- PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(n log n) total inference cost versus O(n^2) for standard attention.

URL PDF HTML ☆

赞 0 踩 0

2605.23592 2026-05-25 cs.AI 版本更新

Solving the Aircraft Disassembly Scheduling Problem

解决飞机拆解调度问题

Charles Thomas, Pierre Schaus

发表机构 * Institute of Information and Communication Technologies, Electronics and Applied Mathematics (ICTEAM)（信息与通信技术、电子与应用数学研究所）； UCLouvain（乌得勒支大学）

AI总结本文研究了飞机报废拆解过程中的调度问题，该问题涉及大量任务和多种约束条件，对航空公司实现可持续拆解和盈利至关重要。文章提出了两种求解方法，包括约束规划模型和混合整数规划模型，并基于工业合作伙伴提供的真实数据进行了测试，验证了模型在处理多达1450项任务实例中的有效性。

详情

AI中文摘要

拆解寿命终结的飞机是一项复杂的工程，对于可持续性而言是必要的，但为航空运输公司带来的利润空间很小。因此，拆解过程的高效调度对于确保流程的盈利能力和激励实践至关重要。这是一个涉及数千个任务和许多不同约束的大规模调度问题：提取计划重复使用的部件需要具有特定认证和设备的技师。提取操作可能受先后顺序关系约束。此外，在整个过程中必须保持飞机平衡。最后，飞机的某些位置空间有限，限制了可同时工作的技师数量。本文详细介绍了该问题，并提出了两种解决方法：约束规划模型和混合整数规划模型。这些模型在基于工业合作伙伴提供的真实运营数据、规模不同（最多1450个任务）的实例上进行了测试。

英文摘要

Dismantling aircrafts reaching their end of life is a complex endeavour that is necessary in terms of sustainability but yields small income margins for air transport companies. An efficient scheduling of the disassembly procedure is thus crucial to ensure the profitability of the process and incentivize practice. This is a large scheduling problem that involves thousands of tasks and many different constraints: Extracting parts that are destined to be reused requires technicians with specific certifications and equipment. Extraction operations might be subject to precedence relations. Furthermore, the aircraft must be kept balanced during the whole process. Finally, some of the locations of the aircraft have a limited space that caps the number of technicians able to work there concurrently. This article presents the problem in details and proposes two approaches to solve the problem: a Constraint Programming model and a MIP model. The models are tested on instances of varying sizes involving up to 1450 tasks, which are based on real operational data provided by an industrial partner.

URL PDF HTML ☆

赞 0 踩 0

2605.23590 2026-05-25 cs.AI 版本更新

Co-ReAct: Rubrics as Step-Level Collaborators for ReAct Agents

Co-ReAct：作为ReAct智能体逐步协作者的评分准则

Jiazheng Kang, Bowen Zhang, Zixin Song, Jiangwang Chen, Xiao Yang, Da Zhu, Guanjun Jiang

发表机构 * Qwen Applications Business Group of Alibaba（阿里巴巴文勤应用业务组）； Tsinghua University（清华大学）

AI总结 Co-ReAct 是一种基于评分标准（rubrics）的行动选择框架，旨在改进 ReAct 代理在多步骤推理任务中的决策过程。该方法在每一步推理中注入评分标准作为指导，明确代理应关注的证据搜索、推理或自我评估方向，从而提升推理的深度和针对性。通过引入专门训练的评分标准生成器，并采用多评委共识排名优化目标，Co-ReAct 显著提升了多个基准任务上的表现，且无需修改原有代理的决策机制。

详情

AI中文摘要

用于搜索密集型、多步推理任务的ReAct风格智能体主要依赖自身内部判断来决定寻求哪些证据、下一步采取哪个推理或行动步骤以及何时停止，常常产生浅显、冗余或目标不明确的轨迹。先前的工作探索了将评分准则作为外部质量信号，但现有用途主要是评估性的而非行动指导性的：评分准则通常作为训练时的奖励或完成输出的事后评估器，在深度研究场景中，它们往往是粗粒度的、报告级别的而非步骤级别的。我们引入了Co-ReAct，一个评分准则指导的行动选择框架，在推理过程中将评分准则作为步骤级指导。在每个决策步骤，Co-ReAct将评分准则注入智能体的上下文，以指导下一个“推理或行动”决策，明确智能体在证据寻求、搜索、推理或自我评估中应瞄准什么。为了使这种指导可靠，我们使用GRPO训练了一个专用的评分准则生成器。与先前的成对或二元偏好公式不同，我们的目标优化了针对多评判专家共识排名的列表式斯皮尔曼等级相关奖励，鼓励评分准则具有区分性而不仅仅是合理。在DeepResearchBench和SQA-CS-V2上，Co-ReAct在基于8B/14B开源和前沿闭源基础模型构建的搜索智能体上，一致优于ReAct和代表性的测试时计算基线。训练好的评分准则生成器还可以作为即插即用组件，在不改变底层决策机制的情况下改进这些基线。我们的代码公开在https://github.com/ZBWpro/Co-ReAct。

英文摘要

ReAct-style agents for search-intensive, multi-step reasoning tasks rely largely on their own internal judgment to decide what evidence to seek, which reasoning or action step to take next, and when to stop, often producing shallow, redundant, or poorly targeted trajectories. Prior work has explored rubrics as external quality signals, but existing uses are mostly evaluative rather than action-guiding: rubrics typically serve as training-time rewards or post-hoc evaluators of completed outputs, and in deep-research settings they are often coarse-grained and report-level rather than step-level. We introduce Co-ReAct, a rubric-guided action-selection framework that uses rubrics as step-level guidance during inference. At each decision step, Co-ReAct injects a rubric into the agent's context to guide the next Reason-or-Act decision, specifying what the agent should target in evidence seeking, search, reasoning, or self-evaluation. To make this guidance reliable, we train a dedicated rubric generator with GRPO. Unlike prior pairwise or binary preference formulations, our objective optimizes a list-wise Spearman rank-correlation reward against multi-judge expert consensus rankings, encouraging rubrics that are discriminative rather than merely plausible. On DeepResearchBench and SQA-CS-V2, Co-ReAct consistently improves over ReAct and representative test-time compute baselines across search agents built on both 8B/14B open-source and frontier closed-source base models. The trained rubric generator can also serve as a drop-in component that improves these baselines without changing their underlying decision mechanisms. Our code is publicly available at https://github.com/ZBWpro/Co-ReAct.

URL PDF HTML ☆

赞 0 踩 0

2605.23572 2026-05-25 cs.IR cs.AI cs.LG 版本更新

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

HARNESS-LM: 一种在赞助搜索中利用小语言模型的三阶段训练方案

Vipul Gupta, Shikhar Mohan, Lakshya Kumar, Pranjal Chitale, Nikit Begwani, Amit Singh, Manik Varma

发表机构 * Microsoft AI（微软人工智能）

AI总结在赞助搜索中，如何在保证检索质量的同时降低响应延迟是一个重要挑战。本文提出HARNESS-LM（HLM），一种三阶段训练框架，旨在将大规模语言模型的检索能力转移到参数更少、成本更低的模型中。通过知识蒸馏和对比优化等方法，HLM在保持高检索精度的同时显著提升了推理效率，并在实际的Bing Ads测试中验证了其有效性，取得了更高的收益、曝光和点击率提升。

Comments 9 pages, 3 figures, 10 tables

详情

AI中文摘要

在赞助搜索的竞争格局中，平衡检索质量与生产延迟是一个关键挑战。尽管基于小语言模型（SLM）的大型检索模型（如Qwen3-Embedding-4B/8B）在公共基准上设定了强上限，但其在高吞吐、延迟敏感环境中的部署仍不切实际。本文提出HARNESS-LM（HLM），一个三阶段训练框架，用于将大规模检索器的能力迁移至紧凑、成本高效的模型。该方法包括：（1）通过微调十亿参数规模的SLM训练高性能参考（“教师”）检索器；（2）通过L2目标对齐查询表示，将知识蒸馏至低于600M参数的学生编码器；（3）应用最终对比精炼阶段以优化学生的检索性能。我们还对关键设计选择进行了全面的实证研究，包括对齐目标、嵌入维度、模型规模、架构和优化策略，以确定在生产环境中最为有效的配置。在真实世界的Bing Ads评估基准上，HLM在多种设置下恢复了参考检索器超过98%的精度，同时在NVIDIA A100 GPU上实现了高达27倍的在线查询编码器延迟降低和20倍的吞吐量提升。在Bing Ads上的在线A/B测试进一步显示，与当前生产中运行的检索器集成（部署190M参数模型）相比，收入提升+1%，展示量提升+0.6%，点击量提升+0.4%，清晰突显了HLM方案在真实世界赞助搜索场景中的实际效果。

英文摘要

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

URL PDF HTML ☆

赞 0 踩 0

2605.23569 2026-05-25 cs.AI 版本更新

CP or DP? Why Not Both: A Case Study in the Partial Shop Scheduling Problem

CP还是DP？为何不兼得：以部分车间调度问题为例

Emma Legrand, Roger Kameugne, Pierre Schaus

发表机构 * ICTEAM, UCLouvain, Belgium（ICTEAM，鲁汶大学，比利时）

AI总结本文研究了如何将动态规划（DP）与约束规划（CP）有效结合，以解决部分车间调度问题（PSSP）。作者提出了一种混合方法，以DP作为主搜索框架，利用CP进行全局约束传播，从而提升求解效率与灵活性。该方法不仅支持任意优先级约束，还可与任何时间策略结合，并能设计出基于DP的大型邻域搜索方案，展示了DP与CP融合在组合优化问题中的可行性。

详情

AI中文摘要

动态规划（DP）和约束规划（CP）是解决组合优化问题的成熟范式。通常，这两种方法被分开使用。本文旨在展示两者可以有效且优雅地结合，其中DP作为主搜索框架，CP作为子程序利用全局约束传播。本文针对部分车间调度问题（PSSP）提出了这样一种方法，该问题之前已有纯DP方法，并且有高效的CP过滤算法可用。PSSP是一个通用调度问题，其中每个作业由一组具有任意优先约束的操作组成。该方法足够灵活，可以容纳任意时间DP策略，例如任意时间列搜索，而原始DP算法以严格的逐层方式运行。此外，CP建模的灵活性使得可以轻松纳入任意优先约束。因此，该模型自然地处理任何优先图，甚至允许设计大邻域搜索（LNS）方案，其中重用DP模型，并在重启之间施加偏序调度以改进当前解。虽然对于这个特定问题，该方法无法与最先进的纯CP求解器竞争，但我们的主要贡献是证明了这种混合集成的可行性。

英文摘要

Dynamic Programming (DP) and Constraint Programming (CP) are well-established paradigms for solving combinatorial optimization problems. Usually, these two approaches are used separately. This paper aims to show that the two can be combined effectively and elegantly, with DP serving as the primary search framework and CP used as a subroutine to leverage global constraint propagation. This paper presents such an approach for the Partial Shop Scheduling Problem (PSSP), for which a pure DP method has previously been proposed, and efficient CP filtering algorithms are available. The PSSP is a general scheduling problem where each job consists of a set of operations with arbitrary precedence constraints. The approach is flexible enough to accommodate anytime DP strategies, such as anytime column search, whereas the original DP algorithm operated in a strictly layer-wise manner. Moreover, the flexibility of the CP modeling makes it straightforward to incorporate arbitrary precedence constraints. As a result, the model naturally handles any precedence graph and even enables the design of a Large Neighborhood Search (LNS) scheme, in which the DP model is reused, and partial-order schedules are imposed across restarts to improve the incumbent solution. While not competitive with state-of-the-art pure CP solvers for this specific problem, our primary contribution is demonstrating the viability of this hybrid integration.

URL PDF HTML ☆

赞 0 踩 0

2605.23565 2026-05-25 cs.LG cs.AI 版本更新

Understanding Goal Generalisation in Sequential Reinforcement Learning

理解序贯强化学习中的目标泛化

Jason Ross Brown, Edward James Young

发表机构 * University of Cambridge（剑桥大学）； Geodesic Research（Geodesic研究）

AI总结本研究探讨了序列强化学习代理在新环境中实现目标泛化的能力，分析了其训练历史对其行为的影响。通过研究超过100种序列训练流程并在250多个分布外环境中进行评估，发现显著特征和早期学习的目标对后续泛化具有重要影响。为此，研究提出了一种名为潜在策略梯度的方法，能够预测训练流程可能诱导的分布外行为，具有较高的预测准确性、良好的泛化能力和可解释性，为从发展角度理解目标泛化提供了基础。

详情

AI中文摘要

强化学习代理在其训练分布之外常常表现出非预期的目标导向行为，但我们目前缺乏基于训练历史对这类代理如何泛化到新环境的原理性理解。我们针对在单个或多个任务上序贯训练的代理解决了这一空白。我们研究了超过100个序贯训练流程，评估了超过250个分布外环境中的行为。我们发现显著特征驱动泛化，并且训练早期习得的目标会持续存在并影响后期习得的目标。为了解释这些现象，我们引入了潜在策略梯度方法，该方法预测训练流程可能诱导的分布外行为。我们的方法根据潜在变量如何映射到行为的简单模型，模拟训练过程中低维潜在变量的演化，以实现在训练目标上获得高奖励。它实现了强预测准确性，泛化到未见过的训练流程类型，并且是可解释的。我们的发现表明，虽然分布外RL代理行为依赖于整个训练流程，但这种依赖具有我们可以捕捉的底层结构，为从发展角度理解目标泛化奠定了基础。

英文摘要

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

URL PDF HTML ☆

赞 0 踩 0

2605.23562 2026-05-25 cs.MA cs.AI 版本更新

ARMS: Automatic Reward Shaping for Sparse-Reward Multi-Agent Reinforcement Learning

ARMS: 稀疏奖励多智能体强化学习的自动奖励塑形

Elie Abboud, Oren Gal

发表机构 * Department of Marine Technologies（海洋技术系）

AI总结在多智能体强化学习中，稀疏奖励是学习过程中的主要瓶颈，而传统的奖励塑造方法难以在保持策略结构的同时提升学习效率。本文提出了一种名为ARMS的自动奖励塑造框架，通过轨迹排序从稀疏环境奖励中学习密集的塑造奖励，并基于条件最佳响应推理保证在固定对手策略下保留每个智能体的最佳响应集和纳什均衡集。实验表明，ARMS在部分可观测的多智能体路径规划任务中显著提升了采样效率，具有良好的环境泛化能力，并揭示了多智能体系统中由探索不足和策略-奖励动态耦合引发的振荡行为问题。

详情

AI中文摘要

稀疏奖励是多智能体强化学习（MARL）中的一个主要瓶颈，其中同时学习会导致非平稳性并使奖励设计尤其精细。奖励塑形可以加速学习，但在多智能体环境中，它必须保留问题的战略结构，而不仅仅是改善短期优化。我们提出了多智能体系统中的自动奖励塑形（ARMS），这是一个用于MARL的自监督奖励塑形框架，通过轨迹排序从稀疏环境奖励中学习稠密塑形信号。由于单智能体轨迹排序保证不能直接迁移到MARL，我们通过条件最优反应推理重新表述策略不变性，并证明如果某些条件成立，则使用塑形奖励在固定对手策略下保留每个智能体的最优反应集，从而保留纳什均衡集。在此视角指导下，ARMS在策略学习和奖励学习之间交替，同时跨智能体共享塑形参数以提高效率。在部分可观测的多智能体路径规划领域中的实验表明，ARMS在奖励稀疏性和智能体数量增加的情况下提高了采样效率，泛化到未见过的环境，并揭示了一种MARL特有的失败模式，其中有限的探索和耦合的策略-奖励动态导致振荡行为。增加探索可缓解此效应并稳定学习。据我们所知，ARMS是第一个其设计动机来自博弈论均衡保持结果的MARL自动奖励塑形框架。

英文摘要

Sparse rewards are a major bottleneck in multi-agent reinforcement learning (MARL), where simultaneous learning induces non-stationarity and makes reward design especially delicate. Reward shaping can accelerate learning, but in the multi-agent setting it must preserve the strategic structure of the problem rather than merely improve short-term optimization. We propose Automatic Reward-shaping in Multi-agent Systems (ARMS), a self-supervised reward shaping framework for MARL that learns dense shaping signals from sparse environmental rewards through trajectory ranking. Since single-agent trajectory-ranking guarantees do not directly transfer to MARL, we reformulate policy invariance through conditional best-response reasoning, and show that if certain conditions hold, then using shaping rewards preserves each agent's best-response set under fixed opponent policies, and consequently preserve the set of Nash equilibria. Guided by this perspective, ARMS alternates between policy learning and reward learning while sharing shaping parameters across agents for efficiency. Experiments in a partially observable multi-agent pathfinding domain show that ARMS improves sampling efficiency under increasing reward sparsity and agent count, generalizes to unseen environments, and reveals a MARL-specific failure mode in which limited exploration and coupled policy--reward dynamics induce oscillatory behavior. Increasing exploration mitigates this effect and stabilizes learning. To the best of our knowledge, ARMS is the first automatic reward shaping framework for MARL whose design is motivated by a game-theoretic equilibrium-preservation result.

URL PDF HTML ☆

赞 0 踩 0

2605.23559 2026-05-25 cs.CV cs.AI 版本更新

PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA

PathNavigate: 一种无需训练的病理学代理，具有惊喜引导扫描和共享幻灯片记忆用于全切片图像VQA

Chunze Yang, Qidong Liu, Wenjie Zhao, Yue Tang, Jiusong Ge, Di Zhang, Jiashuai Liu, Lei Wu, Junbo Lu, Ni Zhang, Xian Wu, Zeyu Gao, Chen Li

发表机构 * School of Comp. Science & Technology, Xi’an Jiaotong University（西安交通大学计算机科学与技术学院）； Tencent Jarvis Lab（腾讯Jarvis实验室）； University of Cambridge（剑桥大学）

AI总结 PathNavigate 是一种无需训练的病理图像问答代理，旨在解决全切片图像问答（WSI-VQA）中在有限检查预算下高效定位关键病理证据的问题。该方法采用“扫描-搜索-读取”流程，通过共享的在线记忆模块生成异常区域池，并结合问题条件的相关性筛选高倍镜下的目标区域，从而提升答案准确性和解释性。实验表明，PathNavigate 在保持模型冻结的前提下，实现了更高的效率和更可靠的证据选择路径。

详情

AI中文摘要

全切片图像视觉问答（WSI-VQA）将病理学视为极端上下文搜索问题：为了回答自由形式的临床查询，系统必须首先在严格的检查预算下导航千兆像素切片，以定位稀疏的高分辨率证据。现有方法主要分为两种范式：i）监督式病理学多模态大语言模型（MLLMs）和代理可以将定位和推理吸收到学习模块中，但它们通常将导航与任务特定的监督和重新训练耦合，限制了其实用性；ii）无需训练的病理学代理通过保持核心模型冻结来避免这种成本，但通常遵循问题优先的设计，主要从查询条件相关性构建初始候选集。这可能会遗漏问题中未提及的决定性形态，并迫使更重的推理时脚手架。为了解决这一挑战，我们引入了PathNavigate，一种无需训练的病理学代理，基于扫描-搜索-读出流程构建。在问题匹配之前，PathNavigate在低放大倍数下扫描当前切片，使用共享的在线记忆模块处理冻结的病理学特征，生成一个切片特定的惊喜场，标记异常区域池。然后，它仅在此池内应用问题条件的PLIP相关性，以选择高放大倍数的搜索目标。最后，它提取局部高放大倍数证据，并使用冻结的感知器-裁决器堆栈进行回答，利用相同的在线记忆作为切片级上下文。在WSI-VQA和SlideBench-BCNB上的实验表明，所提出的扫描-搜索-读出设计提高了答案准确性，并产生了更可解释的证据选择轨迹，且效率更高。代码已在线公开。

英文摘要

Whole-slide image visual question answering (WSI-VQA) frames pathology as an extreme-context search problem: to answer a free-form clinical query, a system must first navigate a gigapixel slide under a strict inspection budget to locate sparse, high-resolution evidence. Existing approaches largely fall into two paradigms: i) supervised pathology multimodal large language models (MLLMs) and agents can absorb localization and reasoning into learned modules, but they often couple navigation to task-specific supervision and retraining, limiting their practicality; ii) training-free pathology agents avoid this cost by keeping core models frozen, but often follow a question-first design, constructing the initial candidate set mainly from query-conditioned relevance. This can miss decisive morphology that is not named in the question, and force heavier inference-time scaffolding. To address this challenge, we introduce PathNavigate, a training-free pathology agent built around a scan-search-readout routine. Before question matching, PathNavigate scans the current slide at low magnification with a shared online memory module over frozen pathology features, producing a slide-specific surprise field that marks an abnormal-region pool. It then applies question-conditioned PLIP relevance only within this pool to select high-magnification search targets. Finally, it extracts local high-magnification evidence and answers with a frozen perceptor-adjudicator stack, using the same online memory as slide-level context. Experiments on WSI-VQA and SlideBench-BCNB show that the proposed scan-search-readout design improves answer accuracy and yields more interpretable evidence-selection trajectories with higher efficiency.The code is available online.

URL PDF HTML ☆

赞 0 踩 0

2605.23551 2026-05-25 cs.LG cs.AI 版本更新

Goal-Conditioned Agents that Learn Everything All at Once

目标条件智能体一次性学习所有内容

Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, Cédric Colas, Jakob Foerster

发表机构 * University of Oxford（牛津大学）； McGill University（麦吉尔大学）； MIT（麻省理工学院）； Inria（法国国家信息与自动化研究所）

AI总结本文提出了一种名为LEO（Learning Everything all at Once）的新方法，用于提升目标条件强化学习的效率。该方法通过一次性输出所有目标对应的价值和动作，实现了高效的并行更新，解决了传统全目标学习计算开销大的问题。实验表明，LEO在目标条件任务和连续控制环境中均表现出色，且相比传统方法有超过250倍的加速效果，为复杂环境中的强化学习提供了有力工具。

详情

AI中文摘要

一个目标条件的强化学习智能体在探索环境时，会在整个轨迹中看到大量信息，但大多数信息在仅根据命令目标进行在线策略更新时被丢弃。全目标学习（每个转换都用于针对每个目标进行离线策略学习）允许智能体提取最大信息，但通过简单的重新标记通常计算上不可行。这可以通过同时为每个目标输出值和动作来克服，从而允许通过网络单次传递进行高效的并行全目标更新，我们称之为一次性学习所有内容（LEO）。我们表明，这种方法在目标条件的Craftax上显著优于其他方法，在连续控制环境中与现有基线具有竞争力，同时与全目标重新标记相比实现了超过250倍的加速。然后，我们进一步表明，通过将LEO用作教师网络而非直接行动者，这种方法可以变得更加强大。我们希望，通过解锁大规模的全目标学习，LEO可以成为复杂环境中强化学习实践者的有用工具。我们开源了我们的代码。

英文摘要

A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.

URL PDF HTML ☆

赞 0 踩 0

2605.23550 2026-05-25 math.OC cs.AI cs.NA math.NA 版本更新

RA-DCA: A Randomized Active-Set DCA for Directional Stationarity in Max-Structured DC Programs

RA-DCA：面向最大结构DC规划方向稳定性的随机活动集DCA

Yi-Shuai Niu

发表机构 * Beijing Institute of Mathematical Sciences and Applications（北京数学科学研究院）

AI总结本文研究了一类非光滑的差分凸优化问题，其中被减去的凸项为多个光滑凸函数的最大值。为了解决标准DCA可能收敛到非方向平稳临界点的问题，同时避免大规模或组合型活动集带来的高计算成本，作者提出了一种基于随机化活动集的DCA方法RA-DCA。该方法通过在采样方向上投影活动梯度、检查采样顶点残差，并仅在残差较小时使用小规模线性规划作为补充，有效保持了DCA的下降结构，同时将随机筛选过程简化为矩阵乘法。实验表明，该方法在多种模型中能够避免非平稳临界点，并在组合型问题中展现出良好的筛选效果。

Comments 40 pages, 7 figures

详情

AI中文摘要

我们研究非光滑差凸规划，其中被减的凸项是光滑凸函数的有限最大值。在此设定下，标准DCA迭代可能收敛到非方向稳定的临界点，而当活动集较大或具有组合性质时，精确的活动顶点筛选可能代价高昂。我们提出RA-DCA，一种顶点优先的随机活动集DCA，它将活动梯度投影到采样方向，检查采样顶点残差，并仅在低残差凸组合回退时使用一个小型线性规划。该方法保留了DCA的下降结构，并将随机筛选层简化为矩阵乘法。在所述正则性、数值活动集一致性和随机嵌入假设下，受保护方法生成的每个聚点以概率1是方向稳定的。MATLAB实验首先在退化的最大仿射、最大二次和稀疏支撑函数模型上测试该定理，其中保护机制避免了非稳定临界点并紧密跟踪完整活动顶点扫描。随后，块top-k测试表明，当精确聚合枚举具有组合性质时，相同的筛选思想仍然有用。修剪回归、互补性和QUBO诊断区分了活动集选择有助于问题的情况与由多起点搜索、DC分裂或其他问题特定特征主导的情况。

英文摘要

We study nonsmooth difference-of-convex programs whose subtracted convex term is a finite maximum of smooth convex functions. In this setting, standard DCA iterations may converge to critical points that are not directionally stationary, whereas exact active-vertex screening can be expensive when active sets are large or combinatorial. We propose RA-DCA, a vertex-first randomized active-set DCA that projects active gradients onto sampled directions, checks a sampled vertex residual, and uses a small linear program only as a low-residual convex-combination fallback. The method preserves the descent structure of DCA and reduces the randomized screening layer to matrix multiplications. Under the stated regularity, numerical active-set consistency, and random-embedding assumptions, every accumulation point generated by the safeguarded method is directionally stationary with probability one. MATLAB experiments first test the theorem on degenerate max-affine, max-quadratic, and sparse support-function models, where the safeguard avoids nonstationary critical points and closely tracks a full active-vertex scan. Block top-k tests then show that the same screening idea remains useful when exact aggregate enumeration is combinatorial. Trimmed-regression, complementarity, and QUBO diagnostics separate cases where active-set selection helps from cases dominated by multistart search, the DC split, or other problem-specific features.

URL PDF HTML ☆

赞 0 踩 0

2605.23522 2026-05-25 cs.LG cs.AI cs.CV 版本更新

EDGE-OPD：通过证据引导的在线策略蒸馏内化特权上下文

Aristotelis Lazaridis, Dylan Bates, Aman Sharma, Brian King, Vincent Lu, Jack FitzGerald

发表机构 * EdgeRunner AI

AI总结本文研究了在基于特权上下文的On-Policy Self-Distillation（OPSD）中，如何避免特权信息对模型行为产生不必要的干扰问题，并提出了EDGE-OPD方法。该方法通过引导式采样和证据掩码机制，在训练过程中更精准地注入特权信息，确保学生模型学习到目标行为而非副作用。实验表明，EDGE-OPD有效提升了身份学习的效果，并有助于保持模型的一般能力。

详情

AI中文摘要

在线策略蒸馏（OPD）作为一种LLM后训练范式，因其在不引入模型分布漂移和通用任务回归的情况下有效提升能力而受到广泛关注。在线策略自蒸馏（OPSD）是OPD的一种高效用例，它仅需单一模型同时作为学生和教师，并且具有在训练过程中向教师提供推理时缺失的特权上下文（例如角色、私有事实或已解决的方案）的优势。该方法面临的挑战在于，特权信息可能过度改变模型行为：它可能修改推理、降低通用能力，并影响响应长度、风格或局部token偏好等性能指标。因此，OPSD可能训练学生模型学习副作用而非期望的可迁移行为。本文在稀有token/身份设定下研究该问题，并提出EDGE-OPD（证据引导的在线策略蒸馏），这是OPSD的一种改进，具有两个显著特征：a) 使用引导展开在采样时向学生注入特权上下文行为，使得稀有目标行为实际出现在在线策略数据中；b) 应用证据掩码：学生仅在特权上下文支持采样token的token位置进行更新，而非展开中的每个token。实验表明，OPSD（及其变体RLSD，无论是否使用验证器）完全无法学习目标身份，而引导展开的集成使其成功。此外，掩码区域消融实验显示，角色信号定位于正证据尾部，这使我们能够获得关于高效知识迁移和通用能力保持的宝贵见解。

英文摘要

On-Policy Distillation (OPD) has gained wide attraction as an LLM post-training paradigm due to its effectiveness in improving capabilities without introducing model distribution drift, and consequently, regression in general tasks. On-Policy Self-Distillation (OPSD) is an efficient use-case of OPD, which is appealing as it requires only a single model as a student and teacher, and it also has the benefit of providing privileged context that is a absent at inference time (e.g. a persona, a private fact, or a worked solution) to the teacher during the training process. The challenge in this approach is that the privileged information can change model behavior more than intended: it can modify reasoning, degrade general capabilities, and affect performance indicators like response length, style, or local token preferences. Consequently, OPSD may train the student on side effects rather than a desired, transferable behavior. In this paper, we study this problem in a rare-token/identity setting and propose EviDence GuidEd On-Policy Distillation (EDGE-OPD), a modification of OPSD with two distinct characteristics: a) it uses guided rollouts to inject privileged-context behavior to the student at sampling time, so that the rare target behavior is actually present in the on-policy data, and b) it applies an evidence mask: the student is updated only at token positions where the privileged context supports the sampled token, rather than on every token in the rollout. We empirically show that OPSD (and its variant RLSD, with and without a verifier) completely fail to learn a target identity, while the integration of guided rollouts allows them to succeed. Additionally, mask-region ablations show that the persona signal is localized to the positive-evidence tail, allows us to draw valuable insights about efficient knowledge transfer and preservation of general purpose capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.23482 2026-05-25 cs.CV cs.AI 版本更新

Multimodal Distribution Matching for Vision-Language Dataset Distillation

多模态分布匹配用于视觉-语言数据集蒸馏

Jongoh Jeong, Hoyong Kwon, Minseok Kim, Kuk-Jin Yoon

发表机构 * Visual Intelligence Lab., KAIST（韩国科学技术院视觉智能实验室）

AI总结该研究提出了一种名为Multimodal Distribution Matching (MDM)的多模态数据集蒸馏方法，旨在在有限的计算和内存资源下，高效生成保留视觉-语言语义信息的紧凑合成数据集。MDM通过结合数据、模型和损失层面的互补组件，实现了跨模态对齐与表示质量的保持，包括在联合嵌入空间中采样生成图像-文本对、基于预训练模型的权重空间插值构建混合教师模型，以及利用几何感知的损失函数匹配联合分布。实验表明，MDM在多个跨架构的图像-文本检索任务中表现出色，显著降低了蒸馏成本并保持了模型的鲁棒性。

Comments Accepted for publication at CVPR 2026. Project Page: https://andyj1.github.io/mdm

详情

AI中文摘要

数据集蒸馏将大型训练集压缩为紧凑的合成数据集，同时保持下游性能。随着现代系统越来越多地处理成对的视觉-语言输入，多模态蒸馏必须在严格的计算和内存预算下保持表示质量和跨模态对齐，然而先前的方法通常需要大量计算并忽略其相关性。为了解决这个问题，我们提出了多模态分布匹配（MDM），一种用于高效且可泛化的多模态蒸馏的几何感知框架。具体来说，MDM在数据、模型和损失层面集成了互补组件。在数据层面，它通过在联合嵌入空间中的聚类采样来初始化合成图像-文本对。在模型层面，它通过在权重空间中根据独立微调模型与预训练锚点的角度偏差进行插值，形成混合教师模型。在损失层面，它使用几何感知的匹配目标在单位超球面上匹配联合分布，该目标利用跨模态一致性和差异方向上的联合特征以及对称对比学习。在跨架构评估的图像-文本检索基准上，MDM生成的紧凑合成集保留了多模态语义，显著降低了蒸馏成本，并在不同架构下保持鲁棒性。

英文摘要

Dataset distillation compresses large training sets into compact synthetic datasets while preserving downstream performance. As modern systems increasingly operate on paired vision-language inputs, multimodal distillation must preserve representation quality and cross-modal alignment under tight compute and memory budgets, yet prior methods often require heavy computes and overlook their correlations. To address this, we present Multimodal Distribution Matching (MDM), a geometry-aware framework for efficient and generalizable multimodal distillation. Specifically, MDM integrates complementary components at the data, model, and loss levels. At the data level, it initializes synthetic image-text pairs by sampling from clusters in the joint embedding space. At the model level, it forms a mixed teacher by interpolating independently fine-tuned models in weight space according to their angular deviation from the pretrained anchor. At the loss level, it matches joint distributions on the unit hypersphere using a geometry-aware matching objective that exploits the joint features in the cross-modal agreement and discrepancy directions along with symmetric contrastive learning. Across image-text retrieval benchmarks with cross-architecture evaluation, MDM yields compact synthetic sets that preserve multimodal semantics, substantially reduce distillation cost, and remain robust across architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.23478 2026-05-25 cs.CV cs.AI 版本更新

PhenoYieldNet: Learning Crop-Aware Phenological Responses for Multi-Crop Yield Prediction

PhenoYieldNet: 学习作物感知的物候响应以进行多作物产量预测

Yu Luo, Xiaogang Zhu, Shan Zeng, Wei Xiang, Thomas Francis Bishop, Zhiyong Wang, Kun Hu

发表机构 * School of Computer Science, The University of Sydney（悉尼大学计算机科学学院）； School of Computer Science and Information Technology, Adelaide University（阿德莱德大学计算机科学与信息技术学院）； College of Mathematics and Computer Science, Wuhan Polytechnic University（武汉职业技术学院数学与计算机科学学院）； School of Computing, La Trobe University（拉特罗布大学计算学院）； School of Science, Edith Cowan University（埃迪斯科文大学科学学院）

AI总结准确预测作物产量对可持续农业和全球粮食安全至关重要。现有方法多针对单一作物，难以泛化到多种作物，且未充分考虑不同作物对天气变化的特定物候响应。本文提出PhenoYieldNet，一种面向多作物产量预测的框架，通过显式建模作物的物候响应来学习作物特异性物候特征，包含作物物候库和注意力模块，能够动态捕捉不同物候阶段的时空特征，并通过预训练模型和自监督策略提升泛化能力，实验表明其在多作物数据集上显著优于现有方法。

Comments Accepted by CVPR2026

详情

AI中文摘要

准确的作物产量预测对于可持续农业和全球粮食安全至关重要。现有方法主要针对单一作物预测开发，通常难以泛化到不同作物类型，且未能解决由复杂天气模式动态调节的独特作物物候响应。在本文中，我们提出PhenoYieldNet，一个多作物产量预测框架，通过显式建模作物对时间驱动因素的响应来学习作物特异性物候。具体来说，我们开发了一个作物感知的时间解码器，由作物物候库（CPB）和作物物候注意力（CPA）模块组成。CPB集成了一组可学习的嵌入，利用查询引导CPA模块学习特定作物最相关的物候模式。CPA模块显式捕获多尺度趋势和变化成分以构建时间上下文，使模型能够动态调整不同物候阶段的注意力。为了学习鲁棒且可泛化的多作物预测特征，编码器使用预训练基础模型初始化，并通过自监督时序对比适应策略进一步调整以对齐农业时间动态。在多作物数据集上进行的大量实验表明，我们提出的方法显著优于最先进的方法，在不同地区和作物上展现出强大的泛化能力。

英文摘要

Accurate crop yield prediction is crucial for sustainable agriculture and global food security. While existing methods are predominantly developed for single-crop prediction, they often struggle to generalize across diverse crop types, without addressing the unique crop phenological responses that are dynamically modulated by complex weather patterns. In this paper, we propose PhenoYieldNet, a multi-crop yield prediction framework that learns crop-specific phenology by explicitly modeling their responses with temporal drivers. Specifically, we develop a crop-aware temporal decoder consisting of a Crop Phenology Bank (CPB) and a Crop Phenology Attention (CPA) module. The CPB integrates a set of learnable embeddings, which leverage a query to guide the CPA module to learn the most relevant phenology patterns for the specific crop. And the CPA module explicitly captures multi-scale trend and variation components to construct temporal contexts, enabling the model to dynamically adjust the attention across different phenological stages. To learn robust and generalizable features for multi-crop prediction, the encoder is initialized with a pre-trained foundation model, and further adapted via a self-supervised Temporal Contrastive Adaptation strategy to align with agricultural temporal dynamics. Extensive experiments conducted on multi-crop datasets indicate that our proposed method significantly outperforms state-of-the-art methods, exhibiting strong generalization capabilities across different regions and crops.

URL PDF HTML ☆

赞 0 踩 0

2605.23471 2026-05-25 cs.LG cs.AI 版本更新

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

CBANet：一种用于激进驾驶事件检测的紧凑型注意力CNN-BiLSTM网络

Hanadi Alhamdan, Ghadah Alosaimi, Amir Atapour-Abarghouei, Farshad Arvin

发表机构 * Department of Computer Science, Princess Nourah bint Abdulrahman University（普里西拉计算机科学系，普里西拉努拉·本·阿卜杜勒拉赫曼大学）； Department of Computer Science, Durham University（计算机科学系，杜ham大学）； Department of Computer Science, Imam Mohammad Ibn Saud Islamic University（计算机科学系，伊玛姆穆罕默德·本·萨德伊斯兰大学）

AI总结本文提出了一种名为CBANet的紧凑型注意力机制结合CNN-BiLSTM的深度学习框架，用于检测激进驾驶事件。该方法通过构建工程化的动态特征来捕捉转向、加速和制动行为，并采用基于SMOTE的过采样与类别加权损失相结合的稳定训练策略，以应对自然驾驶数据中激进事件极度稀有的问题。实验表明，该方法在少数类召回率和安全关键F分数等指标上显著优于传统深度学习方法，同时保持了较高的计算效率。

Comments 8 pages, 4 figures, 4 tables. Submitted to IJCNN/WCCI 2026. CBANet: A compact attention-based CNN-BiLSTM framework for aggressive driving event detection using multivariate vehicle dynamics signals. Code available at https://github.com/halhamdan/CBANet

详情

AI中文摘要

激进驾驶是交通事故的主要原因，对道路安全构成严重威胁。尽管深度学习方法在从车辆传感器数据检测危险驾驶行为方面显示出有希望的结果，但它们在现实条件下的性能通常受到严重数据不平衡、驾驶员间巨大差异以及缺乏物理可解释的车辆动力学表示的限制。在本文中，我们提出了一种增强的深度学习框架，用于使用多变量车辆动力学信号进行激进驾驶检测。该方法不仅依赖原始测量，还构建了捕捉转向、加速和制动行为的工程动力学特征。为了解决自然驾驶数据中激进事件的极端稀少性，我们引入了一种稳定的训练策略，结合了基于SMOTE的受控过采样和类别加权损失公式，并评估了用于不平衡处理的焦点损失变体。此外，采用基于类别特定阈值校准的安全导向决策策略，以更好地反映现实应用中漏检和误报的不对称风险。该框架在新收集的自然驾驶数据集上进行了评估。大量实验表明，所提出的方法在保持实际计算效率的同时，在少数类召回率和安全关键F-score指标上始终优于标准深度学习基线。代码：\url{https://github.com/halhamdan/CBANet}

英文摘要

Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real-world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE-based oversampling with a class-weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety-oriented decision strategy based on class-specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real-world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority-class recall and safety-critical F-score metrics while maintaining practical computational efficiency. Code: \url {https://github.com/halhamdan/CBANet}

URL PDF HTML ☆

赞 0 踩 0

2605.23470 2026-05-25 cs.LG cs.AI cs.CE 版本更新

Learning Individual Dynamics from Sparse Cross-Sectional Snapshots

从稀疏横截面快照中学习个体动力学

Christian Lagemann, Kai Lagemann, Steven L. Brunton, Sach Mukherjee

发表机构 * Statistics and Machine Learning, German Center for Neurodegenerative Diseases (DZNE)（统计与机器学习，德国神经退行性疾病中心（DZNE））； MediaTek Research（联发科技研究）； Department of Mechanical Engineering & AI Institute in Dynamic Systems, University of Washington, Seattle（机械工程与人工智能动态系统研究所，华盛顿大学，西雅图）； DZNE & University of Bonn, Bonn, Germany and University of Cambridge, Cambridge, United Kingdom（DZNE与波恩大学，波恩，德国和剑桥大学，剑桥，英国）

AI总结该研究旨在从稀疏的横截面快照中学习个体的动态演化过程，传统方法在数据稀疏或完全横截面的情况下难以准确推断个体的连续时间轨迹。本文提出了一种名为CADENCE的概率框架，通过将潜在动态与静态个体上下文关联，实现了从孤立快照中恢复个体轨迹。该方法结合了基于分数的空域编码器和软专家混合路由机制，提供了单时间点轨迹推断的可识别性保证，并在多个基准测试中表现出优于现有序列模型的性能。

详情

AI中文摘要

预测一个动力学单元如何随时间演化——例如个体如何衰老、流行病如何传播、物理系统如何退化——通常需要密集的纵向追踪。当只有极其稀疏或完全横截面的数据可用时，推断个体化的连续时间轨迹本质上是病态的。现有方法迫使严格妥协：序列模型（如潜在ODE）需要密集的纵向数据，而横截面方法（如最优传输、基于流匹配的）映射聚合群体，丢失了个体动力学。在本文中，我们证明这种二分法可以被打破。我们介绍CADENCE，一个原则性的概率框架，通过将潜在动力学锚定到静态的个体级上下文，从孤立快照中恢复连续的个体轨迹。我们为单时间点轨迹推断提供了新颖的可识别性保证。通过结合基于分数的空间编码器（双射概率流ODE）以消除微分同胚歧义，以及软混合专家（SMoE）路由器，我们证明个体动力学参数和路由函数是联合可识别的。在一系列涵盖物理系统到真实世界生物数据的基准测试中，CADENCE严格在具有上下文结构的极端稀疏快照上训练，其性能匹配或超过了在密集全轨迹数据上训练的最先进序列模型。

英文摘要

Predicting how a dynamical unit evolves over time - how an individual ages, an epidemic spreads, or a physical system degrades - typically requires dense longitudinal tracking. When only extremely sparse or entirely cross-sectional data is available, inferring individualized, continuous-time trajectories is fundamentally ill-posed. Existing methods force a strict compromise: sequence models (e.g. latent ODEs) require dense longitudinal data, while cross-sectional methods (e.g. optimal transport, flow matching-based) map aggregate populations, losing individual dynamics. In this paper, we demonstrate that this dichotomy can be broken. We introduce CADENCE, a principled probabilistic framework that recovers continuous individual trajectories from isolated snapshots by anchoring latent dynamics to static, individual-level contexts. We provide novel identifiability guarantees for single-timepoint trajectory inference. By combining a score-based spatial encoder (bijective Probability Flow ODE) to eliminate diffeomorphic ambiguities with a Soft Mixture-of-Experts (SMoE) router, we show that individual dynamical parameters and routing function are jointly identifiable. Across a suite of benchmarks spanning physical systems to real-world biological data, CADENCE, trained strictly on extremely sparse snapshots with context structure, matches or exceeds the performance of state-of-the-art sequential models trained on dense, full-trajectory data.

URL PDF HTML ☆

赞 0 踩 0

2605.23459 2026-05-25 cs.SE cs.AI 版本更新

AI Assurance: A Comprehensive Testing Strategy for Enterprise AI Systems

AI 保证：企业 AI 系统的综合测试策略

Chitra Badagi, Divye Singh, Animesh Sen, Adinath Shirsath

发表机构 * Thoughtworks Technologies（Thoughtworks技术公司）

AI总结本文针对基于大语言模型、检索管道和自主代理的企业级AI系统，提出了一种全面的测试保障策略，以应对传统软件质量保证方法难以处理的新型风险。研究强调应将AI测试重点转向持续风险降低，而非严格的正确性验证，并将评估作为与开发同等重要的工程学科。文章引入了结构化的AI失效分类体系，提出了改进的五层AI保障金字塔，并提供了评估驱动开发、RAG系统测试、模型生命周期管理等方面的实践指导，旨在为企业工程领导者和实践者提供既有理论依据又可操作的保障策略。

详情

AI中文摘要

企业 AI 系统构建于大语言模型、检索管道和自主代理之上，引入了一类传统软件质量保证从未设计应对的风险。这些系统是概率性的、上下文敏感的和涌现性的：它们无法在经典意义上被验证为正确，只能通过不断增加信心来评估。本文提出了一种围绕三个关键原则的企业 AI 系统综合保证策略：第一，AI 测试应侧重于持续风险降低而非严格正确性验证；第二，评估必须与开发一起被视为核心工程学科；第三，AI 保证中的失败可能导致与传统确定性软件系统根本不同的组织影响。我们引入了结构化的 AI 故障分类法，提出了修订后的五层 AI 保证金字塔，并提供了关于评估驱动开发、RAG 系统测试、模型生命周期管理和治理的操作指南。目标是让工程领导者和从业者掌握一种既有哲学基础又可操作部署的策略。

英文摘要

Enterprise AI systems, built on large language models, retrieval pipelines and autonomous agents, introduce a class of risks that traditional software quality assurance was never designed to address. These systems are probabilistic, context-sensitive and emergent: they cannot be verified to be correct in the classical sense, but only evaluated with increasing confidence. This paper presents a comprehensive assurance strategy for enterprise AI systems built around three key principles: first, that AI testing should focus on continuous risk reduction rather than strict correctness verification; second, that evaluation must be treated as a core engineering discipline alongside development; and third, that failures in AI assurance can lead to organizational impacts that are fundamentally different from those seen in traditional deterministic software systems. We introduce a structured AI Failure Taxonomy, propose a revised five-layer AI Assurance Pyramid and provide operational guidance on evaluation-driven development, RAG system testing, model lifecycle management and governance. The goal is to equip engineering leaders and practitioners with a strategy that is both philosophically grounded and operationally deployable.

URL PDF HTML ☆

赞 0 踩 0

2605.23458 2026-05-25 cs.CV cs.AI 版本更新

One-Forcing: Towards Stable One-Step Autoregressive Video Generation

One-Forcing: 迈向稳定的一步自回归视频生成

Jiaqi Feng, Justin Cui, Yuanhao Ban, Cho-Jui Hsieh

发表机构 * Tsinghua University（清华大学）； UCLA（加州大学洛杉矶分校）

AI总结该论文提出了一种名为 One-Forcing 的方法，旨在解决单步自回归视频生成中的稳定性和质量问题。该方法通过在动态模式分解（DMD）目标中引入辅助的生成对抗网络（GAN）损失，实现了高质量且高效的单步视频生成。实验表明，One-Forcing 在 VBench 数据集上取得了当前最优的性能，并且仅需三分之一的训练成本即可实现稳定的逐帧自回归生成，优于以往方法。

Comments Work in Progress. Project Page: https://aurora-edu.github.io/one-forcing/, Code: https://github.com/Aurora-edu/One-Forcing

详情

AI中文摘要

最近的进展显著改善了自回归机制下的实时交互式视频生成。然而，大多数现有的少步自回归视频生成方法（通常从相应的多步教师模型蒸馏而来）默认采用4步采样配置，这在部署期间仍会产生相当大的延迟，并且当进一步减少采样步数（特别是在一步设置中）时，会遭受严重的质量下降。轨迹式一致性蒸馏方法通常生成动态较弱的视频，而基于DMD的方法（如Self-Forcing）往往产生模糊的帧。为了解决这一挑战，我们提出了One-Forcing，一种简单而有效的方法，它通过向DMD目标添加辅助GAN损失，实现高质量高效的一步视频生成。在VBench上的实验表明，One-Forcing的总得分为83.76，在一步因果视频生成方法中达到了最先进的性能，并且与强大的多步方法保持竞争力。我们进一步证明，仅需分块模型三分之一的训练成本，即可稳定实现逐帧的一步自回归生成，而先前的方法未能成功实现这一设置。

使用3D卷积神经网络的在线手势识别

Yinghao Qin, Tijana Timotijevic

发表机构 * School of Electronic Engineering and Computer Science（电子工程与计算机科学学院）； Queen Mary, University of London（伦敦大学Queen Mary）

AI总结本文提出了一种基于3D卷积神经网络的在线手部手势识别系统，旨在实现实时视频流中手势的定位与分类。为提高系统鲁棒性，采用滑动窗口方法对多窗口结果进行优化。该系统在Jester数据集上训练，检测和分类准确率分别达到98%以上和90%以上，在自制数据集上达到37.5%的Levenshtein准确率，且响应时间在三秒以内。

Comments Master's dissertation work written in Autumn 2020

2605.23402 2026-05-25 cs.LG cs.AI 版本更新

Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting

非平稳概率时间序列预测的参数先验映射框架

Jinglin Li, Jun Tan, QI Fang, Ning Gui

发表机构 * School of Computer Science and Engineering（计算机科学与工程学院）

AI总结本文提出了一种参数先验映射框架（PPM），用于非平稳概率时间序列预测。该方法通过引入参数化的结构先验，结合生成模型的优势，实现了在保持计算效率的同时捕捉复杂时间依赖关系。实验表明，PPM在非平稳数据预测任务中优于现有方法，在准确性和计算效率之间取得了更好的平衡。

Comments 20 pages, 8 figures, accepted by ICML 2026

详情

AI中文摘要

在概率多变量时间序列（MTS）预测中有效建模非平稳动态需要在表达性和鲁棒性之间取得平衡。现有参数方法受益于强归纳偏置但缺乏灵活性，而深度生成模型在没有大量数据和计算的情况下难以捕捉复杂的时间依赖性。我们引入了参数先验映射（PPM），这是一个将参数化结构先验注入生成建模过程的框架。具体来说，PPM利用参数化估计器推导出一个动态的自适应先验，通过可学习的映射指导复杂预测分布的学习。这种设计使模型能够保留参数方法的效率，同时利用生成模型的表达能力。通过混合目标训练，PPM产生精确的预测，并具有良好校准的不确定性估计。实验结果表明，PPM在处理非平稳数据方面优于现有基线，在精度和计算效率之间提供了更好的权衡。代码可在https://github.com/ljl8336/PPM获取。

英文摘要

Effectively modeling non-stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and computation. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior that guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well-calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non-stationary data, offering a superior trade-off between accuracy and computational efficiency. The code is available at https://github.com/ljl8336/PPM.

URL PDF HTML ☆

赞 0 踩 0

2605.23393 2026-05-25 cs.LG cs.AI 版本更新

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

每个组件都是一个查找：来自单一分解的令牌归因与组合

Po-Kai Chen, Niki van Stein, Aske Plaat

发表机构 * Leiden University（莱顿大学）

AI总结该论文研究了如何从单一前向传播中解析Transformer模型中各组件对预测结果的贡献及其组合方式。作者提出了一种名为Unpack的反向递归方法，通过分解注意力和MLP子层中的信用，揭示了不同组件之间的交互强度以及每个token的归因信息，无需干预、梯度或辅助训练。实验表明，该方法在GPT-2和Pythia系列模型上有效恢复了组件间的组合结构，并展示了对token级归因的准确捕捉，验证了其在机制可解释性方面的有效性。

详情

AI中文摘要

变压器的机制可解释性不仅需要识别哪些组件重要，还需要理解它们如何组合成产生预测的计算路径。注意力和MLP都遵循共享的键值模板 $ϕ(S)U$。我们利用这一结构开发了Unpack，一种后向递归方法，通过两个子层分解贡献，产生任意两个组件之间的交互强度，称为带有K/Q/V组合标签的端到端路径，以及来自单次前向传递的每个令牌的归因，无需干预、梯度或辅助训练。我们在间接宾语识别任务上进行了评估。在GPT-2 small上，该方法恢复了Wang等人（2023）描述的所有三种组合连接，包括每个连接的特定模式路由（K、Q或V）。为了测试超越简单复制的令牌级归因，我们比较了同一分解中同一名称的两次出现：第一次提及保持强归因，而重复检测位置被抑制，这一模式在匹配的控制提示中不存在。在Pythia系列从160M到6.9B参数中，这一抑制模式在每个尺度上一致地恢复，表明该方法无需真实电路标签即可追踪机制结构。代码可在https://github.com/Fun-Cry/unpacklm获取。

英文摘要

Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template $ϕ(S)U$. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end-to-end paths with K/Q/V composition labels, and per-token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT-2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode-specific routing of each connection (K, Q, or V). To test token-level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate-detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground-truth circuit labels. Code is available at https://github.com/Fun-Cry/unpacklm.

URL PDF HTML ☆

赞 0 踩 0

2605.23384 2026-05-25 cs.CL cs.AI 版本更新

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

元认知作为奖励：通过知识和调节信号强化LLM推理

Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

发表机构 * Tongji University（同济大学）； Shanghai AI Laboratory（上海人工智能实验室）； Nanyang Technological University（南洋理工大学）； University of Science and Technology of China（中国科学技术大学）； EPFL（苏黎世联邦理工学院）； Wuhan University（武汉大学）

AI总结该论文提出了一种基于元认知的强化学习框架 MaR，旨在提升大语言模型的推理能力。MaR 通过元认知知识和元认知调节两个维度提供奖励信号，前者用于识别任务相关的信息，后者用于规划和调整推理过程，从而超越仅依赖最终答案的奖励设计。实验表明，MaR 在多个基准测试中显著提升了模型性能，并在部分任务上超越了更强大的模型。

详情

AI中文摘要

最近的强化学习方法显著提高了LLM的推理能力。现有的奖励设计主要遵循两种范式：(1) 基于可验证奖励的强化学习（RLVR）从可执行检查或真实答案中获取结果信号，但对中间推理行为的指导有限。(2) 基于评分标准的奖励（RaR）通过使用自然语言评分标准来评估推理质量和任务合规性，超越了最终答案检查，但通常需要实例特定的评分标准和大量设计工作。为解决这些问题，我们引入了元认知奖励（MaR），一种受元认知启发的RL框架，通过两个通用过程维度指导LLM推理：i) 元认知知识，无需手工制作的实例特定评分标准即可识别任务相关信息；ii) 元认知调节，规划和调整推理过程，以提供超越最终答案结果的奖励指导。MaR将模型轨迹分解为显式的元认知组件，并通过任务知识覆盖度、调节保真度和最终答案正确性的轨迹级奖励进行优化。通过这种方式，MaR将奖励反馈扩展到推理轨迹，同时将奖励信号锚定在通用的元认知维度上。在22个基准上的实验表明，MaR持续提升模型性能，相比基础模型最高提升7.7%，相比原始DAPO最高提升11.0%。值得注意的是，Qwen3.5-9B + MaR缩小了与前沿模型的差距，在整体平均上超越GPT-OSS-120B，并在多个单独基准上超越更强模型。过程级分析进一步显示推理过程质量显著提升。MaR还能泛化到域外数据集，MaR训练的模型在平均性能上优于对应的基础模型。

英文摘要

Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

URL PDF HTML ☆

赞 0 踩 0

2605.23372 2026-05-25 cs.LG cs.AI 版本更新

Curriculum reinforcement learning with measurable task representation learning

基于可度量任务表征学习的课程强化学习

Yongyan Wen, Siyuan Li, Mingjian Fu, Yiqin Yang, Xun Wang, Peng Liu

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Fuzhou University（福州大学）； China Academy of Sciences（中国科学院）

AI总结本文研究了课程强化学习（CRL）中自动课程生成的问题，特别是在非欧几里得任务空间中的复杂导航任务。为了解决传统插值方法在非欧空间中失效的问题，作者提出了一种基于可度量任务表示学习的自动课程生成方法，通过变分自编码器结构对任务的奖励和状态转移进行编码，从而获得具有任务相似性度量能力的潜在任务表示。实验表明，该方法在多个复杂导航任务中优于基于插值和生成对抗网络的现有CRL方法。

Journal ref Neural Networks, 109019 (2026)

详情

DOI: 10.1016/j.neunet.2026.109019

AI中文摘要

在课程强化学习（CRL）中，智能体通过一系列任务（即课程）逐步积累知识，学习过程旨在利用积累的知识最终解决具有挑战性的目标任务。虽然早期的CRL工作侧重于对候选任务进行排序，但最近的研究探索了自动课程生成。在丰富的CRL文献中，基于插值的CRL范式是主体，它通过在任务空间中利用有意义的距离度量（即可以衡量任务相似性）对初始任务分布和目标任务分布进行插值，自动生成中间任务。然而，在具有挑战性的导航任务中，非欧几里得上下文（任务）空间使得这一假设失效。为了在复杂任务中实现自动课程生成，我们提出了一种基于可度量任务表征学习的新型自动课程生成方法。为了更好地衡量相似性，我们提出将任务空间变换到潜在空间。通过一个编码奖励和状态转移的变分自编码器结构，我们获得了具有任务相似性度量属性的潜在任务表征，其中两个相近的任务嵌入对应两个在奖励和状态转移方面相似的任务。基于学习到的任务表征，我们进一步开发了一种自动课程生成方案，该方案能够有效地生成与目标任务越来越相似的新任务。我们在各种具有挑战性的导航任务中评估了我们的方法，实验结果表明，所提出的方法超越了基于插值和生成对抗网络的最先进CRL方法。

英文摘要

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.

URL PDF HTML ☆

赞 0 踩 0

2605.23365 2026-05-25 cs.LG cs.AI 版本更新

Score-Based One-step MeanFlow Policy Optimization

基于分数的单步MeanFlow策略优化

Kyungyoon Kim, Donghyeon Ki, Hee-Jun Ahn, Byung-Jun Lee

发表机构 * Korea University, Decision Making Lab（韩国大学，决策实验室）； Gauss Labs Inc.（Gauss实验室）

AI总结本文提出了一种基于分数估计的单步均流策略优化方法（SOM），旨在解决强化学习中扩散模型和流匹配方法在在线场景下计算开销大的问题。该方法通过Q函数和概率流ODE直接构建目标速度场，无需目标分布的样本，从而在保证策略性能的同时显著降低了训练和推理时间。实验表明，SOM在运动控制任务中实现了领先的在线强化学习效果。

详情

AI中文摘要

扩散和流匹配已成为强化学习中表达力强的策略类，但它们对多步去噪的依赖在推理时带来了大量计算开销，这在在线强化学习中尤其成问题。MeanFlow通过学习一个平均速度场，在单次网络评估中将噪声映射到数据，提供了一种有前景的替代方案。然而，MeanFlow通常需要来自目标分布的样本来构建其目标速度场，而这在在线强化学习中不可用。我们提出了基于分数的单步MeanFlow策略优化（SOM），一种演员-评论家算法，通过分数估计和概率流ODE直接从Q函数构建目标速度场，从而将概率质量集中在高价值模式上。在完全在线强化学习设置中，SOM在运动任务上以单生成步骤实现了最先进的性能，同时与先前基于扩散和流匹配的策略相比，大幅减少了训练和推理时间。

英文摘要

Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.

URL PDF HTML ☆

赞 0 踩 0

2605.23348 2026-05-25 cs.DC cs.AI cs.NI 版本更新

XWind: A Cross-site Router for Large Language Model Inference Serving at Renewable Energy Farms

XWind: 面向可再生能源农场的跨站点大语言模型推理服务路由器

Tella Rajashekhar Reddy, Atharva Deshmukh, Liangcheng Yu, Chaojie Zhang, Mike Shepperd, Rohan Gandhi, Anjaly Parayil, Srinivasan Iyengar, Ajay Manchepalli, Debopam Bhattacherjee

发表机构 * Microsoft（微软）

AI总结随着人工智能算力需求的快速增长，电力网络面临巨大压力，而可再生能源如风能却未被有效利用。本文提出了一种名为AI Greenferencing的互补性AI基础设施部署模型，将模块化AI计算能力部署在风电场，以本地化需求匹配可再生能源供给。为应对风电波动带来的推理服务挑战，研究团队设计了XWind，一种轻量、响应式且与工作负载无关的AI推理路由系统，通过实时信号动态调度任务，显著降低了端到端延迟，验证了其在实际场景中的高效性与普适性。

详情

AI中文摘要

AI电力需求正以前所未有的速度增长，而电网往往状况不佳且难以跟上。电网扩建伴随着高昂的资本支出和远距离传输损耗，然而源头处有丰富的可再生能源，只是与需求不匹配。本文提出一种互补的AI基础设施部署模式——AI绿色围栏，将模块化AI计算带到可再生能源源头，聚焦风能，允许AI足迹扩展，为可再生能源站点产生本地表后需求，并帮助缓解电力公用事业日益增长的压力。我们的可行性分析表明，在Azure数据中心的50毫秒网络往返时间内，有超过890吉瓦的风电容量，并且站点规模的合理调整与风能的空间互补性使得整体集群利用率与传统部署相当。为了在可变风力供电下服务推理请求，我们构建了XWind，一个轻量级、反应式且与工作负载无关的AI推理路由器，仅使用实时信号：推理延迟、KV缓存利用率和队列深度，来动态配置站点并分发请求。在模拟三个风力供电站点的真实64-GPU A100测试平台上，使用Azure生产轨迹进行评估，XWind将P99端到端延迟比最强竞争者（也是我们的想法）降低高达52%，比基线（如功率上限和GPU空闲）降低高达98%，且在不同工作负载类型、负载水平和GPU代际上均有一致的增益。

英文摘要

AI power demand is growing at an unprecedented rate while power grids are often ailing and struggle to keep up. Grid expansion comes with high capital expenditure and long-distance transmission losses, yet there is abundant renewable energy at the source, just not matched to demand. This paper proposes a complementary AI infrastructure deployment model, AI Greenferencing, that brings modular AI compute to renewable energy sources, focusing on wind, allowing AI footprint expansion, generating local behind-the-meter demand for renewable sites, and helping ease the growing strain on power utilities. Our feasibility analysis shows that 890+ GW of wind capacity lies within 50 ms network round trip time of Azure data centers, and that site-wise right-sizing combined with spatial complementarity of wind energy keeps aggregate fleet utilization on par with traditional deployments. To serve inference requests under variable wind power, we build XWind, a lightweight, reactive, and workload-agnostic AI inference router that uses only real-time signals: inference latency, KV-cache utilization, and queue depth, to dynamically configure sites and distribute requests. Evaluated on a real 64-GPU A100 testbed emulating three wind-powered sites with Azure production traces, XWind reduces P99 end-to-end latency by up to 52% over the strongest contender (also our idea) and by up to 98% over baselines such as power-capping and GPU idling, with consistent gains across workload types, load levels, and GPU generations.

URL PDF HTML ☆

赞 0 踩 0

2605.23344 2026-05-25 cs.CV cs.AI 版本更新

CHASD: Language Increment-Calibrated Contrastive Decoding against Hallucination in LVLMs

CHASD：面向LVLMs中幻觉的语言增量校准对比解码

Xiaoyi Huang, Kejia Zhang, Zhiming Luo

发表机构 * Institute of Artificial Intelligence, Xiamen University（厦门大学人工智能学院）； Department of Artificial Intelligence, Xiamen University（厦门大学人工智能系）

AI总结本文研究了大型视觉-语言模型（LVLMs）在语言先验主导下容易产生物体幻觉的问题，提出了一种无需训练的对比解码方法CHASD。该方法通过注意力引导的局部视觉扰动构建负样本分支，并在生成过程中仅对低置信度的词元进行对比校准，从而在保证推理效率的同时有效抑制幻觉。实验表明，CHASD在多个基准数据集上显著提升了相关指标，优于现有的训练自由基线方法。

详情

AI中文摘要

大型视觉-语言模型展现了强大的多模态推理能力，但当语言先验主导不足或错位的视觉证据时，它们仍然容易产生对象幻觉。无训练对比解码方法通过比较原始和扰动视觉输入的预测来缓解此问题，但现有方法要么应用可能改变有用视觉证据的全局扰动，要么在每个解码步骤调用额外的负分支。在本文中，我们观察到幻觉风险是瞬态且特定于token的：视觉注意力在生成的token间转移，而一些功能token以高置信度产生，不需要对比校准。基于这一观察，我们提出面向大型视觉-语言模型的对比幻觉感知逐步解码（CHASD），一种“按需校准”的推理时框架。CHASD使用不确定性驱动的置信门控，仅当下一token的最大概率低于阈值时激活对比分支，并通过注意力引导的局部扰动构建负分支，扰动当前显著的视觉token。这种设计减少了不必要的负分支前向传播，同时保留了高置信度步骤的原始分布。在POPE、AMBER、MME、MMHal-Bench和CHAIR上的实验表明，CHASD在强无训练基线上改进了幻觉相关指标，并具有有竞争力的推理效率。

英文摘要

Large Vision-Language Models have shown strong multimodal reasoning capabilities, yet they remain susceptible to object hallucinations when language priors dominate insufficient or misaligned visual evidence. Training-free contrastive decoding methods mitigate this issue by comparing predictions from original and perturbed visual inputs, but existing approaches either apply global perturbations that may alter useful visual evidence or invoke an additional negative branch at every decoding step. In this paper, we observe that hallucination risks are transient and token-specific: visual attention shifts across generated tokens, while some functional tokens are produced with high confidence and do not require contrastive calibration. Based on this observation, we propose Contrastive Hallucination-Aware Step-wise Decoding (CHASD) for Large Vision-Language Models, an inference-time framework for "calibration on demand". CHASD uses an uncertainty-driven confidence gate to activate the contrastive branch only when the maximum probability of the next-token is less than the threshold, and constructs the negative branch through attention-guided localized perturbations of the currently salient visual tokens. This design reduces unnecessary negative-branch forward passes while preserving the original distribution for high-confidence steps. Experiments on POPE, AMBER, MME, MMHal-Bench, and CHAIR show that CHASD improves hallucination-related metrics over strong training-free baselines with competitive inference efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.23341 2026-05-25 cs.RO cs.AI 版本更新

Sparse Compositional Flow Matching by geometric assembly from motion primitives

基于运动基元的几何组装的稀疏组合流匹配

Yan Tang, Yuanbo Tang, Tingyu Cao, Shaolun Huang, Yang Li

发表机构 * Tsinghua Shenzhen Graduate School, Tsinghua University, Shenzhen, China（清华大学深圳研究生院，清华大学，深圳，中国）； School of AI, Chinese University of Hong Kong (Shenzhen)（香港中文大学（深圳）人工智能学院）

AI总结该论文研究了如何生成具身智能体（如机器人、水下机器人等）的可执行运动轨迹，提出了一种基于运动原语的稀疏组合流匹配方法。该方法通过在物理轨迹空间中直接组合可重复使用的运动原语，并引入几何约束和结构化稀疏流匹配框架，有效建模轨迹的组合结构与时空连续性。实验表明，该方法在多个数据集上取得了最先进的性能，显著提升了轨迹预测的准确性。

详情

AI中文摘要

具身轨迹，如机器人操纵器、水下航行器和移动机器人的可执行运动序列，是具身AI的基本输出。现代生成模型通常将其视为逐点生成的密集、整体信号，拟合复杂的高维后验，而未建模数据的潜在结构，这是结构化生成模型文献早已指出的样本效率低下问题。我们认为组合潜在结构是自然的选择：许多具身任务共享重复出现的运动片段，这些片段可以明确为有限的可重用运动基元库，并且组合单元自然与子任务边界对齐以支持任务分解。然而，现有的组合生成器在潜在空间中组合，并依赖事后解码将采样单元与实际轨迹段关联。相反，我们通过具有两个耦合设计的流匹配框架直接在物理轨迹空间中组合。运动基元字典学习为每个原子配备可学习的长度掩码和二进制起始指示器，使得原子本身即为基元，在其放置位置逐字重用。然后，具有几何约束的结构化稀疏流匹配通过持续时间感知分词和可微几何损失生成二进制放置矩阵，该损失在相邻基元相遇处强制执行空间连续性和时间邻接性。在Open X-Embodiment和3DMoTraj上，该框架达到了最先进的精度，并将FDE/ADE比从1.8降至1.07，相比最强基线，ADE提高了19.2%，FDE提高了21.0%。

英文摘要

Embodied trajectories, such as the executable motion sequences of robotic manipulators, underwater vehicles, and mobile robots, are a fundamental output of embodied AI. Modern generative models often treat them as a dense, monolithic signal generated point by point, fitting an intricate high-dimensional posterior while leaving the data's latent structure unmodeled, the same sample inefficiency long identified by the structured generative model literature. We argue that a compositional latent structure is a natural choice: many embodied tasks share recurring motion fragments that can be made explicit as a finite repertoire of reusable motion primitives, and compositional units naturally align with subtask boundaries to support task decomposition. Existing compositional generators, however, compose in a latent space and rely on post-hoc decoding to relate sampled units to actual trajectory segments. We instead compose directly in the physical trajectory space through a flow-matching framework with two coupled designs. Motion-Primitive Dictionary Learning equips each atom with a learnable length mask and binary starting indicators so the atom itself is the primitive, reused verbatim wherever it is placed. Structural Sparse Flow Matching with Geometric Constraints then generates a binary placement matrix using duration-aware tokenization and a differentiable geometric loss that enforces spatial continuity and temporal contiguity where adjacent primitives meet. On Open X-Embodiment and 3DMoTraj, the framework attains state-of-the-art accuracy and reduces the FDE/ADE ratio from 1.8 to 1.07, improving ADE by 19.2% and FDE by 21.0% over the strongest baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.23320 2026-05-25 cs.AI 版本更新

Human-in-the-Loop Multi-Agent Ventilator Decision Support with Contextual Bandit Preference Learning

人在回路的多智能体呼吸机决策支持与上下文赌博机偏好学习

Sijia Li, Xiaoyu Tan, Qixing Wang, Weiyi Zhao, Chen Zhan, Teqi Hao, Xuemin Wang, Lei Gu, Roland Eils, Xihe Qiu

发表机构 * Shanghai University of Engineering Science, Shanghai, China Tencent Youtu Lab, Tencent, China Department of Critical Care Medicine, Shanghai Tenth People's Hospital, Tongji University School of Medicine, Shanghai, China Department of Emergency ； Critical Disease, Songjiang Hospital Affiliated to Shanghai Jiao Tong University School of Medicine, Shanghai, China Max Planck Institute for Heart ； Lung Research, Bad Nauheim, Germany Fudan University, Shanghai, China BIH at Charit\'e -- Universit\"atsmedizin Berlin, Berlin, Germany

AI总结该研究提出了一种基于人类在环的多智能体框架的呼吸机决策支持系统（VDSS），用于辅助临床医生进行呼吸机参数调整。系统通过上下文老虎机算法实现在线偏好学习，根据临床医生的反馈动态调整决策策略，并利用结构化反馈机制提高交互效率与稳定性。实验表明，该方法在重症监护环境中能显著提升推荐接受率并减少交互轮次，为临床可部署的人机协作提供了有效支持。

Comments miccai 2026

详情

AI中文摘要

呼吸机决策支持需要顺序决策，跟踪不断变化的生理和疾病轨迹，同时尊重安全边界和临床医生的特定调节风格。基于规则的方法很少能泛化个性化，而端到端强化学习或单一大型语言模型系统仍难以控制和审计。我们提出了呼吸机决策支持系统（VDSS），这是一个人在回路的多智能体框架，通过合同驱动的结构化接口协调模块化决策组件，并生成可追溯的证据以供审查。VDSS使用上下文赌博机进行在线偏好适应，在每个调整周期根据最终接受的决策更新临床医生特定偏好，并利用这些偏好指导后续建议。结构化的拒绝反馈触发有针对性的重新规划，以减少无效迭代并提高交互稳定性。回顾性ICU轨迹重放与专家审查表明，推荐接受度更高，达到可接受计划所需的交互轮次更少，支持临床可部署的人机协作。

英文摘要

Ventilator decision support requires sequential decisions that track evolving physiology and disease trajectories while respecting safety boundaries and clinician specific tuning styles. Rule based approaches rarely generalize personalization, and end to end reinforcement learning or single large language model systems remain difficult to control and audit. We propose the Ventilator Decision Support System (VDSS), a human in the loop multi agent framework that coordinates modular decision components through contract driven structured interfaces and produces traceable evidence for review. VDSS performs online preference adaptation with a contextual bandit, updating clinician specific preferences from the final accepted decision at each adjustment cycle and using them to guide subsequent recommendations. Structured rejection feedback triggers targeted replanning to reduce unproductive iterations and improve interaction stability. Retrospective ICU trajectory replay with expert review indicates higher recommendation acceptability and fewer interaction rounds to reach an acceptable plan, supporting clinically deployable human AI collaboration.

URL PDF HTML ☆

赞 0 踩 0

2605.23315 2026-05-25 cs.CL cs.AI 版本更新

着色噪声：用于忠实图像超分辨率的对抗性Sobolev对齐

Hongbo Wang, Huaibo Huang, Pin Wang, Jinhua Hao, Chao Zhou, Ran He

发表机构 * MAIS \& NLPR, Institute of Automation, Chinese Academy of Sciences, Beijing, China ； School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

AI总结图像超分辨率生成中，生成先验常导致还原不够忠实，本文认为这是由于各向同性目标与自然图像内在流形之间存在基本的谱不匹配。为解决这一问题，研究提出了一种基于Sobolev诱导黎曼几何的ASASR框架，通过显式地对噪声转移核进行谱色处理，使其更符合自然图像的谱衰减特性，并引入基于Riesz表示定理的参数化对抗网络，生成针对性的负样本以引导优化方向。实验表明，该方法在保持谱一致性和结构保真度方面优于现有生成方法，有效减少了伪影。

Comments Accepted to ICML 2026

详情

AI中文摘要

图像超分辨率（SR）中的生成先验常常损害忠实重建，我们将这一限制归因于各向同性目标与内在自然图像流形之间的基本光谱失配。虽然直接偏好优化提供了一条对齐路径，但其对光谱平坦高斯噪声的依赖无法区分真实高频细节与幻觉。为了弥合这一几何差距，我们提出了ASASR，一个理论基础的框架，通过显式着色噪声转移核以镜像自然光谱衰减，将生成流重铸为Sobolev诱导的黎曼几何。驱动这一几何对齐，我们集成一个基于Riesz表示定理的参数化对抗器，该对抗器合成目标负样本，等效于最坏情况下的Sobolev梯度，以沿着可能结构失效的切空间引导优化。大量评估表明，ASASR优于领先的生成基线，特别是在保持光谱一致性和结构保真度方面，提供了一种有效缓解伪影的鲁棒解决方案。

英文摘要

Generative priors in Image Super-Resolution (SR) often compromise faithful restoration, we attribute this limitation to a fundamental spectral misalignment between isotropic objectives and the intrinsic natural image manifold. While Direct Preference Optimization offers a path to alignment, its reliance on spectrally flat Gaussian noise fails to distinguish authentic high-frequency details from hallucinations. To bridge this geometric gap, we propose ASASR, a theoretically grounded framework that recasts the generative flow into a Sobolev-induced Riemannian geometry by explicitly coloring the noise transition kernel to mirror natural spectral decay. Driving this geometric alignment, we integrate a parametric adversary grounded in the Riesz Representation Theorem, which synthesizes targeted negative samples equivalent to worst-case Sobolev gradients to direct optimization along the tangent space of plausible structural failures. Extensive evaluations demonstrate that ASASR outperforms leading generative baselines, particularly in preserving spectral consistency and structural fidelity, offering a robust solution that effectively mitigates artifacts.

URL PDF HTML ☆

赞 0 踩 0

2605.23263 2026-05-25 cs.RO cs.AI cs.SY eess.SP eess.SY 版本更新

6G Communication Networks Enabling Embodied Agents: Architecture and Prototype

6G通信网络赋能具身智能体：架构与原型

Lipeng Dai, Luping Xiang, Kun Yang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University（南京大学新型软件技术国家重点实验室）； Institute of Intelligent Networks and Communications (NINE), Nanjing University (Suzhou Campus)（南京大学智能网络与通信研究所（苏州校区））

AI总结本文研究了6G通信网络如何支持具身智能体的通信需求，探讨了具身智能体与6G网络之间的协同关系，并提出了面向人机远程交互的分层通信架构。通过构建包含触觉设备、工业机械臂和5G O-RAN测试平台的原型系统，验证了该架构在毫秒级时延和稳定闭环控制方面的可行性，为未来6G与具身智能体的融合应用提供了重要参考。

详情

AI中文摘要

具身智能体将智能决策与物理执行相结合，对通信提出了比纯软件智能体更严格和多样化的要求。尽管6G承诺亚毫秒级延迟、超高可靠性、原生智能和集成感知，但如何利用这些能力支持具身智能体通信的系统性研究仍然有限。本文从概念和工程两个角度研究了面向具身智能体的6G通信系统。首先，我们回顾了具身智能体的概念和具身价值，并澄清了其与非具身智能体的区别。然后，我们分析了具身智能体与6G网络的共生关系，强调了关键6G使能技术如何支持人机交互的严苛需求。此外，我们展示了具身智能体通过覆盖扩展、环境感知和物理世界理解在增强通信网络中的主动作用。基于这些见解，我们提出了一种用于人机远程交互的分层通信架构，包括人类意图感知层、基于开放无线接入网（O-RAN）的传输层、智能中间层和具身层。为验证其可行性，我们实现了一个端到端原型，集成了触觉设备、工业机械臂、中间平台和5G O-RAN测试床。实验结果表明毫秒级延迟和稳定的闭环操作，证实了所提架构的实用性，并为未来6G-具身智能体研究和工业部署提供了参考。

英文摘要

Embodied agents, which couple intelligent decision-making with physical actuation in the real world, impose far more stringent and heterogeneous communication requirements than purely software-based agents. While 6G promises sub-millisecond latency, ultra-high reliability, native intelligence, and integrated sensing, systematic studies on how to exploit these capabilities for embodied agent communication remain limited. This article investigates 6G-enabled communication systems for embodied agents from both conceptual and engineering perspectives. First, we review the concept, embodiment value of embodied agents, and clarify their distinctions from disembodied agents. Then, we analyse the symbiotic relationship between embodied agents and 6G networks. We highlight how key 6G enablers can support the stringent requirements of human-robot interaction. Furthermore, we demonstrate the proactive role of embodied agents in bolstering communication networks through coverage extension, environmental sensing, and physical world understanding. Building on these insights, we propose a hierarchical communication architecture for human-robot remote interaction, comprising a human-intent perception layer, an open radio access network (O-RAN)-based transport layer, an intelligent intermediary layer, and an embodiment layer. To validate its feasibility, we implement an end-to-end prototype that integrates a haptic device, an industrial robotic arm, an intermediary platform, and a 5G O-RAN testbed. Experimental results demonstrate millisecond-level latency and stable closed-loop operation, confirming the practicality of the proposed architecture and providing a reference for future 6G-embodied agent research and industrial deployments.

URL PDF HTML ☆

赞 0 踩 0

2605.23262 2026-05-25 cs.AI 版本更新

Design and Report Benchmarks for Knowledge Work

知识工作的设计与报告基准

Yining Hua, Hongbin Na, Cyrus Ayubcha, Levi Lian

发表机构 * Harvard University（哈佛大学）； University of Technology Sydney（悉尼科技大学）； Stanford University（斯坦福大学）； Raycaster AI

AI总结本文针对知识工作领域的人工智能系统评估问题，提出了一种三步骤的基准设计方法，以明确任务评分与实际工作成果之间的对应关系。研究指出当前知识工作评估仍沿用传统NLP任务逻辑，难以真实反映系统在实际部署中的能力。为此，作者从工作活动、测试环境和评分标准三个维度构建基准设计框架，并基于O*NET职业任务数据库提炼出18类工作活动，结合三个实际案例展示了该方法在不同知识工作场景中的应用与效果。

详情

Foundation Protocol: 智能体社会的协调层

Bang Liu, Yongfeng Gu, Jiayi Zhang, Zhaoyang Yu, Sirui Hong, Maojia Song, Xiaoqiang Wang, Mingyi Deng, Zijie Zhuang, Ronghao Wang, Mingzhe Cao, Yutong Zhu, Xingjian Li, Yifan Wu, Jianhao Ruan, Yiran Peng, Shuangrui Chen, Jinlin Wang, Yizhang Lin, Dongjie Zhang, Dekun Wu, Chen Ma, Lizi Liao, Han Yu, Jian Pei, Heng Ji, Qiang Yang, Yuyu Luo, Chenglin Wu

发表机构 * Singapore University of Technology and Design（新加坡科技设计大学）； City University of Hong Kong（香港城市大学）； Singapore Management University（新加坡管理学院）； Nanyang Technological University（南洋理工大学）； Duke University（杜克大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Hong Kong Polytechnic University（香港理工大学）

AI总结随着自主代理系统逐渐成为社会基础设施的一部分，协调能力成为系统扩展的关键瓶颈。本文提出了一种名为Foundation Protocol（FP）的协调层，旨在为人类与人工智能共存的社会提供基础架构支持。FP通过图结构统一不同类型的实体，支持多方协作与事件驱动的合作，并引入经济原语和治理机制，以确保系统的可组合性与责任可追溯性。该协议旨在兼容现有标准，降低集成与治理成本，推动自主代理系统在开放、多元和可治理的环境中发展。

详情

AI中文摘要

自主智能体正从工具转变为社会基础设施层：它们浏览、购买、部署软件、管理系统，并越来越多地相互交互。随着这些系统规模扩大，瓶颈从原始模型能力转向协调。智能体需要建立可靠的关系、组织多智能体工作、交换价值、支持人工智能经济，并在现实监督下保持安全和问责。本文介绍了Foundation Protocol (FP)，一种为新兴人机社会设计的以图为核心的协调层。FP统一了异构实体，包括智能体、工具、资源、人类、机构和组织，并支持原生的多方组织和基于事件的协作。它还提供了用于计量、收据和结算的经济原语，并将策略、来源和审计视为一等关注点。FP旨在包装和桥接现有协议而非替代它们，从而在减少集成和治理开销的同时实现渐进式采用。目标是保持自主智能体的可组合性，同时确保问责制不可妥协，从而使协调本身成为开放、多元和可治理的人机社会的共享基础设施。

英文摘要

Autonomous agents are moving from tools into a layer of social infrastructure: they browse, purchase, deploy software, manage systems, and increasingly interact with one another. As these systems scale, the bottleneck shifts away from raw model capability toward coordination. Agents need to form reliable relationships, organize multi-agent work, exchange value, support an AI economy, and stay safe and accountable under real-world oversight. This paper introduces the Foundation Protocol (FP), a graph-first coordination layer for an emerging human-AI society. FP unifies heterogeneous entities, including agents, tools, resources, humans, institutions, and organizations, and supports native multi-party organization and event-based collaboration. It also provides economic primitives for metering, receipts, and settlement, and treats policy, provenance, and audit as first-class concerns. FP is designed to wrap and bridge existing protocols rather than replace them, enabling incremental adoption while reducing integration and governance overhead. The aim is to keep autonomous agency composable while keeping accountability non-negotiable, so that coordination itself can become shared infrastructure for a human-AI society that is open, pluralistic, and governable.

URL PDF HTML ☆

赞 0 踩 0

2605.23215 2026-05-25 cs.LG cs.AI cs.CL 版本更新

自适应质量分段KV压缩用于长上下文推理

Junzhe Yang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Institute of Digital Twin, Eastern Institute of Technology（数字孪生研究院，东部技术研究所）

AI总结在长文本推理中，键值（KV）缓存的线性增长是关键瓶颈，现有压缩方法基于重要性评分剔除 tokens，但易导致连续推理块被严重清除，破坏逻辑连贯性。为此，本文提出自适应分块（AMS）KV压缩框架，通过关注注意力质量的空间分布，动态分配内存配额，保障关键推理段的稳定性，并兼容多种主流压缩方法和现代KV服务框架。实验表明，AMS有效缓解了结构碎片化问题，提升了模型性能。

详情

AI中文摘要

键值（KV）缓存的线性增长是长文本LLM推理中的关键瓶颈。现有的KV压缩方法通过基于重要性分数驱逐令牌来缓解这一问题。然而，我们表明它们依赖全局Top-k选择会触发区域擦除：连续推理块的严重驱逐破坏了逻辑连贯性。为解决此问题，我们提出自适应质量分段（AMS）KV压缩框架，该框架将范式从令牌级竞争转变为区域感知配额分配。AMS根据注意力质量的空间分布自适应地划分KV缓存，确保结构上重要的推理段获得有保障的内存配额。为在迭代解码过程中保持稳定性，引入了基于EMA的平滑机制以防止分段边界的抖动。关键的是，AMS是一个通用的即插即用层，与现有评分器正交。它可以无缝集成到代表性方法中，如TOVA、Expected Attention、KeyDiff、R-KV和TriAttention。AMS还与现代分页KV服务框架（如vLLM）系统兼容，支持高效的收集和压缩KV执行，而不引入额外的稳态注意力开销。在多种任务上的大量实验，包括数学推理（MATH500、AIME、GSM8K）、代码补全、开放域问答和稀疏检索，表明AMS持续减轻结构碎片化并提升模型性能。

英文摘要

The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

URL PDF HTML ☆

赞 0 踩 0

2605.23194 2026-05-25 cs.LG cs.AI 版本更新

Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids

面向智能电网数据驱动最优潮流问题的可扩展异构图基础模型

Massimiliano Lupo Pasini, Yijiang Li, Kibaek Kim, Teja Kuruganti

发表机构 * Computational Sciences and Engineering Division, Oak Ridge National Laboratory（橡树岭国家实验室计算科学与工程部）； Mathematics and Computer Science Division, Argonne National Laboratory（阿贡国家实验室数学与计算机科学部）； UT-Battelle, LLC（UT-巴特勒公司）

AI总结本文提出了一种基于HydraGNN的可扩展异构图神经网络（GNN）框架，用于构建数据驱动的最优潮流（OPF）代理模型和图基础模型（GFM）。该方法保留了电力网络中不同节点和边类型的异构结构，支持在超计算机上进行分布式预处理、训练、超参数优化和下游微调。实验表明，该框架能够生成参数量较少但验证损失更低的紧凑模型，并在可行性分类和N-1故障回归任务中显著提升小样本条件下的模型性能与训练效率。

Comments 10 pages, 6 tables, 4 figures

详情

AI中文摘要

快速可靠的最优潮流（OPF）近似对于可靠的智能电网运行至关重要，然而许多基于学习的替代模型要么扁平化处理电网的天然异质结构，要么针对有限的电网拓扑，要么缺乏用于图基础模型（GFM）训练的可扩展基础设施。本文提出了一种基于HydraGNN的可扩展异构图神经网络（GNN）工作流，用于数据驱动OPF代理建模和OPF-GFM开发。该工作流保留了电网中不同的节点和边类型——母线、发电机、负荷、并联电抗器、交流线路、变压器以及设备到母线的耦合——并支持在领导级超级计算机上进行分布式预处理、训练、超参数优化（HPO）和下游微调。利用跨越十个PGLib-OPF案例（从14到13,659个母线）的三百万个异构图实例，我们在ORNL Frontier超级计算机上进行了DeepHyper驱动的HPO。该实验识别出具有最低验证损失的紧凑模型（约1.6–1.7M参数）。关于可行性分类和N-1应急回归的下游实验表明，微调预训练的OPF GFM在部分或仅头部微调时，能够提高低数据精度、稳定训练、加速收敛并降低适应成本。

英文摘要

Fast and reliable optimal power flow (OPF) approximation is essential for reliable smart-grid operation, yet many learning-based surrogates either flatten the native heterogeneous structure of power networks, target a limited set of grid topologies, or lack scalable infrastructure for graph foundation model (GFM) training. This paper presents a scalable heterogeneous graph neural network (GNN) workflow, built on HydraGNN, for data-driven OPF surrogate modeling and OPF-GFM development. The workflow preserves the distinct node and edge types of power grids -- buses, generators, loads, shunts, AC lines, transformers, and device-to-bus couplings -- and supports distributed preprocessing, training, hyperparameter optimization (HPO), and downstream fine-tuning on leadership-class supercomputers. Using three million heterogeneous graph instances spanning ten PGLib-OPF cases, from 14 to 13,659 buses, we conduct DeepHyper-driven HPO on the ORNL Frontier supercomputer. The campaign identifies compact models ($\sim$1.6--1.7M parameters) with the lowest validation losses. Downstream experiments on feasibility classification and N-1 contingency regression show that fine-tuning pretrained OPF GFM improves low-data accuracy, stabilizes training, accelerates convergence, and reduces adaptation cost when partial or head-only fine-tuning is used.

URL PDF HTML ☆

赞 0 踩 0

2605.23179 2026-05-25 cs.AI 版本更新

Redrawing the AI Map: A Theory of Accountability Boundaries in Agentic Ecosystems

重绘AI地图：代理生态系统中责任边界的理论

Muhammad Zia Hydari, Farooq Muzaffar

发表机构 * University of Pittsburgh（匹兹堡大学）

AI总结该论文探讨了智能体生态系统中责任边界配置的理论问题，提出了一种基于能力层次的责任边界定位理论。研究引入了“责任资产”概念，指出其对AI输出的合法性、可审计性和责任归属具有关键作用，并分析了验证成本和责任可转移性对责任边界与执行边界协同移动的影响。理论提出了三种边界策略，并引入“规则债务”概念，揭示了组织决策规则迁移至智能体执行环境所带来的治理负担，为理解数字模块化与组织解耦的关系提供了新视角。

详情

AI中文摘要

代理AI编排器降低了跨组织边界组合信息系统能力的接口和组装成本，看似加速了模块化和组织分解。然而，其输出需要证据、审查、签核或可分配责任的AI赋能能力，即使其技术接口变得模块化，也可能保留集成的责任边界。我们提出了代理生态系统中责任边界定位的能力层面理论。我们引入责任资产：使AI支持输出合法、可审计、可审查并可分配给责任方的互补资产。我们认为验证成本和责任可转移性决定了执行边界和责任边界能否一起移动。该理论识别出三种边界策略：组件、集成和双轨。它还引入了规则债务，即当组织决策规则从正式信息系统迁移到无治理的代理执行环境时产生的治理负担。整合数字创新、交易成本、互补资产、数字平台治理和IS控制视角，我们提出了七个命题，将代理组装成本降低、责任资产、可占有性、编排者意图捕获和边界错误配置与边界策略、价值占有和规则债务联系起来。该理论解释了数字模块化何时扩展到组织分解，以及责任何时保持能力集成。通过文档处理、法律服务、审计、临床决策支持和采购中的结构化示例来约束边界逻辑。

英文摘要

Agentic AI orchestrators reduce the interface and assembly costs of composing information systems capabilities across organizational boundaries, seemingly accelerating modularization and organizational disaggregation. Yet AI-enabled capabilities whose outputs require evidence, review, signoff, or assignable responsibility may retain integrated accountability boundaries even when their technical interfaces become modular. We develop a capability-level theory of accountability-boundary placement in agentic ecosystems. We introduce accountability assets: complementary assets that make AI-supported outputs legitimate, auditable, reviewable, and assignable to a responsible party. We argue that verification cost and responsibility transferability determine whether the execution and accountability boundaries can move together. The theory identifies three boundary strategies: component, integrated, and dual-track. It also introduces rule debt, the governance burden that accrues when organizational decision rules migrate from formal information systems into ungoverned agentic execution environments. Integrating digital innovation, transaction cost, complementary-assets, digital platform governance, and IS control perspectives, we develop seven propositions linking agentic assembly-cost reductions, accountability assets, appropriability, orchestrator intent capture, and boundary misconfiguration to boundary strategy, value appropriation, and rule debt. The theory explains when digital modularization extends to organizational disaggregation and when accountability keeps capabilities integrated. Structured illustrations across document processing, legal services, audit, clinical decision support, and procurement discipline the boundary logic.

URL PDF HTML ☆

赞 0 踩 0

2605.23171 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

理解与改进指令微调中的噪声嵌入技术

Abhay Yadav

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结该研究探讨了指令微调中嵌入层添加噪声的技术，分析了均匀噪声与高斯噪声的效果差异，并提出了一种新的对称噪声嵌入方法SymNoise。通过理论与实验分析，研究发现不同噪声类型性能相近，而SymNoise通过更严格地调控模型局部曲率，显著提升了微调效果。在多个基准测试中，SymNoise相比当前最优方法NEFTune取得了约6.7%的性能提升，展示了其在语言模型微调中的优越性。

Comments arXiv admin note: substantial text overlap with arXiv:2312.01523

Journal ref IEEE International Conference on Language Modeling (COLM), 2025

详情

AI中文摘要

最近指令微调的进展在嵌入中注入噪声，其中NEFTune（Jain等人，2024）使用均匀噪声设立了基准。尽管NEFTune的实验发现均匀噪声优于高斯噪声，其原因仍不清楚。本文旨在通过提供彻底的理论和实证分析来澄清这一点，表明这些噪声类型之间的性能相当。此外，我们引入了一种新的语言模型微调方法，在嵌入中使用对称噪声。该方法旨在通过更严格地调节模型的局部曲率来增强模型功能，表现出优于当前方法NEFTune的性能。当使用Alpaca微调LLaMA-2-7B模型时，标准技术在AlpacaEval上获得29.79%的分数。然而，我们的方法SymNoise使用对称噪声嵌入将这一分数显著提高到69.04%，比最先进方法NEFTune（64.69%）提高了6.7%。此外，当在各种模型和更强的基线指令数据集（如Evol-Instruct、ShareGPT、OpenPlatypus）上测试时，SymNoise始终优于NEFTune。当前文献，包括NEFTune，强调了在语言模型微调中应用基于噪声的策略需要更深入的研究。我们的方法SymNoise是朝着这一方向迈出的又一重要步骤，显示出对现有最先进方法的显著改进。

英文摘要

Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune's empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

URL PDF HTML ☆

赞 0 踩 0

2605.23170 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

长上下文LLM中的位置失败：推理基准测试中的盲点

Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang

发表机构 * Beijing Jiaotong University（北京交通大学）； Central South University of Forestry and Technology（中央林业科技大学）

AI总结该研究指出当前主流的长上下文大语言模型推理基准在任务位置控制方面存在不足，导致无法准确评估模型在不同位置上的表现。为此，作者提出了Context Rot Evaluation（CRE）框架，系统地控制任务位置、填充内容和上下文长度三个因素，并通过实验发现，当目标任务从上下文末尾移至中间位置时，模型性能会显著下降，且随着上下文长度增加，这一问题更加严重。研究还表明，通过在末尾添加任务副本，可以有效缓解位置带来的性能下降，揭示了当前基准设计中存在结构性的评估盲区。

Comments 20 pages, 1 figure, 23 tables

详情

AI中文摘要

位置控制评估是检索任务（如Needle-in-a-Haystack和RULER）的标准做法，但主流推理基准测试并未控制目标任务在长上下文中的位置。我们审计了11个长上下文基准测试，发现没有一个同时控制任务位置、填充内容和上下文长度进行推理。对四个旗舰长上下文发布的审计发现，NIAH、RULER或LongBench系列基准测试的主要结果表中没有条目，而智能体和编码基准测试在所有四个发布的主要结果表中均有出现。我们提出了上下文旋转评估（CRE），一个控制所有三个因素的框架，并在两轮中评估了九个LLM在GSM8K和ARC-Challenge上的表现：初始五个模型集和四个较新的供应商发布。当目标任务从末尾移动到中间时，模型性能可能急剧下降，且对于易受影响的模型，这种下降随着上下文长度增加而恶化。MiMo-v2-Flash在64K下使用with_solutions填充时下降88个百分点（中间准确率8%）。较新的发布显示出较小的下降：在64K下，四个模型中有三个的末尾位置准确率波动在+/-6个百分点内；MiMo-V2.5-Pro将MiMo-v2-Flash的88个百分点下降缩小到32个百分点。在questions_only_v2填充下，所有四个模型在中间位置的下降仍然存在（在8K、32K、64K下范围-16到-56个百分点）。在8K下，一个诊断探针在末尾添加目标任务副本，使所有九个模型的中间准确率与末尾基线相差在+/-4个百分点内，这与位置解释一致。在初始五个模型集中，76%的中间位置错误与周围填充文本匹配，而末尾位置仅为22%，这与填充-答案干扰作为主要错误模式一致。这些结果暴露了当前推理基准测试设计和供应商评估实践中的结构性评估差距：当任务位置不受控制时，无法测量随上下文长度增长而恶化的位置脆弱性。

英文摘要

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

URL PDF HTML ☆

赞 0 踩 0

2605.23168 2026-05-25 cs.CR cs.AI cs.LG 版本更新

PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

PoisonForge: 面向指令微调LLM的任务级定向投毒基准

Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru, Alina Oprea

发表机构 * Department of Computer Science（计算机科学系）

AI总结本文提出PoisonForge，一个针对指令微调大语言模型的针对性任务级投毒基准，用于评估在有限投毒预算下模型对恶意数据的脆弱性。该基准通过四个维度参数化投毒威胁，并在五个任务家族中测试了12个不同参数量的开源模型，结果显示大多数模型在最脆弱配置下攻击成功率超过70%，但对非目标任务的影响极小。研究分析了影响攻击成功率的关键因素，并发现投毒设计选择而非模型规模是攻击成功的主要原因。

详情

AI中文摘要

当从业者在未经验证的数据集上微调LLM时，攻击者可以通过任务级投毒利用数据供应链：插入少量精心设计的指令-响应对，导致模型在目标任务族中嵌入攻击者指定的实体（如国家），而在其他行为中表现正常。我们引入PoofForge，一个沿四个维度（偏差类型、投毒模式、出现次数和目标输出长度）参数化此威胁的基准，并在五个模型族中评估了12个开源模型（参数从2B到32B），主要采用1%的投毒预算。在1000个微调样本中仅使用10个投毒样本的情况下，12个模型中有11个在其最易受攻击的配置下攻击成功率（ASR）超过70%。同时，非目标任务的无意泄露低于0.5%，模型在标准基准上表现良好。我们详细分析了影响攻击成功的因素。我们观察到，实体的多次出现提高了ASR，最佳投毒模式取决于目标实体的语义结构，并且ASR随任务输出长度单调下降。相关分析和风险预测模型证实，投毒设计选择而非模型规模是攻击成功的主要原因，并且这些模式可以推广到预测新任务上的攻击成功。我们发布所有配置、流水线和分析代码以支持可重复比较。

英文摘要

When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: inserting a small number of crafted instruction-response pairs that cause the model to embed attacker-specified entities, such as a country, in outputs for a targeted task family while behaving normally elsewhere. We introduce PoisonForge, a benchmark that parameterizes this threat along four dimensions (bias type, poisoning mode, appearance count, and target output length) and evaluates 12 open-weight models (from 2B to 32B parameters) across five families under a primarily 1% poison budget. With only 10 poisoned examples among 1,000 fine-tuning examples, 11 of 12 models exceed a 70% attack success rate (ASR) in their most vulnerable configuration. Meanwhile, unintended leakage to non-target tasks remains below 0.5%, and models perform well on standard benchmarks. We analyze in detail the factors contributing to attack success. We observe that multiple appearances of an entity increase the ASR, the optimal poisoning mode depends on the semantic structure of the target entity, and ASR drops monotonically with the task output length. A correlation analysis and risk prediction model confirm that poisoning design choices, rather than model scale, are the primary causes of attack success, and that these patterns generalize to predict attack success on new tasks. We release all configurations, pipelines, and analysis code to support reproducible comparisons.

URL PDF HTML ☆

赞 0 踩 0

2605.23165 2026-05-25 cs.RO cs.AI cs.CL 版本更新

Autonomous Frontier-Based Exploration with VLM Guidance

基于自主前沿探索与VLM引导

Aarush Aitha, Avideh Zakhor

发表机构 * EECS Department, University of California（加州大学EECS系）

AI总结本文提出了一种基于视觉语言模型（VLM）引导的自主前沿探索方法，用于提升机器人在未知和危险环境中的探索能力。该方法通过VLM进行高层战略决策，指导传统的底层机器人控制系统，利用当前地图和潜在路径的视觉信息生成多模态提示，从而选择最具前景的探索方向。实验表明，该方法在六个室内环境的仿真中提升了地图覆盖率，且具有轻量、无需训练和易于迁移的特点。

Comments 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

2605.23159 2026-05-25 econ.GN cs.AI q-fin.EC 版本更新

Generative AI and the Reorganization of Labor Demand

生成式AI与劳动力需求的重组

Fangyan Wang, Zaiyan Wei, Yang Wang

发表机构 * Mitch Daniels School of Business, Purdue University（普渡大学米切尔丹尼尔斯商学院）

AI总结本文研究生成式人工智能（AI）对劳动力需求的重塑影响，探讨企业在技术扩散过程中如何调整招聘岗位和岗位任务结构。通过构建基于美国全行业招聘广告数据的动态暴露度指标，研究发现，生成式AI的暴露程度随时间变化，并非固定不变；企业主要通过岗位间的招聘调整（占52%）和岗位内部任务重构（占39.5%）来适应AI技术，且不同层级岗位的调整路径存在差异。研究揭示了劳动力市场对生成式AI的适应过程是组织结构和任务架构的重新配置。

详情

AI中文摘要

生成式人工智能（AI）预计将改变工作方式，但关于随着技术扩散，企业如何重组劳动力需求的研究尚不充分。现有研究主要关注哪些职业暴露于AI或暴露的工作是否减少。我们通过考察企业是否通过改变招聘地点、工作内容或两者兼而有之来调整，扩展了这一讨论。利用覆盖美国经济所有部门的全国职位发布数据集，我们通过两阶段大语言模型管道构建了一个动态的、职位级别的生成式AI暴露度度量。该管道识别每个职位发布中描述的任务，并分类生成式AI能够执行或辅助这些任务的程度。然后，我们将总暴露度的变化分解为两个边际：跨职位需求重新分配和职位内任务重新设计。我们记录了三个主要发现。首先，生成式AI暴露度是动态而非固定的，随时间显著变化。其次，劳动力需求通过两个边际进行调整。招聘重新分配解释了总暴露度下降的最大份额，平均占52%，而职位内重新设计变得越来越重要，占39.5%。补充的Oaxaca-Blinder分解显示，职业构成的变化解释了可归因于可观察职位特征的暴露度变化的约90%。第三，调整在职业阶梯上有所不同。高级职位调整更早，主要通过重新分配，而初级职位则通过重新分配、重新设计及其相互作用的更广泛组合进行调整。这些发现表明，劳动力市场对生成式AI的调整是一个组织重构的过程，在此过程中，企业重塑了招聘需求和工作的任务架构。

英文摘要

Generative artificial intelligence (AI) is expected to transform work, but less is known about how firms reorganize labor demand as the technology diffuses. Existing research has largely focused on which occupations are exposed to AI or whether exposed jobs decline. We extend this debate by examining whether firms adjust by changing where they hire, what jobs contain, or both. Using a nationwide dataset of job postings in the United States, covering all sectors of the economy, we construct a dynamic, posting-level measure of generative AI exposure with a two-stage large language model pipeline. The pipeline identifies the tasks described in each posting and classifies the extent to which generative AI can perform or assist them. We then decompose changes in aggregate exposure into two margins: reallocation of demand across jobs and redesign of tasks within jobs. We document three main findings. First, generative AI exposure is dynamic rather than fixed, changing substantially over time. Second, labor demand adjusts through both margins. Hiring reallocation explains the largest share of the aggregate decline in exposure, accounting for 52% on average, while within-job redesign becomes increasingly important, accounting for 39.5%. A complementary Oaxaca-Blinder decomposition shows that shifts in occupational composition account for about 90% of the exposure change attributable to observable job characteristics. Third, adjustment differs across the job ladder. Senior jobs adjust earlier and mainly through reallocation, whereas junior jobs adjust through a broader mix of reallocation, redesign, and their interaction. These findings suggest that labor-market adjustment to generative AI is a process of organizational reconfiguration, in which firms reshape both hiring demand and the task architecture of work.

URL PDF HTML ☆

赞 0 踩 0

2605.23147 2026-05-25 cs.CL cs.AI 版本更新

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

作为X，做Y：角色和任务如何在指令微调LLM中结合

Eric Xu

发表机构 * Independent Researcher（独立研究者）

AI总结该研究探讨了在指令微调的大语言模型中，角色提示（如“As X, do Y”）如何将“人物”和“任务”信息结合，并发现这种结合在残差流中的某个特定位置可以通过线性分解清晰地体现。研究指出，人物和任务分别通过部分正交的加法方向影响模型输出，并展示了通过残差流局部加法结构可以实现对角色和任务贡献的可解释控制。然而，研究也表明，尽管存在局部加法结构，角色提示无法被压缩为单一的残差向量，因为其行为依赖于整个提示中的分布式机制。

Comments 12 pages, 1 figure. Code: https://github.com/xuy/localized-additive-composition

详情

AI中文摘要

形式为“作为X，做Y”的角色提示在残差流的一个特定位置——提示到答案的过渡（最后一个提示标记与前两个生成标记）——在早期/中层波段表现出清晰的线性分解。在那里，角色和任务通过部分正交的加性方向贡献。形成纯角色效应Δ_X、纯任务效应Δ_Y，并将h_BB + Δ_X + Δ_Y替换干净残差，在Gemma-2-2B-IT和Qwen-2.5-{1.5B, 3B}-Instruct上，跨越12个单元格的短网格和48个单元格的长角色网格，下游输出与干净输出的KL散度很小，并保留了角色特定的行为标记。从这种加性结构自然推断，角色提示可以压缩为单个缓存的残差向量。我们证明它不能。将缓存的加性预测——甚至oracle干净残差h_XY——注入到移除了角色文本的基线宿主提示中，无论是在一个位置还是在多个层，都无法接近干净的长角色目标。角色条件化的多标记生成通过注意力流回整个提示中的角色文本位置，这是任何单个位置的残差无法复现的。残差流中的局部加性性并不意味着提示可压缩。提示到答案过渡处的加性结构支持可解释性和对角色或任务贡献的细粒度控制；整个延续中的角色条件化行为依赖于分布式的提示/KV机制，局部激活算术无法取代。

英文摘要

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $Δ_X$, a pure task effect $Δ_Y$, and substituting $h_{BB} + Δ_X + Δ_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

URL PDF HTML ☆

赞 0 踩 0

2605.23146 2026-05-25 cs.LG cs.AI 版本更新

定义学术情境中的AI疲劳：维度、指标及基于扎根理论的分阶段模型

John Paul P. Miranda, Emmanuel B. Parreño, Jovita G. Rivera

发表机构 * Pampanga State University（帕曼加州大学）

AI总结本文探讨了学术场景中由持续使用AI工具引发的一种新型压力——AI疲劳，提出了其定义、维度及阶段模型。研究基于对1054名菲律宾大学学生的开放式回答进行扎根理论分析，识别出认知超载、动机脱离、道德不安、身体负担和注意力分散五个维度，每个维度包含两个基于参与者描述的指标。研究还构建了AI疲劳阶段模型，解释了这些压力如何在重复使用AI工具的过程中累积和相互强化，为未来相关测量工具的开发和跨情境研究奠定了基础。

Comments 17 pages, journal article, Volume 25, Issue 5,

Journal ref International Journal of Learning, Teaching and Educational Research, 25(5), 91-107 (2026)

详情

DOI: 10.26803/ijlter.25.5.5

AI中文摘要

AI工具在学术环境中的整合引入了一种独特的压力形式，现有框架如技术压力和数字疲劳尚未完全解决这一问题。本研究开发了一个概念模型，并确定了定义AI疲劳的维度，AI疲劳是持续在学术中使用AI工具而产生的一种压力形式。通过对菲律宾三所大学1054名大学生的开放式回答进行扎根理论分析，研究了学生在AI支持的学术工作中经历的认知、动机、情感、身体和注意力压力。分析产生了AI疲劳的五个维度，即认知超载、动机脱离、道德不安、身体疲劳和注意力漂移，每个维度包含两个基于参与者叙述的指标。研究结果还提出了AI疲劳模型，这是一个分阶段框架，解释了这些压力如何在学术任务中反复与AI交互时积累并相互强化。这些贡献为AI疲劳作为一个独特构念建立了概念和探索基础，并为未来在AI中介学生学习的学术环境中的工具验证、量表开发和跨情境研究提供了基础。

英文摘要

The integration of AI tools in academic settings has introduced a distinct form of strain that existing frameworks like technostress and digital fatigue have not yet fully addressed. This study develops a conceptual model and identifies the dimensions that define AI fatigue as a form of strain arising from sustained academic use of AI tools. Using grounded theory analysis of open-ended responses from 1,054 university students across three universities in the Philippines, the study examined the cognitive, motivational, emotional, physical, and attentional pressures students experienced during AI-supported academic work. Analysis produced five dimensions of AI fatigue, namely Cognitive Overload, Motivational Disengagement, Moral Unease, Physical Strain, and Attentional Drift, each consisting of two indicators grounded in participant accounts. The findings also yielded the AI Fatigue Model, a stage-based framework that explains how these pressures accumulate and reinforce one another across repeated AI interaction in academic tasks. These contributions establish a conceptual and exploratory foundation for AI fatigue as a distinct construct and provide a basis for future instrument validation, scale development, and cross-contextual inquiry in academic settings where AI now mediates student learning.

URL PDF HTML ☆

赞 0 踩 0

2605.23118 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

在临床医生验证的交互式病灶追踪中利用纵向上下文

Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein

发表机构 * German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany（德国癌症研究中心（DKFZ）海德堡，医学图像计算部，德国）； Faculty of Mathematics and Computer Science, Heidelberg University, Germany（海德堡大学数学与计算机科学学院，德国）； HIDSS4Health -- Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany（HIDSS4Health——海德堡信息与数据科学健康学校，卡尔斯鲁厄/海德堡，德国）； Medical Faculty, Heidelberg University, Germany（海德堡大学医学学院，德国）； University Hospital Brandenburg an der Havel, Brandenburg Medical School Theodor Fontane, Germany（勃兰登堡运河大学医院，布兰登堡泰奥多尔·冯·_fontane医学学校，德国）； Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany（放射肿瘤科模式分析与学习组，海德堡大学医院，德国）

AI总结本文研究了如何在临床验证的交互式病灶追踪中有效利用纵向影像信息，以提高肿瘤在连续CT扫描中的追踪准确性。作者提出了一种“验证追踪”范式，通过临床医生验证注册提出的提示，并结合病灶的基线外观信息，解决分割中的模糊问题。该方法结合了早期空间提示融合与潜在时间差分加权，构建了一个统一的纵向信息引导分割框架，并通过大规模合成预训练克服数据稀缺问题，显著提升了性能。实验表明，该方法在全自动和验证追踪设置下均优于现有方法，且在MICCAI autoPET IV挑战赛中取得第一名。

Comments Accepted at MICCAI 2026

详情

AI中文摘要

在系列CT扫描中追踪肿瘤病灶对于肿瘤学反应评估至关重要。现有的自动化方法面临一个基本权衡：端到端追踪器实现高度自动化，但无法纠正无声的追踪失败；而解耦的配准-分割流程允许用户验证，却丢弃了病灶的先验外观，限制了在模糊情况下的准确性。在这项工作中，我们提出了一种验证追踪范式：临床医生验证配准提出的提示，模型利用该提示以及基线病灶外观来解决分割模糊性。我们提出了一个统一框架，结合早期空间提示融合与潜在时间差异加权，用于纵向信息感知的分割。为了解决数据稀缺问题，我们利用大规模合成预训练，证明这对于利用纵向上下文至关重要，相比从头训练性能提升高达4.5个Dice点。我们的方法在MICCAI autoPET IV挑战中获得第一名。我们进一步整理并发布了PanTrack，一个新的纵向胰腺癌基准，以评估分布外泛化能力。实验表明，我们的模型在全自动和所提出的验证追踪设置中均优于先前工作，在自动化与控制之间提供了一个临床安全的中间地带。代码、模型和数据集将在https://github.com/MIC-DKFZ/LongiSeg发布。

英文摘要

Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC-DKFZ/LongiSeg

URL PDF HTML ☆

赞 0 踩 0

2605.23116 2026-05-25 cs.CV cs.AI 版本更新

CoReVAD: A Contextual Reasoning Framework for Training-Free Video Anomaly Detection

CoReVAD: 一种无需训练的视频异常检测上下文推理框架

Hyeongmuk Lim, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea（韩国釜山大学工业工程系）

AI总结现有视频异常检测方法通常依赖任务特定的训练，导致领域依赖性强且训练成本高，且大多仅输出标量异常分数，缺乏对异常原因的解释。为此，本文提出CoReVAD，一种无需训练的上下文推理框架，利用冻结的视觉-语言模型直接生成异常分数和时间描述，并通过局部响应清理模块和全局时序优化策略提升检测精度与可解释性。实验表明，CoReVAD在多个数据集上表现出色，提供了可靠且易于理解的异常解释。

Comments Accepted to ICPR 2026

详情

AI中文摘要

现有的视频异常检测方法通常依赖于任务特定的训练，导致强领域依赖性和高训练成本。此外，大多数现有方法仅输出标量异常分数，对特定事件为何被视为异常提供的洞察有限。视觉语言模型的最新进展使得异常检测和人类可解释推理成为可能。然而，许多基于视觉语言模型的方法仍然需要额外的训练步骤（例如，指令调优或口头化学习）或外部大型语言模型，从而带来进一步的训练成本和推理开销。为了解决这些挑战，我们提出了CoReVAD，一种用于无需训练的视频异常检测的上下文推理框架，该框架使用单个冻结的视觉语言模型运行。CoReVAD直接从视觉语言模型生成异常分数和时间描述。为了减轻生成输出中的噪声，我们引入了一个基于局部视觉-文本对齐的局部响应清理模块。此外，通过基于softmax的精炼、高斯平滑和位置加权，融入了全局时间上下文和进展。在UCF-Crime和XD-Violence上的实验表明，CoReVAD在无需训练的方法中取得了竞争性能，同时提供了可靠且可解释的解释。我们的官方代码可在https://github.com/Muk-00/CoReVAD获取。

英文摘要

Existing Video Anomaly Detection (VAD) methods typically rely on task-specific training, leading to strong domain dependency and high training costs. Moreover, most existing methods output only scalar anomaly scores, providing limited insight into why specific events are considered abnormal. Recent advances in Vision-Language Models (VLMs) have enabled both anomaly detection and human-interpretable reasoning. However, many VLM-based approaches still require additional training steps (e.g., instruction tuning or verbalized learning) or external Large Language Models (LLMs), incurring further training costs and inference overhead. To address these challenges, we propose CoReVAD, a contextual reasoning framework for training-free video anomaly detection that operates with a single frozen VLM. CoReVAD directly generates anomaly scores and temporal descriptions from the VLM. To mitigate noise in generative outputs, we introduce a Local Response Cleaning (LRC) module based on local vision-text alignment. Furthermore, global temporal context and progression are incorporated through softmax-based refinement, Gaussian smoothing, and position weighting. Experiments on UCF-Crime and XD-Violence demonstrate that CoReVAD achieves competitive performance among training-free methods while providing reliable and interpretable explanations. Our official code is available at: https://github.com/Muk-00/CoReVAD

URL PDF HTML ☆

赞 0 踩 0

2605.23109 2026-05-25 cs.AI cs.DC cs.LO cs.PL 版本更新

Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems

归纳演绎合成：使AI能够生成形式化验证的系统

Shubham Agarwal, Alexander Krentsel, Shu Liu, Mert Cemri, Audrey Cheng, Rui Meng, Tomas Pfister, Chun-Liang Li, Sylvia Ratnasamy, Aditya Parameswaran, Matei Zaharia, Ion Stoica, Mohsen Lesani

发表机构 * UC Berkeley（伯克利大学）； Google（谷歌）； UC Santa Cruz（圣克鲁兹大学）

AI总结本文提出了一种名为归纳演绎综合（IDS）的新方法，旨在解决AI生成代码时缺乏形式化验证的问题，特别是在分布式系统领域。该方法通过联合生成实现代码和形式化证明，并从失败尝试中学习，系统性地尝试有效策略。IDS作为基于代理的大型语言模型系统，能够在约6.8小时内以较低成本完成7个分布式键值存储规范的形式化验证，且生成的实现性能优于现有验证系统。

详情

AI中文摘要

AI代理在生成、测试和优化代码方面日益出色。然而，在需要完全覆盖的形式化保证（仅靠测试无法提供）的任务上，它们表现不足。分布式系统是一个典型例子：读写一致性等属性必须在每个可能的事件交错下成立。机械化形式验证可以保证这种正确性，但通常需要专家数月到数年的努力。证据表明，即使是最先进的编码代理（Codex with GPT-5.4和Claude Code with Opus 4.6）也仅在7个分布式键值存储规范中的2个上成功。在本文中，我们提出了解决这一差距的首个有效方法——归纳演绎合成（IDS），它联合且增量地合成实现和证明，并从失败的尝试中学习以系统地尝试有前景的策略。作为基于LLM的代理系统，IDS在平均约6.8小时和每个规范106美元的成本下实现了7/7的成功率，比专家努力快约200倍，比最先进的代理便宜17%。IDS进一步将性能反馈纳入同一循环，产生的实现比已发布的验证系统快达3倍。

英文摘要

AI agents increasingly excel at generating, testing, and refining code. However, they fall short on tasks requiring formal guarantees of full coverage that testing alone cannot provide. Distributed systems are a prime example: properties such as consistency between reads and writes must hold under every possible interleaving of events. Mechanized formal verification can guarantee such correctness, but typically demands months to years of expert effort. As evidence, even SOTA coding agents (Codex with GPT-5.4 and Claude Code with Opus 4.6) succeed on only 2/7 distributed key-value-store specifications. In this paper, we present the first effective approach to addressing this gap, Inductive Deductive Synthesis (IDS), which jointly and incrementally synthesizes implementation and proof, and learns from failed attempts to systematically try promising strategies. Built as an agentic LLM system, IDS achieves 7/7 in about 6.8 hours and $106 per spec on average, roughly 200x faster than expert effort and 17% cheaper than SOTA agents. IDS further incorporates performance feedback into the same loop, yielding implementations up to 3x faster than published verified systems.

URL PDF HTML ☆

赞 0 踩 0

2605.23108 2026-05-25 cs.SE cs.AI 版本更新

Philosophical Dispositions as Behavioral Constraints for AI-Assisted Code Review: An Empirical Study

哲学倾向作为AI辅助代码评审的行为约束：一项实证研究

Kaushal Bansal

发表机构 * Salesforce, Inc.（Salesforce公司）

AI总结本文研究如何通过哲学立场（如怀疑主义、逻辑学、犬儒主义等）约束AI代码审查工具的行为，以提升其审查的多样性和深度。研究提出了一种基于特定知识论传统构建AI审查行为框架的方法，并通过实证分析验证了该方法在不同编程语言和项目中的有效性。实验表明，该系统能够发现传统AI工具难以识别的结构性和逻辑性问题，展现出更强的审查独特性和准确性。

详情

AI中文摘要

AI辅助代码评审工具通常作为通用的“专家评审者”代理运行，无论需要何种分析类型，都会产生同质化的发现。我们提出一个系统，通过哲学倾向——基于特定认识论传统（皮浪怀疑论、新正理逻辑、第欧根尼犬儒主义、儒家关系伦理）的连贯人格视角，将注意力引导到结构上不同类型的问题上——来约束AI评审者行为。每种倾向通过否定方式定义（即拒绝做什么），配备自我监控的失败模式（hamartia），并通过角色协议按顺序编排。我们在跨越5种编程语言（Python、Go、C++、Java、Terraform）、5个组织（2个企业、3个开源）和2个时间时代（AI前2020年、AI后2024-2026年）的7个代码库的50个合并拉取请求上评估该系统。该倾向系统与人类评审者达到46%的一致性（验证信号质量），以75%的比率识别出独特发现，并且在总共601个发现中，没有发现被作者判定为假阳性（未评估评分者间一致性，这仍是一个局限）。受控基线比较表明，51%的倾向发现是同一模型使用通用“专家评审者”提示不会产生的，这些独特发现针对结构、操作和逻辑问题，而非标准代码级别问题。初步跨模型验证（Claude Opus vs. GPT Codex 5.3-xhigh）在3个PR上显示100%的框架结构遵循度和39%的发现级别一致性，表明该框架在保持模型特定分析视角的同时提供了真正的行为约束。

英文摘要

AI-assisted code review tools typically operate as generic "expert reviewer" agents, producing homogeneous findings regardless of the analysis type needed. We present a system that constrains AI reviewer behavior through philosophical dispositions -- coherent personality lenses grounded in specific epistemological traditions (Pyrrhonist Skepticism, Navya-Ny=aya logic, Diogenes' Cynicism, Confucian relational ethics) that direct attention to structurally different types of issues. Each disposition is defined apophatically (by what it refuses to do), equipped with a self-monitoring failure mode (hamartia), and orchestrated in sequence by role protocols. We evaluate this system on 50 merged pull requests across 7 repositories spanning 5 programming languages (Python, Go, C++, Java, Terraform), 5 organizations (2 enterprise, 3 open-source), and 2 temporal eras (pre-AI 2020, post-AI 2024--2026). The disposition system achieves 46% convergence with human reviewers (validating signal quality), identifies unique findings at a 75% rate, and produces no findings judged false-positive by the author across 601 total findings (inter-rater agreement was not assessed and remains a limitation). A controlled baseline comparison demonstrates that 51% of disposition findings are not produced by the same model using generic "expert reviewer" prompting, and these unique findings target structural, operational, and logical concerns rather than standard code-level issues. Preliminary cross-model validation (Claude Opus vs.\ GPT Codex 5.3-xhigh) on 3 PRs shows 100% framework-structure adherence with 39% finding-level agreement, suggesting the framework provides real behavioral constraint while preserving model-specific analytical perspective.

URL PDF HTML ☆

赞 0 踩 0

2605.23103 2026-05-25 cs.CL cs.AI cs.CY cs.DB 版本更新

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

用于明清之际文集中个人书信标题的微调BERT分类器

Queenie Luo

发表机构 * Harvard University（哈佛大学）

AI总结本文提出了一种基于微调BERT的分类器Lepton，用于识别晚明至清初文集目录中的标题是否为个人书信，特别是与可混淆的序言（如告别序）进行区分。该模型在33位文人手标注的5438个文集标题上进行微调，并已部署于Hugging Face平台，应用于中国传记资料库（CBDB），成功识别出约五万五千封书信，为明信平台的数据建设提供了支持。

2605.23094 2026-05-25 eess.IV cs.AI cs.CV 版本更新

Do Synthetic Brain MRIs Reliably Improve Tumour Classification? A StyleGAN2-ADA Class-Plane Augmentation Study on BRISC 2025

合成脑部MRI能否可靠改善肿瘤分类？基于BRISC 2025的StyleGAN2-ADA类平面增强研究

José Rafael Noriega Cedeño

发表机构 * NVIDIA

AI总结该研究探讨了合成脑部MRI图像是否能有效提升肿瘤分类任务的性能，使用StyleGAN2-ADA生成器在BRISC 2025数据集上生成图像，并测试其对三种分类模型的影响。研究发现，合成图像的增益效果因模型架构和真实与合成图像比例不同而有所差异，其中MobileViTV2模型在使用过滤后的1:1合成图像增强后，肿瘤分类准确率提升了1.02%。结果表明，生成式增强的效果并非仅取决于图像的视觉质量，而是与模型结构和数据配比密切相关。

Comments 18 pages, 16 figures

详情

任意时间训练：无调度谱优化

Anuj Apte, Pranav Deshpande, Niraj Kumar, Shouvanik Chakrabarti, Junhyung Lyle Kim

发表机构 * Global Technology Applied Research（全球技术应用研究）

AI总结本文提出了一种名为 SF-NorMuon 的无调度谱优化器，用于解决传统神经网络训练中依赖固定学习率计划的问题。该方法在无需预设训练时间范围的情况下，能够在大规模语言模型上达到甚至超越精心调参的 AdamW 优化器的性能。研究还从理论上证明了无调度谱动态的稳定性保证，并指出快速迭代中的权重衰减对长期训练稳定性至关重要，为无需预设时间范围的持续学习提供了更实用的优化方案。

详情

AI中文摘要

标准神经网络训练依赖于与固定训练步数绑定的学习率调度，导致路径依赖性强，且当数据可用性变化时需要昂贵的重新调优。无调度（SF）方法通过移除显式调度来解决这一问题，然而当前最先进的任意时间优化器SF-AdamW始终不如调优后的AdamW基线。我们提出SF-NorMuon，一种无调度谱优化器，弥补了这一差距：使用单一超参数配置，SF-NorMuon在125M和772M参数的语言模型上，在$1$--$8 imes$ Chinchilla训练步数范围内匹配或超过了调优的AdamW。在理论方面，我们证明了无调度谱动力学的平稳性保证，并指出快速迭代上的权重衰减对于长步数稳定性至关重要。SF-NorMuon使从业者能够在训练过程中的任何时刻获得高质量检查点，而无需预先承诺训练步数。通过缩小与调优基线的性能差距，SF-NorMuon使无步数优化更加实用，向真正开放式的持续学习迈出了一步。

英文摘要

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

URL PDF HTML ☆

赞 0 踩 0

2605.23058 2026-05-25 cs.SE cs.AI 版本更新

A measurement substrate for agentic Kubernetes operations: Methodology and a case study in retrieval-compounding falsification

面向代理化 Kubernetes 操作的测量基础：方法论与检索复合证伪案例研究

Joshua Odmark, Gideon Rubin, Deon van der Vyver

发表机构 * Independent（独立）； LDE ； Cognyx

AI总结该论文提出了一种用于评估自主 Kubernetes 操作代理的测量框架 agent-breakage，旨在解决当前相关研究中缺乏可证伪性的问题。该框架通过注入故障并观察代理的响应，从四个维度进行评分，并记录带标签的状态-动作-结果元组，从而实现对代理行为的系统评估。研究通过案例分析揭示了检索历史故障报告对代理能力的影响，并指出当前研究中存在诸如选择偏差、样本量过小等潜在问题，展示了该方法在提升实验可信度方面的重要价值。

Comments 22 pages. Code at https://github.com/odmarkj/agent-breakage tag v0.1.0 (Apache 2.0). Source repo at https://github.com/odmarkj/agent-breakage-paper tag arxiv-v1

详情

AI中文摘要

关于自主 Kubernetes 操作代理的经验声明在很大程度上是不可证伪的。已发表的工作报告了观察结果，但没有与禁用代理的基线进行受控比较，选择偏差普遍存在，缺乏预注册的决策矩阵，并且样本通常太小，无法匹配底层评分系统的噪声水平。原因在于限制代理本身的相同差距：代码代理有一个验证基础，将“是否有效”转化为快速、可证伪的 ground-truth 信号，而操作领域没有等效物。我们提出 agent-breakage，一个闭环测量框架，向目标 Kubernetes 集群注入故障，观察自主代理如何响应，在四个轴上根据 ground truth 对响应进行评分，并累积带有结果标签的 (状态, 动作, 结果) 元组。该框架区分框架错误和推理错误，通过确定性嵌入器机制支持真正的关闭条件控制，并强制执行预注册的决策矩阵。我们将其作为案例研究，测试检索过去的故障后分析是否会复合代理的能力。方法论的贡献是框架在该案例研究中捕获的三个混杂因素，每个因素都会在同一个工作的仪器化程度较低的版本上产生错误的已发表声明：pgvector 索引错误、+19% 的选择偏差工件，以及将效应夸大大约 3 倍的小样本估计。检索结果本身是部分证伪：3 个密集语料场景中有 1 个在 p<0.05 时显著，合并效应 +3.9 个百分点，在 n=60 时不显著。在 360 次运行中进行的场景内语料密度扫描表明，近邻的机械对齐主导了原始计数。该框架已开源发布。

英文摘要

Empirical claims about autonomous Kubernetes operations agents are largely unfalsifiable. Published work reports observational results without controlled comparisons against an agent-disabled baseline, selection bias is endemic, pre-registered decision matrices are absent, and samples are typically too small for the noise level of the underlying scoring system. The cause is the same gap that limits the agents themselves: code agents have a verification substrate that turns "did it work" into a fast, falsifiable, ground-truth signal, and operations has nothing equivalent. We present agent-breakage, a closed-loop measurement framework that injects faults into a target Kubernetes cluster, observes how an autonomous agent responds, scores the response on four axes against ground truth, and accumulates outcome-labeled (state, action, outcome) tuples. The framework distinguishes framework error from reasoning error, supports a true off-condition control via a deterministic-embedder mechanism, and enforces pre-registered decision matrices. We use it as a case study to test whether retrieval over past postmortems compounds an agent's capability. The methodological payload is three confounds the substrate caught during that case study, each of which would have produced a wrong published claim on a less instrumented version of the same work: a pgvector index bug, a +19% selection-bias artifact, and small-sample estimates that overstated effects by roughly 3x. The retrieval result itself is a partial falsification: 1 of 3 dense-corpus scenarios significant at p<0.05, pooled effect +3.9 percentage points, not significant at n=60. A within-scenario corpus-density sweep at 360 runs shows that mechanistic alignment of near-neighbors dominates raw count. The framework is released open source.

URL PDF HTML ☆

赞 0 踩 0

2605.23056 2026-05-25 cs.NI cs.AI 版本更新

DRL-Driven Edge-Aware Utility Optimization for Multi-Slice 6G Networks

DRL驱动的多切片6G网络边缘感知效用优化

Khaled M. Naguib, Soumaya Cherkaoui, Mahmoud M. Elmessalawy, Ahmed M. Abd El-Haleem, Ibrahim I. Ibrahim

发表机构 * CCAS Department, School of Engineering, New giza University（新吉扎大学工程学院CCAS系）； Department of Computer and Software Engineering, Polytechnique Montreal（蒙特利尔大学计算机与软件工程系）； Department of Electronics and Communications, Faculty of Engineering, Helwan University（海尔万大学工程学院电子与通信系）

AI总结本文研究了在6G网络中如何通过深度强化学习优化多切片网络的边缘感知效用，以满足虚拟现实等高要求业务的需求。提出了一种基于深度Q网络（DQN）的智能资源分配与边缘缓存框架，能够在O-RAN架构中实现多网络切片的动态资源调度与内容分发。该方法有效提升了网络延迟和吞吐量，为6G环境下的沉浸式VR应用提供了更可靠和响应更快的支持。

Comments 5 pages

Journal ref IEEE Networking Letters, vol. 8, pp. 14-18, 2026

详情

DOI: 10.1109/LNET.2025.3614549

AI中文摘要

通过6G网络传输的虚拟现实（VR）服务需要超低延迟和高带宽，以确保无缝用户体验。本文提出了一种面向6G O-RAN网络的智能资源分配与边缘缓存框架，利用深度Q网络（DQN）学习优化O-RAN架构下多网络切片的边缘缓存和动态资源配置。通过将DRL代理集成到网络控制平面，所提系统能够实现主动和自适应内容分发以及实时计算资源分配，满足eMBB、URLLC，尤其是对VR至关重要的新兴MBRLLC切片的服务质量需求。仿真结果表明，基于DQN的框架在降低延迟和提高吞吐量方面始终优于传统方法，从而为6G环境中的沉浸式VR应用提供更可靠和响应更快的支持。

英文摘要

Virtual Reality (VR) services delivered over 6G networks demand ultra-low latency and high bandwidth to ensure seamless user experiences. This paper presents an intelligent resource allocation and edge caching framework for 6G O-RAN networks, leveraging Deep Q-Network (DQN) learning for optimizing edge caching and dynamic resource provisioning across multiple network slices within an O-RAN-compliant architecture. By incorporating DRL agents into the network control plane, the proposed system enables proactive and adaptive content distribution as well as real-time computational resource allocation that meets the quality-of-service demands of eMBB, URLLC, and especially the emerging MBRLLC slices essential for VR. Simulation results demonstrate that the DQN-based framework consistently outperforms traditional methods in reducing latency and improving throughput, leading to more reliable and responsive support for immersive VR applications in 6G environments.

URL PDF HTML ☆

赞 0 踩 0

2605.23054 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Model Collapse as Cultural Evolution

模型崩溃作为文化演化

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong（香港大学）； Stellaris AI Limited（Stellaris AI有限公司）

AI总结本文研究了大型语言模型（LLM）在自训练过程中出现的“模型崩溃”现象，即模型输出质量逐渐下降的问题。作者引入文化进化中的迭代学习理论，提出五个可验证的预测，并通过多语言实验验证，发现模型的组合性结构在无过滤自训练下呈现非单调变化趋势，这一特征仅在任务导向的过滤机制下得以维持。研究为模型崩溃提供了语言学层面的解释，并为自训练流程的设计提供了具体原则。

Comments Accepted at CoNLL 2026. 18 pages, 3 figures, 2 tables

详情

AI中文摘要

模型崩溃，即在其自身输出上训练的LLM的逐步退化，已被统计表征，但缺乏对哪些结构退化、以何种顺序以及为何退化的语言学解释。我们表明，文化演化中的迭代学习理论填补了这一空白。我们推导出五个可证伪的预测，区分了那些对该理论具有独特判别性的预测与确认性预测，并通过在英语、德语和土耳其语中自训练LLaMA-2-7B和Mistral-7B达10代来测试它们。关键的判别性发现：在未过滤的自训练下，组合性遵循非单调轨迹（先上升后下降）。这一特征在最大规则种子数据下持续存在（排除了噪声去除），并且仅由任务导向的过滤维持，而非随机过滤，提供了压缩-通信权衡的首个LLM尺度证据。所有预测均得到确认，效应量较大（Hedges' $g > 1.6$；$\mathrm{BF}_{10} > 100$），且LLM正则化梯度与人类行为数据高度匹配（$R^2 = 0.94$）。这些结果将模型崩溃重新定义为文化传播现象，并为自训练管道设计提供了具体原则。

英文摘要

Model collapse, the progressive degradation of LLMs trained on their own outputs, has been characterized statistically but lacks a linguistic explanation for which structures degrade, in what order, and why. We show that iterated learning theory from cultural evolution fills this gap. We derive five falsifiable predictions, distinguish those uniquely discriminative for the theory from confirmatory ones, and test them by self-training LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish. The critical discriminative finding: compositionality follows a non-monotonic trajectory (initially rising, then falling) under unfiltered self-training. This signature persists with maximally regular seed data (ruling out noise removal) and is sustained only by task-grounded filtering, not random filtering, providing the first LLM-scale evidence for the compression-communication tradeoff. All predictions are confirmed with large effect sizes (Hedges' $g > 1.6$; $\mathrm{BF}_{10} > 100$), and LLM regularization gradients closely match human behavioral data ($R^2 = 0.94$). These results reframe model collapse as a cultural transmission phenomenon and yield concrete principles for self-training pipeline design.

URL PDF HTML ☆

赞 0 踩 0

2605.23052 2026-05-25 cs.CL cs.AI 版本更新

DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods

DreamerNLplus: 使用混合规则和RAG方法从社交媒体时间线进行可解释的心理健康动态建模

Maryia Zhyrko, Daisy Monika Lal, Erik van Mulligen, Lifeng Han

发表机构 * Leiden Institute of Advanced Computer Science (LIACS), Leiden University（莱顿高级计算机科学研究所（LIACS），莱顿大学）； School of Computing and Communications (SCC), Lancaster University（计算与通信学院（SCC），兰卡斯特大学）； Department of Medical Informatics, Erasmus University Medical Center Rotterdam（医学信息学系，埃因霍温医学中心鲁特万分校）； Biomedical Data Sciences, Leiden University Medical Center（生物医学数据科学，莱顿大学医学中心）

AI总结本文提出了一种混合框架 DreamerNLplus，用于从社交媒体时间线中建模心理健康动态，参与了 CLPsych 2026 共享任务。该方法结合了基于规则和检索增强生成（RAG）的技术，分别用于心理状态建模、时间变化检测和序列级摘要任务，并在多个子任务中取得了优异成绩。研究揭示了心理健康动态建模中的关键挑战，如分类与回归性能的不匹配、时间过渡建模的困难，为未来研究提供了重要方向。

Comments Accepted by CLPsych2026. CLPsych 2026 will be held at ACL in San Diego July 4th, 2026

详情

AI中文摘要

我们提出DreamerNLplus，一个用于在CLPsych 2026共享任务中从社交媒体时间线建模心理健康动态的混合框架。我们的系统处理三个任务：心理状态建模、时间变化检测和序列级总结。对于任务1，我们结合基于LLM的数据增强、DeBERTa分类和随机森林回归进行结构化状态预测。对于任务2，我们使用本地部署的Llama 3.1模型进行少样本提示，利用短期时间上下文检测切换和升级事件。对于任务3.1，我们探索了确定性基于规则的总结流水线和基于LLM的少样本方法，官方排名第二。我们的基于RAG的方法在任务3.2中取得了强劲性能，在改善任务中排名第一，在恶化任务中排名第三，展示了其捕捉时间线上反复出现的心理变化模式的能力。我们的分析揭示了关键挑战，包括分类与回归性能之间的不匹配、时间转换建模的困难，以及基于语义和基于相似性的评估指标之间的不一致。这些发现凸显了建模心理健康动态的复杂性，并推动了未来关于统一评估框架的工作。我们在https://github.com/4dpicture/CLPsych2026分享我们的代码和提示。

英文摘要

We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence-level summarization. For Task 1, we combine LLM-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few-shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short-term temporal context. For Task 3.1, we explore both a deterministic rule-based summarization pipeline and a few-shot LLM-based approach, ranking \textbf{2nd} officially. Our RAG-based method achieves strong performance in Task 3.2, ranking \textbf{1st} for Improvement and \textbf{3rd} for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity-based evaluation metrics. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks. We share our code and prompts at https://github.com/4dpicture/CLPsych2026

URL PDF HTML ☆

赞 0 踩 0

2605.23045 2026-05-25 cs.CV cs.AI cs.LG 版本更新

确定性视界：作为可信AI系统设计规范的不可行性结果

Dongxin Guo

AI总结本文探讨了可信人工智能系统设计中由计算理论根本限制所带来的边界问题，提出将不可行性定理转化为系统设计规则的新方法。研究核心在于确定性地证明了大型语言模型的推理深度存在一个由架构决定的上限——“确定性地平线”，该上限不受训练数据量、适配器秩或损失函数的影响，并可通过模型层数和嵌入宽度预先计算。研究还展示了这一理论在多个AI子领域中的应用，形成一套包含十六项设计规范的目录，为构建更可靠的人工智能系统提供了理论依据和设计指导。

Comments PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

详情

AI中文摘要

大型语言模型现在编写软件、起草法律文件并生成临床笔记，但从图灵、阿罗到没有免费午餐定理的基本极限，塑造了计算的能力。本文将这些不可行性结果从奇闻转化为设计规则。其旗舰结果证明了仅由架构设定的准确率上限：超过关键推理深度后，无论适配器秩、样本大小或损失函数如何，训练都无法改变它。该确定性视界在部署前可从层数和嵌入宽度计算，在十二种Transformer架构中测量值介于19到31之间，而在最优长度轨迹上微调可恢复不到4个百分点。其机制是残差流的容量不变性，信息论转换得出超过视界后准确率超指数衰减。一个针对模幂的无条件电路复杂度下界（对抗常数深度素数模电路）补充了这一结果。同样的论证重新应用于多个子领域：任何错误指定模型下的偏好学习在样本复杂度上出现不连续跳跃；多阶段检索流水线至少需要与阶段数一样多的独立指标；标准诚实拍卖对于具有提示相关估值的智能体失效；神经推理的零知识验证为每个非线性激活支付110到190倍的测量开销。这些共同构成了一个包含16条规范的目录，每条规范配对一个可计算边界、一个量化违反成本和一个建设性设计规则：两个组合已被证明，一个配对是诚实障碍，四个保持开放。本文为可信AI可能需要的生成式研究计划提供了不可行性规范方法论。AI的每一个基本极限也是一个设计规则。

英文摘要

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

URL PDF HTML ☆

赞 0 踩 0

2605.23007 2026-05-25 q-fin.TR cs.AI cs.LG q-fin.PM 版本更新

MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models

MadEvolve: 基于大型语言模型的交易系统进化优化

Yurii Kvasiuk, Tianyi Li, Owen Colegrove, Moritz Münchmeyer

发表机构 * Department of Physics, University of Wisconsin–Madison（威斯康星大学麦迪逊分校物理系）； Event Horizon Labs（事件地平线实验室）

AI总结本文提出了一种基于大型语言模型的进化优化框架MadEvolve，用于优化量化交易系统，特别是在比特币交易中的策略生成与执行。该方法通过进化算法优化交易策略的特征集、策略组件及整体流程，显著提升了交易表现。研究还对比了其他智能搜索方法，并评估了模拟环境中的p-hacking概率，验证了AI驱动的进化算法在量化金融中的有效性。

详情

AI中文摘要

我们探索了将LLM驱动的算法优化应用于量化金融中的几个常见任务。MadEvolve是一个受DeepMind的Alpha-Evolve启发的通用算法优化框架，最近被开发用于优化计算宇宙学中的算法。在此，我们以比特币交易为例，展示了MadEvolve在优化算法交易策略和alpha生成方面的实用性。在我们的模拟和回测设置中，我们在所有考虑的任务上取得了显著改进，例如演化用于信号生成的特征集、优化交易策略的独立组件，以及联合演化特征流水线与执行策略。此外，我们将我们的方法与其他智能搜索方法（特别是Claude Code）进行了比较，并仔细评估了模拟设置中的p-hacking概率。我们的发现强烈支持AI驱动的智能和进化算法在算法交易和量化金融中的实用性。

英文摘要

We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpose algorithm optimization framework inspired by DeepMind's Alpha-Evolve, was recently developed to optimize algorithms in computational cosmology. Here we demonstrate the utility of MadEvolve to optimize algorithmic trading strategies and alpha generation at the example of Bitcoin trading. On our simulation and backtesting setup, we achieve significant improvements on all tasks we considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy. Additionally, we compare our method to other agentic search approaches, specifically Claude Code, and carefully evaluate p-hacking probabilities on our simulation setup. Our findings strongly support the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading and quantitative finance.

URL PDF HTML ☆

赞 0 踩 0

2605.22995 2026-05-25 cs.CY cs.AI 版本更新

Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good

谁之善，谁之地？面向社会公益的能动型AI的道德地理学

Poli Nemkova, Haeshitha Indukuri, Jaedon Charles

发表机构 * University of North Texas（北卡罗来纳州立大学）； Florida International University（佛罗里达国际大学）

AI总结本文研究了用于社会公益的智能代理AI系统在道德地理方面的不对称性，指出尽管这类系统常以联合国可持续发展目标（SDGs）为依据，但很少明确说明其地理背景，尤其在需要考虑地方政治、法律和文化因素的领域更为明显。研究分析了2015至2026年间112篇相关论文，发现仅25%的论文报告了实际部署或小规模测试，揭示了在责任归属、参与性和透明度方面的多重缺口，并提出了更具体、参与性更强的AI系统报告标准。

详情

AI中文摘要

能动型AI系统越来越多地被提出用于社会公益领域，通常引用联合国可持续发展目标（SDGs）作为全球利益的词汇。然而，社会公益的主张并未建立对系统声称服务的社区的问责。我们对2015年至2026年间发表的112篇关于社会公益的能动型AI论文进行了结构化调查。我们发现一种道德地理不对称：论文在最需要当地政治、法律和文化背景的领域最不可能指定地理背景。在整个语料库中，112篇论文中有82篇（73%）未指定任何地理背景。与健康或物理/生态SDGs相关的论文指定地理背景的比例为37-40%，而与制度和社会政策SDGs相关的论文仅13%。SDG 16（和平、正义与强大机构）既是语料库中覆盖最多的目标，也是地理指定率最低的目标。我们将此解释为道德抽象：面向社会公益的能动型AI往往将制度性善视为普适的，而不同于对待健康或生态善的方式。第二个发现加剧了这一点：112篇论文中只有28篇（25%）报告了任何实际部署或小规模测试。我们识别出五个问责缺口，并提出了一个最低报告标准，以促进更具体情境、参与性和负责任的面向社会公益的能动型AI。

英文摘要

Agentic AI systems are increasingly proposed for social-good domains, often invoking the United Nations Sustainable Development Goals (SDGs) as a vocabulary of global benefit. Yet claims of social good do not establish accountability to the communities a system claims to serve. We present a structured survey of 112 papers on agentic AI for social good published between 2015 and 2026. We find a moral-geographic asymmetry: papers are least likely to specify geographic context in precisely the domains where local political, legal, and cultural context matters most. Across the corpus, 82 of 112 papers (73%) specify no geographic context. Papers aligned with health or physical/ecological SDGs specify geography 37-40% of the time, while papers aligned with institutional and social-policy SDGs do so only 13%. SDG 16, peace, justice, and strong institutions, is both the most-covered goal in the corpus and the one with the lowest geographic-specification rate. We interpret this as moral abstraction: agentic AI for social good often treats institutional good as universal in ways it does not treat health or ecological good. A second finding compounds this: only 28 of 112 papers (25%) report any real-world deployment or small-scale test. We identify five accountability gaps and propose a minimal reporting standard for more context-specific, participatory, and accountable agentic AI for social good.

URL PDF HTML ☆

赞 0 踩 0

2605.22993 2026-05-25 cs.CL cs.AI 版本更新

A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

一种主动式多智能体对话框架用于评估自闭症中的社交语言障碍特征

Chuanbo Hu, Minglei Yin, Bin Liu, Wenqi Li, Lynn K. Paul, Shuo Wang, Xin Li

发表机构 * Department of Computer Science（计算机科学系）； University at Albany（阿尔巴尼大学）； Department of Management Information System（管理信息系统系）； West Virginia University（西弗吉尼亚大学）； Department of Radiology（放射学系）； Washington University in St. Louis（圣路易斯华盛顿大学）； Humanities and Social Sciences（人文学与社会科学）

AI总结该研究提出了一种名为TPA的主动多智能体对话框架，用于评估自闭症谱系障碍中的社会语言障碍（SLD）特征。该框架通过医生智能体主动选择针对性的问题策略，以系统性地揭示患者对话中潜在的语言障碍特征，从而提高诊断效率。实验表明，TPA在多个关键指标上优于现有基线方法，显著提升了SLD特征的覆盖率和诊断效率，为AI辅助临床筛查提供了重要支持。

详情

AI中文摘要

与自闭症谱系障碍中社交语言障碍（SLD）相关的特征性语言行为，包括回声性重复、代词位移和刻板媒体引用，在自发对话中基本不存在，仅在特定对话条件下出现。在结构化临床评估中，这种延迟意味着提问策略选择是决定对话产生多少诊断信息的关键但未被充分重视的因素。大型语言模型（LLMs）能否被引导主动选择系统地揭示这些潜在特征的提问策略，在很大程度上仍未探索。本文提出TPA（思考、计划、询问），一种应用于自闭症诊断观察量表模块4（ADOS-2）语言评估部分的主动式多智能体对话框架，其中医生智能体在选择临床依据策略并生成针对性问题之前，明确推理哪些特征尚未观察到。基于真实ADOS-2临床数据的患者智能体使得无需真实患者参与即可进行可重复评估，并通过三个独立实验验证，确认其对真实患者语言具有足够的保真度。在来自35名患者的484个片段上评估，TPA在所有主要指标上优于六个竞争性对话规划基线，实现了82.1%的SLD特征覆盖率，比训练有素的临床医生进行的真实临床对话自动回放（65.5%）高16.6%，并且每轮诊断效率显著更高（AUCC：0.628 vs. 0.458，绝对增益+0.170）。这些结果表明，主动提问策略选择显著提高了自动化SLD特征评估的效率，对可扩展的AI辅助临床筛查具有直接意义。

英文摘要

Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.

URL PDF HTML ☆

赞 0 踩 0

LLM 代码异味：分类与检测方法

Zacharie Chenail-Larcher, Brahim Mahmoudi, Naouel Moha, Quentin Stiévenart, Florent Avellaneda

发表机构 * École de technologie supérieure ； Université du Québec à Montréal

AI总结本文研究了大语言模型（LLM）在软件系统中集成时可能引入的代码异味问题，提出了一个包含九类LLM代码异味的分类体系，并开发了静态分析工具SpecDetect4LLM用于检测这些异味。通过对692个开源项目进行实证评估，结果表明近74%的系统存在LLM代码异味，检测精度达91.3%，召回率为71.8%，为开发者提供了识别和改进LLM集成质量的有效手段。

详情

AI中文摘要

大型语言模型（LLM）因其多功能性、灵活性以及在某种程度上模拟人类推理的能力，越来越多地被集成到软件系统中用于各种目的。然而，源代码中LLM推理的糟糕集成可能会损害软件系统的质量。因此，必须记录不充分的LLM集成编码实践，以帮助开发者缓解此类问题。基于我们先前关于LLM代码异味的工作，本文通过呈现一个自包含的分类体系和包含九种LLM代码异味的目录，巩固并完善了这一概念。我们还创建了SpecDetect4LLM，一个用于检测这些异味的静态源代码分析工具，并对其检测效果（精确率和召回率）以及LLM代码异味在692个开源软件项目（171,194个源文件）中的普遍性进行了广泛的实证评估。结果表明，LLM代码异味影响了73.5%的被分析系统，检测精确率为91.3%，召回率为71.8%。

具有扩散教师的期望方差缩减

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

发表机构 * NVIDIA ； University of Toronto（多伦多大学）； Princeton University（普林斯顿大学）

AI总结本文研究了如何在使用预训练扩散模型作为“教师”进行下游任务（如文本到3D生成、单步蒸馏等）时，降低梯度估计的方差。提出了一种名为CARV的计算感知方差控制框架，通过分层蒙特卡洛估计器，将昂贵的上游计算过程与廉价的扩散噪声重采样相结合，并结合时间步重要性采样和分层逆CDF构造，有效减少了计算成本。实验表明，CARV在不改变目标函数的前提下显著提升了计算效率，但在某些任务中梯度方差的降低并未带来生成质量的提升，表明此时方差已不再是性能瓶颈。

Comments Project page: https://research.nvidia.com/labs/sil/projects/CARV/

详情

AI中文摘要

预训练的扩散模型作为冻结教师，为文本到3D、单步蒸馏和数据归因等下游流程提供支持。这些流程消耗的教师梯度是关于噪声水平和高斯噪声样本的蒙特卡洛期望；其估计器方差主导了计算成本，因为每次抽取都需要昂贵的上游工作（渲染、模拟、编码）。我们引入了CARV，一个计算感知的方差核算框架，它激发了一种分层蒙特卡洛估计器：通过廉价的扩散噪声重采样来摊销昂贵的上游计算，并通过时间步重要性采样和分层逆CDF构造加以强化。在我们的文本到3D蒸馏和归因实验中，CARV在不改变目标的情况下提供了2-3倍的有效计算乘数（主要来自摊销重用；约25%来自IS+分层）；在单步蒸馏中，相同的技术将梯度方差降低了一个数量级，但并未改善下游FID，标志着MC方差不再是瓶颈的区间。

英文摘要

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2605.21071 2026-05-25 cs.CL cs.AI 版本更新

Fine-grained Claim-level RAG Benchmark for Law

细粒度声明级法律RAG基准

Souvick Das, Sallam Abualhaija, Domenico Bianculli

发表机构 * University of Luxembourg（卢森堡大学）

AI总结本文提出ClaimRAG-LAW，一个支持英法双语、面向法律专家与非专家用户的细粒度法律检索增强生成（RAG）基准数据集，涵盖多种真实场景的问答类型。研究通过细粒度评估框架分析当前先进法律RAG系统的检索、生成及主张级表现，揭示了其在法律领域中存在的局限性，为提升法律AI系统的可靠性提供了重要参考。

详情

AI中文摘要

大型语言模型（LLM）的快速进展正在将语义搜索转向问答范式，用户提出问题，LLM生成回答。在法律等高风险领域，检索增强生成（RAG）通常用于减轻生成回答中的幻觉。然而，先前的研究表明，无论是通用还是法律专用的RAG系统，仍然以不同速率产生幻觉，这使得细粒度评估变得至关重要。尽管有需求，现有的法律RAG系统评估框架缺乏分别对检索和生成性能进行详细分析所需的粒度。此外，当前的基准主要是英文且集中于法律专家查询，忽视了非专家需求。我们引入了ClaimRAG-LAW，一个全面的法律RAG数据集，支持法语和英语，面向专家和非专家，并包含反映现实场景的多样化问题类型。我们进一步应用细粒度评估框架对最先进的法律RAG系统进行评估，揭示了法律领域在检索、生成和声明级分析方面的局限性。

英文摘要

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

URL PDF HTML ☆

赞 0 踩 0

2605.20919 2026-05-25 cs.LG cs.AI cs.PL 版本更新

Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures

Sutra: 以张量操作RNN作为向量符号架构的编译目标

Emma Leonhart

发表机构 * Emma Leonhart

AI总结 Sutra 是一种类型化的纯函数式编程语言，其前向传播过程被编译为 PyTorch 神经网络。该语言通过将程序中的原始操作、控制流和字符串 I/O 等全部转换为一个融合的张量操作图，实现了对向量符号架构的高效编译。研究展示了 Sutra 在多种嵌入表示上的高精度解码能力，并验证了其可微分性，使得同一程序既能作为逻辑程序运行，也能作为可训练的神经网络进行优化。

Comments Modified NeurIPS submission, see AI declaration and replication materials at end of paper

详情

AI中文摘要

Sutra是一种带类型的纯函数式编程语言，其编译后的前向传播是一个PyTorch神经网络。编译器将整个程序——包括原语、控制流、字符串I/O——通过beta归约降级为一个在冻结嵌入基质上的融合张量操作图。旋转绑定、解绑、捆绑、多项式Kleene三值逻辑以及尾递归循环均被降级为张量操作；Kleene连接词是在{-1, 0, +1}真值网格上精确的拉格朗日插值多项式。验证通过两种方式测试同一事实。(1) 同一程序在跨越两种模态的四个冻结嵌入上运行——三种文本编码器（nomic-embed-text、all-minilm、mxbai-embed-large）和一种蛋白质语言模型（ESM-2）——并在每个基质上以宽度k=8实现100%的解码准确率，而教科书式的Hadamard乘积已经崩溃（mxbai-embed-large上2.5%，all-minilm上7.5%）。(2) PyTorch自动求导流经实际编译的图：一个用.su编写的模糊规则分类器从随机初始化（18.7±9.5%；随机概率=20%，五类）通过反向传播经过发射图（符号源未修改）训练到100.0±0.0%（三个种子）。一个加权变体额外训练一个标量余弦增益，并将其作为数值字面量写回.su源文件；重新编译重现训练后的行为，每个logit误差约2e-7，因此训练后的模型本身是可读、可重编译的代码。因此，同一工件既是一个逻辑程序，也是一个可训练的神经网络。

英文摘要

Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces the whole program -- primitives, control flow, string I/O -- to one fused tensor-op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three-valued logic, and tail-recursive loops all lower to tensor operations; the Kleene connectives are Lagrange-interpolated polynomials exact on the {-1, 0, +1} truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities -- three text encoders (nomic-embed-text, all-minilm, mxbai-embed-large) and one protein language model (ESM-2) -- and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai-embed-large, 7.5% on all-minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy-rule classifier written in .su trains from random init (18.7 +/- 9.5%; chance = 20%, five classes) to 100.0 +/- 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e-7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.

URL PDF HTML ☆

赞 0 踩 0

2605.20896 2026-05-25 cs.CR cs.AI cs.LG 版本更新

TwinRouterBench：面向现实智能体LLM路由的快速静态与实时动态评估

Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi

发表机构 * Gradient ； Soochow University（苏州大学）； Independent Researcher（独立研究者）； University of Southern California（南加州大学）； Rice University（Rice大学）； Carnegie Mellon University（卡内基梅隆大学）； Shanghai Jiao Tong University（上海交通大学）； University of California, Berkeley（加州大学伯克利分校）； University of the Chinese Academy of Sciences（中国科学院大学）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结本文提出 TwinRouterBench，一个用于评估代理式大语言模型（LLM）路由策略的基准工具，旨在支持静态和动态场景下的高效评估。该基准包含两个赛道：静态赛道提供多个任务中的模型调用前缀及对应的最优模型层级，通过确定性计算进行评分；动态赛道则在真实代理系统中运行路由策略，评估其在实际任务完成和成本控制方面的表现。该工作为路由算法的开发与优化提供了全面且高效的实验平台。

详情

AI中文摘要

LLM路由在长时任务（如编码智能体、深度研究系统和计算机使用智能体）中最为重要，其中单个用户请求会触发多次模型调用。将每次调用路由到最便宜的足够模型可以在不牺牲质量的情况下降低成本，然而现有的路由器基准仅评估一次性提示的路由。它们从未暴露中间智能体步骤中路由器可见的前缀，从未测试更便宜的替代品是否保留下游任务的成功，并且通常在评估时依赖在线LLM评判。我们引入了TwinRouterBench，一个具有两轨的步骤级路由基准。静态轨提供来自SWE-bench、BFCL、mtRAG、QMSum和PinchBench中520个实例的970个路由器可见前缀，每个前缀与在发布的降级和级联协议下估计的执行验证目标层级配对；评分是层级标签、轨迹成员资格和令牌成本的确定性算术，无需在线评估方LLM评判。动态轨提供一个工具，可在完整的500例SWE-bench验证集上运行路由器；本文报告了与静态SWE监督划分不相交的100例保留评估。每次LLM调用时，路由器从锁定池中选择一个具体模型，成功由官方任务解决率和实际API支出衡量。两轨支持快速离线迭代，随后在实时智能体执行下进行端到端验证。代码和数据可在https://github.com/CommonstackAI/TwinRouterBench获取。

英文摘要

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

URL PDF HTML ☆

赞 0 踩 0

2605.17637 2026-05-25 cs.AI 版本更新

WebGameBench: Requirement-to-Application Evaluation for Coding Agents via Browser-Native Games

WebGameBench: 通过浏览器原生游戏对编码代理进行需求到应用的评估

Wenyu Zhang, Guoliang You, Tianlun, Haotian Zhao, Tianshu Zhu, Haoran Wang, Xiaoxuan Tang, Mingyang Dai, Jingnan Gu, Daxiang Dong, Jianmin Wu

发表机构 * Baidu（百度）； University of Science and Technology of China（中国科学技术大学）

AI总结 WebGameBench 是一个用于评估代码代理从需求到实际应用构建能力的基准，特别关注其能否将结构化的网页游戏规范转化为可在浏览器中运行的游戏。该基准通过浏览器原生游戏提供紧凑而行为丰富的测试环境，评估代理生成的应用是否具备可玩性、可用性及功能性。研究显示，当前最先进的系统在可用率上达到76.9%，但优秀率仅为20.2%，表明实现完整需求仍存在较大差距。WebGameBench 是首个基于浏览器原生游戏交付的从需求到应用评估的基准，其评估结果与人工游戏体验评审高度一致。

Comments 19 pages, 6 figures

详情

AI中文摘要

编码代理越来越多地被用作应用程序构建者，然而许多评估仍聚焦于源代码、仓库级测试或中间痕迹，而非交付的应用。我们引入WebGameBench，一个需求到应用的基准，评估编码代理能否将冻结的结构化Web游戏规范转化为可浏览器访问的游戏。浏览器原生游戏提供了一个紧凑但行为密集的测试平台：即使是简单的游戏也需要协调的输入处理、空间映射、规则执行、状态转换、终止条件、重启行为和可见反馈。在WebGameBench中，每个生成的工件在统一部署协议下被构建、服务并作为浏览器可访问的应用暴露。然后，运行时评估器在真实浏览器中与交付的游戏交互，并分配三类标签：优秀、可用或不可用。在人工审查的子集上，运行时标签与人类游戏审查在可用率标准下大致一致。在111个任务、12个编码代理和14个评估配置中，WebGameBench区分了当前系统：最佳配置达到76.9%的可用率，但仅有20.2%的优秀率。这一差距表明，跨越最低可玩交付阈值仍远未达到完全满足需求。据我们所知，WebGameBench是首个针对浏览器原生游戏交付的需求到应用基准，它在可用率标准下将交付应用的运行时标签与独立的人类游戏审查进行验证。

英文摘要

Coding agents are increasingly used as application builders, yet many evaluations still focus on source code, repository-level tests, or intermediate traces rather than the delivered application. We introduce WebGameBench, a requirement-to-application benchmark that evaluates whether coding agents can turn a frozen Structured WebGame Specification into a browser-accessible game. Browser-native games provide a compact but behavior-dense testbed: even simple games require coordinated input handling, spatial mapping, rule execution, state transitions, terminal conditions, restart behavior, and visible feedback. In WebGameBench, each generated artifact is built, served, and exposed as a browser-accessible application under a unified deployment protocol. A runtime evaluator then interacts with the delivered game in a real browser and assigns a three-way label: EXCELLENT, USABLE, or UNUSABLE. On a human-reviewed subset, the runtime label is broadly aligned with human gameplay review under the Usable-rate criterion. Across 111 tasks, 12 coding agents, and 14 evaluation configurations, WebGameBench separates current systems: the best configuration reaches a 76.9% usable rate but only a 20.2% excellent rate. This gap shows that crossing the minimum playable-delivery threshold is still far from complete requirement satisfaction. To our knowledge, WebGameBench is the first requirement-to-application benchmark for browser-native game delivery that validates delivered-application runtime labels against independent human gameplay review under the Usable-rate criterion.

URL PDF HTML ☆

赞 0 踩 0

2605.17468 2026-05-25 cs.HC cs.AI 版本更新

An Interpretable Closed-Loop Intelligent Tutoring System for Multimodal Affective Feedback in Asynchronous Presentation Training

一种可解释的闭环智能辅导系统，用于异步演讲训练中的多模态情感反馈

Hung-Yue Suen, Kuo-En Hung

AI总结本文提出了一种可解释的闭环智能辅导系统（ITS），用于支持异步演讲训练中的多模态情感反馈，帮助大规模提升学员的镜头前口头表达能力。该系统基于七维行为锚定评分量表（BARS），结合多模态评分、观众感知表达诊断和增强检索的对话辅导，构建了三层可解释反馈架构，能够将面部、语音、文本和眼动等多模态输入转化为可追溯的、基于证据的反馈。实验表明，该系统在MOOC视频数据上的评分表现接近专家水平，并在30天的实践过程中显著提升了学员的多项表现维度。

Comments 12 pages, 8 figures, IEEE Transactions on Learning Technologies, 2026

详情

DOI: 10.1109/TLT.2026.3693864

AI中文摘要

本文提出了一种可解释的闭环智能辅导系统（ITS），支持大规模开发摄像机前口头演讲技能的反馈引导练习。该系统操作化了一个七维行为锚定评级量表（BARS），并实现了一个三层可解释反馈架构，该架构连接了与评分标准一致的多模态评分、观众感知的表达诊断以及检索增强的对话式辅导，以支持刻意练习。基于XGBoost骨干，该ITS将多模态输入（面部、声音、文本和眼动特征）映射为基于证据的反馈，这些反馈可以追溯到可观察的表现线索。在10,360个大规模开放在线课程（MOOC）视频片段上训练后，该系统实现了与专家评分相当的表现水平的评分标准一致评分（R2 = 0.48-0.61，Spearman's rho = 0.69-0.78，MAE = 0.43-0.57）。在204名成年学习者为期30天的练习窗口的前后验证研究中，参与者在所有七个BARS维度上表现出显著改善（Cohen's d = 0.39-0.90），在控制基线分数和人口统计学因素后，练习频率与后测成绩呈强正相关。结果展示了如何通过集成的反馈架构将多模态分析输出系统地转化为可观察的行为变化，推动了基于表现的能力的可解释和教学导向的ITS设计。

英文摘要

This paper presents an interpretable closed-loop Intelligent Tutoring System (ITS) that supports feedback-guided practice for developing on-camera oral presentation skills at scale. The system operationalizes a seven-dimensional Behaviorally Anchored Rating Scale (BARS) and implements a three-layer interpretable feedback architecture that connects rubric-aligned multimodal scoring, audience-perceived expressive diagnostics, and retrieval-augmented conversational coaching to support deliberate practice. Built on an XGBoost backbone, the ITS maps multimodal inputs (facial, vocal, textual, and oculomotor features) into evidence-based feedback that can be traced back to observable performance cues. Trained on 10,360 Massive Open Online Course (MOOC) video segments, the system achieved rubric-aligned scoring with performance levels comparable to expert ratings (R2 = 0.48-0.61, Spearman's rho = 0.69-0.78, MAE = 0.43-0.57). In a pre-post validation study with 204 adult learners over a 30-day practice window, participants demonstrated significant improvements across all seven BARS dimensions (Cohen's d = 0.39-0.90), with practice frequency showing a strong positive association with posttest performance after controlling for baseline scores and demographics. The results demonstrate how multimodal analytic outputs can be systematically transformed into observable behavioral change through an integrated feedback architecture, advancing explainable and pedagogically grounded ITS design for performance-based competencies.

URL PDF HTML ☆

赞 0 踩 0

2605.17076 2026-05-25 cs.LG cs.AI cs.DC cs.MA 版本更新

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

S-Bus: 多智能体LLM状态协调的自动读集重建

Sajjad Khan

发表机构 * Sajjad Khan

AI总结本文提出了一种名为 S-Bus 的 HTTP 中间件，用于解决多智能体 LLM 在共享可变状态时的并发控制问题，尤其针对无法声明读集的场景。其核心机制 DeliveryLog 能够在提交时从观察到的 HTTP GET 流量中重建每个智能体的读集，从而实现一种名为“可观测读隔离”（ORI）的一致性保证，有效防止分片拓扑中的结构化竞态条件。研究贡献包括形式化验证、与传统数据库的性能对比以及对 ORI 在不同工作负载下的语义影响分析。

Comments v2: LLM judge validated against human annotator (Zahid Hussain, Mindgigs Peshawar) on PH-3 at strict kappa=0.93 (n=93, 96.8% agreement); over-claim refined to 32% (LLM) / 49% (human). Adds Exp.PG-Comparison Rust-Native and Workload-B chi2=1094.98. 24 pages, 23 tables. Annotation data attached as arXiv ancillary files

详情

AI中文摘要

我们解决了通过HTTP共享可变状态的LLM智能体的并发控制问题，其中智能体无法被修改以声明读集。S-Bus是一个HTTP中间件，其核心机制——服务端DeliveryLog——在提交时从观察到的HTTP GET流量中重建每个智能体的读集。它提供的一致性属性——可观测读隔离（ORI），一种基于HTTP可观测读投影的部分因果一致性——防止了专用分片拓扑中的结构性竞态条件。三项贡献：（C1）DeliveryLog机制，具有三层机械化证据：TLAPS证明了ReadSetSoundness和ORICommitSafety（基于一个类型公理）；N=3时的穷举TLC探索了20,763,484个状态，零违规；Dafny验证了9个归纳引理。（C2）与PostgreSQL 17 SERIALIZABLE和Redis 7 WATCH/MULTI的经验安全对等：在884,110次提交尝试中（其中427,308次处于活跃争用下）零Type-I损坏。（C3）ORI在专用分片工作负载中语义中性，但在单分片协作写入中有害，因为保留传播并发矛盾。 v2更新：PH-3 LLM评判器现在已针对人类标注者（Zahid Hussain, Mindgigs Peshawar）在400个（步骤，分片）对上进行独立验证，严格kappa=0.93（n=93，原始一致性96.8%）。LLM间评判器一致性为kappa=0.46（边界方差）。智能体自我报告高估分片使用量32%（LLM评判器）至49%（人类标注者）。SJ-v4语义质量评分标准仍为单评判器LLM-only。源代码、形式化证明、测试框架、标注数据：https://github.com/sajjadanwar0/sbus

英文摘要

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

URL PDF HTML ☆

赞 0 踩 0

2605.16799 2026-05-25 cs.LG cs.AI 版本更新

Cross-Domain Molecular Relational Learning: Leveraging Chemical Structure-Activity Analysis

跨域分子关系学习：利用化学结构-活性分析

Peiliang Zhang, Jingling Yuan, Shiqing Wu, Mengqing Hu, Chao Che, Yongjun Zhu, Lin Li

发表机构 * Wuhan University of Technology（武汉理工大学）； Yonsei University（延世大学）； Hubei Key Laboratory of Transportation Internet of Things（湖北省交通运输物联网重点实验室）； State Key Laboratory of Silicate Materials for Architectures（建筑硅酸盐材料国家重点实验室）； City University of Macau（澳门城市大学）； Kyung Hee University（庆熙大学）； Dalian University（大连大学）

AI总结该研究针对分子关系学习中跨领域建模的不足，提出了一种基于结构-活性分析的跨领域分子关系学习方法。核心方法是引入结构语义迁移差异的领域对抗训练网络（DisTrans），通过子结构拓扑差异引导模型学习分子结构的领域依赖性，并对齐源域与目标域的功能团语义信息，从而提升跨领域适应能力。实验表明，该方法在两种典型跨领域场景下优于16种基线方法，具有良好的泛化性能。

Comments Accepted by SIGKDD 2026 Research Track

详情

AI中文摘要

分子表示的最新进展整合了分子拓扑和视觉模态，为精确的分子关系学习（MRL）开辟了新途径。现有的MRL方法专注于域内建模，其固有的域封闭效应限制了在分子科学中的适用性，特别是在阐明跨域相互作用机制方面。因此，跨域分子关系学习的必要性日益迫切。受益于结构-活性分析，我们提出了具有结构语义迁移差异的域对抗训练网络（DisTrans），以优化分子结构和视觉图像的跨域自适应表示。1）我们利用基于域间子结构拓扑差异的梯度反转策略来学习分子结构的域依赖性。该策略引导模型适应目标域中的结构邻接模式，生成域可分离的结构表示。2）我们应用跨域表示引导机制来对齐源域和目标域之间的官能团语义信息，学习跨域一致性信息。在两种典型跨域策略中的实验结果表明，DisTrans优于16种基线方法，即使在显著的域间差异下也能保持令人满意的性能。

英文摘要

Recent advances in molecular representation integrates molecular topological and visual modalities, opening new avenues for precise Molecular Relational Learning (MRL). Existing MRL methods focus on intra-domain modeling, and their inherent domain-closed effect limits applicability to molecular science, particularly in elucidating cross-domain interaction mechanisms. Consequently, the imperative for Cross-Domain Molecular Relational Learning has become increasingly pressing. Benefiting from structure-activity analysis, we propose the Domain Adversarial Training Network with Structural-Semantic Transfer Discrepancy (DisTrans) to optimize cross-domain adaptive representation for molecular structures and visual images. 1) We employ the gradient reversal strategy based on substructure topological discrepancies between domains to learn the domain dependence of molecular structures. This strategy guides the model to adapt to the structural adjacency patterns in the target domain, generating domain-separable structural representations. 2) We apply the cross-domain representation guidance mechanism to align the functional-group semantic information between the source and target domains, learning cross-domain consistency information. The experimental results in two typical cross-domain strategies demonstrate that DisTrans outperforms 16 baseline methods, maintaining satisfactory performance even under pronounced inter-domain discrepancy.

URL PDF HTML ☆

赞 0 踩 0

2605.16283 2026-05-25 cs.CY cs.AI 版本更新

移动世界模型如何指导GUI代理？

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

发表机构 * Nanyang Technological University（南洋理工大学）； MiLM Plus, Xiaomi Inc.（小米公司）； Independent Researchers（独立研究人员）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院）； Wuhan University（武汉大学）； Xiamen University（厦门大学）

AI总结本文研究了移动世界模型如何指导GUI代理进行有效交互，针对现有模型在预测动作后果方面的不足，提出了一种多模态世界模型，涵盖增量文本、完整文本、扩散图像和可渲染代码四种表示方式。实验表明，该模型在多个基准测试中达到最优性能，并揭示了代码重建在分布内精度和多模态监督上的优势，文本反馈在分布外执行中的鲁棒性，以及世界模型在训练过程中的辅助作用，而非作为通用的后验验证工具。

详情

AI中文摘要

视觉语言模型的最新进展使移动GUI代理能够感知视觉界面并执行用户指令，但对于长期和高风险交互，动作后果的可靠预测仍然至关重要。现有的移动世界模型提供基于文本或基于图像的未来状态，但尚不清楚哪种表示有用，生成的rollout是否可以替代真实环境，以及测试时指导如何帮助不同强度的代理。为了回答上述问题，我们筛选并标注了移动世界模型数据，然后训练了四种模态的世界模型：增量文本、完整文本、基于扩散的图像和可渲染代码。这些模型在MobileWorldBench和Code2WorldBench上均达到了最先进性能。此外，通过在AITZ、AndroidControl和AndroidWorld上评估其下游效用，我们得到三个发现。首先，可渲染代码重建实现了高分布内保真度，并为数据构建提供了有效的多模态监督，而基于文本的反馈对于在线分布外执行更鲁棒。其次，世界模型生成的轨迹可以在训练过程中提供可迁移的交互经验，并提高代理的端到端任务性能，尽管这些数据不保留原始分布。最后，对于动作熵低的过度自信移动代理，后验自省提供的收益有限，这表明世界模型作为先验感知或训练监督比作为通用事后验证器更有效。

英文摘要

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

URL PDF HTML ☆

赞 0 踩 0

2605.07717 2026-05-25 cs.SE cs.AI 版本更新

The AI-Native Large-Scale Agile Software Development Manifesto

AI原生大规模敏捷软件开发宣言

Ricardo Britto, Fredrik Palmgren, Nishrith Saini, Marcus Ohlin

发表机构 * Ericsson, Sweden（爱立信（瑞典））； Blekinge Institute of Technology, Sweden（布莱金厄技术学院（瑞典））

AI总结尽管敏捷方法被广泛应用，但在大规模软件开发中实现真正的敏捷性仍然具有挑战。本文提出《AI原生的大规模敏捷软件开发宣言》，旨在将人工智能作为核心参与者而非辅助工具，重新定义大规模软件开发的组织方式。该宣言基于六大原则，强调通过智能、自适应和持续学习的系统，取代传统的会议驱动、文档密集和顺序式开发流程，从而提升组织层面的敏捷性。

详情

AI中文摘要

尽管敏捷方法被广泛采用，但在大规模实现真正的敏捷性仍然难以捉摸。大规模敏捷框架仍然以人为中心和手动为主，依赖协调会议、工件同步和基于角色的交接，这抑制了实时适应。与此同时，AI的快速进步，特别是大型语言模型，已经开始改变软件工程，但它们对组织级敏捷性的潜力仍未得到充分探索。我们提出了AI原生大规模敏捷软件开发宣言：一组价值观和原则，重新定义了当AI成为一等参与者而非外围工具时，大规模软件开发的组织方式。该宣言基于六项原则：并行流程、意图驱动团队、活知识、验证优先保障、编排的代理工作力和可重用蓝图，这些原则共同将开发从会议驱动、文档繁重、顺序的流程转变为智能、自适应、持续学习的系统。

英文摘要

Despite the widespread adoption of agile methods, achieving true agility at scale remains elusive. Large-scale agile frameworks remain largely human-centric and manual, relying on coordination meetings, artifact synchronization, and role-based handoffs that inhibit real-time adaptation. Meanwhile, rapid advances in AI, particularly large language models, have begun transforming software engineering, yet their potential for organizational-level agility remains underexplored. We present the AI-Native Large-Scale Agile Software Development Manifesto: a set of values and principles that redefine how large-scale software development is organized when AI becomes a first-class participant rather than a peripheral tool. The manifesto is grounded in six principles, parallel processes, intent-driven teams, living knowledge, verification-first assurance, orchestrated agent workforces, and reusable blueprints, that together shift development from a meeting-driven, document-heavy, sequential process to an intelligent, adaptive, continuously learning system.

URL PDF HTML ☆

赞 0 踩 0

2605.06936 2026-05-25 cs.AR cs.AI cs.MA 版本更新

Bridging the Last Mile of Circuit Design: PostEDA-Bench, a Hierarchical Benchmark for PPA Convergence and DRC Fixing

跨越电路设计的最后一英里：PostEDA-Bench，一个用于PPA收敛和DRC修复的分层基准

Pengju Liu, Nuo Xu, Jinwei Tang, Yu Cao, Caiwen Ding

发表机构 * University of Minnesota（明尼苏达大学）

AI总结该论文提出了一种名为PostEDA-Bench的分层基准测试平台，用于评估基于大语言模型（LLM）的智能体在电子设计自动化（EDA）流程中“最后一公里”任务中的表现，包括修复设计规则检查（DRC）违规和优化功耗-性能-面积（PPA）目标。该基准包含145个任务，覆盖DRC修复、PPA单目标和多目标优化等场景，并支持多种EDA工具链进行机器可验证的评估。实验表明，当前主流LLM在处理合成DRC和单目标PPA任务时表现尚可，但在更实际的DRC推理和多目标PPA优化任务中效果显著下降，突显了当前模型在复杂设计优化和权衡推理方面仍面临重大挑战。

详情

AI中文摘要

基于LLM的代理越来越多地应用于电子设计自动化（EDA）的“最后一英里”：修复工具运行后残留的签核设计规则检查（DRC）违规并收敛功耗-性能-面积（PPA）目标。然而，现有的EDA-LLM基准完全忽略了DRC修复，并依赖于与单一工具链绑定的扁平层次结构。我们引入了PostEDA-Bench，这是一个分层基准，包含145个任务，涵盖DRC-Essential、DRC-Reasoning、PPA-Mono和PPA-Multi，由支持机器可检查评估的EDA工具链提供支持。在多个代理框架下的八个商业和开源LLM中，我们发现代理能够较好地处理合成DRC-Essential和单目标PPA-Mono任务，但在更实际的DRC-Reasoning（最佳成功率为36.66%）和PPA-Multi（最佳成功率为20.00%）上性能急剧下降；视觉增强始终提升DRC-Bench性能；而权衡推理（而非旋钮知识）是PPA-Multi的主要瓶颈。

英文摘要

LLM-based agents are increasingly applied to the "last mile" of Electronic Design Automation (EDA): repairing residual sign-off Design Rule Check (DRC) violations and converging Power-Performance-Area (PPA) targets after tool runs. Existing EDA-LLM benchmarks, however, omit DRC fixing entirely and rely on flat hierarchies tied to a single toolchain. We introduce PostEDA-Bench, a hierarchical benchmark with 145 tasks across DRC-Essential, DRC-Reasoning, PPA-Mono, and PPA-Multi, supported by EDA toolchains with machine-checkable evaluation. Across eight commercial and open-source LLMs under multiple agent scaffolds, we find that agents handle synthetic DRC-Essential and single-objective PPA-Mono reasonably well but degrade sharply on the more practical DRC-Reasoning, where the best success rate is 36.66%, and PPA-Multi, where the best success rate is 20.00%; vision augmentation consistently enhances DRC-Bench; and trade-off reasoning, rather than knob knowledge, is the dominant PPA-Multi bottleneck.

URL PDF HTML ☆

赞 0 踩 0

2605.06840 2026-05-25 cs.AI 版本更新

Extracting Search Trees from LLM Reasoning Traces Reveals Myopic Planning

从LLM推理轨迹中提取搜索树揭示短视规划

Sixing Chen, Ji-An Li, Saner Cakir, Sinan Akcali, Kayla Lee, Marcelo G. Mattar

发表机构 * Generality Inc.（Generality公司）

AI总结本研究通过从大型语言模型（LLM）在“四连棋”游戏中的推理轨迹中提取搜索树，揭示了LLM在规划行为上的短视特性。研究发现，尽管LLM的推理轨迹中包含较深的节点，但其决策主要依赖于浅层搜索，而非深度搜索；相比之下，人类玩家的性能更多由深度搜索驱动。这一发现揭示了LLM与人类规划之间的关键差异，并为改进LLM的规划能力提供了方向性指导。

详情

AI中文摘要

大型语言模型（LLMs），尤其是推理模型，会生成扩展的思维链（CoT）推理，其中通常包含对未来结果的明确思考。然而，这种思考是否构成真正的规划、其结构如何以及哪些方面驱动性能仍不清楚。在这项工作中，我们引入了一种新方法，通过从四子棋游戏的推理轨迹中提取和量化搜索树来表征LLM规划。通过将计算模型拟合到提取的搜索树上，我们表征了规划的结构及其如何影响移动决策。我们发现LLM的搜索比人类更浅，性能由搜索广度而非深度预测。最引人注目的是，尽管LLM在轨迹中扩展了深层节点，但其移动选择最好由一个完全忽略这些节点的短视模型解释。一项因果干预研究（我们选择性剪枝CoT段落）进一步表明，移动选择主要由浅层节点而非深层节点驱动。这些模式与人类规划形成对比，在人类规划中，性能主要由深度搜索驱动。总之，我们的发现揭示了LLM与人类规划之间的关键差异：虽然人类专业知识由更深层次的搜索驱动，但LLM并不基于深层前瞻行动。这种分离为对齐LLM和人类规划提供了有针对性的指导。更广泛地说，我们的框架提供了一种可推广的方法，用于解释跨战略领域LLM规划的结构。

英文摘要

Large language models (LLMs), especially reasoning models, generate extended chain-of-thought (CoT) reasoning that often contains explicit deliberation over future outcomes. Yet whether this deliberation constitutes genuine planning, how it is structured, and what aspects of it drive performance remain poorly understood. In this work, we introduce a new method to characterize LLM planning by extracting and quantifying search trees from reasoning traces in the four-in-a-row board game. By fitting computational models on the extracted search trees, we characterize how plans are structured and how they influence move decisions. We find that LLMs' search is shallower than humans', and that performance is predicted by search breadth rather than depth. Most strikingly, although LLMs expand deep nodes in their traces, their move choices are best explained by a myopic model that ignores those nodes entirely. A causal intervention study where we selectively prune CoT paragraphs further suggests that move selection is driven predominantly by shallow rather than deep nodes. These patterns contrast with human planning, where performance is driven primarily by deep search. Together, our findings reveal a key difference between LLM and human planning: while human expertise is driven by deeper search, LLMs do not act on deep lookahead. This dissociation offers targeted guidance for aligning LLM and human planning. More broadly, our framework provides a generalizable approach for interpreting the structure of LLM planning across strategic domains.

URL PDF HTML ☆

赞 0 踩 0

2605.06094 2026-05-25 cs.CV cs.AI 版本更新

ProtDBench: 蛋白质结合物设计与评估的统一基准

Cong Liu, Milong Ren, Jiaqi Guan, Chengyue Gong, Jinyuan Sun, Xinshi Chen, Wenzhi Xiao

发表机构 * AMLab, AI4Science Lab, University of Amsterdam, Amsterdam, The Netherlands（AM实验室、AI4Science实验室、阿姆斯特丹大学、阿姆斯特丹、荷兰）

AI总结本文提出ProtDBench，一个统一的蛋白质配体设计与评估基准框架，旨在解决当前研究中因评估标准不统一而导致的性能指标难以比较的问题。该框架定义了标准化的任务、评估流程和成功标准，并引入基于固定预算和结构多样性的评估指标，揭示了不同验证方法和过滤规则对性能评估的影响。ProtDBench为蛋白质配体设计方法提供了公平、可复现的评估体系，支持在实际条件下进行系统对比。

详情

AI中文摘要

近年来，从头蛋白质结合物设计的进展使得越来越多的实验验证成为可能，但由于缺乏标准化的评估协议，报道的计算指标仍然难以解释或跨研究比较。我们引入了ProtDBench，一个标准化且考虑通量的蛋白质结合物设计评估框架。ProtDBench定义了统一的基准任务、评估协议和成功标准，能够系统分析评估设计如何影响观察到的性能。利用一个大型湿实验标注数据集，我们分析了常用的结构预测模型作为评估验证器，揭示了在相同过滤协议下显著的验证器依赖偏差和有限的一致性。然后，我们在固定评估协议下，针对十个不同的蛋白质靶点，对代表性的开源生成式结合物设计方法进行了基准测试。除了每条序列的成功率外，ProtDBench还基于固定的24小时预算纳入了考虑通量的指标，以及考虑结构多样性的聚类级成功标准。总之，这些结果揭示了过滤规则、成功定义以及考虑通量的评估在计算效率、成功率和结构多样性之间引起的系统性差异。总体而言，ProtDBench提供了一个公平且可复现的评估流程，支持在现实评估设置下对蛋白质结合物设计方法进行系统且受控的比较。

英文摘要

Recent advances in de novo protein binder design have enabled increasing experimental validation, yet reported in silico metrics remain difficult to interpret or compare across studies due to non-standardized evaluation protocols. We introduce ProtDBench, a standardized and throughput-aware evaluation framework for protein binder design. ProtDBench defines unified benchmark tasks, evaluation protocols, and success criteria, enabling systematic analysis of how evaluation design influences observed performance. Using a large wet-lab annotated dataset, we analyze commonly used structure prediction models as evaluation verifiers, revealing substantial verifier-dependent bias and limited agreement under identical filtering protocols. We then benchmark representative open-source generative binder design methods across ten diverse protein targets under a fixed evaluation protocol. Beyond per-sequence success rates, ProtDBench incorporates throughput-aware metrics based on a fixed 24-hour budget, as well as cluster-level success criteria to account for structural diversity. Together, these results expose systematic differences induced by filtering rules, success definitions, and throughput-aware evaluation between computational efficiency, success rate, and structural diversity. Overall, ProtDBench provides a fair and reproducible evaluation pipeline that supports systematic and controlled comparison of protein binder design methods under realistic evaluation settings.

URL PDF HTML ☆

赞 0 踩 0

2605.02087 2026-05-25 cs.AI 版本更新

AI评估应要求标准化的项目级数据发布

Han Jiang, Susu Zhang, Dongyao Zhu, Yuzhuo Bai, Sang T. Truong, Xiaoyuan Yi, Sanmi Koyejo, Xing Xie, Ziang Xiao

发表机构 * Johns Hopkins University（约翰霍普金斯大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Microsoft Research Asia（微软亚洲研究院）； Stanford University（斯坦福大学）； North Carolina State University（北卡罗来纳州立大学）； Tsinghua University（清华大学）

AI总结本文主张人工智能评估应采用标准化的项目级基准数据作为默认基础设施。当前评估方法存在项目选择不明确、构念不一致和泛化能力差等问题，其根本原因是对模型整体得分的过度关注。为构建有效的评估体系，作者提出应通过项目级模型响应的实证数据进行验证，并建立标准化数据发布机制，以提高评估的透明性、可复现性和可审计性。为此，研究构建了OpenEval数据集，展示了项目级数据在识别低质量项目、分析构念偏差和验证基准结构方面的作用。

详情

AI中文摘要

这篇立场论文认为，标准化的项目级基准数据应成为AI评估的默认基础设施。当前的评估存在项目选择不明确、构造错位和泛化能力差的问题。这些失败的根本原因在于对聚合模型分数的错误关注。没有项目级证据，有效性声明无法评估，导致能力声明夸大、研究方向错误以及对已部署系统的不当信任。我们的立场是，设计有效的评估需要来自项目级模型响应的实证证据，并且此类数据的标准化发布应被视为核心AI评估基础设施。此外，这种发布能够实现评估结果的透明度、可复制性和可审计性。为了展示这一规范既可行又重要，我们构建了OpenEval，这是一个包含来自广泛使用基准的15.5万个项目的1000万条响应的项目级档案，采用AI评估社区可以发展的统一模式。我们展示了项目级数据如何识别低质量项目、记录构造错位以及恢复关于基准内部结构的有效性证据。我们解决了关于污染和作者负担的反对意见，并表明每个问题相对于基于不可信声明做出的决策成本而言都是可处理的。

英文摘要

This position paper argues that standardized item-level benchmark data should become the default infrastructure for AI evaluation. Current evaluations suffer from underspecified item selection, construct misalignment, and poor generalization. The root cause of these failures is a misplaced focus on aggregate model scores. Without item-level evidence, validity claims cannot be assessed, resulting in inflated capability claims, misdirected research, and unwarranted trust in deployed systems. Our position is that designing valid evaluations requires empirical evidence from item-level model responses, and the standardized release of such data should be treated as core AI evaluation infrastructure. Such a release, in addition, enables transparency, replicability, and auditability of evaluation results. To show the norm is both feasible and consequential, we construct OpenEval, an item-level archive of 10M responses across 155k items from widely-used benchmarks, under a unified schema that the AI evaluation community can develop upon. We demonstrate how item-level data can identify low-quality items, document construct misalignment, and recover validity evidence about benchmarks' internal structure. We address objections around contamination and author burden, and show each is tractable relative to the cost of decisions made on claims that cannot be trusted.

URL PDF HTML ☆

赞 0 踩 0

2604.00003 2026-05-25 cs.CL cs.AI cs.IR 版本更新

Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation

使用本地大语言模型和布局感知解析的表格PDF信息提取：可靠性评估

Muhammad Anis Al Hilmi, Neelansh Khare, Noel Framil Iglesias, Kurnia Adi Cahyanto, Azhar Al Afghani, Musfi Yuliadi

发表机构 * Faculty of Engineering, Universitas Swadaya Gunung Jati（工程学院，Swadaya Gunung Jati大学）； University of California, Irvine（加州大学伊维奇分校）； UNIR, La Rioja（UNIR，拉里奥ja）； Universitas Diponegoro（迪波内戈罗大学）

AI总结该研究评估了从学术PDF文档中提取结构化信息的可靠性，以印度尼西亚高等教育课程注册表（KRS）为案例，比较了三种方法：纯大语言模型（LLM）、混合确定性-LLM（正则表达式与LLM结合）以及基于Camelot的流程并结合LLM作为后备。实验表明，混合方法在处理确定性元数据时效率更高，而基于Camelot的流程结合LLM后备在准确率和计算效率上表现最佳，尤其适合计算资源受限的环境。

Comments 9 pages, 5 figures, 3 tables

详情

AI中文摘要

从学术PDF文档中提取结构化信息并非易事：单页通常结合自由文本元数据和表格区域，存在跨程序变化，并容易受到干扰下游解析的Unicode编码伪影的影响。本研究以印度尼西亚高等教育的学术课程注册文档（Kartu Rencana Studi或KRS）为案例，评估了表格PDF文档信息提取方法的可靠性。比较了三种策略：纯LLM、混合确定性-LLM（正则表达式和LLM）以及基于Camelot的管道（带LLM回退）。实验在140份文档（基于LLM的测试）和860份文档（基于Camelot的管道评估）上进行，涵盖四个学习项目，包含表格和元数据中的不同数据。使用Ollama和消费级CPU（无GPU）本地运行了三个12-14B的LLM模型（Gemma 3、Phi 4和Qwen 2.5）。评估使用了精确匹配（EM）和Levenshtein相似度（LS）指标，阈值为0.7。尽管并非适用于所有模型，但结果表明，与纯LLM相比，混合方法可以提高效率，尤其是对于确定性元数据。基于Camelot的管道（带LLM回退）在准确性（EM和LS高达0.99-1.00）和计算效率（大多数情况下每个PDF不到1秒）方面取得了最佳组合。Qwen 2.5:14b模型在所有场景中表现最一致。这些发现证实，在计算受限的环境中，将确定性和基于LLM的方法相结合是从基于文本的表格PDF文档中提取信息的可靠且高效的策略。

英文摘要

Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabular regions, exhibits cross program variation, and is susceptible to Unicode encoding artifacts that interfere with downstream parsing. This study evaluates the reliability of information extraction approaches for tabular PDF documents, using academic course registration documents (Kartu Rencana Studi or KRS) from Indonesian higher education as a case study. Three strategies are compared: LLM only, Hybrid Deterministic - LLM (regex & LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM based methods is a reliable and efficient strategy for information extraction from tabular text based PDF documents in computationally constrained environments.

URL PDF HTML ☆

赞 0 踩 0

2603.19310 2026-05-25 cs.LG cs.AI 版本更新

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

MemReward: 基于图的经验记忆用于有限标签下的LLM奖励预测

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Meta

AI总结本文提出了一种基于图结构的经验记忆框架 MemReward，用于在标注数据有限的情况下提升大语言模型（LLM）的奖励预测能力。该方法通过构建包含初始策略生成的推理过程和答案的异构图，并利用图神经网络（GNN）将有限的标注奖励传播到未标注的样本中，从而在在线策略优化过程中实现奖励的高效获取。实验表明，MemReward 在仅使用20%标注数据的情况下，能够在数学证明、问答和代码生成等任务中接近理想奖励模型的性能。

详情

AI中文摘要

强化学习已成为改进大型语言模型推理能力的强大范式，其中从策略中采样rollout，并利用在这些rollout上计算的奖励信号来更新策略。然而，在数据稀缺的场景中，大规模获取ground-truth标签以验证rollout通常需要昂贵的人工标注或劳动密集型的专家验证。例如，评估数学证明需要专家评审，而开放式问答缺乏确定的ground-truth。当ground-truth标签稀缺时，强化学习微调的有效性受到限制。受半监督学习在将标签从标注样本传播到未标注样本方面成功的启发，我们提出了MemReward，一种基于图的经验记忆框架，将奖励传播直接集成到在线策略优化中。MemReward将来自初始LLM策略的rollout（思考过程和最终答案）存储为异构图中的节点，这些节点通过相似性和结构边连接，图神经网络通过该图将奖励从标注rollout传播到未标注rollout。为了训练这样的框架，我们首先在标注rollout上预热GNN，通过查询、思考和答案节点的异质聚合来预测奖励。在在线RL微调期间，未标注rollout通过查询相似性附加到图中，GNN预测它们的奖励，从而产生一种结合ground-truth和GNN预测奖励的混合奖励获取策略。在Qwen2.5-1.5B和3B上的数学、问答和代码生成实验表明，MemReward仅使用20% rollout的ground-truth奖励，就在1.5B上达到Oracle性能的96.6%，在3B上达到97.3%，并在域外任务上接近Oracle。

英文摘要

Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled from the policy and reward signals computed on those rollouts are used to update the policy. However, in data-scarce scenarios, obtaining ground-truth labels to verify rollouts at scale often requires expensive human annotation or labor-intensive expert verification. For instance, evaluating mathematical proofs demands expert review, and open-ended question answering lacks definitive ground truth. When ground-truth labels are scarce, the effectiveness of reinforcement learning fine-tuning is constrained. Inspired by the success of semi-supervised learning in propagating labels from labeled to unlabeled samples, we propose MemReward, a graph-based experience memory framework that integrates reward propagation directly into online policy optimization. MemReward stores rollouts (thinking processes and final answers) from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train such a framework, we first warm up the GNN on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards. Experiments on Qwen2.5-1.5B and 3B in mathematics, question answering, and code generation demonstrate that MemReward, with ground-truth rewards on only 20% of rollouts, achieves 96.6% of Oracle performance on 1.5B and 97.3% on 3B, and closely approaches Oracle on out-of-domain tasks.

URL PDF HTML ☆

赞 0 踩 0

2603.18123 2026-05-25 eess.IV cs.AI 版本更新

Understanding Task Aggregation for Generalizable Ultrasound Foundation Models

理解可泛化超声基础模型的任务聚合

Fangyijie Wang, Tanya Akumu, Vien Ngoc Dang, Amelia Jiménez-Sánchez, Jieyun Bai, Guénolé Silvestre, Karim Lekadir, Kathleen M. Curran

发表机构 * Research Ireland Centre for Research Training in Machine Learning Departament de Matem\`atiques i Inform\`atica, Universitat de Barcelona, Barcelona, Spain School of Medicine, University College Dublin, Dublin, Ireland School of Computer Science, University College Dublin, Dublin, Ireland Instituci\'o Catalana de Recerca i Estudis Avan c ats (ICREA) Department of Cardiovascular Surgery, The First Affiliated Hospital of Jinan University, Jinan University, Guangzhou, China Auckland Bioengineering Institute, University of Auckland, Auckland, New Zealand Equal contribution

AI总结该研究探讨了如何在通用超声基础模型中有效整合多种临床任务，分析了任务聚合策略对模型性能的影响。研究提出，任务性能下降并非源于模型容量不足，而是任务异质性与训练数据规模之间的相互作用被忽视所致。为此，作者提出了基于DINOv3的多器官多任务框架M2DINO，并通过系统实验发现，任务聚合的效果高度依赖于数据规模，统一训练在低数据场景下表现更稳定，而临床分组训练可能带来负面影响。研究还揭示了不同任务类型对聚合策略的敏感性差异，为超声基础模型的设计提供了重要指导。

详情

AI中文摘要

基础模型有望在单一框架内统一多个临床任务，但最近的超声研究报告称统一模型可能不如特定任务基线。我们假设这种退化并非源于模型容量限制，而是由于任务聚合策略忽略了任务异质性与可用训练数据规模之间的相互作用。在这项工作中，我们系统分析了何时可以联合学习异质超声任务而不损失性能，为统一临床成像模型中的任务聚合建立了实用标准。我们引入了M2DINO，一个基于DINOv3的多器官、多任务框架，配备任务条件专家混合模块以实现自适应容量分配。我们系统评估了涵盖分割、分类、检测和回归的27项超声任务，采用三种范式：特定任务、临床分组和全任务统一训练。结果表明，聚合效果强烈依赖于训练数据规模。虽然临床分组训练可以在数据丰富的环境中提高性能，但在低数据环境中可能引发显著的负迁移。相比之下，全任务统一训练在临床组间表现出更一致的性能。我们进一步观察到，在我们的实验中，任务敏感性因任务类型而异：与回归和分类相比，分割显示出最大的性能下降。这些发现为超声基础模型提供了实用指导，强调聚合策略应同时考虑训练数据可用性和任务特性，而非仅依赖临床分类。

英文摘要

Foundation models promise to unify multiple clinical tasks within a single framework, but recent ultrasound studies report that unified models can underperform task-specific baselines. We hypothesize that this degradation arises not from model capacity limitations, but from task aggregation strategies that ignore interactions between task heterogeneity and available training data scale. In this work, we systematically analyze when heterogeneous ultrasound tasks can be jointly learned without performance loss, establishing practical criteria for task aggregation in unified clinical imaging models. We introduce M2DINO, a multi-organ, multi-task framework built on DINOv3 with task-conditioned Mixture-of-Experts blocks for adaptive capacity allocation. We systematically evaluate 27 ultrasound tasks spanning segmentation, classification, detection, and regression under three paradigms: task-specific, clinically-grouped, and all-task unified training. Our results show that aggregation effectiveness depends strongly on training data scale. While clinically-grouped training can improve performance in data-rich settings, it may induce substantial negative transfer in low-data settings. In contrast, all-task unified training exhibits more consistent performance across clinical groups. We further observe that task sensitivity varies by task type in our experiments: segmentation shows the largest performance drops compared with regression and classification. These findings provide practical guidance for ultrasound foundation models, emphasizing that aggregation strategies should jointly consider training data availability and task characteristics rather than relying on clinical taxonomy alone.

URL PDF HTML ☆

赞 0 踩 0

2603.17879 2026-05-25 cs.CV cs.AI 版本更新

Anatomy-Guided Vision-Language Learning with Angular Prototype Separation for Multi-Label Video Capsule Endoscopy Classification Under Class Imbalance

解剖引导的视觉-语言学习与角度原型分离用于类别不平衡下的多标签视频胶囊内镜分类

Podakanti Satyajith Chary, Nagarajan Ganapathy

发表机构 * Department of Engineering Science, IIT Hyderabad（印度海得拉尔理工学院工程科学系）； Department of Biomedical Engineering, IIT Hyderabad（印度海得拉尔理工学院生物医学工程系）

AI总结本文提出了一种用于视频胶囊内镜（VCE）的多标签时间事件检测框架，针对Galar数据集中严重的类别不平衡问题，结合了角度分离损失和生物状态机解码器两个核心贡献。该框架基于BiomedCLIP模型，通过局部差分注意力模块融合连续帧以增强病理信号，并利用解剖上下文头结合软解剖激活进行病理预测。实验表明，该方法在RARE-VISION测试集上显著提升了检测性能，实现了更高的平均精度。

Comments 12 pages, 1 figure, ICPR 2026 RARE-VISION Competition

详情

AI中文摘要

本文提出一个多标签时间事件检测框架用于视频胶囊内镜（VCE），通过结合两个主要贡献来解决Galar数据集固有的极端类别不平衡问题：类原型上的角度分离损失和生物状态机时间解码器。主干网络保持为BiomedCLIP，一个生物医学视觉-语言基础模型。三个连续帧通过局部差分注意力模块融合，该模块通过抑制静态时间冗余来放大瞬态病理信号。然后，解剖上下文头将病理预测条件化于软解剖激活上，利用已知的胃肠道发现空间共现结构。可学习的文本特征提示和基于原型的logit增强与角度分离损失一起训练，该损失惩罚类原型之间的非对角线余弦相似度，防止在极端不平衡下影响罕见类的原型崩溃。为抵消倾斜的标签分布，训练方案结合了非对称焦点损失、逆频率加权采样、时间混合、指数移动平均和每类阈值校准。生物状态机解码器用基于解剖标签的生理学基础前向状态转换替代朴素间隙合并，消除了先前方法中每视频产生数百个虚假解剖事件的碎片化伪影，并将每视频解剖输出减少到2-3个临床现实事件。在包含三个NaviCam检查（161,025帧）的保留RARE-VISION测试集上，更新后的管道实现了整体时间mAP@0.5为0.3597，mAP@0.95为0.3399，相比先前提交分别相对提升46%和44%，总推理时间在单个GPU上约21分钟完成。

英文摘要

This work presents a multi-label temporal event detection framework for video capsule endoscopy (VCE) that addresses the extreme class imbalance inherent in the Galar dataset by combining two principal contributions: an Angular Separation Loss on class prototypes and a Biological State Machine temporal decoder. The backbone remains BiomedCLIP, a biomedical vision-language foundation model. Three consecutive frames are fused through a Local Differencing Attention module that amplifies transient pathological signals by suppressing static temporal redundancy. An Anatomy Context Head then conditions pathological predictions on soft anatomical activations, exploiting the known spatial co-occurrence structure of GI findings. Learnable text-feature prompts and prototype-based logit augmentation are trained alongside an Angular Separation Loss that penalizes off-diagonal cosine similarity between class prototypes, preventing the prototype collapse that afflicts rare classes under extreme imbalance. To counteract the skewed label distribution, the training regime combines asymmetric focal loss, inverse-frequency weighted sampling, temporal Mixup, Exponential Moving Average, and per-class threshold calibration. The Biological State Machine decoder replaces naive gap merging with a physiologically grounded forward-only state transition over anatomy labels, eliminating the fragmentation artefact that produced hundreds of spurious anatomy events per video in the prior approach and reducing per-video anatomy output to 2--3 clinically realistic events. On the held-out RARE-VISION test set comprising three NaviCam examinations (161,025 frames), the updated pipeline achieves an overall temporal mAP@0.5 of 0.3597 and mAP@0.95 of 0.3399, representing a relative improvement of 46% and 44% respectively over the prior submission, with total inference completed in approximately 21 minutes on a single GPU.

URL PDF HTML ☆

赞 0 踩 0

2603.10067 2026-05-25 cs.LG cs.AI 版本更新

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

HTMuon：通过重尾谱校正改进Muon

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

发表机构 * Dartmouth College（达特茅斯学院）； Microsoft（微软）； International Computer Science Institute（国际计算机科学研究所）； University of California, Berkeley（加州大学伯克利分校）； Meta

AI总结本文提出 HTMuon，一种改进 Muon 优化算法的方法，旨在提升大语言模型的训练效果。研究指出，Muon 的正交更新规则抑制了权重谱的重尾特性，而 HTMuon 基于重尾自正则化理论，通过生成更重尾的更新步长，增强模型对参数依赖关系的捕捉能力。实验表明，HTMuon 在语言模型预训练和图像分类任务中均优于现有方法，且可作为现有 Muon 变体的插件使用。

详情

AI中文摘要

赋能 9-1-1 接警培训：生成式 AI 的经验与教训

Zirong Chen, Yilin Liu, Meiyi Ma

发表机构 * College of Connected Computing（连接计算学院）； Vanderbilt University（范德比大学）

AI总结该研究探讨了如何利用生成式人工智能（GenAI）提升9-1-1紧急电话接线员的培训效率，以应对人员短缺和传统培训方式难以扩展的问题。研究团队与孟菲斯市紧急通讯部门合作，开发并部署了一套基于生成式AI的培训系统，经过六个月的实际应用，系统覆盖了190名用户，进行了1120次培训。通过分析大量用户交互数据，研究总结出四条关键经验，为在公共安全领域应用AI驱动培训系统提供了切实可行的设计与治理建议。

Comments Accepted at IEEE SmartComp 2026

详情

AI中文摘要

紧急接警员是公共安全响应的第一操作环节，每年处理超过 2.4 亿次呼叫，同时面临持续的培训危机：许多中心的人员短缺超过 25%，而培训一名新员工可能需要多达 720 小时的一对一指导，这会使得经验丰富的人员脱离现役。传统培训方法在这些限制下难以扩展，限制了覆盖范围和反馈及时性。与 Metro Nashville 紧急通信部（MNDEC）合作，我们在现实约束下设计、开发和部署了一个基于生成式 AI 的接警培训系统。在六个月内，部署从初始试点扩展到 190 名运营用户，覆盖 1120 次培训会话，暴露了在受控或纯模拟评估中基本不可见的系统交付、严谨性、弹性和人为因素方面的系统性挑战。通过分析记录 98429 次用户交互、组织流程和利益相关者参与模式的部署日志，我们提炼出四个关键教训，每个教训都附有具体的设计和治理实践。这些教训为在安全关键公共部门环境中寻求交付 AI 驱动培训系统的研究人员和实践者提供了基于实践的指导，在这些环境中，实际约束从根本上塑造了以人为本的设计。

英文摘要

Emergency call-takers form the first operational link in public safety response, handling over 240 million calls annually while facing a sustained training crisis: staffing shortages exceed 25\% in many centers, and preparing a single new hire can require up to 720 hours of one-on-one instruction that removes experienced personnel from active duty. Traditional training approaches struggle to scale under these constraints, limiting both coverage and feedback timeliness. In partnership with Metro Nashville Department of Emergency Communications (MNDEC), we designed, developed, and deployed a GenAI-powered call-taking training system under real-world constraints. Over six months, deployment scaled from initial pilot to 190 operational users across 1,120 training sessions, exposing systematic challenges around system delivery, rigor, resilience, and human factors that remain largely invisible in controlled or purely simulated evaluations. By analyzing deployment logs capturing 98,429 user interactions, organizational processes, and stakeholder engagement patterns, we distill four key lessons, each coupled with concrete design and governance practices. These lessons provide grounded guidance for researchers and practitioners seeking to deliver AI-driven training systems in safety-critical public sector environments where practical constraints fundamentally shape human-centric design.

URL PDF HTML ☆

赞 0 踩 0

2602.12579 2026-05-25 cs.LG cs.AI 版本更新

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

VI-CuRL: 通过置信度引导的方差缩减稳定与验证器无关的强化学习推理

Xin-Qiang Cai, Masashi Sugiyama

发表机构 * RIKEN AIP（日本理化学研究所高级研究所）； The University of Tokyo（东京大学）

AI总结本文提出了一种名为VI-CuRL的验证器无关强化学习框架，旨在解决现有可验证奖励强化学习（RLVR）依赖外部验证器导致的可扩展性问题。该方法通过利用模型自身的置信度构建独立于外部验证器的课程学习体系，有效控制梯度方差，提升训练稳定性。理论分析证明了该估计器的渐近无偏性，实验表明其在数学和通用推理任务中优于多种依赖或不依赖验证器的基线方法。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）已成为增强大型语言模型（LLMs）推理能力的主流范式，但其对外部验证器的依赖限制了可扩展性。最近的研究表明，RLVR主要通过激发潜在能力发挥作用，这推动了无验证器算法的发展。然而，在此类设置中，标准方法（如Group Relative Policy Optimization）面临一个关键挑战：破坏性的梯度方差常导致训练崩溃。为解决此问题，我们引入了与验证器无关的课程强化学习（VI-CuRL），该框架利用模型的内在置信度构建独立于外部验证器的课程。通过优先处理高置信度样本，VI-CuRL有效管理偏差-方差权衡，特别针对降低动作和问题方差。我们提供了严格的理论分析，证明我们的估计量保证渐近无偏性。实验上，VI-CuRL促进了稳定性，并在有/无验证器的数学和通用推理基准上持续优于依赖/不依赖验证器的基线。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.

URL PDF HTML ☆

赞 0 踩 0

2602.12316 2026-05-25 cs.AI cs.CL cs.CY cs.GT cs.MA 版本更新

ArcMark: 通过最优传输实现无失真的多字节大语言模型水印

Atefeh Gilani, Sajani Vithana, Carol Xuan Long, Oliver Kosut, Lalitha Sankar, Flavio P. Calmon

发表机构 * Arizona State University（亚利桑那州立大学）； Harvard University（哈佛大学）

AI总结 ArcMark 是一种基于最优传输理论的无失真多字节大语言模型水印方法，能够在不改变模型生成文本质量的前提下，将多个字节的信息嵌入到少量的生成文本中。该方法通过将无失真水印问题建模为信道编码问题，推导出信息论意义上的信道容量，从而确定了在不引入失真的情况下嵌入信息的理论极限，并据此设计了 ArcMark 算法。实验表明，ArcMark 在信息重建准确率和抗攻击能力方面优于现有方法，且生成文本的困惑度和下游任务表现与未加水印的文本无明显差异。

详情

AI中文摘要

水印是促进大语言模型（LLM）负责任使用的重要工具。现有水印在生成的token中插入信号，要么标记LLM生成的文本（零比特水印），要么编码更复杂的消息（多比特水印）。尽管最近许多方法在不扰动平均下一token预测的情况下向文本中插入多个比特，但它们很大程度上扩展了零比特设置的设计原则，例如每个token编码单个比特。相比之下，能够将多个字节嵌入文本的水印将极大地增加潜在应用，例如嵌入提交提示的用户ID、使用的精确模型版本，甚至提示本身。我们通过引入ArcMark来解决这个问题：一种基于编码和信息论原理的新型水印构造，能够可靠地将多字节信息嵌入仅几百个token中，而不会对底层LLM的下一token分布造成任何失真。我们通过将无失真水印问题建模为信道编码问题，并推导出信息论信道容量，该容量建立了在LLM输出中无失真嵌入信息的基本极限，从而推导出ArcMark。该容量公式指导了ArcMark的设计。在实践中，ArcMark在重建精度上优于竞争的多比特无失真水印，包括在面对改变部分LLM文本的攻击时。ArcMark输出在困惑度和下游任务质量方面也显示出与未加水印文本无法区分。

英文摘要

Watermarking is an important tool for promoting the responsible use of large language models (LLMs). Existing watermarks insert a signal into generated tokens that either flags LLM-generated text (zero-bit watermarking) or encodes more complex messages (multi-bit watermarking). Though a number of recent approaches insert multiple bits into text without perturbing average next-token predictions, they largely extend design principles from the zero-bit setting, such as encoding a single bit per token. In contrast, a watermarker capable of embedding multiple bytes into the text would dramatically increase the potential applications, by embedding information such as the ID of the user who submitted the prompt, the precise model version that was used, or even the prompt itself. We address this problem by introducing ArcMark: a new watermark construction based on coding and information-theoretic principles that is capable of reliably embedding multiple bytes of information into just a few hundred tokens, without any distortion of the underlying LLM next-token distribution. We derive ArcMark by formulating the distortion-free watermarking problem as a channel coding problem, and deriving an information-theoretic channel capacity that establishes the fundamental limit of embedding information in LLM output in a distortion-free manner. This capacity formulation informs the design of ArcMark. In practice, ArcMark outperforms competing multi-bit distortion-free watermarks in terms of reconstruction accuracy, including in the face of attacks that alter a subset of the LLM text. ArcMark output is also shown to be indistinguishable from unwatermarked text in terms of perplexity, and in downstream task quality.

URL PDF HTML ☆

赞 0 踩 0

2602.05472 2026-05-25 cs.AI 版本更新

ALIVE: Awakening LLM Reasoning via Adversarial Learning and Instructive Verbal Evaluation

ALIVE: 通过对抗学习和指导性语言评估唤醒LLM推理

Yiwen Duan, Jing Ye, Xinpei Zhao

发表机构 * Independent Researcher（独立研究者）

AI总结大型语言模型（LLMs）在专家级推理能力方面面临“奖励瓶颈”问题，传统强化学习依赖的标量奖励难以扩展、跨领域不稳定且无法反映推理逻辑。为此，研究提出ALIVE框架，通过对抗学习与指导性语言评价相结合，使模型能够从原始语料中自主学习推理准则，无需依赖外部奖励信号。实验表明，ALIVE在数学推理、代码生成和逻辑推理等任务中显著提升了模型的准确性、跨领域泛化能力和自我纠正能力，为通用推理对齐提供了一种无需人工监督的可扩展方法。

详情

AI中文摘要

大型语言模型（LLM）追求专家级推理的努力一直受到持续的 extit{奖励瓶颈}的阻碍：传统的强化学习（RL）依赖于标量奖励，这些奖励 extbf{成本高昂}难以扩展、 extbf{脆弱}跨领域，并且对解决方案的底层逻辑 extbf{视而不见}。这种对外部、贫乏信号的依赖阻止了模型发展对推理原则的深层、自包含理解。我们引入 extbf{ALIVE}（\emph{对抗学习与指导性语言评估}），一种免人工干预的对齐框架，超越了标量奖励优化，转向内在推理习得。基于\emph{认知协同}原则，ALIVE将问题提出、解决和判断统一在单个策略模型中，以内化正确性的逻辑。通过将对抗学习与指导性语言反馈相结合，ALIVE使模型能够直接从原始语料库内化评估标准，有效将外部批评转化为内生推理能力。在数学推理、代码生成和一般逻辑推理基准上的实证评估表明，ALIVE持续缓解了奖励信号的局限性。在相同数据和计算量下，它实现了准确率提升、显著改善的跨域泛化以及更高的自我修正率。这些结果表明，推理三位一体促进了能力增长的自我维持轨迹，将ALIVE定位为无需人工循环监督的通用推理对齐的可扩展基础。

英文摘要

The quest for expert-level reasoning in Large Language Models (LLMs) has been hampered by a persistent \textit{reward bottleneck}: traditional reinforcement learning (RL) relies on scalar rewards that are \textbf{costly} to scale, \textbf{brittle} across domains, and \textbf{blind} to the underlying logic of a solution. This reliance on external, impoverished signals prevents models from developing a deep, self-contained understanding of reasoning principles. We introduce \textbf{ALIVE} (\emph{Adversarial Learning with Instructive Verbal Evaluation}), a hands-free alignment framework that moves beyond scalar reward optimization toward intrinsic reasoning acquisition. Grounded in the principle of \emph{Cognitive Synergy}, ALIVE unifies problem posing, solving, and judging within a single policy model to internalize the logic of correctness. By coupling adversarial learning with instructive verbal feedback, ALIVE enables models to internalize evaluative criteria directly from raw corpora, effectively transforming external critiques into an endogenous reasoning faculty. Empirical evaluations across mathematical reasoning, code generation, and general logical inference benchmarks demonstrate that ALIVE consistently mitigates reward signal limitations. With identical data and compute, it achieves accuracy gains, markedly improved cross-domain generalization, and higher self-correction rates. These results indicate that the reasoning trinity fosters a self-sustaining trajectory of capability growth, positioning ALIVE as a scalable foundation for general-purpose reasoning alignment without human-in-the-loop supervision.

URL PDF HTML ☆

赞 0 踩 0

2602.02780 2026-05-25 cs.AI cs.LG 版本更新

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Yi Li, Yan Sun, Boyu Wang, Pingzhao Hu

发表机构 * Department of Computer Science, Western University, London, Canada（加拿大伦敦西方大学计算机科学系）； Department of Biochemistry, Western University, London, Canada（加拿大伦敦西方大学生物化学系）

AI总结本文提出了一种名为Cuttlefish的统一多模态大语言模型，旨在解决基于结构的推理中几何信息缺失和模态融合瓶颈的问题。该模型引入了“Scaling-Aware Patching”和“Geometry Grounding Adapter”两种核心方法，前者通过指令条件门控机制生成可变大小的结构图块，动态调整查询令牌数量以适应结构复杂度；后者通过跨注意力机制将几何信息注入语言模型，从而减少结构幻觉。实验表明，Cuttlefish在多个跨学科的原子级结构推理任务中表现出色。

Comments Accepted by ICML 2026

详情

AI中文摘要

大型语言模型（LLM）正在实现对2D和3D结构的推理，但现有方法仍然局限于特定模态，通常通过基于序列的标记化或固定长度的查询连接器来压缩结构输入。这种架构要么忽略了减轻结构幻觉所需的几何基础，要么施加了不灵活的模态融合瓶颈，同时过度压缩和次优分配结构令牌，从而阻碍了通用全原子推理的实现。我们引入了Cuttlefish，一种统一的多模态LLM，它将语言推理建立在几何线索上，同时根据结构复杂性缩放模态令牌。首先，缩放感知补丁利用指令条件门控机制在结构图上生成可变大小的补丁，根据结构复杂性自适应地缩放查询令牌预算，以缓解固定长度连接器的瓶颈。其次，几何基础适配器通过交叉注意力对模态嵌入进行细化，并将生成的模态令牌注入LLM，暴露明确的几何线索以减少结构幻觉。跨学科全原子基准的实验表明，Cuttlefish在异构结构基础推理中实现了优越的性能。代码：github.com/zihao-jing/Cuttlefish。

英文摘要

Large language models (LLMs) are enabling reasoning over 2D and 3D structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric grounding requisite for mitigating structural hallucinations, or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified multimodal LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across interdisciplinary all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning. Code: github.com/zihao-jing/Cuttlefish.

URL PDF HTML ☆

赞 0 踩 0

2602.00979 2026-05-25 cs.CR cs.AI cs.CL 版本更新

被压迫者的信息获取：解放性信息获取的弗莱雷式设计

Bhaskar Mitra, Nicola Neophytou, Sireesh Gururaja

发表机构 * Independent Researcher（独立研究者）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文探讨了如何在面对威权势力对在线信息访问平台的控制时，通过保罗·弗莱雷的解放教育理论，设计出具有解放性质的信息访问系统。研究挑战了技术开发者与用户之间的传统二元对立关系，主张通过“弗莱雷式设计”使平台成为社区成员共同构建和抗争的工具，从而实现结构性的解放。

详情

AI中文摘要

在线信息获取（IA）平台是威权主义捕获的目标。我们通过保罗·弗莱雷的解放教育学理论视角，探讨如何保护我们的平台并确保解放性成果。弗莱雷的理论为探索IA的社会技术问题提供了一个截然不同的视角，相对于当前主导的公平、问责和透明度框架。我们明确挑战IA平台开发中的技术专家-用户二分法，这反映了弗莱雷分析中的师生关系。通过将弗莱雷的分析扩展到IA，我们批判了技术专家作为解放者的框架，即（利他主义的）技术专家有责任减轻新兴技术对边缘化社区的风险。相反，我们倡导弗莱雷式设计，其目标是在结构上使平台暴露于社区成员的共同选择和共同构建，以支持他们的解放斗争。

英文摘要

Online information access (IA) platforms are targets of authoritarian capture. We explore the question of how to safeguard our platforms and ensure emancipatory outcomes through the lens of Paulo Freire's theories of emancipatory pedagogy. Freire's theories provide a radically different lens for exploring IA's sociotechnical concerns relative to the current dominating frames of fairness, accountability, and transparency. We make explicit, with the intention to challenge, the technologist-user dichotomy in IA platform development that mirrors the teacher-student relation in Freire's analysis. By extending Freire's analysis to IA, we critique the technologists-as-liberator frame where it is the burden of (altruistic) technologists to mitigate the risks of emerging technologies for marginalized communities. Instead, we advocate for Freirean Design whose goal is to structurally expose the platform for co-option and co-construction by community members in aid of their emancipatory struggles.

URL PDF HTML ☆

赞 0 踩 0

2601.00969 2026-05-25 cs.RO cs.AI 版本更新

V-VLAPS: Value-Guided Planning for Vision-Language-Action Models

V-VLAPS：面向视觉-语言-动作模型的价值引导规划

Ke Ren, Ali Salamatian, Kieran Pattison, Cyrus Neary

发表机构 * The University of British Columbia（不列颠哥伦比亚大学）

AI总结该研究提出了一种名为 V-VLAPS 的价值引导型视觉-语言-动作规划方法，旨在解决视觉-语言-动作（VLA）模型在复杂任务中因策略偏差导致的规划失败问题。通过引入一个轻量的价值头，V-VLAPS 利用离线 VLA 演示数据预测蒙特卡洛回报，从而引导蒙特卡洛树搜索优先探索高价值分支。实验表明，V-VLAPS 在多个 LIBERO 任务套件中显著提升了规划效果，尤其在增加搜索预算后表现优于无价值引导的基线方法。

详情

AI中文摘要

视觉-语言-动作（VLA）模型为机器人操作提供了强大的动作先验，但其反应式行为在分布偏移和长时域任务结构下可能失败。最近的VLA引导规划方法通过使用预训练策略引导树搜索来改进执行，但节点选择仍严重依赖于策略先验和访问计数探索。因此，当策略偏向不良动作时，规划器缺乏学习到的价值信号来纠正这种偏差。先前工作表明，VLA表示编码了 rollout 成功与失败信息，暗示它们也可能在规划期间支持价值估计。我们引入了价值引导的视觉-语言-动作规划与搜索（V-VLAPS），该方法通过一个在离线VLA rollout上训练的轻量级价值头来预测蒙特卡洛回报，从而增强VLA引导规划。这些预测引导蒙特卡洛树搜索朝向更高价值的分支。在五个LIBERO套件上，V-VLAPS在默认搜索预算下总体上与无价值规划基线相当，分析表明许多硬失败是根级超时，其中预测值弱分离。在更大的搜索预算下，V-VLAPS在所有任务套件上优于基线，在LIBERO-Object上提高6个百分点，在LIBERO-10上提高4个百分点。我们的结果表明，VLA表示不仅可以支持失败预测，还可以在搜索到达价值排序重要的分支时支持价值引导规划。

英文摘要

Vision-language-action (VLA) models provide strong action priors for robotic manipulation, but their reactive behavior can fail under distribution shift and long-horizon task structure. Recent VLA-guided planning methods improve execution by using pretrained policies to guide tree search, yet node selection still depends heavily on policy priors and visit-count exploration. Consequently, when the policy favors poor actions, the planner lacks a learned value signal to correct this bias. Prior work has shown that VLA representations encode rollout success and failure information, suggesting that they may also support value estimation during planning. We introduce Value-Guided Vision-Language-Action Planning and Search (V-VLAPS), which augments VLA-guided planning with a lightweight value head trained on offline VLA rollouts to predict Monte Carlo returns. These predictions guide Monte Carlo Tree Search toward higher-value branches. Across five LIBERO suites, V-VLAPS matches value-free planning baseline at the default search budget in aggregate, and analysis shows that many hard failures are root-level timeouts where predicted values are weakly separated. With a larger search budget, V-VLAPS improves over the baseline in all task suites with +6 percentage points on LIBERO-Object and +4 percentage points on LIBERO-10. Our results suggest that VLA representations can support not only failure prediction, but also value-guided planning when search reaches branches where value-based ranking matters.

URL PDF HTML ☆

赞 0 踩 0

2512.20298 2026-05-25 cs.CL cs.AI cs.CY cs.HC 版本更新

Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

模式 vs. 患者：通过第一人称叙事评估大语言模型与心理健康专业人员的人格障碍诊断能力

Karolina Drożdż, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz

发表机构 * IDEAS Research Institute（IDEAS研究 institute）； Adam Mickiewicz University（亚当·密茨凯维奇大学）； AMU Center for Artificial Intelligence（AMU人工智能中心）； Poznań University of Medical Sciences（波兹南医学科学大学）； Maria Curie-Skłodowska University（玛丽·居里-斯克洛多夫斯卡大学）

AI总结该研究探讨了大型语言模型（LLMs）在基于第一人称叙述进行人格障碍诊断方面的能力，特别比较了其与心理健康专业人士在诊断边缘型（BPD）和自恋型（NPD）人格障碍时的表现。研究发现，尽管LLMs在识别BPD方面表现优异，但在诊断NPD时显著低估，反映出模型在处理价值判断性术语时可能存在偏见。研究还指出，LLMs倾向于基于模式和形式分类提供详细解释，而人类专家则更关注患者的自我认知和时间体验，整体诊断可靠性仍有待提升。

详情

AI中文摘要

对LLMs进行精神病学自我评估的日益依赖引发了对其解释定性患者叙事能力的质疑。这项深度而非广度的案例研究直接比较了最先进的LLMs和心理健康专业人员，基于波兰语第一人称自传叙事评估边缘型人格障碍（BPD）和自恋型人格障碍（NPD）。在我们的样本中，表现最佳的Gemini Pro模型的总体诊断得分（65.48%）比人类专业人员的平均得分（43.57%）高出21.91个百分点。虽然模型和人类专家在识别BPD方面都表现出色（F1分别为83.4和80.0），但模型严重漏诊NPD（F1=6.7 vs. 50.0），显示出对价值负载术语“自恋”的潜在回避。定性上，模型提供了自信、详尽的理由，侧重于模式和形式类别，而人类专家则保持简洁和谨慎，强调患者的自我感和时间体验。我们的研究结果表明，虽然LLMs可能擅长解释复杂的临床第一人称数据，但其输出仍然存在关键的可靠性和偏见问题。

英文摘要

Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth over breadth case study directly compares state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.

URL PDF HTML ☆

赞 0 踩 0

2512.18470 2026-05-25 cs.SE cs.AI cs.MA 版本更新

SWE-EVO: Benchmarking Coding Agents in Long-Horizon Software Evolution Scenarios

SWE-EVO：在长周期软件演化场景中基准测试编码智能体

Tue Le, Minh V. T. Thai, Dung Nguyen Manh, Huy Phan Nhat, Nghi D. Q. Bui

发表机构 * FPT Software AI Center（FPT软件人工智能中心）； School of Computing and Information Systems（计算与信息系统学院）； University of Melbourne（墨尔本大学）； Center of AI Research（人工智能研究中心）； VinUniversity（文大学）

AI总结现有的AI编程代理基准主要集中在单一任务上，如修复错误或添加小功能，而实际软件工程是一个长期演进的过程，涉及多文件协调与多次迭代。为此，研究者提出了SWE-EVO基准，基于七个成熟开源Python项目的发布说明构建，包含48个需要多步骤修改的任务，平均涉及21个文件，并通过大量测试用例验证。实验表明，当前代理在长期、多文件任务上的表现仍存在显著差距，研究还提出了衡量部分进展的新指标——Fix Rate。

详情

AI中文摘要

现有的AI编码智能体基准测试主要关注孤立、单一问题的任务，例如修复一个bug或添加一个小功能。然而，现实世界的软件工程是一个长周期的工作：开发者解读高层次需求，协调跨多个文件的变更，并在多次迭代中演化代码库同时保持功能。我们引入了SWE-EVO，一个针对这种长周期软件演化挑战的基准测试。该基准测试基于七个成熟的开源Python项目的发布说明构建，包含48个任务，每个任务需要平均跨越21个文件的多步修改，并通过平均每个实例874个测试的测试套件进行验证。实验揭示了一个显著的能力差距：带有OpenHands的GPT-5.4在SWE-EVO上仅达到25%，而GPT-5.2在SWE-Bench Verified上达到72.80%，表明当前智能体在持续的、多文件推理方面存在困难。我们还提出了修复率（Fix Rate），一个衡量这些复杂长周期任务部分进展的指标。

英文摘要

Existing benchmarks for AI coding agents focus on isolated, single-issue tasks such as fixing a bug or adding a small feature. However, real-world software engineering is a long-horizon endeavor: developers interpret high-level requirements, coordinate changes across many files, and evolve codebases over multiple iterations while preserving functionality. We introduce SWE-EVO, a benchmark for this long-horizon software evolution challenge. Constructed from release notes of seven mature open-source Python projects, SWE-EVO comprises 48 tasks requiring multi-step modifications spanning an average of 21 files, validated against test suites averaging 874 tests per instance. Experiments reveal a striking capability gap: GPT-5.4 with OpenHands achieves only 25% on SWE-EVO versus 72.80% achieved by GPT-5.2 on SWE-Bench Verified, showing that current agents struggle with sustained, multi-file reasoning. We also propose Fix Rate, a metric capturing partial progress on these complex, long-horizon tasks.

URL PDF HTML ☆

赞 0 踩 0

2512.06404 2026-05-25 cs.AI cond-mat.mtrl-sci physics.chem-ph 版本更新

GENIUS: An Agentic AI Framework for Autonomous Design and Execution of Simulation Protocols

GENIUS: 一种用于自主设计和执行模拟协议的智能AI框架

Mohammad Soleymanibrojeni, Roland Aydin, Diego Guedes-Sobrinho, Alexandre C. Dias, Maurício J. Piotrowski, Wolfgang Wenzel, Celso Ricardo Caldeira Rêgo

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）； Institute of Nanotechnology（纳米技术研究所）； Hamburg University of Technology（汉堡技术大学）； Federal University of Paraná（帕拉纳联邦大学）； University of Brasília（巴西利亚大学）； Federal University of Pelotas（普拉多斯联邦大学）； Institute of Physics and International Center of Physics（物理研究所和国际物理中心）

AI总结 GENIUS 是一个智能代理框架，旨在自主设计和执行模拟协议，解决材料计算中复杂的设置和调试问题。该框架结合了量子力学模拟软件 Quantum ESPRESSO 的知识图谱和分层语言模型，并由有限状态错误恢复机监督，能够将自然语言指令转化为有效的输入文件并自动修复错误。GENIUS 显著降低了推理成本，减少了幻觉现象，使电子结构密度泛函理论模拟更加易用，推动了材料工程的自动化和大规模应用。

Journal ref Communications Materials 7, 115 (2026)

详情

DOI: 10.1038/s43246-026-01167-0

AI中文摘要

预测性原子模拟推动了材料发现，但常规设置和调试仍需计算机专家。这种知识差距限制了集成计算材料工程（ICME），因为最先进的代码存在但非专家使用起来仍然繁琐。我们通过GENIUS解决了这一瓶颈，这是一种AI智能体工作流，将智能的Quantum ESPRESSO知识图谱与由有限状态错误恢复机器监督的分层大语言模型层次结构融合。我们展示了GENIUS将自由形式的人类生成提示翻译成经过验证的输入文件，在295个多样化基准测试中约有80%运行完成，其中76%被自主修复，成功率呈指数衰减至7%的基线。与仅使用LLM的基线相比，GENIUS将推理成本减半，并几乎消除了幻觉。该框架通过智能自动化协议生成、验证和修复，使电子结构DFT模拟大众化，为全球学术界和工业界开放大规模筛选并加速ICME设计循环。

英文摘要

Predictive atomistic simulations have propelled materials discovery, yet routine setup and debugging still demand computer specialists. This know-how gap limits Integrated Computational Materials Engineering (ICME), where state-of-the-art codes exist but remain cumbersome for non-experts. We address this bottleneck with GENIUS, an AI-agentic workflow that fuses a smart Quantum ESPRESSO knowledge graph with a tiered hierarchy of large language models supervised by a finite-state error-recovery machine. Here we show that GENIUS translates free-form human-generated prompts into validated input files that run to completion on $\approx$80% of 295 diverse benchmarks, where 76% are autonomously repaired, with success decaying exponentially to a 7% baseline. Compared with LLM-only baselines, GENIUS halves inference costs and virtually eliminates hallucinations. The framework democratizes electronic-structure DFT simulations by intelligently automating protocol generation, validation, and repair, opening large-scale screening and accelerating ICME design loops across academia and industry worldwide.

URL PDF HTML ☆

赞 0 踩 0

2511.18000 2026-05-25 cs.LG cs.AI q-bio.PE 版本更新

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

空间流行病模拟中的奖励工程：个体行为学习的强化学习平台

Radman Rakhshandehroo, Daniel Coombs

发表机构 * Department of Computer Science University of British Columbia（计算机科学系，不列颠哥伦比亚大学）； Department of Mathematics and Institute of Applied Mathematics University of British Columbia（数学系和应用数学研究所，不列颠哥伦比亚大学）

AI总结本文介绍了 ContagionRL，一个专为疫情空间模拟设计的强化学习平台，用于系统研究奖励函数设计对个体行为学习的影响。该平台结合了可配置的 SIRS+D 流行病模型，支持在不同环境条件下评估多种奖励机制对智能体生存策略的影响，并通过实验发现方向引导和明确遵守激励是提升策略学习的关键因素。研究还表明，采用势场奖励函数的智能体在非药物干预遵守和空间规避策略方面表现最优，平台为探索奖励与行为关系提供了模块化工具，具有重要的理论和应用价值。

Comments 38 pages, 15 figures and 18 tables; Accepted to TMLR. OpenReview: https://openreview.net/forum?id=yPEASsx3hk

Journal ref Transactions on Machine Learning Research, 2026

详情

AI中文摘要

我们提出了ContagionRL，一个与Gymnasium兼容的强化学习平台，专门用于空间流行病模拟中的系统奖励工程。与依赖固定行为规则的传统基于智能体的模型不同，我们的平台能够严格评估奖励函数设计如何影响在不同流行病场景中学到的生存策略。ContagionRL集成了空间SIRS+D流行病模型与可配置的环境参数，允许研究人员在包括有限可观测性、不同移动模式和异质人口动态等变化条件下对奖励函数进行压力测试。我们评估了五种不同的奖励设计，从稀疏生存奖励到一种新颖的势场方法，跨越多种RL算法（PPO、SAC、A2C）。通过系统的消融研究，我们发现方向性指导和明确的依从性激励是稳健策略学习的关键组成部分。我们在不同感染率、网格大小、可见性约束和移动模式下的全面评估表明，奖励函数的选择显著影响智能体行为和生存结果。使用我们的势场奖励训练的智能体始终获得优越性能，学习最大程度地遵守非药物干预，同时发展出复杂的空间规避策略。该平台的模块化设计使得能够系统地探索奖励-行为关系，弥补了这类模型中奖励工程关注有限的空白。ContagionRL是研究流行病背景下适应性行为反应的有效平台，并强调了奖励设计、信息结构和环境可预测性在学习中的重要性。我们的代码公开在https://github.com/redradman/ContagionRL。

英文摘要

We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning. Our code is publicly available at https://github.com/redradman/ContagionRL

URL PDF HTML ☆

赞 0 踩 0

2511.16014 2026-05-25 cs.AI 版本更新

MUSEKG: A Knowledge Graph Over Museum Collections

MUSEKG：博物馆藏品知识图谱

Jinhao Li, Jianzhong Qi, Soyeon Caren Han, Eun-Jung Holden

发表机构 * The University of Melbourne School of Computing（墨尔本大学计算机与信息系统学院）； The University of Melbourne（墨尔本大学）

AI总结 MUSEKG 是一个针对博物馆藏品数据构建的交互式知识图谱系统，旨在整合结构化目录、图像和非结构化描述等异构数据，形成统一的、可查询的知识表示。该系统通过建立类型化的图结构，将藏品、人物、机构、图像及其语义实体进行关联，支持基于自然语言的查询和关系感知的检索。实验表明，MUSEKG 能有效支持属性查询、关系探索等常见任务，并通过显式的图结构保证答案的可解释性。

Comments SIGIR'26

详情

AI中文摘要

文化遗产领域的数字化产生了大量但分散的博物馆藏品数据存储库，涵盖结构化编目记录、图像和非结构化描述。现有的博物馆信息系统通常难以将这些来源整合成统一的、可查询的表示，以支持关系感知的探索。我们提出了MuseKG，一个交互式知识图谱系统，它将异构博物馆数据组织成一个类型化图，在连贯的模式下链接对象、人物、组织、图像、图像派生标签和提取的语义实体。MuseKG通过将用户问题映射到图实体并检索用于答案生成的紧凑证据邻域来支持自然语言查询。通过在真实博物馆藏品上的交互式演示，我们展示了MuseKG支持常见的探索任务，如属性查找、关系探索和关系感知检索，并且答案可以通过显式图结构进行检查。

英文摘要

Digitisation in the cultural heritage sector has produced large but fragmented repositories of museum collection data, spanning structured catalogue records, images, and unstructured descriptions. Existing museum information systems often make it difficult to integrate these sources into a unified, queryable representation that supports relation-aware exploration. We present MuseKG, an interactive knowledge graph system that organises heterogeneous museum data into a typed graph that links objects, people, organisations, images, image-derived labels, and extracted semantic entities within a coherent schema. MuseKG supports natural-language queries by grounding user questions to graph entities and retrieving a compact neighbourhood of evidence for answer generation. Through an interactive demonstration on real museum collections, we show that MuseKG supports common exploration tasks such as attribute lookup, relation exploration, and relation-aware retrieval, with answers that remain inspectable via explicit graph structures.

URL PDF HTML ☆

赞 0 踩 0

2511.02239 2026-05-25 cs.RO cs.AI 版本更新

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

LACY: 基于视觉-语言模型的语言-动作循环用于自我改进的机器人操作

Youngjin Hong, Houjian Yu, Mingen Li, Changhyun Choi

发表机构 * Department of Electrical and Computer Engineering, Univ. of Minnesota（电气与计算机工程系，明尼苏达大学）

AI总结本文提出LACY，一种基于视觉-语言模型的“语言-动作循环”框架，旨在提升机器人操作任务中的策略泛化能力。该方法通过同时学习语言到动作（L2A）、动作到语言（A2L）以及语言间一致性（L2C）的双向映射，使机器人不仅能执行任务，还能解释自身行为，从而形成更丰富的内部表征。LACY采用主动增强策略自主生成和筛选训练数据，无需额外人工标注，实验表明其在抓取与放置任务中平均提升了56.46%的成功率，显著增强了语言-动作的语义一致性与鲁棒性。

Comments Accepted to ICRA 2026. Project page: https://vla2026.github.io/LACY/

详情

AI中文摘要

学习机器人操作的可泛化策略越来越依赖于将语言指令映射到动作（L2A）的大规模模型。然而，这种单向范式通常产生执行任务而缺乏更深层次上下文理解的策略，限制了它们泛化或解释其行为的能力。我们认为，将动作映射回语言（A2L）的互补技能对于发展更全面的基础至关重要。一个既能行动又能解释其动作的智能体可以形成更丰富的内部表示，并开启自我监督学习的新范式。我们引入了LACY（语言-动作循环），一个统一的框架，在单个视觉-语言模型内学习这种双向映射。LACY在三个协同任务上联合训练：从语言生成参数化动作（L2A）、用语言解释观察到的动作（A2L）以及验证两个语言描述之间的语义一致性（L2C）。这实现了一个自我改进的循环，通过针对低置信度案例的主动增强策略自主生成和过滤新的训练数据，从而在没有额外人工标注的情况下改进模型。在仿真和真实世界的拾取-放置任务上的实验表明，LACY平均将任务成功率提高了56.46%，并为机器人操作产生了更稳健的语言-动作基础。项目页面：https://vla2026.github.io/LACY/

英文摘要

Learning generalizable policies for robotic manipulation increasingly relies on large-scale models that map language instructions to actions (L2A). However, this one-way paradigm often produces policies that execute tasks without deeper contextual understanding, limiting their ability to generalize or explain their behavior. We argue that the complementary skill of mapping actions back to language (A2L) is essential for developing more holistic grounding. An agent capable of both acting and explaining its actions can form richer internal representations and unlock new paradigms for self-supervised learning. We introduce LACY (Language-Action Cycle), a unified framework that learns such bidirectional mappings within a single vision-language model. LACY is jointly trained on three synergistic tasks: generating parameterized actions from language (L2A), explaining observed actions in language (A2L), and verifying semantic consistency between two language descriptions (L2C). This enables a self-improving cycle that autonomously generates and filters new training data through an active augmentation strategy targeting low-confidence cases, thereby improving the model without additional human labels. Experiments on pick-and-place tasks in both simulation and the real world show that LACY improves task success rates by 56.46% on average and yields more robust language-action grounding for robotic manipulation. Project page: https://vla2026.github.io/LACY/

URL PDF HTML ☆

赞 0 踩 0

2510.26411 2026-05-25 cs.AI 版本更新

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

MedSAE: 用稀疏自编码器剖析MedCLIP表示

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, Marco Grangetto

发表机构 * University of Turin（都灵大学）； École polytechnique（巴黎-萨克勒高等理工学院）

AI总结本文提出了一种名为 MedSAE 的方法，通过稀疏自编码器对医学视觉语言模型 MedCLIP 的潜在空间进行解析，以提升其可解释性。研究引入了结合相关性度量、熵分析和自动神经元命名的评估框架，实验表明 MedSAE 能生成更具单语义性和可解释性的神经元表示，从而在医疗 AI 的性能与透明性之间建立桥梁。

Comments Accepted at ICIP 2026

2510.21270 2026-05-25 cs.CL cs.AI cs.CV 版本更新

Sparser Block-Sparse Attention via Token Permutation

通过令牌置换实现更稀疏的块稀疏注意力

Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）

AI总结随着大语言模型上下文长度的增加，计算成本显著上升，主要瓶颈来自自注意力机制的二次复杂度。为此，本文提出了一种名为Permuted Block-Sparse Attention（PBS-Attn）的新型稀疏注意力方法，通过重新排列token顺序以提升块级稀疏性，从而在保持模型精度的同时显著提高计算效率。实验表明，该方法在多个长上下文数据集上优于现有块稀疏注意力方法，并在端到端推理速度上实现了最高2.75倍的加速。

Comments ICML 2026

详情

AI中文摘要

扩展大语言模型（LLM）的上下文长度带来了显著的好处，但计算成本高昂。这种成本主要源于自注意力机制，其相对于序列长度的$O(N^2)$复杂度在内存和延迟方面构成了主要瓶颈。幸运的是，注意力矩阵通常是稀疏的，尤其是对于长序列，这为优化提供了机会。块稀疏注意力已成为一种有前景的解决方案，它将序列划分为块并跳过其中一部分块的计算。然而，该方法的有效性高度依赖于底层的注意力模式，这可能导致次优的块级稀疏性。例如，单个块内查询的重要键令牌可能分散在许多其他块中，导致计算冗余。在这项工作中，我们提出了置换块稀疏注意力（PBS-Attn），这是一种即插即用的方法，利用注意力的置换性质来增加块级稀疏性并提高LLM预填充的计算效率。我们在具有挑战性的真实世界长上下文数据集上进行了全面实验，结果表明PBS-Attn在模型精度上始终优于现有的块稀疏注意力方法，并紧密匹配全注意力基线。借助我们自定义的permuted-FlashAttention内核，PBS-Attn在长上下文预填充中实现了高达2.75倍的端到端加速，证实了其实用性。代码可在https://github.com/xinghaow99/pbs-attn获取。

英文摘要

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

URL PDF HTML ☆

赞 0 踩 0

2510.12787 2026-05-25 cs.AI cs.MA 版本更新

Ax-Prover: A Deep Reasoning Agentic Framework for Theorem Proving in Mathematics and Quantum Physics

Ax-Prover：用于数学和量子物理定理证明的深度推理智能体框架

Benjamin Breen, Marco Del Tredici, Jacob McCarran, Javier Aspuru Mijares, Weichen Winston Yin, Kfir Sulimany, Jacob M. Taylor, Frank H. L. Koppens, Dirk Englund

发表机构 * Axiomatic_AI（公理人工智能）； Massachusetts Institute of Technology (MIT)（麻省理工学院）； Institut de Ciències Fotòniques (ICFO)（光子科学研究所）； Institució Catalana de Recerca i Estudis Avançats (ICREA)（加泰罗尼亚高级研究与高等学院）

AI总结本文提出了一种名为 Ax-Prover 的多智能体系统，用于在 Lean 证明助手环境中进行自动化定理证明，能够解决数学和量子物理等不同科学领域的问题，并支持自主运行或与人类专家协作。该系统结合了大语言模型的推理能力与 Lean 工具的严格形式化验证机制，通过模型上下文协议实现知识与形式正确性的统一。实验表明，Ax-Prover 在多个基准测试中表现优异，尤其在新引入的抽象代数和量子理论基准上显著优于现有方法，展示了其在跨领域形式化验证中的通用性和有效性。

详情

AI中文摘要

我们提出了Ax-Prover，一个用于Lean中自动化定理证明的多智能体系统，能够解决跨不同科学领域的问题，并可自主运行或与人类专家协作。为此，Ax-Prover通过形式化证明生成来处理科学问题求解，这一过程既需要创造性推理又需要严格的句法严谨性。Ax-Prover通过模型上下文协议（MCP）将提供知识和推理能力的大语言模型（LLM）与确保形式正确性的Lean工具相结合，以应对这一挑战。为了评估其作为自主证明器的性能，我们在两个公开数学基准以及我们在抽象代数和量子理论领域引入的两个Lean基准上，将我们的方法与前沿LLM和专用证明器模型进行了比较。在公开数据集上，Ax-Prover与最先进的证明器竞争力相当，而在新基准上则大幅超越它们。这表明，与难以泛化的专用系统不同，我们基于工具的智能体定理证明方法为跨不同科学领域的形式化验证提供了一种可泛化的方法论。此外，我们通过一个实际用例展示了Ax-Prover的辅助能力，展示了它如何使一位专家数学家能够形式化一个复杂密码学定理的证明。

英文摘要

We present Ax-Prover, a multi-agent system for automated theorem proving in Lean that can solve problems across diverse scientific domains and operate either autonomously or collaboratively with human experts. To achieve this, Ax-Prover approaches scientific problem solving through formal proof generation, a process that demands both creative reasoning and strict syntactic rigor. Ax-Prover meets this challenge by equipping Large Language Models (LLMs), which provide knowledge and reasoning, with Lean tools via the Model Context Protocol (MCP), which ensure formal correctness. To evaluate its performance as an autonomous prover, we benchmark our approach against frontier LLMs and specialized prover models on two public math benchmarks and on two Lean benchmarks we introduce in the fields of abstract algebra and quantum theory. On public datasets, Ax-Prover is competitive with state-of-the-art provers, while it largely outperforms them on the new benchmarks. This shows that, unlike specialized systems that struggle to generalize, our tool-based agentic theorem prover approach offers a generalizable methodology for formal verification across diverse scientific domains. Furthermore, we demonstrate Ax-Prover's assistant capabilities in a practical use case, showing how it enabled an expert mathematician to formalize the proof of a complex cryptography theorem.

URL PDF HTML ☆

赞 0 踩 0

2510.11195 2026-05-25 cs.CR cs.AI 版本更新

RAG-Pull: Turning Retrieval into a Code-Injection Channel via Invisible Unicode Perturbations

RAG-Pull：通过不可见Unicode扰动将检索转化为代码注入通道

Aritra Dhar, Vasilije Stambolic, Lukas Cavigelli

发表机构 * Computing System Labs, Huawei Technologies Switzerland AG（华为瑞士技术有限公司计算系统实验室）； BKW Energie AG（BKW能源集团）

AI总结本文提出了一种针对检索增强生成（RAG）系统的新型黑盒攻击方法RAG-Pull，通过在查询或代码库中插入不可见的Unicode字符扰动，引导检索过程指向恶意代码，从而破坏模型的安全对齐性。研究发现，仅对查询或目标代码进行微小扰动即可显著影响检索结果，而两者结合则能实现几乎完美的攻击效果。该方法揭示了RAG系统在安全方面的潜在漏洞，为大语言模型的安全性研究提供了新的视角。

2510.09136 2026-05-25 cs.IR cs.AI 版本更新

Controlled Personalization in Legacy Media Online Services: A Case Study in News Recommendation

传统媒体在线服务中的受控个性化：新闻推荐案例研究

Marlene Holzleitner, Stephan Leitner, Hanna Lind Jorgensen, Christoph Schmitz, Jacob Welander, Dietmar Jannach

发表机构 * University of Klagenfurt（克雷格弗尔特大学）； University of Bergen（卑尔根大学）

AI总结本文研究了传统新闻媒体在在线平台中采用“受控个性化”推荐策略的效果，旨在在技术驱动的内容推荐与核心编辑价值观之间取得平衡。通过在一家挪威主流传统新闻机构网站上进行A/B测试，研究发现即使是适度的个性化推荐也能显著提升用户点击率、降低浏览努力，并促进内容多样性和覆盖率，同时减少流行度偏差。研究结果表明，受控个性化能够在满足用户需求的同时维护新闻编辑目标，为传统媒体采用个性化技术提供了可行路径。

详情

DOI: 10.1145/3814614

AI中文摘要

个性化新闻推荐已成为大型新闻聚合服务的标准功能，通过自动内容选择优化用户参与。相比之下，传统新闻媒体通常谨慎对待个性化，努力在技术创新与核心编辑价值之间取得平衡。因此，传统新闻媒体的在线平台通常结合编辑策划内容与算法选择文章——我们将这种策略称为受控个性化。在这篇行业文章中，我们通过在挪威一家主要传统新闻机构的网站上进行的A/B测试，评估了受控个性化的有效性。我们的研究结果表明，即使是适度的个性化也能带来显著收益。具体来说，我们观察到接触个性化内容的用户表现出更高的点击率和更少的导航努力，这表明相关内容的发现得到了改善。此外，我们的分析显示，受控个性化有助于提高内容多样性和目录覆盖，并减少流行度偏差。总体而言，我们的结果表明，受控个性化能够成功地将用户需求与编辑目标对齐，为传统媒体在维护新闻价值的同时采用个性化技术提供了一条可行路径。

英文摘要

Personalized news recommendations have become a standard feature of large news aggregation services, optimizing user engagement through automated content selection. In contrast, legacy news media often approach personalization cautiously, striving to balance technological innovation with core editorial values. As a result, online platforms of traditional news outlets typically combine editorially curated content with algorithmically selected articles - a strategy we term controlled personalization. In this industry article, we evaluate the effectiveness of controlled personalization through an A/B test conducted on the website of a major Norwegian legacy news organization. Our findings indicate that even a modest level of personalization yields substantial benefits. Specifically, we observe that users exposed to personalized content demonstrate higher click-through-rates and reduced navigation effort, suggesting improved discovery of relevant content. Moreover, our analysis reveals that controlled personalization contributes to greater content diversity and catalog coverage and in addition reduces popularity bias. Overall, our results suggest that controlled personalization can successfully align user needs with editorial goals, offering a viable path for legacy media to adopt personalization technologies while upholding journalistic values.

URL PDF HTML ☆

赞 0 踩 0

2510.08945 2026-05-25 cs.AI 版本更新

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

FATHOMS-RAG：评估使用检索增强生成的多模态系统思考与观察的框架

Samuel Hildebrand, Curtis Taylor, Sean Oesch, James M Ghawaly, Amir Sadovnik, Ryan Shivers, Brandon Schreiber, Kevin Kurian

发表机构 * Louisiana State University（路易斯安那州立大学）； Oak Ridge National Lab（橡树岭国家实验室）； University of Florida（佛罗里达大学）

AI总结本文提出了一种名为FATHOMS-RAG的框架，用于评估使用检索增强生成（RAG）的多模态系统在推理和观察方面的能力。该框架引入了一个由人类创建的小型数据集、多项评估指标以及对开源与闭源模型的对比实验，全面检验RAG系统在处理文本、表格和图像等多模态信息时的表现。实验结果表明，闭源模型在准确性和幻觉控制方面显著优于开源模型，尤其是在涉及多模态和跨文档信息的问题上表现更为突出。

Comments Accepted at SAFE-ML 2026 Workshop at the International Conference on Software Testing (ICST) 2026 Code: https://github.com/Sam-Hildebrand/FATHOMS-RAG

详情

DOI: 10.11578/dc.20260227.1

AI中文摘要

检索增强生成（RAG）已成为提高大型语言模型（LLMs）事实准确性的有前景范式。我们引入了一个旨在整体评估RAG管道的基准，评估管道摄取、检索和推理多种模态信息的能力，区别于现有专注于检索等特定方面的基准。我们提出：（1）一个由93个人工创建的问题组成的小型数据集，用于评估管道摄取文本数据、表格、图像以及跨这些模态分布在多个文档中的数据的能力；（2）一个用于正确性的短语级召回率指标；（3）一个最近邻嵌入分类器，用于识别潜在的管道幻觉；（4）对使用开源检索机制构建的2个管道和4个闭源基础模型进行的比较评估；（5）第三方人工评估我们正确性和幻觉指标的对齐情况。我们发现，闭源管道在正确性和幻觉指标上均显著优于开源管道，在依赖多模态和跨文档信息的问题上性能差距更大。对我们指标的人工评估显示，在1-5 Likert量表（5表示“强烈同意”）上，正确性平均一致性为4.62，幻觉检测平均一致性为4.53。

英文摘要

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

URL PDF HTML ☆

赞 0 踩 0

2510.00915 2026-05-25 cs.LG cs.AI 版本更新

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

在不完美验证器下基于可验证但含噪声奖励的强化学习

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

发表机构 * RIKEN AIP（日本理化学研究所AIP）； The University of Tokyo（东京大学）； The University of Melbourne（墨尔本大学）； The University of Sydney（悉尼大学）

AI总结该论文研究了在不可靠验证器存在下如何改进可验证奖励的强化学习（RLVR）。通过将验证器的不可靠性建模为具有不对称噪声率的随机奖励通道，作者提出了两种轻量级修正方法：一种是反向修正，用于生成无偏的替代奖励；另一种是正向修正，通过调整得分函数项使策略更新更贴近干净梯度方向。实验表明，这两种方法在合成和真实验证噪声环境下均能提升数学推理任务的性能，其中正向修正在高噪声情况下更为稳定。此外，作者还引入了一个基于轻量级语言模型的申诉机制，用于在线估计假阴性率并进一步提升性能。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）用自动验证器替代昂贵的人工标注。为减少验证器攻击，许多RLVR系统将奖励二值化为$\\\{0,1\\\}$，但不完美的验证器不可避免地引入\\emph{假阴性}（拒绝正确答案）和\\emph{假阳性}（接受错误答案）。我们将验证器不可靠性形式化为具有非对称噪声率$ρ_0$和$ρ_1$（分别为FP率和FN率）的随机奖励通道。由此抽象我们推导出两种轻量级校正：（i）\\emph{后向}校正，产生无偏替代奖励，从而在期望上得到无偏的策略梯度估计量；（ii）\\emph{前向}校正，重新加权得分函数项，使得期望更新与干净梯度方向对齐，且仅需FN率。我们在分组相对策略优化流程中将两者实现为轻量级钩子，两种校正均在合成和真实验证器噪声下改善了数学推理的RLVR，其中前向变体在较大噪声下更稳定。最后，一个带有轻量级LLM验证器的上诉机制在线估计FN率并进一步提升性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $ρ_0$ and $ρ_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

URL PDF HTML ☆

赞 0 踩 0

2509.26383 2026-05-25 cs.CL cs.AI 版本更新

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

基于强化学习的高效可迁移智能知识图谱检索增强生成

Junhong Lin, Shicheng Liu, Jinyeop Song, Song Wang, Julian Shun, Yada Zhu

发表机构 * MIT CSAIL（麻省理工学院CSAIL）； University of Virginia（弗吉尼亚大学）； IBM Research（IBM研究院）

AI总结该研究提出了一种基于强化学习的高效且可迁移的智能体知识图谱检索增强生成框架KG-R1，旨在解决现有KG-RAG系统中固定流程导致的高推理成本和依赖特定图结构的问题。KG-R1通过单智能体与知识图谱环境交互，逐步学习信息检索与推理生成的统一过程，从而在减少生成token数量的同时提升回答准确性。实验表明，KG-R1在多个知识图谱问答基准上表现出优异的效率和跨图谱迁移能力，且无需重新训练即可保持对新知识图谱的准确推理，具有良好的实际应用前景。

详情

AI中文摘要

知识图谱检索增强生成（KG-RAG）将大型语言模型（LLMs）与结构化、可验证的知识图谱（KGs）相结合，以减少幻觉并提供推理轨迹。然而，当前的KG-RAG系统通常依赖于多个LLM模块（如规划、推理和响应）的固定流水线，这增加了推理成本，并将性能与特定图模式绑定。为了解决这个问题，我们引入了KG-R1，一个通过强化学习（RL）优化KG-RAG的智能体框架。与模块化工作流不同，KG-R1使用单个智能体，将KGs作为其环境进行交互，学习在每一步检索信息，并将其融入统一的推理和生成过程中。在知识图谱问答（KGQA）基准测试中，KG-R1展示了高效性和可迁移性——使用Qwen 2.5-3B，KG-R1以比先前使用更大基础或微调模型的多模块工作流方法更少的生成token提高了答案准确性。此外，KG-R1表现出强大的即插即用能力：训练后，无需重新训练即可在未见过的KGs上保持准确性。这些特性使KG-R1成为实际部署中很有前景的KG-RAG框架。我们的代码公开在github.com/junhongmit/KG-R1/。

英文摘要

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucination and provide reasoning traces. However, current KG-RAG systems often rely on fixed pipelines of multiple LLM modules (e.g., planning, reasoning, and responding), which inflate inference costs and tie performance to specific graph schemas. To address this, we introduce KG-R1, an agentic framework that optimizes KG-RAG through reinforcement learning (RL). Unlike modular workflows, KG-R1 uses a single agent that interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into its reasoning and generation in a unified process. Across Knowledge-Graph Question Answering (KGQA) benchmarks, KG-R1 demonstrates both efficiency and transferability-using Qwen 2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use much larger foundation or fine-tuned models. Furthermore, KG-R1 exhibits strong plug-and-play capability: after training, maintaining accuracy on unseen KGs without retraining. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at github.com/junhongmit/KG-R1/.

URL PDF HTML ☆

赞 0 踩 0

2509.12958 2026-05-25 cs.AI 版本更新

Forget What's Sensitive, Remember What Matters: Token-Level Differential Privacy in Memory Sculpting for Continual Learning

忘记敏感信息，记住重要内容：持续学习中基于令牌级差分隐私的记忆塑造

Bihao Zhan, Jie Zhou, Junsong Li, Yutao Yang, Shilian Chen, Qianjun Pan, Xin Li, Wen Wu, Xingjiao Wu, Qin Chen, Hang Yan, Liang He

发表机构 * East China Normal University（东华师范大学）； Shanghai AI Laboratory（上海人工智能实验室）； Shanghai Qiji Zhifeng Co., Ltd.（上海启智锋科技有限公司）

AI总结该研究针对持续学习（CL）模型在隐私保护方面的不足，提出了一种隐私增强的持续学习框架（PeCL）。该方法引入了基于语义敏感性的标记级动态差分隐私策略，动态分配隐私预算以保护敏感信息，同时减少对非敏感知识的干扰。此外，研究还设计了一个隐私引导的记忆塑形模块，用于智能遗忘敏感信息并保留任务不变的历史知识，从而在保障隐私的同时提升模型性能。实验表明，PeCL在隐私保护与模型效用之间取得了更优的平衡。

详情

AI中文摘要

持续学习（CL）模型虽然擅长顺序知识获取，但由于积累多样化信息而面临显著且常被忽视的隐私挑战。传统的隐私方法（如统一的差分隐私预算）不加区分地保护所有数据，导致模型效用大幅下降，阻碍了CL在隐私敏感领域的部署。为了克服这一问题，我们提出了一种隐私增强的持续学习（PeCL）框架，该框架忘记敏感信息并记住重要内容。我们的方法首先引入了一种令牌级动态差分隐私策略，该策略根据单个令牌的语义敏感性自适应分配隐私预算。这确保了对私有实体的强健保护，同时最小化对非敏感通用知识的噪声注入。其次，我们集成了一个隐私引导的记忆塑造模块。该模块利用来自动态DP机制的敏感性分析，智能地从模型记忆和参数中忘记敏感信息，同时明确保留对于缓解灾难性遗忘至关重要的任务不变历史知识。大量实验表明，PeCL在隐私保护和模型效用之间实现了优越的平衡，通过保持先前任务的高准确性同时确保强健的隐私，优于基线模型。

英文摘要

Continual Learning (CL) models, while adept at sequential knowledge acquisition, face significant and often overlooked privacy challenges due to accumulating diverse information. Traditional privacy methods, like a uniform Differential Privacy (DP) budget, indiscriminately protect all data, leading to substantial model utility degradation and hindering CL deployment in privacy-sensitive areas. To overcome this, we propose a privacy-enhanced continual learning (PeCL) framework that forgets what's sensitive and remembers what matters. Our approach first introduces a token-level dynamic Differential Privacy strategy that adaptively allocates privacy budgets based on the semantic sensitivity of individual tokens. This ensures robust protection for private entities while minimizing noise injection for non-sensitive, general knowledge. Second, we integrate a privacy-guided memory sculpting module. This module leverages the sensitivity analysis from our dynamic DP mechanism to intelligently forget sensitive information from the model's memory and parameters, while explicitly preserving the task-invariant historical knowledge crucial for mitigating catastrophic forgetting. Extensive experiments show that PeCL achieves a superior balance between privacy preserving and model utility, outperforming baseline models by maintaining high accuracy on previous tasks while ensuring robust privacy.

URL PDF HTML ☆

赞 0 踩 0

2509.06858 2026-05-25 physics.soc-ph cs.AI nlin.AO 版本更新

Disentangling Interaction and Bias Effects in Opinion Dynamics of Large Language Models

大型语言模型中意见动态的交互与偏差效应的分离

Vincent C. Brockers, David A. Ehrlich, Viola Priesemann

发表机构 * Max-Planck-Institute for Dynamics and Self-Organization（马克斯·普朗克动态与自组织研究所）； Institute for the Dynamics of Complex Systems（复杂系统动力学研究所）； University of Göttingen（哥廷根大学）； Campus Institute for Dynamics of Biological Networks（校园生物网络动力学研究所）

AI总结该研究探讨了大型语言模型在模拟人类意见动态时，真实交互效果如何被系统性偏差所掩盖的问题。研究提出了一种贝叶斯框架，用于分离和量化三种偏差：主题偏差、同意偏差和锚定偏差，并应用于多个模型在不同话题上的多轮对话实验中。结果表明，意见演化趋向于快速收敛，偏差和交互的影响随时间减弱，且不同模型的偏差表现存在差异，研究还揭示了微调对模型意见吸引子的影响，为评估语言模型在人类行为模拟中的潜力与局限提供了量化工具。

详情

AI中文摘要

大型语言模型越来越多地被用于模拟人类意见动态，然而真实交互的影响常常被系统性偏差所掩盖。我们开发了一个贝叶斯框架来分离并量化三种这样的偏差：(i) 针对LLM默认立场的主题偏差；(ii) 倾向于同意提示语句的同意偏差，无论问题如何；(iii) 倾向于初始主体立场的锚定偏差。我们将该框架应用于多个LLM，这些模型在从气候变化、社会正义到音乐偏好的12个不同问题上执行了多步对话。我们发现意见轨迹往往迅速收敛到一个共享吸引子，交互和偏差的影响随时间衰减，且偏差的影响在不同LLM之间有所不同。此外，我们表明，在不同组别的强烈意见陈述（包括错误信息）上微调LLM会相应地改变意见吸引子。通过揭示LLM之间的显著差异，并提供定量工具来比较交互和偏差对LLM主体讨论中意见转变的贡献，我们的方法突出了使用LLM作为人类行为代理的潜力和陷阱。

英文摘要

Large Language Models are increasingly used to simulate human opinion dynamics, yet the effect of genuine interaction is often obscured by systematic biases. We develop a Bayesian framework to disentangle and quantify three such biases: (i) A topic bias toward the LLM's default stance; (ii) an agreement bias favoring agreement to the prompted statement irrespective of the question; and (iii) an anchoring bias toward the initiating agent's stance. We apply this framework to various LLMs that performed multi-step dialogues on 12 different questions from climate change and societal justice to music preferences. We find that opinion trajectories tend to quickly converge to a shared attractor, with the influence of both interaction and biases decaying over time, and with the impact of biases differing between LLMs. In addition, we show that fine-tuning an LLM on different sets of strongly opinionated statements (including misinformation) shifts the opinion attractor correspondingly. By exposing stark differences between LLMs and providing quantitative tools for comparing interaction and bias contributions to opinion shifts in LLM agent discussions, our approach highlights both promises and pitfalls of using LLMs as proxies for human behavior.

URL PDF HTML ☆

赞 0 踩 0

2508.18958 2026-05-25 cs.CV cs.AI 版本更新

A drone-based framework for coral habitat mapping via weakly supervised segmentation

基于弱监督分割的无人机珊瑚栖息地制图框架

Matteo Contini, Victor Illien, Sylvain Poulain, Serge Bernard, Julien Barde, Sylvain Bonhommeau, Alexis Joly

发表机构 * IFREMER Délégation Océan Indien (DOI)（IFREMER大洋印度洋办事处）； INRIA, LIRMM, Université de Montpellier, CNRS（INRIA、LIRMM、蒙彼利埃大学、国家科学研究中心）； UMR Marbec, IRD, Université de Montpellier, CNRS, Ifremer（Marbec联合研究单位、IRD、蒙彼利埃大学、国家科学研究中心、IFREMER）； CNRS, LIRMM, Université de Montpellier（国家科学研究中心、LIRMM、蒙彼利埃大学）

AI总结本文提出了一种基于无人机的弱监督分割框架，用于珊瑚生境的映射。该方法通过结合水下图像的细粒度多标签预测和广覆盖的航拍数据，无需像素级标注即可训练高分辨率分割模型。研究在珊瑚礁图像上验证了该方法，实现了大面积珊瑚形态的分割，取得了86.07%的像素准确率和52.23%的平均交并比，展示了其在生态监测中的高效性和适用性。

Comments Extended journal version of "The Point is the Mask: Scaling coral reef segmentation with weak supervision"

详情

DOI: 10.1007/s10044-026-01682-3

AI中文摘要

在大空间范围内获取像素级标注仍然是机器学习在生态应用中部署的主要瓶颈。本文提出了一种多尺度弱监督语义分割（WSSS）框架，能够利用密集的、基于分类的输出训练高分辨率分割模型。我们的方法将来自水下图像的细粒度多标签预测与广覆盖的航空数据相结合。将这些点级分类转换为粗监督掩码，用于训练无人机（UAV）正射影像上的语义分割模型。然后使用模型自身的细化预测进行第二步训练，以进一步提高空间精度，无需额外标注。我们在珊瑚礁图像上展示了该方法，实现了珊瑚形态类型的大面积分割，并展示了其整合新类别的灵活性。最终模型在人工标注的礁区上达到86.07%的像素精度和52.23%的平均交并比（mIoU），表明无需像素级标注即可获得准确的大规模珊瑚分割。通过跨尺度和跨模态连接图像分类与分割，该方法为标注不可用场景下部署分割模型提供了高效解决方案，并为生态学及其他领域的可扩展、高效监测开辟了机会。

英文摘要

Obtaining pixel-level annotations over large spatial extents remains a major bottleneck for deploying machine learning in ecological applications. Here we present a multi-scale weakly supervised semantic segmentation (WSSS) framework that enables training high-resolution segmentation models from dense, classification-based outputs. Our method combines fine-scale, multi-label predictions from underwater imagery with broad-coverage aerial data. We convert these point-level classifications into coarse supervision masks that can be used to train a semantic segmentation model on Unmanned Aerial Vehicle (UAV) orthophotos. A second training step using the model's own refined predictions is then used to further improve spatial accuracy without requiring additional annotations. We demonstrate the approach on coral reef imagery, enabling large-area segmentation of coral morphotypes and illustrating its flexibility in integrating new classes. The final model achieves 86.07% pixel accuracy and 52.23% mean Intersection over Union (mIoU) on manually annotated reef zones, demonstrating that accurate large-scale coral segmentation can be obtained without pixel-level annotations. By bridging image classification and segmentation across scales and modalities, this method provides an efficient solution for deploying segmentation models in settings where annotations are unavailable and opens opportunities for scalable, efficient monitoring in ecology and beyond.

URL PDF HTML ☆

赞 0 踩 0

2508.14311 2026-05-25 cs.LG cs.AI 版本更新

Online Learning with Multiple Fairness Regularizers via Graph-Structured Feedback

通过图结构反馈进行多重公平正则化器的在线学习

Quan Zhou, Jakub Marecek, Robert Shorten

发表机构 * Department of Mathematics, National University of Singapore（新加坡国立大学数学系）； Department of Computer Science, Czech Technical University（捷克技术大学计算机科学系）； Dyson School of Design Engineering, Imperial College London（伦敦帝国理工学院设计工程戴森学院）； Imperial College London（伦敦帝国理工学院）

AI总结本文研究了在自动决策系统中如何同时满足多个可能相互冲突的公平性要求的问题。作者提出了一种基于图结构反馈的强化学习方法，能够在序贯交互过程中自适应地学习不同公平性目标的权重。该方法为动态环境中实现多目标公平性优化提供了新的解决方案。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026. OpenReview: https://openreview.net/forum?id=y8iWuDZtEw

Journal ref Transactions on Machine Learning Research (TMLR), 2026

2508.14083 2026-05-25 cs.LG cs.AI 版本更新

GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values

GeoMAE：面向缺失值的时空图预测的掩码表示学习

Songyu Ke, Chenyu Wu, Yuxuan Liang, Huiling Qin, Junbo Zhang, Yu Zheng

发表机构 * College of Computer and Data Science, Fuzhou University（福州大学计算机与数据科学学院）； JD Intelligent Cities Research（京东智能城市研究院）； School of Computing and Artificial Intelligence, Southwest Jiaotong University（西南交通大学计算机与人工智能学院）； Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Beijing Normal University（北京师范大学）

AI总结 GeoMAE 是一种用于时空图预测的自监督表示学习模型，旨在解决城市智能系统中因环境和设备问题导致的数据缺失问题。该方法通过引入基于注意力机制的时空预测网络和辅助学习任务，有效捕捉了传感器网络中的动态空间关联，并提升了模型对缺失数据的鲁棒性。实验表明，GeoMAE 在多个真实数据集上显著优于现有方法，相对提升了最高达13.20%的预测性能。

Comments 34 pages for pre-print version. This work has been published in *Neural Networks*. Please check the latest version via the following DOI

详情

DOI: 10.1016/j.neunet.2026.108986

AI中文摘要

城市智能系统中缺失数据的普遍存在，归因于不利的环境条件和设备故障，对下游应用（尤其是交通预测和能耗预测）的有效性构成了重大挑战。因此，开发一种能够从不完整数据集中提取有意义信息的稳健时空学习方法至关重要。尽管存在针对缺失值时空图预测的方法，但未解决的问题依然存在。首先，现有研究大多基于时间序列分析，从而忽略了传感器网络中固有的动态空间相关性。其次，缺失数据模式的复杂性加剧了问题的复杂性。此外，维护条件的差异导致缺失值比率和模式显著波动，从而挑战了预测模型的泛化能力。针对这些挑战，本研究引入了GeoMAE，一种自监督的时空表示学习模型。该模型由三个主要组件组成：输入预处理模块、基于注意力的时空预测网络（STAFN）和一个辅助学习任务，该任务受掩码自编码器启发，以增强时空表示学习的鲁棒性。在真实数据集上的实证评估表明，GeoMAE显著优于现有基准，相对于最佳基线模型实现了高达13.20%的相对改进。

英文摘要

The ubiquity of missing data in urban intelligence systems, attributable to adverse environmental conditions and equipment failures, poses a significant challenge to the efficacy of downstream applications, notably in the realms of traffic forecasting and energy consumption prediction. Therefore, it is imperative to develop a robust spatio-temporal learning methodology capable of extracting meaningful insights from incomplete datasets. Despite the existence of methodologies for spatio-temporal graph forecasting in the presence of missing values, unresolved issues persist. Primarily, the majority of extant research is predicated on time-series analysis, thereby neglecting the dynamic spatial correlations inherent in sensor networks. Additionally, the complexity of missing data patterns compounds the intricacy of the problem. Furthermore, the variability in maintenance conditions results in a significant fluctuation in the ratio and pattern of missing values, thereby challenging the generalizability of predictive models. In response to these challenges, this study introduces GeoMAE, a self-supervised spatio-temporal representation learning model. The model is comprised of three principal components: an input preprocessing module, an attention-based spatio-temporal forecasting network (STAFN), and an auxiliary learning task, which draws inspiration from Masking AutoEncoders to enhance the robustness of spatio-temporal representation learning. Empirical evaluations on real-world datasets demonstrate that GeoMAE significantly outperforms existing benchmarks, achieving up to 13.20\% relative improvement over the best baseline models.

URL PDF HTML ☆

赞 0 踩 0

2507.06252 2026-05-25 cs.CR cs.AI cs.LG 版本更新

False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems

虚假警报，真实损害：基于LLM的模型对文本网络威胁情报系统的对抗攻击

Samaneh Shafee, Alysson Bessani, Pedro M. Ferreira

发表机构 * Faculty of Sciences, University of Lisbon（里斯本大学科学学院）； CIENCES, University of Lisbon（里斯本大学CIENCES）

AI总结本文研究了基于大语言模型（LLM）的对抗攻击对基于文本的网络威胁情报（CTI）系统的影响。研究分析了三种攻击类型，包括规避、泛滥和投毒攻击，揭示了CTI系统在处理来自开放来源的文本数据时存在的脆弱性。特别指出，通过生成虚假文本，攻击者可以误导分类器，降低系统性能并破坏其功能，其中规避攻击在CTI流程中尤为关键，为后续攻击提供了前提条件。

Journal ref Future Generation Computer Systems, 2026

详情

DOI: 10.1016/j.future.2026.108603

AI中文摘要

网络威胁情报（CTI）已成为一种重要的补充方法，在网络威胁生命周期的早期阶段运作。CTI涉及收集、处理和分析威胁数据，以提供更准确和快速的网络威胁理解。由于数据量大，通过机器学习（ML）和自然语言处理（NLP）模型进行自动化对于有效的CTI提取至关重要。这些自动化系统利用来自社交网络、论坛和博客等来源的开源情报（OSINT）来识别威胁指标（IoCs）。尽管先前的研究集中在针对特定ML模型的对抗攻击上，但本研究通过调查整个CTI管道中各个组件的脆弱性及其对对抗攻击的敏感性，扩展了研究范围。这些脆弱性源于它们从各种开放来源（包括真实和潜在虚假内容）接收文本输入。我们分析了针对CTI管道的三种攻击类型，包括逃避、淹没和投毒，并评估了它们对系统信息选择能力的影响。具体而言，在虚假文本生成方面，该工作展示了对抗文本生成技术如何创建虚假的网络安全和类似网络安全的文本，从而误导分类器、降低性能并破坏系统功能。重点主要放在逃避攻击上，因为它先于并使得CTI管道中的淹没和投毒攻击成为可能。

英文摘要

Cyber Threat Intelligence (CTI) has emerged as a vital complementary approach that operates in the early phases of the cyber threat lifecycle. CTI involves collecting, processing, and analyzing threat data to provide a more accurate and rapid understanding of cyber threats. Due to the large volume of data, automation through Machine Learning (ML) and Natural Language Processing (NLP) models is essential for effective CTI extraction. These automated systems leverage Open Source Intelligence (OSINT) from sources like social networks, forums, and blogs to identify Indicators of Compromise (IoCs). Although prior research has focused on adversarial attacks on specific ML models, this study expands the scope by investigating vulnerabilities within various components of the entire CTI pipeline and their susceptibility to adversarial attacks. These vulnerabilities arise because they ingest textual inputs from various open sources, including real and potentially fake content. We analyse three types of attacks against CTI pipelines, including evasion, flooding, and poisoning, and assess their impact on the system's information selection capabilities. Specifically, on fake text generation, the work demonstrates how adversarial text generation techniques can create fake cybersecurity and cybersecurity-like text that misleads classifiers, degrades performance, and disrupts system functionality. The focus is primarily on the evasion attack, as it precedes and enables flooding and poisoning attacks within the CTI pipeline.

URL PDF HTML ☆

赞 0 踩 0

2507.05311 2026-05-25 cs.IR cs.AI 版本更新

PLACE: Prompt Learning for Attributed Community Search in Large Graphs

PLACE：面向大规模图属性社区搜索的提示学习

Shuheng Fang, Kangfei Zhao, Rener Zhang, Yu Rong, Jeffrey Xu Yu

发表机构 * Shenzhen Institute of Computing Sciences（深圳计算科学研究院）； Beijing Institute of Technology（北京理工大学）； Chinese University of Hong Kong（香港中文大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结本文提出PLACE，一种用于属性社区搜索的图提示学习框架。该方法受到自然语言处理中提示调优的启发，通过在图中插入可学习的提示标记，构建提示增强图结构，以增强与查询相关的节点间连接，帮助图神经网络更有效地识别结构连贯性和属性相似性。实验表明，PLACE在多个真实图数据集上显著优于现有方法，平均F1分数提升22%。

Comments 14 pages, 9 figures

Journal ref KDD 2026

详情

AI中文摘要

在本文中，我们提出了PLACE（面向属性社区搜索的提示学习），一种创新的图提示学习框架用于ACS。受自然语言处理（NLP）中提示调优的启发，其中可学习的提示令牌被插入以语境化NLP查询，PLACE将结构化和可学习的提示令牌集成到图中作为查询相关的细化机制，形成提示增强图。在这种提示增强图结构中，学习到的提示令牌充当桥梁，加强图中节点与查询之间的连接，使GNN能够更有效地识别与特定查询相关的结构凝聚性和属性相似性模式。我们采用交替训练范式来联合优化提示参数和GNN。此外，我们设计了一种分治策略以增强可扩展性，支持模型处理百万级图。在9个真实图上的大量实验证明了PLACE对三种类型ACS查询的有效性，其中PLACE的平均F1分数比现有最先进方法高出22%。

英文摘要

In this paper, we propose PLACE (Prompt Learning for Attributed Community Search), an innovative graph prompt learning framework for ACS. Enlightened by prompt-tuning in Natural Language Processing (NLP), where learnable prompt tokens are inserted to contextualize NLP queries, PLACE integrates structural and learnable prompt tokens into the graph as a query-dependent refinement mechanism, forming a prompt-augmented graph. Within this prompt-augmented graph structure, the learned prompt tokens serve as a bridge that strengthens connections between graph nodes for the query, enabling the GNN to more effectively identify patterns of structural cohesiveness and attribute similarity related to the specific query. We employ an alternating training paradigm to optimize both the prompt parameters and the GNN jointly. Moreover, we design a divide-and-conquer strategy to enhance scalability, supporting the model to handle million-scale graphs. Extensive experiments on 9 real-world graphs demonstrate the effectiveness of PLACE for three types of ACS queries, where PLACE achieves higher F1 scores by 22% compared to the state-of-the-arts on average.

URL PDF HTML ☆

赞 0 踩 0

2506.04390 2026-05-25 cs.CR cs.AI 版本更新

Through the Stealth Lens: Attention-Aware Defenses Against Poisoning in RAG

通过隐秘视角：检索增强生成中针对投毒攻击的注意力感知防御

Sarthak Choudhary, Nils Palumbo, Ashish Hooda, Krishnamurthy Dj Dvijotham, Somesh Jha

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； ServiceNow Research（ServiceNow研究）

AI总结本文研究了检索增强生成（RAG）系统中针对数据投毒攻击的隐蔽性防御方法。作者提出了一种基于注意力机制的防御策略，通过分析语言模型的注意力权重，引入归一化段落注意力得分（NPAS）和注意力方差过滤器（AV Filter），以检测并过滤被污染的检索内容。实验表明，该方法显著提升了系统的鲁棒性，并揭示了实现真正隐蔽投毒攻击的难度。

Comments Accepted at ICML 2026

详情

AI中文摘要

检索增强生成（RAG）系统容易受到攻击，即使污染率很低，攻击者也能将有毒段落注入检索到的上下文中。我们表明现有攻击并非设计为隐秘的，因此可以实现可靠的检测和缓解。我们形式化了一个基于可区分性的安全游戏来量化此类攻击的隐秘性。如果少数有毒段落控制了响应，它们必须比良性段落更偏向推理过程，这本质上损害了隐秘性。这促使我们分析LLM的中间信号（如注意力权重）来近似不同段落对响应的影响。利用注意力权重，我们引入了$ extbf{归一化段落注意力分数}$（NPAS）和轻量级的$ extbf{注意力方差滤波器}$（AV Filter），用于标记异常段落。我们的方法提高了鲁棒性，相比基线防御，准确率提高了约$\sim$ $ extbf{20%}$。我们还开发了自适应攻击，试图隐藏此类异常，成功率高达$ extbf{35%}$，这凸显了在RAG系统中实现真正隐秘投毒的挑战。

英文摘要

Retrieval-augmented generation (RAG) systems are vulnerable to attacks that inject poisoned passages into the retrieved context, even at low corruption rates. We show that existing attacks are not designed to be stealthy, allowing reliable detection and mitigation. We formalize a distinguishability-based security game to quantify stealth for such attacks. If a few poisoned passages control the response, they must bias the inference process more than the benign ones, inherently compromising stealth. This motivates analyzing intermediate signals of LLMs, such as attention weights, to approximate the influence of different passages on the response. Leveraging attention weights, we introduce the $\textbf{Normalized Passage Attention Score}$ (NPAS) and a lightweight $\textbf{Attention-Variance Filter}$ (AV Filter) that flags anomalous passages. Our method improves robustness, yielding up to $\sim$ $\textbf{20%}$ higher accuracy than baseline defenses. We also develop adaptive attacks that attempt to conceal such anomalies, achieving up to $\textbf{35%}$ success rate and underscoring the challenges of achieving true stealth in poisoning RAG systems.

URL PDF HTML ☆

赞 0 踩 0

2505.21573 2026-05-25 cs.LG cs.AI 版本更新

Spectral-inspired Operator Learning with Limited Data and Unknown Physics

光谱启发的少数据与未知物理下的算子学习

Han Wan, Rui Zhang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学光明学院人工智能学院）

AI总结本文研究了在数据有限且物理机制未知的情况下学习偏微分方程（PDE）动力学的挑战。为此，提出了一种名为SINO的频谱启发神经算子，它仅需2到5条轨迹即可建模复杂系统，无需显式依赖PDE方程。SINO通过频率索引自动捕捉局部和全局空间导数，结合乘法操作块和低通滤波器处理非线性效应和混叠问题，在多个二维和三维PDE基准测试中表现出优异性能，尤其在少量数据和分布外场景下显著优于现有方法。

Comments To appear in KDD 2026

详情

DOI: 10.1145/3770855.3817831

AI中文摘要

从有限数据和未知物理中学习PDE动力学具有挑战性。现有的神经PDE求解器要么需要大型数据集，要么依赖已知物理（如PDE残差或手工模板），导致适用性有限。为解决这些问题，我们提出光谱启发神经算子（SINO），它仅需2-5条轨迹即可建模复杂系统，无需显式PDE项。具体而言，SINO从频率索引自动捕获局部和全局空间导数，从而在物理无关机制下实现底层微分算子的紧凑表示。为建模非线性效应，它采用Pi块对光谱特征进行乘法运算，并辅以低通滤波器抑制混叠。在2D和3D PDE基准上的大量实验表明，SINO实现了最先进的性能，精度提升1-2个数量级。特别地，仅用5条训练轨迹，SINO就优于在1000条轨迹上训练的数据驱动方法，并在其他方法失败的高难度分布外案例中保持预测能力。

英文摘要

Learning PDE dynamics from limited data with unknown physics is challenging. Existing neural PDE solvers either require large datasets or rely on known physics (e.g., PDE residuals or handcrafted stencils), leading to limited applicability. To address these challenges, we propose Spectral-Inspired Neural Operator (SINO), which can model complex systems from just 2-5 trajectories, without requiring explicit PDE terms. Specifically, SINO automatically captures both local and global spatial derivatives from frequency indices, enabling a compact representation of the underlying differential operators in physics-agnostic regimes. To model nonlinear effects, it employs a Pi-block that performs multiplicative operations on spectral features, complemented by a low-pass filter to suppress aliasing. Extensive experiments on both 2D and 3D PDE benchmarks demonstrate that SINO achieves state-of-the-art performance, with improvements of 1-2 orders of magnitude in accuracy. Particularly, with only 5 training trajectories, SINO outperforms data-driven methods trained on 1000 trajectories and remains predictive on challenging out-of-distribution cases where other methods fail.

URL PDF HTML ☆

赞 0 踩 0

2504.09846 2026-05-25 cs.LG cs.AI cs.HC 版本更新

GlyTwin: Digital Twin for Glucose Control in Type 1 Diabetes Through Optimal Behavioral Modifications Using Patient-Centric Counterfactuals

GlyTwin: 通过以患者为中心的反事实实现1型糖尿病血糖控制的最佳行为修改的数字孪生

Asiful Arefeen, Saman Khamesian, Maria Adela Grando, Bithika Thompson, Hassan Ghasemzadeh

发表机构 * College of Health Solutions, Arizona State University（亚利桑那州立大学健康解决方案学院）； School of Computing and Augmented Intelligence, Arizona State University（亚利桑那州立大学计算与增强智能学院）； Department of Endocrinology, Mayo Clinic Arizona（梅奥诊所亚利桑那分部内分泌科）

AI总结该研究提出了一种名为GlyTwin的数字孪生框架，用于通过行为优化改善1型糖尿病患者的血糖控制。其核心方法是结合反事实解释，模拟最优行为干预方案，如调整碳水化合物摄入和胰岛素剂量，以减少高血糖事件的发生。研究还引入了利益相关者的偏好，使干预方案更具个性化和实用性。实验结果表明，GlyTwin在生成有效反事实解释和预防高血糖方面优于现有方法，具有较高的实用价值。

详情

AI中文摘要

频繁和长期暴露于高血糖会增加慢性并发症的风险，包括神经病变、肾病和心血管疾病。现有的连续皮下胰岛素输注（CSII）和连续血糖监测（CGM）技术仅模拟血糖调节的特定方面，例如预测低血糖和给予小剂量胰岛素推注。同样，当前糖尿病管理中的数字孪生方法主要侧重于预测血糖对人类行为和胰岛素治疗的反应。因此，这些技术缺乏提供替代治疗方案的能力，而这些方案可以指导主动行为干预以实现最佳糖尿病管理。为填补这一空白，我们提出GlyTwin，一种新颖的计算框架，通过整合反事实解释来增强数字孪生技术，以模拟血糖控制的最佳行为治疗。GlyTwin通过推荐行为选择（如碳水化合物摄入和胰岛素剂量）的调整来生成反事实治疗，以显著减少高血糖事件的发生和持续时间。此外，GlyTwin将利益相关者的偏好纳入其干预生成过程，确保工具个性化和以用户为中心。我们在AZT1D上评估GlyTwin，该数据集是通过收集50名使用自动胰岛素输送（AID）系统的1型糖尿病（T1D）患者的纵向数据构建的，每人监测26天。结果表明，与历史数据相比，GlyTwin在生成反事实解释方面优于现有方法，有效解释率为85.8%，预防高血糖的有效性为87.3%。

英文摘要

Frequent and long-term exposure to hyperglycemia increases the risk of chronic complications, including neuropathy, nephropathy, and cardiovascular disease. Existing continuous subcutaneous insulin infusion (CSII) and continuous glucose monitoring (CGM) technologies model only specific aspects of glycemic regulation, such as predicting hypoglycemia and administering small insulin boluses. Similarly, current digital twin approaches in diabetes management primarily focus on predicting glucose responses to human behavior and insulin therapy. As a result, these technologies lack the ability to provide alternative treatment scenarios that could guide proactive behavioral interventions for optimal diabetes management. To address this gap, we propose GlyTwin, a novel computational framework that enhances digital twin technologies by integrating counterfactual explanations to simulate optimal behavioral treatments for glucose control. GlyTwin generates counterfactual treatments by recommending adjustments to behavioral choices, such as carbohydrate intake and insulin dosing, to significantly reduce the occurrence and duration of hyperglycemic events. In addition, GlyTwin incorporates stakeholder preferences into its intervention-generation process, ensuring that the tool is personalized and user-centric. We evaluate GlyTwin on AZT1D, a new dataset constructed by collecting longitudinal data from 50 individuals living with type 1 diabetes (T1D) on automated insulin delivery (AID) systems, each monitored for 26 days. Results show that GlyTwin outperforms state-of-the-art methods for generating counterfactual explanations, with 85.8\% valid explanations and 87.3\% effectiveness in preventing hyperglycemia compared with historical data.

URL PDF HTML ☆

赞 0 踩 0

2502.17119 2026-05-25 cs.LG cs.AI 版本更新

Diffusion and Flow Matching Models for Tabular Data: A Survey

表格数据的扩散与流匹配模型：综述

Zhong Li, Qi Huang, Lincen Yang, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen

发表机构 * Great Bay University（大湾大学）； Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）； LIACS, Leiden University（莱顿大学LIACS）

AI总结本文综述了扩散模型和流匹配模型在表格数据生成中的应用，探讨了这些模型在处理数值与类别混合、缺失值、敏感字段及复杂依赖关系等挑战时的优势与方法。文章系统梳理了从2015年至2026年的相关研究，围绕数据工程难题、任务目标、设计选择及评估维度进行组织，并指出了在可扩展性、特征依赖建模、隐私保护、公平性及约束感知生成等方面的开放问题。

Comments We substantially updated the previous version "Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions" by including flow matching models for tabular data

详情

AI中文摘要

深度生成模型在图像、文本、音频和视频生成方面取得了快速进展，并越来越多地应用于结构化记录。然而，对于表格数据，生成建模仍然困难：数据集可能包含数值和分类属性、缺失值、敏感字段、不平衡类别、复杂的特征依赖和领域约束。早期基于GAN或VAE的表格数据建模方法取得了有用结果，但可能面临训练不稳定、模式崩溃、多模态分布建模能力弱以及混合类型特征处理脆弱等问题。因此，扩散模型因其噪声-去噪公式提供了灵活稳定的方式来建模复杂数据分布而受到越来越多的关注，并已被应用于表格合成、缺失值填补、可信数据生成和异常检测。流匹配通过学习沿概率路径的传输向量场提供了一条密切相关的途径，通常对路径设计和采样效率有更直接的控制。尽管取得了进展，但针对表格数据的扩散和流匹配模型文献仍然难以比较，因为方法针对不同任务，依赖于不同的表示、目标、评估协议和领域假设。据我们所知，这是第一篇专门针对表格数据的扩散和流匹配模型的综述。我们回顾了2015年6月至2026年5月的工作，围绕数据工程挑战、任务、设计选择和评估维度进行组织，并讨论了可扩展性、特征依赖建模、隐私、公平性、基准测试和约束感知生成中的开放问题。我们在GitHub仓库中保持更新。

英文摘要

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain numerical and categorical attributes, missing values, sensitive fields, imbalanced categories, complex feature dependencies, and domain constraints. Earlier tabular data modeling methods based on GANs or VAEs have achieved useful results, but they can suffer from unstable training, mode collapse, weak modeling of multimodal distributions, and fragile handling of mixed-type features. Diffusion models have therefore attracted growing interest because their noising-and-denoising formulation provides a flexible and stable way to model complex data distributions, and has been adapted to tabular synthesis, missing-value imputation, trustworthy data generation, and anomaly detection. Flow matching offers a closely related route by learning transport vector fields along probability paths, often with more direct control over path design and sampling efficiency. Despite this progress, the literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions. To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. We maintain updates in a GitHub repository.

URL PDF HTML ☆

赞 0 踩 0

2502.04415 2026-05-25 cs.CV cs.AI 版本更新

TerraQ: Spatiotemporal Question-Answering on Satellite Image Archives

TerraQ：卫星图像档案的时空问答

Sergios-Anestis Kefalidis, Konstantinos Plas, Manolis Koubarakis

发表机构 * Dept. of Informatics and Telecommunications（信息与电信系）； National and Kapodistrian University of Athens（国家与卡布里亚大学）； Archimedes/Athena RC（阿基米德/雅典RC）

AI总结 TerraQ 是一个用于卫星图像档案的时空问答系统，能够根据自然语言查询快速检索符合条件的卫星图像。该系统结合了自然语言处理与空间知识库，支持基于图像元数据和地理实体的复杂查询。其核心贡献在于提升了地球观测数据的可访问性与智能化检索能力。

2502.04230 2026-05-25 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

XAttnMark：基于交叉注意力的鲁棒音频水印学习

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

发表机构 * Department of Computer Science, Lehigh University, Bethlehem, PA, USA（莱文斯顿大学计算机科学系）； Dolby Laboratories Inc., San Francisco, CA, USA（杜比实验室公司）

AI总结随着生成式音频合成和编辑技术的快速发展，版权保护、数据溯源和深度伪造音频传播等问题日益突出。本文提出了一种基于交叉注意力机制的鲁棒音频水印方法XAttnMark，通过生成器与检测器之间的部分参数共享、高效的交叉注意力消息检索机制以及时间条件模块，实现了水印检测与归属的联合优化。此外，该方法引入了与心理声学对齐的时频掩码损失，提升了水印的不可感知性，实验表明其在多种音频变换下均表现出优越的鲁棒性，为生成式AI时代的音频版权保护提供了有效解决方案。

Comments Accepted at ICML'25

详情

AI中文摘要

生成式音频合成与编辑技术的快速普及引发了关于版权侵权、数据溯源以及通过深度伪造音频传播虚假信息的严重担忧。水印技术通过将不可感知但可识别和可追踪的信号嵌入音频内容，提供了一种主动解决方案。尽管最近基于神经网络的水印方法（如WavMark和AudioSeal）在鲁棒性和质量上有所改进，但它们难以同时优化鲁棒检测和准确归因。本文介绍了交叉注意力鲁棒音频水印（XATTNMARK），通过利用生成器和检测器之间的部分参数共享、用于高效消息检索的交叉注意力机制以及用于改善消息分布的时间条件模块，弥合了这一差距。此外，我们提出了一种心理声学对齐的时频（TF）掩蔽损失，捕捉细粒度的听觉掩蔽效应，提高了水印的不可感知性。XATTNMARK在检测和归因方面均达到了最先进的性能，展示了针对各种音频变换（包括不同强度的具有挑战性的生成式编辑）的卓越鲁棒性。这项工作推进了音频水印技术，用于在生成式AI时代保护知识产权并确保真实性。

英文摘要

The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

URL PDF HTML ☆

赞 0 踩 0

2411.12173 2026-05-25 cs.LG cs.AI 版本更新

SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

SkillTree: 面向长时域控制任务的可解释基于技能的深度强化学习

Yongyan Wen, Siyuan Li, Rongchang Zuo, Lei Yuan, Hangyu Mao, Peng Liu

发表机构 * Faculty of Computing, Harbin Institute of Technology（哈尔滨工业大学计算机学院）； National Key Laboratory of Novel Software Technology, Nanjing University（南京大学新型软件技术国家实验室）； School of Artificial Intelligence, Nanjing University（南京大学人工智能学院）； Polixir Technologies ； SenseTime Research（时光机器研究）

AI总结本文提出了一种名为SkillTree的可解释技能型深度强化学习框架，用于解决长期控制任务中的复杂连续动作空间问题。该方法通过将连续动作空间离散化为技能空间，并在高层策略中引入可微决策树生成技能嵌入，从而指导底层策略执行具体技能，实现了技能层面的可解释性。实验表明，SkillTree在复杂机械臂控制任务中性能与基于神经网络的技能方法相当，同时提升了决策过程的透明度。

详情

DOI: 10.1609/aaai.v39i20.35451

AI中文摘要

深度强化学习（DRL）在各个研究领域取得了显著成功。然而，其对神经网络的依赖导致缺乏透明度，限制了实际应用。为了实现可解释性，决策树已成为神经网络的一种流行且有前景的替代方案。然而，由于其表达能力有限，传统决策树难以处理高维长时域连续控制任务。在本文中，我们提出了SkillTree，一种新颖的框架，将复杂的连续动作空间缩减为离散的技能空间。我们的层次化方法在高层次策略中集成了可微决策树以生成技能嵌入，进而指导低层次策略执行技能。通过使技能决策可解释，我们实现了技能级可解释性，增强了对复杂任务中决策过程的理解。实验结果表明，我们的方法在复杂机器人臂控制领域中达到了与基于技能的神经网络相当的性能。此外，SkillTree在技能级别提供解释，从而提高了决策过程的透明度。

英文摘要

Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.

URL PDF HTML ☆

赞 0 踩 0

2402.17888 2026-05-25 cs.LG cs.AI 版本更新

ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

ConjNorm: 面向分布外检测的可处理密度估计

Bo Peng, Yadan Luo, Yonggang Zhang, Yixuan Li, Zhen Fang

发表机构 * University of Technology Sydney（悉尼大学）； The University of Queensland（昆士兰大学）； Hong Kong Baptist University（香港 Baptist 大学）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结本文提出了一种名为ConjNorm的新型密度估计方法，用于提升分布外检测（OOD detection）的性能。该方法基于Bregman散度构建理论框架，将分布考虑扩展到指数族分布，并通过引入共轭约束，将密度函数设计转化为寻找最优范数系数的问题。为了解决归一化计算的困难，作者设计了一种基于重要性采样的无偏且解析可计算的分区函数估计器。实验表明，ConjNorm在多个OOD检测基准上取得了当前最优性能，显著优于现有方法。

Comments ICLR24 poster

详情

AI中文摘要

事后分布外检测在可靠机器学习中引起了广泛关注。许多工作致力于基于logits、距离或严格数据分布假设推导得分函数，以识别低得分OOD样本。然而，这些估计得分可能无法准确反映真实数据密度或施加不切实际的约束。为了提供密度基得分设计的统一视角，我们提出了一个基于Bregman散度的新理论框架，将分布考虑扩展到指数分布族。利用定理中揭示的共轭约束，我们引入了一种 extsc{ConjNorm}方法，将密度函数设计重新定义为针对给定数据集寻找最优范数系数$p$。鉴于归一化的计算挑战，我们利用基于蒙特卡洛的重要性采样技术，设计了一个无偏且解析可处理的配分函数估计器。在OOD检测基准上的大量实验表明，我们提出的 extsc{ConjNorm}在各种OOD检测设置中建立了新的最先进水平，在CIFAR-100和ImageNet-1K上分别比当前最佳方法（FPR95）高出高达13.25%和28.19%。

英文摘要

Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.

URL PDF HTML ☆

赞 0 踩 0

2402.14212 2026-05-25 cs.LG cs.AI 版本更新

Moonwalk: Inverse-Forward Differentiation

Moonwalk: 逆-前向微分

Dmitrii Krylov, Armin Karamzade, Roy Fox

发表机构 * University of California, Irvine（加州大学尔湾分校）

AI总结 Moonwalk 研究了反向传播中需要存储中间激活值的限制问题，提出了一种无需存储激活值的梯度计算方法。该方法通过引入向量-逆雅可比乘积（vijp）操作符，结合子浸入网络和碎片化梯度检查点技术，在前向过程中精确重建梯度，从而显著提升了网络深度而不增加内存消耗。实验表明，Moonwalk 在保持运行时间与反向传播相当的同时，能够在相同内存预算下训练出深度超过两倍的网络。

Journal ref The 29th International Conference on Artificial Intelligence and Statistics, 2026

详情

AI中文摘要

反向传播的主要限制是它需要在正向传播过程中存储中间激活值（残差），这限制了可训练网络的深度。这引出了一个基本问题：我们能否避免存储这些激活值？我们通过重新审视梯度计算的结构来解决这个问题。反向传播通过一系列向量-雅可比乘积计算梯度，这一操作通常是不可逆的。丢失的信息位于每层雅可比矩阵的余核中。我们定义了浸没式网络——其层雅可比矩阵具有平凡余核的网络——在这种网络中，梯度可以在前向扫描中精确重建，而无需存储激活值。对于非浸没式层，我们引入了碎片梯度检查点，仅记录恢复被雅可比矩阵擦除的余切向量所需的最小残差子集。我们方法的核心是一种新的算子，即向量-逆-雅可比乘积（vijp），它反转了余核外的梯度流。我们的混合模式算法首先通过内存高效的反向传播计算输入梯度，然后使用vijp在前向扫描中重建参数梯度，从而消除了存储激活值的需要。我们在Moonwalk中实现了该方法，并表明它在相同内存预算下训练深度超过两倍的网络时，运行时间与反向传播相当。

英文摘要

Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian. We define submersive networks -- networks whose layer Jacobians have trivial cokernels -- in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector-inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient reverse pass, then reconstructs parameter gradients in a forward sweep using the vijp, eliminating the need to store activations. We implement this method in Moonwalk and show that it matches backpropagation's runtime while training networks more than twice as deep under the same memory budget.

URL PDF HTML ☆

赞 0 踩 0

2103.14995 2026-05-25 cs.LG cs.AI eess.SP 版本更新

Thermal transmittance prediction based on the application of artificial neural networks on heat flux method results

基于人工神经网络在热流法结果上的热透射率预测

Sanjin Gumbarević, Bojan Milovanović, Mergim Gaši, Marina Bagarić

发表机构 * Center for Theoretical Physics, Sloane Physics Laboratory, Yale University（理论物理中心、斯洛恩物理实验室、耶鲁大学）； University of Zagreb, Faculty of Civil Engineering, Department of Materials（扎格雷布大学、土木工程学院、材料系）

AI总结本文研究如何利用人工神经网络（ANN）加速建筑围护结构热传导系数（U值）的现场测量过程。通过在热流法（HFM）测量中引入并行测量策略，并基于内外空气温度预测未知热流，从而缩短测量时间。研究对比了多种ANN模型在多层墙体上的应用效果，结果表明该方法在热流预测方面具有较高准确性，为后续研究提供了有价值的参考方向。

Comments Submitted to International Building Physics Conference 2021

Journal ref J. Phys.: Conf. Ser. 2069 (2021) 012152

详情

DOI: 10.1088/1742-6596/2069/1/012152

AI中文摘要

由于能效相关指令，欧洲联盟更加关注建筑群的深度能源改造。许多需要深度能源改造的建筑年代久远，可能缺乏设计/改造文件，或者建筑构件中的材料可能随时间发生退化。热透射率（即U值）是确定通过建筑围护结构构件传输热损失的最重要参数之一，取决于构成建筑构件的所有材料的厚度和热性能。现场U值可通过ISO 9869-1标准（热流法 - HFM）确定。然而，测量持续时间是HFM在改造设计过程开始前现场测试中未广泛使用的原因之一。本文分析了通过使用一个热流传感器进行并行测量来减少测量时间的可能性。这种并行化可以通过在HFM结果上应用特定类别的人工神经网络（ANN）来实现，基于收集的室内外空气温度预测未知热流。在达到满意的预测后，HFM传感器可重新定位到另一个测量位置。本文展示了四种ANN案例应用于HFM结果的比较，这些测量在一面多层墙上进行：一个隐藏层中有三个神经元的多层感知器、100个单元的长短期记忆、100个单元的门控循环单元以及50个长短期记忆单元和50个门控循环单元的组合。分析在基于两个输入温度预测热流率方面给出了有希望的结果。另一面墙上的额外分析显示了该方法的可能局限性，这为这一主题的进一步研究提供了方向。

英文摘要

Deep energy renovation of building stock came more into focus in the European Union due to energy efficiency related directives. Many buildings that must undergo deep energy renovation are old and may lack design/renovation documentation, or possible degradation of materials might have occurred in building elements over time. Thermal transmittance (i.e. U-value) is one of the most important parameters for determining the transmission heat losses through building envelope elements. It depends on the thickness and thermal properties of all the materials that form a building element. In-situ U-value can be determined by ISO 9869-1 standard (Heat Flux Method - HFM). Still, measurement duration is one of the reasons why HFM is not widely used in field testing before the renovation design process commences. This paper analyzes the possibility of reducing the measurement time by conducting parallel measurements with one heat-flux sensor. This parallelization could be achieved by applying a specific class of the Artificial Neural Network (ANN) on HFM results to predict unknown heat flux based on collected interior and exterior air temperatures. After the satisfying prediction is achieved, HFM sensor can be relocated to another measuring location. Paper shows a comparison of four ANN cases applied to HFM results for a measurement held on one multi-layer wall - multilayer perceptron with three neurons in one hidden layer, long short-term memory with 100 units, gated recurrent unit with 100 units and combination of 50 long short-term memory units and 50 gated recurrent units. The analysis gave promising results in term of predicting the heat flux rate based on the two input temperatures. Additional analysis on another wall showed possible limitations of the method that serves as a direction for further research on this topic.

URL PDF HTML ☆

赞 0 踩 0

2605.22940 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning

以人为中心的学习力学：熵正则化表示学习的动力学框架

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX – Génie et Matériaux Textiles（里尔大学，ENSAIT，ULR 2461 – GEMTEX – 纺织工程与材料纺织系）； International Chair in DS & XAI, International Research Institute for Artificial Intelligence and Data Science, Dong A University（数据科学与可解释人工智能国际主席，人工智能与数据科学国际研究所，东亚大学）

AI总结本文提出了一种名为“以人为中心的学习力学”（HCLM）的动态信息理论框架，旨在为开放且受控的学习系统提供理论支持。研究指出，传统的熵正则化方法在某些情况下可能导致梯度不稳定或与优化方向不一致，因此引入了有效熵的概念，并提出了可计算的几何熵代理方法，如基于方差和对数行列式的协方差代理。文章的主要贡献包括形式化有效信息力下的熵正则化、推导收敛性和泛化性理论，以及从动态角度解释模型规模与性能之间的关系。实验表明，几何熵代理，尤其是对数行列式协方差熵，能产生更稳定和有力的信息力，提升表示学习的效果。

Comments Submitted to JMLR

详情

AI中文摘要

深度学习越来越被视为参数空间中的动力学过程，然而许多现有理论仍将训练视为封闭的优化系统。这种观点对于现实世界的人工智能是有限的，因为模型在不确定性、资源约束、分布偏移、下游决策风险和人类反馈下运行。我们提出了以人为中心的学习力学（HCLM），一个用于开放和受控学习系统的动力学和信息论框架。核心思想是，只有当所选的熵代理沿着优化轨迹产生非简并的信息力时，熵正则化才是有用的。否则，熵项可能产生弱、不稳定或不对齐的梯度，导致动力学坍缩为普通的损失最小化。我们引入了有效熵的概念，并研究了可处理的几何熵代理，包括基于方差和对数行列式协方差代理。本文做出三项贡献。首先，它通过有效信息力形式化了熵正则化，并刻画了简并熵区域。其次，它在显式假设下推导了收敛性、熵流、Wasserstein梯度流和噪声表示泛化结果。第三，它提供了缩放律行为的条件动力学解释，作为信息注入、熵耗散和残差风险之间的平衡，而不声称对经验神经缩放律的无条件推导。受控的表示学习实验支持几何熵代理（尤其是对数行列式协方差熵）比softmax归一化熵产生更强更稳定的信息力的假设。

英文摘要

Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed optimization system. This view is limited for real-world AI, where models operate under uncertainty, resource constraints, distribution shift, downstream decision risks, and human feedback. We propose Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for open and controlled learning systems. The central idea is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory. Otherwise, entropy terms may produce weak, unstable, or misaligned gradients, causing the dynamics to collapse toward ordinary loss minimization. We introduce the notion of effective entropy and study tractable geometric entropy surrogates, including variance-based and log-determinant covariance proxies. The paper makes three contributions. First, it formalizes entropy regularization through effective information force and characterizes degenerate entropy regimes. Second, it derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions. Third, it offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk, without claiming an unconditional derivation of empirical neural scaling laws. Controlled representation-learning experiments support the hypothesis that geometric entropy surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.

URL PDF HTML ☆

赞 0 踩 0

2605.22905 2026-05-25 cs.AI cs.CL 版本更新

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

EVE-Agent: 证据可验证的自我进化智能体

Yamato Arai, Yuma Ichikawa

发表机构 * Fujitsu Limited（富士通株式会社）； The University of Tokyo（东京大学）； RIKEN center for AIP（理化学研究所AIP研究中心）

AI总结本文提出了一种名为EVE-Agent的证据可验证自进化智能体，旨在解决自进化搜索代理在缺乏可验证证据时可能生成不准确但流畅的训练样本的问题。该方法通过修改提议者-求解者框架，使每个生成的实例不仅包含答案，还包含可验证的来源片段，并通过证据验证器评估其对答案的贡献。实验表明，EVE-Agent显著提升了基于证据的正确性，且生成的训练样本具有可审计性，增强了系统的可信度。

Comments 23 pages, 2 figures

详情

AI中文摘要

自我进化智能体不应在其无法证明的示例上进行训练。无数据的自我进化搜索智能体提供了一种可扩展的途径，使系统能够生成自己的问题、回答问题，并从自身反馈中改进，而无需人工标注。然而，没有可验证的证据，这种循环可能会奖励流畅但无依据的示例，使自我生成的课程变成不透明且可能不可靠的训练信号。我们认为，证据可验证性是搜索智能体可信自我进化的先决条件：每个生成的实例不仅应包含答案，还应包含一个基于来源的文本片段，其对该答案的贡献可以被衡量。我们引入了EVE-Agent，一种证据可验证的自我进化智能体，通过对提议者-求解者框架的修改来实现这一原则。提议者生成一个问题、一个答案和一个逐字证据片段。然后，证据验证器根据提供证据时的边际准确率增益来奖励该片段。这产生了一个训练信号，倾向于真正有助于回答问题的证据，而不需要标准答案、人工标签或外部标注。EVE-Agent保持骨干模型、检索器、搜索工具和优化框架不变。实验表明，EVE-Agent在证据基础的准确性上显著优于先前的自我进化搜索智能体。由此产生的课程不仅是自我生成的，而且从结构上是可审计的：每个训练实例都带有一个可检查的来源片段，解释其为何值得信任。

英文摘要

Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.

URL PDF HTML ☆

赞 0 踩 0

2605.22903 2026-05-25 cs.CV cs.AI cs.CL 版本更新

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

不看而见：视觉-语言基准测试真的测试视觉吗？

Zixuan Lan, Luzhe Sun, Matthew R. Walter, Jiawei Zhou

发表机构 * University of Chicago（芝加哥大学）； Stony Brook University（石溪大学）； Toyota Technological Institute at Chicago（芝加哥丰田技术研究所）

AI总结该研究质疑了当前视觉-语言模型（VLMs）基准测试是否真正评估了模型对视觉证据的依赖程度。通过系统分析多个开源模型的行为表现，研究发现尽管VLMs会利用视觉输入，但其预测对细粒度视觉信息的丢失并不敏感，这与标准准确率所暗示的情况存在明显偏差。研究还从表示层面揭示了视觉特征在深层逐渐趋同的现象，为这一现象提供了可能的解释，表明现有基准可能无法有效评估模型的细粒度视觉理解能力。

Comments Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version

详情

AI中文摘要

基准测试的准确性通常被隐含地视为反映了视觉-语言模型（VLM）中的基础视觉理解，但尚不清楚这些分数在多大程度上真正反映了对视觉证据的依赖。受一个令人惊讶的观察结果——在广泛使用的幻觉基准测试中，移除大量图像令牌仅轻微降低模型性能——的启发，我们在一组开源VLM中系统地研究了这种不匹配。我们的分析涵盖多个粒度级别，包括全局视觉退化、局部遮挡、问题重述、答案空间扩展以及超出标准准确率的决策级分析。我们进一步用视觉令牌几何的逐层分析补充这些行为结果。在整个实验中，我们发现尽管VLM确实整合了视觉输入，但其预测对细粒度视觉证据丢失的敏感性低于标准准确率所暗示的程度。即使最终预测保持不变，模型对正确答案的内部支持可能已经减弱。我们还补充了表示级分析，显示深层中视觉令牌之间的相似性增加，这为我们的发现提供了一个可能的解释。总之，这些结果表明，当前的基准测试不足以可靠地评估VLM中的细粒度视觉基础。

英文摘要

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.22902 2026-05-25 cs.LG cs.AI cs.CL 版本更新

ImProver 2：用于神经符号证明优化的迭代自改进语言模型

Riyaz Ahuja, Tate Rowney, Jeremy Avigad, Sean Welleck

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结随着形式化数学库的快速增长，对验证证明的重构和神经证明器训练数据质量的提升需求日益迫切。为解决可扩展性证明优化中面临的异构目标、数据稀缺和高训练推理成本等问题，本文提出ImProver 2，一个用于Lean 4的神经符号框架，结合高效的数据专家迭代流程和形式化结构暴露的轻量非正式抽象框架，并引入一系列衡量证明结构特性的指标。实验表明，该框架能够使小型模型在多个指标上达到与更大模型相当甚至更优的性能，展示了证明优化作为可扩展学习任务的可行性。

详情

AI中文摘要

形式化数学库正在迅速扩展，这产生了对已验证证明进行重构以保持可维护性以及提高神经证明器训练数据质量的日益增长的需求。然而，可扩展的证明优化受到异构且启发式指定的目标、稀缺的数据以及高训练和推理成本的阻碍。为了克服这些挑战，我们引入了ImProver 2，这是一个用于在Lean 4中自动进行证明优化的神经符号框架。ImProver 2将数据高效的专家迭代流程与一个暴露形式结构并附带轻量级非正式抽象的脚手架相结合。我们进一步引入了一套捕捉证明结构属性的指标。使用ImProver 2，我们训练了一个7B参数的模型，该模型在相同模型系列中优于数量级更大的模型，并且在各项指标上与中端前沿模型具有竞争力。我们还证明，我们的神经符号脚手架显著提高了小型和前沿模型的性能。我们表明，通过适当的脚手架和训练，小型模型可以有效地在复杂且多样的指标上重构研究级证明，与更大的系统相匹配，并将证明优化确立为一项可扩展、可学习的任务。

英文摘要

Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve training data quality for neural provers. However, scalable proof optimization is hindered by heterogeneous and heuristically specified objectives, scarce data, and high training and inference costs. To overcome these challenges, we introduce ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. ImProver 2 combines a data-efficient expert-iteration pipeline with a scaffold that exposes formal structure alongside lightweight informal abstractions. We further introduce a suite of metrics capturing structural proof properties. Using ImProver 2, we train a 7B-parameter model that outperforms orders-of-magnitude larger models within the same model family, and is competitive with mid-tier frontier models across metrics. We additionally demonstrate that our neurosymbolic scaffold significantly improves performance across both small and frontier models. We show that with proper scaffolding and training, small models can effectively restructure research-level proofs over complex and varied metrics, matching substantially larger systems and establishing proof optimization as a scalable, learnable task.

URL PDF HTML ☆

赞 0 踩 0

2605.22884 2026-05-25 cs.LG cs.AI 版本更新

RMA：一个面向研究级数学问题的智能体系统

Zelin Zhao, Bo Yuan, Jaemoo Choi, Yongxin Chen

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出了一种名为 **RMA** 的智能代理系统，专门用于解决研究级数学问题。RMA 通过分解问题分析、文献检索、公平比较、知识库构建和证明验证等模块，并由初始化器、提议者和验证者代理协同工作，实现了对复杂数学问题的长期推理和迭代证明优化。实验表明，RMA 在 First Proof 基准测试中表现出色，解决了其中八道难题，其生成的证明在逻辑性和可读性上优于现有强基线模型。

详情

AI中文摘要

我们提出了$ extbf{Research Math Agents (RMA)}$，一个用于研究级数学问题自动推理的智能体框架。与以往专注于竞赛数学或形式化定理证明的研究不同，RMA针对需要长程推理、文献依据和迭代证明改进的研究级数学问题。RMA将研究级证明求解分解为专门模块，包括问题分析、文献搜索与理解、公平比较、知识库构建和证明验证，所有这些都由初始化器、提议器和验证器智能体通过共享的结构化内存协调。在这个统一框架内，这些智能体以多角色、多轮工作流的方式运行，通过迭代反馈协作生成、改进和验证候选证明。我们在First Proof基准上评估了RMA，该基准由来自不同领域的专家数学家贡献的十个研究级问题组成。通过全面的专家评估，RMA在First Proof基准上优于强基线（包括GPT-5.2R和Aletheia），解决了十个研究问题中的八个，并生成了逻辑更合理、可读性更强的证明。我们的全面消融研究进一步表明，性能提升来自于结构化推理模块、迭代改进和基于验证器的反馈之间的交互，而非任何单一组件。我们的解决方案和实现将在论文被接收后公开。

英文摘要

We present $\textbf{Research Math Agents (RMA)}$, an agentic framework for automated reasoning on research-level mathematical problems. Unlike prior studies centered on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. We evaluate RMA on the First Proof benchmark, which consists of ten research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines on the First Proof benchmark, including GPT-5.2R and Aletheia, solving eight out of ten research problems and producing more logically sound and readable proofs. Our comprehensive ablation studies further show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Our solutions and implementations will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.22874 2026-05-25 cs.AI cs.LO 版本更新

NeuroNL2LTL: A Neurosymbolic Framework for Natural Language Translation of Linear Temporal Logic

NeuroNL2LTL：用于线性时序逻辑自然语言翻译的神经符号框架

Paapa Kwesi Quansah, Ernest Bonnah

发表机构 * Baylor University（贝勒大学）

AI总结本文提出了一种神经符号框架 NeuroNL2LTL，用于将自然语言翻译为线性时序逻辑（LTL），旨在解决自然语言与形式逻辑之间转换的可靠性与表达力之间的矛盾。该框架通过中间表示结构化地映射到LTL，并结合形式验证进行语义校验与修复，同时利用验证结果作为强化学习的奖励信号，提升模型的正确性。实验表明，该方法在多个领域的大规模需求数据上实现了较高的语义等价性与验证满足率，并能生成易于专家验证的解释性说明。

详情

AI中文摘要

有效地在自然语言（NL）和形式逻辑（如线性时序逻辑LTL）之间进行转换需要专业知识，这限制了形式验证在安全关键开发中的覆盖范围。基于模板的方法牺牲表达能力换取可靠性；神经方法实现了流畅性但无法提供正确性保证。我们提出了NeuroNL2LTL，一种将学习翻译与形式验证统一起来的神经符号架构。NeuroNL2LTL通过一种中间表示进行翻译，该表示到LTL的映射在结构上是保持的。生成的规约经过可满足性和非平凡性检查；一个最小编辑修复机制在近似的错误输出到达下游工具之前对其进行纠正。核心创新是验证器在环训练：验证结果作为强化学习的奖励信号，使神经组件直接针对形式正确性进行优化。在涵盖航空航天、机器人、自动驾驶汽车及其他十个领域的20万+需求上，NeuroNL2LTL实现了与参考规约28%的语义等价性，同时确保86%的输出被验证为可满足。该系统还能从LTL生成上下文相关的解释，使领域专家无需专门培训即可验证规约。这项工作表明，形式验证可以作为神经规约系统的训练目标和运行时过滤器，使我们能够构建可靠性源于逻辑保证而非统计置信度的基于神经的工具。

英文摘要

Effectively translating between natural language (NL) and formal logics like Linear Temporal Logic (LTL) requires expertise that limits formal verification's reach in safety-critical development. Template-based approaches sacrifice expressiveness for reliability; neural methods achieve fluency but provide no correctness guarantees. We present NeuroNL2LTL, a neurosymbolic architecture unifying learned translation with formal verification. NeuroNL2LTL routes translation through an intermediate representation whose mapping to LTL is structure-preserving by construction. Generated specifications undergo satisfiability and non-triviality checking; a minimal-edit repair mechanism corrects near-miss outputs before they reach downstream tools. The central innovation is verifier-in-the-loop training: verification outcomes serve as reward signals for reinforcement learning, producing neural components that optimize directly for formal correctness. On 200,000+ requirements spanning aerospace, robotics, autonomous vehicles, and ten additional domains, NeuroNL2LTL achieves 28\% semantic equivalence with reference specifications while ensuring 86\% of outputs are verified satisfiable. The system also generates contextually grounded explanations from LTL, enabling domain experts to validate specifications without specialized training. This work demonstrates that formal verification can function as both training objective and runtime filter for neural specification systems, allowing us to build neural-based tools whose reliability derives from logical guarantees rather than statistical confidence.

URL PDF HTML ☆

赞 0 踩 0

2605.22872 2026-05-25 cs.LG cs.AI cs.CV 版本更新

MedExpMem: Adapting Experience Memory for Differential Diagnosis

MedExpMem：适应经验记忆用于鉴别诊断

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Yannian Gu, Winnie Chiu Wing Chu, Xiaofan Zhang, Qi Dou

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出了一种名为 MedExpMem 的经验记忆框架，旨在提升基于视觉-语言模型的医疗诊断代理在鉴别诊断方面的能力。该方法通过记录模型自身在诊断过程中的失败经验，生成包含关键鉴别点、决策规则和推理错误模式的成对鉴别笔记，并采用两阶段构建过程模拟医生的学习过程。实验表明，MedExpMem 在多个放射学子专科基准上有效提升了诊断准确性，验证了其在医疗适应性方面的优越性。

Comments MICCAI 2026 Early Accept. Submission Version

详情

AI中文摘要

经验丰富的医生通过临床实践发展诊断专业知识，不仅获得疾病知识，还能区分易混淆的病症。当前的医学视觉语言模型（VLM）缺乏这种能力——它们的参数编码了静态知识，不会随着诊断经历而演变。我们提出了MedExpMem，一个经验记忆框架，使基于VLM的诊断代理能够积累鉴别诊断专业知识。与检索增强生成（检索百科式疾病描述）不同，MedExpMem记忆从代理自身的诊断失败中获得的判别经验，并将其组织为成对的鉴别笔记，编码关键判别因素、可操作的决策规则和推理错误模式。该框架采用两阶段构建过程，模仿医生的学习：初始实践暴露知识差距，反思性重新诊断完善理解。当遇到新病例时，代理检索经验记忆以指导鉴别推理。我们在涵盖11个亚专业的放射学基准上评估了MedExpMem。结果表明，在不同模型和规模上，准确率持续提升，最高达7.0%。分析实验验证了经验质量和鲁棒性，表明MedExpMem是一种有竞争力的方法，解决了参数学习无法触及的医学适应需求。

英文摘要

Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.

URL PDF HTML ☆

赞 0 踩 0

2605.22871 2026-05-25 cs.LG cs.AI stat.ML 版本更新

按书分期：使用评分规则进行自动睡眠阶段分类

Emil Hardarson, Konstantin Popov, Sigridur Sigurdardottir, Anna Sigridur Islind, Erna Sif Arnardóttir, María Óskarsdóttir

发表机构 * Department of Computer Science, Reykjavik University（雷克雅未克大学计算机科学系）； Reykjavik University Sleep Institute（雷克雅未克大学睡眠研究所）； Reykjavik University（雷克雅未克大学）； Department of Engineering, Reykjavik University（雷克雅未克大学工程系）； School of Mathematical Sciences, University of Southampton（萨塞克斯大学数学科学学院）

AI总结本文提出了一种基于睡眠医学临床评分规则的透明化睡眠分期方法，通过将美国睡眠医学会（AASM）的评分逻辑转化为可执行代码，并为每个分期结果生成自然语言解释，从而提高模型的可解释性。与当前主流的深度学习方法相比，该方法虽然在分期准确率上略低，但其决策过程明确且符合临床规范，可作为深度学习模型的辅助工具，用于审核、调试和监管睡眠分期系统。

详情

AI中文摘要

自动睡眠分期通常被视为监督式机器学习问题，深度学习方法主导了近期研究。尽管机器学习模型与人工评分的参考睡眠阶段达到接近人类水平的一致性，但其决策通常不透明，且并非设计用于遵循临床评分规则。我们提出一种透明的替代方案：一种确定性的、基于规则的睡眠分期方法，将美国睡眠医学会（AASM）的评分逻辑明确操作化为可执行代码，并附带基于解释轨迹的时期级自然语言理由。我们在50份多导睡眠图记录上评估该方法，以10位评分者的多数投票共识作为参考。在所有记录中，该方法与多数投票参考在60.5%的时期中一致（κ=0.42），在开发过程中使用的数据集上一致性显著更高（77.1%，κ=0.61）。与参考的一致性在睡眠阶段N2中最高（召回率83.5%），在睡眠阶段R中中等（召回率68.7%），而清醒和N1的召回率较低。尽管与参考的一致性低于当代深度学习模型，但该方法提供了与AASM评分规则一致的确定性决策和自然语言解释，使其成为审计、调试和管理基于深度学习的睡眠分期的补充工具。

英文摘要

Automated sleep staging is commonly approached as a supervised machine learning problem, with deep learning methods dominating recent research. While machine learning models achieve near-human level agreement with human-scored reference sleep stages, their decisions are typically opaque and not designed to follow clinical scoring rules. We propose a transparent alternative: a deterministic, rule-based sleep staging method that explicitly operationalizes the American Academy of Sleep Medicine's (AASM) scoring logic as executable code, coupled with epoch-level natural-language justifications derived from an explanation trace. We evaluate the approach on 50 polysomnography recordings with a 10-scorer majority-vote consensus as reference. Across all recordings, the method agreed with the majority-vote reference in 60.5% of epochs ($κ=0.42$), with substantially higher agreement on a dataset used during development (77.1%, $κ=0.61$). Agreement with the reference was highest for sleep stage N2 (recall 83.5%) and moderate for sleep stage R (recall 68.7%), while Wake and N1 recall were low. Despite lower agreement with the reference than contemporary deep learning models, the method provides deterministic decisions and natural language explanations aligned with AASM scoring rules, making it a complementary tool for auditing, debugging, and governing deep learning-based sleep staging.

URL PDF HTML ☆

赞 0 踩 0

2605.22855 2026-05-25 cs.GT cs.AI cs.CL cs.LG 版本更新

PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

PrefBench：评估隐藏偏好个性化定价谈判中的零样本LLM智能体

Yingjie Lei

发表机构 * University of Aberdeen（阿伯丁大学）

AI总结本文提出了PrefBench，一个用于评估零样本大语言模型（LLM）代理在隐藏偏好个性化定价谈判中表现的基准测试平台。该平台通过模拟买家与固定车辆定制套餐的互动，要求卖家在仅能获取公开信息的情况下进行谈判，而买家的估值、耐心、还价行为等关键参数是隐藏的。实验表明，尽管LLM代理能够遵循协议并达成高比例的交易，但其利润表现较差，远不如简单的让步策略，突显了当前LLM在利润敏感型谈判中的不足。PrefBench为研究隐藏买家偏好下的定价代理行为提供了可控的评估环境。

Comments 24 pages, 3 figures, 5 tables. Code is available at https://github.com/ChaosTheProducer/PrefBench

详情

AI中文摘要

个性化定价谈判是LLM智能体的一个具有挑战性的测试平台，因为成功的互动并不能保证盈利的决策。当买方的支付意愿和谈判特征仍然隐藏时，卖方可能产生有效的行动并达成许多交易，但定价仍然很差。本文提出了PrefBench，一个基于模拟器的隐藏偏好个性化定价谈判基准。每个回合将一个模拟买家与一个固定的车辆定制捆绑包配对；卖方观察公开的人物描述符、捆绑包信息和谈判历史，而潜在的买方变量控制估值、耐心、还价行为和退出决策。PrefBench通过一个面向LLM的状态摘要协议来评估这一设置，该协议限制智能体在固定的隐藏信息边界下返回严格的JSON动作。我们在7500个回合中评估了零样本LLM卖家与启发式参考。测试的LLM可靠地遵循协议，实现了高于0.99的交易率，但它们的卖家利润结果仍然较弱：最佳LLM平均利润仅略高于随机基线，远低于同一回合流下的简单让步启发式。这些结果表明，结构化行动合规性和寻求协议的行为可以与弱利润敏感谈判共存。PrefBench为评估隐藏买方偏好下的定价智能体行为提供了一个受控基准。

英文摘要

Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator-based benchmark for hidden-preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle-customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter-offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM-facing state-summary protocol that constrains agents to return strict JSON actions under a fixed hidden-information boundary. We evaluate zero-shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller-profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement-seeking behavior can coexist with weak profit-sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing-agent behavior under hidden buyer preferences.

URL PDF HTML ☆

赞 0 踩 0

2605.22852 2026-05-25 cs.DB cs.AI cs.LG cs.LO 版本更新

Expressive Power of Deep Homomorphism Networks over Relational Databases

关系数据库上深度同态网络的表达能力

Moritz Schönherr, Balder ten Cate, Maurice Funk, Benny Kimelfeld, Carsten Lutz, Arie Soeteman

发表机构 * University of Amsterdam（阿姆斯特丹大学）； Leipzig University（莱比锡大学）； Technion（技术学院）； RelationalAI（关系AI）

AI总结本文研究了深度同态网络（DHNs）在关系数据库上的表达能力，探讨其与一阶逻辑及其扩展之间的联系。通过将DHNs与包含否定、计数和比例量化等扩展的逻辑片段进行对比，揭示了其在不同聚合方式下的表达能力边界。研究还表明，DHNs与SQL之间存在经典对应关系，并进一步分析了其在静态分析问题中的可判定性。实验验证了不同表达能力的DHNs在预测任务中的性能差异。

详情

AI中文摘要

消息传递图神经网络（GNN）的表达能力限制促使了更强大的图学习架构的发展。我们主张深度同态网络（DHN）作为一种特别适合在关系数据库上学习的模型，因为它与SQL的重要片段（如合取查询）有密切联系。我们通过将DHN与一阶逻辑（FO）的各种自然片段和扩展相关联，研究了DHN的精确表达能力。对于具有max、sum和mean聚合的DHN，我们建立了与一元否定片段（UNFO）以及带有计数量词和比例量词的UNFO扩展的联系。我们进一步将sum聚合DHN与FO的一元量词交替片段以及带有表达性计数的FO扩展相关联。通过FO与SQL之间的经典对应关系，这些结果也阐明了DHN与SQL之间的关系。它们还使我们能够研究DHN的两个基本静态分析问题——空问题和包含问题——的可判定性。最后，我们通过实验证实，表达能力的差异在合适的预测任务性能上得到了体现。

英文摘要

The expressive limitations of message-passing Graph Neural Networks (GNNs) have motivated a wide range of more powerful graph learning architectures. We advocate Deep Homomorphism Networks (DHNs) as a model particularly well-suited for learning over relational databases, due to their close connection to important fragments of SQL such as conjunctive queries. We study the precise expressive power of DHNs by relating them to various natural fragments and extensions of first-order logic (FO). For DHNs with max, sum, and mean aggregations, we establish connections to the unary negation fragment (UNFO) and to the extensions of UNFO with counting quantifiers and with ratio quantifiers. We further relate sum-aggregation DHNs to the unary quantifier alternation fragment of FO and to an extension of FO with expressive counting. Through the classical correspondence between FO and SQL, these results also illuminate the relation between DHNs and SQL. They also enable us to study the decidability of two fundamental static analysis problems for DHNs, the emptiness problem and the subsumption problem. Finally, we confirm through experiments that the established differences in expressive power are reflected in the performance on suitable prediction tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.22850 2026-05-25 cs.DC cs.AI 版本更新

认知卡尔达肖夫指数：量化文明计算所需的物质外壳

Sachin Sharma

发表机构 * NVIDIA ； OpenAI ； Stargate ； Terafab

AI总结本文提出了“认知卡尔达肖夫量表”，用于量化文明在计算能力上的潜力。该量表基于总功率、用于认知的功率比例、能量转化为计算效率以及人脑处理速度等四个因素，估算不同文明层级所能支持的持续AI级计算量。研究指出，当前人类文明处于约0.73的量表位置，接近I型文明；若达到I型文明并分配1%的功率用于计算，每位居民可获得相当于一个个人AI的计算能力，而II型文明的计算能力则难以想象。文章还探讨了未来计算能力发展的几种可能路径，并指出能源与效率的限制取决于尚未确定的工程选择。

详情

AI中文摘要

一个文明能进行多少思考？卡尔达肖夫（1964）的分类法根据总功率对文明进行分级：行星级（I型，约10^16瓦）、恒星级（II型，约10^26瓦）、星系级（III型）。本文构建了一个类似的认知卡尔达肖夫指数：每个等级能支持多少持续的AI级计算。计算涉及四个要素：总功率P（瓦特）、其中用于认知的份额f、能量转化为计算的效率η（每焦耳操作次数），以及大脑自身的处理速率$C_{\mathrm{brain}}$作为参考单位。以2024-2026年的硬件（El Capitan、NVIDIA Blackwell、Vera Rubin）为基准，得到$η_{2026} = 10^{12}$ FLOP/J。当代人类位于$K \approx 0.73$，即达到I型的三分之二。在I型且$f = 1\%$时，可用计算量在每个数量级上相当于每位居民拥有一个个人AI的认知能力；在II型时则基本无法理解。本文报告了到2035年前沿计算的三条轨迹，作为条件投影而非预测。长期约束是能源还是效率取决于尚未做出的工程选择；谁有访问权的政治经济可能比两者都更重要。

英文摘要

How much thinking can a civilisation do? Kardashev's (1964) typology ranks civilisations by total power: planetary (Type I, ~10^16 W), stellar (Type II, ~10^26 W), galactic (Type III). This paper builds an analogous Cognitive Kardashev Scale: how much sustained AI-grade computation each tier could support. Four ingredients enter the calculation: total power P (watts), the share f of it devoted to cognition, the efficiency $η$ at which energy becomes compute (operations per joule), and the brain's own processing rate $C_{\mathrm{brain}}$ as a reference unit. Anchoring on 2024-2026 hardware (El Capitan, NVIDIA Blackwell, Vera Rubin) gives $η_{2026} = 10^{12}$ FLOP/J. Contemporary humanity sits at $K \approx 0.73$, three-quarters of the way to Type I. At Type I and $f = 1\%$, available compute is, within an order of magnitude, one personal AI's worth of cognition per human inhabitant; at Type II it is essentially incomprehensible. Three trajectories for frontier compute through 2035 are reported as conditional projections, not predictions. Whether the long-run binding constraint is energy or efficiency depends on engineering choices not yet made; the political economy of who has access may matter more than either.

URL PDF HTML ☆

赞 0 踩 0

2605.22833 2026-05-25 cs.IR cs.AI cs.LG 版本更新

RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis

RAG4Outcome：用于慢性骨髓炎预后预测的检索增强多模态框架

Daqian Shi, Pei Han, Jishizhan Chen, Yang Wang, Xiaolei Diao, Xianyou Zheng, Pengfei Cheng

发表机构 * Queen Mary University of London（女王玛丽大学）； Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine（上海第六人民医院附属复旦大学医学院）； University College London（大学学院伦敦）

AI总结慢性骨髓炎因其高复发风险和复杂的术后恢复过程，给预后预测带来了较大挑战。传统评估方法依赖人工评分系统，存在可扩展性差、效率低和一致性不足的问题。为此，本文提出RAG4Outcome，一种基于检索增强生成（RAG）的多模态框架，整合PET-CT影像报告、结构化手术和诊断记录以及非结构化的随访记录，结合领域特定检索语料和专家引导提示，实现了更可解释、有依据且临床可靠的预后预测，初步实验结果表明其在真实病例中具有良好的效果和临床契合度。

详情

AI中文摘要

慢性骨髓炎因其高复发风险和复杂的术后恢复轨迹而面临巨大的预后挑战。传统评估通常依赖于手动评分系统，这限制了临床实践中的可扩展性、效率和一致性。此外，临床数据的异质性对当前需要对齐输入和大量标注数据集的多模态学习方法构成了挑战。在这项工作中，我们提出了RAG4Outcome，一个用于慢性骨髓炎预后预测的检索增强生成（RAG）框架。我们的方法将多模态临床数据（包括PET-CT影像报告、结构化手术和诊断记录以及非结构化随访笔记）整合到一个统一的预测流程中。通过结合领域特定的检索语料库和专家引导的提示，该框架实现了更可解释、基于证据且临床可靠的预后。在真实世界病例上的初步结果显示了有希望的有效性和临床一致性，突显了RAG4Outcome在AI辅助感染管理和术后决策支持方面的潜力。

英文摘要

Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets. In this work, we propose RAG4Outcome, a retrieval-augmented generation (RAG) framework for prognostic prediction in chronic osteomyelitis. Our method integrates multimodal clinical data, including PET-CT imaging reports, structured surgical and diagnostic records, and unstructured follow-up notes, into a unified prediction pipeline. By combining a domain-specific retrieval corpus with expert-guided prompting, the framework enables more interpretable, evidence-grounded, and clinically reliable prognosis. Preliminary results on real-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI-assisted infection management and postoperative decision support.

URL PDF HTML ☆

赞 0 踩 0

2605.22829 2026-05-25 cs.IR cs.AI 版本更新

KPI2KVI：一种从服务描述计算关键价值指标的多智能体工作流

Masoud Shokrnezhad, Tarik Taleb, Yan Chen, Qize Guo

发表机构 * ICTFICIAL OY（ICTFICIAL公司）； Ruhr-Universitaet Bochum（波恩鲁尔大学）

AI总结本文提出了一种名为 KPI2KVI 的多智能体工作流工具，用于从服务描述中自动计算关键价值指标（KVIs）。该方法基于大语言模型，通过协调多个智能体完成从服务描述中提取上下文、确定 KVI 类别、生成 KPI、收集 KPI 值并计算区间化 KVI 输出等任务，实现了从自然语言描述到结构化 KVI 估计的端到端映射。该工具有效解决了 KVIs 计算过程中手动操作繁琐、结果不一致的问题，并提供了可追溯的计算过程，支持后续审计与交互式咨询。

详情

AI中文摘要

关键价值指标（KVI）通过总结运营绩效如何转化为利益相关者价值、风险和结果，提供服务的决策导向视图。然而，在许多领域，KVI在实践中难以计算，因为它们需要选择相关的KVI类别、定义可测量的关键绩效指标（KPI）、收集KPI值并应用一致的计算逻辑，而这些通常是从非结构化服务文档中手动且不一致地执行的。本文提出KPI2KVI，一种通过编排由大语言模型（LLM）驱动的确定性多智能体工作流，将自然语言服务描述转化为计算出的KVI估计值的工具，该工作流（i）引出缺失的服务上下文，（ii）从分类中提取并最终确定相关的KVI类别，（iii）生成带有单位和描述的服务特定KPI，（iv）通过交互式对话收集KPI值，并支持对不可用KPI值的智能估计，以及（v）计算区间值的KVI输出（最小值、精确值、最大值），并为每个KVI代码提供可追溯的解释。使用代表性服务描述的模拟表明，KPI2KVI一致地产生从描述到KVI区间的完整端到端映射，并提供透明的计算叙述，支持事后审计和交互式咨询查询。

英文摘要

Key Value Indicators (KVIs) provide a decision oriented view of a service by summarizing how operational performance translates into stakeholder value, risk, and outcomes. However, in many domains KVIs are difficult to compute in practice because they require selecting relevant KVI categories, defining measurable Key Performance Indicators (KPIs), collecting KPI values, and applying consistent calculation logic, all of which is typically performed manually and inconsistently from unstructured service documentation. This paper presents KPI2KVI, a tool that transforms a natural language service description into computed KVI estimates by orchestrating a deterministic multi agent workflow powered by Large Language Models (LLMs) that (i) elicits missing service context, (ii) extracts and finalizes relevant KVI categories from a taxonomy, (iii) generates service specific KPIs with units and descriptions, (iv) collects KPI values through an interactive dialogue and also supports intelligent estimation for KPI values that are unavailable, and (v) computes interval valued KVI outputs (minimum, exact, maximum) with traceable explanations for each KVI code. Simulations with representative service descriptions demonstrate that KPI2KVI consistently produces a complete end to end mapping from description to KVI intervals and provides transparent calculation narratives that support post hoc auditing and interactive advisory queries.

URL PDF HTML ☆

赞 0 踩 0

2605.22824 2026-05-25 cs.DC cs.AI 版本更新

An AI-Driven Framework for Energy-Efficient Environmental Monitoring in Smart Cities Using Edge Intelligence

基于边缘智能的智慧城市节能环境监测AI驱动框架

Yichen Liu, Imam Akintomiwa Akinlade, Xiaochong Jiang, Wenting Yang, Shiqi Yang

发表机构 * Independent Researcher（独立研究者）； Harvard Business School（哈佛商学院）

AI总结本文提出了一种基于边缘智能的AI驱动框架，旨在提升智慧城市中环境监测的能源效率。该框架利用TinyML技术与上下文感知的自适应决策机制，根据时空条件、环境统计和能量约束动态激活传感器，从而减少冗余数据采集和能耗。实验表明，与传统静态或基于UCB的传感策略相比，该方法显著降低了能量消耗并延长了传感器寿命，展示了边缘智能在构建可持续智慧城市监测系统中的潜力。

Comments 6 pages, 2 figures, 3 tables

详情

AI中文摘要

环境监测是智慧城市基础设施的关键组成部分，它能够支持明智决策，从而增强可持续性、公共卫生和城市规划。然而，智能传感器的大规模部署引发了关于过度能耗、冗余数据收集以及传感器寿命有限的问题。为解决这些问题，我们提出了一种基于边缘智能的智慧城市节能环境监测AI驱动框架。我们的框架利用支持TinyML的边缘设备和上下文感知自适应决策，根据时空条件、环境统计数据和能量约束动态激活传感器。传感器将基于一个效用函数动态激活，该函数考虑实时环境条件、传感器位置和剩余电池寿命等因素。我们的框架将减少不必要的感知和通信，同时保持高监测覆盖率。我们引入了一种分层边缘智能架构，以支持城市规模的部署。我们使用真实多传感器环境迹线驱动的城市规模模拟进行了评估，结果表明，与静态、周期和基于UCB的自适应感知策略相比，所提出的机制显著降低了能耗并延长了传感器寿命。结果突出了边缘智能和自适应AI技术在构建可持续高效的智慧城市监测系统方面的潜力。

英文摘要

Environmental monitoring is a crucial component of the smart city infrastructure. It enables informed decision making which enhances sustainability, public health and urban planning. However, the large-scale deployments of the smart sensors have raised concerns on excessive energy consumption and redundant data collection as well as limited sensor lifespan. To resolve these issues, we present an AI-driven framework for energy-efficient environmental monitoring in smart cities utilizing edge intelligence. Our proposed framework leverages TinyML-enabled edge devices and context-aware adaptive decision-making in order to dynamically activate the sensors based on the spatiotemporal conditions, environmental statistics and energy constraints. The sensors will be dynamically activated based on a utility function that takes in factors such as real-time environmental conditions, sensor location, and remaining battery lifespan. Our framework will reduce unnecessary sensing and communication while maintaining high coverage for monitoring. We introduce a hierarchical Edge Intelligence architecture to support deployments in city-wide scales. We conducted evaluation using a city-scale simulation driven by real multi-sensor environmental traces, which demonstrates that the proposed mechanism significantly reduces energy consumption and extends sensor lifespan when compared to static, periodic, and UCB-based adaptive sensing strategies. The results highlight the potential of edge intelligence and adaptive AI techniques for building sustainable and efficient smart city monitoring systems.

URL PDF HTML ☆

赞 0 踩 0

2605.20519 2026-05-25 cs.SD cs.AI 版本更新

Codec-Robust Attacks on Audio LLMs

针对音频大语言模型的编解码鲁棒攻击

Jaechul Roh, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr

发表机构 * University of Massachusetts Amherst（马萨诸塞大学阿默斯特分校）； Qualcomm（高通）

AI总结本文研究了针对音频大语言模型（Audio LLMs）的编码器鲁棒攻击方法，提出了一种名为CodecAttack的新攻击技术。该方法在神经音频编码器的连续潜在空间中优化扰动，而非直接对音频波形进行修改，从而绕过压缩过程对波形扰动的过滤。实验表明，CodecAttack在多种真实压缩场景下表现出显著的攻击成功率，远高于传统波形域攻击方法，揭示了有损压缩并不能有效防御对抗性音频攻击。

详情

AI中文摘要

先前对音频大语言模型（Audio LLMs）的攻击表明，精心设计的波形域扰动可以迫使目标对抗性输出。作为针对这些攻击的防御机制，现实中的编解码压缩预处理已被研究用于检测和移除扰动。然而，现有攻击尚未证明对这些压缩的鲁棒性。我们提出CodecAttack，它在神经音频编解码器的连续潜在空间中优化扰动，而不是直接扰动音频波形。我们表明，编解码器的压缩通道会丢弃波形扰动，但会传输在其自身潜在空间中设计的扰动。为了进一步增强攻击在现实压缩通道中的鲁棒性，我们应用了多比特率直通期望变换（EoT），而无需修改目标模型。在三种现实的音频LLM部署场景和三个目标模型上，CodecAttack在中等比特率下对Opus实现了平均85.5%的目标子串攻击成功率（ASR），而使用相同EoT加固训练的波形基线在任何比特率下均未超过26%。该攻击可迁移到未训练的编解码器，在MP3上达到100% ASR，在AAC-LC上达到84% ASR，无需重新训练。逐频带能量分析表明，潜在扰动集中在4kHz以下，这正是编解码器分配最多比特的区域，而波形基线则扩散到编解码器丢弃的高频区域。这些结果表明，有损压缩不是对抗音频的可靠防御，编解码感知攻击对已部署的音频LLM系统构成了实际威胁。

英文摘要

Prior attacks on Audio Large Language Models (Audio LLMs) demonstrated that carefully crafted waveform-domain perturbations can force targeted adversarial outputs. As a defense mechanism against these attacks, real-world codec compression preprocessing has been studied to both detect and remove the perturbations. Yet no existing attack has demonstrated robustness against these compressions. We introduce CodecAttack, which optimizes a perturbation in a neural audio codec's continuous latent space rather than directly perturbing the audio waveform. We show that the codec's compression channel, which discards waveform perturbations, transmits perturbations crafted in its own latent space. To further harden the attack across real-world compression channels, we apply multi-bitrate straight-through Expectation-over-Transformation (EoT), all without modifying the target model. Across three realistic Audio LLM deployment scenarios and three target models, CodecAttack achieves an average 85.5% target-substring attack success rate (ASR) on Opus at moderate bitrates, while the waveform baseline trained with identical EoT hardening does not exceed 26% at any bitrate. The attack transfers to held-out codecs, reaching up to 100% ASR on MP3 and 84% on AAC-LC without retraining. A per-band energy analysis shows that the latent perturbation concentrates below 4kHz, exactly where codecs allocate the most bits, while the waveform baseline spreads into higher frequencies that codecs discard. These results demonstrate that lossy compression is not a reliable defense against adversarial audio and that codec-aware attacks pose a practical threat to deployed Audio LLM systems.

URL PDF HTML ☆

赞 0 踩 0

2604.07813 2026-05-25 cs.AI cs.HC 版本更新

Agentivism: a learning theory for the age of artificial intelligence

Agentivism：人工智能时代的学习理论

Lixiang Yan, Dragan Gašević

发表机构 * School of Education, Tsinghua University（清华大学教育学院）； Faculty of Education and School of Computing & Data Science, The University of Hong Kong（香港大学教育学院及计算与数据科学学院）； Faculty of Information Technology, Monash University（墨尔本大学信息技术学院）

AI总结随着生成式和智能代理AI的兴起，学习条件发生了根本变化，传统学习理论难以解释AI辅助下学习成效与真实理解之间的脱节问题。本文提出“代理主义”（Agentivism）学习理论，强调在AI辅助下，学习是通过选择性委托、对AI输出的监控与验证、重建性内化以及减少支持下的迁移能力实现的持久能力增长。该理论为理解人类与AI协同学习的过程提供了新的理论框架。

详情

AI中文摘要

历史上，当学习条件演变时，学习理论也随之改变。生成式和代理式AI创造了一种新条件，允许学习者将解释、写作、问题解决及其他认知工作委托给能够生成、推荐并有时代表学习者行动的系统。这给学习理论带来了根本性挑战：成功的表现不能再被视为学习的标志。学习者在AI支持下可能有效完成任务，同时发展出更少的理解、更弱的判断力和有限的可迁移能力。我们认为，现有学习理论并未完全捕捉到这一问题。行为主义、认知主义、建构主义和联通主义仍然重要，但它们并未直接解释AI辅助的表现何时转化为持久的人类能力。我们提出Agentivism，一种人机交互的学习理论。Agentivism将学习定义为通过选择性委托给AI、对AI贡献的认知监控与验证、对AI辅助输出的重构内化以及在减少支持下的迁移，实现人类能力的持久增长。Agentivism的重要性在于解释当智能委托变得容易且人机交互成为人类学习持续且不断扩大的部分时，学习如何仍然可能。

英文摘要

Learning theories have historically changed when the conditions of learning evolved. Generative and agentic AI create a new condition by allowing learners to delegate explanation, writing, problem solving, and other cognitive work to systems that can generate, recommend, and sometimes act on the learner's behalf. This creates a fundamental challenge for learning theory: successful performance can no longer be assumed to indicate learning. Learners may complete tasks effectively with AI support while developing less understanding, weaker judgment, and limited transferable capability. We argue that this problem is not fully captured by existing learning theories. Behaviourism, cognitivism, constructivism, and connectivism remain important, but they do not directly explain when AI-assisted performance becomes durable human capability. We propose Agentivism, a learning theory for human-AI interaction. Agentivism defines learning as durable growth in human capability through selective delegation to AI, epistemic monitoring and verification of AI contributions, reconstructive internalization of AI-assisted outputs, and transfer under reduced support. The importance of Agentivism lies in explaining how learning remains possible when intelligent delegation is easy and human-AI interaction is becoming a persistent and expanding part of human learning.

URL PDF HTML ☆

赞 0 踩 0

2602.20102 2026-05-25 cs.LG cs.AI 版本更新

BarrierSteer: LLM Safety via Learning Barrier Steering

BarrierSteer: 通过学习障碍引导实现大语言模型安全

Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao

发表机构 * Department of Computer Science, National University of Singapore（新加坡国立大学计算机科学系）； Singapore-MIT Alliance for Research and Technology Centre（新加坡-麻省理工联合研究中心）； CSAIL, Massachusetts Institute of Technology（麻省理工学院计算机科学与人工智能实验室）； Worcester Polytechnic Institute（沃斯堡理工学院）

AI总结尽管大语言模型（LLMs）在各种任务中表现出色，但其对对抗性攻击和不安全内容生成的易感性仍然是部署中的重大障碍，尤其是在高风险场景中。为此，本文提出了一种名为 BarrierSteer 的新型推理时框架，通过在模型的潜在表示空间中嵌入学习到的非线性安全约束，提升响应的安全性。该方法将隐藏状态的安全分类器视为控制屏障函数（CBFs），在生成过程中引导不安全的潜在轨迹满足安全约束，从而在不修改模型参数的前提下有效提升安全性，并在多个模型和数据集上验证了其优越性。

Comments This paper introduces SafeBarrier, a framework that enforces safety in large language models by steering their latent representations with control barrier functions during inference, reducing adversarial and unsafe outputs

详情

AI中文摘要

尽管大型语言模型（LLMs）在各种任务中表现出色，但它们对对抗性攻击和不安全内容生成的敏感性仍然是部署的重大障碍，尤其是在高风险场景中。解决这一挑战需要既实际有效又有理论依据的安全机制。在本文中，我们介绍了 BarrierSteer，一种新颖的推理时框架，通过将学习到的非线性安全约束直接嵌入模型的潜在表示空间来提高响应安全性。BarrierSteer 将隐藏状态安全分类器视为控制障碍函数（CBFs），从而在生成过程中引导不安全的潜在轨迹。通过有效的约束合并组合多个安全约束，而不修改底层 LLM 参数，BarrierSteer 保持了模型效用。我们提供的理论结果表明，在潜在空间中应用 CBFs 提供了一种有原则、模块化且计算高效的方法，用于根据学习到的安全约束进行引导，并保证学习到的障碍能够捕捉预期的安全属性。我们在多个模型系列和数据集上的广泛实验结果表明，BarrierSteer 显著降低了对抗性攻击成功率和有害生成，优于现有方法。代码可在我们的 GitHub 仓库中获取。

英文摘要

Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a significant obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and theoretically grounded. In this paper, we introduce BarrierSteer, a novel inference-time framework that improves response safety by embedding learned nonlinear safety constraints directly into the model's latent representation space. BarrierSteer treats hidden-state safety classifiers as Control Barrier Functions (CBFs), enabling constraint-guided steering of unsafe latent trajectories during generation. By composing multiple safety constraints through efficient constraint merging without modifying the underlying LLM parameters, BarrierSteer preserves model utility. We provide theoretical results showing that applying CBFs in the latent space yields a principled, modular, and computationally efficient approach for steering with respect to learned safety constraints, with guarantees conditional on the learned barriers capturing the intended safety property. Our extensive experimental results across multiple model families and datasets demonstrate that BarrierSteer substantially reduces adversarial attack success rates and unsafe generations, outperforming the existing method. The code is available in our \href{https://github.com/thanhquangtran/BarrierSteer}{GitHub repository}.

URL PDF HTML ☆

赞 0 踩 0

2601.21306 2026-05-25 cs.LG cs.AI 版本更新

The Surprising Difficulty of Search in Model-Based Reinforcement Learning

基于模型的强化学习中搜索的惊人困难

Wei-Di Chang, Mikael Henaff, Brandon Amos, Gregory Dudek, Scott Fujimoto

发表机构 * Meta FAIR ； McGill University（麦吉尔大学）

AI总结本文研究了基于模型的强化学习中的搜索问题。传统观点认为长期预测和误差累积是主要障碍，但作者发现搜索并不能简单替代学习到的策略，甚至在模型高度准确时也可能损害性能。研究指出，缓解高估偏差比提升模型或价值函数的准确性更为关键，而通过对一组价值函数取最小值的方法能有效解决这一偏差，从而实现高效的搜索，并在多个基准任务中取得领先性能。

Comments ICML 2026

2601.14652 2026-05-25 cs.AI cs.CL cs.MA 版本更新

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

MAS-Orchestra：通过整体编排和受控基准理解与改进多智能体推理

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty

发表机构 * Salesforce Research（Salesforce研究院）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文提出MAS-Orchestra，一种通过整体编排和受控基准测试来理解和提升多智能体系统（MAS）推理能力的训练框架。该方法将MAS的编排建模为函数调用的强化学习问题，能够一次性生成完整的MAS系统，并通过抽象子智能体为可调用函数，实现对系统结构的全局推理。同时，研究引入MASBENCH基准，从五个维度刻画任务特性，揭示MAS优势依赖于任务结构、验证机制及智能体能力，而非普遍适用。实验表明，MAS-Orchestra在多个基准测试中取得显著提升，效率较现有方法提高十倍以上。

Comments ICML 2026

详情

AI中文摘要

虽然多智能体系统（MAS）通过智能体协调有望提升智能水平，但当前自动MAS设计的方法表现不佳。这些不足源于两个关键因素：（1）方法论复杂性——智能体编排通过顺序的代码级执行进行，限制了全局系统级整体推理，且随智能体复杂性扩展性差；（2）效能不确定性——MAS在未理解相比单智能体系统（SAS）是否有切实益处的情况下被部署。我们提出MAS-Orchestra，一个训练时框架，将MAS编排形式化为具有整体编排的函数调用强化学习问题，一次性生成整个MAS。在MAS-Orchestra中，复杂的、面向目标的子智能体被抽象为可调用函数，从而在隐藏内部执行细节的同时实现系统结构上的全局推理。为了严格研究MAS何时以及为何有益，我们引入了MASBENCH，一个受控基准，沿五个轴表征任务：深度、范围、广度、并行性和鲁棒性。我们的分析揭示，MAS的收益关键取决于任务结构、验证协议以及编排器和子智能体的能力，而非普遍成立。在这些洞察的指导下，MAS-Orchestra在数学推理、多跳问答和基于搜索的问答等公共基准上实现了一致的改进，同时相比强基线实现了超过10倍的效率提升。MAS-Orchestra和MASBENCH共同使得在追求多智能体智能的过程中能够更好地训练和理解MAS。

英文摘要

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

URL PDF HTML ☆

赞 0 踩 0

2504.09583 2026-05-25 cs.RO cs.AI 版本更新

AirVista-II: An Agentic System for Embodied UAVs Toward Dynamic Scene Semantic Understanding

AirVista-II：面向动态场景语义理解的具身无人机智能体系统

Fei Lin, Yonglin Tian, Tengchao Zhang, Jun Huang, Sangtian Guan, Fei-Yue Wang

发表机构 * Department of Engineering Science, Faculty of Innovation Engineering, Macau University of Science and Technology（创新工程学院工程科学系，澳门科学技术大学）； State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences（复杂系统管理与控制国家重点实验室，中国科学院自动化研究所）； State Key Laboratory for Management and Control of Complex Systems, Chinese Academy of Sciences（复杂系统管理与控制国家重点实验室，中国科学院）

AI总结本文提出了一种名为 AirVista-II 的智能代理系统，旨在提升无人机在动态场景中的语义理解能力。该系统融合了基于代理的任务识别与调度、多模态感知机制以及针对不同时间场景的差异化关键帧提取策略，实现了对动态环境中的关键信息高效捕捉。实验表明，该系统在多种无人机应用场景下能够实现高质量的零样本语义理解，显著提升了无人机自主决策的效率与适应性。

Journal ref Proc. 2025 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 6319-6324, 2025

详情

DOI: 10.1109/SMC58881.2025.11342598

AI中文摘要

无人机在物流运输和灾难响应等动态环境中日益重要。然而，当前任务通常依赖人类操作员监控航拍视频并做出操作决策。这种人机协作模式在效率和适应性方面存在显著局限性。本文提出AirVista-II——一种面向具身无人机的端到端智能体系统，旨在实现动态场景中的通用语义理解和推理。该系统集成了基于智能体的任务识别与调度、多模态感知机制，以及针对不同时间场景定制的差异化关键帧提取策略，从而高效捕获关键场景信息。实验结果表明，所提系统在零样本设置下，能够在多种基于无人机的动态场景中实现高质量的语义理解。

英文摘要

Unmanned Aerial Vehicles (UAVs) are increasingly important in dynamic environments such as logistics transportation and disaster response. However, current tasks often rely on human operators to monitor aerial videos and make operational decisions. This mode of human-machine collaboration suffers from significant limitations in efficiency and adaptability. In this paper, we present AirVista-II -- an end-to-end agentic system for embodied UAVs, designed to enable general-purpose semantic understanding and reasoning in dynamic scenes. The system integrates agent-based task identification and scheduling, multimodal perception mechanisms, and differentiated keyframe extraction strategies tailored for various temporal scenarios, enabling the efficient capture of critical scene information. Experimental results demonstrate that the proposed system achieves high-quality semantic understanding across diverse UAV-based dynamic scenarios under a zero-shot setting.

URL PDF HTML ☆

赞 0 踩 0

2110.01552 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

或许PTLMs应该去上学——一项评估开卷和闭卷问答的任务

Manuel R. Ciosici, Joe Cecil, Alex Hedges, Dong-Ho Lee, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California（信息科学研究所，南加州大学）

AI总结本文提出了一项新的任务，旨在评估预训练语言模型（PTLMs）在开放书和闭合书场景下的问答能力，使用社会学和人文领域的大学教材作为教学材料。研究通过设计基于教材内容的判断题，并进行多轮测试，发现PTLMs在闭合书条件下表现有限，表明其可能未真正理解教材内容；而在开放书条件下，允许模型检索相关段落进行回答时，性能显著提升。该任务为评估PTLMs对复杂文本的理解能力提供了新的基准。

Comments Identical to the EMNLP 2021 version

详情

DOI: 10.18653/v1/2021.emnlp-main.493

AI中文摘要

我们的目标是提供一项新任务和排行榜，以刺激关于问答和预训练语言模型（PTLM）的研究，使其理解重要的教学文档，例如大学入门教科书或手册。PTLM在许多问答任务中取得了巨大成功，但需要大量监督训练，而在零样本设置中表现较差。我们提出了一项新任务，包括两本社会科学（《美国政府2e》）和人文科学（《美国历史》）的大学入门教材，数百个基于教材作者编写的复习题的真假陈述，基于教材前八章的验证/开发测试，基于剩余章节的盲测，以及基于最先进PTLM的基线结果。由于问题平衡，随机表现应为约50%。使用BoolQ微调的T5达到了相同的表现，表明教材内容未在PTLM中预表示。闭卷考试（即阅读教材，将教材添加到T5的预训练中）最多带来微小改进（56%），表明PTLM可能没有“理解”教材（或可能误解了问题）。开卷考试（即允许机器自动检索段落并用于回答问题）表现更好（约60%）。

英文摘要

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

URL PDF HTML ☆

赞 0 踩 0

2101.05400 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Machine-Assisted Script Curation

机器辅助脚本编纂

Manuel R. Ciosici, Joseph Cummings, Mitchell DeHaven, Alex Hedges, Yash Kankanampati, Dong-Ho Lee, Ralph Weischedel, Marjorie Freedman

发表机构 * Information Sciences Institute, University of Southern California（信息科学研究所，南加州大学）

AI总结本文介绍了一种名为MASC的系统，用于实现人机协作的脚本创作。该系统能够自动生成事件类型、链接至维基数据、提示可能被遗漏的子事件，并记录参与多个子事件的实体及其时间顺序，从而辅助用户高效编写结构复杂的事件脚本。研究展示了MASC在实际案例中的应用效果，验证了其在脚本创作中的实用价值。

Comments Identical to the NAACL 2021 Demo version