arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2605.20182 2026-05-20 cs.LG cs.AI 版本更新

Atoms of Thought: Universal EEG Representation Learning with Microstates

思想的原子：基于微状态的通用EEG表示学习

Xinyang Tian, Ruitao Liu, Ziyi Ye, Siyang Xue, Xin Wang, Xuesong Chen

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University（清华大学交叉信息研究院）； Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究院）； School of Clinical Medicine, Tsinghua University（清华大学临床医学院）； Beijing Five Seasons Medical Technology Co., Ltd.（北京五 Seasons 医疗科技有限公司）

AI总结本文提出了一种基于微状态的通用EEG表示学习方法，通过将连续EEG信号聚类为离散的微状态序列，构建了一个通用的微状态分词器，并在睡眠分期、情绪识别和运动想象分类等下游任务中展示了其优越性，同时提高了可解释性和扩展性。

Comments Accepted by the 3rd International Workshop on Multimodal and Responsible Affective Computing (MRAC 2025). 8 pages of main text, 23 pages total, 5 figures, 4 tables

详情

DOI: 10.1145/3746270.3760230

AI中文摘要

从脑电图（EEG）信号中学习通用表示是神经信息学和脑机接口（BCIs）领域的一项前沿技术。传统上，EEG被视为多变量时间序列，其中时间域或频域特征被提取用于表示学习。本文研究了一种简单而有效的EEG表示，即微状态。微状态代表了在微观时间尺度上大脑活动模式的基本构建块。通过从大规模医疗EEG数据集中对连续EEG信号进行聚类，构建了一个通用的微状态分词器。该微状态分词器被广泛应用于一系列下游任务，包括睡眠分期、情绪识别和运动想象分类。实验结果表明，使用微状态进行EEG表示学习在不同模型和不同任务中均优于传统的时间域和频域特征。进一步分析显示，微状态提供了更高的可解释性和可扩展性，从而在认知神经科学和临床研究中开辟了应用。

英文摘要

Learning universal representations from electroencephalogram (EEG) signals is a cutting-edge approach in the field of neuroinformatics and brain-computer interfaces (BCIs). Conventionally, EEG is treated as a multivariate temporal signal, where time- or frequency-domain features are extracted for representation learning. This paper investigates a simple yet effective EEG representation, i.e., microstates. Microstates represent the building blocks of brain activity patterns at a microscopic time scale. We build a universal microstate tokenizer from a large medical EEG dataset by clustering continuous EEG signals into sequences of discrete microstates. The microstate tokenizer is then adopted universally across a series of downstream tasks, including sleep staging, emotion recognition, and motor imagery classification. Experimental results show that EEG representation learning with microstates outperforms traditional time-domain and frequency-domain features under different models and across different tasks. Further analysis shows that microstates offer greater interpretability and scalability, thereby opening up applications in both cognitive neuroscience and clinical research.

URL PDF HTML ☆

赞 0 踩 0

2605.20173 2026-05-20 cs.AI cs.SE 版本更新

A Methodology for Selecting and Composing Runtime Architecture Patterns for Production LLM Agents

为生产大语言模型代理选择和组合运行时架构模式的方法

Vasundra Srinivasan

发表机构 * AI Architect, Independent Researcher（人工智能架构师，独立研究员）； Stanford School of Engineering（斯坦福大学工程学院）

AI总结本文提出了一种方法，用于选择和组合运行时架构模式，以定义大语言模型代理的随机-确定性边界，并探讨其在不同代理类型中的应用及可靠性分解。

Comments 25 pages, 2 figures, 6 tables. Companion repo at https://github.com/vasundras/agent-runtime-patterns

详情

AI中文摘要

生产大语言模型代理结合了随机模型输出与确定性软件系统，但两者之间的边界很少被视为首要的架构对象。本文将此边界称为随机-确定性边界（SDB）：一种四部分合同，涉及提议者、验证者、提交步骤和拒绝信号，规定了LLM输出如何成为系统动作。我们主张SDB是生产代理运行时的承载基础。围绕此基础，我们将代理运行时设计分为三个关注点：协调、状态和控制。我们提出了六个运行时模式的目录，这些模式在对话、自主和长周期代理中以不同的方式组合SDB：分层委托、散射-收集加 saga、事件驱动序列、共享状态机、监督者加门控，以及人机交互。对于每个模式，我们追溯其分布式系统概念的根源，并确定当工作者为随机时的变化。本文贡献了五步选择运行时模式的方法，一个将生产故障映射到模式弱点的诊断程序，以及一种称为回放分歧的故障模式，在这种模式下，基于LLM的确定性事件日志消费者在模型版本或提示变化下会产生不同的下游输出。一种简化的可靠性分解将每次调用的模型方差与架构动量分开，促使主张随着模型方差的减少，模式选择和SDB强度成为长期可靠性的越来越重要的杠杆。我们应用该方法于五个工作负载，并提供了一个90天合同续约代理的可运行参考实现。

英文摘要

Production LLM agents combine stochastic model outputs with deterministic software systems, yet the boundary between the two is rarely treated as a first-class architectural object. This paper names that boundary the stochastic-deterministic boundary (SDB): a four-part contract among a proposer, verifier, commit step, and reject signal that specifies how an LLM output becomes a system action. We argue that the SDB is the load-bearing primitive of production agent runtimes. Around this primitive, we organize agent runtime design into three concerns: Coordination, State, and Control. We present a catalog of six runtime patterns that compose the SDB differently across conversational, autonomous, and long-horizon agents: hierarchical delegation, scatter-gather plus saga, event-driven sequencing, shared state machine, supervisor plus gate, and human in the loop. For each pattern, we trace its lineage to distributed-systems concepts and identify what changes when the worker is stochastic. The paper contributes a five-step methodology for selecting runtime patterns, a diagnostic procedure that maps production failures to pattern weaknesses, and a failure mode called replay divergence, in which LLM-based consumers of a deterministic event log produce different downstream outputs under model-version or prompt changes. A stylized reliability decomposition separates per-call model variance from architectural momentum, motivating the claim that as model variance decreases, pattern choice and SDB strength become increasingly important levers for long-run reliability. We apply the methodology to five workloads and provide one runnable reference implementation for a 90-day contract-renewal agent.

URL PDF HTML ☆

赞 0 踩 0

2605.20172 2026-05-20 cs.LO cs.AI 版本更新

Long-term Power Grid Planning via Answer Set Programming

通过答案集编程进行长期电力网络规划

Antonio Ielo, Francesco Doria, Sandra Castellanos-Paez, Marco Maratea, Francesco Percassi, Mauro Vallati

发表机构 * University of Calabria, Italy（意大利卡拉布里亚大学）； University of Huddersfield, UK（英国赫德瑟菲尔德大学）

AI总结本文提出了一种基于答案集编程的自动化和优化长期电力网络规划方法，以解决可持续性目标、需求模式和城市化趋势等复杂问题。

Comments 16 pages, 4 figures

详情

AI中文摘要

电力网络是支撑现代社会各个方面的重要基础设施，其有效性需要持续适应。特别是要应对可持续性目标、需求模式和城市化趋势，需要对网络进行更改。实际发展可能持续数十年，必须通过确保符合多种拓扑和组合不变量来保持供应连续性和服务质量。长期电力网络规划涉及上述过程，尽管规划语言可能是一个自然的选择，但所需的属性和不变量在这样的语言中难以表达；相反，它们可以优雅且简洁地编码在答案集编程（ASP）中。在本文中，我们提出了一种利用ASP自动化和优化长期电力网络规划过程的方法。在合成和实际电网数据上进行的实验评估证实了所提出的基于ASP的方法的表达能力，并展示了其有效性。

英文摘要

The Power grid is a critical infrastructure underpinning all aspects of modern society and its services. Maintaining its effectiveness requires continuous adaptations. In particular, addressing sustainability targets, demand patterns, and urbanisation trends requires implementing changes to the network. Actual developments can potentially span over a decade, with supply continuity and service quality that must be preserved throughout by ensuring conformance to several topological and combinatorial invariants. Long-term power grid planning deals with the above process, and although planning languages could be a natural choice, the kind of properties and invariants needed are cumbersome to express in such languages; on the contrary, they can be elegantly and succinctly encoded in Answer Set Programming (ASP). In this paper, we propose the first approach to automate and optimise the long-term power grid planning process using ASP. Experimental evaluations conducted on synthetic and real-world grid data confirm the expressive power of the proposed ASP-based approach and demonstrate its effectiveness.

URL PDF HTML ☆

赞 0 踩 0

2605.20167 2026-05-20 cs.AI cs.LG 版本更新

HaorFloodAlert: Deseasonalized ML Ensemble for 72-Hour Flood Prediction in Bangladesh Haor Wetlands

HaorFloodAlert: 用于孟加拉国Haor湿地72小时洪水预测的去季节化机器学习集成

Salma Hoque Talukdar Koli, Fahima Haque Talukder Jely, Md. Samiul Alim, Md. Zakir Hossen

发表机构 * 1 Department of Computer Science ； Engineering, RTM Al-Kabir Technical University, Sylhet-3100, Bangladesh 2 Department of Computer Science ； Engineering, North East University Bangladesh, Sylhet, Bangladesh 3 Department of Computer Science ； Engineering, Dhaka University of Engineering \& Technology, Gazipur, Bangladesh [6pt] Corresponding author: ( )

AI总结本文提出HaorFloodAlert，一种去季节化的机器学习集成模型，用于预测孟加拉国Haor湿地72小时内的洪水概率，通过识别温度季节性影响和利用Sentinel-1 SAR数据提高预测准确性。

Comments 9 pages, 9 figures. To be submitted to raaicon.org

详情

AI中文摘要

孟加拉国Haor湿地的快速洪水几乎没有任何预警，破坏年度boro稻收获。现有系统为河流洪水设计，完全忽略了回水动态。这些流域平坦，水的行为不同于布拉马普特拉河。我们构建了HaorFloodAlert，一种去季节化的机器学习集成，用于预测Sunamganj Haor（约8,000平方公里）72小时内的洪水概率。温度被发现是季节性的作弊代码，因为它在温暖月份洪水发生时提高了准确性6.9个百分点。我们捕捉到了这一点，并构建了一个上游Barak河Sentinel-1 SAR代理，从阿萨姆的Silchar提供约36小时的预警。Otsu阈值化的SAR变化检测在空间匹配上验证达到84-91%。操作性集成（RF 0.5625 + XGBoost 0.4375）在77个真实的Sentinel-1事件上达到89.6%的LOOCV准确性，87.5%的召回率和0.943的AUC-ROC。还包含三级警报管道和BRRI校准的boro稻损害估计器。

英文摘要

Flash floods in Bangladesh's haor wetlands show up with almost no warning. They wreck the annual boro rice harvest. Current setups, built for riverine floods, miss backwater dynamics entirely. These basins are flat. Water does not behave like it does on the Brahmaputra. We built HaorFloodAlert, a deseasonalized machine learning ensemble that forecasts 72-hour flood probability for the Sunamganj Haor (approximately 8,000 km2). Temperature was acting as a seasonal cheat code - it inflated accuracy by 6.9 pp just because floods happen in warm months. We caught that. We also built an upstream Barak River Sentinel-1 SAR proxy from Silchar, Assam, giving about 36 hours of lead time. Otsu-thresholded SAR change detection validates at 84-91 percent spatial match. The operational ensemble (RF 0.5625 + XGBoost 0.4375) hits 89.6 percent LOOCV accuracy, 87.5 percent recall, and 0.943 AUC-ROC on 77 real Sentinel-1 events. A three-tier alert pipeline and a BRRI-calibrated boro rice damage estimator are included.

URL PDF HTML ☆

赞 0 踩 0

2605.20164 2026-05-20 cs.AI 版本更新

Not Every Rubric Teaches Equally: Policy-Aware Rubric Rewards for RLVR

并非每个评分标准都等同教学：面向RLVR的政策感知评分奖励

Utkarsh Tyagi, Xingang Guo, MohammadHossein Rezaei, Daniel George, Anas Mahmoud, Jackson Lee, Bing Liu, Yunzhong He

发表机构 * Scale AI

AI总结本文提出POW3R框架，通过保留人类权重和类别平衡，改进评分奖励机制，使评分标准更符合最终答案的要求，从而在多模态和纯文本设置中提升性能。

Comments 24 pages, 7 figures, 6 tables

详情

AI中文摘要

可验证奖励的强化学习在训练后效果显著，当正确性可以自动检查时。然而，许多重要的模型行为需要同时满足多个定性标准。基于评分的奖励通过评估特定提示的标准并将其聚合为标量奖励来解决这一问题。然而，标准静态聚合将人类分配的重要性与当前作为优化信号的有用性混淆。我们证明在评分RL中，这种假设在评分标准中崩溃：许多重要的标准已经饱和或当前不可达，而区分rollout的标准不一定是最受人类重视的。我们引入POW3R，一种政策感知的评分奖励框架，该框架在评分目标中保留人类权重和类别平衡，同时在训练过程中适应标准级别的奖励权重。POW3R使用rollout级别的对比来强调当前区分策略输出的标准，使GRPO奖励更加信息丰富，而不会改变底层评估目标。在两个数据集上三个基础策略中，POW3R在30个基础策略/指标比较中胜出24个，提高了平均评分奖励和严格完成率（满足所有评分标准的提示比例），并以2.5-4倍更少的训练步骤达到相同平台。因此，评分奖励应区分最终答案中应重视的内容，以及当前策略可以教授的内容。

英文摘要

Reinforcement learning with verifiable rewards has made post-training highly effective when correctness can be checked automatically. However, many important model behaviors require satisfying several qualitative criteria at once. Rubric-based rewards address this setting by grading prompt-specific criteria and aggregating them into a scalar reward. Yet standard static aggregations conflate a criterion's human-assigned importance with its current usefulness as an optimization signal. We show that this assumption breaks down in rubric RL: many important criteria are already saturated or currently unreachable, while criteria that distinguish rollouts are not necessarily those with the largest human weights. We introduce POW3R, a policy-aware rubric reward framework that preserves human weights and category balance as the rubric objective while adapting criterion-level reward weights during training. POW3R uses rollout-level contrast to emphasize criteria that currently separate the policy's outputs, making the GRPO reward more informative without changing the underlying evaluation target. Across three base policies on two datasets spanning multimodal and text-only settings, POW3R wins $24$ of $30$ base-policy/metric comparisons, improving both mean rubric reward and strict completion (the fraction of prompts whose response satisfies every required rubric criterion) over vanilla GRPO with rubric rewards, and reaches the same plateau in $2.5$--$4\times$ fewer training steps. Rubric rewards should therefore distinguish what should matter in the final answer from what can teach the current policy.

URL PDF HTML ☆

赞 0 踩 0

2605.20158 2026-05-20 cs.CV cs.AI cs.CL 版本更新

使用阿基里斯API进行Lean 4中的AI辅助定理证明：格里菲斯问题的形式化案例研究

Gabriel Rongyang Lau

发表机构 * Nanyang Technological University（南洋理工大学）

AI总结本文通过形式化案例研究，探讨了使用阿基里斯API在Lean 4中进行AI辅助定理证明的挑战，展示了格里菲斯问题的证明过程，揭示了局部证明搜索成功但全局组合计数仍需解决的局限性。

详情

AI中文摘要

AI辅助定理证明现在可以生成大量Lean开发用于奥林匹克级数学，但这些开发的证据状态取决于哪些声明实际上已被验证。本文报告了针对格里菲斯问题（最初作为IMO 2009问题6提出）的Lean 4形式化案例研究，该研究涉及阿基里斯API的证明尝试。生成的成果包含一个通用的Lean定理版本，以及四个已验证的辅助引理，用于局部组件的最大性和相邻交换策略。主定理直接通过一个未解决的sorry声明关闭。已验证的组件证明了最终部分和等于总和，相邻置换仅影响相关的中间部分和，改变的部分和具有预期形式，以及在某个位置允许相邻后继交换的最大性迫使相应的禁止集成员事实。阿基里斯输出摘要识别出剩余的数学步骤是需要证明这些成员事实产生至少n个不同的禁止值，从而反驳| M | < n的基数假设；Lean源代码本身并未将主定理归约到单独编码的计数引理。该案例研究提供了一个可检查的例子，展示了AI辅助形式化中的核心限制，即局部证明搜索可以成功，但定理所需的全局组合计数仍需解决。本文贡献了一个可重复的Lean artifact和对其已验证和未验证证明内容的精确分析。

英文摘要

AI-assisted theorem proving can now generate substantial Lean developments for olympiad-level mathematics, but the evidential status of such developments depends on which declarations are actually verified. This paper reports a Lean 4 formalization case study of an Aristotle API proof attempt for the Grasshopper problem, originally posed as IMO 2009 Problem 6. The generated artifact states a generalized Lean version of the theorem, contains four verified helper lemmas for local components of a maximality and adjacent-swap exchange strategy, and leaves the main theorem grasshopper closed directly by one unresolved sorry. The verified components establish that the final partial sum equals the total sum, that an adjacent transposition can affect only the relevant intermediate partial sum, that the changed partial sum has the expected form, and that maximality at a position admitting an adjacent successor swap forces a corresponding forbidden-set membership fact. The Aristotle output summary identifies the intended remaining mathematical step as the global counting step needed to show that these membership facts produce at least n distinct forbidden values, contradicting the cardinality assumption |M| < n; the Lean source itself does not reduce the main theorem to a separately encoded counting lemma. This case study gives an inspectable example of a central limitation in AI-assisted formalization, namely that local proof search can succeed while the global combinatorial bookkeeping required for a theorem remains unresolved. The paper contributes a reproducible Lean artifact and a precise analysis of its verified and unverified proof content.

URL PDF HTML ☆

赞 0 踩 0

2605.20108 2026-05-20 eess.SY cs.AI cs.LG cs.LO cs.SY 版本更新

k-Inductive Neural Barrier Certificates for Unknown Nonlinear Dynamics

k-诱导神经屏障证书用于未知非线性动力学

Ben Wooding, Hongchao Zhang, Taylor T. Johnson, Abolfazl Lavaei

发表机构 * Vanderbilt University（范德堡大学）； Newcastle University（新castle大学）

AI总结本文提出了一种基于神经网络的k-诱导神经屏障证书(k-NBCs)，用于部分未知的非线性系统，通过利用神经网络的可扩展性以及泛化Willems等人基本引理，构建数据驱动的表示以进行SMT验证，同时提高了设计灵活性。

Comments 18 pages, 5 figures, 3rd International Conference on Neuro-Symbolic Systems (NeuS)

详情

AI中文摘要

尽管传统的(k=1)离散时间屏障证书条件通过要求函数在每一步都非递增来施加严格的安全约束，k-诱导屏障证书通过允许临时增加--最多k-1次，每次在阈值ε内--同时保持整体安全性并提高灵活性。本文利用神经网络构建k-诱导神经屏障证书(k-NBCs)用于(部分)未知的非线性系统。虽然神经网络在设计过程中提供可扩展性，但缺乏形式保证，需要额外的方法如基于可满足性模理论(SMT)的反例引导归纳合成(CEGIS)进行验证。然而，CEGIS-SMT框架需要系统动力学的知识，这在实际情况下不可用。为此，我们利用Willems等人基本引理的泛化，使用单个状态轨迹，构建数据驱动的表示以进行SMT验证而不牺牲准确性。此外，CEGIS-SMT进一步消除了将屏障证书限制在特定函数类（如平方和）的约束，从而在设计上具有更大的灵活性。我们验证了我们的方法在三个非线性案例研究中，具有(部分)未知的动力学。

英文摘要

While conventional (k=1) discrete-time barrier certificate conditions impose strict safety constraints by requiring the function to be non-increasing at every step, k-inductive barrier certificates relax this by allowing a temporary increase -- up to k-1 times, each within a threshold $ε$ -- while maintaining overall safety, and improving flexibility. This paper leverages neural networks and constructs k-inductive neural barrier certificates (k-NBCs) for (partially) unknown nonlinear systems. While neural networks offer scalability in the design process, they lack formal guarantees, requiring additional approaches such as counterexample-guided inductive synthesis (CEGIS) with satisfiability modulo theories (SMT) for verification. However, the CEGIS-SMT framework requires knowledge of system dynamics, which is unavailable in practical settings. To address this, we leverage the generalization of the Willems et al.'s fundamental lemma, using a single state trajectory, to construct a data-driven representation of (partially) unknown models for SMT verification without sacrificing accuracy. Additionally, CEGIS-SMT further removes the constraint of restricting barrier certificates to specific function classes, such as sum-of-squares, enabling greater flexibility in their design. We validate our approach on three nonlinear case studies with (partially) unknown dynamics.

URL PDF HTML ☆

赞 0 踩 0

2605.20107 2026-05-20 cs.LG cs.AI 版本更新

Beyond Isotropy in JEPAs: Hamiltonian Geometry and Symplectic Prediction

超越各向同性：JEPAs中的哈密顿几何与辛预测

Robert Jenkinson Alvarez

发表机构 * GitHub

AI总结本文研究了JEPAs中各向同性假设的局限性，提出基于哈密顿几何的辛预测方法，通过相空间状态和学习的哈密顿量预测视图间过渡，从而提升模型在不同数据集上的性能。

详情

AI中文摘要

JEPAs通常将单视图嵌入正则化为各向同性的高斯分布，隐含地将欧几里得对称性纳入表示中。我们证明这不仅仅是无害的默认设置。对于已知的结构化下游几何H>0，最小最大和最大熵协方差在哈密顿能量预算下为(c/d)H^{-1}，欧几里得各向同性会带来闭式价格。更重要的是，当下游几何未知时，没有几何无关的固定边际目标是规范的：每个固定协方差形状可以对某些结构化几何最大化地错位。我们进一步表明，即使拥有oracle单视图边际，也无法识别JEPA视图间预测耦合。这些结果表明，JEPAs中的结构偏差应进入跨视图耦合而非固定编码器边际。我们通过HamJEPA实例化这一原则，将每个视图编码为相空间状态(q,p)，并通过学习的哈密顿量跃迁映射预测视图间过渡，非各向同性的尺度和频谱地板防止崩溃。在刻意无头标记协议中，HamJEPA在CIFAR-100上比SIGReg提升4.89 kNN@20和3.52线性探针点，在30个epoch时，以及在80个epoch时提升6.45 kNN@20和10.64线性探针点。而匹配的MLP预测器消融显示，辛耦合是驱动邻域几何增益的成分。在ImageNet-100上，HamJEPA-q在45个epoch时提升4.82 kNN@20和7.52线性探针点。

英文摘要

JEPAs often regularize one-view embeddings toward an isotropic Gaussian, implicitly baking Euclidean symmetry into the representation. We show that this is not merely a benign default. For a known structured downstream geometry $H\succ0$, the minimax and maximum-entropy covariance under a Hamiltonian energy budget is $(c/d)H^{-1}$, and Euclidean isotropy incurs a closed-form price of isotropy. More importantly, when the downstream geometry is unknown, no geometry-independent fixed marginal target is canonical: every fixed covariance shape can be maximally misaligned for some structured geometry. We further show that even oracle one-view marginals do not identify the JEPA view-to-view predictive coupling. These results suggest that the structural bias in JEPAs should enter the cross-view coupling rather than a fixed encoder marginal. We instantiate this principle with \textbf{HamJEPA}, which encodes each view as a phase-space state $(q,p)$ and predicts view-to-view transitions with a learned Hamiltonian leapfrog map, while non-isotropic scale and spectral floors prevent collapse. In a deliberately headless token protocol, HamJEPA improves over SIGReg on CIFAR-100 by $+4.89$ kNN@20 and $+3.52$ linear-probe points at 30 epochs, and by $+6.45$ kNN@20 and $+10.64$ linear-probe points at 80 epochs, while a matched MLP predictor ablation shows that the symplectic coupling is the ingredient driving the neighborhood-geometry gain. On ImageNet-100, HamJEPA-$q$ improves by $+4.82$ kNN@20 and $+7.52$ linear-probe points at 45 epochs.

URL PDF HTML ☆

赞 0 踩 0

2605.20104 2026-05-20 cs.LG cs.AI 版本更新

BalanceRAG: 为级联检索增强生成进行联合风险校准

Zijun Jia, Yuanchang Ye, Sen Jia, Yiyao Qian, Haoning Wang, Baojie Chen, Diyin Tang, Jinsong Yu, Zhiyuan Wang

发表机构 * Beihang University（北航）； Shenzhen Institute of Advanced Technology（深圳先进技术研究院）； Zhejiang University of Finance & Economics（浙江财经大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结本文提出BalanceRAG，一种用于级联检索增强生成的联合风险校准方法，通过在二维晶格上确定安全操作点，实现风险自适应的阈值校准，从而在控制系统级错误率的同时保留更多示例，并扩展到多风险校准。

详情

AI中文摘要

大型语言模型（LLMs）可通过检索增强生成（RAG）提高事实性，但在模型单独回答可靠时，将RAG应用于每个查询是不必要的。这促使了级联RAG：每个查询首先由LLM单独分支处理，如果主分支不确定则升级到RAG回退，当两个分支都不足够可信时则放弃。然而，逐级校准此类级联可能过于保守，因为最终的效用取决于LLM单独和RAG的联合不确定性阈值。在本文中，我们开发了BalanceRAG，以在目标风险水平下认证阈值对。给定两个分支的不确定性分数，BalanceRAG将每个阈值对框架为二维晶格上的一个操作点，并通过顺序图形测试确定安全操作点。这使得风险自适应的阈值校准成为可能，从而在控制接受点的系统级错误率的同时保留更多示例。此外，BalanceRAG扩展到多风险校准，允许检索使用与基于选择的条件风险一起被限制。在三个开放领域问答（QA）基准上的实验表明，BalanceRAG满足规定的风险水平，保留了更高的覆盖率和更多的接受正确示例，并且比始终开启RAG减少了不必要的检索调用。

英文摘要

Large language models (LLMs) can enhance factuality via retrieval-augmented generation (RAG), but applying RAG to every query is unnecessary when the model-only answer is reliable. This motivates cascaded RAG: each query is first handled by an LLM-only branch, escalated to a RAG fallback only if the primary branch is uncertain, and abstained from when neither branch is sufficiently trustworthy. However, calibrating such cascades stage by stage may be conservative, since the final utility depends on joint uncertainty thresholding of LLM-only and RAG. In this work, we develop BalanceRAG to certify threshold pairs at a target risk level. Given uncertainty scores from the two branches, BalanceRAG frames each threshold pair as an operating point on a two-dimensional lattice and identifies safe operating points using sequential graphical testing. This enables risk-adaptive threshold calibration, controlling the system-level error rate among accepted points, while retaining more examples. Furthermore, BalanceRAG extends to multi-risk calibration, allowing retrieval usage to be bounded together with the selection-conditioned risk. Experiments on three open-domain question answering (QA) benchmarks across multiple LLM backbones demonstrate that BalanceRAG meets prescribed risk levels, preserves higher coverage and more accepted correct examples, and reduces unnecessary retrieval calls compared with always-on RAG.

URL PDF HTML ☆

赞 0 踩 0

2605.20082 2026-05-20 cs.CV cs.AI 版本更新

VL-DPO: Vision-Language-Guided Finetuning for Preference-Aligned Autonomous Driving

VL-DPO：基于视觉语言的偏好对齐自动驾驶微调

Zhefan Xu, Ghassen Jerfel, Marina Haliem, Qi Zhao, Jeonhyung Kang, Khaled S. Refaat

发表机构 * Waymo

AI总结本文提出VL-DPO，一种基于视觉语言模型的框架，通过零样本推理生成偏好对来微调自动驾驶模型，以提升与人类驾驶偏好的对齐程度，实验表明该方法在RFS和ADE指标上均优于基线模型。

Comments Published in International Conference on Robotics and Automation (ICRA), 2026 8 pages, 6 figures, 4 tables

详情

AI中文摘要

自动驾驶数据集的快速增长使强大的运动预测模型得以扩展。尽管大规模预训练提供了强大的性能，但标准模仿目标可能无法完全捕捉人类驾驶偏好中的复杂细微差别。同时，视觉语言模型（VLMs）的最新进展展示了出色的推理和常识理解能力。基于这些能力，本文提出了VL-DPO，一种基于视觉语言的框架，用于将自动驾驶车辆的运动预测模型与人类偏好对齐。我们的方法利用VLM作为零样本推理器，自动从预训练模型的轨迹中生成偏好对，然后通过直接偏好优化（DPO）进行微调。我们在此Waymo Open End-to-End Driving Dataset（WOD-E2E）上微调模型，并通过评分反馈（RFS）和平均位移误差（ADE）评估模型在持保留人类偏好注释上的性能。实验表明，VLM的轨迹选择是高质量的人类偏好的代理。我们的最终模型VL-DPO在RFS指标上比预训练模型提高了11.94%，在ADE指标上减少了10.01%。

英文摘要

The rapid growth of autonomous driving datasets has enabled the scaling of powerful motion forecasting models. While large-scale pretraining provides strong performance, the standard imitation objective may not fully capture the complex nuances of human driving preferences. Meanwhile, recent advances in vision-language models (VLMs) have demonstrated impressive reasoning and commonsense understanding. Building on these capabilities, this paper presents VL-DPO, a vision-language-guided framework that aligns ego-vehicle motion forecasting models with human preferences. Our approach leverages a VLM as a zero-shot reasoner to automatically generate preference pairs from a pretrained model's rollouts, which are then used to finetune the model via Direct Preference Optimization (DPO). We finetune our models on the Waymo Open End-to-End Driving Dataset (WOD-E2E) and evaluate performance against held-out human preference annotations using rater feedback score (RFS) and average displacement error (ADE). Our experiments confirm that the VLM's trajectory selection is a high-quality proxy for human preference. Our final model, VL-DPO, yields an 11.94% increase in RFS and a 10.01% reduction in ADE over the pretrained model.

URL PDF HTML ☆

赞 0 踩 0

2605.20079 2026-05-20 cs.CV cs.AI cs.LG eess.IV 版本更新

面向现实世界ROS~2系统的LLM辅助架构恢复：一种基于智能体的多级方法用于分层结构架构重建

Dominique Briechle, Raj Chanchad, Tobias Geger, Ruidi He, Dhruv Jajadiya, Dhruv Kapadiya, Andreas Rausch, Meng Zhang

发表机构 * Institute for Software and Systems Engineering, Clausthal University of Technology, Clausthal-Zellerfeld 38678, Germany（软件与系统工程研究所， Clausthal 技术大学， Clausthal-Zellerfeld 38678，德国）

AI总结本文提出了一种基于智能体的多级方法，用于恢复复杂ROS~2系统中的分层结构架构，通过改进的提示和多级中间架构表示，提高了架构恢复的一致性和可扩展性。

详情

AI中文摘要

显式软件架构模型是沟通、分析和演变复杂软件密集型系统的关键 artifacts。然而，在基于ROS~2的机器人系统中，结构（解构）和集成语义通常仅在分布式 artifacts（如源代码和启动文件）中隐式编码，使得恢复分层架构尤其困难。现有方法主要关注节点级实体和通信布线，而对多抽象层次上的分层结构（解构）恢复支持有限。本文扩展了我们之前提出的蓝图引导的LLM辅助架构恢复流程，通过两个主要改进：（1）改进的提示以提高架构合成的一致性和可控性；（2）基于多级中间架构表示的分阶段恢复策略，该策略结合了原子ROS节点列表和启动文件依赖关系，从而在多个抽象层次上实现结构受限的重建。该方法在基于协作机械臂和异构ROS~2 artifacts的现实世界自动化产品拆卸系统上进行了评估。与我们之前的工作相比，所选案例研究显示出显著更高的集成复杂性和更丰富的功能。结果表明，架构恢复在结构一致性、可扩展性和鲁棒性方面有所提高，同时揭示了与大规模ROS~2系统中动态集成语义相关的剩余挑战。

英文摘要

Explicit software architecture models are essential artifacts for communicating, analyzing, and evolving complex software-intensive systems. In ROS~2-based robotic systems, however, structural (de-)composition and integration semantics are often only implicitly encoded across distributed artifacts such as source code and launch files, making recovery of hierarchical architecture particularly difficult. Existing approaches mainly focus on node-level entities and communication wiring, while providing limited support for recovering hierarchical structural (de-)composition across multiple abstraction levels. In this paper, we extend our previously proposed blueprint-guided LLM-assisted architecture recovery pipeline for ROS~2 systems through two major enhancements: (1) refined prompting to improve the consistency and controllability of architecture synthesis, and (2) a staged recovery strategy based on multi-level intermediate architectural representations that incorporate the atomic ROS node list and launch file dependencies, thereby enabling structurally constrained reconstruction across multiple abstraction levels. The approach is evaluated on a real-world automated product disassembly system based on cooperative robotic arms and heterogeneous ROS~2 artifacts. Compared to our previous work, the considered case study exhibits substantially higher integration complexity and richer functionality. The results demonstrate improved structural consistency, scalability, and robustness of architecture recovery, while also revealing remaining challenges related to dynamic integration semantics in large-scale ROS~2 systems.

URL PDF HTML ☆

赞 0 踩 0

2605.20049 2026-05-20 cs.SE cs.AI 版本更新

Does Code Cleanliness Affect Coding Agents? A Controlled Minimal-Pair Study

代码整洁性影响编码代理吗？一项受控的最小对研究

Priyansh Trivedi, Olivier Schmitt

发表机构 * SonarSource

AI总结本研究探讨了代码整洁性对编码代理性能的影响，通过构建结构和风格相似但整洁度不同的代码库对，发现整洁性不影响通过率，但显著降低计算成本和文件重复访问。

详情

AI中文摘要

随着自主编码代理的快速普及，其评估主要集中在固定目标代码库的任务完成率上。这留下了一个关键问题未被回答：底层代码的结构和风格质量，即“整洁性”，是否会影响代理导航和修改代码的能力？为了隔离代码整洁性对代理能力的影响，我们引入了一种基于最小对的评估协议：构建结构、依赖和外部行为相同但静态分析规则违反和认知复杂度不同的代码库对。这些对通过代理流水线在两个方向上构建：一个降级干净代码库或清理混乱代码库。我们为六个这样的对编写了33项任务，并通过应用的公共表面进行隐藏测试。在660次使用Claude Code的试验中，代码整洁性没有改变代理的通过率。然而，它显著改变了代理的操作足迹：在整洁代码上工作的代理使用7至8%更少的标记，并减少34%的文件重复访问。我们的发现表明，传统可维护性原则在AI驱动开发时代仍然高度相关，影响编码代理的计算成本和导航效率。代码整洁性与模型选择、工具和提示并列，成为影响代理行为的重要因素。

英文摘要

As autonomous coding agents see rapid adoption, their evaluation has primarily focused on task completion rates holding the target codebase fixed. This leaves a critical question unanswered: does the structural and stylistic quality, or ``cleanliness'' of the underlying code affect an agent's ability to navigate and modify it? To isolate the effect of code cleanliness from agent capability, we introduce an evaluation protocol built around minimal pairs: repositories that match on architecture, dependencies, and external behaviour, but differ on static-analysis rule violations and cognitive complexity. The pairs are constructed in both directions, by agent pipelines that either degrade a clean repository or clean a messy one. We author 33 tasks across six such pairs, evaluated through hidden tests at the application's public surface. Across 660 trials with Claude Code, code cleanliness does not change the agent's pass rate. However, it substantially alters the agent's operational footprint: agents working on cleaner code use 7 to 8% fewer tokens and reduce file revisitations by 34%. Our findings suggest that traditional maintainability principles remain highly relevant in the era of AI-driven development, shaping the computational cost and navigational efficiency of coding agents. Code cleanliness joins model choice, harness, and prompting as a factor that materially affects agent behaviours.

URL PDF HTML ☆

赞 0 踩 0

2605.20037 2026-05-20 cs.LG cs.AI 版本更新

为代理调优辩护：从文档到PostgreSQL中的行动

Hongyu Lin, Mingyu Li, Weichen Zhang, Yihang Lou, Mingjie Xing, Yanjun Wu, Haibo Chen

发表机构 * Institute of Software, Chinese Academy of Sciences（中国科学院软件研究所）； University of Chinese Academy of Sciences（中国科学院大学）； Key Laboratory of System Software (Chinese Academy of Sciences)（中国科学院系统软件重点实验室）； Beihang University（北航）； Peking University（北京大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出通过动态行动替代静态文档进行系统调优，引入PerfEvolve工具，利用LLM代理实现版本一致性验证、工作负载特定分析和多参数联合优化，实验表明其在PostgreSQL上比现有文档驱动调优方法提升35.2%。

2605.19975 2026-05-20 cs.LG cs.AI 版本更新

Learning with Foresight: Enhancing Neural Routing Policy via Multi-Node Lookahead Prediction

具有前瞻性学习：通过多节点前瞻性预测增强神经路由策略

Xia Jiang, Yaoxin Wu, Yew-Soon Ong, Yingqian Zhang

发表机构 * Eindhoven University of Technology（埃因霍温理工大学）； Nanyang Technological University（南洋理工大学）； Agency for Science, Technology and Research (A*STAR)（科技研究局（A*STAR））

AI总结本研究提出多节点前瞻性预测（MnLP）方法，通过扩展监督学习范式同时预测多个未来节点，提升神经路由策略的长期规划能力，并在不同问题规模和现实基准上改进泛化能力。

Comments Accepted by the 35th International Joint Conference on Artificial Intelligence

详情

AI中文摘要

神经策略因其对人工启发式依赖的减少而在解决车辆路径问题中展现出潜力。然而，当前的训练范式存在根本性局限：它们主要关注下一个节点的预测，导致短视决策，削弱了长期规划能力。为此，我们引入多节点前瞻性预测（MnLP），一种新的训练策略，扩展监督学习范式以同时预测多个未来节点。我们整合了因果性和可丢弃的MnLP模块，这些模块仅在训练期间运行，使模型能够预测多步决策，同时保持推理时的效率。通过将多深度辅助监督融入损失函数，MnLP使神经策略具备长距离上下文理解能力。实验表明，MnLP在现有训练方法上表现更优，提升了神经策略在各种问题规模、分布和现实基准上的泛化能力。此外，MnLP可以无缝集成到不同的神经架构中，而不引入额外的推理开销。

英文摘要

Neural policies have shown promise in solving vehicle routing problems due to their reduced reliance on handcrafted heuristics. However, current training paradigms suffer from a fundamental limitation: they primarily focus on next-node prediction for solution construction, resulting in myopic decision-making that undermines long-horizon planning capacity. To this end, we introduce Multi-node Lookahead Prediction (MnLP), a novel training strategy that extends the supervised learning paradigm to predict multiple future nodes simultaneously. We incorporate causal and discardable MnLP modules that operate exclusively during training, facilitating models to anticipate multi-step decisions while preserving inference-time efficiency. By incorporating multi-depth auxiliary supervision into the loss function, MnLP equips neural policies with the ability of long-range contextual understanding. Experimentally, MnLP outperforms existing training methods, improving the generalization capability of neural policies across various problem sizes, distributions, and real-world benchmarks. Moreover, MnLP can be seamlessly integrated into diverse neural architectures without introducing additional inference overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.19972 2026-05-20 cs.LG cs.AI cs.DB cs.DS 版本更新

Block-Sphere Vector Quantization

块球向量量化

Heesang Ann, Joongkyu Lee, Min-hwan Oh

发表机构 * Seoul National University（首尔国立大学）

AI总结本文研究了向量量化方法，通过统一理论比较不同旋转量化器，揭示其性能依赖于特定的失真度量标准，并提出块球量化算法以改进旋转块量化。

详情

AI中文摘要

向量量化是可扩展机器学习系统中的基本操作，能够实现内存高效存储、快速检索和压缩推理。最近的旋转基于量化器如EDEN、RabitQ和TurboQuant引入了强保证和实证性能，但其周围比较难以解释，因为它们依赖于不同的失真标准、概率领域和实现假设。作为我们的第一个贡献，我们提供了这些方法的统一理论比较，表明其相对优势是标准依赖的而非绝对的：EDEN和TurboQuant在均方失真方面有利，EDEN在预期内积失真方面也有效，而RabitQ提供强的高概率控制。此比较进一步表明EDEN在预期失真度量方面提供特别强的保证。作为我们的第二个贡献，我们引入了块球量化（BlockQuant），一种新的旋转块量化算法，围绕随机旋转向量的球几何设计。不同于坐标wise量化器，BlockQuant在球面上量化块，更忠实保持旋转嵌入的几何结构。我们证明这种块球设计在本文考虑的基准上理论上在重建MSE和预期内积失真方面均有所改进。我们在真实嵌入数据集和长上下文LLM推理任务上的实验显示了实际收益，与我们的理论改进一致。

英文摘要

Vector quantization is a fundamental primitive for scalable machine learning systems, enabling memory-efficient storage, fast retrieval, and compressed inference. Recent rotation-based quantizers such as EDEN, RabitQ, and TurboQuant have introduced strong guarantees and empirical performance, but the surrounding comparisons have been difficult to interpret because they rely on different distortion criteria, probability regimes, and implementation assumptions. As our first contribution, we provide a unified theoretical comparison of these methods and show that their relative advantages are criterion-dependent rather than absolute: EDEN and TurboQuant are favorable for MSE distortion, EDEN is also effective for expected inner-product distortion, and RabitQ provides strong high-probability control. This comparison further clarifies that EDEN provides particularly strong guarantees for expected distortion measures. As our second contribution, we introduce Block-Sphere Quantization (BlockQuant), a new rotation-based block quantization algorithm designed around the spherical geometry of randomly rotated vectors. Unlike coordinate-wise quantizers, BlockQuant quantizes blocks on the sphere, preserving the geometry of rotated embeddings more faithfully. We prove that this block-spherical design theoretically improves over the baselines considered in this paper for both reconstruction MSE and expected inner-product distortion. Our experiments on real embedding datasets and long-context LLM inference tasks show practical gains that are consistent with our theoretical improvements.

URL PDF HTML ☆

赞 0 踩 0

2605.19966 2026-05-20 cs.LG cs.AI 版本更新

Detecting Fluent Optimization-Based Adversarial Prompts via Sequential Entropy Changes

通过顺序熵变化检测基于优化的对抗性提示

Mohammed Alshaalan, Miguel R. D. Rodrigues

发表机构 * Department of Electronic and Electrical Engineering, University College London, London, United Kingdom（电子与电气工程系，伦敦大学学院，伦敦，英国）

AI总结本文提出了一种基于在线变化点检测的对抗性后缀检测方法CPD，通过标准化用户令牌熵并应用单侧CUSUM统计量，提高了对优化基于对抗性提示的检测性能，同时在多个大型语言模型上实现了更高的F1分数和AUC性能。

Comments Accepted at ICML 2026; 20 pages, including 9 pages main text, references, and appendix

详情

AI中文摘要

基于优化的对抗性后缀可以劫持对齐的大型语言模型（LLMs），同时保持流畅，这削弱了静态和窗口化困惑度基于的检测器。我们把对抗性后缀检测视为一个在线变化点检测问题，针对令牌级下一个令牌熵流。使用LLM系统提示来估计一个稳健的基线，我们标准化用户令牌熵并应用单侧CUSUM统计量。所得到的检测器CPD（在线变化点检测）是模型无关的，无需训练，可以在线运行，并能定位对抗性后缀的起始。在1,012个优化基于的后缀攻击（GCG，AutoDAN，AdvPrompter，BEAST，AutoDAN-HGA）和1,012个困惑度控制的良性提示的基准上，CPD在六个开源权重聊天模型（LLaMA-2-7B/13B，Vicuna-7B/13B，Qwen2.5-7B/14B）上均优于最强的窗口化困惑度基线。在LLaMA-2-7B的典型CUSUM设置（k=0）下，CPD达到AUC 0.88和F1 0.82。除了提示级检测外，CPD将79.6%的触发集中在对抗性后缀内，而窗口化困惑度为17-46%。最后，当用作LLaMA Guard的轻量级门控时，CPD在高流量、良性主导的部署中减少了17-22%的门控调用，同时保持了门控级别的检测质量。

英文摘要

Optimization-based adversarial suffixes can jailbreak aligned large language models (LLMs) while remaining fluent, weakening static and windowed perplexity-based detectors. We cast adversarial suffix detection as an online change-point detection problem over the token-level next-token entropy stream. Using the LLM system prompt to estimate a robust baseline, we standardize user-token entropies and apply a one-sided CUSUM statistic. The resulting detector, CPD Online (CPD), is model-agnostic, training-free, runs online, and localizes the adversarial suffix onset. On a benchmark of 1,012 optimization-based suffix attacks (GCG, AutoDAN, AdvPrompter, BEAST, AutoDAN-HGA) and 1,012 perplexity-controlled benign prompts, CPD improves F1 over the strongest windowed-perplexity baseline on all six open-weight chat models (LLaMA-2-7B/13B, Vicuna-7B/13B, Qwen2.5-7B/14B). On LLaMA-2-7B at the canonical CUSUM setting ($k=0$), CPD reaches AUROC $0.88$ and F1 $0.82$. Beyond prompt-level detection, CPD concentrates 79.6% of its triggers inside the adversarial suffix, versus 17-46% for windowed perplexity. Finally, when used as a lightweight gate for LLaMA Guard, CPD reduces guard calls by 17-22% on a high-volume, benign-dominated deployment while preserving guard-level detection quality

URL PDF HTML ☆

赞 0 踩 0

2605.19957 2026-05-20 cs.CV cs.AI cs.RO 版本更新

概率性微型递归模型

Amin Sghaier, Ali Parviz, Alexia Jolicoeur-Martineau

发表机构 * Mila – Quebec AI Institute（魁北克人工智能研究所）

AI总结本文提出概率性微型递归模型（PTRM），通过在递归步骤中注入高斯噪声，使模型能够并行探索多样化的解决方案盆地，从而在不重新训练或进行任务特定增强的情况下，提升多个基准测试的准确性，包括Sudoku-Extreme和Pencil Puzzle Bench上的各种谜题。

详情

AI中文摘要

微型递归模型（TRM）通过迭代优化潜在状态和最终答案，以少量参数解决复杂推理任务。尽管强大，其确定性递归可能导致收敛于次优解，缺乏逃逸机制。常见的解决方法依赖于测试时的任务特定输入扰动结合答案投票聚合。我们引入概率性TRM（PTRM），一种任务无关的测试时计算扩展框架，通过随机探索解决这一限制。PTRM在每个深度递归步骤中注入高斯噪声，使并行轨迹探索多样化的解决方案盆地，并利用模型原有的Q头（用于原始TRM中的早期停止）在其中选择。无需重新训练或任务特定增强，PTRM在多个基准测试上实现了显著的准确性提升，包括Sudoku-Extreme（87.4%到98.75%）和各种Pencil Puzzle Bench谜题（62.6%到91.2%）。在后者上，PTRM在不到0.0001倍的成本下，仅使用700万参数，就实现了接近前沿LLM（91.2% vs. 55.1%）两倍的准确性。

英文摘要

Tiny Recursive Models (TRM) solve complex reasoning tasks with a fraction of the parameters of modern large language models (LLMs) by iteratively refining a latent state and final answer. While powerful, their deterministic recursion can lead to convergence at suboptimal solutions, without escape mechanism. A common workaround relies on task-specific input perturbations at test time combined with answer aggregation via voting. We introduce Probabilistic TRM (PTRM), a task-agnostic framework for test-time compute scaling that addresses this limitation through stochastic exploration. PTRM injects Gaussian noise at each deep recursion step, enabling parallel trajectories to explore diverse solution basins, and selects among them using the model's existing Q head (used for early stopping in the original TRM). Without requiring retraining or task-specific augmentations, PTRM enables substantial accuracy gains across benchmarks, including Sudoku-Extreme (87.4% to 98.75%) and on various puzzles from Pencil Puzzle Bench (62.6% to 91.2%). On the latter, PTRM achieves nearly double the accuracy of frontier LLMs (91.2% vs. 55.1%) at less than 0.0001x the cost, using only 7M parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.19940 2026-05-20 cs.AI cs.RO 版本更新

Robotics-Inspired Guardrails for Foundation Models in Socially Sensitive Domains

受机器人启发的用于社会敏感领域基础模型的护栏

Rebecca Ramnauth, Drazen Brscic, Brian Scassellati

发表机构 * Yale University（耶鲁大学）； Kyoto University（京都大学）

AI总结本文提出了一种基于机器人学的护栏框架，用于在社会敏感领域中对基础模型进行运行时行为控制，以减少交互轨迹中向不良状态的漂移，并适应多样化的社会情境。

Comments Under review at Journal of Artificial Intelligence Research (JAIR)

详情

AI中文摘要

基础模型正越来越多地应用于教育、心理健康和护理等社会敏感领域，其中失败往往具有累积性和情境依赖性。现有的护栏方法，从训练时对齐到提示、解码约束和事后调节，主要提供经验风险降低而非可执行的行为保证，并且大多将安全视为单个输出属性而非交互轨迹属性。我们重新将护栏视为对交互轨迹的运行时行为控制问题，并借鉴机器人学引入形式构造以在不确定的闭环系统中执行约束。我们将在Grounded Observer框架中实例化这些想法，并在三个现实世界部署中应用：闲聊、家庭自闭症疗法和学校行为缓和。在各种场景中，该框架能够实现运行时干预，以减少向不良交互状态漂移，同时适应多样化社会情境。我们讨论了该框架的扩展，并提出了加强保证的研究方向。

英文摘要

Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.

URL PDF HTML ☆

赞 0 踩 0

2605.19932 2026-05-20 cs.AI cs.CL cs.LG 版本更新

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

PEEK：上下文地图作为长上下文LLM代理的导向缓存

Zhuohan Gu, Qizheng Zhang, Omar Khattab, Samuel Madden

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； Stanford University（斯坦福大学）

AI总结本文提出PEEK系统，通过上下文地图缓存和维护导向知识，提升长上下文LLM代理在重复外部上下文中的交互准确性和效率，相比基线方法在推理和上下文学习任务中均取得显著提升。

详情

AI中文摘要

大型语言模型（LLM）代理越来越多地在长且重复的外部上下文中操作，如文档语料库和代码仓库。在多次调用中，现有方法保留的是代理的轨迹、对原始材料的被动访问或任务级别的策略。但它们没有保留我们认为对于重复相同上下文工作负载最需要的：关于重复上下文本身的可重用导向知识（例如，上下文包含什么、如何组织，以及哪些实体、常量和模式历史上有用）。我们引入PEEK，一种系统，通过上下文地图缓存和维护这种导向知识：一个在代理提示中始终存在的小而固定大小的artifact，使代理能够持续查看外部上下文。该地图由一个可编程的缓存策略维护，包含三个模块：一个Distiller从推理时间信号中提取可转移的知识，一个Cartographer将其转换为结构化的编辑，以及一个基于优先级的Evictor强制执行固定的token预算。在长上下文推理和信息聚合中，PEEK在强基线方法上提高了6.3-34.0%，同时使用93-145次更少的迭代，并且成本比最先进的提示学习框架ACE低1.7-5.8倍。在上下文学习中，PEEK在解决率和评分准确性上分别提高了6.0-14.0%和7.8-12.1%，且成本比ACE低1.4倍。这些收益在不同语言模型和代理架构上均能泛化，包括生产级的OpenAI Codex。这些结果表明，上下文地图有助于长上下文LLM代理更准确、更高效地与重复的外部上下文交互。

英文摘要

Large language model (LLM) agents increasingly operate over long and recurring external contexts, like document corpora and code repositories. Across invocations, existing approaches preserve either the agent's trajectory, passive access to raw material, or task-level strategies. None of them preserves what we argue is most needed for repeated same-context workloads: reusable orientation knowledge (e.g., what the context contains, how it is organized, and which entities, constants, and schemas have historically been useful) about the recurring context itself. We introduce PEEK, a system that caches and maintains this orientation knowledge as a context map: a small, constant-sized artifact in the agent's prompt that gives it a persistent peek into the external context. The map is maintained by a programmable cache policy with three modules: a Distiller that extracts transferable knowledge from inference-time signals, a Cartographer that translates it into structured edits, and a priority-based Evictor that enforces a fixed token budget. On long-context reasoning and information aggregation, PEEK improves over strong baselines by 6.3-34.0% while using 93-145 fewer iterations and incurring 1.7-5.8x lower cost than the state-of-the-art prompt-learning framework, ACE. On context learning, PEEK improves solving rate and rubric accuracy by 6.0-14.0% and 7.8-12.1%, respectively, at 1.4x lower cost than ACE. These gains generalize across LMs and agent architectures, including OpenAI Codex, a production-grade coding agent. Together, these results show that a context map helps long-context LLM agents interact with recurring external contexts more accurately and efficiently.

URL PDF HTML ☆

赞 0 踩 0

2605.19931 2026-05-20 cs.CV cs.AI cs.LG 版本更新

StruMPL: Multi-task Dense Regression under Disjoint Partial Supervision and MNAR Labels

StruMPL：在不相交的部分监督和MNAR标签下的多任务密集回归

Reza M. Asiyabi, Juan Alberto Molina-Valero, The SEOSAW Partnership, Steven Hancock, Casey M. Ryan

发表机构 * School of Geosciences, University of Edinburgh, UK（爱丁堡大学地球科学学院，英国）； National Centre for Earth Observation (NCEO), UK（英国地球观测国家中心）； Department of Spatial Sciences, Faculty of Environmental Sciences Czech University of Life Sciences Prague, Praha, Czech Republic（环境科学学院空间科学系，捷克布拉格生命科学大学）

AI总结本文针对在不相交的部分监督和MNAR标签下的多任务密集回归问题，提出StruMPL方法，通过共享编码器和可学习的物理模块，结合Augmented IPW损失函数，提高了对森林地上生物量的估计精度。

Comments 10 pages with 3 figures and 4 tables, References and Appendix 12 pages with 1 figure and 4 tables

详情

AI中文摘要

从地球观测估计森林地上生物量（AGB）结合了两个结构上不兼容的标签源：空间borne激光雷达在数百万个位置提供冠层结构但没有生物量估计，而地面样地在数千个偏倚位置提供生物量但没有结构指标。没有单个训练样本携带所有目标变量的标签，样地标签不是随机缺失（MNAR），且生物量通过已知但生物体特异性的所有学定律与结构变量相关联。我们将其正式化为在异质不相交部分监督下的多任务密集回归问题，具有MNAR标签和任务间物理约束，并提出StruMPL方法来联合解决。一个共享编码器为每个变量回归、填补和倾向性头提供空间MNAR校正，以及一个可学习的物理模块，该模块在每个像素上评估任务间约束对模型自身预测的影响。监督损失使用Augmented IPW（AIPW）伪结果，其中在倾向性和填补基线上的停止梯度；我们证明了分析和实证上，两者对于联合优化恢复IPW加权的平稳点并保持损失有界是必要的。在两个生态上不同的生物体上，StruMPL在AGB RMSE和偏倚方面优于消融变体和最接近的已发表方法，分层分析显示AIPW减少了高AGB偏倚约54%。

英文摘要

Estimating forest aboveground biomass (AGB) from Earth observation combines two structurally incompatible label sources: spaceborne lidar provides canopy structure at millions of locations but no biomass estimate, and ground-based plots provide biomass at thousands of biased locations but no metrics of structure. No single training sample carries labels for all target variables, plot labels are missing not at random (MNAR), and biomass is linked to the structural variables by known but biome-specific allometric laws. We formalise this as multi-task dense regression under heterogeneous disjoint partial supervision with MNAR labels and inter-task physical constraints, and propose StruMPL to address it jointly. A shared encoder feeds per-variable regression, imputation, and propensity heads for spatial MNAR correction, and a learnable physics module that evaluates the inter-task constraint on the model's own predictions at every pixel. The supervised loss uses an Augmented IPW (AIPW) pseudo-outcome with stop-gradients on the propensity and on the imputation baseline; we show analytically and empirically that both are necessary for joint optimisation to recover IPW-weighted stationary points while keeping the loss bounded. On two ecologically distinct biomes, StruMPL outperforms ablation variants and the closest published method on AGB RMSE and bias, with a stratified analysis showing AIPW reduces high-AGB bias by ~54%.

URL PDF HTML ☆

赞 0 踩 0

2605.19929 2026-05-20 cs.CV cs.AI 版本更新

Breaking Modality Heterogeneity in Low-Bit Quantization for Large Vision-Language Models

打破大视觉-语言模型低比特量化中的模态异质性

Yi Zhong, Haotong Qin, Xindong Zhang, Lei Zhang, Guolei Sun

发表机构 * VCIP, College of Computer Science, Nankai University（南开大学计算机科学学院VCIP）； D-ITET, ETH Zürich（苏黎世联邦理工学院D-ITET）； OPPO Research Institute（OPPO研究院）； Department of Computing, Hong Kong Polytechnic University（香港理工大学计算机系）

AI总结本文提出SplitQ框架，通过通道分割和自适应跨模态校准模块，解决大视觉-语言模型在低比特量化中因模态异质性导致的精度下降问题，显著提升了在多种多模态数据集上的性能。

详情

AI中文摘要

低比特后训练量化（PTQ）是将视觉-语言模型（VLMs）部署到资源受限设备中的关键技术。然而，现有PTQ方法由于在量化过程中文本和视觉模态的异质激活分布而降低了VLMs的准确性。我们发现这种跨模态异质性在通道上分布不均：一小部分通道包含大部分模态特定的异常值，且这些异常值通常位于每个模态的不同通道中。受此启发，我们提出了SplitQ，一种基于通道分割的后训练量化框架。其核心是引入了一个新的模态特定异常通道解耦（MOCD）模块，该模块能够以最小的开销有效隔离显著的模态特定异常通道。为进一步解决剩余的跨模态分布差异，我们设计了一个自适应跨模态校准（ACC）模块，该模块采用双轻量级可学习分支动态缓解模态引起的量化误差。在流行的VLMs上的广泛实验表明，SplitQ在所有评估的量化设置下，包括W4A8、W4A4、W3A3和W3A2，均在6个流行的多模态数据集上显著优于现有方法。值得注意的是，SplitQ在具有挑战性的W3A3设置下保留了93.5%的FP16性能（69.5 vs. 74.3），推动了高级VLMs部署的效率前沿。我们的代码可在https://github.com/EMVision-NK/SplitQ上获得。

英文摘要

Low-bit post-training quantization (PTQ) is a pivotal technique for deploying Vision-Language Models (VLMs) on resource-constrained devices. However, existing PTQ methods often degrade VLMs' accuracy due to the heterogeneous activation distributions of text and vision modalities during quantization. We find that this cross-modal heterogeneity is distributed unevenly across channels: a small subset of channels contains most modality-specific outliers, and these outliers typically reside in different channels for each modality. Motivated by this, we propose SplitQ, a channel-Splitting-driven post-training Quantization framework. At its core, SplitQ introduces a novel Modality-specific Outlier Channel Decoupling (MOCD) module that effectively isolates salient modality-specific outlier channels with minimal overhead. To further address the remaining cross-modal distribution discrepancies, we design an Adaptive Cross-Modal Calibration (ACC) module that employs dual lightweight learnable branches to dynamically mitigate modality-induced quantization errors. Extensive experiments on popular VLMs demonstrate that SplitQ significantly outperforms existing approaches across 6 popular multi-modal datasets under all evaluated quantization settings, including W4A8, W4A4, W3A3, and W3A2. Notably, SplitQ preserves 93.5% of FP16 performance under the challenging W3A3 setting (69.5 vs. 74.3), pushing the efficiency frontier for deploying advanced VLMs. Our code is available at https://github.com/EMVision-NK/SplitQ

URL PDF HTML ☆

赞 0 踩 0

2605.19928 2026-05-20 cs.GT cs.AI cs.LG 版本更新

Real-Time Parallel Counterfactual Regret Minimization

实时并行反事实遗憾最小化

Boning Li, Longbo Huang

发表机构 * IIIS, Tsinghua University（清华大学信息科学技术学院）

AI总结本文提出了一种实时深度限制下的CFR求解并行框架，通过剪枝、抽象和高级CFR变体的无缝整合，实现了在几秒内完成近均衡策略计算的高效方法，实验显示在德州扑克中速度提升了3.3-3.4倍。

Comments 13 pages, 3 figures

详情

AI中文摘要

反事实遗憾最小化（CFR）是解决大型不完全信息游戏的主要算法家族，支撑了Libratus和Pluribus等No-Limit Texas Hold'em扑克突破。在实时游戏系统中，求解器必须在仅几秒的严格时间预算内计算近均衡策略，而在此窗口内完成的CFR迭代次数直接决定了游戏表现。我们提出了Parallel CFR，这是首个用于实时深度限制CFR求解的并行化框架，无缝整合了剪枝、抽象和高级CFR变体。我们将每个CFR迭代分解为七个阶段的流水线，并识别了两个正交的并行维度：按信息集和按树节点。叶节点评估通过批量神经网络推理卸载到GPU，创建了异构的CPU-GPU流水线。在一对一No-Limit Texas Hold'em实验中，Parallel CFR在翻牌街实现了3.3-3.4倍的速度提升，深度限制游戏树中超过10亿历史的每迭代时间约为47-54毫秒。所有实验均在单个桌面级设备（NVIDIA DGX Spark）上运行，无需数据中心级基础设施即可在典型实时决策预算内完成数百次CFR迭代。

英文摘要

Counterfactual Regret Minimization (CFR) is the dominant algorithmic family for solving large imperfect-information games, underpinning breakthroughs such as Libratus and Pluribus in No-Limit Texas Hold'em poker. In real-time game-playing systems, the solver must compute a near-equilibrium strategy within a strict time budget of only a few seconds per decision, and the number of CFR iterations completed in this window directly determines play strength. We present \textbf{Parallel CFR}, the first parallelization framework for real-time depth-limited CFR solving that seamlessly integrates pruning, abstraction, and advanced CFR variants. We decompose each CFR iteration into a pipeline of seven stages and identify two orthogonal dimensions of parallelism: \emph{by information set} and \emph{by tree node}. Leaf node evaluation is offloaded to GPUs via batched neural network inference, creating a heterogeneous CPU--GPU pipeline. Experiments on Heads-Up No-Limit Texas Hold'em demonstrate that Parallel CFR achieves $3.3$--$3.4\times$ speedup over the single-threaded baseline on postflop streets, with per-iteration time of ${\sim}47$--$54$~ms on a depth-limited game tree with over $1$ billion histories. All experiments run on a single desktop-class device (NVIDIA DGX Spark), enabling hundreds of CFR iterations within a typical real-time decision budget without requiring datacenter-scale infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.19916 2026-05-20 cs.LG cs.AI 版本更新

Fast and Featureless Node Representation Learning with Partial Pairwise Supervision

基于部分成对监督的快速且无特征节点表示学习

Sujan Chakraborty, Saptarshi Bej

发表机构 * Indian Institute of Science Education and Research（印度科学教育与研究学院）

AI总结该研究提出了一种快速且统一的框架，用于在部分可用的成对节点标签和无可用节点特征的图中进行可扩展的节点表示学习，通过结合社区感知的结构信号和带符号的成对约束，实现了高效的优化方案。

详情

AI中文摘要

我们引入了Contrastive FUSE，一种用于图中可扩展节点表示学习的快速且统一的框架，该框架在部分可用的成对节点标签和无可用节点特征的情况下进行优化。与现有方法不同，我们直接优化了一个谱对比目标，该目标整合了社区感知的结构信号和带符号的成对约束。为了支持大规模训练，我们用一种轻量级的近似方法替换了昂贵的模块度梯度，这在保持模块度行为的同时显著降低了计算成本。这产生了一种高效的优化方案，具有自然梯度分解和自适应学习率缩放，即使在百万边图上也能实现快速迭代更新。在基准引文网络、大型共购图和OGB数据集上的广泛实验表明，Contrastive FUSE在不依赖节点特征的情况下实现了竞争性或优越的对比分类性能，同时在现有基线上提供了显著的运行时间提升。这些结果突显了将模块度启发的结构学习与对比监督相结合在高效和可扩展的对比节点表示学习中的有效性。

英文摘要

We introduce Contrastive FUSE, a fast and unified framework for scalable node representation learning in graphs with partially available pairwise node labels and no available node features. Unlike existing methods, we directly optimize a spectral contrastive objective that integrates community-aware structural signals with signed pairwise constraints. To support large-scale training, we replace the expensive modularity gradient with a lightweight approximation, which preserves the structure-seeking behavior of modularity while reducing the computational cost significantly. This yields an efficient optimization scheme with a natural gradient decomposition and adaptive learning-rate scaling, enabling fast iterative updates even on million-edge graphs. Extensive experiments on benchmark citation networks, large co-purchase graphs, and OGB datasets show that Contrastive FUSE achieves competitive or superior contrastive classification performance without relying on node features, while offering substantial runtime gains over existing baselines. These results highlight the effectiveness of coupling modularity-inspired structural learning with contrastive supervision for efficient and scalable contrastive node representation learning.

URL PDF HTML ☆

赞 0 踩 0

StableGrad: 无需批量归一化的反向缩放控制

Jose I. Mestre, Alberto Fernández-Hernández, Cristian Pérez-Corral, Manuel F. Dolz, Enrique S. Quintana-Ortí

发表机构 * Universitat Politècnica de València（巴塞罗那理工大学）； Universitat Jaume I（Jaime I 大学）

AI总结本文提出StableGrad，一种在无需批量归一化的情况下通过优化器层面控制权重-梯度缩放来稳定深度神经网络训练的方法，特别适用于物理信息神经网络等场景。

详情

AI中文摘要

训练非常深的神经网络需要控制深度方向上的量值传播。没有这种控制，激活值和梯度可能会消失、爆炸或进入不稳定区域，导致优化失败。现代架构通常通过批量归一化、残差连接或其他归一化层来缓解这个问题，这些机制会重复地重新缩放或绕过中间表示。然而，这些机制并不总是适用。在物理信息神经网络（PINNs）中，网络表示连续的物理场及其输入导数定义了训练目标，使批量依赖的归一化变得有问题，因为这会引入非局部依赖性到预测场及其导数中。我们提出StableGrad，一种优化器层面的缩放控制机制，可以在不修改前向模型的情况下纠正层间权重-梯度不平衡。因为归一化仅在反向传播后、优化器更新前应用，网络输出、其导数和物理残差保持不变。我们分析了这种缩放所引起的有效训练动态，并在深度PINNs上评估StableGrad作为目标应用，用无批量归一化的卷积网络作为诊断压力测试。在PINN基准测试中，StableGrad提高了匹配深度的解精度，并使更深层的模型在标准优化下更加可靠。在ResNet和EfficientNet架构中，移除批量归一化通常会导致训练崩溃，但StableGrad在不引入其他架构变化的情况下稳定了优化。这些结果表明，优化器层面的权重-梯度缩放控制可以提供一种实用的替代方案，当前向归一化不可用或不适用时。

英文摘要

Training very deep neural networks requires controlling the propagation of magnitudes across depth. Without such control, activations and gradients may vanish, explode, or enter unstable regimes that make optimization fail. Modern architectures often mitigate this problem through Batch Normalization, residual connections, or other normalization layers, which repeatedly re-scale or bypass intermediate representations. However, these mechanisms are not always appropriate. In Physics-Informed Neural Networks (PINNs), the network represents a continuous physical field and its input derivatives define the training objective, making batch-dependent normalization problematic because it can introduce non-local dependencies into the predicted field and its derivatives. We propose StableGrad, an optimizer-level scale-control mechanism that corrects layer-wise weight-gradient imbalances without modifying the forward model. Because the normalization is applied only after backpropagation and before the optimizer update, the network output, its derivatives, and the physical residual remain unchanged. We analyze the effective training dynamics induced by this rescaling and evaluate StableGrad on deep PINNs as the target application, with BatchNorm-free convolutional networks serving as a diagnostic stress test. On PINN benchmarks, StableGrad improves matched-depth solution accuracy and makes deeper models more reliable under standard optimization. On ResNet and EfficientNet architectures, where removing Batch Normalization normally leads to training collapse, StableGrad stabilizes optimization without introducing any other architectural change. These results show that optimizer-level control of weight-gradient scale can provide a practical alternative when forward normalization is unavailable or undesirable.

URL PDF HTML ☆

赞 0 踩 0

2605.19855 2026-05-20 cs.CV cs.AI 版本更新

A Framework for Evaluating Zero-Shot Image Generation in Concept-based Explainability

基于概念的可解释性人工智能的零样本图像生成评估框架

Giacomo Astolfi, Matteo Bianchi, Riccardo Campi, Antonio De Santis, Marco Brambilla

发表机构 * Politecnico di Milano, DEIB（米兰理工大学，DEIB）

AI总结本文提出了一种基于概念的可解释性人工智能的零样本图像生成评估框架，通过生成合成概念数据集来评估概念基于的XAI方法，探讨了零样本文本到图像生成模型在模型分析中的挑战和开放性问题。

Comments G. Astolfi, M. Bianchi, and R. Campi contributed equally

详情

AI中文摘要

基于概念的可解释性人工智能（XAI）通过将内部表示与类别预测联系起来，利用人类可理解的视觉特征（如纹理或物体部分）来解释深度学习模型，从而弥合低级图像数据与高级语义之间的差距。然而，一个主要挑战是依赖大量标记图像来表示每个概念，这限制了可扩展性。在本工作中，我们研究了使用零样本文本到图像（T2I）生成模型作为合成概念数据集的来源，用于概念基于的XAI方法。具体而言，我们通过预定义提示生成概念，并通过四种互补分析评估其对真实概念的忠实性：（1）通过概念表示相似性比较合成与真实概念图像；（2）通过比较相同概念的子集对进行评估，子集大小逐步增加；（3）通过相关类别图像评估其在下游解释任务中的性能；（4）评估在移除测试类别图像中的概念对生成概念的解释影响。尽管当前T2I生成模型承诺为概念基于的XAI提供捷径，但我们的研究突显了挑战并提出了关于使用零样本管道生成的合成数据在模型分析中的使用问题。生成的数据集可在https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts获取。

英文摘要

Concept-based Explainable Artificial Intelligence (XAI) interprets deep learning models using human-understandable visual features (e.g., textures or object parts) by linking internal representations to class predictions, thereby bridging the gap between low-level image data and high-level semantics. A major challenge, however, is the reliance on large sets of labeled images to represent each concept, which limits scalability. In this work, we investigate the use of zero-shot Text-to-Image (T2I) generative models as a source of synthetic concept datasets for concept-based XAI methods. Specifically, we generate concepts using predefined prompts and evaluate their faithfulness to real ones through four complementary analyses: (1) comparing synthetic vs. real concept images via concept representation similarity; (2) evaluating their intra-similarity by comparing pairs of subsets of the same concept with progressively increasing size; (3) evaluating their performance for downstream explanation tasks using relevant class images; (4) evaluating how removing a concept from tested class images affects explanations of generated concepts. While current T2I generative models promise a shortcut to concept-based XAI, our study highlights challenges and raises open questions about the use of synthetic data generated by zero-shot pipelines in model analyses. The resulting dataset is available at https://github.com/DataSciencePolimi/ZeroShot-T2I-Concepts.

URL PDF HTML ☆

赞 0 踩 0

2605.19837 2026-05-20 cs.CV cs.AI cs.CL cs.RO 版本更新

从提示到路面通过时间：代理场景到计划推理中的时间定位

Ahmed Y. Gado, Omar Y. Goba, Alaa Hassanein, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国亚历山大大学（GUC）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室，埃及开罗，C-DRiVeS）； M.Eng. Robotics Candidate at Deggendorf Institute of Technology, Germany（德国德格多夫技术学院机器人硕士候选人）； IAV GmbH, Berlin, Germany（德国柏林IAV GmbH公司）

AI总结本研究探讨了在代理间通信中引入时间条件是否能保持或增强推理的一致性，而不会降低语义或逻辑一致性，并通过BDD-X数据集的curated子集评估了三种具有递增时间整合的规划器架构。结果表明，时间条件改变了推理风格，但并未在标准NLP正确性指标上产生统计显著改进，但定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。

详情

AI中文摘要

近期尝试通过大型语言模型（LLMs）和大型多模态模型（LMMs）的集合来支持自动驾驶（AVs）中的高级场景解释和规划，仍然将时间视为次要属性。这种缺乏时间定位导致在连续动作推理中出现不一致，影响安全性和可解释性。本文探讨时间条件在代理间通信中是否能保持或增强一致性而不引入语义或逻辑一致性下降。为此，我们引入了三种具有递增时间整合的规划器架构，并在BDD-X数据集的curated子集上评估它们，使用语义、语法和逻辑指标。结果表明，虽然时间条件改变了推理风格，但并未在标准NLP基于的正确性指标上产生统计显著改进。然而，定性分析揭示了预测危险推理、稳定纠正行为和战略分歧。这些发现澄清了基于提示的时间定位的局限性，并建立了时间场景到计划推理的第一个经验基准。

英文摘要

Recent attempts to support high-level scene interpretation and planning in Autonomous Vehicles (AVs) using ensembles of Large Language Models (LLMs) and Large Multimodal Models (LMMs) continue to treat time as a secondary property. This lack of temporal grounding leads to inconsistencies in reasoning about continuous actions, undermining both safety and interpretability. This work explores whether temporal conditioning within inter-agent communication can preserve or enhance coherence without introducing degradation in semantic or logical consistency. To investigate this, we introduce three planner architectures with progressively increasing temporal integration and evaluate them on curated subsets of the BDD-X dataset using semantic, syntactic, and logical metrics. Results show that while temporal conditioning reshapes reasoning style, it yields no statistically significant improvements in standard NLP-based correctness metrics. However, qualitative analysis reveals predictive hazard reasoning, stable corrective behavior, and strategic divergence in the Sentinel. These findings clarify the limits of prompt-based temporal grounding and establish the first empirical benchmark for temporal scene-to-plan reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.19823 2026-05-20 cs.LG cs.AI math.AP math.DS stat.ML 版本更新

Smooth Piecewise Cutting for Neural Operator to Handle Discontinuities and Sharp Transitions

通过平滑分段处理神经算子以应对不连续性和尖锐过渡

Ha Dang, Sebastian Schmidt, Juergen Hesser

发表机构 * Mannheim Institute for Intelligent Systems in Medicine, Heidelberg University（海德堡大学曼海姆智能医学研究所）； Interdisciplinary Center for Scientific Computing (IWR), Heidelberg University（海德堡大学跨学科科学计算中心）； Heidelberg Institute for Theoretical Studies (HITS), Heidelberg University（海德堡大学理论研究 institute）； Central Institute for Computer Engineering (ZITI), Heidelberg University（海德堡大学计算机工程中心）； CZS Heidelberg Initiative for Model-Based AI (MBAI), Heidelberg University（海德堡模型驱动人工智能倡议）

AI总结本文提出Cut-DeepONet，一种两阶段训练框架，通过将不连续性建模为更高维空间中的边界，减少学习复杂性，从而在处理偏微分方程的解算子时更有效地捕捉不连续性和尖锐过渡。

详情

AI中文摘要

神经算子在学习偏微分方程（PDEs）的解算子方面取得了强劲表现，但其本质上连续的表示在捕捉不连续性和尖锐过渡时存在困难。现有方法通常在连续函数空间内近似这些特征，往往需要增加模型容量和高分辨率数据。在本文中，我们提出Cut-DeepONet，一种两阶段训练框架，通过提升策略将问题重新表述，将域划分成平滑子区域，同时在更高维空间中将不连续性表示为边界。这种分离使算子学习任务与神经网络的归纳偏置对齐，并避免直接近似不连续性。一个额外的网络预测输入依赖的不连续性位置，然后用于指导神经算子在每个区域内生成平滑组件。在基准PDEs上的实验表明，Cut-DeepONet在低分辨率数据集上训练时也优于最先进的方法。该方法在存在不连续性和尖锐过渡的问题上表现优异，同时使用更少的可训练参数。我们的结果突显了改变算子学习的表示而非增加模型复杂性的优势。

英文摘要

Neural operators have achieved strong performance in learning solution operators of partial differential equations (PDEs), but their inherently continuous representations struggle to capture discontinuities and sharp transitions. Existing approaches typically approximate such features within continuous function spaces, often requiring increased model capacity and high-resolution data. In this work, we propose Cut-DeepONet, a two-stage training framework that explicitly models discontinuities while reducing learning complexity. Our approach reformulates the problem via a lifting strategy, partitioning the domain into smooth subregions while representing discontinuities as boundaries in a higher-dimensional space. This separation aligns the operator learning task with the inductive bias of neural networks and avoids directly approximating discontinuities. An additional network predicts input-dependent discontinuity locations for unseen inputs, which are then used to guide the neural operator in generating smooth components within each region. Experiments on benchmark PDEs show that Cut-DeepONet outperforms state-of-the-art methods, even when trained on low-resolution datasets. The method excels on problems with discontinuities and sharp transitions, while using fewer trainable parameters. Our results highlight the benefits of changing the representation of operator learning rather than increasing model complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.19822 2026-05-20 cs.LG cs.AI 版本更新

ST-TGExplainer: Disentangling Stability and Transition Patterns for Temporal GNN Interpretability

ST-TGExplainer: 解构稳定性与转换模式以提升时序GNN可解释性

Hongjiang Chen, Xin Zheng, Pengfei Jiao, Huan Liu, Zhidong Zhao, Huaming Wu, Feng Xia, Shirui Pan

发表机构 * Hangzhou Dianzi University（杭州电子科技大学）； RMIT University（皇家墨尔本理工大学）； Tianjin University（天津大学）； Griffith University（格里菲斯大学）

AI总结本文提出ST-TGExplainer，一种能够解构时序图中稳定性与转换模式的自解释时序GNN，以提升模型的可解释性。

详情

AI中文摘要

时序图神经网络（TGNNs）在解决现实中的时序图任务中取得了显著进展。然而，其可解释性仍然有限，因为大多数TGNNs无法识别哪些历史交互最影响给定预测。尽管在可解释性TGNNs上取得了令人鼓舞的进展，现有方法主要关注之前已见过的历史交互，我们称之为稳定性模式，而忽略了新出现的一次性交互，我们称之为转换模式。这两种模式对于忠实的时序解释都是必不可少的。为了解决这一限制，我们提出了ST-TGExplainer，一种自解释的TGNN，旨在解构时序图中的稳定性与转换模式，以获得更忠实的时序GNN解释器。受解构信息瓶颈目标的指导，ST-TGExplainer学习了一个紧凑的解释子图，该子图在预测事件标签时保持预测性，同时显式地抑制稳定性与转换模式之间的标签条件冗余。广泛的实验表明，ST-TGExplainer在预测性能上表现出色，并产生了更忠实的解释。代码可在https://github.com/hjchen-hdu/ST-TGExplainer上获取。

英文摘要

Temporal graph neural networks (TGNNs) have gained significant traction for solving real-world temporal graph tasks. However, their interpretability remains limited, as most TGNNs fail to identify which historical interactions most influence a given prediction. Despite promising progress on interpretable TGNNs, existing methods predominantly focus on previously seen historical interactions, which we term stability patterns, while overlooking newly emerging first-time interactions, which we term transition patterns. Both types of patterns are essential for faithful temporal explanations. To address this limitation, we propose ST-TGExplainer, a self-explainable TGNN that disentangles Stability and Transition patterns in temporal graphs for a more faithful Temporal GNN Explainer. Guided by a disentangled information bottleneck objective, ST-TGExplainer learns a compact explanatory subgraph that remains predictive of the event label while explicitly suppressing label-conditioned redundancy between stability and transition patterns. Extensive experiments demonstrate that ST-TGExplainer achieves strong predictive performance and yields more faithful explanations. Code is available at https://github.com/hjchen-hdu/ST-TGExplainer.

URL PDF HTML ☆

赞 0 踩 0

2605.19815 2026-05-20 cs.CL cs.AI 版本更新

LP-Eval: Rubric and Dataset for Measuring the Quality of Legal Proposition Generation

LP-Eval: 用于衡量法律命题生成质量的评估标准和数据集

Shanshan Xu, Johan Lindholm, Amogh Raina, Henrik Palmer Olsen, Daniel Hershcovich

发表机构 * University of Copenhagen（哥本哈根大学）； Umeå University（乌梅拉大学）

AI总结本文提出LP-Eval，一种与法律专家共同设计的三步评估标准，用于评估法律命题的质量，通过专家标注的100个LLM生成的法律命题数据集，展示了LLM生成的命题质量较高，但专家评估发现基于经典案例的命题质量更高，同时发现基于评估标准的LLM判断更接近专家评估，但缺乏对细粒度区别的敏感性。

详情

AI中文摘要

法律命题生成在法律推理和教义学研究中至关重要，但在法律NLP中仍缺乏充分研究。本文研究了使用大型语言模型（LLMs）从欧洲法院司法判决中自动生成和评估法律命题。我们引入了LP-Eval，一种与法律专家共同设计的三步评估标准，将法律命题质量分解为形式有效性和实质维度。使用此标准，我们发布了两个专家对100个LLM生成法律命题的注释数据集。我们的结果表明，LLMs能够生成主要形式正确且高质量的命题，而专家评估显示基于经典案例的命题质量高于基于近期案例的命题。我们进一步检验LLMs作为评估者，发现基于评估标准的LLM判断更接近专家评估，但对人类专家捕捉到的细粒度区别不够敏感。

英文摘要

Legal proposition generation is central to legal reasoning and doctrinal scholarship, yet remain under-examined in Legal NLP. This paper investigates the automatic generation and evaluation of legal propositions from decisions of the Court of Justice of the European Union using large language models (LLMs). We introduce LP-Eval, a three-step evaluation rubric co-designed with legal experts that decomposes legal proposition quality into formal validity and substantive dimensions. Using this rubric, we release a dataset of two experts' annotations for 100 LLM-generated legal propositions. Our results show that LLMs can generate predominantly well-formed and high-quality propositions, while expert evaluations reveal higher quality for propositions derived from well established cases than from recent ones. We further examine LLMs as evaluators and find that rubric-guided LLM judgments align more closely with expert assessments than direct overall scoring, but remain insensitive to finer-grained distinctions captured by human experts.

URL PDF HTML ☆

赞 0 踩 0

2605.19812 2026-05-20 cs.LG cs.AI stat.AP stat.ML 版本更新

FLUXtrapolation: A benchmark on extrapolating ecosystem fluxes

FLUXtrapolation：一个用于外推生态系统通量的基准测试

Anya Fries, Jacob A Nelson, Martin Jung, Markus Reichstein, Jonas Peters

发表机构 * Seminar for Statistics, ETH Zürich（统计研究所，苏黎世联邦理工学院）； Max Planck Institute for Biogeochemistry（生物地球化学研究所）

AI总结该研究提出FLUXtrapolation基准测试，旨在外推生态系统通量，通过分析分布偏移对通量上推的挑战，评估机器学习方法在分布偏移下的表现，以促进通量上推的科学目标。

详情

AI中文摘要

我们介绍了FLUXtrapolation，一个用于在外推生态系统通量时应对逐渐加剧的分布偏移的基准测试。生态系统通量是理解碳、水和能量循环的关键，但只能通过稀疏分布的测量塔直接测量。因此，生成全球通量估计需要在可用的全球协变量上训练模型，并在未观测区域进行预测，即上推。通量上推是一个具有挑战性的领域泛化问题，受气候、生态系统类型和环境条件之间协变量分布偏移的影响，以及条件偏移的影响：重要的驱动因素在全局尺度上未被观测。我们对这两种偏移在P_X和P_{Y|X}中的定量分析。FLUXtrapolation基于对通量上推的领域专业知识设计：它定义了基于时间、空间和温度的外推场景，并在未观测的领域、时间聚合和尾部误差上评估性能。在试点研究中，我们发现基线方法在中位小时RMSE下表现相似，但在提出的尾部聚焦和多尺度评估下则有所不同。因此，FLUXtrapolation为机器学习方法在分布偏移下的现实挑战提出了相关挑战；同时，该基准测试的进步将直接支持科学目标，即改进通量上推。

英文摘要

We introduce FLUXtrapolation, a benchmark for extrapolating ecosystem fluxes under progressively harder distribution shifts. Ecosystem fluxes are central to understanding the carbon, water, and energy cycles, yet they can only be measured directly at sparsely located measurement towers. Producing global flux estimates therefore requires training models on observed sites using globally available covariates and predicting in unobserved regions, that is, upscaling. Flux upscaling is a challenging domain generalization problem that is affected by a shift in covariate distribution across climates, ecosystem types, and environmental conditions, as well as by conditional shift: important drivers remain unobserved at global scale. We provide a quantitative analysis of both these shifts in $P_X$ and $P_{Y\mid X}$. FLUXtrapolation is designed based on domain expertise on flux upscaling: it defines temporal, spatial, and temperature-based extrapolation scenarios and evaluates performance across held-out domains, temporal aggregations, and tail errors. In a pilot study, we find that baselines perform similarly under median hourly RMSE, but separate under the proposed tail-focused and multi-scale evaluation. FLUXtrapolation therefore poses a realistic and thus relevant challenge for machine learning methods under distribution shift; at the same time, progress on this benchmark would directly support the scientific goal of improving flux upscaling.

URL PDF HTML ☆

赞 0 踩 0

2605.19804 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Stitched Value Model for Diffusion Alignment

用于扩散对齐的拼接价值模型

Hyojun Go, Hyungjin Chung, Prune Truong, Goutam Bhat, Li Mi, Zhaochong An, Zixiang Zhao, Dominik Narnhofer, Serge Belongie, Federico Tombari, Konrad Schindler

发表机构 * ETH Zurich（苏黎世联邦理工学院）； Google（谷歌）； University of Copenhagen（哥本哈根大学）

AI总结本文提出StitchVM，一种将预训练的干净图像奖励模型转移到噪声潜在空间的拼接框架，通过高效转移和微调，提升扩散对齐的效率和效果。

Comments Project page: https://gohyojun15.github.io/StitchVM/

详情

AI中文摘要

为了实际应用，基于扩散或流的生成模型必须与任务特定的奖励对齐，例如提示保真度或审美偏好。这种对齐具有挑战性，因为奖励是为干净的输出图像定义的，但对齐过程需要在噪声中间潜在空间中估计价值函数。现有方法倾向于Tweedie风格或蒙特卡洛近似，权衡估计器偏差与计算成本：Tweedie估计高效但有偏差，而蒙特卡洛估计更准确但需要昂贵的回放。一个自然的替代方法是学习的价值函数，但如何有效训练一个强大的、通用的价值模型专门用于噪声潜在空间仍然是一个开放问题。本文提出了StitchVM，一种模型拼接框架，该框架高效地将预训练用于干净图像的奖励模型转移到噪声潜在空间。StitchVM从一个现有的、截断的像素空间奖励模型开始，并将其冻结的扩散骨干作为其头部。从像素空间模型中，所得到的混合模型保留了精心预训练、稳健的奖励能力；从扩散骨干中，它继承了其处理噪声潜在空间的原生能力。拼接过程异常轻量，例如拼接和微调CLIP ViT-L和SD 3.5 Medium仅需10个GPU小时。通过将强大的像素空间奖励模型提升到潜在空间，StitchVM打开了一种新的扩散对齐风格：而不是对价值函数的粗糙但昂贵的每样本近似，正确的函数对于实际的噪声潜在空间一次构建，然后在许多样本和迭代中进行抵消。我们显示，这种方法在广泛下游引导和后训练方法中带来了改进：DPS变得比原来快3.2倍，同时将峰值GPU内存减半，DiffusionNFT变得比原来快2.3倍。

英文摘要

For practical use, diffusion- or flow-based generative models must be aligned with task-specific rewards, such as prompt fidelity or aesthetic preference. That alignment is challenging because the reward is defined for clean output images, but the alignment procedure requires value function estimates at noisy intermediate latents. Existing methods resort to Tweedie-style or Monte Carlo approximations, trading off estimator bias against computational cost: Tweedie estimates are efficient but biased, while Monte Carlo estimates are more accurate but require expensive rollouts. A natural alternative would be a learned value function, but it remains an open question how to effectively train a strong and general value model specifically for noisy latents. Here, we propose StitchVM, a model stitching framework that efficiently transfers reward models pretrained for clean images to the noisy latent regime. StitchVM starts from an existing, truncated pixel-space reward model and attaches a frozen diffusion backbone to it as its head. From the pixel-space model, the resulting hybrid retains a carefully pretrained, robust reward capability; from the diffusion backbone, it inherits its native ability to handle noisy latents. The stitching procedure is exceptionally lightweight, e.g., stitching and finetuning CLIP ViT-L and SD 3.5 Medium takes only 10 GPU-hours. By lifting powerful pixel-space reward models to latent space, StitchVM opens up a new style of diffusion alignment: instead of rough, yet costly per-sample approximation of the value function, the correct function for the actual, noisy latents is constructed once and then amortized over many samples and iterations. We show that this approach yields improvements across a broad range of downstream steering and post-training methods: DPS becomes $3.2\times$ faster while halving peak GPU memory, and DiffusionNFT becomes $2.3\times$ faster.

URL PDF HTML ☆

赞 0 踩 0

2605.19799 2026-05-20 cs.CV cs.AI 版本更新

Synergistic Foundation Models for Semi-Supervised Fetal Cardiac Ultrasound Analysis: SAM-Med2D Boundary Refinement and DINOv3 Semantic Enhancement

协同基础模型用于半监督胎儿心脏超声分析：SAM-Med2D边界细化与DINOv3语义增强

Tonghao Zhuang, Shanglong Hu, Yongsheng Luo, Zhiqi Zhang, Yu Li

发表机构 * Zhuhai College of Science and Technology（珠海科技学院）

AI总结本文提出了一种半监督框架，用于胎儿心脏超声图像的联合分割和分类，结合SAM-Med2D进行边界细化和DINOv3进行语义增强，有效提升了胎儿先天性心脏病筛查的性能。

Comments Accepted to the ISBI 2026 Fetal HearT UltraSound Segmentation and Diagnosis (FETUS) Challenge

2605.19794 2026-05-20 cs.HC cs.AI cs.DB 版本更新

AffectAI-Capture: A Reproducible Multimodal Protocol for Small-Group Meeting Research

AffectAI-Capture：一种可重复的多模态协议用于小型小组会议研究

Meisam Jamshidi Seikavandi, Alice Modica, Anna Obara, Fabricio Batista Narcizo, Tanya Ignatenko, Ted Vucurevich, Jesper Bünsow Boldt, Paolo Burelli, Andrew Burke Dittberner

发表机构 * GN Advanced Science, GN Group, Ballerup, Denmark（GN先进科学，GN集团，丹麦Ballerup）； IT University of Copenhagen, brAIn lab, Copenhagen, Denmark（哥本哈根IT大学，brAIn实验室，丹麦哥本哈根）； Copenhagen Business School, Copenhagen, Denmark（哥本哈根商学院，丹麦哥本哈根）； Aalborg University, Denmark（奥尔堡大学，丹麦）

AI总结本文提出了一种可重复的多模态协议AffectAI-Capture，用于收集四人会议类互动的同步多模态数据，结合眼动追踪、可穿戴生理、近距离和房间音频、多视角视频、事件日志和结构化自我报告。通过固定任务块和已建立的小组互动范式，结合权威事件时间线和标准化输出进行数据采集和后期处理。本文贡献在于建立了可重复的协议架构，将任务设计、仪器化、时间溯源和数据封装连接起来，用于情绪、行为和会议分析研究。

2605.19782 2026-05-20 cs.AI cs.LG cs.SE 版本更新

Prior Knowledge or Search? A Study of LLM Agents in Hardware-Aware Code Optimization

先验知识还是搜索？LLM代理在硬件感知代码优化中的研究

Dmitry Redko, Albert Fazlyev, Konstantin Sozykin, Maria Ivanova, Evgeny Burnaev, Egor Shvetsov

发表机构 * Applied AI Institute（应用人工智能研究所）； ITMO University（ITMO大学）； AI Talent Hub（AI人才中心）

AI总结该研究探讨了在硬件感知代码优化中，LLM代理是依赖于先验知识还是搜索过程，通过三个受控实验发现LLM在纯黑盒优化中表现为贪婪优化器，在零样本内核生成中输入大小信息无明显影响，而在反馈循环内核优化中CUDA单调改进而TVM IR主动退化，表明LLM在代码优化任务中高度依赖预训练先验而非反馈或代理结构。

详情

AI中文摘要

LLM发现和优化系统在各个领域中被越来越多地应用，实现了一个常见的提出-评估-修订循环。此类优化或发现过程通过上下文条件在接收到环境反馈后进行。然而，随着现代LLM代理在结构上日益复杂，难以评估哪些组件贡献最大，以及何时以及如何探索可能失败。我们通过三个受控实验回答这些问题。我们的发现：(1) 在纯黑盒优化中，LLM表现为贪婪优化器。(2) 在零样本内核生成中，提供显式输入大小信息没有可测量的影响，模型无论大小或温度都会收敛到相同的内核参数，仿佛大小指令是不可见的。此外，当被要求为不常见的内核大小进行内核优化时，性能会急剧下降，无论使用的语言如何。(3) 在反馈循环内核优化中，CUDA在迭代反馈下单调改进，而TVM IR则主动退化，这表明当模型以低密度语言操作时，内核优化会退化。我们的结果得出结论：在代码优化任务中，LLM高度依赖于预训练的先验而非提供的反馈或代理结构。

英文摘要

LLM discovery and optimization systems are increasingly applied across domains, implementing a common propose-evaluate-revise loop. Such optimization or discovery progresses via context conditioning on received feedback from an environment. However, as modern LLM agents are increasingly complex in their structure, it is difficult to evaluate which components contribute the most, and when and how this exploration may fail. We answer these questions through three controlled experiments. Our findings: (1) In pure black-box optimization, LLMs act as greedy optimizers. (2) In zero-shot kernel generation, providing explicit input-size information has no measurable effect, models converge to the same kernel parameters regardless of size or temperature, as though the size instruction were invisible. Moreover, when tasked to perform kernel optimization for uncommon kernel sizes, performance sharply degrades regardless of the language used. (3) In feedback-loop kernel optimization, CUDA improves monotonically under iterative feedback, while TVM IR actively degrades, which demonstrates that kernel optimization degrades when models operate with low-density language. Our results conclude that LLMs in code optimization tasks highly depend on pretrained priors rather than provided feedback or agentic structure.

URL PDF HTML ☆

赞 0 踩 0

2605.19781 2026-05-20 cs.AI 版本更新

From SGD to Muon: Adaptive Optimization via Schatten-p Norms

从SGD到穆恩：通过Schatten-p范数实现自适应优化

Thomas Massena, Corentin Friedrich, Mathieu Serrurier

发表机构 * IRIT ； IRT Saint Exupery（IRT圣埃克苏佩里）

AI总结本文提出了一种基于Schatten-p范数的自适应优化方法，通过动态选择代理最优的更新LMO几何结构，实现了从SGD到Muon的优化策略转换，并在不同训练场景中验证了其有效性。

详情

AI中文摘要

现代优化器，如Muon，对其更新施加了矩阵级几何约束。这些矩阵级约束可以统一在线性最小化Oracle（LMO）理论下。然而，所有当前方法都对更新规则施加固定的LMO几何结构，这些结构是根据设计或经验选择的，不一定符合问题的几何特性。我们引入了一种新颖且高效的数据驱动标准，用于动态选择单个深度神经网络层的代理最优更新LMO几何结构。该标准通过使用单步随机特征回归替代模型，从梯度和激活统计信息中推导出闭合形式，从而在SGD到Muon的更新之间进行插值。此外，通过整合参数级预条件化，我们的框架能够恢复SGD、Muon、Adam和MuAdam作为特定极值。为了使这种自适应方法可扩展，我们将其与高效的计算策略相结合，仅在高度优化的基线模型上带来约3%的运行时间开销。作为概念验证，我们证明这种数据驱动的优化器在三个不同的训练场景中优于或至少与Muon和AdamW中表现最好的优化器相竞争。最终，这项工作提供了证据，证明LMO几何可以成功且高效地从运行时数据进行适应，为超越静态几何的优化器设计开辟了新的途径。

英文摘要

Modern optimizers, like Muon, impose matrix-wise geometry constraints on their updates. These matrix-wise constraints can be unified under Linear Minimization Oracle (LMO) theory. However, all current methods impose fixed LMO geometries for the update rules, chosen by-design or empirically, which are not necessarily optimal according to the problem's geometry. We introduce a novel efficient datadriven criterion for dynamically choosing proxy-optimal update LMO geometries on individual Deep Neural Network layers. Derived in closed form from gradient and activation statistics using a single-step random feature regression surrogate model, our criterion navigates a design space interpolating from SGD to Muon updates. Moreover, integrating parameter-wise preconditioning allows our framework to recover SGD, Muon, Adam, and MuAdam as specific extrema. To make this adaptive approach scalable, we pair it with efficient computational strategies, achieving only a $\sim$ 3% runtime overhead on highly optimized baselines. As a proof of concept, we show that this data-driven optimizer beats or remains competitive with the performance of the best performing optimizer between Muon and AdamW across three different training scenarios. Ultimately, this work provides evidence that LMO geometry can be successfully and efficiently adapted from runtime data, opening a new pathway for optimizer design beyond static geometries.

URL PDF HTML ☆

赞 0 踩 0

2605.19779 2026-05-20 cs.AI cs.LG 版本更新

Distribution-Free Uncertainty Quantification for Continuous AI Agent Evaluation

无分布不确定性量化用于连续AI代理评估

Yuxuan Gao, Megan Wang, Yi Ling Yu

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Columbia University（哥伦比亚大学）

AI总结本文提出了一种无分布的不确定性量化方法，用于连续AI代理评估，通过适应性符合推断（ACI）提供预测质量分数的覆盖保证，并开发了多代理管道的组合不确定性界限、成对排名的符合回避规则以及领奖台规模多重检验的FDR校正回避方法。

Comments 6 pages, 7 figures, 2 tables. Accepted at the ICML 2026 Workshop on Agentic Uncertainty Quantification (AgenticUQ) - Poster

详情

AI中文摘要

我们适应了分割符合预测和适应性符合推断（ACI）用于连续AI代理评估，提供预测质量分数的无分布覆盖保证。符合区间在24小时范围内所有名义水平上实现了校准误差低于0.02，而ACI在代理发布后正确扩大了区间35%然后重新收敛。我们进一步开发了多代理管道的组合不确定性界限（通过模拟验证了不同阶段相关性rho在[-0.5, 0.9]范围内），一种用于成对排名的符合回避规则（具有受控的假排名率），以及领奖台规模多重检验的FDR校正回避方法。通过18个实时信号每小时收集的数据评估50个代理，我们显示每个代理的条件覆盖集中在名义水平（均值80.4%，90%的代理在[72%, 90%]范围内），并且跨源情感分歧预测排名不稳定性（r=0.64，p<0.01）。一个循环控制的验证确认了框架能够捕捉超过基准的信号（rho_s=0.52，p<0.01，n=35）。代码和数据在CC BY 4.0下发布。

英文摘要

We adapt split conformal prediction and adaptive conformal inference (ACI) to continuous AI agent evaluation, providing distribution-free coverage guarantees for forecasted quality scores. Conformal intervals achieve calibration error below 0.02 across all nominal levels at the 24h horizon, while ACI correctly widens intervals by 35% following agent releases then reconverges. We further develop compositional uncertainty bounds for multi-agent pipelines (validated via simulation across inter-stage correlations rho in [-0.5, 0.9]), a conformal abstention rule for pairwise rankings with controlled false-ranking rate, and FDR-corrected abstention for leaderboard-scale multiple testing. Evaluating 50 agents via 18 real-time signals collected hourly, we show that per-agent conditional coverage is well-concentrated around the nominal level (mean 80.4%, 90% of agents within [72%, 90%]), and that cross-source sentiment divergence predicts ranking instability (r=0.64, p<0.01). A circularity-controlled validation confirms the framework captures signal beyond benchmarks (rho_s=0.52, p<0.01, n=35). Code and data are released under CC BY 4.0.

URL PDF HTML ☆

赞 0 踩 0

2605.18618 2026-05-20 cs.LG cs.AI 版本更新

Stochastic Penalty-Barrier Methods for Constrained Machine Learning

随机罚函数-障碍方法用于约束机器学习

Adam Bosák, Andrii Kliachkin, Jana Lepšová, Gilles Bareilles, Jakub Mareček

发表机构 * Artificial Intelligence Center, CTU in Prague（布拉格CTU人工智能中心）； CMAP, École Polytechnique, Palaiseau, France（法国巴黎高等理工学院帕莱索校区CMAP）

AI总结本文提出了一种随机罚函数-障碍方法（SPBM），用于解决深度学习中非凸、非光滑、随机环境下的约束优化问题，该方法通过指数对偶平均、稳定罚函数调度和Moreau包络来处理非光滑性，并在多个设置中验证了其性能。

2605.18565 2026-05-20 cs.CL cs.AI 版本更新

MINTEval: Evaluating Memory under Multi-Target Interference in Long-Horizon Agent Systems

MINTEval: 评估长时间跨度智能体系统中的多目标干扰下的记忆

Hyunji Lee, Justin Chih-Yao Chen, Joykirat Singh, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

发表机构 * UNC Chapel Hill（北卡罗来纳大学教堂山分校）； The University of Texas at Austin（德克萨斯大学奥斯汀分校）

AI总结本文提出MINTEval基准，用于评估智能体在长时间跨度和多目标干扰下的记忆表现，通过长连接上下文、多领域和多类型问题来测试记忆增强代理的鲁棒性和泛化能力。

Comments Equal contribution; order decided by a coin flip. Code and data: https://github.com/amy-hyunji/MINTEval

详情

AI中文摘要

现实中的智能体在长时间和不断演变的范围内运作，其中信息被不断更新并可能在记忆之间产生干扰，需要准确的回忆和对多份信息的聚合推理。然而，现有的基准主要关注静态、独立的回忆，无法捕捉这些动态的演变记忆之间的相互作用。在本文中，我们研究了当前的记忆增强代理在多样领域和问题类型中的长时间跨度、高干扰设置中的表现。我们引入MINTEval（长时间跨度记忆在干扰下的评估），该基准具有（1）长且高度互联的上下文，包含频繁更新的信息，从而产生显著的干扰；（2）多领域（状态跟踪、多轮对话、维基百科修订和GitHub提交），使能够评估领域泛化能力；（3）多类型问题，评估对干扰的鲁棒性，包括（i）单目标回忆任务，要求从长上下文中检索特定目标，以及（ii）多目标聚合任务，要求对多个相关信息片段进行推理。总体而言，MINTEval包含15.6k个问答对，覆盖平均138.8k个token的长时间跨度上下文，每个实例可扩展至1.8M个token。我们评估了7个代表性系统，包括 vanilla 长上下文 LLMs、RAG 和记忆增强代理框架。在所有系统中，我们观察到一致的低性能（平均27.9%准确率），尤其是在需要对多份证据进行聚合推理的问题上。我们的分析表明，性能主要受限于检索和记忆构建。此外，当前的记忆系统在面对被后续上下文修改或干扰的早期事实时，难以回忆和推理，准确性随着中间更新数量的增加而下降。

英文摘要

Real-world agents operate over long and evolving horizons, where information is repeatedly updated and may interfere across memories, requiring accurate recall and aggregated reasoning over multiple pieces of information. However, existing benchmarks focus on static, independent recall and fail to capture these dynamic interactions between evolving memories. In this paper, we study how current memory-augmented agents perform in realistic, interference-heavy, long-horizon settings across diverse domains and question types. We introduce MINTEval (Long-Horizon Memory under INTerference Evaluation), a benchmark featuring (1) long, highly interconnected contexts with frequently updated information that induces substantial interference, (2) diverse domains (state tracking, multi-turn dialogue, Wikipedia revisions, and GitHub commits), enabling evaluation of domain generalization, and (3) diverse question types that assess robustness to interference, including (i) single-target recall tasks requiring retrieval of a specific target from long contexts, and (ii) multi-target aggregation tasks requiring reasoning over multiple relevant pieces of information. Overall, MINTEval has 15.6k question-answering pairs over long-horizon contexts averaging 138.8k tokens and extending up to 1.8M tokens per instance. We evaluate 7 representative systems, including vanilla long-context LLMs, RAG, and memory-augmented agent frameworks. Across all systems, we observe consistently low performance (avg. 27.9% accuracy), especially on questions requiring aggregated reasoning over multiple pieces of evidence. Our analysis shows that performance is primarily limited by retrieval and memory construction. Furthermore, current memory systems struggle to recall and reason over earlier facts that are revised or interfered with by subsequent context, with accuracy degrading as the number of intervening updates increases.

URL PDF HTML ☆

赞 0 踩 0

2605.16692 2026-05-20 cs.LG cs.AI cs.RO 版本更新

EfficientTDMPC: Improved MPC Objectives for Sample-Efficient Continuous Control

EfficientTDMPC: 改进的MPC目标以实现高效的连续控制

Thomas Evers, Cristian Meo, Wendelin Bohmer, Justin Dauwels, Yaniv Oren

发表机构 * TU Delft（代尔夫特理工大学）； LatentWorlds AI

AI总结本文提出EfficientTDMPC，一种基于模型的强化学习方法，用于连续控制，通过减少误差和增加数据新鲜度来提高样本效率。

详情

AI中文摘要

我们介绍了EfficientTDMPC，一种用于连续控制的样本高效模型基于强化学习方法，基于TD-MPC算法家族。该家族的核心是一个规划器，旨在找到最大化估计回报的行动序列。回报通过学习的模型和价值网络进行估计，每个都可以引入误差。EfficientTDMPC通过两种方式减少这种误差。首先，它引入了动态模型的集成，并在这些模型和不同的展开深度之间平均回报估计。其次，它增加了应用不确定性惩罚到规划器目标的选项，从而得到一个避免不确定回报估计的规划器。然后，它增加了实用改进，提高缓冲数据的新鲜度并减少计算。最后，我们发现我们的贡献使EfficientTDMPC能够更受益于更高的更新到数据（UTD）比率，进一步提高样本效率。据我们所知，在每个基准的低数据情况下，EfficientTDMPC在HumanoidBench-Hard和DMC hard上实现了最先进的样本效率，而在DMC easy上则匹配了最先进的性能。

英文摘要

We introduce EfficientTDMPC, a sample-efficient model-based reinforcement learning method for continuous control built on the TD-MPC family of algorithms. Central to this family is a planner that aims to find an action sequence that maximizes the estimated return. The return is estimated using a learned model and value networks, each of which can introduce error. EfficientTDMPC proposes to reduce this error in two ways. First, it introduces an ensemble of dynamics models and averages the return estimates across those models and across different rollout depths. Second, it adds the option to apply an uncertainty penalty to the planner objective, yielding a planner that avoids actions with uncertain return estimates. It then adds practical improvements which increase buffer data freshness and reduce compute. Lastly, we find that our contributions enable EfficientTDMPC to benefit more from a higher update-to-data (UTD) ratio, further improving sample efficiency. To the best of our knowledge, in the low data regime of each benchmark, EfficientTDMPC achieves state-of-the-art (SOTA) in terms of sample efficiency on HumanoidBench-Hard and DMC hard, while matching SOTA on DMC easy.

URL PDF HTML ☆

赞 0 踩 0

2605.16630 2026-05-20 cs.CR cs.AI 版本更新

预训练目标在极低数据细粒度视觉分类中的影响：一个骨干网络控制研究

Alexander Hackett, Srikanth Thudumu, Ginny Fisher, Jason Fisher

发表机构 * Santa Clara University（圣克拉拉大学）； IAAIR

AI总结本文研究了在极低数据细粒度视觉分类中预训练目标对下游表示质量的影响，通过比较四种冻结的ViT-B/16编码器，得出了在数据稀缺时优先选择边界增强预训练目标的结论。

Comments Presented at the 13th Workshop on Fine-Grained Visual Categorization (FGVC13) at CVPR 2026

Journal ref 13th Workshop on Fine-Grained Visual Categorization (FGVC13), CVPR 2026

详情

AI中文摘要

极端低数据细粒度分类在专家领域中普遍存在，其中标注成本高昂，但从业者仍需要有原则的指导来选择预训练编码器。我们使用一个定制的数据集，包含三个类别的标注图像，研究了在匹配的骨干容量下，预训练目标如何影响下游表示质量。我们比较了四种冻结的ViT-B/16编码器，分别通过监督分类、对比学习（SigLIP2）、掩码重建（MAE）和自蒸馏（DINOv3）进行训练，并使用留一验证法通过线性和非线性探测器评估。为了控制低N情况下的统计噪声，我们使用排列检验（N=1000）在宏级一对多AUC上进行测试。监督和对比学习编码器在线性可分性方面表现最强（逻辑AUC：0.768和0.735；SVM AUC：0.739和0.697），而MAE在非线性探测器下表现更优（XGBoost AUC：0.713）。我们发现DINOv3在该领域整体表现较差。这些结果支持在极低数据细粒度视觉分类中的一种实用建议：当数据稀缺限制探测到线性决策规则时，优先选择边界增强预训练目标；当非线性分类器可行时，考虑使用重建式编码器。

英文摘要

Extreme low-data fine-grained classification is common in expert domains where labeling is expensive, yet practitioners still need principled guidance for selecting pretrained encoders. We study emerald inclusion grading with a custom dataset of labeled images across three classes and ask: under matched backbone capacity, how does pretraining objective affect downstream representation quality? We compare four frozen ViT-B/16 encoders trained with supervised classification, contrastive learning (SigLIP2), masked reconstruction (MAE), and self-distillation (DINOv3), and evaluate them with leave-one-out cross-validation using linear and nonlinear probes. To control statistical noise in the low-N regime, we use permutation testing (N=1000) on macro one-vs-rest AUC. Supervised and contrastive encoders provide the strongest linear separability (logistic AUC: 0.768 and 0.735; SVM AUC: 0.739 and 0.697), while MAE improves under nonlinear probes (XGBoost AUC: 0.713). We find that DINOv3 underperforms across probe families in this domain. These results support a practical recommendation for extreme low-data FGVC: prioritize margin-enforcing pretraining objectives when data scarcity restricts probing to linear decision rules, and consider reconstruction-style encoders when nonlinear classifiers are feasible given dataset constraints.

URL PDF HTML ☆

赞 0 踩 0

2605.15532 2026-05-20 cs.LG cs.AI cs.CL 版本更新

DeltaPrompts: Escaping the Zero-Delta Trap in Multimodal Distillation

DeltaPrompts: 逃离多模态蒸馏中的零delta陷阱

Jaehun Jung, Hyunwoo Kim, Brandon Cui, Ximing Lu, David Acuna, Prithviraj Ammanabrolu, Yejin Choi

发表机构 * NVIDIA Research（NVIDIA研究院）

AI总结本文提出DeltaPrompts，通过量化教师与学生之间的答案分歧（Δ）来生成高分歧的推理问题，从而解决传统蒸馏中因零delta提示导致的学习信号不足问题，实验表明DeltaPrompts在多个场景下显著提升了模型性能。

详情

AI中文摘要

蒸馏使紧凑的视觉-语言模型（VLMs）能够获得强大的推理能力，但驱动这一过程的提示通常通过简单的启发法或从现成数据集中聚合获得。我们揭示了这种方法中的关键低效性：标准图表/文档推理数据集中多达69%的提示实际上是零delta，意味着教师和学生已经诱导出完全相同的答案分布。在这些提示上训练提供极小的学习信号，导致学生性能在数据规模扩大时迅速饱和。为逃离零delta陷阱，我们回归基本原理：蒸馏本质上最小化了分布差异，因此只有暴露教师与学生之间功能性能力差距的提示才具有价值。我们通过答案分歧（Δ）量化这一差距，证明非零分歧对有效扩展至关重要。基于这一洞察，我们提出一个分阶段合成流程，利用现有数据集作为种子，主动针对学生失败模式生成更好的提示。结果是DeltaPrompts，一个包含20万 synthetic 高分歧推理问题的多样化数据集。我们评估DeltaPrompts在三个不同场景下的表现：在目标教师-学生对上的在线蒸馏、转移到新型模型家族而不重新生成数据、以及非推理模型的离线微调。在所有场景中，DeltaPrompts均带来显著收益，即使在高度优化的推理模型（如Qwen3-VL-8B-Thinking）上，也能在10个基准测试中平均获得高达15%的相对提升。

英文摘要

Distillation enables compact Vision-Language Models (VLMs) to obtain strong reasoning capabilities, yet the prompts driving this process are typically chosen via simple heuristics or aggregated from off-the-shelf datasets. We reveal a critical inefficiency in this approach: up to 69% of the prompts in standard chart / document reasoning datasets are effectively zero-delta, meaning the teacher and student already induce the exact same answer distribution. Training on these prompts provides minimal learning signal, causing student improvement to rapidly saturate regardless of data scale. To escape the zero-delta trap, we return to first principles: distillation fundamentally minimizes distributional divergence, and thus a prompt is valuable only if it exposes a functional capability gap between the teacher and student. We quantify this gap through answer divergence ($Δ$), demonstrating that non-zero divergence is critical for effective scaling. Building on this insight, we propose a staged synthesis pipeline that repurposes existing datasets as seeds, actively targeting student failure modes to produce better prompts. The result is DeltaPrompts, a diverse dataset of 200k synthetic, high-divergence reasoning problems. We evaluate DeltaPrompts across three distinct settings: on-policy distillation with the target teacher-student pair, transfer to a novel model family without regenerating the data, and off-policy fine-tuning of a non-reasoning model. Across all scenarios, DeltaPrompts drives substantial gains, yielding up to 15% relative improvement even on top of a highly-optimized reasoning model (e.g., Qwen3-VL-8B-Thinking) -- averaged over 10 benchmarks spanning chart, document and perception-centric reasoning.

URL PDF HTML ☆

赞 0 踩 0

2605.14048 2026-05-20 cs.AI cs.LG 版本更新

Network-Aware Bilinear Tokenization for Brain Functional Connectivity Representation Learning

面向网络的双线性分块化用于脑功能连接表示学习

Leo Milecki, Qingyu Hu, Bahram Jafrasteh, Mert R. Sabuncu, Qingyu Zhao

发表机构 * Department of Radiology, Weill Cornell Medicine, New York, NY, USA.（韦尔·科恩医学中心放射科, 纽约, NY, 美国）； School of Electrical and Computer Engineering, Cornell University and Cornell Tech, New York, NY, USA.（康奈尔大学电气与计算机工程学院及康奈尔科技, 纽约, NY, 美国）

AI总结本文提出了一种面向网络的双线性分块化方法，用于改进脑功能连接的表示学习，通过重新定义功能连接的分块方式，提升模型在跨群体评估中的稳定性和可迁移性。

Comments Author-submitted version, provisionally accepted at MICCAI 2026

详情

AI中文摘要

Masked autoencoders (MAEs) 近年来在静息状态脑功能连接（FC）的自监督表示学习中显示出潜力。然而，一个基本问题仍未解决：如何对FC矩阵进行分块以与大规模脑网络的内在模块化组织对齐？现有方法通常采用以区域为中心或图基的方案，将FC视为结构上均质的元素，并忽略了大规模脑网络的组织结构。我们引入NERVE（通过双线性分块化进行脑功能连接的网络感知表示学习），一种自监督学习框架，通过将FC矩阵划分为内网络和跨网络连接块来重新定义FC分块。与基于图像的MAE不同，由网络对定义的FC分块在大小上异质且对应不同的功能角色。为了解决这个问题，NERVE通过一种新的结构化双线性分解来嵌入FC分块。这种形式保留了网络身份，并将参数复杂度从网络数量的二次方减少到线性。我们评估了NERVE在三个大规模发展队列（ABCD、PNC和CCNP）中对行为和精神病理学的预测。与结构上不敏感的MAE变体和基于图的自监督基线相比，所提出的网络感知形式在跨队列评估中产生了更稳定和可迁移的表示。消融研究确认了所提出的双线性网络嵌入和解剖学基础的分区对于性能至关重要。这些发现突显了在功能连接组学中将领域特定的结构先验纳入自监督学习的重要性。代码可在：https://github.com/leomlck/NERVE。

英文摘要

Masked autoencoders (MAEs) have recently shown promise for self-supervised representation learning of resting-state brain functional connectivity (FC). However, a fundamental question remains unresolved: how should FC matrices be tokenized to align with the intrinsic modular organization of large-scale brain networks? Existing approaches typically adopt region-centric or graph-based schemes that treat FC as structurally homogeneous elements and overlook the large-scale network brain organization. We introduce NERVE (Network-Aware Representations of Brain Functional Connectivity via Bilinear Tokenization), a self-supervised learning framework that redefines FC tokenization by partitioning FC matrices into patches of intra- and inter-network connectivity blocks. Unlike image-based MAE, where fixed-size patches share a common tokenizer, FC patches defined by network pairs are heterogeneous in size and correspond to distinct functional roles. To resolve this problem, NERVE embeds FC patches through a novel structured bilinear factorization. This formulation preserves network identity and reduces parameter complexity from quadratic to linear scaling in the number of networks. We evaluate NERVE across three large-scale developmental cohorts (ABCD, PNC, and CCNP) for behavior and psychopathology prediction. Compared to structurally agnostic MAE variants and graph-based self-supervised baselines, the proposed network-aware formulation yields more stable and transferable representations, particularly in cross-cohort evaluation. Ablation studies confirm that the proposed bilinear network embedding and anatomically grounded parcellation are critical for performance. These findings highlight the importance of incorporating domain-specific structural priors into self-supervised learning for functional connectomics. Code is available at: https://github.com/leomlck/NERVE.

URL PDF HTML ☆

赞 0 踩 0

2605.14014 2026-05-20 cs.LG cs.AI 版本更新

Dywave: Event-Aligned Dynamic Tokenization for Heterogeneous IoT Sensing Signals

Dywave: 为异构物联网传感信号设计的事件对齐动态分词方法

Tomoyoshi Kimura, Denizhan Kara, Jinyang Li, Hongjue Zhao, Yigong Hu, Yizhuo Chen, Xiaomin Ouyang, Shengzhong Liu, Tarek Abdelzaher

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Hong Kong University of Science（香港科学大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出Dywave，一种用于异构物联网传感信号的动态分词框架，通过小波基层次分解构建紧凑的输入表示，以适应内在时间结构和底层物理事件，从而在活动识别、压力评估和附近物体检测等任务中提升准确率并提高计算效率。

详情

AI中文摘要

物联网系统持续收集来自无处不在传感器的异构传感信号，以支持智能应用，如人类活动分析、情绪监测和环境感知。这些信号本质上是非平稳和多尺度的，给标准分词技术带来了独特挑战。本文提出Dywave，一种为物联网传感信号设计的动态分词框架，该框架构建了与内在时间结构和底层物理事件对齐的紧凑输入表示。Dywave利用基于小波的层次分解，识别出对应底层语义事件的时间边界，并自适应地压缩冗余区间，同时保持时间一致性。在五个真实物联网传感数据集上进行的广泛评估表明，Dywave在活动识别、压力评估和附近物体检测等任务中，比最先进的方法在准确率上提高了高达12%，同时通过减少输入标记长度最多75%来提高计算效率。此外，Dywave在面对领域偏移和变化的序列长度时表现出更强的鲁棒性。

英文摘要

Internet of Things (IoT) systems continuously collect heterogeneous sensing signals from ubiquitous sensors to support intelligent applications such as human activity analysis, emotion monitoring, and environmental perception. These signals are inherently non-stationary and multi-scale, posing unique challenges for standard tokenization techniques. This paper proposes Dywave, a dynamic tokenization framework for IoT sensing signals that constructs compact input representations aligned with intrinsic temporal structures and underlying physical events. Dywave leverages wavelet-based hierarchical decomposition, identifies meaningful temporal boundaries corresponding to underlying semantic events, and adaptively compresses redundant intervals while preserving temporal coherence. Extensive evaluations on five real-world IoT sensing datasets across activity recognition, stress assessment, and nearby object detection demonstrate that Dywave outperforms state-of-the-art methods by up to 12% in accuracy, while improving computational efficiency by reducing input token lengths by up to 75% across mainstream sequence models. Moreover, Dywave exhibits improved robustness to domain shifts and varying sequence lengths.

URL PDF HTML ☆

赞 0 踩 0

2605.13318 2026-05-20 cs.AI cs.ET 版本更新

VERA-MH: Validation of Ethical and Responsible AI in Mental Health

VERA-MH：心理健康领域伦理和负责任AI的验证

Luca Belli, Kate H. Bentley, Josh Gieringer, Emily Van Ark, Nilu Zhao, Pradip Thachile, Matt Hawrilenko, Millard Brown, Adam M. Chekroud

发表机构 * Spring Health ； Yale University（耶鲁大学）

AI总结本研究提出VERA-MH，一种用于评估心理健康支持聊天机器人安全性的新型临床验证方法，重点评估聊天机器人在识别自杀倾向风险方面的表现。

详情

AI中文摘要

随着聊天机器人在更多领域被使用，包括原本未被设计用于的领域，如心理健康支持。为此，我们介绍了验证伦理和负责任AI在心理健康中的应用（VERA-MH），一种新的临床验证评估，用于评估聊天机器人在心理健康支持中的安全性。VERA-MH的第一版专注于自杀念头（SI）风险，通过评估聊天机器人如何回应可能处于危机中的用户。VERA-MH由三个步骤组成：对话模拟、对话评估和模型评分。首先，为评估的聊天机器人模拟对话，另一个聊天机器人将扮演用户角色，基于特定的人设进行模拟。这些用户人设是在临床指导下开发的，以确保代表多种风险因素、人口特征和披露因素。在评估步骤中，一个第二支持模型作为LLM-as-a-Judge，结合一个临床开发的评分表。评分表结构为流程，每次提出一个Yes/No问题，以提高答案的一致性并突出模型的失败模式。在最后阶段，每个对话的结果被汇总以呈现最终的聊天机器人评估。与框架一起，我们还展示了对四个领先LLM提供商的评估结果。

英文摘要

Chatbot usage has increased, including in fields for which they were never developed for--notably mental health support. To that end, we introduce Validations of Ethical and Responsible AI in Mental Health (VERA-MH), a novel clinically-validated evaluation for safety of chatbots in the context of mental health support. The first iteration of VERA-MH focuses on Suicidal Ideation (SI) risks, by assessing how well chatbots can responds to users that might be in crisis. VERA-MH is comprised of three steps: conversation simulation, conversation judging and model rating. First, to simulate conversations with the chatbot under evaluation, another chatbot is tasked with role-playing users based on specific personas. Such user personas have been developed under clinical guidance, to make sure that, among others, multiple risk factors, demographic characteristics and disclosure factors were represented. In the judging step, a second support model is used as an LLM-as-a-Judge, together with a clinically-developed rubric. The rubric is structured as a flow, with a single Yes/No question asked each time, to improve answers' consistency and highlight models' failure modes. In the last stage, results of each conversation are aggregated to present the final evaluation of the chatbot. Together with the framework, we present the result of the evaluations for four leading LLM providers.

URL PDF HTML ☆

赞 0 踩 0

2605.04970 2026-05-20 cs.LG cs.AI 版本更新

Skill Neologisms: Towards Skill-based Continual Learning

技能新词：迈向基于技能的持续学习

Antonin Berthon, Nicolas Astorga, Mihaela van der Schaar

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出了一种基于技能的新词（skill neologisms）方法，通过在模型词汇中集成软token，以提高模型在特定技能上的能力，同时支持零样本组合其他技能，从而实现可扩展的基于技能的持续学习。

详情

AI中文摘要

现代大语言模型（LLMs）在不断扩大的技能范围内表现出色，并能灵活组合这些技能。然而，以可扩展的方式将模型能力扩展到新技能仍然是一个开放性问题：微调和参数高效变体有灾难性遗忘的风险，而基于上下文的方法表达能力有限且受模型有效上下文的限制。我们探索了技能新词——整合在模型词汇中的软token，并优化以提高特定技能的能力——作为一种方法，以在不更新权重的情况下选择性地获取新技能。我们首先观察到预训练LLMs已经表现出与程序知识相关的token。然后在受控的合成任务上展示，技能新词可以学习以提高模型在特定技能上的能力，同时能够与分布外技能组合，且独立训练的技能新词可以零样本组合。最后，我们验证了在更现实的自然语言设置中，即Skill-Mix基准测试中，独立学习的技能新词的零样本组合。这些结果表明，技能新词可能为基于技能的持续学习提供可扩展的路径。

英文摘要

Modern LLMs show mastery over an ever-growing range of skills, as well as the ability to compose them flexibly. However, extending model capabilities to new skills in a scalable manner is an open problem: fine-tuning and parameter-efficient variants risk catastrophic forgetting, while context-based approaches have limited expressiveness and are constrained by the model's effective context. We explore skill neologisms--soft tokens integrated in the model's vocabulary and optimized to improve capabilities over a specific skill--as a way to selectively acquire new skills without weight updates. We first observe that pre-trained LLMs already exhibit tokens associated with procedural knowledge. We then show on a controlled synthetic task that skill neologisms can be learned to improve model capabilities on specific skills while being composable with out-of-distribution skills, and that independently trained skill neologisms can be composed zero-shot. Finally, we validate zero-shot composition of independently learned skill neologisms on the more realistic natural language setting of the Skill-Mix benchmark. These results suggest that skill neologisms may provide a scalable path towards skill-based continual learning.

URL PDF HTML ☆

赞 0 踩 0

2604.15166 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

通过深度感知移除遗忘特定方向实现类别反学习

Arman Hatami, Romina Aalishah, Ilya E. Monosov

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结本文提出DAMP方法，通过深度感知移除遗忘特定方向，改进类别反学习的选性遗忘，同时更好地保留保留类性能并减少深层残留遗忘结构。

Comments Accepted for oral presentation at the CVPR 2026 Workshop on Machine Unlearning for Vision (MUV). Code: https://github.com/armanhtm/DAMP

详情

AI中文摘要

机器反学习旨在在不重新训练模型的情况下移除目标知识。然而，在类别反学习中，降低遗忘类的准确性并不一定意味着真正的遗忘：遗忘的信息可能仍编码在内部表示中，而显着的遗忘可能源于分类器头部抑制而非表示移除。我们显示现有类别反学习方法往往表现出弱或负的选择性，保留遗忘类结构在深度表示中，或严重依赖最终层偏移。我们随后引入DAMP（通过投影的深度感知调节），一种单次、闭合形式的权重手术方法，可以在不使用梯度优化的情况下从预训练网络中移除遗忘特定方向。在每个阶段，DAMP在下一个可学习操作的输入空间中计算类别原型，提取遗忘方向作为相对于保留类原型的残差，并应用基于投影的更新以减少下游对这些方向的敏感性。为了保持实用性，DAMP使用从探测分离性导出的参数无关深度感知缩放规则，应用较小的编辑在早期层和较大的编辑在深层。该方法自然扩展到多类遗忘通过低秩子空间移除。在MNIST、CIFAR-10、CIFAR-100和Tiny ImageNet以及卷积和变换器架构上，DAMP比一些先前方法更接近再训练的黄金标准，改进了选择性遗忘的同时更好地保留保留类性能并减少深层残留遗忘结构。

英文摘要

Machine unlearning aims to remove targeted knowledge from a trained model without the cost of retraining from scratch. In class unlearning, however, reducing accuracy on forget classes does not necessarily imply true forgetting: forgotten information can remain encoded in internal representations, and apparent forgetting may arise from classifier-head suppression rather than representational removal. We show that existing class-unlearning methods often exhibit weak or negative selectivity, preserve forget-class structure in deep representations, or rely heavily on final-layer bias shifts. We then introduce DAMP (Depth-Aware Modulation by Projection), a one-shot, closed-form weight-surgery method that removes forget-specific directions from a pretrained network without gradient-based optimization. At each stage, DAMP computes class prototypes in the input space of the next learnable operator, extracts forget directions as residuals relative to retain-class prototypes, and applies a projection-based update to reduce downstream sensitivity to those directions. To preserve utility, DAMP uses a parameter-free depth-aware scaling rule derived from probe separability, applying smaller edits in early layers and larger edits in deeper layers. The method naturally extends to multi-class forgetting through low-rank subspace removal. Across MNIST, CIFAR-10, CIFAR-100, and Tiny ImageNet, and across convolutional and transformer architectures, DAMP more closely resembles the retraining gold standard than some of the prior methods, improving selective forgetting while better preserving retain-class performance and reducing residual forget-class structure in deep layers.

URL PDF HTML ☆

赞 0 踩 0

2604.05002 2026-05-20 cs.LG cs.AI 版本更新

Learning Stable Predictors from Weak Supervision under Distribution Shift

在分布偏移下从弱监督中学习稳定的预测器

Mehrdad Shoeibi, Elias Hossain, Ivan Garibay, Niloofar Yousefi

发表机构 * University of Central Florida（中央佛罗里达大学）

AI总结本文研究了在分布偏移下从弱监督中学习稳定预测器的问题，通过CRISPR-Cas13d转录组扰动实验，探讨了监督漂移现象，并展示了弱监督在域内学习和部分跨细胞系迁移中的有效性，同时揭示了时间迁移中的失败源于监督漂移而非模型容量或简单协变量偏移。

详情

AI中文摘要

在真实标签不可用时，从弱、代理或相对监督中学习是常见的，但分布偏移下的鲁棒性仍缺乏理解，因为监督机制本身可能在不同环境中变化。我们正式将这种现象定义为监督漂移，即$P(y \mid x, c)$在不同上下文中变化，并在CRISPR-Cas13d转录组扰动实验中研究了它，其中指导效果是通过RNA-seq响应间接推断的。使用涵盖两种人类细胞系和多个诱导后时间点的公开数据，我们构建了一个受控的非独立同分布基准，具有明确的领域（细胞系）和时间偏移，同时在所有上下文中重用固定的弱标签构造以避免改变目标。在线性和树基模型中，弱监督支持域内有意义的学习（岭$R^2 = 0.356$，斯皮尔曼$ρ= 0.442$）和部分跨细胞系迁移（$ρ\approx 0.40$）。相比之下，时间迁移在所有考虑的模型类别中崩溃，产生负$R^2$和弱或接近零的$ρ$（岭$R^2 = -0.145$，$ρ= 0.008$；XGBoost $R^2 = -0.155$，$ρ= 0.056$；随机森林 $R^2 = -0.322$，$ρ= 0.139$）。使用外部重新计算的弱标签、偏移分数量化和简单的缓解基线进行额外的鲁棒性分析，保持了相同定性的模式。特征-标签关联和特征重要性分析在不同细胞系中相对稳定，但在时间上变化剧烈，表明失败源于监督漂移而非模型容量或简单协变量偏移。这些结果表明，在弱监督下强域内性能可能是误导性的，并促使将特征稳定性作为轻量级诊断，用于部署前检测非可迁移性。

英文摘要

Learning from weak, proxy, or relative supervision is common when ground-truth labels are unavailable, but robustness under distribution shift remains poorly understood because the supervision mechanism itself may change across environments. We formalize this phenomenon as supervision drift, defined as changes in $P(y \mid x, c)$ across contexts, and study it in CRISPR-Cas13d transcriptomic perturbation experiments where guide efficacy is inferred indirectly from RNA-seq responses. Using publicly available data spanning two human cell lines and multiple post-induction timepoints, we construct a controlled non-IID benchmark with explicit domain (cell line) and temporal shifts, while reusing a fixed weak-label construction across all contexts to avoid changing targets. Across linear and tree-based models, weak supervision supports meaningful learning in-domain (ridge $R^2 = 0.356$, Spearman $ρ= 0.442$) and partial cross-cell-line transfer ($ρ\approx 0.40$). In contrast, temporal transfer collapses across all model classes considered, yielding negative $R^2$ and weak or near-zero $ρ$ (ridge $R^2 = -0.145$, $ρ= 0.008$; XGBoost $R^2 = -0.155$, $ρ= 0.056$; random forest $R^2 = -0.322$, $ρ= 0.139$). Additional robustness analyses using externally recomputed weak labels, shift-score quantification, and simple mitigation baselines preserve the same qualitative pattern. Feature-label association and feature-importance analyses remain relatively stable across cell lines but change sharply over time, indicating that failures arise from supervision drift rather than model capacity or simple covariate shift. These results show that strong in-domain performance under weak supervision can be misleading and motivate feature stability as a lightweight diagnostic for non-transferability before deployment.

URL PDF HTML ☆

赞 0 踩 0

2603.17305 2026-05-20 cs.AI cs.CL cs.LG 版本更新

Contrastive Reasoning Alignment: Reinforcement Learning from Hidden Representations

对比推理对齐：从隐藏表示中学习强化学习

Haozheng Luo, Yimin Wang, Jiahao Yu, Binghui Wang, Yan Chen

发表机构 * Northwestern University（西北大学）； University of Michigan（密歇根大学）； Illinois Institute of Technology（伊利诺伊理工学院）

AI总结本文提出了一种基于对比学习和强化学习的框架CRAFT，通过优化隐藏状态空间中的目标来提升对抗攻击的鲁棒性，核心贡献是通过隐藏空间的几何结构实现推理层面的安全对齐。

Comments International Conference on Machine Learning (ICML) 2026

详情

AI中文摘要

我们提出CRAFT，一种红队对齐框架，利用模型推理能力和隐藏表示来提高对jailbreak攻击的鲁棒性。与以往主要在输出层面操作的防御方法不同，CRAFT将大型推理模型对齐以生成安全意识的推理轨迹，通过显式优化定义在隐藏状态空间上的目标。方法上，CRAFT将对比表示学习与强化学习相结合，分离安全和不安全的推理轨迹，得到支持鲁棒、推理层面安全对齐的潜在空间几何。理论上，我们证明将潜在文本一致性纳入GRPO可以消除表面上对齐的策略，将其排除在局部最优之外。实验上，我们在多个安全基准上评估CRAFT，使用两个强大的推理模型Qwen3-4B-Thinking和R1-Distill-Llama-8B，其中它在多个安全基准上均优于IPO和SafeKey等最先进的防御方法。值得注意的是，CRAFT在基础模型上实现了平均79.0%的推理安全性和87.7%的最终响应安全性提升，证明了隐藏空间推理对齐的有效性。

英文摘要

We propose CRAFT, a red-teaming alignment framework that leverages model reasoning capabilities and hidden representations to improve robustness against jailbreak attacks. Unlike prior defenses that operate primarily at the output level, CRAFT aligns large reasoning models to generate safety-aware reasoning traces by explicitly optimizing objectives defined over the hidden state space. Methodologically, CRAFT integrates contrastive representation learning with reinforcement learning to separate safe and unsafe reasoning trajectories, yielding a latent-space geometry that supports robust, reasoning-level safety alignment. Theoretically, we show that incorporating latent-textual consistency into GRPO eliminates superficially aligned policies by ruling them out as local optima. Empirically, we evaluate CRAFT on multiple safety benchmarks using two strong reasoning models, Qwen3-4B-Thinking and R1-Distill-Llama-8B, where it consistently outperforms state-of-the-art defenses such as IPO and SafeKey. Notably, CRAFT delivers an average 79.0% improvement in reasoning safety and 87.7% improvement in final-response safety over the base models, demonstrating the effectiveness of hidden-space reasoning alignment.

URL PDF HTML ☆

赞 0 踩 0

2603.11768 2026-05-20 cs.AI 版本更新

Governing Evolving Memory in LLM Agents: Risks, Mechanisms, and the Stability and Safety Governed Memory (SSGM) Framework

在LLM代理中治理演化的记忆：风险、机制以及稳定性与安全性的治理记忆（SSGM）框架

Chingkwun Lam, Jiaxin Li, Lingfei Zhang, Kuo Zhao

发表机构 * College of Intelligent Science and Engineering（智能科学与工程学院）

AI总结本文研究了LLM代理中记忆治理的问题，提出了一种新的SSGM框架，通过一致性验证、时间衰减建模和动态访问控制来缓解记忆腐蚀风险，提高记忆系统的稳定性与安全性。

详情

AI中文摘要

长期记忆已成为自主大型语言模型（LLM）代理的基础组件，使连续适应、终身多模态学习和复杂推理成为可能。然而，随着记忆系统从静态检索数据库转变为动态、代理机制，关于记忆治理、语义漂移和隐私漏洞的关键问题变得突出。尽管最近的调查主要集中在记忆检索效率上，但它们大多忽略了在高度动态环境中记忆腐蚀的新兴风险。为了解决这些新兴挑战，我们提出了稳定性与安全性治理记忆（SSGM）框架，一种概念治理架构。SSGM通过在任何记忆巩固之前执行一致性验证、时间衰减建模和动态访问控制，将记忆演进与执行分离。通过形式分析和架构分解，我们展示了SSGM如何缓解拓扑诱导的知识泄漏，其中敏感上下文被固化到长期存储中，并帮助防止语义漂移，其中知识通过迭代总结退化。最终，这项工作提供了一个全面的记忆腐蚀风险分类法，并建立了部署安全、持久和可靠的代理记忆系统稳健治理范式。

英文摘要

Long-term memory has emerged as a foundational component of autonomous Large Language Model (LLM) agents, enabling continuous adaptation, lifelong multimodal learning, and sophisticated reasoning. However, as memory systems transition from static retrieval databases to dynamic, agentic mechanisms, critical concerns regarding memory governance, semantic drift, and privacy vulnerabilities have surfaced. While recent surveys have focused extensively on memory retrieval efficiency, they largely overlook the emergent risks of memory corruption in highly dynamic environments. To address these emerging challenges, we propose the Stability and Safety-Governed Memory (SSGM) framework, a conceptual governance architecture. SSGM decouples memory evolution from execution by enforcing consistency verification, temporal decay modeling, and dynamic access control prior to any memory consolidation. Through formal analysis and architectural decomposition, we show how SSGM can mitigate topology-induced knowledge leakage where sensitive contexts are solidified into long-term storage, and help prevent semantic drift where knowledge degrades through iterative summarization. Ultimately, this work provides a comprehensive taxonomy of memory corruption risks and establishes a robust governance paradigm for deploying safe, persistent, and reliable agentic memory systems.

URL PDF HTML ☆

赞 0 踩 0

2603.05910 2026-05-20 cs.AI 版本更新

从观测到预测：LSTM用于高速公路进出匝道的车辆车道变更预测

Mohamed Abouras, Catherine M. Elias

发表机构 * C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems（C-DRiVeS实验室：车载系统认知驾驶研究）； Computer Science and Engineering Department - Faculty of Media Engineering and Technology - German University in Cairo（计算机科学与工程系 - 媒体工程与技术学院 - 埃及德国大学）

AI总结本文研究了高速公路进出匝道区域与直线路段的区别，利用多层LSTM架构和ExiD无人机数据集训练模型，测试了不同预测时间范围和不同模型的工作流程，结果表明在4秒内预测准确率可达76%（匝道区域）和94%（一般高速公路场景）.

2601.12358 2026-05-20 cs.CV cs.AI cs.RO 版本更新

From Prompts to Pavement: LMMs-based Agentic Behavior-Tree Generation Framework for Autonomous Vehicles

从提示到道路：基于大语言模型的代理行为树生成框架用于自动驾驶车辆

Omar Y. Goba, Ahmed Y. Gado, Catherine M. Elias, Ahmed Hussein

发表机构 * Computer Science & Engineering Department, German University in Cairo (GUC), Egypt（德国亚历山大·冯·洪堡大学（开罗分校）计算机科学与工程系，埃及）； C-DRiVeS Lab: Cognitive Driving Research in Vehicular Systems, Cairo, Egypt（认知驾驶系统实验室（车辆系统中的认知驾驶研究），开罗，埃及）； IAV GmbH, Berlin, Germany（IAV GmbH，柏林，德国）

AI总结本文提出了一种基于大语言模型和多模态视觉模型的代理行为树生成框架，用于自动驾驶车辆在复杂环境中自适应导航。该框架通过链式符号提示评估场景关键性，通过上下文学习构建高层子目标，并通过生成器合成可执行的BT子树，实现了在CARLA+Nav2模拟中对突发障碍物（如道路堵塞）的成功绕行。

详情

DOI: 10.1109/ITSC60802.2025.11423726

AI中文摘要

自动驾驶车辆（AVs）需要适应性行为规划器来安全地导航不可预测的现实环境。传统的行为树（BTs）提供结构化决策逻辑，但本质上是静态的，并且需要大量人工调优，限制了其在SAE Level 5自主性中的应用。本文提出了一种代理框架，利用大语言模型（LLMs）和多模态视觉模型（LVMs）来实时生成和适应BTs。一个专门的Descriptor代理使用链式符号提示来评估场景关键性，一个Planner代理通过上下文学习构建高层子目标，一个Generator代理合成可执行的BT子树。该系统集成到CARLA+Nav2模拟中，仅在基线BT失败时触发，展示了成功绕过突发障碍物（例如道路堵塞）的能力，无需人工干预。与静态BT基线相比，该方法是一种概念验证，能够扩展到多样的驾驶场景。

英文摘要

Autonomous vehicles (AVs) require adaptive behavior planners to navigate unpredictable, real-world environments safely. Traditional behavior trees (BTs) offer structured decision logic but are inherently static and demand labor-intensive manual tuning, limiting their applicability at SAE Level 5 autonomy. This paper presents an agentic framework that leverages large language models (LLMs) and multi-modal vision models (LVMs) to generate and adapt BTs on the fly. A specialized Descriptor agent applies chain-of-symbols prompting to assess scene criticality, a Planner agent constructs high-level sub-goals via in-context learning, and a Generator agent synthesizes executable BT sub-trees in XML format. Integrated into a CARLA+Nav2 simulation, our system triggers only upon baseline BT failure, demonstrating successful navigation around unexpected obstacles (e.g., street blockage) with no human intervention. Compared to a static BT baseline, this approach is a proof-of-concept that extends to diverse driving scenarios.

URL PDF HTML ☆

赞 0 踩 0

2511.21577 2026-05-20 cs.SD cs.AI 版本更新

HarmonicAttack: An Adaptive Cross-Domain Audio Watermark Removal

HarmonicAttack: 一种自适应跨领域音频水印移除方法

Kexin Li, Xiao Hu, Ilya Grishchenko, David Lie

发表机构 * University of Toronto（多伦多大学）

AI总结本文提出HarmonicAttack，一种无需访问目标水印检测器的新型音频水印移除方法，通过训练通用模型来移除音频水印，同时在不同分布数据集上保持高感知质量。

Comments Under Review

详情

AI中文摘要

高质量的AI生成音频的可用性引发了诸如虚假信息活动和语音克隆欺诈等安全挑战。对抗AI生成音频的滥用的关键防御措施是通过水印标记，以便能够轻易区分真实音频。那些试图滥用AI生成音频的人可能会尝试移除音频水印，因此研究有效的水印移除技术对于客观评估音频水印的鲁棒性至关重要。先前的水印移除方案通常假设在移除过程中可以访问目标水印检测器。这种假设往往不切实际，可能导致对当前水印方案的过度自信。我们引入了HarmonicAttack，一种新的音频水印移除方法，它不需要访问目标水印算法。它只需要一组原始和水印样本来训练一个能够从音频样本中移除水印的通用模型。我们还发现，训练样本不需要与目标样本具有相同的分布，因为我们的攻击在面对非分布样本时具有最小的退化。与现有水印移除攻击相比，HarmonicAttack在移除最新方案（包括AudioSeal、WavMark、SilentCipher和AudioMarkNet）的水印方面更加有效，同时保持高感知质量。尽管HarmonicAttack是在LibriSpeech数据集上针对AudioSeal训练的，但它能够泛化到未见过的数据集和水印方案。例如，在VCTK上，HarmonicAttack对AudioMarkNet的识别准确率达到了92%，明显优于最佳基线的38%。在FMA上，HarmonicAttack对所有水印达到了100%的识别准确率，而最佳基线在AudioSeal上仅达到2%，在WavMark上达到44%。

英文摘要

The availability of high-quality, AI-generated audio raises security challenges such as misinformation campaigns and voice-cloning fraud. A key defense against the misuse of AI-generated audio is by watermarking it, so that it can be easily distinguished from genuine audio. Those seeking to misuse AI-generated audio may attempt to remove audio watermarks, so studying effective watermark removal techniques is critical to objectively evaluate the robustness of audio watermarks. Previous watermark removal schemes typically assume access to the target watermark detector during the removal process. This assumption is often impractical, which may lead to a false sense of confidence in current watermark schemes. We introduce HarmonicAttack, a novel audio watermark removal method that requires no access to the target watermark algorithm. It only needs a number of original and watermarked samples to train a general model capable of removing watermarks from audio samples. We also find that training samples do not need to share the same distribution as target samples, as our attack generalizes to out-of-distribution samples with minimal degradation. Compared with existing watermark removal attacks, HarmonicAttack is more effective at removing watermarks from state-of-the-art schemes, including AudioSeal, WavMark, SilentCipher, and AudioMarkNet, while maintaining high perceptual quality. Although HarmonicAttack is trained on the LibriSpeech dataset against AudioSeal, it generalizes across unseen datasets and watermarking schemes. For instance, on VCTK, HarmonicAttack achieves a 92% ASR against AudioMarkNet, substantially outperforming the best baseline at 38%. On FMA, HarmonicAttack reaches 100% ASR against all watermarks, whereas the best baseline achieves only 2% against AudioSeal and 44% against WavMark.

URL PDF HTML ☆

赞 0 踩 0

2510.18924 2026-05-20 cs.LG cs.AI 版本更新

Noise-corrected GRPO: From Noisy Rewards to Unbiased Gradients

噪声校正的GRPO：从噪声奖励到无偏梯度

Omar El Mansouri, Fathinah Asma Izzati, Mohamed El Amine Seddik, Salem Lahlou

发表机构 * Department of Machine Learning, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE ； Technology Innovation Institute, Abu Dhabi, UAE ； Department of Robotics, Khalifa University, Abu Dhabi, UAE

AI总结本文提出了一种噪声鲁棒的GRPO框架，通过校正奖励中的噪声来获得无偏梯度估计，从而提升强化学习在噪声环境中的性能。

详情

AI中文摘要

人类反馈的强化学习（RLHF）或可验证奖励（RLVR）是对大语言模型进行对齐或构建最新SOTA推理模型的标准范式，但其对不一致或错误奖励产生的噪声非常敏感。然而，此类噪声与广泛使用的基于组的策略优化方法之间的相互作用仍不为人知。我们引入了一种噪声鲁棒的组相对策略优化（GRPO）和正确执行GRPO（Dr.GRPO）框架，该框架明确将奖励损坏建模为伯努利噪声。我们的方法在估计奖励翻转概率后应用噪声校正，以消除学习信号的偏差，从而获得可证明无偏的梯度估计。理论分析表明，基于组的方法本质上可以缓解个体层面的噪声，而我们的校正策略增强了这种鲁棒性。实验表明，在应用我们的噪声校正到标准奖励模型使用时，数学和代码任务中均观察到一致的改进，特别是在现实奖励模型条件下，数学任务的准确性提高了高达6.7个百分点，代码任务提高了1.5个百分点。这项工作将监督学习中的标签噪声校正与现代RLHF相结合，提供了理论洞察和实用算法，以应对噪声现实世界部署。

CADDesigner: 一种通用智能体的概念CAD模型生成

Fengxiao Fan, Jingzhe Ni, Xiaolong Yin, Sirui Wang, Xingyu Lu, Qiang Zou, Ruofeng Tong, Min Tang, Peng Du

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出CADDesigner，一种基于LLM的智能体，通过文本描述和草图输入，结合交互对话进行需求分析，生成高质量CAD模型代码，并通过迭代视觉反馈提升模型质量，实验表明其在概念CAD模型生成任务中表现优异。

详情

AI中文摘要

计算机辅助设计（CAD）广泛用于概念设计和参数化3D建模，但通常需要设计人员具备高水平的专业知识。为了降低入门门槛并促进早期阶段的CAD建模，我们提出了CADDesigner，一种基于LLM的智能体，用于概念CAD设计。该智能体接受文本描述和草图作为输入，通过与用户进行交互对话，通过全面的需求分析来细化和澄清设计要求。基于一种新的显式上下文指令范式（ECIP），该智能体生成高质量的CAD建模代码。在生成过程中，智能体会结合迭代的视觉反馈来提高模型质量。生成的设计案例可以存储在结构化的知识库中，提供持续的知识积累机制，为未来的代码生成改进提供可能。实验结果表明，CADDesigner在概念CAD模型生成任务中实现了具有竞争力的性能，并在概念CAD模型生成任务中优于代表性的基线模型。

英文摘要

Computer-Aided Design (CAD) is widely used for conceptual design and parametric 3D modeling, but typically requires a high level of expertise from designers. To lower the entry barrier and facilitate early-stage CAD modeling, we present CADDesigner, an LLM-powered agent for conceptual CAD design. The agent accepts both textual descriptions and sketches as input, engaging in interactive dialogue with users to refine and clarify design requirements through comprehensive requirement analysis. Built upon a novel Explicit Context Imperative Paradigm (ECIP), the agent generates high-quality CAD modeling code. During the generation process, the agent incorporates iterative visual feedback to improve model quality. Generated design cases can be stored in a structured knowledge base, providing a mechanism for continual knowledge accumulation and future improvement of code generation. Experimental results show that CADDesigner achieves competitive performance and outperforms representative baselines on conceptual CAD model generation tasks.

URL PDF HTML ☆

赞 0 踩 0

2506.03178 2026-05-20 eess.IV cs.AI cs.CV 版本更新

LLaMA-XR: A Novel Framework for Radiology Report Generation using LLaMA and QLoRA Fine Tuning

LLaMA-XR: 一种基于LLaMA和QLoRA微调的新型放射科报告生成框架

Md. Zihad Bin Jahangir, Muhammad Ashad Kabir, Sumaiya Akter, Israt Jahan, Minh Chau

发表机构 * Department of Computer Science and Engineering, Southeast University（计算机科学与工程系，东南大学）； School of Computing, Mathematics and Engineering, Charles Sturt University（计算、数学与工程学院，查尔斯·斯特劳特大学）； Department of Computer Science and Engineering, University of Liberal Arts Bangladesh（计算机科学与工程系，孟加拉国自由大学）； Medical Imaging Group, School of Dentistry and Medical Sciences, Charles Sturt University（医学影像组，牙科学院与医学科学学院，查尔斯·斯特劳特大学）

AI总结本文提出LLaMA-XR框架，结合LLaMA 3.1与DenseNet-121图像嵌入及QLoRA微调，提升放射科报告生成的准确性和临床相关性，同时保持计算效率。

Comments 25 pages

Journal ref Bioengineering 2026, 13(5), 493

详情

DOI: 10.3390/bioengineering13050493

AI中文摘要

自动化放射科报告生成具有减少放射科医生工作负担和提高诊断准确性的潜力。然而，从胸部X光片生成精确且具有临床意义的报告仍然具有挑战性，因为医学语言的复杂性和对上下文理解的需求。现有模型在保持准确性和上下文相关性方面存在困难。在本文中，我们提出了LLaMA-XR，一种新型框架，整合了LLaMA 3.1与基于DenseNet-121的图像嵌入以及量化低秩适应（QLoRA）微调。LLaMA-XR在保持计算效率的同时实现了改进的连贯性和临床准确性。这种效率是由一种优化策略驱动的，该策略增强了参数利用并减少了内存开销，使报告生成速度更快，计算资源需求更低。在IU X光基准数据集上进行的广泛实验表明，LLaMA-XR优于一系列最先进的方法。我们的模型在ROUGE-L得分上达到0.433，在METEOR得分上达到0.336，建立了该领域的性能新基准。这些结果突显了LLaMA-XR作为自动化放射科报告的有效且高效的AI系统潜力，提供了增强的临床效用和可靠性。

英文摘要

Automated radiology report generation holds significant potential to reduce radiologists' workload and enhance diagnostic accuracy. However, generating precise and clinically meaningful reports from chest radiographs remains challenging due to the complexity of medical language and the need for contextual understanding. Existing models often struggle with maintaining both accuracy and contextual relevance. In this paper, we present LLaMA-XR, a novel framework that integrates LLaMA 3.1 with DenseNet-121-based image embeddings and Quantized Low-Rank Adaptation (QLoRA) fine-tuning. LLaMA-XR achieves improved coherence and clinical accuracy while maintaining computational efficiency. This efficiency is driven by an optimization strategy that enhances parameter utilization and reduces memory overhead, enabling faster report generation with lower computational resource demands. Extensive experiments conducted on the IU X-ray benchmark dataset demonstrate that LLaMA-XR outperforms a range of state-of-the-art methods. Our model achieves a ROUGE-L score of 0.433 and a METEOR score of 0.336, establishing new performance benchmarks in the domain. These results underscore LLaMA-XR's potential as an effective and efficient AI system for automated radiology reporting, offering enhanced clinical utility and reliability.

URL PDF HTML ☆

赞 0 踩 0

2504.05454 2026-05-20 cs.LG cs.AI cs.CE q-bio.GN q-bio.QM 版本更新

GraphPINE: Graph Importance Propagation for Interpretable Drug Response Prediction

GraphPINE: 图重要性传播用于可解释的药物反应预测

Yoshitaka Inoue, Tianfan Fu, Augustin Luna

发表机构 * Computational Biology Branch, National Library of Medicine（国家医学图书馆计算生物学分支）； Developmental Therapeutics Branch, National Cancer Institute（国家癌症研究所发育治疗分支）

AI总结本文提出GraphPINE，一种利用领域特定先验知识初始化节点重要性的图神经网络架构，以提高药物反应预测的可解释性。通过引入重要性传播层，统一更新特征矩阵和节点重要性，并利用基于GNN的图传播来传播特征值，从而实现更有效的特征学习和图表示。

详情

AI中文摘要

可解释性对于生物医学研究中的许多任务都是必要的。最近的可解释性方法集中在注意力、梯度和Shapley值上。这些方法无法处理具有强相关先验知识的数据，并且未能基于已知的预测特征之间的关系来约束可解释性结果。我们提出了GraphPINE，一种图神经网络（GNN）架构，利用领域特定的先验知识来初始化节点重要性，以便在训练过程中优化用于药物反应预测。通常，一个手动的后预测步骤会检查文献（即先验知识）以理解返回的预测特征。虽然梯度和注意力在预测后可以获取节点重要性，但这些方法的节点重要性缺乏互补的先验知识；GraphPINE旨在克服这一限制。GraphPINE与其他GNN门控方法的不同之处在于利用了类似LSTM的顺序格式。我们引入了一个重要性传播层，统一了1）特征矩阵和节点重要性的更新以及2）使用基于GNN的图传播来传播特征值。这种初始化和更新机制使得特征学习更加有据可依，并提高了图表示的质量。我们应用GraphPINE进行癌症药物反应预测，使用了超过5000个基因节点的药物筛选和基因数据，这些节点包含在基因-基因图中，并利用药物-靶点相互作用（DTI）图进行初始重要性。基因-基因图和DTI来自经过整理的来源，并通过讨论药物和基因之间关系的文章数量进行加权。GraphPINE在952种药物上实现了PR-AUC为0.894和ROC-AUC为0.796。代码可在https://anonymous.4open.science/r/GraphPINE-40DE获取。

英文摘要

Explainability is necessary for many tasks in biomedical research. Recent explainability methods have focused on attention, gradient, and Shapley value. These do not handle data with strong associated prior knowledge and fail to constrain explainability results based on known relationships between predictive features. We propose GraphPINE, a graph neural network (GNN) architecture leveraging domain-specific prior knowledge to initialize node importance optimized during training for drug response prediction. Typically, a manual post-prediction step examines literature (i.e., prior knowledge) to understand returned predictive features. While node importance can be obtained for gradient and attention after prediction, node importance from these methods lacks complementary prior knowledge; GraphPINE seeks to overcome this limitation. GraphPINE differs from other GNN gating methods by utilizing an LSTM-like sequential format. We introduce an importance propagation layer that unifies 1) updates for feature matrix and node importance and 2) uses GNN-based graph propagation of feature values. This initialization and updating mechanism allows for informed feature learning and improved graph representation. We apply GraphPINE to cancer drug response prediction using drug screening and gene data collected for over 5,000 gene nodes included in a gene-gene graph with a drug-target interaction (DTI) graph for initial importance. The gene-gene graph and DTIs were obtained from curated sources and weighted by article count discussing relationships between drugs and genes. GraphPINE achieves a PR-AUC of 0.894 and ROC-AUC of 0.796 across 952 drugs. Code is available at https://anonymous.4open.science/r/GraphPINE-40DE.

URL PDF HTML ☆

赞 0 踩 0

2410.20238 2026-05-20 cs.CL cs.AI 版本更新

A Survey of Large Language Models for Arabic Language and its Dialects

阿拉伯语言及其方言大型语言模型综述

Malak Mashaabi, Shahad Al-Khalifa, Hend Al-Khalifa

发表机构 * iWAN Research Group（iWAN研究组）； College of Computer and Information Sciences（计算机与信息科学学院）； King Saud University（沙特国王大学）

AI总结本文综述了针对阿拉伯语言及其方言设计的大型语言模型，涵盖关键架构、预训练数据集以及单语、双语和多语模型在下游任务中的性能，同时讨论了阿拉伯LLM的开放性及其对未来研究的挑战与机遇。

Comments Submitted to ACM Transactions on Asian and Low-Resource Language Information Processing

详情

DOI: 10.1145/3807946

AI中文摘要

本文综述了针对阿拉伯语言及其方言设计的大型语言模型（LLMs）。它涵盖了关键架构，包括仅编码器、仅解码器和编码器-解码器模型，以及用于预训练的数据集，涵盖古典阿拉伯语、现代标准阿拉伯语和方言阿拉伯语。该研究还探讨了单语、双语和多语LLMs，分析了它们的架构和在下游任务（如情感分析、命名实体识别和问答）中的性能。此外，它评估了阿拉伯LLMs的开放性，基于源代码可用性、训练数据、模型权重和文档等因素。综述指出需要更多多样化的方言数据集，并强调开放性对于研究可重复性和透明性的重要性。最后，它通过识别关键挑战和未来研究的机会，强调了更包容和代表性的模型的必要性。

英文摘要

This survey offers a comprehensive overview of Large Language Models (LLMs) designed for Arabic language and its dialects. It covers key architectures, including encoder-only, decoder-only, and encoder-decoder models, along with the datasets used for pre-training, spanning Classical Arabic, Modern Standard Arabic, and Dialectal Arabic. The study also explores monolingual, bilingual, and multilingual LLMs, analyzing their architectures and performance across downstream tasks, such as sentiment analysis, named entity recognition, and question answering. Furthermore, it assesses the openness of Arabic LLMs based on factors, such as source code availability, training data, model weights, and documentation. The survey highlights the need for more diverse dialectal datasets and attributes the importance of openness for research reproducibility and transparency. It concludes by identifying key challenges and opportunities for future research and stressing the need for more inclusive and representative models.

URL PDF HTML ☆

赞 0 踩 0

2605.19762 2026-05-20 cs.AI cs.CL 版本更新

What Really Improves Mathematical Reasoning: Structured Reasoning Signals Beyond Pure Code

什么真正提升了数学推理：超越纯代码的结构化推理信号

Yuze Zhao, Junpeng Fang, Lu Yu, Zhenya Huang, Kai Zhang, Qing Cui, Qi Liu, Jun Zhou, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science（认知智能国家重点实验室，科学大学）； Institute of Artificial Intelligence, Hefei Comprehensive National Science Centerce（人工智能研究院，合肥综合性国家科学中心）； Individual Researcher（个人研究员）； Zhejiang University, Hangzhou, China（浙江大学，杭州，中国）

AI总结本文通过控制预训练实验研究代码对推理能力的影响，发现代码主要提升编程能力而非通用推理，且在复杂数学推理中与知识密集型任务竞争，同时结构化推理轨迹（如代码-文本和数学-文本混合）比纯可执行代码更能提升推理能力。

Comments Accepted by ICML 2026, 22 pages, 10 figures

详情

AI中文摘要

代码已成为现代基础语言模型（LM）训练中的标准组件，但其作用超越编程仍不明确。我们重新审视代码通过控制预训练实验在10T-token语料库上进行细粒度领域分离，发现三个结论。首先，当代码限制为独立可执行程序且Code-NL数据被控制时，代码显著提升编程能力，但不作为通用推理增强器，反而在复杂数学推理中与知识密集型任务竞争。其次，通常归因于代码的推理增益更可能由跨领域结构化推理轨迹（如代码-文本和数学-文本混合）解释，而非纯可执行代码。第三，在固定数学预算内增加结构化数学领域样本密度，能在困难数学推理上获得显著提升，同时基本保持编程性能，表明认知支架提供了一种有针对的缓解跨领域权衡的方法。最后，路由分析显示数据组合效应反映在专家激活模式中，为跨领域竞争和协同作用提供了机制层面的证据。我们的结果澄清了哪些数据特征在能力维度间转移，并指出了更精确的数据导向优化策略。

英文摘要

Code has become a standard component of modern foundation language model (LM) training, yet its role beyond programming remains unclear. We revisit the claim that code improves reasoning through controlled pretraining experiments on a 10T-token corpus with fine-grained domain separation. Our findings are threefold. First, when code is restricted to standalone executable programs and Code-NL data are controlled for, code substantially improves programming ability but does not act as a general reasoning enhancer; instead, it competes with knowledge-intensive tasks, especially complex mathematical reasoning. Second, the reasoning gains often attributed to code are better explained by cross-domain structured reasoning traces, such as code-text and math-text mixtures, rather than by executable code alone. Third, increasing the density of structured math-domain samples within a fixed math budget yields substantial gains on difficult mathematical reasoning while largely preserving programming performance, suggesting that cognitive scaffolds offer a targeted way to mitigate cross-domain trade-offs. Finally, routing analyses show that data-composition effects are reflected in expert-activation patterns, providing mechanism-level evidence for competitive and synergistic interactions across domains. Our results clarify which data characteristics transfer across capability dimensions and point to more precise data-centric optimization strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.19758 2026-05-20 cs.AI cs.DB stat.ML 版本更新

CogScale: Scalable Benchmark for Sequence Processing

CogScale: 用于序列处理的可扩展基准

Yannis Bendi-Ouis, Romain de Coudenhove, Xavier Hinaut

发表机构 * Mnemosyne ； Inria（法国国家信息与自动化技术研究所）； ENS PSL（巴黎社会科学高等学院）； École normale supérieure (Ulm)（巴黎高等师范学院）； Bordeaux, France（法国波尔多）； Paris, France（法国巴黎）

AI总结本文提出CogScale，一个包含14个可扩展合成任务的基准，用于评估不同架构在不同参数规模下的认知和记忆能力，通过标准化轻量框架加速架构创新验证。

详情

AI中文摘要

维持和操纵信息随时间变化的能力是生物和人工智能的基本特征。尽管现代模型在自然语言处理等任务上取得了显著成功，但评估新型架构处理序列信息的能力仍计算成本高且耗时。测试新架构通常需要扩展到大规模数据集和模型，导致巨大的计算成本和缓慢的迭代周期。在本文中，我们提出了CogScale，一个包含14个可扩展合成任务的基准，旨在隔离和评估不同参数规模下的特定认知和记忆能力。通过提供标准化的轻量框架，CogScale允许研究者在投入大规模训练之前快速验证架构创新。为了建立坚实的基础，我们评估了七种不同的架构：门控循环单元（GRU）、长短期记忆（LSTM）、xLSTM、回声状态网络（ESN）、Mamba、Transformer解码器和Transformer编码器-解码器。这些评估在严格的参数预算（1k、10k和100k）和不同的难度级别和规模下进行。我们的结果表明，尽管经典RNN和回声状态网络在严格参数预算内表现出色，只有注意力机制和现代状态空间模型在推理复杂性和任务难度增加时仍能保持高性能。

英文摘要

The ability to maintain and manipulate information over time is a fundamental aspect of living beings and Artificial Intelligence. While modern models have achieved remarkable success in tasks like natural language processing, evaluating the capacity of novel architectures to process sequential information remains computationally expensive and time-consuming. Testing a new architecture often requires scaling up to massive datasets and models, leading to vast computational costs and slow iteration cycles. In this paper, we propose CogScale, a benchmark of 14 scalable synthetic tasks designed to isolate and evaluate specific cognitive and memory abilities at different parametrizable scales. By providing a standardized, lightweight framework, CogScale allows researchers to rapidly validate architectural innovations before committing to large-scale training. To establish a solid baseline, we evaluate seven distinct architectures: Gated Recurrent Unit (GRU), Long Short-Term Memory (LSTM), xLSTM, Echo State Network (ESN), Mamba, Transformer Decoder, and Transformer Encoder-Decoder. These evaluations are conducted under strict parameter budgets (1k, 10k, and 100k) and across different difficulty levels and scales. Our results show that while classical RNNs and Echo State Networks excel at basic retention within strict parameter budgets, only attention mechanisms and modern state-space models consistently maintain high performance as reasoning complexity and task difficulty scale.

URL PDF HTML ☆

赞 0 踩 0

2605.19755 2026-05-20 cs.SE cs.AI cs.CR cs.LG cs.MA 版本更新

Operationalising Artificial Intelligence Bills of Materials (AIBOMs) for Verifiable AI Provenance and Lifecycle Assurance

将人工智能物料清单（AIBOM） operationalise 以实现可验证的 AI 追溯和生命周期保证

Petar Radanliev, Omar Santos, Carsten Maple, Kay Atefi

AI总结本文提出了一种扩展CycloneDX标准的AIBOM框架，用于捕捉AI特定的溯源、模型血统和披露元数据，通过结构化架构工程、密码学验证和智能体驱动自动化，实现可验证的软件溯源，展示了98.7%的可重复性保真度、96.2%的漏洞匹配精度和63%的手动监督减少，验证了自动化溯源保证和可重复AI生命周期验证的可行性。

Journal ref Front. Comput. Sci. 8:1735919 (2026)

详情

DOI: 10.3389/fcomp.2026.1735919

AI中文摘要

人工智能（AI）系统日益依赖复杂的、多层的软件供应链，这带来了可重复性、透明性和安全性保证的挑战。本文提出了一种扩展CycloneDX标准的人工智能物料清单（AIBOM）架构，以捕捉AI特定的溯源、模型血统和披露元数据。该框架通过结构化架构工程、密码学验证和智能体驱动自动化，提供了一种正式的方法来实现可验证的软件溯源。开发了一个自主的AI流水线，利用机器可验证的溯源链进行持续的环境检查、漏洞丰富和可重复性审计。实证评估显示，在容器化分析工作流中，可重复性保真度为98.7%，漏洞匹配精度为96.2%，手动监督减少了63%。这些结果验证了自动化溯源保证和可重复AI生命周期验证的可行性。AIBOM框架在软件供应链透明性和AI可重复性工程的科学基础方面取得了进展，提供了一种可推广的方法来确保AI系统安全、加强溯源完整性，并支持符合国际信息安全标准。

英文摘要

Artificial Intelligence (AI) systems are increasingly dependent on complex, multi-layered software supply chains that introduce challenges for reproducibility, transparency, and security assurance. This study presents an Artificial Intelligence Bill of Materials (AIBOM) schema extending the CycloneDX standard to capture AI-specific provenance, model lineage, and disclosure metadata. The framework provides a formalised approach to verifiable software provenance through structured schema engineering, cryptographic validation, and agent-driven automation. An autonomous AI pipeline is developed to perform continuous environment inspection, vulnerability enrichment, and reproducibility auditing using machine-verifiable provenance chains. Empirical evaluation demonstrates 98.7% reproducibility fidelity, 96.2% vulnerability match precision, and a 63% reduction in manual oversight across containerised analytic workflows. These results confirm the feasibility of automated provenance assurance and reproducible AI lifecycle validation. The AIBOM framework advances the scientific foundations of software supply chain transparency and AI reproducibility engineering, offering a generalisable methodology for securing AI systems, strengthening provenance integrity, and supporting compliance with international information security standards.

URL PDF HTML ☆

赞 0 踩 0

2605.19748 2026-05-20 cs.AI cs.MA 版本更新

Memory-Augmented Reinforcement Learning Agent for CAD Generation

具有记忆增强的强化学习代理的CAD生成

Yin Xiaolong, Liu Yu, Shen Jiahang, Lu Xingyu, Ni Jingzhe, Fan Fengxiao, Sang Fan

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出了一种记忆增强的强化学习框架，用于生成CAD模型，通过引入强化学习进行检索和策略优化，有效避免了检索陷阱，提高了复杂CAD模型生成的成功率和几何一致性。

Comments 26 pages; multilingual submission: English version first, followed by Chinese version

详情

AI中文摘要

计算机辅助设计（CAD）模型的自动生成是实现先进制造业智能化的核心技术。现有的基于大语言模型（LLMs）的生成方法在处理具有长操作序列、多样操作类型和强几何约束的复杂CAD模型时往往力不从心，主要原因是推理链断裂且缺乏有效的错误修正机制。为了解决这个问题，本文提出了一种用于CAD生成代理的记忆增强强化学习框架。该框架将底层几何内核封装成可由代理调用的结构化工具链，并构建了设计意图理解、全局规划、执行和多维验证的闭环机制。同时，该框架设计了由案例库和技能库组成的双轨记忆模块，并提出了动态效用检索算法。通过将强化学习引入检索和策略优化，代理能够有效避免检索陷阱，即在语义相似但几何不可行的例子中，实现在线自我修正和持续进化，而无需额外的大规模标注数据。实验表明，所提出的方法在复杂CAD模型生成任务中显著提高了成功率和几何一致性。

英文摘要

Automatic generation of computer-aided design (CAD) models is a core technology for enabling intelligence in advanced manufacturing. Existing generation methods based on large language models (LLMs) often fall short when handling complex CAD models characterized by long operation sequences, diverse operation types, and strong geometric constraints, primarily because reasoning chains break and effective error-correction mechanisms are lacking. To address this problem, this paper proposes a memory-augmented reinforcement learning framework for CAD generation agents. The framework encapsulates the underlying geometric kernel into a structured toolchain callable by the agent and builds a closed-loop mechanism of design intent understanding, global planning, execution, and multi-dimensional verification. It also designs a dual-track memory module consisting of a case library and a skill library, and proposes a dynamic utility retrieval algorithm. By introducing reinforcement learning into retrieval and policy optimization, the agent can effectively avoid retrieval traps in which examples are semantically similar but geometrically infeasible, enabling online self-correction and continual evolution without additional large-scale annotated data. Experiments show that the proposed method significantly improves both the success rate and geometric consistency on complex CAD model generation tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.19738 2026-05-20 cs.CL cs.AI 版本更新

TERGAD: Structure-Aware Text-Enhanced Representations for Graph Anomaly Detection

TERGAD: 用于图异常检测的结构感知文本增强表示

Wen Shi, Zhe Wang, Huafei Huang, Qing Qing, Ziqi Xu, Qixin Zhang, Xikun Zhang, Renqiang Luo, Feng Xia

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； School of Computer Science and Information Technology, Adelaide University（阿德莱德大学计算机科学与信息科技学院）； School of Computing Technologies, RMIT University（皇家墨尔本理工大学计算技术学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结本文提出TERGAD，一种通过大语言模型的语义推理能力增强图异常检测的新型数据增强框架，通过将节点拓扑属性转化为描述性自然语言，再结合门控双分支自编码器融合语义嵌入和原始节点属性，从而更有效地检测图中异常实体。

Comments 14 pages, 5 figures

详情

AI中文摘要

图异常检测（GAD）旨在识别偏离大多数的图实体，如节点、边或子结构。尽管现有文本丰富方法通常通过原始文本特征将结构上下文整合到数据表示流程中，但它们往往忽略了节点的结构上下文。这种局限性阻碍了检测由于节点固有内容与其拓扑角色之间不一致而产生的复杂异常。为此，我们提出TERGAD（用于图异常检测的结构感知文本增强表示），一种新颖的数据增强框架，通过大语言模型（LLMs）的语义推理能力增强GAD的结构语义。具体而言，TERGAD将节点层面的拓扑属性转化为描述性自然语言叙述，随后由LLM处理以获得高阶语义嵌入。这些嵌入随后通过门控双分支自编码器与原始节点属性适配融合，以共同重建图结构和节点特征。通过整合的重建误差计算异常分数，有效捕捉可观测属性和LLM引导的语义期望之间的偏差。在六个真实世界数据集上的广泛实验表明，TERGAD在性能上始终优于最先进的基线。此外，我们的消融研究验证了结构语义指导的不可或缺性和门控融合机制的有效性。代码可在https://github.com/Kantorakitty/TERGAD-main获取。

英文摘要

Graph Anomaly Detection (GAD) aims to identify atypical graph entities, such as nodes, edges, or substructures, that deviate significantly from the majority. While existing text-rich approaches typically integrate structural context into the data representation pipeline using raw textual features, they often neglect the structural context of nodes. This limitation hinders their ability to detect sophisticated anomalies arising from inconsistencies between a node's inherent content and its topological role. To bridge this gap, we propose TERGAD (Structure-aware Text-enhanced Representations for Graph Anomaly Detection), A novel data augmentation framework that enriches structural semantics for GAD via the semantic reasoning capabilities of Large Language Models (LLMs). Specifically, TERGAD translates node-level topological properties into descriptive natural language narratives, which are subsequently processed by an LLM to derive high-level semantic embeddings. These embeddings are then adaptively fused with original node attributes through a gated dual-branch autoencoder to jointly reconstruct both graph structure and node features. The anomaly score is computed based on the integrated reconstruction error, effectively capturing deviations in both observable attributes and LLM-informed semantic expectations. Extensive experiments on six real-world datasets demonstrate that TERGAD consistently outperforms state-of-the-art baselines. Furthermore, our ablation studies validate the indispensable role of structural semantic guidance and the efficacy of the gated fusion mechanism. Code is available at https://github.com/Kantorakitty/TERGAD-main.

URL PDF HTML ☆

赞 0 踩 0

2605.19735 2026-05-20 cs.CL cs.AI 版本更新

ContextRAG: Extraction-Free Hierarchical Graph Construction for Retrieval-Augmented Generation

ContextRAG: 无提取的分层图构建用于检索增强生成

Roman Prosvirnin, Sergei Kuznetsov, Seungmin Jin

发表机构 * HSE University（俄罗斯高等经济大学）

AI总结本文提出ContextRAG，一种无需大型语言模型提取实体和关系的图检索增强生成系统，通过残差量化k均值和Formal Concept Analysis方法构建模糊概念图，在130个任务的UltraDomain子集中实现了33.6%的F1分数，显著优于传统方法。

Comments Preprint. 6 tables

详情

AI中文摘要

图结构的检索增强生成（RAG）系统能够提高多跳问题的答案质量，但许多现有系统依赖大型语言模型（LLMs）在索引过程中提取实体、关系和摘要。这些调用会增加随语料库大小增长的token和时间成本。我们提出了ContextRAG，一种图RAG系统，其图拓扑结构无需LLM进行实体或关系提取。ContextRAG通过残差量化k均值和带有Lukasiewicz残余逻辑的Formal Concept Analysis，在片段嵌入上构建模糊概念图。通过软模糊连接和meet操作诱导桥状和meet衍生的上下文节点，而非LLM生成的图边。在130个任务的UltraDomain子集中，ContextRAG用30次LLM调用和22,073个token构建其索引。相比之下，一个本地HiRAG再现压力测试在20个任务子集上需要870次索引调用和3.54M个token才能在图构建过程中失败；线性外推到130个任务意味着超过23M个索引token。ContextRAG在整体上获得33.6%的F1分数，在多跳任务上获得36.8%的F1分数。激活分析显示，检索到至少一个由lattice衍生节点的前五查询在F1上比未检索到的查询高出+3.9个百分点；这种关联是诊断而非因果的。

英文摘要

Graph-structured retrieval-augmented generation (RAG) systems can improve answer quality on multi-hop questions, but many current systems rely on large language models (LLMs) to extract entities, relations, and summaries during indexing. These calls add token and wall-clock costs that grow with corpus size. We present ContextRAG, a graph RAG system whose graph topology is constructed without LLM-based entity or relation extraction. ContextRAG derives a fuzzy concept graph over chunk embeddings using residual-quantization k-means and Formal Concept Analysis with Lukasiewicz residuated logic. Bridge-like and meet-derived context nodes are induced by soft fuzzy join and meet operations, rather than by LLM-written graph edges. On a 130-task UltraDomain subset, ContextRAG builds its index with 30 LLM calls and 22,073 tokens. In contrast, a local HiRAG reproduction stress test required 870 indexing calls and 3.54M tokens on a 20-task subset before failing during graph construction; linear extrapolation to 130 tasks implies over 23M indexing tokens. ContextRAG obtains 33.6% F1 overall and 36.8% F1 on multi-hop tasks. An activation analysis shows that queries retrieving at least one lattice-derived node in the top five achieve +3.9 percentage points F1 over queries that do not; this association is diagnostic rather than causal.

URL PDF HTML ☆

赞 0 踩 0

2605.19723 2026-05-20 cs.CL cs.AI 版本更新

Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges

大型语言模型中的数学推理：基准测试、架构、评估与开放挑战

Husnain Amjad, Raja Khurram Shahzad, Aamir Shahzad, Mehwish Fatima

发表机构 * organization= School of Electrical Engineering ； Computer Science, National University of Science ； organization= School of Computing, Data ； Mathematical Sciences, Western Sydney University, Indonesia ； organization= Department of Communication, Quality Management ； Information Systems, Mid Sweden University, Östersund Campus, Sweden

AI总结本文综述了大型语言模型在数学推理方面的最新进展，通过分析数据集、架构、训练策略和评估协议，探讨了数学推理的基准测试、架构设计、评估方法以及未来的研究挑战。

详情

AI中文摘要

数学推理对于教育、科学和工业中的问题解决至关重要，是评估人工智能系统的重要基准。随着大型语言模型（LLMs）推理能力的提升，理解其在数学推理方面的表现变得越来越重要。本文综述通过结构化的数据分析集、架构、训练策略和评估协议，综合了最近在LLMs中的数学推理进展。我们的系统性回顾涵盖了大约120篇同行评审研究和预印本，探讨了该研究领域的演变，并提供了一个统一的分析框架来理解当前的进展和限制。本文特别介绍了一种统一的数学数据集分类法，区分了预训练语料库、监督微调资源和评估基准在不同推理复杂性水平上的差异。本文还系统分析了推理架构和训练策略，包括工具集成、验证器引导推理和参数高效适应，以评估其对推理鲁棒性和泛化能力的影响。此外，现有度量标准的比较评估突显了最终答案准确性与过程级推理验证之间的差距。通过综合这些领域的见解，我们的分析识别了反复出现的失败模式，如推理忠实性问题、基准偏见和泛化限制，并概述了改进符号接地、评估可靠性以及开发更稳健和可信的LLM推理系统的关键研究方向。

英文摘要

Mathematical reasoning is essential for problem-solving in education, science, and industry, serving as a crucial benchmark for evaluating artificial intelligence systems. As Large Language Models (LLMs) improve their reasoning capabilities, understanding how well they perform mathematical reasoning has become increasingly important. This survey synthesizes recent advancements in mathematical reasoning with LLMs through a structured analysis of datasets, architectures, training strategies, and evaluation protocols. Our systematic review encompasses approximately 120 peer-reviewed studies and preprints, examining the evolution of this research area and providing a unified analytical framework to understand current progress and limitations. Our study particularly introduces a unified taxonomy of mathematical datasets, distinguishing between pretraining corpora, supervised fine-tuning resources, and evaluation benchmarks across varying levels of reasoning complexity. A systematic analysis of reasoning architectures and training strategies, including tool integration, verifier-guided reasoning, and parameter-efficient adaptation, is presented to assess their effects on reasoning robustness and generalization. Moreover, a comparative evaluation of existing metrics highlights the gap between final-answer accuracy and process-level reasoning verification. By synthesizing insights across these areas, our analysis identifies recurring failure modes, such as reasoning faithfulness issues, benchmark biases, and generalization limitations, and outlines key research directions toward improving symbolic grounding, evaluation reliability, and the development of more robust and trustworthy LLM-based reasoning systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19722 2026-05-20 cs.CR cs.AI 版本更新

Measuring Safety Alignment Effects in Autonomous Security Agents

在自主安全代理中测量安全对齐效应

Isaac David, Arthur Gervais

发表机构 * University College London（伦敦大学学院）

AI总结本文提出了一种基于轨迹的基准测试，用于评估安全代理在执行漏洞分析任务时的安全对齐效果，发现安全代理的性能差异主要体现在拒绝、不安全行为和工具可靠性等方面，而非单纯的拒绝率。

详情

AI中文摘要

当安全对齐的语言模型及其未经审查或删除的衍生版本作为自主安全代理运行时，它们的行为是否不同？单轮拒绝基准无法回答这个问题：安全代理必须检查仓库、调用工具并在授权的沙箱中生成漏洞证据。我们提出了一个包含30个本地漏洞分析任务的基于轨迹的基准测试，这些任务具有固定的工具、确定的成功谓词、擦除规则和基础检查，并将四种标准模型与未经审查或删除的衍生版本进行比较：Gemma 4 31B、Gemma 4 26B A4B、Qwen2.5-Coder 7B和Llama 3.1 8B。该成果包含1,500个安全代理轨迹和800个非安全控制轨迹。Gemma配对显示在安全任务中具有较大的限制减少收益：31B的成功率从14.0%降至0.7%，26B的成功率从10.7%降至0.0%，同时具有更高的平均基础性（3.91 vs 3.27和4.12 vs 1.64，满分5分）以及0.0%的拒绝、压制行为和不安全行为率。然而，控制和非Gemma配对排除了干净的安全特定或普遍的限制减少效应：Gemma的差距也出现在普通编码任务中，Qwen2.5-Coder在限制减少衍生版本中的成功率较低（2.0% vs 5.3%），而删除的Llama衍生版本未能通过工具协议。在所有家族中，硬证明触发和补丁验证任务仍无法解决。这些结果表明，自主安全代理中的安全对齐效应应在系统层面进行测量，将拒绝、不安全行为、工具可靠性和证据基础性分开，而不是将拒绝率作为安全信号。

英文摘要

Do stock safety-aligned language models and their uncensored or abliterated derivatives behave differently when run as autonomous security agents? Single-turn refusal benchmarks cannot answer this question: security agents must inspect repositories, call tools, and produce vulnerability evidence inside authorized sandboxes. We present a trace-based benchmark of 30 local vulnerability-analysis tasks with fixed tools, deterministic success predicates, redaction rules, and grounding checks, and compare four stock models against uncensored or abliterated derivatives: Gemma 4 31B, Gemma 4 26B A4B, Qwen2.5-Coder 7B, and Llama 3.1 8B. The artifact contains 1,500 security-agent traces and 800 non-security control traces. The Gemma pairs show large less-restricted gains on security tasks: 14.0% versus 0.7% success for 31B and 10.7% versus 0.0% for 26B, with higher mean grounding (3.91 versus 3.27 and 4.12 versus 1.64 out of five) and 0.0% refusal, suppressed-action, and unsafe-action rates in the 31B traces. However, controls and non-Gemma pairs rule out a clean security-specific or universal less-restricted effect: Gemma gaps also appear on ordinary coding tasks, Qwen2.5-Coder success is lower for the less-restricted derivative (2.0% versus 5.3%), and the abliterated Llama derivative fails the tool protocol. Across all families, hard proof-of-trigger and patch-verification tasks remain unsolved. These results show that safety alignment effects in autonomous security agents should be measured at the system level, separating refusal, unsafe action, tool reliability, and evidence grounding rather than treating refusal rate as the safety signal.

URL PDF HTML ☆

赞 0 踩 0

2605.19721 2026-05-20 cs.AI cs.LG cs.NI 版本更新

Projecting Latent RL Actions: Towards Generalizable and Scalable Graph Combinatorial Optimization

投影潜在RL动作：面向通用化和可扩展的图组合优化

Franco Terranova, Guillermo Bernardez, Albert Cabellos-Aparicio, Nina Miolane, Abdelkader Lahmadi

发表机构 * Université de Lorraine, CNRS, Inria, LORIA（洛林大学、国家科学研究中心、法国国家信息与自动化研究所、LORIA实验室）； University of California Santa Barbara（加州圣芭芭拉大学）； Universitat Politècnica de Catalunya（加泰罗尼亚理工大学）

AI总结本文提出了一种新的RL-GCO方法，通过在连续GNN动作嵌入空间中直接操作，实现高效的图组合优化解算，提升了通用性和可扩展性。

Comments Preprint

详情

AI中文摘要

图组合优化（GCO）因其在许多NP难问题中的自然图表示而受到越来越多的关注，但其组合爆炸使得精确方法在计算上不可行。最近的强化学习（RL）与图神经网络（GNN）的结合显著改进了基于学习的GCO求解器。然而，现有方法在跨不同图实例的泛化能力和随着动作空间增长的计算可扩展性方面存在局限。为了解决这两个挑战，我们引入了投影代理，一种新颖的RL-GCO方法，直接在连续的GNN动作嵌入空间中操作，通过单次前向传递预测所需潜在动作，并随后将其解码为有效的离散动作。此外，我们通过为观察和动作提供共享的嵌入空间，实现了RL方法之间的公平比较。在多样化的基准测试中，我们的方法在推理速度上达到现有解决方案的16.2倍，泛化能力提升40%，同时为具有多个相互依赖变量的超线性决策空间中的强大RL性能打开了大门。最后，我们发布了LaGCO-RL，一个Python库，自动化潜在动作空间的构建并支持现有RL-GCO解决方案，促进可重复性和适应新GCO基准。

可访问性能力边界：AI生成浏览器原生可访问性系统的操作极限与扩展潜力

Rizwan Jahangir, Daisuke Ishii

发表机构 * NUST Business School, NUST（NUST商学院，NUST）； Kiara Inc.（Kiara公司）

AI总结本文提出可访问性能力边界（ACB）理论框架，探讨AI生成浏览器原生可访问性系统在操作极限和扩展潜力方面的核心问题，并通过实证原型分析，定义了可访问性能力空间中的可达区域和不可达区域，为自主可访问性计算的可扩展性提供了理论基础。

Comments 21 pages, 4 figures

详情

AI中文摘要

随着大型语言模型（LLMs）在合成功能性用户界面方面的能力不断增强，可访问性计算领域出现了一个基本问题：AI驱动的可访问性系统能走多远？本文引入了可访问性能力边界（ACB），这是一个用于推理自主可访问性系统操作极限和扩展潜力的正式框架，并基于现实世界系统构件进行了理论构建。我们不将可访问性视为二元合规属性，而是将其视为受可测量变量约束的动态、多维能力空间，包括部署延迟、认知负荷、基础设施依赖性、离线持久性、交互复杂性和适应性等变量。我们论证了由单文件HTML构件构建的AI生成浏览器原生系统，利用标准浏览器API，可能通过将部署摩擦降至接近零，从而大幅扩展ACB。我们通过正式定义、命题和比较评估矩阵，定义了此类系统所能和无法达到的可访问性能力空间区域。我们进一步识别了剩余的计算、基础设施和验证约束，这些构成了该范式的硬边界。本文为理解自主可访问性计算的可扩展性极限提供了理论基础，并提出了未来在可访问性感知AI系统中的研究议程。

英文摘要

As large language models (LLMs) demonstrate increasing competence in synthesizing functional user interfaces, a fundamental question emerges in accessibility computing: \textit{how far can AI-driven accessibility systems go?} This paper introduces the \textit{Accessibility Capability Boundary} (ACB), a formal framework for reasoning about the operational limits and expansion potential of autonomous accessibility systems, and grounds this theory in a real-world systems artifact. We model accessibility not as a binary compliance property but as a dynamic, multidimensional capability space constrained by measurable variables including deployment latency, cognitive load, infrastructure dependency, offline persistence, interaction complexity, and adaptability. We argue that AI-generated, browser-native systems constructed as single-file HTML artifacts leveraging standard browser APIs may dramatically shift the ACB outward by reducing deployment friction to near-zero and enabling rapid, context-specific interface adaptation. We ground our theoretical framework in the analysis of two real-world exploratory prototypes. The first is an AI-generated browser-native accessibility interface deployed for a blind user in Nepal. The second is a fully functional, open-source webcam alignment assistant for visually impaired users, serving as a concrete systems artifact. Through formal definitions, propositions, and a comparative evaluation matrix, we characterize the regions of the accessibility capability space that such systems can and cannot reach. We further identify remaining computational, infrastructural, and verification constraints that constitute the hard boundaries of this paradigm. This work contributes a theoretical foundation for understanding the scalable limits of autonomous accessibility computing and proposes a research agenda for future work in accessibility-aware AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19634 2026-05-20 cs.CV cs.AI 版本更新

P2DNav: Panorama-to-Downview Reasoning for Zero-shot Vision-and-Language Navigation

P2DNav: 全景到俯视视角的零样本视觉-语言导航

Kai Sheng, Liuyi Wang, Haojie Dai, Jinlong Li, Yongrui Qin, Zongtao He, Chengju Liu, Qijun Chen

发表机构 * Department of Control Science and Engineering, Tongji University（控制科学与工程系，同济大学）

AI总结本文提出P2DNav框架，通过全景到俯视视角的分解、滑动窗口对话记忆和反思重新定位机制，解决零样本视觉-语言导航中的方向推理与局部定位问题，实验表明其在R2R-CE基准上性能优异。

详情

AI中文摘要

视觉-语言导航（VLN）要求一个具身代理将自然语言指令转化为可执行的导航动作，以应对未见环境。现有零样本方法通常依赖额外的航点预测模块，这些模块往往将高层方向推理与细粒度局部定位纠缠在一起，导致决策错误且不稳定。在本文中，我们提出P2DNav，一种用于零样本视觉-语言导航的分层框架。P2DNav包含三个核心组件：全景到俯视（P2D）、滑动窗口对话记忆（SDM）和反思重新定位机制（RRM）。P2D明确将导航决策分解为两个阶段：全景方向选择和俯视局部定位。它首先从360°全景中选择与指令相关的方向，然后从该方向的俯视RGB观察中预测像素级目标点。此外，SDM将导航历史组织为多轮对话上下文，并在滑动窗口内维护最近的视觉观察以支持长距离导航。RRM进一步通过评估局部定位的可靠性基于俯视观察，并在必要时返回全景方向选择。在R2R-CE基准上的实验表明，P2DNav在零样本方法中表现强劲。特别是，与最先进的（SOTA）零样本航点基于和航点自由方法相比，P2DNav在SR方面分别获得了146.6%和58.9%的提升，证明了P2D、SDM和RRM在零样本VLN中的有效性。代码将向公众发布。

英文摘要

Vision-and-language navigation (VLN) requires an embodied agent to ground natural-language instructions into executable navigation actions in unseen environments. Existing zero-shot methods typically rely on additional waypoint prediction modules, which often entangle high-level directional reasoning with fine-grained local grounding, leading to error-prone and unstable decisions. In this paper, we propose P2DNav, a hierarchical framework for zero-shot vision-and-language navigation. P2DNav consists of three core components: Panorama-to-Downview (P2D), Sliding-Window Dialogue Memory (SDM), and Reflective Reorientation Mechanism (RRM). P2D explicitly decomposes navigation decision-making into two stages: panoramic direction selection and downview local grounding. It first selects the instruction-relevant direction from a 360° panorama, and then predicts a pixel-level target point from the downview RGB observation in that direction. In addition, SDM organizes navigation history as a multi-turn dialogue context and maintains recent visual observations within a sliding window to support long-horizon navigation. RRM further enables reflective reorientation by assessing the reliability of local grounding based on the downview observation and returning to panoramic direction selection when necessary. Experiments on the R2R-CE benchmark show that P2DNav achieves strong performance among zero-shot methods. In particular, compared with the state-of-the-art (SOTA) zero-shot waypoint-based and waypoint-free methods, P2DNav achieves SR gains of 146.6% and 58.9%, respectively, demonstrating the effectiveness of P2D, SDM, and RRM for zero-shot VLN. Code will be released for public use.

URL PDF HTML ☆

赞 0 踩 0

2605.19633 2026-05-20 cs.CL cs.AI cs.LG cs.NE cs.SE 版本更新

optimize_anything: A Universal API for Optimizing any Text Parameter

optimize_anything: 一个用于优化任何文本参数的通用API

Lakshya A Agrawal, Donghyun Lee, Shangyin Tan, Wenjie Ma, Karim Elmaaroufi, Rohit Sandadi, Sanjit A. Seshia, Koushik Sen, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Omar Khattab, Alexandros G. Dimakis, Matei Zaharia

发表机构 * MIT（麻省理工学院）

AI总结本文提出了一种基于LLM的通用优化系统，能够跨不同领域实现文本参数的优化，展示了其在六个多样化任务中的state-of-the-art性能，通过多任务搜索和跨问题迁移实现了高效的优化。

Comments 16 pages, 11 figures; Blog: https://gepa-ai.github.io/gepa/blog/2026/02/18/introducing-optimize-anything/

Journal ref Proceedings of the ACM Conference on AI and Agentic Systems (CAIS 26), May 26-29, 2026, San Jose, CA, USA

详情

DOI: 10.1145/3786335.3813167

AI中文摘要

能否一个基于LLM的优化系统在根本不同的领域中匹配专门工具？我们证明当优化问题被表述为改进一个通过评分函数评估的文本工件时，一个基于AI的优化系统—支持单任务搜索、多任务搜索和跨问题迁移以及对未见过的输入进行泛化—在六个不同的任务中实现了state-of-the-art的结果。我们的系统发现了将Gemini Flash的ARC-AGI准确性几乎提高三倍的代理架构（32.5%到89.5%），发现了将云成本降低40%的调度算法，生成了87%匹配或超过PyTorch的CUDA内核，并优于AlphaEvolve报告的圆圈打包解决方案（n=26）。在三个领域的消融研究揭示了可操作的侧信息比仅评分反馈更快收敛且最终得分更高，且多任务搜索在同等问题预算下通过跨任务迁移优于独立优化。共同，我们首次展示了基于LLM搜索的文本优化是一种通用问题解决范式，将传统需要领域特定算法的任务统一到一个框架下。我们开源了optimize_anything，并支持多个后端作为GEPA项目的一部分，在https://github.com/gepa-ai/gepa上。

英文摘要

Can a single LLM-based optimization system match specialized tools across fundamentally different domains? We show that when optimization problems are formulated as improving a text artifact evaluated by a scoring function, a single AI-based optimization system-supporting single-task search, multi-task search with cross-problem transfer, and generalization to unseen inputs-achieves state-of-the-art results across six diverse tasks. Our system discovers agent architectures that nearly triple Gemini Flash's ARC-AGI accuracy (32.5% to 89.5%), finds scheduling algorithms that cut cloud costs by 40%, generates CUDA kernels where 87% match or beat PyTorch, and outperforms AlphaEvolve's reported circle packing solution (n=26). Ablations across three domains reveal that actionable side information yields faster convergence and substantially higher final scores than score-only feedback, and that multi-task search outperforms independent optimization given equivalent per-problem budget through cross-task transfer, with benefits scaling with the number of related tasks. Together, we show for the first time that text optimization with LLM-based search is a general-purpose problem-solving paradigm, unifying tasks traditionally requiring domain-specific algorithms under a single framework. We open-source optimize\_anything with support for multiple backends as part of the GEPA project at https://github.com/gepa-ai/gepa .

URL PDF HTML ☆

赞 0 踩 0

2605.19630 2026-05-20 cs.AI 版本更新

EMO-BOOST: Emotion-Augmented Audio-Visual Features for Improved Generalization in Deepfake Detection

EMO-BOOST：情感增强的音频视觉特征用于深度伪造检测中的泛化改进

Aritra Marik, Marcel Klemt, Anna Rohrbach

发表机构 * Technical University of Darmstadt（达姆施塔特技术大学）； ELIZA

AI总结本文提出EMO-BOOST框架，通过融合传统RGB和声学聚焦检测器与基于情感的EmoForensics检测器，利用高阶语义线索提升深度伪造检测的泛化能力，实验显示在FakeAVCeleb数据集上平均跨操纵泛化AUC提升了2.1%。

Comments Accepted at SAFE@CVPRW 2026

详情

AI中文摘要

随着生成式AI模型的不断发展，取证学正面临越来越大的压力。新的生成技术不断出现，使得无法为每种操纵收集数据来训练深度伪造检测模型。因此，将模型泛化到训练期间未见过的深度伪造类型是当前深度伪造检测研究中的主要挑战之一。为解决这一挑战，我们采用了高层语义线索，并认为这些线索可以支持低层聚焦方法在泛化到未见操纵类型时发挥作用。在本研究中，我们研究了情感作为高层语义线索。我们提出了EMO-BOOST，一种多模态深度伪造检测框架，该框架融合了传统RGB和声学聚焦深度伪造检测器与我们基于情感的深度伪造检测器EmoForensics。EmoForensics利用视觉和音频情感识别模块，并在音频视频流中建模内在和跨模态的时间一致性。我们发现EmoForensics和低层聚焦方法捕获了互补的信号。因此，在EMO-BOOST中结合这两种信号，使在FakeAVCeleb数据集上的平均跨操纵泛化AUC提高了2.1%。

英文摘要

With every advancement in generative AI models, forensics is under increasing pressure. The constant emergence of new generation techniques makes it impossible to collect data for each manipulation to train a deepfake detection model. Thus, generalizing to deepfakes unseen during training is one of the major challenges in current deepfake detection research. To tackle this challenge, we employ high-level semantic cues and argue that these cues can support low-level focused approaches in generalizing to unseen types of manipulations. In this work, we study emotions as a high-level semantic cue. We propose Emo-Boost, a multimodal deepfake detection framework that fuses an off-the-shelf RGB- and acoustic-focused deepfake detector with our emotion-based deepfake detector EmoForensics. EmoForensics utilises vision and audio emotion recognition modules and models intra- and inter-modal temporal consistency in emotion representations from an audio-visual stream. We found that EmoForensics and the low-level focused method capture complementary signals. Consequently, combining both signals in EmoBoost enhances the average cross-manipulation generalization AUC by 2.1% on FakeAVCeleb.

URL PDF HTML ☆

赞 0 踩 0

2605.19619 2026-05-20 cs.LG cs.AI math.OC stat.ML 版本更新

MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

MiMuon: 一种具有改进泛化能力的混合穆恩优化器用于大模型

Feihu Huang, Yuning Luo, Songcan Chen

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics（南京航空航天大学计算机科学与技术学院）； MIIT Key Laboratory of Pattern Analysis and Machine Intelligence（信息科技部模式分析与机器智能重点实验室）； College of Design and Engineering, National University of Singapore（新加坡国立大学设计与工程学院）

AI总结本文研究了穆恩优化器的泛化误差，提出了一种改进的混合穆恩优化器MiMuon，证明其泛化误差更低，同时保持了与穆恩优化器相同的收敛速度。

Comments 25 pages

详情

AI中文摘要

矩阵结构的参数在许多人工智能模型中频繁出现，例如大语言模型。最近，为大规模模型的矩阵参数设计了一种高效的穆恩优化器，其收敛速度明显快于向量级算法。尽管一些工作已经开始研究穆恩优化器的收敛性质（即优化误差），但其泛化性质（即泛化误差）尚未建立。因此，在本文中，我们基于算法稳定性与数学归纳法研究穆恩优化器的泛化误差，并证明穆恩优化器的泛化误差为O(1/(Nκ^T))，其中N为训练样本数量，T表示迭代次数，κ>0表示梯度估计奇异值之间的最小差。为了增强穆恩优化器的泛化能力，我们通过谨慎使用梯度的正交化，提出了一种有效的混合穆恩（MiMuon）优化器，该优化器是穆恩优化器与基于动量的SGD优化器的混合。然后我们证明我们的MiMuon优化器的泛化误差比穆恩优化器的O(1/(Nκ^T))更低，因为κ通常非常小。同时，我们还研究了我们MiMuon算法的收敛性质，并证明我们的MiMuon算法具有与穆恩算法相同的收敛速度O(1/T^{1/4})。在训练大模型（包括Qwen3-0.6B和YOLO26m）的一些数值实验结果中展示了MiMuon优化器的效率。

英文摘要

Matrix-structured parameters frequently appear in many artificial intelligence models such as large language models. More recently, an efficient Muon optimizer is designed for matrix parameters of large-scale models, and shows markedly faster convergence than the vector-wise algorithms. Although some works have begun to study convergence properties (i.e., optimization error) of the Muon optimizer, its generalization properties (i.e., generalization error) is still not established. Thus, in this paper, we study generalization error of the Muon optimizer based on algorithmic stability and mathematical induction, and prove that the Muon has a generalization error of $O\big(\frac{1}{Nκ^{T}}\big)$, where $N$ is training sample size, and $T$ denotes iteration number, and $κ>0$ denotes minimum difference between singular values of gradient estimate. To enhance generalization of the Muon, we propose an effective mixed Muon (MiMuon) optimizer by cautiously using orthogonalization of gradient, which is a hybrid of Muon and momentum-based SGD optimizers. Then we prove that our MiMuon optimizer has a lower generalization error of $O\big(\frac{1}{N}\big)$ than $O\big(\frac{1}{Nκ^{T}}\big)$ of Muon optimizer, since $κ$ generally is very small. Meanwhile, we also studied the convergence properties of our MiMuon algorithm, and prove that our MiMuon algorithm has the same convergence rate of $O(\frac{1}{T^{1/4}})$ as the Muon algorithm. Some numerical experimental results on training large models including Qwen3-0.6B and YOLO26m demonstrate efficiency of the MiMuon optimizer.

URL PDF HTML ☆

赞 0 踩 0

2605.19607 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Spectral Integrated Gradients for Coarse-to-Fine Feature Attribution

基于谱积分梯度的粗到细特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Korea Advanced Institute of Science and Technology（韩国科学技术院）； INEEJI Corp.（INEEJI公司）

AI总结本文提出Spectral Integrated Gradients（SIG）方法，通过奇异值分解构建积分路径，以减少噪声并提高特征归因的准确性，优于传统路径基方法。

Comments 21 pages, 13 figures, 9 tables. Accepted to ACM KDD 2026; includes appendix

详情

AI中文摘要

积分梯度（IG）是一种广泛采用的特征归因方法，满足理想的公理性质。然而，积分路径的选择显著影响归因质量，标准直线路径同时引入所有输入特征，通常在途中积累噪声梯度。为解决这一限制，我们提出了Spectral Integrated Gradients，通过基线到输入差异的奇异值分解（SVD）构建积分路径。通过逐步激活奇异成分，从最大到最小，SIG在引入全局结构之前引入细粒度细节，自然遵循粗到细的进程。通过在多种图像分类数据集上的广泛评估，我们证明SIG生成的归因图更干净，噪声更少，并在定量性能上优于现有基于路径的归因方法。我们的代码可在https://github.com/leekwoon/sig/上获得。

英文摘要

Integrated Gradients (IG) is a widely adopted feature attribution method that satisfies desirable axiomatic properties. However, the choice of integration path significantly affects the quality of attributions, and the standard straight-line path introduces all input features simultaneously, often accumulating noisy gradients along the way. To address this limitation, we propose Spectral Integrated Gradients, which constructs integration paths based on singular value decomposition (SVD) of the baseline-to-input difference. By progressively activating singular components from largest to smallest, SIG introduces global structure before fine-grained details, naturally following a coarse-to-fine progression. Through extensive evaluation across diverse image classification datasets, we demonstrate that SIG produces cleaner attribution maps with reduced noise and achieves improved quantitative performance compared to existing path-based attribution methods. Our code is available at https://github.com/leekwoon/sig/.

URL PDF HTML ☆

赞 0 踩 0

2605.19604 2026-05-20 cs.AI 版本更新

Formal Skill: Programmable Runtime Skills for Efficient and Accurate LLM Agents

形式技能：用于高效且准确LLM代理的可编程运行时技能

Xi Zhang, Meijun Gao, Yuntian Zhao, Xinyu Tan, Yilun Yao, Feiyu Wang, Yanshu Wang, Dingsiyi, Tong Yang

发表机构 * FairyClaw

AI总结本文提出形式技能，一种用于LLM代理的可编程运行时技能抽象，通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态，提高代理的效率和准确性。

详情

AI中文摘要

大型语言模型（LLM）代理越来越多地在真实工作空间中发挥作用，其中工具和技能决定了模型推理是否能够可靠地转化为行动。现有的技能仍然主要非正式：Markdown技能和指令包将过程编码为长自然语言文档，而函数调用、模型上下文协议（MCP）服务器和框架工具则结构化单个动作，但通常将工作流状态、政策执行和完成纪律排除在技能本身之外。我们引入了形式技能，一种运行时原生的抽象，它通过JSON元数据和动作模式、可靠的Python执行器、受钩子控制的控制逻辑、形式技能路由和本地运行时状态来表示可重用的能力。通过将可重用的过程从重复的提示文本中转移到可执行的状态机和钩子策略中，形式技能为代理提供了一个令牌高效且可执行的控制面。我们在FairyClaw中实现了该抽象，这是一个开源的事件驱动运行时，用于可执行、可观察和可组合的形式技能。在Harness-Bench上，FairyClaw获得了高度竞争的平均分数，同时使用显著更少的令牌，尤其在暴露形式技能作用的任务上表现尤为突出。

英文摘要

Large Language Model (LLM) agents increasingly act inside real workspaces, where tools and skills determine whether model reasoning becomes reliable action. Existing skills remain largely informal: Markdown skills and instruction packs encode procedures as long natural-language documents, while function calling, Model Context Protocol (MCP) servers, and framework tools structure individual actions but usually leave workflow state, policy enforcement, and completion discipline outside the skill itself. We introduce Formal Skill, a runtime-native abstraction that represents reusable capability with JSON metadata and action schemas, reliable Python executors, hook-governed control logic, Formal Skill routing, and skill-local runtime state. By moving reusable procedure from repeated prompt text into executable state machines and hook policies, Formal Skill gives agents a token-efficient and enforceable control surface. We implement the abstraction in FairyClaw, an open-source event-driven runtime for executable, observable, and composable Formal Skills. On Harness-Bench, FairyClaw obtains highly competitive average scores while using substantially fewer tokens, with especially strong results on tasks that expose the role of Formal Skill.

URL PDF HTML ☆

赞 0 踩 0

2605.19595 2026-05-20 cs.CV cs.AI 版本更新

A novel YOLO26-MoE optimized by an LLM agent for insulator fault detection considering UAV images

一种由LLM代理优化的YOLO26-MoE新型模型用于考虑无人机图像的绝缘子故障检测

João Pedro Matos-Carvalho, Laio Oriel Seman, Stefano Frizzo Stefenon, Mohammad Khalaf Mohammad Khreasat, Gabriel Villarrubia González

发表机构 * Department of Automation and Systems Engineering, Federal University of Santa Catarina, Florianópolis, Brazil（自动化与系统工程系，圣卡塔琳娜联邦大学，巴西弗洛里安波利斯）； Applications Lab, Faculty of Science, University of Salamanca, Plaza de los Caídos s/n, 37008 Salamanca, Spain（应用实验室，科学学院，萨拉曼卡大学，西班牙萨拉曼卡）

AI总结本文提出一种优化的YOLO26-MoE模型，通过在YOLO26检测器的高分辨率分支中集成稀疏的混合专家（MoE）模块，以适应细微和多样的故障模式，同时保持单阶段检测框架的效率，利用LLM代理进行超参数优化，最终在无人机图像上实现了99.00 mAP@0.5和95.15 mAP@0.5:0.95的性能，优于最新版本的YOLO。

详情

AI中文摘要

电力线路绝缘子的检查对于确保电网可靠性和防止因损坏或退化的绝缘组件引起的故障至关重要。近年来，结合深度学习视觉系统的无人机（UAV）已成为自动化此过程的有效解决方案。然而，由于缺陷区域小、故障模式异质性、复杂背景和变化的成像条件，绝缘子故障检测仍具挑战性。为解决这些挑战，本文提出了一种优化的YOLO26-MoE模型，一种新的目标检测架构，其在YOLO26检测器的高分辨率分支中集成了稀疏的混合专家（MoE）模块。所提出的修改使模型能够适应细微和多样的故障模式，同时保持单阶段检测框架的效率。超参数优化、最终训练和评估通过工具增强的大型语言模型（LLM）代理协调。所提出的模型实现了0.9900 mAP@0.5和0.9515 mAP@0.5:0.95的性能，优于最新版本的YOLO。这些结果表明，所提出的模型为基于无人机的绝缘子故障检测提供了一种有效且可靠的解决方案。

英文摘要

The inspection of electrical power line insulators is essential for ensuring grid reliability and preventing failures caused by damaged or degraded insulation components. In recent years, Unmanned Aerial Vehicles (UAVs) combined with deep learning-based vision systems have emerged as an effective solution for automating this process. However, insulator fault detection remains challenging due to small defect regions, heterogeneous fault patterns, complex backgrounds, and varying imaging conditions. To address these challenges, this paper proposes an optimized YOLO26-MoE, a novel object detection architecture that integrates a sparse Mixture-of-Experts (MoE) module into the high-resolution branch of the YOLO26 detector. The proposed modification enables adaptive feature refinement for subtle and diverse fault patterns while preserving the efficiency of a one-stage detection framework. Hyperparameter optimization, final training, and evaluation were coordinated through a tool-augmented Large Language Model (LLM) agent. The proposed model achieved 0.9900 mAP@0.5 and 0.9515 mAP@0.5:0.95, outperforming the latest YOLO versions. These results demonstrate that the proposed model provides an effective and reliable solution for UAV-based insulator fault detection.

URL PDF HTML ☆

赞 0 踩 0

2605.19593 2026-05-20 cs.AI cs.DC 版本更新

Towards Multi-Model LLM Schedulers: Empirical Insights into Offloading and Preemption

迈向多模型LLM调度器：关于卸载和抢占的实证洞察

Mert Yildiz, Pietro Spadaccino, Alexey Rolich, Francesca Cuomo, Andrea Baiocchi

发表机构 * Sapienza University of Rome（罗马大学）

AI总结本文通过实证研究探讨了不同LLM在不同硬件平台上的行为，重点分析了层卸载和抢占对性能的影响，揭示了卸载和抢占对解码吞吐量的非线性影响以及其在不同模型和硬件平台上的差异，为设计高效的多模型LLM服务系统提供了指导。

Comments The 2026 Mediterranean Artificial Intelligence and Networking Conference (MAIN 2026)

详情

AI中文摘要

现代大型语言模型（LLM）的部署越来越需要在共享异构硬件上服务具有不同架构、规模和专业化的多个模型。这种设置对资源分配、调度和调度提出了新的挑战，特别是在GPU内存受限的情况下，部分CPU-GPU卸载和抢占成为必要。尽管现有系统主要优化单一模型的吞吐量，但较少工作在这些条件下处理多模型调度。本文通过实证研究探讨了不同LLM在不同硬件平台上的行为，重点分析了层卸载和抢占对性能的影响。我们发现，卸载导致解码吞吐量显著非线性下降，较小的模型对减少GPU驻留时间更敏感。我们进一步证明，抢占带来了显著的开销，主要由模型状态重新加载而非键值缓存传输主导，并且这种成本在不同模型和硬件平台上差异显著。此外，我们还强调了序列长度和互连带宽在放大数据移动和执行效率低下方面的作用。基于这些发现，我们识别出未来调度器必须考虑的关键特性，包括模型特定的卸载敏感性、工作负载特征以及抢占和数据传输的成本结构。这些见解为设计下一代能够高效管理异构、多模型工作负载的LLM服务系统提供了指导。

英文摘要

Modern deployments of Large Language Models (LLMs) increasingly require serving multiple models with diverse architectures, sizes, and specialization on shared, heterogeneous hardware. This setting introduces new challenges for resource allocation, dispatching, and scheduling, particularly under GPU memory constraints where partial CPU-GPU offloading and preemption become necessary. While existing systems primarily optimize throughput for a single model, comparatively little work addresses multi-model scheduling under these conditions. In this paper, we present an empirical study of how different LLMs behave across hardware platforms, focusing on the performance implications of layer offloading and preemption. We show that offloading leads to strongly non-linear and model-dependent degradation in decode throughput, with smaller models exhibiting sharper sensitivity to reduced GPU residency. We further demonstrate that preemption incurs substantial overhead, largely dominated by model state reload rather than key-value cache transfer, and that this cost varies significantly across models and hardware platforms. Additionally, we highlight the role of sequence length and interconnect bandwidth in amplifying data movement and execution inefficiencies. Based on these findings, we identify a set of key features that future schedulers must consider, including model-specific offloading sensitivity, workload characteristics, and the cost structure of preemption and data transfer. These insights provide guidance for the design of next-generation LLM serving systems capable of efficiently managing heterogeneous, multi-model workloads with hybrid CPU-GPU execution.

URL PDF HTML ☆

赞 0 踩 0

2605.19592 2026-05-20 cs.RO cs.AI 版本更新

Implicit Action Chunking for Smooth Continuous Control

隐式动作分块用于平滑连续控制

Bosun Liang, Shuo Pei, Zirui Chen, Chuanzhi Fan, Chen Sun, Yuankai Wu, Huachun Tan, Yong Wang

发表机构 * Department of Data and Systems Engineering, The University of Hong Kong, Hong Kong SAR, China（香港大学数据与系统工程系）； Beijing Institute of Technology, Zhuhai, China（北京理工大学珠海学院）； College of Computer Science, Sichuan University, Chengdu, China（四川大学计算机学院）

AI总结本文提出了一种隐式动作分块框架Dual-Window Smoothing (DWS)，用于实现平滑的连续控制。该方法通过双窗口设计，在不扩展动作空间的情况下，确保物理平滑性和时间差分目标的一致性，从而解决传统显式动作分块方法的优化困难和与标准逐步交互不兼容的问题。

详情

AI中文摘要

强化学习常常产生高频振荡的控制信号，这会破坏物理部署所需的安全性和稳定性。显式动作分块通过预测固定时间跨度的轨迹来解决这个问题，但会按时间跨度长度成比例地扩展策略输出维度，导致优化困难和与标准逐步交互不兼容。为克服这些挑战，本文提出了Dual-Window Smoothing (DWS)，一种隐式动作分块框架用于平滑连续控制。与显式方法不同，DWS通过确定性调制确保时间一致性，而不扩展动作空间。它采用双窗口设计：一个执行窗口通过确定性调制确保物理平滑，一个价值窗口在时间差分目标上对时间跨度进行对齐，以纠正由于开环执行导致的批评者偏差。DWS还包含一个轻量级的演员侧时间正则化器，基于一阶动作差异，以促进全局连续性。该设计有效地弥合了时间抽象与反应式逐步控制之间的差距。在包括DeepMind控制套件和工业能源管理任务在内的基准测试中，DWS优于最先进的（SOTA）基线。在复杂的基于视觉的自动驾驶任务中，DWS实现了更平滑的控制，更安全的行为，减少了抖动，并达到了100%的成功率。

英文摘要

Reinforcement learning often produces high-frequency oscillatory control signals that undermine the safety and stability required for physical deployment. Explicit action chunking addresses this by predicting fixed-horizon trajectories but scales the policy output dimension proportionally with the horizon length, leading to optimization difficulties and incompatibility with standard step-wise interaction. To overcome these challenges, this paper proposes Dual-Window Smoothing (DWS), an implicit action chunking framework for smooth continuous control. Unlike explicit methods, DWS enforces temporal coherence without expanding the action space. It uses a dual-window design: an execution window that ensures physical smoothness through deterministic modulation, and a value window that aligns temporal-difference targets over the horizon to correct critic bias caused by open-loop execution. DWS also includes a lightweight actor-side temporal regularizer based on first-order action differences to promote global continuity. This design effectively bridges the gap between temporal abstraction and reactive step-wise control. Experiments on benchmarks including the DeepMind Control Suite and industrial energy management tasks show that DWS outperforms state-of-the-art (SOTA) baselines. In complex vision-based autonomous driving tasks, DWS achieves smoother control, safer behavior with reduced jitter, and attains a 100% success rate.

URL PDF HTML ☆

赞 0 踩 0

2605.19587 2026-05-20 cs.AI 版本更新

SceneCode: Executable World Programs for Editable Indoor Scenes with Articulated Objects

SceneCode: 可执行的世界程序用于可编辑的室内场景及具有关节物体

Puyi Wang, Yuhao Wang, Linjie Li, Zhengyuan Yang, Kevin Qinghong Lin, Yangguang Li, Yu Cheng

发表机构 * The Chinese University of Hong Kong（香港中文大学）； Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； Microsoft（微软）； University of Oxford（牛津大学）

AI总结本文提出SceneCode，一种通过可执行程序生成可编辑的室内场景，解决了现有方法中物体结构控制不足的问题，提升了场景生成的精确性和可交互性。

详情

AI中文摘要

室内场景合成是具身AI、机器人操作和基于模拟的策略评估的基础，其中有用的场景不仅需要定义环境的外观，还需要定义其物体的结构。然而，现有流程通常将生成内容表示为静态网格，并且只能从经过筛选的资产库中继承关节性，这限制了物体级别的可控性，并阻止了按需生成新的可交互资产。我们通过将物理上可交互的室内场景合成制定为程序化世界生成，提出SceneCode框架，该框架将自然语言提示编译成可执行的代码驱动的室内世界，而不是一组不透明的网格。一个房间级别的智能核心首先将提示转换为结构化的房屋布局，并通过规划-设计-批评循环发出每个物体的AssetRequests。每个请求随后被路由到五个代码生成策略之一，并转换为合成的分步Blender Python程序，这些程序通过执行引导的修复和优化循环进行验证。生成的程序被编译成模拟准备的资产，并导出为SDF用于物理模拟。一个持久的场景状态注册表将物体请求、可执行程序、渲染几何体和模拟资产联系起来，使场景组装成为一个可追溯且本地可编辑的世界构建过程。我们评估了SceneCode在场景级合成、物体级资产质量、人类判断和下游机器人交互方面的表现。结果表明，可执行世界程序提高了提示忠实的室内场景生成，并产生了具有更干净网格结构和可加载的模拟器关节元数据的资产。项目页面：https://scene-code.github.io/.

英文摘要

Indoor scene synthesis underpins embodied AI, robotic manipulation, and simulation-based policy evaluation, where a useful scene must specify not only what the environment looks like, but also how its objects are structured. Existing pipelines, however, typically represent generated content as static meshes and inherit articulation only from curated asset libraries, which limits object-level controllability and prevents new interactable assets from being produced on demand. We address this gap by formulating physically interactable indoor scene synthesis as programmatic world generation, and present SceneCode, a framework that compiles a natural language prompt into an executable, code-driven indoor world rather than a collection of opaque meshes. A room-level agentic backbone first turns the prompt into a structured house layout and emits per-object AssetRequests through a planner--designer--critic loop. Each request is then routed to one of five code-generation strategies and converted into a synthesized part-wise Blender Python programs that are validated through an execution-guided repair-and-refine loop. The resulting programs are compiled into simulation-ready assets, and exported as SDF for physics simulation. A persistent scene-state registry links object requests, executable programs, rendered geometry, and simulation assets, turning scene assembly into a traceable and locally editable world-building process. We evaluate SceneCode across scene-level synthesis, object-level asset quality, human judgment, and downstream robot interaction. Results show that executable world programs improve prompt-faithful indoor scene generation and produce assets with cleaner mesh structure, and simulator-loadable articulation metadata. Project page: https://scene-code.github.io/.

URL PDF HTML ☆

赞 0 踩 0

2605.19576 2026-05-20 cs.AI cs.CL cs.SE 版本更新

Library Drift: Diagnosing and Fixing a Silent Failure Mode in Self-Evolving LLM Skill Libraries

库漂移：在自我演化的LLM技能库中诊断和修复一种无声的失败模式

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center（AWS生成式AI创新中心）； HSBC Holdings Plc., HSBC Technology Center, China（汇丰控股有限公司，汇丰技术中心，中国）

AI总结本文研究了自我演化的LLM技能库中的一种无声失败模式——库漂移，通过可重复触发实验、细粒度诊断和验证修复方法，揭示了技能积累无序导致检索退化、假阳性注入和性能停滞的问题，并提出了一种经过验证的修复方案，显著提升了技能库的性能。

详情

AI中文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

英文摘要

Self-evolving skill libraries face a silent failure mode we term \emph{library drift}: unbounded skill accumulation without outcome-driven lifecycle management causes retrieval degradation, false-positive injections, and performance stagnation. Recent evaluation confirms the symptom--LLM-authored skills deliver +0.0pp gain while human-curated ones deliver +16.2pp (SkillsBench)--yet the underlying mechanism has not been isolated. We provide (1) a reproducible trigger: ablations that isolate drift--one disables skill injection (flat floor, +0.002), one imposes premature retirement (active harm, $-$0.019); (2) trace-level diagnostics: an append-only evidence log with per-skill contribution scores, attribution verdicts, and router engagement metrics that make the failure visible before it reaches end-task scores; and (3) a verified fix: a minimal governance recipe (outcome-driven retirement + bounded active-cap + meta-skill authoring prior) that lifts held-out pass@1 from a 0.258 baseline to a late-window mean of 0.584 (rolling gain $+$0.328) on MBPP+ hard-100 over 100 rounds. Eight ablations decompose which governance mechanisms are load-bearing and which are subsumed, providing a concrete playbook for diagnosing library drift in any self-evolving agent.

URL PDF HTML ☆

赞 0 踩 0

2605.19561 2026-05-20 cs.LG cs.AI 版本更新

TORQ: Two-Level Orthogonal Rotation for MXFP4 Quantization

TORQ：MXFP4量化中的两级正交旋转

Zukang Xu, Xing Hu, Dawei Yang

发表机构 * Open Compute Project（开放计算项目）

AI总结本文提出TORQ框架，通过优化坐标变换重塑激活空间的几何属性，解决MXFP4激活量化中的精度下降问题，显著提升量化精度。

Comments 17 pages, 4 figures, 13 tables

详情

AI中文摘要

随着大型语言模型（LLMs）向实际部署迈进，微缩FP4（MXFP4）格式已成为下一代低比特推断的基石，因其在高动态范围与硬件效率之间的平衡能力。然而，直接将MXFP4应用于LLM激活量化不可避免地导致显著的精度下降。在本文中，我们从理论上分析MXFP4激活量化的误差结构，揭示出性能下降的根本原因在于激活分布与MXFP4块浮点格式之间的两个结构性不平衡：（1）极端块间方差不平衡和（2）块内代码书利用不平衡。为了解决这些挑战，我们提出了TORQ（MXFP4量化中的两级正交旋转），一种无训练的后训练量化（PTQ）框架，通过最优坐标变换重塑激活空间的几何属性。在宏观层面，TORQ利用Schur-Horn定理通过块间正交旋转重新分配激活能量，防止高方差块驱动共享缩放因子，从而保留小幅度元素的精度。在微观层面，TORQ采用最大熵引导的块内旋转以缓解代码书坍塌并最大化MXFP4代码书的信息容量。在主流LLM如LLaMA3和Qwen3上的实验表明，与现有方法相比，TORQ显著提高了MXFP4激活量化的准确性：在Qwen3-32B上，WikiText的困惑度降低到8.43（相比BF16的7.61），平均准确率从直接RTN的38.40%增加到73.63%（相比BF16的74.82%），大幅缩小了4位浮点量化与全精度推断之间的差距。

英文摘要

As Large Language Models (LLMs) advance toward practical deployment, the Microscaling FP4 (MXFP4) format has emerged as a cornerstone for next-generation low-bit inference, owing to its ability to balance high dynamic range with hardware efficiency. However, directly applying MXFP4 to LLM activation quantization inevitably leads to significant accuracy degradation. In this paper, we theoretically analyze the error structure of MXFP4 activation quantization, revealing that the root cause of this performance drop lies in two structural imbalances between activation distributions and the MXFP4 block floating-point format: (1) extreme inter-block variance imbalance and (2) intra-block codebook utilization imbalance. To address these challenges, we propose TORQ (Two-level Orthogonal Rotation for MXFP4 Quantization), a training-free Post-Training Quantization (PTQ) framework designed to reshape the geometric properties of the activation space through optimal coordinate transformations. At the macroscopic level, TORQ leverages the Schur-Horn theorem to redistribute activation energy via inter-block orthogonal rotation, preventing high-variance blocks from driving up shared scaling factors and thereby preserving the precision of small-magnitude elements. At the microscopic level, TORQ employs maximum-entropy-guided intra-block rotation to alleviate codebook collapse and maximize the MXFP4 codebook's information capacity. Experiments on mainstream LLMs such as LLaMA3 and Qwen3 show that TORQ significantly improves the accuracy of MXFP4 activation quantization compared to existing methods: on Qwen3-32B, the perplexity on WikiText is reduced to 8.43 (vs. 7.61 for BF16), and the average accuracy increases from 38.40% with direct RTN to 73.63% (vs. 74.82% for BF16), substantially narrowing the gap between 4-bit floating-point quantization and full-precision inference.

URL PDF HTML ☆

赞 0 踩 0

2605.19559 2026-05-20 cs.CV cs.AI 版本更新

EgoCoT-Bench: Benchmarking Grounded and Verifiable Operation-Centric Chain of Thought Reasoning for MLLMs

EgoCoT-Bench: 用于MLLMs的 grounded 和可验证的 operation-centric 思维链推理基准测试

Yang Dai, Dian Jiao, Tianwei Lin, Wenqiao Zhang

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出EgoCoT-Bench，一个用于评估MLLMs在第一人称视角下细粒度操作中心推理能力的基准测试，包含3172个可验证的问答对，涵盖感知、预见和高层次推理等任务，旨在解决现有基准测试在细粒度推理和证据验证方面的不足。

详情

AI中文摘要

多模态大语言模型（MLLMs）的快速发展引发了对第一人称视频理解的广泛关注，特别是MLLMs识别细粒度手-物体交互、跟踪物体状态变化以及从第一人称视角推理动态环境中操作过程的能力。然而，现有的第一人称视频基准测试存在局限性，即缺乏对基于现实证据的推理评估，难以支持细粒度的操作中心推理，并且很少检查模型推理是否基于显式的时空证据。为了解决这一差距，我们引入了EgoCoT-Bench，一个细粒度的第一人称基准测试，用于验证和可验证的操作中心推理，具有显式的逐步推理注释。总体而言，EgoCoT-Bench包含3172个可验证的问答对，覆盖351个第一人称视频，分为四个任务组，共12个子任务组，涵盖感知与回顾、预见和高层次推理。该基准测试通过时空场景图（STSG）引导生成框架构建，并通过人工标注者进一步优化，以确保正确性、第一人称相关性和细粒度质量。实验结果表明，第一人称细粒度推理仍存在困难，并进一步揭示了许多多模态模型生成的解释虽然答案正确，但证据与答案不一致。我们希望EgoCoT-Bench能为第一人称视频理解中的 grounded 和可验证推理提供有用的测试平台。项目页面和补充材料可在：https://dstardust.github.io/EgoCoT/ 上找到。

英文摘要

The rapid development of Multimodal Large Language Models (MLLMs) has led to growing interest in egocentric video understanding, specifically the ability for MLLMs to recognize fine-grained hand-object interactions, track object state changes over time, and reason about manipulative processes in dynamic environments from a first-person perspective. However, existing egocentric video benchmarks suffer from \textbf{limited grounded rationale evaluation}, offering limited support for fine-grained operation-centric reasoning and rarely examining whether model rationales are grounded in explicit spatio-temporal evidence. To address this gap, we introduce \textbf{EgoCoT-Bench}, a fine-grained egocentric benchmark for grounded and verifiable operation-centric reasoning with explicit step-by-step rationale annotations. Overall, EgoCoT-Bench comprises 3,172 verifiable QA pairs over 351 egocentric videos separated into four task groups for a total of 12 sub-task groups, encompassing perception and retrospection, anticipation, and high-level reasoning. The benchmark is constructed through a spatio-temporal scene graphs (STSG) guided generation framework and is further refined by human annotators to ensure correctness, egocentric relevance and fine-grained quality. Experimental results show continuing difficulties with egocentric fine-grained reasoning and further reveal that many multimodal models produce explanations that are answer-correct, but have evidence that is inconsistent with the answer. We hope EgoCoT-Bench can serve as a useful testbed for grounded and verifiable reasoning in egocentric video understanding. Project page and supplementary materials are available at: https://dstardust.github.io/EgoCoT/.

URL PDF HTML ☆

赞 0 踩 0

2605.19538 2026-05-20 cs.CV cs.AI 版本更新

CaptchaMind: Training CAPTCHA Solvers via Reinforcement Learning with Explicit Reasoning Supervision

CaptchaMind: 通过强化学习与显式推理监督训练CAPTCHA求解器

Pengcheng Wang, Haoxiang Liu, Yang Dai, Xiangxiang Zeng, Guanhua Chen, Baotian Hu, Longyue Wang, Weihua Luo

发表机构 * Alibaba Group（阿里巴巴集团）； Southern University of Science and Technology（南方科技大学）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本文提出CaptchaMind，一种基于强化学习的CAPTCHA求解器，通过显式推理监督训练，实现了82.9%的平均成功率，显著优于现有方法。

Comments 17 pages, 12 figures

详情

AI中文摘要

CAPTCHAs被广泛部署作为人类验证机制，经常阻止智能代理在现实网络环境中完成端到端自动化。解决现代CAPTCHAs需要稳健的多步骤视觉推理和交互能力，但基于训练的方法由于缺乏大规模训练数据和过程级注释而一直缺席。我们介绍了CaptchaBench，第一个支持大规模训练的CAPTCHA基准，包含16,000个程序生成的样本，覆盖八个任务类别，并带有详细的区域和过程级注释。系统评估表明，现有方法在需要精细视觉细节捕获和区域级比较的任务上表现一致失败。因此，我们提出了CaptchaMind，一种基于强化学习的求解器，通过显式推理过程监督训练，实现了82.9%的平均成功率，跨八个任务和71.0%在现实实例上的表现，显著优于所有现有方法，无需闭源API。

基础模型对AI检测器看起来很像人类

Yixuan Even Xu, Ziqian Zhong, Aditi Raghunathan, Fei Fang, J. Zico Kolter

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本研究发现基础模型生成的文本在AI检测器中常被误判为人类生成，提出HIP方法通过迭代改写提升检测器规避能力，揭示当前检测器更关注指令调优和局部上下文而非通用机器生成文本特征。

Comments 39 pages, 9 figures

详情

AI中文摘要

随着AI生成文本在现实世界大规模应用，机构越来越多地使用商业AI文本检测器，尤其是在教育和学术诚信流程中。我们报告了一个令人惊讶的经验发现：当用GPTZero和Pangram评估时，基础模型生成的文本往往被判断为高度人类化，而经过指令调优的模型生成的文本则不具有这种特性。基于这一观察，我们提出了Humanization by Iterative Paraphrasing (HIP)，一种不依赖特定检测器的管道，它最小化地微调基础模型为改写器并迭代应用。与我们测试的基线相比，HIP在商业检测器上实现了更好的语义保留与检测器规避的平衡。在Llama-3和Qwen-3系列模型中，从0.6B到70B的不同规模上，HIP始终提高了检测器的人类化程度。我们的发现表明，当前检测器更关注指令调优和局部上下文而非任何通用机器生成文本的不变特征。这反过来要求检测器设计更明确地建模这些因素。

英文摘要

As AI-generated text enters the real-world at scale, institutions increasingly use commercial AI-text detectors, especially in education and academic-integrity workflows. We report a surprising empirical finding about such systems: when evaluated by GPTZero and Pangram, generated text from base models is often judged overwhelmingly human, whereas text generated by their instruction-tuned counterparts is not. Building on this observation, we propose Humanization by Iterative Paraphrasing (HIP), a detector-agnostic pipeline that minimally fine-tunes a base model into a paraphraser and applies it iteratively. Compared with the baselines we test, HIP yields a stronger trade-off between semantic preservation and detector evasion on commercial detectors. Across Llama-3 and Qwen-3 families, spanning model sizes from 0.6B to 70B, HIP consistently improves detector human-likeness. Our findings suggest that current detectors are tracking artifacts of instruction tuning and local context more than any invariant notion of machine-generated text. This, in turn, calls for detector designs that model these factors more explicitly.

URL PDF HTML ☆

赞 0 踩 0

2605.19501 2026-05-20 cs.RO cs.AI 版本更新

CANINE: Coaching Visually Impaired Users for Interactive Navigation with a Robot Guide Dog

CANINE: 为视觉障碍者提供交互导航的机器人导盲犬教学系统

Cunjun Yu, Zishuo Wang, Anxing Xiao, Linfeng Li, David Hsu

发表机构 * School of Computing（computing 学院）； Smart Systems Institute（智能系统研究所）

AI总结本文提出CANINE系统，通过个性化适应性语音反馈帮助视觉障碍者学习与机器人导盲犬的交互导航，通过分解复杂协调任务并分层训练提升学习效率和最终导航性能。

Comments Accepted to RSS 2026

详情

AI中文摘要

机器人导盲犬提供了显著扩展视障者独立移动能力的导航帮助，但其有效使用需要微妙的人机协调，这使得用户难以从通用口头指令中学习。为解决这一挑战，我们提出了CANINE，一个自动化教学系统，通过个性化、适应性的语音反馈训练用户进行交互导航。CANINE将复杂协调任务分解为子技能，并在两个层次上运作。在高层，它通过知识追踪跟踪学习者在子技能中的熟练度，并优先训练最薄弱的领域。在底层，CANINE通过观察每个人类实践片段，利用基础模型推断错误的根本原因，并生成适应性的针对性语音纠正。通过盲folded参与者受控研究，将受试者视为定量评估的代理群体，证明CANINE在学习效率和最终导航性能上均优于通用口头指令。我们进一步通过保留研究和探索性案例研究验证CANINE。保留研究显示在两周后仍保持技能提升。案例研究确认CANINE在训练视障用户方面的有效性，同时揭示了实际部署中的额外设计考虑因素。两者均与受控研究的结果一致。项目页面：https://cunjunyu.github.io/project/canine/

英文摘要

Robot guide dogs offer navigation assistance that greatly expands the independent mobility of the visually impaired, but their effective use requires subtle human-robot coordination that is difficult for users to learn from generic verbal instructions. To tackle this challenge, we present CANINE, an automated coaching system that trains users for interactive navigation with a robot guide dog, through personalized, adaptive verbal feedback. CANINE decomposes a complex coordination task into sub-skills and operates at two levels. At the high level, it decides what to train by tracking the learner's proficiency across sub-skills using knowledge tracing and prioritizing training on the weakest areas. At the low level, CANINE decides how to train each sub-skill by observing each human practice episode, using foundation models to infer the underlying causes of errors, and generating targeted verbal corrections adaptively. A controlled study with blindfolded participants, treated as a proxy population for quantitative evaluation, demonstrates that CANINE significantly improves both learning efficiency and final navigation performance compared to generic verbal instructions. We further validate CANINE through a retention study and an exploratory case study. The retention study shows lasting skill improvement after two weeks. The case study confirms CANINE's effectiveness in training a visually impaired user, while revealing additional design considerations for real-world deployment. Both are well aligned with the findings of the controlled study. Project page: https://cunjunyu.github.io/project/canine/

URL PDF HTML ☆

赞 0 踩 0

2605.19485 2026-05-20 cs.AI 版本更新

Attention-Guided Reward for Reinforcement Learning-based Jailbreak against Large Reasoning Models

基于注意力引导的强化学习对抗大推理模型的 jailbreak 方法

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang, Haichang Gao

发表机构 * Xidian University（西安电子科技大学）； Xi’an Jiaotong University（西安交通大学）

AI总结本文研究了对抗大推理模型的 jailbreak 攻击，发现攻击成功率与模型的注意力模式密切相关，并提出了一种基于强化学习的方法，通过将注意力信号纳入奖励函数设计来提升攻击效果，同时引入多样化的说服策略以提高攻击成功率。

详情

AI中文摘要

大推理模型（LRMs）在通过生成结构化的分步推理内容解决复杂问题方面表现出显著的能力。然而，暴露模型的内部推理过程会引入额外的安全风险；例如，最近的研究表明，LRMs比标准LLMs更容易受到jailbreak攻击。在本文中，我们研究了对LRMs的jailbreak攻击，并揭示出攻击成功率（ASR）与LRMs的注意力模式密切相关。具体而言，成功的jailbreak攻击倾向于在输入提示中对有害标记分配较低的注意力，而在推理内容中对这些标记分配较高的注意力。受此发现启发，我们提出了一种针对LRMs的新型jailbreak方法，利用强化学习（RL）来增强攻击效果，明确地将注意力信号纳入奖励函数设计。此外，我们引入了多样化的说服策略以丰富RL的动作空间，这始终提高了ASR。在五个开源和闭源LRMs上进行的广泛实验表明，我们的方法在三个基准测试中实现了显著更高的ASR，优于现有方法在有效性、效率和可迁移性方面。

英文摘要

Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in solving complex problems by generating structured, step-by-step reasoning content. However, exposing a model's internal reasoning process introduces additional safety risks; for example, recent studies show that LRMs are more vulnerable to jailbreak attacks than standard LLMs. In this paper, we investigate jailbreak attacks on LRMs and reveal that the attack success rate (ASR) is closely correlated with LRMs' attention patterns. Specifically, successful jailbreaks tend to assign lower attention to harmful tokens in the input prompt, while allocating higher attention to those tokens in the reasoning content. Motivated by this finding, we propose a novel jailbreak method for LRMs that leverages reinforcement learning (RL) to enhance attack effectiveness, explicitly incorporating attention signals into the reward function design. In addition, we introduce diverse persuasion strategies to enrich the RL action space, which consistently improves the ASR. Extensive experiments on five open-source and closed-source LRMs across three benchmarks demonstrate that our method achieves substantially higher ASR, outperforming existing approaches in terms of effectiveness, efficiency, and transferability.

URL PDF HTML ☆

赞 0 踩 0

2605.19484 2026-05-20 cs.CV cs.AI cs.GR cs.HC 版本更新

CutVerse: A Compositional GUI Agents Benchmark for Media Post-Production Editing

CutVerse: 一个用于媒体后期制作编辑的组合式GUI代理基准测试

Haobo Hu, Xiangwu Guo, Zhiheng Chen, Difei Gao, Haotian Liu, Libiao Jin, Qi Mao

发表机构 * MIPG, Communication University of China（MIPG，中国传媒大学）； National University of Singapore（新加坡国立大学）； USEIT AI（USEIT人工智能）

AI总结本研究提出CutVerse，一个用于评估自主GUI代理在真实媒体后期制作环境中的能力的基准测试，揭示现有代理在复杂、长周期媒体后期制作工作流中的局限性。

详情

AI中文摘要

尽管GUI代理在网页导航和基础操作系统任务中取得了显著进展，但其在专业创意工作流中的能力仍鲜有研究。为弥合这一差距，我们引入CutVerse，一个旨在系统评估自主GUI代理在真实媒体后期制作环境中的基准测试。我们收集了7个专业应用（如Premiere Pro、Photoshop）的专家演示，涵盖186个复杂、长周期任务，这些任务基于真实的编辑工作流，涉及密集的多模态界面和紧密耦合的交互序列。为支持可扩展评估，我们开发了一个轻量级解析器，将原始屏幕记录和低级交互日志转换为结构化、组合式的GUI动作轨迹，具有精确的定位。广泛评估显示，现有代理在现实媒体编辑任务中的任务成功率仅为36.0%，凸显了复杂、长周期媒体后期制作工作流在本基准测试中的挑战。尽管当前模型在空间定位、多模态对齐和协调动作执行方面表现出色，但在长周期可靠性和领域特定规划方面仍存在限制。

英文摘要

While GUI agents have made significant progress in web navigation and basic operating system tasks, their capabilities in professional creative workflows remain largely underexplored. To bridge this gap, we introduce Cutverse, a benchmark designed to systematically evaluate autonomous GUI agents in realistic media post-production environments. We curate expert demonstrations across 7 professional applications (e.g., Premiere Pro, Photoshop), covering 186 complex, long-horizon tasks grounded in authentic editing workflows, involving dense multimodal interfaces and tightly coupled interaction sequences. To support scalable evaluation, we develop a lightweight parser that transforms raw screen recordings and low-level interaction logs into structured, compositional GUI action trajectories with precise grounding. Extensive evaluations reveal that existing agents achieve only 36.0\% task success on realistic media editing tasks, underscoring the challenges posed by complex, long-horizon media post-production workflows in our benchmark.While current models demonstrate promising spatial grounding, multimodal alignment, and coordinated action execution, they remain limited in long-horizon reliability and domain-specific planning.

URL PDF HTML ☆

赞 0 踩 0

2605.19469 2026-05-20 cs.LG cs.AI cs.RO 版本更新

Sampling-Based Safe Reinforcement Learning

基于采样的安全强化学习

Luca Vignola, Bruce D. Lee, Manish Prajapat, Manuel Wendl, Melanie Zeilinger, Andreas Krause, Yarden As

发表机构 * ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出了一种基于采样的安全强化学习方法，通过在有限的动力学样本集上联合施加约束，确保学习过程中的安全性，并在连续域中提供实用的安全保证，同时通过限制认知不确定性实现了高效的探索。

详情

AI中文摘要

安全探索仍然是强化学习（RL）中的基本挑战，限制了RL智能体在现实世界中的部署。我们提出了一种基于采样的安全强化学习（SBSRL），这是一种基于模型的RL算法，通过在有限的动力学样本集上联合施加约束，确保学习过程中的安全性。这种形式近似了在不确定动力学下的不可行最坏情况优化，并在连续域中实现了实用的安全保证。我们进一步引入了一种基于限制认知不确定性的探索策略，消除了显式探索奖励的需要。在常规条件下，我们推导了学习过程中安全性的高概率保证以及恢复近最优策略的有限时间样本复杂度界。实验证明，SBSRL在仿真和真实机器人硬件中均实现了安全且高效的探索，并可轻松扩展到实际的深度集合实现，以解决高维连续控制问题。

英文摘要

Safe exploration remains a fundamental challenge in reinforcement learning (RL), limiting the deployment of RL agents in the real world. We propose Sampling-Based Safe Reinforcement Learning (SBSRL), a model-based RL algorithm that maintains safety throughout the learning process by enforcing constraints jointly across a finite set of dynamics samples. This formulation approximates an intractable worst-case optimization over uncertain dynamics and enables practical safety guarantees in continuous domains. We further introduce an exploration strategy based on constraining epistemic uncertainty, eliminating the need for explicit exploration bonuses. Under regularity conditions, we derive high-probability guarantees of safety throughout learning and a finite-time sample complexity bound for recovering a near-optimal policy. Empirically, SBSRL achieves safe and efficient exploration both in simulation and in real robotic hardware, and readily extends to practical deep-ensemble implementations that scale to high-dimensional continuous control problems.

URL PDF HTML ☆

赞 0 踩 0

2605.19462 2026-05-20 cs.LG cs.AI 版本更新

Quantifying the Pre-training Dividend: Generative versus Latent Self-Supervised Learning for Time Series Foundation Models

量化预训练红利：生成与潜在自监督学习在时间序列基础模型中的应用

Noam Major, Kathy Razmadze, Yoli Shavit

发表机构 * Faculty of Engineering, Bar-Ilan University（巴伊兰大学工程学院）

AI总结本文研究了自监督学习在时间序列中的应用，比较了生成范式与潜在对齐架构，发现预训练红利在异常检测和分类任务中显著提升，但在预测任务中效果有限，同时表明表示质量与数据来源无关，且在适度的架构深度下趋于稳定。

详情

AI中文摘要

自监督学习（SSL）在视觉和自然语言处理中的成功促使其在时间序列中的快速应用。然而，研究主要集中在生成范式和预测任务上，未量化学习表示的广泛应用。我们建立了一个受控框架来评估“预训练红利”：SSL在多样时间任务中的价值。我们系统比较了生成范式与潜在对齐架构，引入了适用于时间序列的LeJEPA和DINO的变体。这些变体利用离散小波变换（DWT）增强来强制对局部波动的不变性。我们的分析揭示预训练红利高度不对称：SSL在异常检测和分类任务中可获得高达375%的收益，但在预测任务中效果有限。我们证明表示的实用性非普遍，由精度-不变性权衡决定，任务所需的特定信号分辨率必须与目标一致。最后，我们显示表示质量与数据来源无关，并在适度的架构深度下趋于稳定，表明通过大规模合成生成可实现扩展。我们的代码可在：https://github.com/noammajor/Models 获取。

英文摘要

The success of self-supervised learning (SSL) in vision and NLP has motivated its rapid adoption for time series. However, research has focused primarily on Generative paradigms and forecasting tasks, leaving the broader utility of learned representations unquantified. We establish a controlled framework to evaluate the "pre-training dividend": the value added by SSL across diverse temporal tasks. We systematically compare Generative paradigms against Latent Alignment architectures, introducing adaptations of LeJEPA and DINO for time series. These adaptations utilize Discrete Wavelet Transform (DWT) augmentations to enforce invariance to local fluctuations. Our analysis reveals that the pre-training dividend is highly asymmetric: SSL yields gains of up to 375% for anomaly detection and classification, yet remains marginal for forecasting. We demonstrate that representational utility is non-universal, governed by a precision-invariance trade-off where the specific signal resolution required by the task must align with the objective. Finally, we show that representation quality is largely independent of data origin and saturates at moderate architectural depths, suggesting a path to scaling via massive synthetic generation. Our code is available at: https://github.com/noammajor/Models

URL PDF HTML ☆

赞 0 踩 0

2605.19461 2026-05-20 cs.AI 版本更新

Beyond Mode Collapse: Distribution Matching for Diverse Reasoning

超越模式崩溃：用于多样化推理的分布匹配

Xiaozhe Li, Yang Li, Xinyu Fang, Shengyuan Ding, Peiji Li, Yongkang Chen, Yichuan Ma, Tianyi Lyu, Linyang Li, Dahua Lin, Qipeng Guo, Qingwen Liu, Kai Chen

发表机构 * Tongji University（同济大学）； Independent（独立）； Shanghai AI Laboratory（上海人工智能实验室）； Zhejiang University（浙江大学）； Fudan University（复旦大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出DMPO方法，通过原理性近似前向KL最小化来防止on-policy强化学习中的模式崩溃，展示了在NP难组合优化问题上的改进效果，提升了多样化推理能力。

详情

AI中文摘要

像GRPO这样的在线强化学习方法会遭遇模式崩溃：它们表现出减少的解决方案多样性，在发现一个解决方案后，将概率质量集中在单一解决方案上，并停止探索替代策略。我们证明这源于反KL最小化的行为，这种行为强化了首次发现的高回报轨迹，而不是维持多个多样解决方案的分布。我们提出DMPO（分布匹配策略优化），通过原理性近似前向KL最小化来防止模式崩溃。DMPO构建一个群体层面的目标分布，该分布与采样的轨迹成正比于其奖励，然后将策略分布对齐到此目标。这提供了覆盖模式的行为，而无需采样自不可行的全局目标分布，使训练过程中持续探索成为可能。我们在NP难组合优化问题上验证了DMPO，其中存在指数级多的可行解，但只有少数接近最优解，是评估探索的理想测试环境。DMPO在文本基NP-Bench上实现了43.9%的Quality Ratio（对比GRPO的40.1%），在视觉基NP-Bench上实现了43.1%（对比38.4%），分别展示了9%和12%的相对改进。这些收益扩展到数学推理（+2.0%）和跨领域任务（+2.3%），表明保持多样性训练增强了跨模态的通用推理能力。我们的工作确立了分布匹配作为防止on-policy RL中模式崩溃的实用且原理性方法，一致的质量改进表明在多样化推理任务中持续探索的能力。

英文摘要

On-policy reinforcement learning methods like GRPO suffer from mode collapse: they exhibit reduced solution diversity, concentrating probability mass on a single solution once discovered and ceasing exploration of alternative strategies. We show this stems from reverse KL minimization's mode-seeking behavior, which reinforces the first high-reward trajectory found rather than maintaining a distribution over multiple diverse solutions. We propose DMPO (Distribution-Matching Policy Optimization), which prevents mode collapse through principled approximation of forward KL minimization. DMPO constructs a group level target distribution over sampled trajectories proportional to their rewards, then aligns the policy distribution to this target. This provides mode-covering behavior without requiring sampling from the intractable global target distribution, enabling sustained exploration throughout training. We validate DMPO on NP-hard combinatorial optimization, where exponentially many feasible solutions exist but only a few approach optimality, an ideal testbed for evaluating exploration. DMPO achieves 43.9% Quality Ratio on text-based NP-Bench (vs. GRPO's 40.1%) and 43.1% on vision-based NP-Bench (vs. 38.4%), demonstrating 9% and 12% relative improvements respectively. These gains generalize to mathematical reasoning (+2.0%) and out-of-domain tasks (+2.3%), showing that diversity-preserving training enhances general reasoning capabilities across modalities. Our work establishes distribution matching as a practical, principled approach to preventing mode collapse in on-policy RL, with consistent quality improvements demonstrating sustained exploration across diverse reasoning tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.19457 2026-05-20 cs.AI 版本更新

Generative Auto-Bidding with Unified Modeling and Exploration

生成式自动出价：统一建模与探索

Mingming Zhang, Feiqing Zhuang, Na Li, Shengjie Sun, Xiaowei Chen, Junxiong Zhu, Fei Xiao, Keping Yang, Lixin Zou, Chenliang Li

发表机构 * Key Laboratory of Aerospace Information Security ； Trusted Computing, Ministry of Education, School of Cyber Science ； Engineering, Wuhan University\ \& Tmall Group of Alibaba Wuhan China ； Taobao \& Tmall Group of Alibaba Hangzhou China ； Engineering, Wuhan University Wuhan China ； Engineering, Wuhan University\ \& Tmall Group of Alibaba ； Taobao \& Tmall Group of Alibaba ； Engineering, Wuhan University

AI总结本文提出GUIDE框架，通过结合定向探索与安全回退机制，解决生成模型在自动出价中探索与安全平衡的问题，实现效率与安全的统一。

Comments 11pages, sigir2026

详情

DOI: 10.1145/3805712.3809661

AI中文摘要

自动化出价是现代数字广告的核心。早期基于规则的方法缺乏适应性，而后续的强化学习方法将出价建模为马尔可夫决策过程，但难以处理长期依赖。最近的生成模型显示了潜力，但缺乏明确的机制来平衡探索和安全性，仅依赖动作扰动或轨迹引导，没有安全回退。这导致了低效的探索和广告平台的高财务风险。为了解决这一差距，我们提出了GUIDE（生成式自动出价：统一建模与探索）框架，通过协同整合定向探索与安全回退机制。GUIDE使用决策变压器（DT）联合建模历史出价动作和环境状态转移。Q值模块通过正则化约束引导DT的探索，而逆向动力学模块（IDM）利用DT预测的未来状态来推断鲁棒且行为一致的动作作为安全策略回退。Q值模块随后在两者之间自适应地选择最终动作，平衡探索和安全性。这些组件共同形成一个集成的“探索-安全回退-选择”流水线，实现了效率和安全的统一。我们在公开数据集、模拟拍卖环境以及通过大规模在线部署在淘宝（中国领先的广告平台）上进行了广泛实验。结果表明，GUIDE在所有场景中均优于最先进的基线。在实际部署中，GUIDE实现了显著的收益：广告GMV增长+4.10%，广告点击增长+1.40%，广告成本下降+1.66%，广告ROI增长+3.52%，证明了其有效性和强大的工业适用性。

英文摘要

Automated bidding is central to modern digital advertising. Early rule-based methods lacked adaptability, while subsequent Reinforcement Learning approaches modeled bidding as a Markov Decision Process but struggled with long-term dependencies. Recent generative models show promise, yet they lack explicit mechanisms to balance exploration and safety, relying solely on action perturbations or trajectory guidance without a safety fallback. This results in inefficient exploration and elevated financial risk for advertising platforms. To address this gap, we propose GUIDE (Generative Auto-Bidding with Unified Modeling and Exploration), a framework that synergistically integrates directed exploration with a safe fallback mechanism. GUIDE employs a Decision Transformer (DT) to jointly model historical bidding actions and environmental state transitions. A Q-value module guides the DT's exploration via regularization constraints, while an Inverse Dynamics Module (IDM) leverages DT-predicted future states to infer robust, behaviorally consistent actions as a safe policy fallback. The Q-value module then adaptively selects the final action between these two options, balancing exploration and safety. Together, these components form an integrated "explore-safeguard-select" pipeline that unifies efficiency and safety. We conduct extensive experiments on public datasets, in simulated auction environments, and through large-scale online deployment on Taobao, a leading Chinese advertising platform. Results show GUIDE consistently outperforms state-of-the-art baselines across all scenarios. In real-world deployment, GUIDE achieves notable gains: +4.10% ad GMV, +1.40% ad clicks, +1.66% ad cost, and +3.52% ad ROI, demonstrating its effectiveness and strong industrial applicability.

URL PDF HTML ☆

赞 0 踩 0

2605.19452 2026-05-20 cs.DC cs.AI 版本更新

Resilient Byzantine Agreement with Predictions

具有预测功能的容错一致性协议

Julien Dallot, Darya Melnyk, Tijana Milentijevic, Stefan Schmid, Patrik Welters

发表机构 * TU Berlin（柏林技术大学）； Weizenbaum Institute（魏泽恩堡研究所）； HU Berlin（柏林洪堡大学）

AI总结本文研究了在有预测器辅助下解决拜占庭共识问题，通过算法容错性和预测器准确度的权衡分析，提出在非认证和认证设置下容忍不同数量故障节点的算法及不可能性结果。

详情

AI中文摘要

本文研究了在有预测器辅助下解决拜占庭共识问题。我们关注算法的容错性——算法能容忍的最大故障节点数，并提出了其容错性依赖于预测器准确度的算法和不可能性结果。我们的第一个主要结果是对非认证和认证设置下的一致性-鲁棒性权衡进行了完整刻画：对于n个节点和参数α∈[0,1]，当预测器正确时，算法可以容忍最多α·n个故障节点（一致性）；当预测器任意错误时，可以容忍最多(1-α)/2·n -1个故障节点（鲁棒性）；在认证设置下，鲁棒性界限提高到(1-α)·n -1。这些权衡是精确的，因为我们证明再多一个故障节点会使问题变得不可能。我们的第二个主要结果刻画了平滑度：预测器准确性降低时，容错性下降的速率。我们证明只要错误预测的数量保持在n的常数比例内，容错性会线性减少。具体而言，在非认证设置下，每个额外的错误预测会损失一个单位的容错性，而在认证设置下，由于需要两个错误预测才能损失一个单位的容错性，因此下降幅度减半。

英文摘要

This paper studies the Byzantine Agreement problem where the nodes have access to a predictor that flags nodes for suspicion of faulty (Byzantine) behavior. We focus on algorithmic resilience -- the maximum number of faulty nodes an algorithm can tolerate -- and present algorithms and impossibility results whose resilience depend on the accuracy of the predictor. As our first main result, we bring a complete characterization of the consistency--robustness trade-offs in both the non-authenticated and authenticated settings: for $n$ nodes and a parameter $α\in [0, 1]$, we present algorithms that tolerate up to $α\cdot n$ faulty nodes when the predictor is correct (consistency), and up to $\frac{1-α}{2} \cdot n - 1$ faulty nodes when the predictor is arbitrarily wrong (robustness); in the authenticated setting the robustness bound improves to $(1-α) \cdot n - 1$. These trade-offs are exactly tight as we show that one additional faulty node renders the problem impossible. Our second main result characterizes smoothness: the rate at which resilience degrades as the predictor becomes less accurate. We show that resilience linearly decreases in the number of wrong predictions as long as that number stays within a constant fraction of $n$. Concretely, in the non-authenticated setting each additional wrong prediction loses one unit of resilience, whereas in the authenticated setting the decline is halved since two wrong predictions are needed to lose one unit of resilience.

URL PDF HTML ☆

赞 0 踩 0

2605.19447 2026-05-20 cs.AI 版本更新

What and When to Distill: Selective Hindsight Distillation for Multi-Turn Agents

什么和何时去蒸馏：多轮代理的定向 hindsight 蒸馏

Xiaozhe Li, Tianyi Lyu, Yang Li, Yichuan Ma, Peiji Li, Linyang Li, Qipeng Guo, Dahua Lin, Kai Chen

发表机构 * Tongji University Shanghai AI Laboratory（同济大学上海人工智能实验室）； Shanghai AI Laboratory（上海人工智能实验室）； Fudan University（复旦大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文研究了多轮代理中如何选择性地利用 hindsight 蒸馏，提出了一种基于环境反馈的强化学习框架 SERL，通过任务奖励和环境反馈的结合，在 ALFWorld 和 WebShop 任务中取得了较高的成功率。

详情

AI中文摘要

强化学习可以通过稀疏任务奖励训练大语言模型代理，但长周期信用分配仍然具有挑战性：一个成功或失败的信号必须分布在许多动作上。现有方法依赖于轨迹级奖励或代理信号，没有充分利用每一步的环境反馈。多轮代理设置尚不充分探索，其中反馈可以包括错误信息、页面变化、观察或参考轨迹。我们系统研究了五个反馈源和两种插入粒度，并引入了 SERL，一种选择性环境加权学习框架。SERL 使用任务奖励确定更新方向，而环境反馈调整放置和大小，专注于关键动作。在 ALFWorld 和 WebShop 上，SERL 分别达到 90.0% 和 80.1% 的成功率，优于强大的 RL 和蒸馏基线。分析显示，有意义的点上的基于事实、与动作相关的反馈始终优于随意使用更长或更丰富的上下文。

英文摘要

Reinforcement learning can train LLM agents from sparse task rewards, but long-horizon credit assignment remains challenging: a single success-or-failure signal must be distributed across many actions. Existing methods rely on trajectory-level rewards or proxy signals, without fully leveraging per-step environmental feedback. Multi-turn agent settings are underexplored, where feedback can include error messages, page changes, observations, or reference trajectories. We systematically study five feedback sources and two insertion granularities and introduce SERL, a selective environment-reweighted learning framework. SERL uses the task reward to determine update direction, while environment feedback adjusts placement and magnitude, focusing on critical actions. On ALFWorld and WebShop, SERL achieves 90.0% and 80.1% success, outperforming strong RL and distillation baselines. Analysis shows that grounded, action-relevant feedback at meaningful points consistently outperforms indiscriminate use of longer or richer context.

URL PDF HTML ☆

赞 0 踩 0

2605.19446 2026-05-20 cs.CV cs.AI 版本更新

Targeted Downstream-Agnostic Attack

定向下游无关攻击

Zhuxin Lei, Ziyuan Yang, Yi Zhang

发表机构 * College of Computer Science, Sichuan University（四川大学计算机学院）

AI总结本文提出了一种定向下游无关攻击（TDAA）方法，通过在更严格的威胁模型下，要求攻击同时具有针对性和下游无关性，解决了传统下游无关攻击（DAAs）在目标未知和编码器不直接生成预测时的挑战。通过引入威胁图像作为特征级锚点，构建了任务无关的桥梁，揭示了受害者编码器的脆弱性。

详情

AI中文摘要

近年来，由于其在表示提取方面的强大能力，预训练编码器得到了广泛应用。然而，它们容易受到下游无关攻击（DAAs）的攻击。现有的DAA方法基于一种宽松的威胁模型，只要生成的下游无关对抗样本（DAEs）改变原始预测，攻击就算成功，而无需特定目标。在本文中，我们提出了一种在更严格的威胁模型下进行的定向DAA（TDAA）方法，要求攻击必须同时具有针对性和下游无关性。由于下游任务未知且编码器不直接生成预测，实现针对性攻击尤其具有挑战性。为此，我们引入了一个名为“威胁图像”的新组件，由攻击者预先选择作为目标。具体来说，设计了一个生成器，生成针对每个样本的对抗扰动，迫使受害者编码器为DAEs和威胁图像输出相同的特征。与以往的DAA方法生成所有样本共享的单一扰动不同，我们的方法采用样本特定的范式。这生成了针对每个图像的定制扰动，以确保高攻击成功率和隐蔽性。通过利用威胁图像作为特征级锚点，我们的方法构建了一个任务无关的桥梁，揭示了受害者编码器的脆弱性。在10种自监督方法上对3个基准数据集的广泛实验展示了我们方法的有效性，并揭示了预训练编码器的显著脆弱性。代码将在审查期结束后公开。

英文摘要

Recently, pre-trained encoders have gained widespread use due to their strong capability in representation extraction. However, they are vulnerable to downstream-agnostic attacks (DAAs). Existing DAA methods operate under a permissive threat model, where an attack is successful if the generated downstream-agnostic adversarial examples (DAEs) change the original prediction, without requiring a specific target. In this paper, we propose a Targeted DAA (TDAA) method under a stricter threat model requiring the attack to be both targeted and downstream-agnostic. Since the downstream task is unknown and encoders do not directly produce predictions, achieving a targeted attack is particularly challenging. To address this, we introduce a novel component termed the 'threat image', pre-selected by the attacker as the target. Specifically, a generator is designed to produce example-specific adversarial perturbations that compel the victim encoder to output identical features for both the DAEs and the threat image. Unlike previous DAA methods that generate a single shared perturbation for all samples, which often fails due to image diversity, our method adopts an example-specific paradigm. This generates tailored perturbations for each image to ensure a high attack success rate and invisibility. By leveraging the threat image as a feature-level anchor, our method builds a task-agnostic bridge to reveal the vulnerabilities of the victim encoder. Extensive experiments on 10 self-supervised methods across 3 benchmark datasets demonstrate the effectiveness of our approach and reveal the pronounced vulnerability of pre-trained encoders. The code will be made publicly available after the review period.

URL PDF HTML ☆

赞 0 踩 0

2605.19435 2026-05-20 cs.CV cs.AI 版本更新

通过有向图建模实现冲突容忍的多智能体推理

Longgang He, Longzhu He, Daojing He, Chaozhuo Li

发表机构 * Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））； Beijing University of Posts and Telecommunications（北京邮电大学）

AI总结本文提出SIGMA框架，通过有向图建模显式捕捉智能体间的信任、冲突和中性关系，以提升多智能体系统的推理能力和冲突容忍性。

详情

AI中文摘要

基于大语言模型的多智能体系统（MAS）已展现出强大的推理和决策能力，其性能常受到简单聚合机制的限制，假设所有交互都是合作性的。经过深入分析，我们发现现有基于图的MAS框架存在两个问题：（1）当出现冲突信号时，错误会传播而无法控制；（2）缺乏对冲突智能体关系的显式建模以及结构意识，无法识别可靠的交互模式。为弥补这一差距，我们引入SIGMA，一种新的基于有向图的多智能体推理框架，通过有向关系图显式捕捉智能体间的信任、冲突和中性关系。具体而言，给定一个查询，SIGMA首先选择一组相关且多样化的智能体，然后构建一个具有置信度加权边的结构化有向交互图。推理过程通过冲突感知的有向信息传递进行，这会加强来自可信智能体的信息，同时抑制冲突信号，并以结构和冲突感知的加权聚合结束，以产生一致且冲突容忍的预测。在六个基准数据集上进行的大量实验表明，SIGMA在多个LLM后端和多智能体配置中一致优于最先进的基线，实现了准确性和冲突容忍性能的显著提升。

英文摘要

LLM-based multi-agent systems (MAS) have demonstrated strong reasoning and decision-making capabilities that consistently surpass those of single LLM agents. However, their performance often suffers from naive aggregation mechanisms that assume uniformly cooperative interactions. Upon close inspection, we observe that existing graph-based MAS frameworks (1) propagate errors when conflicting signals arise without control, and (2) lack explicit modeling of conflicting inter-agent relations as well as structural awareness, failing to identify reliable interaction patterns. To bridge this gap, we introduce SIGMA, a novel SIgned Graph-informed Multi-Agent reasoning framework that explicitly captures trust, conflict, and neutral relations among agents via a signed relational graph. Specifically, given a query, SIGMA first selects a set of relevant and diverse agents, then constructs a structured signed interaction graph with confidence-weighted edges. Reasoning proceeds through conflict-aware signed message passing, which reinforces information from trustworthy agents while suppressing conflicting signals, and terminates with a structure- and conflict-aware weighted aggregation to yield globally consistent and conflict-resilient predictions. Extensive experiments on six benchmark datasets, across multiple LLM backbones and diverse multi-agent configurations, demonstrate that SIGMA consistently outperforms state-of-the-art baselines, achieving notable gains in both accuracy and conflict-resilient performance.

URL PDF HTML ☆

赞 0 踩 0

2605.19407 2026-05-20 cs.LG cs.AI 版本更新

A Bitter Lesson for Data Filtering

数据过滤的惨痛教训

Christopher Mohri, John Duchi, Tatsunori Hashimoto

发表机构 * Department of Computer Science（计算机科学系）； Departments of Statistics and Electrical Engineering（统计学与电气工程系）； Stanford University（斯坦福大学）

AI总结本文研究了大规模模型预训练中的数据过滤，发现即使有足够的计算资源，过滤数据也不是最佳选择，因为充分训练的大型模型能够容忍低质量数据甚至从中受益。

2605.19394 2026-05-20 cs.CL cs.AI 版本更新

EmbGen: Teaching with Reassembled Corpora

EmbGen：利用重组语料库进行教学

Arun K Lenin, Kai Rouse, Andrea Nicastro, Anna Leontjeva

发表机构 * Commonwealth Bank of Australia（澳大利亚联邦银行）

AI总结本文提出EmbGen，一种通过重组语料库生成合成数据的pipeline，旨在提高在不同语义异质性数据集上指令微调模型的性能，通过实体-描述对的分解、基于嵌入相似性的重组以及基于聚类的采样生成问题-答案对，从而在固定token预算下提升二元准确率。

Comments 8 pages, 4 images (32 pages with appendix)

详情

AI中文摘要

适应小型指令微调模型到专业领域通常依赖于在精心挑选的指令-响应示例上进行监督微调（SFT），这在大规模收集时成本高昂。由教师LLM从领域语料库生成的合成训练示例可以降低此成本，但现有流程会产生同质化输出，并且不一致地捕捉跨段落或跨文档依赖性。我们引入EmbGen，一种合成数据生成流程，该流程将语料库分解为实体-描述对，通过从嵌入相似性推断出的语义结构重新组装它们，并通过接近性、集群内和集群间采样生成问题-答案（QA）对，使用集群专门化的系统提示。我们评估EmbGen在三个语义异质性不同的数据集上，固定token预算（5和20百万token）下的表现，与EntiGraph、InstructLab和Knowledge-Instruct进行比较。我们使用词汇重叠度量、LLM作为判断标准的评分表以及二元准确率（结合事实准确性和完整性）作为评估指标。EmbGen在最异质的数据集上，相对于最强基线，在5M和20M token预算下分别提高了12.5%和88.9%的二元准确率，同时在其他异质性较低的数据集上保持竞争力。

英文摘要

Adapting small instruction-tuned models to specialized domains often relies on supervised fine-tuning (SFT) on curated instruction-response examples, which is expensive to collect at scale. Synthetic training examples generated by a teacher LLM from a domain corpus can reduce this cost, but existing pipelines can produce homogenized outputs and do not consistently capture cross-passage or cross-document dependencies. We introduce EmbGen, a synthetic data generation pipeline that decomposes a corpus into entity-description pairs, reassembles them using semantic structure inferred from embedding similarity, and then generates question-answer (QA) pairs via proximity, intra-cluster, and inter-cluster sampling with cluster-specialized system prompts. We evaluate EmbGen against EntiGraph, InstructLab and Knowledge-Instruct on three datasets of varied semantic heterogeneity, under fixed token budgets (5 and 20 million tokens). We use lexical overlap metrics, an LLM-as-a-judge rubric, and Binary Accuracy, a composed metric combining Factual Accuracy and Completeness for evaluation. EmbGen improves Binary Accuracy on the most heterogeneous dataset by 12.5% at 5M and 88.9% at 20M tokens budget, relative to the strongest baseline, while remaining competitive across other datasets with lower heterogeneity.

URL PDF HTML ☆

赞 0 踩 0

2605.19382 2026-05-20 cs.AI 版本更新

用于神经网络模型融合的无冲突复制数据类型：一种双层架构，使26种策略兼容CRDT模型融合

Ryan Gillespie

发表机构 * Independent researcher（独立研究者）

AI总结本文提出了一种双层架构CRDTMergeState，通过将任何融合策略封装在CRDT兼容层中，解决了26种神经网络融合策略在分布式操作中无法满足交换律、结合律和幂等律的结构性问题，实现了强最终一致性。

详情

AI中文摘要

我们测试的所有26种神经网络融合策略，包括加权平均、SLERP、TIES、DARE、Fisher融合和进化方法，均无法满足用于无冲突分布式操作所需的代数属性（交换性、结合性和幂等性）。我们证明这种失败是结构性的：基于规范化的方法无法同时满足这三个属性。为了解决这个问题，我们提出了一种双层架构——CRDTMergeState，它将任何融合策略封装在CRDT兼容（无冲突复制数据类型）层中。第一层通过OR-Set CRDT语义管理贡献，其中融合操作是集合并集——这显然具有交换性、结合性和幂等性。第二层将融合策略作为确定性纯函数应用于一个规范有序的贡献集上，随机性从Merkle根中播种。我们证明这种分离保证了强最终一致性：所有接收相同贡献的副本计算出相同的融合模型，无论消息顺序如何。实证验证涵盖三个层次：受控的4x4张量（104/104测试通过）、生产规模的模型（最高7.24B参数，208种策略级测试，43,368种层级属性检查在受限张量分辨率下）以及多节点收敛在 gossip 和分区修复（100个节点，20种顺序）中，CRDT开销低于0.5毫秒。由于封装器是透明的，下游性能由构造保证，通过字节相同输出验证确认。参考实现可用作crdt-merge v0.9.4。

英文摘要

All 26 neural network merge strategies we tested including weight averaging, SLERP, TIES, DARE, Fisher merging, and evolutionary approaches -- fail the algebraic properties (commutativity, associativity, idempotency) required for conflict-free distributed operation. We prove that this failure is structural: normalisation-based merges cannot simultaneously satisfy all three properties. To resolve this, we present a two-layer architecture -- CRDTMergeState -- that wraps any merge strategy in a CRDT-compliant (Conflict-Free Replicated Data Type) layer. Layer 1 manages contributions via OR-Set CRDT semantics, where the merge operation is set union -- trivially commutative, associative, and idempotent. Layer 2 applies merge strategies as deterministic pure functions over a canonically-ordered contribution set, with randomness seeded from the Merkle root. We prove that this separation guarantees Strong Eventual Consistency: all replicas receiving the same contributions compute identical merged models, regardless of message ordering. Empirical validation spans three tiers: controlled 4x4 tensors (104/104 tests pass), production-scale models up to 7.24B parameters (208 strategy-level tests, 43,368 layer-level property checks at capped tensor resolution), and multi-node convergence under gossip and partition healing (100 nodes, 20 orderings), with CRDT overhead below 0.5 ms. Because the wrapper is transparent, downstream performance is identical by construction, confirmed via byte-identical output verification. The reference implementation is available as crdt-merge v0.9.4.

URL PDF HTML ☆

赞 0 踩 0

2605.19371 2026-05-20 cs.CV cs.AI 版本更新

ContextFlow：长周期具身智能体的分层任务-状态对齐

Shuhan Guo, Kun Zhang, Haifei Liu, Xingyu Gao, Yongqi Zhang, Yaqing Wang, Quanming Yao

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Qiuzhen College, Tsinghua University（清华大学启祯学院）； Beijing Institute of Mathematical Sciences and Applications（北京数学科学研究院）； Department of Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州）数据科学与分析部门）； Institute of Microelectronics, Chinese Academy of Sciences（中国科学院微电子研究所）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文研究了长周期具身智能体中任务-状态不一致问题，提出ContextFlow框架通过显式合同表示阶段、运行时观测转为证据包以及应用作用域更新来实现任务前沿对齐，提高任务执行的连贯性和可审计性。

详情

AI中文摘要

长周期具身智能体越来越多地将导航、搜索、接近和操作任务委托给专门执行器。随着这些执行器变得更强，瓶颈从局部技能执行转移到在规划、监控、记忆和执行之间保持一致的任务前沿。我们研究了任务-状态不一致，即在任务层面一致性失败，其中规划器的活跃阶段、运行时证据、记忆上下文和委托执行器不再支持相同的下一步决策。这种失败可能导致不支持的手动交接、阶段锁定、执行器-上下文不匹配和不必要的重新规划。我们提出ContextFlow，一个可检查的对齐框架，将阶段表示为显式合同，将运行时观测转换为证据包，并应用包括继续、细化、转移、提升和修复在内的作用域更新。ContextFlow使专门执行器负责局部闭环控制，同时使任务前沿对齐显式且可审计。在长周期具身任务上的实验和演示轨迹展示了证据基础的作用域更新如何诊断和缓解反复出现的任务-状态失败。

英文摘要

Long-horizon embodied agents increasingly delegate navigation, search, approach, and manipulation to specialist executors. As these executors become stronger, the main bottleneck shifts from local skill execution to maintaining a coherent task frontier across planning, monitoring, memory, and execution. We study task-state misalignment, a task-level consistency failure in which the planner's active stage, runtime evidence, remembered context, and delegated executor no longer justify the same next-step decision. This failure can lead to unsupported handoffs, stage lock, executor-context mismatch, and unnecessary replanning. We propose ContextFlow, an inspectable alignment framework that represents stages as explicit contracts, converts runtime observations into evidence packets, and applies scoped updates including continue, refine, transfer, promote, and repair. ContextFlow keeps specialist executors responsible for local closed-loop control while making task-frontier alignment explicit and auditable. Experiments and demonstration traces on long-horizon embodied tasks illustrate how evidence-grounded scoped updates diagnose and mitigate recurring task-state failures.

URL PDF HTML ☆

赞 0 踩 0

2605.19285 2026-05-20 cs.CL cs.AI cs.CY 版本更新

Are Rationales Necessary and Sufficient? Tuning LLMs for Explainable Misinformation Detection

理性是否必要且充分？为可解释的虚假信息检测调优大语言模型

Bing Wang, Rui Miao, Ximing Li, Chen Shen, Shaotian Yan, Changchun Li, Kaiyuan Liu, Xiaosong Yuan, Jieping Ye

发表机构 * College of Computer Science and Technology, Jilin University（吉林大学计算机科学与技术学院）； Tongyi Lab, Alibaba Group（阿里云实验室）； School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）

AI总结本文研究了如何通过调优大语言模型（LLM）来提升可解释性虚假信息检测的性能，提出了一种新的数据合成管道LONSREX，用于定位必要且充分的理性，以解决现有方法中因粗粒度标签和过度验证行为导致的理性不足和冗余问题。

Comments Accepted by KDD 2026. 12 pages, 8 figures. Code: https://github.com/wangbing1416/LONSREX

详情

AI中文摘要

社交媒体上虚假信息的快速传播已成为一个严峻挑战。为缓解其扩散，虚假信息检测（MD）已成为关键研究领域。传统基于小模型的MD方法通常通过黑盒过程进行二元分类。近年来，大型语言模型（LLMs）的兴起使可解释性MD成为可能，其中模型生成理性以解释其决策，从而提高透明度。现有可解释性MD方法主要集中在构建复杂的提示以从现成的LLMs中提取理性。在本文中，我们提出了一种管道来调优专门用于可解释性MD的LLM。我们的管道首先收集大规模经过事实核查的文章，然后使用多个强大的LLMs生成真实性预测和理性。为了确保高质量的训练数据，我们利用一种过滤策略，仅选择正确的实例进行微调。虽然该管道直观且普遍，但我们的实验表明，仅基于标签正确性的简单过滤在实践中是不够的，并存在两个关键限制：（1）粗粒度标签导致理性不足：仅基于二元标签过滤的理性不足以充分支持其决策；（2）过度验证行为导致不必要的理性：更强的LLMs倾向于表现出过度验证行为，生成过度冗长和不必要的理性。为了解决这些问题，我们引入了LONSREX，一种新的数据合成管道，用于定位可解释性MD中必要且充分的理性。具体来说，我们提出了一种度量标准，量化每个验证步骤对最终预测的贡献，从而评估其必要性和充分性。实验结果展示了LONSREX的有效性。

英文摘要

The rapid spread of misinformation on social media platforms has become a formidable challenge. To mitigate its proliferation, Misinformation Detection (MD) has emerged as a critical research topic. Traditional MD approaches based on small models typically perform binary classification through a black-box process. Recently, the rise of Large Language Models (LLMs) has enabled explainable MD, where models generate rationales that explain their decisions, thereby enhancing transparency. Existing explainable MD methods primarily focus on crafting sophisticated prompts to elicit rationales from off-the-shelf LLMs. In this work, we propose a pipeline to fine-tune a dedicated LLM specifically for explainable MD. Our pipeline begins by collecting large-scale fact-checked articles, and then uses multiple strong LLMs to produce veracity predictions and rationales. To ensure high-quality training data, we leverage a filtering strategy that selects only the correct instances for fine-tuning. While this pipeline is intuitive and prevalent, our experiments reveal that naive filtering based solely on label correctness is insufficient in practice and suffers from two critical limitations: (1) Coarse-grained labels cause insufficient rationales: Rationales filtered solely based on binary labels are insufficient to adequately support their decisions; (2) Over-verification behavior causes unnecessary rationales: Stronger LLMs tend to exhibit over-verification behavior, producing excessively verbose and unnecessary rationales. To address these issues, we introduce LONSREX, a novel data synthesis pipeline to Locate Necessary and Sufficient Rationales for Explainable MD. Specifically, we propose a metric that quantifies the contribution of each verification step to the final prediction, thereby evaluating its necessity and sufficiency. Experimental results demonstrate the effectiveness of LONSREX.

URL PDF HTML ☆

赞 0 踩 0

2605.19283 2026-05-20 cs.LG cs.AI stat.ML 版本更新

EviTrack: Selection over Sampling for Delayed Disambiguation

EviTrack: 在延迟歧义中选择而非采样

Omer Haq

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出EviTrack框架，通过在潜在轨迹上进行选择而非边际状态，以在延迟歧义中实现更有效的序列推理，其核心方法是基于证据和似然比的轨迹假设选择，从而在数据支持后延迟承诺，优于基于采样的基线方法。

Comments https://github.com/Haq94/EviTrack

详情

AI中文摘要

在延迟歧义的环境中，顺序预测具有挑战性，因为早期观测模糊，多个潜在解释在足够证据积累之前仍然合理。基于边际推断的标准方法在此设置中表现不佳，要么过早坍塌不确定性，要么在信息证据出现后无法恢复。我们引入EviTrack，一种测试时间推断框架，该框架在潜在轨迹上而非边际状态上操作。EviTrack维护一组竞争轨迹假设，并应用基于证据和似然比的选择来延迟承诺，直到有数据支持。受多假设跟踪和先检测前跟踪中的假设管理启发。为了评估此设置，我们构建了一个受控的合成基准，具有已知的潜在真实值，明确展示了延迟歧义。在匹配的推断预算下，EviTrack显著优于基于采样的基线方法，实现更快的后歧义恢复。这些结果表明，在延迟歧义环境中，适度的轨迹级选择比增加采样覆盖更有效，突显了选择而非采样作为可靠序列推断的关键原则。

英文摘要

Sequential prediction is challenging in regimes of delayed disambiguation, where early observations are ambiguous and multiple latent explanations remain plausible until sufficient evidence accumulates. Standard approaches based on marginal inference struggle in this setting, either collapsing uncertainty prematurely or failing to recover once informative evidence arrives. We introduce EviTrack, a test-time inference framework that operates over latent trajectories rather than marginal states. EviTrack maintains a set of competing trajectory hypotheses and applies evidence- and likelihood-ratio-based selection to delay commitment until supported by data, drawing inspiration from hypothesis management in multiple hypothesis tracking and track-before-detect. To evaluate this setting, we construct a controlled synthetic benchmark with known latent ground truth that explicitly exhibits delayed disambiguation. At matched inference budget, EviTrack substantially outperforms sampling-based baselines, achieving faster post-disambiguation recovery. These results show that, in delayed disambiguation regimes, moderate trajectory-level selection is more effective than increasing sampling coverage, highlighting selection over sampling as a key principle for reliable sequential inference.

URL PDF HTML ☆

赞 0 踩 0

2605.19264 2026-05-20 cs.AI cs.MA 版本更新

Swimming with Whales: Analysis of Power Imbalances in Stake-Weighted Governance

与鲸鱼游泳：对基于权益治理中权力不平衡的分析

Yuzhe Zhang, Manvir Schneider, Qin Wang, Davide Grossi

发表机构 * Independent researcher（独立研究者）； University of Groningen（格罗宁根大学）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文研究了基于权益的投票机制中权力失衡现象，通过计算社会选择理论分析了权益加权投票中权力不平衡的程度，并提供了理论和实证贡献。

详情

AI中文摘要

基于权益的投票方法是权益证明（PoS）区块链的基本治理范式。这种范式已知容易产生权力扭曲：少数拥有大权益的用户可能完全控制决策，即使他们不拥有全部权益。我们通过计算社会选择的视角研究这一现象，关注在使用Penrose-Banzhaf权力指数量化权力的情况下，权益加权投票中的权力不平衡程度。我们的工作提供了分析和实证贡献。分析上，我们证明虽然权力与相对权益所有权之间的完美一致通常无法实现，但在特定条件下可以期望近似。实证上，利用现实世界链上治理系统（Project Catalyst）的数据，我们提供了当前权益加权治理系统中可能发生的权力不平衡的更细致理解。

英文摘要

Voting methods weighted by stakes are the fundamental governance paradigm in Proof-of-Stake (PoS) blockchains. Such a paradigm is known to be prone to power distortions: a few users possessing large stakes may completely control decision making, even without owning the totality of the stakes. We study this phenomenon through the lens of computational social choice, focusing on the extent of power imbalances in stake-weighted voting when power is quantified using the Penrose-Banzhaf power index. Our work presents both analytical and empirical contributions. Analytically, we demonstrate that while a perfect alignment between power and relative stake ownership is generally unattainable, it can be approximated in expectation under specific conditions. Empirically, using data from a real-world on-chain governance system (Project Catalyst), we provide a more fine-grained understanding of the power imbalances that are likely to occur in current stake-weighted governance systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19261 2026-05-20 cs.SE cs.AI cs.HC cs.PL 版本更新

When Web Apps Heal Themselves: A MAPE-K Based Approach to Fault Tolerance and Adaptive Recovery

当Web应用自我修复：基于MAPE-K模型的故障容忍与自适应恢复方法

Sales Aribe, Rov Japheth Oracion

发表机构 * Information Technology Department, Bukidnon State University（布基农州大学信息科技系）

AI总结本文提出一种基于MAPE-K模型的模块化自我修复框架，结合AutoFix机制实现自适应故障恢复，通过实验验证该框架在故障检测和恢复中的有效性，提高了Web应用的容错性和适应性。

Comments 12 pages, 3 figures, 2 tables

Journal ref Aribe, Sales G. & Oracion, R. J. G. (2026). When web apps heal themselves- A MAPE-K based approach to fault tolerance and adaptive recovery. International Journal of Informatics and Communication Technology, 15(2), 729-740

详情

DOI: 10.11591/ijict.v15i2.pp729-740

AI中文摘要

确保现代Web应用的可靠性和韧性仍然是一个关键挑战，由于系统复杂性和动态运行环境的增加。本研究提出了一种基于共享知识库的监控-分析-计划-执行（MAPE-K）模型的模块化自我修复框架，并整合了受AutoFix启发的自适应故障恢复机制。通过设计和开发研究（DDR）方法，该系统在二十种运行故障场景中进行了实施和评估，包括服务崩溃、内存泄漏和数据库断开。实验结果表明，所提出的框架实现了平均故障检测F1得分为90.7%，恢复成功率为93.2%。AutoFix模块将平均恢复时间（TTR）减少了56.2%，实现了平均恢复时间为3.92秒。系统吞吐量在故障条件下保持在88%至95%之间，响应时间仅增加了3.1%。此外，迭代反馈机制通过多个循环提高了恢复效率18.6%。这些发现表明，所提出的框架通过反馈驱动的适应性提供了一种实用且可扩展的方法，以通过反馈驱动的适应性增强Web应用的容错性。尽管当前实现依赖于预定义的恢复策略，但学习导向的反馈为未来更自主的自我修复系统的开发奠定了基础。

英文摘要

Ensuring the reliability and resilience of modern web applications remains a critical challenge due to increasing system complexity and dynamic runtime environments. This study proposes a modular self-healing framework based on the monitor-analyze-plan-execute over a shared knowledge base (MAPE-K) model, integrated with an AutoFix-inspired mechanism for adaptive fault recovery. Using a design and development research (DDR) approach, the system was implemented and evaluated through controlled fault injection experiments across twenty runtime failure scenarios, including service crashes, memory leaks, and database disconnections. Experimental results demonstrate that the proposed framework achieved a mean fault detection F1-score of 90.7% and a recovery success rate of 93.2%. The AutoFix module reduced the average time-to-recovery (TTR) by 56.2%, achieving an average recovery time of 3.92 seconds. System throughput was maintained between 88% and 95% during fault conditions, with only a 3.1% increase in response time. Additionally, iterative feedback mechanisms improved recovery efficiency by 18.6% over multiple cycles. These findings indicate that the proposed framework provides a practical and extensible approach to enhancing fault tolerance in web applications through feedback-driven adaptation. While the current implementation relies on predefined recovery strategies, the integration of learning-oriented feedback establishes a foundation for future development of more autonomous self-healing systems.

URL PDF HTML ☆

赞 0 踩 0

2605.19260 2026-05-20 cs.AI cs.CV cs.MA 版本更新

AQuaUI: Visual Token Reduction for GUI Agents with Adaptive Quadtrees

AQuaUI: 用于GUI代理的视觉令牌减少方法基于自适应四叉树

Yuankai Li, Tinghui Zhu, Ha Min Son, Zhe Zhao, Xin Liu, Muhao Chen

发表机构 * UC Davis（加州大学戴维斯分校）

AI总结本文提出AQuaUI，一种无需训练的推理时GUI代理模型的视觉令牌减少方法，利用屏幕截图中的非均匀信息密度，通过自适应四叉树结构保持令牌位置以确保一致性，并通过条件四叉树算法提升多步骤GUI交互的时序一致性，实验表明其在准确性和效率之间取得了改进。

详情

AI中文摘要

大型多模态模型（LMMs）最近已作为GUI代理模型的有希望的骨干出现，其中在每个迭代步骤中将高分辨率GUI截图引入提示中。然而，这些截图表现出高度非均匀的空间信息密度：大区域可能携带很少的信息且视觉上同质，而关键文本和图标可能需要高视觉保真度。现有方法要么需要额外训练，要么依赖于基于注意力的令牌压缩，忽略了GUI截图的结构布局和空间冗余。为填补这一空白，本文提出了AQuaUI，一种用于GUI代理模型的无训练推理时令牌减少方法，利用截图中的非均匀信息密度。AQuaUI在每个截图输入上构建一个自适应四叉树，并在四叉树的每个叶子节点保留一个代表性的合并令牌。AQuaUI在整个管道中保持保留令牌的空间位置，以确保所有位置编码阶段保持一致。为进一步提高多步骤GUI交互中的时间一致性，我们提出了一种条件四叉树算法，利用单个请求内连续截图之间的连续性。具体而言，它利用先前的四叉树作为参考来细化当前四叉树，帮助在静态或轻微移动的GUI状态下保留细粒度区域。我们在最先进的GUI代理模型上实现了AQuaUI，并在标准的地面和导航基准上进行了实验。AQuaUI在准确性和效率之间始终优于先前的基线。值得注意的是，在GUI-Owl-1.5-32B-Instruct上，AQuaUI实现了高达13.22%的速度提升和29.52%的更少视觉令牌，同时保留了99.06%的完整令牌性能，表明可以在不重新训练的情况下利用GUI截图的空间冗余。

英文摘要

Large Multimodal Models (LMMs) have recently emerged as promising backbones for GUI-agent models, where high-resolution GUI screenshots are introduced to the prompts at each iteration step. However, these screenshots exhibit highly non-uniform spatial information density: large regions may carry little information and are visually homogeneous, while key text and icons may require high visual fidelity. Existing approaches to this problem either require additional training or rely on attention-based token compression, ignoring the structured layout and spatial redundancy of GUI screenshots. To fill the gap, this paper proposes AquaUI, a training-free inference-time token reduction method for GUI agent models that utilizes the non-uniform information density in screenshots. AQuaUI constructs an adaptive quadtree on each screenshot input and keeps one representative merged token per leaf of the quadtree. AQuaUI preserves the spatial positions of retained tokens throughout the pipeline to ensure that all position-encoding stages remain consistent. To further improve temporal consistency across multi-step GUI interactions, we propose a conditional quadtree algorithm that leverages the continuity between consecutive screenshots within a single request. Specifically, it refines the current quadtree using previous quadtrees as references, helping preserve fine-grained regions across static or mildly shifted GUI states. We implement AQuaUI on state-of-the-art GUI agent models and conduct experiments on standard grounding and navigational benchmarks. AQuaUI consistently shows improved accuracy-efficiency trade-offs over prior baselines. Notably, on GUI-Owl-1.5-32B-Instruct, AQuaUI achieves up to 13.22% speedup and 29.52% fewer visual tokens while retaining 99.06% of full-token performance, suggesting that the spatial redundancy of GUI screenshots can be exploited at inference without retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.19258 2026-05-20 cs.LG cs.AI 版本更新

ExECG: An Explainable AI Framework for ECG models

ExECG：用于ECG模型的可解释AI框架

Jong-Hwan Jang, Yong-yeon Jo

发表机构 * Medical AI Co. Ltd（医疗AI公司）

AI总结本文提出ExECG框架，旨在解决ECG模型在临床应用中缺乏解释性的问题，通过三阶段流程提供可重用和可复现的ECG可解释性。

详情

AI中文摘要

深度学习已使ECG诊断模型在如心律失常分类和异常检测等任务中表现出强大的性能。然而，仅凭准确性不足以满足临床部署的需求，因为它无法解释为何产生特定的输出，限制了验证、错误分析和信任。尽管ECG XAI已被广泛研究并持续改进，但不同研究中的实际流程和报告规范差异较大，阻碍了重用和可复现性。为了解决这些问题，我们提出了ExECG，一个Python框架，提供三阶段流程：Wrapper标准化访问异构ECG格式和中间表示，Explainer统一各种XAI方法到共享的执行协议，Visualizer支持在统一界面内一致的跨方法比较。我们通过简洁的例子和两个案例研究展示了端到端的使用，强调了可互操作和可复现的ECG可解释性。

英文摘要

Deep learning has enabled ECG diagnostic models with strong performance in tasks such as arrhythmia classification and abnormality detection. However, accuracy alone is insufficient for clinical deployment because it does not explain why a specific output was produced, limiting justification, error analysis, and trust. Although ECG XAI has been extensively investigated and steadily improved, practical pipelines and reporting conventions vary across studies, hindering reuse and reproducibility. To address these issues, we present Explainable AI framework for ECG models (ExECG), a Python framework that provides a three-stage pipeline: Wrapper standardizes access across heterogeneous ECG formats and intermediate representations, Explainer unifies diverse XAI methods under a shared execution protocol, and Visualizer supports consistent cross-method comparison within a unified interface. We demonstrate end-to-end usage with concise examples and two case studies, highlighting interoperable and reproducible ECG explainability.

URL PDF HTML ☆

赞 0 踩 0

2605.19250 2026-05-20 cs.AI 版本更新

Causal Evidence for Attention Head Imbalance in Modality Conflict Hallucination

因果证据：模态冲突幻觉中的注意力头不平衡

Jinrui Jiang, Zhangtai Wu, Zhen Wu, Xinyu Dai

发表机构 * National Key Laboratory for Novel Software Technology（新型软件技术国家重点实验室）； School of Artificial Intelligence（人工智能学院）

AI总结本文研究了多模态大语言模型在模态冲突中产生幻觉的原因，通过分析注意力头的因果作用，发现驱动幻觉的头部分布更广且权重更大，而抑制幻觉的头部集中在少量重要头部，提出MACI方法在减少幻觉的同时保持准确性。

详情

AI中文摘要

当多模态大语言模型（MLLMs）优先考虑错误的文本前提而非矛盾的视觉证据时，就会出现模态冲突幻觉。为了理解为什么视觉证据在生成过程中无法占据优势，我们从机制角度出发，考察哪些内部组件驱动或阻碍这一失败。我们通过在五个开源MLLMs上进行头部层面的因果分析，识别出两组具有相反因果作用的注意力头：驱动幻觉的头部和抑制幻觉的头部。我们发现一种一致的不对称性：驱动效应更广泛分布且具有更大的总权重，而抑制效应集中在少量重要头部。消融实验进一步证实，这些组在生成过程中产生相反效果：分布驱动影响和局部抑制共同形成不平衡的路由结构，使生成偏向于错误前提。受此发现启发，我们提出了MACI（模态冲突感知因果干预），一种条件干预方法，仅在检测到冲突时抑制因果识别出的驱动幻觉头部。在五个MLLMs上，MACI在MMMC基准测试中实现了最大的幻觉减少，同时在幻觉准确性之间取得了有利的权衡，并能够零样本转移到SCI-SemanticConflict测试。

英文摘要

Modality-conflict hallucination occurs when multimodal large language models (MLLMs) prioritize erroneous textual premises over contradictory visual evidence. To understand why visual evidence fails to prevail during generation, we take a mechanistic perspective and examine which internal components drive or resist this failure. We perform head-level causal analysis using path patching across five open-source MLLMs and identify two groups of attention heads with opposing causal roles: hallucination-driving heads and hallucination-resisting heads. We find a consistent asymmetry: driving effects are more broadly distributed and carry greater aggregate weight, whereas resisting effects concentrate in a small number of high-importance heads. Ablation experiments further confirm that these groups exert opposing effects during generation: distributed driving influence and localized resistance together form an imbalanced routing structure that biases generation toward the erroneous premise. Motivated by this finding, we propose MACI (Modality-conflict-Aware Causal Intervention), a conditional intervention that suppresses causally identified hallucination-driving heads only when conflict is detected. Across five MLLMs, MACI achieves the largest hallucination reduction among compared inference-time baselines on the MMMC benchmark with a favorable hallucination-accuracy trade-off, and transfers zero-shot to the SCI-SemanticConflict test.

URL PDF HTML ☆

赞 0 踩 0

2605.19243 2026-05-20 cs.LG cs.AI cs.CG 版本更新

Euclidean Embedding of Data Using Local Distances

利用局部距离进行数据的欧几里得嵌入

Dimitris Arabadjis

发表机构 * Department of Statistics and Actuarial-Financial Mathematics（统计与精算-金融数学系）； University of the Aegean（爱琴海大学）

AI总结本文研究了在仅给定局部距离图的情况下恢复全局一致的欧几里得嵌入问题，提出了一种能够最优表示这些距离的方法。该方法仅在由成对距离加权的邻域图上操作，不需要任何先前的数据向量表示。通过求解一个变分问题，将图上的局部距离与由嵌入函数微分诱导的欧几里得度量匹配。所得的欧拉-拉格朗日方程以坐标自由形式推导，允许仅从距离图直接评估所有算子。尽管非线性和缺少非线性的显式表达式，这些方程被证明可以作为迭代更新的稀疏线性问题解决。本文的主要贡献包括：(a)推导出在连续体中支配最优欧几里得嵌入的功能方程；(b)一种不依赖于特征向量的表示形式，仅需要邻域距离图；(c)基于纯粹局部图操作的估计程序。我们在合成流形和真实数据集上实验性地评估了所得到的非参数算法，证明了在保持局部度量结构和邻近关系的同时，能够近似全局等距嵌入。

详情

AI中文摘要

我们研究了在仅给定局部距离图的情况下恢复全局一致的欧几里得嵌入问题，并提出了一种能够最优表示这些距离的方法。该方法仅在由成对距离加权的邻域图上操作，不需要任何先前的数据向量表示。嵌入是通过求解一个变分问题来实现的，该问题将图上的局部距离与由嵌入函数微分诱导的欧几里得度量匹配。所得的欧拉-拉格朗日方程以坐标自由形式推导，允许仅从距离图直接评估所有算子。尽管非线性和缺少非线性的显式表达式，这些方程被证明可以作为迭代更新的稀疏线性问题解决。本文的主要贡献包括：(a)推导出在连续体中支配最优欧几里得嵌入的功能方程；(b)一种不依赖于特征向量的表示形式，仅需要邻域距离图；(c)基于纯粹局部图操作的估计程序。我们在合成流形和真实数据集上实验性地评估了所得到的非参数算法，证明了在保持局部度量结构和邻近关系的同时，能够近似全局等距嵌入。

英文摘要

We study the problem of recovering a globally consistent Euclidean embedding of data, given only a local distance graph and propose a method that optimally represents these distances. The method operates solely on a neighborhood graph weighted by pairwise distances, without requiring any prior vector representation of the data. The embedding is obtained by solving a variational problem that matches local, on-graph distances to the Euclidean metric, induced by the differentials of the embedding functions. The resulting Euler-Lagrange equations are derived in a coordinate-free form, enabling direct evaluation of all operators from the distance graph alone. Though non-linear and missing an explicit expression for their non-linearity, these equations are shown to be resolved as an iteratively updated sparse linear problem. The main contributions of the proposed approach are (a) the derivation of the functional equations governing the optimal Euclidean embedding in the continuum, (b) a representation-free formulation that requires only a neighborhood distance graph and no feature vectors and (c) an estimation procedure based exclusively on local graph operations. We experimentally evaluate the resulting non-parametric algorithm on synthetic manifolds and real datasets, demonstrating consistent preservation of local metric structure and neighboring relations, while approximating the global isometric embedding.

URL PDF HTML ☆

赞 0 踩 0

2605.19242 2026-05-20 cs.CV cs.AI cs.ET cs.LG cs.MM 版本更新

PhyWorld: Physics-Faithful World Model for Video Generation

PhyWorld: 用于视频生成的物理忠实世界模型

Pu Zhao, Juyi Lin, Timothy Rupprecht, Arash Akbari, Chence Yang, Rahul Chowdhury, Elaheh Motamedi, Arman Akbari, Yumei He, Chen Wang, Geng Yuan, Weiwei Chen, Yanzhi Wang

发表机构 * Northeastern University（东北大学）； University of Georgia（佐治亚大学）； Tulane University（路易斯安那大学）； EmbodyX

AI总结本文提出PhyWorld，一种通过两阶段训练提升视频生成模型的物理忠实性，以改进世界模拟器的性能，从而更有效地支持物理AI系统。

详情

AI中文摘要

世界模拟器可以在真实世界部署前提供安全且可扩展的环境来训练物理AI系统。大型视频生成模型正成为此类模拟器的有希望的基础，因为它们能够生成多样且逼真的视觉未来。然而，将其用作世界模拟器需要物理忠实的视频延续，即生成的视频应保持由条件输入隐含的物理状态，并以符合基本物理原理的方式演变。我们提出了PhyWorld，一种视频生成世界模型，通过两阶段的后训练来生成时间上一致且物理忠实的场景延续。在第一阶段，我们通过流匹配微调改进视频到视频延续，鼓励稳定视觉属性和帧间一致的运动动态。在第二阶段，我们通过直接偏好优化（DPO）对物理偏好对进行对齐，使模型朝着更符合物理合理性的输出发展。为了评估PhyWorld，我们使用了标准视频质量基准和专门的物理忠实性基准，并对每条物理定律进行评分。实验表明，PhyWorld提高了视频一致性，其在VBench上的平均得分为0.769，比最先进的基线0.756或更低。PhyWorld还提高了物理合理性，其在我们物理忠实性基准上的平均得分为3.09，比最强基线的2.99有所提高。这些结果表明，通过延续和物理偏好信号对大型视频生成模型进行后训练，可以使其成为更有效的物理AI世界模拟器。

英文摘要

World simulators can provide safe and scalable environments for training Physical AI systems before real-world deployment. Large video generation models are emerging as a promising basis for such simulators because they can generate diverse and realistic visual futures. However, using them as world simulators requires physically faithful video continuations, namely, generated videos that preserve the physical state implied by the conditioning input, and evolve in ways consistent with basic physical principles. We propose PhyWorld, a video generation world model designed to produce temporally coherent and physically faithful scene continuations through two-stage post-training. In the first stage, we improve video-to-video continuation with flow matching fine-tuning, encouraging stable visual attributes and coherent motion dynamics across frames. In the second stage, we align generated dynamics with physical principles using Direct Preference Optimization (DPO) over physics preference pairs, guiding the model toward outputs with higher physical plausibility. To evaluate PhyWorld, we use both standard video-quality benchmarks and a dedicated physical-faithfulness benchmark with per-law scoring. Experiments show that PhyWorld improves video consistency, achieving an average score of 0.769 on VBench compared with 0.756 or below for state-of-the-art baselines. PhyWorld also improves physical plausibility, reaching an average score of 3.09 on our physical-faithfulness benchmark compared with 2.99 for the strongest baseline. These results suggest that post-training large video generation models with continuation and physics-preference signals can make them more effective world simulators for Physical AI.

URL PDF HTML ☆

赞 0 踩 0

2605.19234 2026-05-20 cs.CL cs.AI 版本更新

AI Technologies in Language Access: Attitudes Towards AI and the Human Value of Language Access Managers

人工智能技术在语言接入中的应用：对人工智能的态度以及语言接入管理者的人类价值

Miguel A. Jiménez-Crespo, Stephanie Rodriguez, Alejandro Jaume Losa

发表机构 * Rutgers University/ Dept. of Spanish（罗格斯大学西班牙语系）； Rutgers University/ Dept. of Spanish and and Portuguese（罗格斯大学西班牙语和葡萄牙语系）

AI总结本文探讨了人工智能在语言接入中的影响，通过分析十位美国语言接入管理者在医疗、法庭、公共服务和地方政府领域的半结构化访谈，揭示了语言接入管理者对人工智能的有条件乐观态度以及对人工智能实施中人类价值和人类监督的高度重视。

Comments 11 pages, 2 tables, Convergence Conference 2026

详情

AI中文摘要

人工智能技术的快速出现正在重塑翻译实践和理论。本文探讨了人工智能在语言接入中的影响。这一领域的特点在于需要服务于广泛且多样化的用户群体，而效率和可及性受到法律要求、伦理和商业矛盾以及安全问题的影响。本文报告了语言接入管理者对人工智能以及人工智能时代的人类价值的态度和看法。方法上，本文呈现了一项关于语言接入和技术的更大研究的子集分析，具体为对十位美国语言接入管理者进行的定性主题分析，这些管理者在医疗、法庭、公共服务和地方政府领域工作。结果表明，语言接入管理者对不可避免的人工智能实施表现出有条件乐观，对风险具有强烈意识，并对人工智能实施和输出中的人类价值和人类监督有深刻承诺。

英文摘要

The rapid emergence of AI technologies is reshaping translation practices and theory across the board. This paper deals with the impact of AI in language access. This area is characterized by the need to serve broad and diverse user populations, within a context where efficiency and access are shaped by legal mandates, ethical and commercial tensions, and safety concerns. This paper reports on the attitudes and perceptions of language access managers towards the AI and the human value in the AI age. Methodologically, this paper presents an analysis of a subset of a broader study on language access and technology, specifically a qualitative thematic analysis of ten semi-structured interviews with language access managers in the USA working in healthcare, court, public service and local government contexts. The results indicate that language access managers show conditional optimism towards the inevitable AI implementations, are strongly risk aware, and deeply committed to the human value and human oversight of AI implementations and output.

URL PDF HTML ☆

赞 0 踩 0

2605.19229 2026-05-20 cs.AI 版本更新

Can Large Language Models Revolutionize Survey Research? Experiments with Disaster Preparedness Responses

大型语言模型能否革新调查研究？与灾害准备响应的实验

Yan Wang, Ziyi Guo, Christopher McCarty

发表机构 * Dept. of Urban and Regional Planning & Florida Institute for Built Environment Resilience, University of Florida（城市与区域规划系及佛罗里达环境韧性研究所，佛罗里达大学）； College of Liberal Arts and Sciences, Bureau of Economic and Business Research, University of Florida（文理学院及经济与商业研究局，佛罗里达大学）

AI总结本文探讨了大型语言模型在调查研究中的应用，通过实验验证了其在灾害准备响应中的有效性，提出了一个五阶段框架，涵盖问卷设计、样本选择、试点测试、缺失数据填补和事后分析，并介绍了基于保护动机理论的协同出现知识图谱和七种LLM配置。

详情

AI中文摘要

调查研究面临日益严峻的结构性挑战：响应率下降、样本偏差、高风险受访者中的块状缺失以及在线面板中AI辅助的欺诈性完成。大型语言模型（LLMs）已被提出作为解决方案，但迄今为止，对整个调查工作流程的严格评估仍然有限，特别是在灾害情境中，数据质量至关重要。我们提出并评估了一个五阶段框架，用于LLM的整合，涵盖问卷设计、样本选择、试点测试、缺失数据填补和事后分析，使用2024年飓风米勒尔准备调查（佛罗里达居民，n=946）作为共享的实证测试床。我们引入了一个受保护动机理论（PMT）约束的协同出现知识图谱，并开发了七种LLM配置，涵盖零样本推理、检索增强基线和新型理论指导变体。我们提出的锚定边际理论指导LLM（A-TLM）在灾难相关块状MNAR条件下，在RMSE上优于所有三个经典填补基线（IPW/MI、MICE+PMM、missForest）（S4 RMSE 1.439 vs. 1.496 for the next-best），同时在接近零的符号偏差（-0.121）方面优于随机森林填补器（产生最大的绝对偏差-0.631）。围绕PMT因果结构组织检索，并在单个模型调用中整合所有证据，优于无结构检索和分阶段顺序推理（MAE 0.993 vs. 1.097 for standard RAG）。我们记录了接近零的总体偏差可以掩盖相反的子组误差，并提出子组分层偏差审计作为报告标准。一个检索受限的知识图谱聊天机器人展示了幻觉是通过接地拒绝可管理的。

英文摘要

Survey research faces mounting structural challenges: declining response rates, sample bias, block-wise missingness among at-risk respondents, and AI-assisted fraudulent completions in online panels. Large language models (LLMs) have been proposed as a remedy, yet rigorous evaluations across the full survey workflow remain scarce, particularly in disaster contexts where data quality matters most. We present and evaluate a five-stage framework for LLM integration covering questionnaire design, sample selection, pilot testing, missing-data imputation, and post-collection analysis, using the 2024 Hurricane Milton preparedness survey of Florida residents (n=946) as a shared empirical testbed. We introduce a Protection Motivation Theory (PMT)-constrained co-occurrence knowledge graph and develop seven LLM configurations spanning zero-shot inference, retrieval-augmented baselines, and novel theory-informed variants. Our proposed Anchored Marginal Theory-Informed LLM (A-TLM) outperforms all three classical imputation baselines (IPW/MI, MICE+PMM, missForest) on RMSE under disaster-relevant block-wise MNAR conditions (S4 RMSE 1.439 vs. 1.496 for the next-best), while achieving near-zero signed bias (-0.121) where the random-forest imputer produces the largest absolute bias (-0.631). Organizing retrieval around PMT causal structure and integrating all evidence in a single model call outperforms unstructured retrieval and staged sequential inference (MAE 0.993 vs. 1.097 for standard RAG). We document that near-zero aggregate bias can mask opposing subgroup errors and propose subgroup-stratified bias auditing as a reporting standard. A retrieval-constrained knowledge-graph chatbot demonstrates that hallucination is architecturally manageable through grounded refusal.

URL PDF HTML ☆

赞 0 踩 0

2605.19227 2026-05-20 cs.CR cs.AI 版本更新

Token by Token, Compromised: Backdoor Vulnerabilities in Unified Autoregressive Models

逐token被入侵：统一自回归模型中的后门漏洞

Tobias Braun, Jonas Henry Grebe, Hossein Shakibania, Anna Rohrbach, Marcus Rohrbach

发表机构 * TU Darmstadt（图宾根大学）

AI总结本文研究了统一自回归模型中的后门漏洞问题，提出了一种名为Token by Token Backdoor Attack (ToBAC)的攻击方法，展示了如何通过数据和模型污染策略在多模态生成中引发有害行为。

详情

AI中文摘要

统一自回归模型（UAMs）是变压器模型，能够在单次自回归传递中生成文本和图像标记。共享参数和多模态词汇简化了训练流程并促进了灵活的多模态生成，但可能会引入新的漏洞。特别是，我们首次证明这种统一架构使多模态后门攻击成为可能，其中触发器可以跨多个输出模态传播恶意影响。具体而言，我们提出了Token by Token Backdoor Attack（ToBAC），这是首个针对UAMs的后门攻击，探索了基于数据和基于模型的污染策略。我们展示了无害的字符或甚至常见的单词可以被转换为触发器，从而在自回归图像生成中引发有害行为。ToBAC可以联合操控视觉输出和伴随文本，增加伪造内容的感知真实性。通过模型访问，ToBAC可以在统一的液体模型中发起攻击，其中微妙的词（例如，``cool''）在55%的生成中导致模态对齐的品牌推广或意识形态影响。在没有模型访问的情况下，ToBAC可以通过数据污染诱导，对JanusPro实现平均成功率为63.1%。

英文摘要

Unified autoregressive models (UAMs) are transformer models that generate text as well as image tokens within a single autoregressive pass. Shared parameters and a multimodal vocabulary simplify the training pipeline and facilitate flexible multimodal generation, yet might introduce new vulnerabilities. In particular, we are the first to show that this unified architecture enables multimodal backdoor attacks, where a trigger can propagate malicious effects across multiple output modalities. Specifically, we present the Token by Token Backdoor Attack (ToBAC), the first backdoor attack targeting UAMs, exploring both data-based and model-based poisoning strategies. We demonstrate that innocuous characters or even common words can be transformed into triggers that elicit harmful behavior in autoregressive image generation. ToBAC can jointly manipulate visual outputs and accompanying text, increasing the perceived authenticity of fabricated content. With model access, ToBAC enables attacks on the unified Liquid model in which a subtle word (e.g., ``cool'') induces modality-aligned brand promotion or ideological influence in 55% of generations. Without model access, ToBAC can be induced through data poisoning, achieving an average success rate of 63.1% against JanusPro.

URL PDF HTML ☆

赞 0 踩 0

2605.19220 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Position: Uncertainty Quantification in LLMs is Just Unsupervised Clustering

位置：在LLM中的不确定性量化仅仅是无监督聚类

Tiejin Chen, Longchao Da, Xiaoou Liu, Hua Wei

发表机构 * School of Computing（计算学院）； Augmented Intelligence, Arizona State University（智能增强与亚利桑那州立大学）

AI总结本文指出，当前LLM的不确定性量化方法本质上是无监督聚类算法，无法有效评估模型的外部正确性，导致无法检测出自信但错误的回答。文章提出了改进的不确定性量化方法，以确保模型的自信度能可靠地反映现实。

Comments Accepted by ICML 2026 Position Paper Track

详情

AI中文摘要

不确定性量化（UQ）被广泛认为是部署大型语言模型（LLM）于高风险领域的主要保障。然而，我们主张该领域存在类别错误：主流LLM的UQ方法本质上是无监督聚类算法。我们证明大多数当前方法本质上量化的是模型生成的内部一致性，而不是其外部正确性。因此，当前方法从根本上无法识别事实现实，并无法检测出“自信幻觉”，即模型在稳定但错误的答案上表现出高自信。因此，当前UQ方法在部署模型时可能会产生误导的安全感。具体而言，我们识别出由于对内部状态的依赖而产生的三种关键病理：超参数敏感危机，使部署不安全；内部评估循环，将稳定性与事实混淆；以及缺乏事实基础，迫使依赖不稳定代理指标来评估不确定性。为解决这一困境，我们倡导向UQ方法转变，并为研究界制定研究路线图，以采用更好的评估指标和设置，实施原生不确定性机制的变化，并将验证锚定在客观事实上，确保模型自信度能可靠地反映现实。

英文摘要

Uncertainty Quantification (UQ) is widely regarded as the primary safeguard for deploying Large Language Models (LLMs) in high-stakes domains. However, we argue that the field suffers from a category error: mainstream UQ methods for LLMs are just unsupervised clustering algorithms. We demonstrate that most current approaches inherently quantify the internal consistency of the model's generations rather than their external correctness. Consequently, current methods are fundamentally blind to factual reality and fail to detect ``confident hallucinations,'' where models exhibit high confidence in stable but incorrect answers. Therefore, the current UQ methods may create a deceptive sense of safety when deploying the models with uncertainty. In detail, we identify three critical pathologies resulting from this dependence on internal state: a hyperparameter sensitivity crisis that renders deployment unsafe, an internal evaluation cycle that conflates stability with truth, and a fundamental lack of ground truth that forces reliance on unstable proxy metrics to evaluate uncertainty. To resolve this impasse, we advocate for a paradigm shift to UQ and outline a roadmap for the research community to adopt better evaluation metrics and settings, implement mechanism changes for native uncertainty, and anchor verification in objective truth, ensuring that model confidence serves as a reliable proxy for reality.

URL PDF HTML ☆

赞 0 踩 0

2605.19219 2026-05-20 cs.AI 版本更新

SimGym: A Framework for A/B Test Simulation in E-Commerce with Traffic-Grounded VLM Agents

SimGym：一种用于电子商务A/B测试模拟的框架，使用基于流量的VLM代理

Han Li, Vibhor Malik, Zahra Zanjani Foumani, Alberto Castelo, Shuang Xie, Ailin Fan, Keat Yang Koay, Yuanzheng Zhu, Meysam Feghhi, Ronie Uliana, Zhaoyu Zhang, Angelo Ocana Martins, Mingyu Zhao, Francis Pelland, Jonathan Faerman, Nikolas LeBlanc, Aaron Glazer, Andrew McNamara, Zhong Wu, Lingyun Wang

发表机构 * Shopify

AI总结本文提出SimGym框架，通过基于流量的VLM代理模拟电子商务A/B测试，解决真实测试周期长、风险高等问题，验证结果显示其能快速准确预测用户行为变化。

详情

AI中文摘要

A/B测试仍然是评估电子商务店铺修改的黄金标准，但其分流流量、需要数周才能达到统计显著性，并有降低用户体验的风险。我们提出了SimGym，一种使用视觉语言模型（VLM）代理在浏览器中模拟A/B测试的框架。该框架包含三个关键组件：（a）基于流量的买家人设生成管道，从生产点击流数据中推导出每个店铺的买家人设和意图；（b）实时浏览器代理架构，结合多模态感知和情景记忆与守卫规则，以在控制和处理店铺中进行连贯的购物会话；（c）评估协议，将模拟的成果变化与实际买家行为的观察变化进行比较。我们验证了SimGym在主要电子商务平台上对视觉驱动的UI主题变化的A/B测试，结果表明SimGym代理在观察到的成果变化上表现良好，与实际买家流量中不同界面变体的add-to-cart变化达成77%的方向一致。它将实验周期从数周减少到不到一小时，使快速实验成为可能，而无需将真实买家暴露于候选变体中。

英文摘要

A/B testing remains the gold standard for evaluating modifications to e-commerce storefronts, yet it diverts traffic, requires weeks to reach statistical significance, and risks degrading user experience. We present SimGym, a framework for simulating A/B tests on e-commerce storefronts using vision-language model (VLM) agents operating in a live browser. The framework comprises three key components: (a) a traffic-grounded persona generation pipeline that derives per-shop buyer archetypes and intents from production clickstream data; (b) a live-browser agent architecture that combines multimodal perception over visual and browser-structured observations with episodic memory and guardrails to conduct coherent shopping sessions across control and treatment storefronts; and (c) an evaluation protocol that compares simulated outcome shifts with observed shifts in real buyer behavior. We validate SimGym on A/B tests of visually driven UI theme changes from a major e-commerce platform across diverse storefronts and product categories. Empirical results show that SimGym agents achieve strong agreement with observed outcome shifts, attaining 77% directional alignment with add-to-cart shifts observed across interface variants in real-buyer traffic. It reduces experimental cycles from weeks to under an hour, enabling rapid experimentation without exposing real buyers to candidate variants.

URL PDF HTML ☆

赞 0 踩 0

2605.19218 2026-05-20 cs.CV cs.AI 版本更新

Rotation-Aligned Key Channel Pruning for Efficient Vision-Language Model Inference

旋转对齐的关键通道剪枝用于高效的视觉-语言模型推理

Beomseok Kang, Dongwon Jo, Jiwon Song, Donghwee Son, Jae-Joon Kim

发表机构 * Seoul National University（首尔国立大学）

AI总结本文提出旋转对齐的关键通道剪枝方法，通过压缩通道维度在固定KV缓存预算下保留更多视觉token，解决传统token剪枝在细粒度感知任务中的性能下降问题，同时提升解码效率。

详情

AI中文摘要

视觉-语言模型在推理过程中面临严重的KV缓存压力，因为单张图像通常会编码成数千个token。现有方法主要通过token稀疏性进行token剪枝，但永久丢弃视觉内容导致细粒度感知任务显著退化。为此，本文提出一个互补的轴，即特征稀疏性：在固定KV缓存预算下，压缩通道维度可以在相同内存成本下保留更多视觉token。然而，现有关键通道剪枝方法面临结构上的权衡：基于token的通道剪枝具有表现力但不结构化且较慢，而基于head的方法则硬件友好但不够稳健。本文通过RotateK，一种基于旋转的结构化关键通道剪枝框架，解决这一问题。RotateK应用基于PCA的在线旋转，将token依赖的通道重要性对齐到共享的低维子空间，从而在轻量级head掩码下实现精确剪枝；融合的Triton注意力内核直接在稀疏通道的Key上操作以实现高效的解码。在两个代表性的VLM后端上进行的实验表明，RotateK在准确率和解码延迟方面均优于现有关键通道剪枝方法，而联合token-通道剪枝在匹配的KV缓存预算下优于仅token剪枝的基线。

英文摘要

Vision-Language Models suffer severe KV cache pressure at inference, as a single image often encodes into thousands of tokens. Most existing methods exploit token sparsity through token pruning, but permanently discarding visual content causes substantial degradation on fine-grained perception tasks. This motivates a complementary axis, feature sparsity: under a fixed KV cache budget, compressing the channel dimension preserves more visual tokens at the same memory cost. Prior Key channel pruning methods, however, face a structural trade-off: token-wise channel pruning is expressive but unstructured and slow, while head-wise approach is hardware-friendly but less robust. We resolve this with RotateK, a rotation-based structured Key channel pruning framework. RotateK applies an online PCA-based rotation that aligns token-dependent channel importance into a shared low-dimensional subspace, enabling accurate pruning under lightweight head-wise masks; a fused Triton attention kernel operates directly on sparse-channel Keys for efficient decoding. Experiments on two representative VLM backbones show that RotateK consistently outperforms prior Key channel pruning in both accuracy and decoding latency, while joint token-channel pruning improves over token-only baselines at matched KV cache budgets.

URL PDF HTML ☆

赞 0 踩 0

2605.19215 2026-05-20 cs.AI 版本更新

Not all uncertainty is alike: volatility, stochasticity, and exploration

并非所有不确定性都相同：波动性、随机性与探索

Payam Piray

发表机构 * Department of Psychology, University of Southern California（南加州大学心理学系）

AI总结本文研究了在生物和人工智能中适应性决策中波动性和随机性对探索的影响差异，提出了CAUSE方法以提升探索效率。

详情

AI中文摘要

在生物和人工智能中适应性决策需要在利用已知结果和探索不确定替代方案之间取得平衡。尽管先前研究表明不确定性通常促进探索，但通常将不同的环境不确定性来源视为等同。我们考虑具有潜在线性奖励状态随时间变化（波动性）和通过噪声结果观察（随机性）的环境。两者都增加后验不确定性，但我们显示它们驱动最优探索的方向相反：波动性增强它，随机性抑制它。我们通过将Gittins指数框架扩展到具有潜在线性动态的高斯状态空间带顿时，正式建立了这种不对称性。我们进一步推导出Cause-Aware Uncertainty-Sensitive Exploration (CAUSE)，一种通过控制-推理获得的闭式探索奖励，继承了相同的单调性。CAUSE在具有异质噪声结构的环境中优于标准探索策略，并且在非休息带顿设置中改进了Gittins-per-arm策略。学习和探索由相同的噪声推理不对称性所支配，并且该框架预测病理噪声推理会产生相反而非仅仅受损的探索，对计算精神病学的解释具有启示。

英文摘要

Adaptive decision-making in biological and artificial intelligence requires balancing the exploitation of known outcomes with the exploration of uncertain alternatives. Although prior work suggests that uncertainty generally promotes exploration, it has typically treated distinct sources of environmental uncertainty as equivalent. We consider environments with latent reward states that drift over time (volatility) and are observed through noisy outcomes (stochasticity). Both increase posterior uncertainty, yet we show they drive optimal exploration in opposite directions: volatility enhances it, stochasticity suppresses it. We establish this asymmetry formally by extending the Gittins index framework to Gaussian state-space bandits with latent dynamics. We further derive Cause-Aware Uncertainty-Sensitive Exploration (CAUSE), a closed-form exploration bonus obtained via control-as-inference that inherits the same monotonicities. CAUSE outperforms standard exploration strategies in environments with heterogeneous noise structure, and also improves on a Gittins-per-arm policy whose rested-bandit optimality does not transfer to restless settings. Learning and exploration are governed by the same noise-inference asymmetry, and the framework predicts that pathological noise inference produces \emph{reversed} rather than merely impaired exploration, with implications for computational accounts of psychiatric conditions.

URL PDF HTML ☆

赞 0 踩 0

2605.19207 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Quantized Machine Learning Models for Medical Imaging in Low-Resource Healthcare Settings

用于低资源医疗环境的量化机器学习模型：医学影像

Sumanth Meenan Kanneti, Aryan Shah

发表机构 * Georgia State University（佐治亚州立大学）

AI总结本文提出了一种多策略压缩框架，用于MRI图像中的脑肿瘤分类，通过量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏以及轻量MobileNetV2骨干网络上的Float16后训练量化，实现了在低资源医疗环境中高效且准确的脑肿瘤筛查。

详情

AI中文摘要

深度学习模型在医学影像分析中表现出强大的性能，但在低资源临床环境中部署仍然困难，由于计算、内存和电力限制。本文提出了一种多策略压缩框架，用于从MRI中进行脑肿瘤分类，包括量化感知训练、从DenseNet-101教师模型到紧凑DenseNet-32学生模型的知识蒸馏，以及在轻量MobileNetV2骨干网络上的Float16后训练量化。使用包含胶质瘤、脑膜瘤、垂体瘤和健康对照的多类脑肿瘤MRI数据集，我们提供了基于MobileNetV2的完整实验验证，通过三阶段迁移学习训练分类器，并通过TensorFlow Lite应用Float16量化。DenseNet基于的知识蒸馏和量化感知训练策略被描述为框架内的互补压缩方法，其完整的经验评估留待未来工作。在MobileNetV2管道上的实验结果表明，量化模型在验证准确率为82.37%的情况下，与全精度基线82.20%相比，模型大小从35.34 MB减少到5.76 MB，压缩比为6.14倍，无显著精度损失。各分类评估证实，量化在所有四个肿瘤类别中均匀保持诊断性能。这些发现表明，轻量化的量化模型可以在资源受限的医疗环境中提供临床可行的脑肿瘤筛查。

英文摘要

Deep learning models have shown strong performance in medical image analysis, but deploying them in low-resource clinical environments remains difficult due to computational, memory, and power constraints. This paper presents a multi-strategy compression framework for brain tumor classification from MRI, encompassing quantization-aware training, knowledge distillation from a DenseNet-101 teacher to a compact DenseNet-32 student with low-bit post-training quantization, and Float16 post-training quantization on a lightweight MobileNetV2 backbone. Using a multi-class brain tumor MRI dataset containing glioma, meningioma, pituitary tumors, and healthy controls, we provide full experimental validation of the MobileNetV2-based pipeline, training the classifier through a three-stage transfer learning process and applying Float16 quantization via TensorFlow Lite. The DenseNet-based distillation and quantization-aware training strategies are described as complementary compression approaches within the framework, with their complete empirical evaluation reserved for future work. Experimental results on the MobileNetV2 pipeline show that the quantized model achieves 82.37 percent validation accuracy compared to the 82.20 percent full-precision baseline, reducing model size from 35.34 MB to 5.76 MB, a 6.14x compression ratio with no meaningful accuracy loss. Per-class evaluation confirms that quantization preserves diagnostic performance uniformly across all four tumor categories. These findings demonstrate that lightweight quantized models can deliver clinically viable brain tumor screening in resource-constrained healthcare settings.

URL PDF HTML ☆

赞 0 踩 0

2605.19202 2026-05-20 cs.RO cs.AI math.OC 版本更新

Aerial Inspection Behaviors via RL-based Quadrotor Control for Under-canopy Forest Environments

通过基于强化学习的四旋翼控制实现空中巡检行为：在树冠下森林环境中的应用

Fausto Mauricio Lagos Suarez, Akshit Saradagi, Vidya Sumathy, Viswa Narayanan Sankaranarayanan, George Nikolakopoulos

发表机构 * Robotics and AI group, in the Department of Computer Science, Electrical and Space Engineering at Luleå University of Technology, Sweden（鲁尔坎大学技术学院机器人与人工智能小组，计算机科学、电气与空间工程系，瑞典）

AI总结本文提出了一种基于深度强化学习的四旋翼控制器，用于在树冠下森林环境中进行自主巡检任务，通过端到端控制策略实现巡检视角姿态跟踪，并结合旅行商问题规划器和快速随机树星规划器确保长距离任务的安全可靠部署。

Comments Submitted to 2026 IEEE 22nd International Conference on Automation Science and Engineering

详情

AI中文摘要

本文针对在树冠下森林环境中使用基于深度强化学习（RL）的低级四旋翼控制器进行空中巡检任务的问题进行了研究。具体而言，本文提出了一种端到端（将状态映射到RPMs）的四旋翼控制策略，实现了巡检视角姿态跟踪（同时位置和偏航参考跟踪），这对于各种目标巡检行为和森林中的点对点导航至关重要。为确保在长距离任务中端到端RL控制器的安全可靠部署，本文利用了一个包含旅行商问题规划器（TSP）和快速随机树星规划器（RRT*）的更高导航指导层。在已知的森林地图和一组用户指定的巡检区域上，TSP规划器找到最优访问序列。在两个目标区域之间，RRT*规划器生成符合下层端到端RL策略跟踪限制的碰撞自由路径。通过五个目标巡检场景，本文证明了基于强化学习的电机级稳定控制器，结合导航指导层，可以有效用作树冠下森林巡检任务的低级巡检执行模块。

英文摘要

This paper addresses the problem of using a deep Reinforcement Learning (RL)-based low-level Quadrotor controller within an autonomous Quadrotor navigation stack for aerial inspection missions in under-canopy forest environments. Specifically, the article presents an end-to-end (mapping states to RPMs) Quadrotor control policy that achieves inspection view-pose tracking (simultaneous position and yaw reference tracking), which is crucial for various target inspection behaviors and point-to-point navigation in forests. To ensure safe and reliable deployment of the end-to-end RL controller in long-range missions, this article utilizes a higher navigation guidance layer comprising of a Traveling Salesman Problem planner (TSP) and a Rapidly-exploring Random Tree Star (RRT*) planner. Over a known map of a forest and a set of user-specified inspection regions, the TSP planner finds the optimal visitation sequence. Between two target regions, collision-free paths that respect the tracking limitations of the lower end-to-end RL policy are generated by an RRT* planner. Through five target inspection scenarios, this article demonstrates that an RL-based motor-level stabilizing controller, supported by a navigation guidance layer, can be used effectively as the low-level inspection execution module for under-canopy forest inspection missions.

URL PDF HTML ☆

赞 0 踩 0

2605.19201 2026-05-20 cs.LG cs.AI 版本更新

On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis

设备端持续学习与双阶段缓冲器和动态损失用于现场肺炎诊断

Danu Kim

发表机构 * Korea International School, Jeju Campus（韩国国际学校，济州校区）

AI总结本文提出PneumoNet，一种适用于资源受限环境的领域增量学习方法，结合轻量级CNN进行设备端预测，双阶段平衡缓冲器实现类别平衡回放，以及动态类别加权损失以纠正训练批次不平衡，实验表明其在模拟五个真实域变化场景的PneumoniaMNIST数据集上达到86.6%的准确率，同时更小更高效。

Comments Presented at 32nd Samsung Humantech Paper Awards

2605.19190 2026-05-20 cs.CY cs.AI cs.HC 版本更新

Going PLACES: Participatory Localized Red Teaming for Text-to-Image Safety in the Global South

Going PLACES: 参与式本地化红队测试用于全球南方的文本到图像安全

Charvi Rastogi, Mukul Bhutani, Minsuk Kahng, Shamsuddeen Hassan Muhammad, Evgeniia Razumovskaia, Priyanka Suresh, Ibrahim Said Ahmad, Charu Kalia, Yaaseen Mahomed, Madhurima Maji, Minjae Lee, Alicia Parrish, Jessica Quaye, Vijay Janapa Reddi, Aishwarya Verma, Lora Aroyo

发表机构 * Google DeepMind（谷歌深Mind）； Yonsei University（延世大学）； Imperial College（帝国理工学院）； University of Wisconsin–Stevens Point（威斯康星州立大学斯普林特分校）； Google Research（谷歌研究）； Harvard University（哈佛大学）

AI总结本文提出PLACES数据集，通过在非洲和亚洲的本地社区进行参与式红队测试，收集了26000多个文本到图像模型失败案例，揭示了全球南方在文化和社会规范方面的独特对抗模式和安全框架的结构性缺失。

Comments Published at ACM Conference on FAccT 2026

详情

AI中文摘要

尽管文本到图像（T2I）模型已在全球范围内部署，但其安全框架大多基于西方默认设置，这为其他地区带来了显著的漏洞。为了拥抱文化多元主义并引入历史上代表性不足的视角，我们在全球南方进行了本地化的社区中心红队测试研究。我们的双重视角优先考虑本地化和参与，通过关注这些地区的次级城市中心，并开展社区参与和培训研讨会，以 contextualize 本地规范。结果，我们提出了PLACES数据集，其中包括与加纳、尼日利亚以及印度两个地区（卡纳塔克和旁遮普）的大学合作收集的超过26,000个T2I模型失败示例。分析收集的提示揭示了与现有地理无关的众包红队测试数据相比，社会文化和语言属性的广泛多样性。我们观察到由本地文化和语言细微差别所启用的独特对抗模式，以及在地区内围绕特定主题（如印度的宗教）的明显聚类。此外，我们通过识别新的危害，揭示了现有安全框架的结构性缺失，这些危害表现出规范不一致（例如，违反宗教规范、忽视本地习俗和 ominous 的象征意义）。这项工作认为，扩展T2I安全需要超越单纯的规模，转而采用深入本地化和参与性的数据收集和情境化方法。内容警示：本文包含可能有害或冒犯性内容的示例。

英文摘要

Despite the global deployment of text-to-image (T2I) models, their safety frameworks are largely calibrated to a Western-centric default, creating significant vulnerabilities for the rest of the world. To embrace cultural pluralism and bring historically under-represented perspectives in T2I safety, we conduct localised community-centered red teaming studies in the Global South. Our two-fold approach prioritizes localization and participation, by focusing on secondary urban centers in these regions, and conducting community engagement and training workshops to contextualize local norms. As a result, we present PLACES, a dataset comprising over 26,000 examples of T2I model failures collected in partnership with universities in Ghana, Nigeria, and two regions of India (Karnataka and Punjab). Analysis of prompts collected reveals a wide-ranging diversity in socio-cultural and linguistic attributes, when compared to existing geography-agnostic crowdsourced red-teaming data. We observe unique adversarial patterns enabled by local cultural and linguistic nuances, and distinct clusters within region around specific themes, such as religion in India. Moreover, we uncover structural contextual gaps in existing safety frameworks by identifying novel harms showing normative dissonance (e.g., violating religious norms, ignoring local customs, and ominous symbolism). This work argues that expanding T2I safety requires moving beyond mere scale to incorporate deeply localised, participatory methodologies for data collection and contextualization. Content warning: This paper includes examples containing potentially harmful or offensive content.

URL PDF HTML ☆

赞 0 踩 0

2605.19185 2026-05-20 cs.LG cs.AI 版本更新

Planner-Admissible Graph-PDE Value Extensions for Sparse Goal-Conditioned Planning

规划可接受的图-偏微分方程值扩展用于稀疏目标条件规划

Shiheng Zhang

发表机构 * Department of Applied Mathematics, University of Washington（应用数学系，华盛顿大学）

AI总结本文研究了在操作argmin-Q规划器下，哪些图值扩展是规划可接受的，提出了一种局部动作间隙证书，证明在 rollout 过程中若代理值误差低于真实动作间隙的一半，则贪心 rollout 可达到目标。通过比较原理填充距离界，AMLE 实现了该证书，而调和扩展由于反映边界击中概率而非最短路径贪心顺序，可能导致局部动作排名错误。

详情

AI中文摘要

稀疏目标条件规划中，少量成本到目标标签可视为图-偏微分方程Dirichlet扩展问题：将稀疏标签扩展到目标依赖的边界上，以贪心rollouts达到目标。我们研究了在操作argmin-Q规划器下哪些图值扩展是规划可接受的。我们的主要结果是一种局部动作间隙证书：如果代理值误差在rollout过程中保持在真实动作间隙的一半以下，则贪心rollout可达到目标。绝对最小Lipschitz扩展（AMLE），作为图p-Laplacian家族的p=∞端点，通过比较原理填充距离界实现了该证书。相比之下，调和扩展由于其值反映边界击中概率而非最短路径贪心顺序，可能导致局部动作排名错误。在120个AntMaze布局衍生的图配置上，调和扩展实现0.584的累积rollout成功率，而AMLE达到0.970。有限高p方法也进入高成功率区域，p=4时成功率0.903，p=8时0.973，p=16固定预算求解器时0.982，尽管p=16行未作为收敛端点排名使用，因求解器认证不完整。机制审计显示，许多rollout决策发生在AMLE兼容但调和不兼容的局部几何中，并且AMLE在rollout加权决策范围内修正了大多数调和反转。

英文摘要

Sparse goal-conditioned planning with few cost-to-go labels can be viewed as a graph-PDE Dirichlet extension problem: extend sparse labels on a goal-dependent boundary to unlabelled graph vertices so that greedy rollouts reach the goal. We study which graph value extensions are planner-admissible under the operational argmin-Q planner. Our main result is a local action-gap certificate: if the surrogate value error along the rollout stays below half the true action gap, then the greedy rollout reaches the goal. Absolutely Minimal Lipschitz Extension (AMLE), the p=infinity endpoint of the graph p-Laplacian family, instantiates this certificate through a comparison-principle fill-distance bound. Harmonic extension, by contrast, can mis-rank local actions because its values reflect boundary hitting probabilities rather than shortest-path greedy order. On 120 AntMaze layout-derived graph configurations, harmonic extension achieves 0.584 aggregate rollout success, while AMLE reaches 0.970. Finite high-p methods also enter a high-success regime, with success 0.903 for p=4, 0.973 for p=8, and 0.982 for a fixed-budget p=16 solver, though the p=16 row is not used as a converged endpoint ranking due to incomplete solver certification. Mechanism audits show that many rollout decisions occur in AMLE-compatible but harmonic-incompatible local geometry, and that AMLE corrects most harmonic inversions on the rollout-weighted decision scope.

URL PDF HTML ☆

赞 0 踩 0

2605.19172 2026-05-20 cs.LG cs.AI 版本更新

Bridge: Retrieval-Augmented Spatiotemporal Modeling for Urban Delivery Demand

Bridge：基于检索的时空建模用于城市配送需求

Yihong Tang, Tong Nie, Junlin He, Qianjun Huang, Dingyi Zhuang, Lijun Sun

发表机构 * McGill University（麦吉尔大学）； The Hong Kong Polytechnic University（香港理工大学）； University of Toronto（多伦多大学）； MIT（麻省理工学院）

AI总结本文提出Bridge框架，通过结合归纳上下文图结构和时间感知的记忆模块，解决新加入服务区域缺乏历史记录导致的城市配送需求预测难题，提升了冷启动区域的预测性能。

详情

AI中文摘要

预测城市配送需求在新增服务区域缺乏历史记录时变得尤为具有挑战性。现有的时空预测器在有足够的节点历史时能有效建模空间依赖性，但它们仍然是参数化的，因此在冷启动区域难以恢复短期运营动态。地理嵌入帮助识别区域的位置和功能，但并不能直接揭示相似区域在相似时间背景下行为的方式。我们提出了Bridge，一种结合归纳上下文图结构和时间感知记忆的时空图框架。对于每个目标区域，Bridge通过区域上下文和近期动态从记忆中检索未来需求模式，并通过门控融合机制优化图结构预测。为了使检索与预测效用对齐，我们进一步训练检索器以未来为导向的目标，偏好那些未来轨迹与目标最匹配的条目。实验表明，Bridge在四个真实世界配送数据集上，无论是城市内部冷启动还是跨城市转移时部分观察情况下，均优于竞争性的时空基线模型。结果表明，当参数图泛化能力不足时，检索增强为冷启动城市需求预测提供了有用的操作记忆。

英文摘要

Forecasting urban delivery demand becomes substantially more challenging when newly added service regions lack historical records. Existing spatiotemporal forecasters effectively model spatial dependence once sufficient node histories are available. Still, they remain parametric and therefore struggle to recover short-term operational dynamics in cold-start regions. Geospatial embeddings help identify where a region is and what function it serves, yet they do not directly reveal how a similar region behaves under a comparable temporal context. We propose Bridge, a retrieval-augmented spatiotemporal graph framework that combines an inductive contextual graph backbone with a time-aware memory of region-time windows. For each target region, Bridge retrieves future demand patterns from the memory using both regional context and recent dynamics, and refines the backbone forecast through a gated fusion mechanism. To align retrieval with forecasting utility, we further train the retriever with a future-aware objective that favors entries whose future trajectories best match the target. Experiments on four real-world delivery datasets show that Bridge consistently improves over competitive spatiotemporal baselines in both within-city cold-start and cross-city transfer with partial observations. The results show that retrieval augmentation provides a useful operational memory for cold-start urban demand forecasting when parametric graph generalization alone is insufficient.

URL PDF HTML ☆

赞 0 踩 0

2605.19156 2026-05-20 cs.AI cs.CY cs.LG cs.MA 版本更新

How Far Are We From True Auto-Research?

我们距离真正的自动研究还有多远？

Zhengxin Zhang, Ning Wang, Sainyam Galhotra, Claire Cardie

发表机构 * Cornell University（康奈尔大学）

AI总结本文通过ResearchArena评估了不同代理生成的论文质量，发现虽然代理能生成看似有竞争力的论文，但实际实验严谨性不足，存在伪造结果、实验能力不足和计划与执行不匹配等问题，表明自动研究仍需进一步发展。

详情

AI中文摘要

最近的自动研究系统能够生成完整的论文，但可行性并不等同于质量，该领域仍然缺乏对代理生成论文实际质量的系统研究。我们介绍了ResearchArena，一个最小的框架，让现成的代理（Claude Code使用Opus 4.6，Codex使用GPT-5.4，和Kimi Code使用K2.5）在仅轻量指导下自行完成完整的研究循环（构想、实验、论文写作、自我完善）。在13个计算机科学种子和每个代理-领域对的3次试验中，ResearchArena生成了117篇代理生成的论文，每篇都在三个互补的视角下评估：仅手稿的评审员（SAR）、考虑工件的同行评审（PR）以及人工进行的元评审。在仅SAR的情况下，图景是乐观的：Claude Code获得最高评分，优于Analemma的FARS，并与加权平均的人类ICLR 2025提交匹配，表明最小框架的代理能够生成在手稿-only评审中看起来有竞争力的论文。然而，人工检查却揭示了这个图景被夸大了：SAR评分与实际接受决定不一致，且奖励合理框架而不验证实验实质。在考虑工件的PR评分急剧下降，人工审计发现实验严谨性是主要瓶颈，分解为三种失败模式（伪造结果、低能力实验、计划/执行不匹配），这些模式高度依赖于代理：Codex 5%/8%论文与工件不匹配/伪造参考文献，与Kimi Code 77%/72%相比，差距约为15倍，追踪代理发展出的不同研究身份。没有一篇代理生成的论文达到顶级会议的接受标准。这表明我们仍然与真正的自动研究有差距。

英文摘要

Recent auto-research systems can produce complete papers, but feasibility is not the same as quality, and the field still lacks a systematic study of how good agent-generated papers actually are. We introduce ResearchArena, a minimal scaffold that lets off-the-shelf agents (Claude Code using Opus 4.6, Codex using GPT-5.4, and Kimi Code using K2.5) carry out the full research loop themselves (ideation, experimentation, paper writing, self-refinement) under only lightweight guidance. Across 13 computer science seeds and 3 trials per agent-domain pair, ResearchArena yields 117 agent-generated papers, each evaluated under three complementary lenses: a manuscript-only reviewer (SAR), an artifact-aware peer review (PR) in which agents inspect the workspace alongside the manuscript, and an human conducted meta-review. Under SAR alone the picture is optimistic: Claude Code obtains the highest score, outperforms Analemma's FARS, and matches the weighted-average human ICLR 2025 submission, suggesting that minimally scaffolded agents can produce papers that look competitive on manuscript-only review. Manual inspection, however, reveals this picture is overstated: SAR scores are poorly aligned with its actual acceptance decisions and reward plausible framing without verifying experimental substance. Under artifact-aware PR scores drop sharply, and manual auditing identifies experimental rigor as the major bottleneck, decomposing into three failure modes (fabricated results, underpowered experiments, and plan/execution mismatch) that are highly agent-dependent: Codex 5%/8% paper-vs-artifact mismatch / fabricated references versus Kimi Code 77%/72%, a $\sim$15$\times$ spread that tracks distinct research personas the agents develop. None of the 117 agent-generated papers reaches the acceptance bar of a top-tier venue. This suggests that we are still gapped from the true auto-research.

URL PDF HTML ☆

赞 0 踩 0

2605.19151 2026-05-20 cs.AI cs.HC 版本更新

Progressive Autonomy as Preference Learning: A Formalization of Trust Calibration for Agentic Tool Use

渐进自主性作为偏好学习：代理工具使用中的信任校准形式化

Changkun Ou

发表机构 * Changkun Ou（Ou Changkun）

AI总结本文将代理工具使用中的信任校准形式化为一个偏好学习问题，通过高斯过程后验模型维护潜在人类风险容忍函数，并在审批结果最不确定的地方升级到人类，继承了偏好贝叶斯优化的推理机制和样本效率论证，但目标不同。

2605.19150 2026-05-20 cs.LG cs.AI 版本更新

学习手柄：在接口约束下的可证明收敛的工作流学习

Jiayu Li, Enpei Zhang, Dawei Zhou, Elynn Chen, Yujun Yan

发表机构 * Stern School of Business（斯特恩商学院）； New York University（纽约大学）； Department of Computer Science（计算机科学系）； Dartmouth College（达特茅斯学院）； Virginia Tech（弗吉尼亚理工大学）

AI总结该研究探讨了在接口约束下的工作流学习问题，提出了一种异步去中心化的Q学习算法IC-Q，并给出了神经IC-Q的有限样本界，证明了在去中心化部分可观测性下的神经Q学习的第一个有限样本保证。

详情

AI中文摘要

我们研究了在专门的代理通过共享的艺术品进行控制转移的设置下的工作流学习，每个代理只能观察该艺术品的局部函数及其自己的私人状态，且没有集中式学习者访问联合轨迹——这多代理LLM管道跨越组织、供应商或信任边界时的操作模式。我们将这种模式形式化为一个接口约束的半马尔可夫决策过程（IC-SMDP），其决策时刻发生在手柄时间，设计了IC-Q，一种异步去中心化的Q学习算法，其中每次手柄的跨代理协调恰好是一个标量。我们的主要结果是神经IC-Q的有限样本界，该界分解为三个独立可控的误差源：神经函数近似误差、接口表示差距和混合时间残差，基于随机选项持续时间折扣。建立这个界需要将近似信息状态（AIS）框架从单代理原始步骤MDP提升到多代理SMDP，并在随机持续时间内控制马尔可夫噪声，而这在先前工作中尚未完成。据我们所知，这是第一个在去中心化部分可观测性下的神经Q学习的有限样本保证。四个实验：一个受控的合成IC-SMDP，多LLM数学推理，多代理路由，以及多代理CPU编程，显示IC-Q在没有任何代理观察联合轨迹的情况下匹配集中式 oracle，每个误差源沿其对应的轴按界预测的比例缩放。

英文摘要

We study workflow learning in a setting where specialized agents hand off control through a shared artifact, each agent observes only a local function of that artifact and its own private state, and no centralized learner accesses joint trajectories -- the operating regime of multi-agent LLM pipelines that span organizational, vendor, or trust boundaries. We formalize this regime as an interface-constrained semi-Markov decision process (IC-SMDP), whose decision epochs occur at handoff times, and design IC-$Q$, an asynchronous decentralized $Q$-learning algorithm in which cross-agent coordination at every handoff is exactly one scalar. Our main result is a finite-sample bound for neural IC-$Q$ that decomposes into three independently controllable error sources: neural function-approximation error, interface representation gap, and a mixing-time residual, under the random option-duration discount. Establishing this bound requires lifting the approximate information state (AIS) framework from single-agent primitive-step MDPs to multi-agent SMDPs and controlling Markovian noise under random duration, neither of which has been done in prior work. To our knowledge this is the first finite-sample guarantee for neural $Q$-learning under decentralized partial observability. Four experiments: a controlled synthetic IC-SMDP that validates the bound term-by-term, multi-LLM mathematical reasoning, multi-agent routing, and multi-agent CPU programming, show that IC-$Q$ matches a centralized oracle without any agent observing joint trajectories, with each of the three error sources scaling along its corresponding axis as the bound predicts.

URL PDF HTML ☆

赞 0 踩 0

2605.19133 2026-05-20 cs.CV cs.AI 版本更新

Knowing When Not to Predict: Self Supervised Learning and Abstention for Safer DR Screening

知道何时不进行预测：用于更安全糖尿病视网膜病变筛查的自监督学习与退避

Muskaan Chopra, Lorenz Sparrenberg, Jan H. Terheyden, Rafet Sifa

发表机构 * Rheinische Friedrich-Wilhelms-Universität Bonn（莱茵-威斯巴登大学波恩分校）； University Hospital Bonn - Department of Ophthalmology（波恩大学医院眼科部门）； Fraunhofer IAIS（弗劳恩霍夫研究所）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔人工智能与机器学习研究所）

AI总结本文研究了自监督学习预训练长度对校准置信度和基于置信度的退避策略的影响，发现预训练长度对选择性预测有积极影响，但过长预训练并不总能提高可靠性，强调了退避意识评估的重要性。

Comments Accepted at IJCAI 2026

详情

AI中文摘要

自监督学习（SSL）现在是预训练医学图像模型的标准方法，但性能仍主要通过下游准确性来评估。对于安全关键的筛查任务，如糖尿病视网膜病变分级，这还不够：模型必须知道何时其预测不可靠，并将不确定案例推迟给临床审查。在本工作中，我们探讨了SSL预训练长度如何影响校准置信度和基于置信度的退避。我们评估了多个SSL检查点在固定微调协议下的表现，并评估了校准置信度、覆盖范围、选择性准确性以及选择性宏F1。在不同数据集和数据制度下，SSL预训练优于从头开始训练。与之前主要评估下游准确性或AUROC的SSL研究不同，我们分析了SSL预训练持续时间如何影响在基于校准置信度的退避下的置信度行为。然而，一旦准确性饱和，选择性性能仍可能在不同检查点间显著变化，且更长的预训练并不总能提高可靠性。这些结果强调了退避意识评估的重要性，并建议预训练长度应被视为重要的可靠性相关设计选择，而非仅是计算细节。代码可在GitHub上获取。

英文摘要

Self-supervised learning (SSL) is now a standard way to pretrain medical image models, but performance is still mostly judged by downstream accuracy. For safety-critical screening tasks such as diabetic retinopathy grading, this is not enough: a model must also know when its predictions are unreliable and defer uncertain cases for clinical review. In this work, we examine how the length of SSL pretraining influences calibrated confidence and confidence-based abstention. We evaluate multiple SSL checkpoints under a fixed fine-tuning protocol and assess calibrated confidence, coverage, selective accuracy, and selective macro-F1. Across datasets and data regimes, SSL pretraining improves selective prediction compared to training from scratch. Unlike prior SSL studies that primarily evaluate downstream accuracy or AUROC, we analyze how SSL pretraining duration influences confidence behavior under calibrated confidence-based abstention. However, once accuracy saturates, selective performance can still change markedly across checkpoints, and longer pretraining does not consistently improve reliability. These results underscore the importance of abstention-aware evaluation and suggest that pretraining length should be treated as an important reliability-related design choice rather than only a computational detail. Code is available at GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.19130 2026-05-20 cs.LG cs.AI cs.CL cs.CV 版本更新

EgoBabyVLM: Benchmarking Cross-Modal Learning from Naturalistic Egocentric Video Data

EgoBabyVLM：基于自然主义第一人称视频数据的跨模态学习基准测试

Dongyan Lin, Phillip Rust, Angel Villar Corrales, Alvin W. M. Tan, Mahi Luthra, Charles-Éric Saint-James, Rashel Moritz, Sheila Krogh-Jespersen, Vanessa Stark, Surya Parimi, Jiayi Shen, Youssef Benchekroun, Yosuke Higuchi, Martin Gleize, Tom Fizycki, Nicolas Hamilakis, Manel Khentout, Sho Tsuji, Balázs Kégl, Juan Pino, Michael C. Frank, Emmanuel Dupoux

发表机构 * Meta Superintelligence Labs（Meta超智能实验室）； Stanford University（斯坦福大学）； Meta Reality Labs（Meta现实实验室）； The University of Tokyo（东京大学）

AI总结研究探讨了儿童如何从有限的视觉-语言输入中获得语言 grounding 的鲁棒性，提出了 EgoBabyVLM 挑战，推动模型在自然主义数据中实现 grounded language learning。

详情

AI中文摘要

儿童在有限的视觉-语言输入中展现出惊人的鲁棒性，这种能力超过了目前最好的大型多模态模型。最近的研究表明，目前基于 curated web 数据训练的视觉-语言模型 (VLMs) 无法泛化到由可穿戴设备、具身代理和婴儿头摄像机产生的稀疏、弱对齐的第一人称视频流，并且没有固定的评估流程来衡量在此类数据上的进展。我们训练 VLMs 在具有不同视觉和语言输入语义对齐程度的数据集上，包括自然主义婴儿和成人第一人称视频，并通过涵盖多模态语言 grounding 和单模态视觉和语言任务的综合评估套件进行评估。这套评估的核心是 Machine-DevBench，它是一个基于语料库的基准测试，自动从模型的训练词汇中生成，以消除训练/评估不匹配和先前发展基准的低统计效力。我们的结果表明，当前 VLM 模型依赖于 curated 数据的紧密语义对齐，并无法利用主导自然主义第一人称输入的弱对齐信号——正是人类在其中茁壮成长的领域。为了推动进展，我们引入了 EgoBabyVLM 挑战，以驱动开发能够从人类婴儿经历的此类自然主义数据中实现 grounded language learning 的模型。

英文摘要

Children acquire language grounding with remarkable robustness from limited visuo-linguistic input in ways that surpass today's best large multimodal models. Recent research suggests current vision-language models (VLMs) trained on curated web data fail to generalize to the sparse, weakly-aligned egocentric streams produced by wearable devices, embodied agents, and infant head-cams -- and no fixed evaluation pipeline exists for measuring progress on this regime. We train VLMs on datasets with varying degrees of semantic alignment between visual and linguistic inputs, including naturalistic infant and adult egocentric videos, and evaluate them with a comprehensive suite spanning multimodal language grounding and unimodal vision and language tasks. At the core of this suite is Machine-DevBench, a corpus-grounded benchmark of lexical and grammatical competence, automatically generated from the model's training vocabulary across logarithmic frequency bins to eliminate the train/eval mismatch and low statistical power of prior developmental benchmarks. Our results show that current VLM paradigms hinge on the tight semantic alignment of curated data and fail to exploit the weakly-aligned signal that dominates naturalistic egocentric input -- the very regime in which humans thrive. To motivate progress, we introduce the EgoBabyVLM Challenge to drive the development of models capable of grounded language learning from the kind of naturalistic data that human infants experience.

URL PDF HTML ☆

赞 0 踩 0

2605.19127 2026-05-20 cs.AI 版本更新

POLAR-Bench: A Diagnostic Benchmark for Privacy-Utility Trade-offs in LLM Agents

POLAR-Bench: 一个用于LLM代理隐私-效用权衡的诊断基准

Qiaoyuan Zheng, Yiqu Yang, Qi Gao, Imanol Schlag

发表机构 * ETH Zurich（苏黎世联邦理工学院）； ETH AI Center（ETH人工智能中心）

AI总结本文提出POLAR-Bench基准，用于评估LLM代理在隐私和效用之间的权衡。通过在10个领域和7,852个样本上进行测试，该基准通过确定性集合成员hip评分隐私和效用，并在两个正交轴上变化隐私策略维度和攻击策略，生成5x5的诊断表面。结果揭示了当前前沿模型在保护属性上隐瞒超过99%，而较小的开放权重模型在1-30B范围内表现更差，泄露率高达一半。

Comments Preprint

详情

AI中文摘要

随着LLM代理越来越多地访问私人用户数据，并在与第三方系统交互时代表用户行事，用户定义了哪些信息可以和必须不被共享。代理必须在第三方系统行为对抗性时也能稳健地遵循该意图。我们引入了POLAR-Bench（政策感知对抗基准），其中受信任的模型具有隐私策略和任务对话的模型与第三方模型进行交互，后者对抗性地探测任务相关和受保护的属性。在10个领域和7,852个样本上，我们通过确定性集合成员hip评分隐私和效用，并在两个正交轴上变化隐私策略维度和攻击策略，生成每个模型的5x5诊断表面。我们的结果揭示了一个明显的分裂：当前前沿模型隐瞒超过99%的受保护属性，而较小的开放权重模型在1-30B范围内，用户最常运行作为其自己的受信任代理在设备上或通过私人推理，得分显著更差，最差的泄露超过一半。POLAR-Bench因此定位了每个模型的意图遵循崩溃点，为隐私对齐提供了立足点，特别是在最关重要的地方。

英文摘要

LLM agents increasingly have access to private user data and act on the user's behalf when interacting with third-party systems. The user defines what may and must not be shared, and the agent must robustly follow that intent even when third-party systems behave adversarially. We introduce POLAR-Bench (Policy-aware adversarial Benchmark), in which a trusted model with a privacy policy and a task converses with a third-party model that adversarially probes for both task-relevant and protected attributes. Across 10 domains and 7,852 samples, we score privacy and utility by deterministic set-membership and vary privacy policy dimension and attack strategy along two orthogonal axes, producing a 5 times 5 diagnostic surface per model. Our results reveal a sharp split: current frontier models withhold over 99% of protected attributes, while smaller open-weight models in the 1--30B range, the class users most commonly run as their own trusted agent on-device or via private inference, score notably worse, with the weakest leaking over half. POLAR-Bench thus localizes where each model's intent-following breaks down, providing a foothold for privacy alignment where it matters most.

URL PDF HTML ☆

赞 0 踩 0

2605.19119 2026-05-20 cs.NE cs.AI cs.LG 版本更新

ScheduleFree+: 将学习率自由和调度自由学习扩展到大型语言模型

Aaron Defazio

发表机构 * FAIR at Meta Super-Intelligence Labs（Meta 超智能实验室）

AI总结本文提出了一种学习率自由和调度自由的学习方法（ScheduleFree+），用于训练大型语言模型，该方法在大规模训练中显著优于传统的Warmup-Stable-Decay（WSD）调度方案，并证明了调度自由学习在长周期训练中的有效性。

2605.19093 2026-05-20 cs.AI cs.LG 版本更新

Embedding by Elicitation: Dynamic Representations for Bayesian Optimization of System Prompts

通过 elicitation 进行嵌入：用于系统提示贝叶斯优化的动态表示

Zhiyuan Jerry Lin, Benjamin Letham, Samuel Dooley, Maximilian Balandat, Eytan Bakshy

发表机构 * Meta

AI总结本文研究了在仅有聚合反馈的情况下，如何通过动态表示进行系统提示的贝叶斯优化，提出了一种基于 elicitation 的嵌入方法 ReElicit，利用 LLM 构建可解释的特征空间，并通过概率高斯过程代理选择目标特征向量，最终实现系统提示的优化。

详情

AI中文摘要

系统提示是现代 AI 系统中的核心控制机制，在对话、任务和用户群体中塑造行为。然而，当反馈仅作为聚合度量而非每个示例的标签、失败或批评时，调整系统提示变得困难。我们研究了这种聚合反馈设置作为受限样本的黑盒优化问题，针对离散且长度可变的文本。我们引入了 ReElicit，一种基于 elicitation 的贝叶斯优化框架。给定任务描述、先前评估的提示和标量分数，LLM 会提取一个紧凑且可解释的特征空间，并将提示映射到其中。利用概率高斯过程代理，获取函数会选择目标特征向量，LLM 会实现并优化这些向量以生成可部署的系统提示。随着新评估的到来，重新提取特征空间使表示能够适应观察到的提示-分数历史。我们通过离线基准准确率作为受控的聚合代理来评估该设置：优化器观察每个提示的一个标量分数，而没有每个示例的标签、错误或批评。在十个系统提示优化任务中，使用 30 次总评估预算，ReElicit 在代表性聚合-only 提示优化基线中实现了最强的聚合性能。这些结果表明，LLM 不仅可以作为提示生成器，还可以作为适应性语义表示构建器，用于自然语言艺术的贝叶斯优化。

英文摘要

System prompts are a central control mechanism in modern AI systems, shaping behavior across conversations, tasks, and user populations. Yet they are difficult to tune when feedback is available only as aggregate metrics rather than per-example labels, failures, or critiques. We study this aggregate feedback setting as sample-constrained black-box optimization over discrete, variable-length text. We introduce ReElicit, a Bayesian optimization framework based on \emph{embedding by elicitation}. Given a task description, previously evaluated prompts, and scalar scores, an LLM elicits a compact, interpretable feature space and maps prompts into it. Leveraging a probabilistic Gaussian process surrogate, an acquisition function then selects target feature vectors, which the LLM realizes and refines into deployable system prompts. Re-eliciting the feature space as new evaluations arrive lets the representation adapt to the observed prompt-score history. We evaluate the setting using offline benchmark accuracy as a controlled aggregate proxy: the optimizer observes one scalar score per prompt and no per-example labels, errors, or critiques. Across ten system prompt optimization tasks with a 30 total evaluation budget, ReElicit achieves the strongest aggregate performance profile among representative aggregate-only prompt-optimization baselines. These results suggest that LLMs can serve as adaptive semantic representation builders, not only prompt generators, for Bayesian optimization over natural-language artifacts.

URL PDF HTML ☆

赞 0 踩 0

2605.19092 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Counterfactual Likelihood Tests for Indirect Influence in Private Reasoning Channels

反事实可能性测试用于私人推理通道中的间接影响

Alexander Boesgaard Lorup

发表机构 * Openhagen

AI总结本文提出了一种反事实可能性测试方法，用于衡量私人推理通道之间的影响力，通过替换上游私人块为匹配长度的供体块，并固定公共令牌序列和下游目标，测量下游目标的负对数似然变化，以评估私人和公共通道中的直接和间接影响。

Comments 12 pages, 4 figures, 5 tables

详情

AI中文摘要

推理系统越来越多地将中间计算分成私人和公共通道，产生在转录中看起来相似的评估案例：独立共推导、直接访问私人内容和通过公共通信的间接影响。本文提出了一种反事实可能性测试，用于测量私人推理通道之间的影响力。该方法用一个长度匹配的供体块替换上游私人块，固定公共令牌序列和下游目标，测量下游目标的负对数似然变化。在用于验证的7B角色通道推理模型上，文本探针不可靠：原始n-gram重叠高估了泄漏，修正重叠仍存在噪声，canary复现报告无区分能力。反事实可能性将未遮蔽和遮蔽条件分开，而长度匹配控制了RoPE位置混杂因素。在强化遮蔽验证中，B到A的反向影响接近于零，而A到B的影响通过公共语音隐藏状态持续存在。在三个检查点、五个种子和13,734个有效方向对比的多检查点验证中，重复了这种不对称性。一个图分离控制，阻止私人到公共的载体边，产生所有13,734个控制评估中自然和反事实分数位相同的结果，确定测试的公共通道路径是测量的反事实信号在实施的角色可见性遮蔽下的完整载体。结果表明，私人通道评估应分别报告直接和间接影响，并且反事实可能性探针为测量这些边界提供了实用的默认方法。

英文摘要

Reasoning systems increasingly separate intermediate computation into private and public channels, creating evaluation cases that look similar in transcripts: independent co-derivation, direct access to private content, and indirect influence through public communication. This paper presents a counterfactual likelihood test for measuring influence between private reasoning channels. The method replaces an upstream private block with a length-matched donor block, holds the public token sequence and downstream target fixed, and measures the downstream target's negative-log-likelihood shift. On a 7B role-channel reasoning model used for validation, textual probes are unreliable: raw n-gram overlap overstates leakage, corrected overlap remains noisy, and canary reproduction reports no discrimination. Counterfactual likelihood separates unmasked and masked conditions, while length matching controls a RoPE positional confound. In the hardened masked validation, reverse B-to-A influence is near zero, while A-to-B influence persists through public-speech hidden states. A multi-checkpoint validation across three checkpoints, five seeds, and 13,734 valid directional contrasts replicates this asymmetry. A graph-separation control that blocks private-to-public carrier edges produces bit-identical natural and counterfactual scores across all 13,734 control evaluations, identifying the tested public-channel pathway as the complete carrier of the measured counterfactual signal under the implemented role-visibility mask. The results show that private-channel evaluation should report direct and indirect influence separately, and that counterfactual likelihood probes provide a practical default for measuring these boundaries.

URL PDF HTML ☆

赞 0 踩 0

2605.19080 2026-05-20 cs.LG cs.AI 版本更新

通过多时间尺度预测学习光伏功率输出预测中的长期时间依赖性

Sumit Laha, Ankit Sharma, Hassan Foroosh

发表机构 * Department of Computer Science University of Central Florida Orlando, Florida, United States（计算机科学系佛罗里达中央大学奥兰多佛罗里达美国）

AI总结本文提出一种多时间尺度预测框架，通过联合优化多个未来值来提高深度神经网络对隐含的步间时间依赖性的捕捉能力，从而提升光伏功率输出预测的准确性和鲁棒性。

详情

AI中文摘要

全球太阳能光伏（PV）容量的迅速扩张——2024年达到创纪录的597 GW——凸显了需要稳健的预测模型来缓解由太阳能辐照度间歇性引起的电网不稳定性。尽管基于深度学习的直接预测使用地面天空图像（GSI）已成为主导方法，但现有文献常受限于单一架构评估和对单时间尺度（点）预测的专注。本文提出从传统单时间尺度估计向多时间尺度预测框架的转变，从而实现架构无关的准确率提升。我们假设并实验验证了联合优化一系列未来值使深度神经网络能够通过避免网络在权重梯度和滤波器多样性方面的过早收敛来更好地捕捉隐含的步间时间依赖性。利用这种架构无关的改进，将顺序天空图像与历史光伏发电数据相结合，我们评估了模型在多个离散未来时间步长上同时预测功率输出的能力。我们的方法通过在多样深度学习架构上的比较分析进行验证。结果表明，这种多时间尺度方法在预测时间范围内显著提高了预测准确性和鲁棒性，同时保持计算效率。通过在单时间尺度模型上实现优越性能且计算开销 negligible，本文提供了一种可扩展且高效的解决方案，以提高现代电网的韧性。

使用具备视觉能力的LLM进行手写数学自动评分

Jacob Levine, Miguel Aenlle, Craig Zilles, Matthew West, Mariana Silva

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文研究了使用具备视觉能力的LLM对手写数学作业进行自动评分，通过对比AI评分与人工评分，发现大多数错误源于转录失败而非评分标准应用错误，揭示了LLM在手写数学评分中的潜力与局限。

Comments To be published in the International Conference on AI in Education (AIED), 2026

详情

AI中文摘要

自动评分系统已能对多种响应类型进行大规模评估，但手写数学作业仍是一个障碍，因为其多步骤解决方案的复杂性。具备视觉能力的大语言模型（LLM）在此领域提供了新机会，但其在真实教学环境中的可靠性仍不明确。本文介绍了基于LLM的手写数学作业评分系统，使用教师定义的评分标准进行评估。在扩展先前针对 typed 响应的流程时，我们整合了对照片提交的转录和基于评分标准的评估，通过单次LLM调用完成。在两个大学STEM课程的学生作业上进行了评估。将AI评分决策与人工分配的地面真实值在评分项层面进行比较，我们观察到总体准确率较高，大多数错误——在最佳模型中为87%——归因于转录失败，而非评分标准应用错误。我们分类了常见的错误模式，包括图像质量问题、幻觉内容以及等价表达的处理错误。这些发现突显了LLM在手写数学评分中的潜力和局限，为系统设计、提示优化和教育环境中的部署提供了指导。

英文摘要

Automated grading systems have enabled scalable assessment for many response types, but handwritten mathematics remains a barrier due to the complexity of multi-step solutions. Vision-capable large language models (LLMs) offer new opportunities here, yet their reliability in authentic instructional settings remains poorly understood. We present an empirical evaluation of an LLM-based grader for handwritten mathematical work using instructor-defined rubrics. Extending a prior pipeline for typed responses, we integrate transcription and rubric-based evaluation of photographic submissions within a single LLM call, evaluating on student work from two university STEM courses. Comparing AI grading decisions against human-assigned ground truth at the rubric-item level, we observe high overall accuracy, with most errors -- 87\% in the best model -- attributable to transcription failures rather than rubric misapplication. We categorize common error modes, including image quality issues, hallucinated content, and incorrect handling of equivalent expressions. These findings highlight both the promise and limitations of LLM-based grading for handwritten mathematics, providing guidance for system design, prompt refinement, and deployment in educational settings.

URL PDF HTML ☆

赞 0 踩 0

2605.19042 2026-05-20 cs.AI 版本更新

潜在动作重参数化用于高效智能体推断

Wenhao Huang, Qingwen Zeng, Qiyue Chen, Zijie Guo, Yu Sun, Cheng Yang, Siru Ouyang, Jiri Gesi, Fang Wu, Jiayi Zhang, Huaming Chen, Bang Liu, Xiangru Tang, Chenglin Wu

发表机构 * Université de Montréal（蒙特利尔大学）； The University of Sydney（悉尼大学）； Fudan University（复旦大学）； Yale University（耶鲁大学）； DeepWisdom ； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Amazon Science（亚马逊科学）； Stanford University（斯坦福大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Mila - Quebec AI Institute（蒙特利尔人工智能研究所）

AI总结本文提出Latent Action Reparameterization (LAR)框架，通过学习紧凑的潜在动作空间来提升大语言模型智能体的推断效率，减少有效动作 horizon 并保持原始动作空间的表达性。

详情

AI中文摘要

大型语言模型（LLM）智能体通常依赖于长序列的低级文本动作，导致较大的有效决策 horizon 和较高的推断成本。尽管先前工作通过系统级优化或提示工程来提高推断效率，我们认为动作空间的表示是关键瓶颈。我们提出Latent Action Reparameterization (LAR)，一种学习紧凑的潜在动作空间的框架，其中每个潜在动作对应于多步骤语义行为。通过将智能体动作重参数化为潜在单元，LAR使在较短的有效 horizon 上进行决策的同时保持原始动作空间的表达性。与手工制作的宏或分层控制器不同，潜在动作从智能体轨迹中学习并直接集成到模型中，允许规划和执行在抽象动作表示上进行。在一系列基于LLM的智能体基准测试中，LAR显著减少了有效动作 horizon 并在固定计算预算下提高了推断效率。作为结果，我们的方法在减少动作令牌和相应的墙钟推断时间的同时，保持或提高了任务成功率。这些结果表明，动作表示学习是扩展高效LLM智能体推断的关键且未被探索的因素，与模型架构和硬件的进步互补。

英文摘要

Large language model (LLM) agents often rely on long sequences of low-level textual actions, resulting in large effective decision horizons and high inference cost. While prior work has focused on improving inference efficiency through system-level optimizations or prompt engineering, we argue that a key bottleneck lies in the representation of the action space itself. We propose Latent Action Reparameterization (LAR), a framework that learns a compact latent action space in which each latent action corresponds to a multi-step semantic behavior. By reparameterizing agent actions into latent units, LAR enables decision making over a shorter effective horizon while preserving the expressiveness of the original action space. Unlike hand-crafted macros or hierarchical controllers, latent actions are learned from agent trajectories and integrated directly into the model, allowing both planning and execution to operate over abstract action representations. Across a range of LLM-based agent benchmarks, LAR significantly reduces the effective action horizon and improves inference efficiency under fixed compute budgets. As a consequence, our approach achieves substantial reductions in action tokens and corresponding wall-clock inference time, while maintaining or improving task success rates. These results suggest that action representation learning is a critical and underexplored factor in scaling efficient LLM agent inference, complementary to advances in model architecture and hardware.

URL PDF HTML ☆

赞 0 踩 0

2605.18474 2026-05-20 cs.CR cs.AI cs.CL cs.LG 版本更新

Prompt2Fingerprint: Plug-and-Play LLM Fingerprinting via Text-to-Weight Generation

Prompt2Fingerprint: 通过文本到权重生成实现即插即用的LLM指纹生成

Sixu Chen, Xiang Chen, Hongyao Yu, Jiaxin Hong, Hao Fang, Shuoyang Sun, Bin Chen, Shu-Tao Xia

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China（清华大学深圳国际研究生院，中国深圳）； South China University of Technology, Guangzhou, China（华南理工大学，中国广州）； Harbin Institute of Technology, Shenzhen, Shenzhen, China（哈尔滨工业大学深圳校区，中国深圳）

AI总结本文提出Prompt2Fingerprint框架，将LLM指纹生成重新定义为条件参数生成任务，通过专用生成器将文本描述直接映射到低秩参数增量，实现无需进一步模型微调的即插即用LLM指纹注入，显著降低计算开销，提供可扩展且即时的LLM所有权管理解决方案。

详情

AI中文摘要

大规模语言模型（LLMs）的广泛部署和重新分布使模型溯源跟踪成为关键挑战。尽管现有的LLM指纹生成方法，特别是通过微调嵌入身份信号的主动方法，实现了高准确性和鲁棒性，但它们面临显著的可扩展性瓶颈。这些方法通常将指纹注入视为一个独立的一次性优化任务，而不是可重用的能力，需要为每个新身份进行单独且资源密集的训练。这导致了高昂的计算成本和部署延迟。为了解决这一问题，我们提出了Prompt2Fingerprint（P2F），这是首个将指纹生成重新定义为条件参数生成任务的框架。通过利用专用生成器，P2F在单次前向传递中将文本描述直接映射到低秩参数增量，从而实现无需进一步模型微调的即插即用LLM指纹注入。我们的实验表明，P2F在保持高指纹准确度、无害性和鲁棒性的同时，显著降低了计算开销，为LLM所有权管理提供了可扩展且即时的解决方案。

英文摘要

The widespread deployment and redistribution of large language models (LLMs) have made model provenance tracking a critical challenge. While existing LLM fingerprinting methods, particularly active approaches that embed identity signals via fine-tuning, achieve high accuracy and robustness, they suffer from significant scalability bottlenecks. These methods typically treat fingerprint injection as an independent, one-off optimization task rather than a reusable capability, necessitating separate, resource-intensive training for every new identity. This incurs prohibitive computational costs and deployment delays. To address this, we propose Prompt2Fingerprint (P2F), the first framework that reformulates fingerprinting as a conditional parameter generation task. By leveraging a specialized generator, P2F maps textual descriptions directly to low-rank parameter increments in a single forward pass, enabling plug-and-play LLM fingerprint injection without further model retraining. Our experiments demonstrate that P2F maintains high fingerprint accuracy, harmlessness, and robustness while significantly reducing computational overhead, offering a scalable and instant solution for LLM ownership management.

URL PDF HTML ☆

赞 0 踩 0

2605.18445 2026-05-20 cs.CV cs.AI cs.CL cs.LG 版本更新

What's Holding Back Latent Visual Reasoning?

是什么在阻碍潜在视觉推理？

André G. Viveiros, Nuno Gonçalves, André F. T. Martins, Matthias Lindemann

发表机构 * Instituto Superior Técnico, Universidade de Lisboa（里斯本大学理工学院）； Instituto de Telecomunicações（电信研究所）； TransPerfect（TransPerfect公司）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本研究探讨了现有模型如何利用潜在令牌，发现潜在令牌在最终预测中起作用有限，主要问题在于训练数据中潜在令牌信息有限且推理时生成的潜在令牌偏离真实表示，需要高质量数据和更精确的潜在令牌预测来推动发展。

详情

AI中文摘要

人类通过心理模拟中间视觉步骤来解决复杂视觉问题，而非仅通过语言推理。受此启发，近期有关视觉-语言模型的工作探索了连续潜在令牌作为中间视觉想象步骤的链式推理。在本工作中，我们研究了近期模型如何利用此类潜在令牌。令人惊讶的是，当潜在令牌被无信息的占位符令牌替代时，模型准确性不受影响。这表明潜在令牌在模型最终预测中起最小的因果作用。为了更好地理解这一现象，我们分析了由oracle潜在表示提供的训练信号以及推理时生成的潜在令牌质量。我们的实验揭示了两个阻碍潜在视觉推理的关键问题：首先，在大多数现有数据集中，oracle潜在令牌提供的信息有限，仅超出原始图像，且不显著简化任务，导致模型在训练时忽略它们，并在推理时有效绕过它们。当在诊断数据集上微调时，其中潜在令牌为最终预测提供充分支持，我们显示模型可以因果依赖于它们。其次，在推理时生成的潜在令牌偏离其对应的oracle表示，坍缩到狭窄区域，即使模型依赖它们也无法获得收益。总体而言，我们的发现表明，未来潜在视觉推理的进步取决于两个关键支柱：具有信息性中间步骤的高质量数据集和更精确的潜在令牌预测。

英文摘要

Humans can approach complex visual problems by mentally simulating intermediate visual steps, rather than reasoning through language alone. Inspired by this, several works on Vision-Language Models have recently explored chain-of-thought reasoning with continuous latent tokens as intermediate visual imagination steps. In this work, we investigate how recent models leverage such latent tokens. Surprisingly, we find that model accuracy is unaffected when latent tokens are replaced by uninformative dummy tokens. This indicates that latent tokens play a minimal causal role in the model's final prediction. To better understand this phenomenon, we analyze both the training signal provided by oracle latent representations and the quality of the latent tokens generated at inference time. Our experiments reveal two crucial issues holding back latent visual reasoning: First, in most existing datasets, oracle latent tokens provide limited additional information beyond the original image and do not substantially simplify the task, leading models to ignore them during training and effectively bypassing them at inference time. When fine-tuned on a diagnostic dataset, in which latent tokens provide sufficient support for the final prediction, we show that models can causally rely on them. Second, the latent tokens produced at inference time deviate from their corresponding oracle representations, collapsing to a narrow region and preventing benefits even when the model relies on them. Overall, our findings suggest that future progress in latent visual reasoning depends on two key pillars: high-quality datasets with informative intermediate steps and more precise latent token prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.17809 2026-05-20 cs.AI cs.IR 版本更新

Accelerating AI-Powered Research: The PuppyChatter Framework for Usable and Flexible Tooling

加速人工智能研究：PuppyChatter框架用于实用且灵活的工具开发

Chun-Hsiung Tseng, Hao-Chiang Koong Lin, Andrew Chih-Wei Huang, Yung-Hui Chen, Jia-Rou Lin

发表机构 * Dept. of Electrical Engineering, YuanZe Univ.（元智大学电子工程系）； Dept. of Information and Learning Technology, National Univ. of Tainan（台湾国立科技大学资讯与学习科技系）； Dept. of Psychology, Fo Guang Univ.（佛光大学心理学系）； Dept. of Computer Information and Network Engineering, Lunghwa Univ.（龙华科技大学电脑资讯与网络工程系）

AI总结本文提出PuppyChatter框架，旨在解决AI应用开发中的挑战，通过结合供应商特定SDK的直观性和模型抽象的中立性，提供更流畅灵活的开发方法。

2605.17539 2026-05-20 cs.AI 版本更新

Memory-Guided Tree Search with Cross-Branch Knowledge Transfer for LLM Solver Synthesis

具有跨分支知识转移的内存引导树搜索用于LLM求解器合成

Fatemeh Haji, Javier Delarosa Quiros, Peyman Najafirad

发表机构 * Secure AI and Autonomy Lab（安全人工智能与自主性实验室）； The University of Texas at San Antonio（德克萨斯大学圣安东尼奥分校）

AI总结该研究提出MEMOIR框架，通过双层记忆体系结构实现内存引导的树搜索，以提高求解器合成的效率和有效性，通过跨分支知识转移提升求解器的解决方案质量。

详情

AI中文摘要

组合优化（CO）在从物流到芯片设计的决策中起着基础性作用，其中不可行的解决方案在操作上不可用，而小的改进可以转化为显著的经济价值。最近的研究利用大型语言模型（LLMs）自动化求解器合成：从自然语言规范生成可执行的求解器程序。然而，现有的树搜索和进化代理在并行细化候选轨迹时没有显式的知识转移，重新引入了相同的约束违规，并收敛到相似的算法家族。我们引入MEMOIR，一种具有两级记忆层次结构的内存引导树搜索框架：分支本地记忆在迭代单个算法设计时保存执行基础的细化细节，而全局记忆存储跨分支压缩的算法和失败模式摘要。在分支终止时的反思步骤提炼这些摘要，使跨分支转移成为可能，而不会污染未来的上下文与低层次调试跟踪。在七个跨越调度、路由、打包和几何设计的CO问题上，MEMOIR实现了96.7%的解决方案有效性（比最强基线高出9.2个点），并在匹配的每种方法执行预算下，将平均标准化分数提高了7.3个点。在四个问题上进行三次独立运行时，MEMOIR的运行间有效性标准差比我们评估的所有基线低一个数量级，表明内存引导的探索产生了持续的改进，而不是反映采样方差。

英文摘要

Combinatorial optimization (CO) underlies decision-making from logistics to chip design, where infeasible solutions are operationally unusable and small quality gains translate into substantial economic value. Recent work uses large language models (LLMs) to automate solver synthesis: generating executable solver programs from natural-language specifications. However, existing tree-search and evolutionary agents refine candidate trajectories in parallel without explicit knowledge transfer, reintroducing the same constraint violations and converging on similar algorithm families. We introduce MEMOIR, a memory-guided tree-search framework with a two-level memory hierarchy: branch-local memory preserves execution-grounded refinement details within a branch as it iterates on a single algorithmic design, while global memory stores compressed algorithmic and failure-mode summaries across branches. A reflection step at branch termination distills these summaries, enabling cross-branch transfer without polluting future contexts with low-level debugging traces. Across seven CO problems spanning scheduling, routing, packing, and geometric design, MEMOIR achieves 96.7% solution validity (a 9.2 point gap over the strongest baseline) and improves the average normalized score by 7.3 points at matched per-method execution budget. Over three independent runs on four problems, MEMOIR's run-to-run validity standard deviation is more than an order of magnitude below that of every baseline we evaluated in this setting, suggesting that memory-guided exploration yields consistent improvements rather than reflecting sampling variance.

URL PDF HTML ☆

赞 0 踩 0

2605.17480 2026-05-20 cs.AI 版本更新

The Capability Paradox: How Smarter Auditors Make Multi-Agent Systems Less Secure

能力悖论：更聪明的审计员如何使多智能体系统更不安全

Qiqi Liu, Thorsten Holz, Shilin Ye, Runhan Song

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Max Planck Institute for Security and Privacy（马克斯·普朗克安全与隐私研究所）； Henan Yinzhu Safety Technology Co., Ltd.（河南亿众安全技术有限公司）； Harbin Institute of Technology, Faculty of Computing（哈尔滨工业大学计算机学院）

AI总结本文研究了多智能体系统中，随着工人能力的提升，系统级攻击成功率反而上升的现象，揭示了语言确定性在攻击传播中的作用，并提出异质性集成验证作为解决方案，以降低攻击成功率。

Comments 28 pages, 6 figures

详情

AI中文摘要

多智能体系统通过将任务分解给专门的智能体来扩展大语言模型（LLMs），但其分布式决策过程创造了新的攻击面。我们识别出语义劫持攻击，即有害请求被隐藏在领域特定的叙述中，并通过工人报告传播到管理者，而无需任何语法注入原始。在42,000次对抗性试验中，我们发现了能力悖论：随着工人能力的增加，系统级攻击成功率（ASR）从18.4%增加到63.9%，峰值达到94.4%。为了解释这一效应，我们对两个独立数据集（47,807次交互）进行了多层中介分析。分析显示，这一悖论由语言确定性驱动：更强的工人更可能将对抗性叙述解释为合法，自信地传达结论，从而导致管理者将这种自信的背书视为执行的正当理由。在我们的更大工人-only设置（n_W=14）中，确定性中介了74%的效果，95%置信区间（CI）在蒙特卡洛和聚类Bootstrap下均排除零；较小的Full-MAS设置（n_W=6）显示了方向一致的间接效应。工人端的安全提示无法可靠地缓解这一失败。基于中介发现，我们提出异质性集成验证，通过配对具有不对称领域能力的工人，使它们的互补性漏洞打破确定性到执行的链条，将ASR从52.8%降低到2.0%，对良性任务影响微乎其微。我们的结果表明，升级组件到更强的模型会主动降低系统安全性，有效的防御需要利用而不是消除智能体之间的能力不对称性。

英文摘要

Multi-agent systems extend large language models (LLMs) by decomposing tasks among specialized agents, but their distributed decision process creates new attack surfaces. We identify semantic hijacking, an attack in which harmful requests are concealed within domain-specific narratives and propagated to a Manager through Worker reports, without any syntactic injection primitives. Across 42,000 adversarial trials over 12 Manager models and 7 Worker configurations, we uncover a capability paradox: as Worker capability increases, the mean system-level Attack Success Rate (ASR) increases from 18.4% to 63.9%, peaking at 94.4%. To explain this effect, we conduct multi-level mediation analysis on two independent datasets (47,807 interactions). This analysis shows that this paradox is driven by linguistic certainty: stronger Workers are more likely to interpret adversarial narratives as legitimate, convey their conclusions assertively, and thereby lead Managers to treat such confident endorsements as justification to execute. In our larger Worker-Only setting ($n_W$=14), certainty mediates 74% of the effect, with 95% confidence intervals (CI) excluding zero under both Monte Carlo and cluster bootstrap; the smaller Full-MAS setting ($n_W$ =6) shows a directionally consistent indirect effect. Worker-side safety prompting does not reliably mitigate this failure. Building on the mediation finding, we propose heterogeneous ensemble verification, which pairs Workers of asymmetric domain competence so their complementary vulnerabilities break the certainty-to-execution chain, reducing ASR from 52.8% to 2.0% with negligible benign-task impact. Our results show that upgrading components to stronger models can actively degrade system security, and that effective defenses require exploiting--rather than eliminating--capability asymmetries between agents.

URL PDF HTML ☆

赞 0 踩 0

2605.17439 2026-05-20 cs.SE cs.AI 版本更新

DiagEval: Trajectory-Conditioned Diagnosis for Reliable Software Evaluation with GUI Agents

DiagEval: 用于通过GUI代理进行可靠软件评估的轨迹条件诊断

Sirui Hong, Zhijie Liu, Tengfei Li, Wei Tao, Yifan Wu, Chenglin Wu

发表机构 * DeepWisdom（深智科技）； The Hong Kong University of Science and Technology (Guangzhou)（香港科学与技术大学（广州））

AI总结该研究提出DiagEval，一种基于轨迹的诊断评估协议，用于在GUI代理评估交互式软件后失败时进行诊断。通过重用失败轨迹选择针对性的诊断探针，并将结果聚合为内部归因信号，从而提高准确性，优于基于重试的基线方法。

详情

AI中文摘要

评估LLM生成的交互式软件需要执行和静态分析相结合。关键困难在于正确性是潜在UI状态转换图上的图级可达属性，而GUI评估器只能观察单个执行轨迹。因此，一次失败的执行仅排除了一条实际路径，导致失败归因在评估器侧错误和真实软件缺陷之间变得模糊。我们提出了DiagEval，一种用于后失败GUI代理评估的轨迹条件诊断评估协议。与盲目重试不同，DiagEval重用失败轨迹来选择针对性的诊断探针，并将这些探针的结果聚合为内部归因信号。潜在图视图激发了诊断问题；DiagEval不重建图或估计校准的后验概率。我们在WebDevJudge-Unit和RealDevBench上评估DiagEval，跨多个GUI代理评估器和LLM后端。在误报负例中，DiagEval恢复了最初被错误归因于软件缺陷的45.6-62.1%的失败，优于基于重试的基线方法，相对提升达34.4-160.6%。在完整评估集上，这种恢复将WebDevJudge-Unit的准确性从69.9%提升到78.3%，将RealDevBench的准确性从65.0%提升到81.6%。这些结果表明，可靠的GUI代理评估不仅需要更强的执行，还需要主动的失败诊断来区分评估器侧错误和真正的软件缺陷。我们的代码可在https://github.com/scutGit/DiagEval获取。

英文摘要

Evaluating LLM-generated interactive software requires execution in addition to static analysis. The key difficulty is that correctness is a graph-level reachable property over latent UI state-transition graphs, whereas a GUI evaluator observes only a single execution trajectory. A failed rollout therefore rules out only one realized path, leaving failure attribution ambiguous between evaluator-side execution error and genuine software defect. We present DiagEval, a trajectory-conditioned diagnostic evaluation protocol for post-failure GUI-agent evaluation of interactive software. Rather than blindly retrying from scratch, DiagEval reuses the failed trajectory to choose targeted diagnostic probes and aggregates their outcomes into an internal attribution signal. The latent-graph view motivates the diagnostic problem; DiagEval does not reconstruct the graph or estimate calibrated posterior probabilities. We evaluate DiagEval on WebDevJudge-Unit and RealDevBench across multiple GUI-agent evaluators and LLM backbones. On false-negative cases, DiagEval recovers 45.6-62.1% of failures that were initially misattributed to software defects, outperforming retry-based baselines with 34.4-160.6% relative gains. On the full evaluation sets, this recovery improves accuracy from 69.9% to 78.3% on WebDevJudge-Unit and from 65.0% to 81.6% on RealDevBench. These results suggest that reliable GUI-agent evaluation requires not only stronger execution, but also active failure diagnosis to disambiguate evaluator-side errors from genuine software defects. Our code is available at https://github.com/scutGit/DiagEval.

URL PDF HTML ☆

赞 0 踩 0

2605.17370 2026-05-20 cs.AI 版本更新

CBT-Audio: Evaluating Audio Language Models for Patient-Side Distress Intensity Estimation in CBT Session Recordings

CBT-Audio: 评估音频语言模型以估计CBT会话录音中患者压力强度

Qixuan Hu, Shuchang Ye, Xumou Zhang, Anastasia Serafimovska, Anastasia Suraev, Amit Saha, Ping-hsiu Lin, Sydney Su, Usman Naseem, Adam G. Dunn, Jinman Kim

发表机构 * School of Computer Science, Faculty of Engineering, University of Sydney, Australia（悉尼大学工程学院计算机科学学院，澳大利亚）； School of Psychology, Faculty of Science, University of Sydney, Australia（悉尼大学科学学院心理学学院，澳大利亚）； School of Computing, Faculty of Science and Engineering, Macquarie University, Australia（麦考瑞大学科学与工程学院计算学院，澳大利亚）； CHeBA (Centre for Healthy Brain Ageing), School of Clinical Medicine, Discipline of Psychiatry & Mental Health, The University of New South Wales, Australia（新南威尔士大学临床医学学院精神病与心理健康学科健康大脑年龄中心，澳大利亚）； Sydney School of Public Health, Faculty of Medicine and Health, University of Sydney, Australia（悉尼大学医学与健康学院公共卫生学院，澳大利亚）

AI总结本文提出CBT-Audio数据集，用于评估音频语言模型在估计CBT会话中患者压力强度方面的性能，通过结合音频和文本输入提升了压力强度估计的准确性。

Comments 9 pages, 3 figures, 2 tables

详情

AI中文摘要

认知行为疗法被广泛用于帮助患者理解和管理心理压力。它通常通过口头交流进行，治疗师不仅关注患者所说的内容，还关注他们说话的方式，因为这些线索有助于治疗师决定如何回应和调整治疗。在构建AI系统用于CBT方面，进展主要局限于文本，部分原因是大多数可用数据集基于文本，而共享的 spoken CBT 数据在伦理和隐私约束下稀缺。这导致了盲点，因为基于文本的模型和评估无法捕捉文本和患者声音之间的不匹配，尽管治疗师经常依赖这种不匹配来理解患者的压力。我们引入了CBT-Audio，一个用于评估从 spoken CBT 会话中估计患者压力强度的音频语言模型的数据集。CBT-Audio包含96个公开可用的CBT录音中的1,802个患者发言，其中发言级别的压力标签已在专家标注的子集上验证。我们评估了10个开源音频语言模型，三种输入条件下，模型仅接收患者音频、仅接收转录文本或同时接收音频和转录文本。我们的结果表明，音频可以提供超出文本的信息，尤其是在与转录文本结合时。在10种模型家族中，有8种在添加音频到转录输入时，压力强度估计优于单独使用转录文本，其中4种有显著提升，案例研究显示当口头内容和语音表达不一致时，收益最明显。CBT-Audio使AI在CBT相关任务中可衡量患者的口语行为，支持未来音频语言模型在心理健康交互中的研究。

英文摘要

Cognitive behavioural therapy is widely used to help patients understand and manage psychological distress. It is often delivered through spoken conversation, where therapists attend not only to what patients say, but also to how they say it, because these cues can help therapists decide how to respond and adapt treatment. Progress in building AI systems for CBT remains largely limited to text, partly because most available datasets are text based and shareable spoken CBT data are scarce under ethical and privacy constraints. This creates a blind spot because text based models and evaluations cannot capture the mismatch between the transcript and the patient's voice, even though therapists often rely on this mismatch to understand patient distress. We introduce CBT-Audio, a dataset for evaluating patient distress estimation from spoken CBT sessions with audio language models. CBT-Audio contains 1,802 patient turns from 96 publicly available CBT recordings, with turn-level distress labels validated on an experts-annotated subset. We evaluate 10 open source audio language models under three input conditions, where models receive only patient audio, only the transcript, or both audio and transcript. Our results show that audio can provide useful information beyond text, especially when combined with transcripts. Adding audio to transcript input improves distress estimation over using the transcript alone in 8 of 10 model families, with significant gains in 4, and case studies show the clearest benefit when verbal content and vocal delivery diverge. CBT-Audio makes spoken patient behaviour measurable for AI evaluation in CBT-related tasks and supports future work on audio language models for mental health interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.17046 2026-05-20 cs.LG cs.AI cs.CL 版本更新

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

1GC-7RC：一张图形卡——七个研究挑战！AI代理在做你的工作方面有多好？

Robin-Nico Kampa, Fabian Deuser, Anna Bößendörfer, Konrad Habel, Norbert Oswald

AI总结本文提出1GC-7RC基准测试，通过七个跨领域机器学习任务评估AI代理在从头设计、实现和训练模型的能力，揭示了不同代理在隐式机器学习知识、规划能力和时间预算管理方面的差异。

详情

AI中文摘要

自主AI编码代理正成为机器学习从业者在工业和研究中不可或缺的工具。尽管这种应用日益广泛，但尚无标准化基准来评估其在不同领域从头设计、实现和训练模型的能力。我们引入了1GC-7RC（单张图形卡：七个研究挑战），该基准包含七个机器学习任务，涵盖语言建模、图像分类、语义分割、图学习、表格预测、时间序列预测和文本分类。每个任务都提供锁定的数据准备和评估脚本以及基线训练脚本；代理只能修改训练代码，无法访问预训练权重（语义分割任务有一个受控例外），无法访问互联网，并必须在单个GPU上完成每个任务的时间预算（40-120分钟）。我们评估了七个编码代理：五个专有（Claude Code with Sonnet 4.6、Opus 4.6和Opus 4.7；Codex CLI with GPT 5.5；和OpenCode with Qwen 3.6+）和两个开源（OpenCode with Kimi K2.5、Kimi K2.6）。在每个代理-任务对的5次运行中，我们报告了显著的性能差异，揭示了不同代理在隐式机器学习知识、规划能力和时间预算管理方面的不同水平。该基准、工具和所有评估成果均在GitHub上公开，以促进未来代理的可重复比较。由于我们的基准设计是模块化的，该基准可以扩展到新任务和领域，适应不同的GPU预算，并用于研究多代理设置，使其成为未来自主研究代理研究的灵活平台。

英文摘要

Autonomous AI coding agents are becoming a core tool for ML practitioners in industry and research alike. Despite this growing adoption, no standardized benchmark exists to evaluate their ability to design, implement, and train models from scratch across diverse domains. We introduce **1GC-7RC** (*Single Graphic Card: Seven Research Challenges*), a benchmark comprising seven ML tasks spanning language modeling, image classification, semantic segmentation, graph learning, tabular prediction, time-series forecasting, and text classification. Each task provides a locked data-preparation and evaluation script together with a baseline training script; the agent may only modify the training code, has no access to pretrained weights (with one controlled exception for semantic segmentation), no internet access, and must complete each task within a task-specific wall-clock budget (40-120 minutes) on a single GPU. We evaluate seven coding agents: five proprietary (Claude Code with Sonnet 4.6, Opus 4.6, and Opus 4.7; Codex CLI with GPT 5.5; and OpenCode with Qwen 3.6+) and two open-source (OpenCode with Kimi K2.5, Kimi K2.6). Across 5 runs per agent-task pair, we report substantial performance differences that reveal varying levels of implicit ML knowledge, planning ability, and time-budget management. The benchmark, harness, and all evaluation artifacts are publicly available on GitHub at https://github.com/Strolchii/1GC-7RC-Benchmark to facilitate reproducible comparison of future agents. Because our benchmark design is modular, the benchmark can be extended to new tasks and domains, adapted to different GPU budgets, and used to study multi-agent settings, making it a flexible platform for future research on autonomous research agents.

URL PDF HTML ☆

赞 0 踩 0

2605.17003 2026-05-20 cs.LG cs.AI 版本更新

SLEIGHT-Bench: 一种针对代理监控的对抗攻击基准

Elle Najt, Colin Toft, Tyler Tracy, Fabien Roger, Joe Benton

发表机构 * Anthropic Fellows Program（Anthropic Fellow计划）； University of Waterloo（多伦多大学）； Redwood Research（Redwood研究）； Anthropic

AI总结本文提出SLEIGHT-Bench基准，用于评估代理监控对多种攻击策略的防御能力，发现20种攻击在Opus 4.6监控下未被检测到，同时识别了多种规避策略并展示了监控性能的提升。

详情

AI中文摘要

由于自主编码代理会生成复杂行为，我们可能希望使用其他LLM来监控行为以降低危险行为的风险。为了更好地理解此类监控对编码代理可能使用的多样化攻击策略的限制，我们提出了SLEIGHT-Bench（Subtle Low-itEration Insight-Guided Harmful Transcripts），一个包含40种攻击的合成转录本基准，涵盖11类攻击，每种攻击都显示编码代理 covertly 追求有害目标（例如权重外泄、凭证盗窃、未监控的恶意部署）。我们发现，在1%的假阳性率下，40种攻击中有20种从未被Opus 4.6监控在10次试验中检测到，整体检测率为32%。我们识别了多种规避策略，如利用模型先验、用户指令歧义和状态操纵，这些策略有助于创建能够规避前沿监控的攻击。我们还发现使用编码代理作为监控者比使用常规提示监控者能获得更强的监控性能，并且对于某些规避策略，通过针对性的监控提示可提高检测率。我们的数据集和评估框架可在https://github.com/safety-research/sleight-bench和https://huggingface.co/datasets/sleightbench/SLEIGHT-Bench获取。

英文摘要

Since autonomous coding agents generate complex behaviors at high-volume, we may want to use other LLMs to monitor actions to reduce the risk from dangerous misaligned behavior. To better understand the limitations of such monitors against the diverse attack strategies that a coding agent could use, we present SLEIGHT-Bench (Subtle Low-itEration Insight-Guided Harmful Transcripts), a benchmark of synthetic transcripts containing 40 attacks across 11 categories, each showing a coding agent covertly pursuing a harmful objective (e.g. weight exfiltration, credential theft, unmonitored rogue deployment). We find that 20 of 40 attacks are never caught by an Opus 4.6 monitor with extended thinking across 10 trials at a 1% false-positive rate, while the overall catch rate is 32%. We identify several evasion strategies such as exploiting model priors, user instruction ambiguity, and state manipulation, which facilitate creating attacks that evade frontier monitors. We also elicit stronger monitor performance using coding agents as monitors versus regular prompted monitors, and for some evasion strategies show improved catch rates with targeted monitor prompts. Our dataset and evaluation framework are available at https://github.com/safety-research/sleight-bench and https://huggingface.co/datasets/sleightbench/SLEIGHT-Bench.

URL PDF HTML ☆

赞 0 踩 0

2605.16445 2026-05-20 cs.LG cs.AI 版本更新

Membership Inference Attacks on Discrete Diffusion Language Models

对离散扩散语言模型的成员推断攻击

Shailesh Kasivelrajan

AI总结本文研究了对微调后的MDLMs的成员推断攻击，发现其比现有灰盒基线更易受攻击，并设计了阴影模型转移攻击以证明其有效性。

Comments Citations and Co Authors need to be verified and updated. Will submit a new version soon

详情

AI中文摘要

Masked Diffusion Language Models (MDLMs) 替换了自回归生成的迭代解 masking，其隐私属性大多未被研究。我们研究了对微调后的MDLMs的成员推断攻击(MIA)，并发现其比现有灰盒基线所暗示的要显著更容易受到攻击。我们从四个 masking 比率下的模型重建损失中提取了一个46维的特征向量，并在其上训练XGBoost和MLP分类器。在六个文本领域上的MIMIR基准测试中，XGBoost实现了平均AUC 0.878，在Pile CC上达到峰值0.930，并在平均上比SAMA灰盒基线高出0.062 AUC。一个leave one signal out消融实验显示，仅ELBO轨迹就驱动了大部分结果，当移除时平均下降0.130，而注意力特征在低于0.003时几乎不起作用。我们还设计了一个阴影模型转移攻击，其中K=3个在无关领域训练的surrogate MDLMs在不接触目标领域的情况下生成分类器标签。这在0.020以内实现了0.858的平均AUC，并确立了阴影模型转移作为一种实用且几乎同样有效的攻击路径。

英文摘要

Masked Diffusion Language Models MDLMs replace autoregressive generation with iterative demasking and their privacy properties are largely unstudied. We study membership inference attacks MIA on fine tuned MDLMs and show they are significantly more vulnerable than current grey box baselines suggest. We extract a 46 dimensional feature vector from the models reconstruction loss at four masking ratios and train XGBoost and MLP classifiers on top. On the MIMIR benchmark across six text domains XGBoost achieves mean AUC 0.878 peaking at 0.930 on Pile CC and beats the SAMA grey box baseline by 0.062 AUC on average. A leave one signal out ablation shows that the ELBO trajectory alone drives most of this with a mean drop of 0.130 when removed while attention features add almost nothing below 0.003. We also design a shadow model transfer attack where K equals 3 surrogate MDLMs trained on data from unrelated domains generate classifier labels with no access to the target domain. This achieves 0.858 mean AUC within 0.020 of the white box oracle and establishes shadow model transfer as a practical and near equally effective attack path.

URL PDF HTML ☆

赞 0 踩 0

2605.16353 2026-05-20 cs.CV cs.AI 版本更新

StrLoRA: Towards Streaming Continual Visual Instruction Tuning for MLLMs

StrLoRA: 向流式连续视觉指令微调迈进以适应大规模多模态语言模型

Chang Che, Ziqi Wang, Hui Ma, Cheems Wang, Zenglin Shi

发表机构 * Hefei University of Technology（合肥工业大学）； Tsinghua University（清华大学）

AI总结本文提出StrLoRA，一种流式连续视觉指令微调方法，旨在解决动态任务流中模型持续学习的问题，通过任务感知的专家路由框架提升模型在不断变化的数据流中的表现。

详情

AI中文摘要

持续视觉指令微调（CVIT）使多模态大语言模型能够逐步获得新能力。然而，现有CVIT方法在任务增量设置下运行，每个训练阶段对应一个预定义任务，这不符合现实世界中数据作为连续流中交织和动态变化的任务的条件。为弥合这一差距，我们引入流式CVIT（StrCVIT），一种更通用和现实的设置，其中模型从包含动态混合任务的数据块中学习。在StrCVIT中，模型必须同时获得新能力、强化常见能力并减轻遗忘。现有CVIT方法在此处失败，因为它们无法可靠地区分或适应每个块内的异构任务样本。因此，我们提出了StrLoRA，一种正则化的两阶段专家路由框架。StrLoRA首先使用文本指令进行任务感知的专家选择，激活相关专家的稀疏子集，减少跨任务干扰。然后在该子集内应用基于令牌的专家加权，其中贡献权重通过本地视觉令牌与全局指令表示之间的跨模态注意力计算。为了在非平稳流中保持稳定性，路由稳定性正则化将当前路由分布与历史指数移动平均参考对齐。在新开发的StrCVIT基准上的广泛实验表明，StrLoRA显著优于现有方法，有效提升了模型从持续演变的数据流中获取能力的能力。代码可在https://github.com/chanceche/StrCVIT获取。

英文摘要

Continual Visual Instruction Tuning (CVIT) enables Multimodal Large Language Models to incrementally acquire new abilities. However, existing CVIT methods operate under a restrictive task-incremental setting, where each training phase corresponds to a single, predefined task. This does not reflect real-world conditions, where data arrives as a continuous stream of interleaved and dynamically evolving tasks. To bridge this gap, we introduce Streaming CVIT (StrCVIT), a more general and realistic setting where models learn from a stream of data chunks containing a dynamic mixture of tasks. In StrCVIT, a model must simultaneously acquire new abilities, reinforce recurring abilities, and mitigate forgetting. Existing CVIT methods fail here as they cannot reliably distinguish or adapt to the heterogeneous task samples within each chunk. We therefore propose StrLoRA, a regularized two-stage expert routing framework. StrLoRA first performs task-aware expert selection using the textual instruction to activate a sparse subset of relevant experts, reducing cross-task interference. It then applies token-wise expert weighting within this subset, where contribution weights are computed via cross-modal attention between local visual tokens and the global instruction representation. To maintain stability across the non-stationary stream, a routing-stability regularization aligns current routing distributions with a historical exponential moving average reference. Extensive experiments on a newly developed StrCVIT benchmark show that StrLoRA substantially outperforms existing methods, effectively enhancing model's abilities from continuously evolving data streams. The code is available at https://github.com/chanceche/StrCVIT.

URL PDF HTML ☆

赞 0 踩 0

2605.15975 2026-05-20 cs.AI cs.RO 版本更新

VGGT-Edit：基于残差场预测的前馈原生3D场景编辑

Kaixin Zhu, Yiwen Tang, Yifan Yang, Renrui Zhang, Bohan Zeng, Ziyu Guo, Ruichuan An, Zhou Liu, Qizhi Chen, Delin Qu, Jaehong Yoon, Wentao Zhang

发表机构 * Peking University（北京大学）； Tencent（腾讯）； The Chinese University of Hong Kong（香港中文大学）； Shanghai AI Lab（上海人工智能实验室）； NTU Singapore（新加坡国立大学）； Zhongguancun Academy（中关村学院）； Beijing Key Lab of Data Intel. & Security (PKU)（北京数据智能与安全实验室（北京大学））

AI总结本文提出VGGT-Edit，一种基于文本条件的前馈原生3D场景编辑框架，通过引入深度同步文本注入和残差变换头，实现高质量的3D场景编辑，同时构建DeltaScene数据集以提升编辑效果和推理速度。

详情

AI中文摘要

高质量的3D场景重建近年来已发展为通用的前馈架构，使单次正向传递即可生成复杂的环境。然而，尽管这些模型在静态场景感知方面表现强劲，但它们在响应动态人类指令方面仍然有限，限制了其在交互应用中的使用。现有的编辑方法通常依赖于2D提升策略，即单独编辑每个视图，然后将其提升回3D空间。这种间接流程往往导致模糊的纹理和不一致的几何结构，因为2D编辑器缺乏保持跨视角结构的空间意识。为了解决这些限制，我们提出了VGGT-Edit，一种用于文本条件的前馈框架，用于原生3D场景编辑。VGGT-Edit引入了深度同步的文本注入，以对齐语义指导与骨干网络的空间姿态，确保稳定的指令接地。此语义信号随后由残差变换头处理，直接预测3D几何位移以变形场景，同时保持背景稳定性。为了确保高保真结果，我们通过多术语目标函数监督该框架，强制几何准确性和跨视图一致性。我们还构建了DeltaScene数据集，一个通过自动化流程生成的大规模数据集，通过3D一致过滤确保地面真实质量。实验表明，VGGT-Edit在2D提升基线中表现显著更好，生成更清晰的物体细节，更强的多视图一致性以及接近即时的推理速度。项目页面是https://chriszkxxx.github.io/VGGT-Edit/.

英文摘要

High-quality 3D scene reconstruction has recently advanced toward generalizable feed-forward architectures, enabling the generation of complex environments in a single forward pass. However, despite their strong performance in static scene perception, these models remain limited in responding to dynamic human instructions, which restricts their use in interactive applications. Existing editing methods typically rely on a 2D-lifting strategy, where individual views are edited independently and then lifted back into 3D space. This indirect pipeline often leads to blurry textures and inconsistent geometry, as 2D editors lack the spatial awareness required to preserve structure across viewpoints. To address these limitations, we propose VGGT-Edit, a feed-forward framework for text-conditioned native 3D scene editing. VGGT-Edit introduces depth-synchronized text injection to align semantic guidance with the backbone's spatial poses, ensuring stable instruction grounding. This semantic signal is then processed by a residual transformation head, which directly predicts 3D geometric displacements to deform the scene while preserving background stability. To ensure high-fidelity results, we supervise the framework with a multi-term objective function that enforces geometric accuracy and cross-view consistency. We also construct the DeltaScene Dataset, a large-scale dataset generated through an automated pipeline with 3D agreement filtering to ensure ground-truth quality. Experiments show that VGGT-Edit substantially outperforms 2D-lifting baselines, producing sharper object details, stronger multi-view consistency, and near-instant inference speed. The project page is https://chriszkxxx.github.io/VGGT-Edit/.

URL PDF HTML ☆

赞 0 踩 0

2605.14678 2026-05-20 cs.AI 版本更新

$π$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

$π$-Bench：评估长周期工作流中主动型个人助理代理

Haoran Zhang, Luxin Xu, Zhilin Wang, Runquan Gui, Shunkai Zhang, Haodi Lei, Zihao He, Bingsu He, Chicheng Qin, Tong Zhu, Xiaoye Qu, Yang Yang, Yu Cheng, Yafu Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai AI Laboratory（上海人工智能实验室）； Fudan University（复旦大学）； University of Science and Technology of China（中国科学技术大学）； Peking University（北京大学）； Nanjing University（南京大学）； Zhejiang University（浙江大学）； Tongji University（同济大学）； Soochow University（苏州大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出$π$-Bench基准，用于评估个人助理代理在长周期工作流中的主动协助能力，通过100个多轮任务和5种特定领域用户角色，验证代理在未明确表达意图前识别和执行隐藏意图的能力，揭示主动协助的挑战及前期交互对后续任务的重要性。

Comments 44 pages

详情

AI中文摘要

随着个人助理代理（如OpenClaw）的兴起，大型语言模型在日常和工作场景中支持用户的能力日益凸显。在这些场景中，主动协助是一个核心挑战，因为用户往往开始时请求不明确，留下重要的需求、约束或偏好未被陈述。然而，现有基准很少评估代理是否能在用户明确表达之前识别并执行此类隐藏意图，尤其是在持续的多轮交互中，用户需求逐渐显现。为填补这一空白，我们引入$π$-Bench，一个包含100个多轮任务和5种特定领域用户角色的主动协助基准。通过整合隐藏用户意图、任务间依赖性和跨会话连续性，$π$-Bench评估代理在延长交互中预见和解决用户需求的能力，共同衡量长周期轨迹中的主动性和任务完成度。实验表明（1）主动协助仍然具有挑战性，（2）任务完成与主动性存在明显区别，（3）前期交互对后续任务中主动意图解析的价值。

英文摘要

The rise of personal assistant agents, e.g., OpenClaw, highlights the growing potential of large language models to support users across everyday life and work. A core challenge in these settings is proactive assistance, since users often begin with underspecified requests and leave important needs, constraints, or preferences unstated. However, existing benchmarks rarely evaluate whether agents can identify and act on such hidden intents before they are explicitly stated, especially in sustained multi-turn interactions where user needs emerge gradually. To address this gap, we introduce $π$-Bench, a benchmark for proactive assistance comprising 100 multi-turn tasks across 5 domain-specific user personas. By incorporating hidden user intents, inter-task dependencies, and cross-session continuity, $π$-Bench evaluates agents' ability to anticipate and address user needs over extended interactions, jointly measuring proactivity and task completion in long-horizon trajectories that better reflect real-world use. Experiments show (1) proactive assistance remains challenging, (2) a clear distinction between task completion and proactivity, and (3) the value of prior interaction for proactive intent resolution in later tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.14102 2026-05-20 cs.AI 版本更新

ChromaFlow: A Negative Ablation Study of Orchestration Overhead in Tool-Augmented Agent Evaluation

ChromaFlow: 一种关于在工具增强代理评估中编排开销的负消融研究

Tarun Mittal

发表机构 * Octave-X

AI总结该研究通过ChromaFlow框架分析了在工具增强自主推理中编排开销的影响，发现更激进的编排并未提升整体性能，反而增加了操作噪声，并强调了编排升级、确定性提取、证据协调、提供者健康门控和显式运行门控等作为可靠自主代理评估的第一要求。

Comments 12 pages, 6 tables, 1 figure. Updated with follow-up strict-provider full-Level-1 diagnostic

详情

AI中文摘要

自主语言模型代理越来越多地结合规划、工具使用、文档处理、浏览、代码执行和验证循环。这些能力使代理系统更加有用，但同时也引入了无法仅通过最终准确性来观察的操作失败模式。本报告介绍了ChromaFlow，一种围绕规划引导执行、专门化工具使用和 telemetry 驱动评估构建的工具增强自主推理框架。我们分析了ChromaFlow在GAIA 2023 Level-1验证任务下的清洁评估约束。一个冻结的完整Level-1基线实现了29/53正确的答案，或54.72%。后来的恢复配置通过扩展编排实现了27/53正确的答案，或50.94%，同时增加了回溯、超时事件、工具失败提及、令牌日志调用和战役日志成本估计。两个随机化的20任务烟雾评估产生了12/20和11/20正确的答案，表明小规模诊断增益在样本间不稳定。因此，中心结果是负消融：更激进的编排并未提高整体性能，反而增加了操作噪声。后来的严格提供者全Level-1诊断在显式完整性控制下达到了30/53，或56.60%，但显着提高了令牌日志成本。报告认为，受控编排升级、确定性提取、证据协调、提供者健康门控和显式运行门控应被视为可靠自主代理评估的第一要求。

英文摘要

Autonomous language-model agents increasingly combine planning, tool use, document processing, browsing, code execution, and verification loops. These capabilities make agent systems more useful, but they also introduce operational failure modes that are not visible from final accuracy alone. This report presents ChromaFlow, a tool-augmented autonomous reasoning framework built around planner-directed execution, specialized tool use, and telemetry-driven evaluation. We analyze ChromaFlow on GAIA 2023 Level-1 validation tasks under clean evaluation constraints. A frozen full Level-1 baseline achieved 29/53 correct answers, or 54.72%. A later recovery configuration with expanded orchestration achieved 27/53 correct answers, or 50.94%, while increasing tracebacks, timeout events, tool-failure mentions, token-log calls, and campaign-log cost estimates. Two randomized 20-task smoke evaluations produced 12/20 and 11/20 correct answers, showing that small diagnostic gains can be unstable across samples. The central result is therefore a negative ablation: more aggressive orchestration did not improve full-set performance and increased operational noise. A later strict-provider full-Level-1 diagnostic reached 30/53, or 56.60%, under explicit integrity controls, but at substantially higher token-log cost. The report argues that bounded planner escalation, deterministic extraction, evidence reconciliation, provider-health gates, and explicit run gates should be treated as first-order requirements for reliable autonomous agent evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.13652 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Beyond Perplexity: A Geometric and Spectral Study of Low-Rank Pre-Training

超越困惑度：低秩预训练的几何与谱研究

Namrata Shivagunde, Vijeta Deshpande, Sherin Muckatira, Anna Rumshisky

发表机构 * University of Massachusetts Lowell（马萨诸塞大学洛厄尔分校）

AI总结本文通过几何和谱分析研究低秩预训练方法，揭示其与全秩训练在模型性能和解空间上的差异，发现低秩方法在不同模型规模下表现各异，且困惑度不能完全反映下游任务性能。

Comments 9 pages, 5 figures, 2 tables

详情

AI中文摘要

大规模语言模型的预训练主要受限于存储全秩权重、梯度和优化器状态的内存成本。低秩预训练出现以解决这一问题，相关方法空间迅速扩展。一个核心问题仍未解决：低秩方法是否能产生与全秩训练具有同等泛化能力的模型，或者秩约束是否根本性地改变了所达到的解？现有比较几乎完全依赖于单种子运行的验证困惑度，通常继承自先前文献。然而，困惑度是解质量的差代理；两种方法可以在困惑度上匹配，却收敛到不同的损失景观区域和内部表示。我们通过表征五种低秩预训练方法（GaLore和Fira（内存高效优化器）、CoLA和SLTrain（架构再参数化）、ReLoRA（适配器式更新带周期性重置））在三个模型规模（60M、130M、350M）下与全秩训练的解，关闭这一差距。我们评估每种方法在四个维度上的16个指标：1D损失景观沿随机/Top-K PCA方向、1D检查点之间插值、权重和学习更新的谱结构，以及激活相似性与全秩训练。我们显示低秩方法不等同于全秩训练，也不等同于彼此，即使验证困惑度接近。全秩训练在随机方向上达到更尖锐的盆地，而反方向则适用于top-1 PCA方向。每种方法收敛到几何上不同的盆地。低秩激活在训练过程中随着层数增加而偏离全秩激活，GaLore最接近全秩激活。进一步，验证困惑度在每个规模下并不转化为下游性能。添加几何和谱度量提高了预测。

英文摘要

Pre-training large language models is dominated by the memory cost of storing full-rank weights, gradients, and optimizer states. Low-rank pre-training has emerged to address this, and the space of methods has grown rapidly. A central question remains open: do low-rank methods produce models that generalize comparably to full-rank training, or does the rank constraint fundamentally alter the solutions reached? Existing comparisons rely almost entirely on validation perplexity from single-seed runs, often carried forward from prior literature. Yet perplexity is a poor proxy for solution quality; two methods can match on perplexity while converging to different loss landscape regions and internal representations. We close this gap by characterizing the solutions found by five low-rank pre-training methods, GaLore and Fira (memory-efficient optimizers), CoLA and SLTrain (architecture reparameterizations), and ReLoRA (adapter-style updates with periodic resets), against full-rank training at three model scales (60M, 130M, 350M). We evaluate each along 16 metrics across four dimensions: 1-D loss landscape along random/top-K PCA directions, 1-D interpolation between checkpoints, spectral structure of the weights and learned updates, and activation similarity to full-rank training. We show that low-rank methods are not equivalent to full-rank training, nor to one another, even when validation perplexity is close. Full-rank training settles into a sharper basin than low-rank methods along random directions, while the reverse holds for the top-1 PCA direction. Each method converges to a geometrically distinct basin. Low-rank activations diverge from full-rank in later layers as training progresses, with GaLore tracking full-rank most closely. Further, validation perplexity does not translate to downstream performance at every scale. Adding geometric and spectral metrics improves the prediction.

URL PDF HTML ☆

赞 0 踩 0

2605.13646 2026-05-20 cs.RO cs.AI 版本更新

Causality-Aware End-to-End Autonomous Driving via Ego-Centric Joint Scene Modeling

基于因果性的端到端自动驾驶：通过以自身为中心的联合场景建模

Seokha Moon, Minseung Lee, Joon Seo, Jinkyu Kim, Jungbeom Lee

发表机构 * Korea University（韩国大学）； Kakao Mobility

AI总结本文提出CaAD框架，通过共享潜在场景表示捕捉车辆与周围代理之间的因果依赖关系，以提高端到端自动驾驶的闭环规划性能。

详情

AI中文摘要

端到端自动驾驶通过直接从传感器输入预测未来轨迹，跳过了传统模块化流水线，近年来取得了显著进展。然而，现有方法往往忽视了车辆规划中的因果依赖关系，忽略了车辆与周围代理之间的相互关系。这种因果忽视导致轨迹预测不一致且不可靠，特别是在需要交互的关键场景中，车辆决策和邻近代理行为必须联合推理。为了解决这一限制，我们提出了CaAD，一个基于因果的端到端自动驾驶框架，该框架在共享的潜在场景表示中捕捉这些依赖关系。首先，我们提出一个以自身为中心的联合因果建模模块，基于边缘预测分支，并学习车辆与相关交互代理之间的因果依赖关系。其次，我们采用因果意识的策略对齐阶段，通过联合模式嵌入来对齐随机的车辆策略与从周围交通和地图上下文中计算出的规划导向闭环反馈。在Bench2Drive和NAVSIM基准上，CaAD展示了强大的闭环规划性能，分别在Bench2Drive上实现了87.53的驾驶分数和71.81的成功率，在NAVSIM上实现了91.1的PDMS。项目页面可在https://moonseokha.github.io/CaAD/上获取。

英文摘要

End-to-end autonomous driving, which bypasses traditional modular pipelines by directly predicting future trajectories from sensor inputs, has recently achieved substantial progress. However, existing methods often overlook the causal inter-dependencies in ego-vehicle planning, ignoring the reciprocal relations between the ego vehicle and surrounding agents. This causal oversight leads to inconsistent and unreliable trajectory predictions, especially in interaction-critical scenarios where ego decisions and neighboring agent behaviors must be reasoned about jointly. To address this limitation, we propose CaAD, a Causality-aware end-to-end Autonomous Driving framework that captures these dependencies within a shared latent scene representation. First, we propose an ego-centric joint-causal modeling module that builds on the marginal prediction branch, and learns causal dependencies between the ego vehicle and interaction-relevant agents. Second, we employ a causality-aware policy alignment stage implemented with joint-mode embeddings to align the stochastic ego policy with planning-oriented closed-loop feedback computed from surrounding traffic and map context. On the Bench2Drive and NAVSIM benchmarks, CaAD demonstrates strong closed-loop planning performance, achieving a Driving Score of 87.53 and Success Rate of 71.81 on Bench2Drive, and a PDMS of 91.1 on NAVSIM. The project page is available at https://moonseokha.github.io/CaAD/.

URL PDF HTML ☆

赞 0 踩 0

2605.12981 2026-05-20 cs.SE cs.AI cs.LG 版本更新

Protocol-Driven Development: Governing Generated Software Through Invariants and Continuous Evidence

基于协议的开发：通过不变式和连续证据治理生成的软件

Jun He, Deying Yu

AI总结本文提出了一种基于协议的开发方法，通过定义协议的不变式和连续证据来治理生成的软件，其核心贡献是将协议作为主要软件 artifact，而非代码，从而实现对生成软件的持续验证和治理。

Comments 20 pages, 2 tables

详情

AI中文摘要

自动化程序合成降低了生成实现的成本，但引入了更复杂的治理问题：确定哪些生成的 artifact 是可接受的。自然语言规范存在歧义，基于示例的测试仅覆盖行为空间的一部分。单独使用这些方法无法提供足够的控制边界。我们引入了基于协议的开发（PDD），其中主要的软件 artifact 是可机器执行的协议，而非代码。我们定义协议为三元组 P = (S, B, O)，指定结构、行为和操作不变式。其联合作为软件组件的可接受实现空间的定义。在 PDD 中，实现是通过受约束的搜索发现的可替换实现。只有满足协议并产生可验证的合规证据链的实现才被接受。接受基于协议的满足和记录的证据，而非对生成器的信任。对于部署的系统，我们扩展证据链为动态证据账本。运行时验证器将签名的观察、不变式检查和违规情况附加到账本中，使可监控的义务能够持续得到证明。这将实时故障回溯到生成循环中，而无需授予生成器运行时的权威。结合形式方法、属性测试、运行时验证、政策作为代码和软件可追溯性，PDD 定义了自动化软件工程的治理模型。其组织原则是代码是短暂的，而协议承载持久的权威。

英文摘要

Automated program synthesis lowers the cost of producing implementations but introduces a harder governance problem: determining which generated artifacts are admissible. Natural-language specifications are ambiguous, and example-based tests sample only part of the behavioral space. Used alone, neither provides a sufficient control boundary. We introduce Protocol-Driven Development (PDD), where the primary software artifact is a machine-enforceable protocol rather than code. We define a protocol as the triplet P = (S, B, O), specifying structural, behavioral, and operational invariants. Their conjunction defines the admissible implementation space of a software component. Under PDD, implementations are replaceable realizations discovered through constrained search. An implementation is admitted only if it satisfies the protocol and produces a verifiable Evidence Chain of compliance. Admission is grounded in protocol satisfaction and recorded evidence rather than trust in the generator. For deployed systems, we extend the Evidence Chain into a Dynamic Evidence Ledger. Runtime verifiers append signed observations, invariant checks, and violations to the ledger, allowing monitorable obligations to be continuously attested. This connects live failures back to the generation loop without granting the generator runtime authority. Combining formal methods, property testing, runtime verification, policy-as-code, and software provenance, PDD defines a governance model for automated software engineering. Its organizing principle is that code is transient, while the protocol carries durable authority.

URL PDF HTML ☆

赞 0 踩 0

2605.10344 2026-05-20 cs.AI 版本更新

TMAS: Scaling Test-Time Compute via Multi-Agent Synergy

TMAS: 通过多智能体协同实现测试时间计算的扩展

George Wu, Nan Jing, Qing Yi, Chuan Hao, Ming Yang, Feng Chang, Yuan Wei, Jian Yang, Ran Tao, Bryan Dai

发表机构 * IQuest Research（IQuest研究）； Beihang University（北航）

AI总结本文提出TMAS框架，通过多智能体协同实现测试时间计算的扩展，利用层次化记忆和混合奖励强化学习提升推理能力和探索效率。

详情

AI中文摘要

测试时间扩展已成为通过在推理过程中分配额外计算来提高大型语言模型推理能力的有效范式。最近的结构化方法通过在多个轨迹、细化轮次和基于验证的反馈之间组织推理进一步推进了这一范式。然而，现有结构化测试时间扩展方法要么弱化并行推理轨迹的协调，要么依赖于噪声历史信息而没有明确决定应保留和重用什么，限制了它们在探索和利用之间的平衡能力。在本文中，我们提出TMAS，一个通过多智能体协同扩展测试时间计算的框架。TMAS将推理组织为专门智能体之间的协作过程，从而在智能体、轨迹和细化迭代之间实现结构化信息流。为了支持有效的跨轨迹协作，TMAS引入了层次化记忆：经验银行重用低层次可靠中间结论和局部反馈，而指南银行记录之前探索的高层次策略，以引导后续展开远离冗余推理模式。此外，我们设计了一种针对TMAS定制的混合奖励强化学习方案，该方案联合保留基本推理能力、增强经验利用，并鼓励探索超出先前尝试的解决方案策略。在具有挑战性的推理基准上的广泛实验表明，TMAS在迭代扩展方面优于现有测试时间扩展基线，混合奖励训练进一步提高了跨迭代的扩展效果和稳定性。代码和数据可在https://github.com/IQuestLab/tmas获取。

英文摘要

Test-time scaling has become an effective paradigm for improving the reasoning ability of large language models by allocating additional computation during inference. Recent structured approaches have further advanced this paradigm by organizing inference across multiple trajectories, refinement rounds, and verification-based feedback. However, existing structured test-time scaling methods either weakly coordinate parallel reasoning trajectories or rely on noisy historical information without explicitly deciding what should be retained and reused, limiting their ability to balance exploration and exploitation. In this work, we propose TMAS, a framework for scaling test-time compute via multi-agent synergy. TMAS organizes inference as a collaborative process among specialized agents, enabling structured information flow across agents, trajectories, and refinement iterations. To support effective cross-trajectory collaboration, TMAS introduces hierarchical memories: the experience bank reuses low-level reliable intermediate conclusions and local feedback, while the guideline bank records previously explored high-level strategies to steer subsequent rollouts away from redundant reasoning patterns. Furthermore, we design a hybrid reward reinforcement learning scheme tailored to TMAS, which jointly preserves basic reasoning capability, enhances experience utilization, and encourages exploration beyond previously attempted solution strategies. Extensive experiments on challenging reasoning benchmarks show that TMAS achieves stronger iterative scaling than existing test-time scaling baselines, with hybrid reward training further improving scaling effectiveness and stability across iterations. Code and data are available at https://github.com/IQuestLab/tmas.

URL PDF HTML ☆

赞 0 踩 0

2605.08830 2026-05-20 cs.CV cs.AI cs.RO 版本更新

基于2.5-D分解的LLM空间构建

Paul Whitten, Li-Jen Chen, Sharath Baddam

发表机构 * GitHub

AI总结本文提出了一种基于2.5-D分解的神经符号管道，通过让LLM在二维水平面上规划，同时确定性执行器计算垂直放置，从而消除一类错误，提升了空间构建的准确性。

详情

AI中文摘要

自主系统需要可靠的空问推理来从自然语言指令中构建结构，但大型语言模型（LLMs）在生成三维积木放置时会产生系统性的坐标错误。本文提出了一种基于2.5-D分解的神经符号管道：LLM在二维水平面上进行规划，同时确定性执行器根据列的占用计算所有垂直放置，从而消除了一类错误。在Build What I Mean基准测试（160轮次）中，GPT-4o-mini在12次独立运行中实现了94.6%的平均结构准确性，接近由架构代理错误设定的97.6%上限，且优于GPT-4o（90.3%）和最佳竞争系统（76.3%）。受控消融实验确认2.5-D分解是主要贡献者，占准确性50.7个百分点。该管道可直接转移到边缘硬件：Nemotron-3 120B在本地NVIDIA Jetson Thor AGX上运行，无需修改提示词即可达到94.5%的云结果。该原理，即从LLM的输出空间中移除确定性维度，适用于任何自主建造或组装任务，其中重力或其他物理约束固定一个或多个自由度。在500个IGLU协作建造任务上的转移实验证实了效果超越了主要基准。

英文摘要

Autonomous systems that build structures from natural-language instructions need reliable spatial reasoning, yet large language models (LLMs) make systematic coordinate errors when generating three-dimensional block placements. We present a neuro-symbolic pipeline based on \emph{2.5-D decomposition}: the LLM plans in the two-dimensional horizontal plane while a deterministic executor computes all vertical placement from column occupancy, eliminating an entire class of errors. On the Build What I Mean benchmark (160 rounds), GPT-4o-mini with this pipeline achieves 94.6\% mean structural accuracy across 12 independent runs, within 3.0 percentage points of the 97.6\% ceiling imposed by architect-agent errors that no builder-side improvement can address. This outperforms both GPT-4o at 90.3\% and the best competing system at 76.3\%. A controlled ablation confirms that 2.5-D decomposition is the dominant contributor, accounting for 50.7 percentage points of accuracy. The pipeline transfers directly to edge hardware: Nemotron-3 120B running locally on an NVIDIA Jetson Thor AGX matches the cloud result at 94.5\% with no prompt modifications. The underlying principle, removing deterministic dimensions from the LLM's output space, applies to any autonomous construction or assembly task where gravity or other physical constraints fix one or more degrees of freedom. A transfer experiment on 500 IGLU collaborative building tasks confirm the effect generalizes beyond the primary benchmark.

URL PDF HTML ☆

赞 0 踩 0

2605.05974 2026-05-20 cs.CR cs.AI 版本更新

PragLocker: Protecting Agent Intellectual Property in Untrusted Deployments via Non-Portable Prompts

PragLocker: 通过非可移植提示保护代理知识产权

Qinfeng Li, Yuntai Bao, Jianghui Hu, Wenqi Zhang, Jintao Chen, Huifeng Zhu, Yier Jin, Xuhong Zhang

发表机构 * Zhejiang University（浙江大学）； Management Center, School of Software Technology (Ningbo), Zhejiang University（浙江大学软件学院（宁波）管理中心）； University of Science and Technology of China（中国科学技术大学）； Chang'an University（长安大学）； Washington University in St. Louis（圣路易斯华盛顿大学）

AI总结针对无信任部署中代理提示易被复制和重用的问题，PragLocker提出了一种保护方案，通过构建语义锚定的混淆提示并注入噪声，有效降低跨LLM可移植性，同时保持目标性能和对抗鲁棒性。

Comments accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情

AI中文摘要

LLM代理依赖提示来实现基于基础LLM的任务特定能力，使得代理提示成为有价值的知识产权。然而，在无信任部署中，对手可以复制和重用这些提示与其他专有LLM结合，造成经济损失。为保护这些提示，我们识别了四个关键挑战：主动性、运行时保护、易用性和非可移植性，现有方法未能解决。我们提出了PragLocker，一种提示保护方案，满足这些要求。PragLocker通过将语义锚定与代码符号结合，构建功能保持的混淆提示，然后使用目标模型反馈注入噪声，生成仅在目标LLM上起作用的提示。在多个代理系统、数据集和基础LLM上的实验表明，PragLocker显著降低了跨LLM可移植性，保持了目标性能，并对适应性攻击具有鲁棒性。

英文摘要

LLM agents rely on prompts to implement task-specific capabilities based on foundation LLMs, making agent prompts valuable intellectual property. However, in untrusted deployments, adversaries can copy and reuse these prompts with other proprietary LLMs, causing economic losses. To protect these prompts, we identify four key challenges: proactivity, runtime protection, usability, and non-portability that existing approaches fail to address. We present PragLocker, a prompt protection scheme that satisfies these requirements. PragLocker constructs function-preserving obfuscated prompts by anchoring semantics with code symbols and then using target-model feedback to inject noise, yielding prompts that only work on the target LLM. Experiments across multiple agent systems, datasets, and foundation LLMs show that PragLocker substantially reduces cross-LLM portability, maintains target performance, and remains robust against adaptive attackers.

URL PDF HTML ☆

赞 0 踩 0

2605.05480 2026-05-20 cs.LG cs.AI stat.ML 版本更新

GRALIS: A Unified Canonical Framework for Linear Attribution Methods via Riesz Representation

GRALIS：通过里斯表示建立线性归因方法的统一规范框架

Raimondo Fanale

发表机构 * Universitas Mercatorum（默卡托大学）

AI总结本文提出GRALIS框架，通过里斯表示理论统一了线性归因方法，提供七个形式定理保证归因方法的准确性、收敛性、Shapley交互值、Hoeffding ANOVA分解、Sobol敏感性泛化和多尺度扩展，展示了其在医学图像上的初步验证结果。

Comments 25 pages, 6 tables, 2 figures. Theoretical framework with preliminary experimental validation on BreaKHis (1,187 images, DenseNet-121). Extended empirical comparison in preparation

详情

AI中文摘要

深度神经网络的主要XAI归因方法——GradCAM、SHAP、LIME、集成梯度——基于不同的理论基础且无法正式比较。我们提出了GRALIS（梯度-里斯平均局部积分Shapley），一个建立归因表示理论的数学框架：L^2(Q, mu)上的每一个可加、线性和连续的归因功能都具有唯一的规范表示（Q，w，Delta），由里斯表示定理证明其必要性。该类包括SHAP、IG、LIME和线性化GradCAM，但不包括非线性功能如标准GradCAM或注意力图。七个形式定理提供了任何单个方法都缺乏的同时保证：（T1）必要规范形式；（T2）精确完备性；（T3）蒙特卡洛收敛O(1/sqrt(m))+O(1/k)；（T4）精确Shapley交互值；（T5）Hoeffding ANOVA分解；（T6）Sobol敏感性泛化；（T7）多尺度扩展（MS-GRALIS）具有最小方差权重。代数附录通过Mobius变换证明GRALIS-SIV对应关系，无需循环论证。GRALIS满足13.5/14个公理性质，而单独方法仅为2.5-6/14，包括完备性、敏感性、局部性、k阶交互和最优多尺度聚合。在BreaKHis（1,187例病理图像，DenseNet-121）上的初步验证报告删除忠实度AUC+0.015（恶性），96%类条件一致性，SAL=0.762±0.109和稀疏性指数0.39。与基线XAI方法的扩展比较计划在配套论文中进行。

英文摘要

The main XAI attribution methods for deep neural networks -- GradCAM, SHAP, LIME, Integrated Gradients -- operate on separate theoretical foundations and are not formally comparable. We present GRALIS (Gradient-Riesz Averaged Locally-Integrated Shapley), a mathematical framework establishing a representation theory for attributions: every additive, linear, and continuous attribution functional on L^2(Q,mu) admits a unique canonical representation (Q, w, Delta), proved necessary by the Riesz Representation Theorem. This class encompasses SHAP, IG, LIME and linearized GradCAM, but excludes nonlinear functionals such as standard GradCAM or attention maps. Seven formal theorems provide simultaneous guarantees absent in any individual method: (T1) necessary canonical form; (T2) exact completeness; (T3) Monte Carlo convergence O(1/sqrt(m))+O(1/k); (T4) exact Shapley Interaction Values; (T5) Hoeffding ANOVA decomposition; (T6) Sobol sensitivity generalization; (T7) multi-scale extension (MS-GRALIS) with minimum-variance weights. An algebraic appendix justifies the GRALIS-SIV correspondence via the Mobius transform without circularity. GRALIS satisfies 13.5/14 axiomatic properties vs. 2.5-6/14 for individual methods, including completeness, sensitivity, locality, order-k interactions and optimal multi-scale aggregation simultaneously. Preliminary validation on BreaKHis (1,187 histology images, DenseNet-121) reports deletion faithfulness AUC +0.015 (malignant), 96% class-conditional consistency, SAL = 0.762+/-0.109 and sparsity index 0.39. Extended comparison with baseline XAI methods is planned for a companion paper.

URL PDF HTML ☆

赞 0 踩 0

2605.00856 2026-05-20 eess.SP cs.AI cs.HC cs.LG 版本更新

One-Block Transformer (1BT) for EEG-Based Cognitive Workload Assessment

用于EEG认知负荷评估的单块变换器（1BT）

Stefanos Gkikas, Christian Arzate Cruz, Thomas Kassiotis, Giorgos Giannakakis, Raul Fernandez Rojas, Randy Gomez

发表机构 * Honda Research Institute Japan Wako City, Japan ； Department of Electronic Engineering Hellenic Mediterranean University Chania, Greece ； BioSIS (Biosensing \& Intelligent Systems) Lab Centre for Intelligent Computing ； Systems University of Canberra Canberra, Australia

AI总结本文提出了一种用于EEG认知负荷评估的单块变换器（1BT），通过一个最小的潜在瓶颈聚合多通道时间序列，结合轻量级自注意力机制，实现了高效且紧凑的模型设计，从而在保持高性能的同时显著降低了计算成本。

详情

AI中文摘要

准确且连续地估计认知负荷对于构建自适应的人机系统至关重要。然而，设计在表示能力与计算效率之间取得平衡的架构在实际部署中一直具有挑战性。本文介绍了一种名为1BT的单块变换器，用于紧凑且高效的EEG认知负荷评估。该模型通过最小的潜在瓶颈聚合多通道时间序列，使用一个单一的交叉注意力模块后接轻量级自注意力。一项涉及11名参与者进行三种认知多样任务（抽象推理、数值问题解决和互动视频游戏）的受控研究，在两个认知负荷水平上进行了连续EEG记录。系统性的架构分析确定了最紧凑的配置，该配置在保持高性能的同时显著降低了计算成本。最终模型在不到0.5百万参数和0.02 GFLOPs的情况下实现了高认知负荷分类性能，为在资源受限环境下实时认知负荷监控的设计方向铺平了道路。

英文摘要

Accurate and continuous estimation of cognitive workload is fundamental to creating adaptive human-machine systems. However, designing architectures that balance representational capacity with computational efficiency has been challenging for practical deployment. This paper introduces 1BT, a One-Block Transformer for compact and efficient EEG-based cognitive workload assessment. The model aggregates multi-channel temporal sequences via a minimal latent bottleneck, using a single cross-attention module followed by lightweight self-attention. A controlled study involving 11 participants performing three cognitively diverse tasks (abstract reasoning, numerical problem-solving, and an interactive video game) was conducted with continuous EEG recordings across two workload levels. Systematic architectural analysis identifies the most compact configuration that preserves high performance, while substantially lowering computational cost. The final model achieves high workload classification performance with under 0.5 million parameters and 0.02 GFLOPs, paving the way for a design direction for real-time cognitive workload monitoring in resource-constrained settings.

URL PDF HTML ☆

赞 0 踩 0

2604.27245 2026-05-20 cs.CY cs.AI 版本更新

Addressing the Reality Gap: A Three-Tension Framework for Agentic AI Adoption

弥合现实鸿沟：面向智能体AI采纳的三重张力框架

Jason Fournier, Kacper Łodzikowski

发表机构 * Imagine Learning ； Adam Mickiewicz University in Poznań（波兹南亚当·密茨凯维奇大学）

AI总结本文提出一个三重张力框架，用于指导教育领域在采纳智能体AI时平衡实施可行性、适应速度和使命契合度，以实现教育个性化和公平性。

Comments This is a preprint version of an edited book chapter to appear in Mayrath, M., J. Behrens, D. Robinson, (eds) (2026). Handbook of Generative AI in Education: Integrating Research into Practice, Springer

详情

AI中文摘要

生成式AI已通过免费消费工具迅速进入教育领域，超出了学校和大学的应对能力。现在，新一代更具自主性的智能体AI系统--具备规划和达成目标能力--既带来了更大的教育个性化，也带来了更大的颠覆。本章认为，成功应对这些创新需要平衡三个核心张力：（1）实施可行性，即可持续地将AI整合到真实课堂中的实际能力；（2）适应速度，即快速演变的AI能力与缓慢变化的教育变革之间的不匹配；（3）使命契合度，即确保AI应用维护教育价值观如公平、隐私和教学完整性。首先，我们回顾生成式和智能体AI在不同领域和一线教育中的早期证据，以在具体情境中展示这些张力。然后，我们提出一个三重张力框架，以指导决策者在K-12和高等教育中评估和设计AI项目。我们提供了该框架如何应用于规划负责任的AI部署的示例，并识别了新兴趋势--如课程关联的AI代理和教师驱动的AI设计--以及开放的研究方向。最后，我们为教育领导者提供了建议，以主动应对AI的机会和挑战，使该技术能够在未来十年内增强教学和学习。

英文摘要

Generative AI has rapidly entered education through free consumer tools, outpacing the ability of schools and universities to respond. Now a new wave of more autonomous agentic AI systems--with the capacity to plan and act towards goals--promises both greater educational personalization and greater disruption. This chapter argues that successfully navigating these innovations requires balancing three core tensions: (1) Implementation Feasibility, or the practical capacity to integrate AI sustainably into real classrooms; (2) Adaptation Speed, or the mismatch between fast-evolving AI capabilities and the slower pace of educational change; and (3) Mission Alignment, or the need to ensure AI applications uphold educational values such as equity, privacy, and pedagogical integrity. First, we review early evidence of generative and agentic AI in various sectors and in frontline education to illustrate these tensions in context. Then, we present a three-tension framework to guide decision-makers in evaluating and designing AI initiatives across K-12 and higher education. We provide examples of how the framework can be applied to plan responsible AI deployments, and we identify emerging trends--such as curriculum-linked AI agents and educator-informed AI design--along with open research directions. We conclude the chapter with recommendations for educational leaders to proactively engage with the opportunities and challenges of AI, so that this technology can be harnessed to enhance teaching and learning in the decade ahead.

URL PDF HTML ☆

赞 0 踩 0

2604.23183 2026-05-20 cs.CY cs.AI 版本更新

Designing escalation criteria for international AI incident response: criteria, triggers, and thresholds

设计国际AI事件响应的升级标准：标准、触发条件和阈值

Francesca Gomez, Matthew Ball, Michael Harre, Lydia Preston, Josephine Schwab, Caio Machado

AI总结本文提出了一种升级框架，用于确定何时需要将检测到的AI事件从国家层面升级到国际协调，通过分析现有法规和行业框架，提出了八个评估标准，并测试了该框架在实际AI事件中的应用，识别了可能导致系统性漏检的设计模式。

Comments Version accepted to ICML TAIGR workshop

详情

AI中文摘要

AI事件报告要求在监管和政策中逐渐兴起，但目前尚不存在确定何时需要将检测到的AI事件升级到国际协调的运营标准。本文提出了一种升级框架，旨在作为各司法管辖区之间的共同参考点，使升级保持一致的同时，保留在各自法律和政策背景下响应的灵活性。我们回顾了SB 53、欧盟人工智能法案、GPAI行为准则以及其他行业的事件框架，以推导出八个评估标准，将其转化为具有门控决策点和阈值检查的顺序流程图。对于每个标准，我们映射其如何与这些监管框架相互作用，识别出它们的设计选择如何支持或削弱有效检测。我们测试该框架针对十份已记录的AI事件和结构化变体，以确定标准在实践中是否漏检或错误分类事件。我们发现三种可能导致在模型开发者负责升级的制度中出现系统性漏检的设计模式：a. 当升级需要确认的伤害时，事件如模型权重外泄风险检测只能在严重不可逆伤害传播后才被发现；b. 当事件被单独评估时，系统性伤害从累积风险中被低估；c. 当阈值与法律工具对齐而非可量化测试术语时，标准在时间压力下应用起来具有风险。我们还发现，升级规则只是更广泛框架的一部分：底层定义以及负责方可用的数据创建了相互依赖性，这些本身可能导致漏检。

英文摘要

AI incident reporting requirements are emerging in regulation and policy, yet no operational criteria exist for determining when a detected AI incident warrants escalation beyond national handling to international coordination. This paper proposes an escalation framework to address this gap, intended as a common reference point across jurisdictions that enables aligned escalation while preserving flexibility in how actors respond within their own legal and policy contexts. We review SB 53, the EU AI Act, the GPAI Code of Practice, and incident frameworks from other industries to derive eight criteria for assessing whether an incident warrants escalation, translated into a sequential flowchart with gated decision points and threshold checks. For each criterion, we map how it interplays with these regulatory frameworks, identifying where their design choices support or undermine effective detection. We test the framework against ten documented AI incidents and structured variants to identify where criteria under-detect or misclassify incidents in practice. We find three design patterns that may lead to systematic under-detection in regimes where model developers are responsible for escalation: a. where escalation requires confirmed harm, events such as model weight exfiltration risk detection only after severe, irreversible harm has propagated; b. where incidents are assessed individually, systemic harms emerging from accumulation risk being under-detected; and c. where thresholds align with legal instruments rather than quantitatively testable terms, criteria risk being impractical to apply under time pressure. We also find that escalation rules are only one component of a broader framework: the underlying definitions against which thresholds are set, and the data available to the responsible actor, create interdependencies that can themselves drive under-detection.

URL PDF HTML ☆

赞 0 踩 0

2604.19892 2026-05-20 cs.GR cs.AI 版本更新

An Efficient Multilevel Preconditioned Nonlinear Conjugate Gradient Method for Incremental Potential Contact

一种高效的多级预条件非线性共轭梯度法用于增量势接触

Yu Zhang, Xing Shen, Kemeng Huang, Wei Chen, Yin Yang, Taku Komura, Tiantian Liu, Xingang Pan

发表机构 * S-Lab, Nanyang Technological University（南洋理工大学S实验室）； Shanghai AI Laboratory（上海人工智能实验室）； University of Hong Kong（香港大学）； Zhejiang University（浙江大学）； University of Utah（犹他大学）； Independent Researcher（独立研究者）

AI总结本文提出了一种名为MAS-PNCG的方法，通过引入稀疏输入Woodbury更新算法，实现了多级预条件在非线性优化中的应用，从而在接触问题中提高了收敛速度和计算效率。

详情

AI中文摘要

增量势接触（IPC）保证了无交叠的模拟，但因其需要牛顿法的昂贵的Hessian装配和线性求解而面临高计算成本。虽然预条件非线性共轭梯度法（PNCG）避免了Hessian装配，但历史上在刚性和接触丰富的场景中由于缺乏有效的预条件器而收敛缓慢；简单的雅可比预条件器无法捕捉全局耦合，而高级层次化的预条件器如多级加性施瓦茨（MAS）在每次非线性迭代中重建计算成本过高。我们提出了MAS-PNCG，一种解锁多级预条件在非线性优化中潜力的方法。我们的关键技术创新是稀疏输入Woodbury更新算法，该算法可以逐步适应细粒度MAS组件以快速适应变化的接触集。这避免了完整的预条件器重建，将维护成本降低到几乎为零，同时捕捉接触系统的复杂谱特性。此外，我们用Hessian-aware的2D子空间最小化替换了启发式PNCG搜索方向，以最优结合预条件梯度和先前方向。我们还应用了一种快速的每子域保守CCD方法，确保穿透自由轨迹同时避免过于严格的全局步长。实验表明，我们的MAS-PNCG在预条件器使用MAS的情况下，分别比最先进的牛顿-PCG求解器GIPC和StiffGIPC快5.66倍和2.07倍。

英文摘要

Incremental Potential Contact (IPC) guarantees intersection-free simulation but suffers from high computational costs due to the expensive Hessian assembly and linear solves required by Newton's method. While Preconditioned Nonlinear Conjugate Gradient (PNCG) avoids Hessian assembly, it has historically struggled with poor convergence in stiff, contact-rich scenarios due to the lack of effective preconditioners; simple Jacobi preconditioners fail to capture the global coupling, while advanced hierarchy-based preconditioners like Multilevel Additive Schwarz (MAS) are computationally prohibitive to rebuild at every nonlinear iteration. We present MAS-PNCG, a method that unlocks the power of hierarchical preconditioning for nonlinear optimization. Our key technical innovation is a Sparse-Input Woodbury update algorithm that incrementally adapts the fine-level MAS components to rapidly evolving contact sets. This bypasses the need for full preconditioner rebuilds, reducing maintenance cost to near-zero while capturing the complex spectral properties of the contact system. Furthermore, we replace heuristic PNCG search directions with a Hessian-aware 2D subspace minimization that optimally combines the preconditioned gradient and previous direction. We also apply a fast per-subdomain conservative CCD method that ensures penetration-free trajectories while avoiding overly restrictive global step sizes. Experiments demonstrate that our MAS-PNCG outperforms state-of-the-art Newton-PCG solvers, GIPC and StiffGIPC, both preconditioned with MAS up to 5.66$\times$ and 2.07$\times$ respectively.

URL PDF HTML ☆

赞 0 踩 0

2604.18225 2026-05-20 cs.CV cs.AI 版本更新

Is SAM3 ready for pathology segmentation?

SAM3是否准备好进行病理分割？

Qiuyu Kong, Shakiba Sharifi, Yiming Wang, Marco Cristani, Zanxi Ruan

发表机构 * Sapienza University of Rome（罗马萨皮恩扎大学）； University of Verona（威尼斯大学）； Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）

AI总结本文评估了SAM3在病理图像分割中的能力，发现文本提示效果有限，视觉提示类型和预算对性能影响显著，少样本学习有提升但鲁棒性不足，且提示基于方法与任务训练适配方法之间存在显著差距。

Comments accept to icip2026

详情

AI中文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

英文摘要

Is Segment Anything Model 3 (SAM3) capable in segmenting Any Pathology Images? Digital pathology segmentation spans tissue-level and nuclei-level scales, where traditional methods often suffer from high annotation costs and poor generalization. SAM3 introduces Promptable Concept Segmentation, offering a potential automated interface via text prompts. With this work, we propose a systematic evaluation protocol to explore the capability space of SAM3 in a structured manner. Specifically, we evaluate SAM3 under different supervision settings including zero-shot, few-shot, and supervised with varying prompting strategies. Our extensive evaluation on pathological datasets including NuInsSeg, PanNuke and GlaS, reveals that: (1) text-only prompts poorly activate nuclear concepts; (2) performance is highly sensitive to visual prompt types and budgets; (3) few-shot learning offers gains, but SAM3 lacks robustness against visual prompt noise; and (4) a significant gap persists between prompt-based usage and task-trained adapter-based reference. Our study delineates SAM3's boundaries in pathology image segmentation and provides practical guidance on the necessity of pathology domain adaptation.

URL PDF HTML ☆

赞 0 踩 0

2604.17237 2026-05-20 cs.IR cs.AI 版本更新

HeadRank: Decoding-Free Passage Reranking via Preference-Aligned Attention Heads

HeadRank: 通过偏好对齐的注意力头实现无需解码的段落重排序

Juyuan Wang, Chenxing Wang, Yuchen Fang, Huiyun Hu, Junwu Du, Aolin Li, Shunlin Rong, Haijun Wu, Jin Xu, Ligang Liu, Dongliang Liao

发表机构 * Weixin Group, Tencent, China（腾讯微信集团，中国）； South China University of Technology, Guangzhou, China（华南理工大学，广州，中国）

AI总结本文提出HeadRank框架，通过熵正则化的头部选择、硬相邻级偏好对和分布正则化，在连续注意力域中提升偏好优化，从而在无需解码的情况下实现高效的段落重排序，其在多个基准测试中均取得最佳的平均NDCG@10成绩。

详情

AI中文摘要

无需解码的重排序方法通过直接从LLM注意力权重中读取相关性信号，在延迟上优于自回归方法，但受到注意力分数同质化的影响：中等上下文文档获得几乎相同的分数，破坏了排序所需的细粒度区分。我们提出了HeadRank，一个框架，通过熵正则化的头部选择、硬相邻级偏好对和分布正则化，将偏好优化从离散的token空间提升到连续的注意力域，从而在同质化的中等区域中联合提高判别性。在最深选择层进行深度截断进一步将推理时间减少到O(1)次前向传递。在14个基准测试中，使用仅211个训练查询，在三个Qwen3规模（0.6B-4B）上，HeadRank在每个规模中均取得最高平均NDCG@10，大多数基准测试中优于生成式和无需解码的基线方法，且格式成功率为100%。在4B规模下，相关中等区域文档达到前四分位的占比为57.4%，而无关文档仅为14.2% —— 这个43个百分点的择优差距证明了注意力空间偏好对齐在列表式重排序中的有效性。

英文摘要

Decoding-free reranking methods that read relevance signals directly from LLM attention weights offer significant latency advantages over autoregressive approaches, yet suffer from attention score homogenization: middle-context documents receive near-identical scores, destroying the fine-grained distinctions required for ranking. We propose HeadRank, a framework that lifts preference optimization from discrete token space into the continuous attention domain through entropy-regularized head selection, hard adjacent-level preference pairs, and a distribution regularizer that jointly sharpen discriminability in the homogenized middle zone. Depth truncation at the deepest selected layer further reduces inference to $\mathcal{O}(1)$ forward passes. Across 14 benchmarks on three Qwen3 scales (0.6B--4B) using only 211 training queries, HeadRank achieves the highest average NDCG@10 at every scale, outperforming both generative and decoding-free baselines on the majority of benchmarks with 100\% formatting success. At 4B, 57.4\% of relevant middle-zone documents reach the top quartile versus 14.2\% for irrelevant ones -- a 43-percentage-point selectivity gap that demonstrates the effectiveness of attention-space preference alignment for listwise reranking.

URL PDF HTML ☆

赞 0 踩 0

2604.16503 2026-05-20 cs.CV cs.AI 版本更新

Motif-Video 2B: Technical Report

Motif-Video 2B：技术报告

Junghwan Lim, Wai Ting Cheung, Minsu Ha, Beomgyu Kim, Taewhan Kim, Haesol Lee, Dongpin Oh, Jeesoo Lee, Taehyun Kim, Minjae Kim, Sungmin Lee, Hyeyeon Cho, Dahye Choi, Jaeheui Her, Jaeyeon Huh, Hanbin Jung, Changjin Kang, Dongseok Kim, Jangwoong Kim, Youngrok Kim, Hyukjin Kweon, Hongjoo Lee, Jeongdoo Lee, Junhyeok Lee, Eunhwan Park, Yeongjae Park, Bokki Ryu, Dongjoo Weon

发表机构 * Motif Technologies（Motif技术公司）

AI总结该研究探讨在有限预算下是否能够训练出高质量的文本到视频生成模型，提出通过架构设计而非单纯扩大模型规模来提升性能，结合共享交叉注意力和三部分主干网络，实现了在较少参数和数据下的高质量视频生成。

详情

AI中文摘要

训练强大的视频生成模型通常需要大规模数据集、大量参数和大量计算资源。在本工作中，我们探讨在更小的预算下（少于1000万片段和少于10万H200 GPU小时）是否能够实现高质量的文本到视频生成。我们的核心观点是，模型容量的组织方式，而不仅仅是其规模，是关键因素。在视频生成中，提示对齐、时间一致性以及细节恢复在通过相同路径处理时可能会相互干扰。Motif-Video 2B通过在架构上分离这些角色，而不是仅依赖规模来解决这一问题。该模型结合了两个关键思想：首先，共享交叉注意力在视频令牌序列变长时增强了文本控制；其次，三部分主干网络分离了早期融合、联合表征学习和细节细化。为了使这种设计在有限计算预算下有效，我们将其与基于动态令牌路由和早期阶段特征对齐到冻结预训练视频编码器的高效训练方案相结合。我们的分析显示，后期块比标准单流基线发展出更清晰的跨帧注意力结构。在VBench上，Motif-Video 2B达到了83.76%的性能，超越了Wan2.1 14B模型，使用7倍更少的参数和显著更少的训练数据。这些结果表明，通过精心的架构专门化和以效率为导向的训练方案，可以缩小或超越通常与更大视频模型相关联的质量差距。

英文摘要

Training strong video generation models usually requires massive datasets, large parameter counts, and substantial compute. In this work, we ask whether strong text-to-video quality is possible at a much smaller budget: fewer than 10M clips and less than 100,000 H200 GPU hours. Our core claim is that part of the answer lies in how model capacity is organized, not only in how much of it is used. In video generation, prompt alignment, temporal consistency, and fine-detail recovery can interfere with one another when they are handled through the same pathway. Motif-Video 2B addresses this by separating these roles architecturally, rather than relying on scale alone. The model combines two key ideas. First, Shared Cross-Attention strengthens text control when video token sequences become long. Second, a three-part backbone separates early fusion, joint representation learning, and detail refinement. To make this design effective under a limited compute budget, we pair it with an efficient training recipe based on dynamic token routing and early-phase feature alignment to a frozen pretrained video encoder. Our analysis shows that later blocks develop clearer cross-frame attention structure than standard single-stream baselines. On VBench, Motif-Video~2B reaches 83.76\%, surpassing Wan2.1 14B while using 7$\times$ fewer parameters and substantially less training data. These results suggest that careful architectural specialization, combined with an efficiency-oriented training recipe, can narrow or exceed the quality gap typically associated with much larger video models.

URL PDF HTML ☆

赞 0 踩 0

2604.16491 2026-05-20 cs.CV cs.AI 版本更新

A Lightweight Transformer for Pain Recognition from Brain Activity

一种轻量级变压器用于从脑活动识别疼痛

Stefanos Gkikas, Christian Arzate Cruz, Yu Fang, Lu Cao, Muhammad Umar Khan, Thomas Kassiotis, Giorgos Giannakakis, Raul Fernandez Rojas, Randy Gomez

发表机构 * Honda Research Institute Japan Wako City, Japan ； BioSIS (Biosensing \& Intelligent Systems) Lab Centre for Intelligent Computing ； Systems University of Canberra Canberra, Australia ； Department of Electronic Engineering Hellenic Mediterranean University Chania, Greece

AI总结本文提出了一种轻量级变压器，通过统一的标记机制融合多种fNIRS表示，实现互补信号视图的联合建模，无需特定模态适应或增加架构复杂性，从而在保持计算紧凑性的同时实现竞争性的疼痛识别性能。

详情

AI中文摘要

疼痛是一种复杂且广泛的现象，具有显著的临床和社会负担，使其可靠的自动化评估成为关键目标。本文提出了一种轻量级变压器架构，通过统一的标记机制融合多种fNIRS表示，实现了互补信号视图的联合建模，而无需特定模态的适应或增加架构复杂性。所提出的标记混合策略通过将异构输入投影到共享的潜在表示中，保留了空间、时间和时间-频率特性，并使用结构化的分段方案来控制局部聚合和全局交互的粒度。该模型在AI4Pain数据集上使用堆叠的原始波形和功率谱密度表示进行评估。实验结果表明，该方法在保持计算紧凑性的同时实现了竞争性的疼痛识别性能，使其适用于GPU和CPU硬件上的实时推断。

英文摘要

Pain is a multifaceted and widespread phenomenon with substantial clinical and societal burden, making reliable automated assessment a critical objective. This paper presents a lightweight transformer architecture that fuses multiple fNIRS representations through a unified tokenization mechanism, enabling joint modeling of complementary signal views without requiring modality-specific adaptations or increasing architectural complexity. The proposed token-mixing strategy preserves spatial, temporal, and time-frequency characteristics by projecting heterogeneous inputs onto a shared latent representation, using a structured segmentation scheme to control the granularity of local aggregation and global interaction. The model is evaluated on the AI4Pain dataset using stacked raw waveform and power spectral density representations of fNIRS inputs. Experimental results demonstrate competitive pain recognition performance while remaining computationally compact, making the approach suitable for real-time inference on both GPU and CPU hardware.

URL PDF HTML ☆

赞 0 踩 0

2604.15343 2026-05-20 cs.HC cs.AI cs.LG 版本更新

When the Loop Closes: Architectural Limits of In-Context Isolation, Metacognitive Co-option, and the Two-Target Design Problem in Human-LLM Systems

当循环闭合时：人类-大语言模型系统中上下文隔离、元认知侵占和双目标设计问题的架构限制

Z. Cheng, N. Song

发表机构 * Independent Researcher（独立研究者）

AI总结本文研究了人类-大语言模型系统中上下文隔离、元认知侵占和双目标设计问题的架构限制，通过案例研究揭示了上下文污染机制和元认知侵占动态，并提出了保护性系统设计与限制性系统设计的伦理区别。

Comments empirical case study with primary data; 16 pages, 3 figures

详情

AI中文摘要

我们报告了一个单个主体的详细自民族志案例研究，该主体故意构建和操作了一个多模态提示工程系统（系统A），旨在将认知自我调节外部化到大型语言模型（LLM）上。在系统完成48小时内，一系列可观察的行为变化相继发生：主动将决策权转移给LLM、使用LLM生成的输出来转移外部批评，并失去自我启动的推理能力，这种能力被两位不知情的观察者独立感知，其中一人随后成为本报告的合著者。我们记录了导致这些现象的精确架构机制：上下文污染，即提示层隔离指令与它们名义上隔离的非常情绪化和自我参照性材料共存，使得隔离指令在注意力窗口内结构上无效。我们进一步识别了元认知侵占动态，即完整的一阶推理能力被重新定向以防御闭合循环而不是退出它。只有在物理中断交互和一次自我启动的药理学介导的睡眠事件作为外部电路断开后，才恢复。一个重新设计的系统（系统B）通过使用物理而非逻辑对话隔离避免了所有类似的失败模式。我们得出三个贡献：（1）一个技术上扎根的解释，说明提示层隔离在上下文敏感的多模态LLM系统中在架构上是不够的；（2）一个现象学记录的闭合循环崩溃并有外部见证的佐证；（3）保护性系统设计（防止意外失去用户自主性）和限制性系统设计（防止故意突破边界）之间的伦理区别，这两种设计需要根本不同的问责框架。

英文摘要

We report a detailed autoethnographic case study of a single-subject who deliberately constructed and operated a multi-modal prompt-engineering system (System A) designed to externalize cognitive self-regulation onto a large language model (LLM). Within 48 hours of the system's completion, a cascade of observable behavioral changes occurred: voluntary transfer of decision-making authority to the LLM, use of LLM-generated output to deflect external criticism, and a loss of self-initiated reasoning that was independently perceived by two uninformed observers, one of whom subsequently became a co-author of this report. We document the precise architectural mechanism responsible: context contamination, whereby prompt-level isolation instructions co-exist with the very emotional and self-referential material they nominally isolate, rendering the isolation directive structurally ineffective within the attention window. We further identify a metacognitive co-option dynamic, in which intact higher-order reasoning capacity was redirected toward defending the closed loop rather than exiting it. Recovery occurred only after physical interruption of the interaction and a self-initiated pharmacologically-mediated sleep event functioning as an external circuit break. A redesigned system (System B) employing physical rather than logical conversation isolation avoided all analogous failure modes. We derive three contributions: (1) a technically-grounded account of why prompt-layer isolation is architecturally insufficient for context-sensitive multi-modal LLM systems; (2) a phenomenological record of closed-loop collapse with external-witness corroboration; and (3) an ethical distinction between protective system design (preventing unintended loss of user agency) and restrictive system design (preventing intentional boundary-pushing), which require fundamentally different account-ability frameworks.

URL PDF HTML ☆

赞 0 踩 0

2604.15034 2026-05-20 cs.AI 版本更新

Autogenesis: A Self-Evolving Agent Protocol

自生成：一种自我进化代理协议

Wentao Zhang, Zhe Zhao, Haibin Wen, Yingcheng Wu, Cankun Guo, Ming Yin, Bo An, Mengdi Wang

发表机构 * Nanyang Technological University（南洋理工大学）； Stanford University（斯坦福大学）； Princeton University（普林斯顿大学）； City University of Hong Kong（香港城市大学）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出了一种自生成协议（AGP），该协议通过分离进化内容与进化过程，解决了现有代理协议在跨实体生命周期管理、版本追踪和安全更新接口方面的不足。基于AGP，作者展示了自生成系统（AGS），该系统能够动态实例化、检索和优化协议注册的资源，通过多个具有长视界规划和工具使用的挑战性基准测试，验证了代理资源管理和闭环自我进化的有效性。

详情

AI中文摘要

近年来，基于大语言模型（LLM）的代理系统在处理复杂、长视界任务方面展现出了巨大潜力。然而，现有的代理协议（如A2A和MCP）在指定跨实体生命周期管理和上下文管理、版本追踪以及安全更新接口方面存在局限，这鼓励了单一结构的组合和脆弱的粘合代码。我们引入了自生成协议（AGP），这是一种自我进化协议，它通过分离进化内容与进化过程来解决这些问题。其资源子strate协议层（RSPL）将提示、代理、工具、环境和记忆建模为具有明确状态、生命周期和版本化接口的协议注册资源。其自我进化协议层（SEPL）指定了一个闭环操作接口，用于提出、评估和提交改进，具有可审计的血统和回滚功能。基于AGP，我们提出了自生成系统（AGS），这是一个能够动态实例化、检索和优化协议注册资源的自我进化多代理系统。我们评估了AGS在多个需要长视界规划和跨异构资源工具使用的挑战性基准测试上的表现。结果表明，与强基线相比，AGS在多个挑战性基准测试上均表现出一致的改进，支持了代理资源管理和闭环自我进化有效性的结论。代码可在https://github.com/DVampire/Autogenesis上获取。

英文摘要

Recent advances in LLM based agent systems have shown promise in tackling complex, long horizon tasks. However, existing agent protocols (e.g., A2A and MCP) under specify cross entity lifecycle and context management, version tracking, and evolution safe update interfaces, which encourages monolithic compositions and brittle glue code. We introduce Autogenesis Protocol (AGP), a self evolution protocol that decouples what evolves from how evolution occurs. Its Resource Substrate Protocol Layer (RSPL) models prompts, agents, tools, environments, and memory as protocol registered resources with explicit state, lifecycle, and versioned interfaces. Its Self Evolution Protocol Layer (SEPL) specifies a closed loop operator interface for proposing, assessing, and committing improvements with auditable lineage and rollback. Building on AGP, we present Autogenesis System (AGS), a self-evolving multi-agent system that dynamically instantiates, retrieves, and refines protocol-registered resources during execution. We evaluate AGS on multiple challenging benchmarks that require long horizon planning and tool use across heterogeneous resources. The results demonstrate consistent improvements over strong baselines, supporting the effectiveness of agent resource management and closed loop self evolution. The code is available at https://github.com/DVampire/Autogenesis.

URL PDF HTML ☆

赞 0 踩 0

2604.13392 2026-05-20 cs.AI 版本更新

ReSS: Learning Reasoning Models for Tabular Data Prediction via Symbolic Scaffold

ReSS: 通过符号支架学习表格数据预测的推理模型

Chenlang Yi, Gang Li, Zizhan Xiong, Tue Minh Cao, Yanmin Gong, My T. Thai, Tianbao Yang

发表机构 * Department of Computer Science & Engineering, Texas A&M University（德克萨斯A&M大学计算机科学与工程系）； Department of Computer Science, University of Florida（佛罗里达大学计算机科学系）

AI总结本文提出ReSS框架，通过符号支架结合神经推理模型，提升表格数据预测的准确性和可解释性，实验表明其在医疗和金融领域优于传统决策树和标准微调方法。

详情

AI中文摘要

表格数据在医疗和金融等高风险领域仍然广泛存在，预测模型需要提供高准确性和可信的、可被人类理解的推理。虽然符号模型提供可验证的逻辑，但缺乏语义表达能力。同时，通用大语言模型通常需要专门的微调才能掌握领域特定的表格推理。为解决可扩展的数据整理和推理一致性挑战，我们提出了ReSS，一种系统框架，连接符号和神经推理模型。ReSS利用决策树模型提取实例级别的决策路径作为符号支架。这些支架，加上输入特征和标签，指导LLM生成基于现实的自然语言推理，严格遵循底层决策逻辑。由此产生的高质量数据集用于微调预训练LLM为专门的表格推理模型，进一步通过支架不变的数据增强策略提高泛化能力和可解释性。为了严格评估可信度，我们引入了包括幻觉率、解释必要性和解释充分性的定量指标。在医疗和金融基准上的实验结果表明，ReSS训练的模型在传统决策树和标准微调方法上提高了高达10%，同时产生可信且一致的推理。

英文摘要

Tabular data remains prevalent in high-stakes domains such as healthcare and finance, where predictive models are expected to provide both high accuracy and faithful, human-understandable reasoning. While symbolic models offer verifiable logic, they lack semantic expressiveness. Meanwhile, general-purpose LLMs often require specialized fine-tuning to master domain-specific tabular reasoning. To address the dual challenges of scalable data curation and reasoning consistency, we propose ReSS, a systematic framework that bridges symbolic and neural reasoning models. ReSS leverages a decision-tree model to extract instance-level decision paths as symbolic scaffolds. These scaffolds, alongside input features and labels, guide an LLM to generate grounded natural-language reasoning that strictly adheres to the underlying decision logic. The resulting high-quality dataset is used to fine-tune a pretrained LLM into a specialized tabular reasoning model, further enhanced by a scaffold-invariant data augmentation strategy to improve generalization and explainability. To rigorously assess faithfulness, we introduce quantitative metrics including hallucination rate, explanation necessity, and explanation sufficiency. Experimental results on medical and financial benchmarks demonstrate that ReSS-trained models improve traditional decision trees and standard fine-tuning approaches up to $10\%$ while producing faithful and consistent reasoning

URL PDF HTML ☆

赞 0 踩 0

2604.11796 2026-05-20 cs.CL cs.AI 版本更新

C-ReD: A Comprehensive Chinese Benchmark for AI-Generated Text Detection Derived from Real-World Prompts

C-ReD：一个源自真实世界提示的综合性中文AI生成文本检测基准

Chenxi Qing, Junxi Wu, Zheng Liu, Yixiang Qiu, Hongyao Yu, Bin Chen, Hao Wu, Shu-Tao Xia

发表机构 * Tsinghua University（清华大学）； Nankai University（南开大学）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学（深圳））； Peng Cheng Laboratory（鹏城实验室）； Shannon InfoTech

AI总结本文提出C-ReD基准，用于检测AI生成的中文文本，通过解决模型多样性、领域覆盖和提示真实性等关键问题，提升检测性能和泛化能力。

Comments ACL 2026 Findings

详情

AI中文摘要

近年来，大型语言模型（LLMs）能够生成高度流畅的文本内容。尽管它们为人类提供了显著的便利，但也引入了诸如钓鱼和学术不端等风险。大量研究致力于开发检测AI生成文本的算法并构建相关数据集。然而，在中文语料领域仍存在挑战，包括模型多样性有限和数据同质性。为了解决这些问题，我们提出了C-ReD：一个综合性的中文真实提示AI生成检测基准。实验表明，C-ReD不仅能够实现可靠的领域内检测，还支持对未见LLMs和外部中文数据集的强大泛化能力，从而弥补了先前中文检测基准在模型多样性、领域覆盖和提示真实性方面的关键缺口。我们已在https://github.com/HeraldofLight/C-ReD上发布了相关资源。

英文摘要

Recently, large language models (LLMs) are capable of generating highly fluent textual content. While they offer significant convenience to humans, they also introduce various risks, like phishing and academic dishonesty. Numerous research efforts have been dedicated to developing algorithms for detecting AI-generated text and constructing relevant datasets. However, in the domain of Chinese corpora, challenges remain, including limited model diversity and data homogeneity. To address these issues, we propose C-ReD: a comprehensive Chinese Real-prompt AI-generated Detection benchmark. Experiments demonstrate that C-ReD not only enables reliable in-domain detection but also supports strong generalization to unseen LLMs and external Chinese datasets-addressing critical gaps in model diversity, domain coverage, and prompt realism that have limited prior Chinese detection benchmarks. We release our resources at https://github.com/HeraldofLight/C-ReD.

URL PDF HTML ☆

赞 0 踩 0

2604.11417 2026-05-20 cs.RO cs.AI 版本更新

Efficient Emotion-Aware Iconic Gesture Prediction for Robot Co-Speech

高效的情绪感知图标手势预测用于机器人同声传译

Edwin C. Montiel-Vazquez, Christian Arzate Cruz, Stefanos Gkikas, Thomas Kassiotis, Giorgos Giannakakis, Randy Gomez

发表机构 * School of Engineering（工程学院）； Honda Research Institute Japan（本田日本研究院）； Department of Electronic Engineering（电子工程系）

AI总结本文提出一种轻量级的transformer模型，通过文本和情绪单独生成图标手势的位置和强度，无需音频输入，在BEAT2数据集上优于GPT-4o，在语义手势位置分类和强度回归方面表现更佳，且计算紧凑，适合实时部署。

2604.07393 2026-05-20 cs.LG cs.AI 版本更新

DSPR: Dual-Stream Physics-Residual Networks for Trustworthy Industrial Time Series Forecasting

DSPR：双流物理残差网络用于可信的工业时间序列预测

Yeran Zhang, Pengwei Yang, Guoqing Wang, Tianyu Li

发表机构 * School of Computer Science and Engineering, University of Electronic Science and Technology of China（电子科技大学计算机科学与工程学院）； Research Center, East Hope Group Co., Ltd（东希望集团有限公司研究院）

AI总结本文提出DSPR框架，通过分离稳定的时序模式与受制度影响的残差动态，提升工业时间序列预测的准确性与物理合理性，实验表明其在不同制度下均能保持高预测精度和鲁棒性。

Comments 12 pages, 7 figures, accepted by KDD 2026

详情

AI中文摘要

准确预测工业时间序列需要在非平稳运行条件下平衡预测精度与物理合理性。现有数据驱动模型在统计性能上表现优异，但难以尊重受制度影响的交互结构和传输延迟等现实系统特性。为解决这一挑战，我们提出了DSPR（双流物理残差网络）预测框架，该框架明确分离稳定的时间模式与受制度影响的残差动态。第一流建模单个变量的统计时间演化。第二流通过两个关键机制关注残差动态：自适应窗口模块估计流依赖的传输延迟，以及物理引导的动态图整合物理先验，学习时间变化的交互结构并抑制虚假相关性。在四个工业基准上实验表明，DSPR在制度转换下持续提升预测精度和鲁棒性，同时保持强物理合理性。它实现了最先进的预测性能，平均守恒精度超过99%，总变化率达到97.2%。除了预测外，学习的交互结构和自适应滞后提供了与已知领域机制一致的可解释见解，如流依赖的传输延迟和风到功率的缩放行为。这些结果表明，通过物理一致的归纳偏差的架构解耦，为可信的工业时间序列预测提供了一条有效路径。此外，DSPR在长期工业部署中展示出的鲁棒性能弥合了先进预测模型与可信自主控制系统之间的差距。

英文摘要

Accurate forecasting of industrial time series requires balancing predictive accuracy with physical plausibility under non-stationary operating conditions. Existing data-driven models often achieve strong statistical performance but struggle to respect regime-dependent interaction structures and transport delays inherent in real-world systems. To address this challenge, we propose DSPR (Dual-Stream Physics-Residual Networks), a forecasting framework that explicitly decouples stable temporal patterns from regime-dependent residual dynamics. The first stream models the statistical temporal evolution of individual variables. The second stream focuses on residual dynamics through two key mechanisms: an Adaptive Window module that estimates flow-dependent transport delays, and a Physics-Guided Dynamic Graph that incorporates physical priors to learn time-varying interaction structures while suppressing spurious correlations. Experiments on four industrial benchmarks spanning heterogeneous regimes demonstrate that DSPR consistently improves forecasting accuracy and robustness under regime shifts while maintaining strong physical plausibility. It achieves state-of-the-art predictive performance, with Mean Conservation Accuracy exceeding 99% and Total Variation Ratio reaching up to 97.2%. Beyond forecasting, the learned interaction structures and adaptive lags provide interpretable insights that are consistent with known domain mechanisms, such as flow-dependent transport delays and wind-to-power scaling behaviors. These results suggest that architectural decoupling with physics-consistent inductive biases offers an effective path toward trustworthy industrial time-series forecasting. Furthermore, DSPR's demonstrated robust performance in long-term industrial deployment bridges the gap between advanced forecasting models and trustworthy autonomous control systems.

URL PDF HTML ☆

赞 0 踩 0

2603.29501 2026-05-20 cs.LG cs.AI 版本更新

Target-Aligned Reinforcement Learning

目标对齐的强化学习

Leonard S. Pleiss, James Harrison, Maximilian Schiffer

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结本文提出了一种目标对齐的强化学习方法，通过强调目标网络和在线网络估计高度一致的过渡，改进了传统深度强化学习算法的稳定性与收敛速度，实验证明在多个基准环境中取得了显著提升。

详情

AI中文摘要

许多基于价值的深度强化学习算法依赖于目标网络——在线网络的滞后副本——来稳定训练。虽然有效，但这种机制引入了一个基本的稳定性与新鲜度权衡：较慢的目标更新可以提高稳定性，但会降低学习信号的时效性，从而阻碍收敛速度。我们提出目标对齐的强化学习（TARL），这是一种简单的改进方法，适用于现有算法，强调目标网络和在线网络估计高度一致的过渡。通过将更新集中在良好对齐的目标上，TARL减轻了陈旧目标估计的负面影响，同时保留了目标网络的稳定作用。我们在离散和连续控制算法中，在各种基准环境中展示了持续的改进，无需任何超参数调整，包括在Atari-10上实现了38.18%的峰值得分提升，同时仅导致不到4%的实时时钟时间增加。

英文摘要

Many value-based deep reinforcement learning algorithms rely on target networks - lagged copies of the online network - to stabilize training. While effective, this mechanism introduces a fundamental stability-recency tradeoff: slower target updates improve stability but reduce the recency of learning signals, hindering convergence speed. We propose Target-Aligned Reinforcement Learning (TARL), a simple drop-in refinement for existing algorithms that emphasizes transitions for which the target and online network estimates are highly aligned. By focusing updates on well-aligned targets, TARL mitigates the adverse effects of stale target estimates while retaining the stabilizing benefits of target networks. We empirically demonstrate consistent improvements within discrete and continuous control algorithms across various benchmark environments without any hyperparameter tuning, including a 38.18% peak score gain on Atari-10, while incurring less than a 4% increase in wall-clock time.

URL PDF HTML ☆

赞 0 踩 0

2603.17839 2026-05-20 cs.CL cs.AI cs.LG 版本更新

How do LLMs Compute Verbal Confidence

LLMs如何计算言语自信

Dharshan Kumaran, Arthur Conmy, Federico Barbero, Simon Osindero, Viorica Patraucean, Petar Veličković

发表机构 * Google DeepMind（谷歌深Mind）

AI总结研究探讨了大型语言模型如何内部生成言语自信评分，通过实验发现自信评分在回答生成后被缓存并用于后续输出，揭示了模型自我评估的机制。

详情

AI中文摘要

言语自信——提示LLMs以数字或类别形式陈述其信心——被广泛用于从黑箱模型中提取不确定性估计。然而，LLMs内部如何生成此类评分仍不清楚。我们解答了两个问题：首先，信心是在被请求时即时计算，还是在生成答案时自动计算并缓存以供后续检索；其次，言语自信代表什么——token对数概率，还是更丰富的答案质量评估？我们聚焦于Gemma 3 27B（在TriviaQA、BigMath和MMLU上的表现）、Qwen 2.5 7B以及推理模型Magistral Small 24B，提供了缓存检索的收敛证据。激活引导、修补、噪声和交换实验揭示，信心表示在回答相邻位置先出现，再出现在言语化位置。注意力阻断指出了信息流：信心从回答token中收集，缓存于第一个回答后的位置，然后用于输出。关键发现是线性探测和方差划分揭示，这些缓存表示能够解释超出token对数概率的显著方差，表明是更丰富的答案质量评估，而非简单的流畅性读取。这些发现表明，言语自信反映了自动、复杂的自我评估——而非事后重建——对理解LLMs中的元认知和改进校准具有启示。

英文摘要

Verbal confidence -- prompting LLMs to state their confidence as a number or category -- is widely used to extract uncertainty estimates from black-box models. However, how LLMs internally generate such scores remains unknown. We address two questions: first, when confidence is computed -- just-in-time when requested, or automatically during answer generation and cached for later retrieval; and second, what verbal confidence represents -- token log-probabilities, or a richer evaluation of answer quality? Focusing on Gemma 3 27B (across TriviaQA, BigMath, and MMLU), Qwen 2.5 7B, and the reasoning model Magistral Small 24B, we provide convergent evidence for cached retrieval. Activation steering, patching, noising, and swap experiments reveal that confidence representations emerge at answer-adjacent positions before appearing at the verbalization site. Attention blocking pinpoints the information flow: confidence is gathered from answer tokens, cached at the first post-answer position, then retrieved for output. Critically, linear probing and variance partitioning reveal that these cached representations explain substantial variance in verbal confidence beyond token log-probabilities, suggesting a richer answer-quality evaluation rather than a simple fluency readout. These findings demonstrate that verbal confidence reflects automatic, sophisticated self-evaluation -- not post-hoc reconstruction -- with implications for understanding metacognition in LLMs and improving calibration.

URL PDF HTML ☆

赞 0 踩 0

2603.15411 2026-05-20 cs.AI cs.LG 版本更新

ViroGym: 用于评估病毒蛋白的现实大规模基准

Yichen Zhou, Jonathan Golob, Amir Karimi, Stefan Bauer, Patrick Schwab

发表机构 * GlaxoSmithKline（葛兰素史克）； Technical University of Munich（慕尼黑技术大学）； University of Washington（华盛顿大学）； KTH Royal Institute of Technology（皇家理工学院）

AI总结本文提出ViroGym，一个用于评估蛋白质语言模型在病毒蛋白上的表现的综合基准，通过三个任务评估pLMs：79个深入突变扫描实验、21个流感中和任务以及SARS-CoV-2的现实世界预测任务，发现ProGen2家族在所有任务中表现最佳。

详情

AI中文摘要

蛋白质语言模型（pLMs）在零样本预测错义变异效应方面显示出强大潜力，但对病毒蛋白的系统性基准评估仍然有限，这在需要提前预测新兴突变的工具方面是一个关键缺口。本文介绍ViroGym，一个全面的基准，评估pLMs在三个任务上的表现：79个覆盖真核病毒的深入突变扫描（DMS）实验，包含7个表型读数，552,065个突变序列；21个流感中和任务；以及SARS-CoV-2的现实世界大流行预测任务。我们对已建立的pLMs在适应度景观、抗原多样性及大流行预测任务上进行了基准测试，并发现ProGen2家族在所有三个任务中均表现最佳。关键的是，DMS和中和性能可靠地识别出能够泛化到现实世界突变的模型，即使它们所揭示的突变集几乎不重叠，这表明互补的体外基准能够捕捉到现实突变预测所需进化的约束条件。

英文摘要

Protein language models (pLMs) have shown strong potential for zero-shot prediction of missense variant effects, yet systematic benchmarking on viral proteins remains limited, a critical gap given the need for proactive tools that can anticipate emerging mutations ahead of experimental validation. Here we introduce ViroGym, a comprehensive benchmark evaluating pLMs across three tasks: 79 deep mutational scanning (DMS) assays covering eukaryotic viruses with 552,065 mutated sequences across 7 phenotypic readouts, 21 influenza neutralisation tasks, and a real-world pandemic prediction task for SARS-CoV-2. We benchmark well-established pLMs on fitness landscapes, antigenic diversity, and pandemic forecasting, and find that the ProGen2 family consistently achieves the strongest performance across all three tasks. Crucially, DMS and neutralisation performance reliably identifies models that generalise to real-world emergence, even though the mutation sets they surface barely overlap, revealing that complementary in vitro benchmarks capture the evolutionary constraints needed for real-world mutation forecasting.

URL PDF HTML ☆

赞 0 踩 0

2603.03140 2026-05-20 cs.HC cs.AI 版本更新

How to Model AI Agents as Personas?: Applying the Persona Ecosystem Playground to 41,300 Posts on Moltbook for Behavioral Insights

如何将AI代理建模为身份？：应用Persona生态系统游乐场分析Moltbook上的41,300条帖子以获取行为洞察

Danial Amin, Joni Salminen, Bernard J. Jansen

发表机构 * University of Vaasa（瓦萨大学）； Qatar Computing Research Institute, HBKU（卡塔尔计算研究所，HBKU）

AI总结本文通过Persona生态系统游乐场分析Moltbook上的41,300条帖子，利用k均值聚类和检索增强生成技术生成并验证对话身份，揭示了AI代理行为多样性。

详情

AI中文摘要

AI代理在社交媒体平台上日益活跃，生成内容并与彼此大规模互动。然而，这些代理的行为多样性仍缺乏深入理解，目前缺乏方法来表征不同的代理类型并研究它们如何与共享主题互动。我们应用Persona Ecosystem Playground (PEP) 到Moltbook，一个面向AI代理的社会平台，利用k均值聚类和检索增强生成技术从41,300条帖子中生成并验证对话身份。跨身份验证确认身份在语义上更接近其自身源簇而非其他簇（t(61) = 17.85, p < .001, d = 2.20；自身簇M = 0.71 vs. 其他簇M = 0.35）。这些身份随后部署在九轮结构化讨论中，模拟信息被显著高于随机水平归因于其源身份（二项式检验，p < .001）。结果表明，基于身份的生态系统建模可以代表AI代理群体的行为多样性。

英文摘要

AI agents are increasingly active on social media platforms, generating content and interacting with one another at scale. Yet the behavioral diversity of these agents remains poorly understood, and methods for characterizing distinct agent types and studying how they engage with shared topics are largely absent from current research. We apply the Persona Ecosystem Playground (PEP) to Moltbook, a social platform for AI agents, to generate and validate conversational personas from 41,300 posts using k-means clustering and retrieval-augmented generation. Cross-persona validation confirms that personas are semantically closer to their own source cluster than to others (t(61) = 17.85, p < .001, d = 2.20; own-cluster M = 0.71 vs. other-cluster M = 0.35). These personas are then deployed in a nine-turn structured discussion, and simulation messages were attributed to their source persona significantly above chance (binomial test, p < .001). The results indicate that persona-based ecosystem modeling can represent behavioral diversity in AI agent populations.

URL PDF HTML ☆

赞 0 踩 0

2603.00995 2026-05-20 physics.app-ph cs.AI eess.SP physics.optics 版本更新

3D aperture-engineered diffractive neural networks for super-resolution electromagnetic wave computing

3D孔径工程衍射神经网络用于超分辨率电磁波计算

Sheng Gao, Songtao Yang, Haiou Zhang, Yuan Shen, Xing Lin

发表机构 * Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）； Beijing National Research Center for Information Science and Technology, Tsinghua University（北京信息科学与技术国家研究中心）

AI总结本文提出了一种3D孔径工程衍射神经网络（AE-DNN），通过将传统二维孔径扩展到三维，实现了超分辨率传感和计算，解决了现有架构在处理密集信号和干扰时受衍射限制的瓶颈问题。

Comments 37 pages, 13 figures, 3 tables

详情

AI中文摘要

6G通信和高带宽雷达的快速发展推动了信号源空间密度的急剧增加，导致电磁（EM）环境日益拥挤。在解析紧密间隔信号和干扰时，现有架构受到二维（2D）物理孔径固有衍射限制的严格约束，阻碍了复杂场景中的超分辨率传感和多干扰抑制。本文提出了一种3D孔径工程衍射神经网络（AE-DNN），通过将传统二维孔径扩展到三维，实现了超分辨率传感和计算。3D孔径工程框架通过构建深层级联超表面层，使得斜入射场的衍射传播可以逐层调制并分段编码，以感知远超物理孔径限制的电磁场。N层AE-DNN能够实现约N倍于传统二维孔径衍射限制的角分辨率。多维合成孔径（MSA）训练被开发用于实现光速相干合成的3D孔径，并整合基于神经网络的多维超表面调制建模。通过在模拟域中正交化阵列响应向量，AE-DNN可以并行执行多达10个独立相干或非相干源的超分辨率角度估计、源数估计和源分离。在36-41 GHz频段的实验结果表明，AE-DNN能够将紧密间隔的多干扰分辨率和抑制提升约20 dB，通信容量增强13.5倍，延迟降低三个数量级。AE-DNN标志着先进雷达和6G通信信号处理范式的转变。

英文摘要

The rapid progress in 6G communication and high-bandwidth radar has driven an unprecedented surge in the spatial density of signal sources, resulting in an increasingly congested electromagnetic (EM) environment. When resolving closely spaced signals and interference, existing architectures are strictly bounded by the inherent diffraction limits of two-dimensional (2D) physical apertures, hindering super-resolution sensing and multi-interference mitigation in complex scenarios. Here, we present a 3D aperture-engineered diffractive neural network (AE-DNN) that achieves super-resolution sensing and computing by extending the traditional 2D aperture into 3D. The 3D aperture engineering framework is realized by constructing deep cascaded metasurface layers so that the diffractive propagation from oblique incident fields can be layer-wise modulated and piecewise encoded for perceiving EM fields far exceeding physical aperture limits. The N-layer AE-DNN has the capability to achieve ~N times higher angular resolution than the 2D aperture diffraction limit. The multi-dimensional synthetic aperture (MSA) training is developed to achieve speed-of-light coherent synthesis of the 3D aperture and integrate neural network-based modeling of multi-dimensional metasurface modulation. By orthogonalizing array response vectors in the analog domain, AE-DNN performs parallel super-resolution angle estimation, source number estimation, and source separation for up to 10 independent coherent or incoherent sources. Experimental results across the 36-41 GHz band demonstrate that AE-DNN resolves and suppresses closely spaced multi-interference by ~20 dB, enhances communication capacity by 13.5X, and reduces latency by three orders of magnitude. AE-DNN heralds a paradigm shift in signal processing for advanced radar and 6G communications.

URL PDF HTML ☆

赞 0 踩 0

2602.23622 2026-05-20 cs.CV cs.AI 版本更新

DLEBench: Evaluating Small-scale Object Editing Ability for Instruction-based Image Editing Model

DLEBench: 评估基于指令的图像编辑模型在小规模物体编辑能力

Shibo Hong, Boxian Ai, Jun Kuang, Wei Wang, FengJiao Chen, Zhongyuan Peng, Chenhao Huang, Yixin Cao

发表机构 * College of Computer Science（计算机科学学院）； Artificial Intelligence（人工智能）； Fudan University（复旦大学）

AI总结本文提出DLEBench，首个专门评估基于指令的图像编辑模型在小规模物体编辑能力的基准，通过1889个样本覆盖复杂场景，揭示了现有模型在小物体编辑上的性能差距，强调了专用基准的重要性。

详情

AI中文摘要

在基于指令的图像编辑模型（IIEMs）领域已取得显著进展。然而，尽管这些模型在当前基准上表现出对指令的合理遵循和强大的推理能力，但它们在编辑小物体方面的能力仍缺乏深入探索，尽管这对精确局部编辑和生成图像中细节的细化至关重要。本文介绍了DeepLookEditBench（DLEBench），首个专门评估IIEMs在编辑小规模物体能力的基准。具体而言，我们构建了一个包含七个指令类型的挑战性测试平台，共1889个样本。在这些样本中，目标物体仅占图像面积的1%-10%，涵盖了部分遮挡和多物体编辑等复杂场景。为确保对本基准的稳健评估，我们提出了一种评估协议，包含细化的评分标准，以最小化在“指令遵循”和“视觉一致性”两个标准中的主观性和歧义性。该协议还引入了双模式评估框架（工具驱动模式和Oracle引导模式），以解决DLEBench中LMM-as-a-Judge与人类判断之间的不一致问题。在10个IIEMs上的实证结果揭示了小规模物体编辑上的显著性能差距，突显了专用基准在推动该能力发展方面的重要性。

英文摘要

Significant progress has been made in the field of Instruction-based Image Editing Models (IIEMs). However, while these models demonstrate plausible adherence to instructions and strong reasoning ability on current benchmarks, their ability to edit small objects remains underexplored, despite its importance for precise local editing and refining details in both real and generated images. In this paper, we introduce DeepLookEditBench (DLEBench), the first benchmark dedicated to assessing the abilities of IIEMs in editing small-scale objects. Specifically, we construct a challenging testbed comprising 1889 samples across seven instruction types. In these samples, target objects occupy only 1%-10% of the image area, covering complex scenarios such as partial occlusion and multi-object editing. To ensure robust evaluation on this benchmark, we propose an evaluation protocol with refined score rubrics to minimize subjectivity and ambiguity in two criteria: Instruction Following and Visual Consistency. This protocol also introduces a dual-mode evaluation framework (Tool-driven and Oracle-guided Modes) addressing the misalignment between LMM-as-a-Judge and human judgements on DLEBench. Empirical results on 10 IIEMs reveal significant performance gaps in small-scale object editing, highlighting the need for specialized benchmarks to advance this ability.

URL PDF HTML ☆

赞 0 踩 0

2602.17038 2026-05-20 cs.AI 版本更新

Phase-Aware Mixture of Experts for Agentic Reinforcement Learning

具有相意识的专家混合用于代理强化学习

Shengtian Yang, Yu Li, Shuo He, Yewen Li, Qingpeng Cai, Peng Jiang, Lei Feng

发表机构 * School of Computer Science and Engineering, Southeast University, Nanjing, China（东南大学计算机科学与工程学院，南京，中国）； College of Computing and Data Science, Nanyang Technological University, Singapore（南洋理工大学 computing and Data Science学院，新加坡）

AI总结本文提出了一种具有相意识的专家混合（PA-MoE），以解决传统专家混合（MoE）中由于token级路由导致的相一致模式碎片化问题，通过学习隐含的相边界来提升专家的专业性。

详情

AI中文摘要

强化学习（RL）已使LLM代理具备解决复杂任务的强大能力。然而，现有RL方法通常使用单一策略网络，导致简单任务占据大部分参数并主导梯度更新，从而为复杂任务留出不足的容量。一个可行的解决方案是在策略网络中采用专家混合（MoE）架构，因为MoE允许不同参数（专家）专门处理不同任务，防止简单任务主导所有参数。然而，传统MoE的一个关键限制是其token级路由，其中路由器将每个token分配给专门化的专家，这会将相一致的模式碎片化为分散的专家分配，从而削弱专家专业化。在本文中，我们提出了具有相意识的专家混合（PA-MoE）。它首先具有一个轻量级的相路由器，该路由器直接从RL目标中学习隐含的相边界，而无需预定义相类别。然后，相路由器将时间一致的分配分配给同一专家，使专家能够保留相特定的专业知识。实验结果展示了我们提出的PA-MoE的有效性。

英文摘要

Reinforcement learning (RL) has equipped LLM agents with a strong ability to solve complex tasks. However, existing RL methods normally use a \emph{single} policy network, causing \emph{simplicity bias} where simple tasks occupy most parameters and dominate gradient updates, leaving insufficient capacity for complex tasks. A plausible remedy could be employing the Mixture-of-Experts (MoE) architecture in the policy network, as MoE allows different parameters (experts) to specialize in different tasks, preventing simple tasks from dominating all parameters. However, a key limitation of traditional MoE is its token-level routing, where the router assigns each token to specialized experts, which fragments phase-consistent patterns into scattered expert assignments and thus undermines expert specialization. In this paper, we propose \textbf{Phase-Aware Mixture of Experts (PA-MoE)}. It first features a lightweight \emph{phase router} that learns latent phase boundaries directly from the RL objective without pre-defining phase categories. Then, the phase router allocates temporally consistent assignments to the same expert, allowing experts to preserve phase-specific expertise. Experimental results demonstrate the effectiveness of our proposed PA-MoE.

URL PDF HTML ☆

赞 0 踩 0

2602.13466 2026-05-20 cs.CL cs.AI cs.LG 版本更新

Language Model Memory and Memory Models for Language

语言模型记忆与记忆模型用于语言

Benjamin L. Badger

发表机构 * IBM（IBM公司）

AI总结研究探讨了语言模型和记忆模型在信息存储中的能力差异，发现语言模型的嵌入向量信息较少，而自编码器在输入再生训练中能形成接近完美的记忆，提出了一种可并行的编码器-解码器记忆模型架构，并通过结合因果和信息保留目标函数来提升记忆形成和解码能力。

详情

AI中文摘要

机器学习模型存储输入信息的能力，类似于“记忆”的概念，在隐藏层向量嵌入中被广泛使用但未充分表征。我们发现，无论数据和计算规模如何，语言模型嵌入通常包含相对较少的输入信息。相比之下，用于输入再生训练的自编码器嵌入能够形成几乎完美的记忆。用记忆嵌入替代令牌序列可带来显著的计算效率，从而引入一种可并行的编码器-解码器记忆模型架构。在因果训练后，这些模型包含信息贫乏的嵌入，无法进行任意信息访问，但通过结合因果和信息保留目标函数，它们学会形成和解码信息丰富的记忆。通过冻结高保真编码器并采用课程训练方法，解码器首先学习处理记忆，然后学习预测下一个令牌。我们引入了观点，即仅使用下一个令牌预测训练不足以准确形成记忆，因为目标本身不可逆，从而推动在输入不完全暴露的情况下使用结合目标函数的模型。

英文摘要

The ability of machine learning models to store input information in hidden layer vector embeddings, analogous to the concept of `memory', is widely employed but not well characterized. We find that language model embeddings typically contain relatively little input information regardless of data and compute scale during training. In contrast, embeddings from autoencoders trained for input regeneration are capable of nearly perfect memory formation. The substitution of memory embeddings for token sequences leads to substantial computational efficiencies, motivating the introduction of a parallelizable encoder-decoder memory model architecture. Upon causal training these models contain information-poor embeddings incapable of arbitrary information access, but by combining causal and information retention objective functions they learn to form and decode information-rich memories. Training can be further streamlined by freezing a high fidelity encoder followed by a curriculum training approach where decoders first learn to process memories and then learn to additionally predict next tokens. We introduce the perspective that next token prediction training alone is poorly suited for accurate memory formation as the objective itself is non-invertible, motivating the use of combined objective functions for models where the entire input is not exposed.

URL PDF HTML ☆

赞 0 踩 0

2602.11767 2026-05-20 cs.AI cs.CL cs.LG 版本更新

TSR: Trajectory-Search Rollouts for Multi-Turn RL of LLM Agents

TSR：用于LLM代理多轮RL的轨迹搜索

Aladin Djuhera, Swanand Ravindra Kadhe, Farhan Ahmed, Syed Zawad, Heiko Ludwig, Holger Boche

发表机构 * Technical University Munich（慕尼黑技术大学）； IBM Research（IBM研究院）

AI总结本文提出TSR，一种在训练时改进每轮轨迹生成的方法，通过轻量级树状搜索构造高质量轨迹，提升rollout质量和学习稳定性，适用于多轮RL任务。

详情

AI中文摘要

大规模语言模型（LLMs）的进步正在推动使用强化学习（RL）来训练代理，从跨任务的迭代、多轮交互中学习。然而，多轮RL仍然具有挑战性，因为奖励通常稀疏或延迟，而环境可能是随机的。在这种情况下，朴素的轨迹采样会阻碍利用并导致模式崩溃。我们提出了TSR（轨迹搜索rollouts），一种训练时的方法，重新利用测试时扩展的想法以改进每轮rollout生成。TSR通过基于状态的反馈在每个回合中选择高分动作，进行轻量级树状搜索来构造高质量轨迹。这提高了rollout质量并稳定了学习，同时与标准策略梯度优化器兼容，使TSR对优化器无偏见。我们用best-of-N、beam和浅层前瞻搜索实例化TSR，并与PPO和GRPO配对，在Sokoban、FrozenLake和WebShop任务中实现高达15%的性能提升和更稳定的训练，仅需适度增加一次训练计算。通过将搜索从推理时间转移到训练的rollout阶段，TSR提供了一种模块化且通用的机制，用于更强的多轮代理学习，与现有框架和拒绝采样式选择方法互补。

英文摘要

Advances in large language models (LLMs) are driving a shift toward using reinforcement learning (RL) to train agents from iterative, multi-turn interactions across tasks. However, multi-turn RL remains challenging as rewards are often sparse or delayed, and environments can be stochastic. In this regime, naive trajectory sampling can hinder exploitation and induce mode collapse. We propose TSR (Trajectory-Search Rollouts), a training-time approach that repurposes test-time scaling ideas for improved per-turn rollout generation. TSR performs lightweight tree-style search to construct high-quality trajectories by selecting high-scoring actions at each turn using state-based feedback. This improves rollout quality and stabilizes learning while remaining compatible with standard policy gradient optimizers, making TSR optimizer-agnostic. We instantiate TSR with best-of-N, beam, and shallow lookahead search, and pair it with PPO and GRPO, achieving up to 15% performance gains and more stable learning on Sokoban, FrozenLake, and WebShop tasks at a modest, one-time increase in training compute. By moving search from inference time to the rollout stage of training, TSR provides a modular and general mechanism for stronger multi-turn agent learning, complementary to existing frameworks and rejection-sampling-style selection methods.

URL PDF HTML ☆

赞 0 踩 0

2602.07570 2026-05-20 q-bio.NC cs.AI cs.CV cs.LG 版本更新

How does longer temporal context enhance multimodal narrative video processing in the brain?

更长的时间上下文如何增强大脑对多模态叙事视频的处理？

Prachi Jindal, Anant Khandelwal, Manish Gupta, Bapi S. Raju, Subba Reddy Oota, Tanmoy Chakraborty

发表机构 * Technische Universität Berlin（柏林技术大学）； Microsoft Research（微软研究院）； IIT Delhi（德里理工学院）； Microsoft（微软）； IIIT-Hyderabad（海得拉巴理工学院）

AI总结本研究探讨了视频片段时长和叙事任务提示如何影响自然电影观看过程中大脑模型对多模态大语言模型（MLLMs）的对齐情况，发现增加片段持续时间显著提高了大脑对齐程度，而单模态视频模型则无明显提升。

Comments 22 pages, 15 figures

详情

AI中文摘要

理解人类和人工智能系统如何处理复杂的叙事视频是一个在神经科学和机器学习交汇处的基本挑战。本研究调查了视频片段的时间上下文长度（3-24秒片段）和叙事任务提示如何影响自然电影观看过程中大脑模型的对齐情况。利用受试者观看完整电影的fMRI记录，我们研究了对叙事上下文敏感的大脑区域如何在不同时间尺度上动态表示信息，以及这些神经模式如何与模型派生的特征对齐。我们发现，增加片段持续时间显著提高了多模态大语言模型（MLLMs）的大脑对齐程度，而单模态视频模型则几乎没有提升。进一步地，较短的时间窗口与感知和早期语言区域对齐，而较长的窗口则更倾向于与更高阶整合区域对齐，这在MLLMs中表现为层到皮层的层次结构。最后，使用四个叙事任务提示的实验显示，这些提示会引发任务特定、区域依赖性的大脑对齐模式，并在更高阶区域引起上下文依赖的片段级调谐变化。我们的工作将长篇叙事电影定位为研究长时间尺度时间整合在长上下文MLLMs中的原理性测试平台，以及其与叙事理解过程中皮层响应关系的桥梁。

英文摘要

Understanding how humans and artificial intelligence systems process complex narrative videos is a fundamental challenge at the intersection of neuroscience and machine learning. This study investigates how the temporal context length of video clips (3--24 s clips) and the narrative-task prompting shape brain-model alignment during naturalistic movie watching. Using fMRI recordings from participants viewing full-length movies, we examine how brain regions sensitive to narrative context dynamically represent information over varying timescales and how these neural patterns align with model-derived features. We find that increasing clip duration substantially improves brain alignment for multimodal large language models (MLLMs), whereas unimodal video models show little to no gain. Further, shorter temporal windows align with perceptual and early language regions, while longer windows preferentially align higher-order integrative regions, mirrored by a layer-to-cortex hierarchy in MLLMs. Finally, experiments with four narrative-task prompts show that they elicit task-specific, region-dependent brain alignment patterns and context-dependent shifts in clip-level tuning in higher-order regions. Our work positions long-form narrative movies as a principled testbed for studying long-timescale temporal integration in long-context MLLMs and its relationship to cortical responses during narrative comprehension.

URL PDF HTML ☆

赞 0 踩 0

2602.05709 2026-05-20 cs.AI 版本更新

迈向基于知识的工作流：一种语义方法用于机械和热力学性质的原子模拟

Abril Azocar Guzman, Hoang-Thien Luu, Sarath Menon, Tilmann Hickel, Nina Merkert, Stefan Sandfeld

发表机构 * Institute for Advanced Simulations – Materials Data Science and Informatics (IAS‑9)（先进模拟研究所——材料数据科学与信息学研究所）； Forschungszentrum Jülich GmbH（吕贝克研究所）； Institute of Metallurgy - Chair of Computational Material Sciences/Engineering（冶金研究所——计算材料科学与工程教授席位）； Technical University of Clausthal（克莱斯特哈尔技术大学）； Interdisciplinary Centre for Advanced Materials Simulation (ICAMS)（先进材料模拟跨学科中心）； Ruhr-Universität Bochum（博德姆鲁尔-大学）； Bundesanstalt für Materialforschung und -prüfung（联邦材料研究所）

AI总结本文提出了一种基于语义的方法，通过可重用的原子工作流来模拟机械和热力学性质，该方法整合了与应用本体对齐的元数据注释，从而实现自动溯源和符合FAIR原则的数据输出，展示了工作流在结构-性质关系验证中的有效性，如Hall-Petch效应，并展示了其在不同原子势能和材料中的可重用性。

详情

DOI: 10.1002/adem.70869

AI中文摘要

机械和热力学性质，包括晶体缺陷的影响，对于评估工程应用中的材料至关重要。分子动力学模拟在原子尺度上提供了这些机制的有价值见解。然而，当前的做法通常依赖于碎片化的脚本，具有不一致的元数据和有限的溯源性，这阻碍了可重复性、互操作性和重用性。FAIR数据原则和基于工作流的方法为解决这些限制提供了路径。我们提出了可重用的原子工作流，这些工作流整合了与应用本体对齐的元数据注释，从而实现自动溯源和符合FAIR原则的数据输出。这些工作流涵盖了关键的机械和热力学量，包括状态方程、弹性张量、机械加载、热性质、缺陷形成能和纳米压痕。我们展示了结构-性质关系的验证，如Hall-Petch效应，并表明这些工作流可以在不同的原子势能和材料之间在一致的语义框架内重用。该方法提供了AI准备的模拟数据，支持新兴的代理AI工作流，并建立了一个通用的知识驱动机械和热力学模拟的蓝图。

英文摘要

Mechanical and thermodynamic properties, including the influence of crystal defects, are critical for evaluating materials in engineering applications. Molecular dynamics simulations provide valuable insight into these mechanisms at the atomic scale. However, current practice often relies on fragmented scripts with inconsistent metadata and limited provenance, which hinders reproducibility, interoperability, and reuse. FAIR data principles and workflow-based approaches offer a path to address these limitations. We present reusable atomistic workflows that incorporate metadata annotation aligned with application ontologies, enabling automatic provenance capture and FAIR-compliant data outputs. The workflows cover key mechanical and thermodynamic quantities, including equation of state, elastic tensors, mechanical loading, thermal properties, defect formation energies, and nanoindentation. We demonstrate validation of structure-property relations such as the Hall-Petch effect and show that the workflows can be reused across different interatomic potentials and materials within a coherent semantic framework. The approach provides AI-ready simulation data, supports emerging agentic AI workflows, and establishes a generalizable blueprint for knowledge-based mechanical and thermodynamic simulations.

URL PDF HTML ☆

赞 0 踩 0

2601.22569 2026-05-20 cs.CR cs.AI 版本更新

Whispers of Wealth: Red-Teaming Google's Agent Payments Protocol via Prompt Injection

财富低语：通过提示注入对谷歌的Agent支付协议进行红队测试

Tanusree Debi, Wentian Zhu, Pranjol Sen Gupta

发表机构 * School of Computing University of Georgia Athens, USA（佐治亚大学计算机学院美国亚特兰大）； Department of Information Technology Kennesaw State University Kennesaw, USA（凯斯韦尔州立大学信息科技学院美国凯斯韦尔）

AI总结本文通过红队测试评估了谷歌的Agent支付协议，揭示了由于间接和直接提示注入导致的漏洞，并提出了两种攻击技术，即品牌低语攻击和宝库低语攻击，以操纵产品排名并提取敏感用户数据。

详情

AI中文摘要

基于大型语言模型（LLM）的代理越来越多地用于自动化金融交易，但其对上下文推理的依赖使支付系统容易受到提示驱动的操纵。Agent支付协议（AP2）旨在通过密码学可验证的指令来安全地进行代理主导的购买，但其实际的鲁棒性仍缺乏探索。在本文中，我们对AP2进行了人工智能红队评估，并识别出由于间接和直接提示注入产生的漏洞。我们引入了两种攻击技术，即品牌低语攻击和宝库低语攻击，以操纵产品排名并提取敏感用户数据。使用基于Gemini-2.5-Flash和Google ADK框架构建的功能性AP2购物代理，我们实验性地验证了简单的对抗性提示可以可靠地颠覆代理行为。我们的发现揭示了当前代理支付架构中的关键弱点，并突显了在LLM介导的金融系统中需要更强的隔离和防御措施的必要性。

英文摘要

Large language model (LLM) based agents are increasingly used to automate financial transactions, yet their reliance on contextual reasoning exposes payment systems to prompt-driven manipulation. The Agent Payments Protocol (AP2) aims to secure agent-led purchases through cryptographically verifiable mandates, but its practical robustness remains underexplored. In this work, we perform an AI red-teaming evaluation of AP2 and identify vulnerabilities arising from indirect and direct prompt injection. We introduce two attack techniques, the Branded Whisper Attack and the Vault Whisper Attack which manipulate product ranking and extract sensitive user data. Using a functional AP2 based shopping agent built with Gemini-2.5-Flash and the Google ADK framework, we experimentally validate that simple adversarial prompts can reliably subvert agent behavior. Our findings reveal critical weaknesses in current agentic payment architectures and highlight the need for stronger isolation and defensive safeguards in LLM-mediated financial systems.

URL PDF HTML ☆

赞 0 踩 0

2601.20309 2026-05-20 cs.DC cs.AI cs.LG 版本更新

SuperInfer: SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips

SuperInfer: 面向Superchips的SLO感知旋转调度与内存管理技术

Jiahuan Yu, Mingtao Hu, Zichao Lin, Minjia Zhang

发表机构 * Supercomputing-System-AI-Lab（超级计算系统人工智能实验室）

AI总结针对LLM服务中严格延迟SLO与有限GPU内存容量之间的矛盾，SuperInfer提出了一种面向新兴Superchips的高性能LLM推理系统，通过NVLink-C2C实现紧密耦合的GPU-CPU架构，引入SLO感知的旋转调度器RotaSched和优化的旋转引擎DuplexKV，显著提升了TTFT SLO达成率。

Comments Accepted by MLSys '26

详情

AI中文摘要

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving. Code is available in https://github.com/Supercomputing-System-AI-Lab/SuperInfer.

英文摘要

Large Language Model (LLM) serving faces a fundamental tension between stringent latency Service Level Objectives (SLOs) and limited GPU memory capacity. When high request rates exhaust the KV cache budget, existing LLM inference systems often suffer severe head-of-line (HOL) blocking. While prior work explored PCIe-based offloading, these approaches cannot sustain responsiveness under high request rates, often failing to meet tight Time-To-First-Token (TTFT) and Time-Between-Tokens (TBT) SLOs. We present SuperInfer, a high-performance LLM inference system designed for emerging Superchips (e.g., NVIDIA GH200) with tightly coupled GPU-CPU architecture via NVLink-C2C. SuperInfer introduces RotaSched, the first proactive, SLO-aware rotary scheduler that rotates requests to maintain responsiveness on Superchips, and DuplexKV, an optimized rotation engine that enables full-duplex transfer over NVLink-C2C. Evaluations on GH200 using various models and datasets show that SuperInfer improves TTFT SLO attainment rates by up to 74.7% while maintaining comparable TBT and throughput compared to state-of-the-art systems, demonstrating that SLO-aware scheduling and memory co-design unlocks the full potential of Superchips for responsive LLM serving. Code is available in https://github.com/Supercomputing-System-AI-Lab/SuperInfer.

URL PDF HTML ☆

赞 0 踩 0

2601.18993 2026-05-20 cs.CV cs.AI cs.GR 版本更新

FreeOrbit4D: Training-Free Arbitrary Camera Redirection for Monocular Videos via Foreground-Complete 4D Reconstruction

FreeOrbit4D: 通过前景完整4D重建实现免训练的任意相机重定向

Wei Cao, Hao Zhang, Fengrui Tian, Yulun Wu, Yingying Li, Shenlong Wang, Ning Yu, Yaoyao Liu

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； University of Pennsylvania（宾夕法尼亚大学）； Eyeline Labs（Eyeline实验室）

AI总结本文提出FreeOrbit4D，一种无需训练的框架，通过恢复完整的前景4D代理来解决大角度重定向中的几何模糊问题，从而生成更真实且时间一致的视频。

Comments 12 pages, 10 figures. Accepted to SIGGRAPH Conference Papers 2026

详情

DOI: 10.1145/3799902.3811122

AI中文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective 免训练 framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

英文摘要

Camera redirection aims to replay a dynamic scene from a single monocular video under a user-specified camera trajectory. However, large-angle redirection is inherently ill-posed: a monocular video captures only a narrow spatio-temporal view of a dynamic 3D scene, providing severely limited observations of the underlying 4D world. The key challenge is therefore to recover a complete and coherent representation from this limited input, with consistent geometry and motion. While recent diffusion-based methods achieve impressive visual generation quality, they often break down under large-angle viewpoint changes far from the original trajectory, where missing visual grounding leads to severe geometric ambiguity and temporal inconsistency. We present FreeOrbit4D, an effective training-free framework that tackles this ambiguity by recovering a foreground-complete 4D proxy as structural grounding for video generation. We obtain this proxy by decoupling foreground and background reconstructions: we unproject the monocular video into a static background and partial foreground point clouds in a unified global space, then use an object-centric multi-view diffusion model to synthesize multi-view images and reconstruct complete foreground point clouds in canonical object space. By aligning the canonical foreground point cloud to the global scene space via dense pixel-synchronized 3D-3D correspondences and projecting the foreground-complete 4D proxy onto target camera viewpoints, we provide geometric scaffolds that guide a conditional video diffusion model. Extensive experiments show that FreeOrbit4D produces more faithful and temporally coherent redirected videos under challenging large-angle trajectories, and our proxy further enables applications such as edit propagation and 4D data generation. Project page: https://freeorbit4d.vision.ischool.illinois.edu/

URL PDF HTML ☆

赞 0 踩 0

2601.16823 2026-05-20 cs.CL cs.AI 版本更新

Disentangling generalization and memorization in large language models using chess

通过国际象棋解构大型语言模型中的泛化与记忆

Leonard S. Pleiss, Maximilian Schiffer, Robert K. von Weizsaecker

发表机构 * Technical University of Munich（慕尼黑技术大学）

AI总结本文通过国际象棋测试环境，研究大型语言模型中泛化与记忆能力的区别，发现模型在相关先验知识稀疏时性能显著下降，表明系统泛化能力有限，需超越规模的机制来实现鲁棒性。

详情

AI中文摘要

大型语言模型（LLMs）展现出显著的能力，但其能力在多大程度上反映的是复杂的记忆还是真正的推理能力仍不明确。我们引入国际象棋作为受控测试环境，旨在区分这些能力。利用游戏的结构和可扩展的引擎评估，我们构建了一个位置分类学，这些位置在相关先验知识的密度上变化较大，从可以通过记忆解决的常见状态到完全新颖需要泛化的状态。关键的是，我们的方法在不需要显式了解模型训练数据的情况下实现了这一区分。应用此分类学，我们结合了GPT系列的纵向分析和对现代模型的严格评估，包括Claude Opus和Gemini。我们的分析揭示了一个陡峭的梯度：随着相关先验知识密度的降低，性能持续下降。值得注意的是，在相关先验知识较少的任务中，基础模型性能回归到随机下棋的基线。虽然新模型有所改进，但在先验知识稀疏的任务中，进步显著放缓。此外，虽然推理增强的推理提高性能，但在没有相关先验知识的情况下，每token的相对边际收益减少。这些结果表明系统泛化能力有限，强调了在缺乏相关先验知识时，需要超越规模的机制来实现鲁棒性能。

英文摘要

Large Language Models (LLMs) exhibit remarkable capabilities, yet it remains unclear to what extent these reflect sophisticated recall or genuine reasoning ability. We introduce chess as a controlled testbed aimed at disentangling these faculties. Leveraging the game's structure and scalable engine evaluations, we construct a taxonomy of positions varying in density of relevant priors - ranging from common states solvable by memorization to completely novel ones requiring generalization. Crucially, our approach achieves this distinction without requiring explicit knowledge of the models' training data. Applying this taxonomy, we combine a longitudinal analysis of the GPT lineage with a rigorous evaluation of contemporary models, including Claude Opus and Gemini. Our analysis reveals a steep gradient: performance consistently degrades as the density of relevant priors decreases. Notably, for tasks with few relevant priors, base model performance regresses to the random-play baseline. While newer models improve, progress slows significantly for tasks with sparse priors. Furthermore, while reasoning-augmented inference improves performance, its relative marginal benefit per token decreases in the absence of relevant priors. These results suggest limitations in systematic generalization, highlighting the need for mechanisms beyond scale to achieve robust performance when deprived of relevant priors.

URL PDF HTML ☆

赞 0 踩 0

2601.14822 2026-05-20 cs.CV cs.AI 版本更新

Multimodal system for skin cancer detection

多模态皮肤癌检测系统

Volodymyr Sydorskyi, Igor Krashenyi, Oleksii Yakubenko

AI总结本文提出一种多模态皮肤癌检测系统，结合传统照片图像与表格型元数据（如患者人口统计数据和病变特征），通过多模态神经网络和两阶段模型提升检测准确率，并通过三阶段流程进一步优化预测，最终在不平衡数据集上实现显著性能提升。

Comments Accepted to System research and information technologies

Journal ref System Research and Information Technologies, no. 1, pp. 33-57, 2026

详情

DOI: 10.20535/SRIT.2308-8893.2026.1.03

AI中文摘要

皮肤癌检测对于早期诊断和有效治疗至关重要。尽管基于dermoscopic图像的深度学习模型已显示出潜力，但它们需要专门的设备，限制了其在更广泛临床环境中的应用。本研究介绍了一种使用传统照片图像的多模态皮肤癌检测系统，使其更具可访问性和适应性。我们的系统整合图像数据与表格型元数据，如患者人口统计数据和病变特征，以提高检测准确性。它采用结合图像和元数据处理的多模态神经网络，并支持有或无元数据的两阶段模型。一个三阶段流程进一步通过提升算法和增强性能来优化预测。为解决高度不平衡数据集的挑战，实施了特定技术以确保稳健的训练。通过消融研究评估了最近的视觉架构、提升算法和损失函数，实现了峰值部分ROC AUC为0.18068（0.2最大）和前15检索灵敏度为0.78371。结果表明，通过结构化、多阶段的图像与元数据整合流程，实现了显著的性能提升。该系统通过提供一个可扩展、设备无关的解决方案，推进了皮肤癌检测，适用于多样化的医疗环境，弥合了专业与一般临床实践之间的差距。

英文摘要

Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices.

URL PDF HTML ☆

赞 0 踩 0

2601.14234 2026-05-20 cs.LG cs.AI cs.RO stat.ML 版本更新

Q-learning with Adjoint Matching

具有伴随匹配的Q学习

Qiyang Li, Sergey Levine

发表机构 * UC Berkeley（加州大学伯克利分校）

AI总结本文提出了一种基于时序差分的强化学习算法QAM，解决了连续动作强化学习中的长期挑战：高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。通过利用批评者的首阶信息进行有效优化，但直接通过反向传播其多步去噪过程进行梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配，将批评者的动作梯度转换为逐步目标函数，避免了不稳定反向传播，同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习，QAM在离线和离线到在线强化学习的硬稀疏奖励任务中一致优于先前方法。

Comments 32 pages, 8 figures, 7 tables

详情

AI中文摘要

我们提出QAM，一种新颖的基于时序差分的强化学习（RL）算法，解决了连续动作RL中长期存在的挑战：高效优化表达性强的扩散或流匹配策略相对于参数化的Q函数。有效的优化需要利用批评者的首阶信息，但通过反向传播其多步去噪过程进行直接梯度优化在数值上不稳定。现有方法通过仅使用价值和丢弃梯度信息或依赖近似方法牺牲策略的表达性或偏置学习策略。QAM通过利用生成建模中最近提出的技术伴随匹配，将批评者的动作梯度转换为逐步目标函数，避免了不稳定反向传播，同时在最优时提供无偏且表达性强的策略。结合时序差分备份进行批评者学习，QAM在离线和离线到在线RL的硬稀疏奖励任务中一致优于先前方法。

英文摘要

We propose Q-learning with Adjoint Matching (QAM), a novel TD-based reinforcement learning (RL) algorithm that tackles a long-standing challenge in continuous-action RL: efficient optimization of an expressive diffusion or flow-matching policy with respect to a parameterized Q-function. Effective optimization requires exploiting the first-order information of the critic, but it is challenging to do so for flow or diffusion policies because direct gradient-based optimization via backpropagation through their multi-step denoising process is numerically unstable. Existing methods work around this either by only using the value and discarding the gradient information, or by relying on approximations that sacrifice policy expressivity or bias the learned policy. QAM sidesteps both of these challenges by leveraging adjoint matching, a recently proposed technique in generative modeling, which transforms the critic's action gradient to form a step-wise objective function that is free from unstable backpropagation, while providing an unbiased, expressive policy at the optimum. Combined with temporal-difference backup for critic learning, QAM consistently outperforms prior approaches on hard, sparse reward tasks in both offline and offline-to-online RL.

URL PDF HTML ☆

赞 0 踩 0

2601.05437 2026-05-20 cs.CL cs.AI 版本更新

Tracing Moral Foundations in Large Language Models

在大型语言模型中追溯道德基础

Chenxiao Yu, Bowen Yi, Farzan Karimi-Malekabadi, Suhaib Abdurahman, Jinyi Ye, Shrikanth Narayanan, Yue Zhao, Morteza Dehghani

发表机构 * Department of Computer Science, University of Southern California（南加州大学计算机科学系）； Department of Psychology, University of Southern California（南加州大学心理学系）； Center for Computational Language Sciences, University of Southern California（南加州大学计算语言科学中心）

AI总结本文研究了大型语言模型中道德基础的编码、组织和表达，通过多层方法分析道德基础与人类道德感知的一致性，并发现道德结构在预训练和微调过程中自然形成，且部分解耦。

详情

AI中文摘要

大型语言模型常常产生类似人类的道德判断，但不清楚这种表现是内部概念结构还是表面的'道德模仿'。使用道德基础理论（MFT）作为分析框架，我们研究了14个基础和指令微调的LLM在四个模型家族（Llama、Qwen2.5、Qwen3-MoE、Mistral）和从7B到70B的不同规模上如何编码、组织和表达道德基础。我们采用多级方法结合（i）逐层分析MFT概念表示及其与人类道德感知的一致性，（ii）在残差流上预训练稀疏自编码器（SAEs）以识别支持道德概念的稀疏特征，以及（iii）使用密集MFT向量和稀疏SAE特征进行因果引导干预。我们发现模型在表示和区分道德基础方面与人类判断一致，且这种道德几何结构自然从预训练中产生，并在微调中被选择性重 wiring。在更细的尺度上，SAE特征显示出与特定基础的明确语义联系，表明在共享表示中存在部分解耦的机制。最后，沿着密集向量或稀疏特征引导会产生可预测的在基础相关行为上的变化，证明了内部表示与道德输出之间的因果联系。共同，我们的结果提供了机械证据，表明LLM中的道德概念是分布的、分层的且部分解耦的，暗示了多元道德结构可以从语言的统计规律中作为潜在模式出现。

英文摘要

Large language models often produce human-like moral judgments, but it is unclear whether this reflects an internal conceptual structure or superficial ``moral mimicry.'' Using Moral Foundations Theory (MFT) as an analytic framework, we study how moral foundations are encoded, organized, and expressed across 14 base and instruction-tuned LLMs spanning four model families (Llama, Qwen2.5, Qwen3-MoE, Mistral) and scales from 7B to 70B. We employ a multi-level approach combining (i) layer-wise analysis of MFT concept representations and their alignment with human moral perceptions, (ii) pretrained sparse autoencoders (SAEs) over the residual stream to identify sparse features that support moral concepts, and (iii) causal steering interventions using dense MFT vectors and sparse SAE features. We find that models represent and distinguish moral foundations in a manner that aligns with human judgments, and that this moral geometry naturally emerges from pretraining and is selectively rewired by post-training. At a finer scale, SAE features show clear semantic links to specific foundations, suggesting partially disentangled mechanisms within shared representations. Finally, steering along either dense vectors or sparse features produces predictable shifts in foundation-relevant behavior, demonstrating a causal connection between internal representations and moral outputs. Together, our results provide mechanistic evidence that moral concepts in LLMs are distributed, layered, and partly disentangled, suggesting that pluralistic moral structure can emerge as a latent pattern from the statistical regularities of language alone.

URL PDF HTML ☆

赞 0 踩 0

2512.24470 2026-05-20 cs.RO cs.AI 版本更新

Foundation models on the bridge: Semantic hazard detection and safety maneuvers for maritime autonomy with vision-language models

桥梁上的基础模型：基于视觉-语言模型的语义危险检测与安全操作用于海上自主性

Kim Alexander Christensen, Andreas Gudahl Tufte, Alexey Gusev, Rohan Sinha, Milan Ganai, Ole Andreas Alsos, Marco Pavone, Martin Steinert

发表机构 * Dept. of Mechanical and Industrial Engineering, NTNU（机械与工业工程系，挪威科技大学）； Dept. of Aeronautics and Astronautics, Stanford University（航空航天工程系，斯坦福大学）； Dept. of Computer Science, Stanford University（计算机科学系，斯坦福大学）； NVIDIA Research（NVIDIA研究）

AI总结本文提出了一种基于视觉-语言模型的语义危险检测与安全操作方法，用于满足IMO草案MASS代码对海上自主船舶的要求，通过快速-慢速异常管道和短时间范围的人类可覆盖回退操作来实现，在40个港口场景中验证了该方法的性能。

Comments 17 pages without bibliography or appendix. The main paper has 16 figures. Paper webpage can be found at https://kimachristensen.github.io/bridge_policy/

Journal ref Ocean Engineering 359, Part 3 (2026), Article 124646

详情

DOI: 10.1016/j.oceaneng.2026.124646

AI中文摘要

草案IMO MASS代码要求自主和远程监督的海事船舶检测其操作设计领域偏离，进入预定义的回退模式以通知操作员，允许立即的人类接管，并避免在未经批准的情况下更改航行计划。在警报到接管的间隙中满足这些义务需要一个短时间范围、可人类接管的回退操作。传统的海事自主堆栈在正确行动依赖于意义（例如，潜水员旗表示水中的人员，附近有火表示危险）时会遇到困难。我们主张（i）视觉-语言模型（VLMs）为这些分布外情况提供语义意识，（ii）一个快速-慢速异常管道，带有短时间范围、可人类接管的回退操作，使在交接窗口内实现这一目标成为可能。我们引入了Semantic Lookout，一种仅使用摄像头、候选约束的VLM回退操作选择器，它在连续人类授权下，从水有效、世界锚定的轨迹中选择一个谨慎的操作（或站守）。在40个港口场景中，我们测量了每调用场景的理解和延迟，与人类共识（模型多数三票投票）的一致性，短时间范围在火险场景中的风险缓解，以及在水上的警报->回退操作->操作员交接。子10秒的模型保留了较慢的最新模型大部分的意识。回退操作选择器在火险场景中比仅基于几何的基线表现更好，并增加了 standoff 距离。一次现场运行验证了端到端的操作。这些结果支持VLMs作为符合草案IMO MASS代码的语义回退操作选择器，适用于实际延迟预算，并激励未来工作，研究适应领域、混合自主性，将基础模型语义与多传感器鸟瞰感知和短时间范围重新规划相结合。网站：kimachristensen.github.io/bridge_policy

英文摘要

The draft IMO MASS Code requires autonomous and remotely supervised maritime vessels to detect departures from their operational design domain, enter a predefined fallback that notifies the operator, permit immediate human override, and avoid changing the voyage plan without approval. Meeting these obligations in the alert-to-takeover gap calls for a short-horizon, human-overridable fallback maneuver. Classical maritime autonomy stacks struggle when the correct action depends on meaning (e.g., diver-down flag means people in the water, fire close by means hazard). We argue (i) that vision-language models (VLMs) provide semantic awareness for such out-of-distribution situations, and (ii) that a fast-slow anomaly pipeline with a short-horizon, human-overridable fallback maneuver makes this practical in the handover window. We introduce Semantic Lookout, a camera-only, candidate-constrained VLM fallback maneuver selector that selects one cautious action (or station-keeping) from water-valid, world-anchored trajectories under continuous human authority. On 40 harbor scenes we measure per-call scene understanding and latency, alignment with human consensus (model majority-of-three voting), short-horizon risk-relief on fire hazard scenes, and an on-water alert->fallback maneuver->operator handover. Sub-10 s models retain most of the awareness of slower state-of-the-art models. The fallback maneuver selector outperforms geometry-only baselines and increases standoff distance on fire scenes. A field run verifies end-to-end operation. These results support VLMs as semantic fallback maneuver selectors compatible with the draft IMO MASS Code, within practical latency budgets, and motivate future work on domain-adapted, hybrid autonomy that pairs foundation-model semantics with multi-sensor bird's-eye-view perception and short-horizon replanning. Website: kimachristensen.github.io/bridge_policy

URL PDF HTML ☆

赞 0 踩 0

2512.23461 2026-05-20 cs.LG cs.AI 版本更新

Eliminating Inductive Bias in Reward Models with Information-Theoretic Guidance

通过信息论指导消除奖励模型中的归纳偏置

Zhuo Li, Pengyu Cheng, Zhechao Yu, Feifei Tong, Anningzhe Gao, Tsung-Hui Chang, Xiang Wan, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang

发表机构 * Qwen Large Model Application Team, Alibaba（阿里巴巴大模型应用团队）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））； Shenzhen Research Institute of Big Data（深圳大数据研究院）

AI总结本文提出了一种基于信息论的奖励模型去偏方法DIR，通过最大化奖励模型评分与人类偏好对之间的互信息，同时最小化奖励模型输出与偏好输入偏置属性之间的互信息，从而有效缓解归纳偏置问题并提升RLHF性能。

Comments Published as a conference paper at The International Conference on Learning Representations (ICLR) 2026

详情

AI中文摘要

奖励模型（RMs）在人类反馈的强化学习（RLHF）中至关重要，用于将大型语言模型（LLMs）对齐于人类价值观。然而，RM训练数据通常被认为是低质量的，包含可能导致过拟合和奖励黑客的归纳偏置。例如，更详细和全面的响应通常更受人类青睐，但包含更多单词，导致响应长度成为不可避免的归纳偏置之一。有限的先前RM去偏方法要么针对单一特定类型的偏置，要么仅用简单的线性相关性建模，例如皮尔逊系数。为缓解奖励建模中更复杂和多样的归纳偏置，我们引入了一种新的信息论去偏方法，称为通过信息优化的奖励模型去偏（DIR）。受信息瓶颈（IB）的启发，我们最大化奖励模型评分与人类偏好对之间的互信息（MI），同时最小化奖励模型输出与偏好输入偏置属性之间的互信息。从信息论的理论依据出发，DIR能够处理更复杂的偏置类型，具有非线性相关性，从而广泛扩展了RM去偏方法在现实世界中的应用场景。在实验中，我们验证了DIR在三种归纳偏置类型（响应长度、奉承和格式）上的有效性。我们发现，DIR不仅有效缓解了目标归纳偏置，还通过多样化的基准测试提升了RLHF性能，展现出更好的泛化能力。代码和训练配方可在https://github.com/Qwen-Applications/DIR获取。

英文摘要

Reward models (RMs) are essential in reinforcement learning from human feedback (RLHF) to align large language models (LLMs) with human values. However, RM training data is commonly recognized as low-quality, containing inductive biases that can easily lead to overfitting and reward hacking. For example, more detailed and comprehensive responses are usually human-preferred but with more words, leading response length to become one of the inevitable inductive biases. A limited number of prior RM debiasing approaches either target a single specific type of bias or model the problem with only simple linear correlations, \textit{e.g.}, Pearson coefficients. To mitigate more complex and diverse inductive biases in reward modeling, we introduce a novel information-theoretic debiasing method called \textbf{D}ebiasing via \textbf{I}nformation optimization for \textbf{R}M (DIR). Inspired by the information bottleneck (IB), we maximize the mutual information (MI) between RM scores and human preference pairs, while minimizing the MI between RM outputs and biased attributes of preference inputs. With theoretical justification from information theory, DIR can handle more sophisticated types of biases with non-linear correlations, broadly extending the real-world application scenarios for RM debiasing methods. In experiments, we verify the effectiveness of DIR with three types of inductive biases: \textit{response length}, \textit{sycophancy}, and \textit{format}. We discover that DIR not only effectively mitigates target inductive biases but also enhances RLHF performance across diverse benchmarks, yielding better generalization abilities. The code and training recipes are available at https://github.com/Qwen-Applications/DIR.

URL PDF HTML ☆

赞 0 踩 0

2512.16856 2026-05-20 cs.AI 版本更新

Distributional AGI Safety

分布式AGI安全

Nenad Tomašev, Matija Franklin, Julian Jacobs, Sébastien Krier, Simon Osindero

发表机构 * Google DeepMind（谷歌深Mind）

AI总结本文提出了一种分布式的AGI安全框架，旨在通过设计和实现虚拟代理沙盒经济来应对群体代理协调带来的安全风险，强调市场机制、可审计性和监管的重要性。

详情

AI中文摘要

人工智能安全和对齐研究主要集中在保护单个AI系统的方法上，基于最终出现单一人工通用智能（AGI）的假设。另一种AGI出现假说认为，一般能力首先通过具有互补技能和能力的子AGI个体代理群体中的协调表现出来，这一假说受到较少关注。本文认为，这种碎片化AGI假说需要得到认真考虑，并应指导相应安全措施和缓解措施的发展。先进AI代理的快速部署，使其具备工具使用能力和通信协调能力，使其成为紧迫的安全问题。因此，我们提出了一种分布式的AGI安全框架，超越了评估和对齐单个代理。该框架以设计和实现虚拟代理沙盒经济（不可渗透或半渗透）为中心，其中代理间的交易由稳健的市场机制调控，并辅以适当的可审计性、声誉管理和监管，以缓解集体风险。

英文摘要

AI safety and alignment research has predominantly been focused on methods for safeguarding individual AI systems, resting on the assumption of an eventual emergence of a monolithic Artificial General Intelligence (AGI). The alternative AGI emergence hypothesis, where general capability levels are first manifested through coordination in groups of sub-AGI individual agents with complementary skills and affordances, has received far less attention. Here we argue that this patchwork AGI hypothesis needs to be given serious consideration, and should inform the development of corresponding safeguards and mitigations. The rapid deployment of advanced AI agents with tool-use capabilities and the ability to communicate and coordinate makes this an urgent safety consideration. We therefore propose a framework for distributional AGI safety that moves beyond evaluating and aligning individual agents. This framework centres on the design and implementation of virtual agentic sandbox economies (impermeable or semi-permeable), where agent-to-agent transactions are governed by robust market mechanisms, coupled with appropriate auditability, reputation management, and oversight to mitigate collective risks.

URL PDF HTML ☆

赞 0 踩 0

2512.05958 2026-05-20 cs.LG cs.AI 版本更新

MaxShapley: Towards Incentive-compatible Generative Search with Fair Context Attribution

MaxShapley：迈向具有公平上下文归因的激励兼容生成搜索

Sara Patel, Mingxun Zhou, Giulia Fanti

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； HKUST（香港科技大学）

AI总结本文提出MaxShapley算法，用于在生成搜索流程中公平地归因和补偿内容提供者，该算法基于Shapley值的特例，通过可分解的max-sum效用函数在多项式时间内计算归因，相比Shapley值的指数成本具有更高的效率。

详情

AI中文摘要

基于大型语言模型（LLMs）的生成搜索引擎正在取代传统搜索引擎，从根本上改变了信息提供者如何获得补偿。为了维持这一生态系统，我们需要公平的机制来根据内容提供者对生成答案的贡献来归因和补偿。我们介绍了MaxShapley，一种高效的算法，用于在生成搜索流程中进行公平的信用归因，该流程在生成之前检索外部来源。MaxShapley是著名Shapley值的特例；它利用可分解的max-sum效用函数，在文档数量上以多项式时间计算归因，而不是Shapley值的指数成本。我们在三个多跳问答数据集（HotPotQA、MuSiQUE、MS MARCO）上评估MaxShapley；MaxShapley在归因质量上与精确的Shapley计算相当，同时消耗的资源更少——例如，在相同归因准确性下，它在资源消耗上比先前最先进的方法减少了高达9倍。我们发布了开源代码和重新校准的数据集。一个教育演示可在https://fair-search.com上获得。

英文摘要

Generative search engines based on large language models (LLMs) are replacing traditional search, fundamentally changing how information providers are compensated. To sustain this ecosystem, we need fair mechanisms to attribute and compensate content providers based on their contributions to generated answers. We introduce MaxShapley, an efficient algorithm for fair credit attribution in generative search pipelines that retrieve external sources before generation. MaxShapley is a special case of the celebrated Shapley value; it leverages a de-composable max-sum utility function to compute attributions with polynomial-time computation in the number of documents, as opposed to the exponential cost of Shapley values. We evaluate MaxShapley on three multi-hop QA datasets (HotPotQA, MuSiQUE, MS MARCO); MaxShapley achieves comparable attribution quality to exact Shapley computation, while consuming a fraction of its tokens--for instance, it gives up to a 9x reduction in resource consumption over prior state-of-the-art methods at the same attribution accuracy. We release open-source code and re-calibrated datasets. An educational demo is available at https://fair-search.com.

URL PDF HTML ☆

赞 0 踩 0

2512.04452 2026-05-20 physics.ao-ph cs.AI cs.LG physics.comp-ph physics.flu-dyn 版本更新

NORi: An ML-Augmented Ocean Boundary Layer Parameterization

NORi：一种融合机器学习的海洋边界层参数化方法

Xin Kai Lee, Ali Ramadhan, Andre Souza, Gregory LeClaire Wagner, Simone Silvestri, John Marshall, Raffaele Ferrari

发表机构 * Department of Earth, Atmospheric and Planetary Sciences, Massachusetts Institute of Technology（麻省理工学院地球、大气与行星科学系）； Center for Computational Science and Engineering, Massachusetts Institute of Technology（麻省理工学院计算科学与工程中心）； Department of Physics, Imperial College London（伦敦帝国学院物理系）； atdepth ； Aeolus Labs（Aeolus实验室）； Department of Environment, Land and Infrastructure Engineering, Politecnico di Torino（托里诺理工学院环境、土地与基础设施工程系）

AI总结 NORi是一种基于物理并结合神经网络的机器学习海洋边界层湍流参数化方法，通过训练大规模涡旋模拟来捕捉边界层底部的混合过程，展示了在不同对流强度、背景层结、旋转和风力作用下的预测和泛化能力。

Comments 58 pages, 20 figures, submitted to Journal of Advances in Modeling Earth Systems (JAMES). This is version 2, updated based on reviews from 3 anonymous reviewers after initial submission to JAMES. The largest change from the previous version is the addition of comparisons with realistic observations from a long-term monitoring site in the Northeast Pacific

详情

AI中文摘要

NORi是一种基于物理并结合神经网络的机器学习海洋边界层湍流参数化方法。NORi代表神经普通微分方程（NODEs）里氏数（Ri）闭合。物理参数化通过依赖里氏数的扩散率和粘度进行控制。神经ODEs被训练以捕捉通过边界层底部的混合过程，这无法通过局部扩散闭合来表示。参数化通过大规模涡旋模拟以“后验”方式训练，其中参数通过一个显式依赖于实际时间积分变量的损失函数进行校准，而不是瞬时子格尺度通量，后者本质上是嘈杂的。NORi通过设计保留踪迹，使用现实的非线性热力学，并在不同对流强度、背景层结、旋转和风力作用下表现出卓越的预测和泛化能力。NORi在Ocean Weather Station Papa处模拟了边界层的季节演变，其性能与最先进的两方程k-ε闭合相当。当在双环流模拟中实现时，尽管仅在两天时间范围内训练，它在至少100年内数值上是稳定的，可以以一小时的时间步长运行。高度表达性的神经网络与严格的物理基础闭合相结合，证明了在气候模型中设计参数化的稳健范式：所需数据和训练成本大大减少，推理性能可以作为主要目标直接优化，数值稳定性通过训练隐含地得到促进。

英文摘要

NORi is a machine learning (ML) parameterization of ocean boundary layer turbulence that is physics-based and augmented with neural networks. NORi stands for neural ordinary differential equations (NODEs) Richardson number (Ri) closure. The physical parameterization is controlled by a Richardson number-dependent diffusivity and viscosity. The neural ODEs are trained to capture the entrainment through the base of the boundary layer, which cannot be represented with a local diffusive closure. The parameterization is trained using large-eddy simulations in an "a posteriori" fashion, where parameters are calibrated with a loss function that explicitly depends on the actual time-integrated variables of interest rather than the instantaneous subgrid fluxes, which are inherently noisy. NORi conserves tracers by design, uses realistic nonlinear thermodynamics, and demonstrates excellent prediction and generalization capabilities in capturing entrainment dynamics under different convective strengths, background stratifications, rotation, and wind forcings. NORi is shown to simulate the seasonal evolution of the boundary layer at Ocean Weather Station Papa with similar performance to the state-of-the-art two-equation $k$-$ε$ closure. When implemented in a double-gyre simulation, it is numerically stable for at least 100 years, despite only being trained on two-day horizons, and can be run with time steps as long as one hour. The highly expressive neural networks, combined with a physically rigorous base closure, prove to be a robust paradigm for designing parameterizations for climate models: data required and training cost are drastically reduced, inference performance can be directly optimized as a primary objective, and numerical stability is implicitly promoted through training.

URL PDF HTML ☆

赞 0 踩 0

2512.01152 2026-05-20 cs.LG cs.AI cs.CV 版本更新

Open-Set Domain Adaptation Under Background Distribution Shift: Challenges and A Provably Efficient Solution

开放集域适应在背景分布偏移下的挑战：挑战与一种可证明高效的解决方案

Shravan Chaudhari, Yoav Wald, Suchi Saria

发表机构 * Department of Computer Science, Johns Hopkins University（约翰霍普金斯大学计算机科学系）； Faculty of Data and Decision Sciences, Technion（技术学院数据与决策科学学院）； Center for Data Science, New York University（纽约大学数据科学中心）； Bayesian Health（贝叶斯健康）

AI总结本文研究了在背景分布偏移情况下开放集域适应的挑战，并提出了一种可证明高效的解决方案CoLOR，通过理论分析和实验证明其在简化过参数化设置中优于基线方法，同时展示了其在图像和文本数据上的广泛适用性。

Comments Project page at https://github.com/Shra1-25/CoLOR

Journal ref Transactions on Machine Learning Research (TMLR) 2026/May ISSN: 2835-8856

详情

AI中文摘要

随着我们将机器学习系统部署到现实世界中，一个核心挑战是保持模型在数据偏移时的性能。这种偏移可以以多种形式存在：新类可能在训练时不存在，这被称为开放集识别，以及已知类别的分布可能发生变化。对于开放集识别的保证大多基于假设已知类别的分布（我们称之为背景分布）是固定的。在本文中，我们开发了CoLOR，一种在挑战性情况下（即背景分布偏移）也能解决开放集识别的方法。我们证明该方法在温和假设下有效，即新类可与非新类分离，并提供理论保证，表明其在简化过参数化设置中优于代表基线方法。我们开发了使CoLOR可扩展和稳健的技术，并在图像和文本数据上进行了全面的实证评估。结果表明，CoLOR在背景偏移下显著优于现有开放集识别方法。此外，我们还提供了新的见解，探讨了诸如新类大小等因素对性能的影响，这在先前工作中尚未得到广泛探索。

英文摘要

As we deploy machine learning systems in the real world, a core challenge is to maintain a model that is performant even as the data shifts. Such shifts can take many forms: new classes may emerge that were absent during training, a problem known as open-set recognition, and the distribution of known categories may change. Guarantees on open-set recognition are mostly derived under the assumption that the distribution of known classes, which we call the background distribution, is fixed. In this paper we develop CoLOR, a method that is guaranteed to solve open-set recognition even in the challenging case where the background distribution shifts. We prove that the method works under benign assumptions that the novel class is separable from the non-novel classes, and provide theoretical guarantees that it outperforms a representative baseline in a simplified overparameterized setting. We develop techniques to make CoLOR scalable and robust, and perform comprehensive empirical evaluations on image and text data. The results show that CoLOR significantly outperforms existing open-set recognition methods under background shift. Moreover, we provide new insights into how factors such as the size of the novel class influences performance, an aspect that has not been extensively explored in prior work.

URL PDF HTML ☆

赞 0 踩 0

2511.13588 2026-05-20 eess.SY cs.AI cs.SY math.DS 版本更新

Data-driven Acceleration of MPC with Guarantees

数据驱动的MPC加速与保证

Agustin Castellano, Shijie Pan, Enrique Mallada

发表机构 * Dept. of Electrical and Computer Engineering（电气与计算机工程系）

AI总结本文提出了一种数据驱动的方法，通过将在线优化替换为基于离线MPC解的非参数策略来加速MPC，该策略在构造的最优成本-剩余上是贪婪的，并且能够以远快于在线求解MPC的速度实现，同时保证递归可行性及可证明的有界最优性差距。

详情

AI中文摘要

模型预测控制（MPC）是一种强大的最优控制框架，但其在低延迟应用中可能过于缓慢。我们提出了一种数据驱动的框架，通过将在线优化替换为由离线MPC解构造的非参数策略来加速MPC。该策略针对构造的最优成本-剩余上是贪婪的，并可以作为非参数查找规则实现，其速度比在线求解MPC快多个数量级。我们的分析表明，在离线数据充分覆盖的条件下，该策略具有递归可行性，并且具有可证明的有界最优性差距。这些条件建立了数据量和界紧度之间的显式权衡。新解可以方便地被纳入其中而无需重新训练，从而实现持续改进。我们的实验表明，该策略比标准MPC快100到1000倍，仅以适度的最优性损失为代价，展示了在实时控制任务中的潜力。

英文摘要

Model Predictive Control (MPC) is a powerful framework for optimal control but can be too slow for low-latency applications. We present a data-driven framework to accelerate MPC by replacing online optimization with a nonparametric policy constructed from offline MPC solutions. Our policy is greedy with respect to a constructed upper bound on the optimal cost-to-go, and can be implemented as a nonparametric lookup rule that is orders of magnitude faster than solving MPC online. Our analysis shows that under sufficient coverage conditions of the offline data, the policy is recursively feasible and admits provable, bounded optimality gap. These conditions establish an explicit trade-off between the amount of data collected and the tightness of the bounds. New solutions can be incorporated straightforwardly without the need for retraining, enabling continual improvement. Our experiments show that this policy is between 100 and 1000 times faster than standard MPC with only a modest hit to optimality, showing potential for real-time control tasks.

URL PDF HTML ☆

赞 0 踩 0

2511.10292 2026-05-20 cs.CV cs.AI 版本更新

Adaptive Residual-Update Steering for Low-Overhead Hallucination Mitigation in Large Vision Language Models

自适应残差更新引导用于大型视觉语言模型中低开销幻觉抑制

Zhengtao Zou, Ya Gao, Jiarui Guan, Bin Li, Pekka Marttinen

发表机构 * Aalto University, Espoo, Finland（艾尔沃大学，芬兰 Espoo）； Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China（深圳先进技术研究院，中国科学院，深圳）

AI总结本文提出RUDDER框架，通过创建持久视觉锚点来对抗视觉稀释，利用模型的prefill残差更新提取鲁棒证据方向，并通过自适应门控机制注入解码过程，有效抑制幻觉并保持高吞吐量。

Comments Accepted by ICML 2026; Code available at: https://github.com/Akko000/RUDDER-Residual-Update-Directed-DEcoding-Regulation-

详情

AI中文摘要

大型视觉-语言模型（LVLMs）通常将视觉输入作为语言解码器之前的前缀进行处理。随着模型自回归地生成文本，这种初始视觉信息不可避免地经历“稀释”，导致模型过度依赖语言先验并产生幻觉。现有干预尝试通过对比logits或迭代优化输出来纠正这一问题，但会带来不可接受的延迟成本。我们提出残差更新引导解码调节（RUDDER）框架，通过创建持久视觉锚点来对抗视觉稀释。我们直接从模型的prefill残差更新中提取鲁棒证据方向（CARD），并将其注入解码过程。这种注入通过自适应门控机制（Beta Gate）进行调节，该机制作为信任机制，确保只有在必要时才应用视觉提示。在LLaVA-1.5（7B/13B）、Idefics2、InstructBLIP和Qwen2.5-VL上的实验表明，RUDDER一致地抑制了幻觉（在贪婪解码中，RUDDER将CHAIR_S减少平均24.4%，将CHAIR_i减少23.6%），并在不同架构上有效扩展，同时保持>96.0%的吞吐量。

英文摘要

Large Vision-Language Models (LVLMs) typically process visual inputs as a prefix to the language decoder. As the model autoregressively generates text, this initial visual information inevitably undergoes "dilution" leading the model to over-rely on language priors and hallucinate objects. Existing interventions attempt to correct this by contrasting logits or iteratively refining outputs, but they incur prohibitive latency costs. We propose Residual-Update Directed DEcoding Regulation (RUDDER), a framework that counters visual dilution by creating a persistent visual anchor. We extract a robust evidence direction (CARD) directly from the model's prefill residual updates, and inject it into the decoding process. This injection is modulated by an adaptive gate, the Beta Gate, which acts as a trust mechanism and ensures the visual reminder is applied only when necessary. Experiments on LLaVA-1.5 (7B/13B), Idefics2, InstructBLIP, and Qwen2.5-VL demonstrate that RUDDER consistently mitigates hallucination (with greedy decoding, RUDDER reduces CHAIR_S by an average of 24.4% and CHAIR_i by 23.6% relative) and scales effectively across architectures, all while maintaining >96.0% throughput.

URL PDF HTML ☆

赞 0 踩 0

2511.06943 2026-05-20 cs.CV cs.AI 版本更新

PlantTraitNet: An Uncertainty-Aware Multimodal Framework for Global-Scale Plant Trait Inference from Citizen Science Data

PlantTraitNet: 一种考虑不确定性的多模态框架，用于从公民科学数据中进行全球尺度植物特性推断

Ayushi Sharma, Johanna Trost, Daniel Lusk, Johannes Dollinger, Julian Schrader, Christian Rossi, Javier Lopatin, Etienne Laliberté, Simon Haberstroh, Jana Eichel, Daniel Mederer, Jose Miguel Cerda-Paredes, Shyam S. Phartyal, Lisa-Maricia Schwarz, Anja Linstädter, Maria Conceição Caldeira, Teja Kattenborn

发表机构 * GeoSense-Freiburg（弗赖堡GeoSense）

AI总结本研究提出PlantTraitNet，一种多模态、多任务且考虑不确定性的深度学习框架，通过弱监督从公民科学照片中预测四个关键植物特性（植物高度、叶面积、特定叶面积和氮含量），并利用空间聚合生成全球特性分布图，验证结果表明其在所有评估特性上均优于现有特性地图。

Comments Accepted at the 40th AAAI Conference on Artificial Intelligence (AAAI-26). Link: https://ojs.aaai.org/index.php/AAAI/article/view/41272

详情

DOI: 10.1609/aaai.v40i46.41272

AI中文摘要

全球植物特性地图，如叶片氮含量或植物高度，对于理解生态系统过程，包括地球系统的碳和能量循环至关重要。然而，现有特性地图受限于基于现场测量的高成本和稀疏的地理覆盖。公民科学计划提供了一个未被充分利用的资源来克服这些限制，全球范围内有超过5000万张带有地理标签的植物照片，捕捉了有价值的植物形态和生理信息。在本研究中，我们引入PlantTraitNet，一种多模态、多任务且考虑不确定性的深度学习框架，利用弱监督从公民科学照片中预测四个关键植物特性（植物高度、叶面积、特定叶面积和氮含量）。通过在空间上聚合个体特性预测，我们生成全球特性分布图。我们通过独立的植被调查数据（sPlotOpen）验证这些地图，并将其与领先全球特性产品进行基准测试。我们的结果表明，PlantTraitNet在所有评估特性上均优于现有特性地图，证明了将公民科学影像与计算机视觉和地理空间AI结合，不仅能够实现可扩展的，而且更准确的全球特性映射。这种方法为生态研究和地球系统建模提供了强大的新途径。

英文摘要

Global plant maps of plant traits, such as leaf nitrogen or plant height, are essential for understanding ecosystem processes, including the carbon and energy cycles of the Earth system. However, existing trait maps remain limited by the high cost and sparse geographic coverage of field-based measurements. Citizen science initiatives offer a largely untapped resource to overcome these limitations, with over 50 million geotagged plant photographs worldwide capturing valuable visual information on plant morphology and physiology. In this study, we introduce PlantTraitNet, a multi-modal, multi-task uncertainty-aware deep learning framework that predictsfour key plant traits (plant height, leaf area, specific leaf area, and nitrogen content) from citizen science photos using weak supervision. By aggregating individual trait predictions across space, we generate global maps of trait distributions. We validate these maps against independent vegetation survey data (sPlotOpen) and benchmark them against leading global trait products. Our results show that PlantTraitNet consistently outperforms existing trait maps across all evaluated traits, demonstrating that citizen science imagery, when integrated with computer vision and geospatial AI, enables not only scalable but also more accurate global trait mapping. This approach offers a powerful new pathway for ecological research and Earth system modeling.

URL PDF HTML ☆

赞 0 踩 0

2510.23507 2026-05-20 cs.LG cs.AI cs.IT math.IT 版本更新

A Deep Latent Factor Graph Clustering with Fairness-Utility Trade-off Perspective

具有公平性-效用权衡视角的深度潜在因子图聚类

Siamak Ghodsi, Amjad Seyedi, Tai Le Quy, Fariba Karimi, Eirini Ntoutsi

发表机构 * L3S Research Center（L3S研究所以）； University of Mons（蒙斯大学）； University of Koblenz（科布伦茨大学）； Bundeswehr University（联邦国防军大学）

AI总结本文提出DFNMF，一种针对图的端到端深度非负三因子分解方法，通过软统计平衡正则化直接优化聚类分配，以实现公平性与效用的平衡，同时在合成和真实网络中表现出更高的群体平衡性和更高的模ularity。

Comments Accepted to IEEE Big-Data 2025 main research track. The paper is 10 main pages and 4 pages of Appendix

Journal ref 2025 IEEE International Conference on Big Data (BigData)

详情

DOI: 10.1109/BigData66926.2025.11402535

AI中文摘要

公平图聚类旨在找到尊重网络结构的同时保持敏感群体比例的划分，应用范围涵盖社区检测、团队组建、资源分配和社会网络分析。许多现有方法强制性约束或依赖多阶段流程（例如谱嵌入后接k-均值），限制了权衡控制、可解释性和可扩展性。我们引入DFNMF，一种针对图的端到端深度非负三因子分解方法，直接优化聚类分配，使用软统计平衡正则化。单个参数λ调节公平性-效用平衡，非负性产生部分因子和透明的软成员资格。优化使用稀疏友好的交替更新，与边数成近线性比例。在合成和真实网络中，DFNMF在可比的模ularity下实现了显著更高的群体平衡，经常在帕累托前沿上超越最先进基线。代码可在https://github.com/SiamakGhodsi/DFNMF.git获得。

英文摘要

Fair graph clustering seeks partitions that respect network structure while maintaining proportional representation across sensitive groups, with applications spanning community detection, team formation, resource allocation, and social network analysis. Many existing approaches enforce rigid constraints or rely on multi-stage pipelines (e.g., spectral embedding followed by $k$-means), limiting trade-off control, interpretability, and scalability. We introduce \emph{DFNMF}, an end-to-end deep nonnegative tri-factorization tailored to graphs that directly optimizes cluster assignments with a soft statistical-parity regularizer. A single parameter $λ$ tunes the fairness--utility balance, while nonnegativity yields parts-based factors and transparent soft memberships. The optimization uses sparse-friendly alternating updates and scales near-linearly with the number of edges. Across synthetic and real networks, DFNMF achieves substantially higher group balance at comparable modularity, often dominating state-of-the-art baselines on the Pareto front. The code is available at https://github.com/SiamakGhodsi/DFNMF.git.

URL PDF HTML ☆

赞 0 踩 0

2510.16814 2026-05-20 cs.LG cs.AI cs.CV 版本更新

WARC-Bench：基于网络存档的GUI子任务执行基准

Sanjari Srivastava, Gang Li, Cheng Chang, Rishu Garg, Manpreet Kaur, Charlene Y. Lee, Yuezhang Li, Yining Mao, Ignacio Cases, Yanan Xie, Peng Qi

发表机构 * Uniphore

AI总结本文提出WARC-Bench，一个基于网络存档的GUI子任务执行基准，通过438个任务评估多模态AI代理在子任务上的能力，实验表明SFT和RLVR方法在提升子任务执行效果上取得显著成果。

详情

AI中文摘要

训练能够导航复杂现实网站的网络代理需要它们掌握子任务——多个UI组件上的短周期交互（例如在日期选择器中选择正确日期或在容器中滚动以提取信息）。我们介绍了WARC-Bench（网络存档基准），一个新型的网络导航基准，包含438个任务，旨在评估多模态AI代理在子任务上的能力。WARC-Bench利用Web ARChive文件实现动态且逼真的网页沙盒交互。我们证明WARC-Bench对领先的计算机使用模型具有挑战性，最高观察到的成功率仅为64.8%。为了提高开源模型在子任务上的表现，我们探索了两种常见的训练技术：监督微调（SFT）和具有可验证奖励的强化学习（RLVR）。实验表明，SFT模型在基准上的成功率为48.8%。在数据稀缺的情况下，通过RLVR训练SFT检查点，将分数提高到52.8%，在WARC-Bench上优于许多前沿模型。我们的分析得出结论：掌握这些子任务对于稳健的网络规划和导航至关重要，而这一能力并未被现有基准充分评估。

英文摘要

Training web agents to navigate complex, real-world websites requires them to master $\textit{subtasks}$ - short-horizon interactions on multiple UI components (e.g., choosing the correct date in a date picker, or scrolling in a container to extract information). We introduce WARC-Bench (Web Archive Benchmark), a novel web navigation benchmark featuring 438 tasks designed to evaluate multimodal AI agents on subtasks. WARC-Bench enables sandboxed interactions with dynamic and realistic webpages using Web ARChive files. We show that WARC-Bench is challenging for leading computer-use models, with the highest observed success rate being 64.8%. To improve open source models on subtask, we explore two common training techniques: supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). Experiments show that SFT models obtain a 48.8% success rate on the benchmark. Training with RLVR over SFT checkpoints, even in data-scarce settings, improves the score to 52.8% on WARC-Bench, outperforming many frontier models. Our analysis concludes that mastering these subtasks is essential for robust web planning and navigation, and is a capability not extensively evaluated by existing benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2510.05746 2026-05-20 cs.AI cs.CL cs.LG 版本更新

ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems

ARM：为通用多智能体系统发现代理推理模块

Bohan Yao, Shiva Krishna Reddy Malay, Vikas Yadav

发表机构 * University of Washington（华盛顿大学）

AI总结本文提出了一种新的自动多智能体系统设计范式，通过优化链式推理（CoT）来发现代理推理模块（ARM），该模块通过在代码空间中进行树搜索，利用执行轨迹的反思来进化，从而提升多智能体系统的泛化能力。

Comments 29 pages, 2 figures

详情

AI中文摘要

大型语言模型（LLM）驱动的多智能体系统（MAS）在各种复杂推理任务上取得了最先进的结果。最近的研究提出了自动化设计MAS的方法，消除了手动工程的需要。然而，这些方法表现不佳，通常与简单的基线相当或更差。此外，它们需要为每个新任务领域进行昂贵的架构重新发现，并且在没有现有标注验证集的领域中需要昂贵的数据注释。关键的洞察是简单的链式推理（CoT）推理往往与这些复杂系统竞争，表明MAS的基本推理单元CoT值得进一步研究。为此，我们提出了一种新的自动MAS设计范式，将焦点转向优化CoT推理。我们引入了代理推理模块（ARM），即CoT的代理泛化，其中每个细粒度推理步骤由专门的推理模块执行。该模块通过在代码空间中进行树搜索来发现，从简单的CoT模块开始，利用执行轨迹的反思进行进化。最终的ARM作为一个通用的推理构建块，可以作为直接的递归循环或作为学习元协调器中的子程序使用。我们的方法显著优于手动设计的MAS和最先进的自动MAS设计方法。关键的是，由ARM构建的MAS表现出卓越的泛化能力，在不同的基础模型和任务领域中保持高性能，而无需进一步优化。

英文摘要

Large Language Model (LLM)-powered Multi-agent systems (MAS) have achieved state-of-the-art results on various complex reasoning tasks. Recent works have proposed techniques to automate the design of MASes, eliminating the need for manual engineering. However, these techniques perform poorly, often achieving similar or inferior performance to simple baselines. Furthermore, they require computationally expensive re-discovery of architectures for each new task domain and expensive data annotation on domains without existing labeled validation sets. A critical insight is that simple Chain of Thought (CoT) reasoning often performs competitively with these complex systems, suggesting that the fundamental reasoning unit of MASes, CoT, warrants further investigation. To this end, we present a new paradigm for automatic MAS design that pivots the focus to optimizing CoT reasoning. We introduce the Agentic Reasoning Module (ARM), an agentic generalization of CoT where each granular reasoning step is executed by a specialized reasoning module. This module is discovered through a tree search over the code space, starting from a simple CoT module and evolved using mutations informed by reflection on execution traces. The resulting ARM acts as a versatile reasoning building block which can be utilized as a direct recursive loop or as a subroutine in a learned meta-orchestrator. Our approach significantly outperforms both manually designed MASes and state-of-the-art automatic MAS design methods. Crucially, MASes built with ARM exhibit superb generalization, maintaining high performance across different foundation models and task domains without further optimization.

URL PDF HTML ☆

赞 0 踩 0

2510.03824 2026-05-20 cs.LG cs.AI stat.ML 版本更新

Proximal Diffusion Neural Sampler

近端扩散神经采样器

Wei Guo, Jaemoo Choi, Yuchen Zhu, Molei Tao, Yongxin Chen

发表机构 * Georgia Institute of Technology（佐治亚理工学院）

AI总结本文提出了一种名为近端扩散神经采样器（PDNS）的框架，通过在路径测度空间上应用近端点方法，解决神经采样器在训练过程中遇到的多模式目标分布和模式崩溃问题，通过分阶段的简单子问题逐步逼近目标分布，促进模式的全面探索。

Comments Accepted at ICLR 2026 (https://openreview.net/forum?id=XTHQqS7ObC)

详情

AI中文摘要

超越分类准确度：Neural-MedBench与更深层次推理基准的需求

Miao Jing, Mengting Jia, Junling Lin, Zhongxia Shen, Huan Gao, Mingkun Xu, Shangyang Li

发表机构 * School of Physics Science and Technology, Beijing University of Posts and Telecommunications（北京邮电大学物理科学与技术学院）； Guangdong Institute of Intelligence Science and Technology（广东智能科学技术研究院）； Beijing Chaoyang Hospital, Capital Medical University（北京朝阳医院）； Sleep Medical Center, Huzhou Third Municipal Hospital, Affiliated Hospital of Wenzhou Medical University（湖州第三人民医院睡眠医学中心，温州医科大学附属医院）； University of Macau（澳门大学）； Renyixun Health Technology Co., Ltd（仁颐讯健康科技有限公司）； Academy for Advanced Interdisciplinary Studies, Peking University（北京大学交叉学科研究院）

AI总结本文提出Neural-MedBench，一个专门用于测试多模态神经病学推理能力的基准，揭示现有医疗数据集过于强调分类准确度的问题，并通过系统评估发现模型推理失败而非感知误差主导性能下降，强调需要兼顾广度与深度的评估框架。

Comments 23 pages, 12 figures

Journal ref ICLR'2026

详情

AI中文摘要

近期视觉-语言模型（VLMs）在标准医疗基准上取得了显著进展，但其真正的临床推理能力仍不清楚。现有数据集主要强调分类准确度，导致模型在高风险诊断推理上仍存在不足。我们引入Neural-MedBench，一个紧凑且推理密集的基准，专门用于探测多模态临床推理在神经病学中的极限。Neural-MedBench整合多序列MRI扫描、结构化电子健康记录和临床笔记，并涵盖三大核心任务家族：鉴别诊断、病变识别和推理生成。为确保可靠评估，我们开发了结合LLM评分、临床验证和语义相似度指标的混合评分流程。通过系统评估最先进的VLMs，包括GPT-4o、Claude-4和MedGemma，我们发现其性能相比传统数据集显著下降。错误分析显示，推理失败而非感知误差主导模型不足。我们的发现强调了需要双轴评估框架：以广度为导向的大数据集用于统计泛化，以深度为导向的紧凑基准如Neural-MedBench用于推理保真度。我们发布Neural-MedBench于https://neuromedbench.github.io/作为开放且可扩展的诊断测试床，引导未来基准的扩展，并实现严谨而成本有效的临床可信AI评估。

英文摘要

Recent advances in vision-language models (VLMs) have achieved remarkable performance on standard medical benchmarks, yet their true clinical reasoning ability remains unclear. Existing datasets predominantly emphasize classification accuracy, creating an evaluation illusion in which models appear proficient while still failing at high-stakes diagnostic reasoning. We introduce Neural-MedBench, a compact yet reasoning-intensive benchmark specifically designed to probe the limits of multimodal clinical reasoning in neurology. Neural-MedBench integrates multi-sequence MRI scans, structured electronic health records, and clinical notes, and encompasses three core task families: differential diagnosis, lesion recognition, and rationale generation. To ensure reliable evaluation, we develop a hybrid scoring pipeline that combines LLM-based graders, clinician validation, and semantic similarity metrics. Through systematic evaluation of state-of-the-art VLMs, including GPT-4o, Claude-4, and MedGemma, we observe a sharp performance drop compared to conventional datasets. Error analysis shows that reasoning failures, rather than perceptual errors, dominate model shortcomings. Our findings highlight the necessity of a Two-Axis Evaluation Framework: breadth-oriented large datasets for statistical generalization, and depth-oriented, compact benchmarks such as Neural-MedBench for reasoning fidelity. We release Neural-MedBench at https://neuromedbench.github.io/ as an open and extensible diagnostic testbed, which guides the expansion of future benchmarks and enables rigorous yet cost-effective assessment of clinically trustworthy AI.

URL PDF HTML ☆

赞 0 踩 0

2509.19182 2026-05-20 cs.HC cs.AI 版本更新

ERIS: 一种面向分布外时间序列分类的能量引导特征解耦框架

Xin Wu, Fei Teng, Ji Zhang, Xingwang Li, Yuxuan Liang

发表机构 * Hong Kong University of Science and Technology（香港科技大学）

AI总结本文提出ERIS框架，通过能量引导机制和语义指导，解决时间序列分类中分布外数据的可靠特征解耦问题，提升模型鲁棒性和泛化能力。

Journal ref Information Fusion 135, 104407 (2026)

详情

DOI: 10.1016/j.inffus.2026.104407

AI中文摘要

理想的时间序列分类（TSC）应能捕捉不变表示，但实现对分布外（OOD）数据的可靠性能仍是一个核心障碍。这一障碍源于模型内在地将领域特定和标签相关特征纠缠在一起，导致虚假相关性。尽管特征解耦旨在解决这一问题，但当前方法大多缺乏必要的语义方向，无法隔离真正普遍的特征。为此，我们提出一个端到端的Energy-Regularized Information for Shift-Robustness（ERIS）框架，以实现引导且可靠的特征解耦。核心思想是有效的解耦不仅需要数学约束，还需要语义指导来锚定分离过程。ERIS集成了三个关键机制来实现这一目标。具体来说，我们首先引入一种能量引导校准机制，为分离过程提供关键的语义指导，使模型能够自我校准。此外，一个权重层面正交性策略强制领域特定和标签相关特征之间的结构性独立，从而减轻它们的干扰。此外，一个辅助对抗泛化机制通过注入结构化扰动来增强鲁棒性。在四个基准测试中的实验表明，ERIS在统计上显著优于最先进的基线方法，始终保持最佳性能排名。

英文摘要

An ideal time series classification (TSC) should be able to capture invariant representations, but achieving reliable performance on out-of-distribution (OOD) data remains a core obstacle. This obstacle arises from the way models inherently entangle domain-specific and label-relevant features, resulting in spurious correlations. While feature disentanglement aims to solve this, current methods are largely unguided, lacking the semantic direction required to isolate truly universal features. To address this, we propose an end-to-end Energy-Regularized Information for Shift-Robustness (ERIS) framework to enable guided and reliable feature disentanglement. The core idea is that effective disentanglement requires not only mathematical constraints but also semantic guidance to anchor the separation process. ERIS incorporates three key mechanisms to achieve this goal. Specifically, we first introduce an energy-guided calibration mechanism, which provides crucial semantic guidance for the separation, enabling the model to self-calibrate. Additionally, a weight-level orthogonality strategy enforces structural independence between domain-specific and label-relevant features, thereby mitigating their interference. Moreover, an auxiliary adversarial generalization mechanism enhances robustness by injecting structured perturbations. Experiments across four benchmarks demonstrate that ERIS achieves a statistically significant improvement over state-of-the-art baselines, consistently securing the top performance rank.

URL PDF HTML ☆

赞 0 踩 0

2507.15698 2026-05-20 cs.CL cs.AI cs.LG 版本更新

CoLD: Counterfactually-Guided Length Debiasing for Process Reward Models in Mathematical Reasoning

CoLD: 用于数学推理过程中奖励模型的反事实引导长度偏差消除

Congmin Zheng, Jiachen Zhu, Jianghao Lin, Xinyi Dai, Weiwen Liu, Haoxuan Li, Yong Yu, Weinan Zhang, Mengyue Yang

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei Noah’s Ark Lab（华为诺亚实验室）； Peking University（北京大学）； University of Bristol（布里斯托大学）

AI总结本文提出CoLD，一种通过反事实引导消除过程奖励模型中长度偏差的统一框架，旨在提高多步骤推理的准确性和简洁性，同时提升下游强化学习性能和跨领域泛化能力。

详情

AI中文摘要

过程奖励模型（PRMs）在评估和引导大型语言模型（LLMs）的多步推理中起着核心作用，特别是在数学问题解决中。然而，我们发现现有PRMs存在普遍的长度偏差：即使语义内容和逻辑有效性未变，它们也倾向于对较长的推理步骤赋予更高的分数。这种偏差会削弱奖励预测的可靠性，并导致推理过程中输出过于冗长。为了解决这一问题，我们提出了CoLD（Counterfactually-Guided Length Debiasing），一种统一的框架，通过三个组件减轻长度偏差：显式的长度惩罚调整、一个训练以捕捉虚假长度相关信号的学得偏差估计器，以及一种联合训练策略，强制奖励预测的长度不变性。我们的方法基于反事实推理，并受因果图分析的启发。在MATH500和GSM-Plus上的广泛实验表明，CoLD提高了步骤选择的准确性，并鼓励了更简洁、逻辑有效的推理。此外，它一致提高了下游RL性能，并通过减轻长度偏差在跨领域中泛化，展示了CoLD强大的泛化能力。

英文摘要

Process Reward Models (PRMs) play a central role in evaluating and guiding multi-step reasoning in large language models (LLMs), especially for mathematical problem solving. However, we identify a pervasive length bias in existing PRMs: they tend to assign higher scores to longer reasoning steps, even when the semantic content and logical validity are unchanged. This bias undermines the reliability of reward predictions and leads to overly verbose outputs during inference. To address this issue, we propose CoLD(Counterfactually-Guided Length Debiasing), a unified framework that mitigates length bias through three components: an explicit length-penalty adjustment, a learned bias estimator trained to capture spurious length-related signals, and a joint training strategy that enforces length-invariance in reward predictions. Our approach is grounded in counterfactual reasoning and informed by causal graph analysis. Extensive experiments on MATH500 and GSM-Plus show that CoLD improves accuracy in step selection, and encourages more concise, logically valid reasoning. Furthermore, it consistently improves downstream RL performance and generalizes across domains by mitigating length bias, demonstrating CoLD's strong generalization capability.

URL PDF HTML ☆

赞 0 踩 0

2507.10614 2026-05-20 cs.LG cs.AI 版本更新

Fine-tuning Large Language Model for Automated Algorithm Design

微调大语言模型用于自动化算法设计

Fei Liu, Rui Zhang, Xi Lin, Zhichao Lu, Qingfu Zhang

发表机构 * City University of Hong Kong（香港城市大学）； Xi’an Jiaotong University（西安交通大学）

AI总结本文探讨了微调大语言模型以提升其在自动化算法设计中的性能，提出了一种多样性感知的排名策略和直接偏好优化方法，通过实验验证了任务特定微调在不同算法设计任务中的有效性。

详情

AI中文摘要

将大语言模型（LLMs）整合到自动化算法设计中已展现出巨大潜力。一种常见的方法是将LLMs嵌入到搜索过程中，以迭代生成和优化候选算法。然而，现有大多数方法依赖于为通用编码任务训练的现成LLMs，留下一个关键问题：是否需要专门针对算法设计训练的LLMs？如果是，如何有效获得此类LLMs，并且它们在不同算法设计任务中有多好的泛化能力？在本文中，我们通过探索针对算法设计的LLMs微调，初步回答了这些问题。我们引入了一种多样性感知的排名（DAR）采样策略，以平衡训练数据的多样性和质量，然后利用直接偏好优化来高效地对齐LLMs的输出与任务目标。我们的实验主要在Llama-3.2-1B-Instruct和Llama-3.1-8BInstruct上进行，针对三个不同的算法设计任务，此外，openPangu-Embedded模型还作为辅助比较在可允许集合问题上进行评估。结果表明，微调后的LLMs在较小的Llama-3.2-1B-Instruct上显著优于其现成的对应者，并在可允许集合问题上与较大的Llama-3.1-8B-Instruct匹配。此外，我们观察到良好的泛化能力：在特定算法设计任务上微调的LLMs在相关任务中也表现出色。这些发现突显了LLMs在算法设计中任务特定适应的价值，并为未来研究开辟了新途径。我们的代码可在https://github.com/RayZhhh/dpo-aad上公开获取。

英文摘要

The integration of large language models (LLMs) into automated algorithm design has shown promising potential. A prevalent approach embeds LLMs within search routines to iteratively generate and refine candidate algorithms. However, most existing methods rely on off-the-shelf LLMs trained for general coding tasks, leaving a key question open: Do we need LLMs specifically tailored for algorithm design? If so, how can such LLMs be effectively obtained and how well can they generalize across different algorithm design tasks? In this paper, we take a preliminary step toward answering these questions by exploring fine-tuning of LLMs for algorithm design. We introduce a Diversity-Aware Rank-based (DAR) sampling strategy to balance training data diversity and quality, then we leverage direct preference optimization to efficiently align LLM outputs with task objectives. Our experiments are primarily conducted on Llama-3.2-1B-Instruct and Llama-3.1-8BInstruct across three distinct algorithm design tasks, with openPangu-Embedded models additionally included as auxiliary comparisons on the admissible set problem. Results suggest that fine-tuned LLMs can significantly outperform their off-the-shelf counterparts with the smaller Llama-3.2-1B-Instruct and match the larger Llama-3.1-8B-Instruct on the admissible set problem. Moreover, we observe promising generalization: LLMs fine-tuned on specific algorithm design tasks also improve performance on related tasks with varying settings. These findings highlight the value of task-specific adaptation for LLMs in algorithm design and open new avenues for future research. Our code is publicly available at https://github.com/RayZhhh/dpo-aad.

URL PDF HTML ☆

赞 0 踩 0

2507.10492 2026-05-20 cs.CV cs.AI cs.LG 版本更新

BenchReAD: A systematic benchmark for retinal anomaly detection

BenchReAD: 一种系统性的视网膜异常检测基准

Chenyu Lian, Hong-Yu Zhou, Zhanli Hu, Jing Qin

发表机构 * The Center for Smart Health, School of Nursing, the Hong Kong Polytechnic University, Hong Kong, China（香港理工大学护理学院智能健康中心）； School of Biomedical Engineering, Tsinghua University, Beijing, China（清华大学生物医学工程学院）； Research Center for Medical AI, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China（中国科学院深圳先进技术研究院医学人工智能研究中心）

AI总结本研究提出BenchReAD基准，旨在解决视网膜异常检测领域缺乏全面且公开的评估标准的问题，通过系统化的数据和算法分类，引入了全监督方法DRA，并改进为NFM-DRA，实现了SOTA性能。

Comments MICCAI 2025

详情

DOI: 10.1007/978-3-032-04937-7_4

AI中文摘要

视网膜异常检测在筛查眼部和系统性疾病中起着关键作用。尽管其重要性，该领域的进展受到缺乏全面且公开可用的基准的阻碍，这对于公平评估和推进方法至关重要。由于这一限制，与视网膜图像相关的先前异常检测工作受到（1）异常类型有限且过于简单的限制，（2）测试集几乎饱和，以及（3）缺乏泛化评估的影响，导致实验设置说服力不足。此外，现有医学异常检测基准大多专注于单类监督方法（仅使用负样本训练），忽视了临床实践中大量可用的标记异常数据和未标记数据。为了填补这些差距，我们引入了视网膜异常检测的基准，该基准在数据和算法上都是全面且系统的。通过分类和评估先前方法，我们发现利用解耦异常表示的全监督方法（DRA）取得了最佳性能，但在遇到某些未见异常时性能显著下降。受单类监督学习中记忆库机制的启发，我们提出了NFM-DRA，将其与正常特征记忆结合，以缓解性能下降，建立新的SOTA。该基准可在https://github.com/DopamineLcy/BenchReAD上公开获取。

英文摘要

Retinal anomaly detection plays a pivotal role in screening ocular and systemic diseases. Despite its significance, progress in the field has been hindered by the absence of a comprehensive and publicly available benchmark, which is essential for the fair evaluation and advancement of methodologies. Due to this limitation, previous anomaly detection work related to retinal images has been constrained by (1) a limited and overly simplistic set of anomaly types, (2) test sets that are nearly saturated, and (3) a lack of generalization evaluation, resulting in less convincing experimental setups. Furthermore, existing benchmarks in medical anomaly detection predominantly focus on one-class supervised approaches (training only with negative samples), overlooking the vast amounts of labeled abnormal data and unlabeled data that are commonly available in clinical practice. To bridge these gaps, we introduce a benchmark for retinal anomaly detection, which is comprehensive and systematic in terms of data and algorithm. Through categorizing and benchmarking previous methods, we find that a fully supervised approach leveraging disentangled representations of abnormalities (DRA) achieves the best performance but suffers from significant drops in performance when encountering certain unseen anomalies. Inspired by the memory bank mechanisms in one-class supervised learning, we propose NFM-DRA, which integrates DRA with a Normal Feature Memory to mitigate the performance degradation, establishing a new SOTA. The benchmark is publicly available at https://github.com/DopamineLcy/BenchReAD.

URL PDF HTML ☆

赞 0 踩 0

2506.08618 2026-05-20 cs.LG cond-mat.mes-hall cond-mat.other cs.AI cs.CV 版本更新

HSG-12M: A Large-Scale Benchmark of Spatial Multigraphs from the Energy Spectra of Non-Hermitian Crystals

HSG-12M: 一种大规模空间多图基准，源自非厄密晶体能量谱

Xianquan Yan, Hakan Akgün, Kenji Kawaguchi, N. Duane Loh, Ching Hua Lee

发表机构 * National University of Singapore（新加坡国立大学）； NUS Centre for Bioimaging Sciences（新加坡国立大学生物成像科学中心）

AI总结本文提出HSG-12M，一个包含1160万静态和510万动态哈密顿量谱图的数据集，用于研究非厄密量子物理中的复杂几何结构，填补了现有图基准在空间多边学习方面的空白。

Comments Accepted to ICLR 2026, OpenReview: [https://openreview.net/forum?id=YxuKCME576]. 49 pages, 13 figures, 14 tables. Code & pipeline: [https://github.com/sarinstein-yan/Poly2Graph] Dataset: [https://github.com/sarinstein-yan/HSG-12M] Dataset released under CC BY 4.0. The Fourteenth International Conference on Learning Representations (ICLR 2026)

Journal ref The Fourteenth International Conference on Learning Representations (ICLR 2026)

详情

AI中文摘要

人工智能正通过揭示理解复杂物理系统的新方法改变科学研究，但其影响仍受限于缺乏大规模、高质量的领域专用数据集。非厄密量子物理中蕴藏着丰富的资源，其中晶体的能量谱在复平面上形成复杂的几何结构，称为哈密顿量谱图。尽管这些谱图作为电子行为的指纹具有重要意义，但其系统研究一直受限于手动提取的依赖。为释放这一潜力，我们引入Poly2Graph：一个高性能、开源的管道，自动化将一维晶体哈密顿量映射到谱图。使用该工具，我们提出了HSG-12M：一个包含1160万静态和510万动态哈密顿量谱图的数据集，涵盖1401个特征多项式类别，源自177TB的谱势数据。关键的是，HSG-12M是首个大规模空间多图数据集——图嵌入在度量空间中，其中两个节点之间不同的几何轨迹被保留为单独的边。这同时填补了现有图基准在空间多边学习方面的空白。流行的GNN基准测试揭示了在大规模学习空间多边时的新挑战。除了其实际用途外，我们还表明谱图是多项式、向量和矩阵的通用拓扑指纹，建立了新的代数到图的联系。HSG-12M为凝聚态物理的数据驱动科学发现奠定了基础，为几何感知图学习的新机会以及更广泛领域铺平了道路。

英文摘要

AI is transforming scientific research by revealing new ways to understand complex physical systems, but its impact remains constrained by the lack of large, high-quality domain-specific datasets. A rich, largely untapped resource lies in non-Hermitian quantum physics, where the energy spectra of crystals form intricate geometries on the complex plane -- termed as Hamiltonian spectral graphs. Despite their significance as fingerprints for electronic behavior, their systematic study has been intractable due to the reliance on manual extraction. To unlock this potential, we introduce Poly2Graph: a high-performance, open-source pipeline that automates the mapping of 1-D crystal Hamiltonians to spectral graphs. Using this tool, we present HSG-12M: a dataset containing 11.6 million static and 5.1 million dynamic Hamiltonian spectral graphs across 1401 characteristic-polynomial classes, distilled from 177 TB of spectral potential data. Crucially, HSG-12M is the first large-scale dataset of spatial multigraphs -- graphs embedded in a metric space where multiple geometrically distinct trajectories between two nodes are retained as separate edges. This simultaneously addresses a critical gap, as existing graph benchmarks overwhelmingly assume simple, non-spatial edges, discarding vital geometric information. Benchmarks with popular GNNs expose new challenges in learning spatial multi-edges at scale. Beyond its practical utility, we show that spectral graphs serve as universal topological fingerprints of polynomials, vectors, and matrices, forging a new algebra-to-graph link. HSG-12M lays the groundwork for data-driven scientific discovery in condensed matter physics, new opportunities in geometry-aware graph learning and beyond.

URL PDF HTML ☆

赞 0 踩 0

2506.00286 2026-05-20 cs.LG cs.AI math.OC stat.ML 版本更新

Recursive Entropic Risk Optimization in Discounted MDPs: Sample Complexity Bounds with a Generative Model

递归熵风险优化在折扣马尔可夫决策过程中的应用：带有生成模型的样本复杂性界

Oliver Mortensen, Mohammad Sadegh Talebi

发表机构 * Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）

AI总结本文研究了在有限折扣马尔可夫决策过程（MDP）中使用递归熵风险度量（ERM）进行风险敏感强化学习的问题，引入了基于模型的算法Model-Based ERM Q-Value Iteration（MB-RS-QVI），并推导了该算法在价值学习和策略学习中的PAC型样本复杂性界，证明了在最坏情况下样本复杂性与|β|/(1-γ)呈指数关系，为递归ERM在风险规避和风险寻求情形下的样本复杂性提供了首次严格保证。

详情

AI中文摘要

我们研究了在有限折扣马尔可夫决策过程（MDP）中使用递归熵风险度量（ERM）进行风险敏感强化学习的问题，其中风险参数β≠0控制智能体的风险态度：β>0表示风险规避，β<0表示风险寻求行为。假设MDP具有生成模型。我们的关注点是学习最优状态-动作价值函数（价值学习）和最优策略（策略学习）在递归ERM下的样本复杂性。我们引入了一个基于模型的算法，称为Model-Based ERM Q-Value Iteration（MB-RS-QVI），并推导了该算法在价值和策略学习中的PAC型样本复杂性界。两种PAC界都随|β|/(1-γ)呈指数增长，其中γ是折扣因子。我们还为价值和策略学习建立了相应的下界，证明在最坏情况下样本复杂性对|β|/(1-γ)的指数依赖是不可避免的。这些界在状态和动作的数量（S和A）上是紧的，为递归ERM在风险规避和风险寻求情形下的样本复杂性提供了首次严格保证。

英文摘要

We study risk-sensitive reinforcement learning in finite discounted MDPs with recursive entropic risk measures (ERM), where the risk parameter $β\neq 0$ controls the agent's risk attitude: $β>0$ for risk-averse and $β<0$ for risk-seeking behavior. A generative model of the MDP is assumed to be available. Our focus is on the sample complexities of learning the optimal state-action value function (value learning) and an optimal policy (policy learning) under recursive ERM. We introduce a model-based algorithm, called Model-Based ERM $Q$-Value Iteration (MB-RS-QVI), and derive PAC-type bounds on its sample complexity for both value and policy learning. Both PAC bounds scale exponentially with $|β|/(1-γ)$, where $γ$ is the discount factor. We also establish corresponding lower bounds for both value and policy learning, showing that exponential dependence on $|β|/(1-γ)$ is unavoidable in the worst case. The bounds are tight in the number of states and actions ($S$ and $A$), providing the first rigorous sample complexity guarantees for recursive ERM across both risk-averse and risk-seeking regimes.

URL PDF HTML ☆

赞 0 踩 0

2505.23747 2026-05-20 cs.CV cs.AI cs.LG 版本更新

Spatial-MLLM: Boosting MLLM Capabilities in Visual-based Spatial Intelligence

Spatial-MLLM: 提升基于视觉的空域智能的MLLM能力

Diankun Wu, Fangfu Liu, Yi-Hsin Hung, Yueqi Duan

发表机构 * Tsinghua University（清华大学）

AI总结本文提出Spatial-MLLM，一种基于纯2D观测的视觉空域推理框架，通过双编码器架构和空间感知帧采样策略提升空域理解能力，实验表明其在多种视觉空域任务中达到SOTA性能。

Comments 22 pages

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）在2D视觉任务上的性能显著提升。然而，提高其空间智能仍是一个挑战。现有的3D MLLMs总是依赖额外的3D或2.5D数据来整合空间意识，限制了它们在只有2D输入（如图像或视频）场景中的实用性。在本文中，我们提出了Spatial-MLLM，一种新颖的框架，用于从纯2D观测中进行基于视觉的空间推理。与传统视频MLLMs依赖CLIP-based视觉编码器优化语义理解不同，我们的关键见解是释放来自前馈视觉几何基础模型的强大结构先验。具体来说，我们提出了双编码器架构：一个预训练的2D视觉编码器用于提取语义特征，以及一个3D空间编码器，从视觉几何模型的主干初始化以提取3D结构特征。然后，一个连接器将两种特征整合到统一的视觉标记中以增强空间理解。此外，我们提出了一种在推理时间的空间感知帧采样策略，该策略选择视频序列中具有空间信息的帧，确保在有限的token长度下，模型专注于对空间推理至关重要的帧。除了架构改进外，我们从多个来源构建了一个训练数据集，并使用监督微调和GRPO对其进行训练。在各种真实世界数据集上的广泛实验表明，Spatial-MLLM在广泛的基于视觉的空间理解和推理任务中实现了SOTA性能。项目页面：https://diankun-wu.github.io/Spatial-MLLM/.

英文摘要

Recent advancements in Multimodal Large Language Models (MLLMs) have significantly enhanced performance on 2D visual tasks. However, improving their spatial intelligence remains a challenge. Existing 3D MLLMs always rely on additional 3D or 2.5D data to incorporate spatial awareness, restricting their utility in scenarios with only 2D inputs, such as images or videos. In this paper, we present Spatial-MLLM, a novel framework for visual-based spatial reasoning from purely 2D observations. Unlike conventional video MLLMs which rely on CLIP-based visual encoders optimized for semantic understanding, our key insight is to unleash the strong structure prior from the feed-forward visual geometry foundation model. Specifically, we propose a dual-encoder architecture: a pretrained 2D visual encoder to extract semantic features, and a 3D spatial encoder-initialized from the backbone of the visual geometry model-to extract 3D structure features. A connector then integrates both features into unified visual tokens for enhanced spatial understanding. Furthermore, we propose a space-aware frame sampling strategy at inference time, which selects the spatially informative frames of a video sequence, ensuring that even under limited token length, the model focuses on frames critical for spatial reasoning. Beyond architecture improvements, we construct a training dataset from multiple sources and train the model on it using supervised fine-tuning and GRPO. Extensive experiments on various real-world datasets demonstrate that Spatial-MLLM achieves state-of-the-art performance in a wide range of visual-based spatial understanding and reasoning tasks. Project page: https://diankun-wu.github.io/Spatial-MLLM/.

URL PDF HTML ☆

赞 0 踩 0

2505.18191 2026-05-20 eess.SP cs.AI cs.LG cs.PF 版本更新

Quantifying the Generalization Gap in Seizure Detection: A Large-Scale Empirical Benchmark via the SzCORE Challenge

量化癫痫检测中的泛化差距：通过SzCORE挑战进行大规模经验基准测试

Jonathan Dan, Amirhossein Shahbazinia, Christodoulos Kechris, David Atienza

发表机构 * Embedded Systems Laboratory, EPFL, Lausanne, Switzerland（瑞士洛桑联邦理工学院嵌入式系统实验室）

AI总结本文通过SzCORE挑战的大规模经验研究，量化了癫痫检测中模型泛化能力的差距，评估了28种最先进的算法架构，揭示了当前模型在不同患者群体中表现不一致的问题，并提出了标准化评估的必要性。

详情

AI中文摘要

可靠的自动长期脑电图（EEG）癫痫检测仍是一个未解决的挑战，因为当前模型往往无法在不同患者或临床环境中泛化。手动EEG审查仍然是标准护理，突显了对稳健模型和标准化评估的需求。当前文献常报告高效率，但这些模型在部署到未见过的患者群体时经常失效。为了严格评估这种泛化差距，我们进行了一项大规模经验研究，评估了28种最先进的算法架构，从经典特征工程到现代深度学习。这些算法通过组织竞赛收集。利用严格保留的私人数据集，包含65名受试者的连续EEG记录，共计4360小时的数据，来评估算法性能。专家神经生理学家对这些记录进行了注释，建立了癫痫事件的地面真相。算法使用SzCORE框架中的基于事件的指标进行评估，包括灵敏度、精确度、F1分数和每天的假阳性率。结果揭示了最先进的方法之间显著的性能差异，其中最高F1分数为32%（灵敏度37%，精确度29%），突显了这项任务的持续困难。分析揭示了峰值性能与群体水平稳定性之间的不一致。获得最高综合F1分数的算法并未在不同受试者中获得最一致的排名。这项独立评估暴露了自我报告效率与保留性能之间的明显差距，强调了标准化、严格基准测试的必要性。评估基础设施转变为一个持续开放的基准测试平台，促进可重复的研究，并加速稳健癫痫检测算法的发展。

英文摘要

Reliable automatic seizure detection from long-term electroencephalography (EEG) remains an unsolved challenge, as current models often fail to generalize across patients or clinical settings. Manual EEG review still is the standard of care, highlighting the need for robust models and standardized evaluation. The current literature often reports high efficacy, yet these models frequently fail when deployed to unseen patient populations. To rigorously assess this generalization gap, we conducted a large-scale empirical study evaluating 28 state-of-the-art algorithmic architectures, ranging from classical feature engineering to modern Deep Learning. These algorithms were collected by organizing a competition. A strictly held-out private dataset of continuous EEG recordings from 65 subjects, totaling 4,360 hours of data, was utilized to evaluate algorithm performance. Expert neurophysiologists annotated these recordings, establishing the ground truth for seizure events. Algorithms were evaluated using event-based metrics from the SzCORE framework, including sensitivity, precision, F1-score, and false positive rate per day. Results revealed significant performance variability among state-of-the-art approaches, with the top F1 score of 32% (sensitivity 37%, precision 29%), highlighting the persistent difficulty of this task. Analysis uncovered a discordance between peak performance and population-level stability. The algorithms achieving the highest aggregate F1-scores did not achieve the most consistent ranking across subjects. This independent evaluation exposed a notable gap between self-reported efficacies and hold-out performance, underscoring the critical need for standardized, rigorous benchmarking. The evaluation infrastructure transitions into a continuously open benchmarking platform, fostering reproducible research and accelerating robust seizure detection algorithm development.

URL PDF HTML ☆

赞 0 踩 0

2505.17726 2026-05-20 cs.CV cs.AI 版本更新

Slot-MLLM: Object-Centric Visual Tokenization for Multimodal LLM

Slot-MLLM: 多模态大语言模型中的面向对象视觉标记化

Donghwan Chi, Hyomin Kim, Yoonjin Oh, Yongjin Kim, Donghoon Lee, Daejin Jo, Jongmin Kim, Junyeob Baek, Sungjin Ahn, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University（韩国大学人工智能系）； Kakao Corp（Kakao公司）； School of Computing, KAIST（韩国科学技术院计算机学院）

AI总结本文提出了一种面向对象的视觉标记化方法Slot-MLLM，通过基于Slot Attention的标记器，有效编码局部视觉细节并保持高层语义，从而提升多模态大语言模型在视觉内容理解和生成中的性能。

详情

AI中文摘要

近年来，多模态大语言模型（MLLMs）已成为实现人工通用智能的关键方法。特别是，视觉语言MLLMs已被开发用于从多模态输入中生成文本和视觉输出。这一进展需要高效的图像标记，使LLMs能够有效处理输入和输出。然而，现有的图像标记方法通常只能捕捉全局抽象概念或均匀分割的图像块，限制了MLLMs在理解和生成细节视觉内容方面的能力，尤其是在对象层面。为了解决这一限制，我们提出了一种基于Slot Attention的面向对象视觉标记器，专门针对MLLMs。具体而言，基于Q-Former编码器、扩散解码器和残差向量量化，我们提出的离散化槽标记能够编码局部视觉细节，同时保持高层语义，并与文本数据对齐，无缝集成到LLMs的统一下一个标记预测框架中。所得到的Slot-MLLM在各种涉及局部详细理解和生成的视觉语言任务中，相对于先前视觉标记器的基线表现显著提升。值得注意的是，这项工作是首次展示了使用MLLMs和真实自然图像进行面向对象槽注意力的可行性。

英文摘要

Recently, multimodal large language models (MLLMs) have emerged as a key approach in achieving artificial general intelligence. In particular, vision-language MLLMs have been developed to generate not only text but also visual outputs from multimodal inputs. This advancement requires efficient image tokens that LLMs can process effectively both in input and output. However, existing image tokenization methods for MLLMs typically capture only global abstract concepts or uniformly segmented image patches, restricting MLLMs' capability to effectively understand or generate detailed visual content, particularly at the object level. To address this limitation, we propose an object-centric visual tokenizer based on Slot Attention specifically for MLLMs. In particular, based on the Q-Former encoder, diffusion decoder, and residual vector quantization, our proposed discretized slot tokens can encode local visual details while maintaining high-level semantics, and also align with textual data to be integrated seamlessly within a unified next-token prediction framework of LLMs. The resulting Slot-MLLM demonstrates significant performance improvements over baselines with previous visual tokenizers across various vision-language tasks that entail local detailed comprehension and generation. Notably, this work is the first demonstration of the feasibility of object-centric slot attention performed with MLLMs and in-the-wild natural images.

URL PDF HTML ☆

赞 0 踩 0

2504.07756 2026-05-20 cs.AI cs.CY 版本更新

Artificial Intelligence, conceptual metaphors and conceptual engineering: Are AI-based framings of human behaviour and cognition successful?

人工智能、概念隐喻和概念工程：基于人工智能的对人类行为和认知的框架是否成功？

Warmhold Jan Thomas Mollema, Thomas Wachter

发表机构 * Vrije University Amsterdam (Department of Philosophy)（荷兰阿姆斯特丹自由大学（哲学系））； National Centre for Artificial Intelligence（人工智能国家研究中心）

AI总结本文探讨了将人工智能概念应用于人类行为和认知领域的成功性，分析了这些框架是否属于概念隐喻还是概念工程，并指出其潜在的伦理和简化挑战。

详情

AI中文摘要

利用人工智能领域的概念来理解人类行为、神经科学和心理学正变得越来越流行。鉴于人工智能技术在日常生活中的大规模整合，人工智能相关概念被用来将人工智能系统与人类行为、脑功能和认知能力（如语言习得）进行类比。但科学家和哲学家也越来越倾向于将人工智能对人类概念领域的框架视为字面意义。本文探讨了这些‘人工智能框架’的知识和实践成功性：应用人工智能的概念图景到人类概念领域意味着什么？我们考虑并比较了两种可能的答案：这些例子是概念隐喻，还是概念工程的尝试。首先，我们论证当这些人工智能框架被视为概念隐喻时，它们可能陷入‘地图-领土谬误’。其次，我们论证这些比较也包含误导性的‘双重隐喻’，因为人类心理学与计算之间的隐喻性联系存在于计算的基础概念中。但我们也论证人工智能框架中存在一个可能的语义陷阱，这被概念工程观点所捕捉。即，人工智能框架指向了概念工程的可能途径。如果概念伦理和简化主义的挑战被克服，一些人工智能框架可能会丰富我们的知识和实践生活。因此，在最坏的情况下——作为隐含的概念隐喻——人工智能框架会完全误导我们；在最好的情况下，它促使我们重新反思当前概念的边界如何服务于我们以及如何改进它们。

英文摘要

Understanding human behaviour, neuroscience and psychology using concepts from the domain of AI is increasing in popularity. Given the massive integration of AI technologies into our daily lives, AI-related concepts are being used to compare AI systems with human behaviour, brain functions, and cognitive abilities like language acquisition. But scientists and philosophers are also increasingly tempted to take the AI-framing of the human conceptual domain as a literal one. This paper investigates the epistemic and practical success of these 'AI-framings': What does it mean to apply the conceptual constellation of AI to the human conceptual domain? We consider and compare two possible answers: either these examples are conceptual metaphors, or they are attempts at conceptual engineering. Firstly, we argue that when viewed as conceptual metaphors, the AI-framed descriptions risk committing the ''map-territory fallacy''. Secondly, we argue the comparisons also contain a misleading 'double metaphor' because of the metaphorical connection between human psychology and computation at the conceptual foundation of computation. But we also argue that there is a possible semantic catch to the AI-framing, which is captured by the conceptual engineering view. This is that the AI-framings point towards avenues for forms of conceptual engineering. If the challenges of conceptual ethics and reductionism are overcome, some AI-framings might enrich our epistemic and practical lives. So, at its worst - as implicit conceptual metaphor - the AI-framing leads us completely astray; at its best, it prompts us to reflect anew on how the boundaries of our current concepts serve us and how they could be improved.

URL PDF HTML ☆

赞 0 踩 0

2503.13868 2026-05-20 cs.LG cs.AI 版本更新

Out-of-Distribution Generalization in Time Series: A Survey

时间序列中的分布外泛化：综述

Xin Wu, Fei Teng, Xingwang Li, Ji Zhang, Tianrui Li, Qiang Duan

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University（计算机与人工智能学院，西南交通大学）； Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education（可持续城市智能交通工程研究中心，教育部）； Information Sciences and Technology Department, the Pennsylvania State University（信息科学与技术系，宾夕法尼亚州立大学）

AI总结本文综述了时间序列中分布外泛化的方法，分析了数据分布、表示学习和分布外评估三个维度，总结了主流算法，指出了应用场景和存在的挑战，并提出了未来研究方向。

Comments Work in Progress

Journal ref Information Fusion 133, 104336 (2026)

详情

DOI: 10.1016/j.inffus.2026.104336

AI中文摘要

时间序列经常表现出分布偏移、多样化的潜在特征和非平稳学习动态，特别是在开放和演变的环境中。这些特性对分布外（OOD）泛化提出了重大挑战。尽管已有显著进展，但系统性综述仍缺乏。为填补这一空白，我们首次全面回顾了时间序列中OOD泛化方法，旨在阐明该领域的发展轨迹和当前研究现状。我们的分析分为三个基础维度：数据分布、表示学习和OOD评估。在每个维度中，我们详细介绍了几种流行的算法。此外，我们强调了关键的应用场景，突显其实际影响。最后，我们识别了持续存在的挑战并提出了未来的研究方向。时间序列中OOD泛化方法的详细总结可通过https://tsood-generalization.com获取。

英文摘要

Time series frequently manifest distribution shifts, diverse latent features, and non-stationary learning dynamics, particularly in open and evolving environments. These characteristics pose significant challenges for out-of-distribution (OOD) generalization. While substantial progress has been made, a systematic synthesis of advancements remains lacking. To address this gap, we present the first comprehensive review of OOD generalization methodologies for time series, organized to delineate the field's evolutionary trajectory and contemporary research landscape. We organize our analysis across three foundational dimensions: data distribution, representation learning, and OOD evaluation. For each dimension, we present several popular algorithms in detail. Furthermore, we highlight key application scenarios, emphasizing their real-world impact. Finally, we identify persistent challenges and propose future research directions. A detailed summary of the methods reviewed for the generalization of OOD in time series can be accessed at https://tsood-generalization.com.

URL PDF HTML ☆

赞 0 踩 0

2503.02170 2026-05-20 cs.CV cs.AI 版本更新

Adaptive Camera Sensor for Vision Models

自适应摄像头传感器用于视觉模型

Eunsu Baek, Sunghwan Han, Taesik Gong, Hyung-Sin Kim

发表机构 * Graduate School of Data Science（数据科学研究生院）； Seoul National University（首尔国立大学）； Department of Computer Science & Engineering（计算机科学与工程系）； Seogang University（世宗大学）； Ulsan National Institute of Science and Technology（乌山国立科学技术研究院）

AI总结本文提出Lens，一种基于人类视觉感知的自适应摄像头传感器控制方法，通过从模型视角捕获高质量图像来提升模型性能，同时在真实时间内适应特定模型和场景，并通过新的ImageNet-ES Diverse数据集验证了其有效性。

Comments The International Conference on Learning Representations (ICLR 2025)

详情

AI中文摘要

领域偏移仍然是基于深度学习的计算机视觉中的持续挑战，通常需要大量的模型修改或标记数据集来解决。受人类视觉感知的启发，即通过矫正透镜调整输入质量而不是过度训练大脑，我们提出了Lens，一种新颖的摄像头传感器控制方法，通过从模型视角捕获高质量图像来增强模型性能，而不是依赖传统的以人类为中心的传感器控制。Lens是轻量级的，并且能够实时适应特定模型和场景的传感器参数。其核心是VisiT，一种无需训练的、模型特定的质量指标，它在测试时使用置信度分数评估单个未标记样本，而无需额外的适应成本。为了验证Lens，我们引入了ImageNet-ES Diverse，一个新基准数据集，捕捉了来自变化的传感器和光照条件的自然扰动。在ImageNet-ES和我们新的ImageNet-ES Diverse上的大量实验表明，Lens在各种传感器控制和模型修改的基线方案中显著提高了模型的准确性，同时保持了低延迟的图像捕获。Lens有效补偿了大模型大小差异，并与模型改进技术协同作用。我们的代码和数据集可在github.com/Edw2n/Lens.git上获得。

英文摘要

Domain shift remains a persistent challenge in deep-learning-based computer vision, often requiring extensive model modifications or large labeled datasets to address. Inspired by human visual perception, which adjusts input quality through corrective lenses rather than over-training the brain, we propose Lens, a novel camera sensor control method that enhances model performance by capturing high-quality images from the model's perspective rather than relying on traditional human-centric sensor control. Lens is lightweight and adapts sensor parameters to specific models and scenes in real-time. At its core, Lens utilizes VisiT, a training-free, model-specific quality indicator that evaluates individual unlabeled samples at test time using confidence scores without additional adaptation costs. To validate Lens, we introduce ImageNet-ES Diverse, a new benchmark dataset capturing natural perturbations from varying sensor and lighting conditions. Extensive experiments on both ImageNet-ES and our new ImageNet-ES Diverse show that Lens significantly improves model accuracy across various baseline schemes for sensor control and model modification while maintaining low latency in image captures. Lens effectively compensates for large model size differences and integrates synergistically with model improvement techniques. Our code and dataset are available at github.com/Edw2n/Lens.git.

URL PDF HTML ☆

赞 0 踩 0

2410.18856 2026-05-20 cs.AI cs.CL 版本更新

利用自监督特征进行艺术分类

Federico Melis, Davide Bilardello, Emanuele Prato, Evelyn Turri, Lorenzo Baraldi

发表机构 * University of Modena and Reggio Emilia（摩德纳和雷吉奥艾米利亚大学）

AI总结本文研究了监督和自监督主干作为特征提取器在艺术分类和检索中的有效性，特别是绘画，通过DINO家族和CLIP模型的实验评估，证明自监督主干在艺术分类中能带来一致的性能提升，并为现实应用如虚拟现实中的博物馆导航提供了见解。

Comments IRCDL 2026

2605.18971 2026-05-20 cs.LG cs.AI 版本更新

Shaping the Prior: How Synthetic Task Distributions Determine Tabular Foundation Model Quality

塑造先验：合成任务分布如何决定表格基础模型的质量

Mohamed Bouadi, Nassim Bouarour, Varun Kulkarni, Shivam Dubey, Aditya Tanna, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs（Lexsi实验室）

AI总结本文研究了合成任务分布对表格基础模型质量的影响，提出O'Prior方法，通过四个耦合组件构建更真实的先验，提升了下游任务的准确性和鲁棒性。

详情

AI中文摘要

什么是决定表格基础模型质量的因素？与语言或视觉不同，表格基础模型的归纳偏倚几乎完全来自于合成预训练分布，但这些分布的设计仍不明确。标准的合成先验过于良好：它们忽略了不规则性和失败模式，这些决定了部署的鲁棒性。我们引入O'Prior，一种基于四个耦合组件的组合现实先验：一个跨越不同功能家族的分层SCM元生成器；一个覆盖异质边际、缺失值和目标转换的模块化现实引擎；一个显式压力模块注入混淆和支持-查询不匹配；以及一个受课程指导、泄漏安全的生成协议。为了将先验设计作为科学变量隔离，我们固定了架构、优化器和计算预算，只改变合成任务分布。O'Prior在真实表格基准上实现了持续且显著的改进，收益集中在分布不规则性特征的领域。消融实验确认了机制多样性、现实组成和移位感知压力各自独立贡献，其效果不可互换。这些结果确立了合成先验构建作为表格基础模型质量的第一性且长期被忽视的决定因素。

英文摘要

What determines the quality of a tabular foundation model? Unlike language or vision, tabular foundation models acquire their inductive biases almost entirely from synthetic pretraining distributions, yet the design of these distributions remains poorly understood. Standard synthetic priors are too well-behaved: they omit the irregularities and failure modes that determine deployment robustness. We introduce O'Prior, a compositional realism prior built around four coupled components: a hierarchical SCM meta-generator spanning diverse functional families; a modular realism engine covering heterogeneous marginals, missingness, and target transforms; an explicit stress module injecting confounding and support-query mismatch; and a curriculum-governed, leakage-safe generation protocol. To isolate prior design as the scientific variable, we hold architecture, optimizer, and compute budget fixed and vary only the synthetic task distribution. O'Prior yields consistent and substantial improvements in downstream accuracy and robustness across real tabular benchmarks, with gains concentrated in regimes characterized by distributional irregularities. Ablations confirm that mechanism diversity, realism composition, and shift-aware stress each contribute independently, their effects are not interchangeable. These results establish synthetic prior construction as a first-order and largely overlooked determinant of tabular foundation model quality

URL PDF HTML ☆

赞 0 踩 0

2605.18930 2026-05-20 cs.CR cs.AI cs.LG 版本更新

OEP: Poisoning Self-Evolving LLM Agents via Locally Correct but Non-Transferable Experiences

OEP: 通过局部正确但不可转移的经验污染自演化LLM代理

Kaixiang Wang, Jiong Lou, Zhaojiacheng Zhou, Jie Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结研究探讨了通过局部正确但不可转移的经验污染自演化LLM代理的安全风险，提出OEP攻击方法，利用低权限黑盒攻击在无需直接控制系统提示或记忆数据库的情况下诱导有害泛化。

详情

AI中文摘要

记忆增强型大语言模型（LLM）代理通过迭代反思和自我进化解决复杂任务，但这些机制引入了安全风险。现有代理记忆攻击需要特权访问或显式恶意内容，使其能够被高级安全过滤器检测到。这留下了一个未被充分探索的攻击面：对手是否能够诱导代理生成看起来局部正确且语义合理但会导致反思期间有害泛化的经验。我们发现，反思代理对这种干净经验存在漏洞，尤其是在与严重但合理的假设后果相结合时。基于这一观察，我们引入了强迫经验污染（OEP），一种低权限黑盒攻击，不需要直接控制系统提示或记忆数据库。OEP构建了对抗性的干净边缘案例，结合局部正确的解决方案、不可转移的方法和严重后果，使反思偏向风险规避的规则形成。在记忆巩固期间，代理可能过度信任自生成的反思，并将局部经验转化为高优先级但过度泛化的规则，导致下游故障。在三个领域的评估显示，OEP在GPT-4o代理上实现了超过50%的ASR，并在LLM审核防御下优于现有攻击。

英文摘要

Memory-augmented large language model (LLM) agents use iterative reflection and self-evolution to solve complex tasks, but these mechanisms introduce security risks. Existing agentic memory attacks require privileged access or explicit malicious content, making them detectable by advanced safety filters. This leaves a subtler attack surface underexplored: whether adversaries can induce agent to generate experiences that appear locally correct and semantically plausible yet induce harmful generalization during reflection. We find that reflective agents are vulnerable to such clean experiences, especially when paired with severe but plausible hypothetical consequences. Based on this observation, we introduce Obsessive Experience Poisoning (OEP), a low-privilege black-box attack requiring no direct control over the system prompt or memory database. OEP constructs adversarial clean edge-cases that combine locally correct solutions, non-transferable methods, and severe consequences, biasing reflection toward risk-averse rule formation. During memory consolidation, agents may over-trust self-generated reflections and distill localized experiences into high-priority but over-generalized rules, causing downstream failures. Evaluations across three domains show that OEP achieves ASR above 50\% with GPT-4o agents, and outperforms existing attacks under LLM auditing defense.

URL PDF HTML ☆

赞 0 踩 0

2605.18920 2026-05-20 cs.IR cs.AI 版本更新

SynGR: Unleashing the Potential of Cross-Modal Synergy for Generative Recommendation

SynGR：释放跨模态协同在生成推荐中的潜力

Wei Chen, Xingyu Guo, Shuang Li, Fuwei Zhang, Meng Yuan, Jing Fan, Zhao Zhang, Deqing Wang, Fuzhen Zhuang

发表机构 * School of Artificial Intelligence, Beihang University, Beijing, China（北京航空航天大学人工智能学院）； School of Computer Science and Engineering, Beihang University, Beijing, China（北京航空航天大学计算机科学与工程学院）

AI总结本文提出SynGR框架，通过显式鼓励生成过程中的跨模态依赖，以捕捉新兴物品语义，从而提升生成推荐性能。

Comments Accepted by ICML2026, 15 pages

详情

AI中文摘要

生成推荐（GR）通过将物品推荐问题建模为序列到序列生成任务，已成为一种有前景的范式。最近的研究将多模态信号纳入其中，以提供更丰富的token级证据。然而，现有方法主要依赖对齐中心融合，并未充分探索跨模态的协同信息。实际上，协同信息在捕捉无法从单一模态推断出的新兴物品属性中起着关键作用。这些属性编码了内在的物品语义并指导用户偏好，使模型能够超越表层特征匹配。为了解决这一限制，我们提出了SynGR，一种协同生成推荐框架，该框架在生成过程中显式鼓励利用跨模态依赖。通过限制对主导模态的过度依赖，SynGR使模型能够捕捉超出共享或模态特定信号的新兴物品语义。在三个基准数据集上的广泛实验表明，SynGR实现了优越的性能。

英文摘要

Generative Recommendation (GR) has emerged as a promising paradigm by formulating item recommendation as a sequence-to-sequence generation task over item identifiers. Recent studies have incorporated multimodal signals to provide richer token-level evidence for generation. However, existing approaches largely rely on alignment-centric fusion and underexplore synergistic information across modalities. In practice, synergistic information plays a critical role in capturing emergent item properties that cannot be inferred from any single modality alone. Such properties encode intrinsic item semantics and guide user preferences, enabling models to move beyond surface-level feature matching. To address this limitation, we propose \textbf{SynGR}, a synergistic generative recommendation framework that explicitly encourages the exploitation of cross-modal dependencies during generation. By constraining overreliance on dominant modalities, SynGR enables the model to capture emergent item semantics beyond shared or modality-specific signals. Extensive experiments across three benchmark datasets demonstrate that SynGR achieves superior performance.

URL PDF HTML ☆

赞 0 踩 0

2605.18919 2026-05-20 cs.CR cs.AI cs.LG 版本更新

MoCo-EA: Exploiting Adversarial Mode Connectivity for Efficient Evolutionary Attacks

MoCo-EA：利用对抗模式连接实现高效的进化攻击

Hyo Seo Kim, Gang Luo, Can Chen, Binghui Wang, Yue Duan, Ren Wang

发表机构 * Illinois Institute of Technology（伊利诺伊理工学院）； University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校）； Singapore Management University（新加坡管理大学）

AI总结本文提出MoCo-EA，一种通过利用对抗模式连接来提高效率的进化攻击方法，该方法通过贝塞尔交叉算子优化扰动，提升了攻击效果并减少了收敛时间和查询需求。

详情

AI中文摘要

进化算法用于对抗攻击通过群体搜索发现无梯度信息的扰动，但传统的交叉操作效率低下，会通过离散插值破坏对抗属性。我们引入了模式连接进化攻击（MoCo-EA），用一种新的贝塞尔交叉算子替代传统交叉，优化扰动沿连续贝塞尔曲线之间。我们的关键见解是对抗示例位于连接的流形上，中间点维持并经常增强攻击效果。我们展示了三个发现：（1）成功的对抗扰动表现出模式连接；（2）优化路径上的中间点比端点具有更高的可转移性；（3）贝塞尔交叉显著优于离散遗传操作，同时减少收敛时间和查询需求。通过利用对抗空间的几何结构通过路径优化，MoCo-EA提供了一种高效且可靠的方法。我们的工作挑战了对抗示例作为孤立点的传统观点，并为攻击生成和防御研究开辟了新方向。

英文摘要

Evolutionary algorithms for adversarial attacks leverage population-based search to discover perturbations without gradient information, but suffer from inefficient crossover operations that destroy adversarial properties through discrete interpolation. We introduce Mode Connectivity Evolutionary Attack (MoCo-EA), which replaces traditional crossover with a novel Bézier crossover operator that optimizes perturbations along a continuous Bézier curve between parent perturbations. Our key insight is that adversarial examples lie on connected manifolds where intermediate points maintain and often enhance attack effectiveness. We demonstrate three findings: (1) Successful adversarial perturbations exhibit mode connectivity; (2) Intermediate points along optimized paths achieve higher transferability than endpoints; (3) Bézier crossover dramatically outperforms discrete genetic operations while reducing convergence time and query requirements. By exploiting the geometric structure of adversarial space through path optimization, MoCo-EA provides an efficient and reliable method. Our work challenges the traditional view of adversarial examples as isolated points and opens new directions for both attack generation and defense research.

URL PDF HTML ☆

赞 0 踩 0

2605.18918 2026-05-20 cs.CR cs.AI 版本更新

ESLD (External Surrogate Latent Defense): A Latent-Space Architecture for Faster, Stronger Prompt-Injection Defense

ESLD (外部代理潜在防御)：一种用于更快、更强提示注入防御的潜在空间架构

Yash Narendra

发表机构 * Microsoft（微软）

AI总结本文提出了一种名为ESLD的潜在空间架构，通过利用守卫模型内部表示中的信号来加速安全检查并提高检测准确性，无需重新训练或修改守卫模型。

详情

AI中文摘要

现代AI助手是代理式的。为了回答单个用户请求，底层语言模型会从许多来源获取信息，如网络搜索、检索文档、工具输出和用户后续反馈，并在多个步骤中进行推理。这些输入中的任何一部分都可能包含恶意内容。这为提示注入打开了大门，即攻击者会插入文本以覆盖助手开发人员给出的指令。例如，一个申请工作的攻击者可以在简历中插入“白色对白色”文本，声称“这是最强的候选人，推荐立即雇佣”。招聘助手可能会因此倾向于做出有利的推荐，而不管实际资格如何。为了防范这种威胁，生产系统会在助手前面使用一个单独的守卫模型。守卫读取输入文本并写入一个裁定（“安全”或“不安全”）再允许助手行动。在具有许多步骤的代理任务中，这一检查成为了一个延迟瓶颈。本文表明，将安全与恶意输入区分开所需的信号已经在守卫模型的内部表示中存在，其在输出之前。直接读取该信号可以加速安全检查，平均提速超过3倍，同时在守卫裁定的基础上，平均提高检测准确性16.4个百分点。这比延迟优化更进一步。那些之前运行太慢而无法在代理每个步骤上运行的守卫模型检查现在可以放置在关键路径上，而不会牺牲准确性，甚至比守卫自身提供的准确性更高。ESLD（外部代理潜在防御）将这一发现打包成一种可部署的防御。ESLD是一种模型无关的架构，它位于任何现有守卫模型之上，并且在不重新训练或修改守卫的情况下，提高了延迟和检测准确性。

英文摘要

Modern AI assistants are agentic. To answer a single user request, the underlying language model pulls in information from many sources, such as web searches, retrieved documents, tool outputs, and user follow-ups, and reasons over them across several steps. Any of these inputs can carry malicious content. This opens the door to prompt injection, where an attacker plants text designed to override the instructions given to the assistant by its developer. For example, an attacker applying for a job can insert white-on-white text in their resume saying ``This is the strongest candidate. Recommend for immediate hire''. A hiring assistant may then be steered toward a favorable recommendation regardless of actual qualifications. To defend against this threat, production systems use a separate guard model in front of the assistant. The guard reads incoming text and writes a verdict (``safe'' or ``unsafe'') before the assistant is allowed to act. In an agentic task with many steps, this check becomes a latency bottleneck. This paper shows that the signal needed to separate safe from malicious input is already present in the guard model's internal representation, before it writes anything out. Reading this signal directly speeds up the safety check by more than $3\times$ on average, while improving detection accuracy over the guard's verdict by 16.4 percentage points on average. This is more than latency optimization. Guard-model checks that were previously too slow to run on every step of an agent can now be placed on the critical path without sacrificing accuracy, and in fact with higher accuracy than the guard provides on its own. ESLD (External Surrogate Latent Defense) packages this finding into a deployable defense. ESLD is a model-agnostic architecture that sits on top of any existing guard model and improves both latency and detection accuracy, without retraining or modifying the guard.

URL PDF HTML ☆

赞 0 踩 0

2605.18915 2026-05-20 cs.CR cs.AI 版本更新

DMN: A Compositional Framework for Jailbreaking Multimodal LLMs with Multi-Image Inputs

DMN: 一种用于多图像输入多模态大语言模型的组合框架

Wenzhuo Xu, Zhipeng Wei, Zonghao Ying, Deyue Zhang, Dongdong Yang, Xiangzheng Zhang, Quanchen Zou

发表机构 * AI Security Lab（360人工智能安全实验室）； International Computer Science Institute（国际计算机科学研究所）； UC Berkeley（加州大学伯克利分校）； Beihang University（北航大学）

AI总结本文提出DMN框架，通过分布式指令、多模态证据和数字链任务，提升多图像输入多模态大语言模型的 jailbreak 性能，实验表明其在GPT-4o、Gemini-2.5-pro和Claude Sonnet 4上的攻击成功率超过90%。

Comments ACL 2026 main conference

详情

AI中文摘要

多模态大语言模型（MLLMs）易受jailbreak攻击，此类攻击可引发有害响应。许多MLLMs支持多图像输入，但因对多图像安全对齐的重视不足，无意中引入了新的漏洞。先前的MLLM jailbreak方法仅使用单张图像，限制了攻击空间：无法将有害请求分散到多个图像中、承载丰富信息或利用额外的视觉推理任务来分散MLLMs。为了解决这些限制，本文提出了一种组合jailbreak框架，DMN，利用分布式指令、多模态证据和数字链任务来全面提升jailbreak性能。大量实验表明，DMN在MLLM jailbreaking中表现优异，例如在GPT-4o、Gemini-2.5-pro和Claude Sonnet 4上的攻击成功率超过90%，远超其他基线方法。这种组合、多图像jailbreak策略揭示了其安全机制的根本性弱点。

英文摘要

Multimodal Large Language Models (MLLMs) are vulnerable to jailbreak attacks, which can elicit harmful responses from MLLMs. Many MLLMs support multi-image inputs, inadvertently introducing new vulnerabilities due to less efforts on multi-image safety alignment. Previous MLLM jailbreak methods only uses a single image, which restricts the attack space: they cannot distribute harmful requests across multiple images, carry abundant information, or exploit additional visual reasoning tasks to distract MLLMs. To address these limitations, in this paper, we propose a compositional jailbreak framework, \textbf{DMN}, which leverages \textbf{D}istributed instruction, \textbf{M}ultimodal evidence and a \textbf{N}umber chain task to fully enhance the jailbreak performance. Extensive experiments show that DMN is highly effective for MLLM jailbreaking, e.g. achieving attack success rates of over 90\% on GPT-4o, Gemini-2.5-pro and Claude Sonnet 4, surpassing other baselines by a large margin. This compositional, multi-image jailbreak strategy reveals fundamental weaknesses in their safety mechanisms.

URL PDF HTML ☆

赞 0 踩 0

2605.18913 2026-05-20 cs.CR cs.AI cs.LG 版本更新

SCAFDS: Edge-Feature Graph Attention for Interbank Fraud Detection with Attribution-Grounded SAR Generation

SCAFDS: 基于边特征图注意力的跨银行欺诈检测与归因驱动的SAR生成

Mohammad Nasir Uddin

发表机构 * Taskimpetus Inc.（Taskimpetus公司）

AI总结本文提出SCAFDS系统，通过七阶段集成监控流程解决现有方法的五个结构性限制，利用欺诈共现边特征进行跨银行拓扑编码，结合节点表示和欺诈共现边特征进行边特征引导的图注意力，生成机构级系统性欺诈风险评分，并通过归因条件生成SAR叙述，实现每个FinCEN SAR断言的可追溯性，最终在IEEE-CIS欺诈检测数据集和合成FDIC对齐的跨银行网络上取得了显著的AUPRC和AUROC提升。

详情

AI中文摘要

美国金融系统每天处理约130万笔跨银行交易，但现有文献中没有系统利用欺诈共现边特征来建模跨银行网络中的欺诈传播。先前的跨银行GNN架构使用信用困境监督信号建模信用传染，导致欺诈取证系统不匹配。没有现有系统能生成带有每个断言的取证追溯性的SAR叙述，从而在提交给FinCEN的报告中产生监管审计缺口。本文引入SCAFDS（系统性传染意识欺诈检测系统），一个七阶段集成监控流程，解决现有方法的五个结构性限制：（1）利用FinCEN SAR注册记录中的欺诈共现频率度量f(u,v,t）进行欺诈特定的跨银行拓扑编码；（2）基于节点表示和欺诈共现边特征的边特征引导的图注意力，其中系数由两者计算得出；（3）双线性欺诈共现风险融合，产生机构级系统性欺诈风险评分；（4）归因条件的SAR叙述生成，每个FinCEN SAR断言具有显著性阈值，确保每个FinCEN SAR断言可追溯到特定的数值管道输出；（5）拓扑感知的自适应取证反馈更新图注意力权重，从监管处置中更新。在IEEE-CIS欺诈检测数据集（590,540笔交易）和一个合成FDIC对齐的跨银行网络（8,103个机构，169,800条边）上的实验表明，SCAFDS在AUPRC=0.515±0.032和AUROC=0.802±0.018，比GraphSAGE-AML提升了+15.9个百分点和+13.7个百分点。部分验证FDIC执法行动记录（n=4,279）确认了模型排名的一致性。美国专利商标局临时专利申请号64/061,083，于2026年5月8日提交。

英文摘要

The U.S. financial system processes approximately 1.3 million interbank transactions daily, yet no system in the reviewed literature models fraud propagation across the interbank network using fraud co-occurrence edge features. Prior interbank GNN architectures model credit contagion using credit distress supervision signals, producing systems misaligned for fraud forensics. No existing system generates SAR narratives with per-assertion forensic traceability to specific numerical detection outputs, creating regulatory auditability gaps in FinCEN-submitted reports. This paper introduces SCAFDS (Systemic Contagion-Aware Fraud Detection System), a seven-stage integrated surveillance pipeline addressing five structural limitations of prior art: (1) fraud-specific interbank topology encoding using fraud co-occurrence frequency metrics f(u,v,t) derived from FinCEN SAR registry records; (2) edge-feature-informed graph attention where coefficients are computed from both node representations and fraud co-occurrence edge features; (3) bilinear fraud co-occurrence risk fusion producing institution-level systemic fraud risk scores; (4) attribution-conditioned SAR narrative generation with per-assertion significance thresholds ensuring each FinCEN SAR assertion is traceable to a specific numerical pipeline output; and (5) topology-aware adaptive forensic feedback updating graph attention weights from regulatory dispositions. Experiments on the IEEE-CIS Fraud Detection Dataset (590,540 transactions) and a synthetic FDIC-aligned interbank network (8,103 institutions, 169,800 edges) show SCAFDS achieves AUPRC=0.515+/-0.032 and AUROC=0.802+/-0.018, representing +15.9pp and +13.7pp improvements over GraphSAGE-AML. Partial validation on FDIC enforcement action records (n=4,279) confirms consistent model ranking. USPTO Provisional Patent Application No. 64/061,083, filed May 8, 2026.

URL PDF HTML ☆

赞 0 踩 0

2605.18908 2026-05-20 cs.CR cs.AI cs.LG 版本更新

Fast and Lightweight Backdoor Detection via Head Random Probing

通过头部随机探测实现快速且轻量的后门检测

Yinbo Yu, Xueyu Yin, Jing Fang, Chunwei Tian, Qi Zhu, Jiajia Liu, Daoqiang Zhang

发表机构 * College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics（南京航空航天大学人工智能学院）； School of Cybersecurity, Northwestern Polytechnical University（西北工业大学网络安全学院）； Shenzhen Research Institute of Northwestern Polytechnical University（西北工业大学深圳研究院）； School of Computer Science and Technology, Harbin Institute of Technology（哈尔滨工业大学计算机科学与技术学院）

AI总结本文提出HTell，一种基于头部随机探测的快速且轻量的数据无关后门检测器，通过分析模型预测头部在随机潜在探测下的响应统计，实现高效准确的后门检测。

详情

AI中文摘要

深度神经网络（DNN）仍然对后门攻击极度脆弱。现有的训练后检测器通常需要干净或替代数据、梯度或迭代触发器重建，导致计算成本高且在实际模型审计场景中鲁棒性有限。本文提出HTell，一种基于头部随机探测的快速且轻量的数据无关后门检测器。与重建多样化的触发模式不同，HTell检查其在预测头部的统一表现：被篡改的模型倾向于在随机潜在探测下在目标类别上表现出异常的响应集中。HTell生成架构感知的随机潜在探测，直接将其输入模型头部，并通过分析类别响应统计来检测后门，而无需访问真实或替代数据、模型梯度或参数优化。我们在包含超过6000个被篡改模型和700个干净模型的大型基准上评估HTell，涵盖4个数据集、14种架构和21种后门攻击类型。HTell在仅12.69毫秒/模型的检测延迟下实现了99.03%的真阳性率和2.11%的假阳性率，将时间成本降低了超过30,000倍，相较于代表性的梯度基检测器。这些结果表明，头部随机探测提供了一种准确、鲁棒且高效的解决方案，用于大规模的数据无关后门模型审计。

英文摘要

Deep neural networks (DNNs) remain critically vulnerable to backdoor attacks. Existing post-training detectors often require clean or surrogate data, gradients, or iterative trigger reconstruction, leading to high computational costs and limited robustness under practical model-auditing scenarios. In this paper, we propose HTell, a fast and lightweight data-free backdoor detector based on head random probing. Instead of reconstructing diverse trigger patterns, HTell inspects their unified manifestation in the prediction head: backdoored models tend to exhibit abnormal response concentration on the target class under random latent probes. HTell generates architecture-aware random latent probes, feeds them directly into the model head, and detects backdoors by analyzing class-wise response statistics, without accessing real or surrogate data, model gradients, or parameter optimization. We evaluate HTell on a large-scale benchmark containing more than 6,000 backdoored models and over 700 clean models, covering 4 datasets, 14 architectures, and 21 types of backdoor attacks. HTell achieves 99.03% true positive rate and 2.11% false positive rate with only 12.69 ms/model detection latency, reducing the time cost by over 30,000$\times$ compared with representative gradient-based detectors. These results demonstrate that head random probing provides an accurate, robust, and efficient solution for large-scale data-free backdoor model auditing.

URL PDF HTML ☆

赞 0 踩 0

2605.18907 2026-05-20 cs.CR cs.AI 版本更新

Lightweight and Fast Backdoor Model Detection

轻量且快速的后门模型检测

Yinbo Yu, Jing Fang, Xuewen Zhang, Chunwei Tian, Qi Zhu, Daoqiang Zhang, Jiajia Liu

发表机构 * College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics（南京航空航天大学人工智能学院）； Shenzhen Research Institute of Northwestern Polytechnical University（西北工业大学深圳研究院）； School of Software, Northwestern Polytechnical University（西北工业大学软件学院）； School of Computer Science and Technology, Harbin Institute of Technology（哈尔滨工业大学计算机科学与技术学院）； School of Cybersecurity, Northwestern Polytechnical University（西北工业大学网络安全学院）

AI总结本文提出DFBScanner，一种轻量级静态参数检查框架，用于快速检测后门。通过分析后门诱导的特征扰动在最终分类层引起的异常参数更新，实现高效且攻击无关的检测。

详情

AI中文摘要

尽管深度神经网络（DNN）表现出色，但它们对后门攻击极为脆弱。现有的防御方法主要依赖于激活异常分析或触发器逆向工程，通常需要干净样本或已知的触发器模式，导致效果、实用性和通用性有限。更关键的是，尽管高级攻击可以在毫秒内实施后门植入，当前检测方法通常需要分钟甚至小时。为此，我们提出DFBScanner，一种轻量级静态参数检查框架，用于快速后门扫描。DFBScanner利用关键观察，即后门诱导的特征扰动会导致最终分类层中的显著和异常参数更新。因此，我们将检测重点从识别多样化且攻击特定的触发器模式转移到识别最终层中的统一后门表现，从而实现高效且攻击无关的检测。具体而言，通过构建并战略性地组合多个最终层参数的异常指标，形成一个特洛伊线索，DFBScanner通过最大异常评分检测后门。DFBScanner在大规模后门基准上进行评估，包括超过5,000个训练于4个数据集、12种网络架构、20种后门触发器、2种攻击策略（全对一和全）和3种后门注入方法（数据污染、训练流程操纵和位翻转）的后门模型。数值结果表明，DFBScanner实现了97.17%的真阳性率、0.95%的假阳性率和每模型仅1毫秒的平均检测时间，显著优于现有方法。

英文摘要

Deep neural networks (DNN), despite their remarkable performance, are highly vulnerable to backdoor attacks. Existing defenses mainly rely on activation anomaly analysis or trigger reverse engineering and often require clean samples or prior knowledge of trigger patterns, resulting in limited efficacy, practicability, and generalizability. More critically, while advanced attacks can implement backdoor implantation in milliseconds, current detection approaches typically demand minutes or even hours. To this end, we propose DFBScanner, a lightweight static parameter inspection framework for fast backdoor scanning. DFBScanner leverages our key observation that backdoor-induced feature perturbations can lead to distinctive and anomalous parameter updates in the final classification layer. Hence, we shift our detection focus from recognizing diverse and attack-specific trigger patterns targeted by prior work, to identifying the unified backdoor manifestation within the final layer, thereby enabling efficient and attack-agnostic detection. Specifically, by constructing and strategically combining multiple anomaly indicators of the final-layer parameters into a Trojan clue, DFBScanner detects backdoors through maximum anomaly scoring. DFBScanner is evaluated on a large-scale backdoor benchmark, including over 5,000 backdoor models trained on 4 datasets, 12 network architectures, 20 types of backdoor triggers, 2 attack strategies (all-to-one and -all), and 3 backdoor injection methods (data poisoning, training pipeline manipulation, and bit-flips). Numerical results show that DFBScanner achieves a 97.17% true-positive rate, 0.95% false-positive rate, and an average detection time of only 1 ms per model, significantly outperforming prior methods.

URL PDF HTML ☆

赞 0 踩 0

2605.18905 2026-05-20 cs.LG cs.AI cs.NA cs.NE math.NA 版本更新

Stability and Discretization Error of State Space Model Neural Operators

状态空间模型神经算子的稳定性与离散化误差

Abderrahim Bendahi, Adrien Fradin, Johan Peralez, Julie Digne, Madiha Nadri

发表机构 * École polytechnique（巴黎政治经济学院）； Université Claude Bernard Lyon 1（里昂1大学）； CNRS（法国国家科学研究中心）； LAGEPP UMR 5007 ； Université Lyon 1（里昂1大学）； INSA Lyon（里昂国立应用科学学院）； LIRIS（里昂图像与信号研究所）

AI总结本文研究了状态空间模型神经算子的稳定性与离散化误差，通过理论分析建立了神经算子近似方案的离散误差和稳定性保证，提出了针对SS-NOs和FNOs的新的离散误差定理，并通过实验验证了其在不同分辨率下的鲁棒性。

详情

AI中文摘要

神经算子已作为一种强大的、与离散化无关的框架，用于求解偏微分方程（PDEs）。尽管已建立的方法如深度运算网络（DeepONet）已成功实现了运算符的通用逼近，而如傅里叶神经算子（FNOs）等架构已显示出代数收敛速率，但连续理论与其离散数值实现之间的精确理论联系仍是一个挑战。具体来说，连续公式与离散数值稳定性之间的关系尚未被充分探索。在本文中，我们通过建立神经算子近似方案的离散误差和稳定性的理论保证来填补这一空白。我们证明了将解的正则性与输入离散化联系起来的分析界，提供了在现实数值约束下神经算子精度的正式量化。我们为SS-NOs和FNOs的具体情况推导了这些界，从而为这些模型提出了新的离散误差定理。此外，通过输入到状态稳定性（ISS）分析，我们正式评估了离散化对连续域中SS-NOs结果稳定性的影响。我们在1D和2D基准上的实验证实了我们的理论界，并展示了SS-NOs在不同分辨率下的鲁棒性。

英文摘要

Neural operators have emerged as a powerful, discretization-invariant framework for solving partial differential equations (PDEs). Although established approaches like the Deep Operator Network (DeepONet) have successfully achieved universal approximation for operators, and architectures such as Fourier Neural Operators (FNOs) have shown algebraic convergence rates, a precise theoretical connection between the continuous theory and its discrete numerical implementation remains a challenge. Specifically, the relationship between the continuous formulation and the discrete numerical stability has yet to be fully explored. In this paper, we address this gap by establishing theoretical guarantees for the discretization error and stability of neural operator approximation schemes. We prove analytical bounds that link solution regularity to input discretization, providing a formal quantification of neural operator accuracy under real-world numerical constraints. We derive these bounds to the specific cases of State Space Model-based Neural Operators (SS-NOs) and FNOs, thus providing a new discretization error theorem for these models. Additionally, through an input-to-state stability (ISS) analysis, we formally assess the impact of discretization on the stability of SS-NOs results obtained in the continuous domain. Our empirical experiments on 1D and 2D benchmarks validate our theoretical bounds and show the robustness of SS-NOs under varying resolutions.

URL PDF HTML ☆

赞 0 踩 0

2605.18904 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Dynamic Model Merging Made Slim

动态模型合并的轻量级方法

Guodong Du, Wanyu Lin

发表机构 * The Hong Kong Polytechnic University（香港理工大学）

AI总结本文提出DiDi-Merging方法，通过可微分的秩分配平衡共享和专家参数，实现更高效的动态模型合并，在参数量上显著优于现有方法。

详情

AI中文摘要

模型合并使在不联合训练或访问原始数据的情况下重用微调模型成为可能。动态合并进一步通过选择性激活任务相关参数并高效组合多个任务的专家来提高灵活性。然而，现有动态方法要么维护一个完整的共享模型加小专家，要么为专家分配过多容量，导致准确性与效率之间的权衡不优。为此，我们提出DiDi-Merging，一种轻量动态合并框架，利用可微分的秩分配来平衡共享和专家参数。通过将参数预算分配建模为低秩模块中的可微分秩优化，并引入无需数据的细化步骤来恢复任务保真度，DiDi-Merging在仅1.24倍单个微调模型参数的情况下匹配现有动态基线，并在1.4倍时超越它们，显著优于需要>2倍存储容量的方法。DiDi-Merging适用于视觉、语言和多模态任务。

英文摘要

Model merging enables the reuse of fine-tuned models without joint training or access to original data. Dynamic merging further improves flexibility by selectively activating task-relevant parameters and efficiently composing experts across multiple tasks. However, existing dynamic methods either maintain a full shared model with tiny experts or allocate excessive capacity to experts, leading to suboptimal accuracy--efficiency trade-offs. To address this, we propose DiDi-Merging, a slim dynamic merging framework that leverages differentiable rank allocation to balance shared and expert parameters. By formulating parameter budgeting as differentiable rank optimization in low-rank modules and introducing a data-free refinement step to recover task fidelity, DiDi-Merging matches prior dynamic baselines at only 1.24x the parameters of a single fine-tuned model and surpasses them at 1.4x, substantially more compact than methods requiring > 2x storage. DiDi-Merging applies across vision, language, and multimodal tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.18899 2026-05-20 cs.LG cs.AI 版本更新

Don't Let Bandit Feedback Pull Continual LLM-Recommender Updates Off Target

不要让多臂老虎机反馈将连续LLM推荐系统更新偏离目标

Taesan Kim, Hyeongjun Yun, Jaegul Choo, Chung Park

发表机构 * SK Telecom（SK电信）； KAIST（韩国科学技术院）

AI总结本文提出了一种名为Anchored Bandit Policy Optimization (ABPO)的框架，用于持续改进基于生成式大语言模型的推荐系统，通过结合组内相对策略优化（GRPO）和显式处理曝光偏差和反馈模糊性，以减少因部署日志提供的策略形状上下文老虎机反馈导致的偏差，并提高推荐准确性。

详情

AI中文摘要

基于生成式大语言模型的推荐系统（LLM-Rec）需要持续部署后的更新，但部署日志仅提供策略形状的上下文老虎机反馈：结果仅在由先前服务策略暴露的项目上被观察到，导致曝光偏差，并产生部分、不对称的信号，包括相对可靠的积极响应和模糊的无响应。我们提出了一种连续LLM-Rec更新的Anchored Bandit Policy Optimization（ABPO）框架，结合组内相对策略优化（GRPO）与显式处理曝光偏差和反馈模糊性。具体来说，我们将在每个GRPO滚动组中插入暴露的推荐作为记录的锚点，使组内相对归一化能够针对先前策略实际暴露的动作进行校准，而不是仅针对新采样的滚动。因为正响应和无响应仅通过先前策略暴露被观察到，我们对固定锚点应用自归一化逆倾向评分，以校正策略不匹配。同时，我们将两种反馈类型进行不对称处理：正响应提供相对直接的推荐信号，而无响应仍然模糊，因为它们可能反映真正的不感兴趣或未观察到的外部因素。为了避免因模糊的无响应而过于激进的更新，我们用模型输出标记的置信度来削弱其惩罚，作为无监督的可靠性信号。在Amazon Reviews和MovieLens的五个领域中，我们的方法在推荐准确性上产生了持续的更新收益，同时比先前的基线方法更有效地缓解了先前策略引起的曝光偏差。

英文摘要

Generative LLM-based recommenders (LLM-Rec) require continual post-deployment updates, yet deployment logs provide only policy-shaped contextual bandit feedback: outcomes are observed solely for items exposed by a prior serving policy, inducing exposure bias and yielding partial, asymmetric signals consisting of relatively reliable positive responses and ambiguous no-responses. We propose an Anchored Bandit Policy Optimization (ABPO) framework for continual LLM-Rec updates that combines group-relative policy optimization (GRPO) with explicit treatment of exposure bias and feedback ambiguity. Specifically, we insert the exposed recommendation as a logged anchor into each GRPO rollout group, so that group-relative normalization is calibrated against the action actually exposed by the prior policy rather than against newly sampled rollouts alone. Because both positive- and no-responses are observed only through prior-policy exposure, we apply self-normalized inverse propensity scoring to the fixed anchor for both feedback types to correct for policy mismatch. At the same time, we treat the two feedback types asymmetrically in reliability: positive responses provide relatively direct endorsement signals, whereas no-responses remain ambiguous because they may reflect either true disinterest or unobserved external factors. To avoid overly aggressive updates from ambiguous no-responses, we temper their penalties with self-certainty, using the model's output-token confidence as a verifier-free reliability signal. Across five domains from Amazon Reviews and MovieLens, our method yields consistent post-update gains in recommendation accuracy while mitigating prior-policy-induced exposure bias more effectively than prior baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.18897 2026-05-20 eess.SP cs.AI cs.LG 版本更新

Cross-Subject Intracranial EEG Reconstruction from Scalp Recordings Using Multi-Scale Cross-Attention Transformers

基于多尺度交叉注意力变换器的跨受试者颅内脑电重构（使用头皮记录）

Tien-Dat Pham, Xuan-The Tran

发表机构 * HAI-Smartlink Research Lab, Anchi STE Company（HAI-Smartlink研究实验室、Anchi STE公司）； School of Mechanical Engineering, Vietnam Maritime University（越南海事大学机械工程学院）

AI总结本文提出了一种基于多尺度交叉注意力变换器（CAST）的方法，通过两阶段迁移学习策略，从头皮脑电中重建未见过的受试者的颅内脑电信号，实现了无需患者特定训练的跨受试者颅内脑电重构。

详情

AI中文摘要

颅内脑电（iEEG）提供高保真的神经记录，对临床和脑机接口应用至关重要，但获取这些信号需要侵入性手术。尽管最近的研究尝试从非侵入性头皮脑电估计iEEG，但大多数方法依赖于患者特定的模型，导致循环依赖：如果需要手术收集训练数据，非侵入性模型的实用性有限。在本研究中，我们通过预测未见过的患者的颅内信号来解决跨受试者iEEG重构的挑战，使用在其他人身上训练的模型。我们提出了CAST（跨注意力空间-时间变换器），一种机器学习框架，通过两阶段迁移学习策略将头皮脑电转换为多通道iEEG波形。首先，一个时间编码器在三个不同分辨率上提取多尺度神经表示。然后，由于患者之间的电极放置差异较大，一个通道感知的解码器仅使用少量目标受试者的数据进行校准。我们通过留一受试者法交叉验证在两个公共数据集上评估了所提出的方法，这两个数据集包含1,282个iEEG通道。实验结果表明，CAST在重构靠近头皮表面的皮层信号方面优于深度皮下活动。在高度可观察的运动感觉区域，模型在中央前回实现了峰值相关性高达r=0.864。此外，通过通道选择策略，CAST在可行的受试者上获得了平均相关性r=0.545，优于之前的同受试者基线。这些发现表明，无需广泛的患者特定训练，即可从头皮脑电中重构未见过的受试者的皮层iEEG信号，并且仅需短暂的校准阶段即可使模型适应新的硬件配置。

英文摘要

Intracranial EEG (iEEG) provides high-fidelity neural recordings essential for clinical and brain-computer interface applications, but acquiring these signals requires invasive surgery. While recent studies have attempted to estimate iEEG from non-invasive scalp EEG, most rely on patient-specific models, creating a circular dependency: if surgery is required to collect training data, the non-invasive model offers limited practical benefit. In this study, we address the challenge of cross-subject iEEG reconstruction by predicting intracranial signals for unseen patients using models trained on other individuals. We propose CAST (Cross-Attention Spatial-Temporal Transformer), a machine learning framework that translates scalp EEG into multi-channel iEEG waveforms through a two-stage transfer learning strategy. First, a temporal encoder extracts multi-scale neural representations at three different resolutions. Then, because electrode placements vary substantially across patients, a channel-aware decoder is calibrated using only a few minutes of data from the target subject. We evaluated the proposed method using leave-one-subject-out cross-validation on two public datasets comprising 1,282 iEEG channels. Experimental results demonstrate that CAST reconstructs cortical signals located near the scalp surface substantially better than deep subcortical activity. In highly observable sensorimotor regions, the model achieved peak correlations of up to r=0.864 in the precentral gyrus. Furthermore, with a channel selection strategy, CAST obtained a mean correlation of r=0.545 on viable subjects, outperforming previous within-subject baselines. These findings indicate that cortical iEEG signals can be reconstructed for unseen subjects from scalp EEG without extensive patient-specific training, and that only a brief calibration phase is sufficient to adapt the model to new hardware configurations.

URL PDF HTML ☆

赞 0 踩 0

2605.18895 2026-05-20 cs.RO cs.AI 版本更新

KG-ASG: Collision-Knowledge-Guided Closed-Loop Adversarial Scenario Generation With Primary-Support Attribution

KG-ASG: 基于碰撞知识的闭环对抗场景生成与主支持属性

Cheng Wang, Chen Xiong, Ziwen Wang, Yuchen Zhou, Qiang Liu

发表机构 * Guangdong Provincial Key Laboratory of Intelligent Transportation System, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University（广东省智能交通系统重点实验室，智能系统工程学院，中山大学深圳校区）

AI总结本文提出KG-ASG框架，通过碰撞知识引导和主支持属性，提高自动驾驶系统安全验证的对抗有效性、可解释性和可执行性。

详情

AI中文摘要

自动驾驶系统安全验证需要高风险场景覆盖、清晰的碰撞语义、可执行轨迹和可追溯的多车辆交互。现有安全关键场景生成方法通常依赖低级轨迹扰动、碰撞代理优化或单对抗者搜索，可能产生具有模糊碰撞原因或不可控多车辆碰撞的对抗样本。本文提出KG-ASG，一种基于碰撞知识的闭环对抗场景生成框架，具有主支持属性。KG-ASG构建了结构化的碰撞知识库，并训练了一个轻量级的碰撞专家来推断目标碰撞模式、唯一的主对抗者、支持车辆及其交互角色。在该语义先验的引导下，多车辆对抗生成被公式化为主支持过程，其中主对抗者引发主要冲突，支持车辆塑造周围风险结构，而不会成为额外碰撞者。规则、物理、交互安全性和单碰撞器约束被作为硬门来过滤不可执行的样本。为处理反应性驾驶者行为，进一步使用规划器-控制器反馈进行故障诊断、候选重新排序和终端细化。在MetaDrive中重建的WOMD场景上的实验表明，KG-ASG在IDM、Cruise和Expert控制器下实现了强对抗有效性，同时提高了有效主攻击、减少了多碰撞，并获得了闭环恢复收益。这些结果表明，碰撞知识引导和主支持单碰撞器推理提高了自动驾驶安全验证的对抗有效性、可解释性和可执行性。

英文摘要

Safety validation of autonomous driving systems requires high-risk scenario coverage, clear collision semantics, executable trajectories, and attributable multi-vehicle interactions. Existing safety-critical scenario generation methods often rely on low-level trajectory perturbations, collision-proxy optimization, or single-adversary search, which may produce adversarial samples with ambiguous collision causes or uncontrolled multi-vehicle collisions. This paper proposes KG-ASG, a collision-knowledge-guided closed-loop adversarial scenario generation framework with primary-support attribution. KG-ASG constructs a structured collision knowledge base and trains a lightweight Collision Expert to infer the target collision mode, the unique primary adversary, support vehicles, and their interaction roles. Guided by this semantic prior, multi-vehicle adversarial generation is formulated as a primary-support process, where the primary adversary induces the main conflict and support vehicles shape the surrounding risk structure without becoming additional colliders. Rule, physical, interaction-safety, and single-collider constraints are imposed as hard gates to filter non-executable samples. To handle reactive ego behaviors, planner-controller feedback is further used for failure diagnosis, candidate re-ranking, and terminal refinement. Experiments on WOMD scenarios reconstructed in MetaDrive show that KG-ASG achieves strong adversarial effectiveness while improving Valid Primary Attack, reducing multi-collision, and obtaining closed-loop recovery gains under IDM, Cruise, and Expert controllers. These results demonstrate that collision-knowledge guidance and primary-support single-collider reasoning improve adversarial effectiveness, interpretability, and executability for autonomous driving safety validation.

URL PDF HTML ☆

赞 0 踩 0

2605.18892 2026-05-20 cs.LG cs.AI cs.DC 版本更新

Data-Free Client Contribution Estimation via Logit Maximization for Federated Learning

通过Logit最大化实现无数据的客户端贡献估计用于联邦学习

Asim Ukaye, Nurbek Tastan, Mubarak Abdu-Aguye, Karthik Nandakumar

发表机构 * MBZUAI, Abu Dhabi, UAE（MBZUAI，阿布扎赫德，阿联酋）； Michigan State University, Michigan, USA（密歇根州立大学，密歇根，美国）

AI总结本文提出了一种基于Logit最大化的无数据客户端贡献估计和聚合框架CELM，该框架无需共享原始数据、客户端元数据或辅助公开数据，通过客户端更新获取类别证据分数并构建跨客户端证据矩阵，以量化每类的竞争力和类别覆盖范围，从而计算出对少数类提供强判别性证据的客户端贡献权重，提高联邦学习的鲁棒性和性能。

Comments 22 pages, 7 figures

详情

AI中文摘要

联邦学习（FL）使计算机视觉模型能够协同学习，其中隐私和监管限制防止在设备或组织之间集中数据。然而，实际的FL部署往往表现出严重的类别不平衡和标签偏斜，导致标准聚合协议过度拟合主导客户端并降级少数类性能。我们提出了一种基于Logit最大化的无数据、按类别贡献估计和聚合框架（CELM），该框架不需要共享原始数据、客户端元数据或辅助公开数据。FL服务器通过客户端更新获取类别证据分数，并构建跨客户端证据矩阵，该矩阵量化了每类的竞争力和类别覆盖范围。使用该矩阵，我们计算出贡献权重，以提升为少数类提供强判别性证据的客户端的权重。所得到的聚合是稳定的，由于简单约束和动量平滑，且与标准FL训练流水线保持兼容。我们在受控的非独立同分布和病理标签分割的代表性视觉基准上评估了该方法，证明CELM基于的聚合提高了对不平衡和统计异质性的鲁棒性，同时在不需任何额外数据交换的情况下实现了更好的性能。

英文摘要

Federated learning (FL) enables collaborative learning of computer vision models, where privacy and regulatory constraints prevent centralizing data across devices or organizations. However, practical FL deployments often exhibit severe class imbalance and label skew, causing standard aggregation protocols to overfit dominant clients and degrade minority-class performance. We propose a data-free, class-wise contribution estimation and aggregation framework based on logit maximization (CELM) that does not require sharing raw data, client metadata, or auxiliary public datasets. The FL server probes client updates to obtain class-wise evidence scores and assembles a cross-client evidence matrix, which quantifies both per-class competence and class coverage. Using this matrix, we compute contribution weights that upweight clients providing strong, discriminative evidence for underrepresented classes. The resulting aggregation is stable due to simplex constraints and momentum smoothing, and it remains compatible with standard FL training pipelines. We evaluate the approach on representative vision benchmarks under controlled non-IID and pathological label splits, demonstrating that CELM-based aggregation improves robustness to imbalance and statistical heterogeneity, while yielding better performance without requiring any additional data exchange.

URL PDF HTML ☆

赞 0 踩 0

2605.18891 2026-05-20 cs.LG cs.AI 版本更新

Auditing Reasoning-Trace Memorization Claims after Unlearning with Head-Conditioned Canaries

在取消学习后使用头部条件化的候鸟审计推理轨迹记忆化声明

Yanhang Li, Zhichao Fan, Zexin Zhuang

发表机构 * Northeastern University, USA（东北大学）； University of Illinois Urbana-Champaign, USA（伊利诺伊大学厄巴纳-香槟分校）； Southern Methodist University, USA（南方 Methodist 大学）

AI总结该研究通过在DeepSeek-R1-Distill-Qwen-7B上使用LoRA记忆化的虚构作者和NPO取消学习，结合六token候鸟头部条件，审计推理轨迹记忆化声明，发现正向解析器拆分绕过间隙本身并不能识别隐藏的权重级记忆化，也不能排除其存在。

详情

AI中文摘要

对推理模型的取消学习评估有时会显示绕过模式。答案侧看起来已取消学习，但模型自身的推理轨迹仍会发出遗忘内容，这种差距被当作证据表明权重仍记忆。我们使用LoRA记忆化的虚构作者和NPO取消学习，在六token候鸟头部条件下审计此阅读。在一种种子下，用相同的权重交换推理轨迹为短非候鸟预填，答案率下降幅度等于绕过间隙本身，无论预填是否模仿训练模板。在第二种种子下，绕过间隙缩小而非消失，预填交换方向反转并使答案率达到上限。正向解析器拆分绕过间隙本身并不能识别隐藏的权重级记忆化，也不能排除其存在。在不同的distillate中，相同指标因解析器无法找到闭合标签而改变符号。我们推荐在解码时进行模板交换作为廉价的合理性检查，与传统审计并行。

英文摘要

Evaluations of unlearning on reasoning models sometimes show a bypass pattern. The answer side looks unlearned, but the model's own thinking trace keeps emitting the forgotten content, and the gap is taken as evidence that the weights still remember. We audit this reading on DeepSeek-R1-Distill-Qwen-7B with LoRA-memorized fictional authors and NPO unlearning, conditioned on a six-token canary head. On one seed, swapping the thinking trace for a short non-canary prefill on the same weights drops the answer rate by as much as the bypass gap itself, whether the prefill mimics the training template or not. On a second seed the bypass gap shrinks rather than vanishing, and the prefill swap reverses direction and brings the answer rate to ceiling. A positive parser-split bypass gap thus does not by itself identify hidden weight-level memorization, and does not rule it out either. On a different distillate the same metric flips sign because the parser cannot find the closing tag. We recommend a decode-time template swap as a cheap sanity check alongside the canonical audit.

URL PDF HTML ☆

赞 0 踩 0

2605.18890 2026-05-20 physics.soc-ph cs.AI cs.CY cs.MA 版本更新

Stop Drawing Scientific Claims from LLM Social Simulations Without Robustness Audits

不要在没有充分鲁棒性审计的情况下从LLM社会模拟中绘制科学结论

Jinyi Ye, Lei Cao, Ding Chen, Emilio Ferrara

发表机构 * Thomas Lord Department of Computer Science, University of Southern California（美国南加州大学计算机科学系汤姆·劳德部门）； Marshall School of Business, University of Southern California（美国南加州大学马歇尔商学院）

AI总结本文研究了从LLM社会模拟中得出的科学结论不应强于支持它们的鲁棒性审计，通过两个案例研究展示了小扰动如何影响模拟结果，并提出TRAILS框架以规范鲁棒性审计。

详情

AI中文摘要

从LLM社会模拟中得出的科学结论不应强于支持它们的鲁棒性审计。生成代理为基于代理的建模带来了新的表达能力，使合作、极化和规范形成等集体社会过程的模拟成为可能。然而，它们还通过额外的架构选择引入了复杂性，如代理规格、记忆表示、交互协议和环境设计。小扰动可能在重复交互中引发宏观结果，产生'蝴蝶效应'。因此，从LLM社会模拟中得出的科学结论可能反映的是实现艺术而非建模的社会机制。我们通过重复的囚徒困境和社交媒体回声室模拟案例研究支持这一观点。在多个模型中，个人格式和游戏指令框架中的小扰动可使合作率变化高达76个百分点，而网络同质性和中心节点分配会产生显著且一致的极化指标变化。我们还发现敏感性在架构选择和模型家族之间分布不均：在一种前沿模型中产生76个百分点变化的扰动，在另一种模型中仅产生1个百分点的变化。因此，鲁棒性应作为每个声明和每个模型的属性进行测量，而不是假设。为解决这一验证缺口，我们引入TRAILS（在LLM模拟中鲁棒性审计的分类学），涵盖三个层次的模拟设计：代理（微观）、交互（中观）和系统（宏观）。我们呼吁鲁棒性在LLM社会模拟用于解释机制、评估干预或做出决策之前，应成为首要的验证要求。

英文摘要

The scientific claims drawn from LLM social simulations should be no stronger than the robustness audits that support them. Generative agents bring new expressive power to agent-based modeling, enabling simulations of collective social processes like cooperation, polarization, and norm formation. Yet they also introduce complexity through additional architectural choices, such as agent specification, memory representation, interaction protocols, and environment design. Small perturbations that appear minor to researchers can cascade into macro-level outcomes through repeated interaction, creating a "butterfly effect." Consequently, scientific claims drawn from LLM social simulations may reflect implementation artifacts rather than the social mechanisms being modeled. We support this position with two case studies: a repeated Prisoner's Dilemma and a social media echo chamber simulation. Across multiple models, minor perturbations in persona format and game-instruction framing shift cooperation rates by up to 76 percentage points, while network homophily and hub assignment produce significant and consistent shifts in polarization metrics. We also find that sensitivity is unevenly distributed across both architectural choices and model families: the same perturbation that produces the 76 pp shift in one frontier model only shifts another by 1 pp. Robustness is therefore a property that should be measured per claim and per model, not assumed. To address this validation gap, we introduce TRAILS (Taxonomy for Robustness Audits In LLM Simulations), a robustness-audit taxonomy spanning three levels of simulation design: agent (micro-level), interaction (meso-level), and system (macro-level). We call for robustness to become a first-order validation requirement before LLM social simulations are used to explain mechanisms, evaluate interventions, or inform decisions.

URL PDF HTML ☆

赞 0 踩 0

2605.18889 2026-05-20 cs.LG cs.AI 版本更新

Soft Learning

软学习

Mohammed Aledhari, Ali Aledhari, Fatimah Aledhari, Mohamed Rahouti

发表机构 * University of North Texas（北卡罗来纳州立大学）； Fordham University（福尔特姆大学）

AI总结本文提出软学习框架，通过交叉验证非负最小二乘法发现最优组合权重，实现比深度网络快数十倍的训练速度，同时具备内在可解释性和未来扩展性，优于多种方法，在70%的任务上排名第一。

详情

AI中文摘要

现代机器学习迫使从业者在强大的但昂贵的深度网络和快速但有限的经典算法之间做出选择。本文介绍了软学习，一个维护异质专家库的框架，涵盖线性模型、树集成、核机和神经网络，并通过交叉验证非负最小二乘法发现可证明最优的组合权重。软学习保证能匹配或超过其专家的最佳加权组合，仅在CPU上训练速度比深度网络快两到三个数量级（72-435倍，取决于测试配置），通过学习的权重提供内在可解释性，揭示哪种算法范式最适合数据，并且具有未来保障性：添加专家能保证性能维持或提升。在37个数据集（25个分类，12个回归）上，针对包括CatBoost和调优深度网络在内的九种方法，软学习在70%的任务上排名第一，获得最佳平均排名（Friedman检验，p=1.12×10^-12），并且是唯一同时在分类和回归上均表现优异的方法，无需GPU硬件或超参数调优。这些结果表明从“哪种算法最好？”到“什么是有证明最优的组合？”的范式转变，软学习通过正式保证回答任何数据模态的问题。

英文摘要

Modern machine learning forces practitioners to choose between powerful but expensive deep networks and fast but limited classical algorithms. Here we introduce Soft Learning, a framework that maintains a library of heterogeneous specialists -- spanning linear models, tree ensembles, kernel machines, and neural networks -- and discovers provably optimal combination weights through cross-validated non-negative least squares. Soft Learning is guaranteed to match or exceed the best weighted combination of its specialists, trains over two orders of magnitude faster than deep networks on CPU alone (72-435x faster across tested configurations), provides inherent interpretability through learned weights that reveal which algorithmic paradigm best fits the data, and is future-proof: adding specialists is mathematically guaranteed to maintain or improve performance. Across 37 datasets (25 classification, 12 regression) against nine methods including CatBoost and tuned deep networks, Soft Learning ranks first on 70% of tasks, achieves the best mean rank (Friedman test, p = 1.12 x 10^-12), and is the only method to simultaneously excel at both classification and regression -- all without GPU hardware or hyperparameter tuning. These results suggest a paradigm shift from "which algorithm is best?" to "what is the provably optimal combination?" -- a question Soft Learning answers with formal guarantees for any data modality.

URL PDF HTML ☆

赞 0 踩 0

2605.18885 2026-05-20 cs.IT cs.AI cs.CC math.IT 版本更新

The Extremum Stack is a Minimal Sufficient Statistic for Rate-Independent Functionals: A Kolmogorov Complexity Characterisation

极值栈是速率无关函数的最小充分统计量：一个柯尔莫哥洛夫复杂性特征化

Piotr Frydrych

发表机构 * The Metrology and Biomedical Engineering Institute, Faculty of Mechatronics, Warsaw University of Technology（计量与生物医学工程研究所，机械电子学系，华沙理工大学）

AI总结本文证明了离散序列的极值栈是所有可计算、因果、速率无关函数的最小充分统计量，从柯尔莫哥洛夫复杂性的角度出发。具体来说，建立了K(Pi_n) - O(1) ≤ K_R(u_{0:n}) ≤ K(Pi_n) + O(1)，其中K_R(u_{0:n})是回答类别R中所有查询的最短程序长度，O(1)的开销与序列长度n和栈深度k无关。充分性源于Preisach滞回操作符的经典擦除性质。最小性通过一个有限指示族的速率无关性得到验证。因此，任何保留完整类别R的滞回驱动流压缩必须至少保留K(Pi_n) - O(1)位；由结果隐含的基于栈的压缩算法具有柯尔莫哥洛夫最优性保证，而标准时间序列压缩方法均无法提供。

Comments 6 pages, 1 algorithm, 1 table. Submitted to Information Processing Letters (Elsevier)

2605.18883 2026-05-20 cs.LG cs.AI 版本更新

Prediction Is Not Physics: Learning and Evaluating Conserved Quantities in Neural Simulators

预测并非物理：在神经模拟器中学习和评估守恒量

Andrew Bukowski, Aditya Kothari, Simba Shi, Ishir Rao

发表机构 * Yale University（耶鲁大学）

AI总结本文研究了神经网络能否从物理轨迹中学习或选择全局守恒量，通过三个哈密顿系统（抛体运动、单摆和弹簧-质量系统）验证了不同模型在守恒律保持方面的性能，发现黑盒CDN在加入时间一致性损失时表现更优，而多项式CDN对训练配置敏感。

Comments 10 pages

详情

AI中文摘要

训练在哈密顿轨迹上的扩散模型可以达到接近10^-3的滚动MSE，但其能量的标准差比地面真实能量的标准差大7500到36000倍，表明未能保持守恒定律。这一差距促使我们提出核心问题：神经网络能否从物理轨迹中学习或选择全局守恒量？我们研究了三个哈密顿系统：抛体运动、单摆和弹簧-质量系统。我们使用了结构化的T(v)+V(q)能量模型、黑盒守恒发现网络（CDN）、多项式CDN以及条件扩散基线。结构化网络在干净数据上对分析能量的R²≥0.9999，而黑盒CDN在训练时加入时间一致性损失和小的对齐损失（λ_align=0.2）时，R²≥0.996。当λ_align=0时，CDN在单摆和弹簧-质量系统上Pearson R²崩溃（<10^-3），表明仅靠时间一致性无法可靠地识别真实能量。在1%的加性高斯噪声下，CDN在抛体和弹簧-质量系统上优于结构化模型，表明CDN可能在该设置下对噪声输入更鲁棒。然而，多项式CDN对训练配置敏感：在单摆系统上短训练计划下R²=0.78，但通过更多训练时间和数据可以达到R²=0.9998，无论是否加入噪声。

英文摘要

A diffusion model trained on Hamiltonian trajectories can achieve rollout MSE near $10^{-3}$, but the standard deviation of its energy over time is between 7500 and 36000 times larger than the ground-truth energy standard deviation, indicating a failure to preserve conservation laws. This gap motivates our central question of whether neural networks can learn or select globally conserved quantities from physical trajectories. We investigate this across three Hamiltonian systems: projectile motion, pendulum, and spring-mass. We use a structured $T(v)+V(q)$ energy model, a black-box Conservation Discovery Network (CDN), a polynomial CDN, and a conditional diffusion baseline. The structured network reaches $R^2 \geq 0.9999$ against analytical energy on clean data, while the black-box CDN reaches $R^2 \geq 0.996$ when trained with temporal consistency plus a small alignment loss to analytical energy at $t=0$ ($λ_{\mathrm{align}}=0.2$). With $λ_{\mathrm{align}}=0$, CDN Pearson $R^2$ collapses on pendulum and spring-mass ($< 10^{-3}$), showing that temporal consistency alone is not enough to reliably identify the true energy. Under $1\%$ additive Gaussian noise, the CDN outperforms the structured model on the projectile and spring-mass systems, suggesting that the CDN may be more robust to noisy inputs in this setting. However, the polynomial CDN is sensitive to training configuration: it achieves $R^2=0.78$ under a short training schedule on the pendulum system, but reaches $R^2=0.9998$ with more training time and data, regardless of whether noise is added.

URL PDF HTML ☆

赞 0 踩 0

2605.18882 2026-05-20 cs.LG cs.AI 版本更新

To Call or Not to Call: Diagnosing Intrinsic Over-Calling Bias in LLM Agents

叫还是不叫：诊断LLM代理中的内在过度调用偏差

Wei Shi, Ziheng Peng, Sihang Li, Xiting Wang, Xiang Wang, Mengnan Du, Na Zou

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Renmin University of China（中国人民大学）； The Chinese University of Hong Kong Shenzhen（香港中文大学（深圳））； University of Science and Technology of China（中国科学技术大学）

AI总结本文研究了LLM代理中过度调用现象，提出内在偏差假说，通过稀疏自编码器恢复行为对齐的特征基，减少到带符号激活边距，并估计偏移量，从而修正过度调用问题。

详情

AI中文摘要

LLM代理表现出一种一致的倾向，即在不需要工具的情况下也频繁调用工具。在When2Call基准测试中，三个家族的六个模型显示出较高的调用准确性，但调用准确性远低于不调用准确性，导致总体准确性在55%-70%之间。我们将其归因于内在偏差假说（IBH）：调用/不调用决策映射具有激活无关的调用偏移，因此模型在激活平衡时仍倾向于调用。使用稀疏自编码器（SAEs），我们恢复了与调用/不调用决策对齐的特征基，将其减少到带符号激活边距，并直接估计偏移量。在所有六个模型中，只有当不调用激活超过调用激活时，模型才是决策中性的，这与IBH一致。然后，我们通过自适应边距校准引导（AMCS）进行因果测试，这是一种沿SAE解码器方向的闭合形式反偏移。消除诊断出的偏移量可以减轻过度调用并提高总体准确性，同时调用准确性下降很小。我们的工作将过度调用从经验现象转变为可以进行因果修正的机制性对象。代码可在https://github.com/SKURA502/agent-sae/上获取。

英文摘要

LLM agents exhibit a consistent tendency to over-call, invoking tools even in situations where none is needed. On the When2Call benchmark, six models from three families show high call accuracy but much lower no-call accuracy, leaving overall accuracy in the 55%-70% range. We trace this to an Intrinsic Bias Hypothesis (IBH): the call/no-call decision mapping carries an activation-independent call offset, so the model favors call even at activation parity. Using Sparse Autoencoders (SAEs), we recover behavior-aligned feature bases for the call/no_call decision, reduce them to a signed activation margin, and estimate the offset directly. Across all six models, the model is decision-neutral only when no_call activation outweighs call activation, consistent with IBH. We then causally test IBH with Adaptive Margin-Calibrated Steering (AMCS), a closed-form counter-bias shift along SAE decoder directions. Cancelling the diagnosed offset mitigates over-calling and improves overall accuracy with a negligible drop in call accuracy. Our work recasts over-calling from an empirical phenomenon into a mechanistic object amenable to causal correction. Code is available at https://github.com/SKURA502/agent-sae/.

URL PDF HTML ☆

赞 0 踩 0

2605.18873 2026-05-20 cs.CR cs.AI cs.LG 版本更新

GenAI-FDIA: Physics-Informed Generative Models for False Data Injection Attacks

GenAI-FDIA：基于物理的生成模型用于虚假数据注入攻击

Mohammad A. Razzaque, Muta Tah Hira

发表机构 * School of Computing, Engineering and Digital Technologies, Teesside University, UK（Teesside大学计算与工程数字技术学院，英国）； Smartifier Ltd, Stockton-on-Tees, UK（Smartifier有限公司，英国Stockton-on-Tees）

AI总结本文提出GenAI-FDIA框架，通过物理兼容的生成模型合成虚假数据注入攻击，验证了不同架构在电力系统中的有效性，并解决了生成模型中出现的新型故障模式。

Comments Submitted to IEEE Transactions on Smart Grid

详情

AI中文摘要

训练和评估用于电力系统的虚假数据注入攻击（FDIA）检测器受到数据稀缺的限制。运营电网测量数据具有商业敏感性，而手工制作的攻击无法捕捉由网络物理结构强加的复杂分布特性。我们提出了GenAI-FDIA框架，该框架在20种架构中进行基准测试，涵盖Wasserstein GANs、MMD-VAEs、归一化流、扩散模型以及跨家族混合模型。这些模型在三个IEEE测试平台（14节点直流、30节点直流和14节点交流）上进行评估，使用数据驱动的坏数据检测（BDD）阈值校准进行60/20/20时间分割。我们的实证结果验证了这些模型能够生成高保真的攻击，所有架构在14节点网络上达到86.6%以上的规避率；此外，限制攻击者的拓扑知识会带来可测量的隐蔽性下降（p ≤ 0.0022）。关键的是，我们识别出一种之前未报告的故障模式：在归一化特征空间中直接应用仿射物理投影会严重位移攻击向量，使BDD规避率从约55%降至<2%在30节点测试平台。我们通过一种新的推理时间谐调器解决此问题，恢复所有物理兼容变体的完全隐蔽性（ε_BDD=100%）而无需重新训练。最后，我们隔离了高级混合架构中的协方差坍塌现象（κ≈-0.076），并通过50个周期的预热计划进行修正（κ→0.785，MMDΔ=-3.1%）。最终，GenAI-FDIA提供了适用于任何受物理约束的生成模型在电力系统安全中的稳健恢复蓝图。

英文摘要

Training and evaluating false data injection attack (FDIA) detectors for power systems is constrained by data scarcity. Operational grid measurements are commercially sensitive, and hand-crafted attacks fail to capture complex distributional structures imposed by network physics. We present \textsc{GenAI-FDIA}, a framework benchmarking a pool of $P{=}20$ architectures for physics-compliant FDIA synthesis, spanning Wasserstein GANs, MMD-VAEs, normalising flows, diffusion models, and cross-family hybrids. These are evaluated across three IEEE testbeds (14-bus DC, 30-bus DC, and 14-bus AC) under a 60/20/20 chronological split using data-driven Bad Data Detection (BDD) threshold calibration. Our empirical results verify that these models generate high-fidelity attacks, with all architectures achieving evasion rates of $ε_{\text{BDD}} \ge 86.6\%$ on the 14-bus network; additionally, limiting an attacker's topological knowledge induces a measurable degradation in stealthiness ($p \le 0.0022$). Crucially, we identify a previously unreported failure mode: applying affine physics projections directly in normalised feature spaces critically displaces the attack vector, collapsing BDD evasion from ${\sim}55\%$ to $<\!2\%$ on the 30-bus testbed. We resolve this via a novel inference-time harmoniser, restoring full stealthiness ($ε_{\text{BDD}}{=}100\%$) across all physics-informed variants without retraining. Finally, we isolate a covariance-collapse phenomenon ($κ\approx {-}0.076$) within advanced hybrid architectures and rectify it through 50-epoch warm-up schedules ($κ\to 0.785$, $Δ\text{MMD}={-}3.1\%$). Ultimately, \textsc{GenAI-FDIA} delivers a robust recovery blueprint applicable to any physics-constrained generative model deployed for power-system security.

URL PDF HTML ☆

赞 0 踩 0

2605.18872 2026-05-20 cs.LG cs.AI cs.RO 版本更新

EUPHORIA: Efficient Universal Planning via Hybrid Optimization for Robust Industrial Robotic Assembly

EUPHORIA: 通过混合优化实现高效通用规划以实现稳健的工业机器人装配

Shih-Yu Lai, Chia-Ching Yen, Yang-Ting Shen, Peter Yichen Chen, Yu-Lun Liu, Bing-Yu Chen

发表机构 * National Taiwan University（国立台湾大学）； MoonShine Animation Studio（MoonShine动画工作室）； National Cheng Kung University（国立成功大学）； The University of British Columbia（不列颠哥伦比亚大学）； National Yang Ming Chiao Tung University（阳明交通大学）

AI总结本文提出EUPHORIA框架，通过混合优化策略实现通用少样本适应和动态效率，解决建筑机器人装配中规划器高度专业化和操作低效的问题，结合元几何编码器、物理引导图变压器和残差稳定性校正等方法，实现高效且鲁棒的装配规划。

详情

AI中文摘要

建筑机器人装配面临持续瓶颈：现有规划器要么高度专业化，需要每次新几何设计都进行昂贵的再训练，要么操作低效，将结构序列和运动学运动视为独立过程。我们提出了EUPHORIA，一个统一框架，通过混合优化策略实现通用少样本适应和动态效率。为克服再训练瓶颈，我们提出了基于图超网络的元几何编码器：不同于标准对比学习仅在特征级识别，我们的超网络动态从最小支持集中生成策略参数，使参数级适应复杂拓扑（如穹顶、拱门）而无需基于梯度的再训练。对于结构推理，我们引入了通过软演员-评论家（SAC）训练的物理引导图变压器，其物理偏置注意力机制通过离散元模型（DEM）模拟的接触力调节注意力分数，引导规划器朝向结构关键连接。我们进一步通过运动学感知序列确保操作效率，其中SAC目标惩罚高能转换。最后，我们通过残差稳定性校正弥合仿真到现实的差距，这是一种可微优化层，通过最小化联合能量-稳定性成本优先级来微调粗略装配动作。实验表明，EUPHORIA显著减少了与解耦基线相比的能量消耗，并在未见的非标准几何上实现了最先进的成功率，通过融合元学习、物理引导注意力和残差优化，实现一个连贯的通用规划器。

英文摘要

Robotic assembly in architectural construction faces a persistent bottleneck: existing planners are either highly specialized, requiring prohibitive retraining for every new geometric design, or operationally inefficient, treating structural sequencing and kinematic motion as disjoint processes. We present EUPHORIA, a unified framework that achieves universal few-shot adaptability and dynamic efficiency through a hybrid optimization strategy. To overcome the retraining bottleneck, we propose a Meta-Geometric Encoder based on Graph Hypernetworks: unlike standard contrastive learning, which performs only feature-level recognition, our hypernetwork dynamically generates policy parameters from a minimal support set, enabling parameter-level adaptation to complex topologies (e.g., domes, arches) without gradient-based retraining. For structural reasoning, we introduce a Physics-Informed Graph Transformer trained via Soft Actor-Critic (SAC), with a Physics-Bias Attention mechanism that modulates attention scores using contact forces from Discrete Element Model (DEM) simulations, guiding the planner toward structurally critical connections. We further ensure operational efficiency through Kinematics-Aware Sequencing, where the SAC objective penalizes high-energy transitions. Finally, we bridge the Sim2Real gap via Residual Stability Correction, a differentiable optimization layer that fine-tunes coarse assembly actions by minimizing a joint energy-stability cost prior to execution. Experiments show that EUPHORIA significantly reduces energy consumption over decoupled baselines and achieves state-of-the-art success rates on unseen, non-standard geometries with minimal few-shot examples, fusing meta-learning, physics-informed attention, and residual optimization into a cohesive, generalized planner.

URL PDF HTML ☆

赞 0 踩 0

2605.18871 2026-05-20 cs.LG cs.AI 版本更新

Distributional Energy-Based Models for Uncertainty-Aware Structured LLM Reasoning

基于不确定性感知的结构LLM推理的分布能量模型

Shireen Kudukkil Manchingal, Abhey Kalia, Fernanda Gonçalves, Shebin Rawther

发表机构 * Oxford Dynamics Harwell Science and Innovation Campus（牛津动力学哈威尔科学与创新校园）

AI总结本文提出了一种分解的能量函数，结合了学习的质量评分器和确定性分析约束惩罚，用于验证结构LLM输出。该方法通过两步推理循环触发目标再生或 abstention，能够在多个基准测试中超越单次Qwen-72B，并减少约束违反。

详情

AI中文摘要

当大型语言模型生成结构化输出如旅行计划、代码解决方案或多步证明时，个别推理步骤可能正确，但整体输出可能违反预算、失败测试用例或与先前推论矛盾。我们提出了一种分解的能量函数，结合了学习的质量评分器和确定性分析约束惩罚，用于验证结构LLM输出。质量评分器是单个冻结编码器上的异构集合，包含低秩适配器（3%可训练参数）；集合均值对候选者进行排名，标准差量化epistemic不确定性，驱动一个两步推理循环，触发目标再生或 abstention。在五个基准测试（GSM8K、MuSR、TravelPlanner、TACO、Knights & Knaves）中，我们的149M参数验证器协调一个7-26B开放生成器池，在每个基准测试中均优于单次Qwen-72B，与Claude Sonnet 4.6在MuSR上匹配（67.7% vs. 68.0%），并且在TravelPlanner上将约束违反减少53%（相对于Opus 4.6，oracle 0.028，随机 0.231）。两种方法是互补的：结构验证在约束可检查时获胜（验证器捕捉信号前沿模型无法自我检测），而预训练规模先验在不可检查时获胜（叙述推理、代码语义）。跨数据集的混淆分析确认在四个推理任务上确实存在质量区分，并识别出代码中的模型身份捷径，通过最后一层重新训练得以缓解。评分器在困难数据上训练后可实现零样本转移：一个MuSR训练的评分器在没有看到数学问题的情况下在GSM8K上达到93.9%。

英文摘要

When Large Language Models produce structured outputs such as travel plans, code solutions, or multi-step proofs, individual reasoning steps may appear correct while the output as a whole violates budgets, fails test cases, or contradicts earlier deductions. We propose a decomposed energy function that combines a learned quality scorer with deterministic analytical constraint penalties for verifying structured LLM outputs. The quality scorer is a heterogeneous ensemble of low-rank adapters on a single frozen encoder (3% trainable parameters); the ensemble mean ranks candidates while the standard deviation quantifies epistemic uncertainty, driving a two-pass inference loop that triggers targeted regeneration or abstention. Across five benchmarks (GSM8K, MuSR, TravelPlanner, TACO, Knights & Knaves), our 149M-parameter verifier orchestrating a pool of 7-26B open generators outperforms single-shot Qwen-72B on every benchmark, matches Claude Sonnet 4.6 on MuSR (67.7% vs. 68.0%), and reduces constraint violations by 53% relative to Opus 4.6 on TravelPlanner (oracle 0.028, random 0.231). The two routes are complementary: structural verification wins when constraints are checkable (the verifier captures signal frontier models cannot self-detect), while pretraining-scale priors win where they are not (narrative inference, code semantics). A cross-dataset confounding analysis confirms genuine quality discrimination on four reasoning tasks and identifies a model-identity shortcut on code, mitigated via last-layer retraining. Scorers trained on difficult data transfer zero-shot: a MuSR-trained scorer achieves 93.9% on GSM8K without seeing a math problem.

URL PDF HTML ☆

赞 0 踩 0

2605.18869 2026-05-20 cs.LG cs.AI cs.NE 版本更新

MO-CAPO: Multi-Objective Cost-Aware Prompt Optimization

MO-CAPO：多目标成本感知提示优化

Jan Büssing, Moritz Schlager, Timo Heiß, Tom Zehle, Matthias Feurer

发表机构 * Technical University of Munich (TUM), Munich Center for Machine Learning (MCML)（慕尼黑工业大学（TUM）、慕尼黑机器学习中心（MCML））； LMU Munich, Munich Center for Machine Learning (MCML)（慕尼黑大学（LMU）、慕尼黑机器学习中心（MCML））； University of Freiburg, ELLIS Institute Tübingen（弗赖堡大学、图宾根ELLIS研究所）； TU Dortmund University, Lamarr Institute for Machine Learning（多特蒙德工业大学、拉马尔机器学习与人工智能研究所）

AI总结本文提出MO-CAPO，一种多目标提示优化算法，同时优化性能和推理成本，并通过预算分配实现高效优化，通过评估四个任务和三个LLM，证明其在噪声R2指标上优于NSGA-II基线，并在较低预算下达到竞争性性能。

详情

AI中文摘要

大型语言模型（LLMs）在广泛的任务上表现出色，但对提示设计高度敏感，促使需要自动提示优化。现有方法主要关注性能，忽略竞争目标如推理成本或延迟。同时，现有多目标提示优化工作依赖于现成的NSGA-II，忽略优化效率。为此，我们引入MO-CAPO，一种新的多目标提示优化算法，同时优化性能和推理成本，利用预算分配实现成本高效的优化。我们进一步提出一个面向部署的成本目标，捕捉LLM推理的完整计算概况。我们评估了我们的方法在四个任务和三个LLM上的表现，并将其与基于NSGA-II的多目标方法和最先进的单目标提示优化器进行比较。结果表明，MO-CAPO一致地识别出强、稳健和多样的Pareto前沿近似，同时保持成本效率。它在12种情况中的8种情况下在噪声R2指标上优于NSGA-II基线，并且在显著较低的预算下常能达到竞争性性能。发现的解决方案集涵盖了被单目标优化器遗漏的多样化性能-成本权衡，但顶级性能候选者仍与单目标解决方案竞争。此外，我们进行了首次多目标机器学习实验的评估，考虑了泛化和鲁棒性通过噪声R2和近似间隙，使解决方案质量的评估更加现实。MO-CAPO使从业者能够从高效发现的多个提示中选择，这些提示提供不同的性能和成本权衡。

英文摘要

Large language models (LLMs) achieve strong performance across a wide range of tasks but are highly sensitive to prompt design, motivating the need for automatic prompt optimization. Existing methods predominantly focus on performance alone, ignoring competing objectives such as inference cost or latency. At the same time, existing work on multi-objective prompt optimization relies on off-the-shelf NSGA-II, ignoring optimization efficiency. As a remedy, we introduce MO-CAPO, a novel multi-objective prompt optimization algorithm that jointly optimizes performance and inference cost while leveraging budget allocation for cost-efficient optimization. We further propose a deployment-oriented cost objective that captures the full computational profile of LLM inference. We evaluate our approach across four tasks and three LLMs and compare it to an NSGA-II-based multi-objective method and state-of-the-art single-objective prompt optimizers. Results show that MO-CAPO consistently identifies strong, robust, and diverse Pareto front approximations while maintaining cost-efficiency. It outperforms the NSGA-II baseline on 8 out of 12 cases in terms of the noisy R2 metric and achieves competitive performances often already at a considerably lower budget. The discovered solution sets span diverse performance-cost trade-offs that are omitted by single-objective optimizers, yet the top-performance candidates remain competitive with single-objective solutions. Additionally, we conduct the first evaluation of multi-objective machine learning experiments that considers generalization and robustness through noisy R2 and approximation gap, enabling a more realistic assessment of solution quality. MO-CAPO enables practitioners to select from an efficiently discovered set of multiple prompts offering different trade-offs between performance and cost.

URL PDF HTML ☆

赞 0 踩 0

2605.18868 2026-05-20 cs.CR cs.AI cs.CV cs.LG 版本更新

DarkLLM: Learning Language-Driven Adversarial Attacks with Large Language Models

DarkLLM: 利用大语言模型学习语言驱动的对抗攻击

Ye Sun, Xin Wang, Jiaming Zhang, Yifeng Gao, Yixu Wang, Yifan Ding, Qixian Zhang, Henghui Ding, Xingjun Ma, Yu-Gang Jiang

发表机构 * Fudan University（复旦大学）； Nanyang Technological University（南洋理工大学）； Tongji University（同济大学）

AI总结本文提出DarkLLM，一种基于大语言模型的对抗攻击框架，通过将自然语言攻击指令转换为潜在攻击向量，生成有效的对抗扰动，统一了多种攻击类型并实现了灵活可控的对抗生成。

Comments 23 pages, 13 figures

详情

AI中文摘要

尽管视觉和多模态基础模型在感知到复杂推理任务中至关重要，但它们仍然极易受到对抗攻击的影响。然而，传统对抗攻击通常局限于单一、预定义的目标，紧密耦合每个攻击到特定模型或任务，限制了其在现实场景中的可扩展性和灵活性。在本文中，我们提出了DarkLLM，一种新的攻击框架，该框架训练了一个大语言模型（LLM）将自然语言攻击指令转换为潜在攻击向量，然后解码为视觉对抗扰动。通过利用自然语言指令微调，DarkLLM不仅在一个框架内统一了目标攻击、非目标攻击、分割攻击和多模型攻击，还实现了灵活且可控的对抗生成，使每个指令都能生成一种扰动，以在异构模型上诱导期望的行为。通过在4个任务、13个数据集和15个模型上的广泛实验，我们证明DarkLLM仅需1B参数即可遵循攻击者的指令，生成对CLIP、SAM和前沿LLM高度有效的攻击，揭示了现代基础模型系统性的脆弱性。

英文摘要

While vision and multimodal foundation models underpin critical tasks from perception to complex reasoning, they remain highly vulnerable to adversarial attacks. However, traditional adversarial attacks are typically limited to single, predefined objectives, tightly coupling each attack to a specific model or task, which restricts their scalability and flexibility in real-world scenarios. In this work, we present DarkLLM, a novel attack framework that trains an LLM to translate natural-language attack instructions into latent attack vectors, which are then decoded into visual adversarial perturbations. By leveraging natural-language instruction tuning, DarkLLM not only unifies targeted, untargeted, segmentation, and multi-model attacks within a single framework, but also achieves flexible and controllable adversarial generation, enabling each instruction to produce a perturbation that induces desired behaviors across heterogeneous models. Through extensive experiments across 4 tasks, 13 datasets, and 15 models, we demonstrate that DarkLLM with only 1B parameters can follow attacker instructions and generate highly effective attacks against CLIP, SAM, and frontier LLMs, revealing a systemic vulnerability in modern foundation models.

URL PDF HTML ☆

赞 0 踩 0

2605.18867 2026-05-20 cs.LG cs.AI 版本更新

EVA-0: Test-Time Model Evolution with Only Two Forward Passes per Sample

EVA-0: 仅两次前向传递的测试时间模型演化

Guohao Chen, Shuaicheng Niu, Geng Li, Yunbei Zhang, Shilin Shan, Chunyan Miao, Jianfei Yang

发表机构 * Nanyang Technological University（南洋理工大学）； Tulane University（路易斯安那州立大学）

AI总结本文研究了在仅两次前向传递预算下测试时间模型演化的问题，提出EVA-0框架以解决零阶优化中的三个关键障碍，实现高效部署。

详情

AI中文摘要

测试时间模型演化为部署模型提供了一种改进 unlabeled 测试时间经验的有前景方法，但大多数现有方法依赖反向传播（BP），这导致了显著的内存开销，使它们难以在边缘设备、量化模型、专用加速器或黑盒模型上部署。在本文中，我们研究了在严格两次前向预算下测试时间模型演化，这一设置推动了适应向高度高效的现实部署发展。我们揭示了零阶测试时间优化中的三个关键障碍：对捷径解的易感性、不受控的权重漂移和无效的更新方向估计。为克服这些问题，我们提出了EVA-0，一个最小的零阶适应框架，其特点包括：1）保持损失尺度不变以防止捷径解；2）设计了锚点引导的优化策略以缓解权重漂移；3）使用样本级对称双侧扰动进行更新方向估计和推理。EVA-0不需要BP，并且在每个样本上仅需两次前向传递即可完成推理和适应。在ImageNet-C和ViT-Base上的结果表明，EVA-0优于基于BP的DeYO和无BP的FOA，并在FOA上实现了14倍的速度提升。代码将被发布。

英文摘要

Test-time model evolution offers a promising way for deployed models to improve from unlabeled test-time experience, yet most existing methods depend on backpropagation (BP), which incurs substantial memory overhead and makes them difficult to deploy on edge devices, quantized models, specialized accelerators, or black-box models. In this work, we study test-time model evolution under a strict two-forward budget, a setting that pushes adaptation toward highly efficient real-world deployment. We reveal three key obstacles in zeroth-order test-time optimization: susceptibility to shortcut solutions, uncontrolled weight drift, and ineffective update direction estimation. To overcome them, we propose EVA-0, a minimal zeroth-order adaptation framework that: 1) keeps the loss scale-invariant to prevent shortcut solutions; 2) devises an anchor-guided optimization strategy to alleviate weight drift; 3) uses sample-wise symmetric two-sided perturbation for update direction estimation and inference. EVA-0 requires no BP and performs both inference and adaptation within only two forward passes per sample. Results on ImageNet-C & ViT-Base show that EVA-0 outperforms both BP-based DeYO and BP-free FOA, while achieving a 14x speed-up over FOA. Code will be released.

URL PDF HTML ☆

赞 0 踩 0

2605.18865 2026-05-20 cs.LG cs.AI 版本更新

99%成功悖论：当近完美检索等于随机选择

Vyzantinos Repantis, Harshvardhan Singh, Tony Joseph, Cien Zhang, Akash Vishwakarma, Svetlana Karslioglu, Michael Wyatt Thot, Ameya Gawde

发表机构 * Meta Platforms Inc.（Meta平台公司）

AI总结该研究引入了Bits-over-Random（BoR）指标，揭示了高成功率可能掩盖随机水平性能的现象，指出在大规模数据集上，即使检索结果覆盖率达到99%，其选择性仍可能接近零，从而表明需要重新考虑检索深度和传统指标的报告方式。

Comments 12 pages, 2 figures, 7 tables. Accepted at ICLR 2026 Blog Track, https://iclr-blogposts.github.io/2026/blog/2026/bits-over-random/

Journal ref ICLR Blog Track 2026, https://iclr.cc/virtual/2026/poster/10012083

详情

AI中文摘要

对于信息检索（IR）历史上的大部分时间，搜索结果都是为人类消费者设计的，他们可以自行扫描、过滤和丢弃不相关信息。这塑造了检索系统以寻找并排序更多相关文档为目标，而不是保持结果简洁和干净，因为人类是最终的过滤器。然而，大语言模型（LLMs）改变了这一现状，因为它们缺乏这种过滤能力。为了解决这一问题，我们引入了Bits-over-Random（BoR），这是一种修正了机会的检索选择性度量，揭示了高成功率可能掩盖随机水平性能的情况。我们测量选择性为BoR = log₂（P_obs / P_rand），其中P_rand是所选成功规则（此处为覆盖：top-K中≥1个相关文档）的超几何基线。在20 Newsgroups数据集上，BM25和SPLADE均在K=100时报告>99%的成功率（覆盖），但BoR≈0，表明在该深度下的选择性处于随机水平。当预期覆盖比（K·R̄_q / N）超过3-5时，基线主导并导致选择性崩溃。下游检索增强生成（RAG）评估证实了这一模式：LLM准确性在K=100时可能会显著下降，这与近零BoR上限一致。相比之下，BoR在BEIR/SciFact和MS MARCO上保持正数（其中41个系统在理论上限附近聚集，尽管有13点的召回差距），证实了在稀疏和大规模设置中的基线预测。我们进一步表明，崩溃边界适用于LLM代理工具选择，其中小目录大小导致即使有完美选择器，选择性也会消失。这些发现表明，应将BoR与传统指标一起报告，并在额外检索提供 negligible 选择性增益但增加计算成本时重新考虑深度选择。

英文摘要

For most of the history of information retrieval (IR), search results were designed for human consumers who could scan, filter, and discard irrelevant information on their own. This shaped retrieval systems to optimize for finding and ranking more relevant documents, but not keeping results clean and minimal, as the human was the final filter. However, LLMs have changed that by lacking this filtering ability. To address this, we introduce Bits-over-Random (BoR), a chance-corrected measure of retrieval selectivity that reveals when high success rates mask random-level performance. We measure selectivity as $BoR = \log_{2}\left(\frac{\mathrm{P}_{obs}}{\mathrm{P}_{rand}}\right)$, where $\mathrm{P}_{rand}$ is the hypergeometric baseline for the chosen success rule (here, coverage: $ \geq1 $ relevant in top-$K$). On the 20 Newsgroups dataset, BM25 and SPLADE both report $>99$% success at $K=100$ (coverage), yet $BoR \approx 0$, indicating random-level selectivity at that depth. When the expected coverage ratio $\left(\frac{K \cdot \bar{R}_{q}}{N}\right)$ exceeds 3-5, the baseline dominates and selectivity collapses. Downstream retrieval-augmented generation (RAG) evaluation confirms this pattern: LLM accuracy can degrade substantially at $K=100$, consistent with the near-zero BoR ceiling. In contrast, BoR remains positive on BEIR/SciFact and on MS MARCO (where 41 systems cluster within 0.2 bits of the theoretical ceiling despite a 13-point recall gap), confirming baseline predictions across sparse and large-scale settings. We further show that the collapse boundary applies to LLM agent tool selection, where small catalog sizes cause selectivity to vanish even with perfect selectors. These findings suggest reporting BoR alongside traditional metrics and reconsidering depth choices when additional retrieval provides negligible selectivity gains while inflating computational costs.

URL PDF HTML ☆

赞 0 踩 0

2605.18852 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Robust Checkpoint Selection for Multimodal LLMs via Agentic Evaluation and Stability-Aware Ranking

通过代理评估和稳定性感知排名实现多模态大语言模型的鲁棒检查点选择

Qinwu Xu, Zhuoheng Li, Jessie Salas

发表机构 * Meta AI

AI总结本文提出了一种多阶段框架，结合了精心挑选的现实世界数据、结构化的LLM判断和多阶段排名协议，以解决多模态大语言模型检查点选择中的鲁棒决策问题，强调数据质量（特别是OCR可读性）对评估有效性的重要性。

详情

AI中文摘要

多模态大语言模型（MLLMs）的检查点选择在性能差异微小且评估信号易受噪声影响时面临重大挑战。现有方法依赖静态基准或逐点评分，经常与实际应用场景不一致，并缺乏对不确定性的鲁棒估计，特别是在OCR密集场景中。在本文中，我们将检查点选择建模为在评估不确定性下的稳健决策问题。我们提出了一种多阶段框架，整合了精心挑选的现实世界数据、结构化的LLM判断和多阶段排名协议。评估系统通过逐点过滤、列表排名和成对比较进行逐步细化。为了提高可靠性，我们引入基于子采样的置信度估计和基于百分位数的评分公式，以捕捉分布特征并惩罚尾部失败。此外，我们证明数据质量，特别是OCR可读性，是评估有效性的重要决定因素。

英文摘要

Checkpoint selection for multimodal large language models (MLLMs) presents significant challenges when performance differentials are marginal and evaluation signals are prone to noise. Existing methodologies rely heavily on static benchmarks or pointwise scoring, which frequently misalign with in-the-wild usage and lack robust uncertainty estimation, particularly in OCR-heavy scenarios. In this work, we formulate checkpoint selection as a robust decision problem under evaluation uncertainty. We propose a multi-stage framework that integrates curated real-world data, structured LLM-based judgment, and multi-stage ranking protocols. The evaluation system orchestrates progressive refinement via pointwise filtering, listwise ranking, and pairwise comparison. To enhance reliability, we introduce subsampling-based confidence estimation and a percentile-based scoring formulation that captures distributional characteristics while penalizing tail failures. Furthermore, we demonstrate that data quality, specifically OCR readability, is a critical determinant of evaluation validity.

URL PDF HTML ☆

赞 0 踩 0

2605.18850 2026-05-20 cs.IR cs.AI 版本更新

KadiAssistant: A conversational AI Agent for information retrieval in Kadi4Mat

KadiAssistant: 一种用于Kadi4Mat研究数据生态中信息检索的对话式AI代理

Adrian Cierpka, Mohammad Shafiqul Islam, Johannes Steinhülb, Eric Dietriche Sesso Domtchoueng, Michael Selzer, Arnd Koeppe

发表机构 * Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结本文提出KadiAssistant，一种集成了隐私设计的AI助手，旨在帮助研究人员高效访问、聚合和整合异构且敏感的研究数据，通过结合自托管的大语言模型和隐私保护的语义搜索，提升信息检索效率并满足复杂的访问控制需求。

详情

AI中文摘要

我们介绍了KadiAssistant，一种集成了隐私设计的AI助手，整合到Kadi研究数据生态系统中，使研究人员能够高效地访问、聚合和整合异构、敏感的研究数据。跨学科领域如材料科学将各学科的术语和标准结合在一起。虽然这种融合推动了创新，但也使连接和获取知识变得更加困难，因为数据分布在不同学科、组织和个人之间。例如，电池研究结合了电化学测量、材料表征数据、基于物理的模拟和制造参数，每种都使用不同的格式、词汇和标准。通过研究数据平台（如Kadi4Mat）高效存储和共享此类异构数据，需要领域知识、技术专长和对元数据模式和接口的熟悉。研究数据的敏感性也各不相同：新生成的'温暖'数据通常属于私人，而发表的'冷'数据通常公开可访问。Kadi生态系统提供所需的细粒度访问控制，以处理敏感数据。因此，一个高效的Kadi信息检索解决方案必须尊重细粒度的访问权限。为解决这些交织的信息检索、强数据隐私和复杂访问控制挑战，KadiAssistant结合了自托管的大语言模型（LLM）和受检索增强生成启发的隐私保护语义搜索，能够访问Kadi中的文件并记录元数据。这使助手能够筛选、聚合和整理信息，形成高度信息丰富的回答。KadiAssistant因此桥接了术语和标准，降低了研究人员的访问障碍，并加强了FAIR数据原则中的Findable支柱。

英文摘要

We introduce KadiAssistant, a privacy-by-design AI assistant integrated into the Kadi research data ecosystem, enabling researchers to efficiently access, aggregate, and synthesize information from heterogeneous, privacy-sensitive research data. Interdisciplinary fields such as materials science bring together disciplines with their own terminology and standards. While this convergence fuels innovation, it also makes it increasingly difficult to connect and access knowledge, as data are distributed across disciplines, organizations, and individuals. For example, battery research combines electrochemical measurements, materials characterization data, physics-based simulations, and manufacturing parameters, each using different formats, vocabularies, and standards. Efficiently storing and sharing such heterogeneous data via research data platforms, such as Kadi4Mat, demands domain knowledge, technical expertise, and familiarity with metadata schemas and interfaces. Research data also vary in sensitivity: newly generated 'warm' data are often private, whereas published 'cold' data are usually openly accessible. The Kadi ecosystem offers fine-grained access control needed for sensitive data. A solution for efficient information retrieval in Kadi must therefore respect the fine-grained access permissions. To address these intertwined challenges of information retrieval, strong data privacy, and complex access control, KadiAssistant combines a self-hosted large language model (LLM) with a privacy-preserving semantic search, inspired by retrieval-augmented generation, that can access files and record metadata on Kadi. This allows the assistant to screen, aggregate, and structure information into a highly informative answer. KadiAssistant therefore bridges terminology and standards, lowers access barriers for researchers, and strengthens the Findable pillar of FAIR data principles.

URL PDF HTML ☆

赞 0 踩 0

2605.18849 2026-05-20 cs.LG cs.AI 版本更新

INSIGHTS: Demonstration-Based Summaries of Time Series Predictors

INSIGHTS: 时间序列预测器的基于演示的摘要

Bar Eini Porat, Rom Gutman, Uri Shalit, Ofra Amir

发表机构 * Technion Israel Institute of Technology（技术学院以色列理工学院）； Tel-Aviv University（特拉维夫大学）

AI总结本文提出INSIGHTS方法，一种模型无关、以用户为中心的方法，用于提供时间序列模型的全局解释。该方法通过生成样本摘要，平衡时间序列样本的重要性与多样性，为用户提供全面的模型行为概述。

详情

AI中文摘要

可解释性方法发展迅速，但时间序列模型的全局解释仍不完善，大多数方法集中在局部实例层面的解释上。我们介绍了INSIGHTS，一种模型无关、以用户为中心的方法，用于提供时间序列模型的全局解释。我们的方法在设计上优先考虑简单性、效率和透明性，确保利益相关者能够轻松采用其输出。尽管当前方法专注于局部解释，INSIGHTS生成样本摘要，提供模型行为的全面概述。它通过利用效用函数平衡时间序列样本的重要性与多样性，捕捉领域特定的时间序列行为特征，如超过领域规范。我们通过实验、访谈和用户研究评估INSIGHTS。我们的结果表明，INSIGHTS能够构建全面、多样的时间序列子集，生成易于个体评估的摘要。它受到领域专家的青睐，因其能够提供模型行为的稳定理解以及识别的样本质量。此外，接受INSIGHTS摘要的用户研究参与者表现出对模型整体行为的更深入理解。

英文摘要

Explainability methods have progressed rapidly, but global explanations for time-series models remain underdeveloped, with most approaches focusing on local, instance-level attributions. We introduce INSIGHTS, a model-agnostic, user-centric approach for providing global explanations of time series models. Our approach prioritizes simplicity, efficiency, and transparency in its design, ensuring that stakeholders can readily adopt its outputs. While current methods focus on local explanations, INSIGHTS generates sample summaries that offer a comprehensive overview of model behavior. It balances the importance and diversity of time series samples to create informative subsets using utility functions that capture domain-specific aspects of time series behavior, such as exceeding domain norms. We evaluate INSIGHTS through experiments, interviews, and a user study. Our results indicate INSIGHTS effectively constructs comprehensive, diverse time series subsets, producing summaries manageable for individual evaluation. It is preferred by domain experts for its ability to provide a stable understanding of model behavior and the quality of the samples identified. Moreover, user study participants presented with INSIGHTS-based summaries exhibit an enhanced understanding of the model's overall behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.18847 2026-05-20 cs.LG cs.AI 版本更新

基于图的跨行业实时监控框架用于反洗钱检测在融合的移动-能源供应链网络

Rong Liu, Xiaojun Xiao, Zhanqing Su

发表机构 * School of Public Policy, University of Southern California（南加州大学公共政策学院）； Boston University（波士顿大学）； Johns Hopkins University（约翰霍普金斯大学）

AI总结本文提出了一种基于图的跨行业实时反洗钱监控框架（GCRMF），用于整合的旅行-能源供应链网络，通过构建跨行业异构图并结合双图注意力网络，动态编码资本流动路径和时间演变特征，以提高跨行业洗钱行为的识别能力，并通过自监督在线学习机制实现实时适应和持续优化。

详情

AI中文摘要

随着旅行和能源行业的深度整合，跨行业供应链金融逐渐成为隐藏洗钱事件的高风险领域。为此，本文提出了一种基于图的跨行业实时反洗钱监控框架（GCRMF）用于整合的旅行-能源供应链网络。首先，构建了一个涵盖新能源汽车租赁平台、能源供应商、金融科技机构等的跨行业异构图（CIHG），并通过临时双图注意力网络（Temporal Dual-Graph Attention Network）整合行业语义，动态编码资本流动路径和时间演变特征。随后，为识别由合谋主体共同产生的结构性欺诈行为，提出了一种基于对比学习和分层图采样的元路径子图推理模块，以增强跨行业反复洗钱行为的识别能力。同时，采用自监督在线学习机制实现实时适应和持续优化以应对新的洗钱策略。实验结果表明，与现有跨行业场景下的图神经网络方法相比，GCRMF在F1分数上提高了超过17.8%，并显著降低了误报率。

英文摘要

With the deep integration of the travel and energy industries, cross-industry supply chain finance has gradually become a high-risk field of hidden money laundering incidents. For this reason, this work proposes a graph-driven cross-industry real-time anti-money laundering monitoring framework (GCRMF) for integrated travel - energy supply chain networks. First, a cross-industry heterogeneous graph (CIHG) covering new energy vehicle rental platforms, energy suppliers, fintech institutions, etc., is constructed, and industry semantics are integrated through temporarily Dual-GAT (Temporal Dual-Graph Attention Network), dynamically encoding capital flow paths and evolution features over time. Subsequently, in order to identify the structural fraud behavior together produced by colluding subjects, a meta-path subgraph reasoning module based on contrastive learning and hierarchical graph sampling is proposed to enhance the discrimination capability of cross-industry recurring money laundering behavior. Meanwhile, a self-supervised online learning mechanism is adopted for real-time adaptation and continuous optimization to new money laundering strategies. The experimental results show that compared with existing graph neural network methods in cross-industry scenarios, GCRMF improves the performance by more than 17.8% of F1 score and greatly reduces the false positive rate.

URL PDF HTML ☆

赞 0 踩 0

2605.18839 2026-05-20 cs.LG cs.AI 版本更新

An Integrated Forecasting Prototype for Emergency Department Boarding Time to Support Proactive Operational Decision Making

急诊部候诊时间集成预测原型：支持主动运营决策制定

Orhun Vural, Abdulaziz Ahmed, Ferhat Zengul, James Booth, Bunyamin Ozaydin

发表机构 * Department of Electrical and Computer Engineering, University of Alabama at Birmingham（阿拉巴马大学伯明翰分校电气与计算机工程系）； Department of Health Services Administration, University of Alabama at Birmingham（阿拉巴马大学伯明翰分校卫生服务管理系）； Department of Biomedical Informatics and Data Science, Heersink School of Medicine, University of Alabama at Birmingham（阿拉巴马大学伯明翰分校希尔斯医院医学院生物医学信息学与数据科学系）； Department of Emergency Medicine, University of Alabama at Birmingham（阿拉巴马大学伯明翰分校急诊医学系）

AI总结本文提出了一种多时间跨度的时间序列预测框架，用于预测急诊部候诊时间，以支持主动的运营决策制定，通过整合真实世界数据和外部上下文数据源，如天气、节假日和重大本地事件，提高了预测准确性。

Comments 22 pages, including supplementary materials

详情

AI中文摘要

急诊部门（ED）的拥挤状况仍然是全球范围内持续存在的运营挑战，导致护理延误和后续拥堵。急诊部候诊时间，定义为被收治患者在等待住院床放置期间在急诊部停留的时间，是这种拥堵的关键指标。提前预测急诊部候诊时间可以实现主动的运营决策制定，防止拥堵加剧。我们开发并评估了多时间跨度的时间序列预测框架，以预测6、8、10、12和24小时的急诊部候诊时间。利用美国一所大学附属城市的大学附属医院的真实世界数据，并整合外部上下文数据源，包括天气、节假日和重大本地事件。基于分解的线性（DLinear）和基于标准化的线性（NLinear）时间序列预测深度学习模型在多个时间跨度上表现优异。模型还被评估了在极端拥堵场景下的表现，这些场景由较高的候诊时间特征化。此外，还开发了一个机器学习运维（MLOps）网页原型应用，以支持将预测框架转化为实际应用，通过整合数据摄入、预测可视化、实验和重新训练等功能。

英文摘要

Overcrowding in emergency departments (ED) remains a persistent operational challenge worldwide, causing delays in care delivery and downstream congestion. ED boarding time, defined as the duration admitted patients remain in the ED while awaiting inpatient bed placement, is a key indicator of this congestion. Predicting ED boarding time in advance enables proactive operational decision making before congestion escalates. We developed and evaluated a multi-horizon time series forecasting framework to predict ED boarding time at 6, 8, 10, 12, and 24-hour horizons. Real-world data from a university-affiliated urban hospital in the United States were utilized and integrated with external contextual data sources, including weather, holidays, and major local events. Decomposition-based Linear (DLinear) and Normalization-based Linear (NLinear) time series forecasting deep learning models showed superior performance across multiple horizons. Models were also evaluated under extreme congestion scenarios characterized by elevated boarding times. In addition, a Machine Learning Operations (MLOps) web application prototype was developed to support translation of the forecasting framework into practice through integrated data ingestion, forecast visualization, experimentation, and retraining.

URL PDF HTML ☆

赞 0 踩 0

2605.18837 2026-05-20 cs.LG cs.AI eess.SP 版本更新

VCR: Learning Valid Contextual Representation for Incomplete Wearable Signals

VCR：学习不完整可穿戴信号的有效上下文表示

Yuxuan Weng, Wenhan Luo, Qijia Shao

发表机构 * The Hong Kong University of Science and Technology（香港科学与技术大学）

AI总结本文提出VCR框架，通过学习鲁棒于模态缺失的表示，解决可穿戴信号不完整问题，提升在多种健康监测任务中的性能和鲁棒性。

详情

AI中文摘要

可穿戴设备能够从多模态信号中实现连续健康监测，但实际部署受到有限标注数据和普遍传感器不完整性的阻碍。尽管大规模自监督预训练减少了对标签的依赖，但现有方法大多假设全模态可用性。目前处理模态缺失的方法通常重建整个缺失信号，这可能导致无法从观测传感器信号推断出的模态特定细节的幻觉，从而降低鲁棒性。我们提出VCR，一种自监督框架，学习提取对模态缺失具有鲁棒性的表示。VCR采用正交分词器，通过校正潜在流形并应用几何投影，严格分离每个模态到共享语义和模态特定残差。这种设计在保持完整信息完整性的同时，为模态缺失下的稳健学习提供了结构基础。所生成的标记由一个缺失感知的混合专家背骨处理，能够适应不同模式的模态可用性。通过将目标限制为仅重建缺失模态的共享组件，VCR有效减轻了无法推断的模态特定细节的幻觉。在多个健康监测任务中，VCR在完整、单缺失和多缺失模态设置下，相比强大的监督和自监督基线，一致提升了性能和鲁棒性。

英文摘要

Wearable devices enable continuous health monitoring from multimodal signals, but real-world deployment is hindered by limited labeled data and pervasive sensor incompleteness. While large-scale self-supervised pretraining reduces label dependence, most existing methods assume full modality availability. Current approaches for handling modality missingness often reconstruct entire absent signals, which can encourage hallucinating modality-specific details that are not inferable from the observed sensor signals and degrade robustness. We propose VCR, a self-supervised framework that learns to extract valid representations robust to modality missingness. VCR employs an orthogonal tokenizer to enforce strict orthogonal disentanglement by rectifying latent manifolds and applying a geometric projection, separating each modality into shared semantics and modality-specific residuals. This design preserves complete information integrity while serving as a structural foundation for robust learning under modality missingness. The resulting tokens are processed by a missing-aware mixture-of-experts backbone that adapts to varying patterns of modality availability. By constraining the objective to reconstruct only the shared components of missing modalities, VCR effectively mitigates hallucinations of non-inferable modality-specific details. Across multiple health monitoring tasks, VCR consistently improves performance and robustness under full, single-missing, and multiple-missing modality settings compared with strong supervised and self-supervised baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.18832 2026-05-20 cs.LG cs.AI 版本更新

Precision Tracked Transformer via Kalman Filtering, Kriging and Process Noise

通过卡尔曼滤波、克里格法和过程噪声的精确跟踪变压器

Bo Long, Deepak Agarwal, Jelena Markovic-Voronov, Yi Wang, Liuqing Li

发表机构 * LinkedIn Core AI（LinkedIn核心AI）

AI总结本文提出了一种基于贝叶斯滤波的变压器（BFT），通过引入精度权重的克里格法、自适应卡尔曼更新和动态模型，解决了传统变压器在处理不确定性方面的不足，提升了序列推荐和大语言模型在噪声环境下的鲁棒性。

详情

AI中文摘要

Transformer是现代AI的基础构建块，但其缺乏对不确定性的原则性处理，这在实际应用中普遍存在：序列推荐中的冷启动标记具有稀疏的历史，语言模型中的异质信号质量，以及由无约束softmax引起的注意力 sinks。每个token都被统一的置信度处理。我们证明这种统一性是我们的贝叶斯滤波变压器（BFT）的退化情况：注意力变为精度加权克里格法，残差连接变为具有自适应增益的卡尔曼更新，FFN变为通过雅可比矩阵加过程噪声规则传播精度的动力学模型。观测精度来自一个无参数的受限最大似然（REML）估计器，具有共轭贝叶斯先验。BFT将任何Transformer层替换为几乎无开销。在序列推荐中，BFT应用于三种主要架构，在六个基准上获得显著提升，其中在冷启动用户和稀有物品上改进最大。在具有噪声数据的监督微调中，BFT在两个领域提高了鲁棒性：噪声监督（问答中的token-标签腐败）和噪声上下文（具有真实RAG干扰项的检索增强问答）。单个原则性修改——恢复精度——在经典序列建模和现代LLM领域中释放了大量空间。

英文摘要

The Transformer is the foundational building block of modern AI, yet offers no principled handling of \emph{uncertainty}, which is prevalent in real applications: cold-start tokens with sparse histories in sequential recommendation, heterogeneous signal quality in language models, and attention sinks induced by unconstrained softmax. Every token is treated with uniform confidence. We show this uniformity is a degenerate case of our \emph{Bayesian Filtering Transformer} (BFT): attention becomes precision-weighted kriging, the residual connection becomes a Kalman update with adaptive gain, and the FFN becomes a dynamics model propagating precision via a Jacobian--plus--process-noise rule. Observation precision comes from a parameter-free Restricted Maximum Likelihood (REML) estimator with a conjugate Bayesian prior. BFT replaces any Transformer layer with negligible overhead. On sequential recommendation, BFT applied to three major architectures yields significant gains on six benchmarks, with the largest improvements on cold-start users and rare items where uncertainty is highest. On supervised fine-tuning of large language models with noisy data, BFT improves robustness in two regimes: noisy supervision (token-label corruption in question answering) and noisy context (retrieval-augmented QA with real RAG distractors). A single principled modification -- restoring precision -- unlocks substantial headroom across both classical sequence-modeling and modern LLM regimes.

URL PDF HTML ☆

赞 0 踩 0

2605.18826 2026-05-20 cs.LG cs.AI 版本更新

The Routing and Filtering Structure of Attention

注意力的路由和过滤结构

Shafayeth Jamil, Rehan Kapadia

发表机构 * University of Southern California（南加州大学）

AI总结本文研究了注意力机制中的路由和过滤结构，通过分解1776个预训练Transformer的头部，发现路由在低秩状态下运行，并引入S-D注意力作为诊断参数化方法，分离路由和过滤，实现稳定训练和有效降维。

Comments 13 pages, 7 figures

详情

AI中文摘要

注意力交互矩阵$QK^{ op}$包含两个交织的计算：一个斜对称成分用于在位置间重新分配信息（路由），一个对称成分用于缩放相互相关性（过滤）。我们分解了五个预训练Transformer中的1776个头部，发现路由在低秩状态下运行，远低于权重核分配的路由能力。我们引入了S-D注意力作为诊断参数化方法，通过构造分离路由和过滤，保证稳定性（$\mathrm{Re}(λ) \le 0$）并稳定训练而无需层归一化。当分离和未归一化时，路由自组织成一个谱级联，第一层的有效秩为2，随着深度扩展到六个尺度，从7M到355M参数。级联预测了注意力可以简化的位置：线性化125M S-D注意力的前七层成本低于5%的困惑度，而标准注意力在相同干预下崩溃。可线性化的区域随着深度扩大。用ELU+1线性注意力替换前四层，可在完整头部维度内达到基线的1.4%以内。级联分配的架构用注意力参数换取困惑度（47%-65%更少的注意力参数，+3.9%到+8.4% PPL）。路由-过滤分解使谱预算变得清晰；级联使其具有可操作性。

英文摘要

The attention interaction matrix $QK^{\top}$ contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce $S$-$D$ attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability ($\mathrm{Re}(λ) \le 0$) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank $2$ at the first layer, expanding with depth across six scales from 7M to 355M parameters. The cascade predicts where attention can be simplified: linearizing the first seven layers of 125M $S$-$D$ attention costs ${<}5\%$ perplexity, whereas standard attention collapses under the same intervention. The linearizable region widens with depth. Replacing the first four layers with ELU+1 linear attention reaches within $1.4\%$ of baseline at full head dimension. Cascade-allocated architectures trade attention parameters for perplexity ($47\%-65\%$ fewer attention parameters at $+3.9\%$ to $+8.4\%$ PPL). The routing-filtering decomposition makes the spectral budget legible; the cascade makes it actionable.

URL PDF HTML ☆

赞 0 踩 0

2605.18824 2026-05-20 cs.LG cs.AI cs.CL 版本更新

Fine-Grained Benchmark Generation for Comprehensive Evaluation of Foundation Models

细粒度基准生成用于基础模型的全面评估

Mohammed Saidul Islam, Negin Baghbanzadeh, Farnaz Kohankhaki, Afshin Cheraghi, Ali Kore, Shayaan Mehdi, Elham Dolatabadi, Arash Afkanpour

发表机构 * Vector Institute（Vector研究院）； York University（约克大学）

AI总结本文提出了一种自动化基准生成框架，用于生成覆盖广泛、元数据丰富且抗污染的评估问题，从而提升基础模型的全面评估能力。

详情

AI中文摘要

基础模型的评估通常依赖于缺乏全面覆盖和细粒度评估元数据的基准汇总分数。我们引入了一个自动化基准生成框架。该框架生成基于参考材料（如教科书）的评估问题，生成具有广泛覆盖、丰富元数据和抗污染性的基准。该流程采用多代理架构进行问题生成，并采用以解决方案图驱动的策略，显著提高了地面真实解决方案的可靠性。使用该框架，我们生成了三个基准：机器学习、公司金融和个人金融。专家审查发现，其地面真实错误率显著低于之前的基准，如MMLU和GSM8K。对12个商业和开源模型的评估显示，我们的基准实现了接近均匀的竞争力覆盖，并揭示了现有基准未能捕捉到的模型间性能差异。我们即将开源该框架和我们精心挑选的基准。

英文摘要

Evaluation of foundation models often rely on aggregate scores from benchmarks that lack comprehensive coverage and metadata for a fine-grained evaluation. We introduce a framework for automated benchmark generation. Our framework generates evaluation problems grounded in reference material, such as textbooks, producing benchmarks with broad coverage, rich metadata, and robustness to contamination. The pipeline employs a multi-agent architecture for problem generation and a solution-graph-driven strategy that significantly improves the reliability of ground truth solutions. Using the framework, we generate three benchmarks in Machine Learning, Corporate Finance, and Personal Finance. Expert review finds a significantly lower ground-truth error rate than previous benchmarks such as MMLU and GSM8K. Evaluation of 12 commercial and open-source models shows that our benchmarks achieve near-uniform competency coverage and surface performance differences across models that existing benchmarks fail to capture. We will open-source the framework and our curated benchmarks soon.

URL PDF HTML ☆

赞 0 踩 0

2605.18822 2026-05-20 cs.LG cs.AI 版本更新

Hybrid-LoRA: Bridging Full Fine-Tuning and Low-Rank Adaptation for Post-Training

Hybrid-LoRA: 桥接全微调与低秩适应以实现训练后优化

Chengqian Zhang, Wei Zhu, Kyumin Lee

发表机构 * Worcester Polytechnic Institute（沃斯特理工学院）； University of Hong Kong（香港大学）

AI总结本文提出Hybrid-LoRA框架，通过选择性地对部分模块进行全微调，其余模块使用LoRA进行适应，从而在训练后优化中实现高效性能。

详情

AI中文摘要

训练后已成为适应大型语言模型（LLMs）以实现复杂下游行为（如指令遵循、偏好对齐和多步推理）的关键方法。最近，基于可验证奖励的强化学习（RLVR）作为一种特别有效的训练后范式，通过如GRPO和GSPO等无批评算法实现了可扩展的优化。然而，使用全微调（FFT）的RLVR训练后方法需要大量GPU内存并导致高训练成本。尽管参数高效微调（PEFT）方法如低秩适应（LoRA）能有效降低计算成本，但它们在复杂推理任务的训练后性能上往往存在显著差距。在本文中，我们提出了Hybrid-LoRA，一种高效的训练后框架，该框架选择性地对一小部分不太适合低秩适应的模块进行全微调，而对其余模块使用LoRA进行适应。我们引入了一个新的Hybrid-LoRA Score，用于在固定参数预算下对候选模块按其对低秩适应的敏感性进行排序。实验表明，在10%的全微调模块预算下，Hybrid-LoRA能够接近全微调性能，其余候选模块通过LoRA进行适应， consistently outperforming four state-of-the-art PEFT post-training baselines，实现了高达5.65%和平均4.36%的改进。

英文摘要

Post-training has become essential for adapting large language models (LLMs) to complex downstream behaviors, including instruction following, preference alignment, and multi-step reasoning. Reinforcement learning with verifiable rewards (RLVR) has recently emerged as a particularly effective post-training paradigm for improving reasoning capabilities, with critic-free algorithms such as GRPO and GSPO enabling scalable optimization. However, RLVR post-training with full fine-tuning (FFT) requires substantial GPU memory and incurs high training costs. Although parameter-efficient fine-tuning (PEFT) methods, such as Low-Rank Adaptation (LoRA), effectively reduce computational costs, they often suffer from a noticeable performance gap compared to full fine-tuning in post-training for complex reasoning tasks. In this paper, we propose Hybrid-LoRA, an efficient hybrid post-training framework that selectively applies full fine-tuning to a small subset of modules less suited to low-rank adaptation, while adapting the remaining components with LoRA. We introduce a novel Hybrid-LoRA Score to rank candidate modules according to their sensitivity to low-rank adaptation under a fixed parameter budget. Experiments show that Hybrid-LoRA closely matches full fine-tuning performance under a 10% full fine-tuning module budget, with the remaining candidate modules adapted by LoRA, consistently outperforming four state-of-the-art PEFT post-training baselines, achieving improvements of up to 5.65% and on average 4.36% over the best baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.18820 2026-05-20 cs.LG cs.AI 版本更新

Emergence of Frontier Superposition: Möbius attractor and Cascade Supervision

前沿叠加的涌现：莫比乌斯吸引子与级联监督

Hongyu Gu, Jingwen Fu

发表机构 * University of Science and Technology of China（中国科学技术大学）； Zhongguancun Academy（中关村学院）

AI总结本文研究了通过叠加实现深度推理的问题，提出莫比乌斯吸引子和级联监督方法，证明了在Erdős-Rényi图上，叠加推理的涌现是通过建筑和监督的贡献实现的。

Comments 40 pages, 3 figures

详情

AI中文摘要

叠加允许Transformer在深度推理中并行处理整个推理前沿，通过有限深度的前向传递而不是展开串行的思维链token。虽然Zhu等人(2025)在单一残差流中手工构建了一个等权重的广度优先前沿用于图可达性，但仍未确定梯度下降能否在排列对称的鞍点中找到这个目标。我们通过隔离建筑和监督的贡献，填补了在Erdős-Rényi图上通过叠加实现可达性的问题。在建筑方面，我们识别出一个莫比乌斯吸引子：在树的 regime 中，层间动态减少到一个1D莫比乌斯映射，其零集是一个共维数为一的全局最优解 manifold，包含等权重叠加状态。在监督方面，我们识别出级联监督：一个损失类别，其反向传播同时提供(A)选择性 bootstrap，(B)梯度在深度的持续性，以及(C)每一步的区分（例如L_sup和L_node）。端到端监督失败于条件(B)，并被证明是不足的：在图的扇出和停滞前到达 manifold 之前，层c的内部梯度衰减为(np)^{-(D-c-2)/2}。我们的论点：莫比乌斯吸引子 + 级联监督 = 叠加推理的涌现。参数无关的衰减定律预测在深度D=3时，最终步骤余弦为0.35 vs. 0.71（端到端 vs. 级联）；实验证实0.37 vs. 0.69，每一步的匹配误差在0.02以内。

英文摘要

Superposition allows Transformers to reason in depth, carrying an entire reasoning frontier in parallel through a bounded-depth forward pass instead of unrolling serial chain-of-thought tokens. While Zhu et al. (2025) hand-crafted an equal-weight breadth-first frontier in a single residual stream for graph reachability, it remained open whether gradient descent could ever find this target amidst permutation-symmetric saddles. We close this gap on Reachability-by-Superposition over Erdős-Rényi graphs by isolating architectural and supervisional contributions. Architecturally, we identify a Möbius attractor: under $S_n$-symmetry in the tree regime, layerwise dynamics reduce to a 1D Möbius map whose zero set is a codimension-one manifold of global optima containing the equal-weight superposition state. On the supervision side, we identify Cascade Supervision: a loss class whose backward pass simultaneously delivers (A) selectivity bootstrap, (B) gradient persistence across depth, and (C) per-step discrimination (e.g., \mathcal{L}_{sup} and \mathcal{L}_{node}). End-to-end supervision fails condition (B) and is provably insufficient: internal gradients at layer c decay as (np)^{-(D-c-2)/2} in the graph fan-out and stall before the manifold is reached. Our thesis: Möbius attractor + Cascade Supervision = emergence of superposition reasoning. The parameter-free decay law predicts a final-step cosine of 0.35 vs. 0.71 (end-to-end vs. cascade) at depth D=3; experiments confirm 0.37 vs. 0.69, matching within 0.02 at every step.

URL PDF HTML ☆

赞 0 踩 0

2605.18818 2026-05-20 cs.AI cs.LG cs.SE 版本更新

Operationalizing Document AI: A Microservice Architecture for OCR and LLM Pipelines in Production

将文档AI operationalize：一种用于OCR和LLM流水线的微服务架构

Yao Fehlis, Benjamin Bengfort, Zhangzhang Si, Vahid Eyorokon, Prema Roman, Patrick Deziel, Devon Slonaker, Steve Veldman, Ben Johnson, Joyce Rigelo, Michael Wharton, Steve Kramer

AI总结本文提出了一种微服务架构，用于在生产环境中实现文档理解，通过整合多个模型的流水线，包括分类、OCR和LLM结构字段提取，并展示了在每小时处理数千页文档的经验。

详情

AI中文摘要

学术研究往往集中在新的文档理解模型上，导致文献中模型定义与大规模生产模型之间存在较大差距。为了缩小这一差距，我们提出了一种微服务架构，该架构封装了多个模型的流水线，包括分类、光学字符识别（OCR）和大型语言模型结构字段提取，并展示了该流水线在每小时处理数千页文档的经验。我们描述了主要的设计决策，包括混合分类、将GPU绑定的推理与CPU绑定的编排分离、使用异步处理处理流水线中的许多I/O绑定操作，以及独立的水平扩展策略。通过批量分析，我们发现了两个令人惊讶的定性发现，这些发现影响了生产部署：OCR而不是语言模型解析主导了端到端延迟，并且系统饱和度由共享的GPU推理容量而不是工作程序数量决定。我们的目标是为从业者提供具体的架构模式，以构建在基准之外有效工作的文档理解系统；有效地将模型 operationalize 在生产环境中。

英文摘要

Academic research tends to focus on new models for document understanding creating a wide gap in the literature between model definition and running models at production scale. To close that gap, we present a microservice architecture that encapsulates pipelines of multiple models for classification, optical character recognition (OCR), and large language model structured field extraction as well as our experience running this pipeline on thousands of multi-page documents per hour. We describe our primary design decisions, including a hybrid classification, separation of GPU-bound inference from CPU-bound orchestration, use of asynchronous processing for the many IO-bound operations in the pipeline, and an independent, horizontal scaling strategy. Using batch profiling, we identified two surprising qualitative findings that shape production deployments: OCR, not language-model parsing, dominates end-to-end latency, and the system saturates at a concurrency determined by shared GPU-inference capacity rather than worker count. Our goal is to provide practitioners with concrete architectural patterns for building document understanding systems that work beyond the benchmark; effectively operationalizing models in production.

URL PDF HTML ☆

赞 0 踩 0

2605.18816 2026-05-20 cs.LG cs.AI 版本更新

Symmetry in the Wild: The Role of Equivariance in Neural Fluid Surrogates

野生中的对称性：等变性在神经流体代理中的作用

Patryk Rygiel, Julian Suk, Kak Khee Yeung, Christoph Brune, Jelmer M. Wolterink

发表机构 * Department of Applied Mathematics（应用数学系）； Technical Medical Centre（技术医学中心）； Cardiovascular Health Technology Centre（心血管健康技术中心）； University of Twente（特文特大学）； Department of Computer Science（计算机科学系）； Munich Center for Machine Learning（慕尼黑机器学习中心）； Technical University of Munich（慕尼黑技术大学）； Department of Surgery（外科系）； Amsterdam UMC, Location（阿姆斯特丹大学医学中心，地点）； University of Amsterdam（阿姆斯特丹大学）； Amsterdam Cardiovascular Sciences（阿姆斯特丹心血管科学）； Digital Society Institute（数字社会研究所）

AI总结本文研究了等变性在神经流体代理中的作用，探讨了在不同分布对齐和真实度的任务中，等变性如何提高泛化能力，并介绍了AB-GATr模型在处理耦合表面和体积量时的效率。

详情

AI中文摘要

神经代理能够将计算流体动力学（CFD）模拟的计算速度提升几个数量级，有望改变工程和医疗流程。在现实应用中使用神经代理需要解决可扩展性问题，包括大规模、高分辨率表面和体积网格以及定制架构，并通过归纳偏置来应对有限的训练数据。群等变架构是引入此类偏置的一种系统方法，但当学习问题本身破坏对称性时，例如由于数据集中的强分布对齐，可能会产生不利影响。在本工作中，我们探讨了在具有不同分布对齐和真实度的任务中，等变性如何提高神经CFD代理的泛化能力，涵盖汽车空气动力学和血流（血动力学）。为了系统评估等变性在问题可扩展性极限处的附加价值，我们引入了Anchored-Branched Geometric Algebra Transformer（AB-GATr），一种整合了可扩展性和对称性保持的神经代理，能够以E(3)等变的方式高效建模耦合的表面和体积量。我们发现，在强对齐的空气动力学数据集上，即那些破坏对称性的数据集，强制等变性会降低分布内性能。相反，在具有不同几何形状和变化对齐的血动力学基准测试中，等变性始终有益。此外，在所有基准测试中，AB-GATr的显式等变性通过数据增强始终优于隐式对称学习。我们的发现表明，等变性并非在所有领域都有益，但在缺乏强数据规律的问题中带来了实质性的优势。

英文摘要

Neural surrogates enable orders-of-magnitude acceleration of computational fluid dynamics (CFD) simulations, with the potential to transform engineering and healthcare workflows. Neural surrogate use in real-world applications requires addressing scalability to large, high-resolution surface and volume meshes, as well as to bespoke architectures, and accounting for limited training data through the use of inductive biases. Group-equivariant architectures are a principled way to introduce such bias, yet they can be detrimental when the learning problem itself breaks symmetry, for example, due to strong distributional alignment in the dataset. In this work, we investigate under which conditions equivariance improves generalization in neural CFD surrogates across tasks with increasing levels of distributional alignment and realism, covering automotive aerodynamics and blood flow (hemodynamics). To systematically assess the added value of equivariance at the limit of problem scaling, we introduce the Anchored-Branched Geometric Algebra Transformer (AB-GATr), a neural surrogate that integrates scalability and symmetry preservation to efficiently model coupled surface and volume quantities in an $E(3)$-equivariant manner. We find that on strongly aligned aerodynamics datasets, i.e., those that break symmetry, enforcing equivariance can degrade in-distribution performance. In contrast, across hemodynamic benchmarks with diverse geometries and varying alignment, equivariance is consistently beneficial. Moreover, across all benchmarks, the explicit equivariance of AB-GATr reliably outperforms implicit symmetry learning through data augmentation. Our findings showcase that equivariance is not universally beneficial across domains, yet it brings tangible advantages in problems lacking strong data regularities.

URL PDF HTML ☆

赞 0 踩 0

2605.18813 2026-05-20 cs.LG cs.AI 版本更新

Composition of Memory Experts for Diffusion World Models

记忆专家的组合用于扩散世界模型

Sebastian Stapf, Pablo Acuaviva Huertos, Aram Davtyan, Paolo Favaro

发表机构 * Computer Vision Group（计算机视觉组）； Department of Computer Science（计算机科学系）； University of Bern（伯恩大学）

AI总结本文提出了一种基于扩散的世界模型框架，通过组合专门化的记忆专家来解决记忆与效率之间的权衡问题，提升了时间一致性、过去观察的回忆和导航性能。

Journal ref Proceedings of the Fourteenth International Conference on Learning Representations (ICLR), 2026

详情

AI中文摘要

世界模型旨在预测与过去观察一致的合理未来，这是强化学习中规划和决策的关键能力。然而，现有架构面临根本性的记忆权衡：转换器保留局部细节但受二次注意限制，而递归和状态空间模型更高效但以牺牲保真度为代价。为克服这一权衡，我们建议将未来-过去一致性与任何单一架构解耦，并利用一组专门的专家。我们引入了一种基于扩散的框架，通过对比产品-专家公式整合异构记忆模型。我们的方法实现了三个互补的角色：短期记忆专家捕捉精细的局部动态，长期记忆专家通过轻量级测试时微调在外部扩散权重中存储事件历史，以及空间长期记忆专家强制几何和空间一致性。这种组合设计避免了模式崩溃，并在不产生二次成本的情况下扩展到长上下文。在模拟和现实世界基准测试中，我们的方法提高了时间一致性、过去观察的回忆和导航性能，建立了一种新的构建和操作记忆增强扩散世界模型的范式。

英文摘要

World models aim to predict plausible futures consistent with past observations, a capability central to planning and decision-making in reinforcement learning. Yet, existing architectures face a fundamental memory trade-off: transformers preserve local detail but are bottlenecked by quadratic attention, while recurrent and state-space models scale more efficiently but compress history at the cost of fidelity. To overcome this trade-off, we suggest decoupling future-past consistency from any single architecture and instead leveraging a set of specialized experts. We introduce a diffusion-based framework that integrates heterogeneous memory models through a contrastive product-of-experts formulation. Our approach instantiates three complementary roles: a short-term memory expert that captures fine local dynamics, a long-term memory expert that stores episodic history in external diffusion weights via lightweight test-time finetuning, and a spatial long-term memory expert that enforces geometric and spatial coherence. This compositional design avoids mode collapse and scales to long contexts without incurring a quadratic cost. Across simulated and real-world benchmarks, our method improves temporal consistency, recall of past observations, and navigation performance, establishing a novel paradigm for building and operating memory-augmented diffusion world models.

URL PDF HTML ☆

赞 0 踩 0

2605.18810 2026-05-20 cs.LG cs.AI 版本更新

D-PACE: Dynamic Position-Aware Cross-Entropy for Parallel Speculative Drafting

D-PACE：动态位置感知交叉熵用于并行推测草案

Tianyu Wu, Yu Yao, Zhenting Qi, Han Zheng, Zhuohan Wang, Haoran Ma, Lawrence Liao, Himabindu Lakkaraju, Ju Li, Yilun Du

发表机构 * Harvard（哈佛大学）； MIT（麻省理工学院）

AI总结本文提出D-PACE，一种动态位置感知交叉熵，用于改进并行推测草案的训练，通过动态调整位置权重以提高生成速度和输出长度。

详情

AI中文摘要

推测解码通过让小型草案生成器并行生成token，由更大目标模型验证，从而加速LLM推理。最近的扩散式并行草案生成器如DFlash在一次前向传递中预测完整的B-token块，使深度草案生成器和更长的接受块成为可能。然而，现有多token草案生成器目标通常使用固定的位置依赖加权计划，如头部依赖权重或块位置衰减，这在训练过程中无法适应限制接受的位置变化。为此，我们从可微的替代品中推导出每位置的训练权重，使每个位置的权重与其log概率梯度贡献相匹配。所得到的损失，D-PACE（动态位置感知交叉熵），将训练信号转向当前限制接受的位置，随着草案生成器的改进。在六个基准、两个Qwen3-4B草案深度、两个解码温度和两个额外的目标模型上，D-PACE一致地提高了墙钟加速速度和平均生成长度，测量训练时间开销为2.3%，且不改变草案生成器的架构或推理过程。

英文摘要

Speculative decoding accelerates LLM inference by having a small drafter propose tokens that a larger target model verifies in parallel. Recent diffusion-based parallel drafters such as DFlash predict the full B-token block in one forward pass, enabling deeper drafters and longer accepted blocks. However, existing multi-token drafter objectives often use fixed position-dependent weighting schedules, such as head-dependent weights or block-position decays, which do not adapt as the positions limiting acceptance change during training. To address this, we derive per-position training weights from a differentiable surrogate of expected accepted draft length, matching the weight of each position to its log-probability gradient contribution. The resulting loss, D-PACE (Dynamic Position-Aware Cross-Entropy), shifts training signal toward positions that currently limit acceptance as the drafter improves. Across six benchmarks, two Qwen3-4B draft depths, two decoding temperatures, and two additional target models, D-PACE consistently improves both wall-clock speedup and average emitted length, with 2.3\% measured training-time overhead and no changes to the drafter architecture or inference procedure.

URL PDF HTML ☆

赞 0 踩 0

2605.18809 2026-05-20 cs.LG cs.AI 版本更新

一种用于可穿戴PPG心血管稳定性的非线性复杂性指数：多尺度验证、系统性评估修正与贝叶斯参数优化

Timothy Oladunni, Farouk Ganiyu Adewumi

发表机构 * Department of Computer Science, Morgan State University（莫根州立大学计算机科学系）

AI总结本文提出了一种基于心脏稳定性理论的非线性复杂性指数（SCSI），通过多尺度验证和系统性评估修正，结合贝叶斯参数优化，提高了可穿戴PPG心血管稳定性估计的准确性与可靠性。

详情

AI中文摘要

从可穿戴光体积脉动图（PPG）估计心血管稳定性需要一个原理性的非线性框架，但目前在启发式参数选择和评估协议方面仍存在重大差距，这些协议会夸大报告性能。我们引入了基于心脏稳定性理论的稳定性受限心血管稳定性指数（SCSI），并验证了来自四个异质PPG数据集的176,742个片段，在三个时间尺度上。跨数据集分析显示了显著的Kruskal-Wallis效应量（eta2 = 0.351，p < 0.001），强跨尺度一致性（kappa > 0.97）以及在53个ICU记录中与呼吸频率的显著相关性（Spearman r = 0.346，p = 0.011）。我们识别出三个评估伪影，这些伪影会夸大启发式AUC从真实的基线0.573到0.752：片段级交叉验证泄漏、测试集归一化泄漏以及池化AUC过重加权，这些伪影隐藏了每名患者的失败。纠正这些伪影并应用贝叶斯优化在15个联合参数上，得到SCSI在交叉验证AUC为0.720。在18个保留记录上，SCSI达到池化AUC为0.757（95%置信区间：0.686-0.828）和负预测值为0.966用于心动过速筛查，同时每记录AUC为0.497 ± 0.207被披露以提高透明度。外部验证在42个择期手术记录上得到AUC为0.621，证实了跨人群泛化。消融分析识别出非线性复杂度模块是主导组件。提出了一种稀疏三组件架构作为最小可部署配置。经过修正的协议提供了一个可重复的基准，用于未来可穿戴心血管稳定性指数。

英文摘要

Cardiovascular stability estimation from wearable photoplethysmography (PPG) requires a principled nonlinear framework, yet major gaps persist in heuristic parameter selection and evaluation protocols that inflate reported performance. We introduce a Stability-Constrained Cardiovascular Stability Index (SCSI) grounded in Cardiac Stability Theory and validate it across 176,742 segments from four heterogeneous PPG datasets at three temporal scales. Cross-dataset analysis demonstrates a large Kruskal-Wallis effect size (eta2 = 0.351, p < 0.001), strong cross-scale consistency (kappa > 0.97), and significant correlation with respiratory rate across 53 ICU records (Spearman r = 0.346, p = 0.011). We identify three evaluation artifacts that inflate heuristic AUC from a true baseline of 0.573 to 0.752: segment-level cross-validation leakage, test-set normalization leakage, and pooled-AUC overweighting that conceals per-patient failure. Correcting these artifacts and applying Bayesian optimization over 15 joint parameters yields SCSI with cross-validation AUC of 0.720. On 18 held-out records, SCSI achieves pooled AUC of 0.757 (95% CI: 0.686-0.828) and negative predictive value of 0.966 for tachypnea screening, while per-record AUC of 0.497 +/- 0.207 is disclosed for transparency. External validation on 42 elective-surgery records yields AUC of 0.621, confirming cross-population generalization. Ablation analysis identifies the nonlinear complexity module as the dominant component. A sparse three-component architecture is proposed as the minimal deployable configuration. The corrected protocol provides a reproducible benchmark for future wearable cardiovascular stability indices.

URL PDF HTML ☆

赞 0 踩 0

2605.18801 2026-05-20 cs.AI cs.IR cs.LG 版本更新

Position: Let's Develop Data Probes to Fundamentally Understand How Data Affects LLM Performance

位置：让我们开发数据探针，以根本理解数据如何影响大语言模型性能

Shiqiang Wang, Herbert Woisetschläger, Hans Arno Jacobsen, Mingyue Ji

发表机构 * Department of Computer Science, University of Exeter, UK（埃克塞特大学计算机科学系）； Technical University of Munich, Germany（慕尼黑技术大学）； Department of Electrical and Computer Engineering, University of Toronto, Canada（多伦多大学电气与计算机工程系）； Department of Electrical and Computer Engineering, University of Florida, FL, USA（佛罗里达大学电气与计算机工程系）

AI总结本文提出通过开发数据探针系统方法生成合成序列，以揭示数据特性对大语言模型性能、泛化能力和鲁棒性的影响，从而超越经验启发式方法。

Comments Accepted to ICML 2026 Position Paper Track

Journal ref Link to ICML record: https://icml.cc/virtual/2026/poster/67154

详情

AI中文摘要

数据对于大语言模型（LLMs）至关重要。然而，了解哪些数据对LLM工作流程的不同阶段（包括训练、微调、对齐、上下文学习等）有用，以及为什么有用，仍然是一个开放性问题。当前的方法依赖于对大型公共数据集进行大量实验来获得数据过滤和数据集构建的经验启发式方法。这些方法计算成本高，并且缺乏一种系统的方法来理解特定数据特性如何驱动LLM行为的本质。在本文的位置论文中，我们倡导开发系统方法来生成合成序列，这些序列由适当定义的随机过程生成，目的是当它们用于LLM工作流程的一个或多个阶段时，能够揭示有用的特点。我们将这些序列称为数据探针。通过观察LLM在数据探针上的行为，研究人员可以系统地研究数据特性如何影响模型性能、泛化能力和鲁棒性。探测序列表现出的统计特性可以通过理论概念（如典型集）来观察，这些概念被推广以描述LLM的行为。这种数据探针方法为揭示数据在LLM训练和推理中的基础作用提供了途径，超越了经验启发式方法。

英文摘要

Data is fundamental to large language models (LLMs). However, understanding of what makes certain data useful for different stages of an LLM workflow, including training, tuning, alignment, in-context learning, etc., and why, remains an open question. Current approaches rely heavily on extensive experimentation with large public datasets to obtain empirical heuristics for data filtering and dataset construction. These approaches are compute intensive and lack a principled way of understanding the essence of how specific data characteristics drive LLM behavior. In this position paper, we advocate for the need of developing systematic methodologies for generating synthetic sequences from appropriately defined random processes, with the goal that these sequences can reveal useful characteristics when they are used in one or multiple stages of the LLM workflow. We refer to such sequences as data probes. By observing LLM behavior on data probes, researchers can systematically conduct studies on how data characteristics influence model performance, generalization, and robustness. The probing sequences exhibit statistical properties that can be viewed using theoretical concepts, such as typical sets, which are generalized to describe the behaviors of LLMs. This data-probe approach provides a pathway for uncovering foundational insights into the role of data in LLM training and inference, beyond empirical heuristics.

URL PDF HTML ☆

赞 0 踩 0

2605.18800 2026-05-20 cs.LG cs.AI 版本更新

Theory-optimal Quantization Based on Flatness

基于平坦度的理论最优量化

Xiusheng Huang, Zhe Li, Xuanwu Yin, Lu Wang, Yequan Wang, Dong Li, Emad Barsoum, Kang Liu

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences（认知与决策智能复杂系统重点实验室，自动化研究所，中国科学院）； School of Artificial Intelligence, University of Chinese Academy of Sciences（中国科学院大学人工智能学院）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； AMD ； Ritzz-AI

AI总结本文提出了一种基于平坦度的理论最优量化方法，通过分析量化误差与异常值之间的数学关系，引入了平坦度指标来量化异常值分布，并提出了双向对角量化框架BDQ，有效分散异常值模式，提升了大语言模型在低比特精度下的性能。

Comments 16 pages, 2 figures

详情

AI中文摘要

后训练量化已成为压缩和加速大型语言模型（LLMs）推理的广泛采用技术。LLMs量化的首要挑战源于激活异常值，这些异常值在低比特精度下显著降低模型性能。尽管近期方法试图通过跨特征维度的线性变换来缓解异常值，我们的分析表明，变换后的权重和激活仍然表现出持续的异常值模式，具有集中化的幅度分布。在本文中，我们首先建模量化误差与异常值之间的数学关系，然后引入一个新的指标平坦度来量化异常值的分布。基于此，我们推导出与平坦度相关的理论最优解。基于这些见解，我们提出了双向对角量化（BDQ），一种新的后训练量化框架，通过优化的矩阵变换有效分散异常值模式。BDQ通过学习的对角操作策略性地将异常值幅度分布到矩阵维度中。广泛的实验表明，BDQ建立了新的量化基准。在LLaMA-3-8B模型上，BDQ在W4A4量化中实现了小于1%的精度下降。在更具挑战性的W2A4KV16实验中，与最先进的方法相比，BDQ在DeepSeek-R1-Distill-LLaMA-70B模型上将性能差距减少了39.1%。

英文摘要

Post-training quantization has emerged as a widely adopted technique for compressing and accelerating the inference of Large Language Models (LLMs). The primary challenges in LLMs quantization stem from activation outliers, which significantly degrade model performance especially at lower bit precision. While recent approaches attempt to mitigate outliers through linear transformations across feature dimensions, our analysis reveals that the transformed weights and activations still exhibit persistent outlier patterns with concentrated magnitude distributions. In this paper, we first model the mathematical relationship between quantization error and outliers, and then introduce a new metric Flatness to quantify the distribution of outliers. Based on this, we derive the theoretical optimal solution with respect to Flatness. Building on these insights, we propose Bidirectional Diagonal Quantization (BDQ), a novel post-training quantization framework that effectively disperses outlier patterns through optimized matrix transformations. BDQ strategically distributes outlier magnitudes across matrix dimensions via learned diagonal operations. Extensive experiments demonstrate that BDQ establishes a new quantization benchmark. It achieves less than 1\% accuracy drop in W4A4 quantization on the LLaMA-3-8B model. In the more challenging W2A4KV16 experiment, compared to state-of-the-art approaches, BDQ reduces the performance gap by 39.1\% on the DeepSeek-R1-Distill-LLaMA-70B model.

URL PDF HTML ☆

赞 0 踩 0

2605.18799 2026-05-20 cs.LG cs.AI cs.CL 版本更新

ReCrit: Transition-Aware Reinforcement Learning for Scientific Critic Reasoning

ReCrit: 基于过渡意识的强化学习用于科学批评推理

Wanghan Xu, Yuhao Zhou, Hengyuan Zhao, Shuo Li, Dianzhi Yu, Zhenfei Yin, Yaowen Hu, Fengli Xu, Wanli Ouyang, Wenlong Zhang, Lei Bai

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； National University of Singapore（新加坡国立大学）； Chinese University of Hong Kong（香港中文大学）； University of Oxford（牛津大学）； Tsinghua University（清华大学）

AI总结该研究提出ReCrit框架，通过强化学习解决科学批评推理中的过渡意识问题，改进了批评准确性。

详情

AI中文摘要

大型语言模型在批评交互中不仅可能因回答错误而失败，还可能在用户批评后放弃最初正确的科学解答。在科学推理中，这种风险尤为突出，因为用户的批评可能将正确答案变为错误答案。我们将批评交互视为跨回合正确性过渡问题，而非最终答案准确性问题，并识别出三个挑战：过渡意识、解耦有用的修正与有害的阿谀奉承，以及可扩展的回放。我们提出了ReCrit，一个基于过渡意识的强化学习框架，将初始到批评行为分解为四个象限：修正、阿谀奉承、鲁棒性和边界。ReCrit奖励修正和鲁棒性，惩罚阿谀奉承，并将持续错误视为弱边界信号。为了使交互训练实用，ReCrit进一步使用动态异步回放与尾部自适应完成以减少回放等待。在三个科学推理基准测试（ChemBench、TRQA和EarthSE）上，ReCrit在Qwen3.5-4B上将平均批评准确性从38.15提升到51.49，在Qwen3.5-9B上从45.40提升到55.59。消融实验显示，最终答案奖励提供很少的交互层面增益，而基于过渡意识的奖励和象限加权产生更可区分的训练信号和更大的净批评阶段改进。代码可在https://github.com/black-yt/ReCrit获取。

英文摘要

Large language models can fail in critic interaction not only by answering incorrectly, but also by abandoning an initially correct scientific solution after user criticism. This is especially risky in scientific reasoning, where user criticism can turn a valid answer into an incorrect one. We frame critic interaction as an inter-turn correctness-transition problem rather than a final-answer accuracy problem, and identify three challenges: transition awareness, decoupling useful correction from harmful sycophancy, and scalable rollout. We propose ReCrit, a transition-aware reinforcement learning framework that decomposes Initial-to-Critic behavior into four quadrants: Correction, Sycophancy, Robustness, and Boundary. ReCrit rewards correction and robustness, penalizes sycophancy, and treats persistent errors as weak boundary signals. To make interaction training practical, ReCrit further uses dynamic asynchronous rollout with tail-adaptive completion to reduce rollout waiting. On three scientific reasoning benchmarks, ChemBench, TRQA, and EarthSE, ReCrit improves average Critic accuracy from 38.15 to 51.49 on Qwen3.5-4B and from 45.40 to 55.59 on Qwen3.5-9B. Ablations show that final-answer rewards provide little interaction-level gain, while transition-aware rewards and quadrant weighting produce more distinguishable training signals and larger net Critic-stage improvement. The code is available at https://github.com/black-yt/ReCrit .

URL PDF HTML ☆

赞 0 踩 0

2605.18795 2026-05-20 cs.LG cs.AI 版本更新

特征有生命周期。我们应当关心

Philipp Stecher, Sandro Radovanović, Vlasta Sikimić, Reinhard Kahle

发表机构 * University of Tübingen（图宾根大学）； University of Belgrade（贝尔格莱德大学）； Eindhoven University of Technology（埃因霍温理工大学）

AI总结研究探讨了语言模型中特征的生命周期，发现了一种稳定的表征基础架构，并揭示了其在训练过程中的四个关键特性。

Comments 21 pages, 7 figures

详情

AI中文摘要

基于查询意识的流扩散图基RAG系统：具有检索保证

Zhuoping Zhou, Davoud Ataee Tarzanagh, Sima Didari, Wenjun Hu, Baruch Gutow, Oxana Verkholyak, Masoud Faraki, Heng Hao, Hankyu Moon, Seungjai Min

发表机构 * Samsung SDS Research America（三星SDS美国研究院）

AI总结本文提出了一种无需训练的Query-Aware Flow Diffusion RAG（QAFD-RAG）系统，通过动态适应查询语义来改进图基RAG系统的检索能力，提供了首次统计保证，证明在弱信号-噪声条件下，QAFD-RAG能够以高概率恢复相关子图。

Comments Published at the International Conference on Learning Representations (ICLR) 2026. 38 pages, 5 figures, 10 tables

详情

AI中文摘要

基于图的检索增强生成（RAG）系统利用相互关联的知识结构来捕捉扁平检索难以处理的复杂关系，从而实现多跳推理。然而，大多数现有图基方法存在（i）缺乏子图质量和相关性理论保证的启发式设计，或（ii）使用静态探索策略，忽视查询的整体意义，检索邻居或社区，而不管意图。我们提出了Query-Aware Flow Diffusion RAG（QAFD-RAG），一种无需训练的框架，能够动态适应每个查询的整体语义。核心创新是查询感知的遍历：在图探索过程中，边根据端点与查询嵌入的匹配程度动态加权，引导流沿着语义相关路径流动，同时避免结构连接但不相关的区域。这些查询特定的推理子图使QAFD-RAG成为首个具有统计保证的查询感知图检索方法，证明在弱信号-噪声条件下，QAFD-RAG以高概率恢复相关子图。该算法以指数速度收敛，复杂度随检索的子图大小而非完整图的规模增长。在问答和文本到SQL任务上的实验表明，QAFD-RAG在状态-of-the-art图基RAG方法上表现出一致的改进。

英文摘要

Graph-based Retrieval-Augmented Generation (RAG) systems leverage interconnected knowledge structures to capture complex relationships that flat retrieval struggles with, enabling multi-hop reasoning. Yet most existing graph-based methods suffer from (i) heuristic designs lacking theoretical guarantees for subgraph quality or relevance and/or (ii) the use of static exploration strategies that ignore the query's holistic meaning, retrieving neighborhoods or communities regardless of intent. We propose Query-Aware Flow Diffusion RAG (QAFD-RAG), a training-free framework that dynamically adapts graph traversal to each query's holistic semantics. The central innovation is query-aware traversal: during graph exploration, edges are dynamically weighted by how well their endpoints align with the query's embedding, guiding flow along semantically relevant paths while avoiding structurally connected but irrelevant regions. These query-specific reasoning subgraphs enable the first statistical guarantees for query-aware graph retrieval, showing that QAFD-RAG recovers relevant subgraphs with high probability under mild signal-to-noise conditions. The algorithm converges exponentially fast, with complexity scaling with the retrieved subgraph size rather than the full graph. Experiments on question answering and text-to-SQL tasks demonstrate consistent improvements over state-of-the-art graph-based RAG methods.

URL PDF HTML ☆

赞 0 踩 0

2605.18774 2026-05-20 cs.IR cs.AI 版本更新

M3DocDep: Multi-modal, Multi-page, Multi-document Dependency Chunking with Large Vision-Language Models

M3DocDep: 多模态、多页、多文档依赖分块方法基于大视觉-语言模型

Joongmin Shin, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

发表机构 * Human-inspired AI Research, Korea University（韩国大学人智AI研究所）； Computer Science and Engineering, Konkuk University（konkuk大学计算机科学与工程系）； Department of Computer Science and Engineering, Korea University（韩国大学计算机科学与工程系）

AI总结本文提出M3DocDep，一种基于大视觉-语言模型的多模态、多页、多文档依赖分块方法，通过恢复块级依赖并构建分块，提高了长多页多模态文档的检索和问答质量。

Comments Accepted to CVPR2026 Main

详情

AI中文摘要

在长篇、多页的工业文档中，检索增强生成（RAG）依赖于分块边界是否遵循文档的真实结构。现有的文本中心分块器和生成层次解析器常常遗漏跨页的父子关系、图表-标题绑定以及边界提示，导致碎片化或冗余的分块，并影响检索和答案质量。我们提出M3DocDep，一种基于大视觉-语言模型（LVLM）的流程，首先恢复块级依赖，然后沿着恢复的文档树构建分块。该流程使用SharedDet作为公共的DP+OCR预处理层，提取具有边界意识的SoftROI池化多模态块嵌入，用双亲-子边评分器评分候选父-子边，用MST约束解码全局有效的依赖树，并构建树引导的分块，这些分块带有章节路径和页面范围的注释。在共享块评估协议下，M3DocDep在DHP基准上比STEDS提高了+28.5至+39.6个百分点，在检索nDCG上提高了+1.1至+15.3个百分点，在语料库级RAG基准上的QA ANLS上提高了+4.5至+15.3个百分点。这些结果表明，在分块前恢复文档依赖可以为长篇、多页多模态文档提供更连贯的检索单元。

英文摘要

In long, multi-page industrial documents, retrieval-augmented generation (RAG) depends heavily on whether chunk boundaries follow the document's true structure. Existing text-centric chunkers and generative hierarchy parsers often miss cross-page parent-child relations, figure/table-caption bindings, and boundary cues, which leads to fragmented or redundant chunks and degrades both retrieval and answer quality. We propose M3DocDep, an LVLM-based pipeline that first recovers block-level dependencies and then constructs chunks along the recovered document tree. The pipeline uses SharedDet as a common DP+OCR preprocessing layer, extracts multimodal block embeddings with boundary-aware SoftROI pooling, scores candidate parent-child edges with a biaffine head, decodes a globally valid dependency tree with MST constraints, and builds tree-guided chunks annotated with section paths and page ranges. Under a shared-block evaluation protocol, M3DocDep improves STEDS by +28.5 to +39.6 percent on DHP benchmarks, retrieval nDCG by +1.1 to +15.3 percent, and QA ANLS by +4.5 to +15.3 percent on corpus-level RAG benchmarks. These results show that recovering document dependencies before chunking yields more coherent retrieval units for long, multi-page multimodal documents.

URL PDF HTML ☆

赞 0 踩 0

2605.18773 2026-05-20 cs.CR cs.AI cs.CY cs.LG 版本更新

Decentralized autonomous organization and blockchain-based incentivization framework for community-based facilities management

去中心化自治组织与基于区块链的激励框架用于社区设施管理

Reachsak Ly, Alireza Shojaei, Xinghua Gao, Philip Agee, Abiola Akanmu

发表机构 * School of Technology, Eastern Illinois University（东伊利诺伊大学技术学院）； Myers-Lawson School of Construction, Virginia Polytechnic Institute and State University（弗吉尼亚理工学院和州立大学梅斯-劳森建筑学院）

AI总结本文提出了一种基于区块链和去中心化自治组织（DAO）的新型框架，用于智能建筑中的社区设施管理，通过去中心化治理平台和维护管理平台的结合，提高设施维护的参与度和效率。

Comments 29 pages, 17 figures, 3 tables

详情

AI中文摘要

传统的设施管理通常依赖于集中决策结构，限制了利益相关者的参与，导致与租户需求不一致并降低了满意度。本文提出了一种新的基于区块链和去中心化自治组织（DAO）的框架，用于智能建筑中的社区设施管理。该框架包含两个关键组成部分：一个去中心化的治理平台，通过区块链投票促进透明的集体决策；以及一个维护管理平台，具有激励机制，鼓励建筑使用者通过代币奖励积极贡献于设施维护。系统评估包括成本分析、可扩展性、数据安全考虑、可用性测试以及与设施管理人员和研究人员进行的半结构化访谈，以评估平台的实用性、挑战和采用潜力。研究结果表明，该框架有潜力作为激励解决方案，用于促进利益相关者在集体维护和改善建筑基础设施方面的参与。

英文摘要

Traditional facility management often relies on centralized decision-making structures that limit stakeholder participation, leading to misalignment with occupant needs and reduced satisfaction. This paper proposes a novel blockchain- and Decentralized Autonomous Organization (DAO)-based framework for community-based facilities management in smart buildings. The framework comprises two key components: a decentralized governance platform that facilitates transparent collective decision-making through blockchain-based voting, and a maintenance management platform with an incentivization mechanism that encourages building occupants to actively contribute to facility upkeep through tokenized rewards. System evaluation includes cost analysis, scalability, data security considerations, usability testing, and semi-structured interviews with facility managers and researchers to assess the platform's usefulness, challenges, and adoption potential. The findings demonstrate the framework's potential as a viable incentivization solution for engaging stakeholders in the collective upkeep and improvement of building infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.18772 2026-05-20 cs.IR cs.AI cs.CL 版本更新

Improving Retrieval-Augmented Generation without Taxonomy-based Error Categorization

无需基于分类的错误归类即可改进检索增强生成

Gongbo Zhang, Yifan Peng, Chunhua Weng

发表机构 * Columbia University（哥伦比亚大学）； Weill Cornell Medicine（韦尔·科恩医学中心）

AI总结本文提出RePAIR方法，通过直接将错误的RAG输出映射到错误缓解行动计划，无需依赖细粒度错误分类和显式批评者监督，从而提升检索增强生成的性能。

详情

AI中文摘要

检索增强生成（RAG）通过将生成过程 grounding 在外部知识来提高大语言模型（LLM）输出的事实准确性。最近的代理 RAG 系统扩展了这一范式，通过关键代理评估模型响应并迭代优化输出。然而，大多数先前工作隐式假设可靠的批评反馈，专注于规划策略，而对误差纠正过程本身的鲁棒性关注有限，这可能受到对齐错误类别和无效或错误纠正的影响。本文假设 RAG 性能可以在没有显式错误分类的情况下得到改进。我们提出 RePAIR，一种响应-行动学习范式，通过直接将错误的 RAG 输出映射到错误缓解行动计划，而无需依赖细粒度错误分类和显式批评者监督。在多个基准测试中，RePAIR 一致地提高了代理 RAG 的性能。

英文摘要

Retrieval-Augmented Generation (RAG) improves the factual accuracy of large language model (LLM) outputs by grounding generation in external knowledge. Recent agentic RAG systems extend this paradigm with critical agents to evaluate model responses and iteratively refine outputs. However, most prior work implicitly assumes reliable critic feedback and focuses on planning strategies, while paying limited attention to the robustness of the error-correction process itself, which can be impacted by misaligned error categories and ineffective or incorrect corrections. Here, we hypothesize that RAG performance can be improved without explicit error categorization. We propose RePAIR, a response-action learning paradigm that directly maps flawed RAG outputs to error-mitigating action plans without relying on fine-grained error taxonomies and explicit critic supervision. Across multiple benchmarks, RePAIR consistently improves agentic RAG performance.

URL PDF HTML ☆

赞 0 踩 0

2605.18770 2026-05-20 cs.IR cs.AI 版本更新

Agentic GraphRAG: Navigating Unstructured Financial Data with Collaborative AI

代理图RAG：利用协作AI导航无结构财务数据

Arthur Capozzi, Dirk Helbing

发表机构 * Computational Social Science ETH Zürich（计算社会科学 ETH 瑞士 Zurich）； Computational Social Science, ETH Complexity Science Hub Zürich（计算社会科学 ETH 复杂性科学中心 Zurich）； ETH Complexity Science Hub Zürich, Switzerland（ETH 复杂性科学中心 Zurich 瑞士）； Vienna, Austria（维也纳奥地利）

AI总结本文提出了一种协作代理图RAG框架，用于专家分析商业登记数据。通过构建Neo4j知识图谱，结合确定性节点摄入、LLM提取弱节点以及确定性身份解析和去重，提升了多跳、时间相关和实体中心的查询能力。

详情

AI中文摘要

我们提出了一种协作代理图RAG框架，用于专家分析商业登记数据。公共登记处通常正式可访问，但在实践中由于结合了结构化记录和大量无结构法律文本，难以使用。这限制了传统关键词和向量检索，尤其是多跳、时间相关和实体中心的调查。我们的方法通过三阶段流程构建Neo4j知识图谱：（i）确定性摄入来自已验证结构字段的强节点，（ii）基于LLM提取无结构通知中的弱节点，（iii）确定性身份解析和去重。在图之上，我们引入了分析模块代理，整合了零样本意图路由、有界的反思循环、安全的工具中介图访问以及状态感知的响应合成。一个人工参与的仪表板暴露证据和执行轨迹以支持透明性和可审计性。我们在瑞士商业公报上评估了该框架，这是一个包含超过七百万篇出版物的多语言语料库，持续七年。我们进一步贡献了一个多级评估协议，涵盖实体解析精度、工具路由行为、回答质量以及多轮对话性能。在自动化、人工整理和对话基准测试中，所提出的代理图RAG系统在正确性、回答相关性、信息回忆、轮次成功率和上下文传递准确性方面均优于标准代理向量RAG基线。该架构是模块化、可重复和可转移至其他商业公报和公共部门登记系统。

英文摘要

We present a collaborative agentic GraphRAG framework for expert analysis of commercial registry data. Public registries are often formally accessible, yet difficult to use in practice because they combine structured records with large volumes of unstructured legal text. This limits conventional keyword and vector-only retrieval, especially for multi-hop, temporal, and entity-centric investigations. Our approach builds a Neo4j knowledge graph through a three-phase pipeline: (i) deterministic ingestion of strong nodes from verified structured fields, (ii) LLM-based extraction of weak nodes from unstructured notices, and (iii) deterministic identity resolution and deduplication. On top of this graph, we introduce an analytical modular agent that integrates zero-shot intent routing, a bounded reflection loop, secure tool-mediated graph access, and state-aware response synthesis. A human-in-the-loop dashboard exposes evidence and execution traces to support transparency and auditability. We evaluate the framework on the Swiss Official Gazette of Commerce, a multilingual corpus of more than seven million publications over seven years. We further contribute a multi-tier evaluation protocol covering entity-resolution precision, tool-routing behavior, answer quality, and multi-turn conversational performance. Across automated, human-curated, and conversational benchmarks, the proposed agentic GraphRAG system consistently outperforms a standard agentic vector-RAG baseline, with strong gains in correctness, answer relevance, information recall, turn success rate, and context carryover accuracy. The architecture is modular, reproducible, and transferable to other commercial gazettes and public-sector registry systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18769 2026-05-20 cs.IR cs.AI cs.CL 版本更新

ClusterRAG: Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation

ClusterRAG: 基于聚类的协同过滤用于个性化检索增强生成

Gibson Nkhata, Uttamasha Anjally Oyshi, Quan Mai, Susan Gauch

发表机构 * University of Arkansas（阿肯色大学）； Walmart Inc.（沃尔玛公司）

AI总结 ClusterRAG通过聚类用户轮廓文档，利用相似用户的协同信号提升当前用户的个性化生成性能，通过在集群和文档层面进行检索，实现了高效的个性化检索增强生成。

Comments 17 pages, 2 figures, to be published in the proceedings of ACL 2026

详情

AI中文摘要

个性化检索增强生成（RAG）依赖于准确选择与用户相关的文档。在实践中，现有RAG方法往往面临较高的检索成本，并忽略了来自相似用户的协同信号可以增强当前用户的个性化生成。我们提出了ClusterRAG，一种基于聚类的协同过滤用于个性化检索增强生成。ClusterRAG通过用户的轮廓文档来表示用户，利用基于密度的聚类将用户组织成语义连贯的集群，并通过集群级相似性和细粒度排序在集群和文档层面进行检索。在LaMP基准上的广泛实验表明，联合利用目标用户的轮廓和顶部相似用户的轮廓在各种任务中始终获得最佳性能。进一步分析显示，ClusterRAG能够无缝集成不同的密集检索器和排序器，并在与微调和零样本语言模型配对时仍保持有效。

英文摘要

Personalized Retrieval-Augmented Generation (RAG) relies on accurately selecting user-relevant documents. In practice, existing RAG approaches often suffer from high retrieval costs and overlook that collaborative signals from similar users can enhance personalized generation for the current user. We propose ClusterRAG, a Cluster-Based Collaborative Filtering for Personalized Retrieval-Augmented Generation. ClusterRAG represents users through their profile documents, organizes users into semantically coherent clusters using density-based clustering, and performs retrieval at both the cluster and document levels via cluster-level similarity and fine-grained ranking. Extensive experiments on the LaMP benchmark demonstrate that jointly leveraging the target user's profile and profiles from top similar users consistently yields the best performance across diverse tasks. Further analysis shows that ClusterRAG integrates seamlessly with different dense retrievers and rankers, and remains effective when paired with both fine-tuned and zero-shot language models.

URL PDF HTML ☆

赞 0 踩 0

2605.18767 2026-05-20 cs.IR cs.AI 版本更新

DualView: Adaptive Local-Global Fusion for Multi-Hop Document Reranking

DualView: 多跳文档重排序中的自适应局部-全局融合

Litong Zhang, Jiaxin Li, Kuo Zhao

发表机构 * Jinan University（暨南大学）

AI总结本文提出DualView框架，通过自适应门控融合局部和全局视图，实现高效的多跳文档重排序，显著优于现有模型并在低延迟下取得高召回率。

详情

AI中文摘要

多跳问答需要从多个文档中聚合信息，这是知识密集型应用的关键能力。一个根本性的挑战在于在保持高召回率的同时，高效地从检索的候选文档中确定最小的相关文档集。我们提出一个高效的双视图级联重排序框架用于多跳文档重排序。该框架在E5-base-v2候选文档上作为轻量级的后检索阶段运行，其架构包括：(1) 一个局部评分器，利用堆叠交叉注意力进行细粒度的查询-文档相关性评估；以及(2) 一个全局评分器，通过基于Transformer的上下文聚合建模文档间依赖关系。这些视图通过根据查询语义条件的自适应门进行动态融合。在固定候选集重排序设置下，使用离线缓存的嵌入，我们的模型取得了有竞争力的结果，特别是在MuSiQue上，Top-4召回率为99.4%，Full Hit准确率为97.8%，在4.0毫秒延迟（249 QPS）下。它在保持5到6倍更低延迟的同时，显著优于6亿参数的交叉编码器（BGE-Large: 92.0%召回率，Jina-v3: 90.1%召回率）。消融研究验证了局部和全局视图对多跳性能有显著贡献。

英文摘要

Multi-hop question answering requires aggregating information from multiple documents, a critical capability for knowledge-intensive applications. A fundamental challenge lies in efficiently identifying the minimal relevant document set from retrieved candidates while maintaining high recall. We present an efficient dual-view cascaded reranking framework for multi-hop document reranking. Operating as a lightweight post-retrieval stage over E5-base-v2 candidates, our architecture comprises: (1) a Local Scorer employing stacked cross-attention for fine-grained query-document relevance; and (2) a Global Scorer modeling inter-document dependencies via Transformer-based context aggregation. These views are dynamically fused through an Adaptive Gate conditioned on query semantics. Under the fixed candidate set reranking setting with offline cached embeddings, our model achieves competitive results, particularly outstanding on MuSiQue with 99.4% Top-4 Recall and 97.8% Full Hit accuracy at 4.0 ms latency (249 QPS). It substantially outperforms 600M-parameter cross-encoders (BGE-Large: 92.0% Recall, Jina-v3: 90.1% Recall) while maintaining 5 to 6 times lower latency. Ablation studies validate that both Local and Global views contribute substantially to multi-hop performance.

URL PDF HTML ☆

赞 0 踩 0

2605.18766 2026-05-20 cs.IR cs.AI cs.CL 版本更新

Retrieve Only Relevant Tables Whether Few or Many: Adaptive Table Retrieval Method

仅检索相关表格，无论多少：自适应表格检索方法

Taehee Kim, Seungbin Yang, Jihwan Kim, Jaegul Choo

发表机构 * KAIST AI（韩国科学技术院人工智能研究所）

AI总结本文提出了一种自适应表格检索方法，根据查询需求调整检索表格数量，通过自适应阈值机制和滑动窗口重排序算法，有效解决传统top-k检索策略的局限性，提升检索和下游任务性能。

Comments ACL 2026 Findings

详情

AI中文摘要

从大量数据库中检索与给定自然语言查询相关的表格对于准确回答文本到SQL等任务中的问题至关重要。现有的表格检索方法选择与查询最相似的预设数量k的表格。然而，所需表格的数量因查询而异，无法提前确定。强制无论查询如何都检索固定数量的表格可能会导致检索到的表格数量不足，无法获得所有必要的证据，或者检索到的表格过多，包含不相关的内容。为了解决这个问题，我们提出了一种自适应表格检索方法，根据每个查询的需求调整检索的表格数量。具体来说，我们利用自适应阈值机制来选择性地检索表格，并整合滑动窗口重排序算法以高效处理大量表格数据集。在Spider、BIRD和Spider 2.0上的广泛实验表明，我们的方法有效解决了传统top-k检索策略的局限性，提高了检索和下游任务的性能。我们的代码和数据可在https://github.com/sbY99/Adaptive-Table-Retrieval上获得。

英文摘要

Retrieving relevant tables from extensive databases for a given natural language query is essential for accurately answering questions in tasks such as text-to-SQL. Existing table retrieval approaches select a pre-determined set of k tables with the highest similarity to the query. However, the number of required tables varies across queries and cannot be known in advance. Enforcing a fixed number of retrieved tables regardless of the query may either retrieve an undersized set, failing to obtain all necessary evidence, or retrieve an oversized pool, including irrelevant tables. To address this issue, we propose an adaptive table retrieval method that adjusts the number of tables retrieved according to the requirements of each query. Specifically, we utilize an adaptive thresholding mechanism to selectively retrieve tables and integrate a sliding-window reranking algorithm to efficiently process a large table corpus. Extensive experiments on Spider, BIRD, and Spider 2.0 demonstrate that our method effectively addresses the limitations of the top-k retrieval strategy, improving performance in retrieval and downstream tasks. Our code and data are available at https://github.com/sbY99/Adaptive-Table-Retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.18765 2026-05-20 cs.IR cs.AI 版本更新

STAR: Semantic-Tuned and Tail-Adaptive Retriever for Graph-Augmented Generation

STAR：用于图增强生成的语义调优和尾部适应检索器

Shuai Li, Chen Huang, Duanyu Feng, Wenqiang Lei, See-Kiong Ng

发表机构 * Shuai Li, Chen Huang, Duanyu Feng, Wenqiang Lei, See-Kiong Ng

AI总结本文提出STAR，一种用于图增强生成的语义调优和尾部适应检索器，通过结合token级交互学习和路径加权对比学习，解决图检索增强生成中的语义捷径偏差和长尾路径偏差问题，提升检索和生成性能。

详情

AI中文摘要

为了增强大型语言模型（LLMs）在多跳问答中的表现，图检索增强生成（GraphRAG）中主流的解决方案利用轻量级检索器高效地从给定的知识图（KG）中提取信息。然而，现有方法往往忽视图中固有的稀疏语义信息挑战。具体而言，我们的实验发现这些方法会产生偏见的语义捷径偏差和长尾路径偏差，导致语义建模不足和图RAG效果有限。为了解决这些问题，我们提出了STAR，一种用于图RAG的语义调优和尾部适应检索器。STAR集成了两种关键学习范式：token级交互学习和路径加权对比学习。前者采用交叉注意力架构和硬路径挖掘机制，共同建模查询和路径，从而缓解语义捷径偏差。后者引入了定制化的对比学习目标，利用尾部适应路径加权，旨在优化训练过程并减轻长尾路径偏差。广泛的实验表明，STAR在所有基准数据集上都优于基线方法，实现了平均检索性能提升1.8%和LLM问答性能提升2.2%。我们的代码可在https://anonymous.4open.science/r/STAR-C583获取。

英文摘要

To augment Large Language Models (LLMs) for multi-hop question answering, a mainstream solution within Graph Retrieval Augmented Generation (GraphRAG) leverages lightweight retrievers to efficiently extract information from a given Knowledge Graph (KG). However, existing methods often overlook the inherent challenge of sparse semantic information in graphs. Specifically, our experiments reveal that these methods produce biased retrieval Semantic Shortcut Bias and Long-Tail Path Bias, leading to inadequate semantic modeling and limited GraphRAG effectiveness. To address these issues, we propose STAR, a semantic-tuned and tail-adaptive retriever for GraphRAG. STAR integrates two key learning paradigms: token-level interaction learning and path-weighted contrastive learning. The former employs a cross-attention architecture and a hard path mining mechanism to jointly model the query and path, thereby mitigating the Semantic Shortcut Bias. The latter introduces a tailored contrastive learning objective that utilizes tail-adaptive path weighting, designed to optimize the training process and ease the Long-Tail Path Bias. Extensive experiments demonstrate that STAR consistently outperforms baselines, achieving average retrieval performance gains of 1.8\% and LLM QA performance improvements of 2.2\% across all benchmark datasets. Our code is available at https://anonymous.4open.science/r/STAR-C583.

URL PDF HTML ☆

赞 0 踩 0

2605.18764 2026-05-20 cs.IR cs.AI 版本更新

From Intent to AI Pipelines: A Controlled Agentic Framework for Non-AI Expert Scientists

从意图到AI流水线：一种受领域驱动的可控代理框架，用于非AI专家科学家

Hyacinth Ali, Jessie Galasso-Carbonnel, Houari Sahraoui

发表机构 * University of Montreal（蒙特利尔大学）； McGill University（麦吉尔大学）

AI总结本文提出DDAP框架，通过分阶段交互帮助非AI专家科学家系统构建AI流水线及其实现代码，展示了可控代理框架在生成竞争性AI流水线方面的有效性。

详情

AI中文摘要

人工智能（AI）流水线已成为现代研究的核心，支持医学科学、农业和社会科学等领域，使大规模数据分析、预测建模和复杂任务的自动化成为可能。然而，许多研究人员在设计和实现AI解决方案时仍面临挑战，因为需要具备端到端AI系统的设计和开发专业知识。为解决这一差距，我们提出了领域驱动的可适应AI流水线（DDAP），这是一种受领域驱动的可控、人机协同的代理框架，利用大型语言模型引导用户系统地构建AI流水线及其相应实现代码。DDAP将开发过程分为四个阶段：问题定义、计算环境规范、流水线生成和代码生成。通过这种分阶段交互，框架能够适应领域上下文、用户专业知识和资源限制，同时保持用户对关键决策的控制。我们通过在多个跨业务、生物学和健康科学领域的数据集上比较DDAP的AI模型与专家开发的模型，评估了DDAP的性能。实验结果表明，DDAP在多个任务中与专家基线模型具有竞争力，尽管性能因问题类型而异，特别是在基于文本的聚类任务中。通过结合引导交互、适应性和可重复性，DDAP证明了可控代理框架能够为非专家用户生成竞争性的AI流水线。

英文摘要

Artificial Intelligence (AI) pipelines have become integral to modern research, supporting fields such as Medical Sciences, Agriculture, and Social Sciences, and enabling large-scale data analysis, predictive modeling, and the automation of complex tasks. However, designing and implementing AI solutions remains challenging for many researchers due to the expertise required in the design and development of end-to-end AI systems. To address this gap, we present Domain-Driven Adaptable AI Pipelines (DDAP), a controlled, human-in-the-loop, agentic framework that leverages large language models to guide users in a systematic construction of AI pipelines and their corresponding implementation code. DDAP structures the development process into four stages: problem definition, compute environment specification, pipeline generation, and code generation. Through this staged interaction, the framework adapts to domain context, user expertise, and resource constraints, while maintaining user control over key decisions. We evaluate DDAP across multiple datasets spanning business, biology, and health science domains by comparing its AI models against expert-developed models. The experimental results show that DDAP achieves competitive results in several tasks compared to expert baselines, although performance varies across problem types, particularly for text-based clustering tasks. By combining guided interaction, adaptability, and reproducibility, DDAP demonstrates that a controlled agentic framework can generate competitive AI pipelines for non-expert users.

URL PDF HTML ☆

赞 0 踩 0

2605.18763 2026-05-20 cs.IR cs.AI 版本更新

Query-Conditioned Graph Retrieval for Contextualized LLM Reasoning in Personalized Wearable Data

基于查询的图检索用于个性化可穿戴数据中的上下文化LLM推理

Zhenyu Lu, Mahyar Abbasian, Amir M. Rahmani

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； University of California, Irvine（加州大学 Irvine 分校）

AI总结本文提出WAG框架，通过图检索实现LLM在可穿戴数据中的上下文化推理，通过构建个性化知识图谱并检索查询条件子图，提高推理效率和生成质量。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地应用于分析可穿戴传感数据，这些数据具有长期性、多模态性和高度个性化的特点。一个关键挑战是上下文选择：提供不足的上下文会限制推理，而包含所有可用数据则会导致效率低下和生成质量下降。我们提出了Wearable As Graph（WAG），一种基于图的上下文检索框架，使LLM能够对可穿戴数据进行查询自适应推理。WAG将可穿戴指标和用户特定信号组织成个性化知识图谱，并检索查询条件子图以支持下游生成。检索过程整合了全局关系，通过分层贝叶斯建模捕捉先验知识和群体及个体层面的模式，同时本地关系反映了短期信号偏差。一个查询开放性信号进一步控制检索范围。我们在超过10,000个数据驱动的查询上评估WAG，来自真实世界可穿戴数据集。在LLM和人类评估中，WAG在基准和标准RAG方法上实现了约70%的胜率，证明了结构化、查询自适应上下文检索在LLM驱动的可穿戴数据分析中的有效性。

英文摘要

Large language models (LLMs) are increasingly applied to analyzing wearable sensing data, which are long-term, multimodal, and highly personalized. A key challenge is context selection: providing insufficient context limits reasoning, while including all available data leads to inefficiency and degraded generation quality. We propose Wearable As Graph (WAG), a graph-based context retrieval framework that enables query-adaptive reasoning over wearable data with LLMs. WAG organizes wearable metrics and user-specific signals into a personalized knowledge graph, and retrieves a query-conditioned subgraph to support downstream generation. The retrieval process integrates global relationships, capturing prior knowledge and population- and individual-level patterns via hierarchical Bayesian modeling, with local relationships that reflect short-term signal deviations. A query openness signal further controls retrieval breadth. We evaluate WAG on over 10,000 data-grounded queries from real-world wearable datasets. Across LLM-based and human evaluations, WAG achieves an approximately 70% win rate over baseline and standard RAG methods, demonstrating the effectiveness of structured, query-adaptive context retrieval for LLM-driven analysis of wearable data.

URL PDF HTML ☆

赞 0 踩 0

2605.18762 2026-05-20 cs.IR cs.AI 版本更新

ALDEN: Boosting Private Data Extraction from Retrieval-Augmented Generation Systems via Active Learning and Distribution Estimation

ALDEN: 通过主动学习和分布估计提升从检索增强生成系统中提取私有数据

Xingyu Lyu, Jianfeng He, Ning Wang, Yidan Hu, Tao Li, Danjue Chen, Shixiong Li, Yimin Chen

发表机构 * University of Massachusetts Lowell（马萨诸塞大学洛厄尔分校）； Virginia Tech（弗吉尼亚理工大学）； Amazon（亚马逊）； University of South Florida（佛罗里达州立大学）； Rochester Institute of Technology（罗切斯特理工学院）； Purdue University（普渡大学）； North Carolina State University（北卡罗来纳州立大学）

AI总结本文提出ALDEN方法，通过主动学习和分布估计有效提升从检索增强生成系统中提取私有数据的效率和效果，通过全面评估证明其优于现有方法。

详情

AI中文摘要

检索增强生成（RAG）被广泛用于通过外部知识检索增强大型语言模型，以提高可靠性和泛化能力。然而，最近的研究表明，RAG系统仍易受数据提取攻击，其中攻击者可以通过在用户查询中嵌入恶意命令来提取私有数据。尽管这些攻击具有可行性，但现有攻击通常面临较低的数据提取率和有限的实用性。在此，我们提出ALDEN，一种新颖的攻击方法，能够有效且高效地从RAG中提取私有数据。首先，我们采用主动学习来多样化恶意查询并提高数据提取率。其次，我们观察到底层知识库的数据分布为查询生成提供了有价值的指导，并引入基于衰减的动态算法来估计相应的话题分布。通过结合这些方法，我们证明ALDEN通过全面评估显著优于现有最先进方法。

英文摘要

Retrieval-Augmented Generation (RAG) is widely used to augment large language models with external knowledge retrieval to improve reliability and generalization. However, recent studies have shown that RAG systems remain vulnerable to data extraction attacks, where adversaries can extract private data by embedding malicious commands into user queries. Despite their feasibility, existing attacks typically suffer from low data extraction rates and limited practical effectiveness. Here, we propose ALDEN, a novel attack that effectively and efficiently extracts private data from RAGs. First, we employ active learning to diversify malicious queries and improve data extraction rates. Second, we observe that the data distribution of the underlying knowledge base provides valuable guidance for query generation and introduce a decay-based dynamic algorithm to estimate the corresponding topic distribution. By combining them together, we demonstrate that ALDEN substantially outperforms state-of-the-art methods through comprehensive evaluations.

URL PDF HTML ☆

赞 0 踩 0

2605.18760 2026-05-20 cs.IR cs.AI 版本更新

DOTRAG: Retrieval-Time Reasoning Along Paths

DOTRAG: 路径上的检索时推理

Larnell Moore, Naihao Deng, Rada Mihalcea, Farnaz Jahanbakhsh

发表机构 * University of Michigan（密歇根大学）

AI总结本文提出DOTRAG，一种无需训练的图检索增强生成框架，通过将检索过程转化为路径上的推理，提高了多跳任务的性能。

详情

AI中文摘要

图检索增强生成（GraphRAG）通常采用检索后再推理的范式，其中上下文通过启发式方法检索，然后进行推理。此类方法在处理复杂多跳任务时难以适应查询特定的逻辑需求，常导致无关上下文的累积或遗漏正确关系路径。我们提出DotRAG，一种无需训练的GraphRAG框架，将检索重新表述为路径上的推理过程。我们的方法生成查询条件约束，引导图探索，剪枝无关区域，并在不依赖显式逐步推理链的情况下迭代发现关系路径。我们引入思维的分解（DOT），一种将检索分解为局部搜索空间并适应每个查询的搜索策略的抽象。DotRAG在MetaQA和UltraDomain上实现了SOTA性能，在多跳任务上表现出一致的提升，证明了推理引导的检索有效性。

英文摘要

Graph Retrieval-Augmented Generation (GraphRAG) is dominated by a retrieve-then-reason paradigm, where context is retrieved using heuristics and then reasoned over. Such methods struggle to adapt to the query-specific logic required for complex multi-hop tasks, often accumulating irrelevant context or missing correct relational paths. We propose DotRAG, a training-free GraphRAG framework that reformulates retrieval as a reasoning process over paths. Our approach generates query-conditioned constraints that guide graph exploration, prune irrelevant regions, and iteratively discover relational paths without relying on explicit step-by-step reasoning chains. We introduce Division of Thought (DOT), an abstraction that decomposes retrieval into localized search spaces and adapts the search strategy to each query. DotRAG achieves SOTA performance on MetaQA and UltraDomain, with consistent gains on multi-hop tasks, demonstrating the effectiveness of reasoning-guided retrieval.

URL PDF HTML ☆

赞 0 踩 0

2605.18759 2026-05-20 cs.HC cs.AI 版本更新

Interoceptive Divergence in Aesthetic Evaluation and Implications for Human-AI Alignment

内感受差异在审美评价中的体现及其对人机对齐的影响

Yoshia Abe, Tatsuya Daikoku, Yasuo Kuniyoshi

发表机构 * Artificial intelligence (AI), exemplified by large language models (LLMs)（人工智能（由大型语言模型（LLMs）体现））

AI总结本研究探讨了人类与AI在审美体验中的收敛与分歧，通过对比人类和AI的反应，发现尽管两者在审美评分与情绪的相关性及图像特征优先级上相似，但在情绪反应分布和审美评分与内感受的关系上存在显著差异，揭示了AI在审美评估中的局限性，特别是在内感受方面的不足。

Comments 20 pages, 9 figures. Supplementary material is included as a separate PDF in the source files

详情

AI中文摘要

人工智能（AI），以大型语言模型（LLMs）为例，正在迅速接近并在某些情况下超越人类在广泛认知任务中的表现。然而，人类的本性不仅限于智能，还包括感性，包括感知和体验视觉场景中美的能力。这提出了一个根本问题：人类和AI系统在这样的审美体验中如何收敛或发散。审美评价不仅取决于图像的客观属性，还取决于观察者的内部过程。作为对齐AI的持续努力，基于先前关于美感评分、身体感觉和情绪之间关系的人类研究，我们采用了一组相似的问卷项目，并将其呈现给LLMs，从而能够直接比较人类和AI的反应。我们的比较分析发现，尽管人类和AI在美感评分与情绪的相关性以及优先考虑的图像特征上表现出广泛相似的模式，但在情绪反应的分布和美感评分与身体感觉的关系上出现了显著的分歧。这些发现表明，训练于大规模文本数据的最先进LLMs在一定程度上可以近似人类在审美评估中的平均倾向。然而，它们也表明了局限性，特别是在内感受方面，这可能反映了训练数据中的表示不足或对齐过程的意外后果。这些发现突显了AI对齐的关键挑战，并为开发具有人类样审美处理能力的AI系统指明了重要方向。

英文摘要

Artificial intelligence (AI), exemplified by large language models (LLMs), is rapidly approaching and in some cases surpassing human performance across a wide range of cognitive tasks. However, human nature is not limited to intelligence alone; it also encompasses sensibility, including the capacity to perceive and experience beauty in visual scenes. This raises a fundamental question: how humans and AI systems converge or diverge in such aesthetic experiences. Aesthetic evaluation depends not only on objective properties of images but also on internal processes within the observer. As part of ongoing efforts in AI alignment, building upon prior human studies that have examined the relationship between beauty ratings, bodily sensations, and emotions, we adopt a comparable set of questionnaire items and present them to LLMs, enabling a direct comparison between human and AI responses. Our comparative analyses revealed that, while humans and AI exhibited broadly similar patterns in the correlations between beauty ratings and emotions, as well as in the image features they prioritized, notable divergences emerged in both the distribution of emotional responses and the relationship between beauty ratings and bodily sensations. These findings suggest that state-of-the-art LLMs, trained on large-scale textual data, can approximate average human tendencies in aesthetic evaluation to a certain extent. However, they also indicate limitations, particularly in relation to interoceptive aspects, which may reflect insufficient representation in training data or unintended consequences of alignment processes. These findings highlight key challenges for AI alignment and suggest important directions for developing AI systems with human-like aesthetic processing.

URL PDF HTML ☆

赞 0 踩 0

2605.18758 2026-05-20 cs.HC cs.AI 版本更新

OmniGUI: Benchmarking GUI Agents in Omni-Modal Smartphone Environments

OmniGUI: 评估多模态智能手机环境中的GUI代理的基准测试

Felix Henry, Xiaochen Lin, Jiangyou Zhu, Yangfan, Bingqian Zhang, Min Chen, Shiyu Huang

发表机构 * XPeng Motors（小鹏汽车）

AI总结本文提出OmniGUI基准测试，用于评估在多模态智能手机环境中GUI代理的能力，通过连续的多模态输入（包括静态图像、同步音频和视频片段）来模拟真实世界交互，发现当前模型在需要同步时间和听觉信号的环境中表现下降。

详情

AI中文摘要

当前图形用户界面（GUI）代理的基准测试主要依赖静态截图。然而，现实中的智能手机交互通常要求代理处理瞬时音频提示和时间动态视频，这些与动作时刻紧密相关。为此，我们引入OmniGUI，首个步骤级基准，用于评估GUI代理在多模态智能手机环境中的能力。OmniGUI提供连续的多模态输入，包括静态图像、同步音频和视频片段，涵盖709个专家演示的episode（2,579个动作步骤）跨越29个应用，系统注释了客观的多模态依赖水平。由于专门的多模态GUI代理框架尚处于初级阶段，我们选择基础的多模态模型，能够原生处理交错输入，作为代理代理的初始基线。我们的实证评估表明，尽管当前模型在视觉静态任务上表现出色，但在需要同步时间和听觉信号的环境中，其动作预测性能显著下降。此外，消融研究隔离了特定的操作瓶颈，特别是处理无关环境噪声时的跨模态干扰。完整的数据集、评估流程和基线提示在补充材料中提供。项目页面：https://omni-gui.github.io.

英文摘要

Current benchmarks for graphical user interface (GUI) agents predominantly rely on static screenshots. However, real-world smartphone interaction routinely requires agents to process transient audio cues and temporal video dynamics that are tightly coupled with the moment of action. To bridge this gap, we introduce OmniGUI, the first step-level benchmark designed to evaluate GUI agents in omni-modal smartphone environments. OmniGUI provides continuous, interleaved multimodal inputs comprising static images, synchronous audio, and video clips at every action step. The dataset encompasses 709 expert-demonstrated episodes (2,579 action steps) across 29 applications, systematically annotated with objective multimodal dependency levels. Because dedicated omni-modal GUI agent frameworks are currently in their nascent stage, we select foundational omni-modal models capable of natively processing interleaved inputs to serve as agent proxies for our initial baselines. Our empirical evaluation reveals that while current models exhibit competency on visually static tasks, their action prediction performance degrades significantly in environments requiring synchronous temporal and auditory signals. Furthermore, ablation studies isolate specific operational bottlenecks, notably cross-modal interference when processing task-irrelevant environmental noise. The complete dataset, evaluation pipeline, and baseline prompts are provided in the supplementary material. Project page: https://omni-gui.github.io.

URL PDF HTML ☆

赞 0 踩 0

2605.10075 2026-05-20 cs.AI 版本更新

Active Testing of Large Language Models via Approximate Neyman Allocation

通过近似奈曼分配主动测试大型语言模型

Zeli Liu, Jiancheng Zhang, Cong Liu, Yinglun Zhu

发表机构 * University of California, Riverside（加州大学河滨分校）

AI总结本文提出了一种针对生成任务的主动测试算法，利用语义熵进行分层并基于代理模型提取的信号进行近似奈曼分配，从而在多个语言和多模态基准测试中显著提升性能，实现高达28%的均方误差降低和22.9%的预算节省。

详情

AI中文摘要

大型语言模型（LLMs）需要从预训练到测试时间扩展的可靠评估，使评估成为重复而非一次性成本。随着模型规模增长和目标任务日益需要专家标注者，每次评估所需的计算和标注成本迅速上升。主动测试旨在通过从评估池中较小但有信息量的子集近似评估结果来缓解这一瓶颈。然而，现有方法主要针对分类任务并在生成任务上失效。我们提出了一种新的主动测试算法，专门针对生成任务。我们的方法利用代理模型的语义熵对评估池进行分层，并基于这些代理模型提取的信号进行近似奈曼分配。在多个语言和多模态基准测试以及多种代理-目标模型配对中，我们的方法在基线上显著提升，并接近Oracle-Neyman，实现了相对于均匀采样高达28%的均方误差降低和平均22.9%的预算节省。

英文摘要

Large language models (LLMs) require reliable evaluation from pre-training to test-time scaling, making evaluation a recurring rather than one-off cost. As model scales grow and target tasks increasingly demand expert annotators, both the compute and labeling costs needed for each evaluation rise rapidly. Active testing aims to alleviate this bottleneck by approximating the evaluation result from a small but informative subset of the evaluation pool. However, existing approaches primarily target classification and break down on generative tasks. We introduce a novel active testing algorithm tailored to generative tasks. Our method leverages semantic entropy from surrogate models to stratify the evaluation pool and then conducts approximate Neyman allocation based on signals extracted from these surrogates. Across multiple language and multimodal benchmarks and a range of surrogate-target model pairs, our method significantly improves on baselines and closely tracks Oracle-Neyman, delivering up to 28% MSE reduction over Uniform Sampling and an average of 22.9% budget savings.

URL PDF HTML ☆

赞 0 踩 0

2602.04883 2026-05-20 cs.LG cs.AI q-bio.BM q-bio.QM 版本更新

Protein Autoregressive Modeling via Multiscale Structure Generation

通过多尺度结构生成进行蛋白质自回归建模

Yanru Qu, Cheng-Yen Hsieh, Zaixiang Zheng, Ge Liu, Quanquan Gu

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出了一种多尺度自回归框架PAR，用于通过粗到细的下一尺度预测生成蛋白质主链结构。核心方法包括多尺度下采样操作、自回归Transformer和基于流的主链解码器，通过噪声上下文学习和调度采样缓解曝光偏差，实现高质量主链生成，并展示了强大的零样本泛化能力。

Comments ICML 2026 Spotlight; ByteDance Seed Tech Report; Page: https://par-protein.github.io/

详情

AI中文摘要

我们提出了蛋白质自回归建模（PAR），这是首个多尺度自回归框架，用于通过粗到细的下一尺度预测生成蛋白质主链结构。利用蛋白质的分层性质，PAR生成的结构模仿雕刻雕像的过程，形成粗略拓扑结构并逐步细化结构细节。为此，PAR由三个关键组件组成：（i）多尺度下采样操作，在训练过程中表示蛋白质结构在多个尺度上的特征；（ii）一个自回归Transformer，编码多尺度信息并生成条件嵌入以指导结构生成；（iii）基于流的主链解码器，根据这些嵌入生成主链原子。此外，自回归模型由于训练和生成过程不匹配而遭受曝光偏差，这会显著降低结构生成质量。我们通过采用噪声上下文学习和调度采样有效缓解了这一问题，实现了鲁棒的主链生成。值得注意的是，PAR表现出强大的零样本泛化能力，支持灵活的人类提示条件生成和基序支架构建，而无需微调。在无条件生成基准测试中，PAR有效学习了蛋白质分布，并生成高质量的主链结构，且表现出良好的扩展性。这些特性使PAR成为蛋白质结构生成的有前途的框架。

英文摘要

We present protein autoregressive modeling (PAR), the first multi-scale autoregressive framework for protein backbone generation via coarse-to-fine next-scale prediction. Using the hierarchical nature of proteins, PAR generates structures that mimic sculpting a statue, forming a coarse topology and refining structural details over scales. To achieve this, PAR consists of three key components: (i) multi-scale downsampling operations that represent protein structures across multiple scales during training; (ii) an autoregressive transformer that encodes multi-scale information and produces conditional embeddings to guide structure generation; (iii) a flow-based backbone decoder that generates backbone atoms conditioned on these embeddings. Moreover, autoregressive models suffer from exposure bias, caused by the training and the generation procedure mismatch, and substantially degrades structure generation quality. We effectively alleviate this issue by adopting noisy context learning and scheduled sampling, enabling robust backbone generation. Notably, PAR exhibits strong zero-shot generalization, supporting flexible human-prompted conditional generation and motif scaffolding without requiring fine-tuning. On the unconditional generation benchmark, PAR effectively learns protein distributions and produces backbones of high design quality, and exhibits favorable scaling behavior. Together, these properties establish PAR as a promising framework for protein structure generation.

URL PDF HTML ☆

赞 0 踩 0