arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 2084
2605.24041 2026-05-27 cs.LG cs.AI

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

迭代精化神经算子:一种学习型不动点求解器——频谱偏差缓解的原则性方法

Xiaotian Liu, Shuyuan Shang, Xiaopeng Wang, Pu Ren, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) CUHK Shenzhen(香港大学深圳分校) Lawrence Berkeley National Lab(伯克利国家实验室)

AI总结 提出迭代精化神经算子(IRNO),通过固定点迭代应用学习精化模块,结合渐进频谱损失,有效缓解神经算子的频谱偏差,在湍流和活性物质等物理系统中显著降低高频误差。

Comments 47 pages; accepted to ICML 2026 as a Spotlight

详情
AI中文摘要

神经算子作为科学建模的快速数据驱动替代方法,通常依赖于单一前向推理过程,难以解析高频细节,这一局限性称为频谱偏差。我们引入迭代精化神经算子(IRNO),通过固定点迭代反复应用学习精化模块来增强预训练算子。IRNO将预测分解为粗初始化及随后的残差校正,类似于经典数值求解器。在局部假设下,我们建立了诱导算子的收缩性,确保收敛到唯一不动点。为明确针对高频误差,我们提出渐进频谱损失,在训练过程中自适应地增加对高频分量的惩罚。在物理系统中,IRNO持续降低误差,在湍流中提升高达56.05%。在活性物质中,频谱分析显示,相对于基础算子,归一化误差比在低频降至27.72-36.10%,中频降至5.07-6.68%,高频降至1.48-2.04%,且在训练迭代次数之外保持稳定。代码见 https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator。

英文摘要

Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72-36.10% in low-, 5.07-6.68% in mid-, and 1.48-2.04% in high-frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator

2605.23910 2026-05-27 cs.CL cs.AI

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

基于信息融合的文档分类模式识别:多模态与多视角表示方法的系统综述

Marcin Michał Mirończuk

发表机构 * National Information Processing Institute(国家信息处理研究所)

AI总结 本文通过系统综述和元分析,提出了统一框架,量化了多模态和多视角融合在文档分类中的性能提升,并揭示了方法学严谨性不足的问题。

Journal ref Information Fusion, 132, 2026, 104247

详情
AI中文摘要

信息融合被广泛用于通过整合多数据源(多模态)或多表示(多视角)来改进文档分类。然而,该领域缺乏统一框架、对其有效性的定量综合以及给实践者的明确指导。本系统综述通过分析139项主要研究来填补这些空白。它引入了一个正式框架来结构化该领域,呈现了定性分析结果以识别关键趋势,并进行了随机效应元分析(据我们所知,这是首次专注于文档分类的元分析)以量化性能提升。我们的元分析显示,多模态融合显著提高了准确率(平均提升+5.28个百分点,$p=0.0016$)——F1分数效应方向为正,但在我们的主要模型中统计上不显著。多视角融合在准确率(+4.67%)、F1分数(+3.08%)和召回率(均$p<0.05$)上提供了一致但适度的提升。关键的是,我们的定性综合揭示了方法学严谨性方面的可重复性挑战:只有11.8%(多模态)和23.3%(多视角)的研究使用统计检验来验证其发现,这削弱了许多结果的可靠性。本综述的主要贡献是一个统一框架、首个定量证据基础以及数据驱动的指南。本综述得出结论,成功的信息融合不依赖于算法复杂性,而在于融合方法与任务上下文的战略对齐以及对更严格验证的承诺。

英文摘要

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

2605.22557 2026-05-27 cs.LG cs.NA math.NA

Neural Flow Operators can Approximate any Operator: Abstract Frameworks and Universal Approximations

神经流算子可以逼近任意算子:抽象框架与通用逼近

Shuang Chen, Juncai He, Xue-Cheng Tai

发表机构 * Qiuzhen College, Tsinghua University(清华大学齐遵学院) Yau Mathematical Sciences Center, Tsinghua University(清华大学尤 mathematical sciences center) Norwegian Research Center(挪威研究中心)

AI总结 提出神经流抽象框架,涵盖组合与分离结构的连续深度模型,证明其在有限维和无限维空间中的通用逼近性质,并通过时间离散化统一残差与普通架构。

详情
AI中文摘要

我们为神经网络和神经算子引入了一个抽象的神经流框架。该框架包含两种连续深度模型,即具有组合和分离结构的神经流,并涵盖了有限维函数逼近和无限维算子逼近。我们证明了相应神经流的适定性和通用逼近性质,包括据我们所知,首个无限维空间之间基于流的模型的通用逼近结果。我们还获得了卷积神经流模型的通用逼近结果。通过适当的时间离散化,组合结构恢复了ResNet类型的架构,而分离结构通过基于分裂的离散化产生了普通架构。这为具有全连接或卷积线性层的神经网络和神经算子的残差和普通架构提供了一条统一的基于流的路径。

英文摘要

We introduce an abstract neural flow framework for neural networks and neural operators. The framework contains two continuous-depth models, namely neural flows with composition and separation structures, and covers both finite-dimensional function approximation and infinite-dimensional operator approximation. We prove well-posedness and universal approximation properties for the corresponding neural flows, including, to the best of our knowledge, the first universal approximation result for flow-based models between infinite-dimensional spaces. We also obtain universal approximation results for convolutional neural flow models. Through suitable time discretizations, the composition structure recovers ResNet-type architectures, while the separation structure, via a splitting-based discretization, yields plain architectures. This gives a unified flow-based route to both residual and plain architectures for neural networks and neural operators with fully connected or convolutional linear layers.

2605.22468 2026-05-27 cs.LG cs.AI

BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series

BioFormer: 通过频谱结构对齐重新思考生物医学时间序列中的跨主体泛化

Guikang Du, Haoran Li, Xinyu Liu, Zhibo Zhang, Xiaoli Gong, Jin Zhang

发表机构 * College of Computer Science, Nankai University, Tianjin, China(南开大学计算机科学学院) College of Cyber Science, Tianjin Key Laboratory of Interventional Brain-Computer Interface(天津介入脑机接口与智能康复重点实验室) Intelligent Rehabilitation, Key Lab of Data(智能康复,数据实验室) Intelligent System Security, Frontiers Science Center for New Organic Matter, Nankai University, Tianjin, China(智能系统安全,新有机物前沿科学中心,南开大学,天津,中国)

AI总结 提出BioFormer模型,通过频谱漂移视角显式建模主体特异性变异,利用频带对齐模块和样本条件层归一化对齐频谱结构,在六个数据集上F1分数提升6%。

详情
AI中文摘要

生物医学时间序列中的跨主体泛化指在一些主体数据上训练并在未见主体上测试。关键挑战是抑制BTS表示中的主体特异性变异。大多数现有方法通过模型构建或主体对抗学习隐式抑制变异,但很少显式建模。我们引入频谱漂移作为表征主体特异性变异的新视角。具体来说,相同标签下的BTS信号通常共享一致的振荡结构,但在特定频率分量上表现出依赖于主体的幅度或相位偏移,我们将其解释为主体特异性变异。基于这一见解,我们提出BioFormer。其核心是频带对齐模块(FBAM),该模块从频谱分布生成带级调制因子,并自适应调整幅度和相位以对齐频谱结构,从而减轻变异。我们进一步将FBAM与样本条件层归一化配对,该归一化从内在信号统计量而非主体身份推断归一化参数,稳定跨主体表示。在六个数据集上的大量实验表明,BioFormer优于12个基线,绝对F1分数提升6%。

英文摘要

Cross-subject generalization in biomedical time-series refers to training on data from some subjects and testing on unseen subjects.The key challenge is to suppress subject specific variability in BTS representations.Most existing methods implicitly suppress the variability through model building or subject adversarial learning, but rarely model it explicitly.We introduce spectral drift as a new perspective to characterize subject specific variability.Specifically, BTS signals under the same label often share consistent oscillatory structure, yet exhibit subject-dependent magnitude or phase shifts in specific frequency components, which we interpret as subject-specific variability. Building on this insight, we propose BioFormer.At its core is a Frequency-Band Alignment Module(FBAM) that generates band-wise modulation factors from the spectral distribution and adaptively adjusts amplitude and phase to align spectral structure, thereby mitigating variability.We further pair FBAM with Sample Conditional Layer Normalization, which infers normalization parameters from intrinsic signal statistics rather than subject identity, stabilizing cross-subject representations.Extensive experiments on six datasets demonstrate that BioFormer outperforms 12 baselines, yielding absolute F1-score improvements of 6%.

2605.22417 2026-05-27 cs.CV cs.SE

The Neglected Baseline in Model Interpretation

模型解释中被忽视的基线

Yongjin Cui, Xiaohui Fan

发表机构 * Zhejiang University(浙江大学)

AI总结 针对现有模型解释方法普遍忽略基线导致不精确的问题,本文重新定义解释任务和原则,统一梯度法、积分梯度法和泰勒展开,分析相关方法缺陷,并基于清晰合理的基线改进积分梯度法,实现基于任意层特征的解释。

详情
AI中文摘要

我们观察到现有的模型解释方法普遍忽略了基线,这种忽视常常导致不精确甚至错误的解释。本文重新阐述了模型解释的任务和解释结果的原则,以证明基线的重要性。我们进一步统一了基于梯度的方法、积分梯度(IG)方法和泰勒展开,阐明了它们之间的联系,并明确识别了每种方法的基线。在此基础上,我们分析了相关模型解释方法(IG、LayerCAM、ODAM、Difference Map)中的缺陷和错误。我们主张通过归因结果与归因目标之间的归因误差来精确评估模型解释结果的质量,而不是采用有缺陷的评估方法,例如基于边际效应或假设模型性能完美的方法。我们改进了IG,并开发了一种具有清晰合理基线的模型解释方法,取得了更好的结果。我们的方法支持基于任意层特征进行模型解释。基于不同层特征的解释都是合理的,这些结果之间的差异反映了不同特征提取阶段特征提取的不同程度。

英文摘要

We observe that existing model interpretation methods generally ignore the baseline, and such neglect often results in imprecise or even incorrect interpretation. In this paper, we reformulate the task of model interpretation and the interpretation principles for model interpretation results to demonstrate the importance of the baseline. We further unify gradient-based methods, Integrated Gradients (IG) methods, and Taylor expansion, clarifying the connections among them and explicitly identifying the baseline for each method. On this basis, we analyze the flaws and errors in related model interpretation methods (IG, LayerCAM, ODAM, Difference Map). We advocate evaluating the quality of model interpretation results precisely through the attribution error between the attribution result and the attribution target, rather than adopting flawed evaluation methods, such as those based on marginal-effect or the assumption of perfect model performance. We revise IG and develope a model interpretation method with a clear and reasonable baseline, achieving better results. Our method supports model interpretation based on features from any layer. Interpretation based on features from different layers are all reasonable, and the differences among these results reflect varying degrees of feature extraction at different feature extraction stages.

2605.21617 2026-05-27 cs.LG q-bio.QM

$\textit{BlockFormer}$ : Transformer-based inference from interaction maps

$ extit{BlockFormer}$:基于交互图的Transformer推理

Eloïse Touron, Pedro L. C. Rodrigues, Julyan Arbel, Nelle Varoquaux, Michael Arbel

发表机构 * Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Inria(法国国家科学研究中心) CNRS(法国国家科学研究中心) Grenoble INP(格勒诺布尔研究所) LJK(实验室) TIMC

AI总结 提出BlockFormer,一种基于Transformer架构的数据驱动方法,通过模拟器生成合成数据训练,解决从交互图中推断可变数量和大小实体参数的反问题,并成功应用于多种物种的着丝粒定位。

详情
AI中文摘要

从交互图中进行推理,例如从全基因组染色体构象捕获技术(特别是Hi-C)中识别着丝粒,可以表述为一个通用的反问题:给定一个通过可变数量和大小的块总结实体间成对相互作用的图,推断一组参数。在这项工作中,我们引入了一种数据驱动的方法,利用这些图之间的共享结构(例如局部模式的全局对齐),同时处理真实数据中实体数量和大小可变性。我们的方法依赖于能够处理这种可变性的Transformer架构,以及一个自定义模拟器,用于生成丰富且计算成本低廉的合成数据进行训练。应用于着丝粒定位问题,该方法能够准确恢复各种基因组大小的多种物种的着丝粒基因组位置。

英文摘要

Inference from interaction maps, such as centromere identification from genome-wide chromosome conformation capture techniques -- notably Hi-C -- can be formulated as a generic inverse problem: infer a set of parameters given a map summarizing pairwise interactions between entities through blocks of variable numbers and sizes. In this work, we introduce a data-driven approach that leverages shared structure between these maps, such as global alignment between localized patterns, while handling the variability in number and size of entities arising in real-world data. Our approach relies on a transformer architecture capable of handling such variability and a custom simulator to generate abundant, yet computationally cheap synthetic data for training. Applied to the problem of centromere localization, the method accurately recovers their genomic positions across a wide range of species of various genome sizes.

2605.20530 2026-05-27 cs.AI cs.CL cs.LG cs.SE

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas:超越LLM智能体的结果排行榜

Parsa Mazaheri, Kasra Mazaheri

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出AgentAtlas框架,通过控制决策分类法和轨迹故障词汇表,将智能体评估从结果成功分离为控制决策质量和轨迹质量,并揭示仅依赖结果排行榜的测量风险。

详情
AI中文摘要

大型语言模型智能体现在可以操作代码库、浏览器、操作系统、日历、文件和工具生态系统,但它们的评估通常将行为简化为最终任务成功。AgentAtlas将智能体评估重新定义为一种诊断词汇和审计协议,用于将结果成功与控制决策质量和轨迹质量分离。本文贡献了:(i) 一个六状态控制决策分类法(行动/询问/拒绝/停止/确认/恢复);(ii) 一个包含主要错误源和下游影响的轨迹失败词汇表;(iii) 对十五个智能体基准的0/1/2基准覆盖审计;(iv) 一个在合成1,342项数据集上进行的说明性协议研究,使用八种模型在分类法感知和分类法盲提示格式下进行评估。该合成演示不是公开基准发布,不应被视为确定的模型比较。相反,它说明了两个测量风险:当显式标签菜单被移除时,映射标签一致性可能发生显著变化,并且轴选择可能改变表观排名。AgentAtlas旨在帮助基准设计者说明他们覆盖的行为,并帮助评估者诊断仅结果排行榜隐藏的失败。

英文摘要

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.

2605.02035 2026-05-27 cs.CL cs.AI

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

VIDA: 多模态机器翻译中视觉依赖歧义的数据集

Jingheng Pan, Xintong Wang, Longyue Wang, Liang Ding, Weihua Luo, Chris Biemann

发表机构 * Department of Informatics, Universität Hamburg(汉堡大学信息学院) Alibaba Group(阿里巴巴集团) Alibaba Cloud(阿里云)

AI总结 提出VIDA数据集,包含2500个精心策划的实例,用于评估多模态机器翻译中需要视觉证据才能解决的歧义,并引入以歧义消解为中心的指标,实验表明链式思维微调能提升跨分布歧义消解能力。

详情
AI中文摘要

歧义消解是多模态机器翻译(MMT)中的一个关键挑战,模型必须真正利用视觉输入将歧义表达映射到其预期含义。尽管先前的工作提出了面向消歧的基准来评估视觉的作用,但我们观察到现有基准仍受限于任务格式不匹配、歧义覆盖范围狭窄或视觉依赖性验证不足。此外,现有的歧义评估并不适用于开放式翻译中的多种歧义类型。为解决这些局限性,我们提出了VIDA(视觉依赖歧义),一个包含2500个精心策划实例的数据集,其中解析带注释的源语言片段需要视觉证据。我们进一步提出了以消歧为中心的指标,使用LLM作为评判分类器来验证带注释的歧义表达是否在片段级别被正确消解。使用两个最先进的LVLM进行的实验表明,监督微调(SFT)提高了整体翻译质量,而链式思维SFT(CoT-SFT)产生了更强的跨分布歧义消解能力,这表明显式的消歧指导提高了对多种歧义类型的泛化能力。

英文摘要

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.

2605.01817 2026-05-27 cs.LG

Skipping the Zeros in Diffusion Models for Sparse Data Generation

跳过扩散模型中的零值以生成稀疏数据

Phil Sidney Ostheimer, Mayank Nagda, Andriy Balinskyy, Gabriel Vicente Rodrigues, Jean Radig, Carl Herrmann, Stephan Mandt, Marius Kloft, Sophie Fellenz

发表机构 * RPTU University Kaiserslautern-Landau(科隆-兰道大学RPTU) Heidelberg University(海德堡大学) University of California, Irvine(加州大学 Irvine 分校)

AI总结 提出稀疏利用扩散(SED)方法,通过仅建模非零值来保持稀疏性,在训练和推理中跳过零值以节省计算并提升生成质量。

Comments Accepted to ICML 2026

详情
AI中文摘要

扩散模型(DMs)在密集连续数据上表现出色,但并非为稀疏连续数据设计。它们无法建模代表信号有意缺失的精确零值。因此,它们会抹去稀疏模式,并对大部分为零的条目执行不必要的计算。通过稀疏利用扩散(SED),我们仅对非零值建模,从而保持稀疏性。SED通过在训练和推理过程中跳过零值,在保持或提高生成质量的同时节省计算。在物理和生物学基准测试中,SED匹配或超越了传统DMs和领域特定基线,而视觉实验则提供了对密集DMs局限性及SED优势的直观理解。

英文摘要

Diffusion models (DMs) excel on dense continuous data, but are not designed for sparse continuous data. They do not model exact zeros that represent the deliberate absence of a signal. As a result, they erase sparsity patterns and perform unnecessary computation on mostly zero entries. With Sparsity-Exploiting Diffusion (SED), we model only non-zero values, preserving sparsity. SED delivers computational savings while maintaining or improving generation quality by skipping zeros during training and inference. Across physics and biology benchmarks, SED matches or surpasses conventional DMs and domain-specific baselines, while vision experiments provide intuitive insights into the limitations of dense DMs and the benefits of SED.

2605.01032 2026-05-27 cs.AI cs.LO cs.PL

Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries

受控执行的代数语义:幺半范畴、效应代数与共同边界

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 提出一种基于交互树和参数化余归纳的受控执行代数语义,通过三公理治理代数记录诱导对称幺半范畴,实现程序的可组合治理与可表达性等价。

Comments 26 pages, 1 figure, 1 table. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

详情
AI中文摘要

我们提出了一种受控执行的代数语义,其中治理被公理化、可组合且与可表达性共同存在。该框架在32个Rocq模块中机械化(约12,000行代码,454个定理,0个待定),基于交互树和参数化余归纳。一个三公理治理代数记录(安全性、透明性、恰当性)诱导出一个对称幺半范畴,具有经过验证的五边形、三角形和六边形一致性,其中每个张量组合都保持治理。一个代数效应系统约束处理子代数,使得在安全片段中只能构造保持治理的处理子;空能力集内的程序可证明仅发出可观察性指令。能力索引的组合将程序与机器检查的能力边界捆绑在一起,一个双重保证定理确立了在全体组合算子下within_caps和gov_safe同时成立。最终结果是共同边界:在我们的形式模型中,每个通过四个原始态射构造子可表达的程序在解释下都是受控的,且每个受控程序都是这样一个程序的像。图灵完备性在治理内部得以保留;无中介的I/O被排除在受控片段之外。治理拒绝被建模为安全的余归纳发散。治理代数是参数化的:任何实例化三个公理的系统都继承所有派生性质,包括收敛性、组合封闭性和目标保持性。提取的OCaml代码作为NIF在BEAM运行时中运行,通过基于属性的测试(70,000+随机输入,零分歧)确认了规范与运行时解释器之间的行为等价性。

英文摘要

We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibility. The framework, mechanized in 32 Rocq modules (~12,000 lines, 454 theorems, 0 admitted), is built on interaction trees and parameterized coinduction. A three-axiom GovernanceAlgebra record (safety, transparency, properness) induces a symmetric monoidal category with verified pentagon, triangle, and hexagon coherence, where every tensor composition preserves governance. An algebraic effect system constrains the handler algebra so that only governance-preserving handlers can be constructed in the safe fragment; programs in the empty capability set provably emit only observability directives. Capability-indexed composition bundles programs with machine-checked capability bounds, and a dual guarantee theorem establishes that within_caps and gov_safe hold simultaneously under all composition operators. The capstone result is the coterminous boundary: within our formal model, every program expressible via the four primitive morphism constructors is governed under interpretation, and every governed program is the image of such a program. Turing completeness is preserved inside governance; unmediated I/O is excluded from the governed fragment. Governance denial is modeled as safe coinductive divergence. The governance algebra is parametric: any system instantiating the three axioms inherits all derived properties, including convergence, compositional closure, and goal preservation. Extracted OCaml runs as a NIF in the BEAM runtime, with property-based testing (70,000+ random inputs, zero disagreements) confirming behavioral equivalence between the specification and the runtime interpreter.

2605.01030 2026-05-27 cs.AI cs.LO cs.PL

Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries

AI工作流架构的效果透明治理:语义保持、表达最小性与可判定性边界

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 本文通过机器验证的形式化方法,证明在AI工作流架构中,效果级治理可以在不降低内部计算表达性的前提下实施,并建立了治理与计算表达性正交的理论基础。

Comments 15 pages. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers. License updated

详情
AI中文摘要

我们提出了一个经过机器验证的结构化治理AI工作流架构的形式化,并证明效果级治理可以在不降低内部计算表达性的情况下实施。使用Rocq 8.19中的交互树,我们定义了一个治理算子G,它中介所有有效指令,包括内存访问、外部调用和预言机(LLM)查询。我们的开发编译通过,无任何待证引理,包含36个模块、约12,000行Rocq代码和454个定理。我们建立了七个性质:(P1)受治理的图灵完备性;(P2)受治理的预言机表达性;(P3)一个可判定性边界,其中治理谓词是全的且在布尔组合下封闭,而语义程序性质保持非平凡且不可由治理判定;(P4)允许执行的目标保持性;(P5)原始能力(计算、内存、推理、外部调用、可观测性)的表达最小性;(P6)包含不对称性,表明结构治理严格包含内容级过滤;(P7)语义透明性:在所有治理允许的执行上,受治理的解释与未受治理的解释(模治理专属事件)在观测上等价。这些结果共同表明,治理和计算表达性是正交维度:治理约束程序的效果边界,同时对内部计算保持语义透明。

英文摘要

We present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be imposed without reducing internal computational expressivity. Using Interaction Trees in Rocq 8.19, we define a governance operator G that mediates all effectful directives, including memory access, external calls, and oracle (LLM) queries. Our development compiles with 0 admitted lemmas and consists of 36 modules, ~12,000 lines of Rocq, and 454 theorems. We establishseven properties: (P1) governed Turing completeness, (P2) governed oracle expressivity, (P3) a decidability boundary in which governance predicates are total and closed under Boolean composition while semantic program properties remain non-trivial and undecidable by governance, (P4) goal preservation for permitted executions, (P5) expressive minimality of primitive capabilities (compute, memory, reasoning, external call, observability), (P6) subsumption asymmetry showing structural governance strictly subsumes content-level filtering, and (P7) semantic transparency: on all executions where governance permits, the governed interpretation is observationally equivalent (modulo governance-only events) to the ungoverned interpretation. Together, these results show that governance and computational expressivity are orthogonal dimensions: governance constrains the effect boundary of programs while remaining semantically transparent to internal computation.

2605.26693 2026-05-27 cs.LG cs.AI stat.ML

Model Merging on Loss Landscape: A Geometry Perspective

损失景观上的模型合并:几何视角

Juanwu Lu, Anand Bhaskar, Brian Axelrod, Ekaterina Tolstaya, Tristan Emrich

发表机构 * Purdue University(普渡大学) Waymo LLC(Waymo公司)

AI总结 提出EpiMer框架,将模型合并视为黎曼流形上的Fréchet均值,利用任务向量张成的低秩子空间和期望Hessian度量,理论证明曲率感知合并优于平坦几何方法,并在八个图像分类任务上验证了性能提升。

Comments CVPR 2026 Findings Track. 18 pages, 4 figures, 6 tables

详情
AI中文摘要

模型合并为无需重新训练的知识集成和并行开发提供了有前景的途径。然而,现有方法要么忽略损失景观的几何结构,要么依赖于难以处理的全空间Hessian近似。我们提出EpiMer,一个将模型合并视为黎曼流形上Fréchet均值求解的框架,并将计算限制在由任务向量张成的低秩子空间内。以期望Hessian作为度量,我们揭示了局部曲率与参数认知不确定性之间的联系。我们的理论分析将合并误差界分解为子空间Fréchet方差和残差能量,并提供了曲率感知合并何时在理论上优于平坦几何方法的闭式刻画。此外,我们的框架将曲率感知方法和最近的谱方法统一为不同几何度量下子空间Fréchet均值的特例。在八个图像分类任务上合并微调的CLIP-ViT模型,Epistemic Merging在匹配秩下严格优于所有三个CLIP-ViT骨干网络的基线,提高了每个骨干网络上的跨任务平均准确率和最差任务准确率。

英文摘要

Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either ignore the geometry of the loss landscape or rely on intractable full-space Hessian approximations. We propose EpiMer, a framework that casts model merging as solving the Fréchet mean on a Riemannian manifold and restricts the computation to a low-rank subspace spanned by the task vectors. With the expected Hessian as the metric, we reveal a connection between local curvature and epistemic uncertainty of the parameters. Our theoretical analysis decomposes the merging error bound into the subspace Fréchet variance and the residual energy, and provides a closed-form characterization of when curvature-aware merging provably outperforms flat-geometry methods. In addition, our framework unifies both curvature-aware methods and recent spectral methods as special cases of the subspace Fréchet mean with different geometric metrics. Merging fine-tuned CLIP-ViT models on eight image classification tasks, Epistemic Merging strictly outperforms the baselines on all three CLIP-ViT backbones at matched rank, improving the across-task average accuracy and worst-task accuracy on every backbone.

2605.26691 2026-05-27 cs.AI

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

注意工具故障:实现医疗智能体的协同工具增益

Yunhui Gan, Tan Pan, Kaiyu Guo, Limei Han, Weimiao Yu, Guangnan Ye, Chen Jiang, Yuan Cheng

发表机构 * Fudan University(复旦大学) Shanghai Academy of Artificial Intelligence for Science(上海人工智能科学研究院) Shanghai Innovation Institute(上海创新研究院) The University of Queensland(昆士兰大学) Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR)(生物信息研究所(BII),科技研究局(A*STAR))

AI总结 针对医疗AI智能体在真实临床环境中工具可能失败的问题,提出基于GRPO的强化学习框架,通过实例级工具选择和分歧感知协同学习,实现错误工具共识的纠正,提升系统鲁棒性。

详情
AI中文摘要

医疗AI智能体越来越多地使用外部工具进行诊断、治疗建议和证据检索,但大多数现有方法假设任务合适的工具在其预期范围内是可靠的。这一假设在真实临床环境中是脆弱的,因为即使相关工具也可能在具有挑战性的实例上失败,并导致不安全的后续决策。为了解决这个问题,我们研究了不完美工具设置下的医疗工具使用,以纠正单个工具遗漏的失败实例。实例相关的失败模式在最佳固定单一工具和理想的实例级选择器之间产生了差距,我们称之为单一预言风险差距。核心挑战在于,传统的任务级工具选择无法实现这一差距,因为它本质上受限于最佳单一工具的性能。受此观察启发,我们考虑了实例级异质性,并将工具使用建模为实例级选择问题。特别地,我们提出了一个基于GRPO的强化学习框架,其奖励函数用于概率风险最小化和分歧感知协同学习,促进错误工具共识的实例级纠正。此外,采用熵引导的采样策略来提升高分歧实例的权重,这些实例为学习实例特定的工具协同提供了更强的信号。这两个组件相互补充,以减轻实例级异质性并改善工具协同。在两个任务和七个医疗基准上的实验表明,我们的方法在广泛的基线上持续实现了稳健且稳定的改进,突显了协同感知工具使用对于可靠医疗智能体系统的重要性。

英文摘要

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.

2605.26690 2026-05-27 cs.LG cs.AI q-bio.QM

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

SILO:基于生物引导搜索的自改进模仿用于预算约束下的蛋白质设计

Ashima Khanna, Dominik Grimm

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Applied Sciences Weihenstephan-Triesdorf(魏因斯坦-特里斯多夫应用科学大学)

AI总结 提出SILO框架,通过层次化编辑策略、增量随机束搜索和UCB代理集成,在有限oracle预算下实现蛋白质序列优化,在8个蛋白质适应度景观上达到最优性能。

详情
AI中文摘要

在严格的oracle预算下进行蛋白质序列优化需要探索巨大的组合空间,同时使每次评估都具有信息量。现有的强化学习和离策略生成方法在代理噪声下性能下降,且位置无关的突变提议可能破坏功能关键残基。我们提出了SILO,一个用于oracle预算蛋白质设计的轨迹级自改进模仿框架。SILO使用层次化编辑策略,将每个突变分解为位置选择后跟残基选择。在每个主动学习轮次中,策略通过增量随机无放回束搜索(SBS)采样候选轨迹,结合基于UCB的代理集成和丙氨酸扫描适应度分数(AFS),选择具有功能相关编辑的候选进行计算机oracle评估。然后,通过在轮次中最佳oracle标记轨迹上的下一动作交叉熵模仿来更新策略,避免值函数估计。在八个复现的蛋白质适应度景观和来自先前工作的五个强基线上,SILO在我们的评估中在8/8的景观上实现了最高的最大和top-100平均适应度,通常表现出更快的早期改进。在每种设置两个景观的低数据和噪声代理压力测试中,当多个基线退化时,SILO保持竞争力或最佳。消融实验表明,SBS与AFS贡献了大部分增益,迭代模仿提供了额外改进。代码可在:https://github.com/grimmlab/SILO.git 获取。

英文摘要

Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git

2605.26689 2026-05-27 cs.CV cs.CL

PinPoint: Prompting with Informative Interior Points

PinPoint: 通过信息性内部点进行提示

Pouya Sadeghi, Shawn He, Pedro Pablo Guerrero Vela, C. Thomas, Alex Wong, Sirisha Rambhatla

发表机构 * University of Waterloo(滑铁卢大学) Critical ML Apple(苹果公司)

AI总结 针对指代图像分割中VLM与SAM结合时因提示模糊导致的性能差距,提出无需训练的确定性点选择器PinPoint,通过融合视觉线索选择稳定、信息丰富的内部点,在无训练下达到监督和强化学习方法的性能。

详情
AI中文摘要

现代指代图像分割流程将用于定位的视觉语言模型(VLM)与用于掩码生成的可提示分割器(如Segment Anything Model,SAM)相结合。先前该方案的无训练实例始终落后于微调和强化学习(RL)调优的专家,且不清楚差距来自VLM的定位、SAM的能力还是提示。我们表明差距主要由提示模糊性主导:VLM提出的边界框(bbox)让SAM猜测框内哪些像素属于表达式所指的对象。内部点是自然的消歧器,但它们的落点很重要;先前的工作依赖于朴素采样的点,这些点落在边界、干扰物和背景杂波上,甚至可能比单独使用bbox更差。有监督和RL调优的方法通过训练VLM预测更好的点来缩小这一差距;我们表明这种训练是不必要的。在五个内部点的匹配预算下,用稳定、信息丰富的点选择替换朴素采样,在RefCOCO/+/g上累积交并比(cIoU)提高了12-18个点,且每个模型固定。我们将这一观察转化为PinPoint,一个确定性的、无需训练的点选择器,它融合四个视觉线索为共识图,选择紧凑、空间多样且远离边界的点,并使用冻结的VLM标记每个点。无需任何任务特定训练,PinPoint在相同堆栈上匹配了有监督和RL调优的专家,同时每次查询仅调用两次VLM。

英文摘要

Modern referring image segmentation pipelines couple a vision-language model (VLM) for grounding with a promptable segmenter such as the Segment Anything Model (SAM) for mask generation. Prior training-free instances of this recipe consistently trail fine-tuned and reinforcement-learning (RL)-tuned specialists, and it has been unclear whether the gap comes from the VLM's grounding, SAM's capacity, or the prompt. We show that the gap is dominated by prompt ambiguity: a VLM-proposed bounding box (bbox) leaves SAM to guess which pixels inside the bbox belong to the object the expression denotes. Interior points are the natural disambiguator, but where they fall matters; prior work relies on naively sampled points that land on boundaries, distractors, and background clutter, and can even hurt performance compared to the bbox alone. Supervised and RL-tuned methods close this gap by training a VLM to predict better points; we show that this training is unnecessary. At a matched budget of five interior points, replacing naive sampling with stable, informative point selection improves cumulative Intersection-over-Union (cIoU) by 12-18 points across RefCOCO/+/g, with every model fixed. We turn this observation into PinPoint, a deterministic, training-free point selector that fuses four visual cues into a consensus map, selects compact, spatially diverse points away from boundaries, and uses the frozen VLM to label each point. Without any task-specific training, PinPoint matches supervised and RL-tuned specialists on the same stack while issuing only two VLM calls per query.

2605.26683 2026-05-27 cs.CL cs.AI

An In-Vitro Study on Cross-Lingual Generalization in Language Models

语言模型中跨语言泛化的体外研究

Adrian Cosma

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA)(达勒莫利人工智能研究所(IDSIA))

AI总结 通过构建两种程序生成的语言,独立控制词汇距离、少数语言比例等变量,研究语言模型跨语言迁移的机制,发现迁移主要取决于分词是否保留可复用的跨语言子结构,且词汇量越小越有利于掩码迁移。

Comments 16 Figures, 1 Table

详情
AI中文摘要

在自然语料中,语言模型的跨语言迁移难以研究,因为词汇重叠、形态、数据不平衡和分词相互纠缠。我们引入了一个体外框架,使用两种程序生成的语言,它们共享相同的本体、类型化语法和组合结构,但表面实现不同。这使我们能够独立改变词汇距离、少数语言比例、分词器训练制度和词汇量大小,同时评估在掩码少数语言条件下的迁移,该条件的词汇形式在训练中从未被观察到。在700次受控运行中,我们发现迁移受分词器平衡或原始词汇相似性的影响较小,而更多地取决于分词是否保留可复用的跨语言子结构。较小的词汇量通常通过保持单词可分解为共享片段来改善掩码迁移,而较大的词汇量可能将形式转化为特定语言的原子。我们进一步表明,迁移是一个阶段性过程:语法和类型级能力先于掩码词汇泛化。最后,我们尝试通过分词器桥梁解释这一机制,并表明桥梁强度与掩码可达性密切相关。

英文摘要

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.

2605.26682 2026-05-27 cs.RO cs.CV

SteelDS: A High-Resolution Video Dataset of E40 Steel Scrap for Object Detection and Instance Segmentation

SteelDS: 用于目标检测和实例分割的E40钢废料高分辨率视频数据集

Melanie Neubauer, Christian Rauch, Gerald Koinig, Alexia Tischberger-Aldrian, Roland Pomberger, Elmar Rueckert

发表机构 * Chair of Cyber-Physical-Systems(系统工程系) Technical University of Leoben(莱比锡技术大学) Chair of Waste Processing Technology and Waste Management(废物处理技术与废物管理系)

AI总结 该数据集提供了E40级钢和铜废料在传送带上的高分辨率标注视频序列,用于支持材料分类、目标检测和实例分割的机器学习模型开发。

详情
AI中文摘要

该数据集提供了粉碎的E40级钢和铜废料在传送带上的高分辨率、标注视频序列。在受控实验室环境中捕获,数据反映了工业磁选后阶段,通常需要人工干预去除铜污染物。数据集包含五个子集的24,297个标注帧,包含396个钢和101个铜物体,按大小分类。它支持材料分类、目标检测和实例分割的机器学习模型开发。包含物体间距和密度的变化,以模拟真实的工业分拣条件。地面真值标注包括像素级分割掩码和材料类别。该数据集作为评估自动化分拣算法的基准,旨在识别复杂、异质钢废料流中的铜杂质。

英文摘要

This dataset provides high-resolution, annotated video sequences of shredded E40-grade steel and copper scrap on a conveyor belt. Captured in a controlled laboratory environment, the data reflects the industrial post-magnetic sorting stage, where manual intervention is typically required to remove copper contaminants. The dataset comprises 24,297 labeled frames across five subsets, featuring 396 steel and 101 copper objects categorized by size. It supports the development of machine learning models for material classification, object detection, and instance segmentation. Variations in object spacing and density are included to simulate realistic industrial sorting conditions. Ground truth annotations include pixel-wise segmentation masks and material classes. This dataset serves as a benchmark for evaluating automated sorting algorithms aiming to identify copper impurities within complex, heterogeneous steel scrap streams.

2605.26680 2026-05-27 cs.CV cs.AI

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

DynFrame: 自适应推理驱动的多模态框架与动态帧增强用于复杂视频理解

Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang, Mushui Liu, Yan Xia, Zhenhao Peng, Weilong Dai, Jinlong Liu, Haobing Tang, Le Zhang, Hao Jiang, Pipei Huang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 提出DynFrame框架,通过将时间窗口和采样密度作为原生token进行单步检索,并引入分段解耦GRPO优化,解决了视频多模态大模型中采样密度不可学习及检索与回答优化耦合的问题。

详情
AI中文摘要

最近视频多模态大语言模型(MLLMs)越来越多地将逐步推理与按需视觉证据检索相结合,允许模型在推理过程中重新访问相关视频片段。然而,现有的思考与视频系统仍存在两个结构性缺陷。(i)采样密度不是一个可学习的决策:现有方法可能让模型决定看哪里,但每个窗口的帧率基本固定。因此,细粒度证据通常通过重复的检索调用来恢复,这增加了推理上下文长度和训练难度。(ii)检索和答案生成通常使用单个轨迹级优势进行优化,因此“看哪里”的token和“如何回答”的token获得相同的信用,即使一个正确而另一个不正确。为了解决这些缺陷,我们提出了DynFrame,一个在单次自回归过程中将时间窗口和采样密度作为原生token发出的框架。这种可学习的跨度-密度检索使得单步检索即可获取多粒度证据。基于上述token化检索接口,我们进一步引入了分段解耦GRPO(SD-GRPO),它在检索边界分割每次展开,并分配角色特定的token级优势,分别对采样决策和答案进行信用分配。在精心策划的DM-CoT-74k和DM-RL-45k上训练后,DynFrame-4B在六个基准测试(NExT-GQA、Charades-STA、ActivityNet-MR、Video-MME、MLVU、LVBench)上与强大的7B-8B基线竞争,而DynFrame-8B在大多数指标上创造了新的最先进水平。代码可在https://github.com/zhangguanghao523/DynFrame获取。

英文摘要

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.

2605.26678 2026-05-27 cs.CL

NestedKV: Nested Memory Routing for Long-Context KV Cache Compression

NestedKV: 用于长上下文KV缓存压缩的嵌套内存路由

Hong Chen, Xiang Liu, Yubo Gao, Yuxuan Fan, Bo Wang, Yuanlin Chu, Yuanguo Lin, Xuming Hu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Jimei University(集美大学)

AI总结 提出NestedKV方法,通过多时间尺度余弦异常评分和头自适应混合的免训练键缓存压缩,在长上下文任务中优于现有方法。

详情
AI中文摘要

长上下文语言模型受限于键值(KV)缓存的内存占用。现有的免训练KV压缩方法通常通过单一重要性信号(注意力、近期性、逐层分配或键独特性)对令牌进行排序,这在有用上下文具有全局独特性、局部片段性或即时相关性时变得脆弱。我们引入NestedKV,一种受嵌套学习中连续内存系统启发的仅键KV缓存压缩方法。NestedKV维护全局、块级和滑动窗口键锚点,通过多时间尺度余弦异常对令牌评分,并将所得排名与使用头自适应混合和惊喜门控令牌路由的免训练外部学习器结合。该评分与自适应每头预算配对,无需训练或修改LLM。在RULER(4k--32k)、LooGLE、LongBench、LongBench-E、InfiniteBench和MMLU-Pro上,使用Qwen3和Llama-3.2模型,NestedKV在保留缓存较小时表现最强。在Qwen3-4B上,当r=0.75时,它在RULER上比KeyDiff提升高达19.10分,在LongBench上提升19.29分;当r=0.95时,它在LongBench上保留37.32分,而KeyDiff为17.55分。

英文摘要

Long-context language models are limited by the memory footprint of the key-value (KV) cache. Existing training-free KV compression methods usually rank tokens by one importance signal -- attention, recency, layer-wise allocation, or key distinctiveness -- which becomes brittle when useful context is globally distinctive, locally episodic, or immediately relevant. We introduce NestedKV, a key-only KV cache compression method inspired by the Continuum Memory System in Nested Learning. NestedKV maintains global, block-level, and sliding-window key anchors, scores tokens by multi-time-scale cosine anomaly, and combines the resulting rankings with a training-free outer learner using head-adaptive mixing and surprise-gated token routing. The score is paired with adaptive per-head budgets and requires no training or LLM modification. Across RULER (4k--32k), LooGLE, LongBench, LongBench-E, InfiniteBench, and MMLU-Pro on Qwen3 and Llama-3.2 models, NestedKV is strongest when the retained cache is small. On Qwen3-4B, it improves over KeyDiff by up to 19.10 points on RULER and 19.29 on LongBench at $r=0.75$; at $r=0.95$, it retains 37.32 on LongBench versus 17.55 for KeyDiff.

2605.26676 2026-05-27 cs.CV

Memory-Distilled Selection for Noise-Robust Anomaly Detection

记忆蒸馏选择用于噪声鲁棒异常检测

Sirojbek Safarov, Jaewoo Park, Yoon Gyo Jung, Kuan-Chuan Peng, Wonchul Kim, Seongdeok Bang, Octavia Camps

发表机构 * AIVEX Inc. Northeastern University Mitsubishi Electric Research Laboratories (MERL)

AI总结 提出基于数据选择的训练算法MeDS,通过随机子采样构建部分记忆集成,利用稀疏性作为低通滤波器捕获名义模式,再蒸馏为重建分数网络,实现噪声鲁棒的异常检测。

Comments Accepted by ICML2026. The code is available at https://github.com/SirojbekSafarov/MeDS

详情
AI中文摘要

数据污染下的异常检测对于在工业环境中部署无监督缺陷检测至关重要,因为整理完全干净的训练集是不切实际的。然而,现有方法对污染敏感,随着噪声比例增加,性能显著下降。在本文中,我们提出记忆蒸馏选择(MeDS),一种基于数据选择的训练算法。MeDS通过随机子采样构建部分记忆集成,其中产生的稀疏性作为低通滤波器,在广泛的噪声比例下捕获名义模式,从而实现对污染样本的粗粒度识别。然后,将到自举记忆的聚合距离蒸馏到重建分数网络中,随后在通过蒸馏模型过滤的干净数据上进行微调,实现异常的精确定位。MeDS在广泛的噪声比例下具有鲁棒性,无需针对特定噪声比例的超参数调整,在MVTecAD上以40%噪声比例达到99.16%的图像级AUROC,并在噪声设置下在VisA和Real-IAD上取得最先进性能。我们在噪声数据场景下的工业AD基准上彻底验证了MeDS的有效性,并进行了深入的经验分析。

英文摘要

Anomaly detection (AD) under data contamination is critical for deploying unsupervised defect detection in industrial environments, where curating perfectly clean training sets is impractical. However, existing methods are sensitive to contamination, suffering significant performance degradation as the noise ratio increases. In this paper, we propose Memory-Distilled Selection (MeDS), a training algorithm based on data selection. MeDS constructs an ensemble of partial memories via random subsampling, where the resulting sparsity acts as a low-pass filter that captures nominal patterns across a wide range of noise ratios, enabling coarse-level identification of contaminated samples. The aggregated distances to the bootstrapped memories are then distilled into a reconstruction score network, which is subsequently fine-tuned on clean data filtered using scores from the distilled model, enabling fine-grained localization of anomalies. MeDS is robust across a wide range of noise ratios without requiring noise-ratio-specific hyperparameter tuning, achieving 99.16\% image-level AUROC on MVTecAD at a 40\% noise ratio, and attaining state-of-the-art performance on both VisA and Real-IAD under noisy settings. We thoroughly verify the efficacy of MeDS on industrial AD benchmarks under noisy data scenarios, accompanied by in-depth empirical analyses.

2605.26670 2026-05-27 cs.CL cs.AI

The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models

迷宫与线索:重新思考大语言模型顺序知识编辑中的正则化方法

Zheng Wang, Kaixuan Zhang, Wanfang Chen, Jingwen Zhang, Xiaonan Lu

发表机构 * Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI)) Bosch (China) Investment Ltd.(博世(中国)投资有限公司) School of Statistics, East China Normal University(东华大学统计学院)

AI总结 本文通过优化分析证明顺序编辑与一次性编辑的等价性,揭示稳定性源于累积编辑约束而非专门正则化,从而简化大语言模型知识编辑流程。

Comments Accepted for publication at ICML 2026

详情
AI中文摘要

大语言模型中结构化知识的顺序编辑允许在不重新训练的情况下进行有针对性的事实更新,但现有方法通常依赖于复杂的正则化或约束机制,其必要性尚不明确。在这项工作中,我们系统地研究了有效且稳定的顺序编辑背后的机制。具体来说,我们首先分析了AlphaEdit的经验成功,并通过严格的优化分析建立了一次性编辑与顺序编辑之间的形式等价性。基于这一见解,我们将等价性推广到更广泛的编辑目标类别,证明稳定性自然源于正确处理累积的编辑约束,而非专门的正则化或零空间操作。我们通过实验证实,许多常用的正则化策略对于可靠的顺序更新并非必要。此外,我们将我们的框架扩展到处理冲突编辑,确保在矛盾更新下具有鲁棒且一致的行为。最终,我们的工作为顺序编辑的迷宫提供了阿里阿德涅的线索,为更简单、更可解释且可靠的知识更新指明了道路。我们的代码可在https://github.com/Wangzzzzzzzz/OTE-SE-Alignment获取。

英文摘要

Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between one-time and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null-space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne's thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates. Our code is available at https://github.com/Wangzzzzzzzz/OTE-SE-Alignment.

2605.26667 2026-05-27 cs.AI cs.LG

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

MemFail: LLM记忆系统的故障模式压力测试

Ishir Garg, Neel Kolhe, Dawn Song, Xuandong Zhao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MemFail基准测试,通过形式化记忆系统为摘要、存储和检索三个操作并构建对抗性数据集,系统性地评估和诊断LLM记忆系统的故障模式。

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖外部记忆系统以在长程交互中保持一致性,但关于这些系统具体故障模式和设计选择的实证研究很少。现有基准报告聚合的问答准确率,将记忆系统视为黑箱,无法将错误答案归因于系统的特定故障模式。我们引入MemFail,一个诊断性基准,用于隔离现代LLM记忆系统的故障模式。我们首先将记忆系统形式化为三个规范操作的组合——摘要、存储和检索——并识别每个操作可能引发的故障模式。基于这些假设的故障模式,我们构建了跨越四个任务的五个数据集,每个数据集都经过对抗性设计以测试记忆系统的特定操作。使用这些数据集,我们在MemFail上评估了四种最先进的记忆系统,并展示了MemFail如何用于实证理解记忆系统架构差异带来的权衡。

英文摘要

Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.

2605.26663 2026-05-27 cs.CL cs.IR cs.SE

Evidence Absence Is Not Evidence Insufficiency: Diagnosing NEI Construction Artifacts in Fact Verification

证据缺失并非证据不足:事实核查中NEI构建伪影的诊断

Jingxi Qiu, Zeyu Han, Cheng Huang

发表机构 * ZenWeave AI Georgetown University(乔治城大学)

AI总结 针对事实核查中“信息不足”(NEI)标签因不同证据条件构建而产生的伪影,提出NEI-CAP诊断协议,通过审计捷径线索、人工裁决和跨构建迁移测试,揭示模型在NEI任务上的能力不可靠迁移及聚合分数的隐藏问题。

Comments Preprint. Under review. 20 pages, 2 figures

详情
AI中文摘要

证据缺失并非证据不足,但事实核查基准可能使它们在观测上相似。“信息不足”(NEI)标签通常通过不同的证据条件来操作化,而这种选择悄然决定了验证器学习的内容以及其分数可能隐藏的信息。我们引入了NEI-CAP,一种针对不足证据评估的构建感知诊断协议。每个NEI示例都带有产生它的构建家族;NEI-CAP审计捷径线索,通过人工裁决验证困难案例,并测试能力是否跨构建迁移。我们在SciFact风格的科学验证中实例化该协议,以FEVER和HoVer作为有界外部控制。在这些设置中,NEI能力不能可靠迁移:在捷径倾向构建上训练的模型无法识别语义相关的不充分证据,而混合构建训练缩小了差距但并未消除。固定主张的诊断进一步表明,证据条件改变了参考支持/反驳标签的置信度,而不仅仅是NEI召回率,因此聚合的NEI分数可能隐藏模型实际解决了哪个问题。

英文摘要

Evidence absence is not evidence insufficiency, but fact verification benchmarks can make them observationally similar. The Not Enough Information (NEI) label is often operationalized through different evidence conditions, and that choice silently determines what a verifier learns and what its score can hide. We introduce NEI-CAP, a construction-aware diagnostic protocol for insufficient-evidence evaluation. Each NEI example carries the construction family that produced it; NEI-CAP audits shortcut cues, validates hard cases through human adjudication, and tests whether competence transfers across constructions. We instantiate the protocol in SciFact-style scientific verification, with FEVER and HoVer as bounded external controls. Across these settings, NEI competence does not transfer reliably: models trained on shortcut-prone constructions fail to recognize semantically related insufficient evidence, and mixed-construction training narrows but does not close the gap. Fixed-claim diagnostics further show that the evidence condition shifts confidence in the reference Support/Refute label, not only NEI recall, so an aggregate NEI score can hide which problem a model has actually solved.

2605.26662 2026-05-27 cs.CL cs.AI econ.GN q-fin.EC

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

AI评估可能扭曲认知:语境在解读学术写作中的重要性

Shang Wu, Randol Yao

发表机构 * UC Irvine(加州大学欧文分校) MIT(麻省理工学院)

AI总结 本文通过构建AI相似度基准,发现忽略国家和领域差异的评估方法会系统性高估或低估某些群体中的AI使用,提出基于具体语境的基准以更准确评估科学写作中的AI使用。

详情
AI中文摘要

本文研究了当评估方法忽略国家和领域的语境差异时,科学写作中AI使用估计可能产生的偏差。利用Dimensions中期刊论文的大规模数据,我们基于人类撰写和LLM重写的摘要之间的差异构建了AI相似度基准。我们表明,合并基准可能混淆已有的风格差异与AI生成的文本,即使在LLM之前的出版物中也会在跨国家-领域组中产生显著扭曲。相比之下,特定国家-领域的基准减轻了这种扭曲,并提供了更可信的比较基线。将这些方法应用于2025年的出版物,结果显示合并基准系统性高估了某些国家和领域的AI使用,同时低估了其他国家和领域的AI使用。这些发现强调了语境感知测量对于准确和公平评估科学中AI使用的重要性。

英文摘要

This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science.

2605.26661 2026-05-27 cs.CV cs.AI

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

在预训练视觉语言模型的后验分布外检测中尊重模态差距

Yuanwei Hu, Bo Peng, Yadan Luo, Zhen Fang, Ling Chen, Jie Lu

发表机构 * The University of Queensland(昆士兰大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对预训练视觉语言模型在后验分布外检测中文本原型与视觉原型存在模态差距的问题,提出在线伪监督框架直接在视觉特征空间学习类原型,实现新最优性能。

详情
AI中文摘要

分布外(OOD)检测已成为一种流行的技术,通过识别来自未知类别的意外输入来增强机器学习模型的可靠性。预训练视觉语言模型(VLM)的最新进展使得无需访问分布内(ID)训练数据即可进行零样本OOD检测;在这种设置下,现有方法通常将类名的文本嵌入视为类原型。在本文中,我们通过理论证明现成的文本原型通常与最优视觉原型不对齐,从而产生无法通过提示工程单独消除的内在模态差距,来挑战广泛采用的文本即原型范式。为了在后验约束下缓解这一差距,本文提出了一种在线伪监督框架,该框架使用未标记的测试时数据流和预训练VLM的软预测,直接在视觉特征空间中学习类原型。我们为在线优化过程的收敛性提供了理论保证。大量实验经验证明,我们的方法在各种OOD检测设置中达到了新的最优水平。

英文摘要

Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying unexpected inputs from unknown classes. Recent progress in pre-trained vision-language models (VLMs) has enabled zero-shot OOD detection without access to in-distribution (ID) training data; in this setting, existing methods commonly treat text embeddings of class names as class prototypes. In this paper, we challenge the widely adopted text-as-prototype paradigm by theoretically showing that off-the-shelf textual prototypes are generally misaligned with the optimal visual prototypes, yielding an intrinsic modality gap that cannot be eliminated by prompt engineering alone. To mitigate this gap under the post-hoc constraint, this paper presents an online pseudo-supervised framework that directly learns class prototypes in the visual feature space using unlabeled test-time data streams and soft predictions from the pre-trained VLMs. We provide theoretical guarantees for the convergence of the online optimization procedure. Extensive experiments empirically demonstrate that our method achieves a new state of the art across a variety of OOD detection setups.

2605.26657 2026-05-27 cs.AI

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

完成度与最优性:长时域累积损伤问题中的策略梯度

Wolfgang Maass, Sabine Janzen

发表机构 * Saarland University(萨尔兰大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 本文针对长时域累积损伤问题,识别策略梯度方法的两种正交失败模式(完成度和最优性),提出分解方法,并通过两个校准环境验证了四个可检验预测。

详情
AI中文摘要

具有累积损伤的长时域决策问题将局部吸引动作与全局不利结果耦合。我们识别了此类问题上策略梯度方法的两种正交失败模式,并提出一种将其分离的分解:\\emph{完成度}(达到终端时域而非通过隐式终端约束退出)和\\emph{最优性}(在给定完成度的情况下匹配动态规划参考)。在带有线性软惩罚的PPO下,仅授予时域访问会降低完成率:惩罚的均衡将主导活动份额推向零,而动作空间限制结合时域访问实现了完成度,但留下了最优性差距($ΔM_{\\text{final}} = 0.271$),我们将其追溯到损伤起源处的第一阶段贪婪承诺。我们推导了四个可检验预测,并在两个独立校准环境中进行评估,这两个环境共享相同的抽象结构,但在领域、时域、活动集和校准数据上不同:一个49步的砖瓦工职业生涯和一个20赛季的NBA大前锋职业生涯。所有四个预测均定性复现。时域不变性预测在四个测试时域中的三个得到满足,例外出现在$H = 15$,与$H^*$边界一致(在NBA参数下$H^* \\\in [6, 14]$)。

英文摘要

Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).

2605.26656 2026-05-27 cs.CV

DV-SFT: Direct Vision Supervision for Fine-Grained Visual Understanding

DV-SFT: 直接视觉监督用于细粒度视觉理解

Jianfei Zhao, Feng Zhang, Xin Sun, Chong Feng, Bing Wang, Zhixing Tan

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Zhongguancun Academy(中关村学院) Beihang University(北航) Zhongguancun Laboratory(中关村实验室) Southeast Academy of Information Technology, Beijing Institute of Technology(北京理工大学信息科学技术东南学院)

AI总结 提出DV-SFT方法,通过为视觉令牌构建显式令牌级监督信号,利用OCR场景中的直接视觉-文本对应关系,在不修改模型架构或增加前向传播的情况下,显著提升多模态大语言模型的细粒度视觉理解能力。

Comments Under Review

详情
AI中文摘要

多模态大语言模型通常以端到端方式训练以预测真实答案,但监督信号仅应用于文本令牌。视觉令牌作为视觉信息的核心载体,仅作为上下文的一部分被隐式优化,导致粗粒度的视觉理解。先前的工作尝试监督视觉输入,但不可避免地依赖辅助组件(如额外的解码器或前向传播),因为视觉令牌缺乏可直接解释的标签。这限制了它们的实际应用。在这项工作中,我们提出了直接视觉监督微调(DV-SFT),该方法为视觉令牌构建显式的令牌级监督,并通过与文本相同的下一个令牌预测目标来训练它们。具体来说,我们利用OCR相关场景中的直接视觉-文本对应关系,自动为每个视觉令牌标注其对应图像块中的单词。DV-SFT将MLLM视为黑盒,无需修改架构或额外的前向传播。大量实验证明了直接视觉监督的优越性。DV-SFT在三个域内和四个域外基准测试中始终优于标准SFT。进一步分析表明,视觉监督有效增强了细粒度视觉理解,并实现了更高的多模态对齐效率。

英文摘要

Multimodal large language models are typically trained end-to-end to predict ground-truth answers, yet supervision signals are applied exclusively to text tokens. Visual tokens, the core carriers of visual information, are optimized only implicitly as part of the context, leading to coarse-grained visual understanding. Prior works attempt to supervise visual inputs but inevitably rely on auxiliary components such as additional decoders or forward passes, because visual tokens lack readily interpretable labels. This limits their practical applicability. In this work, we propose \textbf{D}irect \textbf{V}ision \textbf{S}upervised \textbf{F}ine-\textbf{T}uning (DV-SFT), which constructs explicit, token-level supervision for visual tokens and trains them through the same next-token prediction objective used for text. Specifically, we exploit the direct vision--text correspondence in OCR-related scenarios and automatically label each visual token with the word in its corresponding image patch. DV-SFT treats the MLLM as a black box, requiring no architectural modifications or additional forward passes. Extensive experiments demonstrate the superiority of direct vision supervision. DV-SFT consistently outperforms standard SFT across three in-domain and four out-of-domain benchmarks. Further analyses show that vision supervision effectively enhances fine-grained visual understanding and achieves higher multimodal alignment efficiency.

2605.26655 2026-05-27 cs.CL cs.LG cs.NE

Why Prompt Optimization Works, and Why It Sometimes Doesn't: A Causal-Inspired Edit-Level Analysis

为什么提示优化有效,以及为什么有时无效:一种因果启发的编辑级分析

Shuzhi Gong, Hechuan Wen

发表机构 * The University of Melbourne(墨尔本大学) The University of Queensland(昆士兰大学)

AI总结 本文通过因果推断方法分析自动提示优化在不同任务和模型上的泛化失败原因,发现编辑类型与任务特性之间的系统性交互作用。

Comments 17 pages, 4 figures, 8 tables

详情
AI中文摘要

自动提示优化方法(例如 DSpy、TextGrad)可以显著提升大语言模型(LLM)的性能,然而,它们在不同任务上的泛化能力仍然不足。在实践中,优化后的提示在一个基准上的优势往往无法迁移到另一个基准,即使切换不同的 LLM 骨干网络,这种局限性依然存在。为了探究提示性能中未被充分探索的异质性来源,我们对跨多种优化框架、LLM 骨干网络和 NLP 基准的优化提示进行了因果推断启发的观察性分析。为此,我们基于倾向调整的关联分析以及提示编辑的多种互补表示,识别出一致的任务条件编辑模式。我们发现,增加复杂性和元指令的编辑与数学和多跳推理性能呈负相关,而逐步和元认知的编辑则改善了逻辑和顺序推理任务。这些效应在认知负荷标注、表面文本特征和编辑主题分析中均具有鲁棒性,并且可以跨优化框架泛化。总体而言,这些结果表明,提示优化失败源于编辑族与任务特性之间的系统性交互,而非随机的优化伪影,从而提供了优化器行为的特征级表征,并激励了未来任务条件优化器的设计。

英文摘要

Automated prompt optimization methods (e.g., DSpy, TextGrad) can substantially improve the performance of large language model (LLM), however, their generalization ability across different tasks remains underperformed. In practice, the superiority of the optimized prompt on one benchmark often fails to transfer to another, and this limitation persists even when switching across different LLM backbones. To investigate the underexplored sources of heterogeneity in prompt performance, we conduct a causal inference-inspired observational analysis of optimized prompts across a diverse set of optimization frameworks, LLM backbones, and NLP benchmarks. To achieve the goal, we build upon the propensity-adjusted associational analysis together with multiple complementary representations of prompt edits, where the consistent task-conditioned edits patterns are identified. We find that complexity-increasing and meta-instructional edits are negatively associated with mathematical and multi-hop reasoning performance, whereas step-by-step and meta-cognitive edits improve logical and sequential reasoning tasks. These effects are robust across cognitive-load annotations, surface-level text features, and edit-motif analyses, and can generalize across optimization frameworks. Overall, these results indicate that prompt optimization failures arise from systematic interactions between edit families and task characteristics rather than random optimization artifacts, providing feature-level characterization of optimizer behavior and motivating future task-conditioned optimizer design.

2605.26654 2026-05-27 cs.LG cs.AI math.OC stat.ML

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

零和马尔可夫博弈鞍点上的双层优化

Zihao Zheng, Irwin King, Songtao Lu

发表机构 * Shun Hing Institute of Advanced Engineering, The Chinese University of Hong Kong(香港中文大学先进工程学院) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系)

AI总结 针对下层为零和马尔可夫博弈的双层优化问题,提出基于惩罚的Nikaido-Isoda下降-上升方法(PANDA),避免计算超梯度且无需二阶信息,在无凸性假设下收敛到平稳点,达到与单策略下层MDP双层RL相当的最优速率。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

强化学习(RL)通常具有层次结构,其中上层(UL)学习器选择模型参数,下层(LL)决策过程做出响应,自然形成双层优化问题。大多数现有的双层RL方法假设下层为单策略马尔可夫决策过程(MDP),因此无法捕捉激励设计等应用中出现的竞争结构,其中多个策略相互交互。我们研究了下层问题为正则化极小极大零和马尔可夫博弈、上层目标通过下层博弈诱导的鞍点均衡进行优化的双层优化问题。在这项工作中,我们提出了惩罚增强的Nikaido-Isoda下降-上升(PANDA),一种基于Nikaido-Isoda函数的惩罚一阶策略梯度方法。通过利用极小极大博弈结构,PANDA避免了计算上层超梯度,且不需要二阶信息。我们证明了PANDA在无需对上层或下层目标做凸性假设的情况下收敛到平稳点。此外,PANDA在$ ilde{\mathcal{O}}(ε^{-1})$次迭代内达到$ε$-平稳点,样本复杂度为$ ilde{\mathcal{O}}(ε^{-3})$,与单策略下层MDP的双层RL的最佳已知速率相匹配。实验表明PANDA优于密切相关基线方法。

英文摘要

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.

2605.26649 2026-05-27 cs.RO

On the Generalization Capabilities, Design Choices and Limitations of Keypoint Imitation Learning

关键点模仿学习的泛化能力、设计选择与局限性

Thomas Lips, Marco Moletta, Michael C. Welle, Danica Kragic, Francis wyffels

发表机构 * AI and Robotics Lab, IDLAB-AIRO, Ghent University-imec(人工智能与机器人实验室,IDLAB-AIRO,根特大学-imec) Robotics, Perception and Learning Lab, (RPL), EECS, KTH Royal Institute of Technology(机器人、感知与学习实验室,(RPL),EECS,皇家理工学院) INCAR Robotics AB, Stockholm, Sweden(INCAR机器人公司,斯德哥尔摩,瑞典)

AI总结 本文通过2000多次真实世界实验,系统评估了基于视觉基础模型的关键点模仿学习在机器人操作中的泛化能力、设计选择及局限性,发现其总体成功率达75%,显著优于RGB基线(47%),与S2扩散模型(73%)相当,但未超越其他表示方法且继承了基础模型的局限。

Comments This version was submitted to IROS 2026

详情
AI中文摘要

基于RGB的模仿学习需要大量演示才能泛化到未见过的物体或场景,这促使研究人员探索中间表示以提高机器人操作的泛化能力。视觉基础模型能够通过一次提取关键点来提供这种表示。然而,如何最优地将它们整合到模仿学习中,以及它们何时优于其他表示,仍不清楚。我们结合了以往关键点模仿学习(KIL)研究中的方法,并研究了若干设计选择以提供实用指南。通过2000多次真实世界实验,我们还评估了KIL对未见物体和场景变化的泛化能力。KIL在五个任务上实现了75%的总体成功率,显著优于RGB基线(47%),并与S2扩散模型(73%)相当。最后,我们探讨了用于关键点提取的基础模型的局限性,并将KIL扩展到具有多个物体实例的任务。我们的结果证实KIL是一种数据高效的机器人学习方法,尽管它并未超越其他表示方法,并且继承了用于关键点提取的基础模型的局限性。所有实验视频、演示和结果均可在https://kil-manipulation.github.io/获取。

英文摘要

RGB-based imitation learning requires many demonstrations to generalize to unseen objects or scenes, motivating research into intermediate representations to improve generalization for robotic manipulation. Visual foundation models enable one-shot extraction of keypoints to provide such representation. However, it remains unclear how to integrate them into imitation learning optimally and when they outperform alternative representations. We combine approaches from previous works on keypoint imitation learning (KIL) and investigate several design choices to provide practical guidelines. Using over 2000 real-world rollouts, we also assess the generalization capabilities of KIL to unseen objects and scene variations. KIL achieves a 75% overall success rate across five tasks, significantly outperforming the RGB baseline (47%) and performing on par with S2-diffusion (73%). Finally, we explore the limitations of the foundation models used for keypoint extraction and extend KIL to tasks with multiple object instances. Our results confirm KIL as a data-efficient approach for robot learning, though it does not outperform alternative representations and inherits limitations of the foundation models used for keypoint extraction. All rollout videos, demonstrations, and results are available at https://kil-manipulation.github.io/.