arXivDaily每日学术速递，同步arXiv全量数据，AI总结、翻译，覆盖人工智能、机器人、计算机、金融、统计学、数学、物理学、生物学、经济学、电气&系统等方向。

2604.12376 2026-05-26 cs.CL cs.AI

Cooperative Memory Paging with Keyword Bookmarks for Long-Horizon LLM Conversations

面向长程LLM对话的协作式内存分页与关键词书签

Ziyang Liu

AI总结提出协作式分页方法，用关键词书签替代被驱逐的对话片段，并赋予模型 recall() 工具按需检索，在 LoCoMo 基准上四个模型均取得最佳答案质量，并通过消融实验揭示分页设计的关键因素。

Comments The authors have decided to withdraw this version following internal review regarding authorship and contribution agreements

详情

AI中文摘要

当LLM对话超出上下文窗口时，旧内容必须被驱逐——但模型在需要时如何恢复它们？我们提出协作式分页：被驱逐的片段被替换为最小关键词书签（[pN:keywords]，每个约8-24个token），并赋予模型一个 recall() 工具以按需检索完整内容。在 LoCoMo 基准（10个真实多会话对话，300+轮次）上，协作式分页在四种模型（GPT-4o-mini、DeepSeek-v3.2、Claude Haiku、GLM-5）的六种方法中实现了最高的答案质量——优于截断、BM25、词重叠检索、搜索工具基线和完整上下文——由四个独立的LLM评判员确认（p=0.017，配对bootstrap）。随后，我们通过边界策略和驱逐策略的5x4消融实验（3,176个合成探针，1,600个LoCoMo探针）研究分页设计空间。关键发现：（1）粗粒度固定大小页面（fixed_20）达到96.7%，而内容感知的topic_shift降至56.7%；（2）驱逐策略的选择依赖于数据（FIFO在合成数据上最佳，LFU在LoCoMo上最佳）；（3）两种书签生成策略相比启发式基线有提升（+4.4和+8.7个E2E点）；（4）剩余瓶颈是书签区分度——模型96%的时间触发recall()，但当书签区分度不足时，仅57%选择正确页面。关键词特异性单独造成25个百分点的准确率差异。

英文摘要

When LLM conversations grow beyond the context window, old content must be evicted -- but how does the model recover it when needed? We propose cooperative paging: evicted segments are replaced with minimal keyword bookmarks ([pN:keywords], ~8-24 tokens each), and the model is given a recall() tool to retrieve full content on demand. On the LoCoMo benchmark (10 real multi-session conversations, 300+ turns), cooperative paging achieves the highest answer quality among six methods -- outperforming truncation, BM25, word-overlap retrieval, a search-tool baseline, and full context -- on four models (GPT-4o-mini, DeepSeek-v3.2, Claude Haiku, GLM-5), confirmed by four independent LLM judges ($p=0.017$, paired bootstrap). We then study the paging design space with a 5x4 ablation over boundary strategies and eviction policies (3,176 synthetic probes, 1,600 LoCoMo probes). Key findings: (1) coarse fixed-size pages (fixed_20) reach 96.7% while content-aware topic_shift collapses to 56.7%; (2) eviction policy choice is data-dependent (FIFO best on synthetic, LFU on LoCoMo); (3) two bookmark generation strategies improve over the heuristic baseline (+4.4 and +8.7 E2E points); (4) the remaining bottleneck is bookmark discrimination -- the model triggers recall() 96% of the time but selects the correct page only 57% when bookmarks are insufficiently distinctive. Keyword specificity alone accounts for a 25 percentage point accuracy difference.

URL PDF HTML ☆

赞 0 踩 0

2604.12116 2026-05-26 cs.AI cs.SE

The A-R Behavioral Space: Execution-Level Profiling of Tool-Using Language Model Agents in Organizational Deployment

A-R行为空间：组织部署中工具使用语言模型代理的执行层剖析

Shasha Yu, Fiona Carroll, Barry L. Bentley

AI总结提出基于动作率(A)和拒绝信号(R)的二维A-R空间及散度(D)来测量执行层行为，评估不同规范制度和自主性配置下语言模型代理的执行与拒绝分布模式。

详情

AI中文摘要

大型语言模型(LLMs)越来越多地被部署为能够执行系统级操作的工具增强型代理。虽然现有基准主要评估文本对齐或任务成功，但较少关注在不同自主性支架下语言信号与可执行行为之间的结构关系。本研究引入了一种基于二维A-R空间的执行层行为测量方法，该空间由动作率(A)和拒绝信号(R)定义，散度(D)捕捉两者之间的协调性。模型在四种规范制度（控制、灰色、困境和恶意）和三种自主性配置（直接执行、规划和反思）下进行评估。该方法不是分配聚合安全分数，而是描述执行和拒绝如何随上下文框架和支架深度重新分布。实证结果表明，执行和拒绝构成了可分离的行为维度，其联合分布在制度和自主性水平上系统性地变化。基于反思的支架通常会在风险情境中促使配置转向更高的拒绝，但重新分布模式在不同模型间存在结构性差异。A-R表示使得横截面行为剖面、支架诱导的转变和协调变异性直接可观察。通过将执行层表征置于标量排名之上，这项工作为在组织环境中分析和选择工具增强的LLM代理提供了面向部署的视角，其中执行权限和风险容忍度各不相同。

英文摘要

Large language models (LLMs) are increasingly deployed as tool-augmented agents capable of executing system-level operations. While existing benchmarks primarily assess textual alignment or task success, less attention has been paid to the structural relationship between linguistic signaling and executable behavior under varying autonomy scaffolds. This study introduces an execution-layer be-havioral measurement approach based on a two-dimensional A-R space defined by Action Rate (A) and Refusal Signal (R), with Divergence (D) capturing coor-dination between the two. Models are evaluated across four normative regimes (Control, Gray, Dilemma, and Malicious) and three autonomy configurations (di-rect execution, planning, and reflection). Rather than assigning aggregate safety scores, the method characterizes how execution and refusal redistribute across contextual framing and scaffold depth. Empirical results show that execution and refusal constitute separable behavioral dimensions whose joint distribution varies systematically across regimes and autonomy levels. Reflection-based scaffolding often shifts configurations toward higher refusal in risk-laden contexts, but redis-tribution patterns differ structurally across models. The A-R representation makes cross-sectional behavioral profiles, scaffold-induced transitions, and coordination variability directly observable. By foregrounding execution-layer characterization over scalar ranking, this work provides a deployment-oriented lens for analyzing and selecting tool-enabled LLM agents in organizational settings where execution privileges and risk tolerance vary.

URL PDF HTML ☆

赞 0 踩 0

2604.08988 2026-05-26 cs.AI

SEA-Eval: A Benchmark for Evaluating Self-Evolving Agents Beyond Episodic Assessment

SEA-Eval: 超越情景评估的自进化智能体基准

Sihang Jiang, Lipeng Ma, Zhonghua Hong, Keyi Wang, Zhiyu Lu, Tengfei Wang, Shisong Chen, Jinghao Zhang, Tianjun Pan, Weijia Li, Jiaqing Liang, Yanghua Xiao

AI总结本文提出自进化智能体(SEA)的形式化定义及其最小充分架构进化飞轮，并构建首个专门评估SEA的基准SEA-Eval，通过顺序任务流设计量化进化增益、稳定性和隐式对齐收敛。

详情

AI中文摘要

当前基于LLM的智能体在情景任务执行中表现出强大性能，但仍受限于静态工具集和情景遗忘，无法跨任务边界积累经验。本文从数字具身和连续跨任务进化的角度形式化自进化智能体(SEA)，引入进化飞轮作为其最小充分架构，并提出SEA-Eval——首个专门设计用于评估SEA的基准。基于飞轮理论，SEA-Eval将SR和T作为主要指标，并通过顺序任务流设计，旨在量化进化增益、进化稳定性和隐式对齐收敛。实证评估表明，在可比成功率下，不同框架在单个任务上的token消耗差异高达31.2倍，且在顺序分析下出现不同的进化轨迹——这表明成功率单独造成能力幻觉，而T的顺序收敛是区分真正进化与伪进化的关键标准。

英文摘要

Current LLM-based agents demonstrate strong performance in episodic task execution but remain constrained by static toolsets and episodic amnesia, failing to accumulate experience across task boundaries. This paper formalizes the Self-Evolving Agent (SEA) from the perspective of digital embodiment and continuous cross-task evolution, introduces the Evolutionary Flywheel as its minimal sufficient architecture, and presents SEA-Eval -- the first benchmark designed specifically for evaluating SEAs. Grounded in Flywheel theory, SEA-Eval establishes SR and T as primary metrics and, through sequential task stream design, is designed to quantify evolutionary gain, evolutionary stability, and implicit alignment convergence. Empirical evaluation reveals that, under comparable success rates, token consumption differs by up to 31.2 times between frameworks on individual tasks, with divergent evolutionary trajectories emerging under sequential analysis -- demonstrating that success rate alone creates a capability illusion and that the sequential convergence of $T$ is the key criterion for distinguishing genuine evolution from pseudo-evolution.

URL PDF HTML ☆

赞 0 踩 0

2603.28128 2026-05-26 cs.LG cs.CR

ORACAL: A Robust and Explainable Multimodal Framework for Smart Contract Vulnerability Detection with Causal Graph Enrichment

ORACAL: 一种基于因果图增强的鲁棒且可解释的智能合约漏洞检测多模态框架

Tran Duong Minh Dai, Triet Huynh Minh Le, M. Ali Babar, Van-Hau Pham, Phan The Duy

AI总结提出ORACAL异构多模态图学习框架，集成控制流图、数据流图和调用图，通过RAG和LLM增强关键子图，并采用因果注意力机制和PGExplainer实现鲁棒且可解释的智能合约漏洞检测。

Comments 21 pages, version 2

详情

AI中文摘要

尽管图神经网络（GNN）在智能合约漏洞检测中展现出潜力，但仍面临显著限制。同构图模型无法捕捉控制流与数据依赖之间的相互作用，而异构图方法通常缺乏深层语义理解，使其易受对抗攻击。此外，大多数黑盒模型无法提供可解释证据，阻碍了专业审计的信任。为解决这些挑战，我们提出ORACAL（基于可观测RAG增强的因果推理分析），一种异构多模态图学习框架，集成了控制流图（CFG）、数据流图（DFG）和调用图（CG）。ORACAL选择性地用检索增强生成（RAG）和大语言模型（LLM）的专家级安全上下文增强关键子图，并采用因果注意力机制从虚假相关性中分离真正的漏洞指示。为提升透明度，该框架采用PGExplainer生成子图级解释，识别漏洞触发路径。在大型数据集上的实验表明，ORACAL实现了最先进的性能，在主要基准上以91.28%的峰值宏F1超越MANDO-HGT、MTVHunter、GNN-SC和SCVHunter高达39.6个百分点。ORACAL在分布外数据集上保持强泛化能力，在CGT Weakness和DAppScan上分别达到91.8%和77.1%。在可解释性评估中，PGExplainer针对人工标注的漏洞触发路径实现了32.51%的平均交并比（MIoU）。在对抗攻击下，ORACAL将性能下降限制在约2.35%的F1下降，攻击成功率（ASR）仅为3%，优于ASR在10.91%至18.73%之间的SCVHunter和MANDO-HGT。

英文摘要

Although Graph Neural Networks (GNNs) have shown promise for smart contract vulnerability detection, they still face significant limitations. Homogeneous graph models fail to capture the interplay between control flow and data dependencies, while heterogeneous graph approaches often lack deep semantic understanding, leaving them susceptible to adversarial attacks. Moreover, most black-box models fail to provide explainable evidence, hindering trust in professional audits. To address these challenges, we propose ORACAL (Observable RAG-enhanced Analysis with CausAL reasoning), a heterogeneous multimodal graph learning framework that integrates Control Flow Graph (CFG), Data Flow Graph (DFG), and Call Graph (CG). ORACAL selectively enriches critical subgraphs with expert-level security context from Retrieval-Augmented Generation (RAG) and Large Language Models (LLMs), and employs a causal attention mechanism to disentangle true vulnerability indicators from spurious correlations. For transparency, the framework adopts PGExplainer to generate subgraph-level explanations identifying vulnerability triggering paths. Experiments on large-scale datasets demonstrate that ORACAL achieves state-of-the-art performance, outperforming MANDO-HGT, MTVHunter, GNN-SC, and SCVHunter by up to 39.6 percentage points, with a peak Macro F1 of 91.28% on the primary benchmark. ORACAL maintains strong generalization on out-of-distribution datasets with 91.8% on CGT Weakness and 77.1% on DAppScan. In explainability evaluation, PGExplainer achieves 32.51% Mean Intersection over Union (MIoU) against manually annotated vulnerability triggering paths. Under adversarial attacks, ORACAL limits performance degradation to approximately 2.35% F1 decrease with an Attack Success Rate (ASR) of only 3%, surpassing SCVHunter and MANDO-HGT which exhibit ASRs ranging from 10.91% to 18.73%.

URL PDF HTML ☆

赞 0 踩 0

2603.18766 2026-05-26 cs.LG

Enhancing the Parameterization of Reservoir Properties for Data Assimilation Using Deep VAE-GAN

利用深度VAE-GAN增强数据同化中储层属性的参数化

M. A. Sampaio, P. H. Ranazzi, M. J. Blunt

AI总结提出将VAE-GAN与ESMDA结合，以同时实现高质量储层描述和良好历史拟合，克服传统方法在非高斯分布和有限集合大小上的局限。

详情

DOI: 10.1016/j.cageo.2026.106196

AI中文摘要

目前，称为迭代集合平滑器的方法，特别是称为多重数据同化集合平滑器（ESMDA）的方法，可被视为石油储层模拟中历史拟合的最先进技术。然而，这种方法有两个重要限制：使用有限大小的集合来表示分布，以及参数和数据不确定性中的高斯假设。后者尤为重要，因为许多储层属性具有非高斯分布。参数化涉及在更新前将非高斯参数映射到高斯场，然后将其映射回原始域以将集合通过储层模拟器向前传播。一种有前景的参数化方法是通过深度学习模型。最近的研究表明，生成对抗网络（GAN）在数据同化方面表现不佳，但能生成地质上更合理的储层实现，而变分自编码器（VAE）在数据同化中表现优于GAN，但生成的地质模型不太真实。本工作的创新之处在于结合两者的优势，实现一个称为变分自编码器生成对抗网络（VAE-GAN）的深度学习模型，并与ESMDA集成。该方法应用于两个案例研究，一个案例是分类的，另一个是连续渗透率值。我们的发现表明，通过应用VAE-GAN模型，我们可以同时获得高质量的储层描述（就像GAN）和良好的生产曲线历史拟合（就像VAE）。

英文摘要

Currently, the methods called Iterative Ensemble Smoothers, especially the method called Ensemble Smoother with Multiple Data Assimilation (ESMDA) can be considered state-of-the-art for history matching in petroleum reservoir simulation. However, this approach has two important limitations: the use of an ensemble with finite size to represent the distributions and the Gaussian assumption in parameter and data uncertainties. This latter is particularly important because many reservoir properties have non-Gaussian distributions. Parameterization involves mapping non-Gaussian parameters to a Gaussian field before the update and then mapping them back to the original domain to forward the ensemble through the reservoir simulator. A promising approach to perform parameterization is through deep learning models. Recent studies have shown that Generative Adversarial Networks (GAN) performed poorly concerning data assimilation, but generated more geologically plausible realizations of the reservoir, while the Variational Autoencoder (VAE) performed better than the GAN in data assimilation, but generated less geologically realistic models. This work is innovative in combining the strengths of both to implement a deep learning model called Variational Autoencoder Generative Adversarial Network (VAE-GAN) integrated with ESMDA. The methodology was applied in two case studies, one case being categorical and the other with continuous values of permeability. Our findings demonstrate that by applying the VAE-GAN model we can obtain high quality reservoir descriptions (just like GANs) and a good history matching on the production curves (just like VAEs) simultaneously.

URL PDF HTML ☆

赞 0 踩 0

2603.16481 2026-05-26 cs.LG cs.SY eess.SY math.OC

Optimal uncertainty bounds for multivariate kernel regression under bounded noise: A Gaussian process-based dual function

有界噪声下多元核回归的最优不确定性界：基于高斯过程的对偶函数

Amon Lahr, Anna Scampicchio, Johannes Köhler, Melanie N. Zeilinger

AI总结针对有界噪声下再生核希尔伯特空间中的多输出函数，提出一种紧致、确定性的不确定性界，通过无约束对偶公式获得，具有与经典高斯过程置信界相同的结构，便于集成到下游优化中。

Comments Extended version

详情

AI中文摘要

非保守的不确定性界对于从含噪数据中对潜在函数进行可靠预测至关重要，因此是安全学习控制的关键推动因素。在该领域，高斯过程回归等核方法因其固有的不确定性量化机制而成为成熟技术。然而，现有方法要么对底层噪声分布施加强假设，要么保守，要么不直接适用于多输出情况，要么难以集成到下游任务中。本文通过提出一种针对再生核希尔伯特空间（RKHS）中多输出函数的紧致、确定性界来应对这些限制，该函数受有界噪声影响。该界通过无约束的对偶公式获得，该公式具有与经典高斯过程置信界相同的结构，因此可以直接集成到下游优化流程中。我们证明了所提出的界推广了现有结果，并使用四旋翼动力学学习的示例说明了其应用。

英文摘要

Non-conservative uncertainty bounds are essential for making reliable predictions about latent functions from noisy data, and thus, a key enabler for safe learning-based control. In this domain, kernel methods such as Gaussian process regression are established techniques, thanks to their inherent uncertainty quantification mechanism. Still, existing bounds either pose strong assumptions on the underlying noise distribution, are conservative, do not directly apply in the multi-output case, or are difficult to integrate into downstream tasks. This paper addresses these limitations by presenting a tight, deterministic bound for multi-output functions in Reproducing Kernel Hilbert Spaces (RKHSs) subject to bounded noise. It is obtained through an unconstrained, duality-based formulation, which shares the same structure as classic Gaussian process confidence bounds, and can thus be straightforwardly integrated into downstream optimization pipelines. We show that the proposed bound generalizes existing results and illustrate its application using an example inspired by quadrotor dynamics learning.

URL PDF HTML ☆

赞 0 踩 0

2603.16100 2026-05-26 cs.CV

Reevaluating the Intra-Modal Misalignment Hypothesis in CLIP

重新评估CLIP中的模态内错位假设

Jonas Herzog, Yue Wang

AI总结本文质疑CLIP的模态内错位假设，通过理论分析和实验证明图像嵌入距离不存在所谓的自由度，且模态内任务性能差异主要源于任务歧义而非错位。

Comments Accepted for CVPR'26. Project Page: https://vision-kek.github.io/Is-CLIP-Really-Misaligned/

详情

AI中文摘要

最近的研究表明，CLIP类对比语言-图像训练产生的嵌入对于纯图像任务并非最优。主要理论是跨模态（语言-图像）对齐损失忽略了模态内（图像-图像）对齐，导致图像间距离校准不良。在本研究中，我们质疑这一模态内错位假设。我们重新审视其基础理论论证、支持该假设的指标以及受影响的性能指标。对于理论论证，我们证明图像嵌入距离不存在所谓的自由度。对于经验度量，我们的发现表明，它们在语言-图像训练模型（CLIP、SigLIP）和图像-图像训练模型（DINO、SigLIP2）上产生相似结果。这表明观察到的现象并非源于前者特有的错位。对常见模态内任务（检索和少样本分类）的实验证实，解决任务歧义（而非所谓的错位）才是获得最佳结果的关键。

英文摘要

Recent research suggested that the embeddings produced by CLIP-like contrastive language-image training are suboptimal for image-only tasks. The main theory is that the inter-modal (language-image) alignment loss ignores intra-modal (image-image) alignment, leading to poorly calibrated distances between images. In this study, we question this intra-modal misalignment hypothesis. We reexamine its foundational theoretical argument, the indicators used to support it, and the performance metrics affected. For the theoretical argument, we demonstrate that there are no such supposed degrees of freedom for image embedding distances. For the empirical measures, our findings reveal they yield similar results for language-image trained models (CLIP, SigLIP) and image-image trained models (DINO, SigLIP2). This indicates the observed phenomena do not stem from a misalignment specific to the former. Experiments on the commonly studied intra-modal tasks retrieval and few-shot classification confirm that addressing task ambiguity, not supposed misalignment, is key for best results.

URL PDF HTML ☆

赞 0 踩 0

2603.11583 2026-05-26 cs.CL cs.AI

UtilityMax Prompting: A Formal Framework for Multi-Objective Large Language Model Tasks

UtilityMax Prompting：多目标大语言模型任务的形式化框架

Ofir Marom

AI总结提出UtilityMax Prompting框架，用影响图和期望效用最大化将多目标LLM任务形式化，在MovieLens 1M数据集上相比自然语言基线提升了精度和NDCG。

详情

AI中文摘要

大语言模型（LLM）任务的成功在很大程度上取决于其提示词。大多数用例使用自然语言指定提示词，当必须同时满足多个目标时，自然语言本质上是模糊的。在本文中，我们引入了UtilityMax Prompting，一个使用形式化数学语言指定任务的框架。我们将任务重构为一个影响图，其中LLM的答案是唯一的决策变量。在图中条件概率分布上定义效用函数，并指示LLM找到最大化期望效用的答案。这迫使LLM明确推理目标的每个组成部分，将其输出导向精确的优化目标，而非主观的自然语言解释。我们在MovieLens 1M数据集上，使用三个前沿模型（Claude Sonnet 4.6、GPT-5.4和Gemini 2.5 Pro）验证了我们的方法，在多目标电影推荐任务中，与自然语言基线相比，在精度和归一化折损累计增益（NDCG）上表现出一致的改进。

英文摘要

The success of a Large Language Model (LLM) task depends heavily on its prompt. Most use-cases specify prompts using natural language, which is inherently ambiguous when multiple objectives must be simultaneously satisfied. In this paper we introduce UtilityMax Prompting, a framework that specifies tasks using formal mathematical language. We reconstruct the task as an influence diagram in which the LLM's answer is the sole decision variable. A utility function is defined over the conditional probability distributions within the diagram, and the LLM is instructed to find the answer that maximises expected utility. This constrains the LLM to reason explicitly about each component of the objective, directing its output toward a precise optimization target rather than a subjective natural language interpretation. We validate our approach on the MovieLens 1M dataset across three frontier models (Claude Sonnet 4.6, GPT-5.4, and Gemini 2.5 Pro), demonstrating consistent improvements in precision and Normalized Discounted Cumulative Gain (NDCG) over natural language baselines in a multi-objective movie recommendation task.

URL PDF HTML ☆

赞 0 踩 0

2603.10250 2026-05-26 cs.LG

GeMPO: Generalized Measure Matching for Online Diffusion Reinforcement Learning

GeMPO：在线扩散强化学习的广义度量匹配

Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai

AI总结提出GeMPO框架，通过将扩散RL中的重加权从softmax推广到一般单调函数，并引入负重加权机制，以解决过贪策略和负样本利用不足的问题。

Comments 22 pages, 6 figures

详情

AI中文摘要

扩散策略的强化学习中常用的一类算法对来自行为策略的样本进行softmax重加权，这通常会导致过贪策略，并且未能利用负样本的反馈。在这项工作中，我们引入了GeMPO，一个简单且统一的框架，将扩散RL中的重加权方案从softmax推广到一般单调函数。GeMPO通过度量匹配的视角重新审视扩散RL：首先，通过求解正则化策略优化目标构建虚拟目标策略度量；其次，通过重加权流匹配最小化当前策略与该目标度量之间的散度。这种公式有两个关键优势：i) 它将权重设计扩展到传统的指数重加权之外，允许针对不同的奖励景观进行定制；ii) 通过放松目标度量的非负性约束，我们的框架为负重加权提供了原则性的理由。我们解释了负重加权如何主动使策略远离次优动作，从而促进探索。大量的实证评估表明，GeMPO通过利用这些灵活的加权方案实现了具有竞争力或更优的性能，并且我们提供了在实践中选择重加权方法的实用指南。

英文摘要

A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over samples from the behavior policy, which often induces an overgreedy policy and fails to utilize feedback from negative samples. In this work, we introduce GeMPO, a simple and unified framework that generalizes reweighting scheme in diffusion RL from softmax to general monotonic functions. GeMPO revisits diffusion RL via a measure matching perspective: First, we construct a virtual target policy measure via solving a regularized policy optimization objective; Second, we minimize the divergence between the current policy and this target measure through reweighted flow matching. This formulation offers two key advantages: i) It extends weight design beyond traditional exponential reweighting, allowing it to be tailored to diverse reward landscapes; and ii) by relaxing the non-negativity constraint on the target measure, our framework provides a principled justification for negative reweighting. We provide interpretations of how negative reweighting actively repels the policy from suboptimal actions and thus facilitates exploration. Extensive empirical evaluations demonstrate that GeMPO achieves competitive or superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods in practice.

URL PDF HTML ☆

赞 0 踩 0

2603.06626 2026-05-26 cs.LG cs.AI

Grouter: Decoupling Routing from Representation for Accelerated MoE Training

Grouter: 将路由与表示解耦以加速MoE训练

Yuqi Xu, Rizhen Hu, Zihan Liu, Mou Sun, Kun Yuan

AI总结提出Grouter方法，通过从预训练MoE模型中蒸馏高质量结构作为固定路由器，解耦结构优化与权重更新，显著加速模型收敛并提升训练吞吐量。

详情

AI中文摘要

传统的混合专家（MoE）训练通常没有任何结构先验，实际上要求模型在训练专家权重的同时，在巨大的组合空间中搜索最优路由策略。这种纠缠常常导致收敛缓慢和训练不稳定。本文介绍了Grouter，一种先发制人的路由方法，通过从完全训练的MoE模型中蒸馏高质量结构，并作为目标模型的固定路由器。通过将结构优化与权重更新解耦，Grouter显著加速了模型收敛的速度和质量。为了确保框架的通用性，我们还引入了专家折叠以适应不同模型配置的Grouter，以及专家调优以重新平衡不同数据分布下的工作负载。此外，通过利用先发制人路由提供的结构先验，我们可以实施有针对性的优化以进一步提高训练吞吐量。实验表明，Grouter实现了卓越的性能和效率，将预训练数据利用率提高了4.28倍，并实现了高达33.5%的吞吐量加速，确立了先发制人路由作为可扩展MoE训练的基本范式。我们在https://github.com/JimmyAwoe/Grouter公开了我们的代码和预训练的Grouter检查点。

英文摘要

Traditional Mixture-of-Experts (MoE) training typically proceeds without any structural priors, effectively requiring the model to simultaneously train expert weights while searching for an optimal routing policy within a vast combinatorial space. This entanglement often leads to sluggish convergence and training instabilities. This paper introduces Grouter, a preemptive routing method that by distilling high-quality structures from fully-trained MoE models and serving as a fixed router for target models. By decoupling structural optimization from weight updates, Grouter significantly accelerates both the speed and quality of model convergence. To ensure the framework's versatility, we also introduce expert folding to adapt Grouter across varying model configurations and expert tuning to rebalance workloads across different data distributions. Furthermore, by leveraging the structural priors provided by preemptive routing, we can implement targeted optimizations to further enhance training throughput. Experiments demonstrate that Grouter achieves superior performance and efficiency which boosts pre-training data utilization by 4.28x and achieves up to 33.5% throughput acceleration, establishing preemptive routing as a fundamental paradigm for scalable MoE training. We publicly release our code and pretrained Grouter checkpoints at https://github.com/JimmyAwoe/Grouter.

URL PDF HTML ☆

赞 0 踩 0

2603.05450 2026-05-26 cs.AI cs.CL

Distributed Partial Information Puzzles: Examining Common Ground Construction Under Epistemic Asymmetry

分布式部分信息谜题：在认知不对称下检验共同基础的构建

Yifan Zhu, Mariah Bradford, Kenneth Lai, Timothy Obiso, Videep Venkatesha, James Pustejovsky, Nikhil Krishnaswamy

AI总结提出分布式部分信息谜题（DPIP）任务，收集多模态数据集，并评估大语言模型与动态认知逻辑方法在追踪信念状态和共同基础构建上的表现。

Comments 10 pages, 4 figures

详情

Journal ref: Proceedings of COLING-LREC 2026

AI中文摘要

建立共同基础（一组共享的信念和相互认可的事实）对于协作至关重要，但仍然是当前AI系统面临的挑战，尤其是在多模态、多方设置中，协作者带来不同的信息。我们引入了分布式部分信息谜题（DPIP），这是一个协作构建任务，在认知不对称下引发丰富的多模态交流。我们提供了这些交互的多模态数据集，并在语音、手势和动作模态上进行注释和时间对齐，以支持对命题内容和信念动态的推理。然后，我们评估了两种建模共同基础（CG）的范式：（1）最先进的大语言模型（LLMs），被提示从多模态更新中推断共享信念，以及（2）基于动态认知逻辑（DEL）的公理流水线，逐步执行相同的任务。在注释的DPIP数据上的结果表明，它对现代LLMs跟踪任务进展和信念状态的能力构成了挑战。

英文摘要

Establishing common ground, a shared set of beliefs and mutually recognized facts, is fundamental to collaboration, yet remains a challenge for current AI systems, especially in multimodal, multiparty settings, where the collaborators bring different information to the table. We introduce the Distributed Partial Information Puzzle (DPIP), a collaborative construction task that elicits rich multimodal communication under epistemic asymmetry. We present a multimodal dataset of these interactions, annotated and temporally aligned across speech, gesture, and action modalities to support reasoning over propositional content and belief dynamics. We then evaluate two paradigms for modeling common ground (CG): (1) state-of-the-art large language models (LLMs), prompted to infer shared beliefs from multimodal updates, and (2) an axiomatic pipeline grounded in Dynamic Epistemic Logic (DEL) that incrementally performs the same task. Results on the annotated DPIP data indicate that it poses a challenge to modern LLMs' abilities to track both task progression and belief state.

URL PDF HTML ☆

赞 0 踩 0

2603.00777 2026-05-26 cs.CV

DUCX: Decomposing Unfairness in Tool-Using Chest X-ray Agents

DUCX：分解使用工具的胸部X光代理中的不公平性

Zikang Xu, Ruinan Jin, Xiaoxiao Li

AI总结提出DUCK框架，通过阶段式公平性分解方法，系统审计使用工具的胸部X光代理中的工具暴露偏差、工具转换偏差和模型推理偏差，揭示端到端评估无法预测的群体差异。

Comments Early accepted by MICCAI 2026

详情

AI中文摘要

随着使用工具的临床AI系统协调专门的视觉和语言模块执行胸部X光问答等任务，医疗代理中的公平性变得至关重要。虽然这些医疗AI代理可以提高灵活性，但其增加的流水线复杂性也为人口统计偏差创造了新的途径，超出了独立模型。我们提出了DUCK，即分解胸部X光代理中的不公平性，这是一个对使用MedRAX实例化的工具型胸部X光代理的公平性进行系统审计的方法。为了定位差异产生的位置，我们引入了一种阶段式公平性分解，将端到端偏差与三个代理特定来源分开：工具暴露偏差，即基于工具存在的效用差距；工具转换偏差，即工具路由模式中的子组差异；以及模型推理偏差，即合成行为中的子组差异。在五个驱动骨干网络上对使用工具的代理框架进行的大量实验表明，端到端性能中存在人口统计差距，均等几率高达20.79%，最低公平-效用权衡降至28.65%。中间行为，包括工具使用、转换模式和推理轨迹，表现出明显的子组差异，这些差异无法仅从端到端评估中预测。例如，在分割工具可用的情况下，子组效用差距高达50%。我们的研究结果强调了过程级公平性审计和去偏的必要性，以确保临床代理系统的公平部署。代码：https://github.com/Nanboy-Ronan/DUCK。

英文摘要

Fairness in medical agents is becoming critical as tool-using clinical AI systems orchestrate specialized vision and language modules for tasks such as chest X-ray question answering. While these medical AI agents can improve flexibility, their added pipeline complexity also creates new pathways for demographic bias beyond standalone models. We present DUCK, Decomposing Unfairness in Chest X-ray agents, a systematic audit of fairness in tool-using chest X-ray agents instantiated with MedRAX. To localize where disparities arise, we introduce a stage-wise fairness decomposition that separates end-to-end bias from three agent-specific sources: tool exposure bias, or utility gaps conditioned on tool presence; tool transition bias, or subgroup differences in tool-routing patterns; and model reasoning bias, or subgroup differences in synthesis behaviors. Extensive experiments on tool-using agentic frameworks across five driver backbones reveal that demographic gaps persist in end-to-end performance, with equalized odds up to 20.79% and the lowest fairness-utility tradeoff down to 28.65%. Intermediate behaviors, including tool usage, transition patterns, and reasoning traces, exhibit distinct subgroup disparities that are not predictable from end-to-end evaluation alone. For example, conditioned on segmentation-tool availability, the subgroup utility gap reaches as high as 50%. Our findings underscore the need for process-level fairness auditing and debiasing to ensure the equitable deployment of clinical agentic systems. Code: https://github.com/Nanboy-Ronan/DUCK.

URL PDF HTML ☆

赞 0 踩 0

2603.00191 2026-05-26 cs.LG cs.CV

Task-Driven Subspace Decomposition for Knowledge Sharing and Isolation in LoRA-based Continual Learning

基于LoRA的持续学习中任务驱动的子空间分解用于知识共享与隔离

Lingfeng He, De Cheng, Huaijie Wang, Xi Yang, Nannan Wang, Xinbo Gao

AI总结提出LoDA方法，通过任务驱动分解构建通用和任务特定LoRA子空间，结合梯度对齐优化和闭式重校准，实现知识共享与隔离，提升持续学习性能。

Comments Accepted by ICML 2026

详情

AI中文摘要

持续学习要求模型在不遗忘旧知识的情况下顺序适应新任务。最近，低秩适应（LoRA）作为一种代表性的参数高效微调方法，在持续学习中受到越来越多的关注。几种基于LoRA的持续学习方法通过分离更新空间来减少任务间的干扰，通常从过去任务的估计零空间中构建新空间。然而，它们（i）忽略了任务共享方向，抑制了知识迁移；（ii）未能捕获真正有效的任务特定方向，因为旧任务的这些“零基”在相关任务下对新任务几乎保持不活跃。为了解决这个问题，我们从投影能量的角度研究LoRA的学习能力，并提出了低秩分解与适应（LoDA）。它通过解决两个基于能量的目标，执行任务驱动分解以构建通用和真正的任务特定LoRA子空间，解耦知识共享和隔离的方向。LoDA固定两个子空间上的LoRA下投影，并通过梯度对齐优化方法学习鲁棒的上投影。在每个任务之后，在将LoRA更新集成到主干之前，LoDA为通用更新推导出一个闭式重校准，沿着这个任务共享方向近似特征级联合最优。实验表明，LoDA优于现有的持续学习方法。我们的代码可在https://github.com/HHHLF/LoDA_ICML2026获取。

英文摘要

Continual Learning (CL) requires models to sequentially adapt to new tasks without forgetting old knowledge. Recently, Low-Rank Adaptation (LoRA), a representative Parameter-Efficient Fine-Tuning (PEFT) method, has gained increasing attention in CL. Several LoRA-based CL methods reduce interference across tasks by separating their update spaces, typically building the new space from the estimated null space of past tasks. However, they (i) overlook task-shared directions, which suppresses knowledge transfer, and (ii) fail to capture truly effective task-specific directions since these ``null bases" of old tasks can remain nearly inactive for new task under correlated tasks. To address this, we study LoRA learning capability from a projection energy perspective, and propose Low-rank Decomposition and Adaptation (LoDA). It performs a task-driven decomposition to build general and truly task-specific LoRA subspaces by solving two energy-based objectives, decoupling directions for knowledge sharing and isolation. LoDA fixes LoRA down-projections on two subspaces and learns robust up-projections via a Gradient-Aligned Optimization (GAO) approach. After each task, before integrating the LoRA updates into the backbone, LoDA derives a closed-form recalibration for the general update, approximating a feature-level joint optimum along this task-shared direction. Experiments indicate that LoDA outperforms existing CL methods. Our code is available at https://github.com/HHHLF/LoDA_ICML2026.

URL PDF HTML ☆

赞 0 踩 0

2602.23916 2026-05-26 cs.CV cs.AI

Topology-Driven Transferability Estimation of Medical Foundation Models for Segmentation

基于拓扑驱动的医学基础模型分割迁移性估计

Jiaqi Tang, Shaoyang Zhang, Xiaoqi Wang, Jiaying Zhou, Yang Liu, Qingchao Chen

AI总结提出拓扑驱动迁移性估计框架，通过全局表示拓扑散度、局部边界感知拓扑一致性和任务自适应融合，无需微调即可高效选择医学基础模型，在OpenMind基准上加权Kendall指标相对提升约31%。

详情

AI中文摘要

大规模自监督学习（SSL）的出现产生了大量的医学基础模型。然而，为特定分割任务选择最优的医学基础模型仍然是一个计算瓶颈。现有的迁移性估计（TE）指标主要针对分类任务设计，依赖于全局统计假设，无法捕捉密集预测所需的拓扑复杂性。我们提出了一种新颖的拓扑驱动迁移性估计框架，评估流形可处理性而非统计重叠。我们的方法引入了三个组成部分：（1）全局表示拓扑散度（GRTD），利用最小生成树量化特征-标签结构同构性；（2）局部边界感知拓扑一致性（LBTC），专门在关键解剖边界评估流形可分离性；（3）任务自适应融合，根据目标任务的语义基数动态整合全局和局部指标。在跨不同解剖目标和SSL基础模型的大规模OpenMind基准上验证，我们的方法在加权Kendall指标上显著优于最先进的基线，相对提升约31%，提供了一种鲁棒的、无需训练的代理，用于高效模型选择而无需微调成本。代码将在接收后公开。

英文摘要

The advent of large-scale self-supervised learning (SSL) has produced a vast zoo of medical foundation models. However, selecting optimal medical foundation models for specific segmentation tasks remains a computational bottleneck. Existing Transferability Estimation (TE) metrics, primarily designed for classification, rely on global statistical assumptions and fail to capture the topological complexity essential for dense prediction. We propose a novel Topology-Driven Transferability Estimation framework that evaluates manifold tractability rather than statistical overlap. Our approach introduces three components: (1) Global Representation Topology Divergence (GRTD), utilizing Minimum Spanning Trees to quantify feature-label structural isomorphism; (2) Local Boundary-Aware Topological Consistency (LBTC), which assesses manifold separability specifically at critical anatomical boundaries; and (3) Task-Adaptive Fusion, which dynamically integrates global and local metrics based on the semantic cardinality of the target task. Validated on the large-scale OpenMind benchmark across diverse anatomical targets and SSL foundation models, our approach significantly outperforms state-of-the-art baselines by around 31% relative improvement in the weighted Kendall metric, providing a robust, training-free proxy for efficient model selection without the cost of fine-tuning. The code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2602.23872 2026-05-26 cs.CV cs.RO

Altitude-Adaptive Vision-Only Geo-Localization for UAVs in GPS-Denied Environments

GPS拒止环境下无人机的高度自适应纯视觉地理定位

Xingyu Shao, Mengfan He, Chunyu Li, Liangzheng Sun, Ziyang Meng

AI总结针对无人机视觉位置识别中高度变化导致的尺度不匹配问题，提出一种基于单目视觉的高度自适应地理定位框架，通过频域变换估计相对高度并用于图像尺度归一化，结合分类-检索视觉位置识别模块实现粗定位，引入质量自适应边缘分类器提升检索鲁棒性。

详情

AI中文摘要

为了解决无人机视觉位置识别中由高度大幅变化引起的尺度不匹配问题，我们提出了一种仅依赖单目视觉的高度自适应地理定位框架。该方法首先通过将输入图像转换到频域，并将高度估计建模为回归作为分类问题，从单张下视图像中估计相对高度。然后利用估计的高度将查询图像裁剪到规范尺度，之后通过分类-检索视觉位置识别模块进行粗定位。为了在图像质量变化的情况下提高检索鲁棒性，我们进一步引入了质量自适应边缘分类器，并通过加权坐标估计对最终位置进行精化，该估计基于前k个检索候选。在两个合成数据集和两个真实飞行数据集上的实验表明，相对高度估计模块在显著高度变化下，下游检索性能有显著提升。与使用相同检索流程但未进行高度归一化相比，我们的视觉位置识别模块通过高度自适应使平均R@1和R@5分别提高了41.50和56.83个百分点，完整系统在报告的工作站硬件上以13.3帧/秒运行。这些结果表明，相对高度估计为跨高度无人机地理定位提供了有效的尺度先验，并在无需辅助距离传感器或时间输入的情况下支持GPS拒止环境下的粗初始化。

英文摘要

To address the scale mismatch caused by large altitude variations in UAV visual place recognition, we propose a monocular vision-only altitude-adaptive geo-localization framework. The method first estimates relative altitude from a single downward-looking image by transforming the input into the frequency domain and formulating altitude estimation as a regression-as-classification (RAC) problem. The estimated altitude is then used to crop the query image to a canonical scale, after which a classification-then-retrieval visual place recognition module performs coarse localization. To improve retrieval robustness under varying image quality, we further introduce a quality-adaptive margin classifier (QAMC) and refine the final location by weighted coordinate estimation over the top retrieved candidates. Experiments on two synthetic datasets and two real-flight datasets show that the relative altitude estimation (RAE) module yields clear overall improvements in downstream retrieval performance under significant altitude changes. With our visual place recognition module, altitude adaptation improves average R@1 and R@5 by 41.50 and 56.83 percentage points, respectively, compared with using the same retrieval pipeline without altitude normalization, and the full system runs at 13.3 frames/s on the reported workstation hardware. These results indicate that relative altitude estimation provides an effective scale prior for cross-altitude UAV geo-localization and supports GPS-denied coarse initialization without auxiliary range sensors or temporal inputs.

URL PDF HTML ☆

赞 0 踩 0

2602.23217 2026-05-26 cs.CV cs.NA math.NA

Multidimensional Task Learning: A Unified Tensor Framework for Computer Vision Tasks

多维任务学习：计算机视觉任务的统一张量框架

Alaa El Ichi, Khalide Jbilou

AI总结提出基于广义爱因斯坦MLP的多维任务学习框架，通过张量运算统一分类、分割和检测等视觉任务，并证明其表达空间大于传统矩阵方法。

Comments This manuscript is under review at Pattern Recognition Letters

详情

AI中文摘要

本文介绍了多维任务学习（MTL），这是一个基于广义爱因斯坦MLP（GE-MLPs）的统一数学框架，通过爱因斯坦积直接在张量上操作。我们认为当前的计算机视觉任务公式本质上受限于基于矩阵的思维：标准架构依赖于矩阵值权重和向量值偏置，需要结构展平，这限制了自然可表达任务的空间。GE-MLPs通过使用张量值参数消除了这一约束，使得能够显式控制哪些维度被保留或收缩，而不会丢失信息。通过严格的数学推导，我们证明了分类、分割和检测是MTL的特例，仅在正式定义的任务空间中的维度配置上有所不同。我们进一步证明，这个任务空间严格大于基于矩阵的公式所能原生表达的空间，从而能够实现原则性的任务配置，例如时空或跨模态预测，这些在传统方法下需要破坏性展平。这项工作为通过张量代数的视角理解、比较和设计计算机视觉任务提供了数学基础。

英文摘要

This paper introduces Multidimensional Task Learning (MTL), a unified mathematical framework based on Generalized Einstein MLPs (GE-MLPs) that operate directly on tensors via the Einstein product. We argue that current computer vision task formulations are inherently constrained by matrix-based thinking: standard architectures rely on matrix-valued weights and vectorvalued biases, requiring structural flattening that restricts the space of naturally expressible tasks. GE-MLPs lift this constraint by operating with tensor-valued parameters, enabling explicit control over which dimensions are preserved or contracted without information loss. Through rigorous mathematical derivations, we demonstrate that classification, segmentation, and detection are special cases of MTL, differing only in their dimensional configuration within a formally defined task space. We further prove that this task space is strictly larger than what matrix-based formulations can natively express, enabling principled task configurations such as spatiotemporal or cross modal predictions that require destructive flattening under conventional approaches. This work provides a mathematical foundation for understanding, comparing, and designing computer vision tasks through the lens of tensor algebra.

URL PDF HTML ☆

赞 0 踩 0

2602.19878 2026-05-26 cs.CL cs.LO

Axis-Aligned Semantics for ODRL: Resolving Dimensional Ambiguity in Policy Constraints

面向ODRL的轴对齐语义：解决策略约束中的维度歧义

Daham Mustafa, Diego Collarana, Sabrina Kirrane, Christoph Lange, Christoph Quix, Rafiqul Haque, Yixin Peng, Stefan Decker

AI总结针对ODRL中多轴操作数导致的维度歧义问题，提出轴分解方法，将约束转化为每个轴上的标量操作，从而将冲突检测简化为盒比较，并定义三值语义，通过基准测试验证了方法的正确性和兼容性。

Comments 17 pages. Preprint. v3: expanded benchmark to 256 problems; revised semantics and profile (OAAP)

详情

AI中文摘要

开放数字权利语言（ODRL）将策略约束表示为左操作数、运算符和值的三元组。然而，多个空间操作数涉及宽度、高度和深度等多轴域，而约束语法未提供明确的轴标识。因此，策略引擎无法确定多个约束是应用于同一轴还是不同轴，导致冲突检测不可靠或不完整。我们通过轴分解解决这一歧义，将多轴操作数替换为全序域上的轴特定标量操作数。每个约束表示每个轴上的一个区间，每个策略表示一个轴对齐的盒，从而将冲突检测简化为盒比较。我们定义了三值语义（冲突、兼容、未知），证明了分解的正确性及其与ODRL的向后兼容性，将其实例化为ODRL轴对齐配置文件（OAAP），并在包含256个ODRL策略问题的基准测试上进行了验证，每个问题以Turtle表示并编译为一阶形式（TPTP）和SMT-LIB形式，使用了Vampire、E、Z3和cvc5求解器。

英文摘要

The Open Digital Rights Language (ODRL) represents policy constraints as triples of a left operand, an operator, and a value. Several spatial operands, however, range over multi-axis domains such as width, height, and depth, while the constraint syntax provides no explicit axis identity. As a result, policy engines cannot determine whether multiple constraints apply to the same axis or different ones, making conflict detection unsound or incomplete. We resolve this ambiguity by axis decomposition, replacing multi-axis operands with axis-specific scalar operands over totally ordered domains. Each constraint then denotes an interval per axis and each policy an axis-aligned box, reducing conflict detection to box comparison. We define a three-valued semantics (Conflict, Compatible, Unknown), prove the decomposition sound and backward compatible with ODRL, instantiate it as ODRL Axis-Aligned Profile (OAAP), and validate it on a benchmark of 256 ODRL policy problems, each expressed in Turtle and compiled to first-order (TPTP) and SMT-LIB form, using Vampire, E, Z3, and cvc5.

URL PDF HTML ☆

赞 0 踩 0

2602.18956 2026-05-26 cs.AI

INDUCTION: Finite-Structure Concept Synthesis in First-Order Logic

INDUCTION: 一阶逻辑中的有限结构概念合成

Serafim Batzoglou

AI总结提出INDUCTION基准，用于一阶逻辑中有限结构的概念合成，通过精确模型检查验证公式的正确性，并发现低冗余公式在未见世界上的泛化能力更强。

2602.16340 2026-05-26 cs.LG stat.ML

The Implicit Bias of Adam and Muon on Smooth Homogeneous Neural Networks

Adam和Muon在光滑齐次神经网络上的隐式偏差

Eitan Gronich, Gal Vardi

AI总结研究动量优化器在光滑齐次模型上的隐式偏差，证明Muon、MomentumGD和Signum在衰减学习率下近似于最速下降轨迹，并偏向于对应边际最大化问题的KKT点，同时将分析扩展到Adam和混合范数优化器。

Comments ICML 2026. 8 pages, 1 figure (with appendix: 45 pages, 3 figures)

2602.11173 2026-05-26 cs.CL

Author-in-the-Loop Response Generation and Evaluation: Integrating Author Expertise and Intent in Responses to Peer Review

作者参与的回信生成与评估：将作者专业知识和意图整合到对同行评审的回复中

Qian Ruan, Iryna Gurevych

AI总结提出作者参与的回信生成与评估框架，通过引入对齐的评审-回复-修订三元组数据集、支持灵活作者输入和可控生成的REspGen系统以及包含20+指标的综合评估套件REspEval，填补了作者信号利用和评估的空白。

Comments accepted to ACL 2026 Main Conference

详情

AI中文摘要

作者回复（反驳）写作是科学同行评审的关键阶段，需要作者付出大量努力。在实践中，作者拥有领域专业知识、仅作者可用的信息和回复策略——作者专业知识和意图的具体形式——并寻求NLP辅助，将这些信号整合到作者回复生成（ARG）中。然而，这种作者参与范式缺乏正式的NLP表述和系统研究：没有数据集提供细粒度的作者信号，现有的ARG工作缺乏作者输入和控制，也没有评估指标衡量回复对作者信号的反映以及解决评审者关注点的有效性。为填补这些空白，我们引入了（i）Re3Align，第一个大规模的对齐评审-回复-修订三元组数据集，其中修订代理作者信号；（ii）REspGen，一个作者参与的ARG框架，支持灵活的作者输入、多属性控制和评估引导的细化；以及（iii）REspEval，一个包含20多个指标的全面评估套件，涵盖输入利用、可控性、回复质量和话语。使用SOTA LLMs的实验证明了作者输入和评估引导细化的好处、输入特异性对回复质量的影响以及可控性与质量之间的权衡。我们发布了我们的数据集、生成和评估工具。

英文摘要

Author response (rebuttal) writing is a critical stage of scientific peer review that demands substantial author effort. In practice, authors possess domain expertise, author-only information, and response strategies - concrete forms of author expertise and intent - and seek NLP assistance that integrates these signals into author response generation (ARG). Yet this author-in-the-loop paradigm lacks formal NLP formulation and systematic study: no dataset provides fine-grained author signals, existing ARG work lacks author inputs and controls, and no evaluation measures response reflection of author signals and effectiveness in addressing reviewer concerns. To fill these gaps, we introduce (i) Re3Align, the first large-scale dataset of aligned review-response-revision triplets, where revisions proxy author signals; (ii) REspGen, an author-in-the-loop ARG framework supporting flexible author input, multi-attribute control, and evaluation-guided refinement; and (iii) REspEval, a comprehensive evaluation suite with 20+ metrics spanning input utilization, controllability, response quality, and discourse. Experiments with SOTA LLMs demonstrate the benefits of author input and evaluation-guided refinement, the impact of input specificity on response quality, and controllability-quality trade-offs. We release our dataset, generation and evaluation tools.

URL PDF HTML ☆

赞 0 踩 0

2602.09130 2026-05-26 cs.LG

UniComp: A Unified Evaluation of Large Language Model Compression via Pruning, Quantization and Distillation

UniComp: 通过剪枝、量化和蒸馏对大型语言模型压缩的统一评估

Jonathan von Rad, Yong Cao, Andreas Geiger

AI总结提出UniComp框架，统一评估剪枝、量化和知识蒸馏三种压缩方法，从性能、可靠性和效率三个维度在40个数据集上分析，发现知识偏差、性能与可靠性解耦以及任务特定校准可提升推理性能。

Comments 18 pages, 5 figures, 18 tables

详情

AI中文摘要

模型压缩对于部署大型语言模型（LLM）日益重要，然而现有的比较研究主要集中在剪枝和量化，且主要基于知识中心的基准进行评估。因此，我们引入了UniComp，一个用于比较剪枝、量化和知识蒸馏的统一评估框架。UniComp从性能、可靠性和效率三个维度评估压缩模型，使用多样化的面向能力和安全性的基准以及硬件感知的效率分析。通过对40个数据集上的六种压缩技术进行评估，我们观察到：(i) 一致的知识偏差，即事实回忆基本保留，而多步推理、多语言和指令遵循能力下降；(ii) 性能与可靠性之间的解耦，表明保留的性能并不一致地意味着保留的可靠性；(iii) 任务特定校准可以在剪枝模型中实现高达50%的推理性能相对提升。

英文摘要

Model compression is increasingly essential for deploying large language models (LLMs), yet existing comparative studies largely focus on pruning and quantization evaluated primarily on knowledge-centric benchmarks. Thus, we introduce UniComp, a unified evaluation framework for comparing pruning, quantization, and knowledge distillation. UniComp evaluates compressed models along three dimensions: performance, reliability, and efficiency, using a diverse set of capability- and safety-oriented benchmarks together with a hardware-aware efficiency analysis. Through evaluation of six compression techniques across 40 datasets, we observe (i) a consistent knowledge bias, where factual recall is largely preserved while multi-step reasoning, multilingual, and instruction-following capabilities degrade; (ii) a decoupling between performance and reliability, indicating that retained performance does not consistently imply preserved reliability; and (iii) that task-specific calibration can yield up to 50% relative improvement of reasoning performance in pruned models.

URL PDF HTML ☆

赞 0 踩 0

2602.08615 2026-05-26 cs.CV

Inspiration Seeds: Learning Non-Literal Visual Combinations for Generative Exploration

灵感种子：学习用于生成式探索的非字面视觉组合

Kfir Goldberg, Elad Richardson, Yael Vinker

AI总结提出Inspiration Seeds框架，通过CLIP稀疏自编码器提取编辑方向并隔离概念对，实现无需文本提示的两张输入图像的视觉组合生成，支持早期创意阶段的探索性构思。

Comments Project page available at https://kfirgoldberg.github.io/InspirationSeeds/

详情

AI中文摘要

虽然生成模型已成为图像合成的强大工具，但它们通常针对执行精心设计的文本提示进行优化，对于想法形成之前常见的开放式视觉探索支持有限。相比之下，设计师经常从松散连接的视觉参考中汲取灵感，寻找能激发新想法的涌现连接。我们提出了Inspiration Seeds，这是一个将图像生成从最终执行转变为探索性构思的生成框架。给定两张输入图像，我们的模型生成多样且视觉连贯的组合，揭示输入之间的潜在关系，而无需依赖用户指定的文本提示。我们的方法是前馈式的，在完全通过视觉手段分解的合成三元组上训练：我们使用CLIP稀疏自编码器在CLIP潜在空间中提取编辑方向并隔离概念对。通过消除对语言的依赖并支持快速、直观的重组，我们的方法支持在创意工作的早期和模糊阶段进行视觉构思。

英文摘要

While generative models have become powerful tools for image synthesis, they are typically optimized for executing carefully crafted textual prompts, offering limited support for the open-ended visual exploration that often precedes idea formation. In contrast, designers frequently draw inspiration from loosely connected visual references, seeking emergent connections that spark new ideas. We propose Inspiration Seeds, a generative framework that shifts image generation from final execution to exploratory ideation. Given two input images, our model produces diverse, visually coherent compositions that reveal latent relationships between inputs, without relying on user-specified text prompts. Our approach is feed-forward, trained on synthetic triplets of decomposed visual aspects derived entirely through visual means: we use CLIP Sparse Autoencoders to extract editing directions in CLIP latent space and isolate concept pairs. By removing the reliance on language and enabling fast, intuitive recombination, our method supports visual ideation at the early and ambiguous stages of creative work.

URL PDF HTML ☆

赞 0 踩 0

2602.06357 2026-05-26 cs.LG

LLM-SAA: LLM-persona Generated Distributions for Decision-making

LLM-SAA：基于LLM人格生成分布的决策方法

Jackie Baek, Yunhan Chen, Ziyu Chi, Will Ma

AI总结研究利用LLM生成分布（如模拟消费者支付意愿）支持下游决策，通过三个经典问题（分类优化、定价、报童模型）评估其实际效用，发现低数据场景下有效，且决策无关指标（如Wasserstein距离）可能误导。

详情

AI中文摘要

LLM可以生成丰富的数据，从模拟人类估值和偏好的虚拟人格，到基于世界知识的需求预测。但这类LLM生成的分布对下游决策的支持程度如何？例如，在定价新产品时，企业可以提示LLM根据产品描述模拟消费者愿意支付的价格，但由此得到的分布对优化价格有多大用处？我们将这种方法称为LLM-SAA，即利用LLM构建估计分布，然后在该分布下优化决策。在本文中，我们研究基于这些分布所诱导的决策来评估其质量的指标。以三个经典决策问题（分类优化、定价和报童模型）为例，我们发现LLM生成的分布在实际中是有用的，尤其是在低数据场景下。我们还表明，在评估这些分布用于决策时，诸如Wasserstein距离等与决策无关的指标可能会产生误导。

英文摘要

LLMs can generate a wealth of data, ranging from simulated personas imitating human valuations and preferences, to demand forecasts based on world knowledge. But how well do such LLM-generated distributions support downstream decision-making? For example, when pricing a new product, a firm could prompt an LLM to simulate how much consumers are willing to pay based on a product description, but how useful is the resulting distribution for optimizing the price? We refer to this approach as LLM-SAA, in which an LLM is used to construct an estimated distribution and the decision is then optimized under that distribution. In this paper, we study metrics to evaluate the quality of these LLM-generated distributions, based on the decisions they induce. Taking three canonical decision-making problems (assortment optimization, pricing, and newsvendor) as examples, we find that LLM-generated distributions are practically useful, especially in low-data regimes. We also show that decision-agnostic metrics such as Wasserstein distance can be misleading when evaluating these distributions for decision-making.

URL PDF HTML ☆

赞 0 踩 0

2602.04360 2026-05-26 cs.LG cs.AI cs.CY

Counterfactual Explanations for Hypergraph Neural Networks

超图神经网络的反事实解释

Fabiano Veglianti, Lorenzo Antonelli, Gabriele Tolomei

AI总结提出CF-HyperGNNExplainer方法，通过最小结构变化生成反事实超图，以解释超图神经网络的预测决策。

2602.03983 2026-05-26 cs.RO cs.CV

Efficient Long-Horizon Vision-Language-Action Models via Static-Dynamic Disentanglement

通过静态-动态解耦实现高效长程视觉-语言-动作模型

Weikang Qiu, Huashuo Lei, Tinglin Huang, Rex Ying

AI总结提出DySta框架，通过将视觉输入解耦为多级静态和动态令牌，减少上下文长度并复用KV缓存，实现高效多帧集成和推理，在基准测试和真实任务中显著提升性能。

详情

AI中文摘要

视觉-语言-动作（VLA）模型最近成为通用机器人控制的一种有前景的范式。基于视觉-语言模型（VLM）架构，VLA模型根据视觉观察和语言指令预测动作，在任务中实现了强大的性能和泛化能力。然而，VLA模型面临两个主要挑战：输入帧的有限上下文窗口，以及由于二次注意力复杂性和大参数数量导致的低效推理。为此，我们提出了DySta，一个将视觉输入解耦为多级静态和动态令牌的框架，使得（1）在帧间保留静态令牌的单一副本以显著减少上下文长度，以及（2）通过轻量级重缓存门（仅在必要时更新）重用静态令牌的键值（KV）缓存。这种设计实现了高效的多帧集成和高效推理。此外，我们引入了一个新的基准测试，更有效地评估VLA模型的多帧集成能力。实验表明，DySta在我们的基准测试中各项指标上提高了24.5%的多帧集成能力，在真实世界记忆依赖任务中绝对成功率达到23.3%，同时在模拟基准测试中推理速度提升2.0倍（成功率+2.3%），在真实世界通用任务中推理速度提升2.2倍（成功率+10.6%）。

英文摘要

Vision-Language-Action (VLA) models have recently emerged as a promising paradigm for generalist robotic control. Built upon vision-language model (VLM) architectures, VLAs predict actions conditioned on visual observations and language instructions, achieving strong performance and generalization across tasks. However, VLAs face two major challenges: a limited context window for input frames and inefficient inference due to the quadratic attention complexity and large parameter counts. To this end, we propose DySta, a framework that disentangles visual inputs into multi-level static and dynamic tokens, which enables (1) retaining a single copy of static tokens across frames to significantly reduce context length, and (2) reusing the key-value (KV) cache of static tokens through a lightweight recache gate that updates only when necessary. This design enables efficient multi-frame integration and efficient inference. In addition, we introduce a new benchmark that more effectively evaluates the multi-frame integration ability of VLAs. Experiments show that Dysta improves multi-frame integration by 24.5% across metrics on our benchmark and 23.3% in absolute success rate on real-world memory-dependent tasks, while accelerating inference by 2.0x (with +2.3% success rate) on simulation benchmarks and 2.2x (with +10.6% success rate) on real-world general tasks.

URL PDF HTML ☆

赞 0 踩 0

2602.02843 2026-05-26 cs.CL

Act or Clarify? Modeling Sensitivity to Uncertainty and Cost in Communication

行动还是澄清？建模沟通中对不确定性和成本的敏感性

Polina Tsvilodub, Karl Mulligan, Todd Snider, Robert D. Hawkins, Michael Franke

AI总结提出基于预期遗憾的计算模型，研究人类在不确定性下是否选择提问澄清，取决于不确定性和错误行动成本之间的理性权衡。

Comments 6 pages, 3 figures, accepted to CogSci 2026

详情

AI中文摘要

在不确定性下决定如何行动时，智能体可以选择行动以减少不确定性，也可以不顾不确定性而行动。在沟通场景中，减少不确定性的一个重要方式是提出澄清问题（CQs）。我们预测，是否提出CQ的决定取决于上下文不确定性和替代行动的成本，并且这些因素相互作用：当错误行动代价高昂时，不确定性最为重要。我们在一个基于预期遗憾的计算模型中形式化了这种相互作用：该模型衡量智能体在当前行动而非拥有完整信息时可能遭受的损失。我们在两个实验中测试了这些预测，一个实验考察对问题的纯语言回应，另一个扩展到在澄清和非语言行动之间的选择。综合来看，我们的结果表明一种理性权衡：人类倾向于在不确定性下行动时，根据可能遭受重大损失的风险比例来寻求澄清。

英文摘要

When deciding how to act under uncertainty, agents may choose to act to reduce uncertainty or they may act despite that uncertainty. In communicative settings, an important way of reducing uncertainty is by asking clarification questions (CQs). We predict that the decision to ask a CQ depends on both contextual uncertainty and the cost of alternative actions, and that these factors interact: uncertainty should matter most when acting incorrectly is costly. We formalize this interaction in a computational model based on expected regret: how much an agent stands to lose by acting now rather than with full information. We test these predictions in two experiments, one examining purely linguistic responses to questions and another extending to choices between clarification and non-linguistic action. Taken together, our results suggest a rational tradeoff: humans tend to seek clarification proportional to the risk of substantial loss when acting under uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2602.02544 2026-05-26 cs.LG cs.AI

SPA-Cache: Singular Proxies for Adaptive Caching in Diffusion Language Models

SPA-Cache: 扩散语言模型中的自适应缓存奇异代理

Wenhao Sun, Rong-Cheng Tu, Yifu Ding, Zhao Jin, Jingyi Liao, Yongcheng Jing, Dacheng Tao

AI总结针对扩散语言模型因非因果特性无法使用标准KV缓存导致计算开销大的问题，提出SPA-Cache方法，通过低维奇异代理识别关键令牌并自适应分配缓存预算，实现高达8倍吞吐量提升和2-4倍加速。

Comments Accepted by ICML 2026.The code repository is available at https://github.com/wenhao728/spa-cache

详情

AI中文摘要

尽管扩散语言模型（DLM）为自回归范式提供了一种灵活、任意顺序的替代方案，但其非因果特性排除了标准的KV缓存，迫使在每个解码步骤进行昂贵的隐藏状态重新计算。现有的DLM缓存方法通过选择性隐藏状态更新来降低这一成本；然而，它们仍然受限于（i）昂贵的逐令牌更新识别启发式方法和（ii）僵化的统一预算分配，未能考虑异构的隐藏状态动态。为了解决这些挑战，我们提出了SPA-Cache，它在DLM缓存中联合优化了更新识别和预算分配。首先，我们推导出一个低维奇异代理，能够在低维子空间中识别更新关键令牌，大幅降低更新识别的开销。其次，我们引入一种自适应策略，在不降低生成质量的情况下，为稳定层分配更少的更新。这些贡献共同显著提高了DLM的效率，相比原始解码实现了高达8倍的吞吐量提升，相比现有缓存基线实现了2-4倍的加速。

英文摘要

While Diffusion Language Models (DLMs) offer a flexible, arbitrary-order alternative to the autoregressive paradigm, their non-causal nature precludes standard KV caching, forcing costly hidden state recomputation at every decoding step. Existing DLM caching approaches reduce this cost by selective hidden state updates; however, they are still limited by (i) costly token-wise update identification heuristics and (ii) rigid, uniform budget allocation that fails to account for heterogeneous hidden state dynamics. To address these challenges, we present SPA-Cache that jointly optimizes update identification and budget allocation in DLM cache. First, we derive a low-dimensional singular proxy that enables the identification of update-critical tokens in a low-dimensional subspace, substantially reducing the overhead of update identification. Second, we introduce an adaptive strategy that allocates fewer updates to stable layers without degrading generation quality. Together, these contributions significantly improve the efficiency of DLMs, yielding up to an $8\times$ throughput improvement over vanilla decoding and a $2$--$4\times$ speedup over existing caching baselines.

URL PDF HTML ☆

赞 0 踩 0

2602.02474 2026-05-26 cs.CL cs.AI cs.LG

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents

MemSkill：面向自进化智能体的可学习与进化记忆技能

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, Wenya Wang

AI总结提出MemSkill框架，将记忆操作转化为可学习和可进化的技能，通过控制器选择技能、执行器生成记忆、设计者进化技能集，形成闭环提升LLM智能体任务性能。

Comments Code is available at https://github.com/ViktorAxelsen/MemSkill

详情

AI中文摘要

大多数大语言模型（LLM）智能体记忆系统依赖少量静态、手工设计的操作来提取记忆。这些固定程序硬编码了关于存储内容和如何修订记忆的人类先验知识，使其在多样化的交互模式下僵化，并在长历史记录上效率低下。为此，我们提出 extbf{MemSkill}，将这些操作重新定义为可学习和可进化的记忆技能，即从交互轨迹中提取、整合和修剪信息的结构化可重用例程。受智能体技能设计哲学的启发，MemSkill采用一个 extit{控制器}，学习选择少量相关技能，并与基于LLM的 extit{执行器}配对，生成技能引导的记忆。除了学习技能选择，MemSkill引入一个 extit{设计者}，定期审查所选技能产生错误或不完整记忆的困难案例，并通过提出改进和新技能来进化技能集。共同地，MemSkill形成了一个闭环流程，改进了技能选择策略和技能集本身。在LoCoMo、LongMemEval、HotpotQA和ALFWorld上的实验表明，MemSkill在强基线上提高了任务性能，并在不同设置下具有良好的泛化能力。进一步分析揭示了技能如何进化，为LLM智能体更自适应、自进化的记忆管理提供了见解。

英文摘要

Most Large Language Model (LLM) agent memory systems rely on a small set of static, hand-designed operations for extracting memory. These fixed procedures hard-code human priors about what to store and how to revise memory, making them rigid under diverse interaction patterns and inefficient on long histories. To this end, we present \textbf{MemSkill}, which reframes these operations as learnable and evolvable memory skills, structured and reusable routines for extracting, consolidating, and pruning information from interaction traces. Inspired by the design philosophy of agent skills, MemSkill employs a \emph{controller} that learns to select a small set of relevant skills, paired with an LLM-based \emph{executor} that produces skill-guided memories. Beyond learning skill selection, MemSkill introduces a \emph{designer} that periodically reviews hard cases where selected skills yield incorrect or incomplete memories, and evolves the skill set by proposing refinements and new skills. Together, MemSkill forms a closed-loop procedure that improves both the skill-selection policy and the skill set itself. Experiments on LoCoMo, LongMemEval, HotpotQA, and ALFWorld demonstrate that MemSkill improves task performance over strong baselines and generalizes well across settings. Further analyses shed light on how skills evolve, offering insights toward more adaptive, self-evolving memory management for LLM agents.

URL PDF HTML ☆

赞 0 踩 0

2602.00545 2026-05-26 cs.LG

Depth, Not Data: An Analysis of Hessian Spectral Bifurcation

深度，而非数据：Hessian谱分叉分析

Shenyang Deng, Boyao Liao, Zhuoli Ouyang, Tianyu Pang, Yaoqing Yang

AI总结本文通过分析深度线性网络，证明Hessian矩阵的谱分叉结构（主导特征值与主体特征值分离）可仅由网络深度引起，与数据协方差平衡无关，且主导与主体特征值之比与深度线性相关。

详情

AI中文摘要

Hessian矩阵的特征值分布在理解深度神经网络的优化景观中起着关键作用。先前的工作将广泛记录的“主体-尖峰”谱结构（其中少数主导特征值与大量较小特征值分离）归因于数据协方差矩阵的不平衡。在这项工作中，我们通过证明这种谱分叉可以纯粹由网络架构引起，而与数据不平衡无关，来挑战这一观点。具体来说，我们分析了一个深度线性网络设置，并证明即使数据协方差完全平衡，Hessian仍然表现出分叉特征值结构：一个主导簇和一个主体簇。至关重要的是，我们建立了主导特征值与主体特征值之比与网络深度呈线性关系。这表明谱间隙受到网络架构的强烈影响，而不仅仅是由数据分布决定。我们的结果表明，在设计深度网络的优化算法时，应同时考虑模型架构和数据特征。

英文摘要

The eigenvalue distribution of the Hessian matrix plays a crucial role in understanding the optimization landscape of deep neural networks. Prior work has attributed the well-documented ``bulk-and-spike'' spectral structure, where a few dominant eigenvalues are separated from a bulk of smaller ones, to the imbalance in the data covariance matrix. In this work, we challenge this view by demonstrating that such spectral Bifurcation can arise purely from the network architecture, independent of data imbalance. Specifically, we analyze a deep linear network setup and prove that, even when the data covariance is perfectly balanced, the Hessian still exhibits a Bifurcation eigenvalue structure: a dominant cluster and a bulk cluster. Crucially, we establish that the ratio between dominant and bulk eigenvalues scales linearly with the network depth. This reveals that the spectral gap is strongly affected by the network architecture rather than solely by data distribution. Our results suggest that both model architecture and data characteristics should be considered when designing optimization algorithms for deep networks.

URL PDF HTML ☆

赞 0 踩 0

2602.00511 2026-05-26 cs.LG math.OC

Partition of Unity Neural Networks for Interpretable Classification with Explicit Class Regions

用于可解释分类的单元划分神经网络及显式类别区域

Akram Aldroubi

AI总结提出单元划分神经网络（PUNN），通过直接学习满足和为1的非负函数来定义类别概率，无需softmax层，实现可解释分类并证明其稠密性，实验表明在保持精度的同时大幅减少参数。

Comments v2: substantially revised; under review at TMLR

详情

AI中文摘要

尽管神经网络分类器在经验上取得了成功，但它们仍然难以解释。在基于softmax的模型中，类别区域被隐式定义为logits之间不等式系统的解，这使得它们难以提取和可视化。我们引入了单元划分神经网络（PUNN），这是一种架构，其中类别概率直接来自学习到的单元划分，无需softmax层。PUNN构造了$k$个非负函数$h_1, \ldots, h_k$，满足$\sum_i h_i(x) = 1$，其中每个$h_i(x)$直接表示$P(\text{类别 } i \mid x)$。与softmax不同，其中类别区域通过logits之间的耦合不等式隐式定义，每个PUNN划分函数$h_i$直接定义类别$i$的概率作为$x$的独立函数。我们证明了PUNN在紧致域上的连续概率映射空间中是稠密的。定义划分的门函数$g_i$可以使用各种激活函数（sigmoid、高斯、bump）和参数化方法，从灵活的MLP到参数高效、形状感知的设计（球壳、椭球、球谐函数）。在合成数据、UCI基准和MNIST上的实验表明，基于MLP门的PUNN在精度上达到标准多层感知机的0.3-0.6%以内。当几何先验与数据结构匹配时，形状感知的门在参数减少多达300倍的情况下实现了相当的精度。这些结果表明，可解释性设计架构可以与黑盒模型竞争，同时提供透明的类别概率分配。

英文摘要

Despite their empirical success, neural network classifiers remain difficult to interpret. In softmax-based models, class regions are defined implicitly as solutions to systems of inequalities among logits, making them difficult to extract and visualize. We introduce Partition of Unity Neural Networks (PUNN), an architecture in which class probabilities arise directly from a learned partition of unity, without requiring a softmax layer. PUNN constructs $k$ nonnegative functions $h_1, \ldots, h_k$ satisfying $\sum_i h_i(x) = 1$, where each $h_i(x)$ directly represents $P(\text{class } i \mid x)$. Unlike softmax, where class regions are defined implicitly through coupled inequalities among logits, each PUNN partition function $h_i$ directly defines the probability of class $i$ as a standalone function of $x$. We prove that PUNN is dense in the space of continuous probability maps on compact domains. The gate functions $g_i$ that define the partition can use various activation functions (sigmoid, Gaussian, bump) and parameterizations ranging from flexible MLPs to parameter-efficient shape-informed designs (spherical shells, ellipsoids, spherical harmonics). Experiments on synthetic data, UCI benchmarks, and MNIST show that PUNN with MLP-based gates achieves accuracy within 0.3--0.6\% of standard multilayer perceptrons. When geometric priors match the data structure, shape-informed gates achieve comparable accuracy with up to 300$\times$ fewer parameters. These results demonstrate that interpretable-by-design architectures can be competitive with black-box models while providing transparent class probability assignments.

URL PDF HTML ☆

赞 0 踩 0