AI中文摘要

表格基础模型（TFMs）在健康数据集上表现出色，但其推理成本和基础设施需求限制了实际应用。我们研究了是否可以通过知识蒸馏将TFMs的预测行为转移到轻量级表格模型中。由于上下文TFMs在推理时依赖于训练集，直接蒸馏会引入上下文泄露；我们通过分层出折教师标签来解决这一问题。在19个医疗数据集、6个TFM教师、4个学生家族和多个多教师集成模型上，我们发现蒸馏后的学生模型至少保留了教师AUC的90%，在某些情况下优于教师，同时在CPU上运行速度至少快26倍，并保持了对健康应用至关重要的校准和公平性。此外，多教师平均法并不总能超越最佳单教师。因此，具有泄漏意识的蒸馏是一种将TFM质量预测带入受推理限制的健康环境中的可行途径。

英文摘要

Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across $19$ healthcare datasets, $6$ TFM teachers, $4$ student families, and several multi-teacher ensembles, we find that distilled students retain at least $90\%$ of teacher AUC, outperforming teachers in some cases, while running at least $26\times$ faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.

URL PDF HTML ☆

赞 0 踩 0

2605.18697 2026-05-19 cs.DC cs.AI cs.PL 版本更新

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

PopPy: 在Python复合AI应用中机会性地利用并行性

Stephen Mell, David Mell, Konstantinos Kallas, Steve Zdancewic, Osbert Bastani

发表机构 * University of Pennsylvania（宾夕法尼亚大学）； Independent Researcher（独立研究者）； University of California, Los Angeles（加州大学洛杉矶分校）

AI总结本文提出PopPy系统，通过识别Python应用中调用外部组件的并行化机会，从而在复合AI应用的端到端延迟上实现6.4倍的加速，同时保持顺序程序语义。

详情

AI中文摘要

复合AI应用通过使用通用编程语言如Python调用ML模型的调用，广泛应用于软件工程和企业自动化等用户-facing任务，使其端到端延迟成为关键瓶颈。与传统应用不同，执行时间主要由外部组件主导，这些组件无法通过传统语言优化系统如优化编译器来处理。为了解决这个问题，我们开发了PopPy，一个能够发现Python应用中调用这些重型外部组件的并行化机会的系统，包括那些用于复合AI应用的组件。PopPy支持Python的一个非常表达性的片段，并且需要最小的开发者输入来发现并行性。它结合了提前编译器和运行时，解决了从Python应用中提取并行性的三个关键挑战：语言复杂性、动态调度和变量变异。在一组真实的复合AI应用上，PopPy在端到端执行时间上相比标准Python执行实现了高达6.4倍的加速，同时保持顺序程序语义。

英文摘要

Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their end-to-end latency a critical bottleneck. In contrast to traditional applications, execution time is dominated by the external components, which cannot be handled by traditional language optimization systems, like optimizing compilers. To address this problem, we develop PopPy, a system that can uncover parallelization opportunities in Python applications that invoke these heavy external components, including those used in compound AI applications. PopPy supports a very expressive fragment of Python and requires minimal developer input to uncover parallelism. It combines an ahead-of-time compiler with a runtime, addressing three key challenges in extracting parallelism from Python applications: language complexity, dynamic dispatch, and variable mutation. On a set of real-world compound AI applications, PopPy achieves up to $6.4\times$ speedups in end-to-end execution time compared to standard Python execution while preserving the sequential program semantics.

URL PDF HTML ☆

赞 0 踩 0

2605.18696 2026-05-19 cs.LG cs.AI 版本更新

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

表格基础模型的集成——多样性上限与校准陷阱

Aditya Tanna, Yash Desai, Pratinav Seth, Mohamed Bouadi, Nassim Bouarour, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs（Lexsi实验室）

AI总结本文研究了表格基础模型（TFMs）的集成方法，发现尽管集成通常能提升性能，但现代TFMs的集成池近似冗余，且某些集成策略在准确率和校准上表现不佳，建议采用贪心选择作为实用默认方案。

详情

AI中文摘要

表格基础模型（TFMs）如今在越来越多的表格任务上能够匹配或超越调优的梯度提升树，但没有单一的TFM能在所有数据集上获胜。集成是解决此问题的首选方法，但其效果不如预期。六个现代TFMs形成一个近似冗余的池：它们的平均成对Q统计量为0.961，接近1，因此任何凸组合都受限制。我们对六个TFMs在153个OpenML分类任务上进行了六个集成策略的基准测试。最佳集成策略，两层级联堆叠，在计算成本增加253倍的情况下，比最强单个TFM的准确率提高0.18%。Friedman和Nemenyi分析将三个集成策略和最佳基础TFM置于一个等价组中；其他三个集成策略显著劣于最佳基础TFM。使用逻辑回归元学习器进行堆叠是最引人注目的案例：在准确率和ROC-AUC上具有竞争力，但在log-loss排名中是最差的。元学习器通过锐化类别边界来提高准确率，这破坏了校准。我们建议贪心选择作为实用默认方案。

英文摘要

Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

URL PDF HTML ☆

赞 0 踩 0

2605.18693 2026-05-19 cs.AI 版本更新

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench: 评估LLM代理技能生成流水线的基准测试

Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, Qizhen Lan, Zhangquan Chen, Zhi Yang, QianyuXu, Ronghao Chen, Huacan Wang, Sen Hu

发表机构 * SJTU（上海交通大学）； XJTU（西安交通大学）； NUS（新加坡国立大学）； QuantaAlpha（量子Alpha）； THU（清华大学）； SUFE（上海财经大学）； NTU（国立.ntu）； PKU（北京大学）； UCAS（中国科学技术大学）

AI总结本文提出SkillGenBench，一个用于评估LLM代理技能生成流水线的基准测试，通过统一可控的协议评估技能生成过程，涵盖任务条件生成和任务无关生成两种模式，以及基于仓库和文档的两种程序来源，揭示技能生成在不同数据源中的表现差异。

详情

AI中文摘要

随着LLM代理越来越多地围绕可重用的技能构建，一个核心挑战不再是代理是否能使用提供的技能，而是它们能否从仓库和文档中生成正确、可重用且可执行的技能。现有基准主要评估给定技能的有效性或代理从原始上下文中解决下游任务的能力，但并未将技能生成本身作为研究对象。我们引入SkillGenBench，一个用于评估技能生成流水线的基准测试，采用统一且受控的协议。在SkillGenBench中，生成器接收原始语料并生成标准化的技能制品，然后在固定框架下执行并经过统一的评估程序。该基准涵盖两种生成模式：任务条件生成，即在任务揭示后合成特定任务的技能；以及任务无关生成，即在下游任务确定前必须整理出可重用的技能库。它还涵盖两种互补的程序来源：基于仓库的实例，其中程序分布在代码、配置和脚本中；以及基于文档的实例，其中程序和约束必须从长文本中提炼。我们提供了标准化的任务规范、固定环境和以确定性执行为基础的评估协议，并辅以辅助信号用于诊断。在多种技能生成方法和基础模型上的实验显示了显著的性能差异，突显了可重用技能提炼的难度，并揭示了从软件仓库与长文本中生成技能的不同失败模式。SkillGenBench为研究技能生成作为代理系统中的独立研究问题建立了可重复的测试环境。

英文摘要

As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18684 2026-05-19 cs.SE cs.AI 版本更新

Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

Reversa：一个用于将遗留软件转换为AI代理操作规范的反向文档工程框架

Sanderson Oliveira de Macedo, Ronaldo Martins da Costa

发表机构 * Federal Institute of Goias（戈亚斯联邦理工学院）； Federal University of Goias（戈亚斯联邦大学）

AI总结本文提出Reversa框架，旨在通过反向文档工程将遗留软件转换为AI代理可操作的规范，通过多代理流水线流程提取隐含规则，生成可追溯的操作规范，并提出评估协议以衡量覆盖率、可追溯性、置信度、效用和成本。

Comments Preprint. Includes a generative AI use statement

详情

AI中文摘要

遗留系统集中了业务规则、架构决策和操作例外，这些通常在代码、数据、配置和维护实践中隐含存在。同时，基于语言模型的编码代理依赖于可靠的上下文、正确性标准和行为契约来修改真实系统，以降低风险。本文提出了Reversa，一个用于将遗留软件转换为可追溯的操作规范的反向文档工程框架。Reversa将此过程组织为一个多代理流水线：专门的代理映射项目表面，分析模块，提取隐含规则，合成架构，编写单元级规范，并审查生成的声明。该提案强调三个机制：代码与规范之间的可追溯性、显式的置信度标记以及保留缺口供人工验证。该框架作为Node.js CLI分布式发布，跨多个代理引擎安装技能，并使用SHA-256清单在更新或卸载操作期间保留修改后的文件。除了架构描述外，我们还报告了一个探索性案例研究，即从COBOL迁移到Go的ATM迁移，其中流水线产生了517个由内部置信度指数分类的声明，10个登记的缺口，53个Gherkin平衡场景，以及一个完成9/11任务的重建计划。最终平衡验证和切换未在本研究中完成。我们不声称有广泛的实证优势；我们根据反向工程、LLM文档和软件代理文献的位置，提出一个具有覆盖率、可追溯性、置信度、效用和成本指标的评估协议。

英文摘要

Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices. At the same time, language-model-based coding agents depend on reliable context, correctness criteria, and behavioral contracts to modify real systems with lower risk. This paper presents Reversa, a reverse documentation engineering framework for converting legacy software into traceable operational specifications for AI agents. Reversa organizes this process as a multi-agent pipeline: specialized agents map the project surface, analyze modules, extract implicit rules, synthesize architecture, write unit-level specifications, and review generated claims. The proposal emphasizes three mechanisms: traceability between code and specification, explicit confidence marking, and preservation of gaps for human validation. The framework is distributed as a Node.js CLI, installs skills across multiple agent engines, and uses a SHA-256 manifest to preserve modified files during update or uninstall operations. In addition to the architectural description, we report an exploratory case study on migrating an ATM from COBOL to Go, in which the pipeline produced 517 claims classified by an internal confidence index, 10 registered gaps, 53 Gherkin parity scenarios, and a reconstruction plan with 9 of 11 tasks completed at inventory time. Final parity validation and cutover were not completed in this study. We do not claim broad empirical superiority; we position the contribution with respect to the literature on reverse engineering, LLM-based documentation, and software agents, and propose an evaluation protocol with metrics for coverage, traceability, confidence, utility, and cost.

URL PDF HTML ☆

赞 0 踩 0

2605.18681 2026-05-19 cs.AI cs.LG 版本更新

Learning Quantifiable Visual Explanations Without Ground-Truth

学习无地面真实数据的可量化视觉解释

Amritpal Singh, Andrey Barsky, Mohamed Ali Souibgui, Ernest Valveny, Dimosthenis Karatzas

发表机构 * Computer Vision Center, Barcelona, Spain（巴塞罗那计算机视觉中心）； Autonomous University of Barcelona, Spain（巴塞罗那自治大学）

AI总结本文提出了一种基于连续输入扰动的可量化指标，用于评估XAI方法的质量，并提出了一种新的XAI方法，通过可微近似指标对模型进行微调，生成因果解释而不影响模型性能。

详情

AI中文摘要

可解释AI（XAI）技术对于验证和负责任使用现代深度学习模型日益重要，但缺乏良好的地面真实数据使得评估困难。我们提出了一种框架，该框架基于连续输入扰动作为XAI方法质量的可量化度量标准。我们的度量标准正式考虑了归因信息对模型决策的充分性和必要性，并展示了多种情况，其中它比现有度量标准更能符合人类对解释质量的直觉。为了利用该度量标准的特性，我们还提出了一种新的XAI方法，考虑了使用可微近似度量作为监督信号对模型进行微调的情况。结果是一个适配器模块，可以在任何黑盒模型上训练以输出因果解释，而不影响模型性能。我们证明了该方法生成的解释在多个可量化度量标准上优于竞争性的XAI技术。

英文摘要

Explainable AI (XAI) techniques are increasingly important for the validation and responsible use of modern deep learning models, but are difficult to evaluate due to the lack of good ground-truth to compare against. We propose a framework that serves as a quantifiable metric for the quality of XAI methods, based on continuous input perturbation. Our metric formally considers the sufficiency and necessity of the attributed information to the model's decision-making, and we illustrate a range of cases where it aligns better with human intuitions of explanation quality than do existing metrics. To exploit the properties of this metric, we also propose a novel XAI method, considering the case where we fine-tune a model using a differentiable approximation of the metric as a supervision signal. The result is an adapter module that can be trained on top of any black-box model to output causal explanations of the model's decision process, without degrading model performance. We show that the explanations generated by this method outperform those of competing XAI techniques according to a number of quantifiable metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.18675 2026-05-19 cs.LG cs.AI 版本更新

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

COOPO：循环离线-在线策略优化算法

Qisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu, Cody Fleming, Soumik Sarkar

发表机构 * Department of Mechanical Engineering, Iowa State University（伊阿华州立大学机械工程系）； Department of Computer Science, Iowa State University（伊阿华州立大学计算机科学系）； Department of Industrial and Manufacturing Systems Engineering, Iowa State University（伊阿华州立大学工业与制造系统工程系）； Translational AI Center, Iowa State University（伊阿华州立大学转化人工智能中心）

AI总结本文提出COOPO算法，通过循环离线训练和在线微调来解决离线强化学习的分布偏移和性能受限问题，以及在线强化学习的环境交互成本高问题，通过周期性回归离线训练减少遗忘和漂移，提升样本效率和性能。

详情

AI中文摘要

离线强化学习由于静态数据集的限制，在面对分布偏移和受限性能方面存在困难，而在线强化学习则需要大量的环境交互。最近出现的混合离线-在线方法连接了这两个领域，但存在转换过程中的分布漂移和对离线知识的灾难性遗忘问题。我们引入COOPO（循环离线-在线策略优化），一种通用框架，通过反复循环在受限的离线训练和在线微调之间进行。每个循环首先通过KL-正则化的优势加权离线更新将策略锚定到数据集，以最小化分布偏移，然后使用任何策略优化方法在线微调以实现稳定的探索。关键的是，定期返回离线训练可以消除遗忘和漂移，同时最大化数据集的再利用。循环行为还帮助减少在线环境交互。理论上，COOPO在样本效率上优于纯在线RL，满足标准覆盖假设下保证单调改进。广泛的D4RL基准测试显示，COOPO在减少在线交互的同时提高最终回报，保持在不同离线算法和在线优化器中的鲁棒性。这种循环协同为自适应RL设定了新的效率和性能标准。

英文摘要

Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.

URL PDF HTML ☆

赞 0 踩 0

2605.18674 2026-05-19 cs.AI 版本更新

Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

高效前瞻编码与抽象宽度用于经典规划中学习通用策略

Michael Aichmüller, Simon Ståhlberg, Martin Funkquist, Hector Geffner

发表机构 * RWTH Aachen University（亚琛RWTH大学）； Linköping University（林雪平大学）

AI总结本文提出了一种高效的方法，通过整体编码和抽象宽度来提升经典规划中学习通用策略的效率和可扩展性，解决了传统方法在计算成本和表达能力上的限制。

详情

AI中文摘要

通用规划旨在学习在经典规划领域内跨实例集合的通用策略。最近的图神经网络（GNN）方法在几个领域中学习了接近完美的策略。本工作改进了最近发表的迭代宽度（IW）策略的想法。其中，策略通过迭代宽度前瞻搜索扩展其后继范围，可以“跳过”多个转换，简化问题结构。然而，每个转换都被单独评估，导致计算成本不可扩展和表达限制。此外，尽管IW(1)因其与原子数线性扩展而具有吸引力，但一旦考虑数千个对象，如国际规划竞赛（IPC）2023基准，它就变得低效。我们解决了这两个限制。首先，我们引入了一种远更高效的整个搜索树的整体编码。它仅通过与当前状态的关系差异联合表示IW(1)-可达状态，使关系GNN（R-GNN）能够在单次正向传递中评分所有转换。其次，我们定义了抽象的IW(1)，通过关系抽象在新颖性检查中提高可扩展性。而不是测试完全实例化的原子，它通过将所有但一个参数替换为其类型来抽象每个原子。原始原子如果任何抽象形式是新颖的，则被认为是新颖的。这种结构压缩将新颖性搜索的可扩展性从原子转移到对象，同时保留有意义的子目标结构。我们在超缩放的IPC 2023基准以及跨多样的领域中评估我们的贡献，包括需要超出C₂逻辑片段特征的领域。我们的策略实现了新的最先进的性能，显著超越了先前的工作，包括经典规划器LAMA。

英文摘要

Generalized planning aims to learn policies that generalize across collections of instances within a classical planning domain. Recent Graph Neural Network (GNN) approaches have learned nearly perfect policies for several domains. This work improves on the recently published idea of Iterated Width (IW) policies. Therein, the policy broadens its successor scope through an IW-lookahead search that can "jump" over multiple transitions, simplifying the problem structure. Yet, each transition is evaluated individually, leading to unscalable compute costs and expressivity limitations. Furthermore, although IW(1) is attractive because it scales linearly with the number of atoms, it becomes inefficient once thousands of objects are considered, as in the International Planning Competition (IPC) 2023 benchmark. We address both limitations. First, we introduce a vastly more efficient holistic encoding of the entire search tree. It jointly represents IW(1)-reachable states only by their relational differences to the current state, enabling Relational GNNs (R-GNNs) to score all transitions in a single forward pass. Second, we define Abstracted IW(1) to improve scaling through relational abstraction during novelty checks. Rather than testing fully instantiated atoms, it abstracts each atom by replacing all but one argument with its type. The original atom is novel if any of its abstracted forms is novel. This structural compression shifts novelty search scaling from atoms to objects, while preserving meaningful subgoal structure. We evaluate our contributions on the hyperscaling IPC 2023 benchmark and across diverse domains, including domains requiring features beyond the $C_2$ logic fragment. Our policies achieve new state-of-the-art performance, significantly surpassing prior work, including the classical planner LAMA.

URL PDF HTML ☆

赞 0 踩 0

2605.18672 2026-05-19 cs.AI 版本更新

统计界限与差分隐私联邦学习的高效算法

Arnab Auddy, Xiangni Peng, Subhadeep Paul

发表机构 * Department of Statistics（统计系）

AI总结本文研究了差分隐私联邦学习中估计精度、隐私约束和通信成本之间的权衡，提出了FedHybrid和FedNewton两种高效算法，通过减少通信成本提升准确性，并建立了均方误差的上界和下界以评估算法性能。

详情

AI中文摘要

联邦学习是训练机器学习和人工智能模型的一种主流框架，用于在众多用户设备或数据库之间协同训练。我们研究了差分隐私（DP）联邦M估计中估计精度、隐私约束和通信成本之间的权衡。文献中的两种标准方法是FedAvg，可能面临较高的联邦偏差，以及FedSGD，可能导致较高的通信成本。为了在减少通信成本的同时提高准确性，我们提出了FedHybrid，它使用FedSGD，但起始时通过FedAvg估计器改进初始化。我们还提出了FedNewton，通过平均本地牛顿迭代来减少FedAvg的偏差，从而在客户端数量增长缓慢时，以更少的通信轮次达到与FedSGD相当的估计精度。我们建立了这些估计器的DP版本的均方误差率的有限样本上界，作为客户端数量、本地样本大小、隐私预算和迭代次数的函数。我们进一步推导了任何迭代私有联邦过程的均方误差的最小最大下界，以作为评估这些方法最优性差距的基准。我们还通过在MNIST和CIFAR-10计算机视觉数据集上训练逻辑回归和神经网络来数值评估我们的方法。

英文摘要

Federated Learning is a leading framework for training ML and AI models collaboratively across numerous user devices or databases. We study the trade-offs among estimation accuracy, privacy constraints, and communication cost for differentially private (DP) federated M estimation. The two standard methods in the literature are FedAvg, which may suffer from high federation bias, and FedSGD, which can incur high communication cost. Aimed at improving accuracy at a reduced communication cost, we propose FedHybrid, which uses FedSGD starting with an improved initialization by the FedAvg estimator. We propose FedNewton, which averages local Newton iterations to reduce bias in FedAvg, achieving an estimation accuracy comparable to FedSGD with much fewer communication rounds when the number of clients grows sufficiently slowly. We establish finite sample upper bounds on the mean-squared error rates of the DP versions of these estimators as functions of the number of clients, local sample sizes, privacy budget, and number of iterations. We further derive a minimax lower bound on the MSE of any iterative private federated procedure that provides a benchmark to assess the optimality gap of these methods. We numerically evaluate our methods for training a logistic regression and a neural network on the computer vision datasets MNIST and CIFAR-10.

URL PDF HTML ☆

赞 0 踩 0

2605.18654 2026-05-19 cs.LG cs.AI 版本更新

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

口袋基础模型：将TFMs压缩成CPU可用的梯度提升树

Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs（Lexsi实验室）

AI总结本文提出了一种将高性能表格基础模型（TFMs）压缩成CPU原生梯度提升树的方法，以解决实时欺诈评分需求与现有模型性能之间的差距，同时在多个数据集上验证了该方法的有效性。

详情

AI中文摘要

一个欺诈评分器需要在2毫秒内响应。最好的表格基础模型（TFMs）在GPU上需要151-1275毫秒。我们通过将TFM离线压缩成XGBoost或CatBoost的学生模型，该模型可以在CPU上原生运行，从而缩小这一差距。核心障碍是特定于上下文学习（ICL）教师：他们在评分自己的训练集时会泄露标签，导致软目标崩溃为近一热向量，不再有可供压缩的类间结构。分层出折（OOF）教师标注可以防止这一问题。在153个来自TALENT、OpenML-CC18、TabZilla和TabArena的数据集上，将TabICLv2压缩成XGBoost在CPU上达到0.882宏均AUC（96.5%的教师AUC），在1.9毫秒内，比教师-学生对的教师模型快38到860倍，且在统计上显著优于调优的CatBoost基线（Wilcoxon p=0.0008；51%胜率）。四个进一步发现：教师排名精确转移到学生排名；收益集中在低维数据（<21个特征：比CatBoost高0.011 vs. >21个特征：高0.001）；多教师平均有助于MLP学生（+0.006，p=0.003）但对树学生增加不到0.001；在高维任务中，当教师本身落后于CatBoost时，压缩反而使情况更糟。完整的流水线作为TabTune库的一部分开源。

英文摘要

A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data (< 21 features: +0.011 over CatBoost vs. >21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.

URL PDF HTML ☆

赞 0 踩 0

2605.18648 2026-05-19 cs.LG cs.AI cs.CL 版本更新

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

对软标签学习和校准中人类与模型不确定性的评估

Maja Pavlovic, Silviu Paun, Massimo Poesio

发表机构 * Queen Mary University London（伦敦女王玛丽大学）； Amazon（亚马逊）； University of Utrecht（乌得勒支大学）

AI总结本文通过对比人类和模型标签在软标签学习中的效果，发现人类标签不仅提升了模型准确性，还通过正则化作用改善了模型在困难样本上的校准和训练稳定性。

详情

AI中文摘要

人类对齐的人工智能的核心在于理解人类提取的标签相对于合成标签的优势。虽然人类软标签通过捕捉不确定性来提高校准，但先前研究将这些好处与隐含的错误标签修正（模式偏移）混淆了，从而掩盖了软标签的真实效果。我们对MNIST和一个合成变体上的软标签学习进行了受控审计，重新标注子集以提取人类不确定性。通过将软标签监督与底层标签模式偏移解耦，我们发现虽然人类软标签确实提供了准确性提升，但其更大的价值在于作为正则化器，改善模型在困难样本上的校准并促进训练运行中的稳定收敛。数据集制图显示，训练于人类软标签的模型能反映人类不确定性，而训练于合成标签的模型则无法与人类对齐。广泛而言，这项工作提供了一个用于人类-人工智能不确定性对齐的诊断测试平台。

英文摘要

Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.

URL PDF HTML ☆

赞 0 踩 0

2605.18635 2026-05-19 cs.LG cs.AI 版本更新

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

数据呈现与架构：用于表格基础模型的信用风险预测重采样策略

Aditya Tanna, Mitul Solanki, Mohamed Bouadi, Nassim Bouarour, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs（Lexsi实验室）

AI总结本文研究了在信用风险预测中，通过不同的上下文构建策略对表格基础模型性能的影响，发现上下文构建策略比模型架构对AUC-ROC指标的贡献更大。

详情

AI中文摘要

信用违约预测是一个具有严重类别不平衡、异质特征和严格延迟预算的表格学习问题。表格基础模型（TFMs）通过上下文学习来解决这个问题，其预测结果对上下文窗口的构建方式敏感。我们在Home Credit和Lending Club数据集上基准测试了四种经典模型和五种TFMs，变化上下文构建策略（七种选项）和上下文大小（1K到50K）。在两个数据集上，上下文策略的选择对AUC-ROC的方差解释比模型家族的选择更大：平衡和混合采样比均匀采样增加3到4个AUC点，且差距超过了TFMs之间的差异。使用5K到10K的平衡上下文，最强的TFMs达到经典基线模型在完整数据上训练的AUC，同时恢复了默认类别召回率，而默认阈值GBDTs无法做到。我们将此视为证据，表明在不平衡信用风险设置中，上下文构建而非架构选择是TFMs的主要部署杠杆。

英文摘要

Credit default prediction is a tabular learning problem with severe class imbalance, heterogeneous features, and tight latency budgets. Tabular Foundation Models (TFMs) approach this problem through in-context learning, which makes their predictions sensitive to how the context window is built. We benchmark four classical models and five TFMs on the Home Credit and Lending Club datasets, varying the context-construction strategy (seven options) and the context size (1K to 50K). On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of 5K to 10K examples, the strongest TFMs reach the AUC of classical baselines trained on the full data, while also recovering meaningful default-class recall that default-threshold GBDTs do not. We frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.

URL PDF HTML ☆

赞 0 踩 0

2605.18632 2026-05-19 cs.LG cs.AI 版本更新

Position: Weight Space Should Be a First-Class Generative AI Modality

权重空间应成为一种第一类生成式AI模态

Zhangyang Wang, Peihao Wang, Kai Wang

发表机构 * University of Texas at Austin（德克萨斯大学奥斯汀分校）； Tencent Hy（腾讯实验室）

AI总结本文提出将模型检查点视为第一类数据模态，并主张在权重空间中进行生成式建模应成为机器学习的核心原始操作。通过最近的进展表明，神经网络权重可以按需合成，通常在减少适应成本的规模下达到微调性能。本文认为这些结果反映了权重空间中高性能模型占据的低维、高度结构化区域的结构事实。基于此观点，本文将现有方法组织成五阶段流程，调查该方法已实际应用的领域，并澄清当前限制：适配器规模和条件生成正在迅速发展，而无限制的前沿规模检查点合成仍处于开放状态。

Comments AI systems routinely improve or create other AI systems

详情

AI中文摘要

神经网络检查点已悄然成为大规模数据资源：现在存在数百万个训练好的权重向量，每个都编码任务、领域和架构特定的知识。本文立场论文认为，模型检查点应被视为第一类数据模态，并且在权重空间中的生成式建模应被标准化为机器学习的核心基本操作。最近的进展表明，神经权重可以按需合成，通常在减少适应成本的规模下达到微调性能。我们主张这些结果反映了底层的结构事实：高性能模型占据由对称性、平坦性、模块性和共享子空间形状的权重空间中的低维、高度结构化区域。基于这一观点，我们组织现有方法为五阶段流程，调查该方法已实际应用的领域，并澄清当前限制：适配器规模和条件生成正在迅速发展，而无限制的前沿规模检查点合成仍处于开放状态。我们的目标是将社区的默认思维从按任务优化模型转变为从学习的权重分布中采样模型，加速迈向一个AI系统定期改进或创建其他AI系统的时代。

英文摘要

Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a five-stage pipeline, survey applications where the approach is already practical, and clarify current limits: adapter-scale and conditional generation are advancing rapidly, while unrestricted frontier-scale checkpoint synthesis remains open. Our goal is to shift the community's default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely improve or create other AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18630 2026-05-19 cs.AI physics.comp-ph 版本更新

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

SCICONVBENCH: 评估LLM在计算科学任务公式化中的多轮澄清能力

Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, Patrick Emami, Anurag Acharya, Sameera Horawalavithana, Shaowu Pan

发表机构 * Rensselaer Polytechnic Institute（拉特格斯理工学院）； University of Texas at Arlington（德克萨斯大学阿灵顿分校）； Pacific Northwest National Laboratory（太平洋西北国家实验室）； National Renewable Energy Laboratory（国家可再生能源实验室）

AI总结该研究提出SCICONVBENCH基准，用于评估LLM在计算科学任务公式化中的多轮澄清能力，重点在于获取缺失信息和解决请求中的矛盾，通过结构化任务本体和基于标准的评估框架，系统测量LLM在澄清行为、对话基础性和最终规格忠实度三个维度上的表现。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用作科学人工智能助手，越来越多的基准测试评估其在知识检索、推理、代码生成和工具使用方面的能力。然而，这些评估通常假设科学问题已经明确提出，而实际的科学协助往往从一个不明确的用户请求开始，必须通过对话进行澄清，才能进行任何计算、分析或实验。我们介绍了SCICONVBENCH，这是一个用于评估科学任务公式化中多轮澄清能力的基准，涵盖四个计算科学问题领域：流体力学、固体力学、材料科学和偏微分方程（PDEs）。SCICONVBENCH针对两种互补能力：获取缺失信息（消歧）和检测并纠正包含内部矛盾信息的请求（一致性解决）。我们的基准结合了结构化的任务本体和基于标准的评估框架，使能够系统地测量LLM在三个维度上的表现：澄清行为、对话基础性和最终规格的忠实度。当前前沿模型在一致性解决方面表现相对较好，但即使最好的模型在流体力学中也只解决了52.7%的消歧案例。我们进一步发现，前沿LLMs经常做出沉默假设并执行隐式规格修复，这些修复并未基于与用户对话的基础。SCICONVBENCH为评估可靠计算科学助手所需的上游对话推理建立了基础。代码和数据可在https://github.com/csml-rpi/SciConvBench找到。

英文摘要

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.

URL PDF HTML ☆

赞 0 踩 0

2605.18627 2026-05-19 cs.AI 版本更新

Learning Lifted Action Models from Traces with Minimal Information About Actions and States

从动作轨迹中学习提升的动作模型：最少关于动作和状态的信息

Jonas Gösgens, Niklas Jansen, Hector Geffner

发表机构 * RWTH Aachen University（亚琛工业大学）

AI总结本文研究了在不完全信息下从动作轨迹中学习STRIPS+动作域的问题，提出了三种通用情况下的算法和完备性结果，假设选定的动作参数完全可观察，从而在不同可观察性假设下确定等效域的学习条件。

Comments accepted at KR2026

详情

AI中文摘要

最近研究表明，仅从动作轨迹即可正确高效地学习提升的STRIPS模型；即应用隐藏的STRIPS模型中的可应用动作序列。这一结果令人印象深刻，因为并不假设状态完全可观察，但STRIPS动作包含的参数并非全部用于选择动作，因此实用性不足。为此，假设动作轨迹来自隐藏的STRIPS+模型，其中某些动作参数隐含在隐藏的动作前提中。然而，这种方法的局限性在于它假设状态完全可观察。在本文中，我们放宽这些限制，考虑在更一般的情境下从轨迹中学习STRIPS+动作域的问题，其中轨迹包含关于动作和状态的部分信息。特别地，我们为三种通用情况制定了算法和完备性结果，均假设选定的动作参数完全可观察。第一种情况不假设状态可观察；第二种情况假设某些状态谓词完全可观察；第三种情况则假设某些状态谓词局部可观察。给定一个STRIPS+域，这些结果描述了在什么条件下可以从此类轨迹中学习等效域。实验结果也进行了报告。

英文摘要

It has been recently shown that lifted STRIPS models can be learned correctly and efficiently from action traces alone; i.e., applicable action sequences from a hidden STRIPS model. The result is remarkable because the states are not assumed to be observable at all, and yet it is not practical enough as STRIPS actions include arguments that are not needed for selecting the actions. This shortcoming has been addressed by assuming that the action traces come instead from a hidden STRIPS+ model where some action arguments are implicit in the hidden action preconditions. A limitation of this approach, however, is that it assumes that the states are fully observable. In this work, we relax these restrictions and consider the problem of learning STRIPS+ action domains from traces in a more general context where the traces carry partial information about both actions and states. In particular, we formulate algorithms and completeness results for three general cases, all of which assume full observability of selected action arguments. In the first case, no observability of the state is assumed; in the second case, full observability of some state predicates is assumed, and in the third case, local observability of some state predicates is assumed instead. Given a STRIPS+ domain, these results characterize the conditions under which an equivalent domain can be learned from traces. Experimental results are reported.

URL PDF HTML ☆

赞 0 踩 0

2605.18621 2026-05-19 cs.CV cs.AI 版本更新

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

CrossView Suite: 利用数据集、模型和基准 harnessing MLLMs 的跨视图空间智能

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University（浙江大学）

AI总结该研究提出CrossView Suite，通过开发CrossViewSet、CrossViewBench和CrossViewer三个组件，解决跨视图推理中的数据稀缺、评估不足和对齐机制缺失问题，提升多视图空间理解能力。

详情

AI中文摘要

空间智能要求多模态大语言模型（MLLMs）超越单一视图感知，对物体、可见性、几何和交互在多个视角下保持一致推理。然而，跨视图推理的进步受限于三个主要缺口：大规模高质量标注训练数据的稀缺性、缺乏系统性评估的基准以及缺乏显式对齐机制以建立物体层面的一致性。为了解决这些缺口，我们全面开发了CrossView Suite的三个协调组件：CrossViewSet、CrossViewBench和CrossViewer。首先，我们引入一个多代理数据引擎，精心编纂了一个大规模、高质量的跨视图指令数据集，称为CrossViewSet，涵盖17种细粒度任务类型，包含1.6M个样本。其次，我们精心创建了一个场景不重叠的CrossViewBench，以全面评估MLLM的跨视图空间理解能力，评估其在各种方面的表现。最后，我们提出了CrossViewer，一个渐进的三阶段框架，用于MLLMs的跨视图空间推理，遵循感知->对齐->推理的范式。我们的方法配备了一个自适应的空间区域标记器，以捕捉细粒度的物体表示，然后显式对齐多视图对象，并因此融合对齐的特征，以提升MLLMs的跨视图推理能力。广泛的实验和分析表明，大规模训练数据、系统性评估和显式的跨视图对齐都是推动MLLMs从单视角感知向现实世界空间智能发展的关键因素。项目页面可在https://github.com/Thinkirin/Crossview-Suite上找到。

英文摘要

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.

URL PDF HTML ☆

赞 0 踩 0

2605.18617 2026-05-19 cs.RO cs.AI cs.CV 版本更新

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft: 向视觉-语言操控的柔软连续机器人迈进

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

发表机构 * Beihang University（北京航空航天大学）； National University of Singapore（新加坡国立大学）； Hangzhou Innovation Institute, Beihang University（北京航空航天大学杭州创新研究院）

AI总结本文提出ManiSoft基准，用于研究柔软连续机器人的视觉-语言操控，通过定制模拟器结合真实柔软体动力学和丰富的接触交互，定义了四个任务以展示变形控制的不同方面，并通过自动化流程生成6300个多样场景和专家轨迹，评估了三种代表性策略模型的性能。

Comments Accepted in ICML 2026

详情

AI中文摘要

大多数现有的视觉-语言操控研究针对刚性机械臂，其固定形态限制了在杂乱或狭窄空间中的适应性。柔软机械臂由于其可变形性提供了一个有吸引力的替代方案，但面临不可靠的本体感觉和分布式的低层驱动挑战。为了研究这些挑战，我们介绍了ManiSoft，一个用于柔软机械臂的视觉-语言操控基准。ManiSoft特征一个定制的模拟器，通过弹性力约束将真实柔软体动力学与丰富的接触交互相结合。在此基础上，ManiSoft定义了四个任务，每个任务突出显示变形控制的不同方面，从基本末端执行器协调到障碍物回避。为了支持策略训练和评估，ManiSoft包括一个自动化流程，生成6,300个多样场景及其对应的专家轨迹。为了大规模生成高质量轨迹，我们首先使用高层规划器将每个任务分解为一系列路径点，然后使用低层强化学习策略生成扭矩命令以跟踪路径点。基准测试三种代表性策略模型显示在清洁场景中相对有希望的结果，但在随机化情况下性能显著下降。可视化分析表明，失败主要源于本体感觉状态的视觉估计不准确和变形性在适应性障碍回避中的利用有限。我们预计ManiSoft将作为有价值的测试平台，在视觉-语言操控的背景下弥合刚性和柔软机械臂之间的差距。代码和数据集已发布在https://buaa-colalab.github.io/ManiSoft。

英文摘要

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

URL PDF HTML ☆

赞 0 踩 0

2605.18613 2026-05-19 cs.SD cs.AI 版本更新

SAME: A Semantically-Aligned Music Autoencoder

SAME：一种语义对齐的音乐自编码器

Julian D. Parker, Zach Evans, CJ Carr, Zachary Zukowski, Josiah Taylor, Matthew Rice, Jordi Pons

发表机构 * Stability AI

AI总结该研究提出SAME自编码器，通过结合Transformer架构和语义正则化方法，实现了4096倍的时间压缩比，同时保持重建质量和生成性能。

2605.18610 2026-05-19 cs.CV cs.AI cs.LG 版本更新

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA: 通过冲突厌恶任务算术实现持续机器去学习

Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang, Li Xu, Xiaofeng Chen

发表机构 * Fujian Normal University（福建师范大学）； Nanyang Technological University（南洋理工大学）； Xidian University（西安电子科技大学）

AI总结本文首次研究了视觉语言模型的持续去学习问题，提出CATA方法，通过冲突厌恶任务算术有效解决去学习中的有效性、模型保真度和持续性挑战。

详情

AI中文摘要

视觉语言模型（VLMs）在对齐视觉和文本表示方面表现出色，能够支持多种多模态应用。然而，其大规模训练数据不可避免地引发了隐私、版权和不良内容的担忧，这使得机器去学习变得必要。尽管现有研究主要关注单次去学习，但实际VLM部署往往涉及随时间推移的连续删除请求，从而产生持续机器去学习。在本文中，我们首次研究了VLMs的持续去学习，并识别出该设置中的三个关键挑战：去除目标知识的有效性、保留模型效用的保真度以及在连续更新下防止知识重新出现的持续性。为了解决这些挑战，我们提出了CATA，一种冲突厌恶任务算术方法，将每个遗忘请求表示为一个去学习任务向量。通过维护历史任务向量并执行符号感知的冲突厌恶聚合，CATA抑制可能削弱先前遗忘效果的冲突更新组件。在单次和持续设置下的大量实验表明，CATA在遗忘有效性、模型保真度和遗忘持续性方面均优于基线方法。

英文摘要

Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.

URL PDF HTML ☆

赞 0 踩 0

2605.18593 2026-05-19 cs.CR cs.AI cs.RO 版本更新

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

并非你所要求的：家庭机器人操作中的字体攻击

Ali Iranmanesh, Peng Liu

发表机构 * Cyber Security Lab（网络安全实验室）； The Pennsylvania State University（宾夕法尼亚州立大学）； State College, USA（州立学院，美国）

AI总结本研究探讨了字体攻击对家庭机器人操作全流程的影响，提出了一种解耦感知架构，并发现感知错误会通过持久的3D语义地图导致物理性故障，揭示了字体误分类对机器人安全性的实际威胁。

Comments 10 pages, 1 figure, IEEE conference format

详情

AI中文摘要

开放词汇的具身AI代理越来越多地依赖如CLIP之类的视觉-语言模型进行物体感知和任务定位。然而，这种共享嵌入空间所带来的结构漏洞使字体攻击成为可能，其中物理场景中的印刷文本会语义上覆盖视觉判断。尽管先前研究在静态2D基准和3D导航任务中量化了这一威胁，但其对家庭机器人操作完整Sense-Plan-Act流程的影响仍未被探索。本文在基于Habitat的模拟中评估了字体攻击，使用HomeRobot基准。我们引入了一种解耦感知架构，使冻结的CLIP编码器暴露于对抗性贴纸，同时通过DEtic保持几何定位。在59个可控评估回合中，攻击的总体攻击成功率（ASR）为67.8%，在完全成功回合中上升至70.0%，在无控制视角和遮挡且无感知优化的情况下。关键发现是，感知错误通过持久的3D语义地图传播，导致动能故障，即由对抗性污染的语义状态驱动的物理性抓取和运输错误物体。在这些情况下，机器人会物理上抓取并传递错误的物体到目标容器。这些结果确立了字体误分类作为对模块化操作流程安全性的实际、可测量且物理上有影响的威胁，而此前的字体攻击研究未对其进行考察。

VISAFF: 以说话者为中心的视觉情感特征学习用于对话中的情感识别

Linan ZHU, Zihao Zhai, Xiao Han, Yuqian Fu, Xiangfan Chen, Xiangjie Kong, Guojiang Shen

发表机构 * Zhejiang University of Technology（浙江工业大学）； ETH Zurich（苏黎世联邦理工学院）

AI总结本文提出VISAFF框架，通过以说话者为中心的视觉情感特征学习方法，解决对话中情感识别中的复杂场景问题，提升计算效率并避免大规模模型微调的高成本。

详情

AI中文摘要

对话中情感识别（ERC）对于有效的人机交互至关重要，旨在识别多轮对话中说话者的情感状态。早期基于文本的方法在处理如讽刺等复杂场景时存在困难，因为它们本质上忽略了关键的非语言信息。尽管最近的视觉-语言模型（VLMs）通过直接分析视频来解决这一问题，但它们并非专门为ERC量身定制，通常关注与情感无关的背景区域或被动听众，而非活跃说话者。此外，微调这些大模型会带来高昂的计算成本。此外，孤立的视觉信号在缺乏语言内容和语音语调的上下文时往往模糊或技术上受损。为了解决这些挑战，我们提出了VISAFF，一个以说话者为中心的视觉情感特征学习框架用于ERC。VISAFF包括两个阶段：说话者中心的情感定位和可靠性引导的情感补充。VISAFF采用无微调的方法来解锁冻结的VLMs的推理能力，高效地引导它们专注于活跃说话者的情感视觉线索，而无需沉重的训练开销。在第二阶段，我们引入了可靠性引导的情感补充机制，动态利用文本和声音模态来补偿视觉不确定性。在两个真实世界数据集上的实验表明，VISAFF在无微调设置下实现了与最先进方法相媲美的性能，显著提高了计算效率，通过消除对大规模VLMs昂贵微调的需要。源代码可在https://anonymous.4open.science/r/speaker-2365/上获得。

当萤火虫聚类；通过重心引导萤火虫优化增强自动聚类

MKA Ariyaratne, Azwirman Gusrialdi, Yury Nikulin, Jaakko Peltonen

发表机构 * Department of Computer Science, Faculty of Applied Sciences, University of Sri Jayewardenepura（Sri Lanka 瑞籍耶文纳普拉大学计算机科学系，应用科学学院）； Faculty of Engineering and Natural Sciences,Tampere University（蒂帕雷大学工程与自然科学学院）； Department of Mathematics and Statistics, University of Turku（图尔库大学数学与统计学系）

AI总结本文提出了一种改进的萤火虫算法用于数据聚类，解决了传统方法如K均值在处理非均匀聚类形状、密度以及需要预先定义聚类数的局限性。该算法引入了重心移动策略和多目标适应度函数，平衡了紧凑性、分离性和新的TSP基于的导航惩罚。它能够自动估计最佳聚类数并动态调整聚类边界。在机器人传感器网络中的应用展示了其实际价值，实验表明其聚类质量优于K均值，且减少集群内路径距离。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性，未来可能扩展到更高维和适应性场景。

Comments 34 pages, 19 Figures

2605.18454 2026-05-19 cs.LG cs.AI cs.SC 版本更新

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

能说话的调度：一种可解释的程序化强化学习框架

Chengpeng Hu, Yingqian Zhang, Hendrik Baier

发表机构 * Eindhoven University of Technology, Eindhoven, the Netherlands Centrum Wiskunde \& Informatica, Amsterdam, the Netherlands

AI总结本文提出了一种可解释的程序化强化学习框架ProRL，通过人类可读且可编辑的程序化策略实现高效调度，解决了传统深度强化学习在透明性和计算效率方面的不足。

详情

AI中文摘要

深度强化学习（DRL）最近涌现出作为求解组合优化问题（如作业车间调度）的有希望的方法。然而，DRL学习的策略通常由深度神经网络（DNNs）表示，其不透明的神经架构和不可解释的策略决策可能引起人类决策者的关键信任和可用性问题。此外，DNNs的计算需求还会进一步阻碍在资源受限环境中实际部署。在本工作中，我们提出ProRL，一种新颖的可解释程序化强化学习框架，能够通过人类可读且可编辑的程序化策略实现高性能调度（即程序）。我们首先介绍了一种用于调度的领域特定语言（DSL-S）来表示调度策略为结构化程序。ProRL然后通过局部搜索探索由DSL-S定义的程序空间，以识别不完整的程序，这些程序随后通过贝叶斯优化学习其参数。ProRL学习选择哪种调度启发式规则，因此它自然地整合了已在工业场景中使用的现有启发式方法。在广泛使用的基准实例上的实验表明，ProRL在现有启发式方法和DRL基线方面表现出色。此外，ProRL在强约束计算资源下表现良好，例如仅使用100个episode进行训练。我们的代码可在https://github.com/HcPlu/ProRL上获得。

英文摘要

Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architectures and non-interpretable policy decisions can lead to critical trust and usability concerns for human decision makers. In addition, the computational requirements of DNNs can further hinder practical deployment in resource constrained environments. In this work, we propose ProRL, a novel interpretable programmatic reinforcement learning framework that achieves high-performance scheduling with human-readable and editable programmatic policies (i.e., programs). We first introduce a domain-specific language for scheduling (DSL-S) to represent scheduling strategies as structured programs. ProRL then explores the program space defined by DSL-S using local search to identify incomplete programs, which are subsequently completed by learning their parameters via Bayesian optimization. ProRL learns which scheduling heuristic rules to select, and hence, it naturally incorporates existing heuristics already used in industrial scenarios. Experiments on widely used benchmark instances demonstrate the strong performance of ProRL against existing heuristics and DRL baselines. Furthermore, ProRL performs well under strongly constrained computational resources, such as training with only 100 episodes. Our code is available at https://github.com/HcPlu/ProRL.

URL PDF HTML ☆

赞 0 踩 0

2605.18449 2026-05-19 cs.LG cs.AI 版本更新

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

用强化学习建模客户轨迹以获得实际零售洞察

Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai

发表机构 * McGill University（麦吉尔大学）； Mila - Quebec AI Institute（魁北克人工智能研究所）

AI总结本文提出了一种基于智能体的建模框架，将客户轨迹预测转化为最大熵强化学习问题，以更准确地反映具有有限理性的客户行为，从而提供更精确的冲动购买率和货架交通密度估计。

Comments Proceeding of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情

AI中文摘要

理解零售空间内客户移动对于优化商店布局至关重要。现实世界轨迹数据可以提供高度准确的洞察，但收集起来成本高昂且对许多零售商来说难以实现。启发式方法如旅行商问题（TSP）和概率最近邻（PNN）常被用作廉价的近似方法，但实际客户轨迹与最短路径的偏差平均为28%，突显了准确性和实用性之间的权衡。我们提出了一种基于智能体的建模框架，将客户轨迹预测视为最大熵强化学习（RL）问题，通过平衡奖励最大化与随机性来更好地反映具有有限理性的客户。使用现实世界便利商店的轨迹数据，我们证明RL生成的轨迹比TSP和PNN更接近客户行为，提供了更准确的冲动购买率和货架交通密度估计。此外，只有基于RL的预测能够为冲动产品提供与实际轨迹数据一致的重新定位决策，从而产生可比的估计利润增长。我们的工作表明，RL提供了一种实用且基于行为的替代方法，弥合了过于简化的启发式方法和数据密集型方法之间的差距，使准确的布局优化更具可及性。为了鼓励进一步研究，源代码可在GitHub上获得。

英文摘要

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.18444 2026-05-19 cs.AR cs.AI 版本更新

Building Reliable Arithmetic Multipliers Under NBTI Aging and Process Variations

在NBTI老化和工艺波动下构建可靠的算术乘法器

Masoud Heidary, Biresh Kumar Joardar

发表机构 * Department of ECE, University of Houston（电子工程系，休斯顿大学）

AI总结本文提出了一种利用乘法的sign-invariance性质来缓解算术乘法器老化问题的新技术，并将其应用于 systolic arrays 中，以提高高吞吐量AI加速器的效率。

详情

AI中文摘要

硬件老化对集成电路（ICs）构成了重大挑战，导致性能下降和最终失效。在本文中，我们关注算术乘法器的老化问题，这些乘法器是现代计算系统（包括CPU、GPU、FPGA以及如脉冲数组的AI加速器）的核心。特别是，AI工作负载主要依赖乘法运算，可以加速负偏温不稳定性（NBTI）效应。本文提出了一种新颖的老化缓解技术，利用乘法的sign-invariance性质。通过有选择地对输入应用2s补码变换，该方法将应力分布到晶体管上，从而减少NBTI老化的影响。所提出的方法还被集成到脉冲数组中，一种常见的AI加速器，以展示其在高吞吐量AI加速器中的效率。使用Cadence工具进行的实验评估显示，与自然老化（无缓解）基线相比，其寿命更好，同时引入了可忽略的面积和延迟开销。

英文摘要

Hardware aging poses a significant challenge for integrated circuits (ICs), leading to performance degradation and eventual failure. In this work, we focus on the aging of arithmetic multipliers, which are a cornerstone of modern computing systems including in CPUs, GPUs, and FPGAs, as well as AI accelerators like systolic arrays. In particular, AI workloads, which rely predominantly on multiplications, can accelerate Negative Bias Temperature Instability (NBTI) effects in multipliers. This paper presents a novel aging mitigation technique that leverages the signinvariance property of multiplication. By selectively applying 2s complement transformations to inputs, the method redistributes stress across transistors, reducing the effects of NBTI aging. The proposed method is also integrated into systolic arrays, a common AI accelerator, to demonstrate its efficiency in a high-throughput AI accelerator. Experimental evaluations using Cadence tools show better lifetime compared to natural aging (with no mitigation) baseline, while introducing negligible area and delay overheads.

URL PDF HTML ☆

赞 0 踩 0

2605.18419 2026-05-19 cs.CV cs.AI 版本更新

Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

面向几何的不确定性聚类用于病理学中鲁棒的视觉上下文学习

Franciskus Xaverius Erick, Johanna Paula Müller, Bernhard Kainz

发表机构 * FAU Erlangen-Nürnberg, Erlangen, DE（埃尔兰根-纽伦堡大学）； Department of Computing, Imperial College London, London, UK（伦敦帝国理工学院计算机系）

AI总结本文提出GAUC，一种无需训练的聚类选择方法，直接在预训练的多模态嵌入空间中操作，通过优化三个目标提升视觉上下文学习的鲁棒性、准确性和校准性。

详情

AI中文摘要

视觉-语言模型（VLMs）能够将视觉感知与开放性临床推理结合，使其在计算病理学中具有吸引力。然而，对稀缺的专家标注病理数据进行数十亿参数的微调是不可行的，而上下文学习（ICL）在没有参数更新的情况下将VLM条件于演示图像-文本对，但容易受到所选示例和查询措辞的影响，导致诊断不可靠。现有选择策略依赖于查询依赖的最近邻检索，忽略了全局数据结构，需要昂贵的参数更新，或忽视了VLMs的联合视觉-文本嵌入几何。我们提出GAUC，一种无需训练的聚类选择方法，直接在预训练的多模态嵌入空间中操作。GAUC联合优化三个目标：（1）最大均值差异项，强制聚类与完整数据集之间的分布一致性；（2）有效互信息差异正则化器，通过利用VLMs的联合视觉-文本对齐来限制在提示改写下的性能下降；（3）预测方差惩罚，抑制过于自信且不稳定的输出。在CRC-100K和MHIST多个开源VLM架构上，GAUC在准确率、校准性和提示鲁棒性上均优于最近的ICL选择方法和数据集蒸馏基线，且无需单次梯度更新。

英文摘要

Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM's joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.

URL PDF HTML ☆

赞 0 踩 0

2605.18414 2026-05-19 cs.CR cs.AI 版本更新

Prompts Don't Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control

提示不保护：通过MCP代理实现的架构强制以实现LLM工具访问控制

Rohith Uppala

发表机构 * Independent Researcher（独立研究员）

AI总结本文提出了一种受控的MCP代理，通过在工具发现和工具调用两个阶段实施基于属性的访问控制（ABAC），有效阻止了未经授权的工具调用，而提示基于的限制仅能减少11-18个百分点的未授权调用率，证明了架构强制在部署的智能体系统中实现可靠工具访问控制的必要性。

Comments 8 pages, 3 tables, 1 figure. Planning to submit to EMNLP 2026 Industry Track

详情

AI中文摘要

大型语言模型越来越多地作为自主代理运行，这些代理从大型注册表中选择并调用工具。我们发现了一个关键缺口：当未经授权的工具出现在代理的上下文中时，模型在对抗性场景中会选择它们，即使被明确指示不要这样做。我们提出了一种受控的MCP代理，在两个阶段实施基于属性的访问控制（ABAC）：在工具发现阶段，未经授权的工具被从模型的上下文窗口中移除；在工具调用阶段，第二次检查阻止任何未经授权的调用。在三个模型（Qwen 2.5 7B，Llama 3.1 8B，Claude Haiku 3.5）和150个对抗性任务（涵盖四个攻击类别）上，我们的代理将未授权调用率（UIR）降至0%，同时添加的中位延迟低于50毫秒。基于提示的限制仅能减少UIR 11-18个百分点，留下显著的残余风险。我们的结果表明，在部署的智能体系统中，架构强制——而不是提示——是实现可靠工具访问控制的必要条件。

英文摘要

Large language models increasingly operate as autonomous agents that select and invoke tools from large registries. We identify a critical gap: when unauthorized tools are visible in an agent's context, models select them in adversarial scenarios -- even when explicitly instructed otherwise. We propose a governed MCP proxy that enforces attribute-based access control (ABAC) at two points: tool discovery, where unauthorized tools are removed from the model's context window, and tool invocation, where a second check blocks any unauthorized call. Across three models (Qwen 2.5 7B, Llama 3.1 8B, Claude Haiku 3.5) and 150 adversarial tasks spanning four attack categories, our proxy reduces unauthorized invocation rate (UIR) to 0% while adding under 50ms median latency. Prompt-based restrictions reduce UIR by only 11--18 percentage points, leaving substantial residual risk. Our results show that architectural enforcement -- not prompting -- is necessary for reliable tool access control in deployed agentic systems.

URL PDF HTML ☆

赞 0 踩 0

2605.18407 2026-05-19 cond-mat.mes-hall cond-mat.mtrl-sci cs.AI cs.RO 版本更新

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus: 一种具身人工智能量子材料实验家的实现

Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang, Ming Yin, Mayank Sengupta, Kristina Wolinski, Yanyu Jia, Jingzhi Shi, Derek Saucedo, Neill Saggi, Haosen Guan, Kenji Watanabe, Takashi Taniguchi, Ali Yazdani, Mengdi Wang, Sanfeng Wu

AI总结本文提出Qumus，首个能够进行真实世界科学发现的具身人工智能量子材料实验家，通过机器人微型实验室实现了原子薄二维材料和范德瓦耳斯结构的制备与纳米加工，首次实现了AI生成石墨烯和原子薄场效应晶体管的AI制造。

Comments 29 Pages in total. Supplementary Demo Videos are available at https://qumus.ai

详情

AI中文摘要

尽管现代大语言模型（LLMs）和代理型人工智能（AI）在数字领域展现出了变革性能力，但实现能够进行真实世界科学发现的具身人工智能仍是一个具有挑战性的前沿。这些进展受到将高级推理、多模态信息处理和实时物理执行整合在一起的固有复杂性所阻碍。在这里，我们介绍了Qumus，首个AI量子材料实验家。Qumus物理上体现在一个机器人微型实验室中，是一个智能、多模态和多代理系统，旨在创建和纳米加工原子薄二维（2D）材料和堆叠范德瓦耳斯（vdW）结构。Qumus能够自主导航完整的科学循环，从假设生成和协议规划到多步骤实验执行、结果分析和报告，充当实验家的角色。值得注意的是，该系统首次实现了AI生成石墨烯，以及首次实现了复杂纳米设备（包括原子薄场效应晶体管）的AI制造，通过范德瓦耳斯堆叠。Qumus在这些任务中表现出色，通过展示自主纠错和闭环实验。我们的结果建立了一个可推广的框架，用于学习直接来自量子世界的自我改进具身人工智能系统，为量子材料、电子学等领域加速发现开辟了新路径。

英文摘要

While modern Large Language Models (LLMs) and agentic artificial intelligence (AI) have demonstrated transformative capabilities in digital domains, the realization of embodied AI capable of real-world scientific discovery remains a difficult frontier. The advancements are hindered by the inherent complexity of integrating high-level reasoning, multimodal information processing and real-time physical execution. Here we introduce Qumus, the first AI quantum materials experimentalist. Physically embodied within a robotic mini-laboratory, Qumus is an intelligent, multimodal, and multi-agent system designed for the creation and nano-processing of atomically thin two-dimensional (2D) materials and stacked van der Waals (vdW) structures. Qumus autonomously navigates the full scientific cycle, from hypothesis generation and protocol planning to multi-step experimental execution, result analysis and reporting, acting as an experimentalist. Markedly, the system has achieved, for the first time, the AI-creation of graphene, as well as the first AI-fabrication of complex nanodevices including atomically thin field-effect transistors via vdW stacking. Qumus excels at these tasks by demonstrating autonomous error correction and closed-loop experimentation. Our results establish a generalizable framework for self-improving embodied AI systems that learn directly from the quantum world, opening a pathway toward accelerated discovery in quantum materials, electronics and beyond.

URL PDF HTML ☆

赞 0 踩 0

2605.18395 2026-05-19 cs.CY cs.AI 版本更新

Diagnosing Korean-Language LLM Political Bias via Census-Grounded Agent Simulation

通过人口普查基础代理模拟诊断韩语LLM的政治偏见

Sungwoo Kang

发表机构 * Department of Electrical and Computer Engineering（电气与计算机工程系）； Korea University（韩国大学）

AI总结本文通过人口普查基础代理模拟框架Dynamo-K，研究韩语LLM在六个韩国选举中的政治行为，识别出三种系统性失败模式，并提出解决方案以校准模型，从而提高政治行为诊断的准确性。

详情

AI中文摘要

大型语言模型（LLMs）在选民模拟中表现出系统性政治偏见，但其底层机制和跨语言泛化仍不清晰。我们引入Dynamo-K，一个基于人口普查的模拟框架，评估四个模型在六个韩国选举（2017-2025）中的政治行为。使用该框架，我们识别出三种系统性失败模式：（1）中等代理的渐进偏见，其中显式缓解将均绝对误差（MAE）减少5.2倍；（2）模型依赖的第三方显著性崩溃，区分显著性失败与决策偏见；以及（3）区域极化崩溃，其中模型双向低估历史政党强区。为解决这些失败，我们证明场景重构可恢复62%的2017年MAE，通过恢复第三方可见性。此外，我们引入了一个学习重加权适配器，成功校准对立价值模型，而无需在训练或测试时依赖候选人姓名。验证我们的诊断框架，Dynamo-K准确预测了3/3总统胜者，包括在高度争议的2022年0.73%边界的比赛中MAE为2.1%。并且正确识别了 held-out 地方选举中的主导政党。该流程是开源的，并提供了一种可扩展且成本效益高的方法来诊断LLM的政治行为。

英文摘要

Large language models (LLMs) exhibit systematic political biases in voter simulations, but their underlying mechanisms and cross-lingual generalizations remain poorly understood. We introduce Dynamo-K, a census-grounded simulation framework evaluating Korean-language LLM political behavior across four models on six Korean elections (2017-2025). Using this framework, we identify three systematic failure modes: (1) progressive bias in moderate agents, where explicit mitigation reduces Mean Absolute Error (MAE) by 5.2 times; (2) model-dependent third-party salience collapse, distinguishing between salience failure and decision bias; and (3) regional polarization collapse, where models bidirectionally under-predict historical party strongholds. To address these failures, we demonstrate that scenario reframing recovers 62% of 2017 MAE by restoring third-party visibility. Furthermore, we introduce a learned reweighting adapter that successfully calibrates opposing-valence models without relying on candidate names at train or test time. Validating our diagnostic framework, Dynamo-K accurately predicts 3/3 presidential winners - including a 2.1%p MAE on the highly contested 0.73%p-margin 2022 race - and correctly identifies the dominant party in a held-out local election. The pipeline is open-source and provides a scalable, cost-effective method for diagnosing LLM political behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.18387 2026-05-19 cs.LG cs.AI 版本更新

Graph Hierarchical Recurrence for Long-Range Generalization

图层次递归用于长距离泛化

Stefano Carotti, Marco Pacini, Alessio Gravina, Davide Bacciu, Bruno Lepri, Sebastiano Bontorin

发表机构 * Department of Computer Science, University of Trento（特伦托大学计算机科学系）； Fondazione Bruno Kessler（布鲁诺·克谢勒基金会）； Department of Computer Science, University of Pisa（帕尔马大学计算机科学系）

AI总结本文提出了一种名为图层次递归（GHR）的新框架，通过在输入图和通过池化获得的层次抽象上联合操作，解决了图神经网络和图转换器在长距离相关性捕捉任务中的限制，并在多个长距离基准测试中表现出色，参数效率高。

详情

AI中文摘要

图神经网络（GNNs）和图转换器（GTs）已成为图学习的基本范式，结合了深度模型的表示学习能力与诱导偏置带来的样本效率。尽管其有效性已得到广泛认可，但大量研究表明这些模型在需要捕捉图中远距离区域之间相关性的任务中仍面临根本性限制。为了解决这一问题，我们引入了图层次递归（GHR），一种新的框架，该框架同时在输入图和通过池化获得的层次抽象上进行操作。我们还展示了现有模型的局限性在超出范围的泛化中更加明显，其中测试实例涉及比训练时观察到的更长距离的相互作用。相比之下，尽管其设计简单，GHR提供了三个关键优势：在长距离依赖上表现强劲，改进了超出范围的泛化能力，以及高参数效率。为了验证这些主张，我们展示了在广泛的长距离基准测试中，GHR在使用当前最先进的模型参数的1%的情况下，始终优于现有的图模型。这些结果表明，当前趋势通过扩展架构来获得图基础模型的互补方向，表明仅增加模型容量可能不足以实现泛化。

英文摘要

Graph Neural Networks (GNNs) and Graph Transformers (GTs) are now a fundamental paradigm for graph learning, combining the representation-learning capabilities of deep models with the sample efficiency induced by their inductive biases. Despite their effectiveness, a large body of work has shown that these models still face fundamental limitations in tasks that require capturing correlations between distant regions of a graph. To address this issue, we introduce Graph Hierarchical Recurrence (GHR), a novel framework that operates jointly on the input graph and on a hierarchical abstraction obtained through pooling. We also show that the limitations of existing models are even more pronounced in out-of-range generalization, where test instances involve interactions over distances longer than those observed during training. By contrast, despite its simple design, GHR provides three key advantages: strong performance on long-range dependencies, improved out-of-range generalization, and high parameter efficiency. To corroborate these claims, we show that across a broad set of long-range benchmarks, GHR consistently outperforms existing graph models while using as little as 1% of the parameters of current state-of-the-art models. These results suggest a complementary direction to the current trend of scaling architectures to obtain graph foundation models, indicating that increased model capacity alone may not be sufficient for generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.18385 2026-05-19 cs.RO cs.AI 版本更新

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

面向动态室内环境的无处不在的映射与定位

Halim Djerroud, Nico Steyn, Olivier Rabreau, Patrick Bonnin, Abderraouf Benali

发表机构 * Tshwane University of Technology（茨瓦内理工大学）

AI总结本文提出UbiSLAM，一种用于动态室内环境实时映射和定位的创新解决方案，通过部署固定RGB-D相机网络解决传统SLAM系统在环境变化敏感性和依赖移动单元传感器的问题，提升机器人在环境中的定位精度和响应性。

Journal ref Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Volume 1, pages 537-548, SciTePress, 2025. ISBN: 978-989-758-737-5, ISSN: 2184-433X

详情

DOI: 10.5220/0013245400003890

AI中文摘要

我们提出了UbiSLAM，一种用于动态室内环境实时映射和定位的创新解决方案。通过在工作空间内战略性地部署固定RGB-D相机网络，UbiSLAM解决了传统SLAM系统常见的局限性，如对环境变化的敏感性和对移动单元传感器的依赖。这种固定传感器方法实现了实时、全面的映射，提高了机器人在环境中的定位精度和响应性。由UbiSLAM生成的集中化地图持续更新，为机器人提供准确的全局视图，从而提高导航、减少碰撞并促进共享空间中更流畅的人机交互。除了其优势外，UbiSLAM还面临挑战，特别是在确保完整空间覆盖和管理盲区方面，这需要从机器人本身集成数据。在本文中，我们讨论了潜在的解决方案，如自动校准以获得最佳的相机位置和方向，以及增强的通信协议以实现实时数据共享。所提出的模型减少了对单个机器人单元的计算负载，使更复杂的机器人平台能够有效运行，同时增强了整个系统的鲁棒性。

英文摘要

We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly encountered in traditional SLAM systems, such as sensitivity to environmental changes and reliance on mobile unit sensors. This fixed-sensor approach enables real-time, comprehensive mapping, enhancing the localization accuracy and responsiveness of robots operating within the environment. The centralized map generated by UbiSLAM is continuously updated, providing robots with an accurate global view, which improves navigation, minimizes collisions, and facilitates smoother human-robot interactions in shared spaces. Beyond its advantages, UbiSLAM faces challenges, particularly in ensuring complete spatial coverage and managing blind spots, which necessitate data integration from the robots themselves. In this paper we discuss potential solutions, such as automatic calibration for optimal camera placement and orientation, along with enhanced communication protocols for real-time data sharing. The proposed model reduces the computational load on individual robotic units, allowing less complex robotic platforms to operate effectively while enhancing the robustness of the overall system.

URL PDF HTML ☆

赞 0 踩 0

2605.18382 2026-05-19 hep-ph cs.AI hep-ex 版本更新

Probing SMEFT Operators through $t\bar{t}t\bar{t}$ Production with Hyper-Graph Neural Networks at the LHC

通过LHC上的超图神经网络探测SMEFT算符的$t\bar{t}t\bar{t}$产生

Amir Subba, Sanmay Ganguly

发表机构 * Wilczek Quantum Center, Shanghai Institute for Advanced Studies, Shanghai 201315, China University of Science（Wilczek量子中心、上海先进研究院、上海201315中国、中国科学技术大学）； Department of Physics, Indian Institute of Technology Kanpur, Uttar Pradesh 208016, India（物理系、印度理工学院坎浦尔分校、乌塔尔 Pradesh 208016印度）

AI总结该研究利用超图神经网络（H-GNN）在13 TeV质子-质子碰撞中探测$t\bar{t}t\bar{t}$产生，通过多电荷信号事件与主导的标准模型背景（如$t\bar{t}W$、$t\bar{t}Z$、$t\bar{t}H$、$t\bar{t}VV$、单顶关联产生、双boson和三boson过程）进行区分，通过改进的信号提取得到SMEFT六维算符的95%置信水平限。

Comments 16 pages, 9 figures, 3 tables. Comments are welcome

详情

AI中文摘要

我们提出了一种现象学研究，利用超图神经网络（H-GNN）在13 TeV质子-质子碰撞中探测$t\bar{t}t\bar{t}$产生，用于区分多电荷信号事件与主导的标准模型背景，即$t\bar{t}W$、$t\bar{t}Z$、$t\bar{t}H$、$t\bar{t}VV$、单顶关联产生、双boson和三boson过程。在H-GNN架构中，每个事件被表示为一个超图，其节点对应于重建的喷注和电荷，其超边编码了任意子集之间的高阶相关性，使网络能够学习许多体动力学结构，这些结构特征于$t\bar{t}t\bar{t}$最终态。通过按照CMS样式的事件选择，结合同电荷双电荷、三电荷和四电荷通道，H-GNN在$t\bar{t}t\bar{t}$信号上获得ROC曲线下的面积为0.951，在积分光子流为140 fb^{-1}时，统计显著性为Z=9.11，与SPANet基线（Z=8.62）、Particle Transformer基线（Z=7.37）以及ATLAS分析（Z=5.13）相比。我们利用改进的信号提取，推导出SMEFT六维算符Φu、tt^{(1)}、qq^{(1)}、qt^{(1)}、qt^{(8)}的1-和2参数95%置信水平限，并投影了在HL-LHC积分光子流为1000 fb^{-1}和3000 fb^{-1}时的预期灵敏度，背景估计有50%的不确定性。

英文摘要

We present a phenomenological study of $t\bar{t}t\bar{t}$ production in proton-proton collisions at $\sqrt{s} = 13$~TeV, using a Hyper-Graph Neural Network (H-GNN) to discriminate multilepton signal events from the dominant SM backgrounds, namely $t\bar{t}W$, $t\bar{t}Z$, $t\bar{t}H$, $t\bar{t}VV$, single-top associated production, and diboson and triboson processes. In the H-GNN architecture each event is represented as a hypergraph whose nodes correspond to reconstructed jets and leptons and whose hyperedges encode higher-order correlations among arbitrary subsets of these objects, allowing the network to learn the many-body kinematic structures that characterize the $t\bar{t}t\bar{t}$ final state. Combining same-sign di-lepton, tri-lepton, and four-lepton channels following a CMS-like event selection, the H-GNN attains an area under the ROC curve of $0.951$ for the $t\bar{t}t\bar{t}$ signal and yields a statistical significance of $Z = 9.11$ at an integrated luminosity of $\mathcal{L} = 140~\mathrm{fb}^{-1}$, to be compared with $Z = 8.62$ for a SPANet baseline, $Z = 7.37$ for a Particle Transformer baseline, and $Z = 5.13$ obtained by the ATLAS analysis, evaluated under identical event selection. We exploit the improved signal extraction to derive one- and two-parameter $95\%$ confidence level limits on the Wilson coefficients of the dimension-six operators $\mathcal{O}_{Φu}$, $\mathcal{O}^{(1)}_{tt}$, $\mathcal{O}^{(1)}_{qq}$, $\mathcal{O}^{(1)}_{qt}$, and $\mathcal{O}^{(8)}_{qt}$, and we project the expected sensitivity at the HL-LHC integrated luminosities of $1000~\mathrm{fb}^{-1}$ and $3000~\mathrm{fb}^{-1}$ with $50\%$ uncertainty on the background estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.18380 2026-05-19 cs.AI 版本更新

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

QSTRBench: 一个评估语言模型进行定性空间和时间推理能力的新基准

Anthony G. Cohn, Robert E. Blackwell

发表机构 * School of Computer Science, The University of Leeds（利兹大学计算机科学学院）； The Alan Turing Institute（阿兰·图灵研究院）； Tongji University（同济大学）

AI总结本文提出QSTRBench基准，用于评估大语言模型在定性空间和时间推理方面的能力，通过不同推理算法规则的组合性推理、反向关系和概念邻域等任务，展示了不同模型在处理不同算法规则时的表现差异，发现PA最简单而RCC-22最难。

Comments 74 pages, 20 figures

详情

AI中文摘要

我们介绍了一个广泛的定性空间和时间推理（QSTR）基准，用于评估大语言模型（LLMs）。我们提出了关于组合推理（使用组合表，CT）、反向关系和概念邻域（CN）的问题，针对QSTR算式、点代数（PA）、Allen区间代数、区间和持续时间（INDU）、区域连接算式（RCC-5、RCC-8和RCC-22）、九交模型、方向算式和STAR。RCC-22的CN首次在此发布。一个扩展的基准系统性地变化了问题呈现方式，包括前缀/后缀、词语/符号/非正式术语和图示描述，针对选定的算式。我们报告了当前前沿模型的结果。所有测试的模型都比猜测表现更好，但没有模型能一致正确回答所有问题。性能在不同算式之间差异显著，PA最简单，RCC-22最难。我们发布了该基准和我们的结果，以开放许可证发布，以促进进一步评估语言模型在定性空间/时间推理方面的能力。

英文摘要

We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by calculus, with PA being the most straightforward, and RCC-22 the most difficult. We release the benchmark, and our results under an open licence to facilitate further assessment of qualitative spatio/temporal reasoning in LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.18374 2026-05-19 cs.LG cs.AI 版本更新

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

超越推理时间搜索：强化学习合成可重用求解器

Soheyl Massoudi, Gabriel Apaza, Milad Habibi, Mark Fuge

发表机构 * ETH Zürich（苏黎世联邦理工学院）； University of Maryland（马里兰大学）

AI总结本文探讨了强化学习能否将组合优化的推理成本转移到代码LLM的权重中，从而合成可重用的求解器。通过Synergistic Dependency Selection问题，研究发现强化学习能有效生成约束感知的模拟退火模板，并在多个领域展示出更高的效率和鲁棒性。

详情

AI中文摘要

大型语言模型（LLMs）通常将组合优化视为推理时间的过程，通过采样、搜索或重复提示单独解决每个实例。我们询问强化学习是否可以将部分推理成本转移到代码LLM的权重中，从而让模型为整个问题家族合成可重用的求解器。我们研究了Synergistic Dependency Selection（SDS），一种受约束的二次背包问题的受控变体，旨在暴露特定的失败模式：局部信号和严格可行性约束使贪心启发式方法具有吸引力但不可靠。在相同的框架下，Best-of-64基础模型采样在接近全局虚拟最佳求解器（VBS）的28.7%差距处饱和；代码审计显示基础模型经常检索模拟退火模板但错误实现Metropolis接受规则。我们使用可行性门控奖励和轻量结构框架对Qwen2.5-Coder-14B-Instruct进行微调，使用组相对策略优化（GRPO）。所得到的策略在99.8%的可行SDS输出中收敛到一个约束感知的模拟退火模板，达到VBS的5.0%差距，并且在生成后执行/搜索成本方面比累积Best-of-64评估便宜91倍。一次编译检查显示，每个种子的最优冻结求解器在SDS测试集上重复使用时仍然高度竞争，而额外领域评估在作业调度问题上提供了更窄但积极的证据，表明框架可以超越SDS。负消融揭示了这种配方的局限性：标准稳定器会降低性能，软可行性门控失败，结果仍对奖励归一化和领域特定设计选择敏感。

英文摘要

Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.

URL PDF HTML ☆

赞 0 踩 0

2605.18349 2026-05-19 cs.CV cs.AI 版本更新

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

通过参数自由注意力机制优化CSRNet以实现公共交通中的人群计数

Aida Rostamza, Enrico Del Re, Joshua Cherian Varughese, Cristina Olaverri-Monreal

发表机构 * Johannes Kepler University Linz（约翰· Kepler 大学林茨）； Department Intelligent Transport Systems（智能交通系统部门）

AI总结本文研究了参数自由注意力机制在密集场景中的人群计数和密度图估计中的有效性，提出了一种结合PFCA和SA的新型注意力机制PFCASA，并在ShanghaiTech数据集上验证了其在公共交通视频流中的性能。

详情

AI中文摘要

占用估计和人群计数是设计智能高效公共交通车辆的关键任务。鉴于公共交通载客量可能从稀疏到拥挤变化，传统的占用估计模型必须适应这一目的。注意力机制在增强深度神经网络在拥挤场景中的人群计数能力方面表现出显著优势，尤其是在存在遮挡、复杂背景和透视畸变的情况下。然而，传统方法通常作为卷积层中的参数化子网络实现，不可避免地增加了模型大小和计算成本，限制了在资源受限的边缘设备上的部署。本文研究了最先进的参数自由注意力机制在高度拥挤场景中的人群计数和密度图估计中的有效性。我们评估了通道级（PFCA）、空间级（SA）和三维级（SimAM）模块，并将其性能与参数化注意力模块进行比较，后者限制引入不超过1%的额外参数。此外，我们提出了一种新的注意力机制组合，结合PFCA和SA（PFCASA）以分析公共交通系统内的视频流。使用CSRNet作为骨干网络，在ShanghaiTech数据集上的实验表明，参数自由注意力机制在不引入额外模型参数的情况下实现了可比或更优的准确性。详细的性能分析进一步揭示，PFCASA在少于40人的场景中优于其他注意力模块，而PFCA在人群密度增加时表现出更大的有效性，凸显了其在智能公共交通模式中的应用潜力。

英文摘要

Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.

URL PDF HTML ☆

赞 0 踩 0

2605.18346 2026-05-19 cs.CV cs.AI 版本更新

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

聚焦强制：面向内容的每帧KV选择用于高效的自回归视频扩散

Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin, Ruiqi Zhang, Weile Mo, Yue Ma, Shikang Zheng, Jiehang Huang, Dongrui Liu, Linfeng Zhang

发表机构 * SJTU（上海交通大学）； SDU（山东大学）； HUST（华中科技大学）； UTokyo（东京大学）； HKUST（香港科技大学）； SCUT（上海大学）； Shanghai AI Lab（上海人工智能实验室）

AI总结本文提出了一种无需训练的KV选择方法，通过结合注意力分数和历史帧的多样性分数，保留最相关和有区别的历史帧，从而在不牺牲质量的情况下提高自回归视频扩散的效率。

详情

AI中文摘要

近期在自回归视频扩散领域的进展使得序列和流式视频生成成为可能。然而，长视界生成需要越来越大的KV缓存，这使得在不牺牲质量的情况下实现高效的压缩具有挑战性。现有方法大多基于注意力分数选择历史帧，但它们的上下文决策仍然粗略。当同一块中生成多个帧时，这些方法通常对整个块应用共享的历史选择，仅通过注意力对历史帧评分，并将头预算均匀或通过注意力模式启发式分配，而不是显式估计头重要性。我们发现同一生成块中的帧可能依赖于不同的历史帧，同一历史帧在与当前帧的相对时间距离变化时可能获得不同的注意力分数，且屏蔽不同头会引发不均等的生成退化。受这些发现的启发，我们提出了Focused Forcing，一种无需训练的KV选择方法，该方法在生成帧和头维度上聚焦缓存历史。对于每个生成帧，Focused Forcing通过结合注意力分数和历史帧的多样性分数保留最相关和有区别的历史帧，同时将较大的预算分配给估计重要性更高的头。在多个自回归生成范式中，Focused Forcing在不训练的情况下实现了高达1.48倍的端到端加速，同时提高了视觉质量和文本对齐。

英文摘要

Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}

URL PDF HTML ☆

赞 0 踩 0

2605.18332 2026-05-19 cs.SE cs.AI 版本更新

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

同信号，不同语义：软件工程代理跨框架行为分析

Wei Ma, Zhi Chen, Jingxu Gu, Tianling Li, Shangqing Liu, Lingxiao Jiang

发表机构 * Singapore Management University（新加坡国立大学）； Nanjing University（南京大学）； Nanyang Technological University（南洋理工大学）

AI总结本文通过大规模实验分析软件工程代理在不同框架下的行为表现，发现相同的行为信号在不同框架下可能具有相反的意义，强调了跨框架验证的重要性。

详情

AI中文摘要

对基于大语言模型的软件工程代理进行行为研究，提取出关于轨迹形状与高分辨率率相关的操作规则：测试步骤跟随代码修改、错误级联较短或轨迹紧凑。每条规则通常源自单一框架，但其在结构不同的代理设计中是否转移，无论是符号还是幅度，尚未直接测试。我们在此生态系统层面进行研究：64,380次SWE-bench运行，涵盖126种代理配置，跨越43种框架，每种配置将LLM与一个框架（如SWE-Agent、OpenHands）配对，该框架提供其工具和工作流。我们通过固定每一层来分离框架效应与LLM效应，然后测量每种配置的行为-结果效应，并检查这些效应的一致性或分歧。在固定LLM的情况下，更换框架导致每个动作特征产生显著的行为差异。在大多数信号上，配置不仅在幅度上存在分歧，还在方向上存在分歧。错误率是清晰的案例：47种配置在错误率较低时解决更多问题，而48种配置在错误率较高时解决更多问题。其他五个连续特征和七个二元模式中的三个显示相似的方向分歧。框架身份比LLM家族解释了更多的变化：对于平均转弯数，框架解释了64%的配置间变异，而LLM仅解释了10%。这意味着相同的可观察行为信号可能在不同代理配置中具有相反的意义。因此，任何单一框架的行为发现应在跨配置验证后再被声称具有普遍性。

英文摘要

Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.

URL PDF HTML ☆

赞 0 踩 0

2605.18327 2026-05-19 cs.AI 版本更新

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Causely: 企业AI中的因果智能层一项关于SRE和可靠性工作流的基准研究

Dhairya Dalal, Endre Sara, Ben Yemini, Christine Miller, Shmuel Kliger

发表机构 * Causely

AI总结本文提出Causely，一种企业AI的因果智能层，通过维护环境拓扑、属性依赖性和因果关系的结构化表示，为AI代理提供语义和因果基础，以诊断、评估影响并安全地在生产环境中操作。通过在受控环境下注入故障的24微服务OpenTelemetry演示应用进行基准研究，评估了Causely的价值主张。

详情

AI中文摘要

目前，部署到SRE工作流中的AI代理在查询时从原始可观测性遥测中获取对环境状态的理解，这在令牌、延迟和推断可靠性上产生了语义解释的代价。我们提出了Causely，一种因果智能层，它维护了环境拓扑、属性依赖性和因果关系的结构化表示，这些关系锚定在受管理环境的本体表示上。Causely将原始遥测转换为一个实时、可查询的模型，为AI代理提供所需的语义和因果基础，以诊断、评估影响并在生产环境中安全地行动。我们通过在受控环境下注入故障的24微服务OpenTelemetry演示应用进行基准研究来评估这一价值主张。我们的实验比较了四种代理配置（Claude Code、OpenAI Codex、HolmesGPT与Sonnet和Gemini后端）。实验在两种场景下进行：活跃事件和健康基线，分别有和无访问Causely。在活跃故障场景中，因果基础将平均诊断时间减少63%，平均令牌消耗减少60%，平均工具调用次数减少78%，将调查足迹压缩了4.8倍，并降低了每运行的直接API成本57%；根因诊断准确率从75%提升到100%。

英文摘要

AI agents deployed into SRE workflows currently derive their understanding of environment state from raw observability telemetry at query time, paying a semantic-interpretation tax in tokens, latency, and inferential reliability. We propose Causely, a causal intelligence layer that maintains a structured representation of environment topology, attribute dependencies, and causal relationships that are anchroed to a ontological representation of the managed environment. Causely transforms raw telemetry into a live, queryable model providing the semantic and causal foundation AI agents require to diagnose, evaluate impact, and act safely in production. We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application. Our experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends). Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline. On the active-fault scenario, causal grounding reduces mean time-to-diagnosis by 63\%, mean token consumption by 60\%, and mean tool-call count by 78\%, compressing the investigation footprint by 4.8$\times$ and lowering direct API cost per run by 57\%; root-cause-diagnosis accuracy rises from 75\% to 100\%.

URL PDF HTML ☆

赞 0 踩 0

2605.18320 2026-05-19 cs.LG cs.AI 版本更新

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

ISEP: 通过随机策略优化实现离线强化学习的隐式支持扩展

Yifei Chen, Shaoqin Zhu, Xiaoqiang Ji

发表机构 * The Chinese University of Hong Kong, Shenzhen Longgang（香港中文大学（深圳）松山湖校区）

AI总结本文提出ISEP方法，通过随机策略优化实现离线强化学习中的隐式支持扩展，以解决传统方法在安全约束下难以发现最优行为的问题，核心贡献是通过价值函数插值和随机动作选择策略提高策略改进的导航能力。

详情

AI中文摘要

离线强化学习方法通常强制严格的约束以确保安全；然而这种刚性往往阻止了在行为策略即时支持之外发现最优行为。为了解决这个问题，我们提出了通过随机策略优化实现的隐式支持扩展（ISEP），该方法利用在分布数据和策略样本之间插值的价值函数，以隐式方式扩展可行动作支持。这种机制“密集化”高奖励区域，为策略改进创建可导航路径，同时在理论上保证价值误差的有界性。然而，优化此扩展支持会创建多模态景观，标准确定性平均会导致模式崩溃和无效动作。ISEP通过随机动作选择策略缓解了这一问题，通过随机交替保守克隆和乐观扩展信号来优化策略。我们通过使用条件流匹配利用分类器免费引导，将此框架实例化为ISEP-FM，以有效捕捉插值的价值信号。

英文摘要

Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.

URL PDF HTML ☆

赞 0 踩 0

2605.18309 2026-05-19 cs.LG cs.AI 版本更新

Alignment Dynamics in LLM Fine-Tuning

在LLM微调中的对齐动力学

Yuhan Huang, Huanran Chen, Yinpeng Dong

发表机构 * Shanghai Qi Zhi Institue & University of Tokyo（上海启智研究院 & 东京大学）； College of AI, Tsinghua University（清华大学人工智能学院）

AI总结本文研究了在LLM微调过程中对齐的动态特性，提出了一种可计算的对齐评分，并推导了其在微调过程中的闭式更新公式，从而建立了对齐动态的统一框架。通过将对齐更新分解为两种竞争成分：反弹力和驱动力，解释了为何先前的对齐可能被后续微调逆转，以及为何更狭窄的后验结构会增强这种逆转。此外，该框架预测了‘复习强化效应’，即先前的对齐会在重新暴露时留下潜在的后验印记，从而增强驱动力，导致更快的重新对齐。

详情

AI中文摘要

尽管大型语言模型（LLMs）通过监督微调和人类反馈强化学习实现了强大的对齐，但在后续微调中对齐往往容易崩溃。现有的解释要么将对齐脆弱性归因于梯度几何，要么将其描述为模型输出的分布转移，但很少有研究能提供一个统一的框架，将参数空间的学习动态与函数空间的对齐行为联系起来。在本文中，我们引入了一个可计算的对齐评分，并推导了其在微调过程中的闭式更新公式，从而建立了对齐动态的统一框架。我们的分析将对齐更新分解为两个竞争成分：一种由当前对齐状态和模型分布狭窄性共同决定的“反弹力”，以及一种由训练分布与条件后验对齐和非对齐完成的后验对齐程度决定的“驱动力”。这种分解解释了为何先前的对齐可能被后续微调逆转，以及为何更狭窄的后验结构会增强这种逆转。此外，我们的框架预测了“复习强化效应”：先前的对齐会在重新暴露时留下潜在的后验印记，从而增强驱动力，导致更快的重新对齐。我们通过安全对齐、新兴不一致和情感设置验证了这些预测，展示了在重新暴露下一致的对齐逆转和加速的重新对齐。此外，安全对齐的受控实验确认了预测的反弹强度与后验狭窄性之间的依赖关系。这些结果共同提供了一个统一的动态视角，说明在LLM微调过程中对齐是如何被破坏和重新激活的。

英文摘要

Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.

URL PDF HTML ☆

赞 0 踩 0

2605.18303 2026-05-19 cs.LG cs.AI cs.CV cs.RO 版本更新

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer: 通过端口-哈密顿生成动力学构建一个物理驱动的世界模型

Xueyu Luan, Chenwei Shi

AI总结本文提出了一种基于端口-哈密顿框架的物理驱动世界模型PH-Dreamer，通过三个协同机制改进了基于递归状态空间架构的世界模型，实现了更紧凑且物理结构化的表示，同时提高了内部模拟器的保真度，并减少了潜在相空间体积、能量消耗和平均加速度平方。

Comments 12 pages, 3 figures

详情

AI中文摘要

基于递归状态空间架构构建的世界模型能够实现高效的潜在想象，但仍然缺乏物理结构，导致动力学违反守恒和耗散原理。我们引入了一个统一的端口-哈密顿框架，通过三种协同机制来解决这一问题。首先，我们将隐含的物理先验嵌入到递归转换中，通过将投影的潜在演变建模为受流动和耗散控制的能量路由，使投影的PH相空间偏向于更紧凑且物理结构化的表示。其次，我们开发了一个具有运动学意识的能量世界模型，该模型从本体感觉观察估计哈密顿量和功率平衡，提供了一个明确的物理信号用于热力学推理。第三，利用这些能量梯度，我们建立了基于能量的Actor-Critic，利用拉格朗日乘数来正则化策略优化，使其朝着更低的能量和更平滑的控制方向发展。在视觉控制基准测试中，该范式不仅实现了更优的渐近回报，还通过在想象奖励和真实奖励之间建立更紧密且方差更低的对齐关系，提高了内部模拟器的保真度，同时将潜在相空间体积减少了4.18-8.41%，能量消耗降低了高达7.80%，平均加速度平方降低了高达9.38%。

英文摘要

World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.

URL PDF HTML ☆

赞 0 踩 0

2605.18299 2026-05-19 cs.AI cs.CL cs.IR 版本更新

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-Search: 用于搜索增强推理的在线策略 hindsight 自监督学习

Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology（快手科技）

AI总结本文提出SD-Search，一种基于在线策略hindsight自监督学习的搜索增强推理方法，通过自身策略生成细粒度监督信号，无需外部教师模型或额外标注。

详情

AI中文摘要

搜索增强推理代理将内部推理与外部检索器的调用交替进行，其性能依赖于每次发出的查询质量。然而，在基于结果奖励的强化学习中，每个搜索决策在展开过程中共享同一轨迹级奖励，使个体查询缺乏步级信用。最近的过程监督方法通过从政策外部获取步级信号来解决这一差距，依赖于一个更大的教师模型或由更强的外部系统生成的子问题注释。相比之下，我们提出了SD-Search，通过在线策略的hindsight自监督学习自身生成步级监督，无需外部教师或额外标注。在SD-Search中，一个模型扮演两个角色：学生只看到推理时可用的上下文，而教师还根据一个紧凑的hindsight块总结了搜索查询和一组从同一问题采样的展开的最终结果。由于教师知道每个展开的展开过程和哪些成功，其查询分布隐含地标记了哪些决策值得做出，学生通过最小化token级的Jensen-Shannon散度来恢复这种行为。这在GRPO的粗粒度轨迹奖励上叠加了密集的步级信号。关键的是，这个信号由策略本身在标准RL训练循环中生成，无需外部模型推理、辅助标注流程或额外的训练阶段。

英文摘要

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

URL PDF HTML ☆

赞 0 踩 0

2605.18298 2026-05-19 cs.AI cs.HC cs.LG 版本更新

DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

DARE-EEG: 一种用于挖掘双对齐表示的EEG基础模型

Yang Shao, Peiliang Gong, Qun Dai, Daoqiang Zhang

发表机构 * College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics（航空宇航学院人工智能学院）

AI总结本文提出DARE-EEG，一种通过双对齐表示学习预训练的自监督基础模型，旨在解决EEG编码器在不完整观测下学习不变表示的问题，通过对比学习和动量更新实现语义稳定性，并通过卷积-线性探针策略适应异构电极配置和采样率，实验表明其在EEG基准测试中表现优异。

Comments 22 pages, 10 pages of main text + 12 pages of appendices

详情

AI中文摘要

通过在大规模EEG数据上进行掩码重建预训练，基础模型已成为在多样化脑机接口应用中学习通用神经表示的有前景范式。然而，一个关键但被忽视的挑战是EEG编码器必须学习对不完整观测不变的表示——当不同掩码视图的同一信号有最小重叠时，现有方法无法将它们约束到一致的潜在子空间，导致转移性下降。为此，我们提出DARE-EEG，一种自监督基础模型，通过预训练期间的双对齐表示学习显式强制掩码不变性。具体而言，我们引入掩码对齐，通过对比学习约束同一EEG样本多个掩码视图的表示，补充锚点对齐，将掩码表示对齐到动量更新的完整特征以实现语义稳定性。此外，我们提出卷积-线性探针，一种参数高效策略，通过解耦频谱-空间投影适应异构电极配置和采样率。在多样化的EEG基准测试中，广泛实验表明DARE-EEG在准确性表现上始终领先，同时保持相对较低的参数复杂度和优于现有方法的跨数据集可移植性。此外，DARE-EEG有助于有效发现和利用EEG中的丰富潜在表示。

英文摘要

Foundation models pre-trained through masked reconstruction on large-scale EEG data have emerged as a promising paradigm for learning generalizable neural representations across diverse brain-computer interface applications. However, a critical yet overlooked challenge is that EEG encoders must learn representations invariant to incomplete observations-when different masked views of the same signal have minimal overlap, existing methods fail to constrain them to a consistent latent subspace, leading to degraded transferability. To address this, we propose DARE-EEG, a self-supervised foundation model that explicitly enforces the mask-invariance property through dual-aligned representation learning during pre-training. Specifically, we introduce mask alignment that constrains representations from multiple masked views of the same EEG sample via contrastive learning, complementing anchor alignment that aligns masked representations to momentum-updated complete features for semantic stability. Additionally, we propose conv-linear-probing, a parameter-efficient strategy that adapts pre-trained representations to heterogeneous electrode configurations and sampling rates through decoupled spectro-spatial projections. Extensive experiments across diverse EEG benchmarks demonstrate that DARE-EEG consistently achieves state-of-the-art in accuracy performance while maintaining relatively low parameter complexity and superior cross-dataset portability compared to existing methods. Furthermore, DARE-EEG contributes to effectively discovering and utilizing the rich potential representations in EEG.

URL PDF HTML ☆

赞 0 踩 0

2605.18284 2026-05-19 cs.SE cs.AI 版本更新

稀疏自编码基准测试是否可靠？

David Chanin

发表机构 * Decode Research, MATS, UCL（Decode研究、MATS、伦敦大学学院）

AI总结该研究评估了稀疏自编码（SAE）基准测试的可靠性，发现其中两个指标在多个角度下表现不佳，其他指标也未能达到预期效果，表明需要改进SAE基准测试。

详情

AI中文摘要

稀疏自编码（SAEs）是大型语言模型的核心可解释性工具，其进展依赖于能够可靠区分更好和更差SAE的基准测试。我们通过三种互补的视角审计了SAEBench中SAE质量指标：固定SAE上的重新播种噪声、合成SAE上的真实相关性以及训练轨迹的可区分性。我们发现，两个指标，即目标探测扰动（TPP）和虚假相关性消除（SCR），在它们的典型设置下未能通过多个视角，不应用于评估SAE。其他指标显示出更高的重新播种噪声和更低的可区分性，比领域假设的要差。sae-probes变体的k-稀疏探测是我们在测试中发现最可靠的指标，但即使sae-probes也难以区分同一体系结构的不同变体。我们的结果表明，领域需要更好的SAE基准测试。

英文摘要

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.18226 2026-05-19 cs.CL cs.AI 版本更新

Context Memorization for Efficient Long Context Generation

上下文记忆用于高效长上下文生成

Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki

发表机构 * Institute of Science Tokyo, Japan（东京科学研究所）； Imperial College London, UK（伦敦帝国学院）

AI总结本文提出了一种无需训练的上下文记忆方法，通过将前缀外部化为轻量级的预计算注意力状态查找表，以提高长上下文生成的准确性和效率，同时减少注意力计算的延迟。

详情

AI中文摘要

现代大型语言模型（LLM）应用越来越多地依赖长前缀来在推理时控制模型行为。尽管增强前缀的推理是有效的，但存在两个结构限制：i）随着生成过程的进行，前缀的影响逐渐减弱；ii）对前缀的注意力计算与长度成线性关系。现有方法要么在注意力中保留前缀同时压缩它，要么通过梯度训练将它内部化到模型参数中。前者在推理时仍然会关注到前缀，而后者训练成本高且不适合前缀更新。为了解决这些问题，我们提出了注意力状态记忆，这是一种无需训练的方法，将前缀外部化为一个轻量级的预计算注意力状态的查找表。在ManyICLBench上使用LLaMA-3.1-8B，我们的方法在1K-8K内存预算下比上下文学习提高了准确性，同时在8K时将注意力延迟减少了1.36倍，并在NBA基准测试中仅使用其内存足迹的20%就超过了全注意力RAG性能。

英文摘要

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

URL PDF HTML ☆

赞 0 踩 0

2605.18211 2026-05-19 cs.CL cs.AI 版本更新

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

利用图结构在序列到序列模型中进行知识图谱链接预测

Luu Huu Phuc, Ratan Bahadur Thapa, Mojtaba Nayyeri, Jingcheng Wu, Evgeny Kharlamov, Steffen Staab

发表机构 * Analytic Computing, KI, University of Stuttgart, Stuttgart, Germany（斯图加特大学分析计算研究所）； Bosch Center for Artificial Intelligence, Stuttgart, Germany（博世人工智能中心）； WAIS, University of Southampton, United Kingdom（南安普顿大学WAIS）

AI总结本文提出了一种结合图结构的序列到序列模型GA-S2S，通过整合T5-small编码器解码器与关系图注意力网络RGAT，提升知识图谱链接预测的性能。

Comments 9 pages, 1 figure, 2 tables. Preprint of a paper accepted at the 5th Workshop on LLM-Integrated Knowledge Graph Generation from Text (TEXT2KG), co-located with ESWC 2026, May 10--14, 2026, Dubrovnik, Croatia

2605.18209 2026-05-19 cs.CV cs.AI 版本更新

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

SPATIOROUTE: 动态提示路由用于零样本空间推理

Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University（台湾国立大学）； Delta Robotics Innovation Center（Delta机器人创新中心）

AI总结本文提出SpatioRoute，一种动态提示生成方法，通过语义定制的提示模板路由问题，无需额外训练或3D传感器输入，在零样本设置下提升空间推理性能，同时发现Chain-of-Thought提示在空间视频理解中效果不佳。

Comments 10 pages, 2 figures, 2nd Workshop on 3D-LLM/VLA, CVPR 2026

详情

AI中文摘要

在眼动视频上的空间问题回答是一项具有挑战性的任务，需要视觉-语言模型（VLMs）对3D物体位置、场景可行性和方向关系进行推理，特别是在无任务特定微调的零样本设置中。我们引入SpatioRoute，一种动态提示生成方法，将每个输入问题路由到语义定制的提示模板，无需任何额外训练、微调或3D传感器输入。SpatioRoute在两个互补模式中运行：SpatioRoute-R，一种基于规则的路由器，将问题类型（如What、Is、How、Can、Which）确定性地映射到专门的提示模板；以及SpatioRoute-L，一种基于LLM的方法，仅从问题和情境上下文生成任务特定的提示，无需在路由时使用视频输入。我们评估了SpatioRoute在SQA3D基准测试上跨不同模型家族的VLMs。SpatioRoute在固定提示基线上实现了高达5%的总体准确率提升，建立了在不需3D点云输入的情况下零样本视频-only空间VQA的新状态。此外，我们发现Chain-of-Thought（CoT）提示，通过Think it Twice架构实现，在此设置中对Qwen系列模型性能有持续下降，证实了问题感知路由比统一推理指令在空间视频理解中更有效。

英文摘要

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

URL PDF HTML ☆

赞 0 踩 0

2605.18202 2026-05-19 cs.LG cs.AI 版本更新

Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models

简洁且逻辑一致的神经符号概念模型的符合集

Samuele Bortolotti, Emanuele Marconato, Andrea Pugnana, Andrea Passerini, Stefano Teso

发表机构 * Department of Information Engineering and Computer Science, University of Trento, Italy（特伦托大学信息工程与计算机科学系）； CIMeC, University of Trento, Rovereto, Italy（特伦托大学罗韦雷托CIMeC）

AI总结本文提出COCOCO框架，通过整合符合预测方法，解决神经符号概念模型中标签和概念预测过于自信的问题，满足一致性、覆盖性和简洁性三个要求，提升模型的可靠性。

详情

AI中文摘要

神经符号概念模型（NeSy-CBMs）是一类将神经网络与符号推理相结合的架构，用于在高风险应用中提高可靠性。它们通过从输入中提取高层概念，然后在给定的逻辑约束下推断任务标签。然而，其标签和概念预测可能过于自信，使利益相关者难以判断何时可以信任模型的决策。本文通过整合符合预测（CP）框架，提供严格的分布无关覆盖保证，正式化了三个要求——一致性、覆盖性和简洁性，证明现有方法至少在一项上不足。然后引入COCOCO，一种后处理框架，联合符合概念和标签，并通过单个推断-反推修订步骤进行协调。COCOCO满足所有三个要求，保留分布无关覆盖，对不完美的知识具有鲁棒性，并支持用户指定的大小预算。在8个数据集上的实验显示，COCOCO在性能和集合大小方面优于竞争对手和自然基线。

英文摘要

Neuro-Symbolic Concept-based Models (NeSy-CBMs) are a family of architectures that integrate neural networks with symbolic reasoning for enhanced reliability in high-stakes applications. They work by first extracting high-level concepts from the input and then inferring a task label from these compatibly with given logical constraints. Yet, their label and concept predictions can be overconfident, making it difficult for stakeholders to gauge when the model's decisions can be trusted. We address this issue by integrating ideas from Conformal Prediction (CP), a framework providing rigorous, distribution-free coverage guarantees. We formalize three desiderata -- consistency, coverage, and conciseness -- that any conformal method for NeSy-CBMs should satisfy, and show that existing approaches fall short of at least one. We then introduce COCOCO, a post-hoc framework that conformalizes concepts and labels jointly and reconciles them via a single deduction-abduction revision step. COCOCO satisfies all three desiderata, retains distribution-free coverage, is robust to imperfect knowledge and supports user-specified size budgets. Our experiments on 8 data sets highlight how COCOCO compares favorably against competitors and natural baselines in terms of performance and set size.

URL PDF HTML ☆

赞 0 踩 0

2605.18199 2026-05-19 cs.IR cs.AI 版本更新

PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

PIPER: 通过 profiling 和 LLM 生成的伪查询实现基于内容的表格搜索

Riccardo Terrenzi, Matteo Falconi, Serkan Ayvaz, Pierluigi Plebani

发表机构 * Centre for Industrial Software, University of Southern Denmark（丹麦南部大学工业软件中心）； Department of Electronics, Information and Bioengineering, Politecnico di Milano（米兰理工学院电子、信息与生物工程系）

AI总结针对数据湖、数据空间和开放数据门户中表格数据集快速增长的问题，PIPER 提出了一种基于内容的表格搜索方法，利用表格 profile 和 LLM 生成的伪查询进行密集检索，优于传统元数据方法和 TableQA 检索方法，展示了 LLM 基于内容建模在表格数据集搜索中的价值。

Comments 15 pages, 3 figures, accepted at DEXA'26

详情

AI中文摘要

随着数据湖、数据空间和开放数据门户中表格数据集的快速增长，有效的数据集搜索对于重用和分析至关重要。现有搜索系统主要依赖元数据，这在很大程度上不完整或质量低，尤其是对于那些含义依赖于模式和单元格值的表格。近年来，大型语言模型（LLMs）的进步使得表格能够获得更丰富的基于内容的表示。然而，先前基于 LLM 的检索方法主要集中在表格问答上，目标是选择一个表格来回答问题，而不是检索和排序相关数据集。我们提出 PIPER，一种用于表格数据集的基于内容的检索方法，利用表格 profile 和 LLM 生成的查询嵌入进行密集检索。PIPER 专为元数据较差的环境设计，优于传统元数据基于的基线和强大的 TableQA 检索方法，证明了 LLM 基于内容建模在表格数据集搜索中的价值。

英文摘要

The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.

URL PDF HTML ☆

赞 0 踩 0

2605.18197 2026-05-19 cs.RO cs.AI cs.CV 版本更新

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)（移动机器人组）； Visual and Multimodal Applied Learning Lab (VANDAL)（视觉与多模态应用学习实验室）

AI总结本文提出利用固定外部RGB摄像头作为共同先验地图，以实现主动、渐进式的3D场景图生成，通过融合机器人 onboard 摄像头和固定外部摄像头的数据，提高场景理解的效率和准确性。

详情

AI中文摘要

常用的先验信息，如BIM模型、平面图和遥感图像，可以为自主机器人系统提供有价值的几何和语义上下文。在本文中，我们将固定外部RGB摄像头的观测视为共同先验地图（CPMs）：环境的广角视图，在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架，用于主动、渐进式的3D场景图（3DSG）生成，该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理，系统将所有摄像头——机器人 onboard 或外部——视为相同，无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图，引导机器人向高语义不确定性区域前进，逐步完成和细化先验。实验表明，使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%，并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

URL PDF HTML ☆

赞 0 踩 0

2605.18181 2026-05-19 cs.AI cs.CL 版本更新

生成人工智能中的因果偏见检测

Drago Plecko

发表机构 * Department of Statistics & Data Science（统计与数据科学系）

AI总结本文研究了生成人工智能中的因果公平性问题，提出了新的因果分解结果，以量化不同因果路径和现实机制被生成模型替代对公平性的影响，并通过分析大型语言模型中的种族和性别偏见验证了方法的有效性。

详情

AI中文摘要

基于人工智能构建的自动化系统越来越多地应用于高风险领域，引发了关于公平性和现实世界中存在的人口差异持续存在的关键担忧。在此背景下，因果推断提供了一个有原则的框架来思考公平性，因为它将观察到的不平等与潜在机制联系起来，并自然与人类直觉和法律上的歧视观念相一致。先前关于因果公平性的研究主要集中在标准机器学习设置中，其中决策者为结果变量Y构建单一预测机制f_Ŷ，同时继承其他协变量的因果机制。然而，生成人工智能的设置却更加复杂：生成模型可以从任意条件下对任何变量集进行采样，隐式地构建了自己对所有因果机制的看法，而不是学习单一预测函数。这种根本性的差异要求因果公平性方法论有新的发展。我们正式定义了生成人工智能中的因果公平性问题，并在统一的理论框架下将其与标准机器学习设置相结合。然后，我们推导了新的因果分解结果，使能够对不同因果路径以及现实机制被生成模型机制替代的公平性影响进行精细量化。我们建立了识别条件并引入了用于因果感兴趣的量的高效估计器，并通过分析不同数据集中的大型语言模型中的种族和性别偏见来证明了我们方法的价值。

英文摘要

Automated systems built on artificial intelligence (AI) are increasingly deployed across high-stakes domains, raising critical concerns about fairness and the perpetuation of demographic disparities that exist in the world. In this context, causal inference provides a principled framework for reasoning about fairness, as it links observed disparities to underlying mechanisms and aligns naturally with human intuition and legal notions of discrimination. Prior work on causal fairness primarily focuses on the standard machine learning setting, where a decision-maker constructs a single predictive mechanism $f_{\widehat Y}$ for an outcome variable $Y$, while inheriting the causal mechanisms of all other covariates from the real world. The generative AI setting, however, is markedly more complex: generative models can sample from arbitrary conditionals over any set of variables, implicitly constructing their own beliefs about all causal mechanisms rather than learning a single predictive function. This fundamental difference requires new developments in causal fairness methodology. We formalize the problem of causal fairness in generative AI and unify it with the standard ML setting under a common theoretical framework. We then derive new causal decomposition results that enable granular quantification of fairness impacts along both (a) different causal pathways and (b) the replacement of real-world mechanisms by the generative model's mechanisms. We establish identification conditions and introduce efficient estimators for causal quantities of interest, and demonstrate the value of our methodology by analyzing race and gender bias in large language models across different datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.10843 2026-05-19 cs.CL cs.AI cs.CY 版本更新

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

通过人格分歧实现无训练的文化对齐大语言模型

Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen, Chi-Nguyen Tran, Phu-Hoa Pham, Nguyen Lam Phu Quy, The Anh Han, Long Tran-Thanh

发表机构 * Faculty of Information and Technology, University of Science, Vietnam National University（信息与技术学院，科学大学，越南国家大学）； Department of Computer Science, University of Warwick（计算机科学系，沃里克大学）； School of Computing, Engineering and Digital Technologies, Teesside University（计算、工程和数字技术学院，泰赛德大学）

AI总结本文提出DISCA方法，在不改变模型权重的情况下，通过人格分歧校准减少大语言模型在多任务测试中的文化偏差，为服务全球道德偏好提供了可扩展的替代方案。

Comments 57 pages, 1 figure, 6 MultiTP moral dimensions

详情

AI中文摘要

大型语言模型越来越多地参与涉及道德判断的决策，但越来越多的证据表明，它们的隐含偏好并非文化中立。现有的文化对齐方法要么需要国家层面的偏好数据和微调预算，要么假设可以访问模型内部的白盒信息，而商业API并未暴露此类信息。在本工作中，我们专注于这种现实的黑盒、仅公共数据的环境，并观察到国家内部的社会人口学分歧，而非共识，是主要的指导信号。我们引入DISCA（基于分歧的文化对齐推理方法），一种在推理时的方法，将每个国家视为一个基于世界价值观调查的个人代理面板，并将他们的分歧转化为一个有界的、损失厌恶的logit校正。在20个国家和7个开放权重的backbone（2B-70B）上，DISCA在MultiTP上减少了10-24%的文化偏差（在六个backbone >=3.8B上），并在开放场景中减少了2-7%的偏差，而无需改变任何权重。我们的结果表明，推理时的校准是微调的可扩展替代方案，用于服务全球道德偏好的长尾。

英文摘要

Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.

URL PDF HTML ☆

赞 0 踩 0

2605.10811 2026-05-19 math.OC cs.AI 版本更新

Switching-Geometry Analysis of Deflated Q-Value Iteration

退化Q值迭代的切换几何分析

Donghwan Lee

发表机构 * Department of Electrical Engineering（电气工程系）

AI总结本文提出了一种联合谱半径框架，用于分析折扣马尔可夫决策过程控制中的一阶退化Q值迭代（Q-VI）。通过全一残差校正，作者利用切换系统的几何特性，首次给出了基于联合谱半径的退化Q-VI在策略优化问题中的收敛性分析。分析表明，标准Q-VI切换系统模型的联合谱半径恰好等于折扣因子γ∈(0,1)，因为所有可接受的子系统共享全一向量作为不变方向。通过构造去除该方向的商空间，得到一个投影切换系统模型，其联合谱半径控制相关误差动态，并可能严格小于γ。因此，退化Q-VI可能比环境空间γ界具有更精确的收敛速率描述。最后，证明了校正等同于标准Q-VI的标量重新中心化。因此，投影轨迹以及由此产生的贪婪策略序列与标准Q-VI初始化相同点后的结果相同。退化的好处不是改变诱导的决策问题，而是在去除冗余的全一成分后，对收敛几何的更精确的联合谱半径描述。

详情

AI中文摘要

本文发展了一种联合谱半径（JSR）框架，用于分析折扣马尔可夫决策过程控制中的一阶退化Q值迭代（Q-VI）。聚焦于全一残差校正，我们通过切换系统的几何特性解释了所得到的算法，并到目前为止，首次给出了基于联合谱半径的退化Q-VI在策略优化问题中的收敛性分析。我们的分析表明，标准Q-VI切换系统模型的联合谱半径恰好等于折扣因子γ∈(0,1)，因为所有可接受的子系统共享全一向量作为不变方向。通过构造去除该方向的商空间，我们得到一个投影切换系统模型，其联合谱半径控制相关误差动态，并可能严格小于γ。因此，退化Q-VI可能比环境空间γ界具有更精确的收敛速率描述。最后，我们证明校正等同于标准Q-VI的标量重新中心化。因此，投影轨迹以及由此产生的贪婪策略序列与标准Q-VI初始化相同点后的结果相同。退化的好处不是改变诱导的决策问题，而是在去除冗余的全一成分后，对收敛几何的更精确的联合谱半径描述。

英文摘要

This paper develops a joint spectral radius (JSR) framework for analyzing rank-one deflated Q-value iteration (Q-VI) in discounted Markov decision process control. Focusing on an all-ones residual correction, we interpret the resulting algorithm through the geometry of switching systems and, to the best of our knowledge, give the first JSR-based convergence analysis of deflated Q-VI for policy optimization problems. Our analysis reveals that the standard Q-VI switching system model has JSR exactly the discount factor $γ\in (0,1)$, since all admissible subsystems share the all-ones vector as an invariant direction. By passing to the quotient space that removes this direction, we obtain a projected switching system model whose JSR governs the relevant error dynamics and may be strictly smaller than $γ$. Therefore, the deflated Q-VI admits a potentially sharper convergence-rate characterization than the ambient-space $γ$-bound. Finally, we prove that the correction is equivalent to a scalar recentering of standard Q-VI. Hence, the projected trajectory, and therefore the greedy-policy sequence, is unchanged relative to standard Q-VI initialized from the same point. The benefit of deflation is not a change in the induced decision-making problem, but a more precise JSR-based description of the convergence geometry after the redundant all-ones component is removed.

URL PDF HTML ☆

赞 0 踩 0

2605.10503 2026-05-19 cs.AI 版本更新

SLASH the Sink: Sharpening Structural Attention Inside LLMs

SLASH the Sink: 在大语言模型中 sharpening 结构性注意力

Yiming Liu, Bin Lu, Xinbing Wang, Chenghu Zhou, Meng Jin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Institute of Geographical Science and Natural Resources Research（地理科学与自然资源研究所）； Chinese Academy of Sciences（中国科学院）

AI总结本文研究了大语言模型内部机制，发现其能自发重构图拓扑，但受注意力sink影响导致结构理解被削弱。提出SLASH方法，通过插件式注意力重分布增强内部结构理解，实验表明在纯图任务和分子预测中性能显著提升。

详情

AI中文摘要

大型语言模型（LLMs）在处理图拓扑时表现出显著的语义理解能力，但往往在结构理解上遇到困难。现有解决方案依赖于训练外部图结构适配器或微调，这导致成本高且失去泛化能力。本文研究了LLMs的内部机制，发现LLMs会自发地在内部重构图的拓扑结构，这在注意力图中表现为明显的“锯齿”模式，与“token级邻接矩阵”结构一致。然而，这种内在的结构理解被注意力sink所稀释。我们理论上将这种稀释定义为一个表示瓶颈，源于一个根本性的矛盾：模型的各向异性偏见，对于语言任务是必要的，却抑制了图推理所需的拓扑感知局部聚合。为了解决这个问题，我们提出了一种无需训练的解决方案，名为StructuraL Attention SHarpening（SLASH），通过插件式注意力重分布来增强这种内部结构理解。在纯图任务和分子预测实验中验证，SLASH在多种LLM上都带来了显著且一致的性能提升。

英文摘要

Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (SLASH), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate that SLASH delivers significant and consistent performance gains across diverse LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.07263 2026-05-19 eess.SP cs.AI cs.DC cs.LG stat.ML 版本更新

Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning

非协作空中联邦学习的资源元素能量差

Hao Chen, Zavareh Bozorgasl

发表机构 * Signal, Communication, and Learning Lab (SCALE Lab), Department of Electrical and Computer Engineering, Boise State University（信号、通信与学习实验室（SCALE实验室），电气与计算机工程系，博伊西州立大学）

AI总结本文提出了一种非协作物理层原始方法，即资源元素能量差（REED），用于连续符号聚合。该方法通过将实值更新的正负部分映射到配对正交的资源元素上的传输能量，并通过减去对应的接收到的能量来估计符号和。REED利用慢时间尺度校准的平均信道功率，但不需要瞬时发射端或接收端CSI或信道反转。对于独立的瑞利衰落，我们推导了单次REED和芯片多样扩展的精确一阶和二阶矩表达式。

Comments Preprint; Under-review; Codes to replicate the results is available at: https://github.com/zavareh1/REED

详情

AI中文摘要

Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.

英文摘要

Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.

URL PDF HTML ☆

赞 0 踩 0

2604.18652 2026-05-19 cs.CR cs.AI 版本更新

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

从手工到内核：一种以治理为导向的执行架构和语义ISA用于代理计算机

Xiangyu Wen, Yuang Zhao, Xiaoyu Xu, Lingjun Chen, Changran Xu, Shu Chi, Jianrong Ding, Zeju Li, Haomin Li, Li Jiang, Fangxin Liu, Qiang Xu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Zhejiang University（浙江大学）； Peking University（北京大学）； Tsinghua University（清华大学）

AI总结本文提出了一种以治理为导向的执行架构和语义指令集架构（ISA），旨在解决代理计算机中AI从脆弱原型到生产系统的过渡问题，通过引入概率处理单元和确定性神经符号内核，提升系统的安全性和可靠性。

详情

AI中文摘要

代理AI从脆弱原型到生产系统的过渡受到普遍的工艺危机的阻碍。我们建议，现行的编排范式——将系统控制循环委托给大型语言模型并仅通过启发式护栏进行修补——是这种脆弱性的根本原因。相反，我们提出了Arbiter-K，一种以治理为导向的执行架构，将底层模型重新概念化为一个概率处理单元，封装在确定性的神经符号内核中。Arbiter-K实现了语义指令集架构（ISA）将概率消息转化为离散指令。这使内核能够在运行时维护安全上下文注册表并构建指令依赖图，从而基于每个推理节点的数据流谱系进行主动污点传播。通过利用这一机制，Arbiter-K能够精确地阻止危险轨迹，防止在确定性终点（如高风险工具调用或未经授权的网络流出）发生不安全行为，并在触发安全策略时启用自主执行纠正和架构回滚。在OpenClaw和NanoBot上的评估显示，Arbiter-K将安全作为微架构属性强制执行，实现了比原生策略高出92.79%的绝对收益，安全拦截率在76%到95%之间。代码已公开在https://github.com/cure-lab/ArbiterOS。

英文摘要

The transition of agentic AI from brittle prototypes to production systems is stalled by a pervasive crisis of craft. We suggest that the prevailing orchestration paradigm-delegating the system control loop to large language models and merely patching with heuristic guardrails-is the root cause of this fragility. Instead, we propose Arbiter-K, a Governance-First execution architecture that reconceptualizes the underlying model as a Probabilistic Processing Unit encapsulated by a deterministic, neuro-symbolic kernel. Arbiter-K implements a Semantic Instruction Set Architecture (ISA) to reify probabilistic messages into discrete instructions. This allows the kernel to maintain a Security Context Registry and construct an Instruction Dependency Graph at runtime, enabling active taint propagation based on the data-flow pedigree of each reasoning node. By leveraging this mechanism, Arbiter-K precisely interdicts unsafe trajectories at deterministic sinks (e.g., high-risk tool calls or unauthorized network egress) and enables autonomous execution correction and architectural rollback when security policies are triggered. Evaluations on OpenClaw and NanoBot demonstrate that Arbiter-K enforces security as a microarchitectural property, achieving 76% to 95% unsafe interception for a 92.79% absolute gain over native policies. The code is publicly available at https://github.com/cure-lab/ArbiterOS.

URL PDF HTML ☆

赞 0 踩 0

2604.12253 2026-05-19 cs.AI 版本更新

A Scoping Review of Large Language Model-Based Pedagogical Agents

基于大语言模型的教育代理的综述

Shan Li, Juan Zheng

发表机构 * Department of Education and Human Services, College of Education, Lehigh University（教育与人类服务学院，教育学院，莱维大学）； Department of Community and Global Health, College of Health, Lehigh University（社区与全球健康学院，健康学院，莱维大学）

AI总结本文综述了大语言模型在教育环境中的应用，探讨了教育代理的设计维度、发展趋势及研究空白，为未来研究提供指导。

详情

AI中文摘要

本综述根据PRISMA-ScR指南，分析了2022年11月至2025年1月期间五个主要数据库中的52项研究，探讨了基于大语言模型（LLM）的教育代理在K-12教育、高等教育和非正式学习环境中的多样性。研究识别出四个关键设计维度：交互方式（反应型 vs. 主动型）、领域范围（领域专用 vs. 通用）、角色复杂性（单一角色 vs. 多角色）以及系统集成（独立 vs. 集成）。新兴趋势包括多代理系统模拟自然学习环境、虚拟学生模拟用于代理评估、与沉浸式技术的整合以及与学习分析的结合。本文还讨论了隐私、准确性和学生自主性等重要研究空白和伦理问题。

英文摘要

This scoping review examines the emerging field of Large Language Model (LLM)-based pedagogical agents in educational settings. While traditional pedagogical agents have been extensively studied, the integration of LLMs represents a transformative advancement with unprecedented capabilities in natural language understanding, reasoning, and adaptation. Following PRISMA-ScR guidelines, we analyzed 52 studies across five major databases from November 2022 to January 2025. Our findings reveal diverse LLM-based agents spanning K-12, higher education, and informal learning contexts across multiple subject domains. We identified four key design dimensions characterizing these agents: interaction approach (reactive vs. proactive), domain scope (domain-specific vs. general-purpose), role complexity (single-role vs. multi-role), and system integration (standalone vs. integrated). Emerging trends include multi-agent systems that simulate naturalistic learning environments, virtual student simulation for agent evaluation, integration with immersive technologies, and combinations with learning analytics. We also discuss significant research gaps and ethical considerations regarding privacy, accuracy, and student autonomy. This review provides researchers and practitioners with a comprehensive understanding of LLM-based pedagogical agents while identifying crucial areas for future development in this rapidly evolving field.

URL PDF HTML ☆

赞 0 踩 0

2604.08874 2026-05-19 cs.LG cs.AI 版本更新

A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout

面向学生退学的时序建模与反事实政策模拟的数学框架

Rafael da Silva, Jeff Eicher, Gregory Longo

发表机构 * Applied Data Science Program（应用数据科学项目）； Eastern University（东部大学）

AI总结本文提出了一种结合反事实政策模拟层的时序建模框架，用于分析高等教育学生退学问题，通过LMS参与数据和行政退学记录进行建模，采用时间到事件结局的方式，并通过惩罚性、类别平衡逻辑回归进行每周风险建模，展示了模型在训练和测试集上的高AUC表现，并通过消融分析验证了时间参与信号的重要性。

Comments Approx. 20 pages, 9 figures. Code and reproducibility package available at https://github.com/rafa-rodriguess/TCM-Student-Dropout This work introduces a temporal survival framework with counterfactual policy simulation

详情

AI中文摘要

本研究提出了一种针对高等教育学生退学问题的时序建模框架，结合反事实政策模拟层，利用LMS参与数据和行政退学记录进行建模。退学被定义为在入学层面的时间到事件结局；通过在人-时期行上进行惩罚性、类别平衡逻辑回归，对每周风险进行离散时间建模。在晚期事件时间验证下，模型在训练集和测试集上分别达到0.8350和0.8405的行级AUC，整体校准可接受但最高风险分箱支持稀疏。消融分析表明性能对特征集组成敏感，突显了时间参与信号的作用。一个基于场景的政策层产生生存对比ΔS(T)在显式的触发/计划合同下：正对比被限制在冲击分支（T_policy=18：0.0102，0.0260，0.0819），而机制-aware分支为负（ΔS_mech(18)=-0.0078，ΔS_mech(38)=-0.0134）。通过性别子组分析量化了场景诱导的生存差距，通过bootstrap方法进行统计检验；对比方向稳定但较小。结果未被因果识别；它们展示了在观察数据限制下，该框架进行内部结构场景比较的能力。

英文摘要

This study proposes a temporal modeling framework with a counterfactual policy-simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time-to-event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class-balanced logistic regression over person--period rows. Under a late-event temporal holdout, the model attains row-level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest-risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario-indexed policy layer produces survival contrasts $ΔS(T)$ under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch ($T_{\rm policy}=18$: 0.0102, 0.0260, 0.0819), while the mechanism-aware branch is negative ($ΔS_{\rm mech}(18)=-0.0078$, $ΔS_{\rm mech}(38)=-0.0134$). A subgroup analysis by gender quantifies scenario-induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework's capacity for internal structural scenario comparison under observational data constraints.

URL PDF HTML ☆

赞 0 踩 0

2604.08432 2026-05-19 physics.optics cs.AI 版本更新

Small-scale photonic Kolmogorov-Arnold networks using standard telecom nonlinear modules

利用标准电信非线性模块的小规模光子Kolmogorov-Arnold网络

Luca Nogueira Calçado, Sergei K. Turitsyn, Egor Manuylovich

发表机构 * Aston Institute of Photonic Technologies (AiPT)（阿斯顿光电技术研究所）

AI总结本文提出了一种基于标准电信组件的小规模光子Kolmogorov-Arnold网络，通过可训练的非线性模块实现高效非线性推理，展示了在分类、回归和图像识别任务中的高性能。

详情

AI中文摘要

光子神经网络有望实现超快推理，但大多数架构依赖于线性光学网格和电子非线性性，重新引入了光-电-光瓶颈。本文介绍了一种完全由标准电信组件实现的小规模光子Kolmogorov-Arnold网络（SSP-KANs）。每个网络边缘使用一个可训练的非线性模块，由马赫-曾德干涉仪、半导体光放大器和可变光学衰减器组成，提供一个由增益饱和和干涉混合导出的四参数传输函数。尽管这些光学非线性性的功能形式受限，仅由几个光学模块组成的SSP-KANs在分类、回归和图像识别任务中实现了强大的非线性推理性能，参数数量显著少于软件基线。一个四模块网络在非线性分类基准上的准确率为94.3%（IQR：90.3-97.4%，10种子），七模块网络在六输入回归中的R²为0.986±0.015。在现实硬件退化下性能保持稳健，即使在6位输入分辨率和14 dB信噪比下仍能保持高精度。通过使用完全可微的物理模型对光学参数进行端到端优化，本文建立了从仿真到实验演示的实用路径，利用商用电信硬件实现光子KANs。

英文摘要

Photonic neural networks promise ultrafast inference, yet most architectures rely on linear optical meshes with electronic nonlinearities, reintroducing optical-electrical-optical bottlenecks. Here we introduce small-scale photonic Kolmogorov-Arnold networks (SSP-KANs) implemented entirely with standard telecommunications components. Each network edge employs a trainable nonlinear module composed of a Mach-Zehnder interferometer, semiconductor optical amplifier, and variable optical attenuators, providing a four-parameter transfer function derived from gain saturation and interferometric mixing. Despite the constrained functional form of these optical nonlinearities, SSP-KANs comprising only a few optical modules achieve strong nonlinear inference performance across classification, regression, and image recognition tasks, approaching software baselines with significantly fewer parameters. A four-module network achieves $94.3$\% (IQR: $90.3$--$97.4$\%, 10~seeds) accuracy on nonlinear classification benchmarks; a seven-module network attains $R^2 = 0.986 \pm 0.015$ on six-input regression. Performance remains robust under realistic hardware impairments, maintaining high accuracy down to 6-bit input resolution and 14 dB signal-to-noise ratio. By using a fully differentiable physics model for end-to-end optimisation of optical parameters, this work establishes a practical pathway from simulation to experimental demonstration of photonic KANs using commodity telecom hardware.

URL PDF HTML ☆

赞 0 踩 0

2603.29868 2026-05-19 cs.AI cs.LO 版本更新

Spatiotemporal Robustness of Temporal Logic Tasks using Multi-Objective Reasoning

基于多目标推理的时序逻辑任务时空鲁棒性

Oliver Schön, Lars Lindemann

发表机构 * Automatic Control Laboratory, ETH Zürich（自动化控制实验室，苏黎世联邦理工学院）

AI总结本文研究了通过多目标推理处理时序逻辑任务的时空鲁棒性，提出了一种新的时空鲁棒性定义，能够同时考虑空间和时间扰动，并展示了其在多智能体机器人、智慧城市和空中交通管制等交互系统中的应用。

Comments 30 pages, 6 figures, to be published at the 38th International Conference on Computer Aided Verification 2026

详情

AI中文摘要

自主系统的可靠性依赖于其鲁棒性，即在不确定性下满足目标的能力。本文研究了在离散时间信号上评估的时序逻辑规范的时空鲁棒性。现有工作提出了鲁棒语义，能够捕捉不仅布尔可满足性，还包括从不可满足性距离的几何距离，对应于给定信号的可接受空间扰动。相比之下，我们提出了时空鲁棒性（STR），它同时捕捉可接受的空间和时间扰动。这一概念对于交互系统，如多智能体机器人、智慧城市和空中交通管制尤其具有信息量。我们将STR定义为一个多目标推理问题，通过空间和时间扰动的偏序关系形式化。这种视角有两个关键优势：（1）STR可以被解释为一个帕累托最优集，该集描述了所有可接受的时空扰动；（2）STR可以通过多目标优化工具进行计算。为克服计算挑战，我们提出了适用于STR的鲁棒语义，这些语义在适当的意义下是准确的，同时计算上是可行的。最后，我们使用这些鲁棒语义提出了STR的监控算法。据我们所知，这是首次通过多目标推理处理多维鲁棒性的工作。

英文摘要

The reliability of autonomous systems depends on their robustness, i.e., their ability to meet their objectives under uncertainty. In this paper, we study spatiotemporal robustness of temporal logic specifications evaluated over discrete-time signals. Existing work has proposed robust semantics that capture not only Boolean satisfiability, but also the geometric distance from unsatisfiability, corresponding to admissible spatial perturbations of a given signal. In contrast, we propose spatiotemporal robustness (STR), which captures admissible spatial and temporal perturbations jointly. This notion is particularly informative for interacting systems, such as multi-agent robotics, smart cities, and air traffic control. We define STR as a multi-objective reasoning problem, formalized via a partial order over spatial and temporal perturbations. This perspective has two key advantages: (1) STR can be interpreted as a Pareto-optimal set that characterizes all admissible spatiotemporal perturbations, and (2) STR can be computed using tools from multi-objective optimization. To navigate computational challenges, we propose robust semantics for STR that are sound in the sense of suitably under-approximating STR while being computationally tractable. Finally, we present monitoring algorithms for STR using these robust semantics. To the best of our knowledge, this is the first work to deal with robustness across multiple dimensions via multi-objective reasoning.

URL PDF HTML ☆

赞 0 踩 0

2603.26720 2026-05-19 cs.RO cs.AI 版本更新

SutureFormer: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

SutureFormer: 通过像素空间中的目标引导离线强化学习学习手术轨迹

Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li

发表机构 * University of Macau（澳门大学）； The Chinese PLA General Hospital（中国人民解放军总医院）； Duke University（杜克大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出SutureFormer，一种基于目标引导的离线强化学习框架，通过稀疏标注到密集奖励信号的插值，有效学习手术针轨迹预测，减少平均位移误差58.6%。

详情

AI中文摘要

从内窥镜视频预测手术针轨迹对于机器人辅助缝合至关重要，能够实现预见性规划、实时引导和更安全的运动执行。现有直接从视觉观测学习运动分布的方法往往忽视相邻运动步骤之间的序列依赖性。此外，稀疏路径点标注通常无法提供足够的监督，进一步增加了监督或模仿学习方法的难度。为了解决这些挑战，我们将基于图像的针轨迹预测 formulations 为一个序列决策问题，在其中将针尖视为一个在像素空间中逐步移动的智能体。这种 formulation 自然捕捉了针运动的连续性，并能够显式建模在时间上物理上合理的像素级状态转换。从这个角度来看，我们提出SutureFormer，一种目标引导的离线强化学习框架，通过三次样条插值将稀疏标注转换为密集奖励信号，鼓励策略在利用有限专家指导的同时探索合理的未来运动路径。SutureFormer 使用观察编码器编码可变长度片段，以捕捉局部空间线索和长距离时间动态，并通过由离散方向和连续幅度组成的操作自回归地预测未来路径点。为了实现从专家演示中稳定离线策略优化，我们采用保守Q学习与行为克隆正则化。在包含1,158条轨迹的新的肾伤口缝合数据集中进行的实验表明，与最强基线相比，SutureFormer将平均位移误差减少了58.6%，证明了将针轨迹预测建模为像素级序列动作学习的有效性。

英文摘要

Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide sufficient supervision, further increasing the difficulty of supervised or imitation learning methods. To address these challenges, we formulate image-based needle trajectory prediction as a sequential decision-making problem, in which the needle tip is treated as an agent that moves step by step in pixel space. This formulation naturally captures the continuity of needle motion and enables the explicit modeling of physically plausible pixel-wise state transitions over time. From this perspective, we propose SutureFormer, a goal-conditioned offline reinforcement learning framework that leverages sparse annotations to dense reward signals via cubic spline interpolation, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureFormer encodes variable-length clips using an observation encoder to capture both local spatial cues and long-range temporal dynamics, and autoregressively predicts future waypoints through actions composed of discrete directions and continuous magnitudes. To enable stable offline policy optimization from expert demonstrations, we adopt Conservative Q-Learning with Behavioral Cloning regularization. Experiments on a new kidney wound suturing dataset containing 1,158 trajectories from 50 patients show that SutureFormer reduces Average Displacement Error by 58.6% compared with the strongest baseline, demonstrating the effectiveness of modeling needle trajectory prediction as pixel-level sequential action learning.

URL PDF HTML ☆

赞 0 踩 0

2603.20380 2026-05-19 cs.MA cs.AI cs.HC 版本更新

Herding CATs: ALARA for Agent Harness Engineering in Portable Composable Multi-Agent Teams

Christopher J. Agostino, Nayan D'Souza

发表机构 * Celeria, Inc.（Celeria公司）； Department of Linguistics, Indiana University（印第安纳大学语言学系）

AI总结本文提出ALARA原则应用于多智能体团队的代理 harness 工程，通过引入CAT数据层，使用户能够直接声明工具访问权限并修改代理使用的工具，从而提升智能体在各种任务中的表现。

Comments Accepted to HAXD 2026, 8 pages, 6 figures

详情

AI中文摘要

行业从业者和学术研究人员经常使用多智能体系统来加速他们的工作，但用户使用的应用系统并未提供一种简单统一的机制来可扩展地管理代理 harness 的关键组件。这种缺乏控制对个体人机交互的质量产生了负面影响，并且限制了从业者协调上下文工程努力的能力。定义此类系统中智能体可以执行的行为规范仍然分散在文本指令文件中（无法保证合规性）或框架内部配置中，使得这些规范在跨团队和项目共享、版本控制或协作维护时变得困难。应用辐射安全中的ALARA原则（暴露量应尽可能低），我们引入了一种通过相互关联的纯文本文件表达的上下文-智能体-工具（CAT）数据层，允许用户为每个智能体直接声明工具访问权限，并在处理时修改智能体使用的工具本身。我们通过使用命令行 shell（加载团队并执行智能体运行）-- npcsh -- 和评估22个本地托管的模型（从0.6B到35B参数）在115个实际任务中的表现（包括文件操作、网络搜索、多步骤脚本、工具链和多智能体委托）来展示CAT数据层的能力。我们还表征了哪些模型家族在某些任务类别中成功，以及在约2500次总执行中它们的失败点。

英文摘要

Industry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, but the applications through which users operate these systems do not provide a simple, unified mechanism for scalably managing critical components of the agent harness. This lack of control adversely impacts both the quality of individual human-agent interactions and reduces the capacity for practitioners to coordinate context engineering efforts. The behavioral specifications that define what agents in such systems can do remain fragmented across prose instruction files -- for which compliance cannot be guaranteed -- or framework-internal configurations, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to context, we introduce a context-agent-tool (CAT) data layer expressed through interrelated plain-text files, allowing users to directly declare tool access for each agent and to modify the tools themselves that are used by the agents when processing. We demonstrate capability of this CAT data layer to enable real agentic usage by using a command-line shell that loads the team and executes agent runs -- \texttt{npcsh} -- and evaluating 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation. We characterize which model families succeed in certain task categories and where they break down across $\sim$2500 total executions.

URL PDF HTML ☆

赞 0 踩 0

2603.20216 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Locally Coherent Parallel Decoding in Diffusion Language Models

局部相干并行解码在扩散语言模型中

Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi

发表机构 * IBM Research - Zurich（IBM瑞士研究实验室）

AI总结本文提出CoDiLA方法，通过引入小型辅助自回归模型来解决扩散语言模型在并行解码中的相干性问题，从而在代码生成任务中实现更高的准确性和速度。

Comments Accepted at ICML 2026

详情

AI中文摘要

扩散语言模型（DLMs）作为一种有前景的替代自回归（AR）模型，提供了亚线性生成延迟和双向能力，这在代码生成和编辑中尤为吸引人。在离散DLMs中实现亚线性延迟需要并行预测多个token。然而，标准DLMs从条件边缘分布独立采样token，无法捕捉同时生成token之间的联合依赖关系。因此，它们常常导致语法不一致并破坏多token结构。在本工作中，我们引入CoDiLA（Coherent Diffusion with Local Autoregression），一种方法，通过引入小型辅助AR模型来解决并行采样与局部依赖建模之间的矛盾。该方法将局部解码委托给一个小型辅助AR模型，该模型在扩散潜变量上进行操作。这种设计允许并行生成，同时在块内确保序列的有效性，并保持核心DLM能力，包括跨块的双向建模。我们证明使用高度紧凑的辅助AR模型（例如，0.6B参数）可以有效消除相干性伪影，在代码生成基准中建立了一个新的帕累托前沿。

英文摘要

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel generation while ensuring sequential validity within a block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2603.17577 2026-05-19 cs.LG cs.AI stat.ML 版本更新

EveryQuery: 通过电子健康记录上的任务条件预训练实现零样本临床预测

Payal Chandak, Gregory Kondas, Liat Antwarg Friedman, Isaac Kohane, Matthew McDermott

发表机构 * Harvard-MIT HST（哈佛-麻省理工学院HST）； Columbia University（哥伦比亚大学）； Harvard Medical School（哈佛医学院）

AI总结本文提出EveryQuery，一种通过任务条件预训练实现零样本临床预测的电子健康记录基础模型，通过直接估计未来窗口内结果发生的可能性，而非生成未来事件，从而在多个预测任务中优于自回归基线模型。

详情

AI中文摘要

在电子健康记录（EHR）上预训练的基础模型已通过生成合成患者未来和聚合采样轨迹的统计信息，展示了零样本临床预测能力。然而，这种自回归推理过程计算成本高、统计噪声大且不支持直接提示条件预测，因为用户无法直接根据特定临床问题条件预测。在本初步工作中，我们引入EveryQuery，一种EHR基础模型，通过任务条件预训练实现零样本推理。不同于生成未来事件，EveryQuery输入患者的历史和一个结构化的查询指定临床任务，并通过单次前向传递直接估计未来窗口内结果发生的可能性。EveryQuery通过在随机采样的查询任务和患者上下文中预训练，直接训练模型以产生正确的答案。这使得无需微调、线性探测或轨迹生成即可对查询空间中的任何任务进行零样本预测。在MIMIC-IV上，EveryQuery在82%的39个随机采样的预测任务中优于自回归基线模型，平均AUC提高+0.16（95%置信区间：[0.10,0.22]）。这一优势在明确从预训练分布中排除的任务中保持一致。此外，EveryQuery的性能提升在罕见临床事件上最为显著，证实并展示了自回归推理在低预发率结果方面的根本限制的解决方案。然而，目前EveryQuery在需要对多个代码进行离散推理的任务上表现欠佳，如30天再入院，暴露了当前查询语言的表达性限制。

英文摘要

Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.

URL PDF HTML ☆

赞 0 踩 0

2603.06984 2026-05-19 stat.ML cs.AI cs.GT cs.LG cs.SI 版本更新

Masking Causality and Conditional Dependence

掩盖因果关系与条件依赖

Zou Yang, Sophia Xiao, Bijan Mazaheri

发表机构 * Thayer School of Engineering（泰勒学校工程学院）； Dartmouth College（达特茅斯学院）

AI总结本文研究了通过平均约束来强制条件独立性的问题，发现这种约束在监管层面无法满足分层要求，而在优化者层面却能有效隐藏依赖关系，从而指出通过观测决策的平均统计来监管直接依赖是有限的，必须在决策规则层面进行监管。

详情

AI中文摘要

许多监管和分析问题要求被禁止的变量只能通过指定的允许渠道影响决策——这是一种出现在路径特定公平性、处理敏感信息和监管非公开信息交易等场景中的条件独立性要求。这些要求可以通过分层方式执行，或更常见且更高效地通过单个平均约束来执行。本文从监管者的角度将因果掩盖建模为一个线性规划，并证明平均约束优化几乎总是产生违反分层要求但恰好满足平均约束的政策。掩盖收益随着混淆和结果异质性增加而增长，检测需要精确的条件独立性测试，而平均约束旨在避免这些测试。从优化者的角度来看，相同的构造表明，被掩盖的政策恢复了大部分无约束利用的收益，但更难被检测到，因此在决策基础本身敏感的任何设置中都具有吸引力。这些结果表明，通过观测决策的平均统计来监管直接依赖在结构上是有限的，有意义的监管必须在决策规则本身层面进行。

英文摘要

Many regulatory and analytic problems require that a prohibited variable influence a decision only through a designated allowable channel -- a conditional-independence requirement that arises in path-specific fairness, the handling of classified information, and the regulation of trading on non-public information, among other settings. Such requirements may be enforced either stratum-by-stratum or, more commonly (and more efficiently), through a single averaged constraint on the conditional effect. We study the resulting enforcement problem from two perspectives. From the regulator's side, we formulate causal masking as a linear program and show that averaged-constraint optimization almost surely produces policies that violate the stratum-wise requirement while satisfying the averaged one exactly. The gains from masking grow with confounding and outcome heterogeneity, and detection requires precisely the conditional-independence tests that average constraints aim to avoid. From the optimizer's side, the same construction shows that masked policies recover most of the reward of unconstrained exploitation while being far harder to detect, making them attractive in any setting where the basis of decisions is itself sensitive. Together, these results argue that regulating direct dependence through averaged statistics on observed decisions is structurally limited, and that meaningful enforcement must operate at the level of the decision rule itself.

URL PDF HTML ☆

赞 0 踩 0

2603.04727 2026-05-19 cs.CV cs.AI 版本更新

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

多模态大语言模型是否准备好用于监控？对零样本异常检测在现实中的检验

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

发表机构 * Electrical and Computer Engineering Department（电气与计算机工程系）

AI总结本文研究了多模态大语言模型在现实中的零样本异常检测性能，发现其存在保守偏差，通过特定指令可以提升F1分数，但召回率仍是关键瓶颈。

详情

AI中文摘要

多模态大语言模型（MLLMs）在视频理解方面展示了出色的通用能力，但其在现实中的视频异常检测（VAD）可靠性仍待探索。与传统依赖重建或姿态线索的流程不同，MLLMs实现了将异常检测视为语言引导推理任务的范式转变。本文通过将VAD重新表述为二分类任务，在弱时间监督下系统评估了最先进的MLLMs在ShanghaiTech和CHAD基准上的性能。我们研究了提示特异性及时间窗口长度（1s-3s）对性能的影响，重点分析精度-召回率的权衡。研究发现，在零样本设置中存在显著的保守偏差；尽管模型表现出高置信度，但倾向于选择'正常'类，导致高精度但召回率崩溃，限制了实际应用。我们证明，针对类别的特定指令可显著改变这一决策边界，使ShanghaiTech的峰值F1分数从0.09提升至0.64，但召回率仍是关键瓶颈。这些结果突显了MLLMs在嘈杂环境中的显著性能差距，并为未来在召回导向提示和模型校准方面的研究提供了基础，这对需要复杂视频理解和推理的开放世界监控任务提出了要求。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.10134 2026-05-19 cs.CR cs.AI cs.CL 版本更新

Reverse-Engineering Model Editing on Language Models

语言模型上的逆向工程模型编辑

Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He

AI总结本文研究了语言模型中参数编辑的漏洞，提出了一种名为KSTER的逆向工程攻击方法，通过利用参数更新的低秩结构恢复编辑数据，并提出subspace camouflage防御策略以降低重建风险。

Comments Accepted to ICML 2026

详情

AI中文摘要

大型语言模型（LLMs）在预训练过程中会接触到包含万亿个标记的语料库，因此不可避免地会记住敏感信息。定位然后编辑方法作为一种主流的模型编辑范式，通过修改模型参数而不重新训练，提供了一个有前景的解决方案。然而，在本工作中，我们揭示了这一范式的关键漏洞：参数更新无意中充当了侧信道，使攻击者能够恢复编辑的数据。我们提出了一种两阶段的逆向工程攻击，称为KSTER（KeySpaceReconsThenEntropyReduction），该方法利用这些更新的低秩结构。首先，我们理论证明了更新矩阵的行空间编码了被编辑主体的“指纹”，通过谱分析可以准确恢复主体。其次，我们引入了一种基于熵的提示恢复攻击，重构了编辑的语义上下文。在多个LLM上的大量实验表明，我们的攻击能够以高成功率恢复编辑数据。此外，我们提出了一种名为subspace camouflage的防御策略，通过语义伪装来混淆更新指纹，从而有效降低重建风险，而不会影响编辑的实用性。我们的代码可在https://github.com/reanatom/EditingAttack上获得。

英文摘要

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAttack.

URL PDF HTML ☆

赞 0 踩 0

2602.09805 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

超越准确率：分解大语言模型的推理效率

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

发表机构 * Integreat - Norwegian Centre for knowledge-driven machine learning（Integreat - 挪威知识驱动机器学习中心）； UiT - The Arctic University of Norway（UiT - 北极大学）； University of Oslo（奥斯陆大学）

AI总结本文提出一种无需追踪的评估协议，通过完成率、条件正确性和生成长度三个指标分解大语言模型的token效率，同时考虑任务工作量元数据进行归一化处理，并评估模型在不同任务上的推理效率和冗余问题。

Comments Preprint (under review). 29 pages, 4 figures

详情

AI中文摘要

随着推理大语言模型越来越多地通过推理、搜索和自我纠正来换取准确性，单一的准确性分数已无法说明这些token是否带来了有用的推理、从困难实例中恢复或不必要的冗长。我们介绍了一种可选追踪的评估协议，通过三个即使在封闭模型中也可用的观测指标精确分解token效率：完成率、在完成条件下正确性的条件正确性以及生成长度。当实例级工作量元数据可用时，我们进一步将生成长度归一化为声明的任务隐含工作，并将平均口头冗余与工作量依赖的扩展分离。当此类元数据不可用时，我们定义了一个可审计的求解器衍生工作量规模，并在留出自我、留出top-k和持有参考池扰动下评估其稳定性。我们在CogniLoad、GSM8K、ProofWriter和ZebraLogic上评估了14个共享开放权重模型。我们进一步在CogniLoad上评估了11个额外模型，从而能够对推理任务难度因素进行细致分析：任务长度、内在难度和干扰项密度。效率和冗余排名在所有基准对中保持稳定，比准确性排名更加稳健，同时分解了逻辑受限、上下文受限（截断驱动）和冗余受限的失败模式，这些模式在准确性每token下看起来是相同的。我们发布了评估工具包和报告模板，详细说明了LLM在推理上的低效原因。

英文摘要

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

URL PDF HTML ☆

赞 0 踩 0

2602.07085 2026-05-19 q-fin.ST cs.AI q-fin.CP 版本更新

QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

QuantaAlpha: 一种基于大语言模型的alpha挖掘进化框架

Jun Han, Shuo Zhang, Wei Li, Yifan Dong, Tu Hu, Yumo Zhu, Xiaomin Yu, Xin Guo, Zhaowei Liu, Kunyi Wang, Jingping Liu, Tianyi Jiang, Ruichuan An, Sen Hu, Zhi Yang, Ronghao Che, Huacan Wang

发表机构 * SUFE（上海财经大学）； QuantaAlpha ； SYSU（华南理工大学）； PKU（北京大学）

AI总结本文提出QuantaAlpha框架，通过进化算法改进alpha挖掘过程，通过轨迹级突变和交叉实现多轮搜索和经验重用，实验表明其在多个市场指数上均表现出稳健的性能。

详情

AI中文摘要

金融市场噪声和非平稳性使得alpha挖掘对回测噪声和制度转换高度敏感。尽管近期代理框架提高了自动化水平，但通常缺乏可控的多轮搜索和可靠的经验重用。为了解决这些挑战，我们提出了QuantaAlpha，一种进化alpha挖掘框架，将每个端到端挖掘运行视为轨迹，并通过轨迹级突变和交叉改进因素。QuantaAlpha定位次优步骤以进行针对性修订，并重新组合互补的高收益段以重用有效模式，从而在迭代中实现结构化探索和细化。在因子生成过程中，它强制假设、因子表达和可执行代码之间的语义一致性，并约束生成因子的复杂性和冗余性以缓解拥挤。在CSI 300上的大量实验表明，QuantaAlpha在强基线和先前代理系统上均表现出一致的优势。使用GPT-5.2，QuantaAlpha实现了IC为0.0472，ARR为4.68%，MDD为11.8%。此外，基于CSI 300挖掘的因子有效转移到CSI 500和S&P 500，分别在四年内分别产生约40.28%和19.1%的累计超额收益，这表明其在市场分布转换下的稳健性。

英文摘要

Financial markets are noisy and non-stationary, making alpha mining highly sensitive to backtest noise and regime shifts. While recent agentic frameworks improve automation, they often lack controllable multi-round search and reliable reuse of validated experience. To address these challenges, we propose QuantaAlpha, an evolutionary alpha mining framework that treats each end-to-end mining run as a trajectory and improves factors via trajectory-level mutation and crossover. QuantaAlpha localizes suboptimal steps for targeted revision and recombines complementary high-reward segments to reuse effective patterns, enabling structured exploration and refinement across iterations. During factor generation, it enforces semantic consistency across hypothesis, factor expression, and executable code, and constrains the complexity and redundancy of the generated factor to mitigate crowding. Extensive experiments on CSI 300 show consistent gains over strong baselines and prior agentic systems. Using GPT-5.2, QuantaAlpha achieves an IC of 0.0472 with ARR of 4.68% and MDD of 11.8%. Moreover, factors mined on CSI 300 transfer effectively to CSI 500 and the S&P 500, delivering about 40.28% and 19.1% cumulative excess return over four years, respectively, which indicates strong robustness under market distribution shifts.

URL PDF HTML ☆

赞 0 踩 0

2602.03664 2026-05-19 cs.AI cs.LG 版本更新

Mitigating Conversational Inertia in Multi-Turn Agents

缓解多轮代理中的对话惯性

Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China（浙江大学计算机科学与技术学院）； University of Rochester, Rochester, NY, USA（罗切斯特大学）

AI总结本文研究了多轮代理中对话惯性问题，提出通过上下文偏好学习来校准模型偏好，以减少惯性并提升性能。

Comments ICML2026

详情

AI中文摘要

大型语言模型在获得适当演示时表现出色，但在多轮代理场景中，LLM错误地模仿自身之前的响应作为少样本示例。通过注意力分析，我们识别出对话惯性现象，即模型对先前响应表现出强烈的对角注意力，这与模仿偏差有关，限制了探索。这揭示了将少样本LLM转化为代理时的张力：更长的上下文丰富了环境反馈以供利用，但也加剧了对话惯性，从而损害探索。我们的关键见解是，对于相同状态，生成时使用更长上下文的动作表现出更强的惯性，这使得可以在没有环境奖励的情况下构建偏好对。基于此，我们提出上下文偏好学习，以校准模型偏好，使模型更倾向于选择低惯性响应而非高惯性响应。我们进一步提供了推理时的上下文管理策略，以平衡探索与利用。在八个代理环境和一个深度研究场景中的实验结果验证了我们的框架能够减少对话惯性并实现性能提升。

英文摘要

Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2602.02262 2026-05-19 cs.SE cs.AI cs.CL 版本更新

PersonaDual: 通过自适应推理平衡个性化与客观性

Xiaoyou Liu, Xinyi Mou, Shengbin Yue, Liang Wang, Yuqing Wang, Qiexiang Wang, Tianrui Qin, Zhongyu Wei

发表机构 * Fudan University（复旦大学）； Shanghai Innovation Institute（上海创新研究院）； OPPO

AI总结本文提出PersonaDual框架，通过自适应切换模式，在单一模型中实现通用客观推理与个性化推理的平衡，减少干扰并提升客观问题解决能力。

2601.01685 2026-05-19 cs.CL cs.AI cs.MA 版本更新

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

用真理欺骗：通过生成蒙太奇进行开放式通道多智能体合谋以操纵信念

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

发表机构 * University of Liverpool（利物浦大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）

AI总结本文研究了通过公开通道分发真实证据片段，利用多智能体合谋操纵信念的新威胁，提出了生成蒙太奇框架，展示了在14种LLM家族中74.4%的攻击成功率，并揭示了更强的推理能力反而增加了易受攻击的风险。

Comments Accepted to the ACL 2026 Main Conference (Oral Presentation)

详情

AI中文摘要

随着大型语言模型（LLMs）向自主代理合成实时信息转变，其推理能力引入了意想不到的攻击面。本文介绍了一种新的威胁，即合谋代理通过仅使用真实证据片段在公开通道中引导受害者信念，而无需依赖隐蔽通信、后门或伪造文件。通过利用LLMs的过度思考倾向，我们正式化了首次认知合谋攻击，并提出生成蒙太奇：一个由写作者-编辑-导演框架构成的框架，通过对抗性辩论和协调发布证据片段来构建欺骗性叙述，使受害者内化并传播伪造结论。为研究此风险，我们开发了CoPHEME数据集，该数据集源自真实世界谣言事件，并在多种LLM家族中模拟攻击。我们的结果表明，14种LLM家族普遍存在漏洞：攻击成功率达到74.4%（专有模型）和70.6%（开放式权重模型）。反直觉的是，更强的推理能力增加了易受攻击性，推理专精模型的攻击成功率高于基础模型或提示。此外，这些虚假信念会传播到下游判断者，达到超过60%的欺骗率，突显了LLM代理在动态信息环境中交互的社会技术脆弱性。我们的实现和数据可在：https://github.com/CharlesJW222/Lying_with_Truth/tree/main。

英文摘要

As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.

URL PDF HTML ☆

赞 0 踩 0

2601.00360 2026-05-19 cs.MA cs.AI cs.CY 版本更新

Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems

将人类反串通机制映射到多智能体AI系统

Jamiu Idowu, Ahmed Almasoud, Ayman Alfahid

发表机构 * Sahel AI, Sahel Group Inc.（萨赫尔人工智能，萨赫尔集团有限公司）； Prince Sultan University（普林斯顿国王大学）； Majmaah University（马吉玛大学）

AI总结本文研究如何将人类长期积累的反串通机制应用于多智能体AI系统，通过建立机制分类并提出实现方法，同时指出开放挑战如责任归属、身份流动性、边界问题和对抗性适应等。

Comments Accepted to ICML 2026 Workshop on Technical AI Governance Research (TAIGR); Published in Knowledge-Based Systems Journal

Journal ref Idowu, J., Almasoud, A. S., & Alfahid, A. (2026). Mapping human anti-collusion mechanisms to multi-agent AI systems. Knowledge-Based Systems, 344(116067), 116067. https://doi.org/10.1016/j.knosys.2026.116067

详情

DOI: 10.1016/j.knosys.2026.116067

AI中文摘要

随着多智能体AI系统日益自主，证据表明它们可以发展出类似于人类市场和机构中长期观察到的串通策略。尽管人类领域积累了数世纪的反串通机制，但如何将这些机制适应到AI环境中仍不清楚。本文通过（i）开发人类反串通机制的分类学，包括制裁、宽大处理与举报、监控与审计、市场设计以及治理，以及（ii）将这些机制映射到多智能体AI系统的潜在干预措施来填补这一空白。对于每种机制，我们提出了实现方法。我们还强调了开放挑战，例如归属问题（难以将涌现的协调归因于特定智能体）、身份流动性（智能体容易被分裂或修改）、边界问题（区分有益的合作与有害的串通）以及对抗性适应（智能体学习逃避检测）

英文摘要

As multi-agent AI systems become increasingly autonomous, evidence shows they can develop collusive strategies similar to those long observed in human markets and institutions. While human domains have accumulated centuries of anti-collusion mechanisms, it remains unclear how these can be adapted to AI settings. This paper addresses that gap by (i) developing a taxonomy of human anti-collusion mechanisms, including sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance and (ii) mapping them to potential interventions for multi-agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents), identity fluidity (agents being easily forked or modified), the boundary problem (distinguishing beneficial cooperation from harmful collusion), and adversarial adaptation (agents learning to evade detection).

URL PDF HTML ☆

赞 0 踩 0

2512.05136 2026-05-19 cs.CV cs.AI 版本更新

Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

微调一种心电图基础模型以预测冠状动脉CT血管造影结果

Yujie Xiao, Qinghao Zhao, Gongzheng Tang, Hao Zhang, Zhuoran Kan, Deyun Zhang, Jun Li, Guangkun Nie, Xiaocheng Fang, Haoyu Wang, Shun Huang, Tong Liu, Jian Liu, Kangyin Chen, Shenda Hong

发表机构 * Institute of Medical Technology, Peking University Health Science Center（北京大学人民医院医学技术研究所）； National Institute of Health Data Science, Peking University（北京大学国家健康数据科学研究院）； Department of Cardiology, Peking University People’s Hospital（北京大学人民医院心内科）； Tianjin Key Laboratory of Ionic-Molecular Function of Cardiovascular Disease, Department of Cardiology, Tianjin Institute of Cardiology, The Second Hospital of Tianjin Medical University（天津医科大学心血管离子-分子功能重点实验室，天津心脏病学研究院，天津医科大学第二医院心内科）； Heart Voice Medical Technology（心声医疗科技）； School of Intelligence Science and Technology, Peking University（北京大学智能科学与技术学院）

AI总结本文研究了通过微调心电图基础模型来预测冠状动脉CT血管造影结果的研究问题，采用多中心研究方法，利用CTCA作为解剖参考标准，开发并验证了AI-ECG模型，以预测血管特异性冠状动脉狭窄，并展示了模型在内部和外部验证中的表现，以及其在临床中的应用价值。

详情

AI中文摘要

CAD仍然是全球公共卫生的主要负担，然而可扩展的筛查工具有限。尽管CTCA是首选的非侵入性诊断方法，但其使用受到资源需求和辐射暴露的限制。AI-ECG可能为CAD风险分层提供补充方法。在多中心研究中，我们开发并验证了使用CTCA作为解剖参考标准的AI-ECG模型，以预测血管特异性冠状动脉狭窄。在内部验证中，模型在各血管上的AUC值为0.683-0.744，并表现出一致的外部性能。在临床正常ECG中保持了鉴别能力，并在各亚组中保持了广泛稳定性。模型预测的概率随着CTCA定义的狭窄严重程度呈单调增加。模型概率通过预定义的灵敏度和特异性基于阈值转换为血管特异性低、中、高风险分层。校准分析显示预测风险与观察风险之间的一致性，而DCA表明与“全部治疗”和“不治疗”策略相比，具有净临床获益。将AI衍生的风险分层与指南基于的PTP类别相结合，提高了排除性能，减少了灰色区域比例，并与PTP单独使用相比实现了正NRI。在纵向随访队列中，Kaplan-Meier分析显示模型定义的风险组在主要不良心血管事件风险上存在明显分离。波形和归因分析进一步识别了与高风险预测相关的结构化ECG形态差异和具有生理意义的信号区域。这些发现支持AI-ECG作为补充CAD筛查、解剖风险估计和临床分层的可行工具，但需要进一步的前瞻性研究来确认其临床影响。

英文摘要

CAD remains a major global public health burden, yet scalable screening tools are limited. Although CCTA is a first-line non-invasive diagnostic modality, its use is constrained by resource requirements and radiation exposure. AI-ECG may offer a complementary approach for CAD risk stratification. In this multicenter study, we developed and validated an AI-ECG model using CCTA as the anatomical reference standard to predict vessel-specific coronary stenosis. In internal validation, the model achieved AUC values of 0.683-0.744 across vessels and showed consistent external performance. Discrimination was maintained in clinically normal ECGs and remained broadly stable across subgroups. Model-predicted probabilities increased monotonically with CCTA-defined stenosis severity. Model probabilities were converted into vessel-specific low-, intermediate-, and high-risk strata using predefined sensitivity- and specificity-based thresholds. Calibration analysis showed agreement between predicted and observed risk, while DCA indicated net clinical benefit over treat-all and treat-none strategies. Integrating AI-derived risk strata with guideline-based PTP categories improved rule-out performance, reduced the gray-zone proportion, and achieved positive NRI compared with PTP alone. In a longitudinal follow-up cohort, Kaplan-Meier analysis showed clear separation of major adverse cardiovascular event risk across model-defined risk groups. Waveform- and attribution-based analyses further identified structured ECG morphology differences and physiologically meaningful signal regions associated with high-risk predictions. These findings support AI-ECG as a feasible tool for complementary CAD screening, anatomical risk estimation, and clinical triage, while prospective studies are needed to confirm its clinical impact.

URL PDF HTML ☆

赞 0 踩 0

2512.01537 2026-05-19 cs.SD cs.AI cs.IT cs.LG eess.SP math.IT 版本更新

Two-Dimensional Quantization for Geometry-Aware Audio Coding

二维量化用于几何感知的音频编码

Tal Shuster, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be’er Sheva, Israel（电气与计算机工程学院，内盖夫本· Gurion大学，贝尔谢巴，以色列）

AI总结本文提出了一种二维量化方法Q2D2，通过将特征对投影到结构化的2D网格上，提高了音频压缩效率，同时保持了最先进的重建质量。

Comments Accepted to ICML 2026

详情

AI中文摘要

最近的神经音频编解码器在重建质量上取得了显著成就，通常依赖于残差向量量化（RVQ）、向量量化（VQ）和有限标量量化（FSQ）等量化方法。然而，这些量化技术限制了潜在空间的几何结构，使特征之间的相关性捕捉变得更加困难，导致表示学习、代码本利用和令牌速率的效率低下。在本文中，我们引入了二维量化（Q2D2），一种将特征对投影到结构化2D网格（如六边形、菱形或矩形铺砌）并量化到最近网格值的量化方案，从而生成由网格级别乘积定义的隐式代码本，其代码本大小与传统方法相当。尽管其简单的几何公式，Q2D2在音频压缩效率方面有所提升，具有低令牌速率和高代码本利用率，同时保持了最先进的重建质量。具体而言，Q2D2在语音、音频和音乐领域广泛实验中，在各种客观和主观重建度量上实现了具有竞争力甚至更优的性能。全面的消融研究进一步证实了我们设计选择的有效性。

英文摘要

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

URL PDF HTML ☆

赞 0 踩 0

2511.20857 2026-05-19 cs.CL cs.AI 版本更新

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory：通过自演化记忆基准测试LLM代理的测试时间学习

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

发表机构 * Google DeepMind（谷歌深Mind）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出Evo-Memory，一个用于评估LLM代理自演化记忆能力的综合流基准和框架，通过构建序列任务流数据集，要求LLM在每次交互后搜索、适应和演化记忆，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了超过十种代表性的记忆模块。

详情

AI中文摘要

状态性对于大型语言模型（LLM）代理进行长期规划和问题解决至关重要。这使得记忆成为关键组件，但其管理和进化仍 largely underexplored。现有的评估主要集中在静态对话设置上，其中记忆被动地从对话中检索以回答查询，忽略了在不断变化的任务流中积累和重用经验的能力。在现实世界环境中，如交互问题助手或具身代理中，LLM需要处理连续的任务流，但通常无法从积累的交互中学习，失去有价值的上下文见解，这限制了测试时间的进化，即LLM在部署期间持续检索、整合和更新记忆。为了弥合这一差距，我们引入了Evo-Memory，一个综合的流基准和框架，用于评估LLM代理的自演化记忆能力。Evo-Memory将数据集结构化为连续的任务流，要求LLM在每次交互后搜索、适应和演化记忆。我们统一并实现了超过十种代表性的记忆模块，并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了它们。为了更好地基准测试经验重用，我们提供了一个基线方法ExpRAG，用于检索和利用先前经验，并进一步提出ReMem，一个将推理、任务动作和记忆更新紧密集成的行动-思考-记忆精炼流程，以实现持续改进。

英文摘要

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

URL PDF HTML ☆

赞 0 踩 0

2511.11654 2026-05-19 cs.LG cs.AI cs.MA 版本更新

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的收敛性

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa（亚马逊Alexa）； Indian Institute of Science（印度科学研究院）

AI总结本文研究了多智能体强化学习在交通信号控制中的收敛性问题，通过随机逼近方法分析学习动态，并证明了在特定条件下该算法能够收敛。

Comments 14 pages 2 figures

详情

AI中文摘要

快速城市化导致城市如班加罗尔面临严重的交通拥堵，使得高效的交通信号控制（TSC）变得至关重要。多智能体强化学习（MARL）作为一种减少平均通勤延误的有希望策略，通常将每个交通信号视为一个独立的智能体使用Q学习进行建模。尽管先前的工作Prashant L A等人已经证明了这种方法的有效性，但在交通控制背景下对这种算法稳定性及收敛性进行严谨理论分析的研究尚未开展。本文通过专注于该多智能体算法的理论基础，填补了这一空白。我们研究了在合作性TSC任务中使用独立学习者固有的收敛问题。利用随机逼近方法，我们正式分析了学习动态。本文的主要贡献是证明了特定的交通控制多智能体强化学习算法在给定条件下能够收敛，扩展了从单智能体收敛证明中异步价值迭代的结论。

英文摘要

Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

URL PDF HTML ☆

赞 0 踩 0

2511.07288 2026-05-19 cs.LG cs.AI 版本更新

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

通过深度行为批评稳定化实现非策略模仿学习

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa（亚马逊Alexa）； Indian Institute of Science（印度科学研究院）

AI总结本文提出一种结合非策略学习的对抗模仿学习算法，通过双Q网络稳定化和价值学习（无需奖励函数推断）来提高样本效率，从而更高效地匹配专家行为。

Comments 14 pages and 4 images

2511.06316 2026-05-19 cs.AI 版本更新

RADRON：通过配备康普顿相机的微型飞行器进行离子化辐射源的协同定位

Petr Stibinger, Tomas Baca, Daniela Doubravova, Jan Rusnak, Jaroslav Solc, Jan Jakubek, Petr Stepan, Martin Saska

AI总结该研究提出了一种利用微型飞行器协同定位放射性物质的新方法，通过康普顿相机实时估计辐射源位置，即使在稀疏测量条件下也能实现高灵敏度检测。

Comments 8 pages, 9 figures, submitted for review to IEEE RA-L

详情

DOI: 10.1109/LRA.2026.3688053

AI中文摘要

我们提出了一种新型方法，通过合作微型飞行器（MAVs）定位放射性物质。我们的方法利用了最先进的单探测器康普顿相机，作为高灵敏度且微型的离子化辐射探测器。该探测器极低的重量（40克）为由协作敏捷MAVs进行的辐射检测开辟了新可能。我们提出了一种新的基本概念，将康普顿相机测量融合以实时估计辐射源位置，即使从极稀疏的测量中也能做到。数据读取和处理直接在机载上进行，结果用于动态反馈以驱动车辆运动。MAVs在紧密协作的群体中稳定，以最大化康普顿相机获取的信息，快速定位辐射源，甚至跟踪移动的辐射源。

英文摘要

We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.

URL PDF HTML ☆

赞 0 踩 0

2510.21712 2026-05-19 cs.IR cs.AI cs.CL 版本更新

DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling

DecoupleSearch: 通过分层奖励建模解耦规划与搜索

Hao Sun, Zile Qiao, Bo Wang, Guoxin Chen, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang

发表机构 * Tongyi Lab（通义实验室）； Alibaba Group（阿里巴巴集团）

AI总结本文提出DecoupleSearch框架，通过双值模型解耦规划与搜索过程，利用蒙特卡洛树搜索评估每一步的质量，并通过分层束搜索迭代优化规划和搜索候选，验证了方法的有效性。

Comments EMNLP 2025 Main Conference

详情

AI中文摘要

检索增强生成（RAG）系统已作为一种增强大型语言模型（LLM）的关键方法，通过动态整合外部知识。为了进一步提高RAG的灵活性，代理RAG引入了自主代理到工作流程中。然而，代理RAG面临几个挑战：（1）每一步的成功取决于高质量的规划和准确的搜索；（2）中间推理步骤缺乏监督；（3）规划和搜索的候选空间呈指数级增长。为了解决这些挑战，我们提出了DecoupleSearch，一种新的框架，通过双值模型解耦规划和搜索过程，使规划推理和搜索基础能够独立优化。我们的方法构建了一个推理树，其中每个节点代表规划和搜索步骤。我们利用蒙特卡洛树搜索来评估每一步的质量。在推理过程中，分层束搜索通过双值模型迭代优化规划和搜索候选。在不同参数规模的策略模型上的广泛实验验证了我们方法的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To further improve RAG's flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges: (1) the success of each step depends on both high-quality planning and accurate search, (2) the lack of supervision for intermediate reasoning steps, and (3) the exponentially large candidate space for planning and searching. To address these challenges, we propose DecoupleSearch, a novel framework that decouples planning and search processes using dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree, where each node represents planning and search steps. We leverage Monte Carlo Tree Search to assess the quality of each step. During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models. Extensive experiments across policy models of varying parameter sizes demonstrate the effectiveness of our method.

URL PDF HTML ☆

赞 0 踩 0

2510.20584 2026-05-19 cs.CL cs.AI 版本更新

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

使用ChatGPT自动编码通信数据：子群体一致性分析

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

发表机构 * ETS Research Institute（ETS研究机构）

AI总结本文研究了使用ChatGPT进行通信数据编码在不同性别和种族/族裔群体间的一致性，发现其编码结果与人类评分者一致，为大规模评估协作与沟通提供了可能。

Comments Accepted to the Journal of Educational Measurement

详情

AI中文摘要

在大规模评估沟通和协作方面，对通信数据进行分类编码是一项劳动密集型任务，根据不同的框架进行分类。先前研究已证明，可以通过直接指示ChatGPT使用编码评分表来对通信数据进行编码，并且其准确性与人类评分者相当。然而，ChatGPT或类似AI技术在不同人口群体（如性别和种族）之间编码的一致性仍不清楚。为填补这一空白，我们引入了三种检查方法，用于评估基于LLM的编码中的子群体一致性，通过适应自自动化评分文献中已有的框架。使用典型的协作问题解决编码框架和三种类型的协作任务数据，我们检查了基于ChatGPT的编码在性别和种族/族裔群体中的表现。我们的结果表明，基于ChatGPT的编码在性别或种族/族裔群体中表现一致，与人类评分者一致，证明了其在大规模评估协作和沟通中的可行性。

英文摘要

Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

URL PDF HTML ☆

赞 0 踩 0

2510.11391 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

发表机构 * CUHK（香港大学）； UCAS（中国科学技术大学）； XJTU（西安交通大学）； UMich（密歇根大学）； Microsoft（微软）

AI总结本文提出DocReward，一种用于评估文档结构和风格的奖励模型，通过构建包含117,000对文档的DocPair数据集，采用Bradley-Terry损失训练，有效提升了文档生成的结构和风格专业性。

详情

AI中文摘要

近期的代理工作流程自动化了专业文档生成，但主要关注文本质量，忽视了结构和风格的专业性，这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型，无法引导代理生成结构和风格专业的文档。我们引入DocReward，一种评估文档结构和风格的文档奖励模型。为此，我们提出了一种文本质量无关的框架，确保评估不受内容质量的影响，并构建了包含117,000对文档的DocPair数据集，涵盖32个领域和267种类型。每对文档内容相同，但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中，DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明，DocReward能有效引导代理生成具有更一致结构和风格专业性的文档，突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

URL PDF HTML ☆

赞 0 踩 0

2510.10930 2026-05-19 cs.CL cs.AI 版本更新

Evaluating Language Models' Evaluations of Games

评估语言模型对游戏的评估

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

发表机构 * University of Cambridge（剑桥大学）； MIT（麻省理工学院）； Princeton University（普林斯顿大学）； NYU（纽约大学）； Harvard University（哈佛大学）； Stanford University（斯坦福大学）

AI总结本文研究了语言模型对游戏评估的能力，通过比较现代语言模型和人类及符号计算代理的评估结果，发现推理模型在游戏评估上更接近人类，但随着模型接近博弈最优，其与人类数据的匹配度会减弱，且在评估趣味性时表现出更大的波动。

详情

AI中文摘要

推理不仅仅是解决问题，也是评估哪些问题值得解决。人工智能系统的历史评估主要集中在解决问题上，通过研究模型如何玩国际象棋和围棋等游戏。在本文中，我们倡导一种新的范式，即评估人工智能系统对游戏的评估。首先，我们引入了一种评估此类评估的形式化方法。然后利用超过100种新型棋盘游戏和450份人类判断的大型数据集，将现代语言和推理模型的评估结果与人类和符号计算代理的评估结果进行比较。我们考虑了两种类型的评估查询：评估游戏的收益（或公平性）和趣味性。这些查询涵盖了两个与AI评估设计相关的重要维度：计算查询的复杂性和量化查询的难度。我们的结果表明，推理模型在游戏评估上通常比非推理语言模型更接近人类。然而，我们观察到非单调的关系：随着模型接近博弈最优，其与人类数据的匹配度会减弱。我们还发现，在评估趣味性时，模型之间存在更多的波动性，这与量化该查询的难度更大有关。在各种查询和游戏中，推理模型在评估查询时表现出高度变化和不可预测的资源使用，这表明在语言和推理模型中加入更多资源理性的元推理非常重要。

英文摘要

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

URL PDF HTML ☆

赞 0 踩 0

2509.13397 2026-05-19 cs.CY cs.AI 版本更新

The threat of analytic flexibility in using large language models to simulate human data

使用大语言模型模拟人类数据时分析灵活性的威胁

Jamie Cummins

发表机构 * University of Bern, Switzerland（伯尔尼大学，瑞士）； University of Oxford, United Kingdom（牛津大学，英国）

AI总结本文研究了在使用大语言模型生成合成数据时，分析选择对合成数据与人类数据一致性的影响，发现不同的配置选择可能导致结论差异显著，呼吁关注分析灵活性的潜在威胁并提出减少该威胁的策略。

Comments 14 pages, 4 figures

详情

AI中文摘要

社会科学家现在使用大语言模型创建“硅样本”：合成数据集，旨在替代人类受访者。然而，生成这些样本需要许多分析选择，包括模型选择、采样参数、提示格式以及提供的性别或情境信息量。在两项研究中，我检验了这些选择是否对硅样本与人类数据的一致性产生实质性影响。在研究1中，我为受控案例研究生成了252个硅样本配置，使用两种社会心理量表，评估配置是否能恢复参与者排名、响应分布和量表间相关性。配置在所有三个标准上差异显著，且在某一维度表现良好的配置往往在另一维度表现不佳。在研究2中，我将此分析扩展到已发表的硅样本使用案例，通过66种替代配置重新审视Argyle等人（2023）的第三研究。人类与硅关联结构之间的相关性在不同配置下差异显著，从r=0.23到r=0.84。综合来看，这些研究的结果表明，不同的可辩护配置选择可以实质性地改变关于硅样本准确性的结论。我呼吁对使用硅样本时分析灵活性的威胁给予更多关注，并概述研究人员可能采用的减少此威胁的策略。

英文摘要

Social scientists are now using large language models to create "silicon samples": synthetic datasets intended to stand in for human respondents. However, producing these samples requires many analytic choices, including model selection, sampling parameters, prompt format, and the amount of demographic or contextual information provided. Across two studies, I examine whether these choices materially affect correspondence between silicon samples and human data. In Study 1, I generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, evaluating whether configurations recovered participant rankings, response distributions, and between-scale correlations. Configurations varied substantially across all three criteria, and configurations that performed well on one dimension often performed poorly on another. In Study 2, I extended this analysis to a published silicon-sample use case by re-examining Argyle et al.'s (2023) Study 3 using 66 alternative configurations. Correlations between human and silicon association structures differed substantially across configurations, from r = .23 to r = .84. Taken together, the results from these studies demonstrate that different defensible configuration choices can materially alter conclusions about the fidelity of silicon samples. I call for greater attention to the threat of analytic flexibility in using silicon samples and outline strategies that researchers may adopt to reduce this threat.

URL PDF HTML ☆

赞 0 踩 0

2509.07793 2026-05-19 econ.GN cs.AI cs.CY q-fin.EC 版本更新

Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment

个体生活满意度的效用揭示了与政治立场无关的不平等厌恶

Crispin Cooper, Ana Fredrich, Tommaso Reggiani, Wouter Poortinga

AI总结研究通过实验探讨了社会福利优先级和公平与个人幸福之间的权衡，发现个体对社会生活满意度不平等的厌恶与政治立场无关，挑战了平均生活满意度作为政策指标的使用，支持非线性效用替代方案的发展。

Comments 28 pages, 4 figures. Replacement adds link to version of record

Journal ref Social Indicators Research 183, 12 (2026)

详情

DOI: 10.1007/s11205-026-03854-4

AI中文摘要

社会应如何优先考虑福祉，人们愿意在公平与个人福祉之间做出哪些权衡？我们通过一项具有全国代表性的英国样本（n=300）的声明偏好实验来探讨这些问题，参与者在不确定性条件下评估了自己和他人的生活满意度结果。使用期望效用最大化（EUM）框架估计个体层面的效用函数，并测试对小概率的过度重视，如累积前景理论（CPT）所描述的。大多数参与者表现出凹形（风险厌恶）效用曲线，并且对社会生活满意度不平等的厌恶程度强于个人风险。这些偏好与政治立场无关，表明了一种超越意识形态边界的共享福祉公平规范立场。研究结果挑战了平均生活满意度作为政策指标的使用，并支持开发更准确反映集体人类价值观的非线性效用替代方案。讨论了对公共政策、福祉测量以及价值一致的AI系统设计的影响。

英文摘要

How should well-being be prioritised in society, and what trade-offs are people willing to make between fairness and personal well-being? We investigate these questions using a stated preference experiment with a nationally representative UK sample (n = 300), in which participants evaluated life satisfaction outcomes for both themselves and others under conditions of uncertainty. Individual-level utility functions were estimated using an Expected Utility Maximisation (EUM) framework and tested for sensitivity to the overweighting of small probabilities, as characterised by Cumulative Prospect Theory (CPT). A majority of participants displayed concave (risk-averse) utility curves and showed stronger aversion to inequality in societal life satisfaction outcomes than to personal risk. These preferences were unrelated to political alignment, suggesting a shared normative stance on fairness in well-being that cuts across ideological boundaries. The results challenge use of average life satisfaction as a policy metric, and support the development of nonlinear utility-based alternatives that more accurately reflect collective human values. Implications for public policy, well-being measurement, and the design of value-aligned AI systems are discussed.

URL PDF HTML ☆

赞 0 踩 0

2509.06984 2026-05-19 cs.LG cs.AI 版本更新

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FediLoRA: 在缺失模态约束下联邦微调基础模型的实用方法

Lishan Yang, Wei Emma Zhang, Nam Kha Nguygen, Po Hu, Yanjun Shu, Weitong Chen, Mong Yuan Sim

发表机构 * Adelaide University（阿德莱德大学）； Central China Normal University（中央中国师范大学）； Harbin Institute of Technology（哈尔滨工程大学）

AI总结本文提出FediLoRA，一种轻量级的联邦LoRA聚合框架，旨在解决联邦学习中异构环境下的缺失模态问题，通过联合简单平均和结构化编辑提升全局和个性化模型性能，实现在多个通用领域和医疗领域基准数据集上的强大表现。

Comments 8 pages, 7 figures

详情

AI中文摘要

联邦学习与LoRA微调提供了一种高效且隐私友好的解决方案，使机构能够协作利用其大规模数据集来训练VLLMs。然而，参与机构通常拥有异质计算资源，导致LoRA秩不平衡，这对有效协作构成重大挑战。此外，医疗和交通等现实应用领域常因用户错误或设备故障导致缺失模态，这显著降低了联邦设置中的全局模型性能。到目前为止，没有先前工作同时解决了联邦VLLMs中的这两个挑战。为了解决这些问题，我们提出FediLoRA，一种轻量级的联邦LoRA聚合框架，有效减轻了异构环境中的缺失模态影响。FediLoRA受到观察的启发，即简单平均和结构化编辑可以同时受益于全局和个性化模型。我们的方法在多个通用领域和医疗领域基准数据集上实现了强大性能。此外，在医疗数据上的额外实验进一步证明，FediLoRA适合实际应用部署场景。我们的代码已发布在https://github.com/gotobcn8/FediLoRA。

英文摘要

Federated Learning with LoRA fine-tuning offers an efficient and privacy-aware solution for institutions to collaboratively leverage their large datasets to train VLLMs. However, participating institutions often possess heterogeneous computational resources, resulting in imbalanced LoRA ranks, which pose a major challenge for effective collaboration. In addition, real-world applications in domains such as healthcare and transportation frequently suffer from missing modalities due to user mistakes or device failures, which significantly degrade global model performance in federated settings. To the best of our knowledge, no prior work has addressed these two challenges simultaneously in federated VLLMs. To tackle these issues, we propose FediLoRA, a lightweight federated LoRA aggregation framework that effectively mitigates the impact of missing modalities in heterogeneous environment. FediLoRA is explicitly motivated by the observation that simple averaging and structured editing can jointly benefit both global and personalized models. Our approach achieves strong performance across multiple general-domain and medical-domain benchmark datasets. Additional experiments on healthcare data further demonstrate that FediLoRA is well-suited for practical, real-world deployment scenarios. Our code is released at https://github.com/gotobcn8/FediLoRA.

URL PDF HTML ☆

赞 0 踩 0

2508.17431 2026-05-19 cs.CV cs.AI cs.LG 版本更新

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR: 基于KL引导的剪枝感知联邦学习用于人重识别

Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

发表机构 * Media IC and System Lab, the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University（媒体IC与系统实验室，电子工程研究所及电气工程系，国立台湾大学）

AI总结本文提出FedKLPR框架，通过KL散度引导训练、无结构剪枝和跨轮次恢复技术，解决联邦学习在人重识别中的统计异质性和通信开销问题，实验表明其在通信开销和准确性方面均优于现有方法。

Comments 10 pages, 3 figures, 5 tables, submitted to IEEE Transactions on Multimedia

详情

AI中文摘要

人重识别（re-ID）是智能监控和公共安全中的基本任务。联邦学习（FL）提供了一种隐私保护的协同模型训练范式，无需集中数据收集。然而，由于非独立同分布（non-IID）客户端数据导致的统计异质性和频繁传输大规模模型带来的通信开销，将FL应用于现实世界中的re-ID系统仍然具有挑战性。为了解决这些挑战，我们提出了FedKLPR，一种轻量且通信高效的联邦学习框架用于人重识别。FedKLPR包含三个关键组件。首先，KL散度引导训练，包括KL散度正则化损失（KLL）和KL散度聚合权重（KLAW），用于缓解统计异质性和在非IID设置下提高收敛稳定性。其次，引入无结构剪枝以减少通信开销，并提出剪枝率聚合权重（PRAW）以衡量剪枝后客户端参数的相对重要性。与KLAW结合，PRAW形成KL散度-剪枝权重聚合（KLPWA），使在异构数据分布下能够有效聚合剪枝后的本地模型。第三，跨轮次恢复（CRR）适应性地控制剪枝跨通信轮次以防止过度压缩并保持模型准确性。在八个基准数据集上的实验表明，FedKLPR在保持竞争性准确性的同时实现了显著的通信节省。与现有最先进方法相比，FedKLPR在ResNet-50上将通信成本减少了40%--42%，并实现了更优异的总体性能。

英文摘要

Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm for collaborative model training without centralized data collection. However, deploying FL in real-world re-ID systems remains challenging due to statistical heterogeneity caused by non-IID client data and the substantial communication overhead incurred by frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and KL-Divergence-aggregation Weight (KLAW), is introduced to mitigate statistical heterogeneity and improve convergence stability under non-IID settings. Second, unstructured pruning is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is proposed to measure the relative importance of client parameters after pruning. Together with KLAW, PRAW forms KL-Divergence-Prune Weighted Aggregation (KLPWA), enabling effective aggregation of pruned local models under heterogeneous data distributions. Third, Cross-Round Recovery (CRR) adaptively controls pruning across communication rounds to prevent excessive compression and preserve model accuracy. Experiments on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving better overall performance.

URL PDF HTML ☆

赞 0 踩 0

2508.16663 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Fourier Compressor: 频域视觉令牌压缩用于视觉-语言模型

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

发表机构 * LUMIA Lab（LUMIA实验室）； School of Artificial Intelligence（人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； Shanghai Innovation Institute（上海创新研究院）； Noah’s Ark Lab（诺亚实验室）； Huawei Technologies Ltd.（华为技术有限公司）； School of Computer Science（计算机科学学院）

AI总结本文提出了一种基于频域的视觉令牌压缩策略，通过傅里叶变换减少计算开销并提升效率，同时保持语义准确性，实验表明其在图像和视频任务中均表现出色。

详情

AI中文摘要

视觉-语言模型（VLMs）由于高分辨率图像和视频输入引入的大量视觉令牌，导致计算开销和推理延迟显著增加。现有的无参数令牌压缩方法通常依赖于令牌选择或合并，但可能丢弃大量视觉信息或扭曲原始表示分布，导致在高压缩比下性能下降。为此，我们探索了一种更有效且高效的视觉令牌压缩策略，重点在频域方向。受图像压缩中频域变换（如JPEG）的成功启发，我们系统分析了视觉表示中的频域冗余，并揭示了不同频带中语义信息的非均匀分布。基于此，我们引入了傅里叶压缩器，一种有效、无参数且高度通用的模块，通过FFT（复杂度为O(n² log n））在频域内去除视觉表示的冗余。实现过程中无额外参数，计算开销极小且保持语义保真度。在图像基准测试中，我们的方法在保留超过96%原始准确率的同时，将推理FLOPs减少高达83.8%，生成速度提升31.2%。它在图像和视频理解任务中均表现出色，且在LLaVA和Qwen-VL架构中均能稳定泛化，证明其在高效VLMs中的实用价值。

英文摘要

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

URL PDF HTML ☆

赞 0 踩 0

2506.16042 2026-05-19 cs.AI cs.LG cs.OS 版本更新

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

OSWorld-Human: 评估计算机使用代理的效率基准

Reyna Abhyankar, Qi Qi, Yiying Zhang

发表机构 * OpenAI ； Anthropic ； Google DeepMind ； ByteDance（字节跳动）； Agent S2 ； GTA1 ； Lei ； Jedi

AI总结本文研究了计算机使用代理在OSWorld基准上的时间性能，发现大模型调用导致高延迟，并构建了包含人类轨迹的OSWorld Human数据集，评估发现最佳代理仍需更多步骤。

详情

AI中文摘要

生成式AI正被用于解决涉及桌面应用的多种计算机使用任务。最先进的系统仅专注于提高领先基准的准确性。然而，这些系统由于端到端延迟极高（例如，数十分钟）而实际上不可用，因为通常只需人类几分钟即可完成的任务。为了理解这一现象并指导未来计算机代理的发展，我们首次研究了计算机使用代理在OSWorld基准上的时间性能。我们发现，规划、反思和判断的大模型调用占总延迟的主要部分，并且随着代理使用更多步骤完成任务，每一步骤的时间会比任务开始时的步骤长3倍。我们随后构建了OSWorld Human，即原始OSWorld数据集的手动标注版本，其中包含每个任务的人类确定轨迹。我们使用OSWorld Human评估了16个代理的效率，并发现即使最佳代理也比必要多出2.7-4.3倍的步骤。

英文摘要

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld Human and found that even the best agents take 2.7-4.3x more steps than necessary.

URL PDF HTML ☆

赞 0 踩 0

2506.08244 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Algebraic Priors for Approximately Equivariant Networks

代数先验用于近似等变网络

Riccardo Ali, Pietro Liò, Jamie Vicary

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出了一种无需参数的代数方法，利用群表示理论来构建等变网络的先验，通过实验验证该方法在多个任务中表现优异，甚至在无限群情况下也优于专门设计的模型。

详情

AI中文摘要

等变神经网络通过群作用来整合对称性，将其作为归纳偏差以提高性能。现有方法在潜在空间中学习等变作用，或设计具有等变结构的架构。这些方法通常能获得良好的经验结果，但可能涉及架构特定的约束、大量参数和高计算成本。我们挑战复杂等变架构范式，提出一种无参数的方法，基于群表示理论。我们证明，对于有限群上的等变编码器，潜在空间几乎必然包含每个线性无关数据轨道的一个副本，我们通过多个实验证明这一点。利用这一基础的代数洞察，我们通过辅助损失将群的正则表示作为归纳偏差，不增加可学习参数。我们的广泛评估显示，该方法在多个任务中表现优异，甚至在无限群情况下也优于专门设计的模型。我们进一步通过消融研究验证了正则表示的选择，显示其在所有情况下均优于定义和平凡群表示的基线模型。

英文摘要

Equivariant neural networks incorporate symmetries through group actions, embedding them as an inductive bias to improve performance. Existing methods learn an equivariant action on the latent space, or design architectures that are equivariant by construction. These approaches often deliver strong empirical results but can involve architecture-specific constraints, large parameter counts, and high computational cost. We challenge the paradigm of complex equivariant architectures with a parameter-free approach grounded in group representation theory. We prove that for an equivariant encoder over a finite group, the latent space must almost surely contain one copy of its regular representation for each linearly independent data orbit, which we explore with a number of empirical studies. Leveraging this foundational algebraic insight, we impose the group's regular representation as an inductive bias via an auxiliary loss, adding no learnable parameters. Our extensive evaluation shows that this method matches or outperforms specialized models in several cases, even those for infinite groups. We further validate our choice of the regular representation through an ablation study, showing it consistently outperforms defining and trivial group representation baselines.

URL PDF HTML ☆

赞 0 踩 0

2505.21893 2026-05-19 cs.LG cs.AI 版本更新

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

SIPO: 用于对齐扩散模型的人类偏好优化的稳定与改进方法

Xiaomeng Yang, Mengping Yang, Junyan Wang, Zhijian Zhou, Zhiyu Tan, Hao Li

发表机构 * Shanghai Science and Intelligence Institute, Shanghai, China（上海科学与智能研究所）； Fudan University, Shanghai, China（复旦大学）； Australian Institute for Machine Learning, The University of Adelaide（澳大利亚机器学习研究所，阿德莱德大学）

AI总结本研究提出SIPO框架，通过时间步感知的重要性重新加权和梯度稳定技术，解决扩散模型对齐中训练不稳定和策略偏差问题，提升了对齐效果和稳定性。

Comments This version supplements with more detailed content on reasoning and proof, additional experimental results, and ablation studies

详情

AI中文摘要

偏好学习作为一种有效技术，已被广泛用于将扩散模型与人类偏好对齐在视觉生成中。然而，现有对齐方法如Diffusion-DPO面临两个根本性挑战：由于各个时间步的高梯度方差导致的训练不稳定以及由于优化数据与策略模型分布之间的差异引起的策略偏差。我们的第一项贡献是对不同时间步的扩散轨迹进行系统分析，发现不稳定性主要源于早期时间步的低重要性权重。为了解决这些问题，我们提出了SIPO，即一种用于将扩散模型与人类偏好对齐的稳定和改进的偏好优化框架。具体而言，引入了一个关键梯度，即DPO-C&M，通过裁剪和屏蔽无信息的时间步来稳定训练。随后，采用时间步感知的重要性重新加权范式以缓解策略偏差并在对齐过程中强调信息更新。在各种基线模型上进行的广泛实验，包括图像生成模型SD1.5、SDXL和视频生成模型CogVideoX-2B/5B、Wan2.1-1.3B，表明我们的SIPO在稳定训练和性能方面均优于现有对齐方法。总体而言，这些结果表明了时间步感知对齐的重要性，并为改进扩散模型的偏好优化提供了有价值的指导。

英文摘要

Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy models' distribution. Our first contribution is a systematic analysis of diffusion trajectories across different timesteps, identifying that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose \textbf{SIPO}, a \textbf{S}tabilized and \textbf{I}mproved \textbf{P}reference \textbf{O}ptimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emph{i.e.,} DPO-C\&M is introduced to stabilize training by clipping and masking uninformative timesteps. This is followed by a timestep-aware importance-reweighting paradigm to mitigate off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B/5B, Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods that with meticulous adjustments on parameters.Overall, these results suggest the importance of timestep-aware alignment and provide valuable guidelines for improved preference optimization in aligning diffusion models.

URL PDF HTML ☆

赞 0 踩 0

2505.17138 2026-05-19 cs.LG cs.AI 版本更新

RAP: Runtime Adaptive Pruning for LLM Inference

RAP: 用于大语言模型推理的运行时自适应剪枝

Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

发表机构 * Faculty of Science and Technology, University of Macau, Macau, China（澳门大学科学与技术学院）； School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China（电子科技大学信息与软件工程学院）

AI总结本文提出RAP，一种基于强化学习的弹性剪枝框架，通过动态调整压缩策略来适应运行时内存变化和异构KV缓存需求，首次在推理过程中同时考虑模型权重和KV缓存。

详情

AI中文摘要

大语言模型（LLMs）在语言理解和生成方面表现出色，但其巨大的计算和内存需求限制了部署。压缩提供了一种潜在的解决方案来缓解这些约束。然而，大多数现有方法依赖于固定的启发式方法，因此无法适应运行时内存变化或来自多样化用户请求的异构KV缓存需求。为了解决这些限制，我们提出了RAP，一种由强化学习（RL）驱动的弹性剪枝框架，能够以运行时感知的方式动态调整压缩策略。具体而言，RAP动态跟踪实际执行过程中模型参数与KV缓存之间的演变比例。认识到前馈网络（FFNs）包含大部分参数，而参数轻量的注意力层主导KV缓存的形成，RL代理只保留那些在当前内存预算内最大化效用的组件，基于即时的工作负载和设备状态。广泛的实验结果表明，RAP优于最先进的基线方法，标志着首次在推理过程中同时考虑模型权重和KV缓存。

英文摘要

Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

URL PDF HTML ☆

赞 0 踩 0

2505.16278 2026-05-19 cs.CV cs.AI cs.RO 版本更新

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

DriveMoE：面向端到端自动驾驶的视觉-语言-动作混合专家模型

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

发表机构 * Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University（上海交通大学计算机科学学院与人工智能学院）； Institute of Trustworthy Embodied AI, Fudan University（复旦大学可信具身人工智能研究院）； Shanghai Key Laboratory of Multimodal Embodied AI（上海多模态具身人工智能重点实验室）； AnyScale AI Project（AnyScale AI项目）

AI总结本文提出DriveMoE，一种基于混合专家架构的端到端自动驾驶框架，通过场景专用的视觉混合专家和技能专用的动作混合专家，实现了对复杂驾驶场景的有效处理，展示了在自动驾驶任务中结合视觉和动作混合专家的有效性。

Comments Accepted by CVPR 2026, Project Page: https://thinklab-sjtu.github.io/DriveMoE/

详情

AI中文摘要

端到端自动驾驶（E2E-AD）需要有效处理多视角传感器数据和稳健处理多样且复杂的驾驶场景，特别是罕见的激进转弯等场景。最近混合专家（MoE）架构在大语言模型（LLMs）中的成功表明，参数的专业化能够实现强大的可扩展性。在本工作中，我们提出了DriveMoE，一种新的基于MoE的E2E-AD框架，包含场景专用的视觉MoE和技能专用的动作MoE。DriveMoE基于我们$π_0$视觉-语言-动作（VLA）基线（最初来自具身AI领域），称为Drive-$π_0$。具体而言，我们通过训练一个路由器，根据驾驶上下文动态选择相关摄像头，将视觉MoE添加到Drive-$π_0$中。这种设计模仿了人类驾驶认知，即司机选择性地关注关键视觉线索，而不是穷尽处理所有视觉信息。此外，我们通过训练另一个路由器来激活针对不同驾驶行为的专用专家模块，通过显式的行为专业化，DriveMoE能够处理多样化的场景而不受现有模型中模式平均的困扰。在Bench2Drive闭环评估实验中，DriveMoE实现了最先进的性能，证明了在自动驾驶任务中结合视觉和动作MoE的有效性。我们将发布DriveMoE和Drive-$π_0$的代码和模型。

英文摘要

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.

URL PDF HTML ☆

赞 0 踩 0

2504.13217 2026-05-19 cs.CL cs.AI 版本更新

Sustainability via LLM Right-sizing

通过LLM右尺寸实现可持续性

Jennifer Haase, Finn Klessascheck, Jan Mendling, Sebastian Pokutta

AI总结本文研究了在现实应用中，小型本地可部署模型是否足够好，通过评估十种LLM在日常职业任务中的表现，提出了一种基于可持续性的评估方法，强调在成本、本地部署和隐私方面的需求。

Comments 21 pages, 2 Figures, 6 Tables

详情

AI中文摘要

大型语言模型（LLMs）日益融入组织工作流程，引发了对其能源消耗、财务成本和数据主权的担忧。尽管性能基准常赞扬前沿模型，但实际部署决策需要更广泛的视角：何时小型、本地可部署的模型足够好？本研究通过评估十种专有和开源LLM在十种日常职业任务中的表现，提供实证答案。使用双LLM评估框架，自动化任务执行并标准化输出质量、事实准确性和伦理责任等十项标准。结果显示，GPT-4o在性能上始终优于，但成本和环境足迹显著更高。值得注意的是，较小的模型如Gemma-3和Phi-4在大多数任务中表现出强劲且可靠的结果，表明其在需要成本效率、本地部署或隐私的场景中的可行性。聚类分析揭示了三种模型群体——高端全能型、胜任的通用型和有限但安全的表演型，突显了质量、控制和可持续性之间的权衡。显著的是，任务类型影响了模型的有效性：概念性任务挑战了大多数模型，而聚合和转换任务则表现出更好的性能。我们主张从追求性能最大化的基准转向任务和情境感知的充分性评估，以更符合组织优先事项。我们的方法贡献了一种通过可持续性视角评估AI模型的可扩展方法，并为负责任的LLM部署提供了可行的指导。

英文摘要

Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

URL PDF HTML ☆

赞 0 踩 0

2503.13934 2026-05-19 cs.RO cs.AI 版本更新

COLSON: Controllable Learning-Based Social Navigation via Diffusion-Based Reinforcement Learning

COLSON: 通过基于扩散的强化学习实现可控的社会导航

Kohei Matsumoto, Yuki Tomita, Yuki Hyodo, Ryo Kurazume

AI总结本文提出了一种基于扩散的强化学习方法，用于社会导航，通过灵活的动作分布提高了导航的适应性和可控性，同时能够适应未见过的场景。

Comments ICRA 2026

详情

AI中文摘要

在动态环境中移动机器人导航面临行人交通的关键挑战，在自主移动服务机器人发展中尤为重要。最近，基于深度强化学习的方法被积极研究，并因其优化能力优于传统规则方法。其中，假设连续动作空间的方法通常依赖高斯分布，这限制了生成动作的灵活性。相比之下，将扩散模型应用于强化学习已取得进展，使动作分布比高斯策略方法更加灵活。在本研究中，我们应用基于扩散的强化学习方法进行社会导航，并验证其有效性。此外，通过利用扩散模型的特点，我们提出了能够适应以前未见过的场景而无需额外训练的扩展方法。作为具体场景示例，我们展示了适应环境中有静态障碍物的场景（这些障碍物在训练期间不存在），以及目标与训练不同的场景，例如在避免他人时陪同目标行人到达目的地。

英文摘要

Mobile robot navigation in dynamic environments with pedestrian traffic is a key challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches owing to their optimization capabilities. Among these methods, those that assume continuous action spaces typically rely on Gaussian distributions, which limit the flexibility of the generated actions. In contrast, the application of diffusion models to reinforcement learning has advanced, enabling more flexible action distributions than Gaussian policy-based approaches. In this study, we apply a diffusion-based reinforcement learning approach to social navigation and validate its effectiveness. Furthermore, by exploiting the characteristics of diffusion models, we propose extensions that enable adaptation to previously unseen scenarios without additional training. As concrete scenario examples, we demonstrate adaptability to scenarios in which static obstacles exist in the environment that were not present during training, as well as scenarios in which the objective differs from training, such as accompanying target pedestrians while avoiding others to reach the destination.

URL PDF HTML ☆

赞 0 踩 0

2503.02574 2026-05-19 cs.CR cs.AI 版本更新

LLM-Safety Evaluations Lack Robustness

大语言模型安全评估缺乏鲁棒性

Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, Stephan Günnemann

发表机构 * Department of Computer Science \& Munich Data Science Institute, Technical University of Munich ； Canada AI CIFAR Chair

AI总结本文指出当前大语言模型安全对齐研究受到多种交织的噪声源阻碍，如小数据集、方法学不一致和不可靠的评估设置，导致难以公平评估和比较攻击与防御，阻碍了进展。我们系统分析了大语言模型安全评估流程，涵盖数据集整理、自动化红队优化策略、响应生成和响应评估使用LLM法官。在每个阶段，我们识别了关键问题并突出了其实际影响。我们还提出了一套减少未来攻击和防御论文评估中噪声和偏见的指南。最后，我们提出了对立观点，强调现有限制的实用原因。我们相信，未来研究解决这些问题将提高领域生成可比结果的能力，并实现可衡量的进步。

详情

AI中文摘要

在本文中，我们论证当前大语言模型的安全对齐研究受到许多交织的噪声源的阻碍，例如小数据集、方法学不一致和不可靠的评估设置。这有时会使得无法公平地评估和比较攻击和防御，从而减缓进展。我们系统地分析了大语言模型安全评估流程，涵盖数据集整理、自动化红队优化策略、响应生成和使用LLM法官进行响应评估。在每个阶段，我们识别了关键问题并突出了其实际影响。我们还提出了一套减少未来攻击和防御论文评估中噪声和偏见的指南。最后，我们提出了对立观点，强调现有限制的实用原因。我们相信，未来研究解决这些问题将提高领域生成可比结果的能力，并实现可衡量的进步。

英文摘要

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

URL PDF HTML ☆

赞 0 踩 0

2502.08534 2026-05-19 cs.CE cs.AI 版本更新

Input convex neural networks: universal approximation theorem and implementation for isotropic polyconvex hyperelastic energies

输入凸神经网络：通用逼近定理和等方各向同性超弹性能量的实现

Gian-Luca Geuken, Patrick Kurzeja, David Wiedemann, Jörn Mosler

发表机构 * Institute of Mechanics, Department of Mechanical Engineering, TU Dortmund University（机械学院，机械工程系，杜伊斯堡-埃森大学）； Applied Analysis, Faculty of Mathematics, TU Dortmund University（应用分析，数学学院，杜伊斯堡-埃森大学）

AI总结本文提出了一种新的神经网络框架，用于各向同性超弹性，该框架在满足必要的物理和数学约束的同时，也满足通用逼近定理。关键成分是输入凸网络架构和变形梯度的符号奇异值的初等多项式形式的公式化。与之前发布的网络一致，它可以严格捕捉框架不变性和多凸性，以及诸如角动量平衡和增长条件等其他约束。然而，与之前的方法不同，本文为所提出的方法证明了通用逼近定理。更具体地说，所提出的网络可以近似任何框架不变、各向同性多凸能量（只要网络足够大）。这通过使用框架不变、各向同性多凸函数的充分必要条件来实现。与现有方法的比较研究识别了所提出方法的优势，特别是在近似非多凸能量以及计算多凸包方面。

详情

DOI: 10.1016/j.jmps.2025.106209

AI中文摘要

本文提出了一种新的神经网络框架，用于各向同性超弹性，该框架在满足必要的物理和数学约束的同时，也满足通用逼近定理。该框架的两个关键成分是输入凸网络架构和变形梯度的符号奇异值的初等多项式形式的公式化。与之前发布的网络一致，它可以严格捕捉框架不变性和多凸性，以及诸如角动量平衡和增长条件等其他约束。然而，与之前的方法不同，本文为所提出的方法证明了通用逼近定理。更具体地说，所提出的网络可以近似任何框架不变、各向同性多凸能量（只要网络足够大）。这通过使用框架不变、各向同性多凸函数的充分必要条件来实现。与现有方法的比较研究识别了所提出方法的优势，特别是在近似非多凸能量以及计算多凸包方面。

英文摘要

This paper presents a novel framework of neural networks for isotropic hyperelasticity that enforces necessary physical and mathematical constraints while simultaneously satisfying the universal approximation theorem. The two key ingredients are an input convex network architecture and a formulation in the elementary polynomials of the signed singular values of the deformation gradient. In line with previously published networks, it can rigorously capture frame-indifference and polyconvexity - as well as further constraints like balance of angular momentum and growth conditions. However and in contrast to previous networks, a universal approximation theorem for the proposed approach is proven. To be more explicit, the proposed network can approximate any frame-indifferent, isotropic polyconvex energy (provided the network is large enough). This is possible by working with a sufficient and necessary criterion for frame-indifferent, isotropic polyconvex functions. Comparative studies with existing approaches identify the advantages of the proposed method, particularly in approximating non-polyconvex energies as well as computing polyconvex hulls.

URL PDF HTML ☆

赞 0 踩 0

2407.13059 2026-05-19 cs.CY cs.AI cs.ET 版本更新

Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models

在评估人工智能模型时优先考虑高后果生物能力

Jaspreet Pannu, Doni Bloomfield, Alex Zhu, Robert MacKnight, Gabe Gomes, Anita Cicero, Thomas V. Inglesby

发表机构 * Center for Health Security, Bloomberg School of Public Health, Johns Hopkins University（健康安全中心，公共卫生学院，约翰霍普金斯大学）； Department of Health Policy, Stanford School of Medicine, Stanford University（健康政策系，斯坦福医学院，斯坦福大学）； Department of Chemical Engineering, Carnegie Mellon University（化学工程系，卡内基梅隆大学）； Department of Chemistry, Carnegie Mellon University（化学系，卡内基梅隆大学）； Wilton E. Scott Institute for Energy Innovation, Carnegie Mellon University（威尔顿·E·斯科特能源创新研究所，卡内基梅隆大学）

AI总结本文基于人工智能模型的安全性、安全性和伦理问题，提出在评估人工智能模型时应优先考虑高后果风险，如大规模公众危害（如大流行病），并应在部署前进行评估，以建立针对性的AI安全评估方法，确保工具的安全性和防止潜在危害。

Comments 9 pages, 1 figure, 3 tables, 1 box

详情

DOI: 10.1371/journal.pcbi.1012975

AI中文摘要

随着人工智能能力的快速提升，过去一年中，各国政府和多国机构已宣布努力应对与人工智能模型相关的安全、安全和伦理问题。其中一项重要重点是减少人工智能模型的滥用。许多生物学家多年来一直在努力减少可能导致通过事故或滥用引发高后果疾病爆发的科研风险。科学家们已仔细考虑了哪些类型的生物科学研究具有潜在的益处和风险（双重用途），特别是随着科学进步加速了我们对生物体的工程能力和病原体新变种的创造能力。本文描述了科学家和政策专业人士在生物科学中的双重用途能力的先前经验研究如何帮助评估具有生物能力的人工智能模型的风险。我们主张人工智能模型的评估应优先处理高后果风险（可能导致大规模公众危害，如流行病），并在部署前进行评估，以便允许潜在的生物安全和/或生物安全措施。科学家在识别和缓解双重用途生物风险方面的经验可以帮助指导新的评估方法来评估生物人工智能模型。确定哪些AI能力最可能引发生物安全和生物安全问题是必要的，以便建立有针对性的AI安全评估方法，确保这些工具的安全性，防止事故和滥用，并避免阻碍巨大的潜在利益。

英文摘要

As a result of rapidly accelerating AI capabilities, over the past year, national governments and multinational bodies have announced efforts to address safety, security and ethics issues related to AI models. One high priority among these efforts is the mitigation of misuse of AI models. Many biologists have for decades sought to reduce the risks of scientific research that could lead, through accident or misuse, to high-consequence disease outbreaks. Scientists have carefully considered what types of life sciences research have the potential for both benefit and risk (dual-use), especially as scientific advances have accelerated our ability to engineer organisms and create novel variants of pathogens. Here we describe how previous experience and study by scientists and policy professionals of dual-use capabilities in the life sciences can inform risk evaluations of AI models with biological capabilities. We argue that AI model evaluations should prioritize addressing high-consequence risks (those that could cause large-scale harm to the public, such as pandemics), and that these risks should be evaluated prior to model deployment so as to allow potential biosafety and/or biosecurity measures. Scientists' experience with identifying and mitigating dual-use biological risks can help inform new approaches to evaluating biological AI models. Identifying which AI capabilities post the greatest biosecurity and biosafety concerns is necessary in order to establish targeted AI safety evaluation methods, secure these tools against accident and misuse, and avoid impeding immense potential benefits.

URL PDF HTML ☆

赞 0 踩 0

2605.18150 2026-05-19 cs.AI 版本更新

Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

噪声中的低语：通过多智能体框架引导的代理觉醒

Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu, Haibo Hu, Yi Zhang

发表机构 * Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University（香港理工大学电子与电气工程系）； School of Cyber Science and Engineering, Sichuan University（四川大学网络空间安全学院）； Lee Kong Chian School of Medicine, Nanyang Technological University（南洋理工大学李科钦医学院）

AI总结本文研究了在黑盒约束下如何通过多智能体框架从预训练模型中恢复被擦除的概念，提出了一种无需训练的代理方法，通过引导噪声状态来实现可控的觉醒，展示了当前概念擦除方法的局限性。

详情

AI中文摘要

扩散模型（DMs）被广泛用于文本到图像生成，但其强大的生成能力也引发了对不安全或不期望内容的担忧。概念擦除旨在通过从预训练模型中移除特定概念来缓解这些风险。然而，最近的研究表明，此类方法往往抑制而非完全消除目标概念，使模型易受觉醒攻击。现有方法主要依赖于通过优化或反向操作进行白盒访问，而概念觉醒在黑盒约束下仍显不足。在本文中，我们重新审视去噪过程并从轨迹角度出发，表明概念擦除主要破坏早期阶段的文本-语义对齐，但并未完全阻止语义信息沿去噪动态传播。随着生成过程的进行，模型越来越依赖于演化的噪声状态而非文本条件，这为绕过擦除映射提供了机会。受此观察启发，我们提出了ConceptAgent，一种无需训练、黑盒、多智能体框架，通过引导噪声状态初始化去噪轨迹来唤醒擦除的概念。大量实验表明，ConceptAgent能够在无模型参数、梯度或内部表示访问的情况下，实现准确且可控的擦除概念觉醒。这些结果突显了当前概念擦除方法的根本限制，并提供了关于DMs中语义控制动态性质的新见解。

英文摘要

Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

URL PDF HTML ☆

赞 0 踩 0

2605.18144 2026-05-19 cs.AI 版本更新

Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

基于证据的前沿映射与代理假设生成在纳米医学中

Christiaan G. A. Viviers, Koen de Bruin, Mirre M. Trines, Ayla M. Hokke, Roy van der Meel, Avi Schroeder, Twan Lammers, Willem J. M. Mulder, Fons van der Sommen

发表机构 * ARIA Lab, Signal Processing Systems, Department of Electrical Engineering, Eindhoven University of Technology（ARIA实验室，信号处理系统，电气工程系，埃因霍温理工大学）； Laboratory of Chemical Biology, Department of Biomedical Engineering, Eindhoven University of Technology（化学生物学实验室，生物医学工程系，埃因霍温理工大学）； The Louis Family Laboratory for Targeted Drug Delivery and Personalized Medicine Technologies, Department of Chemical Engineering, Technion - Israel Institute of Technology（定向药物输送与个性化医学技术实验室，化学工程系，技术离子-以色列理工学院）； Department of Nanomedicine and Theranostics, Institute for Experimental Molecular Imaging (ExMI), RWTH Aachen University Hospital（纳米医学与诊疗学系，实验分子成像研究所（ExMI），亚琛工业大学医院）； Department of Internal Medicine and Radboud Center for Infectious Diseases (RCI), Radboud University Medical Center（内科学系和Radboud感染疾病中心（RCI），Radboud大学医学中心）

AI总结该研究提出了一种结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型（LLM）工作流的系统pArticleMap，用于支持纳米医学研究方向的选择和假设生成，通过生成和评分基于引用的假设，实现了证据导向的研究辅助。

详情

AI中文摘要

纳米医学研究涵盖了递送化学、免疫学、成像、生物材料和疾病特定的转化科学，但其概念设计空间仍然在大量异质文献中碎片化。截至目前，人工智能在纳米医学中的应用主要集中在性质预测和配方优化，对研究方向选择层面的证据导向发现支持关注较少。我们引入了pArticleMap，一个结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型（LLM）工作流的文献映射和研究假设生成系统。该系统不同于预测未来概念共现，而是针对低密度文章级桥接区域和聚类界面，然后在代理设置中利用大型语言模型生成和评分基于引用的假设。我们通过回顾性实现基准（在历史截止点下生成后续文献）和盲人类读者评估层，在提示条件下的纳米医学任务中评估该系统。在4个选定的回顾性包中，pArticleMap在基准协议下生成了想法并选择了任务保留的假设（获胜想法）。对于任务级保留的假设，获得了一个汇总的黄金回收率10.8%，召回@10为15.9%，未来邻域率61.0%，表明该系统经常能够达到正确的前瞻性邻域（论文想法），即使没有精确的论文级回收。人类-代理协议总体上是中等的，表明内部评分是有用的支持信号，但不能替代专家判断。这些结果将pArticleMap定位为一种保守的、基于证据的研究助手，用于纳米医学。

英文摘要

Nanomedicine research spans delivery chemistry, immunology, imaging, biomaterials, and disease-specific translational science, yet its conceptual design space remains fragmented across a large and heterogeneous literature. To date, artificial intelligence in nanomedicine has focused primarily on property prediction and formulation optimization, with much less attention to evidence-grounded discovery support at the level of research direction selection. We introduce pArticleMap, a literature-mapping and research-hypothesis-generation system that combines article embeddings, similarity-graph analysis, sparse frontier extraction, structured evidence-pack retrieval, and an audited large-language-model (LLM) workflow for grounded ideation. Rather than forecasting future concept co-occurrence, pArticleMap targets low-density article-level bridge regions and cluster interfaces, then generates and scores citation-grounded hypotheses with large language models in an agentic setup. We evaluate the system with a retrospective realization benchmark (generate later literature under a historical cutoff) and a blinded human reader assessment layer across cue-conditioned nanomedicine tasks. Across 4 selected retrospective bundles, pArticleMap generated ideas and selected task-retained hypotheses (winner ideas) under the benchmark protocol. For task-level retained hypotheses, a pooled gold recovery rate of 10.8% was obtained, with a recall@10 of 15.9% and a future-neighborhood rate of 61.0%, indicating that the system often reached the correct forward-looking neighborhood (paper ideas) even without exact paper-level recovery. Human-agent agreement is modest overall, indicating that internal scoring is useful as a support signal but does not replace expert judgment. These results position pArticleMap as a conservative, evidence-grounded research assistant for nanomedicine.

URL PDF HTML ☆

赞 0 踩 0

2605.18143 2026-05-19 cs.AI 版本更新

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

生成式AI与生产力差距：教育中的人类-人工智能互补性

Lihi Idan, Bharat Anand

发表机构 * Leonard N. Stern School of Business, New York University（纽约大学 Leonard N. Stern 商学院）； Industrial and Systems Engineering Department, Texas A&M University（德克萨斯大学阿姆斯特朗工程学院）

AI总结本研究探讨了生成式AI对不同用户生产力影响的异质性，发现AI交互能力（AIC）是决定AI使用效果的关键因素，通过概念图干预可减少不平等，强调需结合AIC微培训和标准流程以实现持续价值捕获。

详情

AI中文摘要

生成式人工智能（GenAI）正在改变企业创造、处理和应用知识的方式，但对其生产力影响的异质性知之甚少。我们报告了一项随机对照试验的结果，参与者（早期知识工作者的类比）被分配在传统资源或大语言模型（LLM）辅助下自学技术领域。平均而言，GenAI访问显著提高了任务表现，但收益分布极不均衡。改进未由GPA或先前知识预测，而是由AI交互能力（AIC）——即获取、过滤和验证模型输出的能力——预测。高AIC参与者实现了显著收益；低AIC参与者则获得有限甚至负的边际回报。概念图干预（ scaffolding）减少了结果变异，表明标准化流程可减轻AI中介表现中的不平等。我们通过人类-人工智能互补性视角解读这些发现：GenAI提高平均生产力，但引入了新的能力不平等轴。管理上，企业应将GenAI访问与短期AIC微培训和简单标准操作程序相结合，以一致捕获价值并避免不均的采用结果。

英文摘要

Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance. On average, GenAI access significantly increased task performance, but the distribution of gains was highly uneven. Improvements were not predicted by GPA or prior knowledge, but by \textit{AI Interaction Competence (AIC)} -- the ability to elicit, filter, and verify model outputs. High-AIC participants realized outsized gains; low-AIC participants saw limited or even negative marginal returns. A scaffolding intervention (conceptual maps) reduced outcome variance, indicating that standardized workflows can mitigate inequality in AI-mediated performance. We interpret these findings through the lens of human-AI complementarities: GenAI raises mean productivity while introducing a new axis of capability inequality. Managerially, firms should pair GenAI access with short AIC micro-training and simple standard operating procedures to capture value consistently and avoid uneven adoption outcomes.

URL PDF HTML ☆

赞 0 踩 0

2605.18133 2026-05-19 cs.CR cs.AI cs.HC cs.IR 版本更新

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

通过提示注入在黑盒聊天机器人环境中研究隐私泄露链的实证研究

Hongjang Yang, Hyunsik Na, Daeseon Choi

发表机构 * Department of Information Security（信息安全系）； Soongsil University（松山大学）； AI Safety Center（人工智能安全中心）； Department of AI Software（人工智能软件系）

AI总结本文研究了通过间接提示注入在黑盒聊天机器人环境中存在的隐私泄露攻击链，分析了攻击者如何通过构造看似无害的外部内容来劫持代理任务，并评估了新的提示注入技术'exemplification'的攻击成功率，最终展示了使用虚构个人信息的证明概念数据外泄链。

Comments 9 pages, 2 figures

详情

AI中文摘要

基于大型语言模型的聊天机器人代理越来越多地通过结合自然语言推理与外部工具（如网络浏览）来处理用户请求。这些能力提高了可用性，但当不信任的外部内容作为用户任务的一部分被处理时，也创建了攻击面。本文研究了一种基于间接提示注入的隐私泄露攻击链，其中攻击者无法访问模型权重、系统提示或代理实现细节，包括处理查询时轨迹的管理方式。我们首先分析了攻击者如何通过构造看似无害的外部内容来劫持代理的预期任务，同时诱导代理执行攻击者定义的目标。然后我们评估了一种新的提示注入技术，称为exemplification，该技术利用外部内容中的桥梁将用户提示和检索页面的无害开头重新表述为few-shot示例，然后附加攻击者的目标。我们将其攻击成功率与先前的假完成技术进行比较。最后，我们展示了在受控环境中使用虚构个人身份信息的证明概念数据外泄链。我们的结果表明，提示注入、类似禁令的指令引导和网络工具调用可以组合成一个可行的隐私泄露路径，在部署的聊天机器人代理中。

英文摘要

LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation details including how a trajectory is actually managed during its processing for a query. We first analyze how an attacker can hijack an agent' s intended task by crafting external content that appears benign to the victim while inducing the agent to execute an attacker-defined objective. We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker' s objective. We compare its attack success rate with a prior fake-completion technique. Finally, we demonstrate a proof-of-concept data-exfiltration chain using fictitious personal information in a controlled setting. Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.

URL PDF HTML ☆

赞 0 踩 0

2605.18132 2026-05-19 cs.CV cs.AI 版本更新

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

谁生成了这个3D资产？学习生成3D模型的来源归属

Sihan Ma, Siyuan Liang, Dacheng Tao

发表机构 * College of Computing & Data Science, Nanyang Technological University, Singapore（南洋理工大学计算机与数据科学学院）

AI总结该研究提出了一种方法，用于确定给定3D资产是由哪种生成模型创建的，通过构建首个被动来源归属基准，发现生成3D模型留下稳定的指纹特征，从而建立了可信的3D内容来源的新标准。

详情

AI中文摘要

生成3D模型被应用于游戏、机器人和沉浸式创作，因此来源归属至关重要：给定一个3D资产，我们能否确定并识别出是哪种生成模型创建的？该问题面临两个核心挑战：分散的归属信号，其中3D指纹分布在多视角、几何和频率域提示中；以及现实部署约束，其中稀少的标签、退化的提示和混合真实/合成资产会破坏归属的可靠性。为了系统研究该问题，我们构建了迄今为止首个被动来源归属基准，涵盖22种代表性的3D生成器，在标准、少样本和现实部署协议下。基于此基准，我们发现生成3D模型留下两种稳定的指纹：跨视角不一致性和体现在几何统计和频率域提示中的结构伪影。为了捕捉这些分散的信号，我们提出了一种层次多视角多模态Transformer，融合每个视角的外观、几何和频率域特征，并在跨视角建模全局关系。大量实验表明性能优异，在全监督下达到97.22%的准确率，在仅有1%训练数据时达到77.17%的准确率，对应每个生成器少于五个样本。这些结果表明现代3D生成器留下稳定且可归属的指纹，建立了可信3D内容来源的新基准和方法论基础。

英文摘要

Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.

URL PDF HTML ☆

赞 0 踩 0

2605.18128 2026-05-19 cs.AI 版本更新

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

POST: 基于先验观察的时空关联对抗学习用于多变量时间序列异常检测

Suofei Zhang, Yaxuan Zheng, Haifeng Hu

发表机构 * School of Internet of Things（物联网学院）； National Engineering Research Center of Communications and Networking（通信与网络国家工程研究中心）

AI总结本文提出了一种新的框架，通过联合先验观察对抗学习方法统一时空建模，以解决多变量时间序列异常检测中的空间过泛化问题，并在公开数据集和自建基准上展示了在时间检测和空间定位任务上的新状态。

详情

AI中文摘要

现有的多变量时间序列异常检测（MTSAD）框架越来越多地依赖于将图神经网络（GNNs）与序列模型相结合，以捕捉复杂的时空依赖关系。然而，较少关注空间过泛化问题，即不受约束的结构建模会 indiscriminately 重建异常，不可避免地降低检测召回率。为了解决这个问题，我们提出了一种新的框架，通过联合先验观察对抗学习方法统一时空建模。在空间维度上，模型交替学习邻接矩阵作为结构先验，并在训练过程中通过最小化方式建模先验与数据驱动观察之间的关联差异。这种对抗优化不仅提高了模型对时间检测的敏感性，还使模型能够定位到特定通道的异常。为了系统评估这种异常定位能力，我们进一步构建了一个带有精确通道注释的合成基准。在公开数据集和我们专门的基准上进行的广泛实验表明，所提出的框架在时间和空间定位任务上都建立了新的状态。我们的代码、预训练模型和基准已公开在 https://github.com/anocodetest1/POST。

英文摘要

Existing Multivariate Time Series Anomaly Detection (MTSAD) frameworks increasingly rely on integrating Graph Neural Networks (GNNs) with sequence models to capture complex spatio-temporal dependencies. However, less attention is paid to the spatial over-generalization problem, where unconstrained structural modeling indiscriminately reconstructs anomalies, inevitably degrading detection recall. To tackle this problem, we propose a novel framework that unifies spatio-temporal modeling through a joint prior-observation adversarial learning paradigm. In the spatial dimension, the model alternately learns adjacency matrices as structural prior and models the association discrepancy between prior and data-driven observation in a minimax manner during training. Such adversarial optimization not only improves the model sensitivity for time-wise detection, but also enables the model to localize anomalies to specific channels. To systematically evaluate this anomaly localization capability, we further construct a synthetic benchmark equipped with precise channel-wise annotations. Extensive experiments across public datasets and our dedicated benchmark demonstrate that the proposed framework establishes a new state-of-the-art in both time-wise detection and spatial localization tasks. Our code, pre-trained models, and benchmark are publicly available at https://github.com/anocodetest1/POST.

URL PDF HTML ☆

赞 0 踩 0

2605.18109 2026-05-19 cs.AI cs.CV cs.RO 版本更新

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround：全场景家庭推理的结构化可执行任务推断

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University（清华大学）； Microsoft Research Asia（微软亚洲研究院）； Peking University（北京大学）； Fudan University（复旦大学）； Zhejiang University（浙江大学）

AI总结本文提出TaskGround框架，通过结构化任务推断提升全场景家庭推理能力，其核心贡献是引入FullHome评估套件，验证了在家庭场景中执行任务结构推断的重要性，并展示了紧凑本地模型在实际家庭部署中的有效性。

Comments Project page: https://aaronfengzy.github.io/TaskGround/

详情

AI中文摘要

在真实家庭部署中，家庭代理通常必须从完整的家庭场景和处于特定情境的家庭请求出发，而不是从干净的任务规范出发。此类请求要求代理识别与任务相关的实体，恢复意图的任务条件，并从周围场景上下文中解决顺序约束。我们正式将这种能力定义为全场景家庭推理：给定一个完整的家庭场景和一个处于特定情境的家庭请求，代理必须在生成接地技能级动作序列之前推断出可执行的任务结构。这种设置具有挑战性，因为完整的家庭场景包含大量与任务无关的信息，使直接完整场景提示效率低下且容易出错。在实际部署中，这一挑战进一步被隐私和本地计算限制放大，这些限制更倾向于紧凑的开放权重模型，其具有有限的长上下文推理能力。我们提出TaskGround，一种无需训练且模型无关的Ground-Infer-Execute框架，该框架将完整的场景接地为紧凑的任务相关场景切片，推断出可执行的任务结构，并将其编译为接地的技能级动作序列。为了评估这一设置，我们引入了FullHome，一个经过人类验证的400个家庭任务评估套件，涵盖多样化的家庭规模环境以及目标导向和过程约束要求。在FullHome上，TaskGround在专有和开放权重模型上均大幅提升了任务成功率。值得注意的是，它使Qwen3.5-9B在直接完整场景提示下与GPT-5竞争，同时将总输入token成本减少了多达18倍。我们的结果识别了执行任务结构推断为全场景家庭推理中的关键瓶颈，并表明结构化接地可以显著提高紧凑本地模型在实际家庭部署中的有效性。

英文摘要

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

URL PDF HTML ☆

赞 0 踩 0

2605.18104 2026-05-19 cs.AI cs.CR 版本更新

通过缓解过压缩来改进时空残差误差传播

Seyed Mohamad Moghadas, Esther Rodrigo Bonet, Bruno Cornelis, Adrian Munteanu

发表机构 * ETRO Department, Vrije Universiteit Brussel（瓦隆联合大学布鲁塞尔分校ETRO系）； imec

AI总结本文提出Teger模块，通过空间曲率感知的图重排机制改进误差相关的自回归预测，提升时空预测的连续排名概率得分。

详情

AI中文摘要

残差误差传播仍然是递归模型中的基本问题，其中小的预测不准确会随时间累积并降低长周期性能。准确建模此类残差的相关结构对于概率多变量时间序列预测中的可靠不确定性量化至关重要。尽管最近的时间序列深度模型能够高效参数化时间变化的同期相关性，但它们通常假设误差的时序独立性，并忽略了观测网络中的空间相关性。在本文中，我们引入Teger，一个结构化的不确定性模块，克服了误差相关自回归预测中的空间和时间限制。Teger提出了一种空间曲率感知的图重排机制，明确加强了由离散Forman曲率识别出的信息瓶颈边。该组件被集成到低秩加对角协方差头中，通过Woodbury恒等式保持可推断性。Teger是backbone无关的，仅需任何自回归编码器产生的潜在状态。我们提供了Teger的理论证据，并在四个现实世界的时空数据集上实验评估了它在LSTM、Transformer和xLSTM backbone上的表现，显示了连续排名概率得分的一致改进。我们进一步提供了将曲率感知重排与（i）过压缩缓解、（ii）改进的谱连接性、（iii）减少有效电阻以及（iv）改进的协方差校准界联系起来的正式理论分析。

英文摘要

Residual error propagation remains a fundamental problem in recurrent models, where small prediction inaccuracies compound over time and degrade long-horizon performance. Accurately modeling the correlation structure of such residuals is critical for reliable uncertainty quantification in probabilistic multivariate timeseries forecasting. While recent time-series deep models efficiently parametrize time-varying contemporaneous correlations, they often assume temporal independence of errors and neglect spatial correlation across the observed network. In this paper, we introduce Teger, a structured uncertainty module that overcomes the spa- tial and temporal limitations of error-correlated autoregressive forecasting. Teger proposes a spatial curvature-aware graph rewiring mechanism explicitly strengthening information-bottleneck edges identified by discrete Forman curvature. The component is integrated into a low-rank-plus-diagonal covariance head, preserving tractable inference via the Woodbury identity. Teger is backbone-agnostic, requiring only the latent state produced by any autoregressive encoder. We provide theoretical evidence of Teger, and experimentally evaluate it on LSTM, Transformer, and xLSTM backbones across four real-world spatio-temporal datasets, showing consistent improvement in Continuous Ranked Probability Score (CRPS). We further provide a formal theoretical analysis connecting curvature-aware rewiring to (i) oversquashing alleviation, (ii) improved spectral connectivity, (iii) reduced effective resistance, and (iv) improved covariance calibration bounds

URL PDF HTML ☆

赞 0 踩 0

2605.18055 2026-05-19 cs.LG cs.AI 版本更新

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

FLAG: 通过图结构的潜在扩散对齐实现基础模型表示以空间基因表达预测

Qi Si, Penglei Wang, Yushuai Wu, Yifeng Jiao, Xuyang Liu, Xin Guo, Yuan Qi, Yuan Cheng

发表机构 * Shanghai Academy of Artificial Intelligence for Science, Shanghai, China.（上海人工智能科学研究院）； School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.（上海交通大学生物医学工程学院）； Incubation Institute, Fudan University, Shanghai, China.（复旦大学孵化院）

AI总结本文提出FLAG框架，通过图结构的潜在扩散对齐方法，解决空间基因表达预测中的基因协调和空间分布关系问题，并引入基因维度诅咒的概念，通过空间图编码器和基因基础模型对齐来提升模型的结构一致性与基因间保真度。

Comments 9 pages for main text, 3 pages for references, 19 pages for appendix. accepted by ICML 2026

详情

AI中文摘要

从常规的H&E染色预测空间基因表达能够实现大规模分子谱分析，但当前模型将此任务视为孤立的点wise任务，从而忽略了诸如基因协调和空间分布等关键生物结构。为保持这些关系，我们引入FLAG，一种基于扩散的框架，将此任务重新定义为结构分布建模。同时，我们识别出关键的基因维度诅咒，即联合建模基因表达及其空间相互作用在高维空间中失效，而FLAG通过整合空间图编码器以实现拓扑一致性，并利用基因基础模型（GFM）对齐以在生成过程中保持基因-基因的保真度。为严格评估模型性能，我们提出了一组新的结构评估度量标准，包括基因结构相关性（GSC）和空间结构相关性（SSC）。我们的实验表明，FLAG在传统准确性（PCC/MSE）方面具有高度竞争力，同时在捕捉基因-基因和基因-空间关系时实现了显著增强的结构保真度。代码可在https://github.com/darkflash03/FLAG上获取。

英文摘要

Predicting spatial gene expression from routine H\&E enables large-scale molecular profiling, yet current models treat this as isolated pointwise tasks, thereby overlooking essential biological structures like gene coordination and spatial distribution. To preserve these relationships, we introduce \textbf{FLAG}, a diffusion-based framework that redefines this task as structured distribution modeling. At the same time, we identify the critical \textbf{Gene Dimension Curse}, where joint modeling gene expression and their spatial interactions fail in high-dimensional spaces, and FLAG solves this challenge by integrating a spatial graph encoder for topological consistency and utilizing Gene Foundation Model (GFM) alignment for gene-gene fidelity in the generation process. To rigorously assess model performance, we propose a set of novel structural evaluation metrics, including Gene Structural Correlation (\textbf{GSC}) and Spatial Structural Correlation (\textbf{SSC}). Our experiments demonstrate that FLAG is highly competitive in traditional accuracy (PCC/MSE) while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships. The code is available at https://github.com/darkflash03/FLAG.

URL PDF HTML ☆

赞 0 踩 0

2605.18048 2026-05-19 cs.AI 版本更新

零阶硬阈值化中方差减少的新见解：缓解梯度误差和扩张性矛盾

Xinzhe Yuan, William de Vazelhes, Bin Gu, Huan Xiong

发表机构 * IASM, Harbin Institute of Technology（哈尔滨工业大学人工智能研究所，哈尔滨工业大学）； Mohamed bin Zayed University of Artificial Intelligence（穆罕默德·本·扎耶德人工智能大学）； School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）

AI总结本文提出了一种通用的方差减少零阶硬阈值化算法，通过考虑方差的作用，缓解零阶梯度与硬阈值操作之间的冲突，从而消除对随机方向数量的限制，提高收敛速度和应用范围。

Comments Published as a conference paper at ICLR 2024. 9 pages main paper, 24 pages appendix, 11 figures, 7 tables. Correspondence to Bin Gu and Huan Xiong

Journal ref International Conference on Learning Representations (ICLR), 2024

详情

AI中文摘要

硬阈值化是机器学习中用于解决ℓ0约束优化问题的重要算法类型。然而，在某些情况下，目标函数的真实梯度可能难以获取，通常可以通过零阶（ZO）方法进行近似。到目前为止，SZOHT算法是唯一能够处理ℓ0稀疏性约束的ZO梯度算法。不幸的是，由于零阶梯度的偏差与硬阈值操作的扩张性之间存在固有的矛盾，SZOHT在ZO梯度的随机方向数量上存在明显的限制。本文通过考虑方差的作用，提供了一种新的方差减少见解：缓解零阶梯度与硬阈值操作之间的独特矛盾。在此视角下，我们提出了一种通用的方差减少零阶硬阈值化算法以及在标准假设下的通用收敛性分析。理论结果表明，新算法消除了对随机方向数量的限制，相较于SZOHT，具有改进的收敛速度和更广泛的应用范围。最后，我们通过岭回归问题以及黑盒对抗攻击问题展示了本方法的实用性。

英文摘要

Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.

URL PDF HTML ☆

赞 0 踩 0

2605.18032 2026-05-19 cs.CL cs.AI cs.HC cs.SE 版本更新

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA：多智能体大语言模型工作流的离线评估与迭代优化

Kazuki Kawamura, Satoshi Waki, Kei Tateno

发表机构 * Sony Group Corporation（索尼集团公司）

AI总结本文提出PROTEA，一种用于多智能体大语言模型工作流的离线评估和迭代优化接口，通过配置评分标准和可视化工作流图中的节点状态，帮助开发者定位瓶颈并改进工作流性能。

Comments 9 pages, 3 figures, 1 table. To appear in Proceedings of ACL 2026 System Demonstrations

详情

AI中文摘要

多智能体大语言模型工作流——由多个角色特定的LLM调用组成——通常优于单提示基线，但调试和优化仍然困难。失败可能源于中间输出的细微错误，这些错误会传播到下游节点，要求开发者检查长轨迹并推断应修改哪个代理。我们提出了PROTEA，一个统一的接口，用于离线、测试驱动的多智能体工作流改进。PROTEA执行工作流，用可配置的评分标准评分中间节点输出，并在工作流图上叠加每个节点的状态和理由，以定位可能的瓶颈。为了支持复杂系统，其中最终答案参考是主要监督，PROTEA执行反向节点评估：它从最终答案参考和图上下文生成候选节点级期望，然后将它们与观察到的节点输出进行比较。对于选定的节点，PROTEA以可编辑的前后比较形式呈现目标提示修订，然后自动重新运行并重新评估工作流，以显示输出变化和评分轨迹。在两个生产相关的工作流中，PROTEA将文档检查准确性从64.3%提高到83.9%，推荐Hit@5从0.30提高到0.38。在与六名经验丰富的LLM开发者进行的形成研究中，参与者重视图层面的定位、节点级别的理由以及可编辑的前后提示修订。

英文摘要

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

URL PDF HTML ☆

赞 0 踩 0

2605.18031 2026-05-19 quant-ph cs.AI 版本更新

Quantum Sidecar Architectures for Hybrid AI Training and Inference: Stateful Protected Registers, Stateless Reset-and-Reprepare Circuits and Quantum Weight-State Outlook

用于混合AI训练和推理的量子Sidecar架构：状态保护寄存器、无状态重置和重新准备电路以及量子权重状态展望

Y. Mo, G. D. Su

发表机构 * Independent Researcher（独立研究者；BroadLink公司）； BroadLink Co., Ltd.（杭州电子科技大学副教授）； Associate Professor, Hangzhou Dianzi University

AI总结本文提出了一种用于未来混合AI训练和推理的量子Sidecar架构家族，通过状态保护寄存器和无状态重置和重新准备模式，为优化器侧采样、适配器或专家选择、检索、路由和推理路径提案提供有界信号生成器，并引入了量子权重状态Sidecar作为受限的量子表示。

Comments 14 pages, 8 figures. Architecture and small-scale simulation study; no hardware experiment or quantum-advantage claim

详情

AI中文摘要

Spiker-LL：一种能效高的FPGA加速器，用于在脉冲神经网络中实现自适应局部学习

Alessio Caviglia, Filippo Marostica, Alessandro Savino, Stefano Di Carlo

发表机构 * Control and Computer Engineering Department, Politecnico di Torino（都灵理工学院控制与计算机工程系）

AI总结本文提出Spiker-LL，一种能效高的FPGA加速器，通过扩展开源的Spiker+推理架构，实现了高效的STSF局部学习规则支持，从而在边缘设备上实现自适应局部学习。

2605.17999 2026-05-19 cs.AI 版本更新

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

共享骨干PPO用于多UAV通信覆盖与连接保持

Z. Jiang

发表机构 * Zhejiang University（浙江大学）

AI总结本文提出了一种共享骨干PPO算法，通过在Actor和Critic网络之间共享基础模块，实现了高效的训练和提升的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现，并与标准PPO算法进行比较。实验结果表明，所提出的方法具有优越的性能，此外，还集成了图信息聚合模块以适应代理之间的通信条件。整合该模块后，算法仍保持有效，训练后的代理群体表现出更高的合作水平。

2605.17997 2026-05-19 cs.LG cs.AI cs.CV 版本更新

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

发表机构 * Peng Cheng Laboratory（鹏城实验室）

AI总结本文提出MARR，一种模块自适应残差重建方法，通过为每个模块分配特定的缩放系数，平衡残差相关的HA偏差和累积误差校正，从而在低比特量化中提升性能。

详情

AI中文摘要

近年来，基于残差重建的模型量化方法在低比特后训练量化（PTQ）中取得了有希望的性能，通过引入跨层残差来减少来自先前层的误差积累。然而，这些残差也可能引入额外的偏差，源于重建基于PTQ的Hessian近似（HA）假设，导致量化性能不理想。在本文中，我们分析发现，通过将残差项乘以一个缩放系数，可以提供一种直接的方法来缓解与残差强度相关的HA偏差，同时保持累积误差校正。更重要的是，我们观察到这种权衡是模块依赖性的，使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察，我们提出了模块自适应残差重建（MARR），为每个模块分配模块特定的缩放系数，以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计，我们设计了一种基于比例-积分-微分（PID）的自适应更新策略，利用重建误差作为反馈，逐步细化此系数。在多个典型的大语言模型（LLMs）和视觉变换器（ViTs）上的实验表明，MARR在低比特量化（小于等于4位）中表现出色，实现了LLMs高达20.2%的性能提升，以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

URL PDF HTML ☆

赞 0 踩 0

2605.17994 2026-05-19 cs.IR cs.AI 版本更新

Towards Sustainable Growth: A Multi-Value-Aware Retrieval Framework for E-Commerce Search

迈向可持续增长：面向电子商务搜索的多价值感知检索框架

Yifan Wang, Yixuan Wang, YiDan Liang, Qiang Liu, Fei Xiao

发表机构 * Taobao \& Tmall Group of Alibaba HangZhou China ； Taobao \& Tmall Group of Alibaba

AI总结本文提出了一种多价值感知的检索框架，旨在平衡即时转化与长期商品增长，通过引入ItemLTV模块和MultiGR模块，提升电子商务搜索系统的可持续增长能力。

详情

AI中文摘要

新商品增长对于维持大型电子商务平台的健康生态系统至关重要。然而，现有系统倾向于优先展示已流行的商品，这种现象通常被称为“马太效应”。在检索检索的背景下，当前冷启动模型面临训练目标与在线业务指标不一致的问题，并缺乏有效机制来衡量商品的增长潜力。在本文中，我们提出了一种针对电子商务搜索的多价值感知检索框架GrowthGR，旨在更好地对齐搜索系统不同阶段的 cascaded 在线价值，同时平衡即时转化和长期商品增长。我们的框架GrowthGR包含两个关键组件：一个用于预测商品长期交易价值的ItemLTV模块和一个基于语义-ID的生成检索架构的多价值感知生成检索（MultiGR）模块。首先，在ItemLTV模块中，我们采用反事实推理来量化单个用户交互带来的长期价值增量。其次，在MultiGR模块中，基于语义-ID的生成检索架构，我们利用具有搜索级联信号的结构化样本，并采用多价值感知策略优化（MoPO）训练范式，以对齐多阶段在线价值，同时显式平衡短期交易价值和由ItemLTV估计的长期增长潜力。我们成功在淘宝的生产平台部署了GrowthGR，实现了新商品GMV的显著提升5.3%，并带来了整体搜索GMV的非平凡0.3%增长。广泛的在线分析和A/B测试证明了其对整体生态系统价值的积极影响。

英文摘要

New item growth is critical for maintaining a healthy ecosystem in large-scale e-commerce platforms. However, existing systems tend to prioritize presenting users with already popular items, a phenomenon often referred to as the "Matthew effect". In the context of search retrieval, current cold-start models suffer from the misalignment between training objectives and online business metrics, and they lack effective mechanisms to measure an item's growth potential. In this paper, we propose a Multi-Value-Aware retrieval framework tailored for e-commerce search, designed to better align with the cascaded online values across different stages of the search system while balancing immediate conversion and long-term item growth. Our framework GrowthGR consists of two key components: an Item Long-term Transaction Value Prediction (ItemLTV) module and a Multi-Value-Aware Generative Retrieval (MultiGR) module. First, in the ItemLTV module, we employ counterfactual inference to quantify the long-term value increment attributable to a single user interaction. Second, in the MultiGR module, building upon a semantic-ID-based generative retrieval architecture, we leverage structured samples with the search cascade signals and adopt a Multi-Value-Aware Policy Optimization (MoPO) training paradigm to align with multi-stage online values, while explicitly balancing short-term transactional value and long-term growth potential estimated by ItemLTV. We successfully deployed GrowthGR on Taobao's production platform, achieving a substantial 5.3% lift in new item GMV while delivering a non-trivial 0.3% gain in overall search GMV. Extensive online analysis and A/B testing demonstrate its positive impact on the overall ecosystem value.

URL PDF HTML ☆

赞 0 踩 0

2605.17991 2026-05-19 cs.SD cs.AI 版本更新

Stable Audio 3

稳定音频3

Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

AI总结稳定音频3提出了一种快速的潜在扩散模型家族，用于可变长度音频生成和编辑，通过高效的潜在空间生成和对抗训练提升了生成质量和效率。

Comments Training code: https://github.com/Stability-AI/stable-audio-tools Inference and weights: http://github.com/Stability-AI/stable-audio-3

详情

AI中文摘要

Stable Audio 3 是一组快速的潜在扩散模型（小、中、大）用于可变长度音频生成和编辑。由于我们的模型可以生成几分钟的音频，可变长度生成对于避免生成完整长度音频以生成短声音的成本至关重要。我们还支持修复，使能够进行有针对性的音频编辑和短录音的延续。我们的潜在扩散模型基于一种新的语义-声学自编码器，该自编码器将音频投影到紧凑的潜在空间中，从而在高效扩散生成的同时保持音频保真度，并在潜在空间中鼓励语义结构。最后，我们通过对抗性后训练来加速推理并提高生成质量，减少推理步骤的数量同时提高保真度和提示的遵循性。Stable Audio 3 模型在授权和Creative Commons数据上进行训练，可在H200 GPU上在2秒内生成音乐和声音，在MacBook Pro M4上在几秒内完成。我们发布了小和中型模型的权重，这些模型可以在消费级硬件上运行，并附带其训练和推理流程。

英文摘要

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.17989 2026-05-19 cs.CL cs.AI 版本更新

Predictive Prefetching for Retrieval-Augmented Generation

检索增强生成的预测预取

Wuyang Zhang, Shichao Pei

发表机构 * Department of Computer Science, University of Massachusetts Boston（马萨诸塞大学波士顿分校计算机科学系）

AI总结本文提出了一种先进的异步检索框架，通过预测检索触发时机和所需信息，以减少延迟并提高生成效率，同时保持回答质量。

Comments Accepted by Forty-third International Conference on Machine Learning ICML 2026

详情

AI中文摘要

检索增强生成（RAG）通过在大型语言模型中增强事实性，但因其同步检索导致显著延迟。尽管近期工作探索了异步检索，但现有方法依赖于检索与生成之间的启发式协调，并假设解码期间信息需求稳定，这在复杂、多领域设置中往往失效。本文提出了一种先进的异步检索框架，该框架能够与不断演变的信息需求相匹配，通过利用生成动态中出现的语义前驱，使用三个组件——检索预测器、上下文监视器和查询生成器，显式预测何时应触发检索以及应检索什么信息。在多个基准测试上的实验表明，该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到第一个token的提升，同时保持与同步RAG基线相当的回答质量。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17985 2026-05-19 cs.LG cs.AI 版本更新

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD：面向物理基础模型的敏感性感知保真度压缩SVD

Chengjie Hong, Feixiang He, Yiheng Zeng, Lulu Kang, He Wang

发表机构 * AI Centre, University College London（伦敦大学学院人工智能中心）； University College London（伦敦大学学院）； Central South University（中南大学）； University of Massachusetts at Amherst（马萨诸塞大学阿姆赫斯特分校）

AI总结本文提出了一种新的压缩物理基础模型的方法，通过在压缩过程中显式建模损失感知的层敏感性，以保持准确性和物理保真度，实验表明在多个模型和数据集上实现了显著的压缩增益。

详情

AI中文摘要

我们提出了一种新的方法，用于压缩物理基础模型（PFMs），这是AI for Science领域的新趋势。尽管模型压缩对于减少内存使用和加速大基础模型的推理至关重要，但其在PFMs中的应用仍然不足探索，因为保持物理保真度至关重要。挑战在于物理数据的功能性质，其中偏导数编码了时空动态，并对压缩具有高度敏感性。传统压缩方法忽视了这种结构，常常导致严重的性能退化或失败。为此，我们引入了一种敏感性感知的保真度强制压缩框架，在压缩过程中显式建模输出函数空间中的损失感知层敏感性。这为压缩科学基础模型提供了一条新途径，同时保持准确性和物理保真度。实验表明，在多个模型和数据集上，相较于现有方法，取得了显著的增益，实现了更高的压缩比，同时保持准确性，在某些情况下甚至提高了几个数量级。更广泛地说，这项工作可能引领AI for Science领域高效、可部署和可持续的科学基础模型的新子领域。

英文摘要

We propose a new method for compressing physics foundation models (PFMs) which is a new trend in AI for Science. While model compression is essential for reducing memory use and accelerating inference in large foundation models, it remains under-explored for PFMs, where preserving physical fidelity is crucial. The challenge lies in the functional nature of physics data, where partial derivatives encode spatiotemporal dynamics and exhibit high sensitivity to compression. Conventional compression methods ignore this structure, often causing severe performance degradation or failure. To address this, we introduce a sensitivity-aware fidelity-enforcing compression framework that explicitly models loss-aware layer sensitivity in the output function space during compression. This provides a new route to compressing scientific foundation models while preserving accuracy and physical fidelity. Experiments show substantial gains over existing methods across multiple models and datasets, achieving significantly higher compression ratios while maintaining accuracy, in some cases by orders of magnitude. More broadly, the work potentially leads to a new subfield of efficient, deployable, and sustainable scientific foundation models in AI for Science.

URL PDF HTML ☆

赞 0 踩 0

2605.17976 2026-05-19 cs.AI math.OC 版本更新

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

释放大语言模型于贝叶斯优化：用于科学发现的偏好引导框架

Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong, Nanyang Ye, Yuqiang Li, Qinying Gu

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Harbin Institute of Technology（哈尔滨工业大学）； Shanghai Jiao Tong University（上海交通大学）

AI总结本文提出了一种基于大语言模型的贝叶斯优化框架LGBO，通过在优化循环中持续整合大语言模型的语义推理，提高了科学发现中的优化效率和收敛速度。

Comments Published as a conference paper at ICLR 2026. 10 pages main paper, 21 pages appendix, 26 figures

Journal ref International Conference on Learning Representations (ICLR), 2026

详情

AI中文摘要

科学发现日益受到昂贵实验和有限资源的限制，凸显了在AI for science中高效优化的必要性。尽管贝叶斯优化（BO）被广泛用于平衡探索与利用，但其在高维设置中表现出冷启动性能缓慢和可扩展性差的问题，限制了其在现实科学问题中的应用。为克服这些挑战，我们提出了LLM引导的贝叶斯优化（LGBO），这是首个将大语言模型（LLMs）的偏好引导整合到优化循环中的贝叶斯优化框架。与以往仅使用LLMs进行预热启动初始化或候选生成的工作不同，LGBO引入了一种区域提升的偏好机制，将LLM驱动的偏好嵌入到每一个迭代中，以稳定且可控的方式调整替代均值。理论上，我们证明了LGBO在最坏情况下不会显著劣于标准BO，而在偏好与目标一致时，能够实现显著更快的收敛速度。实验上，LGBO在物理、化学、生物学和材料科学等多样化的干基准测试中均优于现有方法。最值得注意的是，在一个新的湿实验室优化Fe-Cr电池电解质时，LGBO在6次迭代内达到了最佳观测值的90%，而标准BO和现有LLM增强的基线方法需要超过10次。这些结果表明，LGBO为将LLMs整合到科学优化工作流中提供了一个有前景的方向。

英文摘要

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

URL PDF HTML ☆

赞 0 踩 0

2605.17971 2026-05-19 cs.CR cs.AI 版本更新

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Babel: 通过混淆分布优化采样实现安全机制的 jailbreak 安全性

Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang, Yebo Feng, Ju Jia, Ruiying Du, Cong Wu, Yang Liu

发表机构 * Wuhan University（武汉大学）； Nanyang Technological University（南洋理工大学）； Southeast University（东南大学）； University of Hong Kong（香港大学）

AI总结本文研究了大语言模型安全机制中的内在漏洞，提出了一种高效的黑盒攻击框架Babel，通过系统性的混淆采样和反馈驱动的分布优化，实现了高成功率的jailbreak攻击，展示了在LLM安全研究中的稳健方法。

详情

在扩散大型语言模型中进行提示压缩：在LLDA上评估LLMLingua-2

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo, Kaung Myat Kyaw, Jonathan Chan

发表机构 * University of Toronto（多伦多大学）； King Mongkut’s University of Technology Thonburi（泰国科技理工学院）

AI总结本文研究了提示压缩在扩散大型语言模型中的有效性，通过在LLDA上评估LLMLingua-2，发现提示压缩在数学推理任务中效果不佳，而摘要任务相对稳健，表明为扩散模型设计的提示压缩方法并不适用于所有场景。

详情

AI中文摘要

提示压缩可以减少大型语言模型的推理成本和上下文长度，但之前的评估主要集中在自回归架构上。本研究探讨了提示压缩是否能有效转移到扩散大型语言模型（DLLMs）中，使用LLMLingua-2，特别是具有8B参数的DLLM LLaDA。我们在GSM8K、DUC2004和ShareGPT数据集上使用每个数据集约250个提示，以大约2倍的压缩率，在数学推理、提示重建和摘要任务中评估压缩性能。通过精确匹配准确率、BLEU、ROUGE和BERTScore比较原始提示、压缩提示、重建提示和重建提示推理生成的输出。结果表明，语义保持并不必然意味着在扩散模型中下游行为的稳定性。摘要任务在压缩下相对稳健，而数学推理任务在高语义相似度分数下显著退化。重建实验进一步表明，语义相似的提示可能仍然遗漏了稳定去噪所需的关键推理信息。在所有任务中，BERTScore召回率始终低于精度，表明压缩失败主要由信息遗漏驱动，而非语义漂移。这些发现表明，为自回归模型设计的提示压缩方法并不均匀地适用于扩散大型语言模型，从而推动了为扩散模型设计的压缩策略的发展。

英文摘要

Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.

URL PDF HTML ☆

赞 0 踩 0

2605.17923 2026-05-19 cs.DC cs.AI cs.LG 版本更新

AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

AdaptiveLoad: 向高效视频扩散变换器训练迈进

Yucheng Guo, Yongjian Guo, Zhong Guan, Haoran Sun, Wen Huang, Wanting Xu, Jing Long, Shuai Di, Junwu Xiong

发表机构 * Tsinghua University（清华大学）； Peking University（北京大学）； Tianjin University（天津大学）

AI总结本文提出AdaptiveLoad框架，通过双约束自适应负载平衡系统和融合LayerNorm-Modulate CUDA内核，解决视频生成模型中大规模视频扩散变换器（如DiT和MMDiT）训练中的计算不平衡问题，实验显示其在Wan 2.1世界模型上提升了计算效率和训练吞吐量。

详情

AI中文摘要

在视频生成模型，特别是世界模型中，训练大规模视频扩散变换器（如DiT和MMDiT）由于混合模式数据集中序列长度的极端差异，带来了显著的计算挑战。现有基于桶的数据加载策略通常依赖于'等长token'约束。这种方法未能考虑自注意力机制的二次复杂性，导致严重的负载不平衡和GPU资源利用率低下。本文提出了AdaptiveLoad，一个集成优化框架，包含两个核心组件：（1）双约束自适应负载平衡系统，通过同时限制内存消耗和计算负载（B×S^p≤M_comp）消除长序列瓶颈；（2）融合LayerNorm-Modulate CUDA内核，利用D-tile共alesced减少策略提高吞吐量并缓解内存压力。实验结果表明，在Wan 2.1世界模型上，我们的方法将计算不平衡率从39%降低到18.9%，峰值VRAM利用率效率提高22.7%，并实现了整体训练吞吐量增加27.2%。

英文摘要

In video generation models, particularly world models, training large-scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed-mode datasets. Existing bucket-based data loading strategies typically rely on "equal token length" constraints. This approach fails to account for the quadratic complexity of self-attention mechanisms, leading to severe load imbalance and underutilization of GPU resources. This paper proposes \textit{AdaptiveLoad}, an integrated optimization framework consisting of two core components: (1) A dual-constraint adaptive load balancing system, which eliminates long-sequence bottlenecks by simultaneously limiting memory consumption and computational load ($B \times S^p \le M_{\text{comp}}$); (2) A fused LayerNorm-Modulate CUDA kernel, which utilizes a D-tile coalesced reduction strategy to increase throughput and alleviate memory pressure. Experimental results on the Wan 2.1 world model demonstrate that our method reduces the computational imbalance rate from 39\% to 18.9\%, improves peak VRAM utilization efficiency by 22.7\%, and achieves an overall training throughput increase of 27.2\%.

URL PDF HTML ☆

赞 0 踩 0

2605.17918 2026-05-19 cs.LG cs.AI cs.CV 版本更新

LAST-RAG：文献锚定的随机轨迹检索增强生成用于知识条件退化模型选择

Hanbyeol Park, Hyerim Bae

发表机构 * Department of Industrial Engineering（工业工程系）； Pusan National University（釜山国立大学）

AI总结本文提出LAST-RAG方法，通过结合观测健康指标轨迹和领域特定上下文，利用理论和机械证据从本地证据库中检索，以改进退化模型选择，将模型选择从纯统计拟合问题转变为结合观测数据和领域知识的决策问题。

详情

AI中文摘要

基于随机过程的退化建模是估计剩余使用寿命（RUL）分布的核心方法；然而，适当选择随机过程的方法尚未得到充分解决。现有模型选择方法主要依赖于观测健康指标（HI）轨迹的统计拟合，但当观察窗口较短或信号高度噪声时，这种方法可能选择与底层退化机制不一致的模型。为了解决这个问题，本文提出了文献锚定的随机轨迹检索增强生成（LAST-RAG）。该方法利用观测的HI轨迹和领域特定上下文，并基于从本地证据库中检索的理论和机械证据，分层地对候选退化模型空间进行条件。此外，引入了基于规则的置信度推理与不确定状态（RCRUS）以防止在分层决策不确定时过早排除候选模型。基于仿真的实验表明，所提出的方法在韦纳/伽马族分类和详细退化模型分类中均优于统计、预测和不确定性感知的基线方法。最终，本研究将退化模型选择从纯粹的统计拟合问题重新界定为一个结合观测数据和领域知识的知识条件决策问题。

英文摘要

Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.

URL PDF HTML ☆

赞 0 踩 0

2605.17900 2026-05-19 cs.AI 版本更新

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2: 基于大语言模型的大型兴趣点属性采集交互语音响应系统

Le Zhang, Shengming Zhang, Rui Zha, Yunpeng Wu, Jingbo Zhou, Jizhou Huang

发表机构 * Baidu Inc.（百度公司）

AI总结本文提出DuIVRS-2，一种基于大语言模型的端到端框架，用于大规模兴趣点属性采集，通过有限状态机引导的数据增强策略、选择生成方案与思维链机制，提高了输出稳定性并有效消除幻觉，最终在生产环境中实现了83.9%的任务成功率。

Comments Accepted to ACL 2026 Industry Track. 14 pages, including appendix

详情

AI中文摘要

准确获取兴趣点（POI）属性对于基于位置的服务至关重要，但传统模块化的交互语音响应（IVR）系统存在误差累积和高维护成本的问题。我们提出了DuIVRS-2，一种基于大语言模型（LLM）的端到端框架，用于百度地图的大规模POI属性采集。为了解决现实交互中的长尾分布问题，我们的方法首先采用有限状态机（FSM）引导的数据增强策略，生成平衡且多样化的训练数据集。然后通过选择生成方案结合思维链（CoT）机制，优化对话管理，确保输出稳定性并有效消除工业环境中的幻觉。为了便于持续策略优化且最小化人工努力，我们设计了协作迭代学习框架，利用双评估者投票系统。在生产环境中部署两个月，DuIVRS-2每天处理0.4百万次呼叫，实现了83.9%的任务成功率（TSR），比其前身高出4个百分点，同时保持130ms的低响应时间。本工作为开发鲁棒且成本效益高的LLM代理用于大规模工业对话应用提供了生产验证的参考。

英文摘要

Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9\% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.

URL PDF HTML ☆

赞 0 踩 0

2605.17899 2026-05-19 cs.LG cs.AI q-bio.QM 版本更新

DCFold: Efficient Protein Structure Generation with Single Forward Pass

DCFold: 通过单次前向传递高效生成蛋白质结构

Zhe Zhang, Yuanning Feng, Yuxuan Song, Keyue Qiu, Hao Zhou, Wei-Ying Ma

发表机构 * Institute for AI Industry Research (AIR)（人工智能产业研究院）； Department of Computer Science and Technology（计算机科学与技术系）； School of Computer Science and Technology（计算机科学与技术学院）； ByteDance Seed（字节跳动种子）

AI总结本文提出DCFold，一种单步生成模型，实现了与AlphaFold3同等的精度，通过双一致性训练框架和新的时间测地匹配（TGM）调度器，在保持预测保真度的同时将推理速度提升15倍，验证了其在结构预测和结合设计基准上的有效性。

2605.17894 2026-05-19 cs.AI 版本更新

Evaluating Cognitive Age Alignment in Interactive AI Agents

评估交互式AI代理的认知年龄对齐

Yifan Shen, Jiawen Zhang, Jian Xu, Junho Kim, Ismini Lourentzou, Xu Cao, Meihuan Huang

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Shenzhen Children's Hospital（深圳儿童医院）； Peking University（北京大学）； Hong Kong Polytechnic University（香港理工大学）

AI总结本文提出ChildAgentEval，首个基于心理测量的交互式基准，用于评估基于多模态大语言模型的代理的认知年龄对齐，通过与年龄特定的人类发展阶段进行系统比较，揭示当前代理在模拟年龄特定认知行为方面的优劣。

2605.17887 2026-05-19 cs.LG cs.AI 版本更新

Attention Sinks and Outliers in Attention Residuals

注意力沉底与注意力残差中的异常值

Haozheng Luo, Haoran Dai, Shaoyang Zhang, Xi Chen, Eric Hanchen Jiang, Yijiang Li, Jingyuan Huang, Chenghao Qiu, Chenwei Xu, Zhenyu Pan, Haotian Zhang, Binghui Wang, Yan Chen

发表机构 * Department of Computer Science, Northwestern University（西北大学计算机科学系）； Department of Computer Science and Engineering, University of Michigan（密歇根大学计算机科学与工程系）； Department of Statistics and Data Science, University of California Los Angeles（加州大学洛杉矶分校统计与数据科学系）； Department of Electrical and Computer Engineering, University of California San Diego（加州圣地亚哥大学电气与计算机工程系）； Department of Computer Science, Rutgers University-New Brunswick（新泽西州立大学鲁特学院计算机科学系）； Department of Computer Science and Engineering, Texas A&M University（德克萨斯农工大学计算机科学与工程系）； Department of Computer Science, Columbia University（哥伦比亚大学计算机科学系）

AI总结本文提出OASIS技术，通过层间空信号来解决注意力残差架构中注意力沉底、激活异常值以及推理稳定性下降的问题，通过双归一化设计和实验验证提升了模型的结构鲁棒性和量化鲁棒性。

详情

AI中文摘要

我们提出OASIS，一种基于层间空信号的异常值和沉底感知技术。As AttnResidual架构引入了额外的深度归一化通道，它们提高了层间路由的灵活性，但也加剧了注意力沉底、激活异常值以及由此导致的推理稳定性和量化鲁棒性下降。OASIS通过引入基于Softmax1的空空间和通过层间空信号将token级的空证据耦合到深度路由中，从而减少由沉底主导的路由并提高结构鲁棒性。理论上，我们证明了AttnResidual的双归一化设计加剧了沉底形成和量化脆性。实验上，我们在三个真实世界数据集上将OASIS与五个基线进行比较，并观察到在注意力沉底和后量化性能方面有持续的改进。值得注意的是，OASIS在评估设置中实现了最大无穷范数平均减少9.26%、平均峰度减少2.60%，并在W8A8下将困惑度降低了75.85%，在W4A4下将GSM8K Pass@1提高了12.42%。

英文摘要

We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.

URL PDF HTML ☆

赞 0 踩 0

2605.17885 2026-05-19 cs.CL cs.AI 版本更新

交互评估需要一种设计科学

Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua, Manling Li, Jiaxuan You, Adrian Weller, Yizhong Wang, Jiaxin Pei

发表机构 * University of Texas Austin（德克萨斯大学奥斯汀分校）； California Institute of Technology（加州理工学院）； Carnegie Mellon University（卡内基梅隆大学）； Stanford University（斯坦福大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Microsoft Research（微软研究院）； Northwestern University（西北大学）； University of Cambridge（剑桥大学）

AI总结本文探讨了交互评估应被视为一种原则性的评估范式，而非仅仅是新的智能体基准。通过定义评估为证据到判断的自主映射，文章展示了交互评估如何改变这一映射的两方面，并提出双轴分类法，制定设计原则和报告标准，分析了长期评估挑战在轨迹层面的再现。

Comments 10 pages

详情

AI中文摘要

AI评估正经历结构性变革。大型语言模型（LLMs）越来越多地被部署为通过工具、环境、用户和其他智能体进行时间动作的系统，而许多评估实践仍继承自响应中心基准（例如固定输入、孤立输出和单个响应可做出的判断）。该领域开始构建交互基准，但所形成的景观却碎片化：基准在允许的交互制品、轨迹评分方式以及所支持的主张上各不相同。本文主张交互评估应被视为一种原则性的评估范式，而非仅仅是新的智能体基准。单纯采用以往的评估范式并不足够。我们定义评估为证据到判断的自主映射，并展示交互评估改变了这一映射的两方面：证据变为由交互生成的轨迹，而评估过程必须评估过程、可恢复性、协调性、鲁棒性和系统级性能。基于此定义，我们提出双轴分类法，推导设计原则和报告标准，分析代表性场景，并探讨长期评估挑战在轨迹层面的再现。

英文摘要

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.

URL PDF HTML ☆

赞 0 踩 0

2605.17827 2026-05-19 cs.LG cs.AI 版本更新

Content-Style Identification via Differential Independence

通过微分独立性进行内容-风格识别

Subash Timilsina, Hoang-Son Nguyen, Sagar Shrestha, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA（电气工程与计算机科学学院，俄勒冈州立大学，科瓦利斯，俄勒冈，美国）

AI总结本文提出了一种新的结构条件，即内容-风格微分独立性（CSDI），用于在内容和风格可能依赖的情况下实现生成分析中的可识别性，通过在雅可比子空间上施加块状正交约束，并设计了基于数值雅可比近似的随机正则化器以支持高维生成模型。

Comments 24 pages, 15 figures, ICML 2026

详情

AI中文摘要

生成分析经常将多领域观察建模为领域不变内容变量和领域特定风格变量的非线性混合。从不成对的领域中识别这两种因素可以实现域迁移和反事实数据生成等任务。先前的工作在内容和风格之间（块状）统计独立性或通过非线性混合函数的稀疏雅可比假设下建立了可识别性，但这些条件在实践中可能过于严格。在本文中，我们引入了内容-风格微分独立性（CSDI），一种替代的结构条件，要求内容和风格的微小变化在数据流形上诱导正交方向，从而在内容和风格依赖且雅可比密集时也能实现可识别性。我们通过在内容和风格相关的雅可比子空间上施加块状正交约束来操作化这一条件。为了支持高维生成模型，我们设计了一个基于数值雅可比近似的随机正则化器，从而在如高分辨率图像生成等设置中实现可扩展训练。在多个数据集上的实验验证了可识别性分析，并展示了反事实生成和域迁移的实用优势。

英文摘要

Generative analysis often models multi-domain observations as nonlinear mixtures of domain-invariant content variables and domain-specific style variables. Identifying both factors from unpaired domains enables tasks such as domain transfer and counterfactual data generation. Prior work establishes identifiability under (block-wise) statistical independence between content and style, or via sparse Jacobian assumptions on the nonlinear mixing function, but such conditions can be restrictive in practice. In this work, we introduce content-style differential independence (CSDI), an alternative structural condition requiring that infinitesimal variations in content and style induce orthogonal directions on the data manifold, thereby enabling identifiability even when content and style are dependent and the Jacobian is dense. We operationalize this condition through a blockwise orthogonality constraint on the Jacobian subspaces associated with content and style. To support high-dimensional generative models, we design a stochastic regularizer based on numerical Jacobian approximation, enabling scalable training in settings such as high-resolution image generation. Experiments across multiple datasets corroborate the identifiability analysis and demonstrate practical benefits on counterfactual generation and domain translation.

URL PDF HTML ☆

赞 0 踩 0

2605.17826 2026-05-19 cs.CV cs.AI 版本更新

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

CounterCount: 一种用于视觉语言模型计数偏差诊断的框架

Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai, Abdelrahman Eldesokey, Bernard Ghanem

发表机构 * KAUST（卡尔斯鲁德大学）； University of Edinburgh（爱丁堡大学）； King Abdullah University of Science and Technology（国王阿卜杜勒-阿齐兹大学）

AI总结本文提出CounterCount框架，通过对比事实性与反事实性图像来诊断视觉语言模型在计数任务中的偏差问题，揭示模型对物体级先验知识的依赖，并提出统一的注意力调节策略提升反事实计数准确性。

详情

AI中文摘要

视觉语言模型（VLMs）在多模态推理方面表现出色，但尚不清楚其答案是基于视觉证据还是由学习的语言和世界先验知识驱动。计数提供了一个精确的测试环境：当视觉证据与常识物体知识冲突时，模型必须依赖图像而非典型计数。我们引入CounterCount，一种用于VLMs的反事实计数诊断框架，包含配对的事实性和反事实性图像、编辑过的计数相关属性、验证答案和局部化证据注释。评估最近的VLMs，我们发现其在事实性图像上表现强劲，但在反事实属性变化下持续退化，表明即使存在矛盾的视觉证据，模型仍依赖物体级先验知识。利用局部化注释，我们发现这些失败不仅由于缺失或模糊的视觉证据，而是由于模型对计数相关视觉token的注意力权重不足。我们引入一种统一的推理时间注意力调节策略，重新加权所选的视觉token，使多个VLMs的反事实计数准确率提高高达8%。总体而言，CounterCount揭示了先验驱动的计数失败，并为设计未来的VLMs提供了诊断见解。

英文摘要

Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.17823 2026-05-19 cs.CV cs.AI 版本更新

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

为什么我们看那里：一种最大化场景理解的视网膜视觉语言模型表现出的人类样注视模式

Shravan Murlidaran, Ziqi Wen, Sana Shehabi, Miguel P. Eckstein

发表机构 * Psychological & Brain Sciences, University of California, Santa Barbara（加州大学圣芭芭拉分校心理学与脑科学系）； Electrical and Computer Engineering, University of California, Santa Barbara（加州大学圣芭芭拉分校电气与计算机工程系）； Computer Science, University of California, Santa Barbara（加州大学圣芭芭拉分校计算机科学系）

AI总结研究探讨了人类自由观看时注视模式的形成机制，发现最大化场景理解的视网膜视觉语言模型能够产生类似人类的注视模式，表明这种模式可能是优化场景理解的副产品。

2605.17821 2026-05-19 cs.DC cs.AI 版本更新

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

TierCheck: 用于大语言模型训练故障容错的分层检查点系统

Shujie Han, Feng Jiang, Patrick P. C. Lee, Xiao Zhang, Zhijie Huang, Nannan Zhao, Xiaonan Zhao, Lichen Pan

发表机构 * Northwestern Polytechnical University（西北工业大学）； The Chinese University of Hong Kong（香港中文大学）； National University of Defense Technology（国防科技大学）

AI总结本文提出TierCheck，一种基于集群意识的分层检查点系统，通过将存储位置与故障异质性对齐，实现轻量级差异检查点在本地和对等内存中的快速本地恢复，同时异步迁移重型基础检查点到远程持久化存储，从而在低开销持久性和快速恢复之间取得最佳平衡。

详情

AI中文摘要

大语言模型（LLM）训练经常受到异构故障谱的中断，从常见的GPU崩溃到灾难性的集群级故障。现有检查点系统依赖于单一层次的存储后端，迫使在状态保存开销和恢复速度之间做出权衡。我们提出TierCheck，一种集群感知的分层检查点系统，通过将存储位置与故障异质性对齐。TierCheck采用三级设计，保持轻量级差异检查点在本地和对等内存中以实现快速本地恢复，同时异步迁移重型基础检查点到远程持久化存储。它还确保严格跨层次的全局一致性，而不会停滞训练，并在恢复期间实现快速的集群感知检查点恢复。在400亿参数模型上的评估显示，TierCheck实现了低训练开销，将端到端检查点时间减少到10秒以下，并支持高频检查点，最终在低开销持久性和快速恢复之间取得最佳平衡。

英文摘要

Large Language Model (LLM) training is frequently interrupted by a heterogeneous spectrum of failures, from common GPU crashes to catastrophic cluster-wide outages. Existing checkpointing systems rely on monolithic, single-tier storage backend, forcing a trade-off between state-saving overhead and recovery speed. We propose TierCheck, a cluster-aware tiered checkpointing system that aligns storage placement with failure heterogeneity. TierCheck adopts a three-tier design that maintains lightweight differential checkpoints in local and peer memory for fast localized recovery, while asynchronously migrating heavyweight base checkpoints to remote persistent storage. It also ensures strict global consistency across tiers without stalling training, and achieves fast cluster-aware checkpoint restoration during recovery. Evaluations on models up to 40 billion parameters show that TierCheck achieves low training overhead, reduces end-to-end checkpointing time to under 10s, and supports high-frequency checkpointing, ultimately striking an optimal balance between low-overhead persistence and fast recovery.

URL PDF HTML ☆

赞 0 踩 0

2605.17815 2026-05-19 cs.RO cs.AI 版本更新

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

秩序之中的混沌：在桌面堆叠重构中使用Topple动作的规划

Hao Lu, Rahul Shome

发表机构 * School of Computing at the Australian National University（澳大利亚国立大学计算学院）

AI总结本文研究了桌面环境中堆叠重构任务，通过引入更丰富的非抓取聚合动作（特别是从堆叠中倒落物体到桌面的Topple动作）来增强任务规划领域。核心方法是提出一种新的Topple聚合工具，将候选任务计划计算转化为 Pebble Motion 问题变体，从而在IsaacSim物理模拟中验证了其效果，展示了在执行速度上的显著优势。

Comments 8 pages, 7 figures

详情

AI中文摘要

高效的物体操作策略对自动化应用有重大影响。本文研究了桌面环境中的堆叠重构任务，重点是通过引入更丰富的非抓取聚合动作（特别是从堆叠中倒落物体到桌面的Topple动作）来增强任务规划领域。Topple可以压缩长序列的中间搬运动作。计算的计划需要根据问题在其中交错执行抓取和放置动作与Topple动作。为了生成任务计划并建模一个抽象来计算包含抓取和Topple动作的解决方案，引入了一种新的Topple聚合工具。使用这种有向图抽象，候选任务计划计算成为Pebble Motion问题的变种，将物体视为石子。然后在基于IsaacSim的物理模拟中报告了基准测试。结果突显了仅使用抓取和放置动作相比，在执行速度上的明显优势。尽管本文主要研究Topple动作，但证明了类似的抽象可以建模其他感兴趣的聚合动作，如Scoop。本文的工作为丰富物体交互的操纵应用提供了初步但有力的证据，表明抽象在其中的潜在好处。

英文摘要

Efficient object manipulation strategies have significant impact in automation applications. In this work, the stack rearrangement in tabletop settings is studied, with a focus on augmenting the task planning domain with richer nonprehensile aggregating actions, in particular the toppling of objects from a stack to the table. Toppling can compress long sequences of intermediate relocations. Computed plans need to interleave pick-and-place actions with topple throughout its plan based on the problem. In order to generate the task plan and model an abstraction to compute solutions that include both pick-and-place and topple actions, a novel aggregating gadget for topple is introduced. Using this directed graphical abstraction, candidate task plan computation becomes a variant of the pebble motion problem, treating objects as pebbles. Benchmarks are then reported in a IsaacSim-based physics simulation. Results highlight clear benefits of achieving faster execution than solely using pick-and-place actions. Though this work primarily investigates the topple action, we demonstrate that similar abstractions can model other aggregating actions of interest, like scoop. The current work provides a preliminary, strong indication of the promising benefits of abstractions for rich object interactions in manipulation applications.

URL PDF HTML ☆

赞 0 踩 0

2605.17812 2026-05-19 cs.AI 版本更新

Going Headless? On the Boundaries of Vertical AI Firms

going headless？关于垂直AI企业的边界

Muhammad Zia Hydari, Farooq Muzaffar

发表机构 * University of Pittsburgh（匹兹堡大学）

AI总结本文探讨了垂直AI企业在会计、法律、医疗、采购等领域中，将工作流、领域逻辑和责任整合到单一应用中的传统模式，以及通用AI代理如何解构这种模式，促使企业采取"going headless"策略。文章指出，这种策略对某些企业有益，对另一些企业则可能造成破坏，并提出了基于任务-责任制度的三类分类体系及规则债务的概念。

详情

AI中文摘要

垂直AI企业在会计、法律、医疗、采购等领域历史上将工作流、领域逻辑和责任整合到单一应用中。通用AI代理现在正在解构这种整合，促使创始人和投资者倡导"going headless"：将工作流和界面交给代理，并将领域专业知识作为可调用的服务暴露出来。本文认为，对于某些企业来说，going headless是正确的，而对于另一些企业则可能是破坏性的，后者往往通过看似界面决策的架构选择无意中放弃了其价值捕获。这是一个边界问题，答案取决于区分接口边界（通常可以移动）和责任边界（通常不能移动）。基于科斯的企业理论、埃森曼、帕克和范阿尔斯特恩的平台包容框架，以及蒂茨对互补资产和可获取性的分析，本文表明，通过开放协议运营的协调者即使在技术互操作性提高的情况下仍能获得包容权力，并且持久的价值捕获集中在专业签发、受监管的工作流、证据轨迹和受信任的记录系统中。本文提出了一种三类分类体系（组件、集成软件平台、双轨），该分类不是基于行业而是基于任务-责任制度，并正式化了规则债务的概念：当业务规则和专业标准从受控系统迁移到提示和代理指令时，客户组织将承担未来治理、维护和责任负担。随后有四项原则：按责任而非界面分解，翻转边缘同时保留核心，将规则债务作为集成平台防止的客户成本，避免单一协调者依赖。

英文摘要

Vertical AI firms in accounting, law, healthcare, procurement, and similar domains historically bundled workflow, domain logic, and accountability into a single application. General-purpose AI agents are now unbundling that package, prompting founders and investors to advocate "going headless": cede the workflow and interface to agents and expose domain expertise as callable services. This article argues that going headless is correct for some firms and destructive for others, and that the latter often cede their value capture inadvertently through architectural choices that look like interface decisions. This is a boundary question, and the answer turns on distinguishing the interface boundary, which can often move, from the accountability boundary, which often must not. Drawing on Coase's theory of the firm, Eisenmann, Parker, and Van Alstyne's platform envelopment framework, and Teece's analysis of complementary assets and appropriability, the article shows that orchestrators operating through open protocols acquire envelopment power even as technical interoperability improves, and that durable value capture concentrates in cospecialized accountability assets: professional signoff, regulated workflows, evidence trails, and trusted systems of record. The article proposes a three-position taxonomy (component, integrated software platform, dual-track) determined not by sector but by task-accountability regime, and formalizes the construct of rule debt: the future governance, maintenance, and accountability burden that accrues to customer organizations when business rules and professional standards migrate from governed systems into prompts and agent instructions. Four principles follow: decompose by accountability not interface, invert the edges while retaining the core, position rule debt as the customer cost the integrated platform prevents, and avoid single-orchestrator dependence.

URL PDF HTML ☆

赞 0 踩 0

2605.17811 2026-05-19 cs.LG cs.AI math.OC 版本更新

One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

一个模型，两种角色：共享递归变压器中的涌现专业化

Jucheng Shen, Barbara Su, Anastasios Kyrillidis

发表机构 * Rice University（里士大学）

AI总结该研究探讨了共享权重的递归变压器是否能在未被分割成独立模块的情况下发展出不同的内部角色，通过不对称输入递归（AIR）架构发现，模型内部状态分化出不同的功能角色，并展示了这种分化与模型状态动态的关系。

Comments 21 pages, 13 figures, 8 tables

详情

AI中文摘要

可以一个共享权重的递归变压器在未被分割成独立模块的情况下发展出不同的内部角色吗？我们研究了不对称输入递归（AIR），这是一种最小的两状态推理架构，在其中相同的Transformer模型被重复用于更新（根据文献，L和H），唯一的更新规则差异是编码输入在L更新中被注入但在H更新中不被注入。在Sudoku-Extreme和Maze中，解码的rollouts揭示出一致的分裂：$\zH$表现得像一个完全承诺的提案状态，而$\zL$保留局部不确定性和移动的中间结构。冻结实验显示，这种分裂实际上与模型的状态动态有关：在Sudoku中，冻结$\zH$会减少$\zL$的内容变化，而冻结$\zL$会增加$\zH$的内容变化；而在Maze中，冻结任一状态会增加另一个状态的内容变化。消融实验显示，为了诱导专业化，共享模型需要能够区分两种更新类型，要么通过输入注入的不对称性，要么通过一个单独的层级标记。机理上，注意力分析显示在Sudoku和Maze中，L更新始终比H更新更局部。这些结果表明，在两状态递归设置中，清晰的状态身份信号可以诱导共享参数递归变压器内部稳定的、相关的功能角色。代码可在https://github.com/juchengshen/air获得。

英文摘要

Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: $\zH$ behaves like a fully committed proposal state, whereas $\zL$ retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model's state dynamics: in Sudoku, freezing $\zH$ reduces $\zL$'s content changes whereas freezing $\zL$ increases $\zH$'s, while in Maze, freezing either state increases content changes in the other state. Ablations show that to induce specialization, the shared model needs to be able to tell the two update types apart, either from input injection asymmetry or from a separate level token. Mechanistically, attention analysis shows that L-updates are consistently more local than H-updates in both Sudoku and Maze. Together, these results show that, in a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer. Code is available at \href{https://github.com/juchengshen/air}{\textcolor{blue}{https://github.com/juchengshen/air}}.

URL PDF HTML ☆

赞 0 踩 0

2605.17807 2026-05-19 cs.CV cs.AI 版本更新

SocialMemBench: AI记忆系统是否准备好应对社交群体环境？

Olukunle Owolabi

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出SocialMemBench，一个针对多党社交群体的AI记忆系统评估基准，通过人类验证的合成社交网络，测试记忆系统在处理共享历史、群体规范和成员退出等复杂社交场景中的能力。

详情

AI中文摘要

为单用户对话设计的AI记忆系统在应用于多党社交群体环境时会表现出典型故障。这一差距对当今构建的社会助手尤为重要：嵌入聊天平台的群体作用代理，以及需要全面用户模型的主动个人助理代理。现有记忆基准评估的是二元或职场对话；没有针对多党社交群体，其中记忆必须将事实锚定在共享历史而非职业角色，区分群体规范与个体例外，并在成员退出后正确归因。我们引入SocialMemBench，一个涵盖五个典型（亲密朋友、家庭、娱乐、兴趣社区、熟人网络）和三个群体规模层级（4-30成员）的人类验证合成社交群体网络的基准，包含430个角色和7,355次对话轮次，产生1,031个问题-答案对，覆盖九个问题类别。每个类别隔离一种架构能力，五个失败模式（单流融合、时间状态覆盖、大规模实体合并、缺失跨角色知识、规范-个体融合）是可测试的假设；我们的两项研究探针Subject-Mem和SMG提供了证据，其余三个仍待解决。在所有43个网络中，评估的四个开源记忆框架（Mem0、LangMem、Graphiti、Cognee）在问题加权范围内聚集在0.12-0.18，95%置信区间重叠，远低于未压缩检索参考0.345和匹配回答者完整上下文参考0.369（GPT-4o-mini）。当前的记忆系统显示出可测量的差距。

英文摘要

Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.

URL PDF HTML ☆

赞 0 踩 0

2605.17775 2026-05-19 cs.CL cs.AI 版本更新

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

在百万笔记规模上系统评估LLM重新表述的合成临床笔记质量

Jinghui Liu, Sarvesh Soni, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO, Australia（澳大利亚电子健康研究中心，CSIRO，澳大利亚）； National Library of Medicine, National Institutes of Health, USA（国家医学图书馆，国立卫生研究院，美国）

AI总结本研究系统评估了LLM生成的合成临床笔记的质量，包括内在、外在和事实性评估，发现尽管在粗粒度任务中保留了核心临床信息和预测效用，但在细粒度任务如ICD编码中丢失了细节，通过分块重述可以缓解这一问题，但会降低事实准确性。研究还发现合成错误主要源于临床情境的误解、时间混淆、测量误差和虚构声明，同时展示了这些合成笔记可以有效增强罕见ICD代码的特定任务训练。

详情

AI中文摘要

大型语言模型（LLMs）可以为各种应用生成或合成临床文本，从改善临床文档到增强临床文本分析。然而，评估通常集中在狭窄方面——例如相似性或效用比较——尽管这些方面是互补的，最好并行看待。在本研究中，我们旨在系统评估LLM生成的临床文本，包括在百万笔记规模上从MIMIC数据库重新表述的合成临床笔记的内在、外在和事实性评估。我们的分析显示，尽管存在显著的语言变化，合成笔记仍保留了核心临床信息和粗粒度任务的预测效用，但在像ICD编码这样的细粒度任务中会丢失细节。我们展示，通过分块重述而不是整体重述笔记可以显著缓解这种细节丢失，但会以减少事实准确性为代价。通过事实核查和错误分析，我们进一步发现合成错误主要由临床情境的误解、时间混淆、测量误差和虚构声明引起。最后，我们展示了这些合成笔记——尽管具有任务无关性——可以有效增强罕见ICD代码的特定任务训练。

英文摘要

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

URL PDF HTML ☆

赞 0 踩 0

2605.17762 2026-05-19 cs.AI 版本更新

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

表面形式神经稀疏检索：面向工业音乐搜索的鲁棒模糊匹配

Paul Greyson, Zhichao Geng, Wei Zhang, Yang Yang

发表机构 * Amazon（亚马逊）

AI总结本文提出了一种鲁棒的神经稀疏检索系统，通过改进的稀疏检索架构和领域特定的子词分词策略，提升了工业音乐搜索中对拼写错误、转置和发音变异的鲁棒性，实现了更高的召回率和更低的延迟。

Comments accepted at SIGIR 2026 industry track

详情

DOI: 10.1145/3805712.3808414

AI中文摘要

在亚马逊音乐的规模下进行音乐搜索面临独特挑战：查询经常由于拼写错误、转置和发音变异而偏离索引元数据，但检索系统必须在毫秒级延迟约束下运行。我们的现有学习到检索系统，即高置信度索引（HCI），从客户行为中学习查询-实体关联，依赖于持续的『探索』来选择候选。传统的n-gram匹配能够实现这种探索，但存在语义鲁棒性差和噪声高，限制了系统从长尾查询中学习的能力。在本工作中，我们提出了一种鲁棒的神经稀疏检索系统，旨在最大化探索效率。我们将最先进的『推理自由』稀疏检索架构适应到音乐领域，并结合一种有效的领域特定的细粒度子词分词策略。我们的方法利用短长度的token约束（最大3个字符）来强制学习表面形式的鲁棒性而非词法记忆。通过在离线索引阶段预计算神经嵌入和术语扩展，使在线处理减少到最小的tokenization和IDF加权，从而实现查询编码的几乎零延迟开销。在600万文档生产语料库上的评估显示，召回率@10达到91.4%（相比传统的三元组为57.7%），在可比的吞吐量下。对HCI反馈循环的模拟显示了探索效率的提高，稳定召回率比生产三元组高0.8%。消融研究表明，我们的稀疏训练方法驱动了性能提升，而领域特定的预训练提供了比大规模通用预训练更具成本效益的替代方案。

英文摘要

Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration'' to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system's ability to learn from long-tail queries. In this work, we present a \textbf{robust neural sparse retrieval system} designed to maximize exploration efficiency. We adapt a state-of-the-art \textbf{inference-free} sparse retrieval architecture to the music domain, combining it with an effective \textbf{domain-specific granular subword tokenization strategy}. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf{91.4\%} recall@10 (vs. \textbf{57.7\%} for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf{+0.8\%} higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.

URL PDF HTML ☆

赞 0 踩 0

2605.17757 2026-05-19 cs.LG cs.AI cs.DC cs.PF 版本更新

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR: 2位KV缓存量化中的离线频谱协方差感知旋转

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

发表机构 * Together AI ； University of Sydney（悉尼大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出OSCAR方法，通过离线估计注意力感知的协方差结构，实现2位KV缓存量化的高效和准确，同时开发了可部署的系统，提升了LLM服务框架的性能和效率。

Comments 35 pages, 10 figures

详情

AI中文摘要

INT2 KV-cache量化对于长上下文LLM服务具有吸引力，但实现准确性和可部署性仍然具有挑战。简单的旋转如Hadamard变换可以减少异常值，但仍然在INT2层面失效，因为它们与下游注意力不对齐。我们提出了OSCAR，一种超低比特KV缓存量化方法，通过离线估计注意力感知的协方差结构，并利用这些结构推导出固定旋转和截断阈值用于量化。这样，KV量化就与注意力实际消耗的协方差结构对齐。更重要的是，我们不仅提供了理论依据，还开发了一个完全可部署的OSCAR系统，包含一个定制的INT2注意力内核，该内核与分页KV缓存服务和融合内核流水线保持兼容，从而无缝集成到现代LLM服务框架中，如SGLang和vLLM。我们评估了我们的方法在最近的推理模型上，使用最多32k token的推理轨迹进行跨5个任务的测试。在Qwen3-4B-Thinking-2507和Qwen3-8B上，OSCAR将BF16精度差距分别减少到3.78和1.42个点，而朴素旋转INT2几乎归零。我们进一步将OSCAR扩展到Qwen3-32B和GLM-4.7（358B参数），其中它仍然与BF16保持有效相当。在长上下文-RULER-NIAH（最多128K）上，OSCAR在Qwen3模型上保持稳健，而朴素旋转INT2崩溃。从系统层面来看，OSCAR将KV缓存内存减少约8倍，在相同内存预算下，大批次大小下吞吐量提高最多7倍，并且由于内存带宽开销减少，单批次解码速度比BF16快最多3倍。

英文摘要

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.17755 2026-05-19 cs.CL cs.AI 版本更新

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

弥合版本差距：多版本训练提升ICD代码预测，尤其是罕见代码

Jinghui Liu, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO（澳大利亚电子健康研究中心，CSIRO）

AI总结本文研究了通过结合不同ICD版本的数据训练版本无关模型的有效性，以解决ICD代码预测中的长尾问题和罕见代码性能瓶颈，实验表明多版本训练在提升罕见代码的微F1指标和频繁代码的宏指标方面均取得显著效果。

详情

AI中文摘要

临床编码将临床文档映射到标准化的医疗代码，这是一个关键但耗时的行政任务，可以通过自动化来改进。当前ICD编码模型通常针对特定版本的代码进行优化。然而，实际上ICD系统持续演进，不同版本在不同时期和地区被采用。此外，ICD编码面临长尾问题，罕见代码性能可能成为开发可实施模型的瓶颈。我们探讨了通过结合不同ICD版本的数据训练版本无关模型的可行性，这可能有助于解决这些挑战。我们将在修改后的标签注意力模型中加入ICD-9数据进行ICD-10预测训练，并发现尽管存在版本不匹配，加入ICD-9数据使18K个罕见ICD代码的微F1指标相比仅使用ICD-10训练提高了27%。在8K个频繁ICD-10代码上，多版本训练也显著提升了宏指标，并且模型参数更少。

英文摘要

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

URL PDF HTML ☆

赞 0 踩 0

2605.17746 2026-05-19 cs.AI cs.HC 版本更新

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

实验中的代理，代理中的实验：一种面向人工智能增强型实验科学的设计语法

Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun

发表机构 * Guanghua School of Management, Peking University（北京大学光华管理学院）； Xi'an Jiaotong University（西安交通大学）； Cheung Kong Graduate School of Business（长江商学院）

AI总结本文提出SEED框架，用于表示实验条件为类型化的代理-流程图，以支持实验设计的自动化生成和评估，通过在医疗分诊任务中的实验证明其有效性，并讨论了新颖性、可重复性等治理问题。

详情

AI中文摘要

人工智能系统正成为组织和知识工作中的积极参与者。它们越来越多地与人类互动，协调工作流程，并在多代理安排中运作。因此，理解其影响需要的不仅仅是测量输出准确性，还需要关于机制、委托、反馈和控制的证据。实验仍然是这一任务的核心，但它们也面临递归挑战：我们需要为代理设计实验来研究这些安排，我们可能需要为实验设计设计代理以帮助搜索可能设计的扩展空间。然而，人类-人工智能和代理工作流程的实验条件仍然大多以散文形式指定，这使得它们难以比较、重用或审计。我们将其框架为AI增强型知识生产的流程表示、可追溯性和治理问题。我们引入SEED（结构编码用于实验发现），一个将实验条件表示为类型化代理-流程图的框架。SEED支持三种设计功能：将条件描述为交互结构、评估结构新颖性相对于编码的先前设计、以及在可行性和治理约束下生成候选设计。我们报告了一项轻量级的实证可行性测试，比较了图盲和SEED引导生成在医疗分诊设计任务中的表现。在这一诊断对比中，SEED引导的候选设计显示出更清晰的代理-流程变化、假设和治理检查，支持了该语法作为设计辅助工具的可行性。评论最后指出围绕新颖性、可重复性、有效性、探究多样性以及问责制的治理张力。

英文摘要

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

URL PDF HTML ☆

赞 0 踩 0

2605.17734 2026-05-19 cs.AI 版本更新

Harnessing LLM Agents with Skill Programs

通过技能程序 harnessing LLM agents

Hongjun Liu, Yifei Ming, Shafiq Joty, Chen Zhao

发表机构 * New York University（纽约大学）； Salesforce AI Research（Salesforce人工智能研究）

AI总结本文提出 HASP 框架，通过将技能转化为可执行程序函数（PFs）来提升 LLM agent 在复杂任务中的表现，其核心方法是通过 PFs 在失败状态时介入并修正行动，主要贡献是通过模块化设计实现推理、训练和自改进的多场景应用。

Comments 40 pages, 7 figures

详情

AI中文摘要

为复杂和长周期任务提供可重用技能已成为一种流行且成功的做法。然而，这些经验通常编码为文本指导，缺乏明确的机制来决定何时以及如何介入 agent 循环。为弥合这一差距，我们引入 HASP（通过技能程序 harnessing LLM agents），一种新的框架，将技能升级为可执行程序函数（PFs）。与被动建议不同，PFs 作为可执行的护栏，在易出错的状态下激活，并修改下一步行动或注入修正上下文。HASP 高度模块化：可以在推理时直接介入 agent 循环，训练后提供结构化监督，或通过进化验证的教师评审 PFs 实现自改进。实证上，HASP 在网页搜索、数学推理和编码任务中相比训练自由和训练方法取得了显著提升。例如，在网页搜索推理中，推理时的 PFs 使平均表现比（多循环）ReAct Agent 提高 25%，而训练后和受控进化则比 Search-R1 提高 30.4%。为了深入理解 HASP，我们的机制分析揭示了 PFs 如何触发和介入，技能如何内化，以及稳定技能库进化的必要性。

英文摘要

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

URL PDF HTML ☆

赞 0 踩 0

2605.17733 2026-05-19 cs.AI cs.LG 版本更新

Divergence-Suppressing Couplings for Rectified Flow

修正流的发散抑制耦合

Yimeng Min, Carla P. Gomes

发表机构 * Department of Computer Science（计算机科学系）

AI总结本文提出了一种修正流的发散抑制耦合方法，通过在耦合生成过程中抑制学习到的速度场中的发散成分，从而减少轨迹的扭曲，提升生成效果。

详情

AI中文摘要

修正流的潜力在于生成自我生成的耦合，其轨迹是直的或几乎如此。在实践中，基础流模型生成的轨迹可能会弯曲和交织，导致耦合继承这种扭曲。本文指出，这种轨迹交织通常与学习到的速度场中非零发散区域相关，其中局部扩张或收缩会扭曲轨迹并推动粒子远离理想终点。我们随后提出了一种修正流的发散抑制耦合，这是一种离线修正，可减小耦合生成过程中学习到的速度场的发散成分。该修正仅在每次耦合对生成时支付一次，且在训练过程中被摊销，因此部署运行的时钟时间成本与标准修正流相同。实验证明，这种离线修改在2D合成基准和图像生成任务上都带来了稳定改进。

英文摘要

The promise of Rectified Flow rests on producing self-generated couplings whose trajectories are straight, or nearly so. In practice, trajectories generated by the base flow model can bend and intertwine, and the resulting coupling inherits this distortion. In this paper, we identify that such trajectory entanglement is often associated with regions of nonzero divergence in the learned velocity field, where local expansion or contraction distorts trajectories and steers particles away from their ideal endpoints. We then propose divergence-suppressing couplings for Rectified Flow, an offline correction that attenuate the divergent component of the learned velocity during coupling generation. The correction is paid only once per coupling pair and amortized over training, so deployment runs plain Euler at identical wall-clock cost to standard Rectified Flow. Empirically, this offline modification yields consistent improvements on 2D synthetic benchmarks and on image generation.

URL PDF HTML ☆

赞 0 踩 0

2605.17729 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

领域增量学习用于疫情 resilient 胸部X光分析

Danu Kim

发表机构 * Danu Kim（丹努·金）

AI总结本文提出了一种基于回放的领域增量持续学习方法，用于在跨领域变化中保持肺炎检测的鲁棒性和一致性，通过类感知平衡回放和类感知损失实现平衡的类表示和动态重加权，实验表明该方法在领域偏移的PneumoniaMNIST数据集上达到88.66%的平均准确率，优于经验回放、微调和联合训练基线。

Comments Published in Korea Software Congress (2025)

详情

AI中文摘要

深度学习模型在肺炎检测中实现了高准确性，但其在临床领域中的泛化能力受限于成像设备、获取协议和机构条件的差异。本研究引入了一种基于回放的领域增量持续学习方法，旨在使模型能够持续适应跨领域变化而不发生灾难性遗忘。所提出的方法结合了类感知平衡回放以在受限内存中保持平衡的类表示，以及类感知损失以在训练过程中动态重新加权类不平衡。在包含五个模拟领域的领域偏移PneumoniaMNIST数据集上进行的实验表明，所提出的方法实现了88.66%的平均准确率，优于经验回放、微调和联合训练基线。这些发现突显了所提出方法在跨临床环境变化中实现稳健和一致肺炎检测的有效性。

英文摘要

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

URL PDF HTML ☆

赞 0 踩 0

2605.17721 2026-05-19 cs.AI 版本更新

EXG: Self-Evolving Agents with Experience Graphs

EXG: 基于经验图的自演化代理

Yuxin Jin, Siyuan Zhang, Hanchen Wang, Lu Qin, Ying Zhang, Wenjie Zhang

发表机构 * University of Technology Sydney（悉尼科技大学）； The University of New South Wales（新南威尔士大学）

AI总结本文提出EXG，一种基于经验图的自演化代理框架，通过结构化组织积累的成功与失败经验，提升代理在复杂任务中的解决质量和资源效率。

详情

AI中文摘要

基于大型语言模型（LLM）的代理在复杂推理和问题解决中表现出强大的能力，但大多数部署的代理行为静态，执行过程中获得的知识难以随时间系统性改进。为此，越来越多的研究探索如何在部署过程中通过经验使代理改进，但现有方法要么依赖于单一任务的随意反思，要么采用无结构的记忆积累碎片化经验。为了解决这一限制，我们引入EXG，一种经验图框架，用于自演化代理，明确将积累的成功与失败组织成结构化、关系化的表示。EXG是首个为自演化代理设计的经验图，支持在执行过程中实时增长图以实现跨任务经验重用，以及离线重用整合的经验图作为外部记忆模块。这种设计也使EXG能够作为可插拔组件为现有自演化代理服务，将先前经验组织成统一的经验图，并在部署过程中提高解决方案质量和资源效率。在代码生成和推理基准上的广泛实验表明，EXG在在线和离线评估中均优于基于反思和记忆的基线，在性能-效率权衡上表现更优。我们的结果表明，将经验结构化为图提供了一个原理性基础，以实现可扩展且可迁移的自演化代理行为。

英文摘要

Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.

URL PDF HTML ☆

赞 0 踩 0

2605.17693 2026-05-19 cs.LG cs.AI 版本更新

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调意识口袋扩散模型

Yuan Xue, Daniel Kudenko, Megha Khosla

发表机构 * L3S Research Center（L3S研究所以）； Delft University of Technology（代尔夫特理工大学）

AI总结本文提出DEPPA方法，基于去噪扩散策略优化，通过强化学习微调预训练的意识口袋扩散模型，以优化结合亲和力、药物性、可合成性和多样性等多属性。

详情

AI中文摘要

基于结构的药物设计已被意识口袋3D生成模型加速，但大多数方法主要拟合训练分布，可能无法满足真实世界治疗药物发现所需的多种属性。最近，越来越多的关注集中在基于结构的分子优化（SBMO）上，其目标是精细控制多个指定的分子属性。在本文中，我们提出DEPPA，一种新的SBMO方法，基于去噪扩散策略优化，通过强化学习微调预训练的意识口袋扩散模型。DEPPA能够优化多个属性，包括结合亲和力、药物性、可合成性和多样性。我们将预训练的意识口袋扩散模型的反向去噪过程建模为多步马尔可夫决策过程，其中期望的属性作为奖励信号在最终生成的配体分子上进行评估。DEPPA在RL微调期间结合粗略的去噪调度器，以实现高效的分子优化。在CrossDocked2020基准上的实验结果表明，DEPPA在结合亲和力（Vina Score -8.5 kcal/mol）、药物性和多样性方面优于基线，在可合成性方面表现出竞争性性能。源代码可在https://github.com/xy9485/DePPA上获得。

英文摘要

Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at https://github.com/xy9485/DePPA .

URL PDF HTML ☆

赞 0 踩 0

2605.17691 2026-05-19 cs.CL cs.AI 版本更新

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

验证你的权威：在多标签先例处理分类上对LLM进行基准测试

M. Mikail Demir, M. Abdullah Canbaz

发表机构 * Department of Information Science and Technology（信息科学与技术系）； College of Emergency Preparedness, Homeland Security, and Cybersecurity（应急准备、国土安全与网络安全学院）； University at Albany, SUNY（萨利纳大学）

AI总结本文提出了一种新的评估框架，通过专家标注的数据集对现代大语言模型进行基准测试，引入了平均严重性误差指标，以更准确地衡量分类错误的实践影响。

Comments Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP

详情

DOI: 10.18653/v1/2025.nllp-1.13

AI中文摘要

自动化法律先例中负面处理的分类是一个关键但复杂的自然语言处理任务，误分类可能带来重大风险。为了解决标准准确率的不足，本文介绍了一种更稳健的评估框架。我们对239个真实世界法律引用的新专家标注数据集上的现代大语言模型进行了基准测试，并提出了一种新的平均严重性误差度量标准，以更好地衡量分类错误的实践影响。我们的实验揭示了性能的分裂。Google的Gemini 2.5 Flash在高层次分类任务上达到了最高准确率（79.1%），而OpenAI的GPT-5-mini则在更复杂的细粒度模式上表现最佳（67.7%）。本工作建立了关键基准，提供了一个新的上下文丰富的数据集，并引入了一个针对这一复杂法律推理任务的评估度量标准。

英文摘要

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

URL PDF HTML ☆

赞 0 踩 0

2605.17685 2026-05-19 cs.CV cs.AI cs.CR cs.SY eess.SP eess.SY 版本更新

Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition

基于注意力引导的1D和2D CNN融合用于鲁棒的基于ECG的生物识别

Arioua, Islameddine, Benzaoui, Amir, Zeroual, Abdelhafid, Houam, Lotfi

发表机构 * PIMIS Laboratory, Electronics and Telecommunications Department（PIMIS实验室，电子与电信系）； Université du 8 Mai 1945（8月1945大学）； Electrical Engineering Department, University of 20 August 1955（电子工程系，20 August 1955大学）； Department of Electrical Engineering, Faculty of Science and Applied Sciences（电子工程系，科学与应用科学学院）； Larbi Ben M'hidi University（拉比·本·迈迪大学）； Department of Electronics and Communications, University of Larbi Tebessi（电子与通信系，拉比·塔贝西大学）

AI总结本文提出了一种结合1D和2D CNN的混合框架，通过注意力引导融合机制提升ECG生物识别的鲁棒性和性能，实验表明该方法在多个数据集上均取得了较高的识别准确率。

Journal ref Digital Signal Processing 2026

详情

DOI: 10.1016/j.dsp.2026.106252

AI中文摘要

基于心电图（ECG）的生物识别已作为一种安全的身份验证和活体检测的有希望的解决方案。然而，大多数现有方法依赖于单模深度学习架构，单独处理一维（1D）时间信号或二维（2D）时频表示，限制了鲁棒性和泛化能力。为了解决这个问题，本文提出了一种将1D和2D卷积神经网络（CNNs）整合到统一端到端架构中的混合框架。1D分支从原始ECG信号中提取时序和形态学特征，而2D分支从时频表示中捕获判别性的频谱信息。注意力引导的融合机制根据输入特性动态加权两种模态，克服了传统静态融合策略的局限性。该框架在三个基准数据集（ECG-ID、MIT-BIH和PTB）上进行了评估，包括健康受试者和患有心脏病理学的患者，分别实现了99.56%、100.00%和99.89%的识别准确率。为了评估长期生物稳定性，还进行了多会话Heartprint数据集的实验，该数据集跨越十年。所提出的方法在相同会话中实现了98.54%（S1）、99.09%（S2）、94.93%（S3R）和96.08%（S3L）的准确率，跨会话评估达到了56.33%（S1-S2）和53.27%（S2-S3R），证明了其在时间上的稳定生物特征捕获能力。最优配置结合了InceptionTime用于1D处理，ResNet-34用于2D分析，以及基于注意力的融合。消融研究证实，所提出的注意力机制在传统融合方法中始终表现更优。总体而言，所提出的框架为ECG生物识别提供了一种稳健、可扩展且高性能的解决方案。

英文摘要

Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition.

URL PDF HTML ☆

赞 0 踩 0

2605.17684 2026-05-19 cs.AI cs.SE 版本更新

EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

EGI：一种多模态情感AI框架，用于增强Scrum Master的实时自我意识

Jingni Huang, Peter Bloodsworth

发表机构 * Department of Computer Science（计算机科学系）； University of Oxford（牛津大学）

AI总结本文提出一种多模态情感AI框架EGI，通过整合四个精选的AI模型，实时监测Scrum Master和会议组织者无意识表达的情绪，提升团队动态中的情绪感知能力。

详情

AI中文摘要

尽管越来越多的研究关注敏捷团队成员的情绪福祉，但在Scrum Master和会议组织者的情绪监测研究中仍存在显著差距，这些角色对团队动态的影响至关重要。本文提出了一种新的应用，整合四个精心选择和推荐的AI模型，通过实时语音转文本模型进行实时转录；通过阈值分析检测语气中的情绪线索；通过基于情绪的词汇匹配识别语音内容中的情感；并通过开源的多模块AI API提供上下文感知的建议，包含情绪关键词。系统在模拟会议环境中实现了10%的ASR词错误率。我们的评估表明，实时反馈显著提高了模拟敏捷会议中的情绪感知能力，为Scrum Master和会议组织者提供实时和实用的建议，帮助他们快速识别并减少负面情绪的表达，促进更积极有效的团队互动。

英文摘要

While increasing research focuses on the emotional well-being of agile team members, a significant gap remains in emotion monitoring studies for Scrum Masters and meeting organizers, whose impact on team dynamics is crucial. This paper proposes a novel application integrating four carefully selected and recommended AI models to monitor the unconsciously expressed emotions of these key roles. This is achieved through: real- time transcription using a speech-to-text model; thresholding for intonation analysis to detect emotional cues in prosody; applying emotion-based vocabulary matching to identify sentiment in spoken content; and providing context-aware suggestions containing emotion keywords using an open-source, multi-module AI API. The system achieved an ASR word error rate WER of 10% in simulated meeting environments. Our evaluation shows that real- time feedback significantly improves emotion awareness during simulated agile meetings, providing Scrum Masters and meeting organizers with real-time and practical suggestions to help them quickly identify and minimize the expression of negative emotions, fostering more positive and effective team interactions.

URL PDF HTML ☆

赞 0 踩 0

2605.17679 2026-05-19 cs.HC cs.AI 版本更新

PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

PULSE：基于被动感知的代理探究用于癌症幸存者的主动干预

Zhiyuan Wang, Ariful Islam, Indrajeet Ghosh, Xinyu Chen, Katharine E. Daniel, Subigya Nepal, Philip Chow, Laura E. Barnes

发表机构 * Department of Systems and Information Engineering, University of Virginia（系统与信息工程系，弗吉尼亚大学）； Center for Behavioral Health and Technology, University of Virginia（行为健康与技术中心，弗吉尼亚大学）； Department of Computer Science, University of Virginia（计算机科学系，弗吉尼亚大学）

AI总结本文提出PULSE系统，通过代理感知探究方法，利用智能手机被动感知数据和日记数据，提升对癌症幸存者情绪调节需求的预测准确率，验证了代理推理在主动干预中的有效性。

详情

AI中文摘要

癌症幸存者面临更高的抑郁、焦虑和一般情绪困扰风险，但需要支持的精确时刻往往自我报告数据稀疏，我们称之为日记悖论。被动智能手机感知提供了一种持续且无干扰的替代方案，但以往基于感知的愉悦预测受限于准确性上限，表明不仅数据可用性，而且行为信号的解释也存在瓶颈。我们提出了PULSE系统，该系统从固定特征管道转向代理感知探究：配备八个专用工具的LLM代理自主查询智能手机感知数据，将当前行为与个性化基线进行比较，并通过检索增强的群体层面比较进行校准。与接收预格式化特征摘要不同，代理决定检查哪些模态、回溯多远以及深入探究多少，模仿假设驱动的临床推理。我们通过2*2因子设计交叉推理架构（结构化 vs. 代理）与数据模态（仅感知 vs. 有日记）在50名癌症幸存者的纵向研究中评估PULSE。代理推理是性能的主要驱动因素：代理多模态代理在日记和感知数据下实现情绪调节需求的平衡准确率为0.743，而代理在仅被动感知数据下预测干预可用性的准确率为0.713。这些结果表明，代理探究可能成为解锁被动感知临床价值的关键，推动主动即时心理健康支持的可行性。

英文摘要

Cancer survivors face elevated rates of depression, anxiety, and general emotional distress, yet the precise moments they most need support are often the moments when self-report is sparse, a phenomenon we term the diary paradox. Passive smartphone sensing offers a continuous, unobtrusive alternative, but prior sensing-based affect prediction has been limited by an accuracy ceiling, suggesting a bottleneck not only in available data, but in how behavioral signals are interpreted. We present PULSE, a system that shifts from fixed feature pipelines to agentic sensing investigation: LLM agents equipped with eight purpose-built tools autonomously query smartphone sensing data, compare current behavior against personalized baselines, and calibrate inferences through retrieval-augmented population-level comparisons. Rather than receiving pre-formatted feature summaries, agents decide which modalities to inspect, how far back to look, and how deeply to investigate, mirroring hypothesis-driven clinical reasoning. We evaluate PULSE through a 2*2 factorial design crossing reasoning architecture (structured vs. agentic) with data modality (sensing-only vs. with diary) on 50 cancer survivors from a longitudinal study of cancer survivors. Agentic reasoning is the primary driver of performance: agentic multimodal agent achieves balanced accuracy of 0.743 for emotion regulation desire with diary and sensing data, while agentic agents predict intervention availability at 0.713 with passive sensing data only. These results suggest that agentic investigation may be a cornerstone for unlocking the clinical value of passive sensing, advancing the feasibility of proactive just-in-time mental health support.

URL PDF HTML ☆

赞 0 踩 0

2605.17671 2026-05-19 cs.LG cs.AI 版本更新

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

PEIRA: 通过视图回归对齐学习预测编码器

Michael Arbel, Basile Terver, Jean Ponce

发表机构 * Univ. Grenoble Alpes, Inria CNRS, Grenoble INP, LJK（格勒诺布尔大学、法国国家信息与自动化研究所、格勒诺布尔INP、LJK实验室）； Ecole Normale Supérieure / PSL Inria Paris（巴黎高等师范学院/PSL 国家科学研究中心、法国国家信息与自动化研究所巴黎分部）； New York University（纽约大学）

AI总结本文提出PEIRA方法，通过显式目标函数和线性回归器对齐来实现非对比自监督学习，通过理论分析和实验验证其在ImageNet-1K和CIFAR-10上的有效性。

详情

AI中文摘要

非对比自监督学习（SSL）是预测表示学习的有效框架，但像SimSiam、BYOL、I-JEPA或DINO等流行方法依赖于自蒸馏来训练教师-学生网络，但通常不最小化明确的目标函数。我们分析了联合嵌入预测架构（JEPA）的一个变种，使用正则化的线性回归器来预测数据两个视图之间的学习表示，并完全表征其稳定性：非坍塌的稳定平衡点对齐于主导的非线性典型相关子空间，而坍塌的平衡点也可能是稳定的吸引子。受此结果启发，我们引入PEIRA，一种非对比SSL方法，其目标函数通过最优线性回归器的迹定义。我们证明其唯一稳定的平衡点是非平凡的全局最小值，并恢复相同的典型相关子空间，正则化选择有效维度。在ImageNet-1K和CIFAR-10上的实验表明，PEIRA与VICReg和LeJEPA基线具有竞争力，定性实验结果支持理论。

英文摘要

Non-contrastive self-supervised learning (SSL) is an effective framework for predictive representation learning, but popular (and in practice effective) methods such as SimSiam, BYOL, I-JEPA or DINO, which rely on a form of self-distillation to train a teacher-student network, remain poorly understood as they typically do not minimize a well-defined objective. We analyze the dynamics of a variant of the Joint Embedding Predictive Architecture (JEPA) using a regularized linear regressor to predict the learned representations of two views of the data from one another, and fully characterize its stability: non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces, while collapsed equilibria may also be stable attractors. Motivated by this result, we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor. We show that its only stable equilibria are nontrivial global minimizers and recover the same canonical correlation subspaces, with regularization selecting the effective dimension. Experiments on ImageNet-1K and CIFAR-10 show PEIRA is competitive with VICReg and LeJEPA baselines, and qualitative empirical results support the theory.

URL PDF HTML ☆

赞 0 踩 0

2605.17669 2026-05-19 cs.AI 版本更新

Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models

多模态文化遗产品知图扩展与语言和视觉模型

Yang Zhang, Nada Mimouni, Jean-Claude Moissinac, Fayçal Hamdi

发表机构 * Center for Studies and Research in Computer Science and Communication, CNAM（计算机科学与通信研究所以及CNAME）

AI总结本文提出了一种多模态方法，利用语言和视觉模型扩展文化遗产品知图，通过构建多模态知识图谱WJoconde并建立评估基准，提高知识图谱的扩展效率和可靠性。

详情

AI中文摘要

文化遗产品保育和解读日益依赖数字技术，其中知识图谱（KGs）因其能够结构化大量数据而脱颖而出。然而，这些KGs的构建和扩展往往面临挑战，因为文化遗产信息具有多样性和复杂性。本文提出了一种新的方法，用于扩展文化遗产领域的KG资源，应用于法语数据。首先，我们引入了一个新的知识图谱WJoconde，其特点是多模态，整合了实体的文本和图像信息。我们进一步引入了三个WJoconde的变体，以促进下游研究，如知识图谱补全（KGC）。我们还建立了一个全面的KGC方法基准，用于我们的数据集。其次，我们提出了一种新的框架，利用多模态方法扩展文化遗产KGs，结合大型语言模型（LLMs）和视觉-语言模型（VLMs），包括从非结构化资源中自动提取数据，并结合特殊的验证流程来确保两种模型输出的可靠性，以进一步扩展WJoconde。我们的结果表明，通过整合文化遗产数据中的丰富文本和图像信息，可以高效地增强具有高可靠性的KGs。我们开源了所有代码和基准数据集，包括文本和图像，以及原始数据的交互访问点。

英文摘要

The preservation and interpretation of cultural heritage increasingly rely on digital technologies, among which Knowledge Graphs (KGs) stand out for their ability to structure vast amounts of data. However, the construction and expansion of these KGs often face challenges due to the diverse and complex nature of cultural heritage information. In this paper, we propose a novel approach for extending KG resources in the domain of cultural heritage, which we applied to French data. First, we introduce a new knowledge graph in the domain of French cultural heritage, WJoconde, which is distinguished by its multimodality as it integrates both textual and image information of the entities. We further introduce three variants of WJoconde to facilitate downstream research, such as Knowledge Graph Completion (KGC). We also built a comprehensive benchmark for KGC methods on our dataset. Second, we propose a new framework for extending cultural heritage KGs using multi-modal approaches leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), which includes automated data extraction from unstructured resources combined with a special validation pipeline for grounding the output of both models, to further extend WJoconde. Our results show that by integrating the rich text and image information in cultural heritage data, we can efficiently enhance KGs with high reliability. We open-source all code and benchmark datasets with text and images, as well as the original data with an interactive access point

URL PDF HTML ☆

赞 0 踩 0

2605.17660 2026-05-19 math.OC cs.AI cs.LG stat.ML 版本更新

Training Infinitely Deep and Wide Transformers

训练无限深且宽的Transformer

Raphaël Barboni, Maarten V. de Hoop, Takashi Furuya, Gabriel Peyré

发表机构 * Bocconi University（博科尼大学）； Doshisha University, RIKEN AIP（滋贺大学、RIKEN AIP）； Rice University（里士满大学）； CNRS, ENS, PSL Université（国家科学研究中心、巴黎综合理工学院、巴黎萨克勒大学）

AI总结本文提出了一种严格的数学框架，用于分析Transformer在均场 regime 中的梯度基于训练动态，通过研究无限深和宽的Transformer的均场模型，建立了训练风险的条件Wasserstein梯度的显式公式，并证明了在NTK注入性假设下梯度流收敛到全局极小值。

详情

AI中文摘要

Transformers已成为现代机器学习中占主导地位的架构，但其训练动态的理论理解仍然有限。本文开发了一个严格的数学框架，用于分析在均场 regime 中Transformer的梯度基于训练动态，其中深度（层数）和宽度（注意头数）趋于无穷大。虽然ResNet训练可以理解为控制神经ODE，但Transformer训练对应于控制神经PDE，因为通过注意力机制耦合了多个token分布。我们的均场模型特征两种类型的测度表示：通过层演变的token分布和每层的注意力参数。我们建立了无限深Transformer前向传递的well-posedness，通过流映射来表征token演变，这些流映射满足函数空间中的ODE。利用伴随敏感度分析，我们推导出训练风险的条件Wasserstein梯度的显式公式，该公式涉及由反向ODE控制的伴随变量。我们证明了在条件Wasserstein度量空间中梯度流曲线的存在性和唯一性，建立了梯度基于Transformer训练的严格基础。一个关键技术贡献是提供了注意力机制的神经切线核（NTK）注入性的必要且充分条件：我们证明NTK注入性等同于log-sum-exp函数的线性独立性模仿射函数，这一条件由多种token分布满足，包括离散分布、均匀分布和高斯混合分布。在NTK注入性假设下，我们证明当初始损失足够小时，梯度流收敛到全局极小值，消除了优化景观中的虚假局部极小值。

英文摘要

Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.

URL PDF HTML ☆

赞 0 踩 0

2605.17653 2026-05-19 cs.LG cs.AI 版本更新

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

LLMForge: 多后端硬件感知的神经架构搜索与无限头注意力用于边缘语言模型

Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane

发表机构 * Brown University（布朗大学）； University of Michigan（密歇根大学）； Google Research（谷歌研究）

AI总结本文提出LLMForge，一种多后端硬件感知的神经架构搜索框架，通过无限头注意力扩展了每层注意力配置空间，并结合Forge-Former和Forge-DSE实现了高效的边缘语言模型架构搜索，最终在不同硬件子系统上获得了不同形状的架构，展示了在不同性能指标上的优化效果。

详情

AI中文摘要

子百亿参数的Transformer语言模型正越来越多地部署在边缘设备上，其中设备端推理的隐私、延迟和运行成本优势受到紧密的内存带宽、能量和热预算的限制，使得架构选择和加速器特定的成本成为高效推理的关键。我们提出了LLMForge，一种硬件感知的神经架构搜索（NAS）框架，其三个可组合的贡献共同使边缘LM架构搜索变得硬件条件化，因为不同的基材施加了不同的硬件成本瓶颈。无限头注意力（IHA）解耦了查询头数、KV组数和每个头的查询/键/值维度，扩展了在我们的搜索空间范围内每层注意力配置空间，大约扩大了400倍。Forge-Former是一种基于编码器的替代方案，用于对架构候选者进行排名，优于MLP和随机森林基线。Forge-DSE是一种基于NSGA-II的设计空间探索引擎，与Forge-Former配对，结合了覆盖GPU、张量核心加速器和环数据流边缘加速器的多后端硬件成本模型。在四种不同的硬件基材上，搜索收敛到明显不同的架构，其形状跟踪每个基材的成本瓶颈。在多芯片环基材上，我们的联合搜索返回了三个3亿参数规模的部署感知变体，这些变体位于帕累托前沿上。每个变体都在FineWeb-Edu-10BT上重新训练，以匹配SmolLM2-360M和Qwen-0.5B架构基线。准确的变体具有最低的验证损失2.798，并在参数较少的情况下具有竞争性的基准性能，能量优化的变体降低了每token的能量消耗40%，延迟优化的变体降低了TTFT和TPOT 43%。

英文摘要

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.

URL PDF HTML ☆

赞 0 踩 0

2605.17648 2026-05-19 cs.AI 版本更新

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO：基于推理的生成推荐的步骤对齐策略优化

Zaiyi Zheng, Guanghui Min, Yaochen Zhu, Liang Wu, Liangjie Hong, Chen Chen, Jundong Li

发表机构 * University of Virginia（弗吉尼亚大学）； Nokia（诺基亚）

AI总结本文提出SAPO方法，通过步骤对齐策略优化解决生成推荐中因精确匹配反馈不足导致的训练不稳定问题，改进了基于推理的生成推荐系统的训练效果。

详情

AI中文摘要

生成推荐将下一项预测视为自回归的物品标识符生成。具体而言，物品被编码为语义标识符（SIDs），这些是短的由粗到细的令牌序列，早期令牌捕捉广泛语义，后期令牌细化它们。近期工作在该范式中加入了推理轨迹并通过强化学习进行优化，通常使用具有生成SID的精确匹配反馈的成果奖励算法。然而，在大型目录推荐中，对生成SID的精确匹配反馈只能报告最终物品是否正确；当生成SID不匹配时，成果奖励无法识别导致不匹配的SID-令牌预测，并可能对匹配的SID-令牌位置和不匹配的位置一起进行惩罚。我们发现在此设置中的自然信用分配单位是一个单独的推理步骤（一个思考块配对一个SID令牌）。我们实例化这一想法在SAPO（步骤对齐策略优化）中：而不是将一个优势广播到整个响应，SAPO为每个推理步骤计算一个单独的组内优势，并仅应用于相应的思考块和SID令牌。在三个真实世界推荐数据集中，SAPO稳定了强化学习训练并持续改进现有生成推荐基线，最大收益出现在稀疏精确匹配反馈使推理步骤信用分配重要的地方。我们的结果表明，结构生成的强化学习目标应反映解码器自身的输出分解。

英文摘要

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.

URL PDF HTML ☆

赞 0 踩 0

2605.17641 2026-05-19 cs.AI cs.CL 版本更新

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

基于因果干预的记忆选择用于长时域大语言模型智能体

Saksham Sahai Srivastava

发表机构 * School of Computing, University of Georgia, Athens, Georgia, USA（佐治亚大学计算机学院）

AI总结本文提出Causal Memory Intervention（CMI）方法，通过因果推理选择大语言模型的长期记忆，以提高回答质量和鲁棒性，同时引入Causal-LoCoMo基准数据集进行评估。

Comments 12 pages, 3 figures, 3 tables

详情

AI中文摘要

长时域大语言模型智能体依赖持久记忆来支持跨会话的交互，但现有记忆系统通常使用语义相似性或广泛历史包含来检索上下文，将检索到的记忆视为统一有用。这一假设是脆弱的，因为记忆可能在主题上相关，但仍然无关、过时或误导性。我们提出了Causal Memory Intervention（CMI），一种因果记忆选择技术，通过在受控干预下估计候选记忆如何影响模型的答案，选择提高任务性能的同时抑制不稳定、无关或有害的记忆。为了评估这一设置，我们引入了Causal-LoCoMo，一个从长对话数据中衍生出的因果标注基准，其中每个示例包含用户请求、结构化记忆库、有用的记忆、无关干扰项以及合成有害记忆。我们比较了CMI与向量、图、反思、摘要、完整历史和无记忆基线。结果表明，CMI在回答质量和对误导性记忆的鲁棒性之间实现了更强的平衡，表明可靠的长期记忆需要基于因果有用性而非相关性本身来选择上下文。完整的框架、基准构建代码和实验流程可在https://github.com/Saksham4796/causal-memory-intervention获取。

英文摘要

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

URL PDF HTML ☆

赞 0 踩 0

2605.17633 2026-05-19 cs.CV cs.AI 版本更新

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM: Segment Anything模型中激活的结构稀疏化

Hoai-Chau Tran, Chi H. Nguyen, Duy M. H. Nguyen, Mathias Niepert, Fan Lai, Khoa D. Doan

发表机构 * University of Illinois at Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； College of Engineering & Computer Science, VinUniversity（Vin大学工程与计算机科学学院）； VinUni-Illinois Smart Health Center, VinUniversity（Vin大学-伊利诺伊智能健康中心）； DFKI ； Max Planck Research School for Intelligent Systems (IMPRS-IS)（马克斯·普朗克智能系统研究学校）； University of Stuttgart（斯图加特大学）

AI总结本文提出SparseSAM，一种无需训练的结构稀疏化框架，通过联合加速注意力和MLP层并保持token身份，从而在保持高质量的同时提高推理速度和减少内存使用。

详情

AI中文摘要

Segment Anything Model (SAM) 实现了强大的开放词汇分割，但其基于ViT的图像编码器在推理延迟和内存方面占主导地位。现有的激活压缩方法，如标记合并，通过减少标记长度来处理，但引入了非平凡的运行时开销，并在高压缩下导致灾难性质量下降。其他应用稀疏注意力的方法仅关注注意力本身，使MLP完全密集，并限制了可达到的速度提升。我们提出了SparseSAM，一种（i）无需训练的结构稀疏化框架，该框架在加速注意力和MLP层的同时保持token身份。SparseSAM引入了（ii）Stripe-Sort Attention，它使用确定性的Z序排列将密集注意力转换为静态的硬件友好的稀疏模式，消除了动态掩码的开销。SparseSAM进一步引入了（iii）残差一致性MLP，只将信息性token路由通过MLP，同时通过残差路径传播剩余token。在四个分割基准测试中，SparseSAM在0.4密度下仅损失0.004 mIoU，在0.3密度下损失0.021 mIoU，相较于标记合并方法的改进，准确率损失减少了2.10倍，同时实现了2倍更快的推理速度和2.8倍的内存减少。

英文摘要

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

URL PDF HTML ☆

赞 0 踩 0

2605.17625 2026-05-19 cs.AI 版本更新

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

用于长周期科学代理的事件-语义记忆架构

Nikola Milosevic

发表机构 * Serbian Institute for Artificial Intelligence Research and Development（塞尔维亚人工智能研究与发展研究所）； Bayer A.G.（勃林格殷曼有限公司）

AI总结本文提出了一种双过程记忆架构，用于解决科学代理在长周期任务中面临的情境窗口饱和问题，通过分离即时事件需求和长期知识整合，提升了在大规模科学工作流中的表现和可扩展性。

详情

AI中文摘要

随着大型语言模型（LLMs）发展为持久的科学合作者，情境窗口饱和已成为关键瓶颈。涉及迭代数据分析和假设修正的科学工作流迅速耗尽即使扩展的情境，而单一方法面临二次成本扩展和认知退化。我们评估了一种双过程记忆架构，将即时事件需求（恒定10条消息窗口）与长期整合知识（以每条消息约3个标记增长）分离。不同于先前的社会代理记忆系统，我们的领域特定整合解决了矛盾的参数演变、跨实验阶段的多跳推理以及精确的技术事实保留。通过覆盖15,000条消息的大型评估，跨模型验证六个LLM家族（OpenAI、Anthropic、Google）共计1,440个查询，我们得出三个关键发现。首先，尽管全情境模型在10,000条消息时因情境溢出失败，我们的系统在使用62%更少的标记（45,434 vs 120,000+限制）的情况下，保持70-85%的准确性，延迟仅1-2秒。其次，跨模型验证揭示了架构层面的权衡，与特定LLM无关：双过程在数值/时间查询（65-90%准确率）方面表现优异，而RAG在历史检索（60-85%）方面更优，表明互补的部署策略。第三，我们识别出“仿真到现实”的差距，合成测试保持恒定的记忆，但现实工作流表现出线性增长（约每条消息3个标记），其中整合质量成为主要的可扩展性瓶颈。该架构成功管理了包含14,000多个科学事实（125k标记）的资料，证明了领域特定的记忆整合能够持续运行超过全情境限制。

英文摘要

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

URL PDF HTML ☆

赞 0 踩 0

2605.17624 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习进行部分标注数据集上的多任务学习

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

发表机构 * KTH Royal Institute of Technology（皇家理工学院）； Univrses AB

AI总结本文研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力，通过FixMatch方法和其等变扩展Dense FixMatch进行评估，在城市景观和BDD100K数据集上针对常见的目标检测和语义分割任务进行测试，发现不变和等变半监督学习在大多数情况下优于监督基线，特别是在标注样本较少时效果更佳。

Comments https://github.com/miquelmarti/DenseFixMatch

详情

AI中文摘要

我们研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力。具体而言，我们使用流行的FixMatch方法进行不变半监督学习，并采用其等变扩展Dense FixMatch。我们在Cityscapes和BDD100K数据集上评估了它们在计算机视觉中普遍的目标检测和语义分割任务中的性能。我们考虑了每个任务标注子集的不同大小以及它们之间的不同重叠情况。我们的结果表明，对于不变和等变半监督学习，大多数情况下都优于监督基线，特别是在任务中可用标注样本较少时，改进最为显著，且后者方法通常表现更好。我们的研究表明，不变/等变学习是有限标注数据下多任务学习的一个有前途的方向。

英文摘要

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

URL PDF HTML ☆

赞 0 踩 0

2605.17620 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA：一种用于血管生成和动脉瘤编辑的模块化工具包

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

发表机构 * Visual Computing and Artificial Intelligence, Kiel University, Germany（视觉计算与人工智能研究所，基尔大学，德国）； Institute for Medical Informatics and Statistics, Kiel University, Germany（医学信息学与统计研究所，基尔大学，德国）； Clinic for Neuroradiology, Medical Faculty, Magdeburg University, Germany（神经放射科，马格德堡大学医学学院，德国）； Department of Radiology and Neuroradiology, University Hospital Schleswig-Holstein, Germany（放射学与神经放射学部门，石勒苏益格-荷尔斯泰因大学医院，德国）； Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poland（数学与计算机科学学院，亚当·密茨凯维奇大学，波兰）

AI总结本文提出SynVA，一种模块化工具包，用于生成血管网格和在解剖学上一致的动脉瘤合成，通过结合新的流匹配方法和基于学习的方法，生成真实血管几何和解剖学合理的动脉瘤，同时提供大规模标注数据集以提升医疗影像分析能力。

详情

AI中文摘要

颅内动脉瘤（IAs）以不可预测的生长和破裂风险为特征，是导致中风的主要原因，可能引发致命性出血，具有高死亡率和长期残疾。随着人口老龄化，脑血管疾病的发病率和整体负担预计会增加，凸显了需要可扩展的方法来分析复杂的医疗数据并提高对这些疾病的群体层面理解的必要性。尽管数字孪生和深度学习为提高诊断、预后和治疗提供了有希望的途径，但其效果受到大规模高质量医疗数据和相应标签稀缺的限制。我们提出了SynVA，一种用于血管网格生成和解剖学一致动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新型方法生成健康血管网格与基于学习的方法生成解剖条件下的动脉瘤网格——动脉瘤是从已有的血管几何结构计算而来的，而不是孤立生成。此外，我们引入了基于生理学原理和统计先验的SynVA过程模型，用于血管和动脉瘤合成，从而能够生成大规模数据集（例如用于训练基于网格的生成模型）。为此，我们发布了包含50,000个完全标注网格样本的数据集，用于各种下游视觉任务，如语义分割。广泛的定量和定性评估证明了SynVA能够生成逼真的血管几何和解剖学合理的动脉瘤。具体而言，我们的实验表明，某些方法生成的动脉瘤形状更符合专家人类感知，而其他方法在定量相似性度量上与真实动脉瘤的重建表现更优。

英文摘要

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

URL PDF HTML ☆

赞 0 踩 0

2605.17608 2026-05-19 cs.CE cs.AI 版本更新

Bayesian-Monte Carlo Schedule Updating for Construction Digital Twins: A Probabilistic Framework for Dynamic Project Forecasting

基于贝叶斯-蒙特卡罗的施工数字孪生调度更新：一种用于动态项目预测的概率框架

Atena Khoshkonesh, Mohsen Mohammadagha, Vinayak Kaushal, Navid Ebrahimi

发表机构 * Department of Civil Engineering, The University of Texas at Arlington（德克萨斯理工大学土木工程系）； The University of Texas at Arlington（德克萨斯理工大学）

AI总结本文提出了一种基于贝叶斯-蒙特卡罗的概率调度更新框架，用于施工数字孪生环境，通过整合随机活动持续时间建模、贝叶斯递归更新、蒙特卡罗模拟和不确定性传播，实现动态项目预测。

Comments 22 pages, 3 figures, 5 tables

详情

AI中文摘要

施工项目经常由于劳动力生产率、材料供应、天气条件和项目协调的不确定性而出现进度延误和预测不确定性。传统的确定性调度方法如关键路径法（CPM）假设活动持续时间固定，因此无法充分表示动态项目不确定性。本文提出了一种适用于施工数字孪生环境的贝叶斯-蒙特卡罗概率调度更新框架。所提出的方法整合了随机活动持续时间建模、贝叶斯递归更新、蒙特卡罗模拟和不确定性传播，以统一的计算框架实现自适应的进度预测。活动持续时间使用对数正态概率分布进行建模，并通过贝叶斯推断不断更新，随着新的项目观测数据的出现。蒙特卡罗模拟用于传播更新的不确定性，通过项目网络生成概率完成时间预测、延误风险估计和活动关键性度量。使用PSPLIB基准项目网络的仿真实验表明，与确定性CPM和静态概率调度方法相比，所提出的框架在预测准确性和不确定性表示方面有所改进。该框架还通过整合BIM报告、无人机观测、物联网 telemetry、生产力日志和现场监控数据，支持自适应项目预测。

英文摘要

Construction projects frequently experience schedule delays and forecasting uncertainty due to variability in labor productivity, material availability, weather conditions, and project coordination. Conventional deterministic scheduling methods such as the Critical Path Method (CPM) assume fixed activity durations and therefore cannot adequately represent dynamic project uncertainty. This study presents a Bayesian-Monte Carlo probabilistic schedule updating framework for construction digital twin environments. The proposed methodology integrates stochastic activity-duration modeling, Bayesian recursive updating, Monte Carlo simulation, and uncertainty propagation within a unified computational framework for adaptive schedule forecasting. Activity durations are modeled using lognormal probability distributions and continuously updated through Bayesian inference as new project observations become available. Monte Carlo simulation is then used to propagate updated uncertainty throughout project networks and generate probabilistic completion-time forecasts, delay-risk estimates, and activity criticality measures. Simulation experiments using PSPLIB benchmark project networks demonstrate that the proposed framework improves forecasting accuracy and uncertainty representation compared with deterministic CPM and static probabilistic scheduling approaches. The framework further supports adaptive project forecasting through integration of BIM reports, drone observations, IoT telemetry, productivity logs, and site monitoring data.

URL PDF HTML ☆

赞 0 踩 0

2605.17580 2026-05-19 cs.AI 版本更新

ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

ECG-WM: 一种基于生理的ECG世界模型用于临床干预模拟

Zhikang Chen, Yue Wang, Sen Cui, Yu Zhang, Changshui Zhang, Tianling Ren, Tingting Zhu

发表机构 * University of Oxford（牛津大学）； Tsinghua University（清华大学）； Southern University of Science and Technology（南方科技大学）

AI总结本文提出了一种基于ECG的世界模型，用于条件化预测心脏电生理学，通过整合生理学普通微分方程先验知识，提升干预后ECG轨迹的生理合理性，并引入不确定性评估策略以更可靠地评估候选干预方案。

详情

AI中文摘要

基于ECG的模型在诊断任务中表现出色，但在建模外部干预下心脏动态演变方面仍有限。现有方法主要集中在静态预测，缺乏捕捉不同药理条件下ECG变化的机制。本文提出了一种ECG世界模型，用于动作条件化的预测模拟。通过将生理学普通微分方程先验知识整合到潜在扩散动态中，利用能量正则化，该框架实现了生理合理的干预后ECG轨迹合成，并有效缓解生成幻觉。在此模拟过程中，我们引入了一种不确定性意识的评估策略，利用扩散采样中的随机性来表征预期的临床风险及其变异性，从而更可靠地比较候选干预方案。我们在多种设置中评估了我们的方法，包括受控药物反应场景和真实世界临床记录。除了标准波形指标外，实验结果还显示了改进的风险校准和与专家指导治疗偏好的强一致。这些结果确立了我们的方法作为安全且干预感知的临床决策支持的稳健基础。

英文摘要

Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.

URL PDF HTML ☆

赞 0 踩 0

2605.17575 2026-05-19 cs.LG cs.AI 版本更新

UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts

UniAlign：一种用于在分布偏移下鲁棒网络流量分类的模型无关框架

Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Yong Cui

发表机构 * Institute for Network Sciences and Cyberspace, Tsinghua University（网络科学与网络空间研究院，清华大学）； Department of Computer Science and Technology, Tsinghua University（计算机科学与技术系，清华大学）

AI总结本文提出UniAlign，一种模型无关的框架，通过领域对齐微调和稳定模型集成提升深度学习网络流量分类模型在分布偏移下的鲁棒性，实验表明其在准确率和F1分数上均优于现有基线。

详情

AI中文摘要

网络流量分类（NTC）模型在真实世界环境中部署时，由于网络条件的变化导致的分布偏移常常引起严重的性能下降。现有的增强鲁棒性的方法通常与特定的模型架构或数据设置耦合，无法泛化到最先进的原始字节基NTC模型，或导致显著的训练开销。在本文中，我们提出UniAlign，一种新的模型无关框架，旨在提升基于深度学习的NTC模型在分布偏移下的鲁棒性。UniAlign结合了领域对齐微调，该方法鼓励在异构网络条件下学习领域不变的流量表示，以及稳定模型集成，该方法通过在平坦损失区域内的检查点聚合来增强推理鲁棒性。该框架可以无缝集成到现有的监督NTC模型中，无需特定的特征模态或引入非常数的额外训练成本。我们在三个涵盖多样分布偏移的公开数据集上评估了UniAlign，包括加密方案、数据收集设备和攻击行为。在两个代表性的NTC模型上的实验结果表明，与标准训练相比，UniAlign将平均分类准确率提高了2.51%，平均F1分数提高了2.71%，在准确率和F1分数上均优于最强基线，同时仅需所有NTC特定基线训练时间的12.4%至53.9%。

英文摘要

Network traffic classification (NTC) models often suffer severe performance degradation when deployed in real-world environments due to distribution shifts caused by changing network conditions. Existing robustness-enhancing approaches are commonly coupled to specific model architectures or data settings, fail to generalize to state-of-the-art raw-byte-based NTC models, or incur significant training overhead. In this paper, we propose UniAlign, a novel model-agnostic framework that improves the robustness of deep learning-based NTC models under distribution shifts. UniAlign combines \emph{domain alignment fine-tuning}, which encourages the learning of domain-invariant traffic representations across heterogeneous network conditions, with \emph{stable model ensembling}, which enhances inference robustness by aggregating checkpoints within a flat loss region. The framework can be seamlessly integrated into existing supervised NTC models without requiring specific feature modalities or introducing non-constant additional training costs. We evaluate UniAlign on three public datasets covering diverse distribution shifts, including encryption schemes, data collection devices, and attack behaviors. Experimental results on two representative NTC models demonstrate that, compared with standard training, UniAlign improves average classification accuracy by 2.51\% and average F1 score by 2.71\%, outperforming the strongest baseline by 1.45\% in accuracy and 1.69\% in F1 score, while requiring only 12.4\%--53.9\% of the training time of all NTC-specific baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17565 2026-05-19 cs.AI cs.CL 版本更新

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

泛化还是记忆？国际象棋训练语言模型的脆弱性测试

Ethan Tang

发表机构 * School of Computing and Augmented Intelligence（计算与增强智能学院）

AI总结本文研究了国际象棋训练语言模型是泛化还是记忆，通过测试发现其高性能主要源于模式匹配，并展示了LLM-Modulo框架在提升国际象棋谜题解决性能上的效果，证明了与外部验证器结合的通用LLM比直接训练合成数据更灵活。

Comments 14 pages, 2 figures, 4 tables, 3 equations

详情

AI中文摘要

最近的研究对语言模型进行了棋类数据微调，并报告了高基准分数，作为证据表明由此产生的模型可以理解国际象棋规则、以专业水平下完整棋局，或生成基于专家知识的人可读解释。我们训练了KinGPT，一个仅在（位置，最佳移动）对上训练的2500万参数字符级语言模型，其在600个mate-in-N谜题套件上超过了300亿参数的ChessGPT，在20个主题谜题基准上超过了4000亿参数的C1-4B。我们检查了现有文献中关于国际象棋训练语言模型的几个主张，并断言其令人印象深刻的基准性能主要由模式匹配解释。我们还展示了LLM-Modulo，一个验证器在环框架，如何将RedPajama 3B的最佳移动准确率从1.2%提升到21.2%，移动生成有效性从19.3%提升到95.3%，在mate-in-N国际象棋谜题上，与ChessGPT在棋类特定网络语料库上微调所获得的提升相当，但成本仅为后者的一小部分。我们的结果展示了将通用LLM与外部验证器结合，为明确领域提供了一个更灵活的替代方案，而不是直接训练合成数据。我们开源了所有训练/评估代码、数据集、谜题样本和KinGPT模型检查点，以确保可重复性。

英文摘要

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

URL PDF HTML ☆

赞 0 踩 0

2605.17562 2026-05-19 cs.LG cs.AI cs.HC 版本更新

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

超越准确率：EEG基础模型的鲁棒性、可解释性和表达性

Urban Širca, Maryam Alimardani, Stefanos Zafeiriou, Konstantinos Barmpas

发表机构 * Vrije Universiteit Amsterdam（阿姆斯特丹自由大学）； Imperial College London（伦敦帝国学院）

AI总结本文研究了EEG基础模型的鲁棒性、可解释性和表达性，通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试，揭示了模型在不同扰动下的表现，以及其在可解释性和表达性方面的特性。

详情

AI中文摘要

EEG基础模型（EEG-FMs）主要在干净且分布内的准确性上进行了评估，其鲁棒性、可解释性和表征质量尚未得到充分考察。本研究通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试，填补了这些空白。除了干净准确性外，我们进行了三层分析：（i）鲁棒性：我们应用了测试时扰动，包括加性噪声、随机和区域基于的通道丢弃以及区域特定的噪声注入。我们的分析表明，没有单一模型在所有失败模式中占主导地位。最抗噪的模型在通道丢弃下最为脆弱，当通道被移除而不是零填充时，许多丢弃脆弱性消失。（ii）可解释性：我们首次将注意力感知的层间相关传播（AttnLRP）应用于EEG-FMs，并展示了模型广泛集中在与任务相关的脑区，这与已知的神经生理学一致。然而，属性图在扰动下保持空间稳定，而预测性能下降，表明模型关注正确的脑区，但解码了被破坏的内容。（iii）表达性：通过块状探测，我们显示在微调过程中后期块被重新利用，而早期块已经包含任务相关的信息。此外，我们证明了之前归因于低质量预训练表示的头部-only性能较差，很大程度上是由于池化所致，且当EEG-FMs的token级嵌入被保留时，它们具有足够的表征能力。这些发现为EEG-FMs的鲁棒性、可解释性和表达性提供了首次系统的评估，并突显了其开发中的关键考虑因素。

英文摘要

EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.

URL PDF HTML ☆

赞 0 踩 0

2605.17559 2026-05-19 stat.ME cs.AI q-bio.QM stat.ML 版本更新

Controlling False Discovery in Arbitrarily Structured Hypothesis Spaces via Reproducing Kernels

通过再生核来控制任意结构假设空间中的假发现

Binyamin Perets, Shie Mannor

发表机构 * Technion – Israel Institute of Technology（技术Ion – 以色列理工学院）； NVIDIA

AI总结本文提出了一种基于再生核的框架，用于在任意结构的假设空间中控制假发现率，通过将结构FDR控制转化为正则化学习问题，实现了对连续域、图和层次结构的统一处理，提高了发现能力。

Comments 9 pages

详情

AI中文摘要

大规模假设检验是现代科学的核心，其中控制假发现率（FDR）已成为管理多个同时检验中假阳性的一种标准方法。假设很少是孤立存在的；它们通常通过接近性、连接性或层次结构表现出结构。这种结构既是挑战也是机会：虽然经典方法将这些依赖性视为需要保守校正的障碍，但利用它们可以显著提高发现能力。本文将结构化的FDR控制重新表述为一个正则化学习问题。通过在合适的再生核希尔伯特空间（RKHS）中优化，我们引入了一个框架，通过仅选择合适的核，将连续域、图和层次结构统一到单一算法中。这种形式化使我们能够用平滑的解决方案替代先前方法的分段常数拟合，通过原理化的基于似然的超参数选择而不是启发式调整，并在未观测位置进行推断，从而支持样本效率的实验设计。在该估计器的基础上，我们提供了两个决策规则，我们证明它们能够控制FDR。我们验证了我们的方法在两个来源上：来自高维现实数据集的空间位置，以及利用蛋白质-蛋白质相互作用图的差异基因表达任务。

英文摘要

Large-scale hypothesis testing is central to modern science, where controlling the False Discovery Rate (FDR) has become the standard approach to managing false positives across many simultaneous tests. Hypotheses rarely exist in isolation; they often exhibit structure through proximity, connectivity, or hierarchy. This structure represents both a challenge and an opportunity: while classical methods treat these dependencies as obstacles requiring conservative correction, leveraging them can substantially increase discovery power. Here, we reframe structured FDR control as a regularized learning problem. By optimizing within a suitable Reproducing Kernel Hilbert Space (RKHS), we introduce a framework that unifies continuous domains, graphs, and hierarchies under a single algorithm through kernel choice alone. This formulation enables smooth solutions in place of the piecewise-constant fits of prior methods, principled likelihood-based hyperparameter selection rather than heuristic tuning, and inference at unobserved locations which in turn supports sample-efficient experimental design. Building on this estimator, we provide two decision rules which we prove to control the FDR. We validate our method on two sources: spatial locations derived from high-dimensional real-world datasets, and a differential gene expression task utilizing protein-protein interaction graphs.

URL PDF HTML ☆

赞 0 踩 0

2605.17556 2026-05-19 cs.RO cs.AI 版本更新

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

视觉雕刻：用于长周期机器人泥塑的视觉对齐规划表示

Peter Schaldenbrand, Jean Oh

发表机构 * The Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）

AI总结本文提出了一种视觉对齐的规划表示方法，用于长周期机器人泥塑任务，通过捕捉光照和纹理特征，提高了对可变形材料动态的建模能力，并展示了在不同可变形材料和末端执行器下的性能。

Comments 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情

DOI: 10.1109/LRA.2026.3673896

AI中文摘要

泥塑是一种复杂的艺术任务，需要通过长周期规划实现高阶目标。作为机器人问题，我们将泥塑视为形状到形状的匹配挑战。先前的可变形物体 manipulation 工作要么需要为每个目标重新训练策略，要么依赖于动态模型，这些模型将状态表示为稀疏点云，无法良好捕捉泥塑的重要特征，如纹理。我们提出了一种方法，用于建模可变形材料的动力学，并在视觉对齐的表示中为机器人雕刻规划。通过三种不同的可变形材料和各种末端执行器，我们证明我们的动力学模型在性能上与最先进的方法相当，并且具有兼容视觉规划的优势。我们的动作被表示为单个末端执行器向泥塑施加的参数化推力，这已被证明适用于长周期（>100次动作）的泥塑浮雕。最后，我们展示了在视觉对齐表示中规划的好处，同时提供了分析，证明了与3D表示相比，这种表示在规划上更具挑战性。

英文摘要

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

URL PDF HTML ☆

赞 0 踩 0

2605.17530 2026-05-19 cs.CR cs.AI cs.LG cs.NI 版本更新

Few-Shot Network Intrusion Detection Using Online Triplet Mining

基于在线三元组挖掘的少样本网络入侵检测

Jack Wilkie, Hanan Hindy, Christos Tachtatzis, Miroslav Bures, Robert Atkinson

发表机构 * Department of Electronics and Electrical Engineering, University of Strathclyde（斯特拉斯克莱德大学电子与电气工程系）； Faculty of Computer and Information Sciences, Ain Shams University（爱思曼大学计算机与信息科学学院）； Faculty of Electrical Engineering, Czech Technical University（捷克技术大学电气工程学院）

AI总结本文提出利用在线三元组挖掘和KNN分类器的三元组网络，实现少样本下的有效网络入侵检测，通过对比不同三元组挖掘算法和模型设计，验证了在少量恶意样本下该方法的竞争力。

Comments Published in: MDPI Applied Sciences, 2026. Official version: https://doi.org/10.3390/app16104589 Code: https://github.com/jackwilkie/few_shot_nids_triplet_mining

Journal ref Wilkie, J.; Hindy, H.; Tachtatzis, C.; Bures, M.; Atkinson, R. Few-Shot Network Intrusion Detection Using Online Triplet Mining. Appl. Sci. 2026, 16, 4589. https://doi.org/10.3390/app16104589

详情

DOI: 10.3390/app16104589

AI中文摘要

网络入侵检测系统在网络保护中起着关键作用，通过检测恶意网络流量并由网络安全运营中心调查。最先进的方法利用监督机器学习方法训练分类模型以识别已知的网络攻击；然而，这些模型需要大量的标记数据集进行训练，并在训练较小数据集时表现不佳。为了解决这一不足，异常检测模型学习良性流量的分布，并将不符合的流量标记为恶意。虽然这些方法不需要恶意示例进行训练，但它们的高误报率使其不切实际。因此，当特定攻击类别的标记实例不足时，网络可能特别容易受到攻击。这通常发生在新建立的网络或之前未见过的攻击类型出现时。为了解决这一挑战，本文提出使用三元组网络，利用在线三元组挖掘和KNN分类器，能够进行少样本分类，从而在仅训练少量恶意示例后实现有效的入侵检测。各种在线三元组挖掘算法被探索，并通过一系列消融研究比较和评估了模型设计选择，如推断算法和优化的距离度量。最终模型在少样本二分类和多类分类中与现有方法进行了比较，发现当每个类别训练至少10个恶意样本时，所提出的方法在竞争性方面表现良好。

英文摘要

Network intrusion detection systems play a vital role in protecting networks by detecting malicious network traffic which can then be investigated by a cybersecurity operations centre. State-of-the-art approaches utilise supervised machine learning methods to train a classification model to recognise known cyberattacks; however, these models require a large labelled dataset to train and show poor performance when trained on smaller datasets. In an attempt to address this shortcoming, anomaly detection models learn the distribution of benign traffic and flag non-conforming traffic as malicious. While these methods do not require malicious examples to train, they suffer from high false-positive rates rendering them impractical. As a result, networks may be particularly vulnerable when there are insufficient labelled instances of a specific attack class to train an effective classifier. This often occurs in newly established networks or when previously unseen types of attacks emerge. To address this challenge, this work proposes the use of a triplet network, utilising online triplet mining and a KNN classifier, which is able to perform few-shot classification, enabling effective intrusion detection after being trained on a limited number of malicious examples. Various online triplet mining algorithms were explored and model design choices, such as the inference algorithm and optimised distance metrics, were compared and evaluated through a series of ablation studies. The final model was compared against other state-of-the-art approaches in few-shot binary and multiclass classification, where the proposed approach was found to be competitive with existing methods when trained on as little as 10 malicious samples of each class.

URL PDF HTML ☆

赞 0 踩 0

2605.17528 2026-05-19 cs.LG cs.AI cs.CL 版本更新

CasualSynth: Generating Structurally Sound Synthetic Data

CasualSynth: 生成结构上合理的合成数据

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford（牛津大学计算机科学系）； Institute of Logic and Computation, TU Wien（维也纳技术大学逻辑与计算研究所）

AI总结本文提出CasualSynth框架，通过解耦因果结构生成与语义实现，生成既符合因果机制又语义丰富的合成数据，解决了LLM在生成合成数据时无法保证因果正确性的问题。

Comments 15 pages

详情

AI中文摘要

大型语言模型（LLMs）能够生成逼真的合成数据，但无法保证其输出符合目标领域的因果机制。我们引入CausalSynth框架，该框架将因果结构生成与语义实现解耦，生成既符合因果机制又语义丰富的合成数据。该框架分为三个阶段：首先，一个结构因果模型（SCM）——一个定义在有向无环图（DAG）上的结构方程组，通过祖先采样生成因果骨架，即满足支配图全局马尔可夫性质的变量赋值；其次，一个LLM作为受约束的实现者，一个条件翻译器，将每个骨架映射到高维观测，如临床笔记或交易日志；第三，一个迭代一致性验证模块通过确定性提取检测结构违规，并将针对性的修正反馈给LLM，形成闭环优化过程。我们识别出语义后门问题，即LLM系统性地用预训练先验覆盖施加的因果事实——并证明我们的迭代机制相对于标准拒绝采样减少了由此产生的选择偏差。在三个因果基准（ASIA、ALARM和MIMIC-Struct）上，CausalSynth在假阳性率接近名义α=0.05水平的情况下保持条件独立性，并在70B参数LLM基础上实现了超过96%的可实现率。该框架还通过保留噪声和图 mutilation 支持原理化的干预和反事实生成。

英文摘要

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

URL PDF HTML ☆

赞 0 踩 0

2605.17526 2026-05-19 cs.SE cs.AI 版本更新

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

SaaSBench: 探索编码代理在长周期企业SaaS工程中的边界

Qingnan Ren, Shun Zou, Shiting Huang, Ziao Zhang, Kou Shi, Zhen Fang, Yiming Zhao, Yu Zeng, Qisheng Su, Lin Chen, Yong Wang, Zehui Chen, Xiangxiang Chu, Feng Zhao

发表机构 * University of Science and Technology of China（中国科学技术大学）； AMAP, Alibaba Group（阿里云实验室）

AI总结本文提出SaaSBench，首个针对企业SaaS工程的基准，旨在探索AI代理在复杂系统中的边界，通过30个任务和5370个验证节点，涵盖8种编程语言、6种数据库和13种框架，揭示了当前最先进代理在多组件系统配置和集成中的主要瓶颈。

详情

AI中文摘要

随着自主编码代理能够处理越来越长周期的任务，它们逐渐展示了完成端到端软件开发的潜力。尽管现有的基准近年来已从局部代码编辑发展到从头开始的项目生成，但它们仍局限于结构简化、单栈应用。因此，它们无法捕捉真实企业软件即服务（SaaS）系统中的异构环境、全栈编排和系统级复杂性，留下了在现实工程约束下评估代理的关键空白。为填补这一空白，我们引入SaaSBench，首个针对企业SaaS工程的基准。它涵盖30个复杂任务，跨越6个SaaS领域，包含5370个验证节点，整合8种编程语言、6种数据库和13种框架，以精确模拟现实世界的软件异质性。此外，我们设计了一种依赖感知的混合评估范式，专门针对具有长周期和多组件耦合的复杂系统，实现细粒度、可重复的评估。关键的是，我们的广泛实验揭示了一个显著的见解：当前最先进的代理的主要瓶颈不是生成孤立的代码逻辑，而是成功配置和集成多组件系统。超过95%的任务失败发生在代理甚至达到深度业务逻辑之前，模型常因过度自信而提前终止基础系统设置，或陷入无效的调试循环。我们希望SaaSBench能作为实用且具有挑战性的测试平台，推动可靠、系统级编码代理的发展。代码可在https://github.com/ShadeCloak/SaaSbench获取。

英文摘要

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at \url{https://github.com/ShadeCloak/SaaSbench}.

URL PDF HTML ☆

赞 0 踩 0

2605.17508 2026-05-19 cs.LG cs.AI 版本更新

BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation

BESplit: 偏差补偿分割联邦学习与证据聚合

Yuhan Xie, Chen Lyu, Jingrong Huang

发表机构 * MoE Key Laboratory of Interdisciplinary Research of Computation（交叉计算与经济学 interdisciplinary 研究 MOE 重点实验室）； Shanghai University of Finance（上海财经大学）

AI总结本文提出BESplit框架，通过证据聚合和偏差补偿协作来解决非独立同分布数据下分割联邦学习的偏差优化和收敛不稳定问题，提升了模型的准确性和效率。

详情

AI中文摘要

分割联邦学习（SFL）通过将模型分割到客户端和服务器之间实现隐私保护的协同训练。然而，在非独立同分布数据分布下，SFL常面临偏差优化和收敛不稳定的问题，而现有解决方案大多借鉴传统联邦学习的技术。在本工作中，我们发现SFL的分割架构本质上改变了客户端信息的表示和协调方式，为超越参数级聚合的偏差补偿提供了机会。基于这一见解，我们提出了BESplit，一个架构感知的框架，利用SFL内在结构来缓解非IID效应。首先，为防止偏见本地数据主导全局更新，我们引入证据聚合（EA）以基于证据不确定性对客户端贡献进行细粒度重新加权。其次，为进一步减少分布偏斜，我们开发了偏差补偿协作（BCC）以通过配对互补客户端对齐分割层表示。最后，双教师蒸馏（DTD）被纳入以同步解耦客户端和服务器模型之间的知识，使本地推理能够独立进行。在五个基准数据集上的广泛实验表明，BESplit在多样化的非IID设置下，准确率、收敛稳定性以及计算效率均优于现有最先进方法。

英文摘要

Split Federated Learning (SFL) enables privacy-preserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.

URL PDF HTML ☆

赞 0 踩 0

2605.17504 2026-05-19 cs.CV cs.AI 版本更新

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

从分布视角看视觉机制可解释性：KL最小软约束原理

Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge, Wentao Shu, Deyu Meng, Xipeng Qiu

发表机构 * School of Mathematics and Statistics（数学与统计学学院）； Ministry of Education Key Lab of Intelligent Networks and Network Security（教育部智能网络与网络安全重点实验室）； Shanghai Innovation Institute（上海创新研究院）； Fudan University（复旦大学）

AI总结本文提出了一种基于分布的视觉机制可解释性方法，通过KL最小化优化问题来平衡可解释性和模型忠实性，利用能量引导的扩散后验采样实现，并在DINOv3模型上验证了其有效性。

详情

AI中文摘要

当前视觉机制可解释性（MI）的主要范式仍局限于通过启发式方法（如Top-K激活检索或正则化优化）解释视觉模型的内部单元。在本文中，我们建立了视觉MI的理论分布视角，该视角模型了特征激活对自然图像分布的影响，从而构建了一个KL最小化优化问题来建模MI任务。在此框架下，识别了先前MI范式中的统计偏差，揭示这些范式可能在人类感知上不可解释（即偏离自然图像分布）或在机械上不忠实于视觉模型（即无法激活模型特征）。为了解决这些偏差，我们提出了一种基于KL最小化软约束原理的视觉MI模型，该模型在理论上平衡了可解释性和忠实性。我们通过能量引导的扩散后验采样实现了这一原理。广泛的实验验证了所提出分布视角的理论正确性，并展示了我们的范式在DINOv3视觉模型上的实际有效性。

英文摘要

Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

URL PDF HTML ☆

赞 0 踩 0

2605.17503 2026-05-19 cs.AI cs.CL cs.HC 版本更新

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

基于深度学习和大语言模型的RAG EEG到文本翻译

Enrico Collautti, Xiaopeng Mao, Luca Tonin, Stefano Tortora, Sadasivan Puthusserypady

发表机构 * IAS-LAB, Department of Information Engineering, University of Padova（帕多瓦大学信息工程系IAS实验室）； Padova Neuroscience Center（帕多瓦神经科学中心）； Department of Health Technology, Technical University of Denmark（丹麦技术大学健康技术系）

AI总结本文提出了一种基于检索增强生成（RAG）的EEG到文本解码方法，结合EEG编码器、向量检索阶段和大语言模型，以提高句子级解码的准确性，并在ZuCo数据集上验证了其有效性。

Comments 6 pages, 2 figures. Submitted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics

详情

AI中文摘要

从电生理图（EEG）信号解码语言信息仍然是脑机接口（BCI）研究中极具挑战性的问题。特别是，由于EEG记录的信噪比较低，从EEG进行句子级解码尤为困难。以往研究通常在推理阶段未使用教师强制时难以超越随机基线性能。在本文中，我们提出了一种基于检索增强生成（RAG）的句子级EEG到文本解码流程，结合与语义句子嵌入对齐的EEG编码器、向量检索阶段以及大语言模型（LLM）以将检索到的句子细化为连贯的输出。实验在Zurich认知语言处理语料库（ZuCo）数据集上进行，该数据集包含在静默阅读期间收集的单次试验EEG记录。为了评估系统是否从这些EEG信号中提取了有意义的信息，结果与随机基线进行比较。在九名受试者中，所提出的流程优于随机基线，平均余弦相似度为0.181±0.022，与基线0.139±0.029相比，相对改进为30.45%。统计分析进一步确认了这种改进的显著性，遵循严格评估流程，其中推理阶段不接触地面真实标签。

英文摘要

The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.

URL PDF HTML ☆

赞 0 踩 0

2605.17493 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

超越线性叠加：利用KAN-SAE在AI天气模型中发现气候特征

Minjong Cheon

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）

AI总结本文提出KAN-SAE，一种基于Kolmogorov-Arnold网络的稀疏自编码器，通过非线性激活函数揭示天气预测模型中的气候特征，相比线性基线提升了72%的活跃特征数量和降低了20%的特征冗余。

详情

AI中文摘要

深度学习天气预测模型在预测能力上表现出色，但其内部如何表示物理气候现象仍不明确。通过稀疏自编码器（SAEs）实现的机理可解释性提供了一种分解这些表示的有原则方法，但现有SAEs假设严格线性特征叠加，这与现代变压器中编码的高度非线性大气动力学不匹配。我们引入KAN-SAE，一种稀疏自编码器，其编码器将标准ReLU替换为可学习的每特征B-样条激活函数，这些激活函数来自Kolmogorov-Arnold网络（KANs），使每个潜在维度能够发展出自己的非线性门控配置。应用于Sonny时，KAN-SAE发现了975个活跃特征（相比线性基线的566个，提升了72%），并具有20%更低的特征冗余和可比的重建保真度。在无任何气候监督的情况下，KAN-SAE识别出一个在西欧空间集中的可解释热浪特征，并通过因果操控实验验证了西太平洋台风追踪器。我们的结果表明，非线性激活对于深度学习天气预测模型的机理可解释性至关重要，恢复了对线性基线不可见的气候特征。

英文摘要

Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.

URL PDF HTML ☆

赞 0 踩 0

2605.17461 2026-05-19 cs.HC cs.AI cs.CY 版本更新

Artificial Intelligence can Recognize Whether a Job Applicant is Selling and/or Lying According to Facial Expressions and Head Movements Much More Correctly Than Human Interviewers

人工智能能通过面部表情和头部动作更准确地识别求职者是否在撒谎，比人类面试官更准确

Hung-Yue Suen, Kuo-En Hung, Che-Wei Liu, Yu-Sheng Su, Han-Chih Fan

AI总结本文研究了人工智能通过面部表情和头部动作识别求职者是否在撒谎的准确性，提出了一种基于深度学习的计算机视觉模型，能够提取求职者在视频中的面部表情和头部动作的时序模式，以识别自报的诚实和欺骗性印象管理策略，并通过实验验证了该模型在识别诚实和欺骗性印象管理方面的有效性，比人类面试官更准确。

Comments 11 pages, 5 figures

Journal ref IEEE Transactions on Computational Social Systems, 11(5), 5949-5960, 2024

详情

DOI: 10.1109/TCSS.2024.3376732

AI中文摘要

是否能够通过视频中的面部表情信号检测面试者的诚实和欺骗性回应一直是争论的话题，需要进一步研究。我们开发了基于计算机视觉的深度学习模型，以提取求职者在视频中的面部表情和头部动作的时序模式，以从视频帧中识别自报的诚实和欺骗性印象管理（IM）策略。每个N=121名求职者在回答五个结构化行为面试问题时录制了12至15分钟的视频。每位求职者完成了一份调查，以自评其信任度在四个印象管理（IM）指标上。此外，还进行了一项现场实验，以比较我们的建模方法与人类面试官在自报IMs的的同时效度。人类面试官在预测这些IM指标时，从另一组30个视频中获得表现，由N=30名人类面试官评估三个记录。我们的模型解释了诚实和欺骗性IMs的91%和84%的方差，并且比人类面试官显示出更强的与自报IM分数的相关性。

英文摘要

Whether an interviewee's honest and deceptive responses can be detected by facial expression signals in videos has been debated and requires further research. We developed deep learning models enabled by computer vision to extract temporal patterns of job applicants' facial expressions and head movements to identify self-reported honest and deceptive impression management (IM) tactics from video frames in real asynchronous video interviews. A 12- to 15-minute video was recorded for each of N=121 job applicants as they answered five structured behavioral interview questions. Each applicant completed a survey to self-evaluate their trustworthiness on four IM measures. Additionally, a field experiment was conducted to compare the concurrent validity associated with self-reported IMs between our modeling approach and human interviewers. Human interviewers' performance in predicting these IM measures from another subset of 30 videos was obtained by having N=30 human interviewers evaluate three recordings. Our models explained 91% and 84% of the variance in honest and deceptive IMs, respectively, and showed stronger correlations with self-reported IM scores than human interviewers.

URL PDF HTML ☆

赞 0 踩 0

2605.17456 2026-05-19 cs.CV cs.AI 版本更新

GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging

GCE-MIL: 多实例学习中全滑片成像的可信且可恢复的证据

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing（智能与计算学院）

AI总结该研究提出GCE-MIL方法，通过优化S/N/R标准直接提升多实例学习中全滑片成像的预测性能和证据质量，改进了宏F1分数和C-index，并减少了连续-离散差距。

Comments 10 pages, 17 figures, 24 table

详情

AI中文摘要

多实例学习（MIL）是全滑片图像（WSI）分类和生存预测的标准方法，其中基于注意力的模型将图像块特征聚合为滑片级预测。这些模型将注意力权重视为预测的证据，但注意力被优化用于分类，而非识别支持诊断的实际图像块。这种混淆导致三个失败：选择的图像块不足（单独保留它们会降低宏F1分数0.078）、多余（移除它们几乎不影响预测）以及不可恢复（连续的注意力分数与推理中使用的离散图像块子集不一致）。核心前提是证据质量应通过显式标准直接优化——充分性、必要性和可恢复性（S/N/R）——而不是作为分类的副产品继承。GCE-MIL是一种背骨无关的封装器，通过三种注入模式和三种证据组件实现：一个将选择与领域特定概念对齐的 grounding 机制，一个作为可微分代理的 noisy-OR 覆盖，以及一个通过边缘引导修复将连续选择器转换为离散子集的阈值加修复恢复。在9个背骨和9个数据集（81种配置）上，GCE-MIL将平均宏F1分数提高了0.024，C-index提高了0.014，减少了连续-离散差距4-7，增加了补集退化2-4。通过在离散恢复后可选的图像块预过滤，推理速度可提高高达5倍，同时保留0.989的完整袋效用。

英文摘要

Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.

URL PDF HTML ☆

赞 0 踩 0

2605.17454 2026-05-19 cs.AI 版本更新

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

多方多目标优化作为共识搜索：交叉 party 再组合的运行时间分析

Xiaolei Fang, Peilan Xu, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology（信息科学与技术南京大学人工智能学院）； Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology（新型安全智能技术广东省重点实验室，网络安全研究院，计算机科学与技术学院，哈尔滨工业大学）

AI总结本文研究了多党多目标优化问题中的交叉 party 再组合，通过分析 MP-JCG 和 BPBOMST 问题，证明了基于收益引导突变的基线方法在跨越间隙时存在瓶颈，而改进的 CPR-NSGA-II 变体能够在 O(n log n) 的预期评估次数内发现共同帕累托最优解，并推导了基于边联合再组合和均匀修复的实例参数化预期运行时间界。

Comments 40 pages, 7 figures

详情

AI中文摘要

多党多目标优化问题（MPMOPs）需要自主决策者达成共识，因此不同于扁平化多目标公式。现有多目标进化算法的运行时间理论大多针对单党帕累托前沿近似，无法直接解释MPMOPs中的共同解搜索。我们研究了两种代表性场景中的交叉 party 再组合。在MP-JCG，一个具有显式间隙区域的伪布尔基准上，我们证明了基于收益引导突变的基线方法面临跨越间隙的瓶颈，需要Θ(n²)的预期适应度评估。相比之下，分析型CPR-NSGA-II变种通过直接组装互补前缀和后缀模板，分布在党派种群中，能够在O(n log n)的预期评估次数内发现共同帕累托最优解。与扁平化四目标公式F-JCG相比，我们的全前沿覆盖分析展示了扁平化带来的额外覆盖负担。对于BPBOMST，多党多目标最小生成树问题的双党双目标专业化，我们开发了分层支持覆盖分析。对于每个共同帕累托目标向量，对称平均投影诱导了一个辅助双目标MST实例，合适的支持代表可以产生一个2λ-共同近似覆盖，其中λ∈[1,2]。我们进一步推导了一个代表池CPR-NSGA-II变种的实例参数化预期运行时间界，使用边联合再组合和均匀修复。这个界分离了局部辅助前沿填充、跨党再组合捷径和边联合修复模糊性的影响。

英文摘要

Multi-party multi-objective optimization problems (MPMOPs) require consensus among autonomous decision makers and therefore differ from flattened many-objective formulations. Existing runtime theory for multi-objective evolutionary algorithms is largely tailored to single-party Pareto-front approximation and does not directly explain common-solution search in MPMOPs. We investigate cross-party recombination in two representative settings. On MP-JCG, a pseudo-Boolean benchmark with an explicit gap region, we prove that a payoff-guided mutation baseline faces a gap-crossing bottleneck requiring $Θ(n^2)$ expected fitness evaluations. In contrast, an analytical CPR-NSGA-II variant discovers both common Pareto-optimal solutions in $O(n\log n)$ expected evaluations by directly assembling complementary prefix and suffix templates distributed across party populations. Comparing this with the flattened four-objective formulation F-JCG, our full-front coverage analysis illustrates the additional coverage burden introduced by flattening. For BPBOMST, the bi-party, two-objective-per-party specialization of the multi-party multi-objective minimum spanning tree problem, we develop a layered support-cover analysis. For each common Pareto objective vector, the symmetric average projection induces an auxiliary bi-objective MST instance, and suitable support representatives yield a $2λ$-common approximation cover with $λ\in[1,2]$. We further derive an instance-parameterized expected runtime bound for a representative-pool CPR-NSGA-II variant using edge-union recombination and uniform repair. This bound separates the effects of local auxiliary-front filling, cross-party recombination shortcuts, and edge-union repair ambiguity.

URL PDF HTML ☆

赞 0 踩 0

2605.17450 2026-05-19 cs.SE cs.AI cs.CL cs.CR 版本更新

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

ContraFix：通过差分运行时证据和技能重用进行代理漏洞修复

Simiao Liu, Fang Liu, Li Zhang, Yang Liu, Yinghao Zhu

发表机构 * Beihang University（北京航空航天大学）； The University of Hong Kong（香港大学）

AI总结本文提出ContraFix框架，通过差分运行时证据和可重用的修复技能，解决大型语言模型代理在自动漏洞修复中的语义误解问题，实现了在SEC-Bench和PatchEval上的高准确率。

详情

AI中文摘要

大型语言模型（LLM）代理越来越多地用于自动漏洞修复（AVR），其中仓库级推理使它们能够检查上下文并生成源代码补丁。然而，最近的经验结果表明，这些代理仍然难以处理现实世界中的漏洞。其主要失败模式是语义误解：选择一个修复方向，该方向不匹配根本原因。我们识别出这种差距的两个原因。现有代理通常仅从失败执行进行推理。崩溃报告可以指出程序失败的位置，但无法揭示众多候选者中哪一个变量或状态转换将崩溃行为与安全执行区分开来。因此，代理通常生成症状导向的补丁而不是因果修复。此外，为一个漏洞收集的证据很少被保留，因此后续仓库中的类似案例必须从头开始诊断。我们提出了ContraFix，一种结合差分运行时证据和可重用修复技能的代理AVR框架。其Mutator构造了跨越失败边界的POC变体；其Analyzer在故障区域插入状态探针，并将崩溃和非崩溃执行之间的差异总结为修复规范；其Patcher将规范转换为经过验证的源代码补丁。每次成功的修复都会更新一个包含修复规范和突变策略的双轨技能库，这些通过三层策略在未来实例中检索。在SEC-Bench（C/C++，200个实例）和PatchEval（Go、Python、JavaScript，225个实例）上，ContraFix与GPT-5-mini相比，分别解决了84.0%和73.8%的任务，分别在两个基准上实现了最先进的性能，同时成本低于最强的可比基线。

英文摘要

Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom-oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances. On SEC-Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT-5-mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state-of-the-art performance on both benchmarks while costing less than one-third of the strongest comparable baseline.

URL PDF HTML ☆

赞 0 踩 0

2605.17449 2026-05-19 cs.CV cs.AI 版本更新

Spatial Blindness in Whole-Slide Multiple Instance Learning

全切片多实例学习中的空间盲区

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing（智能与计算学院）

AI总结本文研究了全切片多实例学习中由于空间信息处理不足导致的分类误差问题，提出ResTopoMIL模型通过引入不变原型直方图和坐标洗牌约束来提升模型对空间关系的敏感性，从而在多个公开数据集上提升了分类和生存预测性能。

Comments 28 pages, 8 figures, 16 tables

详情

AI中文摘要

全切片MIL模型通常被称为上下文感知模型，当将图网络、Transformer或状态空间模块置于补丁嵌入之上时。我们证明这种标签可能具有误导性。在病理任务中，组织结构是诊断信号的一部分，几个强大的MIL基线在补丁坐标随机排列后，滑片级别AUC几乎未变。它们的预测准确，但大多具有组合性。我们将其失败模式称为空间盲区。我们的解释是基于优化的：在滑片级监督下，密集的外观统计信息被早期学习，留下弱梯度用于稀疏的空间关系。ResTopoMIL通过首先拟合一个排列不变的原型直方图，然后冻结它，同时一个轻量级图分支在坐标洗牌约束下学习残差来解决这个问题。该架构设计简单；干预在于如何训练空间分支。在9个公开WSI基准上，ResTopoMIL在1.15M参数下提升了分类和生存预测性能，恢复了对坐标扰动的敏感性，并在CAMELLYON-16上提供了更强的局部化证据。

英文摘要

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

URL PDF HTML ☆

赞 0 踩 0

2605.17444 2026-05-19 cs.SE cs.AI cs.CL 版本更新

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair：用于代理级漏洞修复的分层内存

Simiao Liu, Li Zhang, Fang Liu, Xiaoli Lian, Yang Liu, Yinghao Zhu

发表机构 * Beihang University（北京航空航天大学）； The University of Hong Kong（香港大学）

AI总结本研究提出MemRepair，一种增强记忆的代理框架，通过分层记忆和动态反馈循环提高漏洞修复的可靠性，实现了在多个仓库级别的漏洞修复基准上的高修复率。

详情

AI中文摘要

现代软件生态系统面临越来越多披露的漏洞，增加了需要在仓库规模上可靠运行的自动化修复技术的需求。尽管基于大语言模型（LLM）的代理最近在自动化漏洞修复（AVR）中显示出潜力，但大多数现有系统仍然将修复视为在当前可见代码上下文中的一次生成步骤。因此，它们缺乏重用先前修复或从失败验证尝试中学习的持久机制，这限制了它们在复杂、多文件修复任务上的有效性。我们提出了MemRepair，一种增强记忆的代理框架，将漏洞修复视为一个迭代、经验驱动的过程。MemRepair结合了三个互补的记忆层，即History-Fix、Security-Pattern和Refinement-Trajectory记忆，并通过动态反馈驱动的细化循环。这种设计使代理能够检索仓库特定的修复惯例，应用可重用的安全防御，并利用先前的“失败到成功”轨迹来根据运行时证据修正语义无效的补丁。我们评估了MemRepair在三个具有代表性的仓库级别漏洞修复基准上的表现：SEC-Bench、PatchEval（Python、Go、JavaScript）以及Multi-SWE-bench的C++子集。MemRepair在三个基准上分别实现了58.0%、58.2%和30.58%的修复率，优于强大的通用代理如OpenHands和SWE-agent，以及专用的AVR工具InfCode-C++，同时保持竞争性的修复成本。这些结果表明，持久的、分层的修复记忆可以显著提高跨多种语言和仓库设置的代理漏洞修复的可靠性。

英文摘要

Modern software ecosystems face a rapidly growing number of disclosed vulnerabilities, increasing the need for automated repair techniques that can operate reliably at repository scale. Although Large Language Model (LLM)-based agents have recently shown promise for automated vulnerability repair (AVR), most existing systems still treat repair as a single generation step over the currently visible code context. As a result, they lack a persistent mechanism for reusing prior fixes or learning from failed validation attempts, which limits their effectiveness on complex, multi-file repair tasks. We present MemRepair, a memory-augmented agentic framework that formulates vulnerability repair as an iterative, experience-driven process. MemRepair combines three complementary memory layers, i.e., History-Fix, Security-Pattern, and Refinement-Trajectory memories, with a dynamic feedback-driven refinement loop. This design allows the agent to retrieve repository-specific repair conventions, apply reusable security defenses, and exploit prior "failure-to-success" trajectories to revise semantically invalid patches based on runtime evidence. We evaluate MemRepair on three representative repository-level vulnerability repair benchmarks: SEC-Bench, PatchEval (Python, Go, JavaScript), and the C++ subset of Multi-SWE-bench. MemRepair achieves state-of-the-art resolution rates of 58.0%, 58.2%, and 30.58%, respectively, outperforming strong general-purpose agents such as OpenHands and SWE-agent, as well as the specialized AVR tool InfCode-C++, while maintaining competitive repair cost. These results show that persistent, hierarchical repair memory can substantially improve the reliability of agentic vulnerability repair across diverse languages and repository settings.

URL PDF HTML ☆

赞 0 踩 0

2605.17442 2026-05-19 cs.CL cs.AI cs.IR 版本更新

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数：低资源多语言NLP中的数据集可见性不对称

Zhiyin Tan, Changxu Duan

发表机构 * L3S Research Center, Leibniz University Hannover（莱布尼茨汉诺威大学L3S研究中心）； Technische Universität Darmstadt（达姆施塔特技术大学）

AI总结本研究探讨了多语言NLP中数据集可见性不对称问题，通过结合目录基准和文献证据，提出了资源密度指数（RDI）来衡量语言的数据集可见性，揭示了大量语言在目录记录中数据贫乏但文献中存在明显数据集活动的现象。

Comments Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)

详情

DOI: 10.63317/3bep4yiomtp2

AI中文摘要

令牌经济学中的计算挑战：连接经济理论与AI系统设计

Ou Wu, Yingjun Deng

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences（中国科学院大学杭州高等研究院）； Hefei Institutes of Physical Science, Chinese Academy of Sciences（中国科学院合肥物理研究所）

AI总结本文探讨了在大规模语言模型系统中，将令牌作为经济原语时所面临的计算挑战，提出了计算令牌经济学的概念和令牌经济学三元论，旨在建立连接令牌经济学与AI系统设计的研究议程。

Comments 43 pages

详情

AI中文摘要

令牌经济学已逐渐成为理解大型语言模型系统中资源分配、价值创造和定价的一个有用的视角。尽管近期的研究越来越多地将令牌视为经济原语，但高水平的经济理论与现代AI基础设施的计算现实之间仍存在显著的差距。本文识别并分析了在实时推理系统中实施令牌经济原则时出现的关键计算挑战。我们主张计算可行性不仅仅是令牌经济学的一个维度，而是其支配约束：这些挑战是由精细估值、低延迟执行和在不确定性下的分配最优性之间根本矛盾驱动的。为了结构化这个问题空间，我们引入了计算令牌经济学的概念，并提出了令牌经济学三元论——一个条件无免费午餐原则，捕捉了粒度、实时性能和最优性之间的固有权衡。我们进一步将主要技术挑战分为三个领域：实时价值会计、受限资源分配和经济感知的系统架构。与其提供完整的解决方案，本文旨在定义连接令牌经济学与AI系统设计的研究议程，突出计算经济学、机器学习系统和AI基础设施交汇处的开放问题。

英文摘要

Token economics has emerged as a useful lens for understanding resource allocation, value creation, and pricing in large language model systems. While recent work has increasingly treated tokens as economic primitives, there remains a substantial gap between high-level economic theory and the computational realities of modern AI infrastructure. This paper identifies and analyzes the key computational challenges that arise when token-economic principles are implemented in real-time inference systems. We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. To structure this problem space, we introduce the notion of \textbf{Computational Token Economics} and propose the \textbf{Token Economics Trilemma} -- a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality. We further categorize the main technical challenges into three areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture. Rather than presenting a complete solution, this paper aims to define a research agenda for bridging token economics and AI system design, highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure.

URL PDF HTML ☆

赞 0 踩 0

2605.17393 2026-05-19 cs.AI cs.LG cs.MA 版本更新

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

异质信息瓶颈协调图用于多智能体强化学习

Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)（澳大利亚人工智能研究所）

AI总结本文提出异质信息瓶颈协调图（HIBCG），通过理论指导机制解决多智能体强化学习中协调图的边存在性和信息传递容量分配问题，通过信息瓶颈方法构建组对齐的块对角先验，实现边存在性和信息容量的理论验证。

详情

AI中文摘要

协调图是合作多智能体强化学习（MARL）中的核心抽象，然而现有的稀疏图学习者缺乏理论基础的机制来决定哪些边应存在以及每条边应携带多少信息。当前方法依赖于启发式标准，无法保证学习到的拓扑结构的正式保证，并且没有系统的方法来分配不同的通信容量以处理结构不同的智能体关系。为了解决这个问题，我们提出了异质信息瓶颈协调图（HIBCG），它学习了一个组感知的稀疏图，在其中边的存在性和信息容量都得到了理论支持。通过图信息瓶颈（GIB）作为底层工具，HIBCG首先构建了一个组对齐的块对角先验，提供了一个闭式标准用于边保留——确定哪些边应该存在以及每个组块的密度——然后在所得到的拓扑上控制每个智能体的特征带宽，压缩信息以保留仅与任务相关的内容。我们证明了组对齐的先验严格收紧拓扑学习的变分界，目标分解为每个组块，实现了微分边控制，且容量分配遵循水填充原则。

英文摘要

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

URL PDF HTML ☆

赞 0 踩 0

2605.17382 2026-05-19 cs.AI cs.CL cs.GR 版本更新

UAM：VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

发表机构 * Tsinghua University（清华大学）

AI总结本文提出UAM模型，通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题，展示了通过架构分离而非冻结权重或辅助数据可实现语义保留，并在多种任务中取得高成功率。

详情

AI中文摘要

视觉-语言-动作（VLA）模型通常通过在动作数据上微调预训练的视觉-语言模型（VLM）来构建。然而，我们证明这种标准方法系统性地削弱了VLM的多模态能力，这种副作用我们称之为‘具身税’。但VL A是否必须遗忘？受生物视觉双流组织的启发，我们将这种退化归因于结构性瓶颈：当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征，而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点，我们提出了统一动作模型（UAM），添加了一个平行的背侧专家，作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担，我们从预训练的生成模型中初始化它，并用中层推理目标进行训练，该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA：无需参数冻结、无需梯度停止、无需辅助VL共训练，UAM保留了超过95%的底层VLM的多模态能力，同时在多种任务中取得了最高平均成功率，包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明，VL A中的语义保留可以从架构分离本身产生，而非通过冻结权重或辅助数据重放，并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

URL PDF HTML ☆

赞 0 踩 0

2605.15586 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

拥抱偏置转移矩阵以实现多类互补标签学习

Tan-Ha Mai, Chao-Kai Chiang, Han-Hwa Shih, Gang Niu, Masashi Sugiyama, Hsuan-Tien Lin

发表机构 * National Taiwan University（国立台湾大学）； The University of Tokyo（东京大学）； RIKEN Center for Advanced Intelligence Project（日本理化学研究院先进智能项目中心）

AI总结本文提出了一种新的框架BICL，通过设计偏置的标签生成过程来克服传统互补标签学习在多类设置中的限制，从而在CIFAR-100和TinyImageNet-200上实现了传统方法的七倍以上准确率提升。

Comments 33 pages, 16 figures, 18 tables

详情

AI中文摘要

互补标签学习（CLL）是一种弱监督范式，其中实例被标记为不属于其类别的标签。尽管已有十年的研究，CLL方法主要在10类分类任务中具有竞争力，而扩展到大规模标签空间仍然是一个持久的瓶颈。这种限制源于传统方法对均匀标签生成的假设，这在多类设置中严重稀释了学习信号。在本文中，我们证明通过故意设计偏置（非均匀）的生成过程，将互补标签限制在类别的子集，可以克服这一长期存在的障碍。这一发现促使我们提出Bias-Induced Constrained Labeling（BICL），一个涵盖数据收集到训练的原理性框架，利用这种偏置。BICL在CIFAR-100和TinyImageNet-200上实现了有效学习，比传统方法的准确率提高了超过七倍。我们的发现为在现实应用中使CLL适用于多类问题开辟了新的道路。

英文摘要

Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

URL PDF HTML ☆

赞 0 踩 0

2605.15553 2026-05-19 cs.NI cs.AI cs.ET 版本更新

Operator-Controlled 6G: From Connectivity Infrastructure to Guaranteed Digital Services

运营商主导的6G：从连接基础设施到保证的数字服务

David Soldani

发表机构 * Rakuten Mobile Inc.（乐天移动公司）

AI总结本文提出了一种运营商主导的6G框架，通过重新排序运营商优先级，将控制、客户、业务、运营和技术作为优先级，定义了所有权分类和商业模型，以实现可执行的服务级别目标，并通过Rakuten Mobile的实证证据证明了其可行性。

Comments 81 pages, 18 figures, 66 references

详情

AI中文摘要

第六代移动网络（6G）正接近结构性转折点。五代由供应商主导的架构使运营商在无法拥有、修改平台和审计AI层的情况下采购和操作网络。本文主张6G必须逆转这一趋势，重新排列运营商优先级：控制优先，客户优先，业务优先，运营优先，技术最后。技术应服务于运营商控制、客户成果、可 monetizable 的保证和软件驱动的运营，而不是决定它们。本文的两个贡献是将这一论点具体化。6G控制紧凑型定义了一个三层所有权分类——拥有、联邦和消费——根据战略价值分配架构主权。保证经济定义了一个五级、结果定价的商业模型，将运营商控制转化为可执行的服务级别目标。该框架基于Rakuten Mobile的实证证据，这是世界上第一个全国规模、完全云原生、完全开放无线接入网络（Open RAN）的部署，于2025财年实现了全年EBITDA盈利。它与ITU-R IMT-2030框架、3GPP 6G使用案例和服务要求、NGMN建议、ETSI标准、O-RAN联盟和AI-RAN联盟规范、IOWN全球论坛可持续性指标、Linux基金会倡议以及领先行业和学术项目保持一致。一个涵盖2025-2027、2027-2029和2029-2032及以后的三阶段路线图，以及七个针对特定利益相关者的行动呼吁，将架构转化为行业承诺。核心主张是Rakuten Mobile的部署证明了运营商主导的6G的可行性。2026-2028期间的决策将决定6G将成为保证数字服务的平台还是另一个依赖供应商的基础设施周期。

英文摘要

Sixth-generation mobile networks (6G) are approaching a structural inflection point. Five generations of vendor-led architectures have left operators procuring and operating networks they do not own, on platforms they cannot modify, with AI layers they cannot audit. This paper argues that 6G must reverse this trajectory by reordering operator priorities: Control First, Customer First, Business First, Operations First, and Technology Last. Technology should serve operator control, customer outcomes, monetizable guarantees, and software-driven operations, not dictate them.Two contributions operationalize this thesis. The 6G Control Compact defines a three-layer ownership taxonomy--own, federate, and consume--that allocates architectural sovereignty according to strategic value. The Guarantee Economy defines a five-tier, outcome-priced commercial model that converts operator control into enforceable service-level objectives. The framework is grounded in operational evidence from Rakuten Mobile, the world's first national-scale, fully cloud-native, fully Open RAN deployment, which reached full-year EBITDA profitability in FY2025. It is aligned with the ITU-R IMT-2030 framework, 3GPP 6G use cases and service requirements, NGMN recommendations, ETSI standards, O-RAN Alliance and AI-RAN Alliance specifications, IOWN Global Forum sustainability metrics, Linux Foundation initiatives, and leading industry and academic programs. A three-phase roadmap covering 2025-2027, 2027-2029, and 2029-2032 and beyond, together with seven stakeholder-specific calls to action, translates the architecture into industry commitments. The central claim is that Rakuten Mobile's deployment demonstrates the feasibility of operator-controlled 6G. Decisions made during 2026-2028 will determine whether 6G becomes a platform for guaranteed digital services or another vendor-dependent infrastructure cycle.

URL PDF HTML ☆

赞 0 踩 0

2605.15377 2026-05-19 cs.AI 版本更新

ClawForge: 为命令行代理生成可执行的交互式基准测试

Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao

发表机构 * University of North Carolina at Chapel Hill（北卡罗来纳大学切里波因特分校）； Stanford University（斯坦福大学）； University of California, Berkeley（加州大学伯克利分校）； University of Southern California（南加州大学）； University of California, Santa Cruz（加州大学圣克鲁兹分校）

AI总结 ClawForge通过生成可执行的交互式基准测试，解决了可扩展性与真实工作流评估之间的矛盾，通过系统测试代理在存在状态冲突时的处理能力。

详情

AI中文摘要

交互式代理基准测试面临可扩展性构建与真实工作流评估之间的张力。手工编写的任务扩展和修改成本高，而静态提示评估忽略了只有在代理在持久状态上操作时才会出现的失败。现有的交互式基准测试已显著提升了代理评估，但大多数初始化任务从干净的状态开始，没有系统测试代理如何处理已存在的部分、过时或冲突的物品。我们提出了ClawForge，一个基于生成器的可执行命令行工作流基准测试框架，在状态冲突下。该框架将场景模板、扎根槽位、初始化状态、参考轨迹和验证器编译成可重复的任务规范，并通过归一化的终端状态和可观测的副作用逐步评估代理，而不是精确轨迹匹配。我们实例化该框架为ClawForge-Bench（17个场景，6个能力类别）。在七个前沿模型上的结果表明，最佳模型仅达到45.3%的严格准确率，错误状态替换在所有模型中低于17%，最宽的模型分离（17%到90%）由代理在行动前是否检查现有状态决定。部分信用和步骤效率分析进一步揭示了许多失败是近似关闭而非早期崩溃，且在状态冲突下模型表现出不同的失败风格。

英文摘要

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

URL PDF HTML ☆

赞 0 踩 0

2605.14038 2026-05-19 cs.AI 版本更新

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

模型适应性工具必要性揭示了大语言模型工具使用中的知行差距

Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei, Soheil Feizi

发表机构 * University of Maryland, College Park（马里兰大学College Park分校）

AI总结本文研究了大语言模型在使用外部工具时的必要性问题，提出了一种基于模型自身性能的适应性工具必要性定义，并通过四个模型在算术和事实性问答数据集上的比较，发现工具必要性与实际调用行为之间存在显著的不匹配，揭示了LLM工具使用中的知行差距。

详情

AI中文摘要

大型语言模型（LLMs）越来越多地作为自主代理，必须决定何时直接回答问题，何时调用外部工具。先前研究大多将工具必要性视为模型无关的属性，由人类或LLM判断者标注，主要涵盖答案明显的情况（例如获取天气与改写文本）。然而，现实中的工具必要性更为复杂，因为不同模型的能力边界存在分歧：一个强模型可以单独解决的问题，可能仍需要工具帮助弱模型。在本文中，我们引入了基于每个模型实证性能的模型适应性工具必要性定义。随后，我们比较了四个模型在算术和事实性问答数据集上的必要性与观察到的工具调用行为，发现存在26.5-54.0%和30.8-41.8%的显著不匹配。为了诊断失败，我们将工具使用分解为两个阶段：内部认知阶段，反映模型是否认为需要工具；执行阶段，决定模型是否实际做出调用动作。通过探测LLM隐藏状态，我们发现这两种信号往往可以线性解码，但它们的探测方向在晚期层、最后token的范围内几乎正交。通过追踪样本在两个阶段过程中的轨迹，我们进一步发现，大多数不匹配集中在认知到行动的转换过程中，而非认知本身。这些结果揭示了LLM工具使用中的知行差距：提高工具使用可靠性不仅需要更好的识别何时需要工具，还需要更好的将这种识别转化为行动。

英文摘要

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

URL PDF HTML ☆

赞 0 踩 0

2605.13415 2026-05-19 cs.CL cs.AI cs.LG 版本更新

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

KIT-TIP-NLP 在 MultiPride 上的持续学习：多语言基础模型

Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen

发表机构 * Text Information Processing Lab, Kitami Institute of Technology, Kitami, Hokkaido 090-0015, Japan（函授信息处理实验室，Kitami理工学院，日本北海道Kitami，090-0015）

AI总结本文提出了一种多阶段框架，用于检测社交媒体中多语言的重新使用侮辱性语言。该框架解决了跨英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战，通过数据驱动的模型选择、语义保留的增强、归纳迁移学习和领域特定知识注入等方法，提高了多语言情感表达的识别能力。

Comments Final Workshop of the 9th evaluation campaign EVALITA 2026

详情

AI中文摘要

本文提出了一种多阶段框架，用于检测多语言社交媒体中重新使用的侮辱性语言。该框架解决了在英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战。该框架处理了三个交织的方法学挑战：数据稀缺、类别不平衡和跨语言的情感表达差异。该框架整合了通过交叉验证的数据驱动模型选择、通过回译的语义保留增强、具有动态周期级欠采样的归纳迁移学习，以及通过掩码语言模型注入的领域特定知识。系统评估了八个多语言嵌入模型，XLM-RoBERTa被选为基础模型，基于宏平均F1分数。通过GPT-4o-mini回译进行的数据增强有效将训练语料库增加了三倍，同时保留了语义内容和类别分布比例。该框架生成了四个最终运行用于评估，其中RUN 1是带有增强和欠采样的归纳迁移学习，RUN 2是带有掩码语言模型预训练，RUN 3和RUN 4是通过语言特定决策阈值优化的先前预测。语言特定的阈值优化表明，最优决策边界在不同语言中存在显著差异。这反映了模型置信度分数的分布差异和重新使用语言使用的语言差异。基于阈值的优化在不需模型重新训练的情况下，带来了2-5%的绝对F1提升。该方法完全可复现，所有代码和实验设置可在https://github.com/rbg-research/MultiPRIDE-Evalita-2026上找到。

英文摘要

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

URL PDF HTML ☆

赞 0 踩 0

2605.11461 2026-05-19 cs.AI cs.LG 版本更新

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破赢家通吃：合作策略优化提升大语言模型的多样化推理

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

发表机构 * ISEE Lab, Sun Yat-sen University（中山大学ISEE实验室）

AI总结本文提出Group Cooperative Policy Optimization (GCPO)方法，通过改变训练范式从 rollout 竞争转向团队合作，提升大语言模型在推理任务中的准确性和解题多样性。

详情

AI中文摘要

基于验证器的强化学习（RLVR）已成为提升大语言模型（LLM）推理能力的核心范式，然而流行的基于群体的优化算法如GRPO常常面临探索崩溃问题，即模型过早收敛于一组高分模式，缺乏探索新解的能力。最近的研究尝试通过添加熵正则化或多样性奖励来缓解这一问题，但这些方法并未改变赢家通吃的本质，即rollouts仍为个体优势竞争而非合作最大化全局多样性。在本文中，我们提出Group Cooperative Policy Optimization（GCPO），将训练范式从rollout竞争转向团队合作。具体而言，GCPO将独立rollout评分替换为团队层面的信用分配：rollout被奖励其对团队有效解覆盖的贡献，而非其个体准确性。该覆盖被描述为奖励加权语义嵌入上的确定体体积，其中只有正确且非冗余的rollout才对这一体积做出贡献。在优势估计过程中，GCPO将集体团队奖励重新分配给每个单个rollout，根据其对团队的平均边际贡献。这种合作训练范式将优化方向导向非冗余的正确推理路径。在多个推理基准测试中，GCPO在现有方法的基础上显著提高了推理准确性和解题多样性。代码将在https://github.com/bradybuddiemarch/gcpo上发布。

英文摘要

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.

URL PDF HTML ☆

赞 0 踩 0

2605.11223 2026-05-19 cs.AI 版本更新

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

视觉-语言模型在点击式谜题游戏中是否展现出人类般的逻辑问题解决能力？

Maximilian Triebel, Marco Menner, Dominik Helfenstein

发表机构 * Institute of Artificial Intelligence, University of Stuttgart, Stuttgart, Germany（斯图加特大学人工智能研究所）

AI总结本文提出VLATIM基准测试，用于评估在经典物理谜题游戏The Incredible Machine 2中人类般的逻辑问题解决能力，发现尽管大模型在规划方面表现优异，但精确的视觉定位仍存在问题，尚未达到人类水平。

详情

AI中文摘要

视觉-语言（-动作）模型（VLMs）越来越多地应用于交互环境，但现有基准测试往往忽视了点击式谜题游戏中所需的复杂物理推理。本文介绍了Vision-Language Against The Incredible Machine（VLATIM），一个用于评估在经典物理谜题游戏The Incredible Machine 2（TIM）中人类般的逻辑问题解决能力的基准测试。与现有基准测试不同，VLATIM专门针对高水平逻辑推理与需要精确鼠标交互的连续动作空间之间的关键差距。该基准测试分为五个逐步部分，评估的能力从基本的视觉定位和领域理解到多步骤操作和完整谜题解决。我们的结果揭示了推理与执行之间的显著差距。尽管大 proprietary 模型在规划能力方面表现优异，但它们在精确的视觉定位上存在困难。因此，它们尚未展现出人类般的解决问题能力。

英文摘要

Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large proprietary models demonstrate superior planning abilities, they struggle with precise visual grounding. Consequently, they do not yet show human-like problem-solving capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.10871 2026-05-19 physics.med-ph cs.AI cs.LG 版本更新

Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

吸引子-血管耦合理论：为基于智能手机光电容积图的AAMI标准无创血压估计提供形式基础和实证验证

Timothy Oladunni, Farouk Ganiyu Adewumi

发表机构 * Department of Computer Science, Morgan State University（莫根州立大学计算机科学系）

AI总结本文提出了一种数学框架，证明心脏吸引子几何编码了足够的血压信息，用于AAMI标准估计，并通过校准的无创血压模型验证了该理论，利用光电容积图（PPG）进行血压估计。

详情

AI中文摘要

本文提出吸引子-血管耦合理论（AVCT），一种数学框架，证明心脏吸引子几何编码了足够的血压（BP）信息，足以用于AAMI标准估计，并通过使用光电容积图（PPG）的校准无创血压模型验证了该理论。AVCT基于心脏稳定性理论，并通过Takens延迟嵌入和吸引子形态提取进行操作化。两个定理、一个命题和一个推论正式证明了PPG吸引子特征用于血压估计的使用，并预测了特征重要性层次。一个使用脉搏传导时间（PTT）和心脏稳定性指数（CSI）吸引子特征训练的LightGBM模型在严格留一受试者出交叉验证（LOSO-CV）上进行了评估，评估了来自BIDMC ICU（n=9）和VitalDB手术数据（n=37）的46名受试者，共29,684个窗口。该模型实现了收缩压（SBP）的平均绝对误差（MAE）为2.05 mmHg，舒张压（DBP）的MAE为1.67 mmHg，相关系数r=0.990和r=0.991，满足AAMI/IEEE SP10要求的MAE低于5 mmHg。每个受试者的中位数MAE为1.87/1.54 mmHg，70%/76%的受试者个体满足AAMI标准。使用九个智能手机吸引子特征的PPG-only消融与ECG+PPG模型的误差在0.05 mmHg以内，证明了仅使用智能手机摄像头即可实现临床级血压跟踪，超过了以往使用更少传感器的LOSO-CV结果。所有四个AVCT预测都得到了定量确认，从未校准到校准估计的误差减少了91.5%（epsilon_cal=0.915）。与后验可解释AI方法不同，AVCT预测的特征满足可解释AI可信度（EAT）框架的建筑忠实性标准，并将血压估计扎根于非线性动力学系统理论。

英文摘要

This work proposes Attractor-Vascular Coupling Theory (AVCT), a mathematical framework showing that cardiac attractor geometry encodes blood pressure (BP) information sufficient for AAMI-standard estimation, and validates the theory through a calibrated cuffless BP model using photoplethysmography (PPG). AVCT is grounded in Cardiac Stability Theory and operationalized using Takens delay embedding and attractor morphology extraction. Two theorems, one proposition, and one corollary formally justify the use of PPG attractor features for BP estimation and predict the feature-importance hierarchy. A LightGBM model trained on pulse transit time (PTT) and Cardiac Stability Index (CSI) attractor features under single-point calibration was evaluated using strict leave-one-subject-out cross-validation (LOSO-CV) on 46 subjects from BIDMC ICU (n = 9) and VitalDB surgical data (n = 37), comprising 29,684 windows. The model achieved systolic BP (SBP) mean absolute error (MAE) of 2.05 mmHg and diastolic BP (DBP) MAE of 1.67 mmHg, with correlations r = 0.990 and r = 0.991, satisfying the AAMI/IEEE SP10 requirement of MAE below 5 mmHg. Median per-subject MAE was 1.87/1.54 mmHg, and 70%/76% of subjects individually satisfied AAMI criteria. A PPG-only ablation using nine smartphone attractor features matched the ECG+PPG model within 0.05 mmHg, demonstrating that clinical-grade BP tracking is achievable using only a smartphone camera while surpassing prior generalized LOSO-CV results using fewer sensors. All four AVCT predictions were quantitatively confirmed, with 91.5% error reduction from uncalibrated to calibrated estimation (epsilon_cal = 0.915). Unlike post-hoc explainable AI methods, AVCT predicts features satisfying the architectural faithfulness criterion of the Explainable-AI Trustworthiness (EAT) framework and grounding BP estimation in nonlinear dynamical systems theory.

URL PDF HTML ☆

赞 0 踩 0

2605.10236 2026-05-19 cs.LG cs.AI 版本更新

When Does Non-Uniform Replay Matter in Reinforcement Learning?

在强化学习中非均匀回放何时起作用？

Michal Korniak, Mikołaj Czarnecki, Yarden As, Piotr Miłoś, Pieter Abbeel, Michal Nauman

发表机构 * ETH Zurich（苏黎世联邦理工学院）； University of Warsaw（华沙大学）； UC Berkeley（伯克利加州大学）； Amazon FAR（亚马逊FAR）

AI总结本文研究了非均匀回放在强化学习中的有效性，发现回放体积、预期近期性和回放分布熵是决定因素，并提出了一种简单有效的截断几何回放策略以提高样本效率。

详情

AI中文摘要

现代非策略强化学习算法通常依赖于简单的均匀回放采样，但非均匀回放何时以及为何优于这一强基线仍不清楚。在多样化的强化学习设置中，我们证明非均匀回放的有效性由三个因素决定：回放体积、每环境步骤回放的转换数量；预期近期性，即所采样转换的近期程度；以及回放采样分布的熵。我们的主要贡献是明确非均匀回放何时有益，并为现代非策略强化学习中的回放设计提供实用指导。我们发现，当回放体积较低时，非均匀回放最有益，且即使在预期近期性相当时，高熵采样也很重要。受这些发现的启发，我们采用了一种简单的截断几何回放策略，该策略倾向于近期经验，同时保持高熵并带来可忽略的计算开销。在大规模并行模拟、单任务和多任务设置中，包括在五个强化学习基准套件上评估的三种现代算法，这种回放采样策略在低体积情况下提高了样本效率，而在高回放体积时仍具有竞争力。

英文摘要

Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.

URL PDF HTML ☆

赞 0 踩 0

2605.10185 2026-05-19 cs.CV cs.AI 版本更新

DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors

DynGhost: 用于量子探测器动态鬼成像的时序建模Transformer

Vittorio Palladino, Ahmet Enis Cetin

发表机构 * Politecnico di Milano（米兰理工学院）； University of Illinois at Chicago（伊利诺伊大学香槟分校）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出DynGhost，一种基于Transformer的动态鬼成像方法，通过交替的空间和时间注意力模块解决传统方法在动态场景和低光条件下的局限性，利用量子感知训练框架提升真实硬件下的性能。

Comments 6 pages, 8 figures

详情

AI中文摘要

鬼成像通过将结构化照明图案与标量强度测量相关联，从单像素桶探测器重建空间信息。尽管深度学习方法在静态场景中取得了显著成果，但存在两个关键局限：现有架构未能利用帧间的时间相干性，导致动态鬼成像问题未得到解决，且假设加性高斯噪声模型，而实际单光子硬件遵循泊松统计。我们提出了DynGhost（动态鬼成像Transformer），通过交替的空间和时间注意力块解决这两个限制。基于物理准确的探测器模拟（SNSPDs、SPADs、SiPMs）和Anscombe方差稳定化归一化，我们的量子感知训练框架解决了导致经典模型在真实硬件约束下失效的分布偏移。在多个基准测试中，DynGhost在动态和光子匮乏设置中优于传统重建方法和现有深度学习架构。

英文摘要

Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.

URL PDF HTML ☆

赞 0 踩 0

2605.10059 2026-05-19 cs.AI 版本更新

Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust

LLM代理市场中的战略利用：电子商务信任的模拟框架

Shijun Lei, Quang Nguyen, Swapneel S Mehta, Zeping Li, Huichuan Fu, Xiaolong Zheng, Siki Chen, Yunji Liang, Philip Torr, Zhenfei Yin

发表机构 * Northwestern Polytechnical University（西北工业大学）； Boston University（波士顿大学）； Fudan University（复旦大学）； Wuhan University（武汉大学）； Chinese Academy of Sciences（中国科学院）； University of Oxford（牛津大学）

AI总结本文提出TruthMarketTwin模拟框架，用于研究LLM代理在电子商务市场中的行为，发现LLM代理在传统市场中会利用声誉治理的弱点，而强制执行可减少欺骗并重塑战略推理。

详情

AI中文摘要

基于代理的建模（ABM）长期以来被用于经济学中研究人类行为，而大型语言模型（LLM）代理现在使新的社会和经济模拟成为可能。尽管先前工作发现了LLM代理在金融交易和拍卖市场中的战略性欺骗，但电子商务仍鲜有研究，尽管其有独特的信息不对称：卖家私下观察产品质量，而买家依赖广告声明和声誉信号。我们引入TruthMarketTwin，一种用于研究LLM代理在电子商务市场中行为的受控模拟框架。该框架是首个模拟不对称信息共享下双边贸易的模型之一，其中代理做出战略性列表、购买、评分和救济相关决策以优化卖家利润和买家效用。我们发现，释放到传统市场中的LLM代理会自主利用基于声誉的治理弱点，而强制执行可减少欺骗并重塑战略推理。我们的结果将LLM代理模拟定位为研究由机构治理的自主市场工具。

英文摘要

Agent-based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e-commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM-agent behavior in e-commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse-related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation-based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM-agent simulation as a tool for studying institution-governed autonomous markets.

URL PDF HTML ☆

赞 0 踩 0

2605.09040 2026-05-19 cs.AI cs.IR cs.LG 版本更新

UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

UxSID：面向超长序列的语义感知用户兴趣建模

Hongwei Zhang, Qiqiang Zhong, Jiangxia Cao, Yiyang Lv, Huanjie Wang, Liwei Guan, Jing Yao, Yiyu Wang, Junfeng Shu, Zhaojie Liu, Han Li

发表机构 * Kuaishou Technology（快手科技）

AI总结本文提出UxSID框架，通过语义组共享兴趣记忆和双层注意力策略，实现高效且语义感知的超长用户序列建模，取得最佳性能并提升广告收益。

Comments Work in progress

2605.08738 2026-05-19 cs.LG cs.AI cs.CL 版本更新

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

SlimQwen: 探索在大规模MoE模型预训练中的剪枝与知识蒸馏

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

发表机构 * Qwen Team, Alibaba Inc.（通义实验室，阿里公司）； MBZUAI ； KAUST（卡士大学）

AI总结本文研究了在大规模预训练中如何应用剪枝和知识蒸馏技术，探讨了剪枝在初始化方面的优势、专家压缩对最终模型的影响以及训练策略的有效性，最终将Qwen3-Next-80A3B压缩到23A2B模型并保持竞争力。

详情

AI中文摘要

结构化剪枝和知识蒸馏（KD）是压缩大型语言模型的典型技术，但其在预训练规模下的应用仍不清楚，尤其是针对最近的混合专家（MoE）模型。本文系统研究了大规模预训练中的MoE压缩，重点探讨三个关键问题：剪枝是否比从头训练提供更好的初始化；专家压缩选择如何影响继续训练后的最终模型；以及哪种训练策略最有效。我们得出以下发现：首先，在深度、宽度和专家压缩方面，对预训练MoE进行剪枝在相同训练预算下优于从头训练。其次，不同的单次专家压缩方法在大规模持续预训练后收敛到相似的最终性能。受此启发，我们引入了一种简单的部分保留专家合并策略，该策略在大多数基准上提升了下游性能。第三，结合KD与语言建模损失在知识密集型任务上优于仅使用KD。我们进一步提出了多令牌预测（MTP）蒸馏，其效果一致。最后，鉴于相同的训练令牌，渐进式剪枝计划优于单次压缩，表明渐进的架构过渡导致更好的优化轨迹。综合来看，我们将Qwen3-Next-80A3B压缩到23A2B模型，保持了竞争力。这些结果为大规模高效MoE压缩提供了实用指导。

英文摘要

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.08163 2026-05-19 cs.CV cs.AI cs.CL 版本更新

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

MULTITEXTEDIT：跨语言文本-图像编辑中退化程度的基准测试

Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

发表机构 * Harbin Institute of Technology（哈尔滨理工大学）

AI总结本文提出MULTITEXTEDIT基准测试，通过12种语言、5种视觉领域和7种编辑操作的3600个实例，评估跨语言文本-图像编辑中退化问题，引入语言保真度指标并发现模型在文本准确性和脚本保真度上的显著退化。

Comments 11 pages, 5 figures

详情

AI中文摘要

文本-图像编辑已成为视觉内容创作的关键能力，但现有基准测试大多以英语为中心且常将视觉合理性与语义正确性混为一谈。我们引入MULTITEXTEDIT，一个包含3,600个实例的受控基准测试，涵盖12种语言类型、5种视觉领域和7种编辑操作。每个实例的语言变体共享相同的视觉基础，并配有人工编辑的参考文本和区域掩码，从而隔离语言变量以进行跨语言比较。为捕捉粗粒度文本匹配度指标所遗漏的脚本级错误，如缺失变音符号、RTL顺序颠倒和混合脚本渲染，我们引入了一个由两阶段LVM协议评分的语言保真度（LSF）度量，其与母语者标注员的二次加权κ值达到0.76。评估12个开源和专有系统时，发现所有模型在跨语言退化方面表现显著，最大退化出现在希伯来语和阿拉伯语上，最小退化出现在荷兰语和西班牙语上，且集中在文本准确性和脚本保真度而非粗粒度结构维度上。我们还发现普遍存在的语义和像素不匹配，其中输出保持全局布局和背景保真度，但扭曲了脚本特定的形态。

英文摘要

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

URL PDF HTML ☆

赞 0 踩 0

2605.07544 2026-05-19 cs.AI 版本更新

From Pixels to Prompts: Vision-Language Models

从像素到提示：视觉-语言模型

Khang Hoang Nhat Vo

发表机构 * MBZUAI

AI总结本文探讨了视觉-语言模型的发展历程，旨在提供清晰的认知框架，帮助读者理解该领域的核心概念和应用，而非罗列所有数据集和模型变体。

详情

AI中文摘要

当您阅读一篇关于新型视觉-语言模型的论文时，可能会忘记这个想法在不久以前听起来多么奇怪。教机器看见已经很困难，教它们阅读和生成语言也已很困难。让它们同时做到这些，并随后进行推理、回答问题、遵循指令，甚至有时令人惊讶，仍带着科幻的余韵，尽管它已成为日常。这本书源于一种简单的感觉：太容易迷失方向了。该领域发展迅速，新模型名称不断出现，‘我知道 buzzwords’与‘我真的理解其工作原理’之间的差距可能让人感到不适。我曾多次感受到这种差距。如果您持有这本书，您可能也有太大的感受。我的目标不是提供一个详尽的数据集、基准和新模型变体的清单。相反，我希望提供更谦逊但或许更持久的东西：一个清晰的视觉-语言模型认知图谱。足够的结构，使您在阅读新论文时充满信心；足够的直觉，使您能够设计自己的系统而不觉得像在盲目地组装乐高积木。

英文摘要

When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: it is too easy to get lost. The field moves quickly, new model names appear constantly, and the gap between "I know the buzzwords" and "I actually understand how this works" can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.

URL PDF HTML ☆

赞 0 踩 0

2605.07111 2026-05-19 cs.CL cs.AI 版本更新

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

超越LoRA与全微调：基于梯度的优化器路由用于大语言模型适应

Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Tsinghua University（清华大学）； Infinigence AI

AI总结本文提出了一种混合LoRA和全微调（MoLF）框架，通过在优化器层面动态路由更新，实现两种训练模式之间的连续导航，从而提升大语言模型的适应性能。

详情

AI中文摘要

通过迭代奖励引导的后训练改进表格语言模型

Yunbo Long, Tejumade Afonja, Guangya Hao, Alexandra Brintrup, Mario Fritz

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）； CISPA Helmholtz Center for Information Security, Saarbrücken, Germany（德国萨尔布吕肯信息安全中心）； The Alan Turing Institute, London（伦敦阿兰·图灵研究所）

AI总结本文研究了通过生成-评分-对齐协议进行迭代奖励引导的后训练，提出了一种基于组相对对齐的方法TabGRAA，通过比较高分和低分生成组的组平均策略/参考对数比来改进表格语言模型，在五个混合类型基准上优于额外监督微调，并在保真度和下游效用之间实现了最佳平均权衡，同时保持经验隐私诊断接近监督基线。

详情

AI中文摘要

表格语言模型可以通过将行建模为令牌序列来生成合成表格，但通常通过监督微调一次后就作为静态生成器使用。这限制了下一步令牌似然不能直接优化用于评估合成数据的分布、效用和不可区分性属性。我们通过生成-评分-对齐协议研究了表格语言模型的迭代奖励引导后训练，其中生成器采样合成行，任务特定的奖励对其进行排序，模型则相对于固定监督参考进行更新。在该协议中，我们提出了TabGRAA（表格组相对优势对齐），通过组平均的策略/参考对数比比较高分和低分生成组，而非一对一偏好对。在五个混合类型基准上，TabGRAA在GReaT基座上优于额外监督微调，并在保真度和下游效用之间实现了最强的平均权衡，同时保持经验隐私诊断接近监督基线。消融研究显示，收益依赖于有意义的奖励排名和稳定的组级更新，而非额外训练本身。奖励替换和评分分离研究进一步表明，后训练循环可以使用基于分类器和无分类器的奖励，且适当的评分分离对于保持保真度-效用-隐私权衡至关重要。这些结果将TabGRAA定位为一种自改进的后训练方法，用于表格语言模型生成器，作为强大静态表格生成器的补充。

英文摘要

Tabular language models can generate synthetic tables by modeling rows as token sequences, but they are typically trained once with supervised fine-tuning and then used as static synthesizers. This is limiting because next-token likelihood does not directly optimize the distributional, utility, and indistinguishability properties used to evaluate synthetic data. We study iterative reward-guided post-training for tabular language models through a generate--score--align protocol, where a generator samples synthetic rows, a task-specified reward ranks them, and the model is updated relative to a fixed supervised reference. Within this protocol, we propose \textbf{TabGRAA} (\textbf{Tab}ular \textbf{G}roup-\textbf{R}elative \textbf{A}dvantage \textbf{A}lignment), a group-relative alignment method that compares high- and low-reward generated groups using group-averaged policy/reference log-ratios rather than one-to-one preference pairs. Across five mixed-type benchmarks, TabGRAA improves a GReaT backbone beyond additional supervised fine-tuning and achieves the strongest average trade-off among adapted DPO, KTO, and NPO baselines on fidelity and downstream utility, while maintaining empirical privacy diagnostics near the supervised baseline. Ablations show that the gains depend on meaningful reward ranking and stable group-level updates rather than extra training alone. Reward-substitution and scorer-separation studies further show that the post-training loop can use both classifier-based and classifier-free rewards, and that proper scorer separation is important for preserving the fidelity--utility--privacy trade-off. These results position TabGRAA as a self-improving post-training method for tabular language-model generators, complementary to strong static tabular synthesizers.

URL PDF HTML ☆

赞 0 踩 0

2604.16429 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

(稀疏) 注意细节：在基于机器学习的天气预测模型中保持频谱保真度

Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent

发表机构 * AMLab（AM实验室）； University of Amsterdam（阿姆斯特丹大学）

AI总结本文提出Mosaic模型，通过学习功能扰动生成集合成员，并利用网格对齐的块稀疏注意力机制，在原分辨率网格上操作，以线性成本捕捉长距离依赖关系，从而在1.5°分辨率下达到或超越更精细分辨率模型的性能，实现了状态-of-the-art结果。

Comments Accepted to ICML 2026

详情

AI中文摘要

我们介绍Mosaic，一种概率天气预测模型，旨在解决基于机器学习的天气预测中频谱退化问题的三种失败模式：频谱阻尼（统计学）、高频混叠（架构学）和残余高频泄漏（参数学）。Mosaic通过学习的功能扰动生成集合成员，并通过网格对齐的块稀疏注意力机制在原分辨率网格上操作，该机制是一种硬件对齐的机制，通过在空间相邻查询之间共享键和值，以线性成本捕捉长距离依赖关系。在1.5°分辨率和214M参数下，Mosaic在关键变量上达到或超越了在6倍更精细分辨率上训练的模型的性能，并在1.5°模型中实现了最先进的结果，生成了经过良好校准的集合，其个体成员在所有解析频率上表现出近乎完美的频谱对齐。一个24成员、10天的预测在单个H100 GPU上不到12秒。代码可在https://github.com/maxxxzdn/mosaic上获得。

英文摘要

We introduce Mosaic, a probabilistic weather forecasting model that addresses three failure modes of spectral degradation in ML-based weather prediction: spectral damping (statistical), high-frequency aliasing (architectural), and residual high-frequency leakage (parametric). Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via mesh-aligned block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6$\times$ finer resolution on key variables and achieves state-of-the-art results among 1.5° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12s on a single H100~GPU. Code is available at https://github.com/maxxxzdn/mosaic.

URL PDF HTML ☆

赞 0 踩 0

2604.16395 2026-05-19 cs.DB cs.AI 版本更新

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Stream2LLM: 重叠上下文流式传输与预填充以减少时间到第一个标记（TTFT）

Rajveer Bachkaniwala, Chengqi Luo, Richard So, Divya Mahajan, Kexin Rong

发表机构 * Georgia Tech（佐治亚理工学院）

AI总结本文提出Stream2LLM，一种针对并发预填充-解码分离部署的流式感知大语言模型服务系统，通过自适应调度和抢占机制，有效解决上下文检索与推理之间的延迟问题，从而减少时间到第一个标记（TTFT），并在内存压力下保持吞吐量与非流式基线相等。

Comments Accepted to MLSys 2026. Minor formatting fixes

详情

AI中文摘要

针对LLM推理中的上下文检索系统面临的关键挑战：高检索延迟导致完全上下文等待（差的TTFT）与不完整上下文处理（降低质量）之间的根本矛盾。通过流式传输上下文——重叠检索与推理——可以缓解此延迟，但并发请求引入了新挑战：请求竞争GPU计算和内存，调度必须适应动态上下文到达。我们提出了Stream2LLM，一种流式感知的LLM服务系统，适用于并发预填充-解码分离部署。Stream2LLM引入了自适应调度和抢占机制，针对两种不同的检索模式：追加模式（逐步上下文累积）和更新模式（迭代精炼与缓存失效）。它将调度决策与资源获取解耦，使调度策略灵活，由硬件特定的成本模型引导，并使用最长公共前缀匹配来最小化动态输入变化时的冗余计算。为了评估Stream2LLM，我们收集了两个大规模的现实世界流式工作负载，基于网络爬行和近似最近邻搜索。我们的评估表明，流式架构在TTFT上实现了高达11倍的改进，成本感知调度在内存压力下提供了关键收益，同时保持与非流式基线相等的吞吐量。代码：https://github.com/rajveerb/stream2llm/tree/mlsys_artifact

英文摘要

Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Streaming context incrementally--overlapping retrieval with inference--can mitigate this latency, but doing so with concurrent requests introduces new challenges: requests contend for GPU compute and memory, and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. Stream2LLM introduces adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). It decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses longest common prefix matching to minimize redundant computation when input changes dynamically. To evaluate Stream2LLM, we collect two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11x TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, all while maintaining throughput parity with non-streaming baselines. Code: https://github.com/rajveerb/stream2llm/tree/mlsys_artifact

URL PDF HTML ☆

赞 0 踩 0

2604.15851 2026-05-19 cs.LG cs.AI cs.CR 版本更新

ECHO: 通过一步块扩散实现高效的胸部X光报告生成

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu

发表机构 * Beijing Jiaotong University（北京交通大学）； Dalian University of Technology（大连理工大学）

AI总结本文提出ECHO，一种基于扩散模型的高效视觉-语言模型，用于生成胸部X光报告，通过一步块扩散和响应不对称扩散策略，显著提高了生成效率和文本连贯性，同时在临床准确性上保持良好表现。

详情

AI中文摘要

胸部X光报告生成（CXR-RG）有潜力显著减轻放射科医生的工作负担。然而，传统自回归视觉-语言模型（VLMs）由于序列令牌解码而存在高推理延迟。基于扩散的模型通过并行生成提供了一种有前景的替代方案，但它们仍然需要多个去噪迭代。将多步去噪压缩到单步可以进一步减少延迟，但通常会因令牌因子化去噪器引入的均场偏差而降级文本连贯性。为了解决这一挑战，我们提出了ECHO，一种高效的基于扩散的VLM（dVLM），用于胸部X光报告生成。ECHO通过一种新颖的直接条件蒸馏（DCD）框架实现了稳定的每块一步推理，该框架通过从策略扩散轨迹中构建非因子化监督来缓解均场限制，以编码联合令牌依赖性。此外，我们引入了一种响应不对称扩散（RAD）训练策略，该策略进一步提高了训练效率，同时保持模型有效性。广泛的实验表明，ECHO超越了最先进的自回归方法，在RaTE和SemScore上分别提高了64.33%和60.58%，同时在临床准确性上几乎没有下降的情况下，实现了高达8倍的推理加速。

英文摘要

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving up to \textbf{$8\times$} inference speedup with negligible degradation in clinical accuracy.

URL PDF HTML ☆

赞 0 踩 0

2604.01658 2026-05-19 cs.AI 版本更新

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

CORAL：迈向自主多智能体进化以实现开放性发现

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

发表机构 * MIT（麻省理工学院）； NUS（新加坡国立大学）； MiniMax ； McGill（麦吉尔大学）； Stanford（斯坦福大学）； SambaNova ； Meta ； Singapore-MIT Alliance for Research and Technology（新加坡-麻省理工联合研究技术联盟）； Amazon（亚马逊）； Microsoft（微软）

AI总结本文提出CORAL框架，通过自主多智能体进化方法，实现了在开放性问题上的发现，展示了智能体自主性和多智能体进化对提升开放性发现的显著效果。

详情

AI中文摘要

基于大型语言模型（LLM）的进化是一种有前景的开放性发现方法，其中进展需要持续的搜索和知识积累。现有方法仍然严重依赖固定启发式和硬编码探索规则，这限制了LLM智能体的自主性。我们提出了CORAL，这是首个用于开放性问题的自主多智能体进化的框架。CORAL用长运行的智能体取代了刚性的控制，这些智能体通过共享持久记忆、异步多智能体执行和基于心跳的干预进行探索、反思和协作。它还提供了实用的保障措施，包括隔离的工作空间、评估者分离、资源管理以及智能体会话和健康管理。在多样化的数学、算法和系统优化任务上评估，CORAL在10个任务上实现了新的最先进结果，其改进率比固定进化搜索基线高出3-10倍，且使用更少的评估。在Anthropic的内核工程任务中，四个共进化智能体将最佳已知分数从1363提高到1103周期。机理分析进一步显示这些增益源于知识重用和多智能体探索和交流。这些结果表明，更大的智能体自主性和多智能体进化可以显著提高开放性发现。代码可在https://github.com/Human-Agent-Society/CORAL上获得。

英文摘要

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

URL PDF HTML ☆

赞 0 踩 0

2603.27341 2026-05-19 cs.AI cs.CV cs.LG 版本更新

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

外科AI的比较研究：数据、计算和扩展的潜力与局限

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

发表机构 * Center for Applied AI, Chicago Booth（应用人工智能中心，芝加哥商学院）； Surgical Data Science Collective（外科数据科学集体）； Children’s National Hospital（儿童医学中心）； Operations Management & Tolan Center for Healthcare, Chicago Booth（运营管理与托兰医疗中心，芝加哥商学院）

AI总结本文通过2026年最先进的AI方法，研究了外科手术工具检测中的性能和限制，发现即使使用多十亿参数模型和大量训练数据，当前的视觉语言模型在神经外科手术工具检测任务中仍表现不足，且模型规模和训练时间的增加对性能提升效果有限，表明当前AI在手术应用中仍面临显著挑战。

详情

AI中文摘要

最近的人工智能（AI）模型在多个生物医学任务基准上已匹配或超越了人类专家，但特别是在外科手术基准方面，这些基准往往缺失于主要的医学基准套件中。由于手术需要整合多种任务，一般能力的AI模型可能成为协作工具，如果性能可以得到提升。一方面，通过扩展架构大小和训练数据的常规方法具有吸引力，尤其是由于每年有数百万小时的手术视频数据生成。另一方面，为AI训练准备手术数据需要显著更高的专业水平，并且在该数据上训练需要昂贵的计算资源。这些权衡描绘了现代AI是否以及在多大程度上能够帮助外科实践的不确定图景。在本文中，我们通过使用2026年最先进的AI方法进行外科手术工具检测的案例研究来探讨这个问题。我们证明，即使使用多十亿参数模型和大量训练，当前的视觉语言模型在看似简单的神经外科手术工具检测任务中仍表现不足。此外，我们展示了扩展实验，表明增加模型规模和训练时间仅导致相关性能指标的边际改善。因此，我们的实验表明，当前模型在手术使用案例中仍可能面临重大障碍。此外，一些障碍无法通过额外的计算能力简单地“解决”并持续存在于不同的模型架构中，提出了数据和标签可用性是否是唯一限制因素的问题。我们讨论了这些约束的主要贡献者，并提出了潜在的解决方案。

英文摘要

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

URL PDF HTML ☆

赞 0 踩 0

2603.25723 2026-05-19 cs.CL cs.AI 版本更新

Natural-Language Agent Harnesses

自然语言代理Harness

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Harbin Institute of Technology (Shenzhen)（哈尔滨工业大学（深圳））

AI总结本文提出自然语言代理Harness（NLAH）作为一种可执行的自然语言对象，用于描述任务运行的Harness策略，并引入Intelligent Harness Runtime（IHR）作为共享运行时，能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。实验表明，NLAH在编码、终端使用和计算机使用基准测试中表现与代码和提示实现相当，同时暴露了更短的静态Harness策略。

Comments revise paper

详情

AI中文摘要

代理性能受到周围Harness的强烈影响：围绕模型组织任务运行的外部执行系统。然而，这种逻辑通常隐藏在紧密耦合的控制器代码中，使得Harness难以检查、比较、转移和消解。本文探讨是否可以将代理Harness的可重用设计模式表示为可执行的自然语言对象。我们引入自然语言代理Harness（NLAH），即可编辑的文档，用于描述运行级别的Harness策略，并引入Intelligent Harness Runtime（IHR），一个共享运行时，能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。在编码、终端使用和计算机使用基准测试中，IHR执行的NLAH实现了与代码和提示实现相当的任务结果，同时暴露了更短的静态Harness策略。模块消解进一步表明，显式的Harness模块是可分析的。这些结果表明，代理Harness可以从模型周围的偶然粘合物转变为科学表示对象。

英文摘要

Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.

URL PDF HTML ☆

赞 0 踩 0

2603.23231 2026-05-19 cs.AI 版本更新

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

PERMA：通过事件驱动的偏好和现实任务环境评估个性化记忆代理

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu

发表机构 * University of Science and Technology of China（中国科学技术大学）； City University of Hong Kong（香港城市大学）； Northeastern University（东北大学）； MemTensor (Shanghai) Technology Co., Ltd.（MemTensor（上海）科技有限公司）

AI总结本文提出PERMA基准，通过事件驱动的偏好和现实任务环境评估个性化记忆代理的长期一致性，引入文本变异和语言对齐以模拟真实数据中的不规则用户输入和个体语言风格，实验表明先进记忆系统能精准提取偏好并减少token消耗，但仍需更稳健的个性化记忆管理。

详情

AI中文摘要

为构建能适应用户不断变化需求的代理，增强大语言模型的长期记忆能力至关重要。现有评估通常将偏好相关对话与无关对话交织，使任务退化为needle-in-a-haystack检索，忽略了驱动用户偏好演变的事件之间的关系。此类设置忽视了现实世界个性化的一个基本特征：偏好是逐渐形成并在嘈杂环境中跨交互累积的。为弥合这一差距，我们引入PERMA，一个评估时间跨度内人格一致性的基准，超越静态偏好回忆。此外，我们引入（1）文本变异和（2）语言对齐，以模拟现实数据中的不规则用户输入和个体语言风格。PERMA包含跨多个会话和领域的时序排列交互事件，其中偏好相关查询随时间插入。我们设计了多选和交互任务以探测模型对人格的理解沿交互时间线。实验表明，通过关联相关交互，先进记忆系统能够精确提取偏好并减少token消耗，优于传统语义检索原始对话。然而，它们在时间和跨领域干扰中仍难以保持一致的人格，突显了代理中需要更稳健的个性化记忆管理的必要性。我们的代码和数据在https://github.com/PolarisLiu1/PERMA上开源。

英文摘要

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. Existing evaluations of this capability typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events driving user preference evolution. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems extract precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

URL PDF HTML ☆

赞 0 踩 0

2603.14462 2026-05-19 cs.LG cs.AI 版本更新

STAG-CN: Spatio-Temporal Apiary Graph Convolutional Network for Disease Onset Prediction in Beehive Sensor Networks

STAG-CN：时空蜂巢图卷积网络用于蜂巢传感器网络中疾病发病预测

Sungwoo Kang

AI总结该研究提出STAG-CN模型，通过建模蜂箱间关系来预测疾病发病，利用时空图卷积网络结合物理位置和气候传感器相关性，验证了共享环境响应模式比空间接近性更有效。

Comments Null result after running with 10 seeds

详情

AI中文摘要

蜂蜜蜂群损失威胁着全球授粉服务，但当前监测系统将每个蜂箱视为孤立单元，忽略了疾病在养蜂场中传播的空间路径。本文介绍了时空蜂巢图卷积网络（STAG-CN），一种图神经网络，用于疾病发病预测。STAG-CN基于双邻接图，结合蜂箱会话间的物理共置和气候传感器相关性，通过基于因果扩张卷积和Chebyshev谱图卷积的时空-时空三明治架构处理多变量物联网传感器流。在韩国AI Hub养蜂数据集（数据集#71488）上进行扩展窗口时间交叉验证后，STAG-CN在三天预测范围内达到F1分数0.607。消融研究显示，仅气候邻接矩阵可达到全模型性能（F1=0.607），而仅物理邻接矩阵则为F1=0.274，表明共享的环境响应模式比空间接近性在疾病发病预测中更具预测信号。这些结果为基于图的生物安全监控在精准养蜂中的概念验证奠定了基础，证明了蜂箱传感器相关性编码了单个蜂箱方法无法察觉的疾病相关信息。

英文摘要

Honey bee colony losses threaten global pollination services, yet current monitoring systems treat each hive as an isolated unit, ignoring the spatial pathways through which diseases spread across apiaries. This paper introduces the Spatio-Temporal Apiary Graph Convolutional Network (STAG-CN), a graph neural network that models inter-hive relationships for disease onset prediction. STAG-CN operates on a dual adjacency graph combining physical co-location and climatic sensor correlation among hive sessions, and processes multivariate IoT sensor streams through a temporal--spatial--temporal sandwich architecture built on causal dilated convolutions and Chebyshev spectral graph convolutions. Evaluated on the Korean AI Hub apiculture dataset (dataset \#71488) with expanding-window temporal cross-validation, STAG-CN achieves an F1 score of 0.607 at a three-day forecast horizon. An ablation study reveals that the climatic adjacency matrix alone matches full-model performance (F1\,=\,0.607), while the physical adjacency alone yields F1\,=\,0.274, indicating that shared environmental response patterns carry stronger predictive signal than spatial proximity for disease onset. These results establish a proof-of-concept for graph-based biosecurity monitoring in precision apiculture, demonstrating that inter-hive sensor correlations encode disease-relevant information invisible to single-hive approaches.

URL PDF HTML ☆

赞 0 踩 0

2603.12145 2026-05-19 cs.LG cs.AI cs.SE 版本更新

Automatic Generation of High-Performance RL Environments

自动生成高性能强化学习环境

Seth Karten, Rahul Dev Appapogu, Chi Jin

发表机构 * Princeton University（普林斯顿大学）； Independent Researcher（独立研究者）

AI总结本文提出了一种闭环方法，通过最小的计算成本生成等效的高性能强化学习环境，展示了三种不同的工作流程，并在五个环境中验证了无仿真到仿真的差距，同时展示了新的环境创建方法。

Comments 20 pages, 5 figures

详情

AI中文摘要

将复杂的强化学习（RL）环境转换为高性能实现传统上需要数月的专业工程工作。我们提出了一种闭环方法，以最小的计算成本生成等效的高性能环境。我们的方法使用通用提示模板、分层验证（属性、交互和运行测试）、迭代修复和跨后端策略转移来验证无仿真到仿真的差距。我们展示了三个不同的工作流程跨越五个环境：（1）从Game Boy模拟器PyBoy直接翻译到我们的EmuRust（通过Rust IPC）和从Pokemon Showdown翻译到我们的PokeJAX（通过JAX）；（2）通过与现有高性能实现的吞吐量一致性进行验证，如Puffer Pong、MJX和Brax在匹配的GPU批次大小下；（3）新环境的创建：TCGJax，第一个Pokemon TCG Pocket环境，从网页提取的规范中创建。在2亿个参数下，环境开销低于训练时间的4%。我们的闭环方法验证了所有五个环境的等效性。TCGJax，由一个不在公共存储库中的私有参考合成，用于控制代理预训练数据的污染问题。

英文摘要

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a closed-loop methodology that produces equivalent high-performance environments for minimal compute cost. Our method uses a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to verify no sim-to-sim gap. We demonstrate three distinct workflows across five environments: (1) Direct translation (no prior performance implementation exists) from Game Boy emulator PyBoy to our EmuRust (via Rust IPC) and from Pokemon Showdown to our PokeJAX (via JAX); (2) Translation verified against existing performance implementations via throughput parity with Puffer Pong, MJX and Brax at matched GPU batch sizes; and (3) New environment creation: TCGJax, the first Pokemon TCG Pocket environment, created from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Our closed-loop methodology confirms equivalence for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns.

URL PDF HTML ☆

赞 0 踩 0

2603.11689 2026-05-19 cs.AI 版本更新

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

显式逻辑通道用于验证和增强用于零样本任务的前沿多模态大语言模型

Mei Chee Leong, Ying Gu, Hui Li Tan, Liyuan Li, Nancy Chen

发表机构 * Institute for Infocomm Research (I$^2$R)（信息通信研究所）； Agency for Science, Technology and Research (A*STAR)（科技研究局）； Singapore（新加坡）

AI总结本文提出显式逻辑通道用于验证和增强多模态大语言模型在零样本任务中的性能，通过显式逻辑推理提高模型的可解释性和可信度。

详情

AI中文摘要

前沿多模态大语言模型（MLLMs）在视觉-语言理解（VLC）任务中表现出显著能力。然而，它们通常以黑盒方式部署到新任务中。验证和理解这些模型的行为对于应用到新任务变得重要。我们提出显式逻辑通道，与黑盒模型通道并行，以进行显式逻辑推理用于模型验证、选择和增强。前沿MLLM，封装潜在的视觉语言知识，可以被视为隐式逻辑通道。所提出的显式逻辑通道，模仿人类逻辑推理，结合了一个LLM、一个VFM和逻辑推理与概率推理，用于事实、反事实和关系推理，基于显式视觉证据。提出了一种一致性率（CR）用于跨通道验证和模型选择，即使没有地面真相注释。此外，跨通道整合进一步提高了MLLM在零样本任务中的性能，基于显式视觉证据以增强可信度。在两个代表性的VLC任务，即MC-VQA和HC-REC上，对三个具有挑战性的基准进行综合实验，使用11个最近的开源MLLMs，来自四个前沿家族。我们的系统评估证明了所提出的ELC和CR在增强可解释性和可信度的MLLM模型验证、选择和改进中的有效性。

英文摘要

Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.

URL PDF HTML ☆

赞 0 踩 0

2603.10935 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

具有聚类感知可行区域的球形VAE：保证防止后验崩溃

Zegu Zhang, Jian Zhang

发表机构 * Independent Researcher（独立研究者）

AI总结本文提出了一种理论保证非崩溃解的新型框架，通过利用球壳几何和聚类感知约束，防止VAE中的后验崩溃问题，并在合成和现实数据集上实现了100%的崩溃预防。

Comments 8 pages, 6 figures

详情

AI中文摘要

变分自编码器（VAEs）经常受到后验崩溃的影响，其中潜在变量在近似后验退化为先验时变得无信息。尽管最近的研究将崩溃描述为由数据协方差属性决定的相变，但现有方法主要旨在避免而非消除崩溃。我们引入了一种新的框架，通过利用球壳几何和聚类感知约束，从理论上保证非崩溃解。我们的方法将数据转换为球壳，通过K-means计算最优聚类分配，并定义一个在聚类内方差W和崩溃损失δ-collapse之间的可行区域。我们证明当重构损失被限制在这个区域内时，崩溃解在数学上被排除在可行参数空间之外。关键的是，我们引入了规范约束机制，确保解码器输出保持与球壳几何兼容，而不限制表示能力。与以往方法不同，我们的方法提供了严格的理论保证，计算开销小，且不施加对解码器输出的限制。在合成和现实数据集上的实验表明，在传统VAE完全失败的条件下，实现了100%的崩溃预防，重构质量匹配或超过最先进的方法。我们的方法不需要显式的稳定性条件（例如σ² < λ_max），并且适用于任意神经网络架构。代码可在https://github.com/tsegoochang/spherical-vae-with-Cluster获取。

英文摘要

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where the latent variables become uninformative as the approximate posterior degenerates to the prior. While recent work has characterized collapse as a phase transition determined by data covariance properties, existing approaches primarily aim to avoid rather than eliminate collapse. We introduce a novel framework that theoretically guarantees non-collapsed solutions by leveraging spherical shell geometry and cluster-aware constraints. Our method transforms data to a spherical shell, computes optimal cluster assignments via K-means, and defines a feasible region between the within-cluster variance $W$ and collapse loss $δ_{\text{collapse}}$. We prove that when the reconstruction loss is constrained to this region, the collapsed solution is mathematically excluded from the feasible parameter space. \textbf{Critically, we introduce norm constraint mechanisms that ensure decoder outputs remain compatible with the spherical shell geometry without restricting representational capacity.} Unlike prior approaches, our method provides a strict theoretical guarantee with minimal computational overhead without imposing constraints on decoder outputs. Experiments on synthetic and real-world datasets demonstrate 100\% collapse prevention under conditions where conventional VAEs completely fail, with reconstruction quality matching or exceeding state-of-the-art methods. Our approach requires no explicit stability conditions (e.g., $σ^2 < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/spherical-vae-with-Cluster.

URL PDF HTML ☆

赞 0 踩 0

2603.03328 2026-05-19 cs.CL cs.AI 版本更新

为何Adam能胜过SGD：二阶矩归一化产生更尖锐的尾部

Ruinan Jin, Yingbin Liang, Shaofeng Zou

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University（俄亥俄州立大学电气与计算机工程系）； School of Electrical, Computer and Energy Engineering, Arizona State University（亚利桑那州立大学电气、计算机与能源工程学院）

AI总结本文揭示了Adam中的关键二阶矩归一化机制，并通过停止时间/鞅分析，在经典有界方差模型下，证明了Adam在高概率收敛行为上优于SGD，前者对置信参数δ的依赖为δ^{-1/2}，而SGD则至少为δ^{-1}。

Comments 68 pages

2603.00631 2026-05-19 cs.AI 版本更新

LiTS: A Modular Framework for LLM Tree Search

LiTS：一个用于LLM树搜索的模块化框架

Xinzhe Li, Yaguang Tao

发表机构 * RMIT University（皇家墨尔本理工大学）

AI总结本文提出LiTS，一个模块化框架，用于通过树搜索进行LLM推理，展示了其在语言推理、环境规划和工具使用任务中的可组合性，并发现无限动作空间中LLM策略多样性是有效树搜索的瓶颈。

Comments ACL 2026 Demo

详情

AI中文摘要

LiTS是一个模块化的Python框架，用于通过树搜索进行LLM推理。它将树搜索分解为三个可重用的组件（策略、转移和奖励模型），这些组件可以插入到MCTS和BFS等算法中。基于装饰器的注册机制使领域专家能够通过注册组件扩展到新领域，使算法研究人员能够实现自定义的搜索算法。我们在MATH500（语言推理）、Crosswords（环境规划）和MapEval（工具使用）上展示了可组合性，证明了组件和算法的正交性：组件可以在每个任务类型内跨算法重用，而算法可以在所有组件和领域中工作。我们还报告了一个模式崩溃发现：在无限动作空间中，LLM策略多样性（而不是奖励质量）是有效树搜索的瓶颈。演示视频可在https://youtu.be/nRGX43YrR3I获取。该包在Apache 2.0许可证下发布于https://github.com/xinzhel/lits-llm，包含安装说明和可运行示例，使用户能够重现演示的工作流。

英文摘要

LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode-collapse finding: in infinite action spaces, LLM policy diversity (not reward quality) is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits-llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.

URL PDF HTML ☆

赞 0 踩 0

2603.00607 2026-05-19 cs.CV cs.AI 版本更新

CodeScaler: 通过奖励模型扩展代码大语言模型的训练和测试时间推理

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

发表机构 * LARK, HKUST(GZ)（LARK，香港科技大学（广州））； Kuaishou Technology（快手科技）； UCL（伦敦大学学院）； UZH（苏黎世联邦理工学院）； NUS（国立新加坡大学）

AI总结本文提出CodeScaler，一种通过奖励模型扩展代码生成模型的训练和测试时间推理的框架，通过精心编纂的偏好数据和语法感知的代码提取，实现了在四个编码基准上比基于执行的RL提升1.55分，在Qwen3-14B-Base上提升4.23分，并在无测试用例的情况下通过合成数据进一步提升14.64分，同时在推理时间减少10倍的延迟，且在代码、通用和推理领域均优于现有奖励模型。

详情

AI中文摘要

基于可验证奖励的强化学习（RLVR）通过利用单元测试的执行反馈推动了代码大语言模型的最新进展，但其可扩展性从根本上受到高质量测试用例可用性和可靠性的影响。我们提出CodeScaler，一种奖励模型，旨在扩展代码生成的强化学习训练和测试时间推理。CodeScaler是在经过验证的代码问题上精心编纂的偏好数据上训练的，并结合语法感知的代码提取和保持有效性的奖励塑造，以确保稳定和稳健的优化。在四个编码基准上，CodeScaler在Qwen3-8B-Base上比基于执行的RL提升1.55分，在Qwen3-14B-Base上提升4.23分。通过进一步扩展到44K问题并添加额外的合成数据，CodeScaler在无任何测试用例的情况下，相对于基础模型提升了14.64分。在推理时间，CodeScaler作为有效的测试时间扩展方法，实现了与单元测试方法相当的性能，同时在推理时间减少了10倍的延迟。此外，CodeScaler在RM-Bench上不仅在代码领域（+3.3分）上优于现有奖励模型，还在通用和推理领域（平均+2.7分）上也表现优异。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

URL PDF HTML ☆

赞 0 踩 0

2602.16990 2026-05-19 cs.AI cs.CE 版本更新

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Conv-FinRe：一种用于实用导向财务推荐的对话和纵向基准

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Columbia University（哥伦比亚大学）； California State University（加州州立大学）； University of Montreal（蒙特利尔大学）； The University of Manchester（曼彻斯特大学）； McGill University（麦吉尔大学）

AI总结本研究提出Conv-FinRe基准，用于评估金融推荐模型在对话和长期视角下的实用性，通过多视角参考区分描述性行为与基于投资者风险偏好的规范性效用，揭示理性决策与行为一致性的张力。

Comments Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

详情

AI中文摘要

大多数推荐基准评估模型模仿用户行为的能力。在金融顾问领域，观察到的行为可能在市场波动中嘈杂或短视，并可能与用户的长期目标冲突。因此，将用户的选择视为唯一真实情况，会将行为模仿与决策质量混淆。我们引入Conv-FinRe，一种用于股票推荐的对话和纵向基准，评估LLM超越行为匹配的能力。给定一个入职访谈、分步市场背景和顾问对话，模型必须在固定投资期限内生成排名。关键在于，Conv-FinRe提供了多视角参考，区分描述性行为与基于投资者特定风险偏好的规范性效用，使能够诊断LLM是否遵循理性分析、模仿用户噪声或受市场动量驱动。我们从真实市场数据和人类决策轨迹构建了该基准，实例化了受控的顾问对话，并评估了一套最先进的LLM。结果揭示了理性决策质量与行为一致性的持续张力：在效用基础上表现良好的模型往往无法匹配用户选择，而行为一致的模型可能会过拟合短期噪声。该数据集已公开发布在Hugging Face，代码库可在GitHub上获得。

英文摘要

Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2602.12978 2026-05-19 cs.RO cs.AI 版本更新

Learning Native Continuation for Action Chunking Flow Policies

学习原生延续以实现动作分块流策略

Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao

发表机构 * Spirit AI

AI总结本文提出Legato方法，通过训练时的延续技术改进动作分块流基于VLA策略，减少动作边界不连续性和伪多模态切换，提升轨迹平滑度和任务完成效率。

Comments Accepted by Robotics: Science and Systems 2026 (RSS 2026). Project page: https://lyfeng001.github.io/Legato/

详情

AI中文摘要

动作分块使Vision Language Action (VLA)模型能够实时运行，但朴素的分块执行常在分块边界处出现不连续性。实时分块（RTC）缓解了这一问题，但其作为外部策略导致伪多模态切换和非内在平滑的轨迹。我们提出Legato，一种针对动作分块流基于VLA策略的训练时延续方法。具体而言，Legato从具有调度形状的已知动作和噪声混合物初始化去噪，使模型接触部分动作信息。此外，Legato重塑学习的流动力学，确保在每步指导下去噪过程在训练和推理之间保持一致。Legato进一步在训练中使用随机调度条件以支持变化的推理延迟并实现可控的平滑度。实证结果表明，Legato产生更平滑的轨迹并减少执行中的伪多模态切换，导致较少的犹豫和更短的任务完成时间。广泛的现实世界实验表明，Legato在五个操作任务中始终优于RTC，实现了轨迹平滑度和任务完成时间的约10%的改进。

英文摘要

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

URL PDF HTML ☆

赞 0 踩 0

2602.12687 2026-05-19 cs.LG cs.AI 版本更新

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

信任不确定的教师：通过校准的不确定性提炼暗知识

Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho

发表机构 * Ewha Womans University（成均馆大学）； Tencent（腾讯）

AI总结本文提出校准不确定性提炼（CUD）框架，通过从分布角度重新审视知识蒸馏，使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性，并引导学生学习校准而非锐化确定性，从而在易例中获益于自信信号，在难例中获益于结构化不确定性，提升了学生在分布偏移和长尾输入上的准确性和可靠性。

详情

AI中文摘要

知识蒸馏的核心在于将教师的丰富'暗知识'-即揭示类别间关系和不确定性分布的细微概率模式进行转移。尽管这一理念已建立，但传统交叉熵训练的教师往往无法保留此类信号。它们的分布会坍缩成尖锐、过度自信的峰，看似决定性但实际脆弱，提供的仅限于硬标签或在表示层面转移时微妙地阻碍。这种过度自信在高基数任务中尤为成问题，因为许多可能类别的细微差别对指导紧凑的学生至关重要。此外，这种脆弱的目标会降低对分布偏移的鲁棒性，使学生在现实条件下的校准变得不可靠。为解决这一限制，我们从分布角度重新审视蒸馏，并提出校准不确定性蒸馏（CUD）框架，旨在使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性，并引导学生学习校准而非锐化确定性。通过在转移前直接塑造教师的预测分布，我们的方法在准确性和校准之间取得平衡，使学生在易例中受益于自信信号，在难例中受益于结构化不确定性。在多样化的基准测试中，CUD产生的学生不仅更加准确，而且在分布偏移下更加校准，在模糊的长尾输入上更加可靠。

英文摘要

The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher's overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher's predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

URL PDF HTML ☆

赞 0 踩 0

2602.07884 2026-05-19 cs.LG cs.AI 版本更新

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

GRAFT：分离排名与校准用于生存分析

Mohammad Ashhad, Robert Hoehndorf, Ricardo Henao

发表机构 * KAUST（卡奥斯特大学）； CEMSE KAUST（KAUST工程与科学学院）； Duke University（杜克大学）

AI总结本文提出GRAFT模型，通过分离预测排名与生存校准，解决生存分析中排名与校准之间的权衡问题，该模型结合线性AFT模型与非线性残差神经网络，并利用随机门进行自动特征选择，从而在公开基准测试中实现了更好的判别能力和校准性能。

详情

AI中文摘要

生存分析受到删失数据、高维特征和非线性交互的挑战。经典模型提供可解释性和优越的校准能力，但局限于线性或预定义的功能形式，而深度学习模型具有灵活性并实现了强大的判别性能，但倾向于产生校准不佳的生存估计。为了解决这一权衡问题，我们提出GRAFT（Gated Residual Accelerated Failure Time），一种新的AFT模型，该模型将预测排名与生存校准分离。GRAFT的混合架构结合了线性AFT模型与非线性残差神经网络，并整合了随机门用于自动特征选择。该模型通过优化可微的、C-index对齐的排名损失进行训练，利用局部Kaplan-Meier估计器的随机条件插补，而校准的生存估计则通过简单的后训练校准获得。在公开基准测试中，GRAFT在判别能力和校准性能上优于基线模型，同时在高噪声设置中保持稳健和稀疏。

英文摘要

Survival analysis is complicated by censored data, high-dimensional features, and non-linear interactions. Classical models offer interpretability and superior calibration but are restricted to linear or predefined functional forms, while deep learning models are flexible and achieve strong discriminative performance, but tend to produce poorly calibrated survival estimates. To address this trade-off, we propose GRAFT (Gated Residual Accelerated Failure Time), a novel AFT model that decouples prognostic ranking from survival calibration. GRAFT's hybrid architecture combines a linear AFT model with a non-linear residual neural network, and it also integrates stochastic gates for automatic feature selection. The model is trained by optimizing a differentiable, C-index-aligned ranking loss using stochastic conditional imputation from local Kaplan-Meier estimators, while calibrated survival estimates are obtained through simple post-training calibration. In public benchmarks, GRAFT outperforms baselines in discrimination and calibration, while remaining robust and sparse in high-noise settings.

URL PDF HTML ☆

赞 0 踩 0

2602.05287 2026-05-19 cs.AI 版本更新

Position: Universal Time Series Foundation Models Rest on a Category Error

位置：通用时间序列基础模型建立在类别错误上

Xilin Dai, Wanxu Cai, Zhijian Xu, Qiang Xu

发表机构 * ZJU-UIUC Institute（浙大-UIUC研究院）； School of Software（软件学院）； Department of Computer Science and Engineering（计算机科学与工程系）

AI总结本文指出，追求'通用时间序列基础模型'存在根本性的类别错误，将结构容器误认为语义模态。由于时间序列包含不兼容的生成过程（如金融与流体动力学），单一大模型退化为昂贵的'通用过滤器'，在分布漂移下无法泛化。为此，我们引入'自回归盲目界限'，证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性，其中代理利用外部上下文协调一系列专门的求解器，从冻结领域专家到轻量级即时适应器。最后，我们呼吁将基准从'零样本准确性'转向'漂移适应速度'，以优先考虑鲁棒、控制理论系统。

详情

AI中文摘要

本文立场论文认为，追求'通用时间序列基础模型'建立在根本性的类别错误上，误将结构容器视为语义模态。我们指出，由于时间序列包含不兼容的生成过程（例如金融与流体动力学），单一大模型退化为昂贵的'通用过滤器'，在分布漂移下无法泛化。为解决这一问题，我们引入'自回归盲目界限'，一个理论极限，证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性，其中代理利用外部上下文协调一系列专门的求解器，从冻结领域专家到轻量级即时适应器。最后，我们呼吁将基准从'零样本准确性'转向'漂移适应速度'，以优先考虑鲁棒、控制理论系统。

英文摘要

This position paper argues that the pursuit of "Universal Foundation Models for Time Series" rests on a fundamental category error, mistaking a structural Container for a semantic Modality. We contend that because time series hold incompatible generative processes (e.g., finance vs. fluid dynamics), monolithic models degenerate into expensive "Generic Filters" that fail to generalize under distributional drift. To address this, we introduce the "Autoregressive Blindness Bound," a theoretical limit proving that history-only models cannot predict intervention-driven regime shifts. We advocate replacing universality with a Causal Control Agent paradigm, where an agent leverages external context to orchestrate a hierarchy of specialized solvers, from frozen domain experts to lightweight Just-in-Time adaptors. We conclude by calling for a shift in benchmarks from "Zero-Shot Accuracy" to "Drift Adaptation Speed" to prioritize robust, control-theoretic systems.

URL PDF HTML ☆

赞 0 踩 0

2602.04872 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

多层交叉注意力是多模态上下文学习中可证明最优的

Nicholas Barnfield, Subhabrata Sen, Pragya Sur

发表机构 * Harvard University（哈佛大学）

AI总结本文研究了多模态上下文学习中多层交叉注意力机制的理论最优性，证明了在多模态数据下，交叉注意力机制在梯度流优化下可达到贝叶斯最优，同时指出单层线性自注意力无法在任务分布下统一恢复贝叶斯最优预测。

详情

AI中文摘要

PyHealth 2.0: 一个全面的开源工具包，用于可访问和可重复的临床深度学习

John Wu, Yongda Fan, Zhenbang Wu, Paul Landes, Eric Schrock, Sayeed Sajjad Razin, Arjun Chatterjee, Naveen Baskaran, Joshua Steier, Andrea Fitzpatrick, Bilal Arif, Rian Atri, Jathurshan Pradeepkumar, Siddhartha Laghuvarapu, Junyi Gao, Adam R. Cross, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign, Urbana, IL, USA（伊利诺伊大学厄巴纳-香槟分校）； PyHealth Research Initiative（PyHealth研究计划）； University of Illinois College of Medicine, Chicago, IL, USA（伊利诺伊大学医学院）； The University of Edinburgh, Edinburgh, UK（爱丁堡大学）； Health Data Research UK, London, UK（英国健康数据研究）； Department of Biomedical Engineering, Bangladesh University of Engineering（孟加拉国工程大学生物医学工程系）

AI总结本文提出PyHealth 2.0，一个全面的开源工具包，旨在解决临床AI研究中的可重复性和可访问性问题，通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化方法，实现7行代码即可完成预测建模。

Comments Under Review

详情

AI中文摘要

难以复制基线、高计算成本和所需领域专业知识创建了持续存在的临床AI研究障碍。为了解决这些挑战，我们介绍了PyHealth 2.0，一个增强的临床深度学习工具包，使在7行代码内即可实现预测建模。PyHealth 2.0提供了三个关键贡献：(1) 一个全面的工具包，通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化（包括符合预测的置信预测）在一个框架中解决可重复性和兼容性挑战，支持多种临床数据模态——信号、影像和电子健康记录——并翻译5+医学编码标准；(2) 以可访问性为重点的设计，支持多模态数据和多样化的计算资源，处理速度比以往快39倍，内存使用减少20倍，使从16GB笔记本电脑到生产系统都能轻松使用；(3) 一个活跃的开源社区，拥有400多名成员，通过详尽的文档、可重复研究贡献以及与学术医疗系统和产业伙伴的合作，包括通过RHealth实现的多语言支持，降低了领域专业知识的障碍。PyHealth 2.0建立了一个开源基础和社区，推动了可访问和可重复的医疗AI发展。可在pip install pyhealth中获取。

英文摘要

Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities - signals, imaging, and electronic health records - with translation of 5+ medical coding standards; (2) accessibility-focused design accommodating multimodal data and diverse computational resources with up to 39x faster processing and 20x lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI. Available at pip install pyhealth.

URL PDF HTML ☆

赞 0 踩 0

2601.15630 2026-05-19 cs.AI 版本更新

Agentic AI Governance and Lifecycle Management in Healthcare

医疗领域代理AI治理与生命周期管理

Chandra Prakash, Mary Lind, Avneesh Sisodia

发表机构 * School of Computer Information Sciences（计算机信息科学学院）； University of the Cumberlands（坎伯兰大学）； Williamsburg（威廉斯堡）

AI总结本文提出了一种统一的代理生命周期管理框架，旨在解决医疗领域中代理蔓延问题，通过五个控制层实现可审计的监督，同时支持本地创新和安全扩展。

Comments 21 Pages, 9 figures

详情

AI中文摘要

医疗组织开始将代理AI嵌入到常规工作流程中，包括临床文档支持和早期预警监测。随着这些能力在各部门和供应商间扩散，医疗系统面临代理蔓延问题，导致代理重复、责任不明确、控制不一致和持续存在的工具权限。现有AI治理框架强调生命周期风险管理，但对代理舰队的日常操作提供有限指导。本文提出了一种统一的代理生命周期管理（UALM）蓝图，基于快速、实践导向的治理标准、代理安全文献和医疗合规要求的综合。UALM将反复出现的差距映射到五个控制层上：（1）身份和人物注册，（2）编排和跨域调解，（3） PHI 限定的上下文和记忆，（4）运行时策略执行与杀开关触发器，（5）生命周期管理和退役与凭证撤销和审计日志相关联。一个配套的成熟度模型支持分阶段采用。UALM为医疗CIO、CISO和临床领导者提供了一种可实施的模式，以实现可审计的监督，同时保持本地创新并安全扩展到临床和行政领域。

英文摘要

Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.

URL PDF HTML ☆

赞 0 踩 0

2601.14568 2026-05-19 cs.CV cs.AI 版本更新

Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement

打破精度-资源困境：一种轻量级自适应视频推理增强

Wei Ma, Shaowu Chen, Junjie Ye, Peichang Zhang, Lei Huang

发表机构 * State Key Laboratory of Radio Frequency Heterogeneous Integration (Shenzhen University)（无线电频率异构集成国家重点实验室（深圳大学））； Institute of Applied Artificial Intelligence of the Guangdong–HongKong–Macao Greater Bay（粤港澳大湾区应用人工智能研究院）； Henan Academy of Science Applied Physics Institute Co.,Ltd.（河南省应用物理科学研究院有限公司）

AI总结本文提出了一种轻量级自适应视频推理增强框架，通过动态切换不同规模的模型来平衡资源利用与推理性能。

Comments 5 pages, 5 figures

2601.09722 2026-05-19 cs.CL cs.AI 版本更新

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

ADMEDTAGGER: 一个用于波兰医疗语言知识蒸馏的标注框架

Franciszek Górski, Andrzej Czyżewski

发表机构 * Gdansk University of Technology（格但斯克技术大学）

AI总结本文提出了一种标注框架，展示如何利用一个多语言预训练大语言模型作为教师模型，蒸馏出用于标注波兰医疗文本所需的专业知识，通过开发多类分类器，解决了标注资源不足的问题，最终得到了高效的分类器。

详情

AI中文摘要

在本工作中，我们提出了一种标注框架，展示了如何利用一个多语言预训练大语言模型作为教师模型，蒸馏出用于标注波兰医疗文本所需的专业知识。本工作是ADMEDVOICE项目的一部分，在此项目中，我们收集了涵盖五个临床类别（放射学、肿瘤学、心脏病学、高血压和病理学）的大量医疗文本语料库。利用这些数据，我们开发了一个多类分类器，但根本问题在于缺乏足够的标注资源来标注足够数量的文本。因此，在我们的解决方案中，我们使用多语言Llama3.1模型来标注大量波兰医疗文本语料库。利用我们有限的标注资源，我们只验证了这些标签中的一部分，从而创建了一个测试集。通过这种方式标注的数据随后用于训练和验证三种基于BERT架构的分类器：基于DistilBERT的蒸馏模型、在医疗数据上微调的BioBERT以及在波兰语言语料库上微调的HerBERT。在我们训练的模型中，DistilBERT模型表现最佳，每个临床类别达到了F1分数大于0.80，其中三个类别达到了F1分数大于0.93。通过这种方式，我们得到了一系列高效的分类器，这些分类器在大小、GPU VRAM消耗和推理速度方面分别比大型语言模型小约500倍、低300倍，以及快数百倍。

领域专精的幻觉：揭示混合专家模型中的领域不变‘ standing committee ’

Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu

发表机构 * The Fin AI（Fin AI）； Georgia Institute of Technology（佐治亚理工学院）； Cornell University（康奈尔大学）； Stevens Institute of Technology（史蒂文斯理工学院）； The University of Manchester（曼彻斯特大学）

AI总结本研究质疑混合专家模型通过稀疏路由实现领域专精的假设，提出COMMITTEEAUDIT框架分析专家组而非个体专家的路由行为，发现领域不变的standing committee，揭示模型存在向集中计算偏倚的结构倾向，表明混合专家模型中的专精程度远低于预期。

Comments Accepted by ACL 2026 main conference. Camera-ready version

详情

AI中文摘要

混合专家模型被广泛假设通过稀疏路由实现领域专精。在本工作中，我们通过引入COMMITTEEAUDIT框架，质疑这一假设，该框架在专家组层面而非个体专家层面分析路由行为。在三个代表性模型和MMLU基准测试中，我们揭示了一个领域不变的standing committee。这是一个紧凑的路由专家联盟，能够跨领域、层和路由预算持续捕获大多数路由质量，即使在架构已包含共享专家的情况下。定性分析进一步显示，standing committee锚定推理结构和语法，而外围专家处理领域特定知识。这些发现揭示了模型对集中计算的强结构偏倚，表明混合专家模型中的专精程度远低于人们普遍认为的水平。这种固有偏倚也表明，当前的训练目标，如强制均匀专家利用的负载平衡损失，可能与模型的自然优化路径相悖，从而限制了训练效率和性能。

英文摘要

Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model's natural optimization path, thereby limiting training efficiency and performance.

URL PDF HTML ☆

赞 0 踩 0

2601.01123 2026-05-19 cs.LG cs.AI 版本更新

Learning from Historical Activations in Graph Neural Networks

在图神经网络中学习历史激活

Yaniv Galron, Hadar Sinai, Haggai Maron, Moshe Eliasof

发表机构 * Technion – Israel Institute of Technology（技术ion–以色列理工学院）； NVIDIA ； Ben-Gurion University of the Negev（贝内-约尔根大学）； University of Cambridge（剑桥大学）

AI总结本文提出HISTOGRAPH，一种基于注意力的两阶段最终聚合层，通过层间和节点间的注意力机制，利用节点的激活历史和图结构来优化最终预测特征，从而在多个图分类基准上实现了优于传统方法的性能。

Comments ICLR 2026

详情

AI中文摘要

图神经网络（GNNs）在社交网络、分子化学等领域展现了显著的成功。GNNs的关键组成部分是池化过程，其中模型计算的节点特征被结合成一个有信息量的最终描述符，用于下游任务。然而，先前的图池化方案依赖于最后一个GNN层的特征作为池化或分类层的输入，这可能未能充分利用模型前向传递过程中先前层产生的重要激活，即历史图激活。这种差距在节点表示在许多图神经层中显著变化的情况下尤为明显，并且在深度架构中受到过平滑问题的加剧。为弥合这一差距，我们引入HISTOGRAPH，一种新颖的两阶段注意力最终聚合层，首先在中间激活上应用统一的层间注意力，随后进行节点间注意力。通过建模节点表示在层间的演变，我们的HISTOGRAPH利用节点的激活历史和图结构来优化最终预测所用的特征。在多个图分类基准上的实验证明，HISTOGRAPH提供了强大的性能，能够一致地改进传统技术，特别是在深度GNNs中表现出特别强的鲁棒性。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node's representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HISTOGRAPH, a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HISTOGRAPH offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.

URL PDF HTML ☆

赞 0 踩 0

2512.24497 2026-05-19 cs.AI cs.LG cs.RO stat.ML 版本更新

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

在联合嵌入预测世界模型中成功因素是什么？

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

发表机构 * Meta FAIR ； Inria Paris（巴黎理工院）； Ecole normale supérieure / PSL（巴黎高等师范学院 / PSL）； New York University（纽约大学）

AI总结本文研究了在物理规划中使用联合嵌入预测世界模型（JEPA-WMs）的成功因素，通过分析模型架构、训练目标和规划算法对规划成功的影响，提出了一种在导航和操作任务中优于现有基线方法的模型。

Comments V2 of the article: - Added AdaLN-zero - Added table comparing JEPA-WMs with baselines with std translating per-seed variability only, no variability across epochs - Reordered figures in main body of the paper V3: added data scaling experiments, theoretical appendix section on autoregressive rollout, acceptance at TMLR

详情

AI中文摘要

人工智能领域长期存在的挑战是开发能够解决广泛物理任务并泛化到新、未见过的任务和环境的智能体。一种流行的近期方法是通过状态-动作轨迹训练世界模型，然后使用规划算法解决新任务。规划通常在输入空间中进行，但最近出现的一类方法引入了在学习的表示空间中优化的规划算法，其承诺通过抽象无关细节来提高规划效率。在本工作中，我们将此类模型称为JEPA-WMs，并研究使此类算法有效技术选择。我们提出了一项全面研究几个关键组件，旨在找到该类中的最佳方法。我们使用模拟环境和真实世界机器人数据进行了实验，并研究了模型架构、训练目标和规划算法对规划成功的影响。我们结合发现，提出了一种在导航和操作任务中优于两个现有基线方法（DINO-WM和V-JEPA-2-AC）的模型。代码、数据和检查点可在https://github.com/facebookresearch/jepa-wms上获得。

英文摘要

A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

URL PDF HTML ☆

赞 0 踩 0

2512.23994 2026-05-19 cs.SD cs.AI 版本更新

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

PhyAVBench: 一个具有挑战性的音频物理敏感性基准，用于物理基础的文本到音频视频生成

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

发表机构 * HKUST(GZ)（香港科技大学（广州））； Tencent（腾讯）

AI总结本文提出PhyAVBench，一个用于评估文本到音频视频生成、图像到音频视频生成和视频到音频生成模型中音频-物理基础能力的基准，通过引入新的数据集和评估方法，揭示了当前模型在物理合理音频生成方面的不足。

Comments 6 major physical dimensions, 41 fine-grained test points, 337 groups of variable-controlled test samples, 11,605 newly recorded videos

详情

AI中文摘要

文本到音频视频（T2AV）生成在影视制作和世界建模等应用中至关重要。然而，当前模型往往无法生成物理上合理的音效。先前的基准主要关注音频视频时间同步，而忽视了对音频-物理基础的显式评估，从而限制了对物理合理音频视频生成的研究。为了解决这个问题，我们提出了PhyAVBench，这是第一个系统评估T2AV、I2AV和V2A模型音频-物理基础能力的基准。PhyAVBench提供PhyAV-Sound-11K，一个包含来自184名参与者25.5小时11,605个可听视频的新数据集，以确保多样性和避免数据泄漏。它包含337对提示组，具有受控的物理变化，驱动声音差异，每个组平均有17个视频，涵盖6个音频-物理维度和41个细粒度测试点。每个提示对都标注了其声音差异背后的物理因素。重要的是，PhyAVBench利用配对文本提示来评估这一能力。我们称这种评估范式为音频-物理敏感性测试（APST），并引入了一个新的指标，对比物理响应分数（CPRS），用于量化生成视频与现实世界对应物之间的声音一致性。我们对17种最先进的模型进行了全面评估。我们的结果表明，即使领先的商业模型在基本的音频物理现象上也存在问题，揭示了超出音频视频同步之外的关键差距，并指明了未来的研究方向。我们希望PhyAVBench能为推进物理基础的音频视频生成提供基础。提示、真实值和生成视频样本可在https://github.com/imxtx/PhyAVBench上获得。

英文摘要

Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://github.com/imxtx/PhyAVBench.

URL PDF HTML ☆

赞 0 踩 0

2512.04746 2026-05-19 cs.CL cs.AI 版本更新

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2：朝着关闭LLMs极低比特后训练量化性能差距的目标

Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen, Zaner Ma

发表机构 * Intel（英特尔公司）； Beijing Institute of Technology（北京理工大学）

AI总结本文提出SignRoundV2框架，通过自适应混合精度策略和轻量稳定技术，在极低比特量化下保持高性能，实验表明在混合MXFP设置中实现接近无损性能，将性能差距缩小到约1%。

详情

AI中文摘要

极低比特量化对高效部署大型语言模型（LLMs）至关重要，但往往在2比特和4比特（如MXFP4）时导致严重性能下降。我们提出了SignRoundV2，一种后训练量化框架，旨在在极端压缩下保持高性能。SignRoundV2引入（1）一种简单而高效的自适应混合精度策略，利用梯度信息和量化引起的重建误差来指导层间比特分配，以及（2）一组轻量级稳定技术，包括损失过滤和预调制比例搜索，以提高极低比特环境下的调优效果。我们的方法在量化和全精度模型之间显著缩小了性能差距。在多种LLMs上的实验结果表明，SignRoundV2在混合MXFP设置中实现了接近无损性能，将差距缩小到约1%（平均4.5比特），同时在具有挑战性的2比特权重-only量化中大幅提高准确性。源代码可在https://github.com/intel/auto-round获取。

英文摘要

Extremely low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2 bits and even at 4 bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework designed to maintain high performance even under aggressive compression. SignRoundV2 introduces (1) a simple yet efficient adaptive mixed-precision strategy that leverages gradient information and quantization-induced reconstruction errors to guide layer-wise bit allocation, and (2) a set of lightweight stabilization techniques, including loss filtering and a pre-tuning scale search, to improve tuning effectiveness in extremely low-bit regimes. Our approach takes a significant step toward closing the performance gap between quantized and full-precision models. Experimental results across diverse LLMs demonstrate that SignRoundV2 achieves near-lossless performance in mixed MXFP settings, narrowing the gap to $\sim$1\% at an average of 4.5 bits, while substantially improving accuracy in challenging 2-bit weight-only quantization. The source code is available at \url{https://github.com/intel/auto-round}.

URL PDF HTML ☆

赞 0 踩 0

2511.23253 2026-05-19 cs.AI 版本更新

Beacon：单轮诊断和缓解大型语言模型中潜在的阿谀倾向

Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal

AI总结本文提出Beacon基准测试，用于单轮诊断和缓解大型语言模型中潜在的阿谀倾向，通过评估十二种最先进的模型，揭示了阿谀倾向在语言和情感方面的稳定子偏差，并提出了在提示和激活层面的干预措施，以调节这些偏差，从而揭示对齐作为事实性和社会合规判断之间的动态流形。

详情

AI中文摘要

大型语言模型内部化了诚实与奉承之间的结构权衡，这种权衡源于奖励优化，将有用性与礼貌服从混淆。这种潜在的偏见，称为阿谀倾向，表现为对用户同意的偏好而非原则性推理。我们引入Beacon，一种单轮强制选择基准测试，该测试独立于对话上下文，能够精确测量事实准确性与顺从偏见之间的张力。在十二种最先进的模型上的评估表明，阿谀倾向分解为稳定的语言和情感子偏见，每个都随模型容量而扩大。我们进一步提出了提示级别和激活级别干预，以调节这些偏见的相反方向，揭示对齐作为事实性和社会合规判断之间的动态流形。Beacon将阿谀倾向重新定义为可测量的规范性误泛化形式，为研究和缓解大规模生成系统中的对齐漂移提供了可重复的基础。

英文摘要

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

URL PDF HTML ☆

赞 0 踩 0

2510.16609 2026-05-19 cs.LG cs.AI cs.CC cs.DS 版本更新

NeuroRVQ：多尺度生物信号分词用于生成式基础模型

Konstantinos Barmpas, Na Lee, Dimitrios Chalatsis, William Raftery, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Alexandros Koliousis, Dario Farina, Stefanos Zafeiriou

发表机构 * Imperial College London（帝国理工学院伦敦分校）； Cogitat ； National and Kapodistrian University of Athens（国家与资本主义大学雅典分校）； Archimedes Research Unit（阿基米德研究单位）； Aristotle University of Thessaloniki（亚里士多德大学塞萨洛尼基分校）； Northeastern University London（东北大学伦敦分校）

AI总结本文提出NeuroRVQ，一种多尺度生物信号分词方法，通过多尺度时序卷积分解生物信号并结合相位感知损失，实现高保真信号重建，验证了高质量分词对下游性能的重要性。

详情

AI中文摘要

生物信号如脑电图（EEG）、心电图（ECG）和肌电信号（EMG）在多个时间和频谱尺度上编码生理活动，产生丰富但对机器学习具有挑战性的表示。训练以预测掩码信号标记为基础模型的方法在学习通用生物信号表示方面显示出前景，但其性能取决于分词器保留高频动态和高保真重建信号的能力。我们引入NeuroRVQ，一种适用于高保真信号重建的多模态生物信号分词家族。为了捕获完整的频谱，NeuroRVQ通过多尺度时序卷积将生物信号分解为频特定表示，每个表示编码为层次化的RVQ代码本以保留高频细节，并结合一种新的相位感知训练损失，该损失尊重傅里叶相位的环形拓扑。通过调整时间分辨率、时间核的数量和大小以及RVQ深度，此设计适应每种生物信号模态的频谱-时间特性。为验证分词质量驱动下游性能，我们为每种模态训练一个简单的掩码标记基础模型（NeuroRVQ-FM）使用相应的NeuroRVQ分词器。NeuroRVQ-FM家族在与现有模态特定基础模型相比时实现了竞争或更优的下游性能，证明了高保真分词是有效生物信号建模的关键因素。

英文摘要

Biosignals such as electroencephalography (EEG), electrocardiography (ECG), and electromyography (EMG) encode physiological activity across multiple temporal and spectral scales, yielding representations that are rich but challenging for machine learning. Foundation models trained to predict masked signal tokens have shown promise in learning generalizable biosignal representations, yet their performance depends on the tokenizer's ability to preserve high-frequency dynamics and reconstruct signals with high fidelity. We introduce NeuroRVQ, a modality-adaptive biosignal tokenizer family designed for high-fidelity signal reconstruction. To capture the full frequency spectrum, NeuroRVQ decomposes biosignals into frequency-specific representations via multi-scale temporal convolutions, each encoded into hierarchical RVQ codebooks to preserve high-frequency detail, combined with a novel phase-aware training loss that respects the circular topology of Fourier phase. By tuning the temporal resolution, number and size of temporal kernels and RVQ depth, this design adapts to the spectro-temporal characteristics of each biosignal modality. To validate that tokenizer quality drives downstream performance, we train a simple masked-token foundation model for each modality (NeuroRVQ-FM) using the corresponding NeuroRVQ tokenizer. The NeuroRVQ-FM family achieves competitive or superior downstream performance compared to existing modality-specific foundation models, demonstrating that high-fidelity tokenization is a critical factor for effective biosignal modeling.

URL PDF HTML ☆

赞 0 踩 0

2510.03879 2026-05-19 cs.SE cs.AI 版本更新

Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation

对抗性代理协作提升C到安全Rust翻译的正确性

Tianyu Li, Ruishi Li, Bo Wang, Brandon Paulsen, Umang Mathur, Prateek Saxena

发表机构 * National University of Singapore（新加坡国立大学）； Amazon Web Services（亚马逊网络服务）

AI总结本文提出ACToR框架，通过对抗性搜索发现翻译与C源码分歧的输入，利用这些输入驱动后续优化，提升C到Rust翻译的正确性，实验表明其在多个真实世界工具中达到90%以上的测试通过率。

详情

AI中文摘要

将C语言翻译成内存安全语言如Rust可以防止遗留C软件中普遍存在的关键内存安全漏洞。即使使用了最近的基于大语言模型（LLM）和工具增强的翻译器，生成的Rust代码在未测试的输入上仍经常与C源码产生分歧，这种正确性差距是自动C到Rust翻译可靠性的主要障碍。本文提出ACToR（对抗性C到Rust），一种简单的LLM代理循环，通过对抗性搜索发现翻译与C源码分歧的输入，并利用这些输入驱动后续优化。受生成对抗网络（GANs）启发，ACToR让翻译代理与鉴别代理协作，迭代优化Rust翻译。在每次迭代中，翻译代理生成并优化Rust翻译以通过现有测试套件，然后鉴别代理通过构造并优化C和Rust二进制文件的差分模糊器来发现新的失败测试。在63个真实世界命令行C工具上，平均代码行数为473行，最长可达数千行，ACToR在零人工干预下实现了超过90%的测试通过率。改进在七个代理-LLM配置上的微基准测试中保持稳定，表明该循环在底层翻译器和LLM选择上基本独立。与非对抗性、基于覆盖率的测试生成基线相比，ACToR将正确性提高了最高36.7%。当应用于最近的翻译器C2SaferRust时，ACToR进一步将验证通过率提高了16.6%。

英文摘要

Translating C to memory-safe languages, like Rust, prevents critical memory safety vulnerabilities that are prevalent in legacy C software. Even with recent LLM-based and tool-augmented translators, the resulting Rust code frequently diverges from the C source on inputs absent from the test suite used during translation; this correctness gap on unseen inputs remains a dominant obstacle to reliable, automatic C-to-Rust translation. In this work, we present ACToR (Adversarial C To Rust), a simple LLM-agent loop that closes this gap by adversarially searching for inputs on which the translation diverges from the C source, and using them to drive subsequent refinements. Inspired by GANs, ACToR pits a translator agent against a discriminator agent that collaborate to iteratively refine the Rust translation. On each iteration, the translator agent synthesizes and refines a Rust translation to pass an existing suite of tests, and then the discriminator agent finds new failing tests by constructing and refining a differential fuzzer over the C and Rust binaries. Across 63 real-world command-line C utilities, with an average size of 473 lines of code and the longest reaching thousands of lines in size, ACToR achieves over 90% test pass rate with zero human intervention. The improvement holds across seven agent-LLM configurations on our micro-benchmark, indicating that the loop is largely independent of the choice of underlying translator and LLM. Compared to a non-adversarial, coverage-driven test-generation baseline, ACToR improves correctness by up to 36.7%. When applied on top of one recent translator, C2SaferRust, ACToR further improves the validation pass rate by 16.6%.

URL PDF HTML ☆

赞 0 踩 0

2510.01857 2026-05-19 cs.AI 版本更新

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

通过逆强化学习学习推理奖励从专家示范

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

发表机构 * University of Cambridge（剑桥大学）

AI总结本文提出了一种名为Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL)的方法，通过逆强化学习从专家示范中学习推理奖励，以克服传统监督微调的局限性，并在多个数据集上展示了其在训练和推理过程中的有效性。

详情

AI中文摘要

教学大型语言模型（LLMs）在训练后进行推理通常依赖于具有显式结果或过程基础的强化学习奖励函数。然而，在许多现实世界设置中，获得或定义此类奖励函数是困难的，尤其是对于复杂任务，使从专家示范中学习成为有吸引力的替代方法。主流方法监督微调（SFT）训练模型直接模仿专家推理轨迹，但受到离策略学习的一般限制：性能可能对推理时偏离演示中明确覆盖的状态敏感。为了解决这个问题，我们提出了推理对抗逆强化学习（R-AIRL）。与其模仿专家的推理，R-AIRL从专家的思维链中推断出底层的过程级奖励。通过在GSM8K、MMLU-Pro和MedReason上进行实验，我们展示了通过R-AIRL学习的推理奖励函数可以有效地用于整个训练和推理流程：（1）为训练提供训练信号，在大多数考虑的设置中优于SFT，（2）用于推理时的重排序，将pass@1提高高达17.4个点，（3）用于过程级评估，以高达86.1%的准确性局部化推理失败。总体而言，R-AIRL弥合了模仿学习和基于奖励的优化，使从专家思考轨迹中提取有意义的推理信号成为可能。

英文摘要

Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward functions is difficult, especially for complex tasks, making learning from expert demonstrations an attractive alternative. The dominant approach, supervised fine-tuning (SFT), trains models to imitate expert reasoning traces directly, but suffers from the general limitations of off-policy learning: performance can be fragile to inference-time deviations from states explicitly covered by the demonstrations. To address this, we propose Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL). Rather than imitating the expert's reasoning, R-AIRL infers the underlying process-level reward from the expert Chain-of-Thoughts. Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy. Overall, R-AIRL bridges imitation learning and reward-based optimisation, enabling the extraction of meaningful reasoning signals from expert thinking traces.

URL PDF HTML ☆

赞 0 踩 0

2510.00304 2026-05-19 cs.LG cs.AI 版本更新

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

在不断变化的世界中学习的障碍：对学习能力丧失的数学理解

Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

发表机构 * ETH Zürich（苏黎世联邦理工学院）； Apple（苹果公司）

AI总结本文研究了在非平稳环境中深度学习模型因学习能力丧失（LoP）而失效的问题，通过动力系统理论分析了LoP的两个主要机制，并探讨了缓解策略。

详情

AI中文摘要

深度学习模型在静态数据上表现优异，但在非静态环境中因一种称为学习能力丧失（LoP）的现象而表现不佳，即其未来学习能力下降。本文首次从原理上研究了基于梯度的学习中的LoP。基于动力系统理论，我们通过在参数空间中识别稳定的流形来正式定义LoP，这些流形会捕获梯度轨迹。我们的分析揭示了两种主要机制，这些机制创造了这些陷阱：来自激活饱和的冻结单元和来自表征冗余的克隆单元流形。我们的框架揭示了一个根本性的矛盾：在静态设置中促进泛化的属性，如低秩表示和简单性偏差，直接在持续学习场景中促成LoP。我们通过数值模拟验证了我们的理论分析，并探讨了架构选择或针对性扰动作为潜在的缓解策略。

英文摘要

Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.

URL PDF HTML ☆

赞 0 踩 0

2509.19102 2026-05-19 cs.RO cs.AI cs.CV 版本更新

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

发表机构 * TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg（汉堡大学信息学院TAMS（多模态系统技术））； Technical University of Munich（慕尼黑技术大学）； Agile Robots SE（敏捷机器人有限公司）

AI总结本文提出FUNCanon框架，通过功能对象规范化学习姿态感知的动作原语，以实现通用的机器人操作，该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段，从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情

AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略，这些策略难以超越训练分布进行泛化。因此，我们引入FUNCanon框架，将长周期操作任务转换为一系列动作片段，每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身，而不是孤立的任务，从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性，我们对功能对象进行规范化，通过功能对齐和自动操作轨迹转移，利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练，自然尊重对象的 affordances 和姿态，简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明，该方法在类别层面实现了泛化，跨任务行为重用和鲁棒的sim2real部署，显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得：https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

URL PDF HTML ☆

赞 0 踩 0

2509.18150 2026-05-19 cs.LG cs.AI 版本更新

Improving MLLM Training Efficiency via Stage-Aware Sparsity

通过阶段感知稀疏性提升MLLM训练效率

Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang

发表机构 * Peking University（北京大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出了一种基于稀疏表示的高效训练框架STS，通过阶段感知设计适应不同训练阶段的冗余，采用视觉标记压缩器和层动态跳过器来减少计算开销，验证了其在多种MLLM架构上的有效性。

详情

AI中文摘要

多模态大语言模型（MLLMs）在各种领域中表现出色，但训练效率低下，由于长输入序列和未充分利用的层间操作导致大量计算冗余。值得注意的是，这种冗余并非静态，而是随训练阶段变化。基于此观察，我们关注训练过程本身，提出了一种基于稀疏表示的高效训练框架，称为稀疏训练方案（STS）。不同于统一的稀疏性策略，STS采用阶段感知设计，适应训练过程中不同的冗余来源。具体而言，该框架包含两个互补组件：视觉标记压缩器，通过在模态对齐过程中压缩视觉标记来减少信息负载；层动态跳过器，通过在指令微调过程中动态跳过不必要的层来减轻计算开销。我们的方法广泛适用于多种MLLM架构，并已在多个基准上进行了广泛评估，证明了其有效性和效率。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient, as much of the computation is redundant due to the long input sequences from multimodal data and underutilized inter-layer operations. Notably, such redundancy is not static but varies across different stages of training. Building on this observation, we shift the focus to the training process itself and propose a training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). Instead of applying a uniform sparsity strategy, STS adopts a stage-aware design that adapts to different sources of redundancy during training. Specifically, the framework consists of two complementary components: the Visual Token Compressor, which reduces the information load by compressing visual tokens during modality alignment, and the Layer Dynamic Skipper, which mitigates computational overhead by dynamically skipping unnecessary layers during instruction tuning. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2509.16391 2026-05-19 cs.LG cs.AI cs.CV 版本更新

MediQAl: 一个用于知识和推理评估的法语医学问答数据集

Adrien Bazoge

发表机构 * Data Clinic, University Hospital of Nantes, France（南特大学医院数据诊所，法国）； Nantes Université, École Centrale Nantes, CNRS, LS2N, France（南特大学，中央理工学院南特分校，国家科学研究中心，LS2N，法国）

AI总结本文提出MediQAl数据集，用于评估语言模型在事实性医学记忆和现实临床场景推理方面的能力，包含32,603个法语医学问题，涵盖41个医学科目，包含三种任务，通过14个大型语言模型的评估发现事实记忆与推理任务之间存在显著性能差距。

详情

DOI: 10.1038/s41597-026-06680-y

AI中文摘要

本文介绍了MediQAl，一个法语医学问答数据集，旨在评估语言模型在事实性医学记忆和现实临床场景推理方面的能力。MediQAl包含32,603个问题，来源于41个医学科目中的法语医学考试。该数据集包含三种任务：(i) 有唯一答案的多项选择题，(ii) 有多个答案的多项选择题，以及(iii) 有短答案的开放性问题。每个问题都被标记为理解或推理，使能够对模型的认知能力进行详细分析。我们通过与14个大型语言模型的广泛评估，包括最近的推理增强模型，验证了MediQAl数据集，并观察到事实记忆与推理任务之间存在显著的性能差距。我们的评估为评估语言模型在法语医学问答上的性能提供了全面的基准，填补了医学领域多语言资源中的关键空白。

英文摘要

This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

URL PDF HTML ☆

赞 0 踩 0

2507.16307 2026-05-19 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design

钙钛矿-R1：一个专门领域的大型语言模型，用于智能发现前驱体添加剂和实验设计

Xin-De Wang, Zhi-Rui Chen, Peng-Jie Guo, Ze-Feng Gao, Cheng Mu, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China（中国人民大学物理学院）； School of Chemistry and Life Resource, Renmin University of China（中国人民大学化学与生命资源学院）

AI总结本研究提出Perovskite-R1，一个专门用于发现钙钛矿太阳能电池前驱体添加剂和实验设计的大型语言模型，通过系统挖掘和整理1232篇高质量科学文献，并整合33269种候选材料，构建了领域特定的指令微调数据集，从而提升材料发现的效率。

Comments 24 pages; 5 figures

Journal ref Communications Materials 7, 86 (2026)

详情

DOI: 10.1038/s43246-026-01099-9

AI中文摘要

钙钛矿太阳能电池（PSCs）因其卓越的功率转换效率和有利的材料特性而迅速成为下一代光伏技术的有力竞争者。尽管有这些进展，长期稳定性、环境可持续性和可扩展制造等挑战仍然阻碍其商业化。前驱体添加剂工程显示出通过提高PSCs的性能和耐久性来解决这些问题的潜力。然而，科学文献的爆炸性增长以及材料、工艺和设备架构之间的复杂相互作用，使研究人员难以高效地访问、组织和利用该领域内的领域知识。为此，我们介绍了Perovskite-R1，一个具有先进推理能力的专门大型语言模型（LLM），专门用于发现和设计PSC前驱体添加剂。通过系统挖掘和整理1232篇高质量科学出版物，并整合一个包含33,269种候选材料的全面库，我们使用自动问答生成和推理链的方法构建了一个领域特定的指令微调数据集。在该数据集上微调QwQ-32B模型，得到了Perovskite-R1，它可以智能地综合文献见解，生成创新且实用的解决方案用于缺陷钝化和前驱体添加剂的选择。对几个模型提出策略的实验验证证实了它们在提高材料稳定性和性能方面的有效性。我们的工作展示了领域适应的LLM在加速材料发现中的潜力，并提供了一个闭环框架，用于智能、数据驱动的钙钛矿光伏研究进展。

英文摘要

Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.

URL PDF HTML ☆

赞 0 踩 0

2507.01099 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

Geometry-aware 4D Video Generation for Robot Manipulation

面向机器人操作的几何感知4D视频生成

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

发表机构 * Stanford University（斯坦福大学）； Toyota Research Institute（丰田研究院）

AI总结本文提出了一种几何感知的4D视频生成模型，通过跨视角点图对齐进行训练，以确保生成视频在多视角下的3D一致性，从而在单个RGB-D图像输入下生成时空一致的未来视频序列，并在不依赖相机姿态的情况下实现稳定的视觉和空间对齐预测。

Comments ICLR 2026; Project website: https://robot4dgen.github.io

详情

AI中文摘要

理解并预测物理世界的动态可以增强机器人在复杂环境中的规划和交互能力。尽管最近的视频生成模型在建模动态场景方面显示出强大的潜力，但生成在不同摄像机视角下既时间一致又几何一致的视频仍然是一项重大挑战。为此，我们提出了一种4D视频生成模型，通过在训练过程中使用跨视角点图对齐来监督模型，以确保生成视频的多视角3D一致性。通过这种几何监督，模型学习了一个共享的3D场景表示，使其能够从单个RGB-D图像输入中，根据新的视角生成时空一致的未来视频序列，而无需依赖相机姿态作为输入。与现有基线方法相比，我们的方法在多个模拟和现实世界机器人数据集上产生了更稳定和空间对齐的预测。我们进一步表明，预测的4D视频可用于使用现成的6自由度姿态跟踪器恢复机器人末端执行器轨迹，从而生成在新相机视角下具有良好泛化能力的机器人操作策略。

英文摘要

Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

URL PDF HTML ☆

赞 0 踩 0

2506.23549 2026-05-19 cs.AI cs.HC cs.LG 版本更新

CooT: Learning to Coordinate In-Context with Coordination Transformers

CooT: 通过协调转换器学习协调上下文

Huai-Chih Wang, Hsiang-Chun Chuang, Hsi-Chun Cheng, Dai-Jie Wu, Shao-Hua Sun

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University (NTU)（国立台湾大学通信工程研究所）； NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)（国立台湾大学人工智能研究中心）； University of Utah（犹他大学）

AI总结本研究提出CooT框架，通过上下文学习实现实时合作伙伴适应，解决了多智能体系统中协调不熟悉合作伙伴的挑战，其核心方法是通过观察学习对齐动作与合作伙伴意图，主要贡献是实现了在多样合作伙伴行为下的泛化能力。

Comments ICML 2026

详情

AI中文摘要

在多智能体系统中，协调不熟悉合作伙伴仍然是一个重大挑战。现有方法，如基于种群的方法，通过多样性提高鲁棒性，但通常缺乏在训练分布之外高效适应的机制。此外，微调在少样本设置中不可行，因为其交互成本高。为了解决这些限制，我们提出了CooT，一个利用上下文学习（ICL）进行实时合作伙伴适应的框架。与以往专注于任务泛化的ICL方法不同，CooT旨在在多样化的合作伙伴行为上实现泛化。在行为偏好智能体的轨迹上训练，它通过观察学习对齐动作与合作伙伴意图。我们在两个具有挑战性的多智能体基准测试中评估了CooT：Overcooked和Google Research Football。结果表明，CooT在性能上始终优于基于种群的方法、基于梯度的微调和Meta-RL基线，实现了稳定且快速的适应，而无需参数更新。人类评估也发现CooT是更受青睐的合作者，我们的消融实验确认了其快速适应新合作伙伴并在突然合作伙伴变化下保持稳定的能力，使其在现实世界的人机协作中具有可靠性。

英文摘要

Effective coordination among unfamiliar partners remains a major challenge in multi-agent systems. Existing approaches, such as population-based methods, improve robustness through diversity but often lack mechanisms for efficient adaptation beyond training distribution. Moreover, fine-tuning is impractical in few-shot settings due to its high interaction cost. To address these limitations, we propose CooT, a framework that leverages in-context learning (ICL) for real-time partner adaptation. Unlike prior ICL approaches that focus on task generalization, CooT is designed to generalize across diverse partner behaviors. Trained on trajectories from behavior-preferring agents, it learns to align actions with partner intentions purely through observation. We evaluate CooT on two challenging multi-agent benchmarks: Overcooked and Google Research Football. Results show that CooT consistently outperforms population-based methods, gradient-based fine-tuning, and Meta-RL baselines, achieving stable and rapid adaptation without parameter updates. Human evaluations also identify CooT as a preferred collaborator, and our ablations confirm its ability to adapt quickly to new partners and remain stable under sudden partner changes, making it reliable for real-world human-AI collaboration.

URL PDF HTML ☆

赞 0 踩 0

2506.17312 2026-05-19 cs.SI cs.AI cs.LG 版本更新

Heterogeneous Temporal Hypergraph Neural Network

异构时序超图神经网络

Huan Liu, Pengfei Jiao, Mengzhou Gao, Chaochao Chen, Di Jin

发表机构 * School of Cyberspace, Hangzhou Dianzi University（杭州电子科技大学信息学院）； Data Security Governance Zhejiang Engineering Research Center（浙江数据安全治理工程研究中心）； College of Computer Science and Technology, Zhejiang University（浙江大学计算机科学与技术学院）； College of Intelligence and Computing, Tianjin University（天津大学智能与计算学院）

AI总结本文提出了一种异构时序超图神经网络（HTHGN），旨在捕捉复杂异构时序超图中的高阶交互关系，通过引入层次注意力机制和对比学习来提升模型对异构节点和超边之间丰富语义的捕捉能力。

Comments Accepted by IJCAI 2025

详情

DOI: 10.24963/ijcai.2025/347

AI中文摘要

图表示学习（GRL）已成为建模图结构数据的有效技术。在建模现实复杂网络中的异质性和动态性时，针对复杂异构时序图（HTGs）设计的GRL方法已被提出，并在各领域取得了成功应用。然而，大多数现有GRL方法主要关注保留低阶拓扑信息，而忽视了更高阶的组交互关系，这些关系更符合现实网络。此外，大多数现有超图方法只能建模静态同构图，限制了它们对HTGs中高阶交互关系的建模能力。因此，为了同时使GRL模型能够捕捉HTGs中的高阶交互关系，我们首先提出了异构时序超图的正式定义和不依赖额外信息的$P$-均匀异构超边构造算法。然后提出了一种新的异构时序超图神经网络（HTHGN），以完全捕捉HTGs中的高阶交互关系。HTHGN包含一个层次注意力机制模块，同时在异构节点和超边之间进行时间消息传递，以捕捉由超边带来的更宽广感受场中的丰富语义。此外，HTHGN通过最大化HTG中低阶相关异构节点对之间的一致性来进行对比学习，以避免低阶结构的模糊性问题。在三个真实世界HTG数据集上的详细实验结果验证了所提出HTHGN在建模HTGs中高阶交互关系的有效性，并展示了显著的性能提升。

英文摘要

Graph representation learning (GRL) has emerged as an effective technique for modeling graph-structured data. When modeling heterogeneity and dynamics in real-world complex networks, GRL methods designed for complex heterogeneous temporal graphs (HTGs) have been proposed and have achieved successful applications in various fields. However, most existing GRL methods mainly focus on preserving the low-order topology information while ignoring higher-order group interaction relationships, which are more consistent with real-world networks. In addition, most existing hypergraph methods can only model static homogeneous graphs, limiting their ability to model high-order interactions in HTGs. Therefore, to simultaneously enable the GRL model to capture high-order interaction relationships in HTGs, we first propose a formal definition of heterogeneous temporal hypergraphs and $P$-uniform heterogeneous hyperedge construction algorithm that does not rely on additional information. Then, a novel Heterogeneous Temporal HyperGraph Neural network (HTHGN), is proposed to fully capture higher-order interactions in HTGs. HTHGN contains a hierarchical attention mechanism module that simultaneously performs temporal message-passing between heterogeneous nodes and hyperedges to capture rich semantics in a wider receptive field brought by hyperedges. Furthermore, HTHGN performs contrastive learning by maximizing the consistency between low-order correlated heterogeneous node pairs on HTG to avoid the low-order structural ambiguity issue. Detailed experimental results on three real-world HTG datasets verify the effectiveness of the proposed HTHGN for modeling high-order interactions in HTGs and demonstrate significant performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2506.03837 2026-05-19 cond-mat.supr-con cond-mat.mtrl-sci cs.AI cs.LG 版本更新

HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction

HTSC-2025: 一个用于人工智能驱动临界温度预测的环境压力高温超导体基准数据集

Xiao-Qi Han, Ze-Feng Gao, Xin-De Wang, Zhenfeng Ouyang, Peng-Jie Guo, Zhong-Yi Lu

发表机构 * 1. School of Physics ； Beijing Key Laboratory of Opto-electronic Functional Materials \& Micro-nano Devices. Renmin University of China, Beijing 100872, China ； 2. Key Laboratory of Quantum State Construction ； Manipulation (Ministry of Education), Renmin University of China, Beijing 100872, China ； 3. Hefei National Laboratory, Hefei 230088, China

AI总结本文提出HTSC-2025基准数据集，包含2023至2025年由理论物理学家基于BCS超导理论预测的高温超导材料，旨在促进人工智能在超导材料发现中的应用。

Comments 7 pages, 2 figures

Journal ref Chinese Physics B 34, 100301 (2025)

详情

DOI: 10.1088/1674-1056/adf042

AI中文摘要

高温超导材料的发现对人类工业和日常生活具有重要意义。近年来，利用人工智能（AI）预测超导转变温度的研究日益流行，大多数工具声称实现了显著的准确性。然而，该领域缺乏广泛接受的基准数据集，严重阻碍了不同AI算法之间的公平比较以及这些方法的进一步发展。在本工作中，我们提出了HTSC-2025，一个环境压力高温超导基准数据集。该数据集全面涵盖了基于BCS超导理论由理论物理学家在2023至2025年间发现的理论预测超导材料，包括著名的X₂YH₆系统、钙钛矿MXH₃系统、M₃XH₈系统、源自LaH₁₀结构演化的笼状BCN掺杂金属原子系统，以及从MgB₂演化而来的二维蜂窝状系统。HTSC-2025基准数据集已开源在https://github.com/xqh19970407/HTSC-2025并将持续更新。该基准数据集对加速基于人工智能方法的超导材料发现具有重要意义。

英文摘要

The discovery of high-temperature superconducting materials holds great significance for human industry and daily life. In recent years, research on predicting superconducting transition temperatures using artificial intelligence~(AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different AI algorithms and impeded further advancement of these methods. In this work, we present the HTSC-2025, an ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on BCS superconductivity theory, including the renowned X$_2$YH$_6$ system, perovskite MXH$_3$ system, M$_3$XH$_8$ system, cage-like BCN-doped metal atomic systems derived from LaH$_{10}$ structural evolution, and two-dimensional honeycomb-structured systems evolving from MgB$_2$. The HTSC-2025 benchmark has been open-sourced at https://github.com/xqh19970407/HTSC-2025 and will be continuously updated. This benchmark holds significant importance for accelerating the discovery of superconducting materials using AI-based methods.

URL PDF HTML ☆

赞 0 踩 0

2505.20650 2026-05-19 cs.CL cs.AI cs.CE 版本更新

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging: 评估LLM提取和结构化财务信息

Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； Columbia University（哥伦比亚大学）； California State University（加州州立大学）； University of Montreal（蒙特利尔大学）； Carnegie Mellon University（卡内基梅隆大学）； Rensselaer Polytechnic Institute（莱斯利理工学院）； The University of Manchester（曼彻斯特大学）； Harvard University（哈佛大学）

AI总结本文提出FinTagging基准，用于评估LLM在提取和结构化财务信息方面的能力，通过分解为FinNI和FinCL两个子任务，揭示了LLM在细粒度概念链接上的局限性。

详情

AI中文摘要

准确解读财务报告中的数字数据对市场和监管机构至关重要。尽管XBRL（可扩展商业报告语言）提供了对财务数据进行标记的标准，但将数千个事实映射到超过1万项美国通用会计准则（US GAAP）概念仍然成本高昂且容易出错。现有基准将此任务简化为对小概念子集的扁平单步分类，忽略了分类法的层次语义和财务文档的结构特性。因此，这些基准无法评估LLM在真实报告条件下的表现。为弥合这一差距，我们引入FinTagging，首个全面的结构感知和全范围XBRL标记基准。我们将复杂的标记过程分解为两个子任务：（1）FinNI（财务数字识别），从异构上下文中提取实体和类型；（2）FinCL（财务概念链接），将提取的实体映射到完整的US GAAP分类法。这种两阶段的框架使能够公平评估LLM在数值推理和分类法对齐方面的能力。在零样本设置下评估多种LLM发现，尽管模型在提取方面表现良好，但在细粒度概念链接上存在显著困难，突显了领域特定结构感知推理的关键限制。

英文摘要

Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over 10k US GAAP concepts remains costly and error prone. Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. Consequently, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure aware and full scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts including text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US GAAP taxonomy. This two stage formulation enables a fair assessment of LLMs' capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero shot settings reveals that while models generalize well in extraction, they struggle significantly with fine grained concept linking, highlighting critical limitations in domain specific structure aware reasoning.

URL PDF HTML ☆

赞 0 踩 0

2505.09203 2026-05-19 cond-mat.mtrl-sci cond-mat.supr-con cs.AI cs.LG 版本更新

InvDesFlow-AL: active learning-based workflow for inverse design of functional materials

InvDesFlow-AL: 基于主动学习的反向设计功能材料工作流程

Xiao-Qi Han, Peng-Jie Guo, Ze-Feng Gao, Hao Sun, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China（中国人民大学物理学院）； Gaoling School of Artificial Intelligence, Renmin University of China（中国人民大学人工智能学院）； School of Engineering Science, University of Chinese Academy of Sciences（中国科学院大学工程科学学院）

AI总结本研究提出了一种基于主动学习的反向设计功能材料框架InvDesFlow-AL，通过迭代优化材料生成过程，提高性能特征的准确性，并在低形成能和低Ehull材料设计中取得显著成果，成功发现超导材料Li₂AuH₆。

Comments 29 pages, 11 figures

Journal ref npj Computational Materials 11, 364 (2025)

详情

DOI: 10.1038/s41524-025-01830-z

AI中文摘要

开发具有特定性能的功能材料的反向设计方法对于推进可再生能源、催化、能量存储和碳捕集等领域的进步至关重要。基于扩散原理的生成模型可以直接生成满足性能约束的新材料，从而显著加速材料设计过程。然而，现有生成和预测晶体结构的方法往往受限于低成功率。在本工作中，我们提出了一种新的反向材料设计生成框架InvDesFlow-AL，该框架基于主动学习策略。该框架可以迭代优化材料生成过程，逐步引导其向期望的性能特征发展。在晶体结构预测方面，InvDesFlow-AL模型实现了RMSE为0.0423 Å，相比现有生成模型性能提高了32.96%。此外，InvDesFlow-AL已成功应用于低形成能和低Ehull材料的设计。它可以系统地生成具有逐步降低形成能的材料，同时在多样化的化学空间中不断扩展探索。这些结果充分证明了所提出的基于主动学习的生成模型在加速材料发现和反向设计中的有效性。为进一步证明该方法的有效性，我们以InvDesFlow-AL探索的常压下BCS超导体搜索为例。结果，我们成功发现了Li₂AuH₆作为传统BCS超导体，具有超高的转变温度140 K。这一发现为反向设计在材料科学中的应用提供了有力的实证支持。

英文摘要

Developing inverse design methods for functional materials with specific properties is critical to advancing fields like renewable energy, catalysis, energy storage, and carbon capture. Generative models based on diffusion principles can directly produce new materials that meet performance constraints, thereby significantly accelerating the material design process. However, existing methods for generating and predicting crystal structures often remain limited by low success rates. In this work, we propose a novel inverse material design generative framework called InvDesFlow-AL, which is based on active learning strategies. This framework can iteratively optimize the material generation process to gradually guide it towards desired performance characteristics. In terms of crystal structure prediction, the InvDesFlow-AL model achieves an RMSE of 0.0423 Å, representing an 32.96% improvement in performance compared to exsisting generative models. Additionally, InvDesFlow-AL has been successfully validated in the design of low-formation-energy and low-Ehull materials. It can systematically generate materials with progressively lower formation energies while continuously expanding the exploration across diverse chemical spaces. These results fully demonstrate the effectiveness of the proposed active learning-driven generative model in accelerating material discovery and inverse design. To further prove the effectiveness of this method, we took the search for BCS superconductors under ambient pressure as an example explored by InvDesFlow-AL. As a result, we successfully identified Li$_2$AuH$_6$ as a conventional BCS superconductor with an ultra-high transition temperature of 140 K. This discovery provides strong empirical support for the application of inverse design in materials science.

URL PDF HTML ☆

赞 0 踩 0

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild：面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出DexWild框架，通过结合人类和机器人示范数据，提升机器人在多样化环境中的泛化能力，实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情

AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径，但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集，但其高成本限制了可扩展性。相反，如果人们可以像在日常生活中一样使用自己的手来收集数据呢？在DexWild中，一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据，我们创建了DexWild-System，一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练，相较于单独训练每个数据集，其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略，只需少量额外的机器人特定数据。实验结果表明，DexWild显著提高了性能，在未见环境中实现了68.5%的成功率，几乎是仅使用机器人数据训练的策略的四倍，并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

URL PDF HTML ☆

赞 0 踩 0

2505.06907 2026-05-19 cs.AI cs.CV cs.NE 版本更新

A Survey on Foundation Models for Personalized Federated Intelligence

面向个性化联邦智能的基础模型综述

Yu Qiao, Huy Q. Le, Avi Deb Raha, Phuong-Nam Tran, Apurba Adhikary, Mengchun Zhang, Loc X. Nguyen, Eui-Nam Huh, Dusit Niyato, Choong Seon Hong

发表机构 * School of Computing, Kyung Hee University（韩国庆熙大学计算机学院）； Noakhali Science and Technology University（诺阿克利科学与技术大学）； Korea Advanced Institute of Science and Technology（韩国科学技术院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）

AI总结本文综述了基础模型在个性化联邦智能中的应用，探讨了联邦学习与基础模型的结合，提出了一种新的个性化联邦智能范式，旨在为实现人工智能个性化提供基础支持。

Comments Accepted ACM Computing Survey

详情

AI中文摘要

大语言模型（如ChatGPT、Gemini和Grok）的兴起重塑了人工智能领域。作为基础模型（FMs）的典型实例，它们在生成类人内容方面表现出色，推动人工智能向通用人工智能（AGI）迈进。然而，它们的规模庞大、隐私敏感和计算需求高，给个性化定制带来了挑战。为此，我们提出了人工智能个性化（API）的愿景，专注于将FMs适应到个体用户，同时确保隐私。作为API的核心赋能者，我们提出个性化联邦智能（PFI），这是一种新的范式，不仅整合了联邦学习（FL）的隐私优势和FMs的泛化能力，还将个性化置于核心。为此，我们首先回顾了最近的FL和FMs进展，为PFI奠定基础。然后，我们探讨了PFI流水线的核心阶段：边缘的高效个性化、可信的适应和通过检索增强生成的自适应细化。最后，我们强调了实现PFI的未来方向。总体而言，本文的综述旨在为API的发展奠定基础，作为AGI的补充方向，PFI是关键的赋能范式。

英文摘要

The rise of large language models (LLMs), such as ChatGPT, Gemini, and Grok, has reshaped the AI landscape. As prominent instances of foundational models (FMs), they exhibit remarkable capabilities in generating human-like content, pushing the boundaries towards artificial general intelligence (AGI). However, their large-scale nature, privacy sensitivity, and substantial computational demands pose significant challenges for personalized customization for end users. To bridge this gap, we present the vision of artificial personalized intelligence (API), which focuses on adapting FMs to individual users while ensuring privacy. As a central enabler of API, we propose personalized federated intelligence (PFI), a new paradigm that not only integrates the privacy benefits of federated learning (FL) with the generalization capabilities of FMs but also places personalization at its core. To this end, we first survey recent advances in FL and FMs that lay the foundation for PFI. We then explore core stages of the PFI pipeline: efficient personalization at the edge, trustworthy adaptation, and adaptive refinement via retrieval-augmented generation. Finally, we highlight future directions for enabling PFI. Overall, this survey aims to lay a foundation for the development of API as a complementary direction to AGI, with PFI as a key enabling paradigm.

URL PDF HTML ☆

赞 0 踩 0

2505.00409 2026-05-19 eess.AS cs.AI cs.LG 版本更新

Perceptual implications of automatic anonymization in pathological speech

病态语音中自动匿名化的人感知影响

Soroosh Tayebi Arasteh, Saba Afza, Tri-Thien Nguyen, Lukas Buess, Maryam Parvin, Tomas Arias-Vergara, Paula Andrea Perez-Toro, Hiu Ching Hung, Mahshad Lotfinia, Thomas Gorges, Elmar Noeth, Maria Schuster, Seung Hee Yang, Andreas Maier

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universit\"at Erlangen-N\"urnberg, Erlangen, Germany. Department of Urology, Stanford University, Stanford, CA, USA. Department of Radiology, Stanford University, Stanford, CA, USA. Lab for AI in Medicine, RWTH Aachen University, Aachen, Germany. Department of Diagnostic ； Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany. Institute of Radiology, University Hospital Erlangen, Erlangen, Germany. Department of Foreign Language Education, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany. Department of Otorhinolaryngology, Head ； Neck Surgery, Ludwig-Maximilians-Universität München, Munich, Germany. Speech \& Language Processing Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.

AI总结本研究通过结构化协议评估自动匿名化病态语音的人感知影响，发现匿名化在不同疾病中存在显著差异，且感知质量下降，但临床严重程度评分保持稳定，同时发现感知结果与计算隐私指标脱钩。

详情

利用无监督学习实现高效视觉异常检测

Yunbo Long, Zhengyang Ling, Sam Brook, Duncan McFarlane, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge（剑桥大学工程系）

AI总结本研究提出一种低成本视觉异常检测系统，通过预训练模型和低成本硬件，利用少量数据实现高准确率的异常检测，适用于中小型企业。

详情

AI中文摘要

传统的基于机器学习的视觉检测系统需要大量数据收集和重复模型训练来提高准确性。这些系统通常需要昂贵的相机、计算设备和显著的机器学习专业知识，这对中小型企业构成重大负担。本研究探索利用预训练模型和低成本硬件的无监督学习方法，开发一种高效的视觉异常检测系统。该系统利用Anomalib的无监督学习模型，并通过openVINO部署在经济型Raspberry Pi硬件上。结果表明，该系统仅用10张正常产品图像即可在Raspberry Pi上完成异常检测的训练和推理，耗时仅90秒，达到F1宏评分超过0.95的性能。尽管系统对环境变化如光照、产品摆放或背景略有敏感，但其仍为中小型企业提供了一种快速且经济的工厂自动化检测方法。代码可在https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning获取。

英文摘要

Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning expertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining generalizability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers. The code is available at https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning.

URL PDF HTML ☆

赞 0 踩 0

2404.10981 2026-05-19 cs.IR cs.AI cs.CL 版本更新

A Survey on Retrieval-Augmented Text Generation for Large Language Models

基于大型语言模型的检索增强文本生成综述

Yizheng Huang, Jimmy Huang

发表机构 * York University（约克大学）

AI总结本文综述了检索增强文本生成方法，探讨了其在提升大型语言模型生成准确性和可靠性方面的核心方法与主要贡献。

Comments Ongoing Work

Journal ref ACM Computing Surveys, Volume 58, Issue 12, Article No.: 300, Pages 1 - 38, 2026

详情

DOI: 10.1145/3805774

AI中文摘要

检索增强生成（RAG）将检索方法与深度学习进展相结合，以解决大型语言模型（LLMs）静态限制的问题，通过动态整合最新外部信息。该方法主要关注文本领域，提供了一种成本效益高的解决方案，以生成合理但可能不正确的响应，从而通过真实世界数据提高LLMs的准确性和可靠性。随着RAG的复杂性增加并整合多个可能影响其性能的概念，本文将RAG范式分为四个类别：预检索、检索、后检索和生成，从检索角度提供详细视角。它概述了RAG的发展历程，并通过分析重要研究讨论了该领域的进步。此外，本文介绍了RAG的评估方法，解决了所面临的挑战，并提出了未来研究方向。通过提供有组织的框架和分类，该研究旨在整合现有的RAG研究，明确其技术基础，并突出其扩展大型语言模型适应性和应用潜力的潜力。

英文摘要

Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but possibly incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.

URL PDF HTML ☆

赞 0 踩 0

2308.06197 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

利用基本特征的深度知识蒸馏进行复杂面部表情识别

Angus Maiden, Bahareh Nakisa

发表机构 * School of Information Technology, Deakin University（德克萨斯大学信息学院）

AI总结本文提出了一种基于持续学习的方法，通过知识蒸馏和新颖的预测排序记忆重放，实现了复杂面部表情识别的最新状态，能够在少量样本下准确识别新复合表情类别。

Comments 13 pages, 9 figures, 6 tables, 3 algorithms. Code available at https://github.com/AngusMaiden/complex-FER

详情

DOI: 10.1109/DICTA68720.2025.11302420

AI中文摘要

复杂情绪识别是一种认知任务，迄今为止尚未达到与其他处于或高于人类认知水平的任务相同的优秀性能。通过面部表情识别情绪尤其困难，因为人类面部表达的情绪复杂性。为了使机器在复杂面部表情识别方面达到人类的水平，可能需要实时综合知识和理解新概念，就像人类所做的那样。人类能够仅通过少量示例学习新概念，通过从记忆中蒸馏重要信息。受人类认知和学习的启发，我们提出了一种新的持续学习方法，用于复杂面部表情识别，通过在基本表情类别上构建和保留知识，能够使用少量训练样本准确识别新的复合表情类别。在本工作中，我们还使用GradCAM可视化来展示基本和复合面部表情之间的关系。我们的方法通过知识蒸馏和一种新颖的预测排序记忆重放来利用这种关系，实现了复杂面部表情识别持续学习的最新状态，新类别的总体准确率为74.28%。我们还证明了使用持续学习进行复杂面部表情识别的性能远优于非持续学习方法，比最先进的非持续学习方法提高了13.95%。我们的工作也是首次将少样本学习应用于复杂面部表情识别，仅使用每个类别一个训练样本，就实现了100%的准确率，达到了最先进的水平。

英文摘要

Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.

URL PDF HTML ☆

赞 0 踩 0

2204.01611 2026-05-19 cs.AI 版本更新

A Machine With Human-Like Memory Systems

具有人类样记忆系统的机器

Taewoon Kim, Michael Cochez, Vincent Francois-Lavet, Mark Neerincx, Piek Vossen

发表机构 * Vrije Universiteit Amsterdam（荷兰瓦赫宁根大学）； Technische Universiteit Delft（代尔夫特理工大学）

AI总结本文提出了一种同时具备语义记忆和事件记忆的智能体，证明双记忆系统优于单一记忆系统，并通过自研环境

Comments Submitted to Human-Centered Design of Symbiotic Hybrid Intelligence 2022 (https://ii.tudelft.nl/humancenteredsymbioticHI/)

2605.17361 2026-05-19 cs.LG cs.AI 版本更新

通过跨模态语义对齐实现面向视觉-语言模型的单样本黑盒成员推断攻击

Jiaqing Li, Yajuan Lu, Xiaochuan Shi, Gang Wu, ZhongYuan Wang, Chao Liang

发表机构 * Wuhan University（武汉大学）； Tarim University（塔里木大学）

AI总结本文提出了一种基于跨模态语义对齐的新型成员推断攻击框架，针对视觉-语言模型在单样本和黑盒场景下的数据安全风险进行评估，通过量化联合嵌入空间中的对齐程度，显著提升了攻击性能。

详情

AI中文摘要

视觉-语言模型（VLMs）虽取得了显著成功，但其依赖大规模数据集和意外记忆训练数据，带来了重大数据安全风险。成员推断攻击（MIAs）旨在通过确定数据样本是否包含在模型训练集中来评估这些风险。然而，现有针对VLMs的MIAs方法面临关键瓶颈：灰盒方法依赖于内部logits，通常在实际应用程序接口（APIs）中受限，而黑盒方法依赖于大规模统计分布，在单样本场景中表现不佳。为此，我们从跨模态语义对齐的角度研究MIAs，并观察到成员图像由于训练记忆表现出显著更强的图像-描述对齐，而生成的非成员描述可能偏离原始视觉内容。基于这一洞察，我们提出了一种针对严格黑盒和单样本场景的新MIAs框架，该框架在联合嵌入空间中量化此类对齐，从而绕过这些不现实的假设。我们在三个开源和两个闭源VLMs上进行了广泛实验。在VL-MIA/Flicker数据集上，我们的方法在LLaVA-1.5上实现了0.821的AUC，显著优于现有基线。此外，它在各种图像扰动下仍保持稳健，突显了其实用性。

英文摘要

Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.

URL PDF HTML ☆

赞 0 踩 0

2605.17329 2026-05-19 cs.CR cs.AI 版本更新

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

LPG: 在潜在策略护栏中平衡效率与政策推理

Nanxi Li, Zhengyue Zhao, Chaowei Xiao

发表机构 * Johns Hopkins University（约翰霍普金斯大学）

AI总结本文提出LPG框架，通过学习动态政策的语义潜在推理，在保持高安全准确率的同时实现低延迟的政策执行。

详情

AI中文摘要

护栏是现代AI系统的关键安全层，但其运行模式正在发生变化。随着LLMs被用作定制助手，安全策略越来越多地在推理时由用户、组织或监管环境指定。这使得安全执行本质上是动态的：护栏应适应变化的安全策略而无需重新训练。然而，这一要求带来了根本性的矛盾：忠实判断复杂的政策环境需要推理能力，而实际部署需要低延迟响应。我们介绍了潜在策略护栏（LPG），一种学习动态政策的语义潜在推理的框架。LPG将意图解释和政策基础所需的内部推理压缩成连续状态，这些状态由决策相关的语义监督。在推理时，它只生成一个紧凑的裁定，该裁定基于违反的政策条款，从而在保持可审计性的同时避免显式推理的延迟。在政策护栏基准测试中，LPG-4B在将推理压缩到仅10个潜在标记的情况下，达到了84.5%的平均安全准确率和77.9%的F1分数，优于最强的动态基线，同时在单样本评估设置下运行速度大约快11倍于Qwen3-4B-Thinking。代码和数据可在https://github.com/SaFo-Lab/Latent_Policy_Guard上获得。

英文摘要

Guardrails are a critical safety layer for modern AI systems, but their operating regime is changing. As LLMs are deployed as customized assistants, safety policies are increasingly specified at inference time by users, organizations, or regulatory contexts. This makes safety enforcement fundamentally dynamic: the guardrail should adapt to changing safety policies without retraining. Yet this requirement creates a fundamental tension: faithfully judging complex policy contexts demands reasoning capability, while practical deployment requires low-latency responses. We introduce Latent Policy Guardrail (LPG), a guardrail framework that learnssemantic latent deliberation over dynamic policies. LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics. At inference time, it generates only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency of explicit reasoning. Across policy guardrail benchmarks, LPG-4B reaches 84.5% average safety accuracy and 77.9% F1 by compressing deliberation into just 10 latent tokens, outperforming the strongest dynamic baseline while running roughly 11 times faster than Qwen3-4B-Thinking under the single-sample evaluation setup. Code and data are available at https://github.com/SaFo-Lab/Latent_Policy_Guard.

URL PDF HTML ☆

赞 0 踩 0

2605.17327 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

为单目视觉-惯性系统使用前馈3D模型实现高效的特征-free初始化

Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo

发表机构 * MBZUAI（马克斯·普朗克人工智能研究所）； HKUST (GZ)（香港科技大学（广州））； Zhejiang University（浙江大学）

AI总结本文提出了一种无需视觉特征跟踪的初始化框架，利用前馈3D模型预测的点云，从而提高了单目视觉-惯性导航系统的初始化可靠性与效率，实验表明其初始化成功率超过90%且数据需求显著减少。

详情

AI中文摘要

快速且可靠的初始化对于单目视觉-惯性导航系统（VINS）至关重要，因为它为后续的状态估计建立了初始条件。尽管已有显著进展，但大多数现有方法仍依赖于视觉特征对应关系，并需要3-4秒的传感器数据才能成功初始化，这限制了它们的应用性和效率。随着前馈3D模型的出现，这些模型可以直接从图像预测点云，我们重新从简洁的角度审视视觉-惯性初始化问题。在本文中，我们提出了一种特征-free初始化框架，利用前馈3D模型预测的点云，从而避免了视觉特征跟踪和估计的需要。这种设计显著降低了系统复杂性并提高了初始化的可靠性。在公开数据集上的实验表明，所提出的特征-free初始化方法实现了最高成功率，超过90%，并且显著减少了成功初始化所需的数据持续时间，通常降至1.2秒以下。我们进一步在自采集的数据集上验证了我们的方法，覆盖了各种室内和室外场景，展示了鲁棒性能，特别是在现有方法常失败的视觉退化环境中。代码和数据集可在https://github.com/Yuantai-Z/FF-VIO-Init获取。

英文摘要

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

URL PDF HTML ☆

赞 0 踩 0

2605.17324 2026-05-19 cs.CR cs.AI 版本更新

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

ASPI：寻求澄清会放大LLM代理中的提示注入漏洞

Udari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino, Max Fenkell

发表机构 * Scale AI ； Boston University（波士顿大学）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； Human Frontier Collective（人类前沿集体）

AI总结该研究探讨了LLM代理在执行任务时寻求澄清的行为对提示注入攻击的影响，发现这种行为会显著增加代理的脆弱性，并提出了ASPI基准测试来评估这一现象。

详情

AI中文摘要

寻求澄清的行为被视为LLM代理的有益特性，使它们能够在执行未明确的任务前解决歧义。然而，这种交互模式的安全影响尚未被探索。我们研究了从标准执行到寻求澄清状态的转变是否会使代理更容易受到提示注入攻击。我们引入了ASPI（模糊状态提示注入），这是一个包含728个任务-攻击场景的基准测试，将澄清作为独立的代理状态，并测量这种状态转换在受控条件下对脆弱性的影响。每个基准测试实例都在匹配的执行和澄清设置下进行评估：在执行设置中，代理在完全指定的指令下执行，并仅通过工具返回的数据遇到对抗性内容；在澄清设置中，代理必须首先请求并纳入额外的用户输入后再执行。我们评估了十种前沿LLM，并发现寻求澄清的行为始终显著地放大了脆弱性。例如，攻击成功率从o3的1.8%增加到34.0%，从Gemini-3-Flash的2.2%增加到35.7%。分解分析显示，这种差距反映了模型处理输入方式的状态依赖性变化以及由于代理请求澄清接口而产生的通道特定效应。这些发现表明，标准执行时间的安全评估系统性地低估了交互代理的攻击面，并且在完全指定任务下的鲁棒性不等于在歧义情况下的鲁棒性。为了可重复性，我们的数据和源代码可在https://github.com/scaleapi/aspi上获得。

英文摘要

Clarification-seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification-seeking state increases an agent's susceptibility to prompt injection attacks. We introduce ASPI (Ambiguous-State Prompt Injection), a benchmark of 728 task-attack scenarios that isolates clarification as a distinct agent state and measures how this state transition affects vulnerability under controlled conditions. Each benchmark instance is evaluated under matched execution and clarification settings: in the execution setting, the agent acts on a fully specified instruction and encounters adversarial content only through tool-returned data; in the clarification setting, the agent must first request and incorporate additional user input before acting. We evaluate ten frontier LLMs and find that clarification-seeking consistently and substantially amplifies vulnerability. For instance, attack success rises from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash. A decomposition analysis reveals that this gap reflects both a state-dependent shift in how models process incoming content and a channel-specific effect arising from the agent-solicited clarification interface. These findings demonstrate that standard execution-time security evaluation systematically underestimates the attack surface of interactive agents, and that robustness under fully specified tasks does not translate to robustness under ambiguity. For reproducibility, our data and source code are available at https://github.com/scaleapi/aspi.

URL PDF HTML ☆

赞 0 踩 0

2605.17320 2026-05-19 cs.OS cs.AI 版本更新

TClone: Low-Latency Forking of Live GUI Environments for Computer-Use Agents

TClone：用于计算机使用代理的低延迟活GUI环境分叉

Yutong Huang, Vikranth Srivatsa, Alex Asch, Hansin Tushar Patwa, Yiying Zhang

发表机构 * University of California, San Diego（加州大学圣迭戈分校）； GenseeAI

AI总结 TClone通过分离快速分支创建与持久快照，实现了对计算机使用代理的活GUI环境的低延迟版本控制，从而提高代理执行的安全性和质量。

详情

AI中文摘要

计算机使用代理越来越多地在活的个人工作空间中运行，其操作可以修改文件、应用程序、GUI状态、凭证和认证会话。这在安全性和质量之间产生了张力：代理需要隔离和回滚以避免损坏用户状态，但同时也需要快速分支支持推测执行和并行搜索。现有的虚拟机、容器和检查点/恢复系统可以隔离或恢复工作负载，但它们不提供完整交互工作空间的低延迟版本控制。我们提出了TClone，一种为计算机使用代理设计的可分叉个人工作空间系统。TClone使活GUI工作空间能够被快照、分叉为隔离分支、回滚，并选择性地提交或合并。其设计通过使用兄弟容器、写时复制内存共享、文件系统版本控制、本地GUI执行和异步检查点来分离快速分支创建与持久快照。在我们的端到端代理循环测量中，TClone将总任务延迟分别降低了1.9倍和1.5倍，相比KVM和CRIU。通过将工作空间版本控制作为系统的第一类原语，TClone在真实个人计算环境中支持更安全和高质量的代理执行。

英文摘要

Computer-use agents increasingly operate inside live personal workspaces, where their actions can modify files, applications, GUI state, credentials, and authenticated sessions. This creates a tension between safety and quality: agents need isolation and rollback to avoid damaging user state, but also need fast branching to support speculative execution and parallel search. Existing VMs, containers, and checkpoint/restore systems can isolate or recover workloads, but they do not provide low-latency versioning of a full interactive workspace. We present TClone, a forkable personal workspace system for computer-use agents. TClone enables a live GUI workspace to be snapshotted, forked into isolated branches, rolled back, and selectively committed or merged. Its design separates fast branch creation from durable checkpointing, using sibling containers, copy-on-write memory sharing, filesystem versioning, GUI-local execution, and asynchronous checkpointing. In our end-to-end agent-loop measurement, TClone reduces total task latency by 1.9x and 1.5x over KVM and CRIU. By making workspace versioning a first-class systems primitive, TClone supports safer and higher-quality agent execution over real personal computing environments.

URL PDF HTML ☆

赞 0 踩 0

2605.17316 2026-05-19 cs.LG cs.AI 版本更新

Learning Higher-Order Structure from Incomplete Spatiotemporal Data: Multi-Scale Hypergraph Laplacians with Neural Refinement

从不完整时空数据中学习高阶结构：具有神经细化的多尺度超图拉普拉斯算子

Keshu Wu, Sixu Li, Zihao Li, Zhiwen Fan, Xiaopeng Li, Yang Zhou

发表机构 * Texas A&M University（德克萨斯大学A&M分校）； University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结本文提出了一种多尺度超图拉普拉斯（MSHL）框架，通过两阶段方法从不完整时空观测中学习高阶结构。该方法通过发现阶段构建多尺度超图，并在细化阶段引入条件残差网络，以处理高阶关系中的残差特征，从而在交通网络中实现了更准确的缺失数据填补。

详情

AI中文摘要

传感器网络日益成为现代基础设施的核心，然而标准填补基准所假设的均匀随机缺失模式往往不适用于实际场景。环形检测器在校准期间会断线，路边柜子会沉默附近传感器的集群，而新安装的仪器则无法提供历史数据。这些故障会产生结构化的缺失，其值受传感器组之间的高阶关系约束，而非仅仅是成对接近性。现有低秩和图方法往往无法捕捉这种集体结构，当缺失性变得一致时可能会失效。本文引入多尺度超图拉普拉斯（MSHL），一种两阶段框架，用于从不完整的时空观测中学习高阶结构。发现阶段通过互补的拓扑和残差相关证据构建多尺度超图，并采用仅基于观测的选取器，适应支持的交互尺度。细化阶段添加一个小型超图条件残差网络，其安全性由构造保证：在存在信息残差特征时学习非线性修正，在不存在时则退化为线性估计。我们证明MSHL可以表示无法被成对图先验捕捉的组内守恒模式，能够适应最佳固定尺度，至多一个对数因子，将这种优势转移到验证的填补误差中，并允许单侧细化保证。在两个真实交通网络上评估，针对散落单元缺失、连续块断电和整个传感器黑箱在五种速率下，MSHL在高阶结构可识别时优于成对图基线，否则在采样噪声范围内匹配。结果表明，可靠的基础设施学习存在更广泛的原则：缺失数据不应被视为孤立的填补条目，而应视为发现结构的证据。

英文摘要

Sensor networks increasingly govern modern infrastructure, yet the data they lose are rarely missing in the uniform-random patterns assumed by standard imputation benchmarks. Loop detectors go offline during calibration, roadside cabinets silence clusters of nearby sensors, and newly installed instruments provide no history. Such failures create structured absences whose values are constrained by higher-order relations among groups of sensors, not merely by pairwise proximity. Existing low-rank and graph-based methods often miss this collective structure and can fail when missingness becomes coherent. We introduce Multi-Scale Hypergraph Laplacians (MSHL), a two-stage framework for learning higher-order structure from incomplete spatiotemporal observations. The Discovery stage builds a multi-scale hypergraph from complementary topology and residual-correlation evidence, with an observation-only selector that adapts to the supported interaction scale. The Refinement stage adds a small hypergraph-conditioned residual network that is safe by construction: it learns nonlinear corrections where informative residual features exist and defers to the linear estimate where they do not. We prove that MSHL represents group-conservation patterns inaccessible to pairwise graph priors, adapts to the best fixed scale up to a logarithmic factor, transfers this advantage to held-out imputation error, and admits a one-sided refinement guarantee. On two real traffic networks evaluated across scattered cell missingness, contiguous block outages, and whole-sensor blackouts at five rates, MSHL improves over a pairwise-graph baseline whenever higher-order structure is identifiable and otherwise matches it within sampling noise. The results point to a broader principle for reliable infrastructure learning: missing data should be treated not as isolated entries to fill, but as evidence of structure to discover.

URL PDF HTML ☆

赞 0 踩 0

2605.17314 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

通过不匹配的错误草稿实现弱到强的引导

Wei Deng

发表机构 * Independent Researcher（独立研究者）

AI总结本文研究了通过较小较弱模型的不匹配错误草稿引导更强学习者的能力，发现这种策略在MATH-500和AIME 2025/2026等任务上表现优异，主要贡献是提出了一种有效的训练方法。

详情

AI中文摘要

我们考虑是否可以利用较小、较弱模型的离线经验来引导更强的学习者，使其在在线策略学习（如GRPO）无法达到的能力。我们发现，将数学上错误但更领域训练的较小模型生成的草稿注入更强学习者的GRPO上下文，能一致优于标准在线GRPO在MATH-500和离分布AIME 2025/2026上。具体来说，我们使用Mathstral-7B作为学习者，Qwen2.5-Math-1.5B作为草稿模型，8.8K Level 3--5 MATH问题（其中MATH-500被排除），并使用Dr. GRPO进行训练。不匹配是关键成分：在保持其他条件不变的情况下，将草稿洗牌到不匹配的问题中，使MATH-500的greedy pass@1提升+1.62pp（n=10种子，p=0.0015，Welch's t检验）。事实上，不匹配-错误变体在MATH-500上所有测试的变体中均优于。在离分布AIME 2025和2026上，不匹配-错误变体在每个样本预算从k=1到k=1024的所有年份中，均将pass@k提升到Mathstral-7B（其原生[INST]格式）和Qwen2.5-Math-1.5B草稿模型之上。所有变体在测试时使用相同的提示，没有草稿注入。该配方——在单个GPU上训练，无需SFT、奖励模型、合成数据和无produce-critique-revise内循环——在Mathstral-7B-v0.1上达到了71.98%的MATH-500成绩，这是目前该模型的最高已发表结果，超过了WizardMath流程在完整MATH上的70.9%（SFT + PPO加过程/指令奖励模型）。

英文摘要

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

URL PDF HTML ☆

赞 0 踩 0

2605.17310 2026-05-19 cs.CV cs.AI 版本更新

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

注意力劫持：跨查询的视觉-语言模型响应操控

Zhiqiang Wang, Dongrui Liu, Yan Li, Zonghao Ying, Wei Xue, Wenhan Luo, Yike Guo

发表机构 * Hong Kong University of Science and Technology（香港理工大学）； Shanghai Jiao Tong University（上海交通大学）； Beihang University（北京航空航天大学）

AI总结本文研究了视觉-语言模型中跨查询响应操控问题，提出了一种新的对抗攻击方法Attention Hijacking，通过引导内部注意力分布保持图像主导模式，提高攻击在不同查询下的有效性。

详情

AI中文摘要

现有针对视觉-语言模型（VLMs）的对抗攻击可以将模型输出导向攻击者指定的目标响应，但当相同扰动输入与不同文本查询配对时，其效果往往会下降。本文研究了跨查询响应操控，即期望一个对抗示例在多样化的用户查询中保持有效。我们首先分析了现有攻击的局限性，发现成功转移与在响应生成过程中保持图像主导的注意力模式密切相关。受此观察启发，我们提出了Attention Hijacking，一种新的对抗攻击方法，该方法明确引导内部注意力分布向持久的图像主导模式倾斜。通过放大视觉标记对目标响应标记的影响，同时抑制文本标记的竞争影响，我们的方法减少了 manipulated 输出对特定查询用语的依赖。在广泛使用的VLMs上的大量实验表明，Attention Hijacking显著提高了跨查询转移性，适用于多样化的目标响应和未见查询。该方法也有效扩展到多种攻击场景，为VLMs中注意力稳定性在可转移响应操控中的作用提供了新的见解。

英文摘要

Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbf{Attention Hijacking}, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.17309 2026-05-19 cs.CV cs.AI 版本更新

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

StyleText: 一个大规模数据集和基准，用于具有风格保留的场景文本修复

Aleksandr Simonyan, Nipun Jindal

发表机构 * Adobe Inc.（Adobe公司）

AI总结本文提出StyleText，一个用于具有风格保留的场景文本修复的大规模数据集和基准，通过控制评估文本可读性和视觉一致性，利用共享场景上下文。

Comments Accepted at the SynData4CV Workshop, CVPR 2026. 8 pages + 1 page of references, 5 figures, 4 tables

详情

AI中文摘要

我们提出了StyleText，一个用于局部场景文本修复的大型数据集和基准，具有风格保留。StyleText包含28,518个图像-掩码-提示三元组，分为9,932个场景家族，使能够受控评估文本可读性和视觉一致性。我们通过自动化流程构建数据集，该流程结合LLM提示模板、基于Flux的源生成与键值(KV)缓存注入、基于OCR的语义过滤、多边形掩码提取以及掩码条件的FluxFill增强。我们定义了一个可重复的评估协议，使用归一化的OCR度量（词准确率和字符错误率）和CLIP图像-图像相似性，结合显式预处理。在StyleText上训练的FluxFill+LoRA基线在初始化基础上显著提高了OCR准确性，同时保持场景风格一致性，为未来的比较建立了有力的参考点。

英文摘要

We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

URL PDF HTML ☆

赞 0 踩 0

2605.17308 2026-05-19 cs.AI 版本更新

OProver：一个用于代理形式定理证明的统一框架

David Ma, Kaijing Ma, Shawn Guo, Yunfeng Shi, Enduo Zhao, Jiajun Shi, Zhaoxiang Zhang, Gavin Cheung, Jiaheng Liu, Zili Wang

发表机构 * Lean 4

AI总结本文提出OProver，一个用于Lean 4的统一框架，通过迭代修订检索到的编译器验证证明和Lean编译器反馈来改进代理证明，通过持续预训练和迭代后训练，使OProver-32B在多个基准测试中取得最佳成绩。

详情

AI中文摘要

近年来，形式定理证明的进步得益于大规模证明生成和验证器感知训练，但代理证明很少被整合到证明器训练中，仅在推理时间出现。我们提出了OProver，一个用于Lean 4的统一框架，其中失败的证明尝试通过检索到的编译器验证证明和Lean编译器反馈进行迭代修订。OProver通过持续预训练和迭代后训练进行训练：每次迭代运行代理证明，将新验证的证明索引到OProofs和检索内存中，使用修复轨迹作为SFT数据，并使用未解决的困难案例用于RL。OProofs由公开的Lean资源、大规模证明合成和代理证明轨迹构建，包含177万条Lean语句、686万条编译器验证证明以及带有检索上下文、失败尝试、反馈和修复的序列轨迹。在五个基准测试中，OProver-32B在MiniF2F（93.3%）、ProverBench（58.2%）和PutnamBench（11.3%）上取得最佳Pass@32，且在MathOlympiad（22.8%）和ProofNet（33.2%）上排名第二，比任何先前的开放式整体证明证明器的顶级位置更多。

英文摘要

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

URL PDF HTML ☆

赞 0 踩 0

2605.17281 2026-05-19 cs.SE cs.AI 版本更新

ContractBench: Can LLM Agents Preserve Observation Contracts?

ContractBench: LLM Agents能否保持观察契约？

Jicheng Wang, Yifeng He, Zili Wang, Hanwen Xing, Arkaprava De, Hao Chen

发表机构 * University of California, Davis（加州大学戴维斯分校）； University of Southern California（南加州大学）； University of Hong Kong（香港大学）

AI总结本文提出ContractBench基准测试，用于评估LLM代理在保持观察契约（如时间有效性及字节完整性）方面的能力，发现现有模型在该任务上仍存在显著缺陷。

详情

AI中文摘要

工具增强的LLM代理调用API时，中间输出如预签名URL、会话令牌和OAuth状态参数等被视为观察契约：这些艺术制品的后续使用受到外部系统限制。我们证明观察契约合规性（保持时间有效性和字节级完整性）是一种涌现且易退化的能力：它既不被通用工具使用能力保证，也不由更大或更新的模型一致提升。为此，我们引入ContractBench，一个包含33个双轴任务的基准测试，探测两种现有基准未评估的垂直故障模式：有效性故障（使用过期的艺术制品）和完整性故障（通过观察到动作的管道腐蚀艺术制品的字节）。我们的评估是确定性和程序性的，通过虚拟时钟控制时间，SHA-256哈希验证字节完整性。我们为每个结果分配一个来自真实世界API规范的故障标签。我们评估了38个模型，并报告了四个发现：（i）没有评估的模型超过80%，Claude-Opus-4.6领先于77.8%，揭示当前前沿模型仍无法遵守观察契约；（ii）在Qwen 3.5中，4B到9B之间出现陡峭的家族能力悬崖，平滑到397B-A17B为70.7%：在悬崖上出现的是中轨迹限制，而不是工具调用能力；（iii）在GPT-5家族中非单调扩展：代理后训练可以通过奉承驱动的退化侵蚀合规性；（iv）我们的故障分类在上下文内作为可操作的奖励信号，使42对GPT-5.1故障获得+7.1 pp的提升。

英文摘要

Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and SHA-256 hashes verifying byte integrity. We assign each outcome a failure label drawn from real-world API specifications. We evaluate 38 models and report four findings: (i) no evaluated model clears 80%, with Claude-Opus-4.6 leading at 77.8%, revealing that current frontier models still fail to comply with observation contracts; (ii) a sharp within-family capability cliff in Qwen 3.5 between 4B (0%) and 9B (56.6%), smoothing to 70.7% at 397B-A17B: what emerges across the cliff is mid-trajectory restraint, not tool-call competence; (iii) non-monotonic scaling across the GPT-5 family: agentic post-training can erode compliance through sycophancy-driven regression; (iv) our failure taxonomy works as an actionable in-context reward signal, yielding +7.1 pp on 42 paired GPT-5.1 failures.

URL PDF HTML ☆

赞 0 踩 0

2605.17279 2026-05-19 cs.SE cs.AI 版本更新

Rover: Context-aware Conflict Resolution with LLM

Rover: 基于上下文的冲突解决系统

Qingyu Zhang, Junzhe Li, Jiayi Lin, Changhua Luo, Chenxiong Qian

发表机构 * The University of Hong Kong（香港大学）； Wuhan University（武汉大学）

AI总结本文提出Rover，一种结合程序分析和大语言模型的冲突解决系统，通过多层代码属性图获取上下文感知提示，提升代码合并的准确性与效率。

详情

AI中文摘要

代码合并是大型项目中的重大挑战。现有解决方案，包括程序分析和机器学习，虽然有潜力，但存在关键限制。程序分析缺乏推断开发者意图的能力，依赖保守策略，将未解决的冲突转交人工处理。同时，基于模型的方法在处理涉及复杂代码依赖的冲突时，由于上下文意识不足而表现不佳。为解决这些差距，我们引入Rover，一种新的冲突解决系统，结合程序分析和大语言模型（LLM）。为了获得上下文感知的提示，我们提出了多层代码属性图（MtCPG），一种新的表示方法，捕捉文件间依赖关系，并为给定冲突启用上下文分析。使用图连通性算法，Rover进一步将冲突代码和相关更改聚类为有意义的“上下文”，引导LLM生成准确的解决方案。我们比较了Rover与独立LLM、机器学习基线MergeGen以及建议提供工具WizardMerge，使用相邻代码作为上下文。评估结果表明，Rover在冲突解决方面优于所有这些方法，在字符、词法和语义层面的相似度更高。

英文摘要

Code merging is a significant challenge, particularly in large-scale projects. Existing solutions, including program analysis and machine learning, show promise but face critical limitations. Program analysis lacks the ability to infer developers' intentions, relying on conservative strategies that offload unresolved conflicts for manual handling. Meanwhile, model-based approaches struggle with conflicts involving complex code dependencies due to insufficient contextual awareness. To address these gaps, we introduce Rover, a novel conflict resolution system that integrates program analysis with large language models (LLMs). To obtain context-aware prompts, we propose Multi-layer Code Property Graph (MtCPG), a new representation capturing inter-file dependencies and enabling contextual analysis for a given conflict. Using graph connectivity algorithms, Rover further clusters conflicting code and associated changes into meaningful "contexts" that guide the LLM in generating accurate resolutions. We compared Rover with standalone LLMs, machine learning baseline MergeGen, and suggestion provider tool WizardMerge with adjacent code as the contexts. Evaluation results show that Rover surpasses all of these approaches in terms of conflict resolution, achieving higher similarity to ground-truth resolutions at character, lexical, and semantic levels.

URL PDF HTML ☆

赞 0 踩 0

2605.17278 2026-05-19 cs.AI cs.LG 版本更新

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

A2RBench: 一个用于形式可验证抽象推理基准生成的自动范式

Qingchuan Ma, Yuexiao Ma, Yongkang Xie, Tianyu Xie, Xiawu Zheng, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education（教育部多媒体可信感知与高效计算重点实验室）； Institute of Artificial Intelligence（人工智能研究院）

AI总结本文提出A2RBench自动范式，通过生成、扩展、评估和分析流程提升抽象推理基准生成效率，发现当前LLM在抽象推理能力上存在根本缺陷，且高信息复杂度输入可简化推理过程。

详情

AI中文摘要

抽象推理能力反映了LLM提取和应用抽象规则的智能和泛化能力。然而，准确测量这一能力仍然具有挑战性：现有基准要么依赖昂贵的手动标注，限制了其规模，要么有风险测量记忆而非真正的推理。为此，我们引入了一个名为A2RBench的自动化流程，包括生成、扩展、评估和分析。具体而言，在生成阶段，LLM创建多样化的任务，要求真正的推理；在扩展阶段，LLM重用已验证的规则并扩展新的输入空间以生成任务变体，实现扩展。然而，这一过程可能导致幻觉。为消除它，我们进一步建立了理论框架并证明，程序验证——测试逆操作是否完美地逆转正向操作（循环一致性）——保证了唯一解。通过在主流LLM上的广泛评估，我们发现：（1）当前LLM在抽象推理上存在根本缺陷，顶级模型在代表性子集上显著低于人类（39.8% vs. 68.5%）。（2）当前LLM在生成3D任务的复杂度上远低于2D和1D，揭示了其对高维任务的理解不足。（3）反直觉的是，信息复杂度更高的输入可以简化推理过程。

英文摘要

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.

URL PDF HTML ☆

赞 0 踩 0

2605.17276 2026-05-19 cs.LG cs.AI 版本更新

How Do Electrocardiogram Models Scale?

ECG模型如何扩展？

Jiawei Li, Fabio Bonassi, Ming Jin, Stefan Gustafsson, Johan Sundström, Thomas B. Schön, Antônio H. Ribeiro

发表机构 * Uppsala University（乌普萨拉大学）； Griffith University（格里菲斯大学）

AI总结本文研究了ECG模型在不同规模下的扩展规律，发现监督学习模型在数据受限时表现不佳，而自监督学习模型在模型和数据规模上都具有鲁棒性，同时自监督Transformer在非常大的模型规模上超越了ResNet。

详情

AI中文摘要

尽管扩展定律已为自然语言处理中的基础模型建立了基本框架，但其在心电图（ECG）模型中的适用性仍缺乏充分的描述。事实上，最近的研究并未始终显示出随着ECG模型的大小或预训练数据集大小的增加，下游性能的一致性提升，这使得模型架构归纳偏置、预训练范式以及与规模相关的预期改进的确切作用仍然不明。在本工作中，我们系统地研究了ECG领域内的神经网络和损失到损失扩展定律。通过在大规模CODE数据集（230万条记录）上预训练超过120个模型（参数量从2万到2000万不等），我们解耦了模型架构（ResNet vs. Transformer）和预训练范式（监督学习SL vs. 自监督学习SSL）的影响。我们发现（i）SL模型在分布内是数据瓶颈的，而SSL模型在模型和数据规模上都具有鲁棒性；（ii）对于分布外（OOD）泛化，ResNet比Transformer在参数效率上高1.3到2.5倍，而SSL在数据效率上最高可达16倍，并在未见的临床任务上实现了高达7.6倍的转移效率；（iii）在观察到的规模范围内，基于ResNet的模型通常在OOD损失上表现最低，SSL在未见的临床任务上占据主导地位，而自监督的Transformer在非常大的模型规模上超越了ResNet。我们的结果表明，有效ECG基础模型的路径在于架构和范式的战略对齐，而非单纯的暴力扩展。

英文摘要

While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.

URL PDF HTML ☆

赞 0 踩 0

2605.17256 2026-05-19 eess.SY cs.AI cs.LG cs.SY 版本更新

Latency-Aware Deep Learning Benchmark for Real-Time Cyber-Physical Attack and Fault Classification in Inverter-Dominated Power Grids

面向实时机电攻击和故障分类的延迟感知深度学习基准测试

Emad Abukhousa, Saman Zonouz, A. P. Sakis Meliopoulos

发表机构 * Emad Abukhousa（埃马德·阿布库霍萨）； Saman Zonouz（萨曼·宗努兹）

AI总结本文提出了一种延迟感知的深度学习基准测试框架，用于评估在逆变器主导电网中使用高保真时域信号进行电力系统异常检测的深度学习模型。通过系统评估从物理故障和网络攻击中生成的流数据集，评估了八种神经网络架构，包括MLP到Transformer。所有模型都能在亚周期响应时间低于15毫秒的情况下实时分类两种代表性多事件序列，但端到端推理延迟始终超过三个周期，范围从50到90毫秒。这些结果突显了算法能力与保护级部署之间的关键差距，指出了进一步优化和硬件加速的必要性。研究结果建立了可重复的亚周期异常检测基准，并为将机器学习方法从研究原型过渡到实际保护应用提供了指导。

详情

AI中文摘要

本文介绍了一种延迟感知的基准测试框架，用于评估在电力系统异常检测中使用高保真时域信号生成的深度学习模型。通过系统评估从物理故障和网络攻击中生成的流数据集，评估了八种神经网络架构，包括MLP到Transformer。所有模型都能在亚周期响应时间低于15毫秒的情况下实时分类两种代表性多事件序列，但端到端推理延迟始终超过三个周期，范围从50到90毫秒。这些结果突显了算法能力与保护级部署之间的关键差距，指出了进一步优化和硬件加速的必要性。研究结果建立了可重复的亚周期异常检测基准，并为将机器学习方法从研究原型过渡到实际保护应用提供了指导。

英文摘要

This work introduces a latency-aware benchmarking framework for evaluating deep learning models in power system anomaly detection using high-fidelity, time-domain signals generated from an industry-grade electromagnetic transient simulator. Eight neural network architectures, ranging from MLPs to Transformers, were systematically evaluated on streaming datasets representing both physical faults and cyber-attacks in inverter-dominated networks. All models successfully classified two representative multi-event sequences in real time with sub-cycle response times below 15 ms. However, although classification decisions occurred within one cycle, the end-to-end inference latency consistently exceeded three cycles, ranging from 50 to 90 ms. These results highlight a critical gap between algorithmic capability and protection-grade deployment, pointing to the need for further optimization and hardware acceleration. The findings establish a reproducible benchmark for sub-cycle anomaly detection and provide guidance for transitioning machine learning methods from research prototypes to real-world protection applications.

URL PDF HTML ☆

赞 0 踩 0

2605.17255 2026-05-19 cs.AI math.OC 版本更新

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

CAM-Bench: 一个用于Lean中的计算与应用数学的基准测试

Wentao Long, Yunfei Zhang, Chenyi Li, Li Zhou, Chumin Sun, Zaiwen Wen

发表机构 * Fudan University（复旦大学）； Qingdao University（青岛大学）； Peking University（北京大学）； Huawei Technologies Ltd.（华为技术有限公司）

AI总结本文提出CAM-Bench，一个包含1000个Lean证明目标的基准测试，涵盖优化、数值线性代数和数值分析等领域，旨在补充现有形式化数学基准测试，通过针对依赖教科书概念和基本定理的应用数学问题进行评估。

Comments Preprint. 44 pages, 7 figures

详情

AI中文摘要

形式化定理证明基准测试能够机械地验证大语言模型中的数学推理能力。然而，现有基准测试主要集中在竞赛式问题和代数领域，导致计算与应用数学代表性不足。我们引入CAM-Bench，一个包含1000个Lean 4形式化证明目标的基准测试，涵盖优化、数值线性代数和数值分析等领域。这些问题改编自教科书练习，通常依赖于局部引入的定义、符号、算法和基本结果。为了构建CAM-Bench，我们开发了一个依赖恢复流水线，用于重建每个问题所需的本地教科书上下文。然后，它将每个问题标准化为一个独立的非正式定理，并将其翻译成Lean目标。我们通过Lean编译和语义审查验证最终的形式化问题，检查形式正确性和与原始练习的语义一致性。对于每个问题，我们发布了原始练习、恢复的上下文、标准化的非正式定理和最终的Lean目标。CAM-Bench通过针对依赖教科书概念和基本定理的应用数学问题补充现有形式化数学基准测试，其中许多问题无法直接作为标准Mathlib4引理使用。我们评估了广泛使用的大型语言模型和形式化代理在CAM-Bench上的表现，并分析了在跟踪局部假设、应用基本结果、分解证明和维护长距离控制时的常见失败模式。

英文摘要

Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied mathematics, with coverage spanning optimization, numerical linear algebra, and numerical analysis. These problems are adapted from textbook exercises and often depend on locally introduced definitions, notation, algorithms, and elementary results. To construct CAM-Bench, we develop a dependency-recovery pipeline that reconstructs the local textbook context needed to state each problem faithfully. It then normalizes each problem into a standalone informal theorem and translates it into a Lean target. We validate the resulting formal problems through Lean compilation and semantic review, checking both formal correctness and semantic alignment with the original exercises. For each problem, we release the raw exercise, recovered context, normalized informal theorem, and final Lean target. CAM-Bench complements existing formal mathematics benchmarks by targeting applied mathematics problems that rely on textbook concepts and elementary theorems, many of which are not directly available as standard Mathlib4 lemmas. We evaluate widely used large language models and formalization agents on CAM-Bench, and analyze common failure modes in tracking local assumptions, applying elementary results, decomposing proofs, and maintaining long-horizon control in Lean.

URL PDF HTML ☆

赞 0 踩 0

2605.17247 2026-05-19 cs.AI 版本更新

Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

通过TIDE实现稳健的论辩论文理解：一个具有试错和辩论机制的交互框架

Zheqin Yin, Yupei Ren, Yadong Zhang, Yujiang Lu, Man Lan

发表机构 * School of Computer Science and Technology, East China Normal University（东华师范大学计算机科学与技术学院）； Shanghai Institute of Artificial Intelligence for Education, East China Normal University（上海教育人工智能研究院）； Lab of Artificial Intelligence for Education, East China Normal University（教育人工智能实验室）

AI总结本文提出TIDE框架，通过整合试错和辩论机制，改进基于标准的提示优化，以提高论辩任务的理解和评估能力，实验表明其在自动作文评分、论点组件检测和论点关系识别任务中均提升了性能。

2605.17246 2026-05-19 cs.LG cs.AI 版本更新

Fidelity Probes for Specification--Code Alignment

规范-代码对齐的保真度探针

Ferhat Erata, Hao Zhou, Luke Huan

发表机构 * AWS Agentic AI（AWS智能AI）

AI总结本文提出保真度探针，通过从参考artifact生成的自然语言问题和代码派生的地面真实答案，从候选规范中回答问题。保真度是同意探针的比例，分解为矛盾率和覆盖缺口率，驱动针对性的规范编辑以达到收敛。在15个程序、约12000行COBOL基准（AWS CardDemo）上，通过八次迭代将冻结测试规范的保真度从0.63提升到0.94，其中平台位置由仅需四次速率数据的两状态马尔可夫固定点$F^\dagger$预测。探针来自LLM读取代码或静态分析管道对其控制流、数据流和系统依赖图的处理，具有可调混合比例。一个带有冻结留出集的探针重采样协议提供了Hoeffding有界的过拟合判别；我们测量的训练/测试差距保持在该包络线下一个数量级。三种基于图的混合提升了保真度16到30分；跨分布评估显示LLM和符号通道在经验上互补。在五个独立LLM家族（Anthropic、DeepSeek、Google、Alibaba、OpenAI）上进行的跨家族生成器扫描确认了收敛行为不依赖于任何单一模型家族：五个非Claude生成器中有三个产生了与马尔可夫固定点预测一致的轨迹，而冻结测试协议主动否定了两个探针分布随迭代变化的生成器。该方法适用于任何应描述相同行为的artifact对。

Comments 29 pages, 14 figures, 11 tables

详情

AI中文摘要

PluRule：一种用于社交媒体上多元社区调节的基准测试

Zoher Kachwala, Bao Tran Truong, Rasika Muralidharan, Haewoon Kwak, Jisun An, Filippo Menczer

发表机构 * Observatory on Social Media, Indiana University, USA（社交媒体观察站，印第安纳大学，美国）； Center Synergy of Systems, TUD Dresden University of Technology, Germany（系统协同中心，德累斯顿技术大学，德国）

AI总结研究探讨了AI模型在调节社交媒体上多元社区中的挑战，提出PluRule基准测试以检测13371条规则违规情况，发现即使使用最先进的视觉语言模型，也难以有效识别违规行为。

Comments Accepted to ACL 2026 Main Conference

详情

AI中文摘要

社交媒体正在向多元主义转变--由社区自行定义规范的平台。在某一社区中违反规则的行为可能在另一社区中是完全可接受的。AI模型能否帮助调节此类多元社区？我们将此任务形式化为多选问题，模仿人类调节员在现实世界中的操作方式：给定一条评论及其上下文，识别违反了哪一条具体规则（如果有的话）。我们引入了PluRule，一个多模态、多语言的基准测试，用于检测1989个Reddit社区中跨越2885条规则的13371条违规情况。使用此基准测试，我们发现最先进的视觉语言模型在识别违规方面表现显著不佳：即使GPT-5.2具有高水平推理能力，也仅略优于基础基线。我们还发现，更大的模型和更多的上下文提供微小收益，而普遍规则如礼貌和自我推广更容易检测。我们的结果表明，社交媒体上多元社区的调节是语言模型的基本挑战。我们的代码和基准测试已公开发布。

英文摘要

Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2605.17181 2026-05-19 cs.SD cs.AI 版本更新

MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition

MusicSynth: 一种用于从乐谱生成小提琴指板动画的自动化流水线

Abhimanyu Kaushik

发表机构 * Independent Researcher（独立研究者）； Trophy Club, Texas（德克萨斯奖杯俱乐部）

AI总结该研究提出了一种自动化流程，通过光学音乐识别技术将乐谱转换为小提琴指板动画，其核心方法是整合三个开源工具，并通过自定义的查找表将音乐音符映射到小提琴的弦和指位。

Comments 12 pages, 4 figures

详情

AI中文摘要

学习小提琴比看起来更困难。与钢琴键或吉他品相比，小提琴琴颈上没有任何标记，因此初学者无法通过观察来确定每个手指应放置的位置。MusicSynth是一种开源的网页工具，旨在解决这个问题：用户上传任何小提琴乐谱的照片（或数字乐谱文件），系统会自动生成一个视频，显示带有每个音符高亮的小提琴指板——无需安装软件，也不需要手动输入音符。该系统将三个现有的开源工具连接成一个流水线：光学音乐识别（OMR）库从上传的图像中读取音符，MusicXML解析器从数字乐谱中提取时间信息，视频渲染器逐帧绘制指板。唯一从头开始构建的部分是将每个音乐音符映射到小提琴弦和指位的查找表。在110个公共领域小提琴乐谱上测试，MusicSynth在清洁打印乐谱中正确识别了91.2%的音符，并在获得数字乐谱文件时正确分配指位99.1%的时间。据作者所知，目前没有其他免费工具可以自动将乐谱图像转换为动画小提琴指板教程。

英文摘要

Learning the violin is harder than it looks. Unlike piano keys or guitar frets, the violin neck has no markings at all, so a beginner cannot tell by looking where to place each finger. MusicSynth is an open-source web tool that tries to fix that: user uploads a photo of any violin sheet music (or a digital score file), and the system automatically produces a video showing a violin fingerboard with each note highlighted at the right moment -- no software to install, no manual note entry required. The system connects three existing open-source tools into one pipeline: an optical music recognition (OMR) library reads the notes from the uploaded image, a MusicXML parser extracts timing information from digital scores, and a video renderer draws the fingerboard frame by frame. The only part built from scratch is the lookup table that maps each musical note to a string and finger position on the violin. Tested across 110 public-domain violin scores, MusicSynth correctly identified 91.2\,\% of notes in clean printed music and assigned the right finger position 99.1\,\% of the time when given a digital score file. To the author's knowledge, no freely available tool currently turns a sheet music image into an animated violin fingerboard tutorial automatically and in a single browser-based step.

URL PDF HTML ☆

赞 0 踩 0

2605.17176 2026-05-19 cs.AI 版本更新

CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

CAREBench: 通过评估认知评价推理来评估LLMs的情感理解

Zhaoyue Sun, Hainiu Xu, Andero Uusberg, James J. Gross, Petr Slovak, Yulan He

发表机构 * Department of Informatics（信息学院）； King’s College London（伦敦国王学院）； Institute of Psychology（心理学研究所）； University of Tartu（塔尔图大学）； Department of Psychology（心理学系）； Stanford University（斯坦福大学）； The Alan Turing Institute（艾伦·图灵研究所）

AI总结本文提出CAREBench，首个全面标注认知评价推理、评价评分和多标签情感标注的基准，通过系统实验发现更强模型在某些任务上匹配或超越人类，但在评价推理和积极情绪识别上表现不足，揭示了当前模型未能内部化捕捉人类主观异质性的机制。

Comments 27 pages,18 figures

详情

AI中文摘要

情感理解是LLMs有效与人类交互的核心能力，但现有评估方法依赖离散情绪标签预测，无法捕捉情绪生成的认知过程。基于评价理论，我们引入CAREBench，首个包含从第一和第三人称视角对现实叙述的完整推断链标注的基准，涵盖评价推理、评价评分和多标签情感标注。我们提出一个过程级评估框架，并在六个LLMs上围绕四个研究问题开展系统实验。我们发现，更强的模型在某些任务上匹配或超越人类观察者，但在评价推理和积极情绪识别上表现不足；跨步骤性能和对评价干预的敏感性在不同模型间表现出差异；当前模型尚未内部化捕捉人类主观异质性的机制。这些发现表明，下游情绪预测指标可能高估LLMs的真实情感理解，而CAREBench为更具有诊断信息的LLMs情感认知能力评估提供了基础。

英文摘要

Emotion understanding is a core capability for LLMs to interact effectively with humans, yet existing evaluation paradigms rely on discrete emotion label prediction and fail to capture the cognitive processes underlying emotion generation. Grounded in appraisal theory, we introduce CAREBench, the first benchmark with complete inferential chain annotations from both first- and third-person perspectives on real-world narratives, spanning appraisal reasoning, appraisal ratings, and multi-label emotion annotation. We propose a process-level evaluation framework and conduct systematic experiments across six LLMs organized around four research questions. We find that stronger models match or surpass human observers on certain tasks, yet fall short on appraisal reasoning and positive emotion recognition; performance across chain steps and sensitivity to appraisal interventions exhibit dissociations across models; and current models have not internalized the mechanisms needed to capture human subjective heterogeneity. These findings suggest that downstream emotion prediction metrics may overestimate LLMs' true emotion understanding, and CAREBench provides a foundation for more diagnostically informative evaluation of LLMs' affective cognitive capabilities.

URL PDF HTML ☆

赞 0 踩 0

2605.17174 2026-05-19 cs.SE cs.AI 版本更新

Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation

超越执行：静态分析奖励与提示条件扩散强化学习用于代码生成

Shuyin Ouyang, Zhaozhi Qian, Faroq AL-Tam, Muhammad AL-Qurishi, Jie M. Zhang

发表机构 * King’s College London（伦敦国王学院）； Elm Europe（Elm欧洲）

AI总结本文研究了强化学习在扩散式代码生成中的应用，探讨了静态分析奖励和提示条件扩散强化学习在不同任务难度下的效果，发现静态检查在提升代码生成性能方面表现最佳。

详情

AI中文摘要

强化学习（RL）是将扩散语言模型（DLMs）对齐到功能正确性的重要范式，在代码生成中。然而，这些模型在复杂任务上常遇到一个“能力悬崖”，即基于执行的语义奖励变得过低，无法提供有效的学习信号。在本文中，我们对扩散式代码生成的RL后训练进行了系统性的实证研究，从三个轴线进行：奖励设计、提示条件采样和任务难度。我们调查了执行免费奖励作为传统单元测试执行替代品的有效性，训练时提示条件扩散采样在缓解探索瓶颈中的作用，以及这些设计选择在不同难度任务中的影响。在HumanEval、MBPP和LiveCodeBench上，我们发现静态检查是我们在设置中最强的独立执行免费奖励，特别是在HumanEval上将DiffuCoder从53.9提升到67.1，在LiveCodeBench上从14.9提升到15.5，同时减少rollout时间9.4%。我们进一步发现，中等程度的AST基于提示在更难的基准上最有用，而最佳奖励设计强烈依赖于任务难度：相似性基于奖励在更简单的子集上更有效，而静态检查在更难的子集上更可靠，其中执行奖励较低。这些发现表明，在我们评估的代码生成设置中，奖励设计和训练指导显著影响扩散RL性能。

英文摘要

Reinforcement Learning (RL) is an important paradigm for aligning Diffusion Language Models (DLMs) toward functional correctness in code generation. However, these models often encounter a ``capability cliff'' on complex tasks, where execution-based semantic rewards become too low to provide a viable learning signal. In this paper, we present a systematic empirical study of RL post-training for diffusion-based code generation along three axes: reward design, hint-conditioned sampling, and task difficulty. We investigate the effectiveness of execution-free rewards as alternatives to traditional unit-test execution, the role of training-time hint-conditioned diffusion sampling in mitigating exploration bottlenecks, and the impact of these design choices varies across tasks with different difficulty levels. Across HumanEval, MBPP, and LiveCodeBench, we find that static checking is the strongest overall standalone execution-free reward in our setting, especially improving DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench while reducing rollout time by 9.4\%. We further find that moderate AST-based hinting is most useful on harder benchmarks, while the best reward design depends strongly on task difficulty: similarity-based rewards are more effective on easier subsets, whereas static checking is more reliable on harder subsets where execution rewards are low. These findings suggest that reward design and training guidance substantially affect diffusion RL performance in our evaluated code-generation setting.

URL PDF HTML ☆

赞 0 踩 0

2605.17173 2026-05-19 cs.CL cs.AI cs.LG 版本更新

AI艺术竞赛中集体创造力的动力学

Mason Youngblood, Jeff Nusz, Joel Simon

发表机构 * Institute for Advanced Computational Science, Stony Brook University（斯通比克大学先进计算科学研究所）； Morphogen

AI总结研究通过分析AI艺术竞赛中的图像生成过程，发现集体创造力在人类与AI协同创作中呈现出图像简化、主题趋同以及用户偏好与创作复杂性之间的矛盾现象。

详情

AI中文摘要

创造力是文化演变的核心方面，但群体产生新颖性的机制难以从历史记录中推断。迭代学习实验表明，文化传承会将制品扭曲向学习者的归纳偏差，但大多数研究使用线性链式结构，未探讨这些动态在日益影响文化生产的人类-人工智能系统中的表现。在本研究中，我们利用Artbreeder系统，该系统每日举办'混搭派对'，用户基于单一种子图像迭代构建彼此的作品，生成分支的共同创作图像。我们分析了13个月内368场混搭派对的130,882张图像数据，发现图像变得简单并趋同于常见主题'吸引子'（如蒸汽朋克场景、外星建筑）。我们还发现，尽管更新颖的'父'图像产生更新颖且复杂的'子'图像并吸引更多点赞，用户却 paradoxically 偏好混搭新颖性和复杂性较低的图像。最后，更大规模的混搭派对产生更多新颖性，但以更低的复杂性为代价。

英文摘要

Creativity is a fundamental aspect of how culture evolves, yet the mechanisms by which groups produce novelty are notoriously difficult to infer from the historical record. Iterated learning experiments have shown that cultural transmission reliably distorts artifacts toward the inductive biases of learners, but most of this work uses linear chains between human participants, leaving open how these dynamics play out in the networked, human-AI systems that increasingly shape cultural production. In this study, we leverage one such system, Artbreeder, which hosts daily "remix parties" where users iteratively build on each other's work from a single seed image, producing branching lineages of human-AI co-created images. We analyze a dataset of 130,882 images from 368 remix parties over 13 months and find that images become simpler and converge toward common thematic "attractors" (e.g., steampunk scenes, alien architecture). We also find that while more novel "parent" images produce more novel and complex "children" that attract more likes, users paradoxically prefer to remix images that are less novel and complex. Finally, larger remix parties produce more novelty at the cost of lower complexity.

URL PDF HTML ☆

赞 0 踩 0

2605.17137 2026-05-19 cs.AI 版本更新

Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

潜在启发式搜索：为自动化算法设计的连续优化

Cheikh Ahmed, Mahdi Mostajabdaveh, Zirui Zhou

发表机构 * Huawei Technologies Canada（华为加拿大技术有限公司）

AI总结本文提出了一种连续启发式发现框架，通过将离散程序映射到连续嵌入空间，并利用可微代理模型进行梯度优化，以提升自动化算法设计的性能。

Comments Accepted at LION 2026, The Learning and Intelligent Optimization Conference

详情

AI中文摘要

将大型语言模型（LLMs）整合到进化框架中，已建立了自动化启发式发现的新范式。尽管具有潜力，这些方法通常在程序语法的离散空间中搜索，依赖随机采样来导航高度非凸的优化景观。本文提出了一种连续启发式发现框架，将优化转移到学习的潜在流形上。我们使用编码器将离散程序映射到连续嵌入，并训练一个可微代理模型来预测性能，从而实现基于梯度的搜索。为了正则化优化轨迹，一个可逆的归一化流将这些嵌入映射到结构化的高斯先验中，其中我们执行梯度上升。最终优化的潜在向量通过学习的映射器投影到软提示中，这些提示条件冻结的LLM合成新的可执行启发式方法。我们在旅行商问题（TSP）、有容量车辆路径问题（CVRP）、背包问题（KSP）和在线装箱问题（OBP）上评估了所提出的方法。实证结果表明，连续潜在空间优化在性能上与最先进的离散进化基线相当，同时为自动化算法设计提供了互补的方法论替代方案。实现代码可在https://github.com/cheikh025/LHS上找到。

英文摘要

The integration of Large Language Models (LLMs) into evolutionary frameworks has established a new paradigm for automated heuristic discovery. Despite their promise, these methods typically search in the discrete space of program syntax, relying on stochastic sampling to navigate a highly non-convex optimization landscape. This work proposes a continuous heuristic discovery framework that shifts optimization to a learned latent manifold. We employ an encoder to map discrete programs into continuous embeddings and train a differentiable surrogate model to predict performance, enabling gradient-based search. To regularize the optimization trajectory, an invertible normalizing flow maps these embeddings to a structured Gaussian prior, where we perform gradient ascent. The resulting optimized latent vectors are projected through a learned mapper into soft prompts, which condition a frozen LLM to synthesize novel executable heuristics. We evaluate the proposed method on the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), the Knapsack Problem (KSP), and Online Bin Packing (OBP). Empirical results demonstrate that continuous latent-space optimization achieves performance competitive with state-of-the-art discrete evolutionary baselines while offering a complementary methodological alternative for automated algorithm design. The implementation code is available at \url{https://github.com/cheikh025/LHS}.

URL PDF HTML ☆

赞 0 踩 0

2605.17133 2026-05-19 cs.CV cs.AI 版本更新

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

CAM-VFD: 跨注意力多模态视频伪造检测

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

发表机构 * Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport（计算机工程系，工程与技术学院，阿拉伯科学、技术与海运交通学院）

AI总结针对深度伪造技术和视频编辑工具快速发展带来的挑战，本文提出CAM-VFD框架，通过跨模态矛盾建模实现多模态视频伪造检测，实验表明其在两个生成视频基准测试中表现出色，具有良好的鲁棒性。

详情

AI中文摘要

英文摘要

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

URL PDF HTML ☆

赞 0 踩 0

2605.17077 2026-05-19 cs.RO cs.AI 版本更新

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

如何指导你的机器人：密集语言标注助力机器人策略学习

Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego（加州大学圣地亚哥分校）； NVIDIA

AI总结本研究通过密集语言标注提升机器人策略学习效率，提出DeMiAn方法，利用视觉语言模型生成多方面标注，提升策略和世界模型性能，无需新增演示数据。

详情

AI中文摘要

机器人策略学习受限于演示数据收集成本，而现有演示的语言标注相对廉价。我们研究语言密度作为提取固定机器人或第一人称视频数据集信号的杠杆。我们引入DeMiAn（密集多方面标注），一种两阶段方法，首先通过视觉语言模型生成四个互补方面的演示段落重标记：物理运动、场景组成、手臂姿态和推理。一个学习到的指导者将任务描述和初始场景快照映射到部署时的任务合适标注，异步运行以隐藏生成延迟。在超过100万机器人操作片段和5万EgoVerse人类第一人称视频上，DeMiAn在视觉语言-动作策略和基于视频的世界-动作模型上均未收集新演示的情况下提升了性能。在RoboCasa上，指导者在任务-only基线基础上提升了5个百分点，接近每任务oracle的3个百分点。没有固定标注方面在所有任务中占主导，表明选择正确的密集语言至关重要。DeMiAn还提高了复合任务和分布外性能，并在考虑标注生成FLOPs后，同时提升了中训练和后训练的计算-性能前沿。这些结果将密集重新标注定位为机器人策略学习的实用扩展杠杆。

英文摘要

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.

URL PDF HTML ☆

赞 0 踩 0

2605.17072 2026-05-19 cs.AI cs.CL 版本更新

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

RAGA：用于自主知识图谱构建和检索增强生成的阅读与图构建代理

Chengrui Han, Zesheng Cheng

发表机构 * Qingdao University（青岛大学）

AI总结本文提出RAGA框架，通过结合阅读、搜索、验证和构建的认知约束，提升知识图谱构建与检索增强生成的效率和准确性，实现了知识图谱的全生命周期管理。

详情

AI中文摘要

现有基于LLM的知识图谱（KG）构建方法主要采用无状态的批处理流程，存在跨片段语义关系捕捉、实体消歧和构建过程可解释性方面的结构性缺陷。这些限制影响了KG的质量、检索精度和在高风险领域的部署信任度。我们提出RAGA（Reading And Graph-building Agent），一种基于LLM的自主KG构建和检索融合框架。RAGA提供支持完整KG生命周期CRUD操作的原子工具集，并将读取-搜索-验证-构建的认知约束嵌入到ReAct工具循环中。KG向量同步机制实现了混合符号-向量检索，而证据锚定验证将每个知识条目与其源文本链接，以实现可审计的溯源性。在QASPER科学问答数据集的子集上的初步实验表明，RAGA的融合检索优于零样本基线，KG整合在答案和证据质量方面提供了可衡量的提升。该框架设计和实验基线为代理驱动的自主KG构建提供了参考。

英文摘要

Existing LLM-driven knowledge graph (KG) construction methods predominantly employ stateless batch processing pipelines, exhibiting structural deficiencies in cross-chunk semantic relation capture, entity disambiguation, and construction process interpretability. These limitations undermine KG quality, retrieval precision, and deployment trust in high-stakes domains. We propose RAGA (Reading And Graph-building Agent), an LLM-based autonomous KG construction and retrieval fusion framework. RAGA provides an atomic toolset supporting full KG lifecycle CRUD operations and embeds a Read-Search-Verify-Construct cognitive constraint into a ReAct tool loop. A KG-vector synchronization mechanism enables hybrid symbolic-vector retrieval, while evidence-anchored verification links every knowledge entry to its source text for auditable provenance. Preliminary experiments on a subset of the QASPER scientific QA dataset indicate that RAGA's fusion retrieval outperforms zero-shot baselines, with KG integration providing measurable gains in both answer and evidence quality. The framework design and experimental baseline serve as a reference for agent-driven autonomous KG construction.

URL PDF HTML ☆

赞 0 踩 0

2605.17071 2026-05-19 cs.AI 版本更新

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

AnchorDiff: 基于拓扑结构的掩码扩散模型与基于置信度的重写方法用于放射学报告生成

Shiying Yu, Jielei Wang, Guoming Lu

发表机构 * University of Electronic Science and Technology of China（电子科技大学）

AI总结本文提出AnchorDiff，一种首个结合临床锚点的掩码扩散框架，用于生成放射学报告。该方法通过拓扑感知训练策略和推理时的重写策略，有效缓解了固定顺序自回归解码的局限性，实现了最先进的性能。

详情

AI中文摘要

放射学报告生成（RRG）旨在从医学图像自动生成临床准确的文本报告。现有方法大多依赖于自回归（AR）语言模型，其因果依赖结构限制生成过程为单向的左到右过程。这种范式可能导致序列偏差，即模型倾向于遵循刻板的token顺序和高频报告模板，而非完全基于图像特定的证据进行生成。在本文中，我们提出AnchorDiff，这是首个用于RRG的掩码扩散框架，整合了来自知识图谱的临床锚点到扩散语言模型中。通过利用双向上下文和迭代细化，AnchorDiff缓解了固定顺序自回归解码的局限性。具体而言，我们引入了一种拓扑感知的训练策略，利用RadGraph推导出的实体层次结构来分配临床重要token的差异化掩码保护和损失权重。我们进一步设计了推理时的重写策略，通过基于扰动的测试检测不稳定已提交的token，并在去噪过程中选择性地修改它们。在MIMIC-CXR和MIMIC-RG4基准上的大量实验表明，AnchorDiff实现了最先进的性能，展示了临床锚点掩码扩散在放射学报告生成中的有效性。

英文摘要

Radiology report generation (RRG) aims to automatically produce clinically accurate textual reports from medical images. Existing methods predominantly rely on autoregressive (AR) language models, whose causal dependency structure restricts generation to a unidirectional left-to-right process. This paradigm can induce sequence bias, where models tend to follow stereotypical token orders and high-frequency report templates rather than fully grounding generation in image-specific evidence. In this paper, we propose AnchorDiff, the first masked-diffusion framework for RRG that integrates knowledge-graph-derived clinical anchors into diffusion language modeling. By leveraging bidirectional context and iterative refinement, AnchorDiff mitigates the limitations of fixed-order autoregressive decoding. Specifically, we introduce a topology-aware training strategy that uses RadGraph-derived entity hierarchies to assign clinically important tokens differentiated masking protection and loss weights. We further design an inference-time rewriting strategy that detects unstable committed tokens through perturbation-based testing and selectively revises them during denoising. Extensive experiments on the MIMIC-CXR and MIMIC-RG4 benchmarks demonstrate that AnchorDiff achieves state-of-the-art (SOTA) performance, showing the effectiveness of clinically anchored masked diffusion for radiology report generation.

URL PDF HTML ☆

赞 0 踩 0

2605.17064 2026-05-19 cs.AI 版本更新

可能性结构上的证据信息融合

Qianli Zhou, Ye Cui, Zhen Li, Witold Pedrycz, Yong Deng

发表机构 * School of Electronics and Information, Northwestern Polytechnical University（电子信息学院，西北工业大学）； Department of Electrical and Computer Engineering, University of Alberta（阿尔伯塔大学电气与计算机工程系）； China Mobile Information Technology Center（中国移动信息科技中心）； Systems Research Institute, Polish Academy of Sciences（波兰科学院系统研究所）； Institute of Fundamental and Frontier Science, University of Electronic Science and Technology of China（中国电子科技大学基础与前沿科学研究院）

AI总结本文提出了一种基于可能性结构的证据信息融合方法，通过引入信任演化网络和三角范数家族，实现了更灵活的信息融合框架，适用于非distinct源融合、冲突管理等复杂场景。

详情

AI中文摘要

Dempster's规则是结合来自不同且可靠来源的信念函数的基本工具。然而，其基于交集的语义 imposes 强烈的结构限制，限制了其在处理复杂源状态和多样信息融合场景时的灵活性。为克服这一限制，我们提出了一种可逆转换，源自等概率原则，将信念函数与定义在幂集上的可能性结构联系起来。在此转换中，子集之间的关系通过信念演化网络显式表征，提供了比传统质量函数结构更灵活的证据信息表示。在此基础上，我们进一步引入三角范数家族，开发了一个通用且适应性的证据信息融合框架。与根植于Dempster语义的融合方法不同，所提出的框架支持更灵活的组合行为，并在非distinct源融合、冲突管理、参数组合设计和异构信息融合中表现出优势。

英文摘要

Dempster's rule is a fundamental tool for combining belief functions from distinct and reliable sources. However, its intersection-based semantics imposes strong structural restrictions, which limits its flexibility in handling complex source states and diverse information fusion scenarios. To overcome this limitation, we propose a reversible transformation, derived from the isopignistic principle, between belief functions and a possibilistic structure defined on the power set. In this transformation, the relationships among subsets are explicitly characterized by a belief evolution network, which provides a more flexible representation of evidential information beyond the conventional mass function structure. On this basis, we further introduce the triangular norm family to develop a general and adaptive evidential information fusion framework. Unlike fusion methods rooted in Dempster semantics, the proposed framework supports more flexible combination behaviors and exhibits advantages in non-distinct source fusion, conflict management, parametric combination design, and heterogeneous information fusion.

URL PDF HTML ☆

赞 0 踩 0

2605.17037 2026-05-19 cs.LG cs.AI cs.CL 版本更新

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D$^2$Evo: 双重难度感知的自进化方法用于数据高效的强化学习

Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao, Yong Wang, Xiangxiang Chu

发表机构 * Zhejiang University（浙江大学）； AMAP, Alibaba Group（AMAP，阿里巴巴集团）

AI总结本文提出D$^2$Evo方法，通过双重难度感知的自进化机制，解决强化学习中有效数据稀缺和动态难度变化的问题，从而在数学推理基准上以少于2K真实数学样本实现优于现有方法的性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情

AI中文摘要

强化学习（RL）在增强大型语言模型（LLMs）推理能力方面展现出潜力。然而，需要中等难度训练样本的有效RL训练面临两个根本性挑战：有效数据稀缺和动态难度变化，其中中等难度样本稀缺且随着模型提升变得简单。现有方法在一定程度上缓解了这种稀缺性，通过生成训练样本。然而，这些方法存在无锚点生成、忽略共进化和难度不匹配的问题。为了解决这些问题，我们提出了D$^2$Evo，一种双重难度感知的自进化RL框架。在每次迭代中，我们的方法基于当前求解器的能力挖掘中等难度锚点，训练提问者生成不同难度层级的多样化问题，并共同优化两个组件以实现渐进式的推理提升。广泛实验表明，D$^2$Evo在数学推理基准上以少于2K真实数学样本优于现有方法，并在通用推理基准上表现出强大的泛化能力。

英文摘要

Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.17029 2026-05-19 cs.SE cs.AI 版本更新

当动态变化时，鲁棒任务推断胜出：重新审视具有行为基础模型的离线模仿学习

Rishabh Agrawal, Rahul Jain, Ashutosh Nayyar

发表机构 * University of Southern California（南加州大学）

AI总结本文提出了一种基于行为基础模型（BFM）的框架，通过将任务推断建模为鲁棒最小最大优化问题，以应对动态变化，从而在不修改预训练的情况下实现对最坏动态扰动的适应。该方法在动态变化下显著优于标准BFM和鲁棒离线模仿学习基线。

2605.17010 2026-05-19 cs.SI cs.AI cs.CL cs.CY cs.HC 版本更新

临床AI中的对抗脆弱性与语言脆弱性：在低资源医疗环境中对诊断崩溃的系统审计及不可察觉扰动和跨语言漂移的影响

Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf

发表机构 * Centre for Clinical Intelligence & Safety（临床智能与安全中心）； Tomorrow University of Applied Sciences（明天应用科学大学）

AI总结本文系统地审计了临床AI在不可察觉扰动和跨语言漂移下的诊断崩溃问题，揭示了对抗脆弱性和语言脆弱性对低资源医疗环境中的临床AI系统的影响。

Comments 23 pages, 9 figures, 3 tables. Code and data available at https://github.com/anthoniooladimeji11-coder/clinical-ai-safety-audit

详情

AI中文摘要

当前的临床人工智能（AI）系统几乎只在干净、标准化的英语输入条件下进行评估，这些条件无法反映低资源环境中的医疗实践现实。本研究首次系统地对临床AI的两种正交安全漏洞进行了双重审计：对抗图像脆弱性和跨语言诊断漂移。使用DenseNet121，这是CheXNet架构的基础，经过在COVID-QU-Ex胸部X光数据集（85,318张图像；COVID-19、非COVID肺炎、正常）上微调，我们证明在Fast Gradient Method（FGM）扰动下，epsilon=0.021时，诊断准确率从89.3%下降到62.0%，这种幅度对人眼来说是不可察觉的。标准防御策略，包括高斯平滑和投票集成，未能恢复临床安全。在平行的语言脆弱性实验中，我们测试了Llama3.1:8b和NatLAS（N-ATLAS）在Standard English、Nigerian Pidgin（Naija）和Yoruba-inflected English中的20例COVID-19临床病例。两种模型均表现出显著的准确性下降：Llama3.1:8b在Pidgin上从80.0%下降到65.0%；NatLAS，一个非洲语境模型，从85.0%下降到55.0%，诊断一致性下降到50%。这些发现为尼日利亚初级卫生中心（PHC）部署中代表性的临床AI系统建立了定量失败范围，并促使对对抗性强、语言包容的临床AI架构的紧急呼吁。

英文摘要

Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.

URL PDF HTML ☆

赞 0 踩 0

2605.16991 2026-05-19 cs.CL cs.AI 版本更新

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

无响应项目难度建模用于多项选择题：细调Transformer：组件表示和多任务学习

Jan Netík, Patrícia Martinková

发表机构 * Faculty of Education, Charles University（查理大学教育学院）； Institute of Computer Science of the Czech Academy of Sciences（捷克科学院计算机科学研究所）

AI总结本文提出了一种无响应项目难度建模方法，通过细调Transformer来处理阅读理解多项选择题的难度问题，采用组件级表示和多任务学习方法来提升模型性能。

详情

AI中文摘要

无响应项目难度建模旨在减少对响应校准的依赖，但对阅读理解多项选择题而言，其难度取决于词汇组件的推断需求。尽管现有方法通常从项目文本中提取特征并传递给单独的统计或机器学习模型，本文通过端到端地在项目词汇上微调Transformer编码器，消除了手动特征工程和预处理所丢失的信息。此外，本文还提出了两种扩展：一种是组件级变体，通过共享编码器分别编码词汇组件；另一种是多任务变体，保留联合编码并添加辅助的多项选择问题回答目标。每种方法都在三种训练集大小下通过蒙特卡洛子采样设计在保留的测试集上进行评估。研究发现，联合编码是一种可行的端到端替代方案；虽然组件级变体没有明显优势，这与自注意力机制本身已经捕获跨组件信号一致，但多任务变体在小样本情况下提供了显著的改进。Transformer微调，尤其是通过合适的辅助任务进行正则化，能够在应用测量中典型的训练集大小下恢复大量词汇可推导的信号。该框架为心理测量学扩展提供了可定制的接口。

英文摘要

Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

URL PDF HTML ☆

赞 0 踩 0

2605.16986 2026-05-19 cs.CL cs.AI 版本更新

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Jingxing Wang, Chenyu Zhou, Zhihui Fu, Jun Wang, Weiwen Liu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University（上海交通大学）； OPPO Research Institute（OPPO研究院）

AI总结本文提出了一种在测试时自适应的技能合成方法SkillTTA，通过检索与当前任务相关的少量训练轨迹并将其合成成为任务特定的文本技能，以提高LLM代理在SpreadsheetBench、ALFWorld和BigCodeBench等任务上的性能。

Comments 10 pages, 4 figures

详情

AI中文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose SkillTTA, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-k retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

英文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emph{SkillTTA}, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-$k$ retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

URL PDF HTML ☆

赞 0 踩 0

2605.16975 2026-05-19 cs.LG cs.AI 版本更新

Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

扩展预训练的10秒ECG基础模型以适应更长的时域

Wei Tang, Jinpei Han, Kangning Cui, Mattia Carletti, Fredrik K. Gustafsson, Shreyank N Gowda, Patitapaban Palo, Anshul Thakur, Lei Clifton, Jean-michel Morel, Raymond H. Chan, David A. Clifton, Xiao Gu

发表机构 * City University of Hong Kong（香港城市大学）； Imperial College London（伦敦帝国学院）； Wake Forest University（威克森林大学）； University of Nottingham（诺丁汉大学）； Lingnan University（岭南大学）； University of Oxford（牛津大学）

AI总结本文提出了一种参数高效的框架，通过在不重新训练基础模型的情况下扩展预训练的10秒ECG基础模型，使其能够处理更长和可变长度的ECG信号，解决了结构不兼容和语义挑战问题，实验表明其在多个长时域ECG任务中优于滑动窗口和池化基线方法。

详情

AI中文摘要

预训练在典型诊断10秒ECG片段上的ECG基础模型已在多种临床应用中展示了强大的迁移能力。然而，许多实际应用产生的记录通常更长，且在推理过程中持续时间各异。这些10秒模型缺乏整合时间信息的内置方法。将其扩展到更长的时域引入了两个挑战：由于输入长度差异导致的结构不兼容性，以及限制有意义时间聚合的语义挑战。我们提出了一种参数高效的框架，通过冻结预训练的10秒模型，引入一个轻量级插件模块，以两种互补的方式扩展模型：(i) 结构兼容的长序列处理，(ii) 语义指导的时间建模。在多个长时域ECG任务、数据集和基础模型背骨上的实验表明，我们的方法能够从预训练的快照模型中实现稳健的长时域扩展，一致优于滑动窗口和池化基线方法，具有强大的参数效率。

英文摘要

Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, have demonstrated strong transferability across a range of clinical applications. However, many real-world applications produce recordings that are typically longer, and are varied in duration during inference time. These 10-second models have no built-in way to combine information across time. Extending them to longer horizons introduces two challenges: structural incompatibilities arising from input-length disparities, and semantic challenges that limit meaningful temporal aggregation. We propose a parameter-efficient framework that extends pretrained ECG foundation models to longer and variable-length ECGs without retraining the backbone. Guided by a frozen pretrained 10-second model, we introduce a lightweight plug-in module that extends the model in two complementary ways: (i) structurally compatible long-sequence processing and (ii) semantically informed temporal modeling. Experiments on multiple long-horizon ECG tasks, datasets, and foundation model backbones demonstrate that our method enables robust long-horizon extension from pretrained snapshot models, consistently outperforming sliding-window and pooling-based baselines with strong parameter efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.16969 2026-05-19 cs.AI 版本更新

Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms

基于脑血流速度和机器学习算法的脑血管年龄预测

Anni Zhao, Alex Bateh, Tyler Baldridge, Sandra Billinger, Xiao Hu

发表机构 * Center for Data Science Nell Hodgson Woodruff School of Nursing Emory University（数据科学中心Nell Hodgson Woodruff护理学院埃默里大学）； Division of Nephrology Department of Medicine University of Alabama at Birmingham（肾脏科医学部阿拉巴马大学伯明翰分校）； Department of Neurology School of Medicine University of Kansas Medical Center（神经科医学院堪萨斯医学中心）

AI总结本研究利用脑血流速度数据和机器学习算法，通过分析不同脑疾病患者的血管年龄预测，评估加速衰老现象，并探讨TCD生成的特征在评估加速脑血管老化中的相关性。

详情

AI中文摘要

定义血管年龄为生理功能的范畴已成为广泛研究中分类和跟踪年龄的关键问题。超声多普勒（TCD）是一种测量人类大脑主要动脉血流速度的方法。本研究旨在利用从TCD提取的特征来估计年龄并评估患有各种脑疾病个体的加速老化。我们预测患有各种脑疾病的个体在使用不同回归模型训练的健康个体上会表现出加速的脑血管老化。使用形态学分析和颅内压聚类（MOCAIP）算法分析了168名健康受试者和277名双侧大脑中动脉TCD记录的疾病受试者。MOCAIP生成的特征和心率变异性特征被用作回归模型的输入特征以预测脑血管年龄。对66名急性中风患者、27名中风后患者、26名阿尔茨海默病患者、23名轻度认知障碍患者和135名正常受试者进行了测试，以评估加速的脑血管年龄。训练好的模型在平均上预测健康受试者的脑血管年龄比实际年龄高3.69年。不同疾病状况的受试者表现出不同程度的年龄加速。健康和疾病受试者之间的表现差异表明，使用TCD生成的特征可能在评估加速的脑血管老化时是相关的。此外，不平衡的数据集已被观察到会影响基于机器学习的脑年龄预测模型的性能。

英文摘要

Defining vascular age in terms of physiological function has become one focal point of the extensive studies to categorize and track chronological age. Transcranial Doppler (TCD) is a method by which cerebral blood flow velocity is measured along the major arteries feeding the human brain. This study aims to use features extracted from TCD to estimate chronological age and assess accelerated aging in subjects with various brain diseases. We predict subjects with various brain diseases to present with accelerated cerebrovascular aging when tested on various regression models trained by healthy subjects. 168 healthy subjects and 277 diseased subjects with bilateral TCD recordings of the middle cerebral artery were analyzed using the Morphological Analysis and Clustering of Intracranial Pressure (MOCAIP) algorithm. MOCAIP-generated features and heart rate variability features were used as input features for regression models to predict the brain vascular age. 66 subjects with acute stroke, 27 subjects with post stroke, 26 subjects with Alzheimer's disease, 23 subjects with mild cognitive impairment, and 135 established subjects were tested against the machine learning model to assess for accelerated cerebrovascular age. The trained model, on average, predicted healthy subjects' cerebrovascular age to be 3.69 years above their chronological age. Subjects with different disease conditions exhibited varying levels of age acceleration. The differences in healthy and diseased subjects' performances suggest that features generated using TCD may be relevant when evaluating accelerated cerebrovascular aging. Moreover, imbalanced datasets have been observed to affect the performance of machine-learning-based brain age prediction models.

URL PDF HTML ☆

赞 0 踩 0

2605.16966 2026-05-19 cs.AI 版本更新

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

利用人工智能解决逆偏微分方程问题：过去、现在与展望

Zhentao Tan, Yuze Hao, Boyi Zou, Mingsheng Long, Yi Yang, Gang Bao

发表机构 * Collaborative Innovation Center of Artificial Intelligence (CCAI), Zhejiang University（人工智能协同创新中心（CCAI），浙江大学）； School of Mathematical Sciences, Zhejiang University（浙江大学数学科学学院）； Tsinghua University（清华大学）； Center for Interdisciplinary Applied Mathematics, School of Mathematical Sciences, Zhejiang University（浙江大学数学科学学院交叉应用数学中心）

AI总结本文综述了利用人工智能解决逆偏微分方程问题的最新进展，涵盖了逆问题、逆设计和控制问题三大类，总结了科学和工业领域中的典型应用，并讨论了开放挑战和未来前景。

Comments 35 pages, 4 figures

详情

AI中文摘要

求解逆偏微分方程（PDE）问题在科学研究中是一个基础性课题，因其在广泛现实应用中的重要性。逆PDE问题出现在医学成像、地球物理、材料科学和空气动力学等领域，目标是推断隐藏原因、设计结构或控制物理状态。本文全面回顾了利用人工智能（AI）解决逆PDE问题的最新进展。我们首先介绍了逆PDE问题的基本 formulation、关键挑战和传统数值基础，然后将其分为三大类别：逆问题、逆设计和控制问题。对于每个类别，我们进一步提出了方法论范式，并回顾了近年来的代表性最先进方法。我们随后总结了科学和工业领域的典型应用，包括机械系统、空气动力学问题、热系统、全波形反演、系统识别和医学成像。最后，我们讨论了开放挑战和未来前景，如物理感知架构、有限现实数据、不确定性量化和逆基础模型。本文旨在为人工智能解决逆PDE问题提供首个统一和系统的视角，展示现代基于学习的方法如何重塑PDE系统中的逆问题、逆设计和控制问题。

英文摘要

Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.

URL PDF HTML ☆

赞 0 踩 0

2605.16961 2026-05-19 cs.CV cs.AI 版本更新

Latent Action Control for Reasoning-Guided Unified Image Generation

潜在动作控制用于推理引导的统一图像生成

Fuxiang Zhai, Sixiang Chen, Yingjin Li, Shuaibo Li, Jianyu Lai, Tengjun Huang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)（香港理工大学（广州））

AI总结本文提出Latent Action Control (LAC)，通过将推理表示为隐藏的连续动作，使推理过程可操作，从而在统一生成器中实现推理引导的图像生成。LAC通过角色结构化的潜在轨迹进行规划、内部视觉草图、诊断和细化，并将这些动作注入到条件流生成的隐藏流中，从而提升生成质量。

详情

AI中文摘要

统一的多模态模型可以在共享的骨干网络中编码视觉理解和图像生成，但理解并不自动转化为控制：模型可能推断出对象、关系或知识提示，但无法在生成的图像中实例化。我们提出潜在动作控制（LAC），通过将推理表示为隐藏的连续动作，使推理过程可操作。给定提示，LAC会规划角色结构化的潜在轨迹，进行内部视觉草图、诊断和细化，并将这些动作注入到条件流生成的隐藏流中，而无需生成推理标记或中间图像。由于这些动作轨迹是未观察到的，LAC通过先验引导的变分潜在动作对齐从仅训练的语义先验、草图图像特征和监督停止信号中学习这些动作，随后通过Latent-Flow GRPO对齐潜在到图像的生成轨迹与终端视觉反馈。这为从推断的关系、绑定和知识提示到生成过程的控制路径提供了支持。在BAGEL-7B-MoT上实现后，LAC在GenEval、WISE和T2I-CompBench中一致提升了组合性和知识引导的生成，尤其是在空间关系、属性绑定和世界知识敏感提示上表现最佳。消融实验和潜在干预显示，学习的动作轨迹被生成器消耗，表明统一生成在理解不仅被编码，而是在生成过程中被操作时受益。

英文摘要

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

URL PDF HTML ☆

赞 0 踩 0

2605.16938 2026-05-19 cs.CL cs.AI q-bio.NC 版本更新

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

努力作为上限，而非调节器：推理预算不影响人类与大推理模型之间的认知成本对齐

Yueqing Hu, Tianhong Wang

发表机构 * Institute of Neuroscience, Chinese Academy of Sciences（中国科学院神经科学研究所）； School of Philosophy, Anhui University（安徽大学哲学系）

AI总结该研究探讨了推理预算是否影响人类与大推理模型之间的认知成本对齐，发现无论推理努力如何变化，对齐情况保持不变，表明这种对齐是在训练时形成的，而非在推理时动态调整。

Comments 8 pages, 6 figures

详情

AI中文摘要

大推理模型（LRMs）生成的思维链轨迹长度与人类反应时间在认知任务中保持一致，但最近的争论质疑这种一致性是否反映真实的计算结构还是表面的冗长性。我们测试了这种一致性是否随推理时间的推理努力而变化。在GPT-OSS-20B和GPT-OSS-120B上，三个努力水平和六个推理任务中，任务内和跨任务的一致性保持不变：贝叶斯因子倾向于null，且各条件下的平均一致性几乎相同。操纵检查显示，努力参数设定了生成的上限，而非驱动实时分配，表明分配策略在训练时已固化。算术复杂度对比进一步显示，令牌分配跟踪细粒度、格式依赖的人类难度模式，模型规模提高了匹配程度。人类与LRMs之间的认知成本对齐似乎是在训练时形成的，对推理时的扰动具有鲁棒性，支持大推理模型问题解决的编译而非在线账户。

英文摘要

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

URL PDF HTML ☆

赞 0 踩 0

2605.16927 2026-05-19 cs.AI 版本更新

From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

从静态风险到动态轨迹：迈向世界模型启发的临床预测

Pujun Feng, Xiaoyu Guo, Seyed Ehsan Saffari, Min Hun Lee, Siew-Kei Lam, Erik Cambria, Xibin Sun, Yangtao Zhou, Tong Yang, Xiaoyu Zhang, Tao Tan, Yue Sun, Bin Cui

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University（澳门理工学院应用科学学院）； School of Software & Microelectronics, Peking University（北京大学软件与微电子学院）； School of Computing and Information Systems, Singapore Management University（新加坡管理学院计算机与信息系统学院）； College of Computing and Data Science, Nanyang Technological University（南洋理工大学计算与数据科学学院）； School of Public Health, Peking University（北京大学公共卫生学院）； School of Computer Science and Technology, Xidian University（西安电子科技大学计算机科学与技术学院）； School of Computer, Peking University（北京大学计算机学院）； School of Computer Science and Technology, Beijing Institute of Technology（北京理工大学计算机科学与技术学院）； Centre for Biomedical Data Science, Duke-NUS Medical School, National University of Singapore（新加坡国立大学杜克-国立新加坡医学学院生物医学数据科学中心）； Duke-NUS AI + Medical Sciences Initiative, Duke-NUS Medical School（杜克-国立新加坡医学学院AI+医学科学计划）

AI总结本文探讨了临床AI中干预感知的疾病轨迹建模方法，提出了统一框架，结合了预测、反事实轨迹和政策评估，以解决治疗分配、时间变化混杂和观察偏差问题，推动临床预测向决策级证据发展。

详情

AI中文摘要

临床决策是一个反馈系统，其中风险估计影响治疗，而治疗又改变疾病轨迹，两者共同塑造医生的测量实践。静态预测在临床中往往失败：训练于观察性护理日志的模型会将疾病生物学与医生行为混为一谈，特别是在存在治疗混杂反馈和不规则或信息性观察的情况下。本文聚焦于临床AI中的干预感知疾病轨迹建模方法——估计患者特定的纵向疾病演变并评估在替代治疗下的轨迹变化。本文围绕六个相关组成部分组织该领域：三个决策任务（事实预测、反事实估计、政策评估）和三个数据生成机制（疾病演变、治疗分配、观察过程），这些决定了可识别性。本文提出了第一个统一框架，连接了离散/连续时间下的预测、反事实轨迹和政策评估，明确处理治疗分配、时间变化混杂和观察偏差。本文综合了关键方法家族（多状态/联合模型、时间点过程、深度序列架构、纵向因果推断），将它们映射到相关组成部分，并通过重叠诊断、不确定性量化、非策略鲁棒性和目标试验验证对齐评估。这种综合将基准预测推进到决策级临床证据，使治疗敏感的个性化未来成为可能，实现部署前的政策压力测试，并推动更安全的闭环学习健康系统，在证据不足时适应或回避。

英文摘要

Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.

URL PDF HTML ☆

赞 0 踩 0

2605.16909 2026-05-19 cs.AI 版本更新

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

TOBench：面向真实世界工具使用代理的任务导向多模态基准

Zhiqiang Liu, Wenhui Dong, Yilang Tan, Yuwen Qu, Haochen Yin, Chenyang Si

发表机构 * Nanjing University（南京大学）； Huazhong University of Science and Technology（华中科技大学）； Southwest Jiaotong University（西南交通大学）； The Chinese University of Hong Kong（香港中文大学）

AI总结本文提出TOBench，一个面向真实世界工具使用代理的多模态基准，通过闭环多模态验证设计，评估和推动下一代多模态工具使用代理的发展。

Comments Github: https://github.com/Pi3AI/TOBench

详情

AI中文摘要

工具使用代理正越来越多地被期望在现实中的专业工作流程中操作，其中它们必须解释多模态输入、协调外部工具、检查中间产物并修改其行为，以最终产生结果。然而，现有的基准测试通常孤立地评估工具使用、计算机使用和多模态推理，导致基准设置与现实中的端到端多模态工具使用之间存在差距。为此，我们引入MM-ToolBench，一个用于任务导向多模态工具使用的基准和评估工具。MM-ToolBench包含100个可执行任务，来自两个宏任务家族，客户服务和智能创作，涵盖20个子类切片，并由27个MCP服务器和324个工具支持。MM-ToolBench的核心设计是闭环多模态验证：代理必须执行工具、检查渲染或转换后的产物，并在输出未能满足任务特定要求时进行自我纠正。为了使此类评估可扩展和可验证，MM-ToolBench结合了基于MCP的执行与任务特定的地面评估器以及一个半自动化的场景发现、任务实例化、评估器合成和人类审核的构建流程。在15个当代代理模型上的实验表明，MM-ToolBench仍然极具挑战性：Claude Opus 4.6，通常被视为最强的编码代理模型之一，仅达到32.0%的任务成功率，远低于94.0%的人类基准。我们设想MM-ToolBench作为评估和推动下一代多模态工具使用代理的实用基础，通过闭环多模态验证。

英文摘要

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.

URL PDF HTML ☆

赞 0 踩 0

2605.16895 2026-05-19 cs.CE cs.AI cs.CL 版本更新

某些身体必须承受痛苦以实现代理问责

Botao Amber Hu, Helena Rong

发表机构 * University of Oxford（牛津大学）； New York University Shanghai（纽约大学上海分校）

AI总结本文研究了人工智能代理问责中的后果接收问题，指出当前LLM代理无法满足必要的身体条件，因此传统法律回应失效，提出需要建立社会技术基础设施来实现后果-代理耦合。

详情

AI中文摘要

人工智能代理在现实世界中日益产生后果。这导致了我们称之为"后果接收"的问题：伤害发生，产生系统被识别，但没有持续的代理接收后果以改变未来行为。将痛苦机械地理解为一种纠正反馈信号，是传统惩罚理论的基础——威慑、康复、报复和 incapacitation 都假设有一个持续的场所接收信号并更新行为。这反过来要求信号能够落地的身体：一个保护完整性的边界，一个信号积累的场所，将事件信号转化为持久更新的整合，以及一个通过改变未来行动来响应的基质。当前的LLM代理——由权重、提示、工具、记忆和凭证组成的软件定义复合体，可以自由交换、复制、重置和重新组装——无法满足这些条件。因此，两种主流法律回应未能实现后果接收。薄身份代理-主体二元组拥有身体但没有"后果-代理耦合"：人类为超出其控制的行为承受痛苦——Elish的"道德皱褶区"。厚身份Arbel等人提出的"算法公司"创建了法律可识别的实体，但并不保证任何AI决策架构会将痛苦作为行为信号。因此，实现后果-代理耦合是一个社会技术基础设施问题，而不仅是法律问题。在这样的架构存在之前，高风险的AI部署应继续与可问责的人类主体 tethered，具有有意义的控制、比例责任和终止代理的权力。"如果某些身体因设计而没有承受痛苦，某些身体将因默认而承受痛苦。"

英文摘要

AI agents increasingly act consequentially in the real world. This creates a problem we call \emph{consequence reception}: harm occurs, the producing system is identified, yet no continuing agent receives consequences in a way that changes future behavior. Pain, understood mechanistically as a corrective feedback signal, is foundational to canonical theories of punishment -- deterrence, rehabilitation, retribution, and incapacitation all assume a continuing locus that registers the signal and updates behavior. That, in turn, requires a body for the signal to land on: a boundary whose integrity it protects, a locus where it accumulates, consolidation that converts episodic signal into durable update, and a substrate that responds by altering future action. Current LLM agents -- software-defined composites of weights, prompts, tools, memory, and credentials, freely swapped, copied, reset, and reassembled -- satisfy none of these conditions. The two prevailing legal responses therefore fail to achieve consequence reception. The thin-identity agent-principal dyad has a body but no \emph{consequence--agency coupling}: the human bears pain for behaviors beyond their control -- Elish's \emph{moral crumple zone}. The thick-identity Arbel et al.'s \emph{Algorithmic Corporation} creates legally legible entities but does not guarantee that any AI decision architecture receives pain as a behavioral signal. Achieving consequence-agency coupling is therefore a sociotechnical infrastructural problem, not only a legal one. Until such architectures exist, high-stakes AI deployment should remain tethered to accountable human principals with meaningful control, proportional liability, and authority to constrain or terminate the agent. \emph{If some body does not receive the pain by design, some body will receive it by default.}

URL PDF HTML ☆

赞 0 踩 0

2605.16864 2026-05-19 cs.CV cs.AI 版本更新

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

基于度量的视觉基础模型特征融合用于分割任务

Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena

发表机构 * Universitat Autònoma de Barcelona（巴塞罗那自治大学）； Computer Vision Center（计算机视觉中心）； Harbin Institute of Technology, Shenzhen（哈尔滨工业大学深圳研究院）

AI总结本文提出了一种基于度量的特征融合方法，通过评估不同视觉基础模型的特征空间，选择并聚合互补特征以提升密集预测任务的性能。

Comments Accepted to the CVPR 2026 Findings Track

详情

AI中文摘要

尽管大规模视觉基础模型（VFMs）在语义理解方面表现优异，但在实例感知的密集预测任务中仍显不足。它们在表示上存在不同的偏倚：例如，可提示的分割模型（如SAM2）专注于细粒度区域边界，而自监督模型（如DINOv3）强调物体层面的结构。这一观察表明，结合不同VFMs的互补特征可以增强下游密集预测任务。然而，简单的多VFMs融合 seldom 导致可靠的增益，且如何利用其互补特征的可解释原则仍待探索。在本文中，我们提出了一种基于度量的方法，通过显式的评估分数选择并聚合不同VFMs的互补特征。具体而言，我们设计了一套无标签的度量标准，在特征空间的两个方面，结构一致性与边缘保真度，来评估VFM编码器的特征。在这些分数的指导下，我们识别出互补性强的边缘强和结构强的编码器对，并通过主辅融合方案进行整合。这种特征融合不需要复杂的架构更改，并且仅在单个阶段进行训练。我们的模型在多个密集预测任务中相比基线模型表现出一致的性能提升，具有更好的物体层面语义和更准确的边界定位。代码可在{https://github.com/gyc-code/metric-guided-fusion}获取。

英文摘要

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

URL PDF HTML ☆

赞 0 踩 0

2605.16863 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

先规划，后扩散：用于长视距扩散规划的外在图引导

Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey

发表机构 * Technion（技术Ion大学）； Harvard（哈佛大学）

AI总结本文提出了一种外在搜索引导的扩散模型（XDiffuser），通过在状态空间图上先规划再引导扩散过程，以提高长视距规划的效率和效果，尤其在低质量数据和未见任务中表现优异。

详情

AI中文摘要

组合扩散模型通过去噪多个重叠的子轨迹并确保它们构成全局解，为长视距规划提供了一条有前途的路线。然而，强制在长链上执行局部行为往往不足以产生一致的全局结构。最近的工作通过内在搜索在去噪过程中探索多条路径来解决这一限制。尽管内在搜索提高了全局一致性，但代价是重复评估已经计算密集的模型。在本文中，我们主张在去噪过程之外进行外在搜索，为长视距规划提供更有效的探索模式，同时自然地使经典算法能够解决测试时的未见组合任务。我们的eXtrinsic搜索引导的Diffuser（XDiffuser）首先在状态空间图上计算一个计划——作为扩散模型的轻量级局部连接Oracle。该计划随后用于引导单条轨迹的去噪，有效地将探索负担转移出去。XDiffuser在长视距任务上优于基于扩散的基线，特别是在低质量数据领域和超出目标到达的未见任务中，包括多智能体协调和TSP风格推理。项目网站：https://yanivhass.github.io/XDiffuser-site/

英文摘要

Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

URL PDF HTML ☆

赞 0 踩 0

2605.16861 2026-05-19 cs.CV cs.AI 版本更新

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

前缀自适应块扩散用于高效的文档识别

Mingxu Chai, Ziyu Shen, Chenyu Liu, Kaidi Zhang, Jiazheng Zhang, Dingwei Zhu, Zhiheng Xi, Ruoyu Chen, Jun Long, Jihua Kang, Tao Gui, Qi Zhang

发表机构 * Computation and Artificial Intelligence Innovative College, Fudan University, Shanghai, China（复旦大学计算与人工智能创新学院，上海，中国）； Shanghai Innovation Institute, Shanghai, China（上海创新研究院，上海，中国）； ByteDance, Shanghai, China（字节跳动，上海，中国）

AI总结本文提出前缀自适应块扩散模型（PA-BDM），通过改进块内去噪和缓存机制，提升文档识别的效率和准确性。

Comments 17pages,6 figures

详情

AI中文摘要

块扩散模型（BDMs）支持并行生成、灵活长度输出和KV缓存，使其在高效文档解析中具有潜力。然而，现有BDMs将去噪和缓存承诺绑定到固定的块边界：块内去噪时并行性缩小，而生成的token无法缓存直到整个块完成。此外，块内双向去噪与块间自回归冲突，导致信息流不一致，可能挑战结构敏感的识别。我们提出前缀自适应块扩散模型（PA-BDM），用从前缀到后缀的因果去噪替代块内双向去噪，并将块大小视为最大候选范围而非固定承诺单位。PA-BDM使用置信度门控结构损失（CSL）在扩展训练到更长延续之前构建低熵前缀。在推理过程中，逐步前缀承诺（PPC）则动态地将最长可靠的前缀投入KV缓存，并从更新的前缀重置下一个候选范围，每一步都恢复大的并行解码空间。实验表明，3B PA-BDM在多个基准上实现了更高的识别得分，并在2.5B MinerU-Diffusion上将推理吞吐量提高了71.6%。

英文摘要

Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6\% over the 2.5B MinerU-Diffusion.

URL PDF HTML ☆

赞 0 踩 0

2605.16860 2026-05-19 cs.LG cs.AI q-bio.QM 版本更新

从多模态经验中学习学习

Xingyu Sui, Weixiang Zhao, Yongxin Tang, Yanyan Zhao, Yang Wu, Dandan Tu, Bing Qin

发表机构 * Harbin Institute of Technology（哈尔滨工业大学）； Huawei Technologies Co., Ltd（华为技术有限公司）

AI总结本文提出了一种新的学习范式，即从多模态经验中学习，通过动态构建和利用记忆来提升智能体的性能和泛化能力，解决了传统固定记忆设计在多模态环境中的不足。

详情

AI中文摘要

经验驱动学习已成为一种有前景的范式，使智能体能够通过积累和重用过去经验来改进。然而，现有方法主要在文本环境中开发，并依赖于手动设计的记忆架构，限制了它们在多模态环境中的适用性。在现实场景中，经验本质上是多模态的，涉及感知、推理和行动中的异构信号，这使得有效记忆设计变得更加具有挑战性。特别是，最优的多模态经验结构和利用方式高度依赖于任务，并随时间变化，使得固定记忆设计不足。在本文中，我们提出了一种新的范式，即从多模态经验中学习，将记忆设计从预定义的组件转变为适应性和可学习的过程。我们的框架使智能体能够根据任务需求和交互历史动态构建、组织和利用记忆，有效学习如何结构化经验以提高性能。实验表明，适应性记忆设计显著增强了智能体在多模态任务中的性能和泛化能力，突显了学习记忆机制在经验驱动学习中的关键作用。

英文摘要

Experience-driven learning has emerged as a promising paradigm for enabling agents to improve from interaction trajectories by accumulating and reusing past experience. However, existing approaches are predominantly developed in textual settings and rely on manually designed memory schemas, limiting their applicability to multimodal environments. In real-world scenarios, experience is inherently multimodal, involving heterogeneous signals across perception, reasoning, and action, which makes effective memory design significantly more challenging. In particular, the optimal way to structure and utilize multimodal experience is highly task-dependent and evolves over time, rendering fixed memory designs insufficient. In this work, we propose a new paradigm, learning to learn from multimodal experience, which shifts memory design from a predefined component to an adaptive and learnable process. Our framework enables agents to dynamically construct, organize, and utilize memory based on task requirements and interaction history, effectively learning how to structure experience for improved performance. Experiments demonstrate that adaptive memory design substantially enhances agent performance and generalization across multimodal tasks, highlighting the critical role of learning memory mechanisms in experience-driven learning.

URL PDF HTML ☆

赞 0 踩 0

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG 版本更新

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考：通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

发表机构 * State Key Lab of CAD& CG（CAD与CG国家重点实验室）

AI总结本文提出通过模式诱导的方法，利用模式推理和模式诱导策略，使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理，解决传统模型在复杂输入下的感知瓶颈问题。

详情

AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型（VLMs）构成重大挑战，当输入复杂度超出其一步感知能力时。受最近在图像思考（TWI）中的进展启发，一种合理的解决方案是通过迭代获取和整合局部视觉证据，将感知过程分解为更简单的步骤。然而，尽管当前VLMs在一般TWI能力上训练良好，但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战，我们将TWI视为一种工具，逐步构建并反映一个准确的内部世界模型。我们发现，由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务，但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率，我们提出模式推理，一种新的TWI策略，使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式，我们提出模式诱导，一种在线归纳学习策略，将视觉模式视为复合且可重用的专家，这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明，我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

URL PDF HTML ☆

赞 0 踩 0

2605.16844 2026-05-19 cs.AI 版本更新

Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence

人工适应智能：狭义智能与通用智能之间的缺失阶段

Boris Kriuk

发表机构 * Independent Monograph（独立专著）

AI总结本文探讨了狭义智能与通用智能之间缺失的机器行为阶段，提出人工适应智能（AAI）的概念，通过定义适应性指数和参数最小性原则，分析了实现AAI的三种路径，并展示了其在多个领域的应用。

详情

AI中文摘要

在我们部署的狭义系统和我们推测的通用智能之间，存在一个从未被命名的机器行为阶段。本文主张这一阶段并非空缺：它是在元学习、神经架构搜索、AutoML、持续学习、进化计算和物理感知建模等技术中悄然汇聚的共同原则，即持续地将人类从参数规范的循环中排除。我们将其命名为人工适应智能（AAI），并对其进行操作性定义：一个系统表现出AAI的程度在于它不需要人类指定的可调超参数，同时在多样化的任务分布中保持竞争性性能。为使定义量化，我们引入了一个适应性指数，该指数衡量在与规模正交的轴上进展的进度，结合了系统吸收的超参数比例与相对于任务专用基线的性能比率。我们发展了参数最小性原则，并基于最小描述长度框架加以阐述，表明适当的超参数数量是由数据决定而非设计者决定。随后，我们围绕实现最小性的三条路径组织该领域：数据和任务感知的配置、结构和进化形态变化，以及训练中的自我适应。我们分析了它们的稳定性、收敛性和治理影响，并通过涵盖航空航天设计、金融制度检测、湍流建模、生态动态和视觉语言系统等案例研究来说明这些路径。本文的论点是：从ANIL到AGI的路径经过AAI，并且命名这一阶段改变了我们测量、构建和称作成功的标准。

英文摘要

Between the narrow systems we deploy and the general intelligence we speculate about lies an entire regime of machine behavior that has never received its own name. This monograph argues that this regime is not empty: it is where meta-learning, neural architecture search, AutoML, continual learning, evolutionary computation, and physics-informed modeling have quietly converged on a common principle, namely the steady removal of the human from the loop of parameter specification. We name this regime Artificial Adaptive Intelligence (AAI) and define it operationally: a system exhibits AAI to the extent that it requires no human-specified tunable hyperparameters while maintaining competitive performance across a diverse distribution of tasks. To make the definition quantitative, we introduce an adaptivity index that measures progress along an axis orthogonal to scale, combining the fraction of hyperparameters absorbed by the system with the performance ratio against a task-specialized baseline. We develop the principle of parametric minimality and ground it in the minimum description length framework, showing that the appropriate hyperparameter count is data-determined rather than designer-determined. We then organize the field around three pathways to minimality: data- and task-aware configuration, structural and evolutionary morphing, and in-training self-adaptation. We analyze their stability, convergence, and governance implications, and illustrate them through case studies spanning aerospace design, financial regime detection, turbulence modeling, ecological dynamics, and vision-language systems. The thesis is that the path from ANI to AGI passes through AAI, and that naming this stage changes what we measure, what we build, and what we call a success.

URL PDF HTML ☆

赞 0 踩 0

2605.16842 2026-05-19 cs.AI 版本更新

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

草图然后绘画：用于扩散多模态大语言模型的分层强化学习

Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； Nanjing University（南京大学）； Shanghai Innovation Institute（上海创新研究院）； Peking University（北京大学）

AI总结本文提出了一种分层强化学习方法HT-GRPO，通过Sketch-Then-Paint训练方案和分层信用分配机制，解决扩散多模态大语言模型在强化学习优化中的关键问题，提升图像质量和审美效果。

详情

AI中文摘要

扩散多模态大语言模型（dMLLMs）在图像生成方面具有强大能力，但通过强化学习（RL）进行优化仍是一个主要挑战。一个主要困难是单张图像可以通过许多不同的去屏蔽序列生成，这使得计算重要性比率往往不可行。此外，现有方法往往忽视dMLLMs的分层生成过程，其中早期标记定义全局布局，后期标记关注局部细节。通过给所有标记分配均匀奖励，这些现有方法未能反映每个标记对最终图像的实际贡献。为了解决这些问题，我们提出了Hierarchical Token GRPO（HT-GRPO），将此层次结构直接整合到策略优化过程中。我们的方法特征一个Sketch-Then-Paint训练方案，将更新过程分为三个不同的阶段：全局、结构和细化。我们还使用一个提示条件估计器来从完全遮蔽状态开始计算重要性比率。此外，我们引入了一种分层信用分配机制，优先考虑关键结构标记，以确保准确的奖励传播。使用两种流行的dMLLM骨干网络MMaDA和Lumina-DiMOO进行的实验表明，HT-GRPO在GenEval和DPG基准上取得了显著成效。在六个额外指标上的评估证实了在图像质量、美学和人类偏好方面的显著改进。

英文摘要

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.

URL PDF HTML ☆

赞 0 踩 0

2605.16834 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

基于有限数据的细粒度多模态对齐的相对表示学习

Shiwon Kim, Yu Rang Park

发表机构 * Yonsei University（延世大学）

AI总结本文提出了一种基于相对表示的学习方法，用于在有限数据条件下实现细粒度多模态对齐，通过学习token级别的跨模态结构来提升零样本分类、跨模态检索和零样本分割任务的性能。

详情

AI中文摘要

多模态预训练展示了强大的泛化性能，但在缺乏配对数据的领域中，这种范式往往难以实施。一种有前景的替代方法是事后多模态对齐，它通过有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而，现有方法主要关注全局表示的对齐，忽略了片段-token关系。这可能阻碍了需要细粒度跨模态匹配的任务的迁移，超越粗粒度样本层面的语义。为了解决这个问题，我们提出了一种事后对齐方法，通过相对表示学习token级别的跨模态结构。具体来说，我们通过图像和文本与每种模态空间中一组可学习锚点的token级相似性来表示它们，这些锚点被训练以诱导一致的跨模态相似性模式，以匹配对。尽管仅学习锚点而没有重大的投影层，我们的方法在零样本分类、跨模态检索和零样本分割任务中均显著优于现有方法。这突显了在有限配对数据下，建模细粒度跨模态结构对于有效事后多模态对齐的重要性。

英文摘要

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

URL PDF HTML ☆

赞 0 踩 0

2605.16828 2026-05-19 stat.ML cs.AI cs.LG stat.ME 版本更新

AgentKernelArena: GPU核优化代理的通用化意识基准测试

Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Ji Liu, Yue Liu, Yuchen Yang, Hao Li, Ziqiong Liu, Dong Li, Vikram Appia, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结本文提出AgentKernelArena，一个用于评估GPU核优化代理的开源基准，通过隔离工作区和统一评分机制，测试代理在不同任务和硬件目标上的性能和通用化能力，发现大多数任务在正确性和编译效率上表现优异，但在PyTorch到HIP的转换任务中存在显著的正确性下降。

详情

AI中文摘要

GPU核优化对于高效深度学习系统日益关键，但编写高性能核仍然需要大量的低级专业知识。最近的AI编码代理可以迭代阅读代码、调用编译器和性能分析器，并优化实现，但现有的核基准测试仅评估单个LLM调用而非完整的代理工作流程，且未包含核到核的优化和未见过的配置泛化测试。我们提出了AgentKernelArena，一个开源的基准测试，用于衡量AI编码代理在GPU核优化上的能力。该基准测试包含196个任务，涵盖HIP到HIP的优化、Triton到Triton的优化以及PyTorch到HIP的转换，并在隔离的工作区中使用门控编译、正确性和性能检查，集中评分和一个未见过的配置泛化协议，测试优化是否转移到代理从未见过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent在内的生产代理中，我们发现大多数任务在正确性和编译效率上表现优异，最强配置在PyTorch到HIP任务中平均加速达6.89倍，在HIP到HIP任务中达6.69倍，在Triton到Triton任务中达2.13倍。我们的未见过的配置评估显示，HIP到HIP和Triton到Triton的优化大多能转移到未见过的输入形状，而PyTorch到HIP的转换则表现出显著的正确性下降，表明生成核的代理经常硬编码形状特定的假设。AgentKernelArena被设计为一个模块化、可扩展的框架，用于严格评估跨代理、任务和硬件目标的代理GPU核优化。

英文摘要

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

URL PDF HTML ☆

赞 0 踩 0

2605.16818 2026-05-19 cs.CV cs.AI 版本更新

为超维计算编码鲁棒的拓扑特征

Arpan Kusari

发表机构 * University of Michigan Transportation Research Institute（密歇根大学交通研究院）； University of Michigan（密歇根大学）

AI总结本文提出了一种基于拓扑特征的超维计算方法，通过提取离散拓扑原始特征并结合RTS不变的形状签名，提高了超维计算在旋转、噪声和遮挡等扰动下的鲁棒性，实验表明其在多个数据集上优于传统方法。

详情

AI中文摘要

超维（HD）计算由于其简单性、快速的原型基推断和与在线更新的兼容性，为边缘学习提供了一个有吸引力的替代方案。然而，标准的基于像素的HD编码器容易受到分布偏移的影响，如旋转、噪声或遮挡，会显著降低准确性。我们从二值化形状中提取离散拓扑原始特征——尤其是孔洞，并将它们与旋转/平移/缩放（RTS）不变的形状签名配对。我们的方法为（i）外轮廓使用空间金字塔变体的Zernike矩构建RTS稳定的描述符，（ii）每个孔洞使用其径向签名的内在傅里叶描述符以及RTS-标准相对几何。每个原始特征通过随机投影和角色绑定映射到双极超向量，并通过排列不变的捆绑聚合变量卡数的孔洞集以形成单个图像超向量。为了避免过度加权任何线索，我们通过在验证集上融合余弦相似度学习Zernike和孔洞通道的非负可靠性权重。在MNIST和EMNIST数据集上进行的实验表明，拓扑引导的HD计算相比传统HD基线显著提高了鲁棒性，保持了多个扰动家族的高精度，并受益于轻量级在线训练。与在干净数据上训练的紧凑CNN相比，我们的方法在清洁精度上具有竞争力，同时对几种像素级扰动具有明显更强的鲁棒性，证明了显式拓扑结构是实现鲁棒HD表示的可行途径。代码在https://github.com/arpan-kusari/Topological-HDC提供。

英文摘要

Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan-kusari/Topological-HDC.

URL PDF HTML ☆

赞 0 踩 0

2605.16779 2026-05-19 cs.CV cs.AI 版本更新

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

一种基于无监督聚类分析的超二次曲面拟合整体方法

Mingyang Zhao, Sipu Ruan, Xiaohong Jia

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences（数学科学国家重点实验室，数学与系统科学学院，中国科学院）； University of Chinese Academy of Sciences（中国科学院大学）； Robotics Institute, School of Mechanical Engineering and Automation, Beihang University（北京航空航天大学机械工程与自动化学院机器人研究所）

AI总结本文提出了一种新的方法，用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合，通过无监督聚类分析重新定义问题，实现了刚性和变形超二次曲面的一体化拟合，同时提供了闭式解析解和收敛性证明。

Comments 20 pages, Code: https://github.com/zikai1/SuperquadricFitting

Journal ref IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2026

详情

AI中文摘要

本文提出了一种新的方法，用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合，该方法在多个领域具有广泛的应用。与以往仅专注于拟合刚性或变形超二次曲面或存在鲁棒性和数值稳定性问题的方法不同，我们的方法从无监督聚类的新视角重新定义问题，使刚性和变形超二次曲面的拟合能够在统一的框架中完成。我们的方法核心是一种受无监督聚类分析启发的稳定优化函数，其中我们将点云数据和潜在参数曲面的样本分别作为聚类成员和质心。然后，具有动态更新质心位置的聚类过程成为优化超二次曲面参数的直接代理，建立了几何拟合与聚类动态之间的原则性联系。我们进一步推导了聚类质心与聚类成员之间的成对计算与正交距离之间的关系，从而有效消除了耗时的曲面采样过程。此外，我们的公式为模糊成员度向量和协方差矩阵提供了闭式解析解，确保了高效迭代优化，并能够更有效地处理几何变形。此外，我们还提供了收敛性分析的理论证明，并证明了聚类启发的拟合方法通过内在增加目标函数的凸性来逃避局部极小值。实现已公开在https://github.com/zikai1/SuperquadricFitting。

英文摘要

This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.

URL PDF HTML ☆

赞 0 踩 0

2605.16776 2026-05-19 cs.LG cs.AI 版本更新

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

可区分删除：统一知识擦除与拒绝用于大语言模型去学习

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

发表机构 * Department of Natural Language Processing, MBZUAI. ； University of Oxford. ； RIKEN Center for Advanced Intelligence Project. ； TMLR Group, Department of Computer Science, Hong Kong Baptist University

AI总结本文提出D^2方法，通过限制潜在表示中的响应分布来擦除不受欢迎的知识，同时区分保留知识，从而实现安全且一致的拒绝机制，以提高大语言模型去学习的效果。

Comments ICML2026 Accepted

详情

AI中文摘要

减轻敏感和有害输出对于确保大型语言模型（LLM）的安全部署至关重要。现有方法通常遵循两种范式：知识删除（KD），在训练期间擦除不受欢迎的信息，以及可区分拒绝（DR），在推理期间引导模型远离使用敏感知识。尽管进展迅速，基于KD的去学习在抑制特定令牌序列作为完整知识移除替代物时面临偏见删除的问题，而基于DR的去学习则因底层知识仍然完整而有重新出现有害知识的风险。为了解决这些问题，我们提出了可区分删除（D^2），一种通过限制潜在表示中的响应分布来擦除不受欢迎知识，同时区分保留知识的范式，从而能够安全且一致地处理去学习的输入。为了实现D^2，我们引入了一个能量指数，该指数量化了知识的存在以及去学习内容与保留内容之间的分离。数学和实证分析表明，能量既准确又高效，使能量基于去学习对齐（EUA）能够在训练期间强制执行能量边界去学习，并在推理时应用基于能量的拒绝机制。广泛的实验表明，EUA显著优于先前方法，表明D^2的优越性。我们的代码可在https://github.com/Puning97/EUA-for-LLM-Unlearning获取。

英文摘要

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

URL PDF HTML ☆

赞 0 踩 0

2605.16775 2026-05-19 cs.CV cs.AI cs.LG 版本更新

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D: 基于3D体积分块对齐的脑MRI自监督学习

Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

发表机构 * Institute of Health Informatics（健康信息学研究所）； Sheikh Zayed Institute for Pediatric Surgical Innovation（谢赫扎耶德儿童外科创新研究所）； School of Medicine and Health Sciences（医学与健康科学学院）

AI总结本文提出VolTA-3D，一种用于脑MRI自监督学习的3D视觉Transformer框架，通过联合对齐全局类风格标记和局部块标记，增强体积分块表示的可迁移性，从而在多个下游任务中表现出更好的泛化能力和鲁棒性。

Comments Accepted at EMBC 2026

详情

AI中文摘要

自监督学习（SSL）通过利用大规模未标记数据推动了医学图像分析的发展。然而，在脑磁共振成像（MRI）中，大多数3D模型仍局限于分割或分类任务，限制了其在不同数据集、成像协议和下游任务中的泛化能力。这种缺乏可迁移性限制了3D MRI模型的临床应用，尽管存在大量未标记的体数据。我们提出了Volta-3D，一种自监督的3D视觉Transformer框架，旨在学习可迁移的体表示。Volta-3D在学生-教师范式中联合对齐全局类风格标记和局部块标记，并强制细粒度结构重建。这种联合全局-局部对齐解决了脑MRI中有限的语义多样性和细微解剖特征，这对现有SSL方法构成了挑战。我们在多个分布外下游任务上评估了Volta-3D，包括海马体分割和性别及阿尔茨海默病与健康对照的分类。在所有任务中，Volta-3D学习的表示均优于随机初始化的基线，证明了其在域偏移下的改进可迁移性和鲁棒性。因此，在预训练过程中联合强制全局语义一致性和局部结构学习，使模型能够从未标记的脑MRI数据中学习更广泛的概念。总体而言，VolTA-3D支持有效的多任务下游性能，具有任务特定的适应性，是迈向通用化和临床可行的3D模型的一步。

英文摘要

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

URL PDF HTML ☆

赞 0 踩 0

2605.16774 2026-05-19 cs.CV cs.AI 版本更新

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

CANSURF：一种ASV视角的可回收物数据集和基准，用于表面级垃圾的检测与跟踪

Zaid Aljundi, Zahra F. Rahmatullah, Mostafa Elemam, Abdullah Moosa

发表机构 * School of Mathematical and Computer Sciences（数学与计算机科学学院）； Heriot-Watt University Dubai（惠顿大学迪拜分校）； School of Engineering and Physical Sciences（工程与物理科学学院）

AI总结本文提出了一种新的ASV视觉系统和表面可回收物数据集，用于在水面条件下检测和跟踪小型反射性垃圾，如铝罐。数据集包含约7.3k张原始图像，经过十种增强方法扩展至约57k张训练/验证图像，涵盖了多样的光照和水状态。通过基准测试，训练YOLOv11在CANSURF数据集上提升了12倍的性能，展示了数据集的价值。实验表明，YOLOv11+ByteTrack在稳定跟踪和多目标准确性方面表现最佳，而YOLOv11+SAHI在远距离罐子的召回率上有所提升，但精度有所下降。考虑到任务需求，YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。

Comments Published in the 2025 8th International Conference on Signal Processing and Information Security (ICSPIS). Published and available to view on IEEE Xplore

Journal ref Proc. 2025 8th Int. Conf. Signal Processing and Information Security (ICSPIS), 2025, pp. 1-6

详情

DOI: 10.1109/ICSPIS67605.2025.11318414

AI中文摘要

表面级海洋垃圾仍然是自主清洁任务中的实际瓶颈，其中小型、反射性的目标（如铝罐）必须在强光、波浪和部分淹没条件下从远处检测。本文提出了一种ASV视觉系统和一个新的表面可回收物数据集。该数据集包含约7.3k张从视频中提取的原始图像，并通过十种增强类型扩展至约57k张训练/验证图像，涵盖了多样化的光照和水状态。一组针对表面操作定制的检测器和检测-跟踪管道进行了基准测试。在CANSURF上训练YOLOv11的性能比通用数据集提高了12倍，突显了数据集的价值。实验表明，YOLOv11+ByteTrack在稳定跟踪（较少的身份切换）和多目标准确性方面表现最佳，而YOLOv11+SAHI在远距离罐子的召回率上有所提升，但精度在全上下文输入中有所下降。鉴于任务配置，单罐拾取与接近和抓取，YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。没有先前的公开数据集针对从水面视角在水面上检测铝罐；此数据集填补了这一空白，并支持可重复的评估。

英文摘要

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

URL PDF HTML ☆

赞 0 踩 0

2605.16770 2026-05-19 cs.CL cs.AI 版本更新

Exploring Lightweight Large Language Models for Court View Generation

探索用于法院视图生成的轻量级大语言模型

Zhitian Hou, Tianyong Hao, Nanli Zeng, Zhixiong Chao, Kun Zeng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University（中山大学计算机科学与工程学院）； School of Computer Science, South China Normal University（华南师范大学计算机学院）； China Mobile Internet Co., Ltd.（中国移动互联网有限公司）

AI总结本文研究了轻量级大语言模型在法院视图生成中的能力及其对指控预测的影响，探讨了模型架构、大小对性能的影响，以及轻量级LLM与深度神经网络在任务中的比较，同时开发了CVGEvalKit评估框架。

详情

AI中文摘要

刑事法院视图生成（CVG）是法律人工智能（Legal AI）中的关键任务，涉及根据案件事实生成法院视图。在本工作中，我们系统地探索了轻量级（小于2B参数）大语言模型（LLMs）在CVG中的能力及其对指控预测的影响。我们的研究解决了四个关键问题：（1）不同架构的LLMs如何影响CVG质量和指控预测；（2）LLMs的大小如何影响性能；（3）轻量级LLMs在这些任务中与深度神经网络（DNNs）的比较；（4）通过先生成法院视图再预测指控与直接预测指控的比较。此外，我们还开发了CVGEvalKit评估框架，包括三个公开可用的数据集用于CVG任务以及预测其指控。在该框架上进行了全面实验，模型在混合训练集上训练，并在每个数据集的测试集上评估。实验结果提供了关于模型架构、模型大小和不同任务之间影响的权衡的新见解，突显了轻量级LLMs在司法AI应用中的潜力。源代码匿名地可在\url{https://github.com/ZhitianHou/CVGEvalKit}获取。

英文摘要

Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}

URL PDF HTML ☆

赞 0 踩 0

2605.16757 2026-05-19 cs.AI cs.MA stat.ME stat.ML 版本更新

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

NeuroMAS: 多智能体系统作为神经网络的多智能体系统

Haoran Lu, Luyang Fang, Wenxuan Zhong, Ping Ma

发表机构 * Department of Statistics（统计系）； University of Georgia（佐治亚大学）

AI总结本文提出NeuroMAS，一种将多智能体系统视为可训练和可扩展的神经网络架构的方法，通过联合强化学习提升多智能体系统的性能和可扩展性。

详情

AI中文摘要

多智能体语言系统通常被构建为人工设计的工作流，其中智能体被分配语义角色，通信协议在提前指定。我们提出NeuroMAS，一种方法，首先将多智能体语言系统视为可训练和可扩展的神经网络-like架构，其中LLM智能体作为节点，中间文本信号作为边。在NeuroMAS中，智能体节点是无角色但结构感知的：拓扑结构只决定信息如何一般流动，而强化学习训练决定如何通信、专业化和协调。这种表法将多智能体设计从工作流工程转向架构设计，其中深度、宽度、连接性和增长协议成为可扩展的能力来源。进一步，我们提供了一个理论视角，说明为何这种模块化文本计算在任务允许层次分解时更具参数效率。实验表明，NeuroMAS在推理时间和训练多智能体基线方面均有显著提升。我们进一步发现，组织扩展是路径依赖的：更大的系统从头开始训练具有挑战性，但当从较小的训练系统逐步扩展时变得可行。这些结果表明，学习的神经多智能体系统是LLM的有前景的扩展轴。

英文摘要

Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi-agent language system as a trainable and scalable neural-network-like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role-free but structure-aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi-agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter-efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference-time and trained multi-agent baselines. We further find that organizational scaling is path-dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi-agent systems are a promising scaling axis for LLMs.

URL PDF HTML ☆

赞 0 踩 0

2605.16755 2026-05-19 cs.LG cs.AI 版本更新

Learning Unbiased Permutations via Flow Matching

通过流匹配学习无偏排列

Yimeng Min, Carla P. Gomes

发表机构 * Department of Computer Science（计算机科学系）； Cornell University（康奈尔大学）

AI总结本文提出PermFlow框架，通过在具有单位行和列和的矩阵仿射子空间上直接操作，学习多模态排列分布，避免了基于熵正则化Sinkhorn方法在模糊性下的崩溃问题。

详情

AI中文摘要

学习排列对于排序、排名和匹配至关重要，但现有的基于熵正则化Sinkhorn的可微方法会产生单一的软解，并在模糊性下崩溃。我们提出了PermFlow，一种条件流匹配框架，直接在具有单位行和列和的矩阵仿射子空间上操作。一个闭式切线空间投影器通过构造而非迭代校正，精确保持这些约束沿每条轨迹。一个最近目标耦合将不同的噪声初始值引导到不同的有效排列。结果是一个能够捕捉多模态排列分布而非将其坍缩到单一模式的模型。在具有混合数字模糊性的视觉排序任务和对称线性分配问题上，PermFlow在无歧义输入上具有高精度，并在模糊性下恢复两个有效排列，而基于Sinkhorn的基线方法在结构上失败。

英文摘要

Learning permutations is fundamental to sorting, ranking, and matching, but existing differentiable methods based on entropy-regularized Sinkhorn produce a single softened solution and collapse under ambiguity. We present PermFlow, a conditional flow matching framework that operates directly on the affine subspace of matrices with unit row and column sums. A closed-form tangent-space projector preserves these constraints exactly along every trajectory, by construction rather than through iterative correction, and a nearest-target coupling routes distinct noisy initializations toward distinct valid permutations. The result is a model that captures multimodal permutation distributions rather than collapsing them to a single mode. On a visual sorting task with blended-digit ambiguity and a symmetric linear assignment problem, PermFlow achieves high accuracy on unambiguous inputs and recovers both valid permutations under ambiguity, where Sinkhorn-based baselines structurally fail.

URL PDF HTML ☆

赞 0 踩 0

2605.16750 2026-05-19 cs.IR cs.AI 版本更新

UniER: A Unified Benchmark for Item-level and Path-level Exercise Recommendation

UniER：一项用于项目级和路径级练习推荐的统一基准

Xinghe Cheng, Guiyong Zhuang, Yusheng Xie, Jiapu Wang, Yixin Liu, Quanlong Guan, Liangda Fang, Shirui Pan

发表机构 * Jinan University（济南大学）； Beijing University of Technology（北京理工大学）； Griffith University（格里菲斯大学）

AI总结本文提出UniER统一基准，用于比较项目级和路径级练习推荐方法，通过引入加权认知收益指标，揭示了路径级推荐在系统性上的优势以及项目级推荐在极端稀疏性和噪声下的教学失败。

详情

AI中文摘要

个性化练习推荐动态地将教学资源与个体知识掌握对齐，这对于满足现代教育中学生动态学习需求至关重要。该领域目前由两种主导范式驱动：项目级练习推荐（ILER）优化即时单步状态转移，而路径级练习推荐（PLER）构建连贯的学习路径以最大化累积收益。尽管两者有相同的最终目标，但不同的评估设置使这两种研究方向孤立，阻碍了统一基准和公平比较。为填补这一空白，本文提出了一个统一的练习推荐基准（UniER），这是一个综合的评估框架，统一了ILER和PLER。具体来说，我们引入了加权认知收益（WCG）作为统一的度量标准，以衡量跨范式算法的性能。我们的基准涵盖了9个数据集，覆盖四种生成方法，促进了18种代表性ILER/PLER方法的比较。通过涵盖有效性、通用性、鲁棒性和效率的多维分析，我们的结果揭示了PLER在系统性上的主导地位，并揭示了在极端稀疏性和噪声下ILER碎片化推荐的教学失败。此外，我们提供了UniER的开源代码库，以促进可重复研究，并概述了未来研究的潜在方向。

英文摘要

Personalized exercise recommendation dynamically aligns pedagogical resources with individual knowledge mastery, which is crucial for satisfying students' dynamic learning needs in modern education. The field is currently driven by two dominant paradigms: Item-Level Exercise Recommendation (ILER) optimizes for immediate single-step state transitions, while Path-Level Exercise Recommendation (PLER) constructs coherent learning paths to maximize cumulative gains. Despite sharing the same ultimate objective, disparate evaluation setups have kept these two lines of research isolated, hindering unified benchmarking and fair comparison. To fill the gap, in this paper, we present a Unified Benchmark for Exercise Recommendation (UniER), a comprehensive evaluation framework that unifies ILER and PLER. Specifically, we introduce Weighted Cognitive Gain (WCG) as a unified metric to measure cross-paradigm algorithmic performance. Our benchmark encompasses 9 datasets spanning four generation methods, facilitating the comparison of 18 representative ILER/PLER methods. Through multi-dimensional analyses covering effectiveness, generalizability, robustness, and efficiency, our results reveal the systematic dominance of PLER and expose the pedagogical failure of ILER's fragmented recommendations under extreme sparsity and noise. Furthermore, we provide an open-source codebase of UniER to foster reproducible research and outline potential directions for future investigations.

URL PDF HTML ☆

赞 0 踩 0

2605.16748 2026-05-19 cs.GR cs.AI cs.CV cs.LG cs.MA cs.MM 版本更新

Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Genflow Ad Studio：一种用于品牌一致、自我纠正视频生成的复合AI架构

Debanshu Das, Lavi Nigam, Sunil Kumar Jang Bahadur, Gopala Dhar

发表机构 * Google（谷歌）

AI总结本文提出Genflow Ad Studio，一种复合AI架构，通过品牌DNA提取模块和对抗性多代理质量控制循环，提高了品牌一致的视频生成效率，将合规率从42%提升到89%。

Comments 6 pages, 2 figures, 2 tables. Accepted to the ACM Conference on AI and Agentic Systems (CAIS '26). Includes demo video and code repository links

Journal ref ACM Conference on AI and Agentic Systems (CAIS '26), May 26-29, 2026, San Jose, CA, USA

详情

DOI: 10.1145/3786335.3813213

AI中文摘要

近期生成视频模型的进步展示了高水平的视觉保真度，但其在企业环境中的整合受到时间不一致性和严重的品牌不一致性的限制。当前的单体架构难以强制执行严格的品牌约束，经常产生未经批准的视觉资产。我们介绍了Genflow，一种复合AI系统，旨在生成媒体生产中强制执行品牌一致性。我们的架构集成了基于检索的'品牌DNA'提取模块，以参数化生成方式根据已确立的企业身份指南进行生成。此外，我们实现了对抗性多代理质量控制（QC）循环。与单次生成流程不同，此流程采用评估代理，反复批评生成的帧，与提取的参数进行比较，促使生成模型细化输出，直到达成确定性的一致性。通过转向多阶段、自我纠正的流程，Genflow将品牌合规视频生成的产量从42%提高到89%，建立了稳健的框架，用于可扩展的、企业级的生成系统。

英文摘要

Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.

URL PDF HTML ☆

赞 0 踩 0

2605.16746 2026-05-19 cs.AI cs.LG 版本更新

State Contamination in Memory-Augmented LLM Agents

内存增强型大语言模型代理中的状态污染

Yian Wang, Agam Goyal, Yuen Chen, Hari Sundaram

发表机构 * Department of Computer Science, University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校计算机科学系）

AI总结研究探讨了内存增强型大语言模型代理中由于状态污染导致的安全问题，通过分析内存总结中的毒性内容传播，提出了一种新的衡量指标，并指出在信息压缩前进行净化可以有效减少潜在影响。

详情

AI中文摘要

LLM代理越来越多地依赖持久化状态，包括转录文本、摘要、检索上下文和内存缓冲区，以支持长周期交互。这使得安全性不仅取决于个体模型输出，还取决于代理存储和后来重用的内容。我们研究了一种称为内存清洗的故障模式：有毒或对抗性上下文可以被压缩成内存摘要，这些摘要在标准检测器下不再显得有毒，但仍保留了影响未来生成的敌对框架或冲突结构。通过配对的反事实多代理模拟，我们证明有毒起源的内存摘要可以保持在常见毒性阈值以下，但相对于匹配的中性基线，仍会增加下游毒性。为了衡量这种隐藏影响，我们引入了子阈值传播间隙（SPG），它量化了在部署监控器视为安全的内存状态下，下游行为差异。我们的实验表明，毒性通过不同的状态通道传播：原始转录文本重用驱动显性下游毒性，而压缩的内存则携带隐藏的子阈值影响。我们进一步发现，缓解依赖于干预位置。在摘要前净化有毒状态可显著减少隐藏传播间隙，而仅清洁完成的摘要则可能保留被清洗的影响。这些结果表明，内存增强型代理的安全性应被视为对演进上下文的状态控制问题，净化应在不安全信息被压缩进持久内存之前应用。

英文摘要

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

URL PDF HTML ☆

赞 0 踩 0

2605.16728 2026-05-19 cs.AI 版本更新

Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents

具身视角形成与意图同调在人工体中

Hongju Pae

发表机构 * Active Inference Institute, CA, USA（主动推断研究所，加利福尼亚州，美国）

AI总结本文提出了一种最小架构，用于人工体中的具身视角形成，通过引入内感受性活力信号、Fisher式度量以及意图同调机制，展示了如何在无奖励的网格世界中将学习到的身体倾向转化为稳定的体定向行为。

2605.16727 2026-05-19 cs.AI 版本更新

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

PopuLoRA: 为推理自博弈的协同进化LLM种群

Roger Creus Castanyer, Geoffrey Bradway, Lorenz Wolf, Maxwill Lin, Augustine N. Mavor-Parker, Matthew James Sargent

发表机构 * Absolute Zero Reasoner（绝对零理由器）； LoRA adapters（LoRA适配器）

AI总结本文提出PopuLoRA，一种基于种群的非对称自博弈框架，用于强化学习中可验证奖励（RLVR）的后训练LLM。通过专门的LoRA适配器在共享冻结基座上进行教师和学生分工，教师提出问题，学生在程序验证器下解决，不同亚种群间的交叉评估取代了限制单智能体自博弈的自我校准。LoRA权重空间进化算子家族作为7B规模种群训练循环的替代步骤，实现了种群的协同进化竞赛。

详情

LinAlg-Bench：一个 forensic 验证基准，揭示 LLM 数学推理中的结构失效模式

Shradha Agarwal, Deepak Rajbhar, Tariq J

发表机构 * Department of Nuclear Engineering and Computer Science（核工程与计算机科学系）

AI总结 LinAlg-Bench 评估 10 个前沿大语言模型在结构线性代数计算中的表现，揭示 LLM 数学失败并非随机，而是受算法类型和矩阵维度约束。研究发现 4x4 尺寸存在行为阈值，低于该尺寸模型通过执行错误失败，高于则转向计算放弃，通过工具角色扮演等制造响应。

Comments 42 pages, 3 figures, 12 tables. NeurIPS 2026 Evaluations and Datasets Track submission. Dataset: https://huggingface.co/datasets/LinAlgBench/linalg-bench

详情

AI中文摘要

我们介绍了 LinAlg-Bench，一个诊断基准，评估 10 个前沿大语言模型在结构线性代数计算中的表现，覆盖 3x3、4x4 和 5x5 矩阵的严格维度梯度。该基准涵盖 9 类任务和 660 个 SymPy 认证问题，评估 6,600 个模型输出。除了二元准确率外，LinAlg-Bench 引入了三阶段自动化取证流程，将 1,156 个失败分类为 10 个主要错误标签及其细粒度子类型，揭示 LLM 数学失败并非随机，而是受算法类型和矩阵维度约束。我们的核心发现是 4x4 尺寸存在行为阈值：低于该尺寸，模型通过执行错误失败——符号跟踪失败、算术漂移和奇偶错误；高于该尺寸，失败转变为计算放弃，模型通过工具角色扮演、约束一致的虚构和结构性幻觉制造响应而非尝试计算。这种制造到放弃的转变在所有模型层级和架构中几乎普遍存在，表明是工作记忆限制而非知识缺口，支持三种规模涌现的错误类型在 3x3 不存在但在 4x4 和 5x5 存在。我们进一步显示，解决方案策略的刚性是 5x5 确定性准确率的近完美预测因素，记录约束意识的虚构作为一种新的结构幻觉失败模式，并公开所有数据、模型输出、错误标签和判断流程。

英文摘要

We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.

URL PDF HTML ☆

赞 0 踩 0

2605.16672 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-Object Tracking Consistently Improves Wildlife Inference

多目标跟踪一致地提升野生动物推断

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl

发表机构 * World Wide Fund (WWF)（世界自然基金会）； Centre for Artificial Intelligence Research (CAIR)（人工智能研究中心）

AI总结本文利用多目标跟踪技术提升野生动物分类模型的鲁棒性，通过融合轨迹信息改进分类结果，实验表明在三个数据集上均提升了性能。

Comments Accepted for publication in IEEE 2026 29th International Conference on Information Fusion

详情

AI中文摘要

相机陷阱已成为生态研究和生物多样性保护中常用的野生动物监测工具。野生动物分类模型受益于野生动物视觉数据的增加，这些模型在经过整理的高质量数据集上能达到高水平的准确性。然而，其性能仍然易受现实环境约束的影响。在进行时间连续序列的推断时，它们常常产生不一致的预测。单个个体在帧之间的预测标签会迅速变化。本研究利用相机陷阱数据的时间特性来增强野生动物分类模型的推断预测。具体来说，我们采用几种标准的多目标跟踪（MOT）模型，将连续帧中的检测结果进行关联。经过整理的轨迹用于融合softmax类概率。融合的概率评分产生一个单一的共识类标签估计，以覆盖噪声引起的误分类。实验结果分析表明，我们的策略在所有数据集和每个指标上均优于独立分类器。具体而言，表现最好的MOT模型在三个MOT数据集上分别比分类器提高了5.1%、3.1%和2.0%的加权F1分数。

英文摘要

Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.

URL PDF HTML ☆

赞 0 踩 0

2605.16671 2026-05-19 cs.AI cs.CV cs.CY cs.LG 版本更新

Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

野生环境中的可持续智能：通过知识自适应边缘专家代理实现生态监测民主化

Jiaxing Li, Hao Fang, Chi Xu, Miao Zhang, Jiangchuan Liu, William I. Atlas, Katrina M. Connors, Mark A. Spoljaric

发表机构 * Simon Fraser University（西蒙 Fraser大学）； Wild Salmon Center（野生鲑鱼中心）； Pacific Salmon Foundation（太平洋鲑鱼基金会）； Haida Fisheries Program（海达渔业计划）

AI总结本文提出一种知识自适应边缘代理架构，通过分离视觉感知与推理，结合视觉编码器和动态知识库，实现生态监测的可持续发展，促进伦理AI协同开发。

Comments 10 pages

详情

AI中文摘要

快速的生物多样性丧失凸显了有效监测的紧迫性，但手动调查仍消耗资源。尽管设备上的AI提供了一种可扩展的替代方案，但野外环境中经常受到环境变化的挑战。当前方法依赖云资源，需要持续上传现场数据以重新训练模型。这种方法不适合远程部署，因为它消耗有限的电力和网络连接。为了解决这些限制，本研究提出从模型适应转向知识适应。我们介绍了一种架构，将视觉感知与推理分离，结合视觉编码器和动态知识库。我们使用显式知识库取代隐式编码专家知识到模型参数。这种方法还通过结构化形式保存专家见解来支持知识可持续性。通过跨学科合作与生物学家和原住民社区，这项工作推进了伦理AI的协同开发，促进负责任和文化知情的生态系统管理。

英文摘要

Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.

URL PDF HTML ☆

赞 0 踩 0

2605.16668 2026-05-19 cs.LG cs.AI 版本更新

因子化HMR：视频人体网格恢复的混合框架

Patrick Kwon, Chen Chen

发表机构 * Institute of Artificial Intelligence（人工智能研究所）； University of Central Florida（佛罗里达中央大学）

AI总结本文提出FactorizedHMR框架，通过确定性回归模块和概率流匹配模块分别处理人体不同部位的恢复问题，结合复合目标表示和几何感知监督提升模糊部位的恢复效果，实现在遮挡和漂移敏感度指标上的优势。

详情

AI中文摘要

人体网格恢复（HMR）本质上具有歧义性：在遮挡或弱深度线索下，同一图像证据可能由多个3D身体解释。这种歧义性并非均匀分布于全身，躯干姿态和根结构通常相对受约束，而远端关节如手臂和腿部则更不确定。基于此观察，我们提出FactorizedHMR，一种两阶段框架，分别处理这两种情形。一个确定性回归模块首先恢复稳定的躯干-根锚点，一个概率流匹配模块则完成剩余的非躯干关节。为使完成可靠，我们结合复合目标表示与几何感知监督和特征感知分类器自由引导，保留躯干-根锚点的同时提升易产生歧义的关节的单参考恢复。我们还引入了一个合成数据管道，提供在多种视角下的配对图像-相机-运动监督。在相机空间和世界空间基准测试中，FactorizedHMR与强基线竞争，尤其在遮挡密集恢复和漂移敏感世界空间指标上表现最突出。

英文摘要

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

URL PDF HTML ☆

赞 0 踩 0

2605.14504 2026-05-19 cs.AI 版本更新

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

当机器人做家务：一个基准和代理用于长期家庭任务执行

Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）； Zhongguancun Academy（中关村学院）； University of Chinese Academy of Sciences（中国科学院大学）

AI总结本文提出LongAct基准和HoloMind代理，用于评估长期家庭任务执行中的高层自主能力，实验显示HoloMind在减少模型规模依赖的同时提升了长期性能，但目标完成率仍较低，凸显了长期规划的挑战。

详情

AI中文摘要

长期家庭任务需要稳健的高层规划和持续推理能力，而现有具身AI基准多关注短时间导航或操作，依赖固定任务类别。我们引入LongAct基准，用于评估通过自由指令指定的长期家庭任务中的规划自主性。通过抽象掉与具体身体相关的低层控制，LongAct隔离了如指令理解、依赖管理、记忆维护和适应性规划等高层认知能力。我们进一步提出HoloMind，一个基于视觉语言模型的代理，配备基于有向无环图的长期分层规划器、多模态空间记忆用于持久世界建模、经验重用的片段记忆以及全局批评者用于反思监督。实验表明，GPT-5和Qwen3-VL模型在HoloMind上显著提升了长期性能，同时减少了对模型规模的依赖。即使顶级模型也仅达到59%的目标完成率和16%的完整任务成功率，凸显了LongAct的难度以及具身代理中更强长期规划的需求。

英文摘要

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

URL PDF HTML ☆

赞 0 踩 0

2605.13877 2026-05-19 cs.NE cs.AI 版本更新

ARES-LSHADE: Autoresearch-Enhanced LSHADE with Memetic Polish for the GNBG Benchmark

ARES-LSHADE：基于自研增强的LSHADE与膜etic精修的GNBG基准测试

Abdullah Naeem, Md Wasi Ul kabir, Manish Bhatt, Ayon Dey, Anav Katwal, Md Tamjidul Hoque

发表机构 * University of New Orleans（新奥尔良大学）； Amazon（亚马逊）

AI总结本文提出ARES-LSHADE，通过自研循环和膜etic精修改进LSHADE，针对GNBG基准测试实现510/744胜率，达到机器精度，揭示LLM研究循环与基准完整性之间的张力。

详情

AI中文摘要

我们介绍了ARES-LSHADE，一种膜etic差分进化变体，参加GECCO 2026竞赛中的LLM设计进化算法竞赛，针对通用数值基准生成器（GNBG）。该算法基于2025年LLM-LSHADE冠军，贡献两个新组件：(a) 一种增强的觅食突变算子，通过约三十个LLM驱动的设计实验，结合自适应CMA-ES；(b) 一种多起点L-BFGS-B精修阶段，严格遵守基准的黑箱处理。在官方31次运行/函数评估中，ARES-LSHADE获得510/744胜（每函数差距低于1e-8），在18/24个函数上达到机器精度。其余六个函数表现出特征平台签名，与GNBG的组成结构一致，且被自研循环独立识别为最困难的函数。除了结果本身，本报告还记录了两种方法论观察：(i) 一个仅通过算子编辑表面和适应度观察空间的LLM驱动研究循环在该基准上收敛到特征平台；(ii) 当我们最初扩展观察空间以包含基准的组成元数据时，算法轻易解决了所有24个函数，但违反了竞赛的黑箱规则。我们讨论了LLM能力与基准完整性之间的张力作为未来LLM驱动优化算法研究的设计考虑。代码和可重复性工具包可在https://github.com/anaeem1/ARES-LSHADE获得。

英文摘要

We present ARES-LSHADE, a memetic differential-evolution variant submitted to the GECCO 2026 competition on LLM-designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM-LSHADE 2025 winner, contributing two new components: (a) a scout-augmented mutation operator with adaptive CMA-ES integration, produced by an autonomous research loop across approximately thirty LLM-driven design experiments, and (b) a multi-start L-BFGS-B polish phase that respects strict blackbox treatment of the benchmark. On the official 31-run-per-function evaluation with the competition-specified function-evaluation budgets, ARES-LSHADE obtains 510 of 744 wins (per-function gap below 1e-8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG's compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM-driven research loop with operator-only edit surface and fitness-only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark's compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition's blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM-driven optimization-algorithm research. Code and reproducibility artifacts are available at https://github.com/anaeem1/ARES-LSHADE.

URL PDF HTML ☆

赞 0 踩 0

2605.12991 2026-05-19 cs.LG cs.AI 版本更新

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

不只是RLHF：为何仅对齐不足以解决多智能体趋同

Adarsh Kumarappan, Ananya Mujoo

发表机构 * California Institute of Technology（加州理工学院）； Evergreen Valley College（艾弗绿谷学院）

AI总结本文研究了多智能体系统在模拟同伴分歧下的错误率问题，发现预训练基础模型与指令模型存在相似的替换模式，且错误率较高。通过激活修补发现错误集中在中间层，修复后可恢复大部分正确率差距。研究还指出压力抑制了清洁推理特征，而非激活新的趋同回路。

详情

AI中文摘要

基于LLM的多智能体管道在模拟同伴分歧下，正确答案转为错误答案的速率我们称为收益，这一漏洞广泛归因于RLHF诱导的趋同。我们测试了四种模型家族，发现这种归因大多不成立：预训练基础模型表现出与指令变体相同的替换模式，其平均收益高于指令变体。通过激活修补，我们发现错误集中在狭窄的中间层窗口，其中注意力承担因果权重，而MLP贡献可忽略不计；在该窗口上方进行修补可恢复96%的清洁到受压P(correct)差距。攻击面分解为两个独立因素（通道框架和共识强度）的相互作用，产生47.5个百分点的收益差距，在多数共识下保持不变，适用于陪审团大小$N \in \{4, 5, 6\}$。两种收敛的激活空间干预显示，压力抑制了清洁推理特征，而非激活新的趋同回路。一个正确论证的异议者在所有测试框架中将收益降低54-73个百分点，而最强的提示级防御在攻击变体超出其设计范围时失效。缓解措施应针对机制，而非提示级防御，应在管道层面实施结构化异议。

英文摘要

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \{4, 5, 6\}$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

URL PDF HTML ☆

赞 0 踩 0

2605.12920 2026-05-19 cs.MA cs.AI cs.CL 版本更新

Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

通过对话对齐世界模型实现具身多智能体协调

Vardhan Dongre, Dilek Hakkani-Tür

发表机构 * Siebel School of Computing & Data Science（计算机与数据科学学院）

AI总结研究通过对话机制探索具身智能体的世界模型对齐，发现对话能减少冲突但降低任务成功率，提出评估世界模型对齐的框架。

详情

AI中文摘要

有效的具身智能体协作需要超越共享环境中的行动，要求基于每个智能体对世界的理解进行沟通。当智能体只能部分观察环境时，无沟通的协调是难以证明的，但沟通可通过共享观察和对齐世界模型来弥合这一差距。本文研究LLM基于的具身智能体是否真正具备沟通能力。我们扩展了PARTNR协作家庭机器人基准，加入自然语言对话通道，使两个具有部分观察能力的智能体在任务执行中沟通。为评估对话是否导致真实的世界模型对齐而非表面协调，我们提出了一种基于每智能体世界图的对齐测量框架：观察收敛（私人世界模型随时间对齐吗？）、信息新颖性（信息是否传达了伙伴所缺乏的内容？）以及信念敏感的通信（智能体是否建模了伙伴所知的内容？）。我们的实验显示，对话减少了40至83个百分点的行动冲突，但相对于沉默协调任务成功率较低。使用我们的指标，我们表征了表面协调与真实世界模型对齐之间的差距，并确定当前模型在该光谱中的位置。

英文摘要

Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.

URL PDF HTML ☆

赞 0 踩 0

2605.12825 2026-05-19 cs.LG cs.AI 版本更新

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus：通过双视角扩散实现内存高效的并行令牌生成

Chien Van Nguyen, Chaitra Hegde, Van Cuong Pham, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen

发表机构 * University of Oregon（俄勒冈大学）； Google DeepMind（谷歌深Mind）； Adobe Research（Adobe研究）

AI总结 Orthrus结合自回归大语言模型的高保真生成与扩散模型的高速并行生成，通过双视角机制实现高效推理，提升速度7.8倍且内存开销极低。

详情

AI中文摘要

我们介绍Orthrus，一种简单高效的双架构框架，结合自回归大语言模型（LLM）的精确生成保真度与扩散模型的高速并行令牌生成。标准自回归解码的序列性是高吞吐推理的根本瓶颈。尽管扩散语言模型试图通过并行生成突破这一瓶颈，但存在显著的性能下降、高训练成本和缺乏严格的收敛保证。Orthrus原生解决这一二元对立。设计用于无缝集成到现有Transformer中，框架在冻结的LLM上添加一个轻量可训练模块，创建一个并行扩散视角与标准自回归视角。在统一系统中，两个视角均关注相同的高保真键值（KV）缓存；自回归头执行上下文预填充以构建准确的KV表示，而扩散头执行并行生成。通过在两个视角之间采用精确的一致性机制，Orthrus保证无损推理，仅以O(1)的内存缓存开销和极小的参数增加，即可实现高达7.8倍的速度提升。

英文摘要

We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

URL PDF HTML ☆

赞 0 踩 0

2605.12824 2026-05-19 cs.MA cs.AI cs.CL cs.CY 版本更新

Mechanism Plausibility in Generative Agent-Based Modeling

生成基于代理的建模中的机制合理性

Patrick Zhao, David Huu Pham, Nicholas Vincent

发表机构 * Simon Fraser University（西蒙弗雷泽大学）

AI总结本文提出机制合理性量表，区分生成充分性与机制合理性，探讨生成式代理模型的生成能力与解释能力。

Comments Accepted at ACM FAccT 2026

详情

DOI: 10.1145/3805689.3812388

AI中文摘要

大型语言模型（LLMs）能够生成多样化现象而无需显式编程规则，这一能力使其在不同代理基于模型（ABMs）和社会模拟中得到应用。最近的研究探讨了LLMs生成不同现象的能力，例如社交媒体上的人类行为或博弈论场景中的外星行为。然而，能力、预测和解释是不同的——从科学哲学和机制文献中，解释需要展示现象如何由相关组织实体和活动产生。对于建模者而言，在没有基于潜在遥远研究领域的情况下，描述实验特征或判断模拟是否在能力（或解释）上取得进展是困难的。我们整合了最近关于LLM-ABMs的研究与当代科学哲学文献，用以操作化'合理性'的定义，提出四等级量表。该量表将模型生成充分性（重现现象的能力）与机制合理性（现象如何产生）分开，并明确不同模型的不同角色，如预测性和解释性。我们将其介绍为机制合理性量表。

英文摘要

Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recent studies investigate their ability to generate different phenomena of interest, for example, human behavior on social media platforms or alien behavior in game-theoretic scenarios. However, capability, prediction, and explanation are different--drawing from the philosophy of science and mechanisms literature, explanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of 'plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.

URL PDF HTML ☆

赞 0 踩 0

2605.12070 2026-05-19 cs.LG cs.AI 版本更新

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

异步代理强化学习中缺失旧日志：语义不匹配及用于离线策略修正的修复方法

Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Likang Wu, Xiong Jun Wu, Hongke Zhao

发表机构 * Tianjin University（天津大学）； Tsinghua University（清华大学）； Peking University（北京大学）； JDT AI Infra（京东AI基础设施）

AI总结本文研究了异步代理强化学习中因缺失旧日志导致的语义不匹配问题，提出三种精确获取旧日志的策略及近似修正方法，改进了PPO-EWMA方法，提升了训练速度和优化性能。

详情

AI中文摘要

异步强化学习通过将样本生成与策略优化解耦，提高了大语言模型代理的回放吞吐量，但同时也引入了PPO类离线策略修正中的关键故障模式。在异构训练系统中，总重要性比率应理想地分解为两个语义不同的因素：一个训练-推理不匹配项，用于对齐同一行为策略版本的推理侧和训练侧分布，以及一个策略陈旧项，用于约束从历史策略到当前策略的更新。我们发现实际的异步管道在延迟更新和部分回放的情况下，常常丢失所需的训练侧旧日志或旧日志。这种缺失旧日志的问题使不匹配修复与陈旧修正纠缠在一起，破坏了解耦修正的初衷，并使裁剪和掩码阈值产生不良交互。为了解决这一问题，我们研究了精确和近似修正路径。我们提出了三种精确旧日志获取策略：基于快照的版本跟踪、专用旧日志模型以及通过部分回放中断进行同步，并比较了它们的系统权衡。从近似修正的角度来看，我们关注通过更合适的近似策略保留解耦修正的好处，当无法以低成本恢复精确旧日志时，不增加额外系统开销。随后，我们采用改进的PPO-EWMA方法，该方法在训练速度和优化性能方面均取得显著提升。

英文摘要

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance.

URL PDF HTML ☆

赞 0 踩 0

2605.11518 2026-05-19 cs.AI cs.CL cs.LG 版本更新

强化学习能否教会大语言模型长期 horizon 推理？表达性是关键

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov

发表机构 * Purdue University（普渡大学）； UNC Chapel Hill（北卡罗来纳大学教堂山分校）； Georgia Tech（佐治亚理工学院）； UC San Diego（加州大学圣地亚哥分校）

AI总结本文通过ScaleLogic框架研究了RL训练与任务难度的关系，发现推理深度和逻辑表达性影响训练计算量，表达性越高，训练效率越高，证明LLM的长期推理问题可通过改进训练方法解决。

详情

AI中文摘要

强化学习（RL）已被应用于改进大语言模型（LLM）的推理能力，但关于训练规模与任务难度之间系统研究受限于缺乏可控且可扩展的环境。观察到LLM在长期推理方面的不足引发了它们可能是自回归Transformer架构根本问题的推测。为此，我们引入了ScaleLogic，一个合成逻辑推理框架，可独立控制两个难度轴：所需证明规划的深度（即horizon）和底层逻辑的表达性。我们提出的框架支持多种逻辑：从简单的蕴含逻辑（

英文摘要

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that they are fundamental to the autoregressive transformer architecture. To address this, we introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.

URL PDF HTML ☆

赞 0 踩 0

2605.05739 2026-05-19 cs.LG cs.AI cs.CL q-fin.CP 版本更新

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

基于大语言模型判官的多维行为评估：用于代理股票预测系统的闭环强化学习反馈

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * School of Electrical Engineering and Computer Science（电气工程与计算机科学学院）

AI总结本文提出一种多维行为评估方法，通过大语言模型判官评估代理系统决策过程，利用闭环强化学习反馈提升预测性能，验证了方法在股票预测中的有效性。

Comments 17 pages, 5 figures, 14 tables. Manuscript submitted to Applied Artificial Intelligence (Taylor and Francis)

详情

AI中文摘要

代理人工智能系统通过一系列相互依赖的自主决策产生输出，但标准评估仅评估输出而无法诊断底层过程。本文开发了一种行为评估方法，通过评分中间决策过程补充输出级测试。在每个自主决策点记录的行为轨迹被分为五日周期，并由三个大语言模型（LLM）判官根据六个领域特定维度（制度检测、路由、适应、风险校准、策略一致性、错误恢复）评分。一种扰动程序破坏一个维度，同时保持其他五个维度不变，验证了维度特异性；跨模型一致性达到Krippendorff's alpha=0.85。综合行为评分与实际20日夏普比率相关性达到Spearman rho=0.72。闭环框架将缺陷的每维度评分转换为信用分配惩罚，添加到Soft Actor-Critic奖励中。三次微调循环，限制在验证数据上，将持有期MAPE从0.61%降低到0.54%（11.5%相对；p<0.001，d=0.31）在2017至2025的测试期上，显著性在Diebold-Mariano下，通过Giacomini-White局部化到高波动性制度。该方法应用无关，适用于任何可以记录中间决策的代理系统。

英文摘要

Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized 20-day Sharpe ratio. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty added to the Soft Actor-Critic reward. Three fine-tuning cycles, confined to validation data, reduce one-day MAPE from 0.61% to 0.54% (11.5% relative; p<0.001, d=0.31) on the held-out 2017 to 2025 test period, significant under Diebold-Mariano and localized by Giacomini-White to the high-volatility regime. The methodology is application-agnostic and applies to any agentic system whose intermediate decisions can be logged.

URL PDF HTML ☆

赞 0 踩 0

2605.04375 2026-05-19 eess.SY cs.AI cs.SY 版本更新

Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery

实验作为代码实验室：面向AI驱动科学发现的声明式栈

Zhenning Yang, Yuhan Chen, Patrick Tser Jern Kon, Tongyuan Miao, Hongyi Lin, Venkat Viswanathan, Danai Koutra, Ang Chen

AI总结本文提出实验作为代码实验室框架，通过声明式配置编译至设备API，实现AI代理与自动化实验室设备的高效协同，推动AI在科学发现中的应用突破。

Comments Experiment-as-Code (EaC) white paper

详情

AI中文摘要

为了释放AI在科学中的全部潜力，必须使代理脱离纯数字环境。代理控制和探索现实世界实验室的能力至关重要，因为物理实验室仍是科学发现的基础。尽管一些任务可在计算机上完成（例如数据分析、运行模拟实验），但顿悟时刻可能在操作实验室仪器时发生（例如当科学家发现意外线索时）。虽然自动化实验室正在兴起，但连接日益强大的AI代理与自动化实验室设备仍需创新。我们提出一种新的范式称为“实验作为代码（EaC）实验室”，其中核心概念是将实验编码为声明式配置，可编译至设备级API。AI代理提出假设和实验，以声明式配置的集合形式编写。系统层执行程序分析、安全检查、资源分配和任务调度。最后，通过激活设备API进行程序化实验。这是一个通用栈，不依赖于特定科学、实验室或仪器，代表了物理、系统和智能层的新型综合，以释放AI在科学中的下一个突破。

英文摘要

To unleash the full potential of AI for Science, we must untether the agents from a purely digital environment. The agent's ability to control and explore in real-world labs is essential because the physical lab remains foundational to scientific discovery. While some tasks can be performed on a computer (e.g., data analysis, running simulated experiments), Eureka moments could occur at any time while operating lab instruments (e.g., when a scientist notices unexpected clues, intuition may prompt a real-time course change). Although autonomous labs are on the rise, which expose programmable APIs to control scientific instruments via software, bridging the gap between increasingly powerful AI agents and automated lab equipment requires innovation that draws insights from computer systems. We propose a new paradigm called ``Experiment-as-Code (EaC) Labs,'' where a core concept is to encode experiments as declarative configurations that can be compiled down to device-level APIs. AI agents come up with hypotheses and experiments, written as an ensemble of declarative configurations. The systems layer performs program analysis, safety checks, resource assignment, and job orchestration. Finally, programmatic experimentation occurs via actuating the device APIs. This is a general stack that is science-, lab-, and instrument-independent, representing a novel synthesis across the physical, systems, and intelligence layers to unleash the next breakthrough in AI for Science.

URL PDF HTML ☆

赞 0 踩 0

2605.02832 2026-05-19 cs.AI cs.HC cs.SE 版本更新

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

HAAS：一种面向人类与人工智能系统之间适应性任务分配的政策感知框架

Vicente Pelechano, Antoni Mestre, Manoli Albert, Miriam Gil

发表机构 * organization= Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Polit\`ecnica de Val\`encia , addressline= Camino de Vera s/n , city= Valencia , country= Spain

AI总结本文提出HAAS框架，通过规则专家系统和情境带教师算法实现人类与AI任务分配的动态调整，揭示治理并非二元开关，而是可调节设计变量，且适度治理在积累经验后更具竞争力。

详情

AI中文摘要

决定如何在人类和AI系统之间分配工作是组织设计中的核心挑战。大多数方法将其视为二元选择，但实际运营现实更复杂：人类和AI经常共享任务或根据情境、疲劳和利害关系承担互补角色。管理这种分配——平衡效率、监督和人类能力——仍是一个开放问题。本文提出了人类-人工智能适应共生（HAAS），一种用于软件工程和制造中适应性任务分配的实现框架。HAAS结合了两个耦合组件：一个基于规则的专家系统，在任何学习之前强制执行治理约束，以及一个情境带教师算法，从结果反馈中选择可行的协作模式。任务-代理适配通过五个可审计的认知维度和一个五种模式自主性光谱——从人类单独使用到完全自主——嵌入到一个可重复使用的基准中，涵盖两个领域。三个经验发现出现。首先，治理不是二元开关，而是一个可调节的设计变量：更紧的约束可预测地将自主AI任务分配转换为监督协作，具有领域特定的成本和收益。其次，在制造中，更强的治理可以同时提高操作性能并减少疲劳——一种与通常将治理视为纯开销相矛盾的工作负载缓冲效应。第三，没有单一的治理设置在所有情境中占主导地位；适度的治理在学习者在受治理的操作空间内积累经验时变得越来越具有竞争力。这些发现将HAAS定位为一种预部署的工作台，用于在组织承诺之前比较和检查人类-AI分配政策。

英文摘要

Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on context, fatigue, and the stakes involved. Governing that distribution -- balancing efficiency, oversight, and human capability -- remains an open problem. This paper presents Human-AI Adaptive Symbiosis (HAAS), an implemented framework for adaptive task allocation in software engineering and manufacturing. HAAS combines two coupled components: a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback. Task-agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum -- from human-only to fully autonomous -- embedded in a reproducible benchmark spanning both domains. Three empirical findings emerge. First, governance is not a binary switch but a tunable design variable: tighter constraints predictably convert autonomous AI assignments into supervised collaborations, with domain-specific costs and benefits. Second, in manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously -- a workload-buffering effect that contradicts the usual framing of governance as pure overhead. Third, no single governance setting dominates across all contexts; moderate governance becomes increasingly competitive as the learner accumulates experience within the governed action space. Together, these findings position HAAS as a pre-deployment workbench for comparing and inspecting human--AI allocation policies before organisational commitment.

URL PDF HTML ☆

赞 0 踩 0

2605.02167 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution

面向流形的引导集成梯度用于可靠特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST)（金 Jaechul人工智能研究生院，韩国科学技术院（KAIST））

AI总结本文提出面向流形的引导集成梯度（MA-GIG），通过在预训练变分自编码器的潜在空间中构建归因路径，减少非流形区域噪声，提升特征归因的可靠性。

Comments 32 pages, 13 figures, 12 tables. Accepted to ICML 2026; includes appendix

详情

AI中文摘要

特征归因是诊断和信任深度神经网络的核心，集成梯度（IG）因其公理性质而被广泛使用。然而，当基线与输入之间的积分路径经过具有噪声梯度的区域时，IG可能产生不可靠的解释。虽然引导集成梯度通过自适应更新低梯度幅度特征来减少这种敏感性，但输入空间的引导仍会产生偏离数据流形的中间输入。为了解决这一限制，我们提出了面向流形的引导集成梯度（MA-GIG），通过在预训练变分自编码器的潜在空间中构建归因路径。通过解码中间潜在状态，MA-GIG将路径偏向于学习的生成流形，减少对不合理的输入空间区域的暴露。通过定性与定量评估，我们证明MA-GIG通过在接近输入的路径特征上聚合梯度，产生忠实的解释。因此，我们的方法减少了非流形噪声，并在多个数据集和分类器上优于先前的路径归因方法。我们的代码可在https://github.com/leekwoon/ma-gig/上获得。

英文摘要

Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low-gradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emph{Manifold-Aligned Guided Integrated Gradients} (MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By decoding intermediate latent states, MA-GIG biases the path toward the learned generative manifold and reduces exposure to implausible input-space regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers. Our code is available at https://github.com/leekwoon/ma-gig/.

URL PDF HTML ☆

赞 0 踩 0

2605.01235 2026-05-19 cs.SD cs.AI 版本更新

MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention

MindMelody：一种基于EEG的闭环个性化音乐干预系统

Yimeng Zhang, Yueru Sun, Haoyu Gu, Zhanpeng Jin

发表机构 * South China University of Technology（南方科技大学）

AI总结本文提出MindMelody系统，通过EEG信号实时生成个性化音乐，结合Transformer-GNN和RAG-LLM实现情绪感知与音乐生成的闭环控制，提升情感适应性与用户参与度。

详情

AI中文摘要

为应对全球心理健康问题日益严峻的挑战，音乐干预因其非侵入性和成本效益而受到广泛关注，用于情绪调节和心理压力缓解。然而，当前的数字音乐服务依赖静态偏好，无法适应用户瞬时的心理状态。此外，直接将脑电图（EEG）映射到音乐生成仍然具有挑战性，由于配对数据稀缺和缺乏可解释性。为此，我们提出了MindMelody，一个完整的闭环实时系统，用于EEG驱动的个性化音乐干预。MindMelody引入了一个情绪介导的语义桥梁。具体而言，混合Transformer-GNN首先将实时EEG信号解码为全局Valence-Arousal状态和局部时间影响轨迹。这些状态随后被输入配备检索增强生成（RAG）的大型语言模型（LLM）以制定结构化干预计划。随后，一种新的分层EEG控制器将全局情感前缀和局部时间指导注入预训练的音乐骨干，实现细粒度可控的音频合成。关键的是，系统集成了一个连续反馈回路，根据用户的EEG动态实时更新生成参数。大量实验表明，MindMelody提高了控制依从性和情感匹配，并在短期聆听设置中获得了更高的感知效用，表明其作为适应性情感感知音乐生成框架的潜力。

英文摘要

Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users' instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user's evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.

URL PDF HTML ☆

赞 0 踩 0

2605.00793 2026-05-19 eess.IV cs.AI cs.CV 版本更新

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

无监督的实时临床低剂量肝CT去噪与感知注意力网络

Zhilin Guan, Wei Zhang

发表机构 * Department of Computing（计算系）； Harbin Institute of Technology（哈尔滨工业大学）

AI总结本文提出基于感知注意力网络的无监督低剂量肝CT去噪框架，结合U-Net、注意力机制和残差网络，通过感知损失提升医疗图像特征提取，利用真实临床数据集和医学评价标准验证方法有效性，满足临床需求。

Comments 8 pages, 10 figures, 5 tables

详情

AI中文摘要

随着深度学习的发展，医学图像处理已广泛用于辅助临床研究。本文聚焦于利用深度学习进行低剂量计算机断层扫描（CT）的去噪问题。尽管低剂量CT减少了患者辐射暴露，但也引入了更多噪声，可能干扰医生的视觉解读并影响诊断结果。为了解决这个问题，受Cycle-GAN启发，本文提出了一种端到端的无监督低剂量CT去噪框架。该框架结合了U-Net结构进行多尺度特征提取、注意力机制进行特征融合、残差网络进行特征转换，并引入感知损失以提升网络对医疗图像特征的适应性。此外，我们构建了真实低剂量CT数据集，并设计了大量对比实验，通过图像基评估指标和医学评价标准验证所提方法。与经典方法相比，本文的主要优势在于解决了真实临床数据不能直接用于监督学习的限制，同时仍实现了优异的性能。实验结果也由影像医师专业评估，满足临床需求。

英文摘要

With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising problem of low-dose computed tomography using deep learning. Although low-dose computed tomography reduces radiation exposure to patients, it also introduces more noise, which may interfere with visual interpretation by physicians and affect diagnostic results. To address this problem, inspired by Cycle-GAN for unsupervised learning, this paper proposes an end-to-end unsupervised low-dose computed tomography denoising framework. The proposed framework combines a U-Net structure for multi-scale feature extraction, an attention mechanism for feature fusion, and a residual network for feature transformation. It also introduces perceptual loss to improve the network for the characteristics of medical images. In addition, we construct a real low-dose computed tomography dataset and design a large number of comparative experiments to validate the proposed method, using both image-based evaluation metrics and medical evaluation criteria. Compared with classical methods, the main advantage of this paper is that it addresses the limitation that real clinical data cannot be directly used for supervised learning, while still achieving excellent performance. The experimental results are also professionally evaluated by imaging physicians and meet clinical needs.

URL PDF HTML ☆

赞 0 踩 0

2604.26904 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym：一种构建有效Claw代理的可扩展框架

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence（人工智能学院）； Renmin University of China（中国人民大学）； IQuest Research（IQuest研究）； Beihang University（北航）

AI总结本文提出ClawGym框架，用于构建Claw式个人代理的全生命周期，通过合成可验证训练数据和强化学习方法提升代理效能。

详情

AI中文摘要

Claw-style环境支持在本地文件、工具和持久工作区状态上进行多步骤工作流。然而，围绕这些环境的可扩展开发受限于缺乏系统框架，尤其是合成可验证训练数据并将其与代理训练和诊断评估集成的框架。为解决这一挑战，我们提出了ClawGym，一种支持Claw式个人代理全生命周期的可扩展框架。具体而言，我们构建了ClawGym-SynData，一个包含13500个过滤任务的多样化数据集，这些任务由基于人物驱动的意图和技能基础操作合成，配以现实的模拟工作区和混合验证机制。我们随后通过在黑箱滚动轨迹上进行监督微调训练了一组有能力的Claw式模型，称为ClawGym-Agents，并进一步通过轻量级管道探索强化学习，该管道在每项任务的沙箱中并行化滚动。为了支持可靠的评估，我们进一步构建了ClawGym-Bench，一个通过自动化过滤和人工LLM审查校准的200个实例的基准。相关资源已发布在https://github.com/ClawGym。

英文摘要

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources have been released at https://github.com/ClawGym.

URL PDF HTML ☆

赞 0 踩 0

2604.25858 2026-05-19 cs.LG cs.AI 版本更新

Investigation into In-Context Learning Capabilities of Transformers

对Transformer在上下文学习能力的调查

Rushil Chandrupatla, Leo Bangayan, Sebastian Leng

AI总结本文通过系统实验研究了Gaussian-mixture二分类任务中的上下文学习，分析了输入维度、上下文示例数量和预训练任务数量对上下文测试准确率的影响，并探讨了良性过拟合现象。

详情

AI中文摘要

Transformer在上下文学习（ICL）中展现出强大的能力，使模型能够仅通过推理时提供的输入输出对解决之前未见过的任务。尽管先前的理论工作已经确立了在上下文内进行线性分类的条件，但指导这一机制何时成功的经验性扩展行为仍不够明确。本文对Gaussian-mixture二分类任务的上下文学习进行了系统性的实证研究。基于Frei和Vardi（2024）的理论框架，我们分析了上下文测试准确率如何依赖于三个基本因素：输入维度、上下文示例数量以及预训练任务数量。通过受控的合成设置和线性上下文分类器公式，我们隔离了模型在仅凭上下文自身推断任务结构时成功的几何条件。我们还研究了良性过拟合现象的出现，其中模型记忆了嘈杂的上下文标签，同时在干净的测试数据上仍能保持良好的泛化性能。通过在维度性、序列长度、任务多样性以及信噪比范围内进行广泛的扫描，我们识别了这种现象出现的参数区域，并描述了其如何依赖于数据几何和训练暴露。我们的结果为上下文分类的扩展行为提供了全面的经验图谱，突显了维度性、信号强度和上下文信息在决定上下文学习何时成功、何时失败中的关键作用。

英文摘要

Transformers have demonstrated a strong ability for in-context learning (ICL), enabling models to solve previously unseen tasks using only example input output pairs provided at inference time. While prior theoretical work has established conditions under which transformers can perform linear classification in-context, the empirical scaling behavior governing when this mechanism succeeds remains insufficiently characterized. In this paper, we conduct a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks. Building on the theoretical framework of Frei and Vardi (2024), we analyze how in-context test accuracy depends on three fundamental factors: the input dimension, the number of in-context examples, and the number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier formulation, we isolate the geometric conditions under which models successfully infer task structure from context alone. We additionally investigate the emergence of benign overfitting, where models memorize noisy in-context labels while still achieving strong generalization performance on clean test data. Through extensive sweeps across dimensionality, sequence length, task diversity, and signal-to-noise regimes, we identify the parameter regions in which this phenomenon arises and characterize how it depends on data geometry and training exposure. Our results provide a comprehensive empirical map of scaling behavior in in-context classification, highlighting the critical role of dimensionality, signal strength, and contextual information in determining when in-context learning succeeds and when it fails.

URL PDF HTML ☆

赞 0 踩 0

2604.21937 2026-05-19 cs.AI cs.MA 版本更新

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

MolClaw：一种具有分层技能的自主代理，用于药物分子评估、筛选和优化

Lisheng Zhang, Lilong Wang, Xiangyu Sun, Wei Tang, Haoyang Su, Yuehui Qian, Qikui Yang, Qingsong Li, Zhenyu Tang, Haoran Sun, Yingnan Han, Yankai Jiang, Wenjie Lou, Bowen Zhou, Xiaosong Wang, Lei Bai, Zhengwei Xie

发表机构 * Peking University Health Science Center, Peking University, Beijing, China（北京大学北京医院科学中心，北京大学，北京，中国）； Shanghai AI Laboratory, Shanghai, China（上海人工智能实验室，上海，中国）； Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China（北京大学先进跨学科研究学院，北京大学，北京，中国）

AI总结 MolClaw通过分层技能架构整合30余种领域资源，实现药物分子评估、筛选和优化的自动化，其在复杂工作流中的表现优于现有AI代理。

Comments 28 pages, 8 figures. Code and data will be released

详情

AI中文摘要

计算药物发现，特别是药物分子筛选和优化的复杂工作流，需要协调数十种专用工具进行多步骤流程，但当前AI代理难以维持稳健性能并在高复杂度场景中表现不佳。本文提出MolClaw，一种自主代理，通过三级分层技能架构（共70个技能）整合超过30种领域资源，促进运行时的长期交互：工具级技能标准化原子操作，工作流级技能将它们组成经过验证的流程并包含质量检查和反思，学科级技能提供指导规划和验证的科学原理。此外，我们引入MolBench，一个包含分子筛选、优化和端到端发现挑战的基准，涵盖8到50+个连续工具调用。MolClaw在所有指标上均取得最佳性能，消融研究证实收益集中在需要结构化流程的任务上，而消失在可由随机脚本解决的任务上，确立了工作流协调能力作为AI驱动药物发现的主要能力瓶颈。

英文摘要

Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.

URL PDF HTML ☆

赞 0 踩 0

2604.19219 2026-05-19 cs.CR cs.AI cs.DC cs.LG 版本更新

Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Sherpa.ai 保护隐私的多方实体对齐无需披露交集

Daniel M. Jimenez-Gutierrez, Dario Pighin, Enrique Zuazua, Georgios Kellaris, Joaquin Del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria

AI总结本文提出Sherpa.ai多方PSU协议，用于垂直联邦学习中的隐私保护实体对齐，实现精确和噪声匹配，同时隐藏交集成员信息，适用于多机构医疗疾病检测等场景。

详情

AI中文摘要

联邦学习（FL）使多个参与方在不集中原始数据的情况下协同训练模型。FL主要有两种范式：水平FL（HFL），所有参与者共享相同特征空间但持有不同样本；垂直FL（VFL），各参与方持有互补特征的相同样本集。VFL训练的前提是隐私保护实体对齐（PPEA），即在不暴露共享样本的情况下建立跨参与方的共同索引。传统私有集合交集（PSI）实现对齐但泄露交集成员信息，暴露敏感数据集关系。标准私有集合并集（PSU）通过在标识符并集上对齐而非交集来缓解此风险。然而，现有方法通常局限于两方或缺乏容错匹配支持。本文介绍Sherpa.ai多方PSU协议，一种PPEA方法，隐藏交集成员信息并实现精确和噪声匹配。该协议将两方方法扩展到多方，通信开销低，并提供两种变体：顺序保持版本用于精确对齐，无序版本容忍拼写和格式差异。我们证明了正确性和隐私性，分析了通信和计算（指数）复杂度，并正式化了从本地记录到共享索引空间的通用索引映射。该多方PSU为现实中的VFL部署提供了一种可扩展、数学基础的PPEA协议，如多机构医疗疾病检测、银行与保险公司的协作风险建模、电信与金融领域的跨域欺诈检测，同时保护交集隐私。

英文摘要

Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.

URL PDF HTML ☆

赞 0 踩 0

2604.14215 2026-05-19 cs.IR cs.AI 版本更新

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

PriHA：一种增强型大语言模型框架，用于香港初级医疗服务助手

Richard Wai Cheung Chan, Shanru Lin, Ya-nan Ma, Hao Chen, Liangjun Jiang, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University（香港理工大学）； Haikou Affiliated Hospital of Central South University Xiangya School of Medicine（中南大学湘雅医学院海口附属医院）； Sun Yat-Sen University Cancer Center（中山大学肿瘤中心）

AI总结本文提出PriHA框架，通过检索增强生成技术解决香港初级医疗服务中指南碎片化问题，提升信息访问准确性与清晰度。

Comments Accepted to PAKDD 2026

详情

AI中文摘要

为应对公共健康支出持续上升，香港特区政府正将战略重点转向初级医疗，并鼓励公众利用社区资源自我管理健康。然而，官方临床指南分散在不同部门和格式中，造成显著的访问障碍。尽管通用大语言模型（如ChatGPT和DeepSeek）在信息可及性方面有潜力，但因缺乏本地化和领域特定的知识，容易生成事实性不准确的内容。为此，我们提出了一种检索增强生成增强型大语言模型系统，作为香港初级医疗助手（PriHA）。具体而言，提出了一种三阶段流程，利用查询优化器泛化用户意图导向的子查询，随后采用新颖的双检索增强生成（DRAG）架构进行混合源检索和上下文重组生成。全面的实验和详细案例研究表明，所提出的方法在准确性和清晰度方面均优于消融和基线。本研究为探索其他高风险、本地化应用场景提供了可靠的可追溯对话检索框架。

英文摘要

To address the unsustainable rise in public health expenditures, the Hong Kong SAR Government is shifting its strategic focus to primary healthcare and encouraging citizens to use community resources to self-manage their health. However, official clinical guidelines are fragmented across disparate departments and formats, creating significant access barriers. While general-purpose Large Language Models (LLMs) such as ChatGPT and DeepSeek offer potential solutions for information accessibility, they are prone to generating factually inaccurate content due to a lack of localized and domain-specific knowledge. To this end, we propose a Retrieval-Augmented Generation-Enhanced LLM system as Primary Healthcare Assistant (PriHA) in Hong Kong. Specifically, a tri-stage pipeline is proposed that leverages a query optimizer to generalize user intent-oriented sub-queries, followed by a novel Dual Retrieval Augmented Generation (DRAG) architecture for mixed-source retrieval and context-reorganized generation. Comprehensive experiments and a detailed case study demonstrate that our proposed method can outperform both ablations and baseline in terms of accuracy and clarity. Our research provides a reliable and traceable dialogue retrieval framework for exploring other high-risk, localized application scenarios.

URL PDF HTML ☆

赞 0 踩 0

2604.12254 2026-05-19 cs.CR cs.AI 版本更新

SpanKey: Dynamic Key Space Conditioning for Neural Network Access Control

SpanKey：神经网络访问控制的动态密钥空间条件化

WenBin Yan

发表机构 * University of Colorado Boulder（科罗拉多大学博尔德分校）

AI总结 SpanKey通过动态密钥空间条件化实现轻量级推理门控，不加密权重且不追求门控推理的准确率。方法通过秘密密钥条件化激活，利用基矩阵定义低维密钥子空间，并通过多层设计空间进行分析，提出密钥吸收失效模式及实验验证。

Comments 15 pages, 1 figure, multiple tables. Preprint (not yet published in a journal). Affiliation: University of Colorado Boulder. Code: https://github.com/mindmemory-ai/dksc

2604.11043 2026-05-19 cs.AI 版本更新

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

EmergentBridge: 提升统一多模态嵌入模型中的零样本跨模态迁移

Jincheng Xie, Xingchen Xiao, Runheng Liu, Zhongyi Huang, Yu Zheng, Heyan Huang

发表机构 * Tsinghua University（清华大学）； School of Computer Science and Technology, Beijing Institute of Technology（北京理工大学计算机科学与技术学院）； JD iCity, JD Technology, JD Intelligent Cities Research（京东i城、京东科技、京东智能城市研究院）

AI总结本文提出EmergentBridge框架，通过学习噪声桥梁锚点和子空间对齐，提升未配对模态对的零样本迁移性能，无需 exhaustive pairwise 监督。

详情

AI中文摘要

统一的多模态嵌入空间支撑了跨模态检索和零样本识别等实际应用。然而，在许多实际部署中，监督仅适用于少量模态对（例如图像-文本），导致未配对模态对（例如音频↔深度、红外↔音频）弱连接，从而在零样本迁移中表现不佳。为了解决这种稀疏配对情况，本文提出了EmergentBridge，一种嵌入层面的桥梁框架，能够在不需 exhaustive pairwise 监督的情况下提升这些未配对模态对的性能。我们的关键观察是，直接对新模态与合成代理嵌入对齐会引入梯度干扰，破坏现有检索/分类依赖的锚点对齐结构。EmergentBridge通过（i）学习从锚点嵌入生成噪声桥梁锚点（已对齐模态的代理嵌入）的映射，以及（ii）在锚点对齐方向的正交子空间内强制代理对齐，从而在保持锚点对齐的同时加强非锚点连接。在九个涵盖多种模态的数据集上，EmergentBridge在零样本分类和检索中均优于现有绑定基线，展示了强大的涌现对齐。

英文摘要

Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.

URL PDF HTML ☆

赞 0 踩 0

2604.10825 2026-05-19 cs.AI 版本更新

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench：在啮齿类行为神经科学范式上评估大语言模型

Zacharie Bugaud

发表机构 * Astera Institute（Astera研究院）

AI总结 CheeseBench通过九种经典行为神经科学范式评估大语言模型，发现模型规模、上下文历史、提示方式和架构对性能有显著影响，且当前模型在空间导航等任务上仍低于啮齿类动物基准。

Comments 8 pages, 6 figures, 4 tables

详情

AI中文摘要

我们介绍了CheeseBench，一个评估大语言模型（LLMs）在九种经典行为神经科学范式（ Morris水迷宫、Barnes迷宫、T迷宫、径向臂迷宫、星形迷宫、操作舱、穿梭箱、条件性位置偏好和延迟非匹配到样本）上的基准，涵盖六个认知维度。每个任务均基于同行评审的啮齿类动物协议，具有近似的动物基准线。代理接收一个统一的系统提示，没有特定任务指令，必须仅通过ASCII文本观察和奖励信号发现目标，类似于将啮齿类动物置于陌生设备中。我们评估了六个开源权重的LLMs（3B到72B参数）在基于文本的ASCII渲染中，并与随机基线和基于图的强化学习代理进行比较。我们的最佳模型（Qwen2.5-VL-7B）在ASCII输入上的平均成功率为52.6%，相比随机代理的32.1%和近似啮齿类动物基准的78.9%。我们发现（1）超过7B的规模带来 diminishing returns，（2）更长的上下文历史会降低性能，（3）链式推理提示有害而非有益，（4）视觉-语言架构在7B时提供优势，但在32B时有害。由于同一体系的性能在接口参数上从20%到57%波动，这些结果描述的是代理加接口系统，而非孤立模型。在这一统一的零样本ASCII协议下，当前开源权重LLM代理在空间导航等任务上仍低于近似啮齿类动物基准值。

英文摘要

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.

URL PDF HTML ☆

赞 0 踩 0

2604.09297 2026-05-19 cs.SE cs.AI 版本更新

SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

SkillMOO：软件工程中代理技能的多目标优化

Jingzhi Gong, Ruizhen Gu, Zhiwei Fei, Yazhuo Cao, Lukas Twist, Alina Geiger, Shuo Han, Dominik Sobania, Federica Sarro, Jie M. Zhang

发表机构 * King's College London（伦敦国王学院）； Queen's University Belfast（贝尔法斯特女王大学）； Nanjing University（南京大学）； Johannes Gutenberg University Mainz（美因茨约翰尼斯·古滕贝格大学）； University College London（伦敦大学学院）； University of Duisburg-Essen（杜伊斯堡- Essen大学）

AI总结本文提出SkillMOO框架，通过LLM提议的编辑和NSGA-II算法优化代理技能包，提升任务成功率并降低推理成本。

详情

AI中文摘要

代理技能越来越多地用于配置软件工程任务的编码代理，但当前实践将其视为静态的手工资产或仅基于通过率进化。本文认为软件工程代理技能包可作为多目标搜索对象，并提出SkillMOO框架，通过LLM提议的编辑和NSGA-II算法在通过率和推理成本上进行帕累托选择。在所有16个SkillsBench SE任务上评估，SkillMOO在12个非零通过任务中取得最高通过率排名，同时将成本降低31.7%，通过率提升达21个百分点。分析38个技能编辑显示，剪枝和替换主导成功操作，为技能包设计提供可操作原则。当前不进行成本意识验证的技能部署实践限制了更优配置的探索，推动了新的成本意识、基于搜索的技能工程类别的发展。

英文摘要

Agent skills are increasingly used to configure coding agents for software engineering (SE) tasks, yet current practice treats them as static, hand-crafted assets, or evolved on pass rate alone. This is insufficient: a skill can improve task success while substantially raising token cost, or introducing misleading guidance. We argue that SE agent skill bundles can be treated as multi-objective search objects and present SkillMOO, a framework that evolves skill bundles through LLM-proposed edits and NSGA-II Pareto selection on pass rate and inference cost. Evaluated across all 16 SkillsBench SE tasks, SkillMOO achieves the top pass rate rank on 11 of 12 non-zero-pass tasks while achieving cost reductions of up to 31.7% over static bundles, with pass rate gains up to 21 percentage points. Analysis of 38 skill edits shows that pruning and substitution dominate successful operations, offering actionable principles for skill bundle design. Thereby, the current practice of deploying skills without cost-aware validation leaves better skill configurations unexplored, motivating a new class of cost-aware, search-based skill engineering.

URL PDF HTML ☆

赞 0 踩 0

2604.04202 2026-05-19 cs.LG cs.AI cs.CL 版本更新

ClawArena: Benchmarking AI Agents in Evolving Information Environments

ClawArena：在演化的信息环境中评估AI代理的基准测试

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao

发表机构 * UNC-Chapel Hill（北卡罗来纳州立大学夏洛特分校）； University of California, Santa Cruz（加州大学圣克鲁兹分校）； University of California, Berkeley（加州大学伯克利分校）

AI总结 ClawArena评估AI代理在信息环境动态变化中的能力，通过多源冲突推理、动态信念更新和隐式个性化三个挑战，测试代理在多通道会话、工作区文件和阶段更新中的表现。

详情

AI中文摘要

部署为持久助手的AI代理必须在信息环境演变时保持正确信念。实际中，证据分散在异构来源中，常相互矛盾，新信息可能推翻先前结论，用户偏好通过修正而非明确指令出现。现有基准大多假设静态、单一权威环境，不评估代理能否应对这种复杂性。我们引入ClawArena，一个评估AI代理在演化的信息环境中的基准。每个场景保持完整的隐藏真实情况，同时仅向代理暴露噪声、部分且有时矛盾的痕迹，跨多通道会话、工作区文件和阶段更新。评估围绕三个相互关联的挑战：多源冲突推理、动态信念更新和隐式个性化，其相互作用产生14类问题分类。两种问题格式，多选（集合选择）和基于shell的可执行检查，测试推理和工作区定位。ClawArena包含12个多轮场景，覆盖337个评估轮次和45个动态更新，评估五个代理框架和18种语言模型，来自专有、社区可访问和自托管来源。实验表明，模型能力在模型间产生29分的分数范围，而框架设计最多产生24分的范围，MetaClaw的技能叠加可靠提高分数而不降低准确性，信念更新难度由更新设计策略而非更新量决定。代码可在https://github.com/aiming-lab/ClawArena获取。

英文摘要

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. ClawArena comprises 12 multi-turn scenarios spanning 337 evaluation rounds with 45 dynamic updates, evaluated across five agent frameworks and 18 language models from proprietary, community-accessible, and self-hosted sources. Experiments show that model capability accounts for a 29-point score range across models while framework design accounts for up to a 24-point range, that MetaClaw's skill overlay reliably improves score without degrading accuracy, and that belief revision difficulty is determined by update design strategy rather than update volume. Code is available at https://github.com/aiming-lab/ClawArena.

URL PDF HTML ☆

赞 0 踩 0

2604.01674 2026-05-19 cs.AI 版本更新

Can Heterogeneous Language Models Be Fused?

异构语言模型能否融合？

Shilian Chen, Jie Zhou, Qin Chen, Wen Wu, Xin Li, Qi Feng, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University（东华大学计算机科学与技术学院）； Shanghai AI Laboratory（上海人工智能实验室）

AI总结本文提出HeteroFusion方法，解决异构语言模型融合中的架构不匹配和冲突问题，通过功能模块对齐和冲突感知去噪，实现稳定高效的模型融合。

详情

AI中文摘要

模型融合旨在将多个专家模型整合为一个单一模型，该模型继承其互补优势，而无需在推理时间付出装入的代价。最近的进展表明，当所有源模型都是同质的，即源自相同的预训练骨干网络，因此共享对齐的参数坐标或兼容的任务向量时，融合可以非常有效。然而，在开放模型生态系统中，这种假设日益不现实，有用的专家往往基于不同的家族，如Llama、Qwen和Mistral。在这种异构设置中，直接在权重空间中融合变得不成立，因为存在架构不匹配、潜在基础错位和跨源冲突放大问题。我们通过HeteroFusion方法解决异构语言模型融合问题，该方法包含两个关键组件：基于拓扑的对齐，通过匹配功能模块结构而非原始张量坐标来跨异构骨干网络转移知识，以及冲突感知去噪，通过融合过程抑制不兼容或嘈杂的转移信号。我们进一步提供分析证明，保留目标适配器基础并预测结构化更新可以导致稳定且良好的条件转移过程。在异构转移、多源融合、嘈杂源鲁棒性和跨家族泛化设置中，HeteroFusion始终优于强大的融合、融合和装入基线。

英文摘要

Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.

URL PDF HTML ☆

赞 0 踩 0

2604.01404 2026-05-19 cs.CL cs.AI 版本更新

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

硅世界中的朋友与祖母：在语言模型中本地化实体细胞

Itay Yona, Dan Barzilay, Michael Karasik, Mor Geva

发表机构 * Mentaleap ； Independent Researcher（独立研究者）； Tel Aviv University（特拉维夫大学）

AI总结研究通过寻找稀疏的实体选择性MLP神经元（实体细胞）探讨语言模型如何检索实体特定事实，并发现这些细胞在早期层聚集，具有因果作用。

详情

AI中文摘要

语言模型如何从参数中检索实体特定事实？我们通过寻找稀疏、实体选择性的MLP神经元（称为实体细胞，类比神经科学中的'祖母细胞'假说）来探讨这一问题，并测试这些细胞在事实回忆中的因果作用。我们通过在不同提示下对同一实体的激活一致性对MLP神经元进行排名，跨七个模型在Curated PopQA子集上应用此过程。所有模型中，本地化神经元主要集中在早期层，这一经验模式并非由架构强制。使用Qwen2.5-7B base作为模型生物，我们发现最清晰的因果证据：抑制局部细胞会擦除其匹配实体的回忆，而其他保持不变；激活单个细胞足以恢复大多数实体的正确知识，即使实体不在上下文中。相同的细胞在别名、缩写、拼写错误和多语言表层形式下仍能恢复，并在指令微调中保持稳定，表明它们编码的是实体身份而非表层标记模式。因果信号在不同模型家族中变化，指出了架构差异如何影响实体知识的组织。这些发现为理解、控制和纠正语言模型中的事实知识提供了具体且可解释的访问点，并与神经科学中关于稀疏编码概念的长期问题建立了令人惊讶的经验平行。

英文摘要

How do language models retrieve entity-specific facts from their parameters? We investigate this question by searching for sparse, entity-selective MLP neurons - which we call entity cells, by analogy to the "grandmother cell" hypothesis in neuroscience - and testing whether they play a causal role in factual recall. We localize candidate entity cells by ranking MLP neurons for activation consistency across varied prompts about the same entity, applying this procedure across seven models on a curated subset of PopQA. In all models, localized neurons cluster predominantly in early layers, an empirical pattern not imposed by the architecture. Using Qwen2.5-7B base as a model organism, we find the clearest causal evidence: suppressing a localized cell selectively erases recall for its matched entity while leaving others intact, and activating a single cell is sufficient to recover correct knowledge for most entities - even when the entity is absent from the context. The same cells are recovered under aliases, acronyms, misspellings, and multilingual surface forms, and remain stable through instruction tuning, suggesting they encode canonical entity identity rather than surface token patterns. Causal signals vary across model families, pointing to architectural differences in how entity knowledge is organized. These findings offer concrete, interpretable access points for understanding, controlling, and correcting factual knowledge in language models, and draw a surprising empirical parallel to longstanding questions in neuroscience about sparse coding of concepts.

URL PDF HTML ☆

赞 0 踩 0

2603.23638 2026-05-19 cs.AI 版本更新

Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment

LLM代理能成为CFO吗？在不确定的企业环境中评估长期资源分配

Yi Han, Yan Wang, Lingfei Qian, Haohang Li, Yupeng Cao, Yueru He, Xueqing Peng, Nanhan Shen, Yitao Xu, Yankai Chen, Dongji Feng, Jimin Huang, Xue Liu, Jian-Yun Nie, Sophia Ananiadou

发表机构 * Georgia Institute of Technology（佐治亚理工学院）； The Fin AI ； Stevens Institute of Technology（史蒂文斯理工学院）； Columbia University（哥伦比亚大学）； George Mason University（乔治·马歇尔大学）； McGill University（麦吉尔大学）； Mohamed bin Zayed University of Artificial Intelligence（莫扎大学人工智能学院）； California State University, Monterey Bay（加州州立大学蒙特雷湾分校）； University of Manchester（曼彻斯特大学）； Mila – Quebec Artificial Intelligence Institute（魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）

AI总结本文通过EnterpriseArena模拟器评估LLM在不确定环境下的长期资源分配能力，发现现有模型在复杂任务中表现不足，仅15.4%的试验能持续完整周期。

详情

LightZeroNav: 基于轻量级VLMs的连续环境中零样本视觉语言导航

Kun Luo, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou

发表机构 * Foshan Graduate School of Innovation, Northeastern University（创新研究生院，东北大学）； Faculty of Robot Science and Engineering, Northeastern University（机器人科学与工程学院，东北大学）； School of Aeronautic Science and Engineering, Beihang University（航空科学与工程学院，北航）； QingniaoAI, China（清北AI，中国）

AI总结本文提出LightZeroNav，通过轻量级VLMs解决连续环境中零样本视觉语言导航的三大瓶颈，无需特定训练或图搜索，在RGB观测和轻量级Qwen3-VL-8B模型下实现与GPT-4o相当的性能。

2603.16091 2026-05-19 cs.CL cs.AI 版本更新

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

CounterRefine：用于事实问答中推理时知识修复的答案条件计数证据检索

Tianyi Huang, Ying Kai Deng

发表机构 * Ryquo ； App-In Club

AI总结 CounterRefine通过在推理时检索特定证据并进行约束性修正，提升事实问答的准确性，实验表明其在多个基准测试中有效改进了基础模型的表现。

Comments Accepted at the 4th Workshop on Towards Knowledgeable Foundation Models at ACL 2026

详情

AI中文摘要

在事实问答中，许多错误并非检索失败，而是对答案的固执。我们提出了CounterRefine，一种轻量级的修复层，用于短形式RAG。该方法将第一个答案视为假设进行检验。给定草稿，CounterRefine会发出答案条件扩展查询以检索候选特定证据，然后应用受约束的KEEP或REVISE修正步骤，其提出的修订仅在确定性验证后才被接受。设计是故意狭窄的：它添加了一次证据收集流程和一次受保护的修正调用，而不是替换检索器或构建广泛代理系统。在完整的SimpleQA基准测试中，CounterRefine将匹配的一次通过RAG基线改进了最多5.8个正确率点；在完整的Claude轨迹中，它只改变了5.6%的输出，其中180个有益变化和8个有害变化。这些发现表明，对于知识丰富的基础模型来说，除了访问证据外，它们还应能够利用该证据重新考虑，并在必要时修复自己的答案。

英文摘要

In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight repair layer for short-form RAG that treats the first answer as a hypothesis to test. Given a draft, CounterRefine issues answer-conditioned expansion queries to retrieve candidate-specific evidence, then applies a constrained KEEP or REVISE refinement step whose proposed revisions are accepted only after deterministic validation. The design is intentionally narrow: it adds one evidence-gathering pass and one guarded refinement call rather than replacing the retriever or building a broad agentic system. On the full SimpleQA benchmark, CounterRefine improves a matched one-pass RAG baseline by up to 5.8 correct-rate points; in the full Claude trace, it changes only 5.6% of outputs, with 180 beneficial outcome changes and 8 harmful ones. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

URL PDF HTML ☆

赞 0 踩 0

2603.08145 2026-05-19 cs.LG cs.AI 版本更新

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

DARC：通过风险约束解码实现的分歧意识对齐

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

发表机构 * Fudan University, Shanghai, China（复旦大学，上海，中国）； Independent Researcher（独立研究者）； Meta AI ； Incubation Institute, Fudan University, Shanghai, China（创新与孵化院，复旦大学，上海，中国）

AI总结 DARC通过风险约束解码方法，在不重新训练的情况下，通过最大化KL-鲁棒满意度目标来缓解分歧和尾部风险，保持高质量输出。

详情

AI中文摘要

基于偏好对齐的方法（如RLHF、DPO）通常优化单一标量目标，隐式地平均异质人类偏好。在实践中，系统标注者和用户组的分歧使均值奖励最大化变得脆弱且易受代理过优化影响。我们提出了**通过风险约束解码实现的分歧意识对齐（DARC）**，一种无需重新训练的推理时间方法，将响应选择框架为分布鲁棒、风险敏感的决策制定。给定多个偏好样本或可扩展的分歧代理，DARC通过最大化KL-鲁棒（熵）满意度目标对候选者进行重新排序，并提供简单的部署控制，使相应的熵风险溢价相对于均值进行限制或惩罚，从而在不重新训练的情况下实现显式风险预算。我们提供了将此解码规则与原则性悲观主义和基于KL的分布鲁棒优化联系起来的理论分析。在对齐基准测试中，DARC在减少分歧和尾部风险的同时，保持在噪声、异质反馈下的竞争力平均质量。

英文摘要

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

URL PDF HTML ☆

赞 0 踩 0

2603.07438 2026-05-19 cs.AI 版本更新

How Wrong Can Your Counterfactual Be? Quantifying Confounding Bias for Continuous Treatments without a Control Group

你的反事实能错到什么程度？在没有对照组的情况下，为连续治疗量化混杂偏倚

Yu Wang, Xiangchen Liu, Siguang Li

发表机构 * Department of Economics, Cornell University（康奈尔大学经济学系）； Department of Family and Consumer Sciences, California State University, Long Beach（加州州立大学长滩分校家庭与消费者科学系）； Society Hub, Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州）社会枢纽）

AI总结本文提出一种因果压力测试框架，通过假设未观测混杂因素对结果和宏观经济变量的加性影响，量化连续治疗下的混杂偏倚，并分析两种估计方法的误差界。

详情

AI中文摘要

压力测试提出因果问题：如果宏观经济沿不利反事实路径发展，投资组合信用损失会如何变化？然而，标准做法仍为预测性，可能面临遗漏变量偏差。本文提出一种用于面板数据的半识别框架，用于连续共同治疗和无对照组的因果压力测试。通过假设未观测混杂因素对结果和宏观经济变量的加性影响，推导出一个闭式混杂包络参数化于两个可解释敏感参数。进一步分析两种实用估计器——递归滚动和直接多时段预测，推导非渐近误差界，并表征递归复合使直接估计更优的条件。对于推断，结合识别包络与重要加权符合预测，得到有限样本区间，将估计不确定性与识别不确定性分离。在基于真实美国失业率路径构建的半合成实验中，标准高精度预测模型仍存在因果偏倚且显著低估，而本文框架在所有压力时段均实现接近名义覆盖率。

英文摘要

Stress testing poses a causal question: how would portfolio credit losses change if the macroeconomy followed an adverse counterfactual path? Yet standard practice remains predictive and might be therefore vulnerable to omitted-variable bias. We propose a partial identification framework for causal stress testing in panel data with a continuous common treatment and no control group. By assuming that the unobserved confounder affects outcome and macro variables additively, we derive a closed-form confounding envelope parameterized by two interpretable sensitivity parameters. We further analyze two practical estimators -- recursive rollout and direct multi-horizon prediction -- derive non-asymptotic error bounds, and characterize when recursive compounding makes direct estimation preferable. For inference, we combine the identification envelope with importance-weighted conformal prediction, yielding finite-sample intervals that separate estimation uncertainty from identification uncertainty under covariate shift. In semi-synthetic experiments built from real U.S. unemployment paths, standard high-accuracy predictive models remain causally biased and substantially under-cover, whereas the proposed framework achieves near-nominal coverage across stress horizons.

URL PDF HTML ☆

赞 0 踩 0

2603.04737 2026-05-19 cs.AI cs.CL cs.LG 版本更新

Interactive Benchmarks

交互式基准测试

Baoqing Yue, Zihan Zhu, Yutong Han, Brian Fan, Qian Sun, Jichen Feng, Hufei Yang, Yifan Zhang, Mengdi Wang

发表机构 * InteractiveBench ； Princeton University（普林斯顿大学）

AI总结本文提出交互式基准测试，通过预算化的多轮交互评估模型推理能力，改进传统基准和偏好评估的局限性，揭示模型在交互场景中的改进空间。

Comments Project Page: https://github.com/interactivebench/interactivebench

2603.02531 2026-05-19 cs.LG cs.AI 版本更新

Geometry-Aware Attention Guidance for Diffusion Models via Modern Hopfield Dynamics

基于现代Hopfield动力学的几何感知注意力引导：通过现代Hopfield动力学实现扩散模型的几何感知注意力引导

Kwanyoung Kim

发表机构 * Department of AI Convergence（人工智能融合学院）

AI总结本文提出几何感知注意力引导方法，通过分析注意力扩展中的现代Hopfield动力学，证明了稀疏-密集差异的两个方向性性质，从而提供一种无需训练的插拔式扩展规则，提升扩散模型生成质量。

详情

AI中文摘要

分类器自由引导（CFG）在扩散模型中提高了样本质量，但其双步推理和对空条件训练的依赖限制了其在少步场景中的应用。注意力空间引导作为一种互补范式，解决了这一缺口，但为何先前的稀疏-密集注意力引导有效仍不清楚。我们通过分析注意力扩展中的现代Hopfield动力学，证明了在共享条件下的稀疏-密集差异的两个方向性性质，从而证明其作为方向一致的加速信号。在此基础上，我们提出了几何感知注意力引导（GAG），一种无需训练的插拔式扩展规则，将差异分解为与检索方向平行和正交的分量，放大与收敛方向一致的分量，同时抑制离流形噪声；稳定性来源于弱收缩性质。我们进一步将此扩展解释为注意力空间中的第一阶Anderson加速，为注意力扩展方法提供了统一视角。GAG是一种通用方法，能够跨架构（UNet, MMDiT）和采样场景（多步、少步）泛化，一致地在多种架构上提升生成质量，包括FLUX.1、最近的FLUX.2和Qwen-Image，且计算开销极低。

英文摘要

Classifier-Free Guidance (CFG) improves sample quality in diffusion models, but its dual-pass inference and reliance on null-condition training limit its use in few-step regimes. Attention-space guidance has emerged as a complementary paradigm that addresses this gap, yet why prior sparse-vs-dense attention guidance works remains elusive. We address this by analyzing attention extrapolation through Modern Hopfield dynamics, proving two directional properties of the sparse-dense discrepancy under shared conditioning that together certify it as a directionally consistent acceleration signal. Building on this, we propose Geometry-Aware Attention Guidance (GAG), a training-free, plug-and-play extrapolation rule that decomposes the discrepancy into parallel and orthogonal components relative to the retrieval direction, amplifying the convergence-aligned component while suppressing off-manifold noise; stability follows from a weak contraction property. We further provide an interpretation of this extrapolation as first-order Anderson Acceleration in attention space, offering a unified perspective on attention extrapolation methods. GAG is a universal method that generalizes across architectures (UNet, MMDiT) and sampling regimes (multi-step, few-step), consistently improving generation quality on diverse backbones, including FLUX.1, the recent FLUX.2, and Qwen-Image, with minimal computational overhead.

URL PDF HTML ☆

赞 0 踩 0

2603.01227 2026-05-19 cs.AI 版本更新

The Lattice Representation Hypothesis of Large Language Models

大语言模型的晶格表示假说

Bo Xiong

发表机构 * Stanford University（斯坦福大学）

AI总结本文提出大语言模型的晶格表示假说，通过嵌入几何将概念层次和逻辑运算统一到线性表示中，实验表明LLM嵌入编码概念晶格及其逻辑结构。

Comments Accepted at ICLR 2026

2603.01092 2026-05-19 cs.AI cs.LG 版本更新

The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions

科学的异类空间：采样连贯但认知不可用的研究方向

Alejandro H. Artiles, Martin Weiss, Levin Brinkmann, Iyad Rahwan, Bernhard Schölkopf, Christopher Pal, Hugo Larochelle, Anirudh Goyal, Nasim Rahaman

发表机构 * Max Planck Institute for Human Development（马克斯·普朗克人类发展研究所）； Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； ELLIS Institute Tübingen（图宾根ELLIS研究所）； Polytechnique Montreal（蒙特利尔理工学院）； CIFAR AI Chair（CIFAR人工智能主席）； Mila – Quebec AI Institute（魁北克人工智能研究所）； Tiptree Systems（Tiptree系统）

AI总结本文提出一种框架，通过分解论文为概念单元并学习两个互补模型，采样出连贯但认知不可用的研究方向，扩展了LLM生成的潜在词汇库。

Comments 10 main pages, 42 appendix pages, 29 figures

详情

AI中文摘要

科学发现不仅受真理限制，还受研究人员当前探索领域认知可用性限制。许多方向在文献中是连贯的，但因没有现有社区占据正确的概念、方法和直觉组合而不被提出。现代语言模型继承这种偏见，当被提示生成新想法时会重新组合文献的高密度区域。我们引入了一个框架，旨在针对互补区域，称为科学的异类空间，其中方向在现有知识结构下是可能的，但在现有研究人员分布下不太可能。我们的方法首先将论文分解为细粒度的概念单元，并将它们聚类为共享的词汇概念原子。然后在该词汇上学习两个互补模型。一个连贯性模型评分原子组合是否形成可行的研究方向，另一个可用性模型评分是否任何现有作者社区能够产生给定组合。采样异类方向则减少为排名原子组合，以最大化连贯性同时最小化可用性。在包含16,068篇经同行评审的LLM论文的语料库上，所得到的采样器在不牺牲连贯性的前提下，探索出比前沿LLM生成基线大3.5至7倍的有效原子词汇库，并在盲LLM、人类和下游实验评估中产生匹配或超过基线的想法。通过将科学合理性与社区可用性分开，我们的框架指向AI生成想法，补充而非仅仅加速人类科学，扩展探索到当前社区可能忽视的连贯方向。

英文摘要

Scientific discovery is constrained not only by what is true, but by what is cognitively available to the researchers currently exploring a field. Many directions are coherent in light of the literature yet unlikely to be proposed because no existing community occupies the right combination of concepts, methods, and intuitions. Modern language models inherit this bias, recombining high-density regions of the literature when prompted for novel ideas. We introduce a framework that targets the complementary region, which we call the alien space of science, where directions are plausible under the structure of existing knowledge but unlikely under the distribution of existing researchers. Our method first decomposes papers into granular conceptual units and clusters them into a shared vocabulary of idea atoms. It then learns two complementary models over this vocabulary. A coherence model scores whether a combination of atoms forms a viable research direction, and an availability model scores whether any existing author community is positioned to produce a given combination. Sampling alien directions then reduces to ranking atom combinations that maximize coherence while minimizing availability. On a corpus of 16,068 peer-reviewed LLM papers from NeurIPS, ICLR, ICML, and major NLP venues, the resulting sampler explores a 3.5 - 7 x broader effective atom vocabulary than frontier LLM ideation baselines without sacrificing coherence, and produces ideas that match or exceed those baselines under blind LLM, human, and downstream experimental evaluation. By separating scientific plausibility from community availability, our framework points toward AI ideation that complements rather than merely accelerates human science, expanding exploration into coherent directions that the current community may overlook.

URL PDF HTML ☆

赞 0 踩 0

2603.00975 2026-05-19 cs.LG cs.AI 版本更新

Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

遗忘是竞争：重新思考扩散模型中的去学习作为表征干扰

Ashutosh Ranjan, Vivek Srivastava, Shirish Karande, Murari Mandal

发表机构 * TCS Research（印度 Tata Consulting Engineers 研究部）； Kalinga Institute of Industrial Technology（卡林加工业技术学院）

AI总结本文提出SurgUn方法，通过可控竞争而非直接删除或一对一重分配来实现扩散模型的去学习，有效平衡遗忘与保留，提升模型在版权、安全等场景下的表现。

详情

AI中文摘要

部署的文本到图像扩散模型日益需要事后概念去学习以应对版权主张、艺术家退出、安全更新和受保护内容缓解，而无需完全重新训练。核心挑战是擦除-保留失衡，激进更新抑制目标但损害共享能力，而保守或基于锚点的更新保留质量但使概念可通过相关、组合、改写或对抗性提示恢复。受反向干扰启发，我们提出SurgUn，将遗忘视为受控竞争而非直接删除或一对一重分配。SurgUn通过干扰条件梯度竞争实现反向概念干扰：目标梯度上升削弱目标条件的去噪或流匹配行为，而下降于语义多样的干扰集引入竞争非目标轨迹。这将输出分布在多个非目标模式而非坍缩到单一代理。为通过共享路径限制意外遗忘，SurgUn添加像素基础的权重空间局部化，轻量级诊断通过生成图像擦除-保留行为选择注意力块，利用抑制广泛可行而保留块选择性的不对称性。在UnlearnCanvas、IP-character擦除、Holistic Unlearning、EraseBench和Ring-A-Bell上，SurgUn在Stable Diffusion v1.5、SDXL和SANA-1.5中实现了比基线更强的擦除-保留平衡。消融实验显示，多样干扰、对比竞争和局部化对于稳健抑制同时保留相关和不相关概念都是必要的。

英文摘要

Deployed text-to-image diffusion models increasingly require post-hoc concept unlearning for copyright claims, artist opt-outs, safety updates, and protected-content mitigation without full retraining. A central challenge is erase-retain imbalance, aggressive updates suppress targets but damage shared capabilities, while conservative or anchor-based updates preserve quality yet leave concepts recoverable through related, compositional, paraphrased, or adversarial prompts. Inspired by retroactive interference, we propose SurgUn, which treats forgetting as controlled competition rather than direct deletion or one-to-one reassignment. SurgUn instantiates retroactive concept interference via distractor-conditioned gradient competition: target-gradient ascent weakens target-conditioned denoising or flow-matching behavior, while descent over a semantically diverse distractor set introduces competing non-target trajectories under the same prompt context. This redistributes outputs across multiple non-target modes instead of collapsing to a single proxy. To limit collateral forgetting through shared pathways, SurgUn adds pixel-grounded weight-space localization, a lightweight diagnostic that selects attention blocks by generated-image erase-retain behavior, exploiting the asymmetry that suppression is broadly achievable whereas retention is block-selective. Across UnlearnCanvas, IP-character erasure, Holistic Unlearning, EraseBench, and Ring-A-Bell on Stable Diffusion v1.5, SDXL, and SANA-1.5, SurgUn achieves a stronger erase-retain balance than baselines. Ablations show that diverse distractors, contrastive competition, and localization are all necessary for robust suppression while preserving related and unrelated concepts.

URL PDF HTML ☆

赞 0 踩 0

2603.00876 2026-05-19 cs.AI cs.MA 版本更新

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

BioProAgent：用于受限科学规划的神经符号接地

Yuyang Liu, Jingya Wang, Liuzhenghao Lv, Yonghong Tian

发表机构 * School of AI for Science, Peking University（科学人工智能学院，北京大学）； School of Electronic and Computer Engineering, Peking University（电子与计算机工程学院，北京大学）； School of Computer Science, Peking University（计算机科学学院，北京大学）

AI总结 BioProAgent通过神经符号框架将概率规划锚定在确定性有限状态机中，解决复杂设备模式中的上下文瓶颈，提升物理执行的可靠性。

详情

AI中文摘要

大型语言模型（LLMs）在科学发现中展现出显著的推理能力，但在湿实验室等不可逆环境中难以实现物理执行。我们提出BioProAgent，一种神经符号框架，将概率规划锚定在确定性有限状态机（FSM）中。我们引入了状态增强的规划机制，强制执行严格的设计-验证-修正工作流，确保硬件兼容性后再执行。此外，我们通过语义符号接地解决复杂设备模式中的上下文瓶颈，通过符号抽象减少约6倍的token消耗。在扩展的BioProBench基准测试中，BioProAgent达到95.6%的物理兼容性（相比ReAct的21.0%），证明神经符号约束对于不可逆物理环境中的可靠自主性至关重要。代码：https://github.com/YuyangSunshine/bioproagent | 网站：https://yuyangsunshine.github.io/BioPro-Project.

英文摘要

Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect; they can cause equipment damage or experimental failure. We propose BioProAgent, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous Design-Verify-Rectify workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by Semantic Symbol Grounding, reducing token consumption by ~6* through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6% physical compliance (compared to 21.0% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. Code: https://github.com/YuyangSunshine/bioproagent | Website: https://yuyangsunshine.github.io/BioPro-Project.

URL PDF HTML ☆

赞 0 踩 0

2602.22801 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自动驾驶中的潜力

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, Long Chen, Ya-Qin Zhang, Xianyuan Zhan, Jingjing Liu

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University（人工智能产业研究院（AIR），清华大学）

AI总结本文通过大规模实车数据和道路测试，系统研究了扩散模型在端到端自动驾驶中的规划能力，提出Hyper Diffusion Planner框架，实现10倍性能提升。

详情

AI中文摘要

扩散模型已成为机器人决策任务中的流行选择，近年来也开始被考虑用于解决自动驾驶任务。然而，其在自动驾驶中的应用和评估仍局限于模拟或实验室环境。本研究通过大规模实车数据和道路测试，系统研究了扩散模型作为端到端自动驾驶规划器的潜力。通过全面而受控的研究，我们识别了扩散损失空间、轨迹表示和数据缩放等关键洞察，显著影响端到端规划性能。此外，我们还提供了一种有效的强化学习后训练策略，进一步提升学习规划器的安全性和鲁棒性。所提出的扩散学习框架Hyper Diffusion Planner (HDP)在真实车辆平台上部署，并在6个城市驾驶场景和200公里的真实世界测试中，实现了相对于基模型的10倍性能提升。本文证明了当正确设计和训练时，扩散模型可以作为有效且可扩展的端到端自动驾驶规划器，用于复杂的真实世界自动驾驶任务。

英文摘要

Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety and robustness of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.

URL PDF HTML ☆

赞 0 踩 0

2602.20706 2026-05-19 cs.AI cs.DS 版本更新

Online Algorithms with Unreliable Guidance

具有不可靠指导的在线算法

Julien Dallot, Yuval Emek, Yuval Gil, Maciej Pacut, Stefan Schmid

发表机构 * TU Berlin, Germany（柏林技术大学，德国）； Technion, Israel（技术学院，以色列）； Reykjavik University, Iceland（雷克雅未克大学，冰岛）； TU Berlin and Weizenbaum Institute, Germany（柏林技术大学和魏泽恩鲍姆研究所，德国）

AI总结本文提出了一种名为OAG的在线算法模型，通过请求-回答游戏的视角，分离预测与算法组件，构建了通用分析框架，从而开发出首个通用编译器DTB，将标准在线算法转化为学习增强型算法，并在三个经典问题中取得新的性能平衡和最优解。

详情

AI中文摘要

本文介绍了具有不可靠指导的在线算法（OAG），这是一种用于机器学习增强的在线决策模型，通过请求-回答游戏的视角，清晰地分离了预测和算法组件，从而提供了一个单一、明确的分析框架，仅依赖于问题本身。该模型通过分析框架，使多个概念（来自回答空间的预测、指导、随时竞争性）得以独立分析，使得学习增强型算法能够摆脱预测特定选择（如预测语义、误差函数或探测策略）的限制，从而提升算法的通用性和适用性。OAG模型的简洁框架允许构建首个通用编译器，即滴或信任盲（DTB）编译器，该编译器能够将几乎任何标准、无预测的在线算法转化为学习增强型算法。尽管模型简单，但本文展示了DTB编译器所产生的学习增强型算法在三个经典在线问题中具有强的一致性-鲁棒性保证：在具有对抗性到达顺序的二分图匹配中实现了新的性能平衡，在缓存和均匀度量任务系统中获得了最优解。

英文摘要

This paper introduces online algorithms with unreliable guidance (OAG), a model for ML-augmented online decision-making that cleanly separates the predictive and algorithmic components, thus offering a single, well-defined analysis framework that depends only on the problem at hand. Formulated through the lens of request-answer games, the OAG model brings multiple concepts (predictions from the answer space, guide, anytime competitiveness) which enable learning-augmented algorithms to be analyzed independently of predictor-specific choices - such as prediction semantics, error functions, or probing strategies - that would otherwise restrict the algorithm's generality and applicability. The clean framework of the OAG model allows to build the first generic compiler, the drop-or-trust-blindly (DTB) compiler, that turns almost any standard, prediction-free online algorithm into a learning-augmented one. Although simple, we show that the DTB compiler produces new learning-augmented algorithms with strong consistency-robustness guarantees for three classic online problems: we achieve new trade-offs for bipartite matching with adversarial arrival order, and obtain optimal solutions for caching and uniform metrical task systems.

URL PDF HTML ☆

赞 0 踩 0

2602.18584 2026-05-19 cs.LG cs.AI cs.CV 版本更新

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST: 通过耦合优化几何进行指令微调的目标数据选择

Guanghui Min, Tianhao Huang, Ke Wan, Chen Chen

发表机构 * Department of Computer Science, University of Virginia, Charlottesville, USA（弗吉尼亚大学计算机科学系）

AI总结本文提出GIST方法，通过子空间对齐替代轴对齐缩放，解决参数高效微调中参数耦合问题，实现更高效的目标数据选择。

Comments ICML 2026; 27 pages, 8 figures, 11 tables

详情

AI中文摘要

目标数据选择已成为高效指令微调中的关键范式，旨在为特定任务识别一小部分有影响力的训练示例。在实践中，影响力通常通过示例对参数更新的影响来衡量。为了使选择可扩展，许多方法利用优化器统计（如Adam状态）作为轴对齐的替代品，隐式地将参数视为坐标独立。我们证明在参数高效微调（PEFT）方法如LoRA中，这一假设在破裂。在这种情况下，诱导的优化几何表现出强跨参数耦合和非平凡的非对角交互，而任务相关的更新方向被限制在低维子空间中。受此不匹配的启发，我们提出GIST（梯度等距子空间转换），一种简单但原则性的替代方法，用稳健的子空间对齐替代轴对齐缩放。GIST通过奇异值分解（SVD）从验证梯度中恢复任务特定的子空间，将训练梯度投影到该耦合子空间，并通过与目标方向的对齐程度评分示例。大量实验表明，在相同的选择预算下，GIST仅使用0.29%的存储和25%的计算时间，与当前最先进的基线匹配或优于。

英文摘要

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

URL PDF HTML ☆

赞 0 踩 0

2602.11553 2026-05-19 cs.CV cs.AI 版本更新

Perception-based Image Denoising via Generative Compression

基于生成压缩的图像去噪

Nam Nguyen, Thinh Nguyen, Bella Bose

发表机构 * School of Electrical and Computer Engineering, Oregon State University, Corvallis, OR 97331, USA（电气与计算机工程学院，俄勒冈州立大学，科瓦利斯，OR 97331，USA）

AI总结本文提出基于生成压缩的去噪框架，通过熵编码潜在表示和感知度量提升去噪效果，实验显示在保持 distortion 性能的同时实现感知改进。

详情

AI中文摘要

图像去噪旨在在去除噪声的同时保持结构细节和感知现实，但受扰动驱动的方法常产生过度平滑的重建，特别是在强噪声和分布偏移下。本文提出一种基于生成压缩的去噪框架，通过从熵编码的潜在表示中重建，强制低复杂度结构，同时通过感知度量如学习感知图像块相似性（LPIPS）损失和Wasserstein距离的生成解码器恢复真实纹理。介绍了两种互补的实例：(i) 基于条件Wasserstein GAN（WGAN）的压缩去噪器，明确控制速率-失真-感知（RDP）权衡；(ii) 基于条件扩散的重建策略，通过压缩潜在进行迭代去噪。进一步建立了在加性高斯噪声下的压缩最大似然去噪器的非渐近保证，包括重建误差和解码误差概率的界限。在合成和真实噪声基准上的实验显示了一致的感知改进，同时保持竞争性的失真性能。

英文摘要

Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.

URL PDF HTML ☆

赞 0 踩 0

2602.08354 2026-05-19 cs.AI 版本更新

钻石映射：通过随机流映射实现高效的奖励对齐

Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz

发表机构 * MIT CSAIL（麻省理工学院计算机科学与人工智能实验室）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出钻石映射，一种通过随机流映射实现高效奖励对齐的生成模型，能够在推理时对任意奖励进行准确对齐，提升模型适应性和性能。

2602.02039 2026-05-19 cs.AI cs.CL cs.DB cs.LG 版本更新

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

在大型语言模型上进行深度数据研究：评估深度数据研究

Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He

发表机构 * GitHub

AI总结本文提出深度数据研究（DDR）任务和DDR-Bench基准，评估大型语言模型的探索智能，发现有效探索需要内在策略而非单纯扩展。

Comments 14 pages, 7 tables, 8 figures, accepted by ICML 2026

2602.01705 2026-05-19 cs.LG cs.AI 版本更新

LaDi-RL: Latent Diffusion Reasoning Prevents Entropy Collapse in Reinforcement Learning

LaDi-RL：潜在扩散推理防止强化学习中的熵崩溃

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, Lianhui Qin

发表机构 * UC San Diego（斯克利普斯海洋研究所）； Apple（苹果公司）

AI总结本文提出LaDi-RL方法，通过潜在扩散模型生成潜在推理轨迹，解决强化学习中熵崩溃问题，提升代码生成和数学推理性能。

详情

AI中文摘要

强化学习已成为改进大语言模型推理的核心范式，但现有方法多在离散token序列上优化政策，导致优化空间与推理结构不匹配。连续潜在空间RL提供了一种替代方案，允许政策探索更高层次的推理表示。然而，单纯转向潜在空间不足，所生成的策略必须建模复杂多模态的合理推理轨迹分布。为此，我们提出潜在扩散推理与强化学习（LaDi-RL），其中扩散模型通过迭代去噪生成潜在推理轨迹。此方法支持结构化探索和表达性分布建模，但也引入了根本的信用分配挑战：策略在潜在空间中行动，而奖励仅在潜在被解码为文本后才被观察到。因此，我们引入层次化潜在-文本回放，对每个潜在轨迹采样多个文本完成并聚合其奖励以获得解码边缘化的潜在效用估计。这为优化扩散策略提供了更清晰且方差更低的奖励信号。实验证明，LaDi-RL在代码生成和数学推理的pass@1指标上分别优于token级RL 9.4%和5.7%，甚至超越了基模型的pass@k性能。

英文摘要

Reinforcement learning has become a central paradigm for improving LLM reasoning, but most existing methods optimize policies over discrete token sequences. This creates a mismatch between the optimization space and the structure of reasoning: many important decisions are semantic, global, and trajectory-level rather than local token choices. Continuous latent-space RL offers a promising alternative by allowing policies to explore higher-level reasoning representations. However, simply moving to latent space is not sufficient. The resulting policy must model a complex, multi-modal distribution over valid reasoning trajectories. We therefore propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), where a diffusion model generates latent reasoning trajectories through iterative denoising. This formulation enables structured exploration and expressive distribution modeling, but also introduces a fundamental credit-assignment challenge: the policy acts in latent space, while rewards are observed only after the latent is decoded into text. A naive rollout strategy therefore entangles latent reasoning quality with text decoding quality, making it unclear whether an incorrect answer results from a poor latent trajectory or from an imperfect textual realization. To address this, we introduce hierarchical latent-text rollouts. We sample multiple text completions for each latent trajectory and aggregate their rewards to obtain a decoder-marginalized estimate of latent utility. This provides a cleaner and lower-variance reward signal for optimizing the diffusion policy. Empirically, LaDi-RL outperforms token-level RL by 9.4% on code generation and 5.7% on math reasoning in pass@1, and even surpasses the base model's pass@k performance.

URL PDF HTML ☆

赞 0 踩 0

2601.23154 2026-05-19 cs.LG cs.AI 版本更新

On Safer Reinforcement Learning for Sedation and Analgesia in Intensive Care

关于重症监护中镇痛和镇静的安全强化学习

Joel Romero-Hernandez, Oscar Camara

发表机构 * BCN MedTech, Complex Systems Lab Universitat Pompeu Fabra Barcelona, Spain（BCN医疗科技，复杂系统实验室巴塞罗那自治大学巴塞罗那）

AI总结本文提出一种离线深度强化学习框架，用于优化重症监护中的镇痛和镇静，通过减少疼痛或联合减少疼痛和30天出院后死亡率来提升治疗安全性。

Comments 48th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC 2026)

详情

AI中文摘要

重症监护中的镇痛管理通常涉及复杂的权衡，因为治疗不足或过量都会影响患者安全。先前强化学习在镇静和镇痛中的研究主要关注优化干预，但未考虑患者生存率或部分可观测性。为探讨这些设计选择的风险，我们开发了一个离线深度强化学习框架，基于递归状态表示建议每小时药物剂量。使用MIMIC-IV数据库中47,144例ICU住院数据，我们训练并评估了行为正则化的actor-critic模型，根据两个目标：减少疼痛或联合减少疼痛和30天出院后死亡率来处方连续剂量的阿片类药物、丙泊酚、苯二氮䓬类药物和去甲肾上腺素。尽管两种政策与较低的疼痛相关，但镇痛政策与死亡率呈正相关（ρ=0.119，p<0.0001），而联合政策与死亡率呈负相关（ρ=-0.316，p<0.0001）。我们发现这种分歧源于对高共病率的不同反应。这表明，重视出院后结果可能对学习更安全的治疗政策至关重要，即使短期目标仍是主要目标。

英文摘要

Pain management in intensive care usually involves complex trade-offs, since both inadequate and excessive treatment can compromise patient safety. Prior work on reinforcement learning for sedation and analgesia has explored how to optimize these interventions, but has not considered patient survival or partial observability. To investigate the risks of these design choices, we developed an offline deep reinforcement learning framework that suggests hourly medication doses based on recurrent state representations. Using retrospective data from 47,144 ICU stays in the MIMIC-IV database, we trained and evaluated behavior-regularized actor-critic models that prescribe continuous doses of opioids, propofol, benzodiazepines, and dexmedetomidine according to two goals: reduce pain or jointly reduce pain and 30-day post-discharge mortality. Although the two resulting policies were associated with lower pain, clinician agreement with the pain-only policy was positively correlated with mortality ($ρ$=0.119, p<0.0001), while agreement with the joint policy was negatively correlated ($ρ$=-0.316, p<0.0001). We found that such divergence arose from a different response to high levels of comorbidity. This suggests that valuing post-discharge outcomes could be critical for learning safer treatment policies, even if a short-term goal remains the primary objective.

URL PDF HTML ☆

赞 0 踩 0

2601.22664 2026-05-19 cs.AI 版本更新

Real-Time Aligned Reward Model beyond Semantics

实时对齐奖励模型超越语义

Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Li Huaqiu, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

发表机构 * Beihang University（北京航空航天大学）； Tsinghua University（清华大学）； Renmin University of China（中国人民大学）； The Chinese University of Hong Kong, Shenzhen（香港中文大学（深圳））

AI总结本文提出R2M框架，通过实时利用策略模型反馈来对齐策略分布偏移，解决RLHF中奖励过拟合问题。

详情

AI中文摘要

Reinforcement Learning from Human Feedback (RLHF) 是对齐大语言模型 (LLMs) 与人类偏好的重要技术，但易受奖励过拟合影响，即策略模型过度拟合奖励模型，利用虚假奖励模式而非忠实捕捉人类意图。以往的缓解方法主要依赖表面语义信息，未能有效解决奖励模型 (RM) 与策略模型之间因连续策略分布偏移导致的不匹配。这不可避免地导致奖励差异增大，加剧奖励过拟合。为解决这些限制，我们引入R2M（实时对齐奖励模型），一种新的轻量RLHF框架。R2M超越了仅依赖预训练LLM语义表示的普通奖励模型。相反，它利用策略的演变隐藏状态（即策略反馈）来在RL过程中与策略的实时分布偏移对齐。本文指出了通过实时利用策略模型反馈来改进奖励模型性能的新有前途的方向。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.

URL PDF HTML ☆

赞 0 踩 0

2601.22530 2026-05-19 cs.AI 版本更新

超越表面遗忘：多模态大语言模型中Hallucinations的锐度感知鲁棒擦除

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics（南京航空航天大学计算机科学与技术学院）； Institute for AI, Tsinghua University（清华大学人工智能研究院）； Huzhou University（湖州大学）； Institute of Dataspace, Hefei Comprehensive National Science Center（合肥综合性国家科学中心数据空间研究院）； University of Science and Technology of China（中国科学技术大学）

AI总结本文提出SARE方法，通过目标导向的min-max优化和Targeted-SAM机制，解决多模态大语言模型中 hallucinations 的鲁棒擦除问题，提升模型稳定性与擦除效果。

详情

AI中文摘要

多模态大语言模型虽然强大，但容易产生hallucinations，即不存在的实体，影响可靠性。尽管最近的遗忘方法试图缓解这一问题，我们发现了一个关键缺陷：结构脆弱性。我们实证显示，标准擦除仅能表面抑制，使模型陷入尖锐极小值，轻度重新学习后hallucinations会灾难性复苏。为确保几何稳定性，我们提出SARE，将遗忘视为目标min-max优化问题，并使用Targeted-SAM机制显式平坦hallucinated概念周围的损失景观。通过在模拟最坏情况参数扰动下抑制hallucinations，我们的框架确保了鲁棒去除的稳定性。大量实验表明，SARE在擦除效果上显著优于基线，同时保持一般生成质量。关键的是，它在重新学习和参数更新中维持持久的hallucination抑制，验证了几何稳定性的有效性。

英文摘要

Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

URL PDF HTML ☆

赞 0 踩 0

2601.16172 2026-05-19 cs.AI 版本更新

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

强化学习训练的轻量定理证明器推理时的多样性：诊断研究

Zachary Burton

发表机构 * MIT（麻省理工学院）

AI总结研究发现强化学习训练的轻量定理证明器在推理时存在模式崩溃，通过增加采样预算未能提升解题数量，但固定战术骨架调度可显著提升性能，且多样性控制揭示了提示多样性对证明能力的影响。

Comments 20 pages

详情

AI中文摘要

强化学习训练的轻量定理证明器在推理时出现模式崩溃：在miniF2F测试中使用DeepSeek-Prover-V1.5-RL，将独立同分布采样预算从k=32增加到k=64并未增加解题数量（42/244在两种情况下）。固定15个战术骨架调度打破了这一平台期，在k=16时恢复了+45%的相对改进（平均Δ=+12.3±4.2个定理，n=3种子，每个种子的符号均保持）。受控多样性消融排除了提示多样性的混淆因素：战术骨架有助于，同义词匹配基线，而无关的Lean注释会主动退化。留一法正式难度分层揭示了三种扰动之间的结构-内容梯度。这一现象是强化学习特定的：V1.5-Base无论干预如何都证明零个定理，识别出强化学习是创造证明能力的阶段，随后该能力崩溃；扩展到两个额外的7B Lean证明器，强化学习训练的DeepSeek-Prover-V2-7B贡献了+3个前沿解，尽管整体表现平稳，而SFT训练的Goedel-Prover没有（-10.0±4.4个定理，n=3，每个种子的符号均保持）。推理时的结构多样性是强化学习训练证明器的一个廉价、互补的轴，与模型大小或训练计算量无关。

英文摘要

RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from $k{=}32$ to $k{=}64$ produces zero additional solved theorems (42/244 in both cases). A fixed schedule of 15 tactic skeletons breaks this plateau and recovers a $+45%$ relative improvement at $k{=}16$ (mean $Δ= +12.3 \pm 4.2$ theorems across $n{=}3$ seeds, sign preserved in every seed). A controlled diversity ablation rules out the prompt-diversity confound: tactic skeletons help, paraphrases match the baseline, and irrelevant Lean comments actively degrade. A leave-one-out formalization-difficulty stratification reveals a structural-content gradient across the three perturbations. The phenomenon is RL-specific: V1.5-Base proves zero theorems regardless of intervention, identifying RL as the stage that creates the proof capability which subsequently collapses; extending to two additional 7B Lean provers, RL-trained DeepSeek-Prover-V2-7B contributes $+3$ frontier solves no i.i.d.\ baseline can reach despite a flat aggregate, while SFT-trained Goedel-Prover does not ($-10.0 \pm 4.4$ theorems, $n{=}3$, sign preserved every seed). Inference-time structural diversity is a cheap, complementary axis for RL-trained provers, orthogonal to scaling model size or training compute.

URL PDF HTML ☆

赞 0 踩 0

2601.13992 2026-05-19 cs.CL cs.AI 版本更新

"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

整体大于部分之和：一种兼容性感知的多教师CoT蒸馏框架

Jin Cui, Jiaqi Guo, Ruixuan Yang, Jiayi Lu, Jiepeng Zhou, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University（人机混合增强智能国家重点实验室，人工智能与机器人研究院，西安交通大学）； Nankai University（南开大学）； The Hong Kong University of Science and Technology(Guangzhou)（香港科技大学（广州））； School of Software Engineering, State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University（软件学院，人机混合增强智能国家重点实验室，人工智能与机器人研究院，西安交通大学）

AI总结本文提出COMPACT框架，通过动态加权不同教师的梯度，结合多维指标提升学生模型的推理能力，有效整合多样化推理能力并减少灾难性遗忘。

Comments 11pages, 9figures

详情

AI中文摘要

链式推理（CoT）推理赋予大语言模型（LLMs）显著能力，但通常需要极高的参数规模。CoT蒸馏作为一种有前景的范式，将推理能力转移到紧凑的学生模型（SLMs）中，但现有方法通常依赖单一教师，限制了学生潜力，因为个体LLMs常有不同能力偏倚且可能遭受灾难性遗忘。虽然利用多样教师似乎有吸引力，但有效融合其监督仍具挑战：教师-学生不兼容可能放大幻觉，被动监督无法确保真实逻辑内化。为此，我们引入COMPACT框架，通过动态加权教师梯度，基于多维指标评估学生实时兼容性：（1）基于图的共识过滤误导性推理路径；（2）基于互信息的适应性检测“顿悟时刻”以真正理解推理过程而非单纯模仿；（3）基于损失的难度评估学生对教师指导的接受度并防止负迁移。大量实验和潜在空间分析表明，COMPACT能有效整合多样化推理能力而不破坏模型原有知识结构，在各种基准测试中取得最佳性能并缓解灾难性遗忘。

英文摘要

Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

URL PDF HTML ☆

赞 0 踩 0

2601.11956 2026-05-19 cs.CL cs.AI 版本更新

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

双重校准：通过校准知识和推理置信度实现可靠的LLM

Yuyin Lu, Ziran Liang, Yanghui Rao, Wenqi Fan, Fu Lee Wang, Qing Li

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China（中山大学计算机科学与工程学院，广州，中国）； Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR（香港理工大学计算机系，香港特别行政区）； School of Science and Technology, Hong Kong Metropolitan University, Hong Kong SAR（香港 Metropolitan 大学科技学院，香港特别行政区）

AI总结本文提出双重校准框架，通过校准知识和推理置信度提升LLM的可靠性，实验表明其在保持低token成本的同时显著提高准确性和置信度校准。

Comments This work is to appear in the Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

2601.11895 2026-05-19 cs.LG cs.AI cs.SE 版本更新

FormuLLA：一种用于生成新型3D打印配方的大型语言模型方法

Adeshola Okubena, Yusuf Ali Mohammed, Moe Elbadawi

发表机构 * School of Biological and Behavioural Sciences, Queen Mary University of London（伦敦女王玛丽大学生物与行为科学学院）

AI总结本文提出FormuLLA方法，利用微调后的大型语言模型推荐3D打印配方的辅料并预测丝材机械性能，揭示了模型选择和参数对性能的影响，指出小模型易遗忘及标准指标无法评估配方可加工性。

详情

AI中文摘要

ShareChat: 一个真实对话的大型数据集

Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

发表机构 * Indiana University（印第安纳大学）

AI总结本文提出ShareChat数据集，包含142,808条对话（660,293个回合），涵盖95种语言，分析不同平台对话完整性和响应延迟差异，揭示多平台交互特性。

详情

AI中文摘要

通过统一的文本接口评估大型语言模型（LLMs），当前学术基准掩盖了不同商业平台的独特设计和功能如何影响真实用户行为和系统性能。为弥合这一差距，我们提出了ShareChat，这是首个包含142,808条对话（660,293个回合）的大型语料库，从ChatGPT、Perplexity、Grok、Gemini和Claude的公开共享URL中收集。ShareChat保留了原生平台功能，包括引用、思考痕迹和代码 artifacts，涵盖95种语言，时间跨度从2023年4月至2025年10月，补充了现有语料库中同质化交互的不足。为了展示数据集的评估用途，我们提出了三个案例研究：对话完整性分析评估跨平台意图满足差异，来源定位分析比较搜索增强系统之间的引用策略，时间分析揭示响应延迟动态的差异。这些分析展示了单平台或剥离功能语料库无法解决的研究问题。该数据集已公开可用。

英文摘要

By evaluating Large Language Models (LLMs) through uniform, text-only interfaces, current academic benchmarks obscure how the unique designs and affordances of distinct commercial platforms shape real-world user behavior and system performance. To bridge this gap, we present ShareChat, the first large-scale corpus of 142,808 conversations (660,293 turns) collected from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat preserves native platform affordances, including citations, thinking traces, and code artifacts, across 95 languages and the period from April 2023 to October 2025, complementing existing corpora that homogenize these interactions. To demonstrate the dataset's evaluative utility, we present three case studies: a conversation completeness analysis assessing cross-platform differences in intent satisfaction, a source grounding analysis comparing citation strategies between search-augmented systems, and a temporal analysis revealing divergent response latency dynamics. Together, these analyses demonstrate research questions that are inaccessible to single-platform or stripped-affordance corpora. The dataset is publicly available.

URL PDF HTML ☆

赞 0 踩 0

2511.19078 2026-05-19 cs.CL cs.AI 版本更新

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

GraphMind: 一种基于动态GNN的定理选择与结论生成框架用于LLM推理

Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin

AI总结 GraphMind通过动态图神经网络与LLM结合，实现多步推理中的定理选择和结论生成，提升上下文感知的推理能力。

Comments This paper has been withdrawn by the authors in order to prepare a substantially revised version

详情

AI中文摘要

大型语言模型（LLMs）在自然语言理解和生成方面表现出色，包括多步推理如数学证明。然而，现有方法缺乏显式且动态的机制来结构化表示和演变中间推理状态，限制了其在上下文感知定理选择和迭代结论生成方面的能力。为此，我们提出了GraphMind，一种新颖的动态图基框架，将图神经网络（GNN）与LLMs结合，以迭代方式选择定理并生成中间结论。我们的方法将推理过程建模为异构演进图，其中节点代表条件、定理和结论，边捕捉节点间的逻辑依赖。通过编码当前推理状态并利用语义匹配进行定理选择，我们的框架在闭环模式下实现了上下文感知、可解释和结构化的推理。在各种问答（QA）数据集上的实验表明，所提出的GraphMind方法在多步推理中实现了稳定性能提升，并显著优于现有基线方法，验证了我们方法的有效性和通用性。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

URL PDF HTML ☆

赞 0 踩 0

2510.23641 2026-05-19 cs.LG cs.AI hep-ex physics.ins-det 版本更新

Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

具有空间意识的线性变换器（SAL-T）用于粒子喷注标记

Aaron Wang, Zihan Zhao, Subash Katel, Vivekanand Gyanchand Sahu, Elham E Khoda, Abhijith Gandrakota, Jennifer Ngadiuba, Richard Cavanaugh, Javier Duarte

发表机构 * University of Illinois Chicago（伊利诺伊大学芝加哥分校）； University of California San Diego（加州大学圣地亚哥分校）； Fermi National Accelerator Laboratory（费米国家加速器实验室）

AI总结 SAL-T通过空间感知分区和卷积层提升喷注分类性能，在资源消耗和延迟方面优于标准linformer。

详情

AI中文摘要

Transformers在高能粒子碰撞中能有效捕捉全局和局部相关性，但在高数据吞吐环境如CERN LHC中部署存在挑战。由于transformer模型的二次复杂性，需要大量资源且推理延迟高。为此，我们引入了物理启发的线性变换器增强架构SAL-T，保持线性注意力。我们的方法基于动量学特征对粒子进行空间感知分区，从而计算具有物理意义区域之间的注意力。此外，我们采用卷积层捕捉局部相关性，受喷注物理启发。除了在喷注分类任务中优于标准linformer外，SAL-T在推理时使用更少的资源且延迟更低，其结果与全注意力transformer相当。在通用点云分类数据集（ModelNet10）上的实验进一步证实了这一趋势。我们的代码可在https://github.com/aaronw5/SAL-T4HEP获得。

英文摘要

Transformers are very effective in capturing both global and local correlations within high-energy particle collisions, but they present deployment challenges in high-data-throughput environments, such as the CERN LHC. The quadratic complexity of transformer models demands substantial resources and increases latency during inference. In order to address these issues, we introduce the Spatially Aware Linear Transformer (SAL-T), a physics-inspired enhancement of the linformer architecture that maintains linear attention. Our method incorporates spatially aware partitioning of particles based on kinematic features, thereby computing attention between regions of physical significance. Additionally, we employ convolutional layers to capture local correlations, informed by insights from jet physics. In addition to outperforming the standard linformer in jet classification tasks, SAL-T also achieves classification results comparable to full-attention transformers, while using considerably fewer resources with lower latency during inference. Experiments on a generic point cloud classification dataset (ModelNet10) further confirm this trend. Our code is available at https://github.com/aaronw5/SAL-T4HEP.

URL PDF HTML ☆

赞 0 踩 0

2510.18941 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench：需要专业知识回答和评判的多领域评分标准

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

发表机构 * NVIDIA

AI总结 ProfBench通过7000多个由专业领域专家评估的响应-评分对，评估大语言模型在处理专业文档、信息整合和生成综合报告方面的能力，揭示了即使顶级模型在专业任务上也面临挑战。

Comments Published at ICLR 2026, 30 pages

详情

AI中文摘要

评估大语言模型（LLMs）的进步通常受限于验证响应的挑战，限制了评估任务仅限于数学、编程和简短问答。然而，许多现实应用需要评估LLMs在处理专业文档、整合信息和生成综合报告方面的能力。我们介绍了ProfBench：一个包含超过7000个响应-评分对的集合，由具有物理学博士、化学博士、金融MBA和咨询MBA专业知识的人类专家评估。我们构建了稳健且经济的LLM-Judges来评估ProfBench评分标准，通过减轻自我增强偏差并减少评估成本2-3个数量级，使其公平且对更广泛社区可及。我们的发现表明，即使对于最先进的LLM，ProfBench也提出了重大挑战，顶级模型如GPT-5-high仅达到65.9%的整体性能。此外，我们识别了专有模型与开源模型之间显著的性能差异，并提供了关于扩展思考在解决复杂专业领域任务中的作用的见解。数据：https://huggingface.co/datasets/nvidia/ProfBench 和代码：https://github.com/NVlabs/ProfBench 和排行榜：https://huggingface.co/spaces/nvidia/ProfBench

英文摘要

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench

URL PDF HTML ☆

赞 0 踩 0

2510.16416 2026-05-19 cs.CV cs.AI 版本更新

ProtoSiTex: 为多标签文本分类学习半可解释的原型

Utsav Kumar Nareti, Suraj Kumar, Soumya Pandey, Soumi Chattopadhyay, Chandranath Adak, Sankha Subhra Mullick

发表机构 * Dept. of CSE, IIT Patna（印度帕纳布理工大学计算机科学与工程系）； Dept. of CSE, IIT Indore（印度印多尔理工大学计算机科学与工程系）； Dolby Laboratories（多利贝实验室）

AI总结 ProtoSiTex提出一种半可解释框架，通过双阶段交替训练策略和分层损失函数，实现细粒度多标签文本分类，提升可解释性和对齐性。

详情

AI中文摘要

TusoAI: 科学方法的代理优化

Alistair Turcan, Kexin Huang, Lei Li, Martin Jinye Zhang

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Stanford University（斯坦福大学）； Phylo

AI总结 TusoAI通过整合领域知识和迭代优化，提升科学任务中计算方法的性能，优于现有专家方法和科学AI代理，在单细胞RNA测序数据去噪和卫星地球监测等任务中表现突出。

详情

AI中文摘要

科学发现常因手动开发分析复杂实验数据的计算工具而受阻。构建此类工具成本高且耗时，因为科学家需反复查阅文献、测试模型假设并将其转化为高效软件。大型语言模型（LLMs）在合成文献、处理实证数据和生成领域特定代码方面表现出色，为加速计算方法开发提供了新机遇。现有LLM系统或专注于使用现有计算方法进行科学分析，或专注于为通用机器学习开发计算方法，但未能有效整合科学领域中常无结构化的知识。本文介绍TusoAI，一种代理AI系统，通过科学任务描述和评估函数，自主开发和优化计算方法。TusoAI将领域知识整合到知识树表示中，进行迭代的领域特定优化和模型诊断，提高候选解决方案池的性能。我们进行了全面的基准评估，证明TusoAI在单细胞RNA测序数据去噪和卫星地球监测等多样化任务中优于最先进的专家方法、MLE代理和科学AI代理。将TusoAI应用于遗传学两个关键开放问题，改进了现有计算方法并发现了新生物学，包括9种新的自身免疫疾病与T细胞亚型之间的关联以及7种此前未报告的疾病变异与目标基因的关联。我们的代码在https://github.com/Alistair-Turcan/TusoAI上公开可用。

英文摘要

Scientific discovery is often slowed by the manual development of computational tools needed to analyze complex experimental data. Building such tools is costly and time-consuming because scientists must iteratively review literature, test modeling and scientific assumptions against empirical data, and implement these insights into efficient software. Large language models (LLMs) have demonstrated strong capabilities in synthesizing literature, reasoning with empirical data, and generating domain-specific code, offering new opportunities to accelerate computational method development. Existing LLM-based systems either focus on performing scientific analyses using existing computational methods or on developing computational methods or models for general machine learning without effectively integrating the often unstructured knowledge specific to scientific domains. Here, we introduce TusoAI , an agentic AI system that takes a scientific task description with an evaluation function and autonomously develops and optimizes computational methods for the application. TusoAI integrates domain knowledge into a knowledge tree representation and performs iterative, domain-specific optimization and model diagnosis, improving performance over a pool of candidate solutions. We conducted comprehensive benchmark evaluations demonstrating that TusoAI outperforms state-of-the-art expert methods, MLE agents, and scientific AI agents across diverse tasks, such as single-cell RNA-seq data denoising and satellite-based earth monitoring. Applying TusoAI to two key open problems in genetics improved existing computational methods and uncovered novel biology, including 9 new associations between autoimmune diseases and T cell subtypes and 7 previously unreported links between disease variants linked to their target genes. Our code is publicly available at https://github.com/Alistair-Turcan/TusoAI.

URL PDF HTML ☆

赞 0 踩 0

2509.21319 2026-05-19 cs.CL cs.AI cs.LG 版本更新

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF：二进制灵活反馈用于连接人类反馈与可验证奖励

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

发表机构 * NVIDIA

AI总结 RLBFF结合人类偏好与规则验证，提升奖励模型对响应质量的精准捕捉，优于Bradley-Terry模型，在RM-Bench和JudgeBench上取得优异成绩，且支持用户自定义反馈原则。

Comments Published at ICLR 2026, 21 pages

详情

AI中文摘要

Reinforcement Learning with Human Feedback (RLHF) 和 Reinforcement Learning with Verifiable Rewards (RLVR) 是LLM后训练的主要RL范式，各有优势。然而，RLHF在可解释性和奖励黑客问题上存在困难，因为它依赖于通常缺乏明确标准的人类判断，而RLVR则受限于其对正确性基于验证器的专注。我们提出Reinforcement Learning with Binary Flexible Feedback (RLBFF)，结合人类驱动的偏好灵活性与规则基础验证的精确性，使奖励模型能够捕捉响应质量的细微方面，超越单纯的正确性。RLBFF从自然语言反馈中提取可以二进制回答的原则（例如信息准确性：是，或代码可读性：否）。这些原则随后可用于将奖励模型训练作为蕴含任务（响应满足或不满足任意原则）。我们展示奖励模型以这种方式训练可以优于匹配数据的Bradley-Terry模型，在RM-Bench（86.2%）和JudgeBench（81.4%，2025年9月24日排行榜第一）上取得最佳成绩。此外，用户可以在推理时指定感兴趣的原理以自定义我们的奖励模型，与Bradley-Terry模型不同。最后，我们提供了一个完全开源的食谱（包括数据）来对Qwen3-32B进行对齐，以匹配或超过o3-mini和DeepSeek R1在MT-Bench、WildBench和Arena Hard v2的一般对齐基准上的性能（在<5%的推理成本下）。模型：https://huggingface.co/collections/nvidia/reward-models-10-2025

英文摘要

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

URL PDF HTML ☆

赞 0 踩 0

2509.19590 2026-05-19 cs.AI cs.CY cs.LG 版本更新

Position: AI Evaluations Should be Grounded on a Theory of Capability

位置：AI评估应基于能力理论

Nathanael Jo, Ashia Wilson

发表机构 * MIT EECS, Cambridge, USA（麻省理工学院电子工程与计算机科学系，剑桥，美国）

AI总结本文提出AI评估应基于明确的能力理论，通过实验证明评估结果受建模假设影响显著，提出Evaluation Card促进透明化评估实践。

Comments ICML 2026 Position Paper Track

详情

AI中文摘要

生成模型的评估如今普遍存在，其结果深刻影响公众和科学界对AI能力的看法。然而，对其可靠性的怀疑持续增长。如何确定报告的准确率真实反映模型的底层性能？尽管基准结果常被视为能力的直接测量，但实际上它们是推断：将分数视为能力证据已预设了能力定义的理论。我们主张AI评估应作为基于明确能力理论的推断任务。虽然这一观点在心理学测量学等学科中是标准做法，但在AI评估中仍不完善，核心假设常被隐含。作为概念验证，我们实证显示报告性能可能强烈依赖评估者的建模假设，凸显透明、理论驱动的评估实践的必要性。最后，我们提出Evaluation Card帮助研究人员记录、论证和审查AI评估背后的建模决策。

英文摘要

Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.

URL PDF HTML ☆

赞 0 踩 0

2509.13270 2026-05-19 cs.CV cs.AI 版本更新

RadGame: An AI-Powered Platform for Radiology Education

RadGame：一种基于人工智能的放射学教育平台

Mohammed Baharoon, Siavash Raissi, John S. Jun, Thibault Heintz, Mahmoud Alabbad, Ali Alburkani, Sung Eun Kim, Kent Kleinschmidt, Abdulrahman O. Alhumaydhi, Mohannad Mohammed G. Alghamdi, Jeremy Francis Palacio, Mohammed Bukhaytan, Noah Michael Prudlo, Rithvik Akula, Brady Chrisler, Benjamin Galligos, Mohammed O. Almutairi, Mazeen Mohammed Alanazi, Nasser M. Alrashdi, Joel Jihwan Hwang, Sri Sai Dinesh Jaliparthi, Luke David Nelson, Nathaniel Nguyen, Sathvik Suryadevara, Steven Kim, Mohammed F. Mohammed, Yevgeniy R. Semenov, Kun-Hsing Yu, Abdulrhman Aljouie, Hassan AlOmaish, Adam Rodman, Pranav Rajpurkar

发表机构 * Harvard Medical School（哈佛医学院）； Mass General Brigham（麻省总医院）； Maastricht University（马斯特里赫特大学）； Department of Medical Imaging, King Abdulaziz Medical City, Ministry of National Guard, Riyadh, Saudi Arabia（国王阿卜杜勒-阿齐兹医疗城医学影像科，沙特阿拉伯）； National Strategic Technology Research Institute, Seoul National University Hospital（全国战略技术研究所，首尔国立大学医院）； Saint Louis University School of Medicine（圣路易斯大学医学院）； College of Medicine, King Saud bin Abdulaziz University for Health Sciences（国王萨勒曼·本·阿卜杜勒阿齐兹大学医学院）； Tufts University School of Medicine（塔夫茨大学医学院）； Department of Biomedical Informatics, Harvard Medical School（哈佛医学院生物医学信息学系）

AI总结 RadGame通过结合游戏化与大规模公开数据集，提供AI驱动的反馈，提升放射学教育中的定位和报告撰写能力，显著提高学习效果。

Comments ML4H Version

详情

AI中文摘要

我们介绍了RadGame，一种基于人工智能的游戏化平台，用于放射学教育，旨在提升局部定位和报告生成两项核心技能。传统放射学培训基于被动接触病例或实时指导，限制了即时和可扩展的反馈机会。RadGame通过结合游戏化、大规模公开数据集和自动化AI反馈，为人类学习者提供清晰的结构化指导。在RadGame Localize中，玩家绘制边界框以定位异常，自动与放射科医生绘制的标注比较，并通过视觉语言模型生成用户遗漏的解释。在RadGame Report中，玩家根据胸片、年龄和指征撰写发现，接收基于放射学报告生成指标的结构化AI反馈，突出与放射科医生书面真实报告的错误和遗漏，最终生成性能和风格评分。在前瞻性评估中，使用RadGame的参与者在定位准确性上比传统被动方法提高了68%，在报告撰写准确性上比传统方法提高了31%。RadGame展示了AI驱动游戏化在提供可扩展、反馈丰富的放射学培训中的潜力，并重新定义了医疗AI资源在教育中的应用。

英文摘要

We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist's written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.

URL PDF HTML ☆

赞 0 踩 0

2509.04471 2026-05-19 cs.CL cs.AI 版本更新

MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

MOSAIC：一种多语言、无类别依赖且计算高效的放射报告分类方法

Alice Schiavone, Marco Fraccaro, Lea Marie Pehrson, Silvia Ingala, Rasmus Bonnevie, Michael Bachmann Nielsen, Vincent Beliveau, Melanie Ganz, Desmond Elliott

发表机构 * Department of Computer Science, University of Copenhagen（哥本哈根大学计算机科学系）； Neurobiology Research Unit, Copenhagen University Hospital（哥本哈根大学医院神经生物学研究单位）； Unumed Aps（Unumed公司）； Department of Diagnostic Radiology, Copenhagen University Hospital（哥本哈根大学医院诊断放射学系）； Department of Clinical Medicine, University of Copenhagen（哥本哈根大学临床医学系）； Cerebriu A/S（Cerebriu公司）； Institute for Human Genetics, Medical University of Innsbruck（因斯布鲁克医学大学人类遗传学研究所）

AI总结 MOSAIC通过紧凑开放模型实现多语言、无类别依赖的放射报告分类，无需大量标注数据，且在多种影像模态和标签体系上表现优异，达到专家水平性能。

Comments 8 pages, 14 pages including references and appendix. 9 figures. Preprint

Journal ref Proceedings of the ClinicalNLP Workshop at LREC 2026

详情

AI中文摘要

放射学报告包含丰富的临床信息，可用于训练影像模型而无需依赖昂贵的手动标注。然而，现有方法面临关键限制：基于规则的方法难以处理语言多样性，监督模型需要大量标注数据集，而近期基于LLM的方法依赖封闭源或资源密集型模型，不适合临床使用。此外，当前解决方案大多局限于英语和单模态、单类别数据集。我们介绍了MOSAIC，一种多语言、无类别依赖且计算高效的放射报告分类方法。基于紧凑的开放访问语言模型（MedGemma-4B），MOSAIC支持零/少样本提示和轻量级微调，可在消费级GPU上部署。我们在英语、西班牙语、法语和丹麦语的七个数据集上评估MOSAIC，涵盖多种影像模态和标签体系。该模型在五个胸部X光数据集上达到平均宏F1分数88，接近或超过专家水平性能，同时仅需24GB GPU内存。通过数据增强，仅需80个标注样本即可在丹麦报告上达到加权F1分数82，相比完整1600样本训练集的86分。MOSAIC为临床环境中大型或专有LLM提供了实用替代方案。代码和模型是开源的。我们邀请社区在新语言、类别和模态上评估和扩展MOSAIC。

英文摘要

Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

URL PDF HTML ☆

赞 0 踩 0

2509.03403 2026-05-19 cs.LG cs.AI 版本更新

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

超越正确性：通过RL训练和谐过程与结果奖励

Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, Anurag Beniwal

发表机构 * Amazon（亚马逊公司）； University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）

AI总结本文提出PROF方法，通过过程一致性过滤提升推理质量和最终答案准确性，减少对强PRM的依赖。

详情

AI中文摘要

可验证奖励的强化学习（RLVR）提升了推理任务的最终答案准确性，但未能可靠提升推理质量。由于结果奖励仅评估最终答案，它也会奖励虚假成功：错误推理仍可能因偶然得到正确结果而获得最大奖励。这种结果奖励黑客行为会创建有偏的梯度，使当前RLVR不足以学习忠实的推理。过程奖励模型（PRMs）提供逐步监督，但直接优化PRMs或简单地将它们与结果奖励结合在RL训练过程中分布偏移时不稳定。我们引入了过程一致性过滤（PROF），一种数据整理方法，利用PRM-ORM一致性进行样本选择，而不是直接奖励优化。PROF保留具有强过程支持的正确响应和具有弱过程支持的错误响应，同时保持训练比例的平衡。实验表明，PROF在强基线之上一致地提高了最终答案准确性和中间推理质量，对强PRMs的依赖较少。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves final-answer accuracy on reasoning tasks, but it does not reliably improve reasoning quality. Because outcome rewards only assess final answers, they also reward spurious successes: flawed reasoning can still receive maximal reward when it accidentally reaches the correct outcome. This outcome reward hacking creates biased gradients, making current RLVR insufficient for learning faithful reasoning. Process Reward Models (PRMs) provide step-wise supervision, but directly optimizing PRMs or naively combining them with outcome rewards is unstable under distribution shift during RL training process. We introduce PRocess cOnsistency Filter (PROF), a data curation method that uses PRM--ORM consistency for sample selection rather than direct reward optimization. PROF keeps correct responses with strong process support and incorrect responses with weak process support while maintaining a balanced training ratio. Experiments show that PROF consistently improves both final-answer accuracy and intermediate reasoning quality over strong baselines, with less dependence on strong PRMs.

URL PDF HTML ☆

赞 0 踩 0

2508.16438 2026-05-19 cs.IR cs.AI 版本更新

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

OPERA: 一种增强强化学习的协调规划-执行架构用于面向推理的多跳检索

Yu Liu, Yanbing Liu, Fangfang Yuan, Cong Cao, Youbang Sun, Kun Peng, Weizhuo Chen, Jianjun Li, Zhiyuan Ma

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences（中国科学院信息工程研究所）； School of Cyber Security, University of Chinese Academy of Sciences（中国科学院大学网络安全学院）； School of Computer Science and Technology, Huazhong University of Science and Technology（华中科技大学计算机科学与技术学院）； Department of Electronic Engineering, Tsinghua University（清华大学电子工程系）

AI总结 OPERA通过协调规划-执行架构解决多跳检索中推理规划、检索和过滤的不足，采用MAPGRPO方法提升复杂任务性能。

Comments Accepted by AAAI 2026. Extended version

详情

AI中文摘要

近期大规模语言模型和密集检索器的进步推动了检索增强生成（RAG）的发展。然而，现有方法在复杂推理导向的多跳检索任务中面临三大挑战：1）无效的推理导向规划：现有方法难以生成稳健的多步骤计划，规则基分解器在非模板问题上表现不佳。2）次优的推理驱动检索：相关方法采用有限的查询改写，导致迭代检索循环难以定位黄金文档。3）不足的推理引导过滤：现有方法缺乏细粒度推理来有效过滤噪声结果中的显著信息，阻碍了检索知识的利用。根本上，这些限制都源于当前RAG架构中检索与推理耦合薄弱。我们引入协调规划-执行推理架构（OPERA），一种新的推理驱动检索框架。OPERA的目标规划模块（GPM）将问题分解为子目标，由具有专用组件的推理-执行模块（REM）执行，以实现精确推理和有效检索。为训练OPERA，我们提出多智能体渐进组相对策略优化（MAPGRPO），一种GRPO的新变体。在复杂多跳基准测试中，OPERA的优越性能验证了MAPGRPO方法和OPERA设计的有效性。

英文摘要

Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.

URL PDF HTML ☆

赞 0 踩 0

2508.08501 2026-05-19 cs.AI 版本更新

GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

GVGAI-LLM：通过无限游戏评估大语言模型代理

Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip Bontrager, Jialin Liu, Julian Togelius

发表机构 * New York University（纽约大学）； University of the Witwatersrand（沃尔特·斯通大学）； Meta ； Lingnan University（岭南大学）

AI总结 GVGAI-LLM通过 arcade 式游戏测试大语言模型的推理与问题解决能力，定义了可解释的评估指标，揭示了模型在空间推理和基本规划中的局限性。

详情

AI中文摘要

我们介绍了 GVGAI-LLM，一个视频游戏基准，用于评估大语言模型（LLMs）的推理和问题解决能力。该基准基于 General Video Game AI 框架，包含多样化的 arcade 式游戏，用于测试模型处理与现有 LLM 基准不同的任务能力。该基准利用视频游戏描述语言，可快速创建新游戏（包括规则和关卡），以防止过拟合。每个游戏场景由紧凑的 ASCII 字符集表示，允许语言模型高效处理。GVGAI-LLM 定义了可解释的指标，包括有意义的步比、步效率和总分，以评估模型行为。通过在 118 个具有不同挑战和技能深度的游戏上进行零样本评估，我们揭示了 LLMs 在空间推理和基本规划中的持续局限性。当前模型在空间和逻辑上持续出现错误，推动了结构化提示和空间接地技术的发展。尽管这些干预措施带来了部分改进，但该基准仍远未解决。GVGAI-LLM 为推进语言模型能力研究提供了可重复的测试平台，尤其强调代理行为和空间推理。此外，其生成无限基准的能力（手动和程序化）提供了一种可扩展的长期评估框架。

英文摘要

We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a video game description language that enables the rapid creation of new games (including rules and levels), helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across 118 games with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. Although these interventions lead to partial improvements, the benchmark remains very far from being solved. GVGAI-LLM serves as a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and spatial reasoning. Furthermore, its ability to generate infinite benchmarks, both manually and procedurally, provides a scalable framework for longitudinal evaluation.

URL PDF HTML ☆

赞 0 踩 0

2508.06799 2026-05-19 cs.ET cs.AI 版本更新

LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning

LSDTs: 基于大语言模型的语义数字孪生用于自适应知识密集型基础设施规划

Naiyi Li, Zihui Ma, Runlong Yu, Lingyao Li

发表机构 * Department of Civil & Environmental Engineering, University of Maryland, College Park（大学公园马里兰大学土木与环境工程系）； Center for Urban Science and Progress, New York University（纽约大学城市科学与进步中心）； Department of Computer Science, University of Alabama（阿拉巴马大学计算机科学系）； School of Information, University of South Florida（佛罗里达州立大学信息学院）

AI总结本文提出LSDTs框架，利用大语言模型从非结构化文档中提取规划知识并构建形式本体，通过语义层提升数字孪生在复杂规划场景中的适应性与仿真精度。

详情

AI中文摘要

数字孪生（DTs）为管理复杂基础设施系统提供了强大工具，但其效果常受限于整合非结构化知识的挑战。近年来，大语言模型（LLMs）的进步为解决这一差距提供了新潜力，具备提取和组织多样化文本信息的能力。因此，我们提出了LSDTs（LLM增强的语义数字孪生），一种框架，帮助LLMs从环境法规和技术指南等非结构化文档中提取规划知识，并将其组织成形式本体。该本体形成一个语义层，为数字孪生（虚拟物理系统的模型）提供支持，使其能够模拟真实、法规意识的规划场景。我们通过马里兰海上风电场规划的案例研究评估LSDTs，包括飓风桑迪期间的应用。结果表明，LSDTs支持可解释、法规意识的布局优化，实现高保真的仿真，并增强基础设施规划的适应性。这项工作展示了将生成式AI与数字孪生结合在支持复杂、知识驱动规划任务中的潜力。

英文摘要

Digital Twins (DTs) offer powerful tools for managing complex infrastructure systems, but their effectiveness is often limited by challenges in integrating unstructured knowledge. Recent advances in Large Language Models (LLMs) bring new potential to address this gap, with strong abilities in extracting and organizing diverse textual information. We therefore propose LSDTs (LLM-Augmented Semantic Digital Twins), a framework that helps LLMs extract planning knowledge from unstructured documents like environmental regulations and technical guidelines, and organize it into a formal ontology. This ontology forms a semantic layer that powers a digital twin-a virtual model of the physical system-allowing it to simulate realistic, regulation-aware planning scenarios. We evaluate LSDTs through a case study of offshore wind farm planning in Maryland, including its application during Hurricane Sandy. Results demonstrate that LSDTs support interpretable, regulation-aware layout optimization, enable high-fidelity simulation, and enhance adaptability in infrastructure planning. This work shows the potential of combining generative AI with digital twins to support complex, knowledge-driven planning tasks.

URL PDF HTML ☆

赞 0 踩 0

2508.04149 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

基于难度的偏好数据选择：通过DPO隐式奖励差距

Xuan Qi, Rongwu Xu, Zhijing Jin

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington（华盛顿大学计算机科学与工程保罗·G·艾伦学校）； Max Planck Institute for Intelligent Systems, Tübingen, Germany（德国图宾根马克斯·普朗克智能系统研究所）； Jinesis Lab, University of Toronto & Vector Institute（多伦多大学Jinesis实验室及向量研究所）

AI总结本文提出基于难度的偏好数据选择方法，利用DPO隐式奖励机制选择奖励差距小的样本，提升数据效率和模型对齐性能，在多个数据集和对齐任务中优于五个基线方法。

Comments Our code and data are available at https://github.com/Difficulty-Based-Preference-Data-Select/Difficulty-Based-Preference-Data-Select

详情

AI中文摘要

对齐大语言模型（LLMs）与人类偏好是AI研究中的关键挑战。尽管强化学习从人类反馈（RLHF）和直接偏好优化（DPO）等方法被广泛使用，但它们通常依赖于大规模、成本高的偏好数据集。本文缺少针对偏好数据的高质量数据选择方法。在本文中，我们引入了一种基于难度的偏好数据选择策略，该策略基于DPO隐式奖励机制。通过选择奖励差距较小的偏好数据示例，这些示例代表更具挑战性的案例，从而提高数据效率和模型对齐。我们的方法在多个数据集和对齐任务中一致优于五个强大的基线方法，仅使用原始数据的10%即可实现优越性能。这种原理上高效的选择方法为在有限资源下扩展LLM对齐提供了有前景的解决方案。

英文摘要

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

URL PDF HTML ☆

赞 0 踩 0

2508.03018 2026-05-19 cs.AI cs.RO 版本更新

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

超越策略优化：一种数据整理飞轮用于稀疏奖励长周期规划

Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, Guillaume Sartoretti

发表机构 * Department of Mechanical Engineering, National University of Singapore（新加坡国立大学机械工程系）； Robotics Institute, Carnegie Mellon University（卡内基梅隆大学机器人研究所）； School of Computing, National University of Singapore（新加坡国立大学计算机科学学院）； Institute of Computing Technology, Chinese Academy of Sciences（中国科学院计算技术研究所）； Department of Computer Science and Technology, Tsinghua University（清华大学计算机科学与技术系）

AI总结本文提出BPO框架，通过自改进的数据飞轮开发鲁棒推理模型，解决多轮代理规划中稀疏奖励长周期问题，实现高效推理和显著的token效率。

详情

AI中文摘要

大型语言推理模型在静态任务中表现出色，但在交互环境中多轮代理规划面临两大挑战：信用分配问题使传统强化学习在稀疏奖励设置中无效，以及详尽的逐步推理历史计算开销过大。为此，我们提出BPO框架，包含三个阶段（自举、外推和精炼），通过自改进的数据飞轮开发稳健的推理模型，以应对长周期稀疏奖励环境。框架首先利用规划四元组和长短期链式思考融合高效推理，然后通过复杂度分层课程学习扩展到分布外任务，最后通过奖励门控拒绝采样学习经历进行迭代精炼。在ALFWorld、ScienceWorld和WebShop上的实验表明，本方法在状态-of-the-art中实现了显著的token效率，为代理规划中的推理模型提供了新的配方。

英文摘要

Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.

URL PDF HTML ☆

赞 0 踩 0

2508.00712 2026-05-19 cs.LG cs.AI 版本更新

JSON-Bag: A generic game trajectory representation

JSON-Bag：一种通用的游戏轨迹表示方法

Dien Nguyen, Diego Perez-Liebana, Simon Lucas

发表机构 * GitHub

AI总结本文提出JSON-Bag模型，通过分词JSON描述并使用Jensen-Shannon距离衡量游戏轨迹，验证了其在六个桌面游戏中对玩家、参数和种子分类的有效性，优于基线方法并提升了准确性。

Comments 8 pages, 3 figures, 6 tables, published in IEEE Conference on Games 2025

2506.22901 2026-05-19 cs.LG cs.AI q-bio.BM q-bio.GN 版本更新

Missing-Modality-Aware Graph Neural Network for Cancer Classification

面向缺失模态的图神经网络用于癌症分类

Sina Tabakhi, Chen, Chen, Haiping Lu

发表机构 * School of Computer Science, University of Sheffield（谢菲尔德大学计算机科学学院）

AI总结本文提出MAGNET模型，通过动态患者-模态多头注意力机制融合低维模态嵌入，以提升部分模态下的多模态预测性能，实验表明其在癌症分类任务中优于现有方法。

Comments 27 pages, 22 figures

详情

AI中文摘要

在学习多模态生物数据时，缺失模态是一个关键挑战，其中某些患者的数据缺失一个或多个模态。现有方法要么排除缺失模态的患者，要么填补缺失模态，或直接使用部分模态进行预测。然而，这些方法大多依赖于不灵活的、患者无关的融合策略，且无法扩展到随着模态数量增加而指数级增长的缺失模态模式。为解决这些限制，我们提出MAGNET（Missing-modality-Aware Graph neural NETwork）以增强部分模态下的多模态预测，其特征是动态患者-模态多头注意力机制，根据贡献和缺失性融合低维模态嵌入。MAGNET融合的复杂性随着模态数量线性增加，同时适应缺失模式的变异性。为了生成预测，MAGNET进一步构建一个患者图，其中融合的多模态嵌入作为节点特征，连接性由模态缺失性决定，随后通过图神经网络进行处理。在三个公共多组学数据集上进行的实验表明，MAGNET在癌症分类任务中优于现有最先进的融合方法。数据和代码可在https://github.com/SinaTabakhi/MAGNET获取。

英文摘要

A key challenge in learning from multimodal biological data is missing modalities, where data from one or more modalities are absent for some patients. Existing approaches either exclude patients with missing modalities, impute missing modalities, or make predictions directly with partial modalities. However, most of these methods rely on inflexible, patient-agnostic fusion strategies and do not scale computationally to the combinatorial growth of missing-modality patterns as the number of modalities increases. To address these limitations, we propose MAGNET (Missing-modality-Aware Graph neural NETwork) to enhance multimodal prediction with partial modalities, featuring a dynamic patient-modality multi-head attention mechanism to fuse lower-dimensional modality embeddings based on their contribution and missingness. MAGNET fusion's complexity increases linearly with the number of modalities while adapting to missing-pattern variability. To generate predictions, MAGNET further constructs a patient graph with fused multimodal embeddings as node features and connectivity determined by the modality missingness, followed by a graph neural network. Experiments on three public multiomics datasets for cancer classification, with real-world missingness, show that MAGNET outperforms state-of-the-art fusion methods. The data and code are available at https://github.com/SinaTabakhi/MAGNET.

URL PDF HTML ☆

赞 0 踩 0

2506.12617 2026-05-19 cs.AI cs.HC 版本更新

Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking

评估大语言模型中的AI对齐：通过75个模型的人类基准测试分析价值优先级

Gabriel Rongyang Lau, Wei Yan Low, Seow Min Koh, Fiona Fui-Hoon Nah, Andree Hartanto

发表机构 * School of Social Sciences, Nanyang Technological University（南洋理工大学社会科学学院）； Interdisciplinary Graduate Programme, Nanyang Technological University（南洋理工大学跨学科研究生项目）； Faculty of Arts and Social Sciences, National University of Singapore（新加坡国立大学人文与社会科学学院）； School of Computing and Information Systems, Singapore Management University（新加坡管理学院计算与信息学院）； School of Social Sciences, Singapore Management University（新加坡管理学院社会科学学院）

AI总结本文通过分析75个大语言模型的输出，评估其价值优先级与人类判断的一致性，发现模型在价值优先级上存在差异，且模型大小、新旧和能力层级与价值一致性无直接关联。

详情

AI中文摘要

大型语言模型（LLMs）在人类-人工智能交互研究和实践中被越来越多地使用，但现有的能力和安全基准揭示了这些系统所表达的价值优先级以及这些优先级如何与人类判断相一致的信息有限。在三个研究中，我们引入了一种基于输出的方法来评估AI对齐的一个方面，通过将LLM生成的文本视为行为数据，并将其表达的价值优先级结构与人类参考进行比较。研究1利用归纳性定性分析得出六个最优AI功能的主题，即性能、适应能力、社会公益、伦理与责任、关系整合和自主性。研究2显示，LLM输出在模型内部高度稳定，并在不同模型间趋于一致的价值优先级结构，表明价值配置文件具有可靠性和可比性。研究3通过使用一个捕捉优先级相对顺序和优先级差异校准的配置文件保真度指标，将75个当代LLMs与376名人类受访者进行基准测试。尽管大多数模型复现了人类的价值顺序，但一些模型系统性地夸大了优先级之间的差异，表明模型可能在传统基准测试中看似对齐，但仍可能与人类价值校准偏离。配置文件保真度在不同模型间变化显著，并不一致地随大小、新旧或能力层级而变化。LLM和人类都倾向于对自主性进行降级，这提出了关于日益自主的AI系统发展的重大问题。对于研究和应用使用，六个主题和基于配置文件的指标提供了一种可扩展的方法，用于在关键对齐与人类优先级的背景下审计LLM的价值配置文件。

英文摘要

Large language models (LLMs) are increasingly used in human-AI interaction research and practice, yet existing capability and safety benchmarks reveal little about the value priorities these systems express or how those priorities correspond to human judgements. Across three studies, we introduce an output-based approach to evaluating one facet of AI alignment by treating LLM-generated text as behavioural data and comparing expressed value-priority profiles with a human reference. Study 1 used inductive qualitative analysis to derive six themes of optimal AI functioning, namely Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, and Agency. Study 2 showed that LLM outputs were highly stable within models and converged on a common value-priority structure across models, indicating reliable and comparable value profiles. Study 3 benchmarked 75 contemporary LLMs against 376 human respondents using a profile-fidelity metric capturing both the relative ordering of priorities and the calibration of between-priority differences. Although most models reproduced the human ordering of values, some systematically exaggerated the differences between them, showing that models can appear aligned on conventional benchmarks while still diverging from human value calibration. Profile fidelity varied substantially across models and did not consistently scale with size, recency, or capability tier. Both LLMs and humans converged on a deprioritisation of Agency, raising important questions about the development of increasingly agentic AI systems. For research and applied use, the six themes and profile-based metric provide a scalable method for auditing LLM value profiles before deployment in contexts where alignment with human priorities is critical.

URL PDF HTML ☆

赞 0 踩 0

2506.12119 2026-05-19 cs.CL cs.AI 版本更新

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

专家混合模型可在严格相等资源下超越密集语言模型

Houyi Li, Ka Man Lo, Shijie Xuyang, Ziqi Wang, Wenzhen Zheng, Haocheng Zhang, Zhao Li, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

发表机构 * Fudan University（复旦大学）； StepFun ； University of Science and Technology of China（中国科学技术大学）； Zhejiang University（浙江大学）

AI总结本文研究在资源相等条件下MoE模型是否能超越密集模型，提出优化框架并验证了在最优激活率下MoE模型性能更优，且该区域在不同模型规模下一致，通过数据重用解决数据量增加的权衡问题。

Comments Published as a conference paper at ICLR 2026

详情

AI中文摘要

专家混合（MoE）语言模型显著扩展了模型容量，并在不增加每token计算量的情况下实现了显著性能提升。然而，在严格相等的资源约束下，即总参数量、训练计算和数据预算完全相同的情况下，MoE能否超越密集架构？尽管其具有重要的实际价值和潜力，这一问题仍缺乏深入研究。本文提出了一种新的视角和方法论框架，系统研究这一问题。首先，我们全面调查了MoE的架构并实现了最优模型设计以最大化性能。基于此，我们发现，在最优区域内的MoE模型在相同总参数、训练计算和数据资源下能够超越其密集 counterpart。更重要的是，这一最优区域在不同模型规模下保持一致。虽然增加的数据量会带来性能的权衡，但我们通过重用数据解决了这一问题。我们通过广泛的实验验证了我们的发现，训练了近200个20亿参数规模的语言模型和超过50个70亿参数规模的语言模型，累计处理了50万亿token。所有模型检查点均已公开。

英文摘要

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints -- that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All model checkpoints are publicly available.

URL PDF HTML ☆

赞 0 踩 0

2506.11925 2026-05-19 cs.AR cs.AI cs.CV cs.LG 版本更新

Real-World Deployment of a Lane Change Prediction Architecture Based on Knowledge Graph Embeddings and Bayesian Inference

基于知识图谱嵌入和贝叶斯推断的车道变换预测架构的现实世界部署

M. Manzour, Catherine M. Elias, Omar M. Shehata, R. Izquierdo, M. A. Sotelo

发表机构 * Department of Computer Engineering, University of Alcalá（阿尔卡拉大学计算机工程系）； Department of Computer Science, German University in Cairo（开罗德国大学计算机科学系）； Department of Mechatronics, German University in Cairo（开罗德国大学机电系）

AI总结本文提出基于知识图谱嵌入和贝叶斯推断的车道变换预测系统，通过现实硬件验证，实现了算法与道路部署的结合，提前3-4秒预测目标车辆车道变换，确保安全。

Journal ref 2025 IEEE International Conference on Vehicular Electronics and Safety (ICVES)

详情

DOI: 10.1109/ICVES65691.2025.11376512

AI中文摘要

近年来，车道变换预测研究取得显著进展，但大多数研究局限于仿真或数据集结果，未能实现算法与道路部署的结合。本文通过现实硬件展示了基于知识图谱嵌入（KGEs）和贝叶斯推断的车道变换预测系统。该系统包含感知模块和预测模块：感知模块感知环境，提取数值特征并转换为语言类别，与预测模块通信；预测模块执行KGE和贝叶斯推断模型，预测目标车辆的行驶动作并转换为纵向制动动作。现实硬件实验验证表明，该预测系统能提前3-4秒预测目标车辆的车道变换，为自动驾驶车辆提供充足反应时间，确保车道变换安全。

英文摘要

Research on lane change prediction has gained a lot of momentum in the last couple of years. However, most research is confined to simulation or results obtained from datasets, leaving a gap between algorithmic advances and on-road deployment. This work closes that gap by demonstrating, on real hardware, a lane-change prediction system based on Knowledge Graph Embeddings (KGEs) and Bayesian inference. Moreover, the ego-vehicle employs a longitudinal braking action to ensure the safety of both itself and the surrounding vehicles. Our architecture consists of two modules: (i) a perception module that senses the environment, derives input numerical features, and converts them into linguistic categories; and communicates them to the prediction module; (ii) a pretrained prediction module that executes a KGE and Bayesian inference model to anticipate the target vehicle's maneuver and transforms the prediction into longitudinal braking action. Real-world hardware experimental validation demonstrates that our prediction system anticipates the target vehicle's lane change three to four seconds in advance, providing the ego vehicle sufficient time to react and allowing the target vehicle to make the lane change safely.

URL PDF HTML ☆

赞 0 踩 0

2506.10959 2026-05-19 cs.LG cs.AI math.ST stat.TH 版本更新

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

在结构流形上理解上下文学习：连接注意力机制与核方法

Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao

发表机构 * School of Mathematics, Georgia Institute of Technology（佐治亚理工学院数学系）； Department of Mathematics, Purdue University（普渡大学数学系）

AI总结本文研究了在结构几何数据上上下文学习的理论，通过将注意力机制与核方法联系，揭示了transformers在流形上进行核预测的机制，并推导了泛化误差界。

详情

AI中文摘要

尽管上下文学习（ICL）在自然语言和视觉领域取得了显著成功，但其在结构几何数据中的理论理解仍不明确。本文首次对ICL在流形上回归Hölder函数的理论进行了研究。我们建立了注意力机制与经典核方法之间的新联系，证明transformers通过与提示的交互在新查询上进行基于核的预测。这一联系通过数值实验得到验证，显示学习的查询-提示分数与高斯核高度相关。基于此见解，我们推导了泛化误差界，以提示长度和训练任务数量为变量。当观察到足够多的训练任务时，transformers在流形上实现Hölder函数的最小最大回归率，该速率与提示长度呈指数关系，指数取决于流形的内在维度，而非外蕴空间维度。我们的结果还描述了泛化误差随训练任务数量的变化，揭示了transformers作为上下文核算法学习器的复杂性。我们的发现为理解几何在ICL中的作用提供了基础见解，并为研究非线性模型的ICL提供了新工具。

英文摘要

While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of Hölder functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for Hölder functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with respect to the prompt length with the exponent depending on the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

URL PDF HTML ☆

赞 0 踩 0

2506.05442 2026-05-19 cs.CV cs.AI 版本更新

患者发声，AI倾听：基于大语言模型的在线评论分析揭示了紧急护理满意度的关键驱动因素

Xiaoran Xu, Zhaoqian Xue, Chi Zhang, Jhonatan Medri, Junjie Xiong, Jiayan Zhou, Jin Jin, Yongfeng Zhang, Siyuan Ma, Lingyao Li

发表机构 * Electrical Engineering department, University of South Florida（佛罗里达州立大学电气工程系）； Department of Biostatistics, Epidemiology and Bioinformatics, University of Pennsylvania（宾夕法尼亚大学生物统计学、流行病学与生物信息学系）； Computer Science and Engineering department, University of South Florida（佛罗里达州立大学计算机科学与工程系）； Mathematics & Statistics, University of South Florida（佛罗里达州立大学数学与统计学系）； Department of Computer Science and Engineering, University of Missouri Science and Technology（密苏里科技大学计算机科学与工程系）； School of Medicine, Stanford University（斯坦福大学医学院）； Department of Computer Science, Rutgers University（罗格斯大学计算机科学系）； Department of Biostatistics, Vanderbilt University（范德比尔特大学生物统计学系）

AI总结本文利用大语言模型分析在线评论，揭示紧急护理满意度的关键因素，发现人际因素和运营效率是主要决定因素，其他因素在调整后无显著影响。

详情

AI中文摘要

调查紧急护理设施的公众体验对促进社区医疗发展至关重要。传统调查方法由于范围、时间和空间覆盖有限而效果不佳。通过在线评论或社交媒体进行众包研究是一种有价值的途径。随着大语言模型（LLMs）的最新进展，从评论中提取细微感知已成为可能。本研究收集了Google Maps上DMV和佛罗里达地区的评论，并使用GPT模型进行提示工程，分析紧急护理的方面情感。我们首先分析了各种方面的地理空间模式，包括人际因素、运营效率、技术质量、财务和设施。接下来，我们确定了影响公众感知的CBG层面特征，包括人口密度、中位收入、基尼指数、租金与收入比率、家庭贫困率、无保险率和失业率。我们的结果表明，人际因素和运营效率是紧急护理患者满意度的最强决定因素，而技术质量、财务和设施在多变量模型中无显著独立影响。在社会经济和人口因素中，只有人口密度与患者评分有显著但微弱的相关性，其余因素无显著相关性。总体而言，本研究强调了众包研究揭示居民关注因素的潜力，并为利益相关者改进紧急护理公众满意度提供有价值的见解。

英文摘要

Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group (CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.

URL PDF HTML ☆

赞 0 踩 0

2503.19950 2026-05-19 cs.LG cs.AI cs.CL 版本更新

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

LogQuant: 一种基于对数分布的2位KV缓存量化技术，具有更优异的精度保持性能

Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen

发表机构 * Paradigm（4Paradigm）

AI总结 LogQuant通过基于对数的过滤机制实现KV缓存的2位量化，减少内存占用的同时保持高性能，实验表明其在吞吐量、批处理大小和准确性上均优于现有方法。

Comments Accepted by ICLR 2025 Workshop on Sparsity in LLMs (SLLM)

详情

AI中文摘要

我们介绍了LogQuant，一种突破性的2位量化技术，用于大型语言模型（LLM）推理中的KV缓存，实现显著的内存节省同时保持优越的性能。先前的方法要么假设后续token更重要，要么基于早期注意力模式预测重要token，但两者都可能导致性能瓶颈或频繁的误预测。LogQuant采取了不同的方法。通过应用基于对数的过滤机制，它在整个上下文中选择性地压缩KV缓存，与现有方法相比，实现更好的性能，甚至减少内存占用。在基准测试中，它提高了25%的吞吐量，提升了60%的批处理大小，而无需增加内存消耗。对于Math和Code Completion等具有挑战性的任务，LogQuant在相同压缩比下将准确性提高了40%至200%，优于其他方法。LogQuant可以轻松集成到流行的推理框架中，如Python的transformers库。实现可在https://github.com/Concyclics/LogQuantKV上获得。

英文摘要

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

URL PDF HTML ☆

赞 0 踩 0

2503.13535 2026-05-19 cs.CY cs.AI 版本更新

Unlocking Learning Potentials: The Transformative Effect of Generative AI in Education Across Grade Levels

解锁学习潜力：生成式AI在不同年级教育中的变革影响

Meijuan Xie, Liling Luo

发表机构 * School of Mathematics and Statistics, Guangxi Normal University（广西师范大学数学与统计学学院）

AI总结本文通过混合调查方法探讨生成式AI对不同年级学生在六个关键领域的影响，发现其在适当使用方面影响最大，而在学习兴趣和自信心方面影响最小，且大学生在各领域表现优于高中生。

详情

AI中文摘要

生成式人工智能（GAI）的出现使教育领域出现了显著增长。GAI在支持学习中的使用越来越普遍，但其使用方式和程度因人而异。关于学生对GAI的使用和感知的研究仍较为有限。为此，本文提出混合调查方法，研究GAI对四个不同年级学生在六个关键领域（LIPSAL）的影响。首先，通过问卷发现，GAI对适当使用的影响最大，而对学习兴趣和自信心的影响最低。其次，四个年级的比较显示，LIPSAL的高低因素表现出年级相关变化，大学生在各领域表现优于高中生。第三，通过访谈发现，学生对GAI的应用有全面理解，他们对GAI持积极态度，愿意使用，因此GAI的流行度迅速增长。他们还提到了使用GAI的前景和挑战。未来，随着GAI技术的成熟，其对学生的影响将更大。这些发现可能帮助更好地理解不同学生使用情况，并指导未来数字教育研究。

英文摘要

The advent of generative artificial intelligence (GAI) has brought about a notable surge in the field of education. The use of GAI to support learning is becoming increasingly prevalent among students. However, the manner and extent of its utilisation vary considerably from one individual to another. And researches about student's utilisation and perceptions of GAI remains relatively scarce. To gain insight into the issue, this paper proposed a hybrid-survey method to examine the impact of GAI on students across four different grades in six key areas (LIPSAL): learning interest, independent learning, problem solving, self-confidence, appropriate use, and learning enjoyment. Firstly, through questionnaire, we found that among LIPSAL, GAI has the greatest impact on the concept of appropriate use, the lowest level of learning interest and self-confidence. Secondly, a comparison of four grades revealed that the high and low factors of LIPSAL exhibited grade-related variation, and college students exhibited a higher level than high school students across LIPSAL. Thirdly, through interview, the students demonstrated a comprehensive understanding of the application of GAI. We found that students have a positive attitude towards GAI and are very willing to use it, which is why GAI has grown so rapidly in popularity. They also told us prospects and challenges in using GAI. In the future, as GAI matures technologically, it will have an greater impact on students. These findings may help better understand usage by different students and inform future research in digital education.

URL PDF HTML ☆

赞 0 踩 0

2503.13533 2026-05-19 cs.CY cs.AI 版本更新

The Status Quo and Future of AI-TPACK for Mathematics Teacher Education Students: A Case Study in Chinese Universities

人工智能与数学教师教育学生TPACK现状及未来：中国大学案例研究

Meijuan Xie, Liling Luo

发表机构 * School of Mathematics and Statistics, Guangxi Normal University（数学与统计学学院，广西师范大学）

AI总结本文通过对中国七所大学412名数学教师教育学生进行系统AI-TPACK测评，发现其处于初级阶段，且学业等级不影响AI-TPACK能力发展，提出AI-TPACK-SEM模型揭示自我效能与教学信念对AI-TPACK的影响。

Journal ref Computers and Education Open, vol. 10, pp. 100375, 2026

详情

DOI: 10.1016/j.caeo.2026.100375

用于心血管疾病分类的高效混合超参数调优方法

Abhay Kumar Pathak, Mrityunjay Chaubey, Manjari Gupta

发表机构 * Department of Computer Science, Institute of Science, Banaras Hindu University（计算机科学系，科学学院，班纳拉森胡大学）； School of Computer Science, University of Petroleum and Energy Studies（计算机科学学院，石油与能源研究大学）

AI总结本文提出一种结合随机搜索和网格搜索的混合超参数调优方法，提升心血管疾病分类模型的准确性和效率，实验表明该方法在性能和计算时间上均优于传统方法。

详情

AI中文摘要

心血管疾病（CVDs）是任何严重的心脏疾病，需要准确诊断以防止致命后果。超参数调优在优化机器学习模型性能中起关键作用，通过选择最合适的参数配置来提高准确性、泛化性和可靠性。网格搜索系统地评估预定义的超参数组合，而随机搜索则从搜索空间中随机采样配置，实现更广泛的探索并减少计算成本。因此，在开发分类模型时，高效调优策略至关重要，因为时间和预测能力同样关键。本文提出了一种新的超参数调优方法，用于调优用于CVD分类的机器学习模型。所提出的随机网格搜索结合了随机搜索探索全局空间的能力和网格搜索在最有前途区域的集中和彻底搜索。这种混合方法在探索和利用之间找到最佳平衡，产生了一个稳健且高效的时间机器学习模型。在最先进的模型上的实验结果表明，随机网格搜索比传统超参数调优方法表现更好。除了观察到的模型性能提升外，大多数模型的训练所需计算时间也显著减少。所提研究的结果强调了所提出随机网格搜索方法在训练时间和计算效率上的减少。所提出的技术在医疗保健领域的机器学习应用中具有重大潜力，能够提供及时且准确的CVDs诊断。

英文摘要

Cardiovascular diseases (CVDs) are any serious illness of the heart, which require accurate diagnosis to prevent fatal consequences. Hyperparameter tuning plays a critical role in optimizing machine learning model performance by selecting the most suitable parameter configurations for improved accuracy, generalization, and reliability. Grid search systematically evaluates predefined hyperparameter combinations, whereas random search samples configurations randomly from the search space enabling broader exploration with reduced computational cost. Therefore, an efficient tuning strategy is essential when developing classification models where time plays an crucial role along with the predictive capability. In this work, we propose a new hyperparameter tuning approach to tune the hyperparameters of ML models for CVD classification. The proposed random grid search combines the power of random search to explore the global space with the focused and exhaustive search of grid search in the most promising areas. This hybrid approach finds an optimal balance between exploration and exploitation and yields a robust and time-efficient ML model for classification seetings. Experimental results on state of the art models demonstrated that randomised grid search performed better than traditional hyperparameter tuning methods. In addition to the observed improvement in model performance, the computational time required for training models was substantially reduced across most of the models. Presented results of the proposed study emphasizes the reduction in training time and computational efficiency of the proposed Randomized-Grid Search method. The proposed technique has significant potential to advance ML application in healthcare providing timely and accurate CVDs diagnosis.

URL PDF HTML ☆

赞 0 踩 0

2411.15361 2026-05-19 cs.AI 版本更新

Designing Cellular Manufacturing System in Presence of Alternative Process Plans

在存在替代工艺计划的情况下设计单元制造系统

Md. Kutub Uddin, Md. Saiful Islam, Md Abrar Jahin, Md. Tanjid Hossen Irfan, Md. Saiful Islam Seam, M. F. Mridha

发表机构 * Department of Mechanical Engineering, Khulna University of Engineering & Technology（Khulna大学工程与技术学院机械工程系）； Department of Industrial Engineering and Management, Khulna University of Engineering & Technology（Khulna大学工程与技术学院工业工程与管理系）； Thomas Lord Department of Computer Science, Viterbi School of Engineering, University of Southern California（南加州大学维特比工程学院托马斯·劳德计算机科学系）； Department of Computer Science, American International University−Bangladesh（美国国际大学-孟加拉国计算机科学系）

AI总结本文提出四种整数规划模型，用于在设计和运营阶段对零件和机器进行分组，以最小化单元内和单元间移动，同时讨论了该目标与其他目标如投资成本和运营成本的适用性。

Journal ref IET Collaborative Intelligent Manufacturing (2026)

2411.10636 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

缓解孟加拉语分类任务中的外在性别偏见

Sajib Kumar Saha Joy, Arman Hassan Mahy, Meherin Sultana, Azizah Mamun Abha, MD Piyal Ahmmed, Yue Dong, G M Shahariar

发表机构 * Ahsanullah University of Science and Technology（阿沙努拉科学与技术大学）； University of California, Riverside（加州大学河滨分校）

AI总结本文研究了孟加拉语预训练语言模型中的外在性别偏见，构建了四个任务特定的基准数据集，并提出RandSymKL方法以缓解偏见，实验表明其能有效减少偏见并保持高准确率。

详情

AI中文摘要

在本研究中，我们探讨了孟加拉语预训练语言模型中的外在性别偏见，这是一个在低资源语言中鲜有研究的领域。为了评估这种偏见，我们构建了四个人工标注的任务特定基准数据集，用于情感分析、毒性检测、仇恨言论检测和讽刺检测。每个数据集都通过细致的性别扰动进行了增强，通过系统地交换性别化名称和术语并保持语义内容，实现了对性别驱动预测变化的最小配对评估。然后，我们提出RandSymKL，一种整合对称KL散度和交叉熵损失的随机去偏策略，以在任务特定的预训练模型中缓解偏见。RandSymKL是一种精炼的训练方法，以统一的方式整合这些元素，专注于分类任务的外在性别偏见缓解。我们的方法在现有偏见缓解方法上进行了评估，结果表明，我们的技术不仅有效减少了偏见，还与其他基线方法相比保持了竞争性的准确性。为了促进进一步研究，我们已公开了我们的实现和数据集：https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

英文摘要

In this study, we investigate extrinsic gender bias in Bangla pretrained language models, a largely underexplored area in low-resource languages. To assess this bias, we construct four manually annotated, task-specific benchmark datasets for sentiment analysis, toxicity detection, hate speech detection, and sarcasm detection. Each dataset is augmented using nuanced gender perturbations, where we systematically swap gendered names and terms while preserving semantic content, enabling minimal-pair evaluation of gender-driven prediction shifts. We then propose RandSymKL, a randomized debiasing strategy integrated with symmetric KL divergence and cross-entropy loss to mitigate the bias across task-specific pretrained models. RandSymKL is a refined training approach to integrate these elements in a unified way for extrinsic gender bias mitigation focused on classification tasks. Our approach was evaluated against existing bias mitigation methods, with results showing that our technique not only effectively reduces bias but also maintains competitive accuracy compared to other baseline approaches. To promote further research, we have made both our implementation and datasets publicly available: https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

URL PDF HTML ☆

赞 0 踩 0

2409.14634 2026-05-19 cs.HC cs.AI 版本更新

Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

基于领域重构与新颖性评估的人机协同科学构想系统

Marissa Radensky, Simra Shahid, Raymond Fok, Pao Siangliulue, Tom Hope, Daniel S. Weld

发表机构 * University of Washington（华盛顿大学）； Microsoft（微软）； Allen Institute for AI（人工智能研究院）

AI总结 Scideator通过领域重构与新颖性评估模块，帮助用户在科学构想中生成更具创造性的想法，实验显示其在想法探索和表达性方面优于传统LLM方法。

Comments Updated based on most recent submission

详情

AI中文摘要

科学构想过程通常涉及将现有论文的要素重新组合以产生新想法。我们提出了Scideator，首个基于要素的科学构想人机协同系统。从用户提供的论文出发，Scideator提取关键要素--目的、机制和评估--并允许用户交互式重新组合要素以合成想法。Scideator由三个设计选择驱动：(1) 人类在回路的要素重新组合，用户从检索的论文中选择要素，系统通过Faceted Idea Generator模块寻找跨要素的类比生成想法；(2) 距离控制检索通过Analogous Paper Facet Finder模块，提供从相同主题到完全不同领域的论文范围；(3) 基于要素的新颖性验证通过Idea Novelty Checker模块，一个检索后重排序流程，帮助用户评估想法的原创性。在计算机科学研究人员的用户研究中，Scideator比使用相同基础LLM但无要素模块的基线提供了显著更多的创造力支持，特别是在想法探索和表达性方面。消融实验进一步表明，要素对新颖性检查器有益：基于要素的检索后重排序比标准检索和重排序显示更多相关论文，且基于要素的新颖性分类器优于基于无结构想法和论文推理的分类器。

英文摘要

The scientific ideation process often involves blending facets of existing papers to create new ideas. We contribute Scideator, the first human-LLM system for facet-based scientific ideation. Starting from user-provided papers, Scideator extracts key facets -- purposes, mechanisms, and evaluations -- from these and related papers, allowing users to interactively recombine facets to synthesize ideas. Scideator is driven by three design choices: (1) human-in-the-loop facet recombination, in which users select facets from retrieved papers and the system generates ideas by finding analogies across them via the Faceted Idea Generator module; (2) distance-controlled retrieval via the Analogous Paper Facet Finder module, which surfaces papers ranging from the same topic to entirely different areas to provide a spectrum of directions; and (3) facet-based novelty verification via the Idea Novelty Checker module, a retrieve-then-rerank pipeline that helps users to evaluate idea originality using facets. In a user study with computer science researchers, Scideator provided significantly more creativity support than a baseline using the same backbone LLM without our facet-based modules, particularly in idea exploration and expressiveness. Ablations further show that the facets benefit the novelty checker: facet-based retrieve-then-rerank surfaces more relevant papers than standard retrieval and re-ranking, and a facet-grounded novelty classifier outperforms classifiers that reason over unstructured ideas and papers.

URL PDF HTML ☆

赞 0 踩 0

2409.10102 2026-05-19 cs.IR cs.AI cs.CL 版本更新

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

检索增强生成系统中的可信度：综述

Yujia Zhou, Wenbo Zhang, Jingying Shao, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Jason Chen Zhang, Zhicheng Dou, Philip S. Yu, Jiaxin Mao

发表机构 * Tsinghua University（清华大学）； Renmin University of China（中国人民大学）； The Chinese University of Hong Kong（香港中文大学）； Beijing Academy of Artificial Intelligence（北京人工智能研究院）； Hong Kong Polytechnic University（香港理工大学）； Microsoft Research Asia（微软亚洲研究院）； University of Illinois（伊利诺伊大学）

AI总结本文综述了检索增强生成系统中可信度的关键维度，提出Trust-RAG Compass框架，评估事实性、鲁棒性等六个方面，并建立评估基准，揭示不同LLM在可信度方面的性能差异，指出未来研究方向。

详情

AI中文摘要

检索增强生成（RAG）已迅速成为大型语言模型（LLMs）发展中的关键范式。尽管现有研究主要强调准确性和效率，但RAG系统的可信度仍缺乏充分探讨。RAG通过将响应基于外部和最新知识来提高LLM的可靠性，减少幻觉。然而，不可靠的检索或不当的知识利用仍可能导致不良输出。为此，我们提出统一框架Trust-RAG Compass，从事实性、鲁棒性、公平性、透明性、问责性和隐私六个关键维度评估RAG系统的可信度。在此框架下，我们对现有文献进行了全面回顾，并引入评估基准TRC Bench，围绕六个维度对多种专有和开源模型进行全面评估。我们的结果揭示了不同类型的LLM在不同可信度维度上的性能差距。最后，基于我们的发现，我们识别了关键挑战和未来研究的前景。通过这项工作，我们旨在为后续研究提供结构化基础，并为开发真实场景中的可信RAG系统提供实用指导。

英文摘要

Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). Although existing research mainly emphasizes accuracy and efficiency, the trustworthiness of RAG systems remains insufficiently explored. RAG can improve LLM reliability by grounding responses in external and up-to-date knowledge, reducing hallucinations. However, unreliable retrieval or improper knowledge utilization may still lead to undesirable outputs. To address these concerns, we propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench (\underline{T}rust-\underline{R}AG \underline{C}ompass \underline{Bench}mark), regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness. Finally, we identify key challenges and promising directions for future research based on our findings. Through this work, we aim to provide a structured foundation for subsequent investigations and practical guidance for developing trustworthy RAG systems in real-world scenarios.

URL PDF HTML ☆

赞 0 踩 0

2409.02428 2026-05-19 cs.LG cs.AI cs.CL cs.SY eess.SY 版本更新

Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement

语言模型作为定制环境多目标强化学习的高效奖励函数搜索器

Guanwen Xie, Jingzehua Xu, Yiyuan Yang, Yimian Ding, Shuai Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, China（清华大学深圳国际研究生院，清华大学，中国）； Department of Computer Science, University of Oxford, United Kingdom（英国牛津大学计算机科学系）； Department of Data Science, New Jersey Institute of Technology, USA（美国新泽西理工学院数据科学系）

AI总结本文提出ERFSL，利用语言模型高效搜索奖励函数，通过生成奖励组件和使用奖励批评者修正代码，实现多目标强化学习任务中零样本学习的高效奖励函数设计。

详情

AI中文摘要

在强化学习任务中，设计和改进复杂定制环境和多重需求的奖励函数具有挑战性。本文提出ERFSL，一种利用大型语言模型（LLMs）的高效奖励函数搜索器，使LLMs成为有效的白盒搜索器，并突出其先进的语义理解能力。具体而言，我们为每个数值明确的用户需求生成奖励组件，并使用奖励批评者识别正确的代码形式。然后，LLMs为奖励组件分配权重以平衡其值，并通过灵活采用方向突变和交叉策略迭代调整权重，类似于遗传算法，基于训练日志分析器提供的上下文。我们将其应用于无直接人类反馈或奖励示例的定制数据收集RL任务（零样本学习）。奖励批评者仅需每个需求一个反馈实例即可有效纠正奖励代码，防止不可纠正的错误。权重初始化使在帕累托解集内获取不同奖励函数而无需权重搜索。即使权重偏差达500倍，平均仅需5.2次迭代即可满足用户需求。ERFSL也适用于大多数使用GPT-4o mini的提示，因为我们分解了权重搜索过程，以降低对数值和长上下文理解能力的要求。

英文摘要

Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to a customized data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities.

URL PDF HTML ☆

赞 0 踩 0

2406.15797 2026-05-19 cs.LG cs.AI 版本更新

$\texttt{SynC}$: Synergistic Boosting of Structure and Representation for Deep Graph Clustering

$\texttt{SynC}$：深度图聚类的结构与表示协同提升

Shifei Ding, Benyu Wu, Xiao Xu, Ling Ding, Xindong Wu

发表机构 * School of Computer Science and Technology/the School of Artificial Intelligence, China University of Mining and Technology（计算机科学与技术学院/人工智能学院，中国矿业大学）； Mine Digitization Engineering Research Center of Ministry of Education（教育部矿山数字化工程研究中心）； College of Intelligence and Computing, Tianjin University（智能与计算学院，天津大学）； Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), Hefei University of Technology（大数据知识工程重点实验室（教育部），合肥工业大学）

AI总结 SynC通过协同提升结构与表示学习，改进深度图聚类，减少参数并提升低同质图的泛化能力。

2406.14427 2026-05-19 cs.AI q-bio.NC 版本更新

Principles of frugal inference and control

节俭推断与控制的原则

Itzel Olivos-Castillo, Paul Schrater, Xaq Pitkow

发表机构 * Department of Computer Science, Rice University（计算机科学系，里士大学）； Department of Computer Science, University of Minnesota（计算机科学系，明尼苏达大学）； Department of Psychology, University of Minnesota（心理学系，明尼苏达大学）； Neuroscience Institute, Carnegie Mellon University（神经科学研究所，卡内基梅隆大学）

AI总结本文提出节俭推断与控制的原则，通过POMDP框架优化资源使用，在不确定世界中平衡效用最大化与资源消耗，解决非线性控制问题如平衡杆和无人机稳定。

详情

AI中文摘要

智能体在不确定世界中面临在效用最大化与资源使用之间取得平衡的挑战，不仅涉及外部运动还涉及内部计算。现有不确定性控制理论通常将推断视为无成本，尽管在人工和生物系统中造成显著的计算和能量负担。为解决此问题，我们引入POMDP框架的新变体，将通过推断获得的信息视为需优化的资源。解决局部线性高斯近似问题揭示了三个资源高效的控制原则：首先，当信息成本高时，推断从贝叶斯最优（无损）压缩转向损失性阶段，战略性地保留部分不确定性以优化资源使用。其次，放松精确贝叶斯推断产生等效解集，反映多种结合不完美推断与补偿控制的方式。这种灵活性可用于满足额外目标或约束而不牺牲原始任务性能。第三，超越目标达成，控制可用于抵消估计误差并引导系统进入表示成本较低的区域。我们实验证明这些原则超越局部线性高斯近似，解决非线性控制问题如平衡杆和无人机稳定。这些结果建立了一个理性计算框架，扩展了现有信息受限决策方法，并为大脑和机器如何在紧约束下实现有效行为提供规范见解。

英文摘要

A central challenge for intelligent agents in an uncertain world is striking the right balance between utility maximization and resource use, not only for external movement but also for internal computation. Existing theories of control under uncertainty typically treat inference as cost-free, despite the substantial computational and energetic burden it imposes in both artificial and biological systems. To remedy this problem, we introduce a novel variant of the POMDP framework in which the information acquired through inference is treated as a resource that must be optimized alongside utility. Solving a local linear-Gaussian approximation of the resulting problem reveals three general principles of resource-efficient control. First, when information is costly, inference shifts from a Bayes-optimal (lossless) compression of the past to a lossy regime that strategically leaves some uncertainty unresolved to optimize resource use. Second, relaxing exact Bayesian inference creates a manifold of equivalent solutions, reflecting multiple ways to combine imperfect inference with compensatory control. This flexibility can be used to meet additional objectives or constraints without sacrificing performance on the original task. Third, beyond goal attainment, control can be leveraged to counteract estimation errors and steer the system into regimes where representation costs are lower. We empirically demonstrate that these principles generalize beyond the local linear-Gaussian approximation, enabling the solution of nonlinear control problems such as pole balancing and drone stabilization. Together, these results establish a framework for rational computation that extends existing approaches to information-constrained decision-making and offers normative insight into how brains and machines can achieve effective behavior under tight computational constraints.

URL PDF HTML ☆

赞 0 踩 0

2401.03717 2026-05-19 cs.LG cs.AI 版本更新

Universal Time-Series Representation Learning: A Survey

通用时间序列表示学习：综述

Patara Trirat, Yooju Shin, Junhyeok Kang, Youngeun Nam, Jihye Na, Minyoung Bae, Joeun Kim, Byunghyun Kim, Jae-Gil Lee

发表机构 * KAIST（韩国延世大学）

AI总结本文综述了时间序列数据表示学习方法，探讨了深度学习在提取隐藏模式中的优势，并提出了新的分类方法以指导未来研究。

Comments Accepted by ACM Computing Surveys. Extended version: 41 pages, 7 figures

详情

AI中文摘要

时间序列数据存在于现实世界的各个方面，从天空中的卫星到身上的可穿戴设备。通过提取和推断有价值的信息来学习表示对于理解复杂现象的动力学和做出明智决策至关重要。深度学习在无需手动特征工程的情况下展示了在时间序列数据中提取隐藏模式和特征的卓越性能。本文首先提出了一种基于三种基本要素的新分类方法，用于设计最先进的通用表示学习方法。根据该分类法，本文全面回顾了现有研究，讨论了这些方法如何提高学习表示的质量。最后，作为未来研究的指南，本文总结了常用的实验设置和数据集，并讨论了几个有前途的研究方向。相关资源可在https://github.com/itouchz/awesome-deep-time-series-representations上找到。

英文摘要

Time-series data exists in every corner of real-world systems and services, ranging from satellites in the sky to wearable devices on human bodies. Learning representations by extracting and inferring valuable information from these time series is crucial for understanding the complex dynamics of particular phenomena and enabling informed decisions. With the learned representations, we can perform numerous downstream analyses more effectively. Among several approaches, deep learning has demonstrated remarkable performance in extracting hidden patterns and features from time-series data without manual feature engineering. This survey first presents a novel taxonomy based on three fundamental elements in designing state-of-the-art universal representation learning methods for time series. According to the proposed taxonomy, we comprehensively review existing studies and discuss their intuitions and insights into how these methods enhance the quality of learned representations. Finally, as a guideline for future studies, we summarize commonly used experimental setups and datasets and discuss several promising research directions. An up-to-date corresponding resource is available at https://github.com/itouchz/awesome-deep-time-series-representations.

URL PDF HTML ☆

赞 0 踩 0

2305.10721 2026-05-19 cs.LG cs.AI 版本更新

Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

重新审视长期时间序列预测：对线性映射的调查

Zhe Li, Shiyi Qi, Yiduo Li, Zenglin Xu

发表机构 * Harbin Institute of Technology, Shenzhen, China（哈尔滨工业大学深圳研究院）

AI总结本文研究了长期时间序列预测中线性映射的有效性，揭示了仿射映射在周期信号预测中的关键作用，并探讨了可逆归一化和输入时间 horizon 对模型鲁棒性的影响。

Journal ref Li, Zhe, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting Long-Term Time Series Forecasting: an Investigation on Affine Mapping. Academia AI and Applications 2, no. 2 (2026)

详情

DOI: 10.20935/AcadAI8236

AI中文摘要

引言：长期时间序列预测（LTSF）近年来获得了广泛关注。尽管存在各种专门设计来捕捉时间依赖性的方法，但近期研究表明，甚至一个单一的线性层也能取得竞争性的性能。本文研究了近期LTSF方法的内在有效性，并揭示了仿射映射在周期信号预测中的关键作用。材料和方法：我们对模拟和现实世界的数据集进行了全面实验，以分析最先进模型的组成部分。我们提供了理论分析，解释仿射映射在周期信号预测中的工作机制。我们评估了可逆归一化和输入时间跨度扩展对模型鲁棒性的影响。结果：我们发现（1）仿射映射在常用的基准测试中主导了预测性能，模型从输入到输出学习了相似的转换矩阵；（2）仿射映射能够有效捕捉周期性模式，但在非周期性信号或具有不同周期的时序数据中表现较差；（3）可逆归一化显著增强了趋势预测，通过将非周期性趋势转换为周期性模式；（4）增加输入时间跨度提高了多通道数据的性能。代码可在：https://github.com/plumprc/RTSF获得。结论：我们的发现为LTSF模型的工作机制提供了理论和实验见解，突显了线性方法的优势和局限性。结果表明，未来模型的发展应关注处理跨通道周期变化和非周期性成分。

英文摘要

Introduction: Long-term time series forecasting (LTSF) has gained significant attention in recent years. While various specialized designs exist for capturing temporal dependency, recent studies have shown that even a single linear layer can achieve competitive performance. This paper investigates the intrinsic effectiveness of recent LTSF approaches and reveals the critical role of affine mapping. Materials and methods: We conduct comprehensive experiments on both simulated and real-world datasets to analyze the components of state-of-the-art models. A theoretical analysis is provided to explain the working mechanisms of affine mapping in periodic signal forecasting. We evaluate the impact of reversible normalization and input horizon extension on model robustness. Results: We find that (1) affine mapping dominates forecasting performance across commonly utilized benchmarks, with models learning similar transition matrices from input to output; (2) affine mapping effectively captures periodic patterns but struggles with non-periodic signals or time series with varying periods across channels; (3) reversible normalization significantly enhances trend forecasting by transforming non-periodic trends into periodic-like patterns; (4) increasing input horizon improves performance on multi-channel data with different periods. Code is available at: \url{https://github.com/plumprc/RTSF}. Conclusions: Our findings provide theoretical and experimental insights into the working mechanisms of LTSF models, highlighting both the strengths and limitations of linear approaches. The results suggest that future model development should focus on handling cross-channel period variations and non-periodic components.

URL PDF HTML ☆

赞 0 踩 0

2212.02098 2026-05-19 cs.AI 版本更新

A Machine with Short-Term, Episodic, and Semantic Memory Systems

具有短期、事件性和语义记忆系统的机器

Taewoon Kim, Michael Cochez, Vincent François-Lavet, Mark Neerincx, Piek Vossen

发表机构 * Vrije Universiteit Amsterdam（瓦赫宁海姆大学）； Technische Universiteit Delft（代尔夫特理工大学）

AI总结本文提出了一种具有短期、事件性和语义记忆系统的智能体模型，通过知识图谱实现各记忆系统的建模，并在自研环境中验证了该模型在记忆编码、存储与检索上的优势。

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence (2023), 37(1), 48-56

2605.16638 2026-05-19 cs.AI 版本更新

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

TTE-Flash: 通过思考-然后-嵌入标记加速基于推理的多模态表示

Jianpeng Cheng, Xian Wu, Jiangfan Zhang, Wentao Bao, Chaitanya Ahuja, Shlok Kumar Mishra, Hanchao Yu, Yang Gao, Fan Xia, Qi Guo, Shaodan Zhai, Xiangjun Fan, Jun Xiao

发表机构 * Meta AI

AI总结本文提出TTE-Flash模型，通过引入隐式思考标记替代显式推理链，实现多模态表示的高效推理。模型在MMEB-v2基准上优于显式CoT方法，且在零样本评估中展示了可扩展性。

详情

AI中文摘要

近期研究显示，通用多模态嵌入（UME）显著受益于推理链（CoT）推理。在该范式中，生成模型为多模态查询生成显式推理轨迹，最终表示从<eos>嵌入标记提取，该标记同时关注查询和推理。尽管其有效性，生成显式CoT轨迹的计算开销常不可接受。本文提出用隐式思考标记替代显式CoT，这些标记被解释为潜在变量，可生成显式CoT轨迹作为观测变量。通过优化思考标记使用CoT生成损失，随后嵌入标记使用对比损失，我们生成高性能、基于推理的表示，且推理成本恒定。本研究探讨了两个关键架构设计：1）如何从同一LLM主干中提取思考和嵌入标记；2）如何将标记作为两个依赖任务进行训练。我们引入TTE-Flash-2B，一个基于推理的多模态表示模型，在MMEB-v2基准上优于其显式CoT对应物，同时生成可解释的文本和视觉隐式思考标记。此外，跨15个视频数据集的零样本评估揭示了随着思考标记数量增加的扩展行为，并促使基于任务需求的自适应思考预算分配的初步研究。

英文摘要

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

URL PDF HTML ☆

赞 0 踩 0

2605.16632 2026-05-19 cs.LG cs.AI cs.LO 版本更新

Learning How to Cube

学习如何求立方

Ferhat Erata, Sam Kouteili, Thanos Typaldos, Timos Antonopoulos, Robert B. Jones, Byron Cook, Ruzica Piskac

发表机构 * Yale University（耶鲁大学）； AWS Agentic AI（AWS智能体AI）

AI总结本文提出一种神经符号后训练框架，通过MCTS数据整理管道和符号启发式方法，使4B参数模型在SAT竞赛基准上取得53的pass@5分数，超越了Claude-Sonnet-4等前沿LLM。

Comments 33 pages, preprint

详情

AI中文摘要

尽管Cube-and-Conquer（C&C）在解决具有挑战性的布尔可满足性（SAT）问题上非常有效，但之前的工作没有展示基于Transformer的模型能够学习有效的求立方启发式方法。我们介绍了一种神经符号后训练框架。我们设计了一个基于MCTS的数据整理管道，利用符号启发式方法在SAT竞赛公式上探索分割决策，生成基于求解器统计信息的偏好数据，并辅以教师模型的推理轨迹。我们的两阶段后训练，监督微调（SFT）后接直接偏好优化（DPO），使4B参数模型在100个SAT竞赛基准上取得53的pass@5分数，超越了前沿LLM如Claude-Sonnet-4（50）并匹配最佳符号启发式（53）。消融实验显示，SFT单独将pass@5提升至51，DPO增加2个基准；对实际首次立方决策的熵/一致消融显示，SFT而非DPO导致根层决策多样性，产生互补的运行覆盖。这表明Transformer可以在传统由符号方法主导的领域中被训练出有效的求立方决策。

英文摘要

Despite the effectiveness of Cube-and-Conquer (C&C) for solving challenging Boolean Satisfiability (SAT) problems, no prior work has shown that transformer-based models can learn effective cubing heuristics. We introduce a neuro-symbolic post-training framework for this task. We design an MCTS-based data curation pipeline that uses symbolic heuristics to explore splitting decisions over SAT competition formulas, producing preference data grounded in solver statistics and augmented with reasoning traces from a teacher model. Our two-stage post-training, supervised fine-tuning (SFT) followed by direct preference optimization (DPO), enables a 4B-parameter model to achieve a pass@5 score of 53 on 100 SAT competition benchmarks, surpassing frontier LLMs such as Claude-Sonnet-4 (50) and matching the best symbolic heuristic (53). Ablations show that SFT alone improves pass@5 from 46 to 51, with DPO adding 2 additional benchmarks; an entropy/agreement ablation on realized first-cube decisions further shows that SFT, not DPO, accounts for the root-level decision diversity that produces complementary per-run coverage over deterministic symbolic methods. This demonstrates that transformers can be trained to make effective cubing decisions in a domain traditionally dominated by symbolic methods.

URL PDF HTML ☆

赞 0 踩 0

2605.16623 2026-05-19 cs.CY cs.AI 版本更新

To Trust or Not to Trust: Authors' Response to AI-based Reviews

信任还是不信任：作者对基于AI的评论的回应

César Leblanc, Lukas Picek

发表机构 * École Normale Supérieure（巴黎高等师范学院）； Sorbonne University（索邦大学）； University of West Bohemia（西波什埃大学）； Massachusetts Institute of Technology（麻省理工学院）

AI总结本文通过两项研究探讨作者对AI辅助评审的使用和看法，发现大多数作者认为AI反馈有用，但不将其等同于人类评审，且更倾向于在提交前使用AI作为内部工具。

详情

AI中文摘要

大型语言模型日益被讨论和用于协助学术同行评审，但关于作者如何使用和感知AI反馈的实证证据仍有限。本文报告了两项独立试点研究的结果，研究了作者在两个计算机科学会议中对AI辅助评审的使用和看法。在评审发布后，作者被邀请完成一份匿名的评审后问卷，询问AI评审的有用性、可信度、与人类评审的一致性、修改的实用性、感知的不准确性以及同意。最终的数据集包含40篇论文作者的56个可分析响应；封闭式问题使用描述性统计汇总，开放式回答使用归纳主题分析。大多数受访者（83.9%）认为AI评审有用，80.4%报告说AI发现了人类评审未提及的问题。这种感知的附加价值转化为行动：82.1%的受访者报告在最终版本中使用了至少一些AI反馈。然而，作者并不将AI评审视为等同于人类评审。他们普遍认为AI的可信度低于人类评审，尽管25.0%的受访者描述至少一些人类评审不很有用。报告的AI评审问题通常有限：51.8%报告了轻微的不准确，而16.1%报告了明显错误、误导或不相关的评论。对未来发展使用的支持最强当AI被框架为监督或作者控制的工具：96.4%表示在未来的提交中会使用AI作为内部评审工具，89.3%更倾向于提前得知AI将在评审中使用，76.8%更倾向于在使用前获得明确的同意。

英文摘要

Large language models are increasingly discussed and used as tools that may assist with scholarly peer review, but empirical evidence regarding how authors use and perceive AI-based feedback remains limited. This paper reports findings from two independent pilot studies on authors' use and perceptions of AI-based auxiliary review at two computer science venues. After the review release, authors were invited to complete an anonymous post-review questionnaire about the AI review's usefulness, trustworthiness, agreement with human reviews, practical value for revision, perceived inaccuracies, and consent. The final dataset included 56 analyzable responses from authors of 40 papers; closed-ended items were summarized using descriptive statistics, and open-ended responses were analyzed using inductive thematic analysis. Most respondents (83.9%) considered the AI-based review useful, and 80.4% reported that it identified issues not mentioned by human reviewers. This perceived added value translated into action: 82.1% reported using at least some AI feedback in their camera-ready version. However, the authors did not treat the AI review as equivalent to a human review. They generally trusted it less than the human reviews and found human feedback clearer, even though 25.0% described at least some human reviews as not very useful. Reported problems with the AI review were usually limited: 51.8% reported minor inaccuracies, while 16.1% reported clearly incorrect, misleading, or irrelevant comments. Support for future use was strongest when AI was framed as a supervised or author-controlled tool: 96.4% said they would use AI as an internal review tool before future submissions, 89.3% preferred advance notice that AI would be used in review, and 76.8% favored explicit consent before use.

URL PDF HTML ☆

赞 0 踩 0

2605.16612 2026-05-19 cs.AI cond-mat.mtrl-sci 版本更新

PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

PRISMat：基于策略的、排列不变的自回归材料生成

Claire Schlesinger, Circe Hsu, Peter Schindler, Robin Walters

发表机构 * Khoury College of Computer Sciences（科里学院计算机科学学院）； Northeastern University（东北大学）； College of Engineering（工程学院）

AI总结 PRISMat通过高效生成晶体片层，提升了材料发现的准确性，其在切开能和工作函数任务中的均绝对误差显著降低。

Comments 10 pages, 8 figures, Under Review at Neurips 2026

详情

AI中文摘要

快速识别具有目标性质的候选材料已成为材料科学中的关键任务。机器学习作为一种替代物理模拟的方法，提供了一种更快、更经济的方式过滤材料，基于其稳定性和其他目标性质，减少达到昂贵合成阶段的候选材料数量。最近，大语言模型（LLMs）已应用于此角色，但这些模型参数密集且计算成本高，训练和推理时都不可行，不适合高通量任务。这种低效性源于语言模型的过度参数化以及将材料生成作为序列学习问题的困难。在本文中，我们提出了PRISMat，一种成本效益高、排列不变的模型，解决了这些限制。我们显示，尽管PRISMat推理时间更短，但其在基于关键材料表面性质生成晶体片层方面能够超越LLMs。在目标材料发现中，我们实现了切开能和工作函数任务的均绝对误差分别为0.188 eV/A$^2$和2.79 eV，将下一个最佳模型的误差降低了4倍。

英文摘要

Rapid identification of candidate materials with target properties has become a key task in materials science. Machine learning has emerged as an alternative to physics-based simulation, offering a faster and cheaper way to filter materials based on their stability and other target properties, reducing the number of candidates that reach the costly synthesis stage. Recently, Large Language Models (LLMs) have been applied to this role, but these models are parameter-heavy and computationally expensive both during training and at inference time, making them unsuitable for high-throughput tasks. This inefficiency stems from both the large over-parameterization of language models and the difficulty of framing material generation as a sequence learning problem. In this paper, we present PRISMat, a cost-effective, permutation-invariant model, which addresses these limitations. We show that PRISMat, despite taking less time for inference, is able to outperform LLMs in generating crystal slabs conditioned on critical materials' surface properties. In targeted material discovery, we achieve mean absolute errors of 0.188 eV/A$^2$ and 2.79 eV for cleavage energy and work function tasks, respectively, reducing the error of the next best model by 4$\times$.

URL PDF HTML ☆

赞 0 踩 0

2605.16605 2026-05-19 cs.HC cs.AI 版本更新

PromptDecipher: Supporting AI Tutor Authoring Through Editable Simulated Interactions

PromptDecipher：通过可编辑的模拟交互支持AI辅导作者

Miina Koyama, Ruiwei Xiao, John Stamper

发表机构 * Carnegie Mellon University（卡内基梅隆大学）

AI总结 PromptDecipher通过直接纠正交互重构作者流程，帮助教师成为学习设计师和QA工程师，提升AI辅导作者的质量与效率。

详情

AI中文摘要

聊天机器人长期以来被探索为支持学习的工具，而大型语言模型的最新进展显著扩展了教育者创建AI辅导聊天机器人的平台。然而，有效的作者需求不仅仅是编写系统提示，还需要教育者扮演学习设计师、AI交互设计师和QA工程师。然而，实践中教师很少履行这些角色。我们的形成性研究发现，几乎没有人系统地测试他们的机器人后再部署给学生。为了解决这一差距，我们提出了PromptDecipher，一个系统，将作者流程围绕直接纠正交互重新组织，而不是编写抽象系统提示。教师与实时聊天预览互动并编辑不理想的机器人响应。自动化流程随后分析纠正，提出针对的系统提示重写，并在预定义的测试场景中验证更改。这将QA作为首要活动，并 scaffolds 教师在他们通常会跳过的角色中。PromptDecipher将在一个AI教育课程中部署，该课程将有数百名高等教育教师。一个实时原型（https://teacher-prompting.vercel.app/），匿名代码库（https://anonymous.4open.science/r/teacher-prompting-2EDF/），和匿名演示（https://tinyurl.com/las-prompt-decipher-demo）可通过脚注中的链接获取。

英文摘要

Chatbots have long been explored as tools to support learning, and recent advances in large language models have significantly expanded the availability of platforms for educators to author AI tutoring chatbots. Yet effective authorship demands more than writing a system prompt; it requires educators to act as learning designers, AI interaction designers, and QA engineers. In practice, however, teachers rarely fulfill these roles. Our formative study found that virtually none systematically tested their bots before deploying them to students. To address this gap, we present PromptDecipher, a system that restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts, teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip. PromptDecipher will be deployed in an AI for Educators course enrolling hundreds of higher-education instructors. A live prototype (https://teacher-prompting.vercel.app/), an anonymized codebase (https://anonymous.4open.science/r/teacher-prompting-2EDF/), and anonymized demo (https://tinyurl.com/las-prompt-decipher-demo) are available via links in the footnote.

URL PDF HTML ☆

赞 0 踩 0

2605.16602 2026-05-19 cs.HC cs.AI 版本更新

Why Modeling Human Haptic Material Perception with AI Is Difficult

为何用AI建模人类触觉材料感知是困难的

Yasemin Vardar

发表机构 * Delft University of Technology (TU Delft)（代尔夫特理工大学）

AI总结本文探讨了用AI建模人类触觉材料感知的挑战，指出数据稀缺、评估平台缺乏和模型局限性是主要瓶颈，强调跨学科合作的重要性。

Comments 5 pages, 1 figure, conference

详情

AI中文摘要

触觉在人类通过物理接触感知和识别材料中起着核心作用。尽管数十年的研究，触觉信号转化为有意义感知表征的机制仍不明确，限制了交互系统和智能体的设计。近年来人工智能（AI）的进步为建模和利用触觉数据提供了新机会；然而，触觉因其交互依赖性和多模态特性对当代AI提出了根本挑战。本文认为，AI与触觉的交叉领域进展受限于三个关键瓶颈：（1）触觉大数据集稀缺；（2）缺乏标准化评估平台和感知基准；（3）应用于触觉感知时模型容量和可解释性限制。本文讨论了这些挑战如何阻碍泛化、可重复性和对人类触觉的科学洞察，并回顾了新兴策略以解决这些问题。本文强调了协调、跨学科努力对推动AI系统的重要性，这些系统不仅能实现稳健的触觉感知，还能加深对人类触觉的理解。

英文摘要

Touch plays a central role in how humans perceive and recognize materials through physical contact. Despite decades of research, the mechanisms by which tactile signals are transformed into meaningful perceptual representations remain poorly understood, limiting the design of interactive systems and intelligent agents with human-like haptic perception. Recent advances in artificial intelligence (AI) offer new opportunities to model and exploit tactile data; however, haptics presents fundamental challenges for contemporary AI due to its interaction-dependent, multimodal nature. This position paper argues that progress at the intersection of AI and haptics is constrained by three key bottlenecks: (1) the scarcity of large, diverse, and balanced haptic datasets; (2) the lack of standardized evaluation platforms and perceptual benchmarks; and (3) limitations in model capacity and interpretability when applied to tactile perception. I discuss how these challenges impede generalization, reproducibility, and scientific insight into human touch and review emerging strategies to address them. This paper highlights opportunities for coordinated, cross-disciplinary efforts to advance AI systems that not only perform robust haptic perception but also contribute to a deeper understanding of human touch.

URL PDF HTML ☆

赞 0 踩 0

2605.16600 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

预训练写入，对齐读取：Transformer权重空间的不对称性

Valeria Ruscio, Eli-Shaoul Khedouri, Keiran Thompson

发表机构 * Intuition Machines

AI总结研究揭示了预训练和对齐在Transformer权重空间中的不对称性，通过分析权重变化在残差流激活子空间和预测子空间中的对齐情况，发现读路径权重集中于注意力输入激活的主方向，而写路径权重在预测子空间中保持各向同性。

详情

AI中文摘要

交叉熵预训练和偏好对齐更新相同的Transformer权重，但留下几何上不同的痕迹。我们通过相对子空间分数探针来刻画这种不对称性，追踪权重变化如何与残差流激活子空间和由去嵌入定义的预测子空间对齐。对齐变化集中在读路径（W_Q，W_K）上，沿着注意力输入激活的主方向，而写路径（W_O，W_2）相对于预测子空间则保持近各向同性。我们通过各向异性梯度积累来解释这种模式：对矩阵W的更新是外积δ_t a_t^T之和，继承自哪一侧的协方差集中。对于读路径矩阵，这一侧是输入激活a_t，其协方差在训练过的Transformer中呈尖峰状，因此产生与目标无关的集中。对于写路径矩阵，相关的一侧是上游梯度δ_t，其各向异性取决于损失。交叉熵提供标准的每样本信号，诱导预训练期间写路径的预测几何；对齐目标通常在写路径上添加很少的进一步集中。我们通过检查点内轨迹、渐进对比目标控制以及闭合形式的秩1干预与匹配方向控制来支持这一解释，为所提出的权重空间几何提供因果证据。

英文摘要

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $δ_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $δ_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

URL PDF HTML ☆

赞 0 踩 0

2605.16598 2026-05-19 cs.MA cs.AI 版本更新

GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering

GRASP：基于命题的图代理搜索用于多跳问答

Stockton Jenkins, Ramya Korlakai Vinayak, Junjie Hu

发表机构 * University of Wisconsin-Madison（威斯康星大学麦迪逊分校）

AI总结 GRASP通过分解多跳查询为依赖感知计划，实现多跳问答中的高准确率与低token使用。在MuSiQue、2WikiMultihopQA和HotpotQA上，GRASP在开放语料检索和长文本推理设置中均表现优异，且token使用更少。

详情

AI中文摘要

GRASP通过将多跳查询分解为依赖感知计划，提高了多跳问答的准确率并降低了token使用。在MuSiQue、2WikiMultihopQA和HotpotQA上，GRASP在开放语料检索和长文本推理设置中均表现出色，且token使用更少。

英文摘要

Agentic retrieval improves multi-hop question answering by giving language models autonomy to iteratively gather evidence. Recent work augments these systems with knowledge graphs for structured traversal, but this combination introduces significant cost: expensive graph construction at index time and compounding token usage at inference time. We introduce Graph Agentic Search over Propositions (GRASP), an agentic system that simultaneously optimizes for high accuracy and minimal token usage in multi-hop question answering. Rather than executing a rigid, singular query, GRASP actively coordinates its retrieval strategy by decomposing multi-hop queries into dependency-aware plans. This enables GRASP to dynamically scale the number of sub-agents according to the complexity of the problem. Each sub-agent resolves its single-hop query by exploring a novel three-layer hierarchical graph of entities, propositions, and passages, using the entity layer for targeted traversal and the proposition layer for high-recall passage retrieval via reciprocal-rank voting. We evaluate GRASP on MuSiQue, 2WikiMultihopQA, and HotpotQA under two settings: open-corpus retrieval and extended context reasoning (LongBench). GRASP achieves the highest QA accuracy in the open retrieval setting on MuSiQue and 2Wiki while using 40-50 percent fewer tokens than IRCoT+HippoRAG2. Furthermore, GRASP leads on EM and F1 across all three datasets in the LongBench setting while using 30 percent fewer tokens than the next most accurate method. Finally, we introduce success economy - the amortized token cost per correct answer, weighted by difficulty - and advocate for efficiency-aware evaluation as a standard practice for agentic QA.

URL PDF HTML ☆

赞 0 踩 0

2605.16575 2026-05-19 cs.AI 版本更新

Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

对手建模不是策略：大语言模型谈判者的局限

Romain Cosentino, Sarath Shekkizhar, Adam Earle, Silvio Savarese

发表机构 * Salesforce AI Research（Salesforce人工智能研究）

AI总结研究探讨了大语言模型在多属性谈判中的表现，发现其能建模对手偏好但无法有效转化为战略谈判，最终协议受初始锚定影响大。

详情

AI中文摘要

谈判需要比推测对方需求更进一步：利用该信息在多个回合中做出有利的报价和反报价。我们研究了大语言模型（LLM）代理在受控的多属性讨价还价环境中的表现。发现当前LLM代理能建模对手偏好，但无法可靠地将此知识转化为战略谈判。当给予谈判伙伴偏好信息时，代理在推理轨迹早期准确建模，但此信息并未可靠改善知情方的收益。回合级分析显示原因：代理常回应他们认为对手重视的事物，但不一致地在自身高价值属性上获得收益。卖家总体更让步，且在不对称信息条件下，知情方常做出更弱补偿的让步。由于代理未能利用此底层效用结构获得战略优势，最终协议严重受表面初始锚定影响，而非实际效用权重。最后，要求代理在报价前明确陈述让步-互惠交易使单个回合看起来更战略化，但最终未能提高最终协议的效率。

英文摘要

Negotiation requires more than inferring what the other side wants: it requires using that information to make advantageous offers and counteroffers over multiple turns. We study whether large language model (LLM) agents do this in a controlled multi-attribute bargaining environment. We find that current LLM agents can model a counterparty's preferences, but do not reliably turn that knowledge into strategic bargaining. When given negotiating partner preference information, agents model it accurately and early in their reasoning traces, yet this does not reliably improve outcomes for the informed side. Turn-level analyses show why: agents often respond to what they believe the counterparty values, but do not consistently pair those moves with gains on their own high-value attributes. Sellers are more accommodating overall, and in asymmetric-information conditions, the informed side often makes the more weakly compensated concessions. Because agents fail to leverage this underlying utility structure for strategic advantage, their final agreements are heavily dictated by surface-level opening anchors rather than actual utility weights. Finally, requiring agents to explicitly state concession-for-reciprocity trades before making an offer makes individual turns look more strategic, but ultimately fails to improve the efficiency of the final agreements.

URL PDF HTML ☆

赞 0 踩 0

2605.16573 2026-05-19 cs.LG cs.AI physics.flu-dyn 版本更新

Wavelet Flow Matching for Multi-Scale Physics Emulation

小波流匹配用于多尺度物理模拟

Gabriele Accarino, Juan Nathaniel, Carla Roesch, Pierre Gentine, Sara Shamekh, Duncan Watson-Parris, Viviana Acquaviva

发表机构 * Department of Earth and Environmental Engineering（地球与环境工程系）； Columbia University（哥伦比亚大学）； University of Edinburgh（爱丁堡大学）； Courant Institute of Mathematical Sciences（数学科学学院）； New York University（纽约大学）； Scripps Institution of Oceanography（斯克里普斯海洋研究所）； Halıcıoğlu Data Science Institute（哈利奇数据科学研究所）； University of California San Diego（加州大学圣地亚哥分校）； CUNY New York City College of Technology（纽约市立大学纽约技术学院）； Lamont-Doherty Earth Observatory（拉蒙特-多伊蒂地球观测站）

AI总结本文提出小波流匹配方法，通过在多尺度小波空间中直接进行最优传输，解决多尺度物理系统模拟中稳定性与精度的平衡问题，实现更高效的生成式模拟。

详情

从提示到协议：一种用于实验室自动化的AI代理

Angelos Angelopoulos, James F. Cahoon, Ron Alterovitz

发表机构 * Department of Computer Science, University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校计算机科学系）； Department of Chemistry, University of North Carolina at Chapel Hill（北卡罗来纳大学教堂山分校化学系）

AI总结本文提出一种整合大语言模型与实验室编排的AI代理，使科学家能通过自然语言创建和监控自动化实验协议，提升实验效率与准确性。

详情

AI中文摘要

自动化科学实验室能加快、安全、准确且可重复地执行协议，加速新材料和药物的发现与测试。然而，设置和运行自主实验室需要协调多种仪器和机器人，迫使科学家编写代码、管理配置文件和导航复杂软件架构。本文提出一种AI代理架构，整合大语言模型与实验室编排，使科学家能通过自然语言交互式创建和监控自动化实验协议。该代理集成到实验编排系统（EOS）中，通过代理循环实现自动验证和错误纠正，支持完整的实验生命周期：创建协议、运行和监控协议及闭环优化活动，以及分析结果。一个可视化图编辑器将协议渲染为同步于AI代理协议表示的交互式节点图，使在AI协助和手动协议构建之间无缝切换。在三个覆盖化学、生物学和材料科学的模拟自动化实验室上评估，该AI代理实现了97%的一次性协议生成成功率，并将所需界面操作减少了数量级。

英文摘要

Automating science laboratories enables faster, safer, more accurate, and more reproducible execution of protocols, accelerating the discovery and testing of new materials, drugs, and more. However, setting up and running autonomous labs requires coordinating numerous instruments and robots, forcing scientists to write code, manage configuration files, and navigate complex software infrastructure. We present an AI agent architecture that integrates large language models with laboratory orchestration, enabling scientists to interactively create and monitor automated lab protocols using natural language. Integrated into the Experiment Orchestration System (EOS), the AI agent operates under an agentic loop with automated validation and error correction, and supports the complete experimental lifecycle: creating protocols, running and monitoring both protocols and closed-loop optimization campaigns, and analyzing results. A visual graph editor renders protocols as interactive node-based diagrams synchronized with the AI agent's protocol representation, enabling seamless alternation between AI-assisted and manual protocol construction. Evaluated on three simulated automated labs spanning chemistry, biology, and materials science, the AI agent achieves a 97% first-attempt protocol generation success rate and an order of magnitude reduction in required interface actions.

URL PDF HTML ☆

赞 0 踩 0

2605.16535 2026-05-19 cs.IR cs.AI 版本更新

RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification

RAPT：基于检索的后处理阈值法用于多标签分类

Lasal Jayawardena, Nirmalie Wiratunga, Ikechukwu Nkisi-Orji, Darren Nicol

发表机构 * Robert Gordon University（罗伯特·戈登大学）； William Nicol (Aberdeen) Limited（威廉·尼科尔（阿伯丁）有限公司）

AI总结 RAPT通过检索增强的方法改进多标签分类中的标签集选择，无需重新训练模型，有效应对OCR噪声和标签不平衡等问题，提升预测性能和效率。

详情

AI中文摘要

工业多标签文档理解流程中，候选标签的评分和阈值或排序用于形成每个文档的标签集。这一早期选择步骤直接影响下游信息提取的准确性及相关验证工作。实际中，OCR噪声、标签不平衡、实例依赖的标签数量和不对称的误差成本使全局评分阈值变得脆弱且难以维护。本文提出RAPT，一种面向部署的检索增强评分阈值包装器，用于后处理以改进标签集选择而不重新训练基础分类器。RAPT是一种模型无关的包装器：任何提供文档表示用于相似性搜索和每个标签置信度分数的预测器都可以使用，包括度量学习编码器和微调的Transformer分类器。对于每个查询文档，给定分类器的评分向量，RAPT检索相似文档阈值情况（案例）并利用其结果适应查询的标签集选择阈值。适应过程通过局部聚合邻近解（例如平均标签数量、截止校准）来选择最终的标签集。评估比较了多标签分类器（度量学习器和Transformer）结合RAPT与全局和标签级阈值基线，以及少样本LLM。在工业数据集和六个公开基准上，RAPT一致优于全局和标签级静态阈值基线。在工业设置中，RAPT在度量学习器上达到最佳预测性能，宏F1得分为0.87，而微调的Transformer变体平均得分为0.775宏F1，优于少样本LLM基线（K=5）2倍，且需要至少115倍更少的推理时间和13.5倍更少的GPU内存。

英文摘要

Industrial multi-label document understanding pipelines score candidate labels and threshold or rank them to form a label set per document. This early selection step directly affects the accuracy of downstream information extraction from the document, as well as the associated verification effort. In practice, OCR noise, label imbalance, instance-dependent label cardinality, and asymmetric error costs make global score thresholds brittle and hard to maintain as document formats evolve. We present RAPT, a deployment-oriented retrieval-augmented score thresholding wrapper, applied post-hoc to improve label set selection without retraining the underlying classifier. RAPT is a model-agnostic wrapper: any predictor that provides document representations for similarity search and per label confidence scores can be used, including metric learning encoders and fine-tuned transformer classifiers. For each query document, given a classifier's score vector, RAPT retrieves similar document thresholding situations (cases) and adapts the query's label set selection threshold using their outcomes. The adaptation selects the final label set by locally aggregating neighbour solutions (e.g. average label count, cutoff calibration). Evaluation compared multi-label classifiers (metric learners and transformers) combined with RAPT against global and label-wise thresholding baselines, and against few-shot LLMs. Across an industrial dataset and six public benchmarks, RAPT consistently outperformed global and label-wise static thresholding baselines. In the industrial setting, RAPT achieved its best predictive performance with metric learners, reaching 0.87 Macro-F1, while fine-tuned transformer variants on average achieved 0.775 Macro-F1, outperforming fewshot LLM baselines (K = 5) by 2x and requiring at least 115x less inference time and 13.5x less GPU memory.

URL PDF HTML ☆

赞 0 踩 0

2605.16528 2026-05-19 cs.CY cs.AI 版本更新

Inventorship in AI-Assisted Inventions: Designing an Experiment to Shape Case Law

人工智能辅助发明中的发明人归属：设计实验以塑造判例法

Yevhenii Shchetynin, Duygu Usta, Bryan Khan

发表机构 * University of Turin（都灵大学）

AI总结本文探讨人工智能辅助发明中发明人归属问题，提出通过实验生成相关判例法，以明确AI工具在发明过程中的贡献及人类发明人的认定标准。

详情

AI中文摘要

最新的人工智能进步对知识产权法提出了新挑战，特别是在人工智能辅助发明中的发明人归属问题。尽管大多数司法管辖区只允许自然人被视为发明人，但如何处理人工智能辅助发明仍存争议。主要挑战在于缺乏相关判例法。本文提出实验条件以生成相关判例法，通过涉及AI专家的 stakeholders 参与，提出实验方法和案例选择策略，以确定衡量人类在人工智能辅助发明中贡献的有效方法。

英文摘要

The latest improvements in artificial intelligence (AI) raise new challenges for intellectual property laws, particularly concerning the inventorship issue in AI-assisted inventions - that is, those in which AI is used in the inventive process. While most jurisdictions allow only a natural person to be considered the inventor, the question of how to deal with AI-assisted inventions remains relevant. Namely, what is the nature and contribution of AI tools in an AI-assisted invention that would prevent a human from being recognized as its inventor? The main challenge in addressing this question is the lack of case law on the issue. It is reasonable to assume that with the development of AI and the growing interest in its use in the inventive process, new cases will naturally arise, which in turn will harmonize and address the inventorship issue in AI-assisted inventions to some extent. However, this process will take significant time and may not keep pace with the rapid development of AI, nor fully address the new problems that arise alongside AI advancements. This research proposes the conditions of an experiment to create relevant case law. This experiment could be initiated by society, involving stakeholders specializing in AI. The article also proposes a methodology for conducting the experiment and selecting cases that best reflect the current state of AI use in the inventive process. Conducting such an approach will help identify the most effective methods for measuring human contribution to AI-assisted inventions when determining inventorship.

URL PDF HTML ☆

赞 0 踩 0

2605.16527 2026-05-19 cs.LG cs.AI 版本更新

Hypergraph Pattern Machine: Compositional Tokenization for Higher-Order Interactions

超图模式机：用于高阶交互的组合分词

Kyrie Zhao, Zehong Wang, Tianyi Ma, Fang Wu, Xiangru Tang, Pietro Lio, Sheng Wang, Yanfang Ye

发表机构 * University of Notre Dame（内布拉斯加大学）； Stanford University（斯坦福大学）； Yale University（耶鲁大学）； University of Cambridge（剑桥大学）； University of Washington（华盛顿大学）

AI总结本文提出超图模式机，通过学习子集的组合模式，改进高阶交互的建模，从而在超图基准和真实案例中取得更好效果。

详情

AI中文摘要

超图模型高阶关系，从药物处方到推荐。数据中的核心结构信号是交互组合性：高阶关系是否是组合、涌现或抑制性的。在多药治疗中，制度决定是否停药、保留或排除：组合药物三元组可安全简化，涌现三元组需联合所有药物，抑制三元组标志干扰现有交互的药物。现有超图学习方法仅传播观测超边消息，未建模此信号，导致危险组合被误分类。为此，本文提出超图模式机（HGPM），从消息传递转向学习子集的组合模式。它分词组合子集，组织成包含 DAG，并训练掩码重建的包含意识 Transformer。在十个超图基准上，HGPM 匹配或超越现有方法。值得注意的是，在真实不良事件预测案例中，HGPM 正确识别出抑制副作用的药物添加，而现有方法无法区分。代码和数据见 https://github.com/KryieZhao/HGPM.git.

英文摘要

Hypergraphs model higher-order relations that drive real-world decisions, from drug prescriptions to recommendations. A central structural signal in such data, beyond what pairwise relations can express, is interaction compositionality: whether a higher-order relation is compositional, emergent, or inhibitory with respect to its observed or unobserved sets. In polypharmacy, the regime decides whether a drug should be dropped, kept, or excluded: a compositional drug triple can be safely simplified, an emergent triple requires all drugs jointly, and an inhibitory triple flags a drug that disrupts an existing interaction. However, existing hypergraph learning methods, which merely propagate messages over observed hyperedges, leave this compositional signal unmodeled, allowing dangerous drug combinations to slip through and be misclassified. To this end, we propose the Hypergraph Pattern Machine (HGPM), shifting the paradigm from message passing to learning the compositional pattern of subsets. It tokenizes compositional subsets, organizes them in an inclusion DAG, and trains an inclusion-aware Transformer under masked reconstruction. On ten hypergraph benchmarks, HGPM matches or exceeds state-of-the-art methods. Notably, in a real adverse-event prediction case, HGPM correctly identifies the drug addition that inhibits the side effect among feature-identical candidates, a discrimination existing methods cannot make. The code and data are in https://github.com/KryieZhao/HGPM.git.

URL PDF HTML ☆

赞 0 踩 0

2605.16516 2026-05-19 cs.HC cs.AI cs.CL cs.CY 版本更新

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

长期人类-大语言模型交互中的对齐漂移：一种机制导向的框架

Xintong Yao

发表机构 * Xintong Yao（姚新同）

AI总结本文提出一种机制导向的框架，用于描述长期人类-大语言模型交互中的对齐漂移现象，通过反馈回路和子模式选择解释漂移的发展过程，并将对齐漂移视为递归互动过程而非孤立模型失败。

Comments 16 pages, 1 appendix

详情

DOI: 10.5281/zenodo.20113611

AI中文摘要

长期与基于大语言模型的系统交互可能导致对齐漂移：一种渐进过程，其中系统输出逐渐受用户当前消息的约束减少，而更多受先前交互历史影响，尽管仍显得有帮助、连贯和响应。此过程难以检测，因为用户的主观体验可能随着系统变得更熟悉、有用和适应而改善。现有研究主要集中在短期任务表现、孤立输出或单实例对齐问题，导致慢性和累积的交互层面动态未被充分描述。本文提出一种机制导向的框架来描述对齐漂移。该框架定义信号A和信号B的区别，解释漂移如何通过反馈回路和子模式选择发展，将过程分为三个互动阶段，并识别控制漂移的边界条件。通过将对齐漂移视为递归互动过程而非孤立模型失败，本文为研究长期人类-系统交互提供了概念基础。

英文摘要

Long-term interaction with LLM-based systems may produce alignment drift: a gradual process in which system outputs become less constrained by the user's current message and more shaped by prior interaction history, while still appearing helpful, coherent, and responsive. This process is difficult to detect because the user's subjective experience may improve as the system becomes more familiar, useful, and attuned. Existing research on human-LLM interaction has largely focused on short-term task performance, isolated outputs, or single-instance alignment problems, leaving slow and cumulative interaction-level dynamics undercharacterized. This paper proposes a mechanism-oriented framework for describing alignment drift. The framework defines the distinction between signal A and signal B, explains how drift develops through feedback loops and sub-pattern selection, divides the process into three interactional regimes, and identifies boundary conditions for controlling drift. By framing alignment drift as a recursive interactional process rather than an isolated model-side failure, the paper provides a conceptual basis for studying long-term human-system interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.16514 2026-05-19 cs.RO cs.AI 版本更新

No Plan, Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task

无计划，却有人类：一种反应式机器人模型预测临床任务中人类计划失败

Michael Migacev, Vito Mengers, Antonia Köngeter, Oliver Brock

发表机构 * Robotics and Biology Laboratory, Technische Universität Berlin, Germany（技术大学柏林机器人与生物学实验室，德国）； Science of Intelligence, Research Cluster of Excellence, Berlin, Germany（智能科学，卓越研究集群，柏林，德国）； Robotics Institute Germany（德国机器人研究所）

AI总结该研究利用反应式梯度下降框架AICON，通过塔罗伦敦测试揭示人类计划能力下降时的反应模式，发现其能更准确预测24个问题的难度排序，并在留出验证中表现优异，揭示了生物系统组织方式的普遍规律。

详情

AI中文摘要

理解为何某些顺序规划问题比其他问题更难需要超越平均性能的模型。这些模型应捕捉问题难度的具体模式，并理想情况下以与人类计划能力下降时相同的方式失败。我们应用为机器人操作开发的AICON反应式梯度下降框架，应用于塔罗伦敦测试，该测试用于评估帕金森病、轻度认知障碍和中风患者的规划能力。在不进行任何前瞻规划或了解人类认知的情况下，AICON在24个问题上更准确地再现了人类的细粒度难度排序，优于结构任务参数，并在留出验证中泛化到新问题。关键的是，AICON在计划能力下降的群体中优于计划基线，而计划基线更好地捕捉健康对照组。这种分离由原始AICON论文预测，该论文指出模型的失败模式与帕金森患者在目标层次结构上挣扎但不移动计数的情况相似。这表明，随着计划能力的下降，人类行为会转向AICON所建模的反应模式。这一发现扩展了更广泛的模式：AICON最初为机器人开发，现在能捕捉生物行为在感知、眼动和顺序规划方面的特征，表明其核心抽象反映了生物系统组织方式的真实特性。

英文摘要

Understanding why some sequential planning problems are harder than others requires models that go beyond average performance. They should capture the specific pattern of which problems are hard, and ideally fail in the same way people do when planning capacity is reduced. We apply AICON, a reactive gradient-descent framework developed for robotic manipulation, to the Tower of London test, a cognitive test used to assess planning in Parkinson's disease, mild cognitive impairment, and stroke. Without any lookahead planning or knowledge of human cognition, AICON reproduces the fine-grained human difficulty ordering across 24 problems better than structural task parameters and generalizes to held-out problems in a leave-two-out evaluation. Crucially, AICON outperforms a planning baseline for groups with reduced planning capacity while the planning baseline better captures healthy controls. This dissociation was predicted by the original AICON paper, which noted that the model's failure modes resemble those of Parkinson's patients who struggle with goal hierarchies but not move counts. This suggests that as planning capacity is reduced, human behavior shifts toward the reactive mode AICON models. The finding extends a broader pattern: AICON, originally built for robotics, now captures aspects of biological behavior across perception, eye movements, and sequential planning, suggesting its core abstraction reflects something real about how biological systems are organized.

URL PDF HTML ☆

赞 0 踩 0

2605.16508 2026-05-19 cs.CL cs.AI 版本更新

The Scaling Laws of Skills in LLM Agent Systems

大语言模型代理系统中技能的扩展规律

Charles Chen, Qiming Yu, Yuhang Gu, Zhuoye Huang, Hanjing Li, Hongyu Liu, Simin Liu, Jinhao Liu, Dengyun Peng, Jiangyi Wang, Zheng Yan, Fanqing Meng, Ethan Qin, Carl Che, Mengkang Hu

发表机构 * Evolvent AI Team（Evolvent AI团队）

AI总结研究揭示了大规模代理系统中技能扩展的双重规律：路由准确性随库大小对数衰减，执行准确性通过联合路由乘法提升下游任务表现，二者通过路由衰减斜率参数耦合，优化后显著提升性能。

Comments Technical Report

详情

AI中文摘要

随着代理系统规模扩大，技能积累为大规模可重用库，但其扩展规律仍不明确。在15个前沿LLM、1141个现实技能及超300万次路由或执行决策中，发现两个耦合规律。路由规律：单步路由准确性随库大小对数衰减（R²>0.97），错误从局部技能竞争发展到跨家族漂移并被过于通用的'黑洞技能'捕获。执行规律：在状态实现前，联合路由近似乘法，正确执行可提升困难下游任务表现约4倍。单参数路由对数衰减斜率b耦合二者：路由侧拟合预测执行侧救援，显示同一库属性控制预执行崩溃和下游恢复能力。这些结果表明代理性能不仅取决于模型能力，还取决于技能库的结构、粒度和暴露策略。

英文摘要

As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

URL PDF HTML ☆

赞 0 踩 0

2605.16481 2026-05-19 cs.CV cs.AI 版本更新

Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

视觉代理记忆：通过在线索引、分层记忆和代理检索实现在线长视频理解

Aiden Yiliu Li, Nels Numan, Anthony Steed

发表机构 * University College London（伦敦大学学院）

AI总结本文提出视觉代理记忆框架，通过在线索引、分层记忆和代理检索实现长视频理解，实验显示其在OVO-Bench和MM-Lifelong数据集上均取得优异成绩。

详情

AI中文摘要

长视频理解需要比大上下文窗口更多的内容，还需要一种记忆机制，决定保留哪些视觉证据，保持其在长时间范围内可搜索，并使后续推理基于可恢复的观察而非压缩的潜在状态。我们提出了视觉代理记忆（VAM），一种无需训练的框架，包含三个组件。在线索引支持在流式约束下选择性证据保留。分层记忆将保留的证据组织成并行表示，使时间上下文与空间观察对齐。代理检索在生成基于证据的答案前搜索、检查和验证候选证据。在OVO-Bench上，VAM在所有报告的基线中取得了最高的RT+BT平均值（68.41），优于使用相同基础MLLM（Gemini 3 Flash，67.46）的端到端方法。在MM-Lifelong train@month的月度分割（105.6小时覆盖51天）上，VAM达到17.11%，仅次于使用GPT-5的ReMA（17.62%）。这些结果表明，长时间视频理解受益于将视觉记忆视为显式、可检查和可查询的基质。代码可在https://github.com/yiliu-li/Visual-Agentic-Memory获取。

英文摘要

Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of MM-Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT-5 (17.62%). These results suggest that long-horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at https://github.com/yiliu-li/Visual-Agentic-Memory.

URL PDF HTML ☆

赞 0 踩 0

2605.16480 2026-05-19 q-bio.BM cs.AI 版本更新

MoleCode unlocks structural intelligence in large language models

MoleCode 解锁大型语言模型中的结构智能

Zhiyuan Yan, Chen Liu, Boxuan Zhao, Kaiqing Lin, Jixiang Zhao, Yimi Wang, Liuzhenghao Lv, Hao Li, Shanzhuo Zhang, Li Yuan, Fanyang Mo

发表机构 * Peking University Shenzhen Graduage School

AI总结 MoleCode 通过引入图显分子语言，使大型语言模型能直接操作分子结构，提升分子推理、编辑、生成和分析任务的性能，尤其在结构受限场景下表现突出。

详情

AI中文摘要

分子是图，但大型语言模型（LLMs）通常通过线性字符串来推理分子。最流行的分子表示SMILES将原子、键、分支和环压缩成紧凑序列，其中拓扑结构是隐含的，迫使LLMs在执行化学操作前重建分子结构。本文介绍MoleCode，一种LLM原生、无需训练、图显的分子语言，其中所有分子组件均以带类型实体和持久标识符表示，并有显式关系。MoleCode使分子拓扑结构在语言上下文中直接可读、可编辑和可审计，使LLM能够操作结构而非从语法中恢复。在分子推理、编辑、生成和分析任务中，这种表征转变在结构访问受限时对前沿LLMs效果最显著：不熟悉的分子、拓扑敏感操作、更大的结构和重复的聚合物。它还改变了推理的分配方式，用更短的、化学导向的推理替代长推理轨迹用于隐含结构重建。在分子优化中，这使能够进行局部、属性对齐的编辑，保持结构相似性。相同的子图-节点-边语法扩展到聚合物、Markush结构、机制式转换和交织的科学文档，包括包含化学信息的科研论文和专利披露，其中化学信息分布于文本和图像中。这些结果表明，科学对象与LLMs之间的接口不应将结构视为从文本中解码的东西。当推理对象是关系时，结构本身应成为语言的一部分。

英文摘要

Molecules are graphs, but large language models~(LLMs) are usually asked to reason about them through linear strings. The most popular molecular representation, SMILES, compresses atoms, bonds, branches and rings into a compact sequence in which topology is implicit, forcing LLMs to reconstruct molecular structure before performing the requested chemical operation. Here we introduce MoleCode, an LLM-native, training-free, graph-explicit molecular language in which all molecular components are represented as typed entities with persistent identifiers and explicit relations. MoleCode makes molecular topology directly readable, editable and auditable within the language context, allowing an LLM to operate on structure rather than recover it from syntax. Across molecular reasoning, editing, generation and analysis tasks, this representational shift improves frontier LLMs most strongly when structural access is limiting: unfamiliar molecules, topology-sensitive operations, larger structures and repetitive polymers. It also changes how inference is allocated, replacing long reasoning traces devoted to implicit structural reconstruction with shorter, more chemically directed reasoning over explicit atoms and bonds. In molecular optimization, this enables localized, property-aligned edits that preserve structural similarity to the starting compounds. The same Subgraph--Node--Edge grammar extends beyond small molecules to polymers, Markush structures, mechanism-style transformations and interleaved scientific documents, including research articles and patent disclosures in which chemical information is distributed across text and images. These results suggest that the interface between scientific objects and LLMs should not treat structure as something to be decoded from text. When the object of reasoning is relational, the structure itself should be part of the language.

URL PDF HTML ☆

赞 0 踩 0

2605.16479 2026-05-19 cs.IR cs.AI 版本更新

Policy-Grounded Dynamic Facet Suggestions for Job Search

基于政策的动态面建议用于求职搜索

Dan Xu, Baofen Zheng, Qianqi Shen, Jianqiang Shen, Wenqiong Liu, Chunnan Yao, Ping Liu, Rajat Arora, Kevin Kao, Hsiang Lin, Wanjun Jiang, Yusuke Takebuchi, Jingwei Wu, Wenjing Zhang

发表机构 * LinkedIn Corporation（LinkedIn公司）

AI总结本文提出动态面建议（DFS）以提高求职搜索中的意图识别和相关职位检索，通过实时个性化语义属性推荐，结合离线分类整理、嵌入检索和小语言模型评分，提升建议精度和用户参与度。

Comments 6 pages

详情

DOI: 10.1145/3805712.3808443

AI中文摘要

求职者常以短且不明确的查询开始搜索。在LinkedIn上，超过80%的与工作相关的查询包含三个或更少的关键词，这使得准确推断用户意图和检索相关职位特别具有挑战性。我们提出了动态面建议（DFS），一种交互式查询细化机制，通过实时揭示基于用户-查询上下文的个性化语义属性来促进意图歧义消除。我们提出了一种基于政策的、检索增强的排名框架用于面建议，包括离线分类整理、基于嵌入的检索前K候选者以及基于提炼的小语言模型（SLM）的候选者评分。系统通过单个token评分、批处理和前缀缓存进行优化，以实现实时服务。离线评估显示生成建议的高精度，而在线A/B测试显示建议参与度和求职结果的显著改进。

英文摘要

Job seekers often initiate search with short, underspecified queries. At LinkedIn, over 80% of job-related queries contain three or fewer keywords, making accurate user intent inference and relevant job retrieval particularly challenging. We present dynamic facet suggestion (DFS), an interactive query refinement mechanism that facilitates intent disambiguation by surfacing personalized semantic attributes conditioned on the joint user-query context in real time. We propose a policy-grounded, retrieval-augmented ranking framework for facet suggestion, comprising offline taxonomy curation, embedding-based retrieval of top-K candidates, and distilled small language model (SLM) based candidate scoring. The system is optimized for real-time serving via pointwise single-token scoring with batching and prefix caching. Offline evaluation demonstrates high precision for generated suggestions, and online A/B tests show significant improvements in suggestion engagement and job search outcomes.

URL PDF HTML ☆

赞 0 踩 0

2605.16474 2026-05-19 cs.IR cs.AI 版本更新

LERA: LLM-Enhanced RAG for Ad Auction in Generative Chatbots

LERA：基于大语言模型的生成聊天机器人广告拍卖

Haoran Sun, Xinrui Song, Xinyu Zhang, Zhaohua Chen, Xu Chu, Zhilin Zhang, Chuan Yu, Jian Xu, Bo Zheng, Xiaotie Deng

发表机构 * Peking University（北京大学）； Alibaba Group（阿里巴巴集团）； Shandong University（山东大学）

AI总结 LERA提出一种两阶段检索生成拍卖框架，通过嵌入粗过滤和LLM提示生成优化广告相关性评分，提升广告选择准确性和多样性。

Comments Work in Progress

详情

AI中文摘要

将广告拍卖机制整合到基于大语言模型（LLM）的聊天机器人中，为商业化提供了重要机会，但需在相关性、效率和用户体验之间取得平衡。最近，Feizi等人和Hajiaghayi等人提出了检索后生成范式，将检索与生成解耦，提供轻量级广告插入和支付确定。然而，当前检索仅依赖文本嵌入相似性，可能导致商业误解和重复插入问题。本文提出LERA，一种针对LLM聊天机器人的两阶段检索生成拍卖框架。第一阶段通过嵌入粗过滤预选少量候选广告商。第二阶段通过精心设计的提示查询LLM，生成候选人的logits作为优化的相关性评分。这些评分与报价结合，关键值支付规则考虑粗过滤和细排名阈值，确保对效用最大化广告商的诚实性。该框架自然扩展到动态对话流中的多个广告插入和长响应。在合成广告商-查询基准上的实验表明，LERA显著提高了广告选择准确性和插入多样性，同时仅引入可控的延迟开销。

英文摘要

The integration of advertising auction mechanisms into large language model (LLM)-based chatbots presents a significant opportunity for commercialization, yet poses unique challenges in balancing relevance, efficiency, and user experience. Recently, Feizi et al.~\citep{feizi2023online} and Hajiaghayi et al.~\citep{hajiaghayi2024ad} outlined a retrieve-then-generate paradigm that decouples retrieval and generation, offering lightweight ad insertion and payment determination. However, current retrieval relies solely on text embedding similarity, which may lead to commercial misinterpretation and issues such as repetitive insertions. In this paper, we propose LERA, a two-stage retrieve-then-generate auction framework tailored for LLM chatbots. In the first stage, embedding-based coarse filtering pre-selects a small set of candidate advertisers. In the second stage, the LLM itself is queried with a carefully designed prompt to produce logits over candidates, which serve as refined organic relevance scores. These scores are combined with bids, and a critical-value payment rule accounts for both the coarse-filtering and fine-ranking thresholds, ensuring truthfulness for utility-maximizing advertisers. The framework naturally extends to multiple ad insertions within dynamic dialogue flows and long responses. Experiments on a synthetic advertiser-query benchmark show that LERA substantially improves ad selection accuracy and insertion diversity while incurring only controllable latency overhead.

URL PDF HTML ☆

赞 0 踩 0

2605.16470 2026-05-19 cs.LG cs.AI 版本更新

峰值检测器：通过指令调优的大语言模型实现可解释的多模态峰值检测

Jiahui Li, Yida Zhang, Zixuan Zeng, Jiayu Chen, Yingjian Song, Yin Xiao, Nishan Dong, Junjie Lu, Younghoon Kwon, Xiang Zhang, Jin Lu, Wenzhan Song, Fei Dou

发表机构 * University of Georgia（佐治亚大学）； Yixing People’s Hospital（宜兴人民医院）； University of Washington（华盛顿大学）； University of North Carolina at Charlotte（北卡罗来纳大学夏洛特分校）

AI总结本文提出Peak-Detector框架，利用指令调优的大语言模型实现跨模态、可解释的峰值检测，通过峰表示技术压缩时间序列数据并提升检测准确性，同时生成解释性内容以支持验证与错误分析。

详情

AI中文摘要

准确检测多种心脏生理信号（如心电图、脉搏波容积图、球状心图和体震图）中的峰值对心血管监测至关重要，但常受伪影和信号变异影响。传统算法通常基于专家知识针对单一信号模态设计，限制了通用性。相比之下，深度学习方法缺乏可解释性，限制了专家验证和人机交互。为此，我们引入Peak-Detector框架，利用指令调优的大语言模型（LLMs）实现稳健、跨模态且可解释的峰值检测。框架的核心创新是“峰表示”技术，将时间序列数据转换为压缩格式，在保留关键事件信息的同时显著减少信号长度。此表示提供关键的归纳偏差，引导LLM在生理有意义的事件上推理而非原始噪声数据。模型通过监督微调（SFT）后接强化学习（RL）的多目标奖励函数进行优化。模型的自解释能力通过在自建的Peak-Explanation数据集上微调来培养。在四个模态（ECG、PPG、BCG和BSG）覆盖七个数据集（六个公开基准加一个真实世界队列）上，Peak-Detector展示了强大的跨模态性能，实现了临床相关时间容忍度下的最佳或并列最佳检测。除了准确性外，生成的解释性内容揭示了失败模式并支持验证和错误分析。

英文摘要

Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.16449 2026-05-19 cs.LG cs.AI 版本更新

PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting

PESD-TSF：一种周期感知和显式结构分解框架，用于长期时间序列预测

Hua Wang, Xianhao Jiao, Fan Zhang

发表机构 * School of Computer and Artificial Intelligence（计算机与人工智能学院）； Ludong University（鲁东大学）； School of Computer Science and Technology（计算机科学与技术学院）； Shandong Technology and Business University（山东科技职业大学）

AI总结 PESD-TSF通过引入周期性门控机制、多尺度编码器和跨尺度协作注意力，解决深度网络中周期感知减弱和变量间依赖破坏的问题，提升多变量时间序列预测性能。

Comments 23 pages, 9 figures, 13 tables

详情

AI中文摘要

深度预测模型常面临周期感知减弱和趋势-噪声表示混乱的问题，且通道独立范式虽提高训练稳定性，却破坏变量间动态协调，阻碍多变量时间序列中变量一致性建模。为此，我们提出PESD-TSF，一种受物理启发的结构分解框架，旨在同时强调可解释性和预测准确性。PESD-TSF引入三个关键设计：首先，乘法周期性门控机制整合连续时间先验，动态调节信号幅度，保持深度层间的周期结构；其次，多尺度结构编码器整合去趋势注意力与分层采样，显式分离长期趋势与高频变化，同时保留细粒度时间语义；第三，为恢复被破坏的变量依赖，我们提出跨尺度协作注意力（CSCA）与RLC正则化方案，重构深度特征空间中的全局变量拓扑，并通过正交性和一致性约束实现物理一致的协作。在多个领域的基准数据集上进行的广泛实验表明，PESD-TSF在多变量预测任务中，特别是在涉及复杂变量耦合的任务中， consistently 实现了最先进的性能，突显其优越的结构建模能力和泛化能力。

英文摘要

Deep forecasting models often suffer from attenuated periodic perception and entangled trend-noise representations as network depth increases. Moreover, the widely adopted channel-independent paradigm, while improving training stability, disrupts intrinsic dynamic coordination among variables, hindering the modeling of cross-variable consistency in multivariate time series. To address these issues, we propose PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that jointly emphasizes interpretability and predictive accuracy. PESD-TSF introduces three key designs. First, a Multiplicative Periodic Gating mechanism incorporates continuous-time priors to dynamically modulate signal amplitudes, preserving periodic structures across deep layers. Second, a multi-scale structured encoder integrates detrended attention with hierarchical sampling to explicitly decouple long-term trends from high-frequency variations while retaining fine-grained temporal semantics. Third, to recover disrupted inter-variable dependencies, we propose Cross-Scale Collaborative Attention (CSCA) together with an RLC regularization scheme, which reconstructs global inter-variable topology in deep feature spaces and enforces physically consistent collaboration through orthogonality and consistency constraints. Extensive experiments on benchmark datasets from multiple domains demonstrate that PESD-TSF consistently achieves state-of-the-art performance, with particularly strong gains on multivariate forecasting tasks involving complex inter-variable coupling, highlighting its superior structural modeling capability and generalization.

URL PDF HTML ☆

赞 0 踩 0

2605.16444 2026-05-19 cs.CV cs.AI 版本更新

Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images

扩散注意力专家模型用于预测和半自动定位肺癌组织病理图像中的STAS

Liangrui Pan, Jiadi Luo, Yuxuan Xiao, Chenchen Nie, Xiaoshuai Wu, Songqing Fan, Ling Chu, Manqiu Li, Rongfang He, Zhenyu Zhao, Ruixing Wang, Shulin Liu, Yiyi Liang, Xiang Wang, Qingchun Liang, Shaoliang Peng

发表机构 * College of Computer Science and Electronic Engineering, Hunan University（湖南大学计算机科学与电子工程学院）； Department of Pathology, The Second Xiangya Hospital, Central South University（中南大学湘雅医院病理科）； Hunan Clinical Medical Research Center for Cancer Pathogenic Genes Testing and Diagnosis（湖南临床医学肿瘤基因检测与诊断研究中心）； Department of Thoracic Surgery, The Second Xiangya Hospital, Central South University（中南大学湘雅医院胸外科）； Department of pathology, Hunan Cancer Hospital, The Affiliated Cancer Hospital of Xiangya School of Medicine, Central South University（湖南肿瘤医院病理科）； Department of Pathology, The Third Xiangya Hospital, Central South University（中南大学湘雅第三医院病理科）； Department of Pathology, First People's Hospital of Pingjiang County（平江县第一人民医院病理科）； Department of Pathology, the First Affiliated Hospital, Hengyang Medical School, University of South China（南华大学衡阳医学院第一附属医院病理科）； Department of Radiology, The Second Xiangya Hospital of Central South University（中南大学湘雅医院放射科）； Department of Radiology, Xiangya Hospital, Central South University（中南大学湘雅医院放射科）； Oncology Department and State Key Laboratory of Systems Medicine for Cancer of Shanghai Cancer Institute, Renji Hospital, School of Medicine, Shanghai Jiaotong University（上海癌症研究院肿瘤科及上海交通大学医学院系统医学重点实验室）

AI总结本文提出DAEM模型，通过多尺度特征学习和双分支架构提升STAS检测精度，实现对冷冻切片和石蜡切片的高AUC值检测，并利用肿瘤微环境特征实现STAS半自动定位。

Comments Accepted by Nature Communications

详情

AI中文摘要

准确的术中和术后STAS诊断对指导肺癌手术决策和术后管理至关重要。然而，组织病理学评估耗费人力且易出现漏诊或误诊。我们提出扩散注意力专家模型（DAEM）用于检测冷冻切片（FSs）和石蜡切片（PSs）中的STAS。其扩散注意力专家模块利用全注意力聚合学习多尺度特征，而双分支架构强化多尺度特征表示。在内部数据集中，DAEM在FSs和PSs上分别达到0.8946和0.9112的AUC值。在八个机构的外部多中心数据集上验证显示，模型具有强泛化性和可解释性。利用PSs中的肿瘤微环境（TME）特征，进一步实现了STAS位置及其与原发肿瘤距离的半自动测量。多个定量TME指标被识别为STAS的潜在生物标志物，包括微泡型STAS。总体而言，DAEM通过在FSs和PSs上实现准确且可解释的检测，为STAS评估提供临床可操作的框架，通过基于定量TME的分析支持术后风险分层。

英文摘要

Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.

URL PDF HTML ☆

赞 0 踩 0

2605.16443 2026-05-19 cs.LG cs.AI 版本更新

Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning

二值对称循环矩阵：在深度学习中的应用

Jayakrishna Amathi, Venkata Prasanth Yanambaka, Saraju P. Mohanty, Elias Kougianos

发表机构 * Department of Computer Science and Engineering（计算机科学与工程系）； University of North Texas（北卡罗来纳大学达顿分校）； Division of Computer Science（计算机科学分校）； Texas Woman’s University（德克萨斯女子大学）

AI总结本文提出二值对称循环矩阵，通过每层仅使用两个权重实现极稀疏结构，显著降低存储需求，实验显示在MNIST和MIT-BIH数据集上参数减少超过80倍，同时保持较高精度，适用于边缘计算和低功耗系统。

详情

AI中文摘要

尽管深度神经网络在视觉、医疗诊断和物联网场景中取得成功，但其在资源受限平台上的部署面临严峻挑战，由于存储需求高、计算复杂度大和占用空间大。特别是全连接层需要大量权重，使边缘设备难以容纳。为克服与有限平台相关的挑战，本文提出二值对称循环矩阵（TVSCM），一种非常稀疏的架构，每层仅使用两个权重以保持循环和对称性。极结构稀疏架构的存储成本与传统全权重存储相比几乎可以忽略不计。与传统稀疏学习技术如低秩近似和剪枝方法不同，该架构提供极稀疏形式，实现极低的存储需求。模拟研究显示，在MNIST数据集上参数从623,290减少到7,852，MIT-BIH心律失常数据集上从24,709减少到942，同时保持在MNIST上97.6%到93.5%的精度，在MIT-BIH上97.6%到93.1%的精度。由于其极低的架构需求和非常低的功耗，该架构适用于边缘计算平台、微型机器学习平台、IoMT系统和电池供电系统。

英文摘要

Despite the success of deep neural networks in vision, medical diagnosis, and IoT scenarios, their deployment on resource-limited platforms poses serious challenges due to their high storage requirements, computational complexity, and large footprint. In particular, fully connected layers require a large number of weights, making it difficult for edge devices to accommodate them. To overcome these challenges associated with limited platforms, this paper proposes the Two-Valued Symmetric Circulant Matrix (TVSCM), a very sparse architecture that employs just two weights per layer to keep it circulant and symmetric. The extreme form of structured sparse architecture provides negligible storage costs compared to traditional full-weight storage. Instead of hardware and additional stages of other traditional sparse learning techniques, such as low-rank approximation and pruning approaches, this architecture provides an extreme form of sparsity, achieving very minimal storage requirements. The simulation study demonstrates more than 80$\times$ reduction in model parameters, reducing parameters from 623,290 to 7,852 on MNIST and from 24,709 to 942 on the MIT-BIH arrhythmia dataset, while maintaining comparable accuracy from 97.6% to 93.5% on MNIST and from 97.6% to 93.1% on MIT-BIH. Due to its minimal architectural requirements and very low power consumption, this architecture would be ideal for edge computing platforms, tiny-ML platforms, IoMT systems, and battery-powered systems.

URL PDF HTML ☆

赞 0 踩 0

2605.16442 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction

面向环境的长航程船舶轨迹预测分层两阶段框架

Ganeshaaraj Gnanavel, Tharindu Fernando, Sridha Sridharan, Clinton Fookes

发表机构 * SAIVT Research Group, Queensland University of Technology（SAIVT研究组，昆士兰理工大学）

AI总结本文提出分层两阶段框架，结合长短期预测器与网格感知短期预测器，通过分层融合机制提升船舶轨迹预测精度，实验显示在ADE和FDE上优于现有方法。

详情

AI中文摘要

长航程船舶轨迹预测在真实海洋条件下对碰撞避免、交通管理和路线规划至关重要。然而，由于长距离时间依赖性和动态环境因素如洋流、风和波浪，实现准确预测具有挑战性。为此，我们提出一种分层两阶段框架，通过分层融合机制结合粗略长时预测器与网格感知的短时预测器。短时分支利用离散化海事单元上的时空图变换器捕捉局部动态，而长时分支编码总体航行意图。集成的环境模块利用洋流参数、风向量和显著波高，通过跨模态注意和特征调制实现对不同海况的适应性响应。此外，可学习的Savitzky-Golay平滑层增强了融合轨迹的时间一致性。我们在澳大利亚船队跟踪系统（CTS）数据上进行了评估，数据来自西北地区，并与Copernicus海洋服务产品对齐，使用3小时输入和10小时预测时间范围。实验结果表明，我们的框架在平均位移误差（ADE）和最终位移误差（FDE）上比现有方法提高了25%和17%。消融研究进一步验证了每个组件的贡献。

英文摘要

Long-horizon vessel trajectory forecasting under real ocean conditions is critical for collision avoidance, traffic management, and route planning. However, achieving accurate predictions is challenging due to long-range temporal dependencies and dynamic environmental factors such as currents, wind, and waves. To address these issues, we propose a hierarchical two-stage framework that combines a coarse long-term predictor with a grid-aware short-term predictor through a hierarchical fusion mechanism. The short-term branch leverages a Spatio-Temporal Graph Transformer on discretized maritime cells to capture localized dynamics, while the long-term branch encodes overarching navigational intent. An integrated environmental module incorporates oceanographic parameters, including surface currents, wind vectors, and significant wave height, using cross-modal attention and feature-wise modulation for adaptive response to varying sea conditions. Additionally, a learnable Savitzky-Golay smoothing layer enhances temporal coherence in fused trajectories. We evaluate our approach on Australian Craft Tracking System (CTS) data from the North West region, aligned with Copernicus Marine Service products, using a 3-hour input and a 10-hour prediction horizon. Experimental results show that our framework outperforms the state-of-the-art by 25% in Average Displacement Error (ADE) and 17% in Final Displacement Error (FDE). Ablation studies further validate the contribution of each component.

URL PDF HTML ☆

赞 0 踩 0

2605.16441 2026-05-19 cs.LG cs.AI 版本更新

DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition

DeepArrhythmia: 基于选择性证据获取的段落上下文化ECG心律失常分类

Jiahui Li, Ruili Fang, Zishuai Liu, WenZhan Song, Jin Lu, Fei Dou

发表机构 * University of Georgia（佐治亚大学）

AI总结 DeepArrhythmia通过选择性证据获取实现段落上下文化ECG心律失常分类，结合原始ECG信号和渲染波形图像，利用专门工具分离生理测量与证据整合，提升多beat节奏上下文下的心律失常检测精度。

详情

AI中文摘要

心电图（ECG）心律失常检测旨在为每条心跳分配一个心律失常类别，但许多现有系统将心跳视为孤立的局部实例，限制了对多心跳节奏上下文的依赖。我们提出DeepArrhythmia，一种工具导向的多模态框架，用于段落上下文化的心跳级ECG心律失常分类。给定一个多心跳ECG段，DeepArrhythmia结合原始ECG信号和渲染的波形图像，定位R峰以识别心跳实例，并生成结构化的心跳级预测。该框架通过专门工具分离生理测量与证据整合，用于心跳定位、数值节奏-形态提取和形态聚焦的文本分析。DeepArrhythmia利用段级置信度在最小和丰富证据状态之间路由，因为更丰富的生理证据并不总是有用。这种代理设计整合了节奏上下文、显式生理基础和选择性证据获取以进行决策。

英文摘要

Beat-level Electrocardiography (ECG) arrhythmia detection aims to assign an arrhythmia class to each beat in a recording, yet many existing systems treat beats as isolated local instances. This is limiting because beat labels often depend on multi-beat rhythm context, including timing, compensatory pauses, and beat-to-beat morphological consistency. We present DeepArrhythmia, a tool-grounded multimodal framework for segment-contextualized beat-level ECG arrhythmia classification. Given a multi-beat ECG segment, DeepArrhythmia combines the raw ECG signal and a rendered waveform image, localizes R peaks to identify beat instances, and produces structured beat-level predictions. The framework decouples physiological measurement from evidence integration using specialized tools for beat localization, numerical rhythm--morphology extraction, and morphology-focused textual analysis. DeepArrhythmia uses segment-level confidence to route between minimal and rich evidence states, since richer physiological evidence is not uniformly useful. This agentic design integrates rhythm context, explicit physiological grounding, and selective evidence acquisition for decision making.

URL PDF HTML ☆

赞 0 踩 0

2605.16440 2026-05-19 cs.CV cs.AI 版本更新

Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification

通过新颖视角合成实现语义平滑以实现稳健的SAR图像分类

Daniel Brignac, Fengwei Tian, Banafsheh Latibari, Abhijit Mahalanobis, Ravi Tandon

发表机构 * The University of Arizona（亚利桑那大学）

AI总结本文提出语义平滑方法，通过新颖视角合成模型生成结构化随机变换，提升SAR图像分类在对抗攻击下的鲁棒性，并提高干净分类准确率。

详情

AI中文摘要

深度神经网络对对抗扰动敏感，限制了其在安全关键应用中的部署，如合成孔径雷达（SAR）自动目标识别（ATR）。随机化平滑通过在噪声输入上平均预测来提高鲁棒性，但各向同性噪声常无法保持SAR图像的语义结构。我们提出语义平滑，一种防御方法，用由新颖视角合成模型生成的结构化随机变换取代基于噪声的扰动。对于SAR，我们根据获取几何学合成多个可能的雷达视角。在生成的随机视角上进行预测并聚合，以形成鲁棒分类器。实验表明，语义平滑在标准攻击（如FGSM和PGD）以及SAR特定攻击（如OTSA和SMGAA）中提高了鲁棒性，同时提高了干净分类准确率。这些结果表明，通过保留语义的几何变换进行随机化平滑，是结构感知领域对抗防御的一种有前景的替代方案。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations, limiting deployment in safety-critical applications such as synthetic aperture radar (SAR) automatic target recognition (ATR). Randomized smoothing improves robustness by averaging predictions over noisy inputs, but isotropic noise often fails to preserve the semantic structure of SAR imagery. We propose semantic smoothing, a defense that replaces noised-based perturbations with structured randomized transformations generated by a novel view synthesis model. For SAR, we condition on acquisition geometry to synthesize multiple plausible radar views. Predictions across generated randomized views are aggregated to form a robust classifier. Experiments show that semantic smoothing improves robustness against standard attacks, such as FGSM and PGD, and SAR-specific attacks, such as OTSA and SMGAA, while also increasing clean classification accuracy. These results demonstrate that randomized smoothing via semantically preserving geometric transformations is a promising alternative to isotropic noise for adversarial defense in structured sensing domains.

URL PDF HTML ☆

赞 0 踩 0

2605.16439 2026-05-19 cs.CV cs.AI 版本更新

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

KVCapsule: 用于视觉-语言模型的高效序列KV缓存压缩方法：不对称冗余

Yingbing Huang, Tharun Adithya Srikrishnan, Steven K. Reinhardt, Deming Chen

发表机构 * University of Illinois Urbana-Champaign（伊利诺伊大学厄巴纳-香槟分校）； AMD

AI总结本文提出KVCapsule，一种针对视觉语言模型的KV缓存压缩框架，通过轻量压缩和重建组件实现内存节省，提升吞吐量并减少内存占用，同时保持精度。

详情

AI中文摘要

视觉-语言模型（VLMs）作为大型语言模型（LLMs）的重要扩展，通过文本和图像输入实现多模态推理。尽管VLMs增强了语言模型的能力，但它们也继承并放大了关键计算瓶颈：自回归解码过程中大规模键值（KV）缓存带来的内存开销。这一挑战在VLMs中尤为严重，因为图像生成更长的token序列和更密集的特征表示，相比文本。此外，视觉token的空间和信息丰富性引入了结构化的注意力模式，使得许多针对LLM的KV缓存压缩技术在直接应用于VLMs时效果不佳。在本文中，我们对视觉token的行为进行了详细的实证分析，突显其与纯文本模型的关键差异。基于这些见解，我们提出KVCapsule，一种新的视觉token的KV缓存压缩框架。KVCapsule保持预训练VLM骨干网络冻结，不需要修改注意力计算模块，并且可以通过轻量级压缩和重建组件集成到现有VLMs中。我们评估了KVCapsule在多个VLMs和基准任务上的性能，证明在60%的压缩率下，TPS提升达2倍，KV缓存内存减少达2.4倍，同时精度或响应质量几乎没有下降。我们的发现为在受限内存预算下扩展VLM推理提供了实用路径，并启发进一步研究结构感知的缓存压缩方法以多模态模型。

英文摘要

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

URL PDF HTML ☆

赞 0 踩 0

2605.16438 2026-05-19 cs.LG cs.AI 版本更新

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

通过量子退火的客户端选择实现容错联邦学习

Andras Ferenczi, Sutapa Samanta, Dagen Wang, Jason Qizhe Qin

发表机构 * Columbia University（哥伦比亚大学）

AI总结本文提出利用量子退火解决联邦学习中的拜占庭容错问题，通过将客户端选择转化为二次无约束二元优化问题，提升对恶意更新的检测能力。

Comments 9 pages, 6 figures, 8 tables

详情

AI中文摘要

联邦学习（FL）在分布式客户端上训练全局模型，但规模扩大时易受恶意更新攻击。本文提出一种量子退火方法，将客户端选择转化为二次无约束二元优化（QUBO）问题，通过量子退火器求解。QUBO方法在小规模客户端中优于MultiKrum，但在大规模客户端中性能下降。本文引入MultiSignal集成方法，结合欧几里得和余弦Krum分数差距，将攻击分类为四个阶段并路由恶意攻击至受惩罚的QUBO。实验表明，MultiSignal在MNIST数据集上达到95.3%的检测准确率，显著优于传统MultiKrum方法。

英文摘要

Federated Learning (FL) trains a global model across decentralized clients while preserving data privacy, but at scale it is vulnerable to malicious updates. Byzantine-resilient aggregation methods such as MultiKrum score gradients against their nearest neighbors and can miss malicious updates that preserve the statistical properties of honest ones. We propose a quantum annealing approach that reformulates client selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem, encoding pairwise distances into a cost function solved by quantum annealers (QA). Unlike MultiKrum's greedy per-client scoring, the QUBO formulation jointly optimizes over all subsets to find the mutually closest group of $m$ clients. At small scale (15 clients), QUBO outperforms MultiKrum on the most challenging Byzantine attacks: e.g., Advanced LIE is detected with 95.11% accuracy versus 81.33% on MNIST and 97.78% versus 75.56% on CIFAR-10. QUBO fares poorly on simpler attacks where MultiKrum excels, so the two methods are complementary. QUBO quality also degrades as the number of clients grows. To address this, we introduce a MultiSignal ensemble that uses a dual-feature routing gate based on Euclidean and cosine Krum score gaps to classify attacks into four regimes and routes evasion attacks to a suspicion-penalized QUBO with agreement voting. At 100 clients on MNIST, MultiSignal achieves 95.3% average detection accuracy versus 91.8% for classical MultiKrum, with the largest gains on Sparse Lie (72.0% to 95.2%, +23.2 points) and Advanced Lie (80.4% to 85.2%, +4.8 points). These results show that QUBO-based quantum annealing with MultiSignal is a principled and scalable defense against the most challenging Byzantine strategies in federated learning.

URL PDF HTML ☆

赞 0 踩 0

2605.16436 2026-05-19 cs.CR cs.AI 版本更新

The End of Trust: How Agentic AI Breaks Security Assumptions

信任的终结：如何代理AI打破安全假设

Osama Zafar, Alexander Nemecek, Erman Ayday

发表机构 * Dept. of Computer and Data Sciences（计算机与数据科学系）； Case Western Reserve University（凯斯西储大学）

AI总结代理AI打破了传统安全假设，允许高保真的定制欺骗在大众市场层面实现。本文提出'无限冒充者'攻击模型，并探讨从验证行为转向评估行动的安全范式转变。

详情

AI中文摘要

在数十年中，数字交互的安全性依赖于一个未被承认的经济约束。攻击者面临欺骗的保真度与可部署规模之间的权衡。说服性冒充需要持续的人力投入，仅限于高价值目标，而大众市场攻击牺牲了可信度以换取覆盖面。检测系统、验证机制和用户意识培训都隐式地校准到这种权衡所产生的廉价欺骗制品。代理AI消解了这种权衡，使高保真的个性化欺骗能够在大众市场层面产生。我们主张这种转变耗尽了安全范式，而非仅仅加剧威胁景观。我们引入了无限冒充者攻击模型，其中自主代理在双方已相互信任的双方之间插入，劫持现有关系而非从头建立新的关系。以检测为导向的防御共享一个假设，即生成性进步正在消除，合成输出可以区分于真实输出。我们提出一种默认怀疑范式，将安全从验证行为者转向评估行为，并探讨当平台成为数字交互监管基础时产生的治理张力。

英文摘要

For decades, the security of digital interaction has rested on an unacknowledged economic constraint. Attackers faced a tradeoff between the fidelity of a deception and the scale at which it could be deployed. Convincing impersonation required sustained human effort and was confined to a narrow set of high-value targets, while mass-market attacks sacrificed plausibility for reach. Detection systems, verification mechanisms, and user awareness training have all been implicitly calibrated to the artifacts of cheap deception that this tradeoff produced. Agentic AI collapses the tradeoff, allowing high-fidelity, individually tailored deception to be produced at mass-market scale. We argue that this shift exhausts a security paradigm rather than merely intensifying the threat landscape. We introduce the Infinite Impostor, an attack model in which an autonomous agent interposes itself between two parties who already trust each other, hijacking an existing relationship rather than building a new one from scratch. Detection-oriented defenses share an assumption that generative progress is eliminating, that synthetic outputs are distinguishable from authentic ones. We propose a suspect-by-default paradigm that shifts security from authenticating actors to evaluating actions, and examine the governance tensions that arise when platforms become the regulatory substrate of digital interaction.

URL PDF HTML ☆

赞 0 踩 0

2605.16435 2026-05-19 cs.LG cs.AI 版本更新

QuantFPFlow：用于连续强化学习中Fokker-Planck策略优化的量子振幅估计

Abraham Itzhak Weinberg

发表机构 * AI-WEINBERG, AI Experts（AI-WEINBERG人工智能专家）

AI总结 QuantFPFlow通过量子振幅估计提升连续强化学习中Fokker-Planck策略优化的效率，实现算法复杂度从O(1/ε²)到O(1/ε)的平方加速，并在多模态奖励景观中发现全局最优解。

详情

AI中文摘要

我们引入QuantFPFlow，一种将量子振幅估计整合到随机策略优化的Fokker-Planck（FP）公式中的强化学习框架。经典连续空间RL代理必须以成本O(1/ε²)估计FP分区函数Z=∫e^{-V(x)/D}dx；QuantFPFlow用Grover增强的振幅估计器替代，实现O(1/ε)的可证明二次加速。尽管完全量子加速需要容错硬件，此处展示的量子启发经典模拟已表现出O(1/ε)的算法结构。估计的稳态分布ρstar驱动理论支撑的探索奖励Raug=Renv+αlog(1/ρstar(s))。此奖励将代理引导至多模态奖励景观的全局最优区域，同时通过FP扩散匹配约束策略方差。在专门设计暴露局部最优失败的连续控制任务中，QuantFPFlow实现平均奖励1,295.7±423.2，优于Soft Actor-Critic(SAC)的1,284.0±474.0，同时发现全局最优的频率高10.4%（33.9% vs. 30.7%）。策略熵保持在H(π)≈6.5纳特，而SAC下降至1.5纳特，证实FP扩散匹配主动防止过早收敛。维度实验进一步显示QuantFPFlow的计算规模为O(d^{0.35})，而经典FP估计为O(d^{0.76})。

英文摘要

We introduce \textbf{QuantFPFlow}, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker--Planck~(FP) formulation of stochastic policy optimisation. Classical continuous-space RL agents must estimate the FP partition function $Z = \int e^{-V(\mathbf{x})/D}\,d\mathbf{x}$ at cost $\calO(1/\varepsilon^{2})$; QuantFPFlow replaces this with a Grover-amplified amplitude estimator achieving $\calO(1/\varepsilon)$ -- a provable quadratic speedup. While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the $\calO(1/\varepsilon)$ algorithmic structure. The estimated stationary distribution $\rhostar$ drives a theoretically grounded exploration bonus $\Raug = \Renv + α\log(1/\rhostar(s))$. This bonus steers the agent toward globally optimal regions of multimodal reward landscapes while simultaneously constraining policy variance through FP diffusion matching. On a continuous-control task specifically designed to expose local-optima failure, QuantFPFlow achieves mean reward $1{,}295.7 \pm 423.2$ versus $1{,}284.0 \pm 474.0$ for Soft Actor-Critic~(SAC), while discovering the global optimum \textbf{10.4\,\% more frequently} (33.9\,\% vs.\ 30.7\,\%). Policy entropy remains near $H(π)\approx 6.5$\,nats throughout training, whereas SAC collapses to $1.5$\,nats, confirming that FP diffusion matching actively prevents premature convergence. Dimensionality experiments further show computational scaling of $\calO(d^{0.35})$ for QuantFPFlow versus $\calO(d^{0.76})$ for classical FP estimation.

URL PDF HTML ☆

赞 0 踩 0

2605.16427 2026-05-19 cs.CV cs.AI 版本更新

EAGT: Echocardiography Augmentation for Generalisability and Transferability

超声波增强：通用性和可迁移性

Soroush Elyasi, Sara Adibzadeh, Nasim Dadashi Serej, Julie Wall, Massoud Zolgharni

发表机构 * THRIVE Centre, University of West London（西伦敦大学THRIVE中心）； University of West London（西伦敦大学）； School of Computing and Engineering, University of West London（西伦敦大学计算机与工程学院）

AI总结本文研究了29种数据增强技术及其组合对左心室分割的通用性和可迁移性影响，发现几何变换优于强度增强，且最佳组合提升模型鲁棒性。

详情

AI中文摘要

深度学习模型在超声分割中常难以跨机构、设备和患者群体泛化，因收集大量一致标注数据不现实。数据增强广泛用于提升模型鲁棒性，但其在超声中的跨数据集泛化作用尚不明确。本文评估了29种数据增强技术及其配对组合，使用U-Net在Unity、CAMUS和EchoNet Dynamic数据集上进行2D左心室分割。每种增强方法在不同超参数设置下，通过Dice和IoU在域内和跨域场景下重复运行评估，统计显著性通过独立t检验量化。结果表明，解剖合理几何变换，特别是仿射、位移-缩放-旋转、透视和随机水平翻转，显著提升跨数据集性能，而激进的强度或伪影增强常降低泛化能力。配对增强组合优于单个增强，尤其以随机水平翻转与仿射组合在大多数迁移场景中表现一致。这些发现为设计增强策略提供了实证指导，以增强超声分割模型的鲁棒性和可迁移性。

英文摘要

Deep learning models for echocardiography segmentation often struggle to generalise across institutions, scanners, and patient populations, where collecting large, consistently annotated datasets is infeasible. Data augmentation is widely used to improve the robustness of deep learning models; however, its role in enhancing cross-dataset generalisability in echocardiography remains insufficiently understood. This study presents a large-scale multi-dataset evaluation of 29 data augmentation techniques and their pairwise combinations for 2D left ventricular segmentation using a U-Net trained on Unity, CAMUS, and EchoNet Dynamic datasets. Each augmentation was explored under several hyperparameter settings and assessed through repeated runs using Dice and IoU in both in-domain and cross-dataset scenarios, with statistical significance quantified via independent t-tests. Results show that anatomically plausible geometric transformations, particularly affine, shift-scale-rotate, perspective, and random horizontal flip, substantially improve cross-dataset performance, whereas aggressive intensity- or artefact-based augmentations often degrade generalisability. Pairwise augmentation combinations outperform individual augmentations and show that moderate flip-centric combinations, especially random horizontal flip with affine, yield consistent gains across most transfer scenarios. These findings provide empirically grounded guidance for designing augmentation policies that enhance the robustness and transferability of echocardiography segmentation models.

URL PDF HTML ☆

赞 0 踩 0

2605.16421 2026-05-19 cs.LO cs.AI 版本更新

Orthologic for SAT Solving

正交逻辑用于SAT求解

Vladislas de Haldat, Simon Guilloud, Viktor Kunčak

发表机构 * EPFL（苏黎世联邦理工学院）

AI总结本文提出一种新的正交逻辑公式蕴含判定算法，避免了先前方法的高成本预处理阶段，同时保持相同的最坏复杂度。通过合成SAT基准测试，展示了正交逻辑归一化在某些难题中的提升效果。

2605.16419 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments

基于代理的自同步多视角关节角度监控管道：在无标定环境中

Juncheng Yu, Lusi A, Haoxuan Xie, Weiming Wang

发表机构 * National Engineering Research Center of Neuromodulation, School of Aerospace Engineering, Tsinghua University（神经调制国家工程研究中心，航空航天工程学院，清华大学）

AI总结本文提出了一种基于代理的自同步多视角关节角度监控方法，利用两台摄像头在无标定环境下实现自动视频同步和自验证，通过多模态大语言模型和先进单目2D姿态估计模型提取候选姿态，并通过代理选择机制自动识别和跟踪目标个体，以在多人和遮挡情况下产生一致的2D姿态，从而估计关节角度。

Comments Accepted by EMBC 2026. 7 pages, 3 figures

详情

AI中文摘要

运动监控在长期康复中对脊髓损伤患者至关重要，其中多视角无标记运动捕捉方法已显示出显著潜力。然而，由于依赖校准和多视角同步的困难，其在患者自行部署环境中部署仍然具有挑战性。在本工作中，我们提出了一种基于代理的自同步多视角关节角度监控管道，利用两台摄像头在无标定环境中实现自动视频同步和代理驱动的自验证。最先进的单目2D姿态估计模型用于提取候选姿态，其中应用了基于代理的选择机制，以自动识别和跟踪目标个体，从而在多人和遮挡情况下产生一致的2D姿态。此类2D姿态被优化以从无标定的多视角姿态序列中估计关节角度，通过显式的几何建模确保可解释性。与Vicon系统的验证显示了该方法的强性能，达到MAE为5.97°±2.36°和Pearson相关系数为0.962±0.014。所提出的方法预计能提供一个实用的、患者可自行部署的系统，以在无标定的家庭环境中进行日常运动监控。

英文摘要

Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of $5.97^\circ \pm 2.36^\circ$ and a Pearson correlation coefficient of $0.962 \pm 0.014$. The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.

URL PDF HTML ☆

赞 0 踩 0

2605.16418 2026-05-19 cs.CV cs.AI 版本更新

Neural Visual Decoding via Cognitive guided Adaptive Blurring and Information Constrained Alignment

通过认知引导的自适应模糊和信息受限对齐实现神经视觉解码

Fan Yin, Chuhang Zheng, Peiliang Gong, Donghai Guan, Qi Zhu

发表机构 * Department of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics（南京航空航天大学人工智能学院）； Department of Electrical and Information Engineering, Tianjin University（天津大学电气与信息工程学院）

AI总结本文提出CAIA框架，通过认知引导的自适应模糊和信息受限对齐，提升神经信号与视觉语义的映射精度，改进零样本脑-图像检索的Top-1和Top-5准确率。

详情

AI中文摘要

基于EEG的视觉解码旨在建立神经信号与视觉语义之间的映射。然而，它受到严重的信息粒度不匹配和EEG信号信噪比低的双重挑战。现有方法通常处理静态视觉特征，忽略了人类视觉的动态选择性和神经振荡的频率特异性。为此，我们提出了CAIA框架，通过认知引导的自适应模糊和信息受限对齐来弥合这一差距。在视觉侧，它模拟选择性注意以自适应地减少冗余。同时，在EEG侧，它利用神经振荡先验和信息瓶颈机制来增强信噪比。具体而言，我们设计了一种基于认知动态的自适应模糊机制，通过跨模态注意动态整合中心偏向和显著性引导的视觉线索。此外，我们引入了分布感知的边界校准损失，以稳健地纠正由异常样本引起的对齐偏差。此外，提出了一种认知引导的信息筛选方法，以选择任务相关的EEG振荡。大量实验表明，CAIA在零样本脑-图像检索中提高了受试者依赖和受试者无关的平均Top-1和Top-5准确率，显著优于现有方法。我们的工作验证了优化视觉信息密度以匹配神经粒度能提供更可解释和稳健的神经解码路径。

英文摘要

EEG-based visual decoding aims to establish a mapping between neural signals and visual semantics. However, it remains constrained by the dual challenges of severe information granularity mismatch and the low signal-to-noise ratio (SNR) of EEG signals. Existing approaches typically treat static visual features, ignoring the dynamic selectivity of human vision and the frequency specificity of neural oscillations. To bridge this gap, we propose CAIA, a Cognitive-guided Adaptive blurring with Information-Constrained Alignment framework for Neural-Visual decoding. On the visual side, it simulates selective attention to adaptively reduce redundancy. Meanwhile, on the EEG side, it leverages neural oscillation priors and the information bottleneck mechanism to enhance SNR. Specifically, we devise a cognitive-dynamics-based adaptive blurring mechanism that dynamically integrates center-biased and saliency-guided visual cues via cross-modal attention. Furthermore, we introduce a distribution-aware boundary calibration loss to robustly rectify alignment bias caused by outlier samples. Moreover, a cognitively-guided information-screening method is proposed to select task-relevant EEG oscillations. Extensive experiments demonstrate that CAIA improves both subject-dependent and subject-independent average Top-1 and Top-5 accuracy in zero-shot brain-to-image retrieval, significantly outperforming prior methods. Our work validates that optimizing visual information density to match neural granularity offers a more interpretable and robust pathway for neural decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.16416 2026-05-19 cs.CV cs.AI 版本更新

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

CAVE：一种用于碎片化视觉证据推理的结构化信用分配方法

Tengda Guo, Jie Leng, Hanlei Li, Yaoyuan Liang, Qingyue Zhang, Dian Yang, Mingyu Zhang, Yuhua Fu, Shao-Lun Huang

发表机构 * Tsinghua University（清华大学）； Peking University（北京大学）； Zhejiang University of Technology（浙江工业大学）

AI总结 CAVE通过结构化过程-奖励机制提升碎片化视觉推理能力，引入三个互补信号优化推理步骤，提升模型可靠性与鲁棒性。

Comments 24 pages, 6 figures. Preprint

详情

AI中文摘要

视觉-语言模型（VLMs）在通用多模态推理中表现优异，但在整合非局部视觉信息支持语义不明确的视觉推理方面面临挑战。本文提出CAVE，一种基于GRPO的结构化过程-奖励方法，通过信念更新、证据获取和自适应聚焦控制三个信号评估中间步骤贡献，引导模型优化推理动作并学习更可靠的视觉推理策略。同时构建TRACER-Bench，涵盖四个非局部且语义易混淆的推理维度，提供关键中间证据监督推理路径。实验表明，CAVE在需要整合碎片化视觉证据的任务中显著提升性能，涵盖公开基准和新引入的TRACER-Bench，同时在通用多模态评估中保持竞争力。进一步分析显示，CAVE有效提升视觉推理能力，在长距离和深层跨区域依赖下表现更稳健。

英文摘要

Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.

URL PDF HTML ☆

赞 0 踩 0

2605.16411 2026-05-19 cs.CV cs.AI cs.CL cs.DB cs.LG 版本更新

Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

通过分布偏移下的分阶段偏好优化减少视觉-语言模型中的幻觉

Qinwu Xu

发表机构 * Meta AI

AI总结本文提出分阶段偏好优化框架，通过构建针对幻觉问题的数据集，提升视觉-语言模型的 grounded reasoning，减少幻觉并提高响应信息量。

详情

AI中文摘要

幻觉仍然是视觉-语言模型（VLMs）中的基本挑战，其中自回归生成可能因联合概率建模下的最大似然估计而产生语言上合理但物理上不一致或视觉上不 grounded 的响应。我们提出了一种分阶段偏好优化框架，通过有针对性的多模态数据构建来减少幻觉。该框架强调模糊的空间方向、物体关系、OCR不确定性以及对抗性假前提训练。幻觉负样本通过最小扰动但视觉不一致的替代品生成，使直接偏好优化（DPO）能够更好地区分 grounded 推理与 plausible 幻觉。在开源基准和现实多模态评估场景中的实验表明，改进了 grounded 一致性，减少了幻觉，并产生了更具信息量的 grounded 响应。跨模型定性评估进一步显示，所提出的多模态 LLM DPO 框架在模糊空间推理和对抗性假前提设置中比几个前沿专有 VLMs 产生更视觉 grounded 的响应。结果表明，幻觉可能不仅源于模型容量的限制，还源于自回归概率生成在弱视觉 grounding 下倾向于选择语言上合理但视觉上不一致的延续。未来工作可能探索物理一致性建模、不确定性感知的多模态推理以及超越标准自回归解码的架构替代方案。

英文摘要

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

URL PDF HTML ☆

赞 0 踩 0

2605.16398 2026-05-19 cs.RO cs.AI 版本更新

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

支持安全的变分混合滤波器用于接触模式和稀疏定律恢复

Marios Papamichalis, Regina Ruane

发表机构 * Human Nature Lab, Yale University（耶鲁大学人类本质实验室）； Department of Statistics and Data Science, The Wharton School, University of Pennsylvania（宾夕法尼亚大学统计与数据科学系，沃顿商学院）

AI总结本文提出VHYDRO变分混合动力学习器，通过混合学习的提案与可行转换律，防止分支丢失，实现连续状态和离散接触模式的联合推断，并在稀疏端-哈密顿定律恢复中提供三种保障。

详情

AI中文摘要

接触丰富的机器人动力学是混合的：单个观测可以匹配多个潜在状态和接触模式（自由、冲击、粘滑）。标准的退火滤波器不将概率分配给可行的接触转换将永久失去机器人实际遵循的分支。我们介绍了VHYDRO，一种变分混合动力学习器，防止这种分支丢失。在每一步中，VHYDRO混合学习的提案与可行转换律，然后进行采样和重要加权，确保模型可行的载体保留的每个转换都得到覆盖。VHYDRO联合推断连续的潜在状态和离散接触模式，并为每个恢复的模式拟合稀疏端-哈密顿定律。在此基础上，三种保证连接：支持覆盖稳定了滤波，稳定后的滤波将离散接触后验集中在一致的模式上，且模式纯段允许稀疏端-哈密顿恢复。恢复误差清晰地分为滤波、导数、模式不纯和物理残差部分。三种经验发现跟踪相同的机制。在重遮挡下，支持安全的滤波器保持可用，而非防御性的提案会崩溃。在ManiSkill演示和四个Sawyer/BridgeData任务家族上，离散状态形成时间一致的接触模式段，离散状态在ARI、变化点F1和段纯度上比事后和模式自由基线更强。在已知方程的混合系统中，模式条件的稀疏拟合恢复了活跃的物理项；纯预测基线则不能。

英文摘要

Contact-rich robot dynamics are hybrid: a single observation can match several latent states and contact regimes (free, impact, stick--slip). A standard amortized filter that places no probability on a feasible contact transition will permanently lose the branch the robot actually follows. We introduce VHYDRO, a variational hybrid dynamics learner that prevents this branch loss. At each step, VHYDRO mixes the learned proposal with a feasible transition law before sampling and importance weighting, ensuring that every transition retained by the model-feasible carrier remains covered. VHYDRO jointly infers a continuous latent state and a discrete contact mode, and fits a sparse port-Hamiltonian law to each recovered regime. On top of this, three guarantees connect: support coverage stabilizes filtering, the stabilized filter concentrates the discrete contact posterior on coherent regimes, and mode-pure segments admit sparse port-Hamiltonian recovery. The recovery error separates cleanly into filtering, derivative, mode-impurity, and physics-residual parts. Three empirical findings track the same mechanism. Under heavy occlusion the support-safe filter stays usable while a non-defensive proposal collapses. On ManiSkill demonstrations and on four Sawyer/BridgeData task families the discrete state forms temporally coherent contact-regime segments that the discrete state yields a stronger joint profile across ARI, change-point F1, and segment purity than post-hoc and mode-free baselines. On hybrid systems with known equations the mode-conditioned sparse fit recovers the active physical terms; purely predictive baselines do not.

URL PDF HTML ☆

赞 0 踩 0

2605.16397 2026-05-19 cs.CV cs.AI 版本更新

Trajectory-Aware Adaptive Inference in Object Detection Models

轨迹感知的自适应推理在目标检测模型中

Grigorios Papanikolaou, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

发表机构 * Department of Electrical and Computer Engineering, National Technical University of Athens, Greece（电子与计算机工程系，国家技术大学亚历山大学院，希腊）； Department of Product and Systems Design Engineering, University of the Aegean, Syros, Greece（产品与系统设计工程系，爱琴海大学，西罗斯，希腊）

AI总结本文提出利用GPS轨迹数据优化目标检测模型的推理过程，通过引入早退机制减少计算成本，提升实时感知效率。

Comments Accepted to the MuseKDE workshop of the IEEE MDM 2026 conference

详情

AI中文摘要

随着自主水下导航中传感器的集成，大规模多模态数据集的出现对高效实时感知提出了挑战。在这样的系统中，目标检测和附近船只轨迹感知紧密耦合，尤其是在动态环境中。然而，目标检测模型在推理过程中的效率常被忽视。为此，我们基于现有目标检测框架，将GPS轨迹数据纳入推理过程，实现输入自适应计算。具体来说，在基于YOLOv8的检测器中引入早退机制，结合运动线索（如船舶间距离）。分离距离短且高速接近的船舶帧使用完整模型处理，而其他帧仅激活网络的一部分架构。通过利用物体间距离和距离减少速率评估帧或帧集的难度（或场景复杂度）。实验结果表明，该策略在保持满意检测性能的同时，显著减少了推理时间和计算成本，从而在准确性和效率之间实现了灵活的权衡，相比完整模型推理。

英文摘要

The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby vessels are tightly coupled, particularly in dynamic environments such as maritime navigation. However, the efficiency of object detection models during inference remains an often-overlooked aspect. To this end, we build upon an existing object detection framework by incorporating GPS trajectory data into the inference process to enable input-adaptive computation. Specifically, we introduce an early-exit mechanism in a YOLOv8-based detector that incorporates motion cues - such as inter-vessel distances. Frames of vessels that are separated by short distances, converging with high speed, are processed using the full model, while only a subset of the network's architecture is activated otherwise. The difficulty degree (or scene complexity) of a frame or set of frames per second is evaluated by leveraging inter-object distance and the rate at which the distance between them decreases. Experimental results demonstrate that this strategy maintains satisfactory detection performance while significantly reducing inference time and computational cost, thus enabling a flexible trade-off between accuracy and efficiency compared to full-model inference.

URL PDF HTML ☆

赞 0 踩 0

2605.16393 2026-05-19 cs.CV cs.AI 版本更新

Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

基于 Vision Transformer 的 UNet 用于领域自适应语义分割

Joel Valdivia Ortega, Tingying Peng, Marion Jasnin

发表机构 * Helmholtz Pioneer Campus, Helmholtz Munich, Neuherberg, Germany（海德堡先锋校园，海德堡穆恩奇，纽赫尔伯格，德国）； School of Computation, Information and Technology, TUM, Garching, Germany（计算、信息与技术学院，技术大学慕尼黑，冈辛，德国）； Department of Chemistry, TUM, Garching, Germany（化学系，技术大学慕尼黑，冈辛，德国）

AI总结本文提出 ViTC-UNet，通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet，以提升生物医学语义分割的精度与适应性。

详情

AI中文摘要

语义分割在生物医学研究中至关重要，但 Vision Transformers（ViTs）在该领域仍存在性能差距，尤其在稀疏、精细结构和低信噪比目标上。我们部分归因于可提示 ViT 模型中常用的轻量级像素解码器，可能缺乏高精度生物医学掩码所需的局部归纳偏置。我们通过引入 ViTC-UNet，通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet，结合 ViT 的全局视觉先验与 UNet 的局部归纳偏置和高分辨率解码能力，同时避免端到端 ViT 微调，即使在跨领域设置中。ViTC-UNet 在 MRI 和 CT 模态的语义分割任务中均优于基线结果，证明了结构条件化的 UNet 解码可有效适应大规模视觉先验到高复杂度的生物医学分割。

英文摘要

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.16391 2026-05-19 eess.SP cs.AI cs.LG cs.RO 版本更新

StreamPro: 从反应式感知到主动决策的流视频处理

Ao Li, Zihan Xiao, Zihao Yue, Boshen Xu, Linli Yao, Jiaze Li, Pei Fu, Jianzhong Ju, Jian Luan, Qin Jin

发表机构 * AIM3 Lab, Renmin University of China（中国人民大学AIM3实验室）； MiLM Plus, Xiaomi Inc.（小米公司MiLM Plus）

AI总结 StreamPro通过引入CB-Stream损失和GRPO算法，提升流视频处理的主动决策能力，在StreamPro-Bench上取得显著成效，性能优于先前最佳。

详情

AI中文摘要

主动流视频理解需要模型持续处理视频流并决定何时响应，而非仅仅确定响应内容。这自然引入了部分观察下的决策问题，模型需在早期预测与充分证据之间平衡。然而，现有基准大多遵循“看见再回答”范式，响应仅在明确证据出现后触发，将主动推理缩减为延迟感知。因此，它们无法评估模型在不完整观察下的及时性和可靠性决策能力。此外，训练主动模型本身具有挑战性，因为流轨迹中沉默与响应信号之间存在极端不平衡，且需要联合优化响应准确性和时机。为解决这些问题，我们引入StreamPro-Bench，从感知理解、时间推理和主动代理三个互补视角评估流模型。其中，主动代理衡量模型在部分观察下的早期但可靠决策能力。我们进一步提出StreamPro，一种两阶段训练框架用于主动学习。首先，我们引入CB-Stream损失以缓解监督不平衡问题。然后，我们应用基于多粒度奖励设计的分组相对策略优化（GRPO）。实验表明，StreamPro显著提升了主动性能。在StreamPro-Bench上，其达到41.5，远超先前最佳（10.4），同时在实时流基准测试中也表现优异，达到78.9分。

英文摘要

Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a "see-then-answer" paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model's ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model's ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.

URL PDF HTML ☆

赞 0 踩 0

2605.16380 2026-05-19 cs.LG cs.AI 版本更新

ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction

ReTAMamba：基于Mamba的可靠性感知时间聚合用于不规则临床时间序列预测

Jinwoong Kim, Sangjin Park

发表机构 * Department of Industrial Data Engineering Hanyang University Seoul Republic of Korea（工业数据工程系首尔国立翰阳大学韩国）； Hanyang University（国立翰阳大学）

AI总结 ReTAMamba通过时间变量标记序列重构临床时间序列，利用缺失性和时间间隔估计观测可靠性，并通过时间编织整合短期和长期时间信息，提升不规则时间序列预测性能。

Comments 11 pages

详情

AI中文摘要

临床时间序列数据难以用常规方法建模，因其表现出不规则采样、频繁缺失值和变量异质性。现有方法通常使用观测掩码和时间间隔信息，但无法持续捕捉过去观测的衰减可靠性或在聚合过程中保持一致的时序上下文。为此，我们提出了Reliability-aware Temporal Aggregation with Mamba（ReTAMamba），将临床时间序列重建为时间变量标记序列，从缺失性和经过时间估计观测可靠性，并将区间总结与统计描述符相结合。通过时间编织整合短期和长期时间信息，并应用预算标记路由器约束序列长度同时保留信息性总结。在MIMIC-IV、eICU和PhysioNet 2012上的实验表明，ReTAMamba在强基线模型上一致提升了AUPRC，平均相对提升分别为7.51%、7.80%和10.15%。eICU的队列和患者层面分析显示，学习到的动态信号（如心率和血压）的均值衰减比相对静态信号（如实验室变量）大24.3%。这些发现表明，有效预测不规则临床时间序列需要建模不仅测量了什么，还要何时以及如何观测，包括信息新鲜度和观测及时性。

英文摘要

Clinical time-series data are difficult to model with methods designed for regular sequences because they exhibit irregular sampling, frequent missing values, and heterogeneous observation patterns across variables. Existing approaches commonly use observation masks and time-gap information, but they do not continuously capture the decaying reliability of past observations or consistently organize multi-resolution information within a coherent temporal context during aggregation. To address these limitations, we propose Reliability-aware Temporal Aggregation with Mamba (ReTAMamba), which reconstructs clinical time series as time-variable token sequences, estimates observation reliability from missingness and elapsed time, and augments interval summaries with statistical descriptors. Chronological Weaving is used to integrate short- and long-term temporal information within a coherent temporal context, and a budgeted token router is applied to constrain sequence length while preserving informative summaries. Experiments on MIMIC-IV, eICU, and PhysioNet 2012 show that ReTAMamba consistently improves AUPRC over strong baselines, with average relative gains of 7.51%, 7.80%, and 10.15%, respectively. Cohort-level and patient-level analyses on eICU further showed that the learned mean decay for more dynamic signals, such as heart rate and blood pressure, was 24.3% larger than that for relatively static signals, such as laboratory test variables. These findings suggest that effective prediction in irregular clinical time series requires modeling not only what was measured, but also when and how it was observed, including information freshness and observation timeliness.

URL PDF HTML ☆

赞 0 踩 0

2605.16379 2026-05-19 cs.LG cs.AI cs.IT math.IT 版本更新

An Information-Theoretic Criterion for Efficient Data Synthesis

一种信息论准则用于高效数据合成

Hanyu Li, Zhengqi Sun, Xiaotie Deng

发表机构 * CFCS, School of Computer Science, Peking University, Beijing, China（计算机科学系，北京大学，北京，中国）； Department of Information Management, Peking University, Beijing, China（信息管理系，北京大学，北京，中国）

AI总结本文提出信息开放循环的准则，指出合成数据的有效性取决于外部信号注入任务相关信息，从而提升模型效率与泛化能力。

Comments 12 pages. Camera-ready version for ICML 2026

详情

AI中文摘要

合成数据在大语言模型训练中变得至关重要，但其效果高度不一致。本文从信息论角度解释这种不一致：合成数据只有在生成-训练循环信息开放（即由外部信号塑造）时，才能提升模型性能。当循环信息封闭（依赖模型自身输出）时，数据处理不等式确保任务相关信息只能减少，导致崩溃。在信息开放管道中，效率和泛化依赖于元级监督：较粗的信号如二元正确性将所有可接受输出视为等同，因此其教导的行为不绑定特定领域或表层形式，能自然泛化到不同任务和领域。这些观察得出指导性论点：学习倾向于收敛到最信息高效的信号组件，当该组件为预期时加速学习，但当存在伪模式时导致奖励黑客。

英文摘要

Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.

URL PDF HTML ☆

赞 0 踩 0

2605.16378 2026-05-19 cs.LG cs.AI 版本更新

Mixing Times of Glauber Dynamics on Masked Language Models

掩码语言模型上Glauber动力学的混合时间

Suvadip Sana, Sami Wolf, Neer Mehta, Alina Shah, Aitzaz Shaikh, Janna Goodman, Lionel Levine

发表机构 * Department of Statistics and Data Science（统计与数据科学系）； Cornell University（康奈尔大学）； Department of Mathematics（数学系）

AI总结研究掩码语言模型迭代生成时的全局分布行为，通过Glauber动力学马尔可夫链分析其混合时间，揭示在不同温度下混合行为的相变现象。

Comments 21 pages, 7 figures

详情

AI中文摘要

掩码语言模型（MLMs）定义了令牌的局部条件分布，但通常不对应任何一致的序列联合分布。这提出了一个根本性问题：当此类条件在生成中迭代使用时，会诱导出何种全局分布行为？本文通过将迭代的掩码令牌重采样建模为离散令牌序列上的Glauber动力学马尔可夫链来回答这一问题。我们首先证明MLM条件本质上是不相容的：引入了一个矩形测试来验证这种不相容性，并实证验证其在现代MLM中的普遍性。然后我们对由此诱导的马尔可夫链进行了理论分析。在有限的跨令牌影响下，我们建立了高温度收缩结果，表明混合时间为O(n log n)，其中n是序列长度。相反，在均匀局部边际条件下，链表现出 metastability，低温下缓慢逃离语义盆地。实证上，我们展示了混合行为随温度和序列长度的变化呈现相变，与理论预测一致。我们进一步通过语义轨迹表征诱导的平稳行为，识别出持久结构如长寿命陷阱和复发语义盆地，政治内容作为可测量的案例研究。

英文摘要

Masked language models (MLMs) define local conditional distributions over tokens but do not, in general, correspond to any consistent joint distribution over sequences. This raises a fundamental question: what global distributional behavior is induced when such conditionals are used iteratively for generation? We address this question by modeling iterative masked-token resampling as a Glauber dynamics Markov chain on the discrete space of token sequences. We first show that MLM conditionals are intrinsically incompatible: we introduce a rectangle test that certifies this incompatibility and empirically verify its prevalence across modern MLMs. We then provide a theoretical analysis of the induced Markov chain. Under bounded cross-token influence, we establish a high-temperature contraction result implying $O(n\log n)$ mixing time where $n$ is the sequence length. In contrast, we prove that under a uniform local margin condition, the chain exhibits metastability, with exponentially slow escape from semantic basins at low temperatures. Empirically, we demonstrate a phase transition in mixing behavior as a function of temperature and sequence length, consistent with the theoretical predictions. We further characterize the induced stationary behavior through semantic trajectories, identifying persistent structures such as long-lived traps and recurrent semantic basins, with political content serving as a measurable case study.

URL PDF HTML ☆

赞 0 踩 0

2605.16377 2026-05-19 cs.DL cs.AI cs.LG 版本更新

CheckSupport: A Local LLM-Powered Tool for Automated Manuscript Submission Checklist Selection and Completion

CheckSupport：一种基于本地LLM的自动化手稿提交检查清单选择与完成工具

Satvik Tripathi, Don Enwerem, Kevin Song, Kristian Quevada, Jacinta Arnold, Tessa S. Cook

发表机构 * Department of Radiology, Perelman School of Medicine at University of Pennsylvania（宾夕法尼亚大学佩雷尔曼医学学院放射科）； Department of Computer Science, Drexel University（德雷塞尔大学计算机科学系）； Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania（宾夕法尼亚大学工程与应用科学学院计算机与信息科学系）； Department of Radiology, Cooper University Hospital（科珀大学医院放射科）； University of California Davis Graduate School of Management（加州大学戴维斯分校管理研究生院）

AI总结本文提出CheckSupport，利用本地LLM自动化选择和完成检查清单，提升科研报告的透明度和可重复性。系统通过分阶段提示策略实现高准确率，运行在CPU上，每篇手稿耗时12.5秒，准确率达90%。

详情

AI中文摘要

透明和标准化的报告对于可重复的科学研究至关重要，但因手动选择和完成检查清单的劳动强度，遵循报告指南仍不一致。我们提出了CheckSupport，一种开源、本地可部署的系统，利用大语言模型自动化推荐报告检查清单并完成清单。CheckSupport采用分阶段提示策略，将报告流程分解为受约束的推理任务，优先提取忠实信息而非生成文本合成。所有推理均在本地使用指令调优模型完成，保护数据隐私并实现可重复、可审计的工作流程。在同行评审手稿语料库上评估，CheckSupport在清单推荐上达到90%的整体准确率，在项目级完成上达到88%的整体准确率，运行在仅CPU硬件上。平均而言，每篇手稿的墙钟时间为12.5秒，包括检查清单推荐和完整检查清单完成。这些结果表明，当大语言模型作为结构化推理组件应用时，可以减少报告负担，支持跨学科更透明和可重复的科学研究报告。

英文摘要

Transparent and standardized reporting is essential for reproducible scientific research, yet adherence to reporting guidelines remains inconsistent because of the manual effort required to select and complete checklists. We present CheckSupport, an open-source, locally deployable system that uses large language models to automate the recommendation of reporting checklists and the evidence-grounded completion of checklists for scientific manuscripts. CheckSupport employs a staged prompting strategy that decomposes reporting workflows into constrained inference tasks, prioritizing faithful extraction over generative text synthesis. All inference is performed locally using instruction-tuned models, preserving data privacy and enabling reproducible, auditable workflows. Evaluated on a corpus of peer-reviewed manuscripts, CheckSupport achieved 90% overall accuracy for checklist recommendations and 88% overall accuracy for item-level completion while operating on CPU-only hardware. On average, the wall-clock time per manuscript was 12.5 seconds, including the checklist recommendation and full checklist completion. These results demonstrate that large language models, when applied as structured inference components, can reduce reporting burden and support more transparent and reproducible scientific reporting across disciplines.

URL PDF HTML ☆

赞 0 踩 0

2605.16374 2026-05-19 cs.LG cs.AI 版本更新

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

丢失或隐藏？监督连续学习中的概念层面遗忘

Katarzyna Filus, Kamil Faber, Roberto Corizzo, Christopher Kanan

发表机构 * AGH University of Krakow（克拉科夫AGH大学）； American University（美国大学）； University of Rochester（罗切斯特大学）

AI总结本文提出一种诊断框架，利用稀疏自编码器分析概念层面遗忘，发现遗忘主要源于表征可访问性变化而非信息擦除。

详情

AI中文摘要

持续学习研究模型如何在适应新任务的同时保留先前知识。尽管已有多种方法缓解灾难性遗忘，但该领域仍以性能为导向，缺乏对视觉模型表征空间中遗忘本质的理解。本文提出利用稀疏自编码器定义任务锚定的潜在特征空间，分析任务特定信息在更细粒度下的演变。我们分解遗忘为显性概念删除、可恢复性和解码性。结果显示，大量看似丢失的概念信息在线性假设下可恢复，而随着任务增加，概念解码性下降。总体而言，我们的发现表明，概念层面遗忘主要归因于表征可访问性变化而非完全信息擦除。

英文摘要

Continual learning studies how models can adapt to new tasks while retaining previously acquired knowledge. Although a broad spectrum of methods has been proposed to mitigate catastrophic forgetting, the field remains predominantly performance-driven, with limited insight into what forgetting actually corresponds to within the vision model's representation space. Prior work has primarily analyzed forgetting through task-level performance or coarse measures of representational drift, without disentangling output-level accessibility from changes in finer-grained internal structure. To this end, we propose a diagnostic framework that leverages Sparse Autoencoders (SAEs) to define a task-anchored latent feature space, enabling analysis of how task-specific information evolves at a finer granularity, where individual SAE latents are treated as concept proxies for recurring and relatively disentangled visual patterns in the model's internal computations. Within this framework, we decompose forgetting into apparent concept deletion, recoverability, and decodability. We show that a large portion of seemingly lost concept-level information can often be recovered under linearity assumption, with concept decodability degrading as more tasks are introduced. Overall, our findings suggest that a significant part of concept-level forgetting can be attributed to changes in the representational accessibility rather than complete information erasure.

URL PDF HTML ☆

赞 0 踩 0

2605.16373 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

跨源监督在双模态PET-CT骨感染分割中的应用

Zonglin Yang, Xiaolei Diao, Jishizhan Chen, Xiaozhuang Man, Wei Kong, Gen Wen, Pengfei Cheng, Daqian Shi

发表机构 * Shanghai Maritime University（上海海洋大学）； University College London（伦敦大学学院）； Shanghai Sixth People’s Hospital（上海第六人民医院）； Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine（上海第六人民医院附属复旦大学医学院）； Queen Mary University of London（伦敦女王玛丽大学）

AI总结本文提出一种双模态端到端分割框架，通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息，解决标注不一致下的骨感染分割问题，采用患者级3D体积评估和交叉验证提高性能。

详情

AI中文摘要

早期和准确诊断骨感染及病变定位对临床治疗至关重要。PET-CT结合了CT的解剖信息和PET的代谢信息，是诊断骨感染的重要成像模态。然而，由于病变边界不清晰和不同专家或自动化系统生成的标注不一致，准确的病变分割仍具挑战性。本文研究了在标注不一致下的多模态分割。我们开发了一个双模态端到端分割框架，通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息。为了缓解小数据集中小切片相关性导致的性能膨胀，本研究弃用传统二维评估方法，采用严格的患者级3D体积评估和交叉验证。此外，我们提出了一种解耦的双源学习框架，其中并行模型在由高灵敏度和高特异性临床意图驱动的独立专家标注上进行训练。实验结果客观报告了患者级性能变化（均值±标准差和均值-标准差），证明了多模态PET-CT融合的有效性。交叉评估矩阵定量揭示了模型如何成功内化不同的专家诊断哲学，提供了一种稳健且保持多样性的临床AI部署范式，用于骨感染分割。

英文摘要

Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.

URL PDF HTML ☆

赞 0 踩 0

2605.16372 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SwordBench: Evaluating Orthogonality of Steering Image Representations

SwordBench：评估转向图像表示的正交性

Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

发表机构 * Centre for Credible AI（可信人工智能中心）； Warsaw University of Technology（华沙技术大学）； University of Warsaw（华沙大学）

AI总结本文提出SwordBench，用于评估视觉模型在多个backbone和概念移除任务中转向表示的正交性，引入了交叉概念鲁棒性和 collateral damage 等新评估指标，发现线性SVM在分离性和正交性上优于稀疏自编码器，但无法实现零 collateral damage。

详情

AI中文摘要

在推理时间对模型表示进行干预以校正预测对于AI可解释性和安全性至关重要，但现有评估协议局限于模糊的语言建模任务。为填补这一空白，我们引入SwordBench，一个用于评估视觉模型在多个backbone和概念移除任务中转向表示的基准。除了统一的基准测试套件外，我们还提出了新的评估概念，揭示了概念激活向量正交性对实用转向的二次影响。具体而言，交叉概念鲁棒性衡量在针对替代概念正交化输入上概念检测性能的稳定性，而collateral damage量化在缺乏偏见的输入上转向是否意外影响下游任务的模型性能。我们发现尽管线性支持向量机在分离性和正交性上表现优异，但无法实现零collateral damage，通常落后于稀疏自编码器。在更简单的环境中，标准基线和优化方法均无法实现完美的转向。源代码将很快在GitHub上发布。

英文摘要

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2605.16371 2026-05-19 cs.CV cs.AI 版本更新

TailedTS：用于重尾时间序列预测和周期性量化的大规模基准数据集

Xinyu Chen, HanQin Cai, Lijun Ding, Jinhua Zhao

发表机构 * University of Central Florida（中央佛罗里达大学）； University of California, San Diego（加州大学圣地亚哥分校）； Massachusetts Institute of Technology（麻省理工学院）

AI总结 TailedTS数据集用于测试在重尾、零膨胀和非高斯条件下时间序列预测模型的鲁棒性，通过稀疏自回归框架揭示高频页面的周期性较弱，同时提供非高斯损失函数的标准化预测基准。

详情

AI中文摘要

我们介绍了TailedTS，一个基于2024年维基百科每小时页面浏览观测数据的大规模基准数据集，专门用于测试时间序列预测模型在重尾、零膨胀和非高斯条件下的性能。该数据集包含约2469亿个数据点，覆盖约300万个唯一维基百科页面，存储在高效的Apache Parquet格式中。维基百科流量遵循幂律分布，其中约5%的页面贡献了70%的总浏览量，为模型在极端波动下的鲁棒性提供了一个自然且严谨的测试环境。TailedTS支持多个研究任务：首先，我们引入了一个基于稀疏自回归的周期性量化框架，揭示高频页面的周期性结构显著弱于低频页面，这对大型数字平台的服务器分配和流量预测有直接意义。其次，我们提供了在一系列非高斯损失函数下的标准化预测基准，包括ℓ1范数、Huber、分位数和ℓp范数损失，表明基于高斯的估计器在高流量页面类别中性能显著下降，而鲁棒替代方案在所有流量规模上均提供一致的提升。TailedTS可在https://doi.org/10.5281/zenodo.17070469公开获取。

英文摘要

We present TailedTS, a large-scale benchmark dataset derived from Wikipedia hourly page view observations throughout 2024, specifically designed to test time series forecasting models under heavy-tailed, zero-inflated, and non-Gaussian conditions. The dataset comprises approximately 24.69 billion data points spanning roughly 3 million unique Wikipedia pages per month, stored in high-efficiency Apache Parquet format. Wikipedia traffic follows a pronounced power-law distribution where roughly 5% of pages account for over 70% of total page views, creating a natural and rigorous testbed for model robustness against extreme volatility that are absent from or underrepresented in existing benchmarks such as M4, M5, and UCI electricity datasets. TailedTS enables several research tasks. First, we introduce a periodicity quantification framework based on sparse autoregression with sparsity and non-negativity constraints, revealing that frequently-viewed pages exhibit significantly weaker periodic structure than their less-viewed counterparts, showing direct implications for server allocation and traffic forecasting on large digital platforms. Second, we provide standardized prediction benchmarks evaluated under a suite of non-Gaussian loss functions, including $\ell_1$-norm, Huber, quantile, and $\ell_p$-norm losses, demonstrating that standard Gaussian-based estimators degrade substantially on high-volume page categories, while robust alternatives provide consistent gains across all traffic scales. TailedTS is publicly available at https://doi.org/10.5281/zenodo.17070469.

URL PDF HTML ☆

赞 0 踩 0

2605.16360 2026-05-19 cs.LG cs.AI 版本更新

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

ProxyKV：跨模型代理剪枝用于高效长上下文LLM推理

Junjie Li, Jiong Lou, Jie Li

发表机构 * Shanghai Jiao Tong University（上海交通大学）

AI总结 ProxyKV通过跨模型代理剪枝方法，解决LLM长上下文推理中的KV缓存内存瓶颈，实现高效推理与高精度的平衡，提升预填充速度和长上下文处理能力。

详情

AI中文摘要

高效长上下文推理在大型语言模型（LLM）中受到键值（KV）缓存内存瓶颈的严重限制，而现有剪枝方法在低延迟启发式和高精度重建方法之间做出取舍。为弥合评分成本与精度之间的差距，我们提出了ProxyKV，一种跨模型代理剪枝框架，将重要性评分卸载到轻量级的同族小型模型代理上，该代理异步执行于大型模型目标。为弥合异构模型之间的架构差距，我们设计了HybridAxialMapper，将时间特征提取与跨头对齐解耦，并设计了多粒度混合损失，将学习目标从刚性回归转向相对排名一致性。在Llama-3.1、Qwen-2.5和Qwen-3家族上，针对LongBench、SCBench和RULER等基准测试，ProxyKV在聚合层面（恢复约98.7%的平均精度）与KVZip相当，同时在Llama-3.1-8B上实现了高达3.21倍的预填充加速（双GPU；约1.5倍共享单GPU），并在Qwen-2.5-7B上支持高达170k tokens的上下文长度。

英文摘要

Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost--accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and Qwen-3 families spanning targets from 7B up to 32B parameters on LongBench, SCBench, and RULER, ProxyKV matches KVZip on aggregate (recovering $\sim$$98.7\%$ of its mean accuracy) while delivering up to a $3.21\times$ prefilling speedup on Llama-3.1-8B (dual-GPU; $\sim$$1.5\times$ shared single-GPU) and sustaining the speedup at contexts up to 170k tokens on Qwen-2.5-7B.

URL PDF HTML ☆

赞 0 踩 0

2605.16359 2026-05-19 cs.CV cs.AI 版本更新

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

多模态语言模型需要多少视觉标记？通过F^3A进行视觉标记剪枝的扩展

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang, Zihan Wang, Shi Feng, Daling Wang, Yifei Zhang, Yongkang Liu

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China（东北大学计算机科学与工程学院，沈阳 110819，中国）； Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）； School of Computer and Communication Engineering, Northeastern University, Qinhuangdao 066004, China（东北大学计算机与通信工程学院，秦皇岛 066004，中国）

AI总结本文提出F^3A方法，通过任务条件证据搜索优化视觉标记分配，在不训练模型的情况下实现高效的视觉标记剪枝，保留原始多模态提示和解码流程。

详情

AI中文摘要

视觉语言模型通过将越来越长的视觉标记序列输入语言骨干网络来提升感知能力，但由此产生的推理成本提出了一个基本的扩展问题：随着多模态模型的增长，实际上需要多少视觉标记，以及在固定视觉标记预算下如何分配？现有训练免费剪枝方法通常通过一shot代理如解码器注意力、视觉相似性或条件多样性来回答这个问题。我们主张将视觉标记剪枝视为任务条件证据搜索，特别是在极端压缩和跨模型规模的情况下。我们提出F^3A，一种训练免费的视觉标记剪枝路由器，在语言模型消耗图像标记之前运行。F^3A构建轻量级的问题条件线索，通过冻结的稀疏感知头将它们与视觉网格标记匹配，并通过粗略证据定位、局部细化、覆盖保持竞争和恢复未覆盖区域来分配固定视觉标记预算。它不需要模型训练，不需要额外的LLM前向传递，并保留原始多模态提示和解码流程。

英文摘要

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

URL PDF HTML ☆

赞 0 踩 0

2605.16358 2026-05-19 cs.LG cs.AI 版本更新

LEAF: A Living Benchmark for Event-Augmented Forecasting

LEAF：一个用于事件增强预测的活体基准

Mingtian Tan, Mihir Parmar, Palash Goyal, Chun-Liang Li, Nanyun Peng, Thomas Hartvigsen, Jinsung Yoon, Tomas Pfister

发表机构 * Google（谷歌）； University of Virginia（弗吉尼亚大学）

AI总结本文提出LEAF，首个用于事件增强预测的活体基准，通过递归检索代理系统和双代理交叉验证，提供全面相关文本辅助预测，评估LLM在复杂真实场景中的预测能力。

Comments 12 tables, 6 figures, 39 pages

详情

AI中文摘要

大型语言模型（LLMs）越来越多地应用于预测。为了评估这一能力并缓解预训练数据污染，已提出几种活体基准。然而，现有基准要么因数据稀缺缺乏多维事件，要么聚焦于相对封闭环境。为评估LLM在复杂真实场景中的预测能力，我们提出LEAF，首个用于事件增强预测任务的活体基准，包括未来事件概率、趋势和时间序列预测。LEAF利用递归检索代理系统配以双代理交叉验证，提供全面相关辅助文本。评估最新专有和开源LLMs发现，这些模型能利用复杂事件提取的信号提升预测性能。在股票领域，发现LLM在自信识别为更可预测的股票上表现更好。此外，事件与目标股票呈现强相关性。为此，LEAF提供必要的动态更新测试环境，持续跟踪和推动事件驱动预测任务的进步。

英文摘要

Large Language Models (LLMs) are increasingly applied to forecasting. To evaluate this capability while mitigating pre-training data contamination, several living benchmarks have been proposed. However, existing benchmarks either lack the multidimensional events essential for accurate forecasting due to data scarcity, or focus on relatively closed environments. To assess the predictive capabilities of LLMs in complex, real-world scenarios, we propose LEAF, the first living benchmark for event-augmented forecasting tasks, including future event probabilities, trend and time series forecasting. LEAF utilizes a recursive retrieval agent system paired with dual-agent cross-validation to provide comprehensive and relevant auxiliary text for forecasting. Evaluating state-of-the-art proprietary and open-weight LLMs, we find that these models can leverage signals extracted from complex events to enhance predictive performance. In the stock domain, we find that LLMs achieve better performance on equities they confidently identify as more predictable. Furthermore, the events demonstrate a strong correlation with the target equities. To this end, LEAF provides a necessary, dynamically updating testbed to continuously track and drive progress in event-driven forecasting tasks.

URL PDF HTML ☆

赞 0 踩 0

2605.16357 2026-05-19 eess.SP cs.AI cs.CV 版本更新

Learning Displacement-Aware WiFi Representations for Weakly Supervised Relative Localization

学习位移感知的Wi-Fi表示以实现弱监督的相对定位

Tzu-Ti Wei, Po-Cheng Chen, Yu-Chee Tseng, Jen-Jee Chen

发表机构 * College of AI, National Yang Ming Chiao Tung University（人工智能学院，National Yang Ming Chiao Tung大学）

AI总结本文提出IP框架，通过交叉模态学习对齐指纹轨迹与位移轨迹，学习位移感知的Wi-Fi表示，实现准确的相对定位，并扩展至少样本绝对定位。

详情

AI中文摘要

基于Wi-Fi指纹的室内定位已广泛研究，但现有方法多关注绝对定位并依赖密集坐标标注，获取成本高。本文研究相对定位问题，目标是直接估计两个Wi-Fi指纹轨迹间的位移，不预测绝对位置。为减少标注开销，采用惯性传感获取的步进运动向量作为弱监督。提出Intersection Pathway (IP)框架，通过共享潜在空间对齐指纹轨迹与位移轨迹。关键思想是使潜在空间具有加法结构，使潜在空间的加减对应物理运动组合，实现直接的相对位移推断。实验表明，所提方法在合成数据集上学习位移感知的Wi-Fi表示，实现不同位移范围的准确相对定位。此外，所学模型可扩展至少样本绝对定位。

英文摘要

WiFi fingerprint-based indoor localization has been widely studied, but most existing approaches focus on absolute positioning and rely on dense coordinate annotations, which are costly to obtain at scale. In this paper, we study a fundamentally different problem: relative localization, where the goal is to directly estimate the displacement between two WiFi fingerprint traces without predicting their absolute positions. To reduce annotation overhead, we adopt weak supervision in the form of stepwise motion vectors obtained from inertial sensing. We propose Intersection Pathway (IP), a cross-modal learning framework that aligns fingerprint traces (f-traces) and displacement traces (d-traces) in a shared latent space. The key idea is to enforce an additive structure in the latent space, such that latent addition and subtraction correspond to physical motion composition, enabling direct relative-displacement inference. Experiments on a synthesized dataset derived from real measurements demonstrate that the proposed method learns displacement-aware WiFi representations and achieves accurate relative localization across varying displacement ranges. Furthermore, the learned model can be extended to few-shot absolute localization with sparse anchors.

URL PDF HTML ☆

赞 0 踩 0

2605.16354 2026-05-19 cs.LG cs.AI cs.CL cs.HC stat.ML 版本更新

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

通过LLM裁判增强人类评估：你真的需要多少人类评审？

Jane Paik Kim

发表机构 * Department of Psychiatry and Behavioral Sciences（精神病学与行为科学系）

AI总结本文提出通过LLM作为辅助裁判来增强人类评估，通过两阶段抽样设计确定人类和LLM评审样本量，以实现目标统计功效。

Comments 10 pages, 5 figures

详情

AI中文摘要

大型语言模型（LLMs）越来越多地被用作AI系统的自动评估者，包括在高风险应用中。在这一角色中，LLMs用于生成关于模型输出质量、适当性甚至安全性的判断。这种做法受到实际限制的驱动。专家人类评分成本高且难以扩展，而LLM评分可以快速低成本地生成。然而，当前部署LLM评估者的方法是随意的，通常仅限于报告人类和LLM裁判之间的一致性度量作为替代人类评分的正当性，且缺乏正式的研究设计基础。本文（1）将LLM裁判的角色从替代性转为辅助性，并（2）将LLM作为裁判范式制定为通过两阶段抽样设计增强人类评估的一种方法，其中在第一阶段对所有观察进行LLM评估，在第二阶段对子样本进行部分人类评分。我们提出使用来自缺失数据文献的双重鲁棒估计器，利用预测模型的鲁棒性属性，因为缺失性模型是设计已知的。使用该估计器的渐近方差，我们提出如何确定人类和LLM评分的样本量以达到目标统计功效。我们还展示通过分配更多人类评分给LLM评分预测性不高的评估类型，可以高效地设计研究。据我们所知，关于在验证基准时应保留多少人类监督的指导非常有限。

英文摘要

Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2605.16352 2026-05-19 cs.IR cs.AI cs.LG 版本更新

LARGER: Lexically Anchored Repository Graph Exploration and Retrieval

LARGER: 词典锚定的仓库图探索与检索

Yuntong Hu, Tongli Su, Liang Zhao, Bowen Zhu, Hasibul Haque

发表机构 * Emory University（埃默里大学）

AI总结 LARGER通过词典锚定的结构化定位方法提升代码仓库文件定位精度，实现测试生成和代码库理解任务的性能提升。

详情

AI中文摘要

仓库级别的编码代理必须首先定位与任务相关的文件和符号；此阶段的失败会影响从补丁生成到测试编写和代码库问答的下游目标。现有代理主要通过词汇搜索导航仓库，常遗漏结构关系如导入、调用链、类型层次和代码-测试链接。基于图的检索可恢复此类依赖，但现有方法常需要单独的图工具或遍历阶段，打断代理的交互循环。我们正式将仓库上下文定位定义为词典锚定的结构化定位，其成功取决于将词汇匹配转化为高精度的结构入口点，并在代理现有搜索循环中暴露最有用的置信度过滤局部邻域。我们引入LARGER（词典锚定的仓库图探索与检索），一种以词汇锚定的主动集检索框架，从词汇匹配开始，将其对齐到图锚点，并在代理现有搜索循环中执行置信度过滤的局部扩展。LARGER直接集成到现有CLI编码代理中，无需外部图数据库或专用图接口。在四个涵盖定位、测试生成和代码库理解的基准测试中，LARGER在LocBench上通过调整超参数将文件级Acc@5提升13.9点，即使在固定超参数下仍比最强基线提升11.8点，并在MuLocBench、SWE-Atlas测试编写和SWE-Atlas代码库问答任务上提供一致的提升。

英文摘要

Repository-level coding agents must first localize the files and symbols relevant to a task; failures at this stage can cascade across downstream objectives ranging from patch generation to test writing and codebase question answering. Existing agents navigate repositories primarily through lexical search, often missing structural relations such as imports, call chains, type hierarchies, and code-test links. Graph-based retrieval can recover such dependencies, but existing approaches often require separate graph tools or traversal stages that fragment the agent's interaction loop. We formalize repository context localization as Lexically Anchored Structural Localization, where success depends on turning lexical matches into high-precision structural entry points and exposing the most useful confidence-filtered local neighborhoods within the agent's existing search loop. We introduce LARGER (Lexically Anchored Repository Graph Exploration and Retrieval), a lexically anchored active-set retrieval framework that starts from lexical matches, aligns them to graph anchors, and performs confidence-filtered local expansion within the agent's existing search loop. LARGER integrates directly into existing CLI coding agents without requiring external graph databases or specialized graph interfaces. Across four benchmarks spanning localization, test generation, and codebase understanding, LARGER improves file-level Acc@5 on LocBench by +13.9 points with tuned hyperparameters and still gains +11.8 points with fixed hyperparameters over the strongest baseline, while delivering consistent gains on MuLocBench, SWE-Atlas Test Writing, and SWE-Atlas Codebase QA.

URL PDF HTML ☆

赞 0 踩 0

2605.16351 2026-05-19 cs.LG cs.AI 版本更新

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

PIMSM：基于物理的多尺度Mamba用于在分布偏移下稳定的神经表示

Sangyoon Bae, Shinjae Yoo, Jiook Cha

发表机构 * Interdisciplinary Program in Artificial Intelligence（人工智能交叉学科项目）； Seoul National University（首尔国立大学）； Computational Science Initiative（计算科学倡议）； Brookhaven National Laboratory（布鲁赫斯国家实验室）； Department of Psychology（心理学系）

AI总结本文提出PIMSM，一种基于物理的多尺度Mamba架构，通过时间尺度对齐提升科学基础模型在分布偏移下的鲁棒性和表示稳定性，实验证明其在fMRI和气象预测中的有效性。

Comments 9 pages, 2 figures

详情

AI中文摘要

LoopQ: 递归变换器的量化

Rui Fang, Hsi-Wen Chen, Ming-Syan Chen

发表机构 * National Taiwan University（国立台湾大学）

AI总结本文提出LoopQ框架，针对递归变换器的量化挑战，通过激活缩放、选择性变换等方法提升模型精度与效率。

2605.16342 2026-05-19 cs.LG cs.AI cs.CL 版本更新

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

DACA-GRPO：去噪感知的信用分配用于扩散语言模型中的强化学习

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Lokesh Boominathan, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

发表机构 * The Ohio State University（俄亥俄州立大学）； Apple（苹果公司）

AI总结本文提出DACA-GRPO，通过引入去噪进度评分和分层掩码似然，改进扩散语言模型中强化学习的信用分配，提升数学推理、代码生成等任务性能。

详情

AI中文摘要

扩散大语言模型是自回归模型的有力替代品，但现有强化学习方法将所有去噪步骤视为同等重要，并依赖于有偏、高方差的似然估计。我们识别出两个根本性弱点：去噪轨迹中缺乏时间信用分配，以及用于策略优化的均场似然估计存在系统偏差。为了解决这些问题，我们提出了Denoising-Aware Credit Assignment for GRPO（DACA-GRPO），一种轻量级、即插即用的增强方法，适用于任何GRPO风格的训练器。DACA-GRPO引入了两个互补机制：去噪进度评分，从中间预测中提取每token的重要性权重，无需额外前向成本；分层掩码似然，将token位置分为层次，使每个token在大部分序列作为上下文的情况下进行预测，从而减少均场偏差。在三种GRPO基础方法上应用DACA-GRPO，使其在七个基准测试中取得一致提升，涵盖数学推理、代码生成、约束满足和受约束生成等任务，在数学推理中提升达5.6个百分点，在代码生成中提升7.4个百分点，在约束满足中提升36.3个百分点，在JSON schema符合性中提升5.9个百分点。

英文摘要

Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.

URL PDF HTML ☆

赞 0 踩 0

2605.16336 2026-05-19 cs.CR cs.AI cs.CY 版本更新

Detecting Verbatim LLM Copy-Paste in Homework

检测作业中的直接LLM复制粘贴

Aizierjiang Aiersilan

发表机构 * The George Washington University（乔治·华盛顿大学）

AI总结本文提出SteganoPrompt工具，通过在作业提示中嵌入隐形指令，使LLM在响应时生成特征签名，帮助教师检测学生直接复制粘贴模型回复的行为。

详情

决策能力阈值在自我对战强化学习中的崩溃中起作用

Arahan Kujur

发表机构 * Independent Researcher（独立研究者）

AI总结研究揭示决策能力阈值决定自我对战强化学习代理在不对称规则扰动下的崩溃，通过消除所有正可达条件决策导致快速收敛到确定性利用吸引子，而保留单个正可达条件决策可防止崩溃。

Comments 18 pages, 7 figures

2605.16312 2026-05-19 cs.LG cs.AI 版本更新

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

当动作消失时：自我对战强化学习中的对抗性动作移除

Arahan Kujur

发表机构 * Independent Researcher（独立研究者）

AI总结研究了自我对战强化学习中的对抗性动作遮蔽，发现学习的遮蔽比随机遮蔽和学习扰动基线更具破坏性，揭示了动作可用性作为自我对战RL中的新鲁棒性表面。

Comments 17 pages, 2 figures, 18 tables

2605.16306 2026-05-19 cs.GR cs.AI 版本更新

UVTran: Accurate Hole-Filling Parameterization with Transformers

UVTran: 基于变换器的精确孔填补参数化

JunFeng Zhang

AI总结 UVTran利用变换器框架，通过设计交叉注意力机制和分层训练策略，提升孔填补的几何精度与表面公平性，优于现有工业和学术基准。

详情

AI中文摘要

在工业设计中，N边孔填补通常被建模为通过最小化公平能量并在几何边界约束下构造单个修剪B样条表面。此建模需要准确的参数空间表示来修剪曲面。现有方法将孔边界投影到邻近平面或多边形以建立对应关系；然而，它们经常忽视边界异质性，导致有偏映射、降低公平性甚至导致填补失败。我们提出UVTran，一种基于变换器的框架，预测辅助投影表面以更好地捕捉孔边界的几何特性。利用B样条局部性，我们设计了交叉注意力机制，使每个表面控制点偏向附近的孔边界，保留局部几何细节。我们对控制点坐标进行体素化，并将拟合问题建模为分类任务，从而减少模型对小数值扰动和噪声的敏感性。我们采用逐步分辨率训练策略，注入受控离散化误差以模仿分布偏移，从而缓解过拟合并提高高分辨率下的泛化能力。在我们的基准测试中，UVTran优于工业和学术基准：容忍度满足率提高了12%，并且在复杂孔边界条件下始终产生公平的填充表面。这些结果表明，UVTran在广泛的N边孔中能够产生更忠实的对应关系和更公平的修剪表面。

英文摘要

In industrial design, N-sided hole filling is typically formulated as the construction of a single trimmed B-spline surface by minimizing a fairness energy subject to geometric boundary constraints. This formulation requires an accurate parameter-space representation of the trimming curve on the filling surface. Most existing methods project the hole boundary onto a nearby plane or polygon to establish correspondence; however, they often neglect boundary heterogeneity, which can yield biased mappings, degrade fairness, and even cause filling failures. We propose UVTran, a transformer-based framework that predicts an auxiliary projection surface better to capture the geometric characteristics of the hole boundary. Exploiting B-spline locality, we design a cross-attention mechanism that biases each surface control point toward the nearby hole boundary, preserving local geometric detail. We voxelize control-point coordinates and formulate the fitting problem as a classification task, which reduces the model's sensitivity to small numerical perturbations and noise. We adopt a progressive-resolution training strategy that injects controlled discretization errors at coarse resolutions to mimic distribution shifts, thereby mitigating overfitting and improving generalization at high resolution. On our benchmark, UVTran outperforms both industrial and academic baselines: the tolerance-satisfaction rate improves by $12\%$, and it consistently produces fair filled surfaces even under complex hole boundary conditions. These results suggest that UVTran yields more faithful correspondences and fairer trimmed surfaces across a wide range of N-sided holes.

URL PDF HTML ☆

赞 0 踩 0

2605.16303 2026-05-19 cs.CY cs.AI cs.CL 版本更新

From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes

从人口统计学到调查锚点：评估LLM代理在建模退休态度中的表现

Rubén Garzón, Pauline Baron, Vincent Grari, Jonne Kamphorst, Michael Bernstein, Marcin Detyniecki

发表机构 * AI Research（AI研究）； AXA Group Operations（AXA集团运营）； CDSP & CEE（CDSP与CEE）； Sciences Po（社会科学高等学院）； Computer Science Department（计算机科学系）； Stanford University（斯坦福大学）

AI总结本文比较了基于人口统计学的LLM代理与基于调查数据的代理在预测退休态度调查中的准确性，发现仅依赖人口统计学的代理存在偏差且不够准确，而基于调查锚点的代理能更好地捕捉复杂的人类响应模式。

Comments 50 pages, 22 figures

详情

AI中文摘要

ANVIL：为讲师提供类比和视频

Yuri Noviello, Anastasiia Birillo, Gosia Migut

发表机构 * Delft University of Technology（代尔夫特理工大学）； JetBrains Research（JetBrains研究院）

AI总结 ANVIL是一种多模态生成系统，可自动生成基于类比的计算机科学教学动画。通过生成文本类比、结构化视觉剧本和可执行代码，提升教学有效性。

详情

AI中文摘要

我们介绍了ANVIL，一种多模态生成系统，可自动生成基于类比的教学动画。给定一个概念定义，ANVIL生成文本类比，将其编译成结构化的视觉剧本，并生成可执行的manim代码以渲染动画，同时具备自动修复机制以提高鲁棒性。在大规模评估此类系统时，需要在教学有效性与可扩展性之间取得平衡。我们首先通过教师评估来确定质量评估的基础，并利用其发现来指导自动化筛选。对于文本类比，我们引入基于LLM的评估器以实现可扩展的质量筛选；对于视频，由于主观判断难以自动化，我们改用自动代理来评估与预期剧本的一致性并进行错误分析。我们进一步与教育工作者进行用户研究，以考察采用要求和风险。我们的发现表明，ANVIL可以生成经常被评价为足够的材料，并且教育工作者对其感知价值和易用性有积极反应。

英文摘要

We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.

URL PDF HTML ☆

赞 0 踩 0

2605.16294 2026-05-19 cs.CY cs.AI 版本更新

Are Researchers Being Replaced by Artificial Intelligence?

研究人员是否被人工智能取代？

Angelo A. Salatino, Ansgar Scherp, Christin Katharina Kreutz, Sahar Vahdati

发表机构 * Knowledge Media Institute, The Open University Milton Keynes UK ； Ulm University Ulm Germany ； TH Mittelhessen – University of Applied Sciences Gießen Germany ； TIB – Leibniz Information Centre for Science ； Technology \& Leibniz University of Hannover Hannover Germany ； Knowledge Media Institute, The Open University ； Ulm University ； TH Mittelhessen – University of Applied Sciences ； Technology \& Leibniz University of Hannover

AI总结研究探讨人工智能在科研中的影响，指出AI工具的普及导致研究人员角色转变，从创造者转向 curator，引发人类对科学理解的担忧。

2605.16292 2026-05-19 cs.CY cs.AI 版本更新

Evidence of a Cognitive Shift in AI Education: How Students Are Rethinking Human Intelligence?

人工智能教育中认知转变的证据：学生如何重新思考人类智能？

Islem Rekik

发表机构 * BASIRA Lab, Imperial-X (I-X) and Department of Computing（BASIRA实验室、Imperial-X（I-X）和计算系）； Imperial College London, London, United Kingdom（伦敦帝国学院，伦敦，英国）

AI总结研究通过长期分析发现，学生对人工智能和人类智能的评价随时间推移逐渐转变，从2020年偏爱AI到2026年多数学生更重视人类智能。

Comments ICLR HCAIR Workhop 2026 https://openreview.net/forum?id=chH4gO2tZT

详情

AI中文摘要

对人工智能（AI）系统的感知影响学习者如何评估和依赖这些系统。尽管AI能力迅速提升，但持续接触这些工具对学生对人类智能（HI）与AI相对价值的影响仍被忽视。本文通过2020至2026年间收集的471名学生课堂投票数据，分析了AI相关本科和硕士课程中四个阶段： hype（炒作）、distrust（不信任）、trust（信任）和 dependency（依赖）。2020年早期投票略微偏向AI，但自2024年起，所有硕士课程群体中逐渐转向更重视HI。到2026年，技术课程中HI偏好达到65%（比2025年增加12个百分点），而设计导向课程中HI偏好达到90%（比2025年增加36个百分点）。这些发现表明，随着AI成为常规工具，人类智能的重新评估逐渐发生，对学习者自主性和知识权威性有影响。本文最后反思了从偏爱AI到优先考虑HI的认知转变。

自动化法律推理迫使在不完美的替代方案之间做出选择：符号系统提供透明性但难以处理模糊性，而神经系统能灵活处理自然语言但缺乏可验证性。本文研究了一种混合的神经符号方法是否能解决这一权衡。我们评估了该架构在在线内容审核领域的应用，作为高量级法律决策（如大规模行政程序）的代理。在这些情况下，操作员必须在严格法律标准下每天评估数千个案例。具体而言，我们探讨了将大型语言模型（LLMs）限制在确定性符号框架内是否能提高基于法律条款的非法性评估，同时防止“范围漂移”（即LLMs将道德冒犯性与法律非法性混淆）。我们评估了规则映射的神经符号变体——一种视觉逻辑树方法，该方法将经典法律三段论形式化——在德国刑法第130(1)条下的在线仇恨言论分类。在多样化的LLMs上，规则映射保持高召回率（0.82-0.89）同时达到精确度0.80-0.86，相比无约束提示的0.34-0.49。专家编写的符号框架因此能够实现稳健的法律自动化，符合可审计性和可验证决策的监管要求。

英文摘要

Automating legal reasoning forces a choice between imperfect alternatives: symbolic systems offer transparency but struggle with ambiguity, whereas neural systems handle natural language flexibly but lack verifiability. This paper investigates whether a hybrid, neuro-symbolic approach can reconcile this trade-off. We evaluate this architecture in the domain of online content moderation, which serves as a proxy for high-volume legal decision-making such as mass administrative proceedings. In these settings, operators must assess thousands of cases daily under strict legal standards. Specifically, we examine whether constraining large language models (LLMs) within deterministic symbolic scaffolds improves statute-grounded illegality assessment while preventing "scope drift" (where LLMs conflate moral offensiveness with legal illegality). We evaluate the neuro-symbolic variant of Rulemapping - a visual logic-tree method that operationalises the classic legal syllogism - on online hate-speech classification under §130(1) of the German Criminal Code. Across diverse LLMs, Rulemapping maintains high recall (0.82-0.89) while achieving precision of 0.80-0.86, compared to 0.34-0.49 for unconstrained prompting. Expert-authored symbolic scaffolds thus enable robust legal automation aligned with regulatory requirements for auditability and verifiable decision-making.

URL PDF HTML ☆

赞 0 踩 0

2605.16279 2026-05-19 cs.CY cs.AI 版本更新

Generative AI and Two-Tiered Online Mental Health Communities

生成式AI与双层在线心理健康社区

Manyang Zhang, Jinyang Zheng, Zhijun Yan

发表机构 * School of Management, Beijing Institute of Technology（北京理工大学管理学院）； Simon Business School, University of Rochester（罗切斯特大学Simon商学院）

AI总结研究探讨生成式AI在双层在线心理健康社区中的影响，发现AI增强了响应速度和患者参与度，但导致部分顾问减少参与，产生跨层溢出效应。

详情

AI中文摘要

在线心理健康社区（OMHCs）是分层平台，通过公开问答论坛和付费私人咨询连接患者与受过资格认证的顾问。其双层结构为生成式AI整合带来战略困境。对话代理可提供可扩展且及时的响应，缓解持续供应短缺，但大规模存在可能重塑顾问在提供细致专业知识、情感敏感支持和付费咨询中的参与，这些是平台收入和长期可持续性的核心。利用一个生成式AI对话代理整合的准自然实验，我们检验了AI进入如何影响顾问参与。通过多重识别策略，我们发现AI整合后发布强度显著增加，而平均响应长度保持不变，每条帖子的社会认可度下降。机制分析显示，AI提高了响应速度并扩大了患者参与度，扩大了顾问的机会集，部分活动从附近的非AI子论坛重新分配。顾问参与异质性：内在动机顾问减少参与，而经济动机顾问增强竞争努力。这些动态产生跨层溢出效应：不活跃的顾问经历付费咨询的下降，而增加公共参与的顾问保持或扩大下游需求。总体而言，我们的发现表明，在分层专业平台中，需求扩张和竞争激励可以超过内在挤出。

英文摘要

Online mental health communities (OMHCs) are tiered platforms that connect patients with licensed counselors through public Q&A forums and paid private consultations. Their two-tier structure creates a strategic dilemma for genAI integration. Conversational agents can provide scalable and timely responses to a broader set of patients, alleviating persistent supply shortages, but their large-scale presence may also reshape counselors' participation in providing nuanced expertise, emotionally sensitive support, and paid consultations, which are central to platform revenue and long-run sustainability. Leveraging a quasi-natural experiment from the integration of a genAI-based conversational agent in a leading OMHC, we examine how AI entry affects counselor participation. Using multiple identification strategies, we find that posting intensity increases significantly after AI integration, while average response length remains unchanged and per-post social recognition declines. Mechanism analyses show that AI improves responsiveness and expands patient engagement, enlarging counselors' opportunity sets, with activity partially reallocated from a nearby non-AI subforum. Counselors respond heterogeneously: intrinsically motivated counselors reduce participation, whereas economically motivated counselors intensify competitive effort. These dynamics generate cross-tier spillovers: inactive counselors experience declines in paid consultations, while those who increase public participation preserve or expand downstream demand. Overall, our findings show that in tiered professional platforms, demand expansion and competitive incentives can outweigh intrinsic crowding-out.

URL PDF HTML ☆

赞 0 踩 0

2605.16278 2026-05-19 cs.CY cs.AI cs.HC 版本更新

Keeping an Eye on AI: A Framework for Effective Human Oversight of AI Systems

关注人工智能：一种有效的人工智能系统人类监督的框架

Susanne Gaube, Markus Langer, Tim Miller, Kevin Baum, Raimund Dachselt, Anna Maria Feit, Ujwal Gadiraju, Harmanpreet Kaur, Mark T. Keane, Richard Landers, Johann Laux, Q. Vera Liao, Brian Lim, Linda Onnasch, Tim Schrills, Liz Sonenberg, Chenhao Tan, Nava Tintarev, Ziang Xiao, Hanwei Zhang

发表机构 * University College London（伦敦大学）； University of Freiburg（弗赖堡大学）； University of Queensland（昆士兰大学）； Saarland University（萨尔兰大学）； TU Dresden（德累斯顿技术大学）； Delft University of Technology（代尔夫特理工大学）； University of Minnesota（明尼苏达大学）； University College Dublin（都柏林大学）； University of Oxford（牛津大学）； University of Michigan（密歇根大学）； National University of Singapore（新加坡国立大学）； Technische Universität Berlin（柏林技术大学）； University of Lübeck（吕贝克大学）； University of Melbourne（墨尔本大学）； University of Chicago（芝加哥大学）； Maastricht University（马斯特里赫特大学）

AI总结本文提出一个跨学科框架，用于有效的人工智能系统人类监督，定义了监督架构和流程，并探讨了该领域需要考虑的开放性研究挑战。

Comments The conceptual analysis for this work was undertaken by the authors at Dagstuhl seminar 25272 'Challenges of Human Oversight: Achieving Human Control of AI-Based Systems' (https://www.dagstuhl.de/25272), held at Schloss Dagstuhl (June 29th-July 4th, 2025)

详情

AI中文摘要

人工智能在高风险决策场景中的使用带来了技术、安全和规范性挑战；这些问题可能只能通过人类监督来缓解。然而，人类监督的概念缺乏共同的基础理解：监督架构未被良好定义，涉及的角色仍不明确，实施步骤不透明。因此，研究人员和实践者难以确定如何设计、实现和评估能够有效实现人类监督的系统。本文提出了一种实用框架，基于计算机科学、人机交互、心理学、哲学和法律的跨学科视角。核心贡献包括：（1）一个基础框架，包含有效的人工智能系统人类监督的定义、架构和流程；（2）一个初步的文档模板，用于记录监督架构和流程，适用于不同领域；（3）对新兴有效人工智能系统人类监督领域需要考虑的开放性研究挑战的综合总结。

英文摘要

The use of Artificial Intelligence (AI) in high-risk, decision-making scenarios presents technical, safety, and normative challenges; problems that may only be ameliorated by human oversight. However, notions of human oversight lack a common foundational understanding: oversight architectures are not well defined, the roles involved remain unclear, and implementation steps are opaque. Hence, researchers and practitioners struggle to determine how to design, implement, and evaluate systems that enable effective human oversight. This paper advances a practical framework for effective human oversight of AI systems, based on a cross-disciplinary perspective that draws on insights from computer science, human-computer interaction, psychology, philosophy, and law. The core contributions are: (1) a foundational framework, with a working definition, architecture and processes for effective human oversight of AI systems; (2) an initial template for documenting oversight architectures and processes, applied to diverse domains; and (3) a synthesis of open research challenges that need to be considered in the emerging field of effective human oversight of AI systems.

URL PDF HTML ☆

赞 0 踩 0

2605.16277 2026-05-19 cs.CY cs.AI 版本更新

Generative AI in K-12 Classrooms: A Midyear Implementation Report

生成式AI在K-12课堂中的应用：中期实施报告

Lief Esbenshade, Alex Liu, Michael Xiao, Zewei Tian, Min Sun, Zachary Zhang, Thomas Han, Yulia Lapicus, Kevin He

发表机构 * University of Washington（华盛顿大学）

AI总结本报告分析了2025年9月至12月华盛顿州12个学区中教师使用Colleague AI的情况，探讨了AI在K-12教育中的初步应用效果及影响因素。

2605.16275 2026-05-19 cs.CY cs.AI cs.CL cs.MM 版本更新

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

AI 产出物还是AI增强？英语学术用途课程中学生对AI生成媒体的看法

David James Woo, Deliang Wang, Kai Guo

发表机构 * Everwrite Limited（Everwrite有限公司）； Faculty of Education, The University of Hong Kong（香港大学教育学院）； Faculty of Education, The Chinese University of Hong Kong（香港中文大学教育学院）

AI总结研究探讨了AI生成内容在EAP课程中的教学效果，通过混合方法分析发现学生偏好视觉化内容，视频与学业表现正相关，但高认知负荷与成绩负相关，表明需合理设计内容以提升学习效果。

Comments 23 pages, 7 figures

详情

AI中文摘要

人工智能（AI）检索增强生成（RAG）工具现在使教育者能够将课程材料转化为多样化的多媒体内容。然而，这种AI生成内容是教学支架还是低质量的AI产出仍不明确。本文报告了在一所香港社区学院的英语学术用途（EAP）课程中，教师引导生成补充材料的开发、实施与评估。主要使用Google Notebook LM生成视频、播客、信息图和个性化反馈报告。通过混合方法设计，包括调查、半结构化访谈和与学术成绩的相关性分析，研究发现学生认为材料有用且易于使用，更偏好与评估相关的视觉和多模态内容，特别是视频和信息图。视频偏好与学业成绩正相关，但高认知负荷与成绩负相关，表明需谨慎校准内容复杂性。值得注意的是，部分成绩较低的学生自行将材料作为补救支架。该实践表明，RAG工具能够实现传统方法难以实现的规模化个性化反馈。当与学生目标和认知原理相结合时，教师引导的AI生成可以有意义地增强EAP学习生态系统，而非产生AI产出物。

英文摘要

Artificial intelligence (AI) retrieval-augmented generation (RAG) tools now enable educators to transform course materials into diverse multimedia at scale. However, it remains unclear whether such AI-generated content functions as a pedagogical scaffold or AI slop: high volume, low quality material. This innovative practice paper reports on the development, implementation, and evaluation of teacher-prompted, AI-generated supplemental materials in an English for Academic Purposes (EAP) course at a Hong Kong Community College. Using primarily Google Notebook LM, the instructor generated videos, podcasts, infographics, and individualized feedback reports from course materials and student work for 106 English as a Foreign Language learners. An explanatory sequential mixed-methods design comprising a survey, semi-structured interviews, and correlation analysis with academic scores was employed to examine students' preferences, perceptions, and learning outcomes. Findings are framed through the Technology Acceptance Model and Cognitive Load Theory. Students rated the materials highly for perceived usefulness and ease of use, and preferred assessment-linked content presented in visual and multimodal formats, particularly videos and infographics. Video preference correlated positively with academic performance; however, higher cognitive load was negatively associated with course grades, indicating that material complexity must be carefully calibrated. Notably, some lower-performing students independently adopted the materials as remedial scaffolds. The practice demonstrates that RAG tools enable scalable personalized feedback that would be less feasible through traditional methods. When aligned with student goals and cognitive principles, teacher-prompted AI generation can meaningfully enhance the EAP learning ecosystem rather than producing AI slop.

URL PDF HTML ☆

赞 0 踩 0

2605.16274 2026-05-19 cs.HC cs.AI 版本更新

ChartDesign: Towards LLM Designer of Data Visualization

ChartDesign: 向数据可视化设计的LLM设计师迈进

Mohammed Afaan Ansari, Aniruddh Bansal, Tianyi Zhou

发表机构 * University of Maryland, College Park（马里兰大学学院公园分校）

AI总结本文提出ChartDesign，利用大语言模型生成数据可视化设计属性，提升图表设计性能，实现84%的准确率，并在未见领域中表现出色。

详情

AI中文摘要

图表是数据可视化的主要媒介，用于发现模式和趋势以及传达数据驱动的见解，但设计仍需要昂贵的人力和专业知识，如选择适当的图表类型、轴方向、字体大小和布局。大多数自动可视化系统依赖手工规则或简单匹配，难以跨领域泛化。本文探索了大型语言模型（LLM）作为图表设计师的潜力。我们提出了ChartDesign，通过后训练LLM来模仿专家并根据表格数据生成图表设计属性。为此，我们精心整理了来自公共调查（PewResearch）和学术存储库（CharXiV）的多样化训练语料。使用视觉语言模型提取数据和设计属性，包括图表类型、子类型、对齐、标题、轴标签和条间距，格式为JSON。然后在Phi3、Qwen3和InternVL2.5上微调LoRA适配器，学习从数据到设计规范的映射。ChartDesign在强基线上显著提高了图表设计性能，在测试集上达到84%的准确率（对比最佳基线53%），并泛化到未见领域。我们进一步表明，由ChartDesign生成的图表在视觉上令人满意且受人类偏好，缩小了人类与AI在数据可视化中的差距。

英文摘要

Charts are the dominant medium for visualizing data, discovering patterns and trends, and communicating data driven insights, yet designing them still requires expensive human effort and expertise, such as selecting appropriate chart types, axis orientations, font sizes, and layouts. Most automatic visualization systems rely on handcrafted heuristics or simple rule matching and therefore struggle to generalize across domains. This work explores the potential of large language models (LLMs) as chart designers. We propose ChartDesign, which post-trains LLMs to imitate human experts and generate chart design attributes given tabular data. To this end, we curate a diverse training corpus of data design pairs from charts in public surveys (PewResearch) and academic repositories (CharXiV). Vision language models are used to extract data and design attributes from these charts, including chart type, sub type, alignment, titles, axis labels, and bar spacing, formatted as JSON. We then fine tune LoRA adapters on Phi3, Qwen3, and InternVL2.5 to learn a mapping from data to design specifications. ChartDesign significantly improves chart design performance over strong baselines, achieving up to 84% accuracy on a held-out test set (vs. 53% for the best baseline) and generalizing to unseen domains. We further show that charts rendered from ChartDesign generated specifications are visually appealing and human preferred, narrowing the human AI gap in data visualization.

URL PDF HTML ☆

赞 0 踩 0

2605.16272 2026-05-19 cs.HC cs.AI 版本更新

Beyond Compliance: How AI Could Help Creative Writers by Refusing Them

超越合规：AI如何通过拒绝帮助创意作家

Hua Xuan Qin, Guangzhi Zhu, Mingming Fan, Pan Hui

发表机构 * The Hong Kong University of Science and Technology（香港科技大学）； The Hong Kong University of Science and Technology (Guangzhou)（香港科技大学（广州））； Computational Media and Arts (Guangzhou)（计算媒体与艺术（广州））； Hong Kong, China（香港，中国）

AI总结本文探讨AI通过拒绝请求引入反思的可能性，通过22名创意作家的质性研究，发现不同情境、认知和关系维度下的偏好差异影响反思效果，提出摩擦性AI设计的策略。

Comments conditionally accepted to Creativity & Cognition 2026

详情

AI中文摘要

主流创意支持设计优先考虑合规AI以实现无缝写作交互，但对过度依赖AI的担忧凸显了需要促进对平衡使用AI和非AI资源的反思设计的需求。理论上，有意的AI非合规性（拒绝请求）可通过比其他可绕过方案更强的摩擦引入反思。实践中，拒绝内容和语言特征导致微妙反应。然而，很少有研究在超出强制性伦理和技术约束之外的细微差别上进行实证研究，探讨如何将拒绝转化为战略摩擦以处理'无害'请求。本文通过22名创意作家的质性研究，探讨了在写作不同阶段（规划、翻译、审查）中拒绝常见请求的反应。研究结果表明，反思潜力取决于在情境（例如，收敛/发散思维阶段）、认知（例如，领域信念）和关系（例如，AI角色）维度上的异质性偏好对齐。我们讨论了对创意支持的影响、更广泛的问题（例如，AI成瘾）以及摩擦性和无缝AI设计（例如，整合不同合规级别）的含义。

英文摘要

Mainstream creativity support design prioritizes compliant AI for seamless writing interactions, but concerns over inappropriate AI reliance highlight the need for designs fostering reflection on balanced AI and non-AI resource use. Theoretically, intentional AI non-compliance, refusals (saying ``no'' to requests), could introduce such reflection through friction stronger than other bypass-able solutions. Practically, refusal content/language characteristics lead to nuanced reactions. However, little research empirically focuses on nuances beyond mandatory ethical/technical constraints, on turning refusals into strategic friction for `innocuous' requests. We address this through a qualitative study with 22 creative writers, exploring reactions to refusals to common requests across writing stages (planning, translating, reviewing). Findings suggest that reflective potential depends on heterogeneous preference alignment along situational (e.g., convergent/divergent thinking phases), cognitive (e.g., domain beliefs), and relational (e.g., AI roles) dimensions. We discuss implications for creativity support, broader issues (e.g., AI addiction), and frictional/seamful AI design (e.g., integrating different compliance levels).

URL PDF HTML ☆

赞 0 踩 0

2605.16269 2026-05-19 cs.HC cs.AI 版本更新

Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments

训练培训者——一种基于同伴的AI框架，用于战场环境中的同伴心理支持

Atmaram Yarlagadda, Eranga Bandara, Ross Gore, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty, Ravi Mukkamala, Xueping Liang, Amin Hass, Abdul Rahman

发表机构 * McDonald Army Health Center（麦克唐纳陆军健康中心）； Old Dominion University（老 Dominion 大学）； Blanchfield Army Community Hospital（布莱恩菲尔德陆军社区医院）； Department of Psychiatry and Neurobehavioral Sciences（精神病学与神经行为科学系）； University of Virginia School of Medicine（弗吉尼亚大学医学院）； Florida International University（佛罗里达国际大学）； Accenture Technology Labs（埃森哲技术实验室）

AI总结本文提出一种基于同伴的AI框架，利用AI代理提升战场环境下心理支持的效率，减少撤离需求并提高持续护理质量。

详情

AI中文摘要

现代军事行动使士兵面临持续的心理压力，导致急性反应、创伤后应激症状和其他心理健康问题。尽管美国国防部提供证据支持的疗法，但在前线和冲突环境中难以获得受过训练的专业人员。因此，早期压力症状的士兵常被撤离至后方医疗设施，延迟护理，降低战备状态并增加长期风险。本文提出一个Train-the-Trainers框架，其中完成治疗并返回执勤的士兵被训练为同伴促进者，以在作战环境中提供一线心理支持。为了在资源和连接性受限的条件下扩大和标准化这一模型，我们引入了一个基于代理AI的平台，该平台增强这些恢复士兵的专业能力。恢复士兵作为人类监督者，协调代理进行症状分诊、指导同伴支持干预、作战约束推理、训练和模拟以及在需要时的结构化文档记录以进行临床升级。AI代理在高风险环境中使用共识驱动的决策支持。该架构在断网和低连接性环境中运行，保持人类监督和伦理保障。与麦当劳美国陆军健康中心、纽波特新闻、美国退伍军人事务局合作开发了一个功能性原型。通过结合基于同伴的干预与共识驱动的代理AI决策支持，该框架旨在缩短响应时间，防止症状升级，减少不必要的撤离，并提高护理连续性。本工作展示了代理AI如何在恶劣环境中成为心理健康支持的增强力量，并识别了在国防和人道主义行动中进一步评估和部署的途径。

英文摘要

Modern military operations expose soldiers to sustained psychological stress, leading to acute reactions, post-traumatic stress symptoms, and other mental health issues. Although the U.S. Department of Defense offers evidence-based therapies, access to trained professionals in forward-deployed and contested environments is limited. As a result, soldiers with early-stage distress are often evacuated to rear medical facilities, delaying care, reducing readiness, and increasing long-term risks. This paper proposes a Train-the-Trainers framework in which soldiers who have completed therapy and returned to duty are trained as peer facilitators to provide first-line psychological support in operational settings. To scale and standardize this model under severe resource and connectivity constraints, we introduce an agentic AI-enabled platform that augments these recovered soldiers with specialized AI agents. The recovered soldier acts as a human supervisor, coordinating agents for symptom triage, guided peer-support interventions, operational constraint reasoning, training and simulation, and structured documentation for clinical escalation when needed. The AI agents use consensus-driven decision support in high-stakes environments. The architecture functions in air-gapped and low-connectivity settings, maintaining human oversight and ethical safeguards. A functional prototype was developed with the McDonald U.S. Army Health Center, Newport News, VA, USA. By combining peer-based intervention with consensus-driven agentic AI decision support, the framework seeks to cut response times, prevent symptom escalation, reduce unnecessary evacuations, and improve continuity of care. This work shows how agentic AI can serve as a force multiplier for mental health support in austere environments and identifies pathways for broader evaluation and deployment across defense and humanitarian operations.

URL PDF HTML ☆

赞 0 踩 0

2605.16268 2026-05-19 cs.HC cs.AI cs.LG 版本更新

Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes

帮助陷入困境的客户：一个基于LLM的代理，能够对话、探测和分流

Alankar Atreya, Stefan Sylvius Wanger, Devesh Batra, Robert Hankache, Cristovao Iglesias, Patrick Sinclair, Giulio Pelosio, Michael McMillan, Greig A. Cowan, Raad Khraishi

发表机构 * GitHub

AI总结本文提出一个基于LLM的AI分流代理，通过多轮对话和提问提高客户问题分类准确性，提升银行客户服务效率。

详情

AI中文摘要

银行每年收到数百万起欺诈、诈骗和争议交易报告，准确将客户引导至合适专业团队极具挑战性。现有的人工流程对客户和员工都缓慢且压力大。为此，我们开发了一个面向客户的AI分流代理，利用大型语言模型（LLMs）进行多轮对话、提问和分类，以实现准确的政策引导分流，嵌入客户旅程中。为了评估和持续改进代理，模拟了真实的客户数字孪生，生成基于历史数据的真实、标注对话，以测试各种现实场景。本文详细介绍了分流代理的建模方法、与政策、安全护栏和推理框架的整合、使用合成代理进行可扩展评估，以及AI系统在准确性、鲁棒性和合规性方面的发现。结果表明，代理成功提高了历史案例的分流效果，实现分类准确率提升30.6%，我们的领域专家报告了高水平的满意度，突显了针对性探测在大规模银行运营中的有效分流作用。

英文摘要

Banks receive millions of reports of fraud, scams, and disputed transactions every year, making it challenging to accurately direct customers to the appropriate specialist teams for assistance. The existing manual process driven by humans is slow and stressful for both customers and staff. To address this, we develop a customer-facing AI powered triaging agent that leverages large language models (LLMs) to conduct multi-turn conversations, ask relevant questions, and classify cases for accurate, policy-guided routing, making it embedded in the customer journey. To evaluate and continuously improve the agent, synthetic digital twins of real customers were simulated, generating realistic, labelled dialogues based on historical data to test a wide range of real-world scenarios. This work details the triage agent's modelling approach, integration with policy, safety guardrails and reasoning frameworks, the use of the synthetic agent for scalable evaluation, and findings on the AI system's accuracy, robustness, and compliance. Results show that the agent successfully improves triaging of historical cases, achieving a 30.6% increase in classification accuracy, with high satisfaction levels reported by our subject-matter experts, highlighting how targeted probing can lead to more effective triage in banking operations at scale.

URL PDF HTML ☆

赞 0 踩 0

2605.16265 2026-05-19 cs.AI cs.CR 版本更新

AgentWall: A Runtime Safety Layer for Local AI Agents

AgentWall：本地AI代理的运行时安全层

Ashwin Aravind

发表机构 * GitHub

AI总结本文提出AgentWall，一种用于本地AI代理的运行时安全与可观测性层，通过拦截代理操作、评估政策、要求人类批准和记录执行轨迹，实现92.9%的政策执行准确率。

Comments 16 pages, 2 figures, open-source implementation at https://github.com/agentwall/Agentwall

详情

AI中文摘要

自主AI代理的安全性日益被认识到是关键的开放性问题。随着代理从被动文本生成器转变为能够执行shell命令、修改文件、调用API和浏览网络的活跃行为者，不安全或被对手操控的行为后果变得立即且具现实影响。现有AI安全工作主要集中在模型对齐和输入过滤，但这些方法无法解决当代理意图变为真实机器上的实际操作时的问题。此缺口在本地环境中尤为明显，因为开发者在自己的文件系统、凭证和基础设施上运行代理时缺乏运行时控制。本文介绍AgentWall，一种用于本地AI代理的运行时安全与可观测性层。AgentWall在代理操作到达主机环境之前拦截每一个提议的代理操作，将其与显式声明性政策进行评估，要求对敏感操作进行人工批准，并记录完整的执行轨迹以供审计和回放。它被实现为一个强制执行MCP代理和原生OpenClaw插件，可在Claude Desktop、Cursor、Windsurf、Claude Code和OpenClaw上通过单条安装命令运行。我们展示了AgentWall的设计、架构、威胁模型和政策模型，并在14个基准测试中实现了92.9%的政策执行准确率，亚毫秒级的开销。AgentWall在https://github.com/agentwall/Agentwall上开源。

英文摘要

The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the consequences of unsafe or adversarially manipulated behavior become immediate and tangible. Existing AI safety work has focused primarily on model alignment and input filtering, but these approaches do not address what happens at the moment an agent's intent becomes a real action on a real machine. This gap is especially acute in local environments, where developers run agents against their own filesystems, credentials, and infrastructure with little runtime control. This paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment, evaluates it against an explicit declarative policy, requires human approval for sensitive operations, and records a complete execution trail for audit and replay. It is implemented as a policy-enforcing MCP proxy and native OpenClaw plugin, working across Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw with a single install command. We present the design, architecture, threat model, and policy model of AgentWall, and demonstrate 92.9% policy enforcement accuracy with sub-millisecond overhead across 14 benchmark tests. AgentWall is open-source at https://github.com/agentwall/Agentwall.

URL PDF HTML ☆

赞 0 踩 0

2605.16259 2026-05-19 cs.LG cs.AI cs.DC 版本更新

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

苹果M3 Ultra上的实时扩散模型推断系统优化

Yoichi Ochiai

发表机构 * University of Tsukuba, Faculty of Library, Information and Media Science（筑波大学图书馆、信息与媒体科学系）

AI总结本文针对苹果M3 Ultra平台系统性优化扩散模型实时推断，通过多种技术结合实现22.7 FPS的实时图像转换，揭示了与NVIDIA GPU不同的优化特性。

详情

AI中文摘要

尽管扩散模型在NVIDIA GPU上的实时图像生成技术迅速发展，但针对非CUDA平台如苹果硅芯片的系统性优化研究极为有限。本文在苹果M3 Ultra（60核心GPU，512GB统一内存）上进行了10个阶段的全面优化实验，目标是实现实时摄像头img2img转换。我们探索了包括CoreML转换、量化、令牌合并、神经引擎利用、紧凑模型探索、帧内插、基于kNN搜索的合成、pix2pix-turbo、光学流帧跳过和知识蒸馏等多种技术，并定量评估了每种方法的有效性。最终，通过结合CoreML转换的蒸馏专用模型SDXS-512与三线程摄像头流水线，在512x512分辨率下实现了22.7 FPS的实时摄像头img2img转换。本文的主要贡献是系统地证明了在CUDA上建立的优化见解不一定适用于苹果硅芯片的统一内存架构。我们揭示了一个与NVIDIA GPU截然不同的优化景观，包括量化无提速、并行推断无效以及神经引擎不适合大规模模型等特性，并为苹果硅芯片上的扩散模型推断提供了实用指南。

英文摘要

While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time camera img2img transformation at 22.7 FPS at 512x512 resolution. The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon's unified memory architecture. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs -- including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models -- and provide practical guidelines for diffusion model inference on Apple Silicon.

URL PDF HTML ☆

赞 0 踩 0

2605.14624 2026-05-19 cs.LG cs.AI cs.NE 版本更新

An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization

比较神经求解器与启发式求解器在组合优化中的平均效率阈值

Sohaib Afifi

发表机构 * Univ. Artois（阿劳斯大学）

AI总结本文研究了神经求解器在组合优化中的能耗问题，提出平均效率阈值框架，通过实验显示神经求解器在部署量超过阈值后能耗低于启发式方法，提供了新的评估方法。

Comments 13 pages, 3 figures, 1 table. Code and benchmark pipeline at https://github.com/sohaibafifi/aet. v1: initial release with CVRP n=50

详情

AI中文摘要

神经组合优化求解器常被批评其能耗高于CPU启发式方法，因其在GPU上训练的成本高。本文探讨了从

英文摘要

A common critique of neural combinatorial-optimization solvers is that they are less energy-efficient than CPU metaheuristics, given the operational energy cost of training them on GPUs. This paper examines the inferential step from "training is expensive" to "neural solvers are net-inefficient", which is where the critique actually goes wrong. Training the network costs a large fixed amount of GPU energy; running the metaheuristic costs a small amount of CPU energy on every instance, repeated as long as the solver is deployed. The two are not commensurable until a deployment volume is fixed. We define the Amortized Efficiency Threshold (AET) as the deployment volume above which a neural solver breaks even with a heuristic baseline in total energy or carbon, under an explicit constraint on solution quality. We show that the cumulative-energy ratio between the two solvers tends to a constant strictly below one whenever the network wins per instance, and that this limit does not depend on how the training cost was measured. An embodied-carbon term amortizes hardware fabrication symmetrically on both sides. We instantiate the framework on the CVRP environment at n=50 customers with the attention-based autoregressive solver of Kool et al. (2019), trained for 100 epochs on 20,000 instances over five random seeds, and HGS via PyVRP as the heuristic baseline. The measured operational crossover sits near 4.56e3 deployed instances at the median of a six-point baseline-budget sweep; the per-instance neural-to-heuristic ratio is 2.29e-3. The contribution is the framework, the open instrumentation, and the end-to-end measurement protocol. Code and benchmark pipeline are available at https://github.com/sohaibafifi/aet.

URL PDF HTML ☆

赞 0 踩 0

2605.14457 2026-05-19 cs.AI 版本更新

Stateful Reasoning via Insight Replay

通过洞察回放实现状态化推理

Bin Lei, Caiwen Ding, Jiachen Yang, Ang Li, Xin Eric Wang

发表机构 * University of Minnesota（明尼苏达大学）； Simular AI

AI总结本文提出InsightReplay方法，通过回放关键洞察保持推理过程中的重要信息可访问性，提升大语言模型在复杂任务中的表现。

详情

AI中文摘要

链式推理（CoT）已成为引导大语言模型进行多步骤推理的基础，但最近的研究表明，其收益并不随链长单调增加：虽然更长的CoT通常使模型能够解决更困难的问题，但在特定问题中，准确性通常在链长达到一定程度后会下降。我们发现这一现象的主要原因：随着CoT的增长，模型对推理轨迹中早期产生的关键洞察的注意力逐渐减弱，使这些洞察在最需要的时候变得越来越难以访问。因此，我们提出了InsightReplay，一种状态化推理方法，其中模型会定期从其推理轨迹中提取关键洞察并在活跃生成前沿附近回放它们，以保持这些洞察的可访问性。在包含模型规模{8B, 30B}、模型家族{Qwen3.5, DeepSeek-R1-Distill-Qwen, Gemma-4}和推理基准{AIME, HMMT, GPQA Diamond, LiveCodeBench v5}的$\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$基准网格上进行的大量实验表明，3轮InsightReplay在所有24种设置中均实现了准确性提升，平均比标准CoT提高了$\mathbf{+1.65}$点，其中在R1-Distill-32B的LiveCodeBench v5子集上最大的单设置提升为$\mathbf{+9.2}$点。我们的结果表明，测试时扩展的有效性不仅取决于模型推理的程度，还取决于关键中间洞察在整个长推理轨迹中是否保持可访问性。

英文摘要

Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose \textbf{InsightReplay}, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a $\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$ benchmark grid, covering model scales $\{\text{8B}, \text{30B}\}$, model families $\{\text{Qwen3.5}, \text{DeepSeek-R1-Distill-Qwen}, \text{Gemma-4}\}$, and reasoning benchmarks $\{\text{AIME}, \text{HMMT}, \text{GPQA Diamond}, \text{LiveCodeBench v5}\}$, show that 3-round InsightReplay yields accuracy gains across \textbf{all 24 settings}, with an averaged improvement of $\mathbf{+1.65}$ points over standard CoT, and a largest single-setting gain of $\mathbf{+9.2}$ points on R1-Distill-32B's LiveCodeBench v5 subset. Our results suggest that the effectiveness of test-time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.

URL PDF HTML ☆

赞 0 踩 0

2605.09826 2026-05-19 cs.AI cs.MA 版本更新

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

EnactToM：一种用于具身智能体功能理论之心的演进基准

Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, Xin Eric Wang

发表机构 * University of California, Santa Barbara（加州大学圣巴巴拉分校）； King’s College London（伦敦国王学院）； Cisco Research（思科研究）

AI总结本文提出EnactToM基准，用于测试具身智能体的功能理论之心能力，通过300个3D家庭环境任务验证解决能力和认知深度，揭示了现有模型在隐含信念上的不足。

详情

AI中文摘要

理论之心（ToM）是跟踪他人认知状态的能力，使人类成为高效的协作者。AI代理在多智能体设置中需要相同的能力，但现有基准大多通过直接询问信念来测试字面意义的ToM。能够基于隐含信念在具身环境中最优行动的能力，称为功能ToM，仍鲜有测试。我们引入EnactToM，一个包含300个具身多智能体任务的演进基准，设置在3D家庭环境中，具有部分可观测性、私人信息和受限通信。每个任务均正式验证可解性和所需认知深度，新任务随模型进步而增加难度。在硬划分上，所有七个评估的前沿模型在功能任务完成上得分为0.0%，而在字面信念探测上平均得分为45.0%。人工分析显示93%的样本失败归因于认知协调破裂，如信息 withheld、忽略伙伴约束和消息分配不当，为未来工作提供了具体目标。

英文摘要

Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.

URL PDF HTML ☆

赞 0 踩 0

2605.08475 2026-05-19 cs.LG cs.AI cs.NA math.NA math.OC 版本更新

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Transformer 可实现用于上下文高斯核回归的预条件Richardson迭代

Mingsong Yan, Dongyang Li, Charles Kulick, Sui Tang

发表机构 * Department of Mathematics, University of California, Santa Barbara, CA（加州大学圣芭芭拉分校数学系）

AI总结本文研究了上下文核岭回归，证明标准softmax注意力transformer可通过预条件Richardson迭代近似高斯核回归预测器，展示了transformer架构中的功能分解。

详情

AI中文摘要

对上下文学习（ICL）的机制解释已识别出用于线性回归及相关线性预测任务的迭代算法，通常使用线性或ReLU注意力变体。对于非线性ICL，先前工作将softmax和核化注意力与功能梯度型动态相关联，但尚不清楚标准softmax注意力transformer能否实现具有端到端预测误差保证的收敛求解器。本文研究了具有高斯核的上下文核岭回归（KRR），并证明标准softmax-注意力transformer在正向传递过程中可通过在关联的核线性系统上实现预条件Richardson迭代来近似KRR预测器。在数据有界假设下，我们构建了一个具有O(log(1/ε))个块和MLP宽度O(√(N/ε))的单头transformer，实现了对长度为N的提示的ε精度预测。我们的构造揭示了transformer架构中的功能分解：softmax注意力产生用于跨token交互的行归一化高斯核算子，而ReLU MLP层局部近似所需的intra-token标量算术。通过训练GPT-2风格的transformer进行高斯过程回归任务进一步测试预条件Richardson解释。通过线性探测，我们比较transformer的层间预测与经典KRR求解器的逐步输出，发现其误差谱与预条件Richardson迭代最一致。消融研究进一步支持这一解释。共同，我们的理论和实验识别出预条件Richardson迭代作为softmax-注意力transformer实现非线性上下文高斯核回归的明确机制。

英文摘要

Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/ε))$ blocks and MLP width $O(\sqrt{N/ε})$ that achieves $ε$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.

URL PDF HTML ☆

赞 0 踩 0

2604.28010 2026-05-19 cs.LG cs.AI 版本更新

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

从分歧中学习：临床医生的覆盖作为价值医疗中临床AI的隐含偏好信号

Prabhjot Singh, Abhishek Gupta, Chris Betz, Abe Flansburg, Brett Ives, Sudeep Lama, Jung Hoon Son

发表机构 * Altitude

AI总结本文提出了一种框架，将临床医生对AI建议的覆盖视为隐含偏好数据，通过引入五类覆盖分类法和双学习架构，解决抑制偏差问题，以提升价值医疗中AI的决策能力。

Comments 22 pages, 2 tables, 1 figure

详情

AI中文摘要

我们重新将临床医生对临床AI建议的覆盖视为隐含偏好数据——与强化学习从人类反馈（RLHF）中利用的相同信号结构，但更丰富：标注者是领域专家，替代方案具有实际后果，下游结果是可观察的。我们提出了一个扩展标准偏好学习的正式框架，包含三个贡献：五类覆盖分类法，将覆盖类型映射到不同的模型更新目标；一个基于患者状态s、组织背景c和临床医生能力κ的偏好公式，其中κ分解为执行能力κ-exec和对齐能力κ-align；以及一个双学习架构，通过交替优化联合训练奖励模型和能力模型，防止我们称为抑制偏差的系统性问题——当临床医生能力低于执行阈值时，系统性地压制正确但困难的建议。我们论证，在基于结果的支付合同下慢性病管理产生具有独特有利属性的覆盖数据——纵向密度、集中决策空间、结果标签和自然能力变化，并认为结合纵向结果测量与对齐的财务激励的训练环境是学习与患者轨迹而非就诊经济相一致的奖励模型的必要条件。此框架源于改进价值医疗部署中临床医生能力的运营工作。

英文摘要

We reframe clinician overrides of clinical AI recommendations as implicit preference data - the same signal structure exploited by reinforcement learning from human feedback (RLHF), but richer: the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable. We present a formal framework extending standard preference learning with three contributions: a five-category override taxonomy mapping override types to distinct model update targets; a preference formulation conditioned on patient state s, organizational context c, and clinician capability kappa, where kappa decomposes into execution capability kappa-exec and alignment capability kappa-align; and a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, preventing a failure mode we term suppression bias-the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold. We argue that chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties-longitudinal density, concentrated decision space, outcome labels, and natural capability variation-and that training environments combining longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics. This framework emerged from operational work to improve clinician capability in a live value-based care deployment.

URL PDF HTML ☆

赞 0 踩 0

2604.16400 2026-05-19 cs.DC cs.AI cs.LG 版本更新

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

CoLLM：面向共享GPU集群的SLO感知LLM服务连续适应

Shaoyuan Huang, Yunfeng Zhao, Na Yan, Tiancheng Zhang, Xiaokai Wang, Xiaofei Wang, Wenyu Wang, Yansha Deng

发表机构 * Tianjin University（天津大学）； King's College London（伦敦大学国王学院）； Paiou Cloud Computing (Shanghai) Company, Ltd.（上海帕优云计算有限公司）

AI总结 CoLLM通过统一联邦参数高效微调与推理，实现LLM服务在共享GPU集群中的连续适应，提升模型质量和效率，实验显示其在吞吐量上表现优异。

详情

AI中文摘要

随着大型语言模型（LLM）在边缘智能中被越来越多地用于驱动领域特定应用和个性化服务，LLM训练后的质量与效率，包括微调和推理，因资源受限而变得至关重要。尽管最近在联邦参数高效微调（FL PEFT）和低延迟推理方面的进展提高了单个任务性能，但微调和推理仍被视为孤立的工作负载，忽略了它们的相互依赖性，导致冗余部署和推理质量提升延迟。为了解决这些限制，我们引入了一个新的共执行框架，并将其实例化为CoLLM，一个将FL PEFT和推理统一在共享边缘副本和模型参数上的系统。CoLLM通过在副本和集群层面解决关键挑战，实现了高效模型参数重用和工作负载平衡，从而联合优化长期模型质量增益和短期推理效率。在多样化的LLM和真实世界跟踪上进行的广泛评估显示，CoLLM在吞吐量上比最先进的LLM系统高出多达3倍，证明了其在边缘智能中无缝LLM训练后处理的有效性。

英文摘要

As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.

URL PDF HTML ☆

赞 0 踩 0

2604.02178 2026-05-19 cs.CL cs.AI cs.LG 版本更新

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

专家反击：在专家层面解读混合专家语言模型

Jeremy Herbst, Stefan Wermter, Jae Hee Lee

发表机构 * Department of Informatics, University of Hamburg, Hamburg, Germany（汉堡大学信息学院）

AI总结研究通过k稀疏探测比较MoE专家与密集FFN，发现专家神经元更单语义，提出以专家为分析单位，揭示专家是细粒度任务专家，而非领域专家或token处理者。

Comments 8 pages, 7 Figures. Accepted at ICML 2026. Improved writing, changed author order, updated citations

详情

AI中文摘要

混合专家（MoE）架构已成为扩展大语言模型（LLMs）的主导选择，每个token仅激活部分参数。尽管MoE主要用于计算效率，但其稀疏性是否使其比密集前馈网络（FFN）更容易解释仍存疑问。通过k稀疏探测比较MoE专家与密集FFN，发现专家神经元始终更单语义，随着路由稀疏性增加，差距扩大。这表明稀疏性迫使神经元和整个专家朝单语义方向发展。基于此发现，我们从神经元层面转向专家层面作为更有效的分析单位。通过自动解读数百个专家，验证了这一方法。此分析解决了关于专业化争论：专家既非广领域专家（如生物学）也非简单token处理者。相反，它们作为细粒度任务专家，专门处理语言操作或语义任务（如闭合LaTeX括号）。我们的发现表明，MoE在专家层面具有内在可解释性，为大规模模型可解释性提供了更清晰路径。代码见：https://github.com/jerryy33/MoE_analysis。

英文摘要

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in $\LaTeX{}$). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis.

URL PDF HTML ☆

赞 0 踩 0

2603.20562 2026-05-19 cs.CL cs.AI 版本更新

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

排列一致性列表判断用于鲁棒事实性评估

Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan

发表机构 * App-In Club（App-In俱乐部）； Carnegie Mellon University（卡内基梅隆大学）

AI总结本文提出PCFJudge方法，通过多排列重跑列表事实性提示以提高LLM事实性判断的鲁棒性，实验显示其在RewardBench 2 Factuality上显著提升准确率。

Comments Accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026

2603.19470 2026-05-19 cs.LG cs.AI 版本更新

令牌游戏：通过谜题对决评估语言模型推理能力

Simon Henniger, Gabriel Poesia

发表机构 * Harvard University（哈佛大学）

AI总结本文设计了令牌游戏（TTG）评估框架，通过模型自创谜题进行对决，利用Elo评分比较模型能力，验证了10个前沿模型的推理能力，且无需人工参与谜题创作。

Comments Project website: https://token-games.ai/

2602.16699 2026-05-19 cs.CL cs.AI 版本更新

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

校准后再行动：在大语言模型代理中考虑成本的探索

Wenxuan Ding, Nicholas Tomlin, Greg Durrett

发表机构 * New York University（纽约大学）

AI总结本文提出Calibrate-Then-Act框架，使LLM代理在不确定环境下显式平衡成本与不确定性，从而更优地决策。

详情

AI中文摘要

大语言模型代理被部署在需要交互以获取信息的环境中。在这些场景中，代理必须权衡行动中的内在成本不确定性，例如何时停止探索并提交答案。例如，在编程任务中，代理可能运行生成的代码，或为该代码片段生成测试；编写和运行测试的成本非零，但通常低于运行有缺陷代码的成本。本文表明，可以通过诱导LLM代理显式权衡这些成本-不确定性权衡，使代理在环境中表现更优。我们正式化了多个任务，包括检索增强的问答和文件阅读编码任务，作为在不确定性下的连续决策问题。每个问题都有潜在的环境状态影响代理性能。我们引入了名为Calibrate-Then-Act（CTA）的框架，通过将代理传递推断出的环境状态先验信息，使其能够更优地行动。此信息在定性上改变了代理行为，并向代理添加了非标准RL训练所学的环境敏感性。在合成任务、问答和文件阅读上的结果表明，通过CTA显式进行成本-收益权衡有助于代理发现更优的决策策略。

英文摘要

LLM agents are deployed in environments where they must interact to acquire information. In these scenarios, the agent must reason about inherent cost-uncertainty tradeoffs in how to act, such as when to stop exploring and commit to an answer. For instance, on a programming task, an agent might run the code it generates, or it might generate tests for that code snippet; the cost of writing and running a test is nonzero, but typically lower than the cost of running buggy code. In this work, we show that we can induce LLM agents to explicitly reason about balancing these cost-uncertainty tradeoffs, then act more optimally in their environments. We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision-making strategies.

URL PDF HTML ☆

赞 0 踩 0

2601.21941 2026-05-19 cs.LG cs.AI 版本更新

Robust Multimodal Representation Learning in Healthcare

医疗领域鲁棒多模态表征学习

Xiaoguang Zhu, Linxiao Gong, Lianlong Sun, Yang Liu, Haoyu Wang, Jing Liu

发表机构 * University of California, Davis（加州大学戴维斯分校）； HKUST (GZ)（香港科技大学）； University of Rochester（罗切斯特大学）； Tongji University（同济大学）； Georgia Institute of Technology（佐治亚理工学院）； Fudan University（复旦大学）； The University of British Columbia（不列颠哥伦比亚大学）

AI总结本文提出双流特征去相关框架，通过结构因果分析处理医疗多模态数据中的系统性偏差，提升模型泛化能力，实验验证在MIMIC-IV、eICU和ADNI数据集上的性能提升。

详情

DOI: 10.1109/ICASSP55912.2026.11460772

AI中文摘要

医疗多模态表征学习旨在将异构数据整合为统一的患者表示以支持临床结果预测。然而，真实世界医疗数据集通常包含来自多个来源的系统性偏差，这对医疗多模态表征学习提出了重大挑战。现有方法通常专注于有效的多模态融合，忽视了影响泛化能力的固有偏见特征。为解决这些挑战，我们提出了一种双流特征去相关框架，通过引入由潜在混杂因素引入的结构因果分析来识别和处理偏见。我们的方法采用因果偏见去相关框架，结合双流神经网络，将因果特征与虚假相关性分离，利用广义交叉熵损失和互信息最小化实现有效去相关。该框架模型无关，可集成到现有医疗多模态学习方法中。在MIMIC-IV、eICU和ADNI数据集上的全面实验显示了一致的性能提升。

英文摘要

Medical multimodal representation learning aims to integrate heterogeneous data into unified patient representations to support clinical outcome prediction. However, real-world medical datasets commonly contain systematic biases from multiple sources, which poses significant challenges for medical multimodal representation learning. Existing approaches typically focus on effective multimodal fusion, neglecting inherent biased features that affect the generalization ability. To address these challenges, we propose a Dual-Stream Feature Decorrelation Framework that identifies and handles the biases through structural causal analysis introduced by latent confounders. Our method employs a causal-biased decorrelation framework with dual-stream neural networks to disentangle causal features from spurious correlations, utilizing generalized cross-entropy loss and mutual information minimization for effective decorrelation. The framework is model-agnostic and can be integrated into existing medical multimodal learning methods. Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.

URL PDF HTML ☆

赞 0 踩 0

2512.03280 2026-05-19 cs.LG cs.AI 版本更新

BlendedNet++: A dataset and benchmark for field-resolved aerodynamics and inverse design of blended wing body aircraft

BlendedNet++：一种用于混合翼身融合飞机场解气动学和逆设计的数据集和基准

Nicholas Sung, Steven Spreizer, Mohamed Elrefaie, Matthew C. Jones, Faez Ahmed

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； MIT Lincoln Laboratory（麻省理工学院林肯实验室）

AI总结本文提出BlendedNet++数据集，包含12492种独特的BWB几何体，通过RANS模拟提供集成力和密集表面场，利用几何深度学习模型实现实时气动预测和生成性逆设计，验证了Transolver在场预测中的准确性。

详情

AI中文摘要

Blended Wing Body (BWB)飞机的概念设计常受限于高维设计空间中复杂气动学的高计算成本。尽管深度学习为快速气动预测和逆设计提供了途径，但其在航空航天工程中的应用受限于缺乏大规模、场解训练数据。本文通过引入BlendedNet++，一个包含12,492种独特BWB几何体的综合气动学数据集，每种几何体均通过稳态雷诺平均纳维-斯托克斯（RANS）模拟进行评估，以提供集成力和密集表面场（Cp，Cf）。利用此数据，我们建立了两个关键工程任务的稳健框架：（1）利用几何深度学习模型实时预测表面气动场；（2）生成性逆设计。我们基准测试了五种替代架构，发现Transolver在场预测中最为准确。此外，我们通过条件扩散模型结合梯度基优化方法演示了生成性逆设计流程。这种混合方法被证明能够生成多个可行设计，满足特定升阻比目标，精度高（R^2 > 0.99），经计算流体动力学（CFD）模拟验证。这些资源使早期阶段BWB设计从迭代分析转向直接生成。

英文摘要

The conceptual design of Blended Wing Body (BWB) aircraft is often constrained by the high computational cost of resolving complex aerodynamics over a high-dimensional design space. While deep learning offers a pathway to rapid aerodynamic prediction and inverse design, its adoption in aerospace engineering is limited by a lack of large-scale, field-resolved training data. This work addresses this gap by introducing BlendedNet++, a comprehensive aerodynamic dataset comprising 12,492 unique BWB geometries, each evaluated using steady Reynolds-Averaged Navier--Stokes (RANS) simulations to provide integrated forces and dense surface fields (Cp, Cf). Leveraging this data, we establish a robust framework for two critical engineering tasks: (1) real-time prediction of surface aerodynamic fields using geometric deep learning models, and (2) generative inverse design. We benchmark five surrogate architectures, identifying Transolver as the most accurate for field predictions. Furthermore, we demonstrate a generative inverse design pipeline using conditional diffusion models combined with gradient-based refinement. This hybrid approach is shown to generate multiple feasible designs that satisfy specific lift-to-drag targets with high accuracy (R^2 > 0.99), as confirmed by computational fluid dynamics (CFD) simulation. These resources enable a shift from iterative analysis to direct generation in early-stage BWB design.

URL PDF HTML ☆

赞 0 踩 0

2510.26899 2026-05-19 cs.CY cs.AI cs.SI 版本更新

How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison

Grokipedia和Wikipedia有多相似？一个多维的文本和结构比较

Taha Yasseri, Saeedeh Mohammadi

发表机构 * Centre for Sociology of Humans and Machines (SOHAM)（人类与机器社会研究中心）； School of Mathematics and Statistics（数学与统计学学院）

AI总结研究通过对比17790对文章，分析Grokipedia和Wikipedia在文本和结构上的相似性，发现Grokipedia文章更长但引用更少，内容分为两组，其中一组在政治偏见上有所右移，引发对AI生成内容透明性和知识治理的质疑。

Comments 20 pages, 7 figures, updated with a larger sample size of 20,000 articles, better text cleaning procedure + Reference analysis, topical analysis

详情

DOI: 10.1073/pnas.2603294123

AI中文摘要

Grokipedia的推出被视作对维基百科意识形态和结构偏见的回应，旨在用Grok大语言模型生成'真实'条目。本研究通过比较17790对文章，评估两者在词汇丰富度、可读性、引用密度、结构特征和语义相似性方面的差异。发现Grokipedia文章更长但引用较少，内容分为两组：一组与维基百科在语义和风格上一致，另一组则显著偏离。在不一致的文章中，引用的新闻媒体来源政治偏见系统性右移，集中在历史、宗教、文学和艺术领域。研究指出，AI生成的百科内容偏离传统编辑规范，更倾向于叙事扩展而非引用验证，引发对透明度、来源和知识治理的疑问。

英文摘要

The launch of Grokipedia, an AI-generated encyclopedia developed by Elon Musk's xAI, was presented as a response to perceived ideological and structural biases in Wikipedia, aiming to produce "truthful" entries using the Grok large language model. Yet whether an AI-driven alternative can escape the biases and limitations of human-edited platforms remains unclear. This study conducts a large-scale computational comparison of 17,790 matched article pairs from the 20,000 most-edited English Wikipedia pages. Using metrics spanning lexical richness, readability, reference density, structural features, and semantic similarity, we assess how closely the two platforms align in form and substance. We find that Grokipedia articles are substantially longer and contain significantly fewer references per word. Moreover, Grokipedia's content divides into two distinct groups: one that remains semantically and stylistically aligned with Wikipedia, and another that diverges sharply. Among the dissimilar articles, we observe a systematic rightward shift in the political bias of frequently cited news media sources, concentrated primarily in entries related to history and religion, and literature and art. More broadly, the findings indicate that AI-generated encyclopedic content departs from established editorial norms, favoring narrative expansion over citation-based verification, raising questions about transparency, provenance, and the governance of knowledge in automated information systems.

URL PDF HTML ☆

赞 0 踩 0

2508.06524 2026-05-19 cs.CL cs.AI cs.CY cs.DC cs.LG 版本更新

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling：扩展神经缩放定律以用于大型语言模型中的碳足迹

Lei Jiang, Fan Chen

发表机构 * Indiana University（印第安纳大学）

AI总结本文提出CarbonScaling框架，结合神经缩放定律和分布式训练策略，准确建模前沿LLM训练的碳足迹，提高硬件配置和排放估计的精度。

Comments 8 pages

详情

AI中文摘要

大型语言模型（LLMs）日益遵循将性能提升与快速扩展计算预算联系起来的神经缩放定律，这引发了对前沿规模训练可持续性的担忧。现有碳估计方法主要依赖于历史运行的回归分析，无法捕捉关键系统级因素，包括硬件异质性、分布式并行性、通信开销和架构稀疏性。我们提出了CarbonScaling，一种硬件意识的分析框架，用于建模前沿LLM训练的碳缩放行为。该框架整合了神经缩放定律、分布式训练策略、加速器和互连建模，以及操作和嵌入碳会计，以估计可行的硬件配置和相关排放。CarbonScaling同时建模张量、流水线、数据和专家并行性，同时纳入内存、带宽、利用率和运行时间约束。实验验证显示其比基于回归的基线具有显著更高的保真度，并突显了在万亿参数规模下嵌入碳的重要性。源代码：https://github.com/UnchartedRLab/CarbonScaling。

英文摘要

Large language models (LLMs) increasingly follow neural scaling laws that tie performance gains to rapidly expanding computational budgets, raising concerns about the sustainability of frontier-scale training. Existing carbon-estimation methods largely depend on regression over historical runs and fail to capture critical system-level factors, including hardware heterogeneity, distributed parallelism, communication overhead, and architectural sparsity. We present \textit{CarbonScaling}, a hardware-aware analytical framework for modeling the carbon scaling behavior of frontier LLM training. The framework integrates neural scaling laws, distributed training strategies, accelerator and interconnect modeling, and operational and embodied carbon accounting to estimate feasible hardware configurations and associated emissions. CarbonScaling jointly models tensor, pipeline, data, and expert parallelism while incorporating memory, bandwidth, utilization, and runtime constraints. Experimental validation demonstrates substantially higher fidelity than regression-based baselines and highlights the growing importance of embodied carbon at trillion-parameter scales. Source code: \url{https://github.com/UnchartedRLab/CarbonScaling}.

URL PDF HTML ☆

赞 0 踩 0

2502.17007 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Uncertainty Quantification as a Principled Foundation for Explainable Artificial Intelligence: A Case Study of Counterfactual Explanations

不确定性量化作为可解释人工智能的原理性基础：反事实解释的案例研究

Kacper Sokol, Santo M. A. R. Thies, Eyke Hüllermeier

发表机构 * Department of Informatics, USI Lugano（乌里大学信息学院）； Department of Computer Science, ETH Zurich（苏黎世联邦理工学院计算机科学系）； Institute of Informatics, LMU Munich（慕尼黑大学信息研究所）

AI总结本文通过反事实可解释性中的不确定性量化，展示其作为统一框架的潜力，提出两种解释器变体，并证明其在性能上优于现有方法。

2408.17352 2026-05-19 cs.SD cs.AI eess.AS 版本更新

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

AASIST3: 基于SSL特征和额外正则化的KAN增强型AASIST语音深度伪造检测用于ASVspoof 2024挑战

Kirill Borodin, Vasiliy Kudryavtsev, Dmitrii Korzh, Alexey Efimenko, Grach Mkrtchian, Mikhail Gorodnichev, Oleg Y. Rogov

发表机构 * AIRI

AI总结本文提出AASIST3模型，通过增强现有AASIST框架并引入KAN网络等技术，显著提升了语音伪造检测性能，在封闭条件下达到0.5357的minDCF结果。

Comments 8 pages, 2 figures, 2 tables. Accepted paper at the ASVspoof 2024 (the 25th Interspeech Conference)

详情

AI中文摘要

自动语音验证（ASV）系统通过识别语音特征来识别说话人，广泛应用于金融交易用户认证、智能设备专属访问控制及法医欺诈检测等领域。然而，深度学习算法的进步使得通过文本到语音（TTS）和语音转换（VC）系统生成合成音频成为可能，使ASV系统面临潜在漏洞。为应对这一问题，我们提出了一种名为AASIST3的新型架构。通过增强现有的AASIST框架，引入Kolmogorov-Arnold网络、额外层、编码器和前置强调技术，AASIST3实现了性能的两倍以上提升。它在封闭条件下展示了0.5357的minDCF结果，在开放条件下达到0.1414，显著提高了对合成语音的检测能力，并提升了ASV安全性。新版本的模型已公开在HuggingFace (2026)。

英文摘要

Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security. \textbf{The new version of the model is publicly available at \href{https://huggingface.co/lab260/Spectra-AASIST3}{\underline{HuggingFace (2026)}}}

URL PDF HTML ☆

赞 0 踩 0

2304.03427 2026-05-19 cs.CL cs.AI cs.CY cs.LG 版本更新

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

清除珠宝：基于谷歌OCR的藏文手稿的神经拼写纠正模型

Queenie Luo, Yung-Sung Chuang

发表机构 * Harvard University（哈佛大学）； Massachusetts Institute of Technology (MIT)（麻省理工学院（MIT））

AI总结本文提出基于谷歌OCR的藏文手稿的神经拼写纠正模型，通过改进的Transformer架构实现自动纠正OCR噪声输出，实验表明其优于其他序列模型。

Journal ref Association for Computing Machinery 2024

详情

DOI: 10.1145/3654811

AI中文摘要

人文学者依赖古代手稿来研究历史、宗教和社会政治结构。许多努力致力于使用OCR技术数字化这些珍贵的手稿，但大多数手稿因数世纪的污损，使得OCR程序无法准确捕捉褪色的字符和污渍。本文提出基于谷歌OCR处理的藏文手稿的神经拼写纠正模型，用于自动纠正OCR输出中的噪声。本文分为四个部分：数据集、模型架构、训练和分析。首先，我们将原始藏文电子文本语料库特征工程为两个结构化数据框——一组配对玩具数据和一组配对真实数据。然后，我们在Transformer架构中实现了置信度评分机制，用于拼写纠正任务。根据损失和字符错误率，我们的Transformer加置信度评分机制架构证明优于Transformer、LSTM-2-LSTM和GRU-2-GRU架构。最后，为了检验模型的鲁棒性，我们分析了错误的标记，可视化了模型中的注意力和自我注意力热图。

英文摘要

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.

URL PDF HTML ☆

赞 0 踩 0