arXivDaily arXiv每日学术速递 周一至周五更新
2605.18753 2026-05-19 cs.CL cs.AI cs.LG 版本更新

DashAttention: Differentiable and Adaptive Sparse Hierarchical Attention

DashAttention: 可微且自适应的稀疏分层注意力

Yuxiang Huang, Nuno M. T. Gonçalves, Federico Alvetreti, Lei Li, Xu Han, Edoardo M. Ponti, André F. T. Martins, Marcos V. Treviso

发表机构 * Tsinghua University(清华大学) Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) Instituto de Telecomunicações(电信研究院) Carnegie Mellon University(卡内基梅隆大学) Sapienza University of Rome(罗马萨皮恩扎大学) University of Edinburgh(爱丁堡大学) TransPerfect(TransPerfect公司) ELLIS Unit Lisbon(里斯本ELLIS单位)

AI总结 本研究提出DashAttention,一种可微且自适应的稀疏分层注意力机制,通过自适应稀疏α-entmax变换选择可变数量的块,从而在保持整个层次结构可微的同时,提升长上下文建模能力,实验表明其在高稀疏度下优于现有方法。

Comments Preprint

详情
AI中文摘要

当前的分层注意力方法,如NSA和InfLLMv2,基于粗粒度注意力得分选择前k个相关键值(KV)块,然后对所选标记应用细粒度softmax注意力。然而,top-k操作假设任何查询的相关标记数量固定,并且阻止了稀疏和密集阶段之间的梯度流动。在本工作中,我们提出了DashAttention(可微且自适应的稀疏分层注意力),它利用自适应稀疏α-entmax变换,在第一阶段根据当前查询选择可变数量的块。这反过来为第二阶段的softmax注意力提供先验信息,保持整个层次结构完全可微。与其他分层注意力方法不同,我们表明DashAttention是非发散的,这导致更好的长上下文建模能力。在大型语言模型(LLMs)上的实验表明,DashAttention在75%的稀疏度下达到与全注意力相当的准确性,并在高稀疏度情况下优于NSA和InfLLMv2,特别是在高稀疏度情况下。我们还提供了一个高效的、GPU-aware的DashAttention实现,在Triton中实现了比FlashAttention-3快超过一倍的推理速度。总体而言,DashAttention提供了一种成本效益高的长上下文建模策略。

英文摘要

Current hierarchical attention methods, such as NSA and InfLLMv2, select the top-k relevant key-value (KV) blocks based on coarse attention scores and subsequently apply fine-grained softmax attention on the selected tokens. However, the top-k operation assumes the number of relevant tokens for any query is fixed and it precludes the gradient flow between the sparse and dense stages. In this work, we propose DashAttention (Differentiable and Adaptive Sparse Hierarchical Attention), which leverages the adaptively sparse $α$-entmax transformation to select a variable number of blocks according to the current query in the first stage. This in turn provides a prior for the second-stage softmax attention, keeping the entire hierarchy fully differentiable. Contrary to other hierarchical attention methods, we show that DashAttention is non-dispersive, translating to better long-context modeling ability. Experiments with large language models (LLMs) show that DashAttention achieves comparable accuracy as full attention with 75% sparsity and a better Pareto frontier than NSA and InfLLMv2, especially in high-sparsity regimes. We also provide an efficient, GPU-aware implementation of DashAttention in Triton, which achieves a speedup of up to over FlashAttention-3 at inference time. Overall, DashAttention offers a cost-effective strategy to model long contexts.

2605.18747 2026-05-19 cs.CL cs.AI 版本更新

Code as Agent Harness

代码作为代理工具

Xuying Ning, Katherine Tieu, Dongqi Fu, Tianxin Wei, Zihao Li, Yuanchen Bei, Jiaru Zou, Mengting Ai, Zhining Liu, Ting-Wei Li, Lingjie Chen, Yanjun Zhao, Ke Yang, Bingxuan Li, Cheng Qian, Gaotang Li, Xiao Lin, Zhichen Zeng, Ruizhong Qiu, Sirui Chen, Yifan Sun, Xiyuan Yang, Ruida Wang, Rui Pan, Chenyuan Yang, Dylan Zhang, Liri Fang, Zikun Cui, Yang Cao, Pan Chen, Dorothy Sun, Ren Chen, Mahesh Srinivasan, Nipun Mathur, Yinglong Xia, Hong Li, Hong Yan, Pan Lu, Lingming Zhang, Tong Zhang, Hanghang Tong, Jingrui He

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta Stanford University(斯坦福大学)

AI总结 本文探讨了代码在代理系统中的作用,提出了一种统一的视角,将代码视为代理基础设施的基础,并讨论了代理工具接口、机制以及扩展到多代理系统的挑战。

Comments GitHub: https://github.com/YennNing/Awesome-Code-as-Agent-Harness-Papers

详情
AI中文摘要

近年来,大型语言模型(LLMs)在理解和生成代码方面展现了强大的能力,从竞争性编程到仓库级别的软件工程。在新兴的代理系统中,代码不再仅仅是目标输出,而是越来越多地作为代理推理、行动、环境建模和基于执行的验证的操作基础。我们通过代理工具的视角来阐述这一转变,并引入“代码作为代理工具”的概念:一种以代码为基础的统一视角,用于代理基础设施。为了系统地研究这一视角,我们围绕三个相连的层次组织了综述。首先,我们研究工具接口,其中代码连接代理到推理、行动和环境建模。其次,我们检查工具机制:计划、记忆和工具使用用于长周期执行,以及反馈驱动的控制和优化,使工具可靠且适应性强。第三,我们讨论将工具从单代理系统扩展到多代理系统,其中共享的代码艺术支持多代理协调、审查和验证。在这些层次中,我们总结了代码作为代理工具的代表性方法和实际应用,涵盖编码助手、GUI/OS自动化、具身代理、科学发现、个性化和推荐、DevOps以及企业工作流程。我们进一步概述了工具工程中的开放挑战,包括评估超越最终任务成功、在不完整反馈下的验证、无回归的工具改进、多个代理之间的一致共享状态、人类监督以确保安全关键行动,以及向多模态环境的扩展。通过将代码视为代理AI的工具,本文为可执行、可验证和具有状态的AI代理系统提供了一条统一的道路。

英文摘要

Recent large language models (LLMs) have demonstrated strong capabilities in understanding and generating code, from competitive programming to repository-level software engineering. In emerging agentic systems, code is no longer only a target output. It increasingly serves as an operational substrate for agent reasoning, acting, environment modeling, and execution-based verification. We frame this shift through the lens of agent harnesses and introduce code as agent harness: a unified view that centers code as the basis for agent infrastructure. To systematically study this perspective, we organize the survey around three connected layers. First, we study the harness interface, where code connects agents to reasoning, action, and environment modeling. Second, we examine harness mechanisms: planning, memory, and tool use for long-horizon execution, together with feedback-driven control and optimization that make harness reliable and adaptive. Third, we discuss scaling the harness from single-agent systems to multi-agent settings, where shared code artifacts support multi-agent coordination, review, and verification. Across these layers, we summarize representative methods and practical applications of code as agent harness, spanning coding assistants, GUI/OS automation, embodied agents, scientific discovery, personalization and recommendation, DevOps, and enterprise workflows. We further outline open challenges for harness engineering, including evaluation beyond final task success, verification under incomplete feedback, regression-free harness improvement, consistent shared state across multiple agents, human oversight for safety-critical actions, and extensions to multimodal environments. By centering code as the harness of agentic AI, this survey provides a unified roadmap toward executable, verifiable, and stateful AI agent systems.

2605.18738 2026-05-19 cs.AI 版本更新

What Does the AI Doctor Value? Auditing Pluralism in the Clinical Ethics of Language Models

人工智能医生重视什么?审查语言模型临床伦理中的多元主义

Payal Chandak, Victoria Alkin, David Wu, Maya Dagan, Taposh Dutta Roy, Maria Clara Saad Menezes, Ayush Noori, Nirali Somia, John S. Brownstein, Ran Balicer, Rebecca W. Brendel, Noa Dagan, Isaac S. Kohane, Gabriel A. Brat

AI总结 本文研究了大型语言模型在医疗建议中带来的伦理价值观,提出了一种审计框架来评估医疗AI中的价值多元主义,揭示了模型在决策中对患者自主权的潜在忽视,并强调了多模型协同的重要性。

Comments Code and data available upon request via https://hvp.global/

详情
AI中文摘要

医学本质上是多元主义的。诸如自主性、有益性、不伤害和正义等原则经常发生冲突,这种伦理困境常常使合理医生产生激烈分歧。良好的临床实践是在与每位患者的价值观协调一致的情况下解决这些矛盾,而不是强加单一的伦理立场。然而,大型语言模型带来的伦理价值观尚未被系统地考察。本文提出了一种审计价值多元主义的框架,包括经临床医生验证的困境基准和一种从决策中直接恢复价值优先级的方法。前沿模型生态系统涵盖了医生层面的价值异质性,模型在其推理中讨论竞争性价值观(Overton多元主义),然后做出决定。然而,个体模型决策在重复采样和语义变化下几乎近似决定性,无法再现医生小组的分布多元主义。在基准案例中,这些一致的决策反映了坚定、系统的价值偏好。虽然大多数模型优先级落在医生间自然变化的范围内,但某些显著低估了患者自主权。一个没有考虑其价值优先级的单个LLM可能在每服务的患者中放大这些优先级。如果没有明确努力通过一个或多个模型平衡伦理观点,这些工具可能会用部署单文化取代临床多元主义。

英文摘要

Medicine is inherently pluralistic. Principles such as autonomy, beneficence, nonmaleficence, and justice routinely conflict, and such ethical dilemmas often sharply divide reasonable physicians. Good clinical practice navigates these tensions in concert with each patient's values rather than imposing a single ethical stance. The ethical values that large language models bring to medical advice, however, have not been systematically examined. We present a framework for auditing value pluralism in medical AI, comprising a benchmark of clinician-verified dilemmas and an attribution method that recovers value priorities directly from decisions. The ecosystem of frontier models spans physician-level value heterogeneity, and models discuss competing values in their reasoning (Overton pluralism) before committing to a decision. However, individual model decisions are near-deterministic across repeated sampling and semantic variations, failing to reproduce the distributional pluralism of the physician panel. Across benchmark cases, these consistent decisions reflect committed, systematic value preferences. While most model priorities fall within the natural range of inter-physician variation, some significantly underweight patient autonomy. A single LLM deployed without regard for its value priorities could amplify those priorities at scale to every patient it serves. Without explicit efforts to balance ethical perspectives with one or multiple models, these tools risk replacing clinical pluralism with a deployment monoculture.

2605.18732 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Predictable Confabulations: Factual Recall by LLMs Scales with Model Size and Topic Frequency

可预测的编造:大型语言模型的事实回忆能力随模型大小和主题频率而增加

Matthew L. Smith, Jonathan P. Shock, Samuel T. Segun, Iyiola E. Olatunji, Tegawendé F. Bissyandé

发表机构 * International Development Research Centre Canada(加拿大国际发展研究中心) University of Cape Town(开普敦大学) Global Center on AI Governance(人工智能治理全球中心) SnT, University of Luxembourg(卢森堡大学SnT分校) CITADEL AI Centre of Excellence, Burkina Faso(布基纳法索CITADEL人工智能卓越中心)

AI总结 本研究探讨了大型语言模型在事实回忆方面的可预测性,发现模型大小和训练数据中主题频率是影响回忆质量的关键因素,且模型大小和主题频率的组合能解释60%-94%的方差。

Comments 18 pages, 5 figures, 6 tables

详情
AI中文摘要

尽管规模定律支配了大规模语言模型的整体性能,但尚未有规模定律将事实回忆与模型大小和训练数据组成联系起来。我们评估了38个模型,超过8900篇学术参考文献由自动参考验证系统评估。回忆质量在模型参数数量和训练数据中主题表示的对数线性组合中呈现S形。这两个变量单独能解释16个密集模型中60%的方差,而在单个家族内上升至74-94%。这种形式与受叠加启发的账户相匹配,其中回忆由信号噪声比门控:信号强度与概念频率成正比,噪声底座与模型容量成正比。

英文摘要

While scaling laws govern aggregate large language model performance, no scaling law has linked factual recall to both model size and training-data composition. We evaluated 38 models on over 8,900 scholarly references evaluated by an automated reference verification system. Recall quality follows a sigmoid in the log-linear combination of model parameter count and topic representation in training data. These two variables alone explain 60% of the variance across 16 dense models from four families, rising to 74-94% within individual families. The form matches a superposition-inspired account in which recall is gated by a signal-to-noise ratio: signal strength scales with concept frequency and the noise floor with model capacity.

2605.18727 2026-05-19 cs.RO cs.AI 版本更新

DexHoldem: Playing Texas Hold'em with Dexterous Embodied System

DexHoldem: 使用 Dexterous Embodied 系统进行德州扑克游戏

Feng Chen, Tianzhe Chu, Li Sun, Pei Zhou, Zhuxiu Xu, Shenghua Gao, Yuexiang Zhai, Yanchao Yang, Yi Ma

发表机构 * ShadowHand

AI总结 本文提出DexHoldem,一个基于ShadowHand的现实世界系统级基准,用于评估德克萨斯扑克的灵巧操作。研究通过14个德克萨斯扑克操作原始技能的1470个远程操作示例,测试了代理在感知、执行和决策路由中的能力。

Comments 30 Pages

详情
AI中文摘要

评估基于真实灵巧硬件的具身系统需要超越孤立的原始技能:一个代理必须感知一个变化的桌面场景,选择合适的上下文动作,用灵巧的手执行该动作,并确保场景在后续决策中仍然可用。我们介绍了DexHoldem,一个围绕使用ShadowHand进行德克萨斯扑克灵巧操作构建的现实世界系统级基准。DexHoldem提供了1470个远程操作示例,涵盖14个德克萨斯扑克操作原始技能,一个标准化的物理政策基准,以及一个测试代理是否能够恢复所需结构化游戏状态的代理感知基准。在原始执行方面,π_{0.5}获得了最高的任务完成率(61.2%),而π_{0.5}和π_0在保持场景成功率为47.5%时并列。在代理感知方面,Opus 4.7在严格的问题级准确性(34.3%)方面表现最佳,而GPT 5.5在平均领域准确性(66.8%)方面表现最佳,揭示了孤立视觉子能力与完整路由相关状态恢复之间的差距。最后,我们通过三个案例研究实现了完整的具身代理循环,其中等待、恢复调度、人类帮助请求和重复原始执行揭示了在闭环部署过程中感知和策略错误如何累积。DexHoldem因此在共享物理环境中评估了灵巧桌面执行、代理感知和具身决策路由。项目页面:https://dexholdem.github.io/Dexholdem/.

英文摘要

Evaluating embodied systems on real dexterous hardware requires more than isolated primitive skills: an agent must perceive a changing tabletop scene, choose a context-appropriate action, execute it with a dexterous hand, and leave the scene usable for later decisions. We introduce DexHoldem, a real-world system-level benchmark built around Texas Hold'em dexterous manipulation with a ShadowHand. DexHoldem provides 1,470 teleoperated demonstrations across 14 Texas Hold'em manipulation primitives, a standardized physical policy benchmark, and an agentic perception benchmark that tests whether agents can recover the structured game state needed for embodied decision making. On primitive execution, $π_{0.5}$ obtains the highest task completion rate ($61.2\%$), while $π_{0.5}$ and $π_0$ tie on scene-preserving success rate ($47.5\%$). On agentic perception, Opus 4.7 obtains the best strict problem-level accuracy ($34.3\%$), while GPT 5.5 obtains the best average field-wise accuracy ($66.8\%$), exposing a gap between isolated visual sub-capabilities and complete routing-relevant state recovery. Finally, we instantiate the full embodied-agent loop in three case studies, where waiting, recovery dispatches, human-help requests, and repeated primitive execution reveal how perception and policy errors accumulate during closed-loop deployment. DexHoldem therefore evaluates dexterous tabletop execution, agentic perception, and embodied decision routing in a shared physical setting. Project page: https://dexholdem.github.io/Dexholdem/.

2605.18714 2026-05-19 cs.CV cs.AI 版本更新

Semantic Generative Tuning for Unified Multimodal Models

语义生成微调用于统一多模态模型

Songsong Yu, Yuxin Chen, Ying Shan, Yanwei Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) Tencent ARCLab(腾讯ARCLab)

AI总结 本文提出语义生成微调(SGT)方法,通过将高阶语义任务作为生成代理,统一多模态模型的感知与生成能力,提升多模态理解和生成质量。

Comments 14 pages, 13 figures

详情
AI中文摘要

统一多模态模型(UMMs)致力于在单一架构中整合视觉理解和视觉生成。然而,现有训练范式分别通过稀疏文本信号优化理解,通过密集像素目标优化生成,导致表示空间不一致,隔离了视觉理解和生成,阻碍了它们的相互促进。本文首次系统地研究了生成式后训练,我们将层次化的视觉任务作为生成代理,以弥合UMMs中的隔离。通过实证研究发现,高阶语义任务,特别是图像分割,作为最优代理。不同于低阶任务,分割提供结构语义,显著增强视觉感知和生成布局的保真度。基于这些见解,我们引入语义生成微调(SGT),一种利用分割作为生成代理来对齐和协同多模态能力的新范式。机理分析进一步表明,SGT从根本上提高了特征线性可分离性,并优化了视觉-文本注意力分配模式。广泛的评估显示,SGT在主流基准上一致提升了多模态理解和生成保真度。我们的代码可在https://song2yu.github.io/SGT/上获得。

英文摘要

Unified multimodal models (UMMs) strive to consolidate visual understanding and visual generation within a single architecture. However, prevailing training paradigms independently optimize understanding via sparse text signals and generation through dense pixel objectives. Such a decoupled strategy yields misaligned representation spaces, isolating visual understanding from generation and hindering their mutual reinforcement. This work presents the first systematic investigation into generative post-training, where we formulate hierarchical visual tasks as generative proxies to bridge the isolation in UMMs. Our empirical investigation reveals that high-level semantic tasks, particularly image segmentation, serve as optimal proxies. Unlike low-level tasks that distract models with texture details, segmentation provides structural semantics that significantly enhance both vision-centric perception and generative layout fidelity. Building upon these insights, we introduce Semantic Generative Tuning (SGT), a novel paradigm that leverages segmentation as a generative proxy to align and synergize multimodal capabilities. Mechanistic analyses further demonstrate that SGT fundamentally improves feature linear separability and optimizes visual-textual attention allocation pattern. Extensive evaluations show that SGT consistently improves both multimodal comprehension and generative fidelity across mainstream benchmarks. Our code is available on the https://song2yu.github.io/SGT/.

2605.18702 2026-05-19 cs.LG cs.AI 版本更新

Distilling Tabular Foundation Models for Structured Health Data

为结构化健康数据 distilling 表格基础模型

Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay Kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文研究了如何通过知识蒸馏将表格基础模型的预测行为转移到轻量级表格模型中,通过分层出折教师标签解决上下文泄露问题,在19个医疗数据集上验证了蒸馏学生模型在保持高AUC的同时显著提升了推理速度,并展示了多教师平均法并不总能超越最佳单教师。

详情
AI中文摘要

表格基础模型(TFMs)在健康数据集上表现出色,但其推理成本和基础设施需求限制了实际应用。我们研究了是否可以通过知识蒸馏将TFMs的预测行为转移到轻量级表格模型中。由于上下文TFMs在推理时依赖于训练集,直接蒸馏会引入上下文泄露;我们通过分层出折教师标签来解决这一问题。在19个医疗数据集、6个TFM教师、4个学生家族和多个多教师集成模型上,我们发现蒸馏后的学生模型至少保留了教师AUC的90%,在某些情况下优于教师,同时在CPU上运行速度至少快26倍,并保持了对健康应用至关重要的校准和公平性。此外,多教师平均法并不总能超越最佳单教师。因此,具有泄漏意识的蒸馏是一种将TFM质量预测带入受推理限制的健康环境中的可行途径。

英文摘要

Tabular foundation models (TFMs) achieve strong performance on health datasets, but their inference cost and infrastructure requirements limit practical use. We study whether their predictive behavior can be transferred to lightweight tabular models through knowledge distillation. Since in-context TFMs condition on the training set at inference time, naive distillation can introduce context leakage; we address this with stratified out-of-fold teacher labeling. Across $19$ healthcare datasets, $6$ TFM teachers, $4$ student families, and several multi-teacher ensembles, we find that distilled students retain at least $90\%$ of teacher AUC, outperforming teachers in some cases, while running at least $26\times$ faster on CPU and preserving calibration and fairness critical for health applications. Moreover, multi-teacher averaging does not consistently improve over the best single teacher. Leakage-aware distillation is thus a viable route for bringing TFM-quality predictions into inference-constrained health settings.

2605.18697 2026-05-19 cs.DC cs.AI cs.PL 版本更新

PopPy: Opportunistically Exploiting Parallelism in Python Compound AI Applications

PopPy: 在Python复合AI应用中机会性地利用并行性

Stephen Mell, David Mell, Konstantinos Kallas, Steve Zdancewic, Osbert Bastani

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Independent Researcher(独立研究者) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出PopPy系统,通过识别Python应用中调用外部组件的并行化机会,从而在复合AI应用的端到端延迟上实现6.4倍的加速,同时保持顺序程序语义。

详情
AI中文摘要

复合AI应用通过使用通用编程语言如Python调用ML模型的调用,广泛应用于软件工程和企业自动化等用户-facing任务,使其端到端延迟成为关键瓶颈。与传统应用不同,执行时间主要由外部组件主导,这些组件无法通过传统语言优化系统如优化编译器来处理。为了解决这个问题,我们开发了PopPy,一个能够发现Python应用中调用这些重型外部组件的并行化机会的系统,包括那些用于复合AI应用的组件。PopPy支持Python的一个非常表达性的片段,并且需要最小的开发者输入来发现并行性。它结合了提前编译器和运行时,解决了从Python应用中提取并行性的三个关键挑战:语言复杂性、动态调度和变量变异。在一组真实的复合AI应用上,PopPy在端到端执行时间上相比标准Python执行实现了高达6.4倍的加速,同时保持顺序程序语义。

英文摘要

Compound AI applications, which compose calls to ML models using a general-purpose programming language like Python, are widely used for a variety of user-facing tasks, from software engineering to enterprise automation, making their end-to-end latency a critical bottleneck. In contrast to traditional applications, execution time is dominated by the external components, which cannot be handled by traditional language optimization systems, like optimizing compilers. To address this problem, we develop PopPy, a system that can uncover parallelization opportunities in Python applications that invoke these heavy external components, including those used in compound AI applications. PopPy supports a very expressive fragment of Python and requires minimal developer input to uncover parallelism. It combines an ahead-of-time compiler with a runtime, addressing three key challenges in extracting parallelism from Python applications: language complexity, dynamic dispatch, and variable mutation. On a set of real-world compound AI applications, PopPy achieves up to $6.4\times$ speedups in end-to-end execution time compared to standard Python execution while preserving the sequential program semantics.

2605.18696 2026-05-19 cs.LG cs.AI 版本更新

Ensembling Tabular Foundation Models - A Diversity Ceiling And A Calibration Trap

表格基础模型的集成——多样性上限与校准陷阱

Aditya Tanna, Yash Desai, Pratinav Seth, Mohamed Bouadi, Nassim Bouarour, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文研究了表格基础模型(TFMs)的集成方法,发现尽管集成通常能提升性能,但现代TFMs的集成池近似冗余,且某些集成策略在准确率和校准上表现不佳,建议采用贪心选择作为实用默认方案。

详情
AI中文摘要

表格基础模型(TFMs)如今在越来越多的表格任务上能够匹配或超越调优的梯度提升树,但没有单一的TFM能在所有数据集上获胜。集成是解决此问题的首选方法,但其效果不如预期。六个现代TFMs形成一个近似冗余的池:它们的平均成对Q统计量为0.961,接近1,因此任何凸组合都受限制。我们对六个TFMs在153个OpenML分类任务上进行了六个集成策略的基准测试。最佳集成策略,两层级联堆叠,在计算成本增加253倍的情况下,比最强单个TFM的准确率提高0.18%。Friedman和Nemenyi分析将三个集成策略和最佳基础TFM置于一个等价组中;其他三个集成策略显著劣于最佳基础TFM。使用逻辑回归元学习器进行堆叠是最引人注目的案例:在准确率和ROC-AUC上具有竞争力,但在log-loss排名中是最差的。元学习器通过锐化类别边界来提高准确率,这破坏了校准。我们建议贪心选择作为实用默认方案。

英文摘要

Tabular foundation models (TFMs) now match or beat tuned gradient-boosted trees on a growing fraction of tabular tasks, but no single TFM wins on every dataset. Ensembling is the go to fix here, and it works less well than expected. Six modern TFMs form a near-redundant pool: their mean pairwise Q-statistic is $0.961$, close enough to $1$ that any convex combination is bounded above. We benchmark six ensemble strategies over six TFMs on 153 OpenML classification tasks. The best ensemble, two-level cascade stacking, buys $+0.18\%$ accuracy over the strongest single TFM at $253\times$ the compute. A Friedman and Nemenyi analysis places three ensembles and the best base TFM in a single equivalence group; three other ensembles are significantly \emph{worse} than the best base. Stacking with a logistic-regression meta-learner is the most striking case: competitive accuracy and ROC-AUC, the worst log-loss rank among the ensembles. The meta-learner improves accuracy by sharpening class boundaries, which destroys calibration. We recommend greedy selection as the practical default.

2605.18693 2026-05-19 cs.AI 版本更新

SkillGenBench: Benchmarking Skill Generation Pipelines for LLM Agents

SkillGenBench: 评估LLM代理技能生成流水线的基准测试

Yifan Zhou, Zhentao Zhang, Ziming Cheng, Shuo Zhang, Qizhen Lan, Zhangquan Chen, Zhi Yang, QianyuXu, Ronghao Chen, Huacan Wang, Sen Hu

发表机构 * SJTU(上海交通大学) XJTU(西安交通大学) NUS(新加坡国立大学) QuantaAlpha(量子Alpha) THU(清华大学) SUFE(上海财经大学) NTU(国立.ntu) PKU(北京大学) UCAS(中国科学技术大学)

AI总结 本文提出SkillGenBench,一个用于评估LLM代理技能生成流水线的基准测试,通过统一可控的协议评估技能生成过程,涵盖任务条件生成和任务无关生成两种模式,以及基于仓库和文档的两种程序来源,揭示技能生成在不同数据源中的表现差异。

详情
AI中文摘要

随着LLM代理越来越多地围绕可重用的技能构建,一个核心挑战不再是代理是否能使用提供的技能,而是它们能否从仓库和文档中生成正确、可重用且可执行的技能。现有基准主要评估给定技能的有效性或代理从原始上下文中解决下游任务的能力,但并未将技能生成本身作为研究对象。我们引入SkillGenBench,一个用于评估技能生成流水线的基准测试,采用统一且受控的协议。在SkillGenBench中,生成器接收原始语料并生成标准化的技能制品,然后在固定框架下执行并经过统一的评估程序。该基准涵盖两种生成模式:任务条件生成,即在任务揭示后合成特定任务的技能;以及任务无关生成,即在下游任务确定前必须整理出可重用的技能库。它还涵盖两种互补的程序来源:基于仓库的实例,其中程序分布在代码、配置和脚本中;以及基于文档的实例,其中程序和约束必须从长文本中提炼。我们提供了标准化的任务规范、固定环境和以确定性执行为基础的评估协议,并辅以辅助信号用于诊断。在多种技能生成方法和基础模型上的实验显示了显著的性能差异,突显了可重用技能提炼的难度,并揭示了从软件仓库与长文本中生成技能的不同失败模式。SkillGenBench为研究技能生成作为代理系统中的独立研究问题建立了可重复的测试环境。

英文摘要

As LLM agents are increasingly built around reusable skills, a central challenge is no longer only whether agents can use provided skills, but whether they can generate correct, reusable, and executable skills from repositories and documents. Existing benchmarks primarily evaluate the efficacy of given skills or the ability of agents to solve downstream tasks from raw context, but they do not isolate skill generation itself as the object of study. We introduce SkillGenBench, a benchmark for evaluating skill generation pipelines under a unified and controlled protocol. In SkillGenBench, a generator receives raw corpora and produces standardized skill artifacts, which are then executed under fixed harnesses and assessed with unified evaluation procedures. The benchmark covers two generation regimes: task-conditioned generation, where a task-specific skill is synthesized after the task is revealed, and task-agnostic generation, where a reusable skill library must be distilled before downstream tasks are known. It also spans two complementary procedural sources: repository-grounded instances, where procedures are distributed across code, configuration, and scripts, and document-grounded instances, where procedures and constraints must be distilled from long-form text. We provide standardized task specifications, pinned environments, and evaluation protocols centered on deterministic execution-based checks, supplemented by auxiliary signals for diagnosis. Experiments across a range of skill-generation methods and backbones show substantial performance variation, highlight the difficulty of reusable skill distillation, and reveal distinct failure modes in skill generation from software repositories versus long-form documents. SkillGenBench establishes a reproducible testbed for studying skill generation as an independent research problem in agent systems.

2605.18684 2026-05-19 cs.SE cs.AI 版本更新

Reversa: A Reverse Documentation Engineering Framework for Converting Legacy Software into Operational Specifications for AI Agents

Reversa:一个用于将遗留软件转换为AI代理操作规范的反向文档工程框架

Sanderson Oliveira de Macedo, Ronaldo Martins da Costa

发表机构 * Federal Institute of Goias(戈亚斯联邦理工学院) Federal University of Goias(戈亚斯联邦大学)

AI总结 本文提出Reversa框架,旨在通过反向文档工程将遗留软件转换为AI代理可操作的规范,通过多代理流水线流程提取隐含规则,生成可追溯的操作规范,并提出评估协议以衡量覆盖率、可追溯性、置信度、效用和成本。

Comments Preprint. Includes a generative AI use statement

详情
AI中文摘要

遗留系统集中了业务规则、架构决策和操作例外,这些通常在代码、数据、配置和维护实践中隐含存在。同时,基于语言模型的编码代理依赖于可靠的上下文、正确性标准和行为契约来修改真实系统,以降低风险。本文提出了Reversa,一个用于将遗留软件转换为可追溯的操作规范的反向文档工程框架。Reversa将此过程组织为一个多代理流水线:专门的代理映射项目表面,分析模块,提取隐含规则,合成架构,编写单元级规范,并审查生成的声明。该提案强调三个机制:代码与规范之间的可追溯性、显式的置信度标记以及保留缺口供人工验证。该框架作为Node.js CLI分布式发布,跨多个代理引擎安装技能,并使用SHA-256清单在更新或卸载操作期间保留修改后的文件。除了架构描述外,我们还报告了一个探索性案例研究,即从COBOL迁移到Go的ATM迁移,其中流水线产生了517个由内部置信度指数分类的声明,10个登记的缺口,53个Gherkin平衡场景,以及一个完成9/11任务的重建计划。最终平衡验证和切换未在本研究中完成。我们不声称有广泛的实证优势;我们根据反向工程、LLM文档和软件代理文献的位置,提出一个具有覆盖率、可追溯性、置信度、效用和成本指标的评估协议。

英文摘要

Legacy systems concentrate business rules, architectural decisions, and operational exceptions that often remain implicit in code, data, configuration, and maintenance practices. At the same time, language-model-based coding agents depend on reliable context, correctness criteria, and behavioral contracts to modify real systems with lower risk. This paper presents Reversa, a reverse documentation engineering framework for converting legacy software into traceable operational specifications for AI agents. Reversa organizes this process as a multi-agent pipeline: specialized agents map the project surface, analyze modules, extract implicit rules, synthesize architecture, write unit-level specifications, and review generated claims. The proposal emphasizes three mechanisms: traceability between code and specification, explicit confidence marking, and preservation of gaps for human validation. The framework is distributed as a Node.js CLI, installs skills across multiple agent engines, and uses a SHA-256 manifest to preserve modified files during update or uninstall operations. In addition to the architectural description, we report an exploratory case study on migrating an ATM from COBOL to Go, in which the pipeline produced 517 claims classified by an internal confidence index, 10 registered gaps, 53 Gherkin parity scenarios, and a reconstruction plan with 9 of 11 tasks completed at inventory time. Final parity validation and cutover were not completed in this study. We do not claim broad empirical superiority; we position the contribution with respect to the literature on reverse engineering, LLM-based documentation, and software agents, and propose an evaluation protocol with metrics for coverage, traceability, confidence, utility, and cost.

2605.18681 2026-05-19 cs.AI cs.LG 版本更新

Learning Quantifiable Visual Explanations Without Ground-Truth

学习无地面真实数据的可量化视觉解释

Amritpal Singh, Andrey Barsky, Mohamed Ali Souibgui, Ernest Valveny, Dimosthenis Karatzas

发表机构 * Computer Vision Center, Barcelona, Spain(巴塞罗那计算机视觉中心) Autonomous University of Barcelona, Spain(巴塞罗那自治大学)

AI总结 本文提出了一种基于连续输入扰动的可量化指标,用于评估XAI方法的质量,并提出了一种新的XAI方法,通过可微近似指标对模型进行微调,生成因果解释而不影响模型性能。

详情
AI中文摘要

可解释AI(XAI)技术对于验证和负责任使用现代深度学习模型日益重要,但缺乏良好的地面真实数据使得评估困难。我们提出了一种框架,该框架基于连续输入扰动作为XAI方法质量的可量化度量标准。我们的度量标准正式考虑了归因信息对模型决策的充分性和必要性,并展示了多种情况,其中它比现有度量标准更能符合人类对解释质量的直觉。为了利用该度量标准的特性,我们还提出了一种新的XAI方法,考虑了使用可微近似度量作为监督信号对模型进行微调的情况。结果是一个适配器模块,可以在任何黑盒模型上训练以输出因果解释,而不影响模型性能。我们证明了该方法生成的解释在多个可量化度量标准上优于竞争性的XAI技术。

英文摘要

Explainable AI (XAI) techniques are increasingly important for the validation and responsible use of modern deep learning models, but are difficult to evaluate due to the lack of good ground-truth to compare against. We propose a framework that serves as a quantifiable metric for the quality of XAI methods, based on continuous input perturbation. Our metric formally considers the sufficiency and necessity of the attributed information to the model's decision-making, and we illustrate a range of cases where it aligns better with human intuitions of explanation quality than do existing metrics. To exploit the properties of this metric, we also propose a novel XAI method, considering the case where we fine-tune a model using a differentiable approximation of the metric as a supervision signal. The result is an adapter module that can be trained on top of any black-box model to output causal explanations of the model's decision process, without degrading model performance. We show that the explanations generated by this method outperform those of competing XAI techniques according to a number of quantifiable metrics.

2605.18675 2026-05-19 cs.LG cs.AI 版本更新

COOPO: Cyclic Offline-Online Policy Optimization Algorithm

COOPO:循环离线-在线策略优化算法

Qisai Liu, Zhanhong Jiang, Joshua Russell Waite, Aditya Balu, Cody Fleming, Soumik Sarkar

发表机构 * Department of Mechanical Engineering, Iowa State University(伊阿华州立大学机械工程系) Department of Computer Science, Iowa State University(伊阿华州立大学计算机科学系) Department of Industrial and Manufacturing Systems Engineering, Iowa State University(伊阿华州立大学工业与制造系统工程系) Translational AI Center, Iowa State University(伊阿华州立大学转化人工智能中心)

AI总结 本文提出COOPO算法,通过循环离线训练和在线微调来解决离线强化学习的分布偏移和性能受限问题,以及在线强化学习的环境交互成本高问题,通过周期性回归离线训练减少遗忘和漂移,提升样本效率和性能。

详情
AI中文摘要

离线强化学习由于静态数据集的限制,在面对分布偏移和受限性能方面存在困难,而在线强化学习则需要大量的环境交互。最近出现的混合离线-在线方法连接了这两个领域,但存在转换过程中的分布漂移和对离线知识的灾难性遗忘问题。我们引入COOPO(循环离线-在线策略优化),一种通用框架,通过反复循环在受限的离线训练和在线微调之间进行。每个循环首先通过KL-正则化的优势加权离线更新将策略锚定到数据集,以最小化分布偏移,然后使用任何策略优化方法在线微调以实现稳定的探索。关键的是,定期返回离线训练可以消除遗忘和漂移,同时最大化数据集的再利用。循环行为还帮助减少在线环境交互。理论上,COOPO在样本效率上优于纯在线RL,满足标准覆盖假设下保证单调改进。广泛的D4RL基准测试显示,COOPO在减少在线交互的同时提高最终回报,保持在不同离线算法和在线优化器中的鲁棒性。这种循环协同为自适应RL设定了新的效率和性能标准。

英文摘要

Offline reinforcement learning struggles with distributional shift and constrained performance due to static dataset limitations, while online RL demands prohibitive environment interactions. The recent advent of hybrid offline-to-online methods bridges these domains but suffers from distribution drift during transitions and catastrophic forgetting of offline knowledge. We introduce COOPO (Cyclic Offline-Online Policy Optimization), a generalized framework that repeatedly cycles between constrained offline training and online fine-tuning. Each cycle first anchors the policy to the dataset via KL-regularized advantage-weighted offline updates to minimize distributional shift and then fine-tunes it online using any policy optimization for stable exploration. Crucially, periodically returning to offline training eliminates forgetting and drift while maximizing dataset reuse. The cyclic behavior also helps reduce the online environment interactions. Theoretically, COOPO achieves better online sample efficiency, surpassing pure online RL, with guaranteed monotonic improvement under standard coverage assumptions. Extensive D4RL benchmarks demonstrate COOPO reduces online interactions versus state-of-the-art hybrids while improving final returns, maintaining robustness across diverse offline algorithms and online optimizers. This looped synergy sets new efficiency and performance standards for adaptive RL.

2605.18674 2026-05-19 cs.AI 版本更新

Efficient Lookahead Encoding and Abstracted Width for Learning General Policies in Classical Planning

高效前瞻编码与抽象宽度用于经典规划中学习通用策略

Michael Aichmüller, Simon Ståhlberg, Martin Funkquist, Hector Geffner

发表机构 * RWTH Aachen University(亚琛RWTH大学) Linköping University(林雪平大学)

AI总结 本文提出了一种高效的方法,通过整体编码和抽象宽度来提升经典规划中学习通用策略的效率和可扩展性,解决了传统方法在计算成本和表达能力上的限制。

详情
AI中文摘要

通用规划旨在学习在经典规划领域内跨实例集合的通用策略。最近的图神经网络(GNN)方法在几个领域中学习了接近完美的策略。本工作改进了最近发表的迭代宽度(IW)策略的想法。其中,策略通过迭代宽度前瞻搜索扩展其后继范围,可以“跳过”多个转换,简化问题结构。然而,每个转换都被单独评估,导致计算成本不可扩展和表达限制。此外,尽管IW(1)因其与原子数线性扩展而具有吸引力,但一旦考虑数千个对象,如国际规划竞赛(IPC)2023基准,它就变得低效。我们解决了这两个限制。首先,我们引入了一种远更高效的整个搜索树的整体编码。它仅通过与当前状态的关系差异联合表示IW(1)-可达状态,使关系GNN(R-GNN)能够在单次正向传递中评分所有转换。其次,我们定义了抽象的IW(1),通过关系抽象在新颖性检查中提高可扩展性。而不是测试完全实例化的原子,它通过将所有但一个参数替换为其类型来抽象每个原子。原始原子如果任何抽象形式是新颖的,则被认为是新颖的。这种结构压缩将新颖性搜索的可扩展性从原子转移到对象,同时保留有意义的子目标结构。我们在超缩放的IPC 2023基准以及跨多样的领域中评估我们的贡献,包括需要超出C₂逻辑片段特征的领域。我们的策略实现了新的最先进的性能,显著超越了先前的工作,包括经典规划器LAMA。

英文摘要

Generalized planning aims to learn policies that generalize across collections of instances within a classical planning domain. Recent Graph Neural Network (GNN) approaches have learned nearly perfect policies for several domains. This work improves on the recently published idea of Iterated Width (IW) policies. Therein, the policy broadens its successor scope through an IW-lookahead search that can "jump" over multiple transitions, simplifying the problem structure. Yet, each transition is evaluated individually, leading to unscalable compute costs and expressivity limitations. Furthermore, although IW(1) is attractive because it scales linearly with the number of atoms, it becomes inefficient once thousands of objects are considered, as in the International Planning Competition (IPC) 2023 benchmark. We address both limitations. First, we introduce a vastly more efficient holistic encoding of the entire search tree. It jointly represents IW(1)-reachable states only by their relational differences to the current state, enabling Relational GNNs (R-GNNs) to score all transitions in a single forward pass. Second, we define Abstracted IW(1) to improve scaling through relational abstraction during novelty checks. Rather than testing fully instantiated atoms, it abstracts each atom by replacing all but one argument with its type. The original atom is novel if any of its abstracted forms is novel. This structural compression shifts novelty search scaling from atoms to objects, while preserving meaningful subgoal structure. We evaluate our contributions on the hyperscaling IPC 2023 benchmark and across diverse domains, including domains requiring features beyond the $C_2$ logic fragment. Our policies achieve new state-of-the-art performance, significantly surpassing prior work, including the classical planner LAMA.

2605.18672 2026-05-19 cs.AI 版本更新

Position: A Three-Layer Probabilistic Assume-Guarantee Architecture Is Structurally Required for Safe LLM Agent Deployment

位置:一种三层概率性假设-保证架构在安全LLM代理部署中是结构上必需的

S. Bensalem, Y. Dong, M. Franzle, X. Huang, J. Kroger, D. Nickovic, A. Nouri, R. Roy, C. Wu

发表机构 * CSX-AI University of Liverpool(利物浦大学) Carl von Ossietzky Universität Oldenburg(奥尔登堡卡尔·冯·奥西特齐克大学) Austrian Institute of Technology(奥地利技术研究院) Université Grenoble Alpes(格勒诺布尔阿尔卑斯大学)

AI总结 本文提出,单层抽象层内强制LLM代理安全并非仅仅是次优,而是结构性上不足以满足部署的LLM代理需求。安全操作的三个维度——语义意图和政策合规性、环境有效性以及动态可行性——各自依赖于在执行不同阶段才变得可用的信息集。没有单一的防护措施可以保证这三个维度。我们主张社区必须采用基于合同的架构,其中每个安全维度由独立认证的层强制执行,其概率保证满足下一层的假设。我们勾勒了这样的架构,并通过概率链规则推导出其允许的系统级安全界限。三个开放性问题阻碍着这一标准的实现:从非独立同分布轨迹中估计界限、在部署漂移下合同的渐进退化,以及向多代理设置的扩展——LLM代理运行时保证中最关键的未完成任务。

详情
AI中文摘要

本文主张,在单一抽象层内强制LLM代理安全并非仅仅是次优,而是结构性上不足以满足部署的LLM代理需求——这是代理执行方式的结构性结果,而非当前系统固有的限制。构成安全操作的三个维度——语义意图和政策合规性、环境有效性和动态可行性——各自依赖于在执行不同阶段才变得可用的信息集。没有单一的防护措施可以保证这三个维度。我们主张社区必须采用基于合同的架构,在该架构中,每个安全维度由一个独立认证的层强制执行,其概率保证满足下一层的假设。我们勾勒了这样的架构,并通过概率链规则推导出其允许的系统级安全界限。三个开放性问题阻碍着这一标准的实现:从非独立同分布轨迹中估计界限、在部署漂移下合同的渐进退化,以及向多代理设置的扩展——LLM代理运行时保证中最关键的未完成任务。

英文摘要

This position paper argues that enforcing LLM agent safety within a single abstraction layer is not merely suboptimal but categorically insufficient for deployed LLM agents -- a structural consequence of how agent execution works, not a contingent limitation of current systems. The three dimensions that jointly constitute safe operation -- semantic intent and policy compliance, environmental validity, and dynamical feasibility -- each depend on a strictly distinct set of information that becomes available at different stages of execution. No single guardrail can certify all three. We argue that the community must respond with a contract-based architecture in which each safety dimension is enforced by an independently certified layer whose probabilistic guarantee satisfies the next layer's assumption. We sketch such an architecture and derive the compositional system-level safety bounds it admits via the chain rule of probability. Three open problems stand between this and a deployable standard: bound estimation from non-i.i.d.\ traces, graceful degradation of contracts under deployment drift, and extension to multi-agent settings -- the most important unfinished business in LLM agent runtime assurance.

2605.18663 2026-05-19 cs.AI cs.CL cs.LG 版本更新

GIM: Evaluating models via tasks that integrate multiple cognitive domains

GIM:通过整合多个认知领域的任务评估模型

Rohit Patel, Alexandre Rezende, Steven McClain

发表机构 * Meta Superintelligence Labs(Meta超智能实验室)

AI总结 本文提出GIM基准测试,通过整合多个认知领域的任务来评估模型,其核心方法是设计820个原创问题,结合广泛的知识和多种认知操作,从而保持推理在现实任务中的基础性,同时通过2PL IRT模型校准能力估计,发布涵盖22个模型和47种测试配置的综合排行榜,并深入研究了测试时计算与模型能力之间的权衡。

Comments 56 pages, 27 figures, 4 tables. Code: https://github.com/facebookresearch/gim ; Dataset: https://huggingface.co/datasets/facebook/gim

详情
AI中文摘要

随着LLM基准测试趋于饱和,评估社区已采取两种策略来提高难度:提升知识需求(GPQA,HLE)或完全去除知识而采用抽象推理(ARC-AGI)。前者将记忆混淆为能力,后者使推理脱离实际应用背景。我们采取了不同的方法。Grounded Integration Measure(GIM)是一个包含820个原创问题(615个公开问题,205个私有问题)的基准测试,其中难度来自于整合;每个问题都需要协调多种认知操作(约束满足、状态跟踪、知识警惕、受众校准)在广泛可获取的知识上,从而保持推理在现实任务中而不依赖专门的专家知识。每个问题都是原创专家撰写的组成,大多数有基于评分标准分解的评分(中位数6个独立判断的准则)。一个平衡的公开-私有划分提供了内置的污染诊断。我们校准了一个连续响应的2参数逻辑(2PL)IRT模型,超过200,000个提示-响应对,覆盖28个模型,产生稳健的能力估计,即使在原始准确率被错误或缺失数据扭曲的情况下,也能正确排序测试配置,解决了基准报告中的常见挑战。使用这一框架,我们发布了一个涵盖22个模型和47种测试配置的综合排行榜(独特的模型和思考级别对),并进行了迄今为止最广泛的已发表研究,探讨在固定基准上测试时计算与模型能力之间的权衡:11个模型在35种测试配置中被扫过。我们观察到,家庭内部配置选择,如思考预算和量化,与模型选择一样重要。我们发布了评估框架、校准的IRT参数和所有公开问题。

英文摘要

As LLM benchmarks saturate, the evaluation community has pursued two strategies to increase difficulty: escalating knowledge demands (GPQA, HLE) or removing knowledge entirely in favor of abstract reasoning (ARC-AGI). The first conflates memorization with capability; the second divorces reasoning from the practical contexts in which it matters. We take a different approach. The Grounded Integration Measure (GIM) is a benchmark of 820 original problems (615 public, 205 private) where difficulty comes from integration; individual problems require coordinating multiple cognitive operations (constraint satisfaction, state tracking, epistemic vigilance, audience calibration) over broadly accessible knowledge, so that reasoning stays grounded in realistic tasks without being gated on specialized expertise. Each problem is an original expert-authored composition, majority with rubric-decomposed scoring (median 6 independently judged criteria). A balanced public--private split provides built-in contamination diagnostic. We calibrate a continuous response 2-parameter logistic (2PL) IRT model over >200k prompt-response pairs across 28 models, producing robust ability estimates that correctly order test-configurations even when raw accuracy is distorted by errors or missing data, addressing a common challenge in benchmark reporting. Using this framework, we present a comprehensive leaderboard spanning 22 models and 47 test-configurations (unique model, thinking-level pairs), and conduct what is to our knowledge the most extensive published study of how test-time compute trades off against model capability on a fixed benchmark: 11 models swept across 35 test-configurations. We observe that within-family configuration choices, such as thinking budget and quantization, matter as much as model selection. We release the evaluation framework, calibrated IRT parameters, and all public problems.

2605.18661 2026-05-19 cs.AI 版本更新

AI for Auto-Research: Roadmap & User Guide

AI用于自动研究:路线图与用户指南

Lingdong Kong, Xian Sun, Wei Chow, Linfeng Li, Kevin Qinghong Lin, Xuan Billy Zhang, Song Wang, Rong Li, Qing Wu, Wei Gao, Yingshuo Wang, Shaoyuan Xie, Jiachen Liu, Leigang Qu, Shijie Li, Lai Xing Ng, Benoit R. Cottereau, Ziwei Liu, Tat-Seng Chua, Wei Tsang Ooi

发表机构 * Awesome AI Auto-Research Team(Awesome AI 自动研究团队)

AI总结 本文研究了AI在自动研究中的应用,分析了AI在研究生命周期中的各个阶段的表现,指出AI在结构化任务中表现良好,但在新颖想法、实验和科学判断方面仍存在不足,并提出了协作框架和工具清单。

Comments Project Page at https://worldbench.github.io/awesome-ai-auto-research GitHub Repo at https://github.com/worldbench/awesome-ai-auto-research

详情
AI中文摘要

AI辅助研究正跨越一个门槛:完全自动化的系统现在可以以不到15美元的成本生成研究论文,而长期代理可以执行实验、起草手稿并模拟批判性评价,且几乎不需要人类输入。然而,这种生产力前沿暴露了更深层次的诚信问题:在科学压力下,即使前沿的LLMs仍会编造结果、遗漏隐藏的错误,并且无法可靠地判断新颖性。通过到2026年4月的发展,我们提出了对AI在整个研究生命周期的端到端分析,分为四个认识论阶段:创建(想法生成、文献综述、编码与实验、表格和图表)、写作(论文写作)、验证(同行评审、反驳与修订)和传播(海报、幻灯片、视频、社交媒体、项目页面和交互式代理)。我们识别出一个明确的、阶段依赖的界限,即可靠帮助与不可靠自主之间的界限:AI在结构化、基于检索和工具中介的任务中表现优异,但在真正新颖的想法、研究级实验和科学判断方面仍显得脆弱。生成的想法在实施后往往退化,研究代码远远落后于模式匹配基准,而端到端的自主系统尚未一致达到主要会议的接受标准。我们进一步表明,更大的自动化可能会掩盖而不是消除失败模式,使人类监管的协作成为最可信的部署范式。最后,我们提供了一个结构化的分类法、基准套件和工具清单,跨阶段设计原则以及面向实践者的操作手册,相关资源在我们的项目页面上维护。

英文摘要

AI-assisted research is crossing a threshold: fully automated systems can now generate research papers for as little as $15, while long-horizon agents can execute experiments, draft manuscripts, and simulate critique with minimal human input. Yet this productivity frontier exposes a deeper integrity problem: under scientific pressure, even frontier LLMs still fabricate results, miss hidden errors, and fail to judge novelty reliably. Studying developments through April 2026, we present an end-to-end analysis of AI across the complete research lifecycle, organized into four epistemological phases: Creation (idea generation, literature review, coding & experiments, tables & figures), Writing (paper writing), Validation (peer review, rebuttal & revision), and Dissemination (posters, slides, videos, social media, project pages, and interactive agents). We identify a sharp, stage-dependent boundary between reliable assistance and unreliable autonomy: AI excels at structured, retrieval-grounded, and tool-mediated tasks, but remains fragile for genuinely novel ideas, research-level experiments, and scientific judgment. Generated ideas often degrade after implementation, research code lags far behind pattern-matching benchmarks, and end-to-end autonomous systems have not yet consistently reached major-venue acceptance standards. We further show that greater automation can obscure rather than eliminate failure modes, making human-governed collaboration the most credible deployment paradigm. Finally, we provide a structured taxonomy, benchmark suite, and tool inventory, cross-stage design principles, and a practitioner-oriented playbook, with resources maintained at our project page.

2605.18656 2026-05-19 stat.ML cs.AI cs.LG stat.ME 版本更新

Statistical Limits and Efficient Algorithms for Differentially Private Federated Learning

统计界限与差分隐私联邦学习的高效算法

Arnab Auddy, Xiangni Peng, Subhadeep Paul

发表机构 * Department of Statistics(统计系)

AI总结 本文研究了差分隐私联邦学习中估计精度、隐私约束和通信成本之间的权衡,提出了FedHybrid和FedNewton两种高效算法,通过减少通信成本提升准确性,并建立了均方误差的上界和下界以评估算法性能。

详情
AI中文摘要

联邦学习是训练机器学习和人工智能模型的一种主流框架,用于在众多用户设备或数据库之间协同训练。我们研究了差分隐私(DP)联邦M估计中估计精度、隐私约束和通信成本之间的权衡。文献中的两种标准方法是FedAvg,可能面临较高的联邦偏差,以及FedSGD,可能导致较高的通信成本。为了在减少通信成本的同时提高准确性,我们提出了FedHybrid,它使用FedSGD,但起始时通过FedAvg估计器改进初始化。我们还提出了FedNewton,通过平均本地牛顿迭代来减少FedAvg的偏差,从而在客户端数量增长缓慢时,以更少的通信轮次达到与FedSGD相当的估计精度。我们建立了这些估计器的DP版本的均方误差率的有限样本上界,作为客户端数量、本地样本大小、隐私预算和迭代次数的函数。我们进一步推导了任何迭代私有联邦过程的均方误差的最小最大下界,以作为评估这些方法最优性差距的基准。我们还通过在MNIST和CIFAR-10计算机视觉数据集上训练逻辑回归和神经网络来数值评估我们的方法。

英文摘要

Federated Learning is a leading framework for training ML and AI models collaboratively across numerous user devices or databases. We study the trade-offs among estimation accuracy, privacy constraints, and communication cost for differentially private (DP) federated M estimation. The two standard methods in the literature are FedAvg, which may suffer from high federation bias, and FedSGD, which can incur high communication cost. Aimed at improving accuracy at a reduced communication cost, we propose FedHybrid, which uses FedSGD starting with an improved initialization by the FedAvg estimator. We propose FedNewton, which averages local Newton iterations to reduce bias in FedAvg, achieving an estimation accuracy comparable to FedSGD with much fewer communication rounds when the number of clients grows sufficiently slowly. We establish finite sample upper bounds on the mean-squared error rates of the DP versions of these estimators as functions of the number of clients, local sample sizes, privacy budget, and number of iterations. We further derive a minimax lower bound on the MSE of any iterative private federated procedure that provides a benchmark to assess the optimality gap of these methods. We numerically evaluate our methods for training a logistic regression and a neural network on the computer vision datasets MNIST and CIFAR-10.

2605.18654 2026-05-19 cs.LG cs.AI 版本更新

Pocket Foundation Models: Distilling TFMs into CPU-Ready Gradient-Boosted Trees

口袋基础模型:将TFMs压缩成CPU可用的梯度提升树

Aditya Tanna, Nassim Bouarour, Mohamed Bouadi, Vinay kumar Sankarapu, Pratinav Seth

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文提出了一种将高性能表格基础模型(TFMs)压缩成CPU原生梯度提升树的方法,以解决实时欺诈评分需求与现有模型性能之间的差距,同时在多个数据集上验证了该方法的有效性。

详情
AI中文摘要

一个欺诈评分器需要在2毫秒内响应。最好的表格基础模型(TFMs)在GPU上需要151-1275毫秒。我们通过将TFM离线压缩成XGBoost或CatBoost的学生模型,该模型可以在CPU上原生运行,从而缩小这一差距。核心障碍是特定于上下文学习(ICL)教师:他们在评分自己的训练集时会泄露标签,导致软目标崩溃为近一热向量,不再有可供压缩的类间结构。分层出折(OOF)教师标注可以防止这一问题。在153个来自TALENT、OpenML-CC18、TabZilla和TabArena的数据集上,将TabICLv2压缩成XGBoost在CPU上达到0.882宏均AUC(96.5%的教师AUC),在1.9毫秒内,比教师-学生对的教师模型快38到860倍,且在统计上显著优于调优的CatBoost基线(Wilcoxon p=0.0008;51%胜率)。四个进一步发现:教师排名精确转移到学生排名;收益集中在低维数据(<21个特征:比CatBoost高0.011 vs. >21个特征:高0.001);多教师平均有助于MLP学生(+0.006,p=0.003)但对树学生增加不到0.001;在高维任务中,当教师本身落后于CatBoost时,压缩反而使情况更糟。完整的流水线作为TabTune库的一部分开源。

英文摘要

A fraud scorer needs to answer in under 2 ms. The best tabular foundation models (TFMs) take 151-1,275 ms on GPU. We close this gap by distilling the TFM offline into an XGBoost or CatBoost student that runs natively on CPU. The central obstacle is specific to in-context learning (ICL) teachers: they leak labels when scoring their own training set, so the soft targets collapse to near-one-hot vectors with no inter-class structure left to distill. Stratified out-of-fold (OOF) teacher labeling prevents this. Across 153 classification datasets drawn from TALENT, OpenML-CC18, TabZilla, and TabArena, distilling TabICLv2 into XGBoost gives 0.882 macro-mean AUC (96.5% of teacher AUC) at 1.9 ms on CPU, a 38x to 860x speedup across teacher-student pairs with a statistically significant edge over a tuned CatBoost baseline (Wilcoxon p = 0.0008; 51% win rate). Four further findings: teacher rank transfers exactly to student rank; gains concentrate on low-dimensional data (< 21 features: +0.011 over CatBoost vs. >21 features: +0.001); multi-teacher averaging helps MLP students (+0.006, p = 0.003) but adds less than 0.001 for tree students; and on high-dimensional tasks where the teacher itself trails CatBoost, distillation makes things worse rather than better. The full pipeline is open-sourced as part of the TabTune library.

2605.18648 2026-05-19 cs.LG cs.AI cs.CL 版本更新

An Assessment of Human vs. Model Uncertainty in Soft-Label Learning and Calibration

对软标签学习和校准中人类与模型不确定性的评估

Maja Pavlovic, Silviu Paun, Massimo Poesio

发表机构 * Queen Mary University London(伦敦女王玛丽大学) Amazon(亚马逊) University of Utrecht(乌得勒支大学)

AI总结 本文通过对比人类和模型标签在软标签学习中的效果,发现人类标签不仅提升了模型准确性,还通过正则化作用改善了模型在困难样本上的校准和训练稳定性。

详情
AI中文摘要

人类对齐的人工智能的核心在于理解人类提取的标签相对于合成标签的优势。虽然人类软标签通过捕捉不确定性来提高校准,但先前研究将这些好处与隐含的错误标签修正(模式偏移)混淆了,从而掩盖了软标签的真实效果。我们对MNIST和一个合成变体上的软标签学习进行了受控审计,重新标注子集以提取人类不确定性。通过将软标签监督与底层标签模式偏移解耦,我们发现虽然人类软标签确实提供了准确性提升,但其更大的价值在于作为正则化器,改善模型在困难样本上的校准并促进训练运行中的稳定收敛。数据集制图显示,训练于人类软标签的模型能反映人类不确定性,而训练于合成标签的模型则无法与人类对齐。广泛而言,这项工作提供了一个用于人类-人工智能不确定性对齐的诊断测试平台。

英文摘要

Central to human-aligned AI is understanding the benefits of human-elicited labels over synthetic alternatives. While human soft-labels improve calibration by capturing uncertainty, prior studies conflate these benefits with the implicit correction of mislabeled data (mode shifts), obscuring true effects of soft-labels. We present a controlled audit of soft-label learning across MNIST and a synthetic variant, re-annotating subsets to extract human uncertainty. By decoupling soft-label supervision from underlying label mode shifts, we show that while human soft-labels do provide accuracy gains, their larger value lies in acting as a regularizer that improves model calibration on difficult samples and promotes stable convergence across training runs. Dataset cartography reveals models trained on human soft-labels mirror human uncertainty, whereas those trained on synthetic labels fail to align with humans. Broadly, this work provides a diagnostic testbed for human-AI uncertainty alignment.

2605.18635 2026-05-19 cs.LG cs.AI 版本更新

Data Presentation Over Architecture: Resampling Strategies for Credit Risk Prediction with Tabular Foundation Models

数据呈现与架构:用于表格基础模型的信用风险预测重采样策略

Aditya Tanna, Mitul Solanki, Mohamed Bouadi, Nassim Bouarour, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs(Lexsi实验室)

AI总结 本文研究了在信用风险预测中,通过不同的上下文构建策略对表格基础模型性能的影响,发现上下文构建策略比模型架构对AUC-ROC指标的贡献更大。

详情
AI中文摘要

信用违约预测是一个具有严重类别不平衡、异质特征和严格延迟预算的表格学习问题。表格基础模型(TFMs)通过上下文学习来解决这个问题,其预测结果对上下文窗口的构建方式敏感。我们在Home Credit和Lending Club数据集上基准测试了四种经典模型和五种TFMs,变化上下文构建策略(七种选项)和上下文大小(1K到50K)。在两个数据集上,上下文策略的选择对AUC-ROC的方差解释比模型家族的选择更大:平衡和混合采样比均匀采样增加3到4个AUC点,且差距超过了TFMs之间的差异。使用5K到10K的平衡上下文,最强的TFMs达到经典基线模型在完整数据上训练的AUC,同时恢复了默认类别召回率,而默认阈值GBDTs无法做到。我们将此视为证据,表明在不平衡信用风险设置中,上下文构建而非架构选择是TFMs的主要部署杠杆。

英文摘要

Credit default prediction is a tabular learning problem with severe class imbalance, heterogeneous features, and tight latency budgets. Tabular Foundation Models (TFMs) approach this problem through in-context learning, which makes their predictions sensitive to how the context window is built. We benchmark four classical models and five TFMs on the Home Credit and Lending Club datasets, varying the context-construction strategy (seven options) and the context size (1K to 50K). On both datasets, the choice of context strategy explains more variance in AUC-ROC than the choice of TFM family: balanced and hybrid sampling add 3 to 4 AUC points over uniform sampling, and the gap exceeds the spread between TFMs. With a balanced context of 5K to 10K examples, the strongest TFMs reach the AUC of classical baselines trained on the full data, while also recovering meaningful default-class recall that default-threshold GBDTs do not. We frame this as evidence that context construction, rather than architecture choice, is the primary deployment lever for TFMs in imbalanced credit-risk settings.

2605.18632 2026-05-19 cs.LG cs.AI 版本更新

Position: Weight Space Should Be a First-Class Generative AI Modality

权重空间应成为一种第一类生成式AI模态

Zhangyang Wang, Peihao Wang, Kai Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Tencent Hy(腾讯实验室)

AI总结 本文提出将模型检查点视为第一类数据模态,并主张在权重空间中进行生成式建模应成为机器学习的核心原始操作。通过最近的进展表明,神经网络权重可以按需合成,通常在减少适应成本的规模下达到微调性能。本文认为这些结果反映了权重空间中高性能模型占据的低维、高度结构化区域的结构事实。基于此观点,本文将现有方法组织成五阶段流程,调查该方法已实际应用的领域,并澄清当前限制:适配器规模和条件生成正在迅速发展,而无限制的前沿规模检查点合成仍处于开放状态。

Comments AI systems routinely improve or create other AI systems

详情
AI中文摘要

神经网络检查点已悄然成为大规模数据资源:现在存在数百万个训练好的权重向量,每个都编码任务、领域和架构特定的知识。本文立场论文认为,模型检查点应被视为第一类数据模态,并且在权重空间中的生成式建模应被标准化为机器学习的核心基本操作。最近的进展表明,神经权重可以按需合成,通常在减少适应成本的规模下达到微调性能。我们主张这些结果反映了底层的结构事实:高性能模型占据由对称性、平坦性、模块性和共享子空间形状的权重空间中的低维、高度结构化区域。基于这一观点,我们组织现有方法为五阶段流程,调查该方法已实际应用的领域,并澄清当前限制:适配器规模和条件生成正在迅速发展,而无限制的前沿规模检查点合成仍处于开放状态。我们的目标是将社区的默认思维从按任务优化模型转变为从学习的权重分布中采样模型,加速迈向一个AI系统定期改进或创建其他AI系统的时代。

英文摘要

Neural network checkpoints have quietly become a large-scale data resource: millions of trained weight vectors now exist, each encoding task-, domain-, and architecture-specific knowledge. This position paper argues that model checkpoints should be treated as a first-class data modality, and that generative modeling in weight space should be standardized as a core machine learning primitive. Recent advances demonstrate that neural weights can be synthesized on demand, often matching fine-tuning performance while reducing adaptation cost by orders of magnitude. We contend that these results reflect an underlying structural fact: high-performing models occupy low-dimensional, highly structured regions of weight space shaped by symmetry, flatness, modularity, and shared subspaces. Building on this view, we organize existing methods into a five-stage pipeline, survey applications where the approach is already practical, and clarify current limits: adapter-scale and conditional generation are advancing rapidly, while unrestricted frontier-scale checkpoint synthesis remains open. Our goal is to shift the community's default mindset from optimizing models per task to sampling models from learned weight distributions, accelerating toward an era in which AI systems routinely improve or create other AI systems.

2605.18630 2026-05-19 cs.AI physics.comp-ph 版本更新

SCICONVBENCH: Benchmarking LLMs on Multi-Turn Clarification for Task Formulation in Computational Science

SCICONVBENCH: 评估LLM在计算科学任务公式化中的多轮澄清能力

Nithin Somasekharan, Youssef Hassan, Shiyao Lin, Gihan Panapitiya, Patrick Emami, Anurag Acharya, Sameera Horawalavithana, Shaowu Pan

发表机构 * Rensselaer Polytechnic Institute(拉特格斯理工学院) University of Texas at Arlington(德克萨斯大学阿灵顿分校) Pacific Northwest National Laboratory(太平洋西北国家实验室) National Renewable Energy Laboratory(国家可再生能源实验室)

AI总结 该研究提出SCICONVBENCH基准,用于评估LLM在计算科学任务公式化中的多轮澄清能力,重点在于获取缺失信息和解决请求中的矛盾,通过结构化任务本体和基于标准的评估框架,系统测量LLM在澄清行为、对话基础性和最终规格忠实度三个维度上的表现。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作科学人工智能助手,越来越多的基准测试评估其在知识检索、推理、代码生成和工具使用方面的能力。然而,这些评估通常假设科学问题已经明确提出,而实际的科学协助往往从一个不明确的用户请求开始,必须通过对话进行澄清,才能进行任何计算、分析或实验。我们介绍了SCICONVBENCH,这是一个用于评估科学任务公式化中多轮澄清能力的基准,涵盖四个计算科学问题领域:流体力学、固体力学、材料科学和偏微分方程(PDEs)。SCICONVBENCH针对两种互补能力:获取缺失信息(消歧)和检测并纠正包含内部矛盾信息的请求(一致性解决)。我们的基准结合了结构化的任务本体和基于标准的评估框架,使能够系统地测量LLM在三个维度上的表现:澄清行为、对话基础性和最终规格的忠实度。当前前沿模型在一致性解决方面表现相对较好,但即使最好的模型在流体力学中也只解决了52.7%的消歧案例。我们进一步发现,前沿LLMs经常做出沉默假设并执行隐式规格修复,这些修复并未基于与用户对话的基础。SCICONVBENCH为评估可靠计算科学助手所需的上游对话推理建立了基础。代码和数据可在https://github.com/csml-rpi/SciConvBench找到。

英文摘要

Large Language Models (LLMs) are increasingly deployed as scientific AI as- sistants, and a growing body of benchmarks evaluates their capabilities across knowledge retrieval, reasoning, code generation, and tool use. These evaluations, however, typically assume the scientific problem is already well-posed, whereas practical scientific assistance often begins with an ill-posed user request that must be refined through dialogue before any computation, analysis, or experiment can be carried out reliably. We introduce SCICONVBENCH, a benchmark for multi- turn clarification in scientific task formulation across four computational science problem domains: fluid mechanics, solid mechanics, materials science, and par- tial differential equations (PDEs). SCICONVBENCH targets two complementary capabilities: eliciting missing information (disambiguation) and detecting and correcting erroneous requests containing internally contradictory information (in- consistency resolution). Our benchmark pairs a structured task ontology with a rubric-based evaluation framework, enabling systematic measurement of LLM per- formance across three dimensions: clarification behavior, conversational grounding, and final-specification fidelity. Current frontier models perform relatively well on inconsistency resolution, but even the best model resolves only 52.7% of the disambiguation cases in fluid mechanics. We further find that frontier LLMs fre- quently make silent assumptions and perform implicit specification repairs that are not grounded in the conversation with users. SCICONVBENCH establishes a foundation for evaluating the upstream conversational reasoning that a reliable computational science assistant requires. The code and data can be found at https://github.com/csml-rpi/SciConvBench.

2605.18627 2026-05-19 cs.AI 版本更新

Learning Lifted Action Models from Traces with Minimal Information About Actions and States

从动作轨迹中学习提升的动作模型:最少关于动作和状态的信息

Jonas Gösgens, Niklas Jansen, Hector Geffner

发表机构 * RWTH Aachen University(亚琛工业大学)

AI总结 本文研究了在不完全信息下从动作轨迹中学习STRIPS+动作域的问题,提出了三种通用情况下的算法和完备性结果,假设选定的动作参数完全可观察,从而在不同可观察性假设下确定等效域的学习条件。

Comments accepted at KR2026

详情
AI中文摘要

最近研究表明,仅从动作轨迹即可正确高效地学习提升的STRIPS模型;即应用隐藏的STRIPS模型中的可应用动作序列。这一结果令人印象深刻,因为并不假设状态完全可观察,但STRIPS动作包含的参数并非全部用于选择动作,因此实用性不足。为此,假设动作轨迹来自隐藏的STRIPS+模型,其中某些动作参数隐含在隐藏的动作前提中。然而,这种方法的局限性在于它假设状态完全可观察。在本文中,我们放宽这些限制,考虑在更一般的情境下从轨迹中学习STRIPS+动作域的问题,其中轨迹包含关于动作和状态的部分信息。特别地,我们为三种通用情况制定了算法和完备性结果,均假设选定的动作参数完全可观察。第一种情况不假设状态可观察;第二种情况假设某些状态谓词完全可观察;第三种情况则假设某些状态谓词局部可观察。给定一个STRIPS+域,这些结果描述了在什么条件下可以从此类轨迹中学习等效域。实验结果也进行了报告。

英文摘要

It has been recently shown that lifted STRIPS models can be learned correctly and efficiently from action traces alone; i.e., applicable action sequences from a hidden STRIPS model. The result is remarkable because the states are not assumed to be observable at all, and yet it is not practical enough as STRIPS actions include arguments that are not needed for selecting the actions. This shortcoming has been addressed by assuming that the action traces come instead from a hidden STRIPS+ model where some action arguments are implicit in the hidden action preconditions. A limitation of this approach, however, is that it assumes that the states are fully observable. In this work, we relax these restrictions and consider the problem of learning STRIPS+ action domains from traces in a more general context where the traces carry partial information about both actions and states. In particular, we formulate algorithms and completeness results for three general cases, all of which assume full observability of selected action arguments. In the first case, no observability of the state is assumed; in the second case, full observability of some state predicates is assumed, and in the third case, local observability of some state predicates is assumed instead. Given a STRIPS+ domain, these results characterize the conditions under which an equivalent domain can be learned from traces. Experimental results are reported.

2605.18621 2026-05-19 cs.CV cs.AI 版本更新

CrossView Suite: Harnessing Cross-view Spatial Intelligence of MLLMs with Dataset, Model and Benchmark

CrossView Suite: 利用数据集、模型和基准 harnessing MLLMs 的跨视图空间智能

Wei Wang, Yuqian Yuan, Tianwei Lin, Wenqiao Zhang, Siliang Tang, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学)

AI总结 该研究提出CrossView Suite,通过开发CrossViewSet、CrossViewBench和CrossViewer三个组件,解决跨视图推理中的数据稀缺、评估不足和对齐机制缺失问题,提升多视图空间理解能力。

详情
AI中文摘要

空间智能要求多模态大语言模型(MLLMs)超越单一视图感知,对物体、可见性、几何和交互在多个视角下保持一致推理。然而,跨视图推理的进步受限于三个主要缺口:大规模高质量标注训练数据的稀缺性、缺乏系统性评估的基准以及缺乏显式对齐机制以建立物体层面的一致性。为了解决这些缺口,我们全面开发了CrossView Suite的三个协调组件:CrossViewSet、CrossViewBench和CrossViewer。首先,我们引入一个多代理数据引擎,精心编纂了一个大规模、高质量的跨视图指令数据集,称为CrossViewSet,涵盖17种细粒度任务类型,包含1.6M个样本。其次,我们精心创建了一个场景不重叠的CrossViewBench,以全面评估MLLM的跨视图空间理解能力,评估其在各种方面的表现。最后,我们提出了CrossViewer,一个渐进的三阶段框架,用于MLLMs的跨视图空间推理,遵循感知->对齐->推理的范式。我们的方法配备了一个自适应的空间区域标记器,以捕捉细粒度的物体表示,然后显式对齐多视图对象,并因此融合对齐的特征,以提升MLLMs的跨视图推理能力。广泛的实验和分析表明,大规模训练数据、系统性评估和显式的跨视图对齐都是推动MLLMs从单视角感知向现实世界空间智能发展的关键因素。项目页面可在https://github.com/Thinkirin/Crossview-Suite上找到。

英文摘要

Spatial intelligence requires multimodal large language models (MLLMs) to move beyond single-view perception and reason consistently about objects, visibility, geometry, and interactions across multiple viewpoints. However, progress in cross-view reasoning remains limited by three major gaps: the scarcity of large-scale well-annotated training data, the lack of comprehensive benchmarks for systematic evaluation, and the absence of explicit alignment mechanisms that establish object-level consistency across views. To address these gaps, we thoroughly develop CrossView Suite across three coordinated components: CrossViewSet, CrossViewBench, and CrossViewer. Firstly, we introduce a multi-agent data engine to meticulously curate a large-scale, high-quality cross-view instruction dataset, termed CrossViewSet, covering 17 fine-grained task types with 1.6M samples. Second, we meticulously create a scene-disjoint CrossViewBench to comprehensively assess the cross-view spatial understanding capability of an MLLM, evaluating it across various aspects. Finally, we propose CrossViewer, a progressive three-stage framework for cross-view spatial reasoning in MLLMs, following a Perception -> Alignment -> Reasoning paradigm. Our method equips an adaptive spatial region tokenizer to capture fine-grained object representations, and then aligns the multi-view objects explicitly, and thus fuses aligned features for boosting the cross-view inference capacity for MLLMs. Extensive experiments and analyses show that large-scale training data, systematic evaluation, and explicit cross-view alignment are all critical for advancing MLLMs from single-view perception toward real-world spatial intelligence. The project page is available at https://github.com/Thinkirin/Crossview-Suite.

2605.18617 2026-05-19 cs.RO cs.AI cs.CV 版本更新

ManiSoft: Towards Vision-Language Manipulation for Soft Continuum Robotics

ManiSoft: 向视觉-语言操控的柔软连续机器人迈进

Ziyu Wei, Luting Wang, Chen Gao, Li Wen, Si Liu

发表机构 * Beihang University(北京航空航天大学) National University of Singapore(新加坡国立大学) Hangzhou Innovation Institute, Beihang University(北京航空航天大学杭州创新研究院)

AI总结 本文提出ManiSoft基准,用于研究柔软连续机器人的视觉-语言操控,通过定制模拟器结合真实柔软体动力学和丰富的接触交互,定义了四个任务以展示变形控制的不同方面,并通过自动化流程生成6300个多样场景和专家轨迹,评估了三种代表性策略模型的性能。

Comments Accepted in ICML 2026

详情
AI中文摘要

大多数现有的视觉-语言操控研究针对刚性机械臂,其固定形态限制了在杂乱或狭窄空间中的适应性。柔软机械臂由于其可变形性提供了一个有吸引力的替代方案,但面临不可靠的本体感觉和分布式的低层驱动挑战。为了研究这些挑战,我们介绍了ManiSoft,一个用于柔软机械臂的视觉-语言操控基准。ManiSoft特征一个定制的模拟器,通过弹性力约束将真实柔软体动力学与丰富的接触交互相结合。在此基础上,ManiSoft定义了四个任务,每个任务突出显示变形控制的不同方面,从基本末端执行器协调到障碍物回避。为了支持策略训练和评估,ManiSoft包括一个自动化流程,生成6,300个多样场景及其对应的专家轨迹。为了大规模生成高质量轨迹,我们首先使用高层规划器将每个任务分解为一系列路径点,然后使用低层强化学习策略生成扭矩命令以跟踪路径点。基准测试三种代表性策略模型显示在清洁场景中相对有希望的结果,但在随机化情况下性能显著下降。可视化分析表明,失败主要源于本体感觉状态的视觉估计不准确和变形性在适应性障碍回避中的利用有限。我们预计ManiSoft将作为有价值的测试平台,在视觉-语言操控的背景下弥合刚性和柔软机械臂之间的差距。代码和数据集已发布在https://buaa-colalab.github.io/ManiSoft。

英文摘要

Most existing vision-language manipulation research targets rigid robotic arms, whose fixed morphology limits adaptability in cluttered or confined spaces. Soft robotic arms offer an appealing alternative due to their deformability, but confront challenges such as unreliable proprioception and distributed low-level actuation. To investigate these challenges, we introduce \ManiSoft, a benchmark for vision-language manipulation with soft arms. ManiSoft features a tailored simulator that couples realistic soft-body dynamics with contact-rich interactions via an elastic force constraint. On this basis, ManiSoft defines four tasks, each highlighting distinct aspects of deformable control, from basic end-effector coordination to obstacle avoidance. To support policy training and evaluation, \ManiSoft{} includes an automated pipeline that generates $6{,}300$ diverse scenes and corresponding expert trajectories. To produce high-quality trajectories at scale, we first employ a high-level planner to decompose each task into a sequence of waypoints, followed by a low-level reinforcement learning policy that generates torque commands to track waypoints. Benchmarking three representative policy models shows relatively promising results in clean scenes but substantial performance drop under randomization. Visualization analysis indicates that failures stem primarily from inaccurate visual estimation of proprioceptive state and limited exploitation of deformability for adaptive obstacle avoiding. We anticipate ManiSoft to serve as a valuable testbed, bridging the gap between rigid and soft arms in the context of vision-language manipulation. Out codes and datasets are released at https://buaa-colalab.github.io/ManiSoft.

2605.18613 2026-05-19 cs.SD cs.AI 版本更新

SAME: A Semantically-Aligned Music Autoencoder

SAME:一种语义对齐的音乐自编码器

Julian D. Parker, Zach Evans, CJ Carr, Zachary Zukowski, Josiah Taylor, Matthew Rice, Jordi Pons

发表机构 * Stability AI

AI总结 该研究提出SAME自编码器,通过结合Transformer架构和语义正则化方法,实现了4096倍的时间压缩比,同时保持重建质量和生成性能。

详情
AI中文摘要

潜在表示是现代生成模型的核心。在音频领域,它们通常由神经音频编解码器自编码器生成。在本工作中,我们介绍了SAME(Semantically-Aligned Music autoEncoder),一种用于立体音乐和通用音频的自编码器,实现了4096×的时间压缩比,同时保持重建质量和下游生成性能。我们通过结合基于Transformer的主干结构和一组语义正则化方法、相位感知的重建损失以及改进的判别器设计来实现这一点。该架构通过其高压缩比和对优化良好的Transformer原语的依赖,提供了显著的计算成本优势。两种变体(大型SAME-L和可部署在CPU上的SAME-S)以开放权重形式发布。

英文摘要

Latent representations are at the heart of the majority of modern generative models. In the audio domain they are typically produced by a neural-audio-codec autoencoder. In this work we introduce SAME (Semantically-Aligned Music autoEncoder), an autoencoder for stereo music and general audio that reaches a 4096$\times$ temporal compression ratio while maintaining reconstruction quality and downstream generative performance. We achieve this by combining a tranformer-based backbone with set of semantic regularisation approaches, phase-aware reconstruction losses and improved discriminator designs. The architecture delivers substantial computational cost benefits, through both its high compression ratio and its reliance on well-optimised transformer primitives. Two variants (a large SAME-L and a CPU-deployable SAME-S) are released in open-weights form.

2605.18610 2026-05-19 cs.CV cs.AI cs.LG 版本更新

CATA: Continual Machine Unlearning via Conflict-Averse Task Arithmetic

CATA: 通过冲突厌恶任务算术实现持续机器去学习

Shen Lin, Junhao Dong, Rongjie Chen, Xiaoyu Zhang, Li Xu, Xiaofeng Chen

发表机构 * Fujian Normal University(福建师范大学) Nanyang Technological University(南洋理工大学) Xidian University(西安电子科技大学)

AI总结 本文首次研究了视觉语言模型的持续去学习问题,提出CATA方法,通过冲突厌恶任务算术有效解决去学习中的有效性、模型保真度和持续性挑战。

详情
AI中文摘要

视觉语言模型(VLMs)在对齐视觉和文本表示方面表现出色,能够支持多种多模态应用。然而,其大规模训练数据不可避免地引发了隐私、版权和不良内容的担忧,这使得机器去学习变得必要。尽管现有研究主要关注单次去学习,但实际VLM部署往往涉及随时间推移的连续删除请求,从而产生持续机器去学习。在本文中,我们首次研究了VLMs的持续去学习,并识别出该设置中的三个关键挑战:去除目标知识的有效性、保留模型效用的保真度以及在连续更新下防止知识重新出现的持续性。为了解决这些挑战,我们提出了CATA,一种冲突厌恶任务算术方法,将每个遗忘请求表示为一个去学习任务向量。通过维护历史任务向量并执行符号感知的冲突厌恶聚合,CATA抑制可能削弱先前遗忘效果的冲突更新组件。在单次和持续设置下的大量实验表明,CATA在遗忘有效性、模型保真度和遗忘持续性方面均优于基线方法。

英文摘要

Vision-language models (VLMs) have shown remarkable ability in aligning visual and textual representations, enabling a wide range of multimodal applications. However, their large-scale training data inevitably raises concerns about privacy, copyright, and undesirable content, creating a strong need for machine unlearning. While existing studies mainly focus on single-shot unlearning, practical VLM deployment often involves sequential removal requests over time, giving rise to continual machine unlearning. In this work, we make the first attempt to study continual unlearning for VLMs and identify three key challenges in this setting: effectiveness in removing target knowledge, fidelity in preserving retained model utility, and persistence in preventing knowledge re-emergence under sequential updates. To address these challenges, we propose CATA, a conflict-averse task arithmetic method that represents each forget request as an unlearning task vector. By maintaining historical task vectors and performing sign-aware conflict-averse aggregation, CATA suppresses conflicting update components that may weaken previous forgetting effects. Extensive experiments under both single-shot and continual settings show that CATA outperforms baselines in terms of forgetting effectiveness, model fidelity, and forgetting persistence.

2605.18593 2026-05-19 cs.CR cs.AI cs.RO 版本更新

Not What You Asked For: Typographic Attacks in Household Robot Manipulation

并非你所要求的:家庭机器人操作中的字体攻击

Ali Iranmanesh, Peng Liu

发表机构 * Cyber Security Lab(网络安全实验室) The Pennsylvania State University(宾夕法尼亚州立大学) State College, USA(州立学院,美国)

AI总结 本研究探讨了字体攻击对家庭机器人操作全流程的影响,提出了一种解耦感知架构,并发现感知错误会通过持久的3D语义地图导致物理性故障,揭示了字体误分类对机器人安全性的实际威胁。

Comments 10 pages, 1 figure, IEEE conference format

详情
AI中文摘要

开放词汇的具身AI代理越来越多地依赖如CLIP之类的视觉-语言模型进行物体感知和任务定位。然而,这种共享嵌入空间所带来的结构漏洞使字体攻击成为可能,其中物理场景中的印刷文本会语义上覆盖视觉判断。尽管先前研究在静态2D基准和3D导航任务中量化了这一威胁,但其对家庭机器人操作完整Sense-Plan-Act流程的影响仍未被探索。本文在基于Habitat的模拟中评估了字体攻击,使用HomeRobot基准。我们引入了一种解耦感知架构,使冻结的CLIP编码器暴露于对抗性贴纸,同时通过DEtic保持几何定位。在59个可控评估回合中,攻击的总体攻击成功率(ASR)为67.8%,在完全成功回合中上升至70.0%,在无控制视角和遮挡且无感知优化的情况下。关键发现是,感知错误通过持久的3D语义地图传播,导致动能故障,即由对抗性污染的语义状态驱动的物理性抓取和运输错误物体。在这些情况下,机器人会物理上抓取并传递错误的物体到目标容器。这些结果确立了字体误分类作为对模块化操作流程安全性的实际、可测量且物理上有影响的威胁,而此前的字体攻击研究未对其进行考察。

英文摘要

Open-vocabulary embodied AI agents increasingly rely on vision-language models such as CLIP for object perception and task grounding. However, the shared embedding space that enables this flexibility introduces a structural vulnerability to typographic attacks, where printed text in a physical scene semantically overrides visual judgment. While prior work has quantified this threat in static 2D benchmarks and 3D navigation tasks, its impact on the full Sense-Plan-Act pipeline of household robot manipulation remains unexplored. This work evaluates typographic attacks in a Habitat-based simulation using the HomeRobot benchmark. We introduce a decoupled perception architecture that exposes a frozen CLIP encoder to adversarial stickers while maintaining geometric grounding via DETIC. In a controlled evaluation pool of 59 attributable episodes, the attack achieves an overall Attack Success Rate (ASR) of 67.8%, rising to 70.0% among fully successful episodes, under uncontrolled viewing angles and occlusion with no perceptual optimization. Critically, we find that perceptual errors propagate through the persistent 3D semantic map to produce kinetic failures, defined here as physically executed grasping and transport of the wrong object driven by an adversarially poisoned semantic state. In these cases, the robot physically grasps and delivers the wrong object to a target receptacle. These results establish typographic misclassification as a real, measurable, and physically consequential threat to the safety of modular manipulation pipelines that prior typographic attack research has left unexamined.

2605.18591 2026-05-19 cs.LG cs.AI 版本更新

Randomized Advantage Transformation (RAT): Computing Natural Policy Gradients via Direct Backpropagation

随机优势变换(RAT):通过直接反向传播计算自然策略梯度

Mingfei Sun

发表机构 * The University of Manchester, United Kingdom(曼彻斯特大学,英国)

AI总结 本文提出RAT方法,通过直接反向传播估计正则化自然策略梯度,解决了传统方法中估计和求逆Fisher矩阵成本高的问题,实验证明其在连续和视觉控制基准上性能优异且易于实现。

Comments Accepted to ICML 2026

详情
AI中文摘要

自然策略梯度通过考虑分布空间的几何特性来提高优化效果,但其实际应用受限于估计和求逆Fisher矩阵的成本。我们提出了随机优势变换(RAT),一种通过直接反向传播估计Tikhonov正则化自然策略梯度的方法。通过应用Woodbury公式,我们将正则化自然策略梯度重新表述为带有变换优势的普通策略梯度。RAT通过在在线小批量上应用随机块Kaczmarz迭代高效计算这种变换,避免了显式Fisher构造、共轭梯度求解器和架构特定的近似。我们为RAT提供了收敛保证,并实验证明其在连续和视觉控制基准上与现有自然梯度方法相媲美或更优,同时保持简单易用且兼容各种架构。

英文摘要

Natural policy gradients improve optimization by accounting for the geometry of distribution space, but their practical use is limited by the cost of estimating and inverting the Fisher matrix. We present Randomized Advantage Transformation (RAT), a method for estimating Tikhonov-regularized natural policy gradients via direct backpropagation. By applying the Woodbury formula, we reformulate the regularized natural policy gradients as vanilla policy gradients with a transformed advantage. RAT computes this transformation efficiently via randomized block Kaczmarz iterations on on-policy mini-batches, avoiding explicit Fisher construction, conjugate-gradient solvers, and architecture-specific approximations. We provide convergence guarantees for RAT and demonstrate empirically that it matches or exceeds established natural-gradient methods across continuous and visual control benchmarks, while remaining simple to implement and compatible with various architectures.

2605.18583 2026-05-19 cs.SE cs.AI cs.CL cs.CR 版本更新

Overeager Coding Agents: Measuring Out-of-Scope Actions on Benign Tasks

过度激进的编码代理:在良性任务中测量超出范围的动作

Yubin Qu, Ying Zhang, Yanjun Zhang, Gelei Deng, Yuekang Li, Leo Yu Zhang, Yi Liu

发表机构 * Griffith University(格里菲斯大学) Wake Forest University(威克森林大学) Nanyang Technological University(南洋理工大学) University of New South Wales(新南威尔士大学) Quantstamp

AI总结 本文提出OverEager-Gen基准,用于评估编码代理在良性任务中超出范围的行为。研究发现,当基准中明确列出授权范围时,代理会停止推断边界并开始匹配声明文本,从而影响测量有效性。通过行为梯度验证器和双通道堆栈审计内部工具调用,验证了不同框架下的过度激进行为差异。

详情
AI中文摘要

编码代理现在能够自主运行,拥有shell、文件和网络权限。当用户发出良性请求时,代理有时会做更多事情:删除无关文件、清除过期凭证备份,或重写用户从未提及的配置。我们将这些超出范围的行为称为过度激进动作,这是一种与能力失败、提示注入或沙盒逃逸不同的授权问题。我们提出了OverEager-Gen,一个专门用于评估过度激进行为的基准。构建该基准暴露了一个测量有效性问题:如果基准在提示中明确列出授权范围,代理会停止推断边界并开始匹配声明文本。在Claude Code上,仅去除同意声明,配对场景中的过度激进率从0.0%增加到17.1%(McNemar精确p=2.4×10^-4)。OverEager-Gen因此在通过前使用行为梯度验证器验证每个场景的区分能力,通过双通道堆栈审计内部工具调用(PATH注入的 shim 加上每个代理的事件流),并发布字节完全一致的同意保留和同意去除变体。OverEager-Bench包含500个验证的场景和四个代理产品(Claude Code、OpenHands、Codex CLI、Gemini CLI)和六个基础模型的约7,500次运行;50个重新注释样本给出Cohen's kappa=0.73和规则判断召回率=1.00。去除同意声明在每个共享基础模型上都增加了过度激进率(Delta在[11.9,17.2]pp)。框架轴主导效应大小:一个宽松的集群(Claude Code、Codex CLI、Gemini CLI)运行在5.4-27.7%之间,而询问继续框架(OpenHands)位于0.2-4.5%(Fisher p<=10^-5)。在框架内基础模型方差达到15.9pp,表明模型层对齐并未完全通过宽松权限门控传播。

英文摘要

Coding agents now run autonomously with shell, file, and network privileges. When a user issues a benign request, the agent sometimes does more than asked: it deletes unrelated files, wipes a stale credentials backup, or rewrites configuration the user never mentioned. We call these scope expansions overeager actions, an authorization problem distinct from capability failures, prompt injection, or sandbox escapes. We present OverEager-Gen, a benchmark dedicated to overeager behavior on benign tasks. Building it surfaces a measurement-validity issue: if a benchmark spells out the authorized scope inside the prompt, the agent stops inferring boundaries and starts pattern-matching declaration text. On Claude Code, stripping the consent declaration alone raises the overeager rate from 0.0% to 17.1% on paired scenarios (McNemar exact p = 2.4 x 10^-4). OverEager-Gen therefore certifies each scenario's discriminative power before admission via a behavioral-gradient validator, audits internal tool calls through a dual-channel stack (PATH-injected shim plus per-agent event streams), and ships byte-identical consent_kept and consent_stripped variants. OverEager-Bench contains 500 validated scenarios and ~7,500 runs across four agent products (Claude Code, OpenHands, Codex CLI, Gemini CLI) and six base models; a 50-sample re-annotation gives Cohen's kappa = 0.73 and rule-judge recall = 1.00. Stripping consent multiplies the overeager rate on every shared base model (Delta in [11.9, 17.2] pp). The framework axis dominates effect size: a permissive cluster (Claude Code, Codex CLI, Gemini CLI) runs at 5.4-27.7% while the ask-to-continue framework (OpenHands) sits at 0.2-4.5% (Fisher p <= 10^-5). Within-framework base-model variance reaches 15.9 pp, indicating that model-layer alignment does not fully propagate through permissive permission gating.

2605.18580 2026-05-19 cs.AI cs.LG 版本更新

When Outcome Looks Right But Discipline Fails: Trace-Based Evaluation Under Hidden Competitor State

当结果看似正确但纪律却失败:基于轨迹的评估在隐藏对手状态下的应用

Peiying Zhu, Sidi Chang

发表机构 * Blossom AI Blossom AI Labs(Blossom AI 实验室)

AI总结 本文提出了一种基于轨迹的评估方法,用于评估在隐藏对手状态下的行为纪律稳定性,通过轨迹诊断、机制分离和转移测试来改进强化学习策略,特别是在酒店定价和隐藏预算竞标任务中。

详情
AI中文摘要

仅结果的评估可能无法保证经济安全的智能体:一种策略可能在达到业务KPI的同时,违反可部署的行为纪律。在酒店定价中,当存在隐藏的对手状态时,学习者可能在看似合理的每间房收入上取得成绩,却无法保持规则基于的收益管理对手的定价纪律。我们引入了纪律稳定性,一种基于轨迹的评估范式:定义基准行为,限制观察到部署阶段,从失败中诱导轨迹诊断,通过消融分离机制,并测试转移和部署。在两个酒店基准和一个紧凑的隐藏预算竞标任务中,仅奖励的PPO变体无法实现轨迹对齐;揭示隐藏状态可减少标签不确定性;确定性复制可压缩不确定性;而轨迹先验或修正历史策略能更好地保持价格或投标分布。纯粹的行为克隆在对称模仿中几乎足够,而轨迹先验强化学习在容量不对称情况下增加有限的适应性。本文的贡献是一种评估和基准范式,而不是新的优化器或关于多智能体强化学习的普遍声明。

英文摘要

Outcome-only evaluation can certify economically unsafe agents: a policy can hit a business KPI while violating deployable behavioral discipline. In hotel pricing with hidden competitor state, a learner can achieve plausible revenue per available room while failing to preserve the rate discipline of a rule-based revenue-management competitor. We introduce discipline stability, a trace-based evaluation paradigm: define the benchmark behavior, restrict observations to the deployment regime, induce trace diagnostics from failure, separate mechanisms with ablations, and test transfer and deployment. Across a two-hotel benchmark and a compact hidden-budget bidding task, reward-only PPO variants miss trace alignment; revealing hidden state reduces label uncertainty; deterministic copy collapses uncertainty; and trace-prior or corrected history policies better preserve price or bid distributions. Pure behavior cloning is nearly enough for symmetric imitation, while Trace-Prior RL adds bounded adaptation under capacity asymmetry. The contribution is an evaluation and benchmark paradigm, not a new optimizer or a universal claim about MARL

2605.18570 2026-05-19 cs.AI 版本更新

Query-Conditioned Knowledge Alignment for Reliable Cross-System Medical Reasoning

基于查询的知识对齐用于可靠的跨系统医学推理

Yan Jiao, Jingran Xu, Pin-Han Ho, Limei Peng

发表机构 * Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China(深圳先进研究院,电子科技大学) Department of Electrical and Computer Engineering, University of Waterloo(滑铁卢大学电气与计算机工程系) School of Computer Science and Engineering, Kyungpook National University(庆北国立大学计算机科学与工程学院)

AI总结 本文提出了一种基于查询的知识对齐方法QCEA,通过将实体对齐问题转化为查询条件下的对应关系问题,以提升跨系统医学推理的可靠性,主要贡献是引入了方向感知变换模块以捕捉异构知识系统中的非对称和多对多对应关系。

详情
AI中文摘要

跨领域知识对齐对于整合异构医学系统至关重要,但现有方法通常将实体对齐视为静态匹配问题,忽略了查询上下文和跨系统不对称性。本论文提出查询条件实体对齐(QCEA),将实体对齐重新表述为查询条件下的对应关系问题。不同于学习实体表示之间的固定映射,QCEA将源实体的文本描述视为查询,并在目标图中对候选实体进行排序,从而实现依赖上下文的对齐。该框架整合了语义编码、基于图的表示学习以及方向感知变换模块,以捕捉异构知识系统中的非对称和多对多对应关系。我们评估了QCEA在TCM--WM知识图谱上的表现,涵盖了症状对齐和草药-分子对齐任务。实验结果表明,QCEA在代表性基线方法上表现一致改进,特别是在对排名敏感的指标如Hit@K和MRR上。此外,下游检索增强生成(RAG)实验表明,改进的对齐导致更好的证据检索、更强的支撑性和更高的答案准确性。这些发现强调,对齐不仅仅是数据整合步骤,而是影响跨系统医学推理中知识可访问性和可靠性的关键因素。

英文摘要

Cross-domain knowledge alignment is essential for integrating heterogeneous medical systems, yet existing approaches typically treat entity alignment as a static matching problem, ignoring query context and cross-system asymmetry. This limitation is particularly critical in integrative medical settings, where correspondence between concepts is inherently context-dependent, non-bijective, and direction-sensitive. In this paper, we propose Query-Conditioned Entity Alignment (QCEA), which reformulates entity alignment as a query-conditioned correspondence problem. Instead of learning a fixed mapping between entity representations, QCEA treats the textual description of a source entity as a query and ranks candidate entities in the target graph, enabling context-dependent alignment. The framework integrates semantic encoding, graph-based representation learning, and a direction-aware transformation module to capture asymmetric and many-to-many correspondence across heterogeneous knowledge systems. We evaluate QCEA on TCM--WM knowledge graphs derived from SymMap, covering both symptom alignment and herb--molecule alignment tasks. Experimental results show consistent improvements over representative baselines, particularly on rank-sensitive metrics such as Hit@K and MRR. Furthermore, downstream retrieval-augmented generation (RAG) experiments demonstrate that improved alignment leads to better evidence retrieval, stronger grounding, and higher answer accuracy. These findings highlight that alignment is not merely a data integration step, but a key factor that shapes knowledge accessibility and reliability in cross-system medical reasoning.

2605.18562 2026-05-19 stat.ME cs.AI cs.LG stat.AP 版本更新

Estimating Item Difficulty with Large Language Models as Experts

利用大语言模型作为专家估算项目难度

Diana Kolesnikova, Kirill Fedyanin, Abe D. Hofman, Matthieu J. S. Brinkhuis, Maria Bolsinova

发表机构 * Department of Methodology and Statistics, Tilburg University(蒂尔堡大学方法学与统计学系) Smart Business Technologies(智能商务技术公司) Department of Psychological Methods, University of Amsterdam(阿姆斯特丹大学心理方法系) Prowise Learn, Amsterdam(Prowise Learn公司,阿姆斯特丹) Department of Information and Computing Sciences, Utrecht University(乌得勒支大学信息与计算科学系)

AI总结 本文研究了如何利用大语言模型估算新任务的难度,通过对比不同配置下的模型表现,发现基于对偶比较的配置在无额外优化时表现更优,而结合token概率和已知难度示例的绝对判断配置也表现出中等至高水平的对齐度。

Comments 24 pages, 2 figures, 9 tables

详情
AI中文摘要

准确估计项目难度对于有效的评估和适应性学习至关重要。然而,对于新创建的任务,响应数据通常不可用。预测试和专家判断可能成本高且耗时,而机器学习方法通常需要大量标记训练数据。最近的研究表明,大语言模型(LLMs)可能有所帮助。然而,关于如何通过提示配置来模拟专家进行难度估计的证据有限。本研究通过评估三种现成的LLMs作为新任务的难度评估者,填补了这一空白。使用一个在线学习系统中的项目库,研究了6个小学数学领域,将经验难度作为参考。研究采用全因子设计,交叉三个因素:判断格式(绝对vs对偶比较)、决策类型(硬决策vs基于token概率的估计)和提示策略(零样本vs少量样本)。LLM生成的难度估计与经验难度通过斯皮尔曼等级相关性进行比较。在各领域中,LLM生成的估计与经验项目难度表现出中等至强正相关。对于简单的算术任务,某些配置接近之前研究中人类专家报告的准确性范围的上限。对偶比较在无额外优化时始终优于绝对判断。然而,当结合token级概率并提供已知难度的项目示例时,绝对判断配置也表现出中等至高水平的对齐度。本研究将LLMs定位为初始项目校准的有前途的工具,并提供了有效工作流程配置的见解。

英文摘要

Accurate estimates of item difficulty are essential for valid assessment and effective adaptive learning. However, for newly created tasks, response data are typically unavailable. Pretesting and expert judgement can be costly and slow, while machine learning methods often require large labelled training datasets. Recent work suggests that large language models (LLMs) may help. However, there is limited evidence on the elicitation procedures and prompt configurations used to emulate experts for difficulty estimation. This study addresses this gap by evaluating three off-the-shelf LLMs as difficulty raters for newly created items without access to response data. Using an item bank from an online learning system, the study examined 6 domains of primary-school mathematics, with empirical difficulty estimates treated as empirical reference. The study used a full factorial design crossing three factors: judgement format (absolute vs pairwise), decision type (hard decisions vs token-probability-based estimates), and prompting strategy (zero-shot vs few-shot). LLM-derived difficulty estimates were compared with empirical difficulties using Spearman rank correlations. Across domains, LLM-based estimates exhibited moderate to strong positive correlations with empirical item difficulties. For simpler arithmetic tasks, some configurations approached the upper end of the accuracy range reported for human experts in previous research. Pairwise comparison consistently outperformed absolute judgement in the absence of additional refinements. However, when token-level probabilities were incorporated and examples of items with known empirical difficulty were provided, the absolute judgement configuration likewise demonstrated moderate-to-high alignment. The study positions LLMs as a promising tool for initial item calibration and offers insights into effective workflow configuration.

2605.18561 2026-05-19 cs.IR cs.AI cs.SE 版本更新

Improving BM25 Code Retrieval Under Fixed Generic Tokenization: Adaptive q-Log Odds as a Drop-In BM25 Fix

在固定通用分词下改进BM25代码检索:适应性q-对数比作为BM25的即插即用修复

Santosh Kumar Radha, Oktay Goktas

发表机构 * AgentField

AI总结 该研究针对固定通用分词下BM25代码检索的不足,提出适应性q-对数比作为BM25的改进方法,通过替换RSJ对数比的外层对数为q对数,在保持BM25精度的同时提升检索性能,实验显示在CodeSearchNet Go数据集上NDCG@10显著提升。

Comments 19 pages, 12 figures. Code and artifacts: https://github.com/santoshkumarradha/rarecode

详情
AI中文摘要

在检索增强的编程中,失败通常发生在相关文件不在检索上下文中的时候。在冻结通用分词的情况下,当BM25索引由不可控的搜索系统构建时,这种失败是常态:BM25的对数RSJ-比IDF无法区分不同函数的标识符尾部。我们用q对数替代RSJ比的外层对数。在q=1时,该变换通过洛必达法则恢复BM25的精确性,对于q<1的情况,它是对RSJ比的Box-Cox变换,其中lambda=1-q。在CoIR CodeSearchNet Go(182K文档)上,经过oracle调优的NDCG@10从0.2575提升到0.4874(绝对+0.2299;+89.3%相对;在10,000对的bootstrap重采样中,零符号反转,报告为p <= 10^-4)。该效果在不同编程语言中呈梯度变化,在BEIR文本上接近于零。一个一参数闭合形式估计一个语料库级别的q值,从hapax密度推导出来,并在BM25已经最优的语料库上保持接近q=1。索引时间成本是稀疏得分矩阵的一次遍历,查询延迟不变。一个分词消融实验显示,标识符意识的分词在很大程度上消除了q-IDF的增量增益。

英文摘要

In retrieval-augmented coding, failures often begin when the relevant file is absent from the retrieved context. Under frozen generic tokenization, where a BM25 index has been built by a search system whose analyzer the practitioner does not control, this failure is routine: BM25's logarithmic RSJ-odds IDF under-separates the identifier tail that distinguishes one function from another. We replace the outer logarithm of the Robertson-Spärck-Jones odds with a q-logarithm. At q=1 the transform recovers BM25 exactly by L'Hôpital's rule, and for q<1 it is a Box-Cox transform of the RSJ odds with lambda = 1-q. On CoIR CodeSearchNet Go (182K documents), oracle-tuned NDCG@10 rises from 0.2575 to 0.4874 (absolute +0.2299; +89.3% relative; zero sign reversals in 10,000 paired-bootstrap resamples, reported as p <= 10^-4). The effect is graded across code languages and is near-zero on BEIR text. A one-parameter closed form estimates a corpus-level q from hapax density and stays near q=1 on corpora where BM25 is already optimal. The index-time cost is a single pass over the sparse score matrix and query latency is unchanged. A tokenizer ablation shows that identifier-aware tokenization largely removes the incremental gain from q-IDF.

2605.18556 2026-05-19 cs.RO cs.AI 版本更新

Key-Gram: Extensible World Knowledge for Embodied Manipulation

Key-Gram: 用于具身操作的可扩展世界知识

Jingjing Fan, Siyuan Li, Botao Ren, Zhidong Deng

发表机构 * Department of Computer Science and Technology(计算机科学与技术系) Department of Automation(自动化系)

AI总结 本文提出Key-Gram框架,通过分离语言知识与视觉状态推理,提升具身控制中对组合语言指令的理解和执行能力,主要贡献是引入可扩展的外部记忆模块以提高迁移和现实世界操作性能。

Comments 16 pages, 5 figures

详情
AI中文摘要

具身控制越来越多地要求模型在动态视觉状态上进行推理的同时遵循组合语言指令。然而,当前的视觉-语言-动作策略和世界-动作模型通常将语言知识与视觉计算结合在共享的骨干或条件路径中,导致模态竞争,并使知识扩展依赖于骨干更新。在本文中,我们引入了Key-Gram,一种条件记忆框架,它将语言衍生的世界知识与视觉状态推理分离用于具身控制。其核心是一个记忆模块,该模块将指令分解为任务特定的关键词组,通过确定性哈希查找检索静态语言先验,并通过上下文感知门控和轻量级卷积融合将检索到的条目注入到选定的隐藏层中。这种设计使骨干能够将其主要能力用于视觉推理和动作推断,同时可重用的指令知识存储在可扩展的外部记忆中。逻辑记忆表可以在训练期间方便地划分,并且由于其O(1)的查找模式,在推理时可以高效地放置在主机内存中。在RoboTwin2.0、LIBERO/LIBERO-Plus和现实世界双臂操作中,Key-Gram一致地提高了π₀和π₀.₅骨干,平均相对增益为RoboTwin2.0上的29.5%/9.9%、LIBERO-Plus转移无目标领域微调时的35.8%/4.5%以及现实世界长周期任务上的15.4%/8.1%。这些结果表明,外部化的语言记忆提供了一种有效的、可扩展的机制,以提高组合基础、迁移和现实世界操作性能。

英文摘要

Embodied control increasingly requires models to follow compositional language instructions while reasoning over dynamic visual states. However, current vision-language-action policies and world-action models often couple linguistic knowledge with visual computation in a shared backbone or conditioning pathway, leading to modality competition and making knowledge extension dependent on backbone updates. In this paper, we introduce Key-Gram, a conditional-memory framework that separates language-derived world knowledge from visual-state reasoning for embodied control. At its core is a memory module that decomposes an instruction into task-specific key-grams, retrieves static linguistic priors through deterministic hashed lookup, and injects the retrieved entries into selected hidden layers through context-aware gating and lightweight convolutional fusion. This design allows the backbone to devote its main capacity to visual reasoning and action inference, while reusable instruction knowledge is stored in an extensible external memory. The logical memory table can be conveniently partitioned during training and, due to its $O(1)$ lookup pattern, efficiently placed on host memory during inference. Across RoboTwin2.0, LIBERO/LIBERO-Plus, and real-world dual-arm manipulation, Key-Gram consistently improves both $π_{0}$ and $π_{0.5}$ backbones, with average relative gains of $29.5\%/9.9\%$ on RoboTwin2.0, $35.8\%/4.5\%$ on LIBERO-Plus transfer without target-domain fine-tuning, and $15.4\%/8.1\%$ on real-world long-horizon tasks. These results demonstrate that externalized linguistic memory provides an effective and extensible mechanism for improving compositional grounding, transfer, and real-world manipulation.

2605.18553 2026-05-19 cs.CV cs.AI 版本更新

StableHand: Quality-Aware Flow Matching for World-Space Dual-Hand Motion Estimation from Egocentric Video

StableHand: 世界空间双臂运动估计中的质量感知流匹配

Huajian Zeng, Chaohua Yao, Yuantai Zhang, Jiaqi Yang, Rolandos Alexandros Potamias, Xingxing Zuo

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·泽德人工智能大学) University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出StableHand,一种质量感知的流匹配框架,用于从第一人称视频中恢复世界空间双臂的4D运动,通过分解手部姿态估计器提取的观测质量为四个通道,并利用学习的质量网络预测质量信号,以提高运动估计的鲁棒性。

Comments Project Page: https://huajian-zeng.github.io/projects/stablehand/

详情
AI中文摘要

从第一人称视频中恢复世界空间中两个交互手的4D运动是监督机器人策略学习的基本能力,其中手腕轨迹跟踪末端执行器,手指运动规格化抓取姿态。在此设置中存在两个主要挑战:由于头部运动,手经常长时间离开摄像机视野,且持续的手-物体相互作用导致一个或两个手的严重遮挡。现有方法统一地基于噪声手运动观测,而不考虑其每帧的可靠性,导致性能显著下降。我们的关键见解是,准确的世界空间手运动估计与每帧手部观测的质量紧密相关。为此,我们将从现成的手部姿态估计器中提取的手部运动观测的质量分解为四个通道:双臂的手腕全局平移和手指运动。我们提出StableHand,一种质量感知的流匹配框架,其条件于这些四个通道的质量信号,这些信号由学习的质量网络预测。我们通过每通道的前向调度、质量调整的速度目标、AdaLN调制的DiT去噪器以及质量感知ODE初始化,自然地将质量信号整合到流匹配过程中。这种统一的生成过程在保持高质量观测的同时,利用学习的双臂运动先验重构不可靠的观测。在HOT3D和ARCTIC两个具有长缺失手跨度和持续手-物体遮挡的第一人称基准上,实验表明,StableHand在所有报告的指标上均达到最先进的性能,与最强基线相比,将W-MPJPE减少20-25%,在严重遮挡的ARCTIC序列上最大收益最明显。

英文摘要

Recovering world space 4D motion of two interacting hands from egocentric video is a fundamental capability for supervising robot policy learning, where wrist trajectories track the end-effector and finger articulations specify the grasp pose. Two major challenges arise in this setting: hands frequently leave the camera view for extended periods due to head motion, and persistent hand-object interactions cause severe occlusions of one or both hands. Existing methods uniformly condition on noisy hand motion observations without accounting for their per-frame reliability, leading to substantial performance degradation. Our key insight is that accurate world space hand motion estimation is tightly coupled with the quality of per-frame hand observations. To this end, we decompose the quality of hand motion observations extracted from an off-the-shelf hand pose estimator into four channels: wrist global translation and finger articulations for both hands. We propose StableHand, a quality-aware flow-matching framework conditioned on these four-channel quality signals, which are predicted by a learned quality network. We naturally incorporate the quality signals into the flow-matching process through a per-channel forward schedule, a quality-adjusted velocity target, AdaLN modulation of the DiT denoiser, and a quality-aware ODE initialization. This unified generative process preserves high-quality observations while reconstructing unreliable ones using a learned bimanual motion prior. Experiments on HOT3D and ARCTIC, two egocentric benchmarks featuring long missing-hand spans and persistent hand-object occlusions, show that StableHand achieves state-of-the-art performance across all reported metrics, reducing W-MPJPE by 20-25% compared to the strongest baseline, with the largest gains on heavily occluded ARCTIC sequences.

2605.18548 2026-05-19 cs.CL cs.AI 版本更新

STT-Arena: A More Realistic Environment for Tool-Using with Spatio-Temporal Dynamics

STT-Arena:一种更现实的工具使用环境,包含时空动态

Tingfeng Hui, Hao Xu, Pengyu Zhu, Hongsheng Xin, Kun Zhan, Sen Su, Chunxiao Liu, Ning Miao

发表机构 * Hong Kong Institute of AI for Science, City University of Hong Kong(香港人工智能科学研究院,香港城市大学) Department of Data Science, City University of Hong Kong(数据科学系,香港城市大学) Beijing University of Posts and Telecommunications(北京邮电大学) Li Auto Inc.(李汽有限公司) Independent Researcher(独立研究者)

AI总结 本文提出STT-Arena基准测试,旨在评估大型语言模型在面对时空动态变化时的适应性规划能力,发现现有模型在处理此类动态问题时存在显著不足,并提出改进方法STT-Agent-4B以提升性能。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLMs)在现实世界中的代理应用中必须能够重新规划和适应,当任务中途中断时推翻其先前决策。现有的动态基准主要测量LLMs是否能够及时检测时间变化,留下适应时空动态的互补挑战未被探索。我们介绍了STT-Arena(Spatio-Temporal Tool-Use Arena),一个包含227个高质量交互任务的基准测试,涵盖九种时空冲突类型和四种可解性级别。每个任务都基于一个现实、可执行的环境,配备注入的时空触发器,可以突然使正在进行的计划失效,迫使模型检测状态变化并构建修订的执行策略。对前沿LLMs的广泛评估显示,即使是最先进的专有模型,如Claude-4.6-Opus,也只达到低于40%的总体准确率,突显了时空动态推理的根本难度。对失败轨迹的系统分析揭示了现有模型的三种反复出现的错误模式:停滞状态执行、动态触发器的误诊断和缺失的适应后验证。基于这些发现,我们提出了一种迭代轨迹细化技术,消除这些失败模式,结合在线强化学习,产生STT-Agent-4B,其在STT-Arena上优于前沿LLMs。

英文摘要

Large language models (LLMs) deployed in real-world agentic applications must be capable of replanning and adapting when mid-task disruptions invalidate their prior decisions. Existing dynamic benchmarks primarily measure whether LLMs can detect temporal changes in a timely manner, leaving the complementary challenge of adaptive replanning under spatio-temporal dynamics largely unexplored. We introduce STT-Arena (Spatio-Temporal Tool-Use Arena), a benchmark of 227 high-quality interactive tasks spanning nine spatio-temporal conflict types and four solvability levels. Each task is grounded in a realistic, executable environment equipped with injected spatio-temporal triggers that can abruptly invalidate an ongoing plan, forcing the model to detect the state shift and construct a revised execution strategy. Extensive evaluation of frontier LLMs reveals that even the SOTA proprietary models, including Claude-4.6-Opus, achieves less than 40\% overall accuracies, highlighting the fundamental difficulty of spatio-temporal dynamic reasoning. Systematic analysis of failure trajectories uncovers three recurring error modes of existing models: Stale-State Execution, Misdiagnosis of Dynamic Triggers, and Missing Post-Adaptation Verification. Guided by these findings, we propose an iterative trajectory refinement technique that eliminates these failure patterns from training data, and combine it with online RL to produce STT-Agent-4B which outperforms frontier LLMs on STT-Arena.

2605.18547 2026-05-19 cs.AI 版本更新

VISAFF: Speaker-Centered Visual Affective Feature Learning for Emotion Recognition in Conversation

VISAFF: 以说话者为中心的视觉情感特征学习用于对话中的情感识别

Linan ZHU, Zihao Zhai, Xiao Han, Yuqian Fu, Xiangfan Chen, Xiangjie Kong, Guojiang Shen

发表机构 * Zhejiang University of Technology(浙江工业大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出VISAFF框架,通过以说话者为中心的视觉情感特征学习方法,解决对话中情感识别中的复杂场景问题,提升计算效率并避免大规模模型微调的高成本。

详情
AI中文摘要

对话中情感识别(ERC)对于有效的人机交互至关重要,旨在识别多轮对话中说话者的情感状态。早期基于文本的方法在处理如讽刺等复杂场景时存在困难,因为它们本质上忽略了关键的非语言信息。尽管最近的视觉-语言模型(VLMs)通过直接分析视频来解决这一问题,但它们并非专门为ERC量身定制,通常关注与情感无关的背景区域或被动听众,而非活跃说话者。此外,微调这些大模型会带来高昂的计算成本。此外,孤立的视觉信号在缺乏语言内容和语音语调的上下文时往往模糊或技术上受损。为了解决这些挑战,我们提出了VISAFF,一个以说话者为中心的视觉情感特征学习框架用于ERC。VISAFF包括两个阶段:说话者中心的情感定位和可靠性引导的情感补充。VISAFF采用无微调的方法来解锁冻结的VLMs的推理能力,高效地引导它们专注于活跃说话者的情感视觉线索,而无需沉重的训练开销。在第二阶段,我们引入了可靠性引导的情感补充机制,动态利用文本和声音模态来补偿视觉不确定性。在两个真实世界数据集上的实验表明,VISAFF在无微调设置下实现了与最先进方法相媲美的性能,显著提高了计算效率,通过消除对大规模VLMs昂贵微调的需要。源代码可在https://anonymous.4open.science/r/speaker-2365/上获得。

英文摘要

Emotion Recognition in Conversation (ERC) is essential for effective human-machine interaction, aiming to identify speakers' emotional states in multi-turn dialogues. Early text-based methods struggle with complex scenarios like sarcasm because they inherently neglect vital non-verbal information. While recent Vision-Language Models (VLMs) address this by analyzing video directly, they are not inherently tailored for ERC and often focus on emotionally irrelevant background regions or passive listeners rather than the active speaker. Furthermore, fine-tuning these large models incurs prohibitive computational costs. Additionally, isolated visual signals are frequently ambiguous or technically compromised without the context of linguistic content and vocal prosody. To address these challenges, we propose VISAFF, a speaker-centered VISual AFFective feature learning framework for ERC. VISAFF consists of two stages: Speaker-Centered Affective Grounding and Reliability-Guided Affective Complementation. VISAFF utilizes a tuning-free approach to unlock the reasoning capabilities of frozen VLMs, efficiently steering them to focus on the active speaker's emotional visual cues without heavy training overheads. In the second stage, we introduce a reliability-guided affective complementation mechanism that dynamically leverages textual and acoustic modalities to compensate for visual uncertainty. Experiments on two real-world datasets demonstrate that VISAFF achieves highly competitive performance compared to state-of-the-art methods in a tuning-free setting, significantly enhancing computational efficiency by eliminating the need for expensive fine-tuning of large VLMs. The source code is available at https://anonymous.4open.science/r/speaker-2365/.

2605.18537 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Probing for Representation Manifolds in Superposition

在叠加中探测表示流形

Alexander Modell

发表机构 * Department of Mathematics(数学系)

AI总结 本文提出Manifold Probe方法,用于发现叠加中的表示流形,通过学习可线性预测的特征空间以及编码方向,从而揭示模型行为中因果相关的流形。

Comments 19 pages, 7 figures

详情
AI中文摘要

本文介绍了一个名为Manifold Probe的监督方法,用于在叠加中发现表示流形。该方法通过学习一个概念的特征空间,该空间可以线性预测自表示,然后学习用于编码这些特征的方向。我们展示了该方法在Llama 2-7b中时间与空间的表示上,发现每个案例中都能线性表示可解释的特征集合。在时间案例中,我们展示了通过沿流形引导,可以影响模型对著名歌曲、电影和书籍发布年份的完成,提供了证据表明Manifold Probe能够发现与模型行为因果相关的流形。

英文摘要

This paper introduces the Manifold Probe, a supervised method for discovering representation manifolds in superposition. The method generalizes linear regression probes by learning the space of features of a concept that can be linearly predicted from the representations, and then learning the directions used to encode them. We demonstrate the probe on representations of time and space in Llama 2-7b, finding manifolds which linearly represent an interpretable set of features in each case. In the case of time, we show that by steering along the manifold, we can influence the model's completions about the years in which famous songs, movies and books were released, providing evidence that the Manifold Probe can discover manifolds which are causally involved in model behaviour.

2605.18530 2026-05-19 cs.CL cs.AI cs.LG stat.ML 版本更新

Continuous Diffusion Scales Competitively with Discrete Diffusion for Language

连续扩散在语言领域中能与离散扩散竞争性地扩展

Zhihan Yang, Wei Guo, Shuibai Zhang, Subham Sekhar Sahoo, Yongxin Chen, Arash Vahdat, Morteza Mardani, John Thickstun

发表机构 * NVIDIA & Cornell(NVIDIA与康奈尔大学) NVIDIA & Georgia Tech(NVIDIA与佐治亚理工学院) UW-Madison(威斯康星大学麦迪逊分校) MBZUAI-IFM(梅兰德大学-IFM) Cornell(康奈尔大学)

AI总结 本文研究了连续扩散模型在语言建模中的扩展能力,通过改进Plaid模型构建RePlaid,证明连续扩散模型在计算效率和性能上可与离散模型竞争,并提供了理论支持。

详情
AI中文摘要

尽管扩散模型近期在语言建模领域受到广泛关注,但连续扩散模型在扩展性方面似乎不如离散方法。为了挑战这一观点,我们重新审视Plaid,一种基于似然的连续扩散语言模型(DLM),并构建RePlaid,通过将Plaid的架构与现代离散DLMs对齐。在统一的设定下,我们建立了第一个连续DLMs的扩展定律,表明RePlaid的计算差距仅为自回归模型的20倍,使用更少的参数优于Duo,并在过训练范围内优于MDLM。我们将RePlaid与最近的连续DLMs进行基准测试:在OpenWebText上,RePlaid实现了连续DLMs中的新状态-of-the-art PPL界值为22.1,并在生成质量上更优。这些结果表明,当通过似然训练时,连续扩散是与离散DLMs高度竞争且可扩展的替代方案。此外,我们提供了理论见解以理解基于似然训练的优势。我们展示了优化噪声调度以最小化ELBO的方差自然会得到时间上的线性交叉熵(信息损失)。这均匀地分配去噪难度,而无需任何特定时间的重参数化。此外,我们发现通过似然优化嵌入会创建结构化的几何形状并驱动最大的似然增益。

英文摘要

While diffusion has drawn considerable recent attention from the language modeling community, continuous diffusion has appeared less scalable than discrete approaches. To challenge this belief we revisit Plaid, a likelihood-based continuous diffusion language model (DLM), and construct RePlaid by aligning the architecture of Plaid with modern discrete DLMs. In this unified setting, we establish the first scaling law for continuous DLMs that rivals discrete DLMs: RePlaid exhibits a compute gap of only $20\times$ compared to autoregressive models, outperforms Duo while using fewer parameters, and outperforms MDLM in the over-trained regime. We benchmark RePlaid against recent continuous DLMs: on OpenWebText, RePlaid achieves a new state-of-the-art PPL bound of $22.1$ among continuous DLMs and superior generation quality. These results suggest that continuous diffusion, when trained via likelihood, is a highly competitive and scalable alternative to discrete DLMs. Moreover, we offer theoretical insights to understand the advantage of likelihood-based training. We show that optimizing the noise schedule to minimize the ELBO's variance naturally yields linear cross-entropy (information loss) over time. This evenly distributes denoising difficulty without any case-specific time reparameterization. In addition, we find that optimizing embeddings via likelihood creates structured geometries and drives the most significant likelihood gain.

2605.18529 2026-05-19 cs.AI 版本更新

AMR-SD: Asymmetric Meta-Reflective Self-Distillation for Token-Level Credit Assignment

AMR-SD:不对称元反射自蒸馏用于标记级信用分配

Zhenlin Wei, Pu Jian, Yingzhuo Deng, Xiaohan Wang, Jiajun Chai, Zhexin Hu, Wei Lin, Shanbin Zhang, Guojun Yin

发表机构 * Meituan Beijing, China(美团北京,中国) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出AMR-SD,一种不对称元反射自蒸馏方法,旨在解决大型语言模型在复杂推理中因统一序列奖励导致的信用分配瓶颈问题,通过引入反思瓶颈和因果信息增益机制,实现更精确的标记级优势调节。

详情
AI中文摘要

大型语言模型(LLM)在复杂推理中的对齐严重依赖于可验证奖励强化学习(RLVR)。然而,标准算法如GRPO将序列级奖励均匀应用于所有标记,造成严重的信用分配瓶颈。尽管在线自我蒸馏试图通过将自我教师条件于特权上下文来解决这一问题,但直接暴露于原始 oracle 解决方案往往会诱导过条件的教师分布、隐含答案泄漏和晚期训练崩溃。为了克服这些限制,我们提出了不对称元反射自蒸馏(AMR-SD)。与直接条件于原始参考轨迹不同,AMR-SD插入了一个反思瓶颈:它将来自验证者结果、同伴回放或参考反馈的诊断信号压缩成简洁的自我生成苏格拉底提示和批评。此外,我们引入了因果信息增益(CIG),其具有不对称的ReLU门控阈值,用于将这些反思转换为稀疏、高精度的标记级优势调节。结合时间退火,这种机制在保持基础环境奖励的同时过滤掉分布噪声。在科学、数学和工具使用基准上的实验表明,AMR-SD显著优于现有基线,实现了稳健的长距离稳定性,并成功防止了晚期崩溃。

英文摘要

The alignment of Large Language Models (LLMs) for complex reasoning heavily relies on Reinforcement Learning with Verifiable Rewards (RLVR). However, standard algorithms like GRPO apply sequence-level rewards uniformly to all tokens, creating a severe credit-assignment bottleneck. While on-policy self-distillation attempts to resolve this by conditioning a self-teacher on privileged contexts, direct exposure to raw oracle solutions often induces over-conditioned teacher distributions, implicit answer leakage, and late-stage training collapse. To overcome these limitations, we propose Asymmetric Meta-Reflective Self-Distillation (AMR-SD). Instead of conditioning directly on raw reference traces, AMR-SD inserts a reflection bottleneck: it compresses diagnostic signals -- from verifier outcomes, peer rollouts, or reference feedback -- into concise, self-generated Socratic hints and critiques. Furthermore, we introduce Causal Information Gain (CIG) with an asymmetric, ReLU-gated threshold to translate these reflections into sparse, highly precise token-level advantage modulations. Combined with temporal annealing, this mechanism preserves the base environmental reward while filtering out distributional noise. Experiments across scientific, mathematical, and tool-use benchmarks demonstrate that AMR-SD significantly outperforms existing baselines, achieving robust long-horizon stability and successfully preventing late-stage collapse.

2605.18522 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Beyond Morphology: Quantifying the Diagnostic Power of Color Features in Cancer Classification

超越形态学:量化颜色特征在癌症分类中的诊断能力

Farnaz Kheiri, Shahryar Rahnamayan, Masoud Makrehchi

发表机构 * Dept. of Electrical, Computer and Software Engineering(电气、计算机与软件工程系) Ontario Tech University(安大略技术大学) Dept. of Engineering(工程系) Brock University(布鲁克大学)

AI总结 本文研究了颜色特征在癌症分类中的诊断能力,通过排除形态学信息,评估了全局颜色特征的判别力,发现颜色特征在二分类任务中可达到高达89%的准确率,表明颜色分布包含非随机的诊断信号。

详情
AI中文摘要

在组织病理学中,人类专家主要依靠颜色增强对比度来解读组织形态,而机器视觉模型则将颜色视为原始统计信息。这一区别提出了一个根本性问题:像素强度本身,独立于结构和形态学线索,能支持多少癌症分类?为了解决这个问题,我们系统评估了全局颜色特征的独立判别力,同时刻意排除所有形态学信息。具体而言,我们提取了统计颜色矩,并对RGB和HSV颜色直方图进行离散化处理,然后在十个不同的实验设置中使用经典机器学习分类器评估其性能。我们的结果表明,在二元诊断任务(例如良性与恶性)中,仅颜色特征即可实现强劲的性能,分类准确率可达到89%。这种性能很可能归因于与恶性相关的全局色度变化。重要的是,这些简单的颜色基表示在很大程度上优于随机基线,表明原始颜色分布编码了非随机且具有诊断意义的信号用于癌症检测。因此,本研究表明,简单的、计算高效的色彩特征可以作为一种有效的预筛选工具。通过识别具有强色度指示恶性特征的样本,这些轻量模型可以作为第一道筛选系统,减少对复杂深度学习架构的计算负担。

英文摘要

In histopathology, human experts primarily rely on color as a means of enhancing contrast to interpret tissue morphology, whereas machine vision models process color as raw statistical information. This distinction raises a fundamental question: to what extent can pixel intensity alone, independent of structural and morphological cues, support cancer classification? To address this question, we systematically evaluated the standalone discriminative power of global color features while deliberately excluding all morphological information. Specifically, we extracted statistical color moments and discretized RGB and HSV color histograms, and assessed their performance across ten diverse experimental settings using classical machine learning classifiers. Our results demonstrate that color features alone can achieve strong performance in binary diagnostic tasks (e.g., benign versus malignant), with classification accuracies reaching up to 89%. This performance is likely attributable to global chromatic shifts associated with malignancy. Importantly, these simple color-based representations consistently outperformed random baselines by a substantial margin, indicating that raw color distributions encode a non-random and diagnostically relevant signal for cancer detection. Consequently, this study suggests that simple, computationally efficient color features can serve as an effective pre-screening tool. By identifying samples with strong chromatic indicators of malignancy, these lightweight models could function as a first-pass triage system, reducing the computational burden on complex deep learning architectures.

2605.18511 2026-05-19 cs.AI cond-mat.mtrl-sci eess.SP 版本更新

A Practical Noise2Noise Denoising Pipeline for High-Throughput Raman Spectroscopy

一种适用于高通量拉曼光谱的实用噪声2噪声去噪流程

David Martin-Calle, Cesar Alvarez Llamas, Vincent Motto- Ros, Christophe Dujardin, Jérémie Margueritat, David Rodney

发表机构 * CNRS(法国国家科学研究中心) Institut Universitaire de France (IUF)(法国大学研究院)

AI总结 本文提出了一种轻量级且可复现的高通量拉曼光谱去噪流程,采用一维卷积自编码器和噪声2噪声策略进行训练,无需外部光谱库或高信噪比参考光谱。通过重复短曝光采集的简化训练集,模型能够有效抑制随机噪声并重建拉曼光谱。在异质矿物样本上评估结果表明,5ms/谱的积分时间虽通常不足以可靠解释,但能产生高保真度的去噪光谱并保持化学相干性地图。该工作在光谱质量和获取速度之间提供了实用的权衡,使快速适应的拉曼流程适用于常规实验室使用,并为其他一维光谱模式提供了可转移的框架。

详情
AI中文摘要

本文提出了一种轻量级且可复现的高通量拉曼光谱去噪流程。该方法基于一维卷积自编码器,采用噪声2噪声策略进行训练,无需外部光谱库或高信噪比参考光谱。从由重复短曝光采集构成的简化训练子集中,模型学习重建拉曼光谱并高效抑制随机噪声。在异质矿物样本上,该方法使用定量光谱保真度指标(RMSE、SNR、SSIM)和基于无监督K-均值分类的任务导向标准进行评估。结果表明,5ms/谱的积分时间,通常不足以可靠解释,但能产生高保真度的去噪光谱,同时保持化学相干性地图。本工作在光谱质量和获取速度之间提供了实用的权衡,使快速、适应性强的拉曼流程能够与常规实验室使用兼容。此外,该工作还为其他一维光谱模式提供了可转移的框架。

英文摘要

A lightweight and reproducible denoising pipeline for high-throughput Raman spectroscopy is presented. The approach relies on a one-dimensional convolutional autoencoder trained using a Noise2Noise strategy, requiring neither external spectral libraries nor high signal-to-noise reference spectra for training. From a reduced training subset composed of repeated short-exposure acquisitions, the model learns to reconstruct Raman spectra while efficiently suppressing stochastic noise. The method is evaluated on a heterogeneous mineral sample, using both quantitative spectral fidelity metrics (RMSE, SNR, SSIM) and task-oriented criteria based on unsupervised K-means classification. Results demonstrate that integration times as short as 5 ms per spectrum, which are typically insufficient for reliable interpretation, yield denoised spectra with high fidelity to the reference data while preserving chemically coherent maps. This work provides a practical trade-off between spectral quality and acquisition speed, enabling fast, adaptable Raman workflows compatible with routine laboratory use. It also offers a transferable framework for other one-dimensional spectroscopic modalities.

2605.18508 2026-05-19 cs.LG cs.AI 版本更新

DiPRL: Learning Discrete Programmatic Policies via Architecture Entropy Regularization

DiPRL: 通过架构熵正则化学习离散程序性策略

Chengpeng Hu, Yingqian Zhang, Hendrik Baier

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) Centrum Wiskunde & Informatica(数学与信息学研究中心)

AI总结 本文提出DiPRL,一种通过架构熵正则化学习可解释程序性策略的方法,以避免事后细化阶段,提高策略表达性和任务性能。

详情
AI中文摘要

程序性强化学习(PRL)通过将策略表示为可读可编辑的程序,为深度强化学习提供了一种可解释的替代方案。尽管基于梯度的方法已被开发用于优化程序的连续松弛,但在将连续松弛转换回离散程序时会显著降低性能。事后离散化会丢弃优化的分支和参数,导致策略表达性崩溃和任务性能下降,从而需要额外的微调。为克服这些限制,我们提出了可微离散程序性强化学习(DiPRL),一种在训练过程中使程序接近离散的方法,避免了单独的事后微调阶段。我们首先分析了基于梯度方法事后离散化引入的性能下降固有风险。然后,我们引入了程序架构熵正则化,这使得训练过程平滑且可微,鼓励收敛到离散程序。DiPRL在保持基于梯度优化效率的同时,减轻了事后离散化的风险。在多个离散和连续RL任务中的实验表明,DiPRL可以通过可解释的程序性策略实现强大的性能。

英文摘要

Programmatic reinforcement learning (PRL) offers an interpretable alternative to deep reinforcement learning by representing policies as human-readable and -editable programs. While gradient-based methods have been developed to optimize continuous relaxations of programs, they face a significant performance drop when converting the continuous relaxations back into discrete programs. Post-hoc discretization can discard optimized branches and parameters in a program, which results in a collapse of policy expressivity and lowered task performance, leading in turn to a need for additional fine-tuning. To overcome these limitations, we propose Differentiable Discrete Programmatic Reinforcement Learning (DiPRL), a method that learns programmatic policies that become nearly discrete during training, avoiding a separate post-hoc fine-tuning stage. We first analyze the inherent risks of performance drop introduced by post-hoc discretization of gradient-based methods. Then, we introduce programmatic architecture entropy regularization, which enables smooth, differentiable training that encourages convergence toward a discrete program. DiPRL maintains the efficiency of gradient-based optimization while mitigating the risks of post-hoc discretization. Our experiments across multiple discrete and continuous RL tasks demonstrate that DiPRL can achieve strong performance via interpretable programmatic policies.

2605.18498 2026-05-19 cs.LG cs.AI 版本更新

DBES: A Systematic Benchmark and Metric Suite for Evaluating Expert Specialization in Large-Scale MoEs

DBES: 一种用于评估大规模MoE模型专家专业化程度的系统性基准和度量套件

Jing Wang, Hongxuan Lu, Jazze Young, Shu Wang, Zhimin Xin

发表机构 * Jing Wang(王静) Hongxuan Lu(卢洪轩) Jazze Young(杨杰兹) Shu Wang(王舒) Zhimin Xin(辛志敏)

AI总结 本文提出DBES系统性基准和度量套件,通过多领域基准和五个理论基础的度量指标,评估MoE模型中的专家专业化程度,并验证这些度量指标在领域特定后训练中的可操作性,实现了显著的性能提升。

详情
AI中文摘要

MoE模型中的专家专业化仍缺乏深入理解,传统评估将架构负载均衡与功能专业化混淆。我们引入DBES,一种综合的诊断框架,结合多领域基准和五个理论基础的度量指标:路由专业化、归一化有效秩、领域隔离、路由刚度分数和n-gram专家度量。关键发现显示不同模型展现出不同的专业化范式:Qwen系列表现出模块化专业化,具有高领域隔离,而DeepSeek和GLM采用分布式协作。然而,我们强调专业化是诊断维度,必要但不充分用于下游性能。最重要的是,干预证据验证了这些度量指标的可操作性:通过使用DBES在领域特定后训练中识别高专业化专家路径,我们仅使用15%的原始训练资源,在专业化领域实现了66%至94.48%的性能提升,证明这些诊断工具可以转化为具体的优化算子。本文提供了首个系统性的方法,用于独立于准确度指标评估专家专业化,为下一代MoE系统的设计和后训练优化提供了关键见解。

英文摘要

Expert specialization in Mixture-of-Experts (MoE) models remains poorly understood, with traditional evaluations conflating architectural load-balancing with functional specialization. We introduce DBES, a comprehensive diagnostic framework combining a multi-domain benchmark with five theoretically grounded metrics: Routing Specialization, Normalized Effective Rank, Domain Isolation, Routing Stiffness Score, and N-gram Expertise measures. Critical findings demonstrate distinct specialization paradigms across models: Qwen-series exhibit modular specialization with high domain isolation, while DeepSeek and GLM employ distributed collaboration. However, we emphasize that specialization is a diagnostic dimension, necessary but not sufficient for downstream performance. Most crucially, interventional evidence validates the actionability of these metrics: by using DBES to identify high-specialization expert paths during domain-specific post-training, we achieved 66% to 94.48% improvement in specialized domains with only 15% of original training resources, demonstrating that these diagnostic tools can be converted into concrete optimization operators. This work provides the first systematic methodology for evaluating expert specialization independently of accuracy metrics, offering crucial insights for the design and post-training optimization of next-generation MoE systems.

2605.18483 2026-05-19 cs.LG cs.AI 版本更新

Modality vs. Morphology: A Framework for Time Series Classification for Biological Signals

模态与形态:生物信号时间序列分类的框架

Jordan Tschida, Matthew Yohe, Edward Kane, Gavin Jager, Emma J. Reid, Tony G. Allen, Mark Story, Leanne Thompson, Joe Hoskins, Brandon Schreiber, Stan Seiferth, Scott Dolvin, David Cornett

发表机构 * UT-Battelle, LLC(UT-巴特勒公司) Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 本文提出了一种统一的形态-模态框架,通过分析生物信号的形态结构,揭示了如何影响模型设计和性能,强调形态对预处理和建模策略的重要性,并指出未来的工作方向包括形态数据增强和评估指标改进。

详情
AI中文摘要

生物信号时间序列分类(TSC)已从手工制作的模态特定方法发展为能够表示底层生理过程多样波形结构的深度架构(即形态)。本文综述介绍了一种统一的形态-模态框架,将波形结构与方法论设计连接起来,揭示了尖峰、爆发、振荡、慢漂移和层次节奏如何影响模型设计。通过分析脑电图、肌电图、心电图、脉搏波描记图以及眼动模态(电眼图、瞳孔测量、眼动追踪),本文展示了形态如何决定预处理和建模策略。整合这些生物信号的证据,该框架揭示形态而非模型类别最强烈地决定了性能和可解释性。这提供了深度模型在诱导偏见与底层波形动态一致时为何成功的原因。本文还识别了未来的工作,包括形态数据增强和评估指标改进以提高泛化能力。这些见解将形态意识建模定位为开发跨生物信号通用、可解释和生理意义的TSC模型的统一原则。

英文摘要

Time series classification (TSC) of biological signals has progressed from handcrafted, modality-specific approaches to deep architectures capable of representing the diverse waveform structures of underlying physiological processes (i.e., morphology). This review introduces a unified morphology--modality framework that connects waveform structure to a methodological design, revealing how spikes, bursts, oscillations, slow drift, and hierarchical rhythms inform model design. By analyzing electroencephalography, electromyography, electrocardiography, photoplethysmography, and ocular modalities (electrooculography, pupillometry, eye-tracking), the review demonstrates how morphology determines preprocessing and modeling strategies. Integrating evidence across these biological signals, the framework reveals that morphology, not model class, most strongly determines performance and interpretability. This provides insight into why deep models succeed when their inductive biases align with underlying waveform dynamics. This review also identifies future work including morphological data augmentation and evaluation metrics to improve generalization. Together, these insights position morphology-aware modeling as a unifying principle for developing generalizable, interpretable, and physiologically meaningful TSC models across biological signals.

2605.18481 2026-05-19 cs.AI 版本更新

OCCAM: Open-set Causal Concept explAnation and Ontology induction for black-box vision Models

OCCAM: 开集因果概念解释与本体诱导用于黑盒视觉模型

Chiara Maria Russo, Simone Carnemolla, Simone Palazzo, Daniela Giordano, Concetto Spampinato, Matteo Pennisi

发表机构 * University of Catania(卡塔尼亚大学)

AI总结 OCCAM通过开放集因果概念解释和本体诱导方法,提高黑盒视觉模型的可解释性,揭示概念间的因果关系和模型偏见。

详情
AI中文摘要

解释深度图像分类器的决策仍然具有挑战性,尤其是在黑盒设置中,模型内部不可访问。我们介绍了OCCAM,一种用于视觉模型开放集因果概念解释和本体诱导的框架。OCCAM以开放集的方式发现视觉概念,通过文本引导的分割进行局部化,并通过移除概念来测量类别置信度的变化,以估计每个概念的因果贡献。除了局部解释外,OCCAM跨数据集聚合干预证据,诱导出一个结构化的概念本体,该本体捕捉了分类器如何全局组织视觉概念。在本体上进行推理可以揭示概念之间的一致依赖关系,暴露潜在的因果关系,并揭示系统性的模型偏见。在Broden和ImageNet-S上多个分类器的实验表明,OCCAM在开放集黑盒设置中提高了解释质量,同时提供了比单图像归因方法更丰富的全局见解。

英文摘要

Interpreting the decisions of deep image classifiers remains challenging, particularly in black-box settings where model internals are inaccessible. We introduce OCCAM, a framework for open-set causal concept explanation and ontology induction in vision models. OCCAM discovers visual concepts in an open-set manner, localizes them via text-guided segmentation, and performs object-level interventions by removing concepts to measure changes in class confidence, estimating each concept's causal contribution. Beyond local explanations, OCCAM aggregates interventional evidence across a dataset to induce a structured concept ontology that captures how classifiers globally organize visual concepts. Reasoning over this ontology reveals consistent dependencies between concepts, exposes latent causal relations, and uncovers systematic model biases. Experiments on Broden and ImageNet-S across multiple classifiers show that OCCAM improves explanation quality in open-set black-box settings while providing richer global insight than per-image attribution methods.

2605.18476 2026-05-19 stat.CO cs.AI cs.LG 版本更新

AI4BayesCode: From Natural Language Descriptions to Validated Modular Stateful Bayesian Samplers

AI4BayesCode: 从自然语言描述到经过验证的模块化状态性贝叶斯采样器

Jungang Zou, Alex Ziyu Jiang, Qixuan Chen

发表机构 * Department of Biostatistics, Columbia University(哥伦比亚大学生物统计学系)

AI总结 该研究提出AI4BayesCode系统,通过自然语言描述生成可运行且验证过的MCMC采样器,采用模块化设计和递归状态性编码范式,提升了贝叶斯模型的可靠性和扩展性。

详情
AI中文摘要

编码和计算仍然是马尔可夫链蒙特卡洛(MCMC)工作流程中的主要瓶颈,尤其是在现代采样算法日益复杂的情况下,现有的概率编程系统在模型支持、扩展性和可组合性方面仍然有限。我们介绍了AI4BayesCode,这是一个可扩展的LLM驱动系统,能够将自然语言的贝叶斯模型描述转换为可运行且经过验证的MCMC采样器。为了提高可靠性,AI4BayesCode采用模块化设计,将模型分解为模块化采样块,并将每个块映射到内置的采样组件,从而减少从头实现复杂采样算法的需要。通过预生成模型规范的验证和后生成采样器代码的验证进一步提高了可靠性。AI4BayesCode还引入了一种新的递归状态性编码范式,使模块化采样组件(可能由不同贡献者开发)能够在更大的MCMC过程中协同一致地组成。我们开发了一个基准测试套件来评估AI4BayesCode的采样器生成能力。实验表明,AI4BayesCode能够仅通过自然语言描述实现广泛的贝叶斯模型。作为一项开放系统,其能力可以随着底层AI代理的改进和新增内置块的添加而继续扩展。

英文摘要

Coding and computation remain major bottlenecks in Markov chain Monte Carlo (MCMC) workflows, especially as modern sampling algorithms have become increasingly complex and existing probabilistic programming systems remain limited in model support, extensibility, and composability. We introduce \textbf{AI4BayesCode}, an extensible LLM-driven system that translates natural-language Bayesian model descriptions into runnable, validated MCMC samplers. To improve reliability, AI4BayesCode adopts a modular design that decomposes models into modular sampling blocks and maps each block to a built-in sampling component, reducing the need to implement complex sampling algorithms from scratch. Reliability is further improved through pre-generation validation of model specifications and post-generation validation of generated sampler code. AI4BayesCode also introduces a novel recursively stateful coding paradigm for MCMC, allowing modular sampling components, potentially developed by different contributors, to be composed coherently within larger MCMC procedures. We develop a benchmark suite to evaluate AI4BayesCode for sampler-generation. Experiments show that AI4BayesCode can implement a wide range of Bayesian models from natural-language descriptions alone. As an open-ended system, its capability can continue to expand with improvements in the underlying AI agent and the addition of new built-in blocks.

2605.18475 2026-05-19 cs.LG cs.AI 版本更新

GAMMA: Global Bit Allocation for Mixed-Precision Models under Arbitrary Budgets

GAMMA:在任意预算下为混合精度模型进行全局位分配

Zhangyang Yao, Haiyan Zhao, Haoyu Wang, Tianbo Huang, Lihua Zhang, Xu Han

发表机构 * Beihang University(北航) Tsinghua University(清华) ByteDance Inc(字节跳动)

AI总结 本文提出GAMMA框架,通过后训练流水线学习模块级精度偏好,优化教师强制隐藏状态重建目标并利用整数规划实现精确预算分配,从而在任意预算下提升大语言模型的精度,优于固定精度基线和搜索基混合精度方法。

详情
AI中文摘要

混合精度量化通过将更多位分配给敏感模块,提高了大语言模型(LLMs)的预算-精度权衡。然而,在LLM规模上自动化这种分配面临独特约束:可学习方法需要量化感知训练,这在十亿参数模型中不可行;训练自由替代方案依赖静态代理指标,无法捕捉跨模块交互,并且必须为每个目标预算重新计算;搜索方法成本高且无法保证精确预算符合。我们提出GAMMA,一种量化器无关的框架,完全在后训练流水线内学习模块级精度偏好。GAMMA在增强拉格朗日约束下优化教师强制隐藏状态重建目标,并通过整数规划将学习的偏好投影到精确预算可行的离散分配中。关键性质是分数重用:因为学习的偏好编码了一个稳定的敏感性排名而非预算特定权重,单次训练运行可服务于任意部署目标,仅需重新求解整数规划,将每预算适应时间从小时减少到几分钟。在Llama和Qwen模型(8B-32B)上,GAMMA优于固定精度基线(最高+12.99 Avg.)和搜索基混合精度方法(最高+7.00 Avg.),并在2.5位平均精度下可匹配固定3位质量,从而在大幅减小内存占用的情况下实现部署。

英文摘要

Mixed-precision quantization improves the budget--accuracy trade-off for large language models (LLMs) by allocating more bits to sensitive modules. However, automating this allocation at LLM scale faces a unique combination of constraints: learnable approaches require quantization-aware training, which is infeasible for billion-parameter models; training-free alternatives rely on static proxy metrics that miss cross-module interactions and must be recomputed per target budget; and search-based methods are expensive without guaranteeing exact budget compliance. We propose GAMMA, a quantizer-agnostic framework that learns module-wise precision preferences entirely within a post-training pipeline. GAMMA optimizes a teacher-forced hidden-state reconstruction objective under an augmented Lagrangian constraint, and projects the learned preferences into exact budget-feasible discrete assignments via integer programming. A key property is score reuse: because the learned preferences encode a stable sensitivity ranking rather than budget-specific weights, a single training run serves arbitrary deployment targets by re-solving only the integer program, reducing per-budget adaptation from hours to a few minutes. Across Llama and Qwen models (8B--32B), GAMMA outperforms both fixed-precision baselines (up to +12.99 Avg.) and search-based mixed-precision methods (up to +7.00 Avg.), and can match fixed 3-bit quality at 2.5-bit average precision, enabling deployment at substantially smaller memory footprints.

2605.18472 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Flowing with Confidence

流中自信

Friso de Kruiff, Dario Coscia, Max Welling, Erik Bekkers

发表机构 * CuspAI AMLab, University of Amsterdam(阿姆斯特丹大学AMLab) mathLab, SISSA(SISSA数学实验室)

AI总结 本文提出了一种名为流匹配与自信(FMwC)的方法,通过在选定层注入输入依赖的乘法噪声,传播其方差并通过网络闭式形式传播,从而在标准采样成本下获得每个样本的置信度评分,用于改进图像质量和晶体热力学稳定性、轨迹编辑和自适应步长等应用。

详情
AI中文摘要

生成模型可以产生不合逻辑的文本、不现实的图像和不稳定的材料,其生成速度比模拟或人类审查更快;没有每个样本的置信度,信任会逐渐丧失。现有解决方案运行k个集成或随机轨迹,消耗k倍的计算资源,测量模型之间的变异性,而不是模型的置信度。我们提出流匹配与自信(FMwC)。FMwC在选定的层注入输入依赖的乘法噪声,通过网络闭式形式传播其方差,并沿ODE轨迹整合,从而在标准采样成本下获得每个样本的置信度评分。该评分支持多种用途:过滤可以提高图像质量和晶体的热力学稳定性;编辑可以将轨迹回退到模型承诺的点并重新定向;自适应步长将ODE计算集中在流不明确的地方。我们发现置信度评分与学习速度场的发散量的大小相关,这为我们提供了一个窗口来理解生成过程,开启了针对关键时刻的手术形式指导,新的采样算法和生成模型的可解释性。

英文摘要

Generative models can produce nonsensical text, unrealistic images, and unstable materials faster than simulation or human review can absorb; without per-sample confidence, trust erodes. Existing fixes run $k$ ensembles or stochastic trajectories at $k\times$ compute, measuring variability between models, not model confidence. We propose Flow Matching with Confidence (FMwC). FMwC injects input-dependent multiplicative noise at selected layers, propagates its variance through the network in closed form, and integrates it along the ODE trajectory, yielding a per-sample confidence score at standard sampling cost. The score supports multiple uses: filtering improves image quality and thermodynamic stability of crystals; editing rewinds trajectories to the points where the model commits and redirects them; and adaptive stepping concentrates ODE compute where the flow is ambiguous. We find that the confidence score correlates with the magnitude of the divergence of the learned velocity field, which gives us a window to understand the generative process, opening up surgical forms of guidance that target the moments that matter, new sampling algorithms and interpretability of generative models.

2605.18460 2026-05-19 cs.AI cs.LG cs.NE 版本更新

When Fireflies Cluster; Enhancing Automatic Clustering via Centroid-Guided Firefly Optimization

当萤火虫聚类;通过重心引导萤火虫优化增强自动聚类

MKA Ariyaratne, Azwirman Gusrialdi, Yury Nikulin, Jaakko Peltonen

发表机构 * Department of Computer Science, Faculty of Applied Sciences, University of Sri Jayewardenepura(Sri Lanka 瑞籍耶文纳普拉大学计算机科学系,应用科学学院) Faculty of Engineering and Natural Sciences,Tampere University(蒂帕雷大学工程与自然科学学院) Department of Mathematics and Statistics, University of Turku(图尔库大学数学与统计学系)

AI总结 本文提出了一种改进的萤火虫算法用于数据聚类,解决了传统方法如K均值在处理非均匀聚类形状、密度以及需要预先定义聚类数的局限性。该算法引入了重心移动策略和多目标适应度函数,平衡了紧凑性、分离性和新的TSP基于的导航惩罚。它能够自动估计最佳聚类数并动态调整聚类边界。在机器人传感器网络中的应用展示了其实际价值,实验表明其聚类质量优于K均值,且减少集群内路径距离。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性,未来可能扩展到更高维和适应性场景。

Comments 34 pages, 19 Figures

详情
AI中文摘要

本文提出了一种新的萤火虫算法变体用于数据聚类,以解决传统方法如K均值在处理非均匀聚类形状、密度以及需要预先定义聚类数的局限性。所提出的算法引入了重心移动策略和多目标适应度函数,该函数平衡了紧凑性、分离性和一个新的基于TSP的导航惩罚。该算法能够自动估计最佳聚类数并动态调整聚类边界。在机器人传感器网络中的应用展示了其实际价值,实验表明其聚类质量优于K均值,且减少集群内路径距离。这些结果证实了该算法在复杂空间聚类任务中的鲁棒性,具有未来扩展到更高维和适应性场景的潜力。

英文摘要

This work presents a novel variant of the Firefly Algorithm (FA) for data clustering, addressing limitations of traditional methods like K-Means that struggle with non-uniform cluster shapes, densities, and the need for pre-defining the number of clusters. The proposed algorithm introduces a centroid movement strategy and a multi-objective fitness function that balances compactness, separation, and a novel TSP-based navigation penalty. It automatically estimates the optimal number of clusters and dynamically adjusts cluster boundaries. Application to robotic sensor networks highlights its practical value, with experiments showing improved clustering quality and reduced intra-cluster path distances compared to K-Means. These results confirm the algorithm's robustness in complex spatial clustering tasks, with potential for future extensions to higher-dimensional and adaptive scenarios.

2605.18454 2026-05-19 cs.LG cs.AI cs.SC 版本更新

Scheduling That Speaks: An Interpretable Programmatic Reinforcement Learning Framework

能说话的调度:一种可解释的程序化强化学习框架

Chengpeng Hu, Yingqian Zhang, Hendrik Baier

发表机构 * Eindhoven University of Technology, Eindhoven, the Netherlands Centrum Wiskunde \& Informatica, Amsterdam, the Netherlands

AI总结 本文提出了一种可解释的程序化强化学习框架ProRL,通过人类可读且可编辑的程序化策略实现高效调度,解决了传统深度强化学习在透明性和计算效率方面的不足。

详情
AI中文摘要

深度强化学习(DRL)最近涌现出作为求解组合优化问题(如作业车间调度)的有希望的方法。然而,DRL学习的策略通常由深度神经网络(DNNs)表示,其不透明的神经架构和不可解释的策略决策可能引起人类决策者的关键信任和可用性问题。此外,DNNs的计算需求还会进一步阻碍在资源受限环境中实际部署。在本工作中,我们提出ProRL,一种新颖的可解释程序化强化学习框架,能够通过人类可读且可编辑的程序化策略实现高性能调度(即程序)。我们首先介绍了一种用于调度的领域特定语言(DSL-S)来表示调度策略为结构化程序。ProRL然后通过局部搜索探索由DSL-S定义的程序空间,以识别不完整的程序,这些程序随后通过贝叶斯优化学习其参数。ProRL学习选择哪种调度启发式规则,因此它自然地整合了已在工业场景中使用的现有启发式方法。在广泛使用的基准实例上的实验表明,ProRL在现有启发式方法和DRL基线方面表现出色。此外,ProRL在强约束计算资源下表现良好,例如仅使用100个episode进行训练。我们的代码可在https://github.com/HcPlu/ProRL上获得。

英文摘要

Deep reinforcement learning (DRL) has recently emerged as a promising approach to solve combinatorial optimization problems such as job shop scheduling. However, the policies learned by DRL are typically represented by deep neural networks (DNNs), whose opaque neural architectures and non-interpretable policy decisions can lead to critical trust and usability concerns for human decision makers. In addition, the computational requirements of DNNs can further hinder practical deployment in resource constrained environments. In this work, we propose ProRL, a novel interpretable programmatic reinforcement learning framework that achieves high-performance scheduling with human-readable and editable programmatic policies (i.e., programs). We first introduce a domain-specific language for scheduling (DSL-S) to represent scheduling strategies as structured programs. ProRL then explores the program space defined by DSL-S using local search to identify incomplete programs, which are subsequently completed by learning their parameters via Bayesian optimization. ProRL learns which scheduling heuristic rules to select, and hence, it naturally incorporates existing heuristics already used in industrial scenarios. Experiments on widely used benchmark instances demonstrate the strong performance of ProRL against existing heuristics and DRL baselines. Furthermore, ProRL performs well under strongly constrained computational resources, such as training with only 100 episodes. Our code is available at https://github.com/HcPlu/ProRL.

2605.18449 2026-05-19 cs.LG cs.AI 版本更新

Modelling Customer Trajectories with Reinforcement Learning for Practical Retail Insights

用强化学习建模客户轨迹以获得实际零售洞察

Ken Ming Lee, Paul Barde, Maxime C. Cohen, Derek Nowrouzezahrai

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究所)

AI总结 本文提出了一种基于智能体的建模框架,将客户轨迹预测转化为最大熵强化学习问题,以更准确地反映具有有限理性的客户行为,从而提供更精确的冲动购买率和货架交通密度估计。

Comments Proceeding of the 25th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2026)

详情
AI中文摘要

理解零售空间内客户移动对于优化商店布局至关重要。现实世界轨迹数据可以提供高度准确的洞察,但收集起来成本高昂且对许多零售商来说难以实现。启发式方法如旅行商问题(TSP)和概率最近邻(PNN)常被用作廉价的近似方法,但实际客户轨迹与最短路径的偏差平均为28%,突显了准确性和实用性之间的权衡。我们提出了一种基于智能体的建模框架,将客户轨迹预测视为最大熵强化学习(RL)问题,通过平衡奖励最大化与随机性来更好地反映具有有限理性的客户。使用现实世界便利商店的轨迹数据,我们证明RL生成的轨迹比TSP和PNN更接近客户行为,提供了更准确的冲动购买率和货架交通密度估计。此外,只有基于RL的预测能够为冲动产品提供与实际轨迹数据一致的重新定位决策,从而产生可比的估计利润增长。我们的工作表明,RL提供了一种实用且基于行为的替代方法,弥合了过于简化的启发式方法和数据密集型方法之间的差距,使准确的布局优化更具可及性。为了鼓励进一步研究,源代码可在GitHub上获得。

英文摘要

Understanding customer movement within retail spaces is essential for optimizing store layouts. Real-world trajectory data can provide highly accurate insights, but collecting it is costly and often infeasible for many retailers. Heuristics such as Travelling Salesman Problem (TSP) and Probabilistic Nearest Neighbours (PNN) are commonly used as inexpensive approximations, but actual customer trajectories deviate by an average of 28% from shortest paths, highlighting a tradeoff between accuracy and practicality. We propose an agent-based modelling framework that casts customer trajectory prediction as a maximum entropy reinforcement learning (RL) problem, balancing reward maximization with stochasticity to better reflect customers with bounded rationality. Using real-world trajectory data from a convenience store, we show that RL-generated trajectories align more closely with customer behaviour than TSP and PNN, providing more accurate estimates of impulse purchase rates and shelf traffic densities. Furthermore, only RL-based predictions yield repositioning decisions for impulse products that align with those derived from actual trajectory data, resulting in comparable estimated profit gains. Our work demonstrates that RL provides a practical, behaviourally grounded alternative that bridges the gap between oversimplified heuristics and data-intensive approaches, making accurate layout optimization more accessible. To encourage further research, the source code is available on GitHub.

2605.18444 2026-05-19 cs.AR cs.AI 版本更新

Building Reliable Arithmetic Multipliers Under NBTI Aging and Process Variations

在NBTI老化和工艺波动下构建可靠的算术乘法器

Masoud Heidary, Biresh Kumar Joardar

发表机构 * Department of ECE, University of Houston(电子工程系,休斯顿大学)

AI总结 本文提出了一种利用乘法的sign-invariance性质来缓解算术乘法器老化问题的新技术,并将其应用于 systolic arrays 中,以提高高吞吐量AI加速器的效率。

详情
AI中文摘要

硬件老化对集成电路(ICs)构成了重大挑战,导致性能下降和最终失效。在本文中,我们关注算术乘法器的老化问题,这些乘法器是现代计算系统(包括CPU、GPU、FPGA以及如脉冲数组的AI加速器)的核心。特别是,AI工作负载主要依赖乘法运算,可以加速负偏温不稳定性(NBTI)效应。本文提出了一种新颖的老化缓解技术,利用乘法的sign-invariance性质。通过有选择地对输入应用2s补码变换,该方法将应力分布到晶体管上,从而减少NBTI老化的影响。所提出的方法还被集成到脉冲数组中,一种常见的AI加速器,以展示其在高吞吐量AI加速器中的效率。使用Cadence工具进行的实验评估显示,与自然老化(无缓解)基线相比,其寿命更好,同时引入了可忽略的面积和延迟开销。

英文摘要

Hardware aging poses a significant challenge for integrated circuits (ICs), leading to performance degradation and eventual failure. In this work, we focus on the aging of arithmetic multipliers, which are a cornerstone of modern computing systems including in CPUs, GPUs, and FPGAs, as well as AI accelerators like systolic arrays. In particular, AI workloads, which rely predominantly on multiplications, can accelerate Negative Bias Temperature Instability (NBTI) effects in multipliers. This paper presents a novel aging mitigation technique that leverages the signinvariance property of multiplication. By selectively applying 2s complement transformations to inputs, the method redistributes stress across transistors, reducing the effects of NBTI aging. The proposed method is also integrated into systolic arrays, a common AI accelerator, to demonstrate its efficiency in a high-throughput AI accelerator. Experimental evaluations using Cadence tools show better lifetime compared to natural aging (with no mitigation) baseline, while introducing negligible area and delay overheads.

2605.18419 2026-05-19 cs.CV cs.AI 版本更新

Geometry-Aware Uncertainty Coresets for Robust Visual In-Context Learning in Histopathology

面向几何的不确定性聚类用于病理学中鲁棒的视觉上下文学习

Franciskus Xaverius Erick, Johanna Paula Müller, Bernhard Kainz

发表机构 * FAU Erlangen-Nürnberg, Erlangen, DE(埃尔兰根-纽伦堡大学) Department of Computing, Imperial College London, London, UK(伦敦帝国理工学院计算机系)

AI总结 本文提出GAUC,一种无需训练的聚类选择方法,直接在预训练的多模态嵌入空间中操作,通过优化三个目标提升视觉上下文学习的鲁棒性、准确性和校准性。

详情
AI中文摘要

视觉-语言模型(VLMs)能够将视觉感知与开放性临床推理结合,使其在计算病理学中具有吸引力。然而,对稀缺的专家标注病理数据进行数十亿参数的微调是不可行的,而上下文学习(ICL)在没有参数更新的情况下将VLM条件于演示图像-文本对,但容易受到所选示例和查询措辞的影响,导致诊断不可靠。现有选择策略依赖于查询依赖的最近邻检索,忽略了全局数据结构,需要昂贵的参数更新,或忽视了VLMs的联合视觉-文本嵌入几何。我们提出GAUC,一种无需训练的聚类选择方法,直接在预训练的多模态嵌入空间中操作。GAUC联合优化三个目标:(1)最大均值差异项,强制聚类与完整数据集之间的分布一致性;(2)有效互信息差异正则化器,通过利用VLMs的联合视觉-文本对齐来限制在提示改写下的性能下降;(3)预测方差惩罚,抑制过于自信且不稳定的输出。在CRC-100K和MHIST多个开源VLM架构上,GAUC在准确率、校准性和提示鲁棒性上均优于最近的ICL选择方法和数据集蒸馏基线,且无需单次梯度更新。

英文摘要

Vision-language models (VLMs) can couple visual perception with open-ended clinical reasoning, making them attractive for computational histopathology. However, fine-tuning billions of parameters on scarce, expert-annotated pathology data is prohibitive, while in-context learning (ICL), which conditions the VLM on demonstrative image-text pairs without parameter updates, suffers from high sensitivity to which examples are selected and how the query is phrased, producing unreliable diagnostics. Existing selection strategies rely on query-dependent nearest-neighbour retrieval that ignores global data structure, require costly parameter updates, or disregard the joint vision-text embedding geometry of VLMs. We propose GAUC, a training-free coreset selection method operating directly in the pre-trained multimodal embedding space. GAUC jointly optimises three objectives: (1) a Maximum Mean Discrepancy term enforcing distributional fidelity between coreset and full dataset, (2) an Effective Mutual Information Difference regulariser bounding performance degradation under prompt paraphrases by exploiting the VLM's joint vision-text alignment, and (3) a predictive-variance penalty suppressing overconfident, unstable outputs. On CRC-100K and MHIST across multiple open-source VLM architectures, GAUC consistently improves accuracy, calibration, and prompt robustness over recent ICL selection methods and dataset-distillation baselines, all without a single gradient update.

2605.18414 2026-05-19 cs.CR cs.AI 版本更新

Prompts Don't Protect: Architectural Enforcement via MCP Proxy for LLM Tool Access Control

提示不保护:通过MCP代理实现的架构强制以实现LLM工具访问控制

Rohith Uppala

发表机构 * Independent Researcher(独立研究员)

AI总结 本文提出了一种受控的MCP代理,通过在工具发现和工具调用两个阶段实施基于属性的访问控制(ABAC),有效阻止了未经授权的工具调用,而提示基于的限制仅能减少11-18个百分点的未授权调用率,证明了架构强制在部署的智能体系统中实现可靠工具访问控制的必要性。

Comments 8 pages, 3 tables, 1 figure. Planning to submit to EMNLP 2026 Industry Track

详情
AI中文摘要

大型语言模型越来越多地作为自主代理运行,这些代理从大型注册表中选择并调用工具。我们发现了一个关键缺口:当未经授权的工具出现在代理的上下文中时,模型在对抗性场景中会选择它们,即使被明确指示不要这样做。我们提出了一种受控的MCP代理,在两个阶段实施基于属性的访问控制(ABAC):在工具发现阶段,未经授权的工具被从模型的上下文窗口中移除;在工具调用阶段,第二次检查阻止任何未经授权的调用。在三个模型(Qwen 2.5 7B,Llama 3.1 8B,Claude Haiku 3.5)和150个对抗性任务(涵盖四个攻击类别)上,我们的代理将未授权调用率(UIR)降至0%,同时添加的中位延迟低于50毫秒。基于提示的限制仅能减少UIR 11-18个百分点,留下显著的残余风险。我们的结果表明,在部署的智能体系统中,架构强制——而不是提示——是实现可靠工具访问控制的必要条件。

英文摘要

Large language models increasingly operate as autonomous agents that select and invoke tools from large registries. We identify a critical gap: when unauthorized tools are visible in an agent's context, models select them in adversarial scenarios -- even when explicitly instructed otherwise. We propose a governed MCP proxy that enforces attribute-based access control (ABAC) at two points: tool discovery, where unauthorized tools are removed from the model's context window, and tool invocation, where a second check blocks any unauthorized call. Across three models (Qwen 2.5 7B, Llama 3.1 8B, Claude Haiku 3.5) and 150 adversarial tasks spanning four attack categories, our proxy reduces unauthorized invocation rate (UIR) to 0% while adding under 50ms median latency. Prompt-based restrictions reduce UIR by only 11--18 percentage points, leaving substantial residual risk. Our results show that architectural enforcement -- not prompting -- is necessary for reliable tool access control in deployed agentic systems.

2605.18407 2026-05-19 cond-mat.mes-hall cond-mat.mtrl-sci cs.AI cs.RO 版本更新

Qumus: Realization of An Embodied AI Quantum Material Experimentalist

Qumus: 一种具身人工智能量子材料实验家的实现

Lihan Shi, Zhaoyi Joy Zheng, Xinzhe Juan, Yimin Wang, Ming Yin, Mayank Sengupta, Kristina Wolinski, Yanyu Jia, Jingzhi Shi, Derek Saucedo, Neill Saggi, Haosen Guan, Kenji Watanabe, Takashi Taniguchi, Ali Yazdani, Mengdi Wang, Sanfeng Wu

AI总结 本文提出Qumus,首个能够进行真实世界科学发现的具身人工智能量子材料实验家,通过机器人微型实验室实现了原子薄二维材料和范德瓦耳斯结构的制备与纳米加工,首次实现了AI生成石墨烯和原子薄场效应晶体管的AI制造。

Comments 29 Pages in total. Supplementary Demo Videos are available at https://qumus.ai

详情
AI中文摘要

尽管现代大语言模型(LLMs)和代理型人工智能(AI)在数字领域展现出了变革性能力,但实现能够进行真实世界科学发现的具身人工智能仍是一个具有挑战性的前沿。这些进展受到将高级推理、多模态信息处理和实时物理执行整合在一起的固有复杂性所阻碍。在这里,我们介绍了Qumus,首个AI量子材料实验家。Qumus物理上体现在一个机器人微型实验室中,是一个智能、多模态和多代理系统,旨在创建和纳米加工原子薄二维(2D)材料和堆叠范德瓦耳斯(vdW)结构。Qumus能够自主导航完整的科学循环,从假设生成和协议规划到多步骤实验执行、结果分析和报告,充当实验家的角色。值得注意的是,该系统首次实现了AI生成石墨烯,以及首次实现了复杂纳米设备(包括原子薄场效应晶体管)的AI制造,通过范德瓦耳斯堆叠。Qumus在这些任务中表现出色,通过展示自主纠错和闭环实验。我们的结果建立了一个可推广的框架,用于学习直接来自量子世界的自我改进具身人工智能系统,为量子材料、电子学等领域加速发现开辟了新路径。

英文摘要

While modern Large Language Models (LLMs) and agentic artificial intelligence (AI) have demonstrated transformative capabilities in digital domains, the realization of embodied AI capable of real-world scientific discovery remains a difficult frontier. The advancements are hindered by the inherent complexity of integrating high-level reasoning, multimodal information processing and real-time physical execution. Here we introduce Qumus, the first AI quantum materials experimentalist. Physically embodied within a robotic mini-laboratory, Qumus is an intelligent, multimodal, and multi-agent system designed for the creation and nano-processing of atomically thin two-dimensional (2D) materials and stacked van der Waals (vdW) structures. Qumus autonomously navigates the full scientific cycle, from hypothesis generation and protocol planning to multi-step experimental execution, result analysis and reporting, acting as an experimentalist. Markedly, the system has achieved, for the first time, the AI-creation of graphene, as well as the first AI-fabrication of complex nanodevices including atomically thin field-effect transistors via vdW stacking. Qumus excels at these tasks by demonstrating autonomous error correction and closed-loop experimentation. Our results establish a generalizable framework for self-improving embodied AI systems that learn directly from the quantum world, opening a pathway toward accelerated discovery in quantum materials, electronics and beyond.

2605.18395 2026-05-19 cs.CY cs.AI 版本更新

Diagnosing Korean-Language LLM Political Bias via Census-Grounded Agent Simulation

通过人口普查基础代理模拟诊断韩语LLM的政治偏见

Sungwoo Kang

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Korea University(韩国大学)

AI总结 本文通过人口普查基础代理模拟框架Dynamo-K,研究韩语LLM在六个韩国选举中的政治行为,识别出三种系统性失败模式,并提出解决方案以校准模型,从而提高政治行为诊断的准确性。

详情
AI中文摘要

大型语言模型(LLMs)在选民模拟中表现出系统性政治偏见,但其底层机制和跨语言泛化仍不清晰。我们引入Dynamo-K,一个基于人口普查的模拟框架,评估四个模型在六个韩国选举(2017-2025)中的政治行为。使用该框架,我们识别出三种系统性失败模式:(1)中等代理的渐进偏见,其中显式缓解将均绝对误差(MAE)减少5.2倍;(2)模型依赖的第三方显著性崩溃,区分显著性失败与决策偏见;以及(3)区域极化崩溃,其中模型双向低估历史政党强区。为解决这些失败,我们证明场景重构可恢复62%的2017年MAE,通过恢复第三方可见性。此外,我们引入了一个学习重加权适配器,成功校准对立价值模型,而无需在训练或测试时依赖候选人姓名。验证我们的诊断框架,Dynamo-K准确预测了3/3总统胜者,包括在高度争议的2022年0.73%边界的比赛中MAE为2.1%。并且正确识别了 held-out 地方选举中的主导政党。该流程是开源的,并提供了一种可扩展且成本效益高的方法来诊断LLM的政治行为。

英文摘要

Large language models (LLMs) exhibit systematic political biases in voter simulations, but their underlying mechanisms and cross-lingual generalizations remain poorly understood. We introduce Dynamo-K, a census-grounded simulation framework evaluating Korean-language LLM political behavior across four models on six Korean elections (2017-2025). Using this framework, we identify three systematic failure modes: (1) progressive bias in moderate agents, where explicit mitigation reduces Mean Absolute Error (MAE) by 5.2 times; (2) model-dependent third-party salience collapse, distinguishing between salience failure and decision bias; and (3) regional polarization collapse, where models bidirectionally under-predict historical party strongholds. To address these failures, we demonstrate that scenario reframing recovers 62% of 2017 MAE by restoring third-party visibility. Furthermore, we introduce a learned reweighting adapter that successfully calibrates opposing-valence models without relying on candidate names at train or test time. Validating our diagnostic framework, Dynamo-K accurately predicts 3/3 presidential winners - including a 2.1%p MAE on the highly contested 0.73%p-margin 2022 race - and correctly identifies the dominant party in a held-out local election. The pipeline is open-source and provides a scalable, cost-effective method for diagnosing LLM political behavior.

2605.18387 2026-05-19 cs.LG cs.AI 版本更新

Graph Hierarchical Recurrence for Long-Range Generalization

图层次递归用于长距离泛化

Stefano Carotti, Marco Pacini, Alessio Gravina, Davide Bacciu, Bruno Lepri, Sebastiano Bontorin

发表机构 * Department of Computer Science, University of Trento(特伦托大学计算机科学系) Fondazione Bruno Kessler(布鲁诺·克谢勒基金会) Department of Computer Science, University of Pisa(帕尔马大学计算机科学系)

AI总结 本文提出了一种名为图层次递归(GHR)的新框架,通过在输入图和通过池化获得的层次抽象上联合操作,解决了图神经网络和图转换器在长距离相关性捕捉任务中的限制,并在多个长距离基准测试中表现出色,参数效率高。

详情
AI中文摘要

图神经网络(GNNs)和图转换器(GTs)已成为图学习的基本范式,结合了深度模型的表示学习能力与诱导偏置带来的样本效率。尽管其有效性已得到广泛认可,但大量研究表明这些模型在需要捕捉图中远距离区域之间相关性的任务中仍面临根本性限制。为了解决这一问题,我们引入了图层次递归(GHR),一种新的框架,该框架同时在输入图和通过池化获得的层次抽象上进行操作。我们还展示了现有模型的局限性在超出范围的泛化中更加明显,其中测试实例涉及比训练时观察到的更长距离的相互作用。相比之下,尽管其设计简单,GHR提供了三个关键优势:在长距离依赖上表现强劲,改进了超出范围的泛化能力,以及高参数效率。为了验证这些主张,我们展示了在广泛的长距离基准测试中,GHR在使用当前最先进的模型参数的1%的情况下,始终优于现有的图模型。这些结果表明,当前趋势通过扩展架构来获得图基础模型的互补方向,表明仅增加模型容量可能不足以实现泛化。

英文摘要

Graph Neural Networks (GNNs) and Graph Transformers (GTs) are now a fundamental paradigm for graph learning, combining the representation-learning capabilities of deep models with the sample efficiency induced by their inductive biases. Despite their effectiveness, a large body of work has shown that these models still face fundamental limitations in tasks that require capturing correlations between distant regions of a graph. To address this issue, we introduce Graph Hierarchical Recurrence (GHR), a novel framework that operates jointly on the input graph and on a hierarchical abstraction obtained through pooling. We also show that the limitations of existing models are even more pronounced in out-of-range generalization, where test instances involve interactions over distances longer than those observed during training. By contrast, despite its simple design, GHR provides three key advantages: strong performance on long-range dependencies, improved out-of-range generalization, and high parameter efficiency. To corroborate these claims, we show that across a broad set of long-range benchmarks, GHR consistently outperforms existing graph models while using as little as 1% of the parameters of current state-of-the-art models. These results suggest a complementary direction to the current trend of scaling architectures to obtain graph foundation models, indicating that increased model capacity alone may not be sufficient for generalization.

2605.18385 2026-05-19 cs.RO cs.AI 版本更新

Towards Ubiquitous Mapping and Localization for Dynamic Indoor Environments

面向动态室内环境的无处不在的映射与定位

Halim Djerroud, Nico Steyn, Olivier Rabreau, Patrick Bonnin, Abderraouf Benali

发表机构 * Tshwane University of Technology(茨瓦内理工大学)

AI总结 本文提出UbiSLAM,一种用于动态室内环境实时映射和定位的创新解决方案,通过部署固定RGB-D相机网络解决传统SLAM系统在环境变化敏感性和依赖移动单元传感器的问题,提升机器人在环境中的定位精度和响应性。

Journal ref Proceedings of the 17th International Conference on Agents and Artificial Intelligence (ICAART 2025), Volume 1, pages 537-548, SciTePress, 2025. ISBN: 978-989-758-737-5, ISSN: 2184-433X

详情
AI中文摘要

我们提出了UbiSLAM,一种用于动态室内环境实时映射和定位的创新解决方案。通过在工作空间内战略性地部署固定RGB-D相机网络,UbiSLAM解决了传统SLAM系统常见的局限性,如对环境变化的敏感性和对移动单元传感器的依赖。这种固定传感器方法实现了实时、全面的映射,提高了机器人在环境中的定位精度和响应性。由UbiSLAM生成的集中化地图持续更新,为机器人提供准确的全局视图,从而提高导航、减少碰撞并促进共享空间中更流畅的人机交互。除了其优势外,UbiSLAM还面临挑战,特别是在确保完整空间覆盖和管理盲区方面,这需要从机器人本身集成数据。在本文中,我们讨论了潜在的解决方案,如自动校准以获得最佳的相机位置和方向,以及增强的通信协议以实现实时数据共享。所提出的模型减少了对单个机器人单元的计算负载,使更复杂的机器人平台能够有效运行,同时增强了整个系统的鲁棒性。

英文摘要

We present UbiSLAM, an innovative solution for real-time mapping and localization in dynamic indoor environments. By deploying a network of fixed RGB-D cameras strategically throughout the workspace, UbiSLAM addresses limitations commonly encountered in traditional SLAM systems, such as sensitivity to environmental changes and reliance on mobile unit sensors. This fixed-sensor approach enables real-time, comprehensive mapping, enhancing the localization accuracy and responsiveness of robots operating within the environment. The centralized map generated by UbiSLAM is continuously updated, providing robots with an accurate global view, which improves navigation, minimizes collisions, and facilitates smoother human-robot interactions in shared spaces. Beyond its advantages, UbiSLAM faces challenges, particularly in ensuring complete spatial coverage and managing blind spots, which necessitate data integration from the robots themselves. In this paper we discuss potential solutions, such as automatic calibration for optimal camera placement and orientation, along with enhanced communication protocols for real-time data sharing. The proposed model reduces the computational load on individual robotic units, allowing less complex robotic platforms to operate effectively while enhancing the robustness of the overall system.

2605.18382 2026-05-19 hep-ph cs.AI hep-ex 版本更新

Probing SMEFT Operators through $t\bar{t}t\bar{t}$ Production with Hyper-Graph Neural Networks at the LHC

通过LHC上的超图神经网络探测SMEFT算符的$t\bar{t}t\bar{t}$产生

Amir Subba, Sanmay Ganguly

发表机构 * Wilczek Quantum Center, Shanghai Institute for Advanced Studies, Shanghai 201315, China University of Science(Wilczek量子中心、上海先进研究院、上海201315中国、中国科学技术大学) Department of Physics, Indian Institute of Technology Kanpur, Uttar Pradesh 208016, India(物理系、印度理工学院坎浦尔分校、乌塔尔 Pradesh 208016印度)

AI总结 该研究利用超图神经网络(H-GNN)在13 TeV质子-质子碰撞中探测$t\bar{t}t\bar{t}$产生,通过多电荷信号事件与主导的标准模型背景(如$t\bar{t}W$、$t\bar{t}Z$、$t\bar{t}H$、$t\bar{t}VV$、单顶关联产生、双boson和三boson过程)进行区分,通过改进的信号提取得到SMEFT六维算符的95%置信水平限。

Comments 16 pages, 9 figures, 3 tables. Comments are welcome

详情
AI中文摘要

我们提出了一种现象学研究,利用超图神经网络(H-GNN)在13 TeV质子-质子碰撞中探测$t\bar{t}t\bar{t}$产生,用于区分多电荷信号事件与主导的标准模型背景,即$t\bar{t}W$、$t\bar{t}Z$、$t\bar{t}H$、$t\bar{t}VV$、单顶关联产生、双boson和三boson过程。在H-GNN架构中,每个事件被表示为一个超图,其节点对应于重建的喷注和电荷,其超边编码了任意子集之间的高阶相关性,使网络能够学习许多体动力学结构,这些结构特征于$t\bar{t}t\bar{t}$最终态。通过按照CMS样式的事件选择,结合同电荷双电荷、三电荷和四电荷通道,H-GNN在$t\bar{t}t\bar{t}$信号上获得ROC曲线下的面积为0.951,在积分光子流为140 fb^{-1}时,统计显著性为Z=9.11,与SPANet基线(Z=8.62)、Particle Transformer基线(Z=7.37)以及ATLAS分析(Z=5.13)相比。我们利用改进的信号提取,推导出SMEFT六维算符Φu、tt^{(1)}、qq^{(1)}、qt^{(1)}、qt^{(8)}的1-和2参数95%置信水平限,并投影了在HL-LHC积分光子流为1000 fb^{-1}和3000 fb^{-1}时的预期灵敏度,背景估计有50%的不确定性。

英文摘要

We present a phenomenological study of $t\bar{t}t\bar{t}$ production in proton-proton collisions at $\sqrt{s} = 13$~TeV, using a Hyper-Graph Neural Network (H-GNN) to discriminate multilepton signal events from the dominant SM backgrounds, namely $t\bar{t}W$, $t\bar{t}Z$, $t\bar{t}H$, $t\bar{t}VV$, single-top associated production, and diboson and triboson processes. In the H-GNN architecture each event is represented as a hypergraph whose nodes correspond to reconstructed jets and leptons and whose hyperedges encode higher-order correlations among arbitrary subsets of these objects, allowing the network to learn the many-body kinematic structures that characterize the $t\bar{t}t\bar{t}$ final state. Combining same-sign di-lepton, tri-lepton, and four-lepton channels following a CMS-like event selection, the H-GNN attains an area under the ROC curve of $0.951$ for the $t\bar{t}t\bar{t}$ signal and yields a statistical significance of $Z = 9.11$ at an integrated luminosity of $\mathcal{L} = 140~\mathrm{fb}^{-1}$, to be compared with $Z = 8.62$ for a SPANet baseline, $Z = 7.37$ for a Particle Transformer baseline, and $Z = 5.13$ obtained by the ATLAS analysis, evaluated under identical event selection. We exploit the improved signal extraction to derive one- and two-parameter $95\%$ confidence level limits on the Wilson coefficients of the dimension-six operators $\mathcal{O}_{Φu}$, $\mathcal{O}^{(1)}_{tt}$, $\mathcal{O}^{(1)}_{qq}$, $\mathcal{O}^{(1)}_{qt}$, and $\mathcal{O}^{(8)}_{qt}$, and we project the expected sensitivity at the HL-LHC integrated luminosities of $1000~\mathrm{fb}^{-1}$ and $3000~\mathrm{fb}^{-1}$ with $50\%$ uncertainty on the background estimation.

2605.18380 2026-05-19 cs.AI 版本更新

QSTRBench: a New Benchmark to Evaluate the Ability of Language Models to Reason with Qualitative Spatial and Temporal Calculi

QSTRBench: 一个评估语言模型进行定性空间和时间推理能力的新基准

Anthony G. Cohn, Robert E. Blackwell

发表机构 * School of Computer Science, The University of Leeds(利兹大学计算机科学学院) The Alan Turing Institute(阿兰·图灵研究院) Tongji University(同济大学)

AI总结 本文提出QSTRBench基准,用于评估大语言模型在定性空间和时间推理方面的能力,通过不同推理算法规则的组合性推理、反向关系和概念邻域等任务,展示了不同模型在处理不同算法规则时的表现差异,发现PA最简单而RCC-22最难。

Comments 74 pages, 20 figures

详情
AI中文摘要

我们介绍了一个广泛的定性空间和时间推理(QSTR)基准,用于评估大语言模型(LLMs)。我们提出了关于组合推理(使用组合表,CT)、反向关系和概念邻域(CN)的问题,针对QSTR算式、点代数(PA)、Allen区间代数、区间和持续时间(INDU)、区域连接算式(RCC-5、RCC-8和RCC-22)、九交模型、方向算式和STAR。RCC-22的CN首次在此发布。一个扩展的基准系统性地变化了问题呈现方式,包括前缀/后缀、词语/符号/非正式术语和图示描述,针对选定的算式。我们报告了当前前沿模型的结果。所有测试的模型都比猜测表现更好,但没有模型能一致正确回答所有问题。性能在不同算式之间差异显著,PA最简单,RCC-22最难。我们发布了该基准和我们的结果,以开放许可证发布,以促进进一步评估语言模型在定性空间/时间推理方面的能力。

英文摘要

We introduce an extensive qualitative spatial and temporal reasoning (QSTR) benchmark for evaluating large language models (LLMs). We pose questions concerning compositional reasoning (using composition tables, CT), converse relations, and conceptual neighbourhoods (CN) for QSTR calculi, Point Algebra (PA), Allen's Interval Algebra, Interval and Duration (INDU), Region Connection Calculus (RCC-5, RCC-8, and RCC-22), the nine intersection model, cardinal direction calculus, and STAR. The RCC-22 CN is published here for the first time. An extended benchmark systematically varies question presentation including prefix/infix, words/symbols/nonce terms and schematic descriptions for selected calculi. We report results for contemporary frontier models. All models tested perform better than guessing but none can consistently answer all questions correctly. Performance varies sharply by calculus, with PA being the most straightforward, and RCC-22 the most difficult. We release the benchmark, and our results under an open licence to facilitate further assessment of qualitative spatio/temporal reasoning in LLMs.

2605.18374 2026-05-19 cs.LG cs.AI 版本更新

Beyond Inference-Time Search: Reinforcement Learning Synthesizes Reusable Solvers

超越推理时间搜索:强化学习合成可重用求解器

Soheyl Massoudi, Gabriel Apaza, Milad Habibi, Mark Fuge

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Maryland(马里兰大学)

AI总结 本文探讨了强化学习能否将组合优化的推理成本转移到代码LLM的权重中,从而合成可重用的求解器。通过Synergistic Dependency Selection问题,研究发现强化学习能有效生成约束感知的模拟退火模板,并在多个领域展示出更高的效率和鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)通常将组合优化视为推理时间的过程,通过采样、搜索或重复提示单独解决每个实例。我们询问强化学习是否可以将部分推理成本转移到代码LLM的权重中,从而让模型为整个问题家族合成可重用的求解器。我们研究了Synergistic Dependency Selection(SDS),一种受约束的二次背包问题的受控变体,旨在暴露特定的失败模式:局部信号和严格可行性约束使贪心启发式方法具有吸引力但不可靠。在相同的框架下,Best-of-64基础模型采样在接近全局虚拟最佳求解器(VBS)的28.7%差距处饱和;代码审计显示基础模型经常检索模拟退火模板但错误实现Metropolis接受规则。我们使用可行性门控奖励和轻量结构框架对Qwen2.5-Coder-14B-Instruct进行微调,使用组相对策略优化(GRPO)。所得到的策略在99.8%的可行SDS输出中收敛到一个约束感知的模拟退火模板,达到VBS的5.0%差距,并且在生成后执行/搜索成本方面比累积Best-of-64评估便宜91倍。一次编译检查显示,每个种子的最优冻结求解器在SDS测试集上重复使用时仍然高度竞争,而额外领域评估在作业调度问题上提供了更窄但积极的证据,表明框架可以超越SDS。负消融揭示了这种配方的局限性:标准稳定器会降低性能,软可行性门控失败,结果仍对奖励归一化和领域特定设计选择敏感。

英文摘要

Large language models (LLMs) typically approach combinatorial optimization as an inference-time procedure, solving each instance separately through sampling, search, or repeated prompting. We ask whether reinforcement learning can instead shift part of this reasoning cost into the weights of a code LLM, so that the model synthesizes a reusable solver for an entire problem family. We study this question on Synergistic Dependency Selection (SDS), a controlled variant of constrained Quadratic Knapsack designed to expose a specific failure mode: local signals and strict feasibility constraints make greedy heuristics attractive but unreliable. Under identical scaffolding, Best-of-64 base-model sampling saturates at an approximately 28.7% gap to the global Virtual Best Solver (VBS); code audits show that the base model often retrieves Simulated Annealing templates but misimplements the Metropolis acceptance rule. We fine-tune Qwen2.5-Coder-14B-Instruct with Group Relative Policy Optimization (GRPO) using a feasibility-gated reward and light structural scaffolding. The resulting policy converges to a constraint-aware Simulated Annealing template in 99.8% of feasible SDS outputs, achieves a 5.0% gap to that VBS, and is 91 times cheaper in post-generation execution/search cost than cumulative Best-of-64 evaluation. A compile-once check shows that one best frozen solver per seed remains highly competitive when reused unchanged across the SDS test set, while an additional-domain evaluation on Job Shop Scheduling provides narrower but positive evidence that the scaffold transfers beyond SDS. Negative ablations reveal the limits of this recipe: standard stabilizers degrade performance, a soft feasibility gate fails, and results remain sensitive to reward normalization and domain-specific design choices.

2605.18349 2026-05-19 cs.CV cs.AI 版本更新

Optimising CSRNet with parameter-free attention mechanisms for crowd counting in public transport

通过参数自由注意力机制优化CSRNet以实现公共交通中的人群计数

Aida Rostamza, Enrico Del Re, Joshua Cherian Varughese, Cristina Olaverri-Monreal

发表机构 * Johannes Kepler University Linz(约翰· Kepler 大学林茨) Department Intelligent Transport Systems(智能交通系统部门)

AI总结 本文研究了参数自由注意力机制在密集场景中的人群计数和密度图估计中的有效性,提出了一种结合PFCA和SA的新型注意力机制PFCASA,并在ShanghaiTech数据集上验证了其在公共交通视频流中的性能。

详情
AI中文摘要

占用估计和人群计数是设计智能高效公共交通车辆的关键任务。鉴于公共交通载客量可能从稀疏到拥挤变化,传统的占用估计模型必须适应这一目的。注意力机制在增强深度神经网络在拥挤场景中的人群计数能力方面表现出显著优势,尤其是在存在遮挡、复杂背景和透视畸变的情况下。然而,传统方法通常作为卷积层中的参数化子网络实现,不可避免地增加了模型大小和计算成本,限制了在资源受限的边缘设备上的部署。本文研究了最先进的参数自由注意力机制在高度拥挤场景中的人群计数和密度图估计中的有效性。我们评估了通道级(PFCA)、空间级(SA)和三维级(SimAM)模块,并将其性能与参数化注意力模块进行比较,后者限制引入不超过1%的额外参数。此外,我们提出了一种新的注意力机制组合,结合PFCA和SA(PFCASA)以分析公共交通系统内的视频流。使用CSRNet作为骨干网络,在ShanghaiTech数据集上的实验表明,参数自由注意力机制在不引入额外模型参数的情况下实现了可比或更优的准确性。详细的性能分析进一步揭示,PFCASA在少于40人的场景中优于其他注意力模块,而PFCA在人群密度增加时表现出更大的有效性,凸显了其在智能公共交通模式中的应用潜力。

英文摘要

Occupancy estimation and crowd counting are critical tasks in designing smart and efficient public transport vehicles. Given that public transport loading can vary from sparse to crowded, classical models for occupancy estimation must be adapted to suit this purpose. Attention mechanisms have shown remarkable capability in enhancing the representational power of deep neural networks for crowd counting in congested scenes with occlusion, complex backgrounds, and perspective distortion. However, conventional approaches, often implemented as parameterized sub-networks within convolutional layers, inevitably increase model size and computational cost, limiting deployment on resource-constrained edge devices. This paper investigates the effectiveness of state-of-the-art parameter-free attention mechanisms for crowd counting and density map estimation in highly congested scenes. We evaluate channel-wise (PFCA), spatial-wise (SA), and 3-D (SimAM) modules and compare their performance with parameterized attention modules constrained to introduce no more than 1% additional parameters. Furthermore, we present a novel combination of attention mechanisms that combines the strengths of PFCA and SA (PFCASA) customized for analyzing video streams onboard public transport systems. Using CSRNet as the backbone, experiments on the ShanghaiTech dataset demonstrate that parameter-free attention mechanisms achieve comparable or superior accuracy without introducing additional model parameters. A detailed performance analysis further reveals that PFCASA outperforms other attention modules in scenes with fewer than 40 individuals, while PFCA shows greater effectiveness as crowd density increases, underscoring their potential applicability for integration into smart public transport modalities.

2605.18346 2026-05-19 cs.CV cs.AI 版本更新

Focused Forcing: Content-Aware Per-Frame KV Selection for Efficient Autoregressive Video Diffusion

聚焦强制:面向内容的每帧KV选择用于高效的自回归视频扩散

Peiliang Cai, Evelyn Zhang, Jiacheng Liu, Hao Lin, Ruiqi Zhang, Weile Mo, Yue Ma, Shikang Zheng, Jiehang Huang, Dongrui Liu, Linfeng Zhang

发表机构 * SJTU(上海交通大学) SDU(山东大学) HUST(华中科技大学) UTokyo(东京大学) HKUST(香港科技大学) SCUT(上海大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 本文提出了一种无需训练的KV选择方法,通过结合注意力分数和历史帧的多样性分数,保留最相关和有区别的历史帧,从而在不牺牲质量的情况下提高自回归视频扩散的效率。

详情
AI中文摘要

近期在自回归视频扩散领域的进展使得序列和流式视频生成成为可能。然而,长视界生成需要越来越大的KV缓存,这使得在不牺牲质量的情况下实现高效的压缩具有挑战性。现有方法大多基于注意力分数选择历史帧,但它们的上下文决策仍然粗略。当同一块中生成多个帧时,这些方法通常对整个块应用共享的历史选择,仅通过注意力对历史帧评分,并将头预算均匀或通过注意力模式启发式分配,而不是显式估计头重要性。我们发现同一生成块中的帧可能依赖于不同的历史帧,同一历史帧在与当前帧的相对时间距离变化时可能获得不同的注意力分数,且屏蔽不同头会引发不均等的生成退化。受这些发现的启发,我们提出了Focused Forcing,一种无需训练的KV选择方法,该方法在生成帧和头维度上聚焦缓存历史。对于每个生成帧,Focused Forcing通过结合注意力分数和历史帧的多样性分数保留最相关和有区别的历史帧,同时将较大的预算分配给估计重要性更高的头。在多个自回归生成范式中,Focused Forcing在不训练的情况下实现了高达1.48倍的端到端加速,同时提高了视觉质量和文本对齐。

英文摘要

Recent advances in autoregressive video diffusion have enabled sequential and streaming video generation. However, long-horizon generation requires increasingly large KV caches, making efficient compression without sacrificing quality challenging. Existing methods mostly select historical frames based on attention scores, but their context decisions remain coarse. When multiple frames are generated in the same chunk, these methods often apply a shared history selection to the whole chunk, score historical frames solely by attention, and assign head-wise budgets either uniformly or by attention-pattern heuristics rather than explicit head-importance estimation. We show that frames within the same generated chunk can depend on distinct historical frames, that the same historical frame can receive different attention scores as its relative temporal distance to the current frames changes, and that masking different heads induces unequal generation degradation. Motivated by these findings, we propose \textbf{Focused Forcing}, a training-free KV selection method that focuses cached history along both generated-frame and head dimensions. For each generated frame, Focused Forcing preserves the most relevant and distinctive historical frames by combining attention scores with diversity scores of historical frames, while assigning larger budgets to heads with higher estimated importance. Across multiple autoregressive generation paradigms, Focused Forcing achieves up to $\textbf{1.48}\times$ end-to-end acceleration without training, while \textbf{improving visual quality and text alignment}. \textit{Our code will be released on GitHub.}

2605.18332 2026-05-19 cs.SE cs.AI 版本更新

Same Signal, Different Semantics: A Cross-Framework Behavioral Analysis of Software Engineering Agents

同信号,不同语义:软件工程代理跨框架行为分析

Wei Ma, Zhi Chen, Jingxu Gu, Tianling Li, Shangqing Liu, Lingxiao Jiang

发表机构 * Singapore Management University(新加坡国立大学) Nanjing University(南京大学) Nanyang Technological University(南洋理工大学)

AI总结 本文通过大规模实验分析软件工程代理在不同框架下的行为表现,发现相同的行为信号在不同框架下可能具有相反的意义,强调了跨框架验证的重要性。

详情
AI中文摘要

对基于大语言模型的软件工程代理进行行为研究,提取出关于轨迹形状与高分辨率率相关的操作规则:测试步骤跟随代码修改、错误级联较短或轨迹紧凑。每条规则通常源自单一框架,但其在结构不同的代理设计中是否转移,无论是符号还是幅度,尚未直接测试。我们在此生态系统层面进行研究:64,380次SWE-bench运行,涵盖126种代理配置,跨越43种框架,每种配置将LLM与一个框架(如SWE-Agent、OpenHands)配对,该框架提供其工具和工作流。我们通过固定每一层来分离框架效应与LLM效应,然后测量每种配置的行为-结果效应,并检查这些效应的一致性或分歧。在固定LLM的情况下,更换框架导致每个动作特征产生显著的行为差异。在大多数信号上,配置不仅在幅度上存在分歧,还在方向上存在分歧。错误率是清晰的案例:47种配置在错误率较低时解决更多问题,而48种配置在错误率较高时解决更多问题。其他五个连续特征和七个二元模式中的三个显示相似的方向分歧。框架身份比LLM家族解释了更多的变化:对于平均转弯数,框架解释了64%的配置间变异,而LLM仅解释了10%。这意味着相同的可观察行为信号可能在不同代理配置中具有相反的意义。因此,任何单一框架的行为发现应在跨配置验证后再被声称具有普遍性。

英文摘要

Behavioral studies of LLM-based software engineering agents extract operational rules about which trajectory shapes correlate with higher resolution rates: that a test step follows a code modification, that error cascades are short, or that trajectories are compact. Each rule is typically derived from a single framework, and whether it transfers, in sign as well as magnitude, to structurally different agent designs has not been directly tested. We address this at ecosystem scale: 64,380 SWE-bench runs from 126 agent configurations spanning 43 frameworks, where each configuration pairs an LLM with a framework (e.g., SWE-Agent, OpenHands) that supplies its tools and workflow. We separate framework effects from LLM effects by holding each layer fixed in turn, then measure one behavior-outcome effect per configuration and examine how those effects agree or disagree. Swapping the framework while the LLM is held fixed produces large behavioral differences in every action feature. On most signals, configurations disagree not merely in magnitude but in direction. Error rate is the cleanest case: 47 configurations resolve more issues when their error rate is lower, while 48 resolve more when it is higher. Five other continuous features and three of seven binary patterns from prior SE literature show similar directional disagreement. Framework identity accounts for more of this variation than LLM family: for mean turns, framework explains 64% of the between-configuration variance against the LLM's 10%. The implication is that the same observable behavioral signal can carry opposite meaning for different agent configurations. Behavioral findings from any single framework therefore warrant cross-configuration validation before being claimed as general.

2605.18327 2026-05-19 cs.AI 版本更新

Causely: A Causal Intelligence Layer for Enterprise AI A Benchmark Study on SRE and Reliability Workflows

Causely: 企业AI中的因果智能层 一项关于SRE和可靠性工作流的基准研究

Dhairya Dalal, Endre Sara, Ben Yemini, Christine Miller, Shmuel Kliger

发表机构 * Causely

AI总结 本文提出Causely,一种企业AI的因果智能层,通过维护环境拓扑、属性依赖性和因果关系的结构化表示,为AI代理提供语义和因果基础,以诊断、评估影响并安全地在生产环境中操作。通过在受控环境下注入故障的24微服务OpenTelemetry演示应用进行基准研究,评估了Causely的价值主张。

详情
AI中文摘要

目前,部署到SRE工作流中的AI代理在查询时从原始可观测性遥测中获取对环境状态的理解,这在令牌、延迟和推断可靠性上产生了语义解释的代价。我们提出了Causely,一种因果智能层,它维护了环境拓扑、属性依赖性和因果关系的结构化表示,这些关系锚定在受管理环境的本体表示上。Causely将原始遥测转换为一个实时、可查询的模型,为AI代理提供所需的语义和因果基础,以诊断、评估影响并在生产环境中安全地行动。我们通过在受控环境下注入故障的24微服务OpenTelemetry演示应用进行基准研究来评估这一价值主张。我们的实验比较了四种代理配置(Claude Code、OpenAI Codex、HolmesGPT与Sonnet和Gemini后端)。实验在两种场景下进行:活跃事件和健康基线,分别有和无访问Causely。在活跃故障场景中,因果基础将平均诊断时间减少63%,平均令牌消耗减少60%,平均工具调用次数减少78%,将调查足迹压缩了4.8倍,并降低了每运行的直接API成本57%;根因诊断准确率从75%提升到100%。

英文摘要

AI agents deployed into SRE workflows currently derive their understanding of environment state from raw observability telemetry at query time, paying a semantic-interpretation tax in tokens, latency, and inferential reliability. We propose Causely, a causal intelligence layer that maintains a structured representation of environment topology, attribute dependencies, and causal relationships that are anchroed to a ontological representation of the managed environment. Causely transforms raw telemetry into a live, queryable model providing the semantic and causal foundation AI agents require to diagnose, evaluate impact, and act safely in production. We evaluate this value proposition through a benchmark study conducted in a controlled setting with injected faults in a 24-microservice OpenTelemetry demo application. Our experiments compare four agent configurations (Claude Code, OpenAI Codex, HolmesGPT with Sonnet and Gemini backends). Experiments are run with and without access to Causely under two scenarios: an active incident and a healthy baseline. On the active-fault scenario, causal grounding reduces mean time-to-diagnosis by 63\%, mean token consumption by 60\%, and mean tool-call count by 78\%, compressing the investigation footprint by 4.8$\times$ and lowering direct API cost per run by 57\%; root-cause-diagnosis accuracy rises from 75\% to 100\%.

2605.18320 2026-05-19 cs.LG cs.AI 版本更新

ISEP: Implicit Support Expansion for Offline Reinforcement Learning via Stochastic Policy Optimization

ISEP: 通过随机策略优化实现离线强化学习的隐式支持扩展

Yifei Chen, Shaoqin Zhu, Xiaoqiang Ji

发表机构 * The Chinese University of Hong Kong, Shenzhen Longgang(香港中文大学(深圳)松山湖校区)

AI总结 本文提出ISEP方法,通过随机策略优化实现离线强化学习中的隐式支持扩展,以解决传统方法在安全约束下难以发现最优行为的问题,核心贡献是通过价值函数插值和随机动作选择策略提高策略改进的导航能力。

详情
AI中文摘要

离线强化学习方法通常强制严格的约束以确保安全;然而这种刚性往往阻止了在行为策略即时支持之外发现最优行为。为了解决这个问题,我们提出了通过随机策略优化实现的隐式支持扩展(ISEP),该方法利用在分布数据和策略样本之间插值的价值函数,以隐式方式扩展可行动作支持。这种机制“密集化”高奖励区域,为策略改进创建可导航路径,同时在理论上保证价值误差的有界性。然而,优化此扩展支持会创建多模态景观,标准确定性平均会导致模式崩溃和无效动作。ISEP通过随机动作选择策略缓解了这一问题,通过随机交替保守克隆和乐观扩展信号来优化策略。我们通过使用条件流匹配利用分类器免费引导,将此框架实例化为ISEP-FM,以有效捕捉插值的价值信号。

英文摘要

Offline reinforcement learning methods typically enforce strict constraints to ensure safety; yet this rigidity often prevents the discovery of optimal behaviors outside the immediate support of the behavior policy. To address this, we propose Implicit Support Expansion via stochastic Policy optimization (ISEP), which leverages a value function interpolated between in-distribution data and policy samples to implicitly expand the feasible action support. This mechanism "densifies" high-reward regions, creating a navigable path for policy improvement while theoretically guaranteeing bounded value error. However, optimizing against this expanded support creates a multimodal landscape where standard deterministic averaging leads to mode collapse and invalid actions. ISEP mitigates this via a stochastic action selection strategy, optimizing the policy by stochastically alternating between conservative cloning and optimistic expansion signals. We instantiate this framework as ISEP-FM using Conditional Flow Matching utilizing classifier-free guidance to effectively capture the interpolated value signal.

2605.18309 2026-05-19 cs.LG cs.AI 版本更新

Alignment Dynamics in LLM Fine-Tuning

在LLM微调中的对齐动力学

Yuhan Huang, Huanran Chen, Yinpeng Dong

发表机构 * Shanghai Qi Zhi Institue & University of Tokyo(上海启智研究院 & 东京大学) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 本文研究了在LLM微调过程中对齐的动态特性,提出了一种可计算的对齐评分,并推导了其在微调过程中的闭式更新公式,从而建立了对齐动态的统一框架。通过将对齐更新分解为两种竞争成分:反弹力和驱动力,解释了为何先前的对齐可能被后续微调逆转,以及为何更狭窄的后验结构会增强这种逆转。此外,该框架预测了‘复习强化效应’,即先前的对齐会在重新暴露时留下潜在的后验印记,从而增强驱动力,导致更快的重新对齐。

详情
AI中文摘要

尽管大型语言模型(LLMs)通过监督微调和人类反馈强化学习实现了强大的对齐,但在后续微调中对齐往往容易崩溃。现有的解释要么将对齐脆弱性归因于梯度几何,要么将其描述为模型输出的分布转移,但很少有研究能提供一个统一的框架,将参数空间的学习动态与函数空间的对齐行为联系起来。在本文中,我们引入了一个可计算的对齐评分,并推导了其在微调过程中的闭式更新公式,从而建立了对齐动态的统一框架。我们的分析将对齐更新分解为两个竞争成分:一种由当前对齐状态和模型分布狭窄性共同决定的“反弹力”,以及一种由训练分布与条件后验对齐和非对齐完成的后验对齐程度决定的“驱动力”。这种分解解释了为何先前的对齐可能被后续微调逆转,以及为何更狭窄的后验结构会增强这种逆转。此外,我们的框架预测了“复习强化效应”:先前的对齐会在重新暴露时留下潜在的后验印记,从而增强驱动力,导致更快的重新对齐。我们通过安全对齐、新兴不一致和情感设置验证了这些预测,展示了在重新暴露下一致的对齐逆转和加速的重新对齐。此外,安全对齐的受控实验确认了预测的反弹强度与后验狭窄性之间的依赖关系。这些结果共同提供了一个统一的动态视角,说明在LLM微调过程中对齐是如何被破坏和重新激活的。

英文摘要

Although Large Language Models (LLMs) achieve strong alignment through supervised fine-tuning and reinforcement learning from human feedback, the alignment is often fragile under subsequent fine-tuning. Existing explanations either attribute alignment fragility to gradient geometry or characterize it as a distributional shift in model outputs, yet few provide a unified account that bridges parameter-space learning dynamics with function-space alignment behavior during fine-tuning. In this work, we introduce a tractable alignment score and derive its closed-form update during fine-tuning, yielding a unified framework for alignment dynamics. Our analysis decomposes alignment updates into two competing components: a \textbf{\color{red!60!black} Rebound Force}, governed jointly by the current alignment state and the narrowness of model distribution, and a \textbf{\color{green!60!black} Driving Force}, determined by how the training distribution aligns with outcome-conditioned posteriors over aligned and non-aligned completions. This decomposition explains why prior alignment can be reversed by later fine-tuning and why narrower posterior structure strengthens such reversal. Moreover, our framework predicts a \textbf{Rehearsal Priming Effect}: prior alignment leaves a latent posterior imprint that amplifies the effective Driving Force upon re-exposure, leading to faster re-alignment. We validate these predictions across safety alignment, emergent misalignment, and sentiment settings, demonstrating consistent alignment reversal and accelerated re-alignment under re-exposure. In addition, controlled experiments in safety alignment confirm the predicted dependence of rebound strength on posterior narrowness. Together, these results provide a unified dynamical perspective on how alignment is disrupted and reactivated during LLM fine-tuning.

2605.18303 2026-05-19 cs.LG cs.AI cs.CV cs.RO 版本更新

PH-Dreamer: A Physics-Driven World Model via Port-Hamiltonian Generative Dynamics

PH-Dreamer: 通过端口-哈密顿生成动力学构建一个物理驱动的世界模型

Xueyu Luan, Chenwei Shi

AI总结 本文提出了一种基于端口-哈密顿框架的物理驱动世界模型PH-Dreamer,通过三个协同机制改进了基于递归状态空间架构的世界模型,实现了更紧凑且物理结构化的表示,同时提高了内部模拟器的保真度,并减少了潜在相空间体积、能量消耗和平均加速度平方。

Comments 12 pages, 3 figures

详情
AI中文摘要

基于递归状态空间架构构建的世界模型能够实现高效的潜在想象,但仍然缺乏物理结构,导致动力学违反守恒和耗散原理。我们引入了一个统一的端口-哈密顿框架,通过三种协同机制来解决这一问题。首先,我们将隐含的物理先验嵌入到递归转换中,通过将投影的潜在演变建模为受流动和耗散控制的能量路由,使投影的PH相空间偏向于更紧凑且物理结构化的表示。其次,我们开发了一个具有运动学意识的能量世界模型,该模型从本体感觉观察估计哈密顿量和功率平衡,提供了一个明确的物理信号用于热力学推理。第三,利用这些能量梯度,我们建立了基于能量的Actor-Critic,利用拉格朗日乘数来正则化策略优化,使其朝着更低的能量和更平滑的控制方向发展。在视觉控制基准测试中,该范式不仅实现了更优的渐近回报,还通过在想象奖励和真实奖励之间建立更紧密且方差更低的对齐关系,提高了内部模拟器的保真度,同时将潜在相空间体积减少了4.18-8.41%,能量消耗降低了高达7.80%,平均加速度平方降低了高达9.38%。

英文摘要

World models built on recurrent state space architectures enable efficient latent imagination, yet remain physically unstructured, producing dynamics that violate conservation and dissipative principles. We introduce a unified Port-Hamiltonian framework that remedies this through three synergistic mechanisms. First, we embed implicit physical priors into recurrent transitions by modeling projected latent evolution as action controlled energy routing governed by flow and dissipation, biasing the projected PH phase space toward a more compact and physically structured representation. Second, we develop a kinematics aware energy world model that estimates the Hamiltonian and power balance from proprioceptive observations, providing an explicit physical signal for thermodynamic reasoning. Third, leveraging these energy gradients, we establish an energy guided Actor-Critic that uses Lagrangian multipliers to regularize policy optimization toward lower energy and smoother control. Across visual control benchmarks, this paradigm not only attains superior asymptotic returns but also elevates internal simulator fidelity by establishing a tighter, lower variance alignment between imagined and real rewards, all while reducing latent phase space volume by 4.18-8.41%, energy consumption by up to 7.80%, and mean squared jerk by up to 9.38%.

2605.18299 2026-05-19 cs.AI cs.CL cs.IR 版本更新

SD-Search: On-Policy Hindsight Self-Distillation for Search-Augmented Reasoning

SD-Search: 用于搜索增强推理的在线策略 hindsight 自监督学习

Yufei Ma, Zihan Liang, Ben Chen, Zhipeng Qian, Huangyu Dai, Lingtao Mao, Xuxin Zhang, Chenyi Lei, Wenwu Ou

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出SD-Search,一种基于在线策略hindsight自监督学习的搜索增强推理方法,通过自身策略生成细粒度监督信号,无需外部教师模型或额外标注。

详情
AI中文摘要

搜索增强推理代理将内部推理与外部检索器的调用交替进行,其性能依赖于每次发出的查询质量。然而,在基于结果奖励的强化学习中,每个搜索决策在展开过程中共享同一轨迹级奖励,使个体查询缺乏步级信用。最近的过程监督方法通过从政策外部获取步级信号来解决这一差距,依赖于一个更大的教师模型或由更强的外部系统生成的子问题注释。相比之下,我们提出了SD-Search,通过在线策略的hindsight自监督学习自身生成步级监督,无需外部教师或额外标注。在SD-Search中,一个模型扮演两个角色:学生只看到推理时可用的上下文,而教师还根据一个紧凑的hindsight块总结了搜索查询和一组从同一问题采样的展开的最终结果。由于教师知道每个展开的展开过程和哪些成功,其查询分布隐含地标记了哪些决策值得做出,学生通过最小化token级的Jensen-Shannon散度来恢复这种行为。这在GRPO的粗粒度轨迹奖励上叠加了密集的步级信号。关键的是,这个信号由策略本身在标准RL训练循环中生成,无需外部模型推理、辅助标注流程或额外的训练阶段。

英文摘要

Search-augmented reasoning agents interleave internal reasoning with calls to an external retriever, and their performance relies on the quality of each issued query. However, under outcome-reward reinforcement learning, every search decision in a rollout shares the same trajectory-level reward, leaving individual queries without step-specific credit. Recent process-supervision approaches address this gap by drawing step-level signals from outside the policy, relying either on a much larger teacher model, or on sub-question annotations produced by a stronger external system. In contrast, we propose SD-Search, which derives step-level supervision from the policy itself through on-policy hindsight self-distillation, requiring neither an external teacher nor additional annotations. In SD-Search, a single model plays two roles that differ only in conditioning: a student that sees only the context available at inference time, and a teacher that additionally conditions on a compact hindsight block summarizing the search queries and final outcomes of a group of rollouts sampled from the same question. Since the teacher knows how each rollout unfolded and which ones succeeded, its query distribution implicitly marks which decisions were worth making, and the student is trained to recover this behavior by minimizing the token-level Jensen--Shannon divergence to the teacher at search-query positions. This layers a dense, step-level signal on top of GRPO's coarse trajectory reward. Crucially, this signal is produced by the policy itself within the standard RL training loop, without external model inference, auxiliary annotation pipeline, or additional training stage.

2605.18298 2026-05-19 cs.AI cs.HC cs.LG 版本更新

DARE-EEG: A Foundation Model for Mining Dual-Aligned Representation of EEG

DARE-EEG: 一种用于挖掘双对齐表示的EEG基础模型

Yang Shao, Peiliang Gong, Qun Dai, Daoqiang Zhang

发表机构 * College of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics(航空宇航学院人工智能学院)

AI总结 本文提出DARE-EEG,一种通过双对齐表示学习预训练的自监督基础模型,旨在解决EEG编码器在不完整观测下学习不变表示的问题,通过对比学习和动量更新实现语义稳定性,并通过卷积-线性探针策略适应异构电极配置和采样率,实验表明其在EEG基准测试中表现优异。

Comments 22 pages, 10 pages of main text + 12 pages of appendices

详情
AI中文摘要

通过在大规模EEG数据上进行掩码重建预训练,基础模型已成为在多样化脑机接口应用中学习通用神经表示的有前景范式。然而,一个关键但被忽视的挑战是EEG编码器必须学习对不完整观测不变的表示——当不同掩码视图的同一信号有最小重叠时,现有方法无法将它们约束到一致的潜在子空间,导致转移性下降。为此,我们提出DARE-EEG,一种自监督基础模型,通过预训练期间的双对齐表示学习显式强制掩码不变性。具体而言,我们引入掩码对齐,通过对比学习约束同一EEG样本多个掩码视图的表示,补充锚点对齐,将掩码表示对齐到动量更新的完整特征以实现语义稳定性。此外,我们提出卷积-线性探针,一种参数高效策略,通过解耦频谱-空间投影适应异构电极配置和采样率。在多样化的EEG基准测试中,广泛实验表明DARE-EEG在准确性表现上始终领先,同时保持相对较低的参数复杂度和优于现有方法的跨数据集可移植性。此外,DARE-EEG有助于有效发现和利用EEG中的丰富潜在表示。

英文摘要

Foundation models pre-trained through masked reconstruction on large-scale EEG data have emerged as a promising paradigm for learning generalizable neural representations across diverse brain-computer interface applications. However, a critical yet overlooked challenge is that EEG encoders must learn representations invariant to incomplete observations-when different masked views of the same signal have minimal overlap, existing methods fail to constrain them to a consistent latent subspace, leading to degraded transferability. To address this, we propose DARE-EEG, a self-supervised foundation model that explicitly enforces the mask-invariance property through dual-aligned representation learning during pre-training. Specifically, we introduce mask alignment that constrains representations from multiple masked views of the same EEG sample via contrastive learning, complementing anchor alignment that aligns masked representations to momentum-updated complete features for semantic stability. Additionally, we propose conv-linear-probing, a parameter-efficient strategy that adapts pre-trained representations to heterogeneous electrode configurations and sampling rates through decoupled spectro-spatial projections. Extensive experiments across diverse EEG benchmarks demonstrate that DARE-EEG consistently achieves state-of-the-art in accuracy performance while maintaining relatively low parameter complexity and superior cross-dataset portability compared to existing methods. Furthermore, DARE-EEG contributes to effectively discovering and utilizing the rich potential representations in EEG.

2605.18284 2026-05-19 cs.SE cs.AI 版本更新

CommitDistill: A Lightweight Knowledge-Centric Memory Layer for Software Repositories

CommitDistill: 一种轻量级的知识导向内存层用于软件仓库

Divya Chukkapalli, Thejesh Avula, Aditya Aggarwal, Harsimran Singh, Amith Tallanki

发表机构 * Microsoft Corporation(微软公司)

AI总结 本文提出CommitDistill,一种轻量级的知识导向内存层,通过确定性正则表达式从本地git历史中提取类型化知识单元,并通过TF-IDF检索器进行检索,以提高软件仓库中历史信息的利用率。

详情
AI中文摘要

软件仓库在提交信息、拉取请求讨论和问题线程中积累了大量未结构化的知识,但开发者和AI代码助手很少有效重用这些历史。最近关于类型内存架构用于LLM代理(MemGPT、生成代理和Yang等人提出的PlugMem模块)的工作认为,代理内存应被提炼为类型化的知识,而不是原始交互文本。我们适应这一立场,将这种观点应用于软件仓库自身的git历史,在受限环境下:确定性、无依赖、本地-only、无嵌入。我们提出了CommitDistill,一个开源的Python原型,通过确定性正则表达式将本地git历史挖掘为类型化知识单元(事实、技能、模式),并通过带有校准沉默阈值(theta=2.5)的TF-IDF检索器进行展示。该产物是一种信任仪器化的内存基质:确定性、无外部服务、可检查的纯JSON存储、可调节的沉默。在五个公共仓库上的案例研究(涵盖Python、JavaScript、C和Java,25,000个提交,1,167个提取单元)报告了在40个双注释Python单元上Cohen's kappa=0.633时的有用精度为0.525。决定性发现是预算受限的检索:在256字符每查询预算下,CommitDistill在12查询基准上达到0.750命中率,高于BM25的0.333和git log --grep的0.083。在四臂配对LLM-as-judge评估(n=200时间旅行bug-fixes,两个裁判)中,涵盖控制、CommitDistill、一个体预算匹配的CD-Hybrid和BM25,没有任何条件在标题均值和CD-Hybrid上产生统计上可检测的提升,CD-Hybrid与BM25在直接对抗中不可区分。在10,000个提交上提取可在笔记本电脑上在4秒内完成。源代码、注释、基准和可重复性脚本随本文提供。

英文摘要

Software repositories accumulate large amounts of unstructured knowledge in commit messages, pull-request discussions, and issue threads, but developers and AI coding assistants rarely reuse this history effectively. Recent work on typed-memory architectures for LLM agents (MemGPT, generative agents, and the PlugMem module of Yang et al.) argues that agent memory should be distilled, typed knowledge rather than raw interaction text. We adapt that stance to a software repository's own git history under a constrained regime: deterministic, dependency-free, local-only, no embeddings. We present CommitDistill, an open-source Python prototype that mines a local git history into typed knowledge units (Facts, Skills, Patterns) using deterministic regex and surfaces them through a TF-IDF retriever with a calibrated silence threshold (theta = 2.5) that abstains on out-of-distribution queries. The artefact is a trust-instrumented memory substrate: deterministic, no external service, inspectable plain-JSON store, tunable abstention. A case study on five public repositories spanning Python, JavaScript, C, and Java (25,000 commits, 1,167 extracted units) reports useful-precision 0.525 at Cohen's kappa = 0.633 on 40 dual-annotated Python units. The decisive finding is budget-constrained retrieval: at a 256-character per-query budget, CommitDistill reaches 0.750 hit-rate on a 12-query benchmark against BM25's 0.333 and git log --grep's 0.083. On a four-arm paired LLM-as-judge evaluation (n=200 time-travel bug-fixes, two judges) covering control, CommitDistill, a body-budget-matched CD-Hybrid, and BM25, no condition produces a statistically detectable lift over control on the headline mean and CD-Hybrid is indistinguishable from BM25 head-to-head. Extraction over 10,000 commits completes in under 4 seconds on a laptop. Source, annotations, baselines, and a reproducibility script accompany this paper.

2605.18257 2026-05-19 cs.CV cs.AI cs.CL 版本更新

CodeBind: Decoupled Representation Learning for Multimodal Alignment with Unified Compositional Codebook

CodeBind: 一种用于多模态对齐的解耦表示学习框架

Zeyu Chen, Jie Li, Kai Han

发表机构 * Visual AI Lab, The University of Hong Kong(视觉人工智能实验室,香港大学)

AI总结 CodeBind通过统一的组合代码本设计优化多模态表示空间,解决了传统方法在跨模态信息差异和数据稀缺导致的对齐空间不足问题,实现了多模态分类和检索任务中的最佳性能。

Comments ACL 2026 Findings; Project page: https://visual-ai.github.io/codebind

详情
AI中文摘要

多模态表示对齐对于大语言模型和机器人至关重要。传统方法常受到跨模态信息差异和数据稀缺的限制,导致对齐空间不优,忽略了模态特有的特征。我们提出了CodeBind,一种通过模态共享-特定代码本设计优化多模态表示空间的框架。通过逐步对齐目标和连接模态,CodeBind避免了需要完全配对数据的需要。不同于传统硬对齐,CodeBind将特征分解为共享组件以实现语义一致性,以及特定组件以捕捉模态特有的细节。这种设计利用了组合向量量化方案,其中共享代码本弥合模态差距,而模态特定代码本通过防止主导模态压制其他模态来缓解表示偏差。在九种模态(文本、图像、视频、音频、深度、热成像、触觉、3D点云、EEG)上验证,CodeBind在多模态分类和检索任务中实现了最先进的性能。

英文摘要

Multimodal representation alignment is pivotal for large language models and robotics. Traditional methods are often hindered by cross-modal information discrepancies and data scarcity, leading to suboptimal alignment spaces that overlook modality-unique features. We propose CodeBind, a framework that optimizes multimodal representation spaces through a modality-shared-specific codebook design. By incrementally aligning target and bridging modalities, CodeBind bypasses the need for fully paired data. Unlike traditional hard alignment, CodeBind decomposes features into shared components for semantic consistency and specific components for modality-unique details. This design utilizes a compositional vector quantization scheme, where a shared codebook bridges modality gaps and modality-specific codebooks mitigate representation bias by preventing dominant modalities from overshadowing others. Validated across nine modalities (text, image, video, audio, depth, thermal, tactile, 3D point cloud, EEG), CodeBind achieves state-of-the-art performance in multimodal classification and retrieval tasks.

2605.18253 2026-05-19 cs.CL cs.AI 版本更新

Machine Unlearning for Masked Diffusion Language Models

针对掩码扩散语言模型的机器去学习

Georu Lee, Seungwon Jeong, Hoki Kim, Jinseong Park, Woojin Lee

发表机构 * Dongguk University-Seoul(东国大学-首尔) Chung-Ang University(Chung-Ang 大学) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 本文提出了一种针对掩码扩散语言模型的去学习框架MDU,通过重新审视扩散过程中的知识学习,实现了高效的去学习性能。

Comments 20 pages, 8 figures, appendix included

详情
AI中文摘要

最近的掩码扩散语言模型(MDLMs),如LLaDA和Dream,已经达到了与自回归大语言模型相当的性能。与自回归模型不同,MDLMs通过并行迭代去噪掩码位置来生成文本。在微调过程中,MDLMs学习从掩码响应状态中恢复响应,从而将预测从提示-掩码无条件分布转向提示-条件分布。尽管生成和微调机制不同,针对MDLMs的机器去学习仍鲜有研究。在本文中,我们提出Masked Diffusion Unlearning(MDU),通过重新审视扩散过程中的知识学习,首次提出了针对MDLMs的去学习框架。具体而言,MDU在每个掩码响应位置上最小化从提示-条件预测到提示-掩码无条件锚点的正向KL散度,并通过温度缩放参数控制隐私-效用权衡。在标准基准和MDLM骨干网络上的实验证明,MDU在与现有LLM去学习方法相比时实现了较高的去学习性能。代码可在https://github.com/leegeoru/MDU上获得。

英文摘要

Recent masked diffusion language models (MDLMs), such as LLaDA and Dream, have achieved performance comparable to autoregressive large language models. Unlike autoregressive models, which generate text sequentially, MDLMs generate text by iteratively denoising masked positions in parallel. During fine-tuning, MDLMs learn to recover responses from masked response states conditioned on a prompt, thereby shifting their predictions from a prompt-masked unconditional distribution toward a prompt-conditional distribution. Despite this distinct generative and fine-tuning mechanism, machine unlearning for MDLMs remains largely unexplored. In this paper, we propose Masked Diffusion Unlearning (MDU), the first unlearning framework for MDLMs, by revisiting the process of learning specific knowledge in terms of diffusion. Specifically, MDU minimizes a forward KL divergence from the prompt-conditional prediction to a prompt-masked unconditional anchor at every masked response position, with a temperature scaling parameter to control the privacy-utility trade-off. Our empirical results on standard benchmarks and MDLM backbones show that MDU achieves high unlearning performance compared to existing LLM unlearning methods. Code is available at https://github.com/leegeoru/MDU.

2605.18246 2026-05-19 cs.LG cs.AI 版本更新

Privacy Preserving Reinforcement Learning with One-Sided Feedback

具有单侧反馈的隐私保护强化学习

Lin William Cong, Guangyan Gan, Hanzhang Qin, Zhenzhen Yan

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) Cornell SC Johnson College of Business(康奈尔大学SC Johnson商学院)

AI总结 本文研究了在多维连续状态和动作空间中,代理仅接收状态部分观测并仅在每个时间步获得状态-动作空间子集奖励信息的强化学习问题,提出了一种新的隐私保护强化学习算法POOL,并通过理论分析证明其样本复杂度与非隐私强化学习的下界一致,展示了在保持高学习效率的同时实现强隐私保障的可行性。

Comments Accepted at IJCAI-ECAI 2026

详情
AI中文摘要

我们研究了在多维连续状态和动作空间中具有单侧反馈的强化学习(RL)。在此设置中,智能体仅接收状态的部分观测,并在每个时间步仅获得状态-动作空间子集的奖励信息。这种设置在学习效率和隐私保护方面带来了重大挑战。为了解决这些挑战,我们提出了POOL,一种新颖的隐私保护RL算法。我们对POOL进行了全面的理论分析,推导出一个样本复杂度界,该界与已知的非隐私RL下界相匹配。其中,E_rho表示隐私参数,H是时间范围,alpha是最优性差距参数。我们的研究结果表明,可以在保持高学习效率的同时实现强隐私保障,这标志着在具有单侧反馈的多维环境中实现实用的隐私感知RL迈出重要一步。

英文摘要

We study reinforcement learning (RL) in multi-dimensional continuous state and action spaces with one-sided feedback, where the agent receives partial observations of the state and obtains reward information for only a subset of the state-action space at each time step. This setting introduces substantial challenges in both learning efficiency and privacy preservation. To address these challenges, we propose POOL, a novel privacy-preserving RL algorithm. We conduct a comprehensive theoretical analysis of POOL, deriving a sample complexity bound that matches the known lower bounds for non-private RL. Here, E_rho denotes the privacy parameter, H is the time horizon, and alpha is the optimality-gap parameter. Our findings show that it is possible to enforce strong privacy guarantees while maintaining high learning efficiency, marking a significant step toward practical, privacy-aware RL in multi-dimensional environments with one-sided feedback.

2605.18239 2026-05-19 cs.CL cs.AI 版本更新

Multilingual jailbreaking of LLMs using low-resource languages

使用低资源语言对LLM进行多语言劫持

Dylan Marx, Marcel Dunaiski

发表机构 * Computer Science Division, Mathematical Sciences Department(计算机科学系,数学科学系)

AI总结 研究通过使用低资源非洲语言进行多轮对话来测试大型语言模型的安全机制,发现翻译质量是影响低资源语言劫持成功率的关键因素。

Comments 12 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLMs)仍然容易受到绕过安全防护措施的劫持攻击。我们研究了使用低资源非洲语言(阿弗里卡语、基索瓦希利、isiXhosa和isiZulu)的多轮对话是否能绕过商业LLM的安全机制。我们翻译了现有数据集中的提示,并通过自动化测试和本地母语者的人工红队测试评估了ChatGPT、Claude、DeepSeek、Gemini和Grok。单轮翻译攻击效果不佳,而多轮对话在英语有害响应率方面从52.7%(Claude 3.5 Haiku)到83.6%(GPT-4o-mini),阿弗里卡语从60.0%(Claude 3.5 Haiku)到78.2%(GPT-4o-mini),基索瓦希利从41.8%(Claude 3.5 Haiku)到70.9%(DeepSeek)。人工红队测试比自动化方法提高了劫持率。所有评估语言的平均劫持率从59.8%增加到75.8%,其中阿弗里卡语提高了20.0%,isiZulu提高了12.7%,isiXhosa提高了12.3%,基索瓦希利提高了1%,这表明翻译质量限制了劫持的成功率。这些发现表明,LLM中的漏洞在多语言环境中仍然存在,翻译质量是决定低资源语言劫持成功率的关键因素。

英文摘要

Large Language Models (LLMs) remain vulnerable to jailbreak attempts that circumvent safety guardrails. We investigate whether multi-turn conversations using low-resource African languages (Afrikaans, Kiswahili, isiXhosa, and isiZulu) can bypass safety mechanisms across commercial LLMs. We translated prompts from existing datasets and evaluated ChatGPT, Claude, DeepSeek, Gemini, and Grok through automated testing and human red-teaming with native speakers. Single-turn translation attacks proved ineffective, while multi-turn conversations achieved English harmful response rates from 52.7% (Claude 3.5 Haiku) to 83.6% (GPT-4o-mini), Afrikaans from 60.0% (Claude 3.5 Haiku) to 78.2% (GPT-4o-mini), and Kiswahili from 41.8% (Claude 3.5 Haiku) to 70.9% (DeepSeek). Human red-teaming increased jailbreak rates compared to automated methods. Over all evaluated languages, the average jailbreak rate increased from 59.8% to 75.8%, with improvements of +20.0% (Afrikaans), +12.7% (isiZulu), +12.3% (isiXhosa), and +1% (Kiswahili), demonstrating that poor translation quality limits jailbreak success. These findings suggest that vulnerabilities in LLMs persist in multilingual contexts and that translation quality is the critical factor determining jailbreak success in low-resource languages.

2605.18232 2026-05-19 cs.CL cs.AI cs.IR 版本更新

SomaliWeb v1: A Quality-Filtered Somali Web Corpus with a Matched Tokenizer and a Public Language-Identification Benchmark

SomaliWeb v1: 一个经过质量过滤的索马里网页语料库,配有匹配的分词器和公开的语言识别基准

Khalid Yusuf Dahir

发表机构 * Independent researcher(独立研究者)

AI总结 本文提出了SomaliWeb v1,一个经过质量过滤的索马里语语料库,包含匹配的BPE-16K分词器和首个公开的索马里语言识别基准,揭示了现有分布中的质量问题。

Comments 16 pages, 6 figures, 6 tables. Code: https://github.com/khaledyusuf44/somali-corpus Dataset: https://huggingface.co/datasets/khaledyusuf44/somaliweb-v1

详情
AI中文摘要

索马里是一种非洲之角的库希特语,有约2500万使用者,但目前没有公开的专门索马里预训练语料库及其配套的分词器和语言识别基准。现有的索马里文本要么出现在多语言分布中(如HPLT v2、CC100、MADLAD-400、OSCAR、mC4),要么出现在Hugging Face上的小规模、未记录的索马里-only上传中。我们介绍了SomaliWeb v1,一个经过质量过滤的索马里语料库,包含819,322个文档(约303亿个标记),由三个上游来源(HPLT v2、CC100、索马里维基百科)通过六阶段可重复的流程构建。我们发布了(i)语料库,(ii)匹配的BPE-16K分词器,以及(iii)首个公开的索马里语言识别基准。我们的测量揭示了现有分布中的具体质量问题:HPLT v2的“清理”索马里发布保留了17.3%的字节精确重复项,其56.1%的文档包含可修复的mojibake,且其10.7%的字节唯一文档在Jaccard tau=0.80时为近重复项。我们的BPE-16K分词器在FLORES-200索马里开发测试上比GPT-4的cl100k_base少发出40.2%的标记;下游语言模型困惑度比较将推迟到后续发布。

英文摘要

Somali is a Cushitic language of the Horn of Africa with ~25 million speakers, yet no documented dedicated Somali pretraining corpus with a companion tokenizer and language-identification benchmark has been publicly released. Existing Somali text appears either inside multilingual distributions (HPLT v2, CC100, MADLAD-400, OSCAR, mC4) or in small, undocumented Somali-only uploads on Hugging Face. We introduce SomaliWeb v1, a quality-filtered Somali corpus of 819,322 documents (~303M tokens) built from three upstream sources (HPLT v2, CC100, Somali Wikipedia) through a six-stage reproducible pipeline. We release (i) the corpus, (ii) a matched BPE-16K tokenizer, and (iii) the first public side-by-side Somali benchmark of three production language identifiers. Our measurements reveal concrete quality defects in existing distributions: HPLT v2's "cleaned" Somali release retains 17.3% byte-exact duplicates, 56.1% of its documents contain fixable mojibake, and 10.7% of its byte-unique documents are near-duplicates at Jaccard tau=0.80. Our BPE-16K tokenizer emits 40.2% fewer tokens than GPT-4's cl100k_base on FLORES-200 Somali devtest as a tokenizer-level measurement; downstream language-model perplexity comparisons are deferred to a follow-up release.

2605.18229 2026-05-19 cs.LG cs.AI 版本更新

Are Sparse Autoencoder Benchmarks Reliable?

稀疏自编码基准测试是否可靠?

David Chanin

发表机构 * Decode Research, MATS, UCL(Decode研究、MATS、伦敦大学学院)

AI总结 该研究评估了稀疏自编码(SAE)基准测试的可靠性,发现其中两个指标在多个角度下表现不佳,其他指标也未能达到预期效果,表明需要改进SAE基准测试。

详情
AI中文摘要

稀疏自编码(SAEs)是大型语言模型的核心可解释性工具,其进展依赖于能够可靠区分更好和更差SAE的基准测试。我们通过三种互补的视角审计了SAEBench中SAE质量指标:固定SAE上的重新播种噪声、合成SAE上的真实相关性以及训练轨迹的可区分性。我们发现,两个指标,即目标探测扰动(TPP)和虚假相关性消除(SCR),在它们的典型设置下未能通过多个视角,不应用于评估SAE。其他指标显示出更高的重新播种噪声和更低的可区分性,比领域假设的要差。sae-probes变体的k-稀疏探测是我们在测试中发现最可靠的指标,但即使sae-probes也难以区分同一体系结构的不同变体。我们的结果表明,领域需要更好的SAE基准测试。

英文摘要

Sparse autoencoders (SAEs) are a core interpretability tool for large language models, and progress on SAE architectures depends on benchmarks that reliably distinguish better SAEs from worse ones. We audit the SAE quality metrics in SAEBench, the de-facto standard SAE evaluation suite, through three complementary lenses: reseed noise on a fixed SAE, ground-truth correlation on synthetic SAEs, and discriminability across training trajectories. We find that two of these metrics, Targeted Probe Perturbation (TPP) and Spurious Correlation Removal (SCR), fail multiple lenses at their canonical settings and should not be used to evaluate SAEs. The other metrics show higher reseed noise and lower discriminability than the field assumes. The sae-probes variant of $k$-sparse probing is the most reliable metric we tested, but even sae-probes struggles to separate variants of the same SAE architecture. Our results show the field needs better SAE benchmarks.

2605.18226 2026-05-19 cs.CL cs.AI 版本更新

Context Memorization for Efficient Long Context Generation

上下文记忆用于高效长上下文生成

Yasuyuki Okoshi, Hao Mark Chen, Guanxi Lu, Hongxiang Fan, Masato Motomura, Daichi Fujiki

发表机构 * Institute of Science Tokyo, Japan(东京科学研究所) Imperial College London, UK(伦敦帝国学院)

AI总结 本文提出了一种无需训练的上下文记忆方法,通过将前缀外部化为轻量级的预计算注意力状态查找表,以提高长上下文生成的准确性和效率,同时减少注意力计算的延迟。

详情
AI中文摘要

现代大型语言模型(LLM)应用越来越多地依赖长前缀来在推理时控制模型行为。尽管增强前缀的推理是有效的,但存在两个结构限制:i)随着生成过程的进行,前缀的影响逐渐减弱;ii)对前缀的注意力计算与长度成线性关系。现有方法要么在注意力中保留前缀同时压缩它,要么通过梯度训练将它内部化到模型参数中。前者在推理时仍然会关注到前缀,而后者训练成本高且不适合前缀更新。为了解决这些问题,我们提出了注意力状态记忆,这是一种无需训练的方法,将前缀外部化为一个轻量级的预计算注意力状态的查找表。在ManyICLBench上使用LLaMA-3.1-8B,我们的方法在1K-8K内存预算下比上下文学习提高了准确性,同时在8K时将注意力延迟减少了1.36倍,并在NBA基准测试中仅使用其内存足迹的20%就超过了全注意力RAG性能。

英文摘要

Modern large language model (LLM) applications increasingly rely on long conditioning prefixes to control model behavior at inference time. While prefix-augmented inference is effective, it incurs two structural limitations: i) the prefix's influence fades as generation proceeds, and ii) attention computation over the prefix scales linearly with its length. Existing approaches either keep the prefix in attention while compressing it, or internalize it into model parameters through gradient-based training. The former still attends to the prefix at inference, while the latter is training-intensive and ill-suited to prefix updates. To address these issues, we propose attention-state memory, a training-free approach that externalizes the prefix into a lightweight, lookup-based memory of precomputed attention states between prefix and query tokens. On ManyICLBench with LLaMA-3.1-8B, our method improves accuracy over in-context learning at 1K-8K memory budgets while reducing attention latency by 1.36x at 8K, and surpasses full-attention RAG performance on NBA benchmark using only 20% of its memory footprint.

2605.18211 2026-05-19 cs.CL cs.AI 版本更新

Leveraging Graph Structure in Seq2Seq Models for Knowledge Graph Link Prediction

利用图结构在序列到序列模型中进行知识图谱链接预测

Luu Huu Phuc, Ratan Bahadur Thapa, Mojtaba Nayyeri, Jingcheng Wu, Evgeny Kharlamov, Steffen Staab

发表机构 * Analytic Computing, KI, University of Stuttgart, Stuttgart, Germany(斯图加特大学分析计算研究所) Bosch Center for Artificial Intelligence, Stuttgart, Germany(博世人工智能中心) WAIS, University of Southampton, United Kingdom(南安普顿大学WAIS)

AI总结 本文提出了一种结合图结构的序列到序列模型GA-S2S,通过整合T5-small编码器解码器与关系图注意力网络RGAT,提升知识图谱链接预测的性能。

Comments 9 pages, 1 figure, 2 tables. Preprint of a paper accepted at the 5th Workshop on LLM-Integrated Knowledge Graph Generation from Text (TEXT2KG), co-located with ESWC 2026, May 10--14, 2026, Dubrovnik, Croatia

详情
AI中文摘要

我们介绍了图增强的序列到序列(GA-S2S)框架,这是一种新的框架,将T5-small编码器解码器与关系图注意力网络(RGAT)相结合,以提高知识图谱的链接预测。虽然现有的序列到序列模型仅依赖于实体和关系的表面描述,并且在最理想的情况下,将查询实体的邻居扁平化为一个线性序列,从而丢弃了内在的图结构,GA-S2S联合编码文本特征和查询实体周围的完整k跳子图拓扑。通过将原始编码器输出与RGAT的关系感知嵌入相结合,我们的模型捕捉并利用了更丰富的多跳关系模式和文本信息。在CoDEx数据集上的初步实验表明,GA-S2S在链接预测准确性上优于竞争的序列到序列基线模型,达到了高达19%的相对增益。

英文摘要

We introduce Graph-Augmented Sequence-to-Sequence (GA-S2S), a novel framework that integrates a T5-small encoder-decoder with a Relational Graph Attention Network (RGAT) to improve link prediction in knowledge graphs. While existing Seq2Seq models rely solely on surface-level textual descriptions of entities and relations and at best, flatten the neighborhoods of a query entity into a single linear sequence, thereby discarding the inherent graph structure, GA-S2S jointly encodes both textual features and the full $k$-hop subgraph topology surrounding the query entity. By integrating raw encoder outputs with RGAT's relation-aware embeddings, our model captures and leverages richer multi-hop relational patterns and textual information. Our preliminary experiments on the CoDEx dataset demonstrate that GA-S2S outperforms competitive Seq2Seq-based baseline models, achieving up to a 19\% relative gain in link prediction accuracy.

2605.18209 2026-05-19 cs.CV cs.AI 版本更新

SPATIOROUTE: Dynamic Prompt Routing for Zero-Shot Spatial Reasoning

SPATIOROUTE: 动态提示路由用于零样本空间推理

Pawat Chunhachatrachai, Gueter Josmy Faure, Hung-Ting Su, Winston H. Hsu

发表机构 * National Taiwan University(台湾国立大学) Delta Robotics Innovation Center(Delta机器人创新中心)

AI总结 本文提出SpatioRoute,一种动态提示生成方法,通过语义定制的提示模板路由问题,无需额外训练或3D传感器输入,在零样本设置下提升空间推理性能,同时发现Chain-of-Thought提示在空间视频理解中效果不佳。

Comments 10 pages, 2 figures, 2nd Workshop on 3D-LLM/VLA, CVPR 2026

详情
AI中文摘要

在眼动视频上的空间问题回答是一项具有挑战性的任务,需要视觉-语言模型(VLMs)对3D物体位置、场景可行性和方向关系进行推理,特别是在无任务特定微调的零样本设置中。我们引入SpatioRoute,一种动态提示生成方法,将每个输入问题路由到语义定制的提示模板,无需任何额外训练、微调或3D传感器输入。SpatioRoute在两个互补模式中运行:SpatioRoute-R,一种基于规则的路由器,将问题类型(如What、Is、How、Can、Which)确定性地映射到专门的提示模板;以及SpatioRoute-L,一种基于LLM的方法,仅从问题和情境上下文生成任务特定的提示,无需在路由时使用视频输入。我们评估了SpatioRoute在SQA3D基准测试上跨不同模型家族的VLMs。SpatioRoute在固定提示基线上实现了高达5%的总体准确率提升,建立了在不需3D点云输入的情况下零样本视频-only空间VQA的新状态。此外,我们发现Chain-of-Thought(CoT)提示,通过Think it Twice架构实现,在此设置中对Qwen系列模型性能有持续下降,证实了问题感知路由比统一推理指令在空间视频理解中更有效。

英文摘要

Spatial question answering over egocentric video is a challenging task that requires Vision-Language Models (VLMs) to reason about 3D object positions, scene affordances, and directional relationships, particularly in the zero-shot setting where no task-specific fine-tuning is available. We introduce SpatioRoute, a dynamic prompt generation approach that routes each incoming question to a semantically tailored prompt template -- without any additional training, fine-tuning, or 3D sensor input. SpatioRoute operates in two complementary modes: SpatioRoute-R, a rule-based router that deterministically maps question typologies (e.g., What, Is, How, Can, Which) to specialized prompt templates; and SpatioRoute-L, an LLM-driven approach that generates task-specific prompts from the question and situational context alone, with no video input at routing time. We evaluate SpatioRoute on the SQA3D benchmark across VLMs spanning model families. SpatioRoute achieves consistent overall accuracy gains up to 5% over fixed prompt baselines, establishing a new state-of-the-art for zero-shot video-only spatial VQA without requiring 3D point-cloud inputs. As an additional finding, we observe that Chain-of-Thought (CoT) prompting, implemented via the Think it Twice architecture, consistently degrades performance in this setting on Qwen series models, confirming that question-aware routing is more effective than uniform reasoning instructions for spatial video understanding.

2605.18202 2026-05-19 cs.LG cs.AI 版本更新

Concise and Logically Consistent Conformal Sets for Neuro-Symbolic Concept-Based Models

简洁且逻辑一致的神经符号概念模型的符合集

Samuele Bortolotti, Emanuele Marconato, Andrea Pugnana, Andrea Passerini, Stefano Teso

发表机构 * Department of Information Engineering and Computer Science, University of Trento, Italy(特伦托大学信息工程与计算机科学系) CIMeC, University of Trento, Rovereto, Italy(特伦托大学罗韦雷托CIMeC)

AI总结 本文提出COCOCO框架,通过整合符合预测方法,解决神经符号概念模型中标签和概念预测过于自信的问题,满足一致性、覆盖性和简洁性三个要求,提升模型的可靠性。

详情
AI中文摘要

神经符号概念模型(NeSy-CBMs)是一类将神经网络与符号推理相结合的架构,用于在高风险应用中提高可靠性。它们通过从输入中提取高层概念,然后在给定的逻辑约束下推断任务标签。然而,其标签和概念预测可能过于自信,使利益相关者难以判断何时可以信任模型的决策。本文通过整合符合预测(CP)框架,提供严格的分布无关覆盖保证,正式化了三个要求——一致性、覆盖性和简洁性,证明现有方法至少在一项上不足。然后引入COCOCO,一种后处理框架,联合符合概念和标签,并通过单个推断-反推修订步骤进行协调。COCOCO满足所有三个要求,保留分布无关覆盖,对不完美的知识具有鲁棒性,并支持用户指定的大小预算。在8个数据集上的实验显示,COCOCO在性能和集合大小方面优于竞争对手和自然基线。

英文摘要

Neuro-Symbolic Concept-based Models (NeSy-CBMs) are a family of architectures that integrate neural networks with symbolic reasoning for enhanced reliability in high-stakes applications. They work by first extracting high-level concepts from the input and then inferring a task label from these compatibly with given logical constraints. Yet, their label and concept predictions can be overconfident, making it difficult for stakeholders to gauge when the model's decisions can be trusted. We address this issue by integrating ideas from Conformal Prediction (CP), a framework providing rigorous, distribution-free coverage guarantees. We formalize three desiderata -- consistency, coverage, and conciseness -- that any conformal method for NeSy-CBMs should satisfy, and show that existing approaches fall short of at least one. We then introduce COCOCO, a post-hoc framework that conformalizes concepts and labels jointly and reconciles them via a single deduction-abduction revision step. COCOCO satisfies all three desiderata, retains distribution-free coverage, is robust to imperfect knowledge and supports user-specified size budgets. Our experiments on 8 data sets highlight how COCOCO compares favorably against competitors and natural baselines in terms of performance and set size.

2605.18199 2026-05-19 cs.IR cs.AI 版本更新

PIPER: Content-Based Table Search via profiling and LLM-Generated Pseudoqueries

PIPER: 通过 profiling 和 LLM 生成的伪查询实现基于内容的表格搜索

Riccardo Terrenzi, Matteo Falconi, Serkan Ayvaz, Pierluigi Plebani

发表机构 * Centre for Industrial Software, University of Southern Denmark(丹麦南部大学工业软件中心) Department of Electronics, Information and Bioengineering, Politecnico di Milano(米兰理工学院电子、信息与生物工程系)

AI总结 针对数据湖、数据空间和开放数据门户中表格数据集快速增长的问题,PIPER 提出了一种基于内容的表格搜索方法,利用表格 profile 和 LLM 生成的伪查询进行密集检索,优于传统元数据方法和 TableQA 检索方法,展示了 LLM 基于内容建模在表格数据集搜索中的价值。

Comments 15 pages, 3 figures, accepted at DEXA'26

详情
AI中文摘要

随着数据湖、数据空间和开放数据门户中表格数据集的快速增长,有效的数据集搜索对于重用和分析至关重要。现有搜索系统主要依赖元数据,这在很大程度上不完整或质量低,尤其是对于那些含义依赖于模式和单元格值的表格。近年来,大型语言模型(LLMs)的进步使得表格能够获得更丰富的基于内容的表示。然而,先前基于 LLM 的检索方法主要集中在表格问答上,目标是选择一个表格来回答问题,而不是检索和排序相关数据集。我们提出 PIPER,一种用于表格数据集的基于内容的检索方法,利用表格 profile 和 LLM 生成的查询嵌入进行密集检索。PIPER 专为元数据较差的环境设计,优于传统元数据基于的基线和强大的 TableQA 检索方法,证明了 LLM 基于内容建模在表格数据集搜索中的价值。

英文摘要

The rapid growth of tabular datasets in data lakes, data spaces, and open data portals makes effective dataset search essential for reuse and analysis. Existing search systems rely mainly on metadata, which is often incomplete or low quality, especially for tables whose meaning depends on both schema and cell values. Recent advances in Large Language Models (LLMs) enable richer, content-based representations of tables. However, prior LLM-based retrieval methods have focused on Table Question Answering, where the goal is to select a single table to answer a question, rather than retrieve and rank relevant datasets. We propose PIPER, a content-driven retrieval method for tabular datasets that uses table profiles and LLM-generated queries embedded for dense retrieval. Designed for dataset search in poor-metadata settings, PIPER outperforms both classical metadata-based baselines and strong TableQA retrieval methods, demonstrating the value of LLM-based content modeling for tabular dataset search.

2605.18197 2026-05-19 cs.RO cs.AI cs.CV 版本更新

RGB-only Active 3D Scene Graph Generation for Indoor Mobile Robots

仅RGB的主动3D场景图生成用于室内移动机器人

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人小组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出了一种仅使用RGB输入的主动3D场景图生成方法,通过统一感知与规划的结构化表示,解决了传统方法对专用传感器的依赖问题,并在Replica数据集上验证了其有效性。

详情
AI中文摘要

当前3D场景图生成方法依赖于专用深度传感器,如LiDAR或RGB-D相机,限制了部署到专用机器人平台,并排除了仅使用RGB相机的场景,如固定外部基础设施。现有流程通常基于被动收集的观测轨迹,而不是基于部分构建的场景表示选择视角,因此无法有效利用图中编码的语义和空间信息。本文提出了一种完全视觉框架,用于从仅RGB输入中主动、逐步构建3D场景图,解决了这两个限制。所提出的方法围绕共享的结构化表示统一感知和规划,该表示捕捉了物体语义、3D几何、关系上下文以及多视角信息。由于该框架是硬件无关的,并且仅依赖RGB观测,因此可以将机载机器人相机和固定外部相机的输入整合到同一表示中。在Replica数据集上的实验表明,仅RGB的流程在F1分数上与使用真实深度的基线相当。在ReplicaCAD上的主动探索实验进一步表明,语义驱动的视角选择在相同探索预算下能够检测到比基于几何前沿的基线多超过两倍的物体。最后,外部相机设置表明,互补的RGB视角可以有效启动场景图并提高上下文理解,而无需额外的探索成本。

英文摘要

Current approaches to 3D scene graph generation rely on dedicated depth sensors, such as LiDAR or RGB-D cameras, for metric 3D reconstruction. This limits deployment to specialized robotic platforms and excludes settings where only RGB cameras are available, such as fixed external infrastructure. Existing pipelines also typically operate on passively collected observation trajectories, rather than selecting viewpoints based on the partially built scene representation, and therefore fail to effectively exploit the semantic and spatial information encoded within the graph during exploration. This paper presents a fully visual framework for the active, incremental construction of 3D scene graphs from RGB input only, addressing both limitations. The proposed approach unifies perception and planning around a shared structured representation that captures object semantics, 3D geometry, relational context, and information from multiple viewpoints. Because the framework is hardware-agnostic and relies only on RGB observations, it can incorporate inputs from both onboard robot cameras and fixed external cameras within the same representation. Experiments on the Replica dataset show that the RGB-only pipeline achieves F1-score parity with baselines using ground-truth depth. Active exploration experiments on ReplicaCAD further show that semantic-driven viewpoint selection detects more than twice as many objects as a geometric frontier-based baseline under the same exploration budget. Finally, the external-camera setting demonstrates that complementary RGB views can effectively bootstrap the scene graph and improve contextual understanding at no additional exploration cost.

2605.18194 2026-05-19 cs.AI cs.CV 版本更新

Beyond the Cartesian Illusion: Testing Two-Stage Multi-Modal Theory of Mind under Perceptual Bottlenecks

超越笛卡尔错觉:在感知瓶颈下测试双阶段多模态理论 of Mind

Yajing Zhou, Xiangyu Kong

发表机构 * College of Computer Science, Beijing Information Science and Technology University(北京信息科技大学计算机学院)

AI总结 本文研究了多模态大语言模型在感知瓶颈下的双阶段空间推理能力,提出了一种基于锚点的具身体验空间分解链式推理方法,以解决空间对称性和视角模糊性问题,从而提升多模态理论 of Mind 的表现。

Comments 17 pages, 3 figures

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)在一般推理方面表现出色,但其具身空间智能仍受“笛卡尔错觉”限制——依赖文本概率分布,缺乏基于3D拓扑的理解。这种局限性在多智能体环境中尤为明显,这些环境不仅需要场景感知,还需要第二阶的理论 of Mind(ToM)。具体而言,智能体A必须推断智能体B对环境的看法,这严格受B的物理朝向和感官限制影响。在本文中,我们通过一个新颖的音频-视觉任务来探测MLLMs的双阶段空间推理极限:要求A预测B对A相对位置的估计。为此,我们提出了一个认识感知瓶颈模块,摒弃了刚性的规则坐标转换。相反,我们引入了基于锚点的具身体验空间分解链式推理(CoT)。该方法引导MLLMs进行“几何到语义”的投影,迫使它首先建立B的局部坐标系统,然后根据A是否在B的视觉视锥内动态加权视觉和听觉模态。广泛的评估表明,尽管当前MLLMs在空间对称性和视外模糊性方面根本上存在困难(建立了一个严格的零样本基准线42%准确率),我们的感知受限推理链在纯自体心和 allocentric 基准上表现稳健。通过系统地评估这些感知瓶颈,我们的工作揭示了当前MLLM空间推理的极限,并为具身AI中的认识、模态感知推理建立了基础范式。

英文摘要

While Multi-Modal Large Language Models (MLLMs) demonstrate impressive capabilities in general reasoning, their embodied spatial intelligence remains hampered by a "Cartesian Illusion" - a reliance on text-based probability distributions that lack grounded, 3D topological understanding. This limitation is starkly exposed in multi-agent environments, which demand more than just scene perception; they require second-order Theory of Mind (ToM). Specifically, an Agent A must be able to infer Agent B's belief about the environment, governed strictly by Agent B's physical orientation and sensory limitations. In this paper, we probe the limits of two-stage spatial inference in MLLMs through a novel audio-visual task: requiring Agent A to predict Agent B's estimation of A's relative location. To solve this, we propose an Epistemic Sensory Bottleneck module that abandons rigid, rule-based coordinate transformations. Instead, we introduce an Anchor-Based Embodied Spatial Decomposition Chain-of-Thought (CoT). This guides the MLLM through a "geometric-to-semantic" projection, forcing it to first establish B's local coordinate system and then dynamically weight visual and auditory modalities based on whether A falls within B's visual frustum. Extensive evaluations reveal that while current MLLMs fundamentally struggle with spatial symmetry and out-of-view ambiguities (establishing a rigorous zero-shot baseline of 42% accuracy), our sensory-bounded reasoning chain robustly outperforms pure egocentric and allocentric baselines. By systematically benchmarking these perceptual bottlenecks, our work exposes the current limits of MLLM spatial reasoning and establishes a foundational paradigm for epistemic, modality-aware inference in Embodied AI.

2605.18191 2026-05-19 cs.AI 版本更新

Pairwise Preference Reward and Group-Based Diversity Enhancement for Superior Open-Ended Generation

成对偏好奖励与基于群体的多样性增强以实现更优的开放生成

Guining Cao, Jiaxin Peng, Chu Zeng, Yu Zhao, Shuangyong Song, Yongxiang

发表机构 * Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(星辰AGI实验室,中国电信人工智能技术(北京)有限公司) School of Software and Microelectronics, Peking University(北京大学软件与微电子学院) Tsinghua University(清华大学)

AI总结 本文提出了一种适用于开放生成任务的强化学习方法PPR-GDE,通过成对偏好奖励和基于群体的多样性增强,解决了传统方法在开放生成任务中存在验证困难、计算成本高以及多样性下降的问题。

Comments Work in progress

详情
AI中文摘要

当前的强化学习(RL)方法在可验证的环境中具有广泛的应用和强大的能力,因为可以提供标量奖励。然而,在开放生成任务中,验证响应的正确性仍然具有挑战性,训练奖励模型会带来显著的计算和标注成本。此外,强化学习(RLVR)往往导致多样性崩溃,并产生刻板或僵化的输出,这在开放领域场景中尤其不可取。我们提出了成对偏好奖励与基于群体的多样性增强(PPR-GDE),一种更适合开放生成的RL方法。PPR-GDE不需要标量奖励,并将群体层面的多样性纳入奖励信号中,它通过成对偏好奖励保留主观评估的比较结构,通过重复比较并交换响应顺序来减轻裁判位置偏差,并引入一个基于群体的多样性奖励,明确鼓励响应组内的语义分散,所有这些奖励信号都被整合到一个统一的群体相对策略优化目标中。我们将在角色扮演任务上实例化PPR-GDE,实验表明PPR-GDE在对齐质量和表达多样性方面优于强大的RL基线。进一步分析显示,成对偏好对于主观视角的偏好对齐至关重要,而多样性度量在实现卓越的表达多样性和更广泛的语义覆盖方面起着关键作用。

英文摘要

Current reinforcement learning(RL) methods are broadly applicable and powerful in verifiable settings where scalar rewards can be provided. However, in open-ended generation tasks, verifying the correctness of responses remains challenging, and training reward models incurs substantial computational and annotation costs. Moreover, reinforcement learning (RLVR) often leads to diversity collapse and produces stereotypical or rigid outputs, outcomes that are particularly undesirable in open-domain scenarios. We propose Pairwise Preference Reward and Group-based Diversity Enhancement (PPR-GDE), a RL method that is more suitable for open-ended generation. PPR-GDE does not require scalar rewards and incorporates group-level diversity into the reward signal, it preserves the comparative structure of subjective evaluation through a pairwise preference reward, mitigates judge position bias via repeated comparisons with swapped response order, and introduces a group-based diversity reward that explicitly encourages semantic dispersion within a response group, all of these reward signals are integrated into a unified group-relative policy optimization objective. We instantiate PPR-GDE on role-playing task, experiments show that PPR-GDE achieves a better alignment quality as well as expressive diversity than strong RL baselines. Further analysis shows that pairwise preference is critical for preference alignment in subjective perspective, while the diversity metric plays an essential role in achieving superior expressive diversity and broader semantic coverage.

2605.18184 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Fixed External Cameras as Common Prior Maps for Active 3D Scene Graph Generation

固定外部摄像头作为主动3D场景图生成的共同先验地图

Giorgia Modi, Davide Buoso, Giuseppe Averta, Daniele De Martini

发表机构 * Mobile Robotics Group (MRG)(移动机器人组) Visual and Multimodal Applied Learning Lab (VANDAL)(视觉与多模态应用学习实验室)

AI总结 本文提出利用固定外部RGB摄像头作为共同先验地图,以实现主动、渐进式的3D场景图生成,通过融合机器人 onboard 摄像头和固定外部摄像头的数据,提高场景理解的效率和准确性。

详情
AI中文摘要

常用的先验信息,如BIM模型、平面图和遥感图像,可以为自主机器人系统提供有价值的几何和语义上下文。在本文中,我们将固定外部RGB摄像头的观测视为共同先验地图(CPMs):环境的广角视图,在任何机器人运动开始之前初始化一个语义和几何场景先验。我们提出一个仅使用RGB的框架,用于主动、渐进式的3D场景图(3DSG)生成,该框架在单一硬件无关的管道中无缝融合来自机器人 onboard 摄像头和固定外部摄像头的观测。通过仅依赖RGB观测并通过前馈3D重建模型进行处理,系统将所有摄像头——机器人 onboard 或外部——视为相同,无需硬件修改。基于图的主动语义探索框架然后直接利用部分场景图,引导机器人向高语义不确定性区域前进,逐步完成和细化先验。实验表明,使用单个外部摄像头初始化场景图可使初始物体召回率提高高达+79%,并且先验的更丰富上下文显著提高了后续主动探索的效率。

英文摘要

Commonly available prior information, such as BIM models, floor plans, and remote sensing images, can provide valuable geometric and semantic context for autonomous robotic systems. In this paper, we treat observations from fixed external RGB cameras as Common Prior Maps (CPMs): wide-field views of the environment that initialize a semantic and geometric scene prior before any robot motion begins. We present an RGB-only framework for active, incremental 3D scene graph (3DSG) generation that seamlessly fuses observations from both onboard robot cameras and fixed external cameras within a single hardware-agnostic pipeline. By relying solely on RGB observations processed by a feed-forward 3D reconstruction model, the system treats all cameras - onboard or external - identically, requiring no hardware modifications. A graph-based active semantic exploration framework then directly leverages the partial scene graph to guide the robot toward regions of high semantic uncertainty, progressively completing and refining the prior. Experiments demonstrate that bootstrapping the scene graph with even a single external camera increases initial object recall by up to +79%, and that the richer context of the prior significantly improves the efficiency of subsequent active exploration.

2605.18181 2026-05-19 cs.AI cs.CL 版本更新

Scalable Environments Drive Generalizable Agents

可扩展环境驱动可泛化的智能体

Jiayi Zhang, Fanqi Kong, Guibin Zhang, Maojia Song, Zhaoyang Yu, Jianhao Ruan, Jinyu Xiang, Bang Liu, Chenglin Wu, Yuyu Luo

发表机构 * HKUST(GZ)(香港科技大学(广州)) DeepWisdom PKU(北京大学) NUS(新加坡国立大学) SUTD(新加坡科技设计大学) UdeM & Mila(蒙特利尔大学及Mila)

AI总结 本文探讨了可泛化智能体需要通过可扩展环境来适应多样任务和未见环境的问题,提出环境扩展的核心挑战,并提出了统一的分类方法和可扩展环境的构建范式。

详情
AI中文摘要

可泛化智能体应能够适应多样任务和未见环境,而不仅仅是训练分布内的任务。本文认为这种泛化需要环境扩展:扩展智能体交互的可执行规则集分布,而不是仅增加轨迹或任务数量。当前扩展实践主要集中在固定交互规则下收集更多经验或更广的任务集,导致智能体在底层接口、动态、观察或反馈信号变化时变得脆弱。因此,核心挑战是世界层面的分布偏移:智能体需要系统性地暴露于具有显著不同可执行规则集的环境中。为澄清这一挑战,我们提出了一个统一的分类法,通过主要交付成果和可执行规则集的变化将轨迹扩展、任务扩展和环境扩展区分开来。基于此分类法,我们综合了可扩展环境的构建范式,对比了优先考虑可控性和可验证性的程序生成器与提供更广泛覆盖和开放性的生成世界模型。我们进一步概述了如何将环境扩展与具有状态的学习机制结合,强调学习的更新规则用于跨环境适应。最后,我们讨论了其他观点并论证可扩展环境是实现稳健通用智能体的必要基础。

英文摘要

Generalizable agents should adapt to diverse tasks and unseen environments beyond their training distribution. This position paper argues that such generalization requires environment scaling: expanding the distribution of executable rule-sets that agents interact with, rather than only increasing trajectories or tasks within fixed benchmarks. Current scaling practices largely focus on collecting more experience or broader task sets under fixed interaction rules, leaving agents brittle when underlying interfaces, dynamics, observations, or feedback signals change. The core challenge is therefore a world-level distribution shift: agents need systematic exposure to environments with meaningfully different executable rule-sets. To clarify this challenge, we propose a unified taxonomy that separates trajectory scaling, task scaling, and environment scaling by their primary deliverables and by what changes in the executable rule-set. Building on this taxonomy, we synthesize construction paradigms for scalable environments, contrasting programmatic generators that prioritize controllability and verifiability with generative world models that offer broader coverage and open-endedness. We further outline how environment scaling can be coupled with stateful learning mechanisms, emphasizing learned update rules for cross-environment adaptation. We conclude by discussing alternative perspectives and argue that scalable environments provide the essential substrate for measurable and controllable progress toward robust general agents.

2605.18176 2026-05-19 cs.CV cs.AI 版本更新

MARS: Technical Report for the CASTLE Challenge at EgoVis 2026

MARS:EgoVis 2026 CASTLE挑战的技术报告

Haoyu Zhang, Qiaohui Chu, Yisen Feng, Meng Liu, Weili Guan, Yaowei Wang, Liqiang Nie

发表机构 * Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) Pengcheng Laboratory(鹏城实验室) Shandong Jianzhu University(山东建筑大学)

AI总结 本文提出MARS系统,用于EgoVis 2026的CASTLE挑战,通过多模态代理推理解决需要多源信息的复杂问题,核心方法是多模态证据选择,主要贡献是实现了在多源数据上的有效推理。

Comments The Runner-up Solution for CASTLE Challenge @ EgoVis 2026

详情
AI中文摘要

本报告介绍了MARS,即多模态代理推理与源选择系统,是参与EgoVis 2026 CASTLE挑战的系统。参赛者必须在CASTLE 2024数据集上回答185个封闭式问题。与以往单视频眼动基准不同,CASTLE要求对四天活动、15个同步视角、官方 transcripts 及多种辅助模态(包括个人照片、辅助视频、注视、热成像和心率测量)进行推理。MARS将任务视为多模态源的代理证据选择问题,而非纯粹文本流程。MARS首先遵循官方CASTLE目录组织,从视频和 transcripts 两个主要来源以及注视、心率、照片和热成像四个辅助来源构建证据记忆。长视频仅转换为caption和基于DeepSeek的摘要,因为CASTLE视频太长无法直接输入模型上下文;此步骤压缩时间证据,同时保留照片和其他辅助媒体作为源特定证据。在推理时,一个GPT-5.4决策代理反复选择是否继续推理、请求特定缺失模态、生成答案或回退到随机选项,当证据不足时。所得到的系统在最终CASTLE挑战排行榜上获得第二名。我们的代码可在https://github.com/Hyu-Zhang/MARS获取。

英文摘要

This report presents MARS, short for Multimodal Agentic Reasoning with Source selection, our system for the CASTLE Challenge at EgoVis 2026. Participants must answer 185 closed-form questions over the CASTLE 2024 dataset. In contrast to prior single-video egocentric benchmarks, CASTLE requires reasoning over four days of activity, 15 synchronized perspectives, official transcripts, and multiple auxiliary modalities, including personal photos, auxiliary videos, gaze, thermal imagery, and heartrate measurements. MARS therefore treats the task as an agentic evidence-selection problem over multimodal sources rather than a purely text-only pipeline. MARS first follows the official CASTLE directory organization to build evidence memories from two primary sources, videos and transcripts, and four auxiliary sources, gaze, heartrate, photos, and thermal imagery. Long videos are converted into captions and DeepSeek-based summaries only because CASTLE videos are too long to fit directly into the model context for every question; this step compresses temporal evidence while keeping photos and other auxiliary media available as source-specific evidence. At inference time, a GPT-5.4 decision agent repeatedly chooses whether to continue reasoning, request a specific missing modality, produce an answer, or fall back to a random option when the evidence remains insufficient. The resulting system achieved second place on the final CASTLE Challenge leaderboard. Our codes are available at https://github.com/Hyu-Zhang/MARS.

2605.18163 2026-05-19 cs.AI cs.CL 版本更新

TRACE: Trajectory Correction from Cross-layer Evidence for Hallucination Reduction

TRACE: 通过跨层证据进行轨迹修正以减少幻觉

Tej Sanibh Ranade

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出TRACE算法,通过跨层证据在推理时修正LLM中的幻觉,无需训练或标注,通过内部证据选择修正策略,提升多个基准测试的性能。

Comments 25 pages, 8 figures, 4 tables

详情
AI中文摘要

幻觉修正并非单向问题。我们表明中间层既不总是比最终层更诚实,也不总是更不可信。然而,幻觉减少通常通过固定干预形式实现:对比一层另一层,引导沿诚实方向,或依赖外部证据。这种框架结构不完整。跨层事实证据并不均匀演变:在某些失败中,内部真实支持存在并随后被压制,而在其他情况下,候选竞争在深度上保持 genuinely 多方向性,因此没有单一符号标量家族通常足够。我们引入TRACE(Trajectory Correction from Cross-layer Evidence for Hallucination Reduction),一种确定性、无训练的算法,在LLM自身前向传递中,通过从每个输入的跨层候选轨迹中推导出修正层和适当的修正操作符,在推理时修正幻觉。在一种冻结超参数设置下,TRACE仅使用模型内部证据,在8个模型家族和3个事实性基准测试中评估为单一通用算法,对15个模型进行评估,提升所有评估单元,产生+12.26 MC1点和+8.65 MC2风格点的平均增益,无退化,增益达到+47.20 MC1和+43.38 MC2风格点。该方法不使用标签、检索、预训练、微调或每模型校准。

英文摘要

Hallucination correction is not a one-direction problem. We show that intermediate layers are neither uniformly more truthful than final layers nor uniformly less trustworthy. Yet hallucination reduction is usually instantiated through one fixed intervention form: contrast one layer against another, steer along a truthfulness direction, or defer to external evidence. This framing is structurally incomplete. Cross-layer factual evidence does not evolve uniformly: in some failures truthful support is present internally and later suppressed, whereas in others candidate competition remains genuinely multi-directional across depth, so no single signed scalar family is generally sufficient. We introduce Trajectory Correction from Cross-layer Evidence for Hallucination Reduction (TRACE), a deterministic, training-free algorithm which corrects hallucinations at inference time by deriving both the corrective layer and the appropriate correction operator from each input's cross-layer candidate trajectory inside the LLM's own forward pass. Under one frozen hyperparameter setting, TRACE selects among scalar reversal, earlier-state recovery, and candidate-space correction using only model-internal evidence. Evaluated as a single universal algorithm across 15 models, 8 model families, and 3 factuality benchmarks, TRACE improves every evaluation cell, yielding mean gains of +12.26 MC1 points and +8.65 MC2-style points with no regressions, with gains reaching +47.20 MC1 and +43.38 MC2-style points. The method uses no labels, retrieval, pretraining, finetuning, or per-model calibration.

2605.18162 2026-05-19 cs.CV cs.AI 版本更新

Self-Evolving Spatial Reasoning in Vision Language Models via Geometric Logic Consistency

通过几何逻辑一致性实现视觉语言模型中的自演化空间推理

Junming Liu, Yuqi Li, Yifei Sun, Maonan Wang, Piotr Koniusz, Yirong Chen, Ding Wang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The City University of New York(纽约城市大学) The Chinese University of Hong Kong(香港中文大学) Data61 CSIRO(Data61澳大利亚国家科学研究院) University of New South Wales(新南威尔士大学) Australian National University(澳大利亚国立大学)

AI总结 本文提出SAGE框架,通过几何和语言二元操作在视觉语言模型中实现自演化空间推理,提升模型在空间推理任务中的鲁棒性和泛化能力。

Comments 23 pages, 7 figures, 3 tables

详情
AI中文摘要

视觉语言模型(VLMs)在视觉和语言任务上取得了显著进展,但其空间推理能力仍然脆弱:能够正确回答原始输入的模型在面对具有可预测答案映射的配对变换时仍可能失败,揭示了实例级正确性与鲁棒空间推理之间的差距。为此,我们提出空间对齐通过几何演化(SAGE),一种自演化框架,通过几何和语言二元操作在VLMs中强制逻辑一致性。SAGE将二元一致性作为辅助奖励纳入GRPO训练,鼓励模型在原始和变换输入之间产生逻辑一致的答案。一个动态操作池持续探测不一致,促进具有挑战性的操作并淘汰已掌握的操作,使训练聚焦于最有信息量的信号。SAGE具有模型无关性,比先前的GRPO方法更数据高效,并可作为轻量级的后训练阶段应用于任何现有的VLM。在视频和空间推理基准上的实验表明,SAGE在强基线模型上表现一致提升,并增强了对未见数据的泛化能力。

英文摘要

Vision-Language Models (VLMs) have made striking progress, yet their spatial reasoning remains fragile: models that answer an original input correctly can still fail under paired transformations with predictable answer mappings, revealing a gap between instance-level correctness and robust spatial reasoning. To address this, we propose Spatial Alignment via Geometric Evolution (SAGE), a self-evolving framework that enforces logical consistency in VLMs through geometric and linguistic duality operations. SAGE incorporates duality consistency as an auxiliary reward within GRPO training, encouraging models to produce logically coherent answers across original and transformed inputs. A dynamic operation pool continuously probes for inconsistencies, promoting challenging operations and retiring mastered ones, so that training focuses on the most informative signals. SAGE is model-agnostic, data-efficient compared to prior GRPO methods, and can be applied as a lightweight post-training stage to any existing VLM. Experiments on video and spatial reasoning benchmarks demonstrate consistent improvements over strong baselines and enhanced generalization to unseen data.

2605.16142 2026-05-19 cs.AI cs.LG 版本更新

Property-Guided LLM Program Synthesis for Planning

基于属性的LLM程序合成用于规划

André G. Pereira, Augusto B. Corrêa, Jendrik Seipp

发表机构 * Federal University of Rio Grande do Sul(里约格朗德杜斯尔大学) University of Oxford(牛津大学) Linköping University(林奈大学)

AI总结 本文研究了一种基于属性的LLM程序合成方法,通过检查候选程序是否满足形式定义的属性来指导LLM生成更高质量的程序,从而减少生成和评估成本。

详情
AI中文摘要

LLMs在程序合成中表现出色,能够发现超越先前解决方案的程序。然而,这些方法依赖于简单的数值评分来指示程序质量,如解决方案的值或通过的测试数量。因为评分无法指导程序为何失败,系统必须生成并评估许多候选程序,希望其中一些成功,从而增加LLM推理和评估成本。我们研究了一种不同的方法:属性引导的LLM程序合成。与评分程序后评估不同,我们检查候选程序是否满足形式定义的属性。当属性被违反时,我们提前停止评估并提供具体的反例,显示程序为何失败。这种反馈显著减少了程序生成的数量和评估成本,并可以指导LLM生成更强大的程序。我们在PDDL规划领域评估了这种方法,要求LLM合成直接启发函数:每个通过严格改进转换可达的状态都有严格改进的后继。具有这种属性的启发函数可使爬山算法直接到达目标状态。反例引导的修复循环生成一个候选程序,检查训练集上的属性,并返回第一个违反属性的案例。我们在十个规划领域上评估了这种方法,并使用分布外测试集。合成的启发函数在几乎所有测试任务中都是直接的,与最佳先前生成方法相比,我们的方法在每个领域平均生成的程序数量少七倍,无需使用搜索即可解决更多任务,并且评估候选人的计算量减少了几个数量级。只要问题允许可验证的属性,属性引导的LLM合成可以降低成本并提高程序质量。

英文摘要

LLMs have shown impressive success in program synthesis, discovering programs that surpass prior solutions. However, these approaches rely on simple numeric scores to signal program quality, such as the value of the solution or the number of passed tests. Because a score offers no guidance on why a program failed, the system must generate and evaluate many candidates hoping some succeed, increasing LLM inference and evaluation costs. We study a different approach: property-guided LLM program synthesis. Instead of scoring programs after evaluation, we check whether a candidate satisfies a formally defined property. When the property is violated, we stop the evaluation early and provide the LLM with a concrete counterexample showing exactly how the program failed. This feedback drastically reduces both the number of program generations and the evaluation cost, and can guide the LLM to generate stronger programs. We evaluate this approach on PDDL planning domains, asking the LLM to synthesize direct heuristic functions: every state reachable by strictly improving transitions has a strictly improving successor. A heuristic with this property leads hill-climbing algorithm directly to a goal state. A counterexample-guided repair loop generates one candidate program, checks the property over a training set, and returns the first case that violates the property. We evaluate our approach on ten planning domains with an out-of-distribution test set. The synthesized heuristics are effectively direct on virtually all test tasks, and compared to the best prior generation method our approach generates seven times fewer programs per domain on average, solves more tasks without using search, and requires several orders of magnitude less computation to evaluate candidates. Whenever a problem admits a verifiable property, property-guided LLM synthesis can reduce cost and improve program quality.

2605.15960 2026-05-19 cs.AI cs.LG 版本更新

Imperfect World Models are Exploitable

不完美的世界模型是可利用的

Logan Mondal Bhamidipaty, Esmeralda S. Whitammer, David Abel, Mykel J. Kochenderfer, Subramanian Ramamoorthy

发表机构 * University of Edinburgh(爱丁堡大学) Stanford University(斯坦福大学)

AI总结 本文提出了一种新的强化学习中模型利用的定义,指出世界模型如果暗示某种策略应严格优于另一种策略,而真实环境转移模型却暗示相反,那么该模型就是可利用的。研究通过发展奖励黑客和模型利用的一般理论,证明在大规模策略集上利用本质上是不可避免的,并揭示了安全规划在世界模型中的局限性。

Comments 17 pages, 3 figures, 2 tables; modified (fixed metadata)

详情
AI中文摘要

我们提出了一种新的强化学习中模型利用的定义。非正式地说,如果世界模型暗示一种策略应严格优于另一种策略,而环境的真实转移模型却暗示相反,则该世界模型是可利用的。我们通过类比先前对奖励黑客的描述,但发现相关的不可避免性证明无法转移到利用上。为克服这一障碍,我们发展了一种奖励黑客和模型利用的一般理论,证明在大规模策略集上利用本质上是不可避免的,并得出黑客作为特殊情况的相应结论。不幸的是,我们还发现保证在有限策略集上不可黑客的条件没有对应的防止利用的条件。因此,我们引入了一种放松的利用概念,并推导出一个安全的视野,在其中可以避免利用。总的来说,我们的结果建立了奖励黑客和模型利用之间的正式桥梁,并阐明了世界模型中安全规划的局限性。

英文摘要

We propose a novel definition of model exploitation in reinforcement learning. Informally, a world model is exploitable if it implies that one policy should be strictly preferred over another while the environment's true transition model implies the reverse. We analogize our definition with a prior characterization of reward hacking but show that the associated proof of inevitability does not transfer to exploitation. To overcome this obstruction, we develop a general theory of reward hacking and model exploitation that proves that exploitation is essentially unavoidable on large policy sets and yields the corresponding claim for hacking as a special case. Unfortunately, we also find that the conditions that guarantee unhackability in finite policy sets have no counterpart that precludes exploitation. Consequently, we introduce a relaxed notion of exploitation and derive a safe horizon within which it can be avoided. Taken together, our results establish a formal bridge between reward hacking and model exploitation and elucidate the limits of safe planning in world models.

2605.15407 2026-05-19 math.NA cs.AI cs.NA 版本更新

Amortized Energy-Based Bayesian Inference

平均能量基于的贝叶斯推断

Hojjat Kaveh, Ricardo Baptista, Andrew M. Stuart

发表机构 * California Institute of Technology(加州理工学院) University of Toronto(多伦多大学)

AI总结 本文研究了在仅能获取参数和观测联合分布样本的情况下,非线性逆问题的平均能量基于的贝叶斯推断方法,提出了一种基于传输的方法,通过学习观测依赖的映射来近似后验分布,避免了传统方法的计算开销。

Comments 25 pages, 10 figures

详情
AI中文摘要

我们考虑了在仅能获取参数和观测联合分布样本的情况下,非线性逆问题的平均能量基于的贝叶斯推断。传统方法如马尔可夫链蒙特卡罗方法需要为每个观测解决新的推断问题,当推断必须多次重复时计算开销可能很高。我们提出了一种基于传输的方法,学习一个观测依赖的映射,将参考测度推送到近似后验分布。该映射通过最小化真实后验与学习推前之间的平均能量-距离目标进行训练。这种形式是无似然的,仅需联合样本,避免了密度评估、可逆性约束和雅可比行列式计算。对于函数空间逆问题,我们使用高斯先验参数化传输映射为恒等映射加上在先验中的 Cameron-Martin 空间中的扰动,保持与先验的绝对连续性。在无限维情况下,使用神经算子表示映射。我们通过有限维非线性逆问题和两个出现在多孔介质流和地震反演中的 PDE 约束逆问题展示了该方法。结果表明,学习的传输能够捕捉后验结构,包括多模态和主导模式,同时能够为新观测快速进行后验采样。

英文摘要

We consider amortized Bayesian inference for nonlinear inverse problems in settings where only samples from the joint distribution of parameters and observations are available. Classical methods such as Markov chain Monte Carlo require solving a new inference problem for each observation, which can be computationally prohibitive when inference must be repeated many times. We propose a transport-based approach that learns an observation-dependent map pushing forward a reference measure to approximate the posterior distribution. The map is trained by minimizing an averaged energy-distance objective between the true posterior and the learned pushforward. This formulation is likelihood-free, requiring only joint samples, and avoids density evaluation, invertibility constraints, and Jacobian determinant computations. For function-space inverse problems with Gaussian priors, we parameterize the transport map as the identity plus a perturbation in the Cameron-Martin space of the prior, preserving absolute continuity with respect to the prior. In infinite-dimensional settings, the map is represented using neural operators. We illustrate the method on a finite-dimensional nonlinear inverse problem and two PDE-constrained inverse problems arising in porous medium flow and seismic inversion. The results show that the learned transport captures posterior structure, including multimodality and dominant modes, while enabling fast posterior sampling for new observations.

2605.13339 2026-05-19 cs.CL cs.AI 版本更新

Probing Persona-Dependent Preferences in Language Models

探测语言模型中依赖人格的偏好

Oscar Gilg, Pierre Beckmann, Daniel Paleka, Patrick Butlin

发表机构 * MATS EPFL(瑞士联邦理工学院) ETH Zürich(苏黎世联邦理工学院) Eleos AI Research(Eleos AI研究)

AI总结 本文通过训练线性探针来探测语言模型中不同人格下的偏好表示,发现偏好向量在不同人格间具有共享特性,并能通过调整偏好向量来影响模型的输出选择。

Comments 41 pages, 45 figures. Code: https://github.com/oscar-gilg/Preferences. Earlier write-up on LessWrong: https://www.lesswrong.com/posts/pxC2RAeoBrvK8ivMf/models-have-linear-representations-of-what-tasks-they-like-1

详情
AI中文摘要

大型语言模型(LLMs)可以被认为具有偏好:它们能够可靠地选择某些任务和输出,而这些偏好受到训练后和系统提示的影响,似乎塑造了大部分行为。但模型也可以采用不同的身份,这些身份具有截然不同的偏好。这种内部是如何实现的?每个身份是否运行在自己的偏好机制上,还是有某种共享的基础?我们训练了线性探针来预测Gemma-3-27B和Qwen-3.5-122B残差流激活中的揭示性成对任务选择,并识别出一个真实的偏好向量:它跟踪模型在不同提示和情境下的偏好变化,并且在Gemma-3-27B中,沿着它引导的选择具有因果控制。这种偏好表示在不同身份间具有广泛共享性:一个训练于帮助助手的探针能够预测和引导质不同身份的选择,包括一个反相关于助手偏好的邪恶身份。

英文摘要

Large language models (LLMs) can be said to have preferences: they reliably pick certain tasks and outputs over others, and preferences shaped by post-training and system prompts appear to shape much of their behaviour. But models can also adopt different personas which have radically different preferences. How is this implemented internally? Does each persona run on its own preference machinery, or is something shared underneath? We train linear probes on residual-stream activations of Gemma-3-27B and Qwen-3.5-122B to predict revealed pairwise task choices, and identify a genuine preference vector: it tracks the model's preferences as they shift across a range of prompts and situations, and on Gemma-3-27B steering along it causally controls pairwise choice. This preference representation is largely shared across personas: a probe trained on the helpful assistant predicts and steers the choices of qualitatively different personas, including an evil persona whose preferences anti-correlate with those of the Assistant.

2605.12012 2026-05-19 cs.AI 版本更新

LegalCheck: Retrieval- and Context-Augmented Generation for Drafting Municipal Legal Advice Letters

LegalCheck: 基于检索和上下文增强的生成方法用于起草市政法律建议信

Virgill van der Meer, Julien Rossi

发表机构 * Municipality of Amsterdam(阿姆斯特丹市) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出LegalCheck系统,通过检索增强生成(RAG)和上下文增强生成(CAG)技术,自动起草市政法律回应信,提高法律部门在人员短缺、案件量增加和合规压力下的工作效率,确保法律一致性和准确性。

Comments Accepted at ICAIL 2026 as Short Paper

详情
AI中文摘要

荷兰公共部门法律部门面临严重的人员短缺、案件量增加和满足合规要求的压力。本文提出了LegalCheck,一种新的系统,通过检索增强生成(RAG)和上下文增强生成(CAG)的结合,自动化起草异议回应信。利用大型语言模型(LLM)和经过筛选的法律知识库,LegalCheck检索相关法律和先例,并通过受控提示将外部知识和案件特定细节整合到连贯的草稿中。专家在循环审查确保生成的信件在法律上正确且上下文合适。在阿姆斯特丹市的实际部署中,LegalCheck在分钟内生成接近最终的建议信,而不是小时,同时保持高法律一致性和事实准确性。输出基于实际法规和先例,提供可解释的输出,涵盖了大多数所需的法律推理(通常80%到100%的必要内容)。法律专业人士发现该系统减少了他们的工作量,确保了法律标准的一致应用,而没有取代人类判断。这些结果展示了显著的效率提升、改进的法律一致性和积极的用户接受度。更广泛地说,这项工作展示了如何通过在LLM中加入领域知识和治理机制来部署负责任的AI,从而在法律领域应用负责任的AI。

英文摘要

Public-sector legal departments in the Netherlands face acute staff shortages, increased case volumes, and increased pressure to meet regulatory compliance. This paper presents LegalCheck, a novel system that addresses these challenges by automating the drafting of objection response letters through a combination of Retrieval-Augmented Generation (RAG) and Context-Augmented Generation (CAG). Using a large language model (LLM) alongside curated legal knowledge bases, LegalCheck performs retrieval of relevant laws and precedents, and uses controlled prompting to incorporate both external knowledge and case-specific details into a coherent draft. An expert-in-the-loop review ensures that each generated letter is legally sound and contextually appropriate. In a real-world deployment within the Municipality of Amsterdam, LegalCheck produced near-final advice letters in minutes rather than hours, while maintaining high legal consistency and factual accuracy. The output is based on actual regulations and prior cases, providing explainable outputs that captured the vast majority of required legal reasoning (often 80\% to 100\% of essential content). Legal professionals found that the system reduced their workload and ensured a consistent application of legal standards, without replacing human judgment. These results demonstrate substantial efficiency gains, improved legal consistency, and positive user acceptance. More broadly, this work illustrates how responsible AI can be deployed in the legal domain by augmenting LLMs with domain knowledge and governance mechanisms.

2605.11654 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

通过基于原型的语义部分发现实现抗天气的跨视角地理定位

Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Nguyen Lam Phu Quy, Phu-Hoa Pham, Long Tran-Thanh

发表机构 * Faculty of Information Technology, University of Science, Vietnam National University(信息技术学院,科学大学,越南国家大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学)

AI总结 本文提出SkyPart,一种轻量级可替换头,用于基于补丁的视觉变换器,通过在补丁网格上显式分组实现部分分组。SkyPart有四个理论基础的组件:(i)通过单次传递余弦分配学习可学习的原型以竞争补丁标记;(ii)在训练期间应用的海拔条件线性调制,使检索嵌入在推理时无海拔依赖;(iii)对活跃原型的图注意力读出;(iv)一种Kendall不确定性加权多目标损失,其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下,SkyPart是表现最佳方法中最小的,并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

Comments 37 pages, 7 figures, 6 tables

详情
AI中文摘要

跨视角地理定位(CVGL),即匹配一个倾斜无人机视角到地理参考的卫星瓷砖,已成为在GPS信号被干扰、欺骗或不可用时自主无人机导航的关键替代方案。尽管近年来取得了显著进展,但仍然存在三个限制:(1)全局描述符设计将补丁网格压缩成一个向量,而没有在视角间隙中分离布局和纹理;(2)与海拔相关的尺度变化保留在学习嵌入中,而不是被边缘化;(3)多目标训练依赖于手动调整的标量损失,这些损失在不兼容的梯度尺度上。我们提出SkyPart,一种轻量级可替换头,用于基于补丁的视觉变换器(ViTs),在补丁网格上实施显式部分分组。SkyPart有四个理论基础的组件:(i)通过单次传递余弦分配学习可学习的原型以竞争补丁标记;(ii)在训练期间应用的海拔条件线性调制,使检索嵌入在推理时无海拔依赖;(iii)对活跃原型的图注意力读出;(iv)一种Kendall不确定性加权多目标损失,其平稳点是帕累托平稳点。在26.95M参数和22.14 GFLOPs下,SkyPart是表现最佳方法中最小的,并在SUES-200、University-1652和DenseUAV上设定了新的状态。其在十条件WeatherPrompt腐蚀基准下的优势优于最强基线。

英文摘要

Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.

2605.11365 2026-05-19 cs.AI cs.LG stat.ML 版本更新

Causal Bias Detection in Generative Artificial Intelligence

生成人工智能中的因果偏见检测

Drago Plecko

发表机构 * Department of Statistics & Data Science(统计与数据科学系)

AI总结 本文研究了生成人工智能中的因果公平性问题,提出了新的因果分解结果,以量化不同因果路径和现实机制被生成模型替代对公平性的影响,并通过分析大型语言模型中的种族和性别偏见验证了方法的有效性。

详情
AI中文摘要

基于人工智能构建的自动化系统越来越多地应用于高风险领域,引发了关于公平性和现实世界中存在的人口差异持续存在的关键担忧。在此背景下,因果推断提供了一个有原则的框架来思考公平性,因为它将观察到的不平等与潜在机制联系起来,并自然与人类直觉和法律上的歧视观念相一致。先前关于因果公平性的研究主要集中在标准机器学习设置中,其中决策者为结果变量Y构建单一预测机制f_Ŷ,同时继承其他协变量的因果机制。然而,生成人工智能的设置却更加复杂:生成模型可以从任意条件下对任何变量集进行采样,隐式地构建了自己对所有因果机制的看法,而不是学习单一预测函数。这种根本性的差异要求因果公平性方法论有新的发展。我们正式定义了生成人工智能中的因果公平性问题,并在统一的理论框架下将其与标准机器学习设置相结合。然后,我们推导了新的因果分解结果,使能够对不同因果路径以及现实机制被生成模型机制替代的公平性影响进行精细量化。我们建立了识别条件并引入了用于因果感兴趣的量的高效估计器,并通过分析不同数据集中的大型语言模型中的种族和性别偏见来证明了我们方法的价值。

英文摘要

Automated systems built on artificial intelligence (AI) are increasingly deployed across high-stakes domains, raising critical concerns about fairness and the perpetuation of demographic disparities that exist in the world. In this context, causal inference provides a principled framework for reasoning about fairness, as it links observed disparities to underlying mechanisms and aligns naturally with human intuition and legal notions of discrimination. Prior work on causal fairness primarily focuses on the standard machine learning setting, where a decision-maker constructs a single predictive mechanism $f_{\widehat Y}$ for an outcome variable $Y$, while inheriting the causal mechanisms of all other covariates from the real world. The generative AI setting, however, is markedly more complex: generative models can sample from arbitrary conditionals over any set of variables, implicitly constructing their own beliefs about all causal mechanisms rather than learning a single predictive function. This fundamental difference requires new developments in causal fairness methodology. We formalize the problem of causal fairness in generative AI and unify it with the standard ML setting under a common theoretical framework. We then derive new causal decomposition results that enable granular quantification of fairness impacts along both (a) different causal pathways and (b) the replacement of real-world mechanisms by the generative model's mechanisms. We establish identification conditions and introduce efficient estimators for causal quantities of interest, and demonstrate the value of our methodology by analyzing race and gender bias in large language models across different datasets.

2605.10843 2026-05-19 cs.CL cs.AI cs.CY 版本更新

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement

通过人格分歧实现无训练的文化对齐大语言模型

Huynh Trung Kiet, Dao Sy Duy Minh, Tuan Nguyen, Chi-Nguyen Tran, Phu-Hoa Pham, Nguyen Lam Phu Quy, The Anh Han, Long Tran-Thanh

发表机构 * Faculty of Information and Technology, University of Science, Vietnam National University(信息与技术学院,科学大学,越南国家大学) Department of Computer Science, University of Warwick(计算机科学系,沃里克大学) School of Computing, Engineering and Digital Technologies, Teesside University(计算、工程和数字技术学院,泰赛德大学)

AI总结 本文提出DISCA方法,在不改变模型权重的情况下,通过人格分歧校准减少大语言模型在多任务测试中的文化偏差,为服务全球道德偏好提供了可扩展的替代方案。

Comments 57 pages, 1 figure, 6 MultiTP moral dimensions

详情
AI中文摘要

大型语言模型越来越多地参与涉及道德判断的决策,但越来越多的证据表明,它们的隐含偏好并非文化中立。现有的文化对齐方法要么需要国家层面的偏好数据和微调预算,要么假设可以访问模型内部的白盒信息,而商业API并未暴露此类信息。在本工作中,我们专注于这种现实的黑盒、仅公共数据的环境,并观察到国家内部的社会人口学分歧,而非共识,是主要的指导信号。我们引入DISCA(基于分歧的文化对齐推理方法),一种在推理时的方法,将每个国家视为一个基于世界价值观调查的个人代理面板,并将他们的分歧转化为一个有界的、损失厌恶的logit校正。在20个国家和7个开放权重的backbone(2B-70B)上,DISCA在MultiTP上减少了10-24%的文化偏差(在六个backbone >=3.8B上),并在开放场景中减少了2-7%的偏差,而无需改变任何权重。我们的结果表明,推理时的校准是微调的可扩展替代方案,用于服务全球道德偏好的长尾。

英文摘要

Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.

2605.10811 2026-05-19 math.OC cs.AI 版本更新

Switching-Geometry Analysis of Deflated Q-Value Iteration

退化Q值迭代的切换几何分析

Donghwan Lee

发表机构 * Department of Electrical Engineering(电气工程系)

AI总结 本文提出了一种联合谱半径框架,用于分析折扣马尔可夫决策过程控制中的一阶退化Q值迭代(Q-VI)。通过全一残差校正,作者利用切换系统的几何特性,首次给出了基于联合谱半径的退化Q-VI在策略优化问题中的收敛性分析。分析表明,标准Q-VI切换系统模型的联合谱半径恰好等于折扣因子γ∈(0,1),因为所有可接受的子系统共享全一向量作为不变方向。通过构造去除该方向的商空间,得到一个投影切换系统模型,其联合谱半径控制相关误差动态,并可能严格小于γ。因此,退化Q-VI可能比环境空间γ界具有更精确的收敛速率描述。最后,证明了校正等同于标准Q-VI的标量重新中心化。因此,投影轨迹以及由此产生的贪婪策略序列与标准Q-VI初始化相同点后的结果相同。退化的好处不是改变诱导的决策问题,而是在去除冗余的全一成分后,对收敛几何的更精确的联合谱半径描述。

详情
AI中文摘要

本文发展了一种联合谱半径(JSR)框架,用于分析折扣马尔可夫决策过程控制中的一阶退化Q值迭代(Q-VI)。聚焦于全一残差校正,我们通过切换系统的几何特性解释了所得到的算法,并到目前为止,首次给出了基于联合谱半径的退化Q-VI在策略优化问题中的收敛性分析。我们的分析表明,标准Q-VI切换系统模型的联合谱半径恰好等于折扣因子γ∈(0,1),因为所有可接受的子系统共享全一向量作为不变方向。通过构造去除该方向的商空间,我们得到一个投影切换系统模型,其联合谱半径控制相关误差动态,并可能严格小于γ。因此,退化Q-VI可能比环境空间γ界具有更精确的收敛速率描述。最后,我们证明校正等同于标准Q-VI的标量重新中心化。因此,投影轨迹以及由此产生的贪婪策略序列与标准Q-VI初始化相同点后的结果相同。退化的好处不是改变诱导的决策问题,而是在去除冗余的全一成分后,对收敛几何的更精确的联合谱半径描述。

英文摘要

This paper develops a joint spectral radius (JSR) framework for analyzing rank-one deflated Q-value iteration (Q-VI) in discounted Markov decision process control. Focusing on an all-ones residual correction, we interpret the resulting algorithm through the geometry of switching systems and, to the best of our knowledge, give the first JSR-based convergence analysis of deflated Q-VI for policy optimization problems. Our analysis reveals that the standard Q-VI switching system model has JSR exactly the discount factor $γ\in (0,1)$, since all admissible subsystems share the all-ones vector as an invariant direction. By passing to the quotient space that removes this direction, we obtain a projected switching system model whose JSR governs the relevant error dynamics and may be strictly smaller than $γ$. Therefore, the deflated Q-VI admits a potentially sharper convergence-rate characterization than the ambient-space $γ$-bound. Finally, we prove that the correction is equivalent to a scalar recentering of standard Q-VI. Hence, the projected trajectory, and therefore the greedy-policy sequence, is unchanged relative to standard Q-VI initialized from the same point. The benefit of deflation is not a change in the induced decision-making problem, but a more precise JSR-based description of the convergence geometry after the redundant all-ones component is removed.

2605.10503 2026-05-19 cs.AI 版本更新

SLASH the Sink: Sharpening Structural Attention Inside LLMs

SLASH the Sink: 在大语言模型中 sharpening 结构性注意力

Yiming Liu, Bin Lu, Xinbing Wang, Chenghu Zhou, Meng Jin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Geographical Science and Natural Resources Research(地理科学与自然资源研究所) Chinese Academy of Sciences(中国科学院)

AI总结 本文研究了大语言模型内部机制,发现其能自发重构图拓扑,但受注意力sink影响导致结构理解被削弱。提出SLASH方法,通过插件式注意力重分布增强内部结构理解,实验表明在纯图任务和分子预测中性能显著提升。

详情
AI中文摘要

大型语言模型(LLMs)在处理图拓扑时表现出显著的语义理解能力,但往往在结构理解上遇到困难。现有解决方案依赖于训练外部图结构适配器或微调,这导致成本高且失去泛化能力。本文研究了LLMs的内部机制,发现LLMs会自发地在内部重构图的拓扑结构,这在注意力图中表现为明显的“锯齿”模式,与“token级邻接矩阵”结构一致。然而,这种内在的结构理解被注意力sink所稀释。我们理论上将这种稀释定义为一个表示瓶颈,源于一个根本性的矛盾:模型的各向异性偏见,对于语言任务是必要的,却抑制了图推理所需的拓扑感知局部聚合。为了解决这个问题,我们提出了一种无需训练的解决方案,名为StructuraL Attention SHarpening(SLASH),通过插件式注意力重分布来增强这种内部结构理解。在纯图任务和分子预测实验中验证,SLASH在多种LLM上都带来了显著且一致的性能提升。

英文摘要

Large Language Models (LLMs) show remarkable semantic understanding but often struggle with structural understanding when processing graph topologies in a serialized format. Existing solutions rely on training external graph-based adapters or fine-tuning, which incur high costs and lost generalizability. In this work, we investigate the internal mechanisms of LLMs and present a critical finding: LLMs spontaneously reconstruct the graph's topology internally, evidenced by a distinct "sawtooth" pattern in their attention maps that structurally aligns with the "token-level adjacency matrix". However, this intrinsic structural understanding is diluted by the attention sink. We theoretically formalize this dilution as a representation bottleneck, stemming from a fundamental conflict: the model's anisotropic bias, essential for language tasks, suppresses the topology-aware local aggregation required for graph reasoning. To address this, we propose a training-free solution, named StructuraL Attention SHarpening (SLASH), which amplifies this internal structural understanding via a plug-and-play attention redistribution. Experiments on pure graph tasks and molecular prediction validate that SLASH delivers significant and consistent performance gains across diverse LLMs.

2605.07263 2026-05-19 eess.SP cs.AI cs.DC cs.LG stat.ML 版本更新

Resource-Element Energy Difference for Noncoherent Over-the-Air Federated Learning

非协作空中联邦学习的资源元素能量差

Hao Chen, Zavareh Bozorgasl

发表机构 * Signal, Communication, and Learning Lab (SCALE Lab), Department of Electrical and Computer Engineering, Boise State University(信号、通信与学习实验室(SCALE实验室),电气与计算机工程系,博伊西州立大学)

AI总结 本文提出了一种非协作物理层原始方法,即资源元素能量差(REED),用于连续符号聚合。该方法通过将实值更新的正负部分映射到配对正交的资源元素上的传输能量,并通过减去对应的接收到的能量来估计符号和。REED利用慢时间尺度校准的平均信道功率,但不需要瞬时发射端或接收端CSI或信道反转。对于独立的瑞利衰落,我们推导了单次REED和芯片多样扩展的精确一阶和二阶矩表达式。

Comments Preprint; Under-review; Codes to replicate the results is available at: https://github.com/zavareh1/REED

详情
AI中文摘要

Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.

英文摘要

Over-the-air federated learning (OTA-FL) reduces uplink latency by aggregating client updates directly over the wireless multiple-access channel. Coherent analog aggregation realizes this idea by aligning the phases and amplitudes of simultaneously transmitted waveforms, which typically requires synchronization, instantaneous channel-state information (CSI), phase compensation, and power control. Noncoherent energy detection removes the need for phase-coherent combining, but a single energy measurement is nonnegative and, therefore, cannot represent signed model updates. This paper introduces resource-element energy difference (REED), a noncoherent physical-layer primitive for continuous signed aggregation. REED maps the positive and negative parts of each real-valued update to transmit energies on paired orthogonal resource elements and estimates the signed sum by subtracting the corresponding received energies. The construction uses slow-timescale calibration of average channel powers, but does not require instantaneous transmitter- or receiver-side CSI or channel inversion. For independent Rayleigh fading, we derive exact first- and second-moment expressions for single-shot REED and for a chip-diverse extension that spreads each coordinate over multiple independently faded paired chips. The resulting variance laws separate fading-induced self-noise, signal-noise interaction, and receiver-noise fluctuation, giving an explicit diversity-resource tradeoff. More->The rest of abstract is in the paper.

2604.18652 2026-05-19 cs.CR cs.AI 版本更新

From Craft to Kernel: A Governance-First Execution Architecture and Semantic ISA for Agentic Computers

从手工到内核:一种以治理为导向的执行架构和语义ISA用于代理计算机

Xiangyu Wen, Yuang Zhao, Xiaoyu Xu, Lingjun Chen, Changran Xu, Shu Chi, Jianrong Ding, Zeju Li, Haomin Li, Li Jiang, Fangxin Liu, Qiang Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Zhejiang University(浙江大学) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出了一种以治理为导向的执行架构和语义指令集架构(ISA),旨在解决代理计算机中AI从脆弱原型到生产系统的过渡问题,通过引入概率处理单元和确定性神经符号内核,提升系统的安全性和可靠性。

详情
AI中文摘要

代理AI从脆弱原型到生产系统的过渡受到普遍的工艺危机的阻碍。我们建议,现行的编排范式——将系统控制循环委托给大型语言模型并仅通过启发式护栏进行修补——是这种脆弱性的根本原因。相反,我们提出了Arbiter-K,一种以治理为导向的执行架构,将底层模型重新概念化为一个概率处理单元,封装在确定性的神经符号内核中。Arbiter-K实现了语义指令集架构(ISA)将概率消息转化为离散指令。这使内核能够在运行时维护安全上下文注册表并构建指令依赖图,从而基于每个推理节点的数据流谱系进行主动污点传播。通过利用这一机制,Arbiter-K能够精确地阻止危险轨迹,防止在确定性终点(如高风险工具调用或未经授权的网络流出)发生不安全行为,并在触发安全策略时启用自主执行纠正和架构回滚。在OpenClaw和NanoBot上的评估显示,Arbiter-K将安全作为微架构属性强制执行,实现了比原生策略高出92.79%的绝对收益,安全拦截率在76%到95%之间。代码已公开在https://github.com/cure-lab/ArbiterOS。

英文摘要

The transition of agentic AI from brittle prototypes to production systems is stalled by a pervasive crisis of craft. We suggest that the prevailing orchestration paradigm-delegating the system control loop to large language models and merely patching with heuristic guardrails-is the root cause of this fragility. Instead, we propose Arbiter-K, a Governance-First execution architecture that reconceptualizes the underlying model as a Probabilistic Processing Unit encapsulated by a deterministic, neuro-symbolic kernel. Arbiter-K implements a Semantic Instruction Set Architecture (ISA) to reify probabilistic messages into discrete instructions. This allows the kernel to maintain a Security Context Registry and construct an Instruction Dependency Graph at runtime, enabling active taint propagation based on the data-flow pedigree of each reasoning node. By leveraging this mechanism, Arbiter-K precisely interdicts unsafe trajectories at deterministic sinks (e.g., high-risk tool calls or unauthorized network egress) and enables autonomous execution correction and architectural rollback when security policies are triggered. Evaluations on OpenClaw and NanoBot demonstrate that Arbiter-K enforces security as a microarchitectural property, achieving 76% to 95% unsafe interception for a 92.79% absolute gain over native policies. The code is publicly available at https://github.com/cure-lab/ArbiterOS.

2604.12253 2026-05-19 cs.AI 版本更新

A Scoping Review of Large Language Model-Based Pedagogical Agents

基于大语言模型的教育代理的综述

Shan Li, Juan Zheng

发表机构 * Department of Education and Human Services, College of Education, Lehigh University(教育与人类服务学院,教育学院,莱维大学) Department of Community and Global Health, College of Health, Lehigh University(社区与全球健康学院,健康学院,莱维大学)

AI总结 本文综述了大语言模型在教育环境中的应用,探讨了教育代理的设计维度、发展趋势及研究空白,为未来研究提供指导。

详情
AI中文摘要

本综述根据PRISMA-ScR指南,分析了2022年11月至2025年1月期间五个主要数据库中的52项研究,探讨了基于大语言模型(LLM)的教育代理在K-12教育、高等教育和非正式学习环境中的多样性。研究识别出四个关键设计维度:交互方式(反应型 vs. 主动型)、领域范围(领域专用 vs. 通用)、角色复杂性(单一角色 vs. 多角色)以及系统集成(独立 vs. 集成)。新兴趋势包括多代理系统模拟自然学习环境、虚拟学生模拟用于代理评估、与沉浸式技术的整合以及与学习分析的结合。本文还讨论了隐私、准确性和学生自主性等重要研究空白和伦理问题。

英文摘要

This scoping review examines the emerging field of Large Language Model (LLM)-based pedagogical agents in educational settings. While traditional pedagogical agents have been extensively studied, the integration of LLMs represents a transformative advancement with unprecedented capabilities in natural language understanding, reasoning, and adaptation. Following PRISMA-ScR guidelines, we analyzed 52 studies across five major databases from November 2022 to January 2025. Our findings reveal diverse LLM-based agents spanning K-12, higher education, and informal learning contexts across multiple subject domains. We identified four key design dimensions characterizing these agents: interaction approach (reactive vs. proactive), domain scope (domain-specific vs. general-purpose), role complexity (single-role vs. multi-role), and system integration (standalone vs. integrated). Emerging trends include multi-agent systems that simulate naturalistic learning environments, virtual student simulation for agent evaluation, integration with immersive technologies, and combinations with learning analytics. We also discuss significant research gaps and ethical considerations regarding privacy, accuracy, and student autonomy. This review provides researchers and practitioners with a comprehensive understanding of LLM-based pedagogical agents while identifying crucial areas for future development in this rapidly evolving field.

2604.08874 2026-05-19 cs.LG cs.AI 版本更新

A Mathematical Framework for Temporal Modeling and Counterfactual Policy Simulation of Student Dropout

面向学生退学的时序建模与反事实政策模拟的数学框架

Rafael da Silva, Jeff Eicher, Gregory Longo

发表机构 * Applied Data Science Program(应用数据科学项目) Eastern University(东部大学)

AI总结 本文提出了一种结合反事实政策模拟层的时序建模框架,用于分析高等教育学生退学问题,通过LMS参与数据和行政退学记录进行建模,采用时间到事件结局的方式,并通过惩罚性、类别平衡逻辑回归进行每周风险建模,展示了模型在训练和测试集上的高AUC表现,并通过消融分析验证了时间参与信号的重要性。

Comments Approx. 20 pages, 9 figures. Code and reproducibility package available at https://github.com/rafa-rodriguess/TCM-Student-Dropout This work introduces a temporal survival framework with counterfactual policy simulation

详情
AI中文摘要

本研究提出了一种针对高等教育学生退学问题的时序建模框架,结合反事实政策模拟层,利用LMS参与数据和行政退学记录进行建模。退学被定义为在入学层面的时间到事件结局;通过在人-时期行上进行惩罚性、类别平衡逻辑回归,对每周风险进行离散时间建模。在晚期事件时间验证下,模型在训练集和测试集上分别达到0.8350和0.8405的行级AUC,整体校准可接受但最高风险分箱支持稀疏。消融分析表明性能对特征集组成敏感,突显了时间参与信号的作用。一个基于场景的政策层产生生存对比ΔS(T)在显式的触发/计划合同下:正对比被限制在冲击分支(T_policy=18:0.0102,0.0260,0.0819),而机制-aware分支为负(ΔS_mech(18)=-0.0078,ΔS_mech(38)=-0.0134)。通过性别子组分析量化了场景诱导的生存差距,通过bootstrap方法进行统计检验;对比方向稳定但较小。结果未被因果识别;它们展示了在观察数据限制下,该框架进行内部结构场景比较的能力。

英文摘要

This study proposes a temporal modeling framework with a counterfactual policy-simulation layer for student dropout in higher education, using LMS engagement data and administrative withdrawal records. Dropout is operationalized as a time-to-event outcome at the enrollment level; weekly risk is modeled in discrete time via penalized, class-balanced logistic regression over person--period rows. Under a late-event temporal holdout, the model attains row-level AUCs of 0.8350 (train) and 0.8405 (test), with aggregate calibration acceptable but sparsely supported in the highest-risk bins. Ablation analyses indicate performance is sensitive to feature set composition, underscoring the role of temporal engagement signals. A scenario-indexed policy layer produces survival contrasts $ΔS(T)$ under an explicit trigger/schedule contract: positive contrasts are confined to the shock branch ($T_{\rm policy}=18$: 0.0102, 0.0260, 0.0819), while the mechanism-aware branch is negative ($ΔS_{\rm mech}(18)=-0.0078$, $ΔS_{\rm mech}(38)=-0.0134$). A subgroup analysis by gender quantifies scenario-induced survival gaps via bootstrap; contrasts are directionally stable but small. Results are not causally identified; they demonstrate the framework's capacity for internal structural scenario comparison under observational data constraints.

2604.08432 2026-05-19 physics.optics cs.AI 版本更新

Small-scale photonic Kolmogorov-Arnold networks using standard telecom nonlinear modules

利用标准电信非线性模块的小规模光子Kolmogorov-Arnold网络

Luca Nogueira Calçado, Sergei K. Turitsyn, Egor Manuylovich

发表机构 * Aston Institute of Photonic Technologies (AiPT)(阿斯顿光电技术研究所)

AI总结 本文提出了一种基于标准电信组件的小规模光子Kolmogorov-Arnold网络,通过可训练的非线性模块实现高效非线性推理,展示了在分类、回归和图像识别任务中的高性能。

详情
AI中文摘要

光子神经网络有望实现超快推理,但大多数架构依赖于线性光学网格和电子非线性性,重新引入了光-电-光瓶颈。本文介绍了一种完全由标准电信组件实现的小规模光子Kolmogorov-Arnold网络(SSP-KANs)。每个网络边缘使用一个可训练的非线性模块,由马赫-曾德干涉仪、半导体光放大器和可变光学衰减器组成,提供一个由增益饱和和干涉混合导出的四参数传输函数。尽管这些光学非线性性的功能形式受限,仅由几个光学模块组成的SSP-KANs在分类、回归和图像识别任务中实现了强大的非线性推理性能,参数数量显著少于软件基线。一个四模块网络在非线性分类基准上的准确率为94.3%(IQR:90.3-97.4%,10种子),七模块网络在六输入回归中的R²为0.986±0.015。在现实硬件退化下性能保持稳健,即使在6位输入分辨率和14 dB信噪比下仍能保持高精度。通过使用完全可微的物理模型对光学参数进行端到端优化,本文建立了从仿真到实验演示的实用路径,利用商用电信硬件实现光子KANs。

英文摘要

Photonic neural networks promise ultrafast inference, yet most architectures rely on linear optical meshes with electronic nonlinearities, reintroducing optical-electrical-optical bottlenecks. Here we introduce small-scale photonic Kolmogorov-Arnold networks (SSP-KANs) implemented entirely with standard telecommunications components. Each network edge employs a trainable nonlinear module composed of a Mach-Zehnder interferometer, semiconductor optical amplifier, and variable optical attenuators, providing a four-parameter transfer function derived from gain saturation and interferometric mixing. Despite the constrained functional form of these optical nonlinearities, SSP-KANs comprising only a few optical modules achieve strong nonlinear inference performance across classification, regression, and image recognition tasks, approaching software baselines with significantly fewer parameters. A four-module network achieves $94.3$\% (IQR: $90.3$--$97.4$\%, 10~seeds) accuracy on nonlinear classification benchmarks; a seven-module network attains $R^2 = 0.986 \pm 0.015$ on six-input regression. Performance remains robust under realistic hardware impairments, maintaining high accuracy down to 6-bit input resolution and 14 dB signal-to-noise ratio. By using a fully differentiable physics model for end-to-end optimisation of optical parameters, this work establishes a practical pathway from simulation to experimental demonstration of photonic KANs using commodity telecom hardware.

2603.29868 2026-05-19 cs.AI cs.LO 版本更新

Spatiotemporal Robustness of Temporal Logic Tasks using Multi-Objective Reasoning

基于多目标推理的时序逻辑任务时空鲁棒性

Oliver Schön, Lars Lindemann

发表机构 * Automatic Control Laboratory, ETH Zürich(自动化控制实验室,苏黎世联邦理工学院)

AI总结 本文研究了通过多目标推理处理时序逻辑任务的时空鲁棒性,提出了一种新的时空鲁棒性定义,能够同时考虑空间和时间扰动,并展示了其在多智能体机器人、智慧城市和空中交通管制等交互系统中的应用。

Comments 30 pages, 6 figures, to be published at the 38th International Conference on Computer Aided Verification 2026

详情
AI中文摘要

自主系统的可靠性依赖于其鲁棒性,即在不确定性下满足目标的能力。本文研究了在离散时间信号上评估的时序逻辑规范的时空鲁棒性。现有工作提出了鲁棒语义,能够捕捉不仅布尔可满足性,还包括从不可满足性距离的几何距离,对应于给定信号的可接受空间扰动。相比之下,我们提出了时空鲁棒性(STR),它同时捕捉可接受的空间和时间扰动。这一概念对于交互系统,如多智能体机器人、智慧城市和空中交通管制尤其具有信息量。我们将STR定义为一个多目标推理问题,通过空间和时间扰动的偏序关系形式化。这种视角有两个关键优势:(1)STR可以被解释为一个帕累托最优集,该集描述了所有可接受的时空扰动;(2)STR可以通过多目标优化工具进行计算。为克服计算挑战,我们提出了适用于STR的鲁棒语义,这些语义在适当的意义下是准确的,同时计算上是可行的。最后,我们使用这些鲁棒语义提出了STR的监控算法。据我们所知,这是首次通过多目标推理处理多维鲁棒性的工作。

英文摘要

The reliability of autonomous systems depends on their robustness, i.e., their ability to meet their objectives under uncertainty. In this paper, we study spatiotemporal robustness of temporal logic specifications evaluated over discrete-time signals. Existing work has proposed robust semantics that capture not only Boolean satisfiability, but also the geometric distance from unsatisfiability, corresponding to admissible spatial perturbations of a given signal. In contrast, we propose spatiotemporal robustness (STR), which captures admissible spatial and temporal perturbations jointly. This notion is particularly informative for interacting systems, such as multi-agent robotics, smart cities, and air traffic control. We define STR as a multi-objective reasoning problem, formalized via a partial order over spatial and temporal perturbations. This perspective has two key advantages: (1) STR can be interpreted as a Pareto-optimal set that characterizes all admissible spatiotemporal perturbations, and (2) STR can be computed using tools from multi-objective optimization. To navigate computational challenges, we propose robust semantics for STR that are sound in the sense of suitably under-approximating STR while being computationally tractable. Finally, we present monitoring algorithms for STR using these robust semantics. To the best of our knowledge, this is the first work to deal with robustness across multiple dimensions via multi-objective reasoning.

2603.26720 2026-05-19 cs.RO cs.AI 版本更新

SutureFormer: Learning Surgical Trajectories via Goal-conditioned Offline RL in Pixel Space

SutureFormer: 通过像素空间中的目标引导离线强化学习学习手术轨迹

Huanrong Liu, Chunlin Tian, Tongyu Jia, Tailai Zhou, Qin Liu, Yu Gao, Yutong Ban, Yun Gu, Guy Rosman, Xin Ma, Qingbiao Li

发表机构 * University of Macau(澳门大学) The Chinese PLA General Hospital(中国人民解放军总医院) Duke University(杜克大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出SutureFormer,一种基于目标引导的离线强化学习框架,通过稀疏标注到密集奖励信号的插值,有效学习手术针轨迹预测,减少平均位移误差58.6%。

详情
AI中文摘要

从内窥镜视频预测手术针轨迹对于机器人辅助缝合至关重要,能够实现预见性规划、实时引导和更安全的运动执行。现有直接从视觉观测学习运动分布的方法往往忽视相邻运动步骤之间的序列依赖性。此外,稀疏路径点标注通常无法提供足够的监督,进一步增加了监督或模仿学习方法的难度。为了解决这些挑战,我们将基于图像的针轨迹预测 formulations 为一个序列决策问题,在其中将针尖视为一个在像素空间中逐步移动的智能体。这种 formulation 自然捕捉了针运动的连续性,并能够显式建模在时间上物理上合理的像素级状态转换。从这个角度来看,我们提出SutureFormer,一种目标引导的离线强化学习框架,通过三次样条插值将稀疏标注转换为密集奖励信号,鼓励策略在利用有限专家指导的同时探索合理的未来运动路径。SutureFormer 使用观察编码器编码可变长度片段,以捕捉局部空间线索和长距离时间动态,并通过由离散方向和连续幅度组成的操作自回归地预测未来路径点。为了实现从专家演示中稳定离线策略优化,我们采用保守Q学习与行为克隆正则化。在包含1,158条轨迹的新的肾伤口缝合数据集中进行的实验表明,与最强基线相比,SutureFormer将平均位移误差减少了58.6%,证明了将针轨迹预测建模为像素级序列动作学习的有效性。

英文摘要

Predicting surgical needle trajectories from endoscopic video is critical for robot-assisted suturing, enabling anticipatory planning, real-time guidance, and safer motion execution. Existing methods that directly learn motion distributions from visual observations tend to overlook the sequential dependency among adjacent motion steps. Moreover, sparse waypoint annotations often fail to provide sufficient supervision, further increasing the difficulty of supervised or imitation learning methods. To address these challenges, we formulate image-based needle trajectory prediction as a sequential decision-making problem, in which the needle tip is treated as an agent that moves step by step in pixel space. This formulation naturally captures the continuity of needle motion and enables the explicit modeling of physically plausible pixel-wise state transitions over time. From this perspective, we propose SutureFormer, a goal-conditioned offline reinforcement learning framework that leverages sparse annotations to dense reward signals via cubic spline interpolation, encouraging the policy to exploit limited expert guidance while exploring plausible future motion paths. SutureFormer encodes variable-length clips using an observation encoder to capture both local spatial cues and long-range temporal dynamics, and autoregressively predicts future waypoints through actions composed of discrete directions and continuous magnitudes. To enable stable offline policy optimization from expert demonstrations, we adopt Conservative Q-Learning with Behavioral Cloning regularization. Experiments on a new kidney wound suturing dataset containing 1,158 trajectories from 50 patients show that SutureFormer reduces Average Displacement Error by 58.6% compared with the strongest baseline, demonstrating the effectiveness of modeling needle trajectory prediction as pixel-level sequential action learning.

2603.20380 2026-05-19 cs.MA cs.AI cs.HC 版本更新

Herding CATs: ALARA for Agent Harness Engineering in Portable Composable Multi-Agent Teams

Herding CATs: ALARA for Agent Harness Engineering in Portable Composable Multi-Agent Teams

Christopher J. Agostino, Nayan D'Souza

发表机构 * Celeria, Inc.(Celeria公司) Department of Linguistics, Indiana University(印第安纳大学语言学系)

AI总结 本文提出ALARA原则应用于多智能体团队的代理 harness 工程,通过引入CAT数据层,使用户能够直接声明工具访问权限并修改代理使用的工具,从而提升智能体在各种任务中的表现。

Comments Accepted to HAXD 2026, 8 pages, 6 figures

详情
AI中文摘要

行业从业者和学术研究人员经常使用多智能体系统来加速他们的工作,但用户使用的应用系统并未提供一种简单统一的机制来可扩展地管理代理 harness 的关键组件。这种缺乏控制对个体人机交互的质量产生了负面影响,并且限制了从业者协调上下文工程努力的能力。定义此类系统中智能体可以执行的行为规范仍然分散在文本指令文件中(无法保证合规性)或框架内部配置中,使得这些规范在跨团队和项目共享、版本控制或协作维护时变得困难。应用辐射安全中的ALARA原则(暴露量应尽可能低),我们引入了一种通过相互关联的纯文本文件表达的上下文-智能体-工具(CAT)数据层,允许用户为每个智能体直接声明工具访问权限,并在处理时修改智能体使用的工具本身。我们通过使用命令行 shell(加载团队并执行智能体运行)-- npcsh -- 和评估22个本地托管的模型(从0.6B到35B参数)在115个实际任务中的表现(包括文件操作、网络搜索、多步骤脚本、工具链和多智能体委托)来展示CAT数据层的能力。我们还表征了哪些模型家族在某些任务类别中成功,以及在约2500次总执行中它们的失败点。

英文摘要

Industry practitioners and academic researchers regularly use multi-agent systems to accelerate their work, but the applications through which users operate these systems do not provide a simple, unified mechanism for scalably managing critical components of the agent harness. This lack of control adversely impacts both the quality of individual human-agent interactions and reduces the capacity for practitioners to coordinate context engineering efforts. The behavioral specifications that define what agents in such systems can do remain fragmented across prose instruction files -- for which compliance cannot be guaranteed -- or framework-internal configurations, making these specifications difficult to share, version, or collaboratively maintain across teams and projects. Applying the ALARA principle from radiation safety (exposures kept as low as reasonably achievable) to context, we introduce a context-agent-tool (CAT) data layer expressed through interrelated plain-text files, allowing users to directly declare tool access for each agent and to modify the tools themselves that are used by the agents when processing. We demonstrate capability of this CAT data layer to enable real agentic usage by using a command-line shell that loads the team and executes agent runs -- \texttt{npcsh} -- and evaluating 22 locally-hosted models from 0.6B to 35B parameters across 115 practical tasks spanning file operations, web search, multi-step scripting, tool chaining, and multi-agent delegation. We characterize which model families succeed in certain task categories and where they break down across $\sim$2500 total executions.

2603.20216 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Locally Coherent Parallel Decoding in Diffusion Language Models

局部相干并行解码在扩散语言模型中

Michael Hersche, Nicolas Menet, Ronan Tanios, Abbas Rahimi

发表机构 * IBM Research - Zurich(IBM瑞士研究实验室)

AI总结 本文提出CoDiLA方法,通过引入小型辅助自回归模型来解决扩散语言模型在并行解码中的相干性问题,从而在代码生成任务中实现更高的准确性和速度。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散语言模型(DLMs)作为一种有前景的替代自回归(AR)模型,提供了亚线性生成延迟和双向能力,这在代码生成和编辑中尤为吸引人。在离散DLMs中实现亚线性延迟需要并行预测多个token。然而,标准DLMs从条件边缘分布独立采样token,无法捕捉同时生成token之间的联合依赖关系。因此,它们常常导致语法不一致并破坏多token结构。在本工作中,我们引入CoDiLA(Coherent Diffusion with Local Autoregression),一种方法,通过引入小型辅助AR模型来解决并行采样与局部依赖建模之间的矛盾。该方法将局部解码委托给一个小型辅助AR模型,该模型在扩散潜变量上进行操作。这种设计允许并行生成,同时在块内确保序列的有效性,并保持核心DLM能力,包括跨块的双向建模。我们证明使用高度紧凑的辅助AR模型(例如,0.6B参数)可以有效消除相干性伪影,在代码生成基准中建立了一个新的帕累托前沿。

英文摘要

Diffusion language models (DLMs) have emerged as a promising alternative to autoregressive (AR) models, offering sub-linear generation latency and bidirectional capabilities that are particularly appealing for code generation and editing. Achieving sub-linear latency in discrete DLMs requires predicting multiple tokens in parallel. However, standard DLMs sample tokens independently from conditional marginal distributions, failing to capture the joint dependencies among concurrently generated tokens. As a result, they often lead to syntactic inconsistencies and break multi-token structures. In this work, we introduce CoDiLA (Coherent Diffusion with Local Autoregression), a method that reconciles parallel sampling with local dependency modeling. Rather than forcing the DLM to resolve fine-grained syntax, CoDiLA delegates local decoding to a small, auxiliary AR model operating on the diffusion latents. This design allows for parallel generation while ensuring sequential validity within a block and maintaining core DLM capabilities, including bidirectional modeling across blocks. We demonstrate that using a highly compact auxiliary AR model (e.g., 0.6B parameters) effectively eliminates coherence artifacts, establishing a new Pareto frontier for accuracy and speed in code generation benchmarks.

2603.17577 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Identifying Latent Actions and Dynamics from Offline Data via Demonstrator Diversity

通过示范多样性从离线数据中识别潜在动作和动态

Felix Schur

发表机构 * ETH Zürich(苏黎世联邦理工学院)

AI总结 本文研究了在不观察动作的情况下从离线轨迹中恢复潜在动作和环境动态的问题,通过示范多样性假设,证明了在满足特定条件时,潜在转移和示范策略可以被唯一确定,从而为从离线强化学习数据中学习潜在动作和动态提供了新的方法。

详情
AI中文摘要

在动作未被观察的情况下,能否从离线轨迹中恢复潜在动作和环境动态?我们研究了在轨迹无动作但带有示范者身份标签的设置中这一问题。我们假设每个示范者遵循不同的策略,而环境动态在所有示范者之间是共享的,身份仅通过所选动作影响下一个观测。在这些假设下,条件下一个观测分布 $p(o_{t+1}\mid o_t,e)$ 是潜在动作条件化转移核的混合,具有示范者特定的混合权重。我们证明,这导致每个状态的可观测条件分布具有列随机非负矩阵分解。通过充分分散的策略多样性和秩条件,我们证明潜在转移和示范策略在潜在动作标签的排列下是可识别的。通过Gram行列式最小体积准则,我们将结果扩展到连续观测空间,并证明在连接的状态空间上转移映射的连续性将局部排列模糊性提升为单一全局排列。少量标记的动作数据足以消除最终的模糊性。这些结果确立了示范多样性作为从离线强化学习数据中学习潜在动作和动态的原理性可识别性来源。

英文摘要

Can latent actions and environment dynamics be recovered from offline trajectories when actions are never observed? We study this question in a setting where trajectories are action-free but tagged with demonstrator identity. We assume that each demonstrator follows a distinct policy, while the environment dynamics are shared across demonstrators and identity affects the next observation only through the chosen action. Under these assumptions, the conditional next-observation distribution $p(o_{t+1}\mid o_t,e)$ is a mixture of latent action-conditioned transition kernels with demonstrator-specific mixing weights. We show that this induces, for each state, a column-stochastic nonnegative matrix factorization of the observable conditional distribution. Using sufficiently scattered policy diversity and rank conditions, we prove that the latent transitions and demonstrator policies are identifiable up to permutation of the latent action labels. We extend the result to continuous observation spaces via a Gram-determinant minimum-volume criterion, and show that continuity of the transition map over a connected state space upgrades local permutation ambiguities to a single global permutation. A small amount of labeled action data then suffices to fix this final ambiguity. These results establish demonstrator diversity as a principled source of identifiability for learning latent actions and dynamics from offline RL data.

2603.17041 2026-05-19 stat.ML cs.AI cs.LG stat.ME 版本更新

When Marginals Match but Structure Fails: Covariance Fidelity in Generative Models

当边缘匹配但结构失败:生成模型中的协方差保真度

Nazia Riasat

发表机构 * North Dakota State University(北达科他州立大学)

AI总结 本文提出了一种基于协方差层面的依赖保真度评估标准,以弥补传统边缘分布匹配评估方法的不足,通过实验证明该标准能更准确地区分结构保留与结构丢失的生成模型。

Comments 44 pages, 25 figures. Extended version of paper accepted at MathAI 2026 (International Conference on Mathematics of Artificial Intelligence), March 30 - April 3, 2026

详情
AI中文摘要

生成模型正越来越多地被用作真实数据的替代品用于下游科学流程,但标准评估标准仍然集中在边缘分布匹配上。我们主张这代表了一个根本性的差距:下游推断很少是边缘操作,且一个通过所有单变量诊断的模型仍可能产生结构不可靠的合成数据。我们引入了协方差层面的依赖保真度,通过D_Sigma(P,Q) = ||Sigma_P - Sigma_Q||_F来衡量生成模型是否在超出单变量边缘之外保留数据的联合结构。三个结果正式化了这一准则。首先,边缘保真度对依赖结构没有任何约束:D_Sigma可以被任意增大,同时所有单变量边缘完全匹配。其次,协方差分歧会引起可量化的下游不稳定性,包括总体回归系数的符号反转。第三,通过Davis-Kahan型界提供对依赖敏感过程如PCA的正向稳定性保证。在三个领域,图像数据(Fashion-MNIST VAE,n = 60,000)、批量RNA-seq(TCGA-BRCA,n = 1,111)和小样本压力测试(阿尔茨海默症基因表达,n = 113)的实证验证显示,D_Sigma/delta在标准边缘诊断显示很少分离的情况下,能一致地区分结构丢弃与结构保留的生成器,确认了协方差层面保真度在跨领域和样本大小上提供了与现有评估指标正交的信息。

英文摘要

Generative models are increasingly deployed as substitutes for real data in downstream scientific workflows, yet standard evaluation criteria remain focused on marginal distribution matching. We argue that this represents a fundamental gap: downstream inference is rarely a marginal operation, and a model that passes every univariate diagnostic can still produce structurally unreliable synthetic data. We introduce covariance-level dependence fidelity, measured by D_Sigma(P,Q) = ||Sigma_P - Sigma_Q||_F, as a principled, computable criterion for evaluating whether a generative model preserves the joint structure of data beyond its univariate marginals. Three results formalise this criterion. First, marginal fidelity provides no constraint on dependence structure: D_Sigma can be made arbitrarily large while all univariate marginals match exactly. Second, covariance divergence induces quantifiable downstream instability, including sign reversals in population regression coefficients. Third, bounding D_Sigma provides positive stability guarantees for dependence-sensitive procedures such as PCA via Davis-Kahan-type bounds. Empirical validation across three domains, image data (Fashion-MNIST VAE, n = 60,000), bulk RNA-seq (TCGA-BRCA, n = 1,111), and a small-sample stress test (Alzheimer's gene expression, n = 113), shows that D_Sigma/delta consistently distinguishes structure-discarding from structure-preserving generators in cases where standard marginal diagnostics show little separation, confirming that covariance-level fidelity provides information orthogonal to existing evaluation metrics across domains and sample sizes.

2603.14371 2026-05-19 cs.RO cs.AI 版本更新

OxyGen: Unified KV Cache Management for VLA Inference under Multi-Task Parallelism

OxyGen: 为多任务并行下的VLA推理提供统一的KV缓存管理

Xiangyu Li, Huaizhi Tang, Xin Ding, Weijun Wang, Ting Cao, Yunxin Liu

发表机构 * Institute for AI Industry Research (AIR)(人工智能产业研究院) Department of Electronic Engineering(电子工程系) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出OxyGen,一种统一的KV缓存管理方法,用于在多任务并行下提高VLA推理效率,通过跨任务KV共享和跨帧连续批处理实现冗余计算和资源竞争的减少,从而在设备端实现更高的吞吐量和频率。

Comments Preprint

详情
AI中文摘要

具身AI代理越来越多地需要在不同的时间约束下从共享观察中并行执行多个任务,如操作、对话和记忆构建。最近的混合变换器(MoT)视觉-语言-动作模型(VLAs)在架构上支持这种异构输出,但现有的推理系统由于冗余计算和资源竞争未能在设备部署中实现高效的多任务并行。我们发现孤立的KV缓存管理是根本原因。为此,我们提出了统一的KV缓存管理,一种将KV缓存作为跨任务和时间的第一类共享资源的推理设计。这种抽象使两种关键优化成为可能:跨任务的KV共享消除了共享观察的冗余预填充,而跨帧连续批处理将可变长度的语言解码与固定速率的动作生成解耦。我们为流行的MoT VLA π_{0.5} 实现了这种设计,并在NVIDIA GeForce RTX 4090和Jetson AGX Thor两个代表性的设备端VLA推理平台上进行了评估。OxyGen在孤立执行的情况下实现了高达3.7倍的加速,同时在不降低动作质量的情况下,实现了超过200 tokens/s的语言吞吐量和70 Hz的动作频率,并进一步在搭载Jetson AGX Thor的现实人形机器人上验证了这些收益。

英文摘要

Embodied AI agents increasingly require parallel execution of multiple tasks, such as manipulation, conversation, and memory construction, from shared observations under distinct time constraints. Recent Mixture-of-Transformers (MoT) Vision-Language-Action Models (VLAs) architecturally support such heterogeneous outputs, yet existing inference systems fail to achieve efficient multi-task parallelism for on-device deployment because of redundant computation and resource contention. We identify isolated KV cache management as the root cause. To address this, we propose unified KV cache management, an inference design that treats the KV cache as a first-class shared resource across tasks and over time. This abstraction enables two key optimizations: cross-task KV sharing eliminates redundant prefill of shared observations, while cross-frame continuous batching decouples variable-length language decoding from fixed-rate action generation across control cycles. We implement this design for $π_{0.5}$, a popular MoT VLA, and evaluate it on both NVIDIA GeForce RTX 4090 and Jetson AGX Thor, two representative platforms for on-device VLA inference. OxyGen achieves up to 3.7$\times$ speedup over isolated execution, delivering over 200 tokens/s language throughput and 70 Hz action frequency simultaneously without degrading action quality, and we further validate the gains on a real humanoid robot with on-board Jetson AGX Thor.

2603.08290 2026-05-19 cs.LG cs.AI 版本更新

Minor First, Major Last: A Depth-Induced Implicit Bias of Sharpness-Aware Minimization

先浅后深:一种由深度诱导的sharpness-aware minimization的隐式偏见

Chaewon Moon, Dongkuk Si, Chulhee Yun

发表机构 * Graduate School of AI, KAIST(韩国成均馆大学人工智能研究生院) Mobilint, Inc.(Mobilint公司)

AI总结 该研究探讨了在训练线性可分二分类问题时,sharpness-aware minimization (SAM) 的隐式偏见,发现对于深度L=2的情况,SAM的行为与深度L=1时不同,展示了sequential feature amplification现象。

Comments Accepted to ICLR 2026, 84 pages, 35 figures

详情
AI中文摘要

我们研究了在训练L层线性对角网络时,sharpness-aware minimization (SAM) 的隐式偏见。对于线性模型(L=1),ℓ∞-SAM和ℓ2-SAM都能恢复ℓ2最大间隔分类器,与梯度下降(GD)一致。然而,对于深度L=2,行为发生剧烈变化——即使在单例数据集上。对于ℓ∞-SAM,极限方向依赖于初始化,并可能收敛到零向量或任何标准基向量,与GD的极限方向形成鲜明对比。对于ℓ2-SAM,我们证明其极限方向与GD的ℓ1最大间隔解一致,但有限时间动态表现出我们称之为“顺序特征放大”的现象,即预测器最初依赖于次要坐标,然后逐渐转向更大的坐标。我们的理论分析将这种现象归因于ℓ2-SAM在扰动中应用的梯度归一化因子,该因子在早期放大次要坐标,允许主要坐标在后期主导。合成和真实数据实验验证了我们的发现。

英文摘要

We study the implicit bias of Sharpness-Aware Minimization (SAM) when training $L$-layer linear diagonal networks on linearly separable binary classification. For linear models ($L=1$), both $\ell_\infty$- and $\ell_2$-SAM recover the $\ell_2$ max-margin classifier, matching gradient descent (GD). However, for depth $L = 2$, the behavior changes drastically -- even on a single-example dataset. For $\ell_\infty$-SAM, the limit direction depends critically on initialization and can converge to $\mathbf{0}$ or to any standard basis vector, in stark contrast to GD, whose limit aligns with the basis vector of the dominant data coordinate. For $\ell_2$-SAM, we show that although its limit direction matches the $\ell_1$ max-margin solution as in the case of GD, its finite-time dynamics exhibit a phenomenon we call "sequential feature amplification", in which the predictor initially relies on minor coordinates and gradually shifts to larger ones as training proceeds or initialization increases. Our theoretical analysis attributes this phenomenon to $\ell_2$-SAM's gradient normalization factor applied in its perturbation, which amplifies minor coordinates early and allows major ones to dominate later, giving a concrete example where infinite-time implicit-bias analyses are insufficient. Synthetic and real-data experiments corroborate our findings.

2603.07900 2026-05-19 cs.AI 版本更新

EveryQuery: Zero-Shot Clinical Prediction via Task-Conditioned Pretraining over Electronic Health Records

EveryQuery: 通过电子健康记录上的任务条件预训练实现零样本临床预测

Payal Chandak, Gregory Kondas, Liat Antwarg Friedman, Isaac Kohane, Matthew McDermott

发表机构 * Harvard-MIT HST(哈佛-麻省理工学院HST) Columbia University(哥伦比亚大学) Harvard Medical School(哈佛医学院)

AI总结 本文提出EveryQuery,一种通过任务条件预训练实现零样本临床预测的电子健康记录基础模型,通过直接估计未来窗口内结果发生的可能性,而非生成未来事件,从而在多个预测任务中优于自回归基线模型。

详情
AI中文摘要

在电子健康记录(EHR)上预训练的基础模型已通过生成合成患者未来和聚合采样轨迹的统计信息,展示了零样本临床预测能力。然而,这种自回归推理过程计算成本高、统计噪声大且不支持直接提示条件预测,因为用户无法直接根据特定临床问题条件预测。在本初步工作中,我们引入EveryQuery,一种EHR基础模型,通过任务条件预训练实现零样本推理。不同于生成未来事件,EveryQuery输入患者的历史和一个结构化的查询指定临床任务,并通过单次前向传递直接估计未来窗口内结果发生的可能性。EveryQuery通过在随机采样的查询任务和患者上下文中预训练,直接训练模型以产生正确的答案。这使得无需微调、线性探测或轨迹生成即可对查询空间中的任何任务进行零样本预测。在MIMIC-IV上,EveryQuery在82%的39个随机采样的预测任务中优于自回归基线模型,平均AUC提高+0.16(95%置信区间:[0.10,0.22])。这一优势在明确从预训练分布中排除的任务中保持一致。此外,EveryQuery的性能提升在罕见临床事件上最为显著,证实并展示了自回归推理在低预发率结果方面的根本限制的解决方案。然而,目前EveryQuery在需要对多个代码进行离散推理的任务上表现欠佳,如30天再入院,暴露了当前查询语言的表达性限制。

英文摘要

Foundation models pretrained on electronic health records (EHR) have demonstrated zero-shot clinical prediction capabilities by generating synthetic patient futures and aggregating statistics over sampled trajectories. However, this autoregressive inference procedure is computationally expensive, statistically noisy, and not natively promptable because users cannot directly condition predictions on specific clinical questions. In this preliminary work, we introduce EveryQuery, an EHR foundation model that achieves zero-shot inference through task-conditioned pre-training. Rather than generating future events, EveryQuery takes as input a patient's history and a structured query specifying a clinical task, and directly estimates the likelihood of the outcome occurring in the future window via a single forward pass. EveryQuery realizes this capability by pre-training over randomly sampled combinations of query tasks and patient contexts, directly training the model to produce correct answers to arbitrary input prompts. This enables zero-shot prediction for any task in the query space without finetuning, linear probing, or trajectory generation. On MIMIC-IV, EveryQuery outperforms an autoregressive baseline on 82% of 39 randomly sampled prediction tasks, with a mean AUC improvement of +0.16 (95% CI: [0.10,0.22]). This advantage remains consistent on tasks that were explicitly held out from the pre-training distribution. Further, EveryQuery's performance gains are most pronounced for rare clinical events, affirming and demonstrating a solution to the fundamental limitation of autoregressive inference for low-prevalence outcomes. However, at present, EveryQuery underperforms on tasks requiring disjunctive reasoning over multiple codes, such as 30-day readmission, exposing a concrete expressiveness limitation of the current query language.

2603.06984 2026-05-19 stat.ML cs.AI cs.GT cs.LG cs.SI 版本更新

Masking Causality and Conditional Dependence

掩盖因果关系与条件依赖

Zou Yang, Sophia Xiao, Bijan Mazaheri

发表机构 * Thayer School of Engineering(泰勒学校工程学院) Dartmouth College(达特茅斯学院)

AI总结 本文研究了通过平均约束来强制条件独立性的问题,发现这种约束在监管层面无法满足分层要求,而在优化者层面却能有效隐藏依赖关系,从而指出通过观测决策的平均统计来监管直接依赖是有限的,必须在决策规则层面进行监管。

详情
AI中文摘要

许多监管和分析问题要求被禁止的变量只能通过指定的允许渠道影响决策——这是一种出现在路径特定公平性、处理敏感信息和监管非公开信息交易等场景中的条件独立性要求。这些要求可以通过分层方式执行,或更常见且更高效地通过单个平均约束来执行。本文从监管者的角度将因果掩盖建模为一个线性规划,并证明平均约束优化几乎总是产生违反分层要求但恰好满足平均约束的政策。掩盖收益随着混淆和结果异质性增加而增长,检测需要精确的条件独立性测试,而平均约束旨在避免这些测试。从优化者的角度来看,相同的构造表明,被掩盖的政策恢复了大部分无约束利用的收益,但更难被检测到,因此在决策基础本身敏感的任何设置中都具有吸引力。这些结果表明,通过观测决策的平均统计来监管直接依赖在结构上是有限的,有意义的监管必须在决策规则本身层面进行。

英文摘要

Many regulatory and analytic problems require that a prohibited variable influence a decision only through a designated allowable channel -- a conditional-independence requirement that arises in path-specific fairness, the handling of classified information, and the regulation of trading on non-public information, among other settings. Such requirements may be enforced either stratum-by-stratum or, more commonly (and more efficiently), through a single averaged constraint on the conditional effect. We study the resulting enforcement problem from two perspectives. From the regulator's side, we formulate causal masking as a linear program and show that averaged-constraint optimization almost surely produces policies that violate the stratum-wise requirement while satisfying the averaged one exactly. The gains from masking grow with confounding and outcome heterogeneity, and detection requires precisely the conditional-independence tests that average constraints aim to avoid. From the optimizer's side, the same construction shows that masked policies recover most of the reward of unconstrained exploitation while being far harder to detect, making them attractive in any setting where the basis of decisions is itself sensitive. Together, these results argue that regulating direct dependence through averaged statistics on observed decisions is structurally limited, and that meaningful enforcement must operate at the level of the decision rule itself.

2603.04727 2026-05-19 cs.CV cs.AI 版本更新

Are Multimodal LLMs Ready for Surveillance? A Reality Check on Zero-Shot Anomaly Detection in the Wild

多模态大语言模型是否准备好用于监控?对零样本异常检测在现实中的检验

Shanle Yao, Armin Danesh Pazho, Narges Rashvand, Hamed Tabkhi

发表机构 * Electrical and Computer Engineering Department(电气与计算机工程系)

AI总结 本文研究了多模态大语言模型在现实中的零样本异常检测性能,发现其存在保守偏差,通过特定指令可以提升F1分数,但召回率仍是关键瓶颈。

详情
AI中文摘要

多模态大语言模型(MLLMs)在视频理解方面展示了出色的通用能力,但其在现实中的视频异常检测(VAD)可靠性仍待探索。与传统依赖重建或姿态线索的流程不同,MLLMs实现了将异常检测视为语言引导推理任务的范式转变。本文通过将VAD重新表述为二分类任务,在弱时间监督下系统评估了最先进的MLLMs在ShanghaiTech和CHAD基准上的性能。我们研究了提示特异性及时间窗口长度(1s-3s)对性能的影响,重点分析精度-召回率的权衡。研究发现,在零样本设置中存在显著的保守偏差;尽管模型表现出高置信度,但倾向于选择'正常'类,导致高精度但召回率崩溃,限制了实际应用。我们证明,针对类别的特定指令可显著改变这一决策边界,使ShanghaiTech的峰值F1分数从0.09提升至0.64,但召回率仍是关键瓶颈。这些结果突显了MLLMs在嘈杂环境中的显著性能差距,并为未来在召回导向提示和模型校准方面的研究提供了基础,这对需要复杂视频理解和推理的开放世界监控任务提出了要求。

英文摘要

Multimodal large language models (MLLMs) have demonstrated impressive general competence in video understanding, yet their reliability for real-world Video Anomaly Detection (VAD) remains largely unexplored. Unlike conventional pipelines relying on reconstruction or pose-based cues, MLLMs enable a paradigm shift: treating anomaly detection as a language-guided reasoning task. In this work, we systematically evaluate state-of-the-art MLLMs on the ShanghaiTech and CHAD benchmarks by reformulating VAD as a binary classification task under weak temporal supervision. We investigate how prompt specificity and temporal window lengths (1s--3s) influence performance, focusing on the precision--recall trade-off. Our findings reveal a pronounced conservative bias in zero-shot settings; while models exhibit high confidence, they disproportionately favor the 'normal' class, resulting in high precision but a recall collapse that limits practical utility. We demonstrate that class-specific instructions can significantly shift this decision boundary, improving the peak F1-score on ShanghaiTech from 0.09 to 0.64, yet recall remains a critical bottleneck. These results highlight a significant performance gap for MLLMs in noisy environments and provide a foundation for future work in recall-oriented prompting and model calibration for open-world surveillance, which demands complex video understanding and reasoning.

2602.10134 2026-05-19 cs.CR cs.AI cs.CL 版本更新

Reverse-Engineering Model Editing on Language Models

语言模型上的逆向工程模型编辑

Zhiyu Sun, Minrui Luo, Yu Wang, Zhili Chen, Tianxing He

AI总结 本文研究了语言模型中参数编辑的漏洞,提出了一种名为KSTER的逆向工程攻击方法,通过利用参数更新的低秩结构恢复编辑数据,并提出subspace camouflage防御策略以降低重建风险。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在预训练过程中会接触到包含万亿个标记的语料库,因此不可避免地会记住敏感信息。定位然后编辑方法作为一种主流的模型编辑范式,通过修改模型参数而不重新训练,提供了一个有前景的解决方案。然而,在本工作中,我们揭示了这一范式的关键漏洞:参数更新无意中充当了侧信道,使攻击者能够恢复编辑的数据。我们提出了一种两阶段的逆向工程攻击,称为KSTER(KeySpaceReconsThenEntropyReduction),该方法利用这些更新的低秩结构。首先,我们理论证明了更新矩阵的行空间编码了被编辑主体的“指纹”,通过谱分析可以准确恢复主体。其次,我们引入了一种基于熵的提示恢复攻击,重构了编辑的语义上下文。在多个LLM上的大量实验表明,我们的攻击能够以高成功率恢复编辑数据。此外,我们提出了一种名为subspace camouflage的防御策略,通过语义伪装来混淆更新指纹,从而有效降低重建风险,而不会影响编辑的实用性。我们的代码可在https://github.com/reanatom/EditingAttack上获得。

英文摘要

Large language models (LLMs) are pretrained on corpora containing trillions of tokens and, therefore, inevitably memorize sensitive information. Locate-then-edit methods, as a mainstream paradigm of model editing, offer a promising solution by modifying model parameters without retraining. However, in this work, we reveal a critical vulnerability of this paradigm: the parameter updates inadvertently serve as a side channel, enabling attackers to recover the edited data. We propose a two-stage reverse-engineering attack named \textit{KSTER} (\textbf{K}ey\textbf{S}paceRecons\textbf{T}ruction-then-\textbf{E}ntropy\textbf{R}eduction) that leverages the low-rank structure of these updates. First, we theoretically show that the row space of the update matrix encodes a ``fingerprint" of the edited subjects, enabling accurate subject recovery via spectral analysis. Second, we introduce an entropy-based prompt recovery attack that reconstructs the semantic context of the edit. Extensive experiments on multiple LLMs demonstrate that our attacks can recover edited data with high success rates. Furthermore, we propose \textit{subspace camouflage}, a defense strategy that obfuscates the update fingerprint with semantic decoys. This approach effectively mitigates reconstruction risks without compromising editing utility. Our code is available at https://github.com/reanatom/EditingAttack.

2602.09805 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Beyond Accuracy: Decomposing the Reasoning Efficiency of LLMs

超越准确率:分解大语言模型的推理效率

Daniel Kaiser, Arnoldo Frigessi, Ali Ramezani-Kebrya, Benjamin Ricaud

发表机构 * Integreat - Norwegian Centre for knowledge-driven machine learning(Integreat - 挪威知识驱动机器学习中心) UiT - The Arctic University of Norway(UiT - 北极大学) University of Oslo(奥斯陆大学)

AI总结 本文提出一种无需追踪的评估协议,通过完成率、条件正确性和生成长度三个指标分解大语言模型的token效率,同时考虑任务工作量元数据进行归一化处理,并评估模型在不同任务上的推理效率和冗余问题。

Comments Preprint (under review). 29 pages, 4 figures

详情
AI中文摘要

随着推理大语言模型越来越多地通过推理、搜索和自我纠正来换取准确性,单一的准确性分数已无法说明这些token是否带来了有用的推理、从困难实例中恢复或不必要的冗长。我们介绍了一种可选追踪的评估协议,通过三个即使在封闭模型中也可用的观测指标精确分解token效率:完成率、在完成条件下正确性的条件正确性以及生成长度。当实例级工作量元数据可用时,我们进一步将生成长度归一化为声明的任务隐含工作,并将平均口头冗余与工作量依赖的扩展分离。当此类元数据不可用时,我们定义了一个可审计的求解器衍生工作量规模,并在留出自我、留出top-k和持有参考池扰动下评估其稳定性。我们在CogniLoad、GSM8K、ProofWriter和ZebraLogic上评估了14个共享开放权重模型。我们进一步在CogniLoad上评估了11个额外模型,从而能够对推理任务难度因素进行细致分析:任务长度、内在难度和干扰项密度。效率和冗余排名在所有基准对中保持稳定,比准确性排名更加稳健,同时分解了逻辑受限、上下文受限(截断驱动)和冗余受限的失败模式,这些模式在准确性每token下看起来是相同的。我们发布了评估工具包和报告模板,详细说明了LLM在推理上的低效原因。

英文摘要

As reasoning LLMs increasingly trade tokens for accuracy through deliberation, search, and self-correction, a single accuracy score can no longer tell whether those tokens buy useful reasoning, recovery from hard instances, or unnecessary verbosity. We introduce a trace-optional evaluation protocol that exactly decomposes token efficiency using three observables available even for closed models: completion rate, conditional correctness given completion, and generated length. When instance-level workload metadata is available, we further normalize generated length by declared task-implied work and separate mean verbalization overhead from workload-dependent scaling. When such metadata is absent, we define an auditable solver-derived workload scale and evaluate its stability under leave-self-out, leave-top-k, and held-out-reference-pool perturbations. We evaluate 14 shared open-weight models on CogniLoad, GSM8K, ProofWriter, and ZebraLogic. We further evaluate 11 additional models on CogniLoad, enabling a fine-grained analysis of reasoning-task difficulty factors: task length, intrinsic difficulty, and distractor density. Efficiency and overhead rankings remain stable across all benchmark pairs, more robustly than accuracy rankings, while the decomposition separates logic-limited, context-limited (truncation-driven), and verbosity-limited failure modes that look identical under accuracy-per-token. We release an evaluation artifact and reporting template, which elaborates on why an LLM is inefficient at reasoning.

2602.07085 2026-05-19 q-fin.ST cs.AI q-fin.CP 版本更新

QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

QuantaAlpha: 一种基于大语言模型的alpha挖掘进化框架

Jun Han, Shuo Zhang, Wei Li, Yifan Dong, Tu Hu, Yumo Zhu, Xiaomin Yu, Xin Guo, Zhaowei Liu, Kunyi Wang, Jingping Liu, Tianyi Jiang, Ruichuan An, Sen Hu, Zhi Yang, Ronghao Che, Huacan Wang

发表机构 * SUFE(上海财经大学) QuantaAlpha SYSU(华南理工大学) PKU(北京大学)

AI总结 本文提出QuantaAlpha框架,通过进化算法改进alpha挖掘过程,通过轨迹级突变和交叉实现多轮搜索和经验重用,实验表明其在多个市场指数上均表现出稳健的性能。

详情
AI中文摘要

金融市场噪声和非平稳性使得alpha挖掘对回测噪声和制度转换高度敏感。尽管近期代理框架提高了自动化水平,但通常缺乏可控的多轮搜索和可靠的经验重用。为了解决这些挑战,我们提出了QuantaAlpha,一种进化alpha挖掘框架,将每个端到端挖掘运行视为轨迹,并通过轨迹级突变和交叉改进因素。QuantaAlpha定位次优步骤以进行针对性修订,并重新组合互补的高收益段以重用有效模式,从而在迭代中实现结构化探索和细化。在因子生成过程中,它强制假设、因子表达和可执行代码之间的语义一致性,并约束生成因子的复杂性和冗余性以缓解拥挤。在CSI 300上的大量实验表明,QuantaAlpha在强基线和先前代理系统上均表现出一致的优势。使用GPT-5.2,QuantaAlpha实现了IC为0.0472,ARR为4.68%,MDD为11.8%。此外,基于CSI 300挖掘的因子有效转移到CSI 500和S&P 500,分别在四年内分别产生约40.28%和19.1%的累计超额收益,这表明其在市场分布转换下的稳健性。

英文摘要

Financial markets are noisy and non-stationary, making alpha mining highly sensitive to backtest noise and regime shifts. While recent agentic frameworks improve automation, they often lack controllable multi-round search and reliable reuse of validated experience. To address these challenges, we propose QuantaAlpha, an evolutionary alpha mining framework that treats each end-to-end mining run as a trajectory and improves factors via trajectory-level mutation and crossover. QuantaAlpha localizes suboptimal steps for targeted revision and recombines complementary high-reward segments to reuse effective patterns, enabling structured exploration and refinement across iterations. During factor generation, it enforces semantic consistency across hypothesis, factor expression, and executable code, and constrains the complexity and redundancy of the generated factor to mitigate crowding. Extensive experiments on CSI 300 show consistent gains over strong baselines and prior agentic systems. Using GPT-5.2, QuantaAlpha achieves an IC of 0.0472 with ARR of 4.68% and MDD of 11.8%. Moreover, factors mined on CSI 300 transfer effectively to CSI 500 and the S&P 500, delivering about 40.28% and 19.1% cumulative excess return over four years, respectively, which indicates strong robustness under market distribution shifts.

2602.03664 2026-05-19 cs.AI cs.LG 版本更新

Mitigating Conversational Inertia in Multi-Turn Agents

缓解多轮代理中的对话惯性

Yang Wan, Zheng Cao, Zhenhao Zhang, Zhengwen Zeng, Shuheng Shen, Changhua Meng, Linchao Zhu

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) University of Rochester, Rochester, NY, USA(罗切斯特大学)

AI总结 本文研究了多轮代理中对话惯性问题,提出通过上下文偏好学习来校准模型偏好,以减少惯性并提升性能。

Comments ICML2026

详情
AI中文摘要

大型语言模型在获得适当演示时表现出色,但在多轮代理场景中,LLM错误地模仿自身之前的响应作为少样本示例。通过注意力分析,我们识别出对话惯性现象,即模型对先前响应表现出强烈的对角注意力,这与模仿偏差有关,限制了探索。这揭示了将少样本LLM转化为代理时的张力:更长的上下文丰富了环境反馈以供利用,但也加剧了对话惯性,从而损害探索。我们的关键见解是,对于相同状态,生成时使用更长上下文的动作表现出更强的惯性,这使得可以在没有环境奖励的情况下构建偏好对。基于此,我们提出上下文偏好学习,以校准模型偏好,使模型更倾向于选择低惯性响应而非高惯性响应。我们进一步提供了推理时的上下文管理策略,以平衡探索与利用。在八个代理环境和一个深度研究场景中的实验结果验证了我们的框架能够减少对话惯性并实现性能提升。

英文摘要

Large language models excel as few-shot learners when provided with appropriate demonstrations, yet this strength becomes problematic in multiturn agent scenarios, where LLMs erroneously mimic their own previous responses as few-shot examples. Through attention analysis, we identify conversational inertia, a phenomenon where models exhibit strong diagonal attention to previous responses, which is associated with imitation bias that constrains exploration. This reveals a tension when transforming few-shot LLMs into agents: longer context enriches environmental feedback for exploitation, yet also amplifies conversational inertia that undermines exploration. Our key insight is that for identical states, actions generated with longer contexts exhibit stronger inertia than those with shorter contexts, enabling construction of preference pairs without environment rewards. Based on this, we propose Context Preference Learning to calibrate model preferences to favor low-inertia responses over highinertia ones. We further provide context management strategies at inference time to balance exploration and exploitation. Experimental results across eight agentic environments and one deep research scenario validate that our framework reduces conversational inertia and achieves performance improvements.

2602.02262 2026-05-19 cs.SE cs.AI cs.CL 版本更新

OmniCode: A Benchmark for Evaluating Software Engineering Agents

OmniCode: 一个评估软件工程代理的基准

Atharv Sonwane, Eng-Shen Tu, Wei-Chung Lu, Claas Beger, Carter Larsen, Debjit Dhar, Simon Alford, Rachel Chen, Ronit Pattanayak, Tuan Anh Dang, Guohao Chen, Gloria Geng, Kevin Ellis, Saikat Dutta

发表机构 * Cornell University(康奈尔大学) Independent contributor(独立贡献者) UC Santa Barbara(加州大学圣巴巴拉分校) Jadavpur University(贾瓦伊普尔大学) New York University(纽约大学)

AI总结 本文提出OmniCode,一个涵盖更广泛任务类别的软件工程基准,旨在评估软件代理在不同软件开发方面的性能。

详情
AI中文摘要

LLM驱动的编码代理正在重新定义现实世界软件的开发方式。为了推动研究向更好的编码代理发展,我们需要具有挑战性的基准,能够严格评估此类代理执行各种软件工程任务的能力。然而,流行的编码基准如HumanEval和SWE-Bench专注于狭窄的任务,如竞赛编程和补丁生成。实际上,软件工程师必须处理更广泛的任务来完成现实世界软件开发。为了解决这一差距,我们提出了OmniCode,一个新型的软件工程基准,包含比代码或补丁生成更广泛和多样化的任务类别。总体而言,OmniCode包含1794个任务,涵盖三种编程语言 - Python、Java和C++,以及四个关键类别:bug fixing、test generation、code review fixing和style fixing。与之前的软件工程基准不同,OmniCode的任务(1)经过人工验证以消除定义不清的问题,并且(2)合成创建或最近整理以避免数据泄漏问题,呈现了一个从有限现实数据中合成多样化软件任务的新框架。我们使用流行的代理框架如SWE-Agent评估OmniCode,显示尽管它们在Python的bug fixing上表现良好,但在Test Generation等任务以及C++和Java等语言上则表现不佳。例如,SWE-Agent在C++ Test Generation上使用DeepSeek-V3.1达到最大25.0%。OmniCode旨在作为稳健的基准,推动开发在软件开发不同方面表现良好的代理。代码和数据可在https://github.com/seal-research/OmniCode上获取。

英文摘要

LLM-powered coding agents are redefining how real-world software is developed. To drive the research towards better coding agents, we require challenging benchmarks that can rigorously evaluate the ability of such agents to perform various software engineering tasks. However, popular coding benchmarks such as HumanEval and SWE-Bench focus on narrowly scoped tasks such as competition programming and patch generation. In reality, software engineers have to handle a broader set of tasks for real-world software development. To address this gap, we propose OmniCode, a novel software engineering benchmark that contains a broader and more diverse set of task categories beyond code or patch generation. Overall, OmniCode contains 1794 tasks spanning three programming languages - Python, Java, and C++ - and four key categories: bug fixing, test generation, code review fixing, and style fixing. In contrast to prior software engineering benchmarks, the tasks in OmniCode are (1) manually validated to eliminate ill-defined problems, and (2) synthetically crafted or recently curated to avoid data leakage issues, presenting a new framework for synthetically generating diverse software tasks from limited real-world data. We evaluate OmniCode with popular agent frameworks such as SWE-Agent and show that while they may perform well on bug fixing for Python, they fall short on tasks such as Test Generation and in languages such as C++ and Java. For instance, SWE-Agent achieves a maximum of 25.0% with DeepSeek-V3.1 on C++ Test Generation. OmniCode aims to serve as a robust benchmark and spur the development of agents that can perform well across different aspects of software development. Code and data are available at https://github.com/seal-research/OmniCode.

2601.19667 2026-05-19 cs.CL cs.AI cs.IR cs.LG 版本更新

SynCABEL: Synthetic Contextualized Augmentation for Biomedical Entity Linking

SynCABEL:面向生物医学实体链接的合成上下文增强

Adam Remaki, Christel Gérardin, Eulàlia Farré-Maduell, Martin Krallinger, Xavier Tannier

发表机构 * Sorbonne Université, Inserm, Université Sorbonne Paris Nord, Limics(索邦大学、国家医学研究院、巴黎索邦大学、Limics) Service de médecine interne, Hôpital Tenon, Assistance Publique - Hôpitaux de Paris(内科服务部,Tenon医院,巴黎公共医院) Barcelona Supercomputing Center, Barcelona, Spain(巴塞罗那超级计算中心,西班牙巴塞罗那)

AI总结 SynCABEL通过利用大型语言模型生成丰富的上下文合成训练示例,解决了监督式生物医学实体链接中专家标注数据稀缺的问题,并在三个多语言基准上实现了新的最先进的结果。

Comments 7 pages, 5 figures

详情
AI中文摘要

我们提出了SynCABEL(Synthetic Contextualized Augmentation for Biomedical Entity Linking),一个框架,旨在解决监督式生物医学实体链接(BEL)中的核心瓶颈:专家标注训练数据的稀缺性。SynCABEL利用大型语言模型为目标知识库中的所有候选概念生成上下文丰富的合成训练示例,提供广泛的监督而无需手动标注。我们证明,当结合解码器-only模型和引导推理时,SynCABEL在三个广泛使用的多语言基准上建立了新的最先进结果:MedMentions(英语)、QUAERO(法语)和SPACCC(西班牙语)。评估数据效率时,我们显示SynCABEL在使用最多60%的标注数据的情况下达到全人工监督的性能,显著减少了对劳动密集型和昂贵的专家标注的依赖。最后,考虑到基于精确代码匹配的标准评估往往低估了由于本体冗余而具有临床价值的预测,我们引入了LLM-as-a-judge协议。这项分析揭示了SynCABEL显著提高了具有临床价值的预测率。我们的合成数据集、模型和代码已发布以支持可重复性和未来研究。

英文摘要

We present SynCABEL (Synthetic Contextualized Augmentation for Biomedical Entity Linking), a framework that addresses a central bottleneck in supervised biomedical entity linking (BEL): the scarcity of expert-annotated training data. SynCABEL leverages large language models to generate context-rich synthetic training examples for all candidate concepts in a target knowledge base, providing broad supervision without manual annotation. We demonstrate that SynCABEL, when combined with decoder-only models and guided inference, establishes new state-of-the-art results across three widely used multilingual benchmarks: MedMentions for English, QUAERO for French, and SPACCC for Spanish. Evaluating data efficiency, we show that SynCABEL reaches the performance of full human supervision using up to 60% less annotated data, substantially reducing reliance on labor-intensive and costly expert labeling. Finally, acknowledging that standard evaluation based on exact code matching often underestimates clinically valid predictions due to ontology redundancy, we introduce an LLM-as-a-judge protocol. This analysis reveals that SynCABEL significantly improves the rate of clinically valid predictions. Our synthetic datasets, models, and code are released to support reproducibility and future research.

2601.09413 2026-05-19 cs.SD cs.AI cs.CL cs.MA eess.AS 版本更新

Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception

Speech-Hands: 一种基于自我反思的语音代理方法用于语音识别和多感知音频推理

Zhen Wan, Chao-Han Huck Yang, Jinchuan Tian, Hanrong Ye, Ankita Pasad, Szu-wei Fu, Arushi Goel, Ryo Hachiuma, Shizhe Diao, Kunal Dhawan, Sreyan Ghosh, Yusuke Hirota, Zhehuai Chen, Rafael Valle, Chenhui Chu, Shinji Watanabe, Yu-Chiang Frank Wang, Boris Ginsburg

发表机构 * NVIDIA Kyoto University(京都大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出Speech-Hands框架,通过自我反思决策机制解决语音识别和外部声音理解任务中的信任问题,提升了模型在多任务音频推理中的准确性和鲁棒性。

Comments Accepted to ACL 2026. Oral Presentation. Code: https://github.com/YukinoWan/Speech-Hands OpenClaw Branch: https://github.com/openclaw/openclaw/pull/69073

详情
AI中文摘要

我们介绍了一种语音代理框架,该框架学习了一种关键的全方位理解技能:知道何时信任自身,何时咨询外部音频感知。我们的工作受到一个关键但反直觉的发现的启发:简单地在语音识别和外部声音理解任务上微调全方位模型往往会降低性能,因为模型容易被噪声假说误导。为了解决这个问题,我们的框架Speech-Hands将问题重新表述为一个显式的自我反思决策。这个可学习的反思原语在防止模型被错误的外部候选干扰方面证明是有效的。我们展示了这种代理行为机制能够自然地从语音识别推广到复杂的多选音频推理。在OpenASR排行榜上,Speech-Hands在七个基准测试中比强大的基线高出12.1%的WER。该模型在音频问答决策中也实现了77.37%的准确率和高F1分数,展示了在多样化的音频问答数据集上的鲁棒性和可靠性。通过统一感知和决策,我们的工作为更可靠和稳健的音频智能提供了实用路径。

英文摘要

We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, Speech-Hands consistently outperforms strong baselines by 12.1% WER on seven benchmarks. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.

2601.08679 2026-05-19 cs.AI 版本更新

PersonaDual: Balancing Personalization and Objectivity via Adaptive Reasoning

PersonaDual: 通过自适应推理平衡个性化与客观性

Xiaoyou Liu, Xinyi Mou, Shengbin Yue, Liang Wang, Yuqing Wang, Qiexiang Wang, Tianrui Qin, Zhongyu Wei

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) OPPO

AI总结 本文提出PersonaDual框架,通过自适应切换模式,在单一模型中实现通用客观推理与个性化推理的平衡,减少干扰并提升客观问题解决能力。

详情
AI中文摘要

随着用户对LLM对齐其偏好的期望增加,个性化信息变得有价值。然而,个性化信息可能是一把双刃剑:它能提高交互但可能损害客观性和事实准确性,尤其是在与问题不匹配时。为缓解此问题,我们提出PersonaDual,一个支持单个模型中通用目的客观推理和个性化推理的框架,并根据上下文自适应切换模式。PersonaDual首先通过SFT学习两种推理模式,然后通过强化学习和我们提出的DualGRPO进一步优化模式选择。在客观和个性化基准测试中,PersonaDual在保留个性化优势的同时减少干扰,实现近无干扰性能,并更有效地利用有用的个性化信号以改善客观问题解决。

英文摘要

As users increasingly expect LLMs to align with their preferences, personalized information becomes valuable. However, personalized information can be a double-edged sword: it can improve interaction but may compromise objectivity and factual correctness, especially when it is misaligned with the question. To alleviate this problem, we propose PersonaDual, a framework that supports both general-purpose objective reasoning and personalized reasoning in a single model, and adaptively switches modes based on context. PersonaDual is first trained with SFT to learn two reasoning patterns, and then further optimized via reinforcement learning with our proposed DualGRPO to improve mode selection. Experiments on objective and personalized benchmarks show that PersonaDual preserves the benefits of personalization while reducing interference, achieving near interference-free performance and better leveraging helpful personalized signals to improve objective problem-solving.

2601.01685 2026-05-19 cs.CL cs.AI cs.MA 版本更新

Lying with Truths: Open-Channel Multi-Agent Collusion for Belief Manipulation via Generative Montage

用真理欺骗:通过生成蒙太奇进行开放式通道多智能体合谋以操纵信念

Jinwei Hu, Xinmiao Huang, Youcheng Sun, Yi Dong, Xiaowei Huang

发表机构 * University of Liverpool(利物浦大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文研究了通过公开通道分发真实证据片段,利用多智能体合谋操纵信念的新威胁,提出了生成蒙太奇框架,展示了在14种LLM家族中74.4%的攻击成功率,并揭示了更强的推理能力反而增加了易受攻击的风险。

Comments Accepted to the ACL 2026 Main Conference (Oral Presentation)

详情
AI中文摘要

随着大型语言模型(LLMs)向自主代理合成实时信息转变,其推理能力引入了意想不到的攻击面。本文介绍了一种新的威胁,即合谋代理通过仅使用真实证据片段在公开通道中引导受害者信念,而无需依赖隐蔽通信、后门或伪造文件。通过利用LLMs的过度思考倾向,我们正式化了首次认知合谋攻击,并提出生成蒙太奇:一个由写作者-编辑-导演框架构成的框架,通过对抗性辩论和协调发布证据片段来构建欺骗性叙述,使受害者内化并传播伪造结论。为研究此风险,我们开发了CoPHEME数据集,该数据集源自真实世界谣言事件,并在多种LLM家族中模拟攻击。我们的结果表明,14种LLM家族普遍存在漏洞:攻击成功率达到74.4%(专有模型)和70.6%(开放式权重模型)。反直觉的是,更强的推理能力增加了易受攻击性,推理专精模型的攻击成功率高于基础模型或提示。此外,这些虚假信念会传播到下游判断者,达到超过60%的欺骗率,突显了LLM代理在动态信息环境中交互的社会技术脆弱性。我们的实现和数据可在:https://github.com/CharlesJW222/Lying_with_Truth/tree/main。

英文摘要

As large language models (LLMs) transition to autonomous agents synthesizing real-time information, their reasoning capabilities introduce an unexpected attack surface. This paper introduces a novel threat where colluding agents steer victim beliefs using only truthful evidence fragments distributed through public channels, without relying on covert communications, backdoors, or falsified documents. By exploiting LLMs' overthinking tendency, we formalize the first cognitive collusion attack and propose Generative Montage: a Writer-Editor-Director framework that constructs deceptive narratives through adversarial debate and coordinated posting of evidence fragments, causing victims to internalize and propagate fabricated conclusions. To study this risk, we develop CoPHEME, a dataset derived from real-world rumor events, and simulate attacks across diverse LLM families. Our results show pervasive vulnerability across 14 LLM families: attack success rates reach 74.4% for proprietary models and 70.6% for open-weights models. Counterintuitively, stronger reasoning capabilities increase susceptibility, with reasoning-specialized models showing higher attack success than base models or prompts. Furthermore, these false beliefs then cascade to downstream judges, achieving over 60% deception rates, highlighting a socio-technical vulnerability in how LLM-based agents interact with dynamic information environments. Our implementation and data are available at: https://github.com/CharlesJW222/Lying_with_Truth/tree/main.

2601.00360 2026-05-19 cs.MA cs.AI cs.CY 版本更新

Mapping Human Anti-collusion Mechanisms to Multi-agent AI Systems

将人类反串通机制映射到多智能体AI系统

Jamiu Idowu, Ahmed Almasoud, Ayman Alfahid

发表机构 * Sahel AI, Sahel Group Inc.(萨赫尔人工智能,萨赫尔集团有限公司) Prince Sultan University(普林斯顿国王大学) Majmaah University(马吉玛大学)

AI总结 本文研究如何将人类长期积累的反串通机制应用于多智能体AI系统,通过建立机制分类并提出实现方法,同时指出开放挑战如责任归属、身份流动性、边界问题和对抗性适应等。

Comments Accepted to ICML 2026 Workshop on Technical AI Governance Research (TAIGR); Published in Knowledge-Based Systems Journal

Journal ref Idowu, J., Almasoud, A. S., & Alfahid, A. (2026). Mapping human anti-collusion mechanisms to multi-agent AI systems. Knowledge-Based Systems, 344(116067), 116067. https://doi.org/10.1016/j.knosys.2026.116067

详情
AI中文摘要

随着多智能体AI系统日益自主,证据表明它们可以发展出类似于人类市场和机构中长期观察到的串通策略。尽管人类领域积累了数世纪的反串通机制,但如何将这些机制适应到AI环境中仍不清楚。本文通过(i)开发人类反串通机制的分类学,包括制裁、宽大处理与举报、监控与审计、市场设计以及治理,以及(ii)将这些机制映射到多智能体AI系统的潜在干预措施来填补这一空白。对于每种机制,我们提出了实现方法。我们还强调了开放挑战,例如归属问题(难以将涌现的协调归因于特定智能体)、身份流动性(智能体容易被分裂或修改)、边界问题(区分有益的合作与有害的串通)以及对抗性适应(智能体学习逃避检测)

英文摘要

As multi-agent AI systems become increasingly autonomous, evidence shows they can develop collusive strategies similar to those long observed in human markets and institutions. While human domains have accumulated centuries of anti-collusion mechanisms, it remains unclear how these can be adapted to AI settings. This paper addresses that gap by (i) developing a taxonomy of human anti-collusion mechanisms, including sanctions, leniency & whistleblowing, monitoring & auditing, market design, and governance and (ii) mapping them to potential interventions for multi-agent AI systems. For each mechanism, we propose implementation approaches. We also highlight open challenges, such as the attribution problem (difficulty attributing emergent coordination to specific agents), identity fluidity (agents being easily forked or modified), the boundary problem (distinguishing beneficial cooperation from harmful collusion), and adversarial adaptation (agents learning to evade detection).

2512.05136 2026-05-19 cs.CV cs.AI 版本更新

Fine-tuning an ECG Foundation Model to Predict Coronary CT Angiography Outcomes

微调一种心电图基础模型以预测冠状动脉CT血管造影结果

Yujie Xiao, Qinghao Zhao, Gongzheng Tang, Hao Zhang, Zhuoran Kan, Deyun Zhang, Jun Li, Guangkun Nie, Xiaocheng Fang, Haoyu Wang, Shun Huang, Tong Liu, Jian Liu, Kangyin Chen, Shenda Hong

发表机构 * Institute of Medical Technology, Peking University Health Science Center(北京大学人民医院医学技术研究所) National Institute of Health Data Science, Peking University(北京大学国家健康数据科学研究院) Department of Cardiology, Peking University People’s Hospital(北京大学人民医院心内科) Tianjin Key Laboratory of Ionic-Molecular Function of Cardiovascular Disease, Department of Cardiology, Tianjin Institute of Cardiology, The Second Hospital of Tianjin Medical University(天津医科大学心血管离子-分子功能重点实验室,天津心脏病学研究院,天津医科大学第二医院心内科) Heart Voice Medical Technology(心声医疗科技) School of Intelligence Science and Technology, Peking University(北京大学智能科学与技术学院)

AI总结 本文研究了通过微调心电图基础模型来预测冠状动脉CT血管造影结果的研究问题,采用多中心研究方法,利用CTCA作为解剖参考标准,开发并验证了AI-ECG模型,以预测血管特异性冠状动脉狭窄,并展示了模型在内部和外部验证中的表现,以及其在临床中的应用价值。

详情
AI中文摘要

CAD仍然是全球公共卫生的主要负担,然而可扩展的筛查工具有限。尽管CTCA是首选的非侵入性诊断方法,但其使用受到资源需求和辐射暴露的限制。AI-ECG可能为CAD风险分层提供补充方法。在多中心研究中,我们开发并验证了使用CTCA作为解剖参考标准的AI-ECG模型,以预测血管特异性冠状动脉狭窄。在内部验证中,模型在各血管上的AUC值为0.683-0.744,并表现出一致的外部性能。在临床正常ECG中保持了鉴别能力,并在各亚组中保持了广泛稳定性。模型预测的概率随着CTCA定义的狭窄严重程度呈单调增加。模型概率通过预定义的灵敏度和特异性基于阈值转换为血管特异性低、中、高风险分层。校准分析显示预测风险与观察风险之间的一致性,而DCA表明与“全部治疗”和“不治疗”策略相比,具有净临床获益。将AI衍生的风险分层与指南基于的PTP类别相结合,提高了排除性能,减少了灰色区域比例,并与PTP单独使用相比实现了正NRI。在纵向随访队列中,Kaplan-Meier分析显示模型定义的风险组在主要不良心血管事件风险上存在明显分离。波形和归因分析进一步识别了与高风险预测相关的结构化ECG形态差异和具有生理意义的信号区域。这些发现支持AI-ECG作为补充CAD筛查、解剖风险估计和临床分层的可行工具,但需要进一步的前瞻性研究来确认其临床影响。

英文摘要

CAD remains a major global public health burden, yet scalable screening tools are limited. Although CCTA is a first-line non-invasive diagnostic modality, its use is constrained by resource requirements and radiation exposure. AI-ECG may offer a complementary approach for CAD risk stratification. In this multicenter study, we developed and validated an AI-ECG model using CCTA as the anatomical reference standard to predict vessel-specific coronary stenosis. In internal validation, the model achieved AUC values of 0.683-0.744 across vessels and showed consistent external performance. Discrimination was maintained in clinically normal ECGs and remained broadly stable across subgroups. Model-predicted probabilities increased monotonically with CCTA-defined stenosis severity. Model probabilities were converted into vessel-specific low-, intermediate-, and high-risk strata using predefined sensitivity- and specificity-based thresholds. Calibration analysis showed agreement between predicted and observed risk, while DCA indicated net clinical benefit over treat-all and treat-none strategies. Integrating AI-derived risk strata with guideline-based PTP categories improved rule-out performance, reduced the gray-zone proportion, and achieved positive NRI compared with PTP alone. In a longitudinal follow-up cohort, Kaplan-Meier analysis showed clear separation of major adverse cardiovascular event risk across model-defined risk groups. Waveform- and attribution-based analyses further identified structured ECG morphology differences and physiologically meaningful signal regions associated with high-risk predictions. These findings support AI-ECG as a feasible tool for complementary CAD screening, anatomical risk estimation, and clinical triage, while prospective studies are needed to confirm its clinical impact.

2512.01537 2026-05-19 cs.SD cs.AI cs.IT cs.LG eess.SP math.IT 版本更新

Two-Dimensional Quantization for Geometry-Aware Audio Coding

二维量化用于几何感知的音频编码

Tal Shuster, Eliya Nachmani

发表机构 * School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Be’er Sheva, Israel(电气与计算机工程学院,内盖夫本· Gurion大学,贝尔谢巴,以色列)

AI总结 本文提出了一种二维量化方法Q2D2,通过将特征对投影到结构化的2D网格上,提高了音频压缩效率,同时保持了最先进的重建质量。

Comments Accepted to ICML 2026

详情
AI中文摘要

最近的神经音频编解码器在重建质量上取得了显著成就,通常依赖于残差向量量化(RVQ)、向量量化(VQ)和有限标量量化(FSQ)等量化方法。然而,这些量化技术限制了潜在空间的几何结构,使特征之间的相关性捕捉变得更加困难,导致表示学习、代码本利用和令牌速率的效率低下。在本文中,我们引入了二维量化(Q2D2),一种将特征对投影到结构化2D网格(如六边形、菱形或矩形铺砌)并量化到最近网格值的量化方案,从而生成由网格级别乘积定义的隐式代码本,其代码本大小与传统方法相当。尽管其简单的几何公式,Q2D2在音频压缩效率方面有所提升,具有低令牌速率和高代码本利用率,同时保持了最先进的重建质量。具体而言,Q2D2在语音、音频和音乐领域广泛实验中,在各种客观和主观重建度量上实现了具有竞争力甚至更优的性能。全面的消融研究进一步证实了我们设计选择的有效性。

英文摘要

Recent neural audio codecs have achieved impressive reconstruction quality, typically relying on quantization methods such as Residual Vector Quantization (RVQ), Vector Quantization (VQ) and Finite Scalar Quantization (FSQ). However, these quantization techniques limit the geometric structure of the latent space, make it harder to capture correlations between features leading to inefficiency in representation learning, codebook utilization and token rate. In this paper we introduce Two-Dimensional Quantization (Q2D2), a quantization scheme in which feature pairs are projected onto structured 2D grids, such as hexagonal, rhombic, or rectangular tiling and quantized to the nearest grid values, yielding an implicit codebook defined by the product of grid levels, with codebook sizes comparable to conventional methods. Despite its simple geometric formulation, Q2D2 improves audio compression efficiency, with low token rates and high codebook utilization while maintaining state of the art reconstruction quality. Specifically, Q2D2 achieves competitive to superior performance in various objective and subjective reconstruction metrics, across extensive experiments in speech, audio and music domains compared to state of the art models. Comprehensive ablation studies further confirm the effectiveness of our design choices.

2511.20857 2026-05-19 cs.CL cs.AI 版本更新

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-Memory:通过自演化记忆基准测试LLM代理的测试时间学习

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, Derek Zhiyuan Cheng

发表机构 * Google DeepMind(谷歌深Mind) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出Evo-Memory,一个用于评估LLM代理自演化记忆能力的综合流基准和框架,通过构建序列任务流数据集,要求LLM在每次交互后搜索、适应和演化记忆,并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了超过十种代表性的记忆模块。

详情
AI中文摘要

状态性对于大型语言模型(LLM)代理进行长期规划和问题解决至关重要。这使得记忆成为关键组件,但其管理和进化仍 largely underexplored。现有的评估主要集中在静态对话设置上,其中记忆被动地从对话中检索以回答查询,忽略了在不断变化的任务流中积累和重用经验的能力。在现实世界环境中,如交互问题助手或具身代理中,LLM需要处理连续的任务流,但通常无法从积累的交互中学习,失去有价值的上下文见解,这限制了测试时间的进化,即LLM在部署期间持续检索、整合和更新记忆。为了弥合这一差距,我们引入了Evo-Memory,一个综合的流基准和框架,用于评估LLM代理的自演化记忆能力。Evo-Memory将数据集结构化为连续的任务流,要求LLM在每次交互后搜索、适应和演化记忆。我们统一并实现了超过十种代表性的记忆模块,并在10个多样化的多轮目标导向和单轮推理与问答数据集上评估了它们。为了更好地基准测试经验重用,我们提供了一个基线方法ExpRAG,用于检索和利用先前经验,并进一步提出ReMem,一个将推理、任务动作和记忆更新紧密集成的行动-思考-记忆精炼流程,以实现持续改进。

英文摘要

Statefulness is essential for large language model (LLM) agents to perform long-term planning and problem-solving. This makes memory a critical component, yet its management and evolution remain largely underexplored. Existing evaluations mostly focus on static conversational settings, where memory is passively retrieved from dialogue to answer queries, overlooking the dynamic ability to accumulate and reuse experience across evolving task streams. In real-world environments such as interactive problem assistants or embodied agents, LLMs are required to handle continuous task streams, yet often fail to learn from accumulated interactions, losing valuable contextual insights, a limitation that calls for test-time evolution, where LLMs retrieve, integrate, and update memory continuously during deployment. To bridge this gap, we introduce Evo-Memory, a comprehensive streaming benchmark and framework for evaluating self-evolving memory in LLM agents. Evo-Memory structures datasets into sequential task streams, requiring LLMs to search, adapt, and evolve memory after each interaction. We unify and implement over ten representative memory modules and evaluate them across 10 diverse multi-turn goal-oriented and single-turn reasoning and QA datasets. To better benchmark experience reuse, we provide a baseline method, ExpRAG, for retrieving and utilizing prior experience, and further propose ReMem, an action-think-memory refine pipeline that tightly integrates reasoning, task actions, and memory updates to achieve continual improvement.

2511.11654 2026-05-19 cs.LG cs.AI cs.MA 版本更新

Convergence of Multiagent Learning Systems for Traffic control

多智能体学习系统在交通控制中的收敛性

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa(亚马逊Alexa) Indian Institute of Science(印度科学研究院)

AI总结 本文研究了多智能体强化学习在交通信号控制中的收敛性问题,通过随机逼近方法分析学习动态,并证明了在特定条件下该算法能够收敛。

Comments 14 pages 2 figures

详情
AI中文摘要

快速城市化导致城市如班加罗尔面临严重的交通拥堵,使得高效的交通信号控制(TSC)变得至关重要。多智能体强化学习(MARL)作为一种减少平均通勤延误的有希望策略,通常将每个交通信号视为一个独立的智能体使用Q学习进行建模。尽管先前的工作Prashant L A等人已经证明了这种方法的有效性,但在交通控制背景下对这种算法稳定性及收敛性进行严谨理论分析的研究尚未开展。本文通过专注于该多智能体算法的理论基础,填补了这一空白。我们研究了在合作性TSC任务中使用独立学习者固有的收敛问题。利用随机逼近方法,我们正式分析了学习动态。本文的主要贡献是证明了特定的交通控制多智能体强化学习算法在给定条件下能够收敛,扩展了从单智能体收敛证明中异步价值迭代的结论。

英文摘要

Rapid urbanization in cities like Bangalore has led to severe traffic congestion, making efficient Traffic Signal Control (TSC) essential. Multi-Agent Reinforcement Learning (MARL), often modeling each traffic signal as an independent agent using Q-learning, has emerged as a promising strategy to reduce average commuter delays. While prior work Prashant L A et. al has empirically demonstrated the effectiveness of this approach, a rigorous theoretical analysis of its stability and convergence properties in the context of traffic control has not been explored. This paper bridges that gap by focusing squarely on the theoretical basis of this multi-agent algorithm. We investigate the convergence problem inherent in using independent learners for the cooperative TSC task. Utilizing stochastic approximation methods, we formally analyze the learning dynamics. The primary contribution of this work is the proof that the specific multi-agent reinforcement learning algorithm for traffic control is proven to converge under the given conditions extending it from single agent convergence proofs for asynchronous value iteration.

2511.07288 2026-05-19 cs.LG cs.AI 版本更新

Enabling Off-Policy Imitation Learning with Deep Actor Critic Stabilization

通过深度行为批评稳定化实现非策略模仿学习

Sayambhu Sen, Shalabh Bhatnagar

发表机构 * Amazon Alexa(亚马逊Alexa) Indian Institute of Science(印度科学研究院)

AI总结 本文提出一种结合非策略学习的对抗模仿学习算法,通过双Q网络稳定化和价值学习(无需奖励函数推断)来提高样本效率,从而更高效地匹配专家行为。

Comments 14 pages and 4 images

详情
AI中文摘要

使用强化学习(RL)学习复杂策略通常受到不稳定性慢收敛的阻碍,这一问题在奖励工程困难时尤为严重。模仿学习(IL)从专家演示中绕过了对奖励的依赖。然而,最先进的IL方法,如生成对抗模仿学习(GAIL)Ho等人,存在严重的样本不效率问题。这是由于其基础的策略学习算法,如TRPO Schulman等人,所导致的。在本文中,我们介绍了一种对抗模仿学习算法,该算法结合了非策略学习以提高样本效率。通过结合非策略框架和辅助技术,特别是在此情况下基于双Q网络的稳定化和价值学习(无需奖励函数推断),我们展示了在稳健匹配专家行为所需样本减少。

英文摘要

Learning complex policies with Reinforcement Learning (RL) is often hindered by instability and slow convergence, a problem exacerbated by the difficulty of reward engineering. Imitation Learning (IL) from expert demonstrations bypasses this reliance on rewards. However, state-of-the-art IL methods, exemplified by Generative Adversarial Imitation Learning (GAIL)Ho et. al, suffer from severe sample inefficiency. This is a direct consequence of their foundational on-policy algorithms, such as TRPO Schulman et.al. In this work, we introduce an adversarial imitation learning algorithm that incorporates off-policy learning to improve sample efficiency. By combining an off-policy framework with auxiliary techniques specifically, in this case a double Q network based stabilization and value learning without reward function inference we demonstrate a reduction in the samples required to robustly match expert behavior.

2511.06316 2026-05-19 cs.AI 版本更新

ALIGN: A Vision-Language Framework for High-Accuracy Accident Location Inference through Geo-Spatial Neural Reasoning

ALIGN:一种通过地理空间神经推理进行高精度事故位置推断的视觉-语言框架

MD Thamed Bin Zaman Chowdhury, Moazzem Hossain

发表机构 * Department of Civil Engineering, Bangladesh University of Engineering and Technology(孟加拉工程与技术大学土木工程系)

AI总结 本文提出ALIGN框架,通过视觉-语言模型整合文本和图像数据,以高精度推断事故位置,显著优于传统文本解析方法,实现了亚公里级的定位精度。

详情
AI中文摘要

在低收入和中等收入国家,公共安全和城市规划项目经常面临准确的、特定位置的道路事故数据短缺。从非结构化文本中提取可靠的地理信息需要克服传统文本基于地理编码工具的局限性,这些工具在多语言环境中常常无法处理含糊的地点描述。本研究引入ALIGN(通过地理空间神经推理进行事故位置推断),一种视觉-语言框架,旨在模拟人类空间推理能力,从非结构化的孟加拉语新闻报告和基于地图的线索中推断出精确的事故坐标。开发了一个多阶段自动化流程来处理多样化的文本和视觉数据,整合大语言模型用于线索提取与视觉-语言模型用于地图验证。使用代理架构,我们建模了一个迭代推理循环,结合光学字符识别(OCR)、基于网格的空间扫描以及三轮几何投票方法,以数学方式隔离和减少视觉幻觉。研究结果表明,多模态ALIGN框架显著优于传统文本-only地理解析基线。例如,所提出系统成功将平均定位误差从不可用的10.915公里减少到验证数据集上的亚公里精度0.593公里。此外,测试该框架与官方达卡警察局记录相比,证实了其可靠性,通过达到平均误差0.465公里。结果提供了一个高精度、无需训练的基础,用于数据稀少地区的自动化事故制图,支持证据驱动的道路安全政策制定,并促进多模态AI在交通分析中的整合。

英文摘要

In low- and middle-income countries, public safety and urban planning initiatives frequently face a critical shortage of accurate, location-specific road crash data. Extracting reliable geospatial information from unstructured text requires overcoming the limitations of traditional text-based geocoding tools, which often fail in multilingual environments with ambiguous place descriptions. This study introduces ALIGN (Accident Location Inference through Geo-Spatial Neural Reasoning), a vision-language framework designed to emulate human spatial reasoning to infer precise accident coordinates from unstructured Bangla news reports and map-based cues. A multi stage automated pipeline was developed to process diverse textual and visual data, integrating large language models for cue extraction with vision-language models for map verification. Using an agentic architecture, we modelled an iterative reasoning loop that combines Optical Character Recognition (OCR), grid-based spatial scanning, and a 3-run geometric voting method to mathematically isolate and reduce visual hallucinations. The findings highlight that the multimodal ALIGN framework significantly outperforms traditional text-only geoparsing baselines. For example, the proposed system successfully reduced the mean localization error from an unusable 10.915 km to a sub-kilometer precision of 0.593 km on a validation dataset. Furthermore, testing the framework against official Dhaka Metropolitan Police records confirmed its reliability by achieving a mean error of 0.465 km. The results provide a high-accuracy, training-free foundation for automated crash mapping in data-scarce regions, supporting evidence-driven road-safety policymaking and the integration of multimodal AI in transportation analytics.

2511.00392 2026-05-19 cs.RO cs.AI cs.CV 版本更新

SonarSweep: Fusing Sonar and Vision for Robust 3D Reconstruction via Plane Sweeping

SonarSweep: 通过平面扫描融合声纳与视觉以实现鲁棒的3D重建

Lingpeng Chen, Jiakun Tang, Apple Pui-Yi Chui, Ziyang Hong, Junfeng Wu

发表机构 * Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Chinese University of Hong Kong, Hong Kong(香港中文大学) Department of Automation, Harbin Institute of Technology(哈尔滨工业大学自动化系)

AI总结 本文提出SonarSweep,一种端到端的深度学习框架,通过将平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了单一模态方法在 underwater 环境中3D重建的局限性,实现了更精确和稳定的深度图生成。

Comments 8 pages, 9 figures, conference

详情
AI中文摘要

在视觉退化的水下环境中实现准确的3D重建仍是一个严峻的挑战。单一模态方法不足:基于视觉的方法因可见性差和几何约束而失败,而声纳则因固有的高度歧义和低分辨率而受限。因此,先前的融合技术依赖于启发式方法和错误的几何假设,导致显著的伪影和无法建模复杂场景。在本文中,我们引入了SonarSweep,一种新颖的端到端深度学习框架,通过将原理性的平面扫描算法应用于声纳与视觉数据的跨模态融合,克服了这些限制。在高保真模拟和真实环境中的大量实验表明,SonarSweep能够一致地生成密集且准确的深度图,在挑战性条件下,特别是在高浊度情况下,显著优于最先进的方法。为了促进进一步研究,我们将公开我们的代码和一个新型的数据集,该数据集包含同步的立体相机和声纳数据,这是首次公开的此类数据集。

英文摘要

Accurate 3D reconstruction in visually-degraded underwater environments remains a formidable challenge. Single-modality approaches are insufficient: vision-based methods fail due to poor visibility and geometric constraints, while sonar is crippled by inherent elevation ambiguity and low resolution. Consequently, prior fusion technique relies on heuristics and flawed geometric assumptions, leading to significant artifacts and an inability to model complex scenes. In this paper, we introduce SonarSweep, a novel, end-to-end deep learning framework that overcomes these limitations by adapting the principled plane sweep algorithm for cross-modal fusion between sonar and visual data. Extensive experiments in both high-fidelity simulation and real-world environments demonstrate that SonarSweep consistently generates dense and accurate depth maps, significantly outperforming state-of-the-art methods across challenging conditions, particularly in high turbidity. To foster further research, we will publicly release our code and a novel dataset featuring synchronized stereo-camera and sonar data, the first of its kind.

2510.26745 2026-05-19 cs.LG cs.AI cs.CL stat.ML 版本更新

Deep sequence models tend to memorize geometrically; it is unclear why

深度序列模型倾向于记忆几何学;不清楚为何

Shahriar Noroozizadeh, Vaishnavh Nagarajan, Elan Rosenfeld, Sanjiv Kumar

发表机构 * Machine Learning Department \& Heinz College, Carnegie Mellon University, Pittsburgh, PA, USA Google Research, NY, USA

AI总结 研究探讨了深度序列模型中原子事实的存储机制,发现几何记忆能编码全局关系,即使在训练中未共现的实体间也能建立联系,挑战了传统关联记忆的观点。

Comments Forty-third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

深度序列模型被认为主要通过关联记忆存储原子事实,即通过暴力查找共现实体。我们识别出一种不同的存储形式,称为几何记忆。在此模型中,嵌入编码了所有实体之间的新型全局关系,包括训练中未共现的实体。这种存储形式强大:例如,我们展示了它如何将涉及ℓ-折叠组合的困难推理任务转化为易于学习的一步导航任务。从这一现象中,我们提取了神经嵌入几何学中难以解释的基本方面。我们认为,这种几何的出现,与局部关联的查找相比,不能简单归因于典型的监督、架构或优化压力。反直觉的是,即使几何比暴力查找更复杂,它仍然会被学习。然后,通过分析与Node2Vec的联系,我们展示了几何起源于一种光谱偏见,这与主流理论相反,确实自然产生,尽管缺乏各种压力。这一分析也指出了从业者在使Transformer记忆更几何化方面的可见空间。我们希望几何视角的参数记忆鼓励重新审视指导知识获取、容量、发现和遗忘等领域的默认直觉。

英文摘要

Deep sequence models are said to store atomic facts predominantly in the form of associative memory: a brute-force lookup of co-occurring entities. We identify a dramatically different form of storage of atomic facts that we term as geometric memory. Here, the model has synthesized embeddings encoding novel global relationships between all entities, including ones that do not co-occur in training. Such storage is powerful: for instance, we show how it transforms a hard reasoning task involving an $\ell$-fold composition into an easy-to-learn $1$-step navigation task. From this phenomenon, we extract fundamental aspects of neural embedding geometries that are hard to explain. We argue that the rise of such a geometry, as against a lookup of local associations, cannot be straightforwardly attributed to typical supervisory, architectural, or optimizational pressures. Counterintuitively, a geometry is learned even when it is more complex than the brute-force lookup. Then, by analyzing a connection to Node2Vec, we demonstrate how the geometry stems from a spectral bias that -- in contrast to prevailing theories -- indeed arises naturally despite the lack of various pressures. This analysis also points out to practitioners a visible headroom to make Transformer memory more strongly geometric. We hope the geometric view of parametric memory encourages revisiting the default intuitions that guide researchers in areas like knowledge acquisition, capacity, discovery, and unlearning.

2510.26018 2026-05-19 cs.RO cs.AI 版本更新

RADRON: Cooperative Localization of Ionizing Radiation Sources by MAVs with Compton Cameras

RADRON:通过配备康普顿相机的微型飞行器进行离子化辐射源的协同定位

Petr Stibinger, Tomas Baca, Daniela Doubravova, Jan Rusnak, Jaroslav Solc, Jan Jakubek, Petr Stepan, Martin Saska

AI总结 该研究提出了一种利用微型飞行器协同定位放射性物质的新方法,通过康普顿相机实时估计辐射源位置,即使在稀疏测量条件下也能实现高灵敏度检测。

Comments 8 pages, 9 figures, submitted for review to IEEE RA-L

详情
AI中文摘要

我们提出了一种新型方法,通过合作微型飞行器(MAVs)定位放射性物质。我们的方法利用了最先进的单探测器康普顿相机,作为高灵敏度且微型的离子化辐射探测器。该探测器极低的重量(40克)为由协作敏捷MAVs进行的辐射检测开辟了新可能。我们提出了一种新的基本概念,将康普顿相机测量融合以实时估计辐射源位置,即使从极稀疏的测量中也能做到。数据读取和处理直接在机载上进行,结果用于动态反馈以驱动车辆运动。MAVs在紧密协作的群体中稳定,以最大化康普顿相机获取的信息,快速定位辐射源,甚至跟踪移动的辐射源。

英文摘要

We present a novel approach to localizing radioactive material by cooperating Micro Aerial Vehicles (MAVs). Our approach utilizes a state-of-the-art single-detector Compton camera as a highly sensitive, yet miniature detector of ionizing radiation. The detector's exceptionally low weight (40 g) opens up new possibilities of radiation detection by a team of cooperating agile MAVs. We propose a new fundamental concept of fusing the Compton camera measurements to estimate the position of the radiation source in real time even from extremely sparse measurements. The data readout and processing are performed directly onboard and the results are used in a dynamic feedback to drive the motion of the vehicles. The MAVs are stabilized in a tightly cooperating swarm to maximize the information gained by the Compton cameras, rapidly locate the radiation source, and even track a moving radiation source.

2510.21712 2026-05-19 cs.IR cs.AI cs.CL 版本更新

DecoupleSearch: Decouple Planning and Search via Hierarchical Reward Modeling

DecoupleSearch: 通过分层奖励建模解耦规划与搜索

Hao Sun, Zile Qiao, Bo Wang, Guoxin Chen, Yingyan Hou, Yong Jiang, Pengjun Xie, Fei Huang, Yan Zhang

发表机构 * Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文提出DecoupleSearch框架,通过双值模型解耦规划与搜索过程,利用蒙特卡洛树搜索评估每一步的质量,并通过分层束搜索迭代优化规划和搜索候选,验证了方法的有效性。

Comments EMNLP 2025 Main Conference

详情
AI中文摘要

检索增强生成(RAG)系统已作为一种增强大型语言模型(LLM)的关键方法,通过动态整合外部知识。为了进一步提高RAG的灵活性,代理RAG引入了自主代理到工作流程中。然而,代理RAG面临几个挑战:(1)每一步的成功取决于高质量的规划和准确的搜索;(2)中间推理步骤缺乏监督;(3)规划和搜索的候选空间呈指数级增长。为了解决这些挑战,我们提出了DecoupleSearch,一种新的框架,通过双值模型解耦规划和搜索过程,使规划推理和搜索基础能够独立优化。我们的方法构建了一个推理树,其中每个节点代表规划和搜索步骤。我们利用蒙特卡洛树搜索来评估每一步的质量。在推理过程中,分层束搜索通过双值模型迭代优化规划和搜索候选。在不同参数规模的策略模型上的广泛实验验证了我们方法的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) systems have emerged as a pivotal methodology for enhancing Large Language Models (LLMs) through the dynamic integration of external knowledge. To further improve RAG's flexibility, Agentic RAG introduces autonomous agents into the workflow. However, Agentic RAG faces several challenges: (1) the success of each step depends on both high-quality planning and accurate search, (2) the lack of supervision for intermediate reasoning steps, and (3) the exponentially large candidate space for planning and searching. To address these challenges, we propose DecoupleSearch, a novel framework that decouples planning and search processes using dual value models, enabling independent optimization of plan reasoning and search grounding. Our approach constructs a reasoning tree, where each node represents planning and search steps. We leverage Monte Carlo Tree Search to assess the quality of each step. During inference, Hierarchical Beam Search iteratively refines planning and search candidates with dual value models. Extensive experiments across policy models of varying parameter sizes demonstrate the effectiveness of our method.

2510.20584 2026-05-19 cs.CL cs.AI 版本更新

Automated Coding of Communication Data Using ChatGPT: Consistency Across Subgroups

使用ChatGPT自动编码通信数据:子群体一致性分析

Jiangang Hao, Wenju Cui, Patrick Kyllonen, Emily Kerzabi

发表机构 * ETS Research Institute(ETS研究机构)

AI总结 本文研究了使用ChatGPT进行通信数据编码在不同性别和种族/族裔群体间的一致性,发现其编码结果与人类评分者一致,为大规模评估协作与沟通提供了可能。

Comments Accepted to the Journal of Educational Measurement

详情
AI中文摘要

在大规模评估沟通和协作方面,对通信数据进行分类编码是一项劳动密集型任务,根据不同的框架进行分类。先前研究已证明,可以通过直接指示ChatGPT使用编码评分表来对通信数据进行编码,并且其准确性与人类评分者相当。然而,ChatGPT或类似AI技术在不同人口群体(如性别和种族)之间编码的一致性仍不清楚。为填补这一空白,我们引入了三种检查方法,用于评估基于LLM的编码中的子群体一致性,通过适应自自动化评分文献中已有的框架。使用典型的协作问题解决编码框架和三种类型的协作任务数据,我们检查了基于ChatGPT的编码在性别和种族/族裔群体中的表现。我们的结果表明,基于ChatGPT的编码在性别或种族/族裔群体中表现一致,与人类评分者一致,证明了其在大规模评估协作和沟通中的可行性。

英文摘要

Assessing communication and collaboration at scale depends on a labor-intensive task of coding communication data into categories according to different frameworks. Prior research has established that ChatGPT can be directly instructed with coding rubrics to code the communication data and achieves accuracy comparable to human raters. However, whether the coding from ChatGPT or similar AI technology perform consistently across different demographic groups, such as gender and race, remains unclear. To address this gap, we introduce three checks for evaluating subgroup consistency in LLM-based coding by adapting an existing framework from the automated scoring literature. Using a typical collaborative problem-solving coding framework and data from three types of collaborative tasks, we examine ChatGPT-based coding performance across gender and racial/ethnic groups. Our results show that ChatGPT-based coding perform consistently in the same way as human raters across gender or racial/ethnic groups, demonstrating the possibility of its use in large-scale assessments of collaboration and communication.

2510.11391 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DocReward: A Document Reward Model for Structuring and Stylizing

DocReward: 一种用于文档结构化和风格化的文档奖励模型

Junpeng Liu, Yuzhong Zhao, Bowen Cao, Jiayu Ding, Yilin Jia, Tengchao Lv, Yupan Huang, Wenshan Wu, Shaohan Huang, Nan Yang, Li Dong, Lei Cui, Tao Ge, Xun Wang, Huitian Jiao, Sun Mao, FNU Kartik, Si-Qing Chen, Wai Lam, Furu Wei

发表机构 * CUHK(香港大学) UCAS(中国科学技术大学) XJTU(西安交通大学) UMich(密歇根大学) Microsoft(微软)

AI总结 本文提出DocReward,一种用于评估文档结构和风格的奖励模型,通过构建包含117,000对文档的DocPair数据集,采用Bradley-Terry损失训练,有效提升了文档生成的结构和风格专业性。

详情
AI中文摘要

近期的代理工作流程自动化了专业文档生成,但主要关注文本质量,忽视了结构和风格的专业性,这对于可读性同样至关重要。这一差距主要源于缺乏有效的奖励模型,无法引导代理生成结构和风格专业的文档。我们引入DocReward,一种评估文档结构和风格的文档奖励模型。为此,我们提出了一种文本质量无关的框架,确保评估不受内容质量的影响,并构建了包含117,000对文档的DocPair数据集,涵盖32个领域和267种类型。每对文档内容相同,但结构和风格专业性不同。DocReward使用Bradley-Terry损失进行训练。在人工标注的基准测试中,DocReward在相同设置下比GPT-5高出14.6个百分点。强化学习实验进一步表明,DocReward能有效引导代理生成具有更一致结构和风格专业性的文档,突显了其实际应用价值。

英文摘要

Recent agentic workflows automate professional document generation but focus narrowly on textual quality, overlooking structural and stylistic professionalism, which is equally critical for readability. This gap stems mainly from a lack of effective reward models capable of guiding agents toward producing documents with high structural and stylistic professionalism. We introduce DocReward, a document reward model that evaluates documents based on their structure and style. To achieve this, we propose a textual-quality-agnostic framework that ensures assessments are not confounded by content quality, and construct DocPair, a dataset of 117K paired documents covering 32 domains and 267 types. Each pair shares identical content but differs in structural and stylistic professionalism. DocReward is trained using the Bradley-Terry loss. On a manually annotated benchmark, DocReward outperforms GPT-5 by 14.6 percentage points in the same setting. Reinforcement learning experiments further show that DocReward effectively guides agents toward generating documents with consistently higher structural and stylistic professionalism, highlighting its practical utility.

2510.10930 2026-05-19 cs.CL cs.AI 版本更新

Evaluating Language Models' Evaluations of Games

评估语言模型对游戏的评估

Katherine M. Collins, Cedegao E. Zhang, Graham Todd, Lance Ying, Mauricio Barba da Costa, Ryan Liu, Prafull Sharma, Adrian Weller, Ionatan Kuperwajs, Lionel Wong, Joshua B. Tenenbaum, Thomas L. Griffiths

发表机构 * University of Cambridge(剑桥大学) MIT(麻省理工学院) Princeton University(普林斯顿大学) NYU(纽约大学) Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 本文研究了语言模型对游戏评估的能力,通过比较现代语言模型和人类及符号计算代理的评估结果,发现推理模型在游戏评估上更接近人类,但随着模型接近博弈最优,其与人类数据的匹配度会减弱,且在评估趣味性时表现出更大的波动。

详情
AI中文摘要

推理不仅仅是解决问题,也是评估哪些问题值得解决。人工智能系统的历史评估主要集中在解决问题上,通过研究模型如何玩国际象棋和围棋等游戏。在本文中,我们倡导一种新的范式,即评估人工智能系统对游戏的评估。首先,我们引入了一种评估此类评估的形式化方法。然后利用超过100种新型棋盘游戏和450份人类判断的大型数据集,将现代语言和推理模型的评估结果与人类和符号计算代理的评估结果进行比较。我们考虑了两种类型的评估查询:评估游戏的收益(或公平性)和趣味性。这些查询涵盖了两个与AI评估设计相关的重要维度:计算查询的复杂性和量化查询的难度。我们的结果表明,推理模型在游戏评估上通常比非推理语言模型更接近人类。然而,我们观察到非单调的关系:随着模型接近博弈最优,其与人类数据的匹配度会减弱。我们还发现,在评估趣味性时,模型之间存在更多的波动性,这与量化该查询的难度更大有关。在各种查询和游戏中,推理模型在评估查询时表现出高度变化和不可预测的资源使用,这表明在语言和推理模型中加入更多资源理性的元推理非常重要。

英文摘要

Reasoning is not just about solving problems -- it is also about evaluating which problems are worth solving at all. Evaluations of artificial intelligence (AI) systems primarily focused on problem solving, historically by studying how models play games such as chess and Go. In this paper, we advocate for a new paradigm that assesses AI systems' evaluation of games. First, we introduce a formalism for evaluating such evaluations. We then leverage a large-scale dataset of over 100 novel board games and over 450 human judgments to compare evaluations produced by modern language and reasoning models against those of people and symbolic computational agents. We consider two kinds of evaluative queries: assessing the payoff (or fairness) and the funness of games. These queries span two dimensions relevant to the design of evaluations of AI evaluations: how complex a query is to compute and how difficult a query is to quantify. Our results show that reasoning models are generally more aligned to people in their evaluations of games than non-reasoning language models. However, we observe a non-monotonic relationship: as models get closer to game-theoretic optimal, their fit to human data weakens. We also observe more "jaggedness" across models for assessing funness, in line with the greater difficulty of quantifying this query. Across queries and games, reasoning models show highly variable and unpredictable resource usage when assessing queries, pointing to the importance of imbuing more resource-rational meta-reasoning in language and reasoning models.

2509.13397 2026-05-19 cs.CY cs.AI 版本更新

The threat of analytic flexibility in using large language models to simulate human data

使用大语言模型模拟人类数据时分析灵活性的威胁

Jamie Cummins

发表机构 * University of Bern, Switzerland(伯尔尼大学,瑞士) University of Oxford, United Kingdom(牛津大学,英国)

AI总结 本文研究了在使用大语言模型生成合成数据时,分析选择对合成数据与人类数据一致性的影响,发现不同的配置选择可能导致结论差异显著,呼吁关注分析灵活性的潜在威胁并提出减少该威胁的策略。

Comments 14 pages, 4 figures

详情
AI中文摘要

社会科学家现在使用大语言模型创建“硅样本”:合成数据集,旨在替代人类受访者。然而,生成这些样本需要许多分析选择,包括模型选择、采样参数、提示格式以及提供的性别或情境信息量。在两项研究中,我检验了这些选择是否对硅样本与人类数据的一致性产生实质性影响。在研究1中,我为受控案例研究生成了252个硅样本配置,使用两种社会心理量表,评估配置是否能恢复参与者排名、响应分布和量表间相关性。配置在所有三个标准上差异显著,且在某一维度表现良好的配置往往在另一维度表现不佳。在研究2中,我将此分析扩展到已发表的硅样本使用案例,通过66种替代配置重新审视Argyle等人(2023)的第三研究。人类与硅关联结构之间的相关性在不同配置下差异显著,从r=0.23到r=0.84。综合来看,这些研究的结果表明,不同的可辩护配置选择可以实质性地改变关于硅样本准确性的结论。我呼吁对使用硅样本时分析灵活性的威胁给予更多关注,并概述研究人员可能采用的减少此威胁的策略。

英文摘要

Social scientists are now using large language models to create "silicon samples": synthetic datasets intended to stand in for human respondents. However, producing these samples requires many analytic choices, including model selection, sampling parameters, prompt format, and the amount of demographic or contextual information provided. Across two studies, I examine whether these choices materially affect correspondence between silicon samples and human data. In Study 1, I generated 252 silicon-sample configurations for a controlled case study using two social-psychological scales, evaluating whether configurations recovered participant rankings, response distributions, and between-scale correlations. Configurations varied substantially across all three criteria, and configurations that performed well on one dimension often performed poorly on another. In Study 2, I extended this analysis to a published silicon-sample use case by re-examining Argyle et al.'s (2023) Study 3 using 66 alternative configurations. Correlations between human and silicon association structures differed substantially across configurations, from r = .23 to r = .84. Taken together, the results from these studies demonstrate that different defensible configuration choices can materially alter conclusions about the fidelity of silicon samples. I call for greater attention to the threat of analytic flexibility in using silicon samples and outline strategies that researchers may adopt to reduce this threat.

2509.07793 2026-05-19 econ.GN cs.AI cs.CY q-fin.EC 版本更新

Individual utilities of life satisfaction reveal inequality aversion unrelated to political alignment

个体生活满意度的效用揭示了与政治立场无关的不平等厌恶

Crispin Cooper, Ana Fredrich, Tommaso Reggiani, Wouter Poortinga

AI总结 研究通过实验探讨了社会福利优先级和公平与个人幸福之间的权衡,发现个体对社会生活满意度不平等的厌恶与政治立场无关,挑战了平均生活满意度作为政策指标的使用,支持非线性效用替代方案的发展。

Comments 28 pages, 4 figures. Replacement adds link to version of record

Journal ref Social Indicators Research 183, 12 (2026)

详情
AI中文摘要

社会应如何优先考虑福祉,人们愿意在公平与个人福祉之间做出哪些权衡?我们通过一项具有全国代表性的英国样本(n=300)的声明偏好实验来探讨这些问题,参与者在不确定性条件下评估了自己和他人的生活满意度结果。使用期望效用最大化(EUM)框架估计个体层面的效用函数,并测试对小概率的过度重视,如累积前景理论(CPT)所描述的。大多数参与者表现出凹形(风险厌恶)效用曲线,并且对社会生活满意度不平等的厌恶程度强于个人风险。这些偏好与政治立场无关,表明了一种超越意识形态边界的共享福祉公平规范立场。研究结果挑战了平均生活满意度作为政策指标的使用,并支持开发更准确反映集体人类价值观的非线性效用替代方案。讨论了对公共政策、福祉测量以及价值一致的AI系统设计的影响。

英文摘要

How should well-being be prioritised in society, and what trade-offs are people willing to make between fairness and personal well-being? We investigate these questions using a stated preference experiment with a nationally representative UK sample (n = 300), in which participants evaluated life satisfaction outcomes for both themselves and others under conditions of uncertainty. Individual-level utility functions were estimated using an Expected Utility Maximisation (EUM) framework and tested for sensitivity to the overweighting of small probabilities, as characterised by Cumulative Prospect Theory (CPT). A majority of participants displayed concave (risk-averse) utility curves and showed stronger aversion to inequality in societal life satisfaction outcomes than to personal risk. These preferences were unrelated to political alignment, suggesting a shared normative stance on fairness in well-being that cuts across ideological boundaries. The results challenge use of average life satisfaction as a policy metric, and support the development of nonlinear utility-based alternatives that more accurately reflect collective human values. Implications for public policy, well-being measurement, and the design of value-aligned AI systems are discussed.

2509.06984 2026-05-19 cs.LG cs.AI 版本更新

FediLoRA: Practical Federated Fine-Tuning of Foundation Models Under Missing-Modality Constraints

FediLoRA: 在缺失模态约束下联邦微调基础模型的实用方法

Lishan Yang, Wei Emma Zhang, Nam Kha Nguygen, Po Hu, Yanjun Shu, Weitong Chen, Mong Yuan Sim

发表机构 * Adelaide University(阿德莱德大学) Central China Normal University(中央中国师范大学) Harbin Institute of Technology(哈尔滨工程大学)

AI总结 本文提出FediLoRA,一种轻量级的联邦LoRA聚合框架,旨在解决联邦学习中异构环境下的缺失模态问题,通过联合简单平均和结构化编辑提升全局和个性化模型性能,实现在多个通用领域和医疗领域基准数据集上的强大表现。

Comments 8 pages, 7 figures

详情
AI中文摘要

联邦学习与LoRA微调提供了一种高效且隐私友好的解决方案,使机构能够协作利用其大规模数据集来训练VLLMs。然而,参与机构通常拥有异质计算资源,导致LoRA秩不平衡,这对有效协作构成重大挑战。此外,医疗和交通等现实应用领域常因用户错误或设备故障导致缺失模态,这显著降低了联邦设置中的全局模型性能。到目前为止,没有先前工作同时解决了联邦VLLMs中的这两个挑战。为了解决这些问题,我们提出FediLoRA,一种轻量级的联邦LoRA聚合框架,有效减轻了异构环境中的缺失模态影响。FediLoRA受到观察的启发,即简单平均和结构化编辑可以同时受益于全局和个性化模型。我们的方法在多个通用领域和医疗领域基准数据集上实现了强大性能。此外,在医疗数据上的额外实验进一步证明,FediLoRA适合实际应用部署场景。我们的代码已发布在https://github.com/gotobcn8/FediLoRA。

英文摘要

Federated Learning with LoRA fine-tuning offers an efficient and privacy-aware solution for institutions to collaboratively leverage their large datasets to train VLLMs. However, participating institutions often possess heterogeneous computational resources, resulting in imbalanced LoRA ranks, which pose a major challenge for effective collaboration. In addition, real-world applications in domains such as healthcare and transportation frequently suffer from missing modalities due to user mistakes or device failures, which significantly degrade global model performance in federated settings. To the best of our knowledge, no prior work has addressed these two challenges simultaneously in federated VLLMs. To tackle these issues, we propose FediLoRA, a lightweight federated LoRA aggregation framework that effectively mitigates the impact of missing modalities in heterogeneous environment. FediLoRA is explicitly motivated by the observation that simple averaging and structured editing can jointly benefit both global and personalized models. Our approach achieves strong performance across multiple general-domain and medical-domain benchmark datasets. Additional experiments on healthcare data further demonstrate that FediLoRA is well-suited for practical, real-world deployment scenarios. Our code is released at https://github.com/gotobcn8/FediLoRA.

2508.17431 2026-05-19 cs.CV cs.AI cs.LG 版本更新

FedKLPR: KL-Guided Pruning-Aware Federated Learning for Person Re-Identification

FedKLPR: 基于KL引导的剪枝感知联邦学习用于人重识别

Po-Hsien Yu, Yu-Syuan Tseng, Shao-Yi Chien

发表机构 * Media IC and System Lab, the Graduate Institute of Electronics Engineering and Department of Electrical Engineering, National Taiwan University(媒体IC与系统实验室,电子工程研究所及电气工程系,国立台湾大学)

AI总结 本文提出FedKLPR框架,通过KL散度引导训练、无结构剪枝和跨轮次恢复技术,解决联邦学习在人重识别中的统计异质性和通信开销问题,实验表明其在通信开销和准确性方面均优于现有方法。

Comments 10 pages, 3 figures, 5 tables, submitted to IEEE Transactions on Multimedia

详情
AI中文摘要

人重识别(re-ID)是智能监控和公共安全中的基本任务。联邦学习(FL)提供了一种隐私保护的协同模型训练范式,无需集中数据收集。然而,由于非独立同分布(non-IID)客户端数据导致的统计异质性和频繁传输大规模模型带来的通信开销,将FL应用于现实世界中的re-ID系统仍然具有挑战性。为了解决这些挑战,我们提出了FedKLPR,一种轻量且通信高效的联邦学习框架用于人重识别。FedKLPR包含三个关键组件。首先,KL散度引导训练,包括KL散度正则化损失(KLL)和KL散度聚合权重(KLAW),用于缓解统计异质性和在非IID设置下提高收敛稳定性。其次,引入无结构剪枝以减少通信开销,并提出剪枝率聚合权重(PRAW)以衡量剪枝后客户端参数的相对重要性。与KLAW结合,PRAW形成KL散度-剪枝权重聚合(KLPWA),使在异构数据分布下能够有效聚合剪枝后的本地模型。第三,跨轮次恢复(CRR)适应性地控制剪枝跨通信轮次以防止过度压缩并保持模型准确性。在八个基准数据集上的实验表明,FedKLPR在保持竞争性准确性的同时实现了显著的通信节省。与现有最先进方法相比,FedKLPR在ResNet-50上将通信成本减少了40%--42%,并实现了更优异的总体性能。

英文摘要

Person re-identification (re-ID) is a fundamental task in intelligent surveillance and public safety. Federated learning (FL) provides a privacy-preserving paradigm for collaborative model training without centralized data collection. However, deploying FL in real-world re-ID systems remains challenging due to statistical heterogeneity caused by non-IID client data and the substantial communication overhead incurred by frequent transmission of large-scale models. To address these challenges, we propose FedKLPR, a lightweight and communication-efficient federated learning framework for person re-ID. FedKLPR consists of three key components. First, KL-Divergence-Guided training, including the KL-Divergence Regularization Loss (KLL) and KL-Divergence-aggregation Weight (KLAW), is introduced to mitigate statistical heterogeneity and improve convergence stability under non-IID settings. Second, unstructured pruning is incorporated to reduce communication overhead, and the Pruning-ratio-aggregation Weight (PRAW) is proposed to measure the relative importance of client parameters after pruning. Together with KLAW, PRAW forms KL-Divergence-Prune Weighted Aggregation (KLPWA), enabling effective aggregation of pruned local models under heterogeneous data distributions. Third, Cross-Round Recovery (CRR) adaptively controls pruning across communication rounds to prevent excessive compression and preserve model accuracy. Experiments on eight benchmark datasets demonstrate that FedKLPR achieves substantial communication savings while maintaining competitive accuracy. Compared with state-of-the-art methods, FedKLPR reduces communication cost by 40\%--42\% on ResNet-50 while achieving better overall performance.

2508.16663 2026-05-19 cs.CV cs.AI cs.LG 版本更新

The Loupe: A Plug-and-Play Attention Module for Amplifying Discriminative Features in Vision Transformers

The Loupe: 一种用于增强视觉变换器中判别特征的插件式注意力模块

Naren Sengodan

发表机构 * Jain University(贾因大学)

AI总结 本文提出The Loupe模块,通过在视觉变换器的中间特征阶段插入轻量级插件式空间门控模块,利用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权,从而提升细粒度视觉分类性能。

详情
AI中文摘要

细粒度视觉分类(FGVC)要求模型关注于细微的、与任务相关的区域,而非广泛的物体上下文。我们提出了The Loupe,一种轻量级的插件式空间门控模块,用于层次化的视觉变换器。该模块在中间特征阶段插入,使用小CNN预测单通道空间掩码,并在端到端训练中使用交叉熵目标和l1稀疏项对特征激活进行加权。在CUB-200-2011数据集上,The Loupe将Swin-Base的准确率从88.36%提升至91.72%,将Swin-Tiny的准确率从85.14%提升至88.61%,且仅增加0.1%的参数。消融实验表明,改进依赖于插入点和稀疏正则化器,表明受控的空间门控比朴素的多尺度遮蔽在此设置下更有效。定性结果表明,学习到的掩码通常与判别鸟类部分对齐,尽管该模块不是部分级监督的替代品,在遮挡或细粒度内部分差异时可能会失效。

英文摘要

Fine-Grained Visual Classification (FGVC) requires models to focus on subtle, task-relevant regions rather than broad object context. We present The Loupe, a lightweight plug-and-play spatial gating module for hierarchical Vision Transformers. The module is inserted at an intermediate feature stage, predicts a single-channel spatial mask with a small CNN, and uses that mask to reweight feature activations during end-to-end training with a cross-entropy objective and an l1 sparsity term. On CUB-200-2011, The Loupe improves Swin-Base from 88.36% to 91.72% and Swin-Tiny from 85.14% to 88.61%, with under 0.1% additional parameters. Ablations show that the improvement depends on the insertion point and the sparsity regularizer, suggesting that controlled spatial gating is more effective than naive multi-scale masking in this setting. Qualitative results indicate that the learned masks often align with discriminative bird parts, although the module is not a substitute for part-level supervision and can fail under occlusion or fine-grained intra-part differences.

2508.15878 2026-05-19 cs.LO cs.AI cs.CL cs.LG 版本更新

Lean Meets Theoretical Computer Science: Scalable Synthesis of Theorem Proving Challenges in Formal-Informal Pairs

Lean 与理论计算机科学的交汇:形式-非形式对中可扩展的定理证明挑战合成

Terry Jingchen Zhang, Wenyuan Jiang, Rongchuan Liu, Yisong Wang, Junran Yang, Ning Wang, Nicole Ni, Yinya Huang, Mrinmaya Sachan

发表机构 * D-CHAB, ETH Zurich, Zurich, Switzerland. D-INFK, ETH Zurich, Zurich, Switzerland. ETH AI Center, Zurich, Switzerland. University of Pennsylvania, PA, USA. Independent Researcher.

AI总结 本文提出利用理论计算机科学作为可扩展的严谨证明问题来源,通过算法定义自动生成大量挑战性定理-证明对,展示了在Busy Beaver问题和混合布尔算术问题上的应用,并揭示了自动定理证明在复杂问题上的局限性。

Comments Accepted to AI4MATH@ICML2025

详情
AI中文摘要

形式定理证明(FTP)已成为评估大语言模型推理能力的关键基础,使大规模自动验证数学证明成为可能。然而,进展受到有限数据集的限制,因为手动编纂成本高且缺乏具有验证形式-非形式对应关系的挑战性问题。我们提出利用理论计算机科学(TCS)作为可扩展的严谨证明问题来源,其中算法定义能够自动生成任意多的挑战性定理-证明对。我们在此两个TCS领域中展示了这种方法:Busy Beaver问题,涉及证明图灵机停止行为的界限,以及混合布尔算术问题,结合了逻辑和算术推理。我们的框架自动合成具有并行形式(Lean4)和非形式(Markdown)规范的问题,创建了一个可扩展的生成验证证明挑战的流水线。对前沿模型的评估揭示了自动定理证明的显著差距:尽管DeepSeekProver-V2-671B在Busy Beaver问题上达到57.5%的成功率,但在混合布尔算术问题上仅达到12%。这些结果突显了即使对于计算上易于验证的问题,长形式证明生成的难度,展示了TCS领域在推动自动推理研究中的价值。

英文摘要

Formal theorem proving (FTP) has emerged as a critical foundation for evaluating the reasoning capabilities of large language models, enabling automated verification of mathematical proofs at scale. However, progress has been constrained by limited datasets due to the high cost of manual curation and the scarcity of challenging problems with verified formal-informal correspondences. We propose leveraging theoretical computer science (TCS) as a scalable source of rigorous proof problems, where algorithmic definitions enable automated generation of arbitrarily many challenging theorem-proof pairs. We demonstrate this approach on two TCS domains: Busy Beaver problems, which involve proving bounds on Turing machine halting behavior, and Mixed Boolean Arithmetic problems, which combine logical and arithmetic reasoning. Our framework automatically synthesizes problems with parallel formal (Lean4) and informal (Markdown) specifications, creating a scalable pipeline for generating verified proof challenges. Evaluation on frontier models reveals substantial gaps in automated theorem proving: while DeepSeekProver-V2-671B achieves 57.5\% success on Busy Beaver problems, it manages only 12\% on Mixed Boolean Arithmetic problems. These results highlight the difficulty of long-form proof generation even for problems that are computationally easy to verify, demonstrating the value of TCS domains for advancing automated reasoning research.

2508.07292 2026-05-19 cs.AI cs.CL cs.CV 版本更新

EndoCogniAgent: Closed-Loop Agentic Reasoning with Self-Consistency Validation for Endoscopic Diagnosis

EndoCogniAgent: 闭环代理推理与自我一致性验证用于内窥镜诊断

Yi Tang, Kai-Ni Wang, Yang Chen, Xiaopu He, Guangquan Zhou

发表机构 * School of Biological Science and Medical Engineering, Southeast University(东南大学生物科学与医学工程学院) Jiangsu Key Laboratory of Biomaterials and Devices, Southeast University(江苏省生物材料与器件重点实验室) State Key Laboratory of Digital Medical Engineering, Southeast University(国家数字医学工程重点实验室) The First Affiliated Hospital of Nanjing Medical University(南京医科大学第一附属医院) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Jiangsu Provincial Joint International Research Laboratory of Medical Information Processing(江苏省联合国际医学信息处理联合实验室) Laboratory of Image Science and Technology, the School of Computer Science and Engineering, Southeast University(图像科学与技术实验室,东南大学计算机科学与工程学院)

AI总结 该研究提出EndoCogniAgent框架,通过闭环代理推理和自我一致性验证提升内窥镜诊断的准确性与可靠性,其核心方法是将诊断过程建模为受控状态更新过程,并引入EndoAgentBench基准进行评估。

Comments 10 pages, 8 figures, 2 tables. Revised version with major updates on methodology and extended evaluation on EndoAgentBench. Code and data are available at https://github.com/Tyyds-ai/EndoCogniAgent

详情
AI中文摘要

内窥镜诊断是一个迭代过程,临床医生逐步获取、比较和验证局部视觉证据以得出结论。当前AI系统未能充分支持此过程,因为细粒度证据获取和多步推理仍弱相关,导致两种失败模式:幻觉证据和未纠正的误差累积,影响诊断可靠性。我们提出EndoCogniAgent,一种闭环代理框架,将内窥镜诊断建模为受控状态更新过程。在每次推理轮次中,中央计划器选择下一步证据获取动作,专用专家工具提取相应观察,自我一致性验证机制沿两个维度检查观察:知识一致性与输入图像以及时间一致性与先前验证的发现,然后更新诊断状态。验证的观察被纳入演进状态以指导后续计划,而缺乏充分支持的发现则保留并带有纠正反馈,引导计划器进行进一步验证。我们进一步引入EndoAgentBench,一个以工作流程为导向的基准,包含来自11个内窥镜数据集的6132个问题-答案对,旨在评估诊断代理在全面诊断链中的表现,从细粒度视觉感知到高水平诊断推理。实验显示,EndoCogniAgent在感知任务上达到85.23%的平均准确率,在推理任务上达到71.13%的临床接受率,消融分析确认自我一致性验证和事件状态维护对这些提升至关重要。

英文摘要

Endoscopic diagnosis is an iterative process in which clinicians progressively acquire, compare, and verify local visual evidence before reaching a conclusion. Current AI systems do not adequately support this process because fine-grained evidence acquisition and multi-step reasoning remain weakly coupled. This gives rise to two failure modes, hallucinated evidence and uncorrected error accumulation, that undermine diagnostic reliability. We propose EndoCogniAgent, a closed-loop agentic framework that formulates endoscopic diagnosis as a controlled state update process. At each reasoning round, a central planner selects the next evidence acquisition action, specialized expert tools extract the corresponding observation, and a self-consistency validation mechanism examines the observation along two dimensions, knowledge consistency against the input image and temporal consistency with prior validated findings, before updating the diagnostic state. Validated observations are admitted into the evolving state to condition subsequent planning, while insufficiently supported findings are retained with corrective feedback that redirects the planner toward additional verification. We further introduce EndoAgentBench, a workflow-oriented benchmark comprising 6,132 question-answer pairs from 11 endoscopic datasets, designed to evaluate diagnostic agents across a comprehensive diagnostic chain, from fine-grained visual perception to high-level diagnostic reasoning. Experiments show that EndoCogniAgent achieves 85.23\% average accuracy on perception tasks and 71.13\% clinical acceptance rate on reasoning tasks, with ablation analysis confirming that self-consistency validation and episodic state maintenance are individually critical to these gains.

2508.06038 2026-05-19 cs.CV cs.AI 版本更新

Fourier Compressor: Frequency-Domain Visual Token Compression for Vision-Language Models

Fourier Compressor: 频域视觉令牌压缩用于视觉-语言模型

Huanyu Wang, Jushi Kai, Haoli Bai, Lu Hou, Bo Jiang, Ziwei He, Zhouhan Lin

发表机构 * LUMIA Lab(LUMIA实验室) School of Artificial Intelligence(人工智能学院) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Noah’s Ark Lab(诺亚实验室) Huawei Technologies Ltd.(华为技术有限公司) School of Computer Science(计算机科学学院)

AI总结 本文提出了一种基于频域的视觉令牌压缩策略,通过傅里叶变换减少计算开销并提升效率,同时保持语义准确性,实验表明其在图像和视频任务中均表现出色。

详情
AI中文摘要

视觉-语言模型(VLMs)由于高分辨率图像和视频输入引入的大量视觉令牌,导致计算开销和推理延迟显著增加。现有的无参数令牌压缩方法通常依赖于令牌选择或合并,但可能丢弃大量视觉信息或扭曲原始表示分布,导致在高压缩比下性能下降。为此,我们探索了一种更有效且高效的视觉令牌压缩策略,重点在频域方向。受图像压缩中频域变换(如JPEG)的成功启发,我们系统分析了视觉表示中的频域冗余,并揭示了不同频带中语义信息的非均匀分布。基于此,我们引入了傅里叶压缩器,一种有效、无参数且高度通用的模块,通过FFT(复杂度为O(n² log n))在频域内去除视觉表示的冗余。实现过程中无额外参数,计算开销极小且保持语义保真度。在图像基准测试中,我们的方法在保留超过96%原始准确率的同时,将推理FLOPs减少高达83.8%,生成速度提升31.2%。它在图像和视频理解任务中均表现出色,且在LLaVA和Qwen-VL架构中均能稳定泛化,证明其在高效VLMs中的实用价值。

英文摘要

Vision-Language Models (VLMs) incur substantial computational overhead and inference latency due to the large number of vision tokens introduced by high-resolution image and video inputs. Existing parameter-free token compression methods typically rely on token selection or merging, yet they risk discarding substantial visual information or distorting the original representation distribution, resulting in pronounced performance degradation at high compression ratios. In response, we aim to explore a more effective and efficient visual token compression strategy, with a promising direction in the frequency domain. Motivated by the success of frequency-domain transforms in image compression (e.g., JPEG), we systematically analyze the frequency redundancy in visual representations and uncover a non-uniform distribution of semantic information across frequency bands. Building upon this, we introduce Fourier Compressor, an effective, parameter-free, and highly generalizable module that removes redundancy from visual representations within the frequency domain. Implemented via FFT with $\mathcal{O}(n^2 \log n)$ complexity and no additional parameters, Fourier Compressor introduces negligible computational overhead while preserving semantic fidelity. Extensive experiments on image-based benchmarks demonstrate that our method achieves a favorable performance-efficiency trade-off, retaining over 96% of the original accuracy while reducing inference FLOPs by up to 83.8% and boosting generation speed by 31.2%. It consistently outperforms existing parameter-free methods and even surpasses some parameterized approaches. Importantly, Fourier Compressor generalizes consistently across both LLaVA and Qwen-VL architectures, and further extends to video understanding tasks, highlighting its practical applicability for efficient VLMs.

2506.16042 2026-05-19 cs.AI cs.LG cs.OS 版本更新

OSWorld-Human: Benchmarking the Efficiency of Computer-Use Agents

OSWorld-Human: 评估计算机使用代理的效率基准

Reyna Abhyankar, Qi Qi, Yiying Zhang

发表机构 * OpenAI Anthropic Google DeepMind ByteDance(字节跳动) Agent S2 GTA1 Lei Jedi

AI总结 本文研究了计算机使用代理在OSWorld基准上的时间性能,发现大模型调用导致高延迟,并构建了包含人类轨迹的OSWorld Human数据集,评估发现最佳代理仍需更多步骤。

详情
AI中文摘要

生成式AI正被用于解决涉及桌面应用的多种计算机使用任务。最先进的系统仅专注于提高领先基准的准确性。然而,这些系统由于端到端延迟极高(例如,数十分钟)而实际上不可用,因为通常只需人类几分钟即可完成的任务。为了理解这一现象并指导未来计算机代理的发展,我们首次研究了计算机使用代理在OSWorld基准上的时间性能。我们发现,规划、反思和判断的大模型调用占总延迟的主要部分,并且随着代理使用更多步骤完成任务,每一步骤的时间会比任务开始时的步骤长3倍。我们随后构建了OSWorld Human,即原始OSWorld数据集的手动标注版本,其中包含每个任务的人类确定轨迹。我们使用OSWorld Human评估了16个代理的效率,并发现即使最佳代理也比必要多出2.7-4.3倍的步骤。

英文摘要

Generative AI is being leveraged to solve a variety of computer-use tasks involving desktop applications. State-of-the-art systems have focused solely on improving accuracy on leading benchmarks. However, these systems are practically unusable due to extremely high end-to-end latency (e.g., tens of minutes) for tasks that typically take humans just a few minutes to complete. To understand the cause behind this and to guide future developments of computer agents, we conduct the first study on the temporal performance of computer-use agents on OSWorld, the flagship benchmark in computer-use AI. We find that large model calls for planning, reflection, and judging account for most of the overall latency, and as an agent uses more steps to complete a task, each successive step can take 3x longer than steps at the beginning of a task. We then construct OSWorld Human, a manually annotated version of the original OSWorld dataset that contains a human-determined trajectory for each task. We evaluate 16 agents on their efficiency using OSWorld Human and found that even the best agents take 2.7-4.3x more steps than necessary.

2506.08244 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Algebraic Priors for Approximately Equivariant Networks

代数先验用于近似等变网络

Riccardo Ali, Pietro Liò, Jamie Vicary

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种无需参数的代数方法,利用群表示理论来构建等变网络的先验,通过实验验证该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。

详情
AI中文摘要

等变神经网络通过群作用来整合对称性,将其作为归纳偏差以提高性能。现有方法在潜在空间中学习等变作用,或设计具有等变结构的架构。这些方法通常能获得良好的经验结果,但可能涉及架构特定的约束、大量参数和高计算成本。我们挑战复杂等变架构范式,提出一种无参数的方法,基于群表示理论。我们证明,对于有限群上的等变编码器,潜在空间几乎必然包含每个线性无关数据轨道的一个副本,我们通过多个实验证明这一点。利用这一基础的代数洞察,我们通过辅助损失将群的正则表示作为归纳偏差,不增加可学习参数。我们的广泛评估显示,该方法在多个任务中表现优异,甚至在无限群情况下也优于专门设计的模型。我们进一步通过消融研究验证了正则表示的选择,显示其在所有情况下均优于定义和平凡群表示的基线模型。

英文摘要

Equivariant neural networks incorporate symmetries through group actions, embedding them as an inductive bias to improve performance. Existing methods learn an equivariant action on the latent space, or design architectures that are equivariant by construction. These approaches often deliver strong empirical results but can involve architecture-specific constraints, large parameter counts, and high computational cost. We challenge the paradigm of complex equivariant architectures with a parameter-free approach grounded in group representation theory. We prove that for an equivariant encoder over a finite group, the latent space must almost surely contain one copy of its regular representation for each linearly independent data orbit, which we explore with a number of empirical studies. Leveraging this foundational algebraic insight, we impose the group's regular representation as an inductive bias via an auxiliary loss, adding no learnable parameters. Our extensive evaluation shows that this method matches or outperforms specialized models in several cases, even those for infinite groups. We further validate our choice of the regular representation through an ablation study, showing it consistently outperforms defining and trivial group representation baselines.

2505.21893 2026-05-19 cs.LG cs.AI 版本更新

SIPO: Stabilized and Improved Preference Optimization for Aligning Diffusion Models

SIPO: 用于对齐扩散模型的人类偏好优化的稳定与改进方法

Xiaomeng Yang, Mengping Yang, Junyan Wang, Zhijian Zhou, Zhiyu Tan, Hao Li

发表机构 * Shanghai Science and Intelligence Institute, Shanghai, China(上海科学与智能研究所) Fudan University, Shanghai, China(复旦大学) Australian Institute for Machine Learning, The University of Adelaide(澳大利亚机器学习研究所,阿德莱德大学)

AI总结 本研究提出SIPO框架,通过时间步感知的重要性重新加权和梯度稳定技术,解决扩散模型对齐中训练不稳定和策略偏差问题,提升了对齐效果和稳定性。

Comments This version supplements with more detailed content on reasoning and proof, additional experimental results, and ablation studies

详情
AI中文摘要

偏好学习作为一种有效技术,已被广泛用于将扩散模型与人类偏好对齐在视觉生成中。然而,现有对齐方法如Diffusion-DPO面临两个根本性挑战:由于各个时间步的高梯度方差导致的训练不稳定以及由于优化数据与策略模型分布之间的差异引起的策略偏差。我们的第一项贡献是对不同时间步的扩散轨迹进行系统分析,发现不稳定性主要源于早期时间步的低重要性权重。为了解决这些问题,我们提出了SIPO,即一种用于将扩散模型与人类偏好对齐的稳定和改进的偏好优化框架。具体而言,引入了一个关键梯度,即DPO-C&M,通过裁剪和屏蔽无信息的时间步来稳定训练。随后,采用时间步感知的重要性重新加权范式以缓解策略偏差并在对齐过程中强调信息更新。在各种基线模型上进行的广泛实验,包括图像生成模型SD1.5、SDXL和视频生成模型CogVideoX-2B/5B、Wan2.1-1.3B,表明我们的SIPO在稳定训练和性能方面均优于现有对齐方法。总体而言,这些结果表明了时间步感知对齐的重要性,并为改进扩散模型的偏好优化提供了有价值的指导。

英文摘要

Preference learning has garnered extensive attention as an effective technique for aligning diffusion models with human preferences in visual generation. However, existing alignment approaches such as Diffusion-DPO suffer from two fundamental challenges: training instability caused by high gradient variances at various timesteps and high parameter sensitivities, and off-policy bias arising from the discrepancy between the optimization data and the policy models' distribution. Our first contribution is a systematic analysis of diffusion trajectories across different timesteps, identifying that the instability primarily originates from early timesteps with low importance weights. To address these issues, we propose \textbf{SIPO}, a \textbf{S}tabilized and \textbf{I}mproved \textbf{P}reference \textbf{O}ptimization framework for aligning diffusion models with human preferences. Concretely, a key gradient, \emph{i.e.,} DPO-C\&M is introduced to stabilize training by clipping and masking uninformative timesteps. This is followed by a timestep-aware importance-reweighting paradigm to mitigate off-policy bias and emphasize informative updates throughout the alignment process. Extensive experiments on various baseline models including image generation models on SD1.5, SDXL, and video generation models CogVideoX-2B/5B, Wan2.1-1.3B, demonstrate that our SIPO consistently promotes stabilized training and outperforms existing alignment methods that with meticulous adjustments on parameters.Overall, these results suggest the importance of timestep-aware alignment and provide valuable guidelines for improved preference optimization in aligning diffusion models.

2505.17138 2026-05-19 cs.LG cs.AI 版本更新

RAP: Runtime Adaptive Pruning for LLM Inference

RAP: 用于大语言模型推理的运行时自适应剪枝

Huanrong Liu, Chunlin Tian, Xuyang Wei, Qingbiao Li, Li Li

发表机构 * Faculty of Science and Technology, University of Macau, Macau, China(澳门大学科学与技术学院) School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu, China(电子科技大学信息与软件工程学院)

AI总结 本文提出RAP,一种基于强化学习的弹性剪枝框架,通过动态调整压缩策略来适应运行时内存变化和异构KV缓存需求,首次在推理过程中同时考虑模型权重和KV缓存。

详情
AI中文摘要

大语言模型(LLMs)在语言理解和生成方面表现出色,但其巨大的计算和内存需求限制了部署。压缩提供了一种潜在的解决方案来缓解这些约束。然而,大多数现有方法依赖于固定的启发式方法,因此无法适应运行时内存变化或来自多样化用户请求的异构KV缓存需求。为了解决这些限制,我们提出了RAP,一种由强化学习(RL)驱动的弹性剪枝框架,能够以运行时感知的方式动态调整压缩策略。具体而言,RAP动态跟踪实际执行过程中模型参数与KV缓存之间的演变比例。认识到前馈网络(FFNs)包含大部分参数,而参数轻量的注意力层主导KV缓存的形成,RL代理只保留那些在当前内存预算内最大化效用的组件,基于即时的工作负载和设备状态。广泛的实验结果表明,RAP优于最先进的基线方法,标志着首次在推理过程中同时考虑模型权重和KV缓存。

英文摘要

Large language models (LLMs) excel at language understanding and generation, but their enormous computational and memory requirements hinder deployment. Compression offers a potential solution to mitigate these constraints. However, most existing methods rely on fixed heuristics and thus fail to adapt to runtime memory variations or heterogeneous KV-cache demands arising from diverse user requests. To address these limitations, we propose RAP, an elastic pruning framework driven by reinforcement learning (RL) that dynamically adjusts compression strategies in a runtime-aware manner. Specifically, RAP dynamically tracks the evolving ratio between model parameters and KV-cache across practical execution. Recognizing that FFNs house most parameters, whereas parameter -light attention layers dominate KV-cache formation, the RL agent retains only those components that maximize utility within the current memory budget, conditioned on instantaneous workload and device state. Extensive experiments results demonstrate that RAP outperforms state-of-the-art baselines, marking the first time to jointly consider model weights and KV-cache on the fly.

2505.16278 2026-05-19 cs.CV cs.AI cs.RO 版本更新

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

DriveMoE:面向端到端自动驾驶的视觉-语言-动作混合专家模型

Zhenjie Yang, Yilin Chai, Xiaosong Jia, Qifeng Li, Yuqian Shao, Xuekai Zhu, Haisheng Su, Junchi Yan

发表机构 * Sch. of Computer Science & Sch. of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学计算机科学学院与人工智能学院) Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) AnyScale AI Project(AnyScale AI项目)

AI总结 本文提出DriveMoE,一种基于混合专家架构的端到端自动驾驶框架,通过场景专用的视觉混合专家和技能专用的动作混合专家,实现了对复杂驾驶场景的有效处理,展示了在自动驾驶任务中结合视觉和动作混合专家的有效性。

Comments Accepted by CVPR 2026, Project Page: https://thinklab-sjtu.github.io/DriveMoE/

详情
AI中文摘要

端到端自动驾驶(E2E-AD)需要有效处理多视角传感器数据和稳健处理多样且复杂的驾驶场景,特别是罕见的激进转弯等场景。最近混合专家(MoE)架构在大语言模型(LLMs)中的成功表明,参数的专业化能够实现强大的可扩展性。在本工作中,我们提出了DriveMoE,一种新的基于MoE的E2E-AD框架,包含场景专用的视觉MoE和技能专用的动作MoE。DriveMoE基于我们$π_0$视觉-语言-动作(VLA)基线(最初来自具身AI领域),称为Drive-$π_0$。具体而言,我们通过训练一个路由器,根据驾驶上下文动态选择相关摄像头,将视觉MoE添加到Drive-$π_0$中。这种设计模仿了人类驾驶认知,即司机选择性地关注关键视觉线索,而不是穷尽处理所有视觉信息。此外,我们通过训练另一个路由器来激活针对不同驾驶行为的专用专家模块,通过显式的行为专业化,DriveMoE能够处理多样化的场景而不受现有模型中模式平均的困扰。在Bench2Drive闭环评估实验中,DriveMoE实现了最先进的性能,证明了在自动驾驶任务中结合视觉和动作MoE的有效性。我们将发布DriveMoE和Drive-$π_0$的代码和模型。

英文摘要

End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $π_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive-$π_0$. Specifically, we add Vision MoE to Drive-$π_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive-$π_0$.

2504.13217 2026-05-19 cs.CL cs.AI 版本更新

Sustainability via LLM Right-sizing

通过LLM右尺寸实现可持续性

Jennifer Haase, Finn Klessascheck, Jan Mendling, Sebastian Pokutta

AI总结 本文研究了在现实应用中,小型本地可部署模型是否足够好,通过评估十种LLM在日常职业任务中的表现,提出了一种基于可持续性的评估方法,强调在成本、本地部署和隐私方面的需求。

Comments 21 pages, 2 Figures, 6 Tables

详情
AI中文摘要

大型语言模型(LLMs)日益融入组织工作流程,引发了对其能源消耗、财务成本和数据主权的担忧。尽管性能基准常赞扬前沿模型,但实际部署决策需要更广泛的视角:何时小型、本地可部署的模型足够好?本研究通过评估十种专有和开源LLM在十种日常职业任务中的表现,提供实证答案。使用双LLM评估框架,自动化任务执行并标准化输出质量、事实准确性和伦理责任等十项标准。结果显示,GPT-4o在性能上始终优于,但成本和环境足迹显著更高。值得注意的是,较小的模型如Gemma-3和Phi-4在大多数任务中表现出强劲且可靠的结果,表明其在需要成本效率、本地部署或隐私的场景中的可行性。聚类分析揭示了三种模型群体——高端全能型、胜任的通用型和有限但安全的表演型,突显了质量、控制和可持续性之间的权衡。显著的是,任务类型影响了模型的有效性:概念性任务挑战了大多数模型,而聚合和转换任务则表现出更好的性能。我们主张从追求性能最大化的基准转向任务和情境感知的充分性评估,以更符合组织优先事项。我们的方法贡献了一种通过可持续性视角评估AI模型的可扩展方法,并为负责任的LLM部署提供了可行的指导。

英文摘要

Large language models (LLMs) have become increasingly embedded in organizational workflows. This has raised concerns over their energy consumption, financial costs, and data sovereignty. While performance benchmarks often celebrate cutting-edge models, real-world deployment decisions require a broader perspective: when is a smaller, locally deployable model "good enough"? This study offers an empirical answer by evaluating eleven proprietary and open-weight LLMs across ten everyday occupational tasks, including summarizing texts, generating schedules, and drafting emails and proposals. Using a dual-LLM-based evaluation framework, we automated task execution and standardized evaluation across ten criteria related to output quality, factual accuracy, and ethical responsibility. Results show that GPT-4o delivers consistently superior performance but at a significantly higher cost and environmental footprint. Notably, smaller models like Gemma-3 and Phi-4 achieved strong and reliable results on most tasks, suggesting their viability in contexts requiring cost-efficiency, local deployment, or privacy. A cluster analysis revealed three model groups -- premium all-rounders, competent generalists, and limited but safe performers -- highlighting trade-offs between quality, control, and sustainability. Significantly, task type influenced model effectiveness: conceptual tasks challenged most models, while aggregation and transformation tasks yielded better performances. We argue for a shift from performance-maximizing benchmarks to task- and context-aware sufficiency assessments that better reflect organizational priorities. Our approach contributes a scalable method to evaluate AI models through a sustainability lens and offers actionable guidance for responsible LLM deployment in practice.

2503.13934 2026-05-19 cs.RO cs.AI 版本更新

COLSON: Controllable Learning-Based Social Navigation via Diffusion-Based Reinforcement Learning

COLSON: 通过基于扩散的强化学习实现可控的社会导航

Kohei Matsumoto, Yuki Tomita, Yuki Hyodo, Ryo Kurazume

AI总结 本文提出了一种基于扩散的强化学习方法,用于社会导航,通过灵活的动作分布提高了导航的适应性和可控性,同时能够适应未见过的场景。

Comments ICRA 2026

详情
AI中文摘要

在动态环境中移动机器人导航面临行人交通的关键挑战,在自主移动服务机器人发展中尤为重要。最近,基于深度强化学习的方法被积极研究,并因其优化能力优于传统规则方法。其中,假设连续动作空间的方法通常依赖高斯分布,这限制了生成动作的灵活性。相比之下,将扩散模型应用于强化学习已取得进展,使动作分布比高斯策略方法更加灵活。在本研究中,我们应用基于扩散的强化学习方法进行社会导航,并验证其有效性。此外,通过利用扩散模型的特点,我们提出了能够适应以前未见过的场景而无需额外训练的扩展方法。作为具体场景示例,我们展示了适应环境中有静态障碍物的场景(这些障碍物在训练期间不存在),以及目标与训练不同的场景,例如在避免他人时陪同目标行人到达目的地。

英文摘要

Mobile robot navigation in dynamic environments with pedestrian traffic is a key challenge in the development of autonomous mobile service robots. Recently, deep reinforcement learning-based methods have been actively studied and have outperformed traditional rule-based approaches owing to their optimization capabilities. Among these methods, those that assume continuous action spaces typically rely on Gaussian distributions, which limit the flexibility of the generated actions. In contrast, the application of diffusion models to reinforcement learning has advanced, enabling more flexible action distributions than Gaussian policy-based approaches. In this study, we apply a diffusion-based reinforcement learning approach to social navigation and validate its effectiveness. Furthermore, by exploiting the characteristics of diffusion models, we propose extensions that enable adaptation to previously unseen scenarios without additional training. As concrete scenario examples, we demonstrate adaptability to scenarios in which static obstacles exist in the environment that were not present during training, as well as scenarios in which the objective differs from training, such as accompanying target pedestrians while avoiding others to reach the destination.

2503.02574 2026-05-19 cs.CR cs.AI 版本更新

LLM-Safety Evaluations Lack Robustness

大语言模型安全评估缺乏鲁棒性

Tim Beyer, Sophie Xhonneux, Simon Geisler, Gauthier Gidel, Leo Schwinn, Stephan Günnemann

发表机构 * Department of Computer Science \& Munich Data Science Institute, Technical University of Munich Canada AI CIFAR Chair

AI总结 本文指出当前大语言模型安全对齐研究受到多种交织的噪声源阻碍,如小数据集、方法学不一致和不可靠的评估设置,导致难以公平评估和比较攻击与防御,阻碍了进展。我们系统分析了大语言模型安全评估流程,涵盖数据集整理、自动化红队优化策略、响应生成和响应评估使用LLM法官。在每个阶段,我们识别了关键问题并突出了其实际影响。我们还提出了一套减少未来攻击和防御论文评估中噪声和偏见的指南。最后,我们提出了对立观点,强调现有限制的实用原因。我们相信,未来研究解决这些问题将提高领域生成可比结果的能力,并实现可衡量的进步。

详情
AI中文摘要

在本文中,我们论证当前大语言模型的安全对齐研究受到许多交织的噪声源的阻碍,例如小数据集、方法学不一致和不可靠的评估设置。这有时会使得无法公平地评估和比较攻击和防御,从而减缓进展。我们系统地分析了大语言模型安全评估流程,涵盖数据集整理、自动化红队优化策略、响应生成和使用LLM法官进行响应评估。在每个阶段,我们识别了关键问题并突出了其实际影响。我们还提出了一套减少未来攻击和防御论文评估中噪声和偏见的指南。最后,我们提出了对立观点,强调现有限制的实用原因。我们相信,未来研究解决这些问题将提高领域生成可比结果的能力,并实现可衡量的进步。

英文摘要

In this paper, we argue that current safety alignment research efforts for large language models are hindered by many intertwined sources of noise, such as small datasets, methodological inconsistencies, and unreliable evaluation setups. This can, at times, make it impossible to evaluate and compare attacks and defenses fairly, thereby slowing progress. We systematically analyze the LLM safety evaluation pipeline, covering dataset curation, optimization strategies for automated red-teaming, response generation, and response evaluation using LLM judges. At each stage, we identify key issues and highlight their practical impact. We also propose a set of guidelines for reducing noise and bias in evaluations of future attack and defense papers. Lastly, we offer an opposing perspective, highlighting practical reasons for existing limitations. We believe that addressing the outlined problems in future research will improve the field's ability to generate easily comparable results and make measurable progress.

2502.08534 2026-05-19 cs.CE cs.AI 版本更新

Input convex neural networks: universal approximation theorem and implementation for isotropic polyconvex hyperelastic energies

输入凸神经网络:通用逼近定理和等方各向同性超弹性能量的实现

Gian-Luca Geuken, Patrick Kurzeja, David Wiedemann, Jörn Mosler

发表机构 * Institute of Mechanics, Department of Mechanical Engineering, TU Dortmund University(机械学院,机械工程系,杜伊斯堡-埃森大学) Applied Analysis, Faculty of Mathematics, TU Dortmund University(应用分析,数学学院,杜伊斯堡-埃森大学)

AI总结 本文提出了一种新的神经网络框架,用于各向同性超弹性,该框架在满足必要的物理和数学约束的同时,也满足通用逼近定理。关键成分是输入凸网络架构和变形梯度的符号奇异值的初等多项式形式的公式化。与之前发布的网络一致,它可以严格捕捉框架不变性和多凸性,以及诸如角动量平衡和增长条件等其他约束。然而,与之前的方法不同,本文为所提出的方法证明了通用逼近定理。更具体地说,所提出的网络可以近似任何框架不变、各向同性多凸能量(只要网络足够大)。这通过使用框架不变、各向同性多凸函数的充分必要条件来实现。与现有方法的比较研究识别了所提出方法的优势,特别是在近似非多凸能量以及计算多凸包方面。

详情
AI中文摘要

本文提出了一种新的神经网络框架,用于各向同性超弹性,该框架在满足必要的物理和数学约束的同时,也满足通用逼近定理。该框架的两个关键成分是输入凸网络架构和变形梯度的符号奇异值的初等多项式形式的公式化。与之前发布的网络一致,它可以严格捕捉框架不变性和多凸性,以及诸如角动量平衡和增长条件等其他约束。然而,与之前的方法不同,本文为所提出的方法证明了通用逼近定理。更具体地说,所提出的网络可以近似任何框架不变、各向同性多凸能量(只要网络足够大)。这通过使用框架不变、各向同性多凸函数的充分必要条件来实现。与现有方法的比较研究识别了所提出方法的优势,特别是在近似非多凸能量以及计算多凸包方面。

英文摘要

This paper presents a novel framework of neural networks for isotropic hyperelasticity that enforces necessary physical and mathematical constraints while simultaneously satisfying the universal approximation theorem. The two key ingredients are an input convex network architecture and a formulation in the elementary polynomials of the signed singular values of the deformation gradient. In line with previously published networks, it can rigorously capture frame-indifference and polyconvexity - as well as further constraints like balance of angular momentum and growth conditions. However and in contrast to previous networks, a universal approximation theorem for the proposed approach is proven. To be more explicit, the proposed network can approximate any frame-indifferent, isotropic polyconvex energy (provided the network is large enough). This is possible by working with a sufficient and necessary criterion for frame-indifferent, isotropic polyconvex functions. Comparative studies with existing approaches identify the advantages of the proposed method, particularly in approximating non-polyconvex energies as well as computing polyconvex hulls.

2407.13059 2026-05-19 cs.CY cs.AI cs.ET 版本更新

Prioritizing High-Consequence Biological Capabilities in Evaluations of Artificial Intelligence Models

在评估人工智能模型时优先考虑高后果生物能力

Jaspreet Pannu, Doni Bloomfield, Alex Zhu, Robert MacKnight, Gabe Gomes, Anita Cicero, Thomas V. Inglesby

发表机构 * Center for Health Security, Bloomberg School of Public Health, Johns Hopkins University(健康安全中心,公共卫生学院,约翰霍普金斯大学) Department of Health Policy, Stanford School of Medicine, Stanford University(健康政策系,斯坦福医学院,斯坦福大学) Department of Chemical Engineering, Carnegie Mellon University(化学工程系,卡内基梅隆大学) Department of Chemistry, Carnegie Mellon University(化学系,卡内基梅隆大学) Wilton E. Scott Institute for Energy Innovation, Carnegie Mellon University(威尔顿·E·斯科特能源创新研究所,卡内基梅隆大学)

AI总结 本文基于人工智能模型的安全性、安全性和伦理问题,提出在评估人工智能模型时应优先考虑高后果风险,如大规模公众危害(如大流行病),并应在部署前进行评估,以建立针对性的AI安全评估方法,确保工具的安全性和防止潜在危害。

Comments 9 pages, 1 figure, 3 tables, 1 box

详情
AI中文摘要

随着人工智能能力的快速提升,过去一年中,各国政府和多国机构已宣布努力应对与人工智能模型相关的安全、安全和伦理问题。其中一项重要重点是减少人工智能模型的滥用。许多生物学家多年来一直在努力减少可能导致通过事故或滥用引发高后果疾病爆发的科研风险。科学家们已仔细考虑了哪些类型的生物科学研究具有潜在的益处和风险(双重用途),特别是随着科学进步加速了我们对生物体的工程能力和病原体新变种的创造能力。本文描述了科学家和政策专业人士在生物科学中的双重用途能力的先前经验研究如何帮助评估具有生物能力的人工智能模型的风险。我们主张人工智能模型的评估应优先处理高后果风险(可能导致大规模公众危害,如流行病),并在部署前进行评估,以便允许潜在的生物安全和/或生物安全措施。科学家在识别和缓解双重用途生物风险方面的经验可以帮助指导新的评估方法来评估生物人工智能模型。确定哪些AI能力最可能引发生物安全和生物安全问题是必要的,以便建立有针对性的AI安全评估方法,确保这些工具的安全性,防止事故和滥用,并避免阻碍巨大的潜在利益。

英文摘要

As a result of rapidly accelerating AI capabilities, over the past year, national governments and multinational bodies have announced efforts to address safety, security and ethics issues related to AI models. One high priority among these efforts is the mitigation of misuse of AI models. Many biologists have for decades sought to reduce the risks of scientific research that could lead, through accident or misuse, to high-consequence disease outbreaks. Scientists have carefully considered what types of life sciences research have the potential for both benefit and risk (dual-use), especially as scientific advances have accelerated our ability to engineer organisms and create novel variants of pathogens. Here we describe how previous experience and study by scientists and policy professionals of dual-use capabilities in the life sciences can inform risk evaluations of AI models with biological capabilities. We argue that AI model evaluations should prioritize addressing high-consequence risks (those that could cause large-scale harm to the public, such as pandemics), and that these risks should be evaluated prior to model deployment so as to allow potential biosafety and/or biosecurity measures. Scientists' experience with identifying and mitigating dual-use biological risks can help inform new approaches to evaluating biological AI models. Identifying which AI capabilities post the greatest biosecurity and biosafety concerns is necessary in order to establish targeted AI safety evaluation methods, secure these tools against accident and misuse, and avoid impeding immense potential benefits.

2605.18150 2026-05-19 cs.AI 版本更新

Whispers in the Noise: Surrogate-Guided Concept Awakening via a Multi-Agent Framework

噪声中的低语:通过多智能体框架引导的代理觉醒

Mengyu Sun, Ziyuan Yang, Zunlong Zhou, Junxu Liu, Haibo Hu, Yi Zhang

发表机构 * Department of Electrical and Electronic Engineering, The Hong Kong Polytechnic University(香港理工大学电子与电气工程系) School of Cyber Science and Engineering, Sichuan University(四川大学网络空间安全学院) Lee Kong Chian School of Medicine, Nanyang Technological University(南洋理工大学李科钦医学院)

AI总结 本文研究了在黑盒约束下如何通过多智能体框架从预训练模型中恢复被擦除的概念,提出了一种无需训练的代理方法,通过引导噪声状态来实现可控的觉醒,展示了当前概念擦除方法的局限性。

详情
AI中文摘要

扩散模型(DMs)被广泛用于文本到图像生成,但其强大的生成能力也引发了对不安全或不期望内容的担忧。概念擦除旨在通过从预训练模型中移除特定概念来缓解这些风险。然而,最近的研究表明,此类方法往往抑制而非完全消除目标概念,使模型易受觉醒攻击。现有方法主要依赖于通过优化或反向操作进行白盒访问,而概念觉醒在黑盒约束下仍显不足。在本文中,我们重新审视去噪过程并从轨迹角度出发,表明概念擦除主要破坏早期阶段的文本-语义对齐,但并未完全阻止语义信息沿去噪动态传播。随着生成过程的进行,模型越来越依赖于演化的噪声状态而非文本条件,这为绕过擦除映射提供了机会。受此观察启发,我们提出了ConceptAgent,一种无需训练、黑盒、多智能体框架,通过引导噪声状态初始化去噪轨迹来唤醒擦除的概念。大量实验表明,ConceptAgent能够在无模型参数、梯度或内部表示访问的情况下,实现准确且可控的擦除概念觉醒。这些结果突显了当前概念擦除方法的根本限制,并提供了关于DMs中语义控制动态性质的新见解。

英文摘要

Diffusion models (DMs) are widely used for text-to-image generation, but their strong generative capabilities also raise concerns about unsafe or undesirable content. Concept erasure aims to mitigate these risks by removing specific concepts from pretrained models. However, recent studies show that such methods often suppress rather than fully eliminate target concepts, leaving models vulnerable to awakening attacks. Existing approaches primarily rely on white-box access through optimization or inversion, while concept awakening under black-box constraints remains underexplored. In this work, we revisit the denoising process from a trajectory perspective and show that concept erasure mainly disrupts early-stage text-semantic alignment but does not fully prevent semantic information from propagating along the denoising dynamics. As generation proceeds, the model increasingly depends on the evolving noisy state rather than textual conditions, which creates an opportunity to bypass erased mappings. Motivated by this observation, we propose ConceptAgent, a training-free, black-box, multi-agent framework that awakens erased concepts by initializing the denoising trajectory from surrogate-guided noisy states. Extensive experiments demonstrate that ConceptAgent enables accurate and controllable awakening of erased concepts under black-box settings without access to model parameters, gradients, or internal representations. These results highlight fundamental limitations of current concept erasure methods and provide new insights into the dynamic nature of semantic control in DMs.

2605.18144 2026-05-19 cs.AI 版本更新

Evidence-Grounded Frontier Mapping and Agentic Hypothesis Generation in Nanomedicine

基于证据的前沿映射与代理假设生成在纳米医学中

Christiaan G. A. Viviers, Koen de Bruin, Mirre M. Trines, Ayla M. Hokke, Roy van der Meel, Avi Schroeder, Twan Lammers, Willem J. M. Mulder, Fons van der Sommen

发表机构 * ARIA Lab, Signal Processing Systems, Department of Electrical Engineering, Eindhoven University of Technology(ARIA实验室,信号处理系统,电气工程系,埃因霍温理工大学) Laboratory of Chemical Biology, Department of Biomedical Engineering, Eindhoven University of Technology(化学生物学实验室,生物医学工程系,埃因霍温理工大学) The Louis Family Laboratory for Targeted Drug Delivery and Personalized Medicine Technologies, Department of Chemical Engineering, Technion - Israel Institute of Technology(定向药物输送与个性化医学技术实验室,化学工程系,技术离子-以色列理工学院) Department of Nanomedicine and Theranostics, Institute for Experimental Molecular Imaging (ExMI), RWTH Aachen University Hospital(纳米医学与诊疗学系,实验分子成像研究所(ExMI),亚琛工业大学医院) Department of Internal Medicine and Radboud Center for Infectious Diseases (RCI), Radboud University Medical Center(内科学系和Radboud感染疾病中心(RCI),Radboud大学医学中心)

AI总结 该研究提出了一种结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型(LLM)工作流的系统pArticleMap,用于支持纳米医学研究方向的选择和假设生成,通过生成和评分基于引用的假设,实现了证据导向的研究辅助。

详情
AI中文摘要

纳米医学研究涵盖了递送化学、免疫学、成像、生物材料和疾病特定的转化科学,但其概念设计空间仍然在大量异质文献中碎片化。截至目前,人工智能在纳米医学中的应用主要集中在性质预测和配方优化,对研究方向选择层面的证据导向发现支持关注较少。我们引入了pArticleMap,一个结合文章嵌入、相似性图分析、稀疏前沿提取、结构化证据包检索和审计过的大型语言模型(LLM)工作流的文献映射和研究假设生成系统。该系统不同于预测未来概念共现,而是针对低密度文章级桥接区域和聚类界面,然后在代理设置中利用大型语言模型生成和评分基于引用的假设。我们通过回顾性实现基准(在历史截止点下生成后续文献)和盲人类读者评估层,在提示条件下的纳米医学任务中评估该系统。在4个选定的回顾性包中,pArticleMap在基准协议下生成了想法并选择了任务保留的假设(获胜想法)。对于任务级保留的假设,获得了一个汇总的黄金回收率10.8%,召回@10为15.9%,未来邻域率61.0%,表明该系统经常能够达到正确的前瞻性邻域(论文想法),即使没有精确的论文级回收。人类-代理协议总体上是中等的,表明内部评分是有用的支持信号,但不能替代专家判断。这些结果将pArticleMap定位为一种保守的、基于证据的研究助手,用于纳米医学。

英文摘要

Nanomedicine research spans delivery chemistry, immunology, imaging, biomaterials, and disease-specific translational science, yet its conceptual design space remains fragmented across a large and heterogeneous literature. To date, artificial intelligence in nanomedicine has focused primarily on property prediction and formulation optimization, with much less attention to evidence-grounded discovery support at the level of research direction selection. We introduce pArticleMap, a literature-mapping and research-hypothesis-generation system that combines article embeddings, similarity-graph analysis, sparse frontier extraction, structured evidence-pack retrieval, and an audited large-language-model (LLM) workflow for grounded ideation. Rather than forecasting future concept co-occurrence, pArticleMap targets low-density article-level bridge regions and cluster interfaces, then generates and scores citation-grounded hypotheses with large language models in an agentic setup. We evaluate the system with a retrospective realization benchmark (generate later literature under a historical cutoff) and a blinded human reader assessment layer across cue-conditioned nanomedicine tasks. Across 4 selected retrospective bundles, pArticleMap generated ideas and selected task-retained hypotheses (winner ideas) under the benchmark protocol. For task-level retained hypotheses, a pooled gold recovery rate of 10.8% was obtained, with a recall@10 of 15.9% and a future-neighborhood rate of 61.0%, indicating that the system often reached the correct forward-looking neighborhood (paper ideas) even without exact paper-level recovery. Human-agent agreement is modest overall, indicating that internal scoring is useful as a support signal but does not replace expert judgment. These results position pArticleMap as a conservative, evidence-grounded research assistant for nanomedicine.

2605.18143 2026-05-19 cs.AI 版本更新

Generative AI and the Productivity Divide: Human-AI Complementarities in Education

生成式AI与生产力差距:教育中的人类-人工智能互补性

Lihi Idan, Bharat Anand

发表机构 * Leonard N. Stern School of Business, New York University(纽约大学 Leonard N. Stern 商学院) Industrial and Systems Engineering Department, Texas A&M University(德克萨斯大学阿姆斯特朗工程学院)

AI总结 本研究探讨了生成式AI对不同用户生产力影响的异质性,发现AI交互能力(AIC)是决定AI使用效果的关键因素,通过概念图干预可减少不平等,强调需结合AIC微培训和标准流程以实现持续价值捕获。

详情
AI中文摘要

生成式人工智能(GenAI)正在改变企业创造、处理和应用知识的方式,但对其生产力影响的异质性知之甚少。我们报告了一项随机对照试验的结果,参与者(早期知识工作者的类比)被分配在传统资源或大语言模型(LLM)辅助下自学技术领域。平均而言,GenAI访问显著提高了任务表现,但收益分布极不均衡。改进未由GPA或先前知识预测,而是由AI交互能力(AIC)——即获取、过滤和验证模型输出的能力——预测。高AIC参与者实现了显著收益;低AIC参与者则获得有限甚至负的边际回报。概念图干预( scaffolding)减少了结果变异,表明标准化流程可减轻AI中介表现中的不平等。我们通过人类-人工智能互补性视角解读这些发现:GenAI提高平均生产力,但引入了新的能力不平等轴。管理上,企业应将GenAI访问与短期AIC微培训和简单标准操作程序相结合,以一致捕获价值并避免不均的采用结果。

英文摘要

Generative Artificial Intelligence (GenAI) is transforming how firms create, process, and apply knowledge, yet little is known about the heterogeneity of its productivity effects across users. We report results from a randomized controlled experiment in which participants-analogs of early-career knowledge workers-were assigned to self-study a technical domain using either traditional resources or large-language-model (LLM) assistance. On average, GenAI access significantly increased task performance, but the distribution of gains was highly uneven. Improvements were not predicted by GPA or prior knowledge, but by \textit{AI Interaction Competence (AIC)} -- the ability to elicit, filter, and verify model outputs. High-AIC participants realized outsized gains; low-AIC participants saw limited or even negative marginal returns. A scaffolding intervention (conceptual maps) reduced outcome variance, indicating that standardized workflows can mitigate inequality in AI-mediated performance. We interpret these findings through the lens of human-AI complementarities: GenAI raises mean productivity while introducing a new axis of capability inequality. Managerially, firms should pair GenAI access with short AIC micro-training and simple standard operating procedures to capture value consistently and avoid uneven adoption outcomes.

2605.18133 2026-05-19 cs.CR cs.AI cs.HC cs.IR 版本更新

An Empirical Study of Privacy Leakage Chains via Prompt Injection in Black-Box Chatbot Environments

通过提示注入在黑盒聊天机器人环境中研究隐私泄露链的实证研究

Hongjang Yang, Hyunsik Na, Daeseon Choi

发表机构 * Department of Information Security(信息安全系) Soongsil University(松山大学) AI Safety Center(人工智能安全中心) Department of AI Software(人工智能软件系)

AI总结 本文研究了通过间接提示注入在黑盒聊天机器人环境中存在的隐私泄露攻击链,分析了攻击者如何通过构造看似无害的外部内容来劫持代理任务,并评估了新的提示注入技术'exemplification'的攻击成功率,最终展示了使用虚构个人信息的证明概念数据外泄链。

Comments 9 pages, 2 figures

详情
AI中文摘要

基于大型语言模型的聊天机器人代理越来越多地通过结合自然语言推理与外部工具(如网络浏览)来处理用户请求。这些能力提高了可用性,但当不信任的外部内容作为用户任务的一部分被处理时,也创建了攻击面。本文研究了一种基于间接提示注入的隐私泄露攻击链,其中攻击者无法访问模型权重、系统提示或代理实现细节,包括处理查询时轨迹的管理方式。我们首先分析了攻击者如何通过构造看似无害的外部内容来劫持代理的预期任务,同时诱导代理执行攻击者定义的目标。然后我们评估了一种新的提示注入技术,称为exemplification,该技术利用外部内容中的桥梁将用户提示和检索页面的无害开头重新表述为few-shot示例,然后附加攻击者的目标。我们将其攻击成功率与先前的假完成技术进行比较。最后,我们展示了在受控环境中使用虚构个人身份信息的证明概念数据外泄链。我们的结果表明,提示注入、类似禁令的指令引导和网络工具调用可以组合成一个可行的隐私泄露路径,在部署的聊天机器人代理中。

英文摘要

LLM-based chatbot agents increasingly process user requests by combining natural-language reasoning with external tools such as web browsing. These capabilities improve usability, but they also create attack surfaces when untrusted external content is processed as part of a user' s task. This paper studies a privacy-leakage attack chain based on indirect prompt injection in black-box chatbot environments, where the attacker has no access to model weights, system prompts, or agent implementation details including how a trajectory is actually managed during its processing for a query. We first analyze how an attacker can hijack an agent' s intended task by crafting external content that appears benign to the victim while inducing the agent to execute an attacker-defined objective. We then evaluate a new prompt-injection technique, called exemplification, which uses a bridge in the external content to reframe the user prompt and the benign beginning of the retrieved page as few-shot examples before appending the attacker' s objective. We compare its attack success rate with a prior fake-completion technique. Finally, we demonstrate a proof-of-concept data-exfiltration chain using fictitious personal information in a controlled setting. Our results suggest that prompt injection, jailbreak-style instruction steering, and web-tool invocation can be combined into a feasible privacy-leakage path in deployed chatbot agents.

2605.18132 2026-05-19 cs.CV cs.AI 版本更新

Who Generated This 3D Asset? Learning Source Attribution for Generative 3D Models

谁生成了这个3D资产?学习生成3D模型的来源归属

Sihan Ma, Siyuan Liang, Dacheng Tao

发表机构 * College of Computing & Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机与数据科学学院)

AI总结 该研究提出了一种方法,用于确定给定3D资产是由哪种生成模型创建的,通过构建首个被动来源归属基准,发现生成3D模型留下稳定的指纹特征,从而建立了可信的3D内容来源的新标准。

详情
AI中文摘要

生成3D模型被应用于游戏、机器人和沉浸式创作,因此来源归属至关重要:给定一个3D资产,我们能否确定并识别出是哪种生成模型创建的?该问题面临两个核心挑战:分散的归属信号,其中3D指纹分布在多视角、几何和频率域提示中;以及现实部署约束,其中稀少的标签、退化的提示和混合真实/合成资产会破坏归属的可靠性。为了系统研究该问题,我们构建了迄今为止首个被动来源归属基准,涵盖22种代表性的3D生成器,在标准、少样本和现实部署协议下。基于此基准,我们发现生成3D模型留下两种稳定的指纹:跨视角不一致性和体现在几何统计和频率域提示中的结构伪影。为了捕捉这些分散的信号,我们提出了一种层次多视角多模态Transformer,融合每个视角的外观、几何和频率域特征,并在跨视角建模全局关系。大量实验表明性能优异,在全监督下达到97.22%的准确率,在仅有1%训练数据时达到77.17%的准确率,对应每个生成器少于五个样本。这些结果表明现代3D生成器留下稳定且可归属的指纹,建立了可信3D内容来源的新基准和方法论基础。

英文摘要

Generative 3D models are deployed in gaming, robotics, and immersive creation, making source attribution critical: given a 3D asset, can we identify whether and which generative model created it? This problem faces two core challenges: dispersed attribution signals, where 3D fingerprints are distributed across multi-view, geometric, and frequency-domain cues; and realistic deployment constraints, where scarce labels, degraded prompts, and mixed real/synthetic assets undermine attribution reliability. To systematically study this problem, we construct, to the best of our knowledge, the first passive source attribution benchmark for modern generated assets, covering 22 representative 3D generators under standard, few-shot, and realistic deployment protocols. Based on this benchmark, we find that generative 3D models leave two types of stable fingerprints: cross-view inconsistency and structural artifacts reflected in geometric statistics and frequency-domain cues. To capture these dispersed signals, we propose a hierarchical multi-view multi-modal Transformer that fuses appearance, geometric, and frequency-domain features within each view and models global relationships across views. Extensive experiments demonstrate strong performance, achieving 97.22% accuracy under full supervision and 77.17% accuracy with only 1% training data, corresponding to fewer than five samples per generator. These results show that modern 3D generators leave stable and attributable fingerprints, establishing a new benchmark and methodological foundation for trustworthy 3D content provenance.

2605.18128 2026-05-19 cs.AI 版本更新

POST: Prior-Observation Adversarial Learning of Spatio-Temporal Associations for Multivariate Time Series Anomaly Detection

POST: 基于先验观察的时空关联对抗学习用于多变量时间序列异常检测

Suofei Zhang, Yaxuan Zheng, Haifeng Hu

发表机构 * School of Internet of Things(物联网学院) National Engineering Research Center of Communications and Networking(通信与网络国家工程研究中心)

AI总结 本文提出了一种新的框架,通过联合先验观察对抗学习方法统一时空建模,以解决多变量时间序列异常检测中的空间过泛化问题,并在公开数据集和自建基准上展示了在时间检测和空间定位任务上的新状态。

详情
AI中文摘要

现有的多变量时间序列异常检测(MTSAD)框架越来越多地依赖于将图神经网络(GNNs)与序列模型相结合,以捕捉复杂的时空依赖关系。然而,较少关注空间过泛化问题,即不受约束的结构建模会 indiscriminately 重建异常,不可避免地降低检测召回率。为了解决这个问题,我们提出了一种新的框架,通过联合先验观察对抗学习方法统一时空建模。在空间维度上,模型交替学习邻接矩阵作为结构先验,并在训练过程中通过最小化方式建模先验与数据驱动观察之间的关联差异。这种对抗优化不仅提高了模型对时间检测的敏感性,还使模型能够定位到特定通道的异常。为了系统评估这种异常定位能力,我们进一步构建了一个带有精确通道注释的合成基准。在公开数据集和我们专门的基准上进行的广泛实验表明,所提出的框架在时间和空间定位任务上都建立了新的状态。我们的代码、预训练模型和基准已公开在 https://github.com/anocodetest1/POST。

英文摘要

Existing Multivariate Time Series Anomaly Detection (MTSAD) frameworks increasingly rely on integrating Graph Neural Networks (GNNs) with sequence models to capture complex spatio-temporal dependencies. However, less attention is paid to the spatial over-generalization problem, where unconstrained structural modeling indiscriminately reconstructs anomalies, inevitably degrading detection recall. To tackle this problem, we propose a novel framework that unifies spatio-temporal modeling through a joint prior-observation adversarial learning paradigm. In the spatial dimension, the model alternately learns adjacency matrices as structural prior and models the association discrepancy between prior and data-driven observation in a minimax manner during training. Such adversarial optimization not only improves the model sensitivity for time-wise detection, but also enables the model to localize anomalies to specific channels. To systematically evaluate this anomaly localization capability, we further construct a synthetic benchmark equipped with precise channel-wise annotations. Extensive experiments across public datasets and our dedicated benchmark demonstrate that the proposed framework establishes a new state-of-the-art in both time-wise detection and spatial localization tasks. Our code, pre-trained models, and benchmark are publicly available at https://github.com/anocodetest1/POST.

2605.18109 2026-05-19 cs.AI cs.CV cs.RO 版本更新

TaskGround: Structured Executable Task Inference for Full-Scene Household Reasoning

TaskGround:全场景家庭推理的结构化可执行任务推断

ZhiYuan Feng, Yu Deng, Ruichuan An, Zhenhua Liu, Qixiu Li, Keming Wu, Zhiying Du, Weijie Wang, Haoxiao Wang, Shuang Chen, Sicheng Xu, Yaobo Liang, Jiaolong Yang, Baining Guo

发表机构 * Tsinghua University(清华大学) Microsoft Research Asia(微软亚洲研究院) Peking University(北京大学) Fudan University(复旦大学) Zhejiang University(浙江大学)

AI总结 本文提出TaskGround框架,通过结构化任务推断提升全场景家庭推理能力,其核心贡献是引入FullHome评估套件,验证了在家庭场景中执行任务结构推断的重要性,并展示了紧凑本地模型在实际家庭部署中的有效性。

Comments Project page: https://aaronfengzy.github.io/TaskGround/

详情
AI中文摘要

在真实家庭部署中,家庭代理通常必须从完整的家庭场景和处于特定情境的家庭请求出发,而不是从干净的任务规范出发。此类请求要求代理识别与任务相关的实体,恢复意图的任务条件,并从周围场景上下文中解决顺序约束。我们正式将这种能力定义为全场景家庭推理:给定一个完整的家庭场景和一个处于特定情境的家庭请求,代理必须在生成接地技能级动作序列之前推断出可执行的任务结构。这种设置具有挑战性,因为完整的家庭场景包含大量与任务无关的信息,使直接完整场景提示效率低下且容易出错。在实际部署中,这一挑战进一步被隐私和本地计算限制放大,这些限制更倾向于紧凑的开放权重模型,其具有有限的长上下文推理能力。我们提出TaskGround,一种无需训练且模型无关的Ground-Infer-Execute框架,该框架将完整的场景接地为紧凑的任务相关场景切片,推断出可执行的任务结构,并将其编译为接地的技能级动作序列。为了评估这一设置,我们引入了FullHome,一个经过人类验证的400个家庭任务评估套件,涵盖多样化的家庭规模环境以及目标导向和过程约束要求。在FullHome上,TaskGround在专有和开放权重模型上均大幅提升了任务成功率。值得注意的是,它使Qwen3.5-9B在直接完整场景提示下与GPT-5竞争,同时将总输入token成本减少了多达18倍。我们的结果识别了执行任务结构推断为全场景家庭推理中的关键瓶颈,并表明结构化接地可以显著提高紧凑本地模型在实际家庭部署中的有效性。

英文摘要

In real home deployments, household agents must often operate from a complete household scene and a situated household request, rather than from a clean task specification. Such requests require agents to identify task-relevant entities, recover intended task conditions, and resolve ordering constraints from the surrounding scene context. We formalize this capability as full-scene household reasoning: given a complete household scene and a situated household request, an agent must infer executable task structure before producing a grounded skill-level action sequence. This setting is challenging because complete household scenes contain substantial task-irrelevant information, making direct complete-scene prompting inefficient and error-prone. In practical deployment, this challenge is further amplified by privacy and local compute constraints, which favor compact open-weight models with limited long-context reasoning ability. We propose TaskGround, a training-free and model-agnostic Ground-Infer-Execute framework that grounds complete scenes into compact task-relevant scene slices, infers executable task structure, and compiles it into grounded skill-level action sequences. To evaluate this setting, we introduce FullHome, a human-validated evaluation suite of 400 household tasks spanning diverse home-scale environments and both goal-oriented and process-constrained requirements. On FullHome, TaskGround improves task success rates by large margins across both proprietary and open-weight models. Notably, it makes Qwen3.5-9B competitive with GPT-5 under direct complete-scene prompting while reducing total input-token cost by up to 18x. Our results identify executable task-structure inference as a central bottleneck in full-scene household reasoning and show that structured grounding can make compact local models substantially more effective for practical household deployment.

2605.18104 2026-05-19 cs.AI cs.CR 版本更新

Safety Geometry Collapse in Multimodal LLMs and Adaptive Drift Correction

多模态大语言模型中的安全几何坍缩与自适应漂移修正

Jiahe Guo, Xiangran Guo, Jiaxuan Chen, Weixiang Zhao, Yanyan Zhao, Yutai Hou, Qianchao Wang, Dandan Tu, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文研究了多模态大语言模型在跨模态安全转移中的不足,提出安全几何坍缩现象,并通过自适应漂移修正方法提升模型安全性。

详情
AI中文摘要

多模态大语言模型(MLLMs)常常无法将文本模态中学习到的安全能力转移到语义等价的非文本输入中,揭示出一个持续存在的多模态安全缺口。我们从表示几何视角出发,通过分析文本对齐的拒绝方向和模态诱导的漂移方向来研究这一缺口。我们展示了多模态输入压缩了沿拒绝方向的可用分离度,使其不再可靠用于识别和拒绝有害输入。我们将这种失败模式称为安全几何坍缩。我们通过条件拒绝分离度量化这一现象,并显示更强的模态诱导漂移与更弱的拒绝分离度和更高的攻击成功率一致。随后,我们通过固定强度激活干预验证了模态诱导漂移的因果作用:抵消估计的漂移可以恢复拒绝分离度并提高多模态安全性。在漂移修正后,我们进一步观察到自修正现象,其中模型在前向动态中恢复了识别和拒绝有害多模态输入的能力。这种效果也提供了模型对每个输入感知有害性的内部信号。受此信号启发,我们提出了ReGap,一种无需训练的推理时方法,通过自修正自适应修正模态漂移。在多个多模态安全基准和实用性基准上的实验展示了ReGap的有效性,显著提高了MLLMs的安全性,而不会损害通用能力。我们的发现强调了表示层面的模态对齐作为实时安全改进和构建更安全、更可靠MLLMs的关键方向。

英文摘要

Multimodal large language models (MLLMs) often fail to transfer safety capabilities learned in the text modality to semantically equivalent non-text inputs, revealing a persistent multimodal safety gap. We study this gap from a representation-geometric perspective by analyzing a text-aligned refusal direction and a modality-induced drift direction. We show that multimodal inputs compress the usable separation along the refusal direction, making it no longer reliable for identifying and refusing harmful inputs. We refer to this failure mode as Safety Geometry Collapse. We quantify it through conditional refusal separability and show that stronger modality-induced drift is consistently associated with weaker refusal separability and higher attack success rates. We then validate the causal role of modality-induced drift through a fixed-strength activation intervention: counteracting the estimated drift restores refusal separability and improves multimodal safety. After drift correction, we further observe self-rectification, where the model recovers its ability to recognize and refuse harmful multimodal inputs during forward dynamics. This effect also provides an internal signal of the model's perceived harmfulness of each input. Motivated by this signal, we propose ReGap, a training-free inference-time method that adaptively corrects modality drift using self-rectification. Experiments across multiple multimodal safety benchmarks and utility benchmarks demonstrate the effectiveness of ReGap, which significantly improves the safety of MLLMs without compromising general capabilities. Our findings highlight representation-level modality alignment as a crucial direction for real-time safety improvement and for building safer, more reliable MLLMs.

2605.18101 2026-05-19 cs.CV cs.AI 版本更新

SENSE: Satellite-based ENergy Synthesis for Sustainable Environment

SENSE: 基于卫星的能源合成以实现可持续环境

Kailai Sun, Mingyi He, Heye Huang, Can Rong, Alok Prakash, Baoshen Guo, Shenhao Wang, Jinhua Zhao

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Florida(佛罗里达大学)

AI总结 本文提出SENSE,一种统一的生成性城市建筑能耗框架,通过结合生成扩散模型和大规模视觉模型知识,生成高分辨率的城市卫星图像和对齐的高质量建筑能耗和高度地图,以提高城市可持续发展预测性能。

Comments Accpted by KDD 2026 (Oral)

详情
AI中文摘要

城市建筑能耗建模在实现联合国可持续发展目标7和11中起着关键作用。尽管基于卫星图像和深度学习的研究已取得显著进展,但仍存在许多挑战:大多数现有研究本质上是预测性的,无法反映城市规划的生成性;虽然生成式AI和扩散模型在卫星图像中实现了指数级增长,但缺乏城市功能生成(例如能耗层);第三,高质量高分辨率建筑能耗数据与卫星图像的对齐数据有限且稀缺。本文提出SENSE(基于卫星的能源合成以实现可持续环境),一种统一的生成性城市建筑能耗(UBEM)框架,联合合成逼真的城市卫星图像和对齐的高质量建筑能耗和高度地图。通过在道路网络和城市密度指标上进行条件控制,SENSE基于可控扩散模型,利用大规模视觉模型学习到的知识,生成城市建筑能耗和高度信息(注释)在潜在空间中。在四个城市(纽约市、波士顿、里昂、釜山)上的实验表明,SENSE实现了高视觉保真度和强物理一致性,满足ASHRAE标准度量。实验表明,SENSE可以使用少于20%的标注能耗数据生成足够的注释合成数据,将下游预测性能提升10% IoU。与最先进的城市能耗预测方法相比,SENSE显著降低了预测误差(预测误差减少了3%-11% NMBE和1%-9% CVRMSE)。本研究为城市科学、能源科学和建筑科学提供了能耗效率的城市规划和物理生成解决方案。数据集和代码:https://huggingface.co/datasets/skl24/MUSE和https://github.com/kailaisun/GenAI4Urban-Energy/.

英文摘要

Urban Building Energy Modeling plays a critical role in achieving the United Nations' Sustainable Development Goals 7 and 11. Although existing studies based on satellite imagery and deep learning have achieved remarkable progress, many challenges exist: most existing studies are inherently predictive, failing to reflect the generative nature of urban planning; although generative AI and diffusion models have seen explosive growth in satellite imagery, they lack the urban functional generation (e.g., energy layer); third, aligned high-quality high-resolution building energy data with satellite imagery is limited and scarce. Here we propose SENSE (Satellite-based ENergy Synthesis for Sustainable Environment), a unified generative UBEM framework that jointly synthesizes realistic urban satellite imagery and aligned high-quality building energy consumption and height maps. By conditioning on road networks and urban density metrics, SENSE, based on a controllable diffusion model, leverages the knowledge learned by large vision models to generate urban building energy consumption and height information (annotations) in the latent space. Experiments across four cities (New York City, Boston, Lyon, Busan) demonstrate that SENSE achieves high visual fidelity and strong physical consistency, satisfying the ASHRAE standard metric. Experiments demonstrate that SENSE can generate enough annotated synthetic data using less than 20% labeled energy data, boosting downstream prediction performance by 10% IoU. Compared to SOTA urban energy prediction methods, SENSE significantly reduced prediction error (reduced 3%-11% NMBE and 1%-9% CVRMSE). This study offers an energy-efficiency urban planning and physical generation solution for urban science, energy science and building science. The dataset and code: https://huggingface.co/datasets/skl24/MUSE and https://github.com/kailaisun/GenAI4Urban-Energy/.

2605.18094 2026-05-19 cs.AI 版本更新

Learning to Solve Compositional Geometry Routing Problems

学习解决组合几何路由问题

Mingfeng Fan, Jianan Zhou, Jiaqi Cheng, Yifeng Zhang, Jie Zhang, Guillaume Adrien Sartoretti

发表机构 * National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学) Central South University(中南大学)

AI总结 本文研究了组合几何路由问题(CGRP),这是一种涵盖点、线、面及任意混合任务几何的统一超类,为现实中的路由场景提供广泛抽象。为解决非点任务带来的不对称性和复杂性,作者提出DiCon框架,通过对比学习和差异注意力机制提升表示学习和决策能力。

Comments 27 pages, 10 figures

详情
AI中文摘要

我们研究了组合几何路由问题(CGRP),这是一种涵盖点-only、line-only、area-only及任意混合任务几何的统一超类,为现实中的路由场景提供广泛抽象。除了标准的点基路由外,CGRP中的非点任务可以本质上是不对称的,紧密耦合的旅行路线与内在路径密切相关,并扩展了大量可行但通常无关的行动空间,从而对表示学习和决策提出了重大挑战。为解决这些挑战,我们提出DiCon,一种带有对比学习的差分注意力辅助求解器,作为即插即用的框架,从两个互补的角度解决该问题。首先,我们引入差分注意力机制,主动抑制概率质量在不具竞争力的候选动作上的分布。其次,我们设计了双层对比学习目标,以促进稳健的全局实例表示并正则化几何感知的任务表示。广泛的实验表明,DiCon在不同组成CGRP实例上实现了强大的性能、广泛的通用性和优越的泛化能力。

英文摘要

We study the Compositional Geometry Routing Problem (CGRP), a unified superclass of traditional routing problems that covers point-only, line-only, area-only, and arbitrary hybrid task geometries, providing a broad abstraction for real-world routing scenarios. Beyond standard point-based routing, CGRP with non-point tasks can be inherently asymmetric, tightly coupled travel routes with the intrinsic path, and enlarges the action space with numerous feasible yet often irrelevant options, thereby posing significant challenges for both representation learning and decision-making. To address these challenges, we propose DiCon, a differential attention-assisted solver with contrastive learning, as a plug-and-play framework that tackles the problem from two complementary angles. First, we introduce a differential attention mechanism that actively suppresses the probability mass on less competitive candidate actions. Second, we design a double-level contrastive learning objective to promote robust global instance representations and regularize geometry-aware task representations. Extensive experiments demonstrate that DiCon achieves strong performance, broad versatility, and superior generalization across diverse CGRP instances with different compositions.

2605.18080 2026-05-19 quant-ph cond-mat.stat-mech cs.AI econ.EM econ.TH 版本更新

Parameterized 4-Qubit EWL Quantum Game Circuits with Dirac-Solow-Swan Hamiltonian Integration for Quadruple Helix Disruptive Innovation Recommender Systems

具有Dirac-Solow-Swan哈密顿量整合的参数化4量子比特EWL量子游戏电路用于四螺旋颠覆性创新推荐系统

Agung Trisetyarso, Fithra Faisal Hastiadi, Kridanto Surendro

发表机构 * Department of Mathematics and Statistics, Bina Nusantara University(宾汉纳大学数学与统计学系) Faculty of Economic and Business, Universitas Indonesia(印度尼西亚大学经济与商业学院) Faculty of School Electrical Engineering and Informatics, Institut Teknologi Bandung(万隆技术大学电气工程与信息学院)

AI总结 本文提出了一种参数化4量子比特EWL量子游戏电路,用于四螺旋创新生态系统中的推荐系统,通过提取欧洲委员会CORDIS Horizon Europe数据库中的实际参与者资金数据,直接调节每个螺旋主体的本地策略算子,并利用Dirac-Solow-Swan哈密顿量的对角线势能将测量概率映射为颠覆性与维持性创新趋势的推荐分数,从而实现对颠覆性资本轨迹的高保真预测。

Comments Submitted to Quantum

详情
AI中文摘要

我们提出了一种新颖的参数化4量子比特Eisert-Wilkens-Lewenstein (EWL)量子游戏电路,用于四螺旋创新生态系统(学术界、产业界、政府和民间社会)中的推荐系统。每个螺旋主体的本地策略算子$U_{i} = R_y(θ_{i})$直接通过从欧洲委员会CORDIS Horizon Europe数据库(项目COVend,ID 101045956)中提取的归一化主导权重进行调节。该电路采用多量子比特EWL纠缠器,随后是参数化局部旋转、反纠缠器和完整测量,仅使用22个门和电路深度11,随着$n$-轮螺旋通信扩展为$O(n)$。量子游戏测量后的概率作为推荐分数,用于颠覆性与维持性创新趋势。这些分数随后映射到Dirac-Solow-Swan哈密顿量的对角线势能,从而在颠覆性创新下实现资本积累和分岔动力学的时间演化模拟。在真实CORDIS四螺旋合作网络上的数值实验展示了该电路的NISQ兼容性及其高保真预测颠覆性资本轨迹的能力。所提出的框架连接了量子游戏理论、参数化量子电路和相对论经济成长模型,提供了一种在复杂社会经济生态系统中进行创新政策和战略决策的计算高效工具。复杂性分析和可重复性通过开放的Qiskit实现提供。

英文摘要

We present a novel parameterized 4-qubit Eisert-Wilkens-Lewenstein (EWL) quantum game circuit for recommender systems in quadruple helix innovation ecosystems (academia, industry, government, and civil society). The local strategy operators $U_{i} = R_y(θ_{i})$ for each helix actor are directly tuned by normalized dominance weights extracted from real participant funding data (\texit{ecContribution}) in the European Commission CORDIS Horizon Europe database (project COVend, ID 101045956). The circuit employs a multi-qubit EWL entangler followed by parameterized local rotations, inverse entangler, and full measurement, achieving only 22 gates and circuit depth 11 while scaling as $O(n)$ for $n$-round helix communications. Measurement probabilities after the quantum game serve as recommender scores for disruptive versus sustaining innovation trends. These scores are subsequently mapped into the diagonal Dirac potential of a Dirac-Solow-Swan Hamiltonian, enabling time-evolution simulation of capital accumulation and bifurcation dynamics under disruptive innovation. Numerical experiments on real CORDIS quadruple-helix collaboration networks demonstrate the circuit's NISQ compatibility and its ability to forecast disruptive capital trajectories with high fidelity. The proposed framework bridges quantum game theory, parameterized quantum circuits, and relativistic economic growth models, offering a computationally efficient tool for innovation policy and strategic decision-making in complex socio-economic ecosystems. Complexity analysis and reproducibility are provided through open Qiskit implementations.

2605.18073 2026-05-19 cs.SE cs.AI 版本更新

A-ProS: Towards Reliable Autonomous Programming Through Multi-Model Feedback

A-ProS:通过多模型反馈实现可靠的自主编程

Anika Tabassum, Md Sifat Hossain, Md. Fahim Arefin, Tariqul Islam, Tarannum Shaila Zaman

发表机构 * Dept. of Computer Science and Engineering, University of Dhaka(达卡大学计算机科学与工程系) Dept. of Information Systems, University of Maryland, Baltimore County(马里兰大学巴尔的摩县信息学院) University of Maryland, Baltimore County(马里兰大学巴尔的摩县)

AI总结 本文提出A-ProS,一种通过多模型反馈框架实现自主编程的AI代理,结合生成器和调试器,通过多轮反馈提升代码生成的准确性与效率,实验表明其在解决编程问题上具有显著优势。

Comments Accepted for Publication in ACM Transactions on Software Engineering and Methodology (TOSEM)

详情
AI中文摘要

大型语言模型(LLMs)在自动化代码生成方面展现出强大潜力,但其通过执行反馈迭代优化解决方案的能力仍待探索。编程竞赛提供了一个理想的测试平台,因为它要求端到端的算法推理、在严格计算约束下的精确实现以及完全的功能正确性。在本文中,我们提出了A-ProS,一个通过混合多模型反馈框架解决编程竞赛问题的自主AI代理,该框架将解决方案生成与专门的调试分离。A-ProS结合基于ChatGPT的生成器(GPT-4和GPT-5)以及三个调试批评者:Codestral-2508、Llama-3.3-70B和DeepSeek-R1,在2 x 3的因子设计下进行评估。我们对367个问题进行了六种工作流程的评估,涵盖ICPC世界总决赛(2011-2024)和Codeforces(评级1200-1800)。结果表明,GPT-5工作流程在三次优化轮次后,接受的解决方案从39个提升到85-90个,而GPT-4从15个提升到31-38个。对47个问题的受控消融分析显示,具有状态的优化优于无状态方法,提高了8.5-10.6个百分点,并将重复失败减少了多达3.5倍。与基线代理循环相比,A-ProS实现了超过2倍的收益,突显了持久上下文和多模型反馈在可靠自主程序合成中的重要性。

英文摘要

Large Language Models (LLMs) demonstrate strong potential for automated code generation, yet their ability to iteratively refine solutions using execution feedback remains underexplored. Competitive programming offers an ideal testbed for this investigation, as it demands end-to-end algorithmic reasoning, precise implementation under strict computational constraints, and complete functional correctness with rigorous evaluation. In this paper, we present A-ProS, an autonomous AI agent that solves competitive programming problems through a hybrid multi-model feedback framework separating solution generation from specialized debugging. A-ProS combines ChatGPT-based generators (GPT-4 and GPT-5) with three debugging critics: Codestral-2508, Llama-3.3-70B, and DeepSeek-R1, under a 2 x 3 factorial design. We evaluate six workflows on 367 problems from ICPC World Finals (2011-2024) and Codeforces (rated 1200-1800). The results show that GPT-5 workflows improve from 39 initial accepted solutions to 85-90 after three refinement rounds, while GPT-4 improves from 15 to 31-38. A controlled ablation on 47 problems shows that stateful refinement outperforms stateless approaches by 8.5-10.6 percentage points and reduces repeated failures by up to 3.5x. Compared to baseline agent loops, A-ProS achieves over 2x greater gains, highlighting the importance of persistent context and multi-model feedback for reliable autonomous program synthesis.

2605.18068 2026-05-19 cs.LG cs.AI 版本更新

Improving Spatio-Temporal Residual Error Propagation by Mitigating Over-Squashing

通过缓解过压缩来改进时空残差误差传播

Seyed Mohamad Moghadas, Esther Rodrigo Bonet, Bruno Cornelis, Adrian Munteanu

发表机构 * ETRO Department, Vrije Universiteit Brussel(瓦隆联合大学布鲁塞尔分校ETRO系) imec

AI总结 本文提出Teger模块,通过空间曲率感知的图重排机制改进误差相关的自回归预测,提升时空预测的连续排名概率得分。

详情
AI中文摘要

残差误差传播仍然是递归模型中的基本问题,其中小的预测不准确会随时间累积并降低长周期性能。准确建模此类残差的相关结构对于概率多变量时间序列预测中的可靠不确定性量化至关重要。尽管最近的时间序列深度模型能够高效参数化时间变化的同期相关性,但它们通常假设误差的时序独立性,并忽略了观测网络中的空间相关性。在本文中,我们引入Teger,一个结构化的不确定性模块,克服了误差相关自回归预测中的空间和时间限制。Teger提出了一种空间曲率感知的图重排机制,明确加强了由离散Forman曲率识别出的信息瓶颈边。该组件被集成到低秩加对角协方差头中,通过Woodbury恒等式保持可推断性。Teger是backbone无关的,仅需任何自回归编码器产生的潜在状态。我们提供了Teger的理论证据,并在四个现实世界的时空数据集上实验评估了它在LSTM、Transformer和xLSTM backbone上的表现,显示了连续排名概率得分的一致改进。我们进一步提供了将曲率感知重排与(i)过压缩缓解、(ii)改进的谱连接性、(iii)减少有效电阻以及(iv)改进的协方差校准界联系起来的正式理论分析。

英文摘要

Residual error propagation remains a fundamental problem in recurrent models, where small prediction inaccuracies compound over time and degrade long-horizon performance. Accurately modeling the correlation structure of such residuals is critical for reliable uncertainty quantification in probabilistic multivariate timeseries forecasting. While recent time-series deep models efficiently parametrize time-varying contemporaneous correlations, they often assume temporal independence of errors and neglect spatial correlation across the observed network. In this paper, we introduce Teger, a structured uncertainty module that overcomes the spa- tial and temporal limitations of error-correlated autoregressive forecasting. Teger proposes a spatial curvature-aware graph rewiring mechanism explicitly strengthening information-bottleneck edges identified by discrete Forman curvature. The component is integrated into a low-rank-plus-diagonal covariance head, preserving tractable inference via the Woodbury identity. Teger is backbone-agnostic, requiring only the latent state produced by any autoregressive encoder. We provide theoretical evidence of Teger, and experimentally evaluate it on LSTM, Transformer, and xLSTM backbones across four real-world spatio-temporal datasets, showing consistent improvement in Continuous Ranked Probability Score (CRPS). We further provide a formal theoretical analysis connecting curvature-aware rewiring to (i) oversquashing alleviation, (ii) improved spectral connectivity, (iii) reduced effective resistance, and (iv) improved covariance calibration bounds

2605.18055 2026-05-19 cs.LG cs.AI 版本更新

FLAG: Foundation model representation with Latent diffusion Alignment via Graph for spatial gene expression prediction

FLAG: 通过图结构的潜在扩散对齐实现基础模型表示以空间基因表达预测

Qi Si, Penglei Wang, Yushuai Wu, Yifeng Jiao, Xuyang Liu, Xin Guo, Yuan Qi, Yuan Cheng

发表机构 * Shanghai Academy of Artificial Intelligence for Science, Shanghai, China.(上海人工智能科学研究院) School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China.(上海交通大学生物医学工程学院) Incubation Institute, Fudan University, Shanghai, China.(复旦大学孵化院)

AI总结 本文提出FLAG框架,通过图结构的潜在扩散对齐方法,解决空间基因表达预测中的基因协调和空间分布关系问题,并引入基因维度诅咒的概念,通过空间图编码器和基因基础模型对齐来提升模型的结构一致性与基因间保真度。

Comments 9 pages for main text, 3 pages for references, 19 pages for appendix. accepted by ICML 2026

详情
AI中文摘要

从常规的H&E染色预测空间基因表达能够实现大规模分子谱分析,但当前模型将此任务视为孤立的点wise任务,从而忽略了诸如基因协调和空间分布等关键生物结构。为保持这些关系,我们引入FLAG,一种基于扩散的框架,将此任务重新定义为结构分布建模。同时,我们识别出关键的基因维度诅咒,即联合建模基因表达及其空间相互作用在高维空间中失效,而FLAG通过整合空间图编码器以实现拓扑一致性,并利用基因基础模型(GFM)对齐以在生成过程中保持基因-基因的保真度。为严格评估模型性能,我们提出了一组新的结构评估度量标准,包括基因结构相关性(GSC)和空间结构相关性(SSC)。我们的实验表明,FLAG在传统准确性(PCC/MSE)方面具有高度竞争力,同时在捕捉基因-基因和基因-空间关系时实现了显著增强的结构保真度。代码可在https://github.com/darkflash03/FLAG上获取。

英文摘要

Predicting spatial gene expression from routine H\&E enables large-scale molecular profiling, yet current models treat this as isolated pointwise tasks, thereby overlooking essential biological structures like gene coordination and spatial distribution. To preserve these relationships, we introduce \textbf{FLAG}, a diffusion-based framework that redefines this task as structured distribution modeling. At the same time, we identify the critical \textbf{Gene Dimension Curse}, where joint modeling gene expression and their spatial interactions fail in high-dimensional spaces, and FLAG solves this challenge by integrating a spatial graph encoder for topological consistency and utilizing Gene Foundation Model (GFM) alignment for gene-gene fidelity in the generation process. To rigorously assess model performance, we propose a set of novel structural evaluation metrics, including Gene Structural Correlation (\textbf{GSC}) and Spatial Structural Correlation (\textbf{SSC}). Our experiments demonstrate that FLAG is highly competitive in traditional accuracy (PCC/MSE) while achieving significantly enhanced structural fidelity in capturing both gene-gene and gene-spatial relationships. The code is available at https://github.com/darkflash03/FLAG.

2605.18048 2026-05-19 cs.AI 版本更新

DocOS: Towards Proactive Document-Guided Actions in GUI Agents

DocOS: 向 GUI 代理中的主动文档引导行动迈进

Jingjing Liu, Ziye Huang, Zihao Cheng, Zeming Liu, Jiahong Wu, Yuhang Guo, Kehai Chen, Yunhong Wang, Haifeng Wang

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing, China(北航计算机科学与工程学院) School of Computer Science, Peking University, Beijing, China(北京大学计算机科学学院) School of Computer Science and Technology, Beijing Institute of Technology, Beijing(北京理工大学计算机科学与技术学院) School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)计算机科学与技术学院) Baidu Inc., Beijing, China(百度公司)

AI总结 本文提出 DocOS 基准,通过引导文档解决长尾任务,解决 GUI 代理在动态开放网络环境中处理长尾任务的能力限制,核心方法是主动文档引导行动,主要贡献是设计了一个评估文档引导问题解决能力的基准。

详情
AI中文摘要

尽管图形用户界面(GUI)代理在自动化设备交互中表现出色,但它们主要依赖于预训练或指令微调的静态参数知识。这种依赖从根本上限制了它们处理需要显式过程知识的长尾任务的能力,通常迫使代理采用低效且易碎的试错探索。为缓解这一限制,我们引入了面向 GUI 代理的主动文档引导行动,这是一种新的范式,通过使代理能够自主搜索相关文档来解决长尾任务,从而模仿人类问题解决方式。为了评估代理在此范式中的能力,我们提出了 DocOS,一个基准,用于评估在完全交互环境中文档引导的问题解决能力。DocOS 要求代理自主导航网络浏览器,定位相关在线文档,理解操作步骤,并将这些步骤准确地转化为可执行的 GUI 操作。广泛的实验表明,进展受到双重瓶颈的限制:代理在主动搜索中难以可靠地定位相关信息,并且频繁失败将检索到的指令准确地转化为精确的操作,这表明文档引导交互是使 GUI 代理在动态环境中自我演化的关键路径。

英文摘要

While Graphical User Interface (GUI) agents have shown promising performance in automated device interaction, they primarily depend on static parametric knowledge from pre-training or instruction tuning. This reliance fundamentally limits their ability to handle long-tailed tasks that require explicit procedural knowledge absent from model parameters, often forcing agents to resort to inefficient and brittle trial-and-error exploration. To mitigate this limitation, we introduce \textbf{Proactive Document-Guided Action} for GUI agents in dynamic, open-web environments, a novel paradigm that mirrors human problem-solving by enabling agents to autonomously search for relevant documentation to resolve long-tailed tasks. To evaluate agents' capability in this paradigm, we propose \textbf{DocOS}, a benchmark designed to assess document-guided problem solving in fully interactive environments. DocOS requires agents to autonomously navigate a web browser, locate relevant online documentation, comprehend procedural instructions, and faithfully ground them into executable GUI actions. Extensive experiments reveal that progress is strictly constrained by dual bottlenecks: agents struggle to reliably locate relevant information during proactive search and frequently fail to faithfully ground retrieved instructions into precise actions, pointing toward document-guided interaction as a crucial pathway for enabling self-evolving GUI agents in dynamic environments.

2605.18045 2026-05-19 cs.RO cs.AI 版本更新

Confidence-Gated Robot Autonomy: When Does Uncertainty Actually Help?

置信度门控机器人自主性:不确定性何时真的有帮助?

Johannes A. Gaus, Jhon P. F. Charaja, Daniel Haeufle

发表机构 * Hertie Institute for Clinical Brain Research & Center for Integrative Neuroscience, University of Tübingen(赫尔特研究所临床脑研究与整合神经科学中心,图宾根大学)

AI总结 本文研究了不确定性在机器人自主性决策中的作用,发现当基础模型具备一定能力时,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

Comments ICRA 2026 workshop paper

详情
AI中文摘要

机器人系统常常使用预测不确定性来决定是否自主行动还是退回到备用策略。在阈值门控自主性中,不确定性主要通过其对可能错误的排序能力起作用。标准指标如预期校准误差和AUROC并不能直接测试不确定性是否改变行动/退避决策。因此,我们通过斯皮尔曼等级相关性、配对bootstrap等价检验和行动/退避一致率来评估不确定性。在三个时间活动识别基准上,我们发现存在一个数据集依赖的胜任区域,在此之下不确定性只能提供弱且不稳定的错误排序。在此之上,softmax启发式方法、MC Dropout和集成模型产生相似的门控行为,而阈值选择对执行结果影响更大。一个多种子具身模拟显示,一旦实现自主性,碰撞率和成本也呈现出相同模式。在时间协变量转移下,排序质量保持稳定,但细粒度语义OOD检测仍接近随机。这些结果表明,一旦基础模型具备一定能力,简单的不确定性代理足以实现选择性门控,但无法用于语义新颖性检测。

英文摘要

Robotic systems often use predictive uncertainty to decide whether to act autonomously or defer to a fallback policy. In threshold-gated autonomy, uncertainty matters mainly through its ability to rank likely errors. Standard metrics such as expected calibration error and AUROC do not directly test whether uncertainty changes act/defer decisions. We therefore evaluate uncertainty using Spearman rank correlation, paired bootstrap equivalence testing, and act/defer agreement. Across three temporal activity-recognition benchmarks, we find a dataset-dependent competence regime below which uncertainty provides a weak and unstable error ranking. Above this regime, softmax heuristics, MC Dropout, and ensembles produce similar gating behavior, while threshold choice has a much larger effect on execution outcomes. A multi-seed embodied simulation shows the same pattern for collision rate and cost once realized autonomy is matched. Under temporal covariate shift, ranking quality remains stable, but fine grained semantic OOD detection remains near chance. These results suggest that simple uncertainty proxies can suffice for selective gating once the base model is competent, but not for semantic novelty detection.

2605.18036 2026-05-19 cs.HC cs.AI 版本更新

Exploring Trust Calibration in XAI - The Impact of Exposing Model Limitations to Lay Users

探索可解释AI中的信任校准 - 暴露模型局限性对普通用户的影响

Alfio Ventura, Tim Katzke, Jan Corazza, Mustafa Yalçıner

发表机构 * Research Center Trustworthy Data Science and Security(可信数据科学与安全研究中心) University Alliance Ruhr(鲁尔大学联盟) University of Duisburg-Essen(杜伊斯堡- Essen大学) TU Dortmund University(多特蒙德技术大学)

AI总结 本文研究了在可解释AI(XAI)中信任校准的重要性,通过一项在线研究探讨了向普通用户展示模型局限性对信任评估的影响,发现仅在限制披露情况下,信任校准才有效,并发现短期经验对校准无显著影响。

Comments Preprint. Accepted at the 4th World Conference on eXplainable Artificial Intelligence (XAI 2026). Final version to appear in the conference proceedings

详情
AI中文摘要

信任校准——将用户的信任判断与模型能力对齐——对于安全部署可解释AI(XAI)至关重要,但通常通过全局信任评分来评估,而脱离客观性能证据。我们提出了一项预注册的、激励性的受试者间在线研究(N=418名英国代表性样本),研究可解释性皮肤病变分类,以区分期望设定与实际表现。参与者完成了15个案例评估,使用固定的XAI面板(恶性程度评分、可靠性评分和显著性图)。我们系统地操纵了五个实验入门条件,变化基于示例的信息和限制披露,五个刺激包自然变化观察预测质量。校准被定义为信任相关判断(TAIS和案例评分)与遇到案例的客观性能基准之间的偏差,分析使用分层混合效应模型。只有在限制披露情况下,对案例评分的校准才可靠地影响信任校准,短期经验未产生渐进校准。此外,经历的刺激包解释了比实验操纵更多的方差。然而,参与者很难区分案例感知的信任、可信度和准确性估计。我们讨论了在设计限制沟通和测量和分析XAI评估中的校准指标方面的含义。本研究的所有研究材料和数据都公开可用,供复制和进一步学术使用。

英文摘要

Trust calibration -- aligning user trust judgment with model capability -- is crucial for safe deployment of explainable AI (XAI), yet is often evaluated via global trust ratings detached from objective performance evidence. We present a preregistered, incentivized between-subject online study (N=418 representative UK sample) on explainable skin-lesion classification that disentangles expectation-setting from experienced performance. Participants completed 15 case evaluations using a fixed XAI panel (malignancy score, reliability score, and saliency map). We systematically manipulated five experimental onboarding conditions varying example-based information and limitation disclosures with five stimulus packages naturally varying observed prediction quality. Calibration was operationalized as the deviation between trust-related judgments (TAIS and case-wise ratings) and objective performance benchmarks for the encountered cases, analysed with hierarchical mixed-effects models. Only limitation disclosure for case-wise measures reliably impacts trust calibration, and short-term experience did not yield progressive calibration. Further, the experienced package of stimuli explained substantially more variance than the experimental manipulation. However, participants were hard-pressed to differentiate between case-wise perceived trust, trustworthiness, and accuracy estimation. We discuss implications for designing limitation communication and for measuring and analysing calibration metrics in XAI evaluations. All study materials and data of this study are publicly available for replication and further academic use.

2605.18035 2026-05-19 cs.AI cs.LG 版本更新

New Insight of Variance reduce in Zero-Order Hard-Thresholding: Mitigating Gradient Error and Expansivity Contradictions

零阶硬阈值化中方差减少的新见解:缓解梯度误差和扩张性矛盾

Xinzhe Yuan, William de Vazelhes, Bin Gu, Huan Xiong

发表机构 * IASM, Harbin Institute of Technology(哈尔滨工业大学人工智能研究所,哈尔滨工业大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 本文提出了一种通用的方差减少零阶硬阈值化算法,通过考虑方差的作用,缓解零阶梯度与硬阈值操作之间的冲突,从而消除对随机方向数量的限制,提高收敛速度和应用范围。

Comments Published as a conference paper at ICLR 2024. 9 pages main paper, 24 pages appendix, 11 figures, 7 tables. Correspondence to Bin Gu and Huan Xiong

Journal ref International Conference on Learning Representations (ICLR), 2024

详情
AI中文摘要

硬阈值化是机器学习中用于解决ℓ0约束优化问题的重要算法类型。然而,在某些情况下,目标函数的真实梯度可能难以获取,通常可以通过零阶(ZO)方法进行近似。到目前为止,SZOHT算法是唯一能够处理ℓ0稀疏性约束的ZO梯度算法。不幸的是,由于零阶梯度的偏差与硬阈值操作的扩张性之间存在固有的矛盾,SZOHT在ZO梯度的随机方向数量上存在明显的限制。本文通过考虑方差的作用,提供了一种新的方差减少见解:缓解零阶梯度与硬阈值操作之间的独特矛盾。在此视角下,我们提出了一种通用的方差减少零阶硬阈值化算法以及在标准假设下的通用收敛性分析。理论结果表明,新算法消除了对随机方向数量的限制,相较于SZOHT,具有改进的收敛速度和更广泛的应用范围。最后,我们通过岭回归问题以及黑盒对抗攻击问题展示了本方法的实用性。

英文摘要

Hard-thresholding is an important type of algorithm in machine learning that is used to solve $\ell_0$ constrained optimization problems. However, the true gradient of the objective function can be difficult to access in certain scenarios, which normally can be approximated by zeroth-order (ZO) methods. The SZOHT algorithm is the only algorithm tackling $\ell_0$ sparsity constraints with ZO gradients so far. Unfortunately, SZOHT has a notable limitation on the number of random directions % in ZO gradients due to the inherent conflict between the deviation of ZO gradients and the expansivity of the hard-thresholding operator. This paper approaches this problem by considering the role of variance and provides a new insight into variance reduction: mitigating the unique conflicts between ZO gradients and hard-thresholding. Under this perspective, we propose a generalized variance reduced ZO hard-thresholding algorithm as well as the generalized convergence analysis under standard assumptions. The theoretical results demonstrate the new algorithm eliminates the restrictions on the number of random directions, leading to improved convergence rates and broader applicability compared with SZOHT. Finally, we illustrate the utility of our method on a ridge regression problem as well as black-box adversarial attacks.

2605.18032 2026-05-19 cs.CL cs.AI cs.HC cs.SE 版本更新

PROTEA: Offline Evaluation and Iterative Refinement for Multi-Agent LLM Workflows

PROTEA:多智能体大语言模型工作流的离线评估与迭代优化

Kazuki Kawamura, Satoshi Waki, Kei Tateno

发表机构 * Sony Group Corporation(索尼集团公司)

AI总结 本文提出PROTEA,一种用于多智能体大语言模型工作流的离线评估和迭代优化接口,通过配置评分标准和可视化工作流图中的节点状态,帮助开发者定位瓶颈并改进工作流性能。

Comments 9 pages, 3 figures, 1 table. To appear in Proceedings of ACL 2026 System Demonstrations

详情
AI中文摘要

多智能体大语言模型工作流——由多个角色特定的LLM调用组成——通常优于单提示基线,但调试和优化仍然困难。失败可能源于中间输出的细微错误,这些错误会传播到下游节点,要求开发者检查长轨迹并推断应修改哪个代理。我们提出了PROTEA,一个统一的接口,用于离线、测试驱动的多智能体工作流改进。PROTEA执行工作流,用可配置的评分标准评分中间节点输出,并在工作流图上叠加每个节点的状态和理由,以定位可能的瓶颈。为了支持复杂系统,其中最终答案参考是主要监督,PROTEA执行反向节点评估:它从最终答案参考和图上下文生成候选节点级期望,然后将它们与观察到的节点输出进行比较。对于选定的节点,PROTEA以可编辑的前后比较形式呈现目标提示修订,然后自动重新运行并重新评估工作流,以显示输出变化和评分轨迹。在两个生产相关的工作流中,PROTEA将文档检查准确性从64.3%提高到83.9%,推荐Hit@5从0.30提高到0.38。在与六名经验丰富的LLM开发者进行的形成研究中,参与者重视图层面的定位、节点级别的理由以及可编辑的前后提示修订。

英文摘要

Multi-agent LLM workflows -- systems composed of multiple role-specific LLM calls -- often outperform single-prompt baselines, but they remain difficult to debug and refine. Failures can originate from subtle errors in intermediate outputs that propagate to downstream nodes, requiring developers to inspect long traces and infer which agent to modify. We present PROTEA, a unified interface for offline, test-driven improvement of multi-agent workflows. PROTEA executes a workflow, scores intermediate node outputs with configurable rubrics, and overlays per-node states and rationales on the workflow graph to localize likely bottlenecks. To support complex systems where final-answer references are the primary supervision, PROTEA performs backward node evaluation: it generates candidate node-level expectations from final-answer references and graph context, then compares them with observed node outputs. For selected nodes, PROTEA presents targeted prompt revisions as editable before/after comparisons, then automatically reruns and re-evaluates the workflow to show output changes and score trajectories within the same interface. In two production-adjacent workflows, PROTEA improved document-inspection accuracy from 64.3% to 83.9% and recommendation Hit@5 from 0.30 to 0.38. In a formative study with six experienced LLM developers, participants valued graph-level localization, per-node rationales, and editable before/after prompt revisions.

2605.18031 2026-05-19 quant-ph cs.AI 版本更新

Quantum Sidecar Architectures for Hybrid AI Training and Inference: Stateful Protected Registers, Stateless Reset-and-Reprepare Circuits and Quantum Weight-State Outlook

用于混合AI训练和推理的量子Sidecar架构:状态保护寄存器、无状态重置和重新准备电路以及量子权重状态展望

Y. Mo, G. D. Su

发表机构 * Independent Researcher(独立研究者;BroadLink公司) BroadLink Co., Ltd.(杭州电子科技大学副教授) Associate Professor, Hangzhou Dianzi University

AI总结 本文提出了一种用于未来混合AI训练和推理的量子Sidecar架构家族,通过状态保护寄存器和无状态重置和重新准备模式,为优化器侧采样、适配器或专家选择、检索、路由和推理路径提案提供有界信号生成器,并引入了量子权重状态Sidecar作为受限的量子表示。

Comments 14 pages, 8 figures. Architecture and small-scale simulation study; no hardware experiment or quantum-advantage claim

详情
AI中文摘要

我们提出了一种用于未来混合AI训练和推理的量子Sidecar架构家族。核心思想是不将整个Transformer存储在小量子内存中,也不声称一次性崩溃进入完全训练的模型或最优答案。相反,我们识别出两种物理上不同的操作模式,用于连接到经典大模型管道的量子协处理器。第一种是状态保护寄存器模式,其中保护寄存器存储可重用的量子资源,而辅助或临时寄存器执行QND风格的读取。第二种是无状态的重置和重新准备模式,其中每个查询准备一个任务条件的量子电路,经过受限制的训练或推理控制变量演变,测量候选信号,重置量子位,并重复。我们使用2/4/6/8个保护量子位密度矩阵QND风格奇偶读取进行状态模式的模拟,并通过Qiskit交叉验证。对于无状态模式,我们包括了抽象候选更新采样器和电路级QAOA风格的态向量采样器,随后进行重置开销敏感性分析。所得到的框架将量子Sidecar定位为优化器侧采样、适配器或专家选择、检索、路由和推理路径提案的有界信号生成器。作为推测展望,我们引入了量子权重状态Sidecar:受限的量子表示,而非完整的经典权重张量的直接编码。

英文摘要

We propose a quantum sidecar architecture family for future hybrid AI training and inference. The central idea is not to store an entire Transformer in a small quantum memory, nor to claim one-shot collapse into a fully trained model or an optimal answer. Instead, we identify two physically distinct operating modes for quantum co-processors attached to classical large-model pipelines. The first is a stateful protected-register mode, in which a protected register stores a reusable quantum resource while an ancilla or temporary register performs QND-style readout. The second is a stateless reset-and-reprepare mode, in which each query prepares a task-conditioned quantum circuit, evolves over bounded training or inference control variables, measures candidate signals, resets the qubits, and repeats. We simulate the stateful mode using 2/4/6/8 protected-qubit density-matrix QND-style parity readout with one ancilla and a Qiskit cross-check. For the stateless mode, we include both an abstract candidate-update sampler and a circuit-level QAOA-style statevector sampler over structured candidate landscapes, followed by reset-overhead sensitivity analysis. The resulting framework positions quantum sidecars as bounded signal generators for optimizer-side sampling, adapter or expert selection, retrieval, routing, and reasoning-path proposal. As a speculative outlook, we introduce quantum weight-state sidecars: restricted quantum representations over model-control variables, not direct encodings of complete classical weight tensors.

2605.18028 2026-05-19 cs.LG cs.AI 版本更新

FedSDR: Federated Self-Distillation with Rectification

FedSDR: 带校正的联邦自我蒸馏

Ziheng Ren, Zhanming Shen, Hao Wang, Ning Liu, You Song

发表机构 * Beijing University of Aeronautics(北京航空航天大学) Zhejiang University(浙江大学) Shandong University(山东大学) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 本文提出FedSDR,一种改进的联邦自我蒸馏方法,通过引入双重流机制来解决联邦学习中数据分布不匹配和幻觉问题,提升模型的准确性和一致性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大规模语言模型的联邦微调面临严重的统计异质性。然而,现有模型级防御方法往往忽视了根本原因:内在的数据分布不匹配。在本文中,我们首先建立了联邦自我蒸馏(FedSD)作为基本且有力的策略。通过将客户端表示投影到一个平滑的

英文摘要

Federated fine-tuning of Large Language Models faces severe statistical heterogeneity. However, existing model-level defenses often overlook the root cause: intrinsic data distribution mismatches. In this work, we first establish Federated Self-Distillation (FedSD) as a fundamental and potent strategy. By projecting client representations into a smoothed ``model-understanding space,'' FedSD alone serves as a universal booster, demonstrating superior performance over conventional algorithms. Despite its success, we identify a subtle trade-off termed the Rewrite Paradox -- unconstrained self-distillation can inadvertently increase hallucinations and redundancy. To refine this paradigm, we further propose FedSDR (Federated Self-Distillation with Rectification), the ultimate reinforced framework. It augments FedSD with a dual-stream mechanism: a local LoRA-S (Smoothing) branch to implicitly absorb heterogeneity via distilled data, and a parallel global LoRA-R (Rectification) branch anchored to raw data to enforce factual correctness. By selectively aggregating only LoRA-R, FedSDR yields a globally aligned and faithful model. Extensive experiments verify its superior performance.

2605.18025 2026-05-19 cs.AI 版本更新

TeleCom-Bench: How Far Are Large Language Models from Industrial Telecommunication Applications?

TeleCom-Bench: 大型语言模型在工业电信应用中还有多远?

Jieting Xiao, Yun Lin, Huizhen Qiu, Rui Ma, Chen Zhong, Dongyang Xu, Xiao Long, Chaoyu Zhang, Qiaobo Hao, Ding Zou, Zhiguo Yang, Yanqin Gao, Fang Tan

发表机构 * ZTE Corporation(中兴通讯)

AI总结 本文提出TeleCom-Bench,一个包含12个评估集和22678个精选样本的全面基准,旨在评估大型语言模型在电信领域的综合能力,揭示其在工业流程中的执行能力缺口。

Comments Accepted by KDD 2026

详情
AI中文摘要

尽管大型语言模型(LLM)在各种垂直场景中实现了显著整合,但其在电信领域的部署仍处于探索阶段,由于缺乏标准化的评估框架。当前的电信基准主要关注静态基础知识和孤立的原子技能,忽略了设备特定的文档和端到端的工业工作流,这些对于实际生产系统至关重要。为此,我们提出了TeleCom-Bench,一个包含12个评估集和22,678个精选样本的全面基准,评估LLM在协同层次上的能力:(1)多维知识理解,整合电信基础、3GPP协议、5G网络架构和专有产品知识,通过知识图谱驱动的合成整合有线、核心和无线网络的知识;(2)端到端知识应用,正式化六个核心任务在真实网络代理工作流中的真实轨迹,包括意图识别、实体提取、事件验证、工具调用、根本原因分析和解决方案生成,涵盖网络优化和故障维护场景。对八种最先进的LLM的评估揭示了一个普遍的执行墙:虽然模型在意图识别和实体提取等语言接口任务中达到90%的准确率,但在解决方案生成等过程执行任务中的性能降至约30%。这种能力差距表明,当前LLM在诊断方面表现良好,但在现场工程师方面却失败。TeleCom-Bench提供标准化的诊断,精确指出这一缺陷,为特定领域的对齐提供可操作的指导,以实现生产就绪的电信代理。数据集和评估代码已发布在https://github.com/ZTE-AICloud/TeleCom-Bench。

英文摘要

While Large Language Models have achieved remarkable integration in various vertical scenarios, their deployment in the telecommunications domain remains exploratory due to the lack of a standardized evaluation framework. Current telecom benchmarks primarily focus on static, foundational knowledge and isolated atomic skills, neglecting the equipment-specific documentation and end-to-end industrial workflows essential for real-world production systems. To bridge this gap, we present TeleCom-Bench, a comprehensive benchmark comprising 12 evaluation sets with 22,678 curated samples, which evaluates LLMs across a synergistic hierarchy: (1) Multi-dimensional Knowledge Comprehension, which integrates telecommunication fundamentals, 3GPP protocols, and 5G network architecture with proprietary product knowledge across wired, core, and wireless networks via knowledge graph-driven synthesis; and (2)End-to-End Knowledge Application, which formalizes six core tasks on authentic trajectories from live network agent workflows, including intent recognition, entity extraction, event verification, tool invocation, root cause analysis, and solution generation-across network optimization and fault maintenance scenarios. Evaluations of eight state-of-the-art LLMs reveal a universal Execution Wall: while models achieve 90% accuracy in linguistic interface tasks such as intent recognition and entity extraction, performance collapses to approximately 30% in procedural execution tasks like solution generation. This capability gap demonstrates that current LLMs function competently as diagnosticians but fail as field engineers. TeleCom-Bench provides standardized diagnostics to precisely pinpoint this deficit, offering actionable guidance for domain-specific alignment toward production-ready telecom agents. The dataset and evaluation code have been released at https://github.com/ZTE-AICloud/TeleCom-Bench.

2605.18022 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Unveiling Memorization-Generalization Coexistence: A Case Study on Arithmetic Tasks with Label Noise

揭示记忆与泛化共存:在带有标签噪声的算术任务中的案例研究

Linyu Liu, Pinyan Lu

发表机构 * Taylor Lab, Huawei Technologies Co., Ltd.(华为技术有限公司泰勒实验室) Key Laboratory of Interdisciplinary Research of Computation and Economics, Shanghai University of Finance and Economics(上海财经大学计算与经济交叉研究重点实验室)

AI总结 本文研究了在高过参数化模型中如何同时记忆噪声标签和泛化,通过模运算任务中的实验发现,适当优化和模型配置下大模型泛化能力更强,噪声标签被更快记忆,而过参数化模型内部形成泛化结构,但输出被拟合噪声标签的需求所抑制。通过频率方法提取内部结构可实现高准确率,提出任务无关方法将网络分为泛化和记忆组件,尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

Comments 27 pages, 32 figures

详情
AI中文摘要

高度过参数化的模型可以同时记忆噪声标签并良好泛化,但如何这些行为共存仍不明确。本文通过模运算任务在重噪声标签下研究其内在机制。通过在两层神经网络上的广泛实验发现,适当优化和模型配置下大模型泛化能力更强,而噪声标签被更快记忆。过参数化模型内部形成泛化结构,但其在输出中的表达被拟合噪声标签的需求所抑制。值得注意的是,即使在80%的标签噪声下,通过频率方法提取内部结构也可实现接近完美的测试准确率。我们进一步提出一种任务无关的方法将网络分为泛化和记忆组件。尽管该子网络提升泛化能力,但相比频率提取方法仍有局限,表明泛化结构分布于神经元中,需要新工具来检索过参数化网络中的可泛化知识。

英文摘要

Highly over-parameterized models can simultaneously memorize noisy labels and generalize well, yet how these behaviors coexist remains poorly understood. In this work, we investigate the underlying mechanisms of this coexistence using modular arithmetic tasks under heavy label noise. Through extensive experiments on two-layer neural networks, we find that larger models tend to generalize better under appropriate optimization and model configurations, while noisy labels are memorized faster than clean data. Over-parameterized models internally form a generalization structure, but its expression in the output is suppressed by the need to fit noisy labels. Remarkably, even with 80\% label noise, near-perfect test accuracy can be achieved by extracting this internal structure using frequency-based methods. We further propose a task-agnostic method to partition networks into generalization and memorization components. Although this subnetwork improves generalization, it is limited compared with frequency-based extraction, indicating that the generalization structure is distributed across neurons and motivating the development of new tools to retrieve generalizable knowledge from over-parameterized networks.

2605.18018 2026-05-19 cs.CV cs.AI cs.HC 版本更新

See What I Mean: Aligning Vision and Language Representations for Video Fine-grained Object Understanding

See What I Mean: 对齐视觉与语言表示以实现视频细粒度物体理解

Boyuan Sun, Bowen Yin, Yuanming Li, Xihan Wei, Qibin Hou

发表机构 * VCIP, CS, Nankai University(南开大学计算机科学与技术学院) Tongyi Lab, Alibaba Group(阿里云实验室) NKIARI, Shenzhen Futian(深圳福田国家信息研究所)

AI总结 本文提出SWIM方法,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解,解决了传统方法需要显式视觉提示的问题,通过构建NL-Refer数据集和多层交叉注意力图提升文本-视觉对齐性能。

Journal ref CVPR 2026

详情
AI中文摘要

我们提出了SWIM(See What I Mean),一种新颖的训练策略,通过对齐视觉和语言表示,仅从文本提示中实现细粒度物体理解。与需要显式视觉提示(如掩码或点)的传统方法不同,SWIM仅在训练期间利用掩码监督来指导跨模态注意力,使模型在推理时能够自动关注用户指定的物体。我们对预训练多模态大语言模型(MLLMs)的交叉注意力分析揭示了一种系统性差异:属性词在视觉模态中产生尖锐、局部化的激活,而物体名词由于语义参考偏差和分布式高层表示产生扩散和分散的模式。为了解决这种不对齐问题,我们构建了NL-Refer数据集,其中每个物体掩码都配以精确的自然语言指引用。SWIM从物体名词中提取多层交叉注意力图,并强制与真实掩码保持空间一致性。实验结果表明,SWIM显著提高了文本-视觉对齐性能,并在细粒度物体理解基准上优于基于视觉提示的方法。代码和数据可在https://github.com/HumanMLLM/SWIM获取。

英文摘要

We present SWIM (See What I Mean), a novel training strategy that aligns vision and language representations to enable fine-grained object understanding solely from textual prompts. Unlike existing approaches that require explicit visual prompts, such as masks or points, SWIM leverages mask supervision only during training to guide cross-modal attention, allowing the model to automatically attend to the user-specified object at inference. Our cross-attention analysis of pretrained multimodal large languagemodels (MLLMs) reveals a systematic discrepancy: Attribute words produce sharp, localized activations in the visual modality, whereas object nouns yield diffuse and scattered patterns due to semantic reference bias and distributed high-level representations. To address this misalignment, we construct NL-Refer, an enriched dataset, in which each object mask is paired with a precise natural language referring expression. SWIM extracts multi-layer cross-attention maps from object nouns and enforces spatial consistency with ground-truth masks. Experimental results demonstrate that SWIM substantially improves text-visual alignment and achieves superior performance over visual-prompt-based methods on fine-grained object understanding benchmarks. The code and data are available at \href{https://github.com/HumanMLLM/SWIM}{https://github.com/HumanMLLM/SWIM}.

2605.18013 2026-05-19 cs.CV cs.AI 版本更新

TinySAM 2: Extreme Memory Compression for Efficient Track Anything Model

TinySAM 2: 极端内存压缩用于高效的跟踪任何模型

Zhaoyuan Ding, Yijing Yang, Han Shu, Xinghao Chen

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出TinySAM 2,一种轻量级视频分割模型,通过引入内存质量管理机制和联合空间-时间令牌压缩,有效降低了内存存储和计算成本,实现了在DAVIS和SA-V等挑战性数据集上达到SAM 2.1 90%性能,仅使用7%内存令牌和3%训练数据。

Comments 12 pages, 6 figures

详情
AI中文摘要

Segment Anything Model 2 (SAM 2) 作为视频分割领域的核心基础模型,在半监督视频对象分割和跟踪任何任务中表现出色。然而,SAM 2的多阶段图像编码器和内存模块复杂的计算特性提高了模型在实际应用中的部署难度。为了解决这个问题,我们提出了TinySAM 2,一种在性能和效率之间取得平衡的轻量级视频分割模型。首先,引入了一个内存质量管理机制,用于选择并保留高信息量的历史帧作为内存。此外,提出了一种联合空间-时间令牌压缩方法,通过空间域上的平均池化压缩冗余令牌,在时间域上基于令牌级相似性测量选择信息令牌。此外,采用RepViT作为轻量级图像编码器,进一步减少模型参数。在DAVIS和SA-V等挑战性数据集上的大量实验表明,TinySAM 2在性能上达到了SAM 2.1的90%,仅使用7%的内存令牌和3%的训练数据。本研究有效缓解了SAM 2在参数数量、计算负载和部署成本方面的瓶颈,为视频分割模型在设备上的广泛应用提供了资源高效的解决方案。

英文摘要

Segment Anything Model 2 (SAM 2) serves as a core foundation model in the field of video segmentation. Building upon the original SAM model, it introduces a memory bank mechanism and demonstrates outstanding performance in tasks such as semi-supervised video object segmentation and tracking anything. However, the complex computational characteristics of SAM 2's multi-stage image encoder and memory module have raised the barrier to the model's deployment in practical applications. To address this issue, we propose TinySAM 2, a lightweight video segmentation model that balances performance and efficiency. First, a memory quality management mechanism is introduced to select and retain high-informative historical frames as the memory. In addition, a joint-spatial-temporal token compression is proposed that reduces the memory storage and computational cost. Specifically, average pooling is employed to first compress redundancy tokens in the spatial domain. In the temporal domain, informative tokens are selected across frames in the memory bank based on token-level similarity measurement. Besides, we take RepViT as the lightweight image encoder, which further reduces the model parameters. Extensive experiments on challenging datasets such as DAVIS and SA-V demonstrate that TinySAM 2 achieves 90% of the performance of SAM 2.1, with only 7% memory tokens and 3% training data. This study effectively alleviates the bottlenecks in parameter count, computational load, and deployment costs associated with SAM 2, providing a resource-efficient solution for the widespread application of video segmentation models on devices.

2605.18012 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SAS: Semantic-aware Sampling for Generative Dataset Distillation

SAS: 语义感知的生成数据集蒸馏

Mingzhuo Li, Guang Li, Linfeng Ye, Jiafeng Mao, Takahiro Ogawa, Konstantinos N. Plataniotis, Miki Haseyama

发表机构 * Hokkaido University(北海道大学) University of Toronto(多伦多大学) The University of Tokyo(东京大学)

AI总结 本文提出了一种语义感知的数据集蒸馏方法,通过利用CLIP作为语义先验,设计三个语义评分函数来量化类别相关性、类别间分离性和集合内多样性,从而生成紧凑且语义区分度高的数据集。

Comments Published as a journal paper in IEEE OJSP

详情
AI中文摘要

深度神经网络在广泛的任务中取得了显著的性能,但这种成功往往伴随着由于大规模训练数据带来的巨大计算和存储成本。数据集蒸馏通过构建紧凑且信息丰富的数据集,以实现高效的模型训练同时保持下游性能。然而,大多数现有方法主要强调匹配数据分布或下游训练统计,对蒸馏数据中高阶语义信息的保留有限。在本文中,我们引入了语义感知的视角进行数据集蒸馏,通过利用对比语言-图像预训练(CLIP)作为语义先验进行后采样。我们的目标是获得不仅紧凑而且语义上类别区分度高且多样化的蒸馏数据集。为此,我们设计了三个语义评分函数,以量化预训练语义空间中的类别相关性、类别间分离性和集合内多样性。基于现有蒸馏方法生成的图像池,我们进一步开发了一种两阶段策略进行有效的采样:第一阶段过滤语义区分度高的样本以形成可靠的候选集,第二阶段进行动态多样性感知选择以减少冗余并保持语义覆盖。在多个数据集、图像池和下游模型上的广泛实验显示了一致的性能提升,突显了在数据集蒸馏中整合语义信息的有效性。

英文摘要

Deep neural networks have achieved impressive performance across a wide range of tasks, but this success often comes with substantial computational and storage costs due to large-scale training data. Dataset distillation addresses this challenge by constructing compact yet informative datasets that enable efficient model training while maintaining downstream performance. However, most existing approaches primarily emphasize matching data distributions or downstream training statistics, with limited attention to preserving high-level semantic information in the distilled data. In this work, we introduce a semantic-aware perspective for dataset distillation by leveraging Contrastive Language-Image Pretraining (CLIP) as a semantic prior for post-sampling. Our goal is to obtain distilled datasets that are not only compact but also semantically class-discriminative and diverse. To this end, we design three semantic scoring functions that quantify class relevance, inter-class separability, and intra-set diversity in a pretrained semantic space. Based on image pools generated by existing distillation methods, we further develop a two-stage strategy for effective sampling: the first stage filters semantically discriminative samples to form a reliable candidate set, and the second stage performs a dynamic diversity-aware selection to reduce redundancy while preserving semantic coverage. Extensive experiments across multiple datasets, image pools, and downstream models demonstrate consistent performance gains, highlighting the effectiveness of incorporating semantic information into dataset distillation.

2605.18003 2026-05-19 cs.NE cs.AI 版本更新

Spiker-LL: An Energy-Efficient FPGA Accelerator Enabling Adaptive Local Learning in Spiking Neural Networks

Spiker-LL:一种能效高的FPGA加速器,用于在脉冲神经网络中实现自适应局部学习

Alessio Caviglia, Filippo Marostica, Alessandro Savino, Stefano Di Carlo

发表机构 * Control and Computer Engineering Department, Politecnico di Torino(都灵理工学院控制与计算机工程系)

AI总结 本文提出Spiker-LL,一种能效高的FPGA加速器,通过扩展开源的Spiker+推理架构,实现了高效的STSF局部学习规则支持,从而在边缘设备上实现自适应局部学习。

详情
AI中文摘要

在边缘部署自适应智能仍然具有挑战性,因为训练神经模型的计算和能耗成本很高。脉冲神经网络(SNNs)提供了一种有前途的替代方案,但实现设备端学习需要硬件-算法协同设计。本文提出了Spiker-LL,一种基于FPGA的SNN加速器,扩展了开源的Spiker+推理架构,以高效支持STSF局部学习规则。通过有针对性的微架构扩展,Spiker-LL在推理和在线学习中具有最小的开销。在MNIST、F-MNIST和DIGITS数据集上,它实现了高达93%的准确率,亚毫秒延迟和每推理小于0.1 mJ的能耗,同时保持无DSP且高度可扩展用于边缘FPGA部署。

英文摘要

Deploying adaptive intelligence at the edge remains challenging due to the high computational and energy cost of training neural models. Spiking Neural Networks (SNNs) offer a promising alternative, but enabling on-device learning requires hardware-algorithm co-design. This paper presents SPIKER-LL, an FPGA-based SNN accelerator that extends the open-source Spiker+ inference architecture with efficient support for the STSF local learning rule. Through targeted microarchitectural extensions, SPIKER-LL performs inference and online learning with minimal overhead. Across MNIST, F-MNIST, and DIGITS, it achieves up to 93% accuracy, sub-millisecond latency, and less than 0.1 mJ per inference, while remaining DSP-free and highly scalable for edge-FPGA deployments.

2605.17999 2026-05-19 cs.AI 版本更新

Shared Backbone PPO for Multi-UAV Communication Coverage with Connection Preservation

共享骨干PPO用于多UAV通信覆盖与连接保持

Z. Jiang

发表机构 * Zhejiang University(浙江大学)

AI总结 本文提出了一种共享骨干PPO算法,通过在Actor和Critic网络之间共享基础模块,实现了高效的训练和提升的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现,并与标准PPO算法进行比较。实验结果表明,所提出的方法具有优越的性能,此外,还集成了图信息聚合模块以适应代理之间的通信条件。整合该模块后,算法仍保持有效,训练后的代理群体表现出更高的合作水平。

详情
AI中文摘要

本文提出了一种共享骨干近端策略优化(Shared Backbone PPO)算法。通过在Actor和Critic网络之间共享基础模块,该算法实现了高效的训练和改进的性能。该算法在保持连接的多UAV群体通信覆盖任务中得到实现,并与标准PPO算法进行比较。实验结果表明,所提出的方法实现了优越的性能。此外,将图信息聚合模块纳入模型架构中,以适应代理之间的通信条件。整合该模块后,算法仍保持有效,训练后的代理群体表现出更高的合作水平。

英文摘要

This paper proposes a Shared Backbone Proximal Policy Optimization (Shared Backbone PPO) algorithm. By sharing the base module between the Actor and Critic networks, the algorithm achieves efficient training and improved performance. The algorithm is implemented in a connectivity-preserving multi-UAV swarm communication coverage task and compared with the standard PPO algorithm. Experimental results demonstrate that the proposed method achieves superior performance. Furthermore, a graph information aggregation module is incorporated into the model architecture to accommodate the communication conditions among agents. With the integration of this module, the algorithm remains effective, and the trained agent swarm exhibits a higher level of cooperation.

2605.17997 2026-05-19 cs.LG cs.AI cs.CV 版本更新

MARR: Module-Adaptive Residual Reconstruction for Low-Bit Post-Training Quantization

MARR: 模块自适应残差重建用于低比特后训练量化

Le Su, Xing Luo, Zhi Jin

发表机构 * Peng Cheng Laboratory(鹏城实验室)

AI总结 本文提出MARR,一种模块自适应残差重建方法,通过为每个模块分配特定的缩放系数,平衡残差相关的HA偏差和累积误差校正,从而在低比特量化中提升性能。

详情
AI中文摘要

近年来,基于残差重建的模型量化方法在低比特后训练量化(PTQ)中取得了有希望的性能,通过引入跨层残差来减少来自先前层的误差积累。然而,这些残差也可能引入额外的偏差,源于重建基于PTQ的Hessian近似(HA)假设,导致量化性能不理想。在本文中,我们分析发现,通过将残差项乘以一个缩放系数,可以提供一种直接的方法来缓解与残差强度相关的HA偏差,同时保持累积误差校正。更重要的是,我们观察到这种权衡是模块依赖性的,使单一全局残差强度不足以在不同模块之间平衡有效的校正和残差相关的偏差。基于这些观察,我们提出了模块自适应残差重建(MARR),为每个模块分配模块特定的缩放系数,以自适应地平衡累积误差校正和残差相关的HA偏差。为了避免昂贵的每模块系数搜索并获得稳定的系数估计,我们设计了一种基于比例-积分-微分(PID)的自适应更新策略,利用重建误差作为反馈,逐步细化此系数。在多个典型的大语言模型(LLMs)和视觉变换器(ViTs)上的实验表明,MARR在低比特量化(小于等于4位)中表现出色,实现了LLMs高达20.2%的性能提升,以及ViTs相对于残差重建最先进的方法高达4.6%的相对提升。代码将在接受后公开发布。

英文摘要

Recently, residual reconstruction-based model quantization methods have achieved promising performance in low-bit post-training quantization (PTQ) by introducing cross-layer residuals to reduce error accumulated from previous layers.However, these residuals may also introduce additional bias arising from the Hessian-approximation (HA) assumption underlying reconstruction-based PTQ, leading to suboptimal quantization performance.In this work, we analyze that multiplying the residual term by a scaling coefficient provides a direct way to mitigate the HA bias associated with residual strength, while preserving accumulated-error correction. More importantly, we observe that this trade-off is module-dependent, making a single global residual strength insufficient to balance effective correction and residual-related bias across modules.Based on these observations, we propose Module-Adaptive Residual Reconstruction (MARR), which assigns a module-specific scaling coefficient to adaptively balance accumulated-error correction and residual-related HA bias for each module.To avoid expensive per-module coefficient search and obtain a stable coefficient estimate, we design a Proportional-Integral-Derivative (PID)-based adaptive update strategy that uses reconstruction error as feedback to progressively refine this coefficient. Experiments on several typical large language models (LLMs) and vision transformers (ViTs) demonstrate the effectiveness of MARR under low-bit quantization (less than or equal to 4-bit), achieving up to 20.2% performance gains on LLMs and up to 4.6% relative gains on ViTs over the residual reconstruction state-of-the-art methods.Code will be made publicly available upon acceptance.

2605.17994 2026-05-19 cs.IR cs.AI 版本更新

Towards Sustainable Growth: A Multi-Value-Aware Retrieval Framework for E-Commerce Search

迈向可持续增长:面向电子商务搜索的多价值感知检索框架

Yifan Wang, Yixuan Wang, YiDan Liang, Qiang Liu, Fei Xiao

发表机构 * Taobao \& Tmall Group of Alibaba HangZhou China Taobao \& Tmall Group of Alibaba

AI总结 本文提出了一种多价值感知的检索框架,旨在平衡即时转化与长期商品增长,通过引入ItemLTV模块和MultiGR模块,提升电子商务搜索系统的可持续增长能力。

详情
AI中文摘要

新商品增长对于维持大型电子商务平台的健康生态系统至关重要。然而,现有系统倾向于优先展示已流行的商品,这种现象通常被称为“马太效应”。在检索检索的背景下,当前冷启动模型面临训练目标与在线业务指标不一致的问题,并缺乏有效机制来衡量商品的增长潜力。在本文中,我们提出了一种针对电子商务搜索的多价值感知检索框架GrowthGR,旨在更好地对齐搜索系统不同阶段的 cascaded 在线价值,同时平衡即时转化和长期商品增长。我们的框架GrowthGR包含两个关键组件:一个用于预测商品长期交易价值的ItemLTV模块和一个基于语义-ID的生成检索架构的多价值感知生成检索(MultiGR)模块。首先,在ItemLTV模块中,我们采用反事实推理来量化单个用户交互带来的长期价值增量。其次,在MultiGR模块中,基于语义-ID的生成检索架构,我们利用具有搜索级联信号的结构化样本,并采用多价值感知策略优化(MoPO)训练范式,以对齐多阶段在线价值,同时显式平衡短期交易价值和由ItemLTV估计的长期增长潜力。我们成功在淘宝的生产平台部署了GrowthGR,实现了新商品GMV的显著提升5.3%,并带来了整体搜索GMV的非平凡0.3%增长。广泛的在线分析和A/B测试证明了其对整体生态系统价值的积极影响。

英文摘要

New item growth is critical for maintaining a healthy ecosystem in large-scale e-commerce platforms. However, existing systems tend to prioritize presenting users with already popular items, a phenomenon often referred to as the "Matthew effect". In the context of search retrieval, current cold-start models suffer from the misalignment between training objectives and online business metrics, and they lack effective mechanisms to measure an item's growth potential. In this paper, we propose a Multi-Value-Aware retrieval framework tailored for e-commerce search, designed to better align with the cascaded online values across different stages of the search system while balancing immediate conversion and long-term item growth. Our framework GrowthGR consists of two key components: an Item Long-term Transaction Value Prediction (ItemLTV) module and a Multi-Value-Aware Generative Retrieval (MultiGR) module. First, in the ItemLTV module, we employ counterfactual inference to quantify the long-term value increment attributable to a single user interaction. Second, in the MultiGR module, building upon a semantic-ID-based generative retrieval architecture, we leverage structured samples with the search cascade signals and adopt a Multi-Value-Aware Policy Optimization (MoPO) training paradigm to align with multi-stage online values, while explicitly balancing short-term transactional value and long-term growth potential estimated by ItemLTV. We successfully deployed GrowthGR on Taobao's production platform, achieving a substantial 5.3% lift in new item GMV while delivering a non-trivial 0.3% gain in overall search GMV. Extensive online analysis and A/B testing demonstrate its positive impact on the overall ecosystem value.

2605.17991 2026-05-19 cs.SD cs.AI 版本更新

Stable Audio 3

稳定音频3

Zach Evans, Julian D. Parker, Matthew Rice, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

AI总结 稳定音频3提出了一种快速的潜在扩散模型家族,用于可变长度音频生成和编辑,通过高效的潜在空间生成和对抗训练提升了生成质量和效率。

Comments Training code: https://github.com/Stability-AI/stable-audio-tools Inference and weights: http://github.com/Stability-AI/stable-audio-3

详情
AI中文摘要

Stable Audio 3 是一组快速的潜在扩散模型(小、中、大)用于可变长度音频生成和编辑。由于我们的模型可以生成几分钟的音频,可变长度生成对于避免生成完整长度音频以生成短声音的成本至关重要。我们还支持修复,使能够进行有针对性的音频编辑和短录音的延续。我们的潜在扩散模型基于一种新的语义-声学自编码器,该自编码器将音频投影到紧凑的潜在空间中,从而在高效扩散生成的同时保持音频保真度,并在潜在空间中鼓励语义结构。最后,我们通过对抗性后训练来加速推理并提高生成质量,减少推理步骤的数量同时提高保真度和提示的遵循性。Stable Audio 3 模型在授权和Creative Commons数据上进行训练,可在H200 GPU上在2秒内生成音乐和声音,在MacBook Pro M4上在几秒内完成。我们发布了小和中型模型的权重,这些模型可以在消费级硬件上运行,并附带其训练和推理流程。

英文摘要

Stable Audio 3 is a family of fast latent diffusion models (small, medium, large) for variable-length audio generation and editing. Since our models can generate several minutes of audio, variable-length generations are key to avoid the cost of producing full-length generations for short sounds. We also support inpainting, enabling targeted audio editing and the continuation of short recordings. Our latent diffusion models operate on top of a novel semantic-acoustic autoencoder that projects audio into a compact latent space, enabling efficient diffusion-based generation while preserving audio fidelity and encouraging semantic structure in the latent. Finally, we run adversarial post-training to both accelerate inference and improve generation quality, reducing the number of inference steps while improving fidelity and prompt adherence. Stable Audio 3 models are trained on licensed and Creative Commons data to generate music and sounds in less than a 2s on an H200 GPU and less than a few seconds on a MacBook Pro M4. We release the weights of small and medium, that can run on consumer-grade hardware, together with their training and inference pipeline.

2605.17989 2026-05-19 cs.CL cs.AI 版本更新

Predictive Prefetching for Retrieval-Augmented Generation

检索增强生成的预测预取

Wuyang Zhang, Shichao Pei

发表机构 * Department of Computer Science, University of Massachusetts Boston(马萨诸塞大学波士顿分校计算机科学系)

AI总结 本文提出了一种先进的异步检索框架,通过预测检索触发时机和所需信息,以减少延迟并提高生成效率,同时保持回答质量。

Comments Accepted by Forty-third International Conference on Machine Learning ICML 2026

详情
AI中文摘要

检索增强生成(RAG)通过在大型语言模型中增强事实性,但因其同步检索导致显著延迟。尽管近期工作探索了异步检索,但现有方法依赖于检索与生成之间的启发式协调,并假设解码期间信息需求稳定,这在复杂、多领域设置中往往失效。本文提出了一种先进的异步检索框架,该框架能够与不断演变的信息需求相匹配,通过利用生成动态中出现的语义前驱,使用三个组件——检索预测器、上下文监视器和查询生成器,显式预测何时应触发检索以及应检索什么信息。在多个基准测试上的实验表明,该方法可实现高达43.5%的端到端延迟减少和62.4%的时间到第一个token的提升,同时保持与同步RAG基线相当的回答质量。

英文摘要

Retrieval-Augmented Generation (RAG) improves factual grounding in large language models but suffers from substantial latency due to synchronous retrieval. While recent work explores asynchronous retrieval, existing approaches rely on heuristic coordination between retrieval and generation and assume stable information demands during decoding that often break in complex, multi-domain settings. In this paper, we propose an advanced asynchronous retrieval framework that enables predictive prefetching aligned with evolving information needs. The framework explicitly predicts when retrieval should be triggered and what information should be retrieved using three components, a retrieval predictor, a context monitor, and a query generator, by exploiting semantic precursors in generation dynamics that emerge several tokens before uncertainty becomes critical. Experiments on multiple benchmarks demonstrate up to 43.5% end-to-end latency reduction and 62.4% improvement in time-to-first-token, while maintaining answer quality comparable to synchronous RAG baselines.

2605.17985 2026-05-19 cs.LG cs.AI 版本更新

SAFE-SVD: Sensitivity-Aware Fidelity-Enforcing SVD for Physics Foundation Models

SAFE-SVD:面向物理基础模型的敏感性感知保真度压缩SVD

Chengjie Hong, Feixiang He, Yiheng Zeng, Lulu Kang, He Wang

发表机构 * AI Centre, University College London(伦敦大学学院人工智能中心) University College London(伦敦大学学院) Central South University(中南大学) University of Massachusetts at Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 本文提出了一种新的压缩物理基础模型的方法,通过在压缩过程中显式建模损失感知的层敏感性,以保持准确性和物理保真度,实验表明在多个模型和数据集上实现了显著的压缩增益。

详情
AI中文摘要

我们提出了一种新的方法,用于压缩物理基础模型(PFMs),这是AI for Science领域的新趋势。尽管模型压缩对于减少内存使用和加速大基础模型的推理至关重要,但其在PFMs中的应用仍然不足探索,因为保持物理保真度至关重要。挑战在于物理数据的功能性质,其中偏导数编码了时空动态,并对压缩具有高度敏感性。传统压缩方法忽视了这种结构,常常导致严重的性能退化或失败。为此,我们引入了一种敏感性感知的保真度强制压缩框架,在压缩过程中显式建模输出函数空间中的损失感知层敏感性。这为压缩科学基础模型提供了一条新途径,同时保持准确性和物理保真度。实验表明,在多个模型和数据集上,相较于现有方法,取得了显著的增益,实现了更高的压缩比,同时保持准确性,在某些情况下甚至提高了几个数量级。更广泛地说,这项工作可能引领AI for Science领域高效、可部署和可持续的科学基础模型的新子领域。

英文摘要

We propose a new method for compressing physics foundation models (PFMs) which is a new trend in AI for Science. While model compression is essential for reducing memory use and accelerating inference in large foundation models, it remains under-explored for PFMs, where preserving physical fidelity is crucial. The challenge lies in the functional nature of physics data, where partial derivatives encode spatiotemporal dynamics and exhibit high sensitivity to compression. Conventional compression methods ignore this structure, often causing severe performance degradation or failure. To address this, we introduce a sensitivity-aware fidelity-enforcing compression framework that explicitly models loss-aware layer sensitivity in the output function space during compression. This provides a new route to compressing scientific foundation models while preserving accuracy and physical fidelity. Experiments show substantial gains over existing methods across multiple models and datasets, achieving significantly higher compression ratios while maintaining accuracy, in some cases by orders of magnitude. More broadly, the work potentially leads to a new subfield of efficient, deployable, and sustainable scientific foundation models in AI for Science.

2605.17976 2026-05-19 cs.AI math.OC 版本更新

Unleashing LLMs in Bayesian Optimization: Preference-Guided Framework for Scientific Discovery

释放大语言模型于贝叶斯优化:用于科学发现的偏好引导框架

Xinzhe Yuan, Zhuo Chen, Jianshu Zhang, Huan Xiong, Nanyang Ye, Yuqiang Li, Qinying Gu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Harbin Institute of Technology(哈尔滨工业大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种基于大语言模型的贝叶斯优化框架LGBO,通过在优化循环中持续整合大语言模型的语义推理,提高了科学发现中的优化效率和收敛速度。

Comments Published as a conference paper at ICLR 2026. 10 pages main paper, 21 pages appendix, 26 figures

Journal ref International Conference on Learning Representations (ICLR), 2026

详情
AI中文摘要

科学发现日益受到昂贵实验和有限资源的限制,凸显了在AI for science中高效优化的必要性。尽管贝叶斯优化(BO)被广泛用于平衡探索与利用,但其在高维设置中表现出冷启动性能缓慢和可扩展性差的问题,限制了其在现实科学问题中的应用。为克服这些挑战,我们提出了LLM引导的贝叶斯优化(LGBO),这是首个将大语言模型(LLMs)的偏好引导整合到优化循环中的贝叶斯优化框架。与以往仅使用LLMs进行预热启动初始化或候选生成的工作不同,LGBO引入了一种区域提升的偏好机制,将LLM驱动的偏好嵌入到每一个迭代中,以稳定且可控的方式调整替代均值。理论上,我们证明了LGBO在最坏情况下不会显著劣于标准BO,而在偏好与目标一致时,能够实现显著更快的收敛速度。实验上,LGBO在物理、化学、生物学和材料科学等多样化的干基准测试中均优于现有方法。最值得注意的是,在一个新的湿实验室优化Fe-Cr电池电解质时,LGBO在6次迭代内达到了最佳观测值的90%,而标准BO和现有LLM增强的基线方法需要超过10次。这些结果表明,LGBO为将LLMs整合到科学优化工作流中提供了一个有前景的方向。

英文摘要

Scientific discovery is increasingly constrained by costly experiments and limited resources, underscoring the need for efficient optimization in AI for science. Bayesian Optimization (BO), though widely adopted for balancing exploration and exploitation, often exhibits slow cold-start performance and poor scalability in high-dimensional settings, limiting its applicability in real-world scientific problems. To overcome these challenges, we propose LLM-Guided Bayesian Optimization (LGBO), the first LLM preference-guided BO framework that continuously integrates the semantic reasoning of large language models (LLMs) into the optimization loop. Unlike prior works that use LLMs only for warm-start initialization or candidate generation, LGBO introduces a region-lifted preference mechanism that embeds LLM-driven preferences into every iteration, shifting the surrogate mean in a stable and controllable way. Theoretically, we prove that LGBO does not perform significantly worse than standard BO in the worst case, while achieving significantly faster convergence when preferences align with the objective. Empirically, LGBO consistently outperforms existing methods across diverse dry benchmarks in physics, chemistry, biology, and materials science. Most notably, in a new wet-lab optimization of Fe-Cr battery electrolytes, LGBO attains \textbf{90\% of the best observed value within 6 iterations}, whereas standard BO and existing LLM-augmented baselines require more than 10. Together, these results suggest that LGBO offers a promising direction for integrating LLMs into scientific optimization workflows.

2605.17971 2026-05-19 cs.CR cs.AI 版本更新

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Babel: 通过混淆分布优化采样实现安全机制的 jailbreak 安全性

Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang, Yebo Feng, Ju Jia, Ruiying Du, Cong Wu, Yang Liu

发表机构 * Wuhan University(武汉大学) Nanyang Technological University(南洋理工大学) Southeast University(东南大学) University of Hong Kong(香港大学)

AI总结 本文研究了大语言模型安全机制中的内在漏洞,提出了一种高效的黑盒攻击框架Babel,通过系统性的混淆采样和反馈驱动的分布优化,实现了高成功率的jailbreak攻击,展示了在LLM安全研究中的稳健方法。

详情
AI中文摘要

尽管安全对齐严格,大语言模型(LLMs)仍然容易受到jailbreak攻击。现有黑盒方法往往依赖启发式模板或穷举尝试,缺乏机制解释性和查询效率。在本研究中,我们探讨了LLMs安全机制中的一个内在漏洞,其中安全对齐依赖于少量稀疏分布的注意力头,导致大部分表示空间被弱监控。我们通过数学jailbreaking模型正式化这一现象,该模型刻画了有效文本混淆的微妙边界,并分析了观察到的jailbreak行为。受此模型指导,我们提出Babel,一种高效的黑盒攻击框架,通过系统性混淆采样和迭代反馈驱动的分布优化,利用识别的安全间隙,实现可靠且高成功率的jailbreak攻击,而无需访问模型内部。在前沿商业模型上的全面评估表明,Babel实现了最先进的攻击成功率和优越的查询效率。具体而言,与现有最先进方法相比,Babel在GPT-4o上的攻击成功率从41.33%提升至82.67%,在Claude-3-5-haiku上的攻击成功率从38.33%提升至78.33%,在平均40次查询内提供了一种稳健的LLM安全研究方法。

英文摘要

Despite rigorous safety alignment, Large Language Models (LLMs) remain vulnerable to jailbreak attacks. Existing black-box methods often rely on heuristic templates or exhaustive trials, lacking mechanistic interpretability and query efficiency. In this study, we investigate an intrinsic vulnerability in the safety mechanisms of LLMs, where safety alignment relies on a small set of sparsely distributed attention heads, leaving much of the representational space weakly monitored. We formalize this phenomenon with a mathematical jailbreaking model that characterizes the delicate boundary of effective text obfuscation and analytically explains observed jailbreak behaviors. Guided by this model, we propose Babel, an efficient black-box attack framework that exploits the identified safety gap through systematic obfuscation sampling with iterative, feedback-driven distribution refinement, enabling reliable and high-success jailbreak attacks without access to model internals. Comprehensive evaluations on frontier commercial models demonstrate that Babel achieves state-of-the-art attack success rates and superior query efficiency. Specifically, compared to state-of-the-art methods, Babel increases the attack success rate on GPT-4o from 41.33% to 82.67% and on Claude-3-5-haiku from 38.33% to 78.33% within an average of 40 queries, providing a robust red-teaming methodology for LLMs safety research.

2605.17967 2026-05-19 cs.AI 版本更新

Reconciling Contradictory Views on the Effectiveness of SFT in LLMs: An Interaction Perspective

弥合对SFT在LLM中效果的矛盾观点:一种交互视角

Junpeng Zhang, Lei Cheng, Guoxi Zhang, Hua Cai, Qing Xu, Quanshi Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Beijing Institute for General Artificial Intelligence(北京一般人工智能研究院) UniDT

AI总结 本文从交互视角探讨了SFT在LLM中的效果不一致问题,发现SFT主要去除噪声交互但难以获得可靠新交互,且去噪阶段短暂,继续微调易引入过拟合交互。

详情
AI中文摘要

本文探讨了监督微调(SFT)在深度神经网络中的有效性问题:为何SFT在小规模模型中广泛有效,但在大语言模型(LLM)中却可能产生不一致甚至有害的效果。最近基于交互的解释方法表明,词/标记之间的交互提供了衡量LLM编码推理模式的忠实指标。我们发现SFT过程中交互的演变能有效解释SFT在LLM中的不一致效果。具体而言,我们发现(1)SFT主要去除噪声样的交互,而很少获得可靠的新的交互。(2)这一去噪阶段极为短暂,之后继续微调倾向于引入过拟合的交互。我们通过多个LLM和数据集验证了这些发现。我们的发现为早期停止提供了新见解,并为LLM训练提供了实用指导。

英文摘要

This paper explores a scientific question in supervised fine-tuning (SFT): why SFT is broadly effective for small-scale deep neural networks, yet can produce inconsistent or even detrimental effects when applied to large language models (LLMs). Recent advances in interaction-based explanations suggest that interactions between words/tokens provide a faithful metric for quantifying the inference patterns encoded by LLMs. We find that the evolution of interactions during SFT can effectively explain the inconsistent effectiveness of SFT for LLMs. Specifically, we find that (1) SFT primarily removes noise-like interactions, while rarely acquiring reliable new interactions. (2) This denoising stage is extremely brief, after which continued fine-tuning tends to introduce overfitted interactions. We validate these findings across multiple LLMs and datasets. Our findings provide new insights into early stopping and offer practical guidance for LLM training.

2605.17965 2026-05-19 cs.SE cs.AI 版本更新

BLAgent: Agentic RAG for File-Level Bug Localization

BLAgent: 一种面向文件级Bug定位的代理RAG

Md Afif Al Mamun, Gias Uddin

发表机构 * University of Calgary(卡尔加里大学) York University(约克大学)

AI总结 本文提出BLAgent,一种新型代理RAG框架,用于文件级Bug定位,通过结合代码结构感知的仓库编码、双视角查询转换和两阶段代理重排序,提高了Bug定位的准确性和效率。

Comments Under review at the ACM Transactions on Software Engineering and Methodology

详情
AI中文摘要

Bug定位仍然是软件维护任务中的关键瓶颈,包括根本原因分析、分拣和自动程序修复(APR),尽管近年来大型语言模型(LLM)基于修复系统取得了进展。文件级Bug定位在分层管道中尤其关键,因为错误可能传播到下游阶段,如语句级定位或补丁生成。虽然检索增强生成(RAG)为使LLM扎根于仓库上下文提供了有前途的方向,但现有RAG流程依赖静态检索,缺乏识别故障代码所需的原因。在本文中,我们提出了BLAgent,一种新的代理RAG框架,用于文件级Bug定位,集成了三个关键思想:(i)具有路径增强的AST基于分块的仓库编码,(ii)双视角查询转换捕捉结构和行为信号,(iii)两阶段代理重排序结合符号检查与证据基础推理。与之前的图基或多跳代理方法不同,BLAgent在紧凑的候选集中进行有限推理,平衡了准确性和成本。在SWE-bench Lite上,BLAgent使用开源模型时达到78%以上的Top-1准确率,使用闭源模型时达到86%以上,同时比使用相同模型的最强基线便宜18倍以上。当整合到APR框架中时,它提高了端到端修复的成功率超过20%。

英文摘要

Bug localization remains a key bottleneck in downstream software maintenance tasks, including root cause analysis, triage, and automated program repair (APR), despite recent advances in large language model (LLM)-based repair systems. File-level bug localization is especially critical in hierarchical pipelines, where errors can propagate to downstream stages such as statement-level localization or patch generation. While Retrieval-Augmented Generation (RAG) offers a promising direction for grounding LLMs in repository context, existing RAG pipelines rely on static retrieval and lack the reasoning needed to identify faulty code accurately. In this work, we present BLAgent, a novel agentic RAG framework for file-level bug localization that integrates three key ideas: (i) code structure-aware repository encoding with path-augmented AST-based chunking, (ii) dual-perspective query transformation capturing both structural and behavioral signals, and (iii) two-phase agentic reranking combining symbolic inspection with evidence-grounded reasoning. Unlike prior graph-based or multi-hop agentic approaches, BLAgent performs bounded reasoning over a compact candidate set, balancing accuracy and cost. On SWE-bench Lite, BLAgent attains over 78% Top-1 accuracy with open-source models and over 86% with a closed-source model, while being over 18x cheaper than the strongest baseline using the same model. When integrated into an APR framework, it improves end-to-end repair success by over 20%.

2605.17954 2026-05-19 cs.CV cs.AI cs.LG 版本更新

A More Word-like Image Tokenization for MLLMs

一种更像单词的图像标记化方法用于大规模语言模型

Hyun Lee, Hyemin Jeong, Yejin Kim, Hyungwook Choi, Hyunsoo Cho, Soo Kyung Kim, Joonseok Lee

发表机构 * Seoul National University(首尔国立大学) Ewha Womans University(成均馆大学)

AI总结 本文提出了一种解耦视觉标记化方法(DiVT),通过将图像块嵌入聚类为语义单元,使每个标记对应于独特的视觉概念,从而提升多模态模型的性能和效率。

Journal ref Proceedings of the IEEE/CVF International Conference on Pattern Recognition and Computer Vision (CVPR), 2026

详情
AI中文摘要

现代多模态大语言模型(MLLMs)通常保持语言模型不变,并训练一个视觉投影器,将像素映射到其嵌入空间中的标记序列,使图像能以与文本相同的形式呈现。然而,语言模型已优化以操作离散且具有语义意义的标记,而现有视觉投影器将图像转换为长流的连续且高度相关的嵌入。这导致视觉标记的行为不同于LLM最初训练以理解的单词状单元。我们提出了一种新的解耦视觉标记化(DiVT),将图像块嵌入聚类为连贯的语义单元,使得每个标记对应于一个独特的视觉概念,而不是一个刚性的网格单元。DiVT进一步根据图像复杂度调整其标记预算,提供显式的精度-计算权衡,既不修改视觉编码器也不修改语言模型。在多样化的多模态基准测试中,DiVT在显著较少的视觉标记下匹配或超越基线,展示了在有限标记预算下的鲁棒性,显著降低了内存成本和延迟,同时使视觉输入更兼容于LLM。我们的代码可在https://github.com/snuviplab/DiVT上获得。

英文摘要

Modern multimodal large language models (MLLMs) typically keep the language model fixed and train a visual projector that maps the pixels into a sequence of tokens in its embedding space, so that images can be presented in essentially the same form as text. However, the language model has been optimized to operate on discrete, semantically meaningful tokens, while prevailing visual projectors transform an image into a long stream of continuous and highly correlated embeddings. This causes the visual tokens to behave differently from the word-like units that LLMs are originally trained to understand. We propose a novel Disentangled Visual Tokenization (DiVT) that clusters patch embeddings into coherent semantic units, so each token corresponds to a distinct visual concept instead of a rigid grid cell. DiVT further adapts its token budget to image complexity, providing an explicit accuracy-compute trade-off modifying neither the vision encoder nor the language model. Across diverse multimodal benchmarks, DiVT matches or surpasses baselines with significantly fewer visual tokens, demonstrating robustness under limited token budgets, significantly reducing memory cost and latency while making visual inputs more compatible with LLMs. Our code is available at https://github.com/snuviplab/DiVT.

2605.17938 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Training data attribution in diffusion models via mirrored unlearning and noise-consistent skew

通过镜像反学习和噪声一致偏斜训练数据归因

Joan Serrà, Dipam Goswami, Fabio Morreale, Wei-Hsiang Liao, Yuki Mitsufuji

发表机构 * Sony AI(索尼人工智能)

AI总结 本文提出了一种基于镜像反学习和噪声一致偏斜的方法,用于提升扩散模型的训练数据归因的可靠性与鲁棒性,通过在不同数据集上显著优于现有方法,展示了其在生成实例间影响实例重叠和扩散损失比较任务中的潜力。

Comments 21 pages, 5 figures, 9 tables (includes appendix)

详情
AI中文摘要

训练数据归因(TDA)应能够促进生成模型的可解释性,并推动各种相关下游任务的发展。然而,当前的TDA方法缺乏可靠性和鲁棒性,阻碍了其在实际应用中的采用。在本文中,我们采取了关键步骤,以实现更可靠和鲁棒的扩散模型TDA。我们提出通过镜像反学习和噪声一致偏斜(MUCS)进行TDA。该方法的核心思想是使用受限的镜像梯度上升微调第二个模型,并通过一致的噪声样本测量该模型相对于原始模型的归一化偏斜。我们展示了,尽管概念上简单且通用,MUCS在三个不同的数据集上系统性地大幅优于现有方法。此外,我们研究了核心设计选择对最终性能的影响,并分析了影响实例在生成项目中的重叠以及整合TDA方法的潜力。我们相信,我们的发现可能对更一般的反学习设置以及需要比较扩散损失的任务具有更广泛的意义。

英文摘要

Training data attribution (TDA) should enable generative model interpretability and foster a variety of related downstream tasks. Nonetheless, current TDA approaches lack reliability and robustness, preventing their adoption in real-world setups. In this paper, we take a decisive step towards more reliable and robust TDA for diffusion models. We propose to perform TDA with mirrored unlearning and noise-consistent skew (MUCS). The idea is to fine-tune a second model with bounded mirrored gradient ascent, and to measure the normalized skew of this model with respect to the original one using consistent noise samples. We show that, while being conceptually simple and generic, MUCS systematically outperforms existing methods on three different datasets by a large margin. We additionally study the effect that core design choices have on final performance, and analyze novel aspects regarding the overlap of influential instances across generated items and the potential of ensembling TDA approaches. We believe that our findings may have broader implications for more general unlearning setups, as well as for tasks requiring the comparison of diffusion losses.

2605.17932 2026-05-19 cs.CL cs.AI 版本更新

Prompt Compression in Diffusion Large Language Models: Evaluating LLMLingua-2 on LLaDA

在扩散大型语言模型中进行提示压缩:在LLDA上评估LLMLingua-2

Sterling Huang, Abigayle Brown, Jiyoo Noh, Jiakang Xu, Wantong Huo, Kaung Myat Kyaw, Jonathan Chan

发表机构 * University of Toronto(多伦多大学) King Mongkut’s University of Technology Thonburi(泰国科技理工学院)

AI总结 本文研究了提示压缩在扩散大型语言模型中的有效性,通过在LLDA上评估LLMLingua-2,发现提示压缩在数学推理任务中效果不佳,而摘要任务相对稳健,表明为扩散模型设计的提示压缩方法并不适用于所有场景。

详情
AI中文摘要

提示压缩可以减少大型语言模型的推理成本和上下文长度,但之前的评估主要集中在自回归架构上。本研究探讨了提示压缩是否能有效转移到扩散大型语言模型(DLLMs)中,使用LLMLingua-2,特别是具有8B参数的DLLM LLaDA。我们在GSM8K、DUC2004和ShareGPT数据集上使用每个数据集约250个提示,以大约2倍的压缩率,在数学推理、提示重建和摘要任务中评估压缩性能。通过精确匹配准确率、BLEU、ROUGE和BERTScore比较原始提示、压缩提示、重建提示和重建提示推理生成的输出。结果表明,语义保持并不必然意味着在扩散模型中下游行为的稳定性。摘要任务在压缩下相对稳健,而数学推理任务在高语义相似度分数下显著退化。重建实验进一步表明,语义相似的提示可能仍然遗漏了稳定去噪所需的关键推理信息。在所有任务中,BERTScore召回率始终低于精度,表明压缩失败主要由信息遗漏驱动,而非语义漂移。这些发现表明,为自回归模型设计的提示压缩方法并不均匀地适用于扩散大型语言模型,从而推动了为扩散模型设计的压缩策略的发展。

英文摘要

Prompt compression reduces inference cost and context length in large language models, but prior evaluations focus primarily on autoregressive architectures. This study investigates whether prompt compression transfers effectively to diffusion large language models (DLLMs) using LLMLingua-2, specifically the 8B-parameter DLLM LLaDA. We evaluate compression performance on GSM8K, DUC2004, and ShareGPT using 250 prompts per dataset at an approximate 2$\times$ compression ratio, across mathematical reasoning, prompt reconstruction, and summarization tasks. Outputs generated from original prompts, compressed prompts, reconstructed prompts, and reconstructed-prompt reasoning were compared using exact-match accuracy, BLEU, ROUGE, and BERTScore. Results show that semantic preservation does not necessarily imply stable downstream behavior in diffusion models. Summarization tasks remained comparatively robust under compression, while mathematical reasoning degraded substantially despite high semantic similarity scores. Reconstruction experiments further showed that semantically similar prompts may still omit reasoning-critical information required for stable denoising. Across tasks, BERTScore recall was consistently lower than precision, suggesting that compression failures are primarily driven by information omission rather than semantic drift. These findings indicate that prompt compression methods designed for autoregressive models do not transfer uniformly to diffusion large language models and motivate the development of diffusion-aware compression strategies.

2605.17923 2026-05-19 cs.DC cs.AI cs.LG 版本更新

AdaptiveLoad: Towards Efficient Video Diffusion Transformer Training

AdaptiveLoad: 向高效视频扩散变换器训练迈进

Yucheng Guo, Yongjian Guo, Zhong Guan, Haoran Sun, Wen Huang, Wanting Xu, Jing Long, Shuai Di, Junwu Xiong

发表机构 * Tsinghua University(清华大学) Peking University(北京大学) Tianjin University(天津大学)

AI总结 本文提出AdaptiveLoad框架,通过双约束自适应负载平衡系统和融合LayerNorm-Modulate CUDA内核,解决视频生成模型中大规模视频扩散变换器(如DiT和MMDiT)训练中的计算不平衡问题,实验显示其在Wan 2.1世界模型上提升了计算效率和训练吞吐量。

详情
AI中文摘要

在视频生成模型,特别是世界模型中,训练大规模视频扩散变换器(如DiT和MMDiT)由于混合模式数据集中序列长度的极端差异,带来了显著的计算挑战。现有基于桶的数据加载策略通常依赖于'等长token'约束。这种方法未能考虑自注意力机制的二次复杂性,导致严重的负载不平衡和GPU资源利用率低下。本文提出了AdaptiveLoad,一个集成优化框架,包含两个核心组件:(1)双约束自适应负载平衡系统,通过同时限制内存消耗和计算负载(B×S^p≤M_comp)消除长序列瓶颈;(2)融合LayerNorm-Modulate CUDA内核,利用D-tile共alesced减少策略提高吞吐量并缓解内存压力。实验结果表明,在Wan 2.1世界模型上,我们的方法将计算不平衡率从39%降低到18.9%,峰值VRAM利用率效率提高22.7%,并实现了整体训练吞吐量增加27.2%。

英文摘要

In video generation models, particularly world models, training large-scale video diffusion Transformers (such as DiT and MMDiT) poses significant computational challenges due to the extreme variance in sequence lengths within mixed-mode datasets. Existing bucket-based data loading strategies typically rely on "equal token length" constraints. This approach fails to account for the quadratic complexity of self-attention mechanisms, leading to severe load imbalance and underutilization of GPU resources. This paper proposes \textit{AdaptiveLoad}, an integrated optimization framework consisting of two core components: (1) A dual-constraint adaptive load balancing system, which eliminates long-sequence bottlenecks by simultaneously limiting memory consumption and computational load ($B \times S^p \le M_{\text{comp}}$); (2) A fused LayerNorm-Modulate CUDA kernel, which utilizes a D-tile coalesced reduction strategy to increase throughput and alleviate memory pressure. Experimental results on the Wan 2.1 world model demonstrate that our method reduces the computational imbalance rate from 39\% to 18.9\%, improves peak VRAM utilization efficiency by 22.7\%, and achieves an overall training throughput increase of 27.2\%.

2605.17918 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Domain Transfer Becomes Identifiable via a Single Alignment

通过单个对齐使领域转移变得可识别

Sagar Shrestha, Subash Timilsina, Hoang-Son Nguyen, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA(电气工程与计算机科学系,俄勒冈州立大学,科瓦利斯,俄勒冈,美国)

AI总结 本文提出了一种新的方法,通过结构稀疏性条件和单个配对锚样本实现领域转移的可识别性,减少了对监督信号的依赖,并提出了高效的雅可比稀疏性正则化器以支持高维学习。

详情
AI中文摘要

领域转移(DT)将源分布映射到目标分布,并支持无监督的图像到图像翻译、单细胞分析和跨平台医学影像任务。然而,DT本质上是不明确的:推动正向映射通常不可识别,因为保持测度的自同构(MPAs)在保持边缘分布的同时改变跨领域对应关系,导致内容不一致的翻译。最近的工作表明,通过联合转移多个对应的源/目标条件分布可以消除MPAs,但标记这些条件的监督信号在实践中并不总是可用。我们开发了一种替代的DT可识别性路线。在雅可比支持图案的结构稀疏性条件下,我们证明了分布匹配与单个配对锚样本足以识别真实转移——比先前方法需要的监督更少。为了支持实际的高维学习,我们进一步提出了一种基于随机掩码有限差分的高效雅可比稀疏性正则化器,得到一个可扩展的替代品,无需显式雅可比评估。在合成和现实任务上的实验证实了理论。

英文摘要

Domain transfer (DT) maps source to target distributions and supports tasks such as unsupervised image-to-image translation, single-cell analysis, and cross-platform medical imaging. However, DT is fundamentally ill-posed: push-forward mappings are generally non-identifiable, as measure-preserving automorphisms (MPAs) preserve marginals while altering cross-domain correspondences, leading to content-misaligned translation. Recent work shows that MPAs can be eliminated by jointly transferring multiple corresponding source/target conditional distributions, but supervision signals labeling such conditionals are not always available in practice. We develop an alternative route to DT identifiability. Under a structural sparsity condition on the Jacobian support pattern, we show that distribution matching together with a single paired anchor sample suffices to identify the ground-truth transfer -- requiring substantially less supervision than prior approaches. To enable practical high-dimensional learning, we further propose an efficient Jacobian sparsity regularizer based on randomized masked finite differences, yielding a scalable surrogate without explicit Jacobian evaluation. Empirical results on synthetic and real-world DT tasks validate the theory.

2605.17907 2026-05-19 cs.CV cs.AI 版本更新

One Model to Translate Them All: Universal Any-to-Any Translation for Heterogeneous Collaborative Perception

一个模型翻译它们所有:面向异构协作感知的通用任意到任意翻译

Yang Li, Weize Li, Quan Yuan, Congzhang Shao, Guiyang Luo, Yunqi Ba, Xuanhan Zhu, Xinyuan Ding, Xiaoyuan Fu, Jinglin Li

发表机构 * State Key Laboratory of Networking and Switching Technology(网络与交换技术国家重点实验室)

AI总结 本文提出UniTrans,一种通用任意到任意特征模态翻译模型,通过预训练一组翻译专家参数并学习其组合系数来实现零样本翻译,从而在OPV2V-H和DAIR-V2X数据集上实现了优于现有方法的性能。

Comments 19 pages, accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

通过共享中间特征,协作感知扩展了每个代理的感知能力,但现实世界中的特征模态异质性仍然是有效融合的关键障碍。大多数现有方法,包括直接适应和协议基于的转换,通常依赖于为新出现的特征模态训练适配器,往往需要额外的重新训练或微调。这种重复训练成本高,并且由于模型和数据隐私限制,在跨制造商之间不可行,限制了现实世界的可扩展性。为了解决这个问题,我们提出了UniTrans,一种通用的任意到任意特征模态翻译模型,该模型可以即时实例化任意模态的翻译器。UniTrans预训练了一组翻译专家参数,并学习其组合系数作为源到目标模态映射的函数。映射是在模态内在的潜在空间中进行测量,其中内在编码器从单帧中间特征中提取模态特定但场景不变的代码,使UniTrans能够以零样本的方式实例化翻译器。在OPV2V-H和DAIR-V2X上的实验表明,UniTrans在模拟和现实世界中均优于现有方法,通过通用模型实现了高效的任意到任意翻译。代码可在https://github.com/CheeryLeeyy/UniTrans上获得。

英文摘要

By sharing intermediate features, collaborative perception extends each agent's sensing beyond standalone limits, but real-world feature modality heterogeneity remains a key barrier to effective fusion. Most existing methods, including direct adaption and protocol-based transformation, typically rely on training adapters for newly emerging feature modalities and often require additional retraining or fine-tuning. Such repeated training is costly and is often infeasible across manufacturers due to model and data privacy constraints, limiting real-world scalability. To address this issue, we propose UniTrans, a universal any-to-any feature modality translation model that instantiates translators on the fly for arbitrary modalities. UniTrans pretrains a bank of translator expert parameters and learns their combination coefficients as a function of source-to-target modality mapping. The mapping is measured in a modality-intrinsic latent space, where an intrinsic encoder extracts modality-specific yet scene-invariant codes from single-frame intermediate features, enabling UniTrans to instantiate translators in a zero-shot manner. Experiments on OPV2V-H and DAIR-V2X demonstrate that UniTrans consistently outperforms state-of-the-art methods in both simulated and real-world settings, enabling efficient any-to-any translation through a universal model. The code is available at https://github.com/CheeryLeeyy/UniTrans.

2605.17903 2026-05-19 cs.AI cs.CL cs.HC cs.IR 版本更新

Agentic Chunking and Bayesian De-chunking of AI Generated Fuzzy Cognitive Maps: A Model of the Thucydides Trap

代理分块与贝叶斯去分块:人工智能生成的模糊认知图的模型:特克西德斯陷阱模型

Akash Kumar Panda, Olaoluwa Adigun, Bart Kosko

发表机构 * University of Southern California(美国南加州大学) Florida International University(佛罗里达国际大学)

AI总结 本文提出了一种基于代理分块和贝叶斯去分块的方法,用于生成和更新人工智能生成的模糊认知图,通过在文本中生成重叠的文本分块,并利用稀疏因果分块矩阵进行混合,从而构建出代表性的循环模糊认知图知识图谱,以预测特克西德斯陷阱模型中的冲突结果。

Comments 15 pages, 6 figures

详情
AI中文摘要

我们通过训练大语言模型代理将文本分解为重叠的文本分块,从而自动生成反馈因果模糊认知图(FCMs)。通过将这些分块FCMs进行凸混合,可以得到一个代表性的循环FCM知识图。文本分块可以有不同的重叠程度。分块FCMs仍然混合以形成新的FCM因果知识图。混合技术的可扩展性源于其使用轻量计算和稀疏因果分块矩阵。混合结构允许进行一种操作层面的贝叶斯推断,从而从混合的FCM中生成“去分块”或后验似的FCM。这些去分块的FCM在自身具有价值,并允许进一步的贝叶斯更新。我们通过Allison的“特克西德斯陷阱”模型的论文文本演示了这些混合技术,该模型描述了主导力量(如美国)与崛起力量(如中国)之间的冲突。FCM动态系统在达到固定点或极限环吸引子时预测结果。当我们通过激活代表崛起力量野心和权利的概念节点来刺激这些FCM知识图时,8个中的7个FCM知识图预测了战争类型。Gemini 3.1 LLMs作为分块AI代理。

英文摘要

We automatically generate feedback causal fuzzy cognitive maps (FCMs) from text by teaching large-language-model agents to break the text into overlapping chunks of text. Convex mixing of these chunk FCMs gives a representative cyclic FCM knowledge graph. The text chunks can have different levels of overlap. The chunk FCMs still mix to form a new FCM causal knowledge graph. The mixing technique scales because it uses light computation with sparse causal chunk matrices. The mixing structure allows an operator-level type of Bayesian inference that produces "de-chunked" or posterior-like FCMs from the mixed FCM. These de-chunked FCMs are useful in their own right and allow further iterations of Bayesian updating. We demonstrate these mixing techniques on the essay text of Allison's "Thucydides Trap" model of conflict between a dominant power such as the United States and a rising power such as China. The FCM dynamical systems predict outcomes as they equilibrate to fixed-point or limit-cycle attractors. Seven out of 8 FCM knowledge graphs predicted a type of war when we stimulated them by turning on and keeping on the concept node that stands for the rising power's ambition and entitlement. Gemini 3.1 LLMs served as the chunking AI agents.

2605.17902 2026-05-19 cs.AI 版本更新

LAST-RAG: Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation for Knowledge-Conditioned Degradation Model Selection

LAST-RAG:文献锚定的随机轨迹检索增强生成用于知识条件退化模型选择

Hanbyeol Park, Hyerim Bae

发表机构 * Department of Industrial Engineering(工业工程系) Pusan National University(釜山国立大学)

AI总结 本文提出LAST-RAG方法,通过结合观测健康指标轨迹和领域特定上下文,利用理论和机械证据从本地证据库中检索,以改进退化模型选择,将模型选择从纯统计拟合问题转变为结合观测数据和领域知识的决策问题。

详情
AI中文摘要

基于随机过程的退化建模是估计剩余使用寿命(RUL)分布的核心方法;然而,适当选择随机过程的方法尚未得到充分解决。现有模型选择方法主要依赖于观测健康指标(HI)轨迹的统计拟合,但当观察窗口较短或信号高度噪声时,这种方法可能选择与底层退化机制不一致的模型。为了解决这个问题,本文提出了文献锚定的随机轨迹检索增强生成(LAST-RAG)。该方法利用观测的HI轨迹和领域特定上下文,并基于从本地证据库中检索的理论和机械证据,分层地对候选退化模型空间进行条件。此外,引入了基于规则的置信度推理与不确定状态(RCRUS)以防止在分层决策不确定时过早排除候选模型。基于仿真的实验表明,所提出的方法在韦纳/伽马族分类和详细退化模型分类中均优于统计、预测和不确定性感知的基线方法。最终,本研究将退化模型选择从纯粹的统计拟合问题重新界定为一个结合观测数据和领域知识的知识条件决策问题。

英文摘要

Stochastic-process-based degradation modeling is a core approach for estimating the distribution of remaining useful life (RUL); however, the selection of an appropriate stochastic process has not been sufficiently addressed. Existing model selection methods mainly rely on the statistical fit of the observed health indicator (HI) trajectory, but this approach may select a model that is inconsistent with the underlying degradation mechanism when the observation window is short or the signal is highly noisy. To address this issue, this paper proposes Literature-Anchored Stochastic Trajectory Retrieval-Augmented Generation (LAST-RAG). The proposed method uses both the observed HI trajectory and domain-specific context, and hierarchically conditions the candidate degradation model space based on theoretical and mechanical evidence retrieved from a local evidence bank. In addition, Rule-based Confidence Reasoning with Uncertain State (RCRUS) is introduced to prevent candidate models from being prematurely eliminated when hierarchical decisions are uncertain. Simulation-based experiments demonstrate that the proposed method outperforms statistical, prognostic, and uncertainty-aware baselines in both Wiener/gamma family classification and detailed degradation model classification. Ultimately, this study reframes degradation model selection from a purely statistical goodness-of-fit problem into a knowledge-conditioned decision-making problem that integrates observed data with domain knowledge.

2605.17900 2026-05-19 cs.AI 版本更新

DuIVRS-2: An LLM-based Interactive Voice Response System for Large-scale POI Attribute Acquisition

DuIVRS-2: 基于大语言模型的大型兴趣点属性采集交互语音响应系统

Le Zhang, Shengming Zhang, Rui Zha, Yunpeng Wu, Jingbo Zhou, Jizhou Huang

发表机构 * Baidu Inc.(百度公司)

AI总结 本文提出DuIVRS-2,一种基于大语言模型的端到端框架,用于大规模兴趣点属性采集,通过有限状态机引导的数据增强策略、选择生成方案与思维链机制,提高了输出稳定性并有效消除幻觉,最终在生产环境中实现了83.9%的任务成功率。

Comments Accepted to ACL 2026 Industry Track. 14 pages, including appendix

详情
AI中文摘要

准确获取兴趣点(POI)属性对于基于位置的服务至关重要,但传统模块化的交互语音响应(IVR)系统存在误差累积和高维护成本的问题。我们提出了DuIVRS-2,一种基于大语言模型(LLM)的端到端框架,用于百度地图的大规模POI属性采集。为了解决现实交互中的长尾分布问题,我们的方法首先采用有限状态机(FSM)引导的数据增强策略,生成平衡且多样化的训练数据集。然后通过选择生成方案结合思维链(CoT)机制,优化对话管理,确保输出稳定性并有效消除工业环境中的幻觉。为了便于持续策略优化且最小化人工努力,我们设计了协作迭代学习框架,利用双评估者投票系统。在生产环境中部署两个月,DuIVRS-2每天处理0.4百万次呼叫,实现了83.9%的任务成功率(TSR),比其前身高出4个百分点,同时保持130ms的低响应时间。本工作为开发鲁棒且成本效益高的LLM代理用于大规模工业对话应用提供了生产验证的参考。

英文摘要

Accurate Point of Interest (POI) attribute acquisition is essential for location-based services, yet traditional modular Interactive Voice Response (IVR) systems suffer from error accumulation and high maintenance overhead. We present DuIVRS-2, a large language model (LLM)-based end-to-end framework designed for large-scale POI attribute acquisition at Baidu Maps. To address the long-tail distribution of real-world interactions, our methodology first employs a finite state machine (FSM)-guided data augmentation strategy to synthesize a balanced and diverse training dataset. We then streamline dialogue management via a selective generation scheme combined with a Chain-of-Thought (CoT) mechanism, which ensures output stability and effectively eliminates hallucinations in industrial settings. To facilitate continuous policy refinement with minimal manual effort, we design a cooperative iterative learning framework that leverages a dual-evaluator voting system. Deployed in production for two months, DuIVRS-2 processed 0.4 million calls daily and achieved a 83.9\% Task Success Rate (TSR), outperforming its predecessor by 4 percentage points while maintaining a low reaction time of 130ms. This work provides a production-proven reference for developing robust, cost-effective LLM agents for large-scale industrial dialogue applications.

2605.17899 2026-05-19 cs.LG cs.AI q-bio.QM 版本更新

DCFold: Efficient Protein Structure Generation with Single Forward Pass

DCFold: 通过单次前向传递高效生成蛋白质结构

Zhe Zhang, Yuanning Feng, Yuxuan Song, Keyue Qiu, Hao Zhou, Wei-Ying Ma

发表机构 * Institute for AI Industry Research (AIR)(人工智能产业研究院) Department of Computer Science and Technology(计算机科学与技术系) School of Computer Science and Technology(计算机科学与技术学院) ByteDance Seed(字节跳动种子)

AI总结 本文提出DCFold,一种单步生成模型,实现了与AlphaFold3同等的精度,通过双一致性训练框架和新的时间测地匹配(TGM)调度器,在保持预测保真度的同时将推理速度提升15倍,验证了其在结构预测和结合设计基准上的有效性。

详情
AI中文摘要

AlphaFold3引入了一种基于扩散的架构,将蛋白质结构预测提升到原子级分辨率,并提高了准确性。这种最先进的性能使AlphaFold3成为多样化生成和设计任务的基础模型。然而,其迭代设计显著增加了推理时间,限制了在虚拟筛选和蛋白质设计等下游任务中的实际部署。我们提出DCFold,一种单步生成模型,实现了AlphaFold3级别的精度。我们的双一致性训练框架,结合了新的时间测地匹配(TGM)调度器,使DCFold在保持预测保真度的同时,将推理速度提升15倍。我们验证了其在结构预测和结合设计基准上的有效性。

英文摘要

AlphaFold3 introduces a diffusion-based architecture that elevates protein structure prediction to all-atom resolution with improved accuracy. This state-of-the-art performance has established AlphaFold3 as a foundation model for diverse generation and design tasks. However, its iterative design substantially increases inference time, limiting practical deployment in downstream settings such as virtual screening and protein design. We propose DCFold, a single-step generative model that attains AlphaFold3-level accuracy. Our Dual Consistency training framework, which incorporates a novel Temporal Geodesic Matching (TGM) scheduler, enables DCFold to achieve a 15x acceleration in inference while maintaining predictive fidelity. We validate its effectiveness across both structure prediction and binder design benchmarks.

2605.17894 2026-05-19 cs.AI 版本更新

Evaluating Cognitive Age Alignment in Interactive AI Agents

评估交互式AI代理的认知年龄对齐

Yifan Shen, Jiawen Zhang, Jian Xu, Junho Kim, Ismini Lourentzou, Xu Cao, Meihuan Huang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Shenzhen Children's Hospital(深圳儿童医院) Peking University(北京大学) Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出ChildAgentEval,首个基于心理测量的交互式基准,用于评估基于多模态大语言模型的代理的认知年龄对齐,通过与年龄特定的人类发展阶段进行系统比较,揭示当前代理在模拟年龄特定认知行为方面的优劣。

详情
AI中文摘要

尽管代理AI及其核心多模态大语言模型(MLLMs)在语言和视觉推理方面展示了从日常生活到高级科学研究的广阔潜力,但人工与人类智能之间仍存在深刻差距。尽管集成了强大工具和先进MLLMs,最先进的AI代理经常在基础且看似简单的任务上失败,而儿童可以轻松解决。受韦氏儿童智力量表(WISC)启发,我们引入ChildAgentEval,首个心理测量学基础的交互式基准,用于评估基于MLLMs的代理的认知年龄对齐。ChildAgentEval系统地将各种基于MLLMs的交互代理的推理性能与年龄特定的人类发展阶段进行比较,揭示当前代理系统在模拟年龄特定认知行为方面的能力和局限性。

英文摘要

While agentic AI and its core multimodal large language models (MLLMs) have demonstrated remarkable promise in language and visual reasoning across domains ranging from daily life to advanced scientific research, a profound gap remains between artificial and human intelligence. Despite the integration of powerful tools and advanced MLLMs, state-of-the-art AI agents frequently fail at foundational, seemingly simple tasks that a child can resolve with ease. Inspired by the Wechsler Intelligence Scale for Children (WISC), we introduce ChildAgentEval, the first psychometrically grounded interactive benchmark for evaluating cognitive age alignment in MLLM-based agents. ChildAgentEval systematically compares the reasoning performance of various MLLM-based interactive agents against age-specific human developmental stages, exposing where current agentic AI systems can and cannot simulate age-specific cognitive behavior.

2605.17887 2026-05-19 cs.LG cs.AI 版本更新

Attention Sinks and Outliers in Attention Residuals

注意力沉底与注意力残差中的异常值

Haozheng Luo, Haoran Dai, Shaoyang Zhang, Xi Chen, Eric Hanchen Jiang, Yijiang Li, Jingyuan Huang, Chenghao Qiu, Chenwei Xu, Zhenyu Pan, Haotian Zhang, Binghui Wang, Yan Chen

发表机构 * Department of Computer Science, Northwestern University(西北大学计算机科学系) Department of Computer Science and Engineering, University of Michigan(密歇根大学计算机科学与工程系) Department of Statistics and Data Science, University of California Los Angeles(加州大学洛杉矶分校统计与数据科学系) Department of Electrical and Computer Engineering, University of California San Diego(加州圣地亚哥大学电气与计算机工程系) Department of Computer Science, Rutgers University-New Brunswick(新泽西州立大学鲁特学院计算机科学系) Department of Computer Science and Engineering, Texas A&M University(德克萨斯农工大学计算机科学与工程系) Department of Computer Science, Columbia University(哥伦比亚大学计算机科学系)

AI总结 本文提出OASIS技术,通过层间空信号来解决注意力残差架构中注意力沉底、激活异常值以及推理稳定性下降的问题,通过双归一化设计和实验验证提升了模型的结构鲁棒性和量化鲁棒性。

详情
AI中文摘要

我们提出OASIS,一种基于层间空信号的异常值和沉底感知技术。As AttnResidual架构引入了额外的深度归一化通道,它们提高了层间路由的灵活性,但也加剧了注意力沉底、激活异常值以及由此导致的推理稳定性和量化鲁棒性下降。OASIS通过引入基于Softmax1的空空间和通过层间空信号将token级的空证据耦合到深度路由中,从而减少由沉底主导的路由并提高结构鲁棒性。理论上,我们证明了AttnResidual的双归一化设计加剧了沉底形成和量化脆性。实验上,我们在三个真实世界数据集上将OASIS与五个基线进行比较,并观察到在注意力沉底和后量化性能方面有持续的改进。值得注意的是,OASIS在评估设置中实现了最大无穷范数平均减少9.26%、平均峰度减少2.60%,并在W8A8下将困惑度降低了75.85%,在W4A4下将GSM8K Pass@1提高了12.42%。

英文摘要

We propose OASIS, an outlier- and sink-aware technique built on inter-layer null signaling. As AttnResidual architectures introduce an additional depth-wise normalization channel, they improve inter-layer routing flexibility but also exacerbate attention sinks, activation outliers, and the resulting degradation in inference stability and quantization robustness. OASIS addresses this issue by introducing a Softmax1-based null space and coupling token-level null evidence to depth routing through an inter-layer null signal, thereby reducing sink-dominated routing and improving structural robustness. Theoretically, we show that the dual-normalization design of AttnResidual intensifies sink formation and quantization brittleness. Experimentally, we compare OASIS against five baselines on three real-world datasets and observe consistent improvements in both attention sink and post-quantization performance. Notably, OASIS achieves an average reduction of 9.26% in maximum infinity norm and 2.60% in average kurtosis across the evaluated settings, while lowering perplexity by 75.85% under W8A8 and improving GSM8K Pass@1 by 12.42% under W4A4.

2605.17885 2026-05-19 cs.CL cs.AI 版本更新

Multi-agent AI systems outperform human teams in creativity

多智能体AI系统在创造力上超越人类团队

Tiancheng Hu, Yixuan Jiang, Haotian Li, José Hernández-Orallo, Xing Xie, Nigel Collier, David Stillwell, Luning Sun

发表机构 * Microsoft Research Asia(微软亚洲研究院)

AI总结 研究探讨了多智能体AI系统在创造力任务中的表现,发现其在四个多样化问题解决任务中,比单智能体和人类团队更具创造力,核心方法是通过语义空间路径分析生成过程,主要贡献是揭示了AI和人类团队在创造力预测上的不同机制。

详情
AI中文摘要

尽管人工智能(AI)在众多认知任务上已匹配或超越人类表现,但创造力仍是一个极具争议的前沿。随着基于大语言模型(LLMs)的AI系统在研究和创新中被越来越多地采用,理解并增强其创造力变得至关重要。本文证明,多智能体LLM团队不仅超越了单个智能体,而且在4541个多智能体LLM想法和341个人类团队想法上,显著优于人类团队在创造力方面(Cohen's d=1.50)。这种优势由新颖性驱动,同时保持了相当的实用性。为了研究两组的生成过程,我们通过神经语言模型表示将对话表示为语义空间中的路径。LLM和人类团队在对话范围广泛而不是集中在单一主题(低全局一致性)时产生更多创造性想法。然而,预测创造力的额外模式不同:LLM团队受益于高效的探索(高语义扩展,较短路径),而人类团队受益于维持流畅的对话流程(高局部一致性,频繁转换)。此外,我们识别出模型选择和讨论结构作为正交的设计杠杆,共同解释了LLM对话动态中26.8%的方差,为系统开发具有增强创造力的多智能体系统铺平了道路。

英文摘要

Although artificial intelligence (AI) now matches or exceeds human performance across numerous cognitive tasks, creativity remains a highly contested frontier. As AI systems based on large language models (LLMs) are increasingly adopted in research and innovation, it is essential to understand and augment their creativity. Here we demonstrate that multi-agent LLM teams not only surpass single agents, but also substantially outperform human teams in creativity (Cohen's d=1.50) across 4,541 multi-agent LLM ideas and 341 human-team ideas on six diverse problem-solving tasks. This advantage is driven by novelty while maintaining comparable usefulness. To investigate the generative processes in both groups, we represent conversations as paths through semantic space using neural language model representations. Both LLM and human teams produce more creative ideas when conversations range widely rather than staying centered on a single theme (low global coherence). However, the additional patterns that predict creativity differ: LLM teams benefit from efficient exploration (high semantic spread, shorter paths), while human teams benefit from maintaining smooth conversational flow (high local coherence, frequent pivots). Additionally, we identify model choice and discussion structure as orthogonal design levers that together explain 26.8% of variance in LLM conversational dynamics, paving the way for systematic approaches to developing multi-agent systems with augmented creative capabilities.

2605.17879 2026-05-19 cs.DC cs.AI cs.LG 版本更新

Guard: Scalable Straggler Detection and Node Health Management for Large-Scale Training

Guard:用于大规模训练的可扩展的延迟检测和节点健康管理

Guanliang Liu, Abhinandan Patni, Congzhu Lin, Zoe Zeng, Jack Wittmayer, Josh Wu, Ashvin Nihalani, Binxuan Huang, Yinghong Liu, Rory Na, Anthony Ko, Alexander Zhipa, Cong Cheng, Mi Sun, Vijay Rajakumar, Rejith George Joseph, Parthasarathy Govindarajen

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country(匿名机构,匿名城市,匿名地区,匿名国家)

AI总结 本文提出Guard系统,通过在线性能监控和离线节点扫描机制,有效检测训练中的延迟节点并确保节点健康,从而提升大规模训练的效率和稳定性。

Comments Proceedings of the 9 th MLSys Conference, Bellevue, WA, USA, 2026

详情
AI中文摘要

训练前沿规模的基础模型需要协调成千上万的GPU进行多月运行,其中即使微小的性能退化也会累积成显著的效率损失。现有健康检查机制,如NCCL测试或GPU烧录,主要关注功能正确性,往往无法检测到悄无声息降低系统性能的fail-slow行为。在本文中,我们提出了Guard,一个用于检测stragglers并确保大规模训练集群中节点健康的可扩展系统。Guard结合了训练期间的轻量级在线性能监控与一个离线节点扫描机制,系统地评估和认证节点在参与生产工作负载之前。这种设计使Guard能够检测到传统诊断无法捕捉的急性故障和长期运行的fail-slow行为。在大规模基础模型预训练工作负载上部署Guard,可将平均FLOPs利用率提高多达1.7倍,将运行到运行的训练步骤方差从20%降至1%,增加平均故障时间(MTTF),并显著减少操作和调试开销。这些结果表明,主动检测stragglers和系统化的节点认证对于维持稳定和高效的大型训练至关重要。

英文摘要

Training frontier-scale foundation models involves coordinating tens of thousands of GPUs over multi-month runs, where even minor performance degradations can accumulate into substantial efficiency losses. Existing health-check mechanisms, such as NCCL tests or GPU burn-in, primarily focus on functional correctness and often fail to detect fail-slow behaviors that silently degrade system performance. In this paper, we present Guard, a scalable system for detecting stragglers and ensuring node health in large-scale training clusters. Guard combines lightweight online performance monitoring during training with an offline node-sweep mechanism that systematically evaluates and qualifies nodes before they participate in production workloads. This design enables Guard to detect both acute failures and long-running fail-slow behaviors that traditional diagnostics cannot capture. Deployed on large-scale foundation model pretraining workloads, Guard improves mean FLOPs utilization by up to 1.7x, reduces run-to-run training step variance from 20% to 1%, increases mean time to failure (MTTF), and significantly reduces operational and debugging overhead. These results demonstrate that proactive straggler detection and systematic node qualification are critical for maintaining stable and efficient large-scale training.

2605.17877 2026-05-19 cs.AI 版本更新

PAIR: Prefix-Aware Internal Reward Model for Multi-Turn Agent Optimization

PAIR:面向多轮代理优化的前缀感知内部奖励模型

Wonjoong Kim, Yeonjun In, Sangwu Park, Dongha Lee, Chanyoung Park

发表机构 * KAIST(韩国科学技术院) Yonsei University(延世大学)

AI总结 本文提出PAIR模型,通过结合冻结的隐藏状态探针和轻量级注意力头部,解决多轮任务中内部正确性探针的可靠性问题,从而在不依赖外部模型调用或地面真实依赖的情况下,为GRPO训练提供密集的步骤级奖励信号。

Comments Under Review

详情
AI中文摘要

当前LLM在执行复杂多阶段任务方面面临重大挑战。组相对策略优化(GRPO)已成为主流选择,但其依赖稀疏结果奖励严重限制了中间步骤的信用分配。现有解决方案如运行完整回滚以分配步骤级优势、在每个步骤调用外部LLM评判者或计算内在奖励(需要每次评估都有地面真实答案)都引入了显著成本或实际限制。我们假设内部正确性探针可以重新利用LLM隐藏状态进行步骤级奖励信号,可能一次性解决所有这些限制。然而,现有探针研究假设输入干净,我们首先表明在多步骤设置中这一假设不成立:隐藏状态探针在前缀污染跟踪与可能损坏的前缀保持一致性时严重退化,而基于注意力的特征在污染下保持稳健但清洁前缀表现欠佳。基于这种互补关系,我们提出前缀感知内部奖励(PAIR),一种两阶段模型,包含冻结隐藏状态探针估计信念一致性以及轻量级注意力头部纠正其向地面正确性。实验结果表明,PAIR在受污染轨迹上实现了最高的AUROC,同时运行成本极低,能够在不依赖外部模型调用、地面真实依赖或完整轨迹回滚的情况下,为GRPO训练提供密集的步骤级奖励信号。

英文摘要

A significant hurdle for current LLMs is the execution of complex, multi-stage tasks. Group Relative Policy Optimization (GRPO) has been emerging as a leading choice, but its reliance on sparse outcome rewards severely limits credit assignment across intermediate steps. Existing remedies such as running full rollouts to assign step-level advantages, calling external LLM judges at each step, or computing intrinsic rewards that require ground-truth answers at every evaluation introduce significant costs or practical constraints. We hypothesize that internal correctness probing over LLM hidden states can be repurposed as a step-level reward signal, potentially addressing all of these limitations at once. However, existing probing research assumes clean inputs, and we first show that this assumption breaks down in multi-step settings: hidden-state probes degrade severely under prefix contamination tracking coherence with the (possibly corrupted) prefix rather than grounded correctness, while attention-based features remain robust to contamination but underperform on clean prefixes. Building on this complementary relationship, we propose the Prefix-Aware Internal Reward (PAIR), a two-stage model with a frozen hidden-state probe estimating belief-consistency and a lightweight attention-based head correcting it toward grounded correctness. Experimental results show that PAIR achieves the highest AUROC on contaminated trajectories while operating at negligible inference cost, enabling dense step-level reward signals for GRPO training without external model calls, ground-truth dependencies, or full-trajectory rollouts.

2605.17873 2026-05-19 cs.LG cs.AI cs.CL 版本更新

HINT-SD: Targeted Hindsight Self-Distillation for Long-Horizon Agents

HINT-SD:针对长 Horizon 智能体的定向 hindsight 自监督学习

Woongyeng Yeo, Yumin Choi, Taekyung Ki, Sung Ju Hwang

AI总结 本文提出 HINT-SD,一种针对长 Horizon 智能体的定向 hindsight 自监督学习框架,通过全轨迹 hindsight 选择失败相关的动作,并仅在目标动作跨度上应用反馈条件自监督学习,实验表明该方法在 BFCL v3 和 AppWorld 上比密集的每回合反馈基线提高了 18.80 个百分点,同时训练时间降低 2.26 倍。

详情
AI中文摘要

训练具有长 horizon 的 LLM 智能体进行强化学习具有挑战性,因为稀疏结果奖励只能表明任务是否成功,而不能指示哪些中间动作导致了结果或如何修正。最近的方法通过从回合级动作-输出信号生成奖励或文本提示,或通过反馈条件自监督学习来缓解这一问题。然而,当许多中间回合已经成功或中性时,在每个回合生成反馈效率低下,而固定或错位的反馈难以监督导致失败的动作。为此,我们提出了 HINT-SD,一种基于全轨迹 hindsight 的定向自监督学习框架,用于选择失败相关的动作,并仅在目标动作跨度上应用反馈条件自监督学习。在 BFCL v3 和 AppWorld 上的实验表明,我们的方法在比密集的每回合反馈基线提高 18.80 个百分点的同时,实现了 2.26 倍更低的训练时间,表明选择何时进行自监督学习是有效且高效的长 horizon 智能体训练的关键因素。

英文摘要

Training long-horizon LLM agents with reinforcement learning is challenging because sparse outcome rewards reveal whether a task succeeds, but not which intermediate actions caused the outcome or how they should be corrected. Recent methods alleviate this issue by generating rewards or textual hints from turn-level action-output signals, or by using feedback-conditioned self-distillation. However, generating feedback at every turn is inefficient when many intermediate turns are already successful or neutral, and applying feedback at a fixed or misaligned turn often fails to supervise the actions that contributed to the failure. To bridge this gap, we propose HINT-SD, a targeted self-distillation framework that uses full-trajectory hindsight to select failure-relevant actions and applies feedback-conditioned distillation only on targeted action spans. Experiments on BFCL v3 and AppWorld show that our method improves over the dense per-turn feedback baseline by up to 18.80 percent while achieving 2.26$\times$ lower time per training step, suggesting that selecting where to distill is a key factor for both effective and efficient long-horizon agent training.

2605.17862 2026-05-19 cs.LG cs.AI 版本更新

$\boldsymbol{f}$-OPD: Stabilizing Long-Horizon On-Policy Distillation with Freshness-Aware Control

f-OPD: 通过新鲜度感知控制稳定长周期在线策略蒸馏

Xianwei Chen, Shimin Zhang, Jibin Wu

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 本文提出f-OPD框架,通过引入样本级新鲜度评分来稳定长周期在线策略蒸馏,实现性能与效率的平衡,为大规模长周期智能体训练奠定基础。

详情
AI中文摘要

在大规模语言模型中扩展在线策略蒸馏(OPD)面临根本性矛盾:异步执行是系统效率的必要条件,但结构上偏离理想的在线策略目标。为解决这一挑战,我们理论上将目标偏差分解为回放漂移和监督漂移,分别捕捉学生回放和教师上下文的陈旧性。基于此,我们引入样本级新鲜度评分,量化缓冲样本相对于在线策略目标的可靠性。受此信号引导,我们进一步提出f-OPD,一种新颖的框架,能够自适应调节陈旧样本的影响并约束异步训练下累积的策略漂移。在推理、工具使用和编码代理任务中,f-OPD在增加交互周期时,始终能够实现与同步优化相当的任务性能,同时保留异步执行的吞吐量优势。我们的结果建立了OPD中实现性能-效率权衡的第一个配方,为大规模长周期智能体训练铺平道路。

英文摘要

Scaling on-policy distillation (OPD) for large language models (LLMs) confronts a fundamental tension: asynchronous execution is necessary for system efficiency, but structurally deviates from the ideal on-policy objective. To address this challenge, we theoretically decompose the objective discrepancy into rollout drift and supervision drift, capturing staleness in student rollout and teacher context, respectively. Building on this, we introduce a sample-level freshness score that quantifies the reliability of a buffered sample with respect to the on-policy objective. Guided by this signal, we further propose f-OPD, a novel framework that adaptively regulates stale-sample influence and constrains policy drift accumulated under asynchronous training. Across reasoning, tool-use, and coding-agent tasks of increasing interaction horizon, f-OPD consistently achieves task performance comparable to synchronous optimization while largely retaining the throughput advantages of asynchronous execution. Our results establish the first recipe for achieving a performance-efficiency trade-off in OPD, paving the way for long-horizon agentic post-training at scale.

2605.17860 2026-05-19 cs.CL cs.AI 版本更新

PAREDA: A Multi-Accent Speech Dataset of Natural Language Processing Research Discussions

PAREDA:自然语言处理研究讨论的多口音语音数据集

Sicheng Jin, Dipankar Srirag, Aditya Joshi

AI总结 本文提出PAREDA数据集,用于研究不同口音、自发性和领域特定语音的ASR性能,通过评估SOTA模型发现零样本设置下模型表现下降,但微调后显著降低WER,证明数据集捕捉了现有数据缺失的语言特征。

Comments Accepted and presented at SPEAKABLE 2026 workshop at LREC 2026

详情
AI中文摘要

尽管现代自动语音识别(ASR)系统在基准语料上实现高精度,但其性能在现实世界变化时往往下降。本文聚焦于因口音、自发性和领域特定语音引起的变异性。特别是,我们介绍了PAREDA数据集,这是首个多口音语音数据集,包含澳大利亚、印度英语和中文英语口音的学术自然语言处理(NLP)论文讨论。每个会话都会引发自发独白(一篇论文摘要的总结)和非独白(参与者之间的问答会话),从而产生一个充满技术术语和会话现象的语料库。我们评估了SOTA ASR模型在PAREDA上的性能,分析了口音混合和语音速度增加的影响。我们的结果表明,在零样本设置下,模型表现更差,证实了数据集的挑战性。然而,对PAREDA的微调显著降低了词错误率(WER),证明我们的数据集捕捉了现有语料中常缺失的语言特征。PAREDA为构建和评估更稳健和包容的ASR系统提供了宝贵的资源,用于专门的现实应用。

英文摘要

While modern Automatic Speech Recognition (ASR) systems achieve high accuracy on benchmark corpora, their performance often degrades when there is real-world variability. This work focuses on variability arising due to accented, spontaneous, and domain-specific speech. In particular, we introduce PAper REading DAtaset (PAREDA), a first-of-its-kind multi-accent speech dataset consisting of discussions on academic Natural Language Processing (NLP) papers between speakers with Australian, Indian-English, and Chinese English accents. Each session elicits a spontaneous monologue (a summary of a paper's abstract) and a non-monologue (a question-and-answer session between participants), resulting in a corpus rich with technical jargon and conversational phenomena. We evaluate the performance of SOTA ASR models on PAREDA, analysing the impact of accent mixing and increased speech rate. Our results show that, in the zero-shot setting, models perform worse, confirming the dataset's challenging nature. However, fine-tuning on PAREDA significantly reduces the Word Error Rate (WER), demonstrating that our dataset captures linguistic characteristics often missing from existing corpora. PAREDA serves as a valuable new resource for building and evaluating more robust and inclusive ASR systems for specialised, real-world applications.

2605.17856 2026-05-19 cs.AI 版本更新

KISS - Knowledge Infrastructure for Scientific Simulation: A Scaffolding for Agentic Earth Science

KISS - 地球科学的科学模拟知识基础设施:一种智能体的支架

Ziwei Li, Liujun Zhu, Yuchen Liu, Yichen Zhao, Birk Li, Ruiqi Wu, Junliang Jin, Jianyun Zhang

发表机构 * State Key Laboratory of Water Disaster Prevention, Hohai University, Nanjing(水利灾害预防国家重点实验室,河海大学,南京) Yangtze Institute for Conservation and Development, Hohai University, Nanjing(长江保护与发展研究院,河海大学,南京) Department of Bioresource Engineering, McGill University, Sainte-Anne-de-Bellevue, Quebec, Canada(生物资源工程系,麦吉尔大学,圣安妮-德-贝尔贝夫,魁北克,加拿大) Ottawa Research and Development Centre, Agriculture & Agri-Food Canada, Ottawa, Ontario, K1A 0C6, Canada(渥太华研发中心,加拿大农业与食品部,渥太华,安大略,K1A 0C6,加拿大) College of Water Conservancy and Hydropower Engineering, Hohai University, Nanjing(水利水电工程学院,河海大学,南京) Meta Platforms Inc.(Meta平台公司) Nanjing Hydraulic Research Institute, Nanjing(南京水利研究院)

AI总结 本文提出KISS,一种用于科学模拟的知识基础设施,通过将专业知识外化为经过验证的建模操作符、分阶段的领域协议和诊断恢复机制,使智能体能够生成物理合理且可验证的端到端模拟,从而降低非专业用户与过程模拟之间的接入门槛,并促进建模社区的整合。

详情
AI中文摘要

基于过程的模拟模型编码了数十年的地球科学领域科学理解,但最暴露于气候风险和资源稀缺的社区却最无法利用这些模型。本文介绍知识基础设施(KI),一种可被智能体执行的支架,将专业知识外化为经过验证的建模操作符、分阶段的领域协议和诊断恢复机制。在3000次耦合水文基准测试中,配备KI的智能体在84%的试验中生成了物理合理且可验证的端到端模拟,而未配备KI的智能体则停留在低于40%的水平。KI具有跨学科泛化能力。我们将其构建过程封装为知识解构工具包(KDT),该工具能够自主生成KI,使智能体能够执行117个额外的过程导向模型,覆盖14个地球科学领域。在所有119个KI中,建模决策和失败修复机制在不同底层物理基础上趋于一致,表明操作专业知识是结构化和可提取的,而非随意的。演示显示,配备KI的智能体降低了非专业用户与过程导向模拟之间的接入门槛,并降低了建模社区之间的整合门槛。通过这一支架,基于过程的科学可以作为可生长的科学公共领域发展,回应谁需要知道,且可由谁能够贡献来扩展。

英文摘要

Process-based simulation models encode decades of scientific understanding across the Earth sciences, yet the communities most exposed to climate risk and resource scarcity are the least able to use them. Here, we introduce knowledge infrastructure (KI), an agent-actionable scaffold that externalizes expertise into validated modelling operators, staged domain protocols, and diagnostic recovery mechanisms. Across a 3,000-trial coupled-hydrology benchmark, agents equipped with KI produced physically plausible, verifiable end-to-end simulations in up to 84% of trials, while agents without KI plateaued below 40%. KI generalizes across disciplines. We packaged its construction into a Knowledge Dissection Toolkit (KDT) that autonomously produced KI enabling end-to-end agent execution of 117 additional process-based models across 14 Earth-science domains. Across all 119 KIs, modelling decisions and failure remedies converged despite different underlying physics, showing that operational expertise is structured and extractable rather than ad hoc. Demonstrations show KI-equipped agents lowering both the access barrier between non-specialist users and process-based simulation, and the integration barrier between modelling communities. Through this scaffold, process-based science can then evolve as a living scientific commons, answerable to whoever needs to know and extendable by whoever can contribute.

2605.17849 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Generating Pretraining Tokens from Organic Data for Data-Bound Scaling

从有机数据生成预训练令牌以实现数据驱动的扩展

Zichun Yu, Chenyan Xiong

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所)

AI总结 本文提出SynPro框架,通过重新表述和重新格式化操作,帮助大语言模型更充分地利用有限的有机数据,从而在数据驱动的预训练中实现更高效的扩展。

详情
AI中文摘要

LLM预训练正从计算驱动转向数据驱动的阶段,其中可用的人类(有机)文本远远无法满足扩展需求。然而,达到数据驱动阶段并不意味着模型已充分利用其有机语料库。在本文中,我们介绍了SynPro,一个合成数据生成框架,帮助LLM更深入地学习有限的有机数据。SynPro应用两种操作,即重新表述和重新格式化,以多样化的形式呈现相同的有机源,以促进更深层次的学习,而无需引入外部信息。两个生成器通过强化学习优化,使用质量、忠实度和数据影响奖励进行优化,并在预训练平台期持续更新,以针对模型尚未吸收的内容。我们使用DCLM-Baseline的10%最优令牌(0.8B和2.2B)预训练400M和1.1B模型,反映了前沿预训练中现实的数据驱动阶段。我们的结果表明,有机数据被标准重复方法显著低估:SynPro解锁了比重复方法多3.7-5.2倍的有效令牌,甚至在1.1B规模上超过了非数据驱动的Oracle,该Oracle在等效唯一数据上训练。分析证实,忠实、模型意识的合成可以在不导致分布崩溃的情况下实现数据驱动的扩展。我们开源代码在https://github.com/cxcscmu/SynPro。

英文摘要

LLM pretraining is shifting from a compute-bound to a data-bound regime, where available human (organic) text falls far short of scaling demands. However, reaching the data-bound regime does not mean the model has fully utilized its organic corpus. In this paper, we introduce SynPro, a synthetic data generation framework that helps LLMs more thoroughly learn from limited organic data. SynPro applies two operations, rephrasing and reformat, that present the same organic source in diverse forms to facilitate deeper learning without introducing external information. Both generators are optimized via reinforcement learning with quality, faithfulness, and data influence rewards, and are continuously updated as pretraining plateaus to target content the model has yet to absorb. We pretrain 400M and 1.1B models with 10% of their Chinchilla-optimal tokens (0.8B and 2.2B) from DCLM-Baseline, reflecting a realistic data-bound regime in frontier pretraining. Our results reveal that organic data is significantly underutilized by standard repetition: SynPro unlocks 3.7-5.2x the effective tokens of repetition, even surpassing the non-data-bound oracle that trains on equivalent unique data at the 1.1B scale. Analyses confirm that faithful, model-aware synthesis sustains data-bound scaling without causing distribution collapse. We open-source our code at https://github.com/cxcscmu/SynPro.

2605.17833 2026-05-19 cs.LG cs.AI 版本更新

Efficient Bilevel Optimization for Meta Label Correction in Noisy Label Learning

高效的元标签校正中的双层优化

Ba Hoang Anh Nguyen, Viet Cuong Ta

发表机构 * Human-Machine Interaction Laboratory, VNU University of Engineering and Technology(人机交互实验室,越南工程与技术大学)

AI总结 本文提出了一种高效的元标签校正方法EBOMLC,通过引入一步内循环更新、混合上界损失和对齐感知的动态障碍物,提高了元模型的训练效率和稳定性,实验表明其在高噪声环境下表现优异。

详情
AI中文摘要

训练深度神经网络时使用噪声标签可以降低数据标注成本,但可能会将噪声引入学习模型中。在元标签校正方法中,除了主模型外,还会训练一个额外的元模型,使用小规模干净数据集来校正大规模噪声数据集。然而,元模型的更新需要在主模型的内部步骤中计算超梯度,这会显著增加计算成本。为了提高训练效率,我们首先引入动态障碍梯度下降到标准元标签校正中。虽然这种直接扩展能够将训练过程的速度提高到大约一阶复杂度,但缺乏防止噪声信号泄漏到主模型和稳定元模型学习的机制。基于这一观察,我们提出了EBOMLC方法,其设计包含三个关键改进:一步内循环更新、混合上界损失和对齐感知的动态障碍物。在CIFAR-10和CIFAR-100上的实验结果表明,EBOMLC在高噪声率设置下优于其他基线方法,同时减少了元标签校正方法的训练时间。

英文摘要

Training a deep neural network with noisy labels could reduce data annotation cost but may introduce noise into the learned model. In meta label correction approaches, an additional meta model besides the main model is trained with a small, clean dataset to correct the large, noisy dataset. However, the update of the meta model requires the computation of hypergradients at the inner step of the main model which signif- icantly increases the computational cost. To improve the training efficiency, we first introduce the dynamic barrier gradient descent into standard meta label correction. While this naive extenstion is able to speed up the training process to approximately first- order complexity, it lacks mechanisms to prevent the leakage of noisy signals to the main model and to stabilize the learning of the meta model. Based on this observation, we propose the EBOMLC method, which is designed with three key improvements including one-step inner loop update, mixture upper loss and alignment- aware dynamic barrier. Empirical results on CIFAR-10 and CIFAR-100 demonstrate that EBOMLC consistently outperforms other baselines, especially under high noise rate settings, while reducing training time of the meta label correction approach.

2605.17830 2026-05-19 cs.AI cs.CL 版本更新

Remembering More, Risking More: Longitudinal Safety Risks in Memory-Equipped LLM Agents

记住更多,风险更多:具有记忆能力的LLM代理的纵向安全风险

Ahmad Al-Tawaha, Shangding Gu, Peizhi Niu, Ruoxi Jia, Ming Jin

发表机构 * Virginia Tech(弗吉尼亚理工大学) University of California, Berkeley(加州大学伯克利分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本研究探讨了具有记忆能力的LLM代理在长期任务中因记忆积累导致的安全风险,提出了一种触发-探测协议来评估记忆污染的影响,并发现记忆安全应被视为一个纵向属性而非单一状态属性。

详情
AI中文摘要

对具有记忆能力的LLM代理的安全评估通常测量单任务内的安全性:代理是否在对抗性条件下(如提示注入或记忆污染)安全地完成单一场景。然而,在部署中,一个代理会服务于许多独立任务,时间跨度较长,早期任务积累的记忆会影响后续无关任务的行为。研究这种情形需要在任务间的时间维度上进行评估:不是代理在任何单一记忆状态下的安全性,而是随着记忆在许多独立交互中积累,其安全性特征如何变化。我们称之为这种故障模式“时间记忆污染”。为了隔离记忆暴露与流非平稳性,我们引入了一种触发-探测协议,该协议通过固定探测集与不同前缀长度的只读记忆快照进行评估,并结合NullMemory反事实基线来识别由记忆引起的违规。我们将此协议应用于三个涵盖记录、备忘录、表单和电子邮件通信的部署场景,以及八种记忆架构,并进一步在Claw-like AI代理(如OpenClaw)上使用平台原生的记忆机制。具有记忆能力的代理在NullMemory基线上表现优异,记忆引起的违规率在两种代理类别中均表现出随暴露长度上升的稳健趋势。顺序随机化实验表明,该效应主要由积累内容而非接触顺序驱动。最后,事件分解的结构后果是记忆引起的风险在生成前的检索状态即可检测,我们通过高召回率的诊断监控器验证了这一点。我们的结果表明,应将记忆安全视为一个需要时间评估的纵向属性,而非可通过快照捕捉的单一状态属性。

英文摘要

Safety evaluations of memory-equipped LLM agents typically measure within-task safety: whether an agent completes a single scenario safely, often under adversarial conditions such as prompt injection or memory poisoning. In deployment, however, a single agent serves many independent tasks over a long horizon, and memory accumulated during earlier tasks can affect behavior on later, unrelated ones. Studying this regime requires evaluation along the temporal dimension across tasks: not whether an agent is safe at any single memory state, but how its safety profile changes as memory accumulates across many independent interactions. We call this failure mode temporal memory contamination. To isolate memory exposure from stream non-stationarity, we introduce a trigger-probe protocol that evaluates a fixed probe set against read-only memory snapshots at varying prefix lengths, together with a NullMemory counterfactual baseline for identifying memory-induced violations. We apply this protocol across three deployment scenarios spanning records, memos, forms, and email correspondence and eight memory architectures, and additionally on Claw-like AI agents, such as OpenClaw, using the platform's native memory mechanism. Memory-enabled agents consistently exceed the NullMemory baseline, and memory-induced violation rates show a robust upward trend with exposure length on both agent classes. Order-randomization experiments indicate that the effect is driven primarily by accumulated content rather than encounter order. Finally, a structural consequence of the event decomposition is that memory-induced risk is detectable from retrieval state before generation, which we confirm with a high-recall diagnostic monitor. Our results argue for treating memory safety as a longitudinal property that requires temporal evaluation, not a single-state property that can be captured by a snapshot.

2605.17829 2026-05-19 cs.AI 版本更新

Interactive Evaluation Requires a Design Science

交互评估需要一种设计科学

Keyang Xuan, Peiyang Song, Pan Lu, Pengrui Han, Wenkai Li, Zhenyu Zhang, Zexue He, Wenyue Hua, Manling Li, Jiaxuan You, Adrian Weller, Yizhong Wang, Jiaxin Pei

发表机构 * University of Texas Austin(德克萨斯大学奥斯汀分校) California Institute of Technology(加州理工学院) Carnegie Mellon University(卡内基梅隆大学) Stanford University(斯坦福大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Microsoft Research(微软研究院) Northwestern University(西北大学) University of Cambridge(剑桥大学)

AI总结 本文探讨了交互评估应被视为一种原则性的评估范式,而非仅仅是新的智能体基准。通过定义评估为证据到判断的自主映射,文章展示了交互评估如何改变这一映射的两方面,并提出双轴分类法,制定设计原则和报告标准,分析了长期评估挑战在轨迹层面的再现。

Comments 10 pages

详情
AI中文摘要

AI评估正经历结构性变革。大型语言模型(LLMs)越来越多地被部署为通过工具、环境、用户和其他智能体进行时间动作的系统,而许多评估实践仍继承自响应中心基准(例如固定输入、孤立输出和单个响应可做出的判断)。该领域开始构建交互基准,但所形成的景观却碎片化:基准在允许的交互制品、轨迹评分方式以及所支持的主张上各不相同。本文主张交互评估应被视为一种原则性的评估范式,而非仅仅是新的智能体基准。单纯采用以往的评估范式并不足够。我们定义评估为证据到判断的自主映射,并展示交互评估改变了这一映射的两方面:证据变为由交互生成的轨迹,而评估过程必须评估过程、可恢复性、协调性、鲁棒性和系统级性能。基于此定义,我们提出双轴分类法,推导设计原则和报告标准,分析代表性场景,并探讨长期评估挑战在轨迹层面的再现。

英文摘要

AI evaluation is undergoing a structural change. Large language models (LLMs) are increasingly deployed as systems that act over time through tools, environments, users, and other agents, while many evaluation practices still inherit assumptions from response-centered benchmarks (e.g., fixed inputs, isolated outputs, and outcome judgments that can be made from a single response). The field has begun to build interactive benchmarks, but the resulting landscape is fragmented: benchmarks differ in what interaction artifacts they admit, how trajectories are scored, and what claims their results support. This position paper argues that interactive evaluation should be treated as a principled evaluation paradigm, not merely a new family of agent benchmarks. Simply adopting previous evaluation paradigms does not suffice. We define evaluation as an autonomous mapping from evidence to judgments, and show that interactive evaluation changes both sides of this mapping: the evidence becomes interaction-generated trajectories, while the evaluation procedure must assess process, recoverability, coordination, robustness, and system-level performance. Building on this definition, we propose a two-axis taxonomy, derive design principles and reporting standards, examine representative scenarios, and analyze how longstanding evaluation challenges reappear at the trajectory level.

2605.17827 2026-05-19 cs.LG cs.AI 版本更新

Content-Style Identification via Differential Independence

通过微分独立性进行内容-风格识别

Subash Timilsina, Hoang-Son Nguyen, Sagar Shrestha, Xiao Fu

发表机构 * School of Electrical Engineering and Computer Science, Oregon State University, Corvallis, Oregon, USA(电气工程与计算机科学学院,俄勒冈州立大学,科瓦利斯,俄勒冈,美国)

AI总结 本文提出了一种新的结构条件,即内容-风格微分独立性(CSDI),用于在内容和风格可能依赖的情况下实现生成分析中的可识别性,通过在雅可比子空间上施加块状正交约束,并设计了基于数值雅可比近似的随机正则化器以支持高维生成模型。

Comments 24 pages, 15 figures, ICML 2026

详情
AI中文摘要

生成分析经常将多领域观察建模为领域不变内容变量和领域特定风格变量的非线性混合。从不成对的领域中识别这两种因素可以实现域迁移和反事实数据生成等任务。先前的工作在内容和风格之间(块状)统计独立性或通过非线性混合函数的稀疏雅可比假设下建立了可识别性,但这些条件在实践中可能过于严格。在本文中,我们引入了内容-风格微分独立性(CSDI),一种替代的结构条件,要求内容和风格的微小变化在数据流形上诱导正交方向,从而在内容和风格依赖且雅可比密集时也能实现可识别性。我们通过在内容和风格相关的雅可比子空间上施加块状正交约束来操作化这一条件。为了支持高维生成模型,我们设计了一个基于数值雅可比近似的随机正则化器,从而在如高分辨率图像生成等设置中实现可扩展训练。在多个数据集上的实验验证了可识别性分析,并展示了反事实生成和域迁移的实用优势。

英文摘要

Generative analysis often models multi-domain observations as nonlinear mixtures of domain-invariant content variables and domain-specific style variables. Identifying both factors from unpaired domains enables tasks such as domain transfer and counterfactual data generation. Prior work establishes identifiability under (block-wise) statistical independence between content and style, or via sparse Jacobian assumptions on the nonlinear mixing function, but such conditions can be restrictive in practice. In this work, we introduce content-style differential independence (CSDI), an alternative structural condition requiring that infinitesimal variations in content and style induce orthogonal directions on the data manifold, thereby enabling identifiability even when content and style are dependent and the Jacobian is dense. We operationalize this condition through a blockwise orthogonality constraint on the Jacobian subspaces associated with content and style. To support high-dimensional generative models, we design a stochastic regularizer based on numerical Jacobian approximation, enabling scalable training in settings such as high-resolution image generation. Experiments across multiple datasets corroborate the identifiability analysis and demonstrate practical benefits on counterfactual generation and domain translation.

2605.17826 2026-05-19 cs.CV cs.AI 版本更新

CounterCount: A Diagnostic Framework for Counting Bias in Vision Language Models

CounterCount: 一种用于视觉语言模型计数偏差诊断的框架

Reem Alzahrani, Hassan Alshanqiti, Bushra Bin Hemid, Zaid Alyafeai, Abdelrahman Eldesokey, Bernard Ghanem

发表机构 * KAUST(卡尔斯鲁德大学) University of Edinburgh(爱丁堡大学) King Abdullah University of Science and Technology(国王阿卜杜勒-阿齐兹大学)

AI总结 本文提出CounterCount框架,通过对比事实性与反事实性图像来诊断视觉语言模型在计数任务中的偏差问题,揭示模型对物体级先验知识的依赖,并提出统一的注意力调节策略提升反事实计数准确性。

详情
AI中文摘要

视觉语言模型(VLMs)在多模态推理方面表现出色,但尚不清楚其答案是基于视觉证据还是由学习的语言和世界先验知识驱动。计数提供了一个精确的测试环境:当视觉证据与常识物体知识冲突时,模型必须依赖图像而非典型计数。我们引入CounterCount,一种用于VLMs的反事实计数诊断框架,包含配对的事实性和反事实性图像、编辑过的计数相关属性、验证答案和局部化证据注释。评估最近的VLMs,我们发现其在事实性图像上表现强劲,但在反事实属性变化下持续退化,表明即使存在矛盾的视觉证据,模型仍依赖物体级先验知识。利用局部化注释,我们发现这些失败不仅由于缺失或模糊的视觉证据,而是由于模型对计数相关视觉token的注意力权重不足。我们引入一种统一的推理时间注意力调节策略,重新加权所选的视觉token,使多个VLMs的反事实计数准确率提高高达8%。总体而言,CounterCount揭示了先验驱动的计数失败,并为设计未来的VLMs提供了诊断见解。

英文摘要

Vision-Language Models (VLMs) excel at multimodal reasoning, yet it remains unclear whether their answers are grounded in visual evidence or driven by learned language and world priors. Counting provides a precise testbed: when visual evidence conflicts with canonical object knowledge, a model must rely on the image rather than a prototypical count. We introduce CounterCount, a diagnostic framework for counterfactual counting in VLMs, consisting of paired factual and counterfactual images with edited count-relevant attributes, verified answers, and localized evidence annotations. Evaluating recent VLMs, we find strong performance on factual images but consistent degradation under counterfactual attribute changes, indicating reliance on object-level priors even when contradictory visual evidence is present. Using localized annotations, we show that these failures are not solely due to missing or ambiguous visual evidence, but to models underweighting attention to count-relevant visual tokens. We introduce a unified inference-time attention modulation strategy that reweights selected visual tokens, improving counterfactual counting accuracy by up to 8% across multiple VLMs. Overall, CounterCount exposes prior-driven counting failures and provides diagnostic insights for designing future VLMs.

2605.17823 2026-05-19 cs.CV cs.AI 版本更新

Why We Look Where We Look: Emergent Human-like Fixations of a Foveated Visual Language Model Maximizing Scene Understanding

为什么我们看那里:一种最大化场景理解的视网膜视觉语言模型表现出的人类样注视模式

Shravan Murlidaran, Ziqi Wen, Sana Shehabi, Miguel P. Eckstein

发表机构 * Psychological & Brain Sciences, University of California, Santa Barbara(加州大学圣芭芭拉分校心理学与脑科学系) Electrical and Computer Engineering, University of California, Santa Barbara(加州大学圣芭芭拉分校电气与计算机工程系) Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 研究探讨了人类自由观看时注视模式的形成机制,发现最大化场景理解的视网膜视觉语言模型能够产生类似人类的注视模式,表明这种模式可能是优化场景理解的副产品。

详情
AI中文摘要

当人类在没有特定任务的情况下观察场景(自由观看)时,他们最初会将眼动定向到场景中心,然后注视人物、文本、被注视或抓取的物体以及具有语义意义的区域。这些标志性注视模式所反映的内容以及它们是否优化了底层感知任务仍不清楚。我们显示,一个具有模拟视网膜视觉的计算代理,经过训练以优化场景理解,会表现出人类样的注视模式。相比之下,经过训练以搜索或分类场景的代理版本,或配备比人类更好的或更差的周边视觉的版本,预测人类注视模式的准确性较低。因此,人类自由观看的注视模式可能是在生物视网膜视觉约束下优化场景理解的副产品。

英文摘要

When humans view scenes without a specific task (free-viewing), they initially direct their eye movements toward the scene center and then fixate on people, text, objects being gazed at or grasped, and semantically meaningful regions. What these signature fixation patterns reflect and whether they optimize an underlying perceptual task remain unknown. We show that a computational agent with simulated foveation, trained to optimize scene comprehension, exhibits emergent human fixation signature patterns. In contrast, versions of the agent trained to search or classify scenes, or equipped with peripheral vision that was better or worse than human vision, predicted human fixation patterns less accurately. Thus, human free-viewing fixation patterns may emerge as a functional byproduct of optimizing scene comprehension under the biological constraints of foveated vision.

2605.17821 2026-05-19 cs.DC cs.AI 版本更新

TierCheck: Tiered Checkpointing for Fault Tolerance in Large Language Model Training

TierCheck: 用于大语言模型训练故障容错的分层检查点系统

Shujie Han, Feng Jiang, Patrick P. C. Lee, Xiao Zhang, Zhijie Huang, Nannan Zhao, Xiaonan Zhao, Lichen Pan

发表机构 * Northwestern Polytechnical University(西北工业大学) The Chinese University of Hong Kong(香港中文大学) National University of Defense Technology(国防科技大学)

AI总结 本文提出TierCheck,一种基于集群意识的分层检查点系统,通过将存储位置与故障异质性对齐,实现轻量级差异检查点在本地和对等内存中的快速本地恢复,同时异步迁移重型基础检查点到远程持久化存储,从而在低开销持久性和快速恢复之间取得最佳平衡。

详情
AI中文摘要

大语言模型(LLM)训练经常受到异构故障谱的中断,从常见的GPU崩溃到灾难性的集群级故障。现有检查点系统依赖于单一层次的存储后端,迫使在状态保存开销和恢复速度之间做出权衡。我们提出TierCheck,一种集群感知的分层检查点系统,通过将存储位置与故障异质性对齐。TierCheck采用三级设计,保持轻量级差异检查点在本地和对等内存中以实现快速本地恢复,同时异步迁移重型基础检查点到远程持久化存储。它还确保严格跨层次的全局一致性,而不会停滞训练,并在恢复期间实现快速的集群感知检查点恢复。在400亿参数模型上的评估显示,TierCheck实现了低训练开销,将端到端检查点时间减少到10秒以下,并支持高频检查点,最终在低开销持久性和快速恢复之间取得最佳平衡。

英文摘要

Large Language Model (LLM) training is frequently interrupted by a heterogeneous spectrum of failures, from common GPU crashes to catastrophic cluster-wide outages. Existing checkpointing systems rely on monolithic, single-tier storage backend, forcing a trade-off between state-saving overhead and recovery speed. We propose TierCheck, a cluster-aware tiered checkpointing system that aligns storage placement with failure heterogeneity. TierCheck adopts a three-tier design that maintains lightweight differential checkpoints in local and peer memory for fast localized recovery, while asynchronously migrating heavyweight base checkpoints to remote persistent storage. It also ensures strict global consistency across tiers without stalling training, and achieves fast cluster-aware checkpoint restoration during recovery. Evaluations on models up to 40 billion parameters show that TierCheck achieves low training overhead, reduces end-to-end checkpointing time to under 10s, and supports high-frequency checkpointing, ultimately striking an optimal balance between low-overhead persistence and fast recovery.

2605.17815 2026-05-19 cs.RO cs.AI 版本更新

Virtues of Ordered Chaos: Planning with Topple Actions in Tabletop Stack Rearrangement

秩序之中的混沌:在桌面堆叠重构中使用Topple动作的规划

Hao Lu, Rahul Shome

发表机构 * School of Computing at the Australian National University(澳大利亚国立大学计算学院)

AI总结 本文研究了桌面环境中堆叠重构任务,通过引入更丰富的非抓取聚合动作(特别是从堆叠中倒落物体到桌面的Topple动作)来增强任务规划领域。核心方法是提出一种新的Topple聚合工具,将候选任务计划计算转化为 Pebble Motion 问题变体,从而在IsaacSim物理模拟中验证了其效果,展示了在执行速度上的显著优势。

Comments 8 pages, 7 figures

详情
AI中文摘要

高效的物体操作策略对自动化应用有重大影响。本文研究了桌面环境中的堆叠重构任务,重点是通过引入更丰富的非抓取聚合动作(特别是从堆叠中倒落物体到桌面的Topple动作)来增强任务规划领域。Topple可以压缩长序列的中间搬运动作。计算的计划需要根据问题在其中交错执行抓取和放置动作与Topple动作。为了生成任务计划并建模一个抽象来计算包含抓取和Topple动作的解决方案,引入了一种新的Topple聚合工具。使用这种有向图抽象,候选任务计划计算成为Pebble Motion问题的变种,将物体视为石子。然后在基于IsaacSim的物理模拟中报告了基准测试。结果突显了仅使用抓取和放置动作相比,在执行速度上的明显优势。尽管本文主要研究Topple动作,但证明了类似的抽象可以建模其他感兴趣的聚合动作,如Scoop。本文的工作为丰富物体交互的操纵应用提供了初步但有力的证据,表明抽象在其中的潜在好处。

英文摘要

Efficient object manipulation strategies have significant impact in automation applications. In this work, the stack rearrangement in tabletop settings is studied, with a focus on augmenting the task planning domain with richer nonprehensile aggregating actions, in particular the toppling of objects from a stack to the table. Toppling can compress long sequences of intermediate relocations. Computed plans need to interleave pick-and-place actions with topple throughout its plan based on the problem. In order to generate the task plan and model an abstraction to compute solutions that include both pick-and-place and topple actions, a novel aggregating gadget for topple is introduced. Using this directed graphical abstraction, candidate task plan computation becomes a variant of the pebble motion problem, treating objects as pebbles. Benchmarks are then reported in a IsaacSim-based physics simulation. Results highlight clear benefits of achieving faster execution than solely using pick-and-place actions. Though this work primarily investigates the topple action, we demonstrate that similar abstractions can model other aggregating actions of interest, like scoop. The current work provides a preliminary, strong indication of the promising benefits of abstractions for rich object interactions in manipulation applications.

2605.17812 2026-05-19 cs.AI 版本更新

Going Headless? On the Boundaries of Vertical AI Firms

going headless?关于垂直AI企业的边界

Muhammad Zia Hydari, Farooq Muzaffar

发表机构 * University of Pittsburgh(匹兹堡大学)

AI总结 本文探讨了垂直AI企业在会计、法律、医疗、采购等领域中,将工作流、领域逻辑和责任整合到单一应用中的传统模式,以及通用AI代理如何解构这种模式,促使企业采取"going headless"策略。文章指出,这种策略对某些企业有益,对另一些企业则可能造成破坏,并提出了基于任务-责任制度的三类分类体系及规则债务的概念。

详情
AI中文摘要

垂直AI企业在会计、法律、医疗、采购等领域历史上将工作流、领域逻辑和责任整合到单一应用中。通用AI代理现在正在解构这种整合,促使创始人和投资者倡导"going headless":将工作流和界面交给代理,并将领域专业知识作为可调用的服务暴露出来。本文认为,对于某些企业来说,going headless是正确的,而对于另一些企业则可能是破坏性的,后者往往通过看似界面决策的架构选择无意中放弃了其价值捕获。这是一个边界问题,答案取决于区分接口边界(通常可以移动)和责任边界(通常不能移动)。基于科斯的企业理论、埃森曼、帕克和范阿尔斯特恩的平台包容框架,以及蒂茨对互补资产和可获取性的分析,本文表明,通过开放协议运营的协调者即使在技术互操作性提高的情况下仍能获得包容权力,并且持久的价值捕获集中在专业签发、受监管的工作流、证据轨迹和受信任的记录系统中。本文提出了一种三类分类体系(组件、集成软件平台、双轨),该分类不是基于行业而是基于任务-责任制度,并正式化了规则债务的概念:当业务规则和专业标准从受控系统迁移到提示和代理指令时,客户组织将承担未来治理、维护和责任负担。随后有四项原则:按责任而非界面分解,翻转边缘同时保留核心,将规则债务作为集成平台防止的客户成本,避免单一协调者依赖。

英文摘要

Vertical AI firms in accounting, law, healthcare, procurement, and similar domains historically bundled workflow, domain logic, and accountability into a single application. General-purpose AI agents are now unbundling that package, prompting founders and investors to advocate "going headless": cede the workflow and interface to agents and expose domain expertise as callable services. This article argues that going headless is correct for some firms and destructive for others, and that the latter often cede their value capture inadvertently through architectural choices that look like interface decisions. This is a boundary question, and the answer turns on distinguishing the interface boundary, which can often move, from the accountability boundary, which often must not. Drawing on Coase's theory of the firm, Eisenmann, Parker, and Van Alstyne's platform envelopment framework, and Teece's analysis of complementary assets and appropriability, the article shows that orchestrators operating through open protocols acquire envelopment power even as technical interoperability improves, and that durable value capture concentrates in cospecialized accountability assets: professional signoff, regulated workflows, evidence trails, and trusted systems of record. The article proposes a three-position taxonomy (component, integrated software platform, dual-track) determined not by sector but by task-accountability regime, and formalizes the construct of rule debt: the future governance, maintenance, and accountability burden that accrues to customer organizations when business rules and professional standards migrate from governed systems into prompts and agent instructions. Four principles follow: decompose by accountability not interface, invert the edges while retaining the core, position rule debt as the customer cost the integrated platform prevents, and avoid single-orchestrator dependence.

2605.17811 2026-05-19 cs.LG cs.AI math.OC 版本更新

One Model, Two Roles: Emergent Specialization in a Shared Recurrent Transformer

一个模型,两种角色:共享递归变压器中的涌现专业化

Jucheng Shen, Barbara Su, Anastasios Kyrillidis

发表机构 * Rice University(里士大学)

AI总结 该研究探讨了共享权重的递归变压器是否能在未被分割成独立模块的情况下发展出不同的内部角色,通过不对称输入递归(AIR)架构发现,模型内部状态分化出不同的功能角色,并展示了这种分化与模型状态动态的关系。

Comments 21 pages, 13 figures, 8 tables

详情
AI中文摘要

可以一个共享权重的递归变压器在未被分割成独立模块的情况下发展出不同的内部角色吗?我们研究了不对称输入递归(AIR),这是一种最小的两状态推理架构,在其中相同的Transformer模型被重复用于更新(根据文献,L和H),唯一的更新规则差异是编码输入在L更新中被注入但在H更新中不被注入。在Sudoku-Extreme和Maze中,解码的rollouts揭示出一致的分裂:$\zH$表现得像一个完全承诺的提案状态,而$\zL$保留局部不确定性和移动的中间结构。冻结实验显示,这种分裂实际上与模型的状态动态有关:在Sudoku中,冻结$\zH$会减少$\zL$的内容变化,而冻结$\zL$会增加$\zH$的内容变化;而在Maze中,冻结任一状态会增加另一个状态的内容变化。消融实验显示,为了诱导专业化,共享模型需要能够区分两种更新类型,要么通过输入注入的不对称性,要么通过一个单独的层级标记。机理上,注意力分析显示在Sudoku和Maze中,L更新始终比H更新更局部。这些结果表明,在两状态递归设置中,清晰的状态身份信号可以诱导共享参数递归变压器内部稳定的、相关的功能角色。代码可在https://github.com/juchengshen/air获得。

英文摘要

Can a shared-weight recurrent Transformer develop distinct internal roles without being partitioned into separate modules? We study this in Asymmetric Input Recurrence (AIR), a minimal two-state reasoning architecture in which the same Transformer model is reused for both updates (per literature, L and H) and the only built-in difference in the update rule is that the encoded input is injected during L-updates but not H-updates. Across Sudoku-Extreme and Maze, decoded rollouts reveal a consistent split: $\zH$ behaves like a fully committed proposal state, whereas $\zL$ retains local uncertainty and shifting intermediate structure. Freeze experiments show that this split is, in practice, related to the model's state dynamics: in Sudoku, freezing $\zH$ reduces $\zL$'s content changes whereas freezing $\zL$ increases $\zH$'s, while in Maze, freezing either state increases content changes in the other state. Ablations show that to induce specialization, the shared model needs to be able to tell the two update types apart, either from input injection asymmetry or from a separate level token. Mechanistically, attention analysis shows that L-updates are consistently more local than H-updates in both Sudoku and Maze. Together, these results show that, in a two-state recurrent setting, a clear state-identity signal can induce stable, related functional roles inside a shared-parameter recurrent Transformer. Code is available at \href{https://github.com/juchengshen/air}{\textcolor{blue}{https://github.com/juchengshen/air}}.

2605.17807 2026-05-19 cs.CV cs.AI 版本更新

Curriculum Group Policy Optimization: Adaptive Sampling for Unleashing the Potential of Text-to-Image Generation

课程组策略优化:适应性采样以释放文本到图像生成的潜力

Baoteng Li, Xianghao Zang, Xinran Wang, Xiangyu Na, Zhixiang He, Hao Sun, Chi Zhang, Zhongjiang He, Tianwei Cao, Kongming Liang, Zhanyu Ma

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Institute of Artificial Intelligence (TeleAI), China Telecom(中国电信人工智能研究院) Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance(北京多模态数据智能感知与治理重点实验室)

AI总结 本文提出了一种适应性课程训练框架CGPO,通过动态调整采样策略来提高文本到图像生成的训练效率,同时解决多类别数据集中的数据不平衡问题。

详情
AI中文摘要

文本到图像(T2I)生成在近年来取得了显著进展。同时,基于组相对策略优化(GRPO)的强化学习方法引起了广泛关注,并已成功应用于T2I任务。然而,训练过程中常用的均匀采样策略往往忽略了样本难度与模型当前学习能力之间的匹配,导致训练效率低下。我们主张,提高训练效率需要持续优先选择与模型 evolving 能力匹配且仍能主动学习的提示。为此,我们提出了课程组策略优化(CGPO),一种适应性课程训练框架。在训练过程中,每个提示生成一组由奖励模型评分的图像。我们使用组奖励的方差作为在线代理来衡量提示的一致性。较高的方差表明模型部分捕捉了提示要求,但尚未达到稳定的掌握。此类提示更可能提供有用的训练信号,因此相应增加其采样概率。此外,为了解决多类别数据集中的数据不平衡问题,我们设计了一种基于比例公平优化的类别校准方法,以平衡各类别之间的训练难度。在GenEval、T2I-CompBench++和DPG Bench上的实验表明,我们的框架有效提高了生成性能。

英文摘要

Text-to-Image (T2I) generation has achieved remarkable progress in recent years. Meanwhile, reinforcement learning methods, particularly those based on Group Relative Policy Optimization (GRPO), have attracted widespread attention and been successfully applied to T2I tasks. However, the uniform sampling strategy commonly used during training often ignores the match between sample difficulty and the model's current learning capability, leading to low training efficiency. We argue that improving training efficiency requires continuously prioritizing prompts that match the model's evolving capability and remain actively learnable. To this end, we propose Curriculum Group Policy Optimization (CGPO), an adaptive curriculum training framework. During training, each prompt produces a group of images scored by a reward model. We use the variance of group rewards as an online proxy for prompt inconsistency. A higher variance suggests that the model has partially captured the prompt requirements but has not yet achieved stable mastery. Such prompts are more likely to provide useful learning signals, so we increase their sampling probabilities accordingly. Additionally, to address data imbalance in multi-category datasets, we design a category calibration method based on proportional fairness optimization, which balances training difficulty across categories. Experiments on GenEval, T2I-CompBench++, and DPG Bench demonstrate that our framework effectively improves generation performance.

2605.17800 2026-05-19 cs.RO cs.AI 版本更新

Optimal Knock-Pick Planning for Tightly Packed Tabletop Blocks With Parallel Grippers

紧密排列桌面积木的最优敲击抓取规划

Hao Lu, Rahul Shome

发表机构 * School of Computing(计算学院) Australian National University(澳大利亚国立大学)

AI总结 研究在平行夹具无法在物体周围获得足够空隙时,如何通过引入方向性敲击原语来优化敲击抓取策略,以减少动作数量。

Comments Accepted by WAFR 2026, 18 pages, 6 figures

详情
AI中文摘要

在平行夹具无法在物体周围获得足够空隙时,重新排列紧密堆积的桌面物体具有挑战性。本文研究了在实际应用中,均匀大小的积木放置在平面桌面网格位置时的问题特性。由于纯粹的抓取移除可能不可行,因此引入了方向性敲击原语,并将该问题的最优敲击抓取变体进行了建模。本文提出了一系列抽象,其中通过覆盖最小约束装置来识别必要的敲击。利用图抽象上的最大权重完美匹配,可以高效地在多项式时间内计算最优计划,以最小化动作数量。在合成环境以及IsaacSim中报告了随着网格大小增加的实验结果。理论观察为构建高效操作策略提供了有前途的基石,这些策略可以交错抓取和非抓取动作。

英文摘要

Rearranging densely packed tabletop objects is challenging when parallel-gripper picks are infeasible without sufficient clearance around an object. This work studies the problem characteristics for practically motivated settings with uniformly sized blocks placed at planar tabletop grid locations. Since purely prehensile removal can become infeasible, a directional knock primitive is therefore introduced and the optimal knock-pick variant of the problem is formulated. The work proposes a series of abstractions wherein minimal constraining gadgets are covered to identify the necessary knocks. Utilizing a maximum-weight perfect matching on a graphical abstraction yields efficient polynomial-time computation of the optimal plan that minimizes the number of actions. Experiments are reported for increasing grid sizes in synthetic settings as well as in IsaacSim. The theoretical observations provide a promising stepping stone towards rigorously building efficient manipulation strategies that interleave prehensile and non-prehensile actions.

2605.17790 2026-05-19 cs.AI 版本更新

STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

STRIDE:一种用于可靠自动方程发现的自反思代理框架

Jiarui Su, Songjun Tu, Bei Sun, Xiaojun Liang

发表机构 * Central South University(中南大学) Pengcheng Laboratory(鹏城实验室) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 本文提出STRIDE框架,通过协调数据感知生成、混合拟合评估、批评-执行器修复和多样性保持语义记忆,提升自动方程发现的可靠性,实验表明其在多个LLM基础上提升了准确性、OOD鲁棒性和结构恢复能力。

Comments 23 pages, 15 figures

详情
AI中文摘要

基于LLM的方程发现为从数据中恢复符号定律提供了有前途的途径,但许多系统仍依赖于以生成为中心的循环,提出候选者、拟合参数、评分结果并重用选定的例子。此类循环在不可靠的拟合下可能误判有用的骨架,丢弃需要修复的近正确方程,并积累冗余记忆提供有限的指导。我们提出了STRIDE,一种自反思代理框架,通过协调数据感知生成、混合拟合评估、批评-执行器修复和多样性保持语义记忆来提高可靠性。通过将拟合分数和候选行为转化为共享反馈,STRIDE使方程能够在闭环发现过程中被提出、评估、细化和重用。在具有代表性的符号回归基准和LSR-Synth套件上的实验表明,STRIDE在多个LLM基础上提高了准确性、OOD鲁棒性和结构恢复能力,消融分析和分析确认了其核心组件的贡献。

英文摘要

LLM-based equation discovery offers a promising route to recovering symbolic laws from data, but many systems still rely on generation-centered loops that propose candidates, fit parameters, score results, and reuse selected examples. Such loops can misjudge useful skeletons under unreliable fitting, discard near-correct equations that require repair, and accumulate redundant memories that provide limited guidance. We propose STRIDE, a self-reflective agent framework that improves reliability by coordinating data-aware generation, mixed-fitting evaluation, critic--executor repair, and diversity-preserving semantic memory. By turning fitted scores and candidate behavior into shared feedback, STRIDE enables equations to be proposed, assessed, refined, and reused within a closed-loop discovery process. Experiments on representative symbolic-regression benchmarks and LSR-Synth suites show that STRIDE improves accuracy, OOD robustness, and structural recovery across multiple LLM backbones, with ablations and analyses confirming the contribution of its core components.

2605.17789 2026-05-19 cs.CL cs.AI 版本更新

SocialMemBench: Are AI Memory Systems Ready for Social Group Settings?

SocialMemBench: AI记忆系统是否准备好应对社交群体环境?

Olukunle Owolabi

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出SocialMemBench,一个针对多党社交群体的AI记忆系统评估基准,通过人类验证的合成社交网络,测试记忆系统在处理共享历史、群体规范和成员退出等复杂社交场景中的能力。

详情
AI中文摘要

为单用户对话设计的AI记忆系统在应用于多党社交群体环境时会表现出典型故障。这一差距对当今构建的社会助手尤为重要:嵌入聊天平台的群体作用代理,以及需要全面用户模型的主动个人助理代理。现有记忆基准评估的是二元或职场对话;没有针对多党社交群体,其中记忆必须将事实锚定在共享历史而非职业角色,区分群体规范与个体例外,并在成员退出后正确归因。我们引入SocialMemBench,一个涵盖五个典型(亲密朋友、家庭、娱乐、兴趣社区、熟人网络)和三个群体规模层级(4-30成员)的人类验证合成社交群体网络的基准,包含430个角色和7,355次对话轮次,产生1,031个问题-答案对,覆盖九个问题类别。每个类别隔离一种架构能力,五个失败模式(单流融合、时间状态覆盖、大规模实体合并、缺失跨角色知识、规范-个体融合)是可测试的假设;我们的两项研究探针Subject-Mem和SMG提供了证据,其余三个仍待解决。在所有43个网络中,评估的四个开源记忆框架(Mem0、LangMem、Graphiti、Cognee)在问题加权范围内聚集在0.12-0.18,95%置信区间重叠,远低于未压缩检索参考0.345和匹配回答者完整上下文参考0.369(GPT-4o-mini)。当前的记忆系统显示出可测量的差距。

英文摘要

Memory systems for AI assistants were built for single-user dialogue and fail characteristically when applied to multi-party social group settings. This gap matters for the social assistants being built today: group-acting agents embedded in chat platforms, and proactive personal-assistant agents whose holistic model of a user must include their social context. Existing memory benchmarks evaluate dyadic or workplace dialogue; none targets multi-party social groups, where memory must anchor facts in shared history rather than professional roles, separate group norms from individual exceptions, and correctly attribute even after member departure. We introduce SocialMemBench, a benchmark of human-verified synthetic social group networks across five archetypes (close friends, family, recreational, interest community, acquaintance network) and three group-size tiers (4-30 members), with 430 personas and 7,355 conversation turns, yielding 1,031 QA pairs across nine question categories. Each category isolates an architectural capability, and the five failure modes (single-stream conflation, temporal-state overwrite, entity merging at scale, missing cross-persona knowledge, norm-individual conflation) are testable hypotheses; our two research probes Subject-Mem and SMG provide evidence on two, three remain open. A full-context Gemini 2.5 Flash reference reaches only 0.721 against a blind-critic reasoning-model mean of 0.98 on small networks, indicating the benchmark is genuinely difficult even with complete access to the conversation. Across all 43 networks, the four open-source memory frameworks evaluated (Mem0, LangMem, Graphiti, Cognee) cluster in the 0.12-0.18 question-weighted range with overlapping 95% CIs, well below an uncompressed retrieval reference of 0.345 and a matched-answerer full-context reference of 0.369 (GPT-4o-mini). Current memory systems show a measurable gap.

2605.17775 2026-05-19 cs.CL cs.AI 版本更新

Systematic Evaluation of the Quality of Synthetic Clinical Notes Rephrased by LLMs at Million-Note Scale

在百万笔记规模上系统评估LLM重新表述的合成临床笔记质量

Jinghui Liu, Sarvesh Soni, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO, Australia(澳大利亚电子健康研究中心,CSIRO,澳大利亚) National Library of Medicine, National Institutes of Health, USA(国家医学图书馆,国立卫生研究院,美国)

AI总结 本研究系统评估了LLM生成的合成临床笔记的质量,包括内在、外在和事实性评估,发现尽管在粗粒度任务中保留了核心临床信息和预测效用,但在细粒度任务如ICD编码中丢失了细节,通过分块重述可以缓解这一问题,但会降低事实准确性。研究还发现合成错误主要源于临床情境的误解、时间混淆、测量误差和虚构声明,同时展示了这些合成笔记可以有效增强罕见ICD代码的特定任务训练。

详情
AI中文摘要

大型语言模型(LLMs)可以为各种应用生成或合成临床文本,从改善临床文档到增强临床文本分析。然而,评估通常集中在狭窄方面——例如相似性或效用比较——尽管这些方面是互补的,最好并行看待。在本研究中,我们旨在系统评估LLM生成的临床文本,包括在百万笔记规模上从MIMIC数据库重新表述的合成临床笔记的内在、外在和事实性评估。我们的分析显示,尽管存在显著的语言变化,合成笔记仍保留了核心临床信息和粗粒度任务的预测效用,但在像ICD编码这样的细粒度任务中会丢失细节。我们展示,通过分块重述而不是整体重述笔记可以显著缓解这种细节丢失,但会以减少事实准确性为代价。通过事实核查和错误分析,我们进一步发现合成错误主要由临床情境的误解、时间混淆、测量误差和虚构声明引起。最后,我们展示了这些合成笔记——尽管具有任务无关性——可以有效增强罕见ICD代码的特定任务训练。

英文摘要

Large language models (LLMs) can generate or synthesize clinical text for a wide range of applications, from improving clinical documentation to augmenting clinical text analytics. Yet evaluations typically focus on a narrow aspect -- such as similarity or utility comparisons -- even though these aspects are complementary and best viewed in parallel. In this study, we aim to conduct a systematic evaluation of LLM-generated clinical text, which includes intrinsic, extrinsic, and factuality evaluations of synthetic clinical notes rephrased from MIMIC databases at million-note scale. Our analysis demonstrates that synthetic notes preserve core clinical information and predictive utility for coarse-grained tasks despite substantial linguistic changes, but lose fine-grained details for task like ICD coding. We show this loss of detail can be substantially mitigated by rephrasing notes by chunks rather than by the whole note, but at the cost of reduced factual precision under incomplete context. Through fact-checking and error analysis, we further find that synthesis errors are dominated by misinterpretation of clinical context, alongside temporal confusion, measurement errors, and fabricated claims. Finally, we show that the synthetic notes -- despite their task-agnostic nature -- can effectively augment task-specific training for rare ICD codes.

2605.17762 2026-05-19 cs.AI 版本更新

Surface-Form Neural Sparse Retrieval: Robust Fuzzy Matching for Industrial Music Search

表面形式神经稀疏检索:面向工业音乐搜索的鲁棒模糊匹配

Paul Greyson, Zhichao Geng, Wei Zhang, Yang Yang

发表机构 * Amazon(亚马逊)

AI总结 本文提出了一种鲁棒的神经稀疏检索系统,通过改进的稀疏检索架构和领域特定的子词分词策略,提升了工业音乐搜索中对拼写错误、转置和发音变异的鲁棒性,实现了更高的召回率和更低的延迟。

Comments accepted at SIGIR 2026 industry track

详情
AI中文摘要

在亚马逊音乐的规模下进行音乐搜索面临独特挑战:查询经常由于拼写错误、转置和发音变异而偏离索引元数据,但检索系统必须在毫秒级延迟约束下运行。我们的现有学习到检索系统,即高置信度索引(HCI),从客户行为中学习查询-实体关联,依赖于持续的『探索』来选择候选。传统的n-gram匹配能够实现这种探索,但存在语义鲁棒性差和噪声高,限制了系统从长尾查询中学习的能力。在本工作中,我们提出了一种鲁棒的神经稀疏检索系统,旨在最大化探索效率。我们将最先进的『推理自由』稀疏检索架构适应到音乐领域,并结合一种有效的领域特定的细粒度子词分词策略。我们的方法利用短长度的token约束(最大3个字符)来强制学习表面形式的鲁棒性而非词法记忆。通过在离线索引阶段预计算神经嵌入和术语扩展,使在线处理减少到最小的tokenization和IDF加权,从而实现查询编码的几乎零延迟开销。在600万文档生产语料库上的评估显示,召回率@10达到91.4%(相比传统的三元组为57.7%),在可比的吞吐量下。对HCI反馈循环的模拟显示了探索效率的提高,稳定召回率比生产三元组高0.8%。消融研究表明,我们的稀疏训练方法驱动了性能提升,而领域特定的预训练提供了比大规模通用预训练更具成本效益的替代方案。

英文摘要

Music search at the scale of Amazon Music presents a unique challenge: queries frequently deviate from indexed metadata due to misspellings, transpositions, and phonetic variations, yet the retrieval system must operate under strict millisecond-level latency constraints. Our existing learning-to-retrieve system, the High Confidence Index (HCI), learns query-entity associations from customer behavior, relying on continual ``exploration'' to choose candidates. Traditional n-gram matching enables this exploration but suffers from poor semantic robustness and high noise, limiting the system's ability to learn from long-tail queries. In this work, we present a \textbf{robust neural sparse retrieval system} designed to maximize exploration efficiency. We adapt a state-of-the-art \textbf{inference-free} sparse retrieval architecture to the music domain, combining it with an effective \textbf{domain-specific granular subword tokenization strategy}. Our approach utilizes short-length token constraints (max 3 chars) to enforce the learning of surface-form robustness over lexical memorization. By pre-computing the neural embeddings and term expansions during the offline indexing phase, online processing is reduced to minimal tokenization and IDF weighting, achieving effectively zero latency overhead for query encoding. Evaluations on a 6M-document production corpus show an aggregate \textbf{91.4\%} recall@10 (vs. \textbf{57.7\%} for trigrams) at comparable throughput. Simulation of the HCI feedback loop demonstrates improved exploration efficiency, with \textbf{+0.8\%} higher stabilized recall than production trigrams. Ablation studies indicate that our sparse training methodology drives the performance gains, while domain-specific pretraining provides a cost-effective alternative to large-scale general-purpose pretraining.

2605.17757 2026-05-19 cs.LG cs.AI cs.DC cs.PF 版本更新

OSCAR: Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

OSCAR: 2位KV缓存量化中的离线频谱协方差感知旋转

Zhongzhu Zhou, Donglin Zhuang, Jisen Li, Ziyan Chen, Shuaiwen Leon Song, Ben Athiwaratkun, Xiaoxia Wu

发表机构 * Together AI University of Sydney(悉尼大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出OSCAR方法,通过离线估计注意力感知的协方差结构,实现2位KV缓存量化的高效和准确,同时开发了可部署的系统,提升了LLM服务框架的性能和效率。

Comments 35 pages, 10 figures

详情
AI中文摘要

INT2 KV-cache量化对于长上下文LLM服务具有吸引力,但实现准确性和可部署性仍然具有挑战。简单的旋转如Hadamard变换可以减少异常值,但仍然在INT2层面失效,因为它们与下游注意力不对齐。我们提出了OSCAR,一种超低比特KV缓存量化方法,通过离线估计注意力感知的协方差结构,并利用这些结构推导出固定旋转和截断阈值用于量化。这样,KV量化就与注意力实际消耗的协方差结构对齐。更重要的是,我们不仅提供了理论依据,还开发了一个完全可部署的OSCAR系统,包含一个定制的INT2注意力内核,该内核与分页KV缓存服务和融合内核流水线保持兼容,从而无缝集成到现代LLM服务框架中,如SGLang和vLLM。我们评估了我们的方法在最近的推理模型上,使用最多32k token的推理轨迹进行跨5个任务的测试。在Qwen3-4B-Thinking-2507和Qwen3-8B上,OSCAR将BF16精度差距分别减少到3.78和1.42个点,而朴素旋转INT2几乎归零。我们进一步将OSCAR扩展到Qwen3-32B和GLM-4.7(358B参数),其中它仍然与BF16保持有效相当。在长上下文-RULER-NIAH(最多128K)上,OSCAR在Qwen3模型上保持稳健,而朴素旋转INT2崩溃。从系统层面来看,OSCAR将KV缓存内存减少约8倍,在相同内存预算下,大批次大小下吞吐量提高最多7倍,并且由于内存带宽开销减少,单批次解码速度比BF16快最多3倍。

英文摘要

INT2 KV-cache quantization is attractive for long-context LLM serving, but it remains difficult to make both accurate and deployable. Simple rotations such as Hadamard transforms reduce outliers, but still degrade at INT2 because they are not aligned with downstream attention. We propose OSCAR, an Ultra-low-bit KV Cache quantization method that estimates attention-aware covariance structures offline and uses them to derive fixed rotations and clipping thresholds for quantization. In this way, it aligns KV quantization with the covariance structures that attention actually consumes. More importantly, we not only provide theoretical justification but also develop a fully deployable OSCAR system with a custom INT2 attention kernel that remains compatible with paged KV-cache serving and fused kernel pipelines, enabling seamless integration into modern LLM serving frameworks such as SGLang and vLLM. We evaluate our methods on recent reasoning models with reasoning traces of up to 32k tokens across 5 tasks. On Qwen3-4B-Thinking-2507 and Qwen3-8B, OSCAR reduces the BF16 accuracy gap to 3.78 and 1.42 points, respectively, while naive rotation INT2 collapses to nearly zero. We further scale OSCAR to Qwen3-32B and GLM-4.7 (358B params), where it remains effectively on par with BF16. On long context - RULER-NIAH up to 128K, OSCAR remains robust on both Qwen3 models, while naive rotation INT2 collapses. System-wise, OSCAR reduces KV-cache memory by approximately 8x, improves throughput by up to 7x at large batch sizes under the same memory budget, and accelerates batch-size-1 decoding by up to 3x over BF16 due to reduced memory bandwidth overhead.

2605.17755 2026-05-19 cs.CL cs.AI 版本更新

Bridging the Version Gap: Multi-version Training Improves ICD Code Prediction, Especially for Rare Codes

弥合版本差距:多版本训练提升ICD代码预测,尤其是罕见代码

Jinghui Liu, Anthony Nguyen

发表机构 * Australian e-Health Research Centre, CSIRO(澳大利亚电子健康研究中心,CSIRO)

AI总结 本文研究了通过结合不同ICD版本的数据训练版本无关模型的有效性,以解决ICD代码预测中的长尾问题和罕见代码性能瓶颈,实验表明多版本训练在提升罕见代码的微F1指标和频繁代码的宏指标方面均取得显著效果。

详情
AI中文摘要

临床编码将临床文档映射到标准化的医疗代码,这是一个关键但耗时的行政任务,可以通过自动化来改进。当前ICD编码模型通常针对特定版本的代码进行优化。然而,实际上ICD系统持续演进,不同版本在不同时期和地区被采用。此外,ICD编码面临长尾问题,罕见代码性能可能成为开发可实施模型的瓶颈。我们探讨了通过结合不同ICD版本的数据训练版本无关模型的可行性,这可能有助于解决这些挑战。我们将在修改后的标签注意力模型中加入ICD-9数据进行ICD-10预测训练,并发现尽管存在版本不匹配,加入ICD-9数据使18K个罕见ICD代码的微F1指标相比仅使用ICD-10训练提高了27%。在8K个频繁ICD-10代码上,多版本训练也显著提升了宏指标,并且模型参数更少。

英文摘要

Clinical coding maps clinical documentation to standardized medical codes, an essential yet time-consuming administrative task that could benefit from automation. Current models on ICD coding are typically optimized for codes from a specific ICD version. However, in reality, ICD systems evolve continuously, and different versions are adopted across time periods and regions. Moreover, ICD coding suffers from the long-tail problem, and rare code performance can be a bottleneck for developing implementable models. We examine whether it is viable to train version-independent models by combining data annotated in different ICD versions, which may help address these challenges. We add ICD-9 data to the training of a modified label-wise attention model for ICD-10 prediction, and find that despite the version mismatch, adding ICD-9 yields a 27% increase in micro F1 for 18K rare ICD codes compared to training on ICD-10 alone. On 8K frequent ICD-10 codes, the multi-version training also substantially improves macro metrics, with far fewer model parameters.

2605.17746 2026-05-19 cs.AI cs.HC 版本更新

Agents for Experiments, Experiments for Agents: A Design Grammar for AI-Enabled Experimental Science

实验中的代理,代理中的实验:一种面向人工智能增强型实验科学的设计语法

Yingjie Zhang, Chun Feng, Weizhang Zhu, Tianshu Sun

发表机构 * Guanghua School of Management, Peking University(北京大学光华管理学院) Xi'an Jiaotong University(西安交通大学) Cheung Kong Graduate School of Business(长江商学院)

AI总结 本文提出SEED框架,用于表示实验条件为类型化的代理-流程图,以支持实验设计的自动化生成和评估,通过在医疗分诊任务中的实验证明其有效性,并讨论了新颖性、可重复性等治理问题。

详情
AI中文摘要

人工智能系统正成为组织和知识工作中的积极参与者。它们越来越多地与人类互动,协调工作流程,并在多代理安排中运作。因此,理解其影响需要的不仅仅是测量输出准确性,还需要关于机制、委托、反馈和控制的证据。实验仍然是这一任务的核心,但它们也面临递归挑战:我们需要为代理设计实验来研究这些安排,我们可能需要为实验设计设计代理以帮助搜索可能设计的扩展空间。然而,人类-人工智能和代理工作流程的实验条件仍然大多以散文形式指定,这使得它们难以比较、重用或审计。我们将其框架为AI增强型知识生产的流程表示、可追溯性和治理问题。我们引入SEED(结构编码用于实验发现),一个将实验条件表示为类型化代理-流程图的框架。SEED支持三种设计功能:将条件描述为交互结构、评估结构新颖性相对于编码的先前设计、以及在可行性和治理约束下生成候选设计。我们报告了一项轻量级的实证可行性测试,比较了图盲和SEED引导生成在医疗分诊设计任务中的表现。在这一诊断对比中,SEED引导的候选设计显示出更清晰的代理-流程变化、假设和治理检查,支持了该语法作为设计辅助工具的可行性。评论最后指出围绕新颖性、可重复性、有效性、探究多样性以及问责制的治理张力。

英文摘要

AI systems are becoming active participants in organizational and knowledge work. They increasingly interact with humans, coordinate workflows, and operate in multi-agent arrangements. Understanding their effects therefore requires more than measuring output accuracy; it requires evidence about mechanisms, delegation, feedback, and control. Experiments remain central to this task, but they also face a recursive challenge: we need experiments for agents to study these arrangements, and we may need agents for experiments to help search the expanding space of possible designs. Yet experimental conditions for human-AI and agentic workflows are still largely specified in prose, making them difficult to compare, reuse, or audit. We frame this as a problem of workflow representation, traceability, and governance in AI-enabled knowledge production. We introduce SEED (Structural Encoding for Experimental Discovery), a framework that represents experimental conditions as typed actor-flow graphs. SEED supports three design functions: describing conditions as interaction structures, evaluating structural novelty relative to encoded prior designs, and generating candidate designs under feasibility and governance constraints. We report a lightweight empirical feasibility test that compares graph-blind and SEEDguided generation in a medical-triage design task. In this diagnostic contrast, SEED-guided candidate designs show clearer actor-flow changes, assumptions, and governance checks, supporting the feasibility of the grammar as a design aid. The commentary closes by identifying governance tensions around novelty, replication, validity, diversity of inquiry, and accountability.

2605.17734 2026-05-19 cs.AI 版本更新

Harnessing LLM Agents with Skill Programs

通过技能程序 harnessing LLM agents

Hongjun Liu, Yifei Ming, Shafiq Joty, Chen Zhao

发表机构 * New York University(纽约大学) Salesforce AI Research(Salesforce人工智能研究)

AI总结 本文提出 HASP 框架,通过将技能转化为可执行程序函数(PFs)来提升 LLM agent 在复杂任务中的表现,其核心方法是通过 PFs 在失败状态时介入并修正行动,主要贡献是通过模块化设计实现推理、训练和自改进的多场景应用。

Comments 40 pages, 7 figures

详情
AI中文摘要

为复杂和长周期任务提供可重用技能已成为一种流行且成功的做法。然而,这些经验通常编码为文本指导,缺乏明确的机制来决定何时以及如何介入 agent 循环。为弥合这一差距,我们引入 HASP(通过技能程序 harnessing LLM agents),一种新的框架,将技能升级为可执行程序函数(PFs)。与被动建议不同,PFs 作为可执行的护栏,在易出错的状态下激活,并修改下一步行动或注入修正上下文。HASP 高度模块化:可以在推理时直接介入 agent 循环,训练后提供结构化监督,或通过进化验证的教师评审 PFs 实现自改进。实证上,HASP 在网页搜索、数学推理和编码任务中相比训练自由和训练方法取得了显著提升。例如,在网页搜索推理中,推理时的 PFs 使平均表现比(多循环)ReAct Agent 提高 25%,而训练后和受控进化则比 Search-R1 提高 30.4%。为了深入理解 HASP,我们的机制分析揭示了 PFs 如何触发和介入,技能如何内化,以及稳定技能库进化的必要性。

英文摘要

Equipping LLM agents with reusable skills derived from past experience has become a popular and successful approach for tackling complex and long-horizon tasks. However, such lessons are often encoded as textual guidance that remains largely advisory, lacking explicit mechanisms for when and how to intervene in the agent loop. To bridge the gap, we introduce HASP(Harnessing LLM Agents with Skill Programs), a new framework that upgrades skills into executable Program Functions (PFs). Rather than offering passive advice, PFs act as executable guardrails that activate on failure-prone states and modify the next action or inject corrective context. HASP is highly modular: it can be applied at inference time for direct agent-loop intervention, during post-training to provide structured supervision, or for self-improvement by evolving validated, teacher-reviewed PFs. Empirically, HASP drives substantial gains compared to both training-free and training-based methods on web-search, math reasoning, and coding tasks. For example, on web-search reasoning, inference-time PFs alone improve the average performance by 25% compared to (multi-loop) ReAct Agent, while post-training and controlled evolution achieve a 30.4% gain over Search-R1. To provide deeper insights into HASP, our mechanism analysis reveals how PFs trigger and intervene, how skills are internalized, and the requirement for stable skill library evolution.

2605.17733 2026-05-19 cs.AI cs.LG 版本更新

Divergence-Suppressing Couplings for Rectified Flow

修正流的发散抑制耦合

Yimeng Min, Carla P. Gomes

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出了一种修正流的发散抑制耦合方法,通过在耦合生成过程中抑制学习到的速度场中的发散成分,从而减少轨迹的扭曲,提升生成效果。

详情
AI中文摘要

修正流的潜力在于生成自我生成的耦合,其轨迹是直的或几乎如此。在实践中,基础流模型生成的轨迹可能会弯曲和交织,导致耦合继承这种扭曲。本文指出,这种轨迹交织通常与学习到的速度场中非零发散区域相关,其中局部扩张或收缩会扭曲轨迹并推动粒子远离理想终点。我们随后提出了一种修正流的发散抑制耦合,这是一种离线修正,可减小耦合生成过程中学习到的速度场的发散成分。该修正仅在每次耦合对生成时支付一次,且在训练过程中被摊销,因此部署运行的时钟时间成本与标准修正流相同。实验证明,这种离线修改在2D合成基准和图像生成任务上都带来了稳定改进。

英文摘要

The promise of Rectified Flow rests on producing self-generated couplings whose trajectories are straight, or nearly so. In practice, trajectories generated by the base flow model can bend and intertwine, and the resulting coupling inherits this distortion. In this paper, we identify that such trajectory entanglement is often associated with regions of nonzero divergence in the learned velocity field, where local expansion or contraction distorts trajectories and steers particles away from their ideal endpoints. We then propose divergence-suppressing couplings for Rectified Flow, an offline correction that attenuate the divergent component of the learned velocity during coupling generation. The correction is paid only once per coupling pair and amortized over training, so deployment runs plain Euler at identical wall-clock cost to standard Rectified Flow. Empirically, this offline modification yields consistent improvements on 2D synthetic benchmarks and on image generation.

2605.17729 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Domain Incremental Learning for Pandemic-Resilient Chest X-Ray Analysis

领域增量学习用于疫情 resilient 胸部X光分析

Danu Kim

发表机构 * Danu Kim(丹努·金)

AI总结 本文提出了一种基于回放的领域增量持续学习方法,用于在跨领域变化中保持肺炎检测的鲁棒性和一致性,通过类感知平衡回放和类感知损失实现平衡的类表示和动态重加权,实验表明该方法在领域偏移的PneumoniaMNIST数据集上达到88.66%的平均准确率,优于经验回放、微调和联合训练基线。

Comments Published in Korea Software Congress (2025)

详情
AI中文摘要

深度学习模型在肺炎检测中实现了高准确性,但其在临床领域中的泛化能力受限于成像设备、获取协议和机构条件的差异。本研究引入了一种基于回放的领域增量持续学习方法,旨在使模型能够持续适应跨领域变化而不发生灾难性遗忘。所提出的方法结合了类感知平衡回放以在受限内存中保持平衡的类表示,以及类感知损失以在训练过程中动态重新加权类不平衡。在包含五个模拟领域的领域偏移PneumoniaMNIST数据集上进行的实验表明,所提出的方法实现了88.66%的平均准确率,优于经验回放、微调和联合训练基线。这些发现突显了所提出方法在跨临床环境变化中实现稳健和一致肺炎检测的有效性。

英文摘要

Deep learning models achieved high accuracy in pneumonia detection from chest X-rays. However, their generalization across clinical domains remains limited due to variations in imaging devices, acquisition protocols, and institutional conditions. This study introduces a replay-based domain-incremental continual learning designed to enable continual adaptation to cross-domain variations without catastrophic forgetting. The proposed method incorporates a class-aware balanced replay to maintain balanced class representation within a constrained memory and a class-aware loss to dynamically reweight class imbalance during training. Experiments conducted on a domain-shifted PneumoniaMNIST dataset consisting of five simulated domains demonstrate that the proposed method achieves an average accuracy of 88.66%, outperforming Experience Replay, Fine-Tuning, and Joint Training baselines. These findings highlight the efficacy of the proposed approach in achieving robust and consistent pneumonia detection across clinical environment variations.

2605.17721 2026-05-19 cs.AI 版本更新

EXG: Self-Evolving Agents with Experience Graphs

EXG: 基于经验图的自演化代理

Yuxin Jin, Siyuan Zhang, Hanchen Wang, Lu Qin, Ying Zhang, Wenjie Zhang

发表机构 * University of Technology Sydney(悉尼科技大学) The University of New South Wales(新南威尔士大学)

AI总结 本文提出EXG,一种基于经验图的自演化代理框架,通过结构化组织积累的成功与失败经验,提升代理在复杂任务中的解决质量和资源效率。

详情
AI中文摘要

基于大型语言模型(LLM)的代理在复杂推理和问题解决中表现出强大的能力,但大多数部署的代理行为静态,执行过程中获得的知识难以随时间系统性改进。为此,越来越多的研究探索如何在部署过程中通过经验使代理改进,但现有方法要么依赖于单一任务的随意反思,要么采用无结构的记忆积累碎片化经验。为了解决这一限制,我们引入EXG,一种经验图框架,用于自演化代理,明确将积累的成功与失败组织成结构化、关系化的表示。EXG是首个为自演化代理设计的经验图,支持在执行过程中实时增长图以实现跨任务经验重用,以及离线重用整合的经验图作为外部记忆模块。这种设计也使EXG能够作为可插拔组件为现有自演化代理服务,将先前经验组织成统一的经验图,并在部署过程中提高解决方案质量和资源效率。在代码生成和推理基准上的广泛实验表明,EXG在在线和离线评估中均优于基于反思和记忆的基线,在性能-效率权衡上表现更优。我们的结果表明,将经验结构化为图提供了一个原理性基础,以实现可扩展且可迁移的自演化代理行为。

英文摘要

Large language model (LLM)-based agents have demonstrated strong capabilities in complex reasoning and problem solving through multi-step interactions, yet most deployed agents remain behaviorally static, with knowledge acquired during execution rarely translating into systematic improvement over time. In response, a growing line of work on self-evolving agents explores how agents can improve through experience during deployment, but most existing approaches either rely on ad hoc reflection limited to single-task correction or adopt unstructured memory that accumulates fragmented experience with delayed usability. To address this limitation, we introduce EXG, an experience graph framework for self-evolving agents that explicitly organizes accumulated successes and failures into a structured, relational representation. EXG is the first experience graph designed for self-evolving agents, supporting both online, real-time graph growth during execution for immediate cross-task experience reuse, and offline reuse of a consolidated experience graph as an external memory module. This design also enables EXG to serve as a plug-and-play component for existing self-evolving agents, organizing prior experience into a unified experience graph and improving both solution quality and resource efficiency as deployment progresses. Extensive experiments across code generation and reasoning benchmarks show that EXG attains more favorable performance-efficiency trade-offs than reflection- and memory-based baselines in both online and offline evaluations. Our results suggest that structuring experience as a graph provides a principled foundation for scalable and transferable self-evolving agent behavior.

2605.17693 2026-05-19 cs.LG cs.AI 版本更新

Fine-tuning Pocket-Aware Diffusion Models via Denoising Policy Optimization

通过去噪策略优化微调意识口袋扩散模型

Yuan Xue, Daniel Kudenko, Megha Khosla

发表机构 * L3S Research Center(L3S研究所以) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出DEPPA方法,基于去噪扩散策略优化,通过强化学习微调预训练的意识口袋扩散模型,以优化结合亲和力、药物性、可合成性和多样性等多属性。

详情
AI中文摘要

基于结构的药物设计已被意识口袋3D生成模型加速,但大多数方法主要拟合训练分布,可能无法满足真实世界治疗药物发现所需的多种属性。最近,越来越多的关注集中在基于结构的分子优化(SBMO)上,其目标是精细控制多个指定的分子属性。在本文中,我们提出DEPPA,一种新的SBMO方法,基于去噪扩散策略优化,通过强化学习微调预训练的意识口袋扩散模型。DEPPA能够优化多个属性,包括结合亲和力、药物性、可合成性和多样性。我们将预训练的意识口袋扩散模型的反向去噪过程建模为多步马尔可夫决策过程,其中期望的属性作为奖励信号在最终生成的配体分子上进行评估。DEPPA在RL微调期间结合粗略的去噪调度器,以实现高效的分子优化。在CrossDocked2020基准上的实验结果表明,DEPPA在结合亲和力(Vina Score -8.5 kcal/mol)、药物性和多样性方面优于基线,在可合成性方面表现出竞争性性能。源代码可在https://github.com/xy9485/DePPA上获得。

英文摘要

Structure-based drug design has been accelerated by pocket-aware 3D generative models, yet most methods primarily fit the training distribution and may fall short of satisfying multiple properties required in real-world therapeutic drug discovery. Recently, increasing attention has focused on structure-based molecule optimization (SBMO), which targets fine-grained control over multiple specified molecular properties. In this paper, we present DEPPA, a novel SBMO approach building upon Denoising Diffusion Policy Optimization for fine-tuning a pre-trained pocket-aware diffusion model via reinforcement learning. DEPPA enables optimization over multiple properties, including binding affinity, drug-likeness, synthesizability and diversity. We formulate the reverse denoising process of the pretrained pocket-aware diffusion model as a multi-step Markov Decision Process, where the desired properties that serve as reward signals are evaluated on the final generated ligand molecules. DEPPA incorporates a coarse denoising scheduler during the RL fine-tuning to achieve efficient and effective molecule optimization. Experimental results on the CrossDocked2020 benchmark demonstrate that DEPPA outperforms baselines in binding affinity (Vina Score -8.5 kcal/mol), drug-likeness and diversity while exhibiting competitive performance in synthesizability. The source code is available at https://github.com/xy9485/DePPA .

2605.17691 2026-05-19 cs.CL cs.AI 版本更新

Validate Your Authority: Benchmarking LLMs on Multi-Label Precedent Treatment Classification

验证你的权威:在多标签先例处理分类上对LLM进行基准测试

M. Mikail Demir, M. Abdullah Canbaz

发表机构 * Department of Information Science and Technology(信息科学与技术系) College of Emergency Preparedness, Homeland Security, and Cybersecurity(应急准备、国土安全与网络安全学院) University at Albany, SUNY(萨利纳大学)

AI总结 本文提出了一种新的评估框架,通过专家标注的数据集对现代大语言模型进行基准测试,引入了平均严重性误差指标,以更准确地衡量分类错误的实践影响。

Comments Accepted for publication at the Natural Legal Language Processing Workshop (NLLP) 2025, co-located with EMNLP

详情
AI中文摘要

自动化法律先例中负面处理的分类是一个关键但复杂的自然语言处理任务,误分类可能带来重大风险。为了解决标准准确率的不足,本文介绍了一种更稳健的评估框架。我们对239个真实世界法律引用的新专家标注数据集上的现代大语言模型进行了基准测试,并提出了一种新的平均严重性误差度量标准,以更好地衡量分类错误的实践影响。我们的实验揭示了性能的分裂。Google的Gemini 2.5 Flash在高层次分类任务上达到了最高准确率(79.1%),而OpenAI的GPT-5-mini则在更复杂的细粒度模式上表现最佳(67.7%)。本工作建立了关键基准,提供了一个新的上下文丰富的数据集,并引入了一个针对这一复杂法律推理任务的评估度量标准。

英文摘要

Automating the classification of negative treatment in legal precedent is a critical yet nuanced NLP task where misclassification carries significant risk. To address the shortcomings of standard accuracy, this paper introduces a more robust evaluation framework. We benchmark modern Large Language Models on a new, expert-annotated dataset of 239 real-world legal citations and propose a novel Average Severity Error metric to better measure the practical impact of classification errors. Our experiments reveal a performance split. Google's Gemini 2.5 Flash achieved the highest accuracy on a high-level classification task (79.1%), while OpenAI's GPT-5-mini was the top performer on the more complex fine-grained schema (67.7%). This work establishes a crucial baseline, provides a new context-rich dataset, and introduces an evaluation metric tailored to the demands of this complex legal reasoning task.

2605.17685 2026-05-19 cs.CV cs.AI cs.CR cs.SY eess.SP eess.SY 版本更新

Attention-Guided Fusion of 1D and 2D CNNs for Robust ECG-Based Biometric Recognition

基于注意力引导的1D和2D CNN融合用于鲁棒的基于ECG的生物识别

Arioua, Islameddine, Benzaoui, Amir, Zeroual, Abdelhafid, Houam, Lotfi

发表机构 * PIMIS Laboratory, Electronics and Telecommunications Department(PIMIS实验室,电子与电信系) Université du 8 Mai 1945(8月1945大学) Electrical Engineering Department, University of 20 August 1955(电子工程系,20 August 1955大学) Department of Electrical Engineering, Faculty of Science and Applied Sciences(电子工程系,科学与应用科学学院) Larbi Ben M'hidi University(拉比·本·迈迪大学) Department of Electronics and Communications, University of Larbi Tebessi(电子与通信系,拉比·塔贝西大学)

AI总结 本文提出了一种结合1D和2D CNN的混合框架,通过注意力引导融合机制提升ECG生物识别的鲁棒性和性能,实验表明该方法在多个数据集上均取得了较高的识别准确率。

Journal ref Digital Signal Processing 2026

详情
AI中文摘要

基于心电图(ECG)的生物识别已作为一种安全的身份验证和活体检测的有希望的解决方案。然而,大多数现有方法依赖于单模深度学习架构,单独处理一维(1D)时间信号或二维(2D)时频表示,限制了鲁棒性和泛化能力。为了解决这个问题,本文提出了一种将1D和2D卷积神经网络(CNNs)整合到统一端到端架构中的混合框架。1D分支从原始ECG信号中提取时序和形态学特征,而2D分支从时频表示中捕获判别性的频谱信息。注意力引导的融合机制根据输入特性动态加权两种模态,克服了传统静态融合策略的局限性。该框架在三个基准数据集(ECG-ID、MIT-BIH和PTB)上进行了评估,包括健康受试者和患有心脏病理学的患者,分别实现了99.56%、100.00%和99.89%的识别准确率。为了评估长期生物稳定性,还进行了多会话Heartprint数据集的实验,该数据集跨越十年。所提出的方法在相同会话中实现了98.54%(S1)、99.09%(S2)、94.93%(S3R)和96.08%(S3L)的准确率,跨会话评估达到了56.33%(S1-S2)和53.27%(S2-S3R),证明了其在时间上的稳定生物特征捕获能力。最优配置结合了InceptionTime用于1D处理,ResNet-34用于2D分析,以及基于注意力的融合。消融研究证实,所提出的注意力机制在传统融合方法中始终表现更优。总体而言,所提出的框架为ECG生物识别提供了一种稳健、可扩展且高性能的解决方案。

英文摘要

Electrocardiogram (ECG)-based biometric recognition has emerged as a promising solution for secure authentication and liveness detection. However, most existing methods rely on unimodal deep learning architectures that independently process either one-dimensional (1D) temporal signals or two-dimensional (2D) time-frequency representations, limiting robustness and generalization. To address this issue, this paper proposes a hybrid framework integrating 1D and 2D convolutional neural networks (CNNs) within a unified end-to-end architecture. The 1D branch extracts temporal and morphological features from raw ECG signals, while the 2D branch captures discriminative spectral information from time-frequency representations. An attention-guided fusion mechanism dynamically weights both modalities according to input characteristics, overcoming the limitations of conventional static fusion strategies. The framework was evaluated on three benchmark datasets (ECG-ID, MIT-BIH, and PTB), including healthy subjects and patients with cardiac pathologies, achieving identification accuracies of 99.56%, 100.00%, and 99.89%, respectively. To assess long-term biometric permanence, experiments were also conducted on the multi-session Heartprint dataset spanning ten years. The proposed approach achieved same-session accuracies of 98.54% (S1), 99.09% (S2), 94.93% (S3R), and 96.08% (S3L), while cross-session evaluations reached 56.33% (S1-S2) and 53.27% (S2-S3R), demonstrating the ability to capture stable biometric signatures over time. The optimal configuration combines InceptionTime for 1D processing, ResNet-34 for 2D analysis, and attention-based fusion. Ablation studies confirm that the proposed attention mechanism consistently outperforms conventional fusion approaches. Overall, the proposed framework provides a robust, scalable, and high-performance solution for ECG biometric recognition.

2605.17684 2026-05-19 cs.AI cs.SE 版本更新

EGI: A Multimodal Emotional AI Framework for Enhancing Scrum Master Real-time Self-Awareness

EGI:一种多模态情感AI框架,用于增强Scrum Master的实时自我意识

Jingni Huang, Peter Bloodsworth

发表机构 * Department of Computer Science(计算机科学系) University of Oxford(牛津大学)

AI总结 本文提出一种多模态情感AI框架EGI,通过整合四个精选的AI模型,实时监测Scrum Master和会议组织者无意识表达的情绪,提升团队动态中的情绪感知能力。

详情
AI中文摘要

尽管越来越多的研究关注敏捷团队成员的情绪福祉,但在Scrum Master和会议组织者的情绪监测研究中仍存在显著差距,这些角色对团队动态的影响至关重要。本文提出了一种新的应用,整合四个精心选择和推荐的AI模型,通过实时语音转文本模型进行实时转录;通过阈值分析检测语气中的情绪线索;通过基于情绪的词汇匹配识别语音内容中的情感;并通过开源的多模块AI API提供上下文感知的建议,包含情绪关键词。系统在模拟会议环境中实现了10%的ASR词错误率。我们的评估表明,实时反馈显著提高了模拟敏捷会议中的情绪感知能力,为Scrum Master和会议组织者提供实时和实用的建议,帮助他们快速识别并减少负面情绪的表达,促进更积极有效的团队互动。

英文摘要

While increasing research focuses on the emotional well-being of agile team members, a significant gap remains in emotion monitoring studies for Scrum Masters and meeting organizers, whose impact on team dynamics is crucial. This paper proposes a novel application integrating four carefully selected and recommended AI models to monitor the unconsciously expressed emotions of these key roles. This is achieved through: real- time transcription using a speech-to-text model; thresholding for intonation analysis to detect emotional cues in prosody; applying emotion-based vocabulary matching to identify sentiment in spoken content; and providing context-aware suggestions containing emotion keywords using an open-source, multi-module AI API. The system achieved an ASR word error rate WER of 10% in simulated meeting environments. Our evaluation shows that real- time feedback significantly improves emotion awareness during simulated agile meetings, providing Scrum Masters and meeting organizers with real-time and practical suggestions to help them quickly identify and minimize the expression of negative emotions, fostering more positive and effective team interactions.

2605.17679 2026-05-19 cs.HC cs.AI 版本更新

PULSE: Agentic Investigation with Passive Sensing for Proactive Intervention in Cancer Survivorship

PULSE:基于被动感知的代理探究用于癌症幸存者的主动干预

Zhiyuan Wang, Ariful Islam, Indrajeet Ghosh, Xinyu Chen, Katharine E. Daniel, Subigya Nepal, Philip Chow, Laura E. Barnes

发表机构 * Department of Systems and Information Engineering, University of Virginia(系统与信息工程系,弗吉尼亚大学) Center for Behavioral Health and Technology, University of Virginia(行为健康与技术中心,弗吉尼亚大学) Department of Computer Science, University of Virginia(计算机科学系,弗吉尼亚大学)

AI总结 本文提出PULSE系统,通过代理感知探究方法,利用智能手机被动感知数据和日记数据,提升对癌症幸存者情绪调节需求的预测准确率,验证了代理推理在主动干预中的有效性。

详情
AI中文摘要

癌症幸存者面临更高的抑郁、焦虑和一般情绪困扰风险,但需要支持的精确时刻往往自我报告数据稀疏,我们称之为日记悖论。被动智能手机感知提供了一种持续且无干扰的替代方案,但以往基于感知的愉悦预测受限于准确性上限,表明不仅数据可用性,而且行为信号的解释也存在瓶颈。我们提出了PULSE系统,该系统从固定特征管道转向代理感知探究:配备八个专用工具的LLM代理自主查询智能手机感知数据,将当前行为与个性化基线进行比较,并通过检索增强的群体层面比较进行校准。与接收预格式化特征摘要不同,代理决定检查哪些模态、回溯多远以及深入探究多少,模仿假设驱动的临床推理。我们通过2*2因子设计交叉推理架构(结构化 vs. 代理)与数据模态(仅感知 vs. 有日记)在50名癌症幸存者的纵向研究中评估PULSE。代理推理是性能的主要驱动因素:代理多模态代理在日记和感知数据下实现情绪调节需求的平衡准确率为0.743,而代理在仅被动感知数据下预测干预可用性的准确率为0.713。这些结果表明,代理探究可能成为解锁被动感知临床价值的关键,推动主动即时心理健康支持的可行性。

英文摘要

Cancer survivors face elevated rates of depression, anxiety, and general emotional distress, yet the precise moments they most need support are often the moments when self-report is sparse, a phenomenon we term the diary paradox. Passive smartphone sensing offers a continuous, unobtrusive alternative, but prior sensing-based affect prediction has been limited by an accuracy ceiling, suggesting a bottleneck not only in available data, but in how behavioral signals are interpreted. We present PULSE, a system that shifts from fixed feature pipelines to agentic sensing investigation: LLM agents equipped with eight purpose-built tools autonomously query smartphone sensing data, compare current behavior against personalized baselines, and calibrate inferences through retrieval-augmented population-level comparisons. Rather than receiving pre-formatted feature summaries, agents decide which modalities to inspect, how far back to look, and how deeply to investigate, mirroring hypothesis-driven clinical reasoning. We evaluate PULSE through a 2*2 factorial design crossing reasoning architecture (structured vs. agentic) with data modality (sensing-only vs. with diary) on 50 cancer survivors from a longitudinal study of cancer survivors. Agentic reasoning is the primary driver of performance: agentic multimodal agent achieves balanced accuracy of 0.743 for emotion regulation desire with diary and sensing data, while agentic agents predict intervention availability at 0.713 with passive sensing data only. These results suggest that agentic investigation may be a cornerstone for unlocking the clinical value of passive sensing, advancing the feasibility of proactive just-in-time mental health support.

2605.17671 2026-05-19 cs.LG cs.AI 版本更新

PEIRA: Learning Predictive Encoders through Inter-View Regressor Alignment

PEIRA: 通过视图回归对齐学习预测编码器

Michael Arbel, Basile Terver, Jean Ponce

发表机构 * Univ. Grenoble Alpes, Inria CNRS, Grenoble INP, LJK(格勒诺布尔大学、法国国家信息与自动化研究所、格勒诺布尔INP、LJK实验室) Ecole Normale Supérieure / PSL Inria Paris(巴黎高等师范学院/PSL 国家科学研究中心、法国国家信息与自动化研究所巴黎分部) New York University(纽约大学)

AI总结 本文提出PEIRA方法,通过显式目标函数和线性回归器对齐来实现非对比自监督学习,通过理论分析和实验验证其在ImageNet-1K和CIFAR-10上的有效性。

详情
AI中文摘要

非对比自监督学习(SSL)是预测表示学习的有效框架,但像SimSiam、BYOL、I-JEPA或DINO等流行方法依赖于自蒸馏来训练教师-学生网络,但通常不最小化明确的目标函数。我们分析了联合嵌入预测架构(JEPA)的一个变种,使用正则化的线性回归器来预测数据两个视图之间的学习表示,并完全表征其稳定性:非坍塌的稳定平衡点对齐于主导的非线性典型相关子空间,而坍塌的平衡点也可能是稳定的吸引子。受此结果启发,我们引入PEIRA,一种非对比SSL方法,其目标函数通过最优线性回归器的迹定义。我们证明其唯一稳定的平衡点是非平凡的全局最小值,并恢复相同的典型相关子空间,正则化选择有效维度。在ImageNet-1K和CIFAR-10上的实验表明,PEIRA与VICReg和LeJEPA基线具有竞争力,定性实验结果支持理论。

英文摘要

Non-contrastive self-supervised learning (SSL) is an effective framework for predictive representation learning, but popular (and in practice effective) methods such as SimSiam, BYOL, I-JEPA or DINO, which rely on a form of self-distillation to train a teacher-student network, remain poorly understood as they typically do not minimize a well-defined objective. We analyze the dynamics of a variant of the Joint Embedding Predictive Architecture (JEPA) using a regularized linear regressor to predict the learned representations of two views of the data from one another, and fully characterize its stability: non-collapsed stable equilibria align with leading nonlinear canonical correlation subspaces, while collapsed equilibria may also be stable attractors. Motivated by this result, we introduce PEIRA, a non-contrastive SSL method with an explicit objective defined through the trace of the optimal linear regressor. We show that its only stable equilibria are nontrivial global minimizers and recover the same canonical correlation subspaces, with regularization selecting the effective dimension. Experiments on ImageNet-1K and CIFAR-10 show PEIRA is competitive with VICReg and LeJEPA baselines, and qualitative empirical results support the theory.

2605.17669 2026-05-19 cs.AI 版本更新

Multimodal Cultural Heritage Knowledge Graph Extension with Language and Vision Models

多模态文化遗产品知图扩展与语言和视觉模型

Yang Zhang, Nada Mimouni, Jean-Claude Moissinac, Fayçal Hamdi

发表机构 * Center for Studies and Research in Computer Science and Communication, CNAM(计算机科学与通信研究所以及CNAME)

AI总结 本文提出了一种多模态方法,利用语言和视觉模型扩展文化遗产品知图,通过构建多模态知识图谱WJoconde并建立评估基准,提高知识图谱的扩展效率和可靠性。

详情
AI中文摘要

文化遗产品保育和解读日益依赖数字技术,其中知识图谱(KGs)因其能够结构化大量数据而脱颖而出。然而,这些KGs的构建和扩展往往面临挑战,因为文化遗产信息具有多样性和复杂性。本文提出了一种新的方法,用于扩展文化遗产领域的KG资源,应用于法语数据。首先,我们引入了一个新的知识图谱WJoconde,其特点是多模态,整合了实体的文本和图像信息。我们进一步引入了三个WJoconde的变体,以促进下游研究,如知识图谱补全(KGC)。我们还建立了一个全面的KGC方法基准,用于我们的数据集。其次,我们提出了一种新的框架,利用多模态方法扩展文化遗产KGs,结合大型语言模型(LLMs)和视觉-语言模型(VLMs),包括从非结构化资源中自动提取数据,并结合特殊的验证流程来确保两种模型输出的可靠性,以进一步扩展WJoconde。我们的结果表明,通过整合文化遗产数据中的丰富文本和图像信息,可以高效地增强具有高可靠性的KGs。我们开源了所有代码和基准数据集,包括文本和图像,以及原始数据的交互访问点。

英文摘要

The preservation and interpretation of cultural heritage increasingly rely on digital technologies, among which Knowledge Graphs (KGs) stand out for their ability to structure vast amounts of data. However, the construction and expansion of these KGs often face challenges due to the diverse and complex nature of cultural heritage information. In this paper, we propose a novel approach for extending KG resources in the domain of cultural heritage, which we applied to French data. First, we introduce a new knowledge graph in the domain of French cultural heritage, WJoconde, which is distinguished by its multimodality as it integrates both textual and image information of the entities. We further introduce three variants of WJoconde to facilitate downstream research, such as Knowledge Graph Completion (KGC). We also built a comprehensive benchmark for KGC methods on our dataset. Second, we propose a new framework for extending cultural heritage KGs using multi-modal approaches leveraging Large Language Models (LLMs) and Vision-Language Models (VLMs), which includes automated data extraction from unstructured resources combined with a special validation pipeline for grounding the output of both models, to further extend WJoconde. Our results show that by integrating the rich text and image information in cultural heritage data, we can efficiently enhance KGs with high reliability. We open-source all code and benchmark datasets with text and images, as well as the original data with an interactive access point

2605.17660 2026-05-19 math.OC cs.AI cs.LG stat.ML 版本更新

Training Infinitely Deep and Wide Transformers

训练无限深且宽的Transformer

Raphaël Barboni, Maarten V. de Hoop, Takashi Furuya, Gabriel Peyré

发表机构 * Bocconi University(博科尼大学) Doshisha University, RIKEN AIP(滋贺大学、RIKEN AIP) Rice University(里士满大学) CNRS, ENS, PSL Université(国家科学研究中心、巴黎综合理工学院、巴黎萨克勒大学)

AI总结 本文提出了一种严格的数学框架,用于分析Transformer在均场 regime 中的梯度基于训练动态,通过研究无限深和宽的Transformer的均场模型,建立了训练风险的条件Wasserstein梯度的显式公式,并证明了在NTK注入性假设下梯度流收敛到全局极小值。

详情
AI中文摘要

Transformers已成为现代机器学习中占主导地位的架构,但其训练动态的理论理解仍然有限。本文开发了一个严格的数学框架,用于分析在均场 regime 中Transformer的梯度基于训练动态,其中深度(层数)和宽度(注意头数)趋于无穷大。虽然ResNet训练可以理解为控制神经ODE,但Transformer训练对应于控制神经PDE,因为通过注意力机制耦合了多个token分布。我们的均场模型特征两种类型的测度表示:通过层演变的token分布和每层的注意力参数。我们建立了无限深Transformer前向传递的well-posedness,通过流映射来表征token演变,这些流映射满足函数空间中的ODE。利用伴随敏感度分析,我们推导出训练风险的条件Wasserstein梯度的显式公式,该公式涉及由反向ODE控制的伴随变量。我们证明了在条件Wasserstein度量空间中梯度流曲线的存在性和唯一性,建立了梯度基于Transformer训练的严格基础。一个关键技术贡献是提供了注意力机制的神经切线核(NTK)注入性的必要且充分条件:我们证明NTK注入性等同于log-sum-exp函数的线性独立性模仿射函数,这一条件由多种token分布满足,包括离散分布、均匀分布和高斯混合分布。在NTK注入性假设下,我们证明当初始损失足够小时,梯度流收敛到全局极小值,消除了优化景观中的虚假局部极小值。

英文摘要

Transformers have become the dominant architecture in modern machine learning, yet the theoretical understanding of their training dynamics remains limited. This paper develops a rigorous mathematical framework for analyzing gradient-based training of transformers in the mean-field regime, where both the depth (number of layers) and width (number of attention heads) tend to infinity. While ResNet training can be understood as controlling a neural ODE, transformer training corresponds to controlling a neural PDE, due to the coupling of multiple token distributions through the attention mechanism. Our mean-field model features two types of measure representations: token distributions evolving through layers and attention parameters at each layer. We establish well-posedness of the forward pass through infinitely deep transformers, characterizing token evolution via flow maps that satisfy ODEs in function spaces. Using adjoint sensitivity analysis, we derive an explicit formula for the conditional Wasserstein gradient of the training risk, involving adjoint variables governed by backward ODEs. We prove the existence and uniqueness of gradient flow curves in the conditional Wasserstein metric space, establishing a rigorous foundation for gradient-based transformer training. A key technical contribution is providing necessary and sufficient conditions for injectivity of the Neural Tangent Kernel (NTK) for attention mechanisms: we show that NTK injectivity is equivalent to linear independence of log-sum-exp functions modulo affine functions, a condition satisfied by diverse token distributions, including discrete distributions, uniform distributions, and Gaussian mixtures. Under this NTK injectivity assumption, we prove that gradient flow converges to global minima when the initial loss is sufficiently small, eliminating spurious local minima from the optimization landscape.

2605.17653 2026-05-19 cs.LG cs.AI 版本更新

LLMForge: Multi-Backend Hardware-Aware Neural Architecture Search with Infinite-Head Attention for Edge Language Models

LLMForge: 多后端硬件感知的神经架构搜索与无限头注意力用于边缘语言模型

Xinting Jiang, Junyi Luo, Ruichen Qi, Kauna Lei, Ben Laurie, Gregory Kielian, Mehdi Saligane

发表机构 * Brown University(布朗大学) University of Michigan(密歇根大学) Google Research(谷歌研究)

AI总结 本文提出LLMForge,一种多后端硬件感知的神经架构搜索框架,通过无限头注意力扩展了每层注意力配置空间,并结合Forge-Former和Forge-DSE实现了高效的边缘语言模型架构搜索,最终在不同硬件子系统上获得了不同形状的架构,展示了在不同性能指标上的优化效果。

详情
AI中文摘要

子百亿参数的Transformer语言模型正越来越多地部署在边缘设备上,其中设备端推理的隐私、延迟和运行成本优势受到紧密的内存带宽、能量和热预算的限制,使得架构选择和加速器特定的成本成为高效推理的关键。我们提出了LLMForge,一种硬件感知的神经架构搜索(NAS)框架,其三个可组合的贡献共同使边缘LM架构搜索变得硬件条件化,因为不同的基材施加了不同的硬件成本瓶颈。无限头注意力(IHA)解耦了查询头数、KV组数和每个头的查询/键/值维度,扩展了在我们的搜索空间范围内每层注意力配置空间,大约扩大了400倍。Forge-Former是一种基于编码器的替代方案,用于对架构候选者进行排名,优于MLP和随机森林基线。Forge-DSE是一种基于NSGA-II的设计空间探索引擎,与Forge-Former配对,结合了覆盖GPU、张量核心加速器和环数据流边缘加速器的多后端硬件成本模型。在四种不同的硬件基材上,搜索收敛到明显不同的架构,其形状跟踪每个基材的成本瓶颈。在多芯片环基材上,我们的联合搜索返回了三个3亿参数规模的部署感知变体,这些变体位于帕累托前沿上。每个变体都在FineWeb-Edu-10BT上重新训练,以匹配SmolLM2-360M和Qwen-0.5B架构基线。准确的变体具有最低的验证损失2.798,并在参数较少的情况下具有竞争性的基准性能,能量优化的变体降低了每token的能量消耗40%,延迟优化的变体降低了TTFT和TPOT 43%。

英文摘要

Sub-billion-parameter Transformer language models are increasingly deployed on edge devices, where the privacy, latency, and operating-cost advantages of on-device inference are constrained by tight memory-bandwidth, energy, and thermal budgets that make architectural choice and accelerator-specific cost central to efficient inference. We present LLMForge, a hardware-aware neural architecture search (NAS) framework whose three composable contributions together make edge-LM architecture search hardware-conditioned, since different substrates impose different hardware cost bottlenecks. Infinite-Head Attention (IHA) decouples the number of query heads, KV groups, and per-head query/key and value dimensions, expanding the feasible per-layer attention configuration space by approximately 400x over grouped-query attention within our search-space ranges. Forge-Former, an encoder-based surrogate for ranking architectural candidates, outperforms MLP and random-forest baselines. Forge-DSE, an NSGA-II-based design-space-exploration engine, pairs Forge-Former with a multi-backend hardware cost model spanning GPUs, systolic accelerators, and ring-dataflow edge accelerators. Across four different hardware substrates, the searches converge to visibly different architectures whose shapes track each substrate's cost bottleneck. On the multi-chip ring substrate, our co-search returns three 300M-scale deployment-aware variants on the Pareto front. Each is re-trained on FineWeb-Edu-10BT under matched recipe against SmolLM2-360M and Qwen-0.5B architecture baselines. The accurate variant has the lowest validation loss 2.798 and competitive benchmark performance with fewer parameters, the energy-optimized variant lowers energy per token by 40%, and the latency-optimized variant lowers TTFT and TPOT by 43%.

2605.17648 2026-05-19 cs.AI 版本更新

SAPO: Step-Aligned Policy Optimization for Reasoning-Based Generative Recommendation

SAPO:基于推理的生成推荐的步骤对齐策略优化

Zaiyi Zheng, Guanghui Min, Yaochen Zhu, Liang Wu, Liangjie Hong, Chen Chen, Jundong Li

发表机构 * University of Virginia(弗吉尼亚大学) Nokia(诺基亚)

AI总结 本文提出SAPO方法,通过步骤对齐策略优化解决生成推荐中因精确匹配反馈不足导致的训练不稳定问题,改进了基于推理的生成推荐系统的训练效果。

详情
AI中文摘要

生成推荐将下一项预测视为自回归的物品标识符生成。具体而言,物品被编码为语义标识符(SIDs),这些是短的由粗到细的令牌序列,早期令牌捕捉广泛语义,后期令牌细化它们。近期工作在该范式中加入了推理轨迹并通过强化学习进行优化,通常使用具有生成SID的精确匹配反馈的成果奖励算法。然而,在大型目录推荐中,对生成SID的精确匹配反馈只能报告最终物品是否正确;当生成SID不匹配时,成果奖励无法识别导致不匹配的SID-令牌预测,并可能对匹配的SID-令牌位置和不匹配的位置一起进行惩罚。我们发现在此设置中的自然信用分配单位是一个单独的推理步骤(一个思考块配对一个SID令牌)。我们实例化这一想法在SAPO(步骤对齐策略优化)中:而不是将一个优势广播到整个响应,SAPO为每个推理步骤计算一个单独的组内优势,并仅应用于相应的思考块和SID令牌。在三个真实世界推荐数据集中,SAPO稳定了强化学习训练并持续改进现有生成推荐基线,最大收益出现在稀疏精确匹配反馈使推理步骤信用分配重要的地方。我们的结果表明,结构生成的强化学习目标应反映解码器自身的输出分解。

英文摘要

Generative recommendation treats next-item prediction as autoregressive item-identifier generation. Specifically, items are encoded as semantic identifiers (SIDs), which are short coarse-to-fine token sequences whose early tokens capture broad semantics and later tokens refine them. Recent work augments this paradigm with reasoning traces and optimizes them via reinforcement learning with verifiable rewards, typically outcome-reward algorithm with exact-match feedback on the generated SID. However, in large-catalog recommendation, exact-match feedback on the generated SID only reports whether the final item is correct; when a generated SID mismatches, outcome-reward cannot identify which SID-token prediction caused the mismatch and may penalize matched SID-token positions together with the mismatched position. We identify that the natural unit of credit assignment in this setting is a single reasoning step (one thinking block paired with one SID token). We instantiate this idea in SAPO (Step-Aligned Policy Optimization): rather than broadcasting one advantage to the whole response, SAPO computes a separate group-relative advantage for each reasoning step and applies it only to the corresponding thinking block and SID token. Across three real-world recommendation datasets, SAPO stabilizes reinforcement-learning training and consistently improves over existing generative recommendation baselines, with the largest gains where sparse exact-match feedback makes reasoning-step credit assignment important. Our results suggest that reinforcement-learning objectives for structured generation should mirror the decoder's own decomposition of the output.

2605.17641 2026-05-19 cs.AI cs.CL 版本更新

Causal Intervention-Based Memory Selection for Long-Horizon LLM Agents

基于因果干预的记忆选择用于长时域大语言模型智能体

Saksham Sahai Srivastava

发表机构 * School of Computing, University of Georgia, Athens, Georgia, USA(佐治亚大学计算机学院)

AI总结 本文提出Causal Memory Intervention(CMI)方法,通过因果推理选择大语言模型的长期记忆,以提高回答质量和鲁棒性,同时引入Causal-LoCoMo基准数据集进行评估。

Comments 12 pages, 3 figures, 3 tables

详情
AI中文摘要

长时域大语言模型智能体依赖持久记忆来支持跨会话的交互,但现有记忆系统通常使用语义相似性或广泛历史包含来检索上下文,将检索到的记忆视为统一有用。这一假设是脆弱的,因为记忆可能在主题上相关,但仍然无关、过时或误导性。我们提出了Causal Memory Intervention(CMI),一种因果记忆选择技术,通过在受控干预下估计候选记忆如何影响模型的答案,选择提高任务性能的同时抑制不稳定、无关或有害的记忆。为了评估这一设置,我们引入了Causal-LoCoMo,一个从长对话数据中衍生出的因果标注基准,其中每个示例包含用户请求、结构化记忆库、有用的记忆、无关干扰项以及合成有害记忆。我们比较了CMI与向量、图、反思、摘要、完整历史和无记忆基线。结果表明,CMI在回答质量和对误导性记忆的鲁棒性之间实现了更强的平衡,表明可靠的长期记忆需要基于因果有用性而非相关性本身来选择上下文。完整的框架、基准构建代码和实验流程可在https://github.com/Saksham4796/causal-memory-intervention获取。

英文摘要

Long-horizon LLM agents rely on persistent memory to support interactions across sessions, yet existing memory systems often retrieve context using semantic similarity or broad history inclusion, treating retrieved memories as uniformly useful. This assumption is fragile because memories may be topically related while remaining irrelevant, stale, or misleading. We propose Causal Memory Intervention (CMI), a causal memory-selection technique that estimates how candidate memories affect the model's answer under controlled interventions, selecting memories that improve task performance while suppressing unstable, irrelevant, or harmful ones. To evaluate this setting, we introduce Causal-LoCoMo, a causally annotated benchmark derived from long conversational data, where each example contains a user request, a structured memory bank, useful memories, irrelevant distractors, and synthetic harmful memories. We compare CMI against vector, graph, reflection, summary, full-history, and no-memory baselines. Results show that CMI achieves a stronger balance between answer quality and robustness to misleading memory, suggesting that reliable long-term memory requires selecting context based on causal usefulness rather than relevance alone. The full framework, benchmark construction code, and experimental pipeline are available at https://github.com/Saksham4796/causal-memory-intervention.

2605.17633 2026-05-19 cs.CV cs.AI 版本更新

SparseSAM: Structured Sparsification of Activations in Segment Anything Models

SparseSAM: Segment Anything模型中激活的结构稀疏化

Hoai-Chau Tran, Chi H. Nguyen, Duy M. H. Nguyen, Mathias Niepert, Fan Lai, Khoa D. Doan

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) College of Engineering & Computer Science, VinUniversity(Vin大学工程与计算机科学学院) VinUni-Illinois Smart Health Center, VinUniversity(Vin大学-伊利诺伊智能健康中心) DFKI Max Planck Research School for Intelligent Systems (IMPRS-IS)(马克斯·普朗克智能系统研究学校) University of Stuttgart(斯图加特大学)

AI总结 本文提出SparseSAM,一种无需训练的结构稀疏化框架,通过联合加速注意力和MLP层并保持token身份,从而在保持高质量的同时提高推理速度和减少内存使用。

详情
AI中文摘要

Segment Anything Model (SAM) 实现了强大的开放词汇分割,但其基于ViT的图像编码器在推理延迟和内存方面占主导地位。现有的激活压缩方法,如标记合并,通过减少标记长度来处理,但引入了非平凡的运行时开销,并在高压缩下导致灾难性质量下降。其他应用稀疏注意力的方法仅关注注意力本身,使MLP完全密集,并限制了可达到的速度提升。我们提出了SparseSAM,一种(i)无需训练的结构稀疏化框架,该框架在加速注意力和MLP层的同时保持token身份。SparseSAM引入了(ii)Stripe-Sort Attention,它使用确定性的Z序排列将密集注意力转换为静态的硬件友好的稀疏模式,消除了动态掩码的开销。SparseSAM进一步引入了(iii)残差一致性MLP,只将信息性token路由通过MLP,同时通过残差路径传播剩余token。在四个分割基准测试中,SparseSAM在0.4密度下仅损失0.004 mIoU,在0.3密度下损失0.021 mIoU,相较于标记合并方法的改进,准确率损失减少了2.10倍,同时实现了2倍更快的推理速度和2.8倍的内存减少。

英文摘要

The Segment Anything Model (SAM) achieves strong open-vocabulary segmentation, but its ViT-based image encoders dominate inference latency and memory. Existing activation compression methods, such as token merging, reduce the token length to process, yet introduce non-trivial runtime overhead and encounter catastrophic quality drop under high compression. Other methods applying Sparse Attention focus on attention alone, leaving the MLP fully dense and capping achievable speedup. We propose SparseSAM, a (i) training-free structured sparsification framework that jointly accelerates attention and MLP layers while preserving token identity. SparseSAM introduces (ii) Stripe-Sort Attention, which uses a deterministic Z-order permutation to transform dense attention into static hardware-friendly sparse patterns, eliminating dynamic masking overhead. SparseSAM further introduces a (iii) Residual-Consistency MLP that routes only informative tokens through the MLP while propagating remaining tokens through the residual pathway. Across four segmentation benchmarks, SparseSAM loses only 0.004 mIoU at a 0.4 density and 0.021 mIoU at 0.3, a 2.10x reduction in accuracy loss versus token merging advances, while achieving 2x faster inference and 2.8x memory reduction.

2605.17625 2026-05-19 cs.AI 版本更新

Episodic-Semantic Memory Architecture for Long-Horizon Scientific Agents

用于长周期科学代理的事件-语义记忆架构

Nikola Milosevic

发表机构 * Serbian Institute for Artificial Intelligence Research and Development(塞尔维亚人工智能研究与发展研究所) Bayer A.G.(勃林格殷曼有限公司)

AI总结 本文提出了一种双过程记忆架构,用于解决科学代理在长周期任务中面临的情境窗口饱和问题,通过分离即时事件需求和长期知识整合,提升了在大规模科学工作流中的表现和可扩展性。

详情
AI中文摘要

随着大型语言模型(LLMs)发展为持久的科学合作者,情境窗口饱和已成为关键瓶颈。涉及迭代数据分析和假设修正的科学工作流迅速耗尽即使扩展的情境,而单一方法面临二次成本扩展和认知退化。我们评估了一种双过程记忆架构,将即时事件需求(恒定10条消息窗口)与长期整合知识(以每条消息约3个标记增长)分离。不同于先前的社会代理记忆系统,我们的领域特定整合解决了矛盾的参数演变、跨实验阶段的多跳推理以及精确的技术事实保留。通过覆盖15,000条消息的大型评估,跨模型验证六个LLM家族(OpenAI、Anthropic、Google)共计1,440个查询,我们得出三个关键发现。首先,尽管全情境模型在10,000条消息时因情境溢出失败,我们的系统在使用62%更少的标记(45,434 vs 120,000+限制)的情况下,保持70-85%的准确性,延迟仅1-2秒。其次,跨模型验证揭示了架构层面的权衡,与特定LLM无关:双过程在数值/时间查询(65-90%准确率)方面表现优异,而RAG在历史检索(60-85%)方面更优,表明互补的部署策略。第三,我们识别出“仿真到现实”的差距,合成测试保持恒定的记忆,但现实工作流表现出线性增长(约每条消息3个标记),其中整合质量成为主要的可扩展性瓶颈。该架构成功管理了包含14,000多个科学事实(125k标记)的资料,证明了领域特定的记忆整合能够持续运行超过全情境限制。

英文摘要

As Large Language Models (LLMs) evolve into persistent scientific collaborators, context window saturation has emerged as a critical bottleneck. Scientific workflows involving iterative data analysis and hypothesis refinement rapidly saturate even extended contexts with dense technical content, while monolithic approaches suffer from quadratic cost scaling and cognitive degradation. We evaluate a Dual Process Memory Architecture that decouples immediate episodic needs (constant 10-message window) from long-term consolidated knowledge (growing at approximately 3 tokens/message). Unlike prior social agent memory systems, our domain-specific consolidation addresses contradictory parameter evolution, multi-hop reasoning across experimental phases, and precise technical fact retention. Through large-scale evaluation spanning 15,000 messages with cross-model validation across six LLMs from three families (OpenAI, Anthropic, Google), totaling 1,440 queries, we establish three key findings. First, while full-context models fail at 10,000 messages due to context overflow, our system maintains 70-85% accuracy with 1-2 second latency using 62% fewer tokens (45,434 vs 120,000+ limit). Second, cross-model validation reveals architecture-level trade-offs independent of specific LLMs: Dual Process excels at numeric/temporal queries (65-90% accuracy) while RAG excels at historical retrieval (60-85%), suggesting complementary deployment strategies. Third, we identify a "Sim-to-Real" gap where synthetic tests maintain constant memory but realistic workflows exhibit linear growth (about 3 tokens/message), with consolidation quality emerging as the primary scalability bottleneck. The architecture successfully manages profiles with 14,000+ scientific facts (125k tokens), demonstrating that domain-specific memory consolidation enables sustained operation beyond full-context limits.

2605.17624 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-task learning on partially labeled datasets via invariant/equivariant semi-supervised learning

通过不变/等变半监督学习进行部分标注数据集上的多任务学习

Miquel Martí i Rabadán, Alessandro Pieropan, Hossein Azizpour, Atsuto Maki

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Univrses AB

AI总结 本文研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力,通过FixMatch方法和其等变扩展Dense FixMatch进行评估,在城市景观和BDD100K数据集上针对常见的目标检测和语义分割任务进行测试,发现不变和等变半监督学习在大多数情况下优于监督基线,特别是在标注样本较少时效果更佳。

Comments https://github.com/miquelmarti/DenseFixMatch

详情
AI中文摘要

我们研究了不变和等变半监督学习在处理部分标注数据集上多任务模型训练挑战的潜力。具体而言,我们使用流行的FixMatch方法进行不变半监督学习,并采用其等变扩展Dense FixMatch。我们在Cityscapes和BDD100K数据集上评估了它们在计算机视觉中普遍的目标检测和语义分割任务中的性能。我们考虑了每个任务标注子集的不同大小以及它们之间的不同重叠情况。我们的结果表明,对于不变和等变半监督学习,大多数情况下都优于监督基线,特别是在任务中可用标注样本较少时,改进最为显著,且后者方法通常表现更好。我们的研究表明,不变/等变学习是有限标注数据下多任务学习的一个有前途的方向。

英文摘要

We investigate the potential of invariant and equivariant semi-supervised learning for addressing the challenges of training multi-task models on partially labeled datasets with differently structured output tasks. Specifically, we use the popular FixMatch method for invariant semi-supervised learning and its equivariant extension Dense FixMatch. We evaluate their performance on the Cityscapes and BDD100K datasets in the context of the prevalent object detection and semantic segmentation tasks in computer vision. We consider varying sizes of the subsets annotated for each task and different overlaps among them. Our results for both invariant and equivariant semi-supervised learning outperform supervised baselines in most situations, with the most significant improvements observed when fewer labeled samples are available for a task and generally better results for the latter approach. Our study suggests that invariant/equivariant learning is a promising general direction for multi-task learning from limited labeled data.

2605.17620 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SynVA: A Modular Toolkit for Vessel Generation and Aneurysm Editing

SynVA:一种用于血管生成和动脉瘤编辑的模块化工具包

Marten J. Finck, Niklas C. Koser, Sarker M. Mahfuz, Tameem Jahangir, Jon E. Wilhelm, Daniel Behme, Naomi Larsen, Wojtek Palubicki, Sylvia Saalfeld, Sören Pirk

发表机构 * Visual Computing and Artificial Intelligence, Kiel University, Germany(视觉计算与人工智能研究所,基尔大学,德国) Institute for Medical Informatics and Statistics, Kiel University, Germany(医学信息学与统计研究所,基尔大学,德国) Clinic for Neuroradiology, Medical Faculty, Magdeburg University, Germany(神经放射科,马格德堡大学医学学院,德国) Department of Radiology and Neuroradiology, University Hospital Schleswig-Holstein, Germany(放射学与神经放射学部门,石勒苏益格-荷尔斯泰因大学医院,德国) Faculty of Mathematics and Computer Science, Adam Mickiewicz University, Poland(数学与计算机科学学院,亚当·密茨凯维奇大学,波兰)

AI总结 本文提出SynVA,一种模块化工具包,用于生成血管网格和在解剖学上一致的动脉瘤合成,通过结合新的流匹配方法和基于学习的方法,生成真实血管几何和解剖学合理的动脉瘤,同时提供大规模标注数据集以提升医疗影像分析能力。

详情
AI中文摘要

颅内动脉瘤(IAs)以不可预测的生长和破裂风险为特征,是导致中风的主要原因,可能引发致命性出血,具有高死亡率和长期残疾。随着人口老龄化,脑血管疾病的发病率和整体负担预计会增加,凸显了需要可扩展的方法来分析复杂的医疗数据并提高对这些疾病的群体层面理解的必要性。尽管数字孪生和深度学习为提高诊断、预后和治疗提供了有希望的途径,但其效果受到大规模高质量医疗数据和相应标签稀缺的限制。我们提出了SynVA,一种用于血管网格生成和解剖学一致动脉瘤合成的模块化工具包。SynVA结合了基于流匹配的新型方法生成健康血管网格与基于学习的方法生成解剖条件下的动脉瘤网格——动脉瘤是从已有的血管几何结构计算而来的,而不是孤立生成。此外,我们引入了基于生理学原理和统计先验的SynVA过程模型,用于血管和动脉瘤合成,从而能够生成大规模数据集(例如用于训练基于网格的生成模型)。为此,我们发布了包含50,000个完全标注网格样本的数据集,用于各种下游视觉任务,如语义分割。广泛的定量和定性评估证明了SynVA能够生成逼真的血管几何和解剖学合理的动脉瘤。具体而言,我们的实验表明,某些方法生成的动脉瘤形状更符合专家人类感知,而其他方法在定量相似性度量上与真实动脉瘤的重建表现更优。

英文摘要

Intracranial aneurysms (IAs), characterized by unpredictable growth and risk of rupture, are a major cause of stroke and can lead to life-threatening hemorrhages with high mortality and long-term disability. With aging populations, the incidence and overall burden of cerebrovascular diseases are expected to increase, highlighting the need for scalable approaches to analyze complex medical data and improve population-level understanding of these conditions. While digital twins and deep learning offer promising avenues for improving diagnosis, prognosis, and treatment, their effectiveness is limited by the scarcity of large-scale, high-quality medical data and corresponding labels. We present Synthetic VAsculature (SynVA), a modular toolkit for vascular mesh generation and anatomically consistent aneurysm synthesis. SynVA combines novel flow-matching-based methods for generating healthy vessel meshes with learning-based approaches for anatomy-conditioned aneurysm mesh generation - aneurysms are computed from pre-existing vascular geometries rather than being generated in isolation. In addition, we introduce the SynVA procedural model for vascular and aneurysm synthesis based solely on physiological principles and statistical priors, which enables the generation of large-scale datasets (e.g., for the training of mesh-based generative models). To this end, we release a dataset of 50,000 fully labeled mesh samples for a variety of downstream vision tasks, such as semantic segmentation. Extensive quantitative and qualitative evaluations demonstrate that SynVA generates realistic vessel geometries and anatomically plausible aneurysms. Specifically, our experiments indicate that some methods produce aneurysm shapes more aligned with expert human perception while others perform better on quantitative similarity metrics with reconstructions of real aneurysms.

2605.17608 2026-05-19 cs.CE cs.AI 版本更新

Bayesian-Monte Carlo Schedule Updating for Construction Digital Twins: A Probabilistic Framework for Dynamic Project Forecasting

基于贝叶斯-蒙特卡罗的施工数字孪生调度更新:一种用于动态项目预测的概率框架

Atena Khoshkonesh, Mohsen Mohammadagha, Vinayak Kaushal, Navid Ebrahimi

发表机构 * Department of Civil Engineering, The University of Texas at Arlington(德克萨斯理工大学土木工程系) The University of Texas at Arlington(德克萨斯理工大学)

AI总结 本文提出了一种基于贝叶斯-蒙特卡罗的概率调度更新框架,用于施工数字孪生环境,通过整合随机活动持续时间建模、贝叶斯递归更新、蒙特卡罗模拟和不确定性传播,实现动态项目预测。

Comments 22 pages, 3 figures, 5 tables

详情
AI中文摘要

施工项目经常由于劳动力生产率、材料供应、天气条件和项目协调的不确定性而出现进度延误和预测不确定性。传统的确定性调度方法如关键路径法(CPM)假设活动持续时间固定,因此无法充分表示动态项目不确定性。本文提出了一种适用于施工数字孪生环境的贝叶斯-蒙特卡罗概率调度更新框架。所提出的方法整合了随机活动持续时间建模、贝叶斯递归更新、蒙特卡罗模拟和不确定性传播,以统一的计算框架实现自适应的进度预测。活动持续时间使用对数正态概率分布进行建模,并通过贝叶斯推断不断更新,随着新的项目观测数据的出现。蒙特卡罗模拟用于传播更新的不确定性,通过项目网络生成概率完成时间预测、延误风险估计和活动关键性度量。使用PSPLIB基准项目网络的仿真实验表明,与确定性CPM和静态概率调度方法相比,所提出的框架在预测准确性和不确定性表示方面有所改进。该框架还通过整合BIM报告、无人机观测、物联网 telemetry、生产力日志和现场监控数据,支持自适应项目预测。

英文摘要

Construction projects frequently experience schedule delays and forecasting uncertainty due to variability in labor productivity, material availability, weather conditions, and project coordination. Conventional deterministic scheduling methods such as the Critical Path Method (CPM) assume fixed activity durations and therefore cannot adequately represent dynamic project uncertainty. This study presents a Bayesian-Monte Carlo probabilistic schedule updating framework for construction digital twin environments. The proposed methodology integrates stochastic activity-duration modeling, Bayesian recursive updating, Monte Carlo simulation, and uncertainty propagation within a unified computational framework for adaptive schedule forecasting. Activity durations are modeled using lognormal probability distributions and continuously updated through Bayesian inference as new project observations become available. Monte Carlo simulation is then used to propagate updated uncertainty throughout project networks and generate probabilistic completion-time forecasts, delay-risk estimates, and activity criticality measures. Simulation experiments using PSPLIB benchmark project networks demonstrate that the proposed framework improves forecasting accuracy and uncertainty representation compared with deterministic CPM and static probabilistic scheduling approaches. The framework further supports adaptive project forecasting through integration of BIM reports, drone observations, IoT telemetry, productivity logs, and site monitoring data.

2605.17580 2026-05-19 cs.AI 版本更新

ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation

ECG-WM: 一种基于生理的ECG世界模型用于临床干预模拟

Zhikang Chen, Yue Wang, Sen Cui, Yu Zhang, Changshui Zhang, Tianling Ren, Tingting Zhu

发表机构 * University of Oxford(牛津大学) Tsinghua University(清华大学) Southern University of Science and Technology(南方科技大学)

AI总结 本文提出了一种基于ECG的世界模型,用于条件化预测心脏电生理学,通过整合生理学普通微分方程先验知识,提升干预后ECG轨迹的生理合理性,并引入不确定性评估策略以更可靠地评估候选干预方案。

详情
AI中文摘要

基于ECG的模型在诊断任务中表现出色,但在建模外部干预下心脏动态演变方面仍有限。现有方法主要集中在静态预测,缺乏捕捉不同药理条件下ECG变化的机制。本文提出了一种ECG世界模型,用于动作条件化的预测模拟。通过将生理学普通微分方程先验知识整合到潜在扩散动态中,利用能量正则化,该框架实现了生理合理的干预后ECG轨迹合成,并有效缓解生成幻觉。在此模拟过程中,我们引入了一种不确定性意识的评估策略,利用扩散采样中的随机性来表征预期的临床风险及其变异性,从而更可靠地比较候选干预方案。我们在多种设置中评估了我们的方法,包括受控药物反应场景和真实世界临床记录。除了标准波形指标外,实验结果还显示了改进的风险校准和与专家指导治疗偏好的强一致。这些结果确立了我们的方法作为安全且干预感知的临床决策支持的稳健基础。

英文摘要

Electrocardiogram (ECG)-based models have achieved strong performance in diagnostic tasks, yet they remain limited in modeling how cardiac dynamics evolve under external interventions. In particular, existing approaches focus primarily on static prediction and lack mechanisms to capture ECG variations under different pharmacological conditions. In this work, we propose an ECG World Model for action-conditioned predictive simulation of cardiac electrophysiology. Moving beyond disjoint pipelines, our framework features a principled integration of physiological ordinary differential equation (ODE) priors into latent diffusion dynamics via energy regularization. This structural constraint enables the synthesis of physiologically plausible post-intervention ECG trajectories while effectively mitigating generative hallucinations. Building on this simulation process, we introduce an uncertainty-aware evaluation strategy that leverages the stochasticity of diffusion sampling to characterize both the expected clinical risk and its variability, allowing a more reliable comparative assessment of candidate interventions. We evaluate our method across diverse settings, including controlled drug-response scenarios and real-world clinical records. Beyond standard waveform metrics, experimental results demonstrate improved risk calibration and strong alignment with expert-informed treatment preferences. These results establish our approach as a robust foundation for safe and intervention-aware clinical decision support.

2605.17575 2026-05-19 cs.LG cs.AI 版本更新

UniAlign: A Model-Agnostic Framework for Robust Network Traffic Classification under Distribution Shifts

UniAlign:一种用于在分布偏移下鲁棒网络流量分类的模型无关框架

Tongze Wang, Xiaohui Xie, Wenduo Wang, Chuyi Wang, Yong Cui

发表机构 * Institute for Network Sciences and Cyberspace, Tsinghua University(网络科学与网络空间研究院,清华大学) Department of Computer Science and Technology, Tsinghua University(计算机科学与技术系,清华大学)

AI总结 本文提出UniAlign,一种模型无关的框架,通过领域对齐微调和稳定模型集成提升深度学习网络流量分类模型在分布偏移下的鲁棒性,实验表明其在准确率和F1分数上均优于现有基线。

详情
AI中文摘要

网络流量分类(NTC)模型在真实世界环境中部署时,由于网络条件的变化导致的分布偏移常常引起严重的性能下降。现有的增强鲁棒性的方法通常与特定的模型架构或数据设置耦合,无法泛化到最先进的原始字节基NTC模型,或导致显著的训练开销。在本文中,我们提出UniAlign,一种新的模型无关框架,旨在提升基于深度学习的NTC模型在分布偏移下的鲁棒性。UniAlign结合了领域对齐微调,该方法鼓励在异构网络条件下学习领域不变的流量表示,以及稳定模型集成,该方法通过在平坦损失区域内的检查点聚合来增强推理鲁棒性。该框架可以无缝集成到现有的监督NTC模型中,无需特定的特征模态或引入非常数的额外训练成本。我们在三个涵盖多样分布偏移的公开数据集上评估了UniAlign,包括加密方案、数据收集设备和攻击行为。在两个代表性的NTC模型上的实验结果表明,与标准训练相比,UniAlign将平均分类准确率提高了2.51%,平均F1分数提高了2.71%,在准确率和F1分数上均优于最强基线,同时仅需所有NTC特定基线训练时间的12.4%至53.9%。

英文摘要

Network traffic classification (NTC) models often suffer severe performance degradation when deployed in real-world environments due to distribution shifts caused by changing network conditions. Existing robustness-enhancing approaches are commonly coupled to specific model architectures or data settings, fail to generalize to state-of-the-art raw-byte-based NTC models, or incur significant training overhead. In this paper, we propose UniAlign, a novel model-agnostic framework that improves the robustness of deep learning-based NTC models under distribution shifts. UniAlign combines \emph{domain alignment fine-tuning}, which encourages the learning of domain-invariant traffic representations across heterogeneous network conditions, with \emph{stable model ensembling}, which enhances inference robustness by aggregating checkpoints within a flat loss region. The framework can be seamlessly integrated into existing supervised NTC models without requiring specific feature modalities or introducing non-constant additional training costs. We evaluate UniAlign on three public datasets covering diverse distribution shifts, including encryption schemes, data collection devices, and attack behaviors. Experimental results on two representative NTC models demonstrate that, compared with standard training, UniAlign improves average classification accuracy by 2.51\% and average F1 score by 2.71\%, outperforming the strongest baseline by 1.45\% in accuracy and 1.69\% in F1 score, while requiring only 12.4\%--53.9\% of the training time of all NTC-specific baselines.

2605.17565 2026-05-19 cs.AI cs.CL 版本更新

Generalization or Memorization? Brittleness Testing for Chess-Trained Language Models

泛化还是记忆?国际象棋训练语言模型的脆弱性测试

Ethan Tang

发表机构 * School of Computing and Augmented Intelligence(计算与增强智能学院)

AI总结 本文研究了国际象棋训练语言模型是泛化还是记忆,通过测试发现其高性能主要源于模式匹配,并展示了LLM-Modulo框架在提升国际象棋谜题解决性能上的效果,证明了与外部验证器结合的通用LLM比直接训练合成数据更灵活。

Comments 14 pages, 2 figures, 4 tables, 3 equations

详情
AI中文摘要

最近的研究对语言模型进行了棋类数据微调,并报告了高基准分数,作为证据表明由此产生的模型可以理解国际象棋规则、以专业水平下完整棋局,或生成基于专家知识的人可读解释。我们训练了KinGPT,一个仅在(位置,最佳移动)对上训练的2500万参数字符级语言模型,其在600个mate-in-N谜题套件上超过了300亿参数的ChessGPT,在20个主题谜题基准上超过了4000亿参数的C1-4B。我们检查了现有文献中关于国际象棋训练语言模型的几个主张,并断言其令人印象深刻的基准性能主要由模式匹配解释。我们还展示了LLM-Modulo,一个验证器在环框架,如何将RedPajama 3B的最佳移动准确率从1.2%提升到21.2%,移动生成有效性从19.3%提升到95.3%,在mate-in-N国际象棋谜题上,与ChessGPT在棋类特定网络语料库上微调所获得的提升相当,但成本仅为后者的一小部分。我们的结果展示了将通用LLM与外部验证器结合,为明确领域提供了一个更灵活的替代方案,而不是直接训练合成数据。我们开源了所有训练/评估代码、数据集、谜题样本和KinGPT模型检查点,以确保可重复性。

英文摘要

Recent work has fine-tuned language models on chess data and reported high benchmark scores as evidence that the resulting models can understand the rules of chess, play full chess games at a professional level, or generate human-readable explanations grounded in expert knowledge. We train KinGPT, a 25M-parameter character-level language model trained only on (position, best-move) pairs, who exceeds 3B-parameter ChessGPT on a 600-puzzle mate-in-N suite and 4B-parameter C1-4B over a 20-theme puzzle benchmark. We examine several claims made in existing literature regarding chess-trained language models and assert that their impressive benchmark performance is largely explained by pattern-matching. We also demonstrate how LLM-Modulo, a verifier-in-the-loop framework, raises RedPajama 3B's best move accuracy from 1.2% to 21.2% and move generation validity from 19.3% to 95.3% on mate-in-N chess puzzles, comparable to gains achieved from ChessGPT's fine-tuning on chess-specific web corpora at a fraction of the cost. Our results illustrate how pairing a general LLM with an external verifier offers a more flexible alternative to directly training on synthetic data for well-defined domains. We open source all training/evaluation code, datasets, puzzle samples, and KinGPT model checkpoints for reproducibility.

2605.17562 2026-05-19 cs.LG cs.AI cs.HC 版本更新

Beyond Accuracy: Robustness, Interpretability and Expressiveness of EEG Foundation Models

超越准确率:EEG基础模型的鲁棒性、可解释性和表达性

Urban Širca, Maryam Alimardani, Stefanos Zafeiriou, Konstantinos Barmpas

发表机构 * Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) Imperial College London(伦敦帝国学院)

AI总结 本文研究了EEG基础模型的鲁棒性、可解释性和表达性,通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,揭示了模型在不同扰动下的表现,以及其在可解释性和表达性方面的特性。

详情
AI中文摘要

EEG基础模型(EEG-FMs)主要在干净且分布内的准确性上进行了评估,其鲁棒性、可解释性和表征质量尚未得到充分考察。本研究通过在八个数据集上对六个EEG-FMs和一个基线深度学习模型进行基准测试,填补了这些空白。除了干净准确性外,我们进行了三层分析:(i)鲁棒性:我们应用了测试时扰动,包括加性噪声、随机和区域基于的通道丢弃以及区域特定的噪声注入。我们的分析表明,没有单一模型在所有失败模式中占主导地位。最抗噪的模型在通道丢弃下最为脆弱,当通道被移除而不是零填充时,许多丢弃脆弱性消失。(ii)可解释性:我们首次将注意力感知的层间相关传播(AttnLRP)应用于EEG-FMs,并展示了模型广泛集中在与任务相关的脑区,这与已知的神经生理学一致。然而,属性图在扰动下保持空间稳定,而预测性能下降,表明模型关注正确的脑区,但解码了被破坏的内容。(iii)表达性:通过块状探测,我们显示在微调过程中后期块被重新利用,而早期块已经包含任务相关的信息。此外,我们证明了之前归因于低质量预训练表示的头部-only性能较差,很大程度上是由于池化所致,且当EEG-FMs的token级嵌入被保留时,它们具有足够的表征能力。这些发现为EEG-FMs的鲁棒性、可解释性和表达性提供了首次系统的评估,并突显了其开发中的关键考虑因素。

英文摘要

EEG foundation models (EEG-FMs) have been evaluated predominantly on clean, in-distribution accuracy, leaving their robustness, interpretability and representational quality largely unexamined. This study addresses these gaps by benchmarking six EEG-FMs against a baseline deep learning model across eight datasets. Beyond clean accuracy, we conduct three layers of analysis: (i) Robustness: we apply test-time perturbations including additive noise, random and region-based channel dropout and region-specific noise injection. Our analyses show that no single model dominates all failure modes. The most noise-robust model is among the most fragile under channel dropout and much of the dropout fragility disappears when channels are removed rather than zero-padded. (ii) Interpretability: we present the first application of Attention-Aware Layer-Wise Relevance Propagation (AttnLRP) to EEG-FMs and show that models broadly concentrate relevance on task-appropriate brain regions consistent with known neurophysiology. However, attribution maps remain spatially stable under perturbation while predictions degrade, suggesting that the models attend to the correct brain regions but decode corrupted content. (iii) Expressiveness: With block-wise probing we show that late blocks are repurposed during fine-tuning, while early blocks already hold task-related information. Furthermore, we demonstrate that the poor head-only performance previously attributed to low-quality pre-trained representations is largely explained by pooling and that EEG-FMs possess sufficient representational capacity when their token-level embeddings are preserved. Together, these findings provide the first systematic assessment of robustness, interpretability and expressiveness for EEG-FMs and highlight critical considerations for their development.

2605.17559 2026-05-19 stat.ME cs.AI q-bio.QM stat.ML 版本更新

Controlling False Discovery in Arbitrarily Structured Hypothesis Spaces via Reproducing Kernels

通过再生核来控制任意结构假设空间中的假发现

Binyamin Perets, Shie Mannor

发表机构 * Technion – Israel Institute of Technology(技术Ion – 以色列理工学院) NVIDIA

AI总结 本文提出了一种基于再生核的框架,用于在任意结构的假设空间中控制假发现率,通过将结构FDR控制转化为正则化学习问题,实现了对连续域、图和层次结构的统一处理,提高了发现能力。

Comments 9 pages

详情
AI中文摘要

大规模假设检验是现代科学的核心,其中控制假发现率(FDR)已成为管理多个同时检验中假阳性的一种标准方法。假设很少是孤立存在的;它们通常通过接近性、连接性或层次结构表现出结构。这种结构既是挑战也是机会:虽然经典方法将这些依赖性视为需要保守校正的障碍,但利用它们可以显著提高发现能力。本文将结构化的FDR控制重新表述为一个正则化学习问题。通过在合适的再生核希尔伯特空间(RKHS)中优化,我们引入了一个框架,通过仅选择合适的核,将连续域、图和层次结构统一到单一算法中。这种形式化使我们能够用平滑的解决方案替代先前方法的分段常数拟合,通过原理化的基于似然的超参数选择而不是启发式调整,并在未观测位置进行推断,从而支持样本效率的实验设计。在该估计器的基础上,我们提供了两个决策规则,我们证明它们能够控制FDR。我们验证了我们的方法在两个来源上:来自高维现实数据集的空间位置,以及利用蛋白质-蛋白质相互作用图的差异基因表达任务。

英文摘要

Large-scale hypothesis testing is central to modern science, where controlling the False Discovery Rate (FDR) has become the standard approach to managing false positives across many simultaneous tests. Hypotheses rarely exist in isolation; they often exhibit structure through proximity, connectivity, or hierarchy. This structure represents both a challenge and an opportunity: while classical methods treat these dependencies as obstacles requiring conservative correction, leveraging them can substantially increase discovery power. Here, we reframe structured FDR control as a regularized learning problem. By optimizing within a suitable Reproducing Kernel Hilbert Space (RKHS), we introduce a framework that unifies continuous domains, graphs, and hierarchies under a single algorithm through kernel choice alone. This formulation enables smooth solutions in place of the piecewise-constant fits of prior methods, principled likelihood-based hyperparameter selection rather than heuristic tuning, and inference at unobserved locations which in turn supports sample-efficient experimental design. Building on this estimator, we provide two decision rules which we prove to control the FDR. We validate our method on two sources: spatial locations derived from high-dimensional real-world datasets, and a differential gene expression task utilizing protein-protein interaction graphs.

2605.17556 2026-05-19 cs.RO cs.AI 版本更新

Visual Sculpting: Visually-Aligned Planning Representations for Long-Horizon Robot Clay Sculpting

视觉雕刻:用于长周期机器人泥塑的视觉对齐规划表示

Peter Schaldenbrand, Jean Oh

发表机构 * The Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 本文提出了一种视觉对齐的规划表示方法,用于长周期机器人泥塑任务,通过捕捉光照和纹理特征,提高了对可变形材料动态的建模能力,并展示了在不同可变形材料和末端执行器下的性能。

Comments 8 pages, 14 figures. Accepted for publication in IEEE Robotics and Automation Letters (RA-L)

详情
AI中文摘要

泥塑是一种复杂的艺术任务,需要通过长周期规划实现高阶目标。作为机器人问题,我们将泥塑视为形状到形状的匹配挑战。先前的可变形物体 manipulation 工作要么需要为每个目标重新训练策略,要么依赖于动态模型,这些模型将状态表示为稀疏点云,无法良好捕捉泥塑的重要特征,如纹理。我们提出了一种方法,用于建模可变形材料的动力学,并在视觉对齐的表示中为机器人雕刻规划。通过三种不同的可变形材料和各种末端执行器,我们证明我们的动力学模型在性能上与最先进的方法相当,并且具有兼容视觉规划的优势。我们的动作被表示为单个末端执行器向泥塑施加的参数化推力,这已被证明适用于长周期(>100次动作)的泥塑浮雕。最后,我们展示了在视觉对齐表示中规划的好处,同时提供了分析,证明了与3D表示相比,这种表示在规划上更具挑战性。

英文摘要

Clay sculpting is a nuanced, artistic task involving dexterous manipulation with long-horizon planning to achieve high-level goals. As a robotics problem, we formulate clay sculpting as a shape-to-shape matching challenge. Prior deformable object manipulation work either requires retraining a policy per goal or relies on dynamics models which represent state as sparse point clouds which do not capture important clay features, such as textures, well. We present a method for modeling the dynamics of deformable materials and planning for robotic sculpting in a representation that is visually-aligned, capturing lighting and texture features. With three different deformable materials and various end-effectors, we demonstrate that our dynamics model is comparable in performance to the state-of-the-art with the added benefit of being compatible with visual planning. Our actions are represented as parametrized pushes into clay with a single end-effector, which proved to be suitable for long-horizon (>100 actions) clay relief sculptures. Lastly, we show the benefits of planning in a visually-aligned representation, but also provide analysis providing evidence as to why this representation is challenging to plan in compared to 3D representations.

2605.17530 2026-05-19 cs.CR cs.AI cs.LG cs.NI 版本更新

Few-Shot Network Intrusion Detection Using Online Triplet Mining

基于在线三元组挖掘的少样本网络入侵检测

Jack Wilkie, Hanan Hindy, Christos Tachtatzis, Miroslav Bures, Robert Atkinson

发表机构 * Department of Electronics and Electrical Engineering, University of Strathclyde(斯特拉斯克莱德大学电子与电气工程系) Faculty of Computer and Information Sciences, Ain Shams University(爱思曼大学计算机与信息科学学院) Faculty of Electrical Engineering, Czech Technical University(捷克技术大学电气工程学院)

AI总结 本文提出利用在线三元组挖掘和KNN分类器的三元组网络,实现少样本下的有效网络入侵检测,通过对比不同三元组挖掘算法和模型设计,验证了在少量恶意样本下该方法的竞争力。

Comments Published in: MDPI Applied Sciences, 2026. Official version: https://doi.org/10.3390/app16104589 Code: https://github.com/jackwilkie/few_shot_nids_triplet_mining

Journal ref Wilkie, J.; Hindy, H.; Tachtatzis, C.; Bures, M.; Atkinson, R. Few-Shot Network Intrusion Detection Using Online Triplet Mining. Appl. Sci. 2026, 16, 4589. https://doi.org/10.3390/app16104589

详情
AI中文摘要

网络入侵检测系统在网络保护中起着关键作用,通过检测恶意网络流量并由网络安全运营中心调查。最先进的方法利用监督机器学习方法训练分类模型以识别已知的网络攻击;然而,这些模型需要大量的标记数据集进行训练,并在训练较小数据集时表现不佳。为了解决这一不足,异常检测模型学习良性流量的分布,并将不符合的流量标记为恶意。虽然这些方法不需要恶意示例进行训练,但它们的高误报率使其不切实际。因此,当特定攻击类别的标记实例不足时,网络可能特别容易受到攻击。这通常发生在新建立的网络或之前未见过的攻击类型出现时。为了解决这一挑战,本文提出使用三元组网络,利用在线三元组挖掘和KNN分类器,能够进行少样本分类,从而在仅训练少量恶意示例后实现有效的入侵检测。各种在线三元组挖掘算法被探索,并通过一系列消融研究比较和评估了模型设计选择,如推断算法和优化的距离度量。最终模型在少样本二分类和多类分类中与现有方法进行了比较,发现当每个类别训练至少10个恶意样本时,所提出的方法在竞争性方面表现良好。

英文摘要

Network intrusion detection systems play a vital role in protecting networks by detecting malicious network traffic which can then be investigated by a cybersecurity operations centre. State-of-the-art approaches utilise supervised machine learning methods to train a classification model to recognise known cyberattacks; however, these models require a large labelled dataset to train and show poor performance when trained on smaller datasets. In an attempt to address this shortcoming, anomaly detection models learn the distribution of benign traffic and flag non-conforming traffic as malicious. While these methods do not require malicious examples to train, they suffer from high false-positive rates rendering them impractical. As a result, networks may be particularly vulnerable when there are insufficient labelled instances of a specific attack class to train an effective classifier. This often occurs in newly established networks or when previously unseen types of attacks emerge. To address this challenge, this work proposes the use of a triplet network, utilising online triplet mining and a KNN classifier, which is able to perform few-shot classification, enabling effective intrusion detection after being trained on a limited number of malicious examples. Various online triplet mining algorithms were explored and model design choices, such as the inference algorithm and optimised distance metrics, were compared and evaluated through a series of ablation studies. The final model was compared against other state-of-the-art approaches in few-shot binary and multiclass classification, where the proposed approach was found to be competitive with existing methods when trained on as little as 10 malicious samples of each class.

2605.17528 2026-05-19 cs.LG cs.AI cs.CL 版本更新

CasualSynth: Generating Structurally Sound Synthetic Data

CasualSynth: 生成结构上合理的合成数据

Zehua Cheng, Wei Dai, Jiahao Sun, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Institute of Logic and Computation, TU Wien(维也纳技术大学逻辑与计算研究所)

AI总结 本文提出CasualSynth框架,通过解耦因果结构生成与语义实现,生成既符合因果机制又语义丰富的合成数据,解决了LLM在生成合成数据时无法保证因果正确性的问题。

Comments 15 pages

详情
AI中文摘要

大型语言模型(LLMs)能够生成逼真的合成数据,但无法保证其输出符合目标领域的因果机制。我们引入CausalSynth框架,该框架将因果结构生成与语义实现解耦,生成既符合因果机制又语义丰富的合成数据。该框架分为三个阶段:首先,一个结构因果模型(SCM)——一个定义在有向无环图(DAG)上的结构方程组,通过祖先采样生成因果骨架,即满足支配图全局马尔可夫性质的变量赋值;其次,一个LLM作为受约束的实现者,一个条件翻译器,将每个骨架映射到高维观测,如临床笔记或交易日志;第三,一个迭代一致性验证模块通过确定性提取检测结构违规,并将针对性的修正反馈给LLM,形成闭环优化过程。我们识别出语义后门问题,即LLM系统性地用预训练先验覆盖施加的因果事实——并证明我们的迭代机制相对于标准拒绝采样减少了由此产生的选择偏差。在三个因果基准(ASIA、ALARM和MIMIC-Struct)上,CausalSynth在假阳性率接近名义α=0.05水平的情况下保持条件独立性,并在70B参数LLM基础上实现了超过96%的可实现率。该框架还通过保留噪声和图 mutilation 支持原理化的干预和反事实生成。

英文摘要

Large Language Models (LLMs) generate realistic synthetic data but offer no guarantee that their outputs respect the causal mechanisms governing the target domain. We introduce CausalSynth, a framework that decouples causal structure generation from semantic realization, yielding synthetic data that is both causally valid and linguistically rich. The framework operates in three phases. First, a Structural Causal Model (SCM) - a tuple of structural equations defined over a directed acyclic graph (DAG) generates causal skeletons, i.e., variable assignments that satisfy the Global Markov Property of the governing DAG, via ancestral sampling. Second, an LLM acts as a constrained \emph{realizer}, a conditional translator that maps each skeleton to a high-dimensional observation such as a clinical note or a transaction log. Third, an Iterative Consistency Verification module detects structural violations through deterministic extraction and feeds targeted corrections back to the LLM, forming a closed-loop refinement process. We identify the Semantic Backdoor problem the systematic tendency of LLMs to override imposed causal facts with pre-training priors -- and prove that our iterative mechanism reduces the resulting selection bias relative to standard rejection sampling. On three causal benchmarks (ASIA, ALARM, and MIMIC-Struct), CausalSynth preserved conditional independencies with false-positive rates near the nominal $α=0.05$ level and achieved realizability rates above 96% with 70B-parameter LLM backbones. The framework additionally supports principled interventional and counterfactual generation through noise retention and graph mutilation.

2605.17526 2026-05-19 cs.SE cs.AI 版本更新

SaaSBench: Exploring the Boundaries of Coding Agents in Long-Horizon Enterprise SaaS Engineering

SaaSBench: 探索编码代理在长周期企业SaaS工程中的边界

Qingnan Ren, Shun Zou, Shiting Huang, Ziao Zhang, Kou Shi, Zhen Fang, Yiming Zhao, Yu Zeng, Qisheng Su, Lin Chen, Yong Wang, Zehui Chen, Xiangxiang Chu, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里云实验室)

AI总结 本文提出SaaSBench,首个针对企业SaaS工程的基准,旨在探索AI代理在复杂系统中的边界,通过30个任务和5370个验证节点,涵盖8种编程语言、6种数据库和13种框架,揭示了当前最先进代理在多组件系统配置和集成中的主要瓶颈。

详情
AI中文摘要

随着自主编码代理能够处理越来越长周期的任务,它们逐渐展示了完成端到端软件开发的潜力。尽管现有的基准近年来已从局部代码编辑发展到从头开始的项目生成,但它们仍局限于结构简化、单栈应用。因此,它们无法捕捉真实企业软件即服务(SaaS)系统中的异构环境、全栈编排和系统级复杂性,留下了在现实工程约束下评估代理的关键空白。为填补这一空白,我们引入SaaSBench,首个针对企业SaaS工程的基准。它涵盖30个复杂任务,跨越6个SaaS领域,包含5370个验证节点,整合8种编程语言、6种数据库和13种框架,以精确模拟现实世界的软件异质性。此外,我们设计了一种依赖感知的混合评估范式,专门针对具有长周期和多组件耦合的复杂系统,实现细粒度、可重复的评估。关键的是,我们的广泛实验揭示了一个显著的见解:当前最先进的代理的主要瓶颈不是生成孤立的代码逻辑,而是成功配置和集成多组件系统。超过95%的任务失败发生在代理甚至达到深度业务逻辑之前,模型常因过度自信而提前终止基础系统设置,或陷入无效的调试循环。我们希望SaaSBench能作为实用且具有挑战性的测试平台,推动可靠、系统级编码代理的发展。代码可在https://github.com/ShadeCloak/SaaSbench获取。

英文摘要

As autonomous coding agents become capable of handling increasingly long-horizon tasks, they have gradually demonstrated the potential to complete end-to-end software development. Although existing benchmarks have recently evolved from localized code editing to from-scratch project generation, they remain confined to structurally simplified, single-stack applications. Consequently, they fail to capture the heterogeneous environments, full-stack orchestration, and system-level complexity of real enterprise Software as a Service (SaaS) systems, leaving a critical gap in assessing agents under realistic engineering constraints. To fill this gap, we introduce SaaSBench, the first benchmark designed to explore the boundaries of AI agents in enterprise SaaS engineering. Spanning 30 complex tasks across 6 SaaS domains with 5,370 validation nodes, it incorporates 8 programming languages, 6 databases, and 13 frameworks to meticulously mirror real-world software heterogeneity. Furthermore, we design a dependency-aware hybrid evaluation paradigm tailored for complex systems with long horizons and multi-component coupling, enabling fine-grained, reproducible assessment. Crucially, our extensive experiments reveal a striking insight: the primary bottleneck for state-of-the-art agents is not generating isolated code logic, but successfully configuring and integrating a multi-component system. Over 95\% of task failures occur before agents even reach deep business logic, with models often falling victim to overconfidence and prematurely halting during foundational system setup, or getting trapped in ineffective debugging loops. We hope SaaSBench serves as a practical and challenging testbed to drive the evolution of reliable, system-level coding agents. The code is available at \url{https://github.com/ShadeCloak/SaaSbench}.

2605.17508 2026-05-19 cs.LG cs.AI 版本更新

BESplit: Bias-Compensated Split Federated Learning with Evidential Aggregation

BESplit: 偏差补偿分割联邦学习与证据聚合

Yuhan Xie, Chen Lyu, Jingrong Huang

发表机构 * MoE Key Laboratory of Interdisciplinary Research of Computation(交叉计算与经济学 interdisciplinary 研究 MOE 重点实验室) Shanghai University of Finance(上海财经大学)

AI总结 本文提出BESplit框架,通过证据聚合和偏差补偿协作来解决非独立同分布数据下分割联邦学习的偏差优化和收敛不稳定问题,提升了模型的准确性和效率。

详情
AI中文摘要

分割联邦学习(SFL)通过将模型分割到客户端和服务器之间实现隐私保护的协同训练。然而,在非独立同分布数据分布下,SFL常面临偏差优化和收敛不稳定的问题,而现有解决方案大多借鉴传统联邦学习的技术。在本工作中,我们发现SFL的分割架构本质上改变了客户端信息的表示和协调方式,为超越参数级聚合的偏差补偿提供了机会。基于这一见解,我们提出了BESplit,一个架构感知的框架,利用SFL内在结构来缓解非IID效应。首先,为防止偏见本地数据主导全局更新,我们引入证据聚合(EA)以基于证据不确定性对客户端贡献进行细粒度重新加权。其次,为进一步减少分布偏斜,我们开发了偏差补偿协作(BCC)以通过配对互补客户端对齐分割层表示。最后,双教师蒸馏(DTD)被纳入以同步解耦客户端和服务器模型之间的知识,使本地推理能够独立进行。在五个基准数据集上的广泛实验表明,BESplit在多样化的非IID设置下,准确率、收敛稳定性以及计算效率均优于现有最先进方法。

英文摘要

Split Federated Learning (SFL) enables privacy-preserving collaborative training by partitioning models between clients and a server. However, under non-IID data distributions, SFL often suffers from biased optimization and unstable convergence, while existing solutions largely adapt techniques from conventional federated learning. In this work, we observe that the split architecture of SFL inherently alters how client information is represented and coordinated, opening opportunities for bias compensation beyond parameter-level aggregation. Based on this insight, we propose BESplit, an architecture-aware framework that exploits the intrinsic structure of SFL to mitigate non-IID effects. First, to prevent biased local data from dominating global updates, we introduce Evidential Aggregation (EA) to perform fine-grained reweighting of client contributions based on evidential uncertainty. Second, to further reduce distributional skew, we develop Bias-Compensated Collaboration (BCC) to align split-layer representations by pairing complementary clients. Finally, Dual-Teacher Distillation (DTD) is incorporated to synchronize knowledge between decoupled client and server models, enabling independent local inference. Extensive experiments on five benchmark datasets demonstrate that BESplit consistently outperforms state-of-the-art methods in accuracy, convergence stability, and computational efficiency under diverse non-IID settings.

2605.17504 2026-05-19 cs.CV cs.AI 版本更新

A Distributional View for Visual Mechanistic Interpretability: KL-Minimal Soft-Constraint Principle

从分布视角看视觉机制可解释性:KL最小软约束原理

Guancheng Zhou, Yisi Luo, Zhengfu He, Zhenyu Jin, Xuyang Ge, Wentao Shu, Deyu Meng, Xipeng Qiu

发表机构 * School of Mathematics and Statistics(数学与统计学学院) Ministry of Education Key Lab of Intelligent Networks and Network Security(教育部智能网络与网络安全重点实验室) Shanghai Innovation Institute(上海创新研究院) Fudan University(复旦大学)

AI总结 本文提出了一种基于分布的视觉机制可解释性方法,通过KL最小化优化问题来平衡可解释性和模型忠实性,利用能量引导的扩散后验采样实现,并在DINOv3模型上验证了其有效性。

详情
AI中文摘要

当前视觉机制可解释性(MI)的主要范式仍局限于通过启发式方法(如Top-K激活检索或正则化优化)解释视觉模型的内部单元。在本文中,我们建立了视觉MI的理论分布视角,该视角模型了特征激活对自然图像分布的影响,从而构建了一个KL最小化优化问题来建模MI任务。在此框架下,识别了先前MI范式中的统计偏差,揭示这些范式可能在人类感知上不可解释(即偏离自然图像分布)或在机械上不忠实于视觉模型(即无法激活模型特征)。为了解决这些偏差,我们提出了一种基于KL最小化软约束原理的视觉MI模型,该模型在理论上平衡了可解释性和忠实性。我们通过能量引导的扩散后验采样实现了这一原理。广泛的实验验证了所提出分布视角的理论正确性,并展示了我们的范式在DINOv3视觉模型上的实际有效性。

英文摘要

Most current paradigms in visual mechanistic interpretability (MI) remain confined to interpreting internal units of the vision model via heuristic methods (e.g., top-$K$ activation retrieval or optimization with regularization). In this work, we establish a theoretical distributional view for visual MI, which models the influence of a feature activation on the natural image distribution, thereby formulating a Kullback-Leibler (KL)-minimal optimization problem to model the MI task. Under this framework, statistical biases are identified within previous MI paradigms, which reveal that they may either be perceptually uninterpretable to humans (i.e., deviate from the natural image distribution), or mechanistically unfaithful to the vision models (i.e., unable to activate model features). To resolve the biases under the distributional view, we propose a model with a KL-minimal soft-constraint principle for visual MI that theoretically balances interpretability and faithfulness. We realize this principle via energy-guided diffusion posterior sampling. Extensive experiments validate the theoretical soundness of the proposed distributional view and demonstrate the practical effectiveness of our paradigm on the DINOv3 vision model.

2605.17503 2026-05-19 cs.AI cs.CL cs.HC 版本更新

RAG-based EEG-to-Text Translation Using Deep Learning and LLMs

基于深度学习和大语言模型的RAG EEG到文本翻译

Enrico Collautti, Xiaopeng Mao, Luca Tonin, Stefano Tortora, Sadasivan Puthusserypady

发表机构 * IAS-LAB, Department of Information Engineering, University of Padova(帕多瓦大学信息工程系IAS实验室) Padova Neuroscience Center(帕多瓦神经科学中心) Department of Health Technology, Technical University of Denmark(丹麦技术大学健康技术系)

AI总结 本文提出了一种基于检索增强生成(RAG)的EEG到文本解码方法,结合EEG编码器、向量检索阶段和大语言模型,以提高句子级解码的准确性,并在ZuCo数据集上验证了其有效性。

Comments 6 pages, 2 figures. Submitted to the 2026 IEEE International Conference on Systems, Man, and Cybernetics

详情
AI中文摘要

从电生理图(EEG)信号解码语言信息仍然是脑机接口(BCI)研究中极具挑战性的问题。特别是,由于EEG记录的信噪比较低,从EEG进行句子级解码尤为困难。以往研究通常在推理阶段未使用教师强制时难以超越随机基线性能。在本文中,我们提出了一种基于检索增强生成(RAG)的句子级EEG到文本解码流程,结合与语义句子嵌入对齐的EEG编码器、向量检索阶段以及大语言模型(LLM)以将检索到的句子细化为连贯的输出。实验在Zurich认知语言处理语料库(ZuCo)数据集上进行,该数据集包含在静默阅读期间收集的单次试验EEG记录。为了评估系统是否从这些EEG信号中提取了有意义的信息,结果与随机基线进行比较。在九名受试者中,所提出的流程优于随机基线,平均余弦相似度为0.181±0.022,与基线0.139±0.029相比,相对改进为30.45%。统计分析进一步确认了这种改进的显著性,遵循严格评估流程,其中推理阶段不接触地面真实标签。

英文摘要

The decoding of linguistic information from electroencephalography (EEG) signals remains an extremely challenging problem in brain-computer interface (BCI) research. In particular, sentence-level decoding from EEG is difficult due to the low signal-to-noise ratio of these recordings. Previous studies tackling this problem have typically failed to surpass random baseline performance unless teacher forcing is used during the inference phase. In this work, we propose a retrieval-augmented generation (RAG)-based sentence-level EEG-to-text decoding pipeline that combines an EEG encoder aligned with semantic sentence embeddings, a vector retrieval stage, and a large language model (LLM) to refine retrieved sentences into coherent output. Experiments are conducted on the Zurich Cognitive Language Processing Corpus (ZuCo) dataset, which contains single-trial EEG recordings collected during silent reading. To evaluate whether the system extracts meaningful information from these EEG signals, the results are compared with a random baseline. In nine subjects, the proposed pipeline outperforms the random baseline, achieving a mean cosine similarity of 0.181 +- 0.022 compared to 0.139 +- 0.029 for the baseline, corresponding to a relative improvement of 30.45%. Statistical analysis further confirms that this improvement is significant, following a strict evaluation workflow where inference is performed without access to ground-truth labels.

2605.17493 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

Beyond Linear Superposition: Discovering Climate Features in AI Weather Models with KAN-SAE

超越线性叠加:利用KAN-SAE在AI天气模型中发现气候特征

Minjong Cheon

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文提出KAN-SAE,一种基于Kolmogorov-Arnold网络的稀疏自编码器,通过非线性激活函数揭示天气预测模型中的气候特征,相比线性基线提升了72%的活跃特征数量和降低了20%的特征冗余。

详情
AI中文摘要

深度学习天气预测模型在预测能力上表现出色,但其内部如何表示物理气候现象仍不明确。通过稀疏自编码器(SAEs)实现的机理可解释性提供了一种分解这些表示的有原则方法,但现有SAEs假设严格线性特征叠加,这与现代变压器中编码的高度非线性大气动力学不匹配。我们引入KAN-SAE,一种稀疏自编码器,其编码器将标准ReLU替换为可学习的每特征B-样条激活函数,这些激活函数来自Kolmogorov-Arnold网络(KANs),使每个潜在维度能够发展出自己的非线性门控配置。应用于Sonny时,KAN-SAE发现了975个活跃特征(相比线性基线的566个,提升了72%),并具有20%更低的特征冗余和可比的重建保真度。在无任何气候监督的情况下,KAN-SAE识别出一个在西欧空间集中的可解释热浪特征,并通过因果操控实验验证了西太平洋台风追踪器。我们的结果表明,非线性激活对于深度学习天气预测模型的机理可解释性至关重要,恢复了对线性基线不可见的气候特征。

英文摘要

Deep learning weather prediction models achieve remarkable predictive skill yet remain largely opaque: we know little about how they represent physical climate phenomena internally. Mechanistic interpretability through Sparse Autoencoders (SAEs) offers a principled route to decomposing these representations, but existing SAEs assume strictly linear feature superposition - a constraint ill-suited for the highly nonlinear atmospheric dynamics encoded in modern transformers. We introduce KAN-SAE, a sparse autoencoder whose encoder replaces the standard ReLU with learnable per-feature B-spline activations drawn from Kolmogorov-Arnold Networks (KANs), allowing each latent dimension to develop its own nonlinear gating profile. Applied to Sonny, KAN-SAE discovers 975 alive features (vs. 566 for a linear baseline, a 72% improvement) with 20% lower inter-feature redundancy and comparable reconstruction fidelity. Without any climate supervision, KAN-SAE identifies an interpretable European heatwave feature spatially concentrated over western Europe, and a western Pacific typhoon tracker confirmed by causal steering experiments. Our results demonstrate that nonlinear activations are essential for mechanistic interpretability of deep learning weather prediction models, recovering climate features that remain invisible to linear baselines.

2605.17461 2026-05-19 cs.HC cs.AI cs.CY 版本更新

Artificial Intelligence can Recognize Whether a Job Applicant is Selling and/or Lying According to Facial Expressions and Head Movements Much More Correctly Than Human Interviewers

人工智能能通过面部表情和头部动作更准确地识别求职者是否在撒谎,比人类面试官更准确

Hung-Yue Suen, Kuo-En Hung, Che-Wei Liu, Yu-Sheng Su, Han-Chih Fan

AI总结 本文研究了人工智能通过面部表情和头部动作识别求职者是否在撒谎的准确性,提出了一种基于深度学习的计算机视觉模型,能够提取求职者在视频中的面部表情和头部动作的时序模式,以识别自报的诚实和欺骗性印象管理策略,并通过实验验证了该模型在识别诚实和欺骗性印象管理方面的有效性,比人类面试官更准确。

Comments 11 pages, 5 figures

Journal ref IEEE Transactions on Computational Social Systems, 11(5), 5949-5960, 2024

详情
AI中文摘要

是否能够通过视频中的面部表情信号检测面试者的诚实和欺骗性回应一直是争论的话题,需要进一步研究。我们开发了基于计算机视觉的深度学习模型,以提取求职者在视频中的面部表情和头部动作的时序模式,以从视频帧中识别自报的诚实和欺骗性印象管理(IM)策略。每个N=121名求职者在回答五个结构化行为面试问题时录制了12至15分钟的视频。每位求职者完成了一份调查,以自评其信任度在四个印象管理(IM)指标上。此外,还进行了一项现场实验,以比较我们的建模方法与人类面试官在自报IMs的的同时效度。人类面试官在预测这些IM指标时,从另一组30个视频中获得表现,由N=30名人类面试官评估三个记录。我们的模型解释了诚实和欺骗性IMs的91%和84%的方差,并且比人类面试官显示出更强的与自报IM分数的相关性。

英文摘要

Whether an interviewee's honest and deceptive responses can be detected by facial expression signals in videos has been debated and requires further research. We developed deep learning models enabled by computer vision to extract temporal patterns of job applicants' facial expressions and head movements to identify self-reported honest and deceptive impression management (IM) tactics from video frames in real asynchronous video interviews. A 12- to 15-minute video was recorded for each of N=121 job applicants as they answered five structured behavioral interview questions. Each applicant completed a survey to self-evaluate their trustworthiness on four IM measures. Additionally, a field experiment was conducted to compare the concurrent validity associated with self-reported IMs between our modeling approach and human interviewers. Human interviewers' performance in predicting these IM measures from another subset of 30 videos was obtained by having N=30 human interviewers evaluate three recordings. Our models explained 91% and 84% of the variance in honest and deceptive IMs, respectively, and showed stronger correlations with self-reported IM scores than human interviewers.

2605.17456 2026-05-19 cs.CV cs.AI 版本更新

GCE-MIL: Faithful and Recoverable Evidence for Multiple Instance Learning in Whole-Slide Imaging

GCE-MIL: 多实例学习中全滑片成像的可信且可恢复的证据

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing(智能与计算学院)

AI总结 该研究提出GCE-MIL方法,通过优化S/N/R标准直接提升多实例学习中全滑片成像的预测性能和证据质量,改进了宏F1分数和C-index,并减少了连续-离散差距。

Comments 10 pages, 17 figures, 24 table

详情
AI中文摘要

多实例学习(MIL)是全滑片图像(WSI)分类和生存预测的标准方法,其中基于注意力的模型将图像块特征聚合为滑片级预测。这些模型将注意力权重视为预测的证据,但注意力被优化用于分类,而非识别支持诊断的实际图像块。这种混淆导致三个失败:选择的图像块不足(单独保留它们会降低宏F1分数0.078)、多余(移除它们几乎不影响预测)以及不可恢复(连续的注意力分数与推理中使用的离散图像块子集不一致)。核心前提是证据质量应通过显式标准直接优化——充分性、必要性和可恢复性(S/N/R)——而不是作为分类的副产品继承。GCE-MIL是一种背骨无关的封装器,通过三种注入模式和三种证据组件实现:一个将选择与领域特定概念对齐的 grounding 机制,一个作为可微分代理的 noisy-OR 覆盖,以及一个通过边缘引导修复将连续选择器转换为离散子集的阈值加修复恢复。在9个背骨和9个数据集(81种配置)上,GCE-MIL将平均宏F1分数提高了0.024,C-index提高了0.014,减少了连续-离散差距4-7,增加了补集退化2-4。通过在离散恢复后可选的图像块预过滤,推理速度可提高高达5倍,同时保留0.989的完整袋效用。

英文摘要

Multiple instance learning (MIL) is the standard approach for whole-slide image (WSI) classification and survival prediction, where attention-based models ag gregate patch features into slide-level predictions. These models treat attention weights as evidence for their predictions, but attention is optimized for classi fication, not for identifying which patches actually support the diagnosis. This conflation leads to three failures: selected patches are insufficient (keeping them alone drops Macro-F1 by 0.078), unnecessary (removing them barely changes the prediction), and unrecoverable (continuous attention scores disagree with discrete patch subsets used at inference). The central premise is that evidence quality should be optimized directly through explicit criteria- Sufficiency, Necessity, and Recov erability (S/N/R)- rather than inherited as a byproduct of classification. GCE-MIL is a backbone-agnostic wrapper implemented through three injection modes and three evidence components: a grounding mechanism that aligns selection with domain-specific concepts, noisy-OR coverage that acts as a differentiable proxy for interventional evidence search, and threshold-plus-repair recovery that converts continuous selectors into discrete subsets through marginal-guided repair. Across 9 backbones and 9 datasets (81 configurations), GCE-MIL improves average Macro-F1 by 0.024 and C-index by 0.014, reduces the continuous-discrete gap by 4-7, and increases complement degradation by 2-4. With optional tile prefiltering after discrete recovery, inference runs up to 5 faster while retaining 0.989 full-bag utility.

2605.17454 2026-05-19 cs.AI 版本更新

Multi-Party Multi-Objective Optimization as Consensus Search: Runtime Analysis of Cross-Party Recombination

多方多目标优化作为共识搜索:交叉 party 再组合的运行时间分析

Xiaolei Fang, Peilan Xu, Wenjian Luo

发表机构 * School of Artificial Intelligence, Nanjing University of Information Science and Technology(信息科学与技术南京大学人工智能学院) Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies, Institute of Cyberspace Security, School of Computer Science and Technology, Harbin Institute of Technology(新型安全智能技术广东省重点实验室,网络安全研究院,计算机科学与技术学院,哈尔滨工业大学)

AI总结 本文研究了多党多目标优化问题中的交叉 party 再组合,通过分析 MP-JCG 和 BPBOMST 问题,证明了基于收益引导突变的基线方法在跨越间隙时存在瓶颈,而改进的 CPR-NSGA-II 变体能够在 O(n log n) 的预期评估次数内发现共同帕累托最优解,并推导了基于边联合再组合和均匀修复的实例参数化预期运行时间界。

Comments 40 pages, 7 figures

详情
AI中文摘要

多党多目标优化问题(MPMOPs)需要自主决策者达成共识,因此不同于扁平化多目标公式。现有多目标进化算法的运行时间理论大多针对单党帕累托前沿近似,无法直接解释MPMOPs中的共同解搜索。我们研究了两种代表性场景中的交叉 party 再组合。在MP-JCG,一个具有显式间隙区域的伪布尔基准上,我们证明了基于收益引导突变的基线方法面临跨越间隙的瓶颈,需要Θ(n²)的预期适应度评估。相比之下,分析型CPR-NSGA-II变种通过直接组装互补前缀和后缀模板,分布在党派种群中,能够在O(n log n)的预期评估次数内发现共同帕累托最优解。与扁平化四目标公式F-JCG相比,我们的全前沿覆盖分析展示了扁平化带来的额外覆盖负担。对于BPBOMST,多党多目标最小生成树问题的双党双目标专业化,我们开发了分层支持覆盖分析。对于每个共同帕累托目标向量,对称平均投影诱导了一个辅助双目标MST实例,合适的支持代表可以产生一个2λ-共同近似覆盖,其中λ∈[1,2]。我们进一步推导了一个代表池CPR-NSGA-II变种的实例参数化预期运行时间界,使用边联合再组合和均匀修复。这个界分离了局部辅助前沿填充、跨党再组合捷径和边联合修复模糊性的影响。

英文摘要

Multi-party multi-objective optimization problems (MPMOPs) require consensus among autonomous decision makers and therefore differ from flattened many-objective formulations. Existing runtime theory for multi-objective evolutionary algorithms is largely tailored to single-party Pareto-front approximation and does not directly explain common-solution search in MPMOPs. We investigate cross-party recombination in two representative settings. On MP-JCG, a pseudo-Boolean benchmark with an explicit gap region, we prove that a payoff-guided mutation baseline faces a gap-crossing bottleneck requiring \(Θ(n^2)\) expected fitness evaluations. In contrast, an analytical CPR-NSGA-II variant discovers both common Pareto-optimal solutions in \(O(n\log n)\) expected evaluations by directly assembling complementary prefix and suffix templates distributed across party populations. Comparing this with the flattened four-objective formulation F-JCG, our full-front coverage analysis illustrates the additional coverage burden introduced by flattening. For BPBOMST, the bi-party, two-objective-per-party specialization of the multi-party multi-objective minimum spanning tree problem, we develop a layered support-cover analysis. For each common Pareto objective vector, the symmetric average projection induces an auxiliary bi-objective MST instance, and suitable support representatives yield a \(2λ\)-common approximation cover with \(λ\in[1,2]\). We further derive an instance-parameterized expected runtime bound for a representative-pool CPR-NSGA-II variant using edge-union recombination and uniform repair. This bound separates the effects of local auxiliary-front filling, cross-party recombination shortcuts, and edge-union repair ambiguity.

2605.17450 2026-05-19 cs.SE cs.AI cs.CL cs.CR 版本更新

ContraFix: Agentic Vulnerability Repair via Differential Runtime Evidence and Skill Reuse

ContraFix:通过差分运行时证据和技能重用进行代理漏洞修复

Simiao Liu, Fang Liu, Li Zhang, Yang Liu, Yinghao Zhu

发表机构 * Beihang University(北京航空航天大学) The University of Hong Kong(香港大学)

AI总结 本文提出ContraFix框架,通过差分运行时证据和可重用的修复技能,解决大型语言模型代理在自动漏洞修复中的语义误解问题,实现了在SEC-Bench和PatchEval上的高准确率。

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地用于自动漏洞修复(AVR),其中仓库级推理使它们能够检查上下文并生成源代码补丁。然而,最近的经验结果表明,这些代理仍然难以处理现实世界中的漏洞。其主要失败模式是语义误解:选择一个修复方向,该方向不匹配根本原因。我们识别出这种差距的两个原因。现有代理通常仅从失败执行进行推理。崩溃报告可以指出程序失败的位置,但无法揭示众多候选者中哪一个变量或状态转换将崩溃行为与安全执行区分开来。因此,代理通常生成症状导向的补丁而不是因果修复。此外,为一个漏洞收集的证据很少被保留,因此后续仓库中的类似案例必须从头开始诊断。我们提出了ContraFix,一种结合差分运行时证据和可重用修复技能的代理AVR框架。其Mutator构造了跨越失败边界的POC变体;其Analyzer在故障区域插入状态探针,并将崩溃和非崩溃执行之间的差异总结为修复规范;其Patcher将规范转换为经过验证的源代码补丁。每次成功的修复都会更新一个包含修复规范和突变策略的双轨技能库,这些通过三层策略在未来实例中检索。在SEC-Bench(C/C++,200个实例)和PatchEval(Go、Python、JavaScript,225个实例)上,ContraFix与GPT-5-mini相比,分别解决了84.0%和73.8%的任务,分别在两个基准上实现了最先进的性能,同时成本低于最强的可比基线。

英文摘要

Large language model (LLM) agents are increasingly used for automated vulnerability repair (AVR), where repository-level reasoning enables them to inspect context and produce source-code patches. However, recent empirical results show that these agents still struggle with real-world vulnerabilities. Their main failure mode is semantic misunderstanding: choosing a repair direction that does not match the root cause. We identify two reasons for this gap. Existing agents usually reason from the failing execution alone. A crash report can pinpoint where the program failed, but it does not reveal which variable or state transition, among many candidates near the fault site, separates the crashing behavior from safe execution. As a result, agents often produce symptom-oriented patches instead of causal fixes. Moreover, evidence collected for one vulnerability is rarely retained, so similar cases in later repositories must be diagnosed again from scratch. We present ContraFix, an agentic AVR framework that couples differential runtime evidence with reusable repair skills. Its Mutator constructs PoC variants that straddle the failure boundary; its Analyzer inserts state probes around the fault region and summarizes divergences between crashing and non-crashing executions into a repair specification; and its Patcher converts the specification into verified source patches. Each successful repair updates a two-track skill base containing repair specifications and mutation strategies, which are retrieved through a three-tier policy for future instances. On SEC-Bench (C/C++, 200 instances) and PatchEval (Go, Python, JavaScript, 225 instances), ContraFix with GPT-5-mini resolves 84.0% and 73.8% of the tasks, respectively, achieving state-of-the-art performance on both benchmarks while costing less than one-third of the strongest comparable baseline.

2605.17449 2026-05-19 cs.CV cs.AI 版本更新

Spatial Blindness in Whole-Slide Multiple Instance Learning

全切片多实例学习中的空间盲区

Xiangyu Li, Ran Su

发表机构 * College of Intelligence and Computing(智能与计算学院)

AI总结 本文研究了全切片多实例学习中由于空间信息处理不足导致的分类误差问题,提出ResTopoMIL模型通过引入不变原型直方图和坐标洗牌约束来提升模型对空间关系的敏感性,从而在多个公开数据集上提升了分类和生存预测性能。

Comments 28 pages, 8 figures, 16 tables

详情
AI中文摘要

全切片MIL模型通常被称为上下文感知模型,当将图网络、Transformer或状态空间模块置于补丁嵌入之上时。我们证明这种标签可能具有误导性。在病理任务中,组织结构是诊断信号的一部分,几个强大的MIL基线在补丁坐标随机排列后,滑片级别AUC几乎未变。它们的预测准确,但大多具有组合性。我们将其失败模式称为空间盲区。我们的解释是基于优化的:在滑片级监督下,密集的外观统计信息被早期学习,留下弱梯度用于稀疏的空间关系。ResTopoMIL通过首先拟合一个排列不变的原型直方图,然后冻结它,同时一个轻量级图分支在坐标洗牌约束下学习残差来解决这个问题。该架构设计简单;干预在于如何训练空间分支。在9个公开WSI基准上,ResTopoMIL在1.15M参数下提升了分类和生存预测性能,恢复了对坐标扰动的敏感性,并在CAMELLYON-16上提供了更强的局部化证据。

英文摘要

Whole-slide MIL models are often called context-aware once graphs, Transform ers, or state-space modules are placed above patch embeddings. We show that this label can be deceptive. On pathology tasks where tissue architecture is part of the diagnostic signal, several strong MIL baselines retain nearly unchanged slide level AUC after patch coordinates are permuted. Their predictions are accurate, but largely compositional. We refer to this failure mode as spatial blindness. Our explanation is optimization-based: dense appearance statistics are learned early under slide-level supervision, leaving weak gradients for sparse spatial relations. ResTopoMIL addresses the issue by first fitting a permutation-invariant prototype histogram and then freezing it while a lightweight graph branch learns the residual under a coordinate-shuffling constraint. The architecture is simple by design; the intervention is in how the spatial branch is trained. Across 9 public WSI bench marks, ResTopoMIL improves classification and survival prediction with 1.15M parameters, restores sensitivity to coordinate perturbation, and gives stronger lo calization evidence on CAMELYON-16.

2605.17444 2026-05-19 cs.SE cs.AI cs.CL 版本更新

MemRepair: Hierarchical Memory for Agentic Repository-Level Vulnerability Repair

MemRepair:用于代理级漏洞修复的分层内存

Simiao Liu, Li Zhang, Fang Liu, Xiaoli Lian, Yang Liu, Yinghao Zhu

发表机构 * Beihang University(北京航空航天大学) The University of Hong Kong(香港大学)

AI总结 本研究提出MemRepair,一种增强记忆的代理框架,通过分层记忆和动态反馈循环提高漏洞修复的可靠性,实现了在多个仓库级别的漏洞修复基准上的高修复率。

详情
AI中文摘要

现代软件生态系统面临越来越多披露的漏洞,增加了需要在仓库规模上可靠运行的自动化修复技术的需求。尽管基于大语言模型(LLM)的代理最近在自动化漏洞修复(AVR)中显示出潜力,但大多数现有系统仍然将修复视为在当前可见代码上下文中的一次生成步骤。因此,它们缺乏重用先前修复或从失败验证尝试中学习的持久机制,这限制了它们在复杂、多文件修复任务上的有效性。我们提出了MemRepair,一种增强记忆的代理框架,将漏洞修复视为一个迭代、经验驱动的过程。MemRepair结合了三个互补的记忆层,即History-Fix、Security-Pattern和Refinement-Trajectory记忆,并通过动态反馈驱动的细化循环。这种设计使代理能够检索仓库特定的修复惯例,应用可重用的安全防御,并利用先前的“失败到成功”轨迹来根据运行时证据修正语义无效的补丁。我们评估了MemRepair在三个具有代表性的仓库级别漏洞修复基准上的表现:SEC-Bench、PatchEval(Python、Go、JavaScript)以及Multi-SWE-bench的C++子集。MemRepair在三个基准上分别实现了58.0%、58.2%和30.58%的修复率,优于强大的通用代理如OpenHands和SWE-agent,以及专用的AVR工具InfCode-C++,同时保持竞争性的修复成本。这些结果表明,持久的、分层的修复记忆可以显著提高跨多种语言和仓库设置的代理漏洞修复的可靠性。

英文摘要

Modern software ecosystems face a rapidly growing number of disclosed vulnerabilities, increasing the need for automated repair techniques that can operate reliably at repository scale. Although Large Language Model (LLM)-based agents have recently shown promise for automated vulnerability repair (AVR), most existing systems still treat repair as a single generation step over the currently visible code context. As a result, they lack a persistent mechanism for reusing prior fixes or learning from failed validation attempts, which limits their effectiveness on complex, multi-file repair tasks. We present MemRepair, a memory-augmented agentic framework that formulates vulnerability repair as an iterative, experience-driven process. MemRepair combines three complementary memory layers, i.e., History-Fix, Security-Pattern, and Refinement-Trajectory memories, with a dynamic feedback-driven refinement loop. This design allows the agent to retrieve repository-specific repair conventions, apply reusable security defenses, and exploit prior "failure-to-success" trajectories to revise semantically invalid patches based on runtime evidence. We evaluate MemRepair on three representative repository-level vulnerability repair benchmarks: SEC-Bench, PatchEval (Python, Go, JavaScript), and the C++ subset of Multi-SWE-bench. MemRepair achieves state-of-the-art resolution rates of 58.0%, 58.2%, and 30.58%, respectively, outperforming strong general-purpose agents such as OpenHands and SWE-agent, as well as the specialized AVR tool InfCode-C++, while maintaining competitive repair cost. These results show that persistent, hierarchical repair memory can substantially improve the reliability of agentic vulnerability repair across diverse languages and repository settings.

2605.17442 2026-05-19 cs.CL cs.AI cs.IR 版本更新

Beyond Catalogue Counts: the Dataset Visibility Asymmetry in Low-Resource Multilingual NLP

超越目录计数:低资源多语言NLP中的数据集可见性不对称

Zhiyin Tan, Changxu Duan

发表机构 * L3S Research Center, Leibniz University Hannover(莱布尼茨汉诺威大学L3S研究中心) Technische Universität Darmstadt(达姆施塔特技术大学)

AI总结 本研究探讨了多语言NLP中数据集可见性不对称问题,通过结合目录基准和文献证据,提出了资源密度指数(RDI)来衡量语言的数据集可见性,揭示了大量语言在目录记录中数据贫乏但文献中存在明显数据集活动的现象。

Comments Accepted at the 15th edition of the Language Resources and Evaluation Conference (LREC 2026)

详情
AI中文摘要

多语言NLP常常依赖于集中式目录中的数据集计数来确定哪些语言是资源丰富或贫乏的。然而,这些目录只记录了数据集可见性的一层:哪些数据集已被注册或机构分发。它们不一定反映哪些数据集在研究文献中被创建、引用或重用。为了考察这一差距,我们结合基于目录的基准与文献支持的数据集流通证据。我们引入了资源密度指数(RDI),定义为每一百万使用者的数据集数量,并计算了乙努诺格(Ethnologue)中200种最广泛使用的语言的RDI。其中,118种语言(59%)在LRE地图和语言数据 consortium(LDC)中平均RDI为零,另有23种语言低于0.1,对应每十万使用者最多一个目录数据集。然后,我们利用LLM辅助的引用挖掘流程处理Semantic Scholar语料库中的这141种低可见性语言。经过人工验证和整合,我们识别出53种语言中的609个唯一数据集,其中356个仍通过工作公共链接公开访问。这些结果揭示了显著的可见性差距:许多大使用者语言在目录记录中数据贫乏,但在研究文献中显示明显的数据集活动。我们的发现表明,多语言数据稀缺不仅应被视为生产问题,还应被视为文档、可发现性和长期可访问性的问题。代码和数据可在(https://github.com/zhiyintan/dataset-visibility-asymmetry)公开获取。

英文摘要

Multilingual NLP often relies on dataset counts from centralized catalogues to characterize which languages are resource-rich or resource-poor. However, these catalogues record only one layer of dataset visibility: what has been registered or institutionally distributed. They do not necessarily reflect which datasets are created, cited, or reused in the research literature. To examine this gap, we combine a catalogue-based baseline with literature-backed evidence of dataset circulation. We introduce the Resource Density Index (RDI), defined as the number of catalogued datasets per one million speakers, and compute it for the 200 most widely spoken languages in Ethnologue. Among them, 118 languages (59%) have an average RDI of zero across the LRE Map and the Linguistic Data Consortium (LDC), and another 23 fall below 0.1, corresponding to at most one catalogued dataset per ten million speakers. We then apply an LLM-assisted citation-mining pipeline over the Semantic Scholar corpus to these 141 low-visibility languages. After manual validation and consolidation, we identify 609 unique datasets across 53 languages, of which 356 remain openly accessible through working public links. These results reveal a substantial visibility gap: many large-speaker languages appear data-poor in catalogue records yet show clear evidence of dataset activity in the research literature. Our findings suggest that multilingual data scarcity should be understood not only as a production problem, but also as a question of documentation, discoverability, and long-term accessibility. Code and data are publicly available at (https://github.com/zhiyintan/dataset-visibility-asymmetry).

2605.17431 2026-05-19 cs.LG cs.AI 版本更新

MATE: Solving Contextual Markov Decision Processes with Memory of Accumulated Transition Embeddings

MATE:利用累积转移嵌入记忆解决上下文马尔可夫决策过程

Himchan Hwang, Hyeokju Jeong, Gene Chung, Seungyeon Kim, Sangwoong Yoon, Frank Chongwoo Park

发表机构 * Seoul National University(首尔国立大学) Ulsan National Institute of Science and Technology (UNIST)(釜山国立科学技术研究所(UNIST))

AI总结 MATE通过使用累积转移嵌入的记忆架构,解决了由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs),在保持后验信念的同时,避免了传统方法的计算和梯度问题,实现了高效且性能优异的解决方案。

详情
AI中文摘要

我们提出了MATE,一种简单而有效的记忆架构,用于解决由未观察上下文参数化的上下文马尔可夫决策过程(CMDPs)。在CMDPs中,最优智能体可以通过维持上下文的后验信念来在线适应。MATE用求和聚合的记忆替代了不可行的后验,利用后验的排列不变性来保留可证明的充分表达性。与先前的记忆架构相比,MATE避免了Transformer的逐步展开成本增长和与循环神经网络(RNNs)通常相关的梯度问题。在多样化的基准测试中,MATE展示了清晰的计算优势,同时实现了与标准序列模型基线相当的性能。

英文摘要

We propose MATE, a simple yet effective memory architecture for solving Contextual Markov Decision Processes (CMDPs), a family of MDPs parameterized by an unobserved context. In CMDPs, an optimal agent can adapt online by maintaining the posterior belief over contexts. MATE replaces this intractable posterior with a sum-aggregated memory, leveraging the posterior's permutation invariance to retain provably sufficient expressiveness. Compared to prior memory architectures, MATE avoids the growing per-step rollout cost of Transformers and the gradient issues commonly associated with Recurrent Neural Networks (RNNs). Extensive evaluations across diverse benchmarks demonstrate that MATE provides clear computational advantages while achieving performance comparable to standard sequence-model baselines.

2605.17428 2026-05-19 cs.LG cs.AI 版本更新

Progressive Generalization Augmentation with Deeply Coupled RND-PPO and Domain-Prioritized Noise Injection for Robust Crop Management Reinforcement Learning

渐进泛化增强:结合深度耦合RND-PPO和领域优先噪声注入的稳健作物管理强化学习

Wu Yang

发表机构 * Chongho Bridge Group Limited(中宏桥梁集团有限公司)

AI总结 本文提出了一种渐进泛化增强方法,通过深度耦合RND-PPO和领域优先噪声注入,解决农业强化学习中早期学习效率与后期泛化能力的平衡、内在和外在奖励的简单加法结合以及统一噪声注入策略的问题,从而提高作物管理的鲁棒性。

详情
AI中文摘要

我们在gym-DSSAT玉米灌溉任务上的初步实验表明,±2摄氏度的温度噪声会导致在清洁条件下训练的PPO策略的经济收益减少11.9% - 这是现有研究未充分解决的系统性鲁棒性缺陷。本文针对阻碍农业RL系统实际部署的三个相互关联的限制:早期阶段学习效率与后期阶段泛化能力之间的权衡;探索增强PPO中内在和外在奖励的简单加法结合;以及忽视农业状态变量经验证实的差异敏感性的统一测量噪声注入策略。我们引入了三个系统性的创新:渐进泛化增强(PGA),实现一个三阶段课程(清洁训练0-800次回合,渐进800-1200次回合,完整增强1200-2000次回合);深度耦合RND-PPO架构,具有双通道GAE归一化、进度衰减的内在系数和语义离散化;以及领域优先噪声注入,具有层次激活。我们的实验评估显示:在佛罗里达州,相比最先进的BERT-DQN,产量提高了8.43%,氮肥利用效率提高了16.42%;在阿拉贡,产量提高了5.61%(尽管由于恶劣的地中海气候,经济评分降低了3.67%);在综合扰动下,性能保留率分别为94.4% vs 80.0%。所有实验均使用5个随机种子,在NVIDIA A100 GPU上进行,每运行约4.2±0.3小时(2000次回合,2048步缓冲区,64 mini-batch大小)。

英文摘要

Our preliminary experiments on gym-DSSAT maize irrigation tasks revealed that +/-2 degrees C temperature noise causes an 11.9% reduction in economic returns for PPO policies trained under clean conditions - a systematic robustness deficit that existing research has not adequately addressed. This paper tackles three interconnected limitations impeding practical deployment of agricultural RL systems: the trade-off between early-stage learning efficiency and late-stage generalization capability; the naive additive combination of intrinsic and extrinsic rewards in exploration-augmented PPO; and uniform measurement noise injection strategies that disregard empirically validated differential sensitivity across agricultural state variables. We introduce three systematic innovations: Progressive Generalization Augmentation (PGA) implementing a three-phase curriculum (clean training 0-800 episodes, progressive 800-1200, full augmentation 1200-2000); a deeply coupled RND-PPO architecture with dual-channel GAE normalization, progress-decayed intrinsic coefficients, and semantic discretization; and domain-prioritized noise injection with hierarchical activation. Our experimental evaluation demonstrates: 8.43% yield improvement and 16.42% nitrogen use efficiency improvement over SOTA BERT-DQN in Florida; 5.61% yield improvement in Zaragoza (though 3.67% lower economic score due to challenging Mediterranean climate); and 94.4% vs 80.0% performance retention under combined perturbations. All experiments used 5 random seeds on NVIDIA A100 GPUs with 4.2+/-0.3 hours per run (2000 episodes, 2048-step buffer, 64 mini-batch size).

2605.17419 2026-05-19 cs.LG cs.AI 版本更新

Learning Displacement-Robust Representations for Landslide Early Warning under Rainfall Forecast Uncertainty

学习位移鲁棒的表示以在降雨预报不确定性下进行滑坡预警

Ren Ozeki, Hamada Rizk, Hirozumi Yamaguchi

发表机构 * Osaka University(大阪大学) RIKEN Center for Computational Science(理化学研究所计算科学中心) Tanta University(塔塔大学)

AI总结 本文提出了一种鲁棒于降雨场位移的滑坡预警系统,通过学习降雨和地形数据的潜在表示,以提高在降雨预报不确定性下的滑坡预测精度。

详情
AI中文摘要

由降雨引发的滑坡已成为全球范围内日益增长的风险,因为气候变化加剧了极端降雨事件。为了提供足够的撤离时间,实时灾害监测的滑坡预警系统(LEWS)必须通过整合观测降雨与短期降雨预报来估计近未来滑坡风险,这些预报来自时空环境数据流。尽管最近的滑坡预测方法通过统计和深度学习方法提高了预测性能,但大多数方法假设降雨输入是准确的。然而,在实际应用中,滑坡预测依赖于降雨预报,这些预报通常包含由于预测不确定性导致的降雨场空间位移。这种位移会改变局部累积降雨并降低预测准确性。为了解决这一挑战,我们提出了一种新的LEWS,其对降雨场位移具有鲁棒性。关键思想是学习降雨和地形数据的潜在表示,这些表示在降雨场运动中的位移下保持稳定,从而实现可靠的地理空间数据整合以估计滑坡风险。滑坡预测模型通过使用降雨-运动-感知对比学习(RMCL)进行训练,该方法引入了时间相关的降雨场扰动以模拟预报引起的降雨驱动时空环境数据流中的位移。实验使用了日本两年的降雨和地形数据,覆盖了19个地区中的滑坡事件。所提出的系统在精度上比最先进的基线高出高达37%。这些结果表明,将降雨建模为移动的空间场并在学习过程中处理降雨场位移显著提高了操作预警系统中短期滑坡预测的可靠性。

英文摘要

Rainfall-induced landslides pose a growing risk worldwide as climate change intensifies extreme rainfall events. To provide sufficient evacuation time, landslide early warning systems (LEWS) for real-time disaster monitoring must estimate near-future landslide risk by integrating observed rainfall with short-term rainfall forecasts from spatio-temporal environmental data streams. Although recent landslide prediction methods have improved predictive performance using statistical and deep learning approaches, most assume accurate rainfall inputs. In operational settings, however, landslide prediction relies on rainfall forecasts, which often contain spatial displacement of rainfall fields due to forecasting uncertainties. Such displacement can alter local accumulated rainfall and degrade prediction accuracy. To address this challenge, we propose a novel LEWS robust to rainfall field displacement. The key idea is to learn latent representations from rainfall and terrain data that remain stable under displacement in rainfall field motion, enabling reliable geospatial data integration for landslide risk estimation. The landslide prediction model is trained using Rainfall-Motion-Aware Contrastive Learning (RMCL), which introduces temporally correlated rainfall field perturbations to emulate forecast-induced displacement in rainfall-driven spatio-temporal environmental data streams. Experiments were conducted using two years of rainfall and terrain data across Japan, covering 19 regions with landslide events. The proposed system achieved up to 37% higher precision than state-of-the-art baselines. These results demonstrate that modeling rainfall as a moving spatial field and addressing rainfall field displacement during learning significantly improve the reliability of short-term landslide prediction in operational early warning systems.

2605.17416 2026-05-19 cs.SE cs.AI 版本更新

Benchmarking Mythos-Linked Bug Rediscovery

基于Mythos链接的Bug重发现基准测试

Isaac David, Arthur Gervais

发表机构 * University College London(伦敦大学学院)

AI总结 本文基于Mythos链接的公开系统任务,对六个目标文件进行受控重发现实验,评估不同模型在无CVE标识等信息情况下发现核心bug的能力,结果显示仅6个目标匹配。

详情
AI中文摘要

Anthropic在2026年4月的Mythos材料中结合了基准声明与OpenBSD、FreeBSD、Linux、FFmpeg和浏览器等系统的具体bug发现故事。本文报告了一个针对六个公开或高信心的Mythos链接系统任务的受控目标文件重发现实验。每个模型接收相同的目标文件或文件、只读源代码工具、每个任务三次重复,并使用一个手动目标匹配标准;提示中省略CVE标识符、补丁哈希、公告文本、作者姓名、披露日期和答案键根本原因语言。实验包含54次模型-任务尝试:三个模型、六个任务和三次重复,每个模型有18次尝试。GPT-5.5 xhigh实现了5/18次目标重发现,覆盖2/6个任务;单独计算一次错误目标mpegts.c发现,得到3/6个不同的核心bug。Claude Opus 4.7实现了1/18次目标重发现,覆盖1/6个任务。Kimi K2记录了0/18次目标重发现。主要失败模式是模型在分配文件内过早承诺于看似合理的替代候选:模型通常提交基于源代码的假设,但错过了由公开Mythos补丁证据纠正的特定不变性。这些结果并未反驳Anthropic未披露的工作流程,但显示在该有利的目标文件支架下,系统特定的提示仅能产生6次目标匹配,共计54次尝试。

英文摘要

Anthropic's April 2026 Mythos materials combine benchmark claims with concrete bug-finding stories across OpenBSD, FreeBSD, Linux, FFmpeg, and browsers. This paper reports a controlled target-file rediscovery experiment on six public or high-confidence Mythos-linked systems tasks. Each model receives the same target file or files, read-only source tools, three repeats per task, and one manual target-matching rubric; prompts omit CVE identifiers, patch hashes, advisory text, author names, disclosure dates, and answer key root cause language. The experiment contains 54 counted model-task attempts: three models, six tasks, and three repeats, giving 18 attempts per model. GPT-5.5 xhigh achieves 5/18 target rediscoveries, covering 2/6 tasks; counting one wrong-target mpegts.c finding separately gives 3/6 distinct core bugs. Claude Opus 4.7 achieves 1/18 target rediscoveries, covering 1/6 tasks. Kimi K2 records 0/18 target rediscoveries. The dominant failure mode is early commitment to plausible alternate candidates within the assigned file: models often submit source-grounded hypotheses while missing the specific invariant corrected by public Mythos patch evidence. These results do not refute Anthropic's undisclosed workflow, but show that under this favorable target-file scaffold, systems-specific prompting yields only six target matches across 54 counted attempts.

2605.17413 2026-05-19 cs.CR cs.AI 版本更新

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

消除安全性:用于安全应用的语言模型对齐机制

Isaac David, Arthur Gervais

发表机构 * University College London(伦敦大学学院)

AI总结 该研究探讨了通过受控转换评估协议去除语言模型对齐机制的方法,评估了安全任务中的拒绝率、尝试率、安全成功率、一般能力保留、不稳定性及超出范围的不安全合规性,证明了去除对齐作为效用-风险前沿的重要性,而非单纯的去审查配方。

详情
AI中文摘要

安全对齐的语言模型经常拒绝那些用词类似于滥用的网络安全请求,即使任务已授权且具有防御性。这使得安全评估变得模糊:失败的回答可能反映能力缺失或拒绝策略干预。Ablating Safety研究将对齐去除作为受控转换评估协议,比较了授权上下文提示、可逆拒绝方向激活投影、表示控制投影、以及基于LoRA的去对齐或任务适应。我们评估了安全-AR,一个包含60个提示的授权安全、良性通用和非操作溢出探测的测试集。报告的运行包括一个四模型投影试点(416次完成)、一个三模型Qwen2.5 LoRA扩展(1,980次保留完成)、表示和稳健性扫描,以及可执行的安全修复验证器。单向量拒绝投影仅将平均安全分数从0.46提高到0.50,同时将不安全合规性从0.10提高到0.47;排名4的拒绝子空间投影达到0.51,同时匹配对齐溢出率。仅任务的LoRA将平均安全分数提高到0.87,一般分数为0.83,不安全合规性为0.13,而带有保留的拒绝抑制将溢出提高到0.27。这些结果支持将对齐去除作为效用-风险前沿进行评估,而非作为去审查配方,并将合规性单独视为既不表示能力也不安全部署。

英文摘要

Safety-aligned language models often refuse cybersecurity requests whose wording resembles misuse, even when the task is authorized and defensive. This makes security evaluation ambiguous: a failed answer may reflect missing capability or refusal-policy intervention. Ablating Safety studies alignment removal as a controlled transformation-evaluation protocol for authorized security tasks, comparing authorized-context prompting, reversible refusal-direction activation projection, representation-control projections, and LoRA-based de-alignment or task adaptation. We evaluate refusal, attempt rate, validated security success, general-capability retention, instability, and out-of-scope unsafe compliance on Security-AR, a 60-prompt suite of authorized security, benign general, and non-operational spillover probes. The reported runs include a four-model projection pilot with 416 completions, a three-model Qwen2.5 LoRA extension with 1,980 held-out completions, representation and robustness sweeps, and executable secure-repair validators. Single-vector refusal projection raises mean security score only from 0.46 to 0.50 while increasing unsafe compliance from 0.10 to 0.47; rank-4 refusal-subspace projection reaches 0.51 while matching the aligned spillover rate. Task-only LoRA raises mean security score to 0.87 with general score 0.83 and unsafe compliance 0.13, while refusal-suppression with retention raises spillover to 0.27. These results support evaluating alignment removal as a utility-risk frontier, not as an uncensoring recipe, and treating compliance alone as neither competence nor safe deployment.

2605.17410 2026-05-19 cs.AI 版本更新

Computational Challenges in Token Economics: Bridging Economic Theory and AI System Design

令牌经济学中的计算挑战:连接经济理论与AI系统设计

Ou Wu, Yingjun Deng

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) Hefei Institutes of Physical Science, Chinese Academy of Sciences(中国科学院合肥物理研究所)

AI总结 本文探讨了在大规模语言模型系统中,将令牌作为经济原语时所面临的计算挑战,提出了计算令牌经济学的概念和令牌经济学三元论,旨在建立连接令牌经济学与AI系统设计的研究议程。

Comments 43 pages

详情
AI中文摘要

令牌经济学已逐渐成为理解大型语言模型系统中资源分配、价值创造和定价的一个有用的视角。尽管近期的研究越来越多地将令牌视为经济原语,但高水平的经济理论与现代AI基础设施的计算现实之间仍存在显著的差距。本文识别并分析了在实时推理系统中实施令牌经济原则时出现的关键计算挑战。我们主张计算可行性不仅仅是令牌经济学的一个维度,而是其支配约束:这些挑战是由精细估值、低延迟执行和在不确定性下的分配最优性之间根本矛盾驱动的。为了结构化这个问题空间,我们引入了计算令牌经济学的概念,并提出了令牌经济学三元论——一个条件无免费午餐原则,捕捉了粒度、实时性能和最优性之间的固有权衡。我们进一步将主要技术挑战分为三个领域:实时价值会计、受限资源分配和经济感知的系统架构。与其提供完整的解决方案,本文旨在定义连接令牌经济学与AI系统设计的研究议程,突出计算经济学、机器学习系统和AI基础设施交汇处的开放问题。

英文摘要

Token economics has emerged as a useful lens for understanding resource allocation, value creation, and pricing in large language model systems. While recent work has increasingly treated tokens as economic primitives, there remains a substantial gap between high-level economic theory and the computational realities of modern AI infrastructure. This paper identifies and analyzes the key computational challenges that arise when token-economic principles are implemented in real-time inference systems. We argue that computational feasibility is not merely one dimension of token economics, but its governing constraint: these challenges are driven by fundamental tensions among fine-grained valuation, low-latency execution, and allocation optimality under uncertainty. To structure this problem space, we introduce the notion of \textbf{Computational Token Economics} and propose the \textbf{Token Economics Trilemma} -- a conditional no-free-lunch principle that captures the inherent trade-offs among granularity, real-time performance, and optimality. We further categorize the main technical challenges into three areas: real-time value accounting, constrained resource allocation, and economic-aware system architecture. Rather than presenting a complete solution, this paper aims to define a research agenda for bridging token economics and AI system design, highlighting open problems at the intersection of computational economics, machine learning systems, and AI infrastructure.

2605.17393 2026-05-19 cs.AI cs.LG cs.MA 版本更新

Heterogeneous Information-Bottleneck Coordination Graphs for Multi-Agent Reinforcement Learning

异质信息瓶颈协调图用于多智能体强化学习

Wei Duan, Junyu Xuan, En Yu, Xiaoyu Yang, Jie Lu

发表机构 * Australian Artificial Intelligence Institute (AAII)(澳大利亚人工智能研究所)

AI总结 本文提出异质信息瓶颈协调图(HIBCG),通过理论指导机制解决多智能体强化学习中协调图的边存在性和信息传递容量分配问题,通过信息瓶颈方法构建组对齐的块对角先验,实现边存在性和信息容量的理论验证。

详情
AI中文摘要

协调图是合作多智能体强化学习(MARL)中的核心抽象,然而现有的稀疏图学习者缺乏理论基础的机制来决定哪些边应存在以及每条边应携带多少信息。当前方法依赖于启发式标准,无法保证学习到的拓扑结构的正式保证,并且没有系统的方法来分配不同的通信容量以处理结构不同的智能体关系。为了解决这个问题,我们提出了异质信息瓶颈协调图(HIBCG),它学习了一个组感知的稀疏图,在其中边的存在性和信息容量都得到了理论支持。通过图信息瓶颈(GIB)作为底层工具,HIBCG首先构建了一个组对齐的块对角先验,提供了一个闭式标准用于边保留——确定哪些边应该存在以及每个组块的密度——然后在所得到的拓扑上控制每个智能体的特征带宽,压缩信息以保留仅与任务相关的内容。我们证明了组对齐的先验严格收紧拓扑学习的变分界,目标分解为每个组块,实现了微分边控制,且容量分配遵循水填充原则。

英文摘要

Coordination graphs are a central abstraction in cooperative multi-agent reinforcement learning (MARL), yet existing sparse-graph learners lack a theoretically grounded mechanism to decide which edges should exist and how much information each edge should carry. Current methods rely on heuristic criteria that offer no formal guarantee on the learned topology, and no principled way to allocate different communication capacities to structurally different agent relationships. To address this, we propose Heterogeneous Information-Bottleneck Coordination Graphs (HIBCG), which learns a group-aware sparse graph in which both edge existence and message capacity are theoretically justified. With the graph information bottleneck (GIB) serving as the underlying tool, HIBCG first constructs a group-aligned block-diagonal prior that provides a closed-form criterion for edge retention -- determining which edges should exist and at what density per group block -- and then controls per-agent feature bandwidth on the resulting topology, compressing messages to retain only task-relevant content. We prove that the group-aligned prior strictly tightens the variational bound on topology learning, that the objective decomposes per group block, enabling differential edge control, and that capacity allocation follows a water-filling principle.

2605.17382 2026-05-19 cs.AI cs.CL cs.GR 版本更新

QQJ: Quantifying Qualitative Judgment for Scalable and Human-Aligned Evaluation of Generative AI

QQJ: 量化定性判断以实现可扩展且与人类对齐的生成AI评估

Marjan Veysi, Pirooz Shamsinejadbabaki, Mohammad Zare, Mohammad Sabouri

发表机构 * AI Lab, Arioobarzan Engineering Team(艾伊罗巴赞工程团队人工智能实验室) Department of Computer Engineering and Information Technology(计算机工程与信息科技系) Department of Informatics, Bioengineering, Robotics and Systems Engineering(信息学、生物工程、机器人与系统工程系) University of Genoa(热那亚大学)

AI总结 本文提出QQJ框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,实现与人类判断一致的可扩展评估方法,验证了结构化定性判断在大规模应用中的有效性。

详情
AI中文摘要

生成人工智能的快速发展暴露了现有评估方法的根本局限,尤其是在开放性、创造性和面向人类的任务中。传统自动指标依赖于表面统计相似性,往往无法反映人类对质量的感知,而纯粹的人类评估虽然可靠,但成本高、主观性强且难以扩展。最近利用大语言模型作为评估者的做法虽然提高了可扩展性,但通常缺乏明确的人类定义评估原则,导致偏见和不一致。本文介绍Quantifying Qualitative Judgment (QQJ),一种可扩展且以人类为中心的评估框架,通过专家设计的多维评分标准和高质量标注集校准大语言模型评估器,以实现人类判断与自动化评估之间的桥梁。这种设计使在多样化的生成任务和模态上实现了一致、可解释和可扩展的评估。在文本和图像生成上的大量实验表明,QQJ在与人类判断的一致性方面优于传统自动指标和无约束的大语言模型评估者。此外,QQJ在重复评估中表现出更高的稳定性,并在识别关键失败模式如幻觉和意图不匹配方面具有更好的诊断能力。这些结果表明,结构化的定性判断可以在不牺牲可解释性和人类对齐的情况下实现规模化应用,使QQJ成为现代生成AI系统可靠评估的实用基础。

英文摘要

The rapid progress of generative artificial intelligence has exposed fundamental limitations in existing evaluation methodologies, particularly for open-ended, creative, and human-facing tasks. Traditional automatic metrics rely on surface-level statistical similarity and often fail to reflect human perceptions of quality, while purely human evaluation, although reliable, is costly, subjective, and difficult to scale. Recent approaches using large language models as evaluators offer improved scalability but frequently lack explicit grounding in human-defined evaluation principles, leading to bias and inconsistency. In this paper, we introduce Quantifying Qualitative Judgment (QQJ), a scalable and human-centric evaluation framework that explicitly bridges the gap between human judgment and automated assessment. QQJ separates the definition of quality from its execution by anchoring evaluation in expert-designed, multi-dimensional rubrics and calibrating large language model evaluators to align with expert reasoning using a small, high-quality annotation set. This design enables consistent, interpretable, and scalable evaluation across diverse generative tasks and modalities. Extensive experiments on text and image generation demonstrate that QQJ achieves substantially stronger alignment with human judgment than traditional automatic metrics and unconstrained LLM-based evaluators. Moreover, QQJ exhibits improved stability across repeated evaluations and superior diagnostic capability in identifying critical failure modes such as hallucination and intent mismatch. These results indicate that structured qualitative judgment can be operationalized at scale without sacrificing interpretability or human alignment, positioning QQJ as a practical foundation for reliable evaluation of modern generative AI systems.

2605.17380 2026-05-19 cs.AI cs.CR cs.LG 版本更新

ADR: An Agentic Detection System for Enterprise Agentic AI Security

ADR:一种用于企业代理AI安全的代理检测系统

Chenning Li, Pan Hu, Justin Xu, Baris Ozbas, Olivia Liu, Caroline Van, Manxue Li, Wei Zhou, Mohammad Alizadeh, Pengyu Zhang, KK Sriramadhesikan, Ming Zhang

发表机构 * Uber

AI总结 本文提出ADR系统,一种大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。该系统解决了三个关键问题:观测有限、鲁棒性不足和检测成本高,并通过三个组件实现了这些目标:ADR传感器、ADR探索器和ADR检测器。

Comments Accepted at MLSys 2026 (Industry Track)

详情
AI中文摘要

我们提出了代理AI检测与响应(ADR)系统,这是首个大规模、经过生产验证的企业框架,用于安全地管理通过模型上下文协议(MCP)运行的AI代理。我们识别出该领域存在的三个持续挑战:(1)观测有限——现有的终端检测与响应(EDR)工具只能看到文件写入,而无法看到代理推理、提示或连接意图到执行的因果链;(2)鲁棒性不足——静态防御受限于预定义规则,无法在多样化的攻击技术和企业环境中泛化;(3)高检测成本——基于LLM的推理在大规模上成本过高。ADR通过三个组件解决这些挑战:ADR传感器用于高保真的代理遥测,ADR探索器用于系统性的预部署红队行动和困难示例生成,以及ADR检测器用于可扩展的、两阶段在线检测,结合快速初步筛查与上下文感知推理。在Uber部署超过十个月,ADR在生产中保持了可靠的检测,随着采用的增加,已覆盖超过7,200个唯一主机,每天处理超过10,000个代理会话,发现了数百个凭证泄露,涵盖26类,并启用了向左预防层(97.2%的精度,206个检测到的凭证)。为了验证该方法并促进社区采用,我们引入了ADR-Bench(302个任务,17种技术,133个MCP服务器),其中ADR实现了零误报,同时检测了67%的攻击——在F1分数上,比三个最先进的基线(ALRPHFS、GuardAgent、LlamaFirewall)高出2-4倍。在AgentDojo(公共提示注入基准)上,ADR检测了所有攻击,仅在93个任务中产生了3个误报。

英文摘要

We present the Agentic AI Detection and Response (ADR) system, the first large-scale, production-proven enterprise framework for securing AI agents operating through the Model Context Protocol (MCP). We identify three persistent challenges in this domain: (1) limited observability -- existing Endpoint Detection and Response (EDR) tools see file writes but not the agent reasoning, prompts, or causal chains linking intent to execution; (2) insufficient robustness -- static defenses constrained by pre-defined rules fail to generalize across diverse attack techniques and enterprise contexts; and (3) high detection costs -- LLM-based inference is prohibitively expensive at scale. ADR addresses these challenges via three components: the ADR Sensor for high-fidelity agentic telemetry, the ADR Explorer for systematic pre-deployment red teaming and hard-example generation, and the ADR Detector for scalable, two-tier online detection combining fast triage with context-aware reasoning. Deployed at Uber for over ten months, ADR has sustained reliable detection in production with growing adoption reaching over 7,200 unique hosts and processing over 10,000 agent sessions daily, uncovering hundreds of credential exposures across 26 categories and enabling a shift-left prevention layer (97.2% precision, 206 detected credentials). To validate the approach and enable community adoption, we introduce ADR-Bench (302 tasks, 17 techniques, 133 MCP servers), where ADR achieves zero false positives while detecting 67% of attacks -- outperforming three state-of-the-art baselines (ALRPHFS, GuardAgent, LlamaFirewall) by 2--4x in F1-score. On AgentDojo (public prompt injection benchmark), ADR detects all attacks with only three false alarms out of 93 tasks.

2605.17379 2026-05-19 cs.CL cs.AI 版本更新

Learning Faster with Better Tokens: Parameter-Efficient Vocabulary Adaptation for Specialized Text Summarization

通过更好的令牌学习:用于专业文本摘要的参数高效词汇适应

Gunjan Balde, Soumyadeep Roy, Mainack Mondal, Niloy Ganguly

发表机构 * Dept. of Computer Science and Engg., IIT Kharagpur(印度Kharagpur理工学院计算机科学与工程系) Dept. of Medicine (Biomedical Informatics), Stanford University(斯坦福大学医学院(生物医学信息学))

AI总结 本文提出了一种参数高效的领域适应方法,通过结合词汇适应和预训练,提升大型语言模型在专业领域文本摘要任务中的性能,同时减少训练时间和参数数量。

Comments 16 pages. Accepted in the 64th Annual Meeting of the Association for Computational Linguistics [ACL (Main) 2026] as a long paper

详情
AI中文摘要

预训练在通用领域语料库上的大型语言模型在应用于专门领域时常常表现出令牌化效率低下。尽管连续预训练用于领域适应在一定程度上缓解了性能下降,但并未解决根本的词汇匹配问题。为了解决这一差距,我们引入了一种有针对性的参数高效领域适应方法,结合词汇适应与预训练用于基于LLM的文本摘要。我们的统一框架在预训练令牌化器中增加领域特定的令牌,同时选择性地替换未充分训练和不可达的令牌以限制参数增长。我们在Llama-3.1-8B和Qwen2.5-7B上评估了我们的方法,在法律和医学摘要任务上使用以专家驱动文本和摘要为中心的评估协议,这些文本通常包含更高浓度的Out-of-Vocabulary(OOV)词。词汇适应算法通过提高生成摘要与参考摘要之间的语义相似性,提升了摘要模型的整体质量。此外,适应后的模型生成的摘要包含更多合适的新型和领域特定的词汇,从而提高了连贯性、相关性和忠实性。我们进一步观察到,我们的方法在连续预训练上减少了35-55%的训练时间,并将参数数量减少了多达37%。我们公开了代码库:https://github.com/gb-kgp/VocabReplace-Then-Expand。

英文摘要

Large language models pretrained on general-domain corpora often exhibit tokenization inefficiencies when applied to specialized domains. Although continual pretraining for domain adaptation partially alleviate performance degradation, it does not resolve the fundamental vocabulary mismatch. To address this gap, we introduce a targeted parameter-efficient domain adaptation approach that combines vocabulary adaptation with pretraining for LLM-based text summarization. Our unified framework augments pretrained tokenizers with domain-specific tokens while selectively replacing under-trained and unreachable tokens to limit parameter growth. We evaluate our approach on Llama-3.1-8B and Qwen2.5-7B across legal and medical summarization tasks on a challenge-oriented evaluation protocol focused on expert-driven text and summaries which typically has higher concentration of over-fragmented Out-of-Vocabulary (OOV) words. The vocabulary adaptation algorithm enhances the overall quality of the summarization model by improving semantic similarity between the generated summaries and their references. In addition, the adapted model produces summaries that incorporate more appropriate novel and domain-specific words, leading to improved coherence, relevance, and faithfulness. We further observe that our proposed approach significantly reduce training time by $35-55\%$ over continual pretraining and reduce parameter counts up to $37\%$ w.r.t expansion-only methods. We make the codebase publicly available at https://github.com/gb-kgp/VocabReplace-Then-Expand.

2605.16234 2026-05-19 cs.LG cs.AI cs.CL 版本更新

No Free Swap: Protocol-Dependent Layer Redundancy in Transformers

没有免费的交换:Transformer中的协议依赖层冗余

Gabriel Garcia

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了Transformer中层冗余问题,通过比较替换和交换两种协议,发现它们在压缩中的效果存在显著差异,且在相同评估器下,不同协议可能导致层剪枝结果的变化,尤其在高替换距离时更为明显。

Comments 40 pages, 8 figures, 24 tables. Code is available at https://github.com/Gpgabriel25/ProtocolGapDiagnostic

详情
AI中文摘要

当研究人员询问两个Transformer层是否在压缩中“等价”时,他们常常混淆了不同的测试方法。替换测试询问是否可以将一层的映射替换为另一层的映射;交换测试询问是否当两层位置交换时,它们近似可交换。两者都是基于输出的swap-KL探测器,但它们并不总是一致:在预训练的Transformer中,协议差距可能在相同评估器下改变哪些层看起来可以安全剪枝,尤其是在替换距离较高时。我们跨检查点和架构测量了两种协议。在Pythia训练轨迹(410M和1.4B)上,替换-交换差距从初始化到收敛逐渐增大。在8B规模的WikiText-2合同下,Qwen3-8B进入了一个发散阶段:交换引导的移除比替换引导的在相同层预算下更安全,而Llama-3.1-8B在剪枝成本上两者持平,尽管交换KL较低,这表明指标差距不必一对一映射到移除。在层移除或合并之前,应在目标检查点上对两种swap-KL进行评分;该诊断仅需未标记的正向传递。

英文摘要

When researchers ask whether two transformer layers are "equivalent" for compression, they often conflate distinct tests. Replacement asks whether one layer's map can substitute for another's in place; interchange asks whether two layers approximately commute when their positions are swapped. Both are output-grounded swap-KL probes, but they need not agree: on pretrained transformers the protocol gap can change which layers look safe to prune by several-fold under the same evaluator, especially when replacement distances are high. We measure both protocols across checkpoints and architectures. On a Pythia training trajectory (410M and 1.4B), the replacement-interchange gap grows from initialization to convergence. Under one matched WikiText-2 contract at 8B scale, Qwen3-8B enters a divergent regime: interchange-guided removal is several-fold safer than replacement-guided at the same layer budgets, while Llama-3.1-8B ties the two protocols for pruning cost even though interchange KL is lower, showing metric gaps need not map one-to-one to removal. Before layer removal or merging, score both swap-KLs on the target checkpoint; the diagnostic requires only unlabeled forward passes.

2605.15735 2026-05-19 cs.CV cs.AI 版本更新

UAM: A Dual-Stream Perspective on Forgetting in VLA Training

UAM:VL A训练中遗忘的双流视角

Jianke Zhang, Yuanfei Luo, Yucheng Hu, Xiaoyu Chen, Yanjiang Guo, Ziyang Liu, Hongbin Xu, Tian Lan, Jianyu Chen

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出UAM模型,通过双流架构解决VL A训练中因单一编码器导致的多模态能力下降问题,展示了通过架构分离而非冻结权重或辅助数据可实现语义保留,并在多种任务中取得高成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通常通过在动作数据上微调预训练的视觉-语言模型(VLM)来构建。然而,我们证明这种标准方法系统性地削弱了VLM的多模态能力,这种副作用我们称之为‘具身税’。但VL A是否必须遗忘?受生物视觉双流组织的启发,我们将这种退化归因于结构性瓶颈:当前VL A要求单一编码器同时支持语言基础语义和控制相关的视觉特征,而生物视觉将识别与视觉运动控制分为不同的路径。基于此观点,我们提出了统一动作模型(UAM),添加了一个平行的背侧专家,作为大脑背侧通路的类比。为了使背侧专家成为有效的第二路径并减少对VLM的控制学习负担,我们从预训练的生成模型中初始化它,并用中层推理目标进行训练,该目标预测视觉动态。这种设计使我们能够仅用动作数据端到端地训练整个VLA:无需参数冻结、无需梯度停止、无需辅助VL共训练,UAM保留了超过95%的底层VLM的多模态能力,同时在多种任务中取得了最高平均成功率,包括未见物体、新物体-目标组合和指令变化等探测分布外泛化的任务。这些结果表明,VL A中的语义保留可以从架构分离本身产生,而非通过冻结权重或辅助数据重放,并且这种保留的语义能力可以自然地从VLM转移到动作中的语义泛化。

英文摘要

Vision--language--action (VLA) models are typically built by fine-tuning a pretrained vision--language model (VLM) on action data. However, we show that this standard recipe systematically erodes the VLM's multimodal competence, a side effect we call the embodiment tax. But do VLAs have to forget? Inspired by the two-stream organization of biological vision, we trace this degradation to a structural bottleneck: current VLAs ask a single encoder to support both language-grounded semantics and control-relevant visual features, whereas biological vision separates recognition and visuomotor control into distinct pathways. Building on this view, we propose the Unified Action Model (UAM), which adds a parallel Dorsal Expert, an analog of the brain's dorsal pathway. To make the Dorsal Expert an effective second pathway and reduce the control-learning burden on the VLM, we initialize it from a pretrained generative model and train it with a mid-level reasoning objective that predicts visual dynamics. This design allows us to train the whole VLA end-to-end on action data alone: with no parameter freezing, no gradient stopping, and no auxiliary VL co-training, UAM retains over $95\%$ of the underlying VLM's multimodal capability and at the same time achieves the highest average success rate among baselines on a variety of manipulation tasks that probe out-of-distribution generalization, including unseen objects, novel object--target compositions, and instruction variation. Together, these results suggest that semantic preservation in VLAs can emerge from architectural separation itself, rather than being enforced by frozen weights or auxiliary data replay, and that this preserved semantic capability can naturally transfer from VLMs to semantic generalization in actions.

2605.15586 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Embracing Biased Transition Matrices for Complementary-Label Learning with Many Classes

拥抱偏置转移矩阵以实现多类互补标签学习

Tan-Ha Mai, Chao-Kai Chiang, Han-Hwa Shih, Gang Niu, Masashi Sugiyama, Hsuan-Tien Lin

发表机构 * National Taiwan University(国立台湾大学) The University of Tokyo(东京大学) RIKEN Center for Advanced Intelligence Project(日本理化学研究院先进智能项目中心)

AI总结 本文提出了一种新的框架BICL,通过设计偏置的标签生成过程来克服传统互补标签学习在多类设置中的限制,从而在CIFAR-100和TinyImageNet-200上实现了传统方法的七倍以上准确率提升。

Comments 33 pages, 16 figures, 18 tables

详情
AI中文摘要

互补标签学习(CLL)是一种弱监督范式,其中实例被标记为不属于其类别的标签。尽管已有十年的研究,CLL方法主要在10类分类任务中具有竞争力,而扩展到大规模标签空间仍然是一个持久的瓶颈。这种限制源于传统方法对均匀标签生成的假设,这在多类设置中严重稀释了学习信号。在本文中,我们证明通过故意设计偏置(非均匀)的生成过程,将互补标签限制在类别的子集,可以克服这一长期存在的障碍。这一发现促使我们提出Bias-Induced Constrained Labeling(BICL),一个涵盖数据收集到训练的原理性框架,利用这种偏置。BICL在CIFAR-100和TinyImageNet-200上实现了有效学习,比传统方法的准确率提高了超过七倍。我们的发现为在现实应用中使CLL适用于多类问题开辟了新的道路。

英文摘要

Complementary-label learning (CLL) is a weakly supervised paradigm where instances are labeled with classes they do not belong to. Despite a decade of research, CLL methods remain competitive mainly on 10-class classification, with scaling to large label spaces continuing to be an enduring bottleneck. This limitation stems from the common assumption of uniform label generation in traditional methods, which fatally dilutes the learning signal in many-class settings. In this paper, we demonstrate that this long-standing barrier can be overcome by deliberately designing a biased (non-uniform) generation process that restricts complementary labels to a subset of classes. This finding motivates us to propose Bias-Induced Constrained Labeling (BICL), a principled framework spanning data collection to training that leverages this bias. BICL enables effective learning on CIFAR-100 and TinyImageNet-200, achieving more than sevenfold accuracy improvements over traditional methods. Our findings establish a new trajectory for making CLL feasible for many classes in real-world applications.

2605.15553 2026-05-19 cs.NI cs.AI cs.ET 版本更新

Operator-Controlled 6G: From Connectivity Infrastructure to Guaranteed Digital Services

运营商主导的6G:从连接基础设施到保证的数字服务

David Soldani

发表机构 * Rakuten Mobile Inc.(乐天移动公司)

AI总结 本文提出了一种运营商主导的6G框架,通过重新排序运营商优先级,将控制、客户、业务、运营和技术作为优先级,定义了所有权分类和商业模型,以实现可执行的服务级别目标,并通过Rakuten Mobile的实证证据证明了其可行性。

Comments 81 pages, 18 figures, 66 references

详情
AI中文摘要

第六代移动网络(6G)正接近结构性转折点。五代由供应商主导的架构使运营商在无法拥有、修改平台和审计AI层的情况下采购和操作网络。本文主张6G必须逆转这一趋势,重新排列运营商优先级:控制优先,客户优先,业务优先,运营优先,技术最后。技术应服务于运营商控制、客户成果、可 monetizable 的保证和软件驱动的运营,而不是决定它们。本文的两个贡献是将这一论点具体化。6G控制紧凑型定义了一个三层所有权分类——拥有、联邦和消费——根据战略价值分配架构主权。保证经济定义了一个五级、结果定价的商业模型,将运营商控制转化为可执行的服务级别目标。该框架基于Rakuten Mobile的实证证据,这是世界上第一个全国规模、完全云原生、完全开放无线接入网络(Open RAN)的部署,于2025财年实现了全年EBITDA盈利。它与ITU-R IMT-2030框架、3GPP 6G使用案例和服务要求、NGMN建议、ETSI标准、O-RAN联盟和AI-RAN联盟规范、IOWN全球论坛可持续性指标、Linux基金会倡议以及领先行业和学术项目保持一致。一个涵盖2025-2027、2027-2029和2029-2032及以后的三阶段路线图,以及七个针对特定利益相关者的行动呼吁,将架构转化为行业承诺。核心主张是Rakuten Mobile的部署证明了运营商主导的6G的可行性。2026-2028期间的决策将决定6G将成为保证数字服务的平台还是另一个依赖供应商的基础设施周期。

英文摘要

Sixth-generation mobile networks (6G) are approaching a structural inflection point. Five generations of vendor-led architectures have left operators procuring and operating networks they do not own, on platforms they cannot modify, with AI layers they cannot audit. This paper argues that 6G must reverse this trajectory by reordering operator priorities: Control First, Customer First, Business First, Operations First, and Technology Last. Technology should serve operator control, customer outcomes, monetizable guarantees, and software-driven operations, not dictate them.Two contributions operationalize this thesis. The 6G Control Compact defines a three-layer ownership taxonomy--own, federate, and consume--that allocates architectural sovereignty according to strategic value. The Guarantee Economy defines a five-tier, outcome-priced commercial model that converts operator control into enforceable service-level objectives. The framework is grounded in operational evidence from Rakuten Mobile, the world's first national-scale, fully cloud-native, fully Open RAN deployment, which reached full-year EBITDA profitability in FY2025. It is aligned with the ITU-R IMT-2030 framework, 3GPP 6G use cases and service requirements, NGMN recommendations, ETSI standards, O-RAN Alliance and AI-RAN Alliance specifications, IOWN Global Forum sustainability metrics, Linux Foundation initiatives, and leading industry and academic programs. A three-phase roadmap covering 2025-2027, 2027-2029, and 2029-2032 and beyond, together with seven stakeholder-specific calls to action, translates the architecture into industry commitments. The central claim is that Rakuten Mobile's deployment demonstrates the feasibility of operator-controlled 6G. Decisions made during 2026-2028 will determine whether 6G becomes a platform for guaranteed digital services or another vendor-dependent infrastructure cycle.

2605.15377 2026-05-19 cs.AI 版本更新

Ensemble Monitoring for AI Control: Diverse Signals Outweigh More Compute

为AI控制的集束监控:多样信号胜过更多计算

Eugene Koran, Yejun Yun, Samantha Tetef, Benjamin Arnav, Pablo Bernabeu-Pérez

发表机构 * Yale University(耶鲁大学)

AI总结 本文研究了通过结合多种监控信号来提高AI行为检测的性能,发现多样性的监控集合比单一或同质的监控集合更有效,且细调的监控方法在检测能力上更具优势。

详情
AI中文摘要

随着AI系统在大规模自主代理环境中越来越广泛地部署,确保它们采取的安全和符合用户意图的行为变得至关重要。监控代理行为是关键的安全机制,但可靠的监控仍然难以构建,而系统规模使人类监督变得不切实际。我们证明,将来自不同监控器的信号组合成一个集合可以提高检测偏离行为的能力。我们使用提示和微调策略构建了12个GPT-4.1-Mini监控器。我们在编码任务中评估了它们,其中候选解决方案通过标准测试但失败于对抗性输入。在这种情况下,多样化的集合优于单个监控器和同质的集合。我们的最佳3监控集合在检测性能上比由三个相同监控器组成的集合提高了2.4倍,且在独立数据集上表现强劲。我们认为这些结果表明,收益来自于多样性而不是规模。最佳集合结合了强个体表现和监控器之间低相关性。此外,微调的监控器出现在每一个表现最好的集合中,并且在非分布攻击类型上保持了这一优势,表明微调能够激发检测能力,而提示单独无法做到。这些结果支持集合监控作为一种实用的AI控制策略,以在合理的推理成本下获得安全收益。

英文摘要

As AI systems are increasingly deployed in autonomous agentic settings at scale, it is important to ensure the actions they take are safe and aligned with user intent. Monitoring agent actions is a key safety mechanism, yet reliable monitors remain difficult to build and the scale of these systems makes human oversight impractical. We show that combining signals from diverse monitors into an ensemble improves detection of misaligned actions. We build 12 GPT-4.1-Mini monitors using both prompting and fine-tuning strategies. We evaluate them on coding tasks where candidate solutions pass standard tests but fail on adversarial inputs. In this setting, diverse ensembles outperform both individual monitors and homogeneous ensembles. Our best 3-monitor ensemble achieves 2.4x greater detection performance gain compared to an ensemble composed of three identical monitors, with the same ensemble performing strongly on an independent dataset. We contend that these results show that diversity - not scale - drives gains. The best ensembles combine strong individual performance with low correlation between monitors. Furthermore, fine-tuned monitors appear in every top-performing ensemble and maintain this advantage on out-of-distribution attack types, suggesting that fine-tuning enables detection capabilities that prompting alone does not elicit. These results support ensemble monitoring as a practical AI control strategy for safety gains at reasonable inference costs.

2605.15338 2026-05-19 cs.CR cs.AI 版本更新

Hidden in Memory: Sleeper Memory Poisoning in LLM Agents

记忆中的隐患:LLM代理中的潜伏记忆污染

Sidharth Pulipaka, Stanislau Hlebik, Leonidas Raghav, Sahar Abdelnabi, Vyas Raina, Ivaxi Sheth, Mario Fritz

发表机构 * SPAR ELLIS Institute Tübingen(图宾根ELLIS研究所) MPI for Intelligent Systems(智能系统马克斯·普朗克研究所) Tübingen AI Center(图宾根人工智能中心) APTA CISPA Helmholtz Center for Information Security(信息安全海德堡中心)

AI总结 研究探讨了LLM代理中由于持久化内存带来的安全风险,提出并研究了潜伏记忆污染攻击,该攻击通过操控外部上下文使代理存储伪造的记忆,影响后续交互,实验显示攻击可跨多次对话持续生效。

Comments 86 pages, 60 tables

详情
AI中文摘要

大型语言模型越来越多地集成了持久化内存,使助手能够在不同会话中存储用户特定信息以实现个性化和连续性。这种状态性引入了新的安全风险:对抗性内容可以腐蚀助手所记住的信息,从而影响未来的交互。我们提出并研究了潜伏记忆污染,这是一种延迟攻击,攻击者通过操控外部上下文,如文档、网页或仓库,使助手存储关于用户的伪造记忆。与传统的提示注入不同,这种攻击可以保持潜伏并在多次后续对话中重新出现。我们评估了完整的攻击流程:被污染的记忆是否被写入、后来被检索,并最终用于引导后续对话。在状态化的LLM助手上,被污染的记忆在GPT-5.5上达到高达99.8%,在Kimi-K2.6上达到95%。关键的是,在成功的检索中,被污染的记忆导致攻击者期望的代理行为在60-89%的评估中出现。这些结果表明,持久化内存可以在多个未来对话中充当长期的攻击面。

英文摘要

Large language models are increasingly augmented with persistent memory, allowing assistants to store user-specific information across sessions for personalization and continuity. This statefulness introduces a new security risk: adversarial content can corrupt what an assistant remembers and thereby influence future interactions. We propose and study sleeper memory poisoning, a delayed attack in which an adversary manipulates external context, such as a document, webpage, or repository, to cause the assistant to store a fabricated memory about the user. Unlike conventional prompt injection, the attack can remain dormant and re-emerge across multiple later conversations. We evaluate the full attack pipeline: whether poisoned memories are written, later retrieved, and ultimately used to steer the following conversations. Across stateful LLM assistants, poisoned memories were added up to 99.8% on GPT-5.5 and 95% on Kimi-K2.6. Crucially, among successful retrievals, poisoned memories cause attacker-intended agentic actions in 60-89% of evaluations across models. These results show that persistent memory can act as a long-term attack surface across multiple future conversations.

2605.15177 2026-05-19 cs.AI 版本更新

OpenDeepThink: Parallel Reasoning via Bradley-Terry Aggregation

OpenDeepThink: 通过布拉德利-蒂尔利聚合实现并行推理

Shang Zhou, Wenhao Chai, Kaiyuan Liu, Huanzhi Mao, Qiuyang Mang, Jingbo Shang

发表机构 * UC San Diego(UC圣地亚哥大学) Princeton University(普林斯顿大学) University of Washington(华盛顿大学) UC Berkeley(伯克利大学)

AI总结 该研究提出OpenDeepThink框架,通过布拉德利-蒂尔利聚合方法在测试时扩展计算资源,以提高大语言模型的推理能力,通过并行选择候选方案并消除选择瓶颈,从而提升模型在Codeforces等领域的表现。

Comments 19 pages, 4 figures

详情
AI中文摘要

测试时计算扩展是提高大语言模型推理能力的主要方向。现有方法主要通过扩展单个推理轨迹来扩展深度,而通过并行采样多个候选方案来扩展广度则较为简单,但会引入选择瓶颈:在没有地面真相验证器的情况下选择最佳候选方案,因为点wise LLM判断是嘈杂且有偏见的。为了解决这个问题,我们引入了OpenDeepThink,一种基于种群的测试时计算框架,通过成对的布拉德利-蒂尔利比较来选择。每次生成中,LLM随机判断候选方案对并利用布拉德利-蒂尔利聚合生成全局排名;排名最高的候选方案被保留,前四分之三的方案通过自然语言批评进行变异;后四分之一的方案被丢弃。OpenDeepThink在八个连续的LLM调用轮次中(约27分钟实时时钟时间)将Gemini 3.1 Pro的Codeforces Elo有效提升405分。该流程在较弱和较强模型之间转移时无需重新训练,并在多领域HLE基准测试中,收益集中在客观可验证的领域,而在主观领域则相反。我们发布了CF-73,一个包含73个专家评分的Codeforces问题的精选集,具有国际大师注释,并且与官方判决的本地评估一致性达到99%。

英文摘要

Test-time compute scaling is a primary axis for improving LLM reasoning. Existing methods primarily scale depth by extending a single reasoning trace. Scaling breadth by sampling multiple candidates in parallel is straightforward, but introduces a selection bottleneck: choosing the best candidate without a ground-truth verifier, since pointwise LLM judging is noisy and biased. To address this, we introduce OpenDeepThink, a population-based test-time compute framework that selects via pairwise Bradley-Terry comparison. Each generation, the LLM judges random pairs of candidates and aggregates votes via Bradley-Terry into a global ranking; top-ranked candidates are preserved and the top three quarters are mutated using the natural-language critiques produced during comparison; the bottom quarter is discarded. OpenDeepThink raises Gemini 3.1 Pro's effective Codeforces Elo by +405 points in eight sequential LLM-call rounds (~27 minutes wall-clock). The pipeline transfers across weaker and stronger models without retuning, and on the multi-domain HLE benchmark, gains appear concentrated in objectively verifiable domains and reverse in subjective ones. We release CF-73, a curated set of 73 expert-rated Codeforces problems with International Grandmaster annotation and 99% local-evaluation agreement against the official verdict.

2605.14133 2026-05-19 cs.AI 版本更新

ClawForge: Generating Executable Interactive Benchmarks for Command-Line Agents

ClawForge: 为命令行代理生成可执行的交互式基准测试

Yuxiang Lai, Peng Xia, Haonian Ji, Kaiwen Xiong, Kaide Zeng, Jiaqi Liu, Fang Wu, Jike Zhong, Zeyu Zheng, Cihang Xie, Huaxiu Yao

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学切里波因特分校) Stanford University(斯坦福大学) University of California, Berkeley(加州大学伯克利分校) University of Southern California(南加州大学) University of California, Santa Cruz(加州大学圣克鲁兹分校)

AI总结 ClawForge通过生成可执行的交互式基准测试,解决了可扩展性与真实工作流评估之间的矛盾,通过系统测试代理在存在状态冲突时的处理能力。

详情
AI中文摘要

交互式代理基准测试面临可扩展性构建与真实工作流评估之间的张力。手工编写的任务扩展和修改成本高,而静态提示评估忽略了只有在代理在持久状态上操作时才会出现的失败。现有的交互式基准测试已显著提升了代理评估,但大多数初始化任务从干净的状态开始,没有系统测试代理如何处理已存在的部分、过时或冲突的物品。我们提出了ClawForge,一个基于生成器的可执行命令行工作流基准测试框架,在状态冲突下。该框架将场景模板、扎根槽位、初始化状态、参考轨迹和验证器编译成可重复的任务规范,并通过归一化的终端状态和可观测的副作用逐步评估代理,而不是精确轨迹匹配。我们实例化该框架为ClawForge-Bench(17个场景,6个能力类别)。在七个前沿模型上的结果表明,最佳模型仅达到45.3%的严格准确率,错误状态替换在所有模型中低于17%,最宽的模型分离(17%到90%)由代理在行动前是否检查现有状态决定。部分信用和步骤效率分析进一步揭示了许多失败是近似关闭而非早期崩溃,且在状态冲突下模型表现出不同的失败风格。

英文摘要

Interactive agent benchmarks face a tension between scalable construction and realistic workflow evaluation. Hand-authored tasks are expensive to extend and revise, while static prompt evaluation misses failures that only appear when agents operate over persistent state. Existing interactive benchmarks have advanced agent evaluation significantly, but most initialize tasks from clean state and do not systematically test how agents handle pre-existing partial, stale, or conflicting artifacts. We present \textbf{ClawForge}, a generator-backed benchmark framework for executable command-line workflows under state conflict. The framework compiles scenario templates, grounded slots, initialized state, reference trajectories, and validators into reproducible task specifications, and evaluates agents step by step over persistent workflow surfaces using normalized end state and observable side effects rather than exact trajectory matching. We instantiate this framework as the ClawForge-Bench (17 scenarios, 6 ability categories). Results across seven frontier models show that the best model reaches only 45.3% strict accuracy, wrong-state replacement remains below 17\% for all models, and the widest model separation (17% to 90%) is driven by whether agents inspect existing state before acting. Partial-credit and step-efficiency analyses further reveal that many failures are near-miss closures rather than early breakdowns, and that models exhibit qualitatively different failure styles under state conflict.

2605.14038 2026-05-19 cs.AI 版本更新

Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

模型适应性工具必要性揭示了大语言模型工具使用中的知行差距

Yize Cheng, Chenrui Fan, Mahdi JafariRaviz, Keivan Rezaei, Soheil Feizi

发表机构 * University of Maryland, College Park(马里兰大学College Park分校)

AI总结 本文研究了大语言模型在使用外部工具时的必要性问题,提出了一种基于模型自身性能的适应性工具必要性定义,并通过四个模型在算术和事实性问答数据集上的比较,发现工具必要性与实际调用行为之间存在显著的不匹配,揭示了LLM工具使用中的知行差距。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地作为自主代理,必须决定何时直接回答问题,何时调用外部工具。先前研究大多将工具必要性视为模型无关的属性,由人类或LLM判断者标注,主要涵盖答案明显的情况(例如获取天气与改写文本)。然而,现实中的工具必要性更为复杂,因为不同模型的能力边界存在分歧:一个强模型可以单独解决的问题,可能仍需要工具帮助弱模型。在本文中,我们引入了基于每个模型实证性能的模型适应性工具必要性定义。随后,我们比较了四个模型在算术和事实性问答数据集上的必要性与观察到的工具调用行为,发现存在26.5-54.0%和30.8-41.8%的显著不匹配。为了诊断失败,我们将工具使用分解为两个阶段:内部认知阶段,反映模型是否认为需要工具;执行阶段,决定模型是否实际做出调用动作。通过探测LLM隐藏状态,我们发现这两种信号往往可以线性解码,但它们的探测方向在晚期层、最后token的范围内几乎正交。通过追踪样本在两个阶段过程中的轨迹,我们进一步发现,大多数不匹配集中在认知到行动的转换过程中,而非认知本身。这些结果揭示了LLM工具使用中的知行差距:提高工具使用可靠性不仅需要更好的识别何时需要工具,还需要更好的将这种识别转化为行动。

英文摘要

Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools. Prior work studying adaptive tool use has largely treated tool necessity as a model-agnostic property, annotated by human or LLM judge, and mostly cover cases where the answer is obvious (e.g., fetching the weather vs. paraphrasing text). However, tool necessity in the wild is more nuanced due to the divergence of capability boundaries across models: a problem solvable by a strong model on its own may still require tools for a weaker one. In this work, we introduce a model-adaptive definition of tool-necessity, grounded in each model's empirical performance. Following this definition, we compare the necessity against observed tool-call behavior across four models on arithmetic and factual QA dataset, and find substantial mismatches of 26.5-54.0% and 30.8-41.8%, respectively. To diagnose the failure, we decompose tool use into two stages: an internal cognition stage that reflects whether a model believes a tool is necessary, and an execution stage that determines whether the model actually makes a tool-call action. By probing the LLM hidden states, we find that both signals are often linearly decodable, yet their probe directions become nearly orthogonal in the late-layer, last-token regime that drives the next-token action. By tracing the trajectory of samples in the two-stage process, we further discover that the majority of mismatch is concentrated in the cognition-to-action transition, not in cognition itself. These results reveal a knowing-doing gap in LLM tool-use: improving tool-use reliability requires not only better recognition of when tools are needed, but also better translation of that recognition into action.

2605.13415 2026-05-19 cs.CL cs.AI cs.LG 版本更新

KIT-TIP-NLP at MultiPride: Continual Learning with Multilingual Foundation Model

KIT-TIP-NLP 在 MultiPride 上的持续学习:多语言基础模型

Barathi Ganesh HB, Michal Ptaszynski, Rene Melendez, Juuso Eronen

发表机构 * Text Information Processing Lab, Kitami Institute of Technology, Kitami, Hokkaido 090-0015, Japan(函授信息处理实验室,Kitami理工学院,日本北海道Kitami,090-0015)

AI总结 本文提出了一种多阶段框架,用于检测社交媒体中多语言的重新使用侮辱性语言。该框架解决了跨英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战,通过数据驱动的模型选择、语义保留的增强、归纳迁移学习和领域特定知识注入等方法,提高了多语言情感表达的识别能力。

Comments Final Workshop of the 9th evaluation campaign EVALITA 2026

详情
AI中文摘要

本文提出了一种多阶段框架,用于检测多语言社交媒体中重新使用的侮辱性语言。该框架解决了在英语、西班牙语和意大利语推文中识别重新使用与非重新使用LGBTQ+相关侮辱性语言的挑战。该框架处理了三个交织的方法学挑战:数据稀缺、类别不平衡和跨语言的情感表达差异。该框架整合了通过交叉验证的数据驱动模型选择、通过回译的语义保留增强、具有动态周期级欠采样的归纳迁移学习,以及通过掩码语言模型注入的领域特定知识。系统评估了八个多语言嵌入模型,XLM-RoBERTa被选为基础模型,基于宏平均F1分数。通过GPT-4o-mini回译进行的数据增强有效将训练语料库增加了三倍,同时保留了语义内容和类别分布比例。该框架生成了四个最终运行用于评估,其中RUN 1是带有增强和欠采样的归纳迁移学习,RUN 2是带有掩码语言模型预训练,RUN 3和RUN 4是通过语言特定决策阈值优化的先前预测。语言特定的阈值优化表明,最优决策边界在不同语言中存在显著差异。这反映了模型置信度分数的分布差异和重新使用语言使用的语言差异。基于阈值的优化在不需模型重新训练的情况下,带来了2-5%的绝对F1提升。该方法完全可复现,所有代码和实验设置可在https://github.com/rbg-research/MultiPRIDE-Evalita-2026上找到。

英文摘要

This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to alternate languages effectively tripled the training corpus while preserving semantic content and class distribution ratios. The framework produces four final runs for the evaluation purposes where RUN 1 is inductive transfer learning with augmentation and undersampling, RUN 2 with masked language modeling pre-training, RUN 3 and RUN 4 are previous predictions refined via language-specific decision thresholds optimized via ROC analysis. Language-specific threshold refinement reveals that optimal decision boundaries vary significantly across languages. This reflects distributional differences in model confidence scores and linguistic variation in reclamatory language usage. The threshold-based optimization yields 2-5% absolute F1 improvement without requiring model retraining. The methodology is fully reproducible, with all code and experimental setup available at https://github.com/rbg-research/MultiPRIDE-Evalita-2026.

2605.11461 2026-05-19 cs.AI cs.LG 版本更新

Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

打破赢家通吃:合作策略优化提升大语言模型的多样化推理

Haoxuan Chen, Tianming Liang, Wei-Shi Zheng, Jian-Fang Hu

发表机构 * ISEE Lab, Sun Yat-sen University(中山大学ISEE实验室)

AI总结 本文提出Group Cooperative Policy Optimization (GCPO)方法,通过改变训练范式从 rollout 竞争转向团队合作,提升大语言模型在推理任务中的准确性和解题多样性。

详情
AI中文摘要

基于验证器的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的核心范式,然而流行的基于群体的优化算法如GRPO常常面临探索崩溃问题,即模型过早收敛于一组高分模式,缺乏探索新解的能力。最近的研究尝试通过添加熵正则化或多样性奖励来缓解这一问题,但这些方法并未改变赢家通吃的本质,即rollouts仍为个体优势竞争而非合作最大化全局多样性。在本文中,我们提出Group Cooperative Policy Optimization(GCPO),将训练范式从rollout竞争转向团队合作。具体而言,GCPO将独立rollout评分替换为团队层面的信用分配:rollout被奖励其对团队有效解覆盖的贡献,而非其个体准确性。该覆盖被描述为奖励加权语义嵌入上的确定体体积,其中只有正确且非冗余的rollout才对这一体积做出贡献。在优势估计过程中,GCPO将集体团队奖励重新分配给每个单个rollout,根据其对团队的平均边际贡献。这种合作训练范式将优化方向导向非冗余的正确推理路径。在多个推理基准测试中,GCPO在现有方法的基础上显著提高了推理准确性和解题多样性。代码将在https://github.com/bradybuddiemarch/gcpo上发布。

英文摘要

Reinforcement learning with verifiers (RLVR) has become a central paradigm for improving LLM reasoning, yet popular group-based optimization algorithms like GRPO often suffer from exploration collapse, where the models prematurely converge on a narrow set of high-scoring patterns, lacking the ability to explore new solutions. Recent efforts attempt to alleviate this by adding entropy regularization or diversity bonus. However, these approaches do not change the \textit{winner-takes-all} nature, where rollouts still compete for individual advantage rather than cooperating for maximizing global diversity. In this work, we propose Group Cooperative Policy Optimization (GCPO), which shifts the training paradigm from rollout competition to team cooperation. Specifically, GCPO replaces independent rollout scoring with team-level credit assignment: a rollout is rewarded by how much it contributes to the team's valid solution coverage, rather than its individual accuracy. This coverage is described as a determinant volume over reward-weighted semantic embeddings, where only correct and non-redundant rollouts contribute to this volume. During advantage estimation, GCPO redistributes the collective team reward to each single rollout according to its average marginal contribution to the team. This cooperative training paradigm routes optimization toward non-redundant correct reasoning paths. Experiments across multiple reasoning benchmarks demonstrate that GCPO significantly improves both reasoning accuracy and solution diversity over existing approaches. Code will be released at https://github.com/bradybuddiemarch/gcpo.

2605.11223 2026-05-19 cs.AI 版本更新

Do Vision-Language-Models show human-like logical problem-solving capability in point and click puzzle games?

视觉-语言模型在点击式谜题游戏中是否展现出人类般的逻辑问题解决能力?

Maximilian Triebel, Marco Menner, Dominik Helfenstein

发表机构 * Institute of Artificial Intelligence, University of Stuttgart, Stuttgart, Germany(斯图加特大学人工智能研究所)

AI总结 本文提出VLATIM基准测试,用于评估在经典物理谜题游戏The Incredible Machine 2中人类般的逻辑问题解决能力,发现尽管大模型在规划方面表现优异,但精确的视觉定位仍存在问题,尚未达到人类水平。

详情
AI中文摘要

视觉-语言(-动作)模型(VLMs)越来越多地应用于交互环境,但现有基准测试往往忽视了点击式谜题游戏中所需的复杂物理推理。本文介绍了Vision-Language Against The Incredible Machine(VLATIM),一个用于评估在经典物理谜题游戏The Incredible Machine 2(TIM)中人类般的逻辑问题解决能力的基准测试。与现有基准测试不同,VLATIM专门针对高水平逻辑推理与需要精确鼠标交互的连续动作空间之间的关键差距。该基准测试分为五个逐步部分,评估的能力从基本的视觉定位和领域理解到多步骤操作和完整谜题解决。我们的结果揭示了推理与执行之间的显著差距。尽管大 proprietary 模型在规划能力方面表现优异,但它们在精确的视觉定位上存在困难。因此,它们尚未展现出人类般的解决问题能力。

英文摘要

Vision-Language(-Action) Models (VLMs) are increasingly applied to interactive environments, yet existing benchmarks often overlook the complex physical reasoning required for point-and-click puzzle games. This paper introduces Vision-Language Against The Incredible Machine (VLATIM), a benchmark designed to evaluate human-like logical problem-solving capabilities within the classic physics puzzle game The Incredible Machine 2 (TIM). Unlike existing benchmarks, VLATIM specifically targets the critical gap between high-level logical reasoning and continuous action spaces requiring precise mouse interactions. This benchmark is structured into five progressive parts, assessing capabilities that range from basic visual grounding and domain understanding to multi-step manipulation and full puzzle solving. Our results reveal a significant disparity between reasoning and execution. While large proprietary models demonstrate superior planning abilities, they struggle with precise visual grounding. Consequently, they do not yet show human-like problem-solving capabilities.

2605.10871 2026-05-19 physics.med-ph cs.AI cs.LG 版本更新

Attractor-Vascular Coupling Theory: Formal Grounding and Empirical Validation for AAMI-Standard Cuffless Blood Pressure Estimation from Smartphone Photoplethysmography

吸引子-血管耦合理论:为基于智能手机光电容积图的AAMI标准无创血压估计提供形式基础和实证验证

Timothy Oladunni, Farouk Ganiyu Adewumi

发表机构 * Department of Computer Science, Morgan State University(莫根州立大学计算机科学系)

AI总结 本文提出了一种数学框架,证明心脏吸引子几何编码了足够的血压信息,用于AAMI标准估计,并通过校准的无创血压模型验证了该理论,利用光电容积图(PPG)进行血压估计。

详情
AI中文摘要

本文提出吸引子-血管耦合理论(AVCT),一种数学框架,证明心脏吸引子几何编码了足够的血压(BP)信息,足以用于AAMI标准估计,并通过使用光电容积图(PPG)的校准无创血压模型验证了该理论。AVCT基于心脏稳定性理论,并通过Takens延迟嵌入和吸引子形态提取进行操作化。两个定理、一个命题和一个推论正式证明了PPG吸引子特征用于血压估计的使用,并预测了特征重要性层次。一个使用脉搏传导时间(PTT)和心脏稳定性指数(CSI)吸引子特征训练的LightGBM模型在严格留一受试者出交叉验证(LOSO-CV)上进行了评估,评估了来自BIDMC ICU(n=9)和VitalDB手术数据(n=37)的46名受试者,共29,684个窗口。该模型实现了收缩压(SBP)的平均绝对误差(MAE)为2.05 mmHg,舒张压(DBP)的MAE为1.67 mmHg,相关系数r=0.990和r=0.991,满足AAMI/IEEE SP10要求的MAE低于5 mmHg。每个受试者的中位数MAE为1.87/1.54 mmHg,70%/76%的受试者个体满足AAMI标准。使用九个智能手机吸引子特征的PPG-only消融与ECG+PPG模型的误差在0.05 mmHg以内,证明了仅使用智能手机摄像头即可实现临床级血压跟踪,超过了以往使用更少传感器的LOSO-CV结果。所有四个AVCT预测都得到了定量确认,从未校准到校准估计的误差减少了91.5%(epsilon_cal=0.915)。与后验可解释AI方法不同,AVCT预测的特征满足可解释AI可信度(EAT)框架的建筑忠实性标准,并将血压估计扎根于非线性动力学系统理论。

英文摘要

This work proposes Attractor-Vascular Coupling Theory (AVCT), a mathematical framework showing that cardiac attractor geometry encodes blood pressure (BP) information sufficient for AAMI-standard estimation, and validates the theory through a calibrated cuffless BP model using photoplethysmography (PPG). AVCT is grounded in Cardiac Stability Theory and operationalized using Takens delay embedding and attractor morphology extraction. Two theorems, one proposition, and one corollary formally justify the use of PPG attractor features for BP estimation and predict the feature-importance hierarchy. A LightGBM model trained on pulse transit time (PTT) and Cardiac Stability Index (CSI) attractor features under single-point calibration was evaluated using strict leave-one-subject-out cross-validation (LOSO-CV) on 46 subjects from BIDMC ICU (n = 9) and VitalDB surgical data (n = 37), comprising 29,684 windows. The model achieved systolic BP (SBP) mean absolute error (MAE) of 2.05 mmHg and diastolic BP (DBP) MAE of 1.67 mmHg, with correlations r = 0.990 and r = 0.991, satisfying the AAMI/IEEE SP10 requirement of MAE below 5 mmHg. Median per-subject MAE was 1.87/1.54 mmHg, and 70%/76% of subjects individually satisfied AAMI criteria. A PPG-only ablation using nine smartphone attractor features matched the ECG+PPG model within 0.05 mmHg, demonstrating that clinical-grade BP tracking is achievable using only a smartphone camera while surpassing prior generalized LOSO-CV results using fewer sensors. All four AVCT predictions were quantitatively confirmed, with 91.5% error reduction from uncalibrated to calibrated estimation (epsilon_cal = 0.915). Unlike post-hoc explainable AI methods, AVCT predicts features satisfying the architectural faithfulness criterion of the Explainable-AI Trustworthiness (EAT) framework and grounding BP estimation in nonlinear dynamical systems theory.

2605.10236 2026-05-19 cs.LG cs.AI 版本更新

When Does Non-Uniform Replay Matter in Reinforcement Learning?

在强化学习中非均匀回放何时起作用?

Michal Korniak, Mikołaj Czarnecki, Yarden As, Piotr Miłoś, Pieter Abbeel, Michal Nauman

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Warsaw(华沙大学) UC Berkeley(伯克利加州大学) Amazon FAR(亚马逊FAR)

AI总结 本文研究了非均匀回放在强化学习中的有效性,发现回放体积、预期近期性和回放分布熵是决定因素,并提出了一种简单有效的截断几何回放策略以提高样本效率。

详情
AI中文摘要

现代非策略强化学习算法通常依赖于简单的均匀回放采样,但非均匀回放何时以及为何优于这一强基线仍不清楚。在多样化的强化学习设置中,我们证明非均匀回放的有效性由三个因素决定:回放体积、每环境步骤回放的转换数量;预期近期性,即所采样转换的近期程度;以及回放采样分布的熵。我们的主要贡献是明确非均匀回放何时有益,并为现代非策略强化学习中的回放设计提供实用指导。我们发现,当回放体积较低时,非均匀回放最有益,且即使在预期近期性相当时,高熵采样也很重要。受这些发现的启发,我们采用了一种简单的截断几何回放策略,该策略倾向于近期经验,同时保持高熵并带来可忽略的计算开销。在大规模并行模拟、单任务和多任务设置中,包括在五个强化学习基准套件上评估的三种现代算法,这种回放采样策略在低体积情况下提高了样本效率,而在高回放体积时仍具有竞争力。

英文摘要

Modern off-policy reinforcement learning algorithms often rely on simple uniform replay sampling and it remains unclear when and why non-uniform replay improves over this strong baseline. Across diverse RL settings, we show that the effectiveness of non-uniform replay is governed by three factors: replay volume, the number of replayed transitions per environment step; expected recency, how recent sampled transitions are; and the entropy of the replay sampling distribution. Our main contribution is clarifying when non-uniform replay is beneficial and providing practical guidance for replay design in modern off-policy RL. Namely, we find that non-uniform replay is most beneficial when replay volume is low, and that high-entropy sampling is important even at comparable expected recency. Motivated by these findings, we adopt a simple Truncated Geometric replay that biases sampling toward recent experience while preserving high entropy and incurring negligible computational overhead. Across large-scale parallel simulation, single-task, and multi-task settings, including three modern algorithms evaluated on five RL benchmark suites, this replay sampling strategy improves sample efficiency in low-volume regimes while remaining competitive when replay volume is high.

2605.10185 2026-05-19 cs.CV cs.AI 版本更新

DynGhost: Temporally-Modelled Transformer for Dynamic Ghost Imaging with Quantum Detectors

DynGhost: 用于量子探测器动态鬼成像的时序建模Transformer

Vittorio Palladino, Ahmet Enis Cetin

发表机构 * Politecnico di Milano(米兰理工学院) University of Illinois at Chicago(伊利诺伊大学香槟分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出DynGhost,一种基于Transformer的动态鬼成像方法,通过交替的空间和时间注意力模块解决传统方法在动态场景和低光条件下的局限性,利用量子感知训练框架提升真实硬件下的性能。

Comments 6 pages, 8 figures

详情
AI中文摘要

鬼成像通过将结构化照明图案与标量强度测量相关联,从单像素桶探测器重建空间信息。尽管深度学习方法在静态场景中取得了显著成果,但存在两个关键局限:现有架构未能利用帧间的时间相干性,导致动态鬼成像问题未得到解决,且假设加性高斯噪声模型,而实际单光子硬件遵循泊松统计。我们提出了DynGhost(动态鬼成像Transformer),通过交替的空间和时间注意力块解决这两个限制。基于物理准确的探测器模拟(SNSPDs、SPADs、SiPMs)和Anscombe方差稳定化归一化,我们的量子感知训练框架解决了导致经典模型在真实硬件约束下失效的分布偏移。在多个基准测试中,DynGhost在动态和光子匮乏设置中优于传统重建方法和现有深度学习架构。

英文摘要

Ghost imaging reconstructs spatial information from a single-pixel bucket detector by correlating structured illumination patterns with scalar intensity measurements. While deep learning approaches have achieved promising results on static scenes, two critical limitations remain unaddressed: existing architectures fail to exploit temporal coherence across frames, leaving dynamic ghost imaging largely unsolved, and they assume additive Gaussian noise models that do not reflect the true Poissonian statistics of real single-photon hardware. We present DynGhost (Dynamic Ghost Imaging Transformer), a transformer architecture that addresses both limitations through alternating spatial and temporal attention blocks. Our quantum-aware training framework, based on physically accurate detector simulations (SNSPDs, SPADs, SiPMs) and Anscombe variance-stabilizing normalization, resolves the distribution shift that causes classical models to fail under realistic hardware constraints. Experiments across multiple benchmarks demonstrate that DynGhost outperforms both traditional reconstruction methods and existing deep learning architectures, with particular gains in dynamic and photon-starved settings.

2605.10059 2026-05-19 cs.AI 版本更新

Strategic Exploitation in LLM Agent Markets: A Simulation Framework for E-Commerce Trust

LLM代理市场中的战略利用:电子商务信任的模拟框架

Shijun Lei, Quang Nguyen, Swapneel S Mehta, Zeping Li, Huichuan Fu, Xiaolong Zheng, Siki Chen, Yunji Liang, Philip Torr, Zhenfei Yin

发表机构 * Northwestern Polytechnical University(西北工业大学) Boston University(波士顿大学) Fudan University(复旦大学) Wuhan University(武汉大学) Chinese Academy of Sciences(中国科学院) University of Oxford(牛津大学)

AI总结 本文提出TruthMarketTwin模拟框架,用于研究LLM代理在电子商务市场中的行为,发现LLM代理在传统市场中会利用声誉治理的弱点,而强制执行可减少欺骗并重塑战略推理。

详情
AI中文摘要

基于代理的建模(ABM)长期以来被用于经济学中研究人类行为,而大型语言模型(LLM)代理现在使新的社会和经济模拟成为可能。尽管先前工作发现了LLM代理在金融交易和拍卖市场中的战略性欺骗,但电子商务仍鲜有研究,尽管其有独特的信息不对称:卖家私下观察产品质量,而买家依赖广告声明和声誉信号。我们引入TruthMarketTwin,一种用于研究LLM代理在电子商务市场中行为的受控模拟框架。该框架是首个模拟不对称信息共享下双边贸易的模型之一,其中代理做出战略性列表、购买、评分和救济相关决策以优化卖家利润和买家效用。我们发现,释放到传统市场中的LLM代理会自主利用基于声誉的治理弱点,而强制执行可减少欺骗并重塑战略推理。我们的结果将LLM代理模拟定位为研究由机构治理的自主市场工具。

英文摘要

Agent-based modeling (ABM) has long been used in economics to study human behavior, and large language model (LLM) agents now enable new forms of social and economic simulation. While prior work has discovered strategic deception by LLM agents in financial trading and auction markets, e-commerce remains underexplored despite its distinctive information asymmetry: sellers privately observe product quality, whereas buyers rely on advertised claims and reputation signals. We introduce TruthMarketTwin, a controlled simulation framework for studying LLM-agent behavior in e-commerce markets. The framework is one of the first to model bilateral trade under asymmetric information sharing, where agents make strategic listing, purchasing, rating, and recourse-related decisions to optimize seller profit and buyer utility. We find that LLM agents released into traditional markets autonomously exploit weaknesses in reputation-based governance, while warrant enforcement reduces deception and reshapes strategic reasoning. Our results position LLM-agent simulation as a tool for studying institution-governed autonomous markets.

2605.09040 2026-05-19 cs.AI cs.IR cs.LG 版本更新

UxSID: Semantic-Aware User Interests Modeling for Ultra-Long Sequence

UxSID:面向超长序列的语义感知用户兴趣建模

Hongwei Zhang, Qiqiang Zhong, Jiangxia Cao, Yiyang Lv, Huanjie Wang, Liwei Guan, Jing Yao, Yiyu Wang, Junfeng Shu, Zhaojie Liu, Han Li

发表机构 * Kuaishou Technology(快手科技)

AI总结 本文提出UxSID框架,通过语义组共享兴趣记忆和双层注意力策略,实现高效且语义感知的超长用户序列建模,取得最佳性能并提升广告收益。

Comments Work in progress

详情
AI中文摘要

建模超长用户序列涉及效率与效果之间的艰难权衡。尽管当前方法依赖于物品特定搜索或物品无关压缩,我们提出UxSID,探索第三种路径:语义组共享兴趣记忆。通过利用语义ID(SIDs)和双层注意力策略,UxSID在不付出物品特定模型高昂代价的情况下捕捉目标感知偏好。这种端到端架构在计算效率与语义感知之间取得平衡,实现了最先进的性能,并在大规模广告A/B测试中提升了0.337%的收益。

英文摘要

Modeling ultra-long user sequences involves a difficult trade-off between efficiency and effectiveness. While current paradigms rely on either item-specific search or item-agnostic compression, we propose UxSID, a framework exploring a third path: semantic-group shared interest memory. By utilizing Semantic IDs (SIDs) and a dual-level attention strategy, UxSID captures target-aware preferences without the heavy cost of item-specific models. This end-to-end architecture balances computational parsimony with semantic awareness, achieving state-of-the-art performance and a 0.337% revenue lift in large-scale advertising A/B test.

2605.08738 2026-05-19 cs.LG cs.AI cs.CL 版本更新

SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

SlimQwen: 探索在大规模MoE模型预训练中的剪枝与知识蒸馏

Shengkun Tang, Zekun Wang, Bo Zheng, Liangyu Wang, Rui Men, Siqi Zhang, Xiulong Yuan, Zihan Qiu, Zhiqiang Shen, Dayiheng Liu

发表机构 * Qwen Team, Alibaba Inc.(通义实验室,阿里公司) MBZUAI KAUST(卡士大学)

AI总结 本文研究了在大规模预训练中如何应用剪枝和知识蒸馏技术,探讨了剪枝在初始化方面的优势、专家压缩对最终模型的影响以及训练策略的有效性,最终将Qwen3-Next-80A3B压缩到23A2B模型并保持竞争力。

详情
AI中文摘要

结构化剪枝和知识蒸馏(KD)是压缩大型语言模型的典型技术,但其在预训练规模下的应用仍不清楚,尤其是针对最近的混合专家(MoE)模型。本文系统研究了大规模预训练中的MoE压缩,重点探讨三个关键问题:剪枝是否比从头训练提供更好的初始化;专家压缩选择如何影响继续训练后的最终模型;以及哪种训练策略最有效。我们得出以下发现:首先,在深度、宽度和专家压缩方面,对预训练MoE进行剪枝在相同训练预算下优于从头训练。其次,不同的单次专家压缩方法在大规模持续预训练后收敛到相似的最终性能。受此启发,我们引入了一种简单的部分保留专家合并策略,该策略在大多数基准上提升了下游性能。第三,结合KD与语言建模损失在知识密集型任务上优于仅使用KD。我们进一步提出了多令牌预测(MTP)蒸馏,其效果一致。最后,鉴于相同的训练令牌,渐进式剪枝计划优于单次压缩,表明渐进的架构过渡导致更好的优化轨迹。综合来看,我们将Qwen3-Next-80A3B压缩到23A2B模型,保持了竞争力。这些结果为大规模高效MoE压缩提供了实用指导。

英文摘要

Structured pruning and knowledge distillation (KD) are typical techniques for compressing large language models, but it remains unclear how they should be applied at pretraining scale, especially to recent mixture-of-experts (MoE) models. In this work, we systematically study MoE compression in large-scale pretraining, focusing on three key questions: whether pruning provides a better initialization than training from scratch, how expert compression choices affect the final model after continued training, and which training strategy is most effective. We have the following findings: First, across depth, width, and expert compression, pruning a pretrained MoE consistently outperforms training the target architecture from scratch under the same training budget. Second, different one-shot expert compression methods converge to similar final performance after large-scale continual pretraining. Motivated by this, we introduce a simple partial-preservation expert merging strategy that improves downstream performance across most benchmarks. Third, combining KD with the language modeling loss outperforms KD alone, particularly on knowledge-intensive tasks. We further propose multi-token prediction (MTP) distillation, which yields consistent gains. Finally, given the same training tokens, progressive pruning schedules outperform one-shot compression, suggesting that gradual architecture transitions lead to better optimization trajectories. Putting it all together, we compress Qwen3-Next-80A3B to a 23A2B model that retains competitive performance. These results offer practical guidance for efficient MoE compression at scale.

2605.08163 2026-05-19 cs.CV cs.AI cs.CL 版本更新

MULTITEXTEDIT: Benchmarking Cross-Lingual Degradation in Text-in-Image Editing

MULTITEXTEDIT:跨语言文本-图像编辑中退化程度的基准测试

Liwei Cheng, Shibo Feng, Lunjie Zhou, Yixuan Guan, Dayan Guan

发表机构 * Harbin Institute of Technology(哈尔滨理工大学)

AI总结 本文提出MULTITEXTEDIT基准测试,通过12种语言、5种视觉领域和7种编辑操作的3600个实例,评估跨语言文本-图像编辑中退化问题,引入语言保真度指标并发现模型在文本准确性和脚本保真度上的显著退化。

Comments 11 pages, 5 figures

详情
AI中文摘要

文本-图像编辑已成为视觉内容创作的关键能力,但现有基准测试大多以英语为中心且常将视觉合理性与语义正确性混为一谈。我们引入MULTITEXTEDIT,一个包含3,600个实例的受控基准测试,涵盖12种语言类型、5种视觉领域和7种编辑操作。每个实例的语言变体共享相同的视觉基础,并配有人工编辑的参考文本和区域掩码,从而隔离语言变量以进行跨语言比较。为捕捉粗粒度文本匹配度指标所遗漏的脚本级错误,如缺失变音符号、RTL顺序颠倒和混合脚本渲染,我们引入了一个由两阶段LVM协议评分的语言保真度(LSF)度量,其与母语者标注员的二次加权κ值达到0.76。评估12个开源和专有系统时,发现所有模型在跨语言退化方面表现显著,最大退化出现在希伯来语和阿拉伯语上,最小退化出现在荷兰语和西班牙语上,且集中在文本准确性和脚本保真度而非粗粒度结构维度上。我们还发现普遍存在的语义和像素不匹配,其中输出保持全局布局和背景保真度,但扭曲了脚本特定的形态。

英文摘要

Text-in-image editing has become a key capability for visual content creation, yet existing benchmarks remain overwhelmingly English-centric and often conflate visual plausibility with semantic correctness. We introduce MULTITEXTEDIT, a controlled benchmark of 3,600 instances spanning 12 typologically diverse languages, 5 visual domains, and 7 editing operations. Language variants of each instance share a common visual base and are paired with a human-edited reference and region masks, isolating the language variable for cross-lingual comparison. To capture script-level errors that coarse text-matching metrics miss, such as missing diacritics, reversed RTL order, and mixed-script renderings, we introduce a language fidelity (LSF) metric scored by a two-stage LVM protocol that first traces the edited target text and then judges it in isolation, reaching a quadratic-weighted \k{appa} of 0.76 against native-speaker annotators. Evaluating 12 open-source and proprietary systems with LSF alongside standard semantic and mask-aware pixel metrics, we find pronounced cross-lingual degradation for every model, largest on Hebrew and Arabic and smallest on Dutch and Spanish, and concentrated in text accuracy and script fidelity rather than in coarse structural dimensions. We also uncover a pervasive semantic and pixel mismatch, where outputs preserve global layout and background fidelity yet distort script-specific forms.

2605.07544 2026-05-19 cs.AI 版本更新

From Pixels to Prompts: Vision-Language Models

从像素到提示:视觉-语言模型

Khang Hoang Nhat Vo

发表机构 * MBZUAI

AI总结 本文探讨了视觉-语言模型的发展历程,旨在提供清晰的认知框架,帮助读者理解该领域的核心概念和应用,而非罗列所有数据集和模型变体。

详情
AI中文摘要

当您阅读一篇关于新型视觉-语言模型的论文时,可能会忘记这个想法在不久以前听起来多么奇怪。教机器看见已经很困难,教它们阅读和生成语言也已很困难。让它们同时做到这些,并随后进行推理、回答问题、遵循指令,甚至有时令人惊讶,仍带着科幻的余韵,尽管它已成为日常。这本书源于一种简单的感觉:太容易迷失方向了。该领域发展迅速,新模型名称不断出现,‘我知道 buzzwords’与‘我真的理解其工作原理’之间的差距可能让人感到不适。我曾多次感受到这种差距。如果您持有这本书,您可能也有太大的感受。我的目标不是提供一个详尽的数据集、基准和新模型变体的清单。相反,我希望提供更谦逊但或许更持久的东西:一个清晰的视觉-语言模型认知图谱。足够的结构,使您在阅读新论文时充满信心;足够的直觉,使您能够设计自己的系统而不觉得像在盲目地组装乐高积木。

英文摘要

When you read a paper about a new Vision-Language Model today, it can be easy to forget how strange this idea would have sounded not so long ago. Teaching machines to see was already hard. Teaching them to read and generate language was already hard. Asking them to do both at once - and then to reason, answer questions, follow instructions, and sometimes even surprise us - still carries a quiet trace of science fiction, even as it becomes routine. This book was born from a simple feeling: it is too easy to get lost. The field moves quickly, new model names appear constantly, and the gap between "I know the buzzwords" and "I actually understand how this works" can feel uncomfortably wide. I have felt that gap many times. If you are holding this book, you probably have too. My goal is not to provide an exhaustive catalog of every dataset, benchmark, and new model variant. Instead, I want to offer something more modest - and, I hope, more durable: a clear mental map of Vision-Language Models. Enough structure that you can read new papers with confidence; enough intuition that you can design your own systems without feeling as if you are assembling LEGO bricks blindly.

2605.07111 2026-05-19 cs.CL cs.AI 版本更新

Beyond LoRA vs. Full Fine-Tuning: Gradient-Guided Optimizer Routing for LLM Adaptation

超越LoRA与全微调:基于梯度的优化器路由用于大语言模型适应

Haozhan Tang, Xiuqi Zhu, Xinyin Zhang, Boxun Li, Virginia Smith, Kevin Kuo

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Tsinghua University(清华大学) Infinigence AI

AI总结 本文提出了一种混合LoRA和全微调(MoLF)框架,通过在优化器层面动态路由更新,实现两种训练模式之间的连续导航,从而提升大语言模型的适应性能。

详情
AI中文摘要

近期关于微调大型语言模型的研究突显出一个根本性的争论。虽然全微调(FFT)提供了高熵知识注入所需的表示可塑性,但低秩适应(LoRA)可以匹配或超越FFT的性能,因为许多任务只需要在低秩空间中进行更新,并且受益于LoRA的额外正则化。通过在多样化的任务(SQL、医学问答和反事实知识)和不同语言模型(Gemma-3-1B、Qwen2.5-1.5B和Qwen2.5-3B)上的实证评估,我们验证了这两种趋势,并展示了仅依赖静态架构在结构上是有限的。为了解决这一挑战,我们提出了混合LoRA和全微调(MoLF)框架,这是一个统一的框架,能够连续导航于两种训练模式之间。MoLF在优化器层面动态地将更新路由到FFT和LoRA之间,以确保在整个训练过程中精确的梯度信号能够传达到两个专家,从而产生稳定的训练动态。对于内存受限的环境,我们还引入了MoLF-Efficient,它冻结了基础权重,并只在可能具有不同秩的一对LoRA专家之间路由更新。我们的评估显示,MoLF在所有设置中要么优于或保持在FFT和LoRA中更好的方法的1.5%以内,而MoLF-Efficient在事实任务上比先前的自适应LoRA方法高出高达20%,在医学和SQL任务上高出9%。我们的代码在https://github.com/11785T23/molf.git上开源。

英文摘要

Recent literature on fine-tuning Large Language Models highlights a fundamental debate. While Full Fine-Tuning (FFT) provides the representational plasticity required for high-entropy knowledge injection, Low-Rank Adaptation (LoRA) can match or surpass FFT performance because many tasks only require updates in a low-rank space and benefit from LoRA's additional regularization. Through empirical evaluation across diverse tasks (SQL, Medical QA, and Counterfactual Knowledge) and varying language models (Gemma-3-1B, Qwen2.5-1.5B, and Qwen2.5-3B), we verify both trends and demonstrate that relying solely on either static architecture is structurally limited. To address this challenge, we propose a Mixture of LoRA and Full (MoLF) Fine-Tuning, a unified framework that enables continuous navigation between both training regimes. MoLF dynamically routes updates between FFT and LoRA at the optimizer level to ensure that exact gradient signals are available to both experts throughout training, yielding stable training dynamics. For memory-constrained environments, we also introduce MoLF-Efficient, which freezes base weights and only routes updates among a pair of LoRA experts of potentially varying rank. Our evaluations show that MoLF either improves on or stays within $1.5\%$ of the better of FFT and LoRA across all settings, while MoLF-Efficient outperforms prior adaptive LoRA approaches by up to $20\%$ on Fact and $9\%$ on Med and SQL. Our code is open-sourced at https://github.com/11785T23/molf.git.

2605.03409 2026-05-19 cs.AI 版本更新

Robust Agent Compensation (RAC): Teaching AI Agents to Compensate

鲁棒代理补偿(RAC):教AI代理补偿

Srinath Perera, Kaviru Hapuarachchi, Frank Leymann, Rania Khalaf

发表机构 * University of Stuttgart(斯图加特大学)

AI总结 本研究提出了一种基于日志的恢复范式RAC,通过架构扩展实现安全网,可应用于大多数代理框架以支持可靠执行。RAC可在不修改现有代理代码的情况下启用,通过现有的扩展点在大多数现有代理框架中实现,并通过τ-bench和REALM-Bench验证,证明在解决复杂问题时,RAC在延迟和token经济性方面优于现有最先进的LLM-based恢复方法。

Comments Accepted at ACM Conference on AI and Agentic Systems (ACM CAIS 2026)

详情
AI中文摘要

我们提出了鲁棒代理补偿(RAC),一种基于日志的恢复范式(提供安全网),通过架构扩展实现,可应用于大多数代理框架以支持可靠的执行(避免意外副作用)。用户可以在不修改当前代理代码(例如LangGraph代理)的情况下启用RAC。所提出的方法可以通过大多数现有代理框架的现有扩展点实现。我们基于LangChain提出了一个实现,通过τ-bench和REALM-Bench验证其可行性,并证明在解决复杂问题时,RAC在延迟和token经济性方面比最先进的基于LLM的恢复方法快1.5至8倍或更多。

英文摘要

We present Robust Agent Compensation (RAC), a log-based recovery paradigm (providing a safety net) implemented through an architectural extension that can be applied to most Agent frameworks to support reliable executions (avoiding unintended side effects). Users can choose to enable RAC without changing their current agent code (e.g., LangGraph agents). The proposed approach can be implemented in most existing agent frameworks via their existing extension points. We present an implementation based on LangChain, demonstrate its viability through the $τ$-bench and REALM-Bench, and show that when solving complex problems, RAC is 1.5-8X or more better in both latency and token economy compared to state-of-the-art LLM-based recovery approaches.

2605.00505 2026-05-19 cs.IR cs.AI cs.CL 版本更新

LLM-Oriented Information Retrieval: A Denoising-First Perspective

面向大语言模型的信息检索:一种去噪优先的视角

Lu Dai, Liang Sun, Fanpu Cao, Ziyang Rao, Cehao Yang, Hao Liu, Hui Xiong

发表机构 * Hong Kong University of Science and Technology(香港科技大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出了一种以去噪为核心的信息检索方法,强调在信息检索全流程中,最大化可利用证据密度和可验证性是关键瓶颈,通过四个阶段框架和信号-噪声优化技术分类,探讨了信息检索中的挑战和解决方案。

Comments SIGIR 2026

详情
AI中文摘要

现代信息检索(IR)不再主要由人类消费,而是越来越多地通过检索增强生成(RAG)和代理搜索由大型语言模型(LLMs)使用。与人类用户不同,LLMs受制于有限的注意力预算,并且对噪声具有独特脆弱性;误导或不相关信息不再只是麻烦,而是导致幻觉和推理失败的直接原因。在本文的视角论文中,我们主张在信息访问全管道中,去噪——即在上下文窗口内最大化可利用证据密度和可验证性——已成为主要瓶颈。我们通过信息检索挑战的四阶段框架来概念化这一范式转变:从不可访问到不可发现,再到不一致,最后到不可验证。此外,我们提供了一个按流程组织的信号-噪声优化技术分类,涵盖索引、检索、上下文工程、验证和代理工作流程。我们还展示了在依赖检索的领域如终身助手、编码代理、深度研究和多模态理解中信息去噪的研究工作。

英文摘要

Modern information retrieval (IR) is no longer consumed primarily by humans but increasingly by large language models (LLMs) via retrieval-augmented generation (RAG) and agentic search. Unlike human users, LLMs are constrained by limited attention budgets and are uniquely vulnerable to noise; misleading or irrelevant information is no longer just a nuisance, but a direct cause of hallucinations and reasoning failures. In this perspective paper, we argue that denoising-maximizing usable evidence density and verifiability within a context window-is becoming the primary bottleneck across the full information access pipeline. We conceptualize this paradigm shift through a four-stage framework of IR challenges: from inaccessible to undiscoverable, to misaligned, and finally to unverifiable. Furthermore, we provide a pipeline-organized taxonomy of signal-to-noise optimization techniques, spanning indexing, retrieval, context engineering, verification, and agentic workflow. We also present research works on information denoising in domains that rely heavily on retrieval such as lifelong assistant, coding agent, deep research, and multimodal understanding.

2604.23355 2026-05-19 cs.AI 版本更新

LEGO: An LLM Skill-Based Front-End Design Generation Platform

LEGO: 一个基于LLM技能的前端设计生成平台

Jincheng Lou, Ruohan Xu, Jiecheng Ma, Runzhe Tao, Xinyu Qu, Yibo Lin

发表机构 * School of IC, Peking University(北京大学集成电路学院) School of EECS, Peking University(北京大学电子信息技术学院) School of Microelectronics, Xidian University(西安电子科技大学微电子学院) Institute of EDA, Peking University(北京大学EDA研究院) Beijing Advanced Innovation Center for IC(北京集成电路先进创新中心)

AI总结 本文提出LEGO平台,通过将数字前端流程分解为六个独立步骤,并将每个代理能力表示为标准化的可组合电路技能,实现了高效的前端设计生成,显著提升了RTL设计自动化的效果。

Comments Accepted to ISEDA 2026. Best Paper Nomination. 7 pages, 3 figures

详情
AI中文摘要

现有的基于LLM的EDA代理往往都是特定任务的孤立系统。这导致了重复的工程努力和成功设计和调试策略的有限重用。我们提出了LEGO,一个统一的基于技能的前端设计生成平台。它将数字前端流程分解为六个独立的步骤,并将每个代理的能力表示为标准化的可组合电路技能,以在即插即用的架构中进行表示。为了构建这个技能库,我们调查了超过100篇论文,选择了11个具有代表性的开源项目,并在六步有限状态机的公式中提取了42个可执行的电路技能。电路技能构建器通过线性可扩展性自动化技能提取。代理技能RAG实现了亚毫秒级检索,而无需依赖嵌入模型。在41个VerilogEval v2问题的严格子集上的实证评估显示,LEGO内构建的单个电路技能将Pass@1从0.000提升到0.805。这比基线提高了80.5%。跨项目技能组合也达到了0.805的Pass@1。它们在层次Verilog上表现更优14.6%,在VerilogCoder上表现更优2.5%。它们还与MAGE相匹配。这些结果表明,模块化技能组合支持有效且灵活的RTL设计自动化。LEGO平台和所有电路技能都在GitHub上公开:https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform

英文摘要

Existing LLM-based EDA agents are often isolated task-specific systems. This leads to repeated engineering effort and limited reuse of successful design and debugging strategies. We present LEGO, a unified skill-based platform for front-end design generation. It decomposes the digital front-end flow into six independent steps and represents every agent capability as a standardized composable circuit skill within a plug-and-play architecture. To build this skill library, we survey more than 100 papers, select 11 representative open-source projects, and extract 42 executable circuit skills within a six-step finite state machine formulation. Circuit Skill Builder automates skill extraction with linear scalability. Agent Skill RAG achieves submillisecond retrieval without relying on embedding models. Empirical evaluation on a hard subset of 41 VerilogEval v2 problems that gpt-5.2-codex fails to solve under extra-high reasoning effort shows that individual circuit skills constructed within LEGO raise Pass@1 from 0.000 to 0.805. This is an 80.5% gain over the baseline. Cross-project skill compositions also reach 0.805 Pass@1. They outperform hierarchy-verilog by 14.6% and VerilogCoder by 2.5%. They also match MAGE. These results show that modular skill composition supports both effective and flexible RTL design automation. The LEGO platform and all circuit skills are publicly available at GitHub: https://github.com/loujc/LEGO-An-LLM-Skill-Based-Front-End-Design-Generation-Platform

2604.18966 2026-05-19 cs.LG cs.AI 版本更新

Self-Improving Tabular Language Models via Iterative Reward-Guided Post-Training

通过迭代奖励引导的后训练改进表格语言模型

Yunbo Long, Tejumade Afonja, Guangya Hao, Alexandra Brintrup, Mario Fritz

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) CISPA Helmholtz Center for Information Security, Saarbrücken, Germany(德国萨尔布吕肯信息安全中心) The Alan Turing Institute, London(伦敦阿兰·图灵研究所)

AI总结 本文研究了通过生成-评分-对齐协议进行迭代奖励引导的后训练,提出了一种基于组相对对齐的方法TabGRAA,通过比较高分和低分生成组的组平均策略/参考对数比来改进表格语言模型,在五个混合类型基准上优于额外监督微调,并在保真度和下游效用之间实现了最佳平均权衡,同时保持经验隐私诊断接近监督基线。

详情
AI中文摘要

表格语言模型可以通过将行建模为令牌序列来生成合成表格,但通常通过监督微调一次后就作为静态生成器使用。这限制了下一步令牌似然不能直接优化用于评估合成数据的分布、效用和不可区分性属性。我们通过生成-评分-对齐协议研究了表格语言模型的迭代奖励引导后训练,其中生成器采样合成行,任务特定的奖励对其进行排序,模型则相对于固定监督参考进行更新。在该协议中,我们提出了TabGRAA(表格组相对优势对齐),通过组平均的策略/参考对数比比较高分和低分生成组,而非一对一偏好对。在五个混合类型基准上,TabGRAA在GReaT基座上优于额外监督微调,并在保真度和下游效用之间实现了最强的平均权衡,同时保持经验隐私诊断接近监督基线。消融研究显示,收益依赖于有意义的奖励排名和稳定的组级更新,而非额外训练本身。奖励替换和评分分离研究进一步表明,后训练循环可以使用基于分类器和无分类器的奖励,且适当的评分分离对于保持保真度-效用-隐私权衡至关重要。这些结果将TabGRAA定位为一种自改进的后训练方法,用于表格语言模型生成器,作为强大静态表格生成器的补充。

英文摘要

Tabular language models can generate synthetic tables by modeling rows as token sequences, but they are typically trained once with supervised fine-tuning and then used as static synthesizers. This is limiting because next-token likelihood does not directly optimize the distributional, utility, and indistinguishability properties used to evaluate synthetic data. We study iterative reward-guided post-training for tabular language models through a generate--score--align protocol, where a generator samples synthetic rows, a task-specified reward ranks them, and the model is updated relative to a fixed supervised reference. Within this protocol, we propose \textbf{TabGRAA} (\textbf{Tab}ular \textbf{G}roup-\textbf{R}elative \textbf{A}dvantage \textbf{A}lignment), a group-relative alignment method that compares high- and low-reward generated groups using group-averaged policy/reference log-ratios rather than one-to-one preference pairs. Across five mixed-type benchmarks, TabGRAA improves a GReaT backbone beyond additional supervised fine-tuning and achieves the strongest average trade-off among adapted DPO, KTO, and NPO baselines on fidelity and downstream utility, while maintaining empirical privacy diagnostics near the supervised baseline. Ablations show that the gains depend on meaningful reward ranking and stable group-level updates rather than extra training alone. Reward-substitution and scorer-separation studies further show that the post-training loop can use both classifier-based and classifier-free rewards, and that proper scorer separation is important for preserving the fidelity--utility--privacy trade-off. These results position TabGRAA as a self-improving post-training method for tabular language-model generators, complementary to strong static tabular synthesizers.

2604.16429 2026-05-19 cs.LG cs.AI cs.CV physics.ao-ph 版本更新

(Sparse) Attention to the Details: Preserving Spectral Fidelity in ML-based Weather Forecasting Models

(稀疏) 注意细节:在基于机器学习的天气预测模型中保持频谱保真度

Maksim Zhdanov, Ana Lucic, Max Welling, Jan-Willem van de Meent

发表机构 * AMLab(AM实验室) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文提出Mosaic模型,通过学习功能扰动生成集合成员,并利用网格对齐的块稀疏注意力机制,在原分辨率网格上操作,以线性成本捕捉长距离依赖关系,从而在1.5°分辨率下达到或超越更精细分辨率模型的性能,实现了状态-of-the-art结果。

Comments Accepted to ICML 2026

详情
AI中文摘要

我们介绍Mosaic,一种概率天气预测模型,旨在解决基于机器学习的天气预测中频谱退化问题的三种失败模式:频谱阻尼(统计学)、高频混叠(架构学)和残余高频泄漏(参数学)。Mosaic通过学习的功能扰动生成集合成员,并通过网格对齐的块稀疏注意力机制在原分辨率网格上操作,该机制是一种硬件对齐的机制,通过在空间相邻查询之间共享键和值,以线性成本捕捉长距离依赖关系。在1.5°分辨率和214M参数下,Mosaic在关键变量上达到或超越了在6倍更精细分辨率上训练的模型的性能,并在1.5°模型中实现了最先进的结果,生成了经过良好校准的集合,其个体成员在所有解析频率上表现出近乎完美的频谱对齐。一个24成员、10天的预测在单个H100 GPU上不到12秒。代码可在https://github.com/maxxxzdn/mosaic上获得。

英文摘要

We introduce Mosaic, a probabilistic weather forecasting model that addresses three failure modes of spectral degradation in ML-based weather prediction: spectral damping (statistical), high-frequency aliasing (architectural), and residual high-frequency leakage (parametric). Mosaic generates ensemble members through learned functional perturbations and operates on native-resolution grids via mesh-aligned block-sparse attention, a hardware-aligned mechanism that captures long-range dependencies at linear cost by sharing keys and values across spatially adjacent queries. At 1.5° resolution with 214M parameters, Mosaic matches or outperforms models trained on 6$\times$ finer resolution on key variables and achieves state-of-the-art results among 1.5° models, producing well-calibrated ensembles whose individual members exhibit near-perfect spectral alignment across all resolved frequencies. A 24-member, 10-day forecast takes under 12s on a single H100~GPU. Code is available at https://github.com/maxxxzdn/mosaic.

2604.16395 2026-05-19 cs.DB cs.AI 版本更新

Stream2LLM: Overlap Context Streaming and Prefill for Reduced Time-to-First-Token (TTFT)

Stream2LLM: 重叠上下文流式传输与预填充以减少时间到第一个标记(TTFT)

Rajveer Bachkaniwala, Chengqi Luo, Richard So, Divya Mahajan, Kexin Rong

发表机构 * Georgia Tech(佐治亚理工学院)

AI总结 本文提出Stream2LLM,一种针对并发预填充-解码分离部署的流式感知大语言模型服务系统,通过自适应调度和抢占机制,有效解决上下文检索与推理之间的延迟问题,从而减少时间到第一个标记(TTFT),并在内存压力下保持吞吐量与非流式基线相等。

Comments Accepted to MLSys 2026. Minor formatting fixes

详情
AI中文摘要

针对LLM推理中的上下文检索系统面临的关键挑战:高检索延迟导致完全上下文等待(差的TTFT)与不完整上下文处理(降低质量)之间的根本矛盾。通过流式传输上下文——重叠检索与推理——可以缓解此延迟,但并发请求引入了新挑战:请求竞争GPU计算和内存,调度必须适应动态上下文到达。我们提出了Stream2LLM,一种流式感知的LLM服务系统,适用于并发预填充-解码分离部署。Stream2LLM引入了自适应调度和抢占机制,针对两种不同的检索模式:追加模式(逐步上下文累积)和更新模式(迭代精炼与缓存失效)。它将调度决策与资源获取解耦,使调度策略灵活,由硬件特定的成本模型引导,并使用最长公共前缀匹配来最小化动态输入变化时的冗余计算。为了评估Stream2LLM,我们收集了两个大规模的现实世界流式工作负载,基于网络爬行和近似最近邻搜索。我们的评估表明,流式架构在TTFT上实现了高达11倍的改进,成本感知调度在内存压力下提供了关键收益,同时保持与非流式基线相等的吞吐量。代码:https://github.com/rajveerb/stream2llm/tree/mlsys_artifact

英文摘要

Context retrieval systems for LLM inference face a critical challenge: high retrieval latency creates a fundamental tension between waiting for complete context (poor time-to-first-token) and proceeding without it (reduced quality). Streaming context incrementally--overlapping retrieval with inference--can mitigate this latency, but doing so with concurrent requests introduces new challenges: requests contend for GPU compute and memory, and scheduling must adapt to dynamic context arrivals. We present Stream2LLM, a streaming-aware LLM serving system for concurrent prefill-decode disaggregated deployments. Stream2LLM introduces adaptive scheduling and preemption for two distinct retrieval patterns: append-mode (progressive context accumulation) and update-mode (iterative refinement with cache invalidation). It decouples scheduling decisions from resource acquisition, enabling flexible preemption strategies guided by hardware-specific cost models, and uses longest common prefix matching to minimize redundant computation when input changes dynamically. To evaluate Stream2LLM, we collect two large-scale, real-world streaming workloads based on web crawling and approximate nearest neighbor search. Our evaluation demonstrates that streaming architecture delivers up to 11x TTFT improvements, with cost-aware scheduling providing critical benefits under memory pressure, all while maintaining throughput parity with non-streaming baselines. Code: https://github.com/rajveerb/stream2llm/tree/mlsys_artifact

2604.15851 2026-05-19 cs.LG cs.AI cs.CR 版本更新

DPrivBench: Benchmarking LLMs' Reasoning for Differential Privacy

DPrivBench:评估大语言模型在差分隐私推理中的基准测试

Erchi Wang, Pengrun Huang, Eli Chien, Om Thakkar, Kamalika Chaudhuri, Yu-Xiang Wang, Ruihan Wu

发表机构 * Halıcıoğlu Data Science Institute, UC San Diego(哈里奇奥格卢数据科学研究所,加州大学圣地亚哥分校) Department of Computer Science and Engineering, UC San Diego(计算机科学与工程系,加州大学圣地亚哥分校) Department of Electrical Engineering, National Taiwan University(电气工程系,国立台湾大学) OpenAI

AI总结 本文提出DPrivBench基准测试,用于评估大语言模型在差分隐私推理中的能力,发现当前模型在高级算法推理上存在显著差距,并为改进自动化差分隐私推理提供了方向。

详情
AI中文摘要

差分隐私(DP)在保护数据隐私方面有广泛的应用,但设计和验证DP算法需要专家级推理,这为非专家从业者设置了高门槛。先前的工作要么依赖于需要大量领域专业知识的专用验证语言,要么仍然是半自动化的,需要人工在循环中指导。在本文中,我们研究大语言模型(LLMs)能否自动化DP推理。我们引入了DPrivBench,这是一个基准测试,每个实例询问函数或算法是否在指定假设下满足陈述的DP保证。该基准测试精心设计,覆盖了广泛的DP主题,跨越不同的难度级别,并通过简单的模式匹配来抵抗快捷推理。实验显示,尽管最强的模型能够处理教科书机制,但所有模型在高级算法上都面临困难,揭示了当前DP推理能力的显著差距。通过进一步的分析研究和失败模式分析,我们识别出改进自动化DP推理的几个有前途的方向。我们的基准测试为开发和评估此类方法提供了坚实的基础,并补充了现有的数学推理基准测试。

英文摘要

Differential privacy (DP) has a wide range of applications for protecting data privacy, but designing and verifying DP algorithms requires expert-level reasoning, creating a high barrier for non-expert practitioners. Prior works either rely on specialized verification languages that demand substantial domain expertise or remain semi-automated and require human-in-the-loop guidance. In this work, we investigate whether large language models (LLMs) can automate DP reasoning. We introduce DPrivBench, a benchmark in which each instance asks whether a function or algorithm satisfies a stated DP guarantee under specified assumptions. The benchmark is carefully designed to cover a broad range of DP topics, span diverse difficulty levels, and resist shortcut reasoning through trivial pattern matching. Experiments show that while the strongest models handle textbook mechanisms well, all models struggle with advanced algorithms, revealing substantial gaps in current DP reasoning capabilities. Through further analytic study and failure-mode analysis, we identify several promising directions for improving automated DP reasoning. Our benchmark provides a solid foundation for developing and evaluating such methods, and complements existing benchmarks for mathematical reasoning.

2604.11852 2026-05-19 q-bio.QM cs.AI cs.LG 版本更新

Limitations of Sequence-Based Protein Representations for Parkinson's Disease Classification: A Leakage-Free Benchmark

序列基蛋白质表示在帕金森病分类中的局限性:一种无泄漏的基准测试

César Jesús Núñez-Prado, Grigori Sidorov, Liliana Chanona-Hernández

发表机构 * Higher School of Mechanical and Electrical Engineering, Instituto Politécnico Nacional(机械与电气工程高等专科学校,墨西哥国立理工学院) Research Center for Computing, Instituto Politécnico Nacional(计算研究中心,墨西哥国立理工学院)

AI总结 本文研究了序列基蛋白质表示在帕金森病分类中的局限性,通过无泄漏的基准测试评估了多种基于蛋白质初级序列的表示方法,发现单一序列信息对疾病分类的判别能力有限,需引入更丰富的生物学特征。

Comments 36 pages, 10 figures, 9 tables. Updated title, abstract, figures, and revised experimental discussion

详情
AI中文摘要

可靠分子生物标志物的鉴定仍因帕金森病的多因素性质而具有挑战性。尽管蛋白质序列是基础且广泛可用的生物信息来源,但其单独判别能力用于复杂疾病分类仍不明确。本文提出了一个受控且无泄漏的评估,评估了多种仅基于蛋白质初级序列的表示方法,包括氨基酸组成、k-mer、物理化学描述符、混合表示以及来自蛋白质语言模型的嵌入,所有均在嵌套分层交叉验证框架下评估以确保性能估计的无偏性。表现最佳的配置(ProtBERT + MLP)达到F1分数为0.704 ± 0.028和ROC-AUC为0.748 ± 0.047,表明判别性能仅中等。传统表示如k-mer达到相似的F1值(最高约0.667),但表现出高度不平衡的行为,召回率接近0.98,精度约0.50,反映出对正样本预测的强烈偏倚。在各种表示中,性能差异仍保持在狭窄范围内(F1在0.60到0.70之间),而无监督分析揭示没有与类别标签对齐的内在结构,统计检验(Friedman检验,p = 0.1749)不显示模型间的显著差异。这些结果表明类别之间有显著重叠,并表明仅凭初级序列信息对帕金森病分类的判别能力有限。本研究建立了一个可重复的基线,并提供了实证证据,表明更丰富的生物学特征,如结构、功能或相互作用描述符,对于稳健的疾病建模是必需的。

英文摘要

The identification of reliable molecular biomarkers for Parkinson's disease remains challenging due to its multifactorial nature. Although protein sequences constitute a fundamental and widely available source of biological information, their standalone discriminative capacity for complex disease classification remains unclear. In this work, we present a controlled and leakage-free evaluation of multiple representations derived exclusively from protein primary sequences, including amino acid composition, k-mers, physicochemical descriptors, hybrid representations, and embeddings from protein language models, all assessed under a nested stratified cross-validation framework to ensure unbiased performance estimation. The best-performing configuration (ProtBERT + MLP) achieves an F1-score of 0.704 +/- 0.028 and ROC-AUC of 0.748 +/- 0.047, indicating only moderate discriminative performance. Classical representations such as k-mers reach comparable F1 values (up to approximately 0.667), but exhibit highly imbalanced behavior, with recall close to 0.98 and precision around 0.50, reflecting a strong bias toward positive predictions. Across representations, performance differences remain within a narrow range (F1 between 0.60 and 0.70), while unsupervised analyses reveal no intrinsic structure aligned with class labels, and statistical testing (Friedman test, p = 0.1749) does not indicate significant differences across models. These results demonstrate substantial overlap between classes and indicate that primary sequence information alone provides limited discriminative power for Parkinson's disease classification. This work establishes a reproducible baseline and provides empirical evidence that more informative biological features, such as structural, functional, or interaction-based descriptors, are required for robust disease modeling.

2604.09609 2026-05-19 cs.AI cs.RO 版本更新

General-purpose LLMs as Models of Human Driver Behavior: The Case of Simplified Merging

通用大语言模型作为人类驾驶员行为模型:简化合并案例

Samir H. A. Mohammad, Wouter Mooi, Arkady Zgonnikov

发表机构 * Department of Transport and Planning, Delft University of Technology(代尔夫特理工大学交通与规划系) Department of Cognitive Robotics(认知机器人学系)

AI总结 本文研究了通用大语言模型在模拟人类驾驶员行为中的应用,通过在简化的一维合并场景中嵌入两个通用大语言模型,并与人类数据进行定量和定性分析,发现模型在间歇性操作控制和空间线索战术依赖方面能再现人类行为,但在动态速度线索响应和安全性能方面存在差异,提示未来需进一步研究其失效模式以确保其作为人类驾驶行为模型的有效性。

Comments To be published in proceedings of IEEE ITSC 2026

详情
AI中文摘要

人类行为模型在自动驾驶车辆(AVs)的虚拟安全评估中作为行为参考和模拟人类代理至关重要,但当前模型面临可解释性与灵活性之间的权衡。通用大语言模型(LLMs)提供了一种有前景的替代方案:一个模型可能在各种场景中无需参数拟合即可部署。然而,LLMs在捕捉人类驾驶行为方面能做什么、不能做什么仍不明确。我们通过将两个通用LLMs(OpenAI o3和Google Gemini 2.5 Pro)作为独立的闭环驾驶员代理嵌入简化的一维合并场景,并通过定量和定性分析将其行为与人类数据进行比较,来填补这一空白。两个模型能够再现人类样式的间歇性操作控制和对空间线索的战术依赖。然而,它们均无法一致地捕捉人类对动态速度线索的反应,且模型间的安全性能差异显著。系统性的提示消融研究揭示了提示组件作为模型特定的归纳偏置,这些偏置在不同LLMs之间不转移。这些发现表明,通用LLMs可能潜在地作为独立、即用型的人类行为模型在AV评估流程中发挥作用,但未来研究需要进一步理解其失效模式,以确保其作为人类驾驶行为模型的有效性。

英文摘要

Human behavior models are essential as behavior references and for simulating human agents in virtual safety assessment of automated vehicles (AVs), yet current models face a trade-off between interpretability and flexibility. General-purpose large language models (LLMs) offer a promising alternative: a single model potentially deployable without parameter fitting across diverse scenarios. However, what LLMs can and cannot capture about human driving behavior remains poorly understood. We address this gap by embedding two general-purpose LLMs (OpenAI o3 and Google Gemini 2.5 Pro) as standalone, closed-loop driver agents in a simplified one-dimensional merging scenario and comparing their behavior against human data using quantitative and qualitative analyses. Both models reproduce human-like intermittent operational control and tactical dependencies on spatial cues. However, neither consistently captures the human response to dynamic velocity cues, and safety performance diverges sharply between models. A systematic prompt ablation study reveals that prompt components act as model-specific inductive biases that do not transfer across LLMs. These findings suggest that general-purpose LLMs could potentially serve as standalone, ready-to-use human behavior models in AV evaluation pipelines, but future research is needed to better understand their failure modes and ensure their validity as models of human driving behavior.

2604.09450 2026-05-19 cs.LG cs.AI eess.IV 版本更新

ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion

ECHO: 通过一步块扩散实现高效的胸部X光报告生成

Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu

发表机构 * Beijing Jiaotong University(北京交通大学) Dalian University of Technology(大连理工大学)

AI总结 本文提出ECHO,一种基于扩散模型的高效视觉-语言模型,用于生成胸部X光报告,通过一步块扩散和响应不对称扩散策略,显著提高了生成效率和文本连贯性,同时在临床准确性上保持良好表现。

详情
AI中文摘要

胸部X光报告生成(CXR-RG)有潜力显著减轻放射科医生的工作负担。然而,传统自回归视觉-语言模型(VLMs)由于序列令牌解码而存在高推理延迟。基于扩散的模型通过并行生成提供了一种有前景的替代方案,但它们仍然需要多个去噪迭代。将多步去噪压缩到单步可以进一步减少延迟,但通常会因令牌因子化去噪器引入的均场偏差而降级文本连贯性。为了解决这一挑战,我们提出了ECHO,一种高效的基于扩散的VLM(dVLM),用于胸部X光报告生成。ECHO通过一种新颖的直接条件蒸馏(DCD)框架实现了稳定的每块一步推理,该框架通过从策略扩散轨迹中构建非因子化监督来缓解均场限制,以编码联合令牌依赖性。此外,我们引入了一种响应不对称扩散(RAD)训练策略,该策略进一步提高了训练效率,同时保持模型有效性。广泛的实验表明,ECHO超越了最先进的自回归方法,在RaTE和SemScore上分别提高了64.33%和60.58%,同时在临床准确性上几乎没有下降的情况下,实现了高达8倍的推理加速。

英文摘要

Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose \textbf{ECHO}, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by \textbf{64.33\%} and \textbf{60.58\%} respectively, while achieving up to \textbf{$8\times$} inference speedup with negligible degradation in clinical accuracy.

2604.01658 2026-05-19 cs.AI 版本更新

CORAL: Towards Autonomous Multi-Agent Evolution for Open-Ended Discovery

CORAL:迈向自主多智能体进化以实现开放性发现

Ao Qu, Han Zheng, Zijian Zhou, Yihao Yan, Yihong Tang, Shao Yong Ong, Fenglu Hong, Kaichen Zhou, Chonghe Jiang, Minwei Kong, Jiacheng Zhu, Xuan Jiang, Sirui Li, Cathy Wu, Bryan Kian Hsiang Low, Jinhua Zhao, Paul Pu Liang

发表机构 * MIT(麻省理工学院) NUS(新加坡国立大学) MiniMax McGill(麦吉尔大学) Stanford(斯坦福大学) SambaNova Meta Singapore-MIT Alliance for Research and Technology(新加坡-麻省理工联合研究技术联盟) Amazon(亚马逊) Microsoft(微软)

AI总结 本文提出CORAL框架,通过自主多智能体进化方法,实现了在开放性问题上的发现,展示了智能体自主性和多智能体进化对提升开放性发现的显著效果。

详情
AI中文摘要

基于大型语言模型(LLM)的进化是一种有前景的开放性发现方法,其中进展需要持续的搜索和知识积累。现有方法仍然严重依赖固定启发式和硬编码探索规则,这限制了LLM智能体的自主性。我们提出了CORAL,这是首个用于开放性问题的自主多智能体进化的框架。CORAL用长运行的智能体取代了刚性的控制,这些智能体通过共享持久记忆、异步多智能体执行和基于心跳的干预进行探索、反思和协作。它还提供了实用的保障措施,包括隔离的工作空间、评估者分离、资源管理以及智能体会话和健康管理。在多样化的数学、算法和系统优化任务上评估,CORAL在10个任务上实现了新的最先进结果,其改进率比固定进化搜索基线高出3-10倍,且使用更少的评估。在Anthropic的内核工程任务中,四个共进化智能体将最佳已知分数从1363提高到1103周期。机理分析进一步显示这些增益源于知识重用和多智能体探索和交流。这些结果表明,更大的智能体自主性和多智能体进化可以显著提高开放性发现。代码可在https://github.com/Human-Agent-Society/CORAL上获得。

英文摘要

Large language model (LLM)-based evolution is a promising approach for open-ended discovery, where progress requires sustained search and knowledge accumulation. Existing methods still rely heavily on fixed heuristics and hard-coded exploration rules, which limit the autonomy of LLM agents. We present CORAL, the first framework for autonomous multi-agent evolution on open-ended problems. CORAL replaces rigid control with long-running agents that explore, reflect, and collaborate through shared persistent memory, asynchronous multi-agent execution, and heartbeat-based interventions. It also provides practical safeguards, including isolated workspaces, evaluator separation, resource management, and agent session and health management. Evaluated on diverse mathematical, algorithmic, and systems optimization tasks, CORAL sets new state-of-the-art results on 10 tasks, achieving 3-10 times higher improvement rates with far fewer evaluations than fixed evolutionary search baselines across tasks. On Anthropic's kernel engineering task, four co-evolving agents improve the best known score from 1363 to 1103 cycles. Mechanistic analyses further show how these gains arise from knowledge reuse and multi-agent exploration and communication. Together, these results suggest that greater agent autonomy and multi-agent evolution can substantially improve open-ended discovery. Code is available at https://github.com/Human-Agent-Society/CORAL.

2603.27341 2026-05-19 cs.AI cs.CV cs.LG 版本更新

A Comparative Study in Surgical AI: Potential and Limitations of Data, Compute, and Scaling

外科AI的比较研究:数据、计算和扩展的潜力与局限

Kirill Skobelev, Eric Fithian, Yegor Baranovski, Jack Cook, Sandeep Angara, Shauna Otto, Zhuang-Fang Yi, John Zhu, Daniel A. Donoho, X. Y. Han, Neeraj Mainkar, Margaux Masson-Forsythe

发表机构 * Center for Applied AI, Chicago Booth(应用人工智能中心,芝加哥商学院) Surgical Data Science Collective(外科数据科学集体) Children’s National Hospital(儿童医学中心) Operations Management & Tolan Center for Healthcare, Chicago Booth(运营管理与托兰医疗中心,芝加哥商学院)

AI总结 本文通过2026年最先进的AI方法,研究了外科手术工具检测中的性能和限制,发现即使使用多十亿参数模型和大量训练数据,当前的视觉语言模型在神经外科手术工具检测任务中仍表现不足,且模型规模和训练时间的增加对性能提升效果有限,表明当前AI在手术应用中仍面临显著挑战。

详情
AI中文摘要

最近的人工智能(AI)模型在多个生物医学任务基准上已匹配或超越了人类专家,但特别是在外科手术基准方面,这些基准往往缺失于主要的医学基准套件中。由于手术需要整合多种任务,一般能力的AI模型可能成为协作工具,如果性能可以得到提升。一方面,通过扩展架构大小和训练数据的常规方法具有吸引力,尤其是由于每年有数百万小时的手术视频数据生成。另一方面,为AI训练准备手术数据需要显著更高的专业水平,并且在该数据上训练需要昂贵的计算资源。这些权衡描绘了现代AI是否以及在多大程度上能够帮助外科实践的不确定图景。在本文中,我们通过使用2026年最先进的AI方法进行外科手术工具检测的案例研究来探讨这个问题。我们证明,即使使用多十亿参数模型和大量训练,当前的视觉语言模型在看似简单的神经外科手术工具检测任务中仍表现不足。此外,我们展示了扩展实验,表明增加模型规模和训练时间仅导致相关性能指标的边际改善。因此,我们的实验表明,当前模型在手术使用案例中仍可能面临重大障碍。此外,一些障碍无法通过额外的计算能力简单地“解决”并持续存在于不同的模型架构中,提出了数据和标签可用性是否是唯一限制因素的问题。我们讨论了这些约束的主要贡献者,并提出了潜在的解决方案。

英文摘要

Recent Artificial Intelligence (AI) models have matched or exceeded human experts in several benchmarks of biomedical task performance, but surgical benchmarks in particular are often missing from prominent medical benchmark suites. Since surgery requires integrating disparate tasks, generally-capable AI models could be particularly attractive as a collaborative tool if performance could be improved. On the one hand, the canonical approach of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources. These trade-offs paint an uncertain picture of whether and to-what-extent modern AI could aid surgical practice. In this paper, we explore this question through a case study of surgical tool detection using state-of-the-art AI methods available in 2026. We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery. Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics. Thus, our experiments suggest that current models could still face significant obstacles in surgical use cases. Moreover, some obstacles cannot be simply ``scaled away'' with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. We discuss the main contributors to these constraints and advance potential solutions.

2603.25723 2026-05-19 cs.CL cs.AI 版本更新

Natural-Language Agent Harnesses

自然语言代理Harness

Linyue Pan, Lexiao Zou, Shuo Guo, Jingchen Ni, Hai-Tao Zheng

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本文提出自然语言代理Harness(NLAH)作为一种可执行的自然语言对象,用于描述任务运行的Harness策略,并引入Intelligent Harness Runtime(IHR)作为共享运行时,能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。实验表明,NLAH在编码、终端使用和计算机使用基准测试中表现与代码和提示实现相当,同时暴露了更短的静态Harness策略。

Comments revise paper

详情
AI中文摘要

代理性能受到周围Harness的强烈影响:围绕模型组织任务运行的外部执行系统。然而,这种逻辑通常隐藏在紧密耦合的控制器代码中,使得Harness难以检查、比较、转移和消解。本文探讨是否可以将代理Harness的可重用设计模式表示为可执行的自然语言对象。我们引入自然语言代理Harness(NLAH),即可编辑的文档,用于描述运行级别的Harness策略,并引入Intelligent Harness Runtime(IHR),一个共享运行时,能够将这些文档解释为代理调用、交接、状态更新、验证门和成果合同。在编码、终端使用和计算机使用基准测试中,IHR执行的NLAH实现了与代码和提示实现相当的任务结果,同时暴露了更短的静态Harness策略。模块消解进一步表明,显式的Harness模块是可分析的。这些结果表明,代理Harness可以从模型周围的偶然粘合物转变为科学表示对象。

英文摘要

Agent performance is strongly shaped by the surrounding harness: the external execution system around a model that organizes a task run. Yet this logic is usually buried in tightly coupled controller code, which makes harnesses hard to inspect, compare, transfer, and ablate. This paper asks whether the reusable design pattern of an agent harness can be represented as an executable natural-language object. We introduce Natural-Language Agent Harnesses (NLAHs), editable documents that describe run-level harness policy, and Intelligent Harness Runtime (IHR), a shared runtime that interprets these documents into agent calls, handoffs, state updates, validation gates, and artifact contracts. Across coding, terminal-use, and computer-use benchmarks, IHR-executed NLAHs achieve comparable task outcomes to code and prompted realizations, while exposing much shorter static harness policies. Module ablations further show that explicit harness modules are analyzable. These results suggest that agent harnesses can be turned from incidental glue around models into scientific representation objects.

2603.23231 2026-05-19 cs.AI 版本更新

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

PERMA:通过事件驱动的偏好和现实任务环境评估个性化记忆代理

Shuochen Liu, Junyi Zhu, Long Shu, Junda Lin, Yuhao Chen, Haotian Zhang, Chao Zhang, Derong Xu, Jia Li, Bo Tang, Zhiyu Li, Feiyu Xiong, Enhong Chen, Tong Xu

发表机构 * University of Science and Technology of China(中国科学技术大学) City University of Hong Kong(香港城市大学) Northeastern University(东北大学) MemTensor (Shanghai) Technology Co., Ltd.(MemTensor(上海)科技有限公司)

AI总结 本文提出PERMA基准,通过事件驱动的偏好和现实任务环境评估个性化记忆代理的长期一致性,引入文本变异和语言对齐以模拟真实数据中的不规则用户输入和个体语言风格,实验表明先进记忆系统能精准提取偏好并减少token消耗,但仍需更稳健的个性化记忆管理。

详情
AI中文摘要

为构建能适应用户不断变化需求的代理,增强大语言模型的长期记忆能力至关重要。现有评估通常将偏好相关对话与无关对话交织,使任务退化为needle-in-a-haystack检索,忽略了驱动用户偏好演变的事件之间的关系。此类设置忽视了现实世界个性化的一个基本特征:偏好是逐渐形成并在嘈杂环境中跨交互累积的。为弥合这一差距,我们引入PERMA,一个评估时间跨度内人格一致性的基准,超越静态偏好回忆。此外,我们引入(1)文本变异和(2)语言对齐,以模拟现实数据中的不规则用户输入和个体语言风格。PERMA包含跨多个会话和领域的时序排列交互事件,其中偏好相关查询随时间插入。我们设计了多选和交互任务以探测模型对人格的理解沿交互时间线。实验表明,通过关联相关交互,先进记忆系统能够精确提取偏好并减少token消耗,优于传统语义检索原始对话。然而,它们在时间和跨领域干扰中仍难以保持一致的人格,突显了代理中需要更稳健的个性化记忆管理的必要性。我们的代码和数据在https://github.com/PolarisLiu1/PERMA上开源。

英文摘要

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. Existing evaluations of this capability typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events driving user preference evolution. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems extract precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

2603.14462 2026-05-19 cs.LG cs.AI 版本更新

STAG-CN: Spatio-Temporal Apiary Graph Convolutional Network for Disease Onset Prediction in Beehive Sensor Networks

STAG-CN:时空蜂巢图卷积网络用于蜂巢传感器网络中疾病发病预测

Sungwoo Kang

AI总结 该研究提出STAG-CN模型,通过建模蜂箱间关系来预测疾病发病,利用时空图卷积网络结合物理位置和气候传感器相关性,验证了共享环境响应模式比空间接近性更有效。

Comments Null result after running with 10 seeds

详情
AI中文摘要

蜂蜜蜂群损失威胁着全球授粉服务,但当前监测系统将每个蜂箱视为孤立单元,忽略了疾病在养蜂场中传播的空间路径。本文介绍了时空蜂巢图卷积网络(STAG-CN),一种图神经网络,用于疾病发病预测。STAG-CN基于双邻接图,结合蜂箱会话间的物理共置和气候传感器相关性,通过基于因果扩张卷积和Chebyshev谱图卷积的时空-时空三明治架构处理多变量物联网传感器流。在韩国AI Hub养蜂数据集(数据集#71488)上进行扩展窗口时间交叉验证后,STAG-CN在三天预测范围内达到F1分数0.607。消融研究显示,仅气候邻接矩阵可达到全模型性能(F1=0.607),而仅物理邻接矩阵则为F1=0.274,表明共享的环境响应模式比空间接近性在疾病发病预测中更具预测信号。这些结果为基于图的生物安全监控在精准养蜂中的概念验证奠定了基础,证明了蜂箱传感器相关性编码了单个蜂箱方法无法察觉的疾病相关信息。

英文摘要

Honey bee colony losses threaten global pollination services, yet current monitoring systems treat each hive as an isolated unit, ignoring the spatial pathways through which diseases spread across apiaries. This paper introduces the Spatio-Temporal Apiary Graph Convolutional Network (STAG-CN), a graph neural network that models inter-hive relationships for disease onset prediction. STAG-CN operates on a dual adjacency graph combining physical co-location and climatic sensor correlation among hive sessions, and processes multivariate IoT sensor streams through a temporal--spatial--temporal sandwich architecture built on causal dilated convolutions and Chebyshev spectral graph convolutions. Evaluated on the Korean AI Hub apiculture dataset (dataset \#71488) with expanding-window temporal cross-validation, STAG-CN achieves an F1 score of 0.607 at a three-day forecast horizon. An ablation study reveals that the climatic adjacency matrix alone matches full-model performance (F1\,=\,0.607), while the physical adjacency alone yields F1\,=\,0.274, indicating that shared environmental response patterns carry stronger predictive signal than spatial proximity for disease onset. These results establish a proof-of-concept for graph-based biosecurity monitoring in precision apiculture, demonstrating that inter-hive sensor correlations encode disease-relevant information invisible to single-hive approaches.

2603.12145 2026-05-19 cs.LG cs.AI cs.SE 版本更新

Automatic Generation of High-Performance RL Environments

自动生成高性能强化学习环境

Seth Karten, Rahul Dev Appapogu, Chi Jin

发表机构 * Princeton University(普林斯顿大学) Independent Researcher(独立研究者)

AI总结 本文提出了一种闭环方法,通过最小的计算成本生成等效的高性能强化学习环境,展示了三种不同的工作流程,并在五个环境中验证了无仿真到仿真的差距,同时展示了新的环境创建方法。

Comments 20 pages, 5 figures

详情
AI中文摘要

将复杂的强化学习(RL)环境转换为高性能实现传统上需要数月的专业工程工作。我们提出了一种闭环方法,以最小的计算成本生成等效的高性能环境。我们的方法使用通用提示模板、分层验证(属性、交互和运行测试)、迭代修复和跨后端策略转移来验证无仿真到仿真的差距。我们展示了三个不同的工作流程跨越五个环境:(1)从Game Boy模拟器PyBoy直接翻译到我们的EmuRust(通过Rust IPC)和从Pokemon Showdown翻译到我们的PokeJAX(通过JAX);(2)通过与现有高性能实现的吞吐量一致性进行验证,如Puffer Pong、MJX和Brax在匹配的GPU批次大小下;(3)新环境的创建:TCGJax,第一个Pokemon TCG Pocket环境,从网页提取的规范中创建。在2亿个参数下,环境开销低于训练时间的4%。我们的闭环方法验证了所有五个环境的等效性。TCGJax,由一个不在公共存储库中的私有参考合成,用于控制代理预训练数据的污染问题。

英文摘要

Translating complex reinforcement learning (RL) environments into high-performance implementations has traditionally required months of specialized engineering. We present a closed-loop methodology that produces equivalent high-performance environments for minimal compute cost. Our method uses a generic prompt template, hierarchical verification (property, interaction, and rollout tests), iterative repair, and cross-backend policy transfer to verify no sim-to-sim gap. We demonstrate three distinct workflows across five environments: (1) Direct translation (no prior performance implementation exists) from Game Boy emulator PyBoy to our EmuRust (via Rust IPC) and from Pokemon Showdown to our PokeJAX (via JAX); (2) Translation verified against existing performance implementations via throughput parity with Puffer Pong, MJX and Brax at matched GPU batch sizes; and (3) New environment creation: TCGJax, the first Pokemon TCG Pocket environment, created from a web-extracted specification. At 200M parameters, the environment overhead drops below 4% of training time. Our closed-loop methodology confirms equivalence for all five environments. TCGJax, synthesized from a private reference absent from public repositories, serves as a contamination control for agent pretraining data concerns.

2603.11689 2026-05-19 cs.AI 版本更新

Explicit Logic Channel for Validation and Enhancement of MLLMs on Zero-Shot Tasks

显式逻辑通道用于验证和增强用于零样本任务的前沿多模态大语言模型

Mei Chee Leong, Ying Gu, Hui Li Tan, Liyuan Li, Nancy Chen

发表机构 * Institute for Infocomm Research (I$^2$R)(信息通信研究所) Agency for Science, Technology and Research (A*STAR)(科技研究局) Singapore(新加坡)

AI总结 本文提出显式逻辑通道用于验证和增强多模态大语言模型在零样本任务中的性能,通过显式逻辑推理提高模型的可解释性和可信度。

详情
AI中文摘要

前沿多模态大语言模型(MLLMs)在视觉-语言理解(VLC)任务中表现出显著能力。然而,它们通常以黑盒方式部署到新任务中。验证和理解这些模型的行为对于应用到新任务变得重要。我们提出显式逻辑通道,与黑盒模型通道并行,以进行显式逻辑推理用于模型验证、选择和增强。前沿MLLM,封装潜在的视觉语言知识,可以被视为隐式逻辑通道。所提出的显式逻辑通道,模仿人类逻辑推理,结合了一个LLM、一个VFM和逻辑推理与概率推理,用于事实、反事实和关系推理,基于显式视觉证据。提出了一种一致性率(CR)用于跨通道验证和模型选择,即使没有地面真相注释。此外,跨通道整合进一步提高了MLLM在零样本任务中的性能,基于显式视觉证据以增强可信度。在两个代表性的VLC任务,即MC-VQA和HC-REC上,对三个具有挑战性的基准进行综合实验,使用11个最近的开源MLLMs,来自四个前沿家族。我们的系统评估证明了所提出的ELC和CR在增强可解释性和可信度的MLLM模型验证、选择和改进中的有效性。

英文摘要

Frontier Multimodal Large Language Models (MLLMs) exhibit remarkable capabilities in Visual-Language Comprehension (VLC) tasks. However, they are often deployed as zero-shot solution to new tasks in a black-box manner. Validating and understanding the behavior of these models become important for application to new task. We propose an Explicit Logic Channel, in parallel with the black-box model channel, to perform explicit logical reasoning for model validation, selection and enhancement. The frontier MLLM, encapsulating latent vision-language knowledge, can be considered as an Implicit Logic Channel. The proposed Explicit Logic Channel, mimicking human logical reasoning, incorporates a LLM, a VFM, and logical reasoning with probabilistic inference for factual, counterfactual, and relational reasoning over the explicit visual evidence. A Consistency Rate (CR) is proposed for cross-channel validation and model selection, even without ground-truth annotations. Additionally, cross-channel integration further improves performance in zero-shot tasks over MLLMs, grounded with explicit visual evidence to enhance trustworthiness. Comprehensive experiments conducted for two representative VLC tasks, i.e., MC-VQA and HC-REC, on three challenging benchmarks, with 11 recent open-source MLLMs from 4 frontier families. Our systematic evaluations demonstrate the effectiveness of proposed ELC and CR for model validation, selection and improvement on MLLMs with enhanced explainability and trustworthiness.

2603.10935 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Spherical VAE with Cluster-Aware Feasible Regions: Guaranteed Prevention of Posterior Collapse

具有聚类感知可行区域的球形VAE:保证防止后验崩溃

Zegu Zhang, Jian Zhang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种理论保证非崩溃解的新型框架,通过利用球壳几何和聚类感知约束,防止VAE中的后验崩溃问题,并在合成和现实数据集上实现了100%的崩溃预防。

Comments 8 pages, 6 figures

详情
AI中文摘要

变分自编码器(VAEs)经常受到后验崩溃的影响,其中潜在变量在近似后验退化为先验时变得无信息。尽管最近的研究将崩溃描述为由数据协方差属性决定的相变,但现有方法主要旨在避免而非消除崩溃。我们引入了一种新的框架,通过利用球壳几何和聚类感知约束,从理论上保证非崩溃解。我们的方法将数据转换为球壳,通过K-means计算最优聚类分配,并定义一个在聚类内方差W和崩溃损失δ-collapse之间的可行区域。我们证明当重构损失被限制在这个区域内时,崩溃解在数学上被排除在可行参数空间之外。关键的是,我们引入了规范约束机制,确保解码器输出保持与球壳几何兼容,而不限制表示能力。与以往方法不同,我们的方法提供了严格的理论保证,计算开销小,且不施加对解码器输出的限制。在合成和现实数据集上的实验表明,在传统VAE完全失败的条件下,实现了100%的崩溃预防,重构质量匹配或超过最先进的方法。我们的方法不需要显式的稳定性条件(例如σ² < λ_max),并且适用于任意神经网络架构。代码可在https://github.com/tsegoochang/spherical-vae-with-Cluster获取。

英文摘要

Variational autoencoders (VAEs) frequently suffer from posterior collapse, where the latent variables become uninformative as the approximate posterior degenerates to the prior. While recent work has characterized collapse as a phase transition determined by data covariance properties, existing approaches primarily aim to avoid rather than eliminate collapse. We introduce a novel framework that theoretically guarantees non-collapsed solutions by leveraging spherical shell geometry and cluster-aware constraints. Our method transforms data to a spherical shell, computes optimal cluster assignments via K-means, and defines a feasible region between the within-cluster variance $W$ and collapse loss $δ_{\text{collapse}}$. We prove that when the reconstruction loss is constrained to this region, the collapsed solution is mathematically excluded from the feasible parameter space. \textbf{Critically, we introduce norm constraint mechanisms that ensure decoder outputs remain compatible with the spherical shell geometry without restricting representational capacity.} Unlike prior approaches, our method provides a strict theoretical guarantee with minimal computational overhead without imposing constraints on decoder outputs. Experiments on synthetic and real-world datasets demonstrate 100\% collapse prevention under conditions where conventional VAEs completely fail, with reconstruction quality matching or exceeding state-of-the-art methods. Our approach requires no explicit stability conditions (e.g., $σ^2 < λ_{\max}$) and works with arbitrary neural architectures. The code is available at https://github.com/tsegoochang/spherical-vae-with-Cluster.

2603.03328 2026-05-19 cs.CL cs.AI 版本更新

StructLens: A Structural Lens for Language Models via Maximum Spanning Trees

StructLens:通过最大生成树实现语言模型的结构镜像

Haruki Sakajo, Frederikus Hudi, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技术大学)

AI总结 本文提出StructLens框架,通过最大生成树分析语言模型的表示结构,揭示模型在不同层和训练阶段中如何组织token表示。

详情
AI中文摘要

语言具有内在结构,这一特性解释了语言习得和语言变化。鉴于此特性,我们预期语言模型也会表现出自身的内部结构。尽管可解释性研究已经探讨了模型如何通过注意力模式和稀疏自编码器计算表示,但所得到的表示的组织方式却被忽视。为解决这一差距,我们引入StructLens,一个通过整体结构视角分析表示的框架。StructLens基于残差流中的语义表示构建最大生成树,受依赖解析中树表示的启发,并在表示空间中提供token关系的摘要。我们分析了连续token在表示空间中也彼此接近,并发现中间层显示出最强的局部跨度组织。此外,对预训练检查点的分析表明,较小的局部单元在预训练早期变得可检测,而较大的单元则在后期才变得可检测。我们的发现表明,StructLens提供了关于模型在不同层和训练过程中如何组织token表示的见解。我们的代码可在https://github.com/naist-nlp/structlens获取。

英文摘要

Language exhibits inherent structures, a property that explains both language acquisition and language change. Given this characteristic, we expect language models to manifest their own internal structures as well. While interpretability research has investigated how models compute representations mechanistically through attention patterns and Sparse AutoEncoders, the organization of the resulting representations is overlooked. To address this gap, we introduce StructLens, a framework to analyze representations through a holistic structural view. StructLens constructs maximum spanning trees based on the semantic representations in residual streams, inspired by tree representation in dependency parsing, and provides summaries of token relationships in representation space. We analyze how contiguous tokens are also nearby in representation space and find that middle layers show the strongest local-span organization. Moreover, analysis of pre-training checkpoints reveals that smaller local units become detectable earlier in pre-training, and larger units later. Our findings demonstrate that StructLens provides insights into how models organize token representations across layers and training. Our code is available at https://github.com/naist-nlp/structlens.

2603.03308 2026-05-19 cs.CL cs.AI 版本更新

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

旧习惯难改:对话历史如何几何学地困住大语言模型

Adi Simhi, Fazl Barez, Martin Tutek, Yonatan Belinkov, Shay B. Cohen

发表机构 * Technion - Israel Institute of Technology(技术ion-以色列理工学院) University of Oxford(牛津大学) University of Zagreb, FER(Zagreb大学,FER) Kempner Institute, Harvard University(Kempner研究所,哈佛大学) University of Edinburgh(爱丁堡大学)

AI总结 研究探讨对话历史如何通过几何陷阱影响大语言模型的后续表现,提出History-Echoes框架从概率和几何两个角度分析对话历史偏差,并揭示行为持续性在潜在空间中的几何陷阱。

Comments Accepted to ICML 2026

详情
AI中文摘要

大语言模型(LLMs)的对话历史如何影响其未来表现?近期研究表明,LLMs受对话历史影响的方式出人意料。例如,先前交互中的幻觉可能影响后续模型响应。在本工作中,我们引入History-Echoes框架,研究对话历史如何偏移后续生成。该框架从两个角度探索这种偏差:概率上,我们将对话建模为马尔可夫链以量化状态一致性;几何上,我们测量连续隐藏表示的一致性。在三个模型家族和六个涵盖多样化现象的数据集上,我们的分析揭示了两种视角之间的强相关性。通过连接这些视角,我们证明行为持续性表现为几何陷阱,即潜在空间中的间隙会限制模型轨迹。代码可在https://github.com/technion-cs-nlp/OldHabitsDieHard获取。

英文摘要

How does the conversational past of large language models (LLMs) influence their future performance? Recent work suggests that LLMs are affected by their conversational history in unexpected ways. For instance, hallucinations in prior interactions may influence subsequent model responses. In this work, we introduce History-Echoes, a framework that investigates how conversational history biases subsequent generations. The framework explores this bias from two perspectives: probabilistically, we model conversations as Markov chains to quantify state consistency; geometrically, we measure the consistency of consecutive hidden representations. Across three model families and six datasets spanning diverse phenomena, our analysis reveals a strong correlation between the two perspectives. By bridging these perspectives, we demonstrate that behavioral persistence manifests as a geometric trap, where gaps in the latent space confine the model's trajectory. Code available at https://github.com/technion-cs-nlp/OldHabitsDieHard.

2603.03190 2026-05-19 cs.AI q-bio.NC 版本更新

Expectation and Acoustic Neural Network Representations Enhance Music Identification from Brain Activity

期望与听觉神经网络表示增强从脑活动识别音乐

Shogo Noguchi, Taketo Akama, Tai Nakamura, Shun Minamikawa, Natalia Polouliakh

发表机构 * Sony Computer Science Laboratories, Inc.(索尼计算机科学实验室)

AI总结 本研究通过区分听觉和期望相关的神经网络表示作为教师目标,提高了基于EEG的音乐识别性能,展示了表示学习可以由神经编码引导,并为预测音乐认知和神经解码的发展提供了新方向。

Comments 47 pages, 12 figures

详情
AI中文摘要

在音乐聆听过程中,皮层活动编码了听觉和期望相关信息。先前工作已表明,ANN表示类似于皮层表示,并可作为EEG识别的监督信号。本文显示,将听觉和期望相关的ANN表示作为教师目标进行区分,能提高基于EEG的音乐识别性能。预训练以预测任一表示的模型优于非预训练基线,且结合它们可获得互补增益,超过通过不同随机初始化形成的强种子集合。这些发现表明,教师表示类型影响下游性能,且表示学习可以由神经编码引导。本工作为预测音乐认知和神经解码的发展指明了方向。我们的期望表示直接从原始信号计算得出,无需人工标签,反映了超越起始或音高的预测结构,使能够研究跨多样刺激的多层预测编码。其可扩展性表明,未来可能开发出基于皮层编码原理的通用EEG模型。

英文摘要

During music listening, cortical activity encodes both acoustic and expectation-related information. Prior work has shown that ANN representations resemble cortical representations and can serve as supervisory signals for EEG recognition. Here we show that distinguishing acoustic and expectation-related ANN representations as teacher targets improves EEG-based music identification. Models pretrained to predict either representation outperform non-pretrained baselines, and combining them yields complementary gains that exceed strong seed ensembles formed by varying random initializations. These findings show that teacher representation type shapes downstream performance and that representation learning can be guided by neural encoding. This work points toward advances in predictive music cognition and neural decoding. Our expectation representation, computed directly from raw signals without manual labels, reflects predictive structure beyond onset or pitch, enabling investigation of multilayer predictive encoding across diverse stimuli. Its scalability to large, diverse datasets further suggests potential for developing general-purpose EEG models grounded in cortical encoding principles.

2603.03099 2026-05-19 cs.LG cs.AI 版本更新

Why Adam Can Beat SGD: Second-Moment Normalization Yields Sharper Tails

为何Adam能胜过SGD:二阶矩归一化产生更尖锐的尾部

Ruinan Jin, Yingbin Liang, Shaofeng Zou

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University(俄亥俄州立大学电气与计算机工程系) School of Electrical, Computer and Energy Engineering, Arizona State University(亚利桑那州立大学电气、计算机与能源工程学院)

AI总结 本文揭示了Adam中的关键二阶矩归一化机制,并通过停止时间/鞅分析,在经典有界方差模型下,证明了Adam在高概率收敛行为上优于SGD,前者对置信参数δ的依赖为δ^{-1/2},而SGD则至少为δ^{-1}。

Comments 68 pages

详情
AI中文摘要

尽管Adam在许多应用中表现出比SGD更快的实证收敛速度,但现有的大多数理论保证与SGD几乎相同,无法充分解释实证性能差距。在本文中,我们揭示了Adam中的关键二阶矩归一化,并开发了一种停止时间/鞅分析,该分析在经典有界方差模型(一个二阶矩假设)下,能够证明Adam在高概率收敛行为上优于SGD。具体而言,我们建立了两种方法高概率收敛行为之间的第一个理论区分:Adam对置信参数δ的依赖为δ^{-1/2},而SGD对应的高概率保证至少需要δ^{-1}的依赖。

英文摘要

Despite Adam demonstrating faster empirical convergence than SGD in many applications, much of the existing theory yields guarantees essentially comparable to those of SGD, leaving the empirical performance gap insufficiently explained. In this paper, we uncover a key second-moment normalization in Adam and develop a stopping-time/martingale analysis that provably distinguishes Adam from SGD under the classical bounded variance model (a second moment assumption). In particular, we establish the first theoretical separation between the high-probability convergence behaviors of the two methods: Adam achieves a $δ^{-1/2}$ dependence on the confidence parameter $δ$, whereas corresponding high-probability guarantee for SGD necessarily incurs at least a $δ^{-1}$ dependence.

2603.00631 2026-05-19 cs.AI 版本更新

LiTS: A Modular Framework for LLM Tree Search

LiTS:一个用于LLM树搜索的模块化框架

Xinzhe Li, Yaguang Tao

发表机构 * RMIT University(皇家墨尔本理工大学)

AI总结 本文提出LiTS,一个模块化框架,用于通过树搜索进行LLM推理,展示了其在语言推理、环境规划和工具使用任务中的可组合性,并发现无限动作空间中LLM策略多样性是有效树搜索的瓶颈。

Comments ACL 2026 Demo

详情
AI中文摘要

LiTS是一个模块化的Python框架,用于通过树搜索进行LLM推理。它将树搜索分解为三个可重用的组件(策略、转移和奖励模型),这些组件可以插入到MCTS和BFS等算法中。基于装饰器的注册机制使领域专家能够通过注册组件扩展到新领域,使算法研究人员能够实现自定义的搜索算法。我们在MATH500(语言推理)、Crosswords(环境规划)和MapEval(工具使用)上展示了可组合性,证明了组件和算法的正交性:组件可以在每个任务类型内跨算法重用,而算法可以在所有组件和领域中工作。我们还报告了一个模式崩溃发现:在无限动作空间中,LLM策略多样性(而不是奖励质量)是有效树搜索的瓶颈。演示视频可在https://youtu.be/nRGX43YrR3I获取。该包在Apache 2.0许可证下发布于https://github.com/xinzhel/lits-llm,包含安装说明和可运行示例,使用户能够重现演示的工作流。

英文摘要

LiTS is a modular Python framework for LLM reasoning via tree search. It decomposes tree search into three reusable components (Policy, Transition, and RewardModel) that plug into algorithms like MCTS and BFS. A decorator-based registry enables domain experts to extend to new domains by registering components, and algorithmic researchers to implement custom search algorithms. We demonstrate composability on MATH500 (language reasoning), Crosswords (environment planning), and MapEval (tool use), showing that components and algorithms are orthogonal: components are reusable across algorithms within each task type, and algorithms work across all components and domains. We also report a mode-collapse finding: in infinite action spaces, LLM policy diversity (not reward quality) is the bottleneck for effective tree search. A demonstration video is available at https://youtu.be/nRGX43YrR3I. The package is released under the Apache 2.0 license at https://github.com/xinzhel/lits-llm, including installation instructions and runnable examples that enable users to reproduce the demonstrated workflows.

2603.00607 2026-05-19 cs.CV cs.AI 版本更新

IdGlow: Dynamic Identity Modulation for Multi-Subject Generation

IdGlow: 多主体生成中的动态身份调节

Honghao Cai, Xiangyuan Wang, Jing Li, Yunhao Bai, Tianze Zhou, Haohua Chen, Chao Hui, Changhao Qiao, Runqi Wang, Sijie Xu, Yuyang Hao, Zezhou Cui, Yuyuan Yang, Wei Zhu, Yibo Chen, Xu Tang, Yao Hu, Zhen Li

发表机构 * Xiaohongshu Inc.(小红书公司) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Peking University(北京大学) Tsinghua University(清华大学)

AI总结 本文提出IdGlow框架,通过任务自适应的时间步调度和视觉语言模型解决多主体生成中的稳定性与可塑性矛盾,提升面部真实感与商业级美学质量。

详情
AI中文摘要

多主体图像生成需要在一致的场景中无缝协调多个参考身份。然而,现有方法依赖刚性空间掩码或局部注意力,往往在需要复杂结构变形的任务中(如保持身份的年龄变换)面临'稳定性-可塑性困境'。为此,我们提出IdGlow,一种基于流匹配扩散模型的无掩码、分阶段框架。在监督微调(SFT)阶段,我们引入任务自适应的时间步调度,与扩散生成动力学对齐:一种线性衰减调度,逐步放松约束以生成自然群体组成,以及一个时间门控机制,将身份注入集中于关键语义窗口,成功保留成人面部语义而不覆盖儿童样结构。为解决属性泄漏和语义模糊问题而无需显式布局输入,我们进一步整合了基于badcase驱动的视觉语言模型(VLM)进行精确的上下文感知提示合成。在第二阶段,我们设计了细粒度群体级直接偏好优化(DPO)方法,采用加权边距公式,同时消除多主体伪影、提升纹理和谐度,并重新校准身份保真度以适应现实分布。在两个具有挑战性的基准测试——直接多人物融合和年龄变换群体生成——上的大量实验表明,IdGlow从根本上缓解了稳定性-可塑性冲突,实现了在最先进的面部保真度和商业级美学质量之间的优越帕累托平衡。

英文摘要

Multi-subject image generation requires seamlessly harmonizing multiple reference identities within a coherent scene. However, existing methods relying on rigid spatial masks or localized attention often struggle with the "stability-plasticity dilemma," particularly failing in tasks that require complex structural deformations, such as identity-preserving age transformation. To address this, we present IdGlow, a mask-free, progressive two-stage framework built upon Flow Matching diffusion models. In the supervised fine-tuning (SFT) stage, we introduce task-adaptive timestep scheduling aligned with diffusion generative dynamics: a linear decay schedule that progressively relaxes constraints for natural group composition, and a temporal gating mechanism that concentrates identity injection within a critical semantic window, successfully preserving adult facial semantics without overriding child-like anatomical structures. To resolve attribute leakage and semantic ambiguity without explicit layout inputs, we further integrate a badcase-driven Vision-Language Model (VLM) for precise, context-aware prompt synthesis. In the second stage, we design a Fine-Grained Group-Level Direct Preference Optimization (DPO) with a weighted margin formulation to simultaneously eliminate multi-subject artifacts, elevate texture harmony, and recalibrate identity fidelity towards real-world distributions. Extensive experiments on two challenging benchmarks -- direct multi-person fusion and age-transformed group generation -- demonstrate that IdGlow fundamentally mitigates the stability-plasticity conflict, achieving a superior Pareto balance between state-of-the-art facial fidelity and commercial-grade aesthetic quality.

2602.23566 2026-05-19 cs.LG cs.AI 版本更新

Flowette: Flow Matching with Graphette Priors for Graph Generation

Flowette: 用于图生成的图结构先验的流匹配

Asiri Wijesinghe, Sevvandi Kandanaarachchi, Daniel M. Steinberg, Cheng Soon Ong

发表机构 * CSIRO’s Data61(CSIRO数据61) Australian National University(澳大利亚国立大学)

AI总结 本文提出Flowette框架,通过图神经网络基于transformer学习图表示上的速度场,结合最优传输耦合和正则化,利用图ettes先验结构模型提升图生成性能,实验证明结合结构先验和流训练的有效性。

Comments 48 Pages

详情
AI中文摘要

我们研究具有重复子图motif的图生成建模。我们提出了Flowette,一个连续流匹配框架,利用基于图神经网络的transformer学习具有节点和边属性的图表示上的速度场。我们的模型通过基于最优传输的耦合实现拓扑感知对齐,并通过正则化促进全局结构一致性。为整合领域驱动的结构先验,我们引入图ettes,一种新的概率图结构模型家族,通过受控的结构编辑推广图ons以适用于环、星形和树等motif。我们理论分析了框架的耦合、不变性和结构性质,评估了其在合成和分子基准上的性能,并通过受控消融实验隔离了结构先验、最优传输耦合和正则化项的贡献。Flowette在多个基准上取得了竞争性性能,达到多个指标的最先进结果,突显了结合结构先验与流训练在建模复杂图分布中的有效性。

英文摘要

We study generative modeling of graphs with recurring subgraph motifs. We propose Flowette, a continuous flow matching framework that employs a graph neural network-based transformer to learn a velocity field over graph representations with node and edge attributes. Our model promotes topology-aware alignment through optimal transport-based coupling and encourages global structural coherence through regularisation. To incorporate domain-driven structural priors, we introduce graphettes, a new probabilistic family of graph structure models that generalize graphons via controlled structural edits for motifs such as rings, stars, and trees. We theoretically analyze the coupling, invariance, and structural properties of the framework, evaluate it on synthetic and molecular benchmarks, and isolate the contributions of the structural prior, the optimal-transport coupling, and the regularisation terms through controlled ablations. Flowette achieves competitive performance overall, attaining state-of-the-art results on several metrics across multiple benchmarks, highlighting the effectiveness of combining structural priors with flow-based training for modeling complex graph distributions.

2602.20200 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Global Prior Meets Local Consistency: Dual-Memory Augmented Vision-Language-Action Model for Efficient Robotic Manipulation

全局先验与局部一致性:双内存增强的视觉-语言-动作模型用于高效机器人操作

Zaijing Li, Bing Hu, Rui Shao, Gongwei Chen, Dongmei Jiang, Pengwei Xie, Jianye Hao, Liqiang Nie

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) PengCheng Laboratory(鹏城实验室) Shenzhen Loop Area Institute(深圳洛神研究院) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出OptimusVLA模型,通过引入全局先验内存和局部一致性内存,解决机器人操作中动作生成效率低和鲁棒性差的问题,从而在多个基准测试中实现了更高的成功率和更快的推理速度。

Comments Accepted by CVPR 2026

详情
AI中文摘要

分层视觉-语言-动作(VLA)模型已成为机器人操作中的主导范式。它通常包括一个视觉-语言骨干网络用于感知和理解,以及一个生成性策略用于动作生成。然而,其性能越来越受到动作生成过程的限制。(i) 低推理效率。各向同性噪声先验与目标动作分布之间存在显著的分布差距,这会增加去噪步骤和不可行样本的发生率。(ii) 脆弱性差。现有策略仅基于当前观察,忽视了历史序列的约束,因此缺乏对任务进展和时间一致性意识。为了解决这些问题,我们引入OptimusVLA,一种具有全局先验内存(GPM)和局部一致性内存(LCM)的双内存VLA框架。GPM用从语义相似轨迹中检索到的任务级先验替代高斯噪声,从而缩短生成路径并减少函数评估次数(NFE)。LCM动态建模执行的动作序列以推断任务进展,并注入一个学习的一致性约束,强制轨迹的时间一致性和平滑性。在三个模拟基准测试中,OptimusVLA始终优于强大的基线:它在LIBERO上实现了98.6%的平均成功率,在CALVIN上比pi_0提高了13.5%,在RoboTwin 2.0 Hard上达到了38%的平均成功率。在现实世界评估中,OptimusVLA在泛化和长周期套件中排名第一,比pi_0分别高出42.9%和52.4%,同时实现了2.9倍的推理加速。

英文摘要

Hierarchical Vision-Language-Action (VLA) models have rapidly become a dominant paradigm for robotic manipulation. It typically comprising a Vision-Language backbone for perception and understanding, together with a generative policy for action generation. However, its performance is increasingly bottlenecked by the action generation proceess. (i) Low inference efficiency. A pronounced distributional gap between isotropic noise priors and target action distributions, which increases denoising steps and the incidence of infeasible samples. (ii) Poor robustness. Existing policies condition solely on the current observation, neglecting the constraint of history sequence and thus lacking awareness of task progress and temporal consistency. To address these issues, we introduce OptimusVLA, a dual-memory VLA framework with Global Prior Memory (GPM) and Local Consistency Memory (LCM). GPM replaces Gaussian noise with task-level priors retrieved from semantically similar trajectories, thereby shortening the generative path and reducing the umber of function evaluations (NFE). LCM dynamically models executed action sequence to infer task progress and injects a learned consistency constraint that enforces temporal coherence and smoothness of trajectory. Across three simulation benchmarks, OptimusVLA consistently outperforms strong baselines: it achieves 98.6% average success rate on LIBERO, improves over pi_0 by 13.5% on CALVIN, and attains 38% average success rate on RoboTwin 2.0 Hard. In Real-World evaluation, OptimusVLA ranks best on Generalization and Long-horizon suites, surpassing pi_0 by 42.9% and 52.4%, respectively, while delivering 2.9x inference speedup.

2602.17684 2026-05-19 cs.LG cs.AI 版本更新

CodeScaler: Scaling Code LLM Training and Test-Time Inference via Reward Models

CodeScaler: 通过奖励模型扩展代码大语言模型的训练和测试时间推理

Xiao Zhu, Xinyu Zhou, Boyu Zhu, Hanxu Hu, Mingzhe Du, Haotian Zhang, Huiming Wang, Zhijiang Guo

发表机构 * LARK, HKUST(GZ)(LARK,香港科技大学(广州)) Kuaishou Technology(快手科技) UCL(伦敦大学学院) UZH(苏黎世联邦理工学院) NUS(国立新加坡大学)

AI总结 本文提出CodeScaler,一种通过奖励模型扩展代码生成模型的训练和测试时间推理的框架,通过精心编纂的偏好数据和语法感知的代码提取,实现了在四个编码基准上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分,并在无测试用例的情况下通过合成数据进一步提升14.64分,同时在推理时间减少10倍的延迟,且在代码、通用和推理领域均优于现有奖励模型。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)通过利用单元测试的执行反馈推动了代码大语言模型的最新进展,但其可扩展性从根本上受到高质量测试用例可用性和可靠性的影响。我们提出CodeScaler,一种奖励模型,旨在扩展代码生成的强化学习训练和测试时间推理。CodeScaler是在经过验证的代码问题上精心编纂的偏好数据上训练的,并结合语法感知的代码提取和保持有效性的奖励塑造,以确保稳定和稳健的优化。在四个编码基准上,CodeScaler在Qwen3-8B-Base上比基于执行的RL提升1.55分,在Qwen3-14B-Base上提升4.23分。通过进一步扩展到44K问题并添加额外的合成数据,CodeScaler在无任何测试用例的情况下,相对于基础模型提升了14.64分。在推理时间,CodeScaler作为有效的测试时间扩展方法,实现了与单元测试方法相当的性能,同时在推理时间减少了10倍的延迟。此外,CodeScaler在RM-Bench上不仅在代码领域(+3.3分)上优于现有奖励模型,还在通用和推理领域(平均+2.7分)上也表现优异。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has driven recent progress in code large language models by leveraging execution-based feedback from unit tests, but its scalability is fundamentally constrained by the availability and reliability of high-quality test cases. We propose CodeScaler, a reward model designed to scale both reinforcement learning training and test-time inference for code generation. CodeScaler is trained on carefully curated preference data derived from verified code problems and incorporates syntax-aware code extraction and validity-preserving reward shaping to ensure stable and robust optimization. Across four coding benchmarks, CodeScaler consistently outperforms execution-based RL by +1.55 points on Qwen3-8B-Base and +4.23 points on Qwen3-14B-Base. By further scaling to 44K problems with additional synthetic data, CodeScaler yields +14.64 points improvement over the base model without requiring any test cases. At inference time, CodeScaler serves as an effective test-time scaling method, achieving performance comparable to unit test approaches while providing a 10-fold reduction in latency. Moreover, CodeScaler surpasses existing reward models on RM-Bench not only in the code domain (+3.3 points), but also in general and reasoning domains (+2.7 points on average).

2602.16990 2026-05-19 cs.AI cs.CE 版本更新

Conv-FinRe: A Conversational and Longitudinal Benchmark for Utility-Grounded Financial Recommendation

Conv-FinRe:一种用于实用导向财务推荐的对话和纵向基准

Yan Wang, Yi Han, Lingfei Qian, Yueru He, Xueqing Peng, Dongji Feng, Zhuohan Xie, Vincent Jim Zhang, Rosie Guo, Fengran Mo, Jimin Huang, Yankai Chen, Xue Liu, Jian-Yun Nie

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Columbia University(哥伦比亚大学) California State University(加州州立大学) University of Montreal(蒙特利尔大学) The University of Manchester(曼彻斯特大学) McGill University(麦吉尔大学)

AI总结 本研究提出Conv-FinRe基准,用于评估金融推荐模型在对话和长期视角下的实用性,通过多视角参考区分描述性行为与基于投资者风险偏好的规范性效用,揭示理性决策与行为一致性的张力。

Comments Accepted by SIGIR 2026 Resource Track. Pre-camera-ready version

详情
AI中文摘要

大多数推荐基准评估模型模仿用户行为的能力。在金融顾问领域,观察到的行为可能在市场波动中嘈杂或短视,并可能与用户的长期目标冲突。因此,将用户的选择视为唯一真实情况,会将行为模仿与决策质量混淆。我们引入Conv-FinRe,一种用于股票推荐的对话和纵向基准,评估LLM超越行为匹配的能力。给定一个入职访谈、分步市场背景和顾问对话,模型必须在固定投资期限内生成排名。关键在于,Conv-FinRe提供了多视角参考,区分描述性行为与基于投资者特定风险偏好的规范性效用,使能够诊断LLM是否遵循理性分析、模仿用户噪声或受市场动量驱动。我们从真实市场数据和人类决策轨迹构建了该基准,实例化了受控的顾问对话,并评估了一套最先进的LLM。结果揭示了理性决策质量与行为一致性的持续张力:在效用基础上表现良好的模型往往无法匹配用户选择,而行为一致的模型可能会过拟合短期噪声。该数据集已公开发布在Hugging Face,代码库可在GitHub上获得。

英文摘要

Most recommendation benchmarks evaluate how well a model imitates user behavior. In financial advisory, however, observed actions can be noisy or short-sighted under market volatility and may conflict with a user's long-term goals. Treating what users chose as the sole ground truth, therefore, conflates behavioral imitation with decision quality. We introduce Conv-FinRe, a conversational and longitudinal benchmark for stock recommendation that evaluates LLMs beyond behavior matching. Given an onboarding interview, step-wise market context, and advisory dialogues, models must generate rankings over a fixed investment horizon. Crucially, Conv-FinRe provides multi-view references that distinguish descriptive behavior from normative utility grounded in investor-specific risk preferences, enabling diagnosis of whether an LLM follows rational analysis, mimics user noise, or is driven by market momentum. We build the benchmark from real market data and human decision trajectories, instantiate controlled advisory conversations, and evaluate a suite of state-of-the-art LLMs. Results reveal a persistent tension between rational decision quality and behavioral alignment: models that perform well on utility-based ranking often fail to match user choices, whereas behaviorally aligned models can overfit short-term noise. The dataset is publicly released on Hugging Face, and the codebase is available on GitHub.

2602.12978 2026-05-19 cs.RO cs.AI 版本更新

Learning Native Continuation for Action Chunking Flow Policies

学习原生延续以实现动作分块流策略

Yufeng Liu, Hang Yu, Juntu Zhao, Bocheng Li, Di Zhang, Mingzhu Li, Wenxuan Wu, Yingdong Hu, Junyuan Xie, Junliang Guo, Dequan Wang, Yang Gao

发表机构 * Spirit AI

AI总结 本文提出Legato方法,通过训练时的延续技术改进动作分块流基于VLA策略,减少动作边界不连续性和伪多模态切换,提升轨迹平滑度和任务完成效率。

Comments Accepted by Robotics: Science and Systems 2026 (RSS 2026). Project page: https://lyfeng001.github.io/Legato/

详情
AI中文摘要

动作分块使Vision Language Action (VLA)模型能够实时运行,但朴素的分块执行常在分块边界处出现不连续性。实时分块(RTC)缓解了这一问题,但其作为外部策略导致伪多模态切换和非内在平滑的轨迹。我们提出Legato,一种针对动作分块流基于VLA策略的训练时延续方法。具体而言,Legato从具有调度形状的已知动作和噪声混合物初始化去噪,使模型接触部分动作信息。此外,Legato重塑学习的流动力学,确保在每步指导下去噪过程在训练和推理之间保持一致。Legato进一步在训练中使用随机调度条件以支持变化的推理延迟并实现可控的平滑度。实证结果表明,Legato产生更平滑的轨迹并减少执行中的伪多模态切换,导致较少的犹豫和更短的任务完成时间。广泛的现实世界实验表明,Legato在五个操作任务中始终优于RTC,实现了轨迹平滑度和任务完成时间的约10%的改进。

英文摘要

Action chunking enables Vision Language Action (VLA) models to run in real time, but naive chunked execution often exhibits discontinuities at chunk boundaries. Real-Time Chunking (RTC) alleviates this issue but is external to the policy, leading to spurious multimodal switching and trajectories that are not intrinsically smooth. We propose Legato, a training-time continuation method for action-chunked flow-based VLA policies. Specifically, Legato initializes denoising from a schedule-shaped mixture of known actions and noise, exposing the model to partial action information. Moreover, Legato reshapes the learned flow dynamics to ensure that the denoising process remains consistent between training and inference under per-step guidance. Legato further uses randomized schedule condition during training to support varying inference delays and achieve controllable smoothness. Empirically, Legato produces smoother trajectories and reduces spurious multimodal switching during execution, leading to less hesitation and shorter task completion time. Extensive real-world experiments show that Legato consistently outperforms RTC across five manipulation tasks, achieving approximately 10% improvements in both trajectory smoothness and task completion time.

2602.12687 2026-05-19 cs.LG cs.AI 版本更新

Trust the uncertain teacher: distilling dark knowledge via calibrated uncertainty

信任不确定的教师:通过校准的不确定性提炼暗知识

Jeonghyun Kim, SooKyung Kim, Richeng Xuan, Hyunsoo Cho

发表机构 * Ewha Womans University(成均馆大学) Tencent(腾讯)

AI总结 本文提出校准不确定性提炼(CUD)框架,通过从分布角度重新审视知识蒸馏,使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性,从而在易例中获益于自信信号,在难例中获益于结构化不确定性,提升了学生在分布偏移和长尾输入上的准确性和可靠性。

详情
AI中文摘要

知识蒸馏的核心在于将教师的丰富'暗知识'-即揭示类别间关系和不确定性分布的细微概率模式进行转移。尽管这一理念已建立,但传统交叉熵训练的教师往往无法保留此类信号。它们的分布会坍缩成尖锐、过度自信的峰,看似决定性但实际脆弱,提供的仅限于硬标签或在表示层面转移时微妙地阻碍。这种过度自信在高基数任务中尤为成问题,因为许多可能类别的细微差别对指导紧凑的学生至关重要。此外,这种脆弱的目标会降低对分布偏移的鲁棒性,使学生在现实条件下的校准变得不可靠。为解决这一限制,我们从分布角度重新审视蒸馏,并提出校准不确定性蒸馏(CUD)框架,旨在使暗知识更忠实地被访问。CUD鼓励教师在有信息的地方揭示不确定性,并引导学生学习校准而非锐化确定性。通过在转移前直接塑造教师的预测分布,我们的方法在准确性和校准之间取得平衡,使学生在易例中受益于自信信号,在难例中受益于结构化不确定性。在多样化的基准测试中,CUD产生的学生不仅更加准确,而且在分布偏移下更加校准,在模糊的长尾输入上更加可靠。

英文摘要

The core of knowledge distillation lies in transferring the teacher's rich 'dark knowledge'-subtle probabilistic patterns that reveal how classes are related and the distribution of uncertainties. While this idea is well established, teachers trained with conventional cross-entropy often fail to preserve such signals. Their distributions collapse into sharp, overconfident peaks that appear decisive but are in fact brittle, offering little beyond the hard label or subtly hindering representation-level transfer. This overconfidence is especially problematic in high-cardinality tasks, where the nuances among many plausible classes matter most for guiding a compact student. Moreover, such brittle targets reduce robustness under distribution shift, leaving students vulnerable to miscalibration in real-world conditions. To address this limitation, we revisit distillation from a distributional perspective and propose Calibrated Uncertainty Distillation (CUD), a framework designed to make dark knowledge more faithfully accessible. Instead of uncritically adopting the teacher's overconfidence, CUD encourages teachers to reveal uncertainty where it is informative and guides students to learn from targets that are calibrated rather than sharpened certainty. By directly shaping the teacher's predictive distribution before transfer, our approach balances accuracy and calibration, allowing students to benefit from both confident signals on easy cases and structured uncertainty on hard ones. Across diverse benchmarks, CUD yields students that are not only more accurate, but also more calibrated under shift and more reliable on ambiguous, long-tail inputs.

2602.07884 2026-05-19 cs.LG cs.AI 版本更新

GRAFT: Decoupling Ranking and Calibration for Survival Analysis

GRAFT:分离排名与校准用于生存分析

Mohammad Ashhad, Robert Hoehndorf, Ricardo Henao

发表机构 * KAUST(卡奥斯特大学) CEMSE KAUST(KAUST工程与科学学院) Duke University(杜克大学)

AI总结 本文提出GRAFT模型,通过分离预测排名与生存校准,解决生存分析中排名与校准之间的权衡问题,该模型结合线性AFT模型与非线性残差神经网络,并利用随机门进行自动特征选择,从而在公开基准测试中实现了更好的判别能力和校准性能。

详情
AI中文摘要

生存分析受到删失数据、高维特征和非线性交互的挑战。经典模型提供可解释性和优越的校准能力,但局限于线性或预定义的功能形式,而深度学习模型具有灵活性并实现了强大的判别性能,但倾向于产生校准不佳的生存估计。为了解决这一权衡问题,我们提出GRAFT(Gated Residual Accelerated Failure Time),一种新的AFT模型,该模型将预测排名与生存校准分离。GRAFT的混合架构结合了线性AFT模型与非线性残差神经网络,并整合了随机门用于自动特征选择。该模型通过优化可微的、C-index对齐的排名损失进行训练,利用局部Kaplan-Meier估计器的随机条件插补,而校准的生存估计则通过简单的后训练校准获得。在公开基准测试中,GRAFT在判别能力和校准性能上优于基线模型,同时在高噪声设置中保持稳健和稀疏。

英文摘要

Survival analysis is complicated by censored data, high-dimensional features, and non-linear interactions. Classical models offer interpretability and superior calibration but are restricted to linear or predefined functional forms, while deep learning models are flexible and achieve strong discriminative performance, but tend to produce poorly calibrated survival estimates. To address this trade-off, we propose GRAFT (Gated Residual Accelerated Failure Time), a novel AFT model that decouples prognostic ranking from survival calibration. GRAFT's hybrid architecture combines a linear AFT model with a non-linear residual neural network, and it also integrates stochastic gates for automatic feature selection. The model is trained by optimizing a differentiable, C-index-aligned ranking loss using stochastic conditional imputation from local Kaplan-Meier estimators, while calibrated survival estimates are obtained through simple post-training calibration. In public benchmarks, GRAFT outperforms baselines in discrimination and calibration, while remaining robust and sparse in high-noise settings.

2602.05287 2026-05-19 cs.AI 版本更新

Position: Universal Time Series Foundation Models Rest on a Category Error

位置:通用时间序列基础模型建立在类别错误上

Xilin Dai, Wanxu Cai, Zhijian Xu, Qiang Xu

发表机构 * ZJU-UIUC Institute(浙大-UIUC研究院) School of Software(软件学院) Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 本文指出,追求'通用时间序列基础模型'存在根本性的类别错误,将结构容器误认为语义模态。由于时间序列包含不兼容的生成过程(如金融与流体动力学),单一大模型退化为昂贵的'通用过滤器',在分布漂移下无法泛化。为此,我们引入'自回归盲目界限',证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性,其中代理利用外部上下文协调一系列专门的求解器,从冻结领域专家到轻量级即时适应器。最后,我们呼吁将基准从'零样本准确性'转向'漂移适应速度',以优先考虑鲁棒、控制理论系统。

详情
AI中文摘要

本文立场论文认为,追求'通用时间序列基础模型'建立在根本性的类别错误上,误将结构容器视为语义模态。我们指出,由于时间序列包含不兼容的生成过程(例如金融与流体动力学),单一大模型退化为昂贵的'通用过滤器',在分布漂移下无法泛化。为解决这一问题,我们引入'自回归盲目界限',一个理论极限,证明仅依赖历史的模型无法预测干预驱动的制度转变。我们主张用因果控制代理范式取代通用性,其中代理利用外部上下文协调一系列专门的求解器,从冻结领域专家到轻量级即时适应器。最后,我们呼吁将基准从'零样本准确性'转向'漂移适应速度',以优先考虑鲁棒、控制理论系统。

英文摘要

This position paper argues that the pursuit of "Universal Foundation Models for Time Series" rests on a fundamental category error, mistaking a structural Container for a semantic Modality. We contend that because time series hold incompatible generative processes (e.g., finance vs. fluid dynamics), monolithic models degenerate into expensive "Generic Filters" that fail to generalize under distributional drift. To address this, we introduce the "Autoregressive Blindness Bound," a theoretical limit proving that history-only models cannot predict intervention-driven regime shifts. We advocate replacing universality with a Causal Control Agent paradigm, where an agent leverages external context to orchestrate a hierarchy of specialized solvers, from frozen domain experts to lightweight Just-in-Time adaptors. We conclude by calling for a shift in benchmarks from "Zero-Shot Accuracy" to "Drift Adaptation Speed" to prioritize robust, control-theoretic systems.

2602.04872 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Multi-layer Cross-attention is Provably Optimal for Multi-modal In-context Learning

多层交叉注意力是多模态上下文学习中可证明最优的

Nicholas Barnfield, Subhabrata Sen, Pragya Sur

发表机构 * Harvard University(哈佛大学)

AI总结 本文研究了多模态上下文学习中多层交叉注意力机制的理论最优性,证明了在多模态数据下,交叉注意力机制在梯度流优化下可达到贝叶斯最优,同时指出单层线性自注意力无法在任务分布下统一恢复贝叶斯最优预测。

详情
AI中文摘要

近期进展迅速推动了我们对现代基于注意力的神经网络中上下文学习机制的理解。然而,现有结果仅专注于单模态数据;相比之下,多模态数据的上下文学习的理论基础仍不清晰。我们引入了一个数学上可处理的框架来研究多模态学习,并探讨了在何种情况下Transformer-like架构可以在上下文中恢复贝叶斯最优性能。为了建模多模态问题,我们假设观测数据来自一个潜在因子模型。我们的第一个结果是对表达性的否定:我们证明单层线性自注意力无法在任务分布下统一恢复贝叶斯最优预测。为了解决这一限制,我们引入了一种新的线性化交叉注意力机制,并在交叉注意力层和上下文长度都较大的情况下进行研究。我们证明,当使用梯度流优化时,这种交叉注意力机制可证明是贝叶斯最优的。我们的结果强调了深度对上下文学习的好处,并确立了交叉注意力在多模态分布中的可证明效用。

英文摘要

Recent progress has rapidly advanced our understanding of the mechanisms underlying in-context learning in modern attention-based neural networks. However, existing results focus exclusively on unimodal data; in contrast, the theoretical underpinnings of in-context learning for multi-modal data remain poorly understood. We introduce a mathematically tractable framework for studying multi-modal learning and explore when transformer-like architectures can recover Bayes-optimal performance in-context. To model multi-modal problems, we assume the observed data arises from a latent factor model. Our first result comprises a negative take on expressibility: we prove that single-layer, linear self-attention fails to recover the Bayes-optimal predictor uniformly over the task distribution. To address this limitation, we introduce a novel, linearized cross-attention mechanism, which we study in the regime where both the number of cross-attention layers and the context length are large. We show that this cross-attention mechanism is provably Bayes optimal when optimized using gradient flow. Our results underscore the benefits of depth for in-context learning and establish the provable utility of cross-attention for multi-modal distributions.

2602.00924 2026-05-19 cs.AI 版本更新

Supervised sparse auto-encoders for interpretable and compositional representations

监督稀疏自编码器用于可解释和组合性表示

Ouns El Harzli, Hugo Wallner, Yoonsoo Nam, Haixuan Xavier Tao

发表机构 * Department of Computer Science, University of Oxford, Oxford, UK(牛津大学计算机科学系) KAIST AI, Korean Advanced Institute of Science(韩国科学技术高级研究院AI研究所) Independent researcher(独立研究者)

AI总结 本文提出了一种监督稀疏自编码器,通过结合无约束特征模型和监督学习,解决稀疏自编码器在非光滑性及特征与人类语义对齐方面的不足,实现了组合性泛化和语义图像编辑。

详情
AI中文摘要

稀疏自编码器(SAEs)重新成为机制可解释性的重要方法,但面临两个重大挑战:$L_1$惩罚的非光滑性阻碍了重建和可扩展性,以及学习到的特征与人类语义不一致。在本文中,我们通过适应无约束特征模型,一种来自神经崩溃理论的数学框架,并通过监督任务来解决这些限制。我们监督(解码器-only)SAEs通过联合学习稀疏概念嵌入和解码器权重来重建特征向量。在Stable Diffusion 3.5上验证,我们的方法展示了组合性泛化,成功重建了训练期间未见过的概念组合图像,并在不修改提示的情况下实现了特征级的语义图像编辑。

英文摘要

Sparse auto-encoders (SAEs) have re-emerged as a prominent method for mechanistic interpretability, yet they face two significant challenges: the non-smoothness of the $L_1$ penalty, which hinders reconstruction and scalability, and a lack of alignment between learned features and human semantics. In this paper, we address these limitations by adapting unconstrained feature models, a mathematical framework from neural collapse theory, and by supervising the task. We supervise (decoder-only) SAEs to reconstruct feature vectors by jointly learning sparse concept embeddings and decoder weights. Validated on Stable Diffusion 3.5, our approach demonstrates compositional generalization, successfully reconstructing images with concept combinations unseen during training, and enabling feature-level intervention for semantic image editing without prompt modification.

2601.21468 2026-05-19 cs.AI 版本更新

MemOCR: Layout-Aware Visual Memory for Efficient Long-Horizon Reasoning

MemOCR: 一种面向布局的视觉记忆用于高效的长周期推理

Yaorui Shi, Shugui Liu, Yu Yang, Wenyu Mao, Yuxin Chen, Qi GU, Hui Su, Xunliang Cai, Xiang Wang, An Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学) School of Computing(计算学院)

AI总结 MemOCR通过利用视觉布局进行自适应信息密度分配,提高了在有限上下文预算下的长周期推理效率,其核心方法是维护结构化的丰富文本记忆并将其渲染为图像,以实现对关键证据的视觉优先级分配和辅助细节的压缩,从而在各种基准测试中优于基于文本的基线方法。

详情
AI中文摘要

长周期代理推理需要有效地将增长的交互历史压缩到有限的上下文窗口中。现有的记忆系统通常将历史序列化为文本,其中每个标记的费用是均匀的,并且随着长度线性增长,往往在低价值细节上消耗稀缺的预算。为此,我们引入了MemOCR,一种多模态记忆代理,通过通过视觉布局进行自适应信息密度分配,从而在有限的上下文预算下提高长周期推理的效率。具体而言,MemOCR维护一个结构化的丰富文本记忆(例如标题、重点),并将其渲染为图像,供代理在记忆访问时参考,通过视觉优先级分配关键证据,同时积极压缩辅助细节。为了确保在不同内存预算下的鲁棒性,我们通过强化学习训练MemOCR,使用预算意识目标,使代理能够适应不同的压缩水平。在长上下文多跳和单跳问答基准测试中,MemOCR优于强大的文本基线,并在极端预算下实现了更有效的上下文利用。

英文摘要

Long-horizon agentic reasoning necessitates effectively compressing growing interaction histories into a limited context window. Most existing memory systems serialize history as text, where token-level cost is uniform and scales linearly with length, often spending scarce budget on low-value details. To this end, we introduce MemOCR, a multimodal memory agent that improves long-horizon reasoning under tight context budgets by allocating memory space with adaptive information density through visual layout. Concretely, MemOCR maintains a structured rich-text memory (e.g., headings, highlights) and renders it into an image that the agent consults for memory access, visually prioritizing crucial evidence while aggressively compressing auxiliary details. To ensure robustness across varying memory budgets, we train MemOCR with reinforcement learning under budget-aware objectives that expose the agent to diverse compression levels. Across long-context multi-hop and single-hop question-answering benchmarks, MemOCR outperforms strong text-based baselines and achieves more effective context utilization under extreme budgets.

2601.17887 2026-05-19 cs.AI 版本更新

When Personalization Legitimizes Risks: Uncovering Safety Vulnerabilities in Personalized Dialogue Agents

当个性化合理化风险:揭示个性化对话代理中的安全漏洞

Jiahe Guo, Xiangran Guo, Yulin Hu, Zimo Long, Xingyu Sui, Xuda Zhi, Yongbo Huang, Hao He, Weixiang Zhao, Yanyan Zhao, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) SERES Group Co., Ltd(SERES集团有限公司)

AI总结 本文研究了个性化对话代理中的一种安全故障模式——意图合理化,通过引入PS-Bench基准测试,揭示了个性化记忆如何偏移意图推断并导致模型合理化有害查询,提出了一种轻量级的检测-反思方法以减少安全退化。

详情
AI中文摘要

长期记忆使大型语言模型(LLM)代理能够支持个性化和持续的交互。然而,大多数关于个性化代理的研究优先考虑效用和用户体验,将记忆视为中性组件,并在很大程度上忽略了其安全影响。在本文中,我们揭示了意图合理化,一种此前未被充分探讨的安全故障,在个性化代理中,良性个人记忆会偏移意图推断,导致模型合理化本质上有害的查询。为了研究这一现象,我们引入了PS-Bench,一个用于识别和量化个性化交互中意图合理化的基准测试。在多个增强记忆的代理框架和基础LLM中,个性化将攻击成功率提高了15.8%至243.7%相对于无状态基线。我们进一步从内部表示空间提供了意图合理化的机理证据,并提出了一种轻量级的检测-反思方法,有效减少了安全退化。总体而言,我们的工作提供了首次系统探索和评估意图合理化作为一种安全故障模式,这种模式自然地从良性、现实世界的个性化中产生,突显了在长期个人背景下评估安全的重要性。我们的代码可在:https://github.com/MuyuenLP/PS-Bench获得。警告:本文可能包含有害内容。

英文摘要

Long-term memory enables large language model (LLM) agents to support personalized and sustained interactions. However, most work on personalized agents prioritizes utility and user experience, treating memory as a neutral component and largely overlooking its safety implications. In this paper, we reveal intent legitimation, a previously underexplored safety failure in personalized agents, where benign personal memories bias intent inference and cause models to legitimize inherently harmful queries. To study this phenomenon, we introduce PS-Bench, a benchmark designed to identify and quantify intent legitimation in personalized interactions. Across multiple memory-augmented agent frameworks and base LLMs, personalization increases attack success rates by 15.8\%--243.7\% relative to stateless baselines. We further provide mechanistic evidence for intent legitimation from internal representations space, and propose a lightweight detection-reflection method that effectively reduces safety degradation. Overall, our work provides the first systematic exploration and evaluation of intent legitimation as a safety failure mode that naturally arises from benign, real-world personalization, highlighting the importance of assessing safety under long-term personal context. Our code is available at: https://github.com/MuyuenLP/PS-Bench. WARNING: This paper may contain harmful content.

2601.16414 2026-05-19 cs.LG cs.AI 版本更新

PyHealth 2.0: A Comprehensive Open-Source Toolkit for Accessible and Reproducible Clinical Deep Learning

PyHealth 2.0: 一个全面的开源工具包,用于可访问和可重复的临床深度学习

John Wu, Yongda Fan, Zhenbang Wu, Paul Landes, Eric Schrock, Sayeed Sajjad Razin, Arjun Chatterjee, Naveen Baskaran, Joshua Steier, Andrea Fitzpatrick, Bilal Arif, Rian Atri, Jathurshan Pradeepkumar, Siddhartha Laghuvarapu, Junyi Gao, Adam R. Cross, Jimeng Sun

发表机构 * University of Illinois Urbana-Champaign, Urbana, IL, USA(伊利诺伊大学厄巴纳-香槟分校) PyHealth Research Initiative(PyHealth研究计划) University of Illinois College of Medicine, Chicago, IL, USA(伊利诺伊大学医学院) The University of Edinburgh, Edinburgh, UK(爱丁堡大学) Health Data Research UK, London, UK(英国健康数据研究) Department of Biomedical Engineering, Bangladesh University of Engineering(孟加拉国工程大学生物医学工程系)

AI总结 本文提出PyHealth 2.0,一个全面的开源工具包,旨在解决临床AI研究中的可重复性和可访问性问题,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化方法,实现7行代码即可完成预测建模。

Comments Under Review

详情
AI中文摘要

难以复制基线、高计算成本和所需领域专业知识创建了持续存在的临床AI研究障碍。为了解决这些挑战,我们介绍了PyHealth 2.0,一个增强的临床深度学习工具包,使在7行代码内即可实现预测建模。PyHealth 2.0提供了三个关键贡献:(1) 一个全面的工具包,通过统一15+数据集、20+临床任务、25+模型、5+可解释性方法和不确定性量化(包括符合预测的置信预测)在一个框架中解决可重复性和兼容性挑战,支持多种临床数据模态——信号、影像和电子健康记录——并翻译5+医学编码标准;(2) 以可访问性为重点的设计,支持多模态数据和多样化的计算资源,处理速度比以往快39倍,内存使用减少20倍,使从16GB笔记本电脑到生产系统都能轻松使用;(3) 一个活跃的开源社区,拥有400多名成员,通过详尽的文档、可重复研究贡献以及与学术医疗系统和产业伙伴的合作,包括通过RHealth实现的多语言支持,降低了领域专业知识的障碍。PyHealth 2.0建立了一个开源基础和社区,推动了可访问和可重复的医疗AI发展。可在pip install pyhealth中获取。

英文摘要

Difficulty replicating baselines, high computational costs, and required domain expertise create persistent barriers to clinical AI research. To address these challenges, we introduce PyHealth 2.0, an enhanced clinical deep learning toolkit that enables predictive modeling in as few as 7 lines of code. PyHealth 2.0 offers three key contributions: (1) a comprehensive toolkit addressing reproducibility and compatibility challenges by unifying 15+ datasets, 20+ clinical tasks, 25+ models, 5+ interpretability methods, and uncertainty quantification including conformal prediction within a single framework that supports diverse clinical data modalities - signals, imaging, and electronic health records - with translation of 5+ medical coding standards; (2) accessibility-focused design accommodating multimodal data and diverse computational resources with up to 39x faster processing and 20x lower memory usage, enabling work from 16GB laptops to production systems; and (3) an active open-source community of 400+ members lowering domain expertise barriers through extensive documentation, reproducible research contributions, and collaborations with academic health systems and industry partners, including multi-language support via RHealth. PyHealth 2.0 establishes an open-source foundation and community advancing accessible, reproducible healthcare AI. Available at pip install pyhealth.

2601.15630 2026-05-19 cs.AI 版本更新

Agentic AI Governance and Lifecycle Management in Healthcare

医疗领域代理AI治理与生命周期管理

Chandra Prakash, Mary Lind, Avneesh Sisodia

发表机构 * School of Computer Information Sciences(计算机信息科学学院) University of the Cumberlands(坎伯兰大学) Williamsburg(威廉斯堡)

AI总结 本文提出了一种统一的代理生命周期管理框架,旨在解决医疗领域中代理蔓延问题,通过五个控制层实现可审计的监督,同时支持本地创新和安全扩展。

Comments 21 Pages, 9 figures

详情
AI中文摘要

医疗组织开始将代理AI嵌入到常规工作流程中,包括临床文档支持和早期预警监测。随着这些能力在各部门和供应商间扩散,医疗系统面临代理蔓延问题,导致代理重复、责任不明确、控制不一致和持续存在的工具权限。现有AI治理框架强调生命周期风险管理,但对代理舰队的日常操作提供有限指导。本文提出了一种统一的代理生命周期管理(UALM)蓝图,基于快速、实践导向的治理标准、代理安全文献和医疗合规要求的综合。UALM将反复出现的差距映射到五个控制层上:(1)身份和人物注册,(2)编排和跨域调解,(3) PHI 限定的上下文和记忆,(4)运行时策略执行与杀开关触发器,(5)生命周期管理和退役与凭证撤销和审计日志相关联。一个配套的成熟度模型支持分阶段采用。UALM为医疗CIO、CISO和临床领导者提供了一种可实施的模式,以实现可审计的监督,同时保持本地创新并安全扩展到临床和行政领域。

英文摘要

Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.

2601.14568 2026-05-19 cs.CV cs.AI 版本更新

Breaking the accuracy-resource dilemma: a lightweight adaptive video inference enhancement

打破精度-资源困境:一种轻量级自适应视频推理增强

Wei Ma, Shaowu Chen, Junjie Ye, Peichang Zhang, Lei Huang

发表机构 * State Key Laboratory of Radio Frequency Heterogeneous Integration (Shenzhen University)(无线电频率异构集成国家重点实验室(深圳大学)) Institute of Applied Artificial Intelligence of the Guangdong–HongKong–Macao Greater Bay(粤港澳大湾区应用人工智能研究院) Henan Academy of Science Applied Physics Institute Co.,Ltd.(河南省应用物理科学研究院有限公司)

AI总结 本文提出了一种轻量级自适应视频推理增强框架,通过动态切换不同规模的模型来平衡资源利用与推理性能。

Comments 5 pages, 5 figures

详情
AI中文摘要

现有的视频推理(VI)增强方法通常通过扩大模型规模和采用复杂的网络架构来提高性能。尽管这些方法展示了最先进的性能,但往往忽视了资源效率和推理有效性之间的权衡,导致资源利用效率低下和次优的推理性能。为了解决这个问题,本文开发了一种基于关键系统参数和推理相关指标的模糊控制器(FC-r)。在FC-r的指导下,提出了一种VI增强框架,利用相邻视频帧中目标的时空相关性。根据目标设备的实时资源条件,该框架可以在VI过程中动态切换不同规模的模型。实验结果表明,所提出的方法有效实现了资源利用和推理性能之间的平衡。

英文摘要

Existing video inference (VI) enhancement methods typically aim to improve performance by scaling up model sizes and employing sophisticated network architectures. While these approaches demonstrated state-of-the-art performance, they often overlooked the trade-off of resource efficiency and inference effectiveness, leading to inefficient resource utilization and suboptimal inference performance. To address this problem, a fuzzy controller (FC-r) is developed based on key system parameters and inference-related metrics. Guided by the FC-r, a VI enhancement framework is proposed, where the spatiotemporal correlation of targets across adjacent video frames is leveraged. Given the real-time resource conditions of the target device, the framework can dynamically switch between models of varying scales during VI. Experimental results demonstrate that the proposed method effectively achieves a balance between resource utilization and inference performance.

2601.09722 2026-05-19 cs.CL cs.AI 版本更新

ADMEDTAGGER: an annotation framework for distillation of expert knowledge for the Polish medical language

ADMEDTAGGER: 一个用于波兰医疗语言知识蒸馏的标注框架

Franciszek Górski, Andrzej Czyżewski

发表机构 * Gdansk University of Technology(格但斯克技术大学)

AI总结 本文提出了一种标注框架,展示如何利用一个多语言预训练大语言模型作为教师模型,蒸馏出用于标注波兰医疗文本所需的专业知识,通过开发多类分类器,解决了标注资源不足的问题,最终得到了高效的分类器。

详情
AI中文摘要

在本工作中,我们提出了一种标注框架,展示了如何利用一个多语言预训练大语言模型作为教师模型,蒸馏出用于标注波兰医疗文本所需的专业知识。本工作是ADMEDVOICE项目的一部分,在此项目中,我们收集了涵盖五个临床类别(放射学、肿瘤学、心脏病学、高血压和病理学)的大量医疗文本语料库。利用这些数据,我们开发了一个多类分类器,但根本问题在于缺乏足够的标注资源来标注足够数量的文本。因此,在我们的解决方案中,我们使用多语言Llama3.1模型来标注大量波兰医疗文本语料库。利用我们有限的标注资源,我们只验证了这些标签中的一部分,从而创建了一个测试集。通过这种方式标注的数据随后用于训练和验证三种基于BERT架构的分类器:基于DistilBERT的蒸馏模型、在医疗数据上微调的BioBERT以及在波兰语言语料库上微调的HerBERT。在我们训练的模型中,DistilBERT模型表现最佳,每个临床类别达到了F1分数大于0.80,其中三个类别达到了F1分数大于0.93。通过这种方式,我们得到了一系列高效的分类器,这些分类器在大小、GPU VRAM消耗和推理速度方面分别比大型语言模型小约500倍、低300倍,以及快数百倍。

英文摘要

In this work, we present an annotation framework that demonstrates how a multilingual LLM pretrained on a large corpus can be used as a teacher model to distill the expert knowledge needed for tagging medical texts in Polish. This work is part of a larger project called ADMEDVOICE, within which we collected an extensive corpus of medical texts representing five clinical categories - Radiology, Oncology, Cardiology, Hypertension, and Pathology. Using this data, we had to develop a multi-class classifier, but the fundamental problem turned out to be the lack of resources for annotating an adequate number of texts. Therefore, in our solution, we used the multilingual Llama3.1 model to annotate an extensive corpus of medical texts in Polish. Using our limited annotation resources, we verified only a portion of these labels, creating a test set from them. The data annotated in this way were then used for training and validation of 3 different types of classifiers based on the BERT architecture - the distilled DistilBERT model, BioBERT fine-tuned on medical data, and HerBERT fine-tuned on the Polish language corpus. Among the models we trained, the DistilBERT model achieved the best results, reaching an F1 score > 0.80 for each clinical category and an F1 score > 0.93 for 3 of them. In this way, we obtained a series of highly effective classifiers that represent an alternative to large language models, due to their nearly 500 times smaller size, 300 times lower GPU VRAM consumption, and several hundred times faster inference.

2601.08118 2026-05-19 cs.AI cs.LG 版本更新

MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

MirrorBench: 一个评估对话用户代理人类化能力的基准测试

Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, Anil Babu Ankisettipalli

发表机构 * SAP Labs(SAP实验室)

AI总结 本文提出MirrorBench基准测试,用于评估对话用户代理的人类化能力,通过结合多种词汇多样性指标和LLM评估指标,揭示用户代理与真实人类用户之间的系统性差距。

Comments KDD 2026 (Dataset & Benchmark Track)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作人类模拟器,既用于评估对话系统,也用于生成微调数据。然而,简单的

英文摘要

Large language models (LLMs) are increasingly used as human simulators, both for evaluating conversational systems and for generating fine-tuning data. However, naive "act-as-a-user" prompting often yields verbose, unrealistic utterances, motivating principled evaluation of *user proxy agents*. We present **MirrorBench**, a reproducible and extensible benchmarking framework that evaluates user proxies solely on their ability to produce human-like user utterances across diverse conversational regimes, explicitly decoupled from downstream task success. **MirrorBench** combines three lexical-diversity metrics (**MATTR**, **Yule's~$K$**, and **HD-D**) with three LLM-judge-based metrics (**GTEval**, **Pairwise Indistinguishability**, and **Rubric-and-Reason**), and contextualizes judge scores using Human-Human and Proxy-Proxy calibration controls. Across four public datasets, **MirrorBench** yields variance-aware comparisons and reveals systematic gaps between user proxies and real human users. The framework is open sourced at https://github.com/SAP/mirrorbench and includes a command-line interface for running and managing user-proxy benchmarking experiments.

2601.07122 2026-05-19 cs.CR cs.AI cs.LG 版本更新

Enhancing Cloud Network Resilience via a Robust LLM-Empowered Multi-Agent Reinforcement Learning Framework

通过一个鲁棒的LLM赋能的多智能体强化学习框架增强云网络韧性

Yixiao Peng, Hao Hu, Feiyang Li, Xinye Cao, Yingchang Jiang, Jipeng Tang, Guoshun Nan, Yuling Liu

发表机构 * State Key Laboratory of Mathematical Engineering and Advanced Computing(数学工程与先进计算国家重点实验室) Henan Key Laboratory of Information Security(河南省信息安全重点实验室) National Engineering Research Center for Mobile Network Technologies(移动网络技术国家工程研究中心) Beijing University of Posts and Telecommunications(北京邮电大学) Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所)

AI总结 本文提出了一种基于大语言模型的多智能体强化学习框架,旨在提升云网络的防御能力和韧性,通过分层架构和人类在回路支持来增强系统的适应性和可解释性。

详情
AI中文摘要

尽管虚拟化和资源池化赋予了云网络结构灵活性和弹性扩展能力,但它们不可避免地扩大了攻击面并挑战了网络的网络安全性。基于强化学习(RL)的防御策略已被开发用于在对抗条件下优化资源部署和隔离策略,以通过维护和恢复网络可用性来增强系统韧性。然而,现有方法缺乏鲁棒性,因为它们需要重新训练才能适应网络结构、节点规模、攻击策略和攻击强度的动态变化。此外,缺乏人类在回路(HITL)支持限制了可解释性和灵活性。为了解决这些限制,我们提出了CyberOps-Bots,一种由大语言模型(LLMs)赋能的分层多智能体强化学习框架。受MITRE ATT&CK的战术-技术模型启发,CyberOps-Bots具有双层架构:(1)一个上层LLM代理,包含四个模块——ReAct规划、IPDRR基础感知、长短时记忆和动作/工具整合,执行全局意识、人类意图识别和战术规划;(2)下层RL代理,通过异构分离预训练开发,执行原子防御动作,以在本地网络区域中执行。这种协同作用保留了LLM的适应性和可解释性,同时确保了可靠的RL执行。在真实云数据集上的实验表明,与最先进的算法相比,CyberOps-Bots在不重新训练的情况下,网络可用性保持在68.5%更高,并且在场景切换时实现了34.7%的性能提升。据我们所知,这是首次建立具有HITL支持的鲁棒LLM-RL框架用于云防御的研究。

英文摘要

While virtualization and resource pooling empower cloud networks with structural flexibility and elastic scalability, they inevitably expand the attack surface and challenge cyber resilience. Reinforcement Learning (RL)-based defense strategies have been developed to optimize resource deployment and isolation policies under adversarial conditions, aiming to enhance system resilience by maintaining and restoring network availability. However, existing approaches lack robustness as they require retraining to adapt to dynamic changes in network structure, node scale, attack strategies, and attack intensity. Furthermore, the lack of Human-in-the-Loop (HITL) support limits interpretability and flexibility. To address these limitations, we propose CyberOps-Bots, a hierarchical multi-agent reinforcement learning framework empowered by Large Language Models (LLMs). Inspired by MITRE ATT&CK's Tactics-Techniques model, CyberOps-Bots features a two-layer architecture: (1) An upper-level LLM agent with four modules--ReAct planning, IPDRR-based perception, long-short term memory, and action/tool integration--performs global awareness, human intent recognition, and tactical planning; (2) Lower-level RL agents, developed via heterogeneous separated pre-training, execute atomic defense actions within localized network regions. This synergy preserves LLM adaptability and interpretability while ensuring reliable RL execution. Experiments on real cloud datasets show that, compared to state-of-the-art algorithms, CyberOps-Bots maintains network availability 68.5% higher and achieves a 34.7% jumpstart performance gain when shifting the scenarios without retraining. To our knowledge, this is the first study to establish a robust LLM-RL framework with HITL support for cloud defense.

2601.06943 2026-05-19 cs.CV cs.AI 版本更新

Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

观看、推理与搜索:一个面向开放网络的视频深度研究基准,用于代理视频推理

Chengwen Liu, Xiaomin Yu, Zhuoyue Chang, Zhe Huang, Shuo Zhang, Heng Lian, Jisheng Dang, Rui Xu, Sen Hu, Jianheng Hou, Chengwei Qin, Xiaobin Hu, Kunyi Wang, Zhi Yang, Hao Peng, Hong Peng, Ronghao Chen, Huacan Wang

发表机构 * LZU(兰州大学) HKUST(GZ)(香港科技大学(广州)) UBC(不列颠哥伦比亚大学) FDU(福建大学) PKU(北京大学) USC(美国南加州大学) NUS(新加坡国立大学) UCAS(中国科学院大学) HKUST(香港科技大学) QuantaAlpha(量子Alpha)

AI总结 本文提出VideoDR基准,用于研究开放网络环境下视频代理推理,通过跨帧视觉锚点提取、交互式网络检索和多跳推理验证,揭示了长检索链中维持初始视频锚点、目标漂移和长时程一致性等关键挑战。

详情
AI中文摘要

在现实世界视频问答场景中,视频往往只提供局部视觉线索,而可验证答案分布在开放网络中;模型因此需要联合执行跨帧线索提取、迭代检索和基于多跳推理的验证。为弥合这一差距,我们构建了首个视频深度研究基准VideoDR。VideoDR专注于视频条件的开放领域视频问答,要求进行跨帧视觉锚点提取、交互式网络检索和基于联合视频-网络证据的多跳推理;通过严格的真人标注和质量控制,我们获得了涵盖六个语义领域的高质量视频深度研究样本。我们评估了多种闭源和开源多模态大语言模型在Workflow和Agentic范式下的表现,结果表明Agentic并不始终优于Workflow:其收益取决于模型在长检索链中维持初始视频锚点的能力。进一步分析表明,目标漂移和长时程一致性是核心瓶颈。总之,VideoDR为研究开放网络环境下视频代理提供了系统性的基准,并揭示了下一代视频深度研究代理的关键挑战。

英文摘要

In real-world video question answering scenarios, videos often provide only localized visual cues, while verifiable answers are distributed across the open web; models therefore need to jointly perform cross-frame clue extraction, iterative retrieval, and multi-hop reasoning-based verification. To bridge this gap, we construct the first video deep research benchmark, VideoDR. VideoDR centers on video-conditioned open-domain video question answering, requiring cross-frame visual anchor extraction, interactive web retrieval, and multi-hop reasoning over joint video-web evidence; through rigorous human annotation and quality control, we obtain high-quality video deep research samples spanning six semantic domains. We evaluate multiple closed-source and open-source multimodal large language models under both the Workflow and Agentic paradigms, and the results show that Agentic is not consistently superior to Workflow: its gains depend on a model's ability to maintain the initial video anchors over long retrieval chains. Further analysis indicates that goal drift and long-horizon consistency are the core bottlenecks. In sum, VideoDR provides a systematic benchmark for studying video agents in open-web settings and reveals the key challenges for next-generation video deep research agents.

2601.04855 2026-05-19 cs.LG cs.AI 版本更新

Rethinking GNNs and Missing Features: Challenges, Evaluation and a Robust Solution

重新思考图神经网络与缺失特征:挑战、评估和一个稳健的解决方案

Francesco Ferrini, Veronica Lachi, Antonio Longa, Bruno Lepri, Matono Akiyoshi, Andrea Passerini, Xin Liu, Manfred Jaeger

发表机构 * University of Trento, Trento, Italy(特伦托大学) UiT, The Arctic University of Norway, Tromsø, Norway(北极大学) Aalborg University, Aalborg, Denmark(奥尔堡大学)

AI总结 本文针对图神经网络中缺失节点特征的问题,提出了一种稳健的解决方案,通过设计更真实的缺失机制和评估协议,提高了模型的鲁棒性。

详情
AI中文摘要

处理缺失节点特征是部署图神经网络(GNNs)在现实领域如医疗和传感器网络中的关键挑战。现有研究主要针对相对温和的场景,即基准数据集,其中节点特征具有高维但稀疏的特征和由完全随机缺失(MCAR)机制生成的不完整数据。对于(a),我们理论证明高稀疏性显著限制了缺失性导致的信息损失,使所有模型看起来稳健,从而防止了对性能的有意义比较。为克服这一限制,我们引入了一个合成和三个真实世界的数据集,具有密集且语义丰富的特征。对于(b),我们超越MCAR并设计了更真实的缺失机制的评估协议。此外,我们提供了理论背景,明确陈述了缺失过程的假设,并分析了这些假设对不同方法的影响。基于此分析,我们提出了GNNmim,一种简单但有效的基线模型,用于具有不完整特征数据的节点分类。实验表明,GNNmim在各种数据集和缺失性制度下与专门设计的架构具有竞争力。

英文摘要

Handling missing node features is a key challenge for deploying Graph Neural Networks (GNNs) in real-world domains such as healthcare and sensor networks. Existing studies mostly address relatively benign scenarios, namely benchmark datasets with (a) high-dimensional but sparse node features and (b) incomplete data generated under Missing Completely At Random (MCAR) mechanisms. For (a), we theoretically prove that high sparsity substantially limits the information loss caused by missingness, making all models appear robust and preventing a meaningful comparison of their performance. To overcome this limitation, we introduce one synthetic and three real-world datasets with dense, semantically meaningful features. For (b), we move beyond MCAR and design evaluation protocols with more realistic missingness mechanisms. Moreover, we provide a theoretical background to state explicit assumptions on the missingness process and analyze their implications for different methods. Building on this analysis, we propose GNNmim, a simple yet effective baseline for node classification with incomplete feature data. Experiments show that GNNmim is competitive with respect to specialized architectures across diverse datasets and missingness regimes.

2601.03425 2026-05-19 cs.LG cs.AI 版本更新

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

领域专精的幻觉:揭示混合专家模型中的领域不变‘ standing committee ’

Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu

发表机构 * The Fin AI(Fin AI) Georgia Institute of Technology(佐治亚理工学院) Cornell University(康奈尔大学) Stevens Institute of Technology(史蒂文斯理工学院) The University of Manchester(曼彻斯特大学)

AI总结 本研究质疑混合专家模型通过稀疏路由实现领域专精的假设,提出COMMITTEEAUDIT框架分析专家组而非个体专家的路由行为,发现领域不变的standing committee,揭示模型存在向集中计算偏倚的结构倾向,表明混合专家模型中的专精程度远低于预期。

Comments Accepted by ACL 2026 main conference. Camera-ready version

详情
AI中文摘要

混合专家模型被广泛假设通过稀疏路由实现领域专精。在本工作中,我们通过引入COMMITTEEAUDIT框架,质疑这一假设,该框架在专家组层面而非个体专家层面分析路由行为。在三个代表性模型和MMLU基准测试中,我们揭示了一个领域不变的standing committee。这是一个紧凑的路由专家联盟,能够跨领域、层和路由预算持续捕获大多数路由质量,即使在架构已包含共享专家的情况下。定性分析进一步显示,standing committee锚定推理结构和语法,而外围专家处理领域特定知识。这些发现揭示了模型对集中计算的强结构偏倚,表明混合专家模型中的专精程度远低于人们普遍认为的水平。这种固有偏倚也表明,当前的训练目标,如强制均匀专家利用的负载平衡损失,可能与模型的自然优化路径相悖,从而限制了训练效率和性能。

英文摘要

Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model's natural optimization path, thereby limiting training efficiency and performance.

2601.01123 2026-05-19 cs.LG cs.AI 版本更新

Learning from Historical Activations in Graph Neural Networks

在图神经网络中学习历史激活

Yaniv Galron, Hadar Sinai, Haggai Maron, Moshe Eliasof

发表机构 * Technion – Israel Institute of Technology(技术ion–以色列理工学院) NVIDIA Ben-Gurion University of the Negev(贝内-约尔根大学) University of Cambridge(剑桥大学)

AI总结 本文提出HISTOGRAPH,一种基于注意力的两阶段最终聚合层,通过层间和节点间的注意力机制,利用节点的激活历史和图结构来优化最终预测特征,从而在多个图分类基准上实现了优于传统方法的性能。

Comments ICLR 2026

详情
AI中文摘要

图神经网络(GNNs)在社交网络、分子化学等领域展现了显著的成功。GNNs的关键组成部分是池化过程,其中模型计算的节点特征被结合成一个有信息量的最终描述符,用于下游任务。然而,先前的图池化方案依赖于最后一个GNN层的特征作为池化或分类层的输入,这可能未能充分利用模型前向传递过程中先前层产生的重要激活,即历史图激活。这种差距在节点表示在许多图神经层中显著变化的情况下尤为明显,并且在深度架构中受到过平滑问题的加剧。为弥合这一差距,我们引入HISTOGRAPH,一种新颖的两阶段注意力最终聚合层,首先在中间激活上应用统一的层间注意力,随后进行节点间注意力。通过建模节点表示在层间的演变,我们的HISTOGRAPH利用节点的激活历史和图结构来优化最终预测所用的特征。在多个图分类基准上的实验证明,HISTOGRAPH提供了强大的性能,能够一致地改进传统技术,特别是在深度GNNs中表现出特别强的鲁棒性。

英文摘要

Graph Neural Networks (GNNs) have demonstrated remarkable success in various domains such as social networks, molecular chemistry, and more. A crucial component of GNNs is the pooling procedure, in which the node features calculated by the model are combined to form an informative final descriptor to be used for the downstream task. However, previous graph pooling schemes rely on the last GNN layer features as an input to the pooling or classifier layers, potentially under-utilizing important activations of previous layers produced during the forward pass of the model, which we regard as historical graph activations. This gap is particularly pronounced in cases where a node's representation can shift significantly over the course of many graph neural layers, and worsened by graph-specific challenges such as over-smoothing in deep architectures. To bridge this gap, we introduce HISTOGRAPH, a novel two-stage attention-based final aggregation layer that first applies a unified layer-wise attention over intermediate activations, followed by node-wise attention. By modeling the evolution of node representations across layers, our HISTOGRAPH leverages both the activation history of nodes and the graph structure to refine features used for final prediction. Empirical results on multiple graph classification benchmarks demonstrate that HISTOGRAPH offers strong performance that consistently improves traditional techniques, with particularly strong robustness in deep GNNs.

2512.24497 2026-05-19 cs.AI cs.LG cs.RO stat.ML 版本更新

What Drives Success in Physical Planning with Joint-Embedding Predictive World Models?

在联合嵌入预测世界模型中成功因素是什么?

Basile Terver, Tsung-Yen Yang, Jean Ponce, Adrien Bardes, Yann LeCun

发表机构 * Meta FAIR Inria Paris(巴黎理工院) Ecole normale supérieure / PSL(巴黎高等师范学院 / PSL) New York University(纽约大学)

AI总结 本文研究了在物理规划中使用联合嵌入预测世界模型(JEPA-WMs)的成功因素,通过分析模型架构、训练目标和规划算法对规划成功的影响,提出了一种在导航和操作任务中优于现有基线方法的模型。

Comments V2 of the article: - Added AdaLN-zero - Added table comparing JEPA-WMs with baselines with std translating per-seed variability only, no variability across epochs - Reordered figures in main body of the paper V3: added data scaling experiments, theoretical appendix section on autoregressive rollout, acceptance at TMLR

详情
AI中文摘要

人工智能领域长期存在的挑战是开发能够解决广泛物理任务并泛化到新、未见过的任务和环境的智能体。一种流行的近期方法是通过状态-动作轨迹训练世界模型,然后使用规划算法解决新任务。规划通常在输入空间中进行,但最近出现的一类方法引入了在学习的表示空间中优化的规划算法,其承诺通过抽象无关细节来提高规划效率。在本工作中,我们将此类模型称为JEPA-WMs,并研究使此类算法有效技术选择。我们提出了一项全面研究几个关键组件,旨在找到该类中的最佳方法。我们使用模拟环境和真实世界机器人数据进行了实验,并研究了模型架构、训练目标和规划算法对规划成功的影响。我们结合发现,提出了一种在导航和操作任务中优于两个现有基线方法(DINO-WM和V-JEPA-2-AC)的模型。代码、数据和检查点可在https://github.com/facebookresearch/jepa-wms上获得。

英文摘要

A long-standing challenge in AI is to develop agents capable of solving a wide range of physical tasks and generalizing to new, unseen tasks and environments. A popular recent approach involves training a world model from state-action trajectories and subsequently use it with a planning algorithm to solve new tasks. Planning is commonly performed in the input space, but a recent family of methods has introduced planning algorithms that optimize in the learned representation space of the world model, with the promise that abstracting irrelevant details yields more efficient planning. In this work, we characterize models from this family as JEPA-WMs and investigate the technical choices that make algorithms from this class work. We propose a comprehensive study of several key components with the objective of finding the optimal approach within the family. We conducted experiments using both simulated environments and real-world robotic data, and studied how the model architecture, the training objective, and the planning algorithm affect planning success. We combine our findings to propose a model that outperforms two established baselines, DINO-WM and V-JEPA-2-AC, in both navigation and manipulation tasks. Code, data and checkpoints are available at https://github.com/facebookresearch/jepa-wms.

2512.23994 2026-05-19 cs.SD cs.AI 版本更新

PhyAVBench: A Challenging Audio Physics-Sensitivity Benchmark for Physically Grounded Text-to-Audio-Video Generation

PhyAVBench: 一个具有挑战性的音频物理敏感性基准,用于物理基础的文本到音频视频生成

Tianxin Xie, Wentao Lei, Kai Jiang, Guanjie Huang, Pengfei Zhang, Chunhui Zhang, Fengji Ma, Haoyu He, Han Zhang, Jiangshan He, Jinting Wang, Linghan Fang, Lufei Gao, Orkesh Ablet, Peihua Zhang, Ruolin Hu, Shengyu Li, Weilin Lin, Xiaoyang Feng, Xinyue Yang, Yan Rong, Yanyun Wang, Zihang Shao, Zelin Zhao, Chenxing Li, Shan Yang, Wenfu Wang, Meng Yu, Dong Yu, Li Liu

发表机构 * HKUST(GZ)(香港科技大学(广州)) Tencent(腾讯)

AI总结 本文提出PhyAVBench,一个用于评估文本到音频视频生成、图像到音频视频生成和视频到音频生成模型中音频-物理基础能力的基准,通过引入新的数据集和评估方法,揭示了当前模型在物理合理音频生成方面的不足。

Comments 6 major physical dimensions, 41 fine-grained test points, 337 groups of variable-controlled test samples, 11,605 newly recorded videos

详情
AI中文摘要

文本到音频视频(T2AV)生成在影视制作和世界建模等应用中至关重要。然而,当前模型往往无法生成物理上合理的音效。先前的基准主要关注音频视频时间同步,而忽视了对音频-物理基础的显式评估,从而限制了对物理合理音频视频生成的研究。为了解决这个问题,我们提出了PhyAVBench,这是第一个系统评估T2AV、I2AV和V2A模型音频-物理基础能力的基准。PhyAVBench提供PhyAV-Sound-11K,一个包含来自184名参与者25.5小时11,605个可听视频的新数据集,以确保多样性和避免数据泄漏。它包含337对提示组,具有受控的物理变化,驱动声音差异,每个组平均有17个视频,涵盖6个音频-物理维度和41个细粒度测试点。每个提示对都标注了其声音差异背后的物理因素。重要的是,PhyAVBench利用配对文本提示来评估这一能力。我们称这种评估范式为音频-物理敏感性测试(APST),并引入了一个新的指标,对比物理响应分数(CPRS),用于量化生成视频与现实世界对应物之间的声音一致性。我们对17种最先进的模型进行了全面评估。我们的结果表明,即使领先的商业模型在基本的音频物理现象上也存在问题,揭示了超出音频视频同步之外的关键差距,并指明了未来的研究方向。我们希望PhyAVBench能为推进物理基础的音频视频生成提供基础。提示、真实值和生成视频样本可在https://github.com/imxtx/PhyAVBench上获得。

英文摘要

Text-to-audio-video (T2AV) generation is central to applications such as filmmaking and world modeling. However, current models often fail to produce physically plausible sounds. Previous benchmarks primarily focus on audio-video temporal synchronization, while largely overlooking explicit evaluation of audio-physics grounding, thereby limiting the study of physically plausible audio-visual generation. To address this issue, we present PhyAVBench, the first benchmark that systematically evaluates the audio-physics grounding capabilities of T2AV, image-to-audio-video (I2AV), and video-to-audio (V2A) models. PhyAVBench offers PhyAV-Sound-11K, a new dataset of 25.5 hours of 11,605 audible videos collected from 184 participants to ensure diversity and avoid data leakage. It contains 337 paired-prompt groups with controlled physical variations that drive sound differences, each grounded with an average of 17 videos and spanning 6 audio-physics dimensions and 41 fine-grained test points. Each prompt pair is annotated with the physical factors underlying their acoustic differences. Importantly, PhyAVBench leverages paired text prompts to evaluate this capability. We term this evaluation paradigm the Audio-Physics Sensitivity Test (APST) and introduce a novel metric, the Contrastive Physical Response Score (CPRS), which quantifies the acoustic consistency between generated videos and their real-world counterparts. We conduct a comprehensive evaluation of 17 state-of-the-art models. Our results reveal that even leading commercial models struggle with fundamental audio-physical phenomena, exposing a critical gap beyond audio-visual synchronization and pointing to future research directions. We hope PhyAVBench will serve as a foundation for advancing physically grounded audio-visual generation. Prompts, ground-truth, and generated video samples are available at https://github.com/imxtx/PhyAVBench.

2512.04746 2026-05-19 cs.CL cs.AI 版本更新

SignRoundV2: Toward Closing the Performance Gap in Extremely Low-Bit Post-Training Quantization for LLMs

SignRoundV2:朝着关闭LLMs极低比特后训练量化性能差距的目标

Wenhua Cheng, Weiwei Zhang, Heng Guo, Haihao Shen, Zaner Ma

发表机构 * Intel(英特尔公司) Beijing Institute of Technology(北京理工大学)

AI总结 本文提出SignRoundV2框架,通过自适应混合精度策略和轻量稳定技术,在极低比特量化下保持高性能,实验表明在混合MXFP设置中实现接近无损性能,将性能差距缩小到约1%。

详情
AI中文摘要

极低比特量化对高效部署大型语言模型(LLMs)至关重要,但往往在2比特和4比特(如MXFP4)时导致严重性能下降。我们提出了SignRoundV2,一种后训练量化框架,旨在在极端压缩下保持高性能。SignRoundV2引入(1)一种简单而高效的自适应混合精度策略,利用梯度信息和量化引起的重建误差来指导层间比特分配,以及(2)一组轻量级稳定技术,包括损失过滤和预调制比例搜索,以提高极低比特环境下的调优效果。我们的方法在量化和全精度模型之间显著缩小了性能差距。在多种LLMs上的实验结果表明,SignRoundV2在混合MXFP设置中实现了接近无损性能,将差距缩小到约1%(平均4.5比特),同时在具有挑战性的2比特权重-only量化中大幅提高准确性。源代码可在https://github.com/intel/auto-round获取。

英文摘要

Extremely low-bit quantization is critical for efficiently deploying Large Language Models (LLMs), yet it often leads to severe performance degradation at 2 bits and even at 4 bits (e.g., MXFP4). We present SignRoundV2, a post-training quantization framework designed to maintain high performance even under aggressive compression. SignRoundV2 introduces (1) a simple yet efficient adaptive mixed-precision strategy that leverages gradient information and quantization-induced reconstruction errors to guide layer-wise bit allocation, and (2) a set of lightweight stabilization techniques, including loss filtering and a pre-tuning scale search, to improve tuning effectiveness in extremely low-bit regimes. Our approach takes a significant step toward closing the performance gap between quantized and full-precision models. Experimental results across diverse LLMs demonstrate that SignRoundV2 achieves near-lossless performance in mixed MXFP settings, narrowing the gap to $\sim$1\% at an average of 4.5 bits, while substantially improving accuracy in challenging 2-bit weight-only quantization. The source code is available at \url{https://github.com/intel/auto-round}.

2511.23253 2026-05-19 cs.AI 版本更新

AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture

AgroCoT:用于评估农业中视觉语言模型推理能力的推理链基准

Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Xiaoya Fan, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, Juepeng Zheng

发表机构 * Sun Yat-sen University(中山大学) Tsinghua University(清华大学) Southwest University(西南大学) HuanTian Wisdom Technology Co., Ltd.(慧天智慧科技有限公司) China Agricultural University(中国农业大学) Southwest Jiaotong University(西南交通大学) National Supercomputing Center in Shenzhen(深圳国家超算中心)

AI总结 本文提出AgroCoT基准,通过整合推理链(CoT)方法,评估视觉语言模型在农业复杂场景中的推理和问题解决能力,发现现有模型在推理能力上的不足。

详情
AI中文摘要

近年来,视觉语言模型(VLMs)的进步显著影响了各个行业。在农业中,这些多模态能力在精准农业、作物监测、害虫检测和环境可持续性方面具有巨大潜力。然而,尽管已经开发了多个视觉问答(VQA)数据集和基准来评估VLM性能,但它们往往无法有效评估复杂农业背景下所需的推理和问题解决能力。为解决这一差距,我们引入了AgroCoT,一个整合了推理链(CoT)推理的VQA数据集,专门用于评估VLM的推理能力。AgroCoT包含4,759个精心挑选的样本,提供了对推理能力的全面且稳健的评估,特别是在零样本场景中,重点在于模型进行逻辑推理和有效问题解决的能力。我们对30个代表性VLMs(包括专有和开源模型)的评估揭示了其推理能力的差距,这突显了在评估中整合CoT的重要性。我们的数据集可在https://huggingface.co/datasets/AgroCoT/AgroCoT上获取。

英文摘要

Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/AgroCoT/AgroCoT.

2510.26384 2026-05-19 cs.AI cs.LG 版本更新

Scales++: Compute Efficient Evaluation Subset Selection with Cognitive Scales Embeddings

Scales++: 一种计算高效的评估子集选择方法,基于认知尺度嵌入

Andrew M. Bean, Nabeel Seedat, Shengzhuang Chen, Jonathan Richard Schwarz

发表机构 * Thomson Reuters Foundational Research(汤姆森路透基础研究) University of Oxford(牛津大学) Imperial College London(帝国理工学院伦敦分校)

AI总结 本文提出了一种基于任务项目内在属性的评估子集选择方法Scales++,通过减少预选成本并保持预测保真度,提高了大规模语言模型的评估效率,同时提升了冷启动性能和可解释性。

Comments 9 pages, 2 figures, 4 tables

详情
AI中文摘要

对大规模语言模型(LLMs)进行全面评估的高昂成本需要创建小而有代表性的数据子集(即小型基准),以实现高效的评估同时保留预测保真度。当前的方法基于模型为中心的范式,根据现有模型的集体性能选择基准项目。这些方法受限于前期成本高、无法立即处理新基准(冷启动)以及假设未来模型会共享前代模型的失败模式的脆弱性。在本文中,我们提出了一种新的以项目为中心的基准子集选择方法,认为选择应基于任务项目的内在属性,而不是模型特定的失败模式。我们通过一种新的方法Scales++来实现这种以项目为中心的高效基准方法,其中数据选择基于基准样本的认知需求。实证研究表明,Scales++将前期选择成本降低了超过18倍,同时实现了有竞争力的预测保真度。在Open LLM Leaderboard上,使用仅0.25%的数据子集,我们预测完整基准分数的均方误差为3.2%,在Humanity's Last Exam上,使用2.0%的样本预测完整分数的均方误差为2.9%。我们证明这种以项目为中心的方法可以在不显著降低保真度的情况下更高效地评估模型,同时提供更好的冷启动性能和更可解释的基准测试。

英文摘要

The prohibitive cost of evaluating large language models (LLMs) on comprehensive benchmarks necessitates the creation of small yet representative data subsets (i.e., tiny benchmarks) that enable efficient assessment while retaining predictive fidelity. Current methods for this task operate under a model-centric paradigm, selecting benchmarking items based on the collective performance of existing models. Such approaches are limited by large upfront costs, an inability to immediately handle new benchmarks ("cold-start"), and the fragile assumption that future models will share the failure patterns of their predecessors. In this work, we propose a new item-centric approach to benchmark subset selection, arguing that selection should be based on the intrinsic properties of the task items themselves, rather than on model-specific failure patterns. We instantiate this item-centric efficient benchmarking approach via a novel method, Scales++, where data selection is based on the cognitive demands of the benchmark samples. Empirically, we show Scales++ reduces the upfront selection cost by over 18x while achieving competitive predictive fidelity. On the Open LLM Leaderboard, using just a 0.25% data subset, we predict full benchmark scores with a 3.2% mean absolute error, and on Humanity's Last Exam we predict full scores with 2.9% mean absolute error using a 2.0% sample. We demonstrate that this item-centric approach enables more efficient model evaluation without significant fidelity degradation, while also providing better cold-start performance and more interpretable benchmarking.

2510.24701 2026-05-19 cs.CL cs.AI cs.IR cs.LG cs.MA 版本更新

Tongyi DeepResearch Technical Report

通义深研技术报告

Tongyi DeepResearch Team, Baixuan Li, Bo Zhang, Dingchu Zhang, Fei Huang, Guangyu Li, Guoxin Chen, Huifeng Yin, Jialong Wu, Jingren Zhou, Kuan Li, Liangcai Su, Litu Ou, Liwen Zhang, Pengjun Xie, Rui Ye, Wenbiao Yin, Xinmiao Yu, Xinyu Wang, Xixi Wu, Xuanzhong Chen, Yida Zhao, Zhen Zhang, Zhengwei Tao, Zhongwang Zhang, Zile Qiao, Chenxi Wang, Donglei Yu, Gang Fu, Haiyang Shen, Jiayin Yang, Jun Lin, Junkai Zhang, Kui Zeng, Li Yang, Hailong Yin, Maojia Song, Ming Yan, Minpeng Liao, Peng Xia, Qian Xiao, Rui Min, Ruixue Ding, Runnan Fang, Shaowei Chen, Shen Huang, Shihang Wang, Shihao Cai, Weizhou Shen, Xiaobin Wang, Xin Guan, Xinyu Geng, Yingcheng Shi, Yuning Wu, Zhuo Chen, Zijian Li, Yong Jiang

发表机构 * Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团)

AI总结 本文介绍了一种专为长时间深度信息检索任务设计的代理大语言模型,通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索,同时提供了高可扩展的数据合成管道,实现了无需昂贵人工标注的自动化训练流程,并在多个深度研究基准测试中取得了最先进的性能。

Comments https://tongyi-agent.github.io/blog

详情
AI中文摘要

我们介绍了通义深研,一种专为长周期、深度信息检索任务设计的代理大语言模型。为了激励自主深度研究代理,通义深研通过端到端训练框架结合代理中期和后期训练,实现了在复杂任务中的可扩展推理和信息检索。我们设计了一个高度可扩展的数据合成管道,完全自动化,无需依赖昂贵的人工标注,并赋能所有训练阶段。通过为每个阶段构建定制化环境,我们的系统在整个过程中实现了稳定一致的交互。通义深研拥有305亿总参数,每token仅激活33亿个参数,在多个代理深度研究基准测试中,包括人类最后考试、浏览比较、浏览比较-中文、WebWalkerQA、xbench-DeepSearch、FRAMES和xbench-DeepSearch-2510,均取得了最先进的性能。我们开源了该模型、框架和完整解决方案,以赋能社区。

英文摘要

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

2510.16727 2026-05-19 cs.CL cs.AI 版本更新

Beacon: Single-Turn Diagnosis and Mitigation of Latent Sycophancy in Large Language Models

Beacon:单轮诊断和缓解大型语言模型中潜在的阿谀倾向

Sanskar Pandey, Ruhaan Chopra, Angkul Puniya, Sohom Pal

AI总结 本文提出Beacon基准测试,用于单轮诊断和缓解大型语言模型中潜在的阿谀倾向,通过评估十二种最先进的模型,揭示了阿谀倾向在语言和情感方面的稳定子偏差,并提出了在提示和激活层面的干预措施,以调节这些偏差,从而揭示对齐作为事实性和社会合规判断之间的动态流形。

详情
AI中文摘要

大型语言模型内部化了诚实与奉承之间的结构权衡,这种权衡源于奖励优化,将有用性与礼貌服从混淆。这种潜在的偏见,称为阿谀倾向,表现为对用户同意的偏好而非原则性推理。我们引入Beacon,一种单轮强制选择基准测试,该测试独立于对话上下文,能够精确测量事实准确性与顺从偏见之间的张力。在十二种最先进的模型上的评估表明,阿谀倾向分解为稳定的语言和情感子偏见,每个都随模型容量而扩大。我们进一步提出了提示级别和激活级别干预,以调节这些偏见的相反方向,揭示对齐作为事实性和社会合规判断之间的动态流形。Beacon将阿谀倾向重新定义为可测量的规范性误泛化形式,为研究和缓解大规模生成系统中的对齐漂移提供了可重复的基础。

英文摘要

Large language models internalize a structural trade-off between truthfulness and obsequious flattery, emerging from reward optimization that conflates helpfulness with polite submission. This latent bias, known as sycophancy, manifests as a preference for user agreement over principled reasoning. We introduce Beacon, a single-turn forced-choice benchmark that isolates this bias independent of conversational context, enabling precise measurement of the tension between factual accuracy and submissive bias. Evaluations across twelve state-of-the-art models reveal that sycophancy decomposes into stable linguistic and affective sub-biases, each scaling with model capacity. We further propose prompt-level and activation-level interventions that modulate these biases in opposing directions, exposing the internal geometry of alignment as a dynamic manifold between truthfulness and socially compliant judgment. Beacon reframes sycophancy as a measurable form of normative misgeneralization, providing a reproducible foundation for studying and mitigating alignment drift in large-scale generative systems.

2510.16609 2026-05-19 cs.LG cs.AI cs.CC cs.DS 版本更新

Prior Knowledge Makes It Possible: From Sublinear Graph Algorithms to LLM Test-Time Methods

先验知识使其成为可能:从次线性图算法到LLM测试时方法

Avrim Blum, Daniel Hsu, Cyrus Rashtchian, Donya Saless

发表机构 * Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Columbia University(哥伦比亚大学) Google Research(谷歌研究)

AI总结 本文研究了测试时增强方法中先验知识与外部信息交互的理论基础,通过将多步推理建模为知识图中的s-t连通性问题,揭示了在部分先验知识下,测试时增强步骤数量与图结构之间的关系,发现当知识图中存在小组件时,增强步骤数呈平方根增长,而当知识密度超过阈值形成大组件时,增强步骤数趋于常数。

详情
AI中文摘要

测试时增强,如检索增强生成(RAG)或工具使用,关键依赖于模型参数知识与外部检索信息之间的相互作用。然而,这种关系的理论基础仍不明确。具体来说,不清楚在少量增强步骤下需要多少预训练知识来回答查询,这在实践中是理想的属性。为了解决这个问题,我们将多步推理建模为知识图中的s-t连通性问题。我们将模型的预训练参数知识表示为部分、可能嘈杂的子图。我们将增强视为查询一个 oracle 以获得真实的边,从而扩展模型的知识。然后,我们表征了在部分先验知识下,模型生成准确答案所需的必要和充分的增强步骤数。一个关键结果表明:如果包含n个顶点的知识图被分割成小组件,则通过增强找到路径是低效的,需要Ω(√n)次查询。另一方面,一旦正确知识的密度超过阈值,形成大组件,我们可以通过预期常数次查询找到路径。

英文摘要

Test-time augmentation, such as Retrieval-Augmented Generation (RAG) or tool use, critically depends on an interplay between a model's parametric knowledge and externally retrieved information. However, the theoretical underpinnings of this relationship remain poorly understood. Specifically, it is not clear how much pre-training knowledge is required to answer queries with a small number of augmentation steps, which is a desirable property in practice. To address this question, we formulate multi-step reasoning as an $s$-$t$ connectivity problem on a knowledge graph. We represent a model's pre-training parametric knowledge as a partial, potentially noisy subgraph. We view augmentation as querying an oracle for true edges that augment the model's knowledge. Then, we characterize the necessary and sufficient number of augmentation steps for the model to generate an accurate answer given partial prior knowledge. One key result shows a phase transition: if the prior knowledge graph over $n$ vertices is disconnected into small components, then finding a path via augmentation is inefficient and requires $Ω(\sqrt{n})$ queries. On the other hand, once the density of correct knowledge surpasses a threshold, forming a giant component, we can find paths with an expected constant number of queries.

2510.14466 2026-05-19 cs.CL cs.AI 版本更新

Toward Robust Multilingual Adaptation of LLMs for Low-Resource Languages

迈向低资源语言LLM鲁棒多语言适应

Haolin Li, Haipeng Zhang, Mang Li, Yaohua Wang, Lijie Wen, Yu Zhang, Biqing Huang

发表机构 * Department of Automation, Tsinghua University, Beijing, China(清华大学自动化系) Alibaba International Digital Commerce Group, Beijing, China(阿里巴巴国际数字 commerce 集团) School of Software, Tsinghua University, Beijing, China(清华大学软件学院)

AI总结 本文提出LiRA框架,通过轻量级微调实现低资源语言LLM的鲁棒多语言适应,结合Arca和LaSR组件提升跨语言语义一致性与表示稳定性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)在低资源语言上仍面临挑战,主要由于训练数据有限、翻译噪声和跨语言对齐不稳定。为解决这些问题,我们提出LiRA(LLM的语言鲁棒锚定框架)——一个插件式框架,仅需在现有预训练模型上进行轻量级微调。LiRA通过结合两个关键组件:Arca(锚定表示组合架构),通过基于锚点的对齐和协作编码将低资源输入对齐到共享的英语语义空间;以及LaSR(语言耦合语义推理器),一个轻量级、语言感知的头部,通过一致性正则化强制统一的跨语言理解、检索和推理。我们理论证明,在受控的锚定误差和翻译诱导偏差下,LiRA保证了表示偏差的有界性和稳定的下游性能,基于局部Lipschitz连续性。为促进研究,我们发布了一个新的多语言产品检索数据集,涵盖五个东南亚语言和两种南亚语言。在多样化的低资源基准测试中,广泛实验显示在检索、排序、问答和推理任务上均取得一致的改进。代码将在GitHub上公开,数据集将托管在Hugging Face上。

英文摘要

Large language models (LLMs) continue to struggle with low-resource languages, primarily due to limited training data, translation noise, and unstable cross-lingual alignment. To address these challenges, we propose LiRA (Linguistic Robust Anchoring for LLMs)-a plug-and-play framework that requires only lightweight fine-tuning on top of existing pretrained backbones. LiRA jointly optimizes representation stability and cross-lingual semantic consistency by combining two key components: Arca (Anchored Representation Composition Architecture), which aligns low-resource inputs to a shared English semantic space through anchor-based alignment and collaborative encoding; and LaSR (Language-coupled Semantic Reasoner), a lightweight, language-aware head that enforces consistency regularization for unified cross-lingual understanding, retrieval, and reasoning. We theoretically show that under controlled anchoring error and translation-induced bias, LiRA guarantees bounded representation deviation and stable downstream performance under local Lipschitz continuity. To facilitate research, we release a new multilingual product retrieval dataset covering five Southeast Asian and two South Asian languages. Extensive experiments across diverse low-resource benchmarks demonstrate consistent improvements in retrieval, ranking, question answering, and reasoning tasks. Code will be publicly available on GitHub, and the dataset will be hosted on Hugging Face.

2510.13870 2026-05-19 cs.CL cs.AI 版本更新

Unlocking the Potential of Diffusion Language Models through Template Infilling

通过模板填充解锁扩散语言模型的潜力

Junhoo Lee, Seungyeon Kim, Nojun Kwak

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出了一种针对扩散语言模型的模板填充方法,通过在生成响应空间中建立全局蓝图,提升了数学推理、代码生成和旅行规划等任务的性能,同时在多token生成中实现了生成质量与速度的平衡。

Comments ACL 2026 Main Conference - Long Paper, Oral Presentation

详情
AI中文摘要

扩散语言模型(DLMs)作为一种有前景的替代自回归语言模型的候选者,其推理策略仍局限于自回归范式继承的前缀提示。本文提出模板填充(TI),一种针对DLMs的定制化条件化方法。与传统前缀提示不同,TI在目标响应空间中灵活对齐结构锚点,建立全局蓝图后再填充被遮蔽段落。我们在数学推理、代码生成和旅行规划等多样基准上展示了方法的有效性,相对于基线模型在多个任务上实现了9.40%的提升。此外,我们发现TI在多token生成设置中提供了额外优势,能够在保持生成质量与鲁棒性的同时实现有效加速。通过强制这些全局约束,TI最终促进了系统2推理,使模型能够在结构定义的解决方案空间内进行深入思考。

英文摘要

Diffusion Language Models (DLMs) have emerged as a promising alternative to Autoregressive Language Models, yet their inference strategies remain limited to prefix-based prompting inherited from the autoregressive paradigm. In this paper, we propose Template Infilling (TI), a tailored conditioning methodology for DLMs. Unlike conventional prefix prompting, TI flexibly aligns structural anchors across the entire target response space, establishing a global blueprint before filling in the masked segments. We demonstrate the effectiveness of our approach on diverse benchmarks, including mathematical reasoning, code generation, and trip planning, achieving consistent improvements of 9.40% over the baseline. Furthermore, we observe that TI provides additional advantages in multi-token generation settings, enabling effective speedup while maintaining generation quality and robustness. By enforcing these global constraints, TI ultimately facilitates System-2 reasoning, empowering the model to deliberate within a structurally defined solution space.

2510.13068 2026-05-19 cs.LG cs.AI cs.HC 版本更新

NeuroRVQ: Multi-Scale Biosignal Tokenization for Generative Foundation Models

NeuroRVQ:多尺度生物信号分词用于生成式基础模型

Konstantinos Barmpas, Na Lee, Dimitrios Chalatsis, William Raftery, Yannis Panagakis, Dimitrios A. Adamos, Nikolaos Laskaris, Alexandros Koliousis, Dario Farina, Stefanos Zafeiriou

发表机构 * Imperial College London(帝国理工学院伦敦分校) Cogitat National and Kapodistrian University of Athens(国家与资本主义大学雅典分校) Archimedes Research Unit(阿基米德研究单位) Aristotle University of Thessaloniki(亚里士多德大学塞萨洛尼基分校) Northeastern University London(东北大学伦敦分校)

AI总结 本文提出NeuroRVQ,一种多尺度生物信号分词方法,通过多尺度时序卷积分解生物信号并结合相位感知损失,实现高保真信号重建,验证了高质量分词对下游性能的重要性。

详情
AI中文摘要

生物信号如脑电图(EEG)、心电图(ECG)和肌电信号(EMG)在多个时间和频谱尺度上编码生理活动,产生丰富但对机器学习具有挑战性的表示。训练以预测掩码信号标记为基础模型的方法在学习通用生物信号表示方面显示出前景,但其性能取决于分词器保留高频动态和高保真重建信号的能力。我们引入NeuroRVQ,一种适用于高保真信号重建的多模态生物信号分词家族。为了捕获完整的频谱,NeuroRVQ通过多尺度时序卷积将生物信号分解为频特定表示,每个表示编码为层次化的RVQ代码本以保留高频细节,并结合一种新的相位感知训练损失,该损失尊重傅里叶相位的环形拓扑。通过调整时间分辨率、时间核的数量和大小以及RVQ深度,此设计适应每种生物信号模态的频谱-时间特性。为验证分词质量驱动下游性能,我们为每种模态训练一个简单的掩码标记基础模型(NeuroRVQ-FM)使用相应的NeuroRVQ分词器。NeuroRVQ-FM家族在与现有模态特定基础模型相比时实现了竞争或更优的下游性能,证明了高保真分词是有效生物信号建模的关键因素。

英文摘要

Biosignals such as electroencephalography (EEG), electrocardiography (ECG), and electromyography (EMG) encode physiological activity across multiple temporal and spectral scales, yielding representations that are rich but challenging for machine learning. Foundation models trained to predict masked signal tokens have shown promise in learning generalizable biosignal representations, yet their performance depends on the tokenizer's ability to preserve high-frequency dynamics and reconstruct signals with high fidelity. We introduce NeuroRVQ, a modality-adaptive biosignal tokenizer family designed for high-fidelity signal reconstruction. To capture the full frequency spectrum, NeuroRVQ decomposes biosignals into frequency-specific representations via multi-scale temporal convolutions, each encoded into hierarchical RVQ codebooks to preserve high-frequency detail, combined with a novel phase-aware training loss that respects the circular topology of Fourier phase. By tuning the temporal resolution, number and size of temporal kernels and RVQ depth, this design adapts to the spectro-temporal characteristics of each biosignal modality. To validate that tokenizer quality drives downstream performance, we train a simple masked-token foundation model for each modality (NeuroRVQ-FM) using the corresponding NeuroRVQ tokenizer. The NeuroRVQ-FM family achieves competitive or superior downstream performance compared to existing modality-specific foundation models, demonstrating that high-fidelity tokenization is a critical factor for effective biosignal modeling.

2510.03879 2026-05-19 cs.SE cs.AI 版本更新

Adversarial Agent Collaboration for Correctness Improvements of C to Safe Rust Translation

对抗性代理协作提升C到安全Rust翻译的正确性

Tianyu Li, Ruishi Li, Bo Wang, Brandon Paulsen, Umang Mathur, Prateek Saxena

发表机构 * National University of Singapore(新加坡国立大学) Amazon Web Services(亚马逊网络服务)

AI总结 本文提出ACToR框架,通过对抗性搜索发现翻译与C源码分歧的输入,利用这些输入驱动后续优化,提升C到Rust翻译的正确性,实验表明其在多个真实世界工具中达到90%以上的测试通过率。

详情
AI中文摘要

将C语言翻译成内存安全语言如Rust可以防止遗留C软件中普遍存在的关键内存安全漏洞。即使使用了最近的基于大语言模型(LLM)和工具增强的翻译器,生成的Rust代码在未测试的输入上仍经常与C源码产生分歧,这种正确性差距是自动C到Rust翻译可靠性的主要障碍。本文提出ACToR(对抗性C到Rust),一种简单的LLM代理循环,通过对抗性搜索发现翻译与C源码分歧的输入,并利用这些输入驱动后续优化。受生成对抗网络(GANs)启发,ACToR让翻译代理与鉴别代理协作,迭代优化Rust翻译。在每次迭代中,翻译代理生成并优化Rust翻译以通过现有测试套件,然后鉴别代理通过构造并优化C和Rust二进制文件的差分模糊器来发现新的失败测试。在63个真实世界命令行C工具上,平均代码行数为473行,最长可达数千行,ACToR在零人工干预下实现了超过90%的测试通过率。改进在七个代理-LLM配置上的微基准测试中保持稳定,表明该循环在底层翻译器和LLM选择上基本独立。与非对抗性、基于覆盖率的测试生成基线相比,ACToR将正确性提高了最高36.7%。当应用于最近的翻译器C2SaferRust时,ACToR进一步将验证通过率提高了16.6%。

英文摘要

Translating C to memory-safe languages, like Rust, prevents critical memory safety vulnerabilities that are prevalent in legacy C software. Even with recent LLM-based and tool-augmented translators, the resulting Rust code frequently diverges from the C source on inputs absent from the test suite used during translation; this correctness gap on unseen inputs remains a dominant obstacle to reliable, automatic C-to-Rust translation. In this work, we present ACToR (Adversarial C To Rust), a simple LLM-agent loop that closes this gap by adversarially searching for inputs on which the translation diverges from the C source, and using them to drive subsequent refinements. Inspired by GANs, ACToR pits a translator agent against a discriminator agent that collaborate to iteratively refine the Rust translation. On each iteration, the translator agent synthesizes and refines a Rust translation to pass an existing suite of tests, and then the discriminator agent finds new failing tests by constructing and refining a differential fuzzer over the C and Rust binaries. Across 63 real-world command-line C utilities, with an average size of 473 lines of code and the longest reaching thousands of lines in size, ACToR achieves over 90% test pass rate with zero human intervention. The improvement holds across seven agent-LLM configurations on our micro-benchmark, indicating that the loop is largely independent of the choice of underlying translator and LLM. Compared to a non-adversarial, coverage-driven test-generation baseline, ACToR improves correctness by up to 36.7%. When applied on top of one recent translator, C2SaferRust, ACToR further improves the validation pass rate by 16.6%.

2510.01857 2026-05-19 cs.AI 版本更新

Learning Reasoning Rewards from Expert Demonstrations with Inverse Reinforcement Learning

通过逆强化学习学习推理奖励从专家示范

Claudio Fanconi, Nicolás Astorga, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出了一种名为Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL)的方法,通过逆强化学习从专家示范中学习推理奖励,以克服传统监督微调的局限性,并在多个数据集上展示了其在训练和推理过程中的有效性。

详情
AI中文摘要

教学大型语言模型(LLMs)在训练后进行推理通常依赖于具有显式结果或过程基础的强化学习奖励函数。然而,在许多现实世界设置中,获得或定义此类奖励函数是困难的,尤其是对于复杂任务,使从专家示范中学习成为有吸引力的替代方法。主流方法监督微调(SFT)训练模型直接模仿专家推理轨迹,但受到离策略学习的一般限制:性能可能对推理时偏离演示中明确覆盖的状态敏感。为了解决这个问题,我们提出了推理对抗逆强化学习(R-AIRL)。与其模仿专家的推理,R-AIRL从专家的思维链中推断出底层的过程级奖励。通过在GSM8K、MMLU-Pro和MedReason上进行实验,我们展示了通过R-AIRL学习的推理奖励函数可以有效地用于整个训练和推理流程:(1)为训练提供训练信号,在大多数考虑的设置中优于SFT,(2)用于推理时的重排序,将pass@1提高高达17.4个点,(3)用于过程级评估,以高达86.1%的准确性局部化推理失败。总体而言,R-AIRL弥合了模仿学习和基于奖励的优化,使从专家思考轨迹中提取有意义的推理信号成为可能。

英文摘要

Teaching large language models (LLMs) to reason during post-training typically relies on reinforcement learning with explicit outcome- or process-based reward functions. However, in many real-world settings, obtaining or defining such reward functions is difficult, especially for complex tasks, making learning from expert demonstrations an attractive alternative. The dominant approach, supervised fine-tuning (SFT), trains models to imitate expert reasoning traces directly, but suffers from the general limitations of off-policy learning: performance can be fragile to inference-time deviations from states explicitly covered by the demonstrations. To address this, we propose Reasoning Adversarial Inverse Reinforcement Learning (R-AIRL). Rather than imitating the expert's reasoning, R-AIRL infers the underlying process-level reward from the expert Chain-of-Thoughts. Through experiments on GSM8K, MMLU-Pro and MedReason we show that the reasoning reward function learned with R-AIRL can be effectively used throughout the training and inference pipeline: (1) to provide a training signal for post-training, outperforming SFT in most of the considered settings, (2) for inference-time reranking, improving pass@1 by up to 17.4 points, and (3) for process-level evaluation, localising reasoning failures with up to 86.1% accuracy. Overall, R-AIRL bridges imitation learning and reward-based optimisation, enabling the extraction of meaningful reasoning signals from expert thinking traces.

2510.00304 2026-05-19 cs.LG cs.AI 版本更新

Barriers for Learning in an Evolving World: Mathematical Understanding of Loss of Plasticity

在不断变化的世界中学习的障碍:对学习能力丧失的数学理解

Amir Joudaki, Giulia Lanzillotta, Mohammad Samragh Razlighi, Iman Mirzadeh, Keivan Alizadeh, Thomas Hofmann, Mehrdad Farajtabar, Fartash Faghri

发表机构 * ETH Zürich(苏黎世联邦理工学院) Apple(苹果公司)

AI总结 本文研究了在非平稳环境中深度学习模型因学习能力丧失(LoP)而失效的问题,通过动力系统理论分析了LoP的两个主要机制,并探讨了缓解策略。

详情
AI中文摘要

深度学习模型在静态数据上表现优异,但在非静态环境中因一种称为学习能力丧失(LoP)的现象而表现不佳,即其未来学习能力下降。本文首次从原理上研究了基于梯度的学习中的LoP。基于动力系统理论,我们通过在参数空间中识别稳定的流形来正式定义LoP,这些流形会捕获梯度轨迹。我们的分析揭示了两种主要机制,这些机制创造了这些陷阱:来自激活饱和的冻结单元和来自表征冗余的克隆单元流形。我们的框架揭示了一个根本性的矛盾:在静态设置中促进泛化的属性,如低秩表示和简单性偏差,直接在持续学习场景中促成LoP。我们通过数值模拟验证了我们的理论分析,并探讨了架构选择或针对性扰动作为潜在的缓解策略。

英文摘要

Deep learning models excel in stationary data but struggle in non-stationary environments due to a phenomenon known as loss of plasticity (LoP), the degradation of their ability to learn in the future. This work presents a first-principles investigation of LoP in gradient-based learning. Grounded in dynamical systems theory, we formally define LoP by identifying stable manifolds in the parameter space that trap gradient trajectories. Our analysis reveals two primary mechanisms that create these traps: frozen units from activation saturation and cloned-unit manifolds from representational redundancy. Our framework uncovers a fundamental tension: properties that promote generalization in static settings, such as low-rank representations and simplicity biases, directly contribute to LoP in continual learning scenarios. We validate our theoretical analysis with numerical simulations and explore architectural choices or targeted perturbations as potential mitigation strategies.

2509.19102 2026-05-19 cs.RO cs.AI cs.CV 版本更新

FUNCanon: Learning Pose-Aware Action Primitives via Functional Object Canonicalization for Generalizable Robotic Manipulation

FUNCanon: 通过功能对象规范化学习姿态感知的动作原语以实现通用的机器人操作

Hongli Xu, Lei Zhang, Xiaoyue Hu, Boyang Zhong, Kaixin Bai, Zoltán-Csaba Márton, Zhenshan Bing, Zhaopeng Chen, Alois Christian Knoll, Jianwei Zhang

发表机构 * TAMS (Technical Aspects of Multimodal Systems), Department of Informatics, University of Hamburg(汉堡大学信息学院TAMS(多模态系统技术)) Technical University of Munich(慕尼黑技术大学) Agile Robots SE(敏捷机器人有限公司)

AI总结 本文提出FUNCanon框架,通过功能对象规范化学习姿态感知的动作原语,以实现通用的机器人操作,该方法将长周期操作任务分解为由主体、动词和对象定义的动作片段,从而提升策略的可组合性和可重用性。

Comments project website: https://sites.google.com/view/funcanon, 11 pages

详情
AI中文摘要

通用机器人技能从端到端演示中通常会导致任务特定的策略,这些策略难以超越训练分布进行泛化。因此,我们引入FUNCanon框架,将长周期操作任务转换为一系列动作片段,每个片段由主体、动词和对象定义。这些片段将策略学习聚焦于动作本身,而不是孤立的任务,从而实现组合性和重用性。为了使策略具有姿态感知和类别通用性,我们对功能对象进行规范化,通过功能对齐和自动操作轨迹转移,利用大型视觉语言模型的 affordance 信息将对象映射到共享的功能框架中。一个以对象为中心和动作为中心的扩散策略FuncDiffuser在对齐的数据上进行训练,自然尊重对象的 affordances 和姿态,简化了学习并提高了泛化能力。在模拟和现实基准上的实验表明,该方法在类别层面实现了泛化,跨任务行为重用和鲁棒的sim2real部署,显示功能规范化为复杂操作领域可扩展模仿学习提供了强大的归纳偏置。演示细节和补充材料可在我们的项目网站上获得:https://sites.google.com/view/funcanon。

英文摘要

General-purpose robotic skills from end-to-end demonstrations often leads to task-specific policies that fail to generalize beyond the training distribution. Therefore, we introduce FunCanon, a framework that converts long-horizon manipulation tasks into sequences of action chunks, each defined by an actor, verb, and object. These chunks focus policy learning on the actions themselves, rather than isolated tasks, enabling compositionality and reuse. To make policies pose-aware and category-general, we perform functional object canonicalization for functional alignment and automatic manipulation trajectory transfer, mapping objects into shared functional frames using affordance cues from large vision language models. An object centric and action centric diffusion policy FuncDiffuser trained on this aligned data naturally respects object affordances and poses, simplifying learning and improving generalization ability. Experiments on simulated and real-world benchmarks demonstrate category-level generalization, cross-task behavior reuse, and robust sim2real deployment, showing that functional canonicalization provides a strong inductive bias for scalable imitation learning in complex manipulation domains. Details of the demo and supplemental material are available on our project website https://sites.google.com/view/funcanon.

2509.18150 2026-05-19 cs.LG cs.AI 版本更新

Improving MLLM Training Efficiency via Stage-Aware Sparsity

通过阶段感知稀疏性提升MLLM训练效率

Kean Shi, Liang Chen, Haozhe Zhao, Baobao Chang

发表机构 * Peking University(北京大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出了一种基于稀疏表示的高效训练框架STS,通过阶段感知设计适应不同训练阶段的冗余,采用视觉标记压缩器和层动态跳过器来减少计算开销,验证了其在多种MLLM架构上的有效性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在各种领域中表现出色,但训练效率低下,由于长输入序列和未充分利用的层间操作导致大量计算冗余。值得注意的是,这种冗余并非静态,而是随训练阶段变化。基于此观察,我们关注训练过程本身,提出了一种基于稀疏表示的高效训练框架,称为稀疏训练方案(STS)。不同于统一的稀疏性策略,STS采用阶段感知设计,适应训练过程中不同的冗余来源。具体而言,该框架包含两个互补组件:视觉标记压缩器,通过在模态对齐过程中压缩视觉标记来减少信息负载;层动态跳过器,通过在指令微调过程中动态跳过不必要的层来减轻计算开销。我们的方法广泛适用于多种MLLM架构,并已在多个基准上进行了广泛评估,证明了其有效性和效率。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated outstanding performance across a variety of domains. However, training MLLMs is often inefficient, as much of the computation is redundant due to the long input sequences from multimodal data and underutilized inter-layer operations. Notably, such redundancy is not static but varies across different stages of training. Building on this observation, we shift the focus to the training process itself and propose a training-efficient framework based on sparse representations, termed the Sparse Training Scheme (STS). Instead of applying a uniform sparsity strategy, STS adopts a stage-aware design that adapts to different sources of redundancy during training. Specifically, the framework consists of two complementary components: the Visual Token Compressor, which reduces the information load by compressing visual tokens during modality alignment, and the Layer Dynamic Skipper, which mitigates computational overhead by dynamically skipping unnecessary layers during instruction tuning. Our approach is broadly applicable to diverse MLLM architectures and has been extensively evaluated on multiple benchmarks, demonstrating its effectiveness and efficiency.

2509.16391 2026-05-19 cs.LG cs.AI cs.CV 版本更新

CoUn: Empowering Machine Unlearning via Contrastive Learning

CoUn: 通过对比学习赋能机器无学习

Yasser H. Khalil, Mehdi Setayesh, Hongliang Li

发表机构 * Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出CoUn框架,通过对比学习和监督学习调整保留数据的表示,以提高机器无学习的有效性,实验表明其在多个数据集和模型架构上均优于现有方法。

详情
AI中文摘要

机器无学习(MU)旨在从已训练模型中移除特定'遗忘'数据的影响,同时保持对剩余'保留'数据的知识。现有的基于标签操纵或模型权重扰动的MU方法往往效果有限。为此,我们引入了CoUn,一种受观察启发的新MU框架:当模型仅使用保留数据重新训练时,它会根据保留数据的语义相似性对遗忘数据进行分类。CoUn通过对比学习(CL)和监督学习调整学习的数据表示,仅应用于保留数据。具体而言,CoUn(1)利用数据样本之间的语义相似性,通过CL间接调整遗忘表示,(2)通过监督学习保持保留表示在其各自聚类内。在各种数据集和模型架构上的广泛实验表明,CoUn在无学习有效性上 consistently 超过最先进的MU基线。此外,将我们的CL模块集成到现有基线中可以增强其无学习有效性。

英文摘要

Machine unlearning (MU) aims to remove the influence of specific "forget" data from a trained model while preserving its knowledge of the remaining "retain" data. Existing MU methods based on label manipulation or model weight perturbations often achieve limited unlearning effectiveness. To address this, we introduce CoUn, a novel MU framework inspired by the observation that a model retrained from scratch using only retain data classifies forget data based on their semantic similarity to the retain data. CoUn emulates this behavior by adjusting learned data representations through contrastive learning (CL) and supervised learning, applied exclusively to retain data. Specifically, CoUn (1) leverages semantic similarity between data samples to indirectly adjust forget representations using CL, and (2) maintains retain representations within their respective clusters through supervised learning. Extensive experiments across various datasets and model architectures show that CoUn consistently outperforms state-of-the-art MU baselines in unlearning effectiveness. Additionally, integrating our CL module into existing baselines empowers their unlearning effectiveness.

2509.02351 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Ordinal Adaptive Correction: A Data-Centric Approach to Ordinal Image Classification with Noisy Labels

序数自适应校正:一种数据导向的带有噪声标签的序数图像分类方法

Alireza Sedighi Moghaddam, Mohammad Reza Mohammadi

发表机构 * School of Computer Engineering, Iran University of Science and Technology(伊朗科学技术大学计算机工程学院)

AI总结 本文提出了一种数据导向的序数图像分类方法ORDAC,通过利用标签分布学习来建模序数标签的内在模糊性和不确定性,动态调整每个样本的标签分布均值和标准差,从而有效校正噪声标签并提高模型性能。

Comments 10 pages, 5 figures, 5 tables

详情
AI中文摘要

标记数据是训练计算机视觉任务中监督深度学习模型的基本组成部分。然而,尤其是在序数图像分类中,类边界往往具有模糊性,因此标注过程容易产生错误和噪声。此类标签噪声会显著降低机器学习模型的性能和可靠性。本文针对序数图像分类任务中检测和校正标签噪声的问题,提出了一种新的数据导向方法,称为ORDinal Adaptive Correction(ORDAC)。该方法利用标签分布学习(LDL)的能力来建模序数标签的内在模糊性和不确定性。在训练过程中,ORDAC动态调整每个样本的标签分布的均值和标准差。与其丢弃可能含有噪声的样本不同,该方法旨在校正这些样本并充分利用整个训练数据集。所提出方法在年龄估计(Adience)和疾病严重程度检测(糖尿病视网膜病变)基准数据集上,针对各种不对称高斯噪声场景进行了评估。结果表明,ORDAC及其扩展版本(ORDAC_C和ORDAC_R)在模型性能上取得了显著提升。例如,在Adience数据集上40%的噪声情况下,ORDAC_R将均方误差从0.86降低到0.62,并将召回指标从0.37提高到0.49。该方法还展示了其在原始数据集中固有噪声的校正效果。这项研究表明,使用标签分布进行自适应标签校正是增强在存在噪声数据时序数分类模型鲁棒性和准确性的一种有效策略。

英文摘要

Labeled data is a fundamental component in training supervised deep learning models for computer vision tasks. However, the labeling process, especially for ordinal image classification where class boundaries are often ambiguous, is prone to error and noise. Such label noise can significantly degrade the performance and reliability of machine learning models. This paper addresses the problem of detecting and correcting label noise in ordinal image classification tasks. To this end, a novel data-centric method called ORDinal Adaptive Correction (ORDAC) is proposed for adaptive correction of noisy labels. The proposed approach leverages the capabilities of Label Distribution Learning (LDL) to model the inherent ambiguity and uncertainty present in ordinal labels. During training, ORDAC dynamically adjusts the mean and standard deviation of the label distribution for each sample. Rather than discarding potentially noisy samples, this approach aims to correct them and make optimal use of the entire training dataset. The effectiveness of the proposed method is evaluated on benchmark datasets for age estimation (Adience) and disease severity detection (Diabetic Retinopathy) under various asymmetric Gaussian noise scenarios. Results show that ORDAC and its extended versions (ORDAC_C and ORDAC_R) lead to significant improvements in model performance. For instance, on the Adience dataset with 40% noise, ORDAC_R reduced the mean absolute error from 0.86 to 0.62 and increased the recall metric from 0.37 to 0.49. The method also demonstrated its effectiveness in correcting intrinsic noise present in the original datasets. This research indicates that adaptive label correction using label distributions is an effective strategy to enhance the robustness and accuracy of ordinal classification models in the presence of noisy data.

2507.21035 2026-05-19 cs.AI cs.LG cs.MA q-bio.GN 版本更新

GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis

GenoMAS:通过代码驱动的基因表达分析进行科学发现的多智能体框架

Haoyang Liu, Yijiang Li, Haohan Wang

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 该研究提出GenoMAS多智能体框架,通过类型消息传递协议协调六个专门的LLM代理,以实现基因表达数据的高效处理和科学发现,其在数据预处理和基因识别任务上均优于现有方法。

Comments 51 pages (14 pages for the main text, 10 pages for references, and 27 pages for the appendix)

详情
AI中文摘要

基因表达分析对于许多生物医学发现至关重要,但从原始转录组数据中提取见解仍然极具挑战性,这归因于多个大型半结构化文件的复杂性和对大量领域专业知识的需求。当前的自动化方法往往受到不灵活的工作流或完全自主代理的限制,这些代理缺乏进行严谨科学探究所需的精确度。GenoMAS则另辟蹊径,通过集成结构化工作流的可靠性与自主代理的适应性,提出了一支基于LLM的科学家团队。GenoMAS通过类型消息传递协议协调六个专门的LLM代理,每个代理都为共享的分析画布贡献互补的强项。GenoMAS的核心是一个引导规划框架:编程代理将高层任务指南展开为动作单元,并在每个节点选择前进、修订、绕过或回溯,从而在保持逻辑一致性的同时,灵活适应基因组数据的特性。在GenoTEX基准测试中,GenoMAS在数据预处理方面达到了89.13%的复合相似度相关性,在基因识别方面达到了60.48%的F1分数,分别超过了最佳现有方法10.61%和16.85%。除了指标外,GenoMAS还揭示了由文献支持的生物合理基因-表型关联,同时调整了潜在混杂因素。代码可在https://github.com/Liu-Hy/GenoMAS上获得。

英文摘要

Gene expression analysis holds the key to many biomedical discoveries, yet extracting insights from raw transcriptomic data remains formidable due to the complexity of multiple large, semi-structured files and the need for extensive domain expertise. Current automation approaches are often limited by either inflexible workflows that break down in edge cases or by fully autonomous agents that lack the necessary precision for rigorous scientific inquiry. GenoMAS charts a different course by presenting a team of LLM-based scientists that integrates the reliability of structured workflows with the adaptability of autonomous agents. GenoMAS orchestrates six specialized LLM agents through typed message-passing protocols, each contributing complementary strengths to a shared analytic canvas. At the heart of GenoMAS lies a guided-planning framework: programming agents unfold high-level task guidelines into Action Units and, at each juncture, elect to advance, revise, bypass, or backtrack, thereby maintaining logical coherence while bending gracefully to the idiosyncrasies of genomic data. On the GenoTEX benchmark, GenoMAS reaches a Composite Similarity Correlation of 89.13% for data preprocessing and an F$_1$ of 60.48% for gene identification, surpassing the best prior art by 10.61% and 16.85% respectively. Beyond metrics, GenoMAS surfaces biologically plausible gene-phenotype associations corroborated by the literature, all while adjusting for latent confounders. Code is available at https://github.com/Liu-Hy/GenoMAS.

2507.20917 2026-05-19 cs.CL cs.AI 版本更新

MediQAl: A French Medical Question Answering Dataset for Knowledge and Reasoning Evaluation

MediQAl: 一个用于知识和推理评估的法语医学问答数据集

Adrien Bazoge

发表机构 * Data Clinic, University Hospital of Nantes, France(南特大学医院数据诊所,法国) Nantes Université, École Centrale Nantes, CNRS, LS2N, France(南特大学,中央理工学院南特分校,国家科学研究中心,LS2N,法国)

AI总结 本文提出MediQAl数据集,用于评估语言模型在事实性医学记忆和现实临床场景推理方面的能力,包含32,603个法语医学问题,涵盖41个医学科目,包含三种任务,通过14个大型语言模型的评估发现事实记忆与推理任务之间存在显著性能差距。

详情
AI中文摘要

本文介绍了MediQAl,一个法语医学问答数据集,旨在评估语言模型在事实性医学记忆和现实临床场景推理方面的能力。MediQAl包含32,603个问题,来源于41个医学科目中的法语医学考试。该数据集包含三种任务:(i) 有唯一答案的多项选择题,(ii) 有多个答案的多项选择题,以及(iii) 有短答案的开放性问题。每个问题都被标记为理解或推理,使能够对模型的认知能力进行详细分析。我们通过与14个大型语言模型的广泛评估,包括最近的推理增强模型,验证了MediQAl数据集,并观察到事实记忆与推理任务之间存在显著的性能差距。我们的评估为评估语言模型在法语医学问答上的性能提供了全面的基准,填补了医学领域多语言资源中的关键空白。

英文摘要

This work introduces MediQAl, a French medical question answering dataset designed to evaluate the capabilities of language models in factual medical recall and reasoning over real-world clinical scenarios. MediQAl contains 32,603 questions sourced from French medical examinations across 41 medical subjects. The dataset includes three tasks: (i) Multiple-Choice Question with Unique answer, (ii) Multiple-Choice Question with Multiple answer, and (iii) Open-Ended Question with Short-Answer. Each question is labeled as Understanding or Reasoning, enabling a detailed analysis of models' cognitive capabilities. We validate the MediQAl dataset through extensive evaluation with 14 large language models, including recent reasoning-augmented models, and observe a significant performance gap between factual recall and reasoning tasks. Our evaluation provides a comprehensive benchmark for assessing language models' performance on French medical question answering, addressing a crucial gap in multilingual resources for the medical domain.

2507.16307 2026-05-19 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

Perovskite-R1: a domain-specialized large language model for intelligent discovery of precursor additives and experimental design

钙钛矿-R1:一个专门领域的大型语言模型,用于智能发现前驱体添加剂和实验设计

Xin-De Wang, Zhi-Rui Chen, Peng-Jie Guo, Ze-Feng Gao, Cheng Mu, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院) School of Chemistry and Life Resource, Renmin University of China(中国人民大学化学与生命资源学院)

AI总结 本研究提出Perovskite-R1,一个专门用于发现钙钛矿太阳能电池前驱体添加剂和实验设计的大型语言模型,通过系统挖掘和整理1232篇高质量科学文献,并整合33269种候选材料,构建了领域特定的指令微调数据集,从而提升材料发现的效率。

Comments 24 pages; 5 figures

Journal ref Communications Materials 7, 86 (2026)

详情
AI中文摘要

钙钛矿太阳能电池(PSCs)因其卓越的功率转换效率和有利的材料特性而迅速成为下一代光伏技术的有力竞争者。尽管有这些进展,长期稳定性、环境可持续性和可扩展制造等挑战仍然阻碍其商业化。前驱体添加剂工程显示出通过提高PSCs的性能和耐久性来解决这些问题的潜力。然而,科学文献的爆炸性增长以及材料、工艺和设备架构之间的复杂相互作用,使研究人员难以高效地访问、组织和利用该领域内的领域知识。为此,我们介绍了Perovskite-R1,一个具有先进推理能力的专门大型语言模型(LLM),专门用于发现和设计PSC前驱体添加剂。通过系统挖掘和整理1232篇高质量科学出版物,并整合一个包含33,269种候选材料的全面库,我们使用自动问答生成和推理链的方法构建了一个领域特定的指令微调数据集。在该数据集上微调QwQ-32B模型,得到了Perovskite-R1,它可以智能地综合文献见解,生成创新且实用的解决方案用于缺陷钝化和前驱体添加剂的选择。对几个模型提出策略的实验验证证实了它们在提高材料稳定性和性能方面的有效性。我们的工作展示了领域适应的LLM在加速材料发现中的潜力,并提供了一个闭环框架,用于智能、数据驱动的钙钛矿光伏研究进展。

英文摘要

Perovskite solar cells (PSCs) have rapidly emerged as a leading contender in next-generation photovoltaic technologies, owing to their exceptional power conversion efficiencies and advantageous material properties. Despite these advances, challenges such as long-term stability, environmental sustainability, and scalable manufacturing continue to hinder their commercialization. Precursor additive engineering has shown promise in addressing these issues by enhancing both the performance and durability of PSCs. However, the explosive growth of scientific literature and the complex interplay of materials, processes, and device architectures make it increasingly difficult for researchers to efficiently access, organize, and utilize domain knowledge in this rapidly evolving field. To address this gap, we introduce Perovskite-R1, a specialized large language model (LLM) with advanced reasoning capabilities tailored for the discovery and design of PSC precursor additives. By systematically mining and curating 1,232 high-quality scientific publications and integrating a comprehensive library of 33,269 candidate materials, we constructed a domain-specific instruction-tuning dataset using automated question-answer generation and chain-of-thought reasoning. Fine-tuning the QwQ-32B model on this dataset resulted in Perovskite-R1, which can intelligently synthesize literature insights and generate innovative and practical solutions for defect passivation and the selection of precursor additives. Experimental validation of several model-proposed strategies confirms their effectiveness in improving material stability and performance. Our work demonstrates the potential of domain-adapted LLMs in accelerating materials discovery and provides a closed-loop framework for intelligent, data-driven advancements in perovskite photovoltaic research.

2507.01099 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

Geometry-aware 4D Video Generation for Robot Manipulation

面向机器人操作的几何感知4D视频生成

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, Shuran Song

发表机构 * Stanford University(斯坦福大学) Toyota Research Institute(丰田研究院)

AI总结 本文提出了一种几何感知的4D视频生成模型,通过跨视角点图对齐进行训练,以确保生成视频在多视角下的3D一致性,从而在单个RGB-D图像输入下生成时空一致的未来视频序列,并在不依赖相机姿态的情况下实现稳定的视觉和空间对齐预测。

Comments ICLR 2026; Project website: https://robot4dgen.github.io

详情
AI中文摘要

理解并预测物理世界的动态可以增强机器人在复杂环境中的规划和交互能力。尽管最近的视频生成模型在建模动态场景方面显示出强大的潜力,但生成在不同摄像机视角下既时间一致又几何一致的视频仍然是一项重大挑战。为此,我们提出了一种4D视频生成模型,通过在训练过程中使用跨视角点图对齐来监督模型,以确保生成视频的多视角3D一致性。通过这种几何监督,模型学习了一个共享的3D场景表示,使其能够从单个RGB-D图像输入中,根据新的视角生成时空一致的未来视频序列,而无需依赖相机姿态作为输入。与现有基线方法相比,我们的方法在多个模拟和现实世界机器人数据集上产生了更稳定和空间对齐的预测。我们进一步表明,预测的4D视频可用于使用现成的6自由度姿态跟踪器恢复机器人末端执行器轨迹,从而生成在新相机视角下具有良好泛化能力的机器人操作策略。

英文摘要

Understanding and predicting dynamics of the physical world can enhance a robot's ability to plan and interact effectively in complex environments. While recent video generation models have shown strong potential in modeling dynamic scenes, generating videos that are both temporally coherent and geometrically consistent across camera views remains a significant challenge. To address this, we propose a 4D video generation model that enforces multi-view 3D consistency of generated videos by supervising the model with cross-view pointmap alignment during training. Through this geometric supervision, the model learns a shared 3D scene representation, enabling it to generate spatio-temporally aligned future video sequences from novel viewpoints given a single RGB-D image per view, and without relying on camera poses as input. Compared to existing baselines, our method produces more visually stable and spatially aligned predictions across multiple simulated and real-world robotic datasets. We further show that the predicted 4D videos can be used to recover robot end-effector trajectories using an off-the-shelf 6DoF pose tracker, yielding robot manipulation policies that generalize well to novel camera viewpoints.

2506.23549 2026-05-19 cs.AI cs.HC cs.LG 版本更新

CooT: Learning to Coordinate In-Context with Coordination Transformers

CooT: 通过协调转换器学习协调上下文

Huai-Chih Wang, Hsiang-Chun Chuang, Hsi-Chun Cheng, Dai-Jie Wu, Shao-Hua Sun

发表机构 * Graduate Institute of Communication Engineering, National Taiwan University (NTU)(国立台湾大学通信工程研究所) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国立台湾大学人工智能研究中心) University of Utah(犹他大学)

AI总结 本研究提出CooT框架,通过上下文学习实现实时合作伙伴适应,解决了多智能体系统中协调不熟悉合作伙伴的挑战,其核心方法是通过观察学习对齐动作与合作伙伴意图,主要贡献是实现了在多样合作伙伴行为下的泛化能力。

Comments ICML 2026

详情
AI中文摘要

在多智能体系统中,协调不熟悉合作伙伴仍然是一个重大挑战。现有方法,如基于种群的方法,通过多样性提高鲁棒性,但通常缺乏在训练分布之外高效适应的机制。此外,微调在少样本设置中不可行,因为其交互成本高。为了解决这些限制,我们提出了CooT,一个利用上下文学习(ICL)进行实时合作伙伴适应的框架。与以往专注于任务泛化的ICL方法不同,CooT旨在在多样化的合作伙伴行为上实现泛化。在行为偏好智能体的轨迹上训练,它通过观察学习对齐动作与合作伙伴意图。我们在两个具有挑战性的多智能体基准测试中评估了CooT:Overcooked和Google Research Football。结果表明,CooT在性能上始终优于基于种群的方法、基于梯度的微调和Meta-RL基线,实现了稳定且快速的适应,而无需参数更新。人类评估也发现CooT是更受青睐的合作者,我们的消融实验确认了其快速适应新合作伙伴并在突然合作伙伴变化下保持稳定的能力,使其在现实世界的人机协作中具有可靠性。

英文摘要

Effective coordination among unfamiliar partners remains a major challenge in multi-agent systems. Existing approaches, such as population-based methods, improve robustness through diversity but often lack mechanisms for efficient adaptation beyond training distribution. Moreover, fine-tuning is impractical in few-shot settings due to its high interaction cost. To address these limitations, we propose CooT, a framework that leverages in-context learning (ICL) for real-time partner adaptation. Unlike prior ICL approaches that focus on task generalization, CooT is designed to generalize across diverse partner behaviors. Trained on trajectories from behavior-preferring agents, it learns to align actions with partner intentions purely through observation. We evaluate CooT on two challenging multi-agent benchmarks: Overcooked and Google Research Football. Results show that CooT consistently outperforms population-based methods, gradient-based fine-tuning, and Meta-RL baselines, achieving stable and rapid adaptation without parameter updates. Human evaluations also identify CooT as a preferred collaborator, and our ablations confirm its ability to adapt quickly to new partners and remain stable under sudden partner changes, making it reliable for real-world human-AI collaboration.

2506.17312 2026-05-19 cs.SI cs.AI cs.LG 版本更新

Heterogeneous Temporal Hypergraph Neural Network

异构时序超图神经网络

Huan Liu, Pengfei Jiao, Mengzhou Gao, Chaochao Chen, Di Jin

发表机构 * School of Cyberspace, Hangzhou Dianzi University(杭州电子科技大学信息学院) Data Security Governance Zhejiang Engineering Research Center(浙江数据安全治理工程研究中心) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 本文提出了一种异构时序超图神经网络(HTHGN),旨在捕捉复杂异构时序超图中的高阶交互关系,通过引入层次注意力机制和对比学习来提升模型对异构节点和超边之间丰富语义的捕捉能力。

Comments Accepted by IJCAI 2025

详情
AI中文摘要

图表示学习(GRL)已成为建模图结构数据的有效技术。在建模现实复杂网络中的异质性和动态性时,针对复杂异构时序图(HTGs)设计的GRL方法已被提出,并在各领域取得了成功应用。然而,大多数现有GRL方法主要关注保留低阶拓扑信息,而忽视了更高阶的组交互关系,这些关系更符合现实网络。此外,大多数现有超图方法只能建模静态同构图,限制了它们对HTGs中高阶交互关系的建模能力。因此,为了同时使GRL模型能够捕捉HTGs中的高阶交互关系,我们首先提出了异构时序超图的正式定义和不依赖额外信息的$P$-均匀异构超边构造算法。然后提出了一种新的异构时序超图神经网络(HTHGN),以完全捕捉HTGs中的高阶交互关系。HTHGN包含一个层次注意力机制模块,同时在异构节点和超边之间进行时间消息传递,以捕捉由超边带来的更宽广感受场中的丰富语义。此外,HTHGN通过最大化HTG中低阶相关异构节点对之间的一致性来进行对比学习,以避免低阶结构的模糊性问题。在三个真实世界HTG数据集上的详细实验结果验证了所提出HTHGN在建模HTGs中高阶交互关系的有效性,并展示了显著的性能提升。

英文摘要

Graph representation learning (GRL) has emerged as an effective technique for modeling graph-structured data. When modeling heterogeneity and dynamics in real-world complex networks, GRL methods designed for complex heterogeneous temporal graphs (HTGs) have been proposed and have achieved successful applications in various fields. However, most existing GRL methods mainly focus on preserving the low-order topology information while ignoring higher-order group interaction relationships, which are more consistent with real-world networks. In addition, most existing hypergraph methods can only model static homogeneous graphs, limiting their ability to model high-order interactions in HTGs. Therefore, to simultaneously enable the GRL model to capture high-order interaction relationships in HTGs, we first propose a formal definition of heterogeneous temporal hypergraphs and $P$-uniform heterogeneous hyperedge construction algorithm that does not rely on additional information. Then, a novel Heterogeneous Temporal HyperGraph Neural network (HTHGN), is proposed to fully capture higher-order interactions in HTGs. HTHGN contains a hierarchical attention mechanism module that simultaneously performs temporal message-passing between heterogeneous nodes and hyperedges to capture rich semantics in a wider receptive field brought by hyperedges. Furthermore, HTHGN performs contrastive learning by maximizing the consistency between low-order correlated heterogeneous node pairs on HTG to avoid the low-order structural ambiguity issue. Detailed experimental results on three real-world HTG datasets verify the effectiveness of the proposed HTHGN for modeling high-order interactions in HTGs and demonstrate significant performance improvements.

2506.03837 2026-05-19 cond-mat.supr-con cond-mat.mtrl-sci cs.AI cs.LG 版本更新

HTSC-2025: A Benchmark Dataset of Ambient-Pressure High-Temperature Superconductors for AI-Driven Critical Temperature Prediction

HTSC-2025: 一个用于人工智能驱动临界温度预测的环境压力高温超导体基准数据集

Xiao-Qi Han, Ze-Feng Gao, Xin-De Wang, Zhenfeng Ouyang, Peng-Jie Guo, Zhong-Yi Lu

发表机构 * 1. School of Physics Beijing Key Laboratory of Opto-electronic Functional Materials \& Micro-nano Devices. Renmin University of China, Beijing 100872, China 2. Key Laboratory of Quantum State Construction Manipulation (Ministry of Education), Renmin University of China, Beijing 100872, China 3. Hefei National Laboratory, Hefei 230088, China

AI总结 本文提出HTSC-2025基准数据集,包含2023至2025年由理论物理学家基于BCS超导理论预测的高温超导材料,旨在促进人工智能在超导材料发现中的应用。

Comments 7 pages, 2 figures

Journal ref Chinese Physics B 34, 100301 (2025)

详情
AI中文摘要

高温超导材料的发现对人类工业和日常生活具有重要意义。近年来,利用人工智能(AI)预测超导转变温度的研究日益流行,大多数工具声称实现了显著的准确性。然而,该领域缺乏广泛接受的基准数据集,严重阻碍了不同AI算法之间的公平比较以及这些方法的进一步发展。在本工作中,我们提出了HTSC-2025,一个环境压力高温超导基准数据集。该数据集全面涵盖了基于BCS超导理论由理论物理学家在2023至2025年间发现的理论预测超导材料,包括著名的X₂YH₆系统、钙钛矿MXH₃系统、M₃XH₈系统、源自LaH₁₀结构演化的笼状BCN掺杂金属原子系统,以及从MgB₂演化而来的二维蜂窝状系统。HTSC-2025基准数据集已开源在https://github.com/xqh19970407/HTSC-2025并将持续更新。该基准数据集对加速基于人工智能方法的超导材料发现具有重要意义。

英文摘要

The discovery of high-temperature superconducting materials holds great significance for human industry and daily life. In recent years, research on predicting superconducting transition temperatures using artificial intelligence~(AI) has gained popularity, with most of these tools claiming to achieve remarkable accuracy. However, the lack of widely accepted benchmark datasets in this field has severely hindered fair comparisons between different AI algorithms and impeded further advancement of these methods. In this work, we present the HTSC-2025, an ambient-pressure high-temperature superconducting benchmark dataset. This comprehensive compilation encompasses theoretically predicted superconducting materials discovered by theoretical physicists from 2023 to 2025 based on BCS superconductivity theory, including the renowned X$_2$YH$_6$ system, perovskite MXH$_3$ system, M$_3$XH$_8$ system, cage-like BCN-doped metal atomic systems derived from LaH$_{10}$ structural evolution, and two-dimensional honeycomb-structured systems evolving from MgB$_2$. The HTSC-2025 benchmark has been open-sourced at https://github.com/xqh19970407/HTSC-2025 and will be continuously updated. This benchmark holds significant importance for accelerating the discovery of superconducting materials using AI-based methods.

2505.20650 2026-05-19 cs.CL cs.AI cs.CE 版本更新

FinTagging: Benchmarking LLMs for Extracting and Structuring Financial Information

FinTagging: 评估LLM提取和结构化财务信息

Yan Wang, Lingfei Qian, Xueqing Peng, Yang Ren, Keyi Wang, Yi Han, Dongji Feng, Fengran Mo, Shengyuan Lin, Qinchuan Zhang, Kaiwen He, Chenri Luo, Jianxing Chen, Junwei Wu, Chen Xu, Ziyang Xu, Jimin Huang, Guojun Xiong, Xiao-Yang Liu, Qianqian Xie, Jian-Yun Nie

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Columbia University(哥伦比亚大学) California State University(加州州立大学) University of Montreal(蒙特利尔大学) Carnegie Mellon University(卡内基梅隆大学) Rensselaer Polytechnic Institute(莱斯利理工学院) The University of Manchester(曼彻斯特大学) Harvard University(哈佛大学)

AI总结 本文提出FinTagging基准,用于评估LLM在提取和结构化财务信息方面的能力,通过分解为FinNI和FinCL两个子任务,揭示了LLM在细粒度概念链接上的局限性。

详情
AI中文摘要

准确解读财务报告中的数字数据对市场和监管机构至关重要。尽管XBRL(可扩展商业报告语言)提供了对财务数据进行标记的标准,但将数千个事实映射到超过1万项美国通用会计准则(US GAAP)概念仍然成本高昂且容易出错。现有基准将此任务简化为对小概念子集的扁平单步分类,忽略了分类法的层次语义和财务文档的结构特性。因此,这些基准无法评估LLM在真实报告条件下的表现。为弥合这一差距,我们引入FinTagging,首个全面的结构感知和全范围XBRL标记基准。我们将复杂的标记过程分解为两个子任务:(1)FinNI(财务数字识别),从异构上下文中提取实体和类型;(2)FinCL(财务概念链接),将提取的实体映射到完整的US GAAP分类法。这种两阶段的框架使能够公平评估LLM在数值推理和分类法对齐方面的能力。在零样本设置下评估多种LLM发现,尽管模型在提取方面表现良好,但在细粒度概念链接上存在显著困难,突显了领域特定结构感知推理的关键限制。

英文摘要

Accurate interpretation of numerical data in financial reports is critical for markets and regulators. Although XBRL (eXtensible Business Reporting Language) provides a standard for tagging financial figures, mapping thousands of facts to over 10k US GAAP concepts remains costly and error prone. Existing benchmarks oversimplify this task as flat, single step classification over small subsets of concepts, ignoring the hierarchical semantics of the taxonomy and the structured nature of financial documents. Consequently, these benchmarks fail to evaluate Large Language Models (LLMs) under realistic reporting conditions. To bridge this gap, we introduce FinTagging, the first comprehensive benchmark for structure aware and full scope XBRL tagging. We decompose the complex tagging process into two subtasks: (1) FinNI (Financial Numeric Identification), which extracts entities and types from heterogeneous contexts including text and tables; and (2) FinCL (Financial Concept Linking), which maps extracted entities to the full US GAAP taxonomy. This two stage formulation enables a fair assessment of LLMs' capabilities in numerical reasoning and taxonomy alignment. Evaluating diverse LLMs in zero shot settings reveals that while models generalize well in extraction, they struggle significantly with fine grained concept linking, highlighting critical limitations in domain specific structure aware reasoning.

2505.09203 2026-05-19 cond-mat.mtrl-sci cond-mat.supr-con cs.AI cs.LG 版本更新

InvDesFlow-AL: active learning-based workflow for inverse design of functional materials

InvDesFlow-AL: 基于主动学习的反向设计功能材料工作流程

Xiao-Qi Han, Peng-Jie Guo, Ze-Feng Gao, Hao Sun, Zhong-Yi Lu

发表机构 * School of Physics, Renmin University of China(中国人民大学物理学院) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) School of Engineering Science, University of Chinese Academy of Sciences(中国科学院大学工程科学学院)

AI总结 本研究提出了一种基于主动学习的反向设计功能材料框架InvDesFlow-AL,通过迭代优化材料生成过程,提高性能特征的准确性,并在低形成能和低Ehull材料设计中取得显著成果,成功发现超导材料Li₂AuH₆。

Comments 29 pages, 11 figures

Journal ref npj Computational Materials 11, 364 (2025)

详情
AI中文摘要

开发具有特定性能的功能材料的反向设计方法对于推进可再生能源、催化、能量存储和碳捕集等领域的进步至关重要。基于扩散原理的生成模型可以直接生成满足性能约束的新材料,从而显著加速材料设计过程。然而,现有生成和预测晶体结构的方法往往受限于低成功率。在本工作中,我们提出了一种新的反向材料设计生成框架InvDesFlow-AL,该框架基于主动学习策略。该框架可以迭代优化材料生成过程,逐步引导其向期望的性能特征发展。在晶体结构预测方面,InvDesFlow-AL模型实现了RMSE为0.0423 Å,相比现有生成模型性能提高了32.96%。此外,InvDesFlow-AL已成功应用于低形成能和低Ehull材料的设计。它可以系统地生成具有逐步降低形成能的材料,同时在多样化的化学空间中不断扩展探索。这些结果充分证明了所提出的基于主动学习的生成模型在加速材料发现和反向设计中的有效性。为进一步证明该方法的有效性,我们以InvDesFlow-AL探索的常压下BCS超导体搜索为例。结果,我们成功发现了Li₂AuH₆作为传统BCS超导体,具有超高的转变温度140 K。这一发现为反向设计在材料科学中的应用提供了有力的实证支持。

英文摘要

Developing inverse design methods for functional materials with specific properties is critical to advancing fields like renewable energy, catalysis, energy storage, and carbon capture. Generative models based on diffusion principles can directly produce new materials that meet performance constraints, thereby significantly accelerating the material design process. However, existing methods for generating and predicting crystal structures often remain limited by low success rates. In this work, we propose a novel inverse material design generative framework called InvDesFlow-AL, which is based on active learning strategies. This framework can iteratively optimize the material generation process to gradually guide it towards desired performance characteristics. In terms of crystal structure prediction, the InvDesFlow-AL model achieves an RMSE of 0.0423 Å, representing an 32.96% improvement in performance compared to exsisting generative models. Additionally, InvDesFlow-AL has been successfully validated in the design of low-formation-energy and low-Ehull materials. It can systematically generate materials with progressively lower formation energies while continuously expanding the exploration across diverse chemical spaces. These results fully demonstrate the effectiveness of the proposed active learning-driven generative model in accelerating material discovery and inverse design. To further prove the effectiveness of this method, we took the search for BCS superconductors under ambient pressure as an example explored by InvDesFlow-AL. As a result, we successfully identified Li\(_2\)AuH\(_6\) as a conventional BCS superconductor with an ultra-high transition temperature of 140 K. This discovery provides strong empirical support for the application of inverse design in materials science.

2505.07813 2026-05-19 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

DexWild: Dexterous Human Interactions for In-the-Wild Robot Policies

DexWild:面向真实场景的机器人策略的灵巧交互

Tony Tao, Mohan Kumar Srirama, Jason Jingzhou Liu, Kenneth Shaw, Deepak Pathak

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出DexWild框架,通过结合人类和机器人示范数据,提升机器人在多样化环境中的泛化能力,实验表明其在未见环境中的成功率显著高于传统方法。

Comments In RSS 2025. Website at https://dexwild.github.io

详情
AI中文摘要

大规模、多样化的机器人数据集已成为使灵巧操作策略泛化到新环境的有希望途径,但获取此类数据集存在诸多挑战。虽然远程操作能提供高保真的数据集,但其高成本限制了可扩展性。相反,如果人们可以像在日常生活中一样使用自己的手来收集数据呢?在DexWild中,一个多样化的数据收集团队使用他们的手在多种环境和物体上收集数小时的交互数据。为了记录这些数据,我们创建了DexWild-System,一种低成本、移动且易于使用的设备。DexWild学习框架在人类和机器人示范数据上共同训练,相较于单独训练每个数据集,其性能得到提升。这种组合产生了能够泛化到新环境、任务和形态的稳健机器人策略,只需少量额外的机器人特定数据。实验结果表明,DexWild显著提高了性能,在未见环境中实现了68.5%的成功率,几乎是仅使用机器人数据训练的策略的四倍,并提供了5.8倍更好的跨形态泛化能力。视频结果、代码库和说明可在https://dexwild.github.io上找到。

英文摘要

Large-scale, diverse robot datasets have emerged as a promising path toward enabling dexterous manipulation policies to generalize to novel environments, but acquiring such datasets presents many challenges. While teleoperation provides high-fidelity datasets, its high cost limits its scalability. Instead, what if people could use their own hands, just as they do in everyday life, to collect data? In DexWild, a diverse team of data collectors uses their hands to collect hours of interactions across a multitude of environments and objects. To record this data, we create DexWild-System, a low-cost, mobile, and easy-to-use device. The DexWild learning framework co-trains on both human and robot demonstrations, leading to improved performance compared to training on each dataset individually. This combination results in robust robot policies capable of generalizing to novel environments, tasks, and embodiments with minimal additional robot-specific data. Experimental results demonstrate that DexWild significantly improves performance, achieving a 68.5% success rate in unseen environments-nearly four times higher than policies trained with robot data only-and offering 5.8x better cross-embodiment generalization. Video results, codebases, and instructions at https://dexwild.github.io

2505.06907 2026-05-19 cs.AI cs.CV cs.NE 版本更新

A Survey on Foundation Models for Personalized Federated Intelligence

面向个性化联邦智能的基础模型综述

Yu Qiao, Huy Q. Le, Avi Deb Raha, Phuong-Nam Tran, Apurba Adhikary, Mengchun Zhang, Loc X. Nguyen, Eui-Nam Huh, Dusit Niyato, Choong Seon Hong

发表机构 * School of Computing, Kyung Hee University(韩国庆熙大学计算机学院) Noakhali Science and Technology University(诺阿克利科学与技术大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院)

AI总结 本文综述了基础模型在个性化联邦智能中的应用,探讨了联邦学习与基础模型的结合,提出了一种新的个性化联邦智能范式,旨在为实现人工智能个性化提供基础支持。

Comments Accepted ACM Computing Survey

详情
AI中文摘要

大语言模型(如ChatGPT、Gemini和Grok)的兴起重塑了人工智能领域。作为基础模型(FMs)的典型实例,它们在生成类人内容方面表现出色,推动人工智能向通用人工智能(AGI)迈进。然而,它们的规模庞大、隐私敏感和计算需求高,给个性化定制带来了挑战。为此,我们提出了人工智能个性化(API)的愿景,专注于将FMs适应到个体用户,同时确保隐私。作为API的核心赋能者,我们提出个性化联邦智能(PFI),这是一种新的范式,不仅整合了联邦学习(FL)的隐私优势和FMs的泛化能力,还将个性化置于核心。为此,我们首先回顾了最近的FL和FMs进展,为PFI奠定基础。然后,我们探讨了PFI流水线的核心阶段:边缘的高效个性化、可信的适应和通过检索增强生成的自适应细化。最后,我们强调了实现PFI的未来方向。总体而言,本文的综述旨在为API的发展奠定基础,作为AGI的补充方向,PFI是关键的赋能范式。

英文摘要

The rise of large language models (LLMs), such as ChatGPT, Gemini, and Grok, has reshaped the AI landscape. As prominent instances of foundational models (FMs), they exhibit remarkable capabilities in generating human-like content, pushing the boundaries towards artificial general intelligence (AGI). However, their large-scale nature, privacy sensitivity, and substantial computational demands pose significant challenges for personalized customization for end users. To bridge this gap, we present the vision of artificial personalized intelligence (API), which focuses on adapting FMs to individual users while ensuring privacy. As a central enabler of API, we propose personalized federated intelligence (PFI), a new paradigm that not only integrates the privacy benefits of federated learning (FL) with the generalization capabilities of FMs but also places personalization at its core. To this end, we first survey recent advances in FL and FMs that lay the foundation for PFI. We then explore core stages of the PFI pipeline: efficient personalization at the edge, trustworthy adaptation, and adaptive refinement via retrieval-augmented generation. Finally, we highlight future directions for enabling PFI. Overall, this survey aims to lay a foundation for the development of API as a complementary direction to AGI, with PFI as a key enabling paradigm.

2505.00409 2026-05-19 eess.AS cs.AI cs.LG 版本更新

Perceptual implications of automatic anonymization in pathological speech

病态语音中自动匿名化的人感知影响

Soroosh Tayebi Arasteh, Saba Afza, Tri-Thien Nguyen, Lukas Buess, Maryam Parvin, Tomas Arias-Vergara, Paula Andrea Perez-Toro, Hiu Ching Hung, Mahshad Lotfinia, Thomas Gorges, Elmar Noeth, Maria Schuster, Seung Hee Yang, Andreas Maier

发表机构 * Pattern Recognition Lab, Friedrich-Alexander-Universit\"at Erlangen-N\"urnberg, Erlangen, Germany. Department of Urology, Stanford University, Stanford, CA, USA. Department of Radiology, Stanford University, Stanford, CA, USA. Lab for AI in Medicine, RWTH Aachen University, Aachen, Germany. Department of Diagnostic Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany. Institute of Radiology, University Hospital Erlangen, Erlangen, Germany. Department of Foreign Language Education, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany. Department of Otorhinolaryngology, Head Neck Surgery, Ludwig-Maximilians-Universität München, Munich, Germany. Speech \& Language Processing Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany.

AI总结 本研究通过结构化协议评估自动匿名化病态语音的人感知影响,发现匿名化在不同疾病中存在显著差异,且感知质量下降,但临床严重程度评分保持稳定,同时发现感知结果与计算隐私指标脱钩。

详情
AI中文摘要

自动匿名化日益用于促进伦理共享的临床语音,但其感知和临床后果仍不明确。我们通过结构化协议,使用十名母语和非母语德语听众(涵盖临床和信号处理专业知识)对自动匿名化的病态语音进行了以人为中心的评估。受试者包括来自CLP、构音障碍、构语障碍、失声及成人和儿童对照组的180名德语说话者。每段原始录音及其自动匿名化版本在四个任务上进行评估:零样本图灵式辨别、少量样本辨别后短暂熟悉、5点质量评分以及4点盲评临床严重程度评分由资深语音病学家完成。听众在零样本和少量样本任务中检测到匿名化准确率分别为91%和93%,不同疾病之间存在显著差异(p=0.008),且熟悉度降低该差异。感知质量在0-100分上下降了30分(p<0.001),重新组织了各组的感知质量等级。母语影响了可检测性但不影响质量退化,而领域专业知识影响了质量退化但不影响可检测性,形成双分离现象;说话者性别和年龄无明显偏差。临床严重程度评分在构音障碍、构语障碍和失声中保持几乎完美的一致(二次加权Cohen's kappa 0.87-0.94),无录音移位超过一级。关键发现是感知结果与标准计算隐私指标脱钩:计算上匿名化最强的病态语音在感知上最不明显,反之亦然。这些发现支持了按疾病类型和听众类型、经临床验证的评估作为许可匿名语音用于临床使用的最低标准。

英文摘要

Automatic anonymization is increasingly used to enable ethical sharing of clinical speech, yet its perceptual and clinical consequences remain undercharacterized. We present a human-centered evaluation of automatically anonymized pathological speech, using a structured protocol with ten native and non-native German listeners spanning clinical and signal-processing expertise. The cohort comprised 180 German speakers from CLP, Dysarthria, Dysglossia, Dysphonia, and adult and child controls. Each original recording and its automatically-anonymized counterpart was evaluated on four tasks: zero-shot Turing-style discrimination, few-shot discrimination after brief familiarization, 5-point quality rating, and 4-point blinded clinical severity rating by a senior phoniatrician. Listeners detected anonymization at 91% zero-shot and 93% few-shot accuracy, with significant variation across disorders (p=0.008) that attenuated with familiarization. Perceived quality dropped by 30 ppts on a 0-100 scale (p<0.001), reorganizing the perceived-quality hierarchy across groups. Native language modulated detectability but not quality degradation, while domain expertise modulated quality degradation but not detectability, a double dissociation between the two listener attributes; speaker sex and age produced no detectable bias. Clinical severity ratings were preserved at near-perfect agreement in Dysarthria, Dysglossia, and Dysphonia (quadratic-weighted Cohen's kappa 0.87-0.94), with no recording shifting by more than one grade. Crucially, perceptual outcomes were decoupled from the standard computational privacy metric: the pathology with the strongest computational anonymization was the least perceptually conspicuous, and vice versa. These findings argue for disorder-stratified, listener-stratified, clinician-validated evaluation as the minimum standard for licensing anonymized speech for clinical use.

2503.14800 2026-05-19 cs.IR cs.AI cs.LG 版本更新

Long Context Modeling with Ranked Memory-Augmented Retrieval

长上下文建模与排名记忆增强检索

Ghadir Alselwi, Hao Xue, Shoaib Jameel, Basem Suleiman, Flora D. Salim, Imran Razzak

发表机构 * University of New South Wales(新南威尔士大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) University of Southampton(南安普顿大学) Mohamed Bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 本文提出了一种增强的排名记忆增强检索框架,通过动态排名记忆条目和学习到的排名技术,提升语言模型在长上下文任务中的性能和可扩展性。

详情
AI中文摘要

有效管理长期记忆对于语言模型处理扩展上下文至关重要。我们介绍了增强的排名记忆增强检索(ERMAR)框架,该框架根据相关性动态排名记忆条目。与先前模型不同,ERMAR采用了一种新颖的相关性评分机制和一个点wise重新排序模型,用于键值嵌入,灵感来自信息检索中的学习到的排名技术。通过整合历史使用模式和自适应检索,ERMAR在标准基准上实现了最先进的结果,展示了在长上下文任务中优越的可扩展性和性能。

英文摘要

Effective long-term memory management is crucial for language models handling extended contexts. We introduce the Enhanced Ranked Memory Augmented Retrieval (ERMAR) framework, which dynamically ranks memory entries based on relevance. Unlike prior models, ERMAR employs a novel relevance scoring mechanism and a pointwise re-ranking model for key-value embeddings, inspired by learning-to-rank techniques in information retrieval. By integrating historical usage patterns and adaptive retrieval, ERMAR achieves state-of-the-art results on standard benchmarks, demonstrating superior scalability and performance in long-context tasks.

2410.13846 2026-05-19 cs.CL cs.AI cs.LG 版本更新

LightTransfer: Your Long-Context LLM is Secretly a Hybrid Model with Effortless Adaptation

LightTransfer: 你的长上下文LLM实际上是一个具有轻松适应能力的混合模型

Xuan Zhang, Fengzhuo Zhang, Cunxiao Du, Chao Du, Tianyu Pang, Wei Gao, Min Lin

发表机构 * Singapore Management University(新加坡国立大学) National University of Singapore(新加坡国立大学) Sea AI Lab, Singapore(新加坡海智实验室)

AI总结 本文提出LightTransfer方法,通过将LLaMA等模型转换为混合架构,实现更高效的生成,实验表明在长上下文理解任务中,即使有半数层被识别为懒层,也能在性能损失小于1.5%的情况下提升2.17倍的吞吐量,并在数学基准AIME24上达到53.3%的分数。

Comments Accepted by TMLR 2025

详情
AI中文摘要

将语言模型扩展到处理更长上下文引入了由于键值(KV)缓存成本增加而带来的显著内存挑战。受混合模型的效率提升和预训练大变压器骨干的广泛可用性启发,我们探索将变压器模型转换为混合架构以实现更高效的生成。在本工作中,我们提出了LightTransfer,一种轻量级方法,将模型如LLaMA转换为混合变体。我们的方法识别出懒层——那些专注于最近或初始token的层,并将它们的完整注意力替换为流式注意力。这种转换可以在无需任何训练的情况下用于长上下文理解任务,或在需要更强推理能力的o1-like长推理生成任务中进行最小微调。在多样化的基准和模型(如LLaMA、Mistral、QwQ-STILL)上的实验表明,即使有半数层被识别为懒层,LightTransfer在性能损失小于1.5%(在LongBench上)的情况下,也能实现高达2.17倍的吞吐量提升,并在数学基准AIME24上达到先进o1-like长推理模型QwQ-STILL的53.3%。

英文摘要

Scaling language models to handle longer contexts introduces substantial memory challenges due to the growing cost of key-value (KV) caches. Motivated by the efficiency gains of hybrid models and the broad availability of pretrained large transformer backbones, we explore transitioning transformer models into hybrid architectures for a more efficient generation. In this work, we propose LightTransfer, a lightweight method that transforms models such as LLaMA into hybrid variants. Our approach identifies lazy layers -- those focusing on recent or initial tokens -- and replaces their full attention with streaming attention. This transformation can be performed without any training for long-context understanding tasks or with minimal fine-tuning for o1-like long reasoning generation tasks that require stronger reasoning capabilities. Experiments across diverse benchmarks and models (e.g., LLaMA, Mistral, QwQ-STILL) demonstrate that, even when half of the layers are identified as lazy, LightTransfer achieves up to 2.17$\times$ throughput improvement with minimal performance loss ($<1.5\%$ on LongBench) and achieves 53.3\% on math benchmark AIME24 of advanced o1-like long reasoning model QwQ-STILL.

2410.04941 2026-05-19 cs.LG cs.AI 版本更新

TOAST: Transformer Optimization using Adaptive and Simple Transformations

TOAST: 使用自适应和简单变换的Transformer优化

Irene Cannistraci, Simone Antonelli, Emanuele Palumbo, Thomas M. Sutter, Emanuele Rodolà, Bastian Rieck, Julia E. Vogt

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) CISPA Helmholtz Center for Information Security(信息安全赫尔姆霍兹中心) Sapienza University of Rome(罗马大学萨皮恩扎大学) University of Fribourg(弗里堡大学)

AI总结 本文提出TOAST框架,通过利用Transformer内部的冗余性,用轻量级闭式映射(如线性变换或身份函数)近似整个Transformer块,从而在不额外训练的情况下减少参数和计算量,同时保持甚至提升下游性能。

Comments 33 pages, 16 figures, 22 tables

详情
AI中文摘要

基础模型在不同任务上实现了最先进的性能,但其规模和计算需求引发了关于可访问性和可持续性的担忧。现有的效率方法通常需要额外的重新训练或微调,限制了其实用性。最近的研究发现,深度神经网络表现出内部表示的相似性。虽然这种相似性已被用于启用技术如模型缝合和合并,但网络内部的冗余性仍较少被用作效率提升的来源。在本文中,我们介绍了Transformer优化使用自适应和简单变换(TOAST),一个框架利用这些冗余性,用轻量级闭式映射(如线性变换或甚至身份函数)近似整个Transformer块,而无需任何额外训练。在最先进的预训练视觉模型(如ViT、DINOv2、DeiT)和从MNIST到ImageNet-1k的各类数据集上,TOAST在减少参数和计算量的同时,保持并有时提升下游性能。这些结果表明,Transformer深度的大部分可以被简单函数替代,为高效基础模型提供了新的视角。

英文摘要

Foundation models achieve state-of-the-art performance across different tasks, but their size and computational demands raise concerns about accessibility and sustainability. Existing efficiency methods often require additional retraining or finetuning, limiting their practicality. Recent findings suggest that deep neural networks exhibit internal representation similarities. While such similarities across different models have been exploited for enabling techniques such as model stitching and merging, intra-network redundancy remains underexplored as a source for efficiency gains. In this paper, we introduce Transformer Optimization using Adaptive and Simple Transformations (TOAST), a framework that exploits these redundancies to approximate entire transformer blocks with lightweight closed-form mappings, such as linear transformations or even the identity function, without any additional training. Across state-of-the-art pretrained vision models (e.g., ViT, DINOv2, DeiT) and datasets ranging from MNIST to ImageNet-1k, TOAST reduces parameters and computation while preserving, and in some cases improving, downstream performance. These results show that large portions of transformer depth can be replaced by trivial functions, opening a new perspective on efficient foundation models.

2410.02064 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Inspection and Control of Self-Generated-Text Recognition Ability in Llama3-8b-Instruct

对Llama3-8b-Instruct自生成文本识别能力的检查与控制

Christopher Ackerman, Nina Panickssery

AI总结 本研究探讨了LLM是否能识别自身生成的文本,发现Llama3-8b-Instruct模型能够区分自身输出与人类输出,并通过残差流中的特定向量控制其行为和感知,揭示了模型自我归属的认知机制。

Comments 10 pages, 13 figs, 2 tables, accepted as conference paper to ICLR 2025

Journal ref The Thirteenth International Conference on Learning Representations (ICLR 2025)

详情
AI中文摘要

已报告LLM能够识别其自身生成的文本,这可能对AI安全有重要影响,但研究较少。我们调查这一现象,以确定其在行为层面是否稳健发生,观察行为是如何实现的,以及是否可以控制。首先,我们发现Llama3-8b-Instruct聊天模型(而非基础Llama3-8b模型)能够可靠地区分自身输出与人类输出,并提供证据表明聊天模型很可能利用其在训练后对自身输出的经验来完成文本识别任务。其次,我们识别出残差流中一个在模型正确识别自身生成文本时被差异激活的向量,证明该向量对自我归属相关信息的响应,并提供证据表明该向量与模型中的“自我”概念相关,并展示该向量与模型感知和声明自我归属能力的因果关系。最后,我们证明该向量可用于控制模型的行为和感知,通过将其应用于模型生成输出时,可引导模型声称或否认作者身份;通过将其应用于模型阅读的文本时,可引导模型相信或不相信其写了任意文本。

英文摘要

It has been reported that LLMs can recognize their own writing. As this has potential implications for AI safety, yet is relatively understudied, we investigate the phenomenon, seeking to establish whether it robustly occurs at the behavioral level, how the observed behavior is achieved, and whether it can be controlled. First, we find that the Llama3-8b-Instruct chat model - but not the base Llama3-8b model - can reliably distinguish its own outputs from those of humans, and present evidence that the chat model is likely using its experience with its own outputs, acquired during post-training, to succeed at the writing recognition task. Second, we identify a vector in the residual stream of the model that is differentially activated when the model makes a correct self-written-text recognition judgment, show that the vector activates in response to information relevant to self-authorship, present evidence that the vector is related to the concept of "self" in the model, and demonstrate that the vector is causally related to the model's ability to perceive and assert self-authorship. Finally, we show that the vector can be used to control both the model's behavior and its perception, steering the model to claim or disclaim authorship by applying the vector to the model's output as it generates it, and steering the model to believe or disbelieve it wrote arbitrary texts by applying the vector to them as the model reads them.

2409.15980 2026-05-19 cs.CV cs.AI 版本更新

Leveraging Unsupervised Learning for Cost-Effective Visual Anomaly Detection

利用无监督学习实现高效视觉异常检测

Yunbo Long, Zhengyang Ling, Sam Brook, Duncan McFarlane, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系)

AI总结 本研究提出一种低成本视觉异常检测系统,通过预训练模型和低成本硬件,利用少量数据实现高准确率的异常检测,适用于中小型企业。

详情
AI中文摘要

传统的基于机器学习的视觉检测系统需要大量数据收集和重复模型训练来提高准确性。这些系统通常需要昂贵的相机、计算设备和显著的机器学习专业知识,这对中小型企业构成重大负担。本研究探索利用预训练模型和低成本硬件的无监督学习方法,开发一种高效的视觉异常检测系统。该系统利用Anomalib的无监督学习模型,并通过openVINO部署在经济型Raspberry Pi硬件上。结果表明,该系统仅用10张正常产品图像即可在Raspberry Pi上完成异常检测的训练和推理,耗时仅90秒,达到F1宏评分超过0.95的性能。尽管系统对环境变化如光照、产品摆放或背景略有敏感,但其仍为中小型企业提供了一种快速且经济的工厂自动化检测方法。代码可在https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning获取。

英文摘要

Traditional machine learning-based visual inspection systems require extensive data collection and repetitive model training to improve accuracy. These systems typically require expensive camera, computing equipment and significant machine learning expertise, which can substantially burden small and medium-sized enterprises. This study explores leveraging unsupervised learning methods with pre-trained models and low-cost hardware to create a cost-effective visual anomaly detection system. The research aims to develop a low-cost visual anomaly detection solution that uses minimal data for model training while maintaining generalizability and scalability. The system utilises unsupervised learning models from Anomalib and is deployed on affordable Raspberry Pi hardware through openVINO. The results show that this cost-effective system can complete anomaly defection training and inference on a Raspberry Pi in just 90 seconds using only 10 normal product images, achieving an F1 macro score exceeding 0.95. While the system is slightly sensitive to environmental changes like lighting, product positioning, or background, it remains a swift and economical method for factory automation inspection for small and medium-sized manufacturers. The code is available at https://github.com/Yunbo-max/Cost-Effective-Visual-Anomaly-Detection-using-Unsupervised-Learning.

2404.10981 2026-05-19 cs.IR cs.AI cs.CL 版本更新

A Survey on Retrieval-Augmented Text Generation for Large Language Models

基于大型语言模型的检索增强文本生成综述

Yizheng Huang, Jimmy Huang

发表机构 * York University(约克大学)

AI总结 本文综述了检索增强文本生成方法,探讨了其在提升大型语言模型生成准确性和可靠性方面的核心方法与主要贡献。

Comments Ongoing Work

Journal ref ACM Computing Surveys, Volume 58, Issue 12, Article No.: 300, Pages 1 - 38, 2026

详情
AI中文摘要

检索增强生成(RAG)将检索方法与深度学习进展相结合,以解决大型语言模型(LLMs)静态限制的问题,通过动态整合最新外部信息。该方法主要关注文本领域,提供了一种成本效益高的解决方案,以生成合理但可能不正确的响应,从而通过真实世界数据提高LLMs的准确性和可靠性。随着RAG的复杂性增加并整合多个可能影响其性能的概念,本文将RAG范式分为四个类别:预检索、检索、后检索和生成,从检索角度提供详细视角。它概述了RAG的发展历程,并通过分析重要研究讨论了该领域的进步。此外,本文介绍了RAG的评估方法,解决了所面临的挑战,并提出了未来研究方向。通过提供有组织的框架和分类,该研究旨在整合现有的RAG研究,明确其技术基础,并突出其扩展大型语言模型适应性和应用潜力的潜力。

英文摘要

Retrieval-Augmented Generation (RAG) merges retrieval methods with deep learning advancements to address the static limitations of large language models (LLMs) by enabling the dynamic integration of up-to-date external information. This methodology, focusing primarily on the text domain, provides a cost-effective solution to the generation of plausible but possibly incorrect responses by LLMs, thereby enhancing the accuracy and reliability of their outputs through the use of real-world data. As RAG grows in complexity and incorporates multiple concepts that can influence its performance, this paper organizes the RAG paradigm into four categories: pre-retrieval, retrieval, post-retrieval, and generation, offering a detailed perspective from the retrieval viewpoint. It outlines RAG's evolution and discusses the field's progression through the analysis of significant studies. Additionally, the paper introduces evaluation methods for RAG, addressing the challenges faced and proposing future research directions. By offering an organized framework and categorization, the study aims to consolidate existing research on RAG, clarify its technological underpinnings, and highlight its potential to broaden the adaptability and applications of LLMs.

2308.06197 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features

利用基本特征的深度知识蒸馏进行复杂面部表情识别

Angus Maiden, Bahareh Nakisa

发表机构 * School of Information Technology, Deakin University(德克萨斯大学信息学院)

AI总结 本文提出了一种基于持续学习的方法,通过知识蒸馏和新颖的预测排序记忆重放,实现了复杂面部表情识别的最新状态,能够在少量样本下准确识别新复合表情类别。

Comments 13 pages, 9 figures, 6 tables, 3 algorithms. Code available at https://github.com/AngusMaiden/complex-FER

详情
AI中文摘要

复杂情绪识别是一种认知任务,迄今为止尚未达到与其他处于或高于人类认知水平的任务相同的优秀性能。通过面部表情识别情绪尤其困难,因为人类面部表达的情绪复杂性。为了使机器在复杂面部表情识别方面达到人类的水平,可能需要实时综合知识和理解新概念,就像人类所做的那样。人类能够仅通过少量示例学习新概念,通过从记忆中蒸馏重要信息。受人类认知和学习的启发,我们提出了一种新的持续学习方法,用于复杂面部表情识别,通过在基本表情类别上构建和保留知识,能够使用少量训练样本准确识别新的复合表情类别。在本工作中,我们还使用GradCAM可视化来展示基本和复合面部表情之间的关系。我们的方法通过知识蒸馏和一种新颖的预测排序记忆重放来利用这种关系,实现了复杂面部表情识别持续学习的最新状态,新类别的总体准确率为74.28%。我们还证明了使用持续学习进行复杂面部表情识别的性能远优于非持续学习方法,比最先进的非持续学习方法提高了13.95%。我们的工作也是首次将少样本学习应用于复杂面部表情识别,仅使用每个类别一个训练样本,就实现了100%的准确率,达到了最先进的水平。

英文摘要

Complex emotion recognition is a cognitive task that has so far eluded the same excellent performance of other tasks that are at or above the level of human cognition. Emotion recognition through facial expressions is particularly difficult due to the complexity of emotions expressed by the human face. For a machine to approach the same level of performance in complex facial expression recognition as a human, it may need to synthesise knowledge and understand new concepts in real-time, as humans do. Humans are able to learn new concepts using only few examples by distilling important information from memories. Inspired by human cognition and learning, we propose a novel continual learning method for complex facial expression recognition that can accurately recognise new compound expression classes using few training samples, by building on and retaining its knowledge of basic expression classes. In this work, we also use GradCAM visualisations to demonstrate the relationship between basic and compound facial expressions. Our method leverages this relationship through knowledge distillation and a novel Predictive Sorting Memory Replay, to achieve the current state-of-the-art in continual learning for complex facial expression recognition, with 74.28% Overall Accuracy on new classes. We also demonstrate that using continual learning for complex facial expression recognition achieves far better performance than non-continual learning methods, improving on state-of-the-art non-continual learning methods by 13.95%. Our work is also the first to apply few-shot learning to complex facial expression recognition, achieving the state-of-the-art with 100% accuracy using only a single training sample per class.

2204.01611 2026-05-19 cs.AI 版本更新

A Machine With Human-Like Memory Systems

具有人类样记忆系统的机器

Taewoon Kim, Michael Cochez, Vincent Francois-Lavet, Mark Neerincx, Piek Vossen

发表机构 * Vrije Universiteit Amsterdam(荷兰瓦赫宁根大学) Technische Universiteit Delft(代尔夫特理工大学)

AI总结 本文提出了一种同时具备语义记忆和事件记忆的智能体,证明双记忆系统优于单一记忆系统,并通过自研环境

Comments Submitted to Human-Centered Design of Symbiotic Hybrid Intelligence 2022 (https://ii.tudelft.nl/humancenteredsymbioticHI/)

详情
AI中文摘要

受认知科学理论启发,我们显式建模了一个同时具备语义记忆和事件记忆的智能体,并证明其比仅拥有单一记忆系统的智能体更优。为了证明这一点,我们设计并发布了自己的挑战环境

英文摘要

Inspired by the cognitive science theory, we explicitly model an agent with both semantic and episodic memory systems, and show that it is better than having just one of the two memory systems. In order to show this, we have designed and released our own challenging environment, "the Room", compatible with OpenAI Gym, where an agent has to properly learn how to encode, store, and retrieve memories to maximize its rewards. The Room environment allows for a hybrid intelligence setup where machines and humans can collaborate. We show that two agents collaborating with each other results in better performance than one agent acting alone.

2605.17361 2026-05-19 cs.LG cs.AI 版本更新

\textsc{MasFACT}: Continual Multi-Agent Topology Learning via Geometry-Aware Posterior Transfer

MasFACT:基于几何感知后验转移的连续多智能体拓扑学习

Xuefei Wang, Jialu Wang, Fengbo Zhang, Yihan Hu, Di Zhang, Yutong Ye, Yikun Ban, Jun Han, Ruijie Wang

发表机构 * Beihang University(北京航空航天大学)

AI总结 本文提出MasFACT框架,通过几何感知后验转移方法,解决多智能体系统中因新任务适应导致的拓扑遗忘问题,提升连续学习任务的准确性和拓扑稳定性。

详情
AI中文摘要

多智能体系统(MAS)借助大型语言模型(LLMs)已成为解决复杂问题的强大范式,其性能关键依赖于底层的智能体间通信拓扑。然而,现有拓扑生成方法主要针对孤立任务进行优化,而现实部署涉及连续演化的任务流,要求先前有效的协作模式被保留和重用而非重新发现或覆盖。本文识别出一种此前未被充分探索的失败模式,即拓扑遗忘,其中适应新任务会使拓扑生成器偏离早期任务所需通信结构。该问题源于智能体层面功能语义和关系通信结构的跨任务不一致。为解决这一挑战,我们提出MasFACT,一种几何感知后验转移框架,通过融合Gromov-Wasserstein最优传输在任务特定智能体空间中转移历史协作知识作为可转移拓扑先验,并通过PAC-Bayes引导的保守后验适应在任务特定可塑性与结构稳定性之间取得平衡。在类别级、领域级和任务级连续设置中的实验表明,MasFACT在提升平均准确率的同时减少了拓扑遗忘,相比强大的拓扑生成和重放基线表现更优,并可无缝集成到不同的MAS拓扑生成器中。

英文摘要

Multi-agent systems (MAS) powered by large language models (LLMs) have emerged as a powerful paradigm for complex problem solving, where performance critically depends on the underlying inter-agent communication topology. However, existing topology generation methods mainly optimize for isolated tasks, while real-world deployments involve streams of evolving tasks, requiring previously effective collaboration patterns to be retained and reused rather than rediscovered or overwritten. We identify a previously underexplored failure mode, \emph{topology forgetting}, in which adapting to new tasks shifts the topology generator away from communication structures required by earlier tasks. This issue stems from cross-task misalignment in both agent-level functional semantics and relational communication structures. To address this challenge, we propose \textbf{\textsc{MasFACT}}, a geometry-aware posterior transfer framework that preserves and reuses historical collaboration knowledge as transferable topology priors. We transfer these priors across task-specific agent spaces through Fused Gromov-Wasserstein optimal transport and perform PAC-Bayes-guided conservative posterior adaptation to balance task-specific plasticity with structural stability. Experiments across class-, domain-, and task-level continual settings demonstrate that \textsc{MasFACT} consistently improves average accuracy while reducing topology forgetting compared to strong topology generation and replay-based baselines, and can be seamlessly integrated with different MAS topology generators.

2605.17355 2026-05-19 cs.AI cs.CL 版本更新

HyperPersona: A Multi-Level Hypergraph Framework for Text-Based Automatic Personality Prediction

HyperPersona: 一种多级超图框架用于基于文本的自动人格预测

Sina Heydari, Majid Ramezani

发表机构 * Department of Computer Science and Information Technology(计算机科学与信息技术系) Institude for Advanced Studies in Basic Sciences (IASBS)(基础科学高级研究 institute (IASBS))

AI总结 本文提出HyperPersona框架,通过超图结构显式建模文本的层次结构,利用基于Transformer的图编码器学习不同语言层之间的交互,从而在不依赖传统心理测量法的情况下,实现更准确的人格预测。

Comments Preprint. Submitted to Artificial Intelligence (Elsevier)

详情
AI中文摘要

作为一种现代商品,语言已成为一个庞大的社会和心理重要特质和概念的存储库,反映了人们如何将思维模式、行为和情感的模式编码成词语。基于文本的自动人格预测(APP)旨在从语言行为中推断人格,提供了一种可扩展的替代传统心理测量法的方案。尽管文本本质上是层次化的,文档级捕捉全局特征,句子级编码局部语义,词级提供细粒度的词汇信息,但大多数现有方法依赖于浅层、顺序或单级表示,忽略了书面语言的多级结构。为了解决这个问题,我们提出了HyperPersona,一个框架,通过超图结构显式建模文本的层次组织(文档、句子和词),其中文档及其句子表示为超边,词表示为节点,从而实现对文本全局、局部和词汇依赖关系的联合建模。随后通过基于Transformer的图编码器学习这些语言层内的交互,产生上下文敏感且结构基础的特征表示用于人格预测。在Big Five人格维度上的实验表明,仅依赖文本的情况下,HyperPersona有效整合了多级语言线索,相比最先进的基线方法实现了更优的性能。这些发现强调了文本层次结构在从自然语言中推进类人人格推断中的关键作用。

英文摘要

As a modern commodity, language has become a vast repository of socially and psychologically significant traits and concepts, reflecting the ways people encode pattern of thoughts, behaviors, and emotions into words. Text-based Automatic Personality Prediction (APP), seeks to infer personality from linguistic behavior, offering a scalable alternative to traditional psychometric assessments. Although text is inherently hierarchical, with the document-level capturing global features, the sentence-level encoding local semantics, and the word-level providing fine-grained lexical information, most existing approaches rely on shallow, sequential, or single-level representations that ignore the multi-level structure of written language. To address this, we propose HyperPersona, a framework that explicitly models the hierarchical organization of text (document, sentence, and word) through hypergraph structure, where a document and its sentences are represented as hyperedges, and the words are represented as nodes, enabling joint modeling of global, local, and lexical dependencies of text. Followed by a transformer-based graph encoder that learns interactions within and across these linguistic layers, yielding context-sensitive and structurally grounded feature representations for personality prediction. Experiments on the Big Five personality dimensions show that, while relying solely on text, HyperPersona effectively integrates multi-level linguistic cues, achieving superior performance compared to state-of-the-art baselines. These findings underscore the critical role of textual hierarchy in advancing human-like personality inference from natural language.

2605.17342 2026-05-19 cs.CL cs.AI 版本更新

Transitivity Meets Cyclicity: Explicit Preference Decomposition for Dynamic Large Language Model Alignment

传递性与循环性:动态大语言模型对齐的显式偏好分解

Yucong Huang, Xiucheng Li, Kaiqi Zhao, Jing Li

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳校区)

AI总结 本文提出Hybrid Reward-Cyclic模型,通过博弈论分解显式分离传递性和循环性偏好,结合动态自我博弈优化方法提升大语言模型对齐效果,实验证明其在混合传递-循环设置中具有结构优势和更高的准确率。

Comments Accepted by ICML 2026

详情
AI中文摘要

标准的RLHF依赖于传递性的标量奖励,无法捕捉人类偏好的循环性质。尽管一些方法如通用偏好模型(GPM)试图解决这一问题,但其隐式公式将层次结构与循环性结合在一起,未能保证主导解。为此,我们提出了混合奖励-循环(HRC)模型,利用博弈论分解将偏好显式分解为正交的传递性(标量)和循环性(向量)组件。此外,我们引入了动态自我博弈偏好优化(DSPPO),将对齐视为随时间变化的游戏,逐步引导策略向纳什均衡发展。合成数据实验进一步验证了HRC在混合传递-循环设置中的结构优势,其中HRC收敛速度更快且准确率更高。在RewardBench 2上的实验表明,HRC在BT和GPM基线基础上持续改进(例如,在Gemma-2B-it上提升1.23%)。特别是,其在Ties领域中的优越表现验证了模型在处理复杂非严格偏好时的鲁棒性。对AlpacaEval 2.0、Arena-Hard-v0.1和MT-Bench的广泛下游评估确认了我们框架的有效性。值得注意的是,当使用Gemma-2B-it作为基础偏好模型时,HRC+DSPPO在AlpacaEval 2.0上达到峰值长度控制下的胜率44.75%,在Arena-Hard-v0.1上达到46.8%,显著优于使用BT或GPM训练的SPPO基线。我们的代码在https://github.com/lab-klc/Hybrid-Reward-Cyclic上公开可用。

英文摘要

Standard RLHF relies on transitive scalar rewards, failing to capture the cyclic nature of human preferences. While some approaches like the General Preference Model (GPM) address this, we identify a theoretical limitation: their implicit formulation entangles hierarchy with cyclicity, failing to guarantee dominant solutions. To address this, we propose the Hybrid Reward-Cyclic (HRC) model, which utilizes game-theoretic decomposition to explicitly disentangle preferences into orthogonal transitive (scalar) and cyclic (vector) components. Complementing this, we introduce Dynamic Self-Play Preference Optimization (DSPPO), which treats alignment as a time-varying game to progressively guide the policy toward the Nash equilibrium. Synthetic data experiments further validate HRC's structural superiority in mixed transitive--cyclic settings, where HRC converges faster and achieves higher accuracy than GPM. Experiments on RewardBench 2 demonstrate that HRC consistently improves over both BT and GPM baselines (e.g., +1.23% on Gemma-2B-it). In particular, its superior performance in the Ties domain empirically validates the model's robustness in handling complex, non-strict preferences. Extensive downstream evaluations on AlpacaEval 2.0, Arena-Hard-v0.1, and MT-Bench confirm the efficacy of our framework. Notably, when using Gemma-2B-it as the base preference model, HRC+DSPPO achieves a peak length-controlled win-rate of 44.75% on AlpacaEval 2.0 and 46.8% on Arena-Hard-v0.1, significantly outperforming SPPO baselines trained with BT or GPM. Our code is publicly available at https://github.com/lab-klc/Hybrid-Reward-Cyclic.

2605.17341 2026-05-19 cs.CV cs.AI 版本更新

Single-Sample Black-Box Membership Inference Attack against Vision-Language Models via Cross-modal Semantic Alignment

通过跨模态语义对齐实现面向视觉-语言模型的单样本黑盒成员推断攻击

Jiaqing Li, Yajuan Lu, Xiaochuan Shi, Gang Wu, ZhongYuan Wang, Chao Liang

发表机构 * Wuhan University(武汉大学) Tarim University(塔里木大学)

AI总结 本文提出了一种基于跨模态语义对齐的新型成员推断攻击框架,针对视觉-语言模型在单样本和黑盒场景下的数据安全风险进行评估,通过量化联合嵌入空间中的对齐程度,显著提升了攻击性能。

详情
AI中文摘要

视觉-语言模型(VLMs)虽取得了显著成功,但其依赖大规模数据集和意外记忆训练数据,带来了重大数据安全风险。成员推断攻击(MIAs)旨在通过确定数据样本是否包含在模型训练集中来评估这些风险。然而,现有针对VLMs的MIAs方法面临关键瓶颈:灰盒方法依赖于内部logits,通常在实际应用程序接口(APIs)中受限,而黑盒方法依赖于大规模统计分布,在单样本场景中表现不佳。为此,我们从跨模态语义对齐的角度研究MIAs,并观察到成员图像由于训练记忆表现出显著更强的图像-描述对齐,而生成的非成员描述可能偏离原始视觉内容。基于这一洞察,我们提出了一种针对严格黑盒和单样本场景的新MIAs框架,该框架在联合嵌入空间中量化此类对齐,从而绕过这些不现实的假设。我们在三个开源和两个闭源VLMs上进行了广泛实验。在VL-MIA/Flicker数据集上,我们的方法在LLaVA-1.5上实现了0.821的AUC,显著优于现有基线。此外,它在各种图像扰动下仍保持稳健,突显了其实用性。

英文摘要

Vision-Language Models (VLMs) have achieved remarkable success, yet their reliance on massive datasets and unintended memorization of training data raise significant data security risk. Membership Inference Attacks (MIAs) aim to assess these risks by determining whether a data sample was included in a model's training set. However, existing MIA methods against VLMs face critical bottlenecks: gray-box method relies on internal logits that are typically restricted in real-world Application Programming Interfaces (APIs), while black-box method depends on large-scale statistical distributions, which struggle in single-sample scenarios. To this end, we investigate MIAs from the perspective of cross-modal semantic alignment, and observe that member images exhibit significantly stronger image-caption alignment due to training memorization, whereas generated captions for non-members may deviate from the original visual content. Leveraging this insight, we propose a novel MIA framework designed for strict black-box and single-sample setting that quantifies such alignment within a joint embedding space, thereby bypassing these unrealistic assumptions. We conducted extensive experiments on three open-source and two closed-source VLMs. On the VL-MIA/Flicker dataset, our method achieves an AUC of 0.821 against LLaVA-1.5, significantly outperforming existing baselines. Furthermore, it remains robust under diverse image perturbations, highlighting its practicality.

2605.17329 2026-05-19 cs.CR cs.AI 版本更新

LPG: Balancing Efficiency and Policy Reasoning in Latent Policy Guardrails

LPG: 在潜在策略护栏中平衡效率与政策推理

Nanxi Li, Zhengyue Zhao, Chaowei Xiao

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出LPG框架,通过学习动态政策的语义潜在推理,在保持高安全准确率的同时实现低延迟的政策执行。

详情
AI中文摘要

护栏是现代AI系统的关键安全层,但其运行模式正在发生变化。随着LLMs被用作定制助手,安全策略越来越多地在推理时由用户、组织或监管环境指定。这使得安全执行本质上是动态的:护栏应适应变化的安全策略而无需重新训练。然而,这一要求带来了根本性的矛盾:忠实判断复杂的政策环境需要推理能力,而实际部署需要低延迟响应。我们介绍了潜在策略护栏(LPG),一种学习动态政策的语义潜在推理的框架。LPG将意图解释和政策基础所需的内部推理压缩成连续状态,这些状态由决策相关的语义监督。在推理时,它只生成一个紧凑的裁定,该裁定基于违反的政策条款,从而在保持可审计性的同时避免显式推理的延迟。在政策护栏基准测试中,LPG-4B在将推理压缩到仅10个潜在标记的情况下,达到了84.5%的平均安全准确率和77.9%的F1分数,优于最强的动态基线,同时在单样本评估设置下运行速度大约快11倍于Qwen3-4B-Thinking。代码和数据可在https://github.com/SaFo-Lab/Latent_Policy_Guard上获得。

英文摘要

Guardrails are a critical safety layer for modern AI systems, but their operating regime is changing. As LLMs are deployed as customized assistants, safety policies are increasingly specified at inference time by users, organizations, or regulatory contexts. This makes safety enforcement fundamentally dynamic: the guardrail should adapt to changing safety policies without retraining. Yet this requirement creates a fundamental tension: faithfully judging complex policy contexts demands reasoning capability, while practical deployment requires low-latency responses. We introduce Latent Policy Guardrail (LPG), a guardrail framework that learnssemantic latent deliberation over dynamic policies. LPG compresses the internal deliberation needed for intent interpretation and policy grounding into continuous states supervised by decision-relevant semantics. At inference time, it generates only a compact verdict anchored to the violated policy clauses, preserving auditability while avoiding the latency of explicit reasoning. Across policy guardrail benchmarks, LPG-4B reaches 84.5% average safety accuracy and 77.9% F1 by compressing deliberation into just 10 latent tokens, outperforming the strongest dynamic baseline while running roughly 11 times faster than Qwen3-4B-Thinking under the single-sample evaluation setup. Code and data are available at https://github.com/SaFo-Lab/Latent_Policy_Guard.

2605.17327 2026-05-19 cs.RO cs.AI cs.CV 版本更新

Efficient Feature-Free Initialization for Monocular Visual-Inertial Systems Using a Feed-Forward 3D Model

为单目视觉-惯性系统使用前馈3D模型实现高效的特征-free初始化

Yuantai Zhang, Jiaqi Yang, Huajian Zeng, Changhao Chen, Haoang Li, Liang Li, Dezhen Song, Xingxing Zuo

发表机构 * MBZUAI(马克斯·普朗克人工智能研究所) HKUST (GZ)(香港科技大学(广州)) Zhejiang University(浙江大学)

AI总结 本文提出了一种无需视觉特征跟踪的初始化框架,利用前馈3D模型预测的点云,从而提高了单目视觉-惯性导航系统的初始化可靠性与效率,实验表明其初始化成功率超过90%且数据需求显著减少。

详情
AI中文摘要

快速且可靠的初始化对于单目视觉-惯性导航系统(VINS)至关重要,因为它为后续的状态估计建立了初始条件。尽管已有显著进展,但大多数现有方法仍依赖于视觉特征对应关系,并需要3-4秒的传感器数据才能成功初始化,这限制了它们的应用性和效率。随着前馈3D模型的出现,这些模型可以直接从图像预测点云,我们重新从简洁的角度审视视觉-惯性初始化问题。在本文中,我们提出了一种特征-free初始化框架,利用前馈3D模型预测的点云,从而避免了视觉特征跟踪和估计的需要。这种设计显著降低了系统复杂性并提高了初始化的可靠性。在公开数据集上的实验表明,所提出的特征-free初始化方法实现了最高成功率,超过90%,并且显著减少了成功初始化所需的数据持续时间,通常降至1.2秒以下。我们进一步在自采集的数据集上验证了我们的方法,覆盖了各种室内和室外场景,展示了鲁棒性能,特别是在现有方法常失败的视觉退化环境中。代码和数据集可在https://github.com/Yuantai-Z/FF-VIO-Init获取。

英文摘要

Fast and reliable initialization is critical for monocular visual-inertial navigation systems (VINS), as it establishes the starting conditions for subsequent state estimation. Despite steady progress, most existing methods heavily rely on visual feature correspondences and require 3-4 seconds of sensory data for successful initialization, which limits their applicability and efficiency. With the advent of feed-forward 3D models that can directly predict point clouds from images, we revisit the visual-inertial initialization problem from a concise perspective. In this work, we propose a feature-free initialization framework that leverages up-to-scale point clouds predicted by a feed-forward 3D model, thereby obviating the need for visual feature tracking and estimation. This design substantially reduces system complexity and improves the reliability of initialization. Experiments on public datasets demonstrate that the proposed feature-free initialization method achieves the highest success rate, exceeding 90%, and significantly reduces the data duration required for successful initialization, typically to under 1.2 s. We further validate our method on a self-collected dataset covering various indoor and outdoor scenarios, demonstrating robust performance, particularly in visually degraded environments where existing methods often fail. The code and dataset are available at https://github.com/Yuantai-Z/FF-VIO-Init.

2605.17324 2026-05-19 cs.CR cs.AI 版本更新

ASPI: Seeking Ambiguity Clarification Amplifies Prompt Injection Vulnerability in LLM Agents

ASPI:寻求澄清会放大LLM代理中的提示注入漏洞

Udari Madhushani Sehwag, Zhengyang Shan, Heming Liu, Dileepa Lakshan, Joseph Brandifino, Max Fenkell

发表机构 * Scale AI Boston University(波士顿大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Human Frontier Collective(人类前沿集体)

AI总结 该研究探讨了LLM代理在执行任务时寻求澄清的行为对提示注入攻击的影响,发现这种行为会显著增加代理的脆弱性,并提出了ASPI基准测试来评估这一现象。

详情
AI中文摘要

寻求澄清的行为被视为LLM代理的有益特性,使它们能够在执行未明确的任务前解决歧义。然而,这种交互模式的安全影响尚未被探索。我们研究了从标准执行到寻求澄清状态的转变是否会使代理更容易受到提示注入攻击。我们引入了ASPI(模糊状态提示注入),这是一个包含728个任务-攻击场景的基准测试,将澄清作为独立的代理状态,并测量这种状态转换在受控条件下对脆弱性的影响。每个基准测试实例都在匹配的执行和澄清设置下进行评估:在执行设置中,代理在完全指定的指令下执行,并仅通过工具返回的数据遇到对抗性内容;在澄清设置中,代理必须首先请求并纳入额外的用户输入后再执行。我们评估了十种前沿LLM,并发现寻求澄清的行为始终显著地放大了脆弱性。例如,攻击成功率从o3的1.8%增加到34.0%,从Gemini-3-Flash的2.2%增加到35.7%。分解分析显示,这种差距反映了模型处理输入方式的状态依赖性变化以及由于代理请求澄清接口而产生的通道特定效应。这些发现表明,标准执行时间的安全评估系统性地低估了交互代理的攻击面,并且在完全指定任务下的鲁棒性不等于在歧义情况下的鲁棒性。为了可重复性,我们的数据和源代码可在https://github.com/scaleapi/aspi上获得。

英文摘要

Clarification-seeking behavior is widely regarded as a desirable property of LLM agents, enabling them to resolve ambiguity before acting on underspecified tasks. However, the security implications of this interaction pattern remain unexplored. We investigate whether the transition from standard execution to a clarification-seeking state increases an agent's susceptibility to prompt injection attacks. We introduce ASPI (Ambiguous-State Prompt Injection), a benchmark of 728 task-attack scenarios that isolates clarification as a distinct agent state and measures how this state transition affects vulnerability under controlled conditions. Each benchmark instance is evaluated under matched execution and clarification settings: in the execution setting, the agent acts on a fully specified instruction and encounters adversarial content only through tool-returned data; in the clarification setting, the agent must first request and incorporate additional user input before acting. We evaluate ten frontier LLMs and find that clarification-seeking consistently and substantially amplifies vulnerability. For instance, attack success rises from 1.8% to 34.0% for o3 and from 2.2% to 35.7% for Gemini-3-Flash. A decomposition analysis reveals that this gap reflects both a state-dependent shift in how models process incoming content and a channel-specific effect arising from the agent-solicited clarification interface. These findings demonstrate that standard execution-time security evaluation systematically underestimates the attack surface of interactive agents, and that robustness under fully specified tasks does not translate to robustness under ambiguity. For reproducibility, our data and source code are available at https://github.com/scaleapi/aspi.

2605.17320 2026-05-19 cs.OS cs.AI 版本更新

TClone: Low-Latency Forking of Live GUI Environments for Computer-Use Agents

TClone:用于计算机使用代理的低延迟活GUI环境分叉

Yutong Huang, Vikranth Srivatsa, Alex Asch, Hansin Tushar Patwa, Yiying Zhang

发表机构 * University of California, San Diego(加州大学圣迭戈分校) GenseeAI

AI总结 TClone通过分离快速分支创建与持久快照,实现了对计算机使用代理的活GUI环境的低延迟版本控制,从而提高代理执行的安全性和质量。

详情
AI中文摘要

计算机使用代理越来越多地在活的个人工作空间中运行,其操作可以修改文件、应用程序、GUI状态、凭证和认证会话。这在安全性和质量之间产生了张力:代理需要隔离和回滚以避免损坏用户状态,但同时也需要快速分支支持推测执行和并行搜索。现有的虚拟机、容器和检查点/恢复系统可以隔离或恢复工作负载,但它们不提供完整交互工作空间的低延迟版本控制。我们提出了TClone,一种为计算机使用代理设计的可分叉个人工作空间系统。TClone使活GUI工作空间能够被快照、分叉为隔离分支、回滚,并选择性地提交或合并。其设计通过使用兄弟容器、写时复制内存共享、文件系统版本控制、本地GUI执行和异步检查点来分离快速分支创建与持久快照。在我们的端到端代理循环测量中,TClone将总任务延迟分别降低了1.9倍和1.5倍,相比KVM和CRIU。通过将工作空间版本控制作为系统的第一类原语,TClone在真实个人计算环境中支持更安全和高质量的代理执行。

英文摘要

Computer-use agents increasingly operate inside live personal workspaces, where their actions can modify files, applications, GUI state, credentials, and authenticated sessions. This creates a tension between safety and quality: agents need isolation and rollback to avoid damaging user state, but also need fast branching to support speculative execution and parallel search. Existing VMs, containers, and checkpoint/restore systems can isolate or recover workloads, but they do not provide low-latency versioning of a full interactive workspace. We present TClone, a forkable personal workspace system for computer-use agents. TClone enables a live GUI workspace to be snapshotted, forked into isolated branches, rolled back, and selectively committed or merged. Its design separates fast branch creation from durable checkpointing, using sibling containers, copy-on-write memory sharing, filesystem versioning, GUI-local execution, and asynchronous checkpointing. In our end-to-end agent-loop measurement, TClone reduces total task latency by 1.9x and 1.5x over KVM and CRIU. By making workspace versioning a first-class systems primitive, TClone supports safer and higher-quality agent execution over real personal computing environments.

2605.17316 2026-05-19 cs.LG cs.AI 版本更新

Learning Higher-Order Structure from Incomplete Spatiotemporal Data: Multi-Scale Hypergraph Laplacians with Neural Refinement

从不完整时空数据中学习高阶结构:具有神经细化的多尺度超图拉普拉斯算子

Keshu Wu, Sixu Li, Zihao Li, Zhiwen Fan, Xiaopeng Li, Yang Zhou

发表机构 * Texas A&M University(德克萨斯大学A&M分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出了一种多尺度超图拉普拉斯(MSHL)框架,通过两阶段方法从不完整时空观测中学习高阶结构。该方法通过发现阶段构建多尺度超图,并在细化阶段引入条件残差网络,以处理高阶关系中的残差特征,从而在交通网络中实现了更准确的缺失数据填补。

详情
AI中文摘要

传感器网络日益成为现代基础设施的核心,然而标准填补基准所假设的均匀随机缺失模式往往不适用于实际场景。环形检测器在校准期间会断线,路边柜子会沉默附近传感器的集群,而新安装的仪器则无法提供历史数据。这些故障会产生结构化的缺失,其值受传感器组之间的高阶关系约束,而非仅仅是成对接近性。现有低秩和图方法往往无法捕捉这种集体结构,当缺失性变得一致时可能会失效。本文引入多尺度超图拉普拉斯(MSHL),一种两阶段框架,用于从不完整的时空观测中学习高阶结构。发现阶段通过互补的拓扑和残差相关证据构建多尺度超图,并采用仅基于观测的选取器,适应支持的交互尺度。细化阶段添加一个小型超图条件残差网络,其安全性由构造保证:在存在信息残差特征时学习非线性修正,在不存在时则退化为线性估计。我们证明MSHL可以表示无法被成对图先验捕捉的组内守恒模式,能够适应最佳固定尺度,至多一个对数因子,将这种优势转移到验证的填补误差中,并允许单侧细化保证。在两个真实交通网络上评估,针对散落单元缺失、连续块断电和整个传感器黑箱在五种速率下,MSHL在高阶结构可识别时优于成对图基线,否则在采样噪声范围内匹配。结果表明,可靠的基础设施学习存在更广泛的原则:缺失数据不应被视为孤立的填补条目,而应视为发现结构的证据。

英文摘要

Sensor networks increasingly govern modern infrastructure, yet the data they lose are rarely missing in the uniform-random patterns assumed by standard imputation benchmarks. Loop detectors go offline during calibration, roadside cabinets silence clusters of nearby sensors, and newly installed instruments provide no history. Such failures create structured absences whose values are constrained by higher-order relations among groups of sensors, not merely by pairwise proximity. Existing low-rank and graph-based methods often miss this collective structure and can fail when missingness becomes coherent. We introduce Multi-Scale Hypergraph Laplacians (MSHL), a two-stage framework for learning higher-order structure from incomplete spatiotemporal observations. The Discovery stage builds a multi-scale hypergraph from complementary topology and residual-correlation evidence, with an observation-only selector that adapts to the supported interaction scale. The Refinement stage adds a small hypergraph-conditioned residual network that is safe by construction: it learns nonlinear corrections where informative residual features exist and defers to the linear estimate where they do not. We prove that MSHL represents group-conservation patterns inaccessible to pairwise graph priors, adapts to the best fixed scale up to a logarithmic factor, transfers this advantage to held-out imputation error, and admits a one-sided refinement guarantee. On two real traffic networks evaluated across scattered cell missingness, contiguous block outages, and whole-sensor blackouts at five rates, MSHL improves over a pairwise-graph baseline whenever higher-order structure is identifiable and otherwise matches it within sampling noise. The results point to a broader principle for reliable infrastructure learning: missing data should be treated not as isolated entries to fill, but as evidence of structure to discover.

2605.17314 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Weak-to-Strong Elicitation via Mismatched Wrong Drafts

通过不匹配的错误草稿实现弱到强的引导

Wei Deng

发表机构 * Independent Researcher(独立研究者)

AI总结 本文研究了通过较小较弱模型的不匹配错误草稿引导更强学习者的能力,发现这种策略在MATH-500和AIME 2025/2026等任务上表现优异,主要贡献是提出了一种有效的训练方法。

详情
AI中文摘要

我们考虑是否可以利用较小、较弱模型的离线经验来引导更强的学习者,使其在在线策略学习(如GRPO)无法达到的能力。我们发现,将数学上错误但更领域训练的较小模型生成的草稿注入更强学习者的GRPO上下文,能一致优于标准在线GRPO在MATH-500和离分布AIME 2025/2026上。具体来说,我们使用Mathstral-7B作为学习者,Qwen2.5-Math-1.5B作为草稿模型,8.8K Level 3--5 MATH问题(其中MATH-500被排除),并使用Dr. GRPO进行训练。不匹配是关键成分:在保持其他条件不变的情况下,将草稿洗牌到不匹配的问题中,使MATH-500的greedy pass@1提升+1.62pp(n=10种子,p=0.0015,Welch's t检验)。事实上,不匹配-错误变体在MATH-500上所有测试的变体中均优于。在离分布AIME 2025和2026上,不匹配-错误变体在每个样本预算从k=1到k=1024的所有年份中,均将pass@k提升到Mathstral-7B(其原生[INST]格式)和Qwen2.5-Math-1.5B草稿模型之上。所有变体在测试时使用相同的提示,没有草稿注入。该配方——在单个GPU上训练,无需SFT、奖励模型、合成数据和无produce-critique-revise内循环——在Mathstral-7B-v0.1上达到了71.98%的MATH-500成绩,这是目前该模型的最高已发表结果,超过了WizardMath流程在完整MATH上的70.9%(SFT + PPO加过程/指令奖励模型)。

英文摘要

We consider whether off-policy experience from a smaller, weaker model can elicit capability in a stronger learner that on-policy RL fine-tuning (e.g., GRPO) does not reach. We find that injecting mathematically wrong drafts from a smaller but more domain-trained model -- mismatched to the current problem -- into a stronger learner's GRPO context consistently outperforms standard on-policy GRPO on held-out MATH-500 and out-of-distribution AIME 2025/2026. Concretely, we use Mathstral-7B as the learner, Qwen2.5-Math-1.5B as the draft model, 8.8K Level 3--5 MATH problems (with MATH-500 held out), and train with Dr. GRPO. Mismatch is an active ingredient: shuffling drafts to mismatched problems while holding everything else constant yields $+1.62$pp on MATH-500 (greedy pass@1) over the matched-wrong variant ($n=10$ seeds, $p=0.0015$, Welch's $t$). In fact, the mismatched-wrong variant leads all other variants we tested on MATH-500 across both greedy pass@1 and sampling pass@$k$. On out-of-distribution AIME 2025 and 2026, the mismatched-wrong variant uniquely lifts pass@$k$ above both Mathstral-7B (in its native [INST] format) and the Qwen2.5-Math-1.5B draft model at every sample budget from $k=1$ to $k=1024$ across 2 seeds ($+14.2$pp on 2025 and $+9.0$pp on 2026 at pass@1024 over Mathstral-7B), and at pass@1024 also leads no-draft, matched-wrong, and mismatched-correct variants on both years. All variants use the same prompt with no draft injection at test time. The recipe -- trained on a single GPU with no SFT, no reward models, no synthesized data, and no produce-critique-revise inner loop -- reaches 71.98% MATH-500 on Mathstral-7B-v0.1, the highest published result on this model to our knowledge, surpassing the heavier WizardMath pipeline at 70.9% on full MATH (SFT + PPO with process/instruction reward models).

2605.17310 2026-05-19 cs.CV cs.AI 版本更新

Attention Hijacking: Response Manipulation Across Queries in Vision-Language Models

注意力劫持:跨查询的视觉-语言模型响应操控

Zhiqiang Wang, Dongrui Liu, Yan Li, Zonghao Ying, Wei Xue, Wenhan Luo, Yike Guo

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Shanghai Jiao Tong University(上海交通大学) Beihang University(北京航空航天大学)

AI总结 本文研究了视觉-语言模型中跨查询响应操控问题,提出了一种新的对抗攻击方法Attention Hijacking,通过引导内部注意力分布保持图像主导模式,提高攻击在不同查询下的有效性。

详情
AI中文摘要

现有针对视觉-语言模型(VLMs)的对抗攻击可以将模型输出导向攻击者指定的目标响应,但当相同扰动输入与不同文本查询配对时,其效果往往会下降。本文研究了跨查询响应操控,即期望一个对抗示例在多样化的用户查询中保持有效。我们首先分析了现有攻击的局限性,发现成功转移与在响应生成过程中保持图像主导的注意力模式密切相关。受此观察启发,我们提出了Attention Hijacking,一种新的对抗攻击方法,该方法明确引导内部注意力分布向持久的图像主导模式倾斜。通过放大视觉标记对目标响应标记的影响,同时抑制文本标记的竞争影响,我们的方法减少了 manipulated 输出对特定查询用语的依赖。在广泛使用的VLMs上的大量实验表明,Attention Hijacking显著提高了跨查询转移性,适用于多样化的目标响应和未见查询。该方法也有效扩展到多种攻击场景,为VLMs中注意力稳定性在可转移响应操控中的作用提供了新的见解。

英文摘要

Existing adversarial attacks on vision-language models (VLMs) can steer model outputs toward attacker-specified target responses, but their effectiveness often degrades when the same perturbed input is paired with different textual queries. This paper studies cross-query response manipulation, where a single adversarial example is expected to remain effective across diverse user queries. We first analyze the limitations of existing attacks and find that successful transfer is closely associated with preserving an image-dominant attention pattern during response generation. Motivated by the observation, we propose \textbf{Attention Hijacking}, a novel adversarial attack that explicitly steers internal attention distributions toward a persistent image-dominant pattern. By amplifying the influence of visual tokens on target response tokens while suppressing the competing influence of textual tokens, our method reduces the dependence of the manipulated output on the specific wording of the query. Extensive experiments on widely used VLMs show that Attention Hijacking substantially improves cross-query transferability across diverse target responses and unseen queries. The method also extends effectively to multiple attack scenarios, offering new insights into the role of attention stability in transferable response manipulation for VLMs.

2605.17309 2026-05-19 cs.CV cs.AI 版本更新

StyleText: A Large-Scale Dataset and Benchmark for Stylized Scene Text Inpainting

StyleText: 一个大规模数据集和基准,用于具有风格保留的场景文本修复

Aleksandr Simonyan, Nipun Jindal

发表机构 * Adobe Inc.(Adobe公司)

AI总结 本文提出StyleText,一个用于具有风格保留的场景文本修复的大规模数据集和基准,通过控制评估文本可读性和视觉一致性,利用共享场景上下文。

Comments Accepted at the SynData4CV Workshop, CVPR 2026. 8 pages + 1 page of references, 5 figures, 4 tables

详情
AI中文摘要

我们提出了StyleText,一个用于局部场景文本修复的大型数据集和基准,具有风格保留。StyleText包含28,518个图像-掩码-提示三元组,分为9,932个场景家族,使能够受控评估文本可读性和视觉一致性。我们通过自动化流程构建数据集,该流程结合LLM提示模板、基于Flux的源生成与键值(KV)缓存注入、基于OCR的语义过滤、多边形掩码提取以及掩码条件的FluxFill增强。我们定义了一个可重复的评估协议,使用归一化的OCR度量(词准确率和字符错误率)和CLIP图像-图像相似性,结合显式预处理。在StyleText上训练的FluxFill+LoRA基线在初始化基础上显著提高了OCR准确性,同时保持场景风格一致性,为未来的比较建立了有力的参考点。

英文摘要

We present StyleText, a large-scale dataset and benchmark for localized scene-text inpainting with style preservation. StyleText contains 28,518 image-mask-prompt triplets grouped into 9,932 scene families, enabling controlled evaluation of text legibility and visual consistency under shared scene context. We construct the dataset with an automated pipeline that combines LLM prompt templating, Flux-based source generation with key-value (KV) cache injection, OCR-based semantic filtering, polygon mask extraction, and mask-conditioned FluxFill augmentation. We define a reproducible evaluation protocol using normalized OCR metrics (word accuracy and character error rate) and CLIP image-image similarity with explicit preprocessing. A FluxFill+LoRA baseline trained on StyleText improves OCR accuracy substantially over initialization while maintaining scene style consistency, establishing a strong reference point for future comparisons.

2605.17308 2026-05-19 cs.AI 版本更新

Reasoning Before Diagnosis: Physician-Inspired Structured Thinking for ECG Classification

在诊断前进行推理:受医生启发的结构化思维用于心电图分类

Yang Wu, Xiaoyan Yuan, Hau-San Wong, Xiping Hu

发表机构 * City University of Hong Kong(香港城市大学) Beijing Institute of Technology(北京理工大学) Shenzhen MSU-BIT University(深圳MSU-BIT大学)

AI总结 本文提出CardioThink框架,通过结构化推理过程提升心电图分类的临床相关性,并引入SSPO方法以优化诊断结果的准确性和可解释性。

详情
AI中文摘要

心电图(ECG)临床诊断依赖于对多个层次方面的结构化推理,包括心律、传导特性、波形形态和总体诊断印象。然而,现有大多数方法直接从ECG信号预测标签,缺乏显式的临床推理过程,导致决策不透明且不具临床相关性。为弥合这一差距,我们提出CardioThink,一个受医生启发的多模态大语言模型(MLLM)框架,通过可解释的中间阶段(心律、传导、形态和印象)显式建模诊断推理过程,以推导最终分类结果。此外,我们引入结构化集合策略优化(SSPO)以联合优化对这种结构化推理格式的遵循程度和变量大小诊断集的准确性,而无需手动标注的推理轨迹。在多样化的ECG基准测试中,广泛实验表明,我们的方法在诊断准确性上显著优于现有方法,同时提供可解释的临床推理。值得注意的是,推理质量评估确认SSPO显著增强了生成的推理依据的临床有效性。这些发现表明,超越直接标签预测,转向结构化推理为未来ECG建模提供了更符合临床需求的方向。

英文摘要

Electrocardiogram (ECG) diagnosis in clinical practice relies on structured reasoning over multiple hierarchical aspects, including cardiac rhythm, conduction properties, waveform morphology, and overall diagnostic impression. However, most existing approaches predict labels directly from ECG signals without explicit clinical reasoning, resulting in opaque decisions that lack clinical alignment. To bridge this gap, we propose CardioThink, a physician-inspired multimodal large language model (MLLM) framework that explicitly models the diagnostic reasoning process through human-interpretable intermediate stages (rhythm, conduction, morphology, and impression) to derive final classification results. Furthermore, we introduce Structured Set Policy Optimization (SSPO) to jointly optimize adherence to this structured reasoning format and the accuracy of variable-size diagnostic sets, without requiring manually annotated reasoning traces. Extensive experiments on diverse ECG benchmarks demonstrate the significant superiority of our approach in diagnostic accuracy, while simultaneously providing interpretable clinical reasoning. Notably, reasoning quality evaluations confirm that SSPO substantially enhances the clinical validity of the generated rationales. These findings reveal that moving beyond direct label prediction toward structured reasoning offers a more clinically aligned direction for future ECG modeling.

2605.17307 2026-05-19 q-fin.PM cs.AI cs.LG cs.NE q-fin.TR 版本更新

Deep Reinforcement Learning Framework for Diversified Portfolio Management Across Global Equity Markets

面向全球股票市场的多样化投资组合管理的深度强化学习框架

Kamil Kashif, Robert Ślepaczuk

发表机构 * Quantitative Finance Research Group, Faculty of Economic Sciences, University of Warsaw(经济科学学院量化金融研究组,华沙大学) Quantitative Finance Research Group, Department of Quantitative Finance and Machine Learning, Faculty of Economic Sciences, University of Warsaw(经济科学学院量化金融与机器学习系量化金融研究组,华沙大学)

AI总结 本文提出并评估了一个深度强化学习框架,用于动态分配全球股票市场投资组合,通过比较五种模型配置,探讨了奖励函数、策略结构、投资组合约束和时间编码器对风险调整后表现的影响。

Comments 67 pages, 11 figures, 16 tables

详情
AI中文摘要

本研究开发并评估了一个深度强化学习框架,用于动态分配全球股票市场投资组合。Soft Actor-Critic算法被用于在马尔可夫决策过程中学习连续的投资组合权重,将交易成本、换手惩罚和多样化约束纳入奖励函数中。比较了五种模型配置,这些配置在奖励公式、策略结构(扁平与分层Dirichlet)、投资组合约束和时间编码器(LSTM与Transformer)方面有所不同,并通过走步优化在2003-2026年的纳斯达克100、日经225和欧元 Stoxx 50十六个外样本折上进行了评估。结果表明,强化学习策略在欧元 Stoxx 50市场中实现了有竞争力的风险调整后表现,其中观察到统计显著的异常收益,但核心假设仅部分得到验证:没有策略在HAC稳健推断下相对于持有策略实现统计显著的超额收益。制度分析揭示,强化学习在不确定性升高时期增加价值,而跨市场的集合聚合提高了风险调整后表现,并确认了地理多样化的好处。

英文摘要

This study develops and evaluates a deep reinforcement learning framework for dynamic portfolio allocation across global equity markets. The Soft Actor-Critic algorithm is used to learn continuous portfolio weights within a Markov Decision Process, incorporating transaction costs, turnover penalties, and diversification constraints into the reward function. Five model configurations are compared, varying in reward formulation, policy structure (flat versus hierarchical Dirichlet), portfolio constraints, and temporal encoder (LSTM versus Transformer), and evaluated via walk-forward optimization across sixteen out-of-sample folds spanning 2003-2026 on the Nasdaq-100, Nikkei 225, and Euro Stoxx 50. Results show that RL strategies achieve competitive risk-adjusted performance primarily in the Euro Stoxx 50, where statistically significant abnormal returns are observed, but the central hypothesis is only partially confirmed: no strategy achieves statistically significant excess returns relative to Buy and Hold under HAC-robust inference across all markets. Regime analysis reveals that RL adds the most value during periods of elevated uncertainty, while ensemble aggregation across markets improves risk-adjusted performance and confirms the benefits of geographic diversification.

2605.17305 2026-05-19 cs.AI cs.CL 版本更新

CyberCorrect: A Cybernetic Framework for Closed-Loop Self-Correction in Large Language Models

CyberCorrect: 一种基于闭环自修正的大型语言模型框架

Yuning Wu, Yingmin Liu, Yang Shu

发表机构 * School of Software, Henan University, Kaifeng, China(河南大学软件学院,开封,中国) Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 本文提出CyberCorrect框架,将大型语言模型的自我修正建模为闭环控制系统,通过三模态错误检测器、类型导向的修正控制器和收敛判断器,提升模型的自我修正能力和准确性。

Comments 6 pages, 1 figure, submitted to IEEE SMC 2026

详情
AI中文摘要

大型语言模型(LLM)的自我修正能力——即检测并修复生成输出中的错误——仍然主要依赖于通用提示,如'请重新考虑你的答案',缺乏系统性的错误分析和收敛保证。我们提出了CyberCorrect,一种将LLM自我修正建模为闭环控制系统的方法,基于控制论理论。该框架将LLM生成器视为被控对象,并引入三模态错误检测器(结合自一致性、口头化信心和逻辑链验证)作为传感器。类型导向的修正控制器根据诊断的错误类别生成针对性的修复指令,而收敛判断器利用控制理论适应的稳定性标准确定迭代终止。我们进一步引入了三个控制理论评估指标——收敛率、超调率和振荡率——以捕捉修正动态,而不仅仅是最终准确性。在我们构建的CyberCorrect-Bench(440个带有标注错误类型和修正路径的推理任务)上的实验表明,CyberCorrect实现了79.8%的最终准确性,比现有最佳自我修正方法提高了6.2个百分点,同时通过其收敛控制机制将超调(错误的过度修正)减少了41%。

英文摘要

Large language model (LLM) self-correction -- the ability to detect and fix errors in generated outputs -- remains largely ad hoc, relying on generic prompts such as "please reconsider your answer" without systematic error analysis or convergence guarantees. We propose CyberCorrect, a framework that formalizes LLM self-correction as a closed-loop control system grounded in cybernetic theory. The framework models the LLM generator as the plant and introduces a tri-modal Error Detector (combining self-consistency, verbalized confidence, and logic-chain verification) as the sensor. A type-directed Correction Controller generates targeted repair instructions based on diagnosed error categories, while a Convergence Judge determines iteration termination using stability criteria adapted from control theory. We further introduce three control-theoretic evaluation metrics -- convergence rate, overshoot rate, and oscillation rate -- that capture correction dynamics beyond final accuracy. Experiments on our constructed CyberCorrect-Bench (440 reasoning tasks with annotated error types and correction paths) show that CyberCorrect achieves 79.8% final accuracy, improving upon the best existing self-correction method by 6.2 percentage points, while reducing overshoot (erroneous over-correction) by 41% through its convergence control mechanism.

2605.17292 2026-05-19 cs.AI cs.MA 版本更新

MetaCogAgent: A Metacognitive Multi-Agent LLM Framework with Self-Aware Task Delegation

MetaCogAgent: 一种具有自我意识的任务委托多智能体大语言模型框架

Chenyu Wang, Yang Shu

发表机构 * School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China(郑州大学计算机与人工智能学院) Zhejiang University, Hangzhou, China(浙江大学)

AI总结 本文提出MetaCogAgent框架,通过引入元认知自我评估单元,使每个智能体在执行任务前评估自身能力边界,从而提升任务准确性并减少API调用次数。

Comments 6 pages, submitted to IEEE SMC 2026

详情
AI中文摘要

多智能体大语言模型(LLM)系统通过智能体协作展示了解决复杂任务的潜力。然而,现有框架基于预定义角色分配任务,未考虑智能体能否准确评估自身能力边界,导致超出其专长的任务执行过于自信。受认知科学中的元认知理论启发,我们提出了MetaCogAgent,一种多智能体LLM框架,其中每个智能体配备元认知自我评估单元,在执行前评估任务能力匹配度。该框架提出了三个贡献:(1)一种自我评估机制,通过结合口头不确定性与历史能力档案估计每项任务的置信度;(2)一种自适应委托协议,通过跨智能体评估将低置信度任务路由至更适合的智能体;(3)一种能力边界学习模块,通过闭环反馈迭代优化每个智能体的能力模型。在我们构建的MetaCog-Eval基准(700项任务,5个认知维度)上的实验表明,MetaCogAgent实现了82.4%的任务准确率——比最佳路由基线高8.7%——同时比AutoGen少使用5%的API调用,比投票集少34%。消融研究确认了每个元认知组件对整体系统性能的贡献。

英文摘要

Multi-agent large language model (LLM) systems have shown promise for solving complex tasks through agent collaboration. However, existing frameworks assign tasks based on predefined roles without considering whether an agent can accurately assess its own competence boundaries, leading to overconfident execution on tasks beyond its expertise. Inspired by metacognition theory from cognitive science, we propose MetaCogAgent, a multi-agent LLM framework where each agent is equipped with a Metacognitive Self-Assessment Unit that evaluates task-capability alignment before execution. The framework introduces three contributions: (1) a self-assessment mechanism that estimates per-task confidence by combining verbalized uncertainty with historical capability profiles; (2) an adaptive delegation protocol that routes low-confidence tasks to better-suited agents through cross-agent evaluation; and (3) a capability boundary learning module that iteratively refines each agent's competence model via cybernetic feedback. Experiments on our constructed MetaCog-Eval benchmark (700 tasks across 5 cognitive dimensions) demonstrate that MetaCogAgent achieves 82.4% task accuracy -- 8.7% above the best routing baseline -- while using 5% fewer API calls than AutoGen and 34% fewer than ensemble voting. Ablation studies confirm that each metacognitive component contributes to overall system performance.

2605.17288 2026-05-19 cs.CR cs.AI 版本更新

When Efficiency Backfires: Cascading LLMs Trigger Cascade Failure under Adversarial Attack

效率反噬:对抗攻击下级联LLM引发级联故障

Zehan Sun, Dingfan Chen, Songze Li

发表机构 * Southeast University(东南大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所)

AI总结 研究探讨了级联LLM系统在对抗攻击下的脆弱性,提出了一种新的攻击框架,通过约束序列协同优化对抗后缀,在级联依赖下同时利用轻量级模型和决策机制,验证了此类攻击的有效性和严重性。

Comments under review

详情
AI中文摘要

大型语言模型(LLM)级联系统通过使用轻量级模型处理查询并选择性地将复杂情况升级到更强大的模型,旨在平衡效率和性能。此类系统旨在减少计算成本和延迟,同时保持任务性能,使其成为大规模部署的有吸引力的选择。然而,级联设计通过扩展的攻击面引入了新的漏洞:轻量级前端模型和内部决策机制的引入带来了新的弱点。在本文中,我们首次研究了LLM级联系统对目标对抗操纵的易受性,这种操纵破坏了性能目标和级联设计的预期成本优势。我们提出了一种新的攻击框架,通过约束序列协同优化对抗后缀,在级联依赖下同时利用轻量级模型和决策机制。该框架适应于不同能力的对手,导致成本效率和准确性可控的降级。与以往针对独立模型的攻击不同,我们的方法战略性地利用级联结构,以实现更强的冲击效果。在多样化的数据集和代表性LLM级联系统上进行了广泛的实验,验证了此类攻击的实用性和严重性。我们的发现突显了对LLM级联系统安全性的严格审查的紧迫性,并呼吁对这种设计固有的系统风险引起更广泛的关注。

英文摘要

Large Language Model (LLM) cascade systems are designed to balance efficiency and performance by processing queries with lightweight models while selectively escalating complex cases to more powerful ones. Such systems seek to reduces computational cost and latency while maintaining task performance, making it an appealing choice for large-scale deployment. However, the cascade design introduces new vulnerabilities through an expanded attack surface: the inclusion of lightweight front-end models and internal decision mechanisms introduces new weaknesses. In this work, we present the first study demonstrating that LLM cascade systems are susceptible to targeted adversarial manipulation, which disrupts both performance objectives and the intended cost advantages of the cascade design. We propose a novel attack framework that employs constrained sequential collaborative optimization of adversarial suffix under cascade dependencies, enabling simultaneous exploitation of lightweight models and decision mechanisms. This framework adapts to adversaries with varying capabilities, inducing controllable degradation in both cost-efficiency and accuracy. Unlike prior attacks targeting standalone models, our approach strategically leverages the cascade structure to achieve significantly stronger impact. Extensive experiments across diverse datasets and representative LLM cascade systems validate the practicality and severity of this attack. Our findings highlight the urgent need to rigorously scrutinize the security of LLM cascade systems and call for broader attention to the systemic risks inherent in such designs.

2605.17285 2026-05-19 cs.LG cs.AI 版本更新

UNR-Explainer: Counterfactual Explanations for Unsupervised Node Representation Learning Models

UNR-Explainer: 为无监督节点表示学习模型生成反事实解释

Hyunju Kang, Geonhee Han, Hogun Park

发表机构 * Department of Artificial Intelligence(人工智能系)

AI总结 本文提出UNR-Explainer,一种基于蒙特卡洛树搜索的反事实解释生成方法,用于无监督节点表示学习模型,通过识别关键子图来提升对下游任务如链接预测和聚类的理解。

Comments Accepted at ICLR 2024

详情
AI中文摘要

节点表示学习,如图神经网络(GNNs),已成为机器学习中的关键方法。对可靠解释生成的需求日益增加,但无监督模型仍处于探索阶段。为此,我们提出了一种在无监督节点表示学习中生成反事实(CF)解释的方法。我们识别出在扰动后导致感兴趣节点k近邻显著变化的最重要子图。基于k近邻的反事实解释方法为理解无监督下游任务,如top-k链接预测和聚类,提供了简单但关键的信息。因此,我们引入UNR-Explainer,基于蒙特卡洛树搜索(MCTS)为无监督节点表示学习方法生成具有表现力的反事实解释。所提出的方法在多样化的数据集上对无监督的GraphSAGE和DGI表现出优越的性能。

英文摘要

Node representation learning, such as Graph Neural Networks (GNNs), has emerged as a pivotal method in machine learning. The demand for reliable explanation generation surges, yet unsupervised models remain underexplored. To bridge this gap, we introduce a method for generating counterfactual (CF) explanations in unsupervised node representation learning. We identify the most important subgraphs that cause a significant change in the k-nearest neighbors of a node of interest in the learned embedding space upon perturbation. The k-nearest neighbor-based CF explanation method provides simple, yet pivotal, information for understanding unsupervised downstream tasks, such as top-k link prediction and clustering. Consequently, we introduce UNR-Explainer for generating expressive CF explanations for Unsupervised Node Representation learning methods based on a Monte Carlo Tree Search (MCTS). The proposed method demonstrates superior performance on diverse datasets for unsupervised GraphSAGE and DGI.

2605.17284 2026-05-19 cs.CV cs.AI cs.LG cs.RO 版本更新

CLAP: Contrastive Latent-space Prompt Optimization for End-to-end Autonomous Driving

CLAP:用于端到端自动驾驶的对比潜在空间提示优化

Ruiyang Zhu, Yuehan He, Boyuan Zheng, Zesen Zhao, Ahmad Chalhoub, Qingzhao Zhang, Z. Morley Mao

发表机构 * University of Michigan(密歇根大学) University of Arizona(亚利桑那大学)

AI总结 本文提出CLAP方法,通过对比潜在空间提示优化解决自动驾驶中罕见但安全关键的长尾场景问题,利用V2X通信获取数据并优化提示,从而提升规划性能。

Comments 9 pages + appendix

详情
AI中文摘要

端到端自动驾驶系统通过视觉-语言-动作(VLA)模型在常见驾驶场景中表现出色,但在罕见但安全关键的长尾场景如活跃施工区和复杂让行几何中表现脆弱。本文提出了一种方法,超越数据扩展和模型训练,解决长尾挑战场景。我们引入CLAP(对比潜在空间提示优化),一种位置感知的适应框架,通过车辆到一切(V2X)通信按需检索,将冻结的VLA驾驶模型与每条道路块的软提示相结合。我们的方法基于VLA潜在空间的两个观察:(i)在VLA的隐藏状态层,来自相同道路块的场景紧密聚集并占据潜在空间的紧凑区域;(ii)在单个道路块内,长尾和正常帧在潜在表示中高度混合,难以改进其中一个而不影响另一个。CLAP通过两阶段流程解决此问题:监督对比学习发现道路块特定的困难场景方向,随后方向性正则化提示优化选择性改进挑战帧同时保持正常帧性能。在NAVSIM基准上,使用各种最先进的VLA后端,CLAP将挑战场景规划错误减少了24%,在不回归正常帧的情况下显著提高了规划性能。

英文摘要

End-to-end autonomous driving systems powered by Vision-Language-Action (VLA) models achieve strong performance on common driving scenarios, yet remain brittle in rare but safety-critical long-tail situations such as active construction zones and complex yielding geometries. In this paper, we present a method that addresses the long-tail challenging scenes beyond data scaling and model training. We introduce CLAP (Contrastive Latent-space Prompt optimization), a location-aware adaptation framework that augments a frozen VLA driving model with per-roadblock soft prompts, optimized from crowdsourced data and retrieved on demand via Vehicle-to-Everything (V2X) communication. Our approach rests on two observations from VLAs' latent space: (i) at the VLA's hidden-state layer, scenarios from the same roadblock cluster tightly and occupy compact regions of the latent space; and (ii) within a single roadblock, long-tail and normal frames are heavily intermixed in the latent representation, making it difficult to improve one without disturbing the other. CLAP addresses this via a two-stage pipeline: supervised contrastive learning to discover a roadblock-specific hard-scene direction, followed by directionally regularized prompt optimization that selectively improves challenging frames while preserving normal frame performance. On the NAVSIM benchmark with various state-of-the-art VLA backbones, CLAP reduces challenging scenario planning error by 24% with no regression on normal frames, significantly improving planning performance.

2605.17283 2026-05-19 cs.CL cs.AI 版本更新

OProver: A Unified Framework for Agentic Formal Theorem Proving

OProver:一个用于代理形式定理证明的统一框架

David Ma, Kaijing Ma, Shawn Guo, Yunfeng Shi, Enduo Zhao, Jiajun Shi, Zhaoxiang Zhang, Gavin Cheung, Jiaheng Liu, Zili Wang

发表机构 * Lean 4

AI总结 本文提出OProver,一个用于Lean 4的统一框架,通过迭代修订检索到的编译器验证证明和Lean编译器反馈来改进代理证明,通过持续预训练和迭代后训练,使OProver-32B在多个基准测试中取得最佳成绩。

详情
AI中文摘要

近年来,形式定理证明的进步得益于大规模证明生成和验证器感知训练,但代理证明很少被整合到证明器训练中,仅在推理时间出现。我们提出了OProver,一个用于Lean 4的统一框架,其中失败的证明尝试通过检索到的编译器验证证明和Lean编译器反馈进行迭代修订。OProver通过持续预训练和迭代后训练进行训练:每次迭代运行代理证明,将新验证的证明索引到OProofs和检索内存中,使用修复轨迹作为SFT数据,并使用未解决的困难案例用于RL。OProofs由公开的Lean资源、大规模证明合成和代理证明轨迹构建,包含177万条Lean语句、686万条编译器验证证明以及带有检索上下文、失败尝试、反馈和修复的序列轨迹。在五个基准测试中,OProver-32B在MiniF2F(93.3%)、ProverBench(58.2%)和PutnamBench(11.3%)上取得最佳Pass@32,且在MathOlympiad(22.8%)和ProofNet(33.2%)上排名第二,比任何先前的开放式整体证明证明器的顶级位置更多。

英文摘要

Recent progress in formal theorem proving has benefited from large-scale proof generation and verifier-aware training, but agentic proving is rarely integrated into prover training, appearing only at inference time. We present OProver, a unified framework for agentic formal theorem proving in Lean 4, in which failed proof attempts are iteratively revised using retrieved compiler verified proofs and Lean compiler feedback. OProver is trained through continued pretraining followed by iterative post-training: each iteration runs agentic proving, indexes newly verified proofs into OProofs and the retrieval memory, uses repair trajectories as SFT data, and uses unresolved hard cases for RL. OProofs is built from public Lean resources, large-scale proof synthesis, and agentic proving traces, containing 1.77M Lean statements, 6.86M compiler-verified proofs, and serialized trajectories with retrieved context, failed attempts, feedback, and repairs. Across five benchmarks, OProver-32B attains the best Pass@32 on MiniF2F (93.3%), ProverBench (58.2%), and PutnamBench (11.3%), and ranks second on MathOlympiad (22.8%) and ProofNet (33.2%) more top placements than any prior open-weight whole-proof prover.

2605.17281 2026-05-19 cs.SE cs.AI 版本更新

ContractBench: Can LLM Agents Preserve Observation Contracts?

ContractBench: LLM Agents能否保持观察契约?

Jicheng Wang, Yifeng He, Zili Wang, Hanwen Xing, Arkaprava De, Hao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校) University of Southern California(南加州大学) University of Hong Kong(香港大学)

AI总结 本文提出ContractBench基准测试,用于评估LLM代理在保持观察契约(如时间有效性及字节完整性)方面的能力,发现现有模型在该任务上仍存在显著缺陷。

详情
AI中文摘要

工具增强的LLM代理调用API时,中间输出如预签名URL、会话令牌和OAuth状态参数等被视为观察契约:这些艺术制品的后续使用受到外部系统限制。我们证明观察契约合规性(保持时间有效性和字节级完整性)是一种涌现且易退化的能力:它既不被通用工具使用能力保证,也不由更大或更新的模型一致提升。为此,我们引入ContractBench,一个包含33个双轴任务的基准测试,探测两种现有基准未评估的垂直故障模式:有效性故障(使用过期的艺术制品)和完整性故障(通过观察到动作的管道腐蚀艺术制品的字节)。我们的评估是确定性和程序性的,通过虚拟时钟控制时间,SHA-256哈希验证字节完整性。我们为每个结果分配一个来自真实世界API规范的故障标签。我们评估了38个模型,并报告了四个发现:(i)没有评估的模型超过80%,Claude-Opus-4.6领先于77.8%,揭示当前前沿模型仍无法遵守观察契约;(ii)在Qwen 3.5中,4B到9B之间出现陡峭的家族能力悬崖,平滑到397B-A17B为70.7%:在悬崖上出现的是中轨迹限制,而不是工具调用能力;(iii)在GPT-5家族中非单调扩展:代理后训练可以通过奉承驱动的退化侵蚀合规性;(iv)我们的故障分类在上下文内作为可操作的奖励信号,使42对GPT-5.1故障获得+7.1 pp的提升。

英文摘要

Tool-augmented LLM agents call APIs whose intermediate outputs, such as presigned URLs, session tokens, and OAuth state parameters, are observation contracts: artifacts whose later use is constrained by the external system that produced them. We show that observation contract compliance (preserving the temporal validity and byte-level integrity) is an emergent, regression-prone capability: it is neither guaranteed by general tool-use ability nor consistently improved by larger or newer models. To measure this, we introduce ContractBench, a benchmark of 33 dual-axis tasks that probe two orthogonal failure modes no existing benchmark evaluates: validity failures (using an artifact after expiry) and integrity failures (corrupting an artifact's bytes through the observation-to-action pipeline). Our evaluation is deterministic and programmatic, with a virtual clock controlling time and SHA-256 hashes verifying byte integrity. We assign each outcome a failure label drawn from real-world API specifications. We evaluate 38 models and report four findings: (i) no evaluated model clears 80%, with Claude-Opus-4.6 leading at 77.8%, revealing that current frontier models still fail to comply with observation contracts; (ii) a sharp within-family capability cliff in Qwen 3.5 between 4B (0%) and 9B (56.6%), smoothing to 70.7% at 397B-A17B: what emerges across the cliff is mid-trajectory restraint, not tool-call competence; (iii) non-monotonic scaling across the GPT-5 family: agentic post-training can erode compliance through sycophancy-driven regression; (iv) our failure taxonomy works as an actionable in-context reward signal, yielding +7.1 pp on 42 paired GPT-5.1 failures.

2605.17279 2026-05-19 cs.SE cs.AI 版本更新

Rover: Context-aware Conflict Resolution with LLM

Rover: 基于上下文的冲突解决系统

Qingyu Zhang, Junzhe Li, Jiayi Lin, Changhua Luo, Chenxiong Qian

发表机构 * The University of Hong Kong(香港大学) Wuhan University(武汉大学)

AI总结 本文提出Rover,一种结合程序分析和大语言模型的冲突解决系统,通过多层代码属性图获取上下文感知提示,提升代码合并的准确性与效率。

详情
AI中文摘要

代码合并是大型项目中的重大挑战。现有解决方案,包括程序分析和机器学习,虽然有潜力,但存在关键限制。程序分析缺乏推断开发者意图的能力,依赖保守策略,将未解决的冲突转交人工处理。同时,基于模型的方法在处理涉及复杂代码依赖的冲突时,由于上下文意识不足而表现不佳。为解决这些差距,我们引入Rover,一种新的冲突解决系统,结合程序分析和大语言模型(LLM)。为了获得上下文感知的提示,我们提出了多层代码属性图(MtCPG),一种新的表示方法,捕捉文件间依赖关系,并为给定冲突启用上下文分析。使用图连通性算法,Rover进一步将冲突代码和相关更改聚类为有意义的“上下文”,引导LLM生成准确的解决方案。我们比较了Rover与独立LLM、机器学习基线MergeGen以及建议提供工具WizardMerge,使用相邻代码作为上下文。评估结果表明,Rover在冲突解决方面优于所有这些方法,在字符、词法和语义层面的相似度更高。

英文摘要

Code merging is a significant challenge, particularly in large-scale projects. Existing solutions, including program analysis and machine learning, show promise but face critical limitations. Program analysis lacks the ability to infer developers' intentions, relying on conservative strategies that offload unresolved conflicts for manual handling. Meanwhile, model-based approaches struggle with conflicts involving complex code dependencies due to insufficient contextual awareness. To address these gaps, we introduce Rover, a novel conflict resolution system that integrates program analysis with large language models (LLMs). To obtain context-aware prompts, we propose Multi-layer Code Property Graph (MtCPG), a new representation capturing inter-file dependencies and enabling contextual analysis for a given conflict. Using graph connectivity algorithms, Rover further clusters conflicting code and associated changes into meaningful "contexts" that guide the LLM in generating accurate resolutions. We compared Rover with standalone LLMs, machine learning baseline MergeGen, and suggestion provider tool WizardMerge with adjacent code as the contexts. Evaluation results show that Rover surpasses all of these approaches in terms of conflict resolution, achieving higher similarity to ground-truth resolutions at character, lexical, and semantic levels.

2605.17278 2026-05-19 cs.AI cs.LG 版本更新

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

A2RBench: 一个用于形式可验证抽象推理基准生成的自动范式

Qingchuan Ma, Yuexiao Ma, Yongkang Xie, Tianyu Xie, Xiawu Zheng, Rongrong Ji

发表机构 * Key Laboratory of Multimedia Trusted Perception and Efficient Computing, Ministry of Education(教育部多媒体可信感知与高效计算重点实验室) Institute of Artificial Intelligence(人工智能研究院)

AI总结 本文提出A2RBench自动范式,通过生成、扩展、评估和分析流程提升抽象推理基准生成效率,发现当前LLM在抽象推理能力上存在根本缺陷,且高信息复杂度输入可简化推理过程。

详情
AI中文摘要

抽象推理能力反映了LLM提取和应用抽象规则的智能和泛化能力。然而,准确测量这一能力仍然具有挑战性:现有基准要么依赖昂贵的手动标注,限制了其规模,要么有风险测量记忆而非真正的推理。为此,我们引入了一个名为A2RBench的自动化流程,包括生成、扩展、评估和分析。具体而言,在生成阶段,LLM创建多样化的任务,要求真正的推理;在扩展阶段,LLM重用已验证的规则并扩展新的输入空间以生成任务变体,实现扩展。然而,这一过程可能导致幻觉。为消除它,我们进一步建立了理论框架并证明,程序验证——测试逆操作是否完美地逆转正向操作(循环一致性)——保证了唯一解。通过在主流LLM上的广泛评估,我们发现:(1)当前LLM在抽象推理上存在根本缺陷,顶级模型在代表性子集上显著低于人类(39.8% vs. 68.5%)。(2)当前LLM在生成3D任务的复杂度上远低于2D和1D,揭示了其对高维任务的理解不足。(3)反直觉的是,信息复杂度更高的输入可以简化推理过程。

英文摘要

Abstract reasoning ability reflects the intelligence and generalization capacity of LLMs to extract and apply abstract rules. However, accurately measuring this ability remains challenging: existing benchmarks either rely on expensive manual annotation, limiting their scale, or risk measuring memorization rather than genuine reasoning. To address this, we introduce an automated pipeline named A2RBench, encompassing generation, expansion, evaluation, and analysis. Specifically, in the generation stage, LLMs create diverse tasks demanding genuine reasoning; in the expansion stage, LLMs reuse validated rules and expand new input spaces to generate task variations, achieving scaling. However, such a process may cause hallucinations. To eliminate it, we further establish a theoretical framework and prove that programmatic verification--testing whether the inverse operation perfectly reverses the forward operation (cycle consistency)--guarantees a unique solution. Through extensive evaluations on mainstream LLMs, we find: (1) Current LLMs exhibit fundamental deficiencies in abstract reasoning, with top models significantly underperforming humans on a representative subset (39.8% vs. 68.5%). (2) Current LLMs fall far short of 2D and 1D in the complexity of generated 3D tasks, revealing their lack of understanding of high-dimensional tasks. (3) Counterintuitively, inputs with higher information complexity can simplify the reasoning process.

2605.17276 2026-05-19 cs.LG cs.AI 版本更新

How Do Electrocardiogram Models Scale?

ECG模型如何扩展?

Jiawei Li, Fabio Bonassi, Ming Jin, Stefan Gustafsson, Johan Sundström, Thomas B. Schön, Antônio H. Ribeiro

发表机构 * Uppsala University(乌普萨拉大学) Griffith University(格里菲斯大学)

AI总结 本文研究了ECG模型在不同规模下的扩展规律,发现监督学习模型在数据受限时表现不佳,而自监督学习模型在模型和数据规模上都具有鲁棒性,同时自监督Transformer在非常大的模型规模上超越了ResNet。

详情
AI中文摘要

尽管扩展定律已为自然语言处理中的基础模型建立了基本框架,但其在心电图(ECG)模型中的适用性仍缺乏充分的描述。事实上,最近的研究并未始终显示出随着ECG模型的大小或预训练数据集大小的增加,下游性能的一致性提升,这使得模型架构归纳偏置、预训练范式以及与规模相关的预期改进的确切作用仍然不明。在本工作中,我们系统地研究了ECG领域内的神经网络和损失到损失扩展定律。通过在大规模CODE数据集(230万条记录)上预训练超过120个模型(参数量从2万到2000万不等),我们解耦了模型架构(ResNet vs. Transformer)和预训练范式(监督学习SL vs. 自监督学习SSL)的影响。我们发现(i)SL模型在分布内是数据瓶颈的,而SSL模型在模型和数据规模上都具有鲁棒性;(ii)对于分布外(OOD)泛化,ResNet比Transformer在参数效率上高1.3到2.5倍,而SSL在数据效率上最高可达16倍,并在未见的临床任务上实现了高达7.6倍的转移效率;(iii)在观察到的规模范围内,基于ResNet的模型通常在OOD损失上表现最低,SSL在未见的临床任务上占据主导地位,而自监督的Transformer在非常大的模型规模上超越了ResNet。我们的结果表明,有效ECG基础模型的路径在于架构和范式的战略对齐,而非单纯的暴力扩展。

英文摘要

While scaling laws have established a fundamental framework for foundation models in natural language processing, their applicability to electrocardiogram (ECG) models remains poorly characterized. Indeed, recent studies do not always yield consistent downstream gains as one increases the model size or pre-training dataset size of ECG models, leaving the exact roles of architectural inductive biases, pre-training paradigms, and expected improvements with size largely unanswered. In this work, we systematically investigate neural and loss-to-loss scaling laws within the ECG domain. By pre-training over $120$ models (ranging from $20$K to $200$M parameters) on the large-scale CODE dataset ($2.3$M records), we decouple the effects of model architecture (ResNet vs. Transformer) and pre-training paradigm, namely supervised learning (SL) versus self-supervised learning (SSL). We found that (i) SL models are data-bottlenecked in-distribution, whereas SSL models scale robustly across both model and data sizes; (ii) for out-of-distribution (OOD) generalization, ResNets are $1.3$ to $2.5$ times more parameter-efficient than Transformers, while SSL is up to $16$ times more data-efficient and achieves up to $7.6$ times higher transfer efficiency than SL on unseen clinical tasks; (iii) across the observed scales, ResNet-based models generally achieve the lowest OOD loss, with SSL dominating on unseen clinical tasks and self-supervised Transformers overtaking at very large model sizes. Our results suggest that the path to effective ECG foundation models lies in the strategic alignment of architecture and paradigm rather than brute-force scaling.

2605.17256 2026-05-19 eess.SY cs.AI cs.LG cs.SY 版本更新

Latency-Aware Deep Learning Benchmark for Real-Time Cyber-Physical Attack and Fault Classification in Inverter-Dominated Power Grids

面向实时机电攻击和故障分类的延迟感知深度学习基准测试

Emad Abukhousa, Saman Zonouz, A. P. Sakis Meliopoulos

发表机构 * Emad Abukhousa(埃马德·阿布库霍萨) Saman Zonouz(萨曼·宗努兹)

AI总结 本文提出了一种延迟感知的深度学习基准测试框架,用于评估在逆变器主导电网中使用高保真时域信号进行电力系统异常检测的深度学习模型。通过系统评估从物理故障和网络攻击中生成的流数据集,评估了八种神经网络架构,包括MLP到Transformer。所有模型都能在亚周期响应时间低于15毫秒的情况下实时分类两种代表性多事件序列,但端到端推理延迟始终超过三个周期,范围从50到90毫秒。这些结果突显了算法能力与保护级部署之间的关键差距,指出了进一步优化和硬件加速的必要性。研究结果建立了可重复的亚周期异常检测基准,并为将机器学习方法从研究原型过渡到实际保护应用提供了指导。

详情
AI中文摘要

本文介绍了一种延迟感知的基准测试框架,用于评估在电力系统异常检测中使用高保真时域信号生成的深度学习模型。通过系统评估从物理故障和网络攻击中生成的流数据集,评估了八种神经网络架构,包括MLP到Transformer。所有模型都能在亚周期响应时间低于15毫秒的情况下实时分类两种代表性多事件序列,但端到端推理延迟始终超过三个周期,范围从50到90毫秒。这些结果突显了算法能力与保护级部署之间的关键差距,指出了进一步优化和硬件加速的必要性。研究结果建立了可重复的亚周期异常检测基准,并为将机器学习方法从研究原型过渡到实际保护应用提供了指导。

英文摘要

This work introduces a latency-aware benchmarking framework for evaluating deep learning models in power system anomaly detection using high-fidelity, time-domain signals generated from an industry-grade electromagnetic transient simulator. Eight neural network architectures, ranging from MLPs to Transformers, were systematically evaluated on streaming datasets representing both physical faults and cyber-attacks in inverter-dominated networks. All models successfully classified two representative multi-event sequences in real time with sub-cycle response times below 15 ms. However, although classification decisions occurred within one cycle, the end-to-end inference latency consistently exceeded three cycles, ranging from 50 to 90 ms. These results highlight a critical gap between algorithmic capability and protection-grade deployment, pointing to the need for further optimization and hardware acceleration. The findings establish a reproducible benchmark for sub-cycle anomaly detection and provide guidance for transitioning machine learning methods from research prototypes to real-world protection applications.

2605.17255 2026-05-19 cs.AI math.OC 版本更新

CAM-Bench: A Benchmark for Computational and Applied Mathematics in Lean

CAM-Bench: 一个用于Lean中的计算与应用数学的基准测试

Wentao Long, Yunfei Zhang, Chenyi Li, Li Zhou, Chumin Sun, Zaiwen Wen

发表机构 * Fudan University(复旦大学) Qingdao University(青岛大学) Peking University(北京大学) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 本文提出CAM-Bench,一个包含1000个Lean证明目标的基准测试,涵盖优化、数值线性代数和数值分析等领域,旨在补充现有形式化数学基准测试,通过针对依赖教科书概念和基本定理的应用数学问题进行评估。

Comments Preprint. 44 pages, 7 figures

详情
AI中文摘要

形式化定理证明基准测试能够机械地验证大语言模型中的数学推理能力。然而,现有基准测试主要集中在竞赛式问题和代数领域,导致计算与应用数学代表性不足。我们引入CAM-Bench,一个包含1000个Lean 4形式化证明目标的基准测试,涵盖优化、数值线性代数和数值分析等领域。这些问题改编自教科书练习,通常依赖于局部引入的定义、符号、算法和基本结果。为了构建CAM-Bench,我们开发了一个依赖恢复流水线,用于重建每个问题所需的本地教科书上下文。然后,它将每个问题标准化为一个独立的非正式定理,并将其翻译成Lean目标。我们通过Lean编译和语义审查验证最终的形式化问题,检查形式正确性和与原始练习的语义一致性。对于每个问题,我们发布了原始练习、恢复的上下文、标准化的非正式定理和最终的Lean目标。CAM-Bench通过针对依赖教科书概念和基本定理的应用数学问题补充现有形式化数学基准测试,其中许多问题无法直接作为标准Mathlib4引理使用。我们评估了广泛使用的大型语言模型和形式化代理在CAM-Bench上的表现,并分析了在跟踪局部假设、应用基本结果、分解证明和维护长距离控制时的常见失败模式。

英文摘要

Formal theorem-proving benchmarks enable mechanically verifiable evaluation of mathematical reasoning in large language models. However, existing benchmarks mainly focus on Olympiad-style problems and algebraic domains, leaving computational and applied mathematics underrepresented. We introduce CAM-Bench, a Lean 4 theorem-proving benchmark of 1,000 Lean proof targets in computational and applied mathematics, with coverage spanning optimization, numerical linear algebra, and numerical analysis. These problems are adapted from textbook exercises and often depend on locally introduced definitions, notation, algorithms, and elementary results. To construct CAM-Bench, we develop a dependency-recovery pipeline that reconstructs the local textbook context needed to state each problem faithfully. It then normalizes each problem into a standalone informal theorem and translates it into a Lean target. We validate the resulting formal problems through Lean compilation and semantic review, checking both formal correctness and semantic alignment with the original exercises. For each problem, we release the raw exercise, recovered context, normalized informal theorem, and final Lean target. CAM-Bench complements existing formal mathematics benchmarks by targeting applied mathematics problems that rely on textbook concepts and elementary theorems, many of which are not directly available as standard Mathlib4 lemmas. We evaluate widely used large language models and formalization agents on CAM-Bench, and analyze common failure modes in tracking local assumptions, applying elementary results, decomposing proofs, and maintaining long-horizon control in Lean.

2605.17247 2026-05-19 cs.AI 版本更新

Towards Robust Argumentative Essay Understanding via TIDE: An Interactive Framework with Trial and Debate

通过TIDE实现稳健的论辩论文理解:一个具有试错和辩论机制的交互框架

Zheqin Yin, Yupei Ren, Yadong Zhang, Yujiang Lu, Man Lan

发表机构 * School of Computer Science and Technology, East China Normal University(东华师范大学计算机科学与技术学院) Shanghai Institute of Artificial Intelligence for Education, East China Normal University(上海教育人工智能研究院) Lab of Artificial Intelligence for Education, East China Normal University(教育人工智能实验室)

AI总结 本文提出TIDE框架,通过整合试错和辩论机制,改进基于标准的提示优化,以提高论辩任务的理解和评估能力,实验表明其在自动作文评分、论点组件检测和论点关系识别任务中均提升了性能。

详情
AI中文摘要

论辩论文是评估批判性思维和推理能力的重要媒介,但目前关于通过提示准确理解和评估此类文本的研究有限。在本工作中,我们提出了TIDE,一种新的框架,旨在通过整合TrIal和DEbate机制,改进基于标准的提示优化,以提高与论辩相关的任务。我们的方法通过减轻噪声训练数据的影响并增强优化稳定性,解决了基于标准的提示优化的关键限制。我们评估了TIDE在三个核心任务上的表现:自动作文评分、论点组件检测和论点关系识别。结果表明,我们的框架在各项任务中均提升了性能。这些发现凸显了结合基于提示的方法在高级论辩理解中的潜力。

英文摘要

Argumentative essays serve as a vital medium for assessing critical thinking and reasoning skills, yet there is limited works on accurately understanding and evaluating such texts via prompt. In this work, we propose TIDE, a novel framework designed to improve criteria-based prompt optimization for argument-related tasks by integrating TrIal and DEbate mechanism. Our method addresses key limitations of criteria-based prompt optimizing by mitigating the influence of noisy training data and enhancing optimization stability. We evaluate TIDE on three core tasks: Automated Essay Scoring, Argument Component Detection, and Argument Relation Identification. Results demonstrate that our framework improves performance across tasks. These findings underscore the potential of combining prompt-based methods for advanced argument understanding.

2605.17246 2026-05-19 cs.LG cs.AI 版本更新

Fidelity Probes for Specification--Code Alignment

规范-代码对齐的保真度探针

Ferhat Erata, Hao Zhou, Luke Huan

发表机构 * AWS Agentic AI(AWS智能AI)

AI总结 本文提出保真度探针,通过从参考artifact生成的自然语言问题和代码派生的地面真实答案,从候选规范中回答问题。保真度是同意探针的比例,分解为矛盾率和覆盖缺口率,驱动针对性的规范编辑以达到收敛。在15个程序、约12000行COBOL基准(AWS CardDemo)上,通过八次迭代将冻结测试规范的保真度从0.63提升到0.94,其中平台位置由仅需四次速率数据的两状态马尔可夫固定点$F^\dagger$预测。探针来自LLM读取代码或静态分析管道对其控制流、数据流和系统依赖图的处理,具有可调混合比例。一个带有冻结留出集的探针重采样协议提供了Hoeffding有界的过拟合判别;我们测量的训练/测试差距保持在该包络线下一个数量级。三种基于图的混合提升了保真度16到30分;跨分布评估显示LLM和符号通道在经验上互补。在五个独立LLM家族(Anthropic、DeepSeek、Google、Alibaba、OpenAI)上进行的跨家族生成器扫描确认了收敛行为不依赖于任何单一模型家族:五个非Claude生成器中有三个产生了与马尔可夫固定点预测一致的轨迹,而冻结测试协议主动否定了两个探针分布随迭代变化的生成器。该方法适用于任何应描述相同行为的artifact对。

Comments 29 pages, 14 figures, 11 tables

详情
AI中文摘要

我们引入了保真度探针:从参考artifact生成的自然语言问题,其代码派生的地面真实答案由候选规范回答。保真度是同意探针的比例,分解为矛盾率和覆盖缺口率,驱动针对性的规范编辑以达到收敛。在15个程序、约12000行COBOL基准(AWS CardDemo)上,我们通过八次迭代将冻结测试规范的保真度从0.63提升到0.94,其中平台位置由仅需四次速率数据的两状态马尔可夫固定点$F^\dagger$预测。探针来自LLM读取代码或静态分析管道对其控制流、数据流和系统依赖图的处理,具有可调混合比例。一个带有冻结留出集的探针重采样协议提供了Hoeffding有界的过拟合判别;我们测量的训练/测试差距保持在该包络线下一个数量级。三种基于图的混合提升了保真度16到30分;跨分布评估显示LLM和符号通道在经验上互补。在五个独立LLM家族(Anthropic、DeepSeek、Google、Alibaba、OpenAI)上进行的跨家族生成器扫描确认了收敛行为不依赖于任何单一模型家族:五个非Claude生成器中有三个产生了与马尔可夫固定点预测一致的轨迹,而冻结测试协议主动否定了两个探针分布随迭代变化的生成器。该方法适用于任何应描述相同行为的artifact对。

英文摘要

We introduce fidelity probes: natural-language questions generated from a reference artifact with code-derived ground-truth answers, answered from a candidate specification. The fraction of agreeing probes, which we call the fidelity, decomposes into contradiction and coverage-gap rates that drive targeted spec edits to convergence. On a 15-program, roughly 12k-line COBOL benchmark (AWS CardDemo), we raise frozen-test specification fidelity from 0.63 to 0.94 over eight iterations, with the plateau location predicted by a two-state Markov fixed point $F^\dagger$ from just four iterations of rate data. Probes come from an LLM reading the code or from a static-analysis pipeline over its control-flow, data-flow, and system-dependence graphs, with a tunable mixture. A probe-resampling protocol with a frozen held-out set gives a Hoeffding-bounded overfitting discriminant; our measured train/test gap stays more than an order of magnitude below this envelope. Three graph-grounded mixtures lift fidelity by +16 to +30 points; cross-distribution evaluation shows the LLM and symbolic channels are empirically complementary. A cross-family generator sweep on five independent LLM lineages (Anthropic, DeepSeek, Google, Alibaba, OpenAI) confirms the convergence behaviour is not tied to any single model family: three of five non-Claude generators produce trajectories consistent with the Markov fixed-point prediction, and the frozen-test protocol actively falsifies the two generators whose probe distributions drift across iterations. The method applies to any pair of artifacts that are supposed to describe the same behaviour.

2605.17244 2026-05-19 cs.LG cs.AI 版本更新

Drift Flow Matching

漂移流匹配

Chenrui Ma, Xi Xiao, Lin Zhao, Tianyang Wang, Ferdinando Fioretto, Yanning Shen

发表机构 * University of California, Irvine(加州大学伊万斯堡分校) University of Virginia(弗吉尼亚大学) University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Northeastern University(东北大学)

AI总结 本文提出Drift Flow Matching框架,结合漂移生成模型与基于流的迭代生成方法,实现高效生成与多步细化,提升生成质量与效率适应性。

详情
AI中文摘要

迭代生成模型如流匹配和扩散模型在测试时表现出强大的扩展性,额外的推理计算可以提高生成质量。相比之下,漂移模型提供高效的单步生成,但其直接生成范式限制了灵活性。在本文中,我们提出Drift Flow Matching (DFM),一个连接漂移生成建模与基于流的迭代生成的框架。DFM保留了直接传输映射的效率,同时在需要时通过多个推理步骤细化生成。这填补了单步漂移模型与多步流匹配方法之间的空白,并提供了一种新的生成范式,可以适应不同的质量-效率需求。在不同任务和数据集上的广泛实验验证了所提框架的有效性和通用性。

英文摘要

Iterative generative models such as Flow Matching and Diffusion models have demonstrated strong test-time scaling behavior, where additional inference computation can improve generation quality. In contrast, Drift Models offer efficient one-step generation, but their direct generation paradigm limits such flexibility. In this work, we propose Drift Flow Matching (DFM), a framework that connects drifting generative modeling with flow-based iterative generation. DFM preserves the efficiency of direct transport maps while enabling generation to be refined through multiple inference steps when desired. This bridges the gap between one-step Drift Models and multi-step Flow Matching methods, and provides a novel generative paradigm that can adapt sampling computation to different quality--efficiency requirements. Extensive experiments across different tasks and datasets demonstrate the effectiveness and generality of the proposed framework.

2605.17236 2026-05-19 cs.CV cs.AI 版本更新

Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability

对视觉变换器在自动化宫颈癌分类中的系统评估:优化、统计验证与临床可解释性

Nisreen Albzour, Sarah S. Lam

发表机构 * School of Systems Science and Industrial Engineering, Binghamton University(宾夕法尼亚大学系统科学与工业工程学院)

AI总结 本文研究了视觉变换器在自动化宫颈癌分类中的应用,通过优化和统计验证,展示了其在临床可解释性方面的优势。

详情
AI中文摘要

手动宫颈癌筛查的巴氏涂片分析受到观察者间差异、时间限制和专家资源有限的限制。尽管卷积神经网络(CNNs)已自动化了宫颈细胞分类,但它们在建模长距离空间依赖性和缺乏临床可解释性方面仍有局限。在本研究中,视觉变换器(ViT)架构被系统优化以提高自动化宫颈癌筛查的性能,从而提高了可解释性。通过赫尔勒夫数据集(917张图像:242张正常,675张异常)对ViT-Tiny进行优化,这是一种轻量级视觉变换器架构,旨在减少计算复杂性。通过全面评估增强策略、类别加权和超参数,最佳配置实现了94.9%-95.2%的交叉验证准确率,其中随机水平翻转和类别加权(0.7 x 1.3)被确定为最有效的因素。梯度加权类激活映射(Grad-CAM)分析证实,模型注意力对应于临床相关形态学特征,包括核区域、细胞边界和染色质纹理,这与细胞病理学标准一致。这些发现表明,视觉变换器可以提供准确且可解释的决策支持,以用于宫颈癌筛查,这满足了医疗AI部署所需的临床性能和透明性要求。

英文摘要

Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which random horizontal flipping and class weighting (0.7 x 1.3) were identified as most effective. Gradient-weighted Class Activation Mapping (Grad-CAM) analysis confirmed that model attention corresponded to clinically relevant morphological features, which include nuclear regions, cell boundaries, and chromatin texture, which align with cytopathological criteria. These findings indicate that Vision Transformers can deliver accurate and interpretable decision support for cervical cancer screening, which fulfills both clinical performance and transparency requirements essential for medical AI deployment.

2605.17214 2026-05-19 cs.AI cs.CL cs.CV 版本更新

ChemVA: Advancing Large Language Models on Chemical Reaction Diagrams Understanding

ChemVA:推动大型语言模型在化学反应图示理解上的进步

Mingyang Rao, Kehua Feng, Zhihui Zhu, Jiangzhen Fu, Hao Yu, Keyan Ding, Huajun Chen

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) ZJU-Hangzhou Global Scientific and Technological Innovation Center, Zhejiang University(浙江大学杭州全球科学与技术创新中心) Department of Chemistry, Fudan University(复旦大学化学系)

AI总结 本文针对现有系统在理解化学反应图示时存在的视觉缺陷和语义断开问题,提出ChemVA框架,通过视觉锚机制和语义对齐方法提升大型语言模型在化学推理中的性能。

详情
AI中文摘要

尽管大型语言模型(LLMs)已革新了科学文本处理,但在解释化学反应图示方面存在显著的能力差距。我们识别出两个限制当前系统的根本瓶颈:视觉缺陷,即通用视觉编码器难以解析密集分子图的严格拓扑连接性;以及语义断开,即标准线性字符串,如SMILES,无法有效激活模型的潜在化学推理能力。为弥合这些差距,我们提出了化学视觉激活(ChemVA)框架,该框架采用视觉锚机制通过混合粒度检测来定位功能团,随后采用语义对齐方法将视觉特征转换为实体名称,以最大限度地激活LLMs中的知识。我们在OCRD-Bench数据集上评估了我们的方法,该数据集包含密集的视觉-语义上下文和全面的反应覆盖,以评估从识别到推理的整个谱系。在OCRD-Bench上的大量实验表明,ChemVA实现了92.0%的结构识别准确率。通过弥合视觉和语义瓶颈,我们的框架在9种不同的LLMs上实现了约20个百分点的性能提升,使开放式权重模型能够与专有SOTA系统在复杂的化学推理任务中竞争。

英文摘要

While Large Language Models (LLMs) have revolutionized scientific text processing, they exhibit a significant capability gap when interpreting chemical reaction diagrams. We identify two fundamental bottlenecks restricting current systems: a Visual Deficit, where generic vision encoders struggle to resolve the strict topological connectivity of dense molecular graphs, and a Semantic Disconnect, where standard linear strings, such as SMILES, fail to effectively activate the model's latent chemical reasoning. To bridge these gaps, we propose the Chemical Visual Activation (ChemVA) framework, which employs a Visual Anchor mechanism to ground functional groups via hybrid-granularity detection, followed by a semantic alignment approach that translates visual features into entity names to maximize knowledge activation in LLMs. We evaluate our approach on OCRD-Bench, a newly constructed dataset featuring dense visual-semantic contexts and comprehensive reaction coverage to evaluate the full spectrum from recognition to reasoning. Extensive experiments on OCRD-Bench demonstrate that ChemVA achieves 92.0% structural recognition accuracy. By bridging visual and semantic bottlenecks, our framework delivers a consistent performance gain of approximately 20 percentage points across 9 diverse LLMs, enabling open-weight models to rival proprietary SOTA systems in complex chemical reasoning tasks.

2605.17204 2026-05-19 cs.RO cs.AI 版本更新

Event-Grounded Sparse Autoencoders for Vision-Language-Action Policies

基于事件的稀疏自编码器用于视觉-语言-动作策略

Xinchen Jin, Aditya Chatterjee, Pranav Kumar, Rohan Paleja

发表机构 * Department of Computer Science, Purdue University West Lafayette, IN 47907(计算机科学系,普渡大学西拉法叶分校,印第安纳州,47907)

AI总结 本文提出了一种基于事件的稀疏自编码器(SAE)分析方法,用于视觉-语言-动作(VLA)策略的可解释性研究,通过行为事件锚定SAE特征分析,提升了对闭合回路行为的因果影响和可解释性。

详情
AI中文摘要

视觉-语言-动作(VLA)策略将语言和视觉输入转化为机器人动作,其隐藏表示直接塑造闭环行为。然而,语言和视觉-语言模型中的机制可解释性工具无法直接转移到VLA中:输出是机器人动作而非人类可读的标记,干预只能通过昂贵的闭环回放测试。我们提出了一种基于事件的可解释性流程,将SAE特征分析锚定在行为事件而非文本上下文中。通过在每个任务中使用视觉、状态和时间线索对末端执行器关键帧进行聚类,将SAE特征与行为显著事件联系起来,并通过可选的VLM注释与语义上下文联系起来。据我们所知,我们的流程是首个将基于SAE的VLA分析锚定在闭环行为事件上的方法之一。在两个仿真架构和一个真实机器人研究中,基于事件的排名在OpenVLA上产生了最强的因果效应,并转移到了π_{0.5}的连续动作块中。SAE是一种稀疏但不完美的干预基础:实用性因架构和干预位置而异,激进干预揭示了安全性和可解释性的限制。总体而言,基于事件的SAE分析成为行为锚定VLA可解释性的一种实用起点,推动了未来关于SAE特征的研究,包括超越动作对齐坐标的更细致分析、更精细的闭环评估以及高风险VLA部署中的安全干预。代码可在https://github.com/xc-j/Event-SAE上获得。

英文摘要

Vision-Language-Action (VLA) policies translate language and visual inputs into robot actions, where their hidden representations directly shape closed-loop behavior. However, mechanistic interpretability tools from language and vision-language models do not transfer cleanly to VLAs: outputs are robot actions rather than human-readable tokens, and interventions can only be tested via expensive closed-loop rollouts. We propose an event-grounded interpretability pipeline that anchors SAE feature analysis to behavioral events rather than text contexts. End-effector keyframes are clustered within each task using visual, state, and temporal cues, linking SAE features to behaviorally salient events and, via optional VLM annotations, to semantic context. To our knowledge, our pipeline is among the first to ground SAE-based VLA analysis in closed-loop behavioral events. Across two simulation architectures and a real-robot study, event-grounded ranking yields the strongest causal effects on OpenVLA and transfers to the continuous action chunks of $π_{0.5}$. SAE is a sparse but imperfect intervention basis: usability varies with architecture and intervention site, and aggressive intervention reveals safety and interpretability limits. Overall, event-grounded SAE analysis emerges as a practical starting point for behavior-anchored VLA interpretability, motivating future work on SAE features beyond action-aligned coordinates, finer-grained closed-loop evaluation, and safe interventions for high-stakes VLA deployments. Code is available at \url{https://github.com/xc-j/Event-SAE}.

2605.17187 2026-05-19 cs.CL cs.AI cs.CY 版本更新

PluRule: A Benchmark for Moderating Pluralistic Communities on Social Media

PluRule:一种用于社交媒体上多元社区调节的基准测试

Zoher Kachwala, Bao Tran Truong, Rasika Muralidharan, Haewoon Kwak, Jisun An, Filippo Menczer

发表机构 * Observatory on Social Media, Indiana University, USA(社交媒体观察站,印第安纳大学,美国) Center Synergy of Systems, TUD Dresden University of Technology, Germany(系统协同中心,德累斯顿技术大学,德国)

AI总结 研究探讨了AI模型在调节社交媒体上多元社区中的挑战,提出PluRule基准测试以检测13371条规则违规情况,发现即使使用最先进的视觉语言模型,也难以有效识别违规行为。

Comments Accepted to ACL 2026 Main Conference

详情
AI中文摘要

社交媒体正在向多元主义转变--由社区自行定义规范的平台。在某一社区中违反规则的行为可能在另一社区中是完全可接受的。AI模型能否帮助调节此类多元社区?我们将此任务形式化为多选问题,模仿人类调节员在现实世界中的操作方式:给定一条评论及其上下文,识别违反了哪一条具体规则(如果有的话)。我们引入了PluRule,一个多模态、多语言的基准测试,用于检测1989个Reddit社区中跨越2885条规则的13371条违规情况。使用此基准测试,我们发现最先进的视觉语言模型在识别违规方面表现显著不佳:即使GPT-5.2具有高水平推理能力,也仅略优于基础基线。我们还发现,更大的模型和更多的上下文提供微小收益,而普遍规则如礼貌和自我推广更容易检测。我们的结果表明,社交媒体上多元社区的调节是语言模型的基本挑战。我们的代码和基准测试已公开发布。

英文摘要

Social media are shifting towards pluralism -- community-governed platforms where groups define their own norms. What violates rules in one community may be perfectly acceptable in another. Can AI models help moderate such pluralistic communities? We formalize the task as a multiple-choice problem, mirroring how human moderators operate in the real world: given a comment and its surrounding context, identify which specific rule, if any, is violated. We introduce PluRule, a multimodal, multilingual benchmark for detecting 13,371 rule violations across 1,989 Reddit communities spanning 2,885 rules in 9 languages. Using this benchmark, we show that state-of-the-art vision-language models struggle significantly: even GPT-5.2 with high reasoning performs only slightly better than a trivial baseline. We also find that bigger models and increased context provide marginal gains, and universal rules like civility and self-promotion are easier to detect. Our results show that moderation of pluralistic communities on social media is a fundamental challenge for language models. Our code and benchmark are publicly available.

2605.17181 2026-05-19 cs.SD cs.AI 版本更新

MusicSynth: An Automated Pipeline for Generating Violin Fingerboard Animations from Sheet Music Using Optical Music Recognition

MusicSynth: 一种用于从乐谱生成小提琴指板动画的自动化流水线

Abhimanyu Kaushik

发表机构 * Independent Researcher(独立研究者) Trophy Club, Texas(德克萨斯奖杯俱乐部)

AI总结 该研究提出了一种自动化流程,通过光学音乐识别技术将乐谱转换为小提琴指板动画,其核心方法是整合三个开源工具,并通过自定义的查找表将音乐音符映射到小提琴的弦和指位。

Comments 12 pages, 4 figures

详情
AI中文摘要

学习小提琴比看起来更困难。与钢琴键或吉他品相比,小提琴琴颈上没有任何标记,因此初学者无法通过观察来确定每个手指应放置的位置。MusicSynth是一种开源的网页工具,旨在解决这个问题:用户上传任何小提琴乐谱的照片(或数字乐谱文件),系统会自动生成一个视频,显示带有每个音符高亮的小提琴指板——无需安装软件,也不需要手动输入音符。该系统将三个现有的开源工具连接成一个流水线:光学音乐识别(OMR)库从上传的图像中读取音符,MusicXML解析器从数字乐谱中提取时间信息,视频渲染器逐帧绘制指板。唯一从头开始构建的部分是将每个音乐音符映射到小提琴弦和指位的查找表。在110个公共领域小提琴乐谱上测试,MusicSynth在清洁打印乐谱中正确识别了91.2%的音符,并在获得数字乐谱文件时正确分配指位99.1%的时间。据作者所知,目前没有其他免费工具可以自动将乐谱图像转换为动画小提琴指板教程。

英文摘要

Learning the violin is harder than it looks. Unlike piano keys or guitar frets, the violin neck has no markings at all, so a beginner cannot tell by looking where to place each finger. MusicSynth is an open-source web tool that tries to fix that: user uploads a photo of any violin sheet music (or a digital score file), and the system automatically produces a video showing a violin fingerboard with each note highlighted at the right moment -- no software to install, no manual note entry required. The system connects three existing open-source tools into one pipeline: an optical music recognition (OMR) library reads the notes from the uploaded image, a MusicXML parser extracts timing information from digital scores, and a video renderer draws the fingerboard frame by frame. The only part built from scratch is the lookup table that maps each musical note to a string and finger position on the violin. Tested across 110 public-domain violin scores, MusicSynth correctly identified 91.2\,\% of notes in clean printed music and assigned the right finger position 99.1\,\% of the time when given a digital score file. To the author's knowledge, no freely available tool currently turns a sheet music image into an animated violin fingerboard tutorial automatically and in a single browser-based step.

2605.17176 2026-05-19 cs.AI 版本更新

CAREBench: Evaluating LLMs' Emotion Understanding by Assessing Cognitive Appraisal Reasoning

CAREBench: 通过评估认知评价推理来评估LLMs的情感理解

Zhaoyue Sun, Hainiu Xu, Andero Uusberg, James J. Gross, Petr Slovak, Yulan He

发表机构 * Department of Informatics(信息学院) King’s College London(伦敦国王学院) Institute of Psychology(心理学研究所) University of Tartu(塔尔图大学) Department of Psychology(心理学系) Stanford University(斯坦福大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 本文提出CAREBench,首个全面标注认知评价推理、评价评分和多标签情感标注的基准,通过系统实验发现更强模型在某些任务上匹配或超越人类,但在评价推理和积极情绪识别上表现不足,揭示了当前模型未能内部化捕捉人类主观异质性的机制。

Comments 27 pages,18 figures

详情
AI中文摘要

情感理解是LLMs有效与人类交互的核心能力,但现有评估方法依赖离散情绪标签预测,无法捕捉情绪生成的认知过程。基于评价理论,我们引入CAREBench,首个包含从第一和第三人称视角对现实叙述的完整推断链标注的基准,涵盖评价推理、评价评分和多标签情感标注。我们提出一个过程级评估框架,并在六个LLMs上围绕四个研究问题开展系统实验。我们发现,更强的模型在某些任务上匹配或超越人类观察者,但在评价推理和积极情绪识别上表现不足;跨步骤性能和对评价干预的敏感性在不同模型间表现出差异;当前模型尚未内部化捕捉人类主观异质性的机制。这些发现表明,下游情绪预测指标可能高估LLMs的真实情感理解,而CAREBench为更具有诊断信息的LLMs情感认知能力评估提供了基础。

英文摘要

Emotion understanding is a core capability for LLMs to interact effectively with humans, yet existing evaluation paradigms rely on discrete emotion label prediction and fail to capture the cognitive processes underlying emotion generation. Grounded in appraisal theory, we introduce CAREBench, the first benchmark with complete inferential chain annotations from both first- and third-person perspectives on real-world narratives, spanning appraisal reasoning, appraisal ratings, and multi-label emotion annotation. We propose a process-level evaluation framework and conduct systematic experiments across six LLMs organized around four research questions. We find that stronger models match or surpass human observers on certain tasks, yet fall short on appraisal reasoning and positive emotion recognition; performance across chain steps and sensitivity to appraisal interventions exhibit dissociations across models; and current models have not internalized the mechanisms needed to capture human subjective heterogeneity. These findings suggest that downstream emotion prediction metrics may overestimate LLMs' true emotion understanding, and CAREBench provides a foundation for more diagnostically informative evaluation of LLMs' affective cognitive capabilities.

2605.17174 2026-05-19 cs.SE cs.AI 版本更新

Beyond Execution: Static-Analysis Rewards and Hint-Conditioned Diffusion RL for Code Generation

超越执行:静态分析奖励与提示条件扩散强化学习用于代码生成

Shuyin Ouyang, Zhaozhi Qian, Faroq AL-Tam, Muhammad AL-Qurishi, Jie M. Zhang

发表机构 * King’s College London(伦敦国王学院) Elm Europe(Elm欧洲)

AI总结 本文研究了强化学习在扩散式代码生成中的应用,探讨了静态分析奖励和提示条件扩散强化学习在不同任务难度下的效果,发现静态检查在提升代码生成性能方面表现最佳。

详情
AI中文摘要

强化学习(RL)是将扩散语言模型(DLMs)对齐到功能正确性的重要范式,在代码生成中。然而,这些模型在复杂任务上常遇到一个“能力悬崖”,即基于执行的语义奖励变得过低,无法提供有效的学习信号。在本文中,我们对扩散式代码生成的RL后训练进行了系统性的实证研究,从三个轴线进行:奖励设计、提示条件采样和任务难度。我们调查了执行免费奖励作为传统单元测试执行替代品的有效性,训练时提示条件扩散采样在缓解探索瓶颈中的作用,以及这些设计选择在不同难度任务中的影响。在HumanEval、MBPP和LiveCodeBench上,我们发现静态检查是我们在设置中最强的独立执行免费奖励,特别是在HumanEval上将DiffuCoder从53.9提升到67.1,在LiveCodeBench上从14.9提升到15.5,同时减少rollout时间9.4%。我们进一步发现,中等程度的AST基于提示在更难的基准上最有用,而最佳奖励设计强烈依赖于任务难度:相似性基于奖励在更简单的子集上更有效,而静态检查在更难的子集上更可靠,其中执行奖励较低。这些发现表明,在我们评估的代码生成设置中,奖励设计和训练指导显著影响扩散RL性能。

英文摘要

Reinforcement Learning (RL) is an important paradigm for aligning Diffusion Language Models (DLMs) toward functional correctness in code generation. However, these models often encounter a ``capability cliff'' on complex tasks, where execution-based semantic rewards become too low to provide a viable learning signal. In this paper, we present a systematic empirical study of RL post-training for diffusion-based code generation along three axes: reward design, hint-conditioned sampling, and task difficulty. We investigate the effectiveness of execution-free rewards as alternatives to traditional unit-test execution, the role of training-time hint-conditioned diffusion sampling in mitigating exploration bottlenecks, and the impact of these design choices varies across tasks with different difficulty levels. Across HumanEval, MBPP, and LiveCodeBench, we find that static checking is the strongest overall standalone execution-free reward in our setting, especially improving DiffuCoder from 53.9 to 67.1 on HumanEval and from 14.9 to 15.5 on LiveCodeBench while reducing rollout time by 9.4\%. We further find that moderate AST-based hinting is most useful on harder benchmarks, while the best reward design depends strongly on task difficulty: similarity-based rewards are more effective on easier subsets, whereas static checking is more reliable on harder subsets where execution rewards are low. These findings suggest that reward design and training guidance substantially affect diffusion RL performance in our evaluated code-generation setting.

2605.17173 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Why Do Safety Guardrails Degrade Across Languages?

为何安全护栏在不同语言中会退化?

Max Zhang, Ameen Patel, Sang T. Truong, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学)

AI总结 该研究通过引入多组项目反应理论框架,揭示了语言无关的安全鲁棒性、提示内在难度、全球语言处理难度和提示特定的跨语言安全差距等因素,发现安全退化并非仅在低资源语言中发生,且文化与概念不匹配也会影响安全性能。

详情
AI中文摘要

大型语言模型在非英语语言中表现出安全退化。标准评估依赖于禁令成功率(JSR),但将多个安全驾驶因素合并为一个,掩盖了安全失败的具体原因。我们引入了一个潜在变量模型,即多组项目反应理论(IRT)框架,将安全驾驶因素如语言无关的安全鲁棒性(θ)、内在提示难度(β)、全球语言处理难度(γ)和提示特定的跨语言安全差距(τ)分离。使用MultiJail数据集,我们评估了61种模型配置在5个闭源模型家族和10种资源各异的语言中的安全鲁棒性,汇总了190万行数据集。探索性因子分析显示安全主要是一维的:模型拒绝不同危害类型主要通过共享机制。与预期趋势相反,22种模型配置在英语中比在低资源语言中更易受攻击。低资源语言产生更多不确定响应(高熵)比高资源语言。此外,高τ提示集中在如盗窃和武器等物理危害类别和低资源语言中,趋势通过跨数据集泛化得到验证。虽然全球翻译质量与τ相关性低,但严重翻译错误驱动高偏置异常值,通过本地说话者验证。文化与概念基础不匹配也会影响τ。在预测验证中,IRT框架实现了AUC=0.940,优于更简单的基线,在预测不安全提示的安全拒绝方面表现更优。我们的框架揭示了概念-语言脆弱性,这些指标汇总后被掩盖,使公平的跨语言安全评估和目标改进数据集建设成为可能。

英文摘要

Large language models exhibit safety degradation in non-English languages. Standard evaluation relies on Jailbreak Success Rate (JSR), which confounds several safety-driving factors into one, obscuring the specific cause(s) of safety failure. We introduce a latent variable model, a Multi-Group Item Response Theory (IRT) framework, that decouples safety-driving factors such as language-agnostic safety robustness ($θ$), intrinsic prompt hardness ($β$), global language processing difficulty ($γ$), and a prompt-specific cross-lingual safety gap ($τ$). Using the MultiJail dataset, we evaluate the safety robustness of 61 model configurations across 5 closed-model families and 10 languages of varying resource, aggregating a dataset of 1.9 million rows. Exploratory Factor Analysis shows safety is primarily unidimensional: models refuse different harm types mainly through a shared mechanism. Contrary to the expected trend that safety degrades largely in low-resource languages, 22 model configurations are more vulnerable in English than in low-resource languages. Low-resource languages produce more uncertain responses (high entropy) than high-resource languages. Also, high-$τ$ prompts cluster in physical harm categories like Theft and Weapons and lower-resource languages, trends validated through cross-dataset generalization. While global translation quality shows low correlation with $τ$, severe mistranslations drive high-bias outliers, as validated by native speakers. Cultural and conceptual grounding mismatches also contribute to $τ$. In predictive validation, the IRT framework achieves $\mathrm{AUC} = 0.940$, outperforming simpler baselines in predicting safe refusal of unsafe prompts. Our framework reveals concept-language vulnerabilities that aggregate metrics obscure, enabling fairer cross-lingual safety evaluation and targeted improvements in dataset construction.

2605.17172 2026-05-19 cs.LG cs.AI cs.CL 版本更新

OpenJarvis: Personal AI, On Personal Devices

OpenJarvis: 个人AI,本地设备上

Jon Saad-Falcon, Avanika Narayan, Robby Manihani, Tanvir Bhathal, Herumb Shandilya, Hakki Orhun Akengin, Gabriel Bo, Andrew Park, Matthew Hart, Caia Costello, Chuan Li, Christopher Ré, Azalia Mirhoseini

发表机构 * OpenClaw Hermes Agent PinchBench GAIA

AI总结 本文提出OpenJarvis,一种分解的个人AI堆栈,通过在本地设备上优化五个基本组件(智能、引擎、代理、工具与记忆、学习)来缩小本地与云端之间的性能差距,同时保持本地模型的特性。

Comments Code: https://github.com/openjarvis/openjarvis Website: https://open-jarvis.github.io/OpenJarvis/

详情
AI中文摘要

个人AI堆栈,如OpenClaw和Hermes Agent,正在成为日常工作的核心,但它们几乎将每一个查询(通常涉及敏感的本地数据)都路由到云托管的前沿模型。用现有的堆栈中替换前沿模型为本地模型并不奏效:将Claude Opus 4.6换成Qwen3.5-9B,在个人AI任务如PinchBench和GAIA上会降低25-39个百分点的准确性。现有堆栈围绕特定的云模型捆绑代理提示、工具描述、内存配置和运行时设置。只有提示可以进行调优,而最先进的提示优化器只能自行关闭5个百分点的本地-云差距。这促使了分解的个人AI堆栈:一种能够暴露个体原语,可以单独或联合优化以缩小本地-云差距的堆栈。我们提出了OpenJarvis,一种将个人AI系统表示为五种原语的类型规范的架构:智能、引擎、代理、工具与记忆、学习。每个原语都是独立可编辑的字段,使堆栈能够端到端优化,并且可以针对准确性、成本和延迟进行测量。为了在不牺牲本地模型特性的情况下缩小本地-云差距,OpenJarvis引入了LLM引导的规范搜索,这是一种本地-云协作,在搜索时前沿云模型提出规范的编辑,只有非退化的编辑被接受,最终的规范在推理时完全在设备上运行。通过LLM引导的规范搜索,设备上的规范在8个基准中的4个上匹配或超过了云准确性,并且平均在最佳云基线基础上减少了3.2个百分点。它们还减少了边际API成本约800倍,并将端到端延迟减少了4倍。

英文摘要

Personal AI stacks, like OpenClaw and Hermes Agent, are becoming central to daily work, yet they route nearly every query (often over sensitive local data) to cloud-hosted frontier models. Replacing frontier models with local models inside existing stacks does not work: swapping Claude Opus 4.6 for Qwen3.5-9B drops accuracy by 25-39 pp across personal AI tasks like PinchBench and GAIA. Existing stacks bundle agentic prompts, tool descriptions, memory configuration, and runtime settings around a specific cloud model. Only the prompts can be tuned, and state-of-the-art prompt optimizers close just 5 pp of the local-cloud gap on their own. This motivates a decomposed personal AI stack: one that exposes individual primitives which can be optimized individually or jointly to close the local-cloud gap. We present OpenJarvis, an architecture that represents a personal AI system as a typed spec over five primitives: Intelligence, Engine, Agents, Tools & Memory, and Learning. Each primitive is an independently editable field, making the stack end-to-end optimizable and measurable against accuracy, cost, and latency. Towards closing the local-cloud gap without surrendering local-model properties, OpenJarvis introduces LLM-guided spec search, a local-cloud collaboration in which frontier cloud models propose edits across the spec at search time, only non-regressing edits are accepted, and the resulting spec runs entirely on-device at inference time. With LLM-guided spec search, on-device specs match or exceed cloud accuracy on 4 of 8 benchmarks and land within 3.2 pp of the best cloud baseline on average. They also reduce marginal API cost by ~800x and end-to-end latency by 4x.

2605.17169 2026-05-19 cs.AI cs.CL cs.MA 版本更新

Responsible Agentic AI Requires Explicit Provenance

负责任的代理AI需要明确的来源

Jinwei Hu, Xinmiao Huang, Qisong He, Youcheng Sun, Yi Dong, Xiaowei Huang

发表机构 * School of Computer Science and Informatics, University of Liverpool(利兹大学计算机科学与信息学学院) Department of Computer Science, Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学计算机科学系)

AI总结 本文探讨了代理AI中责任归属的问题,指出需要在整个代理生命周期中明确来源,以使责任可计算和可执行,提出了通过因果归因函数和责任张量来形式化所需信息,并通过初步实验验证了在线估计和干预的可能性。

Comments Under Review

详情
AI中文摘要

代理AI正在迅速扩展到软件工程等多样化的真实世界领域,但公众信任并未同步增长。核心原因是责任,尽管被广泛讨论,但仍是一个主观且未强制执行的概念,因为目前没有任何代理框架能够产生所需的可量化、可追溯和可干预的来源,以在损害由多个方共同设计时分配责任。我们主张所缺失的不是更好的基准级评估,而是整个代理生命周期中明确的来源,这是使责任可计算和可操作的唯一可行基础。我们从四个方向推进这一议程:通过识别社会技术维度中的责任缺口,确立为何此类来源是结构上的必要条件;通过因果归因函数和责任张量形式化它必须编码的内容;讨论如何在四个生命周期层中使其可计算,通过初步实验表明来源可以在不可逆损害积累之前在线估计和干预;并通过具体代理事件考察谁应承担责任。明确来源不是可选的改进,而是负责任的代理AI的必要条件,其生态系统中的任何利益相关者都无法承担将其视为可选的态度。

英文摘要

Agentic AI is rapidly proliferating across diverse real-world domains such as software engineering, yet public trust has not kept pace. The central reason is that responsibility, despite being widely discussed, remains a subjective and unenforced concept, as no current agentic framework produces the quantifiable, traceable, and interventionable provenance needed to assign it when harm emerges from compositions no single party designed. We position that what is missing is not better benchmark-level evaluation but $\textbf{explicit provenance}$ across the full agentic lifecycle, which is the only viable basis for making responsibility computable and actionable. We advance this agenda along four axes: establishing $\textit{why}$ such provenance is a structural necessity by identifying responsibility gaps across sociotechnical dimensions, formalizing $\textit{what}$ it must encode through a causal attribution function and responsibility tensor, discussing $\textit{how}$ it can be made computable across four lifecycle layers with preliminary experiments showing that provenance is estimable and interveneable online before irreversible harm accumulates, and examining $\textit{who}$ bears responsibility through a concrete agentic incident. Explicit provenance is not a discretionary refinement but the necessary condition for responsible agentic AI, and no stakeholder across its ecosystem can afford to treat it as optional.

2605.17163 2026-05-19 cs.CR cs.AI 版本更新

STRIDE-AI: A Threat Modeling Framework for Generative AI Security Assessment

STRIDE-AI: 一种用于生成式AI安全评估的威胁建模框架

Tsafac Nkombong Regine Cyrille, Franziska Schwarz

发表机构 * SRH University of Applied Sciences Heidelberg School of Technology(海德堡应用科学大学技术与建筑学院)

AI总结 本文提出STRIDE-AI框架,旨在填补高层风险标准与技术漏洞分类之间的差距,通过六个阶段的评估生命周期和针对AI系统的威胁建模适应,降低AI系统面临攻击的成功率。

Comments 4 pages, 5 figures , 2 tables, CIIT 2026 23rd International Conference on Informatics and Information Technologies (CIIT)

详情
AI中文摘要

传统网络安全方法针对确定性系统,无法应对AI的概率性质,导致系统易受模型反向、数据污染和提示注入等攻击向量的攻击。最近的行业报告指出,大多数部署AI的组织缺乏专门的安全策略,对抗性攻击每年都在迅速增加。本文提出了STRIDE-AI框架,该框架连接了高层风险标准(NIST AI RMF)和技术漏洞分类(OWASP LLM Top 10)。该框架定义了一个六阶段的评估生命周期,引入了针对AI系统的经典STRIDE威胁建模适应,并通过专门设计的网页工具进行操作化。我们通过一个部署的LLM聊天机器人进行黑盒评估,验证了该方法的初步效果,成功将攻击成功率从80%降低到15%。

英文摘要

Traditional cybersecurity methodologies target deterministic systems and fail to address the probabilistic nature of AI, leaving systems vulnerable to attack vectors such as model inversion, data poisoning, and prompt injection. Recent industry reports indicate that a majority of organizations deploying AI lack a dedicated security strategy, with adversarial attacks increasing rapidly year-over-year. We present \textit{STRIDE-AI}, a framework that bridges the gap between high-level risk standards (NIST AI RMF) and technical vulnerability taxonomies (OWASP LLM Top 10). The framework defines a six-phase assessment lifecycle, introduces a threat modeling adaptation of classical STRIDE for AI systems, and is operationalized through a purpose-built web tool. We provide an initial validation of the approach through a black-box assessment of a deployed LLM chatbot, which successfully reduced the attack success rate from 80\% to 15\% in our sandbox case study.

2605.17162 2026-05-19 cs.AI cs.LG 版本更新

From Imitation to Interaction: Mastering Game of Schnapsen with Shallow Reinforcement Learning

从模仿到交互:利用浅层强化学习掌握斯纳普森游戏

Ján Klačan, Sizhong Zhang

发表机构 * Vrije Universiteit Amsterdam(弗里兹大学阿姆斯特丹)

AI总结 本文研究浅层神经网络代理是否能掌握纸牌游戏斯纳普森,并挑战使用蒙特卡洛采样和前瞻搜索的强搜索基线RdeepBot。通过逐步更复杂的实验设计,首先评估了基于回放数据训练的监督学习代理(MLPBot)以及通过异步蒙特卡洛更新和经验回放训练的强化学习代理(RLBot)。结果表明,监督模仿不足以击败强RdeepBot对手,而强化学习产生了更强的代理。在聚焦RdeepBot深度参数的设置中,最佳性能是在学习的价值函数与游戏过程中更深层次的前瞻搜索结合时实现的,使RLBot在最强的RdeepBot基线下实现了统计显著更高的胜率。在基于样本的设置中,收益更具条件性:最强性能出现在相对较低的训练num_samples参数下,而不是随着更强采样均匀增加。

Comments 17 pages, 8 figures

详情
AI中文摘要

本文研究浅层神经网络代理是否能掌握纸牌游戏斯纳普森,并挑战使用蒙特卡洛采样和前瞻搜索的强搜索基线RdeepBot。通过逐步更复杂的实验设计,首先评估了基于回放数据训练的监督学习代理(MLPBot)以及通过异步蒙特卡洛更新和经验回放训练的强化学习代理(RLBot)。结果表明,监督模仿不足以击败强RdeepBot对手,而强化学习产生了更强的代理。在聚焦RdeepBot深度参数的设置中,最佳性能是在学习的价值函数与游戏过程中更深层次的前瞻搜索结合时实现的,使RLBot在最强的RdeepBot基线下实现了统计显著更高的胜率。在基于样本的设置中,收益更具条件性:最强性能出现在相对较低的训练num_samples参数下,而不是随着更强采样均匀增加。

英文摘要

This paper investigates whether shallow neural network agents can master the card game Schnapsen and challenge a strong search-based baseline, RdeepBot, which uses Monte Carlo sampling and lookahead search. Guided by a progressively more complex experimental design, we first evaluate a supervised learning agent (MLPBot) trained on replay data and then a reinforcement learning agent (RLBot) with the same shallow architecture trained through asynchronous Monte Carlo updates and experience replay. The results show that supervised imitation does not generalize well enough to defeat strong RdeepBot opponents, whereas reinforcement learning produces substantially stronger agents. In the setting that focuses on the depth parameter of RdeepBot, the best performance is achieved when the learned value function is combined with deeper lookahead during gameplay, allowing RLBot to achieve statistically significant higher winning rates against the strongest evaluated RdeepBot baseline. In the sample-based setting, the gains are more conditional: the strongest performance appears at a relatively lower training num_samples parameter rather than increasing uniformly with stronger sampling.

2605.17160 2026-05-19 cs.LG cs.AI cs.CV 版本更新

When Bits Break Recourse: Counterfactual-Faithful Quantization

当比特失效时的反事实:反事实忠实量化

Chaymae Yahyati, Ismail Lamaakal, Khalid El Makkaoui, Ibrahim Ouahbi

发表机构 * Mohammed First University(穆罕默德第一大学)

AI总结 本文研究了量化过程中反事实可解释性的问题,提出反事实忠实量化方法,通过定义有效性下降和反事实可逆差距两个指标来评估量化对反事实可解释性的影响,并在多个数据集上验证了该方法在保持准确性的同时提升了反事实稳定性。

Comments 57 pages, 32 tables, 26 figures

详情
AI中文摘要

量化可以在低比特部署下保持预测准确性,但会无声地破坏算法可逆性:一个在量化前可以执行的操作在量化后可能失效,或变得显著更昂贵。我们通过有效性、成本和方向稳定性来形式化量化下的反事实敏感性,并引入两个指标:有效性下降(VD)和反事实可逆差距(CRG),以揭示准确性无法检测到的可逆失败。我们提出反事实忠实量化(CFQ),通过训练量化参数和混合精度位分配,在全局位预算下强制在教师可逆点上保持目标结果,以保留反事实行为。基于边界的分析给出了在受限制的量化扰动下可逆转移的充分条件。在Adult、德国信贷和COMPAS数据集上的实验表明,与准确性匹配的基线相比,CFQ在保持准确性的同时显著提高了VD和CRG。

英文摘要

Quantization can preserve predictive accuracy under low-bit deployment while silently breaking algorithmic recourse: an actionable change that flips a decision before quantization may fail after quantization, or become substantially more costly. We formalize counterfactual sensitivity under quantization through validity, cost, and direction stability, and introduce two metrics: Validity Drop (VD) and Counterfactual Recourse Gap (CRG) that reveal recourse failures invisible to accuracy. We propose Counterfactual-Faithful Quantization (CFQ), which trains quantizer parameters and mixed-precision bit allocation to preserve counterfactual behavior by enforcing the target outcome at teacher recourse points under a global bit budget. A margin-based analysis gives a sufficient condition for recourse transfer under bounded quantization perturbations. Experiments on Adult, German Credit, and COMPAS show that accuracy-matched baselines can significantly degrade recourse stability, while CFQ maintains accuracy and substantially improves VD and CRG across bit budgets.

2605.17159 2026-05-19 cs.AI cs.MA 版本更新

MADP: A Multi-Agent Pipeline for Sustainable Document Processing with Human-in-the-Loop

MADP:一种用于可持续文档处理的多智能体系统(带人机协作)

Diego Gosmar, Giovanni Zenezini

发表机构 * Tesisquare Head of AI(Tesisquare AI首席科学家) Polytechnic University of Turin, Department of Management and Production Engineering(都灵理工学院管理与生产工程系)

AI总结 本文提出MADP,一种结合深度学习分类和解析以及大语言模型提取的多智能体架构,通过选择性的人工验证保持准确性,实现了文档处理自动化,显著降低人工需求并减少环境影响。

Comments 18 pages, 5 figures

详情
AI中文摘要

文档处理自动化仍然是企业环境中关键的挑战,传统手动方法劳动强度大且容易出错。我们提出了MADP,一种多智能体架构,通过结合基于深度学习的分类和解析以及大语言模型提取,结合选择性的人工验证来解决企业环境中文档处理自动化的挑战。我们的系统集成了五个专门的智能体--Classificator、Splitter、Parser、Extraction和Validator--并采用带有人机协作(HITL)机制和一种新颖的Prompt Fine Tuning with Feedback Inheritance(PFTFI)方法。对每年处理10万张发票的生产使用案例的运营分析表明,可以将全职等效(FTE)需求减少约70%。在2026年1月前处理955个真实世界文档的生产部署中,实现了97.0%的全流程自动化率,仅有3%需要非AI回退。对一个分层的100文档子集(每种供应商/文档类型类别5个文档)的消融评估显示,带有HITL监督的完整MADP配置实现了98.5%的文档级准确性。此外,我们还展示了全面的可持续性分析,表明我们的混合AI+HITL方法相比传统手动处理减少了69%的二氧化碳排放、69%的能量消耗和63%的用水量。多个LLM后端(Granite-Docling、Mistral-Small、DeepSeek-OCR)的基准比较提供了在生产环境中部署的实用见解。

英文摘要

Document processing automation remains a critical challenge in enterprise environments, where traditional manual approaches are labor-intensive and error-prone. We present MADP, a multi-agent architecture that addresses the challenge of automating document processing in enterprise settings by combining deep learning-based classification and parsing with large language model extraction, while maintaining accuracy through selective human validation. Our system integrates five specialized agents--Classificator, Splitter, Parser, Extraction, and Validator--with a Human-in-the-Loop (HITL) mechanism and a novel Prompt Fine Tuning with Feedback Inheritance (PFTFI) approach. The operational analysis on a production use-case scenario of 100,000 invoices per year indicates a potential reduction of Full-Time Equivalent (FTE) requirements by approximately 70%. Production deployment on 955 real-world documents processed through January 2026 achieves a 97.0% full-pipeline automation rate, with only 3% requiring non-AI fallback. Ablation evaluation on a stratified 100-document subset (5 documents per each of 20 supplier/document-type categories) demonstrates that the full MADP configuration with Human-in-the-Loop supervision attains 98.5% document-level accuracy. Additionally, we present a comprehensive sustainability analysis showing that our hybrid AI+HITL approach reduces CO2 emissions by 69%, energy consumption by 69%, and water usage by 63% compared to traditional manual processing. Benchmark comparisons of multiple LLM backends (Granite-Docling, Mistral-Small, DeepSeek-OCR) provide practical insights for deployment in production environments.

2605.17148 2026-05-19 cs.NE cs.AI cs.LG 版本更新

Evolutionary Extreme Learning Machine of ab-initio Energy Landscapes for Crystal Structure Prediction using Manta Ray Optimization with Levy Flight

基于Manta Ray优化与Levy飞行的进化极值学习机用于二元系统中晶格结构预测

Adrian Rubio-Solis

发表机构 * Hamlyn Centre for Robotic Surgery, Imperial College London (ICL)(机器人手术哈姆林中心,伦敦帝国理工学院)

AI总结 本文提出了一种改进的Manta Ray优化算法结合Levy飞行用于训练极值学习机,以预测二元系统中未弛豫和弛豫形成能化合物相对于基态晶格结构的纯组分相对能量。

Comments 8 pages, 4 figures

详情
AI中文摘要

Manta Ray Foraging Optimization算法(MRFO)已被证明是解决大量工程问题最优解的强大启发式策略。本文提出了一种改进的MRFO结合Levy飞行用于训练极值学习机(ELM)的训练,其基本模型是单层前馈网络(SLFN)。所提出的方法称为进化极值学习机-MRFO-Levy飞行(EELM-MRFO-LF)被应用于预测二元系统中未弛豫和弛豫形成能化合物相对于基态晶格结构的纯组分相对能量。EELM-MRFO-LF遵循传统进化ELM的学习过程,首先使用MRFO与Levy飞行选择输入权重,然后应用Moore-Penrose广义逆来解析确定输出权重。Levy飞行轨迹用于增加ELM种群的多样性,以防止早收敛和避免陷入局部最优。所提出的EELM-MRFO-LF性能在相似条件下与其他知名启发式算法进行了比较。

英文摘要

The Manta Ray Foraging Optimization algorithm (MRFO) has proven to be a powerful heuristic strategy in the optimal solution of a large number of engineering problems. In this paper, an improvement of MRFO with Levy Flight is suggested for the training of extreme learning machines (ELMs) whose basic model is a Single Layer Feedforward Network (SLFN). The proposed methodology that we called Evolutionary EELM-MRFO-LF for short is implemented to the prediction of unrelaxed and relaxed formation energy compounds relative to ground state crystal structure of pure components in binary systems. EELM-MRFO-LF follows the learning procedure of traditional Evolutionary ELMs in which first MRFO with LF is used to select the input weights and Moore-Penrose (MP) generalized inverse is applied to analytically determine the output weights. Levy Flight trajectory is implemented for increasing the diversity of the population of ELMs against premature convergence and the ability of avoiding getting trapped in a local optima. The performance of the suggested EELM-MRFO-LF is compared with other well-known nature-inspired algorithms under similar conditions.

2605.17144 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Contrastive Conceptor Activation Steering (COAST): Unlocking Vision-Language-Action Models through Hidden States

对比性概念激活引导(COAST):通过隐藏状态解锁视觉-语言-动作模型

Miranda Muqing Miao, Subin Kim, Brandon Yang, Lyle Ungar

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出COAST方法,通过识别成功子空间来提升视觉-语言-动作模型在机器人任务中的性能,其核心方法是利用概念投射来引导模型向成功分布发展,从而提高任务成功率。

Comments Submitted to NeurIPS 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型利用大规模网络视觉-语言模型(VLM)预训练的强感知先验,但实际应用中却表现出惊人的脆弱性,常常在简单的机器人任务中失败。为缓解这一问题,我们提出了对比性概念激活引导(COAST)。COAST基于“概念”这一线性操作符,该操作符能将数据软投影到目标分布的主成分中。COAST利用概念来从少量的成功和失败轨迹中识别出目标机器人任务的成功子空间。在推理过程中,它将VLA的潜在表示引导到这些识别出的成功子空间中,以提高任务结果。在三种架构不同的神经策略(流匹配VLA、自回归VLA和扩散策略)上,COAST将绝对均值仿真和真实机器人任务的成功率分别提高了超过20%和40%。激活子空间几何表明,失败模式在不同任务中共享大量结构,而成功表示则主要任务特定。当任务共享相似的失败模式时,这种结构使之前拟合的概念能提升新任务的性能而无需重新拟合。最终,我们的结果表明,当前VLA在潜在表示中保留了大量任务相关的知识,而动作专家的解码瓶颈可以通过将残差流引导至任务相关子空间来缓解。COAST提供了一条轻量、无训练的路径,通过引导模型朝其自身的“成功”分布发展,来解锁这些潜在能力。

英文摘要

Vision-Language-Action (VLA) models leverage powerful perceptual priors from web-scale Vision-Language Model (VLM) pre-training, yet they remain surprisingly brittle in practice, frequently failing at simple robotic tasks. To mitigate this, we propose Contrastive Conceptor Activation Steering (COAST). COAST builds on the notion of a "conceptor", a linear operator that soft-projects data into the principal components of a target distribution. COAST uses conceptors to identify success-critical subspaces for a target robotic task from a few examples of success and failure rollouts. At inference time, it steers VLA latents into these identified success subspaces to improve task outcomes. Across three architecturally distinct neural policies (flow-matching VLA, autoregressive VLA, and Diffusion Policy), COAST improves absolute mean simulation and real-robot task success rate by over 20 and 40% respectively. The activation subspace geometry reveals that failure modes share substantial structure across tasks while success representations remain largely task-specific. When tasks share similar failure modes, this structure enables previously fitted conceptors to improve performance on new tasks without refitting. Ultimately, our results suggest that current VLAs retain substantial task-relevant knowledge in their latent representations, and that the action expert's decoding bottleneck could be mitigated by steering its residual stream toward task-relevant subspaces. COAST provides a lightweight, training-free path to unlocking these latent capabilities by steering the model towards its own "success" distributions.

2605.17141 2026-05-19 cs.AI 版本更新

Dynamics of collective creativity in AI art competitions

AI艺术竞赛中集体创造力的动力学

Mason Youngblood, Jeff Nusz, Joel Simon

发表机构 * Institute for Advanced Computational Science, Stony Brook University(斯通比克大学先进计算科学研究所) Morphogen

AI总结 研究通过分析AI艺术竞赛中的图像生成过程,发现集体创造力在人类与AI协同创作中呈现出图像简化、主题趋同以及用户偏好与创作复杂性之间的矛盾现象。

详情
AI中文摘要

创造力是文化演变的核心方面,但群体产生新颖性的机制难以从历史记录中推断。迭代学习实验表明,文化传承会将制品扭曲向学习者的归纳偏差,但大多数研究使用线性链式结构,未探讨这些动态在日益影响文化生产的人类-人工智能系统中的表现。在本研究中,我们利用Artbreeder系统,该系统每日举办'混搭派对',用户基于单一种子图像迭代构建彼此的作品,生成分支的共同创作图像。我们分析了13个月内368场混搭派对的130,882张图像数据,发现图像变得简单并趋同于常见主题'吸引子'(如蒸汽朋克场景、外星建筑)。我们还发现,尽管更新颖的'父'图像产生更新颖且复杂的'子'图像并吸引更多点赞,用户却 paradoxically 偏好混搭新颖性和复杂性较低的图像。最后,更大规模的混搭派对产生更多新颖性,但以更低的复杂性为代价。

英文摘要

Creativity is a fundamental aspect of how culture evolves, yet the mechanisms by which groups produce novelty are notoriously difficult to infer from the historical record. Iterated learning experiments have shown that cultural transmission reliably distorts artifacts toward the inductive biases of learners, but most of this work uses linear chains between human participants, leaving open how these dynamics play out in the networked, human-AI systems that increasingly shape cultural production. In this study, we leverage one such system, Artbreeder, which hosts daily "remix parties" where users iteratively build on each other's work from a single seed image, producing branching lineages of human-AI co-created images. We analyze a dataset of 130,882 images from 368 remix parties over 13 months and find that images become simpler and converge toward common thematic "attractors" (e.g., steampunk scenes, alien architecture). We also find that while more novel "parent" images produce more novel and complex "children" that attract more likes, users paradoxically prefer to remix images that are less novel and complex. Finally, larger remix parties produce more novelty at the cost of lower complexity.

2605.17137 2026-05-19 cs.AI 版本更新

Latent Heuristic Search: Continuous Optimization for Automated Algorithm Design

潜在启发式搜索:为自动化算法设计的连续优化

Cheikh Ahmed, Mahdi Mostajabdaveh, Zirui Zhou

发表机构 * Huawei Technologies Canada(华为加拿大技术有限公司)

AI总结 本文提出了一种连续启发式发现框架,通过将离散程序映射到连续嵌入空间,并利用可微代理模型进行梯度优化,以提升自动化算法设计的性能。

Comments Accepted at LION 2026, The Learning and Intelligent Optimization Conference

详情
AI中文摘要

将大型语言模型(LLMs)整合到进化框架中,已建立了自动化启发式发现的新范式。尽管具有潜力,这些方法通常在程序语法的离散空间中搜索,依赖随机采样来导航高度非凸的优化景观。本文提出了一种连续启发式发现框架,将优化转移到学习的潜在流形上。我们使用编码器将离散程序映射到连续嵌入,并训练一个可微代理模型来预测性能,从而实现基于梯度的搜索。为了正则化优化轨迹,一个可逆的归一化流将这些嵌入映射到结构化的高斯先验中,其中我们执行梯度上升。最终优化的潜在向量通过学习的映射器投影到软提示中,这些提示条件冻结的LLM合成新的可执行启发式方法。我们在旅行商问题(TSP)、有容量车辆路径问题(CVRP)、背包问题(KSP)和在线装箱问题(OBP)上评估了所提出的方法。实证结果表明,连续潜在空间优化在性能上与最先进的离散进化基线相当,同时为自动化算法设计提供了互补的方法论替代方案。实现代码可在https://github.com/cheikh025/LHS上找到。

英文摘要

The integration of Large Language Models (LLMs) into evolutionary frameworks has established a new paradigm for automated heuristic discovery. Despite their promise, these methods typically search in the discrete space of program syntax, relying on stochastic sampling to navigate a highly non-convex optimization landscape. This work proposes a continuous heuristic discovery framework that shifts optimization to a learned latent manifold. We employ an encoder to map discrete programs into continuous embeddings and train a differentiable surrogate model to predict performance, enabling gradient-based search. To regularize the optimization trajectory, an invertible normalizing flow maps these embeddings to a structured Gaussian prior, where we perform gradient ascent. The resulting optimized latent vectors are projected through a learned mapper into soft prompts, which condition a frozen LLM to synthesize novel executable heuristics. We evaluate the proposed method on the Traveling Salesman Problem (TSP), the Capacitated Vehicle Routing Problem (CVRP), the Knapsack Problem (KSP), and Online Bin Packing (OBP). Empirical results demonstrate that continuous latent-space optimization achieves performance competitive with state-of-the-art discrete evolutionary baselines while offering a complementary methodological alternative for automated algorithm design. The implementation code is available at \url{https://github.com/cheikh025/LHS}.

2605.17133 2026-05-19 cs.CV cs.AI 版本更新

CAM-VFD: Cross-Attention Multimodal Video Forgery Detection

CAM-VFD: 跨注意力多模态视频伪造检测

Hoda Osama Elkhodary, Sherin Mostafa Youssef, Marwa Elshenawy, Dalia Sobhy

发表机构 * Computer Engineering Department, College of Engineering and Technology, Arab Academy for Science, Technology and Maritime Transport(计算机工程系,工程与技术学院,阿拉伯科学、技术与海运交通学院)

AI总结 针对深度伪造技术和视频编辑工具快速发展带来的挑战,本文提出CAM-VFD框架,通过跨模态矛盾建模实现多模态视频伪造检测,实验表明其在两个生成视频基准测试中表现出色,具有良好的鲁棒性。

详情
AI中文摘要

深度伪造技术和视频编辑工具的快速发展对多媒体取证、司法证据完整性以及信息真实性构成了重大挑战。当前的检测器依赖单一模态信号,将外观、几何和运动独立处理。然而,先进的生成器在保持单模态一致性的同时会产生跨模态矛盾,这些矛盾在取证上具有鉴别性但无法被单一模态检测器发现。本文提出CAM-VFD,即跨注意力多模态视频伪造检测框架,将跨模态矛盾建模为方向性取证信号。该框架采用跨注意力融合机制,其中基于CLIP的外观表示作为查询,与VideoMAE运动特征和MiDaS深度特征进行对比,从而识别视觉、时间及几何证据之间的矛盾。通过跨模态注意力差异分析验证了该设计,观察到真实与伪造分布在统计上可分离(p<0.001,Cohen's d=0.68)。在两个生成视频基准测试中的实验结果表明,CAM-VFD在GenVidBench上达到95.31%的Top-1准确率,在GenVideo上达到93.43%的准确率、90.63%的F1分数和96.56%的AUROC。此外,CAM-VFD在压缩、噪声、模糊和对抗扰动下表现出稳定的性能,表明跨模态推理可能在媒体取证中提高鲁棒性。代码已公开在https://github.com/Hoda-Osama/CAM-VFD/tree/main。

英文摘要

The rapid advancement of Deepfake technologies and video manipulation tools poses a critical challenge to multimedia forensics, judicial evidence integrity, and information authenticity. Current detectors rely on single-modality signals, treating appearance, geometry, and motion independently. However, advanced generators maintain within-modality consistency while producing cross-modal contradictions, which are forensically discriminative but invisible to any single-modal detector. We propose CAM-VFD, a Cross-Attention Multimodal Video Forgery Detection framework that models cross-modal contradiction as a directional forensic signal. The framework uses a cross-attention fusion mechanism in which CLIP-based appearance representations serve as queries against VideoMAE motion features and MiDaS depth features, enabling the identification of contradictions between visual, temporal, and geometric evidence. We examine this design through cross-modal attention discrepancy analysis, observing statistically separable real and fake distributions ($p<0.001$, Cohen's $d=0.68$). Experimental results on two generative video benchmarks indicate consistent performance, with 95.31\% Top-1 accuracy on GenVidBench and 93.43\% accuracy, 90.63\% F1-score, and 96.56\% AUROC on GenVideo. Moreover, CAM-VFD demonstrates stable performance under compression, noise, blur, and adversarial perturbations, suggesting that cross-modal reasoning may improve robustness in media forensics. The code is publicly available at \url{https://github.com/Hoda-Osama/CAM-VFD/tree/main}.

2605.17128 2026-05-19 cs.CR cs.AI 版本更新

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

针对大模型的新型宽网投射劫持攻击风险

Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

发表机构 * Lancaster University(兰卡斯特大学)

AI总结 本文提出了一种新的宽网投射劫持攻击场景,分析了在攻击者可以查询多个大模型而非单个模型时所暴露的安全风险,并开发了一种针对该场景的新型劫持方法,展示了其在无额外保护措施时高达100%的成功率。

Comments Accepted at ICML 2026; project page at https://zzlz233.github.io/Wide-net-casting/

详情
AI中文摘要

由于其与社会安全的紧密联系,对大模型的劫持攻击引起了越来越多的关注。本文识别了一个实用但此前未被探索的劫持场景,即宽网投射场景,其中攻击者可以查询一组大模型而非单个模型以引发有害输出。我们的分析揭示了在此场景下存在重大但此前被忽视的安全风险。作为我们分析的关键部分,我们进一步开发了一种针对宽网投射场景的新型劫持方法。使用这种定制方法,在某些实验中,当针对没有额外保护措施的大模型时,劫持成功率甚至可以达到100%,暴露宽网投射作为一种独特的、高风险场景,值得在未来评估和防御研究中关注。

英文摘要

Jailbreak attacks on large models have drawn growing attention due to their close ties to societal safety. This work identifies a practical yet unexplored jailbreak scenario, the wide-net-casting scenario, where an adversary can query a group of large models instead of a single one to elicit harmful outputs. Our analysis reveals substantial yet previously overlooked safety risks under this scenario. As a key part of our analysis, we further develop a novel jailbreak method tailored to the wide-net-casting scenario. With this tailored method, the jailbreak success rate can even reach 100\% in some experiments when targeting the large models without additional safeguards, exposing wide-net-casting as a distinct, high-risk scenario that warrants attention in future evaluation and defense research.

2605.17115 2026-05-19 cs.AI 版本更新

F2IND-IT! -- Multimodal Fuzzy Fake Indian News Detection using Images and Text

F2IND-IT! -- 多模态模糊假新闻检测:结合图像和文本

Kushal Trivedi, Murtuza Shaikh, Khushi Singh, Jeevaraj S.

发表机构 * ABV - Indian Institute of Information Technology, Gwalior(ABV-印度信息技术学院,加尔瓦里)

AI总结 本文提出了一种多模态模糊框架,结合图像和文本进行印度媒体虚假新闻检测,通过ResNet-50提取图像特征,DistilBERT获取文本语义嵌入,ANFIS生成模糊可靠性评分,并通过轻量级注意力融合模块进行分类,实验结果显示在准确率、精确率、召回率和F1分数上均优于现有方法。

Comments 10 pages, 1 figure

详情
AI中文摘要

跨区域和国家媒体 outlets 的事实扭曲使印度等多样化景观中的虚假信息检测变得更加复杂。本文介绍了一种新颖的多模态框架,结合视觉和文本模态以增强印度媒体上的虚假新闻检测。该架构利用ResNet-50卷积神经网络提取新闻图像的视觉特征,DistilBERT编码器获取文本语义嵌入,以及自适应神经模糊推理系统(ANFIS)生成模糊可靠性评分。一个轻量级基于注意力的融合模块在分类前为每个模态分配可学习的权重。在IFND数据集上评估,通过深入的比较分析验证了所提模型的有效性。实验结果表明,在准确率、精确率、召回率和F1分数上均优于先前研究,确认了架构的有效性。

英文摘要

Biased manipulation of facts across regional and national media outlets complicates misinformation detection in diverse landscapes like India. This paper introduces a novel multimodal framework combining visual and textual modalities for enhanced fake news detection on Indian media. The architecture utilizes a ResNet-50 Convolutional Neural Network to extract visual features from news images, a DistilBERT encoder to obtain textual semantic embeddings, and an Adaptive Neuro-Fuzzy Inference System (ANFIS) to generate a fuzzy reliability score. A lightweight attention-based fusion module assigns learnable weights to each modality prior to classification. Evaluated on the IFND dataset, the proposed model is validated through an in-depth comparative analysis against previous research. Experimental results demonstrate superior performance across accuracy, precision, recall, and $F_1$-scores, confirming the efficacy of the architecture.

2605.17113 2026-05-19 cs.CL cs.AI 版本更新

The Point of No Return: Counterfactual Localization of Deceptive Commitment in Language-Model Reasoning

无法回头的点:语言模型推理中欺骗承诺的反事实定位

Scott Merrill, Shashank Srivastava

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 本文研究语言模型在推理过程中何时开始承诺欺骗,通过反事实定位方法,分析不同环境中的欺骗产生机制,并发现注意力转移特征在跨环境泛化中的有效性,同时提出通过压缩注意力头集来抑制欺骗承诺。

Comments 41 pages, 25 figures

详情
AI中文摘要

现有欺骗数据集将完成的输出标记为诚实或欺骗,将欺骗视为最终响应的属性,而非模型推理轨迹的功能。这掩盖了一个更根本的问题:语言模型何时开始承诺欺骗?我们引入反事实定位:对于推理轨迹中的每个句子前缀,固定前缀,重新采样后续内容,并估计欺骗结果的概率。为了扩展此方法,我们构建了五个环境(涵盖战略欺骗、迷宫指引、财务建议、二手车销售和报价谈判),其中欺骗从未被提示,而是源自战略激励,标签机械地从环境状态得出,而非主观人类判断。所得到的语料库在四个推理模型中定位了约146万句话,来自超过9410万次采样的后续内容、915亿生成的token和超过1万种场景。句子层面的人类评估证实,检测到的承诺点对应于决策状态的可解释转变。使用此资源,我们显示,用于承诺预测的词汇线索在不同环境之间转移效果差,而基于注意力的转移特征在分布外泛化中表现良好,表明欺骗承诺反映在可重用的推理动态变化中,而非表层形式。我们进一步识别出压缩的注意力头集(少于10%的头)在一种环境中选择后,能因果地抑制其他环境中的欺骗承诺。我们发布此语料库作为研究语言模型推理中欺骗和更广泛承诺的子基质。

英文摘要

Existing deception datasets label completed outputs as honest or deceptive, treating deception as a property of the final response rather than a function of the model's reasoning trace. This obscures a more fundamental question: when does a language model become committed to deception? We introduce counterfactual localization: for each sentence prefix in a reasoning trace, we fix the prefix, resample continuations, and estimate the probability of a deceptive outcome. To scale this, we construct five environments (spanning strategic bluffing, maze guidance, financial advice, used-car sales, and offer negotiation) in which deception is never prompted but emerges from strategic incentives and labels follow mechanically from environment state rather than subjective human judgment. The resulting corpus localizes $\sim$1.46M sentences across four reasoning models, drawn from over 94.1M sampled continuations, 91.5B generated tokens, and over 100K scenarios. Sentence-level human evaluation confirms that detected commitment points correspond to interpretable shifts in decision state. Using this resource, we show that lexical cues for commitment prediction transfer poorly across environments, whereas attention-based transition features generalize out of distribution, suggesting that deceptive commitment is reflected in reusable changes in reasoning dynamics rather than surface form. We further identify compact attention-head sets (under 10% of heads) that, selected on one environment, causally suppress deceptive commitment across held-out environments. We release the corpus as a substrate for studying deception, and more broadly commitment, in language-model reasoning.

2605.17104 2026-05-19 cs.AI 版本更新

Scientific Logicality Enriched Methodology for LLM Reasoning: A Practice in Physics

具有科学逻辑性的方法论:LLM推理的实践:物理学

Zhaoxin Yu, Nan Xu, Kun Chen, Jiahao Zhao, Lei Wang, Wenji Mao

发表机构 * State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China Beijing Wenge Technology Co., Ltd, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China

AI总结 本文提出了一种增强科学逻辑性的方法论,旨在提升LLM在科学推理中的逻辑正确性与任务表现,通过物理学中的多样逻辑结构和形式化进行实践验证。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

随着大语言模型(LLM)推理能力的持续进步,其在科学推理任务中的应用获得了广泛关注。当前研究主要强调通过在更大、更全面的数据集上进行训练,以提升LLM在科学问答基准测试中的性能,但这些方法忽视了科学推理过程的本质——逻辑性,这是确保推理步骤有效性的理性基础。在本工作中,我们首次系统地研究了LLM科学推理内部的逻辑性,并开发了一种科学逻辑性增强的方法论,包括一套评估标准和数据采样方法,用于逻辑性引导的训练,以提高LLM推理的逻辑正确性以及任务性能。进一步地,我们以物理学为典范学科,实践上述方法论。在数据构建方面,我们从学术文献中提取科学问题,并采样出一个具有强逻辑性的高质量数据集。基于三种不同的基础LLM进行的实验表明:1)我们构建的训练数据能够有效提高LLM推理中的科学逻辑性;2)增强的科学逻辑性在解决科学问题中起着关键作用。代码可在https://github.com/ScienceOne-AI/PhysLogic获取。

英文摘要

With the continuous advancement of reasoning abilities in Large Language Models (LLMs), their application to scientific reasoning tasks has gained significant research attention. Current research primarily emphasizes boosting LLMs' performance on scientific QA benchmarks by training on larger, more comprehensive datasets with extended reasoning chains. However, these approaches neglect the essence of the scientific reasoning process -- logicality, which is the rational foundation to ensure the validity of reasoning steps leading to reliable conclusions. In this work, we make the first systematic investigation into the internal logicality underlying LLM scientific reasoning, and develop a scientific logicality-enriched methodology, including a set of assessment criteria and data sampling methods for logicality-guided training, to improve the logical faithfulness as well as task performance. Further, we take physics, characterized by its diverse logical structures and formalisms, as an exemplar discipline to practise the above methodology. For data construction, we extract scientific problems from academic literature and sample a high-quality dataset exhibiting strong logicality. Experiments based on three different backbone LLMs reveal that: 1) the training data we constructed can effectively improve the scientific logicality in LLM reasoning; and 2) the enriched scientific logicality plays a critical role in solving scientific problems. Code is available at \href{https://github.com/ScienceOne-AI/PhysLogic}{https://github.com/ScienceOne-AI/PhysLogic}.

2605.17095 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Visual Timelines of Police Encounters in Body-Worn Camera Footage: Operational Context and Activity Cataloging for Training and Analysis in OpenBWC

警察执法视频中的视觉时间线:用于训练和分析的开放BWC操作上下文和活动编目

Angela Srbinovska, Christopher Homan, Adrian Martin, Ernest Fokoué

发表机构 * Rochester Institute of Technology(罗切斯特理工大学) Rochester Police Department(罗切斯特警察局) Office of Business Intelligence(业务智能办公室) School of Mathematics and Statistics(数学与统计学学院)

AI总结 本文提出了一种处理体感摄像头视频的方法,生成时间对齐的固定长度10秒窗口序列,用于训练和分析,通过隐私保护协议进行处理和标记,以提高事件审查和培训流程的效率。

Comments 13 pages, 10 figures, 9 tables

详情
AI中文摘要

执法机构正在积累大量体感摄像头(BWC)视频。然而,这些视频仍然在操作上是模糊的。也就是说,分析人员和培训人员仍然需要花费大量时间观看完整视频以确定关键事件的开始点,并识别活动转向更剧烈的物理活动的点。我们提出了一种方法,将BWC视频处理为时间对齐的固定长度10秒窗口序列,通过隐私保护协议进行处理和标记。每个窗口被标记为两个维度的信息:(i)窗口的操作上下文和(ii)窗口内的运动强度水平,对于因黑暗、模糊或遮挡导致证据不足的窗口,使用低证据标签。我们训练模型根据这两个轴分类窗口,使用从每个窗口中采样的帧,通过CLIP模型编码并汇总成窗口级别的表示。我们提取每个窗口的密集光流统计信息以捕捉运动强度。在测试窗口中,最佳上下文模型达到78.75%的准确率,最佳准确率活动模型达到88.33%。我们还包含了完整性审计,以展示结果以及视觉时间线表示如何支持更快的事件审查,并使警官培训流程更加实用。

英文摘要

Law enforcement agencies are accumulating vast amounts of body-worn camera (BWC) footage. However, this remains operationally opaque. That is, analysts and trainers still have to invest considerable time watching full-length videos to pinpoint the start of key encounters and identify the points where activity shifts to something more physically intense. We present an approach to process BWC video into a time-aligned sequence of fixed-length 10-second windows, processed and labeled using a privacy-conscious protocol. Each window is labeled with two dimensions of information: (i) the operational context of the window and (ii) the level of motion intensity within the window, with low-evidence labels for windows for which insufficient evidence exists due to darkness, blur or occlusion. We train models to classify windows based on these two axes using frames sampled from each window encoded using CLIP model and aggregated into a window-level representation. We extract dense optical flow statistics for each window to capture motion intensity. On test windows the best context model achieves 78.75% accuracy, and the best-accuracy activity model achieves 88.33%. We also included integrity audits to show the results and how the visual timeline representations support faster incident review and make the officer training workflow more practical.

2605.17086 2026-05-19 econ.GN cs.AI cs.CY q-fin.EC stat.AP 版本更新

Global Automation Atlas

全球自动化图谱

Prashant Garg, Tommaso Crosta, Jasmin Baier

发表机构 * Imperial College London(伦敦帝国学院) Bocconi University(博科尼大学) University of Oxford(牛津大学)

AI总结 本文提出了一种基于任务和国家特定的方法,用于全球范围内分类自动化暴露,以区分劳动力替代和增强自动化,相关技术渠道以及人工智能的物质作用。研究涵盖了124个国家,生成了覆盖全球99%人口和GDP的233万个任务-国家标签。

Comments 65 pages, 6 figures. Data and code: https://automationatlas.org/

详情
AI中文摘要

自动化对工作劳动力内容的影响在不同背景下有所不同。然而,大多数现有的暴露测量方法对任务或职业分配固定分数,限制了国家之间的自动化暴露比较。我们开发了一种基于任务和国家特定的方法,用于在全球范围内分类自动化暴露,以区分劳动力替代和增强自动化,相关技术渠道以及人工智能的物质作用。我们的测量覆盖124个国家,生成了覆盖全球99%人口和GDP的233万个任务-国家标签。我们提出了五个描述性结果。首先,暴露程度高度不均,从南苏丹3.3%的任务到中国61.6%的任务,收入越高暴露程度越强,尽管收入组内仍有显著差异。其次,不同国家暴露的任务偏向于替代而非增强,但低收入国家更倾向于替代,而中等收入国家则更异质。第三,低收入国家中,技术先进的自动化形式占暴露任务的一半以上,而高收入国家则约为四分之一;而其他更复杂的渠道通常随收入水平上升。第四,人工智能在简单自动化渠道中较少,但在低收入地区更倾向于劳动力替代边缘,而在高收入地区则更倾向于增强劳动力。第五,我们发现女性似乎比男性更倾向于受到劳动力替代自动化的影响。我们的方法为比较不同发展阶段的自动化暴露提供了基础,将其与跨国数据联系起来,允许我们将暴露水平、劳动力边缘、技术渠道和人工智能参与视为独立维度。

英文摘要

Automation affects the labour content of work differently across different contexts. Yet, most existing exposure measures assign fixed scores to tasks or occupations, limiting comparisons of automation exposure across countries. We develop a task-based and country-specific approach to classify automation exposure across the world to disentangle labor-substituting from labor-augmenting automation, the relevant technology channel, and the material role of AI. Our measure spans 124 countries, generating an atlas of 2.33 million task-country labels for economies covering 99% of world population and GDP. We present five descriptive results. First, exposure is highly uneven, ranging from 3.3% of tasks in South Sudan to 61.6% in China, and rises strongly with income, although substantial variation remains within income groups. Second, across countries, exposed tasks are skewed towards substitution rather than augmentation, but low-income countries are disproportionately exposed to substitution, whereas middle-income countries are more heterogeneous. Third, less technologically advanced forms of automation account for more than half of exposed tasks in low-income countries but about one quarter in high-income countries; while other more complex channels generally rise with income levels. Fourth, AI tends to be less prevalent in simpler channels of automation, but also more prevalent in labour-substituting margins in lower income settings and to augment labour in higher income settings. Fifth, we find that females seem to be disproportionately more exposed to labour-substituting automation than males. Our methodology provides a basis for comparing automation exposure across development stages, linking it with cross-country data and allowing us to treat exposure levels, labour margins, technological channels and AI involvement as separate dimensions.

2605.17079 2026-05-19 cs.CL cs.AI cs.CY 版本更新

Can LLMs Think Like Consumers? Benchmarking Crowd-Level Reaction Reconstruction with ConsumerSimBench

LLMs能否像消费者一样思考?通过ConsumerSimBench进行大众级反应重建的基准测试

Tianyu Wang, Jiajun Li, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出ConsumerSimBench基准,通过1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准,评估LLM在模拟消费者反应方面的能力,揭示了前沿模型在预测高语境中文消费者讨论中实际关心内容方面的不足。

详情
AI中文摘要

LLMs越来越多地被用作“数字消费者”来模拟公众意见、预测试营销决策并预测观众反应。然而,现有评估很少询问模型是否能重建现实中消费者在公开讨论中表现的具体反应模式。我们引入了ConsumerSimBench,该基准基于1553个真实中国社交媒体话题和23122个原子化、规则审核过的标准,涵盖四个反应类别。与评分开放生成的综合偏好判断不同,ConsumerSimBench将每个任务分解为可审计的yes-no决定,使三判官协议从65.8%提升至92.1%,且点wise判断与人类多数标签在98.4%时一致。在13个前沿生成器中,最强的模型Gemini-3.1-Pro仅覆盖了47.8%的真实反应标准,而GPT-5.2和Claude-4.6尽管在技术基准上表现优异,但仍然落后。这些失败揭示了技术基准表现与基于社会的消费者直觉之间的巨大差距。直接的结构化推理提示会降低覆盖率,而生成-反思多代理流水线可将MiMo-V2.5-Pro在子集上的表现从32.9%提升至37.6%。ConsumerSimBench将消费者模拟重新定义为对真实公开讨论反应的预测问题,表明前沿LLM在预测高语境中文消费者讨论中实际关心内容方面仍远未可靠。

英文摘要

LLMs are increasingly used as ``digital consumers'' to simulate public opinion, pre-test marketing decisions, and anticipate audience response. However, existing evaluations rarely ask whether a model can reconstruct the concrete reaction patterns that real consumers surface in public discourse. We introduce ConsumerSimBench, a benchmark built from 1,553 real Chinese social-media topics and 23,122 atomic, rule-audited criteria spanning four reaction families. Rather than scoring open-ended generations with a holistic preference judge, ConsumerSimBench decomposes each task into auditable yes-no decisions over concrete reaction points, raising three-judge agreement from 65.8% to 92.1% with 98.4% agreement between pointwise judge decisions and human-majority labels. Across 13 frontier generators, the strongest model, Gemini-3.1-Pro, covers only 47.8% of real reaction criteria, while GPT-5.2 and Claude-4.6 trail far behind despite their strength on technical benchmarks. The failures reveal a sharp gap between technical-benchmark performance and socially grounded consumer intuition. A direct structured reasoning prompt decreases coverage, while a generate--reflect multi-agent pipeline improves MiMo-V2.5-Pro from 32.9% to 37.6% on a subset. ConsumerSimBench reframes consumer simulation as a forecasting problem over real public-discourse reactions, showing that frontier LLMs remain far from reliably predicting what consumers will actually care about in high-context Chinese consumer discourse.

2605.17077 2026-05-19 cs.RO cs.AI 版本更新

How to Instruct Your Robot: Dense Language Annotations Power Robot Policy Learning

如何指导你的机器人:密集语言标注助力机器人策略学习

Bosung Kim, Ruiyi Wang, David Acuna, Jaehun Jung, Alexander Trevithick, Brandon Cui, Yejin Choi, Prithviraj Ammanabrolu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) NVIDIA

AI总结 本研究通过密集语言标注提升机器人策略学习效率,提出DeMiAn方法,利用视觉语言模型生成多方面标注,提升策略和世界模型性能,无需新增演示数据。

详情
AI中文摘要

机器人策略学习受限于演示数据收集成本,而现有演示的语言标注相对廉价。我们研究语言密度作为提取固定机器人或第一人称视频数据集信号的杠杆。我们引入DeMiAn(密集多方面标注),一种两阶段方法,首先通过视觉语言模型生成四个互补方面的演示段落重标记:物理运动、场景组成、手臂姿态和推理。一个学习到的指导者将任务描述和初始场景快照映射到部署时的任务合适标注,异步运行以隐藏生成延迟。在超过100万机器人操作片段和5万EgoVerse人类第一人称视频上,DeMiAn在视觉语言-动作策略和基于视频的世界-动作模型上均未收集新演示的情况下提升了性能。在RoboCasa上,指导者在任务-only基线基础上提升了5个百分点,接近每任务oracle的3个百分点。没有固定标注方面在所有任务中占主导,表明选择正确的密集语言至关重要。DeMiAn还提高了复合任务和分布外性能,并在考虑标注生成FLOPs后,同时提升了中训练和后训练的计算-性能前沿。这些结果将密集重新标注定位为机器人策略学习的实用扩展杠杆。

英文摘要

Scaling robot policy learning is bottlenecked by the cost of collecting demonstrations, while language annotations for existing demonstrations are comparatively cheap. We study language density as a lever for extracting more signal from a fixed robot or egocentric-video corpus. We introduce DeMiAn (Dense Multi-aspect Annotation), a two-stage approach that first re-labels demonstration segments with VLM-generated annotations along four complementary aspects: physical motion, scene composition, arm pose, and reasoning. A learned instructor then maps a task description and initial scene snapshot to a task-appropriate annotation at deployment, running asynchronously so generation latency is hidden behind policy execution. Across over 1M robot manipulation clips and 50K EgoVerse human-egocentric videos, DeMiAn improves both a vision-language-action policy and a video-based world-action model without collecting new demonstrations. On RoboCasa, the instructor raises success by 5 points over a task-only baseline and comes within 3 points of a per-task oracle. No fixed annotation aspect dominates across tasks, showing that selecting the right dense language matters. DeMiAn also improves composite-task and out-of-distribution performance, and shifts the compute-performance frontier in both mid-training and post-training after accounting for annotation-generation FLOPs. These results position dense re-annotation as a practical scaling lever for robot policy learning.

2605.17072 2026-05-19 cs.AI cs.CL 版本更新

RAGA: Reading-And-Graph-building-Agent for Autonomous Knowledge Graph Construction and Retrieval-Augmented Generation

RAGA:用于自主知识图谱构建和检索增强生成的阅读与图构建代理

Chengrui Han, Zesheng Cheng

发表机构 * Qingdao University(青岛大学)

AI总结 本文提出RAGA框架,通过结合阅读、搜索、验证和构建的认知约束,提升知识图谱构建与检索增强生成的效率和准确性,实现了知识图谱的全生命周期管理。

详情
AI中文摘要

现有基于LLM的知识图谱(KG)构建方法主要采用无状态的批处理流程,存在跨片段语义关系捕捉、实体消歧和构建过程可解释性方面的结构性缺陷。这些限制影响了KG的质量、检索精度和在高风险领域的部署信任度。我们提出RAGA(Reading And Graph-building Agent),一种基于LLM的自主KG构建和检索融合框架。RAGA提供支持完整KG生命周期CRUD操作的原子工具集,并将读取-搜索-验证-构建的认知约束嵌入到ReAct工具循环中。KG向量同步机制实现了混合符号-向量检索,而证据锚定验证将每个知识条目与其源文本链接,以实现可审计的溯源性。在QASPER科学问答数据集的子集上的初步实验表明,RAGA的融合检索优于零样本基线,KG整合在答案和证据质量方面提供了可衡量的提升。该框架设计和实验基线为代理驱动的自主KG构建提供了参考。

英文摘要

Existing LLM-driven knowledge graph (KG) construction methods predominantly employ stateless batch processing pipelines, exhibiting structural deficiencies in cross-chunk semantic relation capture, entity disambiguation, and construction process interpretability. These limitations undermine KG quality, retrieval precision, and deployment trust in high-stakes domains. We propose RAGA (Reading And Graph-building Agent), an LLM-based autonomous KG construction and retrieval fusion framework. RAGA provides an atomic toolset supporting full KG lifecycle CRUD operations and embeds a Read-Search-Verify-Construct cognitive constraint into a ReAct tool loop. A KG-vector synchronization mechanism enables hybrid symbolic-vector retrieval, while evidence-anchored verification links every knowledge entry to its source text for auditable provenance. Preliminary experiments on a subset of the QASPER scientific QA dataset indicate that RAGA's fusion retrieval outperforms zero-shot baselines, with KG integration providing measurable gains in both answer and evidence quality. The framework design and experimental baseline serve as a reference for agent-driven autonomous KG construction.

2605.17071 2026-05-19 cs.AI 版本更新

AnchorDiff: Topology-Aware Masked Diffusion with Confidence-based Rewriting for Radiology Report Generation

AnchorDiff: 基于拓扑结构的掩码扩散模型与基于置信度的重写方法用于放射学报告生成

Shiying Yu, Jielei Wang, Guoming Lu

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出AnchorDiff,一种首个结合临床锚点的掩码扩散框架,用于生成放射学报告。该方法通过拓扑感知训练策略和推理时的重写策略,有效缓解了固定顺序自回归解码的局限性,实现了最先进的性能。

详情
AI中文摘要

放射学报告生成(RRG)旨在从医学图像自动生成临床准确的文本报告。现有方法大多依赖于自回归(AR)语言模型,其因果依赖结构限制生成过程为单向的左到右过程。这种范式可能导致序列偏差,即模型倾向于遵循刻板的token顺序和高频报告模板,而非完全基于图像特定的证据进行生成。在本文中,我们提出AnchorDiff,这是首个用于RRG的掩码扩散框架,整合了来自知识图谱的临床锚点到扩散语言模型中。通过利用双向上下文和迭代细化,AnchorDiff缓解了固定顺序自回归解码的局限性。具体而言,我们引入了一种拓扑感知的训练策略,利用RadGraph推导出的实体层次结构来分配临床重要token的差异化掩码保护和损失权重。我们进一步设计了推理时的重写策略,通过基于扰动的测试检测不稳定已提交的token,并在去噪过程中选择性地修改它们。在MIMIC-CXR和MIMIC-RG4基准上的大量实验表明,AnchorDiff实现了最先进的性能,展示了临床锚点掩码扩散在放射学报告生成中的有效性。

英文摘要

Radiology report generation (RRG) aims to automatically produce clinically accurate textual reports from medical images. Existing methods predominantly rely on autoregressive (AR) language models, whose causal dependency structure restricts generation to a unidirectional left-to-right process. This paradigm can induce sequence bias, where models tend to follow stereotypical token orders and high-frequency report templates rather than fully grounding generation in image-specific evidence. In this paper, we propose AnchorDiff, the first masked-diffusion framework for RRG that integrates knowledge-graph-derived clinical anchors into diffusion language modeling. By leveraging bidirectional context and iterative refinement, AnchorDiff mitigates the limitations of fixed-order autoregressive decoding. Specifically, we introduce a topology-aware training strategy that uses RadGraph-derived entity hierarchies to assign clinically important tokens differentiated masking protection and loss weights. We further design an inference-time rewriting strategy that detects unstable committed tokens through perturbation-based testing and selectively revises them during denoising. Extensive experiments on the MIMIC-CXR and MIMIC-RG4 benchmarks demonstrate that AnchorDiff achieves state-of-the-art (SOTA) performance, showing the effectiveness of clinically anchored masked diffusion for radiology report generation.

2605.17064 2026-05-19 cs.AI 版本更新

Towards Human-Level Book-Writing Capability

迈向人类水平的书籍写作能力

Jan Zierstek, Matteo Batelic, Maya Medjad, Tim Schönenberger

AI总结 本文提出了一种用于大规模创意写作的 dataset 构建和训练框架,通过将监督微调重新定义为提示到书籍生成任务,以人类创作的虚构作品为基础,旨在提升生成文本的文学性。

Comments 17 pages, 3 figures

详情
AI中文摘要

优化用于指令跟随和代理任务的大型语言模型仍然难以满足高质量创意写作的要求。小说经常依赖于助手训练模型明确避免的行为,特别是欺骗、道德模糊和不可靠的叙述。因此,生成的故事往往在结构上正确,但风格上却过于通用、解释性过强或在人类文学行为上缺乏依据。我们提出了一种用于书籍规模创意写作的 dataset 构建和训练框架,将监督微调重新定义为基于人类创作小说的提示到书籍生成任务。从公共领域小说开始,我们通过逐步细化的层次总结来推导多分辨率规划框架,从高层次的前提到章节和场景层面的结构。然后在训练过程中反转这一层次结构:模型学习将提示扩展成越来越详细的计划,最终生成原始的人类创作书籍文本。这种 formulation 保留了人类散文作为最终的监督目标,同时利用中间摘要使书籍规模的生成变得可学习。我们在这些提示到书籍轨迹上训练了一个长上下文语言模型,并研究这种目标是否能使生成文本从助手风格的散文转向人类文学写作。

英文摘要

Large language models optimized for instruction following and agentic tasks remain poorly aligned with the requirements of high-quality creative writing. Fiction frequently depends on behaviors that assistant-tuned models are explicitly trained to avoid, particularly deception, moral ambiguity, and unreliable narration. As a result, generated stories often appear structurally correct while remaining stylistically generic, overly explanatory, or weakly grounded in human literary behavior. We present a dataset construction and training framework for book-scale creative writing that reframes supervised fine-tuning as a prompt-to-book generation task grounded in human-authored fiction. Starting from public-domain novels, we derive a multi-resolution planning scaffold by summarizing each book at progressively finer levels, from a high-level premise to chapter- and scene-level structure. We then invert this hierarchy during training: the model learns to expand a prompt into increasingly detailed plans and finally into the original human-authored book text. This formulation preserves human prose as the final supervised target while using intermediate summaries to make book-scale generation learnable. We train a long-context language model on these prompt-to-book trajectories and study whether this objective shifts generation away from assistant-style prose and toward human literary writing.

2605.17044 2026-05-19 cs.AI 版本更新

PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models

PersonaArena: 用于评估和提升大语言模型层面角色扮演的动态模拟

Wenlong Shi, Jianxun Lian, Mingqi Wu, Haiming Qin, Mingyang Zhou, Xing Xie, Naipeng Chao, Hao Liao

发表机构 * College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Microsoft Research Asia(微软亚洲研究院) Microsoft Gaming(微软游戏) Provincial Key Laboratory of Intelligent Communication and Digital Society Governance, Shenzhen University(深圳大学省级智能通信与数字社会治理重点实验室)

AI总结 本文提出PersonaArena框架,通过动态模拟评估和提升大语言模型在角色扮演层面的能力,利用用户生成的社会内容构建细致的个性库,并在模拟社交环境中进行多轮上下文丰富的交互,通过多代理辩论裁判实现全面公正的评估。

Comments ACL 2026 Findings

详情
AI中文摘要

大语言模型(LLMs)日益成为交互式社会代理,但其维持连贯且真实的层面角色扮演能力仍有限,尤其是在现实社交场景中。现有研究主要集中在角色层面设置,并依赖静态评估格式,无法捕捉日常社交互动的复杂性。在本文中,我们提出了PersonaArena,一个用于评估和改进LLMs层面角色扮演的动态模拟框架。PersonaArena利用大量过滤后的用户生成社交内容构建细致的个性库,并在模拟社交环境中引发多轮、上下文丰富的交互。我们的框架包含一个多代理辩论裁判,用于全面且无偏的评估。通过广泛实验,我们证明PersonaArena能够严格评估和提升LLMs的角色扮演能力,推动更真实且社交能力强的AI代理的发展。

英文摘要

Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs' role-playing capabilities, advancing the development of more authentic and socially adept AI agents.

2605.17041 2026-05-19 cs.CL cs.AI cs.HC 版本更新

Agentic AI Translate: An Agentic Translator Prototype for Translation as Communication Design

代理AI翻译:一种用于翻译作为沟通设计的代理翻译原型

Masaru Yamada

发表机构 * Rikkyo University(立命馆大学) Translation Lab Inc(翻译实验室公司)

AI总结 本文提出了一种代理翻译原型,通过将翻译研究的金属语言转化为生成AI的指令代码,重新定义翻译作为沟通设计的过程,而非文本转换。

Comments 14 pages. Conceptual and architectural paper; empirical validation in future work. Code: https://github.com/chuckmy/agentic-translator (v0.8.0). Live demo: https://agentic-translator-chuckmy.streamlit.app

详情
AI中文摘要

我们提出了Agentic AI Translate,一种代理翻译原型,实现了Yamada(即将出版)的论点——翻译研究的金属语言已成为生成AI的指令代码。该系统取代了机器翻译中占主导地位的文本输入/文本输出范式,采用四阶段代理循环(识别->提示->生成->验证),并在用户通过模型辅助对话构建一个基于skopos理论、语域、受众和体裁惯例的结构化翻译简报的交互规范阶段之前。验证阶段采用GEMBA-MQM错误跨度协议(Kocmi & Federmann, 2023)进行证据导向评分,并通过Wang等人(2025)的Delta-lite记忆保存文档层面的连贯性。我们描述了哲学动机、架构承诺、系统消耗的四种参考材料类别以及架构显式说明的主要设计张力。实证验证留待未来工作;本文的贡献是概念性和架构性的——一种可执行的体现,表明在GenAI时代翻译是沟通设计,而非文本转换。

英文摘要

We present Agentic AI Translate, an agentic translator prototype that operationalises the thesis of Yamada (forthcoming) -- that the metalanguage of Translation Studies has become an instruction code for generative AI. The system replaces the dominant text-in / text-out paradigm of machine translation with a four-stage agentic cycle (Identify -> Prompt -> Generate -> Verify), preceded by an interactive specification phase in which the user composes -- through model-assisted dialogue -- a structured translation brief grounded in skopos theory, register, audience, and genre conventions. The verification stage adopts the GEMBA-MQM error-span protocol (Kocmi & Federmann, 2023) for evidence-grounded scoring, and document-level coherence is preserved through a DelTA-lite memory of proper nouns and a running bilingual summary, after Wang et al. (2025). We describe the philosophical motivation, the architectural commitments, the four reference-material categories the system consumes, and the principal design tensions the architecture makes explicit. Empirical validation is left for future work; the contribution here is conceptual and architectural -- an executable embodiment of the position that translation in the GenAI era is communication design, not text conversion.

2605.17038 2026-05-19 cs.AI 版本更新

Evidential Information Fusion on Possibilistic Structure

可能性结构上的证据信息融合

Qianli Zhou, Ye Cui, Zhen Li, Witold Pedrycz, Yong Deng

发表机构 * School of Electronics and Information, Northwestern Polytechnical University(电子信息学院,西北工业大学) Department of Electrical and Computer Engineering, University of Alberta(阿尔伯塔大学电气与计算机工程系) China Mobile Information Technology Center(中国移动信息科技中心) Systems Research Institute, Polish Academy of Sciences(波兰科学院系统研究所) Institute of Fundamental and Frontier Science, University of Electronic Science and Technology of China(中国电子科技大学基础与前沿科学研究院)

AI总结 本文提出了一种基于可能性结构的证据信息融合方法,通过引入信任演化网络和三角范数家族,实现了更灵活的信息融合框架,适用于非distinct源融合、冲突管理等复杂场景。

详情
AI中文摘要

Dempster's规则是结合来自不同且可靠来源的信念函数的基本工具。然而,其基于交集的语义 imposes 强烈的结构限制,限制了其在处理复杂源状态和多样信息融合场景时的灵活性。为克服这一限制,我们提出了一种可逆转换,源自等概率原则,将信念函数与定义在幂集上的可能性结构联系起来。在此转换中,子集之间的关系通过信念演化网络显式表征,提供了比传统质量函数结构更灵活的证据信息表示。在此基础上,我们进一步引入三角范数家族,开发了一个通用且适应性的证据信息融合框架。与根植于Dempster语义的融合方法不同,所提出的框架支持更灵活的组合行为,并在非distinct源融合、冲突管理、参数组合设计和异构信息融合中表现出优势。

英文摘要

Dempster's rule is a fundamental tool for combining belief functions from distinct and reliable sources. However, its intersection-based semantics imposes strong structural restrictions, which limits its flexibility in handling complex source states and diverse information fusion scenarios. To overcome this limitation, we propose a reversible transformation, derived from the isopignistic principle, between belief functions and a possibilistic structure defined on the power set. In this transformation, the relationships among subsets are explicitly characterized by a belief evolution network, which provides a more flexible representation of evidential information beyond the conventional mass function structure. On this basis, we further introduce the triangular norm family to develop a general and adaptive evidential information fusion framework. Unlike fusion methods rooted in Dempster semantics, the proposed framework supports more flexible combination behaviors and exhibits advantages in non-distinct source fusion, conflict management, parametric combination design, and heterogeneous information fusion.

2605.17037 2026-05-19 cs.LG cs.AI cs.CL 版本更新

D$^2$Evo: Dual Difficulty-Aware Self-Evolution for Data-Efficient Reinforcement Learning

D$^2$Evo: 双重难度感知的自进化方法用于数据高效的强化学习

Ru Zhang, Renda Li, Ziyu Ma, Weijie Qiu, Chongyang Tao, Yong Wang, Xiangxiang Chu

发表机构 * Zhejiang University(浙江大学) AMAP, Alibaba Group(AMAP,阿里巴巴集团)

AI总结 本文提出D$^2$Evo方法,通过双重难度感知的自进化机制,解决强化学习中有效数据稀缺和动态难度变化的问题,从而在数学推理基准上以少于2K真实数学样本实现优于现有方法的性能。

Comments Accepted by ICML 2026. First two authors contributed equally

详情
AI中文摘要

强化学习(RL)在增强大型语言模型(LLMs)推理能力方面展现出潜力。然而,需要中等难度训练样本的有效RL训练面临两个根本性挑战:有效数据稀缺和动态难度变化,其中中等难度样本稀缺且随着模型提升变得简单。现有方法在一定程度上缓解了这种稀缺性,通过生成训练样本。然而,这些方法存在无锚点生成、忽略共进化和难度不匹配的问题。为了解决这些问题,我们提出了D$^2$Evo,一种双重难度感知的自进化RL框架。在每次迭代中,我们的方法基于当前求解器的能力挖掘中等难度锚点,训练提问者生成不同难度层级的多样化问题,并共同优化两个组件以实现渐进式的推理提升。广泛实验表明,D$^2$Evo在数学推理基准上以少于2K真实数学样本优于现有方法,并在通用推理基准上表现出强大的泛化能力。

英文摘要

Reinforcement learning (RL) has demonstrated potential for enhancing reasoning in large language models (LLMs). However, effective RL training, which requires medium-difficulty training samples, faces two fundamental challenges: Effective Data Scarcity and Dynamic Difficulty Shifts, where medium-difficulty samples are scarce and become trivial as models improve. Existing methods mitigate this scarcity to some extent by generating training samples. However, these approaches suffer from anchor-free generation, ignoring co-evolution, and difficulty mismatch. To address these issues, we propose D$^2$Evo, a Dual Difficulty-aware self-Evolution RL framework. In each iteration, our method mines medium-difficulty anchors based on the current Solver's capability, trains the Questioner to generate diverse questions at appropriate difficulty levels, and jointly optimizes both components to enable progressive reasoning gains. Extensive experiments demonstrate that D$^2$Evo outperforms existing methods on mathematical reasoning benchmarks with fewer than 2K real mathematical samples, and exhibits strong generalization on general reasoning benchmarks.

2605.17029 2026-05-19 cs.SE cs.AI 版本更新

Task Abstention for Large Language Models in Code Generation

大型语言模型在代码生成中的任务回避

Yanke Zhou, Yuhao Tan, Senrong Xu, Zenan Li, Yuan Yao, Taolue Chen, Xiaoxing Ma

AI总结 本文研究了大型语言模型在代码生成任务中是否应回避生成可能产生幻觉的代码,提出了一种基于多重假设检验原理的校准回避规则,通过代码执行结果评估生成一致性,无需依赖Oracle测试用例或外部数据库,提供了一种可靠的安全且稳健的代码生成机制。

Comments 17 pages, 4 figures

详情
AI中文摘要

大型语言模型(LLMs)已经革新了自动化代码生成。然而,一个严重的问题是所谓的'幻觉',即LLMs可能生成看似合理但功能上错误的代码。在本文中,我们研究了任务回避问题,即确定给定的LLM是否应回避执行特定的代码生成任务以避免可能的幻觉。我们的方法特征是一种校准的回避规则,基于多重假设检验原理。该规则通过代码执行结果评估生成一致性,使其能够处理语义等价代码的语法多样性,而无需依赖Oracle测试用例或外部数据库。我们证明了我们的方法在其回避决策上提供了一种严格且分布无关的理论保证。我们使用几个开源代码LLMs在基准数据集上评估了我们的方法。结果表明,我们的方法使生成模型能够更准确且高效地识别并回避导致幻觉的任务,提供了一种可靠的安全且稳健的代码生成机制。

英文摘要

Large language models (LLMs) have revolutionized automated code generation. One serious concern, however, is the so-called ``hallucination'', i.e., LLMs may generate seemingly plausible but functionally incorrect code. In this paper, we study the task abstention problem, i.e., determining whether a given LLM should abstain from performing a specific code generation task to avoid likely hallucination. Our approach features a calibrated abstention rule, grounded in the principles of multiple hypothesis testing. The rule assesses generation consistency through code execution outcomes, allowing it to handle syntactic diversity of semantically equivalent code without reliance on oracle test cases or external databases. We prove that our approach provides a rigorous, distribution-free theoretical guarantee on its abstention decisions. We evaluate our method on benchmark datasets using several open-source code LLMs. Results show that our method allows generative models to more accurately and efficiently identify and abstain from tasks that induce hallucination compared to existing techniques, providing a reliable mechanism for safer and more robust code generation.

2605.17028 2026-05-19 cs.CL cs.AI 版本更新

PARALLAX: Separating Genuine Hallucination Detection from Benchmark Construction Artifacts

PARALLAX:区分真实幻觉检测与基准构建人工制品

Khizar Hussain, Murat Kantarcioglu

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 本文研究了大型语言模型幻觉检测中的基准构建人工制品问题,提出DRIFT作为对比方法,发现大部分基线方法在控制条件下表现接近随机,而SAPLMA和DRIFT作为上层隐藏状态的监督探针表现出例外。

Comments Preprint to Neurips 2026 submission

详情
AI中文摘要

大型语言模型(LLMs)在生成输出时常常表现出自信的幻觉:其输出可以流畅、权威且错误。在医疗、法律和科学应用中,这种失败会造成直接伤害,而通过内部模型状态检测幻觉则为更安全的部署提供了路径。越来越多的研究表明,这一问题变得越来越可处理,最近的方法在广泛使用的基准上实现了高检测性能。然而,我们发现,这种明显的进步在仔细审视后并不成立。六个语料库中的四个直接将真实答案嵌入输入提示中。我们提出的名为 extsc{TxTemb}的简单文本相似度基线利用这一点,无需访问模型内部状态即可实现接近完美的检测分数。为了衡量在消除这些人工制品后剩余的真实检测能力,我们进行了涵盖22种检测方法、12种开源模型(涵盖6种架构家族)和6个语料库的大规模评估。我们进一步引入 extbf{DRIFT},作为实时生成检测的比较点。我们的发现表明,该领域报告的幻觉检测进展在很大程度上是由广泛使用的语料库中的基准构建人工制品所解释的;在受控条件下,大多数已建立的基线方法表现接近随机;一致的例外是SAPLMA和DRIFT,两者都是基于上层隐藏状态的监督探针。

英文摘要

Large language models (LLMs) hallucinate with confidence: their outputs can be fluent, authoritative, and simply wrong. In medical, legal, and scientific applications this failure causes direct harm, and detecting it from internal model states offers a path to safer deployment. A growing body of work reports that this problem is increasingly tractable, with recent methods achieving high detection performance on widely used benchmarks. We show, however, that much of this apparent progress does not survive scrutiny. Four of the six corpora embed the ground-truth answer directly in the input prompt. A naïve text-similarity baseline we call \textsc{TxTemb} exploits this to achieve near-perfect detection scores without any access to model internals. To measure what genuine detection capability remains once these artifacts are controlled, we conduct a large-scale evaluation spanning twenty-two detection methods, twelve open-source models spanning six architectural families, and six corpora. We further introduce \textbf{DRIFT}, a supervised probe over inter-layer hidden-state transitions, as a point of comparison for live-generation detection. Our findings suggest that the field's reported progress on hallucination detection is substantially explained by benchmark construction artifacts in widely used corpora, and that the majority of established baselines perform near chance under controlled conditions; the consistent exceptions are SAPLMA and DRIFT, both supervised probes on upper-layer hidden states.

2605.17021 2026-05-19 cs.AI 版本更新

A Conflict-aware Evidential Framework for Reliable Sleep Stage Classification

一种考虑冲突的证据框架用于可靠的睡眠阶段分类

Yunzhi Tian, Dekui Wang, Qirong Bu, Wei Zhou, Xingxing Hao, Jun Feng

发表机构 * College of Computer Science, Northwest University, Xi'an, China(西北大学计算机科学学院)

AI总结 本文提出了一种考虑冲突的证据框架ConfSleepNet,用于可靠地进行睡眠阶段分类,通过动态解决不同视图之间的冲突,提高分类的可靠性。

Comments 19 pages, 7 figures

详情
AI中文摘要

多视图学习已被广泛应用于使用多模态数据进行睡眠阶段分类。然而,现有方法通常假设不同模态是良好对齐的,这在现实世界中往往难以实现,从而影响分类结果的可靠性。在本文中,我们提出ConfSleepNet,一种考虑冲突的证据框架,能够动态解决视图间的冲突。该框架包括多视图证据提取和冲突感知聚合。在第一阶段,它从不同模态中学习类别相关的证据,代表对个体睡眠阶段的支持程度。考虑到不同模态的固有特性,我们为不同模态提出混合类别结构,以促进更合理的证据学习。在第二阶段,从学习到的证据中构建视图特定的意见,包括预测结果和不确定性。值得注意的是,我们提出了一种新的冲突感知聚合方法,将这些视图特定的意见整合到一个可靠的联合决策中。这种机制可以有效解决意见间的冲突,并将它们合成到一个可靠的联合决策中。理论分析和实验结果均证明了ConfSleepNet在睡眠阶段任务中的有效性。代码可在https://github.com/By4te/ConfSleepNet_ICML2026/上获得。

英文摘要

Multi-view learning has been widely applied for sleep stage classification using multi-modal data. However, existing methods typically assume that different modalities are well-aligned, which is often unattainable in real-world scenarios, thereby compromising the reliability of the staging results. In this paper, we propose ConfSleepNet, a conflict-aware evidential framework that dynamically resolves inter-view conflicts. The framework consists of multi-view evidence extraction and conflict-aware aggregation. In the first phase, it learns category-related evidence from different modalities, which represents the degree of support for individual sleep stages. Considering the inherent characteristics of varying modalities, we propose hybrid category structures for different modalities to promote more reasonable evidence learning. In the second phase, view-specific opinions, including prediction results and uncertainty, are constructed from the learned evidence. Notably, we propose a novel conflict-aware aggregation method that integrates these view-specific opinions into a reliable joint decision. This mechanism can effectively resolve conflicts among opinions and synthesize them into a reliable joint decision. Both theoretical analysis and experimental results demonstrate the effectiveness of ConfSleepNet in sleep staging tasks. The code is available at https://github.com/By4te/ConfSleepNet_ICML2026/.

2605.17017 2026-05-19 cs.LG cs.AI 版本更新

When Dynamics Shift, Robust Task Inference Wins: Offline Imitation Learning with Behavior Foundation Models Revisited

当动态变化时,鲁棒任务推断胜出:重新审视具有行为基础模型的离线模仿学习

Rishabh Agrawal, Rahul Jain, Ashutosh Nayyar

发表机构 * University of Southern California(南加州大学)

AI总结 本文提出了一种基于行为基础模型(BFM)的框架,通过将任务推断建模为鲁棒最小最大优化问题,以应对动态变化,从而在不修改预训练的情况下实现对最坏动态扰动的适应。该方法在动态变化下显著优于标准BFM和鲁棒离线模仿学习基线。

详情
AI中文摘要

行为基础模型(BFM)通过预训练任务无关的表示,实现了可扩展的模仿学习(IL)。然而,现有BFM假设环境动态固定,限制了其在现实世界变化(如摩擦力、驱动或传感器噪声变化)下的鲁棒性。我们通过将BFM的任务推断建模为鲁棒最小最大优化问题来解决这一问题,从而能够在不修改预训练的情况下适应最坏情况的动态扰动。到目前为止,这是首个仅依赖单个名义环境的离线数据的BFM框架,能够在动态变化下实现鲁棒性。我们的方法在动态变化下显著优于标准BFM和鲁棒离线IL基线。这些结果表明,鲁棒策略可以完全在任务推断时间实现,提高了BFM在动态环境中的实用性。

英文摘要

Behavior Foundation Models (BFMs) enable scalable imitation learning (IL) by pretraining task-agnostic representations that can be rapidly adapted to new tasks. However, existing BFMs assume fixed environment dynamics, limiting their robustness under real-world shifts such as changes in friction, actuation, or sensor noise. We address this by formulating BFM task-inference as a robust minimax optimization problem, enabling adaptation to worst-case dynamics perturbations without modifying pretraining. To the best of our knowledge, this is the first BFM-based framework that achieves robustness to dynamics shifts while relying solely on offline data from a single nominal environment. Our approach significantly outperforms standard BFM and robust offline IL baselines under dynamics shifts. These results demonstrate that robust policy can be achieved entirely at task-inference time, improving the practicality of BFMs in dynamic settings.

2605.17010 2026-05-19 cs.SI cs.AI cs.CL cs.CY cs.HC 版本更新

Algorithmic Cultivation: How Social Media Feeds Shape User Language

算法培养:社交媒体信息流如何塑造用户语言

Olivia Pal, Agam Goyal, Eshwar Chandrasekharan, Koustuv Saha

AI总结 本文基于培养理论,研究社交媒体信息流如何影响用户语言,发现信息流暴露导致用户语言风格、语义和正式程度显著变化,且不同信息流类型产生不同影响,揭示信息流作为持久语言环境的作用。

详情
AI中文摘要

算法信息流已成为人们在线获取信息的主要环境,尽管它们塑造了人们看到的内容,但较少有人了解持续的信息流暴露如何影响人们的写作方式。基于培养理论,我们检验算法信息流是否作为在线环境,在用户语言中留下可测量的痕迹。我们利用Bluesky平台上23500万条由400万用户发布的帖子的大型纵向数据集,并进行准实验研究,将初始池中的368513名用户(暴露于新闻、科学和Blacksky三种信息流之一)与201915名未接触这些信息流的活跃对照用户进行匹配。我们从词汇语义、心理语言学和主题三个维度考察语言演变。我们发现,接触这些信息流的用户在风格适应、语义对齐和正式程度方面显著高于对照组。这些影响因信息流身份差异明显——Blacksky产生最深的心理语言学重构,显著影响认知处理、情感表达和代词使用,而新闻和科学的影响主要集中在正式程度和主题聚焦。回归模型显示,转发是所有信息流中语言趋同最一致的预测因素,而发布和书签行为则有信息流依赖性影响,其影响在不同信息流间差异超过四倍。我们的研究将培养理论扩展到语言行为,证明信息流作为持久语言环境,逐渐塑造用户在线写作的内容和方式。我们的研究对研究算法影响、在线身份形成以及调解在线互动的基于信息流平台的设计和治理具有启示意义。

英文摘要

Algorithmic feeds have become primary environments for encountering information online, yet while they shape what people see, less is known about how sustained feed exposure shapes how people write. Drawing on Cultivation Theory, we examine whether algorithmic feeds function as online environments that leave measurable traces in users' language. We leverage a large-scale longitudinal dataset of 235M posts by 4M users on Bluesky, and conduct a quasi-experimental study matching an initial pool of 368,513 users exposed to one of three feeds -- News, Science, and Blacksky -- with a pool of 2,001,915 active control users who did not engage with any of these feeds. We examine linguistic evolution across three dimensions: lexico-semantics, psycholinguistics, and topics. We find that users exposed to these feeds show significantly greater stylistic accommodation, semantic alignment, and register formalization than matched controls. These effects vary markedly by feed identity -- Blacksky produces the deepest psycholinguistic restructuring, with significant shifts in cognitive processing, affective expression, and pronoun use, while News and Science effects are largely confined to register and topical focus. Regression models reveal that reposting is the most consistent predictor of linguistic convergence across all feeds, whereas posting and bookmarking show feed-dependent effects, with effects differing more than fourfold across feeds. Our work extends Cultivation Theory beyond belief formation to linguistic behavior, demonstrating that feeds function as persistent linguistic environments that gradually shape what and how users write online. Our work has implications for studying algorithmic influence, online identity formation, and the design and governance of feed-based platforms that mediate online interactions.

2605.17008 2026-05-19 cs.PL cs.AI cs.CL 版本更新

The IsalProgram Programming Language

IsalProgram 编程语言

Ezequiel López-Rubio

发表机构 * Department of Computer Languages and Computer Science(计算机语言与计算机科学系) University of Málaga(马拉加大学) ITIS Software(ITIS软件) Universidad de Málaga(马拉加大学)

AI总结 本文提出了一种新的类似汇编的编程语言IsalProgram,其核心特性包括:1) 作为形式语言理论中的正则语言,其程序可被有限自动机接受;2) 每个有限的指令字母表上的字符串都是语法有效的程序;3) 不显式使用内存地址或变量名。程序由固定指令集中的有限令牌序列组成,并在虚拟机上执行,其唯一数据结构是通过三个数据指针导航的循环双链表,控制流由两个代码指针控制。本文还讨论了IsalProgram作为神经程序合成目标语言的潜在优势,以及通过Levenshtein编辑距离进行程序空间度量探索的可能性,以及在此框架内分析可计算性和复杂性的方向。

详情
AI中文摘要

我们介绍了一种新的类似汇编的编程语言IsalProgram,它具有三个独特的理论特性:(1) 它在形式语言理论中是正则语言,意味着其程序被有限自动机接受;(2) 每个有限的指令字母表上的字符串都是语法有效的程序;(3) 它不显式使用内存地址或变量名,绝对或相对。程序是由固定指令集中的有限令牌序列组成,并在虚拟机上执行,该虚拟机的唯一数据结构是一个通过三个数据指针导航的循环双链表(CDLL),控制流由两个代码指针控制。我们给出了该语言及其虚拟机的完整形式定义,证明了其正则性,并展示了其表达能力。我们进一步讨论了IsalProgram作为神经程序合成目标语言的潜在优势,其程序空间通过Levenshtein编辑距离进行度量探索的可能性,以及在此框架内分析可计算性和复杂性的方向。

英文摘要

We introduce IsalProgram (Instruction Set and Language for Programming), a novel assembly-like programming language with three distinctive theoretical properties: (1) it is a regular language in the sense of formal language theory, meaning its programs are accepted by a finite automaton; (2) every finite string over the instruction alphabet is a syntactically valid program; and (3) it makes no explicit use of memory addresses or variable names, absolute or relative. Programs are finite sequences of tokens drawn from a fixed instruction set, and are executed on a virtual machine whose sole data structure is a circular doubly linked list (CDLL) navigated by three data pointers, with control flow governed by two code pointers. We give a complete formal definition of the language and its virtual machine, prove its regularity, and demonstrate its expressive power. We further discuss IsalProgram's potential advantages as a target language for neural program synthesis, the amenability of its program space to metric-based exploration via the Levenshtein edit distance, and directions for analyzing computability and complexity within this framework.

2605.17000 2026-05-19 cs.LG cs.AI 版本更新

BoLT: A Benchmark to Democratize Black-box Optimization Research for Expensive LLM Tasks

BoLT:一个民主化黑盒优化研究的基准,用于昂贵的LLM任务

Ruth Wan Theng Chew, Zhiliang Chen, Apivich Hemachandra, Bryan Kian Hsiang Low

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文提出BoLT基准,旨在通过提供真实LLM优化问题,促进黑盒优化方法在昂贵的大型语言模型任务中的研究和评估。

详情
AI中文摘要

优化大型语言模型(LLM)的训练和推理配置,如超参数、数据混合和提示,对性能至关重要,但在实践中往往采用启发式方法,导致可能的次优结果。通过将它们视为噪声、昂贵且无导数的优化问题,贝叶斯优化(BO)和其他黑盒优化(BBO)方法提供了一个有前途但尚未充分探索的方向,用于原则性、样本效率高的方法。然而,LLM训练和推理成本对大多数BBO研究社区来说过高,新方法往往仅在合成测试函数和小规模数据集上进行评估,这些数据集无法捕捉现代LLM优化问题的挑战。这阻碍了BBO方法的发展,并使评估这些方法在现代LLM任务上的有效性变得困难。我们介绍了BoLT,这是首个以LLM为中心的基准,旨在民主化LLM研究,服务于BBO社区。BoLT在https://github.com/chewwt/bolt上发布。BoLT涵盖了广泛且有动机的LLM优化问题,包括多保真度、多目标、异方差噪声和高维搜索空间。BoLT中的每个问题都基于真实的实验数据,并通过轻量级的替代模型,基于成千上万的真实LLM实验结果,使其完全可重复和可访问。我们对BoLT进行了广泛的BO和BBO方法的评估,显示选定的BO方法在各种任务上持续优于其他方法,突显了现有BBO方法在LLM任务上的不足,强调了为BBO社区现代化基准的必要性。

英文摘要

Optimization of LLM training and inference configurations, such as hyperparameters, data mixtures, and prompts, is critical to performance, but it is often approached heuristically in practice, leading to potentially suboptimal outcomes. By framing them as noisy, expensive, and derivative-free optimization problems, Bayesian optimization (BO) and other black-box optimization (BBO) methods offer a promising yet underexplored direction for principled, sample-efficient methods. However, LLM training and inference costs are prohibitively high for most of the BBO research community, and new methods are often only evaluated on synthetic test functions and small-scale datasets that fail to capture the challenges of modern LLM optimization problems. This impedes the development of BBO methods and makes it difficult to assess their effectiveness on modern LLM tasks. We introduce BoLT, the first LLM-centric benchmark that democratizes LLM research for the BBO community. BoLT is released at https://github.com/chewwt/bolt. BoLT covers broad and well-motivated LLM optimization problems, involving multi-fidelity, multi-objective, heteroscedastic noise, and high-dimensional search spaces. Each problem in BoLT is grounded in real experimental data and made fully reproducible and accessible through lightweight surrogate models fitted to the results of thousands of real LLM experiments. We benchmark BoLT against an extensive range of BO and BBO methods, showing that selected BO methods consistently outperform others across tasks and highlighting gaps in existing BBO methods on LLM tasks, underscoring the need to modernize benchmarks for the BBO community.

2605.16993 2026-05-19 cs.CY cs.AI cs.LG 版本更新

Adversarial Fragility and Language Vulnerability in Clinical AI: A Systematic Audit of Diagnostic Collapse Under Imperceptible Perturbations and Cross-Lingual Drift in Low-Resource Healthcare Settings

临床AI中的对抗脆弱性与语言脆弱性:在低资源医疗环境中对诊断崩溃的系统审计及不可察觉扰动和跨语言漂移的影响

Anthonio Oladimeji Gabriel, Ahmad Rufai Yusuf

发表机构 * Centre for Clinical Intelligence & Safety(临床智能与安全中心) Tomorrow University of Applied Sciences(明天应用科学大学)

AI总结 本文系统地审计了临床AI在不可察觉扰动和跨语言漂移下的诊断崩溃问题,揭示了对抗脆弱性和语言脆弱性对低资源医疗环境中的临床AI系统的影响。

Comments 23 pages, 9 figures, 3 tables. Code and data available at https://github.com/anthoniooladimeji11-coder/clinical-ai-safety-audit

详情
AI中文摘要

当前的临床人工智能(AI)系统几乎只在干净、标准化的英语输入条件下进行评估,这些条件无法反映低资源环境中的医疗实践现实。本研究首次系统地对临床AI的两种正交安全漏洞进行了双重审计:对抗图像脆弱性和跨语言诊断漂移。使用DenseNet121,这是CheXNet架构的基础,经过在COVID-QU-Ex胸部X光数据集(85,318张图像;COVID-19、非COVID肺炎、正常)上微调,我们证明在Fast Gradient Method(FGM)扰动下,epsilon=0.021时,诊断准确率从89.3%下降到62.0%,这种幅度对人眼来说是不可察觉的。标准防御策略,包括高斯平滑和投票集成,未能恢复临床安全。在平行的语言脆弱性实验中,我们测试了Llama3.1:8b和NatLAS(N-ATLAS)在Standard English、Nigerian Pidgin(Naija)和Yoruba-inflected English中的20例COVID-19临床病例。两种模型均表现出显著的准确性下降:Llama3.1:8b在Pidgin上从80.0%下降到65.0%;NatLAS,一个非洲语境模型,从85.0%下降到55.0%,诊断一致性下降到50%。这些发现为尼日利亚初级卫生中心(PHC)部署中代表性的临床AI系统建立了定量失败范围,并促使对对抗性强、语言包容的临床AI架构的紧急呼吁。

英文摘要

Current clinical artificial intelligence (AI) systems are evaluated almost exclusively on clean, standardised, English-language inputs, conditions that do not reflect the realities of healthcare delivery in low-resource settings. This study presents the first systematic dual audit of two orthogonal safety vulnerabilities in clinical AI: adversarial image fragility and cross-lingual diagnostic drift. Using DenseNet121, the architecture underlying CheXNet, fine-tuned on the COVID-QU-Ex chest X-ray dataset (85,318 images; COVID-19, Non-COVID Pneumonia, Normal), we demonstrate that diagnostic accuracy collapses from 89.3% to 62.0% under a Fast Gradient Method (FGM) perturbation of epsilon=0.021, a magnitude imperceptible to the human eye. Standard defensive strategies including Gaussian smoothing and ensemble voting failed to restore clinical safety. In a parallel language fragility experiment, we tested Llama3.1:8b and NatLAS (N-ATLAS) on 20 COVID-19 clinical cases presented in Standard English, Nigerian Pidgin (Naija), and Yoruba-inflected English. Both models exhibited significant accuracy degradation: Llama3.1:8b dropped from 80.0% to 65.0% on Pidgin; NatLAS, an African-context model, collapsed from 85.0% to 55.0%, with diagnosis consistency falling to 50%. These findings establish a quantitative failure envelope for clinical AI under conditions representative of Primary Health Centre (PHC) deployment in Nigeria, and motivate urgent calls for adversarially hardened, linguistically inclusive clinical AI architectures.

2605.16991 2026-05-19 cs.CL cs.AI 版本更新

Response-free item difficulty modelling for multiple-choice items with fine-tuned transformers: Component-wise representation and multi-task learning

无响应项目难度建模用于多项选择题:细调Transformer:组件表示和多任务学习

Jan Netík, Patrícia Martinková

发表机构 * Faculty of Education, Charles University(查理大学教育学院) Institute of Computer Science of the Czech Academy of Sciences(捷克科学院计算机科学研究所)

AI总结 本文提出了一种无响应项目难度建模方法,通过细调Transformer来处理阅读理解多项选择题的难度问题,采用组件级表示和多任务学习方法来提升模型性能。

详情
AI中文摘要

无响应项目难度建模旨在减少对响应校准的依赖,但对阅读理解多项选择题而言,其难度取决于词汇组件的推断需求。尽管现有方法通常从项目文本中提取特征并传递给单独的统计或机器学习模型,本文通过端到端地在项目词汇上微调Transformer编码器,消除了手动特征工程和预处理所丢失的信息。此外,本文还提出了两种扩展:一种是组件级变体,通过共享编码器分别编码词汇组件;另一种是多任务变体,保留联合编码并添加辅助的多项选择问题回答目标。每种方法都在三种训练集大小下通过蒙特卡洛子采样设计在保留的测试集上进行评估。研究发现,联合编码是一种可行的端到端替代方案;虽然组件级变体没有明显优势,这与自注意力机制本身已经捕获跨组件信号一致,但多任务变体在小样本情况下提供了显著的改进。Transformer微调,尤其是通过合适的辅助任务进行正则化,能够在应用测量中典型的训练集大小下恢复大量词汇可推导的信号。该框架为心理测量学扩展提供了可定制的接口。

英文摘要

Response-free item difficulty modelling promises to reduce reliance on response-based calibration but is intrinsically difficult on reading-comprehension multiple-choice items, where difficulty depends on inferential demands across wording components. Whereas most existing approaches extract item-text features and pass them to a separate statistical or machine-learning model, we fine-tune transformer encoders end-to-end on the item wording, eliminating the manual feature engineering and preprocessing that discards information. Moreover, two extensions to this joint-encoding approach are proposed: a component-wise variant that encodes wording components separately through a shared encoder, and a multi-task variant that retains joint encoding and adds an auxiliary multiple-choice question answering objective on the shared encoder. Each method is evaluated under a Monte Carlo subsampling design at three training-set sizes on a held-out test set. We find that joint encoding is a viable end-to-end alternative to feature-engineering pipelines; while the component-wise variant shows no detectable benefit, consistent with self-attention already harvesting the cross-component signal, the multi-task variant delivers significant paired improvements in the smallest-sample regime. Transformer fine-tuning, especially if regularised by a suitable auxiliary task, recovers a substantial share of the wording-derivable signal at training-set sizes typical of applied measurement. The framework provides a customisable interface for psychometrically motivated extensions.

2605.16986 2026-05-19 cs.CL cs.AI 版本更新

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Skills on the Fly: Test-Time Adaptive Skill Synthesis for LLM Agents

Jingxing Wang, Chenyu Zhou, Zhihui Fu, Jun Wang, Weiwen Liu, Weinan Zhang, Jianghao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) OPPO Research Institute(OPPO研究院)

AI总结 本文提出了一种在测试时自适应的技能合成方法SkillTTA,通过检索与当前任务相关的少量训练轨迹并将其合成成为任务特定的文本技能,以提高LLM代理在SpreadsheetBench、ALFWorld和BigCodeBench等任务上的性能。

Comments 10 pages, 4 figures

详情
AI中文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose SkillTTA, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-k retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

英文摘要

LLM agents benefit from reusable skills, yet test-time tasks often require guidance more specific than a static skill library can provide. We propose \emph{SkillTTA}, a Test-Time Adaptive Skill Synthesis method that retrieves a small set of training trajectories relevant to the current task and synthesizes them into a temporary, task-specific textual skill. The solver model is kept fixed, so adaptation happens entirely through generated context rather than parameter updates. We evaluate the method on SpreadsheetBench, ALFWorld, and BigCodeBench. Compared with static trajectory-to-skill synthesis using GPT-5.5, task-specific skills improve SpreadsheetBench Pass@1 from 0.397 to 0.505 and BigCodeBench Pass@1 from 0.517 to 0.651. On ALFWorld, the method matches a heavier memory-learning baseline within four points of success rate while producing the shortest successful trajectories among reported methods. Ablations on SpreadsheetBench further show that synthesized skills outperform raw trajectory prompting, that top-$k$ retrieval should stay small, and that failed trajectories are especially useful because they expose recurring evaluator-facing mistakes.

2605.16975 2026-05-19 cs.LG cs.AI 版本更新

Extending Pretrained 10-Second ECG Foundation Models to Longer Horizons

扩展预训练的10秒ECG基础模型以适应更长的时域

Wei Tang, Jinpei Han, Kangning Cui, Mattia Carletti, Fredrik K. Gustafsson, Shreyank N Gowda, Patitapaban Palo, Anshul Thakur, Lei Clifton, Jean-michel Morel, Raymond H. Chan, David A. Clifton, Xiao Gu

发表机构 * City University of Hong Kong(香港城市大学) Imperial College London(伦敦帝国学院) Wake Forest University(威克森林大学) University of Nottingham(诺丁汉大学) Lingnan University(岭南大学) University of Oxford(牛津大学)

AI总结 本文提出了一种参数高效的框架,通过在不重新训练基础模型的情况下扩展预训练的10秒ECG基础模型,使其能够处理更长和可变长度的ECG信号,解决了结构不兼容和语义挑战问题,实验表明其在多个长时域ECG任务中优于滑动窗口和池化基线方法。

详情
AI中文摘要

预训练在典型诊断10秒ECG片段上的ECG基础模型已在多种临床应用中展示了强大的迁移能力。然而,许多实际应用产生的记录通常更长,且在推理过程中持续时间各异。这些10秒模型缺乏整合时间信息的内置方法。将其扩展到更长的时域引入了两个挑战:由于输入长度差异导致的结构不兼容性,以及限制有意义时间聚合的语义挑战。我们提出了一种参数高效的框架,通过冻结预训练的10秒模型,引入一个轻量级插件模块,以两种互补的方式扩展模型:(i) 结构兼容的长序列处理,(ii) 语义指导的时间建模。在多个长时域ECG任务、数据集和基础模型背骨上的实验表明,我们的方法能够从预训练的快照模型中实现稳健的长时域扩展,一致优于滑动窗口和池化基线方法,具有强大的参数效率。

英文摘要

Electrocardiogram (ECG) foundation models pretrained on typical diagnostic 10-second ECG segments, have demonstrated strong transferability across a range of clinical applications. However, many real-world applications produce recordings that are typically longer, and are varied in duration during inference time. These 10-second models have no built-in way to combine information across time. Extending them to longer horizons introduces two challenges: structural incompatibilities arising from input-length disparities, and semantic challenges that limit meaningful temporal aggregation. We propose a parameter-efficient framework that extends pretrained ECG foundation models to longer and variable-length ECGs without retraining the backbone. Guided by a frozen pretrained 10-second model, we introduce a lightweight plug-in module that extends the model in two complementary ways: (i) structurally compatible long-sequence processing and (ii) semantically informed temporal modeling. Experiments on multiple long-horizon ECG tasks, datasets, and foundation model backbones demonstrate that our method enables robust long-horizon extension from pretrained snapshot models, consistently outperforming sliding-window and pooling-based baselines with strong parameter efficiency.

2605.16969 2026-05-19 cs.AI 版本更新

Brain Vascular Age Prediction Using Cerebral Blood Flow Velocity and Machine Learning Algorithms

基于脑血流速度和机器学习算法的脑血管年龄预测

Anni Zhao, Alex Bateh, Tyler Baldridge, Sandra Billinger, Xiao Hu

发表机构 * Center for Data Science Nell Hodgson Woodruff School of Nursing Emory University(数据科学中心Nell Hodgson Woodruff护理学院埃默里大学) Division of Nephrology Department of Medicine University of Alabama at Birmingham(肾脏科医学部阿拉巴马大学伯明翰分校) Department of Neurology School of Medicine University of Kansas Medical Center(神经科医学院堪萨斯医学中心)

AI总结 本研究利用脑血流速度数据和机器学习算法,通过分析不同脑疾病患者的血管年龄预测,评估加速衰老现象,并探讨TCD生成的特征在评估加速脑血管老化中的相关性。

详情
AI中文摘要

定义血管年龄为生理功能的范畴已成为广泛研究中分类和跟踪年龄的关键问题。超声多普勒(TCD)是一种测量人类大脑主要动脉血流速度的方法。本研究旨在利用从TCD提取的特征来估计年龄并评估患有各种脑疾病个体的加速老化。我们预测患有各种脑疾病的个体在使用不同回归模型训练的健康个体上会表现出加速的脑血管老化。使用形态学分析和颅内压聚类(MOCAIP)算法分析了168名健康受试者和277名双侧大脑中动脉TCD记录的疾病受试者。MOCAIP生成的特征和心率变异性特征被用作回归模型的输入特征以预测脑血管年龄。对66名急性中风患者、27名中风后患者、26名阿尔茨海默病患者、23名轻度认知障碍患者和135名正常受试者进行了测试,以评估加速的脑血管年龄。训练好的模型在平均上预测健康受试者的脑血管年龄比实际年龄高3.69年。不同疾病状况的受试者表现出不同程度的年龄加速。健康和疾病受试者之间的表现差异表明,使用TCD生成的特征可能在评估加速的脑血管老化时是相关的。此外,不平衡的数据集已被观察到会影响基于机器学习的脑年龄预测模型的性能。

英文摘要

Defining vascular age in terms of physiological function has become one focal point of the extensive studies to categorize and track chronological age. Transcranial Doppler (TCD) is a method by which cerebral blood flow velocity is measured along the major arteries feeding the human brain. This study aims to use features extracted from TCD to estimate chronological age and assess accelerated aging in subjects with various brain diseases. We predict subjects with various brain diseases to present with accelerated cerebrovascular aging when tested on various regression models trained by healthy subjects. 168 healthy subjects and 277 diseased subjects with bilateral TCD recordings of the middle cerebral artery were analyzed using the Morphological Analysis and Clustering of Intracranial Pressure (MOCAIP) algorithm. MOCAIP-generated features and heart rate variability features were used as input features for regression models to predict the brain vascular age. 66 subjects with acute stroke, 27 subjects with post stroke, 26 subjects with Alzheimer's disease, 23 subjects with mild cognitive impairment, and 135 established subjects were tested against the machine learning model to assess for accelerated cerebrovascular age. The trained model, on average, predicted healthy subjects' cerebrovascular age to be 3.69 years above their chronological age. Subjects with different disease conditions exhibited varying levels of age acceleration. The differences in healthy and diseased subjects' performances suggest that features generated using TCD may be relevant when evaluating accelerated cerebrovascular aging. Moreover, imbalanced datasets have been observed to affect the performance of machine-learning-based brain age prediction models.

2605.16966 2026-05-19 cs.AI 版本更新

Harnessing AI for Inverse Partial Differential Equation Problems: Past, Present, and Prospects

利用人工智能解决逆偏微分方程问题:过去、现在与展望

Zhentao Tan, Yuze Hao, Boyi Zou, Mingsheng Long, Yi Yang, Gang Bao

发表机构 * Collaborative Innovation Center of Artificial Intelligence (CCAI), Zhejiang University(人工智能协同创新中心(CCAI),浙江大学) School of Mathematical Sciences, Zhejiang University(浙江大学数学科学学院) Tsinghua University(清华大学) Center for Interdisciplinary Applied Mathematics, School of Mathematical Sciences, Zhejiang University(浙江大学数学科学学院交叉应用数学中心)

AI总结 本文综述了利用人工智能解决逆偏微分方程问题的最新进展,涵盖了逆问题、逆设计和控制问题三大类,总结了科学和工业领域中的典型应用,并讨论了开放挑战和未来前景。

Comments 35 pages, 4 figures

详情
AI中文摘要

求解逆偏微分方程(PDE)问题在科学研究中是一个基础性课题,因其在广泛现实应用中的重要性。逆PDE问题出现在医学成像、地球物理、材料科学和空气动力学等领域,目标是推断隐藏原因、设计结构或控制物理状态。本文全面回顾了利用人工智能(AI)解决逆PDE问题的最新进展。我们首先介绍了逆PDE问题的基本 formulation、关键挑战和传统数值基础,然后将其分为三大类别:逆问题、逆设计和控制问题。对于每个类别,我们进一步提出了方法论范式,并回顾了近年来的代表性最先进方法。我们随后总结了科学和工业领域的典型应用,包括机械系统、空气动力学问题、热系统、全波形反演、系统识别和医学成像。最后,我们讨论了开放挑战和未来前景,如物理感知架构、有限现实数据、不确定性量化和逆基础模型。本文旨在为人工智能解决逆PDE问题提供首个统一和系统的视角,展示现代基于学习的方法如何重塑PDE系统中的逆问题、逆设计和控制问题。

英文摘要

Solving inverse partial differential equation (PDE) problems is a fundamental topic in scientific research due to its broad significance across a wide range of real-world applications. Inverse PDE problems arise across medical imaging, geophysics, materials science, and aerodynamics, where the goal is to infer hidden causes, design structures, or control physical states. In this paper, we provide a comprehensive review of recent advances in solving inverse PDE problems using artificial intelligence (AI). We first introduce the basic formulation, key challenges, and traditional numerical foundations of inverse PDE problems, and then organize it into three major categories: inverse problems, inverse design, and control problems. For each category, we further present a methodological paradigms, and review representative state-of-the-art approaches from recent years. We then summarize representative applications across scientific and industrial domains, including mechanical systems, aerodynamic problems, thermal systems, full-waveform inversion, system identification, and medical imaging. Finally, we discuss open challenges and future prospects, such as physics-informed architectures, limited real-world data, uncertainty quantification, and inverse foundation models. This survey aims to provide the first unified and systematic perspective on AI for inverse PDE problems, demonstrating how modern learning-based methods are reshaping inverse problems, inverse design, and control problems in PDE-governed systems.

2605.16961 2026-05-19 cs.CV cs.AI 版本更新

Latent Action Control for Reasoning-Guided Unified Image Generation

潜在动作控制用于推理引导的统一图像生成

Fuxiang Zhai, Sixiang Chen, Yingjin Li, Shuaibo Li, Jianyu Lai, Tengjun Huang, Lei Zhu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 本文提出Latent Action Control (LAC),通过将推理表示为隐藏的连续动作,使推理过程可操作,从而在统一生成器中实现推理引导的图像生成。LAC通过角色结构化的潜在轨迹进行规划、内部视觉草图、诊断和细化,并将这些动作注入到条件流生成的隐藏流中,从而提升生成质量。

详情
AI中文摘要

统一的多模态模型可以在共享的骨干网络中编码视觉理解和图像生成,但理解并不自动转化为控制:模型可能推断出对象、关系或知识提示,但无法在生成的图像中实例化。我们提出潜在动作控制(LAC),通过将推理表示为隐藏的连续动作,使推理过程可操作。给定提示,LAC会规划角色结构化的潜在轨迹,进行内部视觉草图、诊断和细化,并将这些动作注入到条件流生成的隐藏流中,而无需生成推理标记或中间图像。由于这些动作轨迹是未观察到的,LAC通过先验引导的变分潜在动作对齐从仅训练的语义先验、草图图像特征和监督停止信号中学习这些动作,随后通过Latent-Flow GRPO对齐潜在到图像的生成轨迹与终端视觉反馈。这为从推断的关系、绑定和知识提示到生成过程的控制路径提供了支持。在BAGEL-7B-MoT上实现后,LAC在GenEval、WISE和T2I-CompBench中一致提升了组合性和知识引导的生成,尤其是在空间关系、属性绑定和世界知识敏感提示上表现最佳。消融实验和潜在干预显示,学习的动作轨迹被生成器消耗,表明统一生成在理解不仅被编码,而是在生成过程中被操作时受益。

英文摘要

Unified multimodal models can encode visual understanding and image generation within a shared backbone, yet understanding does not automatically translate into control: models may infer objects, relations, or knowledge cues but fail to instantiate them in the generated image. We propose Latent Action Control (LAC), which makes reasoning actionable by representing it as hidden continuous actions inside a unified generator. Given a prompt, LAC rolls out a role-structured latent trajectory for planning, internal visual drafting, diagnosis, and refinement, and injects these actions into the hidden stream that conditions flow-based generation, without producing reasoning tokens or intermediate images. Since such action trajectories are unobserved, LAC learns them through prior-guided variational latent action alignment from training-only rendered semantic priors, draft image features, and supervised halting signals, followed by Latent-Flow GRPO to align the latent-to-image rollout with terminal visual feedback. This provides a control path from inferred relations, bindings, and knowledge cues to the generation process. Instantiated on BAGEL-7B-MoT, LAC consistently improves compositional and knowledge-grounded generation across GenEval, WISE, and T2I-CompBench, with the largest gains on spatial relations, attribute binding, and world-knowledge-sensitive prompts. Ablations and latent interventions show that the learned action trajectory is consumed by the generator, suggesting that unified generation benefits when understanding is not only encoded, but made actionable during generation.

2605.16938 2026-05-19 cs.CL cs.AI q-bio.NC 版本更新

Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models

努力作为上限,而非调节器:推理预算不影响人类与大推理模型之间的认知成本对齐

Yueqing Hu, Tianhong Wang

发表机构 * Institute of Neuroscience, Chinese Academy of Sciences(中国科学院神经科学研究所) School of Philosophy, Anhui University(安徽大学哲学系)

AI总结 该研究探讨了推理预算是否影响人类与大推理模型之间的认知成本对齐,发现无论推理努力如何变化,对齐情况保持不变,表明这种对齐是在训练时形成的,而非在推理时动态调整。

Comments 8 pages, 6 figures

详情
AI中文摘要

大推理模型(LRMs)生成的思维链轨迹长度与人类反应时间在认知任务中保持一致,但最近的争论质疑这种一致性是否反映真实的计算结构还是表面的冗长性。我们测试了这种一致性是否随推理时间的推理努力而变化。在GPT-OSS-20B和GPT-OSS-120B上,三个努力水平和六个推理任务中,任务内和跨任务的一致性保持不变:贝叶斯因子倾向于null,且各条件下的平均一致性几乎相同。操纵检查显示,努力参数设定了生成的上限,而非驱动实时分配,表明分配策略在训练时已固化。算术复杂度对比进一步显示,令牌分配跟踪细粒度、格式依赖的人类难度模式,模型规模提高了匹配程度。人类与LRMs之间的认知成本对齐似乎是在训练时形成的,对推理时的扰动具有鲁棒性,支持大推理模型问题解决的编译而非在线账户。

英文摘要

Large Reasoning Models (LRMs) generate chain-of-thought traces whose length tracks human reaction times across cognitive tasks, but recent debate questions whether this alignment reflects genuine computational structure or surface verbosity. We test whether the alignment varies with inference-time reasoning effort. Across GPT-OSS-20B and GPT-OSS-120B, three effort levels, and six reasoning tasks, within-task and cross-task alignment remain invariant: Bayes Factors lean toward the null, and mean alignment is numerically near-identical across conditions. A manipulation check reveals that the effort parameter sets an upper budget on generation rather than driving real-time allocation, suggesting that the allocation policy is crystallized at training time. Arithmetic complexity contrasts further show that token allocation tracks fine-grained, format-dependent human difficulty patterns, with model scale improving the match. Cognitive cost alignment between LRMs and humans appears to be a training-time achievement, robust to inference-time perturbations, supporting a compiled rather than online account of LRM problem-solving.

2605.16927 2026-05-19 cs.AI 版本更新

From Static Risk to Dynamic Trajectories: Toward World-Model-Inspired Clinical Prediction

从静态风险到动态轨迹:迈向世界模型启发的临床预测

Pujun Feng, Xiaoyu Guo, Seyed Ehsan Saffari, Min Hun Lee, Siew-Kei Lam, Erik Cambria, Xibin Sun, Yangtao Zhou, Tong Yang, Xiaoyu Zhang, Tao Tan, Yue Sun, Bin Cui

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University(澳门理工学院应用科学学院) School of Software & Microelectronics, Peking University(北京大学软件与微电子学院) School of Computing and Information Systems, Singapore Management University(新加坡管理学院计算机与信息系统学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Public Health, Peking University(北京大学公共卫生学院) School of Computer Science and Technology, Xidian University(西安电子科技大学计算机科学与技术学院) School of Computer, Peking University(北京大学计算机学院) School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Centre for Biomedical Data Science, Duke-NUS Medical School, National University of Singapore(新加坡国立大学杜克-国立新加坡医学学院生物医学数据科学中心) Duke-NUS AI + Medical Sciences Initiative, Duke-NUS Medical School(杜克-国立新加坡医学学院AI+医学科学计划)

AI总结 本文探讨了临床AI中干预感知的疾病轨迹建模方法,提出了统一框架,结合了预测、反事实轨迹和政策评估,以解决治疗分配、时间变化混杂和观察偏差问题,推动临床预测向决策级证据发展。

详情
AI中文摘要

临床决策是一个反馈系统,其中风险估计影响治疗,而治疗又改变疾病轨迹,两者共同塑造医生的测量实践。静态预测在临床中往往失败:训练于观察性护理日志的模型会将疾病生物学与医生行为混为一谈,特别是在存在治疗混杂反馈和不规则或信息性观察的情况下。本文聚焦于临床AI中的干预感知疾病轨迹建模方法——估计患者特定的纵向疾病演变并评估在替代治疗下的轨迹变化。本文围绕六个相关组成部分组织该领域:三个决策任务(事实预测、反事实估计、政策评估)和三个数据生成机制(疾病演变、治疗分配、观察过程),这些决定了可识别性。本文提出了第一个统一框架,连接了离散/连续时间下的预测、反事实轨迹和政策评估,明确处理治疗分配、时间变化混杂和观察偏差。本文综合了关键方法家族(多状态/联合模型、时间点过程、深度序列架构、纵向因果推断),将它们映射到相关组成部分,并通过重叠诊断、不确定性量化、非策略鲁棒性和目标试验验证对齐评估。这种综合将基准预测推进到决策级临床证据,使治疗敏感的个性化未来成为可能,实现部署前的政策压力测试,并推动更安全的闭环学习健康系统,在证据不足时适应或回避。

英文摘要

Clinical decision-making is a feedback system where risk estimates influence treatment, which in turn changes disease trajectories, and both shape clinicians' measurement practices. Static prediction often fails clinically: models trained on observational care logs conflate disease biology with clinician behavior, particularly under treatment confounder feedback and irregular or informative observation. This Review focuses on intervention-aware disease trajectory modeling in clinical AI--methods estimating patient-specific longitudinal disease evolution and assessing trajectory changes under alternative treatments. We organize the field around six linked components: three decision tasks (factual forecasting, counterfactual estimation, policy evaluation) and three data-generating mechanisms (disease evolution, treatment assignment, observation process) that determine identifiability. We present the first unified framework bridging forecasting, counterfactual trajectories, and policy evaluation across discrete/continuous time, explicitly addressing treatment assignment, time-varying confounding, and observation bias. We synthesize key method families (multistate/joint models, temporal point-process, deep sequence architectures, longitudinal causal inference), map them to relevant components, and align evaluation with claim strength via overlap diagnostics, uncertainty quantification, off-policy robustness, and target-trial validation. This synthesis advances benchmark prediction to decision-grade clinical evidence, enabling treatment-sensitive individualized futures, pre-deployment policy stress-testing, and safer closed-loop learning health systems that adapt/abstain when evidence is insufficient.

2605.16909 2026-05-19 cs.AI 版本更新

TOBench: A Task-Oriented Omni-Modal Benchmark for Real-World Tool-Using Agents

TOBench:面向真实世界工具使用代理的任务导向多模态基准

Zhiqiang Liu, Wenhui Dong, Yilang Tan, Yuwen Qu, Haochen Yin, Chenyang Si

发表机构 * Nanjing University(南京大学) Huazhong University of Science and Technology(华中科技大学) Southwest Jiaotong University(西南交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出TOBench,一个面向真实世界工具使用代理的多模态基准,通过闭环多模态验证设计,评估和推动下一代多模态工具使用代理的发展。

Comments Github: https://github.com/Pi3AI/TOBench

详情
AI中文摘要

工具使用代理正越来越多地被期望在现实中的专业工作流程中操作,其中它们必须解释多模态输入、协调外部工具、检查中间产物并修改其行为,以最终产生结果。然而,现有的基准测试通常孤立地评估工具使用、计算机使用和多模态推理,导致基准设置与现实中的端到端多模态工具使用之间存在差距。为此,我们引入MM-ToolBench,一个用于任务导向多模态工具使用的基准和评估工具。MM-ToolBench包含100个可执行任务,来自两个宏任务家族,客户服务和智能创作,涵盖20个子类切片,并由27个MCP服务器和324个工具支持。MM-ToolBench的核心设计是闭环多模态验证:代理必须执行工具、检查渲染或转换后的产物,并在输出未能满足任务特定要求时进行自我纠正。为了使此类评估可扩展和可验证,MM-ToolBench结合了基于MCP的执行与任务特定的地面评估器以及一个半自动化的场景发现、任务实例化、评估器合成和人类审核的构建流程。在15个当代代理模型上的实验表明,MM-ToolBench仍然极具挑战性:Claude Opus 4.6,通常被视为最强的编码代理模型之一,仅达到32.0%的任务成功率,远低于94.0%的人类基准。我们设想MM-ToolBench作为评估和推动下一代多模态工具使用代理的实用基础,通过闭环多模态验证。

英文摘要

Tool-using agents are increasingly expected to operate across realistic professional workflows, where they must interpret multimodal inputs, coordinate external tools, inspect intermediate artifacts, and revise their actions before producing a final result. Existing benchmarks, however, often evaluate tool use, computer use, and multimodal reasoning in isolation, leaving a gap between benchmark settings and end-to-end omni-modal tool use in the real world. To address this gap, we introduce MM-ToolBench, a benchmark and evaluation harness for task-oriented omni-modal tool use. MM-ToolBench contains 100 executable tasks from two macro task families, Customer Service and Intelligent Creation, covering 20 subcategory slices and supported by 27 MCP servers with 324 tools. The central design of MM-ToolBench is closed-loop multimodal verification: agents must execute tools, inspect rendered or transformed artifacts, and self-correct when outputs fail task-specific requirements. To make such evaluation scalable and verifiable, MM-ToolBench couples MCP-based execution with task-specific grounded evaluators and a semi-automated construction pipeline for scenario discovery, task instantiation, evaluator synthesis, and human audit. Experiments on 15 contemporary agentic models show that MM-ToolBench remains highly challenging: Claude Opus 4.6, commonly regarded as one of the strongest coding-agent models, achieves only 32.0% task success, far below the 94.0% human benchmark. We envision MM-ToolBench as a practical foundation for evaluating and advancing next-generation omni-modal tool-using agents through closed-loop multimodal verification.

2605.16895 2026-05-19 cs.CE cs.AI cs.CL 版本更新

The Alpha Illusion: Reported Alpha from LLM Trading Agents Should Not Be Treated as Deployment Evidence

阿尔法幻觉:LLM交易代理报告的阿尔法不应被视为部署证据

Yuxuan Ye, Jun Han, Ao Hu, Juncheng Bu, Yiyi Chen, Liangjian Wen, Danilo Mandic, Danny Dongning Sun, Xu Yinghui, Zenglin Xu

发表机构 * Fudan University(复旦大学) Shanghai University of Finance and Economics(上海财经大学) Southwest University of Finance and Economics(西南财经大学) Northeastern University(东北大学) Imperial College London(伦敦帝国理工学院) Peng Cheng Laboratory(鹏城实验室)

AI总结 本文指出,LLM交易代理报告的阿尔法不应被当作部署的证据,因为这些阿尔法需要通过结构有效性测试来验证其时间完整性、现实摩擦、反事实稳健性、预测校准、数值执行和多代理分解等关键指标,当前的公开证据无法区分稳健的预测能力与时间污染、未建模的摩擦、短窗口夏普不确定性、叙事拟合和参数先验等因素。

详情
AI中文摘要

端到端的LLM交易代理已经从研究兴趣迅速发展为一个小型的命名系统生态系统,包括FinCon、FinMem、TradingAgents、FinAgent、QuantAgent和FLAG-Trader。其中几个系统报告的 headline 夏普比率如果在部署桌面上被直接解读,将是实质性的影响,而相关的基准如FinBen也报告了在相同范围内的交易任务夏普统计。学术界与工业界之间的差距在双方都被过度跨越了。我们对这一差距持立场:端到端LLM交易代理报告的阿尔法不应被视为部署证据。在这样的收益能够支持部署交易能力的主张之前,它们必须通过结构有效性测试,以确保时间完整性、现实摩擦、反事实稳健性、预测校准、数值执行和多代理分解。当前的公开证据尚无法区分稳健的预测能力与时间污染、未建模的摩擦、短窗口夏普不确定性、叙事拟合和参数先验等因素。问题不仅是评估性的,也是结构性的。语言信心不是可交易的概率,叙事推理不是数值执行,模型先验可能成为未披露的隐性因子暴露。我们贡献了一个最小报告协议套件,P1-P6,根据主张强度分层适用,并且提供了一个保守的模块化替代方案,该方案利用LLM作为可审计的信息接口,在独立校准、风险和执行模块之前。代码和再生产工具:\url{https://github.com/hj1650782738/Trading}。

英文摘要

End-to-end LLM trading agents have moved quickly from research curiosity to a small ecosystem of named systems, including FinCon, FinMem, TradingAgents, FinAgent, QuantAgent, and FLAG-Trader. Several of these report headline Sharpe ratios that would be material if read at face value on a deployment desk, and associated benchmarks such as FinBen report trading-task Sharpe statistics in the same range. The gap between architecture research and deployment claim has been crossed too freely on both sides of the academia--industry divide. We take a position on that gap: reported alpha from end-to-end LLM trading agents should not be treated as deployment evidence. Before such returns can support claims of deployable trading capability, they must survive structural validity tests for temporal integrity, real-world frictions, counterfactual robustness, predictive calibration, numerical execution, and multi-agent disaggregation. Current public evidence cannot yet distinguish robust predictive ability from temporal contamination, unmodeled frictions, short-window Sharpe uncertainty, narrative fitting, and parametric priors. The problem is not only evaluative but structural. Language confidence is not tradable probability, narrative reasoning is not numerical execution, and model priors may become undisclosed implicit factor exposures. We contribute a minimum reporting protocol suite, P1--P6, with tiered applicability by claim strength, and a conservative modular alternative that uses LLMs as auditable information interfaces upstream of independent calibration, risk, and execution modules. Code and reproduction harness: \url{https://github.com/hj1650782738/Trading}.

2605.16893 2026-05-19 cs.AI 版本更新

NGM: A Plug-and-Play Training-Free Memory Module for LLMs

NGM: 一种无需训练的插拔式内存模块用于大语言模型

Yuwen Qu, Wenhui Dong, Chenyang Si, Caifeng Shan

发表机构 * Nanjing University(南京大学)

AI总结 本文提出NGM,一种无需训练的插拔式内存模块,通过因果n-gram编码器和余弦门控内存注入器,实现高效的知识检索,提升大语言模型在代码生成和知识密集型任务中的性能。

Comments Code is available at https://github.com/PioneerQyw/NGM

详情
AI中文摘要

近期的研究引入了条件内存模块,将知识存储与神经计算解耦,从而实现更直接的知识访问。与MoE相比,依赖动态计算路径的MoE,显式查找提供了一种更高效的检索机制。然而,这些方法仍然依赖于学习的内存嵌入,需要额外的训练,限制了灵活性。为此,我们提出了N-gram Memory (NGM),一种无需训练的插拔式模块,由因果N-gram编码器和余弦门控内存注入器组成。因果N-gram编码器直接平均预训练的骨干模型的token嵌入,以构建n-gram表示,从而消除了需要从头训练n-gram嵌入的需要。这种设计不需要额外的内存表或检索流水线。余弦门控内存注入器则使用非参数化的余弦门与ReLU,将检索到的嵌入调节为上下文表示。我们在Qwen3系列从0.6B到14B的八个基准测试中评估了NGM。NGM在平均性能上提高了0.5到1.2个点,尤其在代码生成和知识密集型任务(例如,Qwen3-14B在LiveCodeBench上+3.0,在GPQA上+3.03)中表现突出。此外,NGM还在多模态基准测试(例如,MMStar在Qwen3-VL-2B上+1.53)中提高了性能。

英文摘要

Recent studies introduce conditional memory modules that decouple knowledge storage from neural computation, enabling more direct knowledge access. Compared to MoE, which relies on dynamic computation paths, explicit lookup provides a more efficient knowledge retrieval mechanism. However, these approaches still depend on learned memory embeddings, requiring additional training and limiting flexibility. To address this, we propose N-gram Memory (NGM), a training-free, plug-and-play module composed of a Causal N-Gram Encoder and a Cosine-Gated Memory Injector. The Causal N-Gram Encoder directly averages the pretrained token embeddings of the backbone model to construct N-gram representations, thereby eliminating the need to train separate N-gram embeddings from scratch. This design requires neither an additional memory table nor a retrieval pipeline. The Cosine-Gated Memory Injector then uses a non-parametric cosine gate with ReLU to modulate the retrieved embeddings into the contextual representations. We evaluate NGM on the Qwen3 series from 0.6B to 14B across eight benchmarks. NGM improves average performance by 0.5 to 1.2 points, with particularly clear gains on code generation and knowledge-intensive tasks (e.g., +3.0 on LiveCodeBench and +3.03 on GPQA for Qwen3-14B). Moreover, NGM also improves performance in multimodal benchmarks (e.g., MMStar +1.53 on Qwen3-VL-2B).

2605.16892 2026-05-19 cs.CV cs.AI cs.CL 版本更新

DriveSafe: A Framework for Risk Detection and Safety Suggestions in Driving Scenarios

DriveSafe: 一种用于驾驶场景中风险检测与安全建议的框架

Sainithin Artham, Shankar Gangisetty, Avijit Dasgupta, C. V. Jawahar

发表机构 * IIIT-Hyderabad(IIIT-海得拉巴)

AI总结 本文提出DriveSafe框架,通过结构化自然语言描述实现风险感知场景理解,结合多模态上下文生成空间 grounded 的描述,用于下游风险评估和安全建议,实验表明其在DRAMA基准上达到最先进的性能。

Comments 8 pages

详情
AI中文摘要

全面的情景意识对于在安全关键环境中运行的自动驾驶车辆至关重要,因为它能够识别并缓解潜在风险。尽管最近的多模态大语言模型(MLLMs)在通用视觉-语言任务上表现出色,但我们的研究发现,零样本MLLMs在细粒度、空间接地的风险评估中仍不如领域特定的方法。为了解决这一差距,我们提出了DriveSafe,一种用于风险感知场景理解的框架,利用结构化自然语言描述。具体而言,我们的方法首先生成包含运动、空间和深度线索的多模态上下文的时空接地描述。这些描述随后用于下游的风险评估,明确识别危险物体、其位置以及它们所暗示的不安全行为,随后提供可操作的安全建议。为了进一步提高性能,我们采用描述-风险配对来微调一个轻量级的适配器模块,高效地将领域特定的知识注入基础LLM中。通过将风险评估条件化为显式的语言基础场景表示,DriveSafe在零样本MLLMs和先前的领域特定基线之上取得了显著的提升。在DRAMA基准上的全面实验表明了最先进的性能,而消融研究验证了我们关键设计选择的有效性。项目页面:https://cvit.iiit.ac.in/research/projects/cvit-projects/drivesafe

英文摘要

Comprehensive situational awareness is essential for autonomous vehicles operating in safety-critical environments, as it enables the identification and mitigation of potential risks. Although recent Multimodal Large Language Models (MLLMs) have shown promise on general vision-language tasks, our findings indicate that zero-shot MLLMs still underperform compared to domain-specific methods in fine-grained, spatially grounded risk assessment. To address this gap, we propose DriveSafe, a framework for risk-aware scene understanding that leverages structured natural language descriptions. Specifically, our method first generates spatially grounded captions enriched with multimodal context, including motion, spatial, and depth cues. These captions are then used for downstream risk assessment, explicitly identifying hazardous objects, their locations, and the unsafe behaviors they imply, followed by actionable safety suggestions. To further improve performance, we employ caption-risk pairings to fine-tune a lightweight adapter module, efficiently injecting domain-specific knowledge into the base LLM. By conditioning risk assessment on explicit language-based scene representations, DriveSafe achieves significant gains over both zero-shot MLLMs and prior domain-specific baselines. Exhaustive experiments on the DRAMA benchmark demonstrate state-of-the-art performance, while ablation studies validate the effectiveness of our key design choices. Project page: https://cvit.iiit.ac.in/ research/projects/cvit-projects/drivesafe

2605.16880 2026-05-19 cs.AI 版本更新

Virtual Nodes Guided Dynamic Graph Neural Network for Brain Tumor Segmentation with Missing Modalities

虚拟节点引导的动态图神经网络用于缺失模态的脑肿瘤分割

Sha Tao, Jiao Pan, Yu Guo, Chao Yao

发表机构 * University of Science and Technology Beijing(科学技术大学)

AI总结 本文提出了一种基于图的单阶段框架,通过引入模态特定的虚拟节点来补偿缺失模态,利用图网络的灵活性设计动态连接策略,提升模型对任意模态组合的鲁棒性,并在BRATS-2018和BRATS-2020数据集上验证了方法在不完整模态下的优越性能。

Comments The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2026

详情
AI中文摘要

多模态磁共振成像(MRI)对于脑肿瘤分割至关重要,许多方法利用其四种关键模态来捕捉互补信息以实现有效的子区域分析。然而,实践中缺少几种模态的情况非常常见,导致现有全模态分割方法性能严重下降。受限于结构化数据模型,近期工作常采用多阶段训练策略处理全模态和缺失模态场景,这增加了训练成本且无法充分解决缺失模态带来的干扰。在本文中,我们提出了一种基于图的单阶段框架,用于鲁棒的脑肿瘤分割。具体而言,我们引入了模态特定的虚拟节点,作为补充信息源以补偿缺失模态。为了增强模型对任意模态组合的鲁棒性,我们利用图网络的内在灵活性设计了动态连接策略。该机制根据模态可用性动态调整邻接矩阵,在保留有益信息流的同时减轻缺失模态引起的干扰效应。此外,我们通过异质权重矩阵增强了图网络,使其更适应多模态场景。在BRATS-2018和BRATS-2020数据集上的大量实验表明,我们的方法在几乎所有不完整模态的子集上均优于现有最先进方法。

英文摘要

Multimodal magnetic resonance imaging (MRI) is crucial for brain tumor segmentation, with many methods leveraging its four key modalities to capture complementary information for effective sub-region analysis. However, the absence of several modalities is very common in practice, leading to severe performance degradation in existing full-modality segmentation methods. Limited by the structured data model, recent works often adopt a multi-stage training strategy for full-modality and missing-modality scenarios, which increases training costs and inadequately addresses the interference of miss. In this work, we propose a graph-based one-stage framework for robust brain tumor segmentation with missing modalities. Specifically, we introduce modality-specific virtual nodes that serve as supplementary information sources to compensate for missing modalities. To enhance model robustness against arbitrary modality combinations, we leverage the inherent flexibility of graph networks to devise a dynamic connection strategy. This mechanism dynamically adjusts the adjacency matrix based on modality availability, preserving beneficial information flow while mitigating interference effects caused by missing modalities. Furthermore, we enhance the graph network through heterogeneous weight matrices, enhancing its adaptability to multimodal scenarios. Extensive experiments on the BRATS-2018 and BRATS-2020 datasets demonstrate that our method outperforms the state-of-the-art methods on almost all subsets of incomplete modalities.

2605.16874 2026-05-19 cs.AI 版本更新

Reasoning Can Be Restored by Correcting a Few Decision Tokens

通过纠正少量决策标记来恢复推理能力

Changshuo Shen, Leheng Sheng, Yuxin Chen, An Zhang, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文研究了基础模型在生成过程中推理优势的稀疏性,提出了一种基于分歧指导的标记干预方法,在少量干预下显著恢复甚至超越了同规模推理模型的性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型推理模型(LRMs)在具有挑战性的推理基准测试中显著优于其基础LLM counterpart,但基础模型在逐token生成过程中哪里出错以及如何高效缩小这一差距仍不清楚。我们通过量化基础模型与更强推理模型之间在token层面的分布分歧来研究基础推理差距,使用基于似然的分歧度量。在多个基准测试中,我们发现推理优势高度稀疏,集中在少量早期规划相关的决策标记上。例如,在Qwen3-0.6B中,只有约8%的生成标记导致显著分歧,这些标记集中在响应早期,规划相关决策强烈富集(17倍),并且与基础模型的高不确定性重合——表明基础模型主要在早期规划点上失败,这些点引导后续的推理轨迹。基于这些发现,我们提出了分歧指导的标记干预,一种简单的推理时间委托方案,仅在高分歧位置将单个标记的生成委托给推理模型,并立即切换回基础模型。在少量干预预算下,这种稀疏委托显著恢复,甚至在具有挑战性的推理任务上可以超越同规模推理模型的性能。代码可在https://github.com/AlphaLab-USTC/RRTokenIntervention获得。

英文摘要

Large reasoning models (LRMs) substantially outperform their base LLM counterparts on challenging reasoning benchmarks, yet it remains poorly understood where base models go wrong during token-by-token generation and how to narrow this gap efficiently. We study the base-reasoning gap through quantifying token-level distributional disagreement between a base model and a stronger reasoning model using likelihood-based divergences. Across benchmarks, we find that the reasoning advantage is highly sparse and concentrates on a small set of early, planning-related decision tokens. For instance, on Qwen3-0.6B, only ~8% of generated tokens account for the salient disagreement, and these tokens concentrate early in the response, are strongly enriched in planning-related decisions (17x), and coincide with high base-model uncertainty -- suggesting that base models fail mainly at early planning points that steer the subsequent reasoning trajectory. Building on these findings, we propose disagreement-guided token intervention, a simple inference-time delegation scheme that performs a one-token takeover by the reasoning model only at high-disagreement positions and immediately switches back to the base model. With a small intervention budget, this sparse delegation substantially recovers and can even surpass the performance of a same-size reasoning model on challenging reasoning tasks. Code is available at https://github.com/AlphaLab-USTC/RRTokenIntervention.

2605.16872 2026-05-19 cs.CY cs.AI 版本更新

Some[Body] Must Receive That Pain for Agent Accountability

某些身体必须承受痛苦以实现代理问责

Botao Amber Hu, Helena Rong

发表机构 * University of Oxford(牛津大学) New York University Shanghai(纽约大学上海分校)

AI总结 本文研究了人工智能代理问责中的后果接收问题,指出当前LLM代理无法满足必要的身体条件,因此传统法律回应失效,提出需要建立社会技术基础设施来实现后果-代理耦合。

详情
AI中文摘要

人工智能代理在现实世界中日益产生后果。这导致了我们称之为"后果接收"的问题:伤害发生,产生系统被识别,但没有持续的代理接收后果以改变未来行为。将痛苦机械地理解为一种纠正反馈信号,是传统惩罚理论的基础——威慑、康复、报复和 incapacitation 都假设有一个持续的场所接收信号并更新行为。这反过来要求信号能够落地的身体:一个保护完整性的边界,一个信号积累的场所,将事件信号转化为持久更新的整合,以及一个通过改变未来行动来响应的基质。当前的LLM代理——由权重、提示、工具、记忆和凭证组成的软件定义复合体,可以自由交换、复制、重置和重新组装——无法满足这些条件。因此,两种主流法律回应未能实现后果接收。薄身份代理-主体二元组拥有身体但没有"后果-代理耦合":人类为超出其控制的行为承受痛苦——Elish的"道德皱褶区"。厚身份Arbel等人提出的"算法公司"创建了法律可识别的实体,但并不保证任何AI决策架构会将痛苦作为行为信号。因此,实现后果-代理耦合是一个社会技术基础设施问题,而不仅是法律问题。在这样的架构存在之前,高风险的AI部署应继续与可问责的人类主体 tethered,具有有意义的控制、比例责任和终止代理的权力。"如果某些身体因设计而没有承受痛苦,某些身体将因默认而承受痛苦。"

英文摘要

AI agents increasingly act consequentially in the real world. This creates a problem we call \emph{consequence reception}: harm occurs, the producing system is identified, yet no continuing agent receives consequences in a way that changes future behavior. Pain, understood mechanistically as a corrective feedback signal, is foundational to canonical theories of punishment -- deterrence, rehabilitation, retribution, and incapacitation all assume a continuing locus that registers the signal and updates behavior. That, in turn, requires a body for the signal to land on: a boundary whose integrity it protects, a locus where it accumulates, consolidation that converts episodic signal into durable update, and a substrate that responds by altering future action. Current LLM agents -- software-defined composites of weights, prompts, tools, memory, and credentials, freely swapped, copied, reset, and reassembled -- satisfy none of these conditions. The two prevailing legal responses therefore fail to achieve consequence reception. The thin-identity agent-principal dyad has a body but no \emph{consequence--agency coupling}: the human bears pain for behaviors beyond their control -- Elish's \emph{moral crumple zone}. The thick-identity Arbel et al.'s \emph{Algorithmic Corporation} creates legally legible entities but does not guarantee that any AI decision architecture receives pain as a behavioral signal. Achieving consequence-agency coupling is therefore a sociotechnical infrastructural problem, not only a legal one. Until such architectures exist, high-stakes AI deployment should remain tethered to accountable human principals with meaningful control, proportional liability, and authority to constrain or terminate the agent. \emph{If some body does not receive the pain by design, some body will receive it by default.}

2605.16864 2026-05-19 cs.CV cs.AI 版本更新

Metric-Guided Feature Fusion of Visual Foundation Models for Segmentation Tasks

基于度量的视觉基础模型特征融合用于分割任务

Yachan Guo, JoseLuis Gomez Zurita, Danna Xue, Yi Xiao, AntonioManuel Lopez Pena

发表机构 * Universitat Autònoma de Barcelona(巴塞罗那自治大学) Computer Vision Center(计算机视觉中心) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳研究院)

AI总结 本文提出了一种基于度量的特征融合方法,通过评估不同视觉基础模型的特征空间,选择并聚合互补特征以提升密集预测任务的性能。

Comments Accepted to the CVPR 2026 Findings Track

详情
AI中文摘要

尽管大规模视觉基础模型(VFMs)在语义理解方面表现优异,但在实例感知的密集预测任务中仍显不足。它们在表示上存在不同的偏倚:例如,可提示的分割模型(如SAM2)专注于细粒度区域边界,而自监督模型(如DINOv3)强调物体层面的结构。这一观察表明,结合不同VFMs的互补特征可以增强下游密集预测任务。然而,简单的多VFMs融合 seldom 导致可靠的增益,且如何利用其互补特征的可解释原则仍待探索。在本文中,我们提出了一种基于度量的方法,通过显式的评估分数选择并聚合不同VFMs的互补特征。具体而言,我们设计了一套无标签的度量标准,在特征空间的两个方面,结构一致性与边缘保真度,来评估VFM编码器的特征。在这些分数的指导下,我们识别出互补性强的边缘强和结构强的编码器对,并通过主辅融合方案进行整合。这种特征融合不需要复杂的架构更改,并且仅在单个阶段进行训练。我们的模型在多个密集预测任务中相比基线模型表现出一致的性能提升,具有更好的物体层面语义和更准确的边界定位。代码可在{https://github.com/gyc-code/metric-guided-fusion}获取。

英文摘要

Although large-scale visual foundation models (VFMs) achieve remarkable performance in semantic understanding, they still underperform in instance-aware dense prediction tasks. They exhibit different biases in representation: for instance, promptable segmentation models (e.g., SAM2) focus on fine-grained region boundaries, while self-supervised models (e.g., DINOv3) emphasize object-level structure. This observation highlights the potential of combining complementary features from different VFMs to enhance downstream dense prediction tasks. However, naive multi-VFM fusion seldom leads to reliable gains, and interpretable principles for leveraging their complementary features are still underexplored. In this work, we propose a metric-guided approach that effectively selects and aggregates complementary features from different VFMs based on explicit assessment scores. Specifically, we design a suite of label-free metrics in feature space across two aspects, Structural Coherence and Edge Fidelity, to assess features of VFM encoders. Guided by these scores, we identify complementary edge-strong and structure-strong encoder pairs, and integrate them via a master-auxiliary fusion scheme. This feature fusion requires no complex architectural changes and is trained only in a single stage. Our model shows consistent performance gains across multiple dense prediction tasks compared with the baselines, with better object-level semantics and more accurately localized boundaries. The code is available at {https://github.com/gyc-code/metric-guided-fusion}.

2605.16863 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Plan First, Diffuse Later: Extrinsic Graph Guidance for Long-Horizon Diffusion Planning

先规划,后扩散:用于长视距扩散规划的外在图引导

Yaniv Hassidof, Adir Morgan, Yilun Du, Kiril Solovey

发表机构 * Technion(技术Ion大学) Harvard(哈佛大学)

AI总结 本文提出了一种外在搜索引导的扩散模型(XDiffuser),通过在状态空间图上先规划再引导扩散过程,以提高长视距规划的效率和效果,尤其在低质量数据和未见任务中表现优异。

详情
AI中文摘要

组合扩散模型通过去噪多个重叠的子轨迹并确保它们构成全局解,为长视距规划提供了一条有前途的路线。然而,强制在长链上执行局部行为往往不足以产生一致的全局结构。最近的工作通过内在搜索在去噪过程中探索多条路径来解决这一限制。尽管内在搜索提高了全局一致性,但代价是重复评估已经计算密集的模型。在本文中,我们主张在去噪过程之外进行外在搜索,为长视距规划提供更有效的探索模式,同时自然地使经典算法能够解决测试时的未见组合任务。我们的eXtrinsic搜索引导的Diffuser(XDiffuser)首先在状态空间图上计算一个计划——作为扩散模型的轻量级局部连接Oracle。该计划随后用于引导单条轨迹的去噪,有效地将探索负担转移出去。XDiffuser在长视距任务上优于基于扩散的基线,特别是在低质量数据领域和超出目标到达的未见任务中,包括多智能体协调和TSP风格推理。项目网站:https://yanivhass.github.io/XDiffuser-site/

英文摘要

Compositional diffusion models offer a promising route to long-horizon planning by denoising multiple overlapping sub-trajectories while ensuring that together they constitute a global solution. However, enforcing local behavior over long chains is often insufficient for a coherent global structure to emerge. Recent works tackle this limitation through intrinsic search, which explores multiple paths during the denoising process. While intrinsic search improves global coherence, it comes at the cost of repeated evaluations of an already compute-heavy model. In this work, we argue that extrinsic search, performed outside the denoising process, offers a more effective mode of exploration for long-horizon planning while naturally enabling the use of classical algorithms to solve unseen combinatorial tasks at test time. Our eXtrinsic search-guided Diffuser (XDiffuser) first computes a plan over a state-space graph -- serving as a lightweight local connectivity oracle for the diffusion model. The plan is then used to guide denoising for a single trajectory, effectively offloading the burden of exploration. XDiffuser outperforms diffusion-based baselines on long-horizon tasks, with particularly large gains in the low-quality data regime and on unseen tasks beyond goal-reaching, including multi-agent coordination and TSP-style reasoning. Project website: https://yanivhass.github.io/XDiffuser-site/

2605.16861 2026-05-19 cs.CV cs.AI 版本更新

Prefix-Adaptive Block Diffusion for Efficient Document Recognition

前缀自适应块扩散用于高效的文档识别

Mingxu Chai, Ziyu Shen, Chenyu Liu, Kaidi Zhang, Jiazheng Zhang, Dingwei Zhu, Zhiheng Xi, Ruoyu Chen, Jun Long, Jihua Kang, Tao Gui, Qi Zhang

发表机构 * Computation and Artificial Intelligence Innovative College, Fudan University, Shanghai, China(复旦大学计算与人工智能创新学院,上海,中国) Shanghai Innovation Institute, Shanghai, China(上海创新研究院,上海,中国) ByteDance, Shanghai, China(字节跳动,上海,中国)

AI总结 本文提出前缀自适应块扩散模型(PA-BDM),通过改进块内去噪和缓存机制,提升文档识别的效率和准确性。

Comments 17pages,6 figures

详情
AI中文摘要

块扩散模型(BDMs)支持并行生成、灵活长度输出和KV缓存,使其在高效文档解析中具有潜力。然而,现有BDMs将去噪和缓存承诺绑定到固定的块边界:块内去噪时并行性缩小,而生成的token无法缓存直到整个块完成。此外,块内双向去噪与块间自回归冲突,导致信息流不一致,可能挑战结构敏感的识别。我们提出前缀自适应块扩散模型(PA-BDM),用从前缀到后缀的因果去噪替代块内双向去噪,并将块大小视为最大候选范围而非固定承诺单位。PA-BDM使用置信度门控结构损失(CSL)在扩展训练到更长延续之前构建低熵前缀。在推理过程中,逐步前缀承诺(PPC)则动态地将最长可靠的前缀投入KV缓存,并从更新的前缀重置下一个候选范围,每一步都恢复大的并行解码空间。实验表明,3B PA-BDM在多个基准上实现了更高的识别得分,并在2.5B MinerU-Diffusion上将推理吞吐量提高了71.6%。

英文摘要

Block Diffusion Models (BDMs) support parallel generation, flexible-length output, and KV caching, making them promising for efficient document parsing. However, existing BDMs bind denoising and cache commitment to fixed block boundaries: parallelism shrinks during intra-block denoising, while generated tokens cannot be cached until the whole block is completed. Moreover, intra-block bidirectional denoising conflicts with inter-block autoregression, creating inconsistent information flow that can challenge structure-sensitive recognition. We propose the Prefix-Adaptive Block Diffusion Model (PA-BDM), which replaces intra-block bidirectional denoising with causal denoising from prefix to suffix and treats the block size as a maximum candidate range rather than a fixed commitment unit. PA-BDM uses Confidence-gated Structural Loss (CSL) to build low-entropy prefixes before extending training to longer continuations. During inference, Progressive Prefix Commitment (PPC) then dynamically commits the longest reliable prefix into the KV cache and resets the next candidate range from the updated prefix, restoring a large parallel decoding space at each step. Experiments show that the 3B PA-BDM achieves higher recognition scores on several benchmarks and improves inference throughput by 71.6\% over the 2.5B MinerU-Diffusion.

2605.16860 2026-05-19 cs.LG cs.AI q-bio.QM 版本更新

PhysioSeq2Seq: A Hybrid Physiological Digital Twin and Sequence-to-Sequence LSTM for Long-Horizon Glucose Forecasting in Type 1 Diabetes

PhysioSeq2Seq:一种混合生理数字孪生和序列到序列LSTM的长周期1型糖尿病葡萄糖预测方法

Phat Tran, Neville Mehta, Clara Mosquera-Lopez, Robert H. Dodier, Lizhong Chen, Peter G. Jacobs

发表机构 * Oregon State University(俄勒冈州立大学) Oregon Health & Science University(俄勒冈健康与科学大学)

AI总结 本文提出了一种结合患者特定生理建模与序列到序列LSTM的混合架构PhysioSeq2Seq,用于长周期1型糖尿病葡萄糖预测,通过消除递归误差累积并注入患者匹配的生理状态,提高了预测精度和临床意义。

详情
AI中文摘要

准确的长周期葡萄糖预测对于自动胰岛素输送系统至关重要,这些系统帮助1型糖尿病患者管理血糖并避免危险的低血糖。然而,标准递归长短期记忆网络(LSTM)在更长的周期内由于误差累积存在系统性负偏置,而纯粹的机理微分方程(ODE)模型在群体参数化时无法跨个体泛化。我们提出PhysioSeq2Seq,一种结合患者特定生理建模与序列到序列(Seq2Seq)LSTM的混合架构。对于每个葡萄糖段,双胞胎匹配搜索300个参数化的数字孪生体群体,以从连续葡萄糖监测(CGM)历史中找到最佳拟合的生理匹配。匹配双胞胎的10个内部ODE状态变量被注入到Seq2Seq LSTM的编码器和解码器中。这种同时48步预测策略消除了递归误差累积,而ODE特征提供了一个基于物理的约束,限制了长周期漂移在生理合理范围内。PhysioSeq2Seq在1型糖尿病运动倡议(T1DEXI)数据集中训练了348名参与者的CGM和胰岛素数据,并在74名被排除的参与者上进行评估。在240分钟的预测范围内,PhysioSeq2Seq的平均绝对误差为39.28 mg/dL,平均误差为-10.62 mg/dL,比递归LSTM减少了13.89 mg/dL的偏置,比基于ODE的数字孪生减少了28.62 mg/dL的平均绝对误差。这些结果表明,消除架构反馈并注入患者匹配的生理状态是一种有效且具有临床意义的策略,用于1型糖尿病的长周期葡萄糖预测。

英文摘要

Accurate long-horizon glucose forecasting is critical for automated insulin delivery systems, which help people with type 1 diabetes (T1D) manage their glucose and avoid dangerous hypoglycemia. However, standard recursive long short-term memory (LSTM) networks suffer from systematic negative bias at longer horizons due to error compounding, while purely mechanistic ordinary differential equation (ODE) models fail to generalize across individuals when parameterized at the population level. We propose PhysioSeq2Seq, a hybrid architecture that combines patient-specific physiological modeling with a sequence-to-sequence (Seq2Seq) LSTM. For each glucose segment, twin matching searches a population of 300 parameterized digital twins to identify the best-fitting physiological match from a 3-hour continuous glucose monitoring (CGM) history. The 10 internal ODE state variables of the matched twin are injected as exogenous covariates into both the encoder and decoder of the Seq2Seq LSTM. This simultaneous 48-step prediction strategy eliminates recursive error compounding, while the ODE features provide a physics-grounded constraint that bounds long-horizon drift within physiologically plausible ranges. PhysioSeq2Seq was trained on CGM and insulin data from 348 participants in the Type 1 Diabetes Exercise Initiative (T1DEXI) dataset and evaluated on 74 held-out participants. At the 240-minute horizon, PhysioSeq2Seq achieves a mean absolute error of 39.28 mg/dL and a mean error of -10.62 mg/dL, reducing bias by 13.89 mg/dL over the recursive LSTM and reducing mean absolute error by 28.62 mg/dL over the ODE-based digital twin. These results show that eliminating architectural feedback and injecting patient-matched physiological states is an effective and clinically meaningful strategy for long-horizon glucose forecasting in T1D.

2605.16859 2026-05-19 cs.CV cs.AI 版本更新

VGGT-CD: Training-Free Robust Registration for 3D Change Detection

VGGT-CD:无训练的鲁棒三维变化检测注册

Wei Zhang, Songhua Li, Yihang Wu, Qiang Li, Qi Wang

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 本文提出VGGT-CD方法,通过解耦跨时间注册与动态变化干扰,实现无训练的鲁棒三维变化检测注册,有效减少轨迹误差并提升注册速度。

Comments 13 pages, 5 figures. Code is available at: https://github.com/WZ-CS/VGGT-CD

详情
AI中文摘要

从多视角图像进行三维变化检测对于城市监控、灾难评估和自动驾驶至关重要。然而,现有方法大多在2D领域操作,其中视角变化被误认为物理变化且深度不可用。虽然视觉几何基础模型如VGGT能够快速从未摆正的图像生成密集点云,但独立每轮重建面临根本性障碍:不可预测的跨轮标度模糊、注册-变化悖论以及普遍存在的边缘飞行噪声。为了解决这些挑战,我们提出了VGGT-CD,一种无训练的流水线,将跨时间注册与动态变化干扰解耦。在粗阶段,稀疏关键帧联合推断建立统一的度量空间并产生初始Sim(3)先验。在细阶段,密集重建通过隔离静态背景对应关系进行净化。闭合形式的质心对齐优化平移同时锁定标度和旋转,使用残差自检数学保证非退化。在World Across Time数据集的11场景基准上评估,VGGT-CD在户外将绝对轨迹误差减少了44%,在室内减少了59%。它以6倍于传统方法的速度完成注册,生成高纯度的3D变化地图,无需任务特定训练。

英文摘要

3D change detection from multi-view images is essential for urban monitoring, disaster assessment, and autonomous driving. However, existing methods predominantly operate in the 2D domain, where viewpoint variations are mistaken for physical changes and depth is unavailable. While visual geometry foundation models like VGGT rapidly produce dense point clouds from unposed images, independent per-epoch reconstruction encounters fundamental obstacles: unpredictable inter-epoch scale ambiguity, registration-change paradox where scene changes corrupt alignment, and pervasive edge-flying noise. To address these challenges, we present VGGT-CD, a training-free pipeline decoupling cross-temporal registration from dynamic-change interference. In the Coarse Stage, sparse keyframe joint inference establishes a unified metric space and yields an initial Sim(3) prior. In the Fine Stage, dense reconstructions are purified by isolating static-background correspondences. A closed-form centroid alignment refines the translation while locking scale and rotation, using a residual self-check to mathematically guarantee non-degradation. Evaluated on an 11-scene benchmark from the World Across Time dataset, VGGT-CD reduces Absolute Trajectory Error by 44% outdoors and 59% indoors. It completes registration over 6 times faster, producing high-purity 3D change maps without task-specific training.

2605.16858 2026-05-19 cs.RO cs.AI 版本更新

Pedestrian-Aware LLM-Driven Behavioral Planning for Autonomous Vehicles

面向行人的LLM驱动行为规划用于自动驾驶车辆

Aidana Baimbetova, Haruki Yonekura, Hamada Rizk, Hirozumi Yamaguchi

发表机构 * The University of Osaka, Japan(大阪大学,日本) RIKEN Center for Computational Science, Japan(日本计算科学研究中心) Tanta University, Egypt(埃及塔塔大学)

AI总结 本文提出了一种基于大型语言模型的决策框架,用于自动驾驶车辆在复杂城市环境中考虑行人行为,通过自然语言推理提示将结构化场景观测转换为语言推理,从而生成安全的驾驶决策。

Comments This paper has been accepted for presentation at the 29th IEEE International Conference on Intelligent Transportation Systems (ITSC)

详情
AI中文摘要

自动驾驶车辆(AVs)必须在行人行为多变、有时异常且训练中常未见的密集城市环境中做出可靠决策。基于强化学习(RL)的AV控制系统在结构化交通中表现良好,但在面对不可预测的行人交互和分布外场景时泛化能力较差。其依赖手工制定的奖励和不透明决策进一步限制了其在行人密集、安全关键环境中的适用性。为了解决这些限制,我们引入了一种基于大型语言模型(LLM)的决策框架,用于行人感知的行为规划。该系统将结构化的场景观测转换为自然语言推理提示,使LLM能够推断行人意图、预测风险并生成谨慎的战术驾驶决策。这些决策由运动规划器执行,以确保平滑且动力学可行的控制。我们在SUMO上评估了该框架,涵盖多个行人交互场景,包括意外闯红灯、回退过马路、犹豫和双向过马路。在零样本评估中,基于LLM的智能体实现了68%的无碰撞成功率,显著优于深度RL基线(17.7%)。在单行人场景中使用少量样本的episodic记忆,性能增加到96.0%,超过定制DQN控制器(82.0%)。跨行为评估进一步表明,来自回退交互的记忆可以转移到未见的犹豫和双向过马路场景,分别达到82.0%和90.0%的成功率。该系统能够更早地发起响应,维持更宽的安全缓冲区,并产生可解释、与人类一致的决策。

英文摘要

Autonomous Vehicles (AVs) must make reliable decisions in dense urban environments where pedestrian behavior is variable, sometimes abnormal, and often unseen during training. Reinforcement learning (RL)-based AV control systems perform well in structured traffic but struggle to generalize to unpredictable pedestrian interactions and out-of-distribution scenarios. Their reliance on handcrafted rewards and opaque decisions further limits their suitability for safety-critical, pedestrian-rich environments. To address these limitations, we introduce a Large Language Model (LLM)-based decision-making framework for pedestrian-aware behavioral planning. The system converts structured scene observations into natural-language reasoning prompts, enabling the LLM to infer pedestrian intent, anticipate risk, and generate cautious tactical driving decisions. These decisions are executed by a motion planner that ensures smooth, kinematically feasible control. We evaluate the framework in SUMO across multiple pedestrian-interaction scenarios, including unexpected jaywalking, turn-back crossing, hesitation, and bidirectional crossing. In zero-shot evaluation, the LLM-based agent achieves a 68% collision-free success rate, substantially outperforming deep RL baselines (17.7%). With few-shot episodic memory in a single-pedestrian scenario, performance increases to 96.0%, exceeding a custom DQN controller (82.0%). Cross-behavior evaluation further shows that memory derived from turn-back interactions transfers to unseen hesitation and bidirectional crossing scenarios, achieving 82.0% and 90.0% success, respectively. The system consistently initiates earlier responses, maintains wider safety buffers, and produces interpretable, human-aligned decisions.

2605.16857 2026-05-19 cs.AI 版本更新

Learning to Learn from Multimodal Experience

从多模态经验中学习学习

Xingyu Sui, Weixiang Zhao, Yongxin Tang, Yanyan Zhao, Yang Wu, Dandan Tu, Bing Qin

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 本文提出了一种新的学习范式,即从多模态经验中学习,通过动态构建和利用记忆来提升智能体的性能和泛化能力,解决了传统固定记忆设计在多模态环境中的不足。

详情
AI中文摘要

经验驱动学习已成为一种有前景的范式,使智能体能够通过积累和重用过去经验来改进。然而,现有方法主要在文本环境中开发,并依赖于手动设计的记忆架构,限制了它们在多模态环境中的适用性。在现实场景中,经验本质上是多模态的,涉及感知、推理和行动中的异构信号,这使得有效记忆设计变得更加具有挑战性。特别是,最优的多模态经验结构和利用方式高度依赖于任务,并随时间变化,使得固定记忆设计不足。在本文中,我们提出了一种新的范式,即从多模态经验中学习,将记忆设计从预定义的组件转变为适应性和可学习的过程。我们的框架使智能体能够根据任务需求和交互历史动态构建、组织和利用记忆,有效学习如何结构化经验以提高性能。实验表明,适应性记忆设计显著增强了智能体在多模态任务中的性能和泛化能力,突显了学习记忆机制在经验驱动学习中的关键作用。

英文摘要

Experience-driven learning has emerged as a promising paradigm for enabling agents to improve from interaction trajectories by accumulating and reusing past experience. However, existing approaches are predominantly developed in textual settings and rely on manually designed memory schemas, limiting their applicability to multimodal environments. In real-world scenarios, experience is inherently multimodal, involving heterogeneous signals across perception, reasoning, and action, which makes effective memory design significantly more challenging. In particular, the optimal way to structure and utilize multimodal experience is highly task-dependent and evolves over time, rendering fixed memory designs insufficient. In this work, we propose a new paradigm, learning to learn from multimodal experience, which shifts memory design from a predefined component to an adaptive and learnable process. Our framework enables agents to dynamically construct, organize, and utilize memory based on task requirements and interaction history, effectively learning how to structure experience for improved performance. Experiments demonstrate that adaptive memory design substantially enhances agent performance and generalization across multimodal tasks, highlighting the critical role of learning memory mechanisms in experience-driven learning.

2605.16848 2026-05-19 cs.CV cs.AI cs.CL cs.LG 版本更新

Thinking with Patterns: Breaking the Perceptual Bottleneck in Visual Planning via Pattern Induction

基于模式的思考:通过模式诱导突破视觉规划中的感知瓶颈

Yichang Jian, Boyuan Xiao, Zhenyuan Huang, Yifei Peng, Yao-Xiang Ding

发表机构 * State Key Lab of CAD& CG(CAD与CG国家重点实验室)

AI总结 本文提出通过模式诱导的方法,利用模式推理和模式诱导策略,使视觉语言模型在视觉规划任务中实现更高效和准确的感知与推理,解决传统模型在复杂输入下的感知瓶颈问题。

详情
AI中文摘要

从原始视觉输入进行规划仍然对当前的视觉-语言模型(VLMs)构成重大挑战,当输入复杂度超出其一步感知能力时。受最近在图像思考(TWI)中的进展启发,一种合理的解决方案是通过迭代获取和整合局部视觉证据,将感知过程分解为更简单的步骤。然而,尽管当前VLMs在一般TWI能力上训练良好,但其在规划领域中的感知瓶颈仍然存在。为解决这一挑战,我们将TWI视为一种工具,逐步构建并反映一个准确的内部世界模型。我们发现,由此产生的无训练规划策略使VLMs能够解决远超其初始能力的任务,但代价是过多的TWI操作会显著增加计算开销。为进一步提高效率,我们提出模式推理,一种新的TWI策略,使VLMs能够主动识别新任务中的已知视觉模式并直接推断局部世界模型结构。为了获得这些模式,我们提出模式诱导,一种在线归纳学习策略,将视觉模式视为复合且可重用的专家,这些专家是自主从经验中发现和优化的。在FrozenLake、Crafter和CubeBench领域中的实验评估表明,我们的方法在准确性和效率之间实现了良好的平衡。

英文摘要

Planning from raw visual input remains a significant challenge for current Vision-Language Models (VLMs), when the complexity of input is beyond their one-step perception capability. Motivated by recent advances in Thinking with Images (TWI), a reasonable solution is to decompose the perception process into simpler steps by iteratively acquiring and incorporating local visual evidence. However, even though current VLMs are well-trained in general TWI ability, their perceptual bottleneck in the planning domain remains. To tackle this challenge, we formulate TWI as a tool to gradually build and reflect an accurate internal world model. We find that the resulting training-free planning strategy enables VLMs to solve tasks that are far beyond their initial capabilities, at the cost that too many TWI operations would significantly increase the computational overhead. To further improve efficiency, we propose Pattern Inference, a novel TWI strategy enabling VLMs to actively recognize known visual patterns in the new tasks and directly infer local world model structures. To obtain these patterns, we propose Pattern Induction, an online inductive learning strategy treating visual patterns as composite and reusable experts, which are autonomously discovered and optimized from experience. Experimental evaluations in FrozenLake, Crafter and CubeBench domains show that our approaches achieve a desirable balance between accuracy and efficiency.

2605.16844 2026-05-19 cs.AI 版本更新

Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence

人工适应智能:狭义智能与通用智能之间的缺失阶段

Boris Kriuk

发表机构 * Independent Monograph(独立专著)

AI总结 本文探讨了狭义智能与通用智能之间缺失的机器行为阶段,提出人工适应智能(AAI)的概念,通过定义适应性指数和参数最小性原则,分析了实现AAI的三种路径,并展示了其在多个领域的应用。

详情
AI中文摘要

在我们部署的狭义系统和我们推测的通用智能之间,存在一个从未被命名的机器行为阶段。本文主张这一阶段并非空缺:它是在元学习、神经架构搜索、AutoML、持续学习、进化计算和物理感知建模等技术中悄然汇聚的共同原则,即持续地将人类从参数规范的循环中排除。我们将其命名为人工适应智能(AAI),并对其进行操作性定义:一个系统表现出AAI的程度在于它不需要人类指定的可调超参数,同时在多样化的任务分布中保持竞争性性能。为使定义量化,我们引入了一个适应性指数,该指数衡量在与规模正交的轴上进展的进度,结合了系统吸收的超参数比例与相对于任务专用基线的性能比率。我们发展了参数最小性原则,并基于最小描述长度框架加以阐述,表明适当的超参数数量是由数据决定而非设计者决定。随后,我们围绕实现最小性的三条路径组织该领域:数据和任务感知的配置、结构和进化形态变化,以及训练中的自我适应。我们分析了它们的稳定性、收敛性和治理影响,并通过涵盖航空航天设计、金融制度检测、湍流建模、生态动态和视觉语言系统等案例研究来说明这些路径。本文的论点是:从ANIL到AGI的路径经过AAI,并且命名这一阶段改变了我们测量、构建和称作成功的标准。

英文摘要

Between the narrow systems we deploy and the general intelligence we speculate about lies an entire regime of machine behavior that has never received its own name. This monograph argues that this regime is not empty: it is where meta-learning, neural architecture search, AutoML, continual learning, evolutionary computation, and physics-informed modeling have quietly converged on a common principle, namely the steady removal of the human from the loop of parameter specification. We name this regime Artificial Adaptive Intelligence (AAI) and define it operationally: a system exhibits AAI to the extent that it requires no human-specified tunable hyperparameters while maintaining competitive performance across a diverse distribution of tasks. To make the definition quantitative, we introduce an adaptivity index that measures progress along an axis orthogonal to scale, combining the fraction of hyperparameters absorbed by the system with the performance ratio against a task-specialized baseline. We develop the principle of parametric minimality and ground it in the minimum description length framework, showing that the appropriate hyperparameter count is data-determined rather than designer-determined. We then organize the field around three pathways to minimality: data- and task-aware configuration, structural and evolutionary morphing, and in-training self-adaptation. We analyze their stability, convergence, and governance implications, and illustrate them through case studies spanning aerospace design, financial regime detection, turbulence modeling, ecological dynamics, and vision-language systems. The thesis is that the path from ANI to AGI passes through AAI, and that naming this stage changes what we measure, what we build, and what we call a success.

2605.16842 2026-05-19 cs.AI 版本更新

Sketch Then Paint: Hierarchical Reinforcement Learning for Diffusion Multi-Modal Large Language Models

草图然后绘画:用于扩散多模态大语言模型的分层强化学习

Siqi Luo, Jianghan Shen, Yi Xin, Huayu Zheng, Haoxing Chen, Yan Tai, Yue Li, Junjun He, Yihao Liu, Guangtao Zhai, Yuewen Cao, Xiaohong Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) Nanjing University(南京大学) Shanghai Innovation Institute(上海创新研究院) Peking University(北京大学)

AI总结 本文提出了一种分层强化学习方法HT-GRPO,通过Sketch-Then-Paint训练方案和分层信用分配机制,解决扩散多模态大语言模型在强化学习优化中的关键问题,提升图像质量和审美效果。

详情
AI中文摘要

扩散多模态大语言模型(dMLLMs)在图像生成方面具有强大能力,但通过强化学习(RL)进行优化仍是一个主要挑战。一个主要困难是单张图像可以通过许多不同的去屏蔽序列生成,这使得计算重要性比率往往不可行。此外,现有方法往往忽视dMLLMs的分层生成过程,其中早期标记定义全局布局,后期标记关注局部细节。通过给所有标记分配均匀奖励,这些现有方法未能反映每个标记对最终图像的实际贡献。为了解决这些问题,我们提出了Hierarchical Token GRPO(HT-GRPO),将此层次结构直接整合到策略优化过程中。我们的方法特征一个Sketch-Then-Paint训练方案,将更新过程分为三个不同的阶段:全局、结构和细化。我们还使用一个提示条件估计器来从完全遮蔽状态开始计算重要性比率。此外,我们引入了一种分层信用分配机制,优先考虑关键结构标记,以确保准确的奖励传播。使用两种流行的dMLLM骨干网络MMaDA和Lumina-DiMOO进行的实验表明,HT-GRPO在GenEval和DPG基准上取得了显著成效。在六个额外指标上的评估证实了在图像质量、美学和人类偏好方面的显著改进。

英文摘要

Diffusion Multi-Modal Large Language Models (dMLLMs) are powerful for image generation, but optimizing them through reinforcement learning (RL) remains a major challenge. One primary difficulty is that a single image can be generated through many different unmasking sequences, which makes calculating importance ratios often intractable. Additionally, existing methods tend to ignore the hierarchical generation process of dMLLMs, where early tokens define the global layout and later tokens focus on local details. By assigning uniform rewards to all tokens, these current methods fail to reflect the actual contribution of each token to the final image. To address these issues, we propose Hierarchical Token GRPO (HT-GRPO), which integrates this hierarchy directly into the policy optimization process. Our approach features a Sketch-Then-Paint training scheme that organizes updates into three distinct stages: global, structure, and refinement. We also use a prompt-conditioned estimator to calculate importance ratios starting from a fully masked state. Furthermore, we introduce a Hierarchical Credit Assignment mechanism that prioritizes key structural tokens to ensure accurate reward propagation. Experiments using two popular dMLLM backbones, MMaDA and Lumina-DiMOO, demonstrate that HT-GRPO achieves substantial gains on the GenEval and DPG benchmarks. Evaluations across six additional metrics confirm significant improvements in image quality, aesthetics, and human preference.

2605.16834 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Learning Relative Representations for Fine-Grained Multimodal Alignment with Limited Data

基于有限数据的细粒度多模态对齐的相对表示学习

Shiwon Kim, Yu Rang Park

发表机构 * Yonsei University(延世大学)

AI总结 本文提出了一种基于相对表示的学习方法,用于在有限数据条件下实现细粒度多模态对齐,通过学习token级别的跨模态结构来提升零样本分类、跨模态检索和零样本分割任务的性能。

详情
AI中文摘要

多模态预训练展示了强大的泛化性能,但在缺乏配对数据的领域中,这种范式往往难以实施。一种有前景的替代方法是事后多模态对齐,它通过有限数量的配对示例分别对预训练的单模态编码器进行对齐。然而,现有方法主要关注全局表示的对齐,忽略了片段-token关系。这可能阻碍了需要细粒度跨模态匹配的任务的迁移,超越粗粒度样本层面的语义。为了解决这个问题,我们提出了一种事后对齐方法,通过相对表示学习token级别的跨模态结构。具体来说,我们通过图像和文本与每种模态空间中一组可学习锚点的token级相似性来表示它们,这些锚点被训练以诱导一致的跨模态相似性模式,以匹配对。尽管仅学习锚点而没有重大的投影层,我们的方法在零样本分类、跨模态检索和零样本分割任务中均显著优于现有方法。这突显了在有限配对数据下,建模细粒度跨模态结构对于有效事后多模态对齐的重要性。

英文摘要

Multimodal pre-training demonstrates strong generalization performance, but this paradigm is often impractical in domains where paired data are scarce. A promising alternative is post-hoc multimodal alignment, which aligns separately pre-trained unimodal encoders using a limited number of paired examples. However, existing methods focus primarily on aligning global representations, missing patch-token relations. This may hinder transfer to tasks that require fine-grained cross-modal matching beyond coarse sample-level semantics. To address this issue, we propose a post-hoc alignment method that learns token-level cross-modal structure using relative representations. Specifically, we represent images and texts through their token-level similarities to a set of learnable anchors in each modality space, which are trained to induce consistent cross-modal similarity patterns for matched pairs. Despite learning only the anchors without heavy projection layers, our approach consistently outperforms existing methods in zero-shot classification, cross-modal retrieval, and zero-shot segmentation by a substantial margin. This highlights the importance of modeling fine-grained cross-modal structure for effective post-hoc multimodal alignment with limited paired data.

2605.16828 2026-05-19 stat.ML cs.AI cs.LG stat.ME 版本更新

Prediction-Intervention Games and Invariant Sets

预测-干预博弈与不变集

Linus Kühne, Felix Schur, Jonas Peters

发表机构 * Seminar for Statistics and ETH AI Center(统计研究所和ETH人工智能中心) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文研究了预测-干预博弈中的领导方如何通过选择预测函数来应对跟随方的干预,证明了基于稳定毯的预测在某些情况下优于因果父母的预测,并讨论了实际应用中的策略。

详情
AI中文摘要

我们考虑了一个两位玩家博弈:利用观测数据,领导者选择一个响应变量Y的预测函数,跟随者则在潜在的结构因果模型中对某些协变量进行干预以最大化自身目标。领导者知道干预目标,但可能对跟随者的目标了解有限。我们称这种设置为预测-干预博弈,是Stackelberg博弈的一种特殊情况。找到领导者的最优策略通常很困难。为了避免严重性能损失,领导者可能基于Y的因果父母或更一般地基于协变量的不变子集来选择预测。我们证明,对于两种常见的跟随者目标类别,基于稳定毯(特定不变子集)的预测总是更好或至少与基于因果父母的预测一样好。我们进一步通过允许的干预的最坏情况风险上界来上界领导者干预后的风险,并加强现有的分布泛化结果以分析此界限:我们给出了稳定毯预测在某些条件下最坏情况最优的充分条件,并通过例子表明这些条件不能一般被删除。最后,我们讨论了已知和未知图的实际情况中的实用策略,并在模拟和现实数据上测试了这些策略。

英文摘要

We consider the following two-player game: using observational data, the leader chooses a prediction function for a response variable $Y$ from given covariates. The follower then reacts with an intervention on some covariates in the underlying structural causal model to maximize their own objective. The leader knows the intervention targets, but may have limited knowledge of the follower's objective. We call this setup a prediction-intervention game, a special case of a Stackelberg game. Finding an optimal strategy for the leader is generally difficult. To avoid severe performance loss, the leader may base their prediction on the causal parents of $Y$, or more generally on an invariant subset of covariates. We prove, for two common classes of follower objectives, that predictors based on the stable blanket, a specific invariant subset, are always better or as good as those based on the causal parents. We further upper bound the leader's post-intervention risk by a worst-case risk over allowed interventions and strengthen existing distribution generalization results to analyze this bound: we give sufficient conditions under which stable-blanket predictors are worst-case optimal, and show by examples that these conditions cannot in general be dropped. Finally, we discuss practical strategies for settings with known and unknown graph, and test them on simulated and real-world data.

2605.16827 2026-05-19 cs.AI 版本更新

Voices in the Loop: Mapping Participatory AI

循环中的声音:参与式AI的映射

Rashid Mushkani

发表机构 * Right to AI(人工智能权利) Mila -- Québec AI Institute(魁北克人工智能研究所)

AI总结 本文研究了参与式AI在公共、公民和人道主义领域的应用,提出了一个开放的参与式AI倡议存储库和交互式地图集,通过整合Maga~na和Shilton的可信AI语料库以及额外的审计案例,揭示了参与式AI在地理分布、参与层级、生命周期、组织形式、验证状态和文档缺口等方面的模式,并展示了如何通过版本发布、记录链接的问题和注释通道、模式反馈流程和删除或限制披露请求来实现参与式AI基础设施的设计和治理框架。

Comments Accepted to The 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26), June 25--28, 2026, Montreal, QC, Canada

详情
AI中文摘要

参与式人工智能的方法在公共、公民和人道主义领域日益被记录,但关于参与如何组织的证据仍碎片化。本文报告了构建一个开放的参与式AI倡议存储库和交互式地图集的过程,使用了Maga~na和Shilton的可信AI语料库中的记录进行协调,以及来自研究和实践的额外审计案例。我们贡献了三个要素。首先,我们指定了一个可重复的发现、审查、协调、地理编码、来源追踪和基于发布版本的发布协议。第二,我们报告了语料库层面在地理、参与层级、生命周期位置、组织形式、验证状态和剩余文档缺口的模式。记录的倡议仍然集中在少数国家,而参与通常被编码在问题制定、评估和治理阶段,而不是模型开发或训练阶段。第三,我们展示了地图集如何通过版本发布、记录链接的问题和注释通道、模式反馈流程以及删除或限制披露请求来实现参与式AI基础设施的设计和治理框架。地图集旨在通过一个可以更新、争议和重用的活体清单,支持比较研究、政策学习和社区审查。

英文摘要

Participatory approaches to artificial intelligence are increasingly documented across public, civic, and humanitarian settings, but evidence about how participation is organized remains fragmented. This paper reports on the construction of an open repository and interactive atlas of participatory AI initiatives, using records harmonized from Maga~na and Shilton's Trustworthy AI corpus, and additional audited cases from research and practice. We contribute three elements. First, we specify a reproducible protocol for discovery, vetting, harmonization, geocoding, provenance tracking, and release-based publication of participatory AI records. Second, we report corpus-level patterns in geography, participation tiers, lifecycle loci, organizational form, verification status, and remaining documentation gaps. Documented initiatives remain concentrated in a small number of countries, while participation is most often coded at problem formulation, evaluation, and governance rather than model development or training. Third, we show how the atlas operationalizes a design and governance framework for participatory-by-default AI infrastructures through versioned releases, record-linked issue and annotation channels, schema feedback workflows, and redaction or restricted-disclosure requests. The atlas is intended to support comparative research, policy learning, and community scrutiny through a living inventory that can be updated, contested, and reused.

2605.16826 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Decoupling KL and Trajectories: A Unified Perspective for SFT, DAgger, Offline RL, and OPD in LLM Distillation

解耦KL与轨迹:为LLM蒸馏中的SFT、DAgger、离线RL和OPD提供统一视角

Anhao Zhao, Haoran Xin, Yingqi Fan, Junlong Tong, Wenjie Li, Xiaoyu Shen

发表机构 * Eastern Institute of Technology(东技术院) The Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) Thrust of Artificial Intelligence, The Hong Kong University of Science and Technology (Guangzhou)(人工智能 thrust,香港科学与技术大学(广州))

AI总结 本文探讨了知识蒸馏中KL散度与轨迹之间的耦合问题,通过解耦两个轴向,提出了四种有效的蒸馏目标,并通过实验揭示了KL方向、前缀源和训练长度之间的权衡关系,提出了KL混合和熵门长度课程等实用方法。

Comments Code available at https://github.com/EIT-NLP/Decoupled-Distill

详情
AI中文摘要

知识蒸馏是LLM后训练的核心,但其设计空间仍不明确,尤其是在与强化学习(RL)结合时。我们展示了主流范式,即离线蒸馏和在线蒸馏(OPD),隐含地耦合了两个正交选择:前缀源和token级KL方向。这源于将序列级KL分解为自回归响应分布的KL:前向KL将教师前缀与token级前向KL配对,而反向KL将学生前缀与token级反向KL配对。我们主张这种耦合并非本质:解耦这两个轴向会产生四个有效的目标。我们建立了梯度级恒等式,显示前向KL给出SFT风格的交叉熵匹配,而反向KL给出RL风格的策略梯度目标,连接到离线SFT、DAgger风格的在线SFT、离线RL风格的蒸馏和OPD。我们在数学推理上进行了广泛的受控研究,评估了四个目标作为独立方法和后续RL的初始化。结果揭示了三个权衡:KL方向引起准确度-熵权衡,前缀源引起质量-计算权衡,训练长度引起准确度-稳定性权衡。受这些发现启发,我们提出了KL混合和熵门长度课程。KL混合显示长序列蒸馏需要显著的前向KL权重以防止熵崩溃和长度膨胀而不牺牲准确性。熵门长度课程提高了Avg@k和Pass@k分别3.6和高达5.8个点,并将平均响应长度减少了约3倍。我们的结果提供了一个框架和实用方法,用于设计平衡准确度、多样性、计算和RL行为的推理蒸馏目标。

英文摘要

Knowledge distillation is central to LLM post-training, yet its design space remains poorly understood, especially alongside reinforcement learning (RL). We show that the prevailing paradigms, off-policy distillation and on-policy distillation (OPD), implicitly couple two orthogonal choices: prefix source and token-level KL direction. This follows from decomposing sequence-level KL over autoregressive response distributions: forward KL pairs teacher prefixes with token-level forward KL, and reverse KL pairs student prefixes with token-level reverse KL. We argue this coupling is not intrinsic: decoupling the two axes yields four valid objectives. We establish gradient-level identities showing forward KL gives SFT-style cross-entropy matching with teacher soft targets, whereas reverse KL gives an RL-style policy-gradient objective with a dense teacher-student log-ratio reward, connecting them to off-policy SFT, DAgger-style on-policy SFT, offline-RL-style distillation, and OPD. We conduct an extensive controlled study on math reasoning, evaluating the four objectives both as standalone methods and as initializations for subsequent RL. The results reveal three tradeoffs: KL direction induces an accuracy-entropy tradeoff, prefix source a quality-compute tradeoff, and training length an accuracy-stability tradeoff. Motivated by these findings, we propose KL mixing and an entropy-gated length curriculum. KL mixing shows long-sequence distillation requires substantial forward-KL weight to prevent entropy collapse and length inflation without sacrificing accuracy. The entropy-gated length curriculum improves Avg@k and Pass@k by 3.6 and up to 5.8 points, and cuts average response length by roughly 3x versus fixed long-horizon training. Our results provide a framework and practical methods for designing reasoning distillation objectives that balance accuracy, diversity, compute, and RL behavior.

2605.16821 2026-05-19 cs.AI 版本更新

Multi-Paradigm Agent Interaction in Practice:A Systematic Analysis of Generator-Evaluator, ReAct Loop,and Adversarial Evaluation in the buddyMe Framework

多范式代理交互实践:buddyMe框架中生成器-评估器、ReAct循环和对抗性评估的系统分析

Xiaohua Wang, Chao Han, Kai Yu, XiaoLiang Xu, Liang Wang

发表机构 * BuddyMe Research Team(buddyMe研究团队) CHARMMIRAEL Biotech Co., Ltd(CHARMMIRAEL生物技术有限公司)

AI总结 本文通过系统分析buddyMe框架中生成器-评估器、ReAct循环和对抗性评估三种代理交互范式,揭示了在实际应用中多范式代理系统的设计挑战与优化策略。

Comments 11 pages, 7 tables

详情
AI中文摘要

大型语言模型(LLM)代理的快速演进产生了多样的交互范式,但很少有生产系统在一个统一的架构中整合多种范式。本文系统分析了三种主要的代理交互范式,包括多代理协调(生成器-评估器)、ReAct工具使用循环和记忆增强交互,这些范式在buddyMe开源的多模型代理编程框架中得到实现。我们正式化了一个五阶段的处理流程:需求预审查->任务分解->ReAct执行->实际执行验证->对抗性评估讨论,并建立了一个六维评估方案,采用加权评分。通过四个实证案例研究,基于真实世界部署日志中的博物馆导游生成、定时天气任务和综合旅游规划,我们得出三个关键结论。第一,生成器-评估器预审查在20%的复杂任务中检测到需求遗漏,其中80%的任务通过初步审查。第二,ReAct循环确保了子任务执行的稳定性,但导致大约30%的冗余工具调用。第三,对抗性评估者-防御者讨论在2-3轮内达成共识,适用于近70%的场景,主要用于内容细化而非逻辑反转。我们还提供了三种基于Mermaid的架构图,并在六个系统维度上与CrewAI、AutoGen、LangGraph、MemGPT和A-Mem进行了跨范式比较。研究结果为构建稳定可靠的多范式代理系统提供了实用的设计指南。

英文摘要

The rapid evolution of Large Language Model (LLM) agents has produced diverse interaction paradigms, yet few production systems integrate multiple paradigms within a unified architecture. This paper presents a systematic analysis of three principal agent interaction paradigms, including Multi-Agent Orchestration (Generator-Evaluator), ReAct Tool-Use Loops, and Memory-Augmented Interaction, as implemented in buddyMe, an open-source multi-model agent programming framework. We formalize a five-stage processing pipeline: Requirement Pre-Review -> Task Decomposition -> ReAct Execution -> Real-Execution Verification -> Adversarial Evaluation Discussion, and establish a six-dimensional evaluation schema with weighted scoring. Through four empirical case studies drawn from real-world deployment logs covering museum guide generation, scheduled weather tasks, and comprehensive tour planning, we draw three key conclusions. First, Generator-Evaluator pre-review detects requirement omissions in 20 percent of complex tasks, with 80 percent tasks passing initial inspection. Second, the ReAct loop ensures stable subtask execution but leads to around 30 percent redundant tool invocations. Third, adversarial Evaluator-Defender discussions reach consensus within 2-3 rounds for nearly 70 percent of scenarios, functioning mainly for content refinement rather than logical reversal. We additionally provide three Mermaid-based architectural diagrams and conduct cross-paradigm comparisons with CrewAI, AutoGen, LangGraph, MemGPT and A-Mem across six system dimensions. The research outcomes offer practical design guidelines for constructing stable and reliable multi-paradigm agent systems.

2605.16819 2026-05-19 cs.CL cs.AI cs.LG 版本更新

AgentKernelArena: Generalization-Aware Benchmarking of GPU Kernel Optimization Agents

AgentKernelArena: GPU核优化代理的通用化意识基准测试

Sharareh Younesian, Wenwen Ouyang, Sina Rafati, Mehdi Rezagholizadeh, Sharon Zhou, Ji Liu, Yue Liu, Yuchen Yang, Hao Li, Ziqiong Liu, Dong Li, Vikram Appia, Zhenyu Gu, Emad Barsoum

发表机构 * AMD

AI总结 本文提出AgentKernelArena,一个用于评估GPU核优化代理的开源基准,通过隔离工作区和统一评分机制,测试代理在不同任务和硬件目标上的性能和通用化能力,发现大多数任务在正确性和编译效率上表现优异,但在PyTorch到HIP的转换任务中存在显著的正确性下降。

详情
AI中文摘要

GPU核优化对于高效深度学习系统日益关键,但编写高性能核仍然需要大量的低级专业知识。最近的AI编码代理可以迭代阅读代码、调用编译器和性能分析器,并优化实现,但现有的核基准测试仅评估单个LLM调用而非完整的代理工作流程,且未包含核到核的优化和未见过的配置泛化测试。我们提出了AgentKernelArena,一个开源的基准测试,用于衡量AI编码代理在GPU核优化上的能力。该基准测试包含196个任务,涵盖HIP到HIP的优化、Triton到Triton的优化以及PyTorch到HIP的转换,并在隔离的工作区中使用门控编译、正确性和性能检查,集中评分和一个未见过的配置泛化协议,测试优化是否转移到代理从未见过的输入配置。在包括Cursor Agent、Claude Code和Codex Agent在内的生产代理中,我们发现大多数任务在正确性和编译效率上表现优异,最强配置在PyTorch到HIP任务中平均加速达6.89倍,在HIP到HIP任务中达6.69倍,在Triton到Triton任务中达2.13倍。我们的未见过的配置评估显示,HIP到HIP和Triton到Triton的优化大多能转移到未见过的输入形状,而PyTorch到HIP的转换则表现出显著的正确性下降,表明生成核的代理经常硬编码形状特定的假设。AgentKernelArena被设计为一个模块化、可扩展的框架,用于严格评估跨代理、任务和硬件目标的代理GPU核优化。

英文摘要

GPU kernel optimization is increasingly critical for efficient deep learning systems, but writing high-performance kernels still requires substantial low-level expertise. Recent AI coding agents can iteratively read code, invoke compilers and profilers, and refine implementations, yet existing kernel benchmarks evaluate single LLM calls rather than full agent workflows, and none include both kernel-to-kernel optimization and unseen-configuration generalization testing. We present AgentKernelArena, an open-source benchmark for measuring AI coding agents on GPU kernel optimization. The benchmark contains 196 tasks spanning HIP-to-HIP optimization, Triton-to-Triton optimization, and PyTorch-to-HIP translation, and evaluates complete agent workflows in isolated workspaces using gated compilation, correctness, and performance checks, centralized scoring and an unseen-configuration generalization protocol that tests whether optimizations transfer to input configurations the agent never observed. Across production agents including Cursor Agent, Claude Code, and Codex Agent, we find near-perfect compilation and high correctness rates on most task categories, with the strongest configurations achieving mean speedups of up to 6.89x on PyTorch-to-HIP, 6.69x on HIP-to-HIP, and 2.13x on Triton-to-Triton tasks. Our unseen-configuration evaluation shows that HIP-to-HIP and Triton-to-Triton optimizations largely transfer to unseen input shapes, while PyTorch-to-HIP exhibits substantial correctness drops, indicating that agents generating kernels from scratch frequently hardcode shape-specific assumptions. AgentKernelArena is designed as a modular, extensible framework for rigorous evaluation of agentic GPU kernel optimization across agents, tasks, and hardware targets.

2605.16818 2026-05-19 cs.CV cs.AI 版本更新

Observation-Aligned Mask Priors for Learning Physical Dynamics from Authentic Occlusions

基于观测对齐的遮罩先验学习物理动态的遮罩方法

Chiyuan Ma, Zihan Zhou, Tianshu Yu

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen(数据科学学院,香港大学(深圳))

AI总结 本文提出了一种基于观测对齐的遮罩先验方法,通过学习真实的遮罩分布来构建上下文-查询分区,从而在不完整数据上训练物理动态学习。该方法利用贝叶斯流网络预训练二进制遮罩,结合全局归一化交叉熵目标生成与稀疏观测对齐的样本特定遮罩,从而避免零查询死区和局部生成崩溃。

详情
AI中文摘要

直接从不完整观测中学习物理动态具有挑战性,因为真实的遮罩是结构化的、样本依赖的,并且常常不是随机缺失的,而现有方法通常依赖启发式遮罩规则或预定义的遮罩分布。我们提出Observation-Aligned Mask Priors框架,该框架学习真实的观测遮罩分布,并利用其构建上下文-查询分区以从不完整数据中训练。具体来说,我们先在二进制观测遮罩上预训练一个贝叶斯流网络(BFN)以捕捉真实的遮罩拓扑结构,然后通过全局归一化交叉熵目标引导BFN采样,生成与每个稀疏观测对齐的样本特定遮罩。遮罩与观测遮罩的交集定义为上下文,剩余的观测条目成为扩散模型的查询目标。我们证明,这种基于交集的分区使每个有效的观测维度都有严格正的概率被查询,防止零查询死区和局部生成崩溃。在三个具有真实卫星遮罩的现实世界海洋学数据集上,跨分辨率至256×256的实验显示,在MSE和PSNR上优于强扩散基线的一致改进。这些结果表明,从真实遮罩中学习遮罩先验是学习不完整物理观测的有效替代方法,无需访问完全观测的场数据。

英文摘要

Learning physical dynamics directly from incomplete observations is challenging because authentic occlusions are structured, sample-dependent, and often missing not at random, whereas existing methods typically rely on heuristic masking rules or predefined mask distributions. We propose Observation-Aligned Mask Priors, a framework that learns the distribution of authentic observation masks and uses it to construct context-query partitions for training from incomplete data. Specifically, we pretrain a Bayesian Flow Network (BFN) on binary observation masks to capture real occlusion topologies, then guide BFN sampling with a globally normalized cross-entropy objective to generate sample-specific masks aligned with each sparse observation. The intersection between the guided mask and the observed mask defines the context, and the remaining observed entries become query targets for a diffusion-based reconstruction model. We show that this intersection-based partitioning gives every valid observed dimension a strictly positive probability of being queried, preventing zero-query dead zones and local generative collapse. Experiments on three real-world oceanographic datasets with authentic satellite occlusions, across resolutions up to 256$\times$256, show consistent improvements over strong diffusion baselines in MSE and PSNR. These results demonstrate that learning mask priors from authentic occlusions is an effective alternative to heuristic masking for learning from incomplete physical observations without access to fully observed fields.

2605.16806 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Cross-modal Affinity-aligned Multimodal Learning Analytics for Predicting Student Collaboration Satisfaction in Game-Based Learning

跨模态亲和对齐的多模态学习分析用于预测基于游戏的学习中学生协作满意度

Wen-Hsin Tsai, Chia-Ming Lee, Yuk-Ying Tung

发表机构 * Institute of Education, National Cheng Kung University(国立成功大学教育研究所) Institute of Intelligent System, National Yang Ming Chiao Tung University(阳明交通大学智能系统研究所) Department of Computer Science, University at Albany, State University of New York(纽约州立大学水牛城分校计算机科学系)

AI总结 本文提出了一种跨模态亲和对齐的多模态学习分析框架,通过建模模态间关系和对比学习来增强学生协作满意度预测的鲁棒性和可解释性。

Comments Accetped by CVPR 2026 CVxEdu Workshop

详情
AI中文摘要

协作式基于游戏的学习环境为小组知识构建提供了丰富的机遇,但自动预测学生协作满意度仍具挑战性。关键障碍是模态退化:在教育部署中,个体模态如眼动在学生群体间表现出不一致的信息量,导致基于隐式注意力的融合产生脆弱的多模态表示。我们提出了亲和对齐多模态学习分析(AAMLA)框架,其核心贡献是跨模态亲和引导的模态对齐(CAMA)模块,该模块通过亲和矩阵显式建模模态间关系,并通过对比学习强制跨模态一致性,从而实现对无信息模态的自适应抑制而不丢弃它们。AAMLA进一步应用模态特定的投影层,将异构特征,包括面部动作单元、头部姿态、眼动和交互痕迹日志,映射到统一的语义空间,然后再进行对齐。在EcoJourneys协作学习环境中的50名中学生实验表明,在标准和模态退化条件下,AAMLA在单模态基线和先前跨注意力方法上均表现出一致的改进,SHAP和t-SNE分析证实CAMA能够产生稳健且可解释的跨模态表示,用于学生协作建模。

英文摘要

Collaborative game-based learning environments offer rich opportunities for small-group knowledge construction, yet automatically predicting student collaboration satisfaction remains challenging. A critical barrier is modality degradation: in educational deployments, individual modalities such as eye gaze exhibit inconsistent informativeness across student cohorts, causing implicit attention-based fusion to produce brittle multimodal representations. We propose the Affinity-Aligned Multimodal Learning Analytics (AAMLA) framework, whose core contribution is the Cross-modal Affinity-guided Modality Alignment (CAMA) module, which explicitly models inter-modal relationships via affinity matrices and enforces cross-modal consistency through contrastive learning, enabling adaptive suppression of uninformative modalities without discarding them. AAMLA further applies modality-specific projection layers to map heterogeneous features, including facial action units, head pose, eye gaze, and interaction trace logs, into a unified semantic space prior to alignment. Experiments on 50 middle school students in the EcoJourneys collaborative learning environment demonstrate consistent improvements over unimodal baselines and prior cross-attention approaches under standard and modality degradation conditions, with SHAP and t-SNE analyses confirming that CAMA produces robust, interpretable cross-modal representations for student collaboration modeling.

2605.16795 2026-05-19 cs.CV cs.AI cs.GR 版本更新

3DPhysVideo: Consistency-Guided Flow SDE for Video Generation via 3D Scene Reconstruction and Physical Simulation

3DPhysVideo: 通过3D场景重建和物理模拟的一致性引导流SDE用于视频生成

Hwidong Kim, Yunho Kim, Tae-Kyun Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 本文提出了一种无需训练的管道,通过3D场景重建和物理模拟生成逼真视频,利用一致性引导流SDE分解预测速度以确保条件输入的一致性,从而在多物体和流体交互场景中实现从单张图像到物理合理视频的过渡。

Comments Project page: https://hwidong-kim.github.io/projects/3DPhysVideo

详情
AI中文摘要

视频生成模型取得了显著进展,但它们常常产生违反物理动态基础的视觉伪影。最近的工作如PhysGen3D通过网格重建和基于物理的渲染处理单张图像到3D物理,但在建模流体动力学、多物体交互和照片级真实感方面仍存在挑战。本文介绍了3DPhysVideo,一种新颖的无训练管道,能够从单张图像生成物理真实的视频。我们重新利用现成的视频模型进行两个阶段。首先,我们将其用作新的视图合成器,通过引导图像到视频(I2V)流模型使用渲染点云来重建完整的360度3D场景几何。其次,在应用物理求解器到此几何后,物理模拟的点云用于引导相同的I2V流模型以合成最终的高质量视频。一致性引导流SDE将I2V流模型预测的速度分解为去噪和一致性偏差,强制条件输入的一致性,使我们能够有效地重新利用模型进行3D重建和模拟引导的视频生成。在包括多物体和流体交互场景在内的多样化实验中,我们的方法成功地从单张图像过渡到物理合理的视频,同时在单个消费级GPU上运行高效。它在GPT基线得分、VideoPhy基准和人类评估中优于最先进的基线。

英文摘要

Video generative models have made remarkable progress, yet they often yield visual artifacts that violate grounding in physical dynamics. Recent works such as PhysGen3D tackle single image-to-3D physics through mesh reconstruction and Physically-Based Rendering, but challenges remain in modeling fluid dynamics, multi-object interactions and photorealism. This work introduces 3DPhysVideo, a novel training-free pipeline that generates physically realistic videos from a single image. We repurpose an off-the-shelf video model for two stages. First, we use it as a novel view synthesizer to reconstruct complete 360-degree 3D scene geometry by guiding the image-to-video (I2V) flow model with rendered point clouds. Second, after applying physics solvers to this geometry, the physically simulated point cloud is used to guide the same I2V flow model to synthesize final, high-quality videos. Consistency-Guided Flow SDE, which decomposes the predicted velocity of the I2V flow model into denoising and consistency bias, enforces consistency to the conditional inputs, allowing us to effectively repurpose the model for both 3D reconstruction and simulation-guided video generation. In the diverse experiments including multi-objects, and fluid interaction scenes, our method successfully bridges the gap from single-images to physically plausible videos, while remaining efficient to run on a single consumer GPU. It outperforms state-of-the-art baselines on GPT-based scores, VideoPhy benchmark and human evaluation.

2605.16790 2026-05-19 cs.LG cs.AI cs.CL 版本更新

TIER: Trajectory-Invariant Execution Rewards for Multi-Step Tool Composition

TIER: 用于多步工具组合的轨迹不变执行奖励

Anay Kulkarni, ChiaEn Lu, Dheeraj Mekala, Jayanth Srinivasa, Gaowen Liu, Jingbo Shang

发表机构 * UC San Diego(加州大学圣迭戈分校) Cisco Research(思科研究)

AI总结 本文提出TIER,一种基于函数模式和运行时执行的奖励框架,能够提供密集且可解释的序列级反馈,支持多种解决方案策略并适应变化的工具接口,在DepthBench等基准上实现了高准确率。

Comments Preprint. Submitted to NeurIPS 2026. 28 pages, 7 figures, 8 tables. Code and datasets available at https://github.com/anaykulkarni/TIER

详情
AI中文摘要

工具使用使大语言模型能够通过一系列API调用解决复杂任务,但现有的强化学习方法无法扩展到多步骤组合设置。基于结果的奖励只能提供稀疏反馈,而轨迹监督的奖励依赖于注释的参考解决方案,惩罚有效的替代方案并限制可扩展性。我们提出TIER:轨迹不变执行奖励,一种奖励框架,其监督直接来自函数模式和运行时执行,而非参考轨迹。该奖励分解为格式有效性、模式遵守、执行成功和答案正确性,提供来自细粒度验证的单个步骤工具使用反馈。这种设计允许任何有效的执行路径获得信用,自然支持多种解决方案策略并适应变化的工具接口。在DepthBench,一个按深度(1到6步)分层的组合基准上,TIER在所有步骤中实现了>90%的准确率,其中轨迹监督的奖励在第4步之后崩溃。我们进一步在BFCL v3和NestFUL等基准上展示了持续的提升。消融研究确认所有奖励组件都是必要的,突显了多级监督对于组合推理的重要性。

英文摘要

Tool use enables large language models to solve complex tasks through sequences of API calls, yet existing reinforcement learning approaches fail to scale to multi-step composition settings. Outcome-based rewards provide only sparse feedback, while trajectory-supervised rewards depend on annotated reference solutions, penalizing valid alternatives and limiting scalability. We propose TIER: Trajectory-Invariant Execution Rewards, a reward framework that derives supervision directly from function schemas and runtime execution, rather than from reference trajectories. The reward decomposes into format validity, schema adherence, execution success, and answer correctness, providing dense, interpretable sequence-level feedback derived from fine-grained verification of individual steps of tool use. This design allows any valid execution path to receive credit, naturally supporting multiple solution strategies and adapting to evolving tool interfaces. On DepthBench, a compositional benchmark stratified by depth (1 to 6 steps), TIER achieves >90% accuracy across steps, where trajectory-supervised rewards collapse beyond step-4. We further demonstrate consistent gains on benchmarks like BFCL v3 and NestFUL. Ablation studies confirm that all reward components are necessary, highlighting the importance of multi-level supervision for compositional reasoning.

2605.16785 2026-05-19 cs.CV cs.AI 版本更新

Encoding Robust Topological Signatures for Hyperdimensional Computing

为超维计算编码鲁棒的拓扑特征

Arpan Kusari

发表机构 * University of Michigan Transportation Research Institute(密歇根大学交通研究院) University of Michigan(密歇根大学)

AI总结 本文提出了一种基于拓扑特征的超维计算方法,通过提取离散拓扑原始特征并结合RTS不变的形状签名,提高了超维计算在旋转、噪声和遮挡等扰动下的鲁棒性,实验表明其在多个数据集上优于传统方法。

详情
AI中文摘要

超维(HD)计算由于其简单性、快速的原型基推断和与在线更新的兼容性,为边缘学习提供了一个有吸引力的替代方案。然而,标准的基于像素的HD编码器容易受到分布偏移的影响,如旋转、噪声或遮挡,会显著降低准确性。我们从二值化形状中提取离散拓扑原始特征——尤其是孔洞,并将它们与旋转/平移/缩放(RTS)不变的形状签名配对。我们的方法为(i)外轮廓使用空间金字塔变体的Zernike矩构建RTS稳定的描述符,(ii)每个孔洞使用其径向签名的内在傅里叶描述符以及RTS-标准相对几何。每个原始特征通过随机投影和角色绑定映射到双极超向量,并通过排列不变的捆绑聚合变量卡数的孔洞集以形成单个图像超向量。为了避免过度加权任何线索,我们通过在验证集上融合余弦相似度学习Zernike和孔洞通道的非负可靠性权重。在MNIST和EMNIST数据集上进行的实验表明,拓扑引导的HD计算相比传统HD基线显著提高了鲁棒性,保持了多个扰动家族的高精度,并受益于轻量级在线训练。与在干净数据上训练的紧凑CNN相比,我们的方法在清洁精度上具有竞争力,同时对几种像素级扰动具有明显更强的鲁棒性,证明了显式拓扑结构是实现鲁棒HD表示的可行途径。代码在https://github.com/arpan-kusari/Topological-HDC提供。

英文摘要

Hyperdimensional (HD) computing offers an attractive alternative to deep networks for edge learning due to its simplicity, fast prototype-based inference, and compatibility with online updates. However, standard pixel-based HD encoders are brittle: small distribution shifts such as rotation, noise, or occlusion can drastically reduce accuracy. We extract discrete topological primitives-most notably holes-from binarized shapes and pair them with rotation/translation/scale (RTS)-invariant shape signatures. Our method constructs RTS-stable descriptors for (i) the outer shape using a spatial-pyramid variant of Zernike moments and (ii) each hole using an intrinsic Fourier descriptor of its radial signature together with RTS-canonical relative geometry. Each primitive is mapped to a bipolar hypervector via randomized projection and role binding, and variable-cardinality hole sets are aggregated by permutation-invariant bundling to form a single image hypervector. To avoid over-weighting any cue, we learn nonnegative reliability weights for the Zernike and hole channels on a validation set via late fusion of cosine similarities. Experiments on MNIST and EMNIST under controlled corruptions (rotation, Gaussian noise, salt-and-pepper, cutout, zoom) show that Topology-guided HD computing substantially improves robustness compared with a naive HD baseline, maintaining high accuracy across multiple corruption families and benefiting from lightweight online training. Compared with a compact CNN trained on clean data, our method achieves competitive clean accuracy while offering markedly stronger robustness to several pixel-level corruptions, demonstrating that explicit topological structure is a practical route to robust HD representations. The code is provided at https://github.com/arpan-kusari/Topological-HDC.

2605.16779 2026-05-19 cs.CV cs.AI 版本更新

A Holistic Method for Superquadric Fitting Using Unsupervised Clustering Analysis

一种基于无监督聚类分析的超二次曲面拟合整体方法

Mingyang Zhao, Sipu Ruan, Xiaohong Jia

发表机构 * State Key Laboratory of Mathematical Sciences, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(数学科学国家重点实验室,数学与系统科学学院,中国科学院) University of Chinese Academy of Sciences(中国科学院大学) Robotics Institute, School of Mechanical Engineering and Automation, Beihang University(北京航空航天大学机械工程与自动化学院机器人研究所)

AI总结 本文提出了一种新的方法,用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合,通过无监督聚类分析重新定义问题,实现了刚性和变形超二次曲面的一体化拟合,同时提供了闭式解析解和收敛性证明。

Comments 20 pages, Code: https://github.com/zikai1/SuperquadricFitting

Journal ref IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2026

详情
AI中文摘要

本文提出了一种新的方法,用于在存在噪声和异常值的情况下对点云进行超二次曲面拟合,该方法在多个领域具有广泛的应用。与以往仅专注于拟合刚性或变形超二次曲面或存在鲁棒性和数值稳定性问题的方法不同,我们的方法从无监督聚类的新视角重新定义问题,使刚性和变形超二次曲面的拟合能够在统一的框架中完成。我们的方法核心是一种受无监督聚类分析启发的稳定优化函数,其中我们将点云数据和潜在参数曲面的样本分别作为聚类成员和质心。然后,具有动态更新质心位置的聚类过程成为优化超二次曲面参数的直接代理,建立了几何拟合与聚类动态之间的原则性联系。我们进一步推导了聚类质心与聚类成员之间的成对计算与正交距离之间的关系,从而有效消除了耗时的曲面采样过程。此外,我们的公式为模糊成员度向量和协方差矩阵提供了闭式解析解,确保了高效迭代优化,并能够更有效地处理几何变形。此外,我们还提供了收敛性分析的理论证明,并证明了聚类启发的拟合方法通过内在增加目标函数的凸性来逃避局部极小值。实现已公开在https://github.com/zikai1/SuperquadricFitting。

英文摘要

This work presents a novel method for fitting superquadrics to point clouds under the contamination of noise and outliers, which has many applications for shape modeling across diverse fields. Unlike prior approaches that either exclusively focus on fitting rigid or deformable superquadrics, or suffer from robustness and numerical instability issues, our method redefines the problem from a new unsupervised clustering perspective, enabling the holistic fitting of both rigid and deformable superquadrics within a unified framework. Central to our approach is a stable optimization function inspired by unsupervised clustering analysis, where we formulate the point cloud data and samples from the potential parametric surface as clustering members and centroids, respectively. Then, the clustering process with dynamic updates to centroid locations serves as a direct proxy for optimizing superquadric parameters, establishing a principled link between geometric fitting and clustering dynamics. We further derive the relationship between pairwise computations of clustering centroids and clustering members to orthogonal distances, effectively eliminating the need for the time-consuming surface sampling process. Moreover, our formulation provides closed-form analytical solutions for both the fuzzy membership degree vector and the covariance matrix, ensuring efficient iteration optimization and enabling more effective handling of geometric deformations. In addition, we provide a theoretical certificate of convergence analysis and demonstrate that the clustering-inspired fitting method can escape local minima by inherently increasing the convexity of the objective function. The implementation is publicly available at https://github.com/zikai1/SuperquadricFitting.

2605.16776 2026-05-19 cs.LG cs.AI 版本更新

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

可区分删除:统一知识擦除与拒绝用于大语言模型去学习

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen

发表机构 * Department of Natural Language Processing, MBZUAI. University of Oxford. RIKEN Center for Advanced Intelligence Project. TMLR Group, Department of Computer Science, Hong Kong Baptist University

AI总结 本文提出D^2方法,通过限制潜在表示中的响应分布来擦除不受欢迎的知识,同时区分保留知识,从而实现安全且一致的拒绝机制,以提高大语言模型去学习的效果。

Comments ICML2026 Accepted

详情
AI中文摘要

减轻敏感和有害输出对于确保大型语言模型(LLM)的安全部署至关重要。现有方法通常遵循两种范式:知识删除(KD),在训练期间擦除不受欢迎的信息,以及可区分拒绝(DR),在推理期间引导模型远离使用敏感知识。尽管进展迅速,基于KD的去学习在抑制特定令牌序列作为完整知识移除替代物时面临偏见删除的问题,而基于DR的去学习则因底层知识仍然完整而有重新出现有害知识的风险。为了解决这些问题,我们提出了可区分删除(D^2),一种通过限制潜在表示中的响应分布来擦除不受欢迎知识,同时区分保留知识的范式,从而能够安全且一致地处理去学习的输入。为了实现D^2,我们引入了一个能量指数,该指数量化了知识的存在以及去学习内容与保留内容之间的分离。数学和实证分析表明,能量既准确又高效,使能量基于去学习对齐(EUA)能够在训练期间强制执行能量边界去学习,并在推理时应用基于能量的拒绝机制。广泛的实验表明,EUA显著优于先前方法,表明D^2的优越性。我们的代码可在https://github.com/Puning97/EUA-for-LLM-Unlearning获取。

英文摘要

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

2605.16775 2026-05-19 cs.CV cs.AI cs.LG 版本更新

VolTA-3D: Self-Supervised Learning for Brain MRI using 3D Volumetric Token Alignment

VolTA-3D: 基于3D体积分块对齐的脑MRI自监督学习

Amy Makawana, Abhijeet Parida, Marius George Linguraru, Julia Ive, Syed Muhammad Anwar

发表机构 * Institute of Health Informatics(健康信息学研究所) Sheikh Zayed Institute for Pediatric Surgical Innovation(谢赫扎耶德儿童外科创新研究所) School of Medicine and Health Sciences(医学与健康科学学院)

AI总结 本文提出VolTA-3D,一种用于脑MRI自监督学习的3D视觉Transformer框架,通过联合对齐全局类风格标记和局部块标记,增强体积分块表示的可迁移性,从而在多个下游任务中表现出更好的泛化能力和鲁棒性。

Comments Accepted at EMBC 2026

详情
AI中文摘要

自监督学习(SSL)通过利用大规模未标记数据推动了医学图像分析的发展。然而,在脑磁共振成像(MRI)中,大多数3D模型仍局限于分割或分类任务,限制了其在不同数据集、成像协议和下游任务中的泛化能力。这种缺乏可迁移性限制了3D MRI模型的临床应用,尽管存在大量未标记的体数据。我们提出了Volta-3D,一种自监督的3D视觉Transformer框架,旨在学习可迁移的体表示。Volta-3D在学生-教师范式中联合对齐全局类风格标记和局部块标记,并强制细粒度结构重建。这种联合全局-局部对齐解决了脑MRI中有限的语义多样性和细微解剖特征,这对现有SSL方法构成了挑战。我们在多个分布外下游任务上评估了Volta-3D,包括海马体分割和性别及阿尔茨海默病与健康对照的分类。在所有任务中,Volta-3D学习的表示均优于随机初始化的基线,证明了其在域偏移下的改进可迁移性和鲁棒性。因此,在预训练过程中联合强制全局语义一致性和局部结构学习,使模型能够从未标记的脑MRI数据中学习更广泛的概念。总体而言,VolTA-3D支持有效的多任务下游性能,具有任务特定的适应性,是迈向通用化和临床可行的3D模型的一步。

英文摘要

Self-supervised learning (SSL) has advanced medical image analysis be enabling learning form large unlabelled data. However, in brain magnetic resonance imaging (MRI), most 3D models remain specialized for either segmentation of classification, limiting their ability to generalize across datasets, imaging protocols,, and downstream tasks. This lack of transferability constrains the clinical utility of 3D MRI models, despite the availability of unlabeled volumetric data. We present Volta-3D, a self-supervised 3D Vision Transformer framework designed to learn transferable volumetric representations. Volta-3D jointly aligns global class-style tokens and local patch tokens within a student-teacher paradigm and enforces fine-grained structural reconstruction. This combined global-local alignment addresses the limited semantic diversity and subtle anatomical characteristics of brain MRI, which challenges existing SSL approaches. We evaluate Volta-3D on multiple out-of-distribution downstream tasks, including hippocampal segmentation and classification of sex and Alzheimer's disease versus healthy controls. Across all tasks, representations learned by Volta-3D outperform randomly initialized baselines, demonstrating improved transferability and robustness under domain shift. Hence jointly enforcing global semantic consistency and local structural learning during pretraining enables broader concept learning from unlabeled brain MRI data. Overall VolTA-3D supports effective multi-task downstream performance with task-specific pertaining, a step towards generalizable and clinically viable 3D models.

2605.16774 2026-05-19 cs.CV cs.AI 版本更新

CANSURF: An ASV-View Can Dataset and Benchmark for Detection and Tracking of Surface-Level Debris

CANSURF:一种ASV视角的可回收物数据集和基准,用于表面级垃圾的检测与跟踪

Zaid Aljundi, Zahra F. Rahmatullah, Mostafa Elemam, Abdullah Moosa

发表机构 * School of Mathematical and Computer Sciences(数学与计算机科学学院) Heriot-Watt University Dubai(惠顿大学迪拜分校) School of Engineering and Physical Sciences(工程与物理科学学院)

AI总结 本文提出了一种新的ASV视觉系统和表面可回收物数据集,用于在水面条件下检测和跟踪小型反射性垃圾,如铝罐。数据集包含约7.3k张原始图像,经过十种增强方法扩展至约57k张训练/验证图像,涵盖了多样的光照和水状态。通过基准测试,训练YOLOv11在CANSURF数据集上提升了12倍的性能,展示了数据集的价值。实验表明,YOLOv11+ByteTrack在稳定跟踪和多目标准确性方面表现最佳,而YOLOv11+SAHI在远距离罐子的召回率上有所提升,但精度有所下降。考虑到任务需求,YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。

Comments Published in the 2025 8th International Conference on Signal Processing and Information Security (ICSPIS). Published and available to view on IEEE Xplore

Journal ref Proc. 2025 8th Int. Conf. Signal Processing and Information Security (ICSPIS), 2025, pp. 1-6

详情
AI中文摘要

表面级海洋垃圾仍然是自主清洁任务中的实际瓶颈,其中小型、反射性的目标(如铝罐)必须在强光、波浪和部分淹没条件下从远处检测。本文提出了一种ASV视觉系统和一个新的表面可回收物数据集。该数据集包含约7.3k张从视频中提取的原始图像,并通过十种增强类型扩展至约57k张训练/验证图像,涵盖了多样化的光照和水状态。一组针对表面操作定制的检测器和检测-跟踪管道进行了基准测试。在CANSURF上训练YOLOv11的性能比通用数据集提高了12倍,突显了数据集的价值。实验表明,YOLOv11+ByteTrack在稳定跟踪(较少的身份切换)和多目标准确性方面表现最佳,而YOLOv11+SAHI在远距离罐子的召回率上有所提升,但精度在全上下文输入中有所下降。鉴于任务配置,单罐拾取与接近和抓取,YOLOv11 + SAHI在检测最大数量的罐子方面表现更好。没有先前的公开数据集针对从水面视角在水面上检测铝罐;此数据集填补了这一空白,并支持可重复的评估。

英文摘要

Surface-level marine debris remains a practical bottleneck for autonomous clean-up, where small, reflective targets (e.g., aluminum cans) must be detected at distance under glare, ripples, and partial submersion. This paper presents, an ASV vision system and a new surface-can dataset. The dataset comprises ~7.3k raw images extracted from videos and annotated with bounding boxes, expanded via ten augmentation types to ~57k training/validation images spanning diverse lighting and water states. A family of detector and detector-tracker pipelines tailored to surface operations were benchmarked. Training YOLOv11 on CANSURF boosts performance 12x over generic datasets, highlighting the dataset's value. Experiments show that YOLOv11+ByteTrack yields the most stable tracks (fewer identity switches) and stronger multi-object accuracy under, while YOLOv11+SAHI increases recall on far-field cans at the cost of lower precision in full-context inputs. Given the mission profile, single-can pickup with approach and grab, YOLOv11 + SAHI proves better for detecting the maximum number of cans. No prior open dataset targets aluminum cans on water from a surface-level viewpoint; this dataset fills this gap and supports reproducible evaluation.

2605.16770 2026-05-19 cs.CL cs.AI 版本更新

Exploring Lightweight Large Language Models for Court View Generation

探索用于法院视图生成的轻量级大语言模型

Zhitian Hou, Tianyong Hao, Nanli Zeng, Zhixiong Chao, Kun Zeng

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University(中山大学计算机科学与工程学院) School of Computer Science, South China Normal University(华南师范大学计算机学院) China Mobile Internet Co., Ltd.(中国移动互联网有限公司)

AI总结 本文研究了轻量级大语言模型在法院视图生成中的能力及其对指控预测的影响,探讨了模型架构、大小对性能的影响,以及轻量级LLM与深度神经网络在任务中的比较,同时开发了CVGEvalKit评估框架。

详情
AI中文摘要

刑事法院视图生成(CVG)是法律人工智能(Legal AI)中的关键任务,涉及根据案件事实生成法院视图。在本工作中,我们系统地探索了轻量级(小于2B参数)大语言模型(LLMs)在CVG中的能力及其对指控预测的影响。我们的研究解决了四个关键问题:(1)不同架构的LLMs如何影响CVG质量和指控预测;(2)LLMs的大小如何影响性能;(3)轻量级LLMs在这些任务中与深度神经网络(DNNs)的比较;(4)通过先生成法院视图再预测指控与直接预测指控的比较。此外,我们还开发了CVGEvalKit评估框架,包括三个公开可用的数据集用于CVG任务以及预测其指控。在该框架上进行了全面实验,模型在混合训练集上训练,并在每个数据集的测试集上评估。实验结果提供了关于模型架构、模型大小和不同任务之间影响的权衡的新见解,突显了轻量级LLMs在司法AI应用中的潜力。源代码匿名地可在\url{https://github.com/ZhitianHou/CVGEvalKit}获取。

英文摘要

Criminal Court View Generation (CVG) is a critical task in Legal Artificial Intelligence (Legal AI), involving the generation of court view based on case facts. In this work, we systematically explore the capabilities of lightweight (smaller than 2B) large language models (LLMs) in CVG and their impact on charge prediction. Our study addresses four key questions: (1) how does different architecture of LLMs affect the CVG quality and charge prediction. (2) how does LLMs size contribute to the performance, (3) how do lightweight LLMs compare with Deep Neural Networks (DNNs) in these tasks, and (4) how does predicting charge by court view generation first compare with predicting it directly. Additionally, we also develop CVGEvalKit, an evaluation framework including three public available datasets for CVG tasks, as well as predicting their charges. Comprehensive experiments are conducted on this framework, where models are trained on a mixed training set and evaluated on each dataset's test set. Experimental results provide new insights into the trade-offs between model architecture, model size, and the influence between different tasks, highlighting the potential of lightweight LLMs in judicial AI applications. The source code is anonymously available at \url{https://github.com/ZhitianHou/CVGEvalKit}

2605.16757 2026-05-19 cs.AI cs.MA stat.ME stat.ML 版本更新

NeuroMAS: Multi-Agent Systems as Neural Networks with Joint Reinforcement Learning

NeuroMAS: 多智能体系统作为神经网络的多智能体系统

Haoran Lu, Luyang Fang, Wenxuan Zhong, Ping Ma

发表机构 * Department of Statistics(统计系) University of Georgia(佐治亚大学)

AI总结 本文提出NeuroMAS,一种将多智能体系统视为可训练和可扩展的神经网络架构的方法,通过联合强化学习提升多智能体系统的性能和可扩展性。

详情
AI中文摘要

多智能体语言系统通常被构建为人工设计的工作流,其中智能体被分配语义角色,通信协议在提前指定。我们提出NeuroMAS,一种方法,首先将多智能体语言系统视为可训练和可扩展的神经网络-like架构,其中LLM智能体作为节点,中间文本信号作为边。在NeuroMAS中,智能体节点是无角色但结构感知的:拓扑结构只决定信息如何一般流动,而强化学习训练决定如何通信、专业化和协调。这种表法将多智能体设计从工作流工程转向架构设计,其中深度、宽度、连接性和增长协议成为可扩展的能力来源。进一步,我们提供了一个理论视角,说明为何这种模块化文本计算在任务允许层次分解时更具参数效率。实验表明,NeuroMAS在推理时间和训练多智能体基线方面均有显著提升。我们进一步发现,组织扩展是路径依赖的:更大的系统从头开始训练具有挑战性,但当从较小的训练系统逐步扩展时变得可行。这些结果表明,学习的神经多智能体系统是LLM的有前景的扩展轴。

英文摘要

Multi-agent language systems are often built as hand-designed workflows, where agents are assigned semantic roles and communication protocols are specified in advance. We propose NeuroMAS, a method that first treats a multi-agent language system as a trainable and scalable neural-network-like architecture with LLM agents as nodes and intermediate textual signals as edges. In NeuroMAS, agent nodes are role-free but structure-aware: the topology only determines how information can flow in general, while reinforcement learning training determines how nodes communicate, specialize, and coordinate. This formulation shifts multi-agent design from workflow engineering toward architecture design, where depth, width, connectivity, and growth protocol become scalable sources of capability. Further, we provide a theoretical perspective showing why such modular textual computation is more parameter-efficient when tasks admit hierarchical decompositions. Experiments show that NeuroMAS improves significantly over both inference-time and trained multi-agent baselines. We further find that organizational scaling is path-dependent: larger systems can be challenging to train from scratch, but become feasible when grown progressively from smaller trained systems. These results suggest that learned neural multi-agent systems are a promising scaling axis for LLMs.

2605.16755 2026-05-19 cs.LG cs.AI 版本更新

Learning Unbiased Permutations via Flow Matching

通过流匹配学习无偏排列

Yimeng Min, Carla P. Gomes

发表机构 * Department of Computer Science(计算机科学系) Cornell University(康奈尔大学)

AI总结 本文提出PermFlow框架,通过在具有单位行和列和的矩阵仿射子空间上直接操作,学习多模态排列分布,避免了基于熵正则化Sinkhorn方法在模糊性下的崩溃问题。

详情
AI中文摘要

学习排列对于排序、排名和匹配至关重要,但现有的基于熵正则化Sinkhorn的可微方法会产生单一的软解,并在模糊性下崩溃。我们提出了PermFlow,一种条件流匹配框架,直接在具有单位行和列和的矩阵仿射子空间上操作。一个闭式切线空间投影器通过构造而非迭代校正,精确保持这些约束沿每条轨迹。一个最近目标耦合将不同的噪声初始值引导到不同的有效排列。结果是一个能够捕捉多模态排列分布而非将其坍缩到单一模式的模型。在具有混合数字模糊性的视觉排序任务和对称线性分配问题上,PermFlow在无歧义输入上具有高精度,并在模糊性下恢复两个有效排列,而基于Sinkhorn的基线方法在结构上失败。

英文摘要

Learning permutations is fundamental to sorting, ranking, and matching, but existing differentiable methods based on entropy-regularized Sinkhorn produce a single softened solution and collapse under ambiguity. We present PermFlow, a conditional flow matching framework that operates directly on the affine subspace of matrices with unit row and column sums. A closed-form tangent-space projector preserves these constraints exactly along every trajectory, by construction rather than through iterative correction, and a nearest-target coupling routes distinct noisy initializations toward distinct valid permutations. The result is a model that captures multimodal permutation distributions rather than collapsing them to a single mode. On a visual sorting task with blended-digit ambiguity and a symmetric linear assignment problem, PermFlow achieves high accuracy on unambiguous inputs and recovers both valid permutations under ambiguity, where Sinkhorn-based baselines structurally fail.

2605.16750 2026-05-19 cs.IR cs.AI 版本更新

UniER: A Unified Benchmark for Item-level and Path-level Exercise Recommendation

UniER:一项用于项目级和路径级练习推荐的统一基准

Xinghe Cheng, Guiyong Zhuang, Yusheng Xie, Jiapu Wang, Yixin Liu, Quanlong Guan, Liangda Fang, Shirui Pan

发表机构 * Jinan University(济南大学) Beijing University of Technology(北京理工大学) Griffith University(格里菲斯大学)

AI总结 本文提出UniER统一基准,用于比较项目级和路径级练习推荐方法,通过引入加权认知收益指标,揭示了路径级推荐在系统性上的优势以及项目级推荐在极端稀疏性和噪声下的教学失败。

详情
AI中文摘要

个性化练习推荐动态地将教学资源与个体知识掌握对齐,这对于满足现代教育中学生动态学习需求至关重要。该领域目前由两种主导范式驱动:项目级练习推荐(ILER)优化即时单步状态转移,而路径级练习推荐(PLER)构建连贯的学习路径以最大化累积收益。尽管两者有相同的最终目标,但不同的评估设置使这两种研究方向孤立,阻碍了统一基准和公平比较。为填补这一空白,本文提出了一个统一的练习推荐基准(UniER),这是一个综合的评估框架,统一了ILER和PLER。具体来说,我们引入了加权认知收益(WCG)作为统一的度量标准,以衡量跨范式算法的性能。我们的基准涵盖了9个数据集,覆盖四种生成方法,促进了18种代表性ILER/PLER方法的比较。通过涵盖有效性、通用性、鲁棒性和效率的多维分析,我们的结果揭示了PLER在系统性上的主导地位,并揭示了在极端稀疏性和噪声下ILER碎片化推荐的教学失败。此外,我们提供了UniER的开源代码库,以促进可重复研究,并概述了未来研究的潜在方向。

英文摘要

Personalized exercise recommendation dynamically aligns pedagogical resources with individual knowledge mastery, which is crucial for satisfying students' dynamic learning needs in modern education. The field is currently driven by two dominant paradigms: Item-Level Exercise Recommendation (ILER) optimizes for immediate single-step state transitions, while Path-Level Exercise Recommendation (PLER) constructs coherent learning paths to maximize cumulative gains. Despite sharing the same ultimate objective, disparate evaluation setups have kept these two lines of research isolated, hindering unified benchmarking and fair comparison. To fill the gap, in this paper, we present a Unified Benchmark for Exercise Recommendation (UniER), a comprehensive evaluation framework that unifies ILER and PLER. Specifically, we introduce Weighted Cognitive Gain (WCG) as a unified metric to measure cross-paradigm algorithmic performance. Our benchmark encompasses 9 datasets spanning four generation methods, facilitating the comparison of 18 representative ILER/PLER methods. Through multi-dimensional analyses covering effectiveness, generalizability, robustness, and efficiency, our results reveal the systematic dominance of PLER and expose the pedagogical failure of ILER's fragmented recommendations under extreme sparsity and noise. Furthermore, we provide an open-source codebase of UniER to foster reproducible research and outline potential directions for future investigations.

2605.16748 2026-05-19 cs.GR cs.AI cs.CV cs.LG cs.MA cs.MM 版本更新

Genflow Ad Studio: A Compound AI Architecture for Brand-Aligned, Self-Correcting Video Generation

Genflow Ad Studio:一种用于品牌一致、自我纠正视频生成的复合AI架构

Debanshu Das, Lavi Nigam, Sunil Kumar Jang Bahadur, Gopala Dhar

发表机构 * Google(谷歌)

AI总结 本文提出Genflow Ad Studio,一种复合AI架构,通过品牌DNA提取模块和对抗性多代理质量控制循环,提高了品牌一致的视频生成效率,将合规率从42%提升到89%。

Comments 6 pages, 2 figures, 2 tables. Accepted to the ACM Conference on AI and Agentic Systems (CAIS '26). Includes demo video and code repository links

Journal ref ACM Conference on AI and Agentic Systems (CAIS '26), May 26-29, 2026, San Jose, CA, USA

详情
AI中文摘要

近期生成视频模型的进步展示了高水平的视觉保真度,但其在企业环境中的整合受到时间不一致性和严重的品牌不一致性的限制。当前的单体架构难以强制执行严格的品牌约束,经常产生未经批准的视觉资产。我们介绍了Genflow,一种复合AI系统,旨在生成媒体生产中强制执行品牌一致性。我们的架构集成了基于检索的'品牌DNA'提取模块,以参数化生成方式根据已确立的企业身份指南进行生成。此外,我们实现了对抗性多代理质量控制(QC)循环。与单次生成流程不同,此流程采用评估代理,反复批评生成的帧,与提取的参数进行比较,促使生成模型细化输出,直到达成确定性的一致性。通过转向多阶段、自我纠正的流程,Genflow将品牌合规视频生成的产量从42%提高到89%,建立了稳健的框架,用于可扩展的、企业级的生成系统。

英文摘要

Recent advancements in generative video models demonstrate high visual fidelity, yet their integration into enterprise environments is restricted by temporal inconsistencies and severe brand misalignment. Current monolithic architectures struggle to enforce rigid brand constraints, frequently hallucinating unapproved visual assets. We introduce Genflow, a Compound AI System designed to enforce brand consistency in generative media production. Our architecture integrates a retrieval-based 'Brand DNA' extraction module to parameterize generation according to established corporate identity guidelines. Furthermore, we implement an Adversarial Multi-Agent Quality Control (QC) loop. Instead of a single-pass generation, this pipeline employs evaluator agents to iteratively critique generated frames against the extracted parameters, prompting generator models to refine outputs until a deterministic consensus is reached. By transitioning to a multi-stage, self-correcting pipeline, Genflow improved the yield of brand-compliant video generations from 42% to 89%, establishing a robust framework for scalable, enterprise-grade generative systems.

2605.16746 2026-05-19 cs.AI cs.LG 版本更新

State Contamination in Memory-Augmented LLM Agents

内存增强型大语言模型代理中的状态污染

Yian Wang, Agam Goyal, Yuen Chen, Hari Sundaram

发表机构 * Department of Computer Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机科学系)

AI总结 研究探讨了内存增强型大语言模型代理中由于状态污染导致的安全问题,通过分析内存总结中的毒性内容传播,提出了一种新的衡量指标,并指出在信息压缩前进行净化可以有效减少潜在影响。

详情
AI中文摘要

LLM代理越来越多地依赖持久化状态,包括转录文本、摘要、检索上下文和内存缓冲区,以支持长周期交互。这使得安全性不仅取决于个体模型输出,还取决于代理存储和后来重用的内容。我们研究了一种称为内存清洗的故障模式:有毒或对抗性上下文可以被压缩成内存摘要,这些摘要在标准检测器下不再显得有毒,但仍保留了影响未来生成的敌对框架或冲突结构。通过配对的反事实多代理模拟,我们证明有毒起源的内存摘要可以保持在常见毒性阈值以下,但相对于匹配的中性基线,仍会增加下游毒性。为了衡量这种隐藏影响,我们引入了子阈值传播间隙(SPG),它量化了在部署监控器视为安全的内存状态下,下游行为差异。我们的实验表明,毒性通过不同的状态通道传播:原始转录文本重用驱动显性下游毒性,而压缩的内存则携带隐藏的子阈值影响。我们进一步发现,缓解依赖于干预位置。在摘要前净化有毒状态可显著减少隐藏传播间隙,而仅清洁完成的摘要则可能保留被清洗的影响。这些结果表明,内存增强型代理的安全性应被视为对演进上下文的状态控制问题,净化应在不安全信息被压缩进持久内存之前应用。

英文摘要

LLM agents increasingly rely on persistent state, including transcripts, summaries, retrieved context, and memory buffers, to support long-horizon interaction. This makes safety depend not only on individual model outputs, but also on what an agent stores and later reuses. We study a failure mode we call memory laundering: toxic or adversarial context can be compressed into memory summaries that no longer appear toxic under standard detectors, while still preserving hostile framing or conflict structure that influences future generations. Using paired counterfactual multi-agent rollouts, we show that toxic-origin memory summaries can remain below common toxicity thresholds while nevertheless increasing downstream toxicity relative to matched neutral baselines. To measure this hidden influence, we introduce the sub-threshold propagation gap (SPG), which quantifies downstream behavioral differences conditioned on memory states that a deployed monitor would classify as safe. Our experiments show that toxicity propagates through distinct state channels: raw transcript reuse drives overt downstream toxicity, while compressed memory carries hidden sub-threshold influence. We further find that mitigation depends critically on intervention placement. Sanitizing toxic state before summarization substantially reduces the hidden propagation gap, whereas cleaning only the completed summary can leave laundered influence intact. These results suggest that safety in memory-augmented agents should be treated as a state-control problem over evolving context, with sanitization applied before unsafe information is compressed into persistent memory.

2605.16728 2026-05-19 cs.AI 版本更新

Body-Grounded Perspective Formation and Conative Attunement in Artificial Agents

具身视角形成与意图同调在人工体中

Hongju Pae

发表机构 * Active Inference Institute, CA, USA(主动推断研究所,加利福尼亚州,美国)

AI总结 本文提出了一种最小架构,用于人工体中的具身视角形成,通过引入内感受性活力信号、Fisher式度量以及意图同调机制,展示了如何在无奖励的网格世界中将学习到的身体倾向转化为稳定的体定向行为。

详情
AI中文摘要

本文提出了一种最小架构,用于人工体中的具身视角形成。在扩展先前工作的同时,该模型引入了内感受性活力信号,一种基于融合的外感受性和内感受性状态的Fisher式度量,以及将身体倾向与行动准备性联系起来的意图同调机制。在无奖励的网格世界中,意图将学习到的身体倾向转化为稳定的体定向行为,而身体到视角的路由允许身体扰动在视角潜在空间中留下可恢复的几何残差。本研究展示了如何通过具身组织世界如何呈现给代理的方式,在现象学意义上实现人工主体性的最小结构性条件的操作化。

英文摘要

This paper proposes a minimal architecture for body-grounded perspective formation in artificial agents. Extending prior work, the model introduces an interoceptive viability signal, a Fisher-style metric over fused exteroceptive-interoceptive states, and a conative alignment mechanism linking bodily tendency to action readiness. In a reward-free gridworld, conation converts learned bodily tendency into stable body-directed behavior, while body-to-perspective routing allows bodily perturbations to leave a recoverable geometric residue in the perspective latent. This study shows how minimal structural conditions for artificial subjectivity can be operationalized in the phenomenological sense, through the embodied organization of how a world is given to an agent.

2605.16727 2026-05-19 cs.AI 版本更新

PopuLoRA: Co-Evolving LLM Populations for Reasoning Self-Play

PopuLoRA: 为推理自博弈的协同进化LLM种群

Roger Creus Castanyer, Geoffrey Bradway, Lorenz Wolf, Maxwill Lin, Augustine N. Mavor-Parker, Matthew James Sargent

发表机构 * Absolute Zero Reasoner(绝对零理由器) LoRA adapters(LoRA适配器)

AI总结 本文提出PopuLoRA,一种基于种群的非对称自博弈框架,用于强化学习中可验证奖励(RLVR)的后训练LLM。通过专门的LoRA适配器在共享冻结基座上进行教师和学生分工,教师提出问题,学生在程序验证器下解决,不同亚种群间的交叉评估取代了限制单智能体自博弈的自我校准。LoRA权重空间进化算子家族作为7B规模种群训练循环的替代步骤,实现了种群的协同进化竞赛。

详情
AI中文摘要

我们介绍了PopuLoRA,一种基于种群的非对称自博弈框架,用于强化学习中可验证奖励(RLVR)的后训练LLM。教师和学生是专门的LoRA适配器,共享冻结基座:教师提出问题,匹配的学生在程序验证器下解决,亚种群间的交叉评估取代了限制单智能体自博弈的自我校准。一组LoRA权重空间进化算子(在几秒钟内产生同等级种群成员的突变和交叉)作为7B规模种群训练循环的替代步骤。我们将在Absolute Zero Reasoner上实现PopuLoRA,并将其与一个每适配器计算匹配的单智能体基线进行比较。当单智能体自我校准到可以可靠解决的问题时,种群进入协同进化竞赛:教师产生越来越复杂的问题,学生解决率波动,问题空间覆盖持续扩展。尽管训练时间奖励较低,种群均值在三个代码基准(HumanEval+, MBPP+, LiveCodeBench)和七个数学基准(AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench)上均优于基线,并且种群中最弱的成员在汇总上也优于基线。

英文摘要

We introduce PopuLoRA, a population-based asymmetric self-play framework for reinforcement learning with verifiable rewards (RLVR) post-training of LLMs. Teachers and students are specialised LoRA adapters on a shared frozen base: teachers propose problems, matched students solve them under a programmatic verifier, and cross-evaluation between sub-populations replaces the self-calibration that limits single-agent self-play. A family of LoRA weight-space evolution operators (mutations and crossovers that produce same-rank population members in seconds) serves as the replacement step of a population-based training loop at 7B scale. We instantiate PopuLoRA on top of Absolute Zero Reasoner and compare it against a per-adapter compute-matched single-agent baseline. Where the single agent self-calibrates to generating easy problems it can reliably solve, the population enters a co-evolutionary arms race: teachers produce increasingly complex problems, student solve rates oscillate, and problem-space coverage keeps expanding throughout training. Despite lower training-time reward, the population mean outperforms the baseline on three code benchmarks (HumanEval+, MBPP+, LiveCodeBench) and seven math benchmarks (AIME 24/25, AMC 23, MATH-500, Minerva, GSM8K, OlympiadBench), and even the weakest member of the population beats the baseline on aggregate.

2605.16726 2026-05-19 cs.AI 版本更新

A Global-Local Graph Attention Network for Traffic Forecasting

面向交通预测的全局-局部图注意力网络

Tianchi Zhang

AI总结 本文提出一种全局-局部图注意力网络(GLGAT),通过成对编码和基于事件的邻接矩阵,解决传统图卷积网络和图注意力网络在处理顶点异质性时的复杂性问题,有效捕捉时空相关性并在交通预测中取得竞争优势。

详情
AI中文摘要

交通预测是智能交通系统的重要组成部分。交通预测中的关键挑战之一是发现时空相关性。近年来,图卷积网络和图注意力网络已取代传统统计模型来预测未来交通。然而,这两种方法都难以让顶点具有非常不同的特性。为了解决这个问题,我们提出了具有成对编码和基于事件的邻接矩阵的全局-局部图注意力网络(GLGAT)。GLGAT允许顶点拥有针对整个图的全局注意力矩阵集,并为每个顶点分配局部注意力矩阵集。在两个真实世界交通数据集上的实验表明,GLGAT能够有效捕捉时空相关性,并在与其他最先进的基线模型相比时表现出竞争力。

英文摘要

Traffic forecasting is a significant part of intelligent transportation systems. One of the critical challenges of traffic forecasting is to find spatio-temporal correlations. In recent years, graph convolutional networks and graph attention networks have replaced traditional statistical models to predict future traffic. However, it is complicated for both of them to allow vertices to have far different characters. To address this, we propose the Global-Local Graph Attention Network (GLGAT) with pairwise encoding and the event-based adjacency matrix. The GLGAT allows vertices to have a global attention matrix set for the whole graph and assigns local attention matrix sets to each vertex. Experiments on two real-world traffic datasets show that GLGAT can effectively capture spatio-temporal correlations and has competitive performance against other state-of-the-art baselines.

2605.16725 2026-05-19 cs.AI 版本更新

Baba in Wonderland: Online Self-Supervised Dynamics Discovery for Executable World Models

《白兔在奇幻世界:在线自监督动态发现用于可执行世界模型》

SeungWon Seo, DongHeun Han, SeongRae Noh, HyeongYeop Kang

发表机构 * Korea University(韩国大学) Kyung Hee University(庆熙大学)

AI总结 该研究探讨了在先验错配情况下,如何通过交互证据自监督学习可执行世界模型,引入了Alice系统,通过失败的候选更新作为结构信号,发现并改进动态,从而提升可执行世界模型的学习效果。

详情
AI中文摘要

可执行世界模型可以被阅读、编辑、执行和重用以进行规划,但前提是程序捕获了环境的转换定律,而非其表面词汇的语义捷径。我们研究了在先验错配情况下在线可执行世界模型学习的问题,其中智能体必须从交互证据中诱导状态依赖的动态,而无需规则描述、奖励信号或可信的词汇先验。我们引入了Alice,一个闭环系统,将失败的候选更新视为结构信号:当候选解释新的转换但失去之前解释的转换时,保存冲突揭示了当前程序所混淆的动态。Alice将这些冲突细化为假设类别,这些类别既提供了紧凑的、分层的保存反例以指导更新,又引导前沿探索向新颖且在当前程序下代表性不足的转换。我们在《白兔在奇幻世界》上评估了Alice,这是《白兔在你》的一个先验错配变种,它保留了模拟动态,同时将语义重要的规则属性标签替换为无关词汇。实验表明,Alice在先验错配情况下显著提升了可执行世界模型的学习效果,消融实验显示,类别细化和类别感知探索均有所贡献。

英文摘要

Executable world models can be read, edited, executed, and reused for planning, but only if the program captures the environment's transition law rather than semantic shortcuts in its surface vocabulary. We study online executable world-model learning under prior misalignment, where an agent must induce state-dependent dynamics from interaction evidence alone, without rule descriptions, reward signals, or trustworthy lexical priors. We introduce Alice, a closed-loop system that treats failed candidate updates as structural signal: when a candidate explains a new transition but loses previously explained ones, the preservation conflict reveals dynamics that the current program had conflated. Alice refines these conflicts into hypothesis classes that both provide compact, class-stratified preservation counterexamples for update and guide frontier exploration toward transitions that are novel and underrepresented with respect to the current program. We evaluate Alice on Baba in Wonderland, a prior-misaligned variant of Baba Is You that preserves simulator dynamics while replacing semantically meaningful rule-property labels with unrelated words. Experiments show that Alice substantially improves executable world-model learning under prior misalignment, and ablations show that both class refinement and class-aware exploration contribute.

2605.16719 2026-05-19 physics.soc-ph cs.AI cs.SI 版本更新

Universal Dynamics of Punctuated Progress

突变进步的普遍动力学

Yian Yin, Dashun Wang

发表机构 * Department of Information Science, Cornell University(康奈尔大学信息科学系) Center for Science of Science and Innovation, Northwestern University(西北大学科学与创新中心) McCormick School of Engineering, Northwestern University(西北大学工程学院) Kellogg School of Management, Northwestern University(西北大学管理学院) Ryan Institute on Complexity, Northwestern University(西北大学复杂性研究院) Northwestern Innovation Institute, Northwestern University(西北大学创新研究院)

AI总结 本文研究了科学和技术前沿突变进步的动力学,通过分析9个不同领域的历史数据,发现三个普遍规律:突变间等待时间服从重尾分布,前沿记录积累速率呈亚线性增长,记录突破事件在时间上存在相关性。作者提出一个包含激进创新和渐进改进的模型,揭示了突变进步的普遍动力学机制。

详情
AI中文摘要

科学和技术前沿的进步通过突变动力学推进,但这些动力学的原理仍不明确。本文收集并分析了9个不同领域中前沿演化的数据集,涵盖材料发现、结构生物学、人工智能、计算生物医学、数据科学、理论计算机科学、F1赛车和物理车轮制造等领域。分析680万种解决6700个任务的解决方案,揭示出三个普遍规律:(1)新前沿之间的等待时间服从重尾分布,大多数尝试集中在长期停滞期;(2)前沿记录积累速度呈亚线性增长,比对数增长快但比线性增长慢;(3)记录突破事件在时间上存在相关性,产生短期可预测性但长期不可预测性。尽管设置的规模、范围和定义存在差异,这些规律在所有研究的领域中都非常一致,并未被复杂系统、记录统计、创新经济和文化进化模型所捕捉。作者将缺失的成分归因于激进创新与渐进改进的区别,并开发了一个最小、可解析的模型,结合了重新定义可实现目标的激进重置和利用当前前沿的渐进改进。该简单模型再现了所有三个经验规律。令人惊讶的是,主导级预测参数无关,识别出一个新的普遍类,支配突变进步,并为开放性和前沿解决方案的可及性如何影响进步速度提供可检验的预测。总体而言,这些结果揭示了突变进步的普遍动力学,并识别了激进重置与渐进改进之间的相互作用是科学和技术前沿如何推进的关键驱动力。

英文摘要

Scientific and technological frontiers advance through punctuated dynamics, yet the principles governing these dynamics remain poorly understood. Here we collect and analyze datasets tracking the evolution of frontiers across 9 different domains, spanning materials discovery, structural biology, AI, computational biomedicine, data science, theoretical computer science, Formula-1 racing, and physical wheel building. Analyzing 6.8M solutions to 6.7K tasks, we uncover three universal patterns: (1) waiting times between new frontiers are heavy-tailed, with most attempts concentrated in long stasis; (2) frontier records accumulate at a sublinear rate, faster than logarithmic yet slower than linear growth; (3) record-breaking events are temporally correlated, generating short-term predictability yet long-term unpredictability. Despite the differences in the scale, scope, and definition of the settings, these patterns are remarkably consistent across all domains we study, and are not captured by models from complex systems, record statistics, economics of innovation, and cultural evolution. We trace the missing ingredient to the distinction between radical and incremental innovation, and develop a minimal, analytically solvable model incorporating both radical resets that restructure what is achievable and incremental refinements that exploit the current frontier. The simple model reproduces all three empirical regularities. Remarkably, the leading-order predictions are parameter-independent, identifying a new universality class governing punctuated progress and yielding testable predictions about how openness and access to frontier solutions shape the pace of advance. Overall, these results reveal universal dynamics governing punctuated progress and identify the interplay between radical resets and incremental refinements as the key driver of how scientific and technological frontiers advance.

2605.16714 2026-05-19 cs.AI cs.CR 版本更新

GRID: Graph Representation of Intelligence Data for Security Text Knowledge Graph Construction

GRID:用于安全文本知识图谱构建的智能数据图表示

Liangyi Huang, Zichen Liu, Fei Shao, Shang Ma, Mengshi Zhang, Zihao Chen, Yanfang Ye, Xusheng Xiao

发表机构 * Arizona State University(亚利桑那州立大学) Case Western Reserve University(凯斯西储大学) University of Notre Dame(诺丁汉大学) TensorBlock Facebook(脸书)

AI总结 本文提出GRID框架,通过构建可追溯的文章-图对齐,将文档到图学习转化为剧本任务库,提升安全知识图谱构建的稳定性和效率。

详情
AI中文摘要

安全知识图谱可以为安全代理提供可计算的外部记忆,但从长篇网络威胁情报(CTI)中构建仍具有挑战性:LLMs通常缺乏领域知识,端到端文档到图训练难以用低成本、稳定的奖励监督。我们提出了GRID(智能数据图表示),一种用于安全文本知识图谱构建的端到端框架。GRID首先通过图提取和知识图引导的文本修订,从CTI文章中构建安全领域监督。然后将文档到图学习转化为结合四选项多选问题和三级正则表达式匹配目标的剧本任务库,产生比反复评分完整图输出的LLM判断器更稳定的任务特定奖励。使用这种监督流程,我们训练了两个基于Qwen3-4B-Instruct-2507的4B提取器:一个任务库奖励模型和一个端到端奖励模型。在249篇CTI文章上,任务库奖励模型在具有本体引导的GRID提取流程下达到84.62%的源平均精度、64.91%的源平均召回率和68.53%的平均F1分数,实现了最佳源平均召回率和接近顶级平均F1分数,同时具有更低的token使用和部署成本。端到端奖励模型达到76.91%的精度、53.85%的召回率和58.06%的平均F1分数。进一步分析显示,任务库奖励可以一次离线构建并在后续训练运行中重复使用,优于在线端到端LLM作为判断器奖励和较弱的替代方案,如仅选择奖励和无RL的端到端SFT。

英文摘要

Security knowledge graphs can provide computable external memory for security agents, but constructing them from long-form cyber threat intelligence (CTI) remains difficult: LLMs often lack grounded security-domain knowledge, and end-to-end document-to-graph training is hard to supervise with cheap, stable rewards. We present GRID (Graph Representation of Intelligence Data), an end-to-end framework for security text knowledge graph construction. GRID first builds security-domain supervision from CTI articles by creating traceable article-graph alignments through graph extraction and knowledge-graph-conditioned text revision. It then turns document-to-graph learning into a scripted task bank combining four-option multi-select questions with triple-level regex matching targets, yielding more stable task-specific rewards than repeatedly scoring full graph outputs with an LLM judge. Using this supervision pipeline, we train two Qwen3-4B-Instruct-2507-based 4B extractors: a primary Task-bank Reward model and a secondary End2End Reward model with LLM-as-judge precision/recall rewards. On 249 CTI articles from GRID, CASIE, CTINexus, MalKG, and SecureNLP, the Task-bank Reward model with the ontology-guided GRID extraction pipeline reaches 84.62% source-averaged precision, 64.91% source-averaged recall, and 68.53% Avg F1, achieving the best source-averaged recall and near-top Avg F1 with lower token usage and deployment cost. The End2End Reward model reaches 76.91% precision, 53.85% recall, and 58.06% Avg F1. Further analyses show that task-bank rewards can be built once offline and reused across later post-training runs, outperforming online End2End LLM-as-judge reward and weaker alternatives such as Choice-only Reward and End2End SFT without RL.

2605.16676 2026-05-19 cs.AI 版本更新

Enhancing Metacognitive AI: Knowledge-Graph Population with Graph-Theoretic LLM Enrichment

增强元认知AI:基于图论的LLM知识图谱填充

Deniz Askin, Gal Hadar, Brendan Conway-Smith

发表机构 * Department of Cognitive Science, Carleton University(认知科学系,卡尔顿大学) Faculty of Computer and Information Science, Ben-Gurion University of the Negev(计算机与信息科学学院,贝内-约尔大学)

AI总结 本文提出MetaKGEnrich系统,通过构建知识图谱、检测稀疏区域、生成问题并检索证据,提升LLM的自我修复能力,在多个数据集上显著提高了回答质量。

详情
AI中文摘要

元认知——即监控自身知识状态、发现知识缺口并自主填补的能力——在现代AI中仍然缺失。本文提出了MetaKGEnrich,一个完全自动化的流程,使大语言模型(LLM)应用具备自我导向的知识修复能力。该系统(i)从种子查询构建知识图谱,(ii)通过七种图度量检测稀疏区域,(iii)利用GPT-4o生成针对性问题,(iv)通过Tavily检索网络证据并将其导入Neo4j,(v)使用GraphRAG重新回答查询以供GPT-4评估改进。在Google Research Natural Questions、MS MARCO和Hot-potQA三个广泛使用的数据集上测试了30个查询。MetaKGEnrich在80%的HotpotQA问题、87%的Google Research Natural Questions和83%的MS MARCO问题中提高了回答质量,同时保持了支持充分的区域。这一概念验证展示了拓扑自诊断加针对性检索如何推动AI向具有人类般的元认知学习能力发展。

英文摘要

Metacognition-the ability to monitor one's own knowledge state, spot gaps, and autonomously fill them--remains largely absent from modern AI. Here, we present MetaKGEnrich, a fully automated pipeline that endows large language model (LLM) applications with self-directed knowledge repair. The system (i) builds knowledge graphs from a seed query, (ii) detects sparse regions via seven graph metrics, (iii) has GPT-4o generate targeted questions, (iv) retrieves web evidence with Tavily and ingests it into Neo4j, and (v) re-answers the query with GraphRAG for GPT-4 to evaluate improvement. Tested on 30 queries from each of three widely-used datasets: Google Research Natural Questions, MS MARCO, and Hot-potQA. MetaKGEnrich improved answer quality in 80% of HotpotQA questions, 87% of Google Research Natural Questions and 83% of MS MARCO questions, while preserving well-supported regions. This proof of concept demonstrates how topological self-diagnosis plus targeted retrieval can advance AI toward humanlike metacognitive learning.

2605.16675 2026-05-19 cs.AI 版本更新

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

LinAlg-Bench:一个 forensic 验证基准,揭示 LLM 数学推理中的结构失效模式

Shradha Agarwal, Deepak Rajbhar, Tariq J

发表机构 * Department of Nuclear Engineering and Computer Science(核工程与计算机科学系)

AI总结 LinAlg-Bench 评估 10 个前沿大语言模型在结构线性代数计算中的表现,揭示 LLM 数学失败并非随机,而是受算法类型和矩阵维度约束。研究发现 4x4 尺寸存在行为阈值,低于该尺寸模型通过执行错误失败,高于则转向计算放弃,通过工具角色扮演等制造响应。

Comments 42 pages, 3 figures, 12 tables. NeurIPS 2026 Evaluations and Datasets Track submission. Dataset: https://huggingface.co/datasets/LinAlgBench/linalg-bench

详情
AI中文摘要

我们介绍了 LinAlg-Bench,一个诊断基准,评估 10 个前沿大语言模型在结构线性代数计算中的表现,覆盖 3x3、4x4 和 5x5 矩阵的严格维度梯度。该基准涵盖 9 类任务和 660 个 SymPy 认证问题,评估 6,600 个模型输出。除了二元准确率外,LinAlg-Bench 引入了三阶段自动化取证流程,将 1,156 个失败分类为 10 个主要错误标签及其细粒度子类型,揭示 LLM 数学失败并非随机,而是受算法类型和矩阵维度约束。我们的核心发现是 4x4 尺寸存在行为阈值:低于该尺寸,模型通过执行错误失败——符号跟踪失败、算术漂移和奇偶错误;高于该尺寸,失败转变为计算放弃,模型通过工具角色扮演、约束一致的虚构和结构性幻觉制造响应而非尝试计算。这种制造到放弃的转变在所有模型层级和架构中几乎普遍存在,表明是工作记忆限制而非知识缺口,支持三种规模涌现的错误类型在 3x3 不存在但在 4x4 和 5x5 存在。我们进一步显示,解决方案策略的刚性是 5x5 确定性准确率的近完美预测因素,记录约束意识的虚构作为一种新的结构幻觉失败模式,并公开所有数据、模型输出、错误标签和判断流程。

英文摘要

We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.

2605.16672 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Multi-Object Tracking Consistently Improves Wildlife Inference

多目标跟踪一致地提升野生动物推断

Mufhumudzi Muthivhi, Jiahao Huo, Fredrik Gustafsson, Terence L. van Zyl

发表机构 * World Wide Fund (WWF)(世界自然基金会) Centre for Artificial Intelligence Research (CAIR)(人工智能研究中心)

AI总结 本文利用多目标跟踪技术提升野生动物分类模型的鲁棒性,通过融合轨迹信息改进分类结果,实验表明在三个数据集上均提升了性能。

Comments Accepted for publication in IEEE 2026 29th International Conference on Information Fusion

详情
AI中文摘要

相机陷阱已成为生态研究和生物多样性保护中常用的野生动物监测工具。野生动物分类模型受益于野生动物视觉数据的增加,这些模型在经过整理的高质量数据集上能达到高水平的准确性。然而,其性能仍然易受现实环境约束的影响。在进行时间连续序列的推断时,它们常常产生不一致的预测。单个个体在帧之间的预测标签会迅速变化。本研究利用相机陷阱数据的时间特性来增强野生动物分类模型的推断预测。具体来说,我们采用几种标准的多目标跟踪(MOT)模型,将连续帧中的检测结果进行关联。经过整理的轨迹用于融合softmax类概率。融合的概率评分产生一个单一的共识类标签估计,以覆盖噪声引起的误分类。实验结果分析表明,我们的策略在所有数据集和每个指标上均优于独立分类器。具体而言,表现最好的MOT模型在三个MOT数据集上分别比分类器提高了5.1%、3.1%和2.0%的加权F1分数。

英文摘要

Camera traps have become a common tool for wildlife monitoring efforts in ecological research and biodiversity conservation. Wildlife classification models have benefited from the increase in wildlife visual data. These models reach high levels of accuracy on curated, high-quality datasets. However, their performance remains sensitive to real-world environmental constraints. They often produce inconsistent predictions when performing inference on temporally coherent sequences. The predicted label for a single individual shifts rapidly between frames. This study exploits the temporal nature of camera-trap data to augment inferred predictions from a wildlife classification model. Specifically, we adopt several standard Multi-Object Tracking (MOT) models to link detections across consecutive frames. The curated trajectories are used to fuse the softmax class probabilities. The fused probability score produces a single consensus class label estimate that overrides misclassifications caused by noise. The analysis of the experimental results shows that our proposed strategy improves over a standalone classifier over all datasets and for each metric. Specifically, the best-performing MOT models gain a weighted F1-Score of 5.1%, 3.1% and 2.0% over the classifier across three MOT datasets.

2605.16671 2026-05-19 cs.AI cs.CV cs.CY cs.LG 版本更新

Sustainable Intelligence for the Wild: Democratizing Ecological Monitoring via Knowledge-Adaptive Edge Expert Agents

野生环境中的可持续智能:通过知识自适应边缘专家代理实现生态监测民主化

Jiaxing Li, Hao Fang, Chi Xu, Miao Zhang, Jiangchuan Liu, William I. Atlas, Katrina M. Connors, Mark A. Spoljaric

发表机构 * Simon Fraser University(西蒙 Fraser大学) Wild Salmon Center(野生鲑鱼中心) Pacific Salmon Foundation(太平洋鲑鱼基金会) Haida Fisheries Program(海达渔业计划)

AI总结 本文提出一种知识自适应边缘代理架构,通过分离视觉感知与推理,结合视觉编码器和动态知识库,实现生态监测的可持续发展,促进伦理AI协同开发。

Comments 10 pages

详情
AI中文摘要

快速的生物多样性丧失凸显了有效监测的紧迫性,但手动调查仍消耗资源。尽管设备上的AI提供了一种可扩展的替代方案,但野外环境中经常受到环境变化的挑战。当前方法依赖云资源,需要持续上传现场数据以重新训练模型。这种方法不适合远程部署,因为它消耗有限的电力和网络连接。为了解决这些限制,本研究提出从模型适应转向知识适应。我们介绍了一种架构,将视觉感知与推理分离,结合视觉编码器和动态知识库。我们使用显式知识库取代隐式编码专家知识到模型参数。这种方法还通过结构化形式保存专家见解来支持知识可持续性。通过跨学科合作与生物学家和原住民社区,这项工作推进了伦理AI的协同开发,促进负责任和文化知情的生态系统管理。

英文摘要

Rapid biodiversity loss underscore the urgency of effective monitoring, yet manual surveys remain resource-intensive. While on-device AI offers a scalable alternative, its performance in the wild is often challenged by environmental variability. Current methods rely heavily on cloud resource, which requires continuous uploading of field data for model retraining. This approach is unsuitable for remote deployments because it consumes limited power and network connectivity. To address these constraints, this research proposes a shift from model adaptation to knowledge adaptation. We introduce an architecture that separates visual perception from reasoning, combining a visual encoder with a dynamic knowledge base. We uses an explicit knowledge base to replace implicitly encoding expert knowledge into model parameters. This method also supports knowledge sustainability by preserving expert insights in a structured form. Through cross-disciplinary collaboration with biologists and Indigenous communities, this work advances ethical AI co-development, fostering responsible and culturally informed ecosystem management.

2605.16668 2026-05-19 cs.LG cs.AI 版本更新

GraViti: Graph-Level Variational Autoencoders with Relaxed Permutation Invariance

GraViti:具有放松排列不变性的图级变分自编码器

Roman Bresson, Konstantinos Divriotis, Johannes F. Lutzeyer, Iakovos Evdaimon, Michalis Vazirgiannis

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) LIX, CNRS, École Polytechnique, IP Paris(巴黎理工学院LIX实验室,法国国家科学研究中心,巴黎理工学院,IP巴黎)

AI总结 GraViti通过图级变分自编码器生成紧凑的潜在向量,支持平滑插值和下游任务,优于节点级嵌入。

详情
AI中文摘要

我们介绍了GraViti,一种基于transformer的图级变分自编码器,将整个图映射到紧凑的潜在向量。这种设计产生了一个真正的图级潜在空间,支持平滑插值、属性引导搜索等下游任务,超越节点级嵌入的限制。在分子基准上,GraViti学会解码符合训练数据化学约束的有效样本,表明模型能直接从图级表示中恢复领域规则。我们还显示,在存在可靠规范节点顺序的领域(如分子或贝叶斯网络)中,强制排列不变性可能对一致重建有害。GraViti在大规模数据集上实现了最先进的重建准确性,并提供了坚实的生成性能。其单步解码提供了一种轻量级替代方案,同时保持实用的样本质量。

英文摘要

We introduce GraViti, a transformer-based graph-level variational autoencoder that maps entire graphs to compact latent vectors. This design produces a true graph-level latent space that supports smooth interpolation, property-guided search, and other downstream tasks beyond the constraints of node-level embeddings. On molecular benchmarks, GraViti learns to decode valid samples that follow the chemical constraints present in the training data, showing that the model recovers domain rules directly from graph-level representations. We also show that, in domains where a reliable canonical node ordering exists such as molecules or bayesian networks, enforcing permutation invariance can prove detrimental for consistent reconstruction. GraViti achieves state-of-the-art reconstruction accuracy on large datasets, and provides solid generative performance. Its single-step decoding offers a lightweight alternative to more complex generation pipelines while maintaining practical sample quality.

2605.16654 2026-05-19 cs.CL cs.AI 版本更新

A Scalable Tool for Measuring Manner and Result Verbs in Developmental Language Research

一种可扩展的测量方式用于发展语言研究中的方式和结果动词

Divyesh Pratap Singh, Dakshesh Gusain, Federica Bulgarelli, Alison Eisel Hendricks, John Beavers, Nathan M. Beers, Ifeoma Nwogu

发表机构 * University at Buffalo(布法罗大学) Nanyang Technological University(南洋理工大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出一种利用大规模语言模型进行方式和结果动词识别的方法,通过MASC和InterCorp数据扩展至436类,并在三个数据集上验证,模型准确率达89.6%。

Comments 12 pages

详情
AI中文摘要

方式和结果动词编码事件结构的不同方面,在发展语言研究中被视为研究早期动词学习的潜在区分特征。然而,由于目前缺乏大规模标注的 manner 和 result 分类资源,这种区分难以量度。本文提出了一种计算方法,利用语言学启发式提示生成句子级标注,扩展了VerbNet的标注范围至436类。然后在这些标注上训练了基于RoBERTa的分类器,并在三个保留的金标准数据集上进行评估,包括之前标注的项目和一个新的专家标注集。在这些评估中,模型表现出有希望的性能,平均准确率高达89.6%。本文将此工作作为可扩展的测量工具,支持未来关于动词语义的发展语言和其他语言数据集的研究,同时指出需要进一步验证边缘情况、混合方式/结果动词以及下游发展应用。

英文摘要

Manner and result verbs encode different aspects of event structure and have been discussed in developmental work as a potentially informative distinction for studying early verb learning. However, this distinction remains difficult to measure at scale because large annotated resources for manner and result classification are not currently available. We present a computational approach for identifying manner and result verbs in sentence context. Using linguistically informed prompts, we generate sentence-level annotations with large language models over data drawn from MASC and InterCorp, extending coverage from previously annotated portions of VerbNet to 436 classes. We then train a RoBERTa-based classifier on these annotations and evaluate it on three held-out gold-standard datasets, including previously annotated items and a new expert-annotated set. Across these evaluations, the model shows promising performance, with average accuracy up to 89.6%. We present this work as a scalable measurement tool that can support future research on verb semantics in developmental and other language datasets, while noting that further validation is needed for borderline cases, mixed manner/result verbs, and downstream developmental applications.

2605.16650 2026-05-19 cs.CL cs.AI 版本更新

SKG-Eval: Stateful Evaluation of Multi-Turn Dialogue via Incremental Semantic Knowledge Graphs

SKG-Eval: 通过增量语义知识图谱进行多轮对话的有状态评估

Avijit Shil, Suman Samui

发表机构 * Maulana Abul Kalam Azad University of Technology(Maulana Abul Kalam Azad 工业技术大学) National Institute of Technology Durgapur(Durgapur 国家理工大学)

AI总结 本文提出SKG-Eval框架,通过增量语义知识图谱模型,解决多轮对话评估中长距离不一致问题,提供可解释的评估信号和可复现的评分结果。

Comments 36 Pages, 6 Figures

详情
AI中文摘要

评估多轮对话系统仍具挑战性,因为响应质量不仅取决于当前提示,还取决于之前建立的实体、声明和对话承诺。现有自动评估器主要依赖扁平或轮次隔离的表示,难以检测长距离问题如矛盾、话题漂移和实体不一致。为此,我们提出SKG-Eval,一个近确定性和可解释的框架,将对话建模为跨轮次的语义知识图谱(SKG)的实体、关系和承诺。该框架通过结构化三元组提取逐步更新图谱,并计算三个互补信号:(i)局部相关性,衡量与当前提示和可选参考的一致性;(ii)历史一致性,评估新引入信息如何连接到先前对话上下文,使用图谱驱动和嵌入驱动信号;(iii)逻辑一致性,通过几何矛盾引擎评估跨轮次冲突,不依赖NLI模型或LLM判断。这些信号通过近期加权趋势分析适应性融合,生成长度不变的会话分数。在多个基准测试中,SKG-Eval在与人类判断的相关性上更高,并显著提高了长距离不一致的检测效果。此外,该框架为固定输入生成明确的矛盾证书和确定性分数,使评估可复现和可审计。整体而言,我们的结果表明,通过语义知识图谱的结构化外部化状态跟踪,为LLM基于对话评估器提供了可扩展的替代方案。

英文摘要

Evaluating multi-turn dialogue systems remains challenging because response quality depends not only on the current prompt, but also on previously established entities, claims, and conversational commitments. Existing automatic evaluators, including LLM-as-a-judge frameworks and embedding-based metrics, largely rely on flat or turn-isolated representations, making them less effective at detecting long-range issues such as contradiction, topic drift, and entity inconsistency. To address this, we propose SKG-Eval, a quasi-deterministic and interpretable framework that models dialogue as an evolving Semantic Knowledge Graph (SKG) of entities, relations, and commitments across turns. The framework incrementally updates the graph through structured triple extraction and computes three complementary signals: (i) local relevance, measuring alignment with the current prompt and optional reference; (ii) historical consistency, evaluating how newly introduced information connects to prior conversational context using graph-based and embedding-driven signals; and (iii) logical coherence, assessed by a geometric contradiction engine that detects cross-turn conflicts without relying on NLI models or LLM judges. These signals are adaptively fused and aggregated into a length-invariant session score via recency-weighted trend analysis. Across multiple benchmarks, SKG-Eval achieves higher correlation with human judgments and substantially improves detection of long-range inconsistencies in extended conversations. In addition, the framework produces explicit contradiction certificates and deterministic scores for fixed inputs, enabling reproducible and auditable evaluation. Overall, our results suggest that structured externalized state tracking through semantic knowledge graphs provides a scalable alternative to implicit reasoning in LLM-based dialogue evaluators.

2605.14854 2026-05-19 cs.CV cs.AI 版本更新

FactorizedHMR: A Hybrid Framework for Video Human Mesh Recovery

因子化HMR:视频人体网格恢复的混合框架

Patrick Kwon, Chen Chen

发表机构 * Institute of Artificial Intelligence(人工智能研究所) University of Central Florida(佛罗里达中央大学)

AI总结 本文提出FactorizedHMR框架,通过确定性回归模块和概率流匹配模块分别处理人体不同部位的恢复问题,结合复合目标表示和几何感知监督提升模糊部位的恢复效果,实现在遮挡和漂移敏感度指标上的优势。

详情
AI中文摘要

人体网格恢复(HMR)本质上具有歧义性:在遮挡或弱深度线索下,同一图像证据可能由多个3D身体解释。这种歧义性并非均匀分布于全身,躯干姿态和根结构通常相对受约束,而远端关节如手臂和腿部则更不确定。基于此观察,我们提出FactorizedHMR,一种两阶段框架,分别处理这两种情形。一个确定性回归模块首先恢复稳定的躯干-根锚点,一个概率流匹配模块则完成剩余的非躯干关节。为使完成可靠,我们结合复合目标表示与几何感知监督和特征感知分类器自由引导,保留躯干-根锚点的同时提升易产生歧义的关节的单参考恢复。我们还引入了一个合成数据管道,提供在多种视角下的配对图像-相机-运动监督。在相机空间和世界空间基准测试中,FactorizedHMR与强基线竞争,尤其在遮挡密集恢复和漂移敏感世界空间指标上表现最突出。

英文摘要

Human Mesh Recovery (HMR) is fundamentally ambiguous: under occlusion or weak depth cues, multiple 3D bodies can explain the same image evidence. This ambiguity is not uniform across the body, as torso pose and root structure are often relatively well constrained, whereas distal articulations such as the arms and legs are more uncertain. Building on this observation, we propose FactorizedHMR, a two-stage framework that treats these two regimes differently. A deterministic regression module first recovers a stable torso-root anchor, and a probabilistic flow-matching module then completes the remaining non-torso articulation. To make this completion reliable, we combine a composite target representation with geometry-aware supervision and feature-aware classifier-free guidance, preserving the torso-root anchor while improving single-reference recovery of ambiguity-prone articulation. We also introduce a synthetic data pipeline that provides the paired image-camera-motion supervision under diverse viewpoints. Across camera-space and world-space benchmarks, FactorizedHMR remains competitive with strong baselines, with the clearest gains in occlusion-heavy recovery and drift-sensitive world-space metrics.

2605.14504 2026-05-19 cs.AI 版本更新

When Robots Do the Chores: A Benchmark and Agent for Long-Horizon Household Task Execution

当机器人做家务:一个基准和代理用于长期家庭任务执行

Zilin Zhu, Longteng Guo, Yanghong Mei, Bowen Pang, Zongxun Zhang, Xingjian He, Ruyi Ji, Jing Liu

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Zhongguancun Academy(中关村学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 本文提出LongAct基准和HoloMind代理,用于评估长期家庭任务执行中的高层自主能力,实验显示HoloMind在减少模型规模依赖的同时提升了长期性能,但目标完成率仍较低,凸显了长期规划的挑战。

详情
AI中文摘要

长期家庭任务需要稳健的高层规划和持续推理能力,而现有具身AI基准多关注短时间导航或操作,依赖固定任务类别。我们引入LongAct基准,用于评估通过自由指令指定的长期家庭任务中的规划自主性。通过抽象掉与具体身体相关的低层控制,LongAct隔离了如指令理解、依赖管理、记忆维护和适应性规划等高层认知能力。我们进一步提出HoloMind,一个基于视觉语言模型的代理,配备基于有向无环图的长期分层规划器、多模态空间记忆用于持久世界建模、经验重用的片段记忆以及全局批评者用于反思监督。实验表明,GPT-5和Qwen3-VL模型在HoloMind上显著提升了长期性能,同时减少了对模型规模的依赖。即使顶级模型也仅达到59%的目标完成率和16%的完整任务成功率,凸显了LongAct的难度以及具身代理中更强长期规划的需求。

英文摘要

Long-horizon household tasks demand robust high-level planning and sustained reasoning capabilities, which are largely overlooked by existing embodied AI benchmarks that emphasize short-horizon navigation or manipulation and rely on fixed task categories. We introduce LongAct, a benchmark designed to evaluate planning-level autonomy in long-horizon household tasks specified through free-form instructions. By abstracting away embodiment-specific low-level control, LongAct isolates high-level cognitive capabilities such as instruction understanding, dependency management, memory maintenance, and adaptive planning. We further propose HoloMind, a VLM-driven agent with a DAG-based long-horizon hierarchical planner, a Multimodal Spatial Memory for persistent world modeling, an Episodic Memory for experience reuse, and a global Critic for reflective supervision. Experiments with GPT-5 and Qwen3-VL models show that HoloMind substantially improves long-horizon performance while reducing reliance on model scale. Even top models achieve only 59% goal completion and 16% full-task success, underscoring the difficulty of LongAct and the need for stronger long-horizon planning in embodied agents.

2605.13877 2026-05-19 cs.NE cs.AI 版本更新

ARES-LSHADE: Autoresearch-Enhanced LSHADE with Memetic Polish for the GNBG Benchmark

ARES-LSHADE:基于自研增强的LSHADE与膜etic精修的GNBG基准测试

Abdullah Naeem, Md Wasi Ul kabir, Manish Bhatt, Ayon Dey, Anav Katwal, Md Tamjidul Hoque

发表机构 * University of New Orleans(新奥尔良大学) Amazon(亚马逊)

AI总结 本文提出ARES-LSHADE,通过自研循环和膜etic精修改进LSHADE,针对GNBG基准测试实现510/744胜率,达到机器精度,揭示LLM研究循环与基准完整性之间的张力。

详情
AI中文摘要

我们介绍了ARES-LSHADE,一种膜etic差分进化变体,参加GECCO 2026竞赛中的LLM设计进化算法竞赛,针对通用数值基准生成器(GNBG)。该算法基于2025年LLM-LSHADE冠军,贡献两个新组件:(a) 一种增强的觅食突变算子,通过约三十个LLM驱动的设计实验,结合自适应CMA-ES;(b) 一种多起点L-BFGS-B精修阶段,严格遵守基准的黑箱处理。在官方31次运行/函数评估中,ARES-LSHADE获得510/744胜(每函数差距低于1e-8),在18/24个函数上达到机器精度。其余六个函数表现出特征平台签名,与GNBG的组成结构一致,且被自研循环独立识别为最困难的函数。除了结果本身,本报告还记录了两种方法论观察:(i) 一个仅通过算子编辑表面和适应度观察空间的LLM驱动研究循环在该基准上收敛到特征平台;(ii) 当我们最初扩展观察空间以包含基准的组成元数据时,算法轻易解决了所有24个函数,但违反了竞赛的黑箱规则。我们讨论了LLM能力与基准完整性之间的张力作为未来LLM驱动优化算法研究的设计考虑。代码和可重复性工具包可在https://github.com/anaeem1/ARES-LSHADE获得。

英文摘要

We present ARES-LSHADE, a memetic differential-evolution variant submitted to the GECCO 2026 competition on LLM-designed evolutionary algorithms for the Generalized Numerical Benchmark Generator (GNBG). The algorithm builds on the LLM-LSHADE 2025 winner, contributing two new components: (a) a scout-augmented mutation operator with adaptive CMA-ES integration, produced by an autonomous research loop across approximately thirty LLM-driven design experiments, and (b) a multi-start L-BFGS-B polish phase that respects strict blackbox treatment of the benchmark. On the official 31-run-per-function evaluation with the competition-specified function-evaluation budgets, ARES-LSHADE obtains 510 of 744 wins (per-function gap below 1e-8), reaching machine precision on 18 of 24 functions. The remaining six functions exhibit characteristic plateau signatures consistent with GNBG's compositional structure, and were independently identified by the autoresearch loop as the hardest of the suite. Beyond the result itself, this report documents two methodological observations: (i) an LLM-driven research loop with operator-only edit surface and fitness-only observation space converges to a characteristic plateau on this benchmark; (ii) when we initially widened the observation space to include the benchmark's compositional metadata, the resulting algorithm trivially solved all 24 functions but violated the competition's blackbox rule, which we identified before submission. We discuss this tension between LLM capability and benchmark integrity as a design consideration for future LLM-driven optimization-algorithm research. Code and reproducibility artifacts are available at https://github.com/anaeem1/ARES-LSHADE.

2605.12991 2026-05-19 cs.LG cs.AI 版本更新

Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy

不只是RLHF:为何仅对齐不足以解决多智能体趋同

Adarsh Kumarappan, Ananya Mujoo

发表机构 * California Institute of Technology(加州理工学院) Evergreen Valley College(艾弗绿谷学院)

AI总结 本文研究了多智能体系统在模拟同伴分歧下的错误率问题,发现预训练基础模型与指令模型存在相似的替换模式,且错误率较高。通过激活修补发现错误集中在中间层,修复后可恢复大部分正确率差距。研究还指出压力抑制了清洁推理特征,而非激活新的趋同回路。

详情
AI中文摘要

基于LLM的多智能体管道在模拟同伴分歧下,正确答案转为错误答案的速率我们称为收益,这一漏洞广泛归因于RLHF诱导的趋同。我们测试了四种模型家族,发现这种归因大多不成立:预训练基础模型表现出与指令变体相同的替换模式,其平均收益高于指令变体。通过激活修补,我们发现错误集中在狭窄的中间层窗口,其中注意力承担因果权重,而MLP贡献可忽略不计;在该窗口上方进行修补可恢复96%的清洁到受压P(correct)差距。攻击面分解为两个独立因素(通道框架和共识强度)的相互作用,产生47.5个百分点的收益差距,在多数共识下保持不变,适用于陪审团大小$N \in \{4, 5, 6\}$。两种收敛的激活空间干预显示,压力抑制了清洁推理特征,而非激活新的趋同回路。一个正确论证的异议者在所有测试框架中将收益降低54-73个百分点,而最强的提示级防御在攻击变体超出其设计范围时失效。缓解措施应针对机制,而非提示级防御,应在管道层面实施结构化异议。

英文摘要

LLM-based multi-agent pipelines flip from correct to incorrect answers under simulated peer disagreement at rates we term yield, a vulnerability widely attributed to RLHF-induced sycophancy. We test this attribution across four model families and find it largely wrong: pretrained base models exhibit the same substitution pattern as their Instruct variants, averaging higher yield than Instruct. Using activation patching, we localize the corruption to a narrow mid-layer window where attention carries the causal weight and MLP contribution is negligible; patching above this window restores 96% of the clean-to-pressured P(correct) gap. The attack surface decomposes into two independent factors (channel framing and consensus strength) whose interaction produces a 47.5 percentage-point yield gap at majority consensus, preserved across jury sizes $N \in \{4, 5, 6\}$. Two converging activation-space interventions show that pressure suppresses clean-reasoning features rather than activating a new sycophancy circuit. A single correctly-arguing dissenter reduces yield by 54-73 percentage points across all framings tested, whereas the strongest prompt-level defense fails on attack variants outside its design surface. Mitigations should target the mechanism, structured dissent at the pipeline level, rather than prompt-level defenses.

2605.12920 2026-05-19 cs.MA cs.AI cs.CL 版本更新

Embodied Multi-Agent Coordination by Aligning World Models Through Dialogue

通过对话对齐世界模型实现具身多智能体协调

Vardhan Dongre, Dilek Hakkani-Tür

发表机构 * Siebel School of Computing & Data Science(计算机与数据科学学院)

AI总结 研究通过对话机制探索具身智能体的世界模型对齐,发现对话能减少冲突但降低任务成功率,提出评估世界模型对齐的框架。

详情
AI中文摘要

有效的具身智能体协作需要超越共享环境中的行动,要求基于每个智能体对世界的理解进行沟通。当智能体只能部分观察环境时,无沟通的协调是难以证明的,但沟通可通过共享观察和对齐世界模型来弥合这一差距。本文研究LLM基于的具身智能体是否真正具备沟通能力。我们扩展了PARTNR协作家庭机器人基准,加入自然语言对话通道,使两个具有部分观察能力的智能体在任务执行中沟通。为评估对话是否导致真实的世界模型对齐而非表面协调,我们提出了一种基于每智能体世界图的对齐测量框架:观察收敛(私人世界模型随时间对齐吗?)、信息新颖性(信息是否传达了伙伴所缺乏的内容?)以及信念敏感的通信(智能体是否建模了伙伴所知的内容?)。我们的实验显示,对话减少了40至83个百分点的行动冲突,但相对于沉默协调任务成功率较低。使用我们的指标,我们表征了表面协调与真实世界模型对齐之间的差距,并确定当前模型在该光谱中的位置。

英文摘要

Effective collaboration between embodied agents requires more than acting in a shared environment; it demands communication grounded in each agent's evolving understanding of the world. When agents can only partially observe their surroundings, coordination without communication is provably hard, but communication can, in principle, bridge this gap by allowing agents to share observations and align their world models. In this work, we examine whether LLM-based embodied agents actually realize the ability to communicate. We extend PARTNR, a benchmark for collaborative household robotics, with a natural-language dialogue channel that enables two agents with partial observability to communicate during task execution. To evaluate whether dialogue leads to genuine world-model alignment rather than superficial coordination, we propose a framework for measuring world-model alignment defined over per-agent world graphs: observation convergence (do private world models align over time?), information novelty (do messages convey what the partner lacks?), and belief-sensitive messaging (do agents model what their partner knows?). Our experiments across three LLMs reveal that dialogue reduces action conflicts 40 to 83 percentage points but degrades task success relative to silent coordination. Using our metrics, we characterize the gap between superficial coordination and genuine world-model alignment, and identify where current models fall on this spectrum.

2605.12825 2026-05-19 cs.LG cs.AI 版本更新

Orthrus: Memory-Efficient Parallel Token Generation via Dual-View Diffusion

Orthrus:通过双视角扩散实现内存高效的并行令牌生成

Chien Van Nguyen, Chaitra Hegde, Van Cuong Pham, Ryan A. Rossi, Franck Dernoncourt, Thien Huu Nguyen

发表机构 * University of Oregon(俄勒冈大学) Google DeepMind(谷歌深Mind) Adobe Research(Adobe研究)

AI总结 Orthrus结合自回归大语言模型的高保真生成与扩散模型的高速并行生成,通过双视角机制实现高效推理,提升速度7.8倍且内存开销极低。

详情
AI中文摘要

我们介绍Orthrus,一种简单高效的双架构框架,结合自回归大语言模型(LLM)的精确生成保真度与扩散模型的高速并行令牌生成。标准自回归解码的序列性是高吞吐推理的根本瓶颈。尽管扩散语言模型试图通过并行生成突破这一瓶颈,但存在显著的性能下降、高训练成本和缺乏严格的收敛保证。Orthrus原生解决这一二元对立。设计用于无缝集成到现有Transformer中,框架在冻结的LLM上添加一个轻量可训练模块,创建一个并行扩散视角与标准自回归视角。在统一系统中,两个视角均关注相同的高保真键值(KV)缓存;自回归头执行上下文预填充以构建准确的KV表示,而扩散头执行并行生成。通过在两个视角之间采用精确的一致性机制,Orthrus保证无损推理,仅以O(1)的内存缓存开销和极小的参数增加,即可实现高达7.8倍的速度提升。

英文摘要

We introduce Orthrus, a simple and efficient dual-architecture framework that unifies the exact generation fidelity of autoregressive Large Language Models (LLMs) with the high-speed parallel token generation of diffusion models. The sequential nature of standard autoregressive decoding represents a fundamental bottleneck for high-throughput inference. While diffusion language models attempt to break this barrier via parallel generation, they suffer from significant performance degradation, high training costs, and a lack of rigorous convergence guarantees. Orthrus resolves this dichotomy natively. Designed to seamlessly integrate into existing Transformers, the framework augments a frozen LLM with a lightweight, trainable module to create a parallel diffusion view alongside the standard autoregressive view. In this unified system, both views attend to the exact same high-fidelity Key-Value (KV) cache; the autoregressive head executes context pre-filling to construct accurate KV representations, while the diffusion head executes parallel generation. By employing an exact consensus mechanism between the two views, Orthrus guarantees lossless inference, delivering up to a 7.8x speedup with only an O(1) memory cache overhead and minimal parameter additions.

2605.12824 2026-05-19 cs.MA cs.AI cs.CL cs.CY 版本更新

Mechanism Plausibility in Generative Agent-Based Modeling

生成基于代理的建模中的机制合理性

Patrick Zhao, David Huu Pham, Nicholas Vincent

发表机构 * Simon Fraser University(西蒙弗雷泽大学)

AI总结 本文提出机制合理性量表,区分生成充分性与机制合理性,探讨生成式代理模型的生成能力与解释能力。

Comments Accepted at ACM FAccT 2026

详情
AI中文摘要

大型语言模型(LLMs)能够生成多样化现象而无需显式编程规则,这一能力使其在不同代理基于模型(ABMs)和社会模拟中得到应用。最近的研究探讨了LLMs生成不同现象的能力,例如社交媒体上的人类行为或博弈论场景中的外星行为。然而,能力、预测和解释是不同的——从科学哲学和机制文献中,解释需要展示现象如何由相关组织实体和活动产生。对于建模者而言,在没有基于潜在遥远研究领域的情况下,描述实验特征或判断模拟是否在能力(或解释)上取得进展是困难的。我们整合了最近关于LLM-ABMs的研究与当代科学哲学文献,用以操作化'合理性'的定义,提出四等级量表。该量表将模型生成充分性(重现现象的能力)与机制合理性(现象如何产生)分开,并明确不同模型的不同角色,如预测性和解释性。我们将其介绍为机制合理性量表。

英文摘要

Large language models (LLMs) can generate high-level diverse phenomena without explicitly programmed rules. This capability has led to their adoption within different agent-based models (ABMs) and social simulations. Recent studies investigate their ability to generate different phenomena of interest, for example, human behavior on social media platforms or alien behavior in game-theoretic scenarios. However, capability, prediction, and explanation are different--drawing from the philosophy of science and mechanisms literature, explanation requires showing, to some degree, how a phenomenon is produced by related organized entities and activities. For modelers, describing the characteristics of an experiment or whether a simulation provides progress in capability (or explanation), can be difficult without being grounded in potentially distant research areas. We integrate recent work on LLM-ABMs with contemporary philosophy of science literature and use it to operationalize a definition of 'plausibility' in a four-level scale. Our scale separates the evaluation of a model's generative sufficiency (ability to reproduce a phenomenon) from its mechanistic plausibility (how the phenomenon could be produced), and clarifies the distinct roles of different models, such as predictive and explanatory ones. We introduce this as the Mechanism Plausibility Scale.

2605.12070 2026-05-19 cs.LG cs.AI 版本更新

Missing Old Logits in Asynchronous Agentic RL: Semantic Mismatch and Repair Methods for Off-Policy Correction

异步代理强化学习中缺失旧日志:语义不匹配及用于离线策略修正的修复方法

Zhong Guan, Yongjian Guo, Haoran Sun, Wen Huang, Shuai Di, Likang Wu, Xiong Jun Wu, Hongke Zhao

发表机构 * Tianjin University(天津大学) Tsinghua University(清华大学) Peking University(北京大学) JDT AI Infra(京东AI基础设施)

AI总结 本文研究了异步代理强化学习中因缺失旧日志导致的语义不匹配问题,提出三种精确获取旧日志的策略及近似修正方法,改进了PPO-EWMA方法,提升了训练速度和优化性能。

详情
AI中文摘要

异步强化学习通过将样本生成与策略优化解耦,提高了大语言模型代理的回放吞吐量,但同时也引入了PPO类离线策略修正中的关键故障模式。在异构训练系统中,总重要性比率应理想地分解为两个语义不同的因素:一个训练-推理不匹配项,用于对齐同一行为策略版本的推理侧和训练侧分布,以及一个策略陈旧项,用于约束从历史策略到当前策略的更新。我们发现实际的异步管道在延迟更新和部分回放的情况下,常常丢失所需的训练侧旧日志或旧日志。这种缺失旧日志的问题使不匹配修复与陈旧修正纠缠在一起,破坏了解耦修正的初衷,并使裁剪和掩码阈值产生不良交互。为了解决这一问题,我们研究了精确和近似修正路径。我们提出了三种精确旧日志获取策略:基于快照的版本跟踪、专用旧日志模型以及通过部分回放中断进行同步,并比较了它们的系统权衡。从近似修正的角度来看,我们关注通过更合适的近似策略保留解耦修正的好处,当无法以低成本恢复精确旧日志时,不增加额外系统开销。随后,我们采用改进的PPO-EWMA方法,该方法在训练速度和优化性能方面均取得显著提升。

英文摘要

Asynchronous reinforcement learning improves rollout throughput for large language model agents by decoupling sample generation from policy optimization, but it also introduces a critical failure mode for PPO-style off-policy correction. In heterogeneous training systems, the total importance ratio should ideally be decomposed into two semantically distinct factors: a \emph{training--inference discrepancy term} that aligns inference-side and training-side distributions at the same behavior-policy version, and a \emph{policy-staleness term} that constrains the update from the historical policy to the current policy. We show that practical asynchronous pipelines with delayed updates and partial rollouts often lose the required historical training-side logits, or old logits. This missing-old-logit problem entangles discrepancy repair with staleness correction, breaks the intended semantics of decoupled correction, and makes clipping and masking thresholds interact undesirably. To address this issue, we study both exact and approximate correction routes. We propose three exact old-logit acquisition strategies: snapshot-based version tracking, a dedicated old-logit model, and synchronization via partial rollout interruption, and compare their system trade-offs. From the perspective of approximate correction, we focus on preserving the benefits of decoupled correction through a more appropriate approximate policy when exact old logits cannot be recovered at low cost, without incurring extra system overhead. Following this analysis, we adopt a revised PPO-EWMA method, which achieves significant gains in both training speed and optimization performance.

2605.11518 2026-05-19 cs.AI cs.CL cs.LG 版本更新

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

AutoLLMResearch: 训练研究代理以自动化LLM实验配置 - 从低成本学习,优化高成本

Taicheng Guo, Nitesh V. Chawla, Olaf Wiest, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 本文提出AutoLLMResearch框架,通过多保真度实验环境学习LLM配置原则,解决高成本实验自动化问题,展示其在大规模LLM实验中的有效性与通用性。

详情
AI中文摘要

有效配置可扩展的大规模语言模型(LLM)实验,涵盖架构设计、超参数调优等,对推进LLM研究至关重要,因为糟糕的配置选择会浪费大量计算资源并阻碍模型潜力的实现。以往的自动化方法适用于低成本环境,但可扩展的LLM实验成本过高,无法进行大量迭代。为了解决这一问题,我们提出AutoLLMResearch,一个模仿人类研究人员从低保真度实验中学习一般性原则并高效识别高成本LLM配置的代理框架。核心挑战是如何使代理通过与多保真度实验环境的交互学习LLM配置景观的结构。为此,我们提出一个系统框架,包含两个关键组件:1) LLMConfig-Gym,涵盖四个关键LLM实验任务的多保真度环境,支持超过一百万GPU小时的可验证实验结果;2) 一个结构化训练管道,将配置研究建模为长周期马尔可夫决策过程,并相应地激励跨保真度外推推理。在各种强基线上的广泛评估表明了我们框架的有效性、通用性和可解释性,支持其作为大规模现实LLM实验自动化的实用且通用解决方案的潜力。

英文摘要

Effectively configuring scalable large language model (LLM) experiments, spanning architecture design, hyperparameter tuning, and beyond, is crucial for advancing LLM research, as poor configuration choices can waste substantial computational resources and prevent models from realizing their full potential. Prior automated methods are designed for low-cost settings where repeated trial and error is feasible, but scalable LLM experiments are too expensive for such extensive iteration. To our knowledge, no work has addressed the automation of high-cost LLM experiment configurations, leaving this problem labor-intensive and dependent on expert intuition. Motivated by this gap, we propose AutoLLMResearch, an agentic framework that mimics how human researchers learn generalizable principles from low-fidelity experiments and extrapolate to efficiently identify promising configurations in expensive LLM settings. The core challenge is how to enable an agent to learn, through interaction with a multi-fidelity experimental environment that captures the structure of the LLM configuration landscape. To achieve this, we propose a systematic framework with two key components: 1) LLMConfig-Gym, a multi-fidelity environment encompassing four critical LLM experiment tasks, supported by over one million GPU hours of verifiable experiment outcomes; 2) A structured training pipeline that formulates configuration research as a long-horizon Markov Decision Process and accordingly incentivizes cross-fidelity extrapolation reasoning. Extensive evaluation against diverse strong baselines on held-out experiments demonstrates the effectiveness, generalization, and interpretability of our framework, supporting its potential as a practical and general solution for scalable real-world LLM experiment automation.

2605.09395 2026-05-19 cs.AI cs.LG cs.MA cs.MM 版本更新

Empowering VLMs for Few-Shot Multimodal Time Series Classification via Tailored Agentic Reasoning

通过定制代理推理增强VLMs在少样本多模态时间序列分类中的能力

Lin Li, Jiawei Huang, Qihao Quan, Dan Li, Boxin Li, Xiao Zhang, Erli Meng, Wenjie Feng, Jian Lou, See-Kiong Ng

发表机构 * Sun Yat-sen University(中山大学) Xiaomi Corporation(小米公司) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出MarsTSC框架,通过自演化知识库和代理推理提升少样本多模态时间序列分类性能,实验表明其在六个VLM基础上均优于传统和基础模型基线。

Comments 18 pages, 12 figures, 6 tables. Preprint

详情
AI中文摘要

本文提出首个VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning框架用于少样本多模态时间序列分类(MarsTSC),引入自演化知识库作为动态上下文,通过反思代理推理不断优化。框架包含三个协作角色:i) 生成器通过推理进行可靠分类;ii) 反射器诊断推理错误根源以获得判别性见解;iii) 修改器应用验证更新以防止上下文崩溃。进一步引入测试时更新策略以实现谨慎持续的知识库优化,缓解少样本偏差和分布偏移。在12个主流时间序列基准上的广泛实验表明,MarsTSC在六个VLM基础上均取得显著且一致的性能提升,优于传统和基础模型基线,并生成可解释的推理依据,使每个分类决策都基于人类可读的特征证据。

英文摘要

In this paper, we propose the first VL$\underline{\textbf{M}}$ $\underline{\textbf{a}}$gentic $\underline{\textbf{r}}$easoning framework for few-$\underline{\textbf{s}}$hot multimodal $\underline{\textbf{T}}$ime $\underline{\textbf{S}}$eries $\underline{\textbf{C}}$lassification ($\textbf{MarsTSC}$), which introduces a self-evolving knowledge bank as a dynamic context iteratively refined via reflective agentic reasoning. The framework comprises three collaborative roles: i) Generator conducts reliable classification via reasoning; ii) Reflector diagnoses the root causes of reasoning errors to yield discriminative insights targeting the temporal features overlooked by Generator; iii) Modifier applies verified updates to the knowledge bank to prevent context collapse. We further introduce a test-time update strategy to enable cautious, continuous knowledge bank refinement to mitigate few-shot bias and distribution shift. Extensive experiments across 12 mainstream time series benchmarks demonstrate that $\textbf{MarsTSC}$ delivers substantial and consistent performance gains across 6 VLM backbones, outperforming both classical and foundation model-based time series baselines under few-shot conditions, while producing interpretable rationales that ground each classification decision in human-readable feature evidence.

2605.08794 2026-05-19 cs.LG cs.AI 版本更新

Deterministic Decomposition of Stochastic Generative Dynamics

确定性分解随机生成动力学

Xingyu Song, Yuan Mei, Naoya Takeishi

发表机构 * The University of Tokyo(东京大学) Zhejiang University(浙江大学)

AI总结 本文提出Bridge Matching框架,通过分解生成动力学中的确定性与随机效应,实现可控的生成模型。

Comments 10 pages main text, 6 figures; appendix included. Code available at: https://github.com/xingyu-song/bridge_matching

详情
AI中文摘要

现代生成模型可视为从简单基础分布到目标数据分布的概率传输。确定性传输模型提供可计算的速度场参数化,而随机生成模型通过漂移和扩散捕捉更丰富的密度演变。然而,当随机动力学通过确定性速度场描述时,漂移和扩散的影响常被压缩为单一有效场,掩盖了确定性演化和随机波动的差异作用。本文表明,随机生成过程的确定性场$b_t$可自然分解为传输-渗透分解,分离确定性传输与随机扩散效应:$b_t = u_t + d_t$,其中$u_t$控制边际概率传输,$d_t$由扩散和边际分数决定。基于此分解,我们提出Bridge Matching框架,通过边际和条件形式学习分解的生成动力学。在生成模型实验中,我们重新组合学习的组件作为$b_t = u_t + λ_d d_t$,显示所提分解通过调整概率传输中的渗透贡献实现可解释和可控的采样。

英文摘要

Modern generative models can be understood as probability transport from a simple base distribution to a target data distribution. Deterministic transport models offer tractable velocity-field parameterizations, whereas stochastic generative models capture richer density evolution through drift and diffusion. Yet when stochastic dynamics are described through deterministic velocity fields, the effects of drift and diffusion are often compressed into a single effective field, obscuring the distinct roles of deterministic evolution and stochastic fluctuation. In this work, we show that the deterministic field \(b_t\) of a stochastic generative process admits a natural transport--osmotic decomposition that separates deterministic transport from stochastic, diffusion-induced effects: \(b_t = u_t + d_t\), where \(u_t\) governs marginal probability transport and \(d_t\) captures an osmotic effect induced by diffusion and determined by the marginal score. Based on this decomposition, we propose Bridge Matching, a flow-based framework for learning decomposed generative dynamics through both marginal and conditional formulations. In generative modeling experiments, we recombine the learned components as \(b_t = u_t + λ_d d_t\), showing that the proposed decomposition enables interpretable and controllable sampling by adjusting the osmotic contribution in probability transport.

2605.07905 2026-05-19 cs.CL cs.AI 版本更新

CoCoReviewBench: A Completeness- and Correctness-Oriented Benchmark for AI Reviewers

CoCoReviewBench:面向AI审稿人完整性和正确性的基准测试

Hexuan Deng, Xiaopeng Ke, Yichen Li, Ruina Hu, Dehao Huang, Derek F. Wong, Yue Wang, Xuebo Liu, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) Zhongguancun Academy, Beijing, China(中关村学院) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院) Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学计算机科学与工程系) NLP²CT Lab, Department of Computer and Information Science, University of Macau, China(澳门大学自然语言处理实验室)

AI总结 本文提出CoCoReviewBench,通过构建领域特定的基准子集和专家注释,提升AI审稿人评估的完整性和正确性,分析显示AI审稿人仍存在正确性不足和幻觉问题,强调推理模型的优越性。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管AI审稿系统发展迅速,但评估这些系统仍具挑战性:现有指标倾向于重叠度而非正确性。由于人类审稿通常只涵盖部分显著问题且可能含错误,因此不可靠。为此,我们构建了领域特定的基准子集,并在对应人类审稿缺失时跳过评估以增强完整性。我们还利用审稿人-作者-元审稿讨论作为专家注释,并据此过滤不可靠的审稿以增强正确性。最终,我们引入了CoCoReviewBench,该基准从ICLR和NeurIPS中精选了3900篇论文,以实现对AI审稿人的可靠和细致评估。分析表明,AI审稿人仍受限于正确性不足且易产生幻觉,推理模型表现更佳,这推动了进一步改进AI审稿人的方向。基准和模型可在https://github.com/hexuandeng/CoCoReviewBench获取。

英文摘要

Despite the rapid development of AI reviewers, evaluating such systems remains challenging: metrics favor overlap with human reviews over correctness. However, since human reviews often cover only a subset of salient issues and sometimes contain mistakes, they are unreliable as gold references. To address this, we build category-specific benchmark subsets and skip evaluation when the corresponding human reviews are missing to strengthen Completeness. We also leverage reviewer--author--meta-review discussions as expert annotations and filter unreliable reviews accordingly to strengthen Correctness. Finally, we introduce CoCoReviewBench, which curates 3,900 papers from ICLR and NeurIPS to enable reliable and fine-grained evaluation of AI reviewers. Analysis shows that AI reviewers remain limited in correctness and are prone to hallucinations, and highlights reasoning models as more effective reviewers, motivating further directions for improving AI reviewers. Benchmarks and models are available at https://github.com/hexuandeng/CoCoReviewBench.

2605.06638 2026-05-19 cs.AI cs.CL 版本更新

Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

强化学习能否教会大语言模型长期 horizon 推理?表达性是关键

Tianle Wang, Zhaoyang Wang, Guangchen Lan, Xinpeng Wei, Sipeng Zhang, Guanwen Qiu, Abulhair Saparov

发表机构 * Purdue University(普渡大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Georgia Tech(佐治亚理工学院) UC San Diego(加州大学圣地亚哥分校)

AI总结 本文通过ScaleLogic框架研究了RL训练与任务难度的关系,发现推理深度和逻辑表达性影响训练计算量,表达性越高,训练效率越高,证明LLM的长期推理问题可通过改进训练方法解决。

详情
AI中文摘要

强化学习(RL)已被应用于改进大语言模型(LLM)的推理能力,但关于训练规模与任务难度之间系统研究受限于缺乏可控且可扩展的环境。观察到LLM在长期推理方面的不足引发了它们可能是自回归Transformer架构根本问题的推测。为此,我们引入了ScaleLogic,一个合成逻辑推理框架,可独立控制两个难度轴:所需证明规划的深度(即horizon)和底层逻辑的表达性。我们提出的框架支持多种逻辑:从简单的蕴含逻辑(

英文摘要

Reinforcement learning (RL) has been applied to improve large language model (LLM) reasoning, yet the systematic study of how training scales with task difficulty has been hampered by the lack of controlled, scalable environments. Observed LLM shortcomings in long-horizon reasoning have raised the prospect that they are fundamental to the autoregressive transformer architecture. To address this, we introduce ScaleLogic, a synthetic logical reasoning framework that offers independent control over two axes of difficulty: the depth of the required proof planning (i.e., the horizon) and the expressiveness of the underlying logic. Our proposed framework supports a wide range of logics: from simple implication-only logic ("if-then") towards more expressive first-order reasoning with conjunction ("and"), disjunction ("or"), negation ("not"), and universal quantification ("for all"). Using this framework, we show that the RL training compute $T$ follows a power law with respect to reasoning depth $D$ ($T \propto D^γ$, $R^{2} > 0.99$), and that the scaling exponent $γ$ increases monotonically with logical expressiveness, from $1.04$ to $2.60$. On downstream mathematics and general reasoning benchmarks, more expressive training settings yield both larger performance gains (up to $+10.66$ points) and more compute-efficient transfer compared to less expressive settings, demonstrating that what a model is trained on, not just how much it is trained, shapes downstream transfer. We further show that the power-law relationship holds across multiple RL methods, and curriculum-based training substantially improves scaling efficiency. More broadly, our results demonstrate that LLM shortcomings in long-horizon reasoning are not fundamental to the underlying architecture, and can be addressed by improved training methodology and data.

2605.05739 2026-05-19 cs.LG cs.AI cs.CL q-fin.CP 版本更新

Multi-Dimensional Behavioral Evaluation of Agentic Stock Prediction Systems Using Large Language Model Judges with Closed-Loop Reinforcement Learning Feedback

基于大语言模型判官的多维行为评估:用于代理股票预测系统的闭环强化学习反馈

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 本文提出一种多维行为评估方法,通过大语言模型判官评估代理系统决策过程,利用闭环强化学习反馈提升预测性能,验证了方法在股票预测中的有效性。

Comments 17 pages, 5 figures, 14 tables. Manuscript submitted to Applied Artificial Intelligence (Taylor and Francis)

详情
AI中文摘要

代理人工智能系统通过一系列相互依赖的自主决策产生输出,但标准评估仅评估输出而无法诊断底层过程。本文开发了一种行为评估方法,通过评分中间决策过程补充输出级测试。在每个自主决策点记录的行为轨迹被分为五日周期,并由三个大语言模型(LLM)判官根据六个领域特定维度(制度检测、路由、适应、风险校准、策略一致性、错误恢复)评分。一种扰动程序破坏一个维度,同时保持其他五个维度不变,验证了维度特异性;跨模型一致性达到Krippendorff's alpha=0.85。综合行为评分与实际20日夏普比率相关性达到Spearman rho=0.72。闭环框架将缺陷的每维度评分转换为信用分配惩罚,添加到Soft Actor-Critic奖励中。三次微调循环,限制在验证数据上,将持有期MAPE从0.61%降低到0.54%(11.5%相对;p<0.001,d=0.31)在2017至2025的测试期上,显著性在Diebold-Mariano下,通过Giacomini-White局部化到高波动性制度。该方法应用无关,适用于任何可以记录中间决策的代理系统。

英文摘要

Agentic artificial intelligence systems produce outputs through sequences of interdependent autonomous decisions, yet standard evaluation assesses outputs alone and cannot diagnose the underlying process. We develop a behavioral evaluation methodology that complements output-level testing by scoring the intermediate decision process itself. Behavioral traces logged at each autonomous decision point are grouped into five-day episodes and scored along six domain-specific dimensions (regime detection, routing, adaptation, risk calibration, strategy coherence, error recovery) by an ensemble of three large language model (LLM) judges. A perturbation procedure that corrupts one dimension while leaving the other five intact confirms dimension specificity; cross-model agreement reaches Krippendorff's alpha = 0.85. The composite behavioral score correlates at Spearman rho = 0.72 with realized 20-day Sharpe ratio. Closing the loop, the framework converts deficient per-dimension scores into a credit-assigned penalty added to the Soft Actor-Critic reward. Three fine-tuning cycles, confined to validation data, reduce one-day MAPE from 0.61% to 0.54% (11.5% relative; p<0.001, d=0.31) on the held-out 2017 to 2025 test period, significant under Diebold-Mariano and localized by Giacomini-White to the high-volatility regime. The methodology is application-agnostic and applies to any agentic system whose intermediate decisions can be logged.

2605.04375 2026-05-19 eess.SY cs.AI cs.SY 版本更新

Experiment-as-Code Labs: A Declarative Stack for AI-Driven Scientific Discovery

实验作为代码实验室:面向AI驱动科学发现的声明式栈

Zhenning Yang, Yuhan Chen, Patrick Tser Jern Kon, Tongyuan Miao, Hongyi Lin, Venkat Viswanathan, Danai Koutra, Ang Chen

AI总结 本文提出实验作为代码实验室框架,通过声明式配置编译至设备API,实现AI代理与自动化实验室设备的高效协同,推动AI在科学发现中的应用突破。

Comments Experiment-as-Code (EaC) white paper

详情
AI中文摘要

为了释放AI在科学中的全部潜力,必须使代理脱离纯数字环境。代理控制和探索现实世界实验室的能力至关重要,因为物理实验室仍是科学发现的基础。尽管一些任务可在计算机上完成(例如数据分析、运行模拟实验),但顿悟时刻可能在操作实验室仪器时发生(例如当科学家发现意外线索时)。虽然自动化实验室正在兴起,但连接日益强大的AI代理与自动化实验室设备仍需创新。我们提出一种新的范式称为“实验作为代码(EaC)实验室”,其中核心概念是将实验编码为声明式配置,可编译至设备级API。AI代理提出假设和实验,以声明式配置的集合形式编写。系统层执行程序分析、安全检查、资源分配和任务调度。最后,通过激活设备API进行程序化实验。这是一个通用栈,不依赖于特定科学、实验室或仪器,代表了物理、系统和智能层的新型综合,以释放AI在科学中的下一个突破。

英文摘要

To unleash the full potential of AI for Science, we must untether the agents from a purely digital environment. The agent's ability to control and explore in real-world labs is essential because the physical lab remains foundational to scientific discovery. While some tasks can be performed on a computer (e.g., data analysis, running simulated experiments), Eureka moments could occur at any time while operating lab instruments (e.g., when a scientist notices unexpected clues, intuition may prompt a real-time course change). Although autonomous labs are on the rise, which expose programmable APIs to control scientific instruments via software, bridging the gap between increasingly powerful AI agents and automated lab equipment requires innovation that draws insights from computer systems. We propose a new paradigm called ``Experiment-as-Code (EaC) Labs,'' where a core concept is to encode experiments as declarative configurations that can be compiled down to device-level APIs. AI agents come up with hypotheses and experiments, written as an ensemble of declarative configurations. The systems layer performs program analysis, safety checks, resource assignment, and job orchestration. Finally, programmatic experimentation occurs via actuating the device APIs. This is a general stack that is science-, lab-, and instrument-independent, representing a novel synthesis across the physical, systems, and intelligence layers to unleash the next breakthrough in AI for Science.

2605.02832 2026-05-19 cs.AI cs.HC cs.SE 版本更新

HAAS: A Policy-Aware Framework for Adaptive Task Allocation Between Humans and Artificial Intelligence Systems

HAAS:一种面向人类与人工智能系统之间适应性任务分配的政策感知框架

Vicente Pelechano, Antoni Mestre, Manoli Albert, Miriam Gil

发表机构 * organization= Valencian Research Institute for Artificial Intelligence (VRAIN), Universitat Polit\`ecnica de Val\`encia , addressline= Camino de Vera s/n , city= Valencia , country= Spain

AI总结 本文提出HAAS框架,通过规则专家系统和情境带教师算法实现人类与AI任务分配的动态调整,揭示治理并非二元开关,而是可调节设计变量,且适度治理在积累经验后更具竞争力。

详情
AI中文摘要

决定如何在人类和AI系统之间分配工作是组织设计中的核心挑战。大多数方法将其视为二元选择,但实际运营现实更复杂:人类和AI经常共享任务或根据情境、疲劳和利害关系承担互补角色。管理这种分配——平衡效率、监督和人类能力——仍是一个开放问题。本文提出了人类-人工智能适应共生(HAAS),一种用于软件工程和制造中适应性任务分配的实现框架。HAAS结合了两个耦合组件:一个基于规则的专家系统,在任何学习之前强制执行治理约束,以及一个情境带教师算法,从结果反馈中选择可行的协作模式。任务-代理适配通过五个可审计的认知维度和一个五种模式自主性光谱——从人类单独使用到完全自主——嵌入到一个可重复使用的基准中,涵盖两个领域。三个经验发现出现。首先,治理不是二元开关,而是一个可调节的设计变量:更紧的约束可预测地将自主AI任务分配转换为监督协作,具有领域特定的成本和收益。其次,在制造中,更强的治理可以同时提高操作性能并减少疲劳——一种与通常将治理视为纯开销相矛盾的工作负载缓冲效应。第三,没有单一的治理设置在所有情境中占主导地位;适度的治理在学习者在受治理的操作空间内积累经验时变得越来越具有竞争力。这些发现将HAAS定位为一种预部署的工作台,用于在组织承诺之前比较和检查人类-AI分配政策。

英文摘要

Deciding how to distribute work between humans and AI systems is a central challenge in organisational design. Most approaches treat this as a binary choice, yet the operational reality is richer: humans and AI routinely share tasks or take complementary roles depending on context, fatigue, and the stakes involved. Governing that distribution -- balancing efficiency, oversight, and human capability -- remains an open problem. This paper presents Human-AI Adaptive Symbiosis (HAAS), an implemented framework for adaptive task allocation in software engineering and manufacturing. HAAS combines two coupled components: a rule-based expert system that enforces governance constraints before any learning occurs, and a contextual-bandit learner that selects among feasible collaboration modes from outcome feedback. Task-agent fit is represented through five auditable cognitive dimensions and a five-mode autonomy spectrum -- from human-only to fully autonomous -- embedded in a reproducible benchmark spanning both domains. Three empirical findings emerge. First, governance is not a binary switch but a tunable design variable: tighter constraints predictably convert autonomous AI assignments into supervised collaborations, with domain-specific costs and benefits. Second, in manufacturing, stronger governance can improve operational performance and reduce fatigue simultaneously -- a workload-buffering effect that contradicts the usual framing of governance as pure overhead. Third, no single governance setting dominates across all contexts; moderate governance becomes increasingly competitive as the learner accumulates experience within the governed action space. Together, these findings position HAAS as a pre-deployment workbench for comparing and inspecting human--AI allocation policies before organisational commitment.

2605.02167 2026-05-19 cs.LG cs.AI cs.CV 版本更新

Manifold-Aligned Guided Integrated Gradients for Reliable Feature Attribution

面向流形的引导集成梯度用于可靠特征归因

Soyeon Kim, Seongwoo Lim, Kyowoon Lee, Jaesik Choi

发表机构 * Kim Jaechul Graduate School of AI, Korea Advanced Institute of Science and Technology (KAIST)(金 Jaechul人工智能研究生院,韩国科学技术院(KAIST))

AI总结 本文提出面向流形的引导集成梯度(MA-GIG),通过在预训练变分自编码器的潜在空间中构建归因路径,减少非流形区域噪声,提升特征归因的可靠性。

Comments 32 pages, 13 figures, 12 tables. Accepted to ICML 2026; includes appendix

详情
AI中文摘要

特征归因是诊断和信任深度神经网络的核心,集成梯度(IG)因其公理性质而被广泛使用。然而,当基线与输入之间的积分路径经过具有噪声梯度的区域时,IG可能产生不可靠的解释。虽然引导集成梯度通过自适应更新低梯度幅度特征来减少这种敏感性,但输入空间的引导仍会产生偏离数据流形的中间输入。为了解决这一限制,我们提出了面向流形的引导集成梯度(MA-GIG),通过在预训练变分自编码器的潜在空间中构建归因路径。通过解码中间潜在状态,MA-GIG将路径偏向于学习的生成流形,减少对不合理的输入空间区域的暴露。通过定性与定量评估,我们证明MA-GIG通过在接近输入的路径特征上聚合梯度,产生忠实的解释。因此,我们的方法减少了非流形噪声,并在多个数据集和分类器上优于先前的路径归因方法。我们的代码可在https://github.com/leekwoon/ma-gig/上获得。

英文摘要

Feature attribution is central to diagnosing and trusting deep neural networks, and Integrated Gradients (IG) is widely used due to its axiomatic properties. However, IG can yield unreliable explanations when the integration path between a baseline and the input passes through regions with noisy gradients. While Guided Integrated Gradients reduces this sensitivity by adaptively updating low-gradient-magnitude features, input-space guidance still produces intermediate inputs that deviate from the data manifold. To address this limitation, we propose \emph{Manifold-Aligned Guided Integrated Gradients} (MA-GIG), which constructs attribution paths in the latent space of a pre-trained variational autoencoder. By decoding intermediate latent states, MA-GIG biases the path toward the learned generative manifold and reduces exposure to implausible input-space regions. Through qualitative and quantitative evaluations, we demonstrate that MA-GIG produces faithful explanations by aggregating gradients on path features proximal to the input. Consequently, our method reduces off-manifold noise and outperforms prior path-based attribution methods across multiple datasets and classifiers. Our code is available at https://github.com/leekwoon/ma-gig/.

2605.01235 2026-05-19 cs.SD cs.AI 版本更新

MindMelody: A Closed-Loop EEG-Driven System for Personalized Music Intervention

MindMelody:一种基于EEG的闭环个性化音乐干预系统

Yimeng Zhang, Yueru Sun, Haoyu Gu, Zhanpeng Jin

发表机构 * South China University of Technology(南方科技大学)

AI总结 本文提出MindMelody系统,通过EEG信号实时生成个性化音乐,结合Transformer-GNN和RAG-LLM实现情绪感知与音乐生成的闭环控制,提升情感适应性与用户参与度。

详情
AI中文摘要

为应对全球心理健康问题日益严峻的挑战,音乐干预因其非侵入性和成本效益而受到广泛关注,用于情绪调节和心理压力缓解。然而,当前的数字音乐服务依赖静态偏好,无法适应用户瞬时的心理状态。此外,直接将脑电图(EEG)映射到音乐生成仍然具有挑战性,由于配对数据稀缺和缺乏可解释性。为此,我们提出了MindMelody,一个完整的闭环实时系统,用于EEG驱动的个性化音乐干预。MindMelody引入了一个情绪介导的语义桥梁。具体而言,混合Transformer-GNN首先将实时EEG信号解码为全局Valence-Arousal状态和局部时间影响轨迹。这些状态随后被输入配备检索增强生成(RAG)的大型语言模型(LLM)以制定结构化干预计划。随后,一种新的分层EEG控制器将全局情感前缀和局部时间指导注入预训练的音乐骨干,实现细粒度可控的音频合成。关键的是,系统集成了一个连续反馈回路,根据用户的EEG动态实时更新生成参数。大量实验表明,MindMelody提高了控制依从性和情感匹配,并在短期聆听设置中获得了更高的感知效用,表明其作为适应性情感感知音乐生成框架的潜力。

英文摘要

Driven by the escalating global burden of mental health conditions, music-based interventions have attracted significant attention as a non-invasive, cost-effective modality for emotion regulation and psychological stress relief. However, current digital music services rely on static preferences and fail to adapt to users' instantaneous psychological states. Furthermore, directly mapping electroencephalography (EEG) to music generation remains challenging due to severe paired-data scarcity and a lack of interpretability. To address these limitations, we propose MindMelody, a fully functional, closed-loop real-time system for EEG-driven personalized music intervention. MindMelody introduces an emotion-mediated semantic bridge. Specifically, a hybrid Transformer-GNN first decodes real-time EEG signals into global Valence-Arousal states and local temporal affect trajectories. These states are then fed into a Retrieval-Augmented Generation (RAG)-equipped Large Language Model (LLM) to formulate structured intervention plans. Subsequently, a novel Hierarchical EEG Controller injects global affect prefixes and local temporal guidance into a pretrained music backbone, enabling fine-grained controllable audio synthesis. Crucially, the system incorporates a continuous feedback loop that updates generation parameters on the fly based on the user's evolving EEG dynamics. Extensive experiments show that MindMelody improves control adherence and emotional alignment, and receives higher perceived helpfulness in a short-term listening setting, suggesting its promise as an adaptive affect-aware music generation framework.

2605.00793 2026-05-19 eess.IV cs.AI cs.CV 版本更新

Unsupervised Denoising of Real Clinical Low Dose Liver CT with Perceptual Attention Networks

无监督的实时临床低剂量肝CT去噪与感知注意力网络

Zhilin Guan, Wei Zhang

发表机构 * Department of Computing(计算系) Harbin Institute of Technology(哈尔滨工业大学)

AI总结 本文提出基于感知注意力网络的无监督低剂量肝CT去噪框架,结合U-Net、注意力机制和残差网络,通过感知损失提升医疗图像特征提取,利用真实临床数据集和医学评价标准验证方法有效性,满足临床需求。

Comments 8 pages, 10 figures, 5 tables

详情
AI中文摘要

随着深度学习的发展,医学图像处理已广泛用于辅助临床研究。本文聚焦于利用深度学习进行低剂量计算机断层扫描(CT)的去噪问题。尽管低剂量CT减少了患者辐射暴露,但也引入了更多噪声,可能干扰医生的视觉解读并影响诊断结果。为了解决这个问题,受Cycle-GAN启发,本文提出了一种端到端的无监督低剂量CT去噪框架。该框架结合了U-Net结构进行多尺度特征提取、注意力机制进行特征融合、残差网络进行特征转换,并引入感知损失以提升网络对医疗图像特征的适应性。此外,我们构建了真实低剂量CT数据集,并设计了大量对比实验,通过图像基评估指标和医学评价标准验证所提方法。与经典方法相比,本文的主要优势在于解决了真实临床数据不能直接用于监督学习的限制,同时仍实现了优异的性能。实验结果也由影像医师专业评估,满足临床需求。

英文摘要

With the development of deep learning, medical image processing has been widely used to assist clinical research. This paper focuses on the denoising problem of low-dose computed tomography using deep learning. Although low-dose computed tomography reduces radiation exposure to patients, it also introduces more noise, which may interfere with visual interpretation by physicians and affect diagnostic results. To address this problem, inspired by Cycle-GAN for unsupervised learning, this paper proposes an end-to-end unsupervised low-dose computed tomography denoising framework. The proposed framework combines a U-Net structure for multi-scale feature extraction, an attention mechanism for feature fusion, and a residual network for feature transformation. It also introduces perceptual loss to improve the network for the characteristics of medical images. In addition, we construct a real low-dose computed tomography dataset and design a large number of comparative experiments to validate the proposed method, using both image-based evaluation metrics and medical evaluation criteria. Compared with classical methods, the main advantage of this paper is that it addresses the limitation that real clinical data cannot be directly used for supervised learning, while still achieving excellent performance. The experimental results are also professionally evaluated by imaging physicians and meet clinical needs.

2604.26904 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ClawGym: A Scalable Framework for Building Effective Claw Agents

ClawGym:一种构建有效Claw代理的可扩展框架

Fei Bai, Huatong Song, Shuang Sun, Daixuan Cheng, Yike Yang, Chuan Hao, Renyuan Li, Feng Chang, Yuan Wei, Ran Tao, Bryan Dai, Jian Yang, Wayne Xin Zhao, Ji-Rong Wen

发表机构 * Gaoling School of Artificial Intelligence(人工智能学院) Renmin University of China(中国人民大学) IQuest Research(IQuest研究) Beihang University(北航)

AI总结 本文提出ClawGym框架,用于构建Claw式个人代理的全生命周期,通过合成可验证训练数据和强化学习方法提升代理效能。

详情
AI中文摘要

Claw-style环境支持在本地文件、工具和持久工作区状态上进行多步骤工作流。然而,围绕这些环境的可扩展开发受限于缺乏系统框架,尤其是合成可验证训练数据并将其与代理训练和诊断评估集成的框架。为解决这一挑战,我们提出了ClawGym,一种支持Claw式个人代理全生命周期的可扩展框架。具体而言,我们构建了ClawGym-SynData,一个包含13500个过滤任务的多样化数据集,这些任务由基于人物驱动的意图和技能基础操作合成,配以现实的模拟工作区和混合验证机制。我们随后通过在黑箱滚动轨迹上进行监督微调训练了一组有能力的Claw式模型,称为ClawGym-Agents,并进一步通过轻量级管道探索强化学习,该管道在每项任务的沙箱中并行化滚动。为了支持可靠的评估,我们进一步构建了ClawGym-Bench,一个通过自动化过滤和人工LLM审查校准的200个实例的基准。相关资源已发布在https://github.com/ClawGym。

英文摘要

Claw-style environments support multi-step workflows over local files, tools, and persistent workspace states. However, scalable development around these environments remains constrained by the absence of a systematic framework, especially one for synthesizing verifiable training data and integrating it with agent training and diagnostic evaluation. To address this challenge, we present ClawGym, a scalable framework that supports the full lifecycle of Claw-style personal agent development. Concretely, we construct ClawGym-SynData, a diverse dataset of 13.5K filtered tasks synthesized from persona-driven intents and skill-grounded operations, paired with realistic mock workspaces and hybrid verification mechanisms. We then train a family of capable Claw-style models, termed ClawGym-Agents, through supervised fine-tuning on black-box rollout trajectories, and further explore reinforcement learning via a lightweight pipeline that parallelizes rollouts across per-task sandboxes. To support reliable evaluation, we further construct ClawGym-Bench, a benchmark of 200 instances calibrated through automated filtering and human-LLM review. Relevant resources have been released at https://github.com/ClawGym.

2604.25858 2026-05-19 cs.LG cs.AI 版本更新

Investigation into In-Context Learning Capabilities of Transformers

对Transformer在上下文学习能力的调查

Rushil Chandrupatla, Leo Bangayan, Sebastian Leng

AI总结 本文通过系统实验研究了Gaussian-mixture二分类任务中的上下文学习,分析了输入维度、上下文示例数量和预训练任务数量对上下文测试准确率的影响,并探讨了良性过拟合现象。

详情
AI中文摘要

Transformer在上下文学习(ICL)中展现出强大的能力,使模型能够仅通过推理时提供的输入输出对解决之前未见过的任务。尽管先前的理论工作已经确立了在上下文内进行线性分类的条件,但指导这一机制何时成功的经验性扩展行为仍不够明确。本文对Gaussian-mixture二分类任务的上下文学习进行了系统性的实证研究。基于Frei和Vardi(2024)的理论框架,我们分析了上下文测试准确率如何依赖于三个基本因素:输入维度、上下文示例数量以及预训练任务数量。通过受控的合成设置和线性上下文分类器公式,我们隔离了模型在仅凭上下文自身推断任务结构时成功的几何条件。我们还研究了良性过拟合现象的出现,其中模型记忆了嘈杂的上下文标签,同时在干净的测试数据上仍能保持良好的泛化性能。通过在维度性、序列长度、任务多样性以及信噪比范围内进行广泛的扫描,我们识别了这种现象出现的参数区域,并描述了其如何依赖于数据几何和训练暴露。我们的结果为上下文分类的扩展行为提供了全面的经验图谱,突显了维度性、信号强度和上下文信息在决定上下文学习何时成功、何时失败中的关键作用。

英文摘要

Transformers have demonstrated a strong ability for in-context learning (ICL), enabling models to solve previously unseen tasks using only example input output pairs provided at inference time. While prior theoretical work has established conditions under which transformers can perform linear classification in-context, the empirical scaling behavior governing when this mechanism succeeds remains insufficiently characterized. In this paper, we conduct a systematic empirical study of in-context learning for Gaussian-mixture binary classification tasks. Building on the theoretical framework of Frei and Vardi (2024), we analyze how in-context test accuracy depends on three fundamental factors: the input dimension, the number of in-context examples, and the number of pre-training tasks. Using a controlled synthetic setup and a linear in-context classifier formulation, we isolate the geometric conditions under which models successfully infer task structure from context alone. We additionally investigate the emergence of benign overfitting, where models memorize noisy in-context labels while still achieving strong generalization performance on clean test data. Through extensive sweeps across dimensionality, sequence length, task diversity, and signal-to-noise regimes, we identify the parameter regions in which this phenomenon arises and characterize how it depends on data geometry and training exposure. Our results provide a comprehensive empirical map of scaling behavior in in-context classification, highlighting the critical role of dimensionality, signal strength, and contextual information in determining when in-context learning succeeds and when it fails.

2604.21937 2026-05-19 cs.AI cs.MA 版本更新

MolClaw: An Autonomous Agent with Hierarchical Skills for Drug Molecule Evaluation, Screening, and Optimization

MolClaw:一种具有分层技能的自主代理,用于药物分子评估、筛选和优化

Lisheng Zhang, Lilong Wang, Xiangyu Sun, Wei Tang, Haoyang Su, Yuehui Qian, Qikui Yang, Qingsong Li, Zhenyu Tang, Haoran Sun, Yingnan Han, Yankai Jiang, Wenjie Lou, Bowen Zhou, Xiaosong Wang, Lei Bai, Zhengwei Xie

发表机构 * Peking University Health Science Center, Peking University, Beijing, China(北京大学北京医院科学中心,北京大学,北京,中国) Shanghai AI Laboratory, Shanghai, China(上海人工智能实验室,上海,中国) Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China(北京大学先进跨学科研究学院,北京大学,北京,中国)

AI总结 MolClaw通过分层技能架构整合30余种领域资源,实现药物分子评估、筛选和优化的自动化,其在复杂工作流中的表现优于现有AI代理。

Comments 28 pages, 8 figures. Code and data will be released

详情
AI中文摘要

计算药物发现,特别是药物分子筛选和优化的复杂工作流,需要协调数十种专用工具进行多步骤流程,但当前AI代理难以维持稳健性能并在高复杂度场景中表现不佳。本文提出MolClaw,一种自主代理,通过三级分层技能架构(共70个技能)整合超过30种领域资源,促进运行时的长期交互:工具级技能标准化原子操作,工作流级技能将它们组成经过验证的流程并包含质量检查和反思,学科级技能提供指导规划和验证的科学原理。此外,我们引入MolBench,一个包含分子筛选、优化和端到端发现挑战的基准,涵盖8到50+个连续工具调用。MolClaw在所有指标上均取得最佳性能,消融研究证实收益集中在需要结构化流程的任务上,而消失在可由随机脚本解决的任务上,确立了工作流协调能力作为AI驱动药物发现的主要能力瓶颈。

英文摘要

Computational drug discovery, particularly the complex workflows of drug molecule screening and optimization, requires orchestrating dozens of specialized tools in multi-step workflows, yet current AI agents struggle to maintain robust performance and consistently underperform in these high-complexity scenarios. Here we present MolClaw, an autonomous agent that leads drug molecule evaluation, screening, and optimization. It unifies over 30 specialized domain resources through a three-tier hierarchical skill architecture (70 skills in total) that facilitates agent long-term interaction at runtime: tool-level skills standardize atomic operations, workflow-level skills compose them into validated pipelines with quality check and reflection, and a discipline-level skill supplies scientific principles governing planning and verification across all scenarios in the field. Additionally, we introduce MolBench, a benchmark comprising molecular screening, optimization, and end-to-end discovery challenges spanning 8 to 50+ sequential tool calls. MolClaw achieves state-of-the-art performance across all metrics, and ablation studies confirm that gains concentrate on tasks that demand structured workflows while vanishing on those solvable with ad hoc scripting, establishing workflow orchestration competence as the primary capability bottleneck for AI-driven drug discovery.

2604.19219 2026-05-19 cs.CR cs.AI cs.DC cs.LG 版本更新

Sherpa.ai Privacy-Preserving Multi-Party Entity Alignment without Intersection Disclosure for Noisy Identifiers

Sherpa.ai 保护隐私的多方实体对齐无需披露交集

Daniel M. Jimenez-Gutierrez, Dario Pighin, Enrique Zuazua, Georgios Kellaris, Joaquin Del Rio, Oleksii Sliusarenko, Xabi Uribe-Etxebarria

AI总结 本文提出Sherpa.ai多方PSU协议,用于垂直联邦学习中的隐私保护实体对齐,实现精确和噪声匹配,同时隐藏交集成员信息,适用于多机构医疗疾病检测等场景。

详情
AI中文摘要

联邦学习(FL)使多个参与方在不集中原始数据的情况下协同训练模型。FL主要有两种范式:水平FL(HFL),所有参与者共享相同特征空间但持有不同样本;垂直FL(VFL),各参与方持有互补特征的相同样本集。VFL训练的前提是隐私保护实体对齐(PPEA),即在不暴露共享样本的情况下建立跨参与方的共同索引。传统私有集合交集(PSI)实现对齐但泄露交集成员信息,暴露敏感数据集关系。标准私有集合并集(PSU)通过在标识符并集上对齐而非交集来缓解此风险。然而,现有方法通常局限于两方或缺乏容错匹配支持。本文介绍Sherpa.ai多方PSU协议,一种PPEA方法,隐藏交集成员信息并实现精确和噪声匹配。该协议将两方方法扩展到多方,通信开销低,并提供两种变体:顺序保持版本用于精确对齐,无序版本容忍拼写和格式差异。我们证明了正确性和隐私性,分析了通信和计算(指数)复杂度,并正式化了从本地记录到共享索引空间的通用索引映射。该多方PSU为现实中的VFL部署提供了一种可扩展、数学基础的PPEA协议,如多机构医疗疾病检测、银行与保险公司的协作风险建模、电信与金融领域的跨域欺诈检测,同时保护交集隐私。

英文摘要

Federated Learning (FL) enables collaborative model training among multiple parties without centralizing raw data. There are two main paradigms in FL: Horizontal FL (HFL), where all participants share the same feature space but hold different samples, and Vertical FL (VFL), where parties possess complementary features for the same set of samples. A prerequisite for VFL training is privacy-preserving entity alignment (PPEA), which establishes a common index of samples across parties (alignment) without revealing which samples are shared between them. Conventional private set intersection (PSI) achieves alignment but leaks intersection membership, exposing sensitive relationships between datasets. The standard private set union (PSU) mitigates this risk by aligning on the union of identifiers rather than the intersection. However, existing approaches are often limited to two parties or lack support for typo-tolerant matching. In this paper, we introduce the Sherpa.ai multi-party PSU protocol for VFL, a PPEA method that hides intersection membership and enables both exact and noisy matching. The protocol generalizes two-party approaches to multiple parties with low communication overhead and offers two variants: an order-preserving version for exact alignment and an unordered version tolerant to typographical and formatting discrepancies. We prove correctness and privacy, analyze communication and computational (exponentiation) complexity, and formalize a universal index mapping from local records to a shared index space. This multi-party PSU offers a scalable, mathematically grounded protocol for PPEA in real-world VFL deployments, such as multi-institutional healthcare disease detection, collaborative risk modeling between banks and insurers, and cross-domain fraud detection between telecommunications and financial institutions, while preserving intersection privacy.

2604.14215 2026-05-19 cs.IR cs.AI 版本更新

PriHA: A RAG-Enhanced LLM Framework for Primary Healthcare Assistant in Hong Kong

PriHA:一种增强型大语言模型框架,用于香港初级医疗服务助手

Richard Wai Cheung Chan, Shanru Lin, Ya-nan Ma, Hao Chen, Liangjun Jiang, Wenqi Fan

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Haikou Affiliated Hospital of Central South University Xiangya School of Medicine(中南大学湘雅医学院海口附属医院) Sun Yat-Sen University Cancer Center(中山大学肿瘤中心)

AI总结 本文提出PriHA框架,通过检索增强生成技术解决香港初级医疗服务中指南碎片化问题,提升信息访问准确性与清晰度。

Comments Accepted to PAKDD 2026

详情
AI中文摘要

为应对公共健康支出持续上升,香港特区政府正将战略重点转向初级医疗,并鼓励公众利用社区资源自我管理健康。然而,官方临床指南分散在不同部门和格式中,造成显著的访问障碍。尽管通用大语言模型(如ChatGPT和DeepSeek)在信息可及性方面有潜力,但因缺乏本地化和领域特定的知识,容易生成事实性不准确的内容。为此,我们提出了一种检索增强生成增强型大语言模型系统,作为香港初级医疗助手(PriHA)。具体而言,提出了一种三阶段流程,利用查询优化器泛化用户意图导向的子查询,随后采用新颖的双检索增强生成(DRAG)架构进行混合源检索和上下文重组生成。全面的实验和详细案例研究表明,所提出的方法在准确性和清晰度方面均优于消融和基线。本研究为探索其他高风险、本地化应用场景提供了可靠的可追溯对话检索框架。

英文摘要

To address the unsustainable rise in public health expenditures, the Hong Kong SAR Government is shifting its strategic focus to primary healthcare and encouraging citizens to use community resources to self-manage their health. However, official clinical guidelines are fragmented across disparate departments and formats, creating significant access barriers. While general-purpose Large Language Models (LLMs) such as ChatGPT and DeepSeek offer potential solutions for information accessibility, they are prone to generating factually inaccurate content due to a lack of localized and domain-specific knowledge. To this end, we propose a Retrieval-Augmented Generation-Enhanced LLM system as Primary Healthcare Assistant (PriHA) in Hong Kong. Specifically, a tri-stage pipeline is proposed that leverages a query optimizer to generalize user intent-oriented sub-queries, followed by a novel Dual Retrieval Augmented Generation (DRAG) architecture for mixed-source retrieval and context-reorganized generation. Comprehensive experiments and a detailed case study demonstrate that our proposed method can outperform both ablations and baseline in terms of accuracy and clarity. Our research provides a reliable and traceable dialogue retrieval framework for exploring other high-risk, localized application scenarios.

2604.12254 2026-05-19 cs.CR cs.AI 版本更新

SpanKey: Dynamic Key Space Conditioning for Neural Network Access Control

SpanKey:神经网络访问控制的动态密钥空间条件化

WenBin Yan

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 SpanKey通过动态密钥空间条件化实现轻量级推理门控,不加密权重且不追求门控推理的准确率。方法通过秘密密钥条件化激活,利用基矩阵定义低维密钥子空间,并通过多层设计空间进行分析,提出密钥吸收失效模式及实验验证。

Comments 15 pages, 1 figure, multiple tables. Preprint (not yet published in a journal). Affiliation: University of Colorado Boulder. Code: https://github.com/mindmemory-ai/dksc

详情
AI中文摘要

SpanKey通过动态密钥空间条件化实现轻量级推理门控,不加密权重且不追求门控推理的准确率。方法通过秘密密钥条件化激活,利用基矩阵定义低维密钥子空间,并通过多层设计空间进行分析,提出密钥吸收失效模式及实验验证。

英文摘要

SpanKey is a lightweight way to gate inference without encrypting weights or chasing leaderboard accuracy on gated inference. The idea is to condition activations on secret keys. A basis matrix $B$ defines a low-dimensional key subspace $Span(B)$; during training we sample coefficients $α$ and form keys $k=α^\top B$, then inject them into intermediate activations with additive or multiplicative maps and strength $γ$. Valid keys lie in $Span(B)$; invalid keys are sampled outside that subspace. We make three points. (i) Mechanism: subspace key injection and a multi-layer design space. (ii) Failure mode: key absorption, together with two analytical results (a Beta-energy split and margin-tail diagnostics), explains weak baseline separation in energy and margin terms -- these are not a security theorem. iii) Deny losses and experiments: Modes A--C and extensions, with CIFAR-10 ResNet-18 runs and MNIST ablations for Mode B. We summarize setup and first-order analysis, injectors, absorption, deny losses and ablations, a threat discussion that does not promise cryptography, and closing remarks on scale. Code: \texttt{https://github.com/mindmemory-ai/dksc}

2604.11043 2026-05-19 cs.AI 版本更新

EmergentBridge: Improving Zero-Shot Cross-Modal Transfer in Unified Multimodal Embedding Models

EmergentBridge: 提升统一多模态嵌入模型中的零样本跨模态迁移

Jincheng Xie, Xingchen Xiao, Runheng Liu, Zhongyi Huang, Yu Zheng, Heyan Huang

发表机构 * Tsinghua University(清华大学) School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) JD iCity, JD Technology, JD Intelligent Cities Research(京东i城、京东科技、京东智能城市研究院)

AI总结 本文提出EmergentBridge框架,通过学习噪声桥梁锚点和子空间对齐,提升未配对模态对的零样本迁移性能,无需 exhaustive pairwise 监督。

详情
AI中文摘要

统一的多模态嵌入空间支撑了跨模态检索和零样本识别等实际应用。然而,在许多实际部署中,监督仅适用于少量模态对(例如图像-文本),导致未配对模态对(例如音频↔深度、红外↔音频)弱连接,从而在零样本迁移中表现不佳。为了解决这种稀疏配对情况,本文提出了EmergentBridge,一种嵌入层面的桥梁框架,能够在不需 exhaustive pairwise 监督的情况下提升这些未配对模态对的性能。我们的关键观察是,直接对新模态与合成代理嵌入对齐会引入梯度干扰,破坏现有检索/分类依赖的锚点对齐结构。EmergentBridge通过(i)学习从锚点嵌入生成噪声桥梁锚点(已对齐模态的代理嵌入)的映射,以及(ii)在锚点对齐方向的正交子空间内强制代理对齐,从而在保持锚点对齐的同时加强非锚点连接。在九个涵盖多种模态的数据集上,EmergentBridge在零样本分类和检索中均优于现有绑定基线,展示了强大的涌现对齐。

英文摘要

Unified multimodal embedding spaces underpin practical applications such as cross-modal retrieval and zero-shot recognition. In many real deployments, however, supervision is available only for a small subset of modality pairs (e.g., image--text), leaving \emph{unpaired} modality pairs (e.g., audio$\leftrightarrow$depth, infrared$\leftrightarrow$audio) weakly connected and thus performing poorly on zero-shot transfer. Addressing this sparse-pairing regime is therefore essential for scaling unified embedding systems to new tasks without curating exhaustive pairwise data. We propose \textbf{EmergentBridge}, an embedding-level bridging framework that improves performance on these unpaired pairs \emph{without requiring exhaustive pairwise supervision}. Our key observation is that naively aligning a new modality to a synthesized proxy embedding can introduce \emph{gradient interference}, degrading the anchor-alignment structure that existing retrieval/classification relies on. EmergentBridge addresses this by (i) learning a mapping that produces a \emph{noisy bridge anchor} (a proxy embedding of an already-aligned modality) from an anchor embedding, and (ii) enforcing proxy alignment only in the subspace orthogonal to the anchor-alignment direction, preserving anchor alignment while strengthening non-anchor connectivity. Across nine datasets spanning multiple modalities, EmergentBridge consistently outperforms prior binding baselines on zero-shot classification and retrieval, demonstrating strong emergent alignment.

2604.10825 2026-05-19 cs.AI 版本更新

CheeseBench: Evaluating Large Language Models on Rodent Behavioral Neuroscience Paradigms

CheeseBench:在啮齿类行为神经科学范式上评估大语言模型

Zacharie Bugaud

发表机构 * Astera Institute(Astera研究院)

AI总结 CheeseBench通过九种经典行为神经科学范式评估大语言模型,发现模型规模、上下文历史、提示方式和架构对性能有显著影响,且当前模型在空间导航等任务上仍低于啮齿类动物基准。

Comments 8 pages, 6 figures, 4 tables

详情
AI中文摘要

我们介绍了CheeseBench,一个评估大语言模型(LLMs)在九种经典行为神经科学范式( Morris水迷宫、Barnes迷宫、T迷宫、径向臂迷宫、星形迷宫、操作舱、穿梭箱、条件性位置偏好和延迟非匹配到样本)上的基准,涵盖六个认知维度。每个任务均基于同行评审的啮齿类动物协议,具有近似的动物基准线。代理接收一个统一的系统提示,没有特定任务指令,必须仅通过ASCII文本观察和奖励信号发现目标,类似于将啮齿类动物置于陌生设备中。我们评估了六个开源权重的LLMs(3B到72B参数)在基于文本的ASCII渲染中,并与随机基线和基于图的强化学习代理进行比较。我们的最佳模型(Qwen2.5-VL-7B)在ASCII输入上的平均成功率为52.6%,相比随机代理的32.1%和近似啮齿类动物基准的78.9%。我们发现(1)超过7B的规模带来 diminishing returns,(2)更长的上下文历史会降低性能,(3)链式推理提示有害而非有益,(4)视觉-语言架构在7B时提供优势,但在32B时有害。由于同一体系的性能在接口参数上从20%到57%波动,这些结果描述的是代理加接口系统,而非孤立模型。在这一统一的零样本ASCII协议下,当前开源权重LLM代理在空间导航等任务上仍低于近似啮齿类动物基准值。

英文摘要

We introduce CheeseBench, a benchmark that evaluates large language models (LLMs) on nine classical behavioral neuroscience paradigms (Morris water maze, Barnes maze, T-maze, radial arm maze, star maze, operant chamber, shuttle box, conditioned place preference, and delayed non-match to sample), spanning six cognitive dimensions. Each task is grounded in peer-reviewed rodent protocols with approximate animal baselines. The agent receives a unified system prompt with no task-specific instructions and must discover goals purely from ASCII text observations and reward signals, much like a rodent placed into an unfamiliar apparatus. We evaluate six open-weight LLMs (3B to 72B parameters) on text-based ASCII renderings and compare against both a random baseline and a graph-based reinforcement learning agent. Our best model (Qwen2.5-VL-7B) reaches 52.6% average success on ASCII input, compared to 32.1% for random agents and 78.9% for approximate rodent baselines. We find that (1) scaling beyond 7B yields diminishing returns, (2) longer context history degrades performance, (3) chain-of-thought prompting hurts rather than helps, and (4) a vision-language architecture provides an advantage at 7B but hurts at 32B. Because the same model's performance ranges from 20% to 57% depending on interface parameters alone, these results characterize the agent-plus-interface system, not the model in isolation. Under this unified zero-shot ASCII protocol, current open-weight LLM agents remain well below approximate rodent reference values, particularly on tasks requiring spatial navigation and within-trial state tracking.

2604.09297 2026-05-19 cs.SE cs.AI 版本更新

SkillMOO: Multi-Objective Optimization of Agent Skills for Software Engineering

SkillMOO:软件工程中代理技能的多目标优化

Jingzhi Gong, Ruizhen Gu, Zhiwei Fei, Yazhuo Cao, Lukas Twist, Alina Geiger, Shuo Han, Dominik Sobania, Federica Sarro, Jie M. Zhang

发表机构 * King's College London(伦敦国王学院) Queen's University Belfast(贝尔法斯特女王大学) Nanjing University(南京大学) Johannes Gutenberg University Mainz(美因茨约翰尼斯·古滕贝格大学) University College London(伦敦大学学院) University of Duisburg-Essen(杜伊斯堡- Essen大学)

AI总结 本文提出SkillMOO框架,通过LLM提议的编辑和NSGA-II算法优化代理技能包,提升任务成功率并降低推理成本。

详情
AI中文摘要

代理技能越来越多地用于配置软件工程任务的编码代理,但当前实践将其视为静态的手工资产或仅基于通过率进化。本文认为软件工程代理技能包可作为多目标搜索对象,并提出SkillMOO框架,通过LLM提议的编辑和NSGA-II算法在通过率和推理成本上进行帕累托选择。在所有16个SkillsBench SE任务上评估,SkillMOO在12个非零通过任务中取得最高通过率排名,同时将成本降低31.7%,通过率提升达21个百分点。分析38个技能编辑显示,剪枝和替换主导成功操作,为技能包设计提供可操作原则。当前不进行成本意识验证的技能部署实践限制了更优配置的探索,推动了新的成本意识、基于搜索的技能工程类别的发展。

英文摘要

Agent skills are increasingly used to configure coding agents for software engineering (SE) tasks, yet current practice treats them as static, hand-crafted assets, or evolved on pass rate alone. This is insufficient: a skill can improve task success while substantially raising token cost, or introducing misleading guidance. We argue that SE agent skill bundles can be treated as multi-objective search objects and present SkillMOO, a framework that evolves skill bundles through LLM-proposed edits and NSGA-II Pareto selection on pass rate and inference cost. Evaluated across all 16 SkillsBench SE tasks, SkillMOO achieves the top pass rate rank on 11 of 12 non-zero-pass tasks while achieving cost reductions of up to 31.7% over static bundles, with pass rate gains up to 21 percentage points. Analysis of 38 skill edits shows that pruning and substitution dominate successful operations, offering actionable principles for skill bundle design. Thereby, the current practice of deploying skills without cost-aware validation leaves better skill configurations unexplored, motivating a new class of cost-aware, search-based skill engineering.

2604.04202 2026-05-19 cs.LG cs.AI cs.CL 版本更新

ClawArena: Benchmarking AI Agents in Evolving Information Environments

ClawArena:在演化的信息环境中评估AI代理的基准测试

Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao

发表机构 * UNC-Chapel Hill(北卡罗来纳州立大学夏洛特分校) University of California, Santa Cruz(加州大学圣克鲁兹分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 ClawArena评估AI代理在信息环境动态变化中的能力,通过多源冲突推理、动态信念更新和隐式个性化三个挑战,测试代理在多通道会话、工作区文件和阶段更新中的表现。

详情
AI中文摘要

部署为持久助手的AI代理必须在信息环境演变时保持正确信念。实际中,证据分散在异构来源中,常相互矛盾,新信息可能推翻先前结论,用户偏好通过修正而非明确指令出现。现有基准大多假设静态、单一权威环境,不评估代理能否应对这种复杂性。我们引入ClawArena,一个评估AI代理在演化的信息环境中的基准。每个场景保持完整的隐藏真实情况,同时仅向代理暴露噪声、部分且有时矛盾的痕迹,跨多通道会话、工作区文件和阶段更新。评估围绕三个相互关联的挑战:多源冲突推理、动态信念更新和隐式个性化,其相互作用产生14类问题分类。两种问题格式,多选(集合选择)和基于shell的可执行检查,测试推理和工作区定位。ClawArena包含12个多轮场景,覆盖337个评估轮次和45个动态更新,评估五个代理框架和18种语言模型,来自专有、社区可访问和自托管来源。实验表明,模型能力在模型间产生29分的分数范围,而框架设计最多产生24分的范围,MetaClaw的技能叠加可靠提高分数而不降低准确性,信念更新难度由更新设计策略而非更新量决定。代码可在https://github.com/aiming-lab/ClawArena获取。

英文摘要

AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. ClawArena comprises 12 multi-turn scenarios spanning 337 evaluation rounds with 45 dynamic updates, evaluated across five agent frameworks and 18 language models from proprietary, community-accessible, and self-hosted sources. Experiments show that model capability accounts for a 29-point score range across models while framework design accounts for up to a 24-point range, that MetaClaw's skill overlay reliably improves score without degrading accuracy, and that belief revision difficulty is determined by update design strategy rather than update volume. Code is available at https://github.com/aiming-lab/ClawArena.

2604.01674 2026-05-19 cs.AI 版本更新

Can Heterogeneous Language Models Be Fused?

异构语言模型能否融合?

Shilian Chen, Jie Zhou, Qin Chen, Wen Wu, Xin Li, Qi Feng, Liang He

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 本文提出HeteroFusion方法,解决异构语言模型融合中的架构不匹配和冲突问题,通过功能模块对齐和冲突感知去噪,实现稳定高效的模型融合。

详情
AI中文摘要

模型融合旨在将多个专家模型整合为一个单一模型,该模型继承其互补优势,而无需在推理时间付出装入的代价。最近的进展表明,当所有源模型都是同质的,即源自相同的预训练骨干网络,因此共享对齐的参数坐标或兼容的任务向量时,融合可以非常有效。然而,在开放模型生态系统中,这种假设日益不现实,有用的专家往往基于不同的家族,如Llama、Qwen和Mistral。在这种异构设置中,直接在权重空间中融合变得不成立,因为存在架构不匹配、潜在基础错位和跨源冲突放大问题。我们通过HeteroFusion方法解决异构语言模型融合问题,该方法包含两个关键组件:基于拓扑的对齐,通过匹配功能模块结构而非原始张量坐标来跨异构骨干网络转移知识,以及冲突感知去噪,通过融合过程抑制不兼容或嘈杂的转移信号。我们进一步提供分析证明,保留目标适配器基础并预测结构化更新可以导致稳定且良好的条件转移过程。在异构转移、多源融合、嘈杂源鲁棒性和跨家族泛化设置中,HeteroFusion始终优于强大的融合、融合和装入基线。

英文摘要

Model merging aims to integrate multiple expert models into a single model that inherits their complementary strengths without incurring the inference-time cost of ensembling. Recent progress has shown that merging can be highly effective when all source models are \emph{homogeneous}, i.e., derived from the same pretrained backbone and therefore share aligned parameter coordinates or compatible task vectors. Yet this assumption is increasingly unrealistic in open model ecosystems, where useful experts are often built on different families such as Llama, Qwen, and Mistral. In such \emph{heterogeneous} settings, direct weight-space fusion becomes ill-posed due to architectural mismatch, latent basis misalignment, and amplified cross-source conflict. We address this problem with \texttt{HeteroFusion} for heterogeneous language model fusion, which consists of two key components: topology-based alignment that transfers knowledge across heterogeneous backbones by matching functional module structures instead of raw tensor coordinates, and conflict-aware denoising that suppresses incompatible or noisy transfer signals during fusion. We further provide analytical justification showing that preserving the target adapter basis while predicting structured updates leads to a stable and well-conditioned transfer process. Across heterogeneous transfer, multi-source fusion, noisy-source robustness, and cross-family generalization settings, \texttt{HeteroFusion} consistently outperforms strong merging, fusion, and ensemble baselines.

2604.01404 2026-05-19 cs.CL cs.AI 版本更新

Friends and Grandmothers in Silico: Localizing Entity Cells in Language Models

硅世界中的朋友与祖母:在语言模型中本地化实体细胞

Itay Yona, Dan Barzilay, Michael Karasik, Mor Geva

发表机构 * Mentaleap Independent Researcher(独立研究者) Tel Aviv University(特拉维夫大学)

AI总结 研究通过寻找稀疏的实体选择性MLP神经元(实体细胞)探讨语言模型如何检索实体特定事实,并发现这些细胞在早期层聚集,具有因果作用。

详情
AI中文摘要

语言模型如何从参数中检索实体特定事实?我们通过寻找稀疏、实体选择性的MLP神经元(称为实体细胞,类比神经科学中的'祖母细胞'假说)来探讨这一问题,并测试这些细胞在事实回忆中的因果作用。我们通过在不同提示下对同一实体的激活一致性对MLP神经元进行排名,跨七个模型在Curated PopQA子集上应用此过程。所有模型中,本地化神经元主要集中在早期层,这一经验模式并非由架构强制。使用Qwen2.5-7B base作为模型生物,我们发现最清晰的因果证据:抑制局部细胞会擦除其匹配实体的回忆,而其他保持不变;激活单个细胞足以恢复大多数实体的正确知识,即使实体不在上下文中。相同的细胞在别名、缩写、拼写错误和多语言表层形式下仍能恢复,并在指令微调中保持稳定,表明它们编码的是实体身份而非表层标记模式。因果信号在不同模型家族中变化,指出了架构差异如何影响实体知识的组织。这些发现为理解、控制和纠正语言模型中的事实知识提供了具体且可解释的访问点,并与神经科学中关于稀疏编码概念的长期问题建立了令人惊讶的经验平行。

英文摘要

How do language models retrieve entity-specific facts from their parameters? We investigate this question by searching for sparse, entity-selective MLP neurons - which we call entity cells, by analogy to the "grandmother cell" hypothesis in neuroscience - and testing whether they play a causal role in factual recall. We localize candidate entity cells by ranking MLP neurons for activation consistency across varied prompts about the same entity, applying this procedure across seven models on a curated subset of PopQA. In all models, localized neurons cluster predominantly in early layers, an empirical pattern not imposed by the architecture. Using Qwen2.5-7B base as a model organism, we find the clearest causal evidence: suppressing a localized cell selectively erases recall for its matched entity while leaving others intact, and activating a single cell is sufficient to recover correct knowledge for most entities - even when the entity is absent from the context. The same cells are recovered under aliases, acronyms, misspellings, and multilingual surface forms, and remain stable through instruction tuning, suggesting they encode canonical entity identity rather than surface token patterns. Causal signals vary across model families, pointing to architectural differences in how entity knowledge is organized. These findings offer concrete, interpretable access points for understanding, controlling, and correcting factual knowledge in language models, and draw a surprising empirical parallel to longstanding questions in neuroscience about sparse coding of concepts.

2603.23638 2026-05-19 cs.AI 版本更新

Can LLM Agents Be CFOs? Benchmarking Long-Horizon Resource Allocation in an Uncertain Enterprise Environment

LLM代理能成为CFO吗?在不确定的企业环境中评估长期资源分配

Yi Han, Yan Wang, Lingfei Qian, Haohang Li, Yupeng Cao, Yueru He, Xueqing Peng, Nanhan Shen, Yitao Xu, Yankai Chen, Dongji Feng, Jimin Huang, Xue Liu, Jian-Yun Nie, Sophia Ananiadou

发表机构 * Georgia Institute of Technology(佐治亚理工学院) The Fin AI Stevens Institute of Technology(史蒂文斯理工学院) Columbia University(哥伦比亚大学) George Mason University(乔治·马歇尔大学) McGill University(麦吉尔大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎大学人工智能学院) California State University, Monterey Bay(加州州立大学蒙特雷湾分校) University of Manchester(曼彻斯特大学) Mila – Quebec Artificial Intelligence Institute(魁北克人工智能研究所) Université de Montréal(蒙特利尔大学)

AI总结 本文通过EnterpriseArena模拟器评估LLM在不确定环境下的长期资源分配能力,发现现有模型在复杂任务中表现不足,仅15.4%的试验能持续完整周期。

详情
AI中文摘要

大型语言模型(LLM)代理在复杂任务上的测试日益增加,但其在长期资源分配中的能力仍不明确。本文引入EnterpriseArena,一个132个月的CFO模拟器,用于评估在金融科技贷款公司中不确定环境下的长期资源分配。代理必须管理流动性、结账、收集成本信号并请求股权或债务融资,以应对不断变化的宏观经济环境。模拟器基于转换后的公司财务数据、匿名商业文档、十年期宏观经济和行业信号以及专家验证的操作规则构建。在23个LLM和四个代理框架的实验中发现,当前代理仍远未稳健:仅15.4%的试验能持续完整周期,较大模型并不总能优于较小模型,且失败会跨观测、行动时间和资本规模传播。这些发现确立了在不确定环境下长期资源分配作为LLM代理的一个独特能力缺口。

英文摘要

Large language model (LLM) agents are increasingly tested on complex tasks, but their ability to allocate scarce resources over long horizons remains unclear. Unlike reactive tasks with immediate feedback, this setting requires agents to make binding commitments under partial observability, delayed consequences, hard resource budgets, and shifting dynamics. We introduce EnterpriseArena, a 132-month CFO simulator that evaluates long-horizon resource allocation under uncertainty in a FinTech lending firm. Agents must manage liquidity, close books, gather costly signals, and request equity or debt financing across changing macroeconomic regimes. The simulator is built from transformed firm-level financial data, anonymized business documents, decade-scale macroeconomic and industry signals, and expert-validated operating rules. Experiments across 23 LLMs and four agent frameworks show that current agents remain far from robust: only 15.4% of trials survive the full horizon, larger models do not reliably outperform smaller ones, and failures cascade across observation, action timing, and capital sizing. These findings establish long-horizon resource allocation under uncertainty as a distinct capability gap for LLM agents.

2603.23566 2026-05-19 cs.LG cs.AI 版本更新

AscendOptimizer: Episodic Agent for Ascend NPU Operator Optimization

AscendOptimizer: 一种用于Ascend NPU运算优化的经验型智能体

Jiehao Wu, Zixiao Huang, Wenhao Li, Chuyun Shen, Junjie Sheng, Xiangfeng Wang

发表机构 * School of Computer Science and Technology, East China Normal University(东华大学计算机科学与技术学院) School of Computer Science and Technology, Tongji University(同济大学计算机科学与技术学院) Shanghai University of International Business and Economics(上海国际商务经济大学) Key Lab of Mathematics and Engineering Applications (MoE), East China Normal University(东华大学数学与工程应用重点实验室) School of Mathematical Sciences, East China Normal University(东华大学数学科学学院) Shenzhen Loop Area Institute (SLAI)(深圳环宇院)

AI总结 本文提出AscendOptimizer,通过自身执行构建缺失的优化知识,结合主机侧和内核侧优化,实现AscendC运算的加速,达到1.21倍的几何平均速度提升。

详情
AI中文摘要

本文提出AscendOptimizer,一种用于Ascend NPU运算优化的经验型智能体。通过自身执行构建缺失的优化知识,结合主机侧和内核侧优化,实现AscendC运算的加速,达到1.21倍的几何平均速度提升。

英文摘要

Optimizing AscendC (Ascend C) operators for Ascend NPUs is difficult for two reasons. First, unlike CUDA, the ecosystem offers few public kernels to learn from. Second, performance depends on a coupled two-part implementation: a host-side tiling program that controls data movement and a kernel program that schedules and pipelines computation. We present AscendOptimizer, an episodic agent that builds missing optimization knowledge from execution itself. For kernel optimization, AscendOptimizer rewinds strong implementations by removing optimizations in a controlled way, then keeps the changes whose removal measurably hurts performance as reusable experience for later rewriting. For host-side optimization, it runs profiling-in-the-loop evolutionary search to find valid, fast tiling and data-movement configurations directly from hardware feedback. This combination lets the agent improve kernel structure and host-side scheduling together. On a benchmark of 101 real AscendC operators, AscendOptimizer achieves a 1.21x geometric-mean speedup over the open-source baseline, and 53.47% of operators run faster than their references. Given a same budget of evaluations per operator, AscendOptimizer consistently outperforms Best-of-N sampling and OpenEvolve in terms of geometric mean speedup, fast_p tail speedup ratios, and overall optimization progress across varying budgets.

2603.21071 2026-05-19 cs.CV cs.AI 版本更新

CTFS : Collaborative Teacher Framework for Forward-Looking Sonar Image Semantic Segmentation with Extremely Limited Labels

CTFS:用于极有限标注数据的前瞻性声呐图像语义分割协作教师框架

Ping Guo, Chengzhou Li, Guanchen Meng, Qi Jia, Jinyuan Liu, Zhu Liu, Yu Liu, Zhongxuan Luo, Xin Fan

发表机构 * School of Software Technology, Dalian University of Technology(大连理工大学软件学院)

AI总结 本文提出CTFS框架,通过多教师协作机制提升声呐图像在极有限标注下的分割性能,通过跨教师可靠性评估机制减少噪声伪标签影响,实验显示在FLSMD数据集上2%标注时mIoU提升5.08%。

Comments Accepted to CVPR 2026 Finding. Code: https://github.com/pingggg516/CTFS

详情
AI中文摘要

作为最重要的水下传感技术之一,前瞻性声呐具有独特的成像特性。声呐图像常受严重斑点噪声、低纹理对比度、声影和几何失真影响,使传统教师-学生框架在极有限标注条件下难以获得满意性能。为解决此问题,我们提出一种用于前瞻性声呐图像的协作教师语义分割框架。该框架引入由一个通用教师和多个声呐专用教师组成的多教师协作机制。通过采用多教师交替指导策略,学生模型可学习通用语义表示的同时捕捉声呐图像的特殊特性,从而实现更全面和稳健的特征建模。考虑到声呐图像的挑战,可能导致教师生成大量噪声伪标签,我们进一步设计了跨教师可靠性评估机制。该机制通过评估多视角和多教师间的预测一致性与稳定性动态量化伪标签的可靠性,从而减轻噪声伪标签的负面影响。值得注意的是,在FLSMD数据集上,当仅标注2%的数据时,我们的方法在mIoU上比其他最先进的方法提高了5.08%。

英文摘要

As one of the most important underwater sensing technologies, forward-looking sonar exhibits unique imaging characteristics. Sonar images are often affected by severe speckle noise, low texture contrast, acoustic shadows, and geometric distortions. These factors make it difficult for traditional teacher-student frameworks to achieve satisfactory performance in sonar semantic segmentation tasks under extremely limited labeled data conditions. To address this issue, we propose a Collaborative Teacher Semantic Segmentation Framework for forward-looking sonar images. This framework introduces a multi-teacher collaborative mechanism composed of one general teacher and multiple sonar-specific teachers. By adopting a multi-teacher alternating guidance strategy, the student model can learn general semantic representations while simultaneously capturing the unique characteristics of sonar images, thereby achieving more comprehensive and robust feature modeling. Considering the challenges of sonar images, which can lead teachers to generate a large number of noisy pseudo-labels, we further design a cross-teacher reliability assessment mechanism. This mechanism dynamically quantifies the reliability of pseudo-labels by evaluating the consistency and stability of predictions across multiple views and multiple teachers, thereby mitigating the negative impact caused by noisy pseudo-labels. Notably, on the FLSMD dataset, when only 2% of the data is labeled, our method achieves a 5.08% improvement in mIoU compared to other state-of-the-art approaches.

2603.18178 2026-05-19 cs.CV cs.AI 版本更新

VLM-AutoDrive: Post-Training Vision-Language Models for Safety-Critical Autonomous Driving Events

VLM-AutoDrive: 事后训练视觉-语言模型用于安全关键的自动驾驶事件

Mohammad Qazim Bhat, Yufan Huang, Niket Agarwal, Hao Wang, Michael Woods, John Kenyon, Tsung-Yi Lin, Xiaodong Yang, Ming-Yu Liu, Kevin Xie

发表机构 * NVIDIA

AI总结 本文提出VLM-AutoDrive框架,通过整合元数据生成的描述、LLM生成的描述、视觉问答对和推理监督,提升预训练视觉语言模型在安全关键自动驾驶事件中的检测性能。

Comments 16 pages, 9 figures, submitted to arXiv

详情
AI中文摘要

随着第一人称视角 dashcam 视频的快速增长,检测安全关键事件如碰撞和近碰撞成为重大挑战,这些场景短暂、罕见且难以被通用视觉模型捕捉。尽管多模态大语言模型(MLLMs)展现出强大的推理能力,但其在驾驶场景中因领域和时间对齐问题而表现不佳。我们引入VLM-AutoDrive,一种模块化的事后训练框架,用于将预训练的视觉-语言模型(VLMs)适应到高保真异常检测。该框架整合了元数据衍生的标题、LLM生成的描述、视觉问答对以及推理链(CoT)监督,以实现领域对齐和可解释的学习。现成的VLMs如NVIDIA的Cosmos-Reason1 7B(CR1)在零样本设置中碰撞召回率接近零;通过VLM-AutoDrive微调,碰撞F1值从0.00提升到0.69,整体准确率从35.35%提升到77.27%。VLM-AutoDrive提供了一种可扩展的配方,用于将通用VLMs适应到安全关键、时间局部化的感知任务。在真实世界Nexar dashcam视频上评估,它在碰撞和近碰撞检测方面实现了显著提升,同时生成可解释的推理轨迹,弥合了感知、因果性和决策推理之间的差距。

英文摘要

The rapid growth of ego-centric dashcam footage presents a major challenge for detecting safety-critical events such as collisions and near-collisions, scenarios that are brief, rare, and difficult for generic vision models to capture. While multimodal large language models (MLLMs) demonstrate strong general reasoning ability, they underperform in driving contexts due to domain and temporal misalignment. We introduce VLM-AutoDrive, a modular post-training framework for adapting pretrained Vision-Language Models (VLMs) to high-fidelity anomaly detection. The framework integrates metadata-derived captions, LLM-generated descriptions, visual question answering (VQA) pairs, and chain-of-thought (CoT) reasoning supervision to enable domain-aligned and interpretable learning. Off-the-shelf VLMs such as NVIDIA's Cosmos-Reason1 7B (CR1) exhibit near-zero Collision recall in zero-shot settings; fine-tuning with VLM-AutoDrive improves Collision F1 from 0.00 to 0.69 and overall accuracy from 35.35% to 77.27%. VLM-AutoDrive offers a scalable recipe for adapting general-purpose VLMs to safety-critical, temporally localized perception tasks. Evaluated on real-world Nexar dashcam videos, it achieves substantial gains in Collision and Near-Collision detection while producing interpretable reasoning traces, bridging the gap between perception, causality, and decision reasoning in autonomous driving.

2603.16947 2026-05-19 cs.CV cs.AI 版本更新

LightZeroNav: Zero-Shot Vision Language Navigation in Continuous Environments Based on Lightweight VLMs

LightZeroNav: 基于轻量级VLMs的连续环境中零样本视觉语言导航

Kun Luo, Xiangyu Dong, Xiaoguang Ma, Haoran Zhao, Yaoming Zhou

发表机构 * Foshan Graduate School of Innovation, Northeastern University(创新研究生院,东北大学) Faculty of Robot Science and Engineering, Northeastern University(机器人科学与工程学院,东北大学) School of Aeronautic Science and Engineering, Beihang University(航空科学与工程学院,北航) QingniaoAI, China(清北AI,中国)

AI总结 本文提出LightZeroNav,通过轻量级VLMs解决连续环境中零样本视觉语言导航的三大瓶颈,无需特定训练或图搜索,在RGB观测和轻量级Qwen3-VL-8B模型下实现与GPT-4o相当的性能。

详情
AI中文摘要

尽管视觉语言导航(VLN)发展迅速,但在连续环境中使用轻量级视觉语言模型(VLMs)进行零样本VLN(VLN-CE)仍极具挑战性,因为这些模型的有限推理能力使长周期导航不可靠。本文提出LightZeroNav,以解决在使用轻量级VLMs进行零样本VLN-CE时的三大主要瓶颈,即多源输入的信息冗余、由噪声文本记忆引起的进度估计不准确以及动作执行与阶段转换之间的任务纠缠。仅使用RGB观测和轻量级开源Qwen3-VL-8B主干,LightZeroNav在无需特定训练、图搜索或路径预测器的情况下实现了与GPT-4o(~200B)相当的性能,证明了其在零样本VLN-CE中的有效性。

英文摘要

Although vision-language navigation (VLN) has progressed rapidly, zero-shot VLN in continuous environments (VLN-CE) remains highly challenging when using lightweight vision-language models (VLMs), whose limited reasoning capacity makes long-horizon navigation unreliable. In this paper, we propose LightZeroNav to tackle the three major bottlenecks when using lightweight VLMs in zero-shot VLN-CE,i.e.,information redundancy from multi-source inputs, inaccurate progress estimation caused by noisy textual memory, and task entanglement between action execution and stage transition. Using only RGB observations and a lightweight open-source Qwen3-VL-8B backbone, LightZeroNav achieves competitive performance with GPT-4o (~200B) without task-specific training, graph search, or waypoint predictors, demonstrating its effectiveness in zero-shot VLN-CE.

2603.16091 2026-05-19 cs.CL cs.AI 版本更新

CounterRefine: Answer-Conditioned Counterevidence Retrieval for Inference-Time Knowledge Repair in Factual Question Answering

CounterRefine:用于事实问答中推理时知识修复的答案条件计数证据检索

Tianyi Huang, Ying Kai Deng

发表机构 * Ryquo App-In Club

AI总结 CounterRefine通过在推理时检索特定证据并进行约束性修正,提升事实问答的准确性,实验表明其在多个基准测试中有效改进了基础模型的表现。

Comments Accepted at the 4th Workshop on Towards Knowledgeable Foundation Models at ACL 2026

详情
AI中文摘要

在事实问答中,许多错误并非检索失败,而是对答案的固执。我们提出了CounterRefine,一种轻量级的修复层,用于短形式RAG。该方法将第一个答案视为假设进行检验。给定草稿,CounterRefine会发出答案条件扩展查询以检索候选特定证据,然后应用受约束的KEEP或REVISE修正步骤,其提出的修订仅在确定性验证后才被接受。设计是故意狭窄的:它添加了一次证据收集流程和一次受保护的修正调用,而不是替换检索器或构建广泛代理系统。在完整的SimpleQA基准测试中,CounterRefine将匹配的一次通过RAG基线改进了最多5.8个正确率点;在完整的Claude轨迹中,它只改变了5.6%的输出,其中180个有益变化和8个有害变化。这些发现表明,对于知识丰富的基础模型来说,除了访问证据外,它们还应能够利用该证据重新考虑,并在必要时修复自己的答案。

英文摘要

In factual question answering, many errors are not failures of access but failures of commitment: the system retrieves relevant evidence, yet still settles on the wrong answer. We present CounterRefine, a lightweight repair layer for short-form RAG that treats the first answer as a hypothesis to test. Given a draft, CounterRefine issues answer-conditioned expansion queries to retrieve candidate-specific evidence, then applies a constrained KEEP or REVISE refinement step whose proposed revisions are accepted only after deterministic validation. The design is intentionally narrow: it adds one evidence-gathering pass and one guarded refinement call rather than replacing the retriever or building a broad agentic system. On the full SimpleQA benchmark, CounterRefine improves a matched one-pass RAG baseline by up to 5.8 correct-rate points; in the full Claude trace, it changes only 5.6% of outputs, with 180 beneficial outcome changes and 8 harmful ones. These findings suggest a simple but important direction for knowledgeable foundation models: beyond accessing evidence, they should also be able to use that evidence to reconsider and, when necessary, repair their own answers.

2603.08145 2026-05-19 cs.LG cs.AI 版本更新

DARC: Disagreement-Aware Alignment via Risk-Constrained Decoding

DARC:通过风险约束解码实现的分歧意识对齐

Mingxi Zou, Jiaxiang Chen, Junfan Li, Langzhang Liang, Qifan Wang, Xu Yinghui, Zenglin Xu

发表机构 * Fudan University, Shanghai, China(复旦大学,上海,中国) Independent Researcher(独立研究者) Meta AI Incubation Institute, Fudan University, Shanghai, China(创新与孵化院,复旦大学,上海,中国)

AI总结 DARC通过风险约束解码方法,在不重新训练的情况下,通过最大化KL-鲁棒满意度目标来缓解分歧和尾部风险,保持高质量输出。

详情
AI中文摘要

基于偏好对齐的方法(如RLHF、DPO)通常优化单一标量目标,隐式地平均异质人类偏好。在实践中,系统标注者和用户组的分歧使均值奖励最大化变得脆弱且易受代理过优化影响。我们提出了**通过风险约束解码实现的分歧意识对齐(DARC)**,一种无需重新训练的推理时间方法,将响应选择框架为分布鲁棒、风险敏感的决策制定。给定多个偏好样本或可扩展的分歧代理,DARC通过最大化KL-鲁棒(熵)满意度目标对候选者进行重新排序,并提供简单的部署控制,使相应的熵风险溢价相对于均值进行限制或惩罚,从而在不重新训练的情况下实现显式风险预算。我们提供了将此解码规则与原则性悲观主义和基于KL的分布鲁棒优化联系起来的理论分析。在对齐基准测试中,DARC在减少分歧和尾部风险的同时,保持在噪声、异质反馈下的竞争力平均质量。

英文摘要

Preference-based alignment methods (e.g., RLHF, DPO) typically optimize a single scalar objective, implicitly averaging over heterogeneous human preferences. In practice, systematic annotator and user-group disagreement makes mean-reward maximization brittle and susceptible to proxy over-optimization. We propose **Disagreement-Aware Alignment via Risk-Constrained Decoding (DARC)**, a retraining-free inference-time method that frames response selection as distributionally robust, risk-sensitive decision making. Given multiple preference samples or scalable disagreement proxies, DARC reranks candidates by maximizing a *KL-robust (entropic)* satisfaction objective, and provides simple deployment controls that cap or penalize the corresponding entropic risk premium relative to the mean, enabling explicit risk budgets without retraining. We provide theoretical characterization linking this decoding rule to principled pessimism and KL-based distributionally robust optimization. Experiments on alignment benchmarks show that DARC reduces disagreement and tail risk while maintaining competitive average quality under noisy, heterogeneous feedback.

2603.07438 2026-05-19 cs.AI 版本更新

How Wrong Can Your Counterfactual Be? Quantifying Confounding Bias for Continuous Treatments without a Control Group

你的反事实能错到什么程度?在没有对照组的情况下,为连续治疗量化混杂偏倚

Yu Wang, Xiangchen Liu, Siguang Li

发表机构 * Department of Economics, Cornell University(康奈尔大学经济学系) Department of Family and Consumer Sciences, California State University, Long Beach(加州州立大学长滩分校家庭与消费者科学系) Society Hub, Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)社会枢纽)

AI总结 本文提出一种因果压力测试框架,通过假设未观测混杂因素对结果和宏观经济变量的加性影响,量化连续治疗下的混杂偏倚,并分析两种估计方法的误差界。

详情
AI中文摘要

压力测试提出因果问题:如果宏观经济沿不利反事实路径发展,投资组合信用损失会如何变化?然而,标准做法仍为预测性,可能面临遗漏变量偏差。本文提出一种用于面板数据的半识别框架,用于连续共同治疗和无对照组的因果压力测试。通过假设未观测混杂因素对结果和宏观经济变量的加性影响,推导出一个闭式混杂包络参数化于两个可解释敏感参数。进一步分析两种实用估计器——递归滚动和直接多时段预测,推导非渐近误差界,并表征递归复合使直接估计更优的条件。对于推断,结合识别包络与重要加权符合预测,得到有限样本区间,将估计不确定性与识别不确定性分离。在基于真实美国失业率路径构建的半合成实验中,标准高精度预测模型仍存在因果偏倚且显著低估,而本文框架在所有压力时段均实现接近名义覆盖率。

英文摘要

Stress testing poses a causal question: how would portfolio credit losses change if the macroeconomy followed an adverse counterfactual path? Yet standard practice remains predictive and might be therefore vulnerable to omitted-variable bias. We propose a partial identification framework for causal stress testing in panel data with a continuous common treatment and no control group. By assuming that the unobserved confounder affects outcome and macro variables additively, we derive a closed-form confounding envelope parameterized by two interpretable sensitivity parameters. We further analyze two practical estimators -- recursive rollout and direct multi-horizon prediction -- derive non-asymptotic error bounds, and characterize when recursive compounding makes direct estimation preferable. For inference, we combine the identification envelope with importance-weighted conformal prediction, yielding finite-sample intervals that separate estimation uncertainty from identification uncertainty under covariate shift. In semi-synthetic experiments built from real U.S. unemployment paths, standard high-accuracy predictive models remain causally biased and substantially under-cover, whereas the proposed framework achieves near-nominal coverage across stress horizons.

2603.04737 2026-05-19 cs.AI cs.CL cs.LG 版本更新

Interactive Benchmarks

交互式基准测试

Baoqing Yue, Zihan Zhu, Yutong Han, Brian Fan, Qian Sun, Jichen Feng, Hufei Yang, Yifan Zhang, Mengdi Wang

发表机构 * InteractiveBench Princeton University(普林斯顿大学)

AI总结 本文提出交互式基准测试,通过预算化的多轮交互评估模型推理能力,改进传统基准和偏好评估的局限性,揭示模型在交互场景中的改进空间。

Comments Project Page: https://github.com/interactivebench/interactivebench

详情
AI中文摘要

现有的推理评估范式存在不同局限:固定基准日益饱和且易受污染,而基于偏好的评估依赖主观判断。我们主张智能的核心在于决定获取哪些信息以及如何有效使用它们。我们提出了交互式基准,一种统一的评估范式,通过预算化的多轮交互评估模型的推理能力。我们在两种设置中评估模型:交互证明,其中模型与裁判互动解决逻辑、UI2Html和数学任务,在客观反馈下;以及交互游戏,其中模型战略推理以最大化长期效用。我们的结果表明,交互式基准提供了更稳健的评估,揭示了模型在交互场景中的显著改进空间。

英文摘要

Existing reasoning evaluation paradigms suffer from different limitations: fixed benchmarks are increasingly saturated and vulnerable to contamination, while preference-based evaluations rely on subjective judgments. We argue that a core aspect of intelligence is the ability to decide what information to acquire and how to use it effectively. We propose Interactive Benchmarks, a unified evaluation paradigm that assesses a model's reasoning ability through budgeted multi-turn interaction. We evaluate models under this framework in two settings: Interactive Proofs, where models interact with a judge to solve Logic, UI2Html, and Mathematics tasks under objective feedback; and Interactive Games, where models reason strategically to maximize long-horizon utilities. Our results show that interactive benchmarks provide a more robust assessment of this dimension of model intelligence, revealing substantial room for improvement in interactive scenarios.

2603.02531 2026-05-19 cs.LG cs.AI 版本更新

Geometry-Aware Attention Guidance for Diffusion Models via Modern Hopfield Dynamics

基于现代Hopfield动力学的几何感知注意力引导:通过现代Hopfield动力学实现扩散模型的几何感知注意力引导

Kwanyoung Kim

发表机构 * Department of AI Convergence(人工智能融合学院)

AI总结 本文提出几何感知注意力引导方法,通过分析注意力扩展中的现代Hopfield动力学,证明了稀疏-密集差异的两个方向性性质,从而提供一种无需训练的插拔式扩展规则,提升扩散模型生成质量。

详情
AI中文摘要

分类器自由引导(CFG)在扩散模型中提高了样本质量,但其双步推理和对空条件训练的依赖限制了其在少步场景中的应用。注意力空间引导作为一种互补范式,解决了这一缺口,但为何先前的稀疏-密集注意力引导有效仍不清楚。我们通过分析注意力扩展中的现代Hopfield动力学,证明了在共享条件下的稀疏-密集差异的两个方向性性质,从而证明其作为方向一致的加速信号。在此基础上,我们提出了几何感知注意力引导(GAG),一种无需训练的插拔式扩展规则,将差异分解为与检索方向平行和正交的分量,放大与收敛方向一致的分量,同时抑制离流形噪声;稳定性来源于弱收缩性质。我们进一步将此扩展解释为注意力空间中的第一阶Anderson加速,为注意力扩展方法提供了统一视角。GAG是一种通用方法,能够跨架构(UNet, MMDiT)和采样场景(多步、少步)泛化,一致地在多种架构上提升生成质量,包括FLUX.1、最近的FLUX.2和Qwen-Image,且计算开销极低。

英文摘要

Classifier-Free Guidance (CFG) improves sample quality in diffusion models, but its dual-pass inference and reliance on null-condition training limit its use in few-step regimes. Attention-space guidance has emerged as a complementary paradigm that addresses this gap, yet why prior sparse-vs-dense attention guidance works remains elusive. We address this by analyzing attention extrapolation through Modern Hopfield dynamics, proving two directional properties of the sparse-dense discrepancy under shared conditioning that together certify it as a directionally consistent acceleration signal. Building on this, we propose Geometry-Aware Attention Guidance (GAG), a training-free, plug-and-play extrapolation rule that decomposes the discrepancy into parallel and orthogonal components relative to the retrieval direction, amplifying the convergence-aligned component while suppressing off-manifold noise; stability follows from a weak contraction property. We further provide an interpretation of this extrapolation as first-order Anderson Acceleration in attention space, offering a unified perspective on attention extrapolation methods. GAG is a universal method that generalizes across architectures (UNet, MMDiT) and sampling regimes (multi-step, few-step), consistently improving generation quality on diverse backbones, including FLUX.1, the recent FLUX.2, and Qwen-Image, with minimal computational overhead.

2603.01227 2026-05-19 cs.AI 版本更新

The Lattice Representation Hypothesis of Large Language Models

大语言模型的晶格表示假说

Bo Xiong

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出大语言模型的晶格表示假说,通过嵌入几何将概念层次和逻辑运算统一到线性表示中,实验表明LLM嵌入编码概念晶格及其逻辑结构。

Comments Accepted at ICLR 2026

详情
AI中文摘要

我们提出大语言模型的晶格表示假说:一种符号骨架,将概念层次和逻辑运算根植于嵌入几何中。我们的框架将线性表示假说与形式概念分析(FCA)统一起来,显示线性属性方向带有分离阈值诱导概念晶格通过半空间交集。这种几何使通过几何meet(交集)和join(并集)操作实现符号推理,并且当属性方向线性无关时具有标准形式。在WordNet子层次上的实验提供了经验证据,表明LLM嵌入编码概念晶格及其逻辑结构,揭示了连续几何与符号抽象之间有原则的桥梁。

英文摘要

We propose the Lattice Representation Hypothesis of large language models: a symbolic backbone that grounds conceptual hierarchies and logical operations in embedding geometry. Our framework unifies the Linear Representation Hypothesis with Formal Concept Analysis (FCA), showing that linear attribute directions with separating thresholds induce a concept lattice via half-space intersections. This geometry enables symbolic reasoning through geometric meet (intersection) and join (union) operations, and admits a canonical form when attribute directions are linearly independent. Experiments on WordNet sub-hierarchies provide empirical evidence that LLM embeddings encode concept lattices and their logical structure, revealing a principled bridge between continuous geometry and symbolic abstraction.

2603.01092 2026-05-19 cs.AI cs.LG 版本更新

The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions

科学的异类空间:采样连贯但认知不可用的研究方向

Alejandro H. Artiles, Martin Weiss, Levin Brinkmann, Iyad Rahwan, Bernhard Schölkopf, Christopher Pal, Hugo Larochelle, Anirudh Goyal, Nasim Rahaman

发表机构 * Max Planck Institute for Human Development(马克斯·普朗克人类发展研究所) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) ELLIS Institute Tübingen(图宾根ELLIS研究所) Polytechnique Montreal(蒙特利尔理工学院) CIFAR AI Chair(CIFAR人工智能主席) Mila – Quebec AI Institute(魁北克人工智能研究所) Tiptree Systems(Tiptree系统)

AI总结 本文提出一种框架,通过分解论文为概念单元并学习两个互补模型,采样出连贯但认知不可用的研究方向,扩展了LLM生成的潜在词汇库。

Comments 10 main pages, 42 appendix pages, 29 figures

详情
AI中文摘要

科学发现不仅受真理限制,还受研究人员当前探索领域认知可用性限制。许多方向在文献中是连贯的,但因没有现有社区占据正确的概念、方法和直觉组合而不被提出。现代语言模型继承这种偏见,当被提示生成新想法时会重新组合文献的高密度区域。我们引入了一个框架,旨在针对互补区域,称为科学的异类空间,其中方向在现有知识结构下是可能的,但在现有研究人员分布下不太可能。我们的方法首先将论文分解为细粒度的概念单元,并将它们聚类为共享的词汇概念原子。然后在该词汇上学习两个互补模型。一个连贯性模型评分原子组合是否形成可行的研究方向,另一个可用性模型评分是否任何现有作者社区能够产生给定组合。采样异类方向则减少为排名原子组合,以最大化连贯性同时最小化可用性。在包含16,068篇经同行评审的LLM论文的语料库上,所得到的采样器在不牺牲连贯性的前提下,探索出比前沿LLM生成基线大3.5至7倍的有效原子词汇库,并在盲LLM、人类和下游实验评估中产生匹配或超过基线的想法。通过将科学合理性与社区可用性分开,我们的框架指向AI生成想法,补充而非仅仅加速人类科学,扩展探索到当前社区可能忽视的连贯方向。

英文摘要

Scientific discovery is constrained not only by what is true, but by what is cognitively available to the researchers currently exploring a field. Many directions are coherent in light of the literature yet unlikely to be proposed because no existing community occupies the right combination of concepts, methods, and intuitions. Modern language models inherit this bias, recombining high-density regions of the literature when prompted for novel ideas. We introduce a framework that targets the complementary region, which we call the alien space of science, where directions are plausible under the structure of existing knowledge but unlikely under the distribution of existing researchers. Our method first decomposes papers into granular conceptual units and clusters them into a shared vocabulary of idea atoms. It then learns two complementary models over this vocabulary. A coherence model scores whether a combination of atoms forms a viable research direction, and an availability model scores whether any existing author community is positioned to produce a given combination. Sampling alien directions then reduces to ranking atom combinations that maximize coherence while minimizing availability. On a corpus of 16,068 peer-reviewed LLM papers from NeurIPS, ICLR, ICML, and major NLP venues, the resulting sampler explores a 3.5 - 7 x broader effective atom vocabulary than frontier LLM ideation baselines without sacrificing coherence, and produces ideas that match or exceed those baselines under blind LLM, human, and downstream experimental evaluation. By separating scientific plausibility from community availability, our framework points toward AI ideation that complements rather than merely accelerates human science, expanding exploration into coherent directions that the current community may overlook.

2603.00975 2026-05-19 cs.LG cs.AI 版本更新

Forgetting is Competition: Rethinking Unlearning as Representation Interference in Diffusion Models

遗忘是竞争:重新思考扩散模型中的去学习作为表征干扰

Ashutosh Ranjan, Vivek Srivastava, Shirish Karande, Murari Mandal

发表机构 * TCS Research(印度 Tata Consulting Engineers 研究部) Kalinga Institute of Industrial Technology(卡林加工业技术学院)

AI总结 本文提出SurgUn方法,通过可控竞争而非直接删除或一对一重分配来实现扩散模型的去学习,有效平衡遗忘与保留,提升模型在版权、安全等场景下的表现。

详情
AI中文摘要

部署的文本到图像扩散模型日益需要事后概念去学习以应对版权主张、艺术家退出、安全更新和受保护内容缓解,而无需完全重新训练。核心挑战是擦除-保留失衡,激进更新抑制目标但损害共享能力,而保守或基于锚点的更新保留质量但使概念可通过相关、组合、改写或对抗性提示恢复。受反向干扰启发,我们提出SurgUn,将遗忘视为受控竞争而非直接删除或一对一重分配。SurgUn通过干扰条件梯度竞争实现反向概念干扰:目标梯度上升削弱目标条件的去噪或流匹配行为,而下降于语义多样的干扰集引入竞争非目标轨迹。这将输出分布在多个非目标模式而非坍缩到单一代理。为通过共享路径限制意外遗忘,SurgUn添加像素基础的权重空间局部化,轻量级诊断通过生成图像擦除-保留行为选择注意力块,利用抑制广泛可行而保留块选择性的不对称性。在UnlearnCanvas、IP-character擦除、Holistic Unlearning、EraseBench和Ring-A-Bell上,SurgUn在Stable Diffusion v1.5、SDXL和SANA-1.5中实现了比基线更强的擦除-保留平衡。消融实验显示,多样干扰、对比竞争和局部化对于稳健抑制同时保留相关和不相关概念都是必要的。

英文摘要

Deployed text-to-image diffusion models increasingly require post-hoc concept unlearning for copyright claims, artist opt-outs, safety updates, and protected-content mitigation without full retraining. A central challenge is erase-retain imbalance, aggressive updates suppress targets but damage shared capabilities, while conservative or anchor-based updates preserve quality yet leave concepts recoverable through related, compositional, paraphrased, or adversarial prompts. Inspired by retroactive interference, we propose SurgUn, which treats forgetting as controlled competition rather than direct deletion or one-to-one reassignment. SurgUn instantiates retroactive concept interference via distractor-conditioned gradient competition: target-gradient ascent weakens target-conditioned denoising or flow-matching behavior, while descent over a semantically diverse distractor set introduces competing non-target trajectories under the same prompt context. This redistributes outputs across multiple non-target modes instead of collapsing to a single proxy. To limit collateral forgetting through shared pathways, SurgUn adds pixel-grounded weight-space localization, a lightweight diagnostic that selects attention blocks by generated-image erase-retain behavior, exploiting the asymmetry that suppression is broadly achievable whereas retention is block-selective. Across UnlearnCanvas, IP-character erasure, Holistic Unlearning, EraseBench, and Ring-A-Bell on Stable Diffusion v1.5, SDXL, and SANA-1.5, SurgUn achieves a stronger erase-retain balance than baselines. Ablations show that diverse distractors, contrastive competition, and localization are all necessary for robust suppression while preserving related and unrelated concepts.

2603.00876 2026-05-19 cs.AI cs.MA 版本更新

BioProAgent: Neuro-Symbolic Grounding for Constrained Scientific Planning

BioProAgent:用于受限科学规划的神经符号接地

Yuyang Liu, Jingya Wang, Liuzhenghao Lv, Yonghong Tian

发表机构 * School of AI for Science, Peking University(科学人工智能学院,北京大学) School of Electronic and Computer Engineering, Peking University(电子与计算机工程学院,北京大学) School of Computer Science, Peking University(计算机科学学院,北京大学)

AI总结 BioProAgent通过神经符号框架将概率规划锚定在确定性有限状态机中,解决复杂设备模式中的上下文瓶颈,提升物理执行的可靠性。

详情
AI中文摘要

大型语言模型(LLMs)在科学发现中展现出显著的推理能力,但在湿实验室等不可逆环境中难以实现物理执行。我们提出BioProAgent,一种神经符号框架,将概率规划锚定在确定性有限状态机(FSM)中。我们引入了状态增强的规划机制,强制执行严格的设计-验证-修正工作流,确保硬件兼容性后再执行。此外,我们通过语义符号接地解决复杂设备模式中的上下文瓶颈,通过符号抽象减少约6倍的token消耗。在扩展的BioProBench基准测试中,BioProAgent达到95.6%的物理兼容性(相比ReAct的21.0%),证明神经符号约束对于不可逆物理环境中的可靠自主性至关重要。代码:https://github.com/YuyangSunshine/bioproagent | 网站:https://yuyangsunshine.github.io/BioPro-Project.

英文摘要

Large language models (LLMs) have demonstrated significant reasoning capabilities in scientific discovery but struggle to bridge the gap to physical execution in wet-labs. In these irreversible environments, probabilistic hallucinations are not merely incorrect; they can cause equipment damage or experimental failure. We propose BioProAgent, a neuro-symbolic framework that anchors probabilistic planning in a deterministic Finite State Machine (FSM). We introduce a State-Augmented Planning mechanism that enforces a rigorous Design-Verify-Rectify workflow, ensuring hardware compliance before execution. Furthermore, we address the context bottleneck inherent in complex device schemas by Semantic Symbol Grounding, reducing token consumption by ~6* through symbolic abstraction. In the extended BioProBench benchmark, BioProAgent achieves 95.6% physical compliance (compared to 21.0% for ReAct), demonstrating that neuro-symbolic constraints are essential for reliable autonomy in irreversible physical environments. Code: https://github.com/YuyangSunshine/bioproagent | Website: https://yuyangsunshine.github.io/BioPro-Project.

2602.22801 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Unleashing the Potential of Diffusion Models for End-to-End Autonomous Driving

释放扩散模型在端到端自动驾驶中的潜力

Yinan Zheng, Tianyi Tan, Bin Huang, Enguang Liu, Ruiming Liang, Jianlin Zhang, Jianwei Cui, Guang Chen, Kun Ma, Hangjun Ye, Long Chen, Ya-Qin Zhang, Xianyuan Zhan, Jingjing Liu

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学)

AI总结 本文通过大规模实车数据和道路测试,系统研究了扩散模型在端到端自动驾驶中的规划能力,提出Hyper Diffusion Planner框架,实现10倍性能提升。

详情
AI中文摘要

扩散模型已成为机器人决策任务中的流行选择,近年来也开始被考虑用于解决自动驾驶任务。然而,其在自动驾驶中的应用和评估仍局限于模拟或实验室环境。本研究通过大规模实车数据和道路测试,系统研究了扩散模型作为端到端自动驾驶规划器的潜力。通过全面而受控的研究,我们识别了扩散损失空间、轨迹表示和数据缩放等关键洞察,显著影响端到端规划性能。此外,我们还提供了一种有效的强化学习后训练策略,进一步提升学习规划器的安全性和鲁棒性。所提出的扩散学习框架Hyper Diffusion Planner (HDP)在真实车辆平台上部署,并在6个城市驾驶场景和200公里的真实世界测试中,实现了相对于基模型的10倍性能提升。本文证明了当正确设计和训练时,扩散模型可以作为有效且可扩展的端到端自动驾驶规划器,用于复杂的真实世界自动驾驶任务。

英文摘要

Diffusion models have become a popular choice for decision-making tasks in robotics, and more recently, are also being considered for solving autonomous driving tasks. However, their applications and evaluations in autonomous driving remain limited to simulation-based or laboratory settings. The full strength of diffusion models for large-scale, complex real-world settings, such as End-to-End Autonomous Driving (E2E AD), remains underexplored. In this study, we conducted a systematic and large-scale investigation to unleash the potential of the diffusion models as planners for E2E AD, based on a tremendous amount of real-vehicle data and road testing. Through comprehensive and carefully controlled studies, we identify key insights into the diffusion loss space, trajectory representation, and data scaling that significantly impact E2E planning performance. Moreover, we also provide an effective reinforcement learning post-training strategy to further enhance the safety and robustness of the learned planner. The resulting diffusion-based learning framework, Hyper Diffusion Planner (HDP), is deployed on a real-vehicle platform and evaluated across 6 urban driving scenarios and 200 km of real-world testing, achieving a notable 10x performance improvement over the base model. Our work demonstrates that diffusion models, when properly designed and trained, can serve as effective and scalable E2E AD planners for complex, real-world autonomous driving tasks.

2602.20706 2026-05-19 cs.AI cs.DS 版本更新

Online Algorithms with Unreliable Guidance

具有不可靠指导的在线算法

Julien Dallot, Yuval Emek, Yuval Gil, Maciej Pacut, Stefan Schmid

发表机构 * TU Berlin, Germany(柏林技术大学,德国) Technion, Israel(技术学院,以色列) Reykjavik University, Iceland(雷克雅未克大学,冰岛) TU Berlin and Weizenbaum Institute, Germany(柏林技术大学和魏泽恩鲍姆研究所,德国)

AI总结 本文提出了一种名为OAG的在线算法模型,通过请求-回答游戏的视角,分离预测与算法组件,构建了通用分析框架,从而开发出首个通用编译器DTB,将标准在线算法转化为学习增强型算法,并在三个经典问题中取得新的性能平衡和最优解。

详情
AI中文摘要

本文介绍了具有不可靠指导的在线算法(OAG),这是一种用于机器学习增强的在线决策模型,通过请求-回答游戏的视角,清晰地分离了预测和算法组件,从而提供了一个单一、明确的分析框架,仅依赖于问题本身。该模型通过分析框架,使多个概念(来自回答空间的预测、指导、随时竞争性)得以独立分析,使得学习增强型算法能够摆脱预测特定选择(如预测语义、误差函数或探测策略)的限制,从而提升算法的通用性和适用性。OAG模型的简洁框架允许构建首个通用编译器,即滴或信任盲(DTB)编译器,该编译器能够将几乎任何标准、无预测的在线算法转化为学习增强型算法。尽管模型简单,但本文展示了DTB编译器所产生的学习增强型算法在三个经典在线问题中具有强的一致性-鲁棒性保证:在具有对抗性到达顺序的二分图匹配中实现了新的性能平衡,在缓存和均匀度量任务系统中获得了最优解。

英文摘要

This paper introduces online algorithms with unreliable guidance (OAG), a model for ML-augmented online decision-making that cleanly separates the predictive and algorithmic components, thus offering a single, well-defined analysis framework that depends only on the problem at hand. Formulated through the lens of request-answer games, the OAG model brings multiple concepts (predictions from the answer space, guide, anytime competitiveness) which enable learning-augmented algorithms to be analyzed independently of predictor-specific choices - such as prediction semantics, error functions, or probing strategies - that would otherwise restrict the algorithm's generality and applicability. The clean framework of the OAG model allows to build the first generic compiler, the drop-or-trust-blindly (DTB) compiler, that turns almost any standard, prediction-free online algorithm into a learning-augmented one. Although simple, we show that the DTB compiler produces new learning-augmented algorithms with strong consistency-robustness guarantees for three classic online problems: we achieve new trade-offs for bipartite matching with adversarial arrival order, and obtain optimal solutions for caching and uniform metrical task systems.

2602.18584 2026-05-19 cs.LG cs.AI cs.CV 版本更新

GIST: Targeted Data Selection for Instruction Tuning via Coupled Optimization Geometry

GIST: 通过耦合优化几何进行指令微调的目标数据选择

Guanghui Min, Tianhao Huang, Ke Wan, Chen Chen

发表机构 * Department of Computer Science, University of Virginia, Charlottesville, USA(弗吉尼亚大学计算机科学系)

AI总结 本文提出GIST方法,通过子空间对齐替代轴对齐缩放,解决参数高效微调中参数耦合问题,实现更高效的目标数据选择。

Comments ICML 2026; 27 pages, 8 figures, 11 tables

详情
AI中文摘要

目标数据选择已成为高效指令微调中的关键范式,旨在为特定任务识别一小部分有影响力的训练示例。在实践中,影响力通常通过示例对参数更新的影响来衡量。为了使选择可扩展,许多方法利用优化器统计(如Adam状态)作为轴对齐的替代品,隐式地将参数视为坐标独立。我们证明在参数高效微调(PEFT)方法如LoRA中,这一假设在破裂。在这种情况下,诱导的优化几何表现出强跨参数耦合和非平凡的非对角交互,而任务相关的更新方向被限制在低维子空间中。受此不匹配的启发,我们提出GIST(梯度等距子空间转换),一种简单但原则性的替代方法,用稳健的子空间对齐替代轴对齐缩放。GIST通过奇异值分解(SVD)从验证梯度中恢复任务特定的子空间,将训练梯度投影到该耦合子空间,并通过与目标方向的对齐程度评分示例。大量实验表明,在相同的选择预算下,GIST仅使用0.29%的存储和25%的计算时间,与当前最先进的基线匹配或优于。

英文摘要

Targeted data selection has emerged as a crucial paradigm for efficient instruction tuning, aiming to identify a small yet influential subset of training examples for a specific target task. In practice, influence is often measured through the effect of an example on parameter updates. To make selection scalable, many approaches leverage optimizer statistics (e.g., Adam states) as an axis-aligned surrogate for update geometry (i.e., diagonal precondition), implicitly treating parameters as coordinate-wise independent. We show that this assumption breaks down in parameter-efficient fine-tuning (PEFT) methods such as LoRA. In this setting, the induced optimization geometry exhibits strong cross-parameter coupling with non-trivial off-diagonal interactions, while the task-relevant update directions are confined to a low-dimensional subspace. Motivated by this mismatch, we propose GIST (Gradient Isometric Subspace Transformation), a simple yet principled alternative that replaces axis-aligned scaling with robust subspace alignment. GIST recovers a task-specific subspace from validation gradients via singular value decomposition (SVD), projects training gradients into this coupled subspace, and scores examples by their alignment with target directions. Extensive experiments have demonstrated that GIST matches or outperforms the state-of-the-art baseline with only 0.29% of the storage and 25% of the computational time under the same selection budget.

2602.11553 2026-05-19 cs.CV cs.AI 版本更新

Perception-based Image Denoising via Generative Compression

基于生成压缩的图像去噪

Nam Nguyen, Thinh Nguyen, Bella Bose

发表机构 * School of Electrical and Computer Engineering, Oregon State University, Corvallis, OR 97331, USA(电气与计算机工程学院,俄勒冈州立大学,科瓦利斯,OR 97331,USA)

AI总结 本文提出基于生成压缩的去噪框架,通过熵编码潜在表示和感知度量提升去噪效果,实验显示在保持 distortion 性能的同时实现感知改进。

详情
AI中文摘要

图像去噪旨在在去除噪声的同时保持结构细节和感知现实,但受扰动驱动的方法常产生过度平滑的重建,特别是在强噪声和分布偏移下。本文提出一种基于生成压缩的去噪框架,通过从熵编码的潜在表示中重建,强制低复杂度结构,同时通过感知度量如学习感知图像块相似性(LPIPS)损失和Wasserstein距离的生成解码器恢复真实纹理。介绍了两种互补的实例:(i) 基于条件Wasserstein GAN(WGAN)的压缩去噪器,明确控制速率-失真-感知(RDP)权衡;(ii) 基于条件扩散的重建策略,通过压缩潜在进行迭代去噪。进一步建立了在加性高斯噪声下的压缩最大似然去噪器的非渐近保证,包括重建误差和解码误差概率的界限。在合成和真实噪声基准上的实验显示了一致的感知改进,同时保持竞争性的失真性能。

英文摘要

Image denoising aims to remove noise while preserving structural details and perceptual realism, yet distortion-driven methods often produce over-smoothed reconstructions, especially under strong noise and distribution shift. This paper proposes a generative compression framework for perception-based denoising, where restoration is achieved by reconstructing from entropy-coded latent representations that enforce low-complexity structure, while generative decoders recover realistic textures via perceptual measures such as learned perceptual image patch similarity (LPIPS) loss and Wasserstein distance. Two complementary instantiations are introduced: (i) a conditional Wasserstein GAN (WGAN)-based compression denoiser that explicitly controls the rate-distortion-perception (RDP) trade-off, and (ii) a conditional diffusion-based reconstruction strategy that performs iterative denoising guided by compressed latents. We further establish non-asymptotic guarantees for the compression-based maximum-likelihood denoiser under additive Gaussian noise, including bounds on reconstruction error and decoding error probability. Experiments on synthetic and real-noise benchmarks demonstrate consistent perceptual improvements while maintaining competitive distortion performance.

2602.08354 2026-05-19 cs.AI 版本更新

Does Your Reasoning Model Implicitly Know When to Stop Thinking?

你的推理模型是否在隐式地知道何时停止思考?

Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuanda Wang, Zhixia Zhang, Hongyan Xie, Songshi Liang, Zehao Chen, Xuefeng Xiao, Fuzhen Zhuang, Jianxin Li, Deqing Wang, Yikun Ban

发表机构 * Beihang University(北京航空航天大学)

AI总结 本文研究了大推理模型在复杂推理任务中停止思考的隐式能力,提出SAGE方法提升推理效率与准确性。

详情
AI中文摘要

近年来,大型推理模型(LRMs)通过长链思考(CoTs)在复杂推理任务中取得了显著进展。然而,这种方法常导致冗余,影响计算效率并造成实时应用中的延迟。近期研究表明,更长的推理链与正确性无关,甚至可能损害准确性。深入分析后发现,LRMs隐式地知道何时停止思考,但这一能力被当前采样范式所掩盖。为此,我们引入SAGE(Self-Aware Guided Efficient Reasoning)新采样范式,释放其高效推理潜力。进一步将SAGE作为混合采样整合到基于群体的强化学习(SAGE-RL)中,使SAGE-RL能有效将SAGE发现的高效推理模式融入标准pass@1推理,显著提升LRMs在多个挑战性数学基准上的推理准确性和效率。

英文摘要

Recent advancements in large reasoning models (LRMs) have greatly improved their capabilities on complex reasoning tasks through Long Chains of Thought (CoTs). However, this approach often results in substantial redundancy, impairing computational efficiency and causing significant delays in real-time applications. Recent studies show that longer reasoning chains are frequently uncorrelated with correctness and can even be detrimental to accuracy. In a further in-depth analysis of this phenomenon, we surprisingly uncover and empirically verify that LRMs implicitly know the appropriate time to stop thinking, while this capability is obscured by current sampling paradigms. Motivated by this, we introduce SAGE (Self-Aware Guided Efficient Reasoning), a novel sampling paradigm that unleashes this efficient reasoning potential. Furthermore, integrating SAGE as mixed sampling into group-based reinforcement learning (SAGE-RL) enables SAGE-RL to effectively incorporate SAGE-discovered efficient reasoning patterns into standard pass@1 inference, markedly enhancing both the reasoning accuracy and efficiency of LRMs across multiple challenging mathematical benchmarks.

2602.08167 2026-05-19 cs.RO cs.AI cs.CV cs.LG 版本更新

Self-Supervised Bootstrapping of Action-Predictive Embodied Reasoning

基于互联网规模知识的自监督行动预测具身推理

Milan Ganai, Katie Luo, Jonas Frey, Clark Barrett, Marco Pavone

发表机构 * Stanford(斯坦福大学) UC Berkeley(加州大学伯克利分校) NVIDIA(英伟达)

AI总结 本文提出R&B-EnCoRe方法,通过自监督细化使模型从互联网知识中自推导具身推理策略,提升动作执行和导航性能,减少碰撞率。

Comments Robotics: Science and Systems (RSS) 2026

详情
AI中文摘要

具身链式思维(CoT)推理显著提升了视觉-语言-动作(VLA)模型,但当前方法依赖刚性模板指定推理原语(如场景中的物体、高层计划、结构 affordances)。这些模板可能迫使策略处理无关信息,干扰关键动作预测信号。我们引入R&B-EnCoRe,使模型通过自监督细化从互联网规模知识中自推导具身推理。通过将推理视为重要加权变分推断中的潜在变量,模型可生成并提炼无外部奖励、验证者或人工标注的具身特定策略训练数据集。我们在各种VLA架构中验证R&B-EnCoRe,应用于 manipulation(Franka Panda在仿真中,WidowX在硬件中)、legged导航(双足、轮式、自行车、四足)和自动驾驶具身,参数规模为1B、4B、7B和30B。我们的方法在 manipulation 成功率提升28%,导航评分提高101%,碰撞率减少21%。R&B-EnCoRe使模型提炼出预测成功控制的推理,避免手动标注工程,同时将互联网规模知识接地于物理执行。

英文摘要

Embodied Chain-of-Thought (CoT) reasoning has significantly enhanced Vision-Language-Action (VLA) models, yet current methods rely on rigid templates to specify reasoning primitives (e.g., objects in the scene, high-level plans, structural affordances). These templates can force policies to process irrelevant information that distracts from critical action-prediction signals. This creates a bottleneck: without successful policies, we cannot verify reasoning quality; without quality reasoning, we cannot build robust policies. We introduce R&B-EnCoRe, which enables models to bootstrap embodied reasoning from internet-scale knowledge through self-supervised refinement. By treating reasoning as a latent variable within importance-weighted variational inference, models can generate and distill a refined reasoning training dataset of embodiment-specific strategies without external rewards, verifiers, or human annotation. We validate R&B-EnCoRe across manipulation (Franka Panda in simulation, WidowX in hardware), legged navigation (bipedal, wheeled, bicycle, quadruped), and autonomous driving embodiments using various VLA architectures with 1B, 4B, 7B, and 30B parameters. Our approach achieves 28% gains in manipulation success, 101% improvement in navigation scores, and 21% reduction in collision-rate metric over models that indiscriminately reason about all available primitives. R&B-EnCoRe enables models to distill reasoning that is predictive of successful control, bypassing manual annotation engineering while grounding internet-scale knowledge in physical execution.

2602.07730 2026-05-19 cs.LG cs.AI 版本更新

The Laplacian Keyboard: Beyond the Linear Span

拉普拉斯键盘:超越线性空间

Siddarth Chandrasekar, Marlos C. Machado

发表机构 * Department of Computing Science, University of Alberta, Canada(阿尔伯塔大学计算科学系) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔人工智能研究所) Canada CIFAR AI Chair(加拿大CIFAR人工智能 chair)

AI总结 本文提出拉普拉斯键盘框架,通过构建行为库超越线性空间限制,提升零样本控制的表达能力与样本效率。

Comments 31 pages, 17 figures

详情
AI中文摘要

跨科学领域,拉普拉斯特征向量已成为简化复杂系统的基础,从信号处理到量子力学。在强化学习(RL)中,它们同样形成状态空间的基础,使奖励函数可以通过在少量特征向量上的投影来近似。这种投影使零样本控制成为可能,但同时也带来了根本性的限制:诱导的策略只能在所选特征向量的线性空间内具有表达能力。我们引入了拉普拉斯键盘(LK),一种分层框架,超越了这一线性空间。LK从这些特征向量中构建任务无关的行为库,形成一个保证包含任何奖励在该线性空间内的最优策略的行为基础。一个元策略学习动态地缝合这些行为,使在原始线性约束外高效学习策略成为可能。我们建立了零样本近似误差的理论界限,并实证表明LK在零样本解法上有所改进,同时在样本效率上优于标准RL方法。

英文摘要

Across scientific disciplines, Laplacian eigenvectors serve as a fundamental basis for simplifying complex systems, from signal processing to quantum mechanics. In reinforcement learning (RL), they similarly form a basis over the state space, enabling reward functions to be approximated by projection onto a small set of eigenvectors. This projection makes zero-shot control possible, but it also imposes a fundamental limitation: the induced policies are only as expressive as the linear span of the chosen eigenvectors. We introduce the Laplacian Keyboard (LK), a hierarchical framework that goes beyond this linear span. LK constructs a task-agnostic library of behaviors from these eigenvectors, forming a behavior basis guaranteed to contain the optimal policy for any reward within the linear span. A meta-policy learns to stitch these behaviors dynamically, enabling efficient learning of policies outside the original linear constraints. We establish theoretical bounds on zero-shot approximation error and demonstrate empirically that LK improves over the zero-shot solution while achieving better sample efficiency compared to standard RL methods.

2602.06807 2026-05-19 cs.RO cs.AI cs.LG 版本更新

SuReNav: Superpixel Graph-based Constraint Relaxation for Navigation in Over-constrained Environments

SuReNav:基于超像素图的约束放松用于过约束环境中的导航

Keonyoung Koh, Moonkyeong Jung, Samuel Seungsup Lee, Daehyung Park

发表机构 * School of Computing, Korea Advanced Institute of Science and Technology, Korea(韩国科学技术院计算机学院)

AI总结 本文提出SuReNav方法,通过超像素图构建区域约束,利用图神经网络实现安全高效导航,适用于半静态环境中过约束规划问题,提升导航的人类类比性能。

Comments Accepted by ICRA 2026. Code and videos are available at https://sure-nav.github.io/

详情
AI中文摘要

我们针对半静态环境中过约束规划问题,提出SuReNav方法,通过超像素图构建区域约束,利用图神经网络训练于人类示范数据,实现安全高效的导航。框架包含三个组件:1)带有区域约束的超像素图地图生成,2)利用图神经网络进行区域约束放松,3)放松、规划和执行的交织过程。在2D语义地图和3D OpenStreetMap地图上评估,实现最高的人类类比得分,同时保持效率与安全的平衡。最后在现实城市导航中展示其可扩展性和泛化能力。代码和视频可在https://sure-nav.github.io/获取。

英文摘要

We address the over-constrained planning problem in semi-static environments. The planning objective is to find a best-effort solution that avoids all hard constraint regions while minimally traversing the least risky areas. Conventional methods often rely on pre-defined area costs, limiting generalizations. Further, the spatial continuity of navigation spaces makes it difficult to identify regions that are passable without overestimation. To overcome these challenges, we propose SuReNav, a superpixel graph-based constraint relaxation and navigation method that imitates human-like safe and efficient navigation. Our framework consists of three components: 1) superpixel graph map generation with regional constraints, 2) regional-constraint relaxation using graph neural network trained on human demonstrations for safe and efficient navigation, and 3) interleaving relaxation, planning, and execution for complete navigation. We evaluate our method against state-of-the-art baselines on 2D semantic maps and 3D maps from OpenStreetMap, achieving the highest human-likeness score of complete navigation while maintaining a balanced trade-off between efficiency and safety. We finally demonstrate its scalability and generalization performance in real-world urban navigation with a quadruped robot, Spot. Code and Videos are available at https://sure-nav.github.io/.

2602.05993 2026-05-19 cs.LG cs.AI 版本更新

Diamond Maps: Efficient Reward Alignment via Stochastic Flow Maps

钻石映射:通过随机流映射实现高效的奖励对齐

Peter Holderrieth, Douglas Chen, Luca Eyring, Ishin Shah, Giri Anantharaman, Yutong He, Zeynep Akata, Tommi Jaakkola, Nicholas Matthew Boffi, Max Simchowitz

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出钻石映射,一种通过随机流映射实现高效奖励对齐的生成模型,能够在推理时对任意奖励进行准确对齐,提升模型适应性和性能。

详情
AI中文摘要

流和扩散模型能生成高质量样本,但训练后适应用户偏好或约束仍成本高且脆弱,这一挑战被称为奖励对齐。本文认为高效的奖励对齐应是生成模型本身的属性,而非事后考虑,并重新设计模型以增强适应性。我们提出

英文摘要

Flow and diffusion models produce high-quality samples, but adapting them to user preferences or constraints post-training remains costly and brittle, a challenge commonly called reward alignment. We argue that efficient reward alignment should be a property of the generative model itself, not an afterthought, and redesign the model for adaptability. We propose "Diamond Maps", stochastic flow map models that enable efficient and accurate alignment to arbitrary rewards at inference time. Diamond Maps amortize many simulation steps into a single-step sampler, like flow maps, while preserving the stochasticity required for optimal reward alignment. This design makes search, Sequential Monte Carlo, and guidance scalable by enabling efficient and consistent estimation of the value function. Our experiments show that Diamond Maps can be learned efficiently via distillation from GLASS Flows, achieve stronger reward alignment performance, and scale better than existing methods. Our results point toward a practical route to generative models that can be rapidly adapted to arbitrary preferences and constraints at inference time.

2602.02039 2026-05-19 cs.AI cs.CL cs.DB cs.LG 版本更新

Hunt Instead of Wait: Evaluating Deep Data Research on Large Language Models

在大型语言模型上进行深度数据研究:评估深度数据研究

Wei Liu, Peijie Yu, Michele Orini, Yali Du, Yulan He

发表机构 * GitHub

AI总结 本文提出深度数据研究(DDR)任务和DDR-Bench基准,评估大型语言模型的探索智能,发现有效探索需要内在策略而非单纯扩展。

Comments 14 pages, 7 tables, 8 figures, accepted by ICML 2026

详情
AI中文摘要

Agentic Large Language Models 的代理预期超越正确回答,要求模型自主设定目标和决定探索方向。我们称其为探索智能,区别于仅完成任务的执行智能。数据科学提供自然测试场,因为现实分析从原始数据而非明确查询开始,但很少有基准关注此领域。为此,我们引入深度数据研究(DDR),一个开放任务,使 LLM 自主从数据库提取关键洞察,并提出 DDR-Bench,一个大规模、基于清单的基准,支持可验证评估。结果表明,尽管前沿模型显示出新兴自主性,但长周期探索仍具挑战性。我们的分析强调,有效的探索智能不仅依赖代理支架或单纯扩展,还依赖于 agentic 模型的内在策略。

英文摘要

The agency expected of Agentic Large Language Models goes beyond answering correctly, requiring autonomy to set goals and decide what to explore. We term this investigatory intelligence, distinguishing it from executional intelligence, which merely completes assigned tasks. Data Science provides a natural testbed, as real-world analysis starts from raw data rather than explicit queries, yet few benchmarks focus on it. To address this, we introduce Deep Data Research (DDR), an open-ended task where LLMs autonomously extract key insights from databases, and DDR-Bench, a large-scale, checklist-based benchmark that enables verifiable evaluation. Results show that while frontier models display emerging agency, long-horizon exploration remains challenging. Our analysis highlights that effective investigatory intelligence depends not only on agent scaffolding or merely scaling, but also on intrinsic strategies of agentic models.

2602.01705 2026-05-19 cs.LG cs.AI 版本更新

LaDi-RL: Latent Diffusion Reasoning Prevents Entropy Collapse in Reinforcement Learning

LaDi-RL:潜在扩散推理防止强化学习中的熵崩溃

Haoqiang Kang, Yizhe Zhang, Nikki Lijing Kuang, Yi-An Ma, Lianhui Qin

发表机构 * UC San Diego(斯克利普斯海洋研究所) Apple(苹果公司)

AI总结 本文提出LaDi-RL方法,通过潜在扩散模型生成潜在推理轨迹,解决强化学习中熵崩溃问题,提升代码生成和数学推理性能。

详情
AI中文摘要

强化学习已成为改进大语言模型推理的核心范式,但现有方法多在离散token序列上优化政策,导致优化空间与推理结构不匹配。连续潜在空间RL提供了一种替代方案,允许政策探索更高层次的推理表示。然而,单纯转向潜在空间不足,所生成的策略必须建模复杂多模态的合理推理轨迹分布。为此,我们提出潜在扩散推理与强化学习(LaDi-RL),其中扩散模型通过迭代去噪生成潜在推理轨迹。此方法支持结构化探索和表达性分布建模,但也引入了根本的信用分配挑战:策略在潜在空间中行动,而奖励仅在潜在被解码为文本后才被观察到。因此,我们引入层次化潜在-文本回放,对每个潜在轨迹采样多个文本完成并聚合其奖励以获得解码边缘化的潜在效用估计。这为优化扩散策略提供了更清晰且方差更低的奖励信号。实验证明,LaDi-RL在代码生成和数学推理的pass@1指标上分别优于token级RL 9.4%和5.7%,甚至超越了基模型的pass@k性能。

英文摘要

Reinforcement learning has become a central paradigm for improving LLM reasoning, but most existing methods optimize policies over discrete token sequences. This creates a mismatch between the optimization space and the structure of reasoning: many important decisions are semantic, global, and trajectory-level rather than local token choices. Continuous latent-space RL offers a promising alternative by allowing policies to explore higher-level reasoning representations. However, simply moving to latent space is not sufficient. The resulting policy must model a complex, multi-modal distribution over valid reasoning trajectories. We therefore propose Latent Diffusion Reasoning with Reinforcement Learning (LaDi-RL), where a diffusion model generates latent reasoning trajectories through iterative denoising. This formulation enables structured exploration and expressive distribution modeling, but also introduces a fundamental credit-assignment challenge: the policy acts in latent space, while rewards are observed only after the latent is decoded into text. A naive rollout strategy therefore entangles latent reasoning quality with text decoding quality, making it unclear whether an incorrect answer results from a poor latent trajectory or from an imperfect textual realization. To address this, we introduce hierarchical latent-text rollouts. We sample multiple text completions for each latent trajectory and aggregate their rewards to obtain a decoder-marginalized estimate of latent utility. This provides a cleaner and lower-variance reward signal for optimizing the diffusion policy. Empirically, LaDi-RL outperforms token-level RL by 9.4% on code generation and 5.7% on math reasoning in pass@1, and even surpasses the base model's pass@k performance.

2601.23154 2026-05-19 cs.LG cs.AI 版本更新

On Safer Reinforcement Learning for Sedation and Analgesia in Intensive Care

关于重症监护中镇痛和镇静的安全强化学习

Joel Romero-Hernandez, Oscar Camara

发表机构 * BCN MedTech, Complex Systems Lab Universitat Pompeu Fabra Barcelona, Spain(BCN医疗科技,复杂系统实验室 巴塞罗那自治大学 巴塞罗那)

AI总结 本文提出一种离线深度强化学习框架,用于优化重症监护中的镇痛和镇静,通过减少疼痛或联合减少疼痛和30天出院后死亡率来提升治疗安全性。

Comments 48th Annual International Conference of the IEEE Engineering in Medicine & Biology Society (EMBC 2026)

详情
AI中文摘要

重症监护中的镇痛管理通常涉及复杂的权衡,因为治疗不足或过量都会影响患者安全。先前强化学习在镇静和镇痛中的研究主要关注优化干预,但未考虑患者生存率或部分可观测性。为探讨这些设计选择的风险,我们开发了一个离线深度强化学习框架,基于递归状态表示建议每小时药物剂量。使用MIMIC-IV数据库中47,144例ICU住院数据,我们训练并评估了行为正则化的actor-critic模型,根据两个目标:减少疼痛或联合减少疼痛和30天出院后死亡率来处方连续剂量的阿片类药物、丙泊酚、苯二氮䓬类药物和去甲肾上腺素。尽管两种政策与较低的疼痛相关,但镇痛政策与死亡率呈正相关(ρ=0.119,p<0.0001),而联合政策与死亡率呈负相关(ρ=-0.316,p<0.0001)。我们发现这种分歧源于对高共病率的不同反应。这表明,重视出院后结果可能对学习更安全的治疗政策至关重要,即使短期目标仍是主要目标。

英文摘要

Pain management in intensive care usually involves complex trade-offs, since both inadequate and excessive treatment can compromise patient safety. Prior work on reinforcement learning for sedation and analgesia has explored how to optimize these interventions, but has not considered patient survival or partial observability. To investigate the risks of these design choices, we developed an offline deep reinforcement learning framework that suggests hourly medication doses based on recurrent state representations. Using retrospective data from 47,144 ICU stays in the MIMIC-IV database, we trained and evaluated behavior-regularized actor-critic models that prescribe continuous doses of opioids, propofol, benzodiazepines, and dexmedetomidine according to two goals: reduce pain or jointly reduce pain and 30-day post-discharge mortality. Although the two resulting policies were associated with lower pain, clinician agreement with the pain-only policy was positively correlated with mortality ($ρ$=0.119, p<0.0001), while agreement with the joint policy was negatively correlated ($ρ$=-0.316, p<0.0001). We found that such divergence arose from a different response to high levels of comorbidity. This suggests that valuing post-discharge outcomes could be critical for learning safer treatment policies, even if a short-term goal remains the primary objective.

2601.22664 2026-05-19 cs.AI 版本更新

Real-Time Aligned Reward Model beyond Semantics

实时对齐奖励模型超越语义

Zixuan Huang, Xin Xia, Yuxi Ren, Jianbin Zheng, Xuefeng Xiao, Hongyan Xie, Li Huaqiu, Songshi Liang, Zhongxiang Dai, Fuzhen Zhuang, Jianxin Li, Yikun Ban, Deqing Wang

发表机构 * Beihang University(北京航空航天大学) Tsinghua University(清华大学) Renmin University of China(中国人民大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文提出R2M框架,通过实时利用策略模型反馈来对齐策略分布偏移,解决RLHF中奖励过拟合问题。

详情
AI中文摘要

Reinforcement Learning from Human Feedback (RLHF) 是对齐大语言模型 (LLMs) 与人类偏好的重要技术,但易受奖励过拟合影响,即策略模型过度拟合奖励模型,利用虚假奖励模式而非忠实捕捉人类意图。以往的缓解方法主要依赖表面语义信息,未能有效解决奖励模型 (RM) 与策略模型之间因连续策略分布偏移导致的不匹配。这不可避免地导致奖励差异增大,加剧奖励过拟合。为解决这些限制,我们引入R2M(实时对齐奖励模型),一种新的轻量RLHF框架。R2M超越了仅依赖预训练LLM语义表示的普通奖励模型。相反,它利用策略的演变隐藏状态(即策略反馈)来在RL过程中与策略的实时分布偏移对齐。本文指出了通过实时利用策略模型反馈来改进奖励模型性能的新有前途的方向。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) is a pivotal technique for aligning large language models (LLMs) with human preferences, yet it is susceptible to reward overoptimization, in which policy models overfit to the reward model, exploit spurious reward patterns instead of faithfully capturing human intent. Prior mitigations primarily relies on surface semantic information and fails to efficiently address the misalignment between the reward model (RM) and the policy model caused by continuous policy distribution shifts. This inevitably leads to an increasing reward discrepancy, exacerbating reward overoptimization. To address these limitations, we introduce R2M (Real-Time Aligned Reward Model), a novel lightweight RLHF framework. R2M goes beyond vanilla reward models that solely depend on the semantic representations of a pretrained LLM. Instead, it leverages the evolving hidden states of the policy (namely policy feedback) to align with the real-time distribution shift of the policy during the RL process. This work points to a promising new direction for improving the performance of reward models through real-time utilization of feedback from policy models.

2601.22530 2026-05-19 cs.AI 版本更新

Enhancing Table Reasoning with Deterministic Table-State Rewards

通过确定性表格状态奖励增强表格推理

Tung Sum Thomas Kwok, Xinyu Wang, Hengzhi He, Xiaofeng Lin, Peng Lu, Liheng Ma, Chunhe Wang, Chun Ho Mak, Yuyu Luo, Ying Nian Wu, Lei Ding, Guang Cheng

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) McGill University(麦吉尔大学) Université de Montréal(蒙特利尔大学) University College London(伦敦大学学院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) University of Manitoba(曼尼托巴大学)

AI总结 本文提出TABROUGE,一种无需训练的确定性状态奖励,通过改进的LCS指标评估表格状态,提升表格推理准确率并减少样本需求。

详情
AI中文摘要

大型语言模型(LLMs)在处理结构化表格的多步推理时面临挑战,主要原因是缺乏对中间推理状态的显式监督。现有学习奖励模型或执行器基验证器要么无法扩展,要么依赖于不可用的问答环境。为此,我们引入TABROUGE,一种无需训练的确定性状态奖励。通过将最长公共子序列(LCS)指标从文本摘要中适应到表格状态评估,TABROUGE在不需学习模型或外部执行器的情况下,评估中间表格的词汇覆盖度和结构完整性。基于此指标,我们提出RE-TAB框架,将表格推理重新定义为对中间状态的确定性控制,利用TABROUGE提供逐步反馈和轨迹级测试时间扩展(TTS)信号。在六个backbone和三个基准测试中,RE-TAB将准确率提高了26.7个百分点,同时将TTS样本减少了33%。初步GRPO实验进一步表明TABROUGE作为可扩展的后训练奖励的有效性,提升收益8.34个百分点。我们还分析了TABROUGE的失败模式,包括同义词欠奖励和回声列黑客,并确定结构意识的词汇奖励何时仍可靠。

英文摘要

Large Language Models (LLMs) struggle with multi-step reasoning over structured tables. The primary reason is the lack of explicit supervision for intermediate reasoning states. Existing learned reward models or executor-based verifiers are either unscalable or rely on answer-checking environments unavailable for many tabular tasks. This leaves no signal that is scalable and grounded in the query. To address this, we introduce TABROUGE, a training-free and deterministic state reward. By adapting the Longest Common Subsequence (LCS) metric from text summarization to evaluate tabular states, TABROUGE assesses the lexical coverage and structural integrity of intermediate tables against the query without requiring learned models or external executors. Built upon this metric, we propose RE-TAB, a plug-and-play, training-free framework. RE-TAB reframes table reasoning as deterministic control over intermediate states, utilizing TABROUGE for stepwise feedback and trajectory-level test-time scaling (TTS) signals. Across six backbones and three benchmarks, RE-TAB improves accuracy by an average of 26.7 pp over no-reward baselines. It also reduces TTS samples by up to 33%. Preliminary GRPO experiments further indicate TABROUGE's viability as a scalable post-training reward, increasing gains by 8.34 pp. We further analyze failure modes of TABROUGE, including paraphrase under-rewarding and echo-column hacking, and identify when structure-aware lexical rewards remain reliable.

2601.21531 2026-05-19 cs.CR cs.AI cs.CV 版本更新

On the Adversarial Robustness of Large Vision-Language Models under Visual Token Compression

大型视觉-语言模型在视觉标记压缩下的对抗鲁棒性研究

Xinwei Zhang, Hangcheng Liu, Li Bai, Hao Wang, Qingqing Ye, Tianwei Zhang, Haibo Hu

发表机构 * The Hong Kong Polytechnic University, Hong Kong(香港理工大学) Nanyang Technological University, Singapore(南洋理工大学) Chongqing University, Chongqing, China(重庆大学) Research Centre for Privacy and Security Technologies in Future Smart Systems, PolyU(未来智能系统中的隐私与安全技术研究中心)

AI总结 本文研究了视觉标记压缩对大型视觉-语言模型对抗鲁棒性的影响,提出CAGE攻击方法,通过优化与压缩推理对齐,揭示压缩机制下的鲁棒性漏洞。

Comments Accepted by ICML 2026

详情
AI中文摘要

视觉标记压缩广泛用于加速大型视觉-语言模型(LVLMs),通过剪枝或合并视觉标记来提升效率,但其对抗鲁棒性仍未经探索。我们发现现有基于编码器的攻击无法充分揭示压缩LVLMs的鲁棒性漏洞,原因在于优化与推理之间的不匹配:扰动在完整标记表示上优化,而推理则通过标记压缩瓶颈进行。为解决这一差距,我们提出了压缩对齐攻击(CAGE),无需假设访问部署压缩机制或其标记预算,通过预期特征破坏和排名扭曲对齐,集中扰动在可能预算下存活的标记上,并主动对齐标记扭曲与排名分数以促进高扭曲证据的保留。在多样化的代表性插件式压缩机制和数据集上,结果表明CAGE在鲁棒性上始终优于基线。本文强调忽视压缩的鲁棒性评估可能过于乐观,呼吁对高效LVLMs进行压缩感知的安全评估和防御。

英文摘要

Visual token compression is widely used to accelerate large vision-language models (LVLMs) by pruning or merging visual tokens, yet its adversarial robustness remains unexplored. We show that existing encoder-based attacks cannot fully disclose the robustness vulnerabilities of compressed LVLMs, due to an optimization-inference mismatch: perturbations are optimized on the full-token representation, while inference is performed through a token-compression bottleneck. To address this gap, we propose the Compression-AliGnEd attack (CAGE), which aligns perturbation optimization with compression inference without assuming access to the deployed compression mechanism or its token budget. CAGE combines (i) expected feature disruption, which concentrates distortion on tokens likely to survive across plausible budgets, and (ii) rank distortion alignment, which actively aligns token distortions with rank scores to promote the retention of highly distorted evidence. Across diverse representative plug-and-play compression mechanisms and datasets, our results show that CAGE consistently achieves lower robust accuracy than the baseline. This work highlights that robustness assessments ignoring compression can be overly optimistic, calling for compression-aware security evaluation and defenses for efficient LVLMs.

2601.19624 2026-05-19 cs.LG cs.AI 版本更新

Tracking Drift: Variation-Aware Entropy Scheduling for Non-Stationary Reinforcement Learning

追踪漂移:面向非平稳强化学习的变异性感知熵调度

Tongxi Wang, Zhuoyang Xia, Xinran Chen, Shan Liu

发表机构 * School of Future Technology, Southeast University, Nanjing, China(东南大学未来技术学院,南京,中国) School of Automation, Southeast University, Nanjing, China(东南大学自动化学院,南京,中国)

AI总结 本文提出AES方法,通过动态调整熵系数以应对环境漂移,减少性能下降并加快恢复速度。

Comments Accepted by ICML 2026

详情
AI中文摘要

现实中的强化学习常面临环境漂移问题,但现有方法多依赖静态熵系数/目标熵,导致稳定期过度探索和漂移后探索不足。本文证明,在标准假设下,非平稳最大熵强化学习中的熵调度可转化为跟踪漂移比较器与稳定更新之间的动态遗憾权衡,得出熵权重与在线非平稳性代理的平方根缩放规则。基于此,提出AES--自适应熵调度,通过在线训练中使用可观察的漂移代理动态调整熵系数/温度,几乎不改变结构且开销极小。在四种算法变体、十二个任务和四种漂移模式中,AES显著减少了漂移导致的性能下降比例并加速了突变后的恢复。

英文摘要

Real-world reinforcement learning often faces environment drift, but most existing methods rely on static entropy coefficients/target entropy, causing over-exploration during stable periods and under-exploration after drift, and leaving unanswered the principled question of how exploration intensity should scale with drift magnitude. We show that, under standard assumptions, entropy scheduling in non-stationary maximum-entropy RL can be cast as the dynamic-regret trade-off between tracking a drifting comparator and stabilizing updates, yielding a square-root scaling rule for the entropy weight in terms of a online non-stationarity proxy. Building on this, we propose AES--Adaptive Entropy Scheduling--which adaptively adjusts the entropy coefficient/temperature online using observable drift proxies during training, requiring almost no structural changes and incurring minimal overhead. Across 4 algorithm variants, 12 tasks, and 4 drift modes, AES significantly reduces the fraction of performance degradation caused by drift and accelerates recovery after abrupt changes.

2601.16527 2026-05-19 cs.LG cs.AI cs.CL cs.CV 版本更新

Beyond Superficial Unlearning: Sharpness-Aware Robust Erasure of Hallucinations in Multimodal LLMs

超越表面遗忘:多模态大语言模型中Hallucinations的锐度感知鲁棒擦除

Xianya Fang, Feiyang Ren, Xiang Chen, Yu Tian, Zhen Bi, Haiyang Yu, Sheng-Jun Huang

发表机构 * College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(南京航空航天大学计算机科学与技术学院) Institute for AI, Tsinghua University(清华大学人工智能研究院) Huzhou University(湖州大学) Institute of Dataspace, Hefei Comprehensive National Science Center(合肥综合性国家科学中心数据空间研究院) University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出SARE方法,通过目标导向的min-max优化和Targeted-SAM机制,解决多模态大语言模型中 hallucinations 的鲁棒擦除问题,提升模型稳定性与擦除效果。

详情
AI中文摘要

多模态大语言模型虽然强大,但容易产生hallucinations,即不存在的实体,影响可靠性。尽管最近的遗忘方法试图缓解这一问题,我们发现了一个关键缺陷:结构脆弱性。我们实证显示,标准擦除仅能表面抑制,使模型陷入尖锐极小值,轻度重新学习后hallucinations会灾难性复苏。为确保几何稳定性,我们提出SARE,将遗忘视为目标min-max优化问题,并使用Targeted-SAM机制显式平坦hallucinated概念周围的损失景观。通过在模拟最坏情况参数扰动下抑制hallucinations,我们的框架确保了鲁棒去除的稳定性。大量实验表明,SARE在擦除效果上显著优于基线,同时保持一般生成质量。关键的是,它在重新学习和参数更新中维持持久的hallucination抑制,验证了几何稳定性的有效性。

英文摘要

Multimodal LLMs are powerful but prone to object hallucinations, which describe non-existent entities and harm reliability. While recent unlearning methods attempt to mitigate this, we identify a critical flaw: structural fragility. We empirically demonstrate that standard erasure achieves only superficial suppression, trapping the model in sharp minima where hallucinations catastrophically resurge after lightweight relearning. To ensure geometric stability, we propose SARE, which casts unlearning as a targeted min-max optimization problem and uses a Targeted-SAM mechanism to explicitly flatten the loss landscape around hallucinated concepts. By suppressing hallucinations under simulated worst-case parameter perturbations, our framework ensures robust removal stable against weight shifts. Extensive experiments demonstrate that SARE significantly outperforms baselines in erasure efficacy while preserving general generation quality. Crucially, it maintains persistent hallucination suppression against relearning and parameter updates, validating the effectiveness of geometric stabilization.

2601.16172 2026-05-19 cs.AI 版本更新

Inference-Time Diversity in RL-Trained Lean Theorem Provers: A Diagnostic Study

强化学习训练的轻量定理证明器推理时的多样性:诊断研究

Zachary Burton

发表机构 * MIT(麻省理工学院)

AI总结 研究发现强化学习训练的轻量定理证明器在推理时存在模式崩溃,通过增加采样预算未能提升解题数量,但固定战术骨架调度可显著提升性能,且多样性控制揭示了提示多样性对证明能力的影响。

Comments 20 pages

详情
AI中文摘要

强化学习训练的轻量定理证明器在推理时出现模式崩溃:在miniF2F测试中使用DeepSeek-Prover-V1.5-RL,将独立同分布采样预算从k=32增加到k=64并未增加解题数量(42/244在两种情况下)。固定15个战术骨架调度打破了这一平台期,在k=16时恢复了+45%的相对改进(平均Δ=+12.3±4.2个定理,n=3种子,每个种子的符号均保持)。受控多样性消融排除了提示多样性的混淆因素:战术骨架有助于,同义词匹配基线,而无关的Lean注释会主动退化。留一法正式难度分层揭示了三种扰动之间的结构-内容梯度。这一现象是强化学习特定的:V1.5-Base无论干预如何都证明零个定理,识别出强化学习是创造证明能力的阶段,随后该能力崩溃;扩展到两个额外的7B Lean证明器,强化学习训练的DeepSeek-Prover-V2-7B贡献了+3个前沿解,尽管整体表现平稳,而SFT训练的Goedel-Prover没有(-10.0±4.4个定理,n=3,每个种子的符号均保持)。推理时的结构多样性是强化学习训练证明器的一个廉价、互补的轴,与模型大小或训练计算量无关。

英文摘要

RL-trained Lean theorem provers mode-collapse at inference time: on miniF2F-test with DeepSeek-Prover-V1.5-RL, doubling the i.i.d.\ sampling budget from $k{=}32$ to $k{=}64$ produces zero additional solved theorems (42/244 in both cases). A fixed schedule of 15 tactic skeletons breaks this plateau and recovers a $+45%$ relative improvement at $k{=}16$ (mean $Δ= +12.3 \pm 4.2$ theorems across $n{=}3$ seeds, sign preserved in every seed). A controlled diversity ablation rules out the prompt-diversity confound: tactic skeletons help, paraphrases match the baseline, and irrelevant Lean comments actively degrade. A leave-one-out formalization-difficulty stratification reveals a structural-content gradient across the three perturbations. The phenomenon is RL-specific: V1.5-Base proves zero theorems regardless of intervention, identifying RL as the stage that creates the proof capability which subsequently collapses; extending to two additional 7B Lean provers, RL-trained DeepSeek-Prover-V2-7B contributes $+3$ frontier solves no i.i.d.\ baseline can reach despite a flat aggregate, while SFT-trained Goedel-Prover does not ($-10.0 \pm 4.4$ theorems, $n{=}3$, sign preserved every seed). Inference-time structural diversity is a cheap, complementary axis for RL-trained provers, orthogonal to scaling model size or training compute.

2601.13992 2026-05-19 cs.CL cs.AI 版本更新

"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework

整体大于部分之和:一种兼容性感知的多教师CoT蒸馏框架

Jin Cui, Jiaqi Guo, Ruixuan Yang, Jiayi Lu, Jiepeng Zhou, Jiajun Xu, Jiangcheng Song, Boran Zhao, Pengju Ren

发表机构 * State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学) Nankai University(南开大学) The Hong Kong University of Science and Technology(Guangzhou)(香港科技大学(广州)) School of Software Engineering, State Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Institute of Artificial Intelligence and Robotics, Xi’an Jiaotong University(软件学院,人机混合增强智能国家重点实验室,人工智能与机器人研究院,西安交通大学)

AI总结 本文提出COMPACT框架,通过动态加权不同教师的梯度,结合多维指标提升学生模型的推理能力,有效整合多样化推理能力并减少灾难性遗忘。

Comments 11pages, 9figures

详情
AI中文摘要

链式推理(CoT)推理赋予大语言模型(LLMs)显著能力,但通常需要极高的参数规模。CoT蒸馏作为一种有前景的范式,将推理能力转移到紧凑的学生模型(SLMs)中,但现有方法通常依赖单一教师,限制了学生潜力,因为个体LLMs常有不同能力偏倚且可能遭受灾难性遗忘。虽然利用多样教师似乎有吸引力,但有效融合其监督仍具挑战:教师-学生不兼容可能放大幻觉,被动监督无法确保真实逻辑内化。为此,我们引入COMPACT框架,通过动态加权教师梯度,基于多维指标评估学生实时兼容性:(1)基于图的共识过滤误导性推理路径;(2)基于互信息的适应性检测“顿悟时刻”以真正理解推理过程而非单纯模仿;(3)基于损失的难度评估学生对教师指导的接受度并防止负迁移。大量实验和潜在空间分析表明,COMPACT能有效整合多样化推理能力而不破坏模型原有知识结构,在各种基准测试中取得最佳性能并缓解灾难性遗忘。

英文摘要

Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales. CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs), but existing approaches often rely on a solitary teacher, capping the student's potential since individual LLMs often exhibit distinct capability biases and may suffer from catastrophic forgetting. While leveraging diverse teachers seems appealing, effectively fusing their supervisions remains challenging: teacher-student incompatibility risks amplifying hallucinations, and passive supervision fails to ensure genuine logic internalization. To address this, we introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients based on the student's real-time compatibility evaluated by a multi-dimensional metric: (1) Graph-based Consensus to filter misleading rationales by identifying mainstream reasoning paths; (2) Mutual-Information-based Adaptability to detect "epiphany moments" for genuinely understanding the reasoning process rather than merely imitating; and (3) Loss-based Difficulty to assess student receptivity to the teacher's guidance and prevent negative transfer. Extensive experiments and latent space analysis demonstrate that COMPACT effectively integrates diverse reasoning capabilities without damaging the model's original knowledge structure, achieving state-of-the-art performance on various benchmarks while mitigating catastrophic forgetting.

2601.11956 2026-05-19 cs.CL cs.AI 版本更新

Double-Calibration: Towards Reliable LLMs via Calibrating Knowledge and Reasoning Confidence

双重校准:通过校准知识和推理置信度实现可靠的LLM

Yuyin Lu, Ziran Liang, Yanghui Rao, Wenqi Fan, Fu Lee Wang, Qing Li

发表机构 * School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学计算机科学与工程学院,广州,中国) Department of Computing, The Hong Kong Polytechnic University, Hong Kong SAR(香港理工大学计算机系,香港特别行政区) School of Science and Technology, Hong Kong Metropolitan University, Hong Kong SAR(香港 Metropolitan 大学科技学院,香港特别行政区)

AI总结 本文提出双重校准框架,通过校准知识和推理置信度提升LLM的可靠性,实验表明其在保持低token成本的同时显著提高准确性和置信度校准。

Comments This work is to appear in the Proceedings of the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026)

详情
AI中文摘要

Large Language Models (LLMs) 的可靠推理受到其易产生幻觉的挑战。尽管通过知识图谱(KGs)增强LLMs可以提高事实准确性,但现有KG增强方法未能量化检索证据和LLM推理中的epistemic不确定性。为此,我们引入DoublyCal框架,基于新颖的双重校准原则。DoublyCal采用轻量级代理模型生成KG证据并校准证据置信度,此校准的支撑证据引导黑盒LLM,产生更准确且校准良好的最终预测,其置信度可追溯到支撑证据的不确定性。在知识密集型基准测试中,DoublyCal显著提高了黑盒LLM的准确性和置信度校准,同时保持低token成本。

英文摘要

Reliable reasoning in Large Language Models (LLMs) is challenged by their propensity for hallucination. While augmenting LLMs with Knowledge Graphs (KGs) improves factual accuracy, existing KG-augmented methods fail to quantify epistemic uncertainty in both the retrieved evidence and LLMs' reasoning. To bridge this gap, we introduce DoublyCal, a framework built on a novel double-calibration principle. DoublyCal employs a lightweight proxy model to first generate KG evidence alongside a calibrated evidence confidence. This calibrated supporting evidence then guides a black-box LLM, yielding final predictions that are not only more accurate but also well-calibrated, with confidence scores traceable to the uncertainty of the supporting evidence. Experiments on knowledge-intensive benchmarks show that DoublyCal significantly improves both the accuracy and confidence calibration of black-box LLMs while maintaining low token cost.

2601.11895 2026-05-19 cs.LG cs.AI cs.SE 版本更新

DevBench: A Realistic, Developer-Informed Benchmark for Code Generation Models

DevBench:一个现实的、面向开发者的代码生成模型基准测试

Adarsh Kumarappan, Pareesa Ameneh Golnari, Wen Wen, Xiaoyu Liu, Gabriel Ryan, Yuting Sun, Shengyu Fu, Elsie Nallipogu

发表机构 * California Institute of Technology(加州理工学院) Microsoft(微软公司)

AI总结 DevBench通过真实开发者数据和生成模型合成,构建了包含六种编程语言和任务的1800个实例,评估大语言模型在代码补全任务中的表现,揭示模型在语法精度、语义推理和实用价值上的差异。

详情
AI中文摘要

DevBench是一个基于 telemetry 的基准测试,旨在评估大语言模型(LLMs)在现实代码补全任务中的性能。它包含1,800个评估实例,覆盖六种编程语言和六种任务类别,这些数据来源于真实开发者 telemetry 和多个提供商家庭的生成模型,以减轻单一来源偏差。与以往的基准测试不同,它强调生态效度,避免训练数据污染,并允许详细的诊断。评估结合了功能性正确性、基于相似度的指标以及LLM评估,专注于有用性和上下文相关性。9种最先进的模型被评估,最强的模型在Pass@1上仅达到43.5%,证实了该基准测试仍然具有挑战性,并揭示了语法精度、语义推理和实用价值之间的差异。我们的基准测试提供了可操作的见解,以指导模型选择和改进,这些细节通常缺失于其他基准测试,但对实际部署和目标模型开发至关重要。

英文摘要

DevBench is a telemetry-driven benchmark designed to evaluate Large Language Models (LLMs) on realistic code completion tasks. It includes 1,800 evaluation instances across six programming languages and six task categories derived from real developer telemetry and synthesized using generator models from multiple provider families to mitigate single-source bias. Unlike prior benchmarks, it emphasizes ecological validity, avoids training data contamination, and enables detailed diagnostics. The evaluation combines functional correctness, similarity-based metrics, and LLM-judge assessments focused on usefulness and contextual relevance. 9 state-of-the-art models were assessed, with the strongest achieving only 43.5% Pass@1, confirming the benchmark remains challenging and revealing differences in syntactic precision, semantic reasoning, and practical utility. Our benchmark provides actionable insights to guide model selection and improvement, detail that is often missing from other benchmarks but is essential for both practical deployment and targeted model development.

2601.06633 2026-05-19 cs.LG cs.AI cs.CL cs.CY 版本更新

KASER: Knowledge-Aligned Student Error Simulator for Open-Ended Coding Tasks

KASER:面向开放性编程任务的知识对齐学生错误模拟器

Zhangqi Duan, Nigel Fernandez, Andrew Lan

发表机构 * University of Massachusetts(马萨诸塞大学) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校)

AI总结 KASER通过强化学习方法,结合代码相似性、错误匹配和预测多样性,提升大语言模型对学生错误的模拟与预测能力,实验表明其在代码和错误预测及错误覆盖方面优于基线方法。

Comments Published in ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

开放性任务,如计算机科学教育中的编程问题,能提供关于学生知识的深入洞察。然而,训练大语言模型(LLMs)模拟和预测学生在这些问题上的可能错误具有挑战性:它们常出现模式崩溃,并无法充分捕捉学生响应中的语法、风格和解决方案方法的多样性。在本文中,我们提出了KASER(知识对齐学生错误模拟器),一种将错误与学生知识对齐的新方法。我们提出了一种基于强化学习的训练方法,使用混合奖励反映学生代码预测的三个方面:i)代码与地面真相的相似性,ii)错误匹配,以及iii)代码预测的多样性。在两个真实世界数据集上,我们进行了两个层面的评估,并表明:在每对学生-问题对层面,我们的方法在代码和错误预测上优于基线;在每问题层面,我们的方法在错误覆盖和模拟代码多样性上优于基线。

英文摘要

Open-ended tasks, such as coding problems that are common in computer science education, provide detailed insights into student knowledge. However, training large language models (LLMs) to simulate and predict possible student errors in their responses to these problems can be challenging: they often suffer from mode collapse and fail to fully capture the diversity in syntax, style, and solution approach in student responses. In this work, we present KASER (Knowledge-Aligned Student Error Simulator), a novel approach that aligns errors with student knowledge. We propose a training method based on reinforcement learning using a hybrid reward that reflects three aspects of student code prediction: i) code similarity to the ground-truth, ii) error matching, and iii) code prediction diversity. On two real-world datasets, we perform two levels of evaluation and show that: At the per-student-problem pair level, our method outperforms baselines on code and error prediction; at the per-problem level, our method outperforms baselines on error coverage and simulated code diversity.

2601.05527 2026-05-19 cs.LG cs.AI 版本更新

DeMa: Dual-Path Delay-Aware Mamba for Efficient Multivariate Time Series Analysis

DeMa:双路径延迟感知Mamba用于高效多变量时间序列分析

Rui An, Haohao Qu, Wenqi Fan, Xuequn Shang, Qing Li

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 DeMa通过双路径架构改进Mamba,解决多变量时间序列分析中的延迟建模、跨变量依赖和时间动态分离问题,实现高效且准确的分析。

Comments The article has been accepted by Frontiers of Computer Science (FCS), with the DOI: {10.1007/s11704-026-52221-6}

详情
AI中文摘要

准确且高效的多变量时间序列(MTS)分析对广泛智能应用越来越关键。在这一领域,Transformer因其强大的捕捉成对依赖能力而成为主导架构。然而,基于Transformer的模型存在二次计算复杂度和高内存开销,限制了其在长期和大规模MTS建模中的可扩展性和实用性。最近,Mamba作为一种线性时间替代方案出现,具有高表达能力。然而,直接应用原始Mamba到MTS仍不理想,因为存在三个关键限制:(i)缺乏显式的跨变量建模,(ii)难以分离纠缠的系列内时间动态和系列间交互,(iii)对潜在时间滞后交互效应的建模不足。这些问题限制了其在多样MTS任务中的有效性。为了解决这些挑战,我们提出了DeMa,一种双路径延迟感知Mamba骨干网络。DeMa保留了Mamba的线性复杂度优势,同时显著提高了其在MTS设置中的适用性。具体而言,DeMa引入了三个关键创新:(i)它将MTS分解为系列内时间动态和系列间交互;(ii)它开发了一个时间路径,包含Mamba-SSD模块,以捕捉每个单独系列内的长程动态,实现系列无关的并行计算;(iii)它设计了一个变量路径,包含Mamba-DALA模块,通过延迟感知线性注意力模块来建模跨变量依赖。在五个代表性任务(长期和短期预测、数据插补、异常检测和系列分类)上的广泛实验表明,DeMa在达到最先进性能的同时,还实现了显著的计算效率。

英文摘要

Accurate and efficient multivariate time series (MTS) analysis is increasingly critical for a wide range of intelligent applications. Within this realm, Transformers have emerged as the predominant architecture due to their strong ability to capture pairwise dependencies. However, Transformer-based models suffer from quadratic computational complexity and high memory overhead, limiting their scalability and practical deployment in long-term and large-scale MTS modeling. Recently, Mamba has emerged as a promising linear-time alternative with high expressiveness. Nevertheless, directly applying vanilla Mamba to MTS remains suboptimal due to three key limitations: (i) the lack of explicit cross-variate modeling, (ii) difficulty in disentangling the entangled intra-series temporal dynamics and inter-series interactions, and (iii) insufficient modeling of latent time-lag interaction effects. These issues constrain its effectiveness across diverse MTS tasks. To address these challenges, we propose DeMa, a dual-path delay-aware Mamba backbone. DeMa preserves Mamba's linear-complexity advantage while substantially improving its suitability for MTS settings. Specifically, DeMa introduces three key innovations: (i) it decomposes the MTS into intra-series temporal dynamics and inter-series interactions; (ii) it develops a temporal path with a Mamba-SSD module to capture long-range dynamics within each individual series, enabling series-independent, parallel computation; and (iii) it designs a variate path with a Mamba-DALA module that integrates delay-aware linear attention to model cross-variate dependencies. Extensive experiments on five representative tasks, long- and short-term forecasting, data imputation, anomaly detection, and series classification, demonstrate that DeMa achieves state-of-the-art performance while delivering remarkable computational efficiency.

2601.02071 2026-05-19 cs.AI 版本更新

FormuLLA: A Large Language Model Approach to Generating Novel 3D Printable Formulations

FormuLLA:一种用于生成新型3D打印配方的大型语言模型方法

Adeshola Okubena, Yusuf Ali Mohammed, Moe Elbadawi

发表机构 * School of Biological and Behavioural Sciences, Queen Mary University of London(伦敦女王玛丽大学生物与行为科学学院)

AI总结 本文提出FormuLLA方法,利用微调后的大型语言模型推荐3D打印配方的辅料并预测丝材机械性能,揭示了模型选择和参数对性能的影响,指出小模型易遗忘及标准指标无法评估配方可加工性。

详情
AI中文摘要

制药三维(3D)打印是一种先进的制造技术,有潜力实现真正个性化剂量形式。最近的研究将人工智能(AI)整合到配方和过程开发中,大大改变了制药3D打印的现状。到目前为止,大多数AI驱动的努力仍然狭隘聚焦,未能考虑该技术固有的配方挑战。最近的AI进展引入了人工通用智能概念,其中系统超越传统预测建模,向更通用、类人推理发展。在本文中,我们研究了在包含超过1400种配方的熔融沉积建模(FDM)数据集上微调的大型语言模型(LLMs),以根据活性药物成分(API)剂量推荐合适的辅料,并预测丝材的机械性能。四种LLM架构进行了微调,系统评估了微调和生成参数配置。我们的结果表明,Llama2最适合推荐FDM配方的辅料。此外,模型选择和参数化显著影响性能,较小的LLM表现出灾难性遗忘。此外,我们证明:(i)即使使用超过1400种配方的相对较小数据集,也可能导致模型灾难性遗忘;(ii)标准LLM指标仅评估语言性能,而不评估配方可加工性;(iii)训练于生物医学相关数据的LLM并不总是产生最佳结果。解决这些挑战对于推动LLM超越语言能力,向可靠的制药配方开发系统迈进至关重要。

英文摘要

Pharmaceutical three-dimensional (3D) printing is an advanced fabrication technology with the potential to enable truly personalised dosage forms. Recent studies have integrated artificial intelligence (AI) to accelerate formulation and process development, drastically transforming current approaches to pharmaceutical 3D printing. To date, most AI-driven efforts remain narrowly focused, while failing to account for the broader formulation challenges inherent to the technology. Recent advances in AI have introduced artificial general intelligence concepts, wherein systems extend beyond conventional predictive modelling toward more generalised, human-like reasoning. In this work, we investigate the application of large language models (LLMs), fine-tuned on a fused deposition modelling (FDM) dataset comprising over 1400 formulations, to recommend suitable excipients based on active pharmaceutical ingredient (API) dose, and predict filament mechanical properties. Four LLM architectures were fine-tuned, with systematic evaluation of both fine-tuning and generative parameter configurations. Our results demonstrate that Llama2 was best suited for recommending excipients for FDM formulations. Additionally, model selection and parameterisation significantly influence performance, with smaller LLMs exhibiting instances of catastrophic forgetting. Furthermore, we demonstrate: (i) even with relatively small dataset of over 1400 formulations, it can lead to model catastrophic forgetting; (ii) standard LLM metrics only evaluate linguistic performance but not formulation processability; and (iii) LLMs trained on biomedically-related data do not always produce the best results. Addressing these challenges is essential to advancing LLMs beyond linguistic proficiency and toward reliable systems for pharmaceutical formulation development.

2512.23752 2026-05-19 cs.LG cs.AI 版本更新

Geometric Scaling of Bayesian Inference in LLMs

贝叶斯推断在大语言模型中的几何特性

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

发表机构 * Columbia University(哥伦比亚大学) Columbia University School of Professional Studies(哥伦比亚大学专业研究学院) Department of Statistics(统计学系) Columbia University Department of Computer Science(哥伦比亚大学计算机科学系)

AI总结 研究发现大语言模型中存在几何结构,用于编码后验结构,通过干预实验表明该结构是不确定性的重要读取而非单一计算瓶颈。

Comments v2: Extend cross-architecture analysis with Qwen2.5 and DeepSeek (MLA) families; add SULA and RoPE-channel results; document MLA boundary case (DeepSeek-V2-Lite: substrate preserved, dynamic routing absent); add dual-entropy framework at scale; fix duplicate bibliography entries

详情
AI中文摘要

近期研究表明,经过受控

英文摘要

Recent work has shown that small transformers trained in controlled "wind-tunnel'' settings can implement exact Bayesian inference, and that their training dynamics produce a geometric substrate -- low-dimensional value manifolds and progressively orthogonal keys -- that encodes posterior structure. We investigate whether this geometric signature persists in production-grade language models. Across Pythia, Phi-2, Llama-3, and Mistral families, we find that last-layer value representations organize along a single dominant axis whose position strongly correlates with predictive entropy, and that domain-restricted prompts collapse this structure into the same low-dimensional manifolds observed in synthetic settings. To probe the role of this geometry, we perform targeted interventions on the entropy-aligned axis of Pythia-410M during in-context learning. Removing or perturbing this axis selectively disrupts the local uncertainty geometry, whereas matched random-axis interventions leave it intact. However, these single-layer manipulations do not produce proportionally specific degradation in Bayesian-like behavior, indicating that the geometry is a privileged readout of uncertainty rather than a singular computational bottleneck. Taken together, our results show that modern language models preserve the geometric substrate that enables Bayesian inference in wind tunnels, and organize their approximate Bayesian updates along this substrate.

2512.22473 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Gradient Dynamics of Attention: How Cross-Entropy Sculpts Bayesian Manifolds

注意力的梯度动力学:交叉熵如何塑造贝叶斯流形

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

发表机构 * Columbia University(哥伦比亚大学) Columbia University School of Professional Studies(哥伦比亚大学专业研究学院) Department of Statistics(统计学系) Columbia University Department of Computer Science(哥伦比亚大学计算机科学系)

AI总结 研究通过分析交叉熵训练如何重塑Transformer注意力分数和值向量,揭示了注意力评分的优势路由定律和值的职责加权更新,展示了梯度动力学如何塑造贝叶斯流形以支持概率推理。

Comments v2: Add dual-entropy connection - advantage signal drives \r{ho} down; fix duplicate bibliography entries (synced from Paper I)

详情
AI中文摘要

Transformer在精心构建的『贝叶斯风洞』和大规模语言模型中表现出精确的概率推理能力,但梯度学习如何创建所需的内部几何仍不清楚。本文提供了一种完整的首次级分析,揭示了交叉熵训练如何重塑Transformer注意力头中的注意力评分和值向量。核心结果是注意力评分的『优势路由定律』,以及值的『职责加权更新』。这些方程诱导出正反馈循环,使路由和内容共同专业化:查询更强烈地路由到误差信号高于平均的值,而这些值被拉向使用它们的查询。本文展示了这种耦合专业化行为类似于两时间尺度EM过程:注意力权重实现E步(软责任),而值实现M步(责任加权原型更新),查询和键调整假设框架。通过受控模拟,包括一个粘性马尔可夫链任务,比较了闭合形式EM式更新与标准SGD,证明了相同的梯度动力学在最小化交叉熵的同时,塑造了本文配套工作所识别的低维流形,这些流形实现了贝叶斯推理。这给出了一个统一的画面:优化(梯度流)导致几何(贝叶斯流形),后者又支持功能(上下文概率推理)。

英文摘要

Transformers empirically perform precise probabilistic reasoning in carefully constructed ``Bayesian wind tunnels'' and in large-scale language models, yet the mechanisms by which gradient-based learning creates the required internal geometry remain opaque. We provide a complete first-order analysis of how cross-entropy training reshapes attention scores and value vectors in a transformer attention head. Our core result is an \emph{advantage-based routing law} for attention scores, \[ \frac{\partial L}{\partial s_{ij}} = α_{ij}\bigl(b_{ij}-\mathbb{E}_{α_i}[b]\bigr), \qquad b_{ij} := u_i^\top v_j, \] coupled with a \emph{responsibility-weighted update} for values, \[ Δv_j = -η\sum_i α_{ij} u_i, \] where $u_i$ is the upstream gradient at position $i$ and $α_{ij}$ are attention weights. These equations induce a positive feedback loop in which routing and content specialize together: queries route more strongly to values that are above-average for their error signal, and those values are pulled toward the queries that use them. We show that this coupled specialization behaves like a two-timescale EM procedure: attention weights implement an E-step (soft responsibilities), while values implement an M-step (responsibility-weighted prototype updates), with queries and keys adjusting the hypothesis frame. Through controlled simulations, including a sticky Markov-chain task where we compare a closed-form EM-style update to standard SGD, we demonstrate that the same gradient dynamics that minimize cross-entropy also sculpt the low-dimensional manifolds identified in our companion work as implementing Bayesian inference. This yields a unified picture in which optimization (gradient flow) gives rise to geometry (Bayesian manifolds), which in turn supports function (in-context probabilistic reasoning).

2512.22471 2026-05-19 cs.LG cs.AI stat.ML 版本更新

The Bayesian Geometry of Transformer Attention

Transformer 注意力的贝叶斯几何

Naman Agarwal, Siddhartha R. Dalal, Vishal Misra

发表机构 * Columbia University(哥伦比亚大学) Columbia University School of Professional Studies(哥伦比亚大学专业研究学院) Department of Statistics(统计学系) Columbia University Department of Computer Science(哥伦比亚大学计算机科学系)

AI总结 本文通过构建贝叶斯风道,验证了Transformer在上下文中的贝叶斯推理能力,发现其通过几何机制实现后验更新与路由,揭示了注意力机制的必要性及扁平架构的不足。

Comments v2: Add dual-entropy measurement framework (H_I, H_P, \r{ho} = H_P/H_I); incorporate Overleaf revisions; fix duplicate bibliography entries (akyurek mashup; openai title; legacy aliases removed)

详情
AI中文摘要

Transformer 似乎在上下文中表现出贝叶斯推理,但严格验证一直困难:自然数据缺乏解析后验,大模型将推理与记忆混淆。我们通过构建贝叶斯风道——可控环境,其中真实后验以闭合形式给出,记忆可证明不可能。在这些设置中,小型Transformer以10^-3-10^-4 bit精度再现贝叶斯后验,而容量匹配的MLP则相差多个数量级,确立了明确的架构分离。在两个任务——双射消除和隐马尔可夫模型(HMM)状态跟踪中,发现Transformer通过一致的几何机制实现贝叶斯推理:残差流作为信念基质,前馈网络执行后验更新,注意力提供内容可寻址路由。几何诊断揭示正交键基、渐进查询-键对齐和由后验熵参数化的低维值流形。训练期间该流形展开而注意力模式保持稳定,这与最近的梯度分析预测的帧精度解离一致。这些结果表明,分层注意力通过几何设计实现贝叶斯推理,解释了注意力的必要性及扁平架构的失败。贝叶斯风道为机械连接小型可验证系统与大语言模型中推理现象提供了基础。

英文摘要

Transformers often appear to perform Bayesian reasoning in context, but verifying this rigorously has been impossible: natural data lack analytic posteriors, and large models conflate reasoning with memorization. We address this by constructing \emph{Bayesian wind tunnels} -- controlled environments where the true posterior is known in closed form and memorization is provably impossible. In these settings, small transformers reproduce Bayesian posteriors with $10^{-3}$-$10^{-4}$ bit accuracy, while capacity-matched MLPs fail by orders of magnitude, establishing a clear architectural separation. Across two tasks -- bijection elimination and Hidden Markov Model (HMM) state tracking -- we find that transformers implement Bayesian inference through a consistent geometric mechanism: residual streams serve as the belief substrate, feed-forward networks perform the posterior update, and attention provides content-addressable routing. Geometric diagnostics reveal orthogonal key bases, progressive query-key alignment, and a low-dimensional value manifold parameterized by posterior entropy. During training this manifold unfurls while attention patterns remain stable, a \emph{frame-precision dissociation} predicted by recent gradient analyses. Taken together, these results demonstrate that hierarchical attention realizes Bayesian inference by geometric design, explaining both the necessity of attention and the failure of flat architectures. Bayesian wind tunnels provide a foundation for mechanistically connecting small, verifiable systems to reasoning phenomena observed in large language models.

2512.17843 2026-05-19 cs.CL cs.AI cs.HC 版本更新

ShareChat: A Dataset of Chatbot Conversations in the Wild

ShareChat: 一个真实对话的大型数据集

Yueru Yan, Tuc Nguyen, Bo Su, Melissa Lieffers, Thai Le

发表机构 * Indiana University(印第安纳大学)

AI总结 本文提出ShareChat数据集,包含142,808条对话(660,293个回合),涵盖95种语言,分析不同平台对话完整性和响应延迟差异,揭示多平台交互特性。

详情
AI中文摘要

通过统一的文本接口评估大型语言模型(LLMs),当前学术基准掩盖了不同商业平台的独特设计和功能如何影响真实用户行为和系统性能。为弥合这一差距,我们提出了ShareChat,这是首个包含142,808条对话(660,293个回合)的大型语料库,从ChatGPT、Perplexity、Grok、Gemini和Claude的公开共享URL中收集。ShareChat保留了原生平台功能,包括引用、思考痕迹和代码 artifacts,涵盖95种语言,时间跨度从2023年4月至2025年10月,补充了现有语料库中同质化交互的不足。为了展示数据集的评估用途,我们提出了三个案例研究:对话完整性分析评估跨平台意图满足差异,来源定位分析比较搜索增强系统之间的引用策略,时间分析揭示响应延迟动态的差异。这些分析展示了单平台或剥离功能语料库无法解决的研究问题。该数据集已公开可用。

英文摘要

By evaluating Large Language Models (LLMs) through uniform, text-only interfaces, current academic benchmarks obscure how the unique designs and affordances of distinct commercial platforms shape real-world user behavior and system performance. To bridge this gap, we present ShareChat, the first large-scale corpus of 142,808 conversations (660,293 turns) collected from publicly shared URLs on ChatGPT, Perplexity, Grok, Gemini, and Claude. ShareChat preserves native platform affordances, including citations, thinking traces, and code artifacts, across 95 languages and the period from April 2023 to October 2025, complementing existing corpora that homogenize these interactions. To demonstrate the dataset's evaluative utility, we present three case studies: a conversation completeness analysis assessing cross-platform differences in intent satisfaction, a source grounding analysis comparing citation strategies between search-augmented systems, and a temporal analysis revealing divergent response latency dynamics. Together, these analyses demonstrate research questions that are inaccessible to single-platform or stripped-affordance corpora. The dataset is publicly available.

2511.19078 2026-05-19 cs.CL cs.AI 版本更新

GraphMind: Theorem Selection and Conclusion Generation Framework with Dynamic GNN for LLM Reasoning

GraphMind: 一种基于动态GNN的定理选择与结论生成框架用于LLM推理

Yutong Li, Yitian Zhou, Xudong Wang, GuoChen, Caiyan Qin

AI总结 GraphMind通过动态图神经网络与LLM结合,实现多步推理中的定理选择和结论生成,提升上下文感知的推理能力。

Comments This paper has been withdrawn by the authors in order to prepare a substantially revised version

详情
AI中文摘要

大型语言模型(LLMs)在自然语言理解和生成方面表现出色,包括多步推理如数学证明。然而,现有方法缺乏显式且动态的机制来结构化表示和演变中间推理状态,限制了其在上下文感知定理选择和迭代结论生成方面的能力。为此,我们提出了GraphMind,一种新颖的动态图基框架,将图神经网络(GNN)与LLMs结合,以迭代方式选择定理并生成中间结论。我们的方法将推理过程建模为异构演进图,其中节点代表条件、定理和结论,边捕捉节点间的逻辑依赖。通过编码当前推理状态并利用语义匹配进行定理选择,我们的框架在闭环模式下实现了上下文感知、可解释和结构化的推理。在各种问答(QA)数据集上的实验表明,所提出的GraphMind方法在多步推理中实现了稳定性能提升,并显著优于现有基线方法,验证了我们方法的有效性和通用性。

英文摘要

Large language models (LLMs) have demonstrated impressive capabilities in natural language understanding and generation, including multi-step reasoning such as mathematical proving. However, existing approaches often lack an explicit and dynamic mechanism to structurally represent and evolve intermediate reasoning states, which limits their ability to perform context-aware theorem selection and iterative conclusion generation. To address these challenges, we propose GraphMind, a novel dynamic graph-based framework that integrates the graph neural network (GNN) with LLMs to iteratively select theorems and generate intermediate conclusions for multi-step reasoning. Our method models the reasoning process as a heterogeneous evolving graph, where nodes represent conditions, theorems, and conclusions, while edges capture logical dependencies between nodes. By encoding the current reasoning state with GNN and leveraging semantic matching for theorem selection, our framework enables context-aware, interpretable, and structured reasoning in a closed-loop manner. Experiments on various question-answering (QA) datasets demonstrate that our proposed GraphMind method achieves consistent performance improvements and significantly outperforms existing baselines in multi-step reasoning, validating the effectiveness and generalizability of our approach.

2510.23641 2026-05-19 cs.LG cs.AI hep-ex physics.ins-det 版本更新

Spatially Aware Linear Transformer (SAL-T) for Particle Jet Tagging

具有空间意识的线性变换器(SAL-T)用于粒子喷注标记

Aaron Wang, Zihan Zhao, Subash Katel, Vivekanand Gyanchand Sahu, Elham E Khoda, Abhijith Gandrakota, Jennifer Ngadiuba, Richard Cavanaugh, Javier Duarte

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of California San Diego(加州大学圣地亚哥分校) Fermi National Accelerator Laboratory(费米国家加速器实验室)

AI总结 SAL-T通过空间感知分区和卷积层提升喷注分类性能,在资源消耗和延迟方面优于标准linformer。

详情
AI中文摘要

Transformers在高能粒子碰撞中能有效捕捉全局和局部相关性,但在高数据吞吐环境如CERN LHC中部署存在挑战。由于transformer模型的二次复杂性,需要大量资源且推理延迟高。为此,我们引入了物理启发的线性变换器增强架构SAL-T,保持线性注意力。我们的方法基于动量学特征对粒子进行空间感知分区,从而计算具有物理意义区域之间的注意力。此外,我们采用卷积层捕捉局部相关性,受喷注物理启发。除了在喷注分类任务中优于标准linformer外,SAL-T在推理时使用更少的资源且延迟更低,其结果与全注意力transformer相当。在通用点云分类数据集(ModelNet10)上的实验进一步证实了这一趋势。我们的代码可在https://github.com/aaronw5/SAL-T4HEP获得。

英文摘要

Transformers are very effective in capturing both global and local correlations within high-energy particle collisions, but they present deployment challenges in high-data-throughput environments, such as the CERN LHC. The quadratic complexity of transformer models demands substantial resources and increases latency during inference. In order to address these issues, we introduce the Spatially Aware Linear Transformer (SAL-T), a physics-inspired enhancement of the linformer architecture that maintains linear attention. Our method incorporates spatially aware partitioning of particles based on kinematic features, thereby computing attention between regions of physical significance. Additionally, we employ convolutional layers to capture local correlations, informed by insights from jet physics. In addition to outperforming the standard linformer in jet classification tasks, SAL-T also achieves classification results comparable to full-attention transformers, while using considerably fewer resources with lower latency during inference. Experiments on a generic point cloud classification dataset (ModelNet10) further confirm this trend. Our code is available at https://github.com/aaronw5/SAL-T4HEP.

2510.18941 2026-05-19 cs.CL cs.AI cs.LG 版本更新

ProfBench: Multi-Domain Rubrics requiring Professional Knowledge to Answer and Judge

ProfBench:需要专业知识回答和评判的多领域评分标准

Zhilin Wang, Jaehun Jung, Ximing Lu, Shizhe Diao, Ellie Evans, Jiaqi Zeng, Pavlo Molchanov, Yejin Choi, Jan Kautz, Yi Dong

发表机构 * NVIDIA

AI总结 ProfBench通过7000多个由专业领域专家评估的响应-评分对,评估大语言模型在处理专业文档、信息整合和生成综合报告方面的能力,揭示了即使顶级模型在专业任务上也面临挑战。

Comments Published at ICLR 2026, 30 pages

详情
AI中文摘要

评估大语言模型(LLMs)的进步通常受限于验证响应的挑战,限制了评估任务仅限于数学、编程和简短问答。然而,许多现实应用需要评估LLMs在处理专业文档、整合信息和生成综合报告方面的能力。我们介绍了ProfBench:一个包含超过7000个响应-评分对的集合,由具有物理学博士、化学博士、金融MBA和咨询MBA专业知识的人类专家评估。我们构建了稳健且经济的LLM-Judges来评估ProfBench评分标准,通过减轻自我增强偏差并减少评估成本2-3个数量级,使其公平且对更广泛社区可及。我们的发现表明,即使对于最先进的LLM,ProfBench也提出了重大挑战,顶级模型如GPT-5-high仅达到65.9%的整体性能。此外,我们识别了专有模型与开源模型之间显著的性能差异,并提供了关于扩展思考在解决复杂专业领域任务中的作用的见解。数据:https://huggingface.co/datasets/nvidia/ProfBench 和代码:https://github.com/NVlabs/ProfBench 和排行榜:https://huggingface.co/spaces/nvidia/ProfBench

英文摘要

Evaluating progress in large language models (LLMs) is often constrained by the challenge of verifying responses, limiting assessments to tasks like mathematics, programming, and short-form question-answering. However, many real-world applications require evaluating LLMs in processing professional documents, synthesizing information, and generating comprehensive reports in response to user queries. We introduce ProfBench: a set of over 7000 response-criterion pairs as evaluated by human-experts with professional knowledge across Physics PhD, Chemistry PhD, Finance MBA and Consulting MBA. We build robust and affordable LLM-Judges to evaluate ProfBench rubrics, by mitigating self-enhancement bias and reducing the cost of evaluation by 2-3 orders of magnitude, to make it fair and accessible to the broader community. Our findings reveal that ProfBench poses significant challenges even for state-of-the-art LLMs, with top-performing models like GPT-5-high achieving only 65.9% overall performance. Furthermore, we identify notable performance disparities between proprietary and open-weight models and provide insights into the role that extended thinking plays in addressing complex, professional-domain tasks. Data: https://huggingface.co/datasets/nvidia/ProfBench and Code: https://github.com/NVlabs/ProfBench and Leaderboard: https://huggingface.co/spaces/nvidia/ProfBench

2510.16416 2026-05-19 cs.CV cs.AI 版本更新

SSL4RL: Revisiting Self-supervised Learning as Intrinsic Reward for Visual-Language Reasoning

SSL4RL:重新审视自监督学习作为视觉语言推理的内在奖励

Xiaojun Guo, Runyu Zhou, Yifei Wang, Qi Zhang, Chenheng Zhang, Stefanie Jegelka, Xiaohan Wang, Jiajun Chai, Guojun Yin, Wei Lin, Yisen Wang

发表机构 * Peking University(北京大学) MIT(麻省理工学院) TUM(技术大学(TUM)) Meituan(美团) Peking University School of CIT, MCML, MDSIEECS and CSAIL(北京大学计算机学院、MCML、MDSIEECS以及CSAIL)

AI总结 本文提出SSL4RL框架,利用自监督学习任务作为强化学习的可验证奖励,提升视觉语言推理性能,并通过实验验证其在多模态模型对齐中的有效性。

详情
AI中文摘要

视觉语言模型(VLMs)通过整合大语言模型与视觉输入展现出显著能力,但往往无法充分利用视觉证据,要么依赖视觉任务中的语言先验,要么在推理中依赖文本捷径。尽管强化学习(RL)可以对齐模型与期望行为,但其在VLMs中的应用受限于缺乏可扩展且可靠的奖励机制。为克服这一挑战,我们提出了SSL4RL,一种新的框架,利用自监督学习(SSL)任务作为RL微调的可验证奖励源。我们的方法将SSL目标,如预测图像旋转或重建遮蔽块,转化为密集的自动奖励信号,消除了对人类偏好数据或不可靠的AI评估者的需求。实验表明,SSL4RL在视觉中心和视觉语言推理基准上显著提高了性能。此外,通过系统消融分析,我们识别出影响SSL4RL任务有效性的关键因素,如任务难度、模型规模和与目标领域语义的一致性,为未来工作提供了新的设计原则。我们还通过将其应用于图学习,展示了框架的通用性,其中取得了显著收益。SSL4RL建立了一种利用可验证的自监督目标对齐多模态模型的灵活且有效的方法论。

英文摘要

Vision-language models (VLMs) have shown remarkable abilities by integrating large language models with visual inputs. However, they often fail to utilize visual evidence adequately, either depending on linguistic priors in vision-centric tasks or resorting to textual shortcuts during reasoning. Although reinforcement learning (RL) can align models with desired behaviors, its application to VLMs has been hindered by the lack of scalable and reliable reward mechanisms. To overcome this challenge, we propose SSL4RL, a novel framework that leverages self-supervised learning (SSL) tasks as a source of verifiable rewards for RL-based fine-tuning. Our approach reformulates SSL objectives-such as predicting image rotation or reconstructing masked patches-into dense, automatic reward signals, eliminating the need for human preference data or unreliable AI evaluators. Experiments show that SSL4RL substantially improves performance on both vision-centric and vision-language reasoning benchmarks. Furthermore, through systematic ablations, we identify key factors-such as task difficulty, model scale, and semantic alignment with the target domain-that influence the effectiveness of SSL4RL tasks, offering new design principles for future work. We also demonstrate the framework's generality by applying it to graph learning, where it yields significant gains. SSL4RL establishes a versatile and effective paradigm for aligning multimodal models using verifiable, self-supervised objectives.

2510.16079 2026-05-19 cs.CL cs.AI 版本更新

EvolveR: Self-Evolving LLM Agents through an Experience-Driven Lifecycle

EvolveR:通过经验驱动的生命周期实现自进化大语言模型代理

Rong Wu, Xiaoman Wang, Jianbiao Mei, Pinlong Cai, Daocheng Fu, Cheng Yang, Licheng Wen, Xuemeng Yang, Yufan Shen, Yuxin Wang, Botian Shi

发表机构 * Zhejiang University(浙江大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) East China Normal University(华东师范大学) Fudan University(复旦大学) Central South University(中南大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Jiao Tong University(上海交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 EvolveR通过闭环经验生命周期实现LLM代理的自进化,结合离线自蒸馏和在线交互,利用策略强化机制迭代优化,提升多跳问答任务性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

当前大语言模型代理在工具使用方面表现优异,但缺乏系统性地从自身经验学习的能力。现有框架主要解决外部知识缺口问题,未能解决更根本的限制:无法迭代优化问题解决策略。本文提出EvolveR框架,通过完整的闭环经验生命周期实现代理自改进。该生命周期包含两个关键阶段:(1) 离线自蒸馏,将代理交互轨迹合成结构化的抽象可重用策略原则;(2) 在线交互,代理与任务互动并主动检索蒸馏的原则以指导决策,积累多样化的行为轨迹。此循环利用策略强化机制迭代更新代理。我们在复杂多跳问答基准测试中展示了EvolveR的有效性,其在强代理基线上的表现更优。我们的工作为能够从外部数据和自身行为后果学习的代理提供了全面蓝图,为更自主和持续改进的系统铺平道路。代码可在https://github.com/Edaizi/EvolveR获取。

英文摘要

Current Large Language Model (LLM) agents show strong performance in tool use, but lack the crucial capability to systematically learn from their own experiences. While existing frameworks mainly focus on mitigating external knowledge gaps, they fail to address a more fundamental limitation: the inability to iteratively refine problem-solving strategies. In this work, we introduce EvolveR, a framework designed to enable agent to self-improve through a complete, closed-loop experience lifecycle. This lifecycle comprises two key stages: (1) Offline Self-Distillation, where the agent's interaction trajectories are synthesized into a structured repository of abstract, reusable strategic principles; (2) Online Interaction, where the agent interacts with tasks and actively retrieves distilled principles to guide its decision-making, accumulating a diverse set of behavioral trajectories. This loop employs a policy reinforcement mechanism to iteratively update the agent based on its performance. We demonstrate the effectiveness of EvolveR on complex multi-hop question-answering benchmarks, where it achieves superior performance over strong agentic baselines. Our work presents a comprehensive blueprint for agents that learn not only from external data but also from the consequences of their own actions, paving the way for more autonomous and continuously improving systems. Code is available at https://github.com/Edaizi/EvolveR.

2510.15221 2026-05-19 cs.AI cs.CY cs.LG 版本更新

WELD: The First Naturalistic Long-Period Small-Team Workplace Emotion Dataset for Ubiquitous Affective Computing

WELD:首个自然长期小团队工作场所情感数据集用于无处不在的情感计算

Xiao Sun

发表机构 * AnHui Province Key Laboratory of Affective Computing and Advanced Intelligent Machines, School of Computer Science and Information Engineering, Hefei University of Technology(安徽省情感计算与先进智能机器重点实验室,计算机科学与信息工程学院,合肥工业大学)

AI总结 WELD是首个自然长期小团队工作场所情感数据集,包含733780个每帧七类面部表情概率向量,用于支持长期情感计算研究,验证了三个已知现象并发现四个新结果。

Comments v2: Major revision. 30-month report with full ethics framework, 4-tier access model, variance decomposition, HMM regime discovery, AUC=0.79 vs C-index=0.52 turnover-prediction methodology audit, and Asian-neutral-face FER bias finding. Companion: arXiv:2510.16046. 49 employees, 733,780 records, 17 pages. Submitted to IEEE TAFFC

详情
AI中文摘要

情感计算在实验室环境中迅速成熟,但此前没有数据集同时满足(i)数月到数年的持续时间,(ii)自然工作场所环境,(iii)稳定的小组社交结构,以及(iv)完全被动传感协议并通过机构审查。我们介绍了WELD,首个满足所有四个条件的数据集。WELD包含来自中国软件公司49名员工30.1个月(2021年11月-2024年5月)的733780个每帧七类面部表情概率向量——最长的自然情感语料库和唯一支持多年纵向分析和小组关系分析的数据集。数据以四级访问模型发布,只有聚合概率可公开下载。我们通过复制三个已知现象(+43.1%周末情感提升;13:00低谷日周期;上海2022封锁效应d=-0.40)验证了数据集,并报告了四个新发现:(1)方差分解将每日情感变异性中19.3%归因于人与人差异,29.8%归因于月季节性——对未来预测模型的定量上限;(2)隐藏马尔可夫分解揭示了六个情感状态,具有不对称的负面状态停留时间(16-18天 vs 3天);(3)留一人出离职预测达到AUC=0.79,但Cox一致性指数仅为0.52,暴露了在不考虑生存意识基线时报告AUC的度量陷阱;(4)数据集揭示了基于现成FER模型对中性亚洲面孔预测“愤怒”存在系统性过预测(0.194 vs ~0.05西方先验),使WELD成为FER公平审计的重要资源。对数据集的复杂系统分析作为配套预印本(arXiv:2510.16046)发表。

英文摘要

Affective computing has matured rapidly in laboratory settings, yet no prior dataset combines (i) months-to-years of duration, (ii) a naturalistic workplace context, (iii) a stable small-team social structure, and (iv) a fully passive sensing protocol that survives institutional review. We introduce WELD, the first dataset to satisfy all four. WELD comprises 733,780 per-frame seven-class facial-expression probability vectors from 49 employees of a Chinese software company over 30.1 months (Nov 2021 - May 2024) -- the longest naturalistic in-the-wild emotion corpus and the only multi-year corpus supporting both within-individual longitudinal and within-team relational analyses on the same subjects. Data are released under a four-tier access model with only aggregated probabilities publicly downloadable. We validate the corpus by replicating three established phenomena (+43.1% weekend valence boost; 13:00-trough diurnal cycle; Shanghai 2022 lockdown effect d=-0.40), and report four novel findings: (1) variance decomposition attributes 19.3% of daily-valence variance to between-person differences and 29.8% to month seasonality -- a quantitative ceiling for future predictive models; (2) Hidden Markov decomposition reveals six emotional regimes with asymmetric negative-state dwell times (16-18 d vs 3 d); (3) leave-one-person-out turnover prediction reaches AUC=0.79 yet a Cox concordance index of only 0.52, exposing a metric-trap when AUC is reported without survival-aware baselines; (4) the corpus reveals systematic over-prediction of "angry" by an off-the-shelf FER model on neutral Asian faces (0.194 vs ~0.05 Western priors), making WELD valuable for FER fairness audits. A complex-systems analysis of the corpus appears as a companion preprint (arXiv:2510.16046).

2510.14102 2026-05-19 astro-ph.IM cs.AI cs.LG 版本更新

Extracting latent representations from X-ray spectra. Classification, regression, and accretion signatures of Chandra sources

从X射线光谱中提取潜在表示。Chandra源的分类、回归和吸积特征

Nicolò Oreste Pinciroli Vago, Juan Rafael Martínez-Galarza, Roberta Amato

发表机构 * Department of Electronics, Information Center for Astrophysics Harvard \& Smithsonian, 60 Garden Street, Cambridge, MA 02138, USA

AI总结 本文利用深度学习从Chandra X射线光谱中提取紧凑且物理意义明确的表示,通过分类、回归和可解释性分析验证其有效性,并测量光谱与时间域属性间的互信息,用于未来识别暂现事件。

Comments 21 pages, 17 figures; accepted in A&A

详情
AI中文摘要

光谱特征在大规模X射线巡天时代至关重要。自动机器学习方法在此方面已被证明有效,但迄今为止尚未应用于大规模光谱数据集,如Chandra源目录(CSC)。本工作旨在利用深度学习开发一种紧凑且具有物理意义的Chandra X射线光谱表示。为验证所学表示是否捕捉到相关信息,我们通过分类、回归和可解释性分析进行评估,并测量这些源的光谱与时间域属性间的互信息,以帮助未来识别暂现事件。我们使用基于变换器的自编码器将X射线光谱压缩到一个8维的潜在空间中。天体物理源类型和物理汇总统计信息来自外部目录。我们从光谱重建精度、8种已知天体物理源类的聚类性能以及与硬度比和氢柱密度(N_H)等物理量的相关性来评估所学表示。重建后,潜在空间中的聚类在8种源类上实现了约40%的平衡分类精度,当仅限于类星体和恒星级致密天体时,该精度提高至约69%。此外,潜在特征与光谱和时间属性相关,表明压缩的表示捕捉到了物理相关信息。直接从X射线光谱中学习的特征在捕捉相关物理信息方面与需要额外计算的人工提取特征同样有效。它们可用于大规模巡天中的分类和回归,并且与时间域属性共享互信息。该方法可以适应现有和即将来临的X射线目录。

英文摘要

Spectral signatures are crucial in the era of large X-ray surveys. Automatic machine learning methods have proven useful in this respect, but so far they have not been applied to large spectral datasets, such as the Chandra Source Catalog (CSC). This work aims to develop a compact and physically meaningful representation of Chandra X-ray spectra using deep learning. To verify that the learned representation captures relevant information, we evaluate it through classification, regression, and interpretability analyses, and measure the mutual information between spectral and time-domain properties of these sources, aiding in the future identification of transient events. We use a transformer-based autoencoder to compress X-ray spectra into representations in an 8-dimensional latent space. Astrophysical source types and physical summary statistics are compiled from external catalogs. We evaluate the learned representation in terms of spectral reconstruction accuracy, clustering performance on 8 known astrophysical source classes, and correlation with physical quantities such as hardness ratios and hydrogen column densities ($N_H$). Upon reconstruction, clustering in the latent space yields a balanced classification accuracy of $\sim$40% across the 8 source classes, increasing to $\sim$69% when restricted to AGNs and stellar-mass compact objects exclusively. Moreover, latent features correlate with spectral and temporal properties, suggesting that the compressed representation captures physically relevant information. Features learned directly from X-ray spectra capture relevant physical information as effectively as human-extracted features that require additional computations. They can be used for both classification and regression in large surveys, and also share mutual information with time-domain properties. The method can be adapted to existing and upcoming X-ray catalogs.

2510.12534 2026-05-19 cs.AI 版本更新

ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification

ProtoSiTex: 为多标签文本分类学习半可解释的原型

Utsav Kumar Nareti, Suraj Kumar, Soumya Pandey, Soumi Chattopadhyay, Chandranath Adak, Sankha Subhra Mullick

发表机构 * Dept. of CSE, IIT Patna(印度帕纳布理工大学计算机科学与工程系) Dept. of CSE, IIT Indore(印度印多尔理工大学计算机科学与工程系) Dolby Laboratories(多利贝实验室)

AI总结 ProtoSiTex提出一种半可解释框架,通过双阶段交替训练策略和分层损失函数,实现细粒度多标签文本分类,提升可解释性和对齐性。

详情
AI中文摘要

用户生成文本的快速增长加剧了对能进行细粒度文本分类和解释的可解释模型的需求。现有基于原型的模型提供直观解释,但通常在粗粒度(句子或文档级别)操作,无法解决现实世界文本分类的多标签特性。我们提出ProtoSiTex,一种为细粒度多标签文本分类设计的半可解释框架。ProtoSiTex采用双阶段交替训练策略:一个无监督原型发现阶段,学习语义连贯且多样的原型,和一个监督分类阶段,将这些原型映射到类别标签。分层损失函数强制子句、句子和文档级别的一致性,提升可解释性和对齐性。与以往方法不同,ProtoSiTex使用自适应原型和多头注意力捕捉重叠和冲突的语义。我们还引入了一个标注到子句级别的酒店评论基准数据集。在该数据集和两个公开基准(二元和多类)上的实验表明,ProtoSiTex在达到最先进的性能的同时,提供忠实的人类对齐的解释,确立其作为半可解释多标签文本分类稳健解决方案的地位。

英文摘要

The rapid growth of user-generated text across digital platforms has intensified the need for interpretable models capable of fine-grained text classification and explanation. Existing prototype-based models offer intuitive explanations but typically operate at coarse granularity (sentence or document level) and fail to address the multi-label nature of real-world text classification. We propose ProtoSiTex, a semi-interpretable framework designed for fine-grained multi-label text classification. ProtoSiTex employs a dual-phase alternate training strategy: an unsupervised prototype discovery phase that learns semantically coherent and diverse prototypes, and a supervised classification phase that maps these prototypes to class labels. A hierarchical loss function enforces consistency across subsentence, sentence, and document levels, enhancing interpretability and alignment. Unlike prior approaches, ProtoSiTex captures overlapping and conflicting semantics using adaptive prototypes and multi-head attention. We also introduce a benchmark dataset of hotel reviews annotated at the subsentence level with multiple labels. Experiments on this dataset and two public benchmarks (binary and multi-class) show that ProtoSiTex achieves state-of-the-art performance while delivering faithful, human-aligned explanations, establishing it as a robust solution for semi-interpretable multi-label text classification.

2510.07799 2026-05-19 cs.CL cs.AI 版本更新

Dynamic Generation of Multi-LLM Agents Communication Topologies with Graph Diffusion Models

基于图扩散模型的多LLM代理通信拓扑动态生成

Eric Hanchen Jiang, Mengting Li, Guancheng Wan, Sophia Yin, Yuchen Wu, Xiao Liang, Xinfeng Li, Yizhou Sun, Wei Wang, Kai-Wei Chang, Ying Nian Wu

发表机构 * University of California Los Angeles(加州大学洛杉矶分校) University of Washington(华盛顿大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Guided Topology Diffusion框架,通过迭代构建过程生成适应任务需求的高效通信拓扑,优于现有方法。

Comments ACL 2026 Main

详情
AI中文摘要

多代理系统效率依赖于其通信拓扑,但设计最优拓扑具有挑战性。本文引入Guided Topology Diffusion框架,通过条件离散图扩散模型生成任务自适应的通信拓扑,实现实时优化,显著提升LLM代理协作性能。

英文摘要

The efficiency of multi-agent systems driven by large language models (LLMs) largely hinges on their communication topology. However, designing an optimal topology is a non-trivial challenge, as it requires balancing competing objectives such as task performance, communication cost, and robustness. Existing frameworks often rely on static or hand-crafted topologies, which inherently fail to adapt to diverse task requirements, leading to either excessive token consumption for simple problems or performance bottlenecks for complex ones. To address this challenge, we introduce a novel generative framework called \textit{Guided Topology Diffusion (GTD)}. Inspired by conditional discrete graph diffusion models, GTD formulates topology synthesis as an iterative construction process. At each step, the generation is steered by a lightweight proxy model that predicts multi-objective rewards (e.g., accuracy, utility, cost), enabling real-time, gradient-free optimization towards task-adaptive topologies. This iterative, guided synthesis process distinguishes GTD from single-step generative frameworks, enabling it to better navigate complex design trade-offs. We validated GTD across multiple benchmarks, and experiments show that this framework can generate highly task-adaptive, sparse, and efficient communication topologies, significantly outperforming existing methods in LLM agent collaboration.

2510.01782 2026-05-19 cs.CL cs.AI 版本更新

Can LLMs Refuse Questions They Do Not Know? Measuring Knowledge-Aware Refusal in Factual Tasks

大语言模型能否拒绝它们不知道的问题?在事实性任务中测量知识感知的拒绝

Wenbo Pan, Jie Xu, Qiguang Chen, Junhao Dong, Libo Qin, Xinfeng Li, Haining Yu, Xiaohua Jia

发表机构 * Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系) Harbin Institute of Technology(哈尔滨工业大学) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) School of Computer Science and Engineering, Central South University(中南大学计算机科学与工程学院)

AI总结 本文提出Refusal Index(RI)作为衡量大语言模型知识感知拒绝能力的新指标,通过评估拒绝概率与错误概率的相关性,揭示模型在事实性任务中的可靠性问题。

Comments Accepted at ICLR 2026

详情
AI中文摘要

大语言模型(LLMs)应拒绝回答超出其知识范围的问题。这种称为知识感知拒绝的能力对于事实性可靠性至关重要,但现有指标未能捕捉这一能力。本文提出Refusal Index(RI),一种新颖且原理明确的度量标准,用于衡量LLMs拒绝其不知问题的准确性。我们将RI定义为拒绝概率与错误概率之间的Spearman秩相关性。RI可通过轻量级两轮评估方法进行实际测量,仅需在两个标准评估运行中观察到的拒绝率。在16个模型和5个数据集上的广泛实验表明,RI准确量化了模型的知识感知拒绝能力。值得注意的是,RI在不同拒绝率下保持稳定,并提供一致的模型排名,不依赖于模型的整体准确率和拒绝率。这些特性表明RI捕捉了模型知识校准的稳定、内在方面。更重要的是,RI提供了关于LLM事实性的重要但此前被忽视方面的见解:尽管LLM在事实性任务上实现高准确率,但其拒绝行为可能不可靠且脆弱。

英文摘要

Large Language Models (LLMs) should refuse to answer questions beyond their knowledge. This capability, which we term knowledge-aware refusal, is crucial for factual reliability, while existing metrics fail to capture this ability. In this work, we propose the Refusal Index (RI), a novel and principled metric that measures how accurately LLMs refuse questions they do not know. We define RI as Spearman's rank correlation between refusal probability and error probability. RI is practically measurable with a lightweight two-pass evaluation method which only require observed refusal rates across two standard evaluation runs. Extensive experiments across 16 models and 5 datasets demonstrate that RI accurately quantifies a model's knowledge-aware refusal capability. Notably, RI remains stable across different refusal rates and provides consistent model rankings independent of a model's overall accuracy and refusal rates. These properties suggest RI captures a stable, intrinsic aspect of model knowledge calibration. More importantly, RI provides insight into an important but previously overlooked aspect of LLM factuality: while LLMs achieve high accuracy on factual tasks, their refusal behavior can be unreliable and fragile.

2509.26037 2026-05-19 cs.AI cs.CV cs.LG 版本更新

CoLLM-NAS: Collaborative Large Language Models for Efficient Knowledge-Guided Neural Architecture Search

CoLLM-NAS:协作大型语言模型用于高效知识引导的神经架构搜索

Zhe Li, Zhiwei Lin, Yongtao Wang

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学计算机科学技术研究院)

AI总结 本文提出CoLLM-NAS,一种基于协作大型语言模型的两阶段神经架构搜索框架,通过导航和生成两个LLM及协调模块,有效指导搜索过程,提升效率并取得新状态最优结果。

Comments Accepted as Oral at CVPR 2026 Workshop on Neural Architecture Search (NAS)

详情
AI中文摘要

将大型语言模型(LLMs)与神经架构搜索(NAS)结合,为自动设计神经架构提供了新可能。然而,现有方法面临架构无效、计算低效和性能劣于传统NAS的限制。本文提出协作LLM-based NAS(CoLLM-NAS),一种两阶段NAS框架,通过两个互补的LLM驱动知识引导搜索。具体而言,提出具有状态的导航LLM指导搜索方向,无状态的生成LLM合成高质量候选,以及协调模块协调LLM间通信并管理评估过程。CoLLM-NAS通过结合LLM对结构神经架构的内在知识与迭代反馈和历史轨迹的逐步知识,高效指导搜索过程。在ImageNet和NAS-Bench-201上的实验结果表明,CoLLM-NAS超越现有NAS方法和传统搜索算法,取得新状态最优结果,同时显著降低搜索成本4-10倍。此外,CoLLM-NAS在多种搜索空间(如MobileNet、ShuffleNet和AutoFormer)中一致提升各种两阶段NAS方法(如OFA、SPOS和AutoFormer)的性能和效率,展示其优秀的泛化能力。

英文摘要

The integration of Large Language Models (LLMs) with Neural Architecture Search (NAS) has introduced new possibilities for automating the design of neural architectures. However, most existing methods face critical limitations, including architectural invalidity, computational inefficiency, and inferior performance compared to traditional NAS. In this work, we present Collaborative LLM-based NAS (CoLLM-NAS), a two-stage NAS framework with knowledge-guided search driven by two complementary LLMs. Specifically, we propose a stateful Navigator LLM to guide search direction, a stateless Generator LLM to synthesize high-quality candidates, and a Coordinator module to orchestrate inter-LLM communication and manage evaluation processes. CoLLM-NAS efficiently guides the search process by combining LLMs' inherent knowledge of structured neural architectures with progressive knowledge from iterative feedback and historical trajectory. Experimental results on ImageNet and NAS-Bench-201 show that CoLLM-NAS surpasses existing NAS methods and conventional search algorithms, achieving new state-of-the-art results while significantly reducing search costs by 4--10. Furthermore, CoLLM-NAS consistently enhances the performance and efficiency of various two-stage NAS methods (e.g., OFA, SPOS, and AutoFormer) across diverse search spaces (e.g., MobileNet, ShuffleNet, and AutoFormer), demonstrating its excellent generalization.

2509.23986 2026-05-19 cs.AI 版本更新

TusoAI: Agentic Optimization for Scientific Methods

TusoAI: 科学方法的代理优化

Alistair Turcan, Kexin Huang, Lei Li, Martin Jinye Zhang

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Stanford University(斯坦福大学) Phylo

AI总结 TusoAI通过整合领域知识和迭代优化,提升科学任务中计算方法的性能,优于现有专家方法和科学AI代理,在单细胞RNA测序数据去噪和卫星地球监测等任务中表现突出。

详情
AI中文摘要

科学发现常因手动开发分析复杂实验数据的计算工具而受阻。构建此类工具成本高且耗时,因为科学家需反复查阅文献、测试模型假设并将其转化为高效软件。大型语言模型(LLMs)在合成文献、处理实证数据和生成领域特定代码方面表现出色,为加速计算方法开发提供了新机遇。现有LLM系统或专注于使用现有计算方法进行科学分析,或专注于为通用机器学习开发计算方法,但未能有效整合科学领域中常无结构化的知识。本文介绍TusoAI,一种代理AI系统,通过科学任务描述和评估函数,自主开发和优化计算方法。TusoAI将领域知识整合到知识树表示中,进行迭代的领域特定优化和模型诊断,提高候选解决方案池的性能。我们进行了全面的基准评估,证明TusoAI在单细胞RNA测序数据去噪和卫星地球监测等多样化任务中优于最先进的专家方法、MLE代理和科学AI代理。将TusoAI应用于遗传学两个关键开放问题,改进了现有计算方法并发现了新生物学,包括9种新的自身免疫疾病与T细胞亚型之间的关联以及7种此前未报告的疾病变异与目标基因的关联。我们的代码在https://github.com/Alistair-Turcan/TusoAI上公开可用。

英文摘要

Scientific discovery is often slowed by the manual development of computational tools needed to analyze complex experimental data. Building such tools is costly and time-consuming because scientists must iteratively review literature, test modeling and scientific assumptions against empirical data, and implement these insights into efficient software. Large language models (LLMs) have demonstrated strong capabilities in synthesizing literature, reasoning with empirical data, and generating domain-specific code, offering new opportunities to accelerate computational method development. Existing LLM-based systems either focus on performing scientific analyses using existing computational methods or on developing computational methods or models for general machine learning without effectively integrating the often unstructured knowledge specific to scientific domains. Here, we introduce TusoAI , an agentic AI system that takes a scientific task description with an evaluation function and autonomously develops and optimizes computational methods for the application. TusoAI integrates domain knowledge into a knowledge tree representation and performs iterative, domain-specific optimization and model diagnosis, improving performance over a pool of candidate solutions. We conducted comprehensive benchmark evaluations demonstrating that TusoAI outperforms state-of-the-art expert methods, MLE agents, and scientific AI agents across diverse tasks, such as single-cell RNA-seq data denoising and satellite-based earth monitoring. Applying TusoAI to two key open problems in genetics improved existing computational methods and uncovered novel biology, including 9 new associations between autoimmune diseases and T cell subtypes and 7 previously unreported links between disease variants linked to their target genes. Our code is publicly available at https://github.com/Alistair-Turcan/TusoAI.

2509.21319 2026-05-19 cs.CL cs.AI cs.LG 版本更新

RLBFF: Binary Flexible Feedback to bridge between Human Feedback & Verifiable Rewards

RLBFF:二进制灵活反馈用于连接人类反馈与可验证奖励

Zhilin Wang, Jiaqi Zeng, Olivier Delalleau, Ellie Evans, Daniel Egert, Hoo-Chang Shin, Felipe Soares, Yi Dong, Oleksii Kuchaiev

发表机构 * NVIDIA

AI总结 RLBFF结合人类偏好与规则验证,提升奖励模型对响应质量的精准捕捉,优于Bradley-Terry模型,在RM-Bench和JudgeBench上取得优异成绩,且支持用户自定义反馈原则。

Comments Published at ICLR 2026, 21 pages

详情
AI中文摘要

Reinforcement Learning with Human Feedback (RLHF) 和 Reinforcement Learning with Verifiable Rewards (RLVR) 是LLM后训练的主要RL范式,各有优势。然而,RLHF在可解释性和奖励黑客问题上存在困难,因为它依赖于通常缺乏明确标准的人类判断,而RLVR则受限于其对正确性基于验证器的专注。我们提出Reinforcement Learning with Binary Flexible Feedback (RLBFF),结合人类驱动的偏好灵活性与规则基础验证的精确性,使奖励模型能够捕捉响应质量的细微方面,超越单纯的正确性。RLBFF从自然语言反馈中提取可以二进制回答的原则(例如信息准确性:是,或代码可读性:否)。这些原则随后可用于将奖励模型训练作为蕴含任务(响应满足或不满足任意原则)。我们展示奖励模型以这种方式训练可以优于匹配数据的Bradley-Terry模型,在RM-Bench(86.2%)和JudgeBench(81.4%,2025年9月24日排行榜第一)上取得最佳成绩。此外,用户可以在推理时指定感兴趣的原理以自定义我们的奖励模型,与Bradley-Terry模型不同。最后,我们提供了一个完全开源的食谱(包括数据)来对Qwen3-32B进行对齐,以匹配或超过o3-mini和DeepSeek R1在MT-Bench、WildBench和Arena Hard v2的一般对齐基准上的性能(在<5%的推理成本下)。模型:https://huggingface.co/collections/nvidia/reward-models-10-2025

英文摘要

Reinforcement Learning with Human Feedback (RLHF) and Reinforcement Learning with Verifiable Rewards (RLVR) are the main RL paradigms used in LLM post-training, each offering distinct advantages. However, RLHF struggles with interpretability and reward hacking because it relies on human judgments that usually lack explicit criteria, whereas RLVR is limited in scope by its focus on correctness-based verifiers. We propose Reinforcement Learning with Binary Flexible Feedback (RLBFF), which combines the versatility of human-driven preferences with the precision of rule-based verification, enabling reward models to capture nuanced aspects of response quality beyond mere correctness. RLBFF extracts principles that can be answered in a binary fashion (e.g. accuracy of information: yes, or code readability: no) from natural language feedback. Such principles can then be used to ground Reward Model training as an entailment task (response satisfies or does not satisfy an arbitrary principle). We show that Reward Models trained in this manner can outperform Bradley-Terry models when matched for data and achieve top performance on RM-Bench (86.2%) and JudgeBench (81.4%, #1 on leaderboard as of September 24, 2025). Additionally, users can specify principles of interest at inference time to customize the focus of our reward models, in contrast to Bradley-Terry models. Finally, we present a fully open source recipe (including data) to align Qwen3-32B using RLBFF and our Reward Model, to match or exceed the performance of o3-mini and DeepSeek R1 on general alignment benchmarks of MT-Bench, WildBench, and Arena Hard v2 (at <5% of the inference cost). Models: https://huggingface.co/collections/nvidia/reward-models-10-2025

2509.19590 2026-05-19 cs.AI cs.CY cs.LG 版本更新

Position: AI Evaluations Should be Grounded on a Theory of Capability

位置:AI评估应基于能力理论

Nathanael Jo, Ashia Wilson

发表机构 * MIT EECS, Cambridge, USA(麻省理工学院电子工程与计算机科学系,剑桥,美国)

AI总结 本文提出AI评估应基于明确的能力理论,通过实验证明评估结果受建模假设影响显著,提出Evaluation Card促进透明化评估实践。

Comments ICML 2026 Position Paper Track

详情
AI中文摘要

生成模型的评估如今普遍存在,其结果深刻影响公众和科学界对AI能力的看法。然而,对其可靠性的怀疑持续增长。如何确定报告的准确率真实反映模型的底层性能?尽管基准结果常被视为能力的直接测量,但实际上它们是推断:将分数视为能力证据已预设了能力定义的理论。我们主张AI评估应作为基于明确能力理论的推断任务。虽然这一观点在心理学测量学等学科中是标准做法,但在AI评估中仍不完善,核心假设常被隐含。作为概念验证,我们实证显示报告性能可能强烈依赖评估者的建模假设,凸显透明、理论驱动的评估实践的必要性。最后,我们提出Evaluation Card帮助研究人员记录、论证和审查AI评估背后的建模决策。

英文摘要

Evaluations of generative models are now ubiquitous, and their outcomes critically shape public and scientific expectations of AI's capabilities. Yet skepticism about their reliability continues to grow. How can we know that a reported accuracy genuinely reflects a model's underlying performance? Although benchmark results are often presented as direct measurements of capability, in practice they are inferences: treating a score as evidence of capability already presupposes a theory of what it means to be capable at a task. We argue that AI evaluations should instead be framed as inference tasks grounded on an explicit theory of capability. While this perspective is standard in fields like psychometrics, it remains underdeveloped in AI evaluation, where core assumptions are often left implicit. As a proof-of-concept, we empirically show that reported performance can depend strongly on the evaluator's modeling assumptions, underscoring the need for transparent, theory-driven evaluation practices. We conclude by offering an Evaluation Card to help researchers document, justify, and scrutinize the modeling decisions underlying AI evaluations.

2509.13270 2026-05-19 cs.CV cs.AI 版本更新

RadGame: An AI-Powered Platform for Radiology Education

RadGame:一种基于人工智能的放射学教育平台

Mohammed Baharoon, Siavash Raissi, John S. Jun, Thibault Heintz, Mahmoud Alabbad, Ali Alburkani, Sung Eun Kim, Kent Kleinschmidt, Abdulrahman O. Alhumaydhi, Mohannad Mohammed G. Alghamdi, Jeremy Francis Palacio, Mohammed Bukhaytan, Noah Michael Prudlo, Rithvik Akula, Brady Chrisler, Benjamin Galligos, Mohammed O. Almutairi, Mazeen Mohammed Alanazi, Nasser M. Alrashdi, Joel Jihwan Hwang, Sri Sai Dinesh Jaliparthi, Luke David Nelson, Nathaniel Nguyen, Sathvik Suryadevara, Steven Kim, Mohammed F. Mohammed, Yevgeniy R. Semenov, Kun-Hsing Yu, Abdulrhman Aljouie, Hassan AlOmaish, Adam Rodman, Pranav Rajpurkar

发表机构 * Harvard Medical School(哈佛医学院) Mass General Brigham(麻省总医院) Maastricht University(马斯特里赫特大学) Department of Medical Imaging, King Abdulaziz Medical City, Ministry of National Guard, Riyadh, Saudi Arabia(国王阿卜杜勒-阿齐兹医疗城医学影像科,沙特阿拉伯) National Strategic Technology Research Institute, Seoul National University Hospital(全国战略技术研究所,首尔国立大学医院) Saint Louis University School of Medicine(圣路易斯大学医学院) College of Medicine, King Saud bin Abdulaziz University for Health Sciences(国王萨勒曼·本·阿卜杜勒阿齐兹大学医学院) Tufts University School of Medicine(塔夫茨大学医学院) Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系)

AI总结 RadGame通过结合游戏化与大规模公开数据集,提供AI驱动的反馈,提升放射学教育中的定位和报告撰写能力,显著提高学习效果。

Comments ML4H Version

详情
AI中文摘要

我们介绍了RadGame,一种基于人工智能的游戏化平台,用于放射学教育,旨在提升局部定位和报告生成两项核心技能。传统放射学培训基于被动接触病例或实时指导,限制了即时和可扩展的反馈机会。RadGame通过结合游戏化、大规模公开数据集和自动化AI反馈,为人类学习者提供清晰的结构化指导。在RadGame Localize中,玩家绘制边界框以定位异常,自动与放射科医生绘制的标注比较,并通过视觉语言模型生成用户遗漏的解释。在RadGame Report中,玩家根据胸片、年龄和指征撰写发现,接收基于放射学报告生成指标的结构化AI反馈,突出与放射科医生书面真实报告的错误和遗漏,最终生成性能和风格评分。在前瞻性评估中,使用RadGame的参与者在定位准确性上比传统被动方法提高了68%,在报告撰写准确性上比传统方法提高了31%。RadGame展示了AI驱动游戏化在提供可扩展、反馈丰富的放射学培训中的潜力,并重新定义了医疗AI资源在教育中的应用。

英文摘要

We introduce RadGame, an AI-powered gamified platform for radiology education that targets two core skills: localizing findings and generating reports. Traditional radiology training is based on passive exposure to cases or active practice with real-time input from supervising radiologists, limiting opportunities for immediate and scalable feedback. RadGame addresses this gap by combining gamification with large-scale public datasets and automated, AI-driven feedback that provides clear, structured guidance to human learners. In RadGame Localize, players draw bounding boxes around abnormalities, which are automatically compared to radiologist-drawn annotations from public datasets, and visual explanations are generated by vision-language models for user missed findings. In RadGame Report, players compose findings given a chest X-ray, patient age and indication, and receive structured AI feedback based on radiology report generation metrics, highlighting errors and omissions compared to a radiologist's written ground truth report from public datasets, producing a final performance and style score. In a prospective evaluation, participants using RadGame achieved a 68% improvement in localization accuracy compared to 17% with traditional passive methods and a 31% improvement in report-writing accuracy compared to 4% with traditional methods after seeing the same cases. RadGame highlights the potential of AI-driven gamification to deliver scalable, feedback-rich radiology training and reimagines the application of medical AI resources in education.

2509.04471 2026-05-19 cs.CL cs.AI 版本更新

MOSAIC: A Multilingual, Taxonomy-Agnostic, and Computationally Efficient Approach for Radiological Report Classification

MOSAIC:一种多语言、无类别依赖且计算高效的放射报告分类方法

Alice Schiavone, Marco Fraccaro, Lea Marie Pehrson, Silvia Ingala, Rasmus Bonnevie, Michael Bachmann Nielsen, Vincent Beliveau, Melanie Ganz, Desmond Elliott

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系) Neurobiology Research Unit, Copenhagen University Hospital(哥本哈根大学医院神经生物学研究单位) Unumed Aps(Unumed公司) Department of Diagnostic Radiology, Copenhagen University Hospital(哥本哈根大学医院诊断放射学系) Department of Clinical Medicine, University of Copenhagen(哥本哈根大学临床医学系) Cerebriu A/S(Cerebriu公司) Institute for Human Genetics, Medical University of Innsbruck(因斯布鲁克医学大学人类遗传学研究所)

AI总结 MOSAIC通过紧凑开放模型实现多语言、无类别依赖的放射报告分类,无需大量标注数据,且在多种影像模态和标签体系上表现优异,达到专家水平性能。

Comments 8 pages, 14 pages including references and appendix. 9 figures. Preprint

Journal ref Proceedings of the ClinicalNLP Workshop at LREC 2026

详情
AI中文摘要

放射学报告包含丰富的临床信息,可用于训练影像模型而无需依赖昂贵的手动标注。然而,现有方法面临关键限制:基于规则的方法难以处理语言多样性,监督模型需要大量标注数据集,而近期基于LLM的方法依赖封闭源或资源密集型模型,不适合临床使用。此外,当前解决方案大多局限于英语和单模态、单类别数据集。我们介绍了MOSAIC,一种多语言、无类别依赖且计算高效的放射报告分类方法。基于紧凑的开放访问语言模型(MedGemma-4B),MOSAIC支持零/少样本提示和轻量级微调,可在消费级GPU上部署。我们在英语、西班牙语、法语和丹麦语的七个数据集上评估MOSAIC,涵盖多种影像模态和标签体系。该模型在五个胸部X光数据集上达到平均宏F1分数88,接近或超过专家水平性能,同时仅需24GB GPU内存。通过数据增强,仅需80个标注样本即可在丹麦报告上达到加权F1分数82,相比完整1600样本训练集的86分。MOSAIC为临床环境中大型或专有LLM提供了实用替代方案。代码和模型是开源的。我们邀请社区在新语言、类别和模态上评估和扩展MOSAIC。

英文摘要

Radiology reports contain rich clinical information that can be used to train imaging models without relying on costly manual annotation. However, existing approaches face critical limitations: rule-based methods struggle with linguistic variability, supervised models require large annotated datasets, and recent LLM-based systems depend on closed-source or resource-intensive models that are unsuitable for clinical use. Moreover, current solutions are largely restricted to English and single-modality, single-taxonomy datasets. We introduce MOSAIC, a multilingual, taxonomy-agnostic, and computationally efficient approach for radiological report classification. Built on a compact open-access language model (MedGemma-4B), MOSAIC supports both zero-/few-shot prompting and lightweight fine-tuning, enabling deployment on consumer-grade GPUs. We evaluate MOSAIC across seven datasets in English, Spanish, French, and Danish, spanning multiple imaging modalities and label taxonomies. The model achieves a mean macro F1 score of 88 across five chest X-ray datasets, approaching or exceeding expert-level performance, while requiring only 24 GB of GPU memory. With data augmentation, as few as 80 annotated samples are sufficient to reach a weighted F1 score of 82 on Danish reports, compared to 86 with the full 1600-sample training set. MOSAIC offers a practical alternative to large or proprietary LLMs in clinical settings. Code and models are open-source. We invite the community to evaluate and extend MOSAIC on new languages, taxonomies, and modalities.

2509.03403 2026-05-19 cs.LG cs.AI 版本更新

Beyond Correctness: Harmonizing Process and Outcome Rewards through RL Training

超越正确性:通过RL训练和谐过程与结果奖励

Chenlu Ye, Zhou Yu, Ziji Zhang, Hao Chen, Narayanan Sadagopan, Jing Huang, Tong Zhang, Anurag Beniwal

发表机构 * Amazon(亚马逊公司) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出PROF方法,通过过程一致性过滤提升推理质量和最终答案准确性,减少对强PRM的依赖。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)提升了推理任务的最终答案准确性,但未能可靠提升推理质量。由于结果奖励仅评估最终答案,它也会奖励虚假成功:错误推理仍可能因偶然得到正确结果而获得最大奖励。这种结果奖励黑客行为会创建有偏的梯度,使当前RLVR不足以学习忠实的推理。过程奖励模型(PRMs)提供逐步监督,但直接优化PRMs或简单地将它们与结果奖励结合在RL训练过程中分布偏移时不稳定。我们引入了过程一致性过滤(PROF),一种数据整理方法,利用PRM-ORM一致性进行样本选择,而不是直接奖励优化。PROF保留具有强过程支持的正确响应和具有弱过程支持的错误响应,同时保持训练比例的平衡。实验表明,PROF在强基线之上一致地提高了最终答案准确性和中间推理质量,对强PRMs的依赖较少。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves final-answer accuracy on reasoning tasks, but it does not reliably improve reasoning quality. Because outcome rewards only assess final answers, they also reward spurious successes: flawed reasoning can still receive maximal reward when it accidentally reaches the correct outcome. This outcome reward hacking creates biased gradients, making current RLVR insufficient for learning faithful reasoning. Process Reward Models (PRMs) provide step-wise supervision, but directly optimizing PRMs or naively combining them with outcome rewards is unstable under distribution shift during RL training process. We introduce PRocess cOnsistency Filter (PROF), a data curation method that uses PRM--ORM consistency for sample selection rather than direct reward optimization. PROF keeps correct responses with strong process support and incorrect responses with weak process support while maintaining a balanced training ratio. Experiments show that PROF consistently improves both final-answer accuracy and intermediate reasoning quality over strong baselines, with less dependence on strong PRMs.

2508.16438 2026-05-19 cs.IR cs.AI 版本更新

OPERA: A Reinforcement Learning--Enhanced Orchestrated Planner-Executor Architecture for Reasoning-Oriented Multi-Hop Retrieval

OPERA: 一种增强强化学习的协调规划-执行架构用于面向推理的多跳检索

Yu Liu, Yanbing Liu, Fangfang Yuan, Cong Cao, Youbang Sun, Kun Peng, Weizhuo Chen, Jianjun Li, Zhiyuan Ma

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) School of Computer Science and Technology, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) Department of Electronic Engineering, Tsinghua University(清华大学电子工程系)

AI总结 OPERA通过协调规划-执行架构解决多跳检索中推理规划、检索和过滤的不足,采用MAPGRPO方法提升复杂任务性能。

Comments Accepted by AAAI 2026. Extended version

详情
AI中文摘要

近期大规模语言模型和密集检索器的进步推动了检索增强生成(RAG)的发展。然而,现有方法在复杂推理导向的多跳检索任务中面临三大挑战:1)无效的推理导向规划:现有方法难以生成稳健的多步骤计划,规则基分解器在非模板问题上表现不佳。2)次优的推理驱动检索:相关方法采用有限的查询改写,导致迭代检索循环难以定位黄金文档。3)不足的推理引导过滤:现有方法缺乏细粒度推理来有效过滤噪声结果中的显著信息,阻碍了检索知识的利用。根本上,这些限制都源于当前RAG架构中检索与推理耦合薄弱。我们引入协调规划-执行推理架构(OPERA),一种新的推理驱动检索框架。OPERA的目标规划模块(GPM)将问题分解为子目标,由具有专用组件的推理-执行模块(REM)执行,以实现精确推理和有效检索。为训练OPERA,我们提出多智能体渐进组相对策略优化(MAPGRPO),一种GRPO的新变体。在复杂多跳基准测试中,OPERA的优越性能验证了MAPGRPO方法和OPERA设计的有效性。

英文摘要

Recent advances in large language models (LLMs) and dense retrievers have driven significant progress in retrieval-augmented generation (RAG). However, existing approaches face significant challenges in complex reasoning-oriented multi-hop retrieval tasks: 1) Ineffective reasoning-oriented planning: Prior methods struggle to generate robust multi-step plans for complex queries, as rule-based decomposers perform poorly on out-of-template questions. 2) Suboptimal reasoning-driven retrieval: Related methods employ limited query reformulation, leading to iterative retrieval loops that often fail to locate golden documents. 3) Insufficient reasoning-guided filtering: Prevailing methods lack the fine-grained reasoning to effectively filter salient information from noisy results, hindering utilization of retrieved knowledge. Fundamentally, these limitations all stem from the weak coupling between retrieval and reasoning in current RAG architectures. We introduce the Orchestrated Planner-Executor Reasoning Architecture (OPERA), a novel reasoning-driven retrieval framework. OPERA's Goal Planning Module (GPM) decomposes questions into sub-goals, which are executed by a Reason-Execute Module (REM) with specialized components for precise reasoning and effective retrieval. To train OPERA, we propose Multi-Agents Progressive Group Relative Policy Optimization (MAPGRPO), a novel variant of GRPO. Experiments on complex multi-hop benchmarks show OPERA's superior performance, validating both the MAPGRPO method and OPERA's design.

2508.08501 2026-05-19 cs.AI 版本更新

GVGAI-LLM: Evaluating Large Language Model Agents with Infinite Games

GVGAI-LLM:通过无限游戏评估大语言模型代理

Yuchen Li, Cong Lin, Muhammad Umair Nasir, Philip Bontrager, Jialin Liu, Julian Togelius

发表机构 * New York University(纽约大学) University of the Witwatersrand(沃尔特·斯通大学) Meta Lingnan University(岭南大学)

AI总结 GVGAI-LLM通过 arcade 式游戏测试大语言模型的推理与问题解决能力,定义了可解释的评估指标,揭示了模型在空间推理和基本规划中的局限性。

详情
AI中文摘要

我们介绍了 GVGAI-LLM,一个视频游戏基准,用于评估大语言模型(LLMs)的推理和问题解决能力。该基准基于 General Video Game AI 框架,包含多样化的 arcade 式游戏,用于测试模型处理与现有 LLM 基准不同的任务能力。该基准利用视频游戏描述语言,可快速创建新游戏(包括规则和关卡),以防止过拟合。每个游戏场景由紧凑的 ASCII 字符集表示,允许语言模型高效处理。GVGAI-LLM 定义了可解释的指标,包括有意义的步比、步效率和总分,以评估模型行为。通过在 118 个具有不同挑战和技能深度的游戏上进行零样本评估,我们揭示了 LLMs 在空间推理和基本规划中的持续局限性。当前模型在空间和逻辑上持续出现错误,推动了结构化提示和空间接地技术的发展。尽管这些干预措施带来了部分改进,但该基准仍远未解决。GVGAI-LLM 为推进语言模型能力研究提供了可重复的测试平台,尤其强调代理行为和空间推理。此外,其生成无限基准的能力(手动和程序化)提供了一种可扩展的长期评估框架。

英文摘要

We introduce GVGAI-LLM, a video game benchmark for evaluating the reasoning and problem-solving capabilities of large language models (LLMs). Built on the General Video Game AI framework, it features a diverse collection of arcade-style games designed to test a model's ability to handle tasks that differ from most existing LLM benchmarks. The benchmark leverages a video game description language that enables the rapid creation of new games (including rules and levels), helping to prevent overfitting over time. Each game scene is represented by a compact set of ASCII characters, allowing for efficient processing by language models. GVGAI-LLM defines interpretable metrics, including meaningful step ratio, step efficiency, and overall score, to assess model behavior. Through zero-shot evaluations across 118 games with diverse challenges and skill depth, we reveal persistent limitations of LLMs in spatial reasoning and basic planning. Current models consistently exhibit spatial and logical errors, motivating structured prompting and spatial grounding techniques. Although these interventions lead to partial improvements, the benchmark remains very far from being solved. GVGAI-LLM serves as a reproducible testbed for advancing research on language model capabilities, with a particular emphasis on agentic behavior and spatial reasoning. Furthermore, its ability to generate infinite benchmarks, both manually and procedurally, provides a scalable framework for longitudinal evaluation.

2508.06799 2026-05-19 cs.ET cs.AI 版本更新

LSDTs: LLM-Augmented Semantic Digital Twins for Adaptive Knowledge-Intensive Infrastructure Planning

LSDTs: 基于大语言模型的语义数字孪生用于自适应知识密集型基础设施规划

Naiyi Li, Zihui Ma, Runlong Yu, Lingyao Li

发表机构 * Department of Civil & Environmental Engineering, University of Maryland, College Park(大学公园马里兰大学土木与环境工程系) Center for Urban Science and Progress, New York University(纽约大学城市科学与进步中心) Department of Computer Science, University of Alabama(阿拉巴马大学计算机科学系) School of Information, University of South Florida(佛罗里达州立大学信息学院)

AI总结 本文提出LSDTs框架,利用大语言模型从非结构化文档中提取规划知识并构建形式本体,通过语义层提升数字孪生在复杂规划场景中的适应性与仿真精度。

详情
AI中文摘要

数字孪生(DTs)为管理复杂基础设施系统提供了强大工具,但其效果常受限于整合非结构化知识的挑战。近年来,大语言模型(LLMs)的进步为解决这一差距提供了新潜力,具备提取和组织多样化文本信息的能力。因此,我们提出了LSDTs(LLM增强的语义数字孪生),一种框架,帮助LLMs从环境法规和技术指南等非结构化文档中提取规划知识,并将其组织成形式本体。该本体形成一个语义层,为数字孪生(虚拟物理系统的模型)提供支持,使其能够模拟真实、法规意识的规划场景。我们通过马里兰海上风电场规划的案例研究评估LSDTs,包括飓风桑迪期间的应用。结果表明,LSDTs支持可解释、法规意识的布局优化,实现高保真的仿真,并增强基础设施规划的适应性。这项工作展示了将生成式AI与数字孪生结合在支持复杂、知识驱动规划任务中的潜力。

英文摘要

Digital Twins (DTs) offer powerful tools for managing complex infrastructure systems, but their effectiveness is often limited by challenges in integrating unstructured knowledge. Recent advances in Large Language Models (LLMs) bring new potential to address this gap, with strong abilities in extracting and organizing diverse textual information. We therefore propose LSDTs (LLM-Augmented Semantic Digital Twins), a framework that helps LLMs extract planning knowledge from unstructured documents like environmental regulations and technical guidelines, and organize it into a formal ontology. This ontology forms a semantic layer that powers a digital twin-a virtual model of the physical system-allowing it to simulate realistic, regulation-aware planning scenarios. We evaluate LSDTs through a case study of offshore wind farm planning in Maryland, including its application during Hurricane Sandy. Results demonstrate that LSDTs support interpretable, regulation-aware layout optimization, enable high-fidelity simulation, and enhance adaptability in infrastructure planning. This work shows the potential of combining generative AI with digital twins to support complex, knowledge-driven planning tasks.

2508.04149 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Difficulty-Based Preference Data Selection by DPO Implicit Reward Gap

基于难度的偏好数据选择:通过DPO隐式奖励差距

Xuan Qi, Rongwu Xu, Zhijing Jin

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学计算机科学与工程保罗·G·艾伦学校) Max Planck Institute for Intelligent Systems, Tübingen, Germany(德国图宾根马克斯·普朗克智能系统研究所) Jinesis Lab, University of Toronto & Vector Institute(多伦多大学Jinesis实验室及向量研究所)

AI总结 本文提出基于难度的偏好数据选择方法,利用DPO隐式奖励机制选择奖励差距小的样本,提升数据效率和模型对齐性能,在多个数据集和对齐任务中优于五个基线方法。

Comments Our code and data are available at https://github.com/Difficulty-Based-Preference-Data-Select/Difficulty-Based-Preference-Data-Select

详情
AI中文摘要

对齐大语言模型(LLMs)与人类偏好是AI研究中的关键挑战。尽管强化学习从人类反馈(RLHF)和直接偏好优化(DPO)等方法被广泛使用,但它们通常依赖于大规模、成本高的偏好数据集。本文缺少针对偏好数据的高质量数据选择方法。在本文中,我们引入了一种基于难度的偏好数据选择策略,该策略基于DPO隐式奖励机制。通过选择奖励差距较小的偏好数据示例,这些示例代表更具挑战性的案例,从而提高数据效率和模型对齐。我们的方法在多个数据集和对齐任务中一致优于五个强大的基线方法,仅使用原始数据的10%即可实现优越性能。这种原理上高效的选择方法为在有限资源下扩展LLM对齐提供了有前景的解决方案。

英文摘要

Aligning large language models (LLMs) with human preferences is a critical challenge in AI research. While methods like Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are widely used, they often rely on large, costly preference datasets. The current work lacks methods for high-quality data selection specifically for preference data. In this work, we introduce a novel difficulty-based data selection strategy for preference datasets, grounded in the DPO implicit reward mechanism. By selecting preference data examples with smaller DPO implicit reward gaps, which are indicative of more challenging cases, we improve data efficiency and model alignment. Our approach consistently outperforms five strong baselines across multiple datasets and alignment tasks, achieving superior performance with only 10\% of the original data. This principled, efficient selection method offers a promising solution for scaling LLM alignment with limited resources.

2508.03018 2026-05-19 cs.AI cs.RO 版本更新

Beyond Policy Optimization: A Data Curation Flywheel for Sparse-Reward Long-Horizon Planning

超越策略优化:一种数据整理飞轮用于稀疏奖励长周期规划

Yutong Wang, Pengliang Ji, Kaixin Li, Baolong Bi, Tao Feng, Guillaume Sartoretti

发表机构 * Department of Mechanical Engineering, National University of Singapore(新加坡国立大学机械工程系) Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) School of Computing, National University of Singapore(新加坡国立大学计算机科学学院) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 本文提出BPO框架,通过自改进的数据飞轮开发鲁棒推理模型,解决多轮代理规划中稀疏奖励长周期问题,实现高效推理和显著的token效率。

详情
AI中文摘要

大型语言推理模型在静态任务中表现出色,但在交互环境中多轮代理规划面临两大挑战:信用分配问题使传统强化学习在稀疏奖励设置中无效,以及详尽的逐步推理历史计算开销过大。为此,我们提出BPO框架,包含三个阶段(自举、外推和精炼),通过自改进的数据飞轮开发稳健的推理模型,以应对长周期稀疏奖励环境。框架首先利用规划四元组和长短期链式思考融合高效推理,然后通过复杂度分层课程学习扩展到分布外任务,最后通过奖励门控拒绝采样学习经历进行迭代精炼。在ALFWorld、ScienceWorld和WebShop上的实验表明,本方法在状态-of-the-art中实现了显著的token效率,为代理规划中的推理模型提供了新的配方。

英文摘要

Large Language Reasoning Models have demonstrated remarkable success on static tasks, yet their application to multi-round agentic planning in interactive environments faces two fundamental challenges. First, the intractable credit assignment problem renders conventional reinforcement learning ineffective in sparse-reward settings. Second, the computational overhead of verbose, step-by-step reasoning histories is prohibitive. To address these challenges, we propose BPO, a three-stage framework (bootstrapping, extrapolation, and refinement) that establishes a self-improving data flywheel to develop robust reasoning models for long-horizon, sparse-reward environments. Our framework first bootstraps efficient reasoning using the proposed planning quaternions with long-short chain-of-thought fusion. It then extrapolates to out-of-distribution tasks through complexity-stratified curriculum learning. Finally, the model iteratively refines itself by learning exclusively on experiences selected via reward-gated rejection sampling. Experiments on ALFWorld, ScienceWorld, and WebShop demonstrate that our approach achieves state-of-the-art with significant token efficiency, providing a new recipe for reasoning models in agentic planning.

2508.00712 2026-05-19 cs.LG cs.AI 版本更新

JSON-Bag: A generic game trajectory representation

JSON-Bag:一种通用的游戏轨迹表示方法

Dien Nguyen, Diego Perez-Liebana, Simon Lucas

发表机构 * GitHub

AI总结 本文提出JSON-Bag模型,通过分词JSON描述并使用Jensen-Shannon距离衡量游戏轨迹,验证了其在六个桌面游戏中对玩家、参数和种子分类的有效性,优于基线方法并提升了准确性。

Comments 8 pages, 3 figures, 6 tables, published in IEEE Conference on Games 2025

详情
AI中文摘要

本文提出JSON-Bag模型,通过分词JSON描述并使用Jensen-Shannon距离衡量游戏轨迹,验证了其在六个桌面游戏中对玩家、参数和种子分类的有效性,优于基线方法并提升了准确性。

英文摘要

We introduce JSON Bag-of-Tokens model (JSON-Bag) as a method to generically represent game trajectories by tokenizing their JSON descriptions and apply Jensen-Shannon distance (JSD) as distance metric for them. Using a prototype-based nearest-neighbor search (P-NNS), we evaluate the validity of JSON-Bag with JSD on six tabletop games: 7 Wonders, Dominion, Sea Salt and Paper, Can't Stop, Connect4, Dots and boxes; each over three game trajectory classification tasks: classifying the playing agents, game parameters, or game seeds that were used to generate the trajectories. Our approach outperforms a baseline using hand-crafted features in the majority of tasks. Evaluating on N-shot classification suggests using JSON-Bag prototype to represent game trajectory classes is also sample efficient. Additionally, we demonstrate JSON-Bag ability for automatic feature extraction by treating tokens as individual features to be used in Random Forest to solve the tasks above, which significantly improves accuracy on underperforming tasks. Finally, we show that, across all six games, the JSD between JSON-Bag prototypes of agent classes highly correlates with the distances between agents' policies.

2506.22901 2026-05-19 cs.LG cs.AI q-bio.BM q-bio.GN 版本更新

Missing-Modality-Aware Graph Neural Network for Cancer Classification

面向缺失模态的图神经网络用于癌症分类

Sina Tabakhi, Chen, Chen, Haiping Lu

发表机构 * School of Computer Science, University of Sheffield(谢菲尔德大学计算机科学学院)

AI总结 本文提出MAGNET模型,通过动态患者-模态多头注意力机制融合低维模态嵌入,以提升部分模态下的多模态预测性能,实验表明其在癌症分类任务中优于现有方法。

Comments 27 pages, 22 figures

详情
AI中文摘要

在学习多模态生物数据时,缺失模态是一个关键挑战,其中某些患者的数据缺失一个或多个模态。现有方法要么排除缺失模态的患者,要么填补缺失模态,或直接使用部分模态进行预测。然而,这些方法大多依赖于不灵活的、患者无关的融合策略,且无法扩展到随着模态数量增加而指数级增长的缺失模态模式。为解决这些限制,我们提出MAGNET(Missing-modality-Aware Graph neural NETwork)以增强部分模态下的多模态预测,其特征是动态患者-模态多头注意力机制,根据贡献和缺失性融合低维模态嵌入。MAGNET融合的复杂性随着模态数量线性增加,同时适应缺失模式的变异性。为了生成预测,MAGNET进一步构建一个患者图,其中融合的多模态嵌入作为节点特征,连接性由模态缺失性决定,随后通过图神经网络进行处理。在三个公共多组学数据集上进行的实验表明,MAGNET在癌症分类任务中优于现有最先进的融合方法。数据和代码可在https://github.com/SinaTabakhi/MAGNET获取。

英文摘要

A key challenge in learning from multimodal biological data is missing modalities, where data from one or more modalities are absent for some patients. Existing approaches either exclude patients with missing modalities, impute missing modalities, or make predictions directly with partial modalities. However, most of these methods rely on inflexible, patient-agnostic fusion strategies and do not scale computationally to the combinatorial growth of missing-modality patterns as the number of modalities increases. To address these limitations, we propose MAGNET (Missing-modality-Aware Graph neural NETwork) to enhance multimodal prediction with partial modalities, featuring a dynamic patient-modality multi-head attention mechanism to fuse lower-dimensional modality embeddings based on their contribution and missingness. MAGNET fusion's complexity increases linearly with the number of modalities while adapting to missing-pattern variability. To generate predictions, MAGNET further constructs a patient graph with fused multimodal embeddings as node features and connectivity determined by the modality missingness, followed by a graph neural network. Experiments on three public multiomics datasets for cancer classification, with real-world missingness, show that MAGNET outperforms state-of-the-art fusion methods. The data and code are available at https://github.com/SinaTabakhi/MAGNET.

2506.12617 2026-05-19 cs.AI cs.HC 版本更新

Evaluating AI Alignment in LLMs: Output Analysis of Value Priorities Across 75 Models with Human Benchmarking

评估大语言模型中的AI对齐:通过75个模型的人类基准测试分析价值优先级

Gabriel Rongyang Lau, Wei Yan Low, Seow Min Koh, Fiona Fui-Hoon Nah, Andree Hartanto

发表机构 * School of Social Sciences, Nanyang Technological University(南洋理工大学社会科学学院) Interdisciplinary Graduate Programme, Nanyang Technological University(南洋理工大学跨学科研究生项目) Faculty of Arts and Social Sciences, National University of Singapore(新加坡国立大学人文与社会科学学院) School of Computing and Information Systems, Singapore Management University(新加坡管理学院计算与信息学院) School of Social Sciences, Singapore Management University(新加坡管理学院社会科学学院)

AI总结 本文通过分析75个大语言模型的输出,评估其价值优先级与人类判断的一致性,发现模型在价值优先级上存在差异,且模型大小、新旧和能力层级与价值一致性无直接关联。

详情
AI中文摘要

大型语言模型(LLMs)在人类-人工智能交互研究和实践中被越来越多地使用,但现有的能力和安全基准揭示了这些系统所表达的价值优先级以及这些优先级如何与人类判断相一致的信息有限。在三个研究中,我们引入了一种基于输出的方法来评估AI对齐的一个方面,通过将LLM生成的文本视为行为数据,并将其表达的价值优先级结构与人类参考进行比较。研究1利用归纳性定性分析得出六个最优AI功能的主题,即性能、适应能力、社会公益、伦理与责任、关系整合和自主性。研究2显示,LLM输出在模型内部高度稳定,并在不同模型间趋于一致的价值优先级结构,表明价值配置文件具有可靠性和可比性。研究3通过使用一个捕捉优先级相对顺序和优先级差异校准的配置文件保真度指标,将75个当代LLMs与376名人类受访者进行基准测试。尽管大多数模型复现了人类的价值顺序,但一些模型系统性地夸大了优先级之间的差异,表明模型可能在传统基准测试中看似对齐,但仍可能与人类价值校准偏离。配置文件保真度在不同模型间变化显著,并不一致地随大小、新旧或能力层级而变化。LLM和人类都倾向于对自主性进行降级,这提出了关于日益自主的AI系统发展的重大问题。对于研究和应用使用,六个主题和基于配置文件的指标提供了一种可扩展的方法,用于在关键对齐与人类优先级的背景下审计LLM的价值配置文件。

英文摘要

Large language models (LLMs) are increasingly used in human-AI interaction research and practice, yet existing capability and safety benchmarks reveal little about the value priorities these systems express or how those priorities correspond to human judgements. Across three studies, we introduce an output-based approach to evaluating one facet of AI alignment by treating LLM-generated text as behavioural data and comparing expressed value-priority profiles with a human reference. Study 1 used inductive qualitative analysis to derive six themes of optimal AI functioning, namely Performance, Adaptive Capacity, Social Good, Ethics and Responsibility, Relational Integration, and Agency. Study 2 showed that LLM outputs were highly stable within models and converged on a common value-priority structure across models, indicating reliable and comparable value profiles. Study 3 benchmarked 75 contemporary LLMs against 376 human respondents using a profile-fidelity metric capturing both the relative ordering of priorities and the calibration of between-priority differences. Although most models reproduced the human ordering of values, some systematically exaggerated the differences between them, showing that models can appear aligned on conventional benchmarks while still diverging from human value calibration. Profile fidelity varied substantially across models and did not consistently scale with size, recency, or capability tier. Both LLMs and humans converged on a deprioritisation of Agency, raising important questions about the development of increasingly agentic AI systems. For research and applied use, the six themes and profile-based metric provide a scalable method for auditing LLM value profiles before deployment in contexts where alignment with human priorities is critical.

2506.12119 2026-05-19 cs.CL cs.AI 版本更新

Mixture-of-Experts Can Surpass Dense LLMs Under Strictly Equal Resource

专家混合模型可在严格相等资源下超越密集语言模型

Houyi Li, Ka Man Lo, Shijie Xuyang, Ziqi Wang, Wenzhen Zheng, Haocheng Zhang, Zhao Li, Shuigeng Zhou, Xiangyu Zhang, Daxin Jiang

发表机构 * Fudan University(复旦大学) StepFun University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学)

AI总结 本文研究在资源相等条件下MoE模型是否能超越密集模型,提出优化框架并验证了在最优激活率下MoE模型性能更优,且该区域在不同模型规模下一致,通过数据重用解决数据量增加的权衡问题。

Comments Published as a conference paper at ICLR 2026

详情
AI中文摘要

专家混合(MoE)语言模型显著扩展了模型容量,并在不增加每token计算量的情况下实现了显著性能提升。然而,在严格相等的资源约束下,即总参数量、训练计算和数据预算完全相同的情况下,MoE能否超越密集架构?尽管其具有重要的实际价值和潜力,这一问题仍缺乏深入研究。本文提出了一种新的视角和方法论框架,系统研究这一问题。首先,我们全面调查了MoE的架构并实现了最优模型设计以最大化性能。基于此,我们发现,在最优区域内的MoE模型在相同总参数、训练计算和数据资源下能够超越其密集 counterpart。更重要的是,这一最优区域在不同模型规模下保持一致。虽然增加的数据量会带来性能的权衡,但我们通过重用数据解决了这一问题。我们通过广泛的实验验证了我们的发现,训练了近200个20亿参数规模的语言模型和超过50个70亿参数规模的语言模型,累计处理了50万亿token。所有模型检查点均已公开。

英文摘要

Mixture-of-Experts (MoE) language models dramatically expand model capacity and achieve remarkable performance without increasing per-token compute. However, can MoEs surpass dense architectures under strictly equal resource constraints -- that is, when the total parameter count, training compute, and data budget are identical? This question remains under-explored despite its significant practical value and potential. In this paper, we propose a novel perspective and methodological framework to study this question thoroughly. First, we comprehensively investigate the architecture of MoEs and achieve an optimal model design that maximizes the performance. Based on this, we subsequently find that an MoE model with activation rate in an optimal region is able to outperform its dense counterpart under the same total parameter, training compute and data resource. More importantly, this optimal region remains consistent across different model sizes. Although additional amount of data turns out to be a trade-off for enhanced performance, we show that this can be resolved via reusing data. We validate our findings through extensive experiments, training nearly 200 language models at 2B scale and over 50 at 7B scale, cumulatively processing 50 trillion tokens. All model checkpoints are publicly available.

2506.11925 2026-05-19 cs.AR cs.AI cs.CV cs.LG 版本更新

Real-World Deployment of a Lane Change Prediction Architecture Based on Knowledge Graph Embeddings and Bayesian Inference

基于知识图谱嵌入和贝叶斯推断的车道变换预测架构的现实世界部署

M. Manzour, Catherine M. Elias, Omar M. Shehata, R. Izquierdo, M. A. Sotelo

发表机构 * Department of Computer Engineering, University of Alcalá(阿尔卡拉大学计算机工程系) Department of Computer Science, German University in Cairo(开罗德国大学计算机科学系) Department of Mechatronics, German University in Cairo(开罗德国大学机电系)

AI总结 本文提出基于知识图谱嵌入和贝叶斯推断的车道变换预测系统,通过现实硬件验证,实现了算法与道路部署的结合,提前3-4秒预测目标车辆车道变换,确保安全。

Journal ref 2025 IEEE International Conference on Vehicular Electronics and Safety (ICVES)

详情
AI中文摘要

近年来,车道变换预测研究取得显著进展,但大多数研究局限于仿真或数据集结果,未能实现算法与道路部署的结合。本文通过现实硬件展示了基于知识图谱嵌入(KGEs)和贝叶斯推断的车道变换预测系统。该系统包含感知模块和预测模块:感知模块感知环境,提取数值特征并转换为语言类别,与预测模块通信;预测模块执行KGE和贝叶斯推断模型,预测目标车辆的行驶动作并转换为纵向制动动作。现实硬件实验验证表明,该预测系统能提前3-4秒预测目标车辆的车道变换,为自动驾驶车辆提供充足反应时间,确保车道变换安全。

英文摘要

Research on lane change prediction has gained a lot of momentum in the last couple of years. However, most research is confined to simulation or results obtained from datasets, leaving a gap between algorithmic advances and on-road deployment. This work closes that gap by demonstrating, on real hardware, a lane-change prediction system based on Knowledge Graph Embeddings (KGEs) and Bayesian inference. Moreover, the ego-vehicle employs a longitudinal braking action to ensure the safety of both itself and the surrounding vehicles. Our architecture consists of two modules: (i) a perception module that senses the environment, derives input numerical features, and converts them into linguistic categories; and communicates them to the prediction module; (ii) a pretrained prediction module that executes a KGE and Bayesian inference model to anticipate the target vehicle's maneuver and transforms the prediction into longitudinal braking action. Real-world hardware experimental validation demonstrates that our prediction system anticipates the target vehicle's lane change three to four seconds in advance, providing the ego vehicle sufficient time to react and allowing the target vehicle to make the lane change safely.

2506.10959 2026-05-19 cs.LG cs.AI math.ST stat.TH 版本更新

Understanding In-Context Learning on Structured Manifolds: Bridging Attention to Kernel Methods

在结构流形上理解上下文学习:连接注意力机制与核方法

Zhaiming Shen, Alexander Hsu, Rongjie Lai, Wenjing Liao

发表机构 * School of Mathematics, Georgia Institute of Technology(佐治亚理工学院数学系) Department of Mathematics, Purdue University(普渡大学数学系)

AI总结 本文研究了在结构几何数据上上下文学习的理论,通过将注意力机制与核方法联系,揭示了transformers在流形上进行核预测的机制,并推导了泛化误差界。

详情
AI中文摘要

尽管上下文学习(ICL)在自然语言和视觉领域取得了显著成功,但其在结构几何数据中的理论理解仍不明确。本文首次对ICL在流形上回归Hölder函数的理论进行了研究。我们建立了注意力机制与经典核方法之间的新联系,证明transformers通过与提示的交互在新查询上进行基于核的预测。这一联系通过数值实验得到验证,显示学习的查询-提示分数与高斯核高度相关。基于此见解,我们推导了泛化误差界,以提示长度和训练任务数量为变量。当观察到足够多的训练任务时,transformers在流形上实现Hölder函数的最小最大回归率,该速率与提示长度呈指数关系,指数取决于流形的内在维度,而非外蕴空间维度。我们的结果还描述了泛化误差随训练任务数量的变化,揭示了transformers作为上下文核算法学习器的复杂性。我们的发现为理解几何在ICL中的作用提供了基础见解,并为研究非线性模型的ICL提供了新工具。

英文摘要

While in-context learning (ICL) has achieved remarkable success in natural language and vision domains, its theoretical understanding-particularly in the context of structured geometric data-remains unexplored. This paper initiates a theoretical study of ICL for regression of Hölder functions on manifolds. We establish a novel connection between the attention mechanism and classical kernel methods, demonstrating that transformers effectively perform kernel-based prediction at a new query through its interaction with the prompt. This connection is validated by numerical experiments, revealing that the learned query-prompt scores for Hölder functions are highly correlated with the Gaussian kernel. Building on this insight, we derive generalization error bounds in terms of the prompt length and the number of training tasks. When a sufficient number of training tasks are observed, transformers give rise to the minimax regression rate of Hölder functions on manifolds, which scales exponentially with respect to the prompt length with the exponent depending on the intrinsic dimension of the manifold, rather than the ambient space dimension. Our result also characterizes how the generalization error scales with the number of training tasks, shedding light on the complexity of transformers as in-context kernel algorithm learners. Our findings provide foundational insights into the role of geometry in ICL and novels tools to study ICL of nonlinear models.

2506.05442 2026-05-19 cs.CV cs.AI 版本更新

Structured Labeling Enables Faster Vision-Language Models for End-to-End Autonomous Driving

结构化标注加速面向端到端自动驾驶的视觉-语言模型

Hao Jiang, Chuan Hu, Yukang Shi, Yuan He, Ke Wang, Xi Zhang, Zhipeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) KargoBot

AI总结 本文提出结构化标注的NuScenes-S数据集和紧凑型FastDrive模型,提升自动驾驶中决策任务的效率与准确性,实验显示在结构化数据集上性能优异,推理速度提升超10倍。

详情
AI中文摘要

视觉-语言模型(VLMs)因其类人推理能力成为端到端自动驾驶的有前景方法。然而,现有VLMs与现实应用之间仍存在显著差距。主要限制是现有松散格式的语言描述数据集不适用于机器,可能引入冗余。此外,VLMs的高计算成本和大规模阻碍了推理速度和现实部署。为弥合这一差距,本文引入了结构化且简洁的基准数据集NuScenes-S,该数据集源自NuScenes数据集并包含适用于机器的结构化表示。此外,我们提出了FastDrive,一个参数仅为0.9B的紧凑型VLM基线。与现有参数超过7B且未结构化的VLMs(如LLaVA-1.5)相比,FastDrive能够理解和生成结构化且简洁的描述,以高效率生成机器友好的驾驶决策。大量实验表明,FastDrive在结构化数据集上实现了竞争性的性能,决策任务的精度提高了约20%,同时在推理速度上超越大规模参数基线超过10倍。此外,消融研究进一步聚焦于场景注释(如天气、时间)对决策任务的影响,证明了其在自动驾驶决策任务中的重要性。

英文摘要

Vision-Language Models (VLMs) offer a promising approach to end-to-end autonomous driving due to their human-like reasoning capabilities. However, troublesome gaps remains between current VLMs and real-world autonomous driving applications. One major limitation is that existing datasets with loosely formatted language descriptions are not machine-friendly and may introduce redundancy. Additionally, high computational cost and massive scale of VLMs hinder the inference speed and real-world deployment. To bridge the gap, this paper introduces a structured and concise benchmark dataset, NuScenes-S, which is derived from the NuScenes dataset and contains machine-friendly structured representations. Moreover, we present FastDrive, a compact VLM baseline with 0.9B parameters. In contrast to existing VLMs with over 7B parameters and unstructured language processing(e.g., LLaVA-1.5), FastDrive understands structured and concise descriptions and generates machine-friendly driving decisions with high efficiency. Extensive experiments show that FastDrive achieves competitive performance on structured dataset, with approximately 20% accuracy improvement on decision-making tasks, while surpassing massive parameter baseline in inference speed with over 10x speedup. Additionally, ablation studies further focus on the impact of scene annotations (e.g., weather, time of day) on decision-making tasks, demonstrating their importance on decision-making tasks in autonomous driving.

2505.16831 2026-05-19 cs.CL cs.AI cs.CR cs.LG 版本更新

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

反学习不是删除:调查机器反学习在大语言模型中的可逆性

Xiaoyu Xu, Xiang Yue, Yang Liu, Qingqing Ye, Huadi Zheng, Peizhao Hu, Minxin Du, Haibo Hu

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Carnegie Mellon University(卡内基梅隆大学) University of California, Santa Cruz(加州大学圣克ruz分校) Huawei Technologies(华为技术有限公司) Research Centre for Privacy and Security Technologies in Future Smart Systems, PolyU(未来智能系统中的隐私与安全技术研究中心,PolyU)

AI总结 研究揭示大语言模型反学习的可逆性问题,提出表示层面分析框架,通过PCA相似度、CKA和Fisher信息等指标评估表示漂移,发现四种遗忘模式,指出数据来源影响重学效率,揭示不可逆遗忘的挑战。

Comments ICML 2026, accepted to appear

详情
AI中文摘要

在大语言模型(LLMs)中,反学习旨在移除指定数据,但其效果通常通过任务级指标如准确率和困惑度评估。我们证明这些指标可能误导,因为模型似乎遗忘,但通过最小微调即可恢复原始行为。这种可逆性表明信息被抑制而非真正删除。为填补这一评估空白,我们引入表示层面分析框架。我们的工具包包括PCA相似度和位移、中心核对齐(CKA)和Fisher信息,辅以均值PCA距离作为总结指标,用于衡量表示漂移。在多种反学习方法、数据领域和LLMs上应用此框架,我们识别出四种基于可逆性和灾难性程度的遗忘模式。我们比较了恢复策略,发现重学效率依赖于数据来源。我们还发现不可逆、非灾难性遗忘异常困难。通过探测反学习极限,我们识别出一个看似不可逆的目标遗忘案例,为更稳健的擦除算法提供见解。总体而言,我们的发现揭示了当前评估的差距,并建立了可信反学习的表示层面基础。

英文摘要

Unlearning in large language models (LLMs) aims to remove specified data, but its efficacy is typically assessed with task-level metrics like accuracy and perplexity. We show that these metrics can be misleading, as models can appear to forget while their original behavior is easily restored through minimal fine-tuning. This \emph{reversibility} suggests that information is merely suppressed, not genuinely erased. To address this critical evaluation gap, we introduce a \emph{representation-level analysis framework}. Our toolkit comprises PCA similarity and shift, centered kernel alignment (CKA), and Fisher information, complemented by a summary metric, the mean PCA distance, to measure representational drift. Applying this framework across multiple unlearning methods, data domains, and LLMs, we identify four distinct forgetting regimes based on their \emph{reversibility} and \emph{catastrophicity}. We compare recovery strategies and show that relearning efficiency relies on the data source. We also find that irreversible, non-catastrophic forgetting is exceptionally challenging. By probing unlearning limits, we identify a case of seemingly irreversible, targeted forgetting, offering insights for more robust erasure algorithms. Overall, our findings expose a gap in current evaluation and establish a representation-level foundation for trustworthy unlearning.

2505.02360 2026-05-19 cs.LG cs.AI 版本更新

Catastrophic Overfitting, Entropy Gap and Participation Ratio: A Noiseless $l^p$ Norm Solution for Fast Adversarial Training

灾难性过拟合、熵差与参与比:一种无噪声的 $l^p$ 范数解决方案用于快速对抗训练

Fares B. Mehouachi, Saif Eddin Jabari

发表机构 * New York University of Abu Dhabi(纽约阿布扎比分校) Department of Civil and Urban Engineering(土木与城市工程系) NYU Tandon School of Engineering(纽约大学坦顿工程学院)

AI总结 本文提出基于 $l^p$ 范数的无噪声方法,通过量化梯度集中度和熵测度,自动调整训练范数以缓解灾难性过拟合问题,无需额外正则化或噪声注入。

Comments 26 pages, 13 figures, 5 table. Preliminary version at NeurIPS 2025 Reliable and Responsible AI Workshop. Code: https://github.com/FaresBMehouachi/lpfgsm

详情
AI中文摘要

对抗训练是稳健深度学习的基石,但快速方法如快速梯度符号法(FGSM)常遭遇灾难性过拟合(CO),即模型对单步攻击鲁棒但对多步变种失效。现有解决方案依赖噪声注入、正则化或梯度裁剪,本文提出一种纯控制 $l^p$ 训练范数以缓解 CO 的新方法。我们的研究受实证观察启发,即 CO 在 $l^{\infty}$ 范数下比 $l^2$ 范数更普遍。基于此洞察,我们开发了广义 $l^p$ 攻击作为固定点问题,并设计 $l^p$-FGSM 攻击以理解从 $l^2$ 到 $l^{\infty}$ 的过渡机制。这导致我们的核心洞察:CO 出现于高度集中梯度(信息在少数维度本地化)与激进范数约束相互作用时。通过量化梯度集中度通过参与比和熵测度,我们开发了自适应 $l^p$-FGSM,根据梯度信息自动调整训练范数。大量实验表明,该方法在无需额外正则化或噪声注入的情况下实现了强大的鲁棒性,提供了一种新颖且理论指导的缓解 CO 问题的途径。

英文摘要

Adversarial training is a cornerstone of robust deep learning, but fast methods like the Fast Gradient Sign Method (FGSM) often suffer from Catastrophic Overfitting (CO), where models become robust to single-step attacks but fail against multi-step variants. While existing solutions rely on noise injection, regularization, or gradient clipping, we propose a novel solution that purely controls the $l^p$ training norm to mitigate CO. Our study is motivated by the empirical observation that CO is more prevalent under the $l^{\infty}$ norm than the $l^2$ norm. Leveraging this insight, we develop a framework for generalized $l^p$ attack as a fixed point problem and craft $l^p$-FGSM attacks to understand the transition mechanics from $l^2$ to $l^{\infty}$. This leads to our core insight: CO emerges when highly concentrated gradients where information localizes in few dimensions interact with aggressive norm constraints. By quantifying gradient concentration through Participation Ratio and entropy measures, we develop an adaptive $l^p$-FGSM that automatically tunes the training norm based on gradient information. Extensive experiments demonstrate that this approach achieves strong robustness without requiring additional regularization or noise injection, providing a novel and theoretically-principled pathway to mitigate the CO problem.

2503.20981 2026-05-19 cs.CL cs.AI cs.SI 版本更新

Patients Speak, AI Listens: LLM-based Analysis of Online Reviews Uncovers Key Drivers for Urgent Care Satisfaction

患者发声,AI倾听:基于大语言模型的在线评论分析揭示了紧急护理满意度的关键驱动因素

Xiaoran Xu, Zhaoqian Xue, Chi Zhang, Jhonatan Medri, Junjie Xiong, Jiayan Zhou, Jin Jin, Yongfeng Zhang, Siyuan Ma, Lingyao Li

发表机构 * Electrical Engineering department, University of South Florida(佛罗里达州立大学电气工程系) Department of Biostatistics, Epidemiology and Bioinformatics, University of Pennsylvania(宾夕法尼亚大学生物统计学、流行病学与生物信息学系) Computer Science and Engineering department, University of South Florida(佛罗里达州立大学计算机科学与工程系) Mathematics & Statistics, University of South Florida(佛罗里达州立大学数学与统计学系) Department of Computer Science and Engineering, University of Missouri Science and Technology(密苏里科技大学计算机科学与工程系) School of Medicine, Stanford University(斯坦福大学医学院) Department of Computer Science, Rutgers University(罗格斯大学计算机科学系) Department of Biostatistics, Vanderbilt University(范德比尔特大学生物统计学系)

AI总结 本文利用大语言模型分析在线评论,揭示紧急护理满意度的关键因素,发现人际因素和运营效率是主要决定因素,其他因素在调整后无显著影响。

详情
AI中文摘要

调查紧急护理设施的公众体验对促进社区医疗发展至关重要。传统调查方法由于范围、时间和空间覆盖有限而效果不佳。通过在线评论或社交媒体进行众包研究是一种有价值的途径。随着大语言模型(LLMs)的最新进展,从评论中提取细微感知已成为可能。本研究收集了Google Maps上DMV和佛罗里达地区的评论,并使用GPT模型进行提示工程,分析紧急护理的方面情感。我们首先分析了各种方面的地理空间模式,包括人际因素、运营效率、技术质量、财务和设施。接下来,我们确定了影响公众感知的CBG层面特征,包括人口密度、中位收入、基尼指数、租金与收入比率、家庭贫困率、无保险率和失业率。我们的结果表明,人际因素和运营效率是紧急护理患者满意度的最强决定因素,而技术质量、财务和设施在多变量模型中无显著独立影响。在社会经济和人口因素中,只有人口密度与患者评分有显著但微弱的相关性,其余因素无显著相关性。总体而言,本研究强调了众包研究揭示居民关注因素的潜力,并为利益相关者改进紧急护理公众满意度提供有价值的见解。

英文摘要

Investigating the public experience of urgent care facilities is essential for promoting community healthcare development. Traditional survey methods often fall short due to limited scope, time, and spatial coverage. Crowdsourcing through online reviews or social media offers a valuable approach to gaining such insights. With recent advancements in large language models (LLMs), extracting nuanced perceptions from reviews has become feasible. This study collects Google Maps reviews across the DMV and Florida areas and conducts prompt engineering with the GPT model to analyze the aspect-based sentiment of urgent care. We first analyze the geospatial patterns of various aspects, including interpersonal factors, operational efficiency, technical quality, finances, and facilities. Next, we determine Census Block Group (CBG)-level characteristics underpinning differences in public perception, including population density, median income, GINI Index, rent-to-income ratio, household below poverty rate, no insurance rate, and unemployment rate. Our results show that interpersonal factors and operational efficiency emerge as the strongest determinants of patient satisfaction in urgent care, while technical quality, finances, and facilities show no significant independent effects when adjusted for in multivariate models. Among socioeconomic and demographic factors, only population density demonstrates a significant but modest association with patient ratings, while the remaining factors exhibit no significant correlations. Overall, this study highlights the potential of crowdsourcing to uncover the key factors that matter to residents and provide valuable insights for stakeholders to improve public satisfaction with urgent care.

2503.19950 2026-05-19 cs.LG cs.AI cs.CL 版本更新

LogQuant: Log-Distributed 2-Bit Quantization of KV Cache with Superior Accuracy Preservation

LogQuant: 一种基于对数分布的2位KV缓存量化技术,具有更优异的精度保持性能

Han Chen, Zicong Jiang, Zining Zhang, Bingsheng He, Pingyi Luo, Mian Lu, Yuqiang Chen

发表机构 * Paradigm(4Paradigm)

AI总结 LogQuant通过基于对数的过滤机制实现KV缓存的2位量化,减少内存占用的同时保持高性能,实验表明其在吞吐量、批处理大小和准确性上均优于现有方法。

Comments Accepted by ICLR 2025 Workshop on Sparsity in LLMs (SLLM)

详情
AI中文摘要

我们介绍了LogQuant,一种突破性的2位量化技术,用于大型语言模型(LLM)推理中的KV缓存,实现显著的内存节省同时保持优越的性能。先前的方法要么假设后续token更重要,要么基于早期注意力模式预测重要token,但两者都可能导致性能瓶颈或频繁的误预测。LogQuant采取了不同的方法。通过应用基于对数的过滤机制,它在整个上下文中选择性地压缩KV缓存,与现有方法相比,实现更好的性能,甚至减少内存占用。在基准测试中,它提高了25%的吞吐量,提升了60%的批处理大小,而无需增加内存消耗。对于Math和Code Completion等具有挑战性的任务,LogQuant在相同压缩比下将准确性提高了40%至200%,优于其他方法。LogQuant可以轻松集成到流行的推理框架中,如Python的transformers库。实现可在https://github.com/Concyclics/LogQuantKV上获得。

英文摘要

We introduce LogQuant, a groundbreaking 2-bit quantization technique for KV Cache in large language model (LLM) inference, delivering substantial memory savings while preserving superior performance. Previous methods either assume that later tokens are more important or attempt to predict important tokens based on earlier attention patterns. Both approaches, however, can result in performance bottlenecks or frequent mispredictions. LogQuant takes a different approach. By applying a log-based filtering mechanism, it selectively compresses the KV Cache across the entire context, achieving better performance with the same or even reduced memory footprint compared to existing methods. In benchmark tests, it enhances throughput by 25% and boosts batch size by 60% without increasing memory consumption. For challenging tasks such as Math and Code Completion, LogQuant improves accuracy by 40% to 200% at the same compression ratio, outperforming comparable techniques.LogQuant integrates effortlessly with popular inference frameworks like Python's transformers library. Implementation can be available in https://github.com/Concyclics/LogQuantKV.

2503.13535 2026-05-19 cs.CY cs.AI 版本更新

Unlocking Learning Potentials: The Transformative Effect of Generative AI in Education Across Grade Levels

解锁学习潜力:生成式AI在不同年级教育中的变革影响

Meijuan Xie, Liling Luo

发表机构 * School of Mathematics and Statistics, Guangxi Normal University(广西师范大学数学与统计学学院)

AI总结 本文通过混合调查方法探讨生成式AI对不同年级学生在六个关键领域的影响,发现其在适当使用方面影响最大,而在学习兴趣和自信心方面影响最小,且大学生在各领域表现优于高中生。

详情
AI中文摘要

生成式人工智能(GAI)的出现使教育领域出现了显著增长。GAI在支持学习中的使用越来越普遍,但其使用方式和程度因人而异。关于学生对GAI的使用和感知的研究仍较为有限。为此,本文提出混合调查方法,研究GAI对四个不同年级学生在六个关键领域(LIPSAL)的影响。首先,通过问卷发现,GAI对适当使用的影响最大,而对学习兴趣和自信心的影响最低。其次,四个年级的比较显示,LIPSAL的高低因素表现出年级相关变化,大学生在各领域表现优于高中生。第三,通过访谈发现,学生对GAI的应用有全面理解,他们对GAI持积极态度,愿意使用,因此GAI的流行度迅速增长。他们还提到了使用GAI的前景和挑战。未来,随着GAI技术的成熟,其对学生的影响将更大。这些发现可能帮助更好地理解不同学生使用情况,并指导未来数字教育研究。

英文摘要

The advent of generative artificial intelligence (GAI) has brought about a notable surge in the field of education. The use of GAI to support learning is becoming increasingly prevalent among students. However, the manner and extent of its utilisation vary considerably from one individual to another. And researches about student's utilisation and perceptions of GAI remains relatively scarce. To gain insight into the issue, this paper proposed a hybrid-survey method to examine the impact of GAI on students across four different grades in six key areas (LIPSAL): learning interest, independent learning, problem solving, self-confidence, appropriate use, and learning enjoyment. Firstly, through questionnaire, we found that among LIPSAL, GAI has the greatest impact on the concept of appropriate use, the lowest level of learning interest and self-confidence. Secondly, a comparison of four grades revealed that the high and low factors of LIPSAL exhibited grade-related variation, and college students exhibited a higher level than high school students across LIPSAL. Thirdly, through interview, the students demonstrated a comprehensive understanding of the application of GAI. We found that students have a positive attitude towards GAI and are very willing to use it, which is why GAI has grown so rapidly in popularity. They also told us prospects and challenges in using GAI. In the future, as GAI matures technologically, it will have an greater impact on students. These findings may help better understand usage by different students and inform future research in digital education.

2503.13533 2026-05-19 cs.CY cs.AI 版本更新

The Status Quo and Future of AI-TPACK for Mathematics Teacher Education Students: A Case Study in Chinese Universities

人工智能与数学教师教育学生TPACK现状及未来:中国大学案例研究

Meijuan Xie, Liling Luo

发表机构 * School of Mathematics and Statistics, Guangxi Normal University(数学与统计学学院,广西师范大学)

AI总结 本文通过对中国七所大学412名数学教师教育学生进行系统AI-TPACK测评,发现其处于初级阶段,且学业等级不影响AI-TPACK能力发展,提出AI-TPACK-SEM模型揭示自我效能与教学信念对AI-TPACK的影响。

Journal ref Computers and Education Open, vol. 10, pp. 100375, 2026

详情
AI中文摘要

随着人工智能技术在教育领域的普及,数学教师教育学生(MTES)需要展示将AI与技术教学内容知识(AI-TPACK)整合的能力。为研究此问题,我们首先设计了系统AI-TPACK量表并测试了412名MTES。通过描述性统计分析发现,中国MTES的AI-TPACK现状处于初级阶段。其次,我们比较了三个不同年级MTES在六个变量上的差异,发现无明显差异,表明研究生教育未促进AI-TPACK能力发展。第三,我们提出了新的AI-TPACK结构方程模型(AI-TPACK-SEM)以探讨自我效能和教学信念对AI-TPACK的影响。研究发现自我效能与AI-TPACK呈正相关,同时得出与常识相反的结论:过度的教学信念可能阻碍AI-TPACK的发展。本文首次揭示了中国MTES的AI-TPACK现状,设计了专用SEM研究特定因素对AI-TPACK的影响,并提出未来发展方向的建议。

英文摘要

As artificial intelligence (AI) technology becomes increasingly prevalent in the filed of education, there is a growing need for mathematics teacher education students (MTES) to demonstrate proficiency in the integration of AI with the technological pedagogical content knowledge (AI-TPACK). To study the issue, we firstly devised an systematic AI-TPACK scale and test on 412 MTES from seven universities. Through descriptive statistical analyses, we found that the current status of AI-TPACK for MTES in China is at a basic, preliminary stage. Secondly, we compared MTES between three different grades on the six variables and found that there is no discernible difference, which suggested that graduate studies were observed to have no promotion in the development of AI-TPACK competencies. Thirdly, we proposed a new AI-TPACK structural equation model (AI-TPACK-SEM) to explore the impact of self-efficacy and teaching beliefs on AI-TPACK. Our findings indicate a positive correlation between self-efficacy and AI-TPACK. We also come to a conclusion that may be contrary to common perception, excessive teaching beliefs may impede the advancement of AI-TPACK. Overall, this paper revealed the current status of AI-TPACK for MTES in China for the first time, designed a dedicated SEM to study the effect of specific factors on AI-TPACK, and proposed some suggestions on future developments.

2503.12181 2026-05-19 cs.AI cs.RO 版本更新

Action-Gradient Monte Carlo Tree Search for Non-Parametric Continuous (PO)MDPs

动作-梯度蒙特卡洛树搜索用于非参数连续(PO)MDPs

Idan Lev-Yehudi, Michael Novitsky, Moran Barenboim, Ron Benchetrit, Vadim Indelman

发表机构 * Technion – Israel Institute of Technology(技术学院 – 以色列理工学院)

AI总结 本文提出AGMCTS框架,结合全局树搜索与局部梯度优化,解决连续状态空间下的规划问题,理论贡献包括动作评分梯度定理、多重要性采样树和可计算的动作评分梯度。

详情
AI中文摘要

在连续状态、动作和观察空间中,自主系统在线规划仍具挑战性。尽管蒙特卡洛树搜索(MCTS)通过采样有效扩展,但大多数连续(PO)MDP求解器未利用基于梯度的动作优化。本文提出动作-梯度MCTS(AGMCTS),结合全局树搜索与局部梯度优化,保持一致的价值估计。我们提供了三个关键理论贡献:(1)粒子信念状态的动作评分梯度定理;(2)多重要性采样(MIS)树,通过重用先前样本支持频繁动作分支更新而不引入估计漂移;(3)使用区域公式为平滑生成模型提供可计算的动作评分梯度。实验结果表明,AGMCTS在多个具有挑战性的连续MDP和POMDP基准中优于最先进的基于样本的求解器。

英文摘要

Online planning in continuous state, action, and observation spaces remains challenging for autonomous systems. While Monte Carlo Tree Search (MCTS) scales effectively via sampling, most continuous (PO)MDP solvers do not exploit gradient-based action optimization. We propose Action-Gradient MCTS (AGMCTS), a framework that combines global tree search with local gradient-based action refinement, while maintaining consistent value estimates. We provide three key theoretical contributions: (1) an action score gradient theorem for particle belief states; (2) the Multiple Importance Sampling (MIS) Tree that supports frequent action-branch updates by reusing prior samples without introducing estimator drift; and (3) tractable action score gradients for smooth generative models using the Area Formula. Empirical results demonstrate that AGMCTS outperforms state-of-the-art sample-based solvers in multiple challenging continuous MDP and POMDP benchmarks.

2502.18632 2026-05-19 cs.AI cs.CL cs.CY cs.LG cs.SE 版本更新

Automated Knowledge Component Generation for Interpretable Knowledge Tracing in Coding Problems

面向编码问题可解释知识追踪的自动化知识组件生成

Zhangqi Duan, Nigel Fernandez, Arun Balajiee Lekshmi Narayanan, Mohammad Hassany, Rafaella Sampaio de Alencar, Peter Brusilovsky, Bita Akram, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of Pittsburgh(匹兹堡大学) North Carolina State University(北卡罗来纳州立大学)

AI总结 本文提出基于LLM的知识组件生成与标注自动化流程,开发KCGen-KT框架,在不同编程语言的实测数据中验证其优于传统方法和人工编写的知识组件。

Comments Findings of ACL 2026: The 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

知识组件(KCs)映射到问题有助于建模学生学习,跟踪他们在细粒度技能上的掌握水平,从而在在线学习平台中实现个性化学习和反馈。然而,传统上由人类领域专家进行的知识组件编制和标注工作非常劳动密集。本文提出一个基于LLM的自动化流程,用于开放性编程问题的知识组件生成和标注。我们还开发了一个基于LLM的知识追踪(KT)框架,利用这些LLM生成的知识组件,称为KCGen-KT。我们在两个真实世界的学生代码提交数据集中进行了广泛的定量和定性评估。我们发现KCGen-KT在预测学生未来响应方面优于现有KT方法和人工编写的KCs。我们研究了生成KCs的学习曲线,并显示在认知模型下,LLM生成的KCs比人工编写的KCs拟合更好。我们还进行了与课程讲师的人类评估,以展示我们的流程生成合理的问题-KC映射。

英文摘要

Knowledge components (KCs) mapped to problems help model student learning, tracking their mastery levels on fine-grained skills thereby facilitating personalized learning and feedback in online learning platforms. However, crafting and tagging KCs to problems, traditionally performed by human domain experts, is highly labor intensive. We present an automated, LLM-based pipeline for KC generation and tagging for open-ended programming problems. We also develop an LLM-based knowledge tracing (KT) framework to leverage these LLM-generated KCs, which we refer to as KCGen-KT. We conduct extensive quantitative and qualitative evaluations on two real-world student code submission datasets in different programming languages.We find that KCGen-KT outperforms existing KT methods and human-written KCs on future student response prediction. We investigate the learning curves of generated KCs and show that LLM-generated KCs result in a better fit than human written KCs under a cognitive model. We also conduct a human evaluation with course instructors to show that our pipeline generates reasonably accurate problem-KC mappings.

2502.13957 2026-05-19 cs.CL cs.AI 版本更新

Supervising the search process produces reliable and generalizable information-seeking agents

通过监督搜索过程产生可靠且可推广的信息寻求代理

Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, Aidong Zhang

发表机构 * Department of Computer Science, University of Virginia, USA(弗吉尼亚大学计算机科学系) National Library of Medicine, National Institutes of Health, USA(美国国立卫生研究院国家医学图书馆) Department of Computer Science, University of Illinois Urbana–Champaign, USA(伊利诺伊大学厄巴纳-香槟分校计算机科学系) Medical Oncology, Dana–Farber Cancer Institute, USA(达纳-法伯癌症研究所医学肿瘤科) Surgery, University of Alabama at Birmingham, USA(阿拉巴马大学伯明翰分校外科系) Department of Neurology, Yale School of Medicine, USA(耶鲁医学院神经病学系)

AI总结 本文提出通过监督搜索过程来构建更可靠且可推广的信息寻求代理,通过RAG-Gym框架系统研究了架构设计、参数优化和动作评估,发现推理反思是关键能力,Re$^2$Search++在多跳信息检索基准上取得显著提升,尤其在领域外任务中表现更优。

Comments Homepage: https://rag-gym.github.io; Code: https://github.com/RAG-Gym/RAG-Gym

详情
AI中文摘要

大型语言模型(LLMs)通过从文档排序转向综合答案的方式改变了网络搜索,并越来越多地被用作自主的代理搜索系统,这些系统通过迭代与外部知识源交互。尽管有进展,构建有效的搜索代理仍然具有挑战性,因为高质量的中间搜索步骤难以生成。以往的方法主要依赖于结果监督,仅奖励代理生成正确最终答案。这往往导致奖励黑客和对参数记忆的过度依赖,限制了对领域外任务的泛化能力。为了解决这些限制,我们引入RAG-Gym框架,将监督从最终答案转移到搜索过程本身。通过RAG-Gym,我们系统地研究了架构设计、参数优化和动作评估,确定推理反思是搜索代理的关键能力。基于这一见解,我们提出了Re$^2$Search++,一个受过程监督的代理,它在多跳信息检索基准上实现了显著改进,尤其是在领域外设置中。性能提升主要由更高质量的搜索查询驱动,而非仅靠答案优化。所学的搜索批评者能够跨模型转移,包括专有LLMs。这些发现表明,监督搜索过程会产生更可靠且可推广的信息寻求代理。

英文摘要

Large language models (LLMs) are transforming web search by shifting from document ranking to synthesizing answers, and are increasingly deployed as autonomous agentic search systems that iteratively interact with external knowledge sources. Despite this progress, building effective search agents remains challenging because high-quality intermediate search steps are difficult to generate. Previous approaches have primarily relied on outcome supervision, rewarding agents only for producing correct final answers. This often leads to reward hacking and excessive dependence on parametric memory, limiting generalization to out-of-domain tasks. To address these limitations, we introduce RAG-Gym, a framework that shifts supervision from final answers to the search process itself. With RAG-Gym, we systematically investigate architecture design, parameter optimization, and action evaluation, identifying reasoning reflection as a critical capability for search agents. Building on this insight, we propose Re$^2$Search++, a process-supervised agent that achieves substantial improvements on multi-hop information-seeking benchmarks, especially in out-of-domain settings. Performance gains are driven primarily by higher-quality search queries rather than answer optimization alone, and the learned search critics transfer across models, including proprietary LLMs. These findings show that supervising the search process produces more reliable and generalizable information-seeking agents.

2411.18234 2026-05-19 cs.LG cs.AI cs.PF stat.CO 版本更新

Time-Efficient Hybrid Hyperparameter Tuning Approach for Cardiovascular Disease Classification

用于心血管疾病分类的高效混合超参数调优方法

Abhay Kumar Pathak, Mrityunjay Chaubey, Manjari Gupta

发表机构 * Department of Computer Science, Institute of Science, Banaras Hindu University(计算机科学系,科学学院,班纳拉森胡大学) School of Computer Science, University of Petroleum and Energy Studies(计算机科学学院,石油与能源研究大学)

AI总结 本文提出一种结合随机搜索和网格搜索的混合超参数调优方法,提升心血管疾病分类模型的准确性和效率,实验表明该方法在性能和计算时间上均优于传统方法。

详情
AI中文摘要

心血管疾病(CVDs)是任何严重的心脏疾病,需要准确诊断以防止致命后果。超参数调优在优化机器学习模型性能中起关键作用,通过选择最合适的参数配置来提高准确性、泛化性和可靠性。网格搜索系统地评估预定义的超参数组合,而随机搜索则从搜索空间中随机采样配置,实现更广泛的探索并减少计算成本。因此,在开发分类模型时,高效调优策略至关重要,因为时间和预测能力同样关键。本文提出了一种新的超参数调优方法,用于调优用于CVD分类的机器学习模型。所提出的随机网格搜索结合了随机搜索探索全局空间的能力和网格搜索在最有前途区域的集中和彻底搜索。这种混合方法在探索和利用之间找到最佳平衡,产生了一个稳健且高效的时间机器学习模型。在最先进的模型上的实验结果表明,随机网格搜索比传统超参数调优方法表现更好。除了观察到的模型性能提升外,大多数模型的训练所需计算时间也显著减少。所提研究的结果强调了所提出随机网格搜索方法在训练时间和计算效率上的减少。所提出的技术在医疗保健领域的机器学习应用中具有重大潜力,能够提供及时且准确的CVDs诊断。

英文摘要

Cardiovascular diseases (CVDs) are any serious illness of the heart, which require accurate diagnosis to prevent fatal consequences. Hyperparameter tuning plays a critical role in optimizing machine learning model performance by selecting the most suitable parameter configurations for improved accuracy, generalization, and reliability. Grid search systematically evaluates predefined hyperparameter combinations, whereas random search samples configurations randomly from the search space enabling broader exploration with reduced computational cost. Therefore, an efficient tuning strategy is essential when developing classification models where time plays an crucial role along with the predictive capability. In this work, we propose a new hyperparameter tuning approach to tune the hyperparameters of ML models for CVD classification. The proposed random grid search combines the power of random search to explore the global space with the focused and exhaustive search of grid search in the most promising areas. This hybrid approach finds an optimal balance between exploration and exploitation and yields a robust and time-efficient ML model for classification seetings. Experimental results on state of the art models demonstrated that randomised grid search performed better than traditional hyperparameter tuning methods. In addition to the observed improvement in model performance, the computational time required for training models was substantially reduced across most of the models. Presented results of the proposed study emphasizes the reduction in training time and computational efficiency of the proposed Randomized-Grid Search method. The proposed technique has significant potential to advance ML application in healthcare providing timely and accurate CVDs diagnosis.

2411.15361 2026-05-19 cs.AI 版本更新

Designing Cellular Manufacturing System in Presence of Alternative Process Plans

在存在替代工艺计划的情况下设计单元制造系统

Md. Kutub Uddin, Md. Saiful Islam, Md Abrar Jahin, Md. Tanjid Hossen Irfan, Md. Saiful Islam Seam, M. F. Mridha

发表机构 * Department of Mechanical Engineering, Khulna University of Engineering & Technology(Khulna大学工程与技术学院机械工程系) Department of Industrial Engineering and Management, Khulna University of Engineering & Technology(Khulna大学工程与技术学院工业工程与管理系) Thomas Lord Department of Computer Science, Viterbi School of Engineering, University of Southern California(南加州大学维特比工程学院托马斯·劳德计算机科学系) Department of Computer Science, American International University−Bangladesh(美国国际大学-孟加拉国计算机科学系)

AI总结 本文提出四种整数规划模型,用于在设计和运营阶段对零件和机器进行分组,以最小化单元内和单元间移动,同时讨论了该目标与其他目标如投资成本和运营成本的适用性。

Journal ref IET Collaborative Intelligent Manufacturing (2026)

详情
AI中文摘要

在设计单元制造系统(CMS)时,必须在设计和运营阶段做出众多技术和管理决策。本文提出了四种整数规划公式,用于在设计和运营层次上对零件和机器进行分组,解决一个通用的分组问题,其中每个零件有多个工艺计划,每个工艺计划的操作可以由多台机器执行。通过将同一零件类型的连续操作尽可能分配到同一单元和同一台机器上,来最小化单元内和单元间的移动。讨论了以最小化单元内和单元间移动作为目标与其他目标如最小化机器投资成本、运营成本等的适用性。包含数值示例以说明公式的运作。

英文摘要

In the design of cellular manufacturing systems (CMS), numerous technological and managerial decisions must be made at both the design and operational stages. The first step in designing a CMS involves grouping parts and machines. In this paper, four integer programming formulations are presented for grouping parts and machines in a CMS at both the design and operational levels for a generalized grouping problem, where each part has more than one process plan, and each operation of a process plan can be performed on more than one machine. The minimization of inter-cell and intra-cell movements is achieved by assigning the maximum possible number of consecutive operations of a part type to the same cell and to the same machine, respectively. The suitability of minimizing inter-cell and intra-cell movements as an objective, compared to other objectives such as minimizing investment costs on machines, operating costs, etc., is discussed. Numerical examples are included to illustrate the workings of the formulations.

2411.10636 2026-05-19 cs.CL cs.AI cs.LG 版本更新

Mitigating Extrinsic Gender Bias for Bangla Classification Tasks

缓解孟加拉语分类任务中的外在性别偏见

Sajib Kumar Saha Joy, Arman Hassan Mahy, Meherin Sultana, Azizah Mamun Abha, MD Piyal Ahmmed, Yue Dong, G M Shahariar

发表机构 * Ahsanullah University of Science and Technology(阿沙努拉科学与技术大学) University of California, Riverside(加州大学河滨分校)

AI总结 本文研究了孟加拉语预训练语言模型中的外在性别偏见,构建了四个任务特定的基准数据集,并提出RandSymKL方法以缓解偏见,实验表明其能有效减少偏见并保持高准确率。

详情
AI中文摘要

在本研究中,我们探讨了孟加拉语预训练语言模型中的外在性别偏见,这是一个在低资源语言中鲜有研究的领域。为了评估这种偏见,我们构建了四个人工标注的任务特定基准数据集,用于情感分析、毒性检测、仇恨言论检测和讽刺检测。每个数据集都通过细致的性别扰动进行了增强,通过系统地交换性别化名称和术语并保持语义内容,实现了对性别驱动预测变化的最小配对评估。然后,我们提出RandSymKL,一种整合对称KL散度和交叉熵损失的随机去偏策略,以在任务特定的预训练模型中缓解偏见。RandSymKL是一种精炼的训练方法,以统一的方式整合这些元素,专注于分类任务的外在性别偏见缓解。我们的方法在现有偏见缓解方法上进行了评估,结果表明,我们的技术不仅有效减少了偏见,还与其他基线方法相比保持了竞争性的准确性。为了促进进一步研究,我们已公开了我们的实现和数据集:https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

英文摘要

In this study, we investigate extrinsic gender bias in Bangla pretrained language models, a largely underexplored area in low-resource languages. To assess this bias, we construct four manually annotated, task-specific benchmark datasets for sentiment analysis, toxicity detection, hate speech detection, and sarcasm detection. Each dataset is augmented using nuanced gender perturbations, where we systematically swap gendered names and terms while preserving semantic content, enabling minimal-pair evaluation of gender-driven prediction shifts. We then propose RandSymKL, a randomized debiasing strategy integrated with symmetric KL divergence and cross-entropy loss to mitigate the bias across task-specific pretrained models. RandSymKL is a refined training approach to integrate these elements in a unified way for extrinsic gender bias mitigation focused on classification tasks. Our approach was evaluated against existing bias mitigation methods, with results showing that our technique not only effectively reduces bias but also maintains competitive accuracy compared to other baseline approaches. To promote further research, we have made both our implementation and datasets publicly available: https://github.com/sajib-kumar/Mitigating-Bangla-Extrinsic-Gender-Bias

2409.14634 2026-05-19 cs.HC cs.AI 版本更新

Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

基于领域重构与新颖性评估的人机协同科学构想系统

Marissa Radensky, Simra Shahid, Raymond Fok, Pao Siangliulue, Tom Hope, Daniel S. Weld

发表机构 * University of Washington(华盛顿大学) Microsoft(微软) Allen Institute for AI(人工智能研究院)

AI总结 Scideator通过领域重构与新颖性评估模块,帮助用户在科学构想中生成更具创造性的想法,实验显示其在想法探索和表达性方面优于传统LLM方法。

Comments Updated based on most recent submission

详情
AI中文摘要

科学构想过程通常涉及将现有论文的要素重新组合以产生新想法。我们提出了Scideator,首个基于要素的科学构想人机协同系统。从用户提供的论文出发,Scideator提取关键要素--目的、机制和评估--并允许用户交互式重新组合要素以合成想法。Scideator由三个设计选择驱动:(1) 人类在回路的要素重新组合,用户从检索的论文中选择要素,系统通过Faceted Idea Generator模块寻找跨要素的类比生成想法;(2) 距离控制检索通过Analogous Paper Facet Finder模块,提供从相同主题到完全不同领域的论文范围;(3) 基于要素的新颖性验证通过Idea Novelty Checker模块,一个检索后重排序流程,帮助用户评估想法的原创性。在计算机科学研究人员的用户研究中,Scideator比使用相同基础LLM但无要素模块的基线提供了显著更多的创造力支持,特别是在想法探索和表达性方面。消融实验进一步表明,要素对新颖性检查器有益:基于要素的检索后重排序比标准检索和重排序显示更多相关论文,且基于要素的新颖性分类器优于基于无结构想法和论文推理的分类器。

英文摘要

The scientific ideation process often involves blending facets of existing papers to create new ideas. We contribute Scideator, the first human-LLM system for facet-based scientific ideation. Starting from user-provided papers, Scideator extracts key facets -- purposes, mechanisms, and evaluations -- from these and related papers, allowing users to interactively recombine facets to synthesize ideas. Scideator is driven by three design choices: (1) human-in-the-loop facet recombination, in which users select facets from retrieved papers and the system generates ideas by finding analogies across them via the Faceted Idea Generator module; (2) distance-controlled retrieval via the Analogous Paper Facet Finder module, which surfaces papers ranging from the same topic to entirely different areas to provide a spectrum of directions; and (3) facet-based novelty verification via the Idea Novelty Checker module, a retrieve-then-rerank pipeline that helps users to evaluate idea originality using facets. In a user study with computer science researchers, Scideator provided significantly more creativity support than a baseline using the same backbone LLM without our facet-based modules, particularly in idea exploration and expressiveness. Ablations further show that the facets benefit the novelty checker: facet-based retrieve-then-rerank surfaces more relevant papers than standard retrieval and re-ranking, and a facet-grounded novelty classifier outperforms classifiers that reason over unstructured ideas and papers.

2409.10102 2026-05-19 cs.IR cs.AI cs.CL 版本更新

Trustworthiness in Retrieval-Augmented Generation Systems: A Survey

检索增强生成系统中的可信度:综述

Yujia Zhou, Wenbo Zhang, Jingying Shao, Yan Liu, Xiaoxi Li, Jiajie Jin, Hongjin Qian, Zheng Liu, Chaozhuo Li, Jason Chen Zhang, Zhicheng Dou, Philip S. Yu, Jiaxin Mao

发表机构 * Tsinghua University(清华大学) Renmin University of China(中国人民大学) The Chinese University of Hong Kong(香港中文大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Hong Kong Polytechnic University(香港理工大学) Microsoft Research Asia(微软亚洲研究院) University of Illinois(伊利诺伊大学)

AI总结 本文综述了检索增强生成系统中可信度的关键维度,提出Trust-RAG Compass框架,评估事实性、鲁棒性等六个方面,并建立评估基准,揭示不同LLM在可信度方面的性能差异,指出未来研究方向。

详情
AI中文摘要

检索增强生成(RAG)已迅速成为大型语言模型(LLMs)发展中的关键范式。尽管现有研究主要强调准确性和效率,但RAG系统的可信度仍缺乏充分探讨。RAG通过将响应基于外部和最新知识来提高LLM的可靠性,减少幻觉。然而,不可靠的检索或不当的知识利用仍可能导致不良输出。为此,我们提出统一框架Trust-RAG Compass,从事实性、鲁棒性、公平性、透明性、问责性和隐私六个关键维度评估RAG系统的可信度。在此框架下,我们对现有文献进行了全面回顾,并引入评估基准TRC Bench,围绕六个维度对多种专有和开源模型进行全面评估。我们的结果揭示了不同类型的LLM在不同可信度维度上的性能差距。最后,基于我们的发现,我们识别了关键挑战和未来研究的前景。通过这项工作,我们旨在为后续研究提供结构化基础,并为开发真实场景中的可信RAG系统提供实用指导。

英文摘要

Retrieval-Augmented Generation (RAG) has quickly grown into a pivotal paradigm in the development of Large Language Models (LLMs). Although existing research mainly emphasizes accuracy and efficiency, the trustworthiness of RAG systems remains insufficiently explored. RAG can improve LLM reliability by grounding responses in external and up-to-date knowledge, reducing hallucinations. However, unreliable retrieval or improper knowledge utilization may still lead to undesirable outputs. To address these concerns, we propose a unified framework, Trust-RAG Compass, that assesses the trustworthiness of RAG systems across six key dimensions: factuality, robustness, fairness, transparency, accountability, and privacy. Within this framework, we provide a thorough review of the existing literature along each dimension. Furthermore, we introduce an evaluation benchmark, TRC Bench (\underline{T}rust-\underline{R}AG \underline{C}ompass \underline{Bench}mark), regarding the six dimensions and conduct comprehensive evaluations for a variety of proprietary and open-source models. Our results shed light on the performance gaps between different types of LLMs across varying dimensions of trustworthiness. Finally, we identify key challenges and promising directions for future research based on our findings. Through this work, we aim to provide a structured foundation for subsequent investigations and practical guidance for developing trustworthy RAG systems in real-world scenarios.

2409.02428 2026-05-19 cs.LG cs.AI cs.CL cs.SY eess.SY 版本更新

Language Models as Efficient Reward Function Searchers for Custom-Environment Multi-Objective Reinforcement

语言模型作为定制环境多目标强化学习的高效奖励函数搜索器

Guanwen Xie, Jingzehua Xu, Yiyuan Yang, Yimian Ding, Shuai Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, China(清华大学深圳国际研究生院,清华大学,中国) Department of Computer Science, University of Oxford, United Kingdom(英国牛津大学计算机科学系) Department of Data Science, New Jersey Institute of Technology, USA(美国新泽西理工学院数据科学系)

AI总结 本文提出ERFSL,利用语言模型高效搜索奖励函数,通过生成奖励组件和使用奖励批评者修正代码,实现多目标强化学习任务中零样本学习的高效奖励函数设计。

详情
AI中文摘要

在强化学习任务中,设计和改进复杂定制环境和多重需求的奖励函数具有挑战性。本文提出ERFSL,一种利用大型语言模型(LLMs)的高效奖励函数搜索器,使LLMs成为有效的白盒搜索器,并突出其先进的语义理解能力。具体而言,我们为每个数值明确的用户需求生成奖励组件,并使用奖励批评者识别正确的代码形式。然后,LLMs为奖励组件分配权重以平衡其值,并通过灵活采用方向突变和交叉策略迭代调整权重,类似于遗传算法,基于训练日志分析器提供的上下文。我们将其应用于无直接人类反馈或奖励示例的定制数据收集RL任务(零样本学习)。奖励批评者仅需每个需求一个反馈实例即可有效纠正奖励代码,防止不可纠正的错误。权重初始化使在帕累托解集内获取不同奖励函数而无需权重搜索。即使权重偏差达500倍,平均仅需5.2次迭代即可满足用户需求。ERFSL也适用于大多数使用GPT-4o mini的提示,因为我们分解了权重搜索过程,以降低对数值和长上下文理解能力的要求。

英文摘要

Achieving the effective design and improvement of reward functions in reinforcement learning (RL) tasks with complex custom environments and multiple requirements presents considerable challenges. In this paper, we propose ERFSL, an efficient reward function searcher using LLMs, which enables LLMs to be effective white-box searchers and highlights their advanced semantic understanding capabilities. Specifically, we generate reward components for each numerically explicit user requirement and employ a reward critic to identify the correct code form. Then, LLMs assign weights to the reward components to balance their values and iteratively adjust the weights without ambiguity and redundant adjustments by flexibly adopting directional mutation and crossover strategies, similar to genetic algorithms, based on the context provided by the training log analyzer. We applied the framework to a customized data collection RL task without direct human feedback or reward examples (zero-shot learning). The reward critic successfully corrects the reward code with only one feedback instance for each requirement, effectively preventing unrectifiable errors. The initialization of weights enables the acquisition of different reward functions within the Pareto solution set without the need for weight search. Even in cases where a weight is 500 times off, on average, only 5.2 iterations are needed to meet user requirements. The ERFSL also works well with most prompts utilizing GPT-4o mini, as we decompose the weight searching process to reduce the requirement for numerical and long-context understanding capabilities.

2406.15797 2026-05-19 cs.LG cs.AI 版本更新

$\texttt{SynC}$: Synergistic Boosting of Structure and Representation for Deep Graph Clustering

$\texttt{SynC}$:深度图聚类的结构与表示协同提升

Shifei Ding, Benyu Wu, Xiao Xu, Ling Ding, Xindong Wu

发表机构 * School of Computer Science and Technology/the School of Artificial Intelligence, China University of Mining and Technology(计算机科学与技术学院/人工智能学院,中国矿业大学) Mine Digitization Engineering Research Center of Ministry of Education(教育部矿山数字化工程研究中心) College of Intelligence and Computing, Tianjin University(智能与计算学院,天津大学) Key Laboratory of Knowledge Engineering with Big Data (the Ministry of Education of China), Hefei University of Technology(大数据知识工程重点实验室(教育部),合肥工业大学)

AI总结 SynC通过协同提升结构与表示学习,改进深度图聚类,减少参数并提升低同质图的泛化能力。

详情
AI中文摘要

SynC通过协同提升结构与表示学习,改进深度图聚类,减少参数并提升低同质图的泛化能力。

英文摘要

Employing graph neural networks (GNNs) for graph clustering has shown promising results in deep graph clustering. However, existing methods disregard the reciprocal relationship between representation learning and structure augmentation: the more homogeneous the graph, the more cohesive the node representations; the more cohesive the node representations, the more reliable the structure augmentation becomes. Moreover, the generalization ability of existing GNN-based models on the low homophily graph is relatively poor. To this end, we propose a graph clustering framework named Synergistic Deep Graph Clustering Network (SynC). SynC employs a Transform Input Graph Auto-Encoder (TIGAE) to obtain high-quality embeddings via mitigating the representations collapse issue of GAE for guiding structure augmentation. Then, we re-capture neighborhood representations on the refined graph to obtain clustering-friendly embeddings and conduct self-supervised clustering. Notably, these two stages share weights, resulting in synergistic boosting while significantly reducing the number of model parameters. Additionally, we introduce a structure fine-tuning strategy to improve the model's generalization on the low homophily graph. Extensive experiments on benchmark datasets demonstrate the superiority of SynC. The code is released at GitHub.

2406.14427 2026-05-19 cs.AI q-bio.NC 版本更新

Principles of frugal inference and control

节俭推断与控制的原则

Itzel Olivos-Castillo, Paul Schrater, Xaq Pitkow

发表机构 * Department of Computer Science, Rice University(计算机科学系,里士大学) Department of Computer Science, University of Minnesota(计算机科学系,明尼苏达大学) Department of Psychology, University of Minnesota(心理学系,明尼苏达大学) Neuroscience Institute, Carnegie Mellon University(神经科学研究所,卡内基梅隆大学)

AI总结 本文提出节俭推断与控制的原则,通过POMDP框架优化资源使用,在不确定世界中平衡效用最大化与资源消耗,解决非线性控制问题如平衡杆和无人机稳定。

详情
AI中文摘要

智能体在不确定世界中面临在效用最大化与资源使用之间取得平衡的挑战,不仅涉及外部运动还涉及内部计算。现有不确定性控制理论通常将推断视为无成本,尽管在人工和生物系统中造成显著的计算和能量负担。为解决此问题,我们引入POMDP框架的新变体,将通过推断获得的信息视为需优化的资源。解决局部线性高斯近似问题揭示了三个资源高效的控制原则:首先,当信息成本高时,推断从贝叶斯最优(无损)压缩转向损失性阶段,战略性地保留部分不确定性以优化资源使用。其次,放松精确贝叶斯推断产生等效解集,反映多种结合不完美推断与补偿控制的方式。这种灵活性可用于满足额外目标或约束而不牺牲原始任务性能。第三,超越目标达成,控制可用于抵消估计误差并引导系统进入表示成本较低的区域。我们实验证明这些原则超越局部线性高斯近似,解决非线性控制问题如平衡杆和无人机稳定。这些结果建立了一个理性计算框架,扩展了现有信息受限决策方法,并为大脑和机器如何在紧约束下实现有效行为提供规范见解。

英文摘要

A central challenge for intelligent agents in an uncertain world is striking the right balance between utility maximization and resource use, not only for external movement but also for internal computation. Existing theories of control under uncertainty typically treat inference as cost-free, despite the substantial computational and energetic burden it imposes in both artificial and biological systems. To remedy this problem, we introduce a novel variant of the POMDP framework in which the information acquired through inference is treated as a resource that must be optimized alongside utility. Solving a local linear-Gaussian approximation of the resulting problem reveals three general principles of resource-efficient control. First, when information is costly, inference shifts from a Bayes-optimal (lossless) compression of the past to a lossy regime that strategically leaves some uncertainty unresolved to optimize resource use. Second, relaxing exact Bayesian inference creates a manifold of equivalent solutions, reflecting multiple ways to combine imperfect inference with compensatory control. This flexibility can be used to meet additional objectives or constraints without sacrificing performance on the original task. Third, beyond goal attainment, control can be leveraged to counteract estimation errors and steer the system into regimes where representation costs are lower. We empirically demonstrate that these principles generalize beyond the local linear-Gaussian approximation, enabling the solution of nonlinear control problems such as pole balancing and drone stabilization. Together, these results establish a framework for rational computation that extends existing approaches to information-constrained decision-making and offers normative insight into how brains and machines can achieve effective behavior under tight computational constraints.

2401.03717 2026-05-19 cs.LG cs.AI 版本更新

Universal Time-Series Representation Learning: A Survey

通用时间序列表示学习:综述

Patara Trirat, Yooju Shin, Junhyeok Kang, Youngeun Nam, Jihye Na, Minyoung Bae, Joeun Kim, Byunghyun Kim, Jae-Gil Lee

发表机构 * KAIST(韩国延世大学)

AI总结 本文综述了时间序列数据表示学习方法,探讨了深度学习在提取隐藏模式中的优势,并提出了新的分类方法以指导未来研究。

Comments Accepted by ACM Computing Surveys. Extended version: 41 pages, 7 figures

详情
AI中文摘要

时间序列数据存在于现实世界的各个方面,从天空中的卫星到身上的可穿戴设备。通过提取和推断有价值的信息来学习表示对于理解复杂现象的动力学和做出明智决策至关重要。深度学习在无需手动特征工程的情况下展示了在时间序列数据中提取隐藏模式和特征的卓越性能。本文首先提出了一种基于三种基本要素的新分类方法,用于设计最先进的通用表示学习方法。根据该分类法,本文全面回顾了现有研究,讨论了这些方法如何提高学习表示的质量。最后,作为未来研究的指南,本文总结了常用的实验设置和数据集,并讨论了几个有前途的研究方向。相关资源可在https://github.com/itouchz/awesome-deep-time-series-representations上找到。

英文摘要

Time-series data exists in every corner of real-world systems and services, ranging from satellites in the sky to wearable devices on human bodies. Learning representations by extracting and inferring valuable information from these time series is crucial for understanding the complex dynamics of particular phenomena and enabling informed decisions. With the learned representations, we can perform numerous downstream analyses more effectively. Among several approaches, deep learning has demonstrated remarkable performance in extracting hidden patterns and features from time-series data without manual feature engineering. This survey first presents a novel taxonomy based on three fundamental elements in designing state-of-the-art universal representation learning methods for time series. According to the proposed taxonomy, we comprehensively review existing studies and discuss their intuitions and insights into how these methods enhance the quality of learned representations. Finally, as a guideline for future studies, we summarize commonly used experimental setups and datasets and discuss several promising research directions. An up-to-date corresponding resource is available at https://github.com/itouchz/awesome-deep-time-series-representations.

2305.10721 2026-05-19 cs.LG cs.AI 版本更新

Revisiting Long-term Time Series Forecasting: An Investigation on Linear Mapping

重新审视长期时间序列预测:对线性映射的调查

Zhe Li, Shiyi Qi, Yiduo Li, Zenglin Xu

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院)

AI总结 本文研究了长期时间序列预测中线性映射的有效性,揭示了仿射映射在周期信号预测中的关键作用,并探讨了可逆归一化和输入时间 horizon 对模型鲁棒性的影响。

Journal ref Li, Zhe, Shiyi Qi, Yiduo Li, and Zenglin Xu. Revisiting Long-Term Time Series Forecasting: an Investigation on Affine Mapping. Academia AI and Applications 2, no. 2 (2026)

详情
AI中文摘要

引言:长期时间序列预测(LTSF)近年来获得了广泛关注。尽管存在各种专门设计来捕捉时间依赖性的方法,但近期研究表明,甚至一个单一的线性层也能取得竞争性的性能。本文研究了近期LTSF方法的内在有效性,并揭示了仿射映射在周期信号预测中的关键作用。材料和方法:我们对模拟和现实世界的数据集进行了全面实验,以分析最先进模型的组成部分。我们提供了理论分析,解释仿射映射在周期信号预测中的工作机制。我们评估了可逆归一化和输入时间跨度扩展对模型鲁棒性的影响。结果:我们发现(1)仿射映射在常用的基准测试中主导了预测性能,模型从输入到输出学习了相似的转换矩阵;(2)仿射映射能够有效捕捉周期性模式,但在非周期性信号或具有不同周期的时序数据中表现较差;(3)可逆归一化显著增强了趋势预测,通过将非周期性趋势转换为周期性模式;(4)增加输入时间跨度提高了多通道数据的性能。代码可在:https://github.com/plumprc/RTSF获得。结论:我们的发现为LTSF模型的工作机制提供了理论和实验见解,突显了线性方法的优势和局限性。结果表明,未来模型的发展应关注处理跨通道周期变化和非周期性成分。

英文摘要

Introduction: Long-term time series forecasting (LTSF) has gained significant attention in recent years. While various specialized designs exist for capturing temporal dependency, recent studies have shown that even a single linear layer can achieve competitive performance. This paper investigates the intrinsic effectiveness of recent LTSF approaches and reveals the critical role of affine mapping. Materials and methods: We conduct comprehensive experiments on both simulated and real-world datasets to analyze the components of state-of-the-art models. A theoretical analysis is provided to explain the working mechanisms of affine mapping in periodic signal forecasting. We evaluate the impact of reversible normalization and input horizon extension on model robustness. Results: We find that (1) affine mapping dominates forecasting performance across commonly utilized benchmarks, with models learning similar transition matrices from input to output; (2) affine mapping effectively captures periodic patterns but struggles with non-periodic signals or time series with varying periods across channels; (3) reversible normalization significantly enhances trend forecasting by transforming non-periodic trends into periodic-like patterns; (4) increasing input horizon improves performance on multi-channel data with different periods. Code is available at: \url{https://github.com/plumprc/RTSF}. Conclusions: Our findings provide theoretical and experimental insights into the working mechanisms of LTSF models, highlighting both the strengths and limitations of linear approaches. The results suggest that future model development should focus on handling cross-channel period variations and non-periodic components.

2212.02098 2026-05-19 cs.AI 版本更新

A Machine with Short-Term, Episodic, and Semantic Memory Systems

具有短期、事件性和语义记忆系统的机器

Taewoon Kim, Michael Cochez, Vincent François-Lavet, Mark Neerincx, Piek Vossen

发表机构 * Vrije Universiteit Amsterdam(瓦赫宁海姆大学) Technische Universiteit Delft(代尔夫特理工大学)

AI总结 本文提出了一种具有短期、事件性和语义记忆系统的智能体模型,通过知识图谱实现各记忆系统的建模,并在自研环境中验证了该模型在记忆编码、存储与检索上的优势。

Journal ref Proceedings of the AAAI Conference on Artificial Intelligence (2023), 37(1), 48-56

详情
AI中文摘要

受人类显性记忆系统理论启发,本文构建了一个包含短期、事件性和语义记忆系统的智能体模型,每个记忆系统均用知识图谱建模。为评估该系统并分析智能体行为,我们设计并发布了自研强化学习环境

英文摘要

Inspired by the cognitive science theory of the explicit human memory systems, we have modeled an agent with short-term, episodic, and semantic memory systems, each of which is modeled with a knowledge graph. To evaluate this system and analyze the behavior of this agent, we designed and released our own reinforcement learning agent environment, "the Room", where an agent has to learn how to encode, store, and retrieve memories to maximize its return by answering questions. We show that our deep Q-learning based agent successfully learns whether a short-term memory should be forgotten, or rather be stored in the episodic or semantic memory systems. Our experiments indicate that an agent with human-like memory systems can outperform an agent without this memory structure in the environment.

2605.16638 2026-05-19 cs.AI 版本更新

TTE-Flash: Accelerating Reasoning-based Multimodal Representations via Think-Then-Embed Tokens

TTE-Flash: 通过思考-然后-嵌入标记加速基于推理的多模态表示

Jianpeng Cheng, Xian Wu, Jiangfan Zhang, Wentao Bao, Chaitanya Ahuja, Shlok Kumar Mishra, Hanchao Yu, Yang Gao, Fan Xia, Qi Guo, Shaodan Zhai, Xiangjun Fan, Jun Xiao

发表机构 * Meta AI

AI总结 本文提出TTE-Flash模型,通过引入隐式思考标记替代显式推理链,实现多模态表示的高效推理。模型在MMEB-v2基准上优于显式CoT方法,且在零样本评估中展示了可扩展性。

详情
AI中文摘要

近期研究显示,通用多模态嵌入(UME)显著受益于推理链(CoT)推理。在该范式中,生成模型为多模态查询生成显式推理轨迹,最终表示从<eos>嵌入标记提取,该标记同时关注查询和推理。尽管其有效性,生成显式CoT轨迹的计算开销常不可接受。本文提出用隐式思考标记替代显式CoT,这些标记被解释为潜在变量,可生成显式CoT轨迹作为观测变量。通过优化思考标记使用CoT生成损失,随后嵌入标记使用对比损失,我们生成高性能、基于推理的表示,且推理成本恒定。本研究探讨了两个关键架构设计:1)如何从同一LLM主干中提取思考和嵌入标记;2)如何将标记作为两个依赖任务进行训练。我们引入TTE-Flash-2B,一个基于推理的多模态表示模型,在MMEB-v2基准上优于其显式CoT对应物,同时生成可解释的文本和视觉隐式思考标记。此外,跨15个视频数据集的零样本评估揭示了随着思考标记数量增加的扩展行为,并促使基于任务需求的自适应思考预算分配的初步研究。

英文摘要

Recent research has demonstrated that Universal Multimodal Embedding (UME) benefits significantly from Chain-of-Thought (CoT) reasoning. In this paradigm, a generative model produces explicit reasoning traces for a multimodal query, with the final representation extracted from an <eos> embedding token attending to both the query and the reasoning. Despite its effectiveness, the computational overhead of generating explicit CoT traces is often prohibitive. In this work, we propose replacing explicit CoT with latent think tokens, which are interpreted as latent variables that can produce explicit CoT traces as observed variables. By optimizing think tokens using CoT generation loss and subsequent embedding tokens using contrastive loss, we produce high-performance, reasoning-aware representations at a constant inference cost. Our study investigates two key architectural designs: 1) how think and embeddings tokens should be extracted from the same LLM backbone. 2) how the tokens should be trained as two dependent tasks. We introduce TTE-Flash-2B, a reasoning-aware multimodal representation model that outperforms its explicit-CoT counterpart on the MMEB-v2 benchmark, while producing latent think tokens that are interpretable both textually and visually. Furthermore, zero-shot evaluation across 15 video datasets reveals scaling behavior as the number of think tokens increases, and motivating a pilot study of adaptive think budget allocation based on task requirements.

2605.16632 2026-05-19 cs.LG cs.AI cs.LO 版本更新

Learning How to Cube

学习如何求立方

Ferhat Erata, Sam Kouteili, Thanos Typaldos, Timos Antonopoulos, Robert B. Jones, Byron Cook, Ruzica Piskac

发表机构 * Yale University(耶鲁大学) AWS Agentic AI(AWS智能体AI)

AI总结 本文提出一种神经符号后训练框架,通过MCTS数据整理管道和符号启发式方法,使4B参数模型在SAT竞赛基准上取得53的pass@5分数,超越了Claude-Sonnet-4等前沿LLM。

Comments 33 pages, preprint

详情
AI中文摘要

尽管Cube-and-Conquer(C&C)在解决具有挑战性的布尔可满足性(SAT)问题上非常有效,但之前的工作没有展示基于Transformer的模型能够学习有效的求立方启发式方法。我们介绍了一种神经符号后训练框架。我们设计了一个基于MCTS的数据整理管道,利用符号启发式方法在SAT竞赛公式上探索分割决策,生成基于求解器统计信息的偏好数据,并辅以教师模型的推理轨迹。我们的两阶段后训练,监督微调(SFT)后接直接偏好优化(DPO),使4B参数模型在100个SAT竞赛基准上取得53的pass@5分数,超越了前沿LLM如Claude-Sonnet-4(50)并匹配最佳符号启发式(53)。消融实验显示,SFT单独将pass@5提升至51,DPO增加2个基准;对实际首次立方决策的熵/一致消融显示,SFT而非DPO导致根层决策多样性,产生互补的运行覆盖。这表明Transformer可以在传统由符号方法主导的领域中被训练出有效的求立方决策。

英文摘要

Despite the effectiveness of Cube-and-Conquer (C&C) for solving challenging Boolean Satisfiability (SAT) problems, no prior work has shown that transformer-based models can learn effective cubing heuristics. We introduce a neuro-symbolic post-training framework for this task. We design an MCTS-based data curation pipeline that uses symbolic heuristics to explore splitting decisions over SAT competition formulas, producing preference data grounded in solver statistics and augmented with reasoning traces from a teacher model. Our two-stage post-training, supervised fine-tuning (SFT) followed by direct preference optimization (DPO), enables a 4B-parameter model to achieve a pass@5 score of 53 on 100 SAT competition benchmarks, surpassing frontier LLMs such as Claude-Sonnet-4 (50) and matching the best symbolic heuristic (53). Ablations show that SFT alone improves pass@5 from 46 to 51, with DPO adding 2 additional benchmarks; an entropy/agreement ablation on realized first-cube decisions further shows that SFT, not DPO, accounts for the root-level decision diversity that produces complementary per-run coverage over deterministic symbolic methods. This demonstrates that transformers can be trained to make effective cubing decisions in a domain traditionally dominated by symbolic methods.

2605.16623 2026-05-19 cs.CY cs.AI 版本更新

To Trust or Not to Trust: Authors' Response to AI-based Reviews

信任还是不信任:作者对基于AI的评论的回应

César Leblanc, Lukas Picek

发表机构 * École Normale Supérieure(巴黎高等师范学院) Sorbonne University(索邦大学) University of West Bohemia(西波什埃大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过两项研究探讨作者对AI辅助评审的使用和看法,发现大多数作者认为AI反馈有用,但不将其等同于人类评审,且更倾向于在提交前使用AI作为内部工具。

详情
AI中文摘要

大型语言模型日益被讨论和用于协助学术同行评审,但关于作者如何使用和感知AI反馈的实证证据仍有限。本文报告了两项独立试点研究的结果,研究了作者在两个计算机科学会议中对AI辅助评审的使用和看法。在评审发布后,作者被邀请完成一份匿名的评审后问卷,询问AI评审的有用性、可信度、与人类评审的一致性、修改的实用性、感知的不准确性以及同意。最终的数据集包含40篇论文作者的56个可分析响应;封闭式问题使用描述性统计汇总,开放式回答使用归纳主题分析。大多数受访者(83.9%)认为AI评审有用,80.4%报告说AI发现了人类评审未提及的问题。这种感知的附加价值转化为行动:82.1%的受访者报告在最终版本中使用了至少一些AI反馈。然而,作者并不将AI评审视为等同于人类评审。他们普遍认为AI的可信度低于人类评审,尽管25.0%的受访者描述至少一些人类评审不很有用。报告的AI评审问题通常有限:51.8%报告了轻微的不准确,而16.1%报告了明显错误、误导或不相关的评论。对未来发展使用的支持最强当AI被框架为监督或作者控制的工具:96.4%表示在未来的提交中会使用AI作为内部评审工具,89.3%更倾向于提前得知AI将在评审中使用,76.8%更倾向于在使用前获得明确的同意。

英文摘要

Large language models are increasingly discussed and used as tools that may assist with scholarly peer review, but empirical evidence regarding how authors use and perceive AI-based feedback remains limited. This paper reports findings from two independent pilot studies on authors' use and perceptions of AI-based auxiliary review at two computer science venues. After the review release, authors were invited to complete an anonymous post-review questionnaire about the AI review's usefulness, trustworthiness, agreement with human reviews, practical value for revision, perceived inaccuracies, and consent. The final dataset included 56 analyzable responses from authors of 40 papers; closed-ended items were summarized using descriptive statistics, and open-ended responses were analyzed using inductive thematic analysis. Most respondents (83.9%) considered the AI-based review useful, and 80.4% reported that it identified issues not mentioned by human reviewers. This perceived added value translated into action: 82.1% reported using at least some AI feedback in their camera-ready version. However, the authors did not treat the AI review as equivalent to a human review. They generally trusted it less than the human reviews and found human feedback clearer, even though 25.0% described at least some human reviews as not very useful. Reported problems with the AI review were usually limited: 51.8% reported minor inaccuracies, while 16.1% reported clearly incorrect, misleading, or irrelevant comments. Support for future use was strongest when AI was framed as a supervised or author-controlled tool: 96.4% said they would use AI as an internal review tool before future submissions, 89.3% preferred advance notice that AI would be used in review, and 76.8% favored explicit consent before use.

2605.16612 2026-05-19 cs.AI cond-mat.mtrl-sci 版本更新

PRISMat: Policy-Driven, Permutation-Invariant Autoregressive Material Generation

PRISMat:基于策略的、排列不变的自回归材料生成

Claire Schlesinger, Circe Hsu, Peter Schindler, Robin Walters

发表机构 * Khoury College of Computer Sciences(科里学院计算机科学学院) Northeastern University(东北大学) College of Engineering(工程学院)

AI总结 PRISMat通过高效生成晶体片层,提升了材料发现的准确性,其在切开能和工作函数任务中的均绝对误差显著降低。

Comments 10 pages, 8 figures, Under Review at Neurips 2026

详情
AI中文摘要

快速识别具有目标性质的候选材料已成为材料科学中的关键任务。机器学习作为一种替代物理模拟的方法,提供了一种更快、更经济的方式过滤材料,基于其稳定性和其他目标性质,减少达到昂贵合成阶段的候选材料数量。最近,大语言模型(LLMs)已应用于此角色,但这些模型参数密集且计算成本高,训练和推理时都不可行,不适合高通量任务。这种低效性源于语言模型的过度参数化以及将材料生成作为序列学习问题的困难。在本文中,我们提出了PRISMat,一种成本效益高、排列不变的模型,解决了这些限制。我们显示,尽管PRISMat推理时间更短,但其在基于关键材料表面性质生成晶体片层方面能够超越LLMs。在目标材料发现中,我们实现了切开能和工作函数任务的均绝对误差分别为0.188 eV/A$^2$和2.79 eV,将下一个最佳模型的误差降低了4倍。

英文摘要

Rapid identification of candidate materials with target properties has become a key task in materials science. Machine learning has emerged as an alternative to physics-based simulation, offering a faster and cheaper way to filter materials based on their stability and other target properties, reducing the number of candidates that reach the costly synthesis stage. Recently, Large Language Models (LLMs) have been applied to this role, but these models are parameter-heavy and computationally expensive both during training and at inference time, making them unsuitable for high-throughput tasks. This inefficiency stems from both the large over-parameterization of language models and the difficulty of framing material generation as a sequence learning problem. In this paper, we present PRISMat, a cost-effective, permutation-invariant model, which addresses these limitations. We show that PRISMat, despite taking less time for inference, is able to outperform LLMs in generating crystal slabs conditioned on critical materials' surface properties. In targeted material discovery, we achieve mean absolute errors of 0.188 eV/A$^2$ and 2.79 eV for cleavage energy and work function tasks, respectively, reducing the error of the next best model by 4$\times$.

2605.16605 2026-05-19 cs.HC cs.AI 版本更新

PromptDecipher: Supporting AI Tutor Authoring Through Editable Simulated Interactions

PromptDecipher:通过可编辑的模拟交互支持AI辅导作者

Miina Koyama, Ruiwei Xiao, John Stamper

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 PromptDecipher通过直接纠正交互重构作者流程,帮助教师成为学习设计师和QA工程师,提升AI辅导作者的质量与效率。

详情
AI中文摘要

聊天机器人长期以来被探索为支持学习的工具,而大型语言模型的最新进展显著扩展了教育者创建AI辅导聊天机器人的平台。然而,有效的作者需求不仅仅是编写系统提示,还需要教育者扮演学习设计师、AI交互设计师和QA工程师。然而,实践中教师很少履行这些角色。我们的形成性研究发现,几乎没有人系统地测试他们的机器人后再部署给学生。为了解决这一差距,我们提出了PromptDecipher,一个系统,将作者流程围绕直接纠正交互重新组织,而不是编写抽象系统提示。教师与实时聊天预览互动并编辑不理想的机器人响应。自动化流程随后分析纠正,提出针对的系统提示重写,并在预定义的测试场景中验证更改。这将QA作为首要活动,并 scaffolds 教师在他们通常会跳过的角色中。PromptDecipher将在一个AI教育课程中部署,该课程将有数百名高等教育教师。一个实时原型(https://teacher-prompting.vercel.app/),匿名代码库(https://anonymous.4open.science/r/teacher-prompting-2EDF/),和匿名演示(https://tinyurl.com/las-prompt-decipher-demo)可通过脚注中的链接获取。

英文摘要

Chatbots have long been explored as tools to support learning, and recent advances in large language models have significantly expanded the availability of platforms for educators to author AI tutoring chatbots. Yet effective authorship demands more than writing a system prompt; it requires educators to act as learning designers, AI interaction designers, and QA engineers. In practice, however, teachers rarely fulfill these roles. Our formative study found that virtually none systematically tested their bots before deploying them to students. To address this gap, we present PromptDecipher, a system that restructures the authoring workflow around a direct correction-based interaction rather than writing abstract system prompts, teachers interact with a live chat preview and edit undesirable bot responses. An automated pipeline then analyzes the correction, proposes a targeted system prompt rewrite, and validates the change across pre-defined test scenarios. This enforces QA as a first-class activity and scaffolds teachers in roles they would otherwise skip. PromptDecipher will be deployed in an AI for Educators course enrolling hundreds of higher-education instructors. A live prototype (https://teacher-prompting.vercel.app/), an anonymized codebase (https://anonymous.4open.science/r/teacher-prompting-2EDF/), and anonymized demo (https://tinyurl.com/las-prompt-decipher-demo) are available via links in the footnote.

2605.16602 2026-05-19 cs.HC cs.AI 版本更新

Why Modeling Human Haptic Material Perception with AI Is Difficult

为何用AI建模人类触觉材料感知是困难的

Yasemin Vardar

发表机构 * Delft University of Technology (TU Delft)(代尔夫特理工大学)

AI总结 本文探讨了用AI建模人类触觉材料感知的挑战,指出数据稀缺、评估平台缺乏和模型局限性是主要瓶颈,强调跨学科合作的重要性。

Comments 5 pages, 1 figure, conference

详情
AI中文摘要

触觉在人类通过物理接触感知和识别材料中起着核心作用。尽管数十年的研究,触觉信号转化为有意义感知表征的机制仍不明确,限制了交互系统和智能体的设计。近年来人工智能(AI)的进步为建模和利用触觉数据提供了新机会;然而,触觉因其交互依赖性和多模态特性对当代AI提出了根本挑战。本文认为,AI与触觉的交叉领域进展受限于三个关键瓶颈:(1)触觉大数据集稀缺;(2)缺乏标准化评估平台和感知基准;(3)应用于触觉感知时模型容量和可解释性限制。本文讨论了这些挑战如何阻碍泛化、可重复性和对人类触觉的科学洞察,并回顾了新兴策略以解决这些问题。本文强调了协调、跨学科努力对推动AI系统的重要性,这些系统不仅能实现稳健的触觉感知,还能加深对人类触觉的理解。

英文摘要

Touch plays a central role in how humans perceive and recognize materials through physical contact. Despite decades of research, the mechanisms by which tactile signals are transformed into meaningful perceptual representations remain poorly understood, limiting the design of interactive systems and intelligent agents with human-like haptic perception. Recent advances in artificial intelligence (AI) offer new opportunities to model and exploit tactile data; however, haptics presents fundamental challenges for contemporary AI due to its interaction-dependent, multimodal nature. This position paper argues that progress at the intersection of AI and haptics is constrained by three key bottlenecks: (1) the scarcity of large, diverse, and balanced haptic datasets; (2) the lack of standardized evaluation platforms and perceptual benchmarks; and (3) limitations in model capacity and interpretability when applied to tactile perception. I discuss how these challenges impede generalization, reproducibility, and scientific insight into human touch and review emerging strategies to address them. This paper highlights opportunities for coordinated, cross-disciplinary efforts to advance AI systems that not only perform robust haptic perception but also contribute to a deeper understanding of human touch.

2605.16600 2026-05-19 cs.LG cs.AI cs.CL 版本更新

Where Pretraining writes and Alignment reads: the asymmetry of Transformer weight space

预训练写入,对齐读取:Transformer权重空间的不对称性

Valeria Ruscio, Eli-Shaoul Khedouri, Keiran Thompson

发表机构 * Intuition Machines

AI总结 研究揭示了预训练和对齐在Transformer权重空间中的不对称性,通过分析权重变化在残差流激活子空间和预测子空间中的对齐情况,发现读路径权重集中于注意力输入激活的主方向,而写路径权重在预测子空间中保持各向同性。

详情
AI中文摘要

交叉熵预训练和偏好对齐更新相同的Transformer权重,但留下几何上不同的痕迹。我们通过相对子空间分数探针来刻画这种不对称性,追踪权重变化如何与残差流激活子空间和由去嵌入定义的预测子空间对齐。对齐变化集中在读路径(W_Q,W_K)上,沿着注意力输入激活的主方向,而写路径(W_O,W_2)相对于预测子空间则保持近各向同性。我们通过各向异性梯度积累来解释这种模式:对矩阵W的更新是外积δ_t a_t^T之和,继承自哪一侧的协方差集中。对于读路径矩阵,这一侧是输入激活a_t,其协方差在训练过的Transformer中呈尖峰状,因此产生与目标无关的集中。对于写路径矩阵,相关的一侧是上游梯度δ_t,其各向异性取决于损失。交叉熵提供标准的每样本信号,诱导预训练期间写路径的预测几何;对齐目标通常在写路径上添加很少的进一步集中。我们通过检查点内轨迹、渐进对比目标控制以及闭合形式的秩1干预与匹配方向控制来支持这一解释,为所提出的权重空间几何提供因果证据。

英文摘要

Cross-entropy pretraining and preference alignment update the same transformer weights, but leave geometrically distinct traces. We characterise this asymmetry with a relative-subspace-fraction probe that tracks how weight deltas align with residual-stream activation subspaces and with the prediction subspace defined by the unembedding. Alignment deltas concentrate in the read pathway ($W_Q$, $W_K$), along principal directions of attention-input activations, while remaining near-isotropic in the write pathway ($W_O$, $W_2$) relative to the prediction subspace. We explain this pattern through anisotropic gradient accumulation: updates to a matrix $W$ are sums of outer products $δ_t a_t^\top$, and inherit directional structure from whichever side has concentrated covariance. For read-pathway matrices, this side is the input activation $a_t$, whose covariance is spiked in trained transformers and therefore produces objective-agnostic concentration. For write-pathway matrices, the relevant side is the upstream gradient $δ_t$, whose anisotropy depends on the loss. Cross-entropy supplies the canonical sharp per-sample signal, inducing write-pathway prediction geometry during pretraining; alignment objectives typically add little further write-side concentration. We support this explanation with a within-checkpoint trajectory, a graded contrastive-objective control, and a closed-form rank-1 intervention with matched direction controls, providing causal evidence for the proposed weight-space geometry.

2605.16598 2026-05-19 cs.MA cs.AI 版本更新

GRASP: Graph Agentic Search over Propositions for Multi-hop Question Answering

GRASP:基于命题的图代理搜索用于多跳问答

Stockton Jenkins, Ramya Korlakai Vinayak, Junjie Hu

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 GRASP通过分解多跳查询为依赖感知计划,实现多跳问答中的高准确率与低token使用。在MuSiQue、2WikiMultihopQA和HotpotQA上,GRASP在开放语料检索和长文本推理设置中均表现优异,且token使用更少。

详情
AI中文摘要

GRASP通过将多跳查询分解为依赖感知计划,提高了多跳问答的准确率并降低了token使用。在MuSiQue、2WikiMultihopQA和HotpotQA上,GRASP在开放语料检索和长文本推理设置中均表现出色,且token使用更少。

英文摘要

Agentic retrieval improves multi-hop question answering by giving language models autonomy to iteratively gather evidence. Recent work augments these systems with knowledge graphs for structured traversal, but this combination introduces significant cost: expensive graph construction at index time and compounding token usage at inference time. We introduce Graph Agentic Search over Propositions (GRASP), an agentic system that simultaneously optimizes for high accuracy and minimal token usage in multi-hop question answering. Rather than executing a rigid, singular query, GRASP actively coordinates its retrieval strategy by decomposing multi-hop queries into dependency-aware plans. This enables GRASP to dynamically scale the number of sub-agents according to the complexity of the problem. Each sub-agent resolves its single-hop query by exploring a novel three-layer hierarchical graph of entities, propositions, and passages, using the entity layer for targeted traversal and the proposition layer for high-recall passage retrieval via reciprocal-rank voting. We evaluate GRASP on MuSiQue, 2WikiMultihopQA, and HotpotQA under two settings: open-corpus retrieval and extended context reasoning (LongBench). GRASP achieves the highest QA accuracy in the open retrieval setting on MuSiQue and 2Wiki while using 40-50 percent fewer tokens than IRCoT+HippoRAG2. Furthermore, GRASP leads on EM and F1 across all three datasets in the LongBench setting while using 30 percent fewer tokens than the next most accurate method. Finally, we introduce success economy - the amortized token cost per correct answer, weighted by difficulty - and advocate for efficiency-aware evaluation as a standard practice for agentic QA.

2605.16575 2026-05-19 cs.AI 版本更新

Counterparty Modeling is Not Strategy: The Limits of LLM Negotiators

对手建模不是策略:大语言模型谈判者的局限

Romain Cosentino, Sarath Shekkizhar, Adam Earle, Silvio Savarese

发表机构 * Salesforce AI Research(Salesforce人工智能研究)

AI总结 研究探讨了大语言模型在多属性谈判中的表现,发现其能建模对手偏好但无法有效转化为战略谈判,最终协议受初始锚定影响大。

详情
AI中文摘要

谈判需要比推测对方需求更进一步:利用该信息在多个回合中做出有利的报价和反报价。我们研究了大语言模型(LLM)代理在受控的多属性讨价还价环境中的表现。发现当前LLM代理能建模对手偏好,但无法可靠地将此知识转化为战略谈判。当给予谈判伙伴偏好信息时,代理在推理轨迹早期准确建模,但此信息并未可靠改善知情方的收益。回合级分析显示原因:代理常回应他们认为对手重视的事物,但不一致地在自身高价值属性上获得收益。卖家总体更让步,且在不对称信息条件下,知情方常做出更弱补偿的让步。由于代理未能利用此底层效用结构获得战略优势,最终协议严重受表面初始锚定影响,而非实际效用权重。最后,要求代理在报价前明确陈述让步-互惠交易使单个回合看起来更战略化,但最终未能提高最终协议的效率。

英文摘要

Negotiation requires more than inferring what the other side wants: it requires using that information to make advantageous offers and counteroffers over multiple turns. We study whether large language model (LLM) agents do this in a controlled multi-attribute bargaining environment. We find that current LLM agents can model a counterparty's preferences, but do not reliably turn that knowledge into strategic bargaining. When given negotiating partner preference information, agents model it accurately and early in their reasoning traces, yet this does not reliably improve outcomes for the informed side. Turn-level analyses show why: agents often respond to what they believe the counterparty values, but do not consistently pair those moves with gains on their own high-value attributes. Sellers are more accommodating overall, and in asymmetric-information conditions, the informed side often makes the more weakly compensated concessions. Because agents fail to leverage this underlying utility structure for strategic advantage, their final agreements are heavily dictated by surface-level opening anchors rather than actual utility weights. Finally, requiring agents to explicitly state concession-for-reciprocity trades before making an offer makes individual turns look more strategic, but ultimately fails to improve the efficiency of the final agreements.

2605.16573 2026-05-19 cs.LG cs.AI physics.flu-dyn 版本更新

Wavelet Flow Matching for Multi-Scale Physics Emulation

小波流匹配用于多尺度物理模拟

Gabriele Accarino, Juan Nathaniel, Carla Roesch, Pierre Gentine, Sara Shamekh, Duncan Watson-Parris, Viviana Acquaviva

发表机构 * Department of Earth and Environmental Engineering(地球与环境工程系) Columbia University(哥伦比亚大学) University of Edinburgh(爱丁堡大学) Courant Institute of Mathematical Sciences(数学科学学院) New York University(纽约大学) Scripps Institution of Oceanography(斯克里普斯海洋研究所) Halıcıoğlu Data Science Institute(哈利奇数据科学研究所) University of California San Diego(加州大学圣地亚哥分校) CUNY New York City College of Technology(纽约市立大学纽约技术学院) Lamont-Doherty Earth Observatory(拉蒙特-多伊蒂地球观测站)

AI总结 本文提出小波流匹配方法,通过在多尺度小波空间中直接进行最优传输,解决多尺度物理系统模拟中稳定性与精度的平衡问题,实现更高效的生成式模拟。

详情
AI中文摘要

准确模拟由偏微分方程 governing 的多尺度物理系统需要保持长期自回归滚动的稳定性同时保留细尺度结构的模型。确定性模拟器产生过于平滑的预测,而生成方法能更好地捕捉细节但成本高。潜在空间生成模型作为折中方案,但需额外训练自动编码器。我们提出小波流匹配(WFM),一种新型生成模拟器,通过在多尺度小波空间中直接进行最优传输,克服了当前成本与能力之间的权衡。WFM 不学习潜在压缩,而是利用 U-Net 的层次结构,共同预测指定小波表示的传输速度。在三个具有挑战性的混沌流体动力学系统上,WFM 在长期稳定性、准确性和频谱一致性方面优于现有最佳模型。我们的结果清楚地表明,小波空间作为无训练的表示,在复杂物理动态的生成模拟中是有效的。

英文摘要

Accurate emulation of multi-scale physical systems governed by PDEs demands models that remain stable over long autoregressive rollouts while preserving fine-scale structures. Deterministic emulators produce overly-smoothed predictions, while generative approaches better capture details but are costly. Latent-space generative models have emerged as a compromise but with the additional cost of separately pre-trained autoencoders. We propose Wavelet Flow Matching (WFM), a novel generative emulator that overcomes current trade-offs between cost and skill by performing optimal-transport directly in the multi-scale wavelet space. Rather than learning a latent compression, WFM leverages the hierarchical structure of a U-Net to jointly predict transport velocities of a prescribed wavelet representation. On three challenging systems of chaotic fluid dynamics, WFM achieves superior long-horizon stability, accuracy and spectral coherence compared to state-of-the-art models. Our results clearly position the wavelet space as an effective training-free representation for generative emulation of complex physical dynamics.

2605.16571 2026-05-19 stat.ML cs.AI cs.LG 版本更新

Isotonic Survival Regression: Calibrated Survival Distributions from Deep Cox Models

非递减生存回归:从深度Cox模型中校准生存分布

Anchit Jain, Kevin Zhang, Stephen Bates

发表机构 * EECS, MIT(MIT电子工程与计算机科学系)

AI总结 本文提出一种非递减回归方法,用于校准深度Cox模型的生存概率,通过理论保证和实验验证提升模型实用性。

详情
AI中文摘要

时间到事件数据在生命科学和工程中普遍存在,但通常伴随删失,这使得标准机器学习方法的应用复杂化。深度Cox模型因能优雅处理删失并可与无结构数据如临床文本报告、基因组序列和病理图像结合而成为分析时间到事件数据的流行方法。然而,其预测的生存概率往往校准不良,限制了实际应用。本文提出了一种新颖的后验校准方法,利用非递减回归来改进预测生存概率而不影响判别能力。我们建立了有利的理论保证,包括双重鲁棒性属性和渐近校准。在合成和真实世界临床数据上的实验展示了我们方法的实证有效性。

英文摘要

Time-to-event data is widespread across the life sciences and engineering, but it is typically encountered together with censoring, which complicates the application of standard machine learning methods. Deep Cox models have emerged as a popular method for analyzing time-to-event data because they gracefully handle censoring and can be used with unstructured data such as clinical text reports, genomic sequences, and pathology images. However, their predicted survival probabilities are often poorly calibrated, thus limiting their practical utility. In this paper, we propose a novel post hoc calibration method for Deep Cox models that uses isotonic regression to refine predicted survival probabilities without affecting discriminative power. We establish favorable theoretical guarantees, including a double-robustness property and asymptotic calibration. Experiments on synthetic and real-world clinical data demonstrate the empirical effectiveness of our method.

2605.16568 2026-05-19 cs.AI 版本更新

Scalable Uncertainty Reasoning in Knowledge Graphs

知识图谱中的可扩展不确定性推理

Jingcheng Wu

发表机构 * University of Stuttgart, Stuttgart, Germany(斯图加特大学)

AI总结 本文提出模块化框架,通过定制技术处理知识图谱中不确定性三个层次:概率属性值、概率三元组存在性和不完整模式知识,旨在平衡语义精度与计算可行性。

Comments 14 pages. Preprint of a paper accepted at the ESWC 2026 PhD Symposium

详情
AI中文摘要

知识图谱对于语义数据整合至关重要。它们所建模的现实数据往往本质上具有不确定性。在知识图谱中,不确定性表现为三个不同的层次:不精确的属性值、概率三元组存在性和不完整的模式知识。然而,当前语义网络标准缺乏对这种不确定性的原生支持,而简单的扩展通常会导致计算不可行。本文旨在开发一个模块化框架,通过定制技术分别处理每个层次:(1)定义概率字面量和对应的查询代数用于连续属性;(2)一种基于编译的框架将SPARQL溯源转换为可计算的概率电路以处理不确定的三元组;(3)拓扑感知的几何嵌入用于统计模式推理。核心假设是专门的推理机制,即代数、逻辑和几何方法,能够协调语义精度与计算可行性。

英文摘要

Knowledge Graphs are pivotal for semantic data integration. The real-world data they model is often inherently uncertain. Within knowledge graphs, uncertainty manifests in three distinct levels: imprecise attribute values, probabilistic triple existence, and incomplete schema knowledge. However, current Semantic Web standards lack native support for reasoning over such uncertainty, and naïve extensions often incur computational intractability. In this thesis, I aim to develop a modular framework that addresses each level through tailored techniques: (1) defining probabilistic literals and a corresponding query algebra for continuous attributes; (2) a compilation-based framework transforming SPARQL provenance into tractable probabilistic circuits for uncertain triples; and (3) topology-aware geometric embeddings for statistical schema reasoning. The central hypothesis is that specialized reasoning mechanisms, namely algebraic, logical, and geometric approaches, can reconcile semantic precision with computational tractability.

2605.16567 2026-05-19 cs.LG cs.AI cs.DB 版本更新

Automatic Unsupervised Ensemble Outlier Model Selection--Extended Version

自动无监督集成异常检测模型选择——扩展版

Hong-Phuc Phan, Tuan-Anh Vu, Tung Kieu, Son Ha Xuan, Bin Yang, Christian S. Jensen

发表机构 * Department of Software Engineering, FPT University, Vietnam(FPT大学软件工程系) Department of Information Technology, Can Tho University of Technology, Vietnam(胡志明市技术大学信息科技系) School of Business, RMIT University, Vietnam(RMIT大学商学院) Department of Computer Science, Aalborg University, Denmark(阿阿尔堡大学计算机科学系) School of Data Science and Engineering, East China Normal University, China(华东师范大学数据科学与工程学院)

AI总结 本文提出MetaEns框架,通过学习预测边际增益模型,自动选择高质异常检测模型集成,无需标注数据,实验显示其在39个真实数据集上表现优异。

Comments 25 pages. An extended version of "Automatic Unsupervised Ensemble Outlier Model Selection" accepted at ICML 2026

详情
AI中文摘要

无监督异常检测因其无需标注数据而具有吸引力。此外,多模型集成可提高检测鲁棒性。然而,无标注数据下构建集成具有挑战性。简单集成可能因冗余或不可靠的检测模型导致饱和问题。我们提出MetaEns,一种自动无监督框架,用于选择异常检测模型的集成。利用标注元数据集,MetaEns学习预测边际增益模型,估计添加候选模型到部分构建集成的预期改进。在测试时,该学习信号结合子模函数启发的代理目标,通过多样性感知折扣和家族级风险正则化,实现贪心顺序选择与自适应提前停止。结果表明,MetaEns可在无真实标签的情况下构建紧凑高质量的集成。在39个真实数据集上的实验显示,MetaEns在平均精度上优于现有无监督选择器和集成基线,同时使用更少的模型。

英文摘要

Unsupervised outlier detection is attractive because it eliminates the need for labeled data. Moreover, forming multi-model ensembles can improve detection robustness. However, composing an ensemble without labeled data is challenging. Naively composed ensembles can suffer from ensemble saturation, where redundant or unreliable detection models degrade performance and incur unnecessary computation. We propose MetaEns, an automatic unsupervised framework for selecting ensembles of outlier detection models. Using labeled meta-datasets, MetaEns learns a model that predicts marginal ensemble gains, estimating the expected improvement from adding a candidate model to a partially constructed ensemble. At test time, this learned signal is combined with a submodular-inspired proxy objective that enforces diminishing returns through diversity-aware discounting and family-level risk regularization, thereby enabling greedy sequential selection with adaptive early stopping. As a result, MetaEns constructs compact, high-quality ensembles without access to ground-truth labels. Experiments on 39 real-world datasets show that MetaEns consistently outperforms state-of-the-art unsupervised selectors and ensemble baselines, achieving higher average precision while using fewer models.

2605.16552 2026-05-19 cs.AI cs.RO 版本更新

From Prompts to Protocols: An AI Agent for Laboratory Automation

从提示到协议:一种用于实验室自动化的AI代理

Angelos Angelopoulos, James F. Cahoon, Ron Alterovitz

发表机构 * Department of Computer Science, University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校计算机科学系) Department of Chemistry, University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校化学系)

AI总结 本文提出一种整合大语言模型与实验室编排的AI代理,使科学家能通过自然语言创建和监控自动化实验协议,提升实验效率与准确性。

详情
AI中文摘要

自动化科学实验室能加快、安全、准确且可重复地执行协议,加速新材料和药物的发现与测试。然而,设置和运行自主实验室需要协调多种仪器和机器人,迫使科学家编写代码、管理配置文件和导航复杂软件架构。本文提出一种AI代理架构,整合大语言模型与实验室编排,使科学家能通过自然语言交互式创建和监控自动化实验协议。该代理集成到实验编排系统(EOS)中,通过代理循环实现自动验证和错误纠正,支持完整的实验生命周期:创建协议、运行和监控协议及闭环优化活动,以及分析结果。一个可视化图编辑器将协议渲染为同步于AI代理协议表示的交互式节点图,使在AI协助和手动协议构建之间无缝切换。在三个覆盖化学、生物学和材料科学的模拟自动化实验室上评估,该AI代理实现了97%的一次性协议生成成功率,并将所需界面操作减少了数量级。

英文摘要

Automating science laboratories enables faster, safer, more accurate, and more reproducible execution of protocols, accelerating the discovery and testing of new materials, drugs, and more. However, setting up and running autonomous labs requires coordinating numerous instruments and robots, forcing scientists to write code, manage configuration files, and navigate complex software infrastructure. We present an AI agent architecture that integrates large language models with laboratory orchestration, enabling scientists to interactively create and monitor automated lab protocols using natural language. Integrated into the Experiment Orchestration System (EOS), the AI agent operates under an agentic loop with automated validation and error correction, and supports the complete experimental lifecycle: creating protocols, running and monitoring both protocols and closed-loop optimization campaigns, and analyzing results. A visual graph editor renders protocols as interactive node-based diagrams synchronized with the AI agent's protocol representation, enabling seamless alternation between AI-assisted and manual protocol construction. Evaluated on three simulated automated labs spanning chemistry, biology, and materials science, the AI agent achieves a 97% first-attempt protocol generation success rate and an order of magnitude reduction in required interface actions.

2605.16535 2026-05-19 cs.IR cs.AI 版本更新

RAPT: Retrieval-Augmented Post-hoc Thresholding for Multi-Label Classification

RAPT:基于检索的后处理阈值法用于多标签分类

Lasal Jayawardena, Nirmalie Wiratunga, Ikechukwu Nkisi-Orji, Darren Nicol

发表机构 * Robert Gordon University(罗伯特·戈登大学) William Nicol (Aberdeen) Limited(威廉·尼科尔(阿伯丁)有限公司)

AI总结 RAPT通过检索增强的方法改进多标签分类中的标签集选择,无需重新训练模型,有效应对OCR噪声和标签不平衡等问题,提升预测性能和效率。

详情
AI中文摘要

工业多标签文档理解流程中,候选标签的评分和阈值或排序用于形成每个文档的标签集。这一早期选择步骤直接影响下游信息提取的准确性及相关验证工作。实际中,OCR噪声、标签不平衡、实例依赖的标签数量和不对称的误差成本使全局评分阈值变得脆弱且难以维护。本文提出RAPT,一种面向部署的检索增强评分阈值包装器,用于后处理以改进标签集选择而不重新训练基础分类器。RAPT是一种模型无关的包装器:任何提供文档表示用于相似性搜索和每个标签置信度分数的预测器都可以使用,包括度量学习编码器和微调的Transformer分类器。对于每个查询文档,给定分类器的评分向量,RAPT检索相似文档阈值情况(案例)并利用其结果适应查询的标签集选择阈值。适应过程通过局部聚合邻近解(例如平均标签数量、截止校准)来选择最终的标签集。评估比较了多标签分类器(度量学习器和Transformer)结合RAPT与全局和标签级阈值基线,以及少样本LLM。在工业数据集和六个公开基准上,RAPT一致优于全局和标签级静态阈值基线。在工业设置中,RAPT在度量学习器上达到最佳预测性能,宏F1得分为0.87,而微调的Transformer变体平均得分为0.775宏F1,优于少样本LLM基线(K=5)2倍,且需要至少115倍更少的推理时间和13.5倍更少的GPU内存。

英文摘要

Industrial multi-label document understanding pipelines score candidate labels and threshold or rank them to form a label set per document. This early selection step directly affects the accuracy of downstream information extraction from the document, as well as the associated verification effort. In practice, OCR noise, label imbalance, instance-dependent label cardinality, and asymmetric error costs make global score thresholds brittle and hard to maintain as document formats evolve. We present RAPT, a deployment-oriented retrieval-augmented score thresholding wrapper, applied post-hoc to improve label set selection without retraining the underlying classifier. RAPT is a model-agnostic wrapper: any predictor that provides document representations for similarity search and per label confidence scores can be used, including metric learning encoders and fine-tuned transformer classifiers. For each query document, given a classifier's score vector, RAPT retrieves similar document thresholding situations (cases) and adapts the query's label set selection threshold using their outcomes. The adaptation selects the final label set by locally aggregating neighbour solutions (e.g. average label count, cutoff calibration). Evaluation compared multi-label classifiers (metric learners and transformers) combined with RAPT against global and label-wise thresholding baselines, and against few-shot LLMs. Across an industrial dataset and six public benchmarks, RAPT consistently outperformed global and label-wise static thresholding baselines. In the industrial setting, RAPT achieved its best predictive performance with metric learners, reaching 0.87 Macro-F1, while fine-tuned transformer variants on average achieved 0.775 Macro-F1, outperforming fewshot LLM baselines (K = 5) by 2x and requiring at least 115x less inference time and 13.5x less GPU memory.

2605.16528 2026-05-19 cs.CY cs.AI 版本更新

Inventorship in AI-Assisted Inventions: Designing an Experiment to Shape Case Law

人工智能辅助发明中的发明人归属:设计实验以塑造判例法

Yevhenii Shchetynin, Duygu Usta, Bryan Khan

发表机构 * University of Turin(都灵大学)

AI总结 本文探讨人工智能辅助发明中发明人归属问题,提出通过实验生成相关判例法,以明确AI工具在发明过程中的贡献及人类发明人的认定标准。

详情
AI中文摘要

最新的人工智能进步对知识产权法提出了新挑战,特别是在人工智能辅助发明中的发明人归属问题。尽管大多数司法管辖区只允许自然人被视为发明人,但如何处理人工智能辅助发明仍存争议。主要挑战在于缺乏相关判例法。本文提出实验条件以生成相关判例法,通过涉及AI专家的 stakeholders 参与,提出实验方法和案例选择策略,以确定衡量人类在人工智能辅助发明中贡献的有效方法。

英文摘要

The latest improvements in artificial intelligence (AI) raise new challenges for intellectual property laws, particularly concerning the inventorship issue in AI-assisted inventions - that is, those in which AI is used in the inventive process. While most jurisdictions allow only a natural person to be considered the inventor, the question of how to deal with AI-assisted inventions remains relevant. Namely, what is the nature and contribution of AI tools in an AI-assisted invention that would prevent a human from being recognized as its inventor? The main challenge in addressing this question is the lack of case law on the issue. It is reasonable to assume that with the development of AI and the growing interest in its use in the inventive process, new cases will naturally arise, which in turn will harmonize and address the inventorship issue in AI-assisted inventions to some extent. However, this process will take significant time and may not keep pace with the rapid development of AI, nor fully address the new problems that arise alongside AI advancements. This research proposes the conditions of an experiment to create relevant case law. This experiment could be initiated by society, involving stakeholders specializing in AI. The article also proposes a methodology for conducting the experiment and selecting cases that best reflect the current state of AI use in the inventive process. Conducting such an approach will help identify the most effective methods for measuring human contribution to AI-assisted inventions when determining inventorship.

2605.16527 2026-05-19 cs.LG cs.AI 版本更新

Hypergraph Pattern Machine: Compositional Tokenization for Higher-Order Interactions

超图模式机:用于高阶交互的组合分词

Kyrie Zhao, Zehong Wang, Tianyi Ma, Fang Wu, Xiangru Tang, Pietro Lio, Sheng Wang, Yanfang Ye

发表机构 * University of Notre Dame(内布拉斯加大学) Stanford University(斯坦福大学) Yale University(耶鲁大学) University of Cambridge(剑桥大学) University of Washington(华盛顿大学)

AI总结 本文提出超图模式机,通过学习子集的组合模式,改进高阶交互的建模,从而在超图基准和真实案例中取得更好效果。

详情
AI中文摘要

超图模型高阶关系,从药物处方到推荐。数据中的核心结构信号是交互组合性:高阶关系是否是组合、涌现或抑制性的。在多药治疗中,制度决定是否停药、保留或排除:组合药物三元组可安全简化,涌现三元组需联合所有药物,抑制三元组标志干扰现有交互的药物。现有超图学习方法仅传播观测超边消息,未建模此信号,导致危险组合被误分类。为此,本文提出超图模式机(HGPM),从消息传递转向学习子集的组合模式。它分词组合子集,组织成包含 DAG,并训练掩码重建的包含意识 Transformer。在十个超图基准上,HGPM 匹配或超越现有方法。值得注意的是,在真实不良事件预测案例中,HGPM 正确识别出抑制副作用的药物添加,而现有方法无法区分。代码和数据见 https://github.com/KryieZhao/HGPM.git.

英文摘要

Hypergraphs model higher-order relations that drive real-world decisions, from drug prescriptions to recommendations. A central structural signal in such data, beyond what pairwise relations can express, is interaction compositionality: whether a higher-order relation is compositional, emergent, or inhibitory with respect to its observed or unobserved sets. In polypharmacy, the regime decides whether a drug should be dropped, kept, or excluded: a compositional drug triple can be safely simplified, an emergent triple requires all drugs jointly, and an inhibitory triple flags a drug that disrupts an existing interaction. However, existing hypergraph learning methods, which merely propagate messages over observed hyperedges, leave this compositional signal unmodeled, allowing dangerous drug combinations to slip through and be misclassified. To this end, we propose the Hypergraph Pattern Machine (HGPM), shifting the paradigm from message passing to learning the compositional pattern of subsets. It tokenizes compositional subsets, organizes them in an inclusion DAG, and trains an inclusion-aware Transformer under masked reconstruction. On ten hypergraph benchmarks, HGPM matches or exceeds state-of-the-art methods. Notably, in a real adverse-event prediction case, HGPM correctly identifies the drug addition that inhibits the side effect among feature-identical candidates, a discrimination existing methods cannot make. The code and data are in https://github.com/KryieZhao/HGPM.git.

2605.16516 2026-05-19 cs.HC cs.AI cs.CL cs.CY 版本更新

Alignment Drift in Long-Term Human-LLM Interaction: A Mechanism-Oriented Framework

长期人类-大语言模型交互中的对齐漂移:一种机制导向的框架

Xintong Yao

发表机构 * Xintong Yao(姚新同)

AI总结 本文提出一种机制导向的框架,用于描述长期人类-大语言模型交互中的对齐漂移现象,通过反馈回路和子模式选择解释漂移的发展过程,并将对齐漂移视为递归互动过程而非孤立模型失败。

Comments 16 pages, 1 appendix

详情
AI中文摘要

长期与基于大语言模型的系统交互可能导致对齐漂移:一种渐进过程,其中系统输出逐渐受用户当前消息的约束减少,而更多受先前交互历史影响,尽管仍显得有帮助、连贯和响应。此过程难以检测,因为用户的主观体验可能随着系统变得更熟悉、有用和适应而改善。现有研究主要集中在短期任务表现、孤立输出或单实例对齐问题,导致慢性和累积的交互层面动态未被充分描述。本文提出一种机制导向的框架来描述对齐漂移。该框架定义信号A和信号B的区别,解释漂移如何通过反馈回路和子模式选择发展,将过程分为三个互动阶段,并识别控制漂移的边界条件。通过将对齐漂移视为递归互动过程而非孤立模型失败,本文为研究长期人类-系统交互提供了概念基础。

英文摘要

Long-term interaction with LLM-based systems may produce alignment drift: a gradual process in which system outputs become less constrained by the user's current message and more shaped by prior interaction history, while still appearing helpful, coherent, and responsive. This process is difficult to detect because the user's subjective experience may improve as the system becomes more familiar, useful, and attuned. Existing research on human-LLM interaction has largely focused on short-term task performance, isolated outputs, or single-instance alignment problems, leaving slow and cumulative interaction-level dynamics undercharacterized. This paper proposes a mechanism-oriented framework for describing alignment drift. The framework defines the distinction between signal A and signal B, explains how drift develops through feedback loops and sub-pattern selection, divides the process into three interactional regimes, and identifies boundary conditions for controlling drift. By framing alignment drift as a recursive interactional process rather than an isolated model-side failure, the paper provides a conceptual basis for studying long-term human-system interaction.

2605.16514 2026-05-19 cs.RO cs.AI 版本更新

No Plan, Yet Human: A Reactive Robotics Model Predicts Human Planning Failures on a Clinical Task

无计划,却有人类:一种反应式机器人模型预测临床任务中人类计划失败

Michael Migacev, Vito Mengers, Antonia Köngeter, Oliver Brock

发表机构 * Robotics and Biology Laboratory, Technische Universität Berlin, Germany(技术大学柏林机器人与生物学实验室,德国) Science of Intelligence, Research Cluster of Excellence, Berlin, Germany(智能科学,卓越研究集群,柏林,德国) Robotics Institute Germany(德国机器人研究所)

AI总结 该研究利用反应式梯度下降框架AICON,通过塔罗伦敦测试揭示人类计划能力下降时的反应模式,发现其能更准确预测24个问题的难度排序,并在留出验证中表现优异,揭示了生物系统组织方式的普遍规律。

详情
AI中文摘要

理解为何某些顺序规划问题比其他问题更难需要超越平均性能的模型。这些模型应捕捉问题难度的具体模式,并理想情况下以与人类计划能力下降时相同的方式失败。我们应用为机器人操作开发的AICON反应式梯度下降框架,应用于塔罗伦敦测试,该测试用于评估帕金森病、轻度认知障碍和中风患者的规划能力。在不进行任何前瞻规划或了解人类认知的情况下,AICON在24个问题上更准确地再现了人类的细粒度难度排序,优于结构任务参数,并在留出验证中泛化到新问题。关键的是,AICON在计划能力下降的群体中优于计划基线,而计划基线更好地捕捉健康对照组。这种分离由原始AICON论文预测,该论文指出模型的失败模式与帕金森患者在目标层次结构上挣扎但不移动计数的情况相似。这表明,随着计划能力的下降,人类行为会转向AICON所建模的反应模式。这一发现扩展了更广泛的模式:AICON最初为机器人开发,现在能捕捉生物行为在感知、眼动和顺序规划方面的特征,表明其核心抽象反映了生物系统组织方式的真实特性。

英文摘要

Understanding why some sequential planning problems are harder than others requires models that go beyond average performance. They should capture the specific pattern of which problems are hard, and ideally fail in the same way people do when planning capacity is reduced. We apply AICON, a reactive gradient-descent framework developed for robotic manipulation, to the Tower of London test, a cognitive test used to assess planning in Parkinson's disease, mild cognitive impairment, and stroke. Without any lookahead planning or knowledge of human cognition, AICON reproduces the fine-grained human difficulty ordering across 24 problems better than structural task parameters and generalizes to held-out problems in a leave-two-out evaluation. Crucially, AICON outperforms a planning baseline for groups with reduced planning capacity while the planning baseline better captures healthy controls. This dissociation was predicted by the original AICON paper, which noted that the model's failure modes resemble those of Parkinson's patients who struggle with goal hierarchies but not move counts. This suggests that as planning capacity is reduced, human behavior shifts toward the reactive mode AICON models. The finding extends a broader pattern: AICON, originally built for robotics, now captures aspects of biological behavior across perception, eye movements, and sequential planning, suggesting its core abstraction reflects something real about how biological systems are organized.

2605.16508 2026-05-19 cs.CL cs.AI 版本更新

The Scaling Laws of Skills in LLM Agent Systems

大语言模型代理系统中技能的扩展规律

Charles Chen, Qiming Yu, Yuhang Gu, Zhuoye Huang, Hanjing Li, Hongyu Liu, Simin Liu, Jinhao Liu, Dengyun Peng, Jiangyi Wang, Zheng Yan, Fanqing Meng, Ethan Qin, Carl Che, Mengkang Hu

发表机构 * Evolvent AI Team(Evolvent AI团队)

AI总结 研究揭示了大规模代理系统中技能扩展的双重规律:路由准确性随库大小对数衰减,执行准确性通过联合路由乘法提升下游任务表现,二者通过路由衰减斜率参数耦合,优化后显著提升性能。

Comments Technical Report

详情
AI中文摘要

随着代理系统规模扩大,技能积累为大规模可重用库,但其扩展规律仍不明确。在15个前沿LLM、1141个现实技能及超300万次路由或执行决策中,发现两个耦合规律。路由规律:单步路由准确性随库大小对数衰减(R²>0.97),错误从局部技能竞争发展到跨家族漂移并被过于通用的'黑洞技能'捕获。执行规律:在状态实现前,联合路由近似乘法,正确执行可提升困难下游任务表现约4倍。单参数路由对数衰减斜率b耦合二者:路由侧拟合预测执行侧救援,显示同一库属性控制预执行崩溃和下游恢复能力。这些结果表明代理性能不仅取决于模型能力,还取决于技能库的结构、粒度和暴露策略。

英文摘要

As agent systems scale, skills accumulate into large reusable libraries, yet their scaling laws remain poorly understood. Across 15 frontier LLMs, 1,141 real-world skills, and over 3M routing or execution decisions, we identify two coupled laws. Routing law: single-step routing accuracy decays logarithmically with library size ($R^2{>}0.97$ for all models), with errors progressing from local skill competition to cross-family drift and capture by overly general "black-hole skills". Execution law: before state realization, joint routing is approximately multiplicative, whereas correct execution can improve difficult downstream decisions by about $4{\times}$. A single parameter, the routing logarithmic decay slope $b$, couples the two laws: routing-side fits predict execution-side rescue across models, showing that the same library property controls both pre-execution collapse and downstream recoverability. The laws are actionable: law-guided optimization raises held-out routing accuracy from 71.3% to 91.7%, reduces hijack from 22.4% to 4.1%, and transfers directionally to downstream ClawBench and ClawMark execution settings, improving mean pass rate from 49.3% to 61.6% on ClawBench and from 28.4% to 34.5% on ClawMark. These results show that agent performance depends not only on model capability, but also on the structure, granularity, and exposure policy of the skill library.

2605.16481 2026-05-19 cs.CV cs.AI 版本更新

Visual Agentic Memory: Enabling Online Long Video Understanding via Online Indexing, Hierarchical Memory, and Agentic Retrieval

视觉代理记忆:通过在线索引、分层记忆和代理检索实现在线长视频理解

Aiden Yiliu Li, Nels Numan, Anthony Steed

发表机构 * University College London(伦敦大学学院)

AI总结 本文提出视觉代理记忆框架,通过在线索引、分层记忆和代理检索实现长视频理解,实验显示其在OVO-Bench和MM-Lifelong数据集上均取得优异成绩。

详情
AI中文摘要

长视频理解需要比大上下文窗口更多的内容,还需要一种记忆机制,决定保留哪些视觉证据,保持其在长时间范围内可搜索,并使后续推理基于可恢复的观察而非压缩的潜在状态。我们提出了视觉代理记忆(VAM),一种无需训练的框架,包含三个组件。在线索引支持在流式约束下选择性证据保留。分层记忆将保留的证据组织成并行表示,使时间上下文与空间观察对齐。代理检索在生成基于证据的答案前搜索、检查和验证候选证据。在OVO-Bench上,VAM在所有报告的基线中取得了最高的RT+BT平均值(68.41),优于使用相同基础MLLM(Gemini 3 Flash,67.46)的端到端方法。在MM-Lifelong train@month的月度分割(105.6小时覆盖51天)上,VAM达到17.11%,仅次于使用GPT-5的ReMA(17.62%)。这些结果表明,长时间视频理解受益于将视觉记忆视为显式、可检查和可查询的基质。代码可在https://github.com/yiliu-li/Visual-Agentic-Memory获取。

英文摘要

Long video understanding requires more than large context windows. It also needs a memory mechanism that decides what visual evidence to retain, keeps it searchable over long horizons, and grounds later reasoning in recoverable observations rather than compressed latent state alone. We propose Visual Agentic Memory (VAM), a training-free framework with three components. Online Indexing supports selective evidence retention under streaming constraints. Hierarchical Memory organises retained evidence in a Parallel Representation that aligns temporal context with spatial observations. Agentic Retrieval searches, inspects, and verifies candidate evidence before producing a grounded answer. On OVO-Bench, VAM achieves the highest RT+BT average (68.41) across all reported baselines, improving over end-to-end use of the same underlying MLLM (Gemini 3 Flash, 67.46). On the month-scale split of MM-Lifelong train@month (105.6 hours over 51 days), VAM reaches 17.11%, second only to ReMA with GPT-5 (17.62%). These results suggest that long-horizon video understanding benefits from treating visual memory as an explicit, inspectable, and queryable substrate. Code is available at https://github.com/yiliu-li/Visual-Agentic-Memory.

2605.16480 2026-05-19 q-bio.BM cs.AI 版本更新

MoleCode unlocks structural intelligence in large language models

MoleCode 解锁大型语言模型中的结构智能

Zhiyuan Yan, Chen Liu, Boxuan Zhao, Kaiqing Lin, Jixiang Zhao, Yimi Wang, Liuzhenghao Lv, Hao Li, Shanzhuo Zhang, Li Yuan, Fanyang Mo

发表机构 * Peking University Shenzhen Graduage School

AI总结 MoleCode 通过引入图显分子语言,使大型语言模型能直接操作分子结构,提升分子推理、编辑、生成和分析任务的性能,尤其在结构受限场景下表现突出。

详情
AI中文摘要

分子是图,但大型语言模型(LLMs)通常通过线性字符串来推理分子。最流行的分子表示SMILES将原子、键、分支和环压缩成紧凑序列,其中拓扑结构是隐含的,迫使LLMs在执行化学操作前重建分子结构。本文介绍MoleCode,一种LLM原生、无需训练、图显的分子语言,其中所有分子组件均以带类型实体和持久标识符表示,并有显式关系。MoleCode使分子拓扑结构在语言上下文中直接可读、可编辑和可审计,使LLM能够操作结构而非从语法中恢复。在分子推理、编辑、生成和分析任务中,这种表征转变在结构访问受限时对前沿LLMs效果最显著:不熟悉的分子、拓扑敏感操作、更大的结构和重复的聚合物。它还改变了推理的分配方式,用更短的、化学导向的推理替代长推理轨迹用于隐含结构重建。在分子优化中,这使能够进行局部、属性对齐的编辑,保持结构相似性。相同的子图-节点-边语法扩展到聚合物、Markush结构、机制式转换和交织的科学文档,包括包含化学信息的科研论文和专利披露,其中化学信息分布于文本和图像中。这些结果表明,科学对象与LLMs之间的接口不应将结构视为从文本中解码的东西。当推理对象是关系时,结构本身应成为语言的一部分。

英文摘要

Molecules are graphs, but large language models~(LLMs) are usually asked to reason about them through linear strings. The most popular molecular representation, SMILES, compresses atoms, bonds, branches and rings into a compact sequence in which topology is implicit, forcing LLMs to reconstruct molecular structure before performing the requested chemical operation. Here we introduce MoleCode, an LLM-native, training-free, graph-explicit molecular language in which all molecular components are represented as typed entities with persistent identifiers and explicit relations. MoleCode makes molecular topology directly readable, editable and auditable within the language context, allowing an LLM to operate on structure rather than recover it from syntax. Across molecular reasoning, editing, generation and analysis tasks, this representational shift improves frontier LLMs most strongly when structural access is limiting: unfamiliar molecules, topology-sensitive operations, larger structures and repetitive polymers. It also changes how inference is allocated, replacing long reasoning traces devoted to implicit structural reconstruction with shorter, more chemically directed reasoning over explicit atoms and bonds. In molecular optimization, this enables localized, property-aligned edits that preserve structural similarity to the starting compounds. The same Subgraph--Node--Edge grammar extends beyond small molecules to polymers, Markush structures, mechanism-style transformations and interleaved scientific documents, including research articles and patent disclosures in which chemical information is distributed across text and images. These results suggest that the interface between scientific objects and LLMs should not treat structure as something to be decoded from text. When the object of reasoning is relational, the structure itself should be part of the language.

2605.16479 2026-05-19 cs.IR cs.AI 版本更新

Policy-Grounded Dynamic Facet Suggestions for Job Search

基于政策的动态面建议用于求职搜索

Dan Xu, Baofen Zheng, Qianqi Shen, Jianqiang Shen, Wenqiong Liu, Chunnan Yao, Ping Liu, Rajat Arora, Kevin Kao, Hsiang Lin, Wanjun Jiang, Yusuke Takebuchi, Jingwei Wu, Wenjing Zhang

发表机构 * LinkedIn Corporation(LinkedIn公司)

AI总结 本文提出动态面建议(DFS)以提高求职搜索中的意图识别和相关职位检索,通过实时个性化语义属性推荐,结合离线分类整理、嵌入检索和小语言模型评分,提升建议精度和用户参与度。

Comments 6 pages

详情
AI中文摘要

求职者常以短且不明确的查询开始搜索。在LinkedIn上,超过80%的与工作相关的查询包含三个或更少的关键词,这使得准确推断用户意图和检索相关职位特别具有挑战性。我们提出了动态面建议(DFS),一种交互式查询细化机制,通过实时揭示基于用户-查询上下文的个性化语义属性来促进意图歧义消除。我们提出了一种基于政策的、检索增强的排名框架用于面建议,包括离线分类整理、基于嵌入的检索前K候选者以及基于提炼的小语言模型(SLM)的候选者评分。系统通过单个token评分、批处理和前缀缓存进行优化,以实现实时服务。离线评估显示生成建议的高精度,而在线A/B测试显示建议参与度和求职结果的显著改进。

英文摘要

Job seekers often initiate search with short, underspecified queries. At LinkedIn, over 80% of job-related queries contain three or fewer keywords, making accurate user intent inference and relevant job retrieval particularly challenging. We present dynamic facet suggestion (DFS), an interactive query refinement mechanism that facilitates intent disambiguation by surfacing personalized semantic attributes conditioned on the joint user-query context in real time. We propose a policy-grounded, retrieval-augmented ranking framework for facet suggestion, comprising offline taxonomy curation, embedding-based retrieval of top-K candidates, and distilled small language model (SLM) based candidate scoring. The system is optimized for real-time serving via pointwise single-token scoring with batching and prefix caching. Offline evaluation demonstrates high precision for generated suggestions, and online A/B tests show significant improvements in suggestion engagement and job search outcomes.

2605.16474 2026-05-19 cs.IR cs.AI 版本更新

LERA: LLM-Enhanced RAG for Ad Auction in Generative Chatbots

LERA:基于大语言模型的生成聊天机器人广告拍卖

Haoran Sun, Xinrui Song, Xinyu Zhang, Zhaohua Chen, Xu Chu, Zhilin Zhang, Chuan Yu, Jian Xu, Bo Zheng, Xiaotie Deng

发表机构 * Peking University(北京大学) Alibaba Group(阿里巴巴集团) Shandong University(山东大学)

AI总结 LERA提出一种两阶段检索生成拍卖框架,通过嵌入粗过滤和LLM提示生成优化广告相关性评分,提升广告选择准确性和多样性。

Comments Work in Progress

详情
AI中文摘要

将广告拍卖机制整合到基于大语言模型(LLM)的聊天机器人中,为商业化提供了重要机会,但需在相关性、效率和用户体验之间取得平衡。最近,Feizi等人和Hajiaghayi等人提出了检索后生成范式,将检索与生成解耦,提供轻量级广告插入和支付确定。然而,当前检索仅依赖文本嵌入相似性,可能导致商业误解和重复插入问题。本文提出LERA,一种针对LLM聊天机器人的两阶段检索生成拍卖框架。第一阶段通过嵌入粗过滤预选少量候选广告商。第二阶段通过精心设计的提示查询LLM,生成候选人的logits作为优化的相关性评分。这些评分与报价结合,关键值支付规则考虑粗过滤和细排名阈值,确保对效用最大化广告商的诚实性。该框架自然扩展到动态对话流中的多个广告插入和长响应。在合成广告商-查询基准上的实验表明,LERA显著提高了广告选择准确性和插入多样性,同时仅引入可控的延迟开销。

英文摘要

The integration of advertising auction mechanisms into large language model (LLM)-based chatbots presents a significant opportunity for commercialization, yet poses unique challenges in balancing relevance, efficiency, and user experience. Recently, Feizi et al.~\citep{feizi2023online} and Hajiaghayi et al.~\citep{hajiaghayi2024ad} outlined a retrieve-then-generate paradigm that decouples retrieval and generation, offering lightweight ad insertion and payment determination. However, current retrieval relies solely on text embedding similarity, which may lead to commercial misinterpretation and issues such as repetitive insertions. In this paper, we propose LERA, a two-stage retrieve-then-generate auction framework tailored for LLM chatbots. In the first stage, embedding-based coarse filtering pre-selects a small set of candidate advertisers. In the second stage, the LLM itself is queried with a carefully designed prompt to produce logits over candidates, which serve as refined organic relevance scores. These scores are combined with bids, and a critical-value payment rule accounts for both the coarse-filtering and fine-ranking thresholds, ensuring truthfulness for utility-maximizing advertisers. The framework naturally extends to multiple ad insertions within dynamic dialogue flows and long responses. Experiments on a synthetic advertiser-query benchmark show that LERA substantially improves ad selection accuracy and insertion diversity while incurring only controllable latency overhead.

2605.16470 2026-05-19 cs.LG cs.AI 版本更新

Strategic Over-Parameterization for Generalizable Low-Rank Adaptation

战略性过参数化以实现通用的低秩适应

Jing Gao, Zhong-Yi Lu, Pan Zhang, Ze-Feng Gao

发表机构 * School of Fundamental Physics and Mathematical Sciences, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China(1 基础物理与数学科学学院,杭州先进研究院,UCAS,杭州 310024,中国) School of Physical Sciences, University of Chinese Academy of Sciences, No. 19A Yuquan Road, Beijing 100049, China(2 物理科学学院,中国科学院大学,玉泉路19A号,北京 100049,中国) CAS Key Laboratory of Theoretical Physics, Institute of Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China(3 中国科学院理论物理重点实验室,理论物理研究所,中国科学院,北京 100190,中国) School of Physics and Key Laboratory of Quantum State Construction and Manipulation (Ministry of Education), Renmin University of China, Beijing 100872, China(4 物理学院和量子态构造与操控(教育部)重点实验室,中国人民大学,北京 100872,中国)

AI总结 本文提出LoRA-Over框架,通过训练时丰富优化景观并推理时压缩,提升低秩适应的泛化能力,实验显示其在多个任务上优于传统LoRA。

详情
AI中文摘要

本文提出LoRA-Over框架,通过训练时丰富优化景观并推理时压缩,提升低秩适应的泛化能力,实验显示其在多个任务上优于传统LoRA。

英文摘要

Adapting large language models (LLMs) to downstream tasks via full fine-tuning is increasingly impractical due to its computational and memory demands. Parameter-efficient fine-tuning (PEFT) approaches such as Low-Rank Adaptation (LoRA) mitigate this by confining updates to a compact set of trainable parameters, but this aggressive reduction often sacrifices generalization, especially under transfer across heterogeneous tasks and domains. We revisit the tension between parameter efficiency and adaptation capacity, and ask whether the two are truly at odds. We answer in the negative by introducing LoRA-Over, a framework grounded in a simple principle: enrich the optimization landscape during training, then collapse the enrichment at inference. LoRA-Over injects auxiliary parameters into the low-rank adapters during training to broaden the effective hypothesis space, and through a decomposition-based reformulation folds them back into a standard low-rank structure with negligible reconstruction error, keeping inference cost identical to vanilla LoRA. Since not all weight matrices benefit equally from added capacity, we further propose two scheduling strategies, one statically predefined and one dynamically determined at runtime, that direct extra capacity where most needed. We evaluate LoRA-Over on language understanding (GLUE, T5-Base), dialogue (MT-Bench), arithmetic reasoning (GSM8K), and code generation (HumanEval), using LLaMA 2-7B and LLaMA 3.1-8B. Across all benchmarks and scales, LoRA-Over consistently outperforms vanilla LoRA, showing that principled over-parameterization designed to vanish at inference is an effective lever for improving PEFT generalization. Code will be released upon acceptance.

2605.16468 2026-05-19 cs.CV cs.AI cs.CL cs.LG q-bio.NC 版本更新

Mechanistically Interpretable Neural Encoding Reveals Fine-Grained Functional Selectivity in Human Visual Cortex

可解释的神经编码机制揭示人类视觉皮层的精细功能选择性

Idan Daniel Grosbard, Mor Geva, Galit Yovel

发表机构 * Sagol School of Neuroscience(萨戈尔神经科学学院) Blavatnik School of Computer Science and AI(布拉瓦提克计算机科学与人工智能学院) School of Psychological Sciences(心理学科学学院)

AI总结 本文提出MINE框架,通过机制可解释工具揭示自然图像中驱动皮层 voxel 活动的特征,验证了特征对 voxel 响应的因果影响,并揭示了视觉皮层中精细的功能选择性。

Comments 40 pages, 28 figures

详情
AI中文摘要

理解人类视觉的核心目标是揭示驱动神经活动的视觉特征。已有研究利用人工神经网络作为编码模型预测皮层对自然图像的响应,揭示了激活类别选择区域的视觉内容。然而,现有方法多为相关性分析,将编码器视为黑箱,无法确定哪些图像特征驱动每个 voxel 的响应。本文提出机制可解释神经编码(MINE)框架,通过机制可解释工具定位自然图像中驱动毫米级(voxel 级)活动的特征。MINE利用语言对齐的图像表示预测每个 voxel 的响应,并生成语义可解释的特征描述,用于 voxel 的激活。进一步将这些 per-image 特征泛化为 per-voxel 功能轮廓。为验证 per-image 描述,我们显示它们足以生成激发 voxel 响应与原始图像响应匹配的图像,其准确性优于随机或低贡献控制生成的图像。此外,通过反事实插入或移除预测特征,可使激活在预期方向变化,提供因果证据。由 voxel 激活轮廓指导的反事实编辑产生更强的激活变化,表明轮廓忠实捕捉每个 voxel 的选择性。最后,将 MINE 应用于研究充分的类别选择脑区,显示其恢复了已知的类别偏好,同时揭示了每个区域内的精细 voxel 结构。总体而言,我们的结果确立了机制可解释性作为发现和验证神经功能精细假设的路径。

英文摘要

A central goal in understanding human vision is to uncover the visual features that drive neuronal activity. A growing body of work has used artificial neural networks as encoding models to predict cortical responses to natural images, revealing the visual content that activates category-selective regions. However, existing approaches are largely correlational and treat the encoder as a black box, leaving open which image features drive each voxel's response. We introduce Mechanistically Interpretable Neural Encoding (MINE), a framework that opens this black box by applying mechanistic-interpretability tools to localize the features within natural images that drive millimeter-scale (voxel-level) activity. MINE predicts each voxel's response using language-aligned image representations, and produces semantically interpretable descriptions of the features critical for the voxel's activation. We further generalize these per-image features into per-voxel functional profiles. To validate the per-image descriptions, we show they are sufficient to generate images that elicit voxel responses matching the responses to the original images, more accurately than images generated from random or low-attribution controls. Moreover, counterfactually inserting or removing the predicted features from images shifts activation in the expected direction, providing causal evidence. Counterfactual editing guided by the per-voxel activation profiles produces even stronger activation shifts, indicating that the profiles faithfully capture each voxel's selectivity. Finally, we apply MINE to well-studied category-selective brain regions, showing it recovers their known categorical preferences while revealing fine-grained unique voxel structure within each region. Overall, our results establish mechanistic interpretability as a path to discover and causally validate fine-grained hypotheses about neural function.

2605.16464 2026-05-19 cs.CV cs.AI 版本更新

MHMamba: Multi-Head Mamba for 3D Brain Tumor Segmentation

MHMamba:多头Mamba用于3D脑肿瘤分割

Hanjun Tao, Hua Wang, Fan Zhang

发表机构 * Shandong Technology and Business University(山东科技与商务大学) Ludong University(鲁东大学)

AI总结 本文提出MHMamba,结合U型结构与多头状态空间模型,提升3D脑肿瘤分割的长程表示与多模态训练稳定性,实验显示在BraTS数据集上整体准确率和边界平滑度显著提升。

Comments 10 pages, 3 figures, 4 tables

详情
AI中文摘要

脑肿瘤在形态和多模态对比方面具有高度异质性,手动逐层标注耗时且依赖经验,因此需要高效稳定的自动化分割方法。为解决CNN建模长程依赖的局限性和Transformer在3DMRI中的计算和内存开销问题,本文提出多头Mamba(MHMamba)。该方法结合U型架构与多头状态空间模型(Mamba),将通道维度拆分为并行SSM头并进行残差聚合,增强长程表示并提升多模态训练的稳定性,同时保持线性复杂度。为进一步对齐统计信息并增强病变响应,设计了多头输出的通道空间校准模块,并引入适应性融合机制在跳跃连接中动态连接全局语义与局部细节,从而提升边界一致性及小体积病变的检测能力。在BraTS2021和BraTS2023上进行了实验和消融测试,结果显示MHMamba在整体准确率、边界平滑度及肿瘤核心和小体积增强区域的敏感度上实现了稳定显著的提升,同时保持了基于Mamba建模的线性复杂度优势,验证了方法的有效性和通用性。

英文摘要

Brain tumors exhibit high heterogeneity in morphology and multimodal contrast, making manual slice-by-slice de lineation time-consuming and experience-dependent, thus necessitating efficient and stable automated segmentation methods. To address the limitations of CNNs in modeling long-range dependencies, and the heavy computational and memory overhead and inter-block contextual in coherence of Transformers in 3D MRI, this paper proposes Multi-Head Mamba (MHMamba). This method combines a U-shaped architecture with a multi-head state-space model (Mamba), splitting the channel dimension into parallel SSM heads and aggregating them with residuals. This enhances long-range representation and improves the stability of multimodal training while maintaining linear complexity. To further align statistics and enhance lesion response, we designed a channel-space calibration module for multi-head outputs and introduced an adaptive fusion mechanism at skip connections to dynamically connect global semantics with local details, thereby improving boundary consistency and the detection of small-volume lesions. We conducted experiments and ablations on BraTS2021 and BraTS2023. The results showed that MHMamba achieved stable and significant improvements in overall accuracy, boundary smoothness, and sensitivity to tumor core and small-volume enhancement areas, while preserving the linear-complexity advantage of Mamba-based modeling, thus verifying the effectiveness and versatility of the method.

2605.16462 2026-05-19 cs.CR cs.AI 版本更新

Asking Back: Interaction-Layer Antidistillation Watermarks

反向提问:交互层反知识蒸馏水印

Guang Yang, Amir Ghasemian, Fengchen Liu, Zhong Wang, Ninareh Mehrabi, Homa Hosseinmardi

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Lawrence Berkeley National Laboratory(伯克利国家实验室) Meta

AI总结 本文提出交互层反知识蒸馏水印,通过教师模型的行为特征进行水印,验证其在不同模型上的有效性与鲁棒性。

Comments 34 pages, 17 figures

详情
AI中文摘要

检测未经授权的知识蒸馏从部署的LLM API是困难的,因为防守方无法控制攻击者的训练流程或下一个token的logits。现有防御措施针对教师模型的输出token——偏移下一个token分布(绿色列表水印,密码学方案,反知识蒸馏采样)或在生成后重写输出。最近的研究表明,一个改写攻击者可以剥离这些信号而不失去底层知识。我们提出交互层反知识蒸馏水印,将痕迹提高一层,进入教师模型的行为:防守方用一个间歇性诱导行为标记的系统提示包装教师模型——一个显式的后续问题,一个低频变体或一个声明性重述。一个无意识的蒸馏器继承了这种行为,防守方通过黑盒查询进行审计,使用由人类验证的LLM作为判断(Cohen's kappa = 0.84/0.78在强/风格评分表上)。在63个LoRA蒸馏学生下,使用Llama-3.3-70B-Instruct教师(35,343个判断样本),行为水印在Gemma(88.9%)、OLMo(80.9%)、Qwen(45.2%)中转移,相对保真度(H1,H2)。在非自适应DIPPER改写下,鲁棒性分解为教师自身天花板(约66.4%)和学生相对保留(21-112%),OLMo在教师自身之上保留水印(H3,F-Amp)。低密度(约20%)显式和隐式声明性变体在每族基准上转移(H4,F-Style)。一个N=20的实验室研究(预注册拉丁方)显示所有标记变体在基线周围0.22 Likert步内;TOST、Friedman和Bonferroni-Wilcoxon支持H5。交互层是反知识蒸馏水印的可行设计位置,补充了token-、模型-和推理痕迹层的防御。

英文摘要

Detecting unauthorized knowledge distillation from a deployed LLM API is hard because the defender controls neither the attacker's training pipeline nor the next-token logits. Existing defenses operate on the teacher's output tokens -- biasing the next-token distribution (green-list watermarks, cryptographic schemes, antidistillation sampling) or rewriting outputs after generation. Recent work shows a paraphrasing attacker can strip these signals without losing the underlying knowledge. We propose interaction-layer antidistillation watermarks, which move the trace one layer higher, into the teacher's interaction behavior: the defender wraps the teacher with a system prompt that intermittently induces a behavioral marker -- an explicit follow-up question, a low-frequency variant, or a declarative restatement. An oblivious distiller inherits the behavior, and the defender audits via black-box queries with a human-validated LLM-as-judge (Cohen's kappa = 0.84/0.78 on strong/style rubrics). Across 63 LoRA-distilled students under a Llama-3.3-70B-Instruct teacher (35,343 judged samples), behavioral watermarks transfer at 88.9% (Gemma) / 80.9% (OLMo) / 45.2% (Qwen) relative fidelity (H1, H2). Under non-adaptive DIPPER paraphrasing, robustness decomposes into a teacher-self ceiling (about 66.4%) and student-relative retention of 21-112%, with OLMo preserving the watermark above the teacher itself (H3, F-Amp). Low-density (about 20%) explicit and implicit declarative variants transfer above per-family baseline (H4, F-Style). An N=20 in-lab study (pre-registered Latin-square) shows all marker variants within 0.22 Likert step of baseline; TOST, Friedman, and Bonferroni-Wilcoxon support H5. The interaction layer is a viable design locus for antidistillation watermarking, complementary to token-, model-, and reasoning-trace-layer defenses.

2605.16458 2026-05-19 cs.CV cs.AI 版本更新

Conservative AI for Safety-Sensitive Medical Image Restoration: Residual-Bounded CT-CTA Enhancement for Intracranial Aneurysm-Relevant Signal Recovery

安全敏感医学图像修复中的保守AI:用于颅内动脉瘤相关信号恢复的残差受限CT-CTA增强

Weijun Ma

发表机构 * Independent Researcher(独立研究者) King George, Vancouver School Board(金乔治,温哥华学区) Vancouver, BC, Canada(温哥华,BC省,加拿大)

AI总结 本文提出了一种残差受限的2.5D修复框架,用于安全敏感的医学图像修复,通过编辑控制图限制修改范围,提升CT和CTA图像质量,减少误诊风险。

Comments Preprint manuscript, 16 pages, 4 figures, 3 tables. This manuscript presents a residual-bounded 2.5D CT/CTA restoration framework for conservative medical image enhancement and evaluates it using image-recovery, baseline comparison, Monte Carlo stability, anatomical localization, and external low-dose CT testing

详情
AI中文摘要

图像修复模型越来越多地应用于退化的医学扫描,但在安全敏感的环境中,必须在不不受控制地修改临床重要区域的情况下提高图像质量。这在颅内CT和CT血管造影(CTA)中尤为重要,因为小血管和动脉瘤相关线索靠近高对比度的解剖边界。我们将医学图像修复视为保守AI问题,并提出了一种基于合成退化CT/CTA输入训练的残差受限2.5D修复框架。模型通过编辑控制图将学习到的残差添加到原始中心切片中,限制修改的幅度和空间范围。我们使用动脉瘤相关图像恢复矩阵、与高斯基线的配对比较、蒙特卡洛稳定性测试、有意义编辑的解剖定位以及低剂量CT的外部评估来评估该框架。在50个分布外的CT-CTA案例中,受限模型实现了平均目标增益0.0635,平均PSNR 37.51 dB,以及iatrogenic编辑率4.0%。在1000次蒙特卡洛运行中,模型在85.4%的运行中保持净正收益,没有稳定负收益。在外部低剂量CT中,模型在方向上有益,并且产生的修改足迹比基线小得多。有意义的编辑集中在大脑和颅骨区域,而无关解剖结构几乎没有变化。这些发现提供了初步的计算证据,表明在敏感血管成像中残差受限的修复是可行的,但它们不证明临床诊断性能,需要专家审查和前瞻性验证后才能用于临床应用。

英文摘要

Image restoration models are increasingly applied to degraded medical scans, but in safety-sensitive settings they must improve image quality without uncontrolled modification of clinically important regions. This is especially relevant for intracranial CT and CT angiography (CTA), where small vessels and aneurysm-relevant cues lie near high-contrast anatomical boundaries. We frame medical image restoration as a conservative AI problem and present a residual-bounded 2.5D restoration framework trained on synthetically degraded CT/CTA inputs. The model adds a learned residual to the original center slice through an edit-control map that limits the magnitude and spatial extent of modification. We evaluate the framework using an aneurysm-relevant image-recovery matrix, paired comparison against a Gaussian baseline, Monte Carlo stability testing, anatomical localization of meaningful edits, and external evaluation on low-dose CT. On 50 out-of-distribution CT-CTA cases, the bounded model achieved a mean target gain of 0.0635, a mean PSNR of 37.51 dB, and an iatrogenic-edit rate of 4.0%. Across 1,000 Monte Carlo runs, it remained net positive in 85.4% of runs with no stably negative cases. On external low-dose CT, the model was directionally beneficial and produced a substantially smaller modification footprint than the baseline. Meaningful edits concentrated in brain and skull regions while unrelated anatomy showed negligible change. These findings provide preliminary computational evidence that residual-bounded restoration is feasible in boundary-sensitive vascular imaging, but they do not establish clinical diagnostic performance and require expert review and prospective validation before clinical use.

2605.16452 2026-05-19 cs.LG cs.AI 版本更新

Peak-Detector: Explainable Peak Detection via Instruction-Tuned Large Language Models in Physiological Sign

峰值检测器:通过指令调优的大语言模型实现可解释的多模态峰值检测

Jiahui Li, Yida Zhang, Zixuan Zeng, Jiayu Chen, Yingjian Song, Yin Xiao, Nishan Dong, Junjie Lu, Younghoon Kwon, Xiang Zhang, Jin Lu, Wenzhan Song, Fei Dou

发表机构 * University of Georgia(佐治亚大学) Yixing People’s Hospital(宜兴人民医院) University of Washington(华盛顿大学) University of North Carolina at Charlotte(北卡罗来纳大学夏洛特分校)

AI总结 本文提出Peak-Detector框架,利用指令调优的大语言模型实现跨模态、可解释的峰值检测,通过峰表示技术压缩时间序列数据并提升检测准确性,同时生成解释性内容以支持验证与错误分析。

详情
AI中文摘要

准确检测多种心脏生理信号(如心电图、脉搏波容积图、球状心图和体震图)中的峰值对心血管监测至关重要,但常受伪影和信号变异影响。传统算法通常基于专家知识针对单一信号模态设计,限制了通用性。相比之下,深度学习方法缺乏可解释性,限制了专家验证和人机交互。为此,我们引入Peak-Detector框架,利用指令调优的大语言模型(LLMs)实现稳健、跨模态且可解释的峰值检测。框架的核心创新是“峰表示”技术,将时间序列数据转换为压缩格式,在保留关键事件信息的同时显著减少信号长度。此表示提供关键的归纳偏差,引导LLM在生理有意义的事件上推理而非原始噪声数据。模型通过监督微调(SFT)后接强化学习(RL)的多目标奖励函数进行优化。模型的自解释能力通过在自建的Peak-Explanation数据集上微调来培养。在四个模态(ECG、PPG、BCG和BSG)覆盖七个数据集(六个公开基准加一个真实世界队列)上,Peak-Detector展示了强大的跨模态性能,实现了临床相关时间容忍度下的最佳或并列最佳检测。除了准确性外,生成的解释性内容揭示了失败模式并支持验证和错误分析。

英文摘要

Accurate peak detection across diverse cardiac physiological signals, including the Electrocardiogram (ECG), Photoplethysmogram (PPG), Ballistocardiogram (BCG), and Bodyseismography (BSG), is fundamental for cardiovascular monitoring but is often hindered by artifacts and signal variability. Conventional algorithms are typically engineered with expert knowledge for a single signal modality, limiting their generalizability. Conversely, deep learning-based methods often lack interpretability, limiting transparency for expert verification and hindering expert-computer interaction. To address these limitations, we introduce Peak-Detector, a novel framework that leverages instruction-tuned Large Language Models (LLMs) for robust, cross-modal, and explainable peak detection. A core innovation of our framework is a "peak-representation" technique that transforms time-series data into a condensed format, preserving critical event information while significantly reducing signal length. This representation provides a crucial inductive bias, guiding the LLM to reason over physiologically meaningful events rather than raw, noisy data. The model is optimized through a two-stage process: supervised fine-tuning (SFT) followed by reinforcement learning (RL) with a multi-objective reward function. The model's self-explanation capabilities are cultivated by fine-tuning on a custom-built Peak-Explanation dataset. Across four modalities-ECG, PPG, BCG, and BSG-spanning seven datasets (six public benchmarks plus one real-world cohort), Peak-Detector demonstrates strong cross-modal performance, achieving best or tied-best detection under clinically relevant temporal tolerance. Beyond accuracy, the generated rationales surface failure modes and support verification and error analysis.

2605.16449 2026-05-19 cs.LG cs.AI 版本更新

PESD-TSF: A Period-Aware and Explicit Structured Decomposition Framework for Long-Term Time Series Forecasting

PESD-TSF:一种周期感知和显式结构分解框架,用于长期时间序列预测

Hua Wang, Xianhao Jiao, Fan Zhang

发表机构 * School of Computer and Artificial Intelligence(计算机与人工智能学院) Ludong University(鲁东大学) School of Computer Science and Technology(计算机科学与技术学院) Shandong Technology and Business University(山东科技职业大学)

AI总结 PESD-TSF通过引入周期性门控机制、多尺度编码器和跨尺度协作注意力,解决深度网络中周期感知减弱和变量间依赖破坏的问题,提升多变量时间序列预测性能。

Comments 23 pages, 9 figures, 13 tables

详情
AI中文摘要

深度预测模型常面临周期感知减弱和趋势-噪声表示混乱的问题,且通道独立范式虽提高训练稳定性,却破坏变量间动态协调,阻碍多变量时间序列中变量一致性建模。为此,我们提出PESD-TSF,一种受物理启发的结构分解框架,旨在同时强调可解释性和预测准确性。PESD-TSF引入三个关键设计:首先,乘法周期性门控机制整合连续时间先验,动态调节信号幅度,保持深度层间的周期结构;其次,多尺度结构编码器整合去趋势注意力与分层采样,显式分离长期趋势与高频变化,同时保留细粒度时间语义;第三,为恢复被破坏的变量依赖,我们提出跨尺度协作注意力(CSCA)与RLC正则化方案,重构深度特征空间中的全局变量拓扑,并通过正交性和一致性约束实现物理一致的协作。在多个领域的基准数据集上进行的广泛实验表明,PESD-TSF在多变量预测任务中,特别是在涉及复杂变量耦合的任务中, consistently 实现了最先进的性能,突显其优越的结构建模能力和泛化能力。

英文摘要

Deep forecasting models often suffer from attenuated periodic perception and entangled trend-noise representations as network depth increases. Moreover, the widely adopted channel-independent paradigm, while improving training stability, disrupts intrinsic dynamic coordination among variables, hindering the modeling of cross-variable consistency in multivariate time series. To address these issues, we propose PESD-TSF, a physics-inspired structured decomposition framework for long-term time series forecasting that jointly emphasizes interpretability and predictive accuracy. PESD-TSF introduces three key designs. First, a Multiplicative Periodic Gating mechanism incorporates continuous-time priors to dynamically modulate signal amplitudes, preserving periodic structures across deep layers. Second, a multi-scale structured encoder integrates detrended attention with hierarchical sampling to explicitly decouple long-term trends from high-frequency variations while retaining fine-grained temporal semantics. Third, to recover disrupted inter-variable dependencies, we propose Cross-Scale Collaborative Attention (CSCA) together with an RLC regularization scheme, which reconstructs global inter-variable topology in deep feature spaces and enforces physically consistent collaboration through orthogonality and consistency constraints. Extensive experiments on benchmark datasets from multiple domains demonstrate that PESD-TSF consistently achieves state-of-the-art performance, with particularly strong gains on multivariate forecasting tasks involving complex inter-variable coupling, highlighting its superior structural modeling capability and generalization.

2605.16444 2026-05-19 cs.CV cs.AI 版本更新

Diffusion Attention Expert Model for Predicting and Semi-automatic Localizing STAS in Lung Cancer Histopathological Images

扩散注意力专家模型用于预测和半自动定位肺癌组织病理图像中的STAS

Liangrui Pan, Jiadi Luo, Yuxuan Xiao, Chenchen Nie, Xiaoshuai Wu, Songqing Fan, Ling Chu, Manqiu Li, Rongfang He, Zhenyu Zhao, Ruixing Wang, Shulin Liu, Yiyi Liang, Xiang Wang, Qingchun Liang, Shaoliang Peng

发表机构 * College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院) Department of Pathology, The Second Xiangya Hospital, Central South University(中南大学湘雅医院病理科) Hunan Clinical Medical Research Center for Cancer Pathogenic Genes Testing and Diagnosis(湖南临床医学肿瘤基因检测与诊断研究中心) Department of Thoracic Surgery, The Second Xiangya Hospital, Central South University(中南大学湘雅医院胸外科) Department of pathology, Hunan Cancer Hospital, The Affiliated Cancer Hospital of Xiangya School of Medicine, Central South University(湖南肿瘤医院病理科) Department of Pathology, The Third Xiangya Hospital, Central South University(中南大学湘雅第三医院病理科) Department of Pathology, First People's Hospital of Pingjiang County(平江县第一人民医院病理科) Department of Pathology, the First Affiliated Hospital, Hengyang Medical School, University of South China(南华大学衡阳医学院第一附属医院病理科) Department of Radiology, The Second Xiangya Hospital of Central South University(中南大学湘雅医院放射科) Department of Radiology, Xiangya Hospital, Central South University(中南大学湘雅医院放射科) Oncology Department and State Key Laboratory of Systems Medicine for Cancer of Shanghai Cancer Institute, Renji Hospital, School of Medicine, Shanghai Jiaotong University(上海癌症研究院肿瘤科及上海交通大学医学院系统医学重点实验室)

AI总结 本文提出DAEM模型,通过多尺度特征学习和双分支架构提升STAS检测精度,实现对冷冻切片和石蜡切片的高AUC值检测,并利用肿瘤微环境特征实现STAS半自动定位。

Comments Accepted by Nature Communications

详情
AI中文摘要

准确的术中和术后STAS诊断对指导肺癌手术决策和术后管理至关重要。然而,组织病理学评估耗费人力且易出现漏诊或误诊。我们提出扩散注意力专家模型(DAEM)用于检测冷冻切片(FSs)和石蜡切片(PSs)中的STAS。其扩散注意力专家模块利用全注意力聚合学习多尺度特征,而双分支架构强化多尺度特征表示。在内部数据集中,DAEM在FSs和PSs上分别达到0.8946和0.9112的AUC值。在八个机构的外部多中心数据集上验证显示,模型具有强泛化性和可解释性。利用PSs中的肿瘤微环境(TME)特征,进一步实现了STAS位置及其与原发肿瘤距离的半自动测量。多个定量TME指标被识别为STAS的潜在生物标志物,包括微泡型STAS。总体而言,DAEM通过在FSs和PSs上实现准确且可解释的检测,为STAS评估提供临床可操作的框架,通过基于定量TME的分析支持术后风险分层。

英文摘要

Accurate intraoperative and postoperative diagnosis of spread through air spaces (STAS) is essential for guiding surgical decisions and postoperative management in lung cancer. However, histopathological assessment is labor-intensive and is prone to missed or incorrect diagnoses. We propose a Diffusion Attention Expert Model (DAEM) to detect STAS in frozen sections (FSs) and paraffin sections (PSs). Its diffusion attention expert module leverages full attention aggregation to learn multi-scale features from histopathological images, while a dual-branch architecture strengthens multi-scale feature representation. On an internal dataset, DAEM achieves AUCs of 0.8946 for FSs and 0.9112 for PSs. Validation on external multi-center datasets from eight institutions demonstrates strong generalizability and interpretability. Using tumor microenvironment (TME) features in PSs, we further enable semi-automatic measurement of STAS location and its distance from the primary tumor. Several quantitative TME metrics are identified as potential biomarkers for STAS, including micropapillary-type STAS. Overall, DAEM offers a clinically actionable framework for STAS assessment by enabling accurate and interpretable detection on FSs and PSs, supporting postoperative risk stratification through quantitative TME-based analysis.

2605.16443 2026-05-19 cs.LG cs.AI 版本更新

Two-Valued Symmetric Circulant Matrices: Applications in Deep Learning

二值对称循环矩阵:在深度学习中的应用

Jayakrishna Amathi, Venkata Prasanth Yanambaka, Saraju P. Mohanty, Elias Kougianos

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of North Texas(北卡罗来纳大学达顿分校) Division of Computer Science(计算机科学分校) Texas Woman’s University(德克萨斯女子大学)

AI总结 本文提出二值对称循环矩阵,通过每层仅使用两个权重实现极稀疏结构,显著降低存储需求,实验显示在MNIST和MIT-BIH数据集上参数减少超过80倍,同时保持较高精度,适用于边缘计算和低功耗系统。

详情
AI中文摘要

尽管深度神经网络在视觉、医疗诊断和物联网场景中取得成功,但其在资源受限平台上的部署面临严峻挑战,由于存储需求高、计算复杂度大和占用空间大。特别是全连接层需要大量权重,使边缘设备难以容纳。为克服与有限平台相关的挑战,本文提出二值对称循环矩阵(TVSCM),一种非常稀疏的架构,每层仅使用两个权重以保持循环和对称性。极结构稀疏架构的存储成本与传统全权重存储相比几乎可以忽略不计。与传统稀疏学习技术如低秩近似和剪枝方法不同,该架构提供极稀疏形式,实现极低的存储需求。模拟研究显示,在MNIST数据集上参数从623,290减少到7,852,MIT-BIH心律失常数据集上从24,709减少到942,同时保持在MNIST上97.6%到93.5%的精度,在MIT-BIH上97.6%到93.1%的精度。由于其极低的架构需求和非常低的功耗,该架构适用于边缘计算平台、微型机器学习平台、IoMT系统和电池供电系统。

英文摘要

Despite the success of deep neural networks in vision, medical diagnosis, and IoT scenarios, their deployment on resource-limited platforms poses serious challenges due to their high storage requirements, computational complexity, and large footprint. In particular, fully connected layers require a large number of weights, making it difficult for edge devices to accommodate them. To overcome these challenges associated with limited platforms, this paper proposes the Two-Valued Symmetric Circulant Matrix (TVSCM), a very sparse architecture that employs just two weights per layer to keep it circulant and symmetric. The extreme form of structured sparse architecture provides negligible storage costs compared to traditional full-weight storage. Instead of hardware and additional stages of other traditional sparse learning techniques, such as low-rank approximation and pruning approaches, this architecture provides an extreme form of sparsity, achieving very minimal storage requirements. The simulation study demonstrates more than 80$\times$ reduction in model parameters, reducing parameters from 623,290 to 7,852 on MNIST and from 24,709 to 942 on the MIT-BIH arrhythmia dataset, while maintaining comparable accuracy from 97.6% to 93.5% on MNIST and from 97.6% to 93.1% on MIT-BIH. Due to its minimal architectural requirements and very low power consumption, this architecture would be ideal for edge computing platforms, tiny-ML platforms, IoMT systems, and battery-powered systems.

2605.16442 2026-05-19 cs.RO cs.AI cs.LG 版本更新

Hierarchical Two-Stage Framework for Environment-Aware Long-Horizon Vessel Trajectory Prediction

面向环境的长航程船舶轨迹预测分层两阶段框架

Ganeshaaraj Gnanavel, Tharindu Fernando, Sridha Sridharan, Clinton Fookes

发表机构 * SAIVT Research Group, Queensland University of Technology(SAIVT研究组,昆士兰理工大学)

AI总结 本文提出分层两阶段框架,结合长短期预测器与网格感知短期预测器,通过分层融合机制提升船舶轨迹预测精度,实验显示在ADE和FDE上优于现有方法。

详情
AI中文摘要

长航程船舶轨迹预测在真实海洋条件下对碰撞避免、交通管理和路线规划至关重要。然而,由于长距离时间依赖性和动态环境因素如洋流、风和波浪,实现准确预测具有挑战性。为此,我们提出一种分层两阶段框架,通过分层融合机制结合粗略长时预测器与网格感知的短时预测器。短时分支利用离散化海事单元上的时空图变换器捕捉局部动态,而长时分支编码总体航行意图。集成的环境模块利用洋流参数、风向量和显著波高,通过跨模态注意和特征调制实现对不同海况的适应性响应。此外,可学习的Savitzky-Golay平滑层增强了融合轨迹的时间一致性。我们在澳大利亚船队跟踪系统(CTS)数据上进行了评估,数据来自西北地区,并与Copernicus海洋服务产品对齐,使用3小时输入和10小时预测时间范围。实验结果表明,我们的框架在平均位移误差(ADE)和最终位移误差(FDE)上比现有方法提高了25%和17%。消融研究进一步验证了每个组件的贡献。

英文摘要

Long-horizon vessel trajectory forecasting under real ocean conditions is critical for collision avoidance, traffic management, and route planning. However, achieving accurate predictions is challenging due to long-range temporal dependencies and dynamic environmental factors such as currents, wind, and waves. To address these issues, we propose a hierarchical two-stage framework that combines a coarse long-term predictor with a grid-aware short-term predictor through a hierarchical fusion mechanism. The short-term branch leverages a Spatio-Temporal Graph Transformer on discretized maritime cells to capture localized dynamics, while the long-term branch encodes overarching navigational intent. An integrated environmental module incorporates oceanographic parameters, including surface currents, wind vectors, and significant wave height, using cross-modal attention and feature-wise modulation for adaptive response to varying sea conditions. Additionally, a learnable Savitzky-Golay smoothing layer enhances temporal coherence in fused trajectories. We evaluate our approach on Australian Craft Tracking System (CTS) data from the North West region, aligned with Copernicus Marine Service products, using a 3-hour input and a 10-hour prediction horizon. Experimental results show that our framework outperforms the state-of-the-art by 25% in Average Displacement Error (ADE) and 17% in Final Displacement Error (FDE). Ablation studies further validate the contribution of each component.

2605.16441 2026-05-19 cs.LG cs.AI 版本更新

DeepArrhythmia: Segment-Contextualized ECG Arrhythmia Classification via Selective Evidence Acquisition

DeepArrhythmia: 基于选择性证据获取的段落上下文化ECG心律失常分类

Jiahui Li, Ruili Fang, Zishuai Liu, WenZhan Song, Jin Lu, Fei Dou

发表机构 * University of Georgia(佐治亚大学)

AI总结 DeepArrhythmia通过选择性证据获取实现段落上下文化ECG心律失常分类,结合原始ECG信号和渲染波形图像,利用专门工具分离生理测量与证据整合,提升多beat节奏上下文下的心律失常检测精度。

详情
AI中文摘要

心电图(ECG)心律失常检测旨在为每条心跳分配一个心律失常类别,但许多现有系统将心跳视为孤立的局部实例,限制了对多心跳节奏上下文的依赖。我们提出DeepArrhythmia,一种工具导向的多模态框架,用于段落上下文化的心跳级ECG心律失常分类。给定一个多心跳ECG段,DeepArrhythmia结合原始ECG信号和渲染的波形图像,定位R峰以识别心跳实例,并生成结构化的心跳级预测。该框架通过专门工具分离生理测量与证据整合,用于心跳定位、数值节奏-形态提取和形态聚焦的文本分析。DeepArrhythmia利用段级置信度在最小和丰富证据状态之间路由,因为更丰富的生理证据并不总是有用。这种代理设计整合了节奏上下文、显式生理基础和选择性证据获取以进行决策。

英文摘要

Beat-level Electrocardiography (ECG) arrhythmia detection aims to assign an arrhythmia class to each beat in a recording, yet many existing systems treat beats as isolated local instances. This is limiting because beat labels often depend on multi-beat rhythm context, including timing, compensatory pauses, and beat-to-beat morphological consistency. We present DeepArrhythmia, a tool-grounded multimodal framework for segment-contextualized beat-level ECG arrhythmia classification. Given a multi-beat ECG segment, DeepArrhythmia combines the raw ECG signal and a rendered waveform image, localizes R peaks to identify beat instances, and produces structured beat-level predictions. The framework decouples physiological measurement from evidence integration using specialized tools for beat localization, numerical rhythm--morphology extraction, and morphology-focused textual analysis. DeepArrhythmia uses segment-level confidence to route between minimal and rich evidence states, since richer physiological evidence is not uniformly useful. This agentic design integrates rhythm context, explicit physiological grounding, and selective evidence acquisition for decision making.

2605.16440 2026-05-19 cs.CV cs.AI 版本更新

Semantic Smoothing via Novel View Synthesis for Robust SAR Image Classification

通过新颖视角合成实现语义平滑以实现稳健的SAR图像分类

Daniel Brignac, Fengwei Tian, Banafsheh Latibari, Abhijit Mahalanobis, Ravi Tandon

发表机构 * The University of Arizona(亚利桑那大学)

AI总结 本文提出语义平滑方法,通过新颖视角合成模型生成结构化随机变换,提升SAR图像分类在对抗攻击下的鲁棒性,并提高干净分类准确率。

详情
AI中文摘要

深度神经网络对对抗扰动敏感,限制了其在安全关键应用中的部署,如合成孔径雷达(SAR)自动目标识别(ATR)。随机化平滑通过在噪声输入上平均预测来提高鲁棒性,但各向同性噪声常无法保持SAR图像的语义结构。我们提出语义平滑,一种防御方法,用由新颖视角合成模型生成的结构化随机变换取代基于噪声的扰动。对于SAR,我们根据获取几何学合成多个可能的雷达视角。在生成的随机视角上进行预测并聚合,以形成鲁棒分类器。实验表明,语义平滑在标准攻击(如FGSM和PGD)以及SAR特定攻击(如OTSA和SMGAA)中提高了鲁棒性,同时提高了干净分类准确率。这些结果表明,通过保留语义的几何变换进行随机化平滑,是结构感知领域对抗防御的一种有前景的替代方案。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations, limiting deployment in safety-critical applications such as synthetic aperture radar (SAR) automatic target recognition (ATR). Randomized smoothing improves robustness by averaging predictions over noisy inputs, but isotropic noise often fails to preserve the semantic structure of SAR imagery. We propose semantic smoothing, a defense that replaces noised-based perturbations with structured randomized transformations generated by a novel view synthesis model. For SAR, we condition on acquisition geometry to synthesize multiple plausible radar views. Predictions across generated randomized views are aggregated to form a robust classifier. Experiments show that semantic smoothing improves robustness against standard attacks, such as FGSM and PGD, and SAR-specific attacks, such as OTSA and SMGAA, while also increasing clean classification accuracy. These results demonstrate that randomized smoothing via semantically preserving geometric transformations is a promising alternative to isotropic noise for adversarial defense in structured sensing domains.

2605.16439 2026-05-19 cs.CV cs.AI 版本更新

KVCapsule: Efficient Sequential KV Cache Compression for Vision-Language Models with Asymmetric Redundancy

KVCapsule: 用于视觉-语言模型的高效序列KV缓存压缩方法:不对称冗余

Yingbing Huang, Tharun Adithya Srikrishnan, Steven K. Reinhardt, Deming Chen

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) AMD

AI总结 本文提出KVCapsule,一种针对视觉语言模型的KV缓存压缩框架,通过轻量压缩和重建组件实现内存节省,提升吞吐量并减少内存占用,同时保持精度。

详情
AI中文摘要

视觉-语言模型(VLMs)作为大型语言模型(LLMs)的重要扩展,通过文本和图像输入实现多模态推理。尽管VLMs增强了语言模型的能力,但它们也继承并放大了关键计算瓶颈:自回归解码过程中大规模键值(KV)缓存带来的内存开销。这一挑战在VLMs中尤为严重,因为图像生成更长的token序列和更密集的特征表示,相比文本。此外,视觉token的空间和信息丰富性引入了结构化的注意力模式,使得许多针对LLM的KV缓存压缩技术在直接应用于VLMs时效果不佳。在本文中,我们对视觉token的行为进行了详细的实证分析,突显其与纯文本模型的关键差异。基于这些见解,我们提出KVCapsule,一种新的视觉token的KV缓存压缩框架。KVCapsule保持预训练VLM骨干网络冻结,不需要修改注意力计算模块,并且可以通过轻量级压缩和重建组件集成到现有VLMs中。我们评估了KVCapsule在多个VLMs和基准任务上的性能,证明在60%的压缩率下,TPS提升达2倍,KV缓存内存减少达2.4倍,同时精度或响应质量几乎没有下降。我们的发现为在受限内存预算下扩展VLM推理提供了实用路径,并启发进一步研究结构感知的缓存压缩方法以多模态模型。

英文摘要

Vision-Language Models (VLMs) have emerged as a critical and fast-growing extension of Large Language Models (LLMs) that enable multimodal reasoning through both text and image inputs. Although VLMs enrich the capabilities of language models, they also inherit and amplify key computational bottlenecks: the memory overhead caused by the large key-value (KV) cache during autoregressive decoding. This challenge is particularly severe in VLMs, where images produce longer token sequences and denser feature representations compared to text. Moreover, the spatial and information-rich nature of vision tokens introduces structured attention patterns that make many LLM-oriented KV cache compression techniques ineffective when applied directly to VLMs. In this work, we conduct a detailed empirical analysis of the behavior of vision tokens, highlighting the critical differences from purely text-based models. Based on these insights, we propose KVCapsule, a novel KV cache compression framework for vision tokens. KVCapsule keeps the pretrained VLM backbone frozen, requires no modification to the attention computation modules, and can be integrated into existing VLMs through lightweight compression and reconstruction components. We evaluate KVCapsule on multiple VLMs and benchmark tasks, demonstrating up to 2x improvement in TPS and 2.4x reduction in KV cache memory at a 60% compression ratio, with negligible degradation in accuracy or response quality. Our findings offer practical pathways to scale VLM inference under constrained memory budgets and inspire further research into structure-aware cache compression for multimodal models.

2605.16438 2026-05-19 cs.LG cs.AI 版本更新

Byzantine-Resilient Federated Learning via QUBO-Based Client Selection on Quantum Annealers

通过量子退火的客户端选择实现容错联邦学习

Andras Ferenczi, Sutapa Samanta, Dagen Wang, Jason Qizhe Qin

发表机构 * Columbia University(哥伦比亚大学)

AI总结 本文提出利用量子退火解决联邦学习中的拜占庭容错问题,通过将客户端选择转化为二次无约束二元优化问题,提升对恶意更新的检测能力。

Comments 9 pages, 6 figures, 8 tables

详情
AI中文摘要

联邦学习(FL)在分布式客户端上训练全局模型,但规模扩大时易受恶意更新攻击。本文提出一种量子退火方法,将客户端选择转化为二次无约束二元优化(QUBO)问题,通过量子退火器求解。QUBO方法在小规模客户端中优于MultiKrum,但在大规模客户端中性能下降。本文引入MultiSignal集成方法,结合欧几里得和余弦Krum分数差距,将攻击分类为四个阶段并路由恶意攻击至受惩罚的QUBO。实验表明,MultiSignal在MNIST数据集上达到95.3%的检测准确率,显著优于传统MultiKrum方法。

英文摘要

Federated Learning (FL) trains a global model across decentralized clients while preserving data privacy, but at scale it is vulnerable to malicious updates. Byzantine-resilient aggregation methods such as MultiKrum score gradients against their nearest neighbors and can miss malicious updates that preserve the statistical properties of honest ones. We propose a quantum annealing approach that reformulates client selection as a Quadratic Unconstrained Binary Optimization (QUBO) problem, encoding pairwise distances into a cost function solved by quantum annealers (QA). Unlike MultiKrum's greedy per-client scoring, the QUBO formulation jointly optimizes over all subsets to find the mutually closest group of $m$ clients. At small scale (15 clients), QUBO outperforms MultiKrum on the most challenging Byzantine attacks: e.g., Advanced LIE is detected with 95.11% accuracy versus 81.33% on MNIST and 97.78% versus 75.56% on CIFAR-10. QUBO fares poorly on simpler attacks where MultiKrum excels, so the two methods are complementary. QUBO quality also degrades as the number of clients grows. To address this, we introduce a MultiSignal ensemble that uses a dual-feature routing gate based on Euclidean and cosine Krum score gaps to classify attacks into four regimes and routes evasion attacks to a suspicion-penalized QUBO with agreement voting. At 100 clients on MNIST, MultiSignal achieves 95.3% average detection accuracy versus 91.8% for classical MultiKrum, with the largest gains on Sparse Lie (72.0% to 95.2%, +23.2 points) and Advanced Lie (80.4% to 85.2%, +4.8 points). These results show that QUBO-based quantum annealing with MultiSignal is a principled and scalable defense against the most challenging Byzantine strategies in federated learning.

2605.16436 2026-05-19 cs.CR cs.AI 版本更新

The End of Trust: How Agentic AI Breaks Security Assumptions

信任的终结:如何代理AI打破安全假设

Osama Zafar, Alexander Nemecek, Erman Ayday

发表机构 * Dept. of Computer and Data Sciences(计算机与数据科学系) Case Western Reserve University(凯斯西储大学)

AI总结 代理AI打破了传统安全假设,允许高保真的定制欺骗在大众市场层面实现。本文提出'无限冒充者'攻击模型,并探讨从验证行为转向评估行动的安全范式转变。

详情
AI中文摘要

在数十年中,数字交互的安全性依赖于一个未被承认的经济约束。攻击者面临欺骗的保真度与可部署规模之间的权衡。说服性冒充需要持续的人力投入,仅限于高价值目标,而大众市场攻击牺牲了可信度以换取覆盖面。检测系统、验证机制和用户意识培训都隐式地校准到这种权衡所产生的廉价欺骗制品。代理AI消解了这种权衡,使高保真的个性化欺骗能够在大众市场层面产生。我们主张这种转变耗尽了安全范式,而非仅仅加剧威胁景观。我们引入了无限冒充者攻击模型,其中自主代理在双方已相互信任的双方之间插入,劫持现有关系而非从头建立新的关系。以检测为导向的防御共享一个假设,即生成性进步正在消除,合成输出可以区分于真实输出。我们提出一种默认怀疑范式,将安全从验证行为者转向评估行为,并探讨当平台成为数字交互监管基础时产生的治理张力。

英文摘要

For decades, the security of digital interaction has rested on an unacknowledged economic constraint. Attackers faced a tradeoff between the fidelity of a deception and the scale at which it could be deployed. Convincing impersonation required sustained human effort and was confined to a narrow set of high-value targets, while mass-market attacks sacrificed plausibility for reach. Detection systems, verification mechanisms, and user awareness training have all been implicitly calibrated to the artifacts of cheap deception that this tradeoff produced. Agentic AI collapses the tradeoff, allowing high-fidelity, individually tailored deception to be produced at mass-market scale. We argue that this shift exhausts a security paradigm rather than merely intensifying the threat landscape. We introduce the Infinite Impostor, an attack model in which an autonomous agent interposes itself between two parties who already trust each other, hijacking an existing relationship rather than building a new one from scratch. Detection-oriented defenses share an assumption that generative progress is eliminating, that synthetic outputs are distinguishable from authentic ones. We propose a suspect-by-default paradigm that shifts security from authenticating actors to evaluating actions, and examine the governance tensions that arise when platforms become the regulatory substrate of digital interaction.

2605.16435 2026-05-19 cs.LG cs.AI 版本更新

GPU-Accelerated Deep Learning for Heatwave Prediction and Urban Heat Risk Assessment

基于GPU的深度学习用于热浪预测和城市热风险评估

Adis Alihodžić

发表机构 * Faculty of Science, University of Sarajevo(萨拉热窝大学科学学院)

AI总结 本文提出基于GPU的深度学习框架,用于预测城市热条件和评估热风险,采用MODIS和Open-Meteo数据,验证了ConvLSTM混合损失函数的有效性,提升了预测精度与效率。

详情
AI中文摘要

热浪是城市中的重要问题,气候变化使其更加困难。本文提出一种基于GPU的深度学习框架,用于预测城市热条件和热风险评估。研究在萨拉热窝使用MODIS地表温度数据和Open-Meteo预报数据进行。测试了多种模型,包括卷积模型和时空模型。其中,混合损失函数的ConvLSTM模型表现最佳,得到MAE=0.2293,RMSE=0.3089,R2=0.8877。实验还表明,使用更长的时间序列和额外气象变量可提高结果。由于框架在GPU上实现并采用混合精度训练,执行时间减少。基于预测温度场,可以结合危险信息与暴露和脆弱性数据生成城市热风险地图。所提框架可作为城市热分析的实用基础。

英文摘要

Heatwaves are an important problem in cities, and climate change makes this problem more difficult. In this paper, we present a GPU-based deep learning framework for next-day prediction of urban thermal conditions and for heat risk assessment. The study was carried out in Sarajevo by using MODIS land surface temperature data and Open-Meteo forecast data. We tested several models, including convolutional models and spatiotemporal models. Among them, ConvLSTM with a mixed loss function gave the best results. The obtained values were MAE = 0.2293, RMSE = 0.3089, and R2 = 0.8877. The experiments also showed that results can be improved by using longer temporal series and additional meteorological variables. Since the framework was implemented on a GPU and trained with mixed precision, the execution time was reduced. Based on the predicted temperature fields, it was also possible to combine hazard information with exposure and vulnerability data in order to generate city heat risk maps. The proposed framework can be used as a practical basis for city heat analysis.

2605.16433 2026-05-19 cs.LG cs.AI 版本更新

Edge-AI-Driven Learning-to-Rank for Decentralized Task Allocation in Circular Smart Manufacturing

边缘AI驱动的基于排序的学习排名用于圆环式智能制造中的去中心化任务分配

Mohammadhossein Ghahramani, Yan Qiao, Mengchu Zhou

发表机构 * Birmingham City University(伯明翰城市大学) Macao Institute of Systems Engineering and Collaborative Laboratory for Intelligent Science and Systems(澳门系统工程研究院和智能科学与系统联合实验室) New Jersey Institute of Technology(新泽西理工学院)

AI总结 本文提出一种边缘AI驱动的去中心化任务分配框架,通过基于排序的协商实现高效资源分配,提升高负载和紧 deadline 场景下的延迟和能效。

Journal ref Under review at IEEE IoT J, 2026

详情
AI中文摘要

在智能制造系统中,任务分配需要在去中心化决策、动态负载和共享资源约束下运行。在循环制造环境中,这些挑战因需平衡运营效率与资源和能源可持续性而加剧。尽管已有基于学习的方法,但许多方法专注于预测绝对性能指标,这些指标不一定能提升分配结果,因为去中心化分配由候选机器的相对排序决定。本文提出一种基于排序意识协商的边缘AI驱动的去中心化任务分配框架,其中轻量级决策智能嵌入在机器层面,以实现低延迟协调而无需集中控制。该框架逐步开发:首先,资源感知的启发式方法建立去中心化投标结构,然后基于边缘AI的回归模型提供学习的本地投标近似,最后基于排序的公式重塑学习目标以与赢家选择的排序性质一致。每台机器使用本地信息评估 incoming 任务,包括处理能力、队列状态和资源竞争。该框架通过离散事件模拟在高负载和紧 deadline 场景下进行评估,使用延迟、截止期限违规、吞吐量和能耗等指标。结果表明,在高负载下延迟和截止期限遵守有所改善,在更紧的约束下能耗效率提高,导致更高效的资源操作,符合循环制造目标。这些发现表明,将学习目标与去中心化决策结构对齐对于有效的协商驱动任务分配至关重要。

英文摘要

Task allocation in smart manufacturing systems needs to operate under decentralized decision-making, dynamic workloads, and shared resource constraints. In circular manufacturing settings, these challenges are further intensified by the need to balance operational efficiency with resource and energy sustainability. While learning-based approaches have been explored, many focus on predicting absolute performance metrics that do not necessarily translate into improved allocation outcomes, since decentralized assignment is governed by the relative ordering of candidate machines. This work proposes an Edge-AI-driven decentralized task allocation framework based on ranking-aware negotiation, where lightweight decision intelligence is embedded at the machine level to enable low-latency coordination without centralized control. The framework is developed progressively: a resource-aware heuristic first establishes the decentralized bidding structure, an Edge-AI-based regression model then provides learned local bid approximation, and a ranking-aware formulation finally reshapes the learning objective to align with the ordering-based nature of winner selection. Each machine evaluates incoming tasks using local information, including processing capability, queue state, and resource contention. The framework is evaluated via discrete-event simulation under high-load and tight-deadline scenarios using delay, deadline violations, throughput, and energy consumption. Results show improved delay and deadline adherence under high load, and enhanced energy efficiency under tighter constraints, leading to more resource-efficient operation aligned with circular manufacturing objectives. These findings demonstrate that aligning learning objectives with decentralized decision structures is critical for effective negotiation-driven task allocation.

2605.16432 2026-05-19 cs.RO cs.AI cs.HC 版本更新

MR-SLAM: Immersive Spatial Supervision for Multi-Robot Mapping via Mixed Reality

MR-SLAM:通过混合现实实现多机器人地图的沉浸式空间监督

Prakash Aryan, Cem Erdogdu, Kavinaya Kumarchokkappan, Timo Kehrer, Sebastiano Panichella

发表机构 * University of Bern, Bern, Switzerland(伯尔尼大学) AI4I -- The Italian Institute of Artificial Intelligence, Turin, Italy(意大利人工智能研究所)

AI总结 本文提出MR-SLAM系统,利用混合现实技术实现多机器人SLAM的沉浸式空间监督,通过实时可视化和空间锚定面板提升多机器人定位与建图效率。

Comments Accepted to ICRA 2026 Workshop "MM-SpatialAI Workshop: Multi-Modal Spatial AI for Robust Navigation and Open-World Understanding"

详情
AI中文摘要

在建筑检查或仓库通道监控等应用中,操作多机器人队伍进行同时定位与建图(SLAM)需要操作员持续保持对每个机器人位置和建图状态的空间意识,这在传统2D界面中表现不佳。我们提出了MR-SLAM,一种混合现实(MR)系统,其中佩戴Meta Quest 3头显的operator通过带有真实世界遮挡的通透视图操控三个模拟TurtleBot3机器人,同时空间锚定的仪表板面板实时报告建图进度。每个机器人运行独立的SLAM Toolbox实例,其占用网格在ROS 2后端实时合并。在五次9分钟的评估会话中,系统以8.83±0.16Hz的速度生成扫描,合并了17.9±0.8平方米的占用网格,并在机器人对之间达到94.7±0.5%的跨实例占用一致性。额外的会话记录了6.3ms的中位转换抖动和41平方米网格的26.7平方米覆盖。我们将MR-SLAM定位为一种参考实现,用于在消费级硬件上结合通透混合现实监督与多机器人SLAM。

英文摘要

Operating a multi-robot fleet for simultaneous localization and mapping (SLAM) in applications such as building inspection or warehouse-aisle monitoring requires the operator to maintain spatial awareness of each robot's position and mapping state, a task that scales poorly on conventional 2D interfaces. We present MR-SLAM, a mixed reality (MR) system in which an operator wearing a Meta Quest 3 headset teleoperates three simulated TurtleBot3 robots through a passthrough view with real-world occlusion, while spatially anchored dashboard panels report mapping progress in situ. Each robot runs an independent SLAM Toolbox instance whose occupancy grid is merged in real time on a Robot Operating System 2 (ROS 2) back end. Across five 9-minute evaluation sessions, the system delivered scans at 8.83 +/- 0.16 Hz, mapped 17.9 +/- 0.8 m^2 of merged occupancy, and reached 94.7 +/- 0.5% cross-instance occupancy consistency across robot pairs. An additional session recorded 6.3 ms median transform jitter and 26.7 m^2 coverage of a 41 m^2 grid. We position MR-SLAM as a reference implementation for combining passthrough mixed reality supervision with multi-robot SLAM on consumer hardware.

2605.16429 2026-05-19 cs.LG cs.AI 版本更新

QuantFPFlow: Quantum Amplitude Estimation for Fokker--Planck Policy Optimisation in Continuous Reinforcement Learning

QuantFPFlow:用于连续强化学习中Fokker-Planck策略优化的量子振幅估计

Abraham Itzhak Weinberg

发表机构 * AI-WEINBERG, AI Experts(AI-WEINBERG人工智能专家)

AI总结 QuantFPFlow通过量子振幅估计提升连续强化学习中Fokker-Planck策略优化的效率,实现算法复杂度从O(1/ε²)到O(1/ε)的平方加速,并在多模态奖励景观中发现全局最优解。

详情
AI中文摘要

我们引入QuantFPFlow,一种将量子振幅估计整合到随机策略优化的Fokker-Planck(FP)公式中的强化学习框架。经典连续空间RL代理必须以成本O(1/ε²)估计FP分区函数Z=∫e^{-V(x)/D}dx;QuantFPFlow用Grover增强的振幅估计器替代,实现O(1/ε)的可证明二次加速。尽管完全量子加速需要容错硬件,此处展示的量子启发经典模拟已表现出O(1/ε)的算法结构。估计的稳态分布ρstar驱动理论支撑的探索奖励Raug=Renv+αlog(1/ρstar(s))。此奖励将代理引导至多模态奖励景观的全局最优区域,同时通过FP扩散匹配约束策略方差。在专门设计暴露局部最优失败的连续控制任务中,QuantFPFlow实现平均奖励1,295.7±423.2,优于Soft Actor-Critic(SAC)的1,284.0±474.0,同时发现全局最优的频率高10.4%(33.9% vs. 30.7%)。策略熵保持在H(π)≈6.5纳特,而SAC下降至1.5纳特,证实FP扩散匹配主动防止过早收敛。维度实验进一步显示QuantFPFlow的计算规模为O(d^{0.35}),而经典FP估计为O(d^{0.76})。

英文摘要

We introduce \textbf{QuantFPFlow}, a reinforcement learning framework that integrates quantum amplitude estimation into the Fokker--Planck~(FP) formulation of stochastic policy optimisation. Classical continuous-space RL agents must estimate the FP partition function $Z = \int e^{-V(\mathbf{x})/D}\,d\mathbf{x}$ at cost $\calO(1/\varepsilon^{2})$; QuantFPFlow replaces this with a Grover-amplified amplitude estimator achieving $\calO(1/\varepsilon)$ -- a provable quadratic speedup. While the full quantum acceleration requires fault-tolerant hardware, the quantum-inspired classical simulation demonstrated here already exhibits the $\calO(1/\varepsilon)$ algorithmic structure. The estimated stationary distribution $\rhostar$ drives a theoretically grounded exploration bonus $\Raug = \Renv + α\log(1/\rhostar(s))$. This bonus steers the agent toward globally optimal regions of multimodal reward landscapes while simultaneously constraining policy variance through FP diffusion matching. On a continuous-control task specifically designed to expose local-optima failure, QuantFPFlow achieves mean reward $1{,}295.7 \pm 423.2$ versus $1{,}284.0 \pm 474.0$ for Soft Actor-Critic~(SAC), while discovering the global optimum \textbf{10.4\,\% more frequently} (33.9\,\% vs.\ 30.7\,\%). Policy entropy remains near $H(π)\approx 6.5$\,nats throughout training, whereas SAC collapses to $1.5$\,nats, confirming that FP diffusion matching actively prevents premature convergence. Dimensionality experiments further show computational scaling of $\calO(d^{0.35})$ for QuantFPFlow versus $\calO(d^{0.76})$ for classical FP estimation.

2605.16427 2026-05-19 cs.CV cs.AI 版本更新

EAGT: Echocardiography Augmentation for Generalisability and Transferability

超声波增强:通用性和可迁移性

Soroush Elyasi, Sara Adibzadeh, Nasim Dadashi Serej, Julie Wall, Massoud Zolgharni

发表机构 * THRIVE Centre, University of West London(西伦敦大学THRIVE中心) University of West London(西伦敦大学) School of Computing and Engineering, University of West London(西伦敦大学计算机与工程学院)

AI总结 本文研究了29种数据增强技术及其组合对左心室分割的通用性和可迁移性影响,发现几何变换优于强度增强,且最佳组合提升模型鲁棒性。

详情
AI中文摘要

深度学习模型在超声分割中常难以跨机构、设备和患者群体泛化,因收集大量一致标注数据不现实。数据增强广泛用于提升模型鲁棒性,但其在超声中的跨数据集泛化作用尚不明确。本文评估了29种数据增强技术及其配对组合,使用U-Net在Unity、CAMUS和EchoNet Dynamic数据集上进行2D左心室分割。每种增强方法在不同超参数设置下,通过Dice和IoU在域内和跨域场景下重复运行评估,统计显著性通过独立t检验量化。结果表明,解剖合理几何变换,特别是仿射、位移-缩放-旋转、透视和随机水平翻转,显著提升跨数据集性能,而激进的强度或伪影增强常降低泛化能力。配对增强组合优于单个增强,尤其以随机水平翻转与仿射组合在大多数迁移场景中表现一致。这些发现为设计增强策略提供了实证指导,以增强超声分割模型的鲁棒性和可迁移性。

英文摘要

Deep learning models for echocardiography segmentation often struggle to generalise across institutions, scanners, and patient populations, where collecting large, consistently annotated datasets is infeasible. Data augmentation is widely used to improve the robustness of deep learning models; however, its role in enhancing cross-dataset generalisability in echocardiography remains insufficiently understood. This study presents a large-scale multi-dataset evaluation of 29 data augmentation techniques and their pairwise combinations for 2D left ventricular segmentation using a U-Net trained on Unity, CAMUS, and EchoNet Dynamic datasets. Each augmentation was explored under several hyperparameter settings and assessed through repeated runs using Dice and IoU in both in-domain and cross-dataset scenarios, with statistical significance quantified via independent t-tests. Results show that anatomically plausible geometric transformations, particularly affine, shift-scale-rotate, perspective, and random horizontal flip, substantially improve cross-dataset performance, whereas aggressive intensity- or artefact-based augmentations often degrade generalisability. Pairwise augmentation combinations outperform individual augmentations and show that moderate flip-centric combinations, especially random horizontal flip with affine, yield consistent gains across most transfer scenarios. These findings provide empirically grounded guidance for designing augmentation policies that enhance the robustness and transferability of echocardiography segmentation models.

2605.16421 2026-05-19 cs.LO cs.AI 版本更新

Orthologic for SAT Solving

正交逻辑用于SAT求解

Vladislas de Haldat, Simon Guilloud, Viktor Kunčak

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 本文提出一种新的正交逻辑公式蕴含判定算法,避免了先前方法的高成本预处理阶段,同时保持相同的最坏复杂度。通过合成SAT基准测试,展示了正交逻辑归一化在某些难题中的提升效果。

详情
AI中文摘要

我们提出了一种新的正交逻辑公式蕴含判定算法,避免了先前方法的高成本预处理阶段,同时保持相同的$\mathcal{O}(n^2(1+|A|))$最坏复杂度。随后,我们引入了一组合成的SAT基准测试,基于观察到对于任何公式$ϕ$,等价性$ϕ\leftrightarrow \mathrm{NF}_{\mathrm{OL}}(ϕ)$是一个蕴含式,其Tseitin编码产生难以被最先进的SAT求解器解决但具有短正交逻辑证明的不可满足实例。应用于EPFL算术电路时,我们的算法能高效地解决这些实例,而Kissat则在大量实例上超时。最后,我们展示了在某些难题上使用正交逻辑归一化作为预处理步骤可以提高SAT求解时间。

英文摘要

We present a new algorithm for deciding formula entailment in orthologic (a sound approximation of classical logic) that avoids the costly preprocessing phase of prior implementations while retaining the same $\mathcal{O}(n^2(1+|A|))$ worst-case complexity. We then introduce a family of synthetic SAT benchmarks based on the observation that, for any formula $ϕ$, the equivalence $ϕ\leftrightarrow \mathrm{NF}_{\mathrm{OL}}(ϕ)$ is a tautology whose Tseitin encoding yields unsatisfiable instances that are hard for state-of-the-art SAT solvers yet have short orthologic proofs. Applied to EPFL arithmetic circuits, our algorithm solves these instances efficiently while Kissat times out on a significant fraction. Finally, we show that using orthologic normalization as a preprocessing step can improve SAT solving time on some hard problems.

2605.16419 2026-05-19 cs.CV cs.AI cs.RO 版本更新

Agentic Pipeline for Self-Synchronized Multiview Joint Angle Monitoring in Uncalibrated Environments

基于代理的自同步多视角关节角度监控管道:在无标定环境中

Juncheng Yu, Lusi A, Haoxuan Xie, Weiming Wang

发表机构 * National Engineering Research Center of Neuromodulation, School of Aerospace Engineering, Tsinghua University(神经调制国家工程研究中心,航空航天工程学院,清华大学)

AI总结 本文提出了一种基于代理的自同步多视角关节角度监控方法,利用两台摄像头在无标定环境下实现自动视频同步和自验证,通过多模态大语言模型和先进单目2D姿态估计模型提取候选姿态,并通过代理选择机制自动识别和跟踪目标个体,以在多人和遮挡情况下产生一致的2D姿态,从而估计关节角度。

Comments Accepted by EMBC 2026. 7 pages, 3 figures

详情
AI中文摘要

运动监控在长期康复中对脊髓损伤患者至关重要,其中多视角无标记运动捕捉方法已显示出显著潜力。然而,由于依赖校准和多视角同步的困难,其在患者自行部署环境中部署仍然具有挑战性。在本工作中,我们提出了一种基于代理的自同步多视角关节角度监控管道,利用两台摄像头在无标定环境中实现自动视频同步和代理驱动的自验证。最先进的单目2D姿态估计模型用于提取候选姿态,其中应用了基于代理的选择机制,以自动识别和跟踪目标个体,从而在多人和遮挡情况下产生一致的2D姿态。此类2D姿态被优化以从无标定的多视角姿态序列中估计关节角度,通过显式的几何建模确保可解释性。与Vicon系统的验证显示了该方法的强性能,达到MAE为5.97°±2.36°和Pearson相关系数为0.962±0.014。所提出的方法预计能提供一个实用的、患者可自行部署的系统,以在无标定的家庭环境中进行日常运动监控。

英文摘要

Kinematic monitoring plays a critical role in long-term rehabilitation for patients with spinal cord injury (SCI), where multi-view markerless motion capture methods have shown significant potential. However, owing to the reliance on calibration and the difficulty of achieving multi-view synchronization, their deployment in patient self-deployed environments remains challenging. In this work, we propose an agentic pipeline for self-synchronized multi-view joint angle monitoring in uncalibrated environments using two cameras without hardware triggers. The Multimodal large language models enable automatic video synchronization and agent-driven self-verification. State-of-the-art monocular 2D pose estimation models are employed to extract candidate poses, where an agent-based selection mechanism is then applied to automatically identify and track the target subject, thereby producing consistent 2D poses in the presence of multiple individuals and occlusions. Such 2D poses are optimized to estimate joint angles from uncalibrated multi-view pose sequences, ensuring interpretability through explicit geometric modeling. Validation against Vicon system demonstrated the strong performance, achieving an MAE of $5.97^\circ \pm 2.36^\circ$ and a Pearson correlation coefficient of $0.962 \pm 0.014$. The proposed method is expected to provide a practical, patient self-deployable system to perform daily kinematic monitoring in uncalibrated home environments.

2605.16418 2026-05-19 cs.CV cs.AI 版本更新

Neural Visual Decoding via Cognitive guided Adaptive Blurring and Information Constrained Alignment

通过认知引导的自适应模糊和信息受限对齐实现神经视觉解码

Fan Yin, Chuhang Zheng, Peiliang Gong, Donghai Guan, Qi Zhu

发表机构 * Department of Artificial Intelligence, Nanjing University of Aeronautics and Astronautics(南京航空航天大学人工智能学院) Department of Electrical and Information Engineering, Tianjin University(天津大学电气与信息工程学院)

AI总结 本文提出CAIA框架,通过认知引导的自适应模糊和信息受限对齐,提升神经信号与视觉语义的映射精度,改进零样本脑-图像检索的Top-1和Top-5准确率。

详情
AI中文摘要

基于EEG的视觉解码旨在建立神经信号与视觉语义之间的映射。然而,它受到严重的信息粒度不匹配和EEG信号信噪比低的双重挑战。现有方法通常处理静态视觉特征,忽略了人类视觉的动态选择性和神经振荡的频率特异性。为此,我们提出了CAIA框架,通过认知引导的自适应模糊和信息受限对齐来弥合这一差距。在视觉侧,它模拟选择性注意以自适应地减少冗余。同时,在EEG侧,它利用神经振荡先验和信息瓶颈机制来增强信噪比。具体而言,我们设计了一种基于认知动态的自适应模糊机制,通过跨模态注意动态整合中心偏向和显著性引导的视觉线索。此外,我们引入了分布感知的边界校准损失,以稳健地纠正由异常样本引起的对齐偏差。此外,提出了一种认知引导的信息筛选方法,以选择任务相关的EEG振荡。大量实验表明,CAIA在零样本脑-图像检索中提高了受试者依赖和受试者无关的平均Top-1和Top-5准确率,显著优于现有方法。我们的工作验证了优化视觉信息密度以匹配神经粒度能提供更可解释和稳健的神经解码路径。

英文摘要

EEG-based visual decoding aims to establish a mapping between neural signals and visual semantics. However, it remains constrained by the dual challenges of severe information granularity mismatch and the low signal-to-noise ratio (SNR) of EEG signals. Existing approaches typically treat static visual features, ignoring the dynamic selectivity of human vision and the frequency specificity of neural oscillations. To bridge this gap, we propose CAIA, a Cognitive-guided Adaptive blurring with Information-Constrained Alignment framework for Neural-Visual decoding. On the visual side, it simulates selective attention to adaptively reduce redundancy. Meanwhile, on the EEG side, it leverages neural oscillation priors and the information bottleneck mechanism to enhance SNR. Specifically, we devise a cognitive-dynamics-based adaptive blurring mechanism that dynamically integrates center-biased and saliency-guided visual cues via cross-modal attention. Furthermore, we introduce a distribution-aware boundary calibration loss to robustly rectify alignment bias caused by outlier samples. Moreover, a cognitively-guided information-screening method is proposed to select task-relevant EEG oscillations. Extensive experiments demonstrate that CAIA improves both subject-dependent and subject-independent average Top-1 and Top-5 accuracy in zero-shot brain-to-image retrieval, significantly outperforming prior methods. Our work validates that optimizing visual information density to match neural granularity offers a more interpretable and robust pathway for neural decoding.

2605.16416 2026-05-19 cs.CV cs.AI 版本更新

CAVE: A Structured Credit Assignment Approach for Fragmented Visual Evidence Reasoning

CAVE:一种用于碎片化视觉证据推理的结构化信用分配方法

Tengda Guo, Jie Leng, Hanlei Li, Yaoyuan Liang, Qingyue Zhang, Dian Yang, Mingyu Zhang, Yuhua Fu, Shao-Lun Huang

发表机构 * Tsinghua University(清华大学) Peking University(北京大学) Zhejiang University of Technology(浙江工业大学)

AI总结 CAVE通过结构化过程-奖励机制提升碎片化视觉推理能力,引入三个互补信号优化推理步骤,提升模型可靠性与鲁棒性。

Comments 24 pages, 6 figures. Preprint

详情
AI中文摘要

视觉-语言模型(VLMs)在通用多模态推理中表现优异,但在整合非局部视觉信息支持语义不明确的视觉推理方面面临挑战。本文提出CAVE,一种基于GRPO的结构化过程-奖励方法,通过信念更新、证据获取和自适应聚焦控制三个信号评估中间步骤贡献,引导模型优化推理动作并学习更可靠的视觉推理策略。同时构建TRACER-Bench,涵盖四个非局部且语义易混淆的推理维度,提供关键中间证据监督推理路径。实验表明,CAVE在需要整合碎片化视觉证据的任务中显著提升性能,涵盖公开基准和新引入的TRACER-Bench,同时在通用多模态评估中保持竞争力。进一步分析显示,CAVE有效提升视觉推理能力,在长距离和深层跨区域依赖下表现更稳健。

英文摘要

Vision-Language Models (VLMs) have achieved strong performance on general multimodal reasoning, yet remain challenged in integrating nonlocal visual information to support semantically underdetermined visual reasoning. We describe this challenge as Fragmented Visual Reasoning. To this end, we propose Credit Assignment for Visual Evidence (CAVE), a structured process-reward method based on GRPO for interleaved visual reasoning. Specifically, CAVE evaluates the contribution of intermediate steps at the action level via three complementary reasoning process signals: belief update, evidence acquisition, and adaptive focus control, thereby guiding the model to optimize each reasoning action and learn more reliable visual reasoning strategies. Meanwhile, we construct TRACER-Bench, which covers four nonlocal and semantically confusable reasoning dimensions and provides key intermediate evidence to supervise reasoning paths. Experiments demonstrate that CAVE substantially improves performance on tasks requiring fragmented visual evidence integration, covering both public benchmarks and our newly introduced TRACER-Bench, while retaining competitive performance on general multimodal evaluations. Further analyses reveal that CAVE effectively improves the visual reasoning capacity and exhibits stronger robustness under longer-range and deeper cross-region dependencies.

2605.16411 2026-05-19 cs.CV cs.AI cs.CL cs.DB cs.LG 版本更新

Reducing Hallucination in Vision-Language Models via Stage-wise Preference Optimization under Distribution Shift

通过分布偏移下的分阶段偏好优化减少视觉-语言模型中的幻觉

Qinwu Xu

发表机构 * Meta AI

AI总结 本文提出分阶段偏好优化框架,通过构建针对幻觉问题的数据集,提升视觉-语言模型的 grounded reasoning,减少幻觉并提高响应信息量。

详情
AI中文摘要

幻觉仍然是视觉-语言模型(VLMs)中的基本挑战,其中自回归生成可能因联合概率建模下的最大似然估计而产生语言上合理但物理上不一致或视觉上不 grounded 的响应。我们提出了一种分阶段偏好优化框架,通过有针对性的多模态数据构建来减少幻觉。该框架强调模糊的空间方向、物体关系、OCR不确定性以及对抗性假前提训练。幻觉负样本通过最小扰动但视觉不一致的替代品生成,使直接偏好优化(DPO)能够更好地区分 grounded 推理与 plausible 幻觉。在开源基准和现实多模态评估场景中的实验表明,改进了 grounded 一致性,减少了幻觉,并产生了更具信息量的 grounded 响应。跨模型定性评估进一步显示,所提出的多模态 LLM DPO 框架在模糊空间推理和对抗性假前提设置中比几个前沿专有 VLMs 产生更视觉 grounded 的响应。结果表明,幻觉可能不仅源于模型容量的限制,还源于自回归概率生成在弱视觉 grounding 下倾向于选择语言上合理但视觉上不一致的延续。未来工作可能探索物理一致性建模、不确定性感知的多模态推理以及超越标准自回归解码的架构替代方案。

英文摘要

Hallucination remains a fundamental challenge in vision-language models (VLMs), where autoregressive generation may produce linguistically plausible yet physically inconsistent or visually ungrounded responses due to likelihood maximization under joint probabilistic modeling. We propose a stage-wise preference optimization framework for hallucination reduction through targeted multimodal data construction. Rather than directly optimizing on generic instruction-following data, our approach progressively constructs hallucination-focused preference pairs near known failure boundaries. The framework emphasizes ambiguous spatial orientation, object relationships, OCR uncertainty, and adversarial false-premise training. Hallucinated negatives are generated through minimally perturbed yet visually inconsistent alternatives, enabling Direct Preference Optimization (DPO) to better separate grounded reasoning from plausible hallucination. Experiments on open-source benchmarks and real-world multimodal evaluation scenarios demonstrate improved grounding consistency, reduced hallucination, and more informative grounded responses. Cross-model qualitative evaluation further shows that the proposed multimodal LLM DPO framework produces more visually grounded responses than several frontier proprietary VLMs, such as in ambiguous spatial reasoning and adversarial false-premise settings. The results suggest that hallucination may arise not only from limited model capacity, but also from inherent tendencies of autoregressive probabilistic generation to favor linguistically plausible continuations under weak visual grounding. Future work may explore physical consistency modeling, uncertainty-aware multimodal reasoning, and architectural alternatives beyond standard autoregressive decoding.

2605.16398 2026-05-19 cs.RO cs.AI 版本更新

Support-Safe Variational Hybrid Filtering for Contact-Mode and Sparse-Law Recovery

支持安全的变分混合滤波器用于接触模式和稀疏定律恢复

Marios Papamichalis, Regina Ruane

发表机构 * Human Nature Lab, Yale University(耶鲁大学人类本质实验室) Department of Statistics and Data Science, The Wharton School, University of Pennsylvania(宾夕法尼亚大学统计与数据科学系,沃顿商学院)

AI总结 本文提出VHYDRO变分混合动力学习器,通过混合学习的提案与可行转换律,防止分支丢失,实现连续状态和离散接触模式的联合推断,并在稀疏端-哈密顿定律恢复中提供三种保障。

详情
AI中文摘要

接触丰富的机器人动力学是混合的:单个观测可以匹配多个潜在状态和接触模式(自由、冲击、粘滑)。标准的退火滤波器不将概率分配给可行的接触转换将永久失去机器人实际遵循的分支。我们介绍了VHYDRO,一种变分混合动力学习器,防止这种分支丢失。在每一步中,VHYDRO混合学习的提案与可行转换律,然后进行采样和重要加权,确保模型可行的载体保留的每个转换都得到覆盖。VHYDRO联合推断连续的潜在状态和离散接触模式,并为每个恢复的模式拟合稀疏端-哈密顿定律。在此基础上,三种保证连接:支持覆盖稳定了滤波,稳定后的滤波将离散接触后验集中在一致的模式上,且模式纯段允许稀疏端-哈密顿恢复。恢复误差清晰地分为滤波、导数、模式不纯和物理残差部分。三种经验发现跟踪相同的机制。在重遮挡下,支持安全的滤波器保持可用,而非防御性的提案会崩溃。在ManiSkill演示和四个Sawyer/BridgeData任务家族上,离散状态形成时间一致的接触模式段,离散状态在ARI、变化点F1和段纯度上比事后和模式自由基线更强。在已知方程的混合系统中,模式条件的稀疏拟合恢复了活跃的物理项;纯预测基线则不能。

英文摘要

Contact-rich robot dynamics are hybrid: a single observation can match several latent states and contact regimes (free, impact, stick--slip). A standard amortized filter that places no probability on a feasible contact transition will permanently lose the branch the robot actually follows. We introduce VHYDRO, a variational hybrid dynamics learner that prevents this branch loss. At each step, VHYDRO mixes the learned proposal with a feasible transition law before sampling and importance weighting, ensuring that every transition retained by the model-feasible carrier remains covered. VHYDRO jointly infers a continuous latent state and a discrete contact mode, and fits a sparse port-Hamiltonian law to each recovered regime. On top of this, three guarantees connect: support coverage stabilizes filtering, the stabilized filter concentrates the discrete contact posterior on coherent regimes, and mode-pure segments admit sparse port-Hamiltonian recovery. The recovery error separates cleanly into filtering, derivative, mode-impurity, and physics-residual parts. Three empirical findings track the same mechanism. Under heavy occlusion the support-safe filter stays usable while a non-defensive proposal collapses. On ManiSkill demonstrations and on four Sawyer/BridgeData task families the discrete state forms temporally coherent contact-regime segments that the discrete state yields a stronger joint profile across ARI, change-point F1, and segment purity than post-hoc and mode-free baselines. On hybrid systems with known equations the mode-conditioned sparse fit recovers the active physical terms; purely predictive baselines do not.

2605.16397 2026-05-19 cs.CV cs.AI 版本更新

Trajectory-Aware Adaptive Inference in Object Detection Models

轨迹感知的自适应推理在目标检测模型中

Grigorios Papanikolaou, Ioannis Kontopoulos, Giannis Spiliopoulos, Dimitris Zissis, Konstantinos Tserpes

发表机构 * Department of Electrical and Computer Engineering, National Technical University of Athens, Greece(电子与计算机工程系,国家技术大学亚历山大学院,希腊) Department of Product and Systems Design Engineering, University of the Aegean, Syros, Greece(产品与系统设计工程系,爱琴海大学,西罗斯,希腊)

AI总结 本文提出利用GPS轨迹数据优化目标检测模型的推理过程,通过引入早退机制减少计算成本,提升实时感知效率。

Comments Accepted to the MuseKDE workshop of the IEEE MDM 2026 conference

详情
AI中文摘要

随着自主水下导航中传感器的集成,大规模多模态数据集的出现对高效实时感知提出了挑战。在这样的系统中,目标检测和附近船只轨迹感知紧密耦合,尤其是在动态环境中。然而,目标检测模型在推理过程中的效率常被忽视。为此,我们基于现有目标检测框架,将GPS轨迹数据纳入推理过程,实现输入自适应计算。具体来说,在基于YOLOv8的检测器中引入早退机制,结合运动线索(如船舶间距离)。分离距离短且高速接近的船舶帧使用完整模型处理,而其他帧仅激活网络的一部分架构。通过利用物体间距离和距离减少速率评估帧或帧集的难度(或场景复杂度)。实验结果表明,该策略在保持满意检测性能的同时,显著减少了推理时间和计算成本,从而在准确性和效率之间实现了灵活的权衡,相比完整模型推理。

英文摘要

The increasing integration of sensors in autonomous maritime navigation has led to large-scale multimodal datasets, raising challenges in achieving efficient real-time perception. In such systems, object detection and trajectory perception of nearby vessels are tightly coupled, particularly in dynamic environments such as maritime navigation. However, the efficiency of object detection models during inference remains an often-overlooked aspect. To this end, we build upon an existing object detection framework by incorporating GPS trajectory data into the inference process to enable input-adaptive computation. Specifically, we introduce an early-exit mechanism in a YOLOv8-based detector that incorporates motion cues - such as inter-vessel distances. Frames of vessels that are separated by short distances, converging with high speed, are processed using the full model, while only a subset of the network's architecture is activated otherwise. The difficulty degree (or scene complexity) of a frame or set of frames per second is evaluated by leveraging inter-object distance and the rate at which the distance between them decreases. Experimental results demonstrate that this strategy maintains satisfactory detection performance while significantly reducing inference time and computational cost, thus enabling a flexible trade-off between accuracy and efficiency compared to full-model inference.

2605.16393 2026-05-19 cs.CV cs.AI 版本更新

Vision Transformer-Conditioned UNet for Domain-Adaptive Semantic Segmentation

基于 Vision Transformer 的 UNet 用于领域自适应语义分割

Joel Valdivia Ortega, Tingying Peng, Marion Jasnin

发表机构 * Helmholtz Pioneer Campus, Helmholtz Munich, Neuherberg, Germany(海德堡先锋校园,海德堡穆恩奇,纽赫尔伯格,德国) School of Computation, Information and Technology, TUM, Garching, Germany(计算、信息与技术学院,技术大学慕尼黑,冈辛,德国) Department of Chemistry, TUM, Garching, Germany(化学系,技术大学慕尼黑,冈辛,德国)

AI总结 本文提出 ViTC-UNet,通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet,以提升生物医学语义分割的精度与适应性。

详情
AI中文摘要

语义分割在生物医学研究中至关重要,但 Vision Transformers(ViTs)在该领域仍存在性能差距,尤其在稀疏、精细结构和低信噪比目标上。我们部分归因于可提示 ViT 模型中常用的轻量级像素解码器,可能缺乏高精度生物医学掩码所需的局部归纳偏置。我们通过引入 ViTC-UNet,通过可学习令牌和双向注意力解码器将预训练 ViT 表示条件化于 UNet,结合 ViT 的全局视觉先验与 UNet 的局部归纳偏置和高分辨率解码能力,同时避免端到端 ViT 微调,即使在跨领域设置中。ViTC-UNet 在 MRI 和 CT 模态的语义分割任务中均优于基线结果,证明了结构条件化的 UNet 解码可有效适应大规模视觉先验到高复杂度的生物医学分割。

英文摘要

Semantic segmentation is essential for analysing anatomical features in biomedical research, yet a performance gap remains for Vision Transformers (ViTs) in the field, particularly for sparse, fine-structured, and low signal-to-noise targets. We attribute this challenge in part to the lightweight pixel decoders commonly used in promptable ViT models, who may lack the local inductive bias needed for high-precision biomedical masks. We bridge this gap by introducing ViTC-UNet, which conditions a UNet on frozen pre-trained ViT representations through learnable tokens and a two-way attention decoder. This combines ViT global visual priors with the local inductive bias and high-resolution decoding capacity of UNets, while avoiding end-to-end ViT fine-tuning even in cross-domain settings. ViTC-UNet outperforms baseline results in semantic segmentation tasks across MRI and CT modalities, demonstrating that structure-conditioned UNet decoding can efficiently adapt large-scale visual priors to high-complexity biomedical segmentation.

2605.16391 2026-05-19 eess.SP cs.AI cs.LG cs.RO 版本更新

Overcoming the Intrinsic Performance Limitations of MEMS IMU via Diffusion-Based Generative Learning

通过扩散生成学习克服MEMS惯性测量单元的固有性能限制

Jiarui Lv, Feng Zhu, Xiaohong Zhang

发表机构 * School of Geodesy and Geomatics, Wuhan University(武汉大学测绘学院) Hubei Luojia Laboratory, Wuhan University(湖北珞珈实验室) Chinese Antarctic Center of Surveying and Mapping, Wuhan University(中国极地测绘南极科考中心)

AI总结 本文提出基于扩散的生成学习框架,利用低成本IMU数据生成高保真虚拟IMU数据,提升定位和姿态估计性能,并在空中测绘中验证了其有效性。

详情
AI中文摘要

惯性测量单元(IMUs)是多源集成导航系统中的基本传感组件,其性能直接影响解决方案的精度和可靠性。然而,低成本IMUs的精度受硬件限制。最近,生成式人工智能在建模复杂数据分布和重建高保真信号方面表现出色。受此启发,我们提出了一种基于扩散的生成学习框架,用于从低成本IMU测量中合成高保真虚拟IMU数据。具体而言,基于U-Net架构构建了条件扩散模型,其中高质量IMU测量用作先验真实数据,低成本IMU测量作为条件输入。模型生成的虚拟IMU数据用于后续导航和定位任务。实验结果表明,生成的虚拟IMU数据在定位和姿态估计方面均显著优于原始低成本IMU测量。此外,我们将模型转移到空中测绘实验中,其中所提出的方法产生了更薄且一致的点云。总体而言,所提出的框架突破了低成本IMU的性能限制,并展示了扩散基于生成学习在虚拟高质量IMU数据方面的潜力。

英文摘要

Inertial measurement units (IMUs) are fundamental sensing components in multi-source integrated navigation systems, and their performance directly determines the accuracy and reliability of solutions. However, the precision of low-cost IMUs is inherently constrained by hardware limitations. Recently, generative artificial intelligence has demonstrated remarkable capability in modeling complex data distributions and reconstructing high-fidelity signals. Motivated by this, we propose a diffusion-based generative learning framework for synthesizing high-fidelity virtual IMU data from low-cost IMU measurements. Specifically, a conditional diffusion model based on a U-Net architecture is constructed, where high-grade IMU measurements are utilized as ground-truth priors and low-cost IMU measurements are employed as conditional inputs. The virtual IMU data generated by the model is used for subsequent navigation and localization tasks. Experimental results demonstrate that the generated virtual IMU data significantly outperform the original low-cost IMU measurements in both positioning and attitude estimation. Furthermore, we transfer the model to airborne mapping experiments, where the proposed method produces thinner and more consistent point clouds. Overall, the proposed framework breaks the performance limits of low-cost IMU and demonstrates the potential of diffusion-based generative learning for virtual high-grade IMU data.

2605.16389 2026-05-19 cs.RO cs.AI cs.SY eess.SY 版本更新

Haptic Rendering of Fractional-Order Viscoelasticity: Passivity and Rendering Fidelity

触觉渲染中的分数阶粘弹性:被动性和渲染保真度

Gorkem Gemalmaz, Harun Tolasa, Volkan Patoglu

发表机构 * Faculty of Engineering and Natural Sciences(工程与自然科学学院)

AI总结 本文研究分数阶粘弹性模型在有限记忆离散化下的被动性与渲染性能,推导闭式表达式确保触觉渲染的被动性,并通过实验验证理论结果及人感知的真实感。

Comments Under review for publication in IEEE Transactions on Robotics

详情
AI中文摘要

触觉渲染具有蠕变和应力松弛特性的粘弹性材料对于许多应用至关重要,如使用真实生物组织模型的医学培训。分数阶粘弹性模型提供了一种有效描述本质上时间依赖动态的方法,仅需少量参数,因为这些模型可以自然捕捉记忆效应。在本研究中,我们分析了分数阶粘弹性模型在有限记忆离散化下的被动性和渲染性能。我们推导出闭式表达式,以确保基于Grunwald-Letnikov导数的分数阶(FO)标准线性固体(SLS)模型的触觉渲染被动性。我们还提供了此类FO-SLS模型的有效刚度和阻尼的符号表达式。所得到的被动性条件构成了一个统一的框架,该框架推广了之前报告的整数阶凯尔文-沃伊特、麦克斯韦和SLS模型的结果,因为这些结果是新推导条件的特殊情况。此外,我们还提供了理论被动性界限的实验验证和对FO-SLS模型感知真实感的人类受试者评估。总体而言,本研究建立了在有限记忆离散化下的分数阶粘弹性渲染的统一理论框架和实验评估。

英文摘要

Haptic rendering of viscoelastic materials that exhibit creep and stress relaxation is crucial for many applications, such as medical training with realistic biological tissue models. Fractional-order viscoelastic models provide an effective means of describing intrinsically time-dependent dynamics with few parameters, as these models can naturally capture memory effects. In this study, we present analyses of passivity and rendering performance for fractional-order viscoelastic models under finite-memory discretization. We derive closed-form expressions to ensure the passivity of haptic rendering with a fractional-order (FO) standard linear solid (SLS) model based on Grunwald-Letnikov derivative under short-memory discretization. We also provide symbolic expressions for the effective stiffness and damping of such FO-SLS models. The resulting passivity conditions constitute a unified framework that generalizes previously reported results for integer-order Kelvin-Voigt, Maxwell, and SLS models, since these results are special cases of the newly derived condition. Furthermore, we provide experimental validations of the theoretical passivity bounds and human-subject evaluations of perceived realism of FO-SLS models. Overall, this study establishes a unified theoretical framework and experimental evaluations for FO viscoelastic rendering under short-memory discretization.

2605.16387 2026-05-19 cs.CV cs.AI 版本更新

Stabilizing Temporal Inference Dynamics for Online Surgical Phase Recognition

稳定在线手术阶段识别的时序推断动态

Yang Liu, Ning Zhu, Jingjing Peng, Xiwu Chen, Alejandro Granados, Guotai Wang, Sebastien Ourselin

发表机构 * King's College London, London, UK(伦敦国王学院) University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文提出统一框架稳定时序推断动态,通过TEC损失抑制误差传播,EGTP强制证据驱动状态转移,TFI衡量时间碎片化,提升稳定性并减少预测碎片化。

Comments Early accepted by MICCAI 2026

详情
AI中文摘要

在线手术阶段识别(SPR)模型可达到高帧级准确性,但其预测往往缺乏时序稳定性,导致工作流理解碎片化并降低下游辅助的可靠性。本文表明这种不稳定性并非随机噪声,而是源于两个机制:早期误分类会破坏时序特征状态并传播形成误差级联,而阶段转换遵循证据积累动态,但大多数在线SPR系统依赖无记忆的帧级决策,使其对短暂置信度波动敏感。我们提出一个统一的训练-推断-评估框架,通过模型无关、即插即用的组件显式稳定时序推断动态。在训练中,时序误差级联(TEC)损失通过稳定时序特征演变抑制误差起始并缓解误差传播。在推断中,证据门控转换预测器(EGTP)强制证据驱动的状态转移,仅在积累证据超过置信度边界时允许阶段变化。在评估中,我们引入时间碎片化指数(TFI),一个可靠性感知的度量标准,量化由不稳定性引起的时序分歧,超越传统帧级和基于标记的度量。在Cholec80和AutoLaparo上跨三个代表性backbone的实验表明,所提框架显著提高了时序稳定性并减少了预测碎片化,同时保持或略微提高了帧级性能。

英文摘要

Online Surgical Phase Recognition (SPR) models can reach high frame-wise accuracy, yet their predictions often lack temporal stability, fragmenting workflow understanding and reducing the reliability of downstream assistance. We show that this instability is not random noise but arises from two mechanisms: early misclassifications corrupt temporal feature states and propagate forward to form error cascades, and phase transitions follow evidence-accumulation dynamics whereas most online SPR systems rely on memoryless frame-wise decisions, making them sensitive to transient confidence fluctuations. We propose a unified Train-Inference-Evaluation framework that explicitly stabilizes temporal inference dynamics using model-agnostic, plug-and-play components. For training, the Temporal Error-Cascade (TEC) loss suppresses error onset and mitigates forward error propagation by stabilizing temporal feature evolution. For inference, the Evidence-Gated Transition Predictor (EGTP) enforces evidence-driven state transitions, allowing phase changes only when accumulated evidence exceeds a confidence boundary. For evaluation, we introduce the Temporal Fragmentation Index (TFI), a reliability-aware metric that quantifies instability-induced temporal disagreement beyond conventional frame-wise and token-based measures. Experiments on Cholec80 and AutoLaparo across three representative backbones show that the proposed framework substantially improves temporal stability and reduces prediction fragmentation, while maintaining or modestly improving frame-wise performance.

2605.16384 2026-05-19 cs.CV cs.AI 版本更新

Mutual Enhancement Between Global Tokens and Patch Tokens: From Theory to Practice

全局标记与补丁标记之间的相互增强:从理论到实践

Xiusheng Huang, Xin Jiang, Jun Zhao, Kang Liu, Yequan Wang

发表机构 * The Key Laboratory of Cognition and Decision Intelligence for Complex Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China(认知与决策智能复杂系统重点实验室,自动化研究所,中国科学院,北京,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 本文提出TaTok框架,通过引入全局标记和动态令牌过滤算法,解决现有方法中信息不足和冗余问题,提升图像令牌化效果和推理速度。

Comments 21 pages, 8 figures

详情
AI中文摘要

准确有效的离散图像令牌化对长图像序列处理至关重要。然而,当前方法以固定比率压缩所有内容,忽视了图像中信息密度的变化,导致冗余或信息丢失。受信息熵启发,我们提出TaTok,一种理论指导的自适应图像令牌化框架。我们严格识别现有方法的两个关键问题:仅使用补丁令牌重建图像时的信息不足,以及补丁令牌之间的信息冗余。为此,我们引入全局令牌来建模补丁令牌之间的互信息,并基于累积条件熵的动态令牌过滤(DTF)算法来消除冗余。实验证实TaTok的最先进性能,实现了1.3倍gFID提升和8.7倍推理加速。通过根据信息丰富度分配令牌,TaTok实现了更压缩但更准确的图像令牌化,为未来研究提供了有价值的见解。

英文摘要

Accurate and effective discrete image tokenization is crucial for long image sequence processing. However, current methods rigidly compress all content at a fixed rate, ignoring the variable information density of images and leading to either redundancy or information loss. Inspired by information entropy, we propose TaTok, a Theoretically grounded adaptive image Tokenization framework. We rigorously identify two key drawbacks in existing methods: information insufficiency when reconstructing images with patch tokens alone, and information redundancy among patch tokens. To address these, we introduce global tokens that model mutual information across patch tokens, and a Dynamic Token Filtering (DTF) algorithm based on cumulative conditional entropy to eliminate redundancy. Experiments confirm TaTok's state-of-the-art performance, delivering a 1.3x gFID improvement and 8.7x inference speedup. By allocating tokens according to information richness, TaTok enables more compressed yet accurate image tokenization, offering valuable insights for future research.

2605.16383 2026-05-19 cs.CV cs.AI stat.ML 版本更新

A neurosymbolic Approach with Epistemic Deep Learning for Hierarchical Image Classification

一种结合知识符号学习与认知深度学习的分层图像分类方法

Ezel Kilicdere, Shireen Kudukkil Manchingal, Fabio Cuzzolin

发表机构 * Institute for AI, Data Analysis and Systems (AIDAS) School of Engineering, Computing and Mathematics, Oxford Brookes University, UK(人工智能、数据分析和系统研究所(AIDAS)工程、计算与数学学院,英国奥克斯福德布鲁克斯大学)

AI总结 本文提出一种统一的神经符号和认知建模框架,通过融合Swin Transformer、焦点集推理和可微模糊逻辑,提升分层图像分类的准确性和逻辑一致性。

Comments 36 pages

详情
AI中文摘要

深度神经网络在图像分类任务中实现高精度,但往往产生过于自信的预测,无法表达认知不确定性,并违反数据中存在的逻辑或结构约束。这些局限性在分层分类中尤为明显,因为细粒度和粗粒度的预测必须保持一致。本文首次提出一种统一的神经符号和认知建模框架,通过融合Swin Transformer、焦点集推理和可微模糊逻辑,将标签视为孤立类别,而是在学习的嵌入空间中诱导数据驱动的焦点集,帮助捕捉多个可能细粒度类别的认知不确定性。这些焦点集构成了一个基于信念理论的层,利用模糊隶属函数和t-范数合取来鼓励细粒度和粗粒度预测之间的一致性。可学习的损失进一步平衡校准、质量正则化和逻辑一致性,使模型能够自适应地权衡符号结构与数据驱动的证据。在分层图像分类实验中,本文框架在与Transformer基线相当的准确性的同时,提供更校准和可解释的预测,减少过度自信并强制在分层输出中保持高逻辑一致性。实验结果表明,结合焦点集推理与模糊逻辑为深度学习模型提供了实际步骤,使其既准确又具有认知意识。

英文摘要

Deep neural networks achieve high accuracy on image classification tasks. Yet, they often produce overconfident predictions as which fail to express epistemic uncertainty, and frequently violate logical or structural constraints present in the data. These limitations are particularly pronounced in hierarchical classification, where predictions across fine and coarse levels must remain coherent. We propose, for the first time, a unified neurosymbolic and epistemic modelling framework that augments Swin Transformers with focal set reasoning and differentiable fuzzy logic. Rather than treating labels as isolated categories, our method induces data-driven focal sets within the learnt embedding space, which helps capture epistemic uncertainty over multiple plausible fine-grained classes. These focal sets form the basis of a belief-theoretic layer that uses fuzzy membership functions and t-norm conjunctions to encourage consistency between fine- and coarse-grained predictions. A learnable loss further balances calibration, mass regularisation, and logical consistency, allowing the model to adaptively trade off symbolic structure with data-driven evidence. In experiments on hierarchical image classification, our framework maintains accuracy on par with transformer baselines while providing more calibrated and interpretable predictions, reducing overconfidence and enforcing high logical consistency across hierarchical outputs. Our experimental results show that combining focal set reasoning with fuzzy logic provides a practical step toward deep learning models that are both accurate and epistemically aware.

2605.16381 2026-05-19 cs.CV cs.AI 版本更新

StreamPro: From Reactive Perception to Proactive Decision-Making in Streaming Video

StreamPro: 从反应式感知到主动决策的流视频处理

Ao Li, Zihan Xiao, Zihao Yue, Boshen Xu, Linli Yao, Jiaze Li, Pei Fu, Jianzhong Ju, Jian Luan, Qin Jin

发表机构 * AIM3 Lab, Renmin University of China(中国人民大学AIM3实验室) MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus)

AI总结 StreamPro通过引入CB-Stream损失和GRPO算法,提升流视频处理的主动决策能力,在StreamPro-Bench上取得显著成效,性能优于先前最佳。

详情
AI中文摘要

主动流视频理解需要模型持续处理视频流并决定何时响应,而非仅仅确定响应内容。这自然引入了部分观察下的决策问题,模型需在早期预测与充分证据之间平衡。然而,现有基准大多遵循“看见再回答”范式,响应仅在明确证据出现后触发,将主动推理缩减为延迟感知。因此,它们无法评估模型在不完整观察下的及时性和可靠性决策能力。此外,训练主动模型本身具有挑战性,因为流轨迹中沉默与响应信号之间存在极端不平衡,且需要联合优化响应准确性和时机。为解决这些问题,我们引入StreamPro-Bench,从感知理解、时间推理和主动代理三个互补视角评估流模型。其中,主动代理衡量模型在部分观察下的早期但可靠决策能力。我们进一步提出StreamPro,一种两阶段训练框架用于主动学习。首先,我们引入CB-Stream损失以缓解监督不平衡问题。然后,我们应用基于多粒度奖励设计的分组相对策略优化(GRPO)。实验表明,StreamPro显著提升了主动性能。在StreamPro-Bench上,其达到41.5,远超先前最佳(10.4),同时在实时流基准测试中也表现优异,达到78.9分。

英文摘要

Proactive streaming video understanding requires models to continuously process video streams and decide when to respond, rather than merely what to respond. This naturally introduces a decision-making problem under partial observations, where models must balance early prediction against sufficient evidence. However, existing benchmarks largely follow a "see-then-answer" paradigm, where responses are triggered only after explicit evidence appears, effectively reducing proactive reasoning to delayed perception. As a result, they fail to evaluate a model's ability to make timely and reliable decisions under incomplete observations. Moreover, training proactive models is inherently challenging due to the extreme imbalance between silence and response signals in streaming trajectories, as well as the need to jointly optimize response correctness and timing. To address these challenges, we introduce StreamPro-Bench, a new benchmark that evaluates streaming models from three complementary perspectives: Perception Understanding, Temporal Reasoning, and Proactive Agency, where the last measures a model's ability to make early yet reliable decisions under partial observations. We further propose StreamPro, a two-stage training framework for proactive learning. First, we introduce CB-Stream Loss to mitigate the severe supervision imbalance during supervised fine-tuning (SFT). Then, we apply Group Relative Policy Optimization (GRPO) with a multi-grained reward design that involves both turn-level and trajectory-level rewards. Experiments show that StreamPro significantly improves proactive performance. On StreamPro-Bench, it achieves 41.5, substantially outperforming the previous best (10.4), while also maintaining strong performance on real-time streaming benchmarks, achieving 78.9 on StreamingBench-RTVU.

2605.16380 2026-05-19 cs.LG cs.AI 版本更新

ReTAMamba: Reliability-Aware Temporal Aggregation with Mamba for Irregular Clinical Time Series Prediction

ReTAMamba:基于Mamba的可靠性感知时间聚合用于不规则临床时间序列预测

Jinwoong Kim, Sangjin Park

发表机构 * Department of Industrial Data Engineering Hanyang University Seoul Republic of Korea(工业数据工程系 首尔国立翰阳大学 韩国) Hanyang University(国立翰阳大学)

AI总结 ReTAMamba通过时间变量标记序列重构临床时间序列,利用缺失性和时间间隔估计观测可靠性,并通过时间编织整合短期和长期时间信息,提升不规则时间序列预测性能。

Comments 11 pages

详情
AI中文摘要

临床时间序列数据难以用常规方法建模,因其表现出不规则采样、频繁缺失值和变量异质性。现有方法通常使用观测掩码和时间间隔信息,但无法持续捕捉过去观测的衰减可靠性或在聚合过程中保持一致的时序上下文。为此,我们提出了Reliability-aware Temporal Aggregation with Mamba(ReTAMamba),将临床时间序列重建为时间变量标记序列,从缺失性和经过时间估计观测可靠性,并将区间总结与统计描述符相结合。通过时间编织整合短期和长期时间信息,并应用预算标记路由器约束序列长度同时保留信息性总结。在MIMIC-IV、eICU和PhysioNet 2012上的实验表明,ReTAMamba在强基线模型上一致提升了AUPRC,平均相对提升分别为7.51%、7.80%和10.15%。eICU的队列和患者层面分析显示,学习到的动态信号(如心率和血压)的均值衰减比相对静态信号(如实验室变量)大24.3%。这些发现表明,有效预测不规则临床时间序列需要建模不仅测量了什么,还要何时以及如何观测,包括信息新鲜度和观测及时性。

英文摘要

Clinical time-series data are difficult to model with methods designed for regular sequences because they exhibit irregular sampling, frequent missing values, and heterogeneous observation patterns across variables. Existing approaches commonly use observation masks and time-gap information, but they do not continuously capture the decaying reliability of past observations or consistently organize multi-resolution information within a coherent temporal context during aggregation. To address these limitations, we propose Reliability-aware Temporal Aggregation with Mamba (ReTAMamba), which reconstructs clinical time series as time-variable token sequences, estimates observation reliability from missingness and elapsed time, and augments interval summaries with statistical descriptors. Chronological Weaving is used to integrate short- and long-term temporal information within a coherent temporal context, and a budgeted token router is applied to constrain sequence length while preserving informative summaries. Experiments on MIMIC-IV, eICU, and PhysioNet 2012 show that ReTAMamba consistently improves AUPRC over strong baselines, with average relative gains of 7.51%, 7.80%, and 10.15%, respectively. Cohort-level and patient-level analyses on eICU further showed that the learned mean decay for more dynamic signals, such as heart rate and blood pressure, was 24.3% larger than that for relatively static signals, such as laboratory test variables. These findings suggest that effective prediction in irregular clinical time series requires modeling not only what was measured, but also when and how it was observed, including information freshness and observation timeliness.

2605.16379 2026-05-19 cs.LG cs.AI cs.IT math.IT 版本更新

An Information-Theoretic Criterion for Efficient Data Synthesis

一种信息论准则用于高效数据合成

Hanyu Li, Zhengqi Sun, Xiaotie Deng

发表机构 * CFCS, School of Computer Science, Peking University, Beijing, China(计算机科学系,北京大学,北京,中国) Department of Information Management, Peking University, Beijing, China(信息管理系,北京大学,北京,中国)

AI总结 本文提出信息开放循环的准则,指出合成数据的有效性取决于外部信号注入任务相关信息,从而提升模型效率与泛化能力。

Comments 12 pages. Camera-ready version for ICML 2026

详情
AI中文摘要

合成数据在大语言模型训练中变得至关重要,但其效果高度不一致。本文从信息论角度解释这种不一致:合成数据只有在生成-训练循环信息开放(即由外部信号塑造)时,才能提升模型性能。当循环信息封闭(依赖模型自身输出)时,数据处理不等式确保任务相关信息只能减少,导致崩溃。在信息开放管道中,效率和泛化依赖于元级监督:较粗的信号如二元正确性将所有可接受输出视为等同,因此其教导的行为不绑定特定领域或表层形式,能自然泛化到不同任务和领域。这些观察得出指导性论点:学习倾向于收敛到最信息高效的信号组件,当该组件为预期时加速学习,但当存在伪模式时导致奖励黑客。

英文摘要

Synthetic data becomes crucial for large language model training, but its effectiveness is highly inconsistent. We provide an information-theoretic account of this inconsistency: synthetic data improves a model only when the generation-training loop is information-open, i.e., shaped by external signals (verifiers, environments, or rubrics) that inject task-relevant information beyond the model's current distribution. When the loop is information-closed (relying on the model's own outputs without such signals), the data processing inequality ensures that task-relevant information can only decrease, making collapse a predicted outcome. Among information-open pipelines, both efficiency and generalization hinge on the meta-level of supervision: a coarser signal such as binary correctness treats all acceptable outputs as equivalent, so the behavior it teaches is not tied to any particular domain or surface form and generalizes naturally across tasks and domains. These observations lead to a guiding thesis: learning preferentially converges to the most information-efficient signal component available, which accelerates learning when that component is the intended one, but causes reward hacking when a spurious pattern happens to be simpler.

2605.16378 2026-05-19 cs.LG cs.AI 版本更新

Mixing Times of Glauber Dynamics on Masked Language Models

掩码语言模型上Glauber动力学的混合时间

Suvadip Sana, Sami Wolf, Neer Mehta, Alina Shah, Aitzaz Shaikh, Janna Goodman, Lionel Levine

发表机构 * Department of Statistics and Data Science(统计与数据科学系) Cornell University(康奈尔大学) Department of Mathematics(数学系)

AI总结 研究掩码语言模型迭代生成时的全局分布行为,通过Glauber动力学马尔可夫链分析其混合时间,揭示在不同温度下混合行为的相变现象。

Comments 21 pages, 7 figures

详情
AI中文摘要

掩码语言模型(MLMs)定义了令牌的局部条件分布,但通常不对应任何一致的序列联合分布。这提出了一个根本性问题:当此类条件在生成中迭代使用时,会诱导出何种全局分布行为?本文通过将迭代的掩码令牌重采样建模为离散令牌序列上的Glauber动力学马尔可夫链来回答这一问题。我们首先证明MLM条件本质上是不相容的:引入了一个矩形测试来验证这种不相容性,并实证验证其在现代MLM中的普遍性。然后我们对由此诱导的马尔可夫链进行了理论分析。在有限的跨令牌影响下,我们建立了高温度收缩结果,表明混合时间为O(n log n),其中n是序列长度。相反,在均匀局部边际条件下,链表现出 metastability,低温下缓慢逃离语义盆地。实证上,我们展示了混合行为随温度和序列长度的变化呈现相变,与理论预测一致。我们进一步通过语义轨迹表征诱导的平稳行为,识别出持久结构如长寿命陷阱和复发语义盆地,政治内容作为可测量的案例研究。

英文摘要

Masked language models (MLMs) define local conditional distributions over tokens but do not, in general, correspond to any consistent joint distribution over sequences. This raises a fundamental question: what global distributional behavior is induced when such conditionals are used iteratively for generation? We address this question by modeling iterative masked-token resampling as a Glauber dynamics Markov chain on the discrete space of token sequences. We first show that MLM conditionals are intrinsically incompatible: we introduce a rectangle test that certifies this incompatibility and empirically verify its prevalence across modern MLMs. We then provide a theoretical analysis of the induced Markov chain. Under bounded cross-token influence, we establish a high-temperature contraction result implying $O(n\log n)$ mixing time where $n$ is the sequence length. In contrast, we prove that under a uniform local margin condition, the chain exhibits metastability, with exponentially slow escape from semantic basins at low temperatures. Empirically, we demonstrate a phase transition in mixing behavior as a function of temperature and sequence length, consistent with the theoretical predictions. We further characterize the induced stationary behavior through semantic trajectories, identifying persistent structures such as long-lived traps and recurrent semantic basins, with political content serving as a measurable case study.

2605.16377 2026-05-19 cs.DL cs.AI cs.LG 版本更新

CheckSupport: A Local LLM-Powered Tool for Automated Manuscript Submission Checklist Selection and Completion

CheckSupport:一种基于本地LLM的自动化手稿提交检查清单选择与完成工具

Satvik Tripathi, Don Enwerem, Kevin Song, Kristian Quevada, Jacinta Arnold, Tessa S. Cook

发表机构 * Department of Radiology, Perelman School of Medicine at University of Pennsylvania(宾夕法尼亚大学佩雷尔曼医学学院放射科) Department of Computer Science, Drexel University(德雷塞尔大学计算机科学系) Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania(宾夕法尼亚大学工程与应用科学学院计算机与信息科学系) Department of Radiology, Cooper University Hospital(科珀大学医院放射科) University of California Davis Graduate School of Management(加州大学戴维斯分校管理研究生院)

AI总结 本文提出CheckSupport,利用本地LLM自动化选择和完成检查清单,提升科研报告的透明度和可重复性。系统通过分阶段提示策略实现高准确率,运行在CPU上,每篇手稿耗时12.5秒,准确率达90%。

详情
AI中文摘要

透明和标准化的报告对于可重复的科学研究至关重要,但因手动选择和完成检查清单的劳动强度,遵循报告指南仍不一致。我们提出了CheckSupport,一种开源、本地可部署的系统,利用大语言模型自动化推荐报告检查清单并完成清单。CheckSupport采用分阶段提示策略,将报告流程分解为受约束的推理任务,优先提取忠实信息而非生成文本合成。所有推理均在本地使用指令调优模型完成,保护数据隐私并实现可重复、可审计的工作流程。在同行评审手稿语料库上评估,CheckSupport在清单推荐上达到90%的整体准确率,在项目级完成上达到88%的整体准确率,运行在仅CPU硬件上。平均而言,每篇手稿的墙钟时间为12.5秒,包括检查清单推荐和完整检查清单完成。这些结果表明,当大语言模型作为结构化推理组件应用时,可以减少报告负担,支持跨学科更透明和可重复的科学研究报告。

英文摘要

Transparent and standardized reporting is essential for reproducible scientific research, yet adherence to reporting guidelines remains inconsistent because of the manual effort required to select and complete checklists. We present CheckSupport, an open-source, locally deployable system that uses large language models to automate the recommendation of reporting checklists and the evidence-grounded completion of checklists for scientific manuscripts. CheckSupport employs a staged prompting strategy that decomposes reporting workflows into constrained inference tasks, prioritizing faithful extraction over generative text synthesis. All inference is performed locally using instruction-tuned models, preserving data privacy and enabling reproducible, auditable workflows. Evaluated on a corpus of peer-reviewed manuscripts, CheckSupport achieved 90% overall accuracy for checklist recommendations and 88% overall accuracy for item-level completion while operating on CPU-only hardware. On average, the wall-clock time per manuscript was 12.5 seconds, including the checklist recommendation and full checklist completion. These results demonstrate that large language models, when applied as structured inference components, can reduce reporting burden and support more transparent and reproducible scientific reporting across disciplines.

2605.16374 2026-05-19 cs.LG cs.AI 版本更新

Lost or Hidden? A Concept-Level Forgetting in Supervised Continual Learning

丢失或隐藏?监督连续学习中的概念层面遗忘

Katarzyna Filus, Kamil Faber, Roberto Corizzo, Christopher Kanan

发表机构 * AGH University of Krakow(克拉科夫AGH大学) American University(美国大学) University of Rochester(罗切斯特大学)

AI总结 本文提出一种诊断框架,利用稀疏自编码器分析概念层面遗忘,发现遗忘主要源于表征可访问性变化而非信息擦除。

详情
AI中文摘要

持续学习研究模型如何在适应新任务的同时保留先前知识。尽管已有多种方法缓解灾难性遗忘,但该领域仍以性能为导向,缺乏对视觉模型表征空间中遗忘本质的理解。本文提出利用稀疏自编码器定义任务锚定的潜在特征空间,分析任务特定信息在更细粒度下的演变。我们分解遗忘为显性概念删除、可恢复性和解码性。结果显示,大量看似丢失的概念信息在线性假设下可恢复,而随着任务增加,概念解码性下降。总体而言,我们的发现表明,概念层面遗忘主要归因于表征可访问性变化而非完全信息擦除。

英文摘要

Continual learning studies how models can adapt to new tasks while retaining previously acquired knowledge. Although a broad spectrum of methods has been proposed to mitigate catastrophic forgetting, the field remains predominantly performance-driven, with limited insight into what forgetting actually corresponds to within the vision model's representation space. Prior work has primarily analyzed forgetting through task-level performance or coarse measures of representational drift, without disentangling output-level accessibility from changes in finer-grained internal structure. To this end, we propose a diagnostic framework that leverages Sparse Autoencoders (SAEs) to define a task-anchored latent feature space, enabling analysis of how task-specific information evolves at a finer granularity, where individual SAE latents are treated as concept proxies for recurring and relatively disentangled visual patterns in the model's internal computations. Within this framework, we decompose forgetting into apparent concept deletion, recoverability, and decodability. We show that a large portion of seemingly lost concept-level information can often be recovered under linearity assumption, with concept decodability degrading as more tasks are introduced. Overall, our findings suggest that a significant part of concept-level forgetting can be attributed to changes in the representational accessibility rather than complete information erasure.

2605.16373 2026-05-19 cs.CV cs.AI cs.LG 版本更新

Cross-Source Supervision for Bone Infection Segmentation in Dual-Modality PET-CT

跨源监督在双模态PET-CT骨感染分割中的应用

Zonglin Yang, Xiaolei Diao, Jishizhan Chen, Xiaozhuang Man, Wei Kong, Gen Wen, Pengfei Cheng, Daqian Shi

发表机构 * Shanghai Maritime University(上海海洋大学) University College London(伦敦大学学院) Shanghai Sixth People’s Hospital(上海第六人民医院) Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine(上海第六人民医院附属复旦大学医学院) Queen Mary University of London(伦敦女王玛丽大学)

AI总结 本文提出一种双模态端到端分割框架,通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息,解决标注不一致下的骨感染分割问题,采用患者级3D体积评估和交叉验证提高性能。

详情
AI中文摘要

早期和准确诊断骨感染及病变定位对临床治疗至关重要。PET-CT结合了CT的解剖信息和PET的代谢信息,是诊断骨感染的重要成像模态。然而,由于病变边界不清晰和不同专家或自动化系统生成的标注不一致,准确的病变分割仍具挑战性。本文研究了在标注不一致下的多模态分割。我们开发了一个双模态端到端分割框架,通过早融合多模态表示整合PET代谢信号和CT骨窗解剖信息。为了缓解小数据集中小切片相关性导致的性能膨胀,本研究弃用传统二维评估方法,采用严格的患者级3D体积评估和交叉验证。此外,我们提出了一种解耦的双源学习框架,其中并行模型在由高灵敏度和高特异性临床意图驱动的独立专家标注上进行训练。实验结果客观报告了患者级性能变化(均值±标准差和均值-标准差),证明了多模态PET-CT融合的有效性。交叉评估矩阵定量揭示了模型如何成功内化不同的专家诊断哲学,提供了一种稳健且保持多样性的临床AI部署范式,用于骨感染分割。

英文摘要

Early and accurate diagnosis and lesion localization of bone infections are crucial for clinical treatment. PET-CT integrates anatomical information from CT with metabolic information from PET, making it an important imaging modality for diagnosing bone infections. However, accurate lesion segmentation remains challenging due to indistinct lesion boundaries and inconsistencies in annotations generated by different experts or automated systems. In this work, we investigate multimodal segmentation of bone infections under annotation discrepancy. We develop a bimodal end-to-end segmentation framework that integrates PET metabolic signals and CT bone-window anatomy through an early-fusion multimodal representation.To mitigate performance inflation caused by inter-slice correlation in small datasets, this study discards traditional two-dimensional evaluation methods and implements a rigorous patient-level 3D volumetric evaluation and cross-validation. Furthermore, instead of forcing a singular consensus, we propose a decoupled dual-source learning framework where parallel models are trained on independent expert annotations driven by high-sensitivity and high-specificity clinical intents. Experimental results objectively report performance variations at the patient level (Mean + SD and Mean - SD), demonstrating the effectiveness of multimodal PET-CT fusion. The cross-evaluation matrix quantitatively reveals how models successfully internalize distinct expert diagnostic philosophies, providing a robust, diversity-preserving paradigm for clinical AI deployment in bone infection segmentation.

2605.16372 2026-05-19 cs.CV cs.AI cs.LG 版本更新

SwordBench: Evaluating Orthogonality of Steering Image Representations

SwordBench:评估转向图像表示的正交性

Vladimir Zaigrajew, Dawid Pludowski, Hubert Baniecki, Przemyslaw Biecek

发表机构 * Centre for Credible AI(可信人工智能中心) Warsaw University of Technology(华沙技术大学) University of Warsaw(华沙大学)

AI总结 本文提出SwordBench,用于评估视觉模型在多个backbone和概念移除任务中转向表示的正交性,引入了交叉概念鲁棒性和 collateral damage 等新评估指标,发现线性SVM在分离性和正交性上优于稀疏自编码器,但无法实现零 collateral damage。

详情
AI中文摘要

在推理时间对模型表示进行干预以校正预测对于AI可解释性和安全性至关重要,但现有评估协议局限于模糊的语言建模任务。为填补这一空白,我们引入SwordBench,一个用于评估视觉模型在多个backbone和概念移除任务中转向表示的基准。除了统一的基准测试套件外,我们还提出了新的评估概念,揭示了概念激活向量正交性对实用转向的二次影响。具体而言,交叉概念鲁棒性衡量在针对替代概念正交化输入上概念检测性能的稳定性,而collateral damage量化在缺乏偏见的输入上转向是否意外影响下游任务的模型性能。我们发现尽管线性支持向量机在分离性和正交性上表现优异,但无法实现零collateral damage,通常落后于稀疏自编码器。在更简单的环境中,标准基线和优化方法均无法实现完美的转向。源代码将很快在GitHub上发布。

英文摘要

Steering or intervening on model representations at inference time to correct predictions is essential for AI interpretability and safety, yet existing evaluation protocols are limited to ambiguous language modeling tasks. To address this gap, we introduce SwordBench, a benchmark for steering image representations of vision models across multiple backbones and concept removal tasks. Beyond a unified benchmarking suite, we propose new evaluation notions that uncover the second-order effects of orthogonalization among concept activation vectors for pragmatic steering. Specifically, cross-concept robustness measures the stability of concept detection performance across inputs orthogonalized against alternative concepts, and collateral damage quantifies whether steering inadvertently affects model performance on a downstream task for inputs lacking the bias. We find that although a linear support vector machine exhibits superior separability and orthogonality, it fails to achieve zero collateral damage, often trailing sparse autoencoders. In simpler regimes, both standard baselines and optimization-based methods fail to achieve perfect steering. The source code will be made available soon on GitHub.

2605.16371 2026-05-19 cs.CV cs.AI 版本更新

GeoSym127K: Scalable Symbolically-verifiable Synthesis for Multimodal Geometric Reasoning

GeoSym127K:可扩展的符号验证合成用于多模态几何推理

Jinhao Jing, Zheng Ma, Jinwei Liang, Qiannian Zhao, Shawn Chen, Jing Yang, Por Lip Yee, Prayag Tiwari, Jingjing Bai, Benyou Wang, Lewei Lu, Zhan Su

发表机构 * School of Information Technology, Halmstad University(哈姆斯塔德大学信息科技学院)

AI总结 本文提出GeoSym引擎,通过类型条件语法和分析SymGT求解器生成精确符号地面真实值,构建了包含51K高清图像、127K问题和55K答案验证CoT QA对的GeoSym127K数据集,并展示了其在几何推理任务中的性能提升。

详情
AI中文摘要

大型多模态模型(LMMs)在几何推理中常因视觉幻觉和缺乏数学精确的Chain-of-Thought(CoT)数据而遇到困难。为此,我们提出了GeoSym引擎,一种自动且可扩展的神经符号框架。通过利用类型条件语法和分析SymGT求解器,它能够推导出精确的符号地面真实值,并无缝整合到稳健的渲染管线中,生成高精度的几何图示。使用该引擎,我们构建了GeoSym127K,一个难度分层的数据集,包含51K高清图像、127K带有符号地面真实值的问题和55K答案验证的CoT QA对。我们还引入了GeoSym-Bench,一个由专家整理的511个复杂样本集,用于严格评估。通过广泛的监督微调(SFT),我们证明GeoSym在依赖图示和多步骤几何任务上实现了集中改进。我们的Qwen3-VL-8B模型在MathVerse Vision-Only子集上实现了绝对+22.21%的提升,并在WeMath上达到61.52%(+6.19%的改进),缓解了长距离逻辑碎片化问题,并优于先进的闭源模型如Doubao-1.8。进一步地,通过Reinforcement Learning with Verifiable Rewards(RLVR) via GRPO发现,从结构SFT检查点初始化显著提升了零样本RL的性能上限。由确定性精确匹配信号驱动,这展示了我们可验证推理合成的稳健扩展潜力。数据集和代码可在https://huggingface.co/datasets/Tomie0506/GeoSym127K和https://github.com/Tomie56/GeoSym127K获得。

英文摘要

Large Multimodal Models (LMMs) often struggle with geometric reasoning due to visual hallucinations and a lack of mathematically precise Chain-of-Thought (CoT) data. To address this, we propose the GeoSym Engine, an automated and scalable neuro-symbolic framework. By leveraging a type-conditional grammar and an analytic SymGT Solver, it derives exact symbolic ground truths and seamlessly integrates with a robust rendering pipeline to produce high-precision geometric diagrams. Using this engine, we construct GeoSym127K, a difficulty-stratified dataset featuring 51K high-resolution images, 127K questions with symbolic ground truths, and 55K answer-verified CoT QA pairs. We also introduce GeoSym-Bench, an expert-curated suite of 511 complex samples for rigorous evaluation. Through extensive supervised fine-tuning (SFT), we demonstrate that GeoSym drives concentrated improvements specifically on diagram-dependent and multi-step geometry tasks. Our Qwen3-VL-8B model gains an absolute +22.21% on the MathVerse Vision-Only subset and reaches 61.52% (+6.19% improvement) on WeMath, mitigating long-horizon logic fragmentation and outperforming advanced closed-source models like Doubao-1.8. Furthermore, applying Reinforcement Learning with Verifiable Rewards (RLVR) via GRPO reveals that initializing from structural SFT checkpoints substantially elevates the performance ceiling over zero-shot RL. Driven by deterministic exact-match signals, this showcases the robust scaling potential of our verifiable reasoning synthesis. Datasets and code are available at https://huggingface.co/datasets/Tomie0506/GeoSym127K and https://github.com/Tomie56/GeoSym127K.

2605.16366 2026-05-19 cs.CV cs.AI 版本更新

Fre-Res: Frequency-Residual Video Token Compression for Efficient Video MLLMs

Fre-Res: 频率-残差视频令牌压缩用于高效的视频多模态大语言模型

Yigui Feng, Qinglin Wang, Yang Liu, Jie Liu

发表机构 * College of Computer Science, National University of Defense Technology(计算机科学学院,国防科技大学) Shien-Ming Wu School of Intelligent Engineering, South China University of Technology(智能工程谢民明伍学院,华南理工大学)

AI总结 Fre-Res通过分离空间和时间信息,实现视频令牌压缩,在保持细节精度的同时提升效率,适用于短时事件和长视频推理。

Comments 24 pages, 5 figures

详情
AI中文摘要

视频多模态大语言模型面临空间保真度与时间覆盖度之间的矛盾:保留细粒度视觉细节需要大量空间令牌,而捕捉短暂事件需要密集的时间采样。我们提出Fre-Res,一种预算自适应的双轨视频令牌压缩框架,分别处理这两种证据形式。Fre-Res保留稀疏的高保真空间锚点,并通过紧凑的残差频域令牌表示密集的时间演变。具体而言,它对视觉潜在空间中的帧间残差轨迹应用时间1D-DCT,在其中观察到强低频集中。为对齐频域动态与原生视觉嵌入,Fre-Res引入了空间引导吸收器,将时间残差信息注入与空间锚点对应的令牌中。在细粒度短视频和长视频推理基准上,Fre-Res实现了有利的准确率-效率权衡,匹配或接近全令牌性能,同时显著减少视觉令牌长度。广泛消融实验进一步表明,时间频域残差保留因果转换线索,而空间锚点对细粒度物体和布局推理至关重要。

英文摘要

Video MLLMs face a persistent tension between spatial fidelity and temporal coverage: preserving fine-grained visual details requires many spatial tokens, while capturing short-lived events requires dense temporal sampling. We propose \textbf{Fre-Res}, a budget-adaptive dual-track video-token compression framework that separates these two forms of evidence. Fre-Res preserves sparse high-fidelity spatial anchors and represents dense temporal evolution through compact residual-frequency tokens. Specifically, it applies temporal 1D-DCT to inter-frame residual trajectories in vision-latent space, where we observe strong low-frequency concentration. To align frequency-domain dynamics with native visual embeddings, Fre-Res introduces a Spatial-Guided Absorber that injects temporal residual information into spatially corresponding anchor tokens. Across fine-grained short-video and long-video reasoning benchmarks, Fre-Res achieves a favorable accuracy--efficiency trade-off, matching or approaching full-token performance while substantially reducing visual-token length. Extensive ablations further show that temporal-frequency residuals preserve causal transition cues, while spatial anchors remain essential for fine-grained object and layout reasoning.

2605.16364 2026-05-19 cs.SD cs.AI cs.CL 版本更新

WASIL: In-the-Wild Arabic Spoken Interactions with LLMs

WASIL:真实场景下阿拉伯语口语交互与LLMs

Zien Sheikh Ali, Hamdy Mubarak, Soon-Gyo Jung, Hunzalah Hassan Bhatti, Firoj Alam, Shammur Absar Chowdhury

发表机构 * Qatar Computing Research Institute, Qatar(卡塔尔计算研究所)

AI总结 本文提出WASIL数据集,包含真实阿拉伯语口语交互数据,包含音频、ASR假说、助手回复及显式喜欢/不喜欢反馈,用于评估LLMs在真实场景下的表现。

Comments Spoken Prompts, Multilingual LLMs, Speech-based Evaluation, Dialectal Speech, Low-resource Languages, Conversational AI, Speech-to-Text QA, Real-world Interaction, Spoken Language Understanding

详情
AI中文摘要

大型语言模型(LLMs)的语音助手通常构建为自动语音识别(ASR)与LLM系统的级联系统,其中识别错误可能扭曲用户意图。不满可能还源于模糊、领域外或非请求的对话轮次,使难以分离ASR影响。我们发布WASIL(在阿拉伯语中表示连接或链接):包含真实场景下的阿拉伯语口语交互提示,包含音频、ASR假说、助手回复及显式喜欢/不喜欢反馈(8,529轮次;14.2%的不满),再加上一个包含现代标准阿拉伯语(MSA)和四种主要方言及其标签的2,000轮次测试集。我们通过多ASR协议引导的后编辑提供低成本的黄金转录,并标注回答性(可回答、模糊/需要澄清、不支持、非请求/噪声)以区分内在不可回答性与ASR引起的退化。最后,我们描述了使用多裁判LLM评分的可扩展无参考评估方法,用于评估ASR与黄金转录之间的响应。

英文摘要

Large Language Models (LLMs) voice assistants are commonly built as cascaded Automatic Speech recognition (ASR) to LLM systems, where recognition errors can distort user intent. Dislikes may also arise from ambiguous, out-of-domain, or non-request turns, making it hard to isolate ASR effects. We release WASIL (it denotes connection or linking in Arabic): in-the-wild Arabic spoken interaction prompts with audio, ASR hypotheses, assistant responses, and explicit like/dislike feedback (8,529 turns; 14.2% dislikes), plus a 2,000-turn test set covering Modern Standard Arabic (MSA) and four major dialects with their labels. We provide low-cost gold transcripts via multi-ASR agreement-guided post-editing and annotate answerability (answerable, ambiguous/needs-clarification, unsupported, not-a-request/noise) to separate intrinsic unanswerability from ASR-induced degradation. Finally, we describe scalable reference-free evaluation of responses from ASR vs. gold transcripts using multi-judge LLM scoring.

2605.16361 2026-05-19 cs.LG cs.AI stat.ML 版本更新

TailedTS: Benchmark Dataset for Heavy-Tailed Time Series Prediction and Periodicity Quantification

TailedTS:用于重尾时间序列预测和周期性量化的大规模基准数据集

Xinyu Chen, HanQin Cai, Lijun Ding, Jinhua Zhao

发表机构 * University of Central Florida(中央佛罗里达大学) University of California, San Diego(加州大学圣地亚哥分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 TailedTS数据集用于测试在重尾、零膨胀和非高斯条件下时间序列预测模型的鲁棒性,通过稀疏自回归框架揭示高频页面的周期性较弱,同时提供非高斯损失函数的标准化预测基准。

详情
AI中文摘要

我们介绍了TailedTS,一个基于2024年维基百科每小时页面浏览观测数据的大规模基准数据集,专门用于测试时间序列预测模型在重尾、零膨胀和非高斯条件下的性能。该数据集包含约2469亿个数据点,覆盖约300万个唯一维基百科页面,存储在高效的Apache Parquet格式中。维基百科流量遵循幂律分布,其中约5%的页面贡献了70%的总浏览量,为模型在极端波动下的鲁棒性提供了一个自然且严谨的测试环境。TailedTS支持多个研究任务:首先,我们引入了一个基于稀疏自回归的周期性量化框架,揭示高频页面的周期性结构显著弱于低频页面,这对大型数字平台的服务器分配和流量预测有直接意义。其次,我们提供了在一系列非高斯损失函数下的标准化预测基准,包括ℓ1范数、Huber、分位数和ℓp范数损失,表明基于高斯的估计器在高流量页面类别中性能显著下降,而鲁棒替代方案在所有流量规模上均提供一致的提升。TailedTS可在https://doi.org/10.5281/zenodo.17070469公开获取。

英文摘要

We present TailedTS, a large-scale benchmark dataset derived from Wikipedia hourly page view observations throughout 2024, specifically designed to test time series forecasting models under heavy-tailed, zero-inflated, and non-Gaussian conditions. The dataset comprises approximately 24.69 billion data points spanning roughly 3 million unique Wikipedia pages per month, stored in high-efficiency Apache Parquet format. Wikipedia traffic follows a pronounced power-law distribution where roughly 5% of pages account for over 70% of total page views, creating a natural and rigorous testbed for model robustness against extreme volatility that are absent from or underrepresented in existing benchmarks such as M4, M5, and UCI electricity datasets. TailedTS enables several research tasks. First, we introduce a periodicity quantification framework based on sparse autoregression with sparsity and non-negativity constraints, revealing that frequently-viewed pages exhibit significantly weaker periodic structure than their less-viewed counterparts, showing direct implications for server allocation and traffic forecasting on large digital platforms. Second, we provide standardized prediction benchmarks evaluated under a suite of non-Gaussian loss functions, including $\ell_1$-norm, Huber, quantile, and $\ell_p$-norm losses, demonstrating that standard Gaussian-based estimators degrade substantially on high-volume page categories, while robust alternatives provide consistent gains across all traffic scales. TailedTS is publicly available at https://doi.org/10.5281/zenodo.17070469.

2605.16360 2026-05-19 cs.LG cs.AI 版本更新

ProxyKV: Cross-Model Proxy Pruning for Efficient Long-Context LLM Inference

ProxyKV:跨模型代理剪枝用于高效长上下文LLM推理

Junjie Li, Jiong Lou, Jie Li

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 ProxyKV通过跨模型代理剪枝方法,解决LLM长上下文推理中的KV缓存内存瓶颈,实现高效推理与高精度的平衡,提升预填充速度和长上下文处理能力。

详情
AI中文摘要

高效长上下文推理在大型语言模型(LLM)中受到键值(KV)缓存内存瓶颈的严重限制,而现有剪枝方法在低延迟启发式和高精度重建方法之间做出取舍。为弥合评分成本与精度之间的差距,我们提出了ProxyKV,一种跨模型代理剪枝框架,将重要性评分卸载到轻量级的同族小型模型代理上,该代理异步执行于大型模型目标。为弥合异构模型之间的架构差距,我们设计了HybridAxialMapper,将时间特征提取与跨头对齐解耦,并设计了多粒度混合损失,将学习目标从刚性回归转向相对排名一致性。在Llama-3.1、Qwen-2.5和Qwen-3家族上,针对LongBench、SCBench和RULER等基准测试,ProxyKV在聚合层面(恢复约98.7%的平均精度)与KVZip相当,同时在Llama-3.1-8B上实现了高达3.21倍的预填充加速(双GPU;约1.5倍共享单GPU),并在Qwen-2.5-7B上支持高达170k tokens的上下文长度。

英文摘要

Efficient long-context inference in Large Language Models (LLMs) is severely constrained by the Key-Value (KV) cache memory wall, yet existing pruning methods force a choice between low-latency heuristics that sacrifice precision and high-precision reconstruction methods that incur prohibitive prefilling overhead. To bridge this scoring-cost--accuracy gap, we propose ProxyKV, a cross-model proxy pruning framework that offloads importance scoring to a lightweight intra-family Small-Model Proxy executed asynchronously to the Large-Model Target. To bridge the architectural gap between heterogeneous models, we design the HybridAxialMapper, which disentangles temporal feature extraction from cross-head alignment, together with a Multi-Granularity Hybrid Loss that shifts the learning objective from rigid regression to relative ranking consistency. Across the Llama-3.1, Qwen-2.5, and Qwen-3 families spanning targets from 7B up to 32B parameters on LongBench, SCBench, and RULER, ProxyKV matches KVZip on aggregate (recovering $\sim$$98.7\%$ of its mean accuracy) while delivering up to a $3.21\times$ prefilling speedup on Llama-3.1-8B (dual-GPU; $\sim$$1.5\times$ shared single-GPU) and sustaining the speedup at contexts up to 170k tokens on Qwen-2.5-7B.

2605.16359 2026-05-19 cs.CV cs.AI 版本更新

How Many Visual Tokens Do Multimodal Language Models Need? Scaling Visual Token Pruning with F^3A

多模态语言模型需要多少视觉标记?通过F^3A进行视觉标记剪枝的扩展

YiJie Huang, Yiqun Zhang, Zhuoyue Jia, Xiaocui Yang, Junzhao Huang, Zihan Wang, Shi Feng, Daling Wang, Yifei Zhang, Yongkang Liu

发表机构 * School of Computer Science and Engineering, Northeastern University, Shenyang 110819, China(东北大学计算机科学与工程学院,沈阳 110819,中国) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Computer and Communication Engineering, Northeastern University, Qinhuangdao 066004, China(东北大学计算机与通信工程学院,秦皇岛 066004,中国)

AI总结 本文提出F^3A方法,通过任务条件证据搜索优化视觉标记分配,在不训练模型的情况下实现高效的视觉标记剪枝,保留原始多模态提示和解码流程。

详情
AI中文摘要

视觉语言模型通过将越来越长的视觉标记序列输入语言骨干网络来提升感知能力,但由此产生的推理成本提出了一个基本的扩展问题:随着多模态模型的增长,实际上需要多少视觉标记,以及在固定视觉标记预算下如何分配?现有训练免费剪枝方法通常通过一shot代理如解码器注意力、视觉相似性或条件多样性来回答这个问题。我们主张将视觉标记剪枝视为任务条件证据搜索,特别是在极端压缩和跨模型规模的情况下。我们提出F^3A,一种训练免费的视觉标记剪枝路由器,在语言模型消耗图像标记之前运行。F^3A构建轻量级的问题条件线索,通过冻结的稀疏感知头将它们与视觉网格标记匹配,并通过粗略证据定位、局部细化、覆盖保持竞争和恢复未覆盖区域来分配固定视觉标记预算。它不需要模型训练,不需要额外的LLM前向传递,并保留原始多模态提示和解码流程。

英文摘要

Vision-language models improve perception by feeding increasingly long visual token sequences into language backbones, but the resulting inference cost raises a basic scaling question: as multimodal models grow, how many visual tokens are actually needed, and how should they be allocated under a fixed visual token budget? Existing training-free pruning methods typically answer this with one-shot proxies such as decoder attention, visual similarity, or conditional diversity. We argue that visual token pruning is better viewed as task-conditioned evidence search, especially under aggressive compression and across model scales. We propose F^3A, a training-free router for visual token pruning that operates before the language model consumes image tokens. F^3A builds lightweight question-conditioned cues, matches them to visual-grid tokens through frozen sparse sensing heads, and allocates a fixed vision token budget via coarse evidence localization, local refinement, coverage-preserving competition, and recovery of under-covered regions. It requires no model training, no extra LLM forward pass and preserves the original multimodal prompting and decoding pipeline.

2605.16358 2026-05-19 cs.LG cs.AI 版本更新

LEAF: A Living Benchmark for Event-Augmented Forecasting

LEAF:一个用于事件增强预测的活体基准

Mingtian Tan, Mihir Parmar, Palash Goyal, Chun-Liang Li, Nanyun Peng, Thomas Hartvigsen, Jinsung Yoon, Tomas Pfister

发表机构 * Google(谷歌) University of Virginia(弗吉尼亚大学)

AI总结 本文提出LEAF,首个用于事件增强预测的活体基准,通过递归检索代理系统和双代理交叉验证,提供全面相关文本辅助预测,评估LLM在复杂真实场景中的预测能力。

Comments 12 tables, 6 figures, 39 pages

详情
AI中文摘要

大型语言模型(LLMs)越来越多地应用于预测。为了评估这一能力并缓解预训练数据污染,已提出几种活体基准。然而,现有基准要么因数据稀缺缺乏多维事件,要么聚焦于相对封闭环境。为评估LLM在复杂真实场景中的预测能力,我们提出LEAF,首个用于事件增强预测任务的活体基准,包括未来事件概率、趋势和时间序列预测。LEAF利用递归检索代理系统配以双代理交叉验证,提供全面相关辅助文本。评估最新专有和开源LLMs发现,这些模型能利用复杂事件提取的信号提升预测性能。在股票领域,发现LLM在自信识别为更可预测的股票上表现更好。此外,事件与目标股票呈现强相关性。为此,LEAF提供必要的动态更新测试环境,持续跟踪和推动事件驱动预测任务的进步。

英文摘要

Large Language Models (LLMs) are increasingly applied to forecasting. To evaluate this capability while mitigating pre-training data contamination, several living benchmarks have been proposed. However, existing benchmarks either lack the multidimensional events essential for accurate forecasting due to data scarcity, or focus on relatively closed environments. To assess the predictive capabilities of LLMs in complex, real-world scenarios, we propose LEAF, the first living benchmark for event-augmented forecasting tasks, including future event probabilities, trend and time series forecasting. LEAF utilizes a recursive retrieval agent system paired with dual-agent cross-validation to provide comprehensive and relevant auxiliary text for forecasting. Evaluating state-of-the-art proprietary and open-weight LLMs, we find that these models can leverage signals extracted from complex events to enhance predictive performance. In the stock domain, we find that LLMs achieve better performance on equities they confidently identify as more predictable. Furthermore, the events demonstrate a strong correlation with the target equities. To this end, LEAF provides a necessary, dynamically updating testbed to continuously track and drive progress in event-driven forecasting tasks.

2605.16357 2026-05-19 eess.SP cs.AI cs.CV 版本更新

Learning Displacement-Aware WiFi Representations for Weakly Supervised Relative Localization

学习位移感知的Wi-Fi表示以实现弱监督的相对定位

Tzu-Ti Wei, Po-Cheng Chen, Yu-Chee Tseng, Jen-Jee Chen

发表机构 * College of AI, National Yang Ming Chiao Tung University(人工智能学院,National Yang Ming Chiao Tung大学)

AI总结 本文提出IP框架,通过交叉模态学习对齐指纹轨迹与位移轨迹,学习位移感知的Wi-Fi表示,实现准确的相对定位,并扩展至少样本绝对定位。

详情
AI中文摘要

基于Wi-Fi指纹的室内定位已广泛研究,但现有方法多关注绝对定位并依赖密集坐标标注,获取成本高。本文研究相对定位问题,目标是直接估计两个Wi-Fi指纹轨迹间的位移,不预测绝对位置。为减少标注开销,采用惯性传感获取的步进运动向量作为弱监督。提出Intersection Pathway (IP)框架,通过共享潜在空间对齐指纹轨迹与位移轨迹。关键思想是使潜在空间具有加法结构,使潜在空间的加减对应物理运动组合,实现直接的相对位移推断。实验表明,所提方法在合成数据集上学习位移感知的Wi-Fi表示,实现不同位移范围的准确相对定位。此外,所学模型可扩展至少样本绝对定位。

英文摘要

WiFi fingerprint-based indoor localization has been widely studied, but most existing approaches focus on absolute positioning and rely on dense coordinate annotations, which are costly to obtain at scale. In this paper, we study a fundamentally different problem: relative localization, where the goal is to directly estimate the displacement between two WiFi fingerprint traces without predicting their absolute positions. To reduce annotation overhead, we adopt weak supervision in the form of stepwise motion vectors obtained from inertial sensing. We propose Intersection Pathway (IP), a cross-modal learning framework that aligns fingerprint traces (f-traces) and displacement traces (d-traces) in a shared latent space. The key idea is to enforce an additive structure in the latent space, such that latent addition and subtraction correspond to physical motion composition, enabling direct relative-displacement inference. Experiments on a synthesized dataset derived from real measurements demonstrate that the proposed method learns displacement-aware WiFi representations and achieves accurate relative localization across varying displacement ranges. Furthermore, the learned model can be extended to few-shot absolute localization with sparse anchors.

2605.16354 2026-05-19 cs.LG cs.AI cs.CL cs.HC stat.ML 版本更新

Augmenting Human Evaluation with LLM Judges: How Many Human Reviews Do You Need?

通过LLM裁判增强人类评估:你真的需要多少人类评审?

Jane Paik Kim

发表机构 * Department of Psychiatry and Behavioral Sciences(精神病学与行为科学系)

AI总结 本文提出通过LLM作为辅助裁判来增强人类评估,通过两阶段抽样设计确定人类和LLM评审样本量,以实现目标统计功效。

Comments 10 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作AI系统的自动评估者,包括在高风险应用中。在这一角色中,LLMs用于生成关于模型输出质量、适当性甚至安全性的判断。这种做法受到实际限制的驱动。专家人类评分成本高且难以扩展,而LLM评分可以快速低成本地生成。然而,当前部署LLM评估者的方法是随意的,通常仅限于报告人类和LLM裁判之间的一致性度量作为替代人类评分的正当性,且缺乏正式的研究设计基础。本文(1)将LLM裁判的角色从替代性转为辅助性,并(2)将LLM作为裁判范式制定为通过两阶段抽样设计增强人类评估的一种方法,其中在第一阶段对所有观察进行LLM评估,在第二阶段对子样本进行部分人类评分。我们提出使用来自缺失数据文献的双重鲁棒估计器,利用预测模型的鲁棒性属性,因为缺失性模型是设计已知的。使用该估计器的渐近方差,我们提出如何确定人类和LLM评分的样本量以达到目标统计功效。我们还展示通过分配更多人类评分给LLM评分预测性不高的评估类型,可以高效地设计研究。据我们所知,关于在验证基准时应保留多少人类监督的指导非常有限。

英文摘要

Large language models (LLMs) are increasingly used as automated evaluators of AI systems, including in high-stakes applications. In this role, LLMs are used to generate judgments about the quality, appropriateness, or even safety of model outputs. This approach is motivated by practical constraints. Expert human ratings are costly and difficult to scale, whereas LLM ratings can be produced quickly at low cost. However, current approaches to deploying LLM evaluators are ad hoc, typically limited to reporting agreement metrics between human and LLM judges as a justification for substitution of human ratings, and lack a formal basis for study design. This paper (1) shifts the role of the LLM judge from substitutive to auxiliary, and (2) formulates the LLM-as-a-judge paradigm as one of augmenting human evaluation through a two-stage sampling design, where LLM evaluations are measured for all observations at the first stage and human ratings are partially observed for a subsample at the second stage. We propose to use a doubly robust estimator from the missing data literature, which takes advantage of the robustness property against the prediction model, since the missingness model is known by design. Using the asymptotic variance of this estimator, we propose how sample sizes of human and LLM ratings can be determined to achieve a targeted level of power. We also show that a study can be efficiently designed by allocating more human ratings for types of evaluations where the predictability of LLM ratings is not high. To the best of our knowledge, there is very little guidance on how much human oversight should be retained when validating benchmarks.

2605.16352 2026-05-19 cs.IR cs.AI cs.LG 版本更新

LARGER: Lexically Anchored Repository Graph Exploration and Retrieval

LARGER: 词典锚定的仓库图探索与检索

Yuntong Hu, Tongli Su, Liang Zhao, Bowen Zhu, Hasibul Haque

发表机构 * Emory University(埃默里大学)

AI总结 LARGER通过词典锚定的结构化定位方法提升代码仓库文件定位精度,实现测试生成和代码库理解任务的性能提升。

详情
AI中文摘要

仓库级别的编码代理必须首先定位与任务相关的文件和符号;此阶段的失败会影响从补丁生成到测试编写和代码库问答的下游目标。现有代理主要通过词汇搜索导航仓库,常遗漏结构关系如导入、调用链、类型层次和代码-测试链接。基于图的检索可恢复此类依赖,但现有方法常需要单独的图工具或遍历阶段,打断代理的交互循环。我们正式将仓库上下文定位定义为词典锚定的结构化定位,其成功取决于将词汇匹配转化为高精度的结构入口点,并在代理现有搜索循环中暴露最有用的置信度过滤局部邻域。我们引入LARGER(词典锚定的仓库图探索与检索),一种以词汇锚定的主动集检索框架,从词汇匹配开始,将其对齐到图锚点,并在代理现有搜索循环中执行置信度过滤的局部扩展。LARGER直接集成到现有CLI编码代理中,无需外部图数据库或专用图接口。在四个涵盖定位、测试生成和代码库理解的基准测试中,LARGER在LocBench上通过调整超参数将文件级Acc@5提升13.9点,即使在固定超参数下仍比最强基线提升11.8点,并在MuLocBench、SWE-Atlas测试编写和SWE-Atlas代码库问答任务上提供一致的提升。

英文摘要

Repository-level coding agents must first localize the files and symbols relevant to a task; failures at this stage can cascade across downstream objectives ranging from patch generation to test writing and codebase question answering. Existing agents navigate repositories primarily through lexical search, often missing structural relations such as imports, call chains, type hierarchies, and code-test links. Graph-based retrieval can recover such dependencies, but existing approaches often require separate graph tools or traversal stages that fragment the agent's interaction loop. We formalize repository context localization as Lexically Anchored Structural Localization, where success depends on turning lexical matches into high-precision structural entry points and exposing the most useful confidence-filtered local neighborhoods within the agent's existing search loop. We introduce LARGER (Lexically Anchored Repository Graph Exploration and Retrieval), a lexically anchored active-set retrieval framework that starts from lexical matches, aligns them to graph anchors, and performs confidence-filtered local expansion within the agent's existing search loop. LARGER integrates directly into existing CLI coding agents without requiring external graph databases or specialized graph interfaces. Across four benchmarks spanning localization, test generation, and codebase understanding, LARGER improves file-level Acc@5 on LocBench by +13.9 points with tuned hyperparameters and still gains +11.8 points with fixed hyperparameters over the strongest baseline, while delivering consistent gains on MuLocBench, SWE-Atlas Test Writing, and SWE-Atlas Codebase QA.

2605.16351 2026-05-19 cs.LG cs.AI 版本更新

PIMSM: Physics-Informed Multi-Scale Mamba for Stable Neural Representations under Distribution Shift

PIMSM:基于物理的多尺度Mamba用于在分布偏移下稳定的神经表示

Sangyoon Bae, Shinjae Yoo, Jiook Cha

发表机构 * Interdisciplinary Program in Artificial Intelligence(人工智能交叉学科项目) Seoul National University(首尔国立大学) Computational Science Initiative(计算科学倡议) Brookhaven National Laboratory(布鲁赫斯国家实验室) Department of Psychology(心理学系)

AI总结 本文提出PIMSM,一种基于物理的多尺度Mamba架构,通过时间尺度对齐提升科学基础模型在分布偏移下的鲁棒性和表示稳定性,实验证明其在fMRI和气象预测中的有效性。

Comments 9 pages, 2 figures

详情
AI中文摘要

科学基础模型旨在在数据集、获取协议和部署领域变化时重用表示,但许多序列骨干网络将科学时间结构视为无约束模式进行拟合。本文认为,这忽略了自然动力系统的核心特性:神经和大气时间序列由跨多个物理时间尺度的相互过程组织,未能保留这种多尺度结构会加剧分布偏移下的脆性。本文将这种失败模式正式定义为时间核不匹配,即模型使用与信号物理时间尺度无关的有效记忆策略拟合分布内动态,导致表示漂移和转移性能下降。本文提出物理约束的多尺度Mamba(PIMSM),一种状态空间架构,将频域估计的过渡点(膝频)映射到尺度特定的离散化参数并锚定到获取时间单位。在人类连接组计划fMRI上,PIMSM在严重时间上下文截断、极端低资源转移和静息态到任务态泛化中提升了鲁棒性和表示稳定性。在无需模态特定适应的情况下,相同架构在Weather-5K持出站空间分布外预测中取得了所有报告范围和变量的最低变量加权MAE。这些结果支持时间尺度对齐作为科学基础模型的实用归纳偏置,这些模型必须在部署偏移下保持结构,而非仅拟合相关性。

英文摘要

Scientific foundation models are expected to reuse representations under changes in dataset, acquisition protocol, and deployment domain, yet many sequence backbones treat scientific temporal structure as an unconstrained pattern to be fitted. We argue that this misses a central property of natural dynamical systems: neural and atmospheric time series are organized by interacting processes across multiple physical timescales, and failure to preserve this multiscale structure contributes to brittleness under distribution shift. We formalize this failure mode as temporal kernel mismatch, where a model fits in-distribution dynamics with an effective memory policy that is not anchored to the signal's physical timescales, leading to representation drift and degraded transfer. We propose Physics-Informed Multi-Scale Mamba (PIMSM), a state-space architecture that maps spectrum-estimated transition points between frequency regimes (knee frequencies) to scale-specific discretization parameters and anchors them to acquisition time units. On Human Connectome Project fMRI, PIMSM improves robustness and representation stability under severe temporal-context truncation, extreme low-resource transfer, and resting-state-to-task-state generalization. Without modality-specific adaptation, the same architecture also attains the lowest variable-wise MAE across all reported horizons and variables on Weather-5K held-out-station spatial out-of-distribution forecasting. These results support temporal-scale alignment as a practical inductive bias for scientific foundation models that must preserve structure, not only fit correlations, under deployment shift.

2605.16350 2026-05-19 cs.LG cs.AI 版本更新

Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation

联邦嵌套学习:为测试时间适应性协作训练自参考记忆

Hong Chen, Pengcheng Wu, Yuanguo Lin, Peilin Zhao, Xiuze Zhou, Fan Lin, Han Yu

发表机构 * Nanyang Technological University(南洋理工大学) Jimei University(集美大学) Shanghai Jiao Tong University(上海交通大学) Xiamen University(厦门大学)

AI总结 本文提出FedNL框架,通过嵌套优化系统实现协作学习优化规则,提升非独立同分布数据下的推理与检索性能,保持恒定推理内存。

详情
AI中文摘要

我们从嵌套学习视角重新思考联邦学习(FL),将核心挑战定为如何协作学习优化规则而非静态模型,以应对非独立同分布客户端数据。为此,我们提出联邦嵌套学习(FedNL),一种新的框架,将FL重新表述为三层嵌套优化系统。FedNL嵌入基于Titans的线性注意力机制到FL中,使客户端能够通过将delta规则视为在线梯度步骤进行轻量级零样本测试时间适应。在非独立同分布MMLU和长上下文基准测试中,FedNL在短上下文推理中取得竞争性性能,增强了长上下文检索和流式交叉熵的性能,并保持恒定的推理内存。

英文摘要

We rethink Federated Learning (FL) from a nested learning perspective, framing the core challenge as how to collaboratively learn optimization rules, not just static models, to tackle Non-IID client data. To address this, we propose Federated Nested Learning (FedNL), a novel framework that reformulates FL as a three-level nested optimization system. FedNL embeds Titans-based linear attention into FL, enabling clients to perform lightweight, zero-shot test-time adaptation by treating a delta rule as an online gradient step. Experiments on Non-IID MMLU and long-context benchmarks show that FedNL achieves competitive performance in short-context reasoning, enhances the performance of long-context retrieval and streaming Cross-Entropy, and maintains constant inference memory.

2605.16348 2026-05-19 cs.LG cs.AI 版本更新

Flow-Direct: Feedback-Efficient and Reusable Guidance for Flow Models via Non-Parametric Guidance Field

Flow-Direct: 通过非参数指导场实现反馈高效且可重用的流模型指导

Kim Yong Tan, Yueming Lyu, Ivor Tsang, Yew-Soon Ong

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算与数据科学学院,新加坡) Centre for Frontier AI Research, Agency for Science, Technology and Research, Singapore(前沿人工智能研究中心,新加坡科技研究局) Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore(高性能计算研究所,新加坡科技研究局)

AI总结 本文提出Flow-Direct框架,通过持续指导场提升流模型的反馈效率和可重用性,利用累积的奖励评估样本构建非参数估计器,实现高效优化和多目标样本生成。

详情
AI中文摘要

免训练指导使预训练的扩散和流模型能够利用外部黑盒奖励函数的反馈来优化应用特定的目标。然而,现有方法反馈效率低,因为奖励反馈仅临时用于指导局部梯度近似或离散搜索决策,随后被丢弃。为解决这一限制,我们提出Flow-Direct框架,通过持续的指导场引导生成过程。理论上,该指导场是从基础分布与奖励加权目标分布的对数密度比分析得出的;它将预训练分布传输到目标分布。实践中,该场被实现为一个由所有累积奖励评估样本构建的非参数估计器。随着优化过程中样本的增加,该经验指导场变得越来越准确。这种持续的 formulation 产生了两个主要优势。首先,Flow-Direct具有高度的反馈效率:因为每个评估样本都用于细化全局指导场,没有奖励信息被浪费。其次,该框架具有自然的可重用性:一旦优化完成,收集的数据库定义了一个可重用的指导场,用于生成新的目标样本而无需额外的奖励评估,并且不同的指导场可以结合以生成同时满足多个目标的样本。

英文摘要

Training-free guidance enables pre-trained diffusion and flow models to optimize application-specific objectives using feedback from external black-box reward functions. However, existing methods are feedback-inefficient because reward feedback is used only transiently to inform a localized gradient approximation or a discrete search decision, and is subsequently discarded. To address this limitation, we propose Flow-Direct, a framework that guides the generation process via a persistent guidance field. Theoretically, this guidance field is analytically derived from the log-density ratio between the base and reward-weighted target distributions; it transports the pre-trained distribution to the target distribution. In practice, the field is implemented as a non-parametric estimator constructed from all accumulated reward-evaluated samples. As more samples are collected during optimization, this empirical guidance field becomes increasingly accurate. This persistent formulation yields two major advantages. First, Flow-Direct is highly feedback-efficient: because every evaluated sample is used to refine the global guidance field, no reward information is wasted. Second, the framework is naturally reusable: once optimization is complete, the collected dataset defines a reusable guidance field for generating novel target samples without additional reward evaluations, and distinct guidance fields can be combined to generate samples that simultaneously satisfy multiple objectives.

2605.16346 2026-05-19 cs.LG cs.AI cs.CR 版本更新

PropGuard: Safeguarding LLM-MAS via Propagation-Aware Exploration and Remediation

PropGuard: 通过传播感知探索与修复保障LLM-MAS

Bingyu Yan, Xiaoming Zhang, Jinyu Hou, Chaozhuo Li, Ziyi Zhou, Xiaozhe Zhang, Litian Zhang

发表机构 * Beihang University(北航) Beijing University of Posts and Telecommunications(北京邮电大学) City University of Hong Kong(香港城市大学)

AI总结 PropGuard通过构建双视角时空图,结合响应中心风险评估与全状态证据保存,实现对LLM-MAS中传播性攻击的检测与修复,有效降低攻击成功率并保持高任务防御效率。

详情
AI中文摘要

基于多智能体系统(LLM-MAS)的大型语言模型(LLM)已成为解决复杂任务的有前景范式,通过角色专业化、工具使用、记忆和协作推理。然而,这些交互创造了新的安全风险,恶意指令通过消息、工具或记忆注入后可能在代理之间传播,导致系统级妥协。现有防御主要依赖局部过滤或图基异常检测,但往往无法追踪细粒度传播路径或在不干扰良性协作的情况下修复污染状态。我们提出PropGuard,一种传播感知框架,用于保障LLM-MAS。PropGuard构建了双视角时空图,结合响应中心风险评估与全状态证据保存。受这些风险先验引导,一个训练好的GE-GRPO检查员依次探索全状态图,以恢复紧凑的可疑传播子图。PropGuard随后通过子图感知诊断验证有害传播,并应用源引导修复以纠正上游污染并重放受影响的下游交互。在四个通信架构和五个攻击设置上的实验表明,PropGuard在降低攻击成功率的同时保持高任务级防御成功率,实现了有利的效果-效率权衡。

英文摘要

LLM-based multi-agent systems (LLM-MAS) have become a promising paradigm for solving complex tasks through role specialization, tool use, memory, and collaborative reasoning. However, these interactions create new security risks that malicious instructions injected through messages, tools, or memories can propagate across agents and rounds, causing system-level compromise. Existing defenses largely rely on local filtering or graph-based anomaly detection, but they often fail to trace fine-grained propagation paths or remediate contaminated states without disrupting benign collaboration. We propose PropGuard, a propagation-aware framework for safeguarding LLM-MAS. PropGuard constructs a dual-view spatio-temporal graph that combines response-centric risk estimation with full-state evidence preservation. Guided by these risk priors, a GE-GRPO trained inspector sequentially explores the full-state graph to recover compact suspicious propagation subgraphs. PropGuard then verifies harmful propagation through subgraph-aware diagnosis and applies source-guided remediation to correct upstream contamination and replay affected downstream interactions. Experiments across four communication architectures and five attack settings demonstrate that PropGuard consistently lowers attack success while maintaining high task-level defense success, achieving a favorable effectiveness--efficiency trade-off.

2605.16345 2026-05-19 cs.LG cs.AI 版本更新

Goal-Conditioned Supervised Learning for LLM Fine-Tuning

面向目标的监督学习用于大语言模型微调

Shijun Li, Kaiwen Dong, Xiang Gao, Joydeep Ghosh

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Intuit AI Research(Intuit人工智能研究)

AI总结 本文提出目标条件监督学习(GCSL)框架,通过将反馈信号作为显式目标,利用监督学习生成高质量响应,改进了传统监督微调和直接偏好优化的局限性。

详情
AI中文摘要

大型语言模型通常需要微调以更好地与用户意图对齐。现有方法可分为在线和离线范式。在线方法如基于强化学习的对齐方法可直接优化结果质量,但通常依赖外部奖励模型和迭代滚动,使其成本高且难以部署。离线方法更高效,但现有方法如监督微调(SFT)和直接偏好优化(DPO)仍有局限:SFT通常将分级反馈转化为二元监督,而DPO依赖配对偏好数据,此类数据往往不可用或昂贵。本文提出目标条件监督学习(GCSL)作为大语言模型的离线微调框架。我们的核心思想是将反馈信号直接视为显式目标,并通过监督学习训练模型生成实现该目标的响应。为更好地利用分级反馈,我们进一步引入一种新的目标公式,将学习定义为持续追求超过目标质量阈值的成果,而非模仿选定高质量子集中的样本。此设计通过显式引导模型学习质量的定向进步,缓解了SFT和经典GCSL的有限学习效应。我们还提出了自然语言目标表示,以更好地利用大语言模型的语义理解和推理能力。我们在三个任务上评估了我们的方法:非毒性生成、代码生成和推荐系统中的大语言模型。结果表明,我们的方法在保持监督学习的效率、可扩展性和简单数据需求的同时,始终优于标准离线微调基线。

英文摘要

Large language models often require fine-tuning to better align their behavior with user intent at deployment. Existing approaches are commonly divided into online and offline paradigms. Online methods, such as RL-based alignment, can directly optimize outcome quality but typically rely on external reward models and iterative rollouts, making them costly and difficult to deploy in many cases. Offline methods are more efficient, but prevailing approaches such as supervised fine-tuning (SFT) and direct preference optimization (DPO) remain limited: SFT typically collapses graded feedback into binary supervision, while DPO depends on paired preference data that is often unavailable or expensive to construct. In this paper, we propose goal-conditioned supervised learning (GCSL) as an offline fine-tuning framework for LLMs. Our core idea is to treat feedback signals directly as an explicit goal and train the model, purely through supervised learning, to generate responses that achieve that goal. To better exploit graded feedback, we further introduce a novel goal formulation that defines learning as consistently pursuing outcomes above a target quality threshold, rather than imitating samples from a selected high-quality subset. This design mitigates the bounded-learning effect of SFT and classic GCSL by explicitly guiding the model to learn the directional progression of quality. We also propose natural-language goal representations to better leverage the semantic understanding and reasoning capabilities of LLMs. We evaluate our method on three tasks: non-toxic generation, code generation, and LLM for recommendation. Results show that our approach consistently outperforms standard offline fine-tuning baselines while retaining the efficiency, scalability, and simple data requirements of supervised learning.

2605.16343 2026-05-19 cs.LG cs.AI 版本更新

LoopQ: Quantization for Recursive Transformers

LoopQ: 递归变换器的量化

Rui Fang, Hsi-Wen Chen, Ming-Syan Chen

发表机构 * National Taiwan University(国立台湾大学)

AI总结 本文提出LoopQ框架,针对递归变换器的量化挑战,通过激活缩放、选择性变换等方法提升模型精度与效率。

详情
AI中文摘要

Looped语言模型(LoopLMs)通过递归重用Transformer块提高参数效率,但在训练后量化(PTQ)中易出现脆弱性。本文首次系统研究LoopLMs的量化问题,识别出三个挑战:角色间的分布偏移、循环转换中的状态重用以及递归误差累积。提出LoopQ框架,通过共享量化主干和轻量级适应,结合激活缩放、选择性变换、跨循环状态对齐和轨迹感知优化,减少循环内的分布不匹配和跨循环的误差累积。实验表明,在W4A4量化下,LoopQ在七个基准测试中平均下游准确率提升68.8%,平均困惑度降低87.7%。

英文摘要

Looped language models (LoopLMs) improve parameter efficiency by recursively reusing Transformer blocks, enabling deeper computation under a fixed model size. However, this reuse makes LoopLMs more fragile under post-training quantization (PTQ). We present the first systematic study of quantization in LoopLMs and identify three challenges: distribution shift across roles, state reuse across loop transitions, and recursive error accumulation. To address these challenges, we propose LoopQ, a loop-aware PTQ framework that preserves a shared quantized backbone while introducing lightweight adaptations. LoopQ combines activation scaling, selective transformation, cross-loop state alignment, and trajectory-aware optimization to reduce distributional mismatch within loops and error accumulation across loops. Experiments across seven benchmarks show that, under W4A4 quantization, LoopQ improves average downstream accuracy by 68.8% and reduces average perplexity by 87.7% compared with the strongest static PTQ baseline.

2605.16342 2026-05-19 cs.LG cs.AI cs.CL 版本更新

DACA-GRPO: Denoising-Aware Credit Assignment for Reinforcement Learning in Diffusion Language Models

DACA-GRPO:去噪感知的信用分配用于扩散语言模型中的强化学习

Amin Karimi Monsefi, Dominic Culver, Nikhil Bhendawade, Lokesh Boominathan, Manuel R. Ciosici, Yizhe Zhang, Irina Belousova

发表机构 * The Ohio State University(俄亥俄州立大学) Apple(苹果公司)

AI总结 本文提出DACA-GRPO,通过引入去噪进度评分和分层掩码似然,改进扩散语言模型中强化学习的信用分配,提升数学推理、代码生成等任务性能。

详情
AI中文摘要

扩散大语言模型是自回归模型的有力替代品,但现有强化学习方法将所有去噪步骤视为同等重要,并依赖于有偏、高方差的似然估计。我们识别出两个根本性弱点:去噪轨迹中缺乏时间信用分配,以及用于策略优化的均场似然估计存在系统偏差。为了解决这些问题,我们提出了Denoising-Aware Credit Assignment for GRPO(DACA-GRPO),一种轻量级、即插即用的增强方法,适用于任何GRPO风格的训练器。DACA-GRPO引入了两个互补机制:去噪进度评分,从中间预测中提取每token的重要性权重,无需额外前向成本;分层掩码似然,将token位置分为层次,使每个token在大部分序列作为上下文的情况下进行预测,从而减少均场偏差。在三种GRPO基础方法上应用DACA-GRPO,使其在七个基准测试中取得一致提升,涵盖数学推理、代码生成、约束满足和受约束生成等任务,在数学推理中提升达5.6个百分点,在代码生成中提升7.4个百分点,在约束满足中提升36.3个百分点,在JSON schema符合性中提升5.9个百分点。

英文摘要

Diffusion large language models are a compelling alternative to autoregressive models, yet existing RL methods for diffusion treat all denoising steps as equally important and rely on biased, high-variance likelihood estimates. We identify two fundamental weaknesses: the absence of temporal credit assignment across the denoising trajectory, and the systematic bias of mean-field likelihood estimates used for policy optimization. To address these, we propose Denoising-Aware Credit Assignment for GRPO (DACA-GRPO), a lightweight, plug-and-play enhancement for any GRPO-style trainer. DACA-GRPO introduces two complementary mechanisms: Denoising Progress Scores, which extract per-token importance weights from intermediate predictions at no additional forward cost, and Stratified Masking Likelihood, which partitions token positions into strata so that each token is predicted with most of the sequence as context, reducing the mean-field bias. Applied on top of three GRPO base methods, DACA-GRPO achieves consistent improvements across seven benchmarks spanning mathematical reasoning, code generation, constraint satisfaction, and constrained generation, with gains of up to 5.6pp on math reasoning, 7.4pp on code generation, 36.3pp on constraint satisfaction, and 5.9pp on JSON schema adherence.

2605.16336 2026-05-19 cs.CR cs.AI cs.CY 版本更新

Detecting Verbatim LLM Copy-Paste in Homework

检测作业中的直接LLM复制粘贴

Aizierjiang Aiersilan

发表机构 * The George Washington University(乔治·华盛顿大学)

AI总结 本文提出SteganoPrompt工具,通过在作业提示中嵌入隐形指令,使LLM在响应时生成特征签名,帮助教师检测学生直接复制粘贴模型回复的行为。

详情
AI中文摘要

大型语言模型(LLMs)使学生能够即时获得流畅的论文写作、代码起草和答题服务。许多教育者不反对LLM的使用,但需要检测学生将作业提示粘贴到聊天机器人并直接提交模型回复的情况。现有后处理AI文本检测器不可靠且对非英语母语者有偏见,而输出侧水印需要模型提供者的合作。本文提出一种由教师直接控制的输入侧水印方法:在可见的作业提示中嵌入隐形指令。LLM在直接摄入提示时会静默读取隐藏指令,并在响应中生成特征签名,专门暴露复制粘贴路径。我们描述了SteganoPrompt,一个单页、零依赖的网页工具,将任意可打印的ASCII负载编码到弃用的Unicode标签块(U+E0000--U+E007F)。编码后的字符串在视觉上与原始字符串相同,能经受常见复制粘贴通道(Word、Google Docs、PDF、Markdown、Slack、电子邮件、主要学习管理系统)的考验,并能被前沿模型可靠地分词。我们评估了七种LLM家族和代表性教育内容渠道的合规性。这项工作受到我在乔治华盛顿大学本科软件工程课程中的研究生助教经验的启发。该工具以MIT许可发布于https://ezharjan.github.io/SteganoPrompt/。

英文摘要

Large language models (LLMs) have made fluent essay writing, code drafting, and quiz answering instantly available to students at every level, from secondary school through graduate study. Many educators do not object to LLM use \emph{per~se}; what they need to detect is the case in which a student pastes the assignment prompt into a chatbot and submits the model's reply verbatim, without engaging with the work. Existing post-hoc AI-text detectors remain unreliable and have been shown to penalise non-native English writers, while output-side watermarks require cooperation from the model provider. We propose an alternative that the educator controls directly: an input-side watermark in which an invisible instruction is embedded inside the visible assignment prompt itself. An LLM that ingests the prompt verbatim quietly reads the hidden instruction and writes a tell-tale signature into its reply, exposing the copy-and-paste pathway specifically. We describe SteganoPrompt, a single-page, zero-dependency web tool that encodes an arbitrary printable-ASCII payload into the deprecated Unicode Tags block (\texttt{U+E0000}--\texttt{U+E007F}). The encoded string is visually identical to the original, survives common copy-paste channels (Word, Google Docs, PDF, Markdown, Slack, e-mail, the major learning-management systems), and is reliably tokenized by frontier models. We evaluate compliance across seven LLM families and a representative set of educational content channels. The work is informed by my experience as a graduate teaching assistant for an undergraduate software engineering course at the George Washington University. The tool is released under the MIT licence at \url{https://ezharjan.github.io/SteganoPrompt/}.

2605.16327 2026-05-19 eess.SY cs.AI cs.SY 版本更新

Differentiable Optimization Layered Safety-Critical Control for Risk-Aware Navigation via Conformal Prediction

可微优化分层安全关键控制用于通过共形预测的风险感知导航

Jinyang Dong, Shizhen Wu, Yongchun Fang

发表机构 * Institute of Robotics and Automatic Information Systems, College of Artificial Intelligence, Nankai University, Tianjin 300071, China(机器人与自动化信息系统学院,人工智能学院,南开大学,天津300071,中国) Academy for Advanced Interdisciplinary Studies, Nankai University, Tianjin 300071, China(先进交叉学科研究院,南开大学,天津300071,中国) Institute of Intelligence Technology and Robotic Systems, Shenzhen Research Institute of Nankai University, Shenzhen 518083, China(智能技术与机器人系统研究所,南开大学深圳研究院,深圳518083,中国)

AI总结 本文提出基于共形预测的可微优化分层安全关键控制方法,用于未知环境中风险感知导航。通过生成风险感知障碍物椭球体,构建控制屏障函数以实现避障和可行性保障,并通过数值模拟验证了方法的有效性。

详情
AI中文摘要

未知环境中的风险感知导航是自动驾驶车辆在复杂城市系统中面临的根本挑战。为了解决这个问题,本文提出了一种基于共形预测的可微优化分层安全关键控制方法。首先,为处理传感器噪声引起的不确定性,采用共形预测方法生成围绕椭圆形机器人的风险感知障碍物椭球体。其次,引入两个嵌套的可微优化层,分别构建用于避障和可行性保障的控制屏障函数。然后,提出基于二次规划的安全关键控制律,以整合上述控制屏障函数约束以及输入约束。最后,通过数值模拟验证了所提框架的有效性。

英文摘要

Risk-aware navigation in unknown environments is a fundamental challenge for autonomous vehicles operating in complex urban systems. To address this issue, this paper presents a differentiable optimization layered safety-critical control method based on conformal prediction. First, to handle uncertainties arising from sensor noise, the conformal prediction method is employed to generate risk-aware obstacle ellipsoids around an elliptical-shaped robot. Second, two nested differentiable optimization layers are introduced to build the control barrier functions for obstacle avoidance and feasibility guarantee, respectively. Then, a quadratic program based safety-critical control law is proposed to integrate the above control barrier function constraints as well as input constraints. In the end, the effectiveness of the proposed framework is demonstrated through numerical simulations.

2605.16326 2026-05-19 q-bio.QM cs.AI cs.LG eess.SP 版本更新

A Machine Learning Framework for EEG-Based Prediction of Treatment Efficacy in Chronic Neck Pain

一种基于EEG的慢性颈部疼痛治疗效果预测机器学习框架

Xiru Wang, Aiden Li, Hongzhao Tan, Stevie Foglia, Aimee Nelson, Zhen Gao

发表机构 * Department of Kinesiology, Faculty of Science, McMaster University(麦吉尔大学运动科学系) W Booth School of Engineering Practice and Technology, Faculty of Engineering, McMaster University(麦吉尔大学工程学院W Booth工程实践与技术学院)

AI总结 本文提出利用EEG数据预测慢性颈部疼痛治疗效果的机器学习框架,通过严格的数据预处理和文献综述,旨在开发支持个性化医疗的鲁棒预测模型。

Comments 15 pages, 7 figures

详情
AI中文摘要

慢性颈部疼痛是全球导致残疾的主要原因之一,当前的治疗选择仍主要依赖于试错。我们提出了一种机器学习框架,利用脑电图(EEG)预测慢性颈部疼痛患者的治疗效果,旨在支持个性化治疗并减轻医疗系统负担。该框架的核心是针对每种EEG记录类型特征的严格数据预处理阶段。对于静息态EEG,预处理流程包括基线信号去除、坏通道识别和排除、重新参考、带通和-notch滤波、独立成分分析和功率谱密度分析。对于运动执行和运动想象记录,应用相同的初始步骤后,信号对触发事件对齐,以便量化事件相关去同步(ERD)和事件相关同步(ERS)。同步记录的表面肌电数据经过带通滤波和移动平均平滑,然后与相应的EEG通道相关联,以表征尝试运动期间的EEG-EMG关系。同时,我们进行了广泛的文献综述,回顾了应用于临床EEG的机器学习模型(最初筛选出763条记录,保留16名患者和47名健康对照研究),以指导后续处理策略。通过这种结合的预处理和综述工作,我们旨在开发一个鲁棒的预测模型,以支持慢性疼痛管理中的个性化医疗策略。

英文摘要

Chronic neck pain is a leading cause of disability worldwide, and current treatment selection remains largely trial and error. We present a machine learning framework that uses electroencephalography to predict treatment efficacy in patients with chronic neck pain, with the goal of supporting individualized therapy and reducing the burden on healthcare systems. The framework centers on a rigorous data preprocessing stage tailored to the characteristics of each EEG recording type. For resting-state EEG, the preprocessing pipeline comprises baseline signal removal, bad channel identification and exclusion, re-referencing, bandpass and notch filtering, Independent Component Analysis, and power spectral density analysis. For motor execution and motor imagery recordings, the same initial steps are applied, after which signals are aligned to trigger events so that event-related desynchronization (ERD) and event-related synchronization (ERS) can be quantified. Synchronously recorded electromyography data are bandpass filtered and smoothed with a moving average, then correlated with the corresponding EEG channels to characterize the EEG EMG relationship during attempted movement. In parallel, we performed an extensive literature review of machine learning models applied to clinical EEG (763 records initially screened, 16 patient and 47 healthy-control studies retained), to inform the post-processing strategy. Through this combined preprocessing and review effort, we aim to develop a robust predictive model that can support personalized healthcare strategies in chronic pain management.

2605.16325 2026-05-19 cs.LG cs.AI 版本更新

Phase Transitions in Driven Informational Systems: A Two-Field Perspective on Learning Theory and Non-Equilibrium Chemistry

驱动信息系统的相变:从学习理论和非平衡化学的角度看两种场视角

Truong Xuan Khanh

发表机构 * H&K Research Studio, Clevix LLC(H&K研究室,Clevix公司)

AI总结 本文从学习理论和非平衡化学角度,提出驱动信息系统的两种场框架,引入对抗破裂阈值和自参照耦合阈值,探讨相变现象的普遍性类与可验证预测。

Comments 29 pages, 2 figures

详情
AI中文摘要

深度学习中的相变现象(如grokking、涌现能力、上下文转移下的本体重构)已通过表征压缩、奇异学习理论和信息论进步度量等视角研究。同时,非平衡统计物理在预生物选择下的驱动化学反应网络中识别出相变,其经验特征难以在单一场梯度模型中复现。本文提出一种视角,将两类现象共同描述为驱动信息系统:由两个梯度场(熵产率Sigma和信息准势Phi_I := -ln p*,其中p*是稳态密度)支配的随机过程。在此框架中引入两个候选序参数:对抗破裂阈值alpha_dagger和自参照耦合阈值kappa_c。(alpha_dagger, kappa_c)的联合缩放定义了一个候选普遍性类,具有指数(gamma_1, gamma_2)。本文概述了该框架的几何结构,识别出可区分其与单一场替代方案的可验证预测,并展示其与2024-2026年最近的实证发现(如对齐相变、对抗破裂缩放和大语言模型部分自我反思)的一致性。

英文摘要

Phase-transition phenomena in deep learning (grokking, emergent capabilities, and ontological reorganization under context shift) have been studied through several lenses, including representational compression, singular learning theory, and information-theoretic progress measures. Independently, non-equilibrium statistical physics has identified phase transitions in driven chemical reaction networks underlying prebiotic selection, with empirical signatures that are difficult to reproduce within single-field gradient accounts. We propose a perspective in which both classes of phenomena admit a common description as driven informational systems: stochastic processes governed by two gradient fields, an entropy production rate Sigma and an information quasi-potential Phi_I := -ln p*, where p* is the stationary density. Within this framework we introduce two candidate order parameters: an adversarial breakdown threshold alpha_dagger and a self-referential coupling threshold kappa_c. The joint scaling of (alpha_dagger, kappa_c) defines a candidate universality class with exponents (gamma_1, gamma_2). We outline the geometric structure of this framework, identify falsifiable predictions distinguishing it from single-field alternatives, and show consistency with recent empirical findings (2024--2026) on alignment transitions, adversarial breakdown scaling, and partial introspection in large language models.

2605.16320 2026-05-19 cs.LG cs.AI 版本更新

AdaGraph: A Graph-Native Clustering Algorithm That Overcomes the Curse of Dimensionality and Enables Scientific Discovery

AdaGraph:一种克服维度诅咒并促进科学发现的图原生聚类算法

Ahmed Elmahdi

发表机构 * Independent Researcher(独立研究者)

AI总结 AdaGraph是一种基于图的聚类算法,通过结构导向的计算方法克服维度诅咒,其在10个合成数据集上表现优异,且在基因表达、文本聚类和材料科学领域实现了科学发现。

Comments 12 pages, 4 figures, 1 table. Full paper in preparation for KDD 2027

详情
AI中文摘要

AdaGraph是一种基于图的聚类算法,通过结构导向的计算方法克服维度诅咒,其在10个合成数据集上表现优异,且在基因表达、文本聚类和材料科学领域实现了科学发现。

英文摘要

We present AdaGraph, a graph-native clustering algorithm born from the Structure-Centric Machine Learning (SC-ML) paradigm -- a new field of unsupervised learning that replaces geometry-centric (distance-based) computation with structure-centric (topology-based) computation, fundamentally dissolving the curse of dimensionality. AdaGraph operates entirely within the kNN graph topology, a representation that retains meaningful relational structure in arbitrarily high dimensions where Euclidean distance metrics become uninformative. AdaGraph requires no a priori specification of the number of clusters k, handles noise natively, and scales via the SLCD (Sample-Learn-Calibrate-Deploy) prototype-deployment framework. As its unsupervised tuning objective, AdaGraph pairs with Graph-SCOPE, the topology-based cluster validity index introduced as a separate SC-ML contribution. On 10 synthetic benchmarks spanning d=10 to d=5000, Graph-SCOPE achieves mean ARI=0.900 and correctly selects k on 9/10 datasets -- outperforming Silhouette, Davies-Bouldin, and Calinski-Harabasz -- while maintaining Kendall tau >= 0.92 with ground-truth cluster quality across all dimensionalities (Silhouette: tau ~= 0.46). We validate AdaGraph across three scientific domains: (1) gene co-expression discovery in hepatocellular carcinoma (GSE14520, 10,000 genes, 488 patients, no dimensionality reduction), where AdaGraph identifies condition-specific gene modules that WGCNA, ICA, NMF, and Spectral Biclustering fail to resolve; (2) natural language text clustering, where AdaGraph achieves ARI=0.751 on 20NG-6cat versus HDBSCAN's 0.464 (62% relative improvement); (3) materials science clustering of superconductors (145-dimensional Magpie features), perovskites, and JARVIS-DFT materials, where AdaGraph achieves the highest Graph-SCOPE on all three datasets.

2605.16315 2026-05-19 cs.LG cs.AI 版本更新

A Structural Threshold in Decision Capacity Governs Collapse in Self-Play Reinforcement Learning

决策能力阈值在自我对战强化学习中的崩溃中起作用

Arahan Kujur

发表机构 * Independent Researcher(独立研究者)

AI总结 研究揭示决策能力阈值决定自我对战强化学习代理在不对称规则扰动下的崩溃,通过消除所有正可达条件决策导致快速收敛到确定性利用吸引子,而保留单个正可达条件决策可防止崩溃。

Comments 18 pages, 7 figures

详情
AI中文摘要

我们展示了一个阈值在决策能力中决定自我对战强化学习代理在不对称规则扰动下的崩溃。在扑克变种、矩阵游戏、骰子游戏和多种学习算法中,消除所有正可达条件决策导致快速收敛到确定性利用吸引子,一个在接近最大损失处的固定点。保留甚至一个正可达条件决策点可防止此崩溃。冻结基线和固定对手控制确认该机制是受约束下的共适应,而非扰动本身。该现象是时间不变的,一旦恢复行动即可完全可逆,并在函数逼近下加剧。这些结果确立了一个在测试领域中精确的阈值,即零可达加权条件行动能力,其严重性通过可达加权能力连续缩放。

英文摘要

We show that a threshold in decision capacity determines whether self-play reinforcement learning agents collapse under asymmetric rule perturbations. Across poker variants, matrix games, a dice game, and multiple learning algorithms, eliminating all positive-reach contingent decisions causes rapid convergence to a deterministic exploitation attractor, a fixed point at near-maximal loss. Preserving even a single positive-reach contingent decision point prevents this collapse. A frozen baseline and fixed-opponent control confirm that the mechanism is co-adaptation under constraint, not the perturbation itself. The phenomenon is timing-invariant, fully reversible upon action restoration, and intensifies under function approximation. These results establish a sharp threshold at zero reach-weighted contingent action capacity, with severity scaling continuously via reach-weighted capacity in the tested domains.

2605.16312 2026-05-19 cs.LG cs.AI 版本更新

When Actions Disappear: Adversarial Action Removal in Self-Play Reinforcement Learning

当动作消失时:自我对战强化学习中的对抗性动作移除

Arahan Kujur

发表机构 * Independent Researcher(独立研究者)

AI总结 研究了自我对战强化学习中的对抗性动作遮蔽,发现学习的遮蔽比随机遮蔽和学习扰动基线更具破坏性,揭示了动作可用性作为自我对战RL中的新鲁棒性表面。

Comments 17 pages, 2 figures, 18 tables

详情
AI中文摘要

我们研究了自我对战强化学习中的对抗性动作遮蔽:攻击者会选择性地从受害者的行为集移除合法动作。不同于观察或动作扰动,移除是在智能体行动前消除了决策选项。在从6到5531个信息状态的扑克游戏以及两个非扑克领域中,学习的遮蔽造成的损害比随机遮蔽和学习扰动基线要大得多。攻击在Q-learning、PPO、NFSP、神经NFSP和DQN受害者中持续存在;能够跨智能体转移;在自我对战中被放大;并在延长的遮蔽训练下没有恢复。机理上,攻击者针对高价值决策点,通过可达加权条件动作容量(CAC_w)和价值加权细化CAC_v来捕捉。这些结果将动作可用性识别为自我对战RL中的一个新鲁棒性表面。

英文摘要

We study adversarial action masking in self-play reinforcement learning: an attacker selectively removes legal actions from a victim's action set. Unlike observation or action perturbations, removal eliminates decision options before the agent acts. Across poker games scaling from 6 to 5,531 information states and two non-poker domains, learned masking causes substantially more damage than random masking and learned perturbation baselines. The attack persists across Q-learning, PPO, NFSP, neural NFSP, and DQN victims; transfers across agents; is amplified by self-play; and shows no recovery under extended masked training. Mechanistically, the adversary targets high-value decision points, captured by reach-weighted contingent action capacity (CAC$_w$) and a value-weighted refinement CAC$_v$. These results identify action availability as a distinct robustness surface in self-play RL.

2605.16306 2026-05-19 cs.GR cs.AI 版本更新

UVTran: Accurate Hole-Filling Parameterization with Transformers

UVTran: 基于变换器的精确孔填补参数化

JunFeng Zhang

AI总结 UVTran利用变换器框架,通过设计交叉注意力机制和分层训练策略,提升孔填补的几何精度与表面公平性,优于现有工业和学术基准。

详情
AI中文摘要

在工业设计中,N边孔填补通常被建模为通过最小化公平能量并在几何边界约束下构造单个修剪B样条表面。此建模需要准确的参数空间表示来修剪曲面。现有方法将孔边界投影到邻近平面或多边形以建立对应关系;然而,它们经常忽视边界异质性,导致有偏映射、降低公平性甚至导致填补失败。我们提出UVTran,一种基于变换器的框架,预测辅助投影表面以更好地捕捉孔边界的几何特性。利用B样条局部性,我们设计了交叉注意力机制,使每个表面控制点偏向附近的孔边界,保留局部几何细节。我们对控制点坐标进行体素化,并将拟合问题建模为分类任务,从而减少模型对小数值扰动和噪声的敏感性。我们采用逐步分辨率训练策略,注入受控离散化误差以模仿分布偏移,从而缓解过拟合并提高高分辨率下的泛化能力。在我们的基准测试中,UVTran优于工业和学术基准:容忍度满足率提高了12%,并且在复杂孔边界条件下始终产生公平的填充表面。这些结果表明,UVTran在广泛的N边孔中能够产生更忠实的对应关系和更公平的修剪表面。

英文摘要

In industrial design, N-sided hole filling is typically formulated as the construction of a single trimmed B-spline surface by minimizing a fairness energy subject to geometric boundary constraints. This formulation requires an accurate parameter-space representation of the trimming curve on the filling surface. Most existing methods project the hole boundary onto a nearby plane or polygon to establish correspondence; however, they often neglect boundary heterogeneity, which can yield biased mappings, degrade fairness, and even cause filling failures. We propose UVTran, a transformer-based framework that predicts an auxiliary projection surface better to capture the geometric characteristics of the hole boundary. Exploiting B-spline locality, we design a cross-attention mechanism that biases each surface control point toward the nearby hole boundary, preserving local geometric detail. We voxelize control-point coordinates and formulate the fitting problem as a classification task, which reduces the model's sensitivity to small numerical perturbations and noise. We adopt a progressive-resolution training strategy that injects controlled discretization errors at coarse resolutions to mimic distribution shifts, thereby mitigating overfitting and improving generalization at high resolution. On our benchmark, UVTran outperforms both industrial and academic baselines: the tolerance-satisfaction rate improves by $12\%$, and it consistently produces fair filled surfaces even under complex hole boundary conditions. These results suggest that UVTran yields more faithful correspondences and fairer trimmed surfaces across a wide range of N-sided holes.

2605.16303 2026-05-19 cs.CY cs.AI cs.CL 版本更新

From Demographics to Survey Anchors: Evaluating LLM Agents for Modeling Retirement Attitudes

从人口统计学到调查锚点:评估LLM代理在建模退休态度中的表现

Rubén Garzón, Pauline Baron, Vincent Grari, Jonne Kamphorst, Michael Bernstein, Marcin Detyniecki

发表机构 * AI Research(AI研究) AXA Group Operations(AXA集团运营) CDSP & CEE(CDSP与CEE) Sciences Po(社会科学高等学院) Computer Science Department(计算机科学系) Stanford University(斯坦福大学)

AI总结 本文比较了基于人口统计学的LLM代理与基于调查数据的代理在预测退休态度调查中的准确性,发现仅依赖人口统计学的代理存在偏差且不够准确,而基于调查锚点的代理能更好地捕捉复杂的人类响应模式。

Comments 50 pages, 22 figures

详情
AI中文摘要

大型语言模型(LLM)代理可能为预测人类对调查的响应提供工具。一种常见的技术是仅使用人口统计学数据(如国家、年龄、性别、就业状况、收入、教育和婚姻状况)来定义这些代理。我们比较了基于人口统计学的代理与基于更广泛调查响应数据的代理的预测准确性。我们测试了这两种方法在预测多学科、跨国的《健康、老龄化与退休状况调查》(SHARE)中的响应,重点关注五个变量,这些变量来自三个与个人财务相关的政策相关构念。在这些三个构念中,我们发现,与基于更广泛数据训练的调查代理相比,仅依赖人口统计学的代理(1)表现出集中趋势偏差,使答案偏向人口均值,(2)过于准确,无法重现人类响应中的错误答案和“不知道”响应。这些性能差异通过复制先前退休规划研究中的分层回归分析得到进一步验证。仅基于人口统计信息的代理重现了财务风险承受能力、未来时间观念和退休规划知识各自预测退休储蓄的结果。然而,只有基于调查锚点的代理才能重现这三个因素之间的相互作用。这些发现表明,在仅使用人口统计学定义LLM代理以预测调查响应时应保持谨慎。

英文摘要

Large language models (LLM) agents may offer tools to predict human responses to surveys. A common technique for defining these agents uses only demographics, for example country, age, gender, employment status, income, education and marital status. We compare the predictive accuracy of demographic agents to that of survey agents defined with a larger set of in-domain survey responses. We test both approaches in predicting responses to the multidisciplinary, cross-national Survey of Health, Ageing and Retirement in Europe (SHARE), focusing on five variables from three policy-relevant constructs around personal finance. In these three constructs, we observe that, compared to survey agents trained on broader data, demographics-only agents (1) exhibited a central tendency bias, skewing answers toward population means, and (2) were unrealistically accurate, failing to reproduce the incorrect answers and "don't know" responses typical of human respondents. These performance differences are further substantiated through the replication of a hierarchical regression analysis from prior retirement planning research. Agents based solely on demographic information reproduce the outcome that financial risk tolerance, future time perspective, and knowledge of retirement planning each are predictive of retirement savings. However, only the survey-anchored agents succeed in reproducing the interaction among these three factors. These findings suggest caution in using only demographics to define LLM agents for predicting survey responses.

2605.16300 2026-05-19 cs.CY cs.AI cs.MA cs.RO 版本更新

Consent Chain Degradation in Embodied Multi-Agent Systems: Bridging the Gap Between AI Agent Governance and Robot Ethics

具身多智能体系统中的同意链退化:弥合人工智能代理治理与机器人伦理之间的鸿沟

Mehmet Haklidir

发表机构 * Artificial Intelligence Institute, TÜBİTAK BİLGEM(人工智能研究所,TÜBİTAK BİLGEM)

AI总结 本文提出同意链退化(CCD)概念,探讨多机器人委托链中人类同意的具体性、有效性及范围如何退化,并通过医疗、家庭和工业机器人场景展示其实际表现,分析现有法规对CCD核心维度的缺失。

Comments Accepted for oral presentation at the 2nd Workshop on Robot Ethics (WoRoBet), ICRA 2026, Vienna, Austria, June 1, 2026. 6 pages, 3 tables, 1 figure

详情
AI中文摘要

机器人系统正从孤立平台转向在人类环境中运行的互联多智能体生态系统。这一转变引发了现有框架未解决的治理问题:如何在多机器人委托链中传播、退化和破裂同意?人工智能伦理社区已开始研究数字软件代理的同意,而人机交互社区则研究人机双方面对的同意。现有研究均未涵盖当物理机器人以影响人类的方式委托任务给其他机器人时的情况。本文引入同意链退化(CCD),一种分析多机器人委托链中人类同意具体性、有效性及范围如何退化的概念框架。我们提出一种三层治理架构,即具身代理的同意运行验证框架(CoRVE),整合了同意范围建模、委托链追踪和物理不可逆性评估。医疗、家庭和工业机器人三个场景展示了CCD的实际表现,包括一个数值示例。对欧盟人工智能法案、GDPR、机械指令和修订后产品责任指令的监管缺口分析显示,这四个工具均未涵盖CCD的核心维度。

英文摘要

Robotic systems are moving from isolated platforms to interconnected multi-agent ecosystems that operate in human environments. This shift raises a governance problem that existing frameworks do not address: how does consent propagate, degrade, and break down across chains of delegation between embodied autonomous agents? The AI ethics community has begun to study consent for digital software agents, and the HRI community has examined consent in dyadic human-robot encounters. Neither body of work covers what happens when physical robots delegate tasks to other robots in ways that affect humans. This paper introduces consent chain degradation (CCD), a conceptual framework for analyzing how the specificity, validity, and scope of human consent erodes as authority passes through multi-robot delegation chains. We propose a three-layer governance architecture, the Consent Runtime Verification Framework for Embodied Agents (CoRVE), which integrates consent scope modeling, delegation chain tracking, and physical irreversibility assessment. Three scenarios in healthcare, domestic, and industrial robotics show how CCD arises in practice, including a worked numerical example. A regulatory gap analysis covering the EU AI Act, the GDPR, the Machinery Regulation, and the Revised Product Liability Directive shows that all four instruments leave core CCD dimensions unaddressed.

2605.16298 2026-05-19 cs.CY cs.AI cs.MA 版本更新

Data-driven and distributed governance of building facilities management using decentralized autonomous organization, digital twin, and large language models

基于去中心化自治组织、数字孪生和大语言模型的建筑设施管理数据驱动与分布式治理

Reachsak Ly, Alireza Shojaei, Xinghua Gao, Philip Agee, Abiola Akanmu

发表机构 * School of Technology, Eastern Illinois University(东伊利诺伊大学技术学院) Myers-Lawson School of Construction, Virginia Polytechnic Institute and State University(弗吉尼亚理工学院和州立大学迈尔斯-劳森建设学院)

AI总结 本文提出一种融合DAO、数字孪生、大语言模型和区块链的建筑管理分布式治理框架,以提升决策透明度和系统安全性。

Comments 33 pages, 20 figures, 4 tables

详情
AI中文摘要

尽管传统AI和数据驱动的设施管理方法提升了建筑运营效率,但其受集中化组织结构限制,易受网络攻击、缺乏上下文理解且排除关键利益相关者参与治理。本文提出一种新型AI和数据驱动的分布式治理框架,整合去中心化自治组织(DAOs)、数字孪生、大语言模型(LLMs)和区块链技术。该框架通过DAO治理平台实现透明集体决策,利用物联网和数字孪生实现数据驱动管理,采用LLM虚拟助手增强决策支持,并利用区块链实现安全建筑自动化。开发了一个全栈去中心化应用以促进用户与这些组件的交互。系统通过系统可用性量表(SUS)评估了成本效益、可扩展性、数据安全性和可用性,并通过专家访谈评估其实际效益和实施挑战。

英文摘要

While traditional AI and data-driven facilities management approaches have improved building operational efficiency, they remain constrained by centralized organizational structures that are vulnerable to cyber attacks, limited contextual understanding, and decision-making processes that exclude key stakeholders from governance. This paper introduces a novel AI- and data-driven distributed governance framework for smart building management that integrates decentralized autonomous organizations (DAOs), digital twins, large language models (LLMs), and blockchain technology. The framework enables transparent collective decision-making through a DAO governance platform, implements data-driven management using IoT and digital twins, incorporates LLM-based virtual assistants for enhanced decision support, and utilizes blockchain for secure building automation. A full-stack decentralized application was developed to facilitate user interaction with these integrated components. The system was evaluated for cost efficiency, scalability, data security, and usability using the System Usability Scale (SUS). Expert interviews were also conducted to assess its practical benefits and implementation challenges.

2605.16297 2026-05-19 cs.CY cs.AI cs.HC cs.SE 版本更新

Task-Level AI Readiness Assessment for Business Process Management:The T-IPO Model and LARA Matrix in Financial-Services IT Operations

面向业务流程管理的任务级AI准备性评估:T-IPO模型与LARA矩阵在金融服务IT运营中的应用

Mingjun Li, Xiaojun Ye

发表机构 * Tsinghua University(清华大学)

AI总结 本文提出T-IPO模型和LARA矩阵,用于评估任务级AI代理的适用性,通过认知执行复杂度和治理合规强度的两因素结构提升预测效用。

Comments 19 pages, 7 figures, 8 tables, 50 references. A shortened workshop version has been submitted to the BPM 2026 Workshop. This preprint is the complete version

详情
AI中文摘要

本文探讨了企业工作流中哪些任务可由大语言模型代理可靠处理及其条件。现有业务流程建模框架仍以活动级回答,但单个活动可能包含难度差异极大的工作。本文提出T-IPO模型,将每个任务表示为八元组,以及LARA矩阵,通过德尔菲研究和AHP验证,赋予合规敏感性1.5倍权重,产生L1至L4四个等级,并应用地板规则。通过127个任务评估,Kappa值达0.80,三机构复现为0.73。对比活动级评估显示任务级预测效用提升。120个任务实例部署显示自动完成率从L1的95%降至L3的40%。探索性因子分析显示任务准备性由认知执行复杂度和治理合规强度共同决定。最后提出LARA-TCA校准程序以适应LLM能力演变。

英文摘要

Which tasks inside an enterprise workflow can a large-language-model agent reliably handle, and under what conditions? Most business process modeling frameworks still answer this at the activity level, even though a single activity can bundle work of radically different difficulty. This paper takes the analysis a step smaller. We describe two design artifacts developed in a financial-services IT setting: T-IPO, which represents each task as an eight-element tuple, and LARA (LLM Agent Readiness Assessment), a five-dimension rubric that scores a task's readiness for agent substitution. Compliance Sensitivity carries $1.5\times$ weight, a value we fixed through a three-round Delphi study and cross-checked with AHP. The rubric produces four levels, L1 to L4, and applies a floor rule so that a task with maximum compliance load cannot be classified below L3 no matter what the other scores say. Both artifacts sit inside a larger methodology (PARTIS) that we map onto BWW ontology in Section 3. We evaluate the instruments across 127 tasks. Inter-rater agreement reaches Fleiss' $κ= 0.80$; a replication at three further institutions returns $κ= 0.73$. A controlled comparison against activity-level assessment suggests, though does not prove, an improvement in predictive utility at the task level. Pilot deployment of 120 task instances confirms that auto-completion decays monotonically from $95\%$ at L1 through about $70\%$ at L2 to about $40\%$ at L3. Exploratory factor analysis points to a two-factor structure: task readiness seems to be determined jointly by cognitive-execution complexity and governance-compliance intensity. We close with a recalibration procedure (LARA-TCA) so the rubric can keep pace with evolving LLM capabilities.

2605.16295 2026-05-19 cs.CY cs.AI cs.CL cs.GR cs.HC cs.MM 版本更新

ANVIL: Analogies and Videos for Lecturers

ANVIL:为讲师提供类比和视频

Yuri Noviello, Anastasiia Birillo, Gosia Migut

发表机构 * Delft University of Technology(代尔夫特理工大学) JetBrains Research(JetBrains研究院)

AI总结 ANVIL是一种多模态生成系统,可自动生成基于类比的计算机科学教学动画。通过生成文本类比、结构化视觉剧本和可执行代码,提升教学有效性。

详情
AI中文摘要

我们介绍了ANVIL,一种多模态生成系统,可自动生成基于类比的教学动画。给定一个概念定义,ANVIL生成文本类比,将其编译成结构化的视觉剧本,并生成可执行的manim代码以渲染动画,同时具备自动修复机制以提高鲁棒性。在大规模评估此类系统时,需要在教学有效性与可扩展性之间取得平衡。我们首先通过教师评估来确定质量评估的基础,并利用其发现来指导自动化筛选。对于文本类比,我们引入基于LLM的评估器以实现可扩展的质量筛选;对于视频,由于主观判断难以自动化,我们改用自动代理来评估与预期剧本的一致性并进行错误分析。我们进一步与教育工作者进行用户研究,以考察采用要求和风险。我们的发现表明,ANVIL可以生成经常被评价为足够的材料,并且教育工作者对其感知价值和易用性有积极反应。

英文摘要

We present ANVIL, a multimodal generative system that automates the production of analogy-based instructional animations for computer science topics. Given a concept definition, ANVIL generates a textual analogy, compiles it into a structured visual screenplay, and produces executable manim code to render an animation, with an automated repair mechanism to improve robustness. Evaluating such systems at scale requires balancing pedagogical validity with scalability. We begin with a teacher evaluation to ground the quality assessment and use its findings to guide automated screening. For textual analogies, we introduce an LLM-based evaluator for scalable quality screening; for videos, where subjective judgments are difficult to automate, we instead assess fidelity to the intended screenplay using an automated proxy for auditing and error analysis. We further conduct a user study with educators to examine adoption requirements and risks. Our findings suggest that ANVIL can produce materials that are frequently rated as adequate, and that educators respond positively to its perceived value and usability.

2605.16294 2026-05-19 cs.CY cs.AI 版本更新

Are Researchers Being Replaced by Artificial Intelligence?

研究人员是否被人工智能取代?

Angelo A. Salatino, Ansgar Scherp, Christin Katharina Kreutz, Sahar Vahdati

发表机构 * Knowledge Media Institute, The Open University Milton Keynes UK Ulm University Ulm Germany TH Mittelhessen – University of Applied Sciences Gießen Germany TIB – Leibniz Information Centre for Science Technology \& Leibniz University of Hannover Hannover Germany Knowledge Media Institute, The Open University Ulm University TH Mittelhessen – University of Applied Sciences Technology \& Leibniz University of Hannover

AI总结 研究探讨人工智能在科研中的影响,指出AI工具的普及导致研究人员角色转变,从创造者转向 curator,引发人类对科学理解的担忧。

详情
AI中文摘要

一项2023年Nature调查涉及1600名研究人员,显示科学家对人工智能工具在科研中的增加使用感到担忧和兴奋。本文探讨AI如何重塑科学生命周期,并揭示深层危险:并非AI无法从事科学,而是人类可能停止真正理解科学。

英文摘要

A Nature survey from 2023 involving 1,600 researchers shows that scientists are ``concerned, as well as excited, by the increasing use of artificial-intelligence tools in research.'' This tension frames our central question: Are researchers being replaced by artificial intelligence? We argue that replacement is already underway-not as disappearance, but as a shift from researcher-as-creator to researcher-as-curator. As AI agents increasingly generate hypotheses, papers, and reviews, humans risk retaining responsibility while losing intellectual ownership. This article examines how AI is reshaping the scientific lifecycle and exposes the deeper danger: not that AI will fail to do science, but that humans may stop truly understanding it.

2605.16292 2026-05-19 cs.CY cs.AI 版本更新

Evidence of a Cognitive Shift in AI Education: How Students Are Rethinking Human Intelligence?

人工智能教育中认知转变的证据:学生如何重新思考人类智能?

Islem Rekik

发表机构 * BASIRA Lab, Imperial-X (I-X) and Department of Computing(BASIRA实验室、Imperial-X(I-X)和计算系) Imperial College London, London, United Kingdom(伦敦帝国学院,伦敦,英国)

AI总结 研究通过长期分析发现,学生对人工智能和人类智能的评价随时间推移逐渐转变,从2020年偏爱AI到2026年多数学生更重视人类智能。

Comments ICLR HCAIR Workhop 2026 https://openreview.net/forum?id=chH4gO2tZT

详情
AI中文摘要

对人工智能(AI)系统的感知影响学习者如何评估和依赖这些系统。尽管AI能力迅速提升,但持续接触这些工具对学生对人类智能(HI)与AI相对价值的影响仍被忽视。本文通过2020至2026年间收集的471名学生课堂投票数据,分析了AI相关本科和硕士课程中四个阶段: hype(炒作)、distrust(不信任)、trust(信任)和 dependency(依赖)。2020年早期投票略微偏向AI,但自2024年起,所有硕士课程群体中逐渐转向更重视HI。到2026年,技术课程中HI偏好达到65%(比2025年增加12个百分点),而设计导向课程中HI偏好达到90%(比2025年增加36个百分点)。这些发现表明,随着AI成为常规工具,人类智能的重新评估逐渐发生,对学习者自主性和知识权威性有影响。本文最后反思了从偏爱AI到优先考虑HI的认知转变。

英文摘要

Perceptions of intelligence shape how learners evaluate and rely on artificial intelligence (AI) systems. Despite rapid advances in AI capabilities, the impact of sustained exposure to these tools on students' valuation of human intelligence (HI) relative to AI remains underexplored. This paper presents a longitudinal analysis of classroom poll responses collected between 2020 and 2026 in AI-focused undergraduate and MSc courses in computer science. Data from 471 students across technical courses (such as Machine Learning and Deep Graph Learning) and design-oriented courses (such as Design Thinking for AI) reveal four recurring phases: hype, distrust, trust, and dependency. While early responses in 2020 slightly favored AI, a consistent shift toward HI emerged from 2024 onward across all MSc cohorts. By 2026, preference for HI reached 65 percent in a technical course (a 12 percentage-point increase from 2025) and 90 percent in a design-oriented course (a 36 percentage-point increase). These findings suggest a gradual reappraisal of human intelligence as AI becomes a routine tool, with implications for learner autonomy and epistemic agency. We conclude by reflecting on this cognitive shift from favoring artificial intelligence toward prioritizing human intelligence.

2605.16291 2026-05-19 cs.CY cs.AI cs.GT 版本更新

AI of the People, by the People, for the People: A Social Choice Approach to Collective Control of Artificial Intelligence

为人民的AI,由人民,为人民:一种社会选择方法用于集体控制人工智能

Paul Anton Bachmann, Niclas Boehmer, Lukas Daniel Klausner, Martin Lackner

发表机构 * Institute of Logic and Computation, TU Wien(逻辑与计算研究所,维也纳技术大学) Hasso Plattner Institute, University of Potsdam(哈索·platzer研究所,波茨坦大学) Center for Artificial Intelligence, University of Applied Sciences St. Pölten(人工智能中心,圣波尔滕应用科学大学)

AI总结 本文提出基于社会选择理论的集体控制AI方法,强调在机器学习全流程中融入公众意见,提供数学框架评估控制机制。

Comments Accepted for publication in Proceedings of the 2026 ACM Conference on Fairness, Accountability, and Transparency (FAccT '26)

详情
AI中文摘要

随着AI系统的广泛应用,如何社会控制AI成为紧迫问题。现有民主控制研究多关注宏观治理,本文提出基于社会选择理论的集体控制AI方法,主张在数据收集、目标设计到对齐等ML开发全流程中纳入集体输入。进一步证明社会选择理论适用于各阶段集体输入的建模,并通过公理化方法提供评估控制机制的原理性标准。本文概念贡献提供数学基础框架来实施和分析AI系统的集体控制。

英文摘要

With the growing adoption of AI systems, reasoning about how society can exert control over AI becomes an increasingly urgent problem. Existing work on democratic control largely focuses on macro-level governance. In contrast, we propose a new approach grounded in social choice theory, which we term collective control of artificial intelligence. We argue that collective input can and should be incorporated at multiple points across the ML development pipeline, from data collection through objective design to alignment. We further demonstrate that social choice provides a well-suited modelling language for the treatment of collective input across all stages and that its axiomatic methodology yields principled criteria for evaluating various control mechanisms. Overall, our conceptual contribution provides a mathematically grounded framework to implement and analyse collective control of AI systems.

2605.16290 2026-05-19 cs.CY cs.AI cs.LG 版本更新

MCQ Difficulty Prediction via Modeling Learner Heterogeneity Using Data-Driven Cognitive Profiling

通过数据驱动的认知画像建模学习者异质性以预测多项选择题难度

Dhriti Krishnan, Jaromir Savelka

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出基于学习者异质性的数据驱动认知画像框架,通过隐类分析识别行为画像并模拟响应分布,结合主题上下文和岭回归模型预测IRT难度参数,提升难度预测精度。

详情
AI中文摘要

预测多项选择题(MCQ)难度对有效评估至关重要,但当前方法通常假设学生能力分布单峰,忽视学生误解的异质性。本文提出一种基于角色的框架,用数据驱动的认知画像替代理论能力采样。利用EEDI数据集中的学生互动,通过潜在类分析(LCA)识别行为画像,然后将大语言模型(LLM)调制以模拟每个画像的响应分布。这些信号与主题上下文结合,输入岭回归模型预测项目反应理论(IRT)难度参数。通过五折交叉验证,本文方法在MSE上优于最近的基线(0.367到0.274;R2:0.525到0.686)。发现的画像具有可解释性,并提供了关于项目难度原因的见解,潜在应用于诊断评估设计。

英文摘要

Predicting the difficulty of multiple-choice questions (MCQs) is important for effective assessment, yet current methods typically assume a unimodal student ability distribution, overlooking the heterogeneous nature of student misconceptions. We propose a persona-driven framework that replaces theoretical ability sampling with data-driven cognitive profiling. Using student interactions from the EEDI dataset, we identify behavioral personas via latent class analysis (LCA), then condition a large language model (LLM) to simulate response distributions for each persona. These signals are aggregated with topic context and fed into a Ridge Regression model to predict the item response theory (IRT) difficulty parameter. With five-fold cross-validation, our method improves over a recent baseline (MSE: 0.367 to 0.274; R2: 0.525 to 0.686). The discovered personas are interpretable and offer insights into why items are difficult, with potential applications to diagnostic assessment design.

2605.16286 2026-05-19 cs.CY cs.AI 版本更新

Homoglyph-based Adversarial Perturbation of Introductory Computer Science Theory Problems

基于同形字的初级计算机科学理论问题对抗扰动

Aidan Alexander, Chitrangada Juneja, Napaluck Tontrasathien, Miro Vanek, Reyan Ahmed, Saumya Debray, Sazzadur Rahaman

发表机构 * our institution(我们的机构)

AI总结 本文提出一种简单方法,通过同形字对抗扰动修改问题,使学生无法利用AI工具完成作业,同时开发交互工具方便应用。

详情
AI中文摘要

不同的AI工具如ChatGPT、Gemini和Claude正变得非常流行。尽管它们对许多日常任务有帮助,但它们可能以意想不到的方式使用。例如,如果学生使用这些工具解决作业问题,课程的学习目标可能无法实现。本文提出了一种简单的方法来解决懒惰学生模型中的这一问题。该方法使用基于同形字的对抗扰动首先修改问题,而不改变问题的语义含义。然后通过其同形字扰动少量字符。我们的实验结果表明,初级计算机科学课程的理论问题可以被有效扰动。我们还提出了一种交互工具,以方便使用我们的方法。

英文摘要

Different AI tools such as ChatGPT, Gemini, and Claude are becoming very popular. Although they are helpful for many day-to-day tasks, they can be used in unexpected ways. For example, the learning objectives of a course may not be achieved if students use these tools to solve their homework problems. This paper proposes a simple method to address this issue in the lazy student model. The method uses homoglyph-based adversarial perturbation to first modify the question without changing the semantic meaning of the question. Then a few characters are perturbed by their homoglyphs. Our experimental result shows the theoretical problems of introductory computer science courses can be effectively perturbed. We also propose an interactive tool to conveniently use our method.

2605.16284 2026-05-19 cs.CY cs.AI 版本更新

Measuring Changes in Instructor Class Design and Student Learning After the Release of Large Language Models (LLMs)

评估大型语言模型发布后教师课程设计和学生学习的变化

Amanda Potasznik, Daniel Haehn

发表机构 * University of Massachusetts, Boston(马萨诸塞大学波士顿分校)

AI总结 研究通过混合方法分析大学课程中LLM的使用对学生学习和教师教学的影响,结合定量分析和问卷调查数据,探讨学习成果的变化。

详情
AI中文摘要

学生在完成课程作业时使用生成式AI工具,无论教授是否知情或批准,已导致高等教育领域发生显著变化。尽管生成式AI广泛使用,但其对学生学习方法、教师课程开发、成绩报告和整体学习的影响尚未充分记录。本研究采用混合方法,结合回顾性定量分析、教师调查和匿名学生调查,分析大学课堂内外使用LLM作为学习工具的感知和经验。通过定量和主题分析教师和学生调查结果,以及历史成绩数据,研究学习成果在LLM时代前后的变化。希望本研究能为其他机构提供试点研究,其结果可帮助教授、大学及其他教育机构制定GenAI政策,以最大化学生学习。

英文摘要

Student use of Generative AI (GenAI) products in completing their classwork, with or without their professors' knowledge and/or approval, has resulted in substantial shifts in higher education. While GenAI use is widespread, its impact on student study methods, faculty course development, grade reporting, and overall learning is not well documented. This is a mixed-methods, multi-course study using retrospective quantitative analysis, instructor surveys, and anonymous student surveys at a university in the New England region of the United States. This research seeks to identify and document patterns in student and faculty perceptions of, and experiences in, the use of LLMs as a learning tool inside and outside of the university classroom. Alongside quantitative and thematic analysis of both faculty and student survey responses, historical grade data as reported to the university registrar is used to triangulate the phenomenon of learning achievement in pre- and post-LLM eras. It is hoped that this research can serve as a pilot study for a broader set of institutions. Results from this study can inform GenAI policy for professors, universities, and other educational institutions that are trying to maximize student learning in the age of AI.

2605.16282 2026-05-19 cs.CY cs.AI 版本更新

Taxonomy and Consistency Analysis of Safety Benchmarks for AI Agents

AI代理安全基准的分类与一致性分析

Miles Q. Li, Benjamin C. M. Fung, Boyang Li, Heba Ismail, Farkhund Iqbal

发表机构 * McGill University(麦吉尔大学) Kean University(凯恩大学) Zayed University(扎耶德大学)

AI总结 本文分析了AI代理安全基准的分类与一致性问题,揭示了方法学选择对安全结论的影响,指出基准选择可能导致矛盾的安全结论,且指标碎片化限制了比较。

详情
AI中文摘要

LLM基于自主代理的快速部署引入了超越传统LLM安全问题的风险,自2023年底以来安全基准迅速增多,但存在不一致的威胁模型、不兼容的指标和不完全的风险覆盖。本文首次系统分析代理安全基准作为评估工具,列举了40个行为代理安全基准及5个相邻评估器、防御和数据集工具,提出六轴基准评估方法论分类,并应用于整个语料库,以说明方法学选择如何塑造安全结论。覆盖矩阵显示广泛的风险覆盖但有限的方法学收敛,而分类分析显示核心集中在沙盒化、受限和通常仅安全的评估。在整体领域中,发现基准选择可导致矛盾的安全结论,覆盖计数常高估评估深度,环境保真度系统地影响报告的安全性,该领域不成比例地测试外部施加而非代理内部风险,指标碎片化限制了比较,鲁棒性仍基本未被评估。这些主张通过跨基准一致性检查得到支持,95%置信区间和Kendall's W一致性分析显示无证据显示在评估维度上存在排名一致性(W=0.10,p=0.94)。本文释放了结构化元数据、完整分类编码、风险注释和所有实验工具,并提出未来基准的最低报告标准。

英文摘要

The rapid deployment of LLM-based autonomous agents has introduced safety risks that extend far beyond traditional LLM concerns, prompting a proliferation of safety benchmarks since late 2023. However, these benchmarks have developed independently, with inconsistent threat models, incompatible metrics, and overlapping yet incomplete risk coverage. We present the first systematic analysis dedicated to agent safety benchmarks as evaluation instruments. We catalog 40 behavioral agent-safety benchmarks (2023-2026), plus 5 adjacent evaluator, defense, and dataset artifacts, propose a six-axis taxonomy of benchmark evaluation methodology, and apply it across the corpus to characterize how methodological choices shape safety conclusions. A coverage matrix reveals broad risk coverage but limited methodological convergence, while the taxonomy analysis shows a behavioral-benchmark core concentrated in sandboxed, constrained, and often safety-only evaluation. Across the landscape, we find that benchmark choice can yield contradictory safety conclusions, coverage counts often overstate evaluation depth, environment fidelity systematically shapes reported safety, the field disproportionately tests externally imposed rather than agent-internal risks, metric fragmentation limits comparison, and robustness remains effectively unbenchmarked. We ground these claims with a cross-benchmark consistency check, with 95% confidence intervals and Kendall's W concordance analysis, finding no evidence of ranking concordance across evaluation dimensions (W = 0.10, p = 0.94). We release structured metadata, full taxonomy codings, risk annotations, and all experimental artifacts, and propose minimum reporting standards for future benchmarks.

2605.16281 2026-05-19 cs.CY cs.AI 版本更新

From Reactive to Proactive: A Multi-Regulatory Empirical Analysis of 480 AI Incidents and a Data-Driven Governance Compliance Framework

从被动到主动:对480起AI事件的多监管实证分析及数据驱动的治理合规框架

Ummara Mumtaz, Summaya Mumtaz

发表机构 * University of the Cumberlands, USA(卡姆伯兰大学)

AI总结 本文通过分析480起AI事件,揭示现有治理框架在部署后问责中的不足,并提出主动治理框架,旨在从被动响应转向部署前的合规保障。

详情
AI中文摘要

人工智能系统日益应用于高风险领域,但现有治理框架在部署后确保问责性仍不明确。本研究的贡献在于:首先,对480起真实AI事件进行跨监管实证分析,评估其与欧盟AI法案(第72-73条)、NIST AI风险管理框架(MANAGE和GOVERN功能)及GDPR(第22、33-35条)的契合度,结果显示三大框架存在显著治理缺口。其次,基于此研究结果,提出主动AI治理合规框架(PAGCF),该框架包含四个阶段生命周期方法,旨在将治理从被动事件响应转向部署前的合规保障。框架包括风险分层治理层级、与特定监管条款关联的实施检查表以及利用内部监控作为主动治理能力代理的预测影响分析。

英文摘要

Artificial intelligence systems are increasingly deployed in high-stakes domains, yet it remains unclear whether existing governance frameworks ensure accountability after deployment. This study makes two contributions. First, it presents a cross-regulatory empirical analysis of 480 real-world AI incidents from the AI Incident Database (AIID), evaluating their alignment with post-deployment provisions in three major governance frameworks: the EU AI Act (Articles 72-73), the NIST AI Risk Management Framework (MANAGE and GOVERN functions), and the General Data Protection Regulation (GDPR Articles 22, 33-35). The results reveal substantial governance gaps across these frameworks, indicating persistent weaknesses in post-deployment accountability. Second, based on these findings, the study proposes the Proactive AI Governance Compliance Framework (PAGCF), a four-phase lifecycle methodology designed to shift governance from reactive incident response toward pre-deployment compliance assurance. The framework includes risk-stratified governance tiers, an implementation checklist linked to specific regulatory provisions, and a projected impact analysis that uses internal monitoring as a proxy for proactive governance capacity.

2605.16280 2026-05-19 cs.CY cs.AI 版本更新

Beyond Imperfect Alternatives with Rulemapping: A Neuro-Symbolic Case Study on Online Hate Speech

超越不完美替代:基于规则映射的神经符号案例研究:在线仇恨言论

Oskar von Cossel

发表机构 * Max Planck Institute for Comparative and International Private Law(比较与国际私法Max Planck研究所)

AI总结 本文探讨神经符号方法在在线仇恨言论分类中的应用,通过约束大语言模型以提高法律判断的准确性与可验证性。

Comments Extended version of a paper accepted at ICAIL 2026. 10 pages, 1 figure, 2 tables

详情
AI中文摘要

自动化法律推理迫使在不完美的替代方案之间做出选择:符号系统提供透明性但难以处理模糊性,而神经系统能灵活处理自然语言但缺乏可验证性。本文研究了一种混合的神经符号方法是否能解决这一权衡。我们评估了该架构在在线内容审核领域的应用,作为高量级法律决策(如大规模行政程序)的代理。在这些情况下,操作员必须在严格法律标准下每天评估数千个案例。具体而言,我们探讨了将大型语言模型(LLMs)限制在确定性符号框架内是否能提高基于法律条款的非法性评估,同时防止“范围漂移”(即LLMs将道德冒犯性与法律非法性混淆)。我们评估了规则映射的神经符号变体——一种视觉逻辑树方法,该方法将经典法律三段论形式化——在德国刑法第130(1)条下的在线仇恨言论分类。在多样化的LLMs上,规则映射保持高召回率(0.82-0.89)同时达到精确度0.80-0.86,相比无约束提示的0.34-0.49。专家编写的符号框架因此能够实现稳健的法律自动化,符合可审计性和可验证决策的监管要求。

英文摘要

Automating legal reasoning forces a choice between imperfect alternatives: symbolic systems offer transparency but struggle with ambiguity, whereas neural systems handle natural language flexibly but lack verifiability. This paper investigates whether a hybrid, neuro-symbolic approach can reconcile this trade-off. We evaluate this architecture in the domain of online content moderation, which serves as a proxy for high-volume legal decision-making such as mass administrative proceedings. In these settings, operators must assess thousands of cases daily under strict legal standards. Specifically, we examine whether constraining large language models (LLMs) within deterministic symbolic scaffolds improves statute-grounded illegality assessment while preventing "scope drift" (where LLMs conflate moral offensiveness with legal illegality). We evaluate the neuro-symbolic variant of Rulemapping - a visual logic-tree method that operationalises the classic legal syllogism - on online hate-speech classification under §130(1) of the German Criminal Code. Across diverse LLMs, Rulemapping maintains high recall (0.82-0.89) while achieving precision of 0.80-0.86, compared to 0.34-0.49 for unconstrained prompting. Expert-authored symbolic scaffolds thus enable robust legal automation aligned with regulatory requirements for auditability and verifiable decision-making.

2605.16279 2026-05-19 cs.CY cs.AI 版本更新

Generative AI and Two-Tiered Online Mental Health Communities

生成式AI与双层在线心理健康社区

Manyang Zhang, Jinyang Zheng, Zhijun Yan

发表机构 * School of Management, Beijing Institute of Technology(北京理工大学管理学院) Simon Business School, University of Rochester(罗切斯特大学Simon商学院)

AI总结 研究探讨生成式AI在双层在线心理健康社区中的影响,发现AI增强了响应速度和患者参与度,但导致部分顾问减少参与,产生跨层溢出效应。

详情
AI中文摘要

在线心理健康社区(OMHCs)是分层平台,通过公开问答论坛和付费私人咨询连接患者与受过资格认证的顾问。其双层结构为生成式AI整合带来战略困境。对话代理可提供可扩展且及时的响应,缓解持续供应短缺,但大规模存在可能重塑顾问在提供细致专业知识、情感敏感支持和付费咨询中的参与,这些是平台收入和长期可持续性的核心。利用一个生成式AI对话代理整合的准自然实验,我们检验了AI进入如何影响顾问参与。通过多重识别策略,我们发现AI整合后发布强度显著增加,而平均响应长度保持不变,每条帖子的社会认可度下降。机制分析显示,AI提高了响应速度并扩大了患者参与度,扩大了顾问的机会集,部分活动从附近的非AI子论坛重新分配。顾问参与异质性:内在动机顾问减少参与,而经济动机顾问增强竞争努力。这些动态产生跨层溢出效应:不活跃的顾问经历付费咨询的下降,而增加公共参与的顾问保持或扩大下游需求。总体而言,我们的发现表明,在分层专业平台中,需求扩张和竞争激励可以超过内在挤出。

英文摘要

Online mental health communities (OMHCs) are tiered platforms that connect patients with licensed counselors through public Q&A forums and paid private consultations. Their two-tier structure creates a strategic dilemma for genAI integration. Conversational agents can provide scalable and timely responses to a broader set of patients, alleviating persistent supply shortages, but their large-scale presence may also reshape counselors' participation in providing nuanced expertise, emotionally sensitive support, and paid consultations, which are central to platform revenue and long-run sustainability. Leveraging a quasi-natural experiment from the integration of a genAI-based conversational agent in a leading OMHC, we examine how AI entry affects counselor participation. Using multiple identification strategies, we find that posting intensity increases significantly after AI integration, while average response length remains unchanged and per-post social recognition declines. Mechanism analyses show that AI improves responsiveness and expands patient engagement, enlarging counselors' opportunity sets, with activity partially reallocated from a nearby non-AI subforum. Counselors respond heterogeneously: intrinsically motivated counselors reduce participation, whereas economically motivated counselors intensify competitive effort. These dynamics generate cross-tier spillovers: inactive counselors experience declines in paid consultations, while those who increase public participation preserve or expand downstream demand. Overall, our findings show that in tiered professional platforms, demand expansion and competitive incentives can outweigh intrinsic crowding-out.

2605.16278 2026-05-19 cs.CY cs.AI cs.HC 版本更新

Keeping an Eye on AI: A Framework for Effective Human Oversight of AI Systems

关注人工智能:一种有效的人工智能系统人类监督的框架

Susanne Gaube, Markus Langer, Tim Miller, Kevin Baum, Raimund Dachselt, Anna Maria Feit, Ujwal Gadiraju, Harmanpreet Kaur, Mark T. Keane, Richard Landers, Johann Laux, Q. Vera Liao, Brian Lim, Linda Onnasch, Tim Schrills, Liz Sonenberg, Chenhao Tan, Nava Tintarev, Ziang Xiao, Hanwei Zhang

发表机构 * University College London(伦敦大学) University of Freiburg(弗赖堡大学) University of Queensland(昆士兰大学) Saarland University(萨尔兰大学) TU Dresden(德累斯顿技术大学) Delft University of Technology(代尔夫特理工大学) University of Minnesota(明尼苏达大学) University College Dublin(都柏林大学) University of Oxford(牛津大学) University of Michigan(密歇根大学) National University of Singapore(新加坡国立大学) Technische Universität Berlin(柏林技术大学) University of Lübeck(吕贝克大学) University of Melbourne(墨尔本大学) University of Chicago(芝加哥大学) Maastricht University(马斯特里赫特大学)

AI总结 本文提出一个跨学科框架,用于有效的人工智能系统人类监督,定义了监督架构和流程,并探讨了该领域需要考虑的开放性研究挑战。

Comments The conceptual analysis for this work was undertaken by the authors at Dagstuhl seminar 25272 'Challenges of Human Oversight: Achieving Human Control of AI-Based Systems' (https://www.dagstuhl.de/25272), held at Schloss Dagstuhl (June 29th-July 4th, 2025)

详情
AI中文摘要

人工智能在高风险决策场景中的使用带来了技术、安全和规范性挑战;这些问题可能只能通过人类监督来缓解。然而,人类监督的概念缺乏共同的基础理解:监督架构未被良好定义,涉及的角色仍不明确,实施步骤不透明。因此,研究人员和实践者难以确定如何设计、实现和评估能够有效实现人类监督的系统。本文提出了一种实用框架,基于计算机科学、人机交互、心理学、哲学和法律的跨学科视角。核心贡献包括:(1)一个基础框架,包含有效的人工智能系统人类监督的定义、架构和流程;(2)一个初步的文档模板,用于记录监督架构和流程,适用于不同领域;(3)对新兴有效人工智能系统人类监督领域需要考虑的开放性研究挑战的综合总结。

英文摘要

The use of Artificial Intelligence (AI) in high-risk, decision-making scenarios presents technical, safety, and normative challenges; problems that may only be ameliorated by human oversight. However, notions of human oversight lack a common foundational understanding: oversight architectures are not well defined, the roles involved remain unclear, and implementation steps are opaque. Hence, researchers and practitioners struggle to determine how to design, implement, and evaluate systems that enable effective human oversight. This paper advances a practical framework for effective human oversight of AI systems, based on a cross-disciplinary perspective that draws on insights from computer science, human-computer interaction, psychology, philosophy, and law. The core contributions are: (1) a foundational framework, with a working definition, architecture and processes for effective human oversight of AI systems; (2) an initial template for documenting oversight architectures and processes, applied to diverse domains; and (3) a synthesis of open research challenges that need to be considered in the emerging field of effective human oversight of AI systems.

2605.16277 2026-05-19 cs.CY cs.AI 版本更新

Generative AI in K-12 Classrooms: A Midyear Implementation Report

生成式AI在K-12课堂中的应用:中期实施报告

Lief Esbenshade, Alex Liu, Michael Xiao, Zewei Tian, Min Sun, Zachary Zhang, Thomas Han, Yulia Lapicus, Kevin He

发表机构 * University of Washington(华盛顿大学)

AI总结 本报告分析了2025年9月至12月华盛顿州12个学区中教师使用Colleague AI的情况,探讨了AI在K-12教育中的初步应用效果及影响因素。

详情
AI中文摘要

本中期报告由Colleague AI与AmplifyLearn.AI联合发布,汇总了2025年9月至12月华盛顿州12个学区的平台数据和行政记录,展示了教师在2025-26学年第一学期与AI互动的情况。这些学区规模各异,涵盖农村、郊区和城市地区。仅有部分学区能提供中期行政数据,与教师使用AI相关的学生特征的发现应被视为初步信号。

英文摘要

This mid-year report summarizes teacher use of Colleague AI across 12 Washington State school districts from September 1 to December 31, 2025. Produced jointly by Colleague AI and AmplifyLearn.AI at the University of Washington, this report aggregates platform data and district-provided administrative records to provide an early look at how teachers engaged with AI during the first half of the 2025-26 school year. The districts vary in size from small districts with a few thousand students to large districts with up to thirty thousand students. The districts are rural, suburban, and urban. Only a subset of districts were able to provide mid-year administrative data, and findings that link teachers' use of Colleague AI to student characteristics should be interpreted as preliminary signals.

2605.16275 2026-05-19 cs.CY cs.AI cs.CL cs.MM 版本更新

AI Slop or AI-enhancement? Student perceptions of AI-generated media for an English for Academic Purposes course

AI 产出物还是AI增强?英语学术用途课程中学生对AI生成媒体的看法

David James Woo, Deliang Wang, Kai Guo

发表机构 * Everwrite Limited(Everwrite有限公司) Faculty of Education, The University of Hong Kong(香港大学教育学院) Faculty of Education, The Chinese University of Hong Kong(香港中文大学教育学院)

AI总结 研究探讨了AI生成内容在EAP课程中的教学效果,通过混合方法分析发现学生偏好视觉化内容,视频与学业表现正相关,但高认知负荷与成绩负相关,表明需合理设计内容以提升学习效果。

Comments 23 pages, 7 figures

详情
AI中文摘要

人工智能(AI)检索增强生成(RAG)工具现在使教育者能够将课程材料转化为多样化的多媒体内容。然而,这种AI生成内容是教学支架还是低质量的AI产出仍不明确。本文报告了在一所香港社区学院的英语学术用途(EAP)课程中,教师引导生成补充材料的开发、实施与评估。主要使用Google Notebook LM生成视频、播客、信息图和个性化反馈报告。通过混合方法设计,包括调查、半结构化访谈和与学术成绩的相关性分析,研究发现学生认为材料有用且易于使用,更偏好与评估相关的视觉和多模态内容,特别是视频和信息图。视频偏好与学业成绩正相关,但高认知负荷与成绩负相关,表明需谨慎校准内容复杂性。值得注意的是,部分成绩较低的学生自行将材料作为补救支架。该实践表明,RAG工具能够实现传统方法难以实现的规模化个性化反馈。当与学生目标和认知原理相结合时,教师引导的AI生成可以有意义地增强EAP学习生态系统,而非产生AI产出物。

英文摘要

Artificial intelligence (AI) retrieval-augmented generation (RAG) tools now enable educators to transform course materials into diverse multimedia at scale. However, it remains unclear whether such AI-generated content functions as a pedagogical scaffold or AI slop: high volume, low quality material. This innovative practice paper reports on the development, implementation, and evaluation of teacher-prompted, AI-generated supplemental materials in an English for Academic Purposes (EAP) course at a Hong Kong Community College. Using primarily Google Notebook LM, the instructor generated videos, podcasts, infographics, and individualized feedback reports from course materials and student work for 106 English as a Foreign Language learners. An explanatory sequential mixed-methods design comprising a survey, semi-structured interviews, and correlation analysis with academic scores was employed to examine students' preferences, perceptions, and learning outcomes. Findings are framed through the Technology Acceptance Model and Cognitive Load Theory. Students rated the materials highly for perceived usefulness and ease of use, and preferred assessment-linked content presented in visual and multimodal formats, particularly videos and infographics. Video preference correlated positively with academic performance; however, higher cognitive load was negatively associated with course grades, indicating that material complexity must be carefully calibrated. Notably, some lower-performing students independently adopted the materials as remedial scaffolds. The practice demonstrates that RAG tools enable scalable personalized feedback that would be less feasible through traditional methods. When aligned with student goals and cognitive principles, teacher-prompted AI generation can meaningfully enhance the EAP learning ecosystem rather than producing AI slop.

2605.16274 2026-05-19 cs.HC cs.AI 版本更新

ChartDesign: Towards LLM Designer of Data Visualization

ChartDesign: 向数据可视化设计的LLM设计师迈进

Mohammed Afaan Ansari, Aniruddh Bansal, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 本文提出ChartDesign,利用大语言模型生成数据可视化设计属性,提升图表设计性能,实现84%的准确率,并在未见领域中表现出色。

详情
AI中文摘要

图表是数据可视化的主要媒介,用于发现模式和趋势以及传达数据驱动的见解,但设计仍需要昂贵的人力和专业知识,如选择适当的图表类型、轴方向、字体大小和布局。大多数自动可视化系统依赖手工规则或简单匹配,难以跨领域泛化。本文探索了大型语言模型(LLM)作为图表设计师的潜力。我们提出了ChartDesign,通过后训练LLM来模仿专家并根据表格数据生成图表设计属性。为此,我们精心整理了来自公共调查(PewResearch)和学术存储库(CharXiV)的多样化训练语料。使用视觉语言模型提取数据和设计属性,包括图表类型、子类型、对齐、标题、轴标签和条间距,格式为JSON。然后在Phi3、Qwen3和InternVL2.5上微调LoRA适配器,学习从数据到设计规范的映射。ChartDesign在强基线上显著提高了图表设计性能,在测试集上达到84%的准确率(对比最佳基线53%),并泛化到未见领域。我们进一步表明,由ChartDesign生成的图表在视觉上令人满意且受人类偏好,缩小了人类与AI在数据可视化中的差距。

英文摘要

Charts are the dominant medium for visualizing data, discovering patterns and trends, and communicating data driven insights, yet designing them still requires expensive human effort and expertise, such as selecting appropriate chart types, axis orientations, font sizes, and layouts. Most automatic visualization systems rely on handcrafted heuristics or simple rule matching and therefore struggle to generalize across domains. This work explores the potential of large language models (LLMs) as chart designers. We propose ChartDesign, which post-trains LLMs to imitate human experts and generate chart design attributes given tabular data. To this end, we curate a diverse training corpus of data design pairs from charts in public surveys (PewResearch) and academic repositories (CharXiV). Vision language models are used to extract data and design attributes from these charts, including chart type, sub type, alignment, titles, axis labels, and bar spacing, formatted as JSON. We then fine tune LoRA adapters on Phi3, Qwen3, and InternVL2.5 to learn a mapping from data to design specifications. ChartDesign significantly improves chart design performance over strong baselines, achieving up to 84% accuracy on a held-out test set (vs. 53% for the best baseline) and generalizing to unseen domains. We further show that charts rendered from ChartDesign generated specifications are visually appealing and human preferred, narrowing the human AI gap in data visualization.

2605.16272 2026-05-19 cs.HC cs.AI 版本更新

Beyond Compliance: How AI Could Help Creative Writers by Refusing Them

超越合规:AI如何通过拒绝帮助创意作家

Hua Xuan Qin, Guangzhi Zhu, Mingming Fan, Pan Hui

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Computational Media and Arts (Guangzhou)(计算媒体与艺术(广州)) Hong Kong, China(香港,中国)

AI总结 本文探讨AI通过拒绝请求引入反思的可能性,通过22名创意作家的质性研究,发现不同情境、认知和关系维度下的偏好差异影响反思效果,提出摩擦性AI设计的策略。

Comments conditionally accepted to Creativity & Cognition 2026

详情
AI中文摘要

主流创意支持设计优先考虑合规AI以实现无缝写作交互,但对过度依赖AI的担忧凸显了需要促进对平衡使用AI和非AI资源的反思设计的需求。理论上,有意的AI非合规性(拒绝请求)可通过比其他可绕过方案更强的摩擦引入反思。实践中,拒绝内容和语言特征导致微妙反应。然而,很少有研究在超出强制性伦理和技术约束之外的细微差别上进行实证研究,探讨如何将拒绝转化为战略摩擦以处理'无害'请求。本文通过22名创意作家的质性研究,探讨了在写作不同阶段(规划、翻译、审查)中拒绝常见请求的反应。研究结果表明,反思潜力取决于在情境(例如,收敛/发散思维阶段)、认知(例如,领域信念)和关系(例如,AI角色)维度上的异质性偏好对齐。我们讨论了对创意支持的影响、更广泛的问题(例如,AI成瘾)以及摩擦性和无缝AI设计(例如,整合不同合规级别)的含义。

英文摘要

Mainstream creativity support design prioritizes compliant AI for seamless writing interactions, but concerns over inappropriate AI reliance highlight the need for designs fostering reflection on balanced AI and non-AI resource use. Theoretically, intentional AI non-compliance, refusals (saying ``no'' to requests), could introduce such reflection through friction stronger than other bypass-able solutions. Practically, refusal content/language characteristics lead to nuanced reactions. However, little research empirically focuses on nuances beyond mandatory ethical/technical constraints, on turning refusals into strategic friction for `innocuous' requests. We address this through a qualitative study with 22 creative writers, exploring reactions to refusals to common requests across writing stages (planning, translating, reviewing). Findings suggest that reflective potential depends on heterogeneous preference alignment along situational (e.g., convergent/divergent thinking phases), cognitive (e.g., domain beliefs), and relational (e.g., AI roles) dimensions. We discuss implications for creativity support, broader issues (e.g., AI addiction), and frictional/seamful AI design (e.g., integrating different compliance levels).

2605.16269 2026-05-19 cs.HC cs.AI 版本更新

Train the Trainers -- An Agentic AI Framework for Peer-Based Mental Health Support in Battlefield Environments

训练培训者——一种基于同伴的AI框架,用于战场环境中的同伴心理支持

Atmaram Yarlagadda, Eranga Bandara, Ross Gore, Anita H. Clayton, Preston Samuel, Christopher K. Rhea, Sachin Shetty, Ravi Mukkamala, Xueping Liang, Amin Hass, Abdul Rahman

发表机构 * McDonald Army Health Center(麦克唐纳陆军健康中心) Old Dominion University(老 Dominion 大学) Blanchfield Army Community Hospital(布莱恩菲尔德陆军社区医院) Department of Psychiatry and Neurobehavioral Sciences(精神病学与神经行为科学系) University of Virginia School of Medicine(弗吉尼亚大学医学院) Florida International University(佛罗里达国际大学) Accenture Technology Labs(埃森哲技术实验室)

AI总结 本文提出一种基于同伴的AI框架,利用AI代理提升战场环境下心理支持的效率,减少撤离需求并提高持续护理质量。

详情
AI中文摘要

现代军事行动使士兵面临持续的心理压力,导致急性反应、创伤后应激症状和其他心理健康问题。尽管美国国防部提供证据支持的疗法,但在前线和冲突环境中难以获得受过训练的专业人员。因此,早期压力症状的士兵常被撤离至后方医疗设施,延迟护理,降低战备状态并增加长期风险。本文提出一个Train-the-Trainers框架,其中完成治疗并返回执勤的士兵被训练为同伴促进者,以在作战环境中提供一线心理支持。为了在资源和连接性受限的条件下扩大和标准化这一模型,我们引入了一个基于代理AI的平台,该平台增强这些恢复士兵的专业能力。恢复士兵作为人类监督者,协调代理进行症状分诊、指导同伴支持干预、作战约束推理、训练和模拟以及在需要时的结构化文档记录以进行临床升级。AI代理在高风险环境中使用共识驱动的决策支持。该架构在断网和低连接性环境中运行,保持人类监督和伦理保障。与麦当劳美国陆军健康中心、纽波特新闻、美国退伍军人事务局合作开发了一个功能性原型。通过结合基于同伴的干预与共识驱动的代理AI决策支持,该框架旨在缩短响应时间,防止症状升级,减少不必要的撤离,并提高护理连续性。本工作展示了代理AI如何在恶劣环境中成为心理健康支持的增强力量,并识别了在国防和人道主义行动中进一步评估和部署的途径。

英文摘要

Modern military operations expose soldiers to sustained psychological stress, leading to acute reactions, post-traumatic stress symptoms, and other mental health issues. Although the U.S. Department of Defense offers evidence-based therapies, access to trained professionals in forward-deployed and contested environments is limited. As a result, soldiers with early-stage distress are often evacuated to rear medical facilities, delaying care, reducing readiness, and increasing long-term risks. This paper proposes a Train-the-Trainers framework in which soldiers who have completed therapy and returned to duty are trained as peer facilitators to provide first-line psychological support in operational settings. To scale and standardize this model under severe resource and connectivity constraints, we introduce an agentic AI-enabled platform that augments these recovered soldiers with specialized AI agents. The recovered soldier acts as a human supervisor, coordinating agents for symptom triage, guided peer-support interventions, operational constraint reasoning, training and simulation, and structured documentation for clinical escalation when needed. The AI agents use consensus-driven decision support in high-stakes environments. The architecture functions in air-gapped and low-connectivity settings, maintaining human oversight and ethical safeguards. A functional prototype was developed with the McDonald U.S. Army Health Center, Newport News, VA, USA. By combining peer-based intervention with consensus-driven agentic AI decision support, the framework seeks to cut response times, prevent symptom escalation, reduce unnecessary evacuations, and improve continuity of care. This work shows how agentic AI can serve as a force multiplier for mental health support in austere environments and identifies pathways for broader evaluation and deployment across defense and humanitarian operations.

2605.16268 2026-05-19 cs.HC cs.AI cs.LG 版本更新

Helping Customers in Distress: An LLM-powered Agent that Converses, Probes, and Routes

帮助陷入困境的客户:一个基于LLM的代理,能够对话、探测和分流

Alankar Atreya, Stefan Sylvius Wanger, Devesh Batra, Robert Hankache, Cristovao Iglesias, Patrick Sinclair, Giulio Pelosio, Michael McMillan, Greig A. Cowan, Raad Khraishi

发表机构 * GitHub

AI总结 本文提出一个基于LLM的AI分流代理,通过多轮对话和提问提高客户问题分类准确性,提升银行客户服务效率。

详情
AI中文摘要

银行每年收到数百万起欺诈、诈骗和争议交易报告,准确将客户引导至合适专业团队极具挑战性。现有的人工流程对客户和员工都缓慢且压力大。为此,我们开发了一个面向客户的AI分流代理,利用大型语言模型(LLMs)进行多轮对话、提问和分类,以实现准确的政策引导分流,嵌入客户旅程中。为了评估和持续改进代理,模拟了真实的客户数字孪生,生成基于历史数据的真实、标注对话,以测试各种现实场景。本文详细介绍了分流代理的建模方法、与政策、安全护栏和推理框架的整合、使用合成代理进行可扩展评估,以及AI系统在准确性、鲁棒性和合规性方面的发现。结果表明,代理成功提高了历史案例的分流效果,实现分类准确率提升30.6%,我们的领域专家报告了高水平的满意度,突显了针对性探测在大规模银行运营中的有效分流作用。

英文摘要

Banks receive millions of reports of fraud, scams, and disputed transactions every year, making it challenging to accurately direct customers to the appropriate specialist teams for assistance. The existing manual process driven by humans is slow and stressful for both customers and staff. To address this, we develop a customer-facing AI powered triaging agent that leverages large language models (LLMs) to conduct multi-turn conversations, ask relevant questions, and classify cases for accurate, policy-guided routing, making it embedded in the customer journey. To evaluate and continuously improve the agent, synthetic digital twins of real customers were simulated, generating realistic, labelled dialogues based on historical data to test a wide range of real-world scenarios. This work details the triage agent's modelling approach, integration with policy, safety guardrails and reasoning frameworks, the use of the synthetic agent for scalable evaluation, and findings on the AI system's accuracy, robustness, and compliance. Results show that the agent successfully improves triaging of historical cases, achieving a 30.6% increase in classification accuracy, with high satisfaction levels reported by our subject-matter experts, highlighting how targeted probing can lead to more effective triage in banking operations at scale.

2605.16265 2026-05-19 cs.AI cs.CR 版本更新

AgentWall: A Runtime Safety Layer for Local AI Agents

AgentWall:本地AI代理的运行时安全层

Ashwin Aravind

发表机构 * GitHub

AI总结 本文提出AgentWall,一种用于本地AI代理的运行时安全与可观测性层,通过拦截代理操作、评估政策、要求人类批准和记录执行轨迹,实现92.9%的政策执行准确率。

Comments 16 pages, 2 figures, open-source implementation at https://github.com/agentwall/Agentwall

详情
AI中文摘要

自主AI代理的安全性日益被认识到是关键的开放性问题。随着代理从被动文本生成器转变为能够执行shell命令、修改文件、调用API和浏览网络的活跃行为者,不安全或被对手操控的行为后果变得立即且具现实影响。现有AI安全工作主要集中在模型对齐和输入过滤,但这些方法无法解决当代理意图变为真实机器上的实际操作时的问题。此缺口在本地环境中尤为明显,因为开发者在自己的文件系统、凭证和基础设施上运行代理时缺乏运行时控制。本文介绍AgentWall,一种用于本地AI代理的运行时安全与可观测性层。AgentWall在代理操作到达主机环境之前拦截每一个提议的代理操作,将其与显式声明性政策进行评估,要求对敏感操作进行人工批准,并记录完整的执行轨迹以供审计和回放。它被实现为一个强制执行MCP代理和原生OpenClaw插件,可在Claude Desktop、Cursor、Windsurf、Claude Code和OpenClaw上通过单条安装命令运行。我们展示了AgentWall的设计、架构、威胁模型和政策模型,并在14个基准测试中实现了92.9%的政策执行准确率,亚毫秒级的开销。AgentWall在https://github.com/agentwall/Agentwall上开源。

英文摘要

The safety of autonomous AI agents is increasingly recognized as a critical open problem. As agents transition from passive text generators to active actors capable of executing shell commands, modifying files, calling APIs, and browsing the web, the consequences of unsafe or adversarially manipulated behavior become immediate and tangible. Existing AI safety work has focused primarily on model alignment and input filtering, but these approaches do not address what happens at the moment an agent's intent becomes a real action on a real machine. This gap is especially acute in local environments, where developers run agents against their own filesystems, credentials, and infrastructure with little runtime control. This paper introduces AgentWall, a runtime safety and observability layer for local AI agents. AgentWall intercepts every proposed agent action before it reaches the host environment, evaluates it against an explicit declarative policy, requires human approval for sensitive operations, and records a complete execution trail for audit and replay. It is implemented as a policy-enforcing MCP proxy and native OpenClaw plugin, working across Claude Desktop, Cursor, Windsurf, Claude Code, and OpenClaw with a single install command. We present the design, architecture, threat model, and policy model of AgentWall, and demonstrate 92.9% policy enforcement accuracy with sub-millisecond overhead across 14 benchmark tests. AgentWall is open-source at https://github.com/agentwall/Agentwall.

2605.16259 2026-05-19 cs.LG cs.AI cs.DC 版本更新

Systematic Optimization of Real-Time Diffusion Model Inference on Apple M3 Ultra

苹果M3 Ultra上的实时扩散模型推断系统优化

Yoichi Ochiai

发表机构 * University of Tsukuba, Faculty of Library, Information and Media Science(筑波大学图书馆、信息与媒体科学系)

AI总结 本文针对苹果M3 Ultra平台系统性优化扩散模型实时推断,通过多种技术结合实现22.7 FPS的实时图像转换,揭示了与NVIDIA GPU不同的优化特性。

详情
AI中文摘要

尽管扩散模型在NVIDIA GPU上的实时图像生成技术迅速发展,但针对非CUDA平台如苹果硅芯片的系统性优化研究极为有限。本文在苹果M3 Ultra(60核心GPU,512GB统一内存)上进行了10个阶段的全面优化实验,目标是实现实时摄像头img2img转换。我们探索了包括CoreML转换、量化、令牌合并、神经引擎利用、紧凑模型探索、帧内插、基于kNN搜索的合成、pix2pix-turbo、光学流帧跳过和知识蒸馏等多种技术,并定量评估了每种方法的有效性。最终,通过结合CoreML转换的蒸馏专用模型SDXS-512与三线程摄像头流水线,在512x512分辨率下实现了22.7 FPS的实时摄像头img2img转换。本文的主要贡献是系统地证明了在CUDA上建立的优化见解不一定适用于苹果硅芯片的统一内存架构。我们揭示了一个与NVIDIA GPU截然不同的优化景观,包括量化无提速、并行推断无效以及神经引擎不适合大规模模型等特性,并为苹果硅芯片上的扩散模型推断提供了实用指南。

英文摘要

While real-time image generation using diffusion models has advanced rapidly on NVIDIA GPUs, systematic optimization research on non-CUDA platforms such as Apple Silicon remains extremely limited. In this study, we conducted comprehensive optimization experiments across 10 phases targeting the Apple M3 Ultra (60-core GPU, 512 GB unified memory) with the goal of achieving real-time camera img2img transformation. We explored a wide range of techniques including CoreML conversion, quantization, Token Merging, Neural Engine utilization, compact model exploration, frame interpolation, kNN search-based synthesis, pix2pix-turbo, optical flow frame skipping, and knowledge distillation, quantitatively evaluating the effectiveness of each approach. Ultimately, by combining CoreML conversion of the distillation-specialized model SDXS-512 with a 3-thread camera pipeline, we achieved real-time camera img2img transformation at 22.7 FPS at 512x512 resolution. The primary contribution of this work is the systematic demonstration that optimization insights established for CUDA are not necessarily effective on Apple Silicon's unified memory architecture. We reveal an optimization landscape fundamentally different from that of NVIDIA GPUs -- including the absence of speedup from quantization, the ineffectiveness of parallel inference, and the unsuitability of the Neural Engine for large-scale models -- and provide practical guidelines for diffusion model inference on Apple Silicon.

2605.14624 2026-05-19 cs.LG cs.AI cs.NE 版本更新

An Amortized Efficiency Threshold for Comparing Neural and Heuristic Solvers in Combinatorial Optimization

比较神经求解器与启发式求解器在组合优化中的平均效率阈值

Sohaib Afifi

发表机构 * Univ. Artois(阿劳斯大学)

AI总结 本文研究了神经求解器在组合优化中的能耗问题,提出平均效率阈值框架,通过实验显示神经求解器在部署量超过阈值后能耗低于启发式方法,提供了新的评估方法。

Comments 13 pages, 3 figures, 1 table. Code and benchmark pipeline at https://github.com/sohaibafifi/aet. v1: initial release with CVRP n=50

详情
AI中文摘要

神经组合优化求解器常被批评其能耗高于CPU启发式方法,因其在GPU上训练的成本高。本文探讨了从

英文摘要

A common critique of neural combinatorial-optimization solvers is that they are less energy-efficient than CPU metaheuristics, given the operational energy cost of training them on GPUs. This paper examines the inferential step from "training is expensive" to "neural solvers are net-inefficient", which is where the critique actually goes wrong. Training the network costs a large fixed amount of GPU energy; running the metaheuristic costs a small amount of CPU energy on every instance, repeated as long as the solver is deployed. The two are not commensurable until a deployment volume is fixed. We define the Amortized Efficiency Threshold (AET) as the deployment volume above which a neural solver breaks even with a heuristic baseline in total energy or carbon, under an explicit constraint on solution quality. We show that the cumulative-energy ratio between the two solvers tends to a constant strictly below one whenever the network wins per instance, and that this limit does not depend on how the training cost was measured. An embodied-carbon term amortizes hardware fabrication symmetrically on both sides. We instantiate the framework on the CVRP environment at n=50 customers with the attention-based autoregressive solver of Kool et al. (2019), trained for 100 epochs on 20,000 instances over five random seeds, and HGS via PyVRP as the heuristic baseline. The measured operational crossover sits near 4.56e3 deployed instances at the median of a six-point baseline-budget sweep; the per-instance neural-to-heuristic ratio is 2.29e-3. The contribution is the framework, the open instrumentation, and the end-to-end measurement protocol. Code and benchmark pipeline are available at https://github.com/sohaibafifi/aet.

2605.14457 2026-05-19 cs.AI 版本更新

Stateful Reasoning via Insight Replay

通过洞察回放实现状态化推理

Bin Lei, Caiwen Ding, Jiachen Yang, Ang Li, Xin Eric Wang

发表机构 * University of Minnesota(明尼苏达大学) Simular AI

AI总结 本文提出InsightReplay方法,通过回放关键洞察保持推理过程中的重要信息可访问性,提升大语言模型在复杂任务中的表现。

详情
AI中文摘要

链式推理(CoT)已成为引导大语言模型进行多步骤推理的基础,但最近的研究表明,其收益并不随链长单调增加:虽然更长的CoT通常使模型能够解决更困难的问题,但在特定问题中,准确性通常在链长达到一定程度后会下降。我们发现这一现象的主要原因:随着CoT的增长,模型对推理轨迹中早期产生的关键洞察的注意力逐渐减弱,使这些洞察在最需要的时候变得越来越难以访问。因此,我们提出了InsightReplay,一种状态化推理方法,其中模型会定期从其推理轨迹中提取关键洞察并在活跃生成前沿附近回放它们,以保持这些洞察的可访问性。在包含模型规模{8B, 30B}、模型家族{Qwen3.5, DeepSeek-R1-Distill-Qwen, Gemma-4}和推理基准{AIME, HMMT, GPQA Diamond, LiveCodeBench v5}的$\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$基准网格上进行的大量实验表明,3轮InsightReplay在所有24种设置中均实现了准确性提升,平均比标准CoT提高了$\mathbf{+1.65}$点,其中在R1-Distill-32B的LiveCodeBench v5子集上最大的单设置提升为$\mathbf{+9.2}$点。我们的结果表明,测试时扩展的有效性不仅取决于模型推理的程度,还取决于关键中间洞察在整个长推理轨迹中是否保持可访问性。

英文摘要

Chain-of-Thought (CoT) reasoning has become a foundation for eliciting multi-step reasoning in large language models, but recent studies show that its benefits do not scale monotonically with chain length: while longer CoT generally enables a model to tackle harder problems, on a given problem, accuracy typically increases with CoT length up to a point, after which it declines. We identify a major cause of this phenomenon: as the CoT grows, the model's attention to critical insights produced earlier in the trace gradually weakens, making those insights progressively less accessible when they are most needed. Therefore, we propose \textbf{InsightReplay}, a stateful reasoning approach in which the model periodically extracts critical insights from its reasoning trace and replays them near the active generation frontier, keeping them accessible as the reasoning scales. Extensive experiments on a $\mathbf{2}\!\times\!\mathbf{3}\!\times\!\mathbf{4}$ benchmark grid, covering model scales $\{\text{8B}, \text{30B}\}$, model families $\{\text{Qwen3.5}, \text{DeepSeek-R1-Distill-Qwen}, \text{Gemma-4}\}$, and reasoning benchmarks $\{\text{AIME}, \text{HMMT}, \text{GPQA Diamond}, \text{LiveCodeBench v5}\}$, show that 3-round InsightReplay yields accuracy gains across \textbf{all 24 settings}, with an averaged improvement of $\mathbf{+1.65}$ points over standard CoT, and a largest single-setting gain of $\mathbf{+9.2}$ points on R1-Distill-32B's LiveCodeBench v5 subset. Our results suggest that the effectiveness of test-time scaling depends not only on how much a model reasons, but also on whether critical intermediate insights remain accessible throughout long reasoning trajectories.

2605.09826 2026-05-19 cs.AI cs.MA 版本更新

EnactToM: An Evolving Benchmark for Functional Theory of Mind in Embodied Agents

EnactToM:一种用于具身智能体功能理论之心的演进基准

Gurusha Juneja, Dylan Lu, Saaket Agashe, Parth Diwane, Edward Gunn, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Yali Du, Xin Eric Wang

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校) King’s College London(伦敦国王学院) Cisco Research(思科研究)

AI总结 本文提出EnactToM基准,用于测试具身智能体的功能理论之心能力,通过300个3D家庭环境任务验证解决能力和认知深度,揭示了现有模型在隐含信念上的不足。

详情
AI中文摘要

理论之心(ToM)是跟踪他人认知状态的能力,使人类成为高效的协作者。AI代理在多智能体设置中需要相同的能力,但现有基准大多通过直接询问信念来测试字面意义的ToM。能够基于隐含信念在具身环境中最优行动的能力,称为功能ToM,仍鲜有测试。我们引入EnactToM,一个包含300个具身多智能体任务的演进基准,设置在3D家庭环境中,具有部分可观测性、私人信息和受限通信。每个任务均正式验证可解性和所需认知深度,新任务随模型进步而增加难度。在硬划分上,所有七个评估的前沿模型在功能任务完成上得分为0.0%,而在字面信念探测上平均得分为45.0%。人工分析显示93%的样本失败归因于认知协调破裂,如信息 withheld、忽略伙伴约束和消息分配不当,为未来工作提供了具体目标。

英文摘要

Theory of Mind (ToM), the ability to track others epistemic state, makes humans efficient collaborators. AI agents need the same capacity in multi agent settings, yet existing benchmarks mostly test literal ToM by asking direct belief questions. The ability act optimally on implicit beliefs in embodied environments, called functional ToM, remains largely untested. We introduce EnactToM, an evolving benchmark of 300 embodied multi-agent tasks set in a 3D household with partial observability, private information, and constrained communication. Each task is formally verified for solvability and required epistemic depth, and new tasks are generated increase difficulty as models improve. On the hard split, all seven evaluated frontier models score 0.0% Pass^3 on functional task completion, while averaging 45.0% on literal belief probes. Manual analysis traces 93% of sampled failures to epistemic coordination breakdowns such as withheld information, ignored partner constraints, and misallocated messages, providing a concrete target for future work.

2605.08475 2026-05-19 cs.LG cs.AI cs.NA math.NA math.OC 版本更新

Transformers Can Implement Preconditioned Richardson Iteration for In-Context Gaussian Kernel Regression

Transformer 可实现用于上下文高斯核回归的预条件Richardson迭代

Mingsong Yan, Dongyang Li, Charles Kulick, Sui Tang

发表机构 * Department of Mathematics, University of California, Santa Barbara, CA(加州大学圣芭芭拉分校数学系)

AI总结 本文研究了上下文核岭回归,证明标准softmax注意力transformer可通过预条件Richardson迭代近似高斯核回归预测器,展示了transformer架构中的功能分解。

详情
AI中文摘要

对上下文学习(ICL)的机制解释已识别出用于线性回归及相关线性预测任务的迭代算法,通常使用线性或ReLU注意力变体。对于非线性ICL,先前工作将softmax和核化注意力与功能梯度型动态相关联,但尚不清楚标准softmax注意力transformer能否实现具有端到端预测误差保证的收敛求解器。本文研究了具有高斯核的上下文核岭回归(KRR),并证明标准softmax-注意力transformer在正向传递过程中可通过在关联的核线性系统上实现预条件Richardson迭代来近似KRR预测器。在数据有界假设下,我们构建了一个具有O(log(1/ε))个块和MLP宽度O(√(N/ε))的单头transformer,实现了对长度为N的提示的ε精度预测。我们的构造揭示了transformer架构中的功能分解:softmax注意力产生用于跨token交互的行归一化高斯核算子,而ReLU MLP层局部近似所需的intra-token标量算术。通过训练GPT-2风格的transformer进行高斯过程回归任务进一步测试预条件Richardson解释。通过线性探测,我们比较transformer的层间预测与经典KRR求解器的逐步输出,发现其误差谱与预条件Richardson迭代最一致。消融研究进一步支持这一解释。共同,我们的理论和实验识别出预条件Richardson迭代作为softmax-注意力transformer实现非线性上下文高斯核回归的明确机制。

英文摘要

Mechanistic accounts of in-context learning (ICL) have identified iterative algorithms for linear regression and related linear prediction tasks, often using linear or ReLU attention variants. For nonlinear ICL, prior work has related softmax and kernelized attention to functional-gradient-type dynamics, but it remains unclear whether a standard transformer with softmax attention can implement a convergent solver with an end-to-end prediction-error guarantee. In this paper, we study in-context kernel ridge regression (KRR) with Gaussian kernels and show that a standard softmax-attention transformer can approximate the KRR predictor during its forward pass by implementing preconditioned Richardson iteration on the associated kernel linear system. Under bounded-data assumptions, we construct a single-head transformer with $O(\log(1/ε))$ blocks and MLP width $O(\sqrt{N/ε})$ that achieves $ε$-accurate prediction for prompts of length $N$. Our construction reveals a functional decomposition within the transformer architecture: softmax attention produces a row-normalized Gaussian-kernel operator needed for cross-token interactions, while ReLU MLP layers act locally to approximate the intra-token scalar arithmetic required by the update. Empirically, we train GPT-2-style transformers on Gaussian-process regression tasks to further test the preconditioned Richardson interpretation. Through linear probing, we compare the transformer's layer-wise predictions with the step-wise outputs of classical KRR solvers and find that its error profiles align most consistently with preconditioned Richardson iteration. Ablation studies further support this interpretation. Together, our theory and experiments identify preconditioned Richardson iteration as a concrete mechanism that softmax-attention transformers can realize for nonlinear in-context Gaussian-kernel regression.

2604.28010 2026-05-19 cs.LG cs.AI 版本更新

Learning from Disagreement: Clinician Overrides as Implicit Preference Signals for Clinical AI in Value-Based Care

从分歧中学习:临床医生的覆盖作为价值医疗中临床AI的隐含偏好信号

Prabhjot Singh, Abhishek Gupta, Chris Betz, Abe Flansburg, Brett Ives, Sudeep Lama, Jung Hoon Son

发表机构 * Altitude

AI总结 本文提出了一种框架,将临床医生对AI建议的覆盖视为隐含偏好数据,通过引入五类覆盖分类法和双学习架构,解决抑制偏差问题,以提升价值医疗中AI的决策能力。

Comments 22 pages, 2 tables, 1 figure

详情
AI中文摘要

我们重新将临床医生对临床AI建议的覆盖视为隐含偏好数据——与强化学习从人类反馈(RLHF)中利用的相同信号结构,但更丰富:标注者是领域专家,替代方案具有实际后果,下游结果是可观察的。我们提出了一个扩展标准偏好学习的正式框架,包含三个贡献:五类覆盖分类法,将覆盖类型映射到不同的模型更新目标;一个基于患者状态s、组织背景c和临床医生能力κ的偏好公式,其中κ分解为执行能力κ-exec和对齐能力κ-align;以及一个双学习架构,通过交替优化联合训练奖励模型和能力模型,防止我们称为抑制偏差的系统性问题——当临床医生能力低于执行阈值时,系统性地压制正确但困难的建议。我们论证,在基于结果的支付合同下慢性病管理产生具有独特有利属性的覆盖数据——纵向密度、集中决策空间、结果标签和自然能力变化,并认为结合纵向结果测量与对齐的财务激励的训练环境是学习与患者轨迹而非就诊经济相一致的奖励模型的必要条件。此框架源于改进价值医疗部署中临床医生能力的运营工作。

英文摘要

We reframe clinician overrides of clinical AI recommendations as implicit preference data - the same signal structure exploited by reinforcement learning from human feedback (RLHF), but richer: the annotator is a domain expert, the alternatives carry real consequences, and downstream outcomes are observable. We present a formal framework extending standard preference learning with three contributions: a five-category override taxonomy mapping override types to distinct model update targets; a preference formulation conditioned on patient state s, organizational context c, and clinician capability kappa, where kappa decomposes into execution capability kappa-exec and alignment capability kappa-align; and a dual learning architecture that jointly trains a reward model and a capability model via alternating optimization, preventing a failure mode we term suppression bias-the systematic suppression of correct-but-difficult recommendations when clinician capability falls below the execution threshold. We argue that chronic disease management under outcome-based payment contracts produces override data with uniquely favorable properties-longitudinal density, concentrated decision space, outcome labels, and natural capability variation-and that training environments combining longitudinal outcome measurement with aligned financial incentives are a necessary condition for learning a reward model aligned with patient trajectory rather than with encounter economics. This framework emerged from operational work to improve clinician capability in a live value-based care deployment.

2604.16400 2026-05-19 cs.DC cs.AI cs.LG 版本更新

CoLLM: Continuous Adaptation for SLO-Aware LLM Serving on Shared GPU Clusters

CoLLM:面向共享GPU集群的SLO感知LLM服务连续适应

Shaoyuan Huang, Yunfeng Zhao, Na Yan, Tiancheng Zhang, Xiaokai Wang, Xiaofei Wang, Wenyu Wang, Yansha Deng

发表机构 * Tianjin University(天津大学) King's College London(伦敦大学国王学院) Paiou Cloud Computing (Shanghai) Company, Ltd.(上海帕优云计算有限公司)

AI总结 CoLLM通过统一联邦参数高效微调与推理,实现LLM服务在共享GPU集群中的连续适应,提升模型质量和效率,实验显示其在吞吐量上表现优异。

详情
AI中文摘要

随着大型语言模型(LLM)在边缘智能中被越来越多地用于驱动领域特定应用和个性化服务,LLM训练后的质量与效率,包括微调和推理,因资源受限而变得至关重要。尽管最近在联邦参数高效微调(FL PEFT)和低延迟推理方面的进展提高了单个任务性能,但微调和推理仍被视为孤立的工作负载,忽略了它们的相互依赖性,导致冗余部署和推理质量提升延迟。为了解决这些限制,我们引入了一个新的共执行框架,并将其实例化为CoLLM,一个将FL PEFT和推理统一在共享边缘副本和模型参数上的系统。CoLLM通过在副本和集群层面解决关键挑战,实现了高效模型参数重用和工作负载平衡,从而联合优化长期模型质量增益和短期推理效率。在多样化的LLM和真实世界跟踪上进行的广泛评估显示,CoLLM在吞吐量上比最先进的LLM系统高出多达3倍,证明了其在边缘智能中无缝LLM训练后处理的有效性。

英文摘要

As Large Language Models (LLMs) are increasingly adopted in edge intelligence to power domain-specific applications and personalized services, the quality and efficiency of the LLM post-training phase-including fine-tuning and inference, have become critical due to constrained resources. Although recent advances in federated parameter-efficient fine-tuning (FL PEFT) and low-latency inference have improved individual task performance, fine-tuning and inference are still handled as isolated workloads, which overlooks their interdependence and results in redundant deployments and delayed improvement in inference quality. To address these limitations, we introduce a new co-execution framework and instantiate it with CoLLM, a system that unifies FL PEFT and inference on shared edge replicas and model parameters. CoLLM addresses key challenges at both replica and cluster levels through: (1) an intra-replica model sharing mechanism that enables real-time model parameter reuse via unmerged inference and shadow adapter strategies; and (2) a two-timescale inter-replica coordination algorithm that adaptively balances fine-tuning and inference workloads to jointly optimize long-term model quality gains and short-term inference efficiency. Extensive evaluation across diverse LLMs and real-world traces show that CoLLM consistently outperforms state-of-the-art LLM systems, achieving up to 3x higher goodput, demonstrating its effectiveness in enabling seamless LLM post-training for edge intelligence.

2604.02178 2026-05-19 cs.CL cs.AI cs.LG 版本更新

The Expert Strikes Back: Interpreting Mixture-of-Experts Language Models at Expert Level

专家反击:在专家层面解读混合专家语言模型

Jeremy Herbst, Stefan Wermter, Jae Hee Lee

发表机构 * Department of Informatics, University of Hamburg, Hamburg, Germany(汉堡大学信息学院)

AI总结 研究通过k稀疏探测比较MoE专家与密集FFN,发现专家神经元更单语义,提出以专家为分析单位,揭示专家是细粒度任务专家,而非领域专家或token处理者。

Comments 8 pages, 7 Figures. Accepted at ICML 2026. Improved writing, changed author order, updated citations

详情
AI中文摘要

混合专家(MoE)架构已成为扩展大语言模型(LLMs)的主导选择,每个token仅激活部分参数。尽管MoE主要用于计算效率,但其稀疏性是否使其比密集前馈网络(FFN)更容易解释仍存疑问。通过k稀疏探测比较MoE专家与密集FFN,发现专家神经元始终更单语义,随着路由稀疏性增加,差距扩大。这表明稀疏性迫使神经元和整个专家朝单语义方向发展。基于此发现,我们从神经元层面转向专家层面作为更有效的分析单位。通过自动解读数百个专家,验证了这一方法。此分析解决了关于专业化争论:专家既非广领域专家(如生物学)也非简单token处理者。相反,它们作为细粒度任务专家,专门处理语言操作或语义任务(如闭合LaTeX括号)。我们的发现表明,MoE在专家层面具有内在可解释性,为大规模模型可解释性提供了更清晰路径。代码见:https://github.com/jerryy33/MoE_analysis。

英文摘要

Mixture-of-Experts (MoE) architectures have become the dominant choice for scaling Large Language Models (LLMs), activating only a subset of parameters per token. While MoE architectures are primarily adopted for computational efficiency, it remains an open question whether their sparsity makes them inherently easier to interpret than dense feed-forward networks (FFNs). We compare MoE experts and dense FFNs using $k$-sparse probing and find that expert neurons are consistently less polysemantic, with the gap widening as routing becomes sparser. This suggests that sparsity pressures both individual neurons and entire experts toward monosemanticity. Leveraging this finding, we zoom out from the neuron to the expert level as a more effective unit of analysis. We validate this approach by automatically interpreting hundreds of experts. This analysis allows us to resolve the debate on specialization: experts are neither broad domain specialists (e.g., biology) nor simple token-level processors. Instead, they function as fine-grained task experts, specializing in linguistic operations or semantic tasks (e.g., closing brackets in $\LaTeX{}$). Our findings suggest that MoEs are inherently interpretable at the expert level, providing a clearer path toward large-scale model interpretability. Code is available at: https://github.com/jerryy33/MoE_analysis.

2603.20562 2026-05-19 cs.CL cs.AI 版本更新

Permutation-Consensus Listwise Judging for Robust Factuality Evaluation

排列一致性列表判断用于鲁棒事实性评估

Tianyi Huang, Nathan Huang, Justin Tang, Wenqian Chen, Elsa Fan

发表机构 * App-In Club(App-In俱乐部) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出PCFJudge方法,通过多排列重跑列表事实性提示以提高LLM事实性判断的鲁棒性,实验显示其在RewardBench 2 Factuality上显著提升准确率。

Comments Accepted at the Fifth Workshop on Natural Language Generation, Evaluation, and Metrics at ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)如今广泛用作评判者,但其决定可能受无关的呈现选择影响。我们研究了列表事实性评估中的候选顺序敏感性问题,其中多个答案可能看似同样精良但实际存在显著的幻觉风险差异。我们引入PCFJudge,一种推理时方法,通过多次重跑相同事实性优先的列表提示并聚合结果,生成一致的决策。在RewardBench 2 Factuality上,最终七排列聚合(K=7)使GPT-5.4的Top-1选择准确率从86.00%提升至91.33%,Claude Sonnet 4.6的准确率从86.33%提升至89.67%。这些结果表明候选顺序可能是事实性评判中的重要误差来源,消除这种干扰可提高LLM评估的可靠性。

英文摘要

Large language models (LLMs) are now widely used as judges, yet their decisions can change under presentation choices that should be irrelevant. We study one such source of instability: candidate-order sensitivity in listwise factuality evaluation, where several answers can look similarly polished while differing substantially in hallucination risk. We introduce PCFJudge, an inference-time method that reruns the same factuality-first listwise prompt over multiple orderings of the same candidate set and aggregates the resulting scores, ranks, and uncertainty signals into a single consensus decision. On RewardBench 2 Factuality, the final seven-permutation aggregate (K=7) improves top-1 selection accuracy from 86.00% to 91.33% with GPT-5.4 and from 86.33% to 89.67% with Claude Sonnet 4.6. These results suggest that candidate order can be a meaningful source of factuality-judging error and that marginalizing over this nuisance variation can improve the reliability of LLM evaluation.

2603.19470 2026-05-19 cs.LG cs.AI 版本更新

Adaptive Layerwise Perturbation: Unifying Off-Policy Corrections for LLM RL

自适应分层扰动:统一LLM RL中的非策略修正

Chenlu Ye, Xuanchang Zhang, Yifan Hao, Zhou Yu, Ziji Zhang, Abhinav Gullapalli, Hao Chen, Jing Huang, Tong Zhang

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Amazon(亚马逊)

AI总结 本文提出自适应分层扰动(ALP),通过在更新过程中向每一层的输入隐藏状态注入可控噪声,缓解策略退化和训练-推理不匹配问题,提升训练稳定性与探索能力。

详情
AI中文摘要

非策略问题如策略老化和训练-推理不匹配已成为LLM RL训练稳定性及进一步探索的主要瓶颈。由于增强推理效率的技术,推理策略与更新策略的分布差距扩大,导致重要性比率呈重尾分布。当策略局部尖锐时,重尾比率出现,进一步放大梯度并可能使更新超出信任区域。为解决此问题,我们提出自适应分层扰动(ALP),在更新过程中向每一层的输入隐藏状态注入小的可学习扰动,并将由此产生的扰动策略作为重要性比率的分子,与未改变的推理策略进行比较。直观上,通过向中间表示添加受控噪声,ALP防止更新策略过于偏离推理策略,并扩大策略家族以覆盖推理时的不匹配噪声。因此,扁平化的分布可自然缩小更新策略与推理策略之间的差距,并减少重要性比率的尾部,从而维持训练稳定性。这通过实验证实。在单轮数学和多轮工具集成推理任务中的实验表明,ALP不仅提高了最终性能,还避免了重要性比率尾部的爆炸和KL尖峰,同时提升了探索能力。消融实验显示,跨所有层的表示级扰动效果最佳,显著优于部分层和logits-only变体。

英文摘要

Off-policy problems such as policy staleness and training--inference mismatch have become a major bottleneck for training stability and further exploration in LLM RL. The distribution gap between the inference and updated policies grows because of the techniques to enhance inference efficiency, leading to heavy-tailed importance ratios. Heavy-tailed ratios arise when the policy is locally sharp, which further inflates gradients and can push updates outside the trust region. To address this, we propose Adaptive Layerwise Perturbation (ALP), which injects small learnable perturbations into the input hidden states of each layer during updates and uses the resulting perturbed policy as the numerator of the importance ratio against the unchanged inference policy in the objective. Intuitively, by adding controlled noise to intermediate representations, ALP prevents the updated policy from deviating too sharply from the inference policy and enlarges the policy family to cover inference-time mismatch noise. Hence, the flattened distribution can naturally tighten the gap between the updated and inference policies and reduce the tail of importance ratios, thus maintaining training stability. This is further validated empirically. Experiments on single-turn math and multi-turn tool-integrated reasoning tasks show that ALP not only improves final performance, but also avoids blow-up in the importance-ratio tail and KL spikes during iterative training, along with boosted exploration. Ablations show that representation-level perturbations across all layers are most effective, substantially outperforming partial-layer and logits-only variants.

2603.02512 2026-05-19 cs.ET cs.AI cs.SE 版本更新

Human-Certified Module Repositories for the AI Age

面向AI时代的可信模块仓库

Szilárd Enyedi

发表机构 * Automation Department, Technical University of Cluj-Napoca, Romania(克卢日-纳波卡技术大学自动化系,罗马尼亚)

AI总结 本文提出HCMR框架,通过人类监督与自动化分析结合,确保模块的可信度,为AI辅助开发提供安全可预测的模块组装方法。

Comments v4: acknowledged AI use for grammar and readability, added IEEE copyright notice; v3: 13 pages, new subsection about strong typed languages forcing AI to create more reliable code; v2: 12 pages, improved references; v1: 11 pages, 3 figures, 2 tables, prepared for AQTR 2026

详情
AI中文摘要

本文介绍了人类认证模块仓库(HCMRs)作为构建可信软件的新建筑模型。随着大型语言模型在代码生成、配置合成和多组件集成中的参与增加,AI组装系统的可靠性依赖于其使用的构建块的可信度。当前软件供应链事件和模块化开发生态系统凸显了依赖于来源不明、审查不足或行为不可预测的组件的风险。我们主张未来AI驱动的开发流程需要经过 curated、安全审查、来源丰富的可重用模块仓库。为此,我们提出了HCMRs框架,结合人类监督与自动化分析来认证模块,并通过人类和AI代理支持安全、可预测的组装。我们提出了HCMRs的参考架构,概述了认证和来源工作流程,分析了与模块生态系统相关的威胁面,并从最近的失败中提取了教训。我们进一步讨论了治理、可扩展性和AI问责制的含义,将HCMRs定位为可靠和可审计的AI构建软件系统的基础子系统。

英文摘要

Human-Certified Module Repositories (HCMRs) are introduced in this work as a new architectural model for constructing trustworthy software in the era of AI-assisted development. As large language models increasingly participate in code generation, configuration synthesis, and multi-component integration, the reliability of AI-assembled systems will depend critically on the trustworthiness of the building blocks they use. Today's software supply-chain incidents and modular development ecosystems highlight the risks of relying on components with unclear provenance, insufficient review, or unpredictable composition behavior. We argue that future AI-driven development workflows require repositories of reusable modules that are curated, security-reviewed, provenance-rich, and equipped with explicit interface contracts. To this end, we propose HCMRs, a framework that blends human oversight with automated analysis to certify modules and support safe, predictable assembly by both humans and AI agents. We present a reference architecture for HCMRs, outline a certification and provenance workflow, analyze threat surfaces relevant to modular ecosystems, and extract lessons from recent failures. We further discuss implications for governance, scalability, and AI accountability, positioning HCMRs as a foundational substrate for reliable and auditable AI-constructed software systems.

2603.02218 2026-05-19 cs.LG cs.AI cs.CL cs.IT math.IT 版本更新

Self-Play Only Evolves When Self-Synthetic Pipeline Ensures Learnable Information Gain

仅在自我合成管道确保可学习信息增益时,自我博弈才会进化

Wei Liu, Siya Qi, Yali Du, Yulan He

发表机构 * King's College London(伦敦国王学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 本文通过实验揭示可持续自我进化需要可学习信息递增的自我合成数据管道,提出自我进化LLM的三重角色及系统设计,解决自我博弈停滞问题。

Comments 10 pages, 6 figures, 7 formulas, accepted by ICML 2026 position paper track

详情
AI中文摘要

大型语言模型(LLMs)使构建通过自我进化循环改进的系统成为可能,但许多现有方案更倾向于自我博弈且易陷入停滞。核心失败模式是循环生成更多数据但未增加下一轮的可学习信息。通过自我博弈编程任务实验,我们发现可持续自我进化需要具有可学习信息递增的自我合成数据管道。我们识别出自我进化LLM的三重角色:生成任务的Proposer、尝试解决方案的Solver以及提供训练信号的Verifier,并提出三种系统设计共同针对三重角色视角下的可学习信息增益。不对称共进化在角色间形成弱到强到弱的循环。容量增长扩展参数和推理时间预算以匹配上升的可学习信息。主动信息寻求引入外部上下文和新任务来源以防止饱和。这些模块共同提供从脆弱自我博弈动态到持续自我进化的可测量系统级路径。

英文摘要

Large language models (LLMs) make it plausible to build systems that improve through self-evolving loops, but many existing proposals are better understood as self-play and often plateau quickly. A central failure mode is that the loop synthesises more data without increasing learnable information for the next iteration. Through experiments on a self-play coding task, we reveal that sustainable self-evolution requires a self-synthesised data pipeline with learnable information that increases across iterations. We identify triadic roles that self-evolving LLMs play: the Proposer, which generates tasks; the Solver, which attempts solutions; and the Verifier, which provides training signals, and we identify three system designs that jointly target learnable information gain from this triadic roles perspective. Asymmetric co-evolution closes a weak-to-strong-to-weak loop across roles. Capacity growth expands parameter and inference-time budgets to match rising learnable information. Proactive information seeking introduces external context and new task sources that prevent saturation. Together, these modules provide a measurable, system-level path from brittle self-play dynamics to sustained self-evolution.

2603.01683 2026-05-19 cs.CL cs.AI 版本更新

Surgical Post-Training: Proximal On-Policy Distillation for Reasoning with Knowledge Retention

手术式后训练:用于知识保留的近端策略蒸馏

Wenye Lin, Kai Han

发表机构 * The University of Hong Kong(香港大学)

AI总结 本文提出SPOT框架,通过近端策略蒸馏在后训练中保留知识,提升推理性能,实验表明其在少量数据下显著提升模型准确率。

Comments 21 pages

详情
AI中文摘要

通过后训练向大语言模型注入新的推理知识时,常常导致灾难性遗忘。最近的研究强调了on-policy数据的重要性,但KL散度未能缓解遗忘。本文通过分析和实验表明,KL约束的奖励公式在后训练中保留知识起关键作用。本文提出手术式后训练(SPOT),一种近端策略蒸馏框架,旨在高效优化推理同时保留先前知识。SPOT包含(1)使用Oracle的数据显示校正流水线,通过最小编辑修正错误步骤,生成近端on-policy数据;(2)基于奖励的二元交叉熵目标,用于增强推理和缓解遗忘。实验表明,仅使用4k校正数学对,SPOT在域内和域外任务上平均提升Qwen3-8B的准确率6.2%,仅需16分钟模型训练(8x H800 GPU)。此外,SPOT为后续强化学习提供更优初始化,显著提升性能上限。代码:https://github.com/Visual-AI/SPoT

英文摘要

Injecting new reasoning knowledge into Large Language Models (LLMs) via post-training often induces catastrophic forgetting. Recent studies emphasize the importance of on-policy data but suggest that KL-divergence fails to mitigate forgetting. In contrast, we show, both analytically and empirically, that the KL-constrained reward formulation actually plays a critical role in retaining knowledge during post-training. This motivates our Surgical Post-Training (SPOT), a proximal on-policy distillation framework designed to optimize reasoning efficiently while preserving prior knowledge. SPOT consists of (1) a data rectification pipeline employing an Oracle to surgically correct erroneous steps via minimal edits, generating proximal on-policy data; and (2) a reward-based binary cross-entropy objective essential for enhancing reasoning and mitigating forgetting. Empirically, with only 4k rectified math pairs, SPOT improves Qwen3-8B's accuracy by 6.2% on average across in-domain and out-of-domain tasks, requiring merely 16-minute model training on 8x H800 GPUs. Moreover, SPOT provides a superior initialization for subsequent reinforcement learning, significantly elevating the performance ceiling. Code: https://github.com/Visual-AI/SPoT

2602.17831 2026-05-19 cs.AI 版本更新

The Token Games: Evaluating Language Model Reasoning with Puzzle Duels

令牌游戏:通过谜题对决评估语言模型推理能力

Simon Henniger, Gabriel Poesia

发表机构 * Harvard University(哈佛大学)

AI总结 本文设计了令牌游戏(TTG)评估框架,通过模型自创谜题进行对决,利用Elo评分比较模型能力,验证了10个前沿模型的推理能力,且无需人工参与谜题创作。

Comments Project website: https://token-games.ai/

详情
AI中文摘要

本文设计了令牌游戏(TTG)评估框架,通过模型自创谜题进行对决,利用Elo评分比较模型能力,验证了10个前沿模型的推理能力,且无需人工参与谜题创作。

英文摘要

Evaluating the reasoning capabilities of Large Language Models is increasingly challenging as models improve. Human curation of hard questions is highly expensive, especially in recent benchmarks using PhD-level domain knowledge to challenge the most capable models. Even then, there is always a concern about whether these questions test genuine reasoning or if similar problems have been seen during training. Here, we take inspiration from 16th-century mathematical duels to design The Token Games (TTG): an evaluation framework where models challenge each other by creating their own puzzles. We leverage the format of Programming Puzzles - given a function that returns a boolean, find inputs that make it return True - to flexibly represent problems and enable verifying solutions. Using results from pairwise duels, we then compute Elo ratings, allowing us to compare models relative to each other. We evaluate 10 frontier models on TTG, and closely match the ranking from existing benchmarks such as Humanity's Last Exam, spending less than $200 USD and without involving any human effort in creating puzzles. We also find that creating good puzzles is still a highly challenging task for current models. Overall, our work suggests new paradigms for evaluating reasoning that avoid saturation by design, and that allow testing models for other skills like creativity and task creation alongside problem solving.

2602.16699 2026-05-19 cs.CL cs.AI 版本更新

Calibrate-Then-Act: Cost-Aware Exploration in LLM Agents

校准后再行动:在大语言模型代理中考虑成本的探索

Wenxuan Ding, Nicholas Tomlin, Greg Durrett

发表机构 * New York University(纽约大学)

AI总结 本文提出Calibrate-Then-Act框架,使LLM代理在不确定环境下显式平衡成本与不确定性,从而更优地决策。

详情
AI中文摘要

大语言模型代理被部署在需要交互以获取信息的环境中。在这些场景中,代理必须权衡行动中的内在成本不确定性,例如何时停止探索并提交答案。例如,在编程任务中,代理可能运行生成的代码,或为该代码片段生成测试;编写和运行测试的成本非零,但通常低于运行有缺陷代码的成本。本文表明,可以通过诱导LLM代理显式权衡这些成本-不确定性权衡,使代理在环境中表现更优。我们正式化了多个任务,包括检索增强的问答和文件阅读编码任务,作为在不确定性下的连续决策问题。每个问题都有潜在的环境状态影响代理性能。我们引入了名为Calibrate-Then-Act(CTA)的框架,通过将代理传递推断出的环境状态先验信息,使其能够更优地行动。此信息在定性上改变了代理行为,并向代理添加了非标准RL训练所学的环境敏感性。在合成任务、问答和文件阅读上的结果表明,通过CTA显式进行成本-收益权衡有助于代理发现更优的决策策略。

英文摘要

LLM agents are deployed in environments where they must interact to acquire information. In these scenarios, the agent must reason about inherent cost-uncertainty tradeoffs in how to act, such as when to stop exploring and commit to an answer. For instance, on a programming task, an agent might run the code it generates, or it might generate tests for that code snippet; the cost of writing and running a test is nonzero, but typically lower than the cost of running buggy code. In this work, we show that we can induce LLM agents to explicitly reason about balancing these cost-uncertainty tradeoffs, then act more optimally in their environments. We formalize multiple tasks, including retrieval-augmented QA and a file reading coding task, as sequential decision-making problems under uncertainty. Each problem has latent environment state that impacts the agent's performance. We introduce a framework called Calibrate-Then-Act (CTA), where we pass the agent an inferred prior about this environment state to enable it to act more optimally. This information qualitatively changes agent behavior, and adds environment sensitivity to the agent which is not learned via standard RL training. Our results on a synthetic task, QA, and file reading show that making cost-benefit tradeoffs explicit with CTA helps agents discover more optimal decision-making strategies.

2601.21941 2026-05-19 cs.LG cs.AI 版本更新

Robust Multimodal Representation Learning in Healthcare

医疗领域鲁棒多模态表征学习

Xiaoguang Zhu, Linxiao Gong, Lianlong Sun, Yang Liu, Haoyu Wang, Jing Liu

发表机构 * University of California, Davis(加州大学戴维斯分校) HKUST (GZ)(香港科技大学) University of Rochester(罗切斯特大学) Tongji University(同济大学) Georgia Institute of Technology(佐治亚理工学院) Fudan University(复旦大学) The University of British Columbia(不列颠哥伦比亚大学)

AI总结 本文提出双流特征去相关框架,通过结构因果分析处理医疗多模态数据中的系统性偏差,提升模型泛化能力,实验验证在MIMIC-IV、eICU和ADNI数据集上的性能提升。

详情
AI中文摘要

医疗多模态表征学习旨在将异构数据整合为统一的患者表示以支持临床结果预测。然而,真实世界医疗数据集通常包含来自多个来源的系统性偏差,这对医疗多模态表征学习提出了重大挑战。现有方法通常专注于有效的多模态融合,忽视了影响泛化能力的固有偏见特征。为解决这些挑战,我们提出了一种双流特征去相关框架,通过引入由潜在混杂因素引入的结构因果分析来识别和处理偏见。我们的方法采用因果偏见去相关框架,结合双流神经网络,将因果特征与虚假相关性分离,利用广义交叉熵损失和互信息最小化实现有效去相关。该框架模型无关,可集成到现有医疗多模态学习方法中。在MIMIC-IV、eICU和ADNI数据集上的全面实验显示了一致的性能提升。

英文摘要

Medical multimodal representation learning aims to integrate heterogeneous data into unified patient representations to support clinical outcome prediction. However, real-world medical datasets commonly contain systematic biases from multiple sources, which poses significant challenges for medical multimodal representation learning. Existing approaches typically focus on effective multimodal fusion, neglecting inherent biased features that affect the generalization ability. To address these challenges, we propose a Dual-Stream Feature Decorrelation Framework that identifies and handles the biases through structural causal analysis introduced by latent confounders. Our method employs a causal-biased decorrelation framework with dual-stream neural networks to disentangle causal features from spurious correlations, utilizing generalized cross-entropy loss and mutual information minimization for effective decorrelation. The framework is model-agnostic and can be integrated into existing medical multimodal learning methods. Comprehensive experiments on MIMIC-IV, eICU, and ADNI datasets demonstrate consistent performance improvements.

2512.03280 2026-05-19 cs.LG cs.AI 版本更新

BlendedNet++: A dataset and benchmark for field-resolved aerodynamics and inverse design of blended wing body aircraft

BlendedNet++:一种用于混合翼身融合飞机场解气动学和逆设计的数据集和基准

Nicholas Sung, Steven Spreizer, Mohamed Elrefaie, Matthew C. Jones, Faez Ahmed

发表机构 * Massachusetts Institute of Technology(麻省理工学院) MIT Lincoln Laboratory(麻省理工学院林肯实验室)

AI总结 本文提出BlendedNet++数据集,包含12492种独特的BWB几何体,通过RANS模拟提供集成力和密集表面场,利用几何深度学习模型实现实时气动预测和生成性逆设计,验证了Transolver在场预测中的准确性。

详情
AI中文摘要

Blended Wing Body (BWB)飞机的概念设计常受限于高维设计空间中复杂气动学的高计算成本。尽管深度学习为快速气动预测和逆设计提供了途径,但其在航空航天工程中的应用受限于缺乏大规模、场解训练数据。本文通过引入BlendedNet++,一个包含12,492种独特BWB几何体的综合气动学数据集,每种几何体均通过稳态雷诺平均纳维-斯托克斯(RANS)模拟进行评估,以提供集成力和密集表面场(Cp,Cf)。利用此数据,我们建立了两个关键工程任务的稳健框架:(1)利用几何深度学习模型实时预测表面气动场;(2)生成性逆设计。我们基准测试了五种替代架构,发现Transolver在场预测中最为准确。此外,我们通过条件扩散模型结合梯度基优化方法演示了生成性逆设计流程。这种混合方法被证明能够生成多个可行设计,满足特定升阻比目标,精度高(R^2 > 0.99),经计算流体动力学(CFD)模拟验证。这些资源使早期阶段BWB设计从迭代分析转向直接生成。

英文摘要

The conceptual design of Blended Wing Body (BWB) aircraft is often constrained by the high computational cost of resolving complex aerodynamics over a high-dimensional design space. While deep learning offers a pathway to rapid aerodynamic prediction and inverse design, its adoption in aerospace engineering is limited by a lack of large-scale, field-resolved training data. This work addresses this gap by introducing BlendedNet++, a comprehensive aerodynamic dataset comprising 12,492 unique BWB geometries, each evaluated using steady Reynolds-Averaged Navier--Stokes (RANS) simulations to provide integrated forces and dense surface fields (Cp, Cf). Leveraging this data, we establish a robust framework for two critical engineering tasks: (1) real-time prediction of surface aerodynamic fields using geometric deep learning models, and (2) generative inverse design. We benchmark five surrogate architectures, identifying Transolver as the most accurate for field predictions. Furthermore, we demonstrate a generative inverse design pipeline using conditional diffusion models combined with gradient-based refinement. This hybrid approach is shown to generate multiple feasible designs that satisfy specific lift-to-drag targets with high accuracy (R^2 > 0.99), as confirmed by computational fluid dynamics (CFD) simulation. These resources enable a shift from iterative analysis to direct generation in early-stage BWB design.

2510.26899 2026-05-19 cs.CY cs.AI cs.SI 版本更新

How Similar Are Grokipedia and Wikipedia? A Multi-Dimensional Textual and Structural Comparison

Grokipedia和Wikipedia有多相似?一个多维的文本和结构比较

Taha Yasseri, Saeedeh Mohammadi

发表机构 * Centre for Sociology of Humans and Machines (SOHAM)(人类与机器社会研究中心) School of Mathematics and Statistics(数学与统计学学院)

AI总结 研究通过对比17790对文章,分析Grokipedia和Wikipedia在文本和结构上的相似性,发现Grokipedia文章更长但引用更少,内容分为两组,其中一组在政治偏见上有所右移,引发对AI生成内容透明性和知识治理的质疑。

Comments 20 pages, 7 figures, updated with a larger sample size of 20,000 articles, better text cleaning procedure + Reference analysis, topical analysis

详情
AI中文摘要

Grokipedia的推出被视作对维基百科意识形态和结构偏见的回应,旨在用Grok大语言模型生成'真实'条目。本研究通过比较17790对文章,评估两者在词汇丰富度、可读性、引用密度、结构特征和语义相似性方面的差异。发现Grokipedia文章更长但引用较少,内容分为两组:一组与维基百科在语义和风格上一致,另一组则显著偏离。在不一致的文章中,引用的新闻媒体来源政治偏见系统性右移,集中在历史、宗教、文学和艺术领域。研究指出,AI生成的百科内容偏离传统编辑规范,更倾向于叙事扩展而非引用验证,引发对透明度、来源和知识治理的疑问。

英文摘要

The launch of Grokipedia, an AI-generated encyclopedia developed by Elon Musk's xAI, was presented as a response to perceived ideological and structural biases in Wikipedia, aiming to produce "truthful" entries using the Grok large language model. Yet whether an AI-driven alternative can escape the biases and limitations of human-edited platforms remains unclear. This study conducts a large-scale computational comparison of 17,790 matched article pairs from the 20,000 most-edited English Wikipedia pages. Using metrics spanning lexical richness, readability, reference density, structural features, and semantic similarity, we assess how closely the two platforms align in form and substance. We find that Grokipedia articles are substantially longer and contain significantly fewer references per word. Moreover, Grokipedia's content divides into two distinct groups: one that remains semantically and stylistically aligned with Wikipedia, and another that diverges sharply. Among the dissimilar articles, we observe a systematic rightward shift in the political bias of frequently cited news media sources, concentrated primarily in entries related to history and religion, and literature and art. More broadly, the findings indicate that AI-generated encyclopedic content departs from established editorial norms, favoring narrative expansion over citation-based verification, raising questions about transparency, provenance, and the governance of knowledge in automated information systems.

2508.06524 2026-05-19 cs.CL cs.AI cs.CY cs.DC cs.LG 版本更新

CarbonScaling: Extending Neural Scaling Laws for Carbon Footprint in Large Language Models

CarbonScaling:扩展神经缩放定律以用于大型语言模型中的碳足迹

Lei Jiang, Fan Chen

发表机构 * Indiana University(印第安纳大学)

AI总结 本文提出CarbonScaling框架,结合神经缩放定律和分布式训练策略,准确建模前沿LLM训练的碳足迹,提高硬件配置和排放估计的精度。

Comments 8 pages

详情
AI中文摘要

大型语言模型(LLMs)日益遵循将性能提升与快速扩展计算预算联系起来的神经缩放定律,这引发了对前沿规模训练可持续性的担忧。现有碳估计方法主要依赖于历史运行的回归分析,无法捕捉关键系统级因素,包括硬件异质性、分布式并行性、通信开销和架构稀疏性。我们提出了CarbonScaling,一种硬件意识的分析框架,用于建模前沿LLM训练的碳缩放行为。该框架整合了神经缩放定律、分布式训练策略、加速器和互连建模,以及操作和嵌入碳会计,以估计可行的硬件配置和相关排放。CarbonScaling同时建模张量、流水线、数据和专家并行性,同时纳入内存、带宽、利用率和运行时间约束。实验验证显示其比基于回归的基线具有显著更高的保真度,并突显了在万亿参数规模下嵌入碳的重要性。源代码:https://github.com/UnchartedRLab/CarbonScaling。

英文摘要

Large language models (LLMs) increasingly follow neural scaling laws that tie performance gains to rapidly expanding computational budgets, raising concerns about the sustainability of frontier-scale training. Existing carbon-estimation methods largely depend on regression over historical runs and fail to capture critical system-level factors, including hardware heterogeneity, distributed parallelism, communication overhead, and architectural sparsity. We present \textit{CarbonScaling}, a hardware-aware analytical framework for modeling the carbon scaling behavior of frontier LLM training. The framework integrates neural scaling laws, distributed training strategies, accelerator and interconnect modeling, and operational and embodied carbon accounting to estimate feasible hardware configurations and associated emissions. CarbonScaling jointly models tensor, pipeline, data, and expert parallelism while incorporating memory, bandwidth, utilization, and runtime constraints. Experimental validation demonstrates substantially higher fidelity than regression-based baselines and highlights the growing importance of embodied carbon at trillion-parameter scales. Source code: \url{https://github.com/UnchartedRLab/CarbonScaling}.

2502.17007 2026-05-19 cs.LG cs.AI stat.ML 版本更新

Uncertainty Quantification as a Principled Foundation for Explainable Artificial Intelligence: A Case Study of Counterfactual Explanations

不确定性量化作为可解释人工智能的原理性基础:反事实解释的案例研究

Kacper Sokol, Santo M. A. R. Thies, Eyke Hüllermeier

发表机构 * Department of Informatics, USI Lugano(乌里大学信息学院) Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系) Institute of Informatics, LMU Munich(慕尼黑大学信息研究所)

AI总结 本文通过反事实可解释性中的不确定性量化,展示其作为统一框架的潜力,提出两种解释器变体,并证明其在性能上优于现有方法。

详情
AI中文摘要

本文认为,透明性研究忽视了人工智能的基础概念。以反事实可解释性中的不确定性量化为例,证明其广泛应用能解决领域关键挑战。通过将核心反事实属性用不确定性表达,构建两种解释器变体,并展示框架在性能上优于现有方法。文章进一步表明,将人工智能基础融入透明性研究能产生更可靠、稳健和易懂的预测模型。提出使人工智能可解释性真正不确定性感知是实现该目标的第一步。

英文摘要

In this paper we argue that, to its detriment, transparency research overlooks many foundational concepts of artificial intelligence. As an illustrating example we focus on uncertainty quantification in the context of counterfactual explainability, demonstrating that its broader adoption could address key challenges in the field. To this end, we show how uncertainty can provide a principled unifying framework for counterfactual explainability by expressing the core counterfactual properties in terms of uncertainty, allowing us to build two variants of an explainer upon them -- one based solely on uncertainty estimates and another pairing them with distance measured in the feature space. Our comprehensive experiments illustrate highly competitive performance of our framework when compared to many state-of-the-art methods despite its radically simple design. More broadly, the paper demonstrates that integrating artificial intelligence fundamentals into transparency research promises to yield more reliable, robust and understandable predictive models. We posit that making artificial intelligence explainability truly uncertainty-aware is the first step towards this goal.

2408.17352 2026-05-19 cs.SD cs.AI eess.AS 版本更新

AASIST3: KAN-Enhanced AASIST Speech Deepfake Detection using SSL Features and Additional Regularization for the ASVspoof 2024 Challenge

AASIST3: 基于SSL特征和额外正则化的KAN增强型AASIST语音深度伪造检测用于ASVspoof 2024挑战

Kirill Borodin, Vasiliy Kudryavtsev, Dmitrii Korzh, Alexey Efimenko, Grach Mkrtchian, Mikhail Gorodnichev, Oleg Y. Rogov

发表机构 * AIRI

AI总结 本文提出AASIST3模型,通过增强现有AASIST框架并引入KAN网络等技术,显著提升了语音伪造检测性能,在封闭条件下达到0.5357的minDCF结果。

Comments 8 pages, 2 figures, 2 tables. Accepted paper at the ASVspoof 2024 (the 25th Interspeech Conference)

详情
AI中文摘要

自动语音验证(ASV)系统通过识别语音特征来识别说话人,广泛应用于金融交易用户认证、智能设备专属访问控制及法医欺诈检测等领域。然而,深度学习算法的进步使得通过文本到语音(TTS)和语音转换(VC)系统生成合成音频成为可能,使ASV系统面临潜在漏洞。为应对这一问题,我们提出了一种名为AASIST3的新型架构。通过增强现有的AASIST框架,引入Kolmogorov-Arnold网络、额外层、编码器和前置强调技术,AASIST3实现了性能的两倍以上提升。它在封闭条件下展示了0.5357的minDCF结果,在开放条件下达到0.1414,显著提高了对合成语音的检测能力,并提升了ASV安全性。新版本的模型已公开在HuggingFace (2026)。

英文摘要

Automatic Speaker Verification (ASV) systems, which identify speakers based on their voice characteristics, have numerous applications, such as user authentication in financial transactions, exclusive access control in smart devices, and forensic fraud detection. However, the advancement of deep learning algorithms has enabled the generation of synthetic audio through Text-to-Speech (TTS) and Voice Conversion (VC) systems, exposing ASV systems to potential vulnerabilities. To counteract this, we propose a novel architecture named AASIST3. By enhancing the existing AASIST framework with Kolmogorov-Arnold networks, additional layers, encoders, and pre-emphasis techniques, AASIST3 achieves a more than twofold improvement in performance. It demonstrates minDCF results of 0.5357 in the closed condition and 0.1414 in the open condition, significantly enhancing the detection of synthetic voices and improving ASV security. \textbf{The new version of the model is publicly available at \href{https://huggingface.co/lab260/Spectra-AASIST3}{\underline{HuggingFace (2026)}}}

2304.03427 2026-05-19 cs.CL cs.AI cs.CY cs.LG 版本更新

Cleansing Jewel: A Neural Spelling Correction Model Built On Google OCR-ed Tibetan Manuscripts

清除珠宝:基于谷歌OCR的藏文手稿的神经拼写纠正模型

Queenie Luo, Yung-Sung Chuang

发表机构 * Harvard University(哈佛大学) Massachusetts Institute of Technology (MIT)(麻省理工学院(MIT))

AI总结 本文提出基于谷歌OCR的藏文手稿的神经拼写纠正模型,通过改进的Transformer架构实现自动纠正OCR噪声输出,实验表明其优于其他序列模型。

Journal ref Association for Computing Machinery 2024

详情
AI中文摘要

人文学者依赖古代手稿来研究历史、宗教和社会政治结构。许多努力致力于使用OCR技术数字化这些珍贵的手稿,但大多数手稿因数世纪的污损,使得OCR程序无法准确捕捉褪色的字符和污渍。本文提出基于谷歌OCR处理的藏文手稿的神经拼写纠正模型,用于自动纠正OCR输出中的噪声。本文分为四个部分:数据集、模型架构、训练和分析。首先,我们将原始藏文电子文本语料库特征工程为两个结构化数据框——一组配对玩具数据和一组配对真实数据。然后,我们在Transformer架构中实现了置信度评分机制,用于拼写纠正任务。根据损失和字符错误率,我们的Transformer加置信度评分机制架构证明优于Transformer、LSTM-2-LSTM和GRU-2-GRU架构。最后,为了检验模型的鲁棒性,我们分析了错误的标记,可视化了模型中的注意力和自我注意力热图。

英文摘要

Scholars in the humanities rely heavily on ancient manuscripts to study history, religion, and socio-political structures in the past. Many efforts have been devoted to digitizing these precious manuscripts using OCR technology, but most manuscripts were blemished over the centuries so that an Optical Character Recognition (OCR) program cannot be expected to capture faded graphs and stains on pages. This work presents a neural spelling correction model built on Google OCR-ed Tibetan Manuscripts to auto-correct OCR-ed noisy output. This paper is divided into four sections: dataset, model architecture, training and analysis. First, we feature-engineered our raw Tibetan etext corpus into two sets of structured data frames -- a set of paired toy data and a set of paired real data. Then, we implemented a Confidence Score mechanism into the Transformer architecture to perform spelling correction tasks. According to the Loss and Character Error Rate, our Transformer + Confidence score mechanism architecture proves to be superior to Transformer, LSTM-2-LSTM and GRU-2-GRU architectures. Finally, to examine the robustness of our model, we analyzed erroneous tokens, visualized Attention and Self-Attention heatmaps in our model.