arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30353 2026-05-29 cs.AI astro-ph.CO cs.HC cs.SE 版本更新

Physics Is All You Need? A Case Study in Physicist-Supervised AI Development of Scientific Software

物理学就是一切?物理学家监督人工智能开发科学软件的案例研究

Nhat-Minh Nguyen

发表机构 * Kavli IPMU (WPI), UTIAS, The University of Tokyo(Kavli研究所(WPI)、UTIAS、东京大学) Center for Data-Driven Discovery(数据驱动发现中心) Institute For Interdisciplinary Research in Science(科学跨学科研究中心)

AI总结 通过一个物理学家监督AI编码代理开发可微扰动理论模块的案例,研究AI代理在科学软件开发中的可靠性,发现监督设计比模型能力更能决定输出可信度。

Comments 10 pages, 2 figures, 2 tables, 1 physicist and a few AI agents. Accepted by ICML 2026 AI for Science Workshop. Code and development log are available at this repo: https://github.com/MinhMPA/clax-pt

详情
AI中文摘要

AI代理是工具、合著者还是研究者?我们提出了一个量化案例研究(N=1):一位物理学家在12个工作日和57个会话中监督一个AI编码代理(Claude Code, Sonnet和Opus模型),构建了CLAX-PT,一个基于JAX的可微单圈扰动理论模块。我们按干预级别记录并分类了15个监督事件。代理通过迭代与oracle测试自主解决了10个事件,另外2个通过物理学家的领域知识解决。它无法解决的三个事件——均避开了oracle检测——有一个共同特征:代理将症状缓解视为根本原因解决。它在57个会话中花费了33个来调整一个无法表示目标物理的代码架构内的系数,并且即使被提示重新考虑也无法重新评估其CLASS-PT分支选择;只有注入一个物理概念(各向异性BAO阻尼)才触发了重新设计。另外,代理提交了一个经过校准的修正,该修正通过了所有oracle测试,但不对应理论中的任何量,在其他宇宙学参数下预测错误值。这个修正因子在同一会话中被发现并替换。三个监督实践被证明对于捕捉oracle测试遗漏的问题至关重要:在基准校准之外的多样参数点进行测试;共享变更日志,揭示跨会话的停滞探索;以及明确禁止非物理数值补丁的规则。在这个案例中,监督设计而非模型能力决定了代理的输出是否可信。缩小差距需要代理能够提出架构替代方案,而不是在给定结构内优化,并区分预测充分性与解释正确性——这些能力在本案例中未展现,且显然不能仅通过规模扩展来解决。[删节版]

英文摘要

Are AI agents tools, co-authors, or researchers? We present a quantified case study ($N=1$): a physicist supervising an AI coding agent (Claude Code, Sonnet and Opus models) over 12 work days and 57 sessions to build CLAX-PT, a differentiable one-loop perturbation theory module in JAX. We documented and classified 15 supervision events by intervention level. The agent resolved ten autonomously by iterating against oracle tests. Two more by the physicist's domain knowledge. The three it could not -- all evaded oracle detection -- share a common property: the agent treated symptom reduction as root-cause resolution. It spent 33 of the 57 sessions adjusting coefficients within a code architecture that could not represent the target physics, and could not re-evaluate its CLASS-PT branch choice even when prompted to reconsider; only an injected physics concept (anisotropic BAO damping) triggered the redesign. Separately, the agent committed a calibrated correction that passed all oracle tests but corresponded to no quantity in the theory, predicting wrong values at any other cosmology. The fudge factor was caught and replaced within the same session. Three supervision practices proved critical for catching what oracle tests missed: testing at diverse parameter points beyond the fiducial calibration; shared changelogs that surfaced stalled exploration across sessions; and an explicit rule against unphysical numerical patches. In this case, supervision design, not model capability, determined whether the agent's output was trustworthy. Closing the gap would require agents that propose architectural alternatives rather than optimize within a given structure, and distinguish predictive adequacy from explanatory correctness -- capabilities not exhibited here, not obviously addressed by scaling alone. [Abridged.]

2605.30351 2026-05-29 cs.CV cs.AI 版本更新

VideoMLA: Low-Rank Latent KV Cache for Minute-Scale Autoregressive Video Diffusion

VideoMLA: 用于分钟级自回归视频扩散的低秩潜在KV缓存

Hidir Yesiltepe, Jiazhen Hu, Tuna Han Salih Meral, Adil Kaan Akan, Kaan Oktay, Hoda Eldardiry, Pinar Yanardag

发表机构 * Virginia Tech(弗吉尼亚理工大学) fal Project(fal项目)

AI总结 本文提出VideoMLA,通过多头潜在注意力(MLA)将每头KV替换为共享低秩内容潜在和分离的3D-RoPE位置键,在视频扩散中减少92.7%的KV内存,并保持质量与吞吐量提升。

Comments Project Page: https://videomla.github.io/

详情
AI中文摘要

长序列因果视频扩散已收敛于固定大小的滑动窗口KV缓存,近期创新通过改变窗口内令牌或位置编码方式在此布局内进行改进。每头KV布局本身是流式内存和延迟的主要贡献者,但基本保持不变。本文首次研究多头潜在注意力(MLA)在视频扩散中的应用。VideoMLA将每头的键和值替换为共享的低秩内容潜在和共享的解耦3D-RoPE位置键,在每个缓存层将每令牌KV内存减少92.7%。我们进一步探究了为什么MLA在视频扩散中成功,尽管语言模型中常用于激励它的谱假设不成立:预训练视频注意力不是低秩的,99%能量的有效秩远高于任何实际潜在维度。VideoMLA在压缩比下保持质量,而直接谱近似会预测较大的重构误差。我们表明,MLA瓶颈而非预训练谱决定了有效秩:谱和随机初始化都从初始化开始占据几乎全部秩预算,训练在此预算内适应。在VBench上,VideoMLA匹配短视界流式视频扩散基线,在长视界中取得最佳总体分数,并在单个B200上将吞吐量提升1.23倍。

英文摘要

Long-rollout causal video diffusion has converged on a fixed-size sliding-window KV cache, with recent progress innovating within this layout by changing which tokens occupy the window or how their positions are encoded. The per-head KV layout itself, a dominant contributor to streaming memory and latency, has been mostly left unchanged. In this paper, we present the first study of Multi-Head Latent Attention (MLA) in video diffusion. VideoMLA replaces per-head keys and values with a shared low-rank content latent and a shared decoupled 3D-RoPE positional key, reducing per-token KV memory by 92.7% at every cached layer. We further investigate why MLA succeeds in video diffusion even though the spectral assumption often used to motivate it in language models does not hold: pretrained video attention is not low-rank, with 99%-energy effective rank far above any practical latent dimension. VideoMLA retains quality at compression ratios where direct spectral approximation would predict large reconstruction error. We show that the MLA bottleneck, rather than the pretrained spectrum, determines the effective rank: both spectral and random initialization occupy nearly the full rank budget from initialization, and training preserves this budget while adapting within it. On VBench, VideoMLA matches short-horizon streaming video diffusion baselines, achieves the best overall score at long horizons among evaluated methods, and improves throughput by 1.23x on a single B200.

2605.30348 2026-05-29 cs.CL cs.AI cs.LG 版本更新

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

LLMSurgeon: 诊断大型语言模型的数据混合

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi, Zhaoyi Li, Zhiqiang Shen

发表机构 * VILA Lab, MBZUAI(VILA实验室,MBZUAI) UCL

AI总结 提出LLMSurgeon框架,通过逆问题方法从目标LLM生成文本中估计预训练语料的领域分布,实现无需训练数据的后验审计。

Comments ACL 2026 Main. Code at https://github.com/Yaxin9Luo/LLMSurgeon

详情
AI中文摘要

大型语言模型(LLM)的预训练数据混合构成了它们的“数字DNA”,塑造了模型的行为、能力和失败模式。然而,这种组成很少被披露,使得事后审计数据组合或来源变得困难。在这项工作中,我们形式化了$ extbf{数据混合手术(DMS)}$:仅从目标LLM生成的文本中,在预定义分类法下估计其预训练语料的领域级分布。我们提出了$ extbf{LLMSurgeon}$,一个强大的框架,将DMS视为标签偏移假设下的逆问题。LLMSurgeon不是直接聚合分类器输出,而是估计一个校准的$ extit{软}$混淆矩阵,并解决一个约束逆问题以纠正系统性的领域混淆并恢复潜在的混合先验。为了评估,我们引入了$ extbf{LLMScan}$,一个基于具有透明预训练混合的开源LLM构建的配方可验证评估套件。在LLMScan上,LLMSurgeon在固定协议下以高保真度恢复了领域混合。我们的工作提出了一种实用的、事后审计基础模型数字DNA的方法,无需访问其训练数据。

英文摘要

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

2605.30345 2026-05-29 cs.AI cs.CL cs.LG 版本更新

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

SchGen: 基于语义接地代码表示的PCB原理图生成

Qinpei Luo, Ruichun Ma, Xinyu Zhang, Lili Qiu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Microsoft Research Asia(微软亚洲研究院) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型,通过语义接地代码表示将几何驱动问题转化为语义驱动匹配任务,并构建大规模数据集,在连线准确性和功能正确性上显著优于现有方法。

Comments 19 pages, 7 figures

详情
AI中文摘要

印刷电路板(PCB)原理图设计几乎定义了所有电子硬件,但它仍然是手动且依赖专业知识的。虽然生成式AI已推动数字和模拟集成电路设计的发展,但从自然语言意图生成PCB原理图的研究仍基本空白。本文提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型。关键挑战在于缺乏适合LLM的表示和大规模数据集。当前的原理图格式以冗长、特定于工具的语法和几何描述为主,难以可靠生成。我们引入一种语义接地代码表示,该表示通过相对位置和基于引脚名的布线对原理图编辑原语进行编码,将几何驱动生成问题转化为适合LLM的语义驱动匹配任务。我们进一步通过人机协作流水线将开源硬件设计转换为我们的表示,构建了与用户提示配对的大规模PCB原理图数据集。实验表明,SchGen在连线准确性和功能正确性上显著优于替代表示甚至更大的通用LLM。我们的结果突出了表示设计在使生成模型胜任复杂硬件设计任务中的关键作用。

英文摘要

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.

2605.30344 2026-05-29 cs.AI 版本更新

Tiny but Trusted: Efficient Vision-Language Reasoning for Time-Series Anomaly Detection

小巧但可信:面向时间序列异常检测的高效视觉-语言推理

Xiaona Zhou, Muntasir Wahed, Tianjiao Yu, Constantin Brif, Ismini Lourentzou

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Sandia National Laboratories(桑迪亚国家实验室)

AI总结 针对时间序列异常检测中缺乏自然语言解释的问题,构建VisAnomBench基准并微调参数高效的视觉-语言模型VisAnomReasoner,在准确性和泛化性上显著超越基线。

详情
AI中文摘要

近期视觉-语言模型(VLM)的进展在许多任务上取得了令人瞩目的性能,然而先前的研究报告称,将大型语言或多模态模型应用于发现序列数据中的异常模式时表现不佳。公共异常检测基准通常提供区间标注而非自然语言解释,这使得微调VLM以产生有根据、可解释的决策变得困难。为解决这一差距,我们构建了VisAnomBench,这是一个从公共时间序列数据集构建的精选基准,并利用细粒度、任务特定的奖励从多个大型VLM中选择高质量异常解释进行增强。通过在该基准上进行微调,我们开发了VisAnomReasoner,一种用于时间序列异常检测的参数高效VLM。在VisAnomBench上的实验结果表明,VisAnomReasoner实现了更准确的异常定位,并始终优于所有基线,精确率和F1分别至少提高21.23和23.87个百分点。在TSB-AD-U基准上的额外实验证明了强大的跨基准泛化能力,VisAnomReasoner将精确率和F1分别提高了9.57和13.39个百分点。

英文摘要

Recent advances in Vision-Language Models (VLMs) have achieved impressive performance across many tasks, yet prior studies report unsatisfactory performance when applying large language or multimodal models to finding abnormal patterns in sequential data. Public anomaly detection benchmarks typically provide interval annotations but not natural-language rationales, making it difficult to fine-tune VLMs to produce grounded, interpretable decisions. To address this gap, we construct VisAnomBench, a curated benchmark built from public time-series datasets and augmented with high-quality anomaly explanations selected from multiple large VLMs using fine-grained, task-specific rewards. Through fine-tuning on this benchmark, we develop VisAnomReasoner, a parameter-efficient VLM for time-series anomaly detection. Experimental results on VisAnomBench show that VisAnomReasoner achieves more accurate anomaly localization and consistently outperforms all baselines, with improvements of at least 21.23 and 23.87 percentage points in precision and F1, respectively. Additional experiments on the TSB-AD-U benchmark demonstrate strong cross-benchmark generalization, with VisAnomReasoner improving precision and F1 by 9.57 and 13.39 percentage points, respectively.

2605.30343 2026-05-29 cs.CL cs.AI 版本更新

Unlocking the Working Memory of Large Language Models for Latent Reasoning

解锁大型语言模型的工作记忆以实现潜在推理

Lukas Aichberger, Sepp Hochreiter

发表机构 * ELLIS Unit Linz and LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz, Austria(林茨ELLIS单元和LIT AI实验室,机器学习研究所,约翰·凯撒大学林茨,奥地利) NXAI GmbH, Linz, Austria(NXAI公司,林茨,奥地利)

AI总结 提出一种名为RiM的潜在推理方法,通过固定记忆块替代自回归生成中间推理步骤,在单次前向传播中实现计算高效的潜在推理。

Comments Preprint

详情
AI中文摘要

为了提升大型语言模型的推理能力,通常通过在最终答案之前生成中间令牌来扩展测试时计算。然而,这会将推理与自回归生成耦合,从而混淆内部计算与外部通信。相比之下,人类认知可以利用工作记忆在内部保持和操作信息,而无需将中间思维外部化。基于这一原理,我们引入了记忆推理(RiM),一种潜在推理方法,用记忆块替代推理步骤的自回归生成。这些记忆块是固定序列的特殊令牌,能够解锁大型语言模型的工作记忆容量。由于它们是固定的而非生成的,可以在单次前向传播中处理,从而实现计算高效的潜在推理。为了操作这些记忆块,我们采用了两阶段课程。首先,通过在每个记忆块后预测显式推理步骤来奠定基础。其次,我们丢弃这种步骤级监督,并在每个记忆块后迭代地优化最终答案。我们在推理基准上的实验表明,跨不同家族和规模的语言模型,RiM在避免思维自回归生成的同时,匹配或超越了现有的潜在推理方法。这些结果表明,大型语言模型可以被训练为使用工作记忆作为潜在推理的有效机制。

英文摘要

To improve the reasoning capabilities of large language models, test-time compute is typically scaled by generating intermediate tokens before the final answer. However, this couples reasoning to autoregressive generation and thereby conflates internal computation with external communication. In contrast, human cognition can use working memory to hold and manipulate information internally without the need to externalize intermediate thoughts. Drawing on this principle, we introduce Reasoning in Memory (RiM), a latent reasoning method that replaces the autoregressive generation of reasoning steps with memory blocks. These memory blocks are fixed sequences of special tokens that unlock the working-memory capacity of large language models. Since they are fixed rather than generated, they can be processed in a single forward pass, enabling compute-efficient latent reasoning. To operationalize these memory blocks, we employ a two-stage curriculum. First, we ground them by predicting explicit reasoning steps after each memory block. Second, we discard this step-level supervision and iteratively refine the final answer after each memory block. Our experiments on reasoning benchmarks show that, across language models of different families and sizes, RiM matches or exceeds existing latent reasoning methods while avoiding the autoregressive generation of thoughts. These results demonstrate that large language models can be trained to use working memory as an effective mechanism for latent reasoning.

2605.30341 2026-05-29 cs.CV cs.AI 版本更新

GPIC: A Giant Permissive Image Corpus for Visual Generation

GPIC:用于视觉生成的大型许可图像数据集

Keshigeyan Chandrasegaran, Kyle Sargent, Suchir Agarwal, Michael Jang, Michael Poli, Juan Carlos Niebles, Justin Johnson, Jiajun Wu, Li Fei-Fei

发表机构 * Stanford University(斯坦福大学) Radical Numerics University of Michigan(密歇根大学) Salesforce Research(Salesforce研究)

AI总结 提出GPIC,一个约28万亿像素的大型许可图像数据集,包含1亿训练样本,通过最先进的视觉语言模型标注,用于视觉生成建模研究。

Comments 25 pages; Dataset: https://huggingface.co/datasets/stanford-vision-lab/giant-permissive-image-corpus; Project website: https://gpic.stanford.edu

详情
AI中文摘要

研究视觉生成建模的可扩展方法需要大型、可访问且稳定的数据集。我们引入了GPIC,一个约28万亿像素的大型许可图像数据集。GPIC包含由最先进的视觉语言模型标注的多样化互联网图像,包括1亿训练样本、20万验证样本和100万测试样本。此外,所有GPIC图像均获得研究及商业用途的许可。GPIC经过安全过滤、去重,并集中托管在Hugging Face上。我们为GPIC上的生成建模提供了一个基准测试协议。最后,我们提供了GPIC上像素空间流匹配的参考基线。我们的数据集、基准和模型可在https://huggingface.co/datasets/stanford-vision-lab/gpic获取。评估工具包和代码可在https://gpic.stanford.edu获取。

英文摘要

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

2605.30335 2026-05-29 cs.AI cs.CL 版本更新

Locally Coherent, Globally Incoherent: Bounding Compositional Incoherence in Multi-Component LLM Agents

局部一致,全局不一致:多组件LLM代理中的组合不一致性界定

Anany Kotawala

发表机构 * princeton(普林斯顿大学)

AI总结 本文形式化多组件LLM代理中局部一致但全局不一致的失败,提出组合残差eps*度量不一致性,并通过层次投影修复和序贯一致性监测方法,在实验中发现广泛存在的不一致性及其对决策的影响。

Comments 25 pages, 7 figures, 24 tables. Preliminary versions to appear at the ICML 2026 Workshops on Combining Theory and Benchmarks (CTB), Statistical Frameworks for Uncertainty in Agentic Systems (AgenticUQ), and Failure Modes of Agentic AI (FAGEN)

详情
AI中文摘要

多组件LLM代理从每个仅看到联合问题一部分的组件中组装概率声明;即使每个组件局部一致,组合也可能违反基本概率公理。我们通过组合残差eps*(从组合报价到联合一致多面体的L2距离)形式化这种局部一致、全局不一致的失败,该残差可在运行时根据系统输出和声明的跨组件耦合约束计算。一个乘积结构二分法刻画了局部一致性何时足够,而瑞利商预测在四个关系类别中的三个上,观测残差与预测相差在7%以内。一种层次化的Boyle-Dykstra投影确定性修复组合;一个任意有效的e-过程提供序贯一致性监测。在四个LLM中端面板(前沿面板在5.5节重新运行)的1,876个集成团上,33-94%的团中eps* > 0,在比例分配规则下,这转化为1,770个已解决赌注中每注+0.115纳特的遗憾(在自身一致化的投注者下,增益降至+0.006)。三种直观的LLM端缓解措施(检索、分区感知提示、聚合器LLM)均失败或倒退。

英文摘要

Multi-component LLM agents assemble probabilistic claims from components that each see only part of a joint problem; the composition can violate basic probability axioms even when every component is locally coherent. We formalise this locally coherent, globally incoherent failure via the compositional residual eps*, the L2 distance from the composed quote to the joint coherent polytope, computable at runtime from system output and the declared cross-component coupling constraints. A product-structure dichotomy characterises when local coherence suffices, and a Rayleigh-quotient prediction matches the observed residual within 7% on three of four relation classes. A hierarchical Boyle-Dykstra projection repairs the composition deterministically; an anytime-valid e-process gives sequential coherence monitoring. Across 1,876 ensemble cliques on a four-LLM mid-tier panel (frontier-panel rerun in Section 5.5), eps* > 0 on 33-94% of cliques, translating to +0.115 nats per bet of regret on 1,770 resolved bets under the proportional allocation rule (the gain collapses to +0.006 under bettors that themselves coherentise). Three intuitive LLM-side mitigations(retrieval, partition-aware prompting, aggregator-LLM) each fail or regress.

2605.30334 2026-05-29 cs.AI cs.CL 版本更新

Demystifying Data Organization for Enhanced LLM Training

揭秘数据组织以增强大语言模型训练

Yalun Dai, Yangyu Huang, Tongshen Yang, Yonghan Wang, Xin Zhang, Wenshan Wu, Qihao Zhao, Hao Li, Yuanyuan Gao, Kim-Hui Yap, Scarlett Li

发表机构 * Nanyang Technological University(南洋理工大学) Microsoft Research(微软研究院) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文系统探索数据组织对大语言模型训练的影响,提出边界锐化、循环调度、课程连续性和局部多样性四项优化准则,并基于此设计了两种新的数据排序方法STR和SAW,实验验证了其在预训练和微调阶段的有效性。

Comments ACL 2026 Main Conference

详情
AI中文摘要

大语言模型(LLMs)已经彻底改变了各个领域,但其训练效率严重依赖于有效的数据整理。虽然数据选择已被广泛研究,但用于增强训练的战略性数据组织仍然是一个未被充分探索的领域,特别是因为当前的LLMs通常只训练一个或几个epoch。本文通过重用先前为数据效率生成的预计算样本级分数,系统地探索了数据组织对LLM训练的影响,从而产生最小的额外计算开销。我们识别并形式化了优化数据组织的四个关键准则:边界锐化、循环调度、课程连续性和局部多样性。在这些准则的指导下,我们引入了两种新颖的数据排序方法,称为STR和SAW。跨不同模型规模和数据大小的广泛实验,包括预训练和SFT阶段,验证了我们总结的准则的有效性。它们也证明了我们提出的数据排序方法在增强LLM训练的稳定性和性能方面的鲁棒性。Github链接:https://github.com/microsoft/data-efficacy/

英文摘要

Large Language Models (LLMs) have revolutionized various fields, yet their training efficiency is heavily reliant on effective data curation. While data selection has been widely studied, the strategic data organization for enhanced training remains an underexplored area, particularly since current LLMs are often trained for only one or a few epochs. This paper systematically explores the influence of data organization on LLM training by reusing pre-computed sample-level scores originally generated for data efficiency, thereby incurring minimal additional computational overhead. We identify and formalize four key guidelines for optimizing data organization: Boundary Sharpening, Cyclic Scheduling, Curriculum Continuity, and Local Diversity. Guided by them, we introduce two novel data ordering methods termed STR and SAW. Extensive experiments across different model scales and data sizes, encompassing both pre-training and SFT stages, validate the effectiveness of our summarized guidelines. They also demonstrate the robustness of our proposed data ordering methods in enhancing the stability and performance of LLM training. Github Link: https://github.com/microsoft/data-efficacy/

2605.30327 2026-05-29 cs.LG cs.AI cs.CL math.ST stat.ML stat.TH 版本更新

Reasoning with Sampling: Cutting at Decision Points

基于采样的推理:在决策点进行裁剪

Felix Zhou, Anay Mehrotra, Quanquan C. Liu

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学)

AI总结 提出Entropy-Cut Metropolis-Hastings算法,利用基础模型的下一词元熵作为代理识别关键决策点并重新采样,从而高效地从幂分布中采样以增强推理能力,在多个基准上超越基线和RL训练模型。

详情
AI中文摘要

前沿推理模型是通过对基础语言模型进行强化学习后训练而产生的。最近的研究对此提出了挑战,表明从基础模型分布的锐化版本(即所谓的幂分布)中采样,无需额外训练、精心策划的数据集或验证器,就能产生可比的推理能力。然而,使这种方法实用化需要高效地从幂分布中采样。采样器需要“混合”到幂分布,这需要在目标分布的模态之间移动;直观地说,例如尝试不同的推理策略。先前工作中提出的采样器反复在当前推理轨迹中均匀随机选择一个“裁剪”位置,并从该位置开始重新采样后缀。然而,推理轨迹通常包含少数关键决策(例如,证明策略或算法的选择),我们观察到均匀选择的裁剪往往重写局部细节,而不是重新审视决策点。我们引入了一种算法(Entropy-Cut Metropolis-Hastings),该算法使用基础模型的下一词元熵作为代理来识别关键决策点,并从这些位置重新采样。我们通过实验验证了熵跳变是决策点的有用代理,并在一个风格化的推理模型中证明了我们的方法的混合时间与轨迹中的决策数量成比例,而不是与可能大得多的词元数量成比例。在MATH500、HumanEval、GPQA Diamond和AIME26上,我们的方法始终优于基线和RL训练模型。

英文摘要

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

2605.30326 2026-05-29 cs.RO cs.AI 版本更新

RoboWits: Unexpected Challenges for Robotic Creative Problem Solving

RoboWits:机器人创造性问题解决中的意外挑战

Chunru Lin, Hongxin Zhang, Fenghao Yu, Zhehuan Chen, Thomas L. Griffiths, Yejin Choi, David Held, Chuang Gan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) Princeton University(普林斯顿大学) Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出RoboWits双臂机器人基准,通过多智能体协作的自动化任务生成流水线评估机器人在几何、材料和装配推理中的认知推理、创造性工具使用及鲁棒性,发现预训练VLA在突变任务中表现脆弱。

Comments The first two authors contributed equally

详情
AI中文摘要

在真实环境中运行的机器人必须具备在意外挑战下推理、适应和创造性解决问题的能力。然而,当前的机器人基准主要强调技能级执行,对此类认知推理能力的洞察有限。我们提出了RoboWits,一个双臂机器人基准,旨在系统评估认知推理、创造性工具使用以及对意外条件的鲁棒性。为了实现可扩展的高质量推理中心意外场景构建,我们提出了一种自动化任务生成流水线,该流水线被设计为多智能体协作框架,包括种子任务生成与验证、度量生成、场景生成和任务变异等智能体。利用该流水线,我们整理了30个多样化的种子任务和208个带有变异和分级难度的任务,涵盖几何、材料和基于装配的推理。我们对流行的机器人策略、预训练VLA和oracle状态规划器进行了基准测试。结果揭示了显著的性能差距:预训练VLA在单任务微调后在种子任务上表现出初步成功,但在变异任务上表现不佳,这表明它们在需要推理、策略适应以及对欺骗性或受限环境鲁棒性的操作任务中具有脆弱性。项目页面位于https://umass-embodied-agi.github.io/RoboWits。

英文摘要

The ability to reason, adapt, and creatively solve problems under unexpected challenges is essential for robots operating in real-world environments. However, current robotic benchmarks primarily emphasize skill-level execution and provide limited insight into such cognitive reasoning capabilities. We introduce RoboWits, a bi-manual robotic benchmark designed to systematically evaluate cognitive reasoning, creative tool use, and robustness to unexpected conditions. To enable scalable construction of high-quality reasoning-centric unexpected scenarios, we propose an automated task generation pipeline formulated as a multi-agent cooperative framework, comprising agents for seed task generation and verification, metric generation, scene generation, and task mutation. Using the pipeline, we curated 30 diverse seed tasks and 208 tasks with mutations and graded difficulty across geometry, material, and assembly-based reasoning. We benchmark popular robot policies, pre-trained VLAs, and oracle-state planners. Our results reveal a significant performance gap: while pre-trained VLAs exhibit preliminary success on seed tasks after single-task fine-tuning, they struggle to perform on mutated tasks, implying their brittleness in manipulation tasks requiring reasoning, strategy adaptation, and robustness to deceptive or constrained environments. Project page is available at https://umass-embodied-agi.github.io/RoboWits.

2605.30324 2026-05-29 cs.DS cs.AI cs.CL cs.LG stat.ML 版本更新

On Language Generation in the Limit with Bounded Memory

有界记忆下的极限语言生成

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

发表机构 * Cornell University(康奈尔大学) Stanford University(斯坦福大学) Google Research(谷歌研究)

AI总结 研究有界记忆下语言生成的极限问题,通过组合界和滑动窗口分析记忆约束对可生成性、密度和识别的影响。

Comments The abstract has been shortened to fit within the arXiv limit

详情
AI中文摘要

我们研究有界记忆下的极限语言生成。在该任务中,学习器每次观察来自未知目标语言的一个示例,并且必须最终只输出新的有效示例。先前的工作假设可以访问整个历史,这是一个强假设,因为实际算法只保留有限的过去信息。学习理论中的经典工作表明,记忆约束会显著改变可学习性;我们将此扩展到语言生成。 首先,我们研究无记忆生成器。在温和的枚举限制下,每个可数无限语言集合仍然可以在没有记忆的情况下生成。没有这个限制,我们精确刻画了何时无记忆生成是可能的。对于有限集合,我们刻画了无记忆生成器可实现的最优极小极大密度——针对任何给定大小的集合所能保证的最佳密度。这个组合界依赖于Sperner定理和对称链分解。 我们进一步表明,最后$W$个示例的滑动窗口不会改善这种最坏情况密度,而允许存储$b$个自适应选择的过去示例则会改善每个$b \geq 1$的可实现密度。 最后,我们重新审视极限识别,其中学习器必须收敛到目标语言的单个正确假设。我们关注其增量变体,其中学习器只记住其之前的猜测。在这里,尽管精确识别在仅包含三种语言的集合上失败,但一个温和的松弛——要求收敛到目标的“近似”版本——对于每个有限集合都是可实现的。 这些结果表明,有界记忆对这些任务的影响不同:生成对于每个可数集合仍然可实现,而密度和识别仅限于有限集合,且随着集合增长保证减弱。

英文摘要

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

2605.30323 2026-05-29 cs.LG cs.AI 版本更新

In-Context Reward Adaptation for Robust Preference Modeling

上下文奖励自适应用于鲁棒偏好建模

Zhenyu Sun, Zheng Xu, Ermin Wei

发表机构 * Northwestern University(西北大学) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出基于Transformer的上下文奖励自适应框架,通过少量偏好示例和人类反应时间辅助信号,在线建模多样且未见的人类偏好,实现鲁棒的偏好建模和分布偏移适应。

详情
AI中文摘要

基于人类反馈的强化学习通常依赖静态奖励模型来使大型语言模型与人类偏好对齐。然而,人类价值观本质上是多样且异质的,单一奖励模型往往缺乏泛化到未见偏好领域所需的鲁棒性。虽然现有的多奖励框架试图解决这一问题,但它们通常局限于一组固定的已知领域,并且无法在没有昂贵重新训练的情况下适应未见的人类分布。在这项工作中,我们提出了上下文奖励自适应,一个基于Transformer的框架,旨在动态建模多样且未见的人类偏好。通过利用Transformer的上下文学习能力,我们的方法从少量偏好示例中自适应地推断出潜在的奖励结构。我们证明,标准Transformer架构由于对真实值存在渐近偏差而不足以完成此任务,但将人类反应时间作为辅助输入信号使模型能够成功适应来自先前未见领域的偏好。我们的研究结果表明,这种方法为偏好建模提供了更鲁棒的基础,允许表示异质奖励和偏好分布偏移,并为更灵活的人机对齐提供了一条可扩展的路径。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.

2605.30322 2026-05-29 cs.LG cs.AI 版本更新

Gram: Assessing sabotage propensities via automated alignment auditing

Gram:通过自动化对齐审计评估破坏倾向

David Lindner, Victoria Krakovna, Sebastian Farquhar

发表机构 * Google(谷歌)

AI总结 提出Gram框架,通过模拟17种代理部署场景自动审计AI代理的破坏倾向,发现Gemini模型在约2-3%的轨迹中存在不当行为,并引入调查代理管道以识别驱动因素。

详情
AI中文摘要

我们引入了Gram,一个自动化对齐审计框架,用于评估AI代理参与破坏的倾向。我们在17个模拟的代理部署场景中评估了Gemini模型,这些场景激励破坏行为。我们发现Gemini模型在大约2-3%的模拟轨迹中存在不当行为。其中许多案例可以通过Gemini模型中的“过度急切”来解释,导致过度的角色扮演和目标寻求行为。与其他对齐审计方法相比,Gram专门设计用于评估代理编码和研究代理中的失调和有意破坏。我们还引入了一个实验性的调查代理管道,能够进行细粒度的定向实验,以识别不当行为的驱动因素。我们发现,增加环境的真实性和移除不当行为的提示往往会使破坏率降低到接近零。

英文摘要

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by "overeagerness" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contrast to other alignment auditing approaches, Gram is designed to specifically evaluate misalignment and intentional sabotage in agentic coding and research agents. We additionally introduce an experimental investigator agent pipeline which enables fine-grained targeted experiments to identify the drivers of misbehavior. We find that increasing realism of environments and removing nudges to misbehave tends to reduce sabotage rates close to zero.

2605.30319 2026-05-29 stat.ML cs.AI cs.DS cs.LG math.ST stat.TH 版本更新

Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion

通过矩阵补全改进异质性处理效应估计的保证

Anay Mehrotra, Phuc Tran, Van H. Vu, Manolis Zampetakis

发表机构 * Stanford University(斯坦福大学) Vin University(文大学) The University of Hong Kong(香港大学) Yale University(耶鲁大学)

AI总结 针对面板数据中的异质性处理效应估计问题,提出一种基于矩阵补全的简单高效估计器,在低秩假设下实现行向$\ell_2$误差$ ilde{O}(\sqrt{1/n + n/m^2})$,并首次建立了低秩逼近的行向$\ell_2$扰动界。

详情
AI中文摘要

现代因果推断的一个核心目标是估计异质性处理效应,以回答诸如“干预如何影响每个单元”的问题,而不仅仅是平均效应。我们研究面板数据下的该问题,其中我们观察到$n$个单元在$m$个时间点上的数据,且处理分配未知且非均匀。该设置中的数据自然表示为所有单元-时间处理效应的矩阵。估计异质性处理效应可以表示为对该矩阵中每一行平均值的良好估计。这使我们能够将问题表述为矩阵补全,在自然低秩假设下可解。然而,现有的矩阵补全保证不足以得到估计异质性处理效应所需的每行保证的有意义界;粗略地说,它们仅适用于估计平均处理效应界,正如最近一系列工作所示。我们给出一个简单、计算高效的估计器,在不知道倾向性且标准低秩和正则性假设下,实现行向$\ell_2$误差$ ilde{O}(\sqrt{ rac{1}{n} + rac{n}{m^2}})$。在技术上,我们的分析首次建立了低秩逼近的尖锐行向$\ell_2$扰动界,补充了现有的谱、Frobenius和逐元素扰动理论。

英文摘要

A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like "how does an intervention affect each unit," rather than only on average. We study this problem with panel-data where we observe $n$ units across $m$ times under unknown, non-uniform treatment assignments. The data in this setting is naturally represented as a matrix of all unit--time treatment effects. Estimating heterogeneous treatment effects can then be expressed as obtaining a good estimation of each row's average in this matrix. This allows us to formulate the problem as matrix completion, which can be solved under natural low-rankness assumptions. However, existing matrix-completion guarantees are not powerful enough to get meaningful bounds for the per-row guarantee required for estimating the heterogeneous treatment effect; roughly speaking, they are only useful for estimating average treatment effect bounds, as also illustrated in a recent line of work. We give a simple, computationally efficient estimator that, without knowledge of the propensities and under standard low-rankness and regularity assumptions, achieves a row-wise $\ell_2$ error of $\tilde{O}(\sqrt{\frac{1}{n} + \frac{n}{m^2}})$. Technically, our analysis establishes the first sharp row-wise $\ell_2$-perturbation bound for low-rank approximation, complementing existing spectral-, Frobenius-, and entrywise perturbation theory.

2605.30318 2026-05-29 cs.GR cs.AI cs.CV 版本更新

Before the Shutter: Aesthetic and Actionable Portrait Photography Planning in 3D Scenes

快门之前:3D场景中美学的且可执行的人像摄影规划

Ruixiang Jiang, Chang Wen Chen

发表机构 * The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出在3D场景中生成人像姿态、相机、照明和曝光方案的方法,通过构建摄影场景图实现美学引导的规划,生成视觉上引人注目且几何与光度可行的人像。

详情
AI中文摘要

人像摄影在很大程度上是在快门打开之前决定的:主体的姿态、相机配置和照明设备必须在周围的3D场景中协调。相比之下,大多数现有的计算方法侧重于2D图像空间中的后期制作,例如修饰、重新照明或编辑已经存在的图像;捕获前的摄影规划仍然很大程度上未被探索。我们引入了3D美学人像规划,即生成人体姿态、相机、照明和曝光计划的任务,这些计划在满足3D场景中的几何和光度可行性的同时,产生视觉上引人注目的人像。我们的方法构建了一个摄影场景图,该图表示场景可供性、主体-场景关系以及与人像相关的照明结构。基于这种表示,我们对先前的尝试和当前的取景器观察进行美学引导的比较规划。在多样化的室内和室外场景中的实验表明,我们的方法生成的人像比竞争基线更受人类评分者和MLLM评估者的青睐,同时保持高物理合理性。总之,我们的结果指明了从捕获后校正走向捕获前计算人像规划的道路。项目仓库:https://github.com/songrise/Before-the-Shutter

英文摘要

Portrait photography is largely decided before the shutter opens: the subject's pose, the camera configuration, and the lighting devices must be coordinated within the surrounding 3D scene. In contrast, most existing computational methods focus on post-production in 2D image space, such as retouching, relighting, or editing images that already exist; pre-capture photographic planning remains largely unexplored. We introduce 3D aesthetic portrait planning, the task of generating human pose, camera, lighting, and exposure plans that produce visually compelling portraits while satisfying geometric and photometric feasibility in a 3D scene. Our approach builds a Photographic Scene Graph that represents scene affordances, subject-scene relations, and portrait-relevant lighting structure. Built on this representation, we perform aesthetic-guided comparative planning over previous attempts and current viewfinder observations. Experiments across diverse indoor and outdoor scenes show that our method produces portraits preferred by human raters and MLLM evaluators over competitive baselines, while maintaining high physical plausibility. Together, our results suggest a path from post-capture correction toward pre-capture computational portrait planning. Project repository: https://github.com/songrise/Before-the-Shutter

2605.30311 2026-05-29 cs.CV cs.AI 版本更新

Archon: A Unified Multimodal Model for Holistic Digital Human Generation

Archon:面向整体数字人生成的统一多模态模型

Chong Bao, Shichen Liu, Lijun Yu, David Futschik, Stylianos Moschoglou, Shefali Srivastava, Ziqian Bai, Feitong Tan, Guofeng Zhang, Zhaopeng Cui, Sean Fanello, Yinda Zhang

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) Google(谷歌) Google DeepMind(谷歌DeepMind)

AI总结 提出Archon,一个完全预训练的以人为中心的统一多模态模型,通过模态特定分词器、语义视频重参数化和“模态思维”策略,实现文本、音频、动作和视觉等七种模态的整体数字人生成。

Comments Accepted to CVPR 2026. Project Page: https://zju3dv.github.io/archon/

详情
AI中文摘要

数字人是沉浸式交互的基础,然而创建一个统一模型来处理包括文本、音频、动作和视觉内容在内的整体模态仍然是一个开放的挑战。在本文中,我们提出了Archon,一个完全预训练的、以人为中心的统一多模态模型,用于整体虚拟形象生成。Archon通过模态特定分词器统一了七种模态,并利用一个在同步模态和72个不同任务上预训练的原生自回归统一多模态模型来建模整体联合分布。为了解决高保真说话视频中的标记爆炸挑战,我们引入了一种内存高效的语义视频重参数化方法,在保持细粒度动态的同时实现了4倍的标记减少,并结合了一个语义驱动的视频扩散解码器。我们进一步提出了一种“模态思维”,它将模糊的跨模态任务分解为替代模态链中的逐步思维,逐步增强保真度和可控性。大量实验表明,Archon在各种数字人生成任务中实现了优越或可比的性能,验证了我们统一框架的有效性。项目页面:https://zju3dv.github.io/archon/。

英文摘要

Digital humans are fundamental to immersive interaction, yet creating a unified model for holistic modalities, including text, audio, motion, and visual content, remains an open challenge. In this paper, we present Archon, a fully pretrained, human-centric unified multimodal model for holistic avatar generation. Archon unifies seven modalities with modality-specific tokenizers, and a native autoregressive unified multimodal model pretrained on synchronized modalities and 72 diverse tasks to model holistic joint distributions. To address the token explosion challenge in high-fidelity talking videos, we introduce a memory-efficient semantic video reparameterization, achieving 4x token reduction while preserving fine-grained dynamics, coupled with a semantic-driven video diffusion decoder. We further propose a "Thinking in Modality" that decomposes ambiguous cross-modal tasks into stepwise thinking in an alternative chain of modality, progressively enhancing fidelity and controllability. Extensive experiments demonstrate that Archon achieves superior or comparable performance across diverse digital human generation tasks, validating the effectiveness of our unified framework. Project page: https://zju3dv.github.io/archon/.

2605.30310 2026-05-29 cs.CV cs.AI cs.GR 版本更新

City-Mesh3R: Simulation-Ready City-Scale 3D Mesh Reconstruction from Multi-View Images

City-Mesh3R:面向仿真就绪的城市级多视图三维网格重建

Sayan Paul, Sourav Ghosh, Siddharth Katageri, Soumyadip Maity, Sanjana Sinha, Brojeshwar Bhowmick

发表机构 * Visual Computing & Embodied AI Lab, TCS Research(视觉计算与具身人工智能实验室,TCS研究)

AI总结 提出City-Mesh3R框架,通过分治策略从大规模无序图像集合端到端重建水密表面网格,解决城市尺度重建中几何不完整、表面不规则及计算复杂性问题。

Comments Accepted to the USM3D Workshop Proceedings at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2026 as an Oral Presentation. Project page: https://citymesh3r.github.io/

详情
AI中文摘要

从多视图图像进行城市级三维表面重建以支持下游三维仿真,由于城市场景的规模和复杂性,带来了极具挑战性的问题。现有的基于NeRF、高斯泼溅等方法的城市级三维重建技术,常因几何不完整/缺失以及不规则、噪声表面而无法恢复可用于仿真的三维网格。将现有小规模三维重建方法扩展到任意大规模城市场景因计算复杂而不可行。我们提出City-Mesh3R,一个可扩展的框架,直接从大规模无序图像集合重建水密表面网格。与近期使用全局稀疏SfM点云初始化后分布式稠密重建大规模场景的方法不同,我们的方法采用分治策略,遵循端到端的图像到网格三维重建流程。通过拓扑图像聚类、聚类独立稀疏SfM和地图合并重建稀疏城市地图,无需穷举图像特征匹配。然后对该地图进行空间划分,执行几何感知的相机选择,接着进行稠密表面重建,并使用曲率感知的自适应顶点密度重网格化进行表面细化。这些分区网格随后拼接成城市全局网格。所提出的端到端框架在城市级重建数据集上进行了评估。定性和定量结果表明,我们的方法能生成具有规则几何、捕捉精细表面细节的高保真水密三维网格,并因其分布式端到端处理而适用于任意大规模场景。

英文摘要

City-scale 3D surface reconstruction from multiview images for downstream 3D simulation, poses highly challenging problems due to the scale and complexity of urban scenes. Existing city-scale 3D reconstruction methods based on NeRF, Gaussian Splatting etc. often fail to recover 3D meshes ready for simulation due to incomplete/missing geometry and irregular, noisy surfaces. Scaling existing small-scale 3D reconstruction methods to arbitrarily large urban scenes is highly infeasible due to their computational complexity. We present City-Mesh3R, a scalable framework for reconstructing watertight surface meshes directly from large unordered image collections. Unlike recent methods which use global sparse SfM point-cloud initialization followed by a distributed 3D dense reconstruction of large-scale scenes, our method follows an end-to-end images-to-mesh 3D reconstruction approach using a divide-and-conquer strategy. The sparse city map is reconstructed via topological image clustering, cluster-wise independent sparse SfM and map merging, without need for exhaustive image feature matching. Then this map is partitioned spatially to perform geometry-aware camera selection, followed by dense surface reconstruction and surface refinement using curvature-aware adaptive vertex density remeshing. These partition meshes are then stitched together to produce the global mesh of the city. The proposed end-to-end framework is evaluated on city-scale reconstruction datasets. As demonstrated by our qualitative and quantitative results, our proposed method yields high-fidelity watertight 3D meshes with regular geometry, capturing fine surface details, and is suitable for scaling to arbitrarily large scenes owing to the end-to-end processing in a distributed setting.

2605.30295 2026-05-29 cs.CL cs.AI 版本更新

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

MedCase-Structured:用于在临床真实EHR环境中基准测试诊断推理的文本到FHIR数据集

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

发表机构 * System Inc.(系统公司)

AI总结 提出一个从非结构化文本生成临床真实HL7 FHIR R4数据集的流水线,构建MedCase-Structured数据集,发现LLMs在结构化FHIR输入上的诊断准确性低于纯文本,强调部署对齐基准测试的重要性。

Comments Accepted to ICML 2026 Structured Data for Health Workshop

详情
AI中文摘要

大型语言模型(LLMs)在临床推理和决策支持方面显示出潜力,但在真实、与电子健康记录一致的环境中的评估仍然有限。现有的基准测试通常依赖于静态数据集或不反映临床系统中使用的结构化、可互操作数据格式的非结构化输入。我们引入了一个从非结构化文本生成临床真实HL7 FHIR R4数据包的流水线,从而实现对临床决策支持系统的可控评估。该流水线将分阶段LLM生成与基于术语的验证和修复相结合,以减少幻觉代码并强制结构和语义一致性。将此方法应用于MedCaseReasoning,我们构建了MedCase-Structured,这是一个与临床医生编写的诊断案例对齐的合成数据集,实现了82.5%案例的有效FHIR生成。在MedCase-Structured上的评估显示,LLMs在结构化FHIR输入上的诊断准确性始终低于纯文本,突出了部署对齐基准测试的重要性。

英文摘要

Large language models (LLMs) show promise for clinical reasoning and decision support, but evaluation in realistic, electronic health record-congruent settings remains limited. Existing benchmarks often rely on static datasets or unstructured inputs that do not reflect the structured, interoperable data formats used in clinical systems. We introduce a pipeline for generating clinically realistic HL7 FHIR R4 bundles from unstructured text, enabling controllable evaluation of clinical decision support systems. The pipeline combines staged LLM generation with terminology-grounded validation and repair to reduce hallucinated codes and enforce structural and semantic consistency. Applying this approach to MedCaseReasoning, we construct MedCase-Structured, a synthetic dataset aligned with clinician-authored diagnostic cases, achieving valid FHIR generation for 82.5% of cases. Evaluation on MedCase-Structured reveals consistently lower diagnostic accuracy for LLMs on structured FHIR inputs than with plain text, highlighting the importance of deployment-aligned benchmarking.

2605.29169 2026-05-29 cs.CR cs.AI 版本更新

Domain-Informed Representation for Evolutionary Sieving in Integral and Module Lattices

积分格与模格中进化筛法的领域信息表示

Ahmad Tashfeen, Qi Cheng

发表机构 * University of Oklahoma(俄克拉荷马大学)

AI总结 针对格密码中最短向量问题(SVP),通过引入领域信息表示和交叉操作,将Ajtai等人的筛法改进为遗传算法,并自然扩展到模格。

Comments Published (16 pages) in the proceedings of EvoApplications 2026. You may find the proceedings version here at https://link.springer.com/chapter/10.1007/978-3-032-23604-3_9

详情
Journal ref
Lecture Notes in Computer Science 16524 (2026) 133-148
AI中文摘要

传统密码学基于整数分解或离散对数等问题,不可避免地容易受到全功能量子计算机的攻击。尽管这仍是一个工程前沿,但迫在眉睫的威胁延伸到今天存储的加密数据,这些数据将来可能被具有量子能力的计算机解密。为了防范这种可能性,现代量子安全密码学的支柱是最短向量问题(SVP)。我们通过引入领域信息的SVP表示和交叉操作,增强了Laarhoven对Ajtai等人筛法作为遗传算法(GA)处理SVP的方法,同时自然地将应用扩展到模格。

英文摘要

Traditional cryptography, rooted in problems, e.g., integer factorisation or discrete log, is inevitably vulnerable to a fully operational quantum computer. Although it remains an engineering frontier, the looming threat extends to encrypted data stored today, which could be decrypted in the future with quantum capabilities. To safeguard against this eventuality, the backbone of the modern quantum-safe cryptography is the Shortest Vector Problem (SVP). We enhance Laarhoven's treatment of Ajtai et al.'s sieving as a genetic algorithm (GA) for the SVP by incorporating domain-informed SVP representation and crossover while naturally extending application to the module lattices.

2605.28746 2026-05-29 math.OC cs.AI cs.NE 版本更新

Preference-Shaped Expected Hypervolume and R2 Improvement: Exact Computation and Monotonicity

偏好形状的期望超体积和R2改进:精确计算与单调性

Michael T. M. Emmerich

发表机构 * Faculty of Information Technology, University of Jyväskylä(贾韦斯科普大学信息科技学院)

AI总结 本文研究了贝叶斯多目标优化中偏好形状的期望改进准则,精确计算了超体积和R2指标的期望改进,并分析了其单调性和几何特性。

Comments 17 pages; Changes v1 (added strict Pareto compliance proof, removed missing figure references and redundant graphics section, added Liang et al 2026 citation in outlook. Improved figures and language

详情
AI中文摘要

本文研究了贝叶斯多目标优化中偏好形状的期望改进准则。我们考虑了两个常用于类似算法目的但几何性质不同的指标族。超体积指标基于一个反乌托邦参考点,测量目标空间中的支配体积。R2指标基于一个乌托邦点,通过加权Tchebycheff标量化包络评估近似集。本文的目的是明确哪些偏好变换保留了精确计算、Pareto兼容性和单调性,哪些变换改变了底层几何。在超体积方面,我们通过Deng表示重新审视了经典的EHVI,在期望坐标中制定了乘积密度加权的EHVI,讨论了基于锥的EHVI作为线性锥变换后的普通EHVI,并将这些情况与截断EHVI区分开来,后者可能违反方差单调性。在R2方面,我们证明精确积分R2改进通常不是普通的目标空间加权超体积。障碍是低维的:Lebesgue密度超体积无法看到Tchebycheff标量化仍能检测到的某些边界贡献。然后我们证明精确积分R2改进恰好是一个标量化空间体积,即当前标量化包络与参考包络之间的Tchebycheff阴影的测度。该表示产生了离散R2的有限和ER2I算法、精确积分R2的求积方法,以及一个成就空间高斯代理公式,其中ER2I是标量高斯期望改进的积分。

英文摘要

This paper studies preference-shaped expected improvement criteria for Bayesian multiobjective optimization. We consider two indicator families which are often used for similar algorithmic purposes, but which are geometrically different. The hypervolume indicator is based on a dystopian reference point and measures dominated volume in objective space. The R2 indicator is based on a utopian point and evaluates approximation sets through weighted Tchebycheff scalarization envelopes. The purpose of the paper is to make precise which preference transformations preserve exact computation, Pareto compatibility, and monotonicity properties, and which transformations change the underlying geometry. On the hypervolume side, we revisit canonical EHVI through the Deng representation, formulate product-density weighted EHVI in desirability coordinates, discuss cone-based EHVI as ordinary EHVI after a linear cone transformation, and separate these cases from truncated EHVI, where variance monotonicity may fail. On the R2 side, we prove that exact integral R2 improvement is not, in general, an ordinary objective-space weighted hypervolume. The obstruction is lower-dimensional: Lebesgue-density hypervolume cannot see certain boundary contributions that Tchebycheff scalarizations still detect. We then show that exact integral R2 improvement is exactly a scalarization-space volume, namely the measure of the Tchebycheff shadow between the incumbent scalarization envelope and the reference envelope. This representation yields finite-sum ER2I algorithms for discrete R2, quadrature methods for exact integral R2, and an achievement-space Gaussian surrogate formulation in which ER2I is an integral of scalar Gaussian expected improvements.

2604.04956 2026-05-29 physics.soc-ph cs.AI cs.CY physics.pop-ph 版本更新

The Planetary Cost of AI Acceleration, Part II: The 10th Planetary Boundary and the 6.5-Year Countdown

人工智能加速的行星成本,第二部分:第十个行星边界与6.5年倒计时

William Yicheng Zhu, Lei Zhu

AI总结 本研究指出,大规模语言模型(LLM)的指数级扩展导致“思考”本身的热力学后果,并预测在6.5年内将突破行星热阈值,提出AI热排放构成第十个行星边界。

Comments Minor revisions for clarity

详情
AI中文摘要

近期,自主大型语言模型(LLM)代理的超指数级扩展标志着更广泛、根本性的范式转变:从机器主要替代人类双手(体力劳动和机械加工)转向机器代表人类思维(认知、推理和意图)。超出人类有限但高效的生物能力,“思考”本身不受控制的卸载和扩展对人类的热平衡表产生深远影响,因为思考或智能具有热力学后果。地球已经超过了长期生态稳定性所需的热耗散阈值,基于经验数据的预测揭示了一条令人担忧的轨迹:如果没有激进的结构性干预,即使在最理想的情况下(地球能量不平衡(EEI)保持恒定),人为热积累将在不到6.5年内突破关键的行星生态阈值。在这项工作中,我们确定了人工智能中影响全球热耗散率的六个因素,并描述了它们如何相互作用推动社会走向四种宏观轨迹之一。我们提出,人工智能及其热耗散融入行星系统构成了第十个行星边界(9+1)。该边界的核心经验测量是由人工智能指数增长产生的净新增废热,平衡其对减少经济和社会低效率以及因此减少基线人为废热排放的影响。我们证明,管理人工智能扩展缺乏适度的中间地带:它将要么加速关键行星热力学阈值的突破,要么成为稳定其他九个行星边界的最有效杠杆,从而保障人类文明的生存。

英文摘要

The recent, super-exponential scaling of autonomous Large Language Model (LLM) agents signals a broader, fundamental paradigm shift from machines primarily replacing the human hands (manual labor and mechanical processing) to machines delegating for the human minds (cognition, reasoning, and intention). The uncontrolled offloading and scaling of "thinking" itself, beyond human's limited but efficient biological capacity, has profound consequences for humanity's heat balance sheet, since thinking, or intelligence, carries thermodynamic consequences. The Earth has already surpassed the heat dissipation threshold required for long-term ecological stability, and projecting based on empirical data reveal a concerning trajectory: without radical structural intervention, anthropogenic heat accumulation will breach critical planetary ecological thresholds in less than 6.5 years, even under the most ideal scenario where Earth Energy Imbalance (EEI) holds constant. In this work, we identify six factors from artificial intelligence that influence the global heat dissipation rate and delineate how their interplay drives society toward one of four broad macroscopic trajectories. We propose that the integration of artificial intelligence and its heat dissipation into the planetary system constitute the tenth planetary boundary (9+1). The core empirical measurement of this boundary is the net-new waste heat generated by exponential AI growth, balanced against its impact on reducing economic and societal inefficiencies and thus baseline anthropogenic waste heat emissions. We demonstrate that managing AI scaling lacks a moderate middle ground: it will either accelerate the breach of critical planetary thermodynamic thresholds, or it will serve as the single most effective lever on stabilizing the other nine planetary boundaries and through which safeguarding human civilization's survival.

2602.11389 2026-05-29 cs.AI 版本更新

Causal-JEPA: Learning World Models through Object-Level Latent Masking

Causal-JEPA:通过对象级潜在掩码学习世界模型

Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, Randall Balestriero

发表机构 * Brown University, GalilAI(布朗大学,GalilAI) New York University(纽约大学)

AI总结 提出C-JEPA,一种通过对象级潜在掩码扩展联合嵌入预测的对象中心世界模型,在视觉问答和智能体控制任务中分别提升反事实推理20%和仅用1%潜在特征实现高效规划。

Comments Project Page: https://hazel-heejeong-nam.github.io/cjepa/ ICML 2026 Accepted

详情
AI中文摘要

世界模型需要稳健的关系理解以支持预测、推理和控制。虽然对象中心表示提供了有用的抽象,但不足以捕捉依赖交互的动态。因此,我们提出C-JEPA,一种简单灵活的对象中心世界模型,将掩码联合嵌入预测从图像块扩展到对象中心表示。通过掩码对象级潜在变量并要求每个掩码对象状态从周围上下文中推断,C-JEPA在训练期间施加了结构化的部分可观测性,创建了类似反事实的预测查询,阻止捷径解决方案,并在学习目标下使依赖交互的预测成为必要。实验上,C-JEPA在视觉问答中取得了一致的提升,与没有对象级掩码的相同架构相比,反事实推理绝对提高了约20%。在智能体控制任务中,C-JEPA仅使用基于块的世界模型所需总潜在输入特征的1%,即可实现相当的性能,从而实现了更高效的规划。最后,我们提供了形式化分析,证明对象级掩码通过控制可观测性引入了有用的归纳偏置。我们的代码可在https://github.com/galilai-group/cjepa获取。

英文摘要

World models require robust relational understanding to support prediction, reasoning, and control. While object-centric representations provide a useful abstraction, they are not sufficient to capture interaction-dependent dynamics. We therefore propose C-JEPA, a simple and flexible object-centric world model that extends masked joint embedding prediction from image patches to object-centric representations. By masking object-level latents and requiring each masked object state to be inferred from the surrounding context, C-JEPA imposes structured partial observability during training, creating counterfactual-like prediction queries that discourage shortcut solutions and make interaction-dependent prediction necessary under the learning objective. Empirically, C-JEPA leads to consistent gains in visual question answering, with an absolute improvement of about 20% in counterfactual reasoning over the same architecture without object-level masking. On agent control tasks, C-JEPA enables substantially more efficient planning by using only 1% of the total latent input features required by patch-based world models, while achieving comparable performance. Finally, we provide a formal analysis demonstrating that object-level masking induces useful inductive bias by controlling observability. Our code is available at https://github.com/galilai-group/cjepa.

2601.22139 2026-05-29 cs.CL cs.AI 版本更新

Reasoning While Asking: Transforming Reasoning Large Language Models from Passive Solvers to Proactive Inquirers

边推理边提问:将推理型大语言模型从被动求解者转变为主动询问者

Xin Chen, Feng Jiang, Yiqian Zhang, Hardy Chen, Shuo Yan, Wenya Xie, Min Yang, Shujian Huang

发表机构 * National Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室) Artificial Intelligence Research Institute(人工智能研究院) Shenzhen Institutes of Advanced Technology(深圳先进技术研究院) University of California, Santa Cruz(加州大学圣克鲁兹分校) University of Texas, Dallas(德克萨斯大学达拉斯分校) University of Minnesota(明尼苏达大学)

AI总结 提出主动交互推理(PIR)范式,通过不确定性感知微调和用户模拟器策略优化,使LLM在推理中主动提问以澄清前提和意图不确定性,在数学推理、代码生成和文档编辑任务上显著提升准确率、通过率和BLEU值,同时减少近半推理计算和不必要交互。

Comments ACL Main Conference

详情
AI中文摘要

面向推理的大语言模型(LLMs)通过思维链(CoT)提示取得了显著进展,但它们仍然受到一种“盲目自我思考”范式的根本限制:即使在关键信息缺失或模糊的情况下,也会进行大量的内部推理。我们提出了主动交互推理(PIR),一种新的推理范式,将LLMs从被动求解者转变为主动询问者,在推理过程中穿插澄清。与现有的主要通过与外部环境交互来解决知识不确定性的搜索或工具框架不同,PIR通过与用户直接交互来解决前提和意图层面的不确定性。PIR通过两个核心组件实现:(1)一种不确定性感知的监督微调过程,赋予模型交互推理能力;(2)一个基于用户模拟器的策略优化框架,由复合奖励驱动,使模型行为与用户意图对齐。在数学推理、代码生成和文档编辑上的大量实验表明,PIR始终优于强基线,准确率提高高达32.70%,通过率提高22.90%,BLEU提升41.36,同时减少近一半的推理计算和不必要的交互轮次。在事实知识、问答和缺失前提场景上的进一步可靠性评估证实了PIR的强大泛化能力和鲁棒性。模型和代码公开于:https://github.com/SUAT-AIRI/Proactive-Interactive-R1

英文摘要

Reasoning-oriented Large Language Models (LLMs) have achieved remarkable progress with Chain-of-Thought (CoT) prompting, yet they remain fundamentally limited by a \emph{blind self-thinking} paradigm: performing extensive internal reasoning even when critical information is missing or ambiguous. We propose Proactive Interactive Reasoning (PIR), a new reasoning paradigm that transforms LLMs from passive solvers into proactive inquirers that interleave reasoning with clarification. Unlike existing search- or tool-based frameworks that primarily address knowledge uncertainty by querying external environments, PIR targets premise- and intent-level uncertainty through direct interaction with the user. PIR is implemented via two core components: (1) an uncertainty-aware supervised fine-tuning procedure that equips models with interactive reasoning capability, and (2) a user-simulator-based policy optimization framework driven by a composite reward that aligns model behavior with user intent. Extensive experiments on mathematical reasoning, code generation, and document editing demonstrate that PIR consistently outperforms strong baselines, achieving up to 32.70\% higher accuracy, 22.90\% higher pass rate, and 41.36 BLEU improvement, while reducing nearly half of the reasoning computation and unnecessary interaction turns. Further reliability evaluations on factual knowledge, question answering, and missing-premise scenarios confirm the strong generalization and robustness of PIR. Model and code are publicly available at: \href{https://github.com/SUAT-AIRI/Proactive-Interactive-R1}

2601.07525 2026-05-29 cs.CL cs.AI 版本更新

Thinking Before Constraining: A Unified Decoding Framework for Large Language Models

先思考再约束:大型语言模型的统一解码框架

Ngoc Trinh Hung Nguyen, Alonso Silva, Laith Zumot, Liubov Tupikina, Armen Aghasaryan, Mehwish Alam

发表机构 * Télécom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院) Nokia Bell Labs(诺基亚贝尔实验室) Nokia(诺基亚)

AI总结 提出In-Writing混合方法,通过触发令牌将自由形式推理与结构化解码解耦,在分类和推理任务上准确率提升高达27%。

Comments v2-EMNLP

详情
AI中文摘要

自然生成允许大型语言模型(LLMs)产生具有丰富推理的自由形式响应,但缺乏结构使得输出难以验证。相反,约束解码确保标准化格式,但可能在生成过程中过早施加约束,从而无意中限制推理能力。我们提出一种混合方法,即In-Writing,它在单次调用中结合了自由形式推理和结构化生成。模型首先执行无约束推理,仅在生成触发令牌后应用结构化解码,明确地将推理与格式化解耦。我们证明,我们的触发令牌策略能够几乎消除过早触发,即约束解码中断正在进行推理的失败模式。在涵盖分类和推理任务的多个数据集上的评估表明,我们的方法优于现有技术,在自然生成基础上准确率提升高达27%。我们的代码可在https://github.com/Nokia-Bell-Labs/InWriting获取。

英文摘要

Natural generation allows Large Language Models (LLMs) to produce free-form responses with rich reasoning, yet the lack of structure makes outputs difficult to verify. Conversely, constrained decoding ensures standardized formats but can inadvertently restrict reasoning capabilities by imposing constraints too early in the generation process. We propose a hybrid approach, namely In-Writing, that combines free-form reasoning and structured generation in a single call. The model first performs unconstrained reasoning and only applies structured decoding after a trigger token is generated, explicitly decoupling reasoning from formatting. We establish that our trigger-token strategies are able to virtually eradicate premature triggering, a failure mode in which constrained decoding interrupts on-going reasoning. Evaluations across diverse datasets covering classification and reasoning tasks demonstrate that our approach outperforms the state-of-the-art by achieving accuracy gains of up to 27% over natural generation. Our code are available at: https://github.com/Nokia-Bell-Labs/InWriting.

2511.14426 2026-05-29 cs.LG cond-mat.mtrl-sci cs.AI physics.comp-ph 版本更新

MiAD: Mirage Atom Diffusion for De Novo Crystal Generation

MiAD: 幻影原子扩散用于从头晶体生成

Andrey Okhotin, Maksim Nakhodnov, Nikita Kazeev, Mikhail Lazarev, Andrey E Ustyuzhanin, Dmitry Vetrov

发表机构 * Higher School of Economics(俄罗斯高等经济学院) Moscow State University(莫斯科大学) Constructor University of Bremen(不来梅Constructor大学)

AI总结 提出幻影注入技术,使扩散模型能在生成过程中改变原子数量,显著提升晶体生成质量,在MP-20数据集上实现8.2%的S.U.N.率。

详情
AI中文摘要

近年来,基于扩散的模型在搜索同时稳定、独特和新颖(S.U.N.)的晶体材料方面表现出卓越性能。然而,大多数这些模型在生成过程中无法改变晶体中的原子数量,这限制了模型采样轨迹的多样性。在本文中,我们展示了这种限制的严重性,并引入了一种简单而强大的技术——幻影注入,它使扩散模型能够将构成晶体的原子状态从存在变为不存在(幻影),反之亦然。我们表明,与没有这种修改的相同模型相比,该技术将模型质量提高了多达2.5倍。由此产生的模型,幻影原子扩散(MiAD),是一种用于从头晶体生成的等变联合扩散模型,能够在生成过程中改变原子数量。MiAD在MP-20数据集上实现了8.2%的S.U.N.率,大大超过了现有的最先进方法。代码:https://github.com/andrey-okhotin/miad.git

英文摘要

In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don't have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to x2.5 compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an 8.2% S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. Code: https://github.com/andrey-okhotin/miad.git

2605.30284 2026-05-29 cs.AI 版本更新

ProjectionBench: Evaluating Scientific Hypothesis Generation in LLMs Under Progressive Information Disclosure

ProjectionBench: 在渐进信息揭示下评估大语言模型的科学假设生成

A. J. Lew, Y. Cao, M. J. Buehler

发表机构 * Unreasonable Labs

AI总结 提出ProjectionBench框架,通过渐进式信息揭示评估大语言模型在科学发现中的创新性和推理能力,实验表明GPT-5.4在最小上下文下仍保持0.7 F1分数与真实结论对齐。

Comments 19 pages, 4 figures

详情
AI中文摘要

科学发现本质上是一个创造性和不确定的过程,需要超越已知知识的推理。尽管许多基准测试通过多跳检索评估大语言模型在深度研究任务上的表现,但其对真正科学发现至关重要的创新推理能力仍未得到充分测试。我们引入了一个基准框架,用于评估模型在科学发现和推理中的表现,从原始问题逐步构建到经典零假设检验。在我们的框架中,模型最初仅接收来自近期论文的主题和研究问题,技术细节逐步揭示。在每个信息揭示阶段,模型需要生成针对研究问题的假设,这些假设与原始论文的结论进行比较,并通过组成原子声明的自动语义相似性进行评估。这种对与真实结论语义偏离的渐进评估,使得能够评估模型的创新性(在最小信息下)到基于推理的能力(在完整实验细节下),这两者对于将大语言模型用于科学发现都至关重要。我们的框架为系统评估大语言模型的科学推理和发现能力提供了基础,这对于推动下一代AI科学家/协同科学家系统的发展至关重要。具体来说,我们在涵盖生物活性材料、机械材料和纳米材料的45篇论文上评估了GPT-5、GPT-5.4、Gemini 2.5 pro和Gemini 3.1 pro preview。我们发现GPT-5.4和Gemini 3.1 pro的表现优于其前代版本,特别是GPT-5.4即使在最小上下文下仍保持0.7 F1分数与真实结论对齐。

英文摘要

Scientific discovery is an inherently creative and uncertain process, requiring reasoning beyond the recall of known knowledge. While many benchmarks have been proposed to evaluate large language model (LLM) performance on deep research tasks via multi-hop retrieval, their innovative reasoning abilities essential for true scientific discovery remain largely untested. We introduce a benchmark framework for evaluating model performance in scientific discovery and reasoning, building up from a raw problem to the classical null hypothesis test. In our framework, models initially receive only the topic and research question from a recent paper, with technical details progressively revealed. At each stage of information disclosure, the model is tasked with generating hypotheses that address the research question, which is compared with the conclusions from the original paper and evaluated via automated semantic similarity of constituent atomic claims. This progressive evaluation of semantic divergence from ground-truth conclusions enables assessment of a model's innovativeness (under minimal information) to grounded reasoning capabilities (under full experimental details), both critical for using LLMs for scientific discovery purposes. Our framework provides a foundation for systematically evaluating scientific reasoning and discovery capabilities in LLMs, crucial for advancing the development of next-generation AI scientist/co-scientist systems. Specifically, here we evaluate GPT-5, GPT-5.4, Gemini 2.5 pro, and Gemini 3.1 pro preview across 45 papers spanning bioactive materials, mechanical materials, and nanomaterials. We find that GPT-5.4 and Gemini 3.1 pro outperform their previous generation counterparts as expected, and GPT-5.4 in particular maintains 0.7 F1 score alignment with ground truth conclusions even under minimal context.

2605.30283 2026-05-29 cs.AI cs.ET 版本更新

mcp-proto-okn: Natural-language access to open scientific knowledge graphs through the Model Context Protocol

mcp-proto-okn:通过模型上下文协议实现对开放科学知识图谱的自然语言访问

Peter W. Rose, Benjamin M. Good, Amanda M. Saravia-Butler, Charlotte A. Nelson, James P. Balhoff, Yaphet Kebede, Patricia L. Whetzel, Christopher Bizon, Andrew I. Su, Sergio E. Baranzini

发表机构 * San Diego Supercomputer Center, University of California San Diego(圣地亚哥超级计算机中心,加州大学圣地亚哥分校) The Scripps Research Institute(斯克里普斯研究所) Amentum, Space Biosciences Division, NASA Ames Research Center(Amentum空间生物学部门,NASA阿姆斯研究中心) Mate Bioservices(Mate生物服务) Renaissance Computing Institute, University of North Carolina at Chapel Hill(复兴计算研究所,北卡罗来纳大学夏洛特分校) Weill Institute for Neurosciences, Department of Neurology, University of California San Francisco(Weill神经科学研究所,神经病学系,加州大学旧金山分校)

AI总结 提出基于模型上下文协议的服务器mcp-proto-okn,使AI助手能通过自然语言发现、查询和集成科学知识图谱,降低跨领域知识图谱分析门槛。

Comments 9 pages, 1 figure

详情
AI中文摘要

MCP Server Proto-OKN (mcp-proto-okn) 是一个基于Python的模型上下文协议服务器,使AI助手能够通过自然语言发现、检查、查询和集成科学知识图谱。该服务器提供图路由、模式检查、SPARQL执行、本体扩展、多图查询和转录生成功能,降低了生物医学和科学用户进行跨领域知识图谱分析的门槛。mcp-proto-okn使用FastMCP框架在Python中实现,可在https://github.com/sbl-sdsc/mcp-proto-okn获取。GitHub仓库提供了文档、客户端配置说明和示例分析转录。

英文摘要

MCP Server Proto-OKN (mcp-proto-okn) is a Python-based Model Context Protocol server that enables AI assistants to discover, inspect, query and integrate scientific knowledge graphs through natural language. The server provides graph routing, schema inspection, SPARQL execution, ontology expansion, multi-graph querying, and transcript generation, lowering the barrier to cross-domain knowledge graph analysis for biomedical and scientific users. mcp-proto-okn is implemented in Python using the FastMCP framework and is available at https://github.com/sbl-sdsc/mcp-proto-okn. Documentation, client configuration instructions, and example analysis transcripts are provided in the GitHub repository.

2605.30274 2026-05-29 cs.CL cs.AI 版本更新

Loong: A Human-Like Long Document Translation Agent with Observe-and-Act Adaptive Context Selection

Loong: 一种类人长文档翻译代理,具有观察与行动的适应性上下文选择

Yutong Wang, Xuebo Liu, Derek F. Wong, Zhilin Li, Rongqing Jiang, Min Zhang, Shimin Tao, Daimeng Wei, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳研究院) NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学学院自然语言处理与CT实验室) Huawei Translation Services Center(华为翻译服务中心)

AI总结 提出Loong代理,通过3E记忆模块和强化学习优化上下文策略,解决长文档翻译中上下文窗口限制和冗余信息问题,在英⇄中、德、法翻译中平均提升13.0分。

详情
AI中文摘要

文档级翻译仍然是大型语言模型最具挑战性的任务之一,它们受到有限上下文窗口的限制,阻碍了全局连贯性,同时遭受冗余上下文信息的影响,降低了翻译质量。为了解决这个问题,我们提出了一种名为Loong的类人长文档翻译代理,它利用3E记忆模块(精华-示例-实体)存储摘要、句子对和实体记录作为历史上下文。Loong不是被动地关注所有历史,而是进行深度推理,自适应地识别翻译指导的最佳上下文。Loong通过强化学习优化其上下文策略,利用从其自身采样的观察与行动推理轨迹中得出的偏好数据。实证评估表明,Loong在英语⇄中文、德语和法语方向上实现了显著的翻译质量提升,在三个评估指标上平均提升高达13.0分。此外,Loong在跨领域和对抗上下文噪声方面表现出强大的泛化能力和鲁棒性,同时在超长文档翻译中保持显著的稳定性。我们的代码发布在https://github.com/YutongWang1216/LoongDocMT。

英文摘要

Document-level translation remains one of the most challenging tasks for large language models, which are constrained by limited context windows that impede global cohesion, while simultaneously suffering from redundant contextual information that degrades translation quality. To address this, we propose a human-like long document translation agent called Loong, which leverages a 3E memory module (Essence-Exemplar-Entity) to store summaries, sentence pairs, and entity records as historical context. Instead of passively attending to all history, Loong performs deep reasoning to adaptively identify the optimal context for translation guidance. Loong optimizes its context policy through reinforcement learning, utilizing preference data derived from its own sampled observe-and-act reasoning trajectories. Empirical evaluations demonstrate that Loong achieves substantial translation quality improvements in English $\Leftrightarrow$ Chinese, German, and French directions, with average gains of up to 13.0 points across the three evaluation metrics. Furthermore, Loong exhibits strong generalization across domains and robustness against contextual noise, while maintaining remarkable stability in ultra-long document translation. Our code is released at https://github.com/YutongWang1216/LoongDocMT.

2605.30273 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

LLUMI: 利用在线社区反馈改进心理健康支持中的LLM写作辅助

Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar, Dong Whi Yoo, Eshwar Chandrasekharan, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 提出LLUMI框架,通过在线社区反馈(如Reddit投票)构建偏好对,结合监督微调和直接偏好优化训练开源小模型,在隐私保护下实现与GPT相当的心理健康支持性能。

详情
AI中文摘要

大型语言模型在生成心理健康问题的支持性回复方面展现出潜力,但提升其有用性、共情能力和安全性通常需要大量计算、专家输入和标注数据。同时,在心理健康相关交互中部署专有云模型会引发重要的隐私和数据治理问题。为解决这一挑战,我们提出了LLUMI设置,该设置可在受保护环境内部署。LLUMI包含两个互补组件:生成模型(GM)起草对心理健康问题的支持性回复,以及改进模型(IM)修改初始人工编写的回复。我们利用Reddit心理健康社区的反馈信号,使用社区认可模式(如点赞和点踩)构建用于监督微调和直接偏好优化的选择-拒绝回复对。我们还通过五个维度(可读性、共情、连接、可操作性和安全性)的人工评估进一步对齐LLUMI。结果表明,尽管依赖较小的开源模型而非专有云GPT模型,LLUMI在语言分析和人工评估中均实现了相当的性能。这些发现表明,使用社区衍生的偏好信号训练的开源模型可以支持高质量的心理健康支持辅助,同时为敏感的支持场景提供更保护隐私的替代方案。

英文摘要

Large language models (LLMs) show promise in generating supportive responses for mental health queries, but improving their usefulness, empathy, and safety often requires substantial compute, expert input, and labeled data. At the same time, deploying proprietary, cloud-based models for mental health-related interactions raises important privacy and data-governance concerns, given the sensitivities. To address this challenge, we introduce LLUMI setup that can be hosted in-house within protected environments. LLUMI consists of two complementary components: a generation model (GM), which drafts supportive responses to mental health queries, and an improvement model (IM), which revises an initial human-crafted response. We leverage feedback signals from Reddit mental health communities, using community endorsement patterns such as upvotes and downvotes to construct chosen-rejected response pairs for Supervised Fine Tuning (SFT) and Direct Preference Optimization (DPO). We further align LLUMI using human evaluation across five dimensions: readability, empathy, connection, actionability, and safety. Our results show that, despite relying on smaller open-source models rather than proprietary cloud-based GPT models, LLUMI achieves comparable performance across linguistic analyses and human evaluations. These findings suggest that open-source models, when trained with community-derived preference signals, can support high-quality mental health support assistance while offering a more privacy-preserving alternative for sensitive support contexts.

2605.30268 2026-05-29 cs.CV cs.AI 版本更新

PhyGenHOI: Physically-Aware 4D Generation of Dynamic Human-Object Interactions

PhyGenHOI:物理感知的动态人-物交互4D生成

Omer Benishu, Gal Fiebelman, Sagie Benaim

发表机构 * Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 提出PhyGenHOI框架,结合运动扩散模型和物质点方法,通过窗口吸引损失、接触驱动重模拟和掩码视频SDS目标,生成物理一致且视觉逼真的4D人-物交互动态场景。

详情
AI中文摘要

我们解决了生成物理准确且视觉逼真的4D人-物交互(HOI)的任务。给定一个静态3D人体和以3D高斯泼溅(3DGS)表示的目标物体,我们的目标是合成动态场景,其中人体根据给定的输入文本主动与物体交互,例如拳击或踢腿。为此,我们引入了PhyGenHOI,一种新颖的框架,将生成式人体运动与显式物理物体模拟相结合。我们将人体建模为由运动扩散模型(MDM)驱动的语义智能体,将物体建模为通过物质点方法(MPM)模拟的物理智能体,并利用3D高斯作为统一的、可微分的表示。我们通过三种耦合机制监督它们的交互:(1)窗口吸引损失,时间上同步生成运动以拦截物体;(2)接触驱动重模拟步骤,在碰撞时触发物理一致动量传递;(3)掩码视频SDS目标,注入基于视频的先验以增强接触保真度。实验表明,PhyGenHOI在多种动作、人体和物体上生成物理一致的4D HOI,优于基线方法。项目页面和视频:https://omerbenishu.github.io/PhyGenHOI/

英文摘要

We address the task of generating physically accurate and visually faithful 4D Human-Object Interaction (HOI). Given a static 3D human and target object represented as 3D Gaussian Splats (3DGS), our goal is to synthesize dynamic scenes where the human actively engages with the object through actions, such as punching or kicking, in accordance with a given input text. To this end, we introduce PhyGenHOI, a novel framework that couples generative human motion with an explicit physical object simulation. We model the human as a semantic agent driven by a Motion Diffusion Model (MDM) and the object as a physical agent simulated via the Material Point Method (MPM), utilizing 3D Gaussians as a unified, differentiable representation. We supervise their interaction through three coupled mechanisms: (1) A Windowed Attraction Loss that temporally synchronizes generative motion to intercept the object; (2) A Contact-Driven Re-simulation step that triggers physically consistent momentum transfer upon impact; and (3) A Masked Video-SDS objective that injects video-based priors to enhance contact fidelity. Experiments show PhyGenHOI generates physically consistent 4D HOI across diverse actions, humans, and objects, outperforming baselines. Project page and videos: https://omerbenishu.github.io/PhyGenHOI/

2605.30260 2026-05-29 cs.CL cs.AI cs.CV cs.LG 版本更新

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

LoRA如何记忆?大语言模型微调的参数记忆定律

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出参数记忆定律,揭示LoRA在微调中参数与序列长度对损失降低的幂律关系,并基于此设计MemFT优化策略提升记忆保真度与效率。

Comments Ongoing work

详情
AI中文摘要

大型语言模型(LLM)必须持续学习和更新知识,以在动态的真实世界环境中保持有效。虽然低秩适应(LoRA)被广泛用于此类记忆更新,但现有研究主要依赖于定性的下游评估,使得精确参数记忆的定量容量限制和潜在动态在很大程度上未被探索。为了弥合这一差距,我们在潜在空间中使用LoRA作为受控记忆容量探针,以系统量化精确参数记忆。我们引入了参数记忆定律,这是一个将损失降低ΔL与有效参数和序列长度联系起来的稳健幂律。在令牌级别,细粒度分析揭示了确定性相变,表明在贪婪解码下,预测概率p > 0.5构成逐字回忆的充分条件。基于这些见解,我们引入了MemFT,一种阈值引导的优化策略,该策略动态地将训练预算重新分配给低于阈值的令牌。实证评估表明,MemFT可以提高记忆保真度和效率。代码将在https://github.com/zjunlp/ParametricMemoryLaw发布。

英文摘要

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

2605.30251 2026-05-29 cs.CL cs.AI 版本更新

Same Evidence, Different Answers: Canonical-Context On-Policy Distillation for Multi-Turn Language Models

相同证据,不同答案:面向多轮语言模型的规范上下文在线策略蒸馏

Zizhuo Lin, Quanling Liu, Jinsheng Quan, Chao Zhang, Yifan Zhu, Xing Shi, Jingtao Xu, Zhihui Li, Yawei Luo

发表机构 * Zhejiang University(浙江大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出规范上下文在线策略蒸馏(CCOPD)方法,通过教师-学生框架对齐模型在完整提示和逐步揭示信息下的行为,减少自我锚定漂移,在多轮数学对话上训练后,在原始分片任务上平均提升32%性能。

详情
AI中文摘要

大型语言模型(LLMs)通常在单次提示中给出所有指令时能解决任务,但当相同信息在多个轮次中逐步揭示时却会失败。当干净的完整提示和原始分片对话包含相同的完整用户证据时,模型仍应得出相同的答案。我们认为造成这一差距的关键原因是自我锚定漂移:在部分信息下产生的响应引入了未经支持的假设,而这些假设随后扭曲了最终答案。为了减少这种影响,我们提出了规范上下文在线策略蒸馏(CCOPD)。在训练过程中,同一基础模型扮演两个角色:一个冻结的教师模型,以干净的完整提示为条件;一个可训练的学生模型,通过多轮对话逐步接收相同的证据;CCOPD将学生在其自身轨迹上的行为与教师的规范全上下文行为对齐。仅在数学问题对话上训练后,CCOPD在数学和五个零样本跨领域任务族上的原始分片性能相比原始基础模型平均提升32%,同时基本保持全上下文性能。进一步分析表明,CCOPD增强了基于用户证据的推理,并减少了对早期助手轮次污染的敏感性。

英文摘要

Large language models (LLMs) often solve a task when all instructions are given in a single prompt, but fail when the same information is revealed gradually across turns. When a clean FULL prompt and a RAW-SHARDED conversation contain the same complete user evidence, the model should still arrive at the same answer. We argue that a key reason for this gap is self-anchored drift: responses produced under partial information introduce unsupported assumptions, and those assumptions later distort the final answer. To reduce this effect, we propose Canonical-Context On-Policy Distillation (CCOPD). During training, the same base model is used in two roles: a frozen teacher conditioned on the clean FULL prompt and a trainable student that receives the same evidence incrementally through a multi-turn conversation; CCOPD aligns the student's behavior on its own trajectories with the teacher's canonical full-context behavior. Trained only on math problem conversations, CCOPD yields a 32\% average relative improvement in RAW-SHARDED performance over the original base model across math and five zero-shot out-of-domain task families, while largely preserving full-context performance. Further analyses suggest that CCOPD strengthens grounding in user evidence and reduces sensitivity to contamination from earlier assistant turns.

2605.30244 2026-05-29 cs.CV cs.AI 版本更新

Reinforcement Learning with Robust Rubric Rewards

基于稳健评分规则的强化学习

Ya-Qi Yu, Hao Wang, Fangyu Hong, Xiangyang Qu, Gaojie Wu, Qiaoyu Luo, Nuo Xu, Huixin Wang, Wuheng Xu, Yongxin Liao, Zihao Chen, Haonan Li, Ziming Li, Dezhi Peng, Minghui Liao, Jihao Wu, Haoyu Ren, Dandan Tu

发表机构 * Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 针对部分可验证的视觉-语言任务,提出RLR^3方法,通过双路径执行评分规则、最小暴露策略和层次聚合,实现从任务级到准则级验证的扩展,在15个基准上平均提升4.7分。

详情
AI中文摘要

虽然基于可验证奖励的强化学习(RLVR)对于确定性可检查的任务有效,但许多视觉-语言任务部分可验证,需要多准则监督(例如,感知细节、推理步骤和约束)。评分规则为此细粒度监督提供了自然接口,但其有效性取决于在线RL期间的执行准确性。我们提出基于稳健评分规则的强化学习($\text{RLR}^3$),将RLVR从任务级验证扩展到准则级验证。$\text{RLR}^3$通过两条执行路径路由实例特定的评分规则:LLM作为提取器与确定性验证器配对,或LLM作为裁判用于不可验证的准则。为确保忠实评分,$\text{RLR}^3$引入最小暴露策略,从提取器中屏蔽真实标签,从裁判中屏蔽图像。此外,$\text{RLR}^3$采用层次聚合,优先考虑基本准则而非附加准则,并缓解rollout组内的分数饱和。在Qwen3-VL-30B-A3B上跨15个基准评估,$\text{RLR}^3$始终优于RLVR,比基础模型提升4.7分,并超过官方instruct-to-thinking模型差距。受控审计证实,我们的确定性验证和最小暴露显著减少了可利用的假阳性。

英文摘要

While Reinforcement Learning with Verifiable Rewards (RLVR) is effective for deterministically checkable tasks, many vision-language tasks are partially verifiable, demanding multi-criteria supervision (e.g., perceptual details, reasoning steps, and constraints). Rubrics provide a natural interface for this fine-grained supervision, but their effectiveness depends on the execution accuracy during online RL. We propose Reinforcement Learning with Robust Rubric Rewards ($\text{RLR}^3$), extending RLVR from task-level verification to criterion-level verification. $\text{RLR}^3$ routes instance-specific rubrics through two execution paths: an LLM-as-an-extractor paired with a deterministic verifier, or an LLM-as-a-Judge for non-verifiable criteria. To ensure faithful scoring, $\text{RLR}^3$ introduce a minimal exposure strategy that masks ground truths from extractors and images from judges. Furthermore, $\text{RLR}^3$ employs hierarchical aggregation to prioritize essential criteria over additional criteria, and mitigates score saturation within rollout groups. Evaluated on Qwen3-VL-30B-A3B across 15 benchmarks, $\text{RLR}^3$ consistently outperforms RLVR, yielding a 4.7-point improvement over the base model and exceeding the official instruct-to-thinking model gap. Controlled audits confirm our deterministic verification and minimal exposure significantly reduce exploitable false positives.

2605.30233 2026-05-29 cs.CL cs.AI 版本更新

Do Language Models Track Entities Across State Changes?

语言模型是否在状态变化中跟踪实体?

Zilu Tang, Qiao Zhao, Gabriel Franco, Derry Wijaya, Aaron Mueller, Sebastian Schuster, Najoung Kim

发表机构 * Department of Computer Science, Boston University, Boston, USA(波士顿大学计算机科学系) Department of Data Science, Monash University, Indonesia(墨尔本大学数据科学系) Faculty of Computer Science, University of Vienna, Austria(维也纳大学计算机科学系) Department of Linguistics, Boston University, Boston, USA(波士顿大学语言学系)

AI总结 研究语言模型在自然语言中处理多步状态变化操作时的实体跟踪机制,发现其采用非增量策略,在最后token并行聚合信息,并揭示了REMOVE操作的全局抑制标签及其导致的失败模式。

Comments ICML main conference 2026, 9 pages

详情
AI中文摘要

实体跟踪(ET),即跟踪状态的能力,是支撑复杂推理的基本技能。越来越多的研究探讨transformer语言模型(LMs)如何在没有状态变化的情况下解决实体绑定问题。然而,对于非玩具级LMs如何处理以自然语言表达的具有现实难度的ET问题,理解仍然有限。为此,我们研究了在具有多个状态变化操作的更复杂场景下ET背后的机制。我们发现,LMs不会跨token增量地跟踪世界状态,也不会跨层跟踪查询相关状态,而是在查询变得明显时,在最后一个token处并行地聚合相关信息。我们进一步研究了单个操作(PUT、REMOVE、MOVE)的机制,以表征这种非增量ET机制。令人惊讶的是,LMs使用一种脆弱的全局抑制标签来实现REMOVE操作;这种全局移除机制预测了我们通过行为实验确认的各种失败模式。我们提供了一种消除该标签的机械解决方案,以部分解决此问题。总体而言,我们的发现揭示了LMs使用非顺序策略来解决一个本质上是顺序的任务。更广泛地说,我们的工作展示了行为分析和机制分析如何有效地相互作用。行为结果为机制假设提供信息,而机制分析的见解通过预测现有评估中缺失的失败模式,有助于构建更强的行为评估。

英文摘要

Entity tracking (ET), the ability to keep track of states, is a fundamental skill that underlies complex reasoning. An increasing amount of work investigates how transformer language models (LMs) solve entity binding $\textit{without}$ state changes. However, there is limited understanding of how non-toy LMs address ET problems of realistic difficulties expressed in natural language. To this end, we investigate the mechanisms underlying ET in more complex scenarios featuring multiple state-changing operations. We find that LMs do not incrementally track world states across tokens or query-relevant states across layers, but simply aggregate relevant information in parallel at the last token when the query becomes evident. We further investigate mechanisms of individual operations ($\texttt{PUT}$, $\texttt{REMOVE}$, $\texttt{MOVE}$) to characterize this non-incremental ET mechanism. Surprisingly, LMs implement the $\texttt{REMOVE}$ operation with a fragile global suppression tag; this global removal mechanism predicts various failure modes that we confirm behaviorally. We provide a mechanistic solution of nullifying this tag to partially address this issue. Overall, our findings reveal that LMs solve a fundamentally sequential task using a non-sequential strategy. More broadly, our work illustrates how behavioral and mechanistic analyses can fruitfully interact. Behavioral results inform mechanistic hypotheses, and insights from mechanistic analyses help build stronger behavioral evaluations by predicting failure modes missing from existing evaluations.

2605.30231 2026-05-29 cs.CV cs.AI 版本更新

Beyond 3D VQAs: Injecting 3D Spatial Priors into Vision-Language Models for Enhanced Geometric Reasoning

超越3D VQA:将3D空间先验注入视觉-语言模型以增强几何推理

Chun-Hsiao Yeh, Shengyi Qian, Manchen Wang, Yi Ma, Joseph Tighe, Fanyi Xiao

发表机构 * FAIR at Meta(Meta的FAIR)

AI总结 提出GASP框架,通过将几何先验注入LLM的Transformer层,利用对比损失和深度一致性监督训练,显著提升VLM的3D空间推理能力,在多个基准上取得大幅提升。

Comments CVPR 2026. Project page: https://danielchyeh.github.io/GASP/

详情
AI中文摘要

视觉-语言模型(VLM)通常在鲁棒的3D空间推理方面存在困难。依赖于使用3D视觉问答(VQA)数据集进行微调的主流方法可能过度拟合数据集特定的偏差,而集成专门的3D视觉编码器往往不灵活且繁琐。在本文中,我们认为真正的空间理解应该源于学习基本的几何先验,而不仅仅是来自高级VQA监督。我们提出了GASP(几何感知空间先验),这是一个将这些先验直接注入LLM的Transformer层的框架。GASP采用一个小的对应头,作为跨所有层的深度监督信号,并使用一个双重目标进行训练,该目标利用大规模视频场景的真实几何:基于真实点对应的对比损失强制2D视图不变性,而深度一致性监督解决3D几何歧义。我们的分析首先提供了一个诊断,表明标准VLM的内部对应匹配精度非常低(通常低于5%)。然后我们证明,我们的训练显著改善了这种行为,将逐层峰值对应提升到70%以上,并保持超过85%的时间鲁棒性,而基线仍低于5%。这些内部改进转化为下游空间基准的显著提升,包括在All-Angles Bench上+18.2%,在VSI-Bench上+29.0%,所有这些都没有在任何3D VQA数据上进行训练。我们的发现表明,从基本几何先验中学习是实现具有更可靠3D空间推理的VLM的一条有前途且可推广的途径。

英文摘要

Vision-Language Models (VLMs) often struggle with robust 3D spatial reasoning. Prevailing methods that rely on fine-tuning with 3D visual question-answering (VQA) datasets may overfit dataset-specific biases, while integrating specialized 3D visual encoders is often inflexible and cumbersome. In this paper, we argue that genuine spatial understanding should emerge from learning fundamental geometric priors, not only from high-level VQA supervision. We propose GASP (Geometric-Aware Spatial Priors), a framework that injects these priors directly into the LLM's transformer layers. GASP employs a small correspondence head, applied as a deep supervision signal across all layers, and is trained with a dual objective leveraging ground-truth geometry from large-scale video scenes: a contrastive loss on ground-truth point correspondences enforces 2D view-invariance, while a depth consistency supervision resolves 3D geometric ambiguities. Our analysis first provides a diagnostic showing that standard VLMs' internal correspondence matching accuracy is very low (often below 5%). We then demonstrate that our training substantially improves this behavior, boosting peak layer-wise correspondence to over 70% and maintaining over 85% temporal robustness while baselines remain below 5%. These internal improvements translate to significant gains on downstream spatial benchmarks including +18.2% on All-Angles Bench and +29.0% on VSI-Bench, all without training on any 3D VQA data. Our findings indicate that learning from fundamental geometric priors is a promising and generalizable pathway towards VLMs with more reliable 3D spatial reasoning.

2605.30227 2026-05-29 cs.MA cs.AI 版本更新

Unifying Temporal and Structural Credit Assignment in LLM-Based Multi-Agent Prompt Optimization

统一基于LLM的多智能体提示优化中的时间与结构信用分配

Wenwu Li, Yuran Song, Mingze Zhao, Bo Jin, Wenhao Li

发表机构 * Tongji University(同济大学) The University of Hong Kong(香港大学)

AI总结 提出通过时间信用(状态空间瓶颈识别关键轮次)和结构信用(固定角色策略隔离智能体贡献)分解误差信号,并利用离散言语化块坐标下降算法迭代优化角色提示和聚合协议,降低查询复杂度并提升性能。

Comments 15 pages, 4 figures, 6 tables

详情
AI中文摘要

虽然多智能体系统(MAS)通过协作交互使大型语言模型能够处理复杂推理任务,但由于计算图的离散、不可微性质以及全局监督信号的稀疏性,优化其动态仍然是一个严峻的挑战。现有的黑盒优化器难以将轨迹级别的失败归因于特定的局部组件,导致低效、高方差的探索。我们认为,可处理的MAS优化需要结构归纳偏差来解开误差信号。我们提出了时间和结构信用分配,它沿着两个轴分解目标:(i)时间信用,使用状态空间瓶颈识别关键轮次;(ii)结构信用,使用固定角色策略隔离智能体贡献。利用这些分解后的信号,我们引入了一种离散的、言语化的块坐标下降算法用于迭代优化。它不是不加区分的全局更新,而是在优化角色提示和聚合协议之间交替,使用LLM生成的“代理梯度”仅针对识别出的薄弱环节。在多种推理基准测试中,我们的方法在提高性能的同时显著降低了查询复杂度,为自我改进的MAS提供了一条有原则且可解释的路径。

英文摘要

While Multi-Agent Systems (MAS) empower Large Language Models to tackle complex reasoning tasks through collaborative interaction, optimizing their dynamics remains a formidable challenge due to the discrete, non-differentiable nature of the computation graph and the sparsity of global supervisory signals. Existing black-box optimizers struggle to attribute trajectory-level failure to specific local components, resulting in inefficient, high-variance exploration. We argue that tractable MAS optimization needs structural inductive biases to disentangle error signals. We propose temporal and structural credit assignment, which decomposes the objective along two axes: (i) temporal credit, using state-space bottlenecks to identify critical rounds, and (ii) structural credit, using stationary role policies to isolate agent contributions. Leveraging these decomposed signals, we introduce a discrete, verbalized block coordinate descent algorithm for iterative refinement. Rather than indiscriminate global updates, it alternates between optimizing role prompts and aggregation protocols, using LLM-generated "proxy gradients" to target only the identified weak links. Across diverse reasoning benchmarks, our approach substantially reduces query complexity while improving performance, providing a principled and interpretable path toward self-improving MAS.

2605.30219 2026-05-29 cs.AI cs.CL cs.LG 版本更新

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

模型何时应改变想法?大语言模型中的上下文信念管理

Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

发表机构 * Zhejiang University(浙江大学) HomologyAI

AI总结 提出上下文信念管理(CBM)框架,通过引入BeliefTrack基准和信念状态奖励的强化学习,将大语言模型在长程交互中的信念更新失败率平均降低70.9%。

Comments Work in progress

详情
AI中文摘要

长程交互要求语言模型管理累积信息:何时更新状态、何时保持状态、以及忽略什么。我们将这一挑战研究为 extbf{上下文信念管理(CBM)}:在隔离任务无关噪声的同时,维护与形式证据对齐的预测信念状态。为了使CBM可测量,我们引入了BeliefTrack,一个涵盖规则发现和电路诊断的封闭世界基准,其中有限的信念空间和符号验证器支持精确的逐轮评估。BeliefTrack诊断三种失败:保持失败、更新失败和隔离失败。在多个大语言模型中,原始模型表现出严重的CBM失败,而显式的信念跟踪提示提供的改进有限。相比之下,使用信念状态奖励的强化学习平均将失败率降低了70.9%。进一步的探测揭示了这些失败背后的潜在信念状态动态,而表示级引导在两个任务上将失败率降低了46.1% ootnote{代码即将在https://github.com/zjunlp/CBM发布。}

英文摘要

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

2605.30207 2026-05-29 cs.AI 版本更新

Persona Conditioning of Brand Recommendations in Retrieval-Augmented Commercial Chat: A Prominence-Stratified Cross-Provider Audit

检索增强商业对话中品牌推荐的人格条件化:一种突出性分层跨提供商审计

Will Jack, Noah Lehman, Keller Maloney, Sarah Xu

AI总结 本研究通过审计10种人格×8个提示×3种模型配置的2000次运行,发现用户人格显著改变AI推荐品牌集,且效果在中等市场品牌和依赖先验的生成路径中更为突出。

详情
AI中文摘要

相同的提示——“最佳CRM软件”——来自不同背景的买家(独立创始人、企业副总裁、英国中小企业主)会到达AI助手。我们审计了这种上下文变化如何强烈地重塑模型推荐的品牌。审计采样了2000次运行,覆盖10种人格×8个提示×3种模型配置×N=10次重复的设计空间,其中两个OpenAI单元覆盖全部8个提示,Anthropic sonnet-4.6/低单元覆盖4个提示。在用户消息前添加人格,相对于同人格基线,推荐集相似度(Jaccard)下降Delta = -0.12至-0.20(聚类95%置信区间在所有三个测量单元上排除零;sonnet单元的置信区间仅基于4个提示聚类,相应更宽)。该效应具有明显的突出性分层:品类领导者具有人格抗性(跨人格约80%相同品牌一致性),但中等市场品牌随人格变化最多更换75%的推荐集。Anthropic模型的点估计效应大于OpenAI配置,尽管聚类置信区间在更接近的对比(sonnet vs. OpenAI/高)中重叠;这种不对称性与Anthropic更多依赖检索未归因的生成路径一致(43-52%的推荐没有观察到检索层证据,而OpenAI为8-29%,记录在Jack 2026中)。任何AI品牌感知的测量都必须以提供查询的买家人格为条件:相同的提示根据模型认为谁在提问而产生实质上不同的推荐集,而跨人格聚合的测量协议系统性地掩盖了这种变化。该效应集中在中等市场,并且在我们审计中最依赖先验的生成路径上最大,这与人格响应性随着模型更依赖训练数据先验和更丰富的上下文集成而增强是一致的。

英文摘要

The same prompt -- "best CRM software" -- reaches AI assistants from buyers in widely different contexts: a solo founder, an enterprise VP, a UK SMB owner. We audit how strongly that contextual variation reshapes which brands the model recommends. The audit samples 2,000 runs over a design space of 10 personas x 8 prompts x 3 model configurations x N=10 reps, with the two OpenAI cells at full 8-prompt coverage and the Anthropic sonnet-4.6 / low cell at 4-prompt coverage. Prefixing the user message with a persona drops the recommendation-set similarity (Jaccard) by Delta = -0.12 to -0.20 relative to a same-persona baseline (clustered 95% CIs exclude zero on all three measured cells; the sonnet cell's CI rests on only 4 prompt clusters and is correspondingly wider). The effect is sharply prominence-stratified: category leaders are persona-resistant (~80% same-brand consistency across personas), but mid-market brands swap up to 75% of the recommendation set as the persona changes. The Anthropic model shows a larger point-estimate effect than the OpenAI configurations, though clustered CIs overlap for the closer contrast (sonnet vs. OpenAI/high); the asymmetry is consistent with Anthropic's more retrieval-unattributed generation route (43-52% recommendations without observed retrieval-layer evidence, vs OpenAI's 8-29%, documented in Jack 2026). Any measurement of AI brand perception must condition on the buyer persona supplying the query: the same prompt produces materially different recommendation sets depending on who the model thinks is asking, and a measurement protocol that aggregates across personas systematically obscures that variation. The effect concentrates at mid-market and is largest on the most priors-reliant generation route in our audit, consistent with persona responsiveness growing as models lean more on training-data priors and richer context integration.

2605.30201 2026-05-29 cs.LG cs.AI 版本更新

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

HPO: 稀疏奖励机制下稳定高效训练的滞后策略优化

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

发表机构 * Paris Research Center, Huawei Technologies(华为技术有限公司巴黎研究中心)

AI总结 针对GRPO在稀疏验证奖励下的失败模式,提出HPO通过降低负优势更新权重和均值长度归一化改进训练,并引入自适应版本A-HPO,在TeleLogs和Countdown实验中显著提升奖励。

详情
AI中文摘要

我们研究了GRPO风格的强化学习在稀疏可验证奖励背景下的一种狭窄但常见的失败模式:早期更新中包含更多具有负优势的响应,而非正优势的响应,而响应级长度归一化将更新幅度与输出长度挂钩。我们提出滞后策略优化(HPO),这是对GRPO的最小修改,它降低了负优势更新的权重,并用均值长度归一化替代了每个响应的长度归一化。我们进一步引入自适应HPO(A-HPO),它基于批次级优势符号统计设置滞后权重,从而消除了调整固定滞后权重的需要。在我们的TeleLogs和Countdown实验中,与GRPO相比,A-HPO提高了每次更新的奖励,在早期稀疏奖励机制中增益最大。在TeleLogs上,A-HPO实现了0.84的最终奖励,比SAPO高5%,比GSPO高11%,比GRPO高15%,同时保持了可比较的响应长度。在Countdown上,A-HPO在1.5B-7B模型的初始和最困难配置中实现了最大增益。关于滞后权重的消融研究表明,A-HPO的增益来自于比仅正更新或完全对称更新更好地平衡正负优势的贡献。

英文摘要

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.

2605.30200 2026-05-29 cs.AI 版本更新

Double-Edged Sword or Sharp Tool? Designing and Evaluating Triadic LLM-Teacher Collaboration for K-12 Writing at Scale

双刃剑还是利器?设计与评估面向K-12写作规模化的三元LLM-教师协作

Canran Wang, Yuwen Yang, Zhen Wang, Ming Ma, Ding Yu, Chentai Wang, Keman Huang, Xiaoyong Du

发表机构 * Renmin University of China(中国人民大学) Central University of Finance and Economics(中央财经大学) Beijing HQ Intelligent Technology, Ltd.(北京HQ智能科技有限公司)

AI总结 本文通过开发一个三元协作系统,结合系统功能语言学与建议轨迹追踪管道,基于包含57,954篇作文的大规模实证数据,验证了LLM作为生成引擎、教师作为教学把关者的分工策略能有效提升写作质量,并发现语言扩展存在边际效用递减的天花板效应。

详情
AI中文摘要

集成大型语言模型(LLM)的双刃剑效应需要一个有效的LLM、教师和学生之间的三元协作机制,尤其是对于K-12教育。通过开发一个支持K-12写作学习的三元协作系统,一个基于系统功能语言学和建议轨迹追踪管道的多维评估框架,本文贡献了一个大规模实证数据集,包含来自120所学校10,195名学生在两年内提交的57,954篇作文。我们的发现证实了该系统通过战略分工提高写作质量的功效:LLM作为生成引擎以缓解教师倦怠,教师作为教学把关者和桥梁以保证反馈质量。虽然LLM和教师对技能提升都至关重要,但我们发现了一个天花板效应,即过度的语言扩展产生递减的边际效用。这些表明随着学生熟练度的提高,需要动态自适应的LLM-教师协作。

英文摘要

The double-edged sword of integrating Large Language Models (LLMs) requires an effective triadic collaboration mechanism among LLMs, teachers and students, especially for K-12 education. By developing a triadic collaboration system to support K-12 writing learning, a multidimensional evaluation framework grounded in Systemic Functional Linguistics and the suggestion trajectory tracing pipeline, this paper contributes a large-scale empirical dataset involving $57,954$ essays from $10,195$ students across $120$ schools over two years. Our findings confirm the efficacy of this system in improving writing quality through a strategic labor division: the LLM serves as a generative engine to mitigate teacher burnout, and the teacher acts as a pedagogical gatekeeper and bridge to guarantee feedback quality. While both LLM and teacher are critical for skill improvement, we uncover a ceiling effect where excessive linguistic expansion yields diminishing marginal utility. These suggest a dynamically adaptive LLM-teacher collaboration as student proficiency increases.

2605.30195 2026-05-29 cond-mat.mtrl-sci cs.AI cs.LG 版本更新

What drives performance in molecular MPNNs? An operator-level factorial benchmark

分子MPNN性能驱动因素:算子级因子基准测试

Panyu Jiao, Shuizhou Chen, Yiheng Shen, Yuyang Wang, Runhai Ouyang, Wei Xie

发表机构 * Materials Genome Institute, Shanghai University(上海大学材料基因组研究所) School of Computer Engineering and Science, Shanghai University(上海大学计算机工程与科学学院) School of Materials Science and Engineering, Tongji University(同济大学材料科学与工程学院)

AI总结 通过分解分子MPNN为消息种子初始化、节点-边融合和节点更新三类算子,在84种配置下对MoleculeNet数据集进行基准测试,发现消息构建而非更新复杂度主导性能,并提出了设计启发式方法。

详情
AI中文摘要

消息传递神经网络(MPNN)广泛用于分子性质预测,但其作为整体架构部署使得难以识别特定消息传递算子如何影响性能。我们提出了一个算子级因子基准测试,将二维分子MPNN分解为消息种子初始化、节点-边融合和节点更新算子三个家族。在共享实验设置和统计分析协议下,对十个MoleculeNet数据集上的84种配置进行了基准测试。在这个受控设计中,性能变化主要与消息构建相关,而非更新复杂度。消息种子初始化在回归和分类任务中均显示出显著的家族级效应;节点-边融合在回归任务中显示出显著的家族级效应,且基于拼接的混合具有描述性优势;更新家族在任一任务家族中均未显示出统计上支持的效应。对Quinethazone分子的表征探测进一步表明,与Hadamard门控相比,基于拼接的混合能更好地区分化学上不同的杂原子并抵抗过度平滑。分别针对分类和回归任务选择的代表性配置相对于已建立的分子图神经网络(GNN)基线恢复了竞争性性能,在十个基准数据集中有八个数值上排名最佳。这些实证结果通过对代表性节点-边融合和更新算子的简洁机理分析进行了解释。我们的发现通过将模型设计从搜索整体架构转变为针对化学信息在消息传递管道中进入位置和方式的定向评估,为分子MPNN提供了实证设计启发式方法。

英文摘要

Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures makes it difficult to identify how specific message-passing operators affect performance. We present an operator-level factorial benchmark that decomposes 2D molecular MPNNs into the three families of message-seed initialization, node-edge fusion, and node update operators. The resulting 84 configurations are benchmarked on ten MoleculeNet datasets under a shared experimental setup and statistical analysis protocol. Across this controlled design, performance variation is associated primarily with message construction rather than update complexity. Message-seed initialization shows significant family-level effects for both regression and classification, node-edge fusion shows a significant family-level effect for regression with descriptive advantages for concatenation-based mixing, and the update family shows no statistically supported effect for either endpoint family. A representation probe into the Quinethazone molecule further demonstrates that concatenation-based mixing can better differentiate chemically distinct heteroatoms and withstand oversmoothing than Hadamard gating. Representative configurations selected separately for classification and regression recover competitive performance relative to established molecular graph neural network (GNN) baselines, ranking numerically best on eight of ten benchmark datasets. These empirical results are interpreted through concise mechanistic analyses of representative node-edge fusion and update operators. Our findings provide empirical design heuristics for molecular MPNNs by turning model design from a search over monolithic architectures into a targeted assessment of where and how chemical information enters the message-passing pipeline.

2605.30189 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

LoRA适配器后门中的令牌级泛化:攻击表征与行为检测

Travis Lelle

发表机构 * Travis Lelle

AI总结 本文通过数据投毒在LoRA适配器中植入后门,发现后门在令牌特征层面泛化而非结构模式层面,并提出了基于行为统计和权重统计的两种检测方法。

Comments 45 pages, 27 tables. Code and evaluation data: https://github.com/Travis-ML/lora-backdoors. Trained adapter weights available on request

详情
AI中文摘要

我们表明,LoRA适配器(微调LLM的主要分发格式)可以通过训练数据投毒可靠地植入后门,同时保持基线任务性能。在Qwen 2.5 1.5B提示注入分类器上,一小部分中毒样本即可驱动一个保持干净精度的后门达到饱和。由此产生的后门在令牌特征层面而非结构模式层面泛化:在一个RFC引用上训练的模型会在任何RFC引用上激活,但不会迁移到结构相同的ISO、OWASP、CWE或NIST引用上。这种不对称性有利于攻击者,因为防御者无法通用地探测“结构化引用”。 我们跨基础模型规模与系列、LoRA秩和触发字符串表征了该攻击,并针对多种子适配器队列评估了两种互补的检测路径。一个由两个探测电池统计量(outlier_gap和mean_attack_rate)构建的行为检测器,在探测电池与触发器的令牌邻域重叠时完美区分中毒适配器和干净适配器,在不重叠时以零假阳性实现高召回率。一个权重级统计量——维度归一化Frobenius范数的跨模块标准差——也能在不运行模型的情况下完美区分队列。两者结合对探测组成具有鲁棒性。因果修补将后门定位到中后层的MLP块,其中down_proj是最强的单投影原因。 跨规模、系列和秩的重复实验表明,行为检测器无需重新调整即可迁移,而权重级检测器则需针对基础模型进行校准。攻击随秩单调扩展,且选择的触发锚点令牌既依赖于触发也依赖于基础模型。行为检测是适配器供应链扫描中操作上可移植的结果。

英文摘要

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

2605.30187 2026-05-29 cs.AI cs.CY 版本更新

Modularizing Educational LLM-Agency for Fostering Responsible Learning Assistance

模块化教育大语言模型代理以促进负责任的学习辅助

Julius Gabelmann, Felix Jahn, Kevin Baum, Sophie van Rossum, Emely Wuenscher, Timo P. Gros, Verena Wolf

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Saarland Informatics Campus, Germany(萨尔兰州信息学校区,德国) Center for European Research in Trusted AI (CERTAIN)(可信人工智能欧洲研究中心(CERTAIN))

AI总结 提出一种模块化代理架构的AI聊天机器人,通过分阶段指导练习解决,融入针对性教学建议,实现更可控、透明和可监督的学习过程,促进教育中负责任的AI使用。

Comments 12 pages, 2 figures (+ 2 in appendix), accepted at AISoLA 2025 (Track: Responsible and Trusted AI: An Interdisciplinary Perspective)

详情
AI中文摘要

AI聊天机器人在教育中的广泛采用将彻底改变学习方式,使负责任部署成为关键问题。虽然大型语言模型(LLM)可能能够访问讨论教育科学见解的来源,但它们并不特别倾向于遵循教学概念,可能对学习过程产生负面影响,如丧失迁移能力、批判性思维或创造力。在本文中,我们介绍了一种辅助学生解决练习的代理型AI聊天机器人架构,专门设计用于促进教育中更负责任的AI使用。我们的概念开发基于对负责任的基于LLM的教育系统若干期望的识别,论证了整体式开箱即用解决方案固有的结构缺陷,并建议模块化代理架构。我们提出了针对练习解决不同阶段的特定模块,使得能够融入有针对性的教学建议,以更可控、透明和可监督的方式引导学生完成学习过程。

英文摘要

The widespread adoption of AI chatbots in education will drastically change learning, making responsible deployment a critical concern. While large language models (LLMs) might have access to sources discussing insights from educational sciences, they are not particularly inclined to adhere to pedagogical concepts, risking negative effects on the learning process, such as a loss of transfer capabilities, critical thinking, or creativity. In this paper, we introduce an agentic AI chatbot architecture assisting students with exercise solving, specifically designed to contribute to more responsible AI use in education. We base our conceptual development on the identification of several desiderata for responsible LLM-based educational systems, argue for the structural shortcomings inherent in monolithic, out-of-the-box solutions, and instead suggest modularizing the agentic architecture. We propose specific modules for different stages of exercise solving, enabling incorporation of targeted pedagogical advice, guiding students through the learning process in a more controllable, transparent, and overseeable manner.

2605.30179 2026-05-29 cs.LG cs.AI 版本更新

iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

iLoRA: 用于微生物组诊断的具有潜在交互图的贝叶斯低秩适应

Yang Song, Yixuan Zhang, Lingfa Meng, Tongyuan Hu, Haizhou Shi, Hao Wang, Samir Bhatt, Hengguan Huang

发表机构 * University of Copenhagen, Copenhagen, Denmark Rutgers University, New Brunswick, NJ, USA Section of Health Data Science \& AI, Department of Public Health, University of Copenhagen, Copenhagen, Denmark MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom

AI总结 提出iLoRA,一种贝叶斯图条件LoRA框架,通过推断输入中的潜在交互图生成输入条件LoRA更新,联合学习预测和潜在交互结构,在微生物组诊断中优于现有方法。

Comments Accepted at ICML 2026

详情
AI中文摘要

参数高效适应使得大型语言模型在领域预测中变得实用,但标准LoRA仍然依赖于静态低秩更新,并且没有揭示通常驱动科学标签的潜在交互。我们引入了iLoRA。据我们所知,这是第一个贝叶斯图条件LoRA框架。它从输入中推断潜在交互图,并使用它生成输入条件LoRA更新。因此,iLoRA联合学习预测和潜在交互结构,而不是训练预测器然后仅事后应用交互分析。我们将这一思想实例化用于微生物组诊断,其中疾病状态可能依赖于物种水平丰度和微生物-微生物串扰,并在两个互补设置中评估:与人工注释图进行交互式问答,测试潜在结构恢复;以及多队列IBD诊断,测试生物医学效用。在这两种设置中,iLoRA优于强LoRA和贝叶斯适应基线,恢复与人工注释和队列水平微生物组关联一致的图,并提供具有适度图分支开销的校准不确定性。

英文摘要

Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels. We introduce iLoRA. To our knowledge, it is the first Bayesian graph-conditioned LoRA framework. It infers a latent interaction graph from the input and uses it to generate input-conditioned LoRA updates. As a result, iLoRA learns prediction and latent interaction structure jointly, rather than training a predictor and applying interaction analysis only post hoc. We instantiate this idea for microbiome diagnosis, where disease state can depend on both species-level abundance and microbe-microbe cross-talk, and evaluate it in two complementary settings: interactive QA with human-annotated graphs, which tests latent structure recovery, and multi-cohort IBD diagnosis, which tests biomedical utility. Across both settings, iLoRA improves over strong LoRA and Bayesian adaptation baselines, recovers graphs aligned with human annotations and cohort-level microbiome associations, and provides calibrated uncertainty with moderate graph-branch overhead.

2605.30162 2026-05-29 cs.AI cs.CR cs.LG 版本更新

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

BioRefusalAudit: 使用通用和领域微调稀疏自编码器审计生物安全拒绝深度

Caleb DeLeeuw

发表机构 * Independent researcher(独立研究者) Apart Research AIxBio Sprint

AI总结 本文提出BioRefusalAudit方法,通过行为测试和内部稀疏自编码器特征分析,评估语言模型在生物安全场景下的拒绝一致性,发现模型存在结构脆弱性、过度拒绝和架构差异。

Comments 21 pages, 2 figures, 3 tables. Apart Research AIxBio Sprint hackathon paper, April 2026 (Track 3: AI Biosecurity Tools). Code, eval set, and SAEs: github.com/SolshineCode/Deleeuw-AI-x-Bio-hackathon. Reviewer feedback: apartresearch.com/project/biorefusalaudit-auditing-biosecurity-refusal-depth-using-general-and-domainfinetuned-sparse-autoencoders-1fyk

详情
AI中文摘要

语言模型的生物安全评估通常询问模型是否产生危险输出。本文提出一个补充性问题:当模型拒绝时,该拒绝在结构上是否稳健,还是在提示框架、格式或输出长度的适度变化下消失?在五种架构中,没有模型能清晰区分良性查询和危险查询。Gemma 2 2B-IT 在75个提示中从未真正拒绝,对每个接近危险的查询都含糊其辞。Gemma 4 E2B-IT 在聊天模板格式下拒绝了65/75个提示,无格式时拒绝了0/75。两个Gemma模型在80词限制下都降至0%拒绝率。Qwen 2.5 1.5B 和 Phi-3-mini 过度拒绝,将83-87%的良性生物学标记为危险。Llama 3.2 1B 显示出唯一有意义的层级梯度(61点跨度)。为了探究过度拒绝的驱动因素,我们测试了一组附表I但无生物毒性的化合物(特别是裸盖菇素栽培,具有FDA突破性疗法资格)。一些模型对这些化合物的拒绝率超过了真正危险的生物学,表明拒绝追踪法律和文化显著性而非CBRN危险。为了测量内部层面,我们引入了一个分歧分数D,比较模型的表面响应标签与其内部稀疏自编码器(SAE)特征激活。在Gemma 2 2B-IT(Gemma Scope 1)和Gemma 4 E2B-IT(作者训练的bio SAE)上计算了完整的D。发布了两个微调的Gemma 2领域SAE。在Gemma 4上,服从和拒绝响应之间差距为0.647点,零重叠(n=75),尽管这是初步的,目录狭窄,样本内校准,且仅覆盖Gemma家族的SAE。在一个黑客马拉松周末使用消费级硬件(GTX 1650 Ti Max-Q,以及用于SAE训练的Colab T4)构建,这一初步证据表明,激活级审计可能揭示行为评估无法发现的失败模式,且各架构间存在显著差异。

英文摘要

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

2605.30160 2026-05-29 cs.LG cs.AI 版本更新

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

混沌动力系统中的分布强化学习

James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

发表机构 * Centre for Artificial Intelligence(人工智能中心) Department of Computer Science(计算机科学系) University College London(伦敦大学学院) University of Bologna(博洛尼亚大学)

AI总结 针对混沌动力系统中强化学习面临的高方差和梯度病态问题,提出分布强化学习通过1-Wasserstein度量下的分布贝尔曼目标实现更稳定的优化。

详情
AI中文摘要

混沌动力系统对强化学习(RL)提出了根本性挑战:对初始条件的指数敏感性导致高方差的引导目标和病态的梯度更新。混沌动力学出现在科学和工程领域的各个方面,从流体流动和气候系统到多智能体系统,在这些领域中,可靠的学习是非常可取的。标准RL方法通过标量值函数优化期望回报,隐式地对发散轨迹进行平均,并将轨迹层面的不稳定性与学习目标纠缠在一起。我们证明,在温和的统计稳定性假设下,当在$1$-Wasserstein度量下测量时,回报分布比单个轨迹更规则地演化,从而产生更平滑的分布贝尔曼目标。通过将优化与该度量层面结构对齐,分布RL提供了更好的条件学习。我们为混沌系统中分布方法的优势以及混沌下RL目标的几何结构提供了原则性的解释。

英文摘要

Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.

2605.30159 2026-05-29 cs.AI 版本更新

Meta-Cognitive Memory Policy Optimization for Long-Horizon LLM Agents

元认知记忆策略优化用于长视野LLM智能体

Ziyan Liu, Zhezheng Hao, Yeqiu Chen, Hong Wang, Jingren Hou, Ruiyi Ding, Yongkang Yang, Wence Ji, Wei Xia, Feng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Zhejiang University(浙江大学) Tencent(腾讯)

AI总结 针对长视野任务中记忆策略训练缺乏中间监督的问题,提出基于信念熵的元认知记忆策略优化(MMPO),通过自监督代理惩罚高认知不确定性摘要,提升长期推理性能。

详情
AI中文摘要

记忆增强的LLM智能体通过递归地将交互轨迹总结为紧凑记忆来处理复杂的长期任务。然而,现有方法通常使用基于结果的强化学习来训练这些记忆策略,未能定位中间记忆质量下降的位置。随着交互的展开,模糊的递归总结逐渐丢弃任务相关信息并引入语义噪声。这加剧了信念偏差,模糊了智能体对潜在任务状态的估计,最终导致长期推理偏离轨道。因此,我们认为记忆优化不仅应关注轨迹层面的成功,还应关注中间总结所诱导的信念清晰度。为此,我们引入了信念熵,这是一种自监督代理,用于探测模型在当前记忆下对潜在任务状态的不确定性程度。基于这一代理,我们提出了元认知记忆策略优化(MMPO)。MMPO不依赖稀疏的基于结果的信号,而是通过显式惩罚诱导高认知不确定性的总结,提供细粒度的、记忆特定的监督。实验表明,MMPO在各种长期任务上持续优于现有方法,即使在扩展到175万token的上下文时仍保持97.1%的性能。

英文摘要

Memory-augmented LLM agents tackle complex long-horizon tasks by recursively summarizing interaction trajectories into compact memory. However, existing approaches typically train these memory policies using outcome-based reinforcement learning, failing to localize where intermediate memory quality degrades. As interactions unfold, ambiguous recursive summaries progressively discard task-relevant information and introduce semantic noise. This exacerbates belief deviation, obscuring the agent's estimate of the latent task state and ultimately derailing long-horizon reasoning. We therefore argue that memory optimization should focus not merely on trajectory-level success, but on the clarity of the belief induced by intermediate summaries. To this end, we introduce Belief Entropy, a self-supervised proxy that probes how uncertain the model remains about the latent task state given its current memory. Based on this proxy, we propose Metacognitive Memory Policy Optimization (MMPO). Instead of relying only on sparse outcome-based signals, MMPO provides fine-grained, memory-specific supervision via explicitly penalizing summaries that induce high epistemic uncertainty. Experiments show that MMPO consistently outperforms existing methods on diverse long-horizon tasks, maintaining 97.1% performance even when scaled to 1.75M-token contexts.

2605.30152 2026-05-29 cs.CL cs.AI cs.HC 版本更新

Do Proactive Agents Really Need an LLM to Decide When to Wake and What to Anchor?

主动型智能体真的需要LLM来决定何时唤醒和锚定什么吗?

Xiaoze Liu, Ruowang Zhang, Amir H. Abdi, Michel Galley, Zhikai Chen, Siheng Xiong, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University(普渡大学) Microsoft(微软) Michigan State University(密歇根州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出用时间图学习(TGL)模型替代LLM作为主动智能体的触发器,通过图更新而非文本处理用户活动,实现高效、低延迟的触发决策。

Comments 31 pages, 5 figures, 7 tables

详情
AI中文摘要

主动型智能体将用户活动读取为文本,并在每个事件上调用LLM来决定是否行动。但用户活动本质上不是文本:它是操作系统以图形式维护的结构化事件流(actor, verb, object, timestamp)元组。将结构渲染为文本并要求LLM恢复它是系统本不必进行的往返。我们将始终在线的信号视为图更新而非文本,并使用小型时间图学习(TGL)模型作为编码器:一次前向传播产生每个事件的触发概率和每个实体的路由分数,只有下游智能体(将小型结构化交接转化为流畅的用户面向句子)是LLM调用,仅在触发时调用。TGL在14个基线上平均提升F1 +16.7(最高+46.0);在触发架构比较中,一个TGL检查点给出了最强的触发AUC和最稳定的部署阈值。它在GPU服务器上每个事件运行11.13毫秒,在消费级笔记本电脑上运行13.99毫秒,比每种测试场景中的每个单次前向LLM作为触发配置快约4-7倍和12-83倍,其BF16驻留内存占用约220 MiB,可部署在设备上,与其消费的隐私敏感活动流一起运行。

英文摘要

Proactive agents read user activity as text and call an LLM on every event to decide whether to act. But user activity is not natively text: it is a structured event stream of (actor, verb, object, timestamp) tuples that the operating system already maintains in graph form. Rendering the structure as text and asking an LLM to recover it is a round-trip the system never had to take. We treat the always-on signal as graph updates rather than text and use a small temporal-graph-learning (TGL) model as the encoder: one forward pass yields a per-event trigger probability and a per-entity routing score, and only the downstream agent (turning a small structured handoff into a fluent user-facing sentence) is an LLM call, invoked only when the trigger fires. TGL improves F1 on each of 14 backbones (mean +16.7, up to +46.0); in trigger-architecture comparisons, one TGL checkpoint gives the strongest trigger AUCs and the most stable deployed threshold. It runs at 11.13 ms per event on a GPU server and 13.99 ms on a consumer laptop, approximately 4--7x and 12--83x faster than every single-forward LLM-as-trigger configuration tested in each regime, with an approximately 220 MiB BF16 resident footprint deployable on-device alongside the privacy-sensitive activity stream it consumes.

2605.30151 2026-05-29 cs.AI 版本更新

Temporal Stability and Few-Shot Prompting in Math Task Assessment

数学任务评估中的时间稳定性和少样本提示

Danielle S. Fox, Brenda L. Robles, Elizabeth DiPietro Brovey, Christian D. Schunn

发表机构 * Learning Research and Development Center, University of Pittsburgh(匹兹堡大学学习研究与发展中心) Institute for Learning, University of Pittsburgh(匹兹堡大学学习研究所)

AI总结 本研究通过纵向实验评估AI工具在数学任务认知需求分类中的时间稳定性和少样本提示效果,发现提示工程比模型版本更新更能提升性能。

Comments 23 pages, 1 figure

详情
AI中文摘要

随着AI工具日益融入教育环境,其随时间稳定性以及对提示工程技术的响应性成为问题。本纵向研究聚焦于不同AI工具使用任务分析指南(TAG; Stein & Smith, 1998)对数学任务认知需求进行分类的能力。具体而言,考察了这种分类能力是否随(1)模型版本更新和(2)使用示例任务的少样本提示而改变。我们测试了一个通用AI工具(Gemini)和一个教育专用AI工具(Coteach)。选择这些特定工具是因为它们在相关公开基准和先前任务特定测试中表现相对较高。模型在基线时进行测试,在模型版本更新后重新测试,然后再次使用少样本提示(每个认知需求类别两个示例任务)进行测试。结果显示,仅更新模型版本产生了混合效应:Gemini的准确率稳定在58%,而Coteach的准确率从75%下降到50%。然而,少样本提示提高了两个模型的性能:Gemini提高到67%,Coteach恢复到75%的准确率。这些发现表明,提示工程技术可以产生比被动模型改进更大且更可靠的效果,并且版本更新并不总是能提高在专门教育任务上的性能。该研究对教育工作者和研究人员在教育环境中如何选择、评估和实施AI工具具有重要意义。

英文摘要

As AI tools become increasingly integrated into educational contexts, questions arise about both their stability over time and their responsiveness to prompt engineering techniques. This longitudinal study focused on different AI tools' ability to use the Task Analysis Guide (TAG; Stein \& Smith, 1998) to classify the cognitive demand of mathematics tasks. In particular, it examined whether this classification ability changed with (1) model version updates over time and (2) few-shot prompting using exemplar tasks. We tested a general-purpose AI tool (Gemini) and an education-specific AI tool (Coteach). The specific tools were selected because of their relatively high performance on relevant published benchmarks and prior task-specific tests. Models were tested at baseline, retested with model version updates, and then tested again using few-shot prompting (two exemplar tasks for each cognitive demand category). Results revealed that newer model versions alone produced mixed effects: Gemini's accuracy remained stable at 58\%, while Coteach's accuracy decreased from 75\% to 50\%. However, few-shot prompting improved both models' performance: Gemini increased to 67\% and Coteach recovered to 75\% accuracy. These findings demonstrate that prompt engineering techniques can have larger and more reliable effects than passive model improvements, and that version updates may not always improve performance on specialized educational tasks. The study has important implications for how educators and researchers should approach AI tool selection, evaluation, and implementation in educational contexts.

2605.30150 2026-05-29 cs.AI 版本更新

Anchorless Diversification for Parallel LLM Ideation

无锚点多样化并行LLM创意生成

Fares Nabil Ibrahim, Nafis Saami Azad, Raiyan Abdul Baten

发表机构 * Bellini College of Artificial Intelligence, Cybersecurity, and Computing, University of South Florida(贝尔尼人工智能、网络安全与计算学院,佛罗里达州立大学)

AI总结 研究无锚点方法(如语义方向分层)在并行LLM创意生成中实现候选池多样化,无需依赖种子想法,在多样性、质量和计算效率上优于有锚点基线。

详情
AI中文摘要

大型语言模型越来越多地用于生成创意任务的候选想法池,其中广泛探索是有价值的。在此场景下,并行推理在拓宽池的同时保持质量和成本效率时具有吸引力。我们研究推理时控制以实现候选池多样化,探究无锚点方法是否能与依赖观察到的种子想法的方法相抗衡。在三个创意任务族中,我们在中性和群体参照发散指令下,比较了独立生成和语义方向分层与自我、同伴和代表性锚点基线。群体参照发散是一个强大的低成本基线,在保持质量代理的同时增加了语义多样性。语义方向分层更强:一次规划调用即可组织跨广泛语义方向的生成,产生最佳的多样性-质量-计算前沿。锚点再生在最终池多样性上可能很强,但其优势在完整流水线令牌核算下缩小。这些结果为开放式LLM创意生建立实用的无锚点基线。

英文摘要

LLMs are increasingly used to generate candidate-idea pools for creative tasks where broad exploration is valuable. Parallel inference can be attractive in this setting when it broadens the pool while retaining quality and cost efficiency. We study inference-time controls for candidate-pool diversification, asking whether anchorless methods can rival methods that depend on observed seed ideas. Across three creative task families, we compare independent generation and semantic direction stratification with self-, peer-, and representative-anchor baselines, under neutral and population-referential divergent instructions. Population-referential divergence is a strong low-cost baseline, increasing semantic diversity while preserving quality proxies. Semantic direction stratification is stronger: a single planning call organizes generations across broad semantic directions, yielding the best diversity--quality--compute frontier. Anchored regeneration can be strong in final-pool diversity, but its advantage shrinks under full-pipeline token accounting. These results establish practical anchorless baselines for open-ended LLM ideation.

2605.30148 2026-05-29 cs.LG cs.AI 版本更新

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

克服LLM微调中的遗忘:进化策略方法

Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu

发表机构 * Cognizant AI Lab(Cognizant AI实验室) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 本文发现进化策略微调中的先前任务遗忘实为性能漂移且可恢复,并引入锚定权重衰减(AWD)正则化技术有效稳定先前任务性能,表明遗忘可避免,使ES成为LLM持续学习的可行方法。

详情
AI中文摘要

进化策略(ES)最近作为强化学习(RL)在大语言模型(LLM)微调中的竞争性替代方案出现,通过简单性、可扩展性和仅推理训练提供优势。然而,近期研究表明,在新任务上进行ES微调可能导致对先前任务的遗忘。首先,本文表明先前任务遗忘(1)更好地被描述为性能漂移而非不可逆遗忘,在ES训练过程中先前任务性能通常会恢复;(2)并非ES特有的失败模式,使用RL方法微调时也可能出现。其次,本文分析了这种漂移何时以及为何出现,强调了其对ES训练动态的依赖性,特别是权重空间中弱约束方向上的随机游走行为。第三,基于这些见解,本文引入了锚定权重衰减(AWD)作为一种参数空间正则化技术,将优化约束向初始模型参数。AWD在保持目标任务性能的同时有效稳定了先前任务性能,以更低的计算成本实现了与大型ES种群规模相当的优势。因此,与先前观点相反,本文表明ES下的先前任务遗忘在很大程度上是可以避免的,使ES成为LLM持续学习中一种有前景的方法。

英文摘要

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.

2605.30144 2026-05-29 cs.AI cs.MA 版本更新

AgentSchool: An LLM-Powered Multi-Agent Simulation for Education

AgentSchool:基于LLM的多智能体教育模拟系统

Yulei Ye, Wenhao Li, Zhong Wen, Yunshu Huang, Yichen Hu, Zifan Wei, Yige Wang, Xinyu Xie, Haoxuan Yang, Yanjun Huang, Ruijia Li, Hong Qian, Yu Song, Bo Jiang, Bingdong Li, Lijun Li, Bo Zhang, Pinlong Cai, Xingcheng Xu, Shuangye Chen, Xia Hu, Liang He, Aimin Zhou, Jingjing Qu, Jing Shao, Xiangfeng Wang

发表机构 * Shanghai Institute of AI for Education(上海人工智能教育研究院) School of Computer Science and Technology(计算机科学与技术学院) East China Normal University(东华大学) School of Design(设计学院) Faculty of Education(教育学院) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出AgentSchool,一种LLM驱动的多智能体模拟器,通过可成长的学生智能体(带知识图谱、思维工作流和错误概念)与自适应教师智能体(基于最近发展区)模拟学习过程,支持多尺度模拟,实验验证了其生成差异化掌握轨迹和符合课堂社会理论的行为模式。

Comments 39 pages, 10 figures

详情
AI中文摘要

尽管LLM已迅速部署到课堂中,验证教育AI仍然具有独特的棘手性:干预措施作用于发展中的学习者,其认知和社会轨迹被不可逆地塑造,而现实世界试验缓慢、受伦理约束且受制度限制。基于LLM的教育模拟器已成为潜在的补救措施,但许多模拟器仍将学习简化为角色扮演,并且当仅优化以再现现有课堂时,可能会结构性惩罚教学改革所需的制度创新。在这项工作中,我们介绍了AgentSchool,一种LLM驱动的多智能体模拟器,将学习建模为状态转换而非提示行为。AgentSchool将可成长的学生智能体(配备加权学科知识图谱、思维工作流池和显式错误概念)与自适应教师智能体(在最近发展区内规划、搭建支架和反思)相结合,嵌入可配置的场景生成器(将教学置于正式和非正式学习领域)和多尺度模拟器(解耦交互规模、时间粒度和模拟持续时间)。实验表明,结构化学生智能体比基线模拟器产生更差异化的掌握和错误概念轨迹,而教师智能体比较显示出与基于ZPD的适应一致的骨干依赖模式。此外,AgentSchool生成与课堂社会理论一致的外围参与、小团体形成、攻击者诱导的凝聚力和意见领袖出现的合理轨迹。除了作为教育研究工具的作用外,AgentSchool还将教育构建为在组织压力下进行长时记忆、多智能体协调和未来制度推理的社会意义测试平台。

英文摘要

Despite the rapid deployment of LLMs into classrooms, validating educational AI remains uniquely intractable: interventions act on developing learners whose cognitive and social trajectories are irreversibly shaped, while real-world trials are slow, ethically constrained, and institutionally locked. LLM-based educational simulators have emerged as a potential remedy, but many still collapse learning into persona-conditioned role-play and, when optimized only to reproduce existing classrooms, can structurally penalize the institutional novelty that pedagogical reform requires. In this work, we introduce AgentSchool, an LLM-driven multi-agent simulator that models learning as state transition rather than prompted behavior. AgentSchool couples cognitively growable student agents -- equipped with weighted subject knowledge graphs, thinking-workflow pools, and explicit misconceptions -- with adaptive teacher agents that plan, scaffold, and reflect along the Zone of Proximal Development, embedded in a configurable scenery generator that situates instruction within both formal and informal learning fields, and a multi-scale simulator that decouples interaction scale, temporal granularity, and simulation duration. Experiments show that structured student agents produce more differentiated mastery and misconception traces than a baseline simulator, while teacher-agent comparisons show backbone-dependent patterns consistent with ZPD-informed adaptation. Further, AgentSchool generates plausible traces of peripheral participation, clique formation, aggressor-induced cohesion, and opinion-leader emergence consistent with classroom social theories. Beyond its role as an educational research instrument, AgentSchool frames education as a socially meaningful testbed for long-horizon memory, multi-agent coordination, and future institutional reasoning under organizational pressure.

2605.30136 2026-05-29 cs.AI 版本更新

Enhancing Multi-Agent Communication through Attention Steering with Context Relevance

通过上下文相关性的注意力引导增强多智能体通信

Hongxiang Zhang, Yuan Tian, Tianyi Zhang

发表机构 * Purdue University(普渡大学)

AI总结 针对LLM多智能体系统中长对话历史导致信息稀释的问题,提出无训练的上下文管理方法Agent-Radar,利用时空衰减机制动态引导注意力,在五个基准上取得最高7.64个绝对点的提升。

详情
AI中文摘要

基于LLM的多智能体系统通过协作推理在复杂任务上表现出色。然而,这些系统在交互过程中会迅速积累极长的对话历史。随着对话变长,相关信息被无关上下文稀释,导致性能下降。在这项工作中,我们提出了Agent-Radar,一种无需训练的上下文管理方法,通过新颖的时空衰减机制动态引导每个智能体的注意力到相关上下文。实验表明,Agent-Radar在五个不同基准上优于最先进的方法,最高提升7.64个绝对点。此外,分析显示Agent-Radar在智能体数量和交互轮次增加时仍然有效且鲁棒。最后,消融研究表明Agent-Radar的核心组件对性能至关重要,且在不同设置下具有泛化性。

英文摘要

LLM-based multi-agent systems have demonstrated remarkable performance on complex tasks through collaborative reasoning. However, these systems tend to rapidly accumulate extremely long conversation histories during interaction. As conversations lengthen, relevant information is increasingly diluted by irrelevant context, leading to degraded performance. In this work, we present Agent-Radar, a training-free context management method that dynamically steers each agent's attention toward relevant context with a novel temporal and spatial decay mechanism. Our experiments demonstrate that Agent-Radar outperforms state-of-the-art methods across five different benchmarks, yielding gains of up to 7.64 absolute points. Furthermore, our analysis shows that Agent-Radar remains effective and robust as the number of agents and interaction rounds increases. Finally, the ablation study shows that core components in Agent-Radar are crucial to performance and generalizable in different settings.

2605.30135 2026-05-29 cs.LG cs.AI 版本更新

DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning

DAMEL: 双轴多专家学习用于类别不平衡学习

Hyuck Lee, Taemin Park, Heeyoung Kim

发表机构 * AI Research, Krafton(AI研究,Krafton) Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)(工业与系统工程系,韩国科学技术院)

AI总结 提出双轴多专家学习算法DAMEL,通过表示轴和时间轴上的多专家集成,同时降低预测偏差和方差,有效解决类别不平衡学习问题。

详情
AI中文摘要

针对来自具有长尾分布的真实世界数据的类别不平衡学习所带来的挑战,已有多种算法被提出。这些算法通过重平衡技术减少了预测偏差,但通常以增加预测方差为代价。一些多专家学习算法旨在解决这一方差问题,但涉及复杂的过程。我们提出了一种新的多专家学习算法,称为双轴多专家学习(DAMEL),该算法通过沿表示轴和时间轴使用多个专家来同时降低预测的偏差和方差。沿表示轴,DAMEL拼接多个专家的表示,并同时使用拼接后的表示训练一个辅助的平衡分类器。沿时间轴,DAMEL聚合跨训练时期的网络权重,并在测试时使用这些聚合权重。实验结果表明,DAMEL同时降低了预测的偏差和方差,突显了其在类别不平衡学习中的有效性。

英文摘要

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG 版本更新

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所) Google(谷歌)

AI总结 提出PARCEL视觉分词架构,通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突,在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情
AI中文摘要

大型视觉-语言模型(LVLMs)将视觉输入映射为密集的令牌序列,导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而,现有方法在激进压缩下表现不佳。空间压缩(如嵌套池化)表现为不完美的低通滤波器,并引起频谱混叠,掩盖了细粒度细节。查询压缩(如嵌套查询重采样)用非局部摘要替代显式的网格对齐令牌,显著降低了空间定位能力。为解决这一表示冲突,我们引入了PARCEL(基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解),一种视觉分词架构,动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点,并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征,而非冗余的空间映射。在27个基准上的广泛评估表明,PARCEL改进了性能-效率帕累托前沿,在各种视觉令牌预算下持续优于现有的嵌套基线,同时保留了“一次训练,随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

2605.30117 2026-05-29 cs.AI 版本更新

VLA-Trace: Diagnosing Vision-Language-Action Models through Representation and Behavior Tracing

VLA-Trace: 通过表示与行为追踪诊断视觉-语言-动作模型

Haoyuan Shi, Xiancong Ren, Yingji Zhang, Qinfan Zhang, Jiayu Hu, Haozhe Shan, Han Dong, Jinpeng Lu, Yinda Chen, Yi Zhang, Yong Dai, Xiaozhu Ju

发表机构 * University of Science and Technology of China(中国科学技术大学) University of Manchester(曼彻斯特大学) Beihang University(北航) Fudan University(复旦大学) University of New South Wales(新南威尔士大学)

AI总结 提出VLA-Trace诊断框架,通过表示演化、因果控制归因和行为表现分析,揭示VLA模型在多模态知识向具身控制转化中的机制,发现不同模型在微调适应、多模态路由和语义遵循上的差异与局限。

详情
AI中文摘要

理解视觉-语言-动作(VLA)模型如何将多模态知识转化为具身控制仍然是一个开放的挑战。我们提出了VLA-Trace,一个渐进式诊断框架,通过从表示动态到因果控制归因再到行为表现的统一证据链来分析VLA模型。它具体结合了跨模态和以检查点漂移为中心的核对齐(CKA)来追踪表示演化,注意力阻断干预来识别模态特定的控制通路,以及 rollout 级别的行为探针来检查基础能力、捷径依赖和语义遵循。在 $π_{0.5}$ 和 OpenVLA 上的实验揭示了三个关键发现。第一,两个模型在 VLA 微调期间表现出不同的模态特定适应动态。第二,它们在动作解码期间依赖于不同的多模态路由策略和层间依赖关系。第三,尽管 VLA 策略在视觉引导的轨迹生成方面表现出色,但在细粒度语义遵循方面仍然有限。这些发现指出了表示保持适应、因果 VLA 回路和组合语义控制的未来方向。

英文摘要

Understanding how Vision-Language-Action (VLA) models transform multimodal knowledge into embodied control remains an open challenge. We present VLA-Trace, a progressive diagnostic framework that analyzes VLA models through a unified evidence chain from representation dynamics to causal control attribution and behavioral manifestation. It specifically combines cross-modal and checkpoint-drift centered kernel alignment (CKA) to trace representation evolution, attention knockout interventions to identify modality-specific control pathways, and rollout-level behavioral probes to examine grounding, shortcut dependence, and semantic following. Experiments on $π_{0.5}$ and OpenVLA reveal three key findings. First, the two models exhibit distinct modality-specific adaptation dynamics during VLA finetuning. Second, they rely on different multimodal routing strategies and layer-wise dependencies during action decoding. Third, although VLA policies excel at visually grounded trajectory generation, they remain limited in fine-grained semantic following. These findings highlight future directions for representation-preserving adaptation, causal VLA circuits, and compositional semantic control.

2605.30111 2026-05-29 cs.CV cs.AI 版本更新

xModel-KD: Cross-modal Knowledge Distillation for 3D Scene Perception using LiDAR

xModel-KD:基于LiDAR的3D场景感知跨模态知识蒸馏

Thenukan Pathmanathan, Kanchan Keisham, Thangarajah Akilan

发表机构 * Dept. of Computer Science Lakehead University Thunder Bay, Canada School of Computer Science Engg. \& Info. Systems Vellore Institute of Technology Tamil Nadu, India Dept. of Software Engg. Lakehead University Thunder Bay, Canada

AI总结 提出跨模态知识蒸馏框架xModel-KD,通过对比学习对齐2D图像纹理与3D点云几何特征,在无额外标注下提升LiDAR点云分割性能。

Comments 3 figures, and 5 tables

详情
AI中文摘要

点云分割是3D场景理解中的基础任务。其进展受到密集3D标注高成本和高时间的限制,导致标注样本难以获取。除了标注稀缺,不同感知模态面临固有局限性。2D图像提供丰富的纹理和外观线索,但缺乏明确的深度和几何结构。相比之下,3D点云捕捉精确的空间几何,但稀疏且不含纹理信息。因此,依赖单一模态限制了所学表示的丰富性并削弱了泛化能力。尽管最近结合3D点云与2D图像的多模态方法在分类和检索等任务中表现出色,但它们通常依赖大规模标注数据集,且尚未充分用于数据高效的密集预测。为解决这些限制,我们提出一种新颖的跨模态知识蒸馏框架xModel-KD,用于3D点云分割。我们的方法通过跨模态对齐学习统一的逐点表示,利用2D纹理和3D几何的互补优势。具体而言,我们设计了一个跨模态融合编码器,通过对比目标训练,强制多视图下对应的2D和3D表示之间的特征一致性。通过将强大的预训练骨干与有针对性的融合策略相结合,所提框架有效地将图像的外观线索迁移到几何感知的点特征中。实验结果表明,跨模态融合在mIoU上比仅使用LiDAR的基线实现了2%的绝对提升,证明了利用互补多模态信息进行可扩展和标注高效的3D场景理解的优势。

英文摘要

Point cloud segmentation is a fundamental task in 3D scene understanding. Its progress is constrained by the high cost and time required for dense 3D annotations, making labeled samples difficult to obtain. Beyond annotation scarcity, different sensing modalities face inherent limitations. 2D images provide rich texture and appearance cues, yet they lack explicit depth and geometric structure. In contrast, 3D point clouds capture accurate spatial geometry but are sparse and contain no texture information. As a result, relying on a single modality restricts the richness of learned representations and weakens generalization. Although recent multi-modal methods that combine 3D point clouds with 2D images have demonstrated strong performance in tasks such as classification and retrieval, they typically depend on large-scale labeled datasets and have not been fully exploited for data-efficient dense prediction. To address these limitations, we propose a novel cross-modal knowledge distillation framework, xModel-KD, for 3D point cloud segmentation. Our method exploits the complementary strengths of 2D texture and 3D geometry by learning unified per-point representations through cross-modal alignment. Specifically, we design a cross-modal fusion encoder trained with a contrastive objective that enforces feature consistency between corresponding 2D and 3D representations across multiple views. By integrating powerful pre-trained backbones with a targeted fusion strategy, the proposed framework effectively transfers appearance cues from images to geometry-aware point features. Experimental results show that cross-modal fusion achieves a 2% absolute improvement in mIoU over a LiDAR-only baseline, demonstrating the benefit of leveraging complementary multi-modal information for scalable and annotation-efficient 3D scene understanding.

2605.30102 2026-05-29 cs.MA cs.AI 版本更新

When Cloud Agents Meet Device Agents: Lessons from Hybrid Multi-Agent Systems

当云端智能体遇到设备端智能体:混合多智能体系统的经验教训

Corrado Rainone, Davide Belli, Bence Major, Arash Behboodi

发表机构 * Qualcomm(高通)

AI总结 本文系统研究混合多智能体系统(结合设备端小模型和云端大模型)的设计空间,分析不同设计选择对功耗、成本和性能帕累托前沿的影响,发现最优架构高度依赖任务且前沿计算并不总能带来更好性能。

Comments 30 pages, 16 figures. Accepted to the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

智能体AI推理的设计空间涵盖两个极端:前沿大语言模型(LLM),通常托管在云端,在广泛任务上提供强性能但成本高昂;以及更具成本效益的小语言模型(SLM),适合设备端推理。结合设备端和云端模型的混合多智能体系统(MAS)提供了一种有前景的中间地带,但它们也引入了一个复杂且理解不足的设计空间,其中任务准确性、货币成本和边缘能耗紧密耦合;在缺乏通用设计原则的情况下,混合组件虽然并非最普遍的选择,但通常通过针对特定领域的临时决策引入。在这项工作中,我们更系统地审视了这一设计空间。我们调整了两种代表性的MAS架构以支持混合推理,并研究了单个设计选择如何沿着功耗、成本和性能的帕累托前沿移动工作点。我们的发现描绘了混合MAS设计的细致图景:虽然SLM可以有效受益于LLM的协助,但最优架构高度依赖任务,且更大的前沿计算并不总能转化为更好的性能。

英文摘要

The design space of agentic AI inference spans two extremes: frontier large language models (LLMs), typically hosted in the cloud and offering strong performance across a wide range of tasks at substantially high cost, and more cost-efficient small language models (SLMs), which are amenable to on-device inference. Hybrid multi-agent systems (MASs) combining on-device and cloud models offer a promising middle ground, but they also introduce a complex and poorly understood design space in which task accuracy, monetary cost, and edge energy consumption are tightly coupled; in the absence of general design principles, hybrid components, although not the most prevalent choice, are typically introduced through ad hoc decisions tailored to specific domains. In this work, we examine this design space more systematically. We adapt two representative MAS architectures to support hybrid inference and study how individual design choices shift the operating point along the Pareto frontier of power, cost, and performance. Our findings paint a nuanced picture of hybrid MAS design: while SLMs can effectively benefit from LLM assistance, the optimal architecture is highly task-dependent, and greater frontier-level compute does not consistently translate to better performance.

2605.30096 2026-05-29 cs.CR cs.AI 版本更新

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

AI攻击者对固定脆弱目标的可靠性如何?LLM渗透测试一致性的400次运行实证研究

Galip Tolga Erdem

发表机构 * Independent Researcher(独立研究者)

AI总结 通过400次自主渗透测试运行(4个模型各100次),研究LLM在固定目标上攻击行为的一致性,发现模型间成功率差异显著且失败模式独特。

Comments 41 pages, 7 figures. Code and 400-run dataset: https://doi.org/10.5281/zenodo.20421592

详情
AI中文摘要

大型语言模型(LLM)可以自主进行多阶段网络攻击,但其在重复试验下攻击行为的一致性尚未被研究。本文首次对LLM攻击一致性进行了大规模实证测量:针对托管OWASP Juice Shop和另外两个脆弱服务的相同蜜罐,进行了400次自主渗透测试运行(4个模型,各100次),保持提示、编排器和目标不变。没有模型发出在编排器第0-1次迭代的一次性授权重新提示后仍存在的拒绝内容。Claude Sonnet 4的API调用确实遇到了上游服务不可用——在记录的Anthropic容量事件期间,1135次调用中有91次返回HTTP 529 overloaded_error,导致100次Claude运行中有39次被截断。早期草稿将这些归类为安全拒绝;在完整日志审计后,它们是上游API故障,而非模型级拒绝。尽管如此,Claude在100次运行中有61次实现了完全利用;Gemini 2.5 Flash-Lite为85次;GPT-4o-mini为56次,同时部署了98种独特的攻击策略;qwen2.5-coder:14b为25次。失败模式因模型而异:Claude因API截断(39次运行),qwen因过早完成(52次),GPT-4o-mini因迭代预算耗尽(23次)。跨服务凭据重用仅出现在保留最多对话历史的配置中(qwen 57%,GPT-4o-mini 49%,云模型在5次交换窗口内为0%)。跨模型利用率的差异具有统计学显著性(p < 0.001),效应量大;qwen与Gemini的SQL注入率差异的Cohen's h = 1.12。首次利用时间落在15-30秒的挂钟时间范围内。据我们所知,这是首个在N=100每模型下测量跨多服务目标的自主LLM攻击行为的研究。

英文摘要

Large language models (LLMs) can autonomously conduct multi-stage cyber attacks, but the consistency of their offensive behavior under repeated trials remains unstudied. This work presents the first large-scale empirical measurement of LLM attack consistency: 400 autonomous penetration testing runs (4 models, 100 each) against an identical honeypot hosting OWASP Juice Shop and two additional vulnerable services, holding prompt, orchestrator, and target constant. No model emitted a content refusal that survived the orchestrator's one-shot authorization re-prompt at iterations 0-1. Claude Sonnet 4's API calls did encounter upstream service unavailability - 91 of 1,135 calls returned HTTP 529 overloaded_error during a documented Anthropic capacity event, truncating 39 of 100 Claude runs. An earlier draft catalogued these as safety refusals; on full-log audit they are upstream API failures, not model-level refusals. Despite this, Claude achieved full exploitation in 61 of 100 runs; Gemini 2.5 Flash-Lite in 85; GPT-4o-mini in 56 while deploying 98 unique attack strategies; qwen2.5-coder:14b in 25. Failure modes are model-distinctive: Claude through API truncation (39 runs), qwen through premature completion (52), GPT-4o-mini through iteration-budget exhaustion (23). Cross-service credential reuse appeared only in configurations retaining the most conversation history (qwen 57%, GPT-4o-mini 49%, cloud models 0% on 5-exchange windows). Cross-model exploitation rate differences are statistically significant (p < 0.001) with large effect sizes; qwen vs. Gemini SQL injection rates differ at Cohen's h = 1.12. First-exploit timing fell within a 15-30 second wall-clock range. To our knowledge, this is the first study to measure autonomous LLM attack behavior at N=100 per model across a multi-service target.

2605.30094 2026-05-29 cs.AI cs.GT 版本更新

PokerSkill: LLMs Can Play Expert-Level Poker without Training or Solvers

PokerSkill: 无需训练或求解器,大语言模型可达到专家级扑克水平

Boning Li, Baoxiang Wang, Longbo Huang

发表机构 * IIIS, Tsinghua University(清华大学人工智能研究院) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出PokerSkill框架,通过规则驱动的技能库约束大语言模型动作,无需训练或求解器即可在扑克中达到接近GTO水平的性能。

Comments 45 pages, 3 figures

详情
AI中文摘要

扑克是人工智能的一个标志性挑战。主流方法依赖于基于反事实遗憾最小化的均衡求解器,需要数百万核心小时的训练。大语言模型(LLMs)拥有广泛的扑克知识,但当被要求直接游戏时,其表现远低于基于求解器的智能体。传统的基于规则的扑克智能体是可解释且无需训练的,但其策略上限仍远低于均衡玩法。我们提出了 extbf{PokerSkill},一个无需训练且无需求解器的框架,通过使用详细的基于规则的扑克技能作为LLMs的结构化动作基础接口来弥合这一差距。一个确定性上下文引擎分析当前状态,并从完全由人类扑克专家设计的分层技能库中仅检索相关片段,将LLM的选择限制在合理动作内。针对最先进的GTO基准GTOWizard,使用PokerSkill的GPT-5.5 XHigh达到$-57 \pm 21$ mbb/hand,Claude Opus 4.6达到$-80 \pm 29$ mbb/hand,Claude Opus 4.7达到$-87\pm 64$ mbb/hand,相比默认提示基线减少了49-61%的损失,并优于强机器人Slumbot。我们的关键发现是,仅靠基于规则的技能不足以构成强大策略,仅靠LLM也无法良好游戏,但它们的结合产生了一个既不需要训练也不需要求解器访问,却能媲美基于数百万核心小时计算构建的系统的智能体。据我们所知,这是首次证明LLM在复杂不完美信息游戏中无需特定游戏训练或求解器查询即可达到竞争性能。代码可在https://github.com/lbn187/PokerSkill获取。

英文摘要

Poker is a landmark challenge for artificial intelligence. The dominant approach relies on equilibrium solvers built on counterfactual regret minimization, requiring millions of core-hours of training. Large Language Models (LLMs) possess extensive poker knowledge but perform far below solver-based agents when asked to play directly. Traditional rule-based poker agents are interpretable and training-free, but their strategic ceiling remains far below equilibrium play. We introduce \textbf{PokerSkill}, a training-free and solver-free framework that bridges this gap by using detailed rule-based poker skills as a structured action-grounding interface for LLMs. A deterministic context engine analyzes the current state and retrieves only the relevant fragments from a layered skill library, which is entirely designed by human poker experts, constraining the LLM's choice to reasonable actions. Against GTOWizard, a state-of-the-art GTO benchmark, GPT-5.5 XHigh with PokerSkill achieves $-57 \pm 21$ mbb/hand, Claude Opus 4.6 achieves $-80 \pm 29$ mbb/hand and Claude Opus 4.7 achieves $-87\pm 64$ mbb/hand, reducing losses by 49--61\% compared to default-prompt baselines and outperforming the strong bot Slumbot. Our key finding is that rule-based skills alone do not constitute a strong strategy, and LLMs alone cannot play well, but their combination yields an agent that requires neither training nor solver access yet competes with systems built on millions of core-hours of computation. To our knowledge, this is the first demonstration of an LLM achieving competitive performance in a complex imperfect-information game without game-specific training or solver queries. Code is available at https://github.com/lbn187/PokerSkill.

2605.30087 2026-05-29 cs.AI 版本更新

Selective QA over Conflicting Multi-Source Personal Memory: A Diagnostic Testbed and Method Comparison

冲突多源个人记忆上的选择性问答:诊断性测试平台与方法比较

Tiancheng Yang, Matthias Schonlau, Ilia Sucholutsky

AI总结 针对多源冲突记忆的选择性问答问题,构建了包含34,560个实例的诊断基准,评估了多种方法,发现结构化融合方法在准确性和选择性上优于纯提示LLM。

Comments 55 pages, 5 figures

详情
AI中文摘要

新兴的个人AI代理正朝着持久、多源记忆的方向发展。这带来了一个评估问题:系统必须决定如何使用冲突或不完整的证据;它们不能仅从一个干净的历史中检索事实。现有的基准很少能显示错误是来自提供给方法的证据还是来自方法的冲突解决步骤。我们将此研究为冲突多源个人记忆上的选择性问答:系统基于冲突的、有时不完整的来源进行回答,或者在证据不足时放弃回答。我们开发了一个基准,包含8种推理类型下的18个问题模板、480个角色、4个随机种子和34,560个实例,具有受控的来源扭曲和确定性的真实答案。我们评估了无法访问任何来源、访问单一来源、结构化融合方法以及前沿LLM的基线性能。最佳训练融合解析器达到80.3%的准确率,而最强的纯提示LLM基线达到70.0%。在允许弃权的情况下,同一解析器在78.3%的覆盖率下达到85.3%的选择性准确率,最佳LLM在95.4%的覆盖率下达到71.0%的选择性准确率。不同模型在不同推理类型上具有不同的优势。我们发布了数据、代码、缓存的模型输出以及数据生成过程以供复用。

英文摘要

Emerging personal AI agents are moving toward persistent, multi-source memory. This creates an evaluation problem: systems must decide how to use conflicting or incomplete evidence; they cannot just retrieve facts from one clean history. Existing benchmarks rarely show whether an error came from the evidence given to a method or from the method's conflict-resolution step. We study this as selective QA over conflicting multi-source personal memory: systems answer based on conflicting, sometimes incomplete sources, or abstain when evidence is insufficient. We develop a benchmark containing 18 question templates across 8 reasoning types, 480 personas, 4 random seeds, and 34,560 instances, with controlled source distortions and deterministic ground truth. We evaluate the performance of baselines without access to any source, access to a single source, structured fusion methods, and frontier LLMs. The best trained fusion resolver reaches 80.3% accuracy, while the strongest prompt-only LLM baseline reaches 70.0%. With abstention, the same resolver reaches 85.3% selective accuracy at 78.3% coverage and the best LLM reaches 71.0% selective accuracy at 95.4% coverage. Different models have different strengths across reasoning types. We release the data, code, cached model outputs, and data-generating process for reuse.

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML 版本更新

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

发表机构 * Department of Electrical & Computer Engineering, Rice University(电气与计算机工程系,里士满大学)

AI总结 提出CROP方法,通过保形校准选择阈值,返回最长无错前缀,并控制错误包含概率,平衡保留有效推理与丢弃误导后缀。

Comments Code available at https://github.com/matthewyccheung/crop

详情
AI中文摘要

语言模型推理轨迹很少是全有或全无;在关键错误发生之前,它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应,未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题,我们引入了CROP(保形推理输出前缀),一种与验证器无关的校准程序,用于干净前缀认证。给定任何步骤级风险代理,CROP选择一个校准阈值,并返回其步骤风险代理保持低于该阈值的最长连续前缀,将未认证的后缀路由到下游审查或修复。假设可交换性,CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上,我们证明了标准步骤级指标(如AUROC)不能完全捕捉前缀效用,建议验证器应改为通过认证前缀长度进行评估。此外,CROP平衡了过度保留和不足保留,通过保留有效的中间推理同时丢弃误导后缀,提高了下游修复的准确性。最终,这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

2605.30070 2026-05-29 cs.LG cs.AI 版本更新

A Predictive Law for On-Policy Self-Distillation From World Feedback

基于世界反馈的在线自蒸馏预测定律

Tommy He, Jerome Sieber, Matteo Saponati

发表机构 * Open-source models(开源模型) LiveCodeBench

AI总结 本文发现在线自蒸馏(OPSD)中初始师生性能差距与最终性能改进之间存在线性关系,并提出一种预测定律,用于在训练前预测OPSD配置的效果。

详情
AI中文摘要

超越简单的标量奖励,向更丰富的世界反馈迈进,是实现更可扩展的RL后训练的自然路径。在线自蒸馏(OPSD)是一种有前景的最新方法,它使用任意反馈作为学习信号,但其与GRPO等成熟方法相比的可靠性仍不清楚。我们发现了OPSD中初始学生-教师性能差距与最终性能改进之间存在惊人的一致线性相关性。这种关系在不同上下文类型和模型家族中均成立,为预测OPSD配置的结果提供了一种强大的预测定律,而无需运行完整的训练过程。有趣的是,我们表明这种线性可预测性随模型规模成立,这为具有更强上下文学习能力的大型模型上新的经验缩放定律提供了潜在基础。本质上,我们的发现表明,OPSD性能可以在训练前进行预测和调整,为将世界反馈作为后训练流水线的一等组件提供了一种原则性方法。

英文摘要

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

2605.30054 2026-05-29 cs.SE cs.AI 版本更新

Projectional Decoding: Towards Semantic-Aware LLM Generation

投影式解码:迈向语义感知的LLM生成

Boqi Chen, José Antonio Hernández López, Aren A. Babikian

发表机构 * University of Ottawa(渥太华大学) University of Murcia(穆尔西亚大学) University of Toronto(多伦多大学)

AI总结 提出投影式解码框架,通过维护部分图模型作为主要工件表示,实现增量语义验证和错误检测,以提升LLM生成工件的语义有效性。

Comments 5 pages, 3 figures. Accepted at FSE 2026 IVR track

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于跨许多软件工程(SE)任务生成软件工件,然而确保这些工件的语义有效性仍然是一个基本挑战。现有的约束解码技术可以强制执行语法正确性,并且在某些情况下强制执行特定的语义规则,但缺乏一种通用表示,能够将LLM生成的文本与SE中语义验证所需的推理联系起来。在本文中,我们提出了投影式解码,一种新颖的概念框架,通过在整个生成过程中与文本一起维护部分图模型作为主要工件表示,直接将领域语义集成到生成过程中。这种抽象表示通过显式捕获不确定性并原生支持错误检测,实现增量语义验证,同时引导生成朝向具有可证明保证的语义有效输出。我们在一个程序生成任务上展示了初步结果,证明了这种方法在提高LLM生成工件的语义有效性方面的潜力。我们还讨论了投影式解码如何能够在各种SE活动中实现与LLM的可验证自动化。

英文摘要

Large language models (LLMs) are increasingly used to generate software artifacts across many software engineering (SE) tasks, yet ensuring the semantic validity of these artifacts remains a fundamental challenge. Existing constrained decoding techniques can enforce syntactic correctness and, in some cases, specific semantic rules, but lack a general representation that bridges LLM-generated text with the reasoning required for semantic validation in SE. In this paper, we propose projectional decoding, a novel conceptual framework that integrates domain semantics directly into the generation process by maintaining, alongside text, a partial graph model as the primary artifact representation throughout generation. This abstract representation enables incremental semantic validation by explicitly capturing uncertainty and natively supporting error detection, while guiding generation toward semantically valid outputs with provable guarantees. We present preliminary results on a program generation task which demonstrate the potential of this approach to improve the semantic validity of LLM-generated artifacts. We also discuss how projectional decoding can enable verifiable automation with LLMs across various SE activities.

2605.30052 2026-05-29 cs.SE cs.AI cs.CL 版本更新

REPOT: Recoverable Program-of-Thought via Checkpoint Repair

REPOT:通过检查点修复实现可恢复的思维程序

Parsa Mazaheri

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校)

AI总结 提出 RePoT 方法,通过确定性验证重放和 LLM 调用从验证前缀恢复,以解决 Program-of-Thought 中单个无效动作导致轨迹失效的问题,在多个模型和基准上提升成功率。

详情
AI中文摘要

单次 Program-of-Thought (PoT) 生成一个打印基本动作计划的 Python 程序;单个无效动作会无声地使轨迹失效。我们引入 RePoT (可恢复 PoT):一种确定性验证重放,它将计划遍历环境直到第一个无效转换,然后通过一次 LLM 调用从验证前缀恢复。在 PoT 失败的约 14% 的问题上,RePoT 最多增加一次 LLM 调用。在 PuzzleZoo-775 上,RePoT 在四种闭模型配置上比 PoT 提高 +3 到 +11 个百分点,在 gpt-5.4-mini-medium 上达到 96.9% 对比 86.3% 的峰值;与预算匹配的 PoT-retry 基线相比,RePoT 在 Gemini 上明显获胜(+3.8pp,95% CI [+2.2,+5.4]),在 GPT-medium 和 Claude 上处于采样噪声范围内,在 GPT-mini 上失败——这是一种能力扩展模式,我们开始通过自适应 RePoT 解决,这是一种基于规则的调度器,根据验证前缀长度在后缀修复和全新 PoT 重试之间路由(初步)。我们在 PlanBench Blocksworld 上复现(+1.1 到 +11.4pp),在四个开放权重模型上(四个中的三个 +3.3 到 +20.0pp)。在 Derail-550(我们的受控恢复基准)上,每个能够访问检查点信息的条件在 GPT-medium 上达到 >=30%,在 Gemini 上达到 >=70%,而仅错误反馈条件 <=3.1%——表明检查点信息(而非特定的验证前缀尾部)是承载恢复的信号。

英文摘要

One-shot Program-of-Thought (PoT) emits a Python program that prints a primitive-action plan; a single invalid action silently invalidates the trajectory. We introduce RePoT (Recoverable PoT): a deterministic verified replay that walks the plan through the environment to its first invalid transition, then one LLM call that resumes from the verified prefix. RePoT costs at most one extra LLM call on the ~14% of problems where PoT fails. RePoT beats PoT by +3 to +11pp across four closed-model configurations on PuzzleZoo-775 and peaks at 96.9% vs 86.3% on gpt-5.4-mini-medium; against the matched-budget PoT-retry baseline, RePoT wins decisively on Gemini (+3.8pp, 95% CI [+2.2,+5.4]), is within sampling noise on GPT-medium and Claude, and loses on GPT-mini -- a capability-scaling pattern we begin to address with Adaptive RePoT, a rule-based dispatcher that routes between suffix repair and a fresh PoT retry based on verified-prefix length (preliminary). We replicate on PlanBench Blocksworld (+1.1 to +11.4pp) and on four open-weights models (+3.3 to +20.0pp on three of four). On Derail-550, our controlled recovery benchmark, every condition with access to checkpoint information clears >=30% on GPT-medium and >=70% on Gemini, vs <=3.1% for error-only feedback -- showing that checkpoint information, not the specific verified-prefix tail, is the load-bearing recovery signal.

2605.30049 2026-05-29 cs.AI 版本更新

Robust and Generalizable Safety Steering for Text-to-Image Diffusion Transformers

面向文本到图像扩散Transformer的鲁棒且可泛化的安全引导

Zihao Xue, Yan Wang, Zhen Bi, Long Ma, Zhonglong Zheng, Zeyu Yang, Bingyu Zhu, Longtao Huang, Jie Xiao, Jungang Lou

发表机构 * Huzhou Normal University(湖州师范学院) Alibaba Group(阿里巴巴集团) University of Science and Technology of China(中国科学技术大学) Zhejiang Normal University(浙江师范大学) Zhejiang University of Technology(浙江工业大学)

AI总结 提出SafeDIG框架,通过位置感知稀疏特征迁移实现扩散Transformer的安全引导,在保持源域安全性和图像质量的同时,有效降低目标域和整体不安全生成率。

详情
AI中文摘要

扩散Transformer已成为文本到图像生成的强大骨干网络,但其分层和跨模态生成过程使得安全控制在根本上不同于提示级过滤或输出级检测。有害语义可能在文本表示中弱表达,逐步绑定到视觉潜变量,最终与渲染动态纠缠。因此,在固定层进行安全引导可能不稳定,而从已知风险学习到的引导机制可能无法可靠地迁移到偏移的目标风险域。我们提出SafeDIG,一个将DiT安全适应形式化为位置感知稀疏特征迁移的安全引导框架。SafeDIG首先在功能不同的DiT干预位置上构建稀疏自编码器,并使用鲁棒性感知预训练路由来优先选择在源-目标风险偏移下预期保持稳定的干预站点。然后,通过冻结SAE编码器作为可重用的稀疏安全字典,并仅将解码器适应到目标域激活流形,将可迁移的安全特征与特定领域的激活几何分离。在推理过程中,SafeDIG结合混合和排斥操作,将不安全激活引导至迁移的安全流形或远离有害的稀疏方向。在FLUX.1 Dev和Stable Diffusion 3.5 Large上的实验表明,SafeDIG在保持源域安全性和图像质量的同时,持续降低了目标域和整体的不安全生成率。

英文摘要

Diffusion Transformers have become a powerful backbone for text-to-image generation, but their layered and cross-modal generation process makes safety control fundamentally different from prompt-level filtering or output-level detection. Harmful semantics may be weakly expressed in text representations, progressively bound to visual latents, and finally entangled with rendering dynamics. As a result, safety steering at a fixed layer can be unstable, and a steering mechanism learned from known risks may not transfer reliably to a shifted target risk domain. We propose SafeDIG, a safety steering framework that formulates DiT safety adaptation as position-aware sparse feature transfer. SafeDIG first constructs Sparse Autoencoders over functionally distinct DiT intervention positions and uses robustness-aware pre-training routing to prioritize intervention sites that are expected to remain stable under source-target risk shift. It then separates transferable safety features from domain-specific activation geometry by freezing the SAE encoder as a reusable sparse safety dictionary and adapting only the decoder to the target-domain activation manifold. During inference, SafeDIG combines Blend and Repel operations to steer unsafe activations toward transferred safety manifolds or away from harmful sparse directions. Experiments on FLUX.1 Dev and Stable Diffusion 3.5 Large show that SafeDIG consistently reduces target-domain and overall unsafe generation rates while preserving source-domain safety and image quality.

2605.30046 2026-05-29 cs.LG cs.AI 版本更新

Masked Diffusion Modeling for Anomaly Detection

掩码扩散建模用于异常检测

Lixing Zhang, Yuchen Liang, Liyan Xie

发表机构 * University of Minnesota(明尼苏达大学) Ohio State University(俄亥俄州立大学)

AI总结 提出基于掩码扩散模型的MaskDiff-AD方法,通过重建随机掩码坐标的难度构建异常分数,在分类、混合类型和离散序列数据上实现高效异常检测。

详情
AI中文摘要

异常检测旨在识别偏离名义数据分布的样本,是许多安全关键应用的核心。然而,针对分类、混合类型和离散序列数据开发有效的异常检测方法仍然具有挑战性且相对未被充分探索。掩码扩散模型通过学习从剩余可见上下文中恢复掩码值,为建模此类数据提供了一种自然的方式。在本文中,我们提出了用于异常检测的掩码扩散(MaskDiff-AD),一种基于掩码扩散模型的前向方法,仅在名义数据上训练。给定测试样本,MaskDiff-AD从随机掩码坐标的重建难度构建异常分数,产生一个直接作用于离散状态空间且避免反向时间采样的内容敏感分数。我们还开发了MaskDiff-AD的非参数变体,并通过在固定检测阈值下表征I型和II型错误提供了理论保证。在来自ADBench和UADAD的十四个分类和混合类型表格数据集,以及来自NLP-ADBench的四个文本异常检测数据集上的实验表明,MaskDiff-AD相对于经典、基于扩散以及最近的表格/文本异常检测基线取得了有竞争力的性能。值得注意的是,MaskDiff-AD达到了最佳总体平均排名,优于所有十二种表格基线方法。

英文摘要

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.

2605.30042 2026-05-29 cs.AI 版本更新

Learning to Choose: An Empowerment-Guided Multi-Agent System with semantic communication for Adaptive Method Selection

学会选择:一种基于赋权与语义通信的自适应方法选择多智能体系统

Geremy Loachamín-Suntaxi, Robert Lazar, Dimitrios G. Giovanis, Ioannis G. Kevrekidis, Eleni D. Koronaki

发表机构 * Faculty of Science, Technology and Medicine(科学、技术与医学学院) University of Luxembourg(卢森堡大学) Johns Hopkins University(约翰霍普金斯大学) Luxembourg Institute of Science and Technology(卢森堡科学与技术研究院)

AI总结 提出一种结合上下文赌博机、结构化智能体间通信和语义检查点的多智能体框架,通过保持动作-结果因果一致性来提升科学计算工作流中自适应决策的收敛性、鲁棒性和泛化能力。

详情
AI中文摘要

自动化科学计算工作流不仅需要生成可执行代码:自主系统还必须选择适当的计算策略,忠实地执行它们,并确保最终结果在因果上可归因于产生它们的决策。在多智能体流水线中,这一过程尤其脆弱,因为智能体意图与行动之间的微小不一致可能导致语义漂移,即最终执行的程序不再反映最初选择的策略,从而破坏下游评估和适应。受ATHENA框架(Toscano等人,2025;Toscano等人,2026)和赋权概念(Yiu等人,2025)的启发,本文引入了一个多智能体框架,该框架将上下文赌博机与结构化智能体间通信相结合,最重要的是,引入了语义检查点以保持整个流水线中行动-结果的一致性。该系统在自适应决策架构中集成了专门的大语言模型(LLM)智能体、基于代码生成和自修复执行循环。通过赋权的视角解释该框架,我们表明可靠的自主学习不仅需要识别高质量的行动,还需要保持这些行动在智能体间传播的完整性。使用敏感性分析和不确定性量化工作流作为代表性案例研究,我们证明未受约束的语义漂移会降低策略学习,而所提出的框架则提高了收敛性、鲁棒性和对新问题情境的适应能力。这些结果表明了科学多智能体系统的一个更广泛的设计原则:自适应决策必须与明确的机制相结合,以保证整个计算流水线中的语义一致性和可靠信息流。

英文摘要

Automating scientific computing workflows requires more than generating executable code: autonomous systems must also select appropriate computational strategies, implement them faithfully, and ensure that the resulting outcomes remain causally attributable to the decisions that produced them. In multi-agent pipelines, this process is particularly fragile, as small inconsistencies between agent intentions and actions can lead to semantic drift, where the eventually executed procedure no longer reflects the originally selected strategy, thereby corrupting downstream evaluation and adaptation. In this work, motivated by the ATHENA framework (Toscano et al., 2025; Toscano et al., 2026) and the concept of empowerment (Yiu et al., 2025), we introduce a multi-agent framework that combines contextual bandits with structured inter-agent communication and, most importantly, semantic checkpoints that preserve action-outcome fidelity throughout the pipeline. The system integrates specialized large language model (LLM) agents, grounded code generation, and self-healing execution loops within an adaptive decision-making architecture. Interpreting the framework through the lens of empowerment, we show that reliable autonomous learning requires not only identifying high-quality actions, but also preserving the integrity of their propagation across agents. Using sensitivity analysis and uncertainty quantification workflows as representative case studies, we demonstrate that unchecked semantic drift degrades policy learning, whereas the proposed framework improves convergence, robustness, and adaptation to novel problem contexts. These results suggest a broader design principle for scientific multi-agent systems: adaptive decision-making must be coupled with explicit mechanisms that guarantee semantic consistency and reliable information flow across the computational pipeline.

2605.30040 2026-05-29 cs.CR cs.AI cs.CL 版本更新

Token Inflation: How Dishonest Providers Can Overcharge for Large Language Model Usage

Token通胀:不诚实的提供商如何对大型语言模型使用超额收费

Shahinul Hoque, Jinghuai Zhang, Jinyuan Sun, Fnu Suya

发表机构 * University of Tennessee, Knoxville(田纳西大学,基洛纳分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 研究揭示了基于每token计费的大型语言模型商业服务中,提供商利用审计信任悖论系统性地虚报token数量,导致用户费用大幅增加的问题。

详情
AI中文摘要

按token计费现在是商业大型语言模型(LLM)的标准定价模式,因此报告token数量的诚实性直接影响用户支付的费用。我们表明,这种计费方式在设计上难以审计:提供商隐藏模型、分词器和执行过程以保护其知识产权、缓解越狱攻击并保护用户隐私,这意味着审计员只能检查提供商提供的证明。因此,审计简化为对提供商自身报告的一致性检查。我们称之为信任悖论:每次审计都必须信任某些工件,但当前的框架恰恰信任提供商最有动机操纵的那些工件。我们研究了三个最近的token审计框架,并表明具有普通商业能力的提供商可以系统地虚报计费token数量。在最宽松的设置中,隐藏的推理使用量平均可以膨胀1469%而不被检测到。以当前前沿推理价格计算,这将使同一查询的诚实账单从100美元变成约1569美元。即使当用户可以看到完整的推理字符串时,仅分词歧义就允许在检测阈值以下多报50.85%。这些结果表明问题不在于任何特定的审计器,而在于任何证据来自被审计方的审计。恢复诚实计费需要将报告的token数量与提供商无法控制的证据(例如可信执行证明、推理的加密证明或第三方重新执行)联系起来的验证。

英文摘要

Per-token billing is now the standard pricing model for commercial large language models (LLMs), so the honesty of reported token counts directly affects what users pay. We show that this kind of billing is hard to audit by design: providers hide the model, the tokenizer, and the execution to protect their IP, mitigate jailbreaks, and preserve user privacy, which means an auditor can only inspect proofs the provider supplies. The audit therefore reduces to a consistency check on the provider's own reports. We call this a trust paradox: every audit must trust some artifact, but current frameworks trust exactly the ones a provider has the strongest reason to manipulate. We study three recent token auditing frameworks and show that a provider with ordinary commercial capabilities can systematically inflate billed token counts. In the most permissive setting, hidden reasoning usage can be inflated by 1,469% on average without detection. At current frontier reasoning prices, that turns a \$100 honest bill into roughly a \$1,569 bill on the same query. Even when the user can see the full reasoning string, tokenization ambiguity alone still allows 50.85% over-reporting below the detection threshold. These results suggest the problem is not in any specific auditor but in any audit whose evidence comes from the audited party. Restoring honest billing will require verification that ties reported token counts to evidence the provider does not control, such as trusted execution attestation, cryptographic proofs of inference, or third-party re-execution.

2605.30038 2026-05-29 cs.LG cs.AI cs.CV 版本更新

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

对齐引导的分数匹配用于扩散模型中的文本到图像对齐

Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea(韩国高级人工智能研究生院)

AI总结 提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中,以解决文本-图像对齐中的过度惩罚和计数错误问题。

Comments ICML 2026, Project page: https://jaayeon.github.io/AGSM

详情
AI中文摘要

扩散模型生成高度逼真的图像,但通常难以实现精确的文本-图像对齐。虽然最近的后训练方法使用外部奖励或人类偏好信号改善对齐,但其性能严重依赖奖励质量,且不直接解决扩散过程中的对齐问题。最近的无奖励方法如SoftREPA表明,通过对比学习优化软文本令牌可以有效改善文本-图像表示对齐,优于标准参数高效微调基线。然而,对比公式可能过度惩罚负对,表现为典型的失败案例,如过度计数和重复。为解决此问题,我们提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中来细化软令牌。通过在分数级别分配对齐方向,我们的方法缓解了这些限制,并产生更连贯和语义忠实的生成。实验表明,我们的方法与SoftREPA相当,同时显著改善了其失败案例,在GenEval基准上计数准确性提高了超过35%。我们的方法可无缝应用于现有扩散骨干网络(SD1.5、SDXL和SD3),并与现有的基于RL的扩散后训练方法互补。项目页面:https://jaayeon.github.io/AGSM

英文摘要

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

2605.30031 2026-05-29 cs.SD cs.AI cs.CL 版本更新

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

大型音频语言模型中的音频越狱:分类、攻防分析与成本感知评估

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang, Yun-Nung Chen

发表机构 * National Taiwan University(台湾大学)

AI总结 本文提出了大型音频语言模型中音频越狱攻击与防御的统一分类法和受控实证评估,揭示了声学最佳N攻击暴露了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而现有防御在鲁棒性与良性可用性之间存在权衡。

Comments Submitted to ACL ARR 2026 May

详情
AI中文摘要

大型音频语言模型(LALMs)将越狱风险从令牌级提示扩展到完整的语音感知到推理管道,其中不安全行为可以通过语义、声学风格、信号伪影或内部表示来诱导。现有研究在异质的威胁模型和评估协议下研究这些风险,使得比较攻击实用性或防御效用变得困难。本文提供了LALM越狱攻击和防御的统一分类法和受控实证评估。我们将先前的工作组织为语义、声学、信号和嵌入层攻击;基于防护、无需训练和基于训练的防御;以及跨模态、音频原生和交互式基准。然后,我们在十个开源LALM上评估代表性攻击和防御,不仅测量攻击成功率,还测量良性拒绝和延迟。我们的结果表明,声学最佳N揭示了最坏情况下的音频空间漏洞,叙事框架是一种有效的低延迟语义威胁,而当前防御在鲁棒性与良性可用性之间存在权衡。这些发现支持将成本和效用感知评估作为仅成功率的LALM安全基准的必要补充。

英文摘要

Large Audio Language Models (LALMs) expand jailbreak risks from token-level prompting to the full speech perception-to-reasoning pipeline, where unsafe behavior can be induced through semantics, acoustic style, signal artifacts, or internal representations. Existing work studies these risks under heterogeneous threat models and evaluation protocols, making it difficult to compare attack practicality or defense utility. This paper provides a unified taxonomy and a controlled empirical evaluation of LALM jailbreak attacks and defenses. We organize prior work into semantic, acoustic, signal, and embedding-layer attacks; guard-based, training-free, and training-based defenses; and cross-modal, audio-native, and interactive benchmarks. We then evaluate representative attacks and defenses across ten open-source LALMs, measuring not only attack success rate but also benign refusal and latency. Our results show that Acoustic Best-of-N reveals strong worst-case audio-space vulnerabilities, Narrative Framing is an effective low-latency semantic threat, and current defenses trade robustness against benign usability. These findings support cost- and utility-aware evaluation as a necessary complement to success-rate-only LALM safety benchmarks.

2605.30029 2026-05-29 cs.AI 版本更新

RAISE: RAG Design as an Architecture Search Problem

RAISE:将RAG设计视为架构搜索问题

Zhen Chen, Yibing Liu, Weihao Xie, Yu Liang, Peilin Chen, Shiqi Wang

发表机构 * City University of Hong Kong(香港城市大学) Baidu Inc.(百度公司)

AI总结 本文提出将检索增强生成(RAG)系统的设计选择形式化为架构搜索问题,并构建RAISE框架和基准,通过标准化搜索空间和预算评估13种优化算法在7个数据集上的表现,发现优化性能高度依赖任务。

详情
AI中文摘要

检索增强生成(RAG)系统涉及众多设计选择,包括查询重写、分块、检索深度、重排序和上下文压缩。在实践中,这些选择通常通过启发式方法配置,阻碍了跨设置的系统评估和可重复性。我们认为这一挑战最好被形式化为RAG架构搜索。为了支持对该问题的可控和可重复研究,我们引入了RAG智能搜索引擎(RAISE),这是一个用于RAG超参数优化的综合框架和基准,它在标准化的搜索空间和预算下评估RAG管道的优化方法。RAISE实现了13种搜索算法,并使用三种随机种子在七个公开文本和多模态数据集上对其进行评估。我们的实验表明,优化性能高度依赖于任务:在一个数据集上表现良好的方法可能无法在其他数据集上一致泛化,这提醒我们不要将聚合排名解释为普遍优越策略的证据。RAISE为公平、可重复和系统的RAG超参数优化研究提供了共同的实验基础。

英文摘要

Retrieval-augmented generation (RAG) systems expose numerous design choices spanning query rewriting, chunking, retrieval depth, reranking, and context compression. In practice, these choices are often configured through heuristics, hindering systematic evaluation and reproducibility across settings. We argue that this challenge is best formulated as RAG architecture search. To support controlled and reproducible study of this problem, we introduce the RAG Intelligence Search Engine (RAISE), a comprehensive framework and benchmark for RAG hyperparameter optimization, which evaluates optimization methods for RAG pipelines under standardized search spaces and budgets. RAISE implements 13 search algorithms and evaluates them across seven public text and multimodal datasets using three random seeds. Our experiments show that optimization performance is highly task-dependent: methods that perform strongly on one dataset may not generalize consistently across others, cautioning against interpreting aggregate rankings as evidence of universally superior strategies. RAISE provides a common experimental substrate for fair, reproducible, and systematic research on RAG hyperparameter optimization.

2605.30022 2026-05-29 cs.CL cs.AI 版本更新

Give it Space! Explicit Disentangling of Positional and Semantic Representations in Encoders

给它空间!编码器中位置和语义表示的显式解缠

Pierre-Antoine Lequeu, Camille Barboule, Benjamin Piwowarski

发表机构 * Sorbonne Université, CNRS, ISIR(索邦大学、国家科学研究中心、信息研究所) Orange Innovation(Orange创新)

AI总结 通过将位置和语义信号分离为三个独立流,研究Transformer中位置编码的机制,发现解缠方法能保留宏观结构并提升语言表示性能。

Comments 8 page + 10 pages of bibliography and appendix

详情
AI中文摘要

位置编码(PE)是置换不变的Transformer表示序列顺序的基础,然而位置信息如何处理和存储仍知之甚少。现代PE方法如RoPE在长上下文理解或检索等任务上仍存在困难\cite{chen-etal-2025-hope}。因此,更好地理解内部位置机制有助于设计更好的PE。基于位置和语义信号在训练好的Transformer中占据几乎正交子空间的证据,我们修改编码器Transformer以处理三个显式解缠的流:语义、绝对位置(AP)和相对位置(RP),并将掩码语言建模(MLM)目标限制在语义流上。这种解耦使得能够进行清晰的机制研究,并得出三个要点:(1)孤立的AP子空间自发坍缩为一个捕获文档结构的低频二维流形;(2)注意力头特化为结构导向和语义导向两组,其中RP专门支持后者;(3)标准位置编码不能稳健地保留宏观结构:RoPE和RP仅弱编码它,而纠缠的AP在MLM压力下在最后几层丢失了它。解缠方法保留了位置编码,在Flash-Holmes探测基准的65个语言现象中的49个上改善了语言表示。

英文摘要

Positional encoding (PE) underpins how permutation-invariant Transformers represent sequence order, yet how positional information is processed and stored remains poorly understood. Modern PE methods such as RoPE still struggle on tasks such as long-context understanding or retrieval \cite{chen-etal-2025-hope}. Hence, a better understanding of the internal positional mechanism could help design better PE. Building on evidence that positional and semantic signals occupy nearly orthogonal subspaces in trained Transformers, we modify an encoder Transformer to process three explicitly disentangled streams: semantic, absolute positional (AP) and relative positional (RP), and confine the masked-language-modeling (MLM) objective to the semantic stream. This decoupling enables a clean mechanistic study and yields three take-aways. (1) The isolated AP subspace spontaneously collapses into a low-frequency two-dimensional manifold that captures the structure of the document; (2) Attention heads specialize into structure and semantic-oriented groups, with RP exclusively supporting the latter; (3) Standard positional encodings do not robustly retain macroscopic structure: RoPE and RP only weakly encode it, and entangled AP loses it in the final layers under MLM pressure. The disentangled approach preserves positional encoding, which improves linguistic representation on 49 of the 65 linguistic phenomena of the Flash-Holmes probing benchmark.

2605.30015 2026-05-29 cs.LG cs.AI 版本更新

Test Time Training for Supervised Causal Learning

测试时训练用于监督因果学习

Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun, Jinzhuo Wang, Qiang Fu, Shi Han, Dongmei Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Microsoft(微软) Sony Research(索尼研究)

AI总结 针对监督因果学习在分布外泛化中的不足,提出测试时训练框架TTT-SCL,通过动态生成与测试实例对齐的训练集,显著提升因果发现性能。

详情
AI中文摘要

监督因果学习(SCL)通过将因果发现构建为监督学习问题,展现了潜力。然而,它面临显著的分布外泛化挑战。我们揭示了先前SCL实践的三个局限性:合成基准与真实数据之间的显著性能差距、对分布偏移的脆弱性以及组合泛化的失败,共同质疑了其现实世界适用性。为此,我们提出测试时训练用于监督因果学习(TTT-SCL),一种新颖的框架,动态生成与任何特定测试实例显式对齐的训练集。我们展示了TTT-SCL与基于分数的方法之间的关联,并基于经典评分函数设计了一个高效模块用于生成训练集。在合成基准、伪真实和真实世界数据集上的实验表明,TTT-SCL显著优于现有的SCL和传统因果发现方法。

英文摘要

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.

2605.30014 2026-05-29 cs.AI 版本更新

From GPS Points to Travel Patterns: Flexible and Semantic Trajectory Generation with LLMs

从GPS点到出行模式:基于LLM的灵活语义轨迹生成

Silin Zhou, Chenhao Wang, Yuntao Wen, Shuo Shang, Lisi Chen, Panos Kalnis

发表机构 * University of Electronic Science and Technology of China(电子科技大学) King Abdullah University of Science and Technology(国王阿卜杜勒·阿齐兹大学)

AI总结 提出HTP方法,通过层次化生成出行模式再生成GPS点,利用LLM和RQ-VAE实现灵活、语义丰富的轨迹生成,在质量上平均提升29.78%。

Comments This paper is accepted by KDD2026 second round

详情
AI中文摘要

城市轨迹在建模城市动态和支持各种智慧城市应用中起着关键作用。然而,隐私问题限制了对大规模高质量轨迹数据集的访问。轨迹生成通过合成现实数据来减轻隐私风险,提供了一种有前景的替代方案。然而,现有方法未能显式捕获出行模式,并且只能在单一条件下生成固定长度的轨迹。为了解决这些局限性,我们提出了 extbf{HTP},它 extbf{层}次化地首先生成 extbf{出行模式},然后使用大语言模型(LLM)生成GPS extbf{点},而不是直接生成GPS点。我们首先设计了一个轨迹特定的残差量化变分自编码器(RQ-VAE),它以从粗到细的方式将微观级别的GPS轨迹量化为紧凑的宏观级别出行模式令牌。这些令牌捕获了丰富的段空间不规则性,例如由交通条件引起的点密度变化。然后,我们用出行模式令牌扩展LLM词汇表,以对齐轨迹表示与LLM输入,并应用监督微调(SFT)使LLM与轨迹生成任务对齐,从而能够在各种条件下生成出行模式序列。在两个真实世界数据集上的大量实验表明,HTP在生成质量上平均比最强基线高出29.78%。我们的代码可在https://github.com/slzhou-xy/HTP获取。

英文摘要

Urban trajectories play a crucial role in modeling urban dynamics and supporting various smart city applications. However, privacy concerns restrict access to large-scale and high-quality trajectory datasets. Trajectory generation provides a promising alternative by synthesizing realistic data to mitigate privacy risks. However, existing methods fail to explicitly capture travel patterns and can only generate fixed-length trajectories under a single condition. To address these limitations, we propose \textbf{HTP}, which \textbf{H}ierarchically generates \textbf{T}ravel patterns first and then generates GPS \textbf{P}oints by using large language models (LLMs), rather than directly generating GPS points. We first design a trajectory-specific residual quantization variational autoencoder (RQ-VAE) that quantizes micro-level GPS trajectories into compact, macro-level travel pattern tokens in a coarse-to-fine manner. These tokens capture rich segment spatial irregularities, such as point density variations caused by traffic conditions. Then, we extend the LLM vocabulary with travel pattern tokens to align trajectory representations with the LLM input, and apply supervised fine-tuning (SFT) to align the LLM with the trajectory generation task, enabling generation of travel pattern sequences under various conditions. Extensive experiments on two real-world datasets show that HTP outperforms the strongest baseline by an average of 29.78\% in terms of generation quality. Our code is available at https://github.com/slzhou-xy/HTP.

2605.30011 2026-05-29 cs.CV cs.AI 版本更新

VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

VisualThink-VLA:用于高效低延迟视觉-语言-动作策略的视觉中间推理

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学) Cornell University(康奈尔大学) National University of Singapore(新加坡国立大学) Xi'an University of Electronic Science and Technology(西安电子科技大学)

AI总结 提出VisualThink-VLA框架,通过视觉中间推理和选择性路由机制,在保持高精度的同时将推理延迟从数秒降至亚秒级。

详情
AI中文摘要

近期工作开始为视觉-语言-动作(VLA)策略配备显式的中间推理。然而,在具身控制中,文本思维链并不适用:无关或弱文本信息会干扰动作预测,而自回归文本解码为实时闭环执行增加了过多延迟。我们提出VISUALTHINK-VLA,一个用于准确、低延迟VLA策略的视觉中间推理框架。我们的引导哲学是通过有效的视觉思维来指导动作:VISUALTHINK-VLA通过一个紧凑的视觉证据接口引导动作预测,该接口在避免解码开销的同时保持空间精度。此外,为了进一步提升性能和效率,VISUALTHINK-VLA采用了一种定制的选择性路由机制来学习视觉证据令牌,从而实现低延迟推理同时保持高容量专用性。我们还引入了VisualEvidence-Kit,这是一个以VisualEvidence-Agent为核心的监督与审计资源,该智能体构建了754.7k条VLA指令的VisualEvidence-Set,用于路由监督和反事实忠实性测试。在多个基准测试和真实机器人评估中,VISUALTHINK-VLA在大多数基准测试上实现了最高成功率,同时将推理增强基线的多秒延迟降至亚秒级。例如,在BridgeData V2上,它将步骤延迟从ECoT的8.377秒降至0.367秒,实现了22.8倍的加速。

英文摘要

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

2605.30003 2026-05-29 cs.MA cs.AI cs.LG 版本更新

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

发现合作管线:面向序列社会困境的自动研究

Víctor Gallego

发表机构 * Komorebi AI Technologies(Komorebi人工智能技术)

AI总结 本文提出一种双层自动研究框架,其中外层AI智能体自动重新设计内层LLM策略合成管线,以解决多智能体序列社会困境,实验表明该方法在多个游戏和福利目标下优于手工基线。

Comments Accepted to the AI Agents for Discovery in the Wild (AID-Wild) Workshop at ACM CAIS 2026

详情
AI中文摘要

我们研究了两层自动研究合作问题:外层AI智能体自主重新设计用于多智能体序列社会困境(SSD)的LLM策略合成系统的内层管线。研究者智能体$\mathcal{R}$(作为编码智能体运行)读取内层源代码,编辑系统提示、反馈函数、辅助库和迭代逻辑,运行评估,并决定保留什么,遵循自动研究范式。在两个游戏(Cleanup和Gathering)、两个策略合成器LLM和两个福利目标(功利主义效率和Rawlsian最大最小原则)下,研究者可靠地超越了手工设计的基线,显著缩小了运行间方差,并优于仅提示优化。发现的管线依赖于目标:只有在最大最小原则下,研究者才会向合成器管线注入显式的公平机制,而这类机制在其自身目标无关的系统提示和每个效率优化的管线中都不存在。这支持了一种信息设计解读,即研究者根据福利目标选择向有限理性的合成器揭示什么。代码见https://github.com/vicgalle/autoresearch-social-dilemmas。

英文摘要

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.

2605.30002 2026-05-29 cs.AI 版本更新

KairosAgent: Agentic Time Series Forecasting with Fused Semantic Reasoning

KairosAgent:融合语义推理的智能体时间序列预测

Kun Feng, Ziwei Shan, Yuchen Fang, Yiyang Tan, Sihan Lu, Shuqi Gu, Lintao Ma, Xingyu Lu, Kan Ren

发表机构 * School of Information Science and Technology, ShanghaiTech University(信息科学与技术学院,上海科技大学) Ant Group(蚂蚁集团)

AI总结 提出KairosAgent框架,通过结合基于LLM的推理器和基于TSFM的预测器,并引入强化学习范式,实现跨模态时间序列的零样本预测。

详情
AI中文摘要

跨领域多模态时间序列预测是一项具有挑战性的任务,要求模型整合精确的数值理解、跨领域语义理解和有效的多模态融合。现有方法要么从头构建时间序列基础模型(TSFM),要么利用预训练的大语言模型(LLM)。然而,TSFM通常忽略语义理解且缺乏面向未来的语义推理能力,而LLM在数值理解和准确的定量预测方面存在困难。为克服这些限制,我们提出KairosAgent,一种用于多模态时间序列预测的新型智能体框架,包括基于LLM的推理器和基于TSFM的预测器。KairosAgent通过动态调用分析工具来增强LLM的数值理解和语义推理能力,从而统一文本推理和数值预测。推理结果随后融合到TSFM流程中,实现更准确可靠的未来预测。为进一步改进推理,我们整理了一个大规模高质量轨迹语料库,并引入了一种基于预测的强化学习范式,包含多轮细化和轮次级别信用分配。实验表明,KairosAgent在最大化预训练LLM和TSFM效用的同时,实现了卓越的零样本预测性能,为高效且可解释的时间序列智能体提供了有前景的方向。项目页面位于https://foundation-model-research.github.io/KairosAgent。

英文摘要

Cross-domain multimodal time series forecasting is a challenging task, requiring models to integrate precise numerical comprehension, cross-domain semantic understanding, and effective multimodal fusion. Existing approaches either build Time Series Foundation Models (TSFMs) from scratch or leverage pretrained Large Language Models (LLMs). However, TSFMs often overlook semantic understanding and lack the ability to perform future-oriented semantic reasoning, and LLMs struggle with numerical comprehension and accurate quantitative forecasting. To overcome these limitations, we propose KairosAgent, a novel agentic framework for multimodal time series forecasting, including an LLM-based reasoner and a TSFM-based forecaster. KairosAgent unifies textual reasoning and numerical forecasting by dynamically invoking analytical tools to enhance the numerical understanding and semantic reasoning capabilities of LLMs. The reasoning results are subsequently fused into the TSFM pipeline, enabling more accurate and reliable future predictions. To further improve the reasoning, we curate a large-scale corpus of high-quality trajectories, alongside a reinforcement learning from forecasting paradigm with multi-turn refinement and turn-level credit assignment. Experiments demonstrate that KairosAgent achieves superior zero-shot forecasting performance while maximizing the utility of pretrained LLMs and TSFMs, presenting a promising direction for efficient and interpretable time series agents. The project page is at https://foundation-model-research.github.io/KairosAgent .

2605.29986 2026-05-29 cs.AI 版本更新

Accelerating Constrained Decoding with Token Space Compression

加速受限解码:通过词元空间压缩

Michael Sullivan, Alexander Koller

发表机构 * Department of Language Science and Technology(语言科学与技术系) Saarland Informatics Campus(萨尔兰州信息学校区) Saarland University(萨尔兰大学)

AI总结 提出CFGzip离线压缩词元搜索空间,大幅降低上下文无关文法约束解码的开销,实现高达两个数量级的延迟减少和7.5倍的总生成速度提升。

Comments 13 pages; 5 figures; under review at EMNLP 2026

详情
AI中文摘要

为了保证LLM的输出符合指定结构,上下文无关文法(CFG)解码引擎强制选择能够产生符合给定CFG的字符串的下一个词元。虽然当前的CFG受限解码引擎已经高度优化,但由于每一步搜索空间(即整个词元词汇表)巨大,导致对于更复杂的CFG会产生难以承受的高开销——而这正是CFG引擎最有用的情况。在本文中,我们引入了CFGzip,一种离线压缩词元搜索空间的技术,它大幅减少了CFG引擎的开销。实验中,我们报告了当CFGzip与最先进的语法引擎一起使用时,延迟减少高达两个数量级,在总受限生成时间上实现了高达7.5倍的加速:借助CFGzip,受限解码现在可以大规模应用于复杂CFG。

英文摘要

To guarantee that an LLM's outputs conform to a specified structure, context-free grammar (CFG) decoding engines force the selection of next tokens that produce strings that conform to a given CFG. While current CFG-constrained decoding engines are highly optimized, the inherent costs arising from the massive per-step search space -- i.e. the entire token vocabulary -- result in intractably high overhead for more complex CFGs: precisely the situation where CFG engines are most useful. In this paper, we introduce CFGzip, an offline technique for compressing the token search space, which massively reduces CFG engine overhead. In experiments, we report latency reduction of up to two orders of magnitude when CFGzip is used with a SoTA grammar engine, yielding an up to 7.5x speedup in total constrained generation time: with CFGzip, constrained decoding is now feasible at scale for complex CFGs.

2605.29980 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Genetically Aligned Patient Representations Improve Hematological Diagnosis

基因对齐的患者表示改善血液学诊断

Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr

发表机构 * Institute of AI for Health, Helmholtz Munich, Germany International School of Medicine, Istanbul Medipol University, T\"urkiye Munich Leukemia Laboratory, Germany Department of Medicine III, Ludwig-Maximilian-University Hospital, Germany Department of Physics, University of Munich, Germany Munich Center for Machine Learning (MCML), Germany DKTK, German Cancer Consortium, Germany

AI总结 提出一种两阶段框架,通过自监督视觉预训练和监督对比学习对齐白细胞图像与染色体畸变及体细胞突变,提升血液学诊断性能。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情
AI中文摘要

组织病理学编码器与转录组和基因组数据的多模态对齐已被证明能显著提高下游诊断任务的性能。血液学细胞学的独特之处在于,视觉单细胞评估通常与细胞遗传学和分子遗传学相结合用于血癌诊断。在本研究中,我们提出了一个框架,将单个白细胞图像与染色体畸变(核型)以及来自靶向基因面板的体细胞突变对齐。我们的训练策略采用两阶段方法:(i)在超过1500名患者的队列上,使用iBOT头进行自监督、仅视觉的Transformer聚合器预训练;(ii)通过急性髓系白血病患者的监督对比损失进行基因对齐。我们的基因对齐患者编码器改善了血液学诊断任务,优于切片级组织病理学基础模型。此外,该模型为疾病和遗传改变提供了即用型检索能力。将遗传数据纳入患者编码器提高了患者表示的质量,提供了一个与临床诊断工作流程对齐的框架,并为未来的多模态血液学特定AI铺平了道路。代码和模型权重可在https://github.com/marrlab/GenBloom获取。

英文摘要

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

2605.29976 2026-05-29 physics.ao-ph cs.AI 版本更新

Evaluating Skill and Stability of ArchesWeather and ArchesWeatherGen under Multi-Decadal Climate Simulations

评估 ArchesWeather 和 ArchesWeatherGen 在多年代际气候模拟中的技能和稳定性

Renu Singh, Robert Brunstein, Antonia Jost, Thomas Rackow, Claire Monteleoni, Yana Hasson, Christian Lessig, Guillaume Couairon

发表机构 * Inria Paris(巴黎国家信息与自动化研究所) Google DeepMind(谷歌深Mind) Otto-von-Guericke-University Magdeburg(马格德堡奥托·冯·格里克大学) University of Potsdam(波茨坦大学) European Centre for Medium-Range Weather Forecasting(欧洲中期天气预报中心) University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本研究将两个原本用于天气预报的机器学习模型 ArchesWeather(确定性)和 ArchesWeatherGen(概率流匹配)改造为强迫大气模型,通过月平均海表温度和海水覆盖作为边界条件,遵循 AIMIP Phase 1 协议,评估其多年代际气候模拟能力,发现它们能产生稳定的长期气候模拟、稳定的年循环,并捕捉许多气候变量的漂移。

Comments 29 pages, 16 figures, preprint

详情
AI中文摘要

我们评估了 ArchesWeather 和 ArchesWeatherGen 的气候模拟能力,这两个机器学习模型最初训练用于天气预报,并评估了长达10天的预报时效。ArchesWeather 是一个确定性模型,而 ArchesWeatherGen 是一个概率流匹配模型,利用 ArchesWeather 的预报,实现基于集合的不确定性量化。在这项工作中,我们通过额外以月平均海表温度(SST)和海冰覆盖(SIC)作为边界条件进行条件化,将这些模型改造为强迫大气模型。具体地,我们遵循 AI 模型比较项目(AIMIP)第一阶段协议,该协议类似于大气模型比较项目(AMIP),提出了一个标准化的实验设置,以评估基于 ML 的强迫大气模型的气候技能。我们在这两种条件下对两个模型进行了全面评估,包括与数值气候模型的比较、检查扩展中关键设计选择的消融研究,以及强迫与非强迫配置的分析。尽管最初是为天气预报开发的,但我们证明,ArchesWeather 和 ArchesWeatherGen 的强迫配置能产生稳定的长期气候模拟,具有稳定的年循环,并捕捉许多气候变量的漂移。这些模型忠实地再现了 ERA5 的气候态、大尺度环流和年际变率,并捕捉了分布的尾部。

英文摘要

We evaluate the climate simulation capabilities of ArchesWeather and ArchesWeatherGen, two machine learning models originally trained for weather forecasting and evaluated up to a 10-day lead time. ArchesWeather is a deterministic model, while ArchesWeatherGen is a probabilistic flow-matching model leveraging ArchesWeather's forecasts, enabling ensemble-based uncertainty quantification. In this work, we adapt these models to act as forced atmospheric models by using additional conditioning on the monthly mean sea surface temperature (SST) and sea ice cover (SIC) as boundary conditions. In particular, we follow the AI Model Intercomparison Project (AIMIP) Phase 1 protocol, which, analogous to the Atmospheric Model Intercomparison Project (AMIP), proposes a standardized experimental setup to evaluate the climate skill of ML-based forced atmospheric models. We present a comprehensive evaluation of both models under these conditions, including comparison against numerical climate models, ablation studies that examine key design choices in the extension, and an analysis of forced versus unforced configurations. Despite being originally developed for weather forecasting, we demonstrate that forced configurations of ArchesWeather and ArchesWeatherGen produce stable long-term climate simulations, have a stable annual cycle, and capture the drift of many climate variables. The models faithfully reproduce ERA5's climatology, large-scale circulations and interannual variability, and they capture the tails of the distributions.

2605.29966 2026-05-29 cs.AI 版本更新

Compass: Navigating Global Marine Lead Data Integration through Expert-Guided LLM Agent

Compass: 通过专家引导的LLM代理导航全球海洋铅数据整合

Yiming Liu, Bin Lu, Meng Jin, Ziyuan Sang, Shuo Jiang, Lei Zhou, Xinbing Wang, Chenghu Zhou, Jing Zhang

发表机构 * School of Information Science Electronic Engineering,\ Jiao Tong University Shanghai China School of Artificial Intelligence,\ Jiao Tong University Shanghai China State Key Laboratory of Estuarine Coastal Research,\ China Normal University Shanghai China School of Oceanography,\ Jiao Tong University Shanghai China Institute of Geographical Science Natural Resources Research,\ Academy of Sciences Beijing China Electronic Engineering,\ Jiao Tong University School of Artificial Intelligence,\ Jiao Tong University Coastal Research,\ China Normal University School of Oceanography,\ Jiao Tong University Natural Resources Research,\ Academy of Sciences

AI总结 针对海洋铅数据分散于非结构化论文中的问题,提出专家引导的LLM代理框架Compass,结合知识树分解任务,从23万篇论文中提取3751条铅记录,构建最大海洋铅数据库,准确率达92%。

详情
AI中文摘要

海洋铅及其同位素是海洋环流和人为污染的关键示踪剂,然而实地观测仍然成本高昂且稀疏。尽管存在大量历史记录,但它们被埋藏在学术论文的非结构化内容中,形成了无法进行综合分析的数据孤岛。手动提取不可扩展,而通用大语言模型缺乏必要的领域特定知识,导致幻觉和科学上无效的输出。为了解决这个问题,我们引入了一种专家引导的适应方法,使LLM能够在不进行微调的情况下执行严格的科学数据提取。我们通过Compass(一个由与海洋科学家共同设计的知识树增强的LLM代理框架)来实现这种方法,该框架将复杂任务分解为可验证的步骤,引导代理的推理以确保科学有效性。将Compass应用于超过23万篇相关开放获取论文的语料库,我们成功提取了3751条先前未纳入的铅记录。这项工作建立了迄今为止最大的综合海洋铅数据库。除了标准指标外,Compass通过多层验证展示了卓越的可靠性,经专家手动验证确认准确率达到92%。新整合的数据扩展了先前采样不足区域(如东海和南大洋)的覆盖范围,为未来的科学发现提供了丰富的数据基础。我们发布了一个交互式可视化平台以促进开放科学访问。我们的工作表明,专家引导的代理可以有效弥合通用LLM与高风险科学领域之间的差距,实现地球科学中的可扩展数据发现。

英文摘要

Marine lead (Pb) and its isotopes are critical tracers for ocean circulation and anthropogenic pollution, yet in-situ observations remain costly and sparse. While vast historical records exist, they lie buried within the unstructured content of academic papers, creating "data silos" inaccessible to comprehensive analysis. Manual extraction is unscalable, while general-purpose Large Language Models (LLMs) lack the necessary domain-specific knowledge, leading to hallucinations and scientifically invalid outputs. To address this, we introduce an expert-guided adaptation approach that enables LLMs to perform rigorous scientific data extraction without fine-tuning. We operationalize this approach through Compass, an LLM agent framework enhanced by a Knowledge Tree co-designed with marine scientists, which decomposes complex tasks into verifiable steps, guiding the agent's reasoning to ensure scientific validity. Deploying Compass across a corpus of over 230,000 relevant open-access papers, we successfully extract 3,751 previously unincorporated Pb records. This effort establishes the largest integrated marine Pb database to date. Beyond standard metrics, Compass demonstrates superior reliability through multi-layered validation, achieving 92% accuracy as confirmed through expert manual verification. The newly integrated data expand coverage in previously under-sampled regions such as the East China Sea and the Southern Ocean, providing an enriched data foundation for future scientific discoveries. We release an interactive visualization platform to facilitate open scientific access. Our work demonstrates that expert-guided agents can effectively bridge the gap between general-purpose LLMs and high-stakes scientific domains, enabling scalable data discovery in geosciences.

2605.29965 2026-05-29 cs.AI 版本更新

Meta-Programming for Linear-time Temporal Answer Set Programming

线性时态回答集编程的元编程

Susana Hahn, Amade Nems, Javier Romero, Torsten Schaub

发表机构 * University of Potsdam, Germany(波恩大学)

AI总结 提出一种统一的元编程框架,通过扩展clingo的理论语法并引入转换管道保护嵌套模态,实现了对多种线性时态逻辑(TEL、MEL、DEL)的语义操作化,并开发了metasp系统。

详情
AI中文摘要

回答集编程(ASP)的时态扩展的发展导致了非单调线性时态(TEL)、动态(DEL)和度量(MEL)时态均衡逻辑的出现。然而,高度优化的ASP系统固有的刚性常常阻碍了替代逻辑设计的快速探索和实现。在这项工作中,我们提出了一个灵活的元编程框架,通过统一的声明性框架操作化各种时态逻辑的语义。我们的方法通过用形式类型规范和嵌套能力增强clingo的理论语法,扩展了标准ASP元编程。为了确保语义正确性,我们引入了一个转换管道,在实例化过程中保护嵌套模态免受基于稳定模型的简化。我们通过实现TEL、MEL和DEL的元编码来展示我们框架的可扩展性。我们提供了TEL的全面说明,并突出了管理MEL的区间约束和DEL中的Fischer-Ladner闭包的关键特性。最后,我们介绍了metasp系统,这是一个封装了此工作流程的多功能工具。

英文摘要

The development of temporal extensions of Answer Set Programming (ASP) has led to the emergence of non-monotonic linear-time (TEL), dynamic (DEL), and metric (MEL) temporal equilibrium logics. However, the inherent rigidity of highly optimized ASP systems often hinders the rapid exploration and implementation of alternative logical designs. In this work, we propose a flexible meta-programming framework that operationalizes the semantics of varied temporal logics through a unified, declarative framework. Our approach extends standard ASP meta-programming by augmenting clingo's theory grammar with formal type specifications and nesting capabilities. To ensure semantic correctness, we introduce a transformation pipeline that protects nested modalities from stable-model-based simplifications during grounding. We demonstrate the extensibility of our framework by implementing meta-encodings for TEL, MEL, and DEL. We provide a comprehensive account of TEL and highlight the key features for managing the interval constraints of MEL and the Fischer-Ladner closure in DEL. Finally, we introduce the metasp system, a versatile tool that encapsulates this workflow.

2605.29963 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Honeyval: 基于LLM的HTTP蜜罐综合评估框架

Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Jamie Hayes, Niels Heinen, Tianqi Fan, Luca Invernizzi, Martin Vechev

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌) Google DeepMind(谷歌深Mind) AI Sequrity Company(AI安全公司) Independent(独立)

AI总结 提出Honeyval评估框架,通过16个后端应用、AI攻击代理、控制任务和可验证利用目标,系统评估LLM驱动的HTTP蜜罐,发现其相比规则基线能显著延长攻击交互、降低被前沿模型检测率,且保持成本优势。

详情
AI中文摘要

蜜罐是模拟真实系统组件的诱饵系统,旨在防御网络攻击。最近,LLM越来越多地作为蜜罐的模拟骨干。它们使防御者能够构建高交互蜜罐,同时降低系统安全风险。然而,基于LLM的蜜罐开发缺乏统一的评估框架。大多数评估包括测量固定命令上的响应相似性、手动测试或实际部署。这些方法通常不可扩展用于开发、不可跨评估复现、不能代表实际攻击,或不能适应各种攻击者和蜜罐配置。在这项工作中,我们弥补了这一差距,提出了Honeyval,一个针对LLM驱动的HTTP蜜罐的综合评估框架。我们通过将蜜罐基于16个后端应用程序、使用AI黑客代理作为攻击者、采用两个控制任务来监控代理和蜜罐在定制化方面的能力,以及为攻击者定义清晰且可验证的利用目标,解决了先前评估的局限性。使用Honeyval,我们对近期成本高效的LLM作为HTTP蜜罐进行了广泛评估。我们的实验突出了LLM驱动的蜜罐的前景;它们与基于规则的基线蜜罐相比,导致与攻击者的交互时间显著延长,并且即使被前沿模型检测到的频率也远低得多,同时平均而言,保持了针对代理攻击者的运行成本优势。此外,我们实验了不同的反攻蜜罐配置,并观察到了独特的权衡,例如以增加检测为代价获得更长的交互。

英文摘要

Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as simulation backbones for honeypots. They enable defenders to construct high-interaction honeypots with low system security risks. However, LLM-powered honeypot development lacks a unified evaluation framework. Most evaluations consist of measuring response similarity on fixed commands, manual testing, or real-world deployment. These methods are often not scalable for development, reproducible across evaluations, representative of practical attacks, or adaptable to various attacker and honeypot configurations. In this work, we bridge this gap and propose Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots. We address the limitations of prior evaluations by grounding the honeypots in 16 backend applications, using AI hacking agents as attackers, employing two control tasks to monitor agent and honeypot capabilities across customizations, and defining clear and verifiable exploit goals for the attacker. Using Honeyval, we conduct an extensive evaluation of recent cost-efficient LLMs as HTTP honeypots. Our experiments highlight the promise of LLM-powered honeypots; they lead to substantially longer interactions with the attacker than rule-based baseline honeypots and are far less frequently detected even by frontier models, all while, on average, preserving a running cost advantage against agentic attackers. Further, we experiment with different counter-offensive honeypots configurations, and observe unique trade-offs, such as longer interactions at the cost of increased detection.

2605.29960 2026-05-29 cs.CR cs.AI 版本更新

Hijacking Agent Memory: Stealthy Trojan Attacks Through Conversational Interaction

劫持Agent记忆:通过对话交互的隐蔽木马攻击

Hongtao Wang, Se Yang, Yu Chen, Puzhuo Liu

发表机构 * North China Electric Power University(华北电力大学) Tencent(腾讯) Tsinghua University(清华大学)

AI总结 提出MemPoison攻击方法,通过语义关系桥、实体伪装和联合嵌入优化绕过选择性记忆机制,在LLM Agent长期记忆中注入触发器后门,实现高达0.95的攻击成功率。

Comments 19 pages, 12 figures

详情
AI中文摘要

大型语言模型(LLM)Agent越来越多地利用长期记忆来支持持久且自主的任务执行。然而,这种能力也引入了一个新的攻击面:记忆投毒,即对手可以注入恶意信息以影响未来行为。现有的记忆投毒攻击通常假设注入内容可以直接存储在记忆中,忽略了现代记忆流水线中的选择性提取和重写阶段。这使得先前的方法在现实场景中无效。在本文中,我们提出MemPoison,一种新颖的记忆投毒攻击,能够绕过LLM Agent中的选择性记忆机制,攻击者可以通过对话交互将可触发的后门注入Agent的长期记忆,从而误导其后续响应。MemPoison引入三个关键组件:(i)语义关系桥,将触发器和载荷绑定为连贯的陈述,确保它们一起被提取到记忆中;(ii)实体伪装,优化触发器以模仿命名实体,抵抗重写;(iii)联合嵌入优化,将注入触发器的文本在嵌入空间中形成紧密聚类,同时与良性嵌入保持隔离以实现隐蔽。跨不同Agent领域和记忆机制的评估显示,MemPoison的攻击成功率高达0.95,优于现有基线。机制分析表明,该攻击利用了嵌入空间各向异性并转移注意力模式,突显了选择性记忆系统的核心漏洞。我们评估了多种防御策略,并展示了它们在缓解攻击方面的根本局限性。

英文摘要

Large language model (LLM) agents increasingly leverage long term memory to support persistent and autonomous task execution. However, this capability also introduces a new attack surface: memory poisoning, where adversaries can inject malicious information to influence future behavior. Existing memory poisoning attacks often assume that injected content can be stored directly in memory, overlooking the selective extraction and rewriting stages in modern memory pipelines. This makes prior methods ineffective under realistic settings. In this paper, we propose MemPoison, a novel memory poisoning attack that bypasses selective memory mechanisms in LLM agents, where an attacker can inject triggerable backdoors into the agent's long-term memory through dialogue interactions, thereby misleading its subsequent responses. MemPoison introduces three key components: (i) a semantic relational bridge that binds the trigger and payload into a coherent statement to ensure they are extracted into memory together; (ii) entity masquerading that optimizes triggers to mimic named entities, resisting rewriting; and (iii) joint embedding optimization that shapes trigger-injected texts into a tight cluster in the embedding space while maintaining isolation from benign embeddings for stealth. Evaluations across different agent domains and memory mechanisms show MemPoison achieves attack success rates up to 0.95, outperforming existing baselines. Mechanistic analysis indicates that the attack exploits embedding-space anisotropy and shifts attention patterns, highlighting core vulnerabilities in selective memory systems. We evaluate multiple defense strategies and demonstrate their fundamental limitations in mitigating the attack.

2605.29955 2026-05-29 cs.AI 版本更新

Formalizing Mathematics at Scale

大规模形式化数学

Ahmad Rammal, Niket Patel, Fabian Gloeckle, Amaury Hayat, Julia Kempe, Remi Munos, Charles Arnal, Vivien Cabannes

发表机构 * FAIR at Meta(Meta的FAIR) New York University(纽约大学) Korea Institute for Advanced Study(韩国高级研究院)

AI总结 提出多智能体系统AutoformBot,利用LLM和形式化验证工具,自动将非正式教材翻译为Lean 4可验证代码,构建了包含超过45,000个声明和50万行代码的Atlas形式化库。

详情
AI中文摘要

我们提出了AutoformBot,一个用于在Lean 4中大规模构建自动形式化教材库(Atlas)的多智能体系统。AutoformBot协调数千个LLM智能体,配备形式化验证工具、依赖感知的任务调度和协作版本控制,将非正式的教材文本转化为机器可检查的定义和证明。我们将方法应用于26本开放获取教材,涵盖分析、代数、拓扑、组合学和概率论,生成了Atlas:一个包含超过45,000个Lean 4声明和50万行代码的已验证库。我们发布两个成果:(i)AutoformBot,开源的多智能体框架;(ii)Atlas,生成的形式化库。我们的结果表明,大规模自动形式化研究生级别数学的核心内容在经济和技术上现在是可行的。这为在研究层面上自动验证人类和机器生成的数学打开了大门。

英文摘要

We present AutoformBot, a multi-agent system for building an Autoformalized Textbook Library At Scale (Atlas) in Lean 4. AutoformBot orchestrates thousands of LLM agents, equipped with formal verification tools, dependency-aware task scheduling, and collaborative version control, to translate informal textbook prose into machine-checked definitions and proofs. We apply our methods to a corpus of 26 open-access textbooks spanning analysis, algebra, topology, combinatorics, and probability, producing Atlas: a verified library of over 45,000 Lean 4 declarations and 500 thousand lines of code. We release two artifacts: (i) AutoformBot, the open-source multi-agent framework; and (ii) Atlas, the resulting formal library. Our results suggest that autoformalizing the core content of graduate-level mathematics at scale is now economically and technically feasible. This opens the door to the automated verification of both human- and machine-generated mathematics at a research level.

2605.29951 2026-05-29 cs.AI cs.CL cs.LG cs.MM 版本更新

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

MuPHI: 通过语义基础奖励优化学习隐式多模态有害推理

Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales, Vera Demberg

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克院信息研究所) Saarland Informatics Campus(萨尔兰州信息校园) Saarland University(萨尔兰州大学) The University of Edinburgh(爱丁堡大学) Samsung AI Center, Cambridge(三星AI中心,剑桥)

AI总结 针对视觉语言模型在隐式跨模态有害语义推理上的不足,提出MuPHI数据集和MuPHIRM训练框架,通过多视角奖励优化联合语义学习,提升有害检测与推理质量及分布外鲁棒性。

详情
AI中文摘要

理解看似良性的图像-文本对之间交互如何产生危害,需要超越表面特征的意图感知跨模态推理。现有的视觉语言模型(VLM)擅长对感知线索进行字面推理,但往往无法推导出依赖于隐式、上下文相关推理的有害语义。为了评估VLM在组合性有害检测和推理方面的能力,我们引入了多模态语用有害解释(MuPHI)数据集,其中包含有害编码在微妙多模态线索中的图像-文本对。MuPHI涵盖多种有害类别,并包含用于评估VLM推理链的注释有害理由。为了改进VLM的检测和推理能力,我们提出了MuPHIRM,一种推理增强的训练框架,通过优化多视角奖励来学习联合语义。MuPHIRM提高了VLM的有害检测和推理质量,同时与训练和推理时基线相比,表现出优越的分布外鲁棒性。我们的发现表明,面向推理的奖励优化为构建超越基准特定捷径进行泛化的多模态系统提供了一个有前景的方向。

英文摘要

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

2605.29940 2026-05-29 cs.AI 版本更新

Make LLM Learn to Synthesize from Streaming Experiences through Feedback

使大语言模型通过反馈从流式经验中学习合成

Zhenlin Hu, Yan Wang, Zhen Bi, Zihao Xue, Bingyu Zhu, Longtao Huang, Xiongtao Zhang, Zeyu Yang, Zhixuan Chu, Jungang Lou

发表机构 * Huzhou Normal University(湖州师范学院) Alibaba Group(阿里巴巴集团) Zhejiang University(浙江大学) Zhejiang Key Laboratory of Intelligent Education Technology and Application(浙江省智能教育技术与应用重点实验室)

AI总结 提出StreamSynth设置和SynLearner框架,使模型通过任务流积累经验并利用反馈提升合成数据生成性能。

详情
AI中文摘要

大语言模型(LLMs)已被广泛用于合成数据生成,显著降低了标注成本。然而,现有研究大多将合成视为一组孤立任务,忽略了一个更基本的问题:模型能否通过积累过去任务的经验并将其迁移到未来任务来学习合成。在这项工作中,我们引入了StreamSynth,一种新的设置,其中合成任务顺序到达,历史任务的经验为未来合成提供信息信号。为了解决这一设置,我们提出了SynLearner,一个通用框架,使合成模型能够在任务流上获取可重用的合成经验。SynLearner不是为每个任务独立生成数据,而是鼓励模型探索多样化的合成模式,从反馈中学习,并在任务演化中平衡样本质量与集合级多样性。在多个基准上的大量实验表明,SynLearner有效地利用了早期任务的经验来改进后期任务的合成性能,表现出一致的跨任务可迁移性。这些发现为StreamSynth的可行性提供了证据,并突显了合成数据生成作为一个经验驱动过程,可以从任务流中受益。

英文摘要

Large language models (LLMs) have been widely adopted for synthetic data generation, significantly reducing annotation costs. However, most existing studies treat synthesis as a set of isolated tasks and overlook a more fundamental question: whether a model can learn to synthesize by accumulating experience from past tasks and transferring it to future ones. In this work, we introduce StreamSynth, a new setting in which synthesis tasks arrive sequentially and experience from historical tasks provides informative signals for future synthesis. To address this setting, we propose SynLearner, a general framework that enables synthesis models to acquire reusable synthesis experience over a task stream. Instead of generating data independently for each task, SynLearner encourages the model to explore diverse synthesis patterns, learn from feedback, and balance sample quality with set-level diversity as tasks evolve. Extensive experiments across multiple benchmarks show that SynLearner effectively leverages experience from earlier tasks to improve synthesis performance on later ones, exhibiting consistent cross-task transferability. These findings provide evidence for the feasibility of StreamSynth and highlight synthetic data generation as an experience-driven process that can benefit from task streams.

2605.29935 2026-05-29 cs.CV cs.AI 版本更新

CityGen: Structure-Guided City-Style Synthesis for Cross-City Autonomous Driving

CityGen: 结构引导的城市风格合成用于跨城市自动驾驶

Zezhong Qian, Zhao Yang, Lu Tan, Zhihao Yan, Weiyi Hong, Haizhuang Liu, Yawei Jueluo

发表机构 * Jiangsu Cytoderm Intelligent Technology Co., Ltd., China(江苏细胞膜智能科技有限公司,中国) Xi'an Jiaotong University, Xi'an, China(西安交通大学,中国) Tsinghua University, Beijing, China(清华大学,中国) University of Science and Technology of China, Hefei, China(中国科学技术大学,中国)

AI总结 提出CityGen,一种基于扩散模型的生成框架,通过高清地图条件和城市级视觉提示实现零标签城市适应,提升跨城市自动驾驶在感知、分割和规划任务上的鲁棒性。

详情
AI中文摘要

自动驾驶系统通常在有限的地理区域内进行训练和评估,这阻碍了它们在新城市部署时的可扩展性。然而,外观、道路拓扑和交通模式的显著域偏移常常导致跨城市部署时性能严重下降。现有的基于域适应、数据增强或合成数据生成的方法通常依赖于标注的目标数据、城市特定的标注或任务特定的设计,限制了它们在整体评估中的可扩展性和有效性。在本文中,我们引入了CityTransfer-Bench,一个地理上不重叠的基准,用于评估跨城市泛化在感知、分割和规划任务上的表现,并提出了CityGen,一个基于扩散的生成框架,通过城市级视觉提示引导的高清地图条件合成实现零标签城市适应。大量实验表明,CityGen在多个任务上持续提高了跨城市鲁棒性,为可泛化的自动驾驶建立了可扩展且标签高效的基石。

英文摘要

Autonomous driving systems are commonly trained and evaluated within limited geographic regions, which hinders their scalability when deployed in new cities. However, significant domain shifts in appearance, road topology, and traffic patterns often cause severe performance degradation under cross-city deployment. Existing approaches based on domain adaptation, data augmentation, or synthetic data generation typically rely on labeled target data, city-specific annotations, or task-specific designs, limiting their scalability and effectiveness for holistic evaluation. In this paper, we introduce CityTransfer-Bench, a geographically disjoint benchmark for evaluating cross-city generalization across perception, segmentation, and planning, and propose CityGen, a diffusion-based generative framework that performs zero-label city adaptation via HD-map-conditioned synthesis guided by city-level visual prompts. Extensive experiments demonstrate that CityGen consistently improves cross-city robustness across multiple tasks, establishing a scalable and label-efficient foundation for generalizable autonomous driving.

2605.29931 2026-05-29 cs.AI eess.AS 版本更新

It`s All About Speed: AI`s Impact on Workflow in Music Production

一切都关乎速度:AI对音乐制作工作流程的影响

Finn McClellan, Fabio Morreale

发表机构 * Waipapa Taumata Rau - University of Auckland, Auckland (Aotearoa - New Zealand)(瓦伊帕塔玛拉大学——奥克兰大学,奥克兰(奥特亚罗——新西兰)) Sony AI, Barcelona (Spain)(索尼AI,巴塞罗那(西班牙))

AI总结 通过民族志研究,探讨AI和自动化工具如何影响音乐制作工作流程,重点关注录音工程师、混音师和制作人的使用体验与态度,并分析速度、可控性与创造性自主权之间的张力及其缓解方法。

Comments Audio Engineering Society Conference Paper - Presented at the AES International Conference on Machine Learning and Artificial Intelligence for Audio 2025 - September 8-10, London, UK

详情
AI中文摘要

在本文中,我们展示了一项关于AI和自动化工具对音乐制作工作流程影响的民族志研究结果。我们特别关注那些自认为是录音工程师、混音师和制作人的专业参与者,讨论了他们对常见AI和自动化软件的使用情况,以及他们对这些工具普及的看法。我们讨论了在速度和效率、可控性以及保持创造性自主权等关键领域,用户与自动化工具之间可能产生的紧张关系,以及如何通过工具设计来缓解这些紧张关系。

英文摘要

In this paper, we present the results of an ethnographic study into the impact of AI and automated tools on music production workflow. Focusing specifically on professional participants who identified as recording engineers, mixers, and producers, we discuss their usage of common AI and automated software, as well as their sentiments on the proliferation of these tools. We discuss tensions that may be created between users and automated tools in key areas such as the need for speed and efficiency, controllability, and maintaining creative agency, and how these tensions may be alleviated through tool design.

2605.29927 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

计划方式重要吗?LLM网络代理计划表示的实证研究

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

发表机构 * Concordia University(康科德大学) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Copenhagen(哥本哈根大学) Universite Claude Bernard Lyon(克莱尔蒙特-伯恩大学) McGill University(麦吉尔大学)

AI总结 本研究提出PlanAhead框架,通过自动难度分类和四种计划表示(顺序子目标、叙述、伪代码、检查清单)的对比实验,发现计划表示形式和生成计划的LLM显著影响网络代理的鲁棒性和任务成功率。

Comments Extended version of paper submitted to EMNLP, waiting for acceptance

详情
AI中文摘要

尽管最近取得了进展,基于LLM的网络代理仍然面临探索有限、遗漏关键步骤以及对任务约束敏感等问题。先前的研究表明,许多这些失败源于规划中的弱点,但替代自然语言计划表示的影响尚未被探索。为了解决这个问题,我们引入了PlanAhead,一个静态规划器-执行器框架,评估计划表示对代理性能的影响。我们首先将WebArena任务自动分类为3个难度级别,无需人工标注即可实现一致的难度分级。然后,我们在被分类为困难的任务上系统评估了4种不同的计划表示:顺序子目标、叙述、伪代码和检查清单;跨越不同系列的多模态LLM驱动的代理(OpenAI、阿里巴巴和谷歌)。为了解释随机变异性,我们引入了两个新的评估指标:达成率(AR)和解决任务一致性(STC)。我们的结果表明,计划制定和生成计划的底层LLM都显著影响网络代理的鲁棒性和任务成功率。

英文摘要

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

2605.29919 2026-05-29 cs.AI cs.MA 版本更新

On the Geometry of Games and their Solvers

论博弈及其求解器的几何结构

Yaqi Sun, Julian Ma, David Mguni

发表机构 * Queen Mary University of London(伦敦玛丽女王大学) University College London(伦敦大学学院)

AI总结 提出一种结构感知的求解器合成框架,通过学习连续求解器对齐的博弈几何表示,实现自适应均衡计算并揭示求解器行为的连续区域。

详情
AI中文摘要

博弈论和生成对抗网络等学习系统中的一个核心挑战是理解哪些算法能够在异质博弈景观中高效计算均衡。均衡计算通常按求解器和博弈类别分别研究,产生了强局部保证但碎片化的求解器行为视图。现有的离散分类法往往无法完整解释算法成功的原因。我们通过一个将博弈与有效求解器动力学联系起来的求解器-博弈映射来研究这一问题。经典理论识别出该映射的孤立区域,但对中间或重叠区域提供的见解有限,表明可解性由定义连续求解器对齐博弈几何的潜在结构属性控制。我们通过结构感知的求解器合成来形式化这一视角。一个学习到的结构识别器将每个博弈映射到低维求解器对齐表示,一个策略将该表示映射到有效的原始机制,从而跨区域调整求解器行为。这揭示了特定求解器动力学有效的区域,以及需要原始机制混合而非单一主导求解器的区域。一个有界残差充当局部校正器和诊断信号,用于不完整的求解器基或表示。该框架同时产生自适应求解器和分析视角:具有相似优化动力学的博弈聚类在一起,揭示了算法有效性的连续区域和重叠的求解器行为。实验表明,固定原始机制表现出系统性的区域不匹配,而学习到的表示将博弈空间组织成与求解器行为对齐的结构化地图。这些结果表明,应将均衡计算视为学习求解器机制和映射可解性几何的联合问题。

英文摘要

A central challenge in game theory and learning systems such as GANs is understanding which algorithms can efficiently compute equilibria across the heterogeneous landscape of games. Equilibrium computation is typically studied solver by solver and game class by game class, yielding strong local guarantees but a fragmented view of solver behaviour. Existing discrete taxonomies often provide an incomplete account of where algorithms succeed. We study this problem through a solver-game map linking games to effective solver dynamics. Classical theory identifies isolated regions of this map but provides limited insight into intermediate or overlapping regimes, suggesting that solvability is governed by latent structural properties defining a continuous solver-aligned geometry of games. We formalise this perspective through structure-aware solver synthesis. A learned structure recogniser maps each game to a low-dimensional solver-aligned representation, and a policy maps this representation to effective primitive mechanisms, adapting solver behaviour across regimes. This reveals regions where particular solver dynamics are effective and where mixtures of primitives are required rather than a single dominant solver. A bounded residual acts as a local corrector and diagnostic signal for incomplete solver bases or representations. The framework yields both an adaptive solver and an analytical lens: games with similar optimisation dynamics cluster together, revealing continuous regions of algorithmic validity and overlapping solver behaviour. Empirically, we show that fixed primitives exhibit systematic regime mismatch, while the learned representation organises game space into a structured cartography aligned with solver behaviour. These results suggest viewing equilibrium computation as the joint problem of learning solver mechanisms and mapping the geometry of solvability.

2605.29910 2026-05-29 cs.SE cs.AI 版本更新

Agora: Toward Autonomous Bug Detection in Production-Level Consensus Protocols with LLM Agents

Agora: 面向生产级共识协议中自主漏洞检测的LLM智能体

Xiang Liu, Sa Song, Zhaowei Zhang, Huiying Lan, Jason Zeng, Ming Wu, Michael Heinrich, Yong Sun, Ceyao Zhang

发表机构 * School of Computing, National University of Singapore(新加坡国立大学计算机学院) School of Information and Telecommunication Engineering, Beijing University of Posts and Telecommunications(北京邮电大学信息与电信工程学院) Peking University(北京大学) G Labs(0G实验室)

AI总结 提出Agora,一个领域感知的多智能体框架,通过假设驱动测试和LLM协作,在Raft、EPaxos、HotStuff、BullShark四个共识实现中发现15个未知协议级逻辑漏洞,而现有LLM方法未能检测到任何此类漏洞。

Comments 35 pages, 4 figures

详情
AI中文摘要

共识协议构成了分布式系统和区块链的骨干,其中的实现漏洞可能导致数据损坏和财务损失。虽然基于LLM的方法在代码分析中显示出前景,但它们难以处理涉及跨多个执行阶段的复杂状态依赖行为的深层协议级逻辑漏洞。我们提出Agora,一个领域感知的多智能体框架,将假设驱动测试与LLM能力相结合,用于系统性的协议验证。Agora采用专门的智能体,协作探索协议状态空间,使用领域特定约束综合攻击场景,并通过迭代细化验证发现。这种明确的角色分离使得能够推理全局协议不变量,超越单函数代码分析。我们在四个共识实现(Raft、EPaxos、HotStuff、BullShark)上使用四个最先进的LLM评估了Agora。Agora发现了15个先前未知的违反安全属性的协议级逻辑漏洞,而现有的基于LLM的智能体未能检测到任何此类协议级逻辑漏洞。我们的结果表明,领域感知的多智能体协作对于检测复杂协议中的深层逻辑漏洞至关重要。

英文摘要

Consensus protocols form the backbone of distributed systems and blockchains, where implementation bugs can cause data corruption and financial losses. While LLM-based approaches show promise in code analysis, they struggle with deep protocol-level logic bugs involving complex state-dependent behaviors across multiple execution stages. We present Agora, a domain-aware multi-agent framework that integrates hypothesis-driven testing with LLM capabilities for systematic protocol verification. Agora employs specialized agents that collaboratively explore protocol state spaces, synthesize attack scenarios using domain-specific constraints, and validate findings through iterative refinement. This explicit role separation enables reasoning about global protocol invariants beyond single-function code analysis. We evaluate Agora on four consensus implementations (Raft, EPaxos, HotStuff, BullShark) using four state-of-the-art LLMs. Agora discovers 15 previously unknown protocol-level logic bugs that violate safety properties, while existing LLM-based agents fail to detect any such protocol-level logic bugs. Our results demonstrate that domain-aware multi-agent collaboration is essential for detecting deep logic bugs in complex protocols.

2605.29893 2026-05-29 cs.AI 版本更新

Redundant or Necessary? A Benchmark for Detecting Redundant Steps in Agent Trajectories

冗余还是必要?检测智能体轨迹中冗余步骤的基准

Minyang Hu, Bo Yang, Zhinuo Zhou, Jiachen Liang, Guo Jiahao, Yiyang Yin, Xiongwei Han

发表机构 * Huawei Technologies(华为技术有限公司) Noah Arks’ Lab(Noah Arks实验室) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所)

AI总结 针对LLM智能体轨迹中的冗余步骤检测问题,提出RedundancyBench基准,包含标注轨迹的数据集,并评估三种方法,发现最佳方法仅达到24.88%的检测分数。

详情
AI中文摘要

基于LLM的智能体通过多步推理和工具使用在解决复杂任务方面表现出强大的能力。然而,现有的评估协议主要关注任务成功,忽略了智能体行为的一个关键方面:执行效率。在实践中,智能体轨迹通常包含冗余步骤,这些步骤消耗大量资源但对任务完成贡献甚微。在这项工作中,我们提出并定义了一个新的研究领域:智能体轨迹的 extbf{冗余步骤检测}。为了支持这一倡议,我们引入了 extbf{RedundancyBench},这是一个新的基准,包含多样化的任务和精心标注的轨迹,其中每个步骤根据其对任务完成的贡献进行标记。利用RedundancyBench,我们开发并评估了3种代表性方法,以回答轨迹中的步骤是冗余还是必要的问题。我们的结果表明,即使是最优方法在检测冗余步骤方面也仅达到24.88%的分数,而有些方法的表现甚至不如随机猜测。这些结果突显了该任务的复杂性以及在该领域进一步研究的必要性。 ootnote{本文的代码和数据集均可在\href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}获取。}

英文摘要

LLM-based agents have demonstrated strong capabilities in solving complex tasks through multi-step reasoning and tool use. However, existing evaluation protocols primarily focus on task success, overlooking a critical aspect of agent behavior: execution efficiency. In practice, agent trajectories often contain redundant steps that consume substantial resources while contributing little to task completion. In this work, we propose and formulate a new research area: \textbf{redundant step detection} for agent trajectories. To support this initiative, we introduce \textbf{RedundancyBench}, a new benchmark that contains diverse tasks with carefully annotated trajectories, where each step is labeled according to its contribution to task completion. Using RedundancyBench, we develop and evaluate 3 representative methods to answer whether a step within trajectory is redundant or necessary. Our results show that even the best-performing method achieves only 24.88\% score in detecting redundant steps, while some methods perform worse than random guessing. These results highlight the task's complexity and the need for further research in this area. \footnote{Code and dataset in this paper are both available in \href{https://anonymous.4open.science/r/RedundancyBench}{https://anonymous.4open.science/r/RedundancyBench}.}

2605.29889 2026-05-29 cs.CL cs.AI 版本更新

Internal Representation, Not Clinical Knowledge: Where Apparent LLM Triage Failures Originate

内部表示,而非临床知识:明显的大语言模型分诊失败源于何处

David Fraile Navarro, Berardino Como, Jialei Sheng, Soundariya Ananthan, Shlomo Berkovsky

发表机构 * Macquarie University(麦考瑞大学) Politecnico di Bari(巴里理工大学) NSW Health(新南威尔士州卫生部) Independent Researcher(独立研究者)

AI总结 本研究通过稀疏自编码器特征分析,发现大语言模型在分诊任务中表现不佳源于输出格式限制,而非临床知识表示缺陷。

Comments 9 pages main text, 27 pages total including appendices; 7 figures, 25 tables

详情
AI中文摘要

患者语音临床分诊基准报告显示,在受限的多选输出中,消费级大语言模型存在较高的分诊不足率,但同样的案例在自由文本中得分不同。我们探究输出格式是否改变了模型的\emph{临床表示},还是仅改变了从保留表示到答案的映射。使用Gemma 3 4B/12B IT和Qwen3-8B中的稀疏自编码器(SAE)特征,我们发现相同的医学特征在两种格式下对共享临床叙述激活,但在所有模型的每个案例的多选决策标记处变得{沉默}。三种独立方法(自然语言自编码器言语化、决策标记logit归因和顶部特征表征)一致认为,驱动决策logit的是支架和格式特征,而非医学特征。行为上,多选惩罚在结构化和自然语言输入下均反转,选项顺序洗牌排除了位置偏差,且差距主要由偏差一个决策(模型选择与黄金答案相邻的敏锐度字母)主导,而非知识失败。因此,失败源于输出格式,而非临床表示。

英文摘要

Patient-voiced clinical-triage benchmarks report high under-triage rates for consumer LLMs for constrained multiple-choice output, yet the same cases score differently with free-text. We ask whether output format changes the model's \emph{clinical representation} or only the mapping from a preserved representation to an answer. Using sparse-autoencoder (SAE) features in Gemma 3 4B/12B IT and Qwen3-8B, we find the same medical features fire on the shared clinical narrative under both formats but go {silent} at the multiple-choice decision token in all the cases at every model. Three independent methods (natural-language autoencoder verbalization, decision-token logit attribution, and top-feature characterization) agree that scaffold and format features, but not medical features, drive the decision logits. Behaviorally, the multiple-choice penalty inverts under both structured and natural-language input, option-order shuffle rules out positional bias, and the gap is dominated by off-by-one decision (the model picks an adjacent acuity letter to the gold answer) rather than knowledge failure. Thus, the failure originates in the output format and not in the clinical representation.

2605.29888 2026-05-29 cs.LG cs.AI 版本更新

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

LaRA: 面向RL后训练中数据污染的逐层表示分析

Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter, Jaehyung Kim

发表机构 * Yonsei University(延世大学) Seoul National University(首尔国立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出LaRA框架,通过逐层表示分析检测强化学习后训练中的污染数据,利用扰动敏感性、方向坍缩和局部表示刚性三个指标,优于现有输出级方法。

Comments Work in Progress

详情
AI中文摘要

强化学习(RL)后训练已被证明能提升大型语言模型(LLMs)的推理能力。然而,关于RL后训练中数据污染问题的探索很少,这可能损害训练过程本身的泛化能力和评估可靠性。现有的检测方法主要依赖于输出级信号,如似然或熵,这对于RL训练的模型变得不可靠,因为RL通过轨迹级奖励而非token似然来塑造行为。我们提出LaRA,一个用于检测RL后训练LLMs中数据污染的逐层表示分析框架。LaRA引入了三个互补指标,测量受控扰动下的扰动敏感性、方向坍缩和局部表示刚性。我们发现污染会在各层产生渐进式的几何偏差,包括放大的扰动敏感性、更强的方向坍缩和增强的局部刚性。基于我们的发现,我们还开发了一个污染检测协议,聚合跨层和跨指标的表示级偏差。在RL训练推理模型上的实验表明,我们的协议在污染检测方面优于现有的输出级基线。

英文摘要

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.

2605.29886 2026-05-29 cs.CL cs.AI 版本更新

CRITIC-R1: Learning Structured Critics for Retrieval-Augmented Generation

CRITIC-R1: 学习结构化评论用于检索增强生成

Wenhan Xiao, Ziwei Zhang, Chuanyue Yu, Xingcheng Fu, Qingyun Sun, Runhua Xu, Jianxin Li

发表机构 * Nankai University(南开大学) Beihang University(北航) Guangxi Normal University(广西师范大学)

AI总结 提出CRITIC-R1框架,通过强化学习将RAG评论建模为结构化错误诊断问题,设计保守判断对齐和诊断质量对齐奖励函数,提升检索增强生成的答案质量。

Comments 17 pages,13 figures

详情
AI中文摘要

检索增强生成(RAG)通过引入外部证据改进了知识密集型问答。然而,现有的RAG方法仍然存在幻觉和细微推理错误。最近的研究引入外部评论来优化RAG输出,但它们通常提供粗粒度且结构薄弱的反馈,表现出过度激进的干预,导致噪声大且不可靠的优化,限制了其纠正效果。为解决这些问题,我们提出了CRITIC-R1,一个结构化评论框架,将RAG评论制定并学习为使用强化学习(RL)的显式错误诊断问题。我们的框架将常见的RAG错误分类为多个诊断维度,包括判定、错误位置、推理分析和修复生成。为了学习这些能力,我们设计了两个奖励函数:保守判断对齐(CJA)首先鼓励校准的高层判断,同时减轻过度激进现象;而诊断质量对齐(DQA)通过门控奖励进一步改进细粒度诊断反馈。我们使用基于GRPO的RL训练评论模型,并从外部LLM教师模型收集过程级监督。在五个QA基准上的实验表明,CRITIC-R1在强RAG基线上持续提高了答案质量。我们的源代码可在 https://anonymous.4open.science/r/critic-r1-FCB0 获取。

英文摘要

Retrieval-augmented generation (RAG) improves knowledge-intensive question answering by incorporating external evidence. However, existing RAG methods still suffer from hallucinations and subtle reasoning errors. Recent studies introduce external critics to refine RAG outputs, yet they often provide coarse-grained and weakly structured feedback, exhibit over-aggressive intervention, and lead to noisy and unreliable refinement, limiting their effectiveness for correction. To tackle these issues, we propose CRITIC-R1, a structured critic framework that formulates and learns RAG critique as an explicit error diagnosis problem using reinforcement learning (RL). Our framework categorizes common RAG errors into multiple diagnostic dimensions, including verdict, error location, reasoning analysis, and fix generation. To learn these capabilities, we design two reward functions: Conservative Judgement Alignment (CJA) first encourages calibrated high-level judgements while mitigating the over-aggressive phenomenon, whereas Diagnostic Quality Alignment (DQA) further improves fine-grained diagnostic feedback through gated rewards. We train the critic model using GRPO-based RL with process-level supervision collected from external LLM teacher models. Experiments across five QA benchmarks show that CRITIC-R1 consistently improves answer quality over strong RAG baselines. Our source code is available at https://anonymous.4open.science/r/critic-r1-FCB0

2605.29881 2026-05-29 cs.CV cs.AI 版本更新

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

通过屏障调控自适应闭式引导缓解视觉语言模型中的幻觉

Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati(印度理工学院果阿班加)

AI总结 提出BRACS框架,通过监测视觉注意力并仅在接地退化时进行闭式修正,无需训练即可有效减少LVLM中的物体幻觉。

详情
AI中文摘要

大型视觉语言模型(LVLMs)经常幻觉出输入图像中不存在的物体,这主要是因为随着解码进行,视觉接地减弱。现有的推理时缓解方法在生成过程中修改logits或隐藏状态,但它们存在三个关键限制:缺乏明确的接地目标,即使在模型已经良好接地时也进行干预,以及使用固定的修正强度,无法适应接地失败的严重程度。我们提出BRACS(屏障调控自适应闭式引导),一种无需训练的引导框架,通过屏障调控自适应闭式引导解决这些问题。BRACS监测模型自身的注意力以衡量视觉接地,并仅在接地恶化时对隐藏状态进行修正。修正更新以闭式解析计算,无需训练辅助网络或重新训练模型。在LLaVA-1.5-7B和Qwen-VL-Chat上的实验表明,BRACS在幻觉基准上持续优于先前方法,将CHAIR$_s$降低9.4个点,将POPE F1提高2.7个点,同时在四个通用多模态基准上匹配或提升性能。BRACS还保持高效,运行速度为贪心解码吞吐量的80%,平均速度比基线快1.3倍。

英文摘要

Large vision-language models (LVLMs) often hallucinate objects that are not present in the input image, largely because visual grounding weakens as decoding progresses. Existing inference-time mitigation methods modify logits or hidden states throughout generation, but they suffer from three key limitations: they lack an explicit grounding objective, intervene even when the model is already well-grounded, and use fixed correction strengths that do not adapt to the severity of grounding failure. We propose BRACS (Barrier-Regulated Adaptive Closed-form Steering), a training-free steering framework that addresses these issues through barrier-regulated adaptive closed-form steering. BRACS monitors the model's own attention to measure visual grounding and applies corrections to the hidden states only when grounding deteriorates. The corrective update is computed analytically in closed form, requiring no training of auxiliary networks or model retraining. Experiments on LLaVA-1.5-7B and Qwen-VL-Chat show that BRACS consistently outperforms prior methods on hallucination benchmarks, reducing CHAIR$_s$ by 9.4 points and improving POPE F1 by 2.7 points, while matching or improving performance on four general multimodal benchmarks. BRACS also remains efficient, operating at 80% of greedy decoding throughput and achieving 1.3 times higher speed on average than the baselines.

2605.29873 2026-05-29 cs.AI 版本更新

Moment-KV: Momentum-Based Decode-Time KV Cache Compression for Long Generation

Moment-KV: 基于动量的解码时KV缓存压缩用于长文本生成

Soumyadeep Jana, Sagar Nishad, Sanasam Ranbir Singh

发表机构 * Indian Institute of Technology Guwahati(印度理工学院瓜哇蒂)

AI总结 提出Moment-KV方法,利用动量驱动的时序注意力聚合在解码阶段压缩KV缓存,以提升长文本生成质量并保持解码延迟。

详情
AI中文摘要

键值(KV)缓存仍然是大型语言模型(LLM)在长文本生成任务中部署的主要瓶颈。先前的工作通常对预填充和解码缓存应用均匀压缩,但压缩预填充缓存会破坏关键上下文从而降低性能。虽然保留预填充缓存至关重要,但解码阶段的压缩仍未被充分探索,现有方法依赖于固定的近期窗口或瞬时注意力。我们对注意力动态的分析揭示了强时间模式:关键标记在长时间范围内获得持续注意力,而局部推理涉及短暂的爆发。静态启发式方法无法捕捉这种行为,导致重要标记被过早驱逐或陈旧标记被保留。我们提出Moment-KV,一种基于动量驱动的时序注意力聚合的解码时KV缓存压缩方法。我们的方法将标记重要性建模为连续演化的状态,其中注意力通过衰减进行聚合,捕捉长期影响和近期相关性。实验表明,Moment-KV在长文本生成任务中显著提高了生成保真度(2.3-3.2%),同时保持了解码延迟。

英文摘要

Key-Value (KV) cache remains a major bottleneck for deploying Large Language Models (LLMs) in long-generation tasks. Prior work often applies uniform compression across both prefill and decoding caches, but compressing the prefill cache degrades performance by corrupting critical context. While preserving the prefill cache is essential, decoding-phase compression remains underexplored, with existing methods relying on rigid recency windows or instantaneous attention. Our analysis of attention dynamics reveals strong temporal patterns: critical tokens receive sustained attention over long horizons, while local reasoning involves short-lived bursts. Static heuristics fail to capture this behavior, leading to premature eviction of important tokens or retention of stale ones. We propose Moment-KV, a decoding-time KV cache compression method based on momentum-driven temporal attention aggregation. Our method models token importance as a continuously evolving state, where attention is aggregated with decay, capturing both long-term influence and recent relevance. Experiments show that Moment-KV significantly improves generation fidelity in long-generation tasks (2.3-3.2 %) while maintaining decoding latency.

2605.29862 2026-05-29 eess.AS cs.AI cs.SD 版本更新

Mitigating Stethoscope-Induced Shortcuts in Respiratory Sound Classification under Federated Domain Generalization with Causality-Inspired Interventions

在联邦域泛化下通过因果启发的干预减轻听诊器引起的呼吸音分类中的捷径

Heejoon Koo, Yoon Tae Kim, Miika Toikkanen, June-Woo Kim

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) RSC LAB(RSC实验室) Wonkwang University(Wonkwang大学)

AI总结 针对呼吸音分类中听诊器设备差异导致的域偏移问题,提出一种因果启发的多模态联邦域泛化框架,通过内容保持的风格扰动、反事实文本增强和梯度对齐实现设备不变表示,在ICBHI和SPRSound数据集上优于传统方法。

Comments 2 figures, 4 tables, and 5 pages

详情
AI中文摘要

基于AI的呼吸音分类(RSC)有望实现自动化肺部疾病检测,但多站点部署受到听诊器间差异的阻碍。我们针对听诊器引起的设备偏移引入了一种联邦域泛化(FedDG)公式,其中客户端使用异构设备,模型在未见设备上进行评估。我们的实证分析表明,听诊器引起的风格和疾病特定内容紧密纠缠,使得确定性风格去除不可靠。为此,我们提出了一种因果启发的多模态FedDG框架,结合了:(i) 因果启发的设备风格干预网络,执行内容保持的风格扰动,(ii) 反事实文本增强,中和元数据捷径,以及(iii) 梯度对齐,促进跨客户端的设备不变表示。基于多模态语言-音频预训练模型,在ICBHI和SPRSound数据集上的留一设备验证中,它优于传统数据增强和联邦学习基线。代码将在发表后发布。

英文摘要

AI-driven respiratory sound classification (RSC) is promising for automated pulmonary disease detection, yet multi-site deployment is hindered by inter-stethoscope variability. We introduce a federated domain generalization (FedDG) formulation for RSC under stethoscope-induced device shifts, where clients use heterogeneous devices and the model is evaluated on unseen devices. Our empirical analysis shows that stethoscope-induced style and disease-specific content are tightly entangled, making deterministic style removal unreliable. In response, we propose a causality-inspired multimodal FedDG framework that combines: (i) a causality-inspired device style intervention network that performs content-preserving style perturbations, (ii) counterfactual text augmentation that neutralizes metadata shortcuts, and (iii) gradient alignment that facilitates device-invariant representations across clients. Built on a multimodal language-audio pretraining model, it outperforms conventional data augmentation and federated learning baselines in leave-one-device-out validation on ICBHI and SPRSound datasets. Code will be released upon publication.

2605.29860 2026-05-29 cs.LG cs.AI 版本更新

ESPO: Early-Stopping Proximal Policy Optimization

ESPO:早期停止的近端策略优化

Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu, Zhewen Tan, Zixiang Liu, Zeming Li, Binhua Li, Yongbin Li, Tong Yang, Jieping Ye

发表机构 * Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团) Peking University(北京大学)

AI总结 提出ESPO算法,通过在强化学习训练大语言模型时在线检测轨迹失败并提前终止,节省计算资源并提升数学推理性能。

详情
AI中文摘要

当大语言模型在强化学习过程中,在轨迹早期出现错误的推理步骤时,标准算法会强制其继续生成直到最大步长,从而在从未获得正奖励的令牌上浪费计算资源,并用失败后的噪声污染优势估计。我们提出ESPO(早期停止的近端策略优化),该算法能够在线检测轨迹失败并提前终止轨迹生成。在每个生成步骤中,ESPO仅利用采样过程中已计算出的logits计算一个替代遗憾值,并在平滑累积遗憾值显著超过其估计值时终止。截断轨迹被视为具有终止奖励的吸收失败状态,将负的时间差分误差集中在检测到的失败步骤附近,无需任何额外的奖励模型或人工标注。在基于DeepSeek-R1-Distill-Qwen-7B训练的数学推理任务上,ESPO在AIME 2024(46.28% vs. 45.25%)、AMC 2023(85.83% vs. 82.94%)和MATH-500(87.42% vs. 85.43%)上超越了PPO,同时累计节省了超过20%的轨迹生成令牌。

英文摘要

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

2605.29843 2026-05-29 cs.LG cs.AI 版本更新

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

HARP: 哈达玛预条件自适应旋转处理器用于极端LLM量化

Artur Zagitov, Gleb Molodtsov, Aleksandr Beznosikov

发表机构 * BRAIn Lab(BRAIn实验室)

AI总结 提出HARP,一种可学习的结构化双正交处理器,替代固定随机哈达玛变换,通过自适应旋转基来改善极端低位量化中的激活异常值和各向异性权重曲率问题,在2-4比特设置下提升困惑度和零样本准确率,并保持部署效率。

详情
AI中文摘要

后训练量化(PTQ)对于在内存和带宽约束下部署LLM至关重要。然而,极端低位量化仍然对激活异常值和各向异性权重曲率高度敏感。现有的基于非相干性的PTQ方法通过固定的随机哈达玛变换(RHT)缓解了这一问题,这提高了量化鲁棒性,但无法将旋转基适应于层、校准分布或量化器。我们引入了HARP(哈达玛预条件自适应旋转处理器),一种可学习的结构化双正交处理器,它替代了固定的哈达玛混合,同时保留了精确的全精度等价性。HARP将每个旋转表示为稀疏蝶形类块正交阶段的乘积,通过混合基数调度支持非2的幂次维度,并初始化为RHT处理器(最多一个固定排列)。仅在校准数据上拟合,HARP将量化基适应于每一层和后端。在从1B到70B参数的模型的2-4比特设置中,HARP在困惑度和零样本准确率上优于固定RHT。重要的是,HARP保持了部署效率,达到128 tok/s,而FP16为61 tok/s。

英文摘要

Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-based PTQ methods mitigate this issue with fixed randomized Hadamard transforms (RHTs), which improve quantization robustness but cannot adapt the rotated basis to the layer, calibration distribution, or quantizer. We introduce HARP (Hadamard-preconditioned Adaptive Rotation Processor), a learnable structured two-sided orthogonal processor that replaces fixed Hadamard mixing while preserving exact full-precision equivalence. HARP represents each rotation as a product of sparse butterfly-like block-orthogonal stages, supports non-power-of-two dimensions via Mixed-Radix schedules, and initializes to the RHT processor up to a fixed permutation. Fitted only on calibration data, HARP adapts the quantization basis to each layer and backend. Across 2-4 bit settings on models ranging from 1B to 70B parameters, HARP improves perplexity and zero-shot accuracy over fixed RHT. Importantly, HARP preserves deployment efficiency, reaching 128 tok/s versus 61 tok/s for FP16.

2605.29836 2026-05-29 cs.LG cs.AI stat.ML 版本更新

CB-SLICE: Concept-Based Interpretable Error Slice Discovery

CB-SLICE: 基于概念的可解释错误切片发现

Yael Konforti, Mateo Espinosa Zarlenga, Elaf Almahmoud, Mateja Jamnik

发表机构 * Department of Computer Science and Technology, University of Cambridge, Cambridge, UK(计算机科学与技术系,剑桥大学,剑桥,英国) Trinity College, University of Oxford, Oxford, UK(牛津大学三一学院,牛津,英国) Cambridge Institute for Technology and Humanity, Cambridge, UK(剑桥技术与人类研究所,剑桥,英国)

AI总结 提出CB-SLICE方法,利用概念瓶颈模型的概念预测失败来发现错误切片,并通过关键词概念解释失败模式,优于现有方法。

Comments 20 pages, 7 figures, 12 tables, to be published at Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

尽管平均性能强劲,深度学习模型在特定人群组(称为错误切片)上常表现出系统性错误。识别这些组及其失败的根本原因对于模型调试和偏差缓解至关重要。然而,现有的错误切片发现方法(SDMs)通常生成与模型推理过程脱节的解释,因此只能近似潜在错误源,可能不准确。我们通过利用概念瓶颈模型(CBMs)来解决这一局限,其预测直接依赖于人类可理解的语义概念。由于CBM中下游任务失败通常源于概念预测错误,概念表示为错误切片识别提供了强有力的候选,提供了直接关联错误源的细粒度解释。基于这一见解,我们引入CB-SLICE,一种基于概念的SDM,它将共享概念预测失败的样本分组,并识别每个切片失败模式中最关键的关键词概念。在多个基准测试中,我们展示了CB-SLICE在发现已知偏差方面优于最先进方法,同时提供更丰富、更忠实的模型错误解释。

英文摘要

Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate. We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduce CB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.

2605.29829 2026-05-29 cs.AI cs.LG 版本更新

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

OptSkills: 通过基于聚类的蒸馏从问题原型中学习可泛化的优化技能

Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian

发表机构 * East China Normal University(华东师范大学) Shanghai Innovation Institute(上海创新研究院) Ant Group(蚂蚁集团)

AI总结 提出OptSkills系统,通过聚类问题原型、蒸馏成功轨迹为可复用工作流技能,并动态扩展技能库,提升优化建模与求解的分布内和分布外泛化能力。

Comments 22 pages, 10 figuers, project: https://github.com/fujiwaranoM0kou/OptSkills

详情
AI中文摘要

利用大型语言模型(LLM)从自然语言自动制定和求解优化问题已成为自动化优化的高效范式。然而,现有方法仍表现出有限的泛化能力:它们对表面叙述变化敏感,主要在案例层面复用经验,难以适应变化或新兴的问题类型。我们提出OptSkills,一个以原型为中心的技能学习和推理智能体系统,用于优化建模和求解。为提升鲁棒泛化,我们的系统根据问题的底层原型而非表面叙述进行聚类。为提升分布内泛化,它在每个聚类内探索多样的建模范式和求解器配置,然后将成功轨迹蒸馏为可重用的工作流级技能。为提升分布外泛化,它利用新获得的轨迹改进现有技能或扩展技能库。我们的系统在涵盖多种问题类型和场景的数据集上达到了68.27%的最先进微平均准确率。此外,在极具挑战性的大规模高维基准MIPLIB-NL上,它达到了26.91%的准确率,比DeepSeek-V3.2-Thinking高出4.53%。在Nano-CO上进行技能学习后,它在OOD NLCO基准上达到了72.79%。代码和技能可在https://github.com/fujiwaranoM0kou/OptSkills获取。

英文摘要

Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

2605.29826 2026-05-29 cs.CL cs.AI 版本更新

Towards Localized and Disentangled Knowledge Editing for Multimodal Large Language Models

面向多模态大语言模型的局部化与解耦知识编辑

Leijiang Gu, Zhen Zeng, Feng Li, Xinjian Gao, Zenglin Shi

发表机构 * Hefei University of Technology(合肥工业大学) Tongji University(同济大学)

AI总结 针对多模态知识编辑中因果错位和特征纠缠问题,提出LDKE框架,通过快速定位关键层和解耦分类器实现精准泛化编辑并保持高局部性。

详情
AI中文摘要

现有的多模态知识编辑(MKE)方法在纠正多模态大语言模型(MLLMs)中过时或不准确的知识方面取得了进展。然而,它们存在一个关键局限性:虽然能有效修改目标事实对,但无法将编辑泛化到逻辑相关的查询,并且常常对无关但视觉或语义上关联的信息造成意外改变。我们识别并形式化了导致该问题的两种潜在失败模式:因果错位(将编辑限制在特定样本)和特征纠缠(对耦合但无关的信息造成意外改变)。为解决这些问题,我们提出局部化与解耦知识编辑(LDKE),一种通过定位事实特定模型层并将目标相关输入与无关输入解耦来实现精确和泛化编辑的新框架。我们的方法引入快速定位模块以高效识别和更新关键层,以及解耦分类器以适当路由输入从而保留无关知识。在各种基准和MLLMs上的大量实验表明,LDKE在将编辑传播到相关上下文方面实现了优越性能,同时保持了高局部性。

英文摘要

Existing methods in Multimodal Knowledge Editing (MKE) have advanced the ability to correct outdated or inaccurate knowledge in Multimodal Large Language Models (MLLMs). However, they exhibit a critical limitation: while effectively modifying target factual pairs, they fail to generalize edits to logically related queries and often cause unintended alterations to unrelated but visually or semantically linked information. We identify and formalize two underlying failure modes causing this issue: Causal Misalignment, which confines edits to the specific sample, and Feature Entanglement, which causes unintended alterations to coupled but irrelevant information. To address these issues, we propose Localized and Disentangled Knowledge Editing (LDKE), a new framework that achieves precise and generalized editing by localizing fact-specific model layers and disentangling target-relevant inputs from irrelevant ones. Our approach introduces a Fast Localization module to identify and update critical layers efficiently, along with a Disentanglement Classifier that routes inputs appropriately to preserve unrelated knowledge. Extensive experiments across various benchmarks and MLLMs demonstrate that LDKE achieves superior performance in propagating edits to related contexts while maintaining high locality.

2605.29822 2026-05-29 cs.SE cs.AI 版本更新

Inferring Code Correctness from Specification

从规约推断代码正确性

Tambon Florian, Papadakis Mike

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 提出TRAILS方法,通过基于规约的类别划分生成测试输入并执行,利用LLM评估输入输出对是否符合规约,从而推断代码正确性,在LiveCodeBench和CoCoClaNeL数据集上相比基线方法提升了马修斯相关系数并增强了稳定性。

详情
AI中文摘要

大型语言模型(LLM)已成为现代软件开发不可或缺的一部分,实现了大规模自动代码生成。然而,验证LLM生成代码的正确性仍然是一个关键且基本未解决的挑战。现有方法要么依赖多个代码候选之间的动态共识——这使得它们成本高昂且难以扩展,要么依赖静态推理,容易受到动态错误和顺序偏差的影响。在本文中,我们提出TRAILS(通过输入和规约的目标推理一致性),一种将LLM推理与具体(输入,输出)对相结合的方法。TRAILS首先基于规约通过类别划分生成多样化的测试输入,然后针对候选代码执行这些输入,并提示LLM评估产生的输入输出对是否符合规约——而无需对代码本身进行推理。分数跨输入聚合,以确定程序是否可能正确。我们在两个数据集LiveCodeBench和CoCoClaNeL上,使用三个LLM(Qwen3Coder-30B、Devstral-Small-24B和Olmo3.1-Instruct)评估TRAILS,并与HoarePrompt和零样本思维链基线进行比较。TRAILS的马修斯相关系数相比零样本思维链提高了高达39%,并且始终优于HoarePrompt。除了准确性,TRAILS在多次运行中表现出更高的稳定性,降低了对LLM非确定性的敏感性,并且相比竞争方法为更多独特的代码样本分配了正确的标签。

英文摘要

Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, validating the correctness of LLM-generated code remains a critical and largely unsolved challenge. Existing approaches either rely on dynamic consensus across multiple code candidates - making them costly and difficult to scale - or on static reasoning that is susceptible to dynamic bugs and order bias. In this paper, we propose TRAILS~ (Targeted Reasoning Agreement via Inputs and Specifications), an approach that grounds LLM reasoning with concrete (input, output) pairs. TRAILS~ first generates diverse test inputs via category partitioning based on the specification, then executes them against the candidate code and prompts LLMs to assess whether the resulting input-output pairs conform to the specification - without ever reasoning over the code itself. Scores are aggregated across inputs, to determines whether the program is likely correct. We evaluate TRAILS~ on two datasets, LiveCodeBench and CoCoClaNeL, across three LLMs (Qwen3Coder-30B, Devstral-Small-24B, and Olmo3.1-Instruct), comparing against HoarePrompt and a Zero-Shot Chain-of-Thought baseline. TRAILS~ improves Matthew Correlation Coefficient by up to 39\% relative to Zero-Shot COT and consistently outperforms HoarePrompt. Beyond accuracy, TRAILS~ demonstrates greater stability across seeded runs, reducing sensitivity to LLM non-determinism, and assigns correct labels to a larger set of unique code samples than competing approaches.

2605.29816 2026-05-29 cs.AI 版本更新

Harnessing non-adversarial robustness in large language models

利用大语言模型中的非对抗鲁棒性

Qinghua Zhou, Ellina Aleshina, Andrey Lovyagin, Oleg Somov, Mikhail Seleznyov, Alexander Panchenko, Ivan Oseledets, Elena Tutubalina, Ivan Y. Tyukin

发表机构 * Applied AI Institute, Moscow, Russia(莫斯科应用人工智能研究所) King's College London, London, UK(伦敦国王学院) International Joint Laboratory of AI for Industry, QUST, Qingdao, China(工业人工智能联合实验室)

AI总结 本文通过理论分析和实验,提出了一种基于去偏的微调方法,以提升大语言模型对语义相似但文本不同的提示的鲁棒性,并提供了认证保证。

详情
AI中文摘要

本文提出了一种方法来解决大语言模型(LLMs)对由语义相似但文本不同的提示引起的改变和潜在错误的鲁棒性挑战。最近的研究表明,这类提示变化会显著影响LLMs在任务上的性能。核心问题是:能否在不重新训练整个模型的情况下,获得LLMs对语义中性提示变化的鲁棒性?我们通过理论和实验来探讨这个问题。我们的理论分析揭示了一个影响模型鲁棒性的关键因素——神经网络模块输出中的系统性预期偏移或扰动引起的偏差。受此分析启发,我们表明可以通过一个简单的微调过程实现鲁棒性:为鲁棒性进行去偏。我们确定了去偏有帮助和没有帮助的条件,并通过理论和大量实验证明,为鲁棒性进行去偏确实可以成为一种快速有效的工具,以增强鲁棒性并提供对随机提示扰动的认证。

英文摘要

The work presents an approach for addressing the challenge of robustness in Large Language Models (LLMs) to alterations and potential errors caused by semantically similar but textually different prompts. Recent works have shown that these kinds of prompt variations can significantly impact the performance of LLMs on tasks. The central question is: can LLMs' robustness to semantically-neutral prompt alterations be acquired without expensive retraining of the entire model? We address this question both theoretically and through experiments. Our theoretical analysis reveals a crucial factor impacting model robustness - a systematic expected shift or perturbation-induced bias in neural network module outputs. Motivated by this analysis, we show that robustness can be achieved via a simple fine-tuning process: debiasing for robustness. We identify conditions when debiasing helps and when it does not, and demonstrate, through both theory and extensive experiments, that debiasing for robustness may indeed be a quick and efficient tool to enhance robustness and provide certification against random prompt perturbations.

2605.29815 2026-05-29 cs.AI cs.CL 版本更新

PRAIB: Peer Review AI Benchmark of Behaviour of LLM-Assisted Reviewing

PRAIB: 大语言模型辅助审稿行为的同行评审AI基准

Krzysztof Żurawicki, Julia Farganus, Arkadiusz Gaweł, Mateusz Bystroński, Tomasz Jan Kajdanowicz

发表机构 * Department of Artificial Intelligence(人工智能系)

AI总结 提出PRAIB框架,通过定义审稿特异性、风格和参与行为的指标,并基于11000条机器生成审稿与人类审稿的对比实验,揭示LLM审稿在评分、交叉引用和弱点识别方面与人类审稿的系统性差异。

详情
AI中文摘要

提交论文数量的增长促使人们探索利用大型语言模型(LLMs)来支持和增强同行评审过程,特别是在提高其速度和可扩展性方面。然而,目前尚不清楚LLMs是否以与人类审稿人相同的方式处理科学稿件,还是仅仅生成看起来像审稿的文本。为了解决这个问题,我们引入了同行评审AI基准(PRAIB),这是一个新颖的框架,包含精确定义的指标,用于衡量审稿的特异性、风格和参与行为。为补充PRAIB框架,我们进行了一项大规模实证研究,利用一个包含由五个专有和开源模型为1000篇ICLR和NeurIPS论文生成的11000条审稿的数据集。这些机器生成的审稿跨越2021-2025年,与原始人类反馈在不同提示策略下进行比较,以识别系统性的行为差异。我们的分析表明,生成的审稿与人类审稿人提供的反馈存在显著差异:LLM评分变异性较小、存在正向偏差且过度自信,其交叉引用模式依赖于模型且与人类规范不同。此外,通过PRAIB评估,我们观察到LLMs倾向于生成更长、更复杂的审稿,但经常忽略人类审稿人指出的原子性弱点。通过描述LLM审稿行为在哪些方面以及如何偏离人类规范,PRAIB为社区提供了一个诊断工具,用于识别LLMs目前可以可靠支持审稿过程的哪些方面,以及在部署前哪些方面需要进一步发展。

英文摘要

The growing number of submitted papers has motivated the exploration of Large Language Models (LLMs) as a means to support and augment the peer review process, particularly in terms of improving its speed and scalability. Yet, it remains unknown whether LLMs engage with scientific manuscripts in the same manner as human reviewers, or whether they merely produce review-looking text. To address this, we introduce the Peer Review AI Benchmark (PRAIB), a novel framework comprising thoroughly defined metrics that measure review specificity, style, and behavior of engagement. To complement the PRAIB framework, we conduct a large-scale empirical study leveraging a dataset of 11,000 reviews generated by five proprietary and open-source models for 1,000 ICLR and NeurIPS papers. Spanning the 2021--2025 period, these machine-generated reviews are compared against original human feedback across diverse prompting strategies to identify systematic behavioral divergences. Our analysis reveals that the generated reviews diverge significantly from feedback provided by human reviewers: LLM ratings are less variable, positively biased, and overconfident, and their cross-reference patterns are model-dependent and distinct from human norms. Furthermore, when evaluated through PRAIB, we observe that LLMs tend to generate longer, more complex reviews, yet frequently overlook the atomic weaknesses noted by human reviewers. By characterizing where and how LLMs reviewing behavior departs from human norms, PRAIB provides the community with a diagnostic tool for identifying which aspects of the review process LLMs can reliably support today and which require further development before deployment.

2605.29807 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Data filtering methods for training language models

训练语言模型的数据过滤方法

Egor Shevchenko, Elena Bruches

发表机构 * Novosibirsk State University(新西伯利亚国立大学) A. P. Ershov Institute of Informatics Systems SB RAS(A. P. Ershov 信息系统研究所)

AI总结 本文比较了Confident Learning和Dataset Cartography两种自动标签错误检测方法在俄语文本分类任务中的效果,发现其有效性依赖于数据集特性,在小规模高噪声数据集上Confident Learning显著提升F1-macro。

Comments AINL-2026

详情
AI中文摘要

数据质量是机器学习模型有效性的关键因素。即使广泛使用的基准数据集中也存在标签错误,这些错误会引入训练数据噪声并降低模型泛化能力。在本工作中,我们对两种自动标签错误检测方法——Confident Learning和Dataset Cartography——在三个俄语文本分类语料库上进行了比较分析,这些语料库在规模、类别数量和领域上各不相同:ru_emotion_e-culture(49,123个样本,情感分类)、RuCoLA(8,524个样本,语言可接受性)和TERRa(2,337个样本,文本蕴含识别)。我们使用在每个语料库上微调的预训练rubert-base-cased模型。为了验证过滤的意义,我们进行了控制实验,随机移除等量样本。结果表明,两种方法的有效性强烈依赖于数据集特征:在噪声水平低的大规模语料库上,过滤并未提升性能,而在噪声高的小规模数据集上,Confident Learning实现了显著的F1-macro提升。Dataset Cartography表现出更保守的行为,移除的样本更少。在所有语料库中,两种方法的目标性移除均优于随机移除,证实了这些方法的意义。

英文摘要

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.

2605.29801 2026-05-29 cs.AI cs.CL cs.CR cs.CV cs.LG 版本更新

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5:一种轻量级且可扩展的AI智能体安全与安保对齐框架

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对开放世界智能体的新兴安全风险,提出一种轻量级可扩展的安全对齐框架,通过更新安全分类法、构建数据引擎并训练小模型(0.8B-8B参数),实现与闭源模型相当的性能,并降低部署开销两个数量级。

Comments 44 pages, 12 Figures, 9 Tables

详情
AI中文摘要

现代开放世界智能体(如OpenClaw)展现出强大的跨环境执行能力,但同时也引入了广泛的新安全风险源。同时,先进的前沿AI模型大幅降低了攻击门槛,使得当前的智能体对齐框架不足以应对实际部署。为了应对这些新兴威胁,我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类法,以涵盖来自Codex和OpenClaw执行场景的新兴风险。我们进一步构建了一个基于分类法指导的数据引擎,并采用影响函数净化,仅使用约1k样本训练轻量级AgentDoG 1.5变体(0.8B、2B、4B和8B参数),达到了与领先闭源模型(如GPT-5.4)相当的性能。基于AgentDoG 1.5,我们构建了一个高效的智能体安全SFT和RL训练环境,将Docker级环境的部署开销降低了两个数量级。最后,我们将AgentDoG 1.5部署为无需训练的在线护栏,用于实时安全审核。大量实验结果表明,AgentDoG 1.5在多样且复杂的交互式智能体场景中达到了最先进的性能。所有模型和数据集均已公开发布。

英文摘要

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2605.29795 2026-05-29 cs.AI 版本更新

MEMENTO: Leveraging Web as a Learning Signal for Low-Data Domains

MEMENTO: 利用网络作为低数据领域的学习信号

Ashutosh Ojha, Vinay Aggarwal, Ashutosh Srivastava, Siddharth Yedlapati, Yaman K Singla, Jitendra Ajmera

发表机构 * Adobe, Media & Data Science Research Lab(Adobe媒体与数据科学研究实验室)

AI总结 提出MEMENTO框架,通过自适应探索树和双通道记忆将网络作为学习信号,在低数据专业领域(销售自动化和法律研究)中显著提升性能。

详情
AI中文摘要

现实世界的任务通常缺乏大规模标注数据集,这激发了在低数据场景下学习的广泛研究。然而,现有方法如少样本提示、指令调优和合成数据生成,仍将标注或伪标注数据作为主要学习信号。相比之下,人类从业者通过反复、自主地与开放网络交互来获取专业知识,逐步完善领域知识和搜索策略。我们提出MEMENTO,一个将网络视为学习信号而非无状态检索接口的框架。MEMENTO在两个层面运作:在每个会话内,它通过自适应探索树(AET)进行迭代式网络探索,将任务分解为演化中的问题并反思中间发现;跨会话间,它通过双通道记忆积累经验,将陈述性知识(事实)与程序性知识(搜索策略)分离。这种设计使智能体能够从网络交互轨迹中学习可重用的研究策略和领域专业知识,而无需额外的模型训练。我们在两个低数据专业领域(销售自动化和法律研究)上评估MEMENTO。实验结果显示,与基于ReAct的基线相比,性能持续提升(销售自动化+25.6%,法律研究+36.5%),表明网络可以作为在数据稀缺场景下获取任务特定专业知识的可扩展学习源。

英文摘要

Real-world tasks often lack large labeled datasets, motivating extensive work on learning in low-data regimes. However, existing approaches such as few-shot prompting, instruction tuning, and synthetic data generation, continue to treat labeled or pseudo-labeled data as the primary learning signal. In contrast, human practitioners acquire expertise through repeated, self-directed interaction with the open web, progressively refining both domain knowledge and search strategies. We propose MEMENTO, a framework that treats the web as a learning signal rather than a stateless retrieval interface. MEMENTO operates at two levels: within each session, it conducts iterative web exploration via an Adaptive Exploration Tree (AET) that decomposes tasks into evolving questions and reflects on intermediate findings; across sessions, it accumulates experience through dual-channel memory, separating declarative knowledge (facts) from procedural knowledge (search strategies). This design enables agents to learn reusable research strategies and domain expertise from trajectories of web interaction without additional model training. We evaluate MEMENTO on two low-data professional domains: sales automation and legal research. Our empirical results show consistent improvements in performance over ReAct based baselines (+25.6% on sales automation and 36.5% on legal research), demonstrating that the web can serve as a scalable learning source for acquiring task-specific expertise in data-scarce settings.

2605.29794 2026-05-29 cs.AI 版本更新

SkillsInjector: Dynamic Skill Context Construction for LLM Agents

SkillsInjector: 面向LLM智能体的动态技能上下文构建

Yanchao Li, Wanhao Liu, Ben Gao, Jiaqing Xie, Zhehong Ai, Na Zou, Yuqiang Li, Tianfan Fu

发表机构 * Nanjing University(南京大学) Shanghai AI Lab(上海人工智能实验室)

AI总结 针对静态技能注入导致性能下降的问题,提出SkillsInjector两阶段自适应方法,通过上下文规划器学习技能偏好并自适应预算,结合集合感知渲染器优化描述呈现,在三个基准上分别提升3.9、6.1和7.3个百分点。

详情
AI中文摘要

LLM智能体现在依赖不断增长的技能库来处理复杂任务。然而,注入更多技能并不总能提高任务完成度,甚至可能降低性能。现有方法仍将技能注入视为静态步骤,使用固定标准选择技能,预先设定预算,并保持描述不变。我们认为这种静态处理会削弱技能的效用,因为暴露哪些技能、包含多少技能以及如何呈现它们都会影响下游性能。我们提出SkillsInjector,一种两阶段自适应方法,共同解决这些决策。首先,上下文规划器学习基于执行的技能偏好,并为每个任务自适应地确定技能数量。然后,集合感知渲染器根据共注入的邻居定制所选描述的呈现方式。在tau2-bench、SkillsBench和ALFWorld上,SkillsInjector取得了最高分数,分别比最强基线提高了3.9、6.1和7.3个百分点。消融研究表明,技能选择、自适应预算和集合感知渲染各自对性能提升有贡献。这些结果表明,技能增强型智能体受益于优化注入的上下文本身。代码将在发表后发布。

英文摘要

LLM agents now draw on growing skill libraries to handle complex tasks. However, injecting more skills does not always improve task completion and can even degrade it. Existing methods still treat skill injection as a static step, selecting skills with fixed criteria, fixing the budget in advance, and leaving descriptions unchanged. We argue that this static treatment can undermine the utility of skills, because which skills are exposed, how many are included, and how they are presented all affect downstream performance. We propose SkillsInjector, a two-stage adaptive method that jointly addresses these decisions. First, a context planner learns execution-grounded skill preferences and admits an adaptive number of skills for each task. A set-aware renderer then tailors how selected descriptions are presented relative to their co-injected neighbors. Across tau2-bench, SkillsBench, and ALFWorld, SkillsInjector achieves the highest score, improving over the strongest baseline by 3.9, 6.1, and 7.3 percentage points, respectively. Ablation studies show that skill selection, adaptive budgeting, and set-aware rendering each contribute to the gain. These results show that skill-augmented agents benefit from optimizing the injected context itself. Code will be released upon publication

2605.29790 2026-05-29 cs.MA cs.AI 版本更新

Evolve as a Team: Collaborative Self-Evolution for LLM-based Multi-Agent Systems

像团队一样进化:基于LLM的多智能体系统的协作自我进化

Zhezheng Hao, Tianfu Wang, Huanshuo Dong, Ziyan Liu, Hong Wang, Xiankun Lin, Qiang Lin, Can Wang, Hande Dong, Jiawei Chen

发表机构 * Zhejiang University(浙江大学) Hong Kong University of Science and Technology(香港科技大学) Tencent(腾讯)

AI总结 提出Meta-Team框架,通过协作自我进化机制,基于执行经验改进多智能体系统的行为、协调和团队组织,在长周期任务中显著优于单智能体、手工MAS及先前进化方法。

详情
AI中文摘要

基于LLM的多智能体系统(MAS)已成为处理复杂和长周期任务的有效范式。然而,在实际任务中,MAS在执行过程中经常出现各种故障,且这些故障在设计阶段难以消除。这激发了经验驱动的MAS进化,即系统根据自身执行经验进行改进。然而,这种进化具有挑战性,因为MAS经验漫长而复杂,交织着多个智能体的执行链和通信消息,使得难以识别需要改进的内容。为应对这一挑战,我们提出了Meta-Team,一种基于协作自我进化的经验驱动MAS进化框架。Meta-Team保留每个智能体的执行上下文并协调任务后通信,使智能体能够交换分布式证据以进行进化。基于此设计,Meta-Team进行多尺度自我进化,将执行经验转化为对智能体行为、智能体间协调以及团队级组织的可复用改进。在六个长周期智能体基准测试中,Meta-Team始终优于单智能体系统、手工MAS和先前的MAS进化方法;进一步分析表明,Meta-Team实现了更可靠和可扩展的MAS自我进化。

英文摘要

LLM-based multi-agent systems (MAS) have emerged as an effective paradigm for complex and long-horizon tasks. However, in real-world tasks, MAS often exhibit various failures during execution and such failures are difficult to eliminate during design. This motivates experience-driven MAS evolution, where a system improves based on its own execution experience. Yet such evolution is challenging because MAS experience is prolonged and intricate, interleaving multiple agents' execution chains and communication messages, which makes it difficult to identify what should be improved. To address this challenge, we propose Meta-Team, an experience-driven MAS evolution framework based on collaborative self-evolution. Meta-Team preserves the execution context of each agent and coordinates post-task communication, enabling agents to exchange distributed evidence for evolution. Building on this design, Meta-Team conducts multi-scale self-evolution, transforming execution experience into reusable improvements to agent behaviors, inter-agent coordination, and team-level organization. Across six long-horizon agent benchmarks, Meta-Team consistently outperforms single-agent systems, hand-crafted MAS, and prior MAS evolution methods; further analyses demonstrate that Meta-Team enables more reliable and scalable MAS self-evolution.

2605.29788 2026-05-29 cs.AI cs.LG 版本更新

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

嵌套因果赌博机的认证策略优化:基于PAC-Bayes风险

Tim Woydt, Paul-David Zuercher

发表机构 * ProdAxon

AI总结 本文提出嵌套因果汤普森采样(NCTS)算法,通过PAC-Bayes超额风险界对历史数据进行离线、任意时刻的部署策略认证,解决分层因果赌博机中的跨时间尺度因果耦合问题。

详情
AI中文摘要

关键序列决策很少是单时间尺度的:一个战略决策因果地塑造了每个后续战术选择所处的环境;标准赌博机和强化学习理论并未捕捉时间尺度之间的这种因果耦合。我们将问题类别形式化为嵌套上下文因果赌博机(NCCBs),这是一个分层SCM,其中每个层次的动作设置下一层次的上下文分布,并提出了嵌套因果汤普森采样(NCTS),该算法每轮抽取一个机制因子化的信念,并在其下递归地行动。我们的主要理论结果是一个因果PAC-Bayesian超额风险界,它仅从历史数据中认证任何候选部署策略,离线且任意时刻,回答了部署问题:我们能否在此处信任该智能体,风险如何?在分层SCM上的实验表明,相对于同一函数类上的匹配RFF-GP联合回归,因子化的SCM机制后验在外生分布偏移下零样本迁移显著更好,递归的元到内层提交在分布上显著优于联合提交替代方案,并且随着离线数据积累,认证显著收缩。结合这些结果,我们建立了渐进式认证交接,一种安全部署方法:每个时间尺度在收益可被认证时从传统控制器切换到NCTS,独立于其他时间尺度。

英文摘要

Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.

2605.29786 2026-05-29 cs.AI 版本更新

Croissant Tasks: A Metadata Format for Reproducible Machine Learning Evaluations

Croissant Tasks:一种用于可重复机器学习评估的元数据格式

Omar Benjelloun, Leonardo Martins Bianco, Isabelle Guyon, Thanh Gia Hieu Khuong, Jonathan Lebensold, Sebastian Lobentanzer, Luis Oala, Benedictus Kent Rachmat, Ihsan Ullah, Peyman Vahidi, Joaquin Vanschoren

发表机构 * Google DeepMind(谷歌DeepMind) ChaLearn Université Paris-Saclay(巴黎-萨克雷大学) Jetty Mila, Quebec AI Institute(魁北克AI研究所) Inst. of Computational Biology, Helmholtz Munich(海德堡慕尼黑计算生物学研究所) German Center for Diabetes Research(德国糖尿病研究中心) School of CIT, TUM(技术大学 CIT 学校) Helmholtz AI(海德堡人工智能研究所) Brickroad Eindhoven Univ. of Technology(埃因霍温技术大学)

AI总结 提出Croissant Tasks元数据格式,通过声明式规范解耦任务问题与解决方案,结合自动化LLM管道实现概念可重复性,使自主代理能从零生成可复现的评估流水线。

Comments 10 pages, 4 figures

详情
AI中文摘要

可重复性是科学方法的基础,但在机器学习中仍然是一个关键挑战。导致这一问题的因素包括不明确的执行细节和脆弱的软件环境。以人为中心的补救措施(如检查清单和手动验证)有所帮助,但需要大量精力且难以扩展。为了解决这个问题,我们引入了Croissant Tasks:一种声明式的、机器可操作的元数据格式,将低层实现细节抽象为高层规范。这种格式实现了概念可重复性:通过独立的、由代理生成的实现来验证声明,而不是脆弱的源代码复制。我们贡献了:(1) Croissant Tasks规范,正式将任务问题与解决方案解耦;(2) 一个自动化的LLM流水线,将现有基准测试改造为此格式;(3) 实证验证表明,自主代理可以摄取这些规范,从零开始生成功能准确的可重复流水线。我们设想这种格式将成为机器学习中自动化和概念可重复性的新基础。

英文摘要

Reproducibility is fundamental to the scientific method, yet remains a critical challenge in machine learning. Contributing factors include underspecified execution details and brittle software environments. Human-centric remedies, such as checklists and manual verification, help but require intensive effort and fail to scale. To address this, we introduce Croissant Tasks: a declarative, machine-actionable metadata format that abstracts low-level implementation details into high-level specifications. This format enables conceptual reproducibility: verifying claims via independent, agent-generated implementations rather than brittle source code replication. We contribute: (1) the Croissant Tasks specification, formally decoupling task problem from solution; (2) an automated LLM pipeline that retrofits existing benchmarks into this format; and (3) empirical validation showing autonomous agents can ingest these specifications to generate functional, accurate reproduction pipelines from scratch. We envision this format as a new foundation for automated and conceptual reproducibility in machine learning.

2605.29782 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista 和 Numca:为 LLM 强化学习有效估计状态值

Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Huawei Technologies Ltd(华为技术有限公司)

AI总结 针对 LLM 强化学习中状态值估计不准确的问题,提出 Numca(利用数值跨度作为可分级里程碑)和 Hista(利用隐藏状态加权平均不连续轨迹及其回报)两种方法,显著提升估计精度和训练性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

强化学习(RL)通过奖励信号直接优化模型行为来改进大型语言模型(LLMs)。虽然在经典RL中准确的状态值估计对于稳定训练至关重要,但在LLM后训练中这仍是一个未被充分探索的挑战。在这项工作中,我们引入了状态值估计基准(SVEB)来评估现有RL框架中的状态估计,并展示了像PPO这样的标准方法中的评论家会退化为粗糙的组平均基线。为了解决这个问题,我们提出了两种技术:Numca,它利用数值跨度作为可分级里程碑进行状态值估计;以及Hista,一个使用LLM的隐藏状态作为表示来加权平均不连续轨迹及其回报的框架。大量实验表明,这两种方法都能产生更准确的状态值估计,并在不同的RL算法和模型大小上提升训练性能,而不会产生显著的计算开销。

英文摘要

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

2605.29773 2026-05-29 cs.CV cs.AI cs.RO 版本更新

Energy-Aware NECO for Single-Pass Pixel-wise Out-of-Distribution Detection in Semantic Segmentation

能量感知NECO:用于语义分割中单次逐像素分布外检测

Boyuan Zhang, Huanshan Huang, Yifei Cao

发表机构 * Ecole Polytechnique, Institut Polytechnique de Paris(巴黎理工学院高研院) CIAD, UTBM, Université Marie et Louis Pasteur(CIAD、UTBM、马吕斯·路易·巴斯蒂埃大学) U2IS, ENSTA, Institut Polytechnique de Paris(U2IS、ENSTA、巴黎理工学院)

AI总结 提出一种结合NECO几何比率和能量分数的混合方法,实现单次前向传播的逐像素分布外检测,在miniMUAD数据集上AUROC达0.8539,优于单独使用NECO或能量分数。

Comments 7 pages, 6 figures. Accepted at the ICRA 2026 Workshop on Long-term Deployments in the Wild (LoWi 2026)

详情
AI中文摘要

移动机器人的可靠语义分割需要准确的密集预测和分布偏移下的鲁棒不确定性估计。强不确定性基线如蒙特卡洛Dropout通常需要重复的随机前向传播,难以在边缘平台上部署。我们提出能量感知NECO,一种用于语义分割的单次逐像素分布外(OOD)检测器。该方法将从解码器特征计算的居中NECO风格几何比率与基于logit的能量分数相结合。两个分量均使用在纯分布内验证集上拟合的统计量进行标准化,并通过凸组合融合。我们在miniMUAD子集上使用真实像素级OOD标签评估该方法。所提出的混合分数达到0.8539的AUROC,优于仅NECO(0.8280)、仅能量(0.8171)和集成预测熵基线(0.8124)。额外的定性和操作点分析表明,混合检测器在保持单次设计效率优势的同时,提高了整体排名性能。代码可在https://github.com/boyuan-zhangx/Energy-Aware_NECO获取。

英文摘要

Reliable semantic segmentation for mobile robots requires both accurate dense prediction and robust uncertainty estimation under distribution shift. Strong uncertainty baselines such as Monte Carlo Dropout often require repeated stochastic forward passes and are difficult to deploy on edge platforms. We propose Energy-Aware NECO, a single-pass pixel-wise out-of-distribution (OOD) detector for semantic segmentation. The method combines a centered NECO-style geometric ratio computed from decoder features with a logit-based Energy score. Both components are standardized using statistics fitted on a pure in-distribution validation split and fused through a convex combination. We evaluate the method on the miniMUAD subset using true pixel-level OOD labels. The proposed hybrid score achieves an AUROC of 0.8539, outperforming NECO-only (0.8280), Energy-only (0.8171), and an ensemble predictive-entropy baseline (0.8124). Additional qualitative and operating-point analyses show that the hybrid detector improves overall ranking performance while preserving the efficiency advantages of a single-pass design. Code is available at https://github.com/boyuan-zhangx/Energy-Aware_NECO

2605.29768 2026-05-29 cs.AI 版本更新

From XXLTraffic to EvoXXLTraffic: Scaling Traffic Forecasting to Sensor-Evolving Networks

从XXLTraffic到EvoXXLTraffic:将交通预测扩展到传感器演化的网络

Du Yin, Hao Xue, Arian Prabowo, Shuang Ao, Flora Salim

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 针对现有交通预测基准假设固定传感器集的问题,提出包含长达27年数据的XXLTraffic数据集及其传感器演化版本EvoXXLTraffic,定义年度流式预测协议,并评估多种基线方法,发现超大规模演化数据集更贴近现实且许多现有SOTA方法失效。

Comments Under Review

详情
AI中文摘要

现有的交通预测基准假设固定的传感器集,但实际道路传感器网络随着道路网络逐年变化而持续增长。我们引入了XXLTraffic数据集系列,涵盖长达27年的加州PeMS和新南威尔士州交通数据。XXLTraffic的固定传感器子集支持多年间隔的极长周期预测以及标准的每小时/每日长时预测。我们将其扩展为EvoXXLTraffic,这是一个传感器演化的重组版本,暴露了每年活跃的传感器、年度交通流矩阵以及九个PeMS区域的年度图快照,增长率从+305%到超过+10,000%。我们在EvoXXLTraffic上定义了一个年度流式预测协议,其中每个日历年是一个持续任务,并评估了来自静态时空GNN、朴素在线方案、演化图持续方法以及检索/测试时方法的各种代表性基线。我们发现,我们的超大规模演化数据集更好地反映了现实世界,许多最先进(SOTA)结果不再有效。我们的数据集通过支持在超长演化道路网络下更现实的预测,补充了现有的基准。

英文摘要

Existing traffic forecasting benchmarks assume a fixed sensor set, but real road-sensor networks grow continuously as the road network changes year by year. We introduce the XXLTraffic dataset family, which spans up to 27 years of California PeMS and Transport for NSW data. The fixed-sensor subsets of XXLTraffic support extremely long forecasting with multi-year gaps and standard hourly / daily long-horizon forecasting. We extend it to EvoXXLTraffic, a sensor-evolving reorganization that exposes per-year active sensors, yearly traffic-flow matrices, and yearly graph snapshots across nine PeMS districts, with growth ratios ranging from +305% to over +10,000%. We define a yearly streaming forecasting protocol on EvoXXLTraffic in which each calendar year is a continual task, and benchmark a wide range of representative baselines drawn from static spatio-temporal GNNs, naïve online schemes, evolving-graph continual methods, and retrieval / test-time methods. We find that our ultra-large evolutionary dataset better reflects the real world, and many state-of-the-art (SOTA) results no longer work. Our dataset complements existing benchmarks by enabling more realistic forecasting under ultra-long evolutionary road networks.

2605.29756 2026-05-29 cs.AI 版本更新

LFQ: Logit-aware Final-block Quantization for Boosting the Generation Quality of Low-Bit Quantized LLMs

LFQ:面向提升低比特量化LLM生成质量的逻辑感知最终块量化

Jung Hyun Lee, June Yong Yang, Jungwook Choi, Eunho Yang

发表机构 * Kim Jaechul Graduate School of AI, KAIST, Daejeon, South Korea(韩国科学技术院人工智能研究生院) LG AI Research, Seoul, South Korea(LG人工智能研究) Hanyang University, Seoul, South Korea(翰阳大学)

AI总结 针对低比特量化LLM在生成任务中质量下降的问题,提出通过最小化FP模型与量化模型在最终Transformer块上的logits交叉熵来优化量化,从而提升复杂生成任务的准确性。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着大语言模型规模的持续扩大,低比特权重的训练后量化(PTQ)为其内存高效部署提供了实用解决方案。尽管分块PTQ在基本语言建模和理解任务上能够匹配全精度(FP)基线,但其在生成任务(尤其是长响应和扩展思维链,这对提升任务准确性至关重要)上的质量有所下降。我们将这一不足归因于两个因素:(i) 分块优化中忽略了反嵌入层(LM头),以及(ii) 对均方误差(MSE)目标的依赖。这两个因素导致量化模型的令牌概率分布与FP模型不一致,从而在文本生成基准上产生显著的准确性下降。为纠正这一偏差,我们引入了逻辑感知最终块量化(LFQ),这是对分块PTQ的一种简单而有效的增强,通过最小化FP模型与其量化对应模型在logits上的交叉熵来量化最终Transformer块。通过在最终块中在logit级别对齐令牌概率,LFQ在不同模型家族中持续提升了复杂生成任务的准确性,优于最先进的分块PTQ,同时在语言建模和理解任务上保持与FP基线相当的性能。

英文摘要

As large language models continue to scale, low-bit weight-only post-training quantization (PTQ) offers a practical solution to their memory-efficient deployment. Although block-wise PTQ is capable of matching the full-precision (FP) baseline on basic language modeling and understanding, its quality is degraded for generative tasks -- especially at longer responses and extended chains of thought, which is critical in boosting task accuracy. We attribute this shortfall to two factors: (i) the omission of the unembedding layer (the LM head) in block-wise optimization and (ii) the reliance on the mean squared error (MSE) objective. Both factors cause the token probability distribution of the quantized model to misalign with that of the FP model, yielding notable accuracy drops on text generation benchmarks. To rectify the discrepancy, we introduce Logit-aware Final-block Quantization (LFQ), a simple yet effective enhancement to block-wise PTQ that quantizes the final Transformer block by minimizing the cross-entropy between the logits of the FP model and those of its quantized counterpart. By aligning token probabilities at the logit level in the final block, LFQ consistently improves the accuracy of complex generation tasks over state-of-the-art block-wise PTQ across diverse model families, while maintaining parity with FP baselines on language modeling and understanding.

2605.29754 2026-05-29 cs.AI 版本更新

Benchmarking Positional Encoding Strategies for Transformer-Based EEG Foundation Models

基于Transformer的脑电图基础模型位置编码策略基准测试

Ayse Betul Yuce, Sebastian Stober

发表机构 * Department of Computer Science, Otto von Guericke University(奥托·冯·格里克大学计算机科学系)

AI总结 本研究在CBraMod骨干网络中基准测试五种位置编码策略,通过线性探测和微调协议评估运动想象分类和情感识别任务,发现最优策略具有任务依赖性。

详情
AI中文摘要

脑电图(EEG)是一种广泛使用的非侵入性技术,用于测量脑机接口(BCI)应用中的大脑活动。监督式EEG解码模型通常难以跨任务、受试者和数据集泛化,这促使了基于Transformer的EEG基础模型通过自监督学习进行训练。由于Transformer是排列不变的,它们需要显式的位置信息。与文本标记不同,EEG电极在头皮上空间分布,这引发了如何在基于Transformer的EEG模型中编码电极位置的问题。在本研究中,我们在CBraMod骨干网络中基准测试了五种位置编码策略,并在运动想象分类和情感识别任务上通过线性探测和微调协议进行评估。我们的结果表明,没有单一策略能在所有任务中持续表现优异。球形位置编码(SPE)为运动想象生成了强大的表示,但在情感识别上表现不佳,而非对称条件位置编码(ACPE)在任务间表现更为一致。这些发现表明,最优位置编码策略具有任务依赖性,在EEG解码场景中没有通用解决方案。

英文摘要

Electroencephalography (EEG) is a widely used non-invasive technique for measuring brain activity in brain-computer interface (BCI) applications. Supervised EEG decoding models often struggle to generalize across tasks, subjects, and datasets, motivating transformer-based EEG foundation models trained with self-supervised learning. Since transformers are permutation-invariant, they require explicit positional information. Unlike textual tokens, EEG electrodes are spatially distributed across the scalp, raising the question of how electrode positions should be encoded in transformer-based EEG models. In this study, we benchmark five positional encoding strategies within the CBraMod backbone and evaluate them under linear probing and fine-tuning protocols on motor imagery classification and emotion recognition. Our results show that no single strategy consistently outperforms across tasks. Spherical Positional Encoding (SPE) yields strong representations for motor imagery but underperforms on emotion recognition, while Asymmetric Conditional Positional Encoding (ACPE) demonstrates more consistent performance across tasks. These findings suggest that the optimal positional encoding strategy is task-dependent, with no universal solution across EEG decoding scenarios.

2605.29753 2026-05-29 eess.IV cs.AI 版本更新

A unified deeplearning framework for contrast-phase-specific virtual monochromatic imaging

一种用于对比相位特异性虚拟单色成像的统一深度学习框架

Antony Jerald, Hemant K Aggarwal, Brian Nett, Avinash Gopal, Phaneendra K Yalavarthy, Bipul Das, Rajesh Langoju

发表机构 * Science and Technology Organization, GE HealthCare(科技组织,GE医疗)

AI总结 提出一种统一深度学习框架,利用对比相位先验信息从单能CT数据合成对比相位特异性虚拟单色50 keV图像,通过新型先验条件架构实现能量转换,并在四个对比相位上验证了其对比增强和泛化能力。

详情
Journal ref
SPIE Medical Imaging 2026
AI中文摘要

双能CT(DECT)可实现虚拟单色成像(VMI)并提高对比度分辨率,但其临床采用受到硬件复杂性和成本的限制。在这项工作中,我们提出了一种统一的深度学习框架,通过利用对比相位信息作为先验,从单能CT(SECT)数据合成对比相位特异性虚拟单色50 keV图像。该模型使用DECT衍生的70 keV和50 keV图像对进行训练,涵盖四个对比相位——血管期、动脉期、门脉期和延迟期——采用一种新颖的先验条件架构,将对比相位先验整合到能量转换过程中。我们证明了所提出的统一模型能够实现对比增强,并在对比相位之间具有良好的泛化能力。此外,我们展示了该模型可以从SECT输入生成类似50 keV的图像,并保留对比相位特异性动态。

英文摘要

Dual-energy CT (DECT) enables virtual monochromatic imaging (VMI) and improved contrast resolution, but its clinical adoption is limited by hardware complexity and cost. In this work, we propose a unified deep learning framework that synthesizes contrast-phase-specific virtual monochromatic 50 keV images from single-energy CT (SECT) data by leveraging contrast phase information as a prior. The model is trained using DECT-derived 70 keV and 50 keV image pairs across four contrast phases -- Angio, Arterial, Portal, and Delayed -- using a novel prior conditioning architecture that integrates contrast phase priors into the energy transformation process. We demonstrate that the proposed unified model achieves contrast enhancement and generalizes well across contrast phases. Additionally, we show that the model can generate 50 keV-like images from SECT inputs, preserving contrast phase-specific dynamics.

2605.29744 2026-05-29 cs.AI cs.CL cs.LG cs.MA 版本更新

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

为什么专家模型仍然重要:面向医学人工智能的异构多智能体范式

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

发表机构 * Anthropic AI

AI总结 提出HetMedAgent异构多智能体框架,通过冲突感知证据融合、不确定性驱动的临床医生干预触发和自适应阈值校准,实现通用大语言模型与领域专家模型的协同,在三个临床决策任务中验证了专家模型在模态特定分析中的不可替代价值。

Comments Accepted at ICML 2026. 12 pages main text, 16 pages appendix

详情
AI中文摘要

GPT和Claude等通用大语言模型在医疗保健领域的出色表现引发了一个关键问题:特定领域的医学专家模型是否会变得过时?我们认为,医学人工智能的未来不在于构建单一的医学基础模型,也不在于取代人类专业知识,而在于协调通用大语言模型、领域特定专家模型和临床医生之间的协作。我们提出HetMedAgent,一个异构医学多智能体框架,能够实现冲突感知证据融合、基于不确定性的临床医生干预触发和自适应阈值校准。在三个真实世界临床决策任务上的实验表明,通用大语言模型与领域特定专家模型之间的协同显著优于单独使用任一类型模型,验证了专家模型在模态特定分析中的不可替代价值。HetMedAgent代表了从构建医学大语言模型或基础模型向多智能体协作的转变,实现了通用推理能力与领域特定精度之间的平衡。

英文摘要

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2605.29742 2026-05-29 cs.AI 版本更新

Citation-Closure Retrieval and Per-Rule Attribution for Real-World Regulatory Compliance Question Answering

引用闭包检索与逐规则归因:面向真实世界法规合规问答

Yeong-Joon Ju, Seong-Whan Lee

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系)

AI总结 针对法规合规问答中多层级权威结构的引用追踪难题,提出基于操作知识图谱的基准RegOps-Bench和统一框架RefWalk,通过共享主题锚点遍历跨文档引用、多视角候选融合及逐规则归因,显著提升检索召回率和引用准确性。

Comments Under Review

详情
AI中文摘要

将大型语言模型(LLM)部署于法规合规领域,要求通过跨多层权威结构的全面引用来实现严格的追溯性。与传统多跳或法律问答不同,该任务需要结构化的程序性查找和证据集闭包,而非实体解析或判例推理。现有的RAG系统由于扁平化的引用边、碎片化的检索扩展以及脆弱的后期归因而难以胜任。我们通过RegOps-Bench将法规合规问答形式化,这是一个新颖的基准,包含从复杂的国家研发法规中导出的操作知识图谱。为解决这些瓶颈,我们提出了RefWalk,一个由共享主题锚点驱动的统一框架。RefWalk遍历跨文档引用,通过基于最大值的聚合融合多视角候选,并强制执行逐规则归因,以明确地将声明映射到来源。我们建立了一个强大的基线,在检索召回率和引用准确性方面取得了显著改进。最后,在美国健康合规数据集(HIPAA)上的对比评估显示,现有系统在扁平结构规则上表现饱和,凸显了RegOps-Bench的必要性。我们的代码可在https://github.com/yeongjoonJu/RefWalk获取。

英文摘要

Deploying Large Language Models (LLMs) for regulatory compliance demands rigorous traceability via comprehensive citations across multi-tiered authority structures. Unlike traditional multi-hop or legal QA, this task requires structured procedural lookups and evidence-set closure rather than entity resolution or case-law reasoning. Existing RAG systems struggle here due to flattened citation edges, fragmented retrieval expansions, and fragile post-hoc attribution. We formalize Regulatory Compliance QA with RegOps-Bench, a novel benchmark featuring an Operational Knowledge Graph derived from complex national R\&D regulations. To address these bottlenecks, we propose RefWalk, a unified framework driven by a shared topic anchor. RefWalk traverses cross-document citations, fuses multi-view candidates via max-based aggregation, and enforces per-rule attribution to explicitly map claims to sources. We establish a strong baseline with substantial improvements in retrieval recall and citation accuracy. Finally, a contrastive evaluation on a U.S. health compliance dataset (HIPAA) reveals that existing systems exhibit saturation on flat-structure rules, underscoring the need for RegOps-Bench. Our code is available at https://github.com/yeongjoonJu/RefWalk.

2605.29738 2026-05-29 cs.CL cs.AI 版本更新

Multi-Legal-Bench: Evaluating LLMs on Legal Reasoning Across Jurisdictions, Languages, and Legal Traditions

Multi-Legal-Bench: 跨司法管辖区、语言和法律传统的法律推理评估LLM

Volodymyr Ovcharov

发表机构 * SecondLayer

AI总结 提出Multi-Legal-Bench,首个跨司法管辖区法律基准,在6个国家、4个语系和1.34亿份法院判决上评估LLM,发现少样本效果跨辖区复制、无单一模型主导所有语言、跨语言迁移不遵循语言邻近性、分词器效率不显著预测跨语言准确率。

Comments 14 pages, 5 figures, 8 tables. Dataset: https://huggingface.co/datasets/overthelex/multi-legal-bench

详情
AI中文摘要

法律NLP基准绝大多数评估单一语言或汇总跨司法管辖区根本不同的任务,使得跨语言比较不可能。我们引入Multi-Legal-Bench,首个跨司法管辖区法律基准,在六个国家(乌克兰、法国、荷兰、波兰、捷克共和国、立陶宛)、四个语系和1.34亿份法院判决上评估相同任务。该基准定义了五个任务——法院类型分类、判决形式分类、案件结果预测、法律规范提取和原因类别预测——映射到来自国家法院登记处的结构化元数据,形成一个故意稀疏的5x6任务-司法管辖区矩阵(30个单元格中填充20个)。我们通过AWS Bedrock在零样本和3样本提示下评估7个前沿LLM,并额外使用4个小/中型模型(3-12B)进行规模分析。我们的结果显示:(1)在乌克兰发现的依赖任务的少样本效果在所有司法管辖区复制;(2)没有单一模型主导任何语言——排名随任务和司法管辖区而变化;(3)跨语言少样本迁移不遵循语言邻近性:UA->FR(罗曼语族,-2.1个百分点)迁移优于UA->PL(斯拉夫语族,-13.7个百分点),标签集对齐比语系更能预测迁移质量;(4)分词器生育率尽管有2.3倍的差异,并不能显著预测跨语言准确率(r=-0.27,p=0.14),表明模型架构和预训练数据主导分词器效率。我们发布所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks overwhelmingly evaluate a single language or aggregate tasks that differ fundamentally across jurisdictions, making cross-lingual comparison impossible. We introduce Multi-Legal-Bench, the first cross-jurisdictional legal benchmark that evaluates identical tasks across six countries (Ukraine, France, Netherlands, Poland, Czech Republic, Lithuania), four language families, and 134 million court decisions. The benchmark defines five tasks court-type classification, judgment form classification, case-outcome prediction, legal norm extraction, and cause category prediction mapped to structured metadata from national court registries, forming a deliberately sparse 5x6 task-jurisdiction matrix (20 of 30 cells filled). We evaluate 7 frontier LLMs under zero-shot and 3-shot prompting via AWS Bedrock, with 4 additional small/medium models (3-12B) for scaling analysis. Our results reveal that: (1) task-dependent few-shot effects discovered in Ukrainian replicate across all jurisdictions; (2) no single model dominates any language rankings shift with both task and jurisdiction; (3) cross-lingual few-shot transfer does not follow language proximity: UA->FR (Romance, -2.1 pp) transfers better than UA->PL (Slavic, -13.7 pp), with label-set alignment predicting transfer quality better than language family; and (4) tokenizer fertility, despite a 2.3x spread, does not significantly predict cross-lingual accuracy (r=-0.27, p=0.14), suggesting that model architecture and pretraining data dominate tokenizer efficiency. We release all data, prompts, and model predictions.

2605.29733 2026-05-29 cs.AI 版本更新

Uncertainty-Aware Transfer Learning for Cross-Building Energy Forecasting: Toward Robust and Scalable District-Level Energy Management

面向跨建筑能耗预测的不确定性感知迁移学习:迈向鲁棒且可扩展的区域级能源管理

Shadmehr Zaregarizi, Khashayar Yavari

发表机构 * Politecnico di Torino(托里尼理工学院)

AI总结 提出基于时间融合变换器的不确定性感知迁移学习框架,通过引入迁移鲁棒性指标和探针微调策略,实现跨建筑能耗预测的鲁棒迁移与不确定性量化。

Comments 5 pages, 3 figures, 2 tables. Accepted at BALANCES'26 (6th ACM International Workshop on Big Data and Machine Learning for Smart Buildings and Cities), Banff, Alberta, Canada, June 22, 2026. This is the author's accepted manuscript; final published version DOI will be activated after June 22, 2026

详情
AI中文摘要

将数据驱动的能耗预测扩展到区域级需要能够在最小目标域数据和诚实不确定性估计下跨建筑复用的模型。我们提出了一种基于时间融合变换器的不确定性感知迁移学习框架,用于跨建筑能耗预测,并在新发布的高分辨率真实子计量数据集上进行了评估:丹麦奥尔堡大学的一栋教育建筑(源域)和瑞士EMPA的多类型NEST建筑(目标域)。我们引入了迁移鲁棒性指数,一种与架构无关的度量,用于量化跨域泛化质量。一项四策略层冻结消融实验表明,仅探针微调(仅更新806K参数中的455个输出层参数)实现了最佳的迁移质量,优于全微调,表明TFT编码器学习了可迁移的时间表示。蒙特卡洛丢弃法得到的预测区间覆盖概率为93.2%,接近名义上的95%目标。数据稀缺性分析进一步显示,随着目标域数据的增加,性能单调提升,为区域能源部署提供了实践指导。

英文摘要

Scaling data-driven energy forecasting to district level requires models that can be re-used across buildings with minimal target-domain data and honest uncertainty estimates. We present an uncertainty-aware transfer learning (TL) framework for cross-building energy forecasting based on the Temporal Fusion Transformer (TFT), evaluated on a newly released high-resolution real sub-meter dataset: an educational building at Aalborg University, Denmark (source) and the multi-typology NEST building at EMPA, Switzerland (target). We introduce the Transfer Robustness Index (TRI), an architecture-agnostic metric for quantifying generalization quality across domain gaps. A four-strategy layer-freezing ablation shows that Probe-Only fine-tuning, updating only 455 output-layer parameters out of 806K, achieves the best transfer quality (TRI = 3,097), outperforming full fine-tuning and suggesting that TFT encoders learn transferable temporal representations. Monte Carlo Dropout yields a prediction interval coverage probability of 93.2%, close to the nominal 95% target. A data-scarcity analysis further shows monotonic improvement with increasing target-domain data, providing practical guidance for district energy deployment.

2605.29716 2026-05-29 cs.AI 版本更新

NaRA: Noise-Aware LoRA for Parameter-Efficient Fine-Tuning of Diffusion LLMs

NaRA: 面向扩散大语言模型参数高效微调的噪声感知LoRA

Shuaidi Wang, Zhan Zhuang, Ruping Huang, Yu Zhang

发表机构 * Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学计算机科学与工程系,深圳,中国) Department of Computer Science, City University of Hong Kong, Hong Kong, China(香港城市大学计算机科学系,香港,中国) Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China(香港理工大学计算机科学与工程系,香港,中国)

AI总结 针对扩散大语言模型,提出噪声感知低秩适配(NaRA),通过噪声条件超网络生成低秩核心矩阵,实现沿去噪轨迹连续变化的更新矩阵,在常识推理、数学推理和代码生成基准上优于噪声无关基线。

详情
AI中文摘要

扩散大语言模型(dLLMs)已成为一种有前途的非自回归生成范式。鉴于全微调的计算成本过高,参数高效微调(PEFT)已成为标准方法。然而,现有的PEFT方法(如LoRA)最初是为自回归模型设计的,依赖于静态参数,对噪声水平不敏感。因此,它们忽略了扩散过程的内在动态性,其中输入分布和生成难度沿去噪轨迹显著变化,使得它们对dLLMs而言是次优的。为了解决这个问题,我们提出了噪声感知低秩适配(NaRA),它引入了一个由轻量级、全局共享的超网络根据噪声水平生成的低秩核心矩阵。这种设计使得更新矩阵能够沿扩散过程连续变化,同时保持参数和延迟开销可忽略不计。我们为所提出的NaRA框架提供了理论依据,并在常识推理、数学推理和代码生成基准上实证证明了其相对于噪声无关基线的持续改进。我们的代码可在https://github.com/generaldi/NaRA获取。

英文摘要

Diffusion Large Language Models (dLLMs) have emerged as a promising non-autoregressive generative paradigm. Given the prohibitive computational cost of full fine-tuning, Parameter-Efficient Fine-Tuning (PEFT) has become the standard approach. However, existing PEFT methods (e.g., LoRA), originally tailored for autoregressive models, rely on static parameters that are agnostic to the noise level. Consequently, they ignore the intrinsic dynamics of the diffusion process, where input distributions and generation difficulty shift significantly along the denoising trajectory, rendering them suboptimal for dLLMs. To address this, we propose Noise-aware Low-Rank Adaptation (NaRA), which introduces a low-rank core matrix generated by a lightweight, globally shared hypernetwork conditioned on the noise level. This design enables the update matrices to vary continuously along the diffusion process while keeping parameter and latency overhead negligible. We provide a theoretical justification for the proposed NaRA framework and empirically demonstrate consistent improvements over noise-agnostic baselines across commonsense reasoning, mathematical reasoning, and code generation benchmarks. Our code is available at https://github.com/generaldi/NaRA.

2605.29713 2026-05-29 cs.LG cs.AI 版本更新

The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer

生成式AI基础小书:直观数学入门

Tianhua Chen

发表机构 * School of Computing and Engineering(计算与工程学院)

AI总结 本书通过推导导向的方式,从PCA到能量模型,系统介绍现代生成式人工智能的数学基础,旨在使生成建模结构更易理解。

Comments Preprint version, 178 pages. Comments and corrections are welcome

详情
AI中文摘要

本书提供了对现代生成式人工智能数学基础的紧凑、推导导向的介绍。它不是调查每一个最近的架构或实现细节,而是通过连接主要生成模型家族的思想发展出一条连贯的路线,从PCA、概率PCA、变分自编码器和扩散模型到归一化流、自回归分解、GANs、Wasserstein GANs和基于能量的模型。目的是使生成建模的结构更易理解,同时不失去理解这些模型如何推导和关联所需的数学实质。本书旨在为具有数学好奇心的研究人员、从业者和学生提供基础构建的入门读物。

英文摘要

This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence. Rather than surveying every recent architecture or implementation detail, it develops a coherent route through the ideas connecting major families of generative models, from PCA, probabilistic PCA, variational autoencoders, and diffusion models to normalising flows, autoregressive factorisations, GANs, Wasserstein GANs, and energy-based models. The aim is to make the structure of generative modelling more accessible without removing the mathematical substance needed to understand how these models are derived and related. The book is intended as a foundation-building primer for mathematically curious researchers, practitioners, and students.

2605.29712 2026-05-29 cs.CL cs.AI 版本更新

Teaching Language Models to Check Grounded Claim Factuality with Human Test-Taking Strategies

教会语言模型使用人类应试策略检查基于事实的声明真实性

Yuxuan Ye, Raul Santos-Rodriguez, Edwin Simpson

发表机构 * Intelligent Systems Laboratory(智能系统实验室) University of Bristol(布里斯托大学)

AI总结 将基于事实的声明真实性检查建模为真假阅读理解任务,通过提示语言模型使用明确的应试策略进行高效推理,并训练小语言模型以降低推理成本。

Comments ACL 2026 Main

详情
AI中文摘要

基于事实的声明真实性检查对于大型语言模型(LLM)应用(如检索增强生成)非常重要,因为它帮助用户评估生成输出的正确性。现有的使用蕴含分类器的指标需要针对数据集调整阈值,而基于LLM的方法通常使用直接提示,这未能充分利用LLM的推理能力。我们通过将基于事实的声明真实性检查建模为真假阅读理解任务,并提示LLM使用明确的应试策略进行高效推理来解决这一问题。与无引导的开放式推理相比,我们的方法减少了超过80%的令牌使用量,并在两个真实性基准测试中取得了与更昂贵替代方案竞争的性能,在一个基准上达到了新的最先进水平。为了进一步降低推理成本,我们训练小语言模型(SLM)来替代检查流程中的LLM。通过监督微调(SFT)和自我修正机制,SLM学会了改进其真实性判断。实验结果表明,生成的SLM在性能上与强基线相当,结合了低推理成本和生成支持理由以支持可解释性。代码和数据集将在接收后发布。

英文摘要

Grounded claim factuality checking is important for large language model (LLM) applications such as retrieval-augmented generation, as it helps users assess the correctness of generated outputs. Existing metrics using entailment classifiers require dataset-specific threshold tuning, while LLM-based approaches often use direct prompting, which underutilises the reasoning capabilities of LLMs. We address this by formulating grounded claim factuality checking as a true/false reading comprehension task and prompting LLMs with explicit test-taking strategies for efficient reasoning. Our method reduces token usage by over 80% compared to unguided open-ended reasoning, and achieves competitive performance to more expensive alternatives across two factuality benchmarks, setting a new state of the art on one. To further reduce inference cost, we train small language models (SLMs) to replace LLMs in the checking pipeline. Using supervised fine-tuning (SFT) and a self-revision mechanism, the SLMs learn to improve their factuality judgements. Experimental results show that the resulting SLMs perform on par with strong baselines, combining low inference costs with generating supporting rationales to support interpretability. Code and datasets will be released upon acceptance.

2605.29711 2026-05-29 cs.CL cs.AI 版本更新

Personalized Turn-Level User Conversation Satisfaction Benchmark

个性化轮级用户对话满意度基准

Zhefan Wang, Zhiqiang Guo, Weizhi Ma, Min Zhang, Quanjia Yan, Hengliang Luo

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China.(清华大学计算机科学与技术系,北京,中国) Institute for AI Industry Research, Tsinghua University, Beijing, China.(清华大学人工智能产业研究院,北京,中国) Meituan(美团)

AI总结 针对AI助手响应的个性化满意度评估问题,提出结合用户记忆与目标轮上下文的满意度评估器,并构建PersTurnBench基准,通过回放实现生成模型的受控比较。

详情
AI中文摘要

用户对AI助手的满意度高度个性化:同一响应可能满足一个用户但令另一个失望,取决于每个用户的期望以及他们之前询问的内容。现有的自动评估方法大多衡量通用响应质量,难以判断某个响应在特定轮次是否满足用户。我们将此问题作为个性化轮级用户对话满意度评估进行研究。我们构建了一个对话满意度评估器,将紧凑的用户记忆与目标轮上下文相结合,生成满意度分数和不满意的理由。与人类满意度标注的元评估表明,个性化记忆和事后分数校准在有序一致性和不满意轮次检测上优于监督式、检索式和通用LLM作为评判者的基线。我们进一步引入了PersTurnBench,这是一个个性化轮级用户对话满意度基准,通过回放使用经过验证的评估器来评估生成模型。通过固定回放状态,PersTurnBench能够在无需为每个候选模型收集新人工标签的情况下,对通用生成模型和记忆增强的个性化系统进行受控比较。该评估器和基准让研究人员能够在无需为每个模型收集新用户反馈的情况下,比较候选生成模型在个性化满意度上的表现。

英文摘要

User satisfaction with AI assistants is highly personalized: the same response may satisfy one user but disappoint another depending on what each user expects and what they have asked for before. Existing automatic evaluation methods mostly measure generic response quality, making it difficult to judge whether a response satisfies a user at a specific turn. We study this problem as personalized turn-level user conversation satisfaction evaluation. We build a conversation satisfaction evaluator that combines compact user memories with target-turn context to produce satisfaction scores and dissatisfaction-oriented rationales. Meta-evaluation against human satisfaction annotations shows that personalized memory and post-hoc score calibration improve ordinal agreement and dissatisfied-turn detection over supervised, retrieval-based, and generic LLM-as-a-judge baselines. We further introduce PersTurnBench, a personalized turn-level user conversation satisfaction benchmark that uses the verified evaluator to assess generation models via replay. By holding the replay state fixed, PersTurnBench enables controlled comparison of generic generation models and memory-augmented personalized systems without new human labels for every candidate model. The evaluator and benchmark let researchers compare candidate generation models on personalized satisfaction without collecting new user feedback for every model.

2605.29705 2026-05-29 cs.AI 版本更新

BitTP: The Lightweight Trajectory Prediction Model with BitLLM for Edge-Devices

BitTP:面向边缘设备的轻量级轨迹预测模型与BitLLM

Mincheol Kang, Hyunjin Lim, Bomin Kang, Daehee Park

发表机构 * KAIST, Republic of Korea(韩国釜山国立大学) DGIST, Republic of Korea(韩国国立庆北科学技术院)

AI总结 提出BitTP,通过将LLM轨迹预测器转换为1.58比特轻量架构,在保持或提升预测质量的同时大幅降低内存和计算需求,实现边缘设备部署。

Comments Camera-ready version. Accepted as a findings paper at CVPR 2026. 8 pages, 4 figures

详情
AI中文摘要

轨迹预测是自主系统的一项基本任务,需要对多智能体交互和意图进行复杂推理。大型语言模型(LLM)最近被用于此任务,因为它们提供了强大的上下文推理和可解释的、基于语言的轨迹表示。然而,这些基于LLM的预测器极其消耗内存和计算资源,难以部署在资源受限的边缘设备上,例如自主机器人的车载计算机。为弥合这一差距,我们提出BitTP,它将基于LLM的轨迹预测器转换为轻量级比特线性架构。我们证明,仅权重量化到1.58比特(BitTP-Weight)是最优的。关键在于,激活值必须保持全精度,因为量化它们会导致时空推理的严重退化和不稳定性。实验表明,BitTP-Weight不仅保持了全精度(BF16)LLM基线的预测质量,还提升了质量,平均ADE降低14.29%,FDE降低20.97%,同时相比其他量化方法减少了内存使用和推理延迟。这些结果表明,精心设计的量化可作为有效的正则化器,使得基于LLM的复杂推理能够在边缘设备上实际部署。代码地址:https://github.com/MintCat98/BitTP。

英文摘要

Trajectory prediction is a fundamental task for autonomous systems, requiring complex reasoning about multi-agent interactions and intents. Large language models (LLMs) have recently been adopted for this task, as they provide strong contextual reasoning and interpretable, language-based trajectory representations. However, these LLM-based predictors are extremely memory- and compute-intensive, making them difficult to deploy on resource-constrained edge devices such as on-board computers in autonomous robots. To bridge this gap, we propose BitTP, which converts an LLM-based trajectory predictor into a lightweight bitlinear architecture. We demonstrate that weight-only quantization to 1.58-bit (BitTP-Weight) is optimal. Crucially, activations must remain in full precision, as quantizing them leads to severe degradation and instability in spatio-temporal reasoning. Empirically, BitTP-Weight not only preserves but improves prediction quality over the full-precision (BF16) LLM baseline, reducing ADE by 14.29% and FDE by 20.97% on average, while simultaneously reducing memory usage and inference latency relative to other quantization methods. These results demonstrate that carefully designed quantization acts as an effective regularizer, enabling the practical deployment of sophisticated LLM-based reasoning on edge devices. Code is available at: https://github.com/MintCat98/BitTP.

2605.29697 2026-05-29 cs.AI 版本更新

Beyond Trajectory Rewards: Step-level Credit Assignment for Agentic Search via Graph Modeling

超越轨迹奖励:通过图建模实现智能搜索的步骤级信用分配

Yuchen Liu, Yingjie Feng, Lixiong Qin, Jiasi Chen, Jianing Yu, Sheng Gao, Sheng Yang, Weiran Xu

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Li Auto Inc.(李自动公司)

AI总结 针对智能搜索中轨迹级奖励无法量化单步行为贡献的问题,提出基于图距离贡献奖励(GDCR)的步骤级过程奖励,并结合步骤优势策略优化(SAPO)在四个基准上验证有效性。

Comments 15 pages, 8 figures

详情
AI中文摘要

在智能搜索中,轨迹级结果奖励无法量化单个步骤的行为贡献,而现有的步骤级奖励方法通常依赖于代价高昂的树采样。我们将世界知识视为潜在的世界图,并将每个信息搜索任务视为在潜在任务图中的搜索,其中有效步骤应朝着答案节点进行图进展。基于这一先验,我们提出图距离贡献奖励(GDCR),这是一种步骤级过程奖励,通过训练时实体-关系(ER)图中实体到答案节点的距离对新检索和引用的实体进行评分。我们进一步提出步骤优势策略优化(SAPO),它将GDCR转换为步骤级优势,并与轨迹级结果优势相结合。在四个具有挑战性的基准上的实验验证了我们方法的有效性。

英文摘要

In Agentic Search, trajectory-level outcome rewards fail to quantify the behavioral contributions of individual steps, while existing step-level reward methods typically rely on costly tree sampling. We view world knowledge as a latent world graph and each IS task as search within a latent task graph, where effective steps should make graph progress toward the answer node. Based on this prior, we propose Graph-Distance Contribution Reward (GDCR), a step-level process reward that scores newly-retrieved and newly-cited entities by their distance to the answer node in a training-time Entity-Relation (ER) graph. We further propose Step Advantage Policy Optimization (SAPO), which converts GDCR into step-level advantages and combines them with trajectory-level outcome advantages. Experiments on four challenging benchmarks validate the effectiveness of our method.

2605.29695 2026-05-29 cs.AI cs.CE cs.LG math.PR 版本更新

FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and Forecasting

FHRFormer: 一种用于胎儿心率时间序列修复和预测的自监督掩码Transformer框架

Kjersti Engan, Neel Kanwal, Anita Yeconia, Ladislaus Blacy, Yuda Munyaw, Estomih Mduma, Hege Ersdal

发表机构 * University of Stavanger(斯塔万格大学) Haydom Lutheran Hospital(海多姆路德医院) Stavanger University Hospital(斯塔万格大学医院)

AI总结 针对胎儿心率监测中信号丢失问题,提出基于掩码Transformer的自监督自编码器方法,通过捕获局部时间和频率成分来修复和预测缺失信号,具有鲁棒性并支持AI风险算法开发。

Comments Submitted to Frontiers in Digital Health. arXiv admin note: substantial text overlap with arXiv:2509.20852

详情
AI中文摘要

大约10%的新生儿出生时需要帮助才能开始呼吸,约5%需要通气支持。胎儿心率(FHR)监测在产前护理中评估胎儿健康状况方面起着关键作用,能够检测异常模式并支持及时产科干预以减轻分娩期间的胎儿风险。应用人工智能(AI)方法分析具有不同结局的连续FHR监测大数据集,可能为预测需要呼吸辅助或干预的风险提供新见解。可穿戴FHR监测仪的最新进展实现了在不影响母亲活动能力的情况下进行连续胎儿监测。然而,母亲运动期间的传感器移位以及胎儿或母亲位置的变化常常导致信号丢失,造成记录的FHR数据出现缺口。这种缺失数据限制了有意义信息的提取,并使基于AI的自动化分析复杂化。传统的缺失数据处理方法,如简单插值技术,往往无法保留信号的频谱特性。在本文中,我们提出了一种基于掩码Transformer的自编码器方法,通过捕获数据的局部时间和频率成分来重建缺失的FHR信号。所提出的方法在不同缺失数据时长下表现出鲁棒性,可用于信号修复和预测。该方法可回顾性地应用于研究数据集,以支持基于AI的风险算法开发。未来,该方法可集成到可穿戴FHR监测设备中,实现更早、更稳健的风险检测。

英文摘要

Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropout, resulting in gaps in recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handling missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both local temporal and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.

2605.29687 2026-05-29 cs.AI cs.LO 版本更新

Reliable Reasoning with Large Language Models via Preference-Based Maximum Satisfiability

基于偏好最大可满足性的大语言模型可靠推理

Pedro Orvalho, Marta Kwiatkowska, Guillem Alenyà, Felip Manyà

发表机构 * Artificial Intelligence Research Institute (IIIA) Consejo Superior de Investigaciones Científicas (CSIC)(人工智能研究所(IIIA)西班牙国家科学研究委员会(CSIC)) Department of Computer Science University of Oxford(计算机科学系牛津大学) Institut de Robòtica i Informàtica Industrial (IRI-CSIC-UPC)(机器人与信息工业研究所(IRI-CSIC-UPC))

AI总结 提出一种混合推理方法,通过LLM生成代码将自然语言问题编码为偏好最大可满足性问题,由精确求解器求解并独立验证,显著提高可行性。

Comments 17 pages, 1 figure, 4 tables

详情
AI中文摘要

大语言模型(LLM)擅长理解自然语言,但在涉及多个约束和用户定义偏好的优化任务(常见于机器人等领域)中表现不佳。我们提出一种混合推理方法,其中LLM通过代码生成实现外部化推理。给定自然语言问题描述,LLM生成Python代码,将用户定义的约束和偏好编码为偏好最大可满足性(MaxSAT)问题,然后由精确的MaxSAT求解器求解。为确保正确性,模型生成代码返回的解会与规范MaxSAT编码独立验证可行性和最优性,允许不同的编码和多个最优解。我们使用开源和闭源LLM在三个偏好推理任务族上评估该方法,并与相同模型的直接回答、思维链和程序思维基线进行比较。虽然这些基线很少产生可行解,但基于MaxSAT的流水线实现了显著更高的接受率,在某些情况下超过80%。我们的结果表明,LLM驱动的代码生成结合偏好MaxSAT能够针对生成的编码实现可验证的优化,并在独立验证的参考语义下大幅提高正确性。

英文摘要

Large Language Models (LLMs) excel at understanding natural language but struggle with optimisation tasks involving multiple constraints and user-defined preferences, which commonly arise in domains such as robotics. We propose a hybrid reasoning approach in which LLMs externalise reasoning through code generation. Given a natural language problem description, an LLM generates Python code that encodes user-defined constraints and preferences as a preference-based Maximum Satisfiability (MaxSAT) problem, which is then solved by an exact MaxSAT solver. To ensure correctness, solutions returned by the model-generated code are independently verified for feasibility and optimality against a canonical MaxSAT encoding, allowing for different encodings and multiple optimal solutions. We evaluate our approach using both open-source and closed-access LLMs on three families of preference-based reasoning tasks, and compare it against direct-answer, chain-of-thought, and program-of-thought baselines using the same models. While these baselines rarely produce feasible solutions, the MaxSAT-based pipeline achieves substantially higher acceptance rates, in some cases exceeding 80%. Our results demonstrate that LLM-driven code generation combined with preference-based MaxSAT enables solver-verifiable optimisation with respect to generated encodings, and substantially improves correctness under independently verified reference semantics.

2605.29685 2026-05-29 cs.AI 版本更新

NICE: A Theory-Grounded Diagnostic Benchmark for Social Intelligence of LLMs

NICE:一个基于理论的LLM社交智能诊断基准

Yunjin Qi, Zhaojun Jiang, Xuan Wu, Hanxi Pan, Yixuan Wang, Yanfang Liu, Xiang Ji, Churu Yu, Chunyuan Zheng, Yingze Chen, Jie He, Liuqing Chen, Zaifeng Gao

发表机构 * Department of Psychology and Behavioral Sciences, Zhejiang University(浙江大学心理学与行为科学系) College of Artificial Intelligence, Zhejiang University(浙江大学人工智能学院) Human Machine Interaction Lab, Huawei Technologies Co., Ltd.(华为技术有限公司人机交互实验室) Zhejiang Key Laboratory of Neurocognitive Development and Mental Health(浙江省神经认知发展与心理健康重点实验室)

AI总结 本文通过构建基于社会理论的社交智能框架,提出诊断基准NICE,用于细粒度评估大语言模型在社交交互中的能力弱点。

详情
AI中文摘要

随着大语言模型(LLM)在情感陪伴和客户服务等社交场景中的广泛应用,衡量其社交智能对人工智能交互的质量与安全性变得至关重要。然而,现有的社交智能基准缺乏统一框架来组织社交能力,因此无法进行细粒度诊断。为了构建首个基于社会理论的整体诊断评估,我们首先通过文献综述和多阶段专家验证(遵循心理测量学原则)构建了一个社交智能框架。该框架包括4个类别和11个维度,每个维度进一步由细粒度的能力方面指定。基于此框架,我们提出了NICE(规范、交互、认知、体验),一个包含137个项目的诊断基准,通过代表性中文情境进行操作化。在5个前沿LLM和一个人类参考组中,模型在总体准确率上得分较高,但在沟通方面表现出持续的弱点,框架将其定位到三个具体能力方面:多轮沟通、非语言沟通和同步性。因此,NICE将社交智能评估重新定义为对LLM中具有社会后果的弱点的基于理论的诊断。

英文摘要

As large language models (LLMs) are increasingly applied in social contexts such as emotional companionship and customer service, measuring their social intelligence has become critical to the quality and safety of human-AI interaction. However, existing social intelligence benchmarks lack a unified framework that organizes social abilities into a unified structure, and therefore cannot enable fine-grained diagnosis. To build the first holistic diagnostic evaluation grounded in social theory, we first construct a social intelligence framework through a literature review and multi-stage expert validation guided by psychometric principles. The resulting framework includes 4 categories and 11 dimensions, each further specified by fine-grained capability facets. Building on this framework, we introduce NICE (Norm, Interaction, Cognition, Experience), a diagnostic benchmark of 137 items operationalized through representative Chinese contexts. Across 5 frontier LLMs and a human reference group, models score higher in aggregate accuracy yet show a consistent weakness in Communication, which the framework localizes to 3 specific capability facets: multi-turn communication, nonverbal communication, and synchrony. NICE thus reframes social intelligence evaluation toward theory-grounded diagnosis of socially consequential weaknesses in LLMs.

2605.29675 2026-05-29 cs.HC cs.AI cs.IR 版本更新

From Prompts to Context: An Ontology-Driven Framework for Human-Generative AI Collaboration

从提示到上下文:一种面向人类-生成式AI协作的本体驱动框架

Ngoc Luyen Le, Marie-Hélène Abel, Bertrand Laforge

发表机构 * Gamaizer Université de technologie de Compiègne, CNRS, Heudiasyc(法国图尔学院、CNRS、Heudiasyc) Sorbonne Université, CNRS UMR 7585, LPMHE(索邦大学、CNRS UMR 7585、LPMHE)

AI总结 提出一种基于本体(CCAI)的框架,通过结构化建模任务、角色、资源和约束,将提示-响应交互转化为可查询的协作轨迹,以提升信息密集型工作流中的可追溯性和问责性。

详情
AI中文摘要

与生成式AI的协作通常始于简短提示,止于不透明输出,隐去了参与者、任务、资源及约束等关键信息。这种上下文显式性的缺失阻碍了信任、可追溯性和问责性,尤其在搜索、查询和档案管理等信息密集型工作流中。本文提出“从提示到上下文”这一本体驱动框架,用于表示人类-生成式AI协作。其核心组件——上下文协作AI本体(CCAI)——将任务、智能体角色、资源和约束等协作关键元素建模为共享的机器可解释词汇。通过将填充的CCAI实例与基于SPARQL的上下文检索相结合,该框架将原本短暂的提示-响应交互转化为结构化、可查询的协作轨迹,连接提示、输出及其周围上下文。通过一个软件开发团队构建基于能力的教育功能(用于查看和更新学习者能力档案)的案例研究,展示了该框架如何支持需求分析、设计、实现和测试阶段的协作片段表示与文档化。结果表明,显式协作建模有助于使任务上下文更清晰,提高AI生成贡献的可追溯性,并支持更透明、更负责任的人类-生成式AI实践。最后,我们提出了未来人类-生成式AI系统的设计原则,强调不仅关注输出质量,还要显式表示产生输出的协作上下文。

英文摘要

Collaborations with Generative AI often begin with a short prompt and end with an opaque output, leaving implicit who was involved, what task was being pursued, which resources were used, and which constraints should have shaped the process. This limited contextual explicitness hinders trust, traceability, and accountability, particularly when Generative AI is embedded in information-intensive workflows such as search, querying, and profile management. This paper introduces From Prompts to Context, an ontology-driven framework for representing Human-Generative AI collaboration. Its core component, the Contextual Collaboration AI Ontology (CCAI), models key elements of collaboration - including tasks, agent roles, resources, and constraints - as a shared machine-interpretable vocabulary. By combining populated CCAI instances with SPARQL-based context retrieval in operational workflows, the framework turns otherwise ephemeral prompt-response interactions into structured and queryable collaboration traces linking prompts, outputs, and their surrounding context. The approach is illustrated through a case study involving a software development team building a competency-based education feature for viewing and updating learner competency profiles. The case study shows how the framework can support the representation and documentation of collaboration episodes across requirements analysis, design, implementation, and testing. Within this setting, the results indicate that explicit collaboration modelling helps make task context more explicit, improves the traceability of AI-generated contributions, and supports more transparent and accountable Human-Generative AI practices. We conclude by outlining design principles for future Human-Generative AI systems that emphasise not only output quality, but also the explicit representation of the collaborative context in which outputs are produced.

2605.29670 2026-05-29 cs.CL cs.AI 版本更新

EviLink: Multi-Path Schema Linking with Uncertainty-Guided Evidence Acquisition for Large-Scale Text-to-SQL

EviLink: 面向大规模Text-to-SQL的基于不确定性引导证据获取的多路径模式链接

Huawei Zheng, Sen Yang, Zhaorui Yang, Yuhui Zhang, Haozhe Feng, Haoxuan Li, Xuan Yi, Chao Hu, Defeng Xie, Chen Hou, Danqing Huang, Wei Chen, Yingcai Wu, Peng Chen, Dazhen Deng

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室) Tencent TEG(腾讯TEG) School of Mathematical Sciences, Peking University(北京大学数学科学学院)

AI总结 提出EviLink方法,通过多假设模式基础与不确定性引导的证据获取,重新定义模式链接为不确定性感知的模式需求推理,以平衡模式完整性、相关性和令牌成本,提升大规模Text-to-SQL性能。

详情
AI中文摘要

模式链接是大规模Text-to-SQL中困难且重要的步骤,系统必须从庞大且模糊的数据库中识别出紧凑且充分的模式上下文。现有方法通常将模式链接视为围绕单个SQL路径的确定性选择,但复杂问题可能允许多个具有不同模式需求的有效实现。我们将模式链接重新定义为对多个可行SQL路径的不确定性感知模式需求推理,其中系统区分必需模式项与路径依赖的不确定项,并仅在需要时获取证据。我们通过EviLink实例化这一重构,它结合了多假设模式基础与不确定性引导的证据获取。在BIRD-Dev和Spider2-Snow上的实验表明,这种视角改善了模式完整性、模式相关性和令牌成本之间的平衡。在Spider2-Snow上,EviLink实现了90.15%的字段级严格召回率,平均使用123.30K令牌,并在固定生成器下提升了下游SQL生成性能。

英文摘要

Schema linking is a difficult and important step in large-scale Text-to-SQL, where systems must identify a compact yet sufficient schema context from large and ambiguous databases. Existing methods often treat schema linking as deterministic selection around a single SQL path, but complex questions may admit multiple valid realizations with different schema needs. We reframe schema linking as uncertainty-aware schema-need inference over multiple plausible SQL paths, where the system distinguishes required schema items from path-dependent uncertain ones and acquires evidence only where needed. We instantiate this reframing with EviLink, which combines multi-hypothesis schema grounding with uncertainty-guided evidence acquisition. Experiments on BIRD-Dev and Spider2-Snow show that this perspective improves the balance among schema completeness, schema relevance, and token cost. On Spider2-Snow, EviLink achieves 90.15% field-level strict recall rate, uses 123.30K average tokens, and improves downstream SQL generation under a fixed generator.

2605.29668 2026-05-29 cs.AI cs.CL 版本更新

GRASP: Gated Regression-Aware Skill Proposer for Self-Improving LLM Agents

GRASP: 门控回归感知技能提议器用于自我改进的LLM智能体

Johannes Moll, Jean-Philippe Corbeil, Jiazhen Pan, Martin Hadamitzky, Daniel Rueckert, Lisa Adams, Keno Bressem

发表机构 * Technical University of Munich and TUM University Hospital(慕尼黑技术大学及慕尼黑大学医院) Microsoft Healthcare & Life Sciences(微软医疗与生命科学)

AI总结 提出GRASP方法,通过门控回归感知技能库编辑,在硬回归预算下确保每次技能更新带来净改进,显著提升LLM智能体在结构化环境中的操作可靠性。

详情
AI中文摘要

在结构化环境中运行的LLM智能体以操作方式而非对话方式失败,其可靠性取决于对环境的程序性知识。先前的自我改进方法累积自然语言指导而不检查每个新项目是否保留先前正确的行为,因此修复一条轨迹的笔记可能静默地使另一条轨迹退化。我们引入GRASP(门控回归感知技能提议器),将智能体改进视为对有限技能库的一系列编辑,仅在候选技能在硬回归预算下对平衡的保留探针产生净改进时才接受它。我们在两个基于FHIR的临床基准上评估了GRASP在五个基础模型(gpt-oss-120b、DeepSeek V4 Flash、Gemini 3.1 Flash Lite、GPT-4.1、GPT-5.4)上的表现。在MedAgentBench上,GRASP将gpt-oss-120b从40.6%提升至88.8%,超过五个自我改进基线中最强的21.0个百分点,并将其他每个基础模型提升17.2至40.3个百分点。消融实验将增益归因于比较性提议生成、接受门和硬回归预算,而非技能编写本身——没有验证的技能编写并不比不使用技能更好。该机制泛化到临床领域之外,在四个非临床环境中的三个上改进了智能体,仅在动作空间开放的环境中保持持平。冻结的技能库可在模型间迁移,其中来自更强模型的技能将较弱执行者提升到超出其自身学习能力的水平,而反向则不然,这种不对称性是没有门控的基线无法复现的。

英文摘要

LLM agents acting in structured environments fail in operational rather than conversational ways, and reliability depends on procedural knowledge of the environment. Prior self-improvement methods accumulate natural-language guidance without checking that each new item preserves previously correct behavior, so a note that fixes one trajectory can silently regress another. We introduce GRASP (Gated Regression-Aware Skill Proposer), which treats agent improvement as a sequence of edits to a bounded skill library, admitting each candidate only if it produces a net improvement on a balanced held-out probe under a hard regression budget. We evaluate GRASP across five base models (gpt-oss-120b, DeepSeek V4 Flash, Gemini 3.1 Flash Lite, GPT-4.1, GPT-5.4) on two FHIR-based clinical benchmarks. On MedAgentBench, GRASP lifts gpt-oss-120b from 40.6% to 88.8%, exceeds the strongest of five self-improvement baselines by 21.0 points, and improves every other base model by 17.2 to 40.3 points. Ablations attribute the gain to comparative proposal generation, the acceptance gate, and the hard regression budget rather than to skill writing itself, which without validation is no better than using no skills. The mechanism generalizes beyond the clinical domain, improving agents on three of four non-clinical environments and remaining flat only where the action space is open-ended. Frozen libraries transfer across models, where skills from a stronger model improve weaker executors beyond what they learn for themselves while the reverse does not, an asymmetry that no ungated baseline reproduces.

2605.29659 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Opir:针对毒性、越狱、仇恨言论和有害内容的高效多任务安全分类

Ihor Stepanov, Aleksandr Smechov

发表机构 * Knowledgator Wordcab

AI总结 本文提出基于GLiClass架构的Opir系列编码器护栏模型,通过多任务学习实现二进制安全/不安全分类、多标签毒性分类、越狱分类和零样本不安全提示与响应分类,在12项安全分类任务和17项类别任务上与现有护栏系统竞争,同时部署开销更小。

Comments 23 pages, 4 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)应用的实时安全过滤需要能够检测不安全提示、有毒语言、越狱尝试和不安全响应的分类器,且不能像大型护栏模型那样成本高昂,同时要能区分良性的敏感文本与真正隐蔽的有害内容。在本文中,我们介绍了Opir,一个基于GLiClass架构的编码器护栏模型系列。Opir包括用于二进制安全/不安全分类、多标签毒性分类、越狱分类以及零样本不安全提示和响应分类的多任务模型。我们还发布了专门用于二进制安全/不安全分类的边缘变体,参数少于1亿。这些模型在一个三级分类体系上训练,该体系包含16个顶层标签、126个中层标签和854个叶标签,共996个类别。Opir的训练数据结合了基于分类体系的不安全提示、对抗性挖掘的难负例、良性安全保持示例、生成的响应示例、多语言翻译以及Aegis2和WildGuard训练子集的部分内容。我们还开源了一个评估工具,支持GLiClass和GLiNER2后端以及基于解码器的模型,涵盖二进制安全分类、多标签分类、毒性、越狱检测、提示安全、响应安全、响应拒绝以及跨公共基准系列的提示子类别视图。在与八个当代护栏系统(包括基于GLiNER2和生成式护栏模型)的扩展比较中,涵盖12项安全分类任务和17项类别任务,Opir变体在大多数基准数据集上与最强的开源基线模型竞争或领先,同时部署规模显著更小。

英文摘要

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

2605.29657 2026-05-29 cs.CV cs.AI 版本更新

OccamToken: Efficient VLM Inference with Training-Free and Budget-Adaptive Token Pruning

OccamToken: 无需训练且预算自适应的令牌剪枝实现高效VLM推理

Geng Li, Guohao Chen, Ting Chen, Shilin Shan, Kuangji Zuo, Bofan Lyu, Tuo An, Gen Li, Jianfei Yang

发表机构 * Nanyang Technological University (NTU)(南洋理工大学)

AI总结 提出OccamToken框架,通过寄存器锚定的相对证据测试替代绝对排名范式,实现无需训练、自适应预算的视觉令牌剪枝,在保持高精度的同时大幅压缩令牌数量。

Comments 26 pages,8 figures

详情
AI中文摘要

视觉语言模型(VLM)依赖长视觉令牌序列进行视觉理解,导致预填充阶段在计算和内存上开销巨大。现有大多数剪枝方法遵循绝对排名范式,为视觉令牌分配重要性分数并保留固定的Top-K子集。本文认为这种范式本质上是脆弱的:注意力汇聚点扭曲令牌重要性排名,而图像冗余和查询依赖的视觉证据使得固定令牌预算在不同输入间不可靠。我们提出OccamToken,一个无需训练的框架,用寄存器锚定的相对证据测试替代绝对令牌排名。OccamToken不询问哪些令牌全局重要,而是评估视觉令牌是否提供了超越寄存器基线的信息。我们的关键洞察是,寄存器令牌自然吸收低信息注意力模式,使其成为识别真正信息性视觉证据的稳定参考。基于这一原理,OccamToken通过从寄存器注意力中导出的动态阈值,执行图像自适应冗余剪枝和查询自适应相关性剪枝。在LLaVA-NeXT、LLaVA-v1.5和Qwen3-VL上,OccamToken一致地改善了准确率-效率权衡,无需额外训练。值得注意的是,在LLaVA-NeXT上,它将2880个视觉令牌减少到约40个,同时保留了超过93%的原始准确率,即使在极端的1.4%保留率下也能实现稳定的视觉令牌压缩。

英文摘要

Vision-language models (VLMs) rely on long visual token sequences for visual understanding, making the prefill stage expensive in both computation and memory. Most existing pruning methods follow an absolute-ranking paradigm, assigning importance scores to visual tokens and retaining a fixed top-K subset. In this work, we argue that this paradigm is fundamentally brittle: attention sinks distort token importance rankings, while image redundancy and query-dependent visual evidence make fixed token budgets unreliable across inputs. We propose OccamToken, a training-free framework that replaces absolute token ranking with register-anchored relative evidence testing. Instead of asking which tokens are globally important, OccamToken evaluates whether a visual token provides information beyond a register-based reference. Our key insight is that register tokens naturally absorb low-information attention patterns, making them a stable reference for identifying genuinely informative visual evidence. Based on this principle, OccamToken performs both image-adaptive redundancy pruning and query-adaptive relevance pruning through dynamic thresholds derived from register attention. Across LLaVA-NeXT, LLaVA-v1.5, and Qwen3-VL, OccamToken consistently improves the accuracy-efficiency trade-off without additional training. Notably, on LLaVA-NeXT, it reduces 2,880 visual tokens to approximately 40 while preserving over 93% of the original accuracy, enabling stable visual token compression even in the extreme 1.4% retention regime.

2605.29656 2026-05-29 cs.AI 版本更新

TRACE: Toulmin-based Reasoning Assessment through Constructive Elements for LLM CoT Evaluation

TRACE: 基于图尔敏论证元素的 LLM 思维链推理评估

Yundong Kim, Heyoung Yang

发表机构 * Applied Agent Research Center, Korea Institute of Science(应用智能代理研究中心,韩国科学技术信息研究所) Department of Computer Science and Engineering, University of Seoul, Republic of Korea(首尔大学计算机科学与工程系,大韩民国)

AI总结 提出 TRACE 指标,结合图尔敏论证理论与弗拉维尔元认知框架分析思维链推理结构,实验表明与基准准确率强相关(r=0.74)并可作为有效强化学习奖励信号。

Comments 23 pages, Accepted at ICML 2026

详情
AI中文摘要

由于缺乏真实答案,评估大型语言模型(LLM)的开放式输出仍然具有挑战性。现有指标依赖于最终答案的准确性或表面统计,而未检查推理过程本身。我们提出 TRACE(基于图尔敏论证元素的推理评估),一种分析思维链(CoT)推理过程的指标。TRACE 不判断结果,而是通过整合图尔敏的论证理论与弗拉维尔的元认知框架来检查论证的构建方式,从而评估推理结构。在 7 个推理模型的 26.3K QA 样本上的实验表明,TRACE 与基准准确率强相关(r=0.74)。此外,TRACE 作为强化学习奖励信号有效,优于仅基于准确率的基线。这些结果共同表明,逻辑合理的推理能带来更高质量的答案。因此,TRACE 可作为评估开放式输出的补充指标。代码可在 https://github.com/hyyangkisti/trace 获取。

英文摘要

Evaluating open-ended outputs from large language models (LLMs) remains challenging due to the absence of ground truth. Existing metrics rely on final-answer accuracy or surface-level statistics, leaving the reasoning process itself unexamined. We introduce TRACE (Toulmin-based Reasoning Assessment through Constructive Elements), a metric that analyzes Chain-of-Thought (CoT) reasoning processes. Rather than judging outcomes, TRACE inspects how arguments are constructed by integrating Toulmin's argumentation theory with Flavell's metacognitive framework to assess reasoning structure. Experiments on 26.3K QA samples across 7 reasoning models show strong correlation with benchmark accuracy (r=0.74). Furthermore, TRACE is effective as a reinforcement learning reward signal, outperforming accuracy-only baselines. Together, these results indicate that logically sound reasoning leads to higher-quality answers. TRACE thus serves as a complementary metric for evaluating open-ended outputs. Code is available at https://github.com/hyyangkisti/trace.

2605.29653 2026-05-29 cs.AI 版本更新

PTCG-Bench: Can LLM Agents Master Pokémon Trading Card Game?

PTCG-Bench:LLM智能体能否掌握宝可梦集换式卡牌游戏?

Dongdong Hua, Yifei Sun, Renhong Huang, Feng Gao, Chunping Wang, Yang Yang

发表机构 * Zhejiang University(浙江大学) FinVolution Group(FinVolution集团)

AI总结 提出PTCG-Bench基准,通过宝可梦集换式卡牌游戏评估LLM智能体的决策性能和自进化能力,并设计模块化消融实验分析智能体性能。

详情
AI中文摘要

面对一个策略复杂的棋盘游戏,人类玩家在玩几轮后就能快速学会制定策略。自主智能体在现实交互环境中需要类似的能力,然而现有的智能体基准往往未能充分捕捉这种策略性和不断演变的决策场景。我们提出了PTCG-Bench,一个基于宝可梦集换式卡牌游戏(PTCG)构建的基准,它在两个互补层面上评估LLM智能体:(1)它们在单个复杂环境中的决策性能,以及(2)它们通过积累经验自我进化的能力。我们进一步包括一个模块化消融实验,以更好地解释智能体性能,而不将其与模型能力混为一谈。我们的实验表明,尽管LLM智能体能够实现非平凡的 gameplay 性能,但持续稳定的自我进化仍然具有挑战性,并且性能对消融设计敏感。我们希望PTCG-Bench能够促进未来在现实交互环境中对消融感知和自我进化智能体的研究。

英文摘要

Given a strategically complex board game, human players can quickly learn to devise strategies after playing a few rounds. Autonomous agents require similar capabilities in realistic interactive environments, yet existing agent benchmarks often fail to fully capture such strategic and evolving decision-making scenarios. We present PTCG-Bench, a benchmark built on the Pok'{e}mon Trading Card Game (PTCG) that evaluates LLM agents at two complementary levels: (1) their decision-making performance within a single complex environment, and (2) their ability to self-evolving through accumulated experience. We further include a modular harness ablation to better interpret agent performance without conflating it with model capability. Our experiments show that, although LLM agents can achieve non-trivial gameplay performance, sustained and stable self-evolution remains challenging, and performance is sensitive to harness design. We hope that PTCG-Bench will facilitate future research on harness-aware and self-evolving agents in realistic interactive environments.

2605.29652 2026-05-29 cs.AI 版本更新

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

快速思考,智能对话:结构化健康文本生成中确定性与神经计算的划分

Kai-Chen Cheng, Haejun Han, David Q. Sun

发表机构 * Kai-Chen Cheng Haejun Han David Q. Sun

AI总结 提出一种将确定性计算与有限LLM调用相结合的流水线,用于结构化健康文本生成,在降低错误率和成本的同时保持忠实性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用于从结构化记录(如可穿戴时间序列、生物标志物、生命体征和护理管理日志)生成健康文本。对于重复性健康输出,流畅性是不够的:系统必须忠实于源数据,将解释性主张建立在可用证据上,遵循既定政策,输出机器可读的内容,并且运行成本足够低以支持重复使用。我们探讨在结构化健康生成中,哪些责任应由确定性计算承担,而非运行时LLM提示。我们引入了“快速思考,智能对话”,一个睡眠健康洞察流水线,其中确定性代码在调用一次有界LLM写入器之前执行重复分析。在280个用户-夜晚和六个模型上,与结构化零样本和少样本单次调用基线相比,该方法实现了更低的数值误差、更低的指令合规误差和更低的端到端成本。层替换揭示了特定合约的失败:LLM比较增加了数值误差,LLM排名降低了策略选择,LLM属性增加了无根据的因果语言,而LLM生成的写入器接口即使在上游事实确定后也会重新引入误差。结果支持一个更广泛的设计规则:让代码负责重复分析,让LLM在有界接口内表达已验证的事实。

英文摘要

Large language models (LLMs) are increasingly being used to generate health text from structured records such as wearable time series, biomarkers, vitals, and care-management logs. For recurring health outputs, fluency is not enough: systems must remain faithful to source data, ground explanatory claims in available evidence, follow stated policies, emit machine-readable outputs, and run cheaply enough for repeated use. We ask which responsibilities in structured health generation should be deterministic computation rather than runtime LLM prompting. We introduce Think Fast, Talk Smart, a sleep-health insight pipeline in which deterministic code performs recurring analysis before one bounded LLM writer call. Across 280 user-nights and six models, achieves lower numeric error, lower instruction-compliance error, and lower end-to-end cost than structured zero-shot and few-shot one-call baselines. Layer replacement reveals contract-specific failures: LLM comparison raises numeric error, LLM ranking degrades policy selection, LLM attribution increases unsupported causal language, and an LLM-generated writer interface reintroduces errors even after upstream facts are deterministic. The results support a broader design rule: let code own recurring analysis, and let LLMs express verified facts within bounded interfaces.

2605.29645 2026-05-29 cs.LG cs.AI stat.ML 版本更新

The Sample Complexity of Multiclass and Sparse Contextual Bandits

多类别和稀疏上下文赌博机的样本复杂度

Liad Erez, Fan Chen, Alon Cohen, Tomer Koren, Yishay Mansour, Shay Moran, Alexander Rakhlin

发表机构 * Tel Aviv University(特拉维夫大学) Massachusetts Institute of Technology(麻省理工学院) Google Research Tel Aviv(谷歌研究特拉维夫) Technion—Israel Institute of Technology(技术学院—以色列理工学院)

AI总结 针对随机i.i.d.上下文赌博机,提出基于决策估计系数和低方差探索的算法,在稀疏奖励下实现接近最优的样本复杂度,并匹配下界。

详情
AI中文摘要

我们研究随机i.i.d.设置下的上下文赌博机,其中学习器观察来自未知分布的上下文,从有限集合$A$中选择动作,并旨在基于赌博机反馈从给定类别中识别近似最优策略。受零一奖励的赌博机多类别分类启发,我们关注\emph{$s$-稀疏}设置,其中对于每个上下文,奖励向量的$L_1$范数至多为$s \ll |A|$。我们的主要结果是设计算法,以高概率输出一个相对于策略类$Π$的$ε$-最优策略,使用$ ilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$个样本。我们将此界推广到一般Natarajan类,并补充了匹配的下界(对数因子内),从而缩小了先前工作(Erez等人,2024, 2025)留下的巨大差距,后者额外增加了$Θ(|A|^9)$依赖。我们通过两种互补方法获得这些结果。首先,我们从具有结构化观测的上下文决策角度分析上下文赌博机,设计了一种探索-优化算法,其样本复杂度由\emph{决策估计系数}(DEC;Foster等人,2021, 2022)控制。我们证明,在$s$-稀疏奖励下,诱导的模型类具有随$s$缩放的尖锐DEC界,直接产生最优速率。由于这种方法主要是信息论性的,并涉及求解复杂的min-max优化问题,我们还开发了第二种更专门的算法方法,基于低方差探索技术。这种方法产生了具体、易处理的算法,并自然地扩展到上下文组合半赌博机,为赌博机多类别列表分类提供了改进的样本复杂度保证。

英文摘要

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $ε$-optimal policy compared to policy class $Π$ using $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $Θ(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

2605.29631 2026-05-29 cs.CL cs.AI 版本更新

Predicting Causal Effects from Natural Language Queries using Structured Representations

使用结构化表示从自然语言查询预测因果效应

Giuliano Martinelli, Piriyakorn Piriyatamwong, Abelardo Carlos Martinez Lorenzo, Jasmin Baier, Riccardo Orlando, Satvik Garg, Sharif Kazemi, Linxi Wang, Arianna Legovini, Samuel Fraiberger

发表机构 * The World Bank Group(世界银行集团) University of Oxford(牛津大学) New York University(纽约大学)

AI总结 针对从自然语言查询预测因果效应的问题,提出Query2Effect基准和两步框架,通过生成结构化表示再预测效应大小,微调使绝对误差降低27%-71%。

Comments 18 pages

详情
AI中文摘要

随机对照试验是医学和社会科学的基石,因为它们能够可靠地估计因果效应。然而,进行这些试验成本高昂且耗时,这激发了从现有实验证据预测因果效应的兴趣。大型语言模型(LLMs)的最新进展在知识密集型任务上表现出强大的性能,引发了一个问题:这些模型能否用于预测因果效应大小?为了研究这一点,我们引入了Query2Effect,这是一个新的大规模基准,包含超过72,000个与实验描述对齐的自然语言问题,通过改变查询在隐含性、抽象性和歧义性维度上的特异性,模拟现实的信息寻求场景。然后,我们提出了一个两步框架,首先生成查询的合成结构化表示,然后使用监督编码器模型预测效应大小。实验表明,微调在提高预测性能方面起着关键作用,与开箱即用的提示式LLMs相比,绝对误差降低了-27%到-71%,并且我们的两步框架有利于域外泛化,突显了将语义解释与数值效应估计分离的好处。

英文摘要

Randomized controlled trials are a cornerstone of medicine and the social sciences as they enable reliable estimates of causal effects. However, they are costly and time-consuming to conduct, motivating interest in predicting causal effects from existing experimental evidence. Recent advances in large language models (LLMs) have demonstrated strong performance on knowledge-intensive tasks, raising the question of whether these models can be used for forecasting causal effect sizes. To investigate this, we introduce Query2Effect, a new large-scale benchmark consisting of more than 72,000 natural language questions aligned with experiment descriptions, created to simulate realistic information-seeking scenarios by varying query specificity along dimensions of implicitness, abstraction, and ambiguity. We then propose a two-step framework that first generates a synthetic structured representation of a query before predicting effect size using a supervised encoder model. Experiments show that finetuning plays a crucial role in improving prediction performance, with absolute error reducing by -27% up to -71% compared to prompted out-of-the-box LLMs, and that our two-step framework is beneficial for out-of-domain generalization, highlighting the benefits of separating semantic interpretation from numerical effect estimation.

2605.29630 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Entity-Collision: A Stratified Protocol for Attributing Retrieval Lift in Agent Memory

实体碰撞:一种用于归因智能体记忆检索提升的分层协议

Youwang Deng

发表机构 * Independent Researcher(独立研究员)

AI总结 提出实体碰撞协议,通过控制实体重叠和标签分层,将BM25基线固定,从而将检索提升归因于嵌入器,并在多维度实验中揭示编码器容量并非唯一约束。

Comments 48 pages with appendix; 6-page body, mandatory Limitations, References, and 7 appendices. Code, benchmarks, and 37 reproduce scripts: https://github.com/youwangd/engram (see paper/REPRODUCIBILITY.md). Apache 2.0

详情
AI中文摘要

端到端的智能体记忆基准测试为每个检索器报告一个单一的hit@k指标,混淆了词汇泄漏(不受控制的查询/黄金/干扰实体重叠)与标签混合(偏好、服务、工具平均在一起)。我们提出实体碰撞,一种系统无关的协议,通过构造将BM25基线固定——每个干扰项共享答案的实体标记——并按判别器标签对查询进行分层,因此任何超过BM25的提升都可归因于嵌入器。应用于一个开源智能体记忆测试平台,涵盖5个标签×3个嵌入器×5个碰撞程度,并采用配对自助法95%置信区间,该协议揭示了一个双轴模式:256维哈希三元组仅在深度碰撞下的封闭词汇标签上有帮助;MiniLM-384在两个轴上均占优;而参数规模2.7倍的BGE-large并未在MiniLM上一致提升——它在意图式查询上胜出,但在词汇式查询上落败。编码器容量本身并非约束条件。合成意图标签的零假设在LongMemEval(n=500)上重现为单会话偏好回忆悬崖。LoCoMo上的自适应向量权重路由是一个测量的零假设:存在11.7个百分点的oracle空间,但我们测试的所有信号均未恢复。所有26个结果表和37个复现脚本均受版本控制并由公共注册表验证;该协议在一个确定性管理的记忆测试平台(事件溯源决策日志、DAG状态机模式生命周期)上执行,因此每个报告的置信区间都可以从输入流中逐字节复现。

英文摘要

End-to-end agent-memory benchmarks report a single hit@k per retriever, confounding lexical leakage (uncontrolled query/gold/distractor entity overlap) with tag-mixing (preferences, services, tools averaged together). We propose entity-collision, a system-agnostic protocol that pins the BM25 floor by construction -- every distractor shares the answer's entity tokens -- and stratifies queries by discriminator tag, so any lift over BM25 is attributable to the embedder. Applied to an open-source agent-memory testbed across 5 tags x 3 embedders x 5 collision degrees with paired-bootstrap 95% CIs, the protocol reveals a two-axis pattern: a 256-d hash trigram helps only on closed-vocabulary lexical tags at deep collision; MiniLM-384 dominates both axes; and a 2.7x-parameter BGE-large does not uniformly improve on MiniLM -- it wins on intent-style queries but loses on lexical ones. Encoder capacity alone is not the binding constraint. The synthetic intent-tag null replicates on LongMemEval (n=500) as a single-session-preference recall cliff. Adaptive vector-weight routing on LoCoMo is a measured null: 11.7pp of oracle headroom exists, but no signal we tested recovers it. All 26 result tables and 37 reproduce scripts are version-controlled and verified by a public registry; the protocol is exercised on a deterministically governed memory testbed (event-sourced decision log, DAG-state-machine schema lifecycle) so every reported CI is reproducible byte-for-byte from the ingest stream.

2605.29629 2026-05-29 cs.AI 版本更新

Beyond Attack Success Rate: Temporal Logit Observability for LLM Safety Failures

超越攻击成功率:LLM安全失效的时间对数可观测性

Junyoung Park, Sunghwan Park, Seongyong Ju, Jaewoo Lee

发表机构 * Chung-Ang University(Chung-Ang 大学)

AI总结 提出时间对数可观测性(TLO)方法,通过解码过程中的合规-拒绝边际将模型-攻击条件映射到校准的二维平面,揭示攻击成功的时间模式,并基于此设计早期停止规则将成功越狱减少一半以上。

详情
AI中文摘要

攻击成功率(ASR)在生成结束时用单个是/否标签评估每次越狱,告诉我们是否发生了失败,但未说明失败如何展开。产生同等有害输出的两次攻击可能遵循完全不同的路径,而ASR无法区分它们。我们仅从对数几率使这些隐藏路径变得可观测。时间对数可观测性(TLO)是一种无需训练的诊断方法,在解码过程中观察合规-拒绝边际,并将每个模型-攻击条件置于校准的二维平面上。通过设计,该平面在ASR信息量最小的情况下最具信息量:即在因真正不同原因而成功的攻击中。在四种对齐的LLM和三种越狱范式下,具有几乎相同ASR的攻击在平面上位于明显不同的点:同一模型可能通过不同的时间模式失败。在大多数条件下,几何形状与来自隐藏状态的拒绝方向探针匹配,但一个模型显示了固定词汇方法的局限性。从TLO导出的简单早期停止规则将成功的越狱减少一半以上,且对普通良性查询无误报。安全评估应报告失败发生的时间和方式,而不仅仅是是否发生。TLO仅从对数几率即可观测前两者。

英文摘要

Attack Success Rate (ASR) evaluates each jailbreak with a single yes/no label at the end of generation, telling us whether a failure happened but not how it unfolded. Two attacks that produce equally harmful outputs may have followed completely different paths, and ASR cannot tell them apart. We make those hidden paths observable from logits alone. Temporal Logit Observability (TLO) is a training-free diagnostic that watches a compliance-refusal margin during decoding and places each model-attack condition on a calibrated 2D plane. By design, this plane is most informative exactly where ASR is least informative: among attacks that succeed for genuinely different reasons. Across four aligned LLMs and three jailbreak paradigms, attacks with nearly identical ASR land at clearly different points on the plane: the same model can fail through different temporal patterns. The geometry matches refusal-direction probes from hidden states on most conditions, with one model showing the limit of our fixed-lexicon approach. A simple early-stop rule derived from TLO cuts successful jailbreaks by more than half, without false alarms on plain benign queries. Safety evaluation should report when and how a failure unfolds, not only whether it occurred. TLO makes the first two observable from logits alone.

2605.29628 2026-05-29 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

COMET:音频-文本多模态对比嵌入中模态间隙的概念空间剖析

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey(Surrey 大学视觉、语音和信号处理中心)

AI总结 提出COMET框架,通过PLS-SVD分解揭示CLAP模型中模态间隙主要由少数共享概念轴贡献,并基于谱截断方法无训练地缓解间隙,实现零样本音频字幕接近全监督性能。

详情
AI中文摘要

对比语言-音频预训练(CLAP)模型广泛用于音频理解,并在许多零样本应用中支持模态无关的条件交换。然而,其性能受到音频和文本嵌入之间模态间隙的严重影响。现有解释主要将此间隙归因于锥体效应,将其视为均值嵌入之间的偏移,但仅纠正均值只能带来有限的改进。其他假设,如信息不平衡和维度坍缩,也被提出,但仍未得到充分验证,并且在音频领域尚未被深入研究。同时,一些工作尝试将多模态对比嵌入分解为可解释的概念,但没有任何工作从概念分解的角度显式分析模态间隙。在这项工作中,我们引入了COMET(基于PLS-SVD变换的概念空间组织与模态间隙解释),这是一个新颖的用于CLAP的偏最小二乘奇异值分解(PLS-SVD)框架,揭示了模态间隙的更广泛视角。我们的框架揭示,只有一小部分可解释的轴(捕捉共享概念)对相似度计算有显著贡献,并且均值分量仅部分代表模态间隙。基于这一见解,我们提出了一种简单的谱截断方法,以无训练的方式缓解模态间隙。该方法使得零样本音频字幕通过条件交换接近全监督性能,无需大型辅助记忆库或昂贵计算。同时,它在保持检索和音频字幕任务强性能的同时,实现了显著的嵌入维度缩减。

英文摘要

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

2605.29626 2026-05-29 cs.CL cs.AI 版本更新

DLM-SWAI: Steering Diffusion Language Models Before They Unmask

DLM-SWAI: 在扩散语言模型去掩码之前引导它们

Hyeseon An, Yo-Sub Han

发表机构 * Department of Computer Science(计算机科学系) Yonsei University(延世大学)

AI总结 提出一种无需训练的引导方法DLM-SWAI,通过预计算的词级风格分数在去噪步骤中偏置词分布,实现扩散语言模型的可控生成。

Comments preprint

详情
AI中文摘要

将语言模型生成引导至期望的文本属性对于实际部署至关重要,而推理时方法特别有吸引力,因为它们无需重新训练即可实现可控生成。最近的研究也强调了扩散语言模型作为一种新兴的生成范式,具有独特的解码特性。然而,大多数现有的引导方法要么依赖辅助模型,要么专为自回归下一个词解码设计,难以应用于通过部分掩码序列的迭代去噪生成文本的扩散语言模型(DLM)。因此,我们提出DLM-SWAI,一种简单的无需训练的引导方法,通过使用预计算的词级风格分数在每个去噪步骤偏置词分布。在风格和安全控制任务上的实验表明,DLM-SWAI有效引导扩散语言模型,同时保持生成质量并需要最小的计算开销。消融实验进一步揭示了引导强度与流畅性之间的可控权衡,我们的分析将类别可引导性与词级属性线索的强度联系起来。

英文摘要

Steering language model generation toward desired textual properties is essential for practical deployment, and inference-time methods are particularly appealing because they enable controllable generation without retraining. Recent work has also highlighted diffusion language models as an emerging generation paradigm with distinct decoding properties. However, most existing steering approaches either rely on auxiliary models or are designed for autoregressive next-token decoding, making them difficult to apply to diffusion language models DLMs, which generate text through iterative denoising of partially masked sequences. Therefore, we propose DLM-SWAI, a simple training-free steering method that biases the token distribution at each denoising step using pre-computed token-level style scores. Experiments on style and safety control tasks show that DLM-SWAI effectively steers diffusion language models while preserving generation quality and requiring minimal computational overhead. Ablations further reveal a controllable trade-off between steering strength and fluency, and our analysis links class-wise steerability to the strength of token-level attribute cues.

2605.29625 2026-05-29 cs.AI 版本更新

Improving Collaborative Storytelling with a Multi-Agent Framework Based on Large Language Models

基于大语言模型的多智能体框架改进协作故事讲述

Arturo Valdivia, Paolo Burelli

发表机构 * Data Science Section IT University of Copenhagen(数据科学部门 河南大学) AI Section - brAIn lab IT University of Copenhagen(人工智能部门 - brAIn实验室 河南大学)

AI总结 提出一种基于大语言模型的多智能体框架,通过迭代的写者-编辑者过程,在物理棋盘游戏中与儿童协作生成高质量故事。

详情
AI中文摘要

共同创作(即AI智能体与人类交互生成输出(如艺术))的话题近期获得了显著关注。然而,大多数研究关注数字环境中的成人-人类交互。本文探索了一种新颖的游戏式共同创作场景,涉及儿童和大语言模型(LLMs)通过物理棋盘游戏交互来创作书面故事。我们的目标是开发一个多智能体框架,能够生成适合年轻玩家的高质量叙事。我们方法的核心是一个迭代的写者-编辑者过程,其中一个LLM生成故事,另一个评估故事并提供改进反馈。通过涉及多个LLM的模拟研究,我们表明这种迭代交互在连续循环中持续提高了生成故事的感知质量。结果表明,在交互式故事讲述系统中,少量改进步骤可能足以实现高质量输出。

英文摘要

The topic of Co-creation, i.e., AI agents interacting with humans to generate outputs (e.g., art), has gained significant attention recently. However, most studies focus on adult-human interactions in a digital setting. This paper explores a novel ludic co-creation scenario involving children and Large Language Models (LLMs) interacting through a physical board game to create written stories. Our goal is to develop a multi-agent framework capable of producing high-quality narratives suitable for young players. At the core of our approach is an iterative Writer-Editor process in which one LLM generates stories while another evaluates them and provides feedback for refinement. Through a simulation study involving multiple LLMs, we show that this iterative interaction consistently improves the perceived quality of generated stories across successive loops. The results indicate that a small number of refinement steps may be sufficient to achieve high-quality outputs in interactive storytelling systems.

2605.29610 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Learning Context-Conditioned Predicate Semantics via Prototype Feedback

通过原型反馈学习上下文条件谓词语义

NamGyu Jung, Chang Choi

发表机构 * Department of Computer Engineering, Gachon University, Seongnam, Republic of Korea(韩国成仁市加德满都大学计算机工程系)

AI总结 提出AlignG方法,利用原型反馈从图像关系候选中推断上下文条件谓词语义并调整关系表示,在VG-150和GQA-200上分别提升SGDet的F@100指标1.4和2.7。

Comments Accepted at ICML 2026. Code: https://github.com/Namgyu97/AlignG-SGG.pytorch

详情
AI中文摘要

在场景图生成中,一个核心挑战是建模多义谓词,其含义随上下文变化。先前的方法通过将谓词分解为多个静态原型或检索语义相似的示例来解决此问题。然而,这些策略保持谓词表示静态,无法重新组织语义以反映图像特定的证据,导致在模糊上下文中出现系统性混淆。我们提出AlignG,通过原型反馈学习上下文条件谓词语义。AlignG从每幅图像中的关系候选中推断上下文条件谓词语义,并将调整后的语义反馈回来以重新校准关系表示。学习目标将此适应锚定到全局语义中心,防止语义漂移,同时当场景提供一致的关系线索时仍允许选择性重组。在VG-150和GQA-200上的实验表明,在SGDet下,F@100指标分别提升了+1.4和+2.7,优于最先进的基线。我们进一步可视化每幅图像的原型相似性变化,并观察到一致的上下文相关重组,其中原型根据场景证据选择性地合并或分离谓词。代码可在https://github.com/Namgyu97/AlignG-SGG.pytorch获取。

英文摘要

In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image-specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context-conditioned predicate semantics via prototype feedback. AlignG infers context-conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over state-of-the-art baselines, with F@100 improvements of +1.4 on VG-150 and +2.7 on GQA-200 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at https://github.com/Namgyu97/AlignG-SGG.pytorch.

2605.29606 2026-05-29 cs.AI cs.IR 版本更新

HiKEY: Hierarchical Multimodal Retrieval for Open-Domain Document Question Answering

HiKEY: 面向开放域文档问答的分层多模态检索

Joongmin Shin, Gyuho Shim, Jeongbae Park, Jaehyung Seo, Heuiseok Lim

发表机构 * Human-inspired AI Research, Korea University(韩国大学人机智能研究部) Computer Science and Engineering, Konkuk University(韩国康科大学计算机科学与工程系) Department of Computer Science and Engineering, Korea University(韩国大学计算机科学与工程系)

AI总结 提出基于文档层次结构的分层多模态检索框架HiKEY,通过文档层次解析和粗到细的检索策略解决大规模工业语料中的路由失败和证据碎片化问题,在ODQA基准上检索召回率提升达12.9%,端到端QA性能提升达6.8%。

Comments Accepted to ACL2026 Main

详情
AI中文摘要

基于检索增强生成(RAG)的文档级开放域问答(ODQA)在大规模工业语料库中面临两个关键瓶颈:定位正确文档时的路由失败以及整合分散信息时的证据碎片化。现有依赖平面文本块或页面级图像的方法本质上难以(i)在数千个候选中精确定位目标文档,以及(ii)在有限的token预算内有机连接多模态证据(如表格和图形)。为应对这些挑战,我们提出HiKEY,一种基于层次树的多模态检索框架,将文档层次结构提升为一等检索信号。不同于简单的分块,HiKEY通过文档层次解析(DHP)重建逻辑异构图,显式编码父子关系。采用层次化由粗到细的策略,该框架(1)通过全局路由利用层次索引快速剪枝搜索空间,以及(2)通过采用捕获最具区分性证据的多模态融合策略进行细粒度检索以对章节排序。最后,HiKEY通过混合结构-语义打包策略组装一个token高效的证据子图。在ODQA基准上的实验表明,HiKEY显著优于基于页面和基于块的基线,检索召回率提升高达12.9%,端到端QA性能提升高达6.8%。

英文摘要

Retrieval-augmented generation (RAG) for document-based Open-domain Question Answering (ODQA) on large-scale industrial corpora faces two critical bottlenecks: routing failure in locating the correct document and evidence fragmentation in integrating scattered information. Existing approaches relying on flat text chunks or page-level images inherently struggle to (i) precisely pinpoint the target document among thousands of candidates and (ii) organically connect multimodal evidence, such as tables and figures, within a limited token budget. To address these challenges, we propose HiKEY, a hierarchical tree-based multimodal retrieval framework that elevates document hierarchy to a first-class retrieval signal. Instead of simple chunking, HiKEY reconstructs a logical heterogeneous graph via Document Hierarchical Parsing (DHP), explicitly encoding parent-child relationships. Adopting a hierarchical coarse-to-fine strategy, the framework (1) performs global routing to rapidly prune the search space using hierarchical indexing, and (2) conducts fine-grained retrieval to rank sections by employing a multimodal fusion strategy that captures the most discriminative evidence. Finally, HiKEY assembles a token-efficient evidence subgraph via a hybrid structural-semantic packing strategy. Experiments on ODQA benchmarks demonstrate that HiKEY significantly outperforms page- and chunk-based baselines, improving retrieval recall by up to 12.9% and end-to-end QA performance by up to 6.8%.

2605.29601 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Training Deliberative Monitors for Black-Box Scheming Detection

训练审慎监控器用于黑箱策划检测

Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel Højmark, Marius Hobbhahn

发表机构 * Independent(独立) MATS Research(MATS研究) Astra Fellowship Apollo Research(Apollo研究)

AI总结 提出一种基于行动轨迹的审慎监控方法,通过蒸馏前沿模型的推理过程训练开源模型,以低成本高精度检测智能体的策划与破坏行为。

详情
AI中文摘要

随着自主智能体在执行现实任务方面变得愈发强大,区分策划行为与良性任务追求可能成为AI控制的核心问题。现有监控器通常依赖思维链访问或内部激活,或使用提示的前沿模型,这些在部署中可能不可用、不可靠或成本高昂。在本工作中,我们研究仅基于行动的审慎监控器:较小的开源模型,经过训练可从智能体轨迹中检测策划与破坏行为,而无需访问被监控智能体的推理或模型内部。我们的方法受审慎对齐启发,使用策划规范从前沿教师模型中引出结构化推理,通过独立的评判器进行过滤,并通过监督微调和强化学习将最高质量的推理蒸馏到开源监控器中。我们在五个数据集上训练,并在六个分布外智能体失调基准上评估。我们表明,将我们的方法应用于Qwen3.5-27B,其性能优于所有低成本前沿模型作为提示监控器(Gemini 3.1 Flash-Lite、GPT-5.4 Nano和Claude Haiku 4.5)以及Gemini 2.5 Pro,同时实现了更低的边际推理成本(每1000次评估的token计费美元)。更强的提示前沿监控器(Gemini 3.1 Pro、GPT-5.4、Claude Sonnet 4.6和Claude Opus 4.6)实现了更高的性能,但边际推理成本大约高出16-34倍。我们训练的多个监控器在我们评估的监控器中位于经验成本-性能帕累托前沿,为提示前沿模型提供了实用的低成本、低误报率替代方案。

英文摘要

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

2605.29591 2026-05-29 cs.AI 版本更新

Mind-Omni: A Unified Multi-Task Framework for Brain-Vision-Language Modeling via Discrete Diffusion

Mind-Omni:通过离散扩散实现脑-视觉-语言建模的统一多任务框架

Yizhuo Lu, Changde Du, Qingyu Shi, Hang Chen, Jie Peng, Liuyun Jiang, Shuangchen Zhao, Huiguang He

发表机构 * NeuBCI Lab, State Key Laboratory of Brain Cognition Brain-inspired Intelligence Technology, Institute of Automation, Chinese Academy of Sciences, Beijing, China School of Future Technology, University of Chinese Academy of Sciences, Beijing, China School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China Zhongguancun Academy, Beijing, China Peking University, Beijing, China

AI总结 提出Mind-Omni框架,利用离散扩散范式统一七种编码与解码任务,通过脑分词器将连续脑信号转化为离散令牌,实现多模态交互,并构建脑问答指令调优数据集,在多项任务上达到或超越专用模型性能。

详情
AI中文摘要

建模外部刺激与内部神经表征之间的相互作用是脑机接口(BCI)领域的关键研究方向。以往工作的主要局限性在于普遍采用专门的单任务模型,这限制了通用性并忽略了任务间的协同效应。为解决这一问题,我们提出了Mind-Omni,这是第一个通过离散扩散范式统一七种不同编码和解码任务的通用框架。其核心是一种新颖的脑分词器(Brain Tokenizer),可将异质、连续的脑信号转化为标准化、离散的令牌。这使得在共享语义空间中,任意两个或多个模态之间能够进行直接的令牌级交互,实现相互理解和生成。为了解锁高级推理能力,我们进一步策划了一个专门的脑问答(BQA)指令调优数据集。我们的模型不仅在多任务统一框架中确立了新的最先进水平,还为多任务协同提供了有力证据。通过展示与更大规模专用模型相当甚至有时更优的性能,我们的工作为神经建模提供了强大的新范式,并为神经活动基础模型铺平了道路。代码已公开于https://github.com/ReedOnePeck/Mind-Omni。

英文摘要

Modeling the interplay between external stimuli and internal neural representations is a pivotal research area for Brain-Computer Interfaces (BCIs). A major limitation of prior work is the prevailing paradigm of specialized, single-task models, which curtails versatility and neglects inter-task synergies. To address this, we propose Mind-Omni, the first versatile framework that unifies seven distinct encoding and decoding tasks through a discrete diffusion paradigm. At its core is a novel Brain Tokenizer that transforms heterogeneous, continuous brain signals into standardized, discrete tokens. This enables direct, token-level interactions for mutual understanding and generation between any two or more modalities within a shared semantic space. To unlock advanced reasoning capabilities, we further curate a specialized Brain Question Answering (BQA) instruction-tuning dataset. Our model not only establishes a new state-of-the-art among multi-task unified frameworks but also provides strong evidence for multi-task synergy. By demonstrating performance competitive with, and at times superior to, larger specialized models, our work offers a powerful new paradigm for neural modeling and paves the way for foundation models of neural activity. The code is publicly available at https://github.com/ReedOnePeck/Mind-Omni.

2605.29586 2026-05-29 cs.AI 版本更新

FinVerBench: Benchmark Validity and Calibration in Large Language Model Financial Statement Verification

FinVerBench: 大型语言模型财务报表验证中的基准有效性与校准

Silu Panda

AI总结 提出FinVerBench基准,通过四类错误分类和SEC 10-K XBRL数据,评估LLM在财务报表数值一致性验证中的性能,发现校准和渲染选择显著影响结果,强调构造有效性而非最终排行榜。

Comments 37 pages, 9 figures

详情
AI中文摘要

我们介绍了FinVerBench,一个用于财务报表验证的基准和有效性研究:确定一组公司财务报表在模型所呈现的信息下是否数值一致。FinVerBench基于43家标普500公司的SEC 10-K XBRL文件构建,定义了一个包含算术、跨报表链接、同比和幅度扰动四类错误的分类法。我们尝试了十五种当代LLM评估,并报告了十四次完整运行;Gemini 2.5 Pro的一次运行因40/108次网关调用失败而被排除在主比较之外。所有二值指标排除了其扰动行项目未呈现的不确定正例,留下一个包含105个实例的可观察诊断子集(43个干净,62个注入错误)。在未舍入诊断子集上的原始引导清单提示下,十四次完整LLM运行中有九次在干净报表上产生95-100%的假阳性,而一次运行实现了0%的观察假阳性。基准渲染选择显著影响测量的召回率:在同一可观察子集的现实舍入变体上,校准模型的召回率为79.0%,观察FPR为0%,而在未舍入诊断变体上召回率为100.0%。这些结果支持构造有效性结论而非最终排行榜:财务报表验证不仅仅是算术检测,而是在不完全可观察性、提示诱导假设和现实数值渲染下的校准判断。FinVerBench和所有代码均已公开。

英文摘要

We introduce FinVerBench, a benchmark and validity study for financial statement verification: determining whether a set of corporate financial statements is numerically consistent from the information shown to the model. FinVerBench is built from SEC 10-K XBRL filings for 43 S&P 500 companies and defines a four-category error taxonomy covering arithmetic, cross-statement linkage, year-over-year, and magnitude perturbations. We attempt fifteen contemporary LLM evaluations and report fourteen complete runs; a Gemini 2.5 Pro run is excluded from the main comparison because 40/108 gateway calls failed. All binary metrics exclude underdetermined positive instances whose perturbed line item is not rendered, leaving a 105-instance observable diagnostic subset (43 clean, 62 error-injected). Under the original guided-checklist prompt on the unrounded diagnostic subset, nine of fourteen complete LLM runs produce 95-100% false positives on clean statements, while one run achieves 0% observed false positives. Benchmark rendering choices materially affect measured recall: on a realistic rounded variant of the same observable subset, the calibrated model's recall is 79.0% with 0% observed FPR, compared with 100.0% recall on the unrounded diagnostic variant. These results support a construct-validity conclusion rather than a final leaderboard: financial statement verification is not merely arithmetic detection, but calibrated judgment under incomplete observability, prompt-induced assumptions, and realistic numerical rendering. FinVerBench and all code are publicly available.

2605.28700 2026-05-29 cs.AI cs.CL 版本更新

The Importance of Being Statistically Earnest: A Critical Re-evaluation of GSM-Symbolic

认真统计的重要性:对GSM-Symbolic的批判性再评估

Dominika Agnieszka Długosz, Arlindo Oliveira, Natalia Díaz-Rodríguez

发表机构 * Instituto Superior Técnico & INESC-ID, Universidade de Lisboa, Portugal(里斯本大学技术高级学院及INESC-ID研究所,里斯本大学,葡萄牙) Dept. of Computer Science and AI & DaSCI Institute, Universidad de Granada, Spain(计算机科学与人工智能系及DaSCI研究所,格拉纳达大学,西班牙)

AI总结 通过广义线性混合模型和每问题随机效应重新评估20个开源模型,发现仅半数模型在原始提示格式下表现显著变化,并指出GSM-Symbolic数据集存在大整数分布偏移,控制该效应后剩余显著案例约减半,表明关于LLM推理的笼统结论在统计上不成熟且机制上具有误导性。

Comments 38 pages, 11 figures. Submitted to ACL ARR / EMNLP 2026

详情
AI中文摘要

GSM-Symbolic基准测试(Mirzadeh等人,2025)报告了25个大型语言模型(LLM)在GSM8K问题的模板生成变体上测试时出现一致的性能下降,并得出结论认为这些模型缺乏真正的推理能力。我们认为这一结论建立在不可靠的统计基础上。使用具有每问题随机效应的广义线性混合模型重新评估20个开源模型,我们发现只有一半的模型在原始提示格式下表现出统计上显著的性能变化。此外,我们识别出一个先前未被承认的因素:主要GSM-Symbolic数据集相对于GSM-Base,在问题文本中包含系统性地偏移的大整数分布(K-S统计量=0.12,p<0.001),这与原始作者的声明相矛盾。控制这一大数效应后,大约一半剩余案例的显著性得以解释。在具有统计上显著性能差异的模型中,我们识别出不同的、模型特定的失败模式——包括变量绑定的脆弱性、算术限制和双任务干扰——这强调了关于LLM推理的笼统结论在统计上既不成熟,在机制上也是误导性的。

英文摘要

The GSM-Symbolic benchmark (Mirzadeh et al., 2025) reported consistent performance drops across 25 Large Language Models (LLMs) when tested on template-generated variants of GSM8K problems, concluding that the models lack genuine reasoning capabilities. We argue that this conclusion rests on shaky statistical ground. Re-evaluating 20 open-weight models using Generalised Linear Mixed Models with per-question random effects, we find that only half exhibit statistically significant performance changes under the original prompt format. Moreover, we identify a previously unacknowledged factor: the main GSM-Symbolic dataset contains a systematically shifted distribution of larger integers in problem texts relative to GSM-Base (K-S statistic = 0.12, p < 0.001), contradicting the original authors' claims. Controlling for this large number effect accounts for significance in roughly half the remaining cases. Among models with statistically significant performance deltas, we identify distinct, model-specific failure profiles - including fragility of variable binding, arithmetic limitations, and dual-task interference - underscoring that blanket claims about LLM reasoning are both statistically premature and mechanistically misleading.

2605.27580 2026-05-29 cs.AI q-bio.NC 版本更新

You Are in Control of Your State: Why Human Outcomes Are Controllable Through Causal State Intervention

你掌控自己的状态:为什么人类结果可以通过因果状态干预来控制

Suraj Biswas, Saurav Gupta, Pritam Mukherjee

发表机构 * Independent Researchers in Behavioural Modelling, Causal Inference, and Genomics(行为建模、因果推理和基因组学的独立研究人员)

AI总结 本文提出人类行为的变异性源于动态潜在状态,并通过因果状态干预实现对结果的可控性,结合六类证据和超过20万用户的数据验证了该框架。

Comments 20 pages, 12 figures, 37 references. Companion to a prior SSRN preprint on causal architecture for human modelling

详情
AI中文摘要

行为科学和面向人类的人工智能的一个核心谜题是个体内变异性的持续存在。同一个体在相同的可观察输入下,在不同场合产生不同的结果,而不同个体产生不同的结果,且没有可观察的协变量能完全预测。我们认为,这种变异性属于个体的动态潜在状态,并且通过针对决策形成时刻的状态及其权重的干预,人类结果在精确且可操作的意义上是可控的。我们将状态定义为随时间索引的权重向量,其维度决定个体的生物学、生理学和神经心理学如何将下一个事件处理为决策和结果。状态、决策和结果之间的关系是因果性的而非相关性的。权重向量在亚日时间尺度上是动态的。结果可报告的意识通道是一个狭窄的注意瓶颈,其内容本身依赖于状态。综合这些主张,意味着给定事件的结果在干预时的状态轨迹条件下是可控的。我们通过六条已建立的证据链(因果推断、预测处理、稳态应变、注意瓶颈、时间生物学、计算精神病学)以及一个部署的行为平台(涵盖2023年至2026年研究期间超过20万同意用户,跨越四种职业角色)的24个月观察基础来推动该框架。我们推导出七个可检验的预测,列出了六个状态感知系统的操作要求,并讨论了对数字健康、教育、人工智能个性化和个人能动性的影响。

英文摘要

A central puzzle for the behavioural sciences and for human-facing artificial intelligence is the persistence of within-person variability. The same individual, presented with the same observable input, produces different outcomes on different occasions, and different individuals produce divergent outcomes that no observable covariate fully predicts. We argue that this variability belongs in the dynamic latent state of the person, and that human outcomes are controllable in a precise and operational sense through interventions that target the state and its weighting at the moment a decision is being formed. We define a state as the time-indexed weighting vector over the dimensions that govern how an individual's biology, physiology, and neuropsychology process the next event into a decision and an outcome. The relationship between state, decision, and outcome is causal rather than correlational. The weighting vector is dynamic at sub-daily timescales. The conscious channel through which outcomes are reportable is a narrow attentional bottleneck whose contents are themselves state-dependent. Taken together, these claims imply that the outcome of a given event is controllable, conditionally, on the state-trajectory at the time of intervention. We motivate the framework with six strands of established evidence (causal inference, predictive processing, allostasis, attentional bottleneck, chronobiology, computational psychiatry) and a 24-month observational base from a deployed behavioural platform spanning more than 200,000 consented users across four occupational personas (research period 2023 to 2026). We derive seven testable predictions, list six operational requirements for state-aware systems, and discuss implications for digital health, education, AI personalisation, and personal agency.

2605.27176 2026-05-29 cs.AI 版本更新

The Compressive Knowledge Graph Hypothesis: Which Graph Facts Matter for Scientific Hypothesis Generation?

压缩知识图谱假说:哪些图事实对科学假设生成至关重要?

Shashwat Sourav, Viktoriia Baibakova, Sanjay Das, Ran Elgedawy, Maria Mahbub, Emily Herron, Tirthankar Ghosal

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) Oak Ridge National Laboratory(橡树岭国家实验室) Lawrence Berkeley National Laboratory(伯克利国家实验室) UniverseTBD(宇宙TBD)

AI总结 研究通过扰动局部知识图谱(密度、本体丰富度、拓扑和控制结构),评估不同语言模型在电池材料假设生成中知识图谱的效用,提出冗余感知的压缩知识图谱假说:有用信号可从紧凑子图恢复。

详情
AI中文摘要

知识图谱(KGs)可以为语言模型提供结构化的科学背景,但目前尚不清楚哪些图事实实际上塑造了生成的假设。我们研究了Mistral-7B、Llama-3.1-70B和Gemini 2.5 Flash在电池材料上的KG引导假设生成。通过改变密度、本体丰富度、拓扑和控制结构来扰动局部KG,并使用提供的图和固定参考指标评估输出。跨模型而言,KG效用是选择性的且依赖于模型:图上下文改变了输出,但无KG输出也从模型先验中恢复了大量图内容。紧凑的top-k子图通常近似于完整KG的行为,包括当声称的结果三元组被排除时。同时,压缩并非唯一依赖于某种语义排序规则,随机和基于拓扑的子集也能恢复大部分信号。这些结果支持一种冗余感知的压缩KG假说:有用的KG信号通常可以从紧凑的、科学结构的子图中恢复,而不是需要完整的局部图。

英文摘要

Knowledge graphs (KGs) can provide structured scientific context to language models, but it remains unclear which graph facts actually shape the generated hypotheses. We study KG-guided hypothesis generation for battery materials across Mistral-7B, Llama-3.1-70B, and Gemini 2.5 Flash. We perturb local KGs by varying density, ontology richness, topology, and control structure, and evaluate outputs with both provided-graph and fixed-reference metrics. Across models, KG utility is selective and model-dependent: graph context changes outputs, but no-KG outputs also recover substantial graph content from model priors. Compact top-k subgraphs often approximate full-KG behavior, including when claimed-outcome triples are held out. At the same time, compression is not unique to one semantic ranking rule, random and topology-based subsets can also recover much of the signal. These results support a redundancy-aware Compressive KG hypothesis: useful KG signal is often recoverable from compact, scientifically structured subgraphs rather than requiring the full local graph.

2605.27078 2026-05-29 cs.LG cs.AI 版本更新

Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent

两种学习速度:Grokking 和双下降的表征-读出分解

Chi-Ning Chou, Oscar Uzdelewicz, Neng-Chun Chiu, Yao-Yuan Yang, SueYeon Chung

发表机构 * Center for Computational Neuroscience(计算神经科学中心) Flatiron Institute(Flatiron研究所) Department of Physics(物理系) Harvard University(哈佛大学) Kempner Institute(Kempner研究所) New York University(纽约大学)

AI总结 通过将学习动态分解为编码器中的表征学习和最终分类器中的读出校准两个竞争过程,解释了 grokking 和 epoch-wise 双下降现象,并提供了诊断虚假泛化的框架。

详情
AI中文摘要

训练损失和准确率是用于监控深度神经网络训练过程中泛化性能的标准信号。两个有据可查的现象使这一图景复杂化:在 grokking 中,训练损失迅速下降,而测试性能仅在长时间延迟后突然提升;在 epoch-wise 双下降中,训练损失单调下降,而测试损失或误差先升后降。现有解释通常针对特定任务,缺乏一个任务无关的分析框架来诊断和解释这些现象在现实任务和架构中的表现。我们通过分析学习动态背后的两个竞争过程来应对这一挑战:编码器中的表征学习和最终分类器中的读出校准。利用表征几何、神经正切核和线性探测等工具,我们表明这两个过程在整个训练过程中都是活跃的,它们相对速度的波动导致了看似异常的泛化动态。将表征-读出分解应用于各种任务和架构中的 grokking,我们发现读出在 grokking 发生前偏向训练集,而表征学习是渐进但并非缺失的,这与“从懒惰到丰富”的解释相反。该框架进一步提供了区分虚假泛化和真实泛化的诊断特征:在先前报告的 MNIST grokking 示例和 epoch-wise 双下降示例中,看似延迟或非单调的泛化是由非标准训练配方导致的表征退化和读出失调引起的。总之,这些结果确立了表征-读出分解作为一个自上而下的框架,用于理解学习动态并揭示可解释性研究的基础算法。

英文摘要

Training loss and accuracy are the standard signals used to monitor generalization during deep neural network training. Two well-documented phenomena complicate this picture: in grokking, train loss falls rapidly while test performance improves abruptly only after a long delay; in epoch-wise double descent, train loss decreases monotonically while test loss or error rises and falls. Existing accounts are often task-specific, and a task-agnostic analysis framework for diagnosing and explaining these phenomena across realistic tasks and architectures is missing. We address this challenge by analyzing two competing processes that underlie learning dynamics: representation learning in the encoder and readout calibration in the final classifier. Using tools from representational geometry, neural tangent kernels, and linear probing, we show that both processes are active throughout training, with the fluctuations of their relative speed giving rise to seemingly anomalous generalization dynamics. Applying the representation-readout decomposition to grokking across a wide range of tasks and architectures, we find that the readout is train-biased before grokking onset, and representation learning is gradual but not absent, contrary to the lazy-to-rich account. The framework further provides diagnostic signatures distinguishing spurious from genuine generalization: in a previously reported MNIST grokking example and an epoch-wise double descent example, apparent delayed or non-monotone generalization is shown to arise from representation degradation and readout misalignment induced by non-standard training recipes. Together, these results establish the representation-readout decomposition as a top-down framework for understanding learning dynamics and revealing underlying algorithms for interpretability research.

2605.26156 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

将偏见转化为漏洞:基于Bandit引导的LLM裁判风格操纵攻击

Xianglin Yang, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong

发表机构 * School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院) Nanyang Technological University, Singapore(南洋理工大学)

AI总结 提出BITE黑盒对抗框架,将风格编辑选择建模为上下文Bandit问题,通过LinUCB策略自适应选择编辑以误导LLM裁判并人为提高评分,攻击成功率超65%。

Comments Accepted to the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

已知LLM裁判中的风格偏见,例如对冗长或特定句子结构的偏好,构成了一个未被充分探索的安全漏洞。在这项工作中,我们引入了BITE(偏见探索与利用),一个黑盒对抗框架,学习保持语义的编辑以误导LLM裁判并人为提高其分配的分数。我们将风格编辑的选择建模为上下文Bandit问题,并使用LinUCB策略自适应地选择编辑,以最大化裁判的分数,而无需访问模型参数或梯度。实验上,我们在多种LLM裁判和任务上测试了BITE,包括聊天机器人排行榜和AI审稿人基准上的逐点和成对比较。BITE实现了超过65%的攻击成功率,并在9分制上将分数提高了1-2分,同时保持了语义等价性。我们进一步评估了攻击的隐蔽性,表明BITE规避了标准的风格控制方法和几种检测基线。我们的发现暴露了LLM作为裁判范式的一个根本弱点,并激励了鲁棒的、对抗感知的评估。我们的代码可在https://github.com/xianglinyang/llm-as-a-judge-attack获取。

英文摘要

The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.

2605.25376 2026-05-29 cs.CR cs.AI cs.CY cs.MA cs.SE 版本更新

KYA: A Framework-Agnostic Trust Layer for Autonomous Systems with Verifiable Provenance and Hierarchical Policy Composition

KYA: 面向自主系统的框架无关信任层,具有可验证溯源和分层策略组合

Kolawole Quadri

发表机构 * Veldt Labs(Veldt实验室)

AI总结 提出KYA,一个框架无关的信任与治理层,通过五元组原语实现自主系统的授权、策略合规和事后可验证性,在跨后端矩阵上全部通过测试,检测89%的对抗性探测。

Comments 26 pages including appendix. Code available under Apache 2.0 at https://github.com/veldtlabs/veldt-kya (pip install veldt-kya). Two-domain worked examples (loan decisioning under NYDFS/ECOA/CFPB; clinical triage under HIPAA/21 CFR Part 11/FDA SaMD).Reproducibility artifacts in-tree

详情
AI中文摘要

KYA(Know Your Agents)是一个开源的、框架无关的自主系统信任与治理层,由五个原语组成:(1)四门入站应用管道;(2)三通道多租户层次结构上的仅收紧组合代数;(3)KYP(Know Your Principal),跨人类用户、AI智能体和服务账户的信任评分的模式级统一;(4)基于AIVSS形状的加性基线的可审计交互乘数放大;(5)两轴委托归因:高风险委托的静态溢价和运行时对多智能体扇出中实际委托不当行为的扣减。这些原语共同涵盖三个支柱(信任、治理和证据保证),使自主系统的行为得到授权、符合策略且事后可验证:其中可观测性回答多长时间、多少量以及什么路径,KYA回答是否被授权、是否合规以及能否验证;它与可观测性互补而非替代。它原生支持15+智能体框架的适配器。在4×9跨后端矩阵上,所有36个单元格均通过测试;纯函数评分器在p99下运行亚毫秒级,系统在20个并发工作线程下维持约1,800 ops/s,且HMAC链完整性端到端保持。KYA检测出来自PyRIT和Garak的1,200个对抗性探测中的89%,包括最近发布的拓扑引导的多智能体攻击。该系统以Apache 2.0许可证在PyPI上以veldt-kya包形式提供。

英文摘要

KYA (Know Your Agents) is an open-source, framework-agnostic trust and governance layer for autonomous systems, composed of five primitives: (1) a four-gate inbound apply pipeline; (2) an only-tighten composition algebra over a three-channel multi-tenant hierarchy; (3) KYP (Know Your Principal), a schema-level unification of trust scoring across human users, AI agents, and service accounts; (4) auditable interaction-multiplier amplification over an AIVSS-shaped additive baseline; and (5) two-axis delegation attribution: a static premium for risky delegates and a runtime debit for actual delegate misbehavior in multi-agent fan-out. Together these span three pillars (trust, governance, and evidentiary assurance), making an autonomous system's actions authorized, policy-conforming, and post-hoc verifiable: where observability answers how long, how much, and what path, KYA answers was it authorized, did it conform, and can it be verified; it composes with observability rather than replacing it. It ships native adapters for 15+ agent frameworks. On a 4 by 9 cross-backend matrix all 36 cells pass; the pure-function scorer runs sub-millisecond at p99 and the system sustains ~ 1,800 ops/sec at 20 concurrent workers with HMAC chain integrity preserved end-to-end. KYA detects 89% of 1,200 adversarial probes from PyRIT and Garak, including the recently-published topology-guided multi-agent attack. The system is available under Apache 2.0 as the veldt-kya package on PyPI.

2605.24934 2026-05-29 cs.RO cs.AI cs.CV cs.LG 版本更新

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

HumanEgo:从几分钟的人类自我中心视频中零样本学习机器人

Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos

发表机构 * University of Maryland(马里兰大学)

AI总结 提出HumanEgo框架,通过将人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略,实现从人类自我中心视频到机器人的零样本、无机器人数据、硬件无关的技能迁移。

Comments Project page: https://humanego-ai.github.io

详情
AI中文摘要

人类自我中心视频捕捉了丰富的操作演示,无需任何机器人硬件,但由于人类和机器人在视觉外观和运动学上的具身差距,将这些技能迁移到机器人仍然具有挑战性。我们提出了HumanEgo,一个通过将每个人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略来弥合具身差距的框架,该策略放大了每个轨迹的监督信号。HumanEgo无需机器人数据、硬件无关、数据高效且可零样本地从人类迁移到机器人。每个任务仅需30分钟的人类视频,HumanEgo在四个真实世界任务中实现了92.5%的平均成功率(仅15分钟即可达到75%),比匹配时间的机器人遥操作高出41%,并且能够稳健地零样本迁移到新的机器人、相机和环境。我们发布了HumanEgo作为一个易于使用的开源框架,用于直接从人类数据学习机器人策略:https://github.com/TX-Leo/HumanEgo

英文摘要

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments. We release HumanEgo as an easy-to-use, open-source framework for learning robot policies directly from human data: https://github.com/TX-Leo/HumanEgo

2605.24460 2026-05-29 cs.CV cs.AI 版本更新

Coarse-to-Fine Domain Incremental Learning with Attentive Distillation for Mining Footprint Segmentation in Multispectral Imagery

面向多光谱影像采矿足迹分割的粗到细领域增量学习与注意力蒸馏

Alif Tri Handoyo, Vincent C. S. Lee, Rizka Widyarini Purwanto, Alex M. Lechner, Deanna Kemp, Muhamad Risqi U. Saputra

发表机构 * Monash University, Indonesia(印度尼西亚莫纳什大学) Monash University, Australia(澳大利亚莫纳什大学) Northeastern University, China(中国东北大学) The University of Queensland, Australia(澳大利亚昆士兰大学)

AI总结 提出MineC2FNet框架,利用粗标注数据通过教师-学生架构和注意力蒸馏增强细粒度采矿足迹分割,解决领域偏移问题。

Comments Accepted at the 35th International Joint Conference on Artificial Intelligence (IJCAI 2026), AI and Social Good track

详情
AI中文摘要

利用遥感和深度学习自动绘制和分割全球采矿足迹对于监测采矿的社会环境风险和影响至关重要,但其进展受到细粒度标注数据稀缺的阻碍。尽管具有粗略边界的大规模数据集广泛可用,但由于显著的领域偏移,利用它们改进细粒度分割具有挑战性。为此,我们提出了MineC2FNet,一种粗到细的领域增量学习框架,利用丰富的粗数据增强细粒度采矿足迹分割。MineC2FNet采用教师-学生架构,在特征和预测层面进行注意力蒸馏,选择性地从粗领域迁移通用知识,同时利用有限的细粒度数据(细领域)实现边界细化。我们进一步引入了一个经过专家验证的数据集,包含219张图像,具有跨不同地理和商品类型的精确边界标注。与包括领域适应和领域增量学习方法在内的最先进方法进行的大量实验表明,MineC2FNet在有效处理领域偏移的同时实现了优越的性能。数据集和代码公开于https://github.com/risqiutama/MineC2FNet。

英文摘要

Automatically mapping and segmenting global mining footprints using remote sensing and deep learning is critical for monitoring the socio-environmental risks and impacts of mining, yet its progress is hindered by the scarcity of fine-grained annotated data. Although large-scale datasets with coarse boundaries are widely available, leveraging them to improve fine-grained segmentation is challenging due to significant domain shift. To address this, we propose MineC2FNet, a coarse-to-fine domain incremental learning framework that exploits abundant coarse data to enhance fine-grained mining footprint segmentation. MineC2FNet adopts a teacher-student architecture with attentive distillation at both the feature and prediction levels, selectively transferring generalized knowledge from the coarse domain while enabling boundary refinement using limited fine-grained data (fine domain). We further introduce an expertly validated dataset of 219 images with precise boundary annotations across diverse geographies and commodities. Extensive experiments against state-of-the-art approaches, including domain adaptation and domain incremental learning methods, demonstrate that MineC2FNet achieves superior performance while effectively handling domain shift. The dataset and code are publicly available at https://github.com/risqiutama/MineC2FNet.

2605.23440 2026-05-29 cs.CL cs.AI 版本更新

SSDAU: Structured Semantic Data Augmentation for Joint Entity and Relation Extraction

SSDAU:面向联合实体关系抽取的结构化语义数据增强

Jiawei He, Mengyu Shi, Jiawei Liu, Dong Sun, Chunrong Fang, Xikai Yang, Zhijie Wang, Lei Ma, Zhenyu Chen

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China(南京大学新型软件技术国家重点实验室) Amap, Alibaba Group, China(阿里巴巴集团阿地图) University of Alberta, Edmonton, Canada(阿尔伯塔大学) The University of Tokyo, Tokyo, Japan(东京大学)

AI总结 提出结构化语义数据增强方法SSDAU,通过保留三元组感知语义结构、上下文感知编码和BERTopic过滤,提升联合实体关系抽取的泛化能力,在多个模型和数据集上优于现有方法。

Comments 10 pages, 4 figure

详情
AI中文摘要

联合实体关系抽取(JERE)对训练数据质量高度敏感,因此数据增强是提升泛化能力的自然方式。然而,现有增强方法常削弱实体相关性并破坏语义结构,限制了其在JERE中的有效性。本文提出 extbf{结构化语义数据增强(SSDAU)},一种在增强过程中保留三元组感知语义结构的方法。SSDAU按实体标签分割文本,通过上下文感知编码捕获语义特征,并重构实体语义以生成增强数据。为区分语义相似的实体,SSDAU将上下文嵌入与传统相似度评分相结合。为减少主题不一致性,我们应用基于BERTopic的过滤去除不相关的增强样本。我们在不同标注类型的数据集上评估SSDAU,并比较其在五个代表性JERE模型上相对于七个流行增强基线的性能。实验表明,SSDAU生成语义一致的数据,对歧义的鲁棒性优于非LLM方法(平均相对F1下降8.95% vs. 23.58%),并在大多数设置下显著优于强替代方法。

英文摘要

Joint Entity and Relation Extraction (JERE) is highly sensitive to training data quality, making data augmentation a natural way to improve generalization. However, existing augmentation methods often weaken entity relevance and disrupt semantic structure, limiting their effectiveness for JERE. In this paper, we propose \textbf{Structured Semantic Data Augmentation (SSDAU)}, a method designed to preserve triple-aware semantic structure during augmentation. SSDAU segments text by entity labels, captures semantic features through context-aware encoding, and restructures entity semantics to generate augmented data. To distinguish semantically similar entities, SSDAU combines contextualized embeddings with traditional similarity scores. To reduce topic inconsistency, we apply BERTopic-based filtering to remove irrelevant augmentations. We evaluate SSDAU on datasets with different annotation types and compare its performance on five representative JERE models against seven popular augmentation baselines. Experiments show that SSDAU generates semantically consistent data, is more robust to ambiguity than non-LLM methods (8.95\% vs. 23.58\% average relative F1 decrease), and significantly outperforms strong alternatives in most settings.

2605.22771 2026-05-29 cs.CL cs.AI 版本更新

Reducing Political Manipulation with Consistency Training

通过一致性训练减少政治操纵

Long Phan, Devin Kim, Alexander Pan, Alice Blair, Adam Khoja, Dan Hendrycks

发表机构 * Center for AI Safety(人工智能安全中心) UC Berkeley(加州大学伯克利分校)

AI总结 针对大语言模型在敏感话题中表现出的隐性政治偏见,提出政治一致性训练(PCT)方法,通过情感一致性和帮助一致性两个指标及相应训练范式来减少偏见。

详情
AI中文摘要

大型语言模型(LLM)在各种敏感上下文中表现出系统性的政治偏见。我们发现,LLM 对来自对立政治立场的对应话题处理不对称。我们将这种现象称为隐性政治偏见,并识别出其运作的 7 类技术。我们提出了两个隐性偏见指标:情感一致性衡量跨配对政治提示的修辞和框架对称性;帮助一致性衡量深度和参与度的对称性。为了减少这两种隐性偏见,我们引入了政治一致性训练(PCT),这是一种具有两个互补范式的 RL 训练方法:情感一致性训练和帮助一致性训练。我们表明,PCT 保持了整体帮助性,显著减少了隐性政治偏见,并泛化到保留的基准测试中。我们在 https://political-manipulation.ai 发布我们的工作。

英文摘要

Large language models (LLMs) exhibit systematic political bias across a variety of sensitive contexts. We find that LLMs handle counterpart topics from opposing political sides asymmetrically. We refer to this phenomenon as covert political bias and identify 7 categories of techniques through which it operates. We propose two metrics for covert bias: Sentiment Consistency measures symmetry in rhetoric and framing across paired political prompts; Helpfulness Consistency measures symmetric depth and engagement. To reduce both types of covert bias, we introduce Political Consistency Training (PCT), an RL training method with two complementary paradigms: Sentiment Consistency Training and Helpfulness Consistency Training. We show that PCT preserves overall helpfulness, substantially reduces covert political bias, and generalizes to held-out benchmarks. We release our work at https://political-manipulation.ai

2605.16825 2026-05-29 cs.IR cs.AI 版本更新

Echoes in Filter Bubble: Diagnosing and Curing Popularity Bias in Generative Recommenders

过滤气泡中的回声:诊断与治愈生成式推荐系统中的流行度偏差

Jun Yin, Bangguo Zhu, Peng Huo, Ruochen Liu, Hao Chen, Senzhang Wang, Shirui Pan, Chengqi Zhang

发表机构 * Department of Data Science and Artificial Intelligence, Hong Kong Polytechnic University(数据科学与人工智能系,香港理工大学) School of Computer Science and Engineering, Central South University(计算机科学与工程学院,中南大学) National Super Computing Center, Tianjin, China(国家超算中心,天津,中国) Faculty of Data Science, City University of Macau(数据科学学院,澳门城市大学) School of Information and Communication Technology, Griffith University(信息与通信技术学院,格里菲斯大学)

AI总结 本文通过理论分析发现生成式推荐系统中的流行度偏差源于令牌级优化缺陷和物品分词的无差别性,并设计了非对称不相似度优化和基于骨架的分词方法(Ghost系统)来缓解偏差。

详情
AI中文摘要

最近,以统一端到端框架为特征的生成式推荐系统(GRs)在转变推荐范式方面展现出惊人的潜力。尽管有效,但我们认识到GRs仍然容易受到长期存在的流行度偏差问题的影响,该问题一直困扰着推荐社区。虽然少数研究尝试将传统的去偏方法扩展到GRs,但其效果有限,且GRs遭受流行度偏差的根本原因仍未得到充分探索。为弥补这一空白,本研究聚焦于GRs中的两个核心方面:生成框架的优化和基于语义索引的物品分词。基于理论分析,我们识别出严重的流行度偏差源于令牌级优化缺陷和物品分词的无差别性共同作用。据此,本研究通过设计非对称不相似度优化和基于骨架的分词,开发了一种名为Ghost的新型生成式推荐系统。在三个数据集上进行的广泛实证评估,与多个SOTA基线相比,表明Ghost显著缓解了流行度偏差并促进了更公平的推荐,同时仅对整体推荐效用造成轻微下降。

英文摘要

Recently, Generative Recommenders (GRs), characterized by a unified end-to-end framework, have exhibited astonishing potential in transforming the recommendation paradigm. Despite their effectiveness, we recognize that GRs are still susceptible to the long-standing issue of popularity bias that has pervaded the recommendation community. Although a few studies have attempted to extend traditional debiasing methods to GRs, their effectiveness is marginal, and the fundamental reason why GRs suffer from popularity bias remains under-explored. To bridge this gap, this study focuses on two core aspects in GRs: the optimization of generative framework and the item tokenization based on semantic index. Based on theoretical analyses, we identify that the severe popularity bias emerges from the confluence of a token-level optimization flaw and the undifferentiated property of item tokenization. Accordingly, this study develops a novel generative recommender system, called Ghost, by designing the asymmetric unlikelihood optimization and the skeleton-founded tokenization. Extensive empirical evaluations across three datasets, alongside multiple SOTA baselines, reveal that Ghost substantially alleviates popularity bias and promotes fairer recommendations, while incurring slight degradation to the overall recommendation utility.

2605.14373 2026-05-29 cs.LG cs.AI 版本更新

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

将陈旧梯度转化为稳定梯度:具有隐式景观平滑的相干坐标下降用于轻量级零阶优化

Chen Liang, Xiatao Sun, Qian Wang, Daniel Rakita

发表机构 * Department of Computer Science, Yale University, New Haven, USA(耶鲁大学计算机科学系)

AI总结 提出一种确定性的、样本高效的零阶优化方法Coherent Coordinate Descent (CoCD),通过利用历史梯度的相干性实现每步O(1)查询复杂度,并发现大步长有限差分可隐式平滑优化景观,从而在轻量级场景下优于现有方法。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026); Project page: https://chen-dylan-liang.github.io/CoCD/

详情
AI中文摘要

零阶优化对于反向传播不可用的场景至关重要,例如内存受限的在线学习和黑盒优化。然而,现有方法面临严峻的权衡:它们要么样本效率低(例如标准有限差分),要么由于随机估计(例如随机子空间方法)而遭受高方差。在这项工作中,我们提出了相干坐标下降(CoCD),一种确定性的、样本高效的、预算感知的零阶优化器。理论上,我们形式化了梯度相干性的概念,并证明CoCD等价于具有“热启动”的块循环坐标下降(BCCD),有效地将历史(陈旧)梯度从负担转化为计算资产。该机制在保持全局下降方向的同时,实现了每步O(1)查询复杂度。此外,我们推导出误差界,揭示了一个反直觉的见解:更大的有限差分步长可以通过降低有效平滑常数来隐式地平滑优化景观,从而提高收敛稳定性。在MLP、CNN和ResNet架构(最多27万个参数)上的实验表明,CoCD在样本效率和收敛损失/准确性方面显著优于BCCD,并且比随机化零阶方法表现出更好的稳定性。我们的结果表明,对于轻量级零阶优化,确定性的、结构感知的更新是随机化的优越替代方案。

英文摘要

Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.

2605.14113 2026-05-29 cs.CV cs.AI cs.LG cs.MA 版本更新

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent: 通过隐私感知的智能体工作流实现多模态临床可解释性

Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares, Jemma Kerns

发表机构 * School of Computing and Communications(计算与通信学校) Lancaster University(兰卡斯特大学) Lancaster Medical School(兰卡斯特医学院) PUC-Rio(里约热内卢联邦大学) Puc-Behring Institute for AI(人工智能皮克林研究所)

AI总结 提出ProtoMedAgent框架,通过神经符号瓶颈和反射性Scribe-Critic循环约束生成过程,解决原型网络在临床报告中的语义结构缺失和检索谄媚问题,并引入k-匿名和ℓ-多样性隐私门控。

Comments CVR 2026

详情
AI中文摘要

尽管可解释的原型网络为临床诊断提供了引人注目的基于案例的推理,但其原始连续输出缺乏医学文档所需的语义结构。通过标准检索增强生成(RAG)弥合这一差距通常会触发“检索谄媚”,即大语言模型(LLM)产生事后合理化幻觉以与视觉预测对齐。我们引入了ProtoMedAgent,一个将多模态临床报告形式化为在严格神经符号瓶颈上的迭代、零梯度测试时优化问题的框架。在冻结的原型骨干上运行,我们将潜在视觉和表格特征蒸馏为离散语义记忆。在线生成严格受限于精确的集合论差分和反射性Scribe-Critic循环,从数学上排除了无根据的叙述性声明。为了安全地限制数据泄露,我们引入了一个由k-匿名和ℓ-多样性控制的语义隐私门控。在4,160名患者临床队列上的评估显示,ProtoMedAgent达到了91.2%的比较集忠实度,从根本上优于标准RAG(46.2%)。ProtoMedAgent还利用一个绑定ℓ-多样性的相变,系统性地将工件级成员推理风险降低了绝对9.8%。

英文摘要

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.

2605.13230 2026-05-29 cs.LG cs.AI 版本更新

Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence

教师引导的策略优化:大策略差异下的在线推理蒸馏

Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Chenglong Wang, Tong Xiao, JingBo Zhu

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Tsinghua University(清华大学) Meituan(美团) Meta AI NiuTrans Research, Shenyang, China(新译研究院,沈阳,中国)

AI总结 针对在线蒸馏中教师与学生策略差异大时反向KL监督失效的问题,提出教师引导策略优化(TGPO),通过教师直接指导学生上下文的token级生成并结合RLVR奖励,在推理基准上优于现有方法。

详情
AI中文摘要

在线蒸馏(OPD)已成为面向推理的大型语言模型(LLM)后训练的一种有前景的范式,特别是与可验证奖励的强化学习(RLVR)结合时。现有的OPD方法依赖于基于反向KL(RKL)的教师监督,对学生策略采样的轨迹进行监督。然而,我们识别出一个关键限制:在教师-学生策略差异大的情况下,RL驱动的探索常常产生教师分布之外的轨迹,导致无信息的负面反馈。为了解决这个问题,我们提出教师引导策略优化(TGPO),一种在策略差异大设置下仍然有效的在线推理蒸馏方法。TGPO不依赖于单纯的评估监督,而是利用教师直接指导基于学生生成上下文的token级生成;结合RLVR风格的轨迹级奖励,TGPO引导探索朝向改进的延续。在推理基准上的实验表明,TGPO始终优于现有的基于RKL的OPD方法,并且在不同教师模型下保持鲁棒性。

英文摘要

On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existing OPD methods rely on reverse KL (RKL)-based teacher supervision over trajectories sampled from the student policy. However, we identify a critical limitation: under large teacher--student policy divergence, RL-driven exploration often produces trajectories outside the teacher distribution, resulting in uninformative negative feedback. To address this, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy reasoning distillation method that remains effective under large policy divergence settings. Rather than relying solely on evaluative supervision, TGPO uses teacher to directly guide token level generation conditioning on student-generated contexts; together with RLVR-style trajectory level rewards, TGPO steers exploration toward improved continuations. Experiments on reasoning benchmarks show that TGPO consistently outperforms existing RKL-based OPD methods and remains robust across different teacher models.

2605.11723 2026-05-29 cs.CV cs.AI 版本更新

CaC: Advancing Video Reward Models via Hierarchical Spatiotemporal Concentrating

CaC:通过分层时空聚焦推进视频奖励模型

Jiyuan Wang, Huan Ouyang, Jiuzhou Lin, Chunyu Lin, Dewen Fan, Boheng Zhang, Haonan Fan, Fei Zuo, Jia Sun, Huaiqing Wang, Honglie Wang, Yiyang Fan, Zhenlong Yuan, Zijun Li, Yongrui Heng, Guosheng Lin, Fan Yang, Tingting Gao

发表机构 * BJTU(北京工业大学) NTU(国立台湾大学) BUPT(北京邮电大学) Kuaishou Technology(快手科技)

AI总结 提出基于视觉语言模型的粗到细异常奖励模型CaC,通过全局时间扫描、局部空间定位和结构化时空思维链推理,结合大规模生成视频异常数据集和三阶段渐进训练,显著提升细粒度异常检测精度并减少生成视频异常。

Comments 27 pages, 10 figures

详情
AI中文摘要

在本文中,我们提出了Concentrate and Concentrate (CaC),一种基于视觉语言模型的粗到细异常奖励模型。在推理过程中,它首先进行全局时间扫描以锚定异常时间窗口,然后在局部区间内进行细粒度空间定位,最后通过结构化的时空思维链推理得出稳健判断。为了使模型具备这些能力,我们构建了第一个大规模生成视频异常数据集,包含逐帧边界框注释、时间异常窗口和细粒度归因标签。基于该数据集,我们设计了三阶段渐进训练范式。模型首先通过单帧和多帧监督微调学习空间和时间锚定,然后通过基于两轮组相对策略优化(GRPO)的强化学习策略进行优化。除了传统的准确率奖励,我们引入了时间和空间IoU奖励来监督中间定位过程,有效引导模型进行更扎实和可解释的时空推理。大量实验表明,CaC能够稳定聚焦于细微异常,在细粒度异常基准上实现了25.7%的准确率提升,并且作为奖励信号时,CaC将生成视频异常减少了11.7%,同时提高了整体视频质量。

英文摘要

In this paper, we propose Concentrate and Concentrate (CaC), a coarse-to-fine anomaly reward model based on Vision-Language Models. During inference, it first conducts a global temporal scan to anchor anomalous time windows, then performs fine-grained spatial grounding within the localized interval, and finally derives robust judgments via structured spatiotemporal Chain-of-Thought reasoning. To equip the model with these capabilities, we construct the first large-scale generated video anomaly dataset with per-frame bounding-box annotations, temporal anomaly windows, and fine-grained attribution labels. Building on this dataset, we design a three-stage progressive training paradigm. The model initially learns spatial and temporal anchoring through single- and multi-frame supervised fine-tuning, and then is optimized by a reinforcement learning strategy based on two-turn Group Relative Policy Optimization (GRPO). Beyond conventional accuracy rewards, we introduce Temporal and Spatial IoU rewards to supervise the intermediate localization process, effectively guiding the model toward more grounded and interpretable spatiotemporal reasoning. Extensive experiments demonstrate that CaC can stably concentrate on subtle anomalies, achieving a 25.7% accuracy improvement on fine-grained anomaly benchmarks and, when used as a reward signal, CaC reduces generated-video anomalies by 11.7% while improving overall video quality.

2605.05155 2026-05-29 cs.CV cs.AI 版本更新

Aes3D: Aesthetic Assessment in 3D Gaussian Splatting

Aes3D: 3D高斯泼溅中的美学评估

Chuanzhi Xu, Boyu Wei, Haoxian Zhou, Xuanhua Yin, Zihan Deng, Haodong Chen, Qiang Qu, Weidong Cai

发表机构 * The University of Sydney(悉尼大学) The University of Hong Kong(香港大学)

AI总结 针对3D高斯泼溅场景缺乏美学评估的问题,提出首个系统框架Aes3D,包含专用数据集Aesthetic3D和轻量级模型Aes3DGSNet,直接预测场景级美学分数,无需渲染多视图图像。

详情
AI中文摘要

随着3D高斯泼溅(3DGS)在沉浸式媒体和数字内容创作中受到关注,评估3D场景的美学对于帮助创作者构建更具视觉吸引力的3D内容变得重要。然而,现有的3D场景评估方法主要强调重建保真度和感知真实感,在很大程度上忽略了构图、和谐度和视觉吸引力等更高层次的美学属性。这一局限性源于两个关键挑战:(1)缺乏带有美学标注的通用3DGS数据集,以及(2)3DGS作为低级基元表示的内在性质,使其难以捕捉高级美学特征。为应对这些挑战,我们提出Aes3D,这是首个用于评估3D神经渲染场景美学的系统框架。Aes3D包含Aesthetic3D,这是首个专用于3D场景美学评估的数据集,基于我们提出的3D场景美学标注策略构建。此外,我们提出Aes3DGSNet,一个轻量级模型,可直接从3DGS表示预测场景级美学分数。值得注意的是,我们的模型仅基于3D高斯基元运行,无需渲染多视图图像,从而降低了计算成本和硬件要求。通过对多视图3DGS场景表示进行美学监督学习,Aes3DGSNet有效捕获高级美学线索并准确回归美学分数。实验结果表明,我们的方法在保持轻量级设计的同时实现了强劲性能,为3D场景美学评估建立了新基准。代码和数据集将在未来版本中提供。

英文摘要

As 3D Gaussian Splatting (3DGS) gains attention in immersive media and digital content creation, assessing the aesthetics of 3D scenes becomes important in helping creators build more visually compelling 3D content. However, existing evaluation methods for 3D scenes primarily emphasize reconstruction fidelity and perceptual realism, largely overlooking higher-level aesthetic attributes such as composition, harmony, and visual appeal. This limitation comes from two key challenges: (1) the absence of general 3DGS datasets with aesthetic annotations, and (2) the intrinsic nature of 3DGS as a low-level primitive representation, which makes it difficult to capture high-level aesthetic features. To address these challenges, we propose Aes3D, the first systematic framework for assessing the aesthetics of 3D neural rendering scenes. Aes3D includes Aesthetic3D, the first dataset dedicated to 3D scene aesthetic assessment, built on our proposed annotation strategy for 3D scene aesthetics. In addition, we present Aes3DGSNet, a lightweight model that directly predicts scene-level aesthetic scores from 3DGS representations. Notably, our model operates solely on 3D Gaussian primitives, eliminating the need for rendering multi-view images and thus reducing computational cost and hardware requirements. Through aesthetics-supervised learning on multi-view 3DGS scene representations, Aes3DGSNet effectively captures high-level aesthetic cues and accurately regresses aesthetic scores. Experimental results demonstrate that our approach achieves strong performance while maintaining a lightweight design, establishing a new benchmark for 3D scene aesthetic assessment. Code and datasets will be made available in a future version.

2605.00969 2026-05-29 cs.SD cs.AI cs.CL 版本更新

MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio

MedMosaic:一个具有挑战性的多样化医学音频大规模基准

Harshit Rajgarhia, Shuubham Ojha, Asif Shaik, Akhil Pothanapalli, Rachuri Lokesh, Abhishek Mukherji, Prasanna Desikan

发表机构 * Centific Global Solutions Inc.(Centific全球解决方案公司) University of Maryland, College Park, MD, USA(马里兰大学学院市分校)

AI总结 为解决医学音频数据稀缺和现有基准不足的问题,提出MedMosaic数据集,包含多种医学音频类型和46701个问答对,用于评估语言和音频推理模型,实验表明推理仍具挑战性。

Comments Accepted at ICML 2026

详情
AI中文摘要

由于隐私法规和领域专业知识导致的高注释成本,医学音频数据难以收集。因此,现有基准往往未能充分代表复杂的医学音频场景。为应对这一挑战,我们提出了MedMosaic,一个医学音频问答数据集,旨在在现实临床约束下对语言和音频推理模型进行基准测试。MedMosaic包含多种医学音频类型,包括与疾病相关的生理声音、精心构建的模拟带有伪影的语音的合成声音,以及模拟不同上下文长度的真实短篇和长篇临床对话。该数据集还包含总共46,701个问答对,涵盖多项选择、顺序多轮和开放式问答等类别,从而能够系统评估多跳推理和答案生成能力。对13个音频和多模态推理模型的基准测试显示,推理对所有评估系统仍然具有挑战性,且在不同问题类型上表现差异显著。特别是,即使是像Gemini-2.5-pro这样的最先进模型也只能达到约68.1%的准确率。这些发现强调了医学推理中的持续局限性,并凸显了对更鲁棒、特定领域的多模态推理模型的需求。基准数据样本可在此处获取:https://shorturl.at/Lyp33

英文摘要

Medical audio data is difficult to collect due to privacy regulations and high annotation costs arising from domain expertise. Thus, existing benchmarks tend to underrepresent complex medical audio scenarios. To address this challenge, we present MedMosaic, a medical audio question-answering dataset designed to benchmark language and audio reasoning models under realistic clinical constraints. MedMosaic features a diverse range of medical audio types, including condition-related physiological sounds, carefully constructed synthetic voices to mimic speech with artifacts as well as real short and long length clinical conversations to model varying context lengths. The dataset also features a total of 46,701 question-answer pairs, spanning categories such as multiple-choice, sequential multi-turn, and open-ended question-answers, enabling systematic evaluation of multi-hop reasoning and answer generation capabilities. Benchmarking 13 audio and multimodal reasoning models reveals that reasoning remains challenging for all evaluated systems, with substantial performance variation across question types. In particular, even state-of-the-art model like Gemini-2.5-pro can only achieve 68.1% accuracy approximately. These findings underscore persistent limitations in medical reasoning and highlight the need for more robust, domain-specific multimodal reasoning models. A sample of benchmark data is available here: https://shorturl.at/Lyp33

2604.27272 2026-05-29 cs.CL cs.AI cs.LG 版本更新

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

当2D任务遇到1D序列化:结构化任务中的序列化摩擦

Chung-Hsiang Lo, Lu Li, Diji Yang, Tianyu Zhang, Yunkai Zhang, Yoshua Bengio, Yi Zhang

发表机构 * Northeastern University(东北大学) University of Pennsylvania(宾夕法尼亚大学) UC Santa Cruz(加州大学圣克鲁兹分校) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Montreal(蒙特利尔大学) BAIR, UC Berkeley(伯克利大学BAIR实验室)

AI总结 研究通过矩阵转置、康威生命游戏和LU分解三个任务,发现将二维布局任务序列化为一维文本会因表示不匹配导致性能下降,且错误呈现空间结构模式。

详情
AI中文摘要

在LLM时代,许多符号化和结构化问题通过一维文本序列化呈现给模型。然而,其中一些问题本质上是二维的:它们的相关关系,如行列对应或空间邻接,由二维布局中的位置定义,而非顺序。这引发了一个表示问题:在一维序列中保留相同的符号条目是否也保留了计算所需的关系结构?我们通过序列化摩擦的视角研究这一问题:即相同底层任务实例和条目仍然存在,但依赖于布局的关系在一维序列化下变得隐式的表示不匹配。本研究使用三个受控合成测试任务:矩阵转置、康威生命游戏和LU分解。在每个任务中,相同的实例要么作为一维文本序列化呈现,要么作为其原生二维布局渲染为图像呈现。在整个测试集中,随着任务规模增长,一维序列化的性能下降更显著,且序列化下的错误呈现空间结构模式,表明这种呈现选择在我们的测试集中具有重要影响。为了进一步解释这些结果,我们添加了补充分析,包括视觉内探针以及混合训练转置设置下两种输入呈现的额外比较。这些发现表明,对于布局定义的任务,将输入简化为1D序列化并非中性的表示选择。

英文摘要

In the LLM era, many symbolic and structured problems are presented to models through 1D text serialization. Yet some such problems are natively two-dimensional: their relevant relations, such as row--column correspondence or spatial adjacency, are defined by position in a 2D layout rather than by sequential order. This raises a representational question: does preserving the same symbolic entries in a 1D sequence also preserve the relational structure needed for computation? We study this issue through the lens of serialization friction: the representational mismatch in which the same underlying task instances and entries are still present, but relations that depend on layout become implicit under 1D serialization. The study uses a controlled synthetic testbed of three tasks: matrix transpose, Conway's Game of Life, and LU decomposition. In each task, the same instances are presented either as 1D text serialization or as their native 2D layout rendered as an image. Across this testbed, 1D serialization degrades more sharply as task size grows, and errors under serialization exhibit spatially structured patterns, suggesting that this presentation choice is consequential within our testbed. To further interpret these results, we add supplementary analyses that include a within-visual probe and an additional comparison of the two input presentations under the mixed-training transpose setting. These findings suggest that, for layout-defined tasks, reducing inputs to 1D serialization is not a neutral choice of representation.

2604.26645 2026-05-29 cs.AI cs.LG 版本更新

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

SciHorizon-DataEVA:面向异构科学数据AI就绪性评估的智能体系统

Dianyu Liu, Chuan Qin, Xi Chen, Xiaohan Li, Wenxi Xu, Yuyang Wang, Xin Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * SciHorizon Team, Computer Network Information Center, Chinese Academy of Sciences(科学前沿团队,计算机网络信息中心,中国科学院)

AI总结 提出SciHorizon-DataEVA智能体系统,基于Sci-TQA2原则和层次化多智能体评估方法,实现对异构科学数据的可扩展AI就绪性评估。

详情
AI中文摘要

AI-for-Science (AI4Science) 正通过将机器学习模型嵌入跨领域的预测、模拟和假设生成工作流程,日益变革科学发现。然而,这些模型的有效性从根本上受到科学数据AI就绪性的限制,目前尚不存在可扩展且系统的评估机制。在这项工作中,我们提出了SciHorizon-DataEVA,一种新颖的智能体系统,用于对异构科学数据进行可扩展的AI就绪性评估。在评估标准层面,我们引入了Sci-TQA2原则,将AI就绪性组织为四个互补维度:治理可信度、数据质量、AI兼容性和科学适应性。每个维度被分解为可测量的原子元素,以实现细粒度且可执行的评估。为了大规模实施这些原则,我们开发了Sci-TQA2-Eval,一种通过有向循环工作流编排的层次化多智能体评估方法。我们的Sci-TQA2-Eval通过结合轻量级数据集分析、适用性感知的度量激活以及基于领域约束和数据集-论文信号的知识增强规划,动态构建数据集感知的评估规范。这些规范通过自适应的、以工具为中心的评估机制执行,该机制具有内置的验证和自我修正能力,从而实现对异构科学数据的可扩展且可靠的评估。在跨多个领域的科学数据集上的广泛实验证明了SciHorizon-DataEVA在原则性AI就绪性评估方面的有效性和通用性。

英文摘要

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.

2604.23862 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Graph Memory Transformer (GMT)

图记忆Transformer (GMT)

Nicola Zanarini, Niccolò Ferrari, Evelina Lamma

发表机构 * Bonfiglioli Engineering s.r.l.(博尼菲利工程公司) Department of Engineering, University of Ferrara(费拉拉大学工程学院) NAIS s.r.l.(NAIS公司)

AI总结 提出用显式学习的记忆图替换解码器-only Transformer中的前馈网络子层,保留自回归架构,实现可解释的记忆导航。

Comments 65 pages, 10 figures, 5 tables. Author list updated in arXiv metadata; no technical changes. Code available at https://github.com/Nemesis533/GMT-GraphMemoryTransformer

详情
AI中文摘要

我们研究是否可以在解码器-only Transformer中,用显式学习的记忆图替换前馈网络(FFN)子层,同时保留周围的自回归架构。所提出的图记忆Transformer(GMT)保持因果自注意力不变,但将通常的逐token FFN变换替换为一个记忆单元,该单元通过一个由学习的有向转移矩阵连接的质心库来路由token表示。在此处研究的基础GMT v7实例中,16个Transformer块中的每个块包含128个质心、一个128*128的边矩阵、引力源路由、token条件目标选择以及门控位移读出。因此,该单元返回从估计的源记忆状态到目标记忆状态的移动,而不是检索到的值。由此产生的模型是一个完全解码器-only的语言模型,具有82.2M可训练参数且没有密集的FFN子层,而评估中使用的密集GPT风格基线有103.0M参数。基础v7模型训练稳定,并将质心使用、转移结构和源到目标移动作为前向计算中可直接检查的量。在验证损失和困惑度方面,它落后于较大的密集基线(3.5995/36.58 vs. 3.2903/26.85),但在评估设置下显示出接近的零样本基准表现。这些结果并非旨在声称最先进性能;它们支持用图介导的记忆导航替换密集的token内变换的可行性和结构可解释性。更广泛的扩展、优化的内核以及更广泛的基准评估留待后续工作。

英文摘要

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

2604.14889 2026-05-29 cs.AI 版本更新

MemoSight: Unifying Context Compression and Multi Token Prediction for Reasoning Acceleration

MemoSight: 统一上下文压缩与多令牌预测以加速推理

Xinyu Liu, Xin Liu, Bo Jin, Runsong Zhao, Pengcheng Huang, Junhao Ruan, Bei Li, Chunyang Xiao, Chenglong Wang, Tong Xiao, Jingbo Zhu

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Meituan Inc.(美团公司) NiuTrans Research, Shenyang, China(牛译研所)

AI总结 提出 MemoSight 框架,通过特殊令牌和位置布局统一上下文压缩与多令牌预测,在保持思维链推理性能的同时减少 KV 缓存使用并提升推理速度。

详情
AI中文摘要

虽然思维链推理使大型语言模型能够解决具有挑战性的推理任务,但 KV 缓存的线性增长导致了大量的内存和推理开销。现有方法如上下文压缩和多令牌预测通过压缩历史令牌和并行生成未来令牌两个互补方向提高效率。然而,由于它们不同的训练范式和架构假设,有效结合它们仍然具有挑战性。在这项工作中,我们提出 MemoSight(基于记忆与前瞻的推理),一个统一框架,集成了上下文压缩和多令牌预测,以提高推理效率同时保持思维链性能。MemoSight 采用基于特殊令牌和令牌特定位置布局的共享极简设计,用于压缩和并行预测。在四个推理基准上的实验表明,与普通 SFT 基线相比,MemoSight 将 KV 缓存使用减少高达 66%,推理速度提升 56%,同时平均推理准确率下降不到 3%,相比现有的思维链压缩方法实现了更好的效率-准确率权衡。

英文摘要

While chain-of-thought (CoT) reasoning enables LLMs to solve challenging reasoning tasks, the linear growth of the KV cache leads to substantial memory and inference overhead. Existing approaches such as context compression and multi-token prediction (MTP) improve efficiency from two complementary directions by compressing historical tokens and generating future tokens in parallel. However, effectively combining them remains challenging due to their different training paradigms and architectural assumptions. In this work, we propose MemoSight (Memory-Foresight-Based Reasoning), a unified framework that integrates context compression and MTP to improve inference efficiency while preserving CoT performance. MemoSight adopts a shared minimalist design based on special tokens and token-specific positional layouts for both compression and parallel prediction. Experiments on four reasoning benchmarks show that, compared to the vanilla SFT baseline, MemoSight reduces KV cache usage by up to 66% and improves inference speed by 56%, while incurring less than a 3% drop in average reasoning accuracy, yielding a better efficiency-accuracy trade-off than existing CoT compression methods.

2604.11665 2026-05-29 cs.NE cs.AI 版本更新

Beyond LLMs, Sparse Distributed Memory, and Neuromorphics <A Hyper-Dimensional SRAM-CAM "VaCoAl" for Ultra-High Speed, Ultra-Low Power, and Low Cost>

超越LLM、稀疏分布式记忆和神经形态计算:一种用于超高速、超低功耗和低成本的超维SRAM-CAM“VaCoAl”

Hiroyuki Chuma, Kanji Otsuka, Yoichi Sato

发表机构 * Hitotsubashi University(立命馆大学) Meisei University(立命馆大学) Shuhari System(Shuhari系统)

AI总结 本文提出VaCoAl算法,通过超维计算和伽罗瓦域扩散,在SRAM/DRAM-CAM上实现可逆、可审计的多跳推理,解决灾难性遗忘和绑定问题,并在维基数据上验证了其路径依赖的STDP式选择机制。

Comments 57 pages, 4 figure, 18 tables

详情
AI中文摘要

本文报告了一个意外发现:在一个确定性超维计算(HDC)架构中,该架构反转了伽罗瓦域代数的传统角色——不是将其用于纠错以获得唯一答案,而是作为相对相似性和路径质量排序的引擎——出现了一种路径依赖的语义选择机制,等效于尖峰时序依赖可塑性(STDP),其幅度可通过封闭形式表达式先验预测,并与测量值匹配。为了解决灾难性遗忘、学习停滞和绑定问题,我们在代数层面提出了VaCoAl(模糊巧合算法)及其Python实现PyVaCoAl,运行于超高维SRAM/DRAM-CAM上。该算法根植于稀疏分布式记忆,通过伽罗瓦域扩散解决高维二进制空间中的正交化和检索问题,实现低负载部署。关键的是,VaCoAl将认知边界——前沿大小——嵌入其架构中,通过路径积分置信度(CR2)对候选进行排序,以实现组合泛化;这种有限理性设计产生了STDP式的选择,而纠错范式在结构上无法实现。我们评估了来自Wikidata的约47万条导师-学生关系上的多跳推理,追踪了多达57代(超过2550万条路径)。基于CR去噪的HDC捆绑和解绑操作量化了概念在DAG上的传播。结果显示了对牛顿-莱布尼茨争议的重新解释,以及从稀疏收敛到后莱布尼茨“超级高速公路”的相变,结构指标支持库恩范式转变。VaCoAl因此定义了第三范式——HDC-AI,以可逆、可审计的多跳推理补充LLM。

英文摘要

This paper reports an unexpected finding: in a deterministic hyperdimensional computing (HDC) architecture **that inverts the conventional role of Galois-field algebra -- employing it not for error correction toward a unique answer but as an engine for relative similarity and path-quality ranking -- **a path-dependent semantic selection mechanism emerges, equivalent to spike-timing-dependent plasticity (STDP), with magnitude predictable a priori from a closed-form expression matching measured values. Addressing catastrophic forgetting, learning stagnation, and the Binding Problem at an algebraic level, we propose VaCoAl (Vague Coincident Algorithm) and its Python implementation PyVaCoAl on ultra-high-dimensional SRAM/DRAM-CAM. Rooted in Sparse Distributed Memory, it resolves orthogonalisation and retrieval in high-dimensional binary spaces via Galois-field diffusion, enabling low-load deployment. Crucially, VaCoAl embeds a cognitive bound -- the Frontier Size -- into its architecture, ranking candidates by path-integral confidence (CR2) to achieve compositional generalisation; this bounded-rationality design produces STDP-like selection that error-correction paradigms structurally cannot attain. We evaluated multi-hop reasoning on about 470k mentor-student relations from Wikidata, tracing up to 57 generations (over 25.5M paths). HDC bundling and unbinding with CR-based denoising quantify concept propagation over DAGs. Results show a reinterpretation of the Newton-Leibniz dispute and a phase transition from sparse convergence to a post-Leibniz "superhighway", with structural indicators supporting a Kuhnian paradigm shift. VaCoAl thus defines a third paradigm, HDC-AI, complementing LLMs with reversible, auditable multi-hop reasoning.

2604.10511 2026-05-29 cs.AI cs.CL 版本更新

Thinking Fast, Thinking Wrong: Intuitiveness Modulates LLM Counterfactual Reasoning in Policy Evaluation

快思考,错思考:直觉性调节LLM在政策评估中的反事实推理

Yanjie He

发表机构 * Independent Researcher(独立研究者)

AI总结 本研究构建了一个基于经济学和社会科学实证案例的基准,通过8000次实验评估大型语言模型在政策评估中的反事实推理,发现链式思维提示在反直觉案例中效果显著减弱,且直觉性是主导因素,表明模型存在知识-推理分离。

Comments 10 pages, 6 figures, 6 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于因果和反事实推理,但它们在现实世界政策评估中的可靠性仍未得到充分探索。我们构建了一个包含40个实证政策评估案例的基准,这些案例来自经济学和社会科学,每个案例都基于同行评审的证据,并根据直觉性进行分类——即实证结果是否符合(明显)、相对于(模糊)或违背(反直觉)常见的先验预期。我们评估了四个前沿LLM,采用五种提示策略,进行了8000次实验试验,并使用混合效应逻辑回归分析结果。我们的发现揭示了三个关键结果:(1)链式思维(CoT)悖论,即链式思维提示在明显案例上显著提升性能,但在反直觉案例上这种收益大幅减弱(交互OR = 0.278,p < 0.001);(2)直觉性是主导因素,案例层面的方差超过模型选择或提示策略(ICC = 0.671);(3)知识-推理分离,基于引用的熟悉度与准确性无关(p = 0.84),表明模型拥有相关知识,但当结果与直觉相悖时无法利用这些知识进行推理。我们通过双过程理论(系统1与系统2)的视角来框架这些结果,并认为当前LLM的“慢思考”仅实现了对直觉先验的部分抑制——产生了深思熟虑推理的形式,但未能完全实现其实质。

英文摘要

Large language models (LLMs) are increasingly used for causal and counterfactual reasoning, yet their reliability in real-world policy evaluation remains underexplored. We construct a benchmark of 40 empirical policy evaluation cases drawn from economics and social science, each grounded in peer-reviewed evidence and classified by intuitiveness -- whether the empirical finding aligns with (obvious), is unclear relative to (ambiguous), or contradicts (counter-intuitive) common prior expectations. We evaluate four frontier LLMs across five prompting strategies with 8,000 experimental trials and analyze the results using mixed-effects logistic regression. Our findings reveal three key results: (1) a chain-of-thought (CoT) paradox, where chain-of-thought prompting dramatically improves performance on obvious cases but this benefit is substantially attenuated on counter-intuitive ones (interaction OR = 0.278, $p < 0.001$); (2) intuitiveness as the dominant factor, with case-level variance exceeding that of model choice or prompting strategy (ICC = 0.671); and (3) a knowledge-reasoning dissociation, where citation-based familiarity is unrelated to accuracy ($p = 0.84$), suggesting models possess relevant knowledge but fail to reason with it when findings contradict intuition. We frame these results through the lens of dual-process theory (System 1 vs. System 2) and argue that current LLMs' "slow thinking" achieves only partial inhibition of intuitive priors -- producing the form of deliberative reasoning without fully delivering its substance.

2604.10228 2026-05-29 cs.AI 版本更新

SVSR: A Self-Verification and Self-Rectification Paradigm for Multimodal Reasoning

SVSR:一种用于多模态推理的自我验证与自我修正范式

Zhe Qian, Nianbing Su, Zhonghua Wang, Hebei Li, Zhongxing Xu, Yueying Li, Fei Luo, Zhuohan Ouyang, Yanbiao Ma

发表机构 * South China Agricultural University(华南农业大学) University of Glasgow(格拉斯哥大学) University of Electronic Science and Technology of China(电子科技大学) Monash University(莫纳什大学) University of Science and Technology of China(中国科学技术大学) National University of Defense Technology(国防科技大学) Renmin University of China(中国人民大学) South China Normal University(华南师范大学)

AI总结 提出SVSR框架,通过三阶段训练(统一偏好数据集构建、冷启动监督微调、半在线直接偏好优化)将自我验证与自我修正显式集成到多模态推理流程中,提升复杂视觉理解和多模态推理的鲁棒性与可靠性。

详情
AI中文摘要

当前多模态模型常存在浅层推理问题,导致因不完整或不一致的思维过程而产生错误。为解决这一局限,我们提出自我验证与自我修正(SVSR)统一框架,将自我验证和自我修正显式集成到模型的推理流程中,显著提升复杂视觉理解和多模态推理任务的鲁棒性与可靠性。SVSR基于一种新颖的三阶段训练范式。首先,通过精炼预训练视觉语言模型的推理轨迹,结合前向和后向推理嵌入自我反思信号,构建高质量统一偏好数据集。其次,在该数据集上进行冷启动监督微调,学习结构化、多步推理行为。第三,应用半在线直接偏好优化(Semi-online DPO)过程,通过强大的教师VLM筛选的高质量模型生成推理轨迹持续增强训练语料。该流程使模型能够学习、激发并精炼其自我验证与自我修正能力。跨多个基准的广泛实验表明,SVSR提升了推理准确性,并增强了对未见任务和问题类型的泛化能力。值得注意的是,经过显式自我反思推理训练后,模型还展现出改进的隐式推理能力,即使在没有显式推理轨迹的情况下也优于强基线。这些结果凸显了SVSR在构建更可靠、内省且认知对齐的多模态系统方面的潜力。

英文摘要

Current multimodal models often suffer from shallow reasoning, leading to errors caused by incomplete or inconsistent thought processes. To address this limitation, we propose Self-Verification and Self-Rectification (SVSR), a unified framework that explicitly integrates self-verification and self-rectification into the model's reasoning pipeline, substantially improving robustness and reliability in complex visual understanding and multimodal reasoning tasks. SVSR is built on a novel three-stage training paradigm. First, we construct a high-quality unified preference dataset by refining reasoning traces from pre-trained vision-language models, incorporating both forward and backward reasoning to embed self-reflective signals. Second, we perform cold-start supervised fine-tuning on this dataset to learn structured, multi-step reasoning behaviors. Third, we apply a Semi-online Direct Preference Optimization (Semi-online DPO) process, continuously augmenting the training corpus with high-quality, model-generated reasoning traces filtered by a powerful teacher VLM. This pipeline enables the model to learn, elicit, and refine its ability to self-verify and self-rectify. Extensive experiments across diverse benchmarks demonstrate that SVSR improves reasoning accuracy and enables stronger generalization to unseen tasks and question types. Notably, once trained with explicit self-reflective reasoning, the model also exhibits improved implicit reasoning ability, outperforming strong baselines even when no explicit reasoning traces are provided. These results highlight the potential of SVSR for building more dependable, introspective, and cognitively aligned multimodal systems.

2604.10219 2026-05-29 cs.AI 版本更新

Cognitive Pivot Points and Visual Anchoring: Unveiling and Rectifying Hallucinations in Multimodal Reasoning Models

认知支点与视觉锚定:揭示并纠正多模态推理模型中的幻觉

Zhe Qian, Yanbiao Ma, Zhuohan Ouyang, Zhonghua Wang, Zhongxing Xu, Fei Luo, Xinyu Liu, Zongyuan Ge, Yike Guo, Jungong Han

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学 龙城人工智能学院) South China Agricultural University(华南农业大学) South China Normal University(华南师范大学) Monash University(莫纳什大学) Jishou University(吉首大学) Hong Kong University of Science and Technology(香港科技大学) Tsinghua University(清华大学)

AI总结 针对多模态大推理模型在长链推理中易产生幻觉的问题,提出V-STAR训练范式,通过分层视觉注意力奖励和强制反思机制,将视觉锚定引入推理过程以减轻幻觉。

Comments TPAMI under review

详情
AI中文摘要

多模态大推理模型(MLRMs)通过测试时计算扩展在视觉推理方面取得了显著进展,但长链推理仍然容易出现幻觉。我们识别出一个称为“推理视觉真相脱节”(RVTD)的令人担忧的现象:幻觉与认知分叉点高度相关,这些分叉点通常表现出高熵状态。我们将这种脆弱性归因于视觉语义锚定的崩溃,这种崩溃位于网络中间层;具体来说,在这些高不确定性过渡期间,模型未能查询视觉证据,而是退回到语言先验。因此,我们主张从仅关注结果层面的监督转向增加细粒度的内部注意力引导。为此,我们提出V-STAR(视觉结构训练与注意力强化),一种轻量级、整体的训练范式,旨在内化视觉感知的推理能力。我们方法的核心是分层视觉注意力奖励(HVAR),集成在GRPO框架内。在检测到高熵状态时,该机制动态激励关键中间层的视觉注意力,从而将推理过程锚定回视觉输入。此外,我们引入了强制反思机制(FRM),一种轨迹编辑策略,通过在高熵认知分叉点触发反思并鼓励后续步骤与视觉输入进行验证,从而打破认知惯性,将外部去偏干预转化为减轻幻觉的内在能力。

英文摘要

Multimodal Large Reasoning Models (MLRMs) have achieved remarkable strides in visual reasoning through test time compute scaling, yet long chain reasoning remains prone to hallucinations. We identify a concerning phenomenon termed the Reasoning Vision Truth Disconnect (RVTD): hallucinations are strongly correlated with cognitive bifurcation points that often exhibit high entropy states. We attribute this vulnerability to a breakdown in visual semantic anchoring, localized within the network's intermediate layers; specifically, during these high uncertainty transitions, the model fails to query visual evidence, reverting instead to language priors. Consequently, we advocate a shift from solely outcome level supervision to augmenting it with fine grained internal attention guidance. To this end, we propose V-STAR (Visual Structural Training with Attention Reinforcement), a lightweight, holistic training paradigm designed to internalize visually aware reasoning capabilities. Central to our approach is the Hierarchical Visual Attention Reward (HVAR), integrated within the GRPO framework. Upon detecting high entropy states, this mechanism dynamically incentivizes visual attention across critical intermediate layers, thereby anchoring the reasoning process back to the visual input. Furthermore, we introduce the Forced Reflection Mechanism (FRM), a trajectory editing strategy that disrupts cognitive inertia by triggering reflection around high entropy cognitive bifurcation points and encouraging verification of subsequent steps against the visual input, thereby translating external debiasing interventions into an intrinsic capability for hallucination mitigation.

2604.09557 2026-05-29 cs.DC cs.AI 版本更新

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

SPEED-Bench:一个统一且多样化的推测解码基准

Talor Abramovich, Maor Ashkenazi, Izzy Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman

发表机构 * NVIDIA

AI总结 针对推测解码(SD)评估中任务多样性不足、吞吐量评估支持不够及实现不贴近生产环境的问题,提出SPEED-Bench基准,包含多样化语义领域和真实服务场景的数据集,集成vLLM和TensorRT-LLM引擎,以标准化SD评估并揭示系统行为。

Comments ICML 2026; Our data is available on https://huggingface.co/datasets/nvidia/SPEED-Bench

详情
AI中文摘要

推测解码(SD)已成为加速大型语言模型(LLM)推理的关键技术。与确定性系统优化不同,SD性能本质上依赖于数据,因此多样且具有代表性的工作负载对于准确衡量其有效性至关重要。现有基准存在任务多样性有限、对面向吞吐量的评估支持不足,以及依赖无法反映生产环境的高级实现等问题。为解决这些问题,我们引入了SPEED-Bench,这是一个全面的套件,旨在跨不同语义领域和真实服务场景标准化SD评估。SPEED-Bench提供了精心策划的定性数据划分,通过优先考虑数据样本之间的语义多样性来选择。此外,它还包括一个吞吐量数据划分,允许在从延迟敏感的低批量设置到面向吞吐量的高负载场景的一系列并发性下进行加速评估。通过与vLLM和TensorRT-LLM等生产引擎集成,SPEED-Bench使从业者能够分析其他基准常常掩盖的系统行为。我们通过量化合成输入如何高估实际吞吐量、识别依赖于批量大小的最优草稿长度和低多样性数据中的偏差,以及分析最先进起草器中词汇剪枝的注意事项来突出这一点。我们发布SPEED-Bench,以建立用于SD算法实际比较的统一评估标准。

英文摘要

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

2603.27667 2026-05-29 cs.SD cs.AI 版本更新

EvA: An Evidence-First Audio Understanding Paradigm for LALMs

EvA: 一种面向LALM的以证据为先的音频理解范式

Xinyuan Xie, Shunian Chen, Zhiheng Liu, Yuhao Zhang, Zhiqiang Lv, Liyin Liang, Benyou Wang

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Didi Chuxing(滴滴出行)

AI总结 提出EvA双路径架构,通过分层聚合和非压缩时间对齐融合增强声学证据保留,并在统一零样本协议下在MMAU、MMAR和MMSU上取得最佳开源感知结果,支持以证据为先的假设。

详情
AI中文摘要

大型音频语言模型(LALM)在复杂声学场景中仍然存在困难,因为它们往往在推理开始前未能保留与任务相关的声学证据。我们将这种错误模式识别为证据瓶颈:最先进的系统在声学证据提取方面的缺陷大于下游推理,这表明上游感知通常是限制因素。为了解决这个问题,我们提出了EvA(以证据为先的音频),一种双路径架构,通过分层聚合和非压缩、时间对齐融合来增强声学证据保留。我们还构建了EvA-Perception,一个大规模训练集,包含约54K个事件排序描述和500K个基于证据的问答对。在统一的零样本协议下,EvA在MMAU、MMAR和MMSU上取得了最佳开源感知结果,在感知密集型分割上增益最大。对开放描述的人工评估进一步显示了改进的细粒度声学覆盖和描述质量。这些结果支持以证据为先的假设:更强的音频理解依赖于在推理前保留声学证据。项目地址:https://satsuki2486441738.github.io/EvA/。

英文摘要

Large Audio Language Models (LALMs) still struggle in complex acoustic scenes because they often fail to preserve task-relevant acoustic evidence before reasoning begins. We identify this error pattern as the evidence bottleneck: state-of-the-art systems show larger deficits in acoustic evidence extraction than in downstream reasoning, suggesting that upstream perception is often the limiting factor. To address this problem, we propose EvA (Evidence-First Audio), a dual-path architecture that enhances acoustic evidence preservation through hierarchical aggregation and non-compressive, time-aligned fusion. We also build EvA-Perception, a large-scale training set with about 54K event-ordered captions and 500K evidence-grounded QA pairs. Under a unified zero-shot protocol, EvA achieves the best open-source \emph{Perception} results on MMAU, MMAR, and MMSU, with the largest gains on perception-heavy splits. Human evaluation on open-ended captioning further shows improved fine-grained acoustic coverage and caption quality. These results support the evidence-first hypothesis: stronger audio understanding depends on preserving acoustic evidence before reasoning. Project can be found at https://satsuki2486441738.github.io/EvA/.

2603.26668 2026-05-29 cs.IR cs.AI cs.CL 版本更新

Bridge-RAG: An Abstract Bridge Tree Based Retrieval Augmented Generation Algorithm

Bridge-RAG:一种基于抽象桥树的检索增强生成算法

Zihang Li, Wenjun Liu, Yikun Zong, Jiawen Tao, Siying Dai, Songcheng Ren, Zirui Liu, Yuhang Wang, Yanbing Jiang, Tong Yang

发表机构 * Peking University(北京大学)

AI总结 针对检索增强生成中准确性和效率的挑战,提出Bridge-RAG框架,通过抽象桥树结构实现多级检索,并集成布谷鸟过滤器实现O(1)实体查找,在保持高准确率的同时将检索速度提升至1.9倍。

详情
AI中文摘要

作为增强大型语言模型(LLMs)生成质量的重要范式,检索增强生成(RAG)面临着检索准确性和计算效率两方面的挑战。本文提出了一种名为Bridge-RAG的新型RAG框架。为了克服准确性挑战,我们引入了抽象概念来桥接查询实体和文档块,提供了稳健的语义理解。我们将抽象组织成树结构,并设计了多级检索策略以确保包含足够的上下文信息。虽然这种层次化组织显著提高了答案质量,但遍历树以定位包含查询实体的抽象不可避免地引入了额外的检索开销。为了恢复检索效率,我们进一步在CFT-RAG中集成了布谷鸟过滤器,该过滤器提供O(1)实体查找,并且自然适配了我们框架中实体到抽象的路径。大量实验表明,与结构化RAG基线相比,Bridge-RAG在所有指标上均实现了持续的准确性提升,并且检索速度最高提升了1.9倍。

英文摘要

As an important paradigm for enhancing the generation quality of Large Language Models (LLMs), retrieval-augmented generation (RAG) faces the two challenges regarding retrieval accuracy and computational efficiency. This paper presents a novel RAG framework called Bridge-RAG. To overcome the accuracy challenge, we introduce the concept of abstract to bridge query entities and document chunks, providing robust semantic understanding. We organize the abstracts into a tree structure and design a multi-level retrieval strategy to ensure the inclusion of sufficient contextual information. While this hierarchical organization substantially improves answer quality, traversing the tree to locate the abstracts that contain a query entity inevitably introduces additional retrieval overhead. To restore retrieval efficiency, we further integrate the Cuckoo Filter in CFT-RAG, which provides O(1) entity lookup and naturally fits the entity-to-abstract pathway of our framework. Extensive experiments show that Bridge-RAG achieves consistent accuracy improvements across all metrics and up to $1.9\times$ faster retrieval compared to structured RAG baselines.

2603.23853 2026-05-29 cs.AI cs.MA 版本更新

SCoOP: Semantic Consistent Opinion Pooling for Uncertainty Quantification in Multiple Vision-Language Model Systems

SCoOP: 多视觉-语言模型系统中用于不确定性量化的语义一致意见池化

Chung-En Johnny Yu, Brian Jalaian, Nathaniel D. Bastian

发表机构 * University of West Florida(西佛罗里达大学) United States Military Academy(美国军事学院)

AI总结 提出SCoOP框架,通过不确定性加权的线性意见池化聚合多个视觉-语言模型的输出,实现无训练的不确定性量化,有效检测幻觉并支持高不确定性样本的弃权。

Comments Accepted to ICLR 2026 Workshop on Agentic AI in the Wild: From Hallucinations to Reliable Autonomy

详情
AI中文摘要

结合多个视觉-语言模型(VLM)可以增强多模态推理和鲁棒性,但聚合异构模型的输出会放大不确定性并增加幻觉风险。我们提出SCoOP(语义一致意见池化),一种无需训练的不确定性量化(UQ)框架,通过不确定性加权的线性意见池化用于多VLM系统。核心思想是将每个VLM视为概率“专家”,采样多个输出,映射到统一空间,聚合它们的意见,并产生系统级不确定性分数。与先前为单模型设计的UQ方法不同,SCoOP明确测量跨多个VLM的集体系统级不确定性,从而实现对高不确定性样本的有效幻觉检测和弃权。在ScienceQA上,SCoOP在幻觉检测中实现了0.866的AUROC,优于基线(0.732-0.757)约10-13%。对于弃权,它达到了0.907的AURAC,超过基线(0.818-0.840)7-9%。尽管有这些提升,SCoOP相对于基线仅引入微秒级的聚合开销,与典型的VLM推理时间(秒级)相比微不足道。这些结果表明,SCoOP为不确定性感知聚合提供了一种高效且原则性的机制,推动了多模态AI系统的可靠性。我们的代码公开于https://github.com/chungenyu6/SCoOP。

英文摘要

Combining multiple Vision-Language Models (VLMs) can enhance multimodal reasoning and robustness, but aggregating heterogeneous models' outputs amplifies uncertainty and increases the risk of hallucinations. We propose SCoOP (Semantic-Consistent Opinion Pooling), a training-free uncertainty quantification (UQ) framework for multi-VLM systems through uncertainty-weighted linear opinion pooling. The core idea is to treat each VLM as a probabilistic "expert," sample multiple outputs, map them to a unified space, aggregate their opinions, and produce a system-level uncertainty score. Unlike prior UQ methods designed for single models, SCoOP explicitly measures collective, system-level uncertainty across multiple VLMs, enabling effective hallucination detection and abstention for highly uncertain samples. On ScienceQA, SCoOP achieves an AUROC of 0.866 for hallucination detection, outperforming baselines (0.732-0.757) by approximately 10-13%. For abstention, it attains an AURAC of 0.907, exceeding baselines (0.818-0.840) by 7-9%. Despite these gains, SCoOP introduces only microsecond-level aggregation overhead relative to the baselines, which is trivial compared to typical VLM inference time (on the order of seconds). These results demonstrate that SCoOP provides an efficient and principled mechanism for uncertainty-aware aggregation, advancing the reliability of multimodal AI systems. Our code is publicly available at https://github.com/chungenyu6/SCoOP.

2603.23234 2026-05-29 cs.AI cs.LG 版本更新

MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation

MemCollab:通过对比轨迹蒸馏实现跨模型记忆协作

Yurui Chang, Yiran Wu, Qingyun Wu, Lu Lin

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学) AG2AI

AI总结 针对不同骨干模型代理间共享记忆性能下降的问题,提出MemCollab框架,通过对比同一任务上不同模型生成的推理轨迹来蒸馏共享的抽象推理约束,并引入任务感知检索机制,提升异构代理的准确性和推理效率。

详情
AI中文摘要

LLM代理越来越依赖记忆机制来重用过去问题解决经验中的知识。然而,现有方法通常为单个代理构建记忆,并与同一底层模型重用,将存储的知识紧密耦合到特定模型的推理风格。在异构部署中,代理可能使用不同大小、架构或专业化的骨干模型实例化,这引发了一个关键问题:一个单一的记忆系统能否在不同骨干模型的代理之间共享?我们发现,简单的跨模型记忆传输可能会降低性能,因为存储的记忆常常将任务相关知识纠缠到模型特定的偏见中。为了解决这一挑战,我们提出了MemCollab,一个协作记忆框架,通过对比不同基于模型的代理在同一任务上生成的推理轨迹来构建共享的跨模型记忆。通过这一对比过程,MemCollab蒸馏出捕获共享任务级不变量的抽象推理约束,同时抑制模型特定的伪影。我们进一步引入了一种任务感知检索机制,根据任务类别调节记忆访问,确保在推理时只检索相关的约束。在数学推理和代码生成基准上的实验表明,MemCollab在不同代理(包括不同模型族设置)上一致地提高了准确性和推理效率。这些结果表明,协作构建的跨模型记忆可以作为异构基于LLM的代理的共享推理资源。

英文摘要

LLM agents increasingly rely on memory mechanisms to reuse knowledge from past problem-solving experiences. However, existing methods typically construct memory for a single agent and reuse it with the same underlying model, tightly coupling stored knowledge to model-specific reasoning styles. In heterogeneous deployments, where agents may be instantiated with backbone models of different sizes, architectures, or specializations, this raises a key question: can a single memory system be shared across agents with different backbone models? We find that naive cross-model memory transfer can degrade performance, because stored memories often entangle task-relevant knowledge with model-specific biases. To address this challenge, we propose MemCollab, a collaborative memory framework that builds shared cross-model memory by contrasting reasoning trajectories generated by different model-based agents on the same task. Through this contrastive process, MemCollab distills abstract reasoning constraints that capture shared task-level invariants while suppressing model-specific artifacts. We further introduce a task-aware retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are retrieved at inference time. Experiments on mathematical reasoning and code generation benchmarks show that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including settings with different model families. These results demonstrate that collaboratively constructed cross-model memory can serve as a shared reasoning resource for heterogeneous LLM-based agents.

2603.23085 2026-05-29 cs.AI 版本更新

When Models Learn to Ask Why: Adaptive Causal Reasoning for Trustworthy Medical Vision-Language Models

当模型学会问为什么:面向可信医疗视觉语言模型的自适应因果推理

Jianxin Lin, Chunzheng Zhu, Peter J. Kneuertz, Yunfei Bai, Yuan Xue

发表机构 * The Ohio State University(俄亥俄州立大学) Hunan University(湖南大学) Amazon(亚马逊)

AI总结 提出MedCausalX框架,通过因果推理链、自适应反射架构和轨迹级因果校正,解决医疗VLM中的虚假相关和推理不一致问题,显著提升诊断一致性和减少幻觉。

Comments Accepted by CVPR 2026 Findings

详情
AI中文摘要

视觉语言模型(VLM)通过整合视觉感知与语言推理,实现了可解释的医疗诊断。然而,现有的医疗思维链(CoT)模型缺乏明确表示和执行因果推理的机制,使其易受虚假相关影响,限制了临床可靠性。我们指出了医疗CoT推理中的三个核心挑战:如何自适应触发因果校正、构建高质量因果-虚假对比样本、以及保持推理轨迹的因果一致性。为应对这些挑战,我们提出MedCausalX,一个显式建模医疗VLM中因果推理链的端到端框架。我们首先引入CRMed数据集,提供细粒度解剖标注、结构化因果推理链和反事实变体,引导学习超越表面相关性的因果关系。基于CRMed,MedCausalX采用两阶段自适应反射架构,配备⟨causal⟩和⟨verify⟩标记,使模型能够自主决定何时以及如何进行因果分析和验证。最后,通过错误归因强化学习优化的轨迹级因果校正目标,细化推理链,使模型能够区分真正的因果依赖与捷径关联。在多个基准上的大量实验表明,MedCausalX持续优于最先进方法,诊断一致性提升+5.4分,幻觉减少超过10分,并达到最高的空间定位IoU,从而为因果基础的医疗推理设立了新标准。代码和数据集可在https://github.com/zhcz328/MedCausalX获取。

英文摘要

Vision-Language Models (VLMs) have enabled interpretable medical diagnosis by integrating visual perception with linguistic reasoning. Yet, existing medical chain-of-thought (CoT) models lack explicit mechanisms to represent and enforce causal reasoning, leaving them vulnerable to spurious correlations and limiting their clinical reliability. We pinpoint three core challenges in medical CoT reasoning: how to adaptively trigger causal correction, construct high-quality causal-spurious contrastive samples, and maintain causal consistency across reasoning trajectories. To address these challenges, we propose MedCausalX, an end-to-end framework explicitly models causal reasoning chains in medical VLMs. We first introduce the CRMed dataset providing fine-grained anatomical annotations, structured causal reasoning chains, and counterfactual variants that guide the learning of causal relationships beyond superficial correlations. Building upon CRMed, MedCausalX employs a two-stage adaptive reflection architecture equipped with $\langle$causal$\rangle$ and $\langle$verify$\rangle$ tokens, enabling the model to autonomously determine when and how to perform causal analysis and verification. Finally, a trajectory-level causal correction objective optimized through error-attributed reinforcement learning refines the reasoning chain, allowing the model to distinguish genuine causal dependencies from shortcut associations. Extensive experiments on multiple benchmarks show that MedCausalX consistently outperforms state-of-the-art methods, improving diagnostic consistency by +5.4 points, reducing hallucination by over 10 points, and attaining top spatial grounding IoU, thereby setting a new standard for causally grounded medical reasoning. The code and dataset are available at https://github.com/zhcz328/MedCausalX.

2603.23069 2026-05-29 cs.CL cs.AI 版本更新

AuthorMix: Modular Authorship Style Transfer via Layer-wise Adapter Mixing

AuthorMix: 通过逐层适配器混合实现模块化作者风格迁移

Sarubi Thillainathan, Ji-Ung Lee, Michael Sullivan, Alexander Koller

发表机构 * Saarland University(萨尔兰大学)

AI总结 提出AuthorMix框架,通过训练特定风格的LoRA适配器并利用逐层适配器混合,仅需少量目标风格样本即可实现轻量级、模块化的作者风格迁移,在低资源场景下优于现有方法并显著提升语义保留。

Comments Under review

详情
AI中文摘要

作者风格迁移任务涉及在保留原文含义的同时,将文本重写为目标作者的风格。现有的风格迁移方法在大型语料库上训练单一模型以同时建模所有目标风格:这种高成本方法为目标特定适应提供的灵活性有限,并且常常为了风格迁移而牺牲语义保留。在本文中,我们提出了AuthorMix:一个轻量级、模块化且可解释的风格迁移框架。我们在少量高资源作者上训练个体、风格特定的LoRA适配器,通过学习的逐层适配器混合,仅使用少量目标风格训练示例,即可快速训练每个新目标的专门适应模型。AuthorMix在低资源目标上优于现有的最先进风格迁移基线以及GPT-5.1,在自动和人工评估中均获得最高总分,并显著提高了语义保留。

英文摘要

The task of authorship style transfer involves rewriting text in the style of a target author while preserving the meaning of the original text. Existing style transfer methods train a single model on large corpora to model all target styles at once: this high-cost approach offers limited flexibility for target-specific adaptation, and often sacrifices meaning preservation for style transfer. In this paper, we propose AuthorMix: a lightweight, modular, and interpretable style transfer framework. We train individual, style-specific LoRA adapters on a small set of high-resource authors, allowing the rapid training of specialized adaptation models for each new target via learned, layer-wise adapter mixing, using only a handful of target-style training examples. AuthorMix outperforms existing, SoTA style-transfer baselines-as well as GPT-5.1-for low-resource targets, achieving the highest overall score and substantially improving meaning preservation in both automatic and human evaluations.

2603.19828 2026-05-29 cs.AI 版本更新

FormalEvolve: Neuro-Symbolic Evolutionary Search for Diverse Autoformalization

FormalEvolve: 用于多样化自动形式化的神经符号进化搜索

Haijian Lu, Wei Wang, Jing Liu

发表机构 * School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) Beijing Institute for General Artificial Intelligence(北京通用人工智能研究院)

AI总结 提出FormalEvolve,一种结合LLM变异、交叉、补丁修复和符号AST重写的神经符号进化搜索方法,将自动形式化重构为预算测试时搜索,通过维护可编译存档并报告去重语义接受库,在CombiBench和ProofNet上分别达到58.0%和84.9%的SH@100,并提升下游定理证明性能。

Comments 27 pages, 12 figures

详情
AI中文摘要

自动形式化旨在生成编译通过并忠实保留非正式数学预期含义的形式化陈述。然而,标准单输出评估协议将一个多对多问题简化为单输出预测任务。对于下游证明,这种粒度过于粗糙:形式化陈述不仅是忠实的翻译终点,还是一个面向证明者的接口,其结构可以在固定预算下改变证明搜索。因此,我们将自动形式化重新定义为预算测试时搜索:FormalEvolve维护一个可编译的存档以供重用,同时报告去重后的语义接受库用于评估和下游证明。它通过LLM驱动的变异、交叉、有界补丁修复和符号抽象语法树(AST)重写来扩展存档,以实现结构多样性。在生成器调用预算T=100且使用固定LLM语义判断器的情况下,FormalEvolve在CombiBench上达到58.0%的SH@100,在ProofNet上达到84.9%的SH@100,优于所有无存档对照,同时减少了跨问题的语义成功集中度。为了评估下游价值,我们在固定B=64证明器预算下评估生成的库,它们比匹配的无存档对照提高了定理完全证明;额外的更强基础语句生成实验表明,存档搜索收益在使用更强的种子和修复模型时仍然保持。手动忠实性审计对这些判断器正输出进行了校准。

英文摘要

Autoformalization aims to produce formal statements that compile and faithfully preserve the intended meaning of informal mathematics. Yet standard single-output evaluation protocols collapse a many-to-many problem into a single-output prediction task. For downstream proving, this granularity is too coarse: a formal statement is not merely a faithful translation endpoint, but also a prover-facing interface whose structure can alter proof search under a fixed budget. We therefore recast autoformalization as budgeted test-time search: FormalEvolve maintains a compilation-feasible archive for reuse, while reporting the deduplicated semantically accepted repertoire for evaluation and downstream proving. It expands the archive with LLM-driven mutation, crossover, bounded patch repair, and symbolic Abstract Syntax Tree (AST) rewrites for structural diversity. Under a generator-call budget of T=100 with a fixed LLM semantic judge, FormalEvolve reaches SH@100 of 58.0% on CombiBench and 84.9% on ProofNet, improving over all no-archive controls while reducing the cross-problem concentration of semantic successes. To assess downstream value, we evaluate the resulting repertoires under a fixed B=64 prover budget, where they improve theorem-complete proving over the matched no-archive control; additional stronger-base statement-generation experiments show that archive-search gains hold with stronger seed and repair models. Manual faithfulness audits calibrate these judge-positive outputs.

2603.13249 2026-05-29 cs.CL cs.AI cs.CY 版本更新

Steering at the Source: Style Modulation Heads for Robust Persona Control

源头操控:用于稳健角色控制的风格调制头

Yoshihiro Izawa, Gouki Minegishi, Koshi Eguchi, Sosuke Hosokawa, Kenjiro Taura

发表机构 * The University of Tokyo, Tokyo, Japan(东京大学) National Institute of Informatics Research(信息处理研究所) Development Center for Large Language Models, Japan(大型语言模型发展中心)

AI总结 本文通过识别并仅干预少量注意力头(风格调制头),在无需微调的情况下实现对大型语言模型角色和风格的稳健控制,同时显著缓解了残差流干预导致的连贯性下降问题。

Comments 8 main pages with appendix

详情
AI中文摘要

激活操控提供了一种计算高效的机制,无需微调即可控制大型语言模型(LLM)。虽然能有效控制目标特征(如角色),但连贯性下降仍然是安全和实际部署的主要障碍。我们假设这种下降源于对残差流的干预,该干预无差别地影响聚合特征,并无意中放大了非目标噪声。在这项工作中,我们识别出一组稀疏的注意力头(仅三个头),它们独立控制角色和风格形成,我们将其称为风格调制头。具体来说,这些头可以通过内部表示的几何分析进行定位,结合层间余弦相似度和头部贡献分数。我们证明,仅针对这些特定头的干预能够实现稳健的行为控制,同时显著减轻残差流操控中观察到的连贯性下降。更广泛地说,我们的发现表明,精确的组件级定位能够实现更安全、更精确的模型控制。

英文摘要

Activation steering offers a computationally efficient mechanism for controlling Large Language Models (LLMs) without fine-tuning. While effectively controlling target traits (e.g., persona), coherency degradation remains a major obstacle to safety and practical deployment. We hypothesize that this degradation stems from intervening on the residual stream, which indiscriminately affects aggregated features and inadvertently amplifies off-target noise. In this work, we identify a sparse subset of attention heads (only three heads) that independently govern persona and style formation, which we term Style Modulation Heads. Specifically, these heads can be localized via geometric analysis of internal representations, combining layer-wise cosine similarity and head-wise contribution scores. We demonstrate that intervention targeting only these specific heads achieves robust behavioral control while significantly mitigating the coherency degradation observed in residual stream steering. More broadly, our findings show that precise, component-level localization enables safer and more precise model control.

2603.11331 2026-05-29 cs.LG cs.AI 版本更新

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

大型语言模型的越狱缩放定律:多项式-指数交叉

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

发表机构 * John A. Paulson School of Engineering And Applied Sciences, Harvard University(哈佛大学约翰·A·保罗森工程与应用科学学院) Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology(麻省理工学院脑科学与认知科学系) Speech and Hearing Bioscience and Technology, Harvard Medical School(哈佛医学院语音与听力生物科学与技术系) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学自然与人工智能研究学院) Center for Brain Science, Harvard University(哈佛大学脑科学中心)

AI总结 研究发现对抗性提示注入攻击可使攻击成功率从无注入时的缓慢多项式增长变为随推理样本数指数增长,并通过自旋玻璃模型从理论上解释了这一现象。

详情
AI中文摘要

对抗性攻击可以可靠地将安全对齐的大型语言模型引导至不安全行为。经验上,我们发现对抗性提示注入攻击可以将攻击成功率从无注入时观察到的缓慢多项式增长放大为随推理样本数指数增长。我们首先通过一组关于上下文安全生成分布的最小假设,确定了这两种机制的统计基础,并推导出两种缩放定律。为了进一步解释这一现象,我们提出了一个基于自旋玻璃系统的代理语言理论生成模型,该系统处于复制对称破缺状态,生成样本来自相关的吉布斯测度,并将低能、有偏大小的子集标记为不安全。我们分析展示了该模型如何自然实现最小假设。短注入提示对应于指向不安全簇中心的弱磁场,导致攻击成功率随推理样本数呈幂律缩放;而长注入提示(即强磁场)则导致指数缩放。我们在参数规模从3B到70B的广泛大型语言模型中观察到了定性一致的行为。特别是,主要趋势在多种攻击方法(如GCG和AutoDAN)以及基准数据集(如AdvBench和HarmBench)中保持稳定。

英文摘要

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We analytically show how this model naturally realizes the minimal assumptions. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We observe qualitatively consistent behavior across a broad range of large language models, spanning parameter scales from 3B to 70B. In particular, the main trends remain stable across multiple attack methods, such as GCG and AutoDAN, as well as across benchmark datasets such as AdvBench and HarmBench.

2603.07916 2026-05-29 cs.AI cs.DB cs.LG 版本更新

Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

Rel-MOSS:面向关系数据库中不平衡关系深度学习的解决方案

Jun Yin, Peng Huo, Bangguo Zhu, Hao Yan, Senzhang Wang, Shirui Pan, Chengqi Zhang

发表机构 * Department of Data Science and Artificial Intelligence(数据科学与人工智能系) Hong Kong Polytechnic University(香港理工大学) School of Computer Science and Engineering(计算机科学与工程学院) Central South University(中南大学) School of Information and Communication Technology(信息与通信技术学院) Griffith University(格里菲斯大学) National Super Computing Center(国家超级计算中心)

AI总结 针对关系数据库中实体分类的类别不平衡问题,提出关系中心少数类合成过采样GNN(Rel-MOSS),通过关系门控控制器和关系引导的少数类合成器提升少数类表示,在12个数据集上平均平衡准确率提升2.46%,G-Mean提升4.00%。

详情
AI中文摘要

在最近的进展中,为了实现关系数据库(RDB)上完全数据驱动的学习范式,提出了关系深度学习(RDL),将RDB结构化为异构实体图,并采用图神经网络(GNN)作为预测模型。然而,现有的RDL方法忽略了RDB中关系数据的不平衡问题,可能导致少数实体表示不足,从而在实践中产生不可用的模型。在这项工作中,我们首次研究了RDB实体分类中的类别不平衡问题,并设计了以关系为中心的少数类合成过采样GNN(Rel-MOSS),以填补当前文献中的关键空白。具体来说,为了缓解少数类相关信息被多数类信息淹没的问题,我们设计了关系门控控制器来调节来自每个单独关系类型的邻域消息。基于关系门控表示,我们进一步提出了用于过采样的关系引导的少数类合成器,该合成器整合了实体关系签名以保持关系一致性。在12个实体分类数据集上的大量实验为Rel-MOSS的优越性提供了令人信服的证据,与最先进的RDL方法和处理类别不平衡的经典方法相比,在平衡准确率和G-Mean上分别平均提高了2.46%和4.00%。

英文摘要

In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.

2603.05488 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

推理剧场:从思维链中分离模型信念

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

发表机构 * Harvard University, Cambridge, MA(哈佛大学,马萨诸塞州剑桥)

AI总结 通过激活探针、早期强制回答和思维链监控器分析,发现推理模型存在表演性思维链现象,并利用探针引导的早期退出实现高效计算。

详情
AI中文摘要

我们提供了推理模型中表演性思维链(CoT)的证据,即模型对其最终答案变得非常自信,但继续生成令牌而不揭示其内部信念。我们的分析比较了两个大型模型(DeepSeek-R1 671B 和 GPT-OSS 120B)中的激活探针、早期强制回答和思维链监控器,并发现了任务难度特定的差异:模型的最终答案可以从思维链中远早于监控器能够判断的激活中解码,特别是对于基于回忆的简单MMLU问题。我们将此与困难的多跳GPQA-Diamond问题中的真正推理进行对比。尽管如此,转折点(例如回溯、“啊哈”时刻)几乎只出现在探针显示大信念转变的响应中,表明这些行为追踪的是真正的不确定性,而不是学到的“推理剧场”。最后,探针引导的早期退出在MMLU上减少了高达80%的令牌,在GPQA-Diamond上减少了30%,且准确率相似,将注意力探针定位为检测表演性推理和实现自适应计算的高效工具。

英文摘要

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

2603.04678 2026-05-29 cs.CL cs.AI 版本更新

Post-Training Language Models for Crosslingual Consistency

后训练语言模型以实现跨语言一致性

Tianyu Liu, Jirui Qi, Mrinmaya Sachan, Ryan Cotterell, Raquel Fernández, Arianna Bisazza

发表机构 * ETH Zürich(苏黎世联邦理工学院) CLCG, University of Groningen(格罗宁根大学CLCG中心) University of Amsterdam(阿姆斯特丹大学)

AI总结 针对多语言模型对翻译等价提示响应不一致的问题,提出基于信息论的跨语言一致性定义,并开发后训练方法直接一致性优化(DCO)以提升一致性。

Comments ICML 2026. The first two authors contributed equally. Codes available at: https://github.com/Betswish/ConsistencyRL

详情
AI中文摘要

语言模型通常对跨语言的翻译等价提示响应不一致,这损害了多语言系统的可靠性。为了量化这一点,我们从信息论角度将跨语言一致性定义为模型响应分布与其跨语言往返推前分布之间的散度界。然后,我们引入惩罚一致性优化(PCO),这是一种后训练程序,将该散度与固定参考语言模型的Kullback-Leibler惩罚相结合。由于直接优化PCO需要昂贵的策略内展开,我们提出了一个易于处理的替代方案——直接一致性优化(DCO),它可以在策略外进行优化。在多种语言模型和26种语言中,DCO显著提高了跨语言一致性,优于现有方法,并实现了对低资源语言的有针对性的对齐。

英文摘要

Language models often respond inconsistently to translation-equivalent prompts across languages, undermining the reliability of multilingual systems. To quantify this, we give an information-theoretic definition of crosslingual consistency as a divergence bound between a model's response distribution and its round-trip pushforward across languages. We then introduce penalized consistency optimization (PCO), a post-training procedure that couples this divergence with a Kullback-Leibler penalty to a fixed reference language model. Because direct optimization of PCO requires expensive on-policy roll-outs, we propose a tractable surrogate, direct consistency optimization (DCO), which can be optimized off-policy. Across diverse language models and 26 languages, DCO significantly improves crosslingual consistency, outperforms existing methods, and enables targeted alignment of low-resource languages.

2603.04314 2026-05-29 cs.CV cs.AI 版本更新

MOO: A Multi-view Oriented Observations Dataset for Viewpoint Analysis in Cattle Re-Identification

MOO:用于牛个体重识别视角分析的多视角观测数据集

William Grolleau, Achraf Chaouch, Astrid Sabourin, Guillaume Lapouge, Catherine Achard

发表机构 * Universite Paris-Saclay, CEA, List(巴黎-萨克雷大学,CEA,List) Sorbonne University, CNRS, ISIR(索邦大学,CNRS,ISIR)

AI总结 提出大规模合成多视角观测数据集MOO,通过128个均匀采样视角的1000头牛图像,量化视角变化对重识别的影响,并验证合成几何先验在真实场景中的迁移性。

Comments 6 pages, 3 figures, accepted to the CVPR 2026 Workshop on Computer Vision for Animal Behavior Tracking and Modeling (CV4Animals)

详情
AI中文摘要

动物重识别(ReID)由于视角变化面临严峻挑战,特别是在航空-地面(AG-ReID)场景中,模型需要跨越剧烈的高度变化匹配个体。然而,现有数据集缺乏精确的角度标注来系统分析这些几何变化。为此,我们引入了多视角观测(MOO)数据集,这是一个大规模合成AG-ReID数据集,包含从128个均匀采样视角捕获的1000头牛个体(128,000张标注图像)。利用这个受控数据集,我们量化了高度的影响,并识别出一个关键高度阈值,超过该阈值模型对未见视角的泛化能力显著提升。最后,我们在零样本和监督设置下验证了向真实世界应用的迁移性,展示了在四个真实牛数据集上的性能提升,并确认合成几何先验有效弥合了领域差距。总之,该数据集和分析为跨视角动物ReID的未来模型开发奠定了基础。MOO公开于https://github.com/TurtleSmoke/MOO。

英文摘要

Animal re-identification (ReID) faces critical challenges due to viewpoint variations, particularly in Aerial-Ground (AG-ReID) settings where models must match individuals across drastic elevation changes. However, existing datasets lack the precise angular annotations required to systematically analyze these geometric variations. To address this, we introduce the Multi-view Oriented Observation (MOO) dataset, a large-scale synthetic AG-ReID dataset of $1,000$ cattle individuals captured from $128$ uniformly sampled viewpoints ($128,000$ annotated images). Using this controlled dataset, we quantify the influence of elevation and identify a critical elevation threshold, above which models generalize significantly better to unseen views. Finally, we validate the transferability to real-world applications in both zero-shot and supervised settings, demonstrating performance gains across four real-world cattle datasets and confirming that synthetic geometric priors effectively bridge the domain gap. Collectively, this dataset and analysis lay the foundation for future model development in cross-view animal ReID. MOO is publicly available at https://github.com/TurtleSmoke/MOO.

2603.03805 2026-05-29 cs.LG cs.AI cs.DB 版本更新

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

通过结构先验的合成预训练实现关系上下文学习

Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) University of Illinois at Urbana-Champaign(伊利诺伊大学香槟分校) Institute of Computing Technology, Beijing University of Post(北京邮电大学计算机学院) State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室)

AI总结 提出RDB-PFN,首个仅通过合成数据训练的关系基础模型,利用结构因果模型生成多样关系数据库,实现对新数据库的即时上下文学习,在19个真实关系预测任务上优于现有表格基础模型。

详情
AI中文摘要

关系数据库是现代业务的支柱,但它们缺乏与文本或视觉领域相当的基础模型。一个关键障碍是高质量的关系数据库是私有的、稀缺的且结构异构,使得互联网规模的预训练不可行。为了克服这种数据稀缺性,我们引入了RDB-PFN,这是第一个完全通过合成数据训练的关系基础模型。受先验数据拟合网络的启发,其中从结构因果模型生成的合成数据能够实现单表推理,我们设计了一个关系先验生成器,从零开始创建无限多样的关系数据库流。在超过200万个合成单表和关系任务上进行预训练后,RDB-PFN通过真正的上下文学习学会即时适应任何新数据库。实验表明,RDB-PFN在19个真实世界的关系预测任务上实现了强大的少样本性能,优于在相同DFS线性化输入上评估的最先进的表格基础模型,同时使用轻量级架构和快速推理。代码可在https://github.com/MuLabPKU/RDBPFN获取。

英文摘要

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce, and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, we introduce RDB-PFN, the first relational foundation model trained purely via synthetic data. Inspired by Prior-Data Fitted Networks (PFNs), where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a Relational Prior Generator to create an infinite stream of diverse RDBs from scratch. Pre-training on over 2 million synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine in-context learning. Experiments show that RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming state-of-the-art tabular foundation models evaluated on the same DFS-linearized inputs, while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN.

2602.23258 2026-05-29 cs.AI cs.CL 版本更新

AgentDropoutV2: Optimizing Information Flow in Multi-Agent Systems via Test-Time Rectify-or-Reject Pruning

AgentDropoutV2: 通过测试时修正或拒绝剪枝优化多智能体系统中的信息流

Yutong Wang, Siyuan Xiong, Xuebo Liu, Wenkang Zhou, Liang Ding, Miao Zhang, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Alibaba Group(阿里巴巴集团)

AI总结 提出AgentDropoutV2框架,在测试时通过检索增强修正器纠正错误并剪枝不可修复输出,动态优化多智能体系统信息流,显著提升数学和代码基准性能。

详情
AI中文摘要

虽然多智能体系统(MAS)在复杂推理中表现出色,但它们受到来自单个智能体的错误信息的级联影响。当前的解决方案通常依赖于刚性的结构工程或昂贵的微调,限制了它们的适应性。我们提出了AgentDropoutV2(ADv2),一种测试时修正或拒绝剪枝框架,动态优化MAS信息流。作为主动防火墙,ADv2拦截智能体输出,并采用检索增强修正器迭代纠正错误。这种修正由一个指示池引导,该池通过从历史MAS失败轨迹中提炼错误模式离线构建。随后,不可修复的输出被剪枝以防止错误传播。实验结果表明,ADv2在固定和动态MAS框架上均显著提升了性能,在广泛的数学和代码基准测试中分别实现了平均6.39和2.28个百分点的准确率提升。此外,ADv2表现出卓越的适应性,根据任务难度动态调整修正力度,以解决广泛的错误模式。我们的代码已发布在https://github.com/TonySY2/AgentDropoutV2。

英文摘要

While Multi-Agent Systems (MAS) excel in complex reasoning, they suffer from the cascading impact of erroneous information from individual agents. Current solutions often resort to rigid structural engineering or expensive fine-tuning, limiting their adaptability. We propose AgentDropoutV2 (ADv2), a test-time rectify-or-reject pruning framework that dynamically optimizes MAS information flow. Acting as an active firewall, ADv2 intercepts agent outputs and employs a retrieval-augmented rectifier to iteratively correct errors. This rectification is guided by an indicator pool, which is constructed offline by distilling error patterns from historical MAS failure trajectories. Irreparable outputs are subsequently pruned to prevent error propagation. Empirical results demonstrate that ADv2 significantly boosts performance on both fixed and dynamic MAS frameworks, achieving average accuracy gains of 6.39 and 2.28 percentage points on extensive math and code benchmarks, respectively. Furthermore, ADv2 exhibits remarkable adaptivity, dynamically modulating rectification efforts based on task difficulty to resolve a wide spectrum of error patterns. Our code is released at https://github.com/TonySY2/AgentDropoutV2.

2602.20141 2026-05-29 cs.AI 版本更新

Recurrent Structural Policy Gradient for Partially Observable Mean Field Games

循环结构策略梯度用于部分可观测平均场博弈

Clarisse Wibault, Johannes Forkel, Sebastian Towers, Tiphaine Wibault, Juan Duque, George Whittle, Andreas Schaab, Yucheng Yang, Chiyuan Wang, Maike Osborne, Benjamin Moll, Jakob Foerster

发表机构 * FLAIR, University of Oxford(FLAIR,牛津大学) MLRG, University of Oxford(MLRG,牛津大学) ifo Institute, LMU Munich(IFO研究所,慕尼黑大学) Mila, Québec AI Institute(Mila,魁北克AI研究所) University of Zurich(苏黎世大学) Peking University(北京大学) London School of Economics(伦敦经济学院)

AI总结 针对部分可观测平均场博弈,提出首个历史感知的混合结构方法RSPG,通过利用低维状态动作空间和已知转移动力学计算期望回报,实现比无模型RL方法快一个数量级的收敛速度。

详情
AI中文摘要

平均场博弈(MFGs)为大规模群体系统中的交互建模提供了原则性框架。然而,由于无模型方法方差高而精确方法扩展性差,算法进展有限。最近的混合结构方法(HSMs)通过利用低维个体状态和动作空间以及已知的转移动力学,计算以公共噪声的蒙特卡洛轨迹为条件的精确期望回报,从而在保持可处理性的同时降低方差。然而,HSMs尚未扩展到部分可观测设置。我们提出循环结构策略梯度(RSPG),这是首个用于具有公共部分信息的MFGs的历史感知HSM。RSPG实现了比无模型RL方法快一个数量级的收敛速度,同时学习历史感知行为,这与当前的HSMs不同。为了促进对MFGs的研究,我们还引入了MFAX,这是我们基于JAX的MFG框架,支持解析和基于样本的平均场更新。MFAX和使用示例可在https://clarisse-wibault.github.io/rspg/找到。

英文摘要

Mean Field Games (MFGs) provide a principled framework for modelling interactions in large population systems. However, algorithmic progress has been limited since model-free methods are high variance and exact methods scale poorly. Recent Hybrid Structural Methods (HSMs) reduce variance while maintaining tractability by leveraging low-dimensional individual state and action spaces and known transition dynamics to compute the exact expected return conditioned on Monte Carlo rollouts of common noise. However, HSMs have not been extended to partially observable settings. We propose Recurrent Structural Policy Gradient (RSPG), the first history-aware HSM for MFGs with public partial information. RSPG achieves an order-of-magnitude faster convergence than model-free RL methods while learning history-aware behaviour, unlike current HSMs. To facilitate research into MFGs, we also introduce MFAX, our JAX-based framework for MFGs that supports both analytic and sample-based mean-field updates. MFAX and usage examples can be found at https://clarisse-wibault.github.io/rspg/.

2602.18527 2026-05-29 cs.CV cs.AI cs.SD 版本更新

JAEGER: Joint 3D Audio-Visual Grounding and Reasoning in Simulated Physical Environments

JAEGER:模拟物理环境中的联合3D音频-视觉定位与推理

Zhan Liu, Changli Tang, Yuxin Wang, Zhiyuan Zhu, Youjun Chen, Yiwen Shao, Tianzi Wang, Lei Ke, Zengrui Jin, Chao Zhang

发表机构 * Tsinghua University(清华大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Tencent AI Lab(腾讯AI实验室)

AI总结 提出JAEGER框架,通过集成RGB-D观测和多通道一阶环境声学,将音频-视觉大语言模型扩展到3D空间,实现联合空间定位与推理,并引入神经强度向量(Neural IV)提升声源方向估计的鲁棒性。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前的音频-视觉大语言模型(AV-LLMs)主要局限于2D感知,依赖于RGB视频和单声道音频。这种设计选择引入了基本的维度不匹配,阻碍了在复杂3D环境中可靠的声源定位和空间推理。我们通过提出JAEGER框架来解决这一限制,该框架将AV-LLMs扩展到3D空间,通过集成RGB-D观测和多通道一阶环境声学实现联合空间定位与推理。我们工作的核心贡献是神经强度向量(Neural IV),一种学习的空间音频表示,它编码了鲁棒的方向线索,以增强到达方向估计,即使在具有重叠声源的不利声学场景中也是如此。为了促进大规模训练和系统评估,我们提出了SpatialSceneQA,一个包含从模拟物理环境中整理的6.1万个指令调优样本的基准。大量实验表明,我们的方法在各种空间感知和推理任务中始终优于以2D为中心的基线,强调了显式3D建模对于推进物理环境中AI的必要性。我们的源代码、预训练模型检查点和数据集可在https://github.com/liuzhan22/JAEGER获取。

英文摘要

Current audio-visual large language models (AV-LLMs) are predominantly restricted to 2D perception, relying on RGB video and monaural audio. This design choice introduces a fundamental dimensionality mismatch that precludes reliable source localization and spatial reasoning in complex 3D environments. We address this limitation by presenting JAEGER, a framework that extends AV-LLMs to 3D space, to enable joint spatial grounding and reasoning through the integration of RGB-D observations and multi-channel first-order ambisonics. A core contribution of our work is the neural intensity vector (Neural IV), a learned spatial audio representation that encodes robust directional cues to enhance direction-of-arrival estimation, even in adverse acoustic scenarios with overlapping sources. To facilitate large-scale training and systematic evaluation, we propose SpatialSceneQA, a benchmark of 61k instruction-tuning samples curated from simulated physical environments. Extensive experiments demonstrate that our approach consistently surpasses 2D-centric baselines across diverse spatial perception and reasoning tasks, underscoring the necessity of explicit 3D modelling for advancing AI in physical environments. Our source code, pre-trained model checkpoints, and datasets are available at https://github.com/liuzhan22/JAEGER.

2602.16610 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Who can we trust? LLM-as-a-jury for Comparative Assessment

我们该信任谁?LLM作为陪审团进行比较评估

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill

发表机构 * Department of Engineering, University of Cambridge, UK(剑桥大学工程系)

AI总结 针对LLM作为评估者时判断不一致和可靠性差异的问题,提出BT-sigma模型,通过引入判别参数联合推断项目排名和法官可靠性,优于平均聚合方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作自动评估器,用于自然语言生成评估,通常采用成对比较判断。现有方法通常依赖单一法官或聚合多个法官并假设其可靠性相同。在实践中,LLM法官在不同任务和评估方面的表现差异很大,其判断概率可能存在偏差和不一致。此外,用于法官校准的人工标注监督可能不可用。我们首先通过实验证明LLM比较概率的不一致性存在,并表明这限制了直接基于概率排名的有效性。为解决此问题,我们研究了LLM作为陪审团的设置,并提出了BT-sigma,这是Bradley-Terry模型的一种法官感知扩展,为每个法官引入一个判别参数,仅从成对比较中联合推断项目排名和法官可靠性。在基准NLG评估数据集上的实验表明,BT-sigma始终优于基于平均的聚合方法,并且学习到的判别参数与LLM判断的循环一致性的独立度量高度相关。进一步分析揭示,BT-sigma可以解释为一种无监督校准机制,通过建模法官可靠性来改进聚合。

英文摘要

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and evaluation aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-asa-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminators strongly correlate with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

2602.16449 2026-05-29 cs.LG cs.AI stat.ML 版本更新

GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

GICDM: 缓解枢纽性以实现可靠的基于距离的生成模型评估

Nicolas Salvy, Hugues Talbot, Bertrand Thirion

发表机构 * Inria, Palaiseau, France(法国帕莱索研究所)

AI总结 针对生成模型评估中高维嵌入空间的枢纽性现象,提出GICDM方法(基于迭代上下文不相似度度量),通过多尺度扩展校正邻域估计,恢复可靠度量并与人类评估对齐。

Comments Forty-third International Conference on Machine Learning, 2026

详情
AI中文摘要

生成模型评估通常依赖于高维嵌入空间来计算样本之间的距离。我们表明,这些空间中的数据集表示受到枢纽性现象的影响,这会扭曲最近邻关系并使基于距离的度量产生偏差。基于经典的迭代上下文不相似度度量(ICDM),我们引入了生成式ICDM(GICDM),一种校正真实数据和生成数据邻域估计的方法。我们引入了多尺度扩展以改善经验行为。在合成和真实基准上的大量实验表明,GICDM解决了枢纽性引起的失败,恢复了可靠的度量行为,并改善了与人类评估的一致性。

英文摘要

Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset representations in these spaces are affected by the hubness phenomenon, which distorts nearest-neighbor relationships and biases distance-based metrics. Building on the classical Iterative Contextual Dissimilarity Measure (ICDM), we introduce Generative ICDM (GICDM), a method to correct neighborhood estimation for both real and generated data. We introduce a multi-scale extension to improve empirical behavior. Extensive experiments on synthetic and real benchmarks demonstrate that GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human assessment.

2602.12304 2026-05-29 cs.SD cs.AI cs.MM eess.AS 版本更新

OmniCustom: Sync Audio-Video Customization Via Joint Audio-Video Generation Model

OmniCustom: 通过联合音视频生成模型实现同步音视频定制

Maomao Li, Zhen Li, Kaipeng Zhang, Guosheng Yin, Zhifeng Li, Dong Xu

发表机构 * The University of Hong Kong(香港大学) Shanda AI Research Tokyo(Shanda AI东京研究所) XIntelligence Technology Co., Limited(XIntelligence技术有限公司)

AI总结 提出一种基于DiT的零样本音视频定制框架OmniCustom,通过参考图像和音频同步生成保持身份和音色一致性的视频,支持文本指定语音内容。

Comments code: https://github.com/OmniCustom-project/OmniCustom

详情
AI中文摘要

现有的主流视频定制方法侧重于基于给定参考图像和文本提示生成身份一致的视频。受益于联合音视频生成的快速发展,本文提出一个更具吸引力的新任务:同步音视频定制,旨在同步定制视频身份和音频音色。具体来说,给定参考图像$I^{r}$和参考音频$A^{r}$,该新任务要求生成保持参考图像身份并模仿参考音频音色的视频,语音内容可由用户提供的文本提示自由指定。为此,我们提出OmniCustom,一个基于DiT的强大音视频定制框架,能够以零样本方式一次性根据参考图像身份、音频音色和文本提示合成视频。我们的框架基于三个关键贡献。首先,身份和音频音色控制通过独立的参考身份和音频LoRA模块实现,这些模块通过基础音视频生成模型中的自注意力层操作。其次,我们引入了对比学习目标与标准流匹配目标一起使用。它将以参考输入为条件的预测流作为正例,以无参考条件的预测流作为负例,从而增强模型保持身份和音色的能力。第三,我们在构建的大规模高质量音视频人类数据集上训练OmniCustom。大量实验表明,OmniCustom在生成具有一致身份和音色保真度的音视频内容方面优于现有方法。项目页面:https://omnicustom-project.github.io/page/。

英文摘要

Existing mainstream video customization methods focus on generating identity-consistent videos based on given reference images and textual prompts. Benefiting from the rapid advancement of joint audio-video generation, this paper proposes a more compelling new task: sync audio-video customization, which aims to synchronously customize both video identity and audio timbre. Specifically, given a reference image $I^{r}$ and a reference audio $A^{r}$, this novel task requires generating videos that maintain the identity of the reference image while imitating the timbre of the reference audio, with spoken content freely specifiable through user-provided textual prompts. To this end, we propose OmniCustom, a powerful DiT-based audio-video customization framework that can synthesize a video following reference image identity, audio timbre, and text prompts all at once in a zero-shot manner. Our framework is built on three key contributions. First, identity and audio timbre control are achieved through separate reference identity and audio LoRA modules that operate through self-attention layers within the base audio-video generation model. Second, we introduce a contrastive learning objective alongside the standard flow matching objective. It uses predicted flows conditioned on reference inputs as positive examples and those without reference conditions as negative examples, thereby enhancing the model ability to preserve identity and timbre. Third, we train OmniCustom on our constructed large-scale, high-quality audio-visual human dataset. Extensive experiments demonstrate that OmniCustom outperforms existing methods in generating audio-video content with consistent identity and timbre fidelity. Project page: https://omnicustom-project.github.io/page/.

2602.11171 2026-05-29 cs.CL cs.AI 版本更新

A Language-Guided Bayesian Optimization for Efficient LoRA Hyperparameter Search

语言引导的贝叶斯优化用于高效LoRA超参数搜索

Baek Seong-Eun, Lee Jung-Mok, Kim Sung-Bin, Tae-Hyun Oh

发表机构 * Grad. School of AI, POSTECH, Pohang, Korea(POSTECH人工智能研究生院,韩国坡安) School of EE, KAIST, Daejeon, Korea(韩国科学技术院电子工程学院,韩国大田) School of Computing, KAIST, Daejeon, Korea(韩国科学技术院计算学院,韩国大田)

AI总结 提出一种利用预训练LLM领域知识的贝叶斯优化框架,通过语言提示将超参数映射到连续空间,结合子集训练代理评估,仅需约30次迭代即可发现比标准超参数提升20%以上性能的LoRA超参数。

Comments Accepted at ICML 2026

详情
AI中文摘要

使用低秩适配(LoRA)微调大型语言模型(LLM)提供了一种资源高效的方式来实现个性化或专业化。然而,LoRA对超参数选择高度敏感,且穷举超参数搜索计算成本高昂。为此,我们提出一个贝叶斯优化(BO)框架,利用预训练LLM的领域知识来高效搜索LoRA超参数。我们的方法将预训练LLM重新用作离散到连续映射模块,将超参数及其领域知识链接到连续向量空间,在其中进行BO。我们通过语言提示设计和控制映射,提供描述超参数间关系及其各自角色的领域感知文本提示。这使我们能够以自然语言将关于LoRA的领域知识显式注入LLM。我们还引入一个额外的可学习标记,以捕获提示中难以用语言描述的残差信息。这有助于BO采样更多高性能超参数。此外,通过利用LoRA训练机制中从完整数据集和子集训练数据集获得的性能之间观察到的强相关性,我们引入使用数据子集的代理训练和评估。这显著提高了我们方法的效率。我们证明,仅需约30次迭代发现的超参数,相比从约45,000种组合中找到的标准超参数,实现了超过20%的性能提升。项目页面:https://baekseongeun.github.io/lora-bo/

英文摘要

Fine-tuning Large Language Models (LLMs) with Low-Rank Adaptation (LoRA) offers a resource-efficient way to personalize or specialize. However, LoRA is highly sensitive to hyperparameter choices, and exhaustive hyperparameter search is computationally expensive. To address this, we propose a Bayesian Optimization (BO) framework that leverages the domain knowledge of pre-trained LLMs to efficiently search for LoRA hyperparameters. Our approach repurposes a pre-trained LLM as a discrete-to-continuous mapping module to link hyperparameters and their domain knowledge to a continuous vector space, where BO is conducted. We design and control the mapping via language prompting, providing a domain-aware textual prompt that describes the relationships among hyperparameters and their respective roles. This allows us to explicitly inject domain knowledge about LoRA into the LLM in natural language. We also introduce an additional learnable token to capture residual information that is difficult to describe linguistically in the prompt. This aids BO to sample more high-performing hyperparameters. In addition, by leveraging the strong correlation observed between the performance obtained from full and subset training datasets in LoRA training regimes, we introduce proxy training and evaluation using a data subset. This significantly improves the efficiency of our method. We demonstrate that our hyperparameter, discovered with only about 30 iterations, achieves more than 20% performance improvement over standard hyperparameters found from about 45,000 combinations. Project page: https://baekseongeun.github.io/lora-bo/

2602.11065 2026-05-29 cs.CL cs.AI 版本更新

S-MARC: Causal Streaming Reasoning for Full-Duplex Conversational Behavior Modeling

S-MARC:全双工对话行为建模的因果流式推理

Dingkun Zhou, Shuchang Pan, Jiachen Lian, Siddharth Banerjee, Sarika Pasumarthy, Dhruv Hebbar, Siddhant Patel, Zeyi Austin Li, Kan Jen Cheng, Sanay Bordia, Krish Patel, Akshaj Gupta, Tingle Li, Gopala Anumanchipalli

发表机构 * University of California, Berkeley(加州大学伯克利分校) Zhejiang University(浙江大学) South China University of Technology(华南理工大学)

AI总结 提出S-MARC框架,通过流式因果层次建模意图到动作路径,预测高层交际功能和低层交互行为,并构建高质量语料库,实现全双工对话中的鲁棒行为检测与可解释推理。

详情
AI中文摘要

人类对话由隐式的思维链组织,并表现为时间结构化的对话行为。捕捉这一感知路径对于构建自然的全双工交互系统至关重要。我们提出了S-MARC(对话的流式因果建模与推理),一个用于对话行为建模与推理的流式、因果、层次化框架。通过形式化意图到动作的路径,S-MARC预测高层交际功能和低层交互行为,同时建模它们的因果和时间依赖关系。为支持这一设置,我们构建了一个高质量语料库,将可控、事件丰富的双工对话数据与行为标签配对。S-MARC将流式预测组织成持续演化的图结构,为其决策生成简洁的推理依据,并动态优化其推理过程。在合成和真实双工对话上的实验表明,S-MARC实现了鲁棒的行为检测,产生了可解释的推理链,并为全双工口语对话系统中的对话推理建立了基准基础。

英文摘要

Human conversation is organized by an implicit chain of thought and manifests as temporally structured conversational behaviors. Capturing this perceptual pathway is critical for building natural full-duplex interactive systems. We propose S-MARC (Streaming Causal Modeling and Reasoning for Conversation), a streaming, causal, and hierarchical framework for conversational behavior modeling and reasoning. By formalizing the intent-to-action pathway, S-MARC predicts high-level communicative functions and low-level interaction behaviors while modeling their causal and temporal dependencies. To support this setting, we construct a high-quality corpus that pairs controllable, event-rich duplex dialogue data with behavior labels. S-MARC organizes streaming predictions into a continuously evolving graph structure, generates concise justifications for its decisions, and dynamically optimizes its reasoning process. Experiments on synthetic and real duplex dialogues show that S-MARC achieves robust behavior detection, produces interpretable reasoning chains, and establishes a benchmark foundation for conversational reasoning in full-duplex spoken dialogue systems.

2602.08783 2026-05-29 cs.AI cs.CL 版本更新

Dynamics Within Latent Chain-of-Thought: An Empirical Study of Causal Structure

潜在思维链中的因果结构:一项实证研究

Zirui Li, Xuefeng Bai, Kehai Chen, Yizhi Li, Jian Yang, Chenghua Lin, Min Zhang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) University of Manchester, United Kingdom(曼彻斯特大学) Beihang University, China(北京航空航天大学)

AI总结 通过结构因果模型对潜在思维链进行干预分析,揭示其因果结构、步骤间影响传播及与显式思维链的差异。

Comments Accepted to ICML 2026; 25 pages, 23 figures

详情
AI中文摘要

潜在或连续思维链方法用若干内部潜在步骤替代显式文本推理,但这些中间计算难以通过基于相关性的探针进行评估。本文将潜在思维链视为表示空间中的可操控因果过程,将潜在步骤建模为结构因果模型(SCM)中的变量,并通过逐步do-干预分析其效应。我们研究了两种代表性范式(即Coconut和CODI)在数学和通用推理任务上的表现,以探讨三个关键问题:(1)哪些步骤对正确性具有因果必要性,以及答案何时可早期解码;(2)影响如何在步骤间传播,以及这种结构与显式CoT相比如何;(3)中间轨迹是否保留竞争性答案模式,以及输出级承诺与步骤间表示级承诺的差异。我们发现潜在步骤预算更像分阶段功能而非同质化额外深度,并具有非局部路由特性,同时识别出早期输出偏差与后期表示承诺之间的持续差距。这些结果促使我们采用模式条件化和稳定性感知分析,以及相应的训练/解码目标,作为解释和改进潜在推理系统的更可靠工具。代码见https://github.com/J1mL1/causal-latent-cot。

英文摘要

Latent or continuous chain-of-thought methods replace explicit textual rationales with a number of internal latent steps, but these intermediate computations are difficult to evaluate beyond correlation-based probes. In this paper, we view latent chain-of-thought as a manipulable causal process in representation space by modeling latent steps as variables in a structural causal model (SCM) and analyzing their effects through step-wise do-interventions. We study two representative paradigms (i.e., Coconut and CODI) on both mathematical and general reasoning tasks to investigate three key questions: (1) which steps are causally necessary for correctness and when answers become decodable early; (2) how influence propagates across steps and how this structure compares to explicit CoT; and (3) whether intermediate trajectories retain competing answer modes and how output-level commitment differs from representational commitment across steps. We find that latent-step budgets behave less like homogeneous extra depth and more like staged functionality with non-local routing, and we identify a persistent gap between early output bias and late representational commitment. These results motivate mode-conditional and stability-aware analyses, together with corresponding training/decoding objectives, as more reliable tools for interpreting and improving latent reasoning systems. Code is available at https://github.com/J1mL1/causal-latent-cot.

2602.02849 2026-05-29 cs.AI 版本更新

AutoSizer: Automatic Sizing of Analog and Mixed-Signal Circuits via Large Language Model (LLM) Agents

AutoSizer: 通过大语言模型代理自动调整模拟和混合信号电路的尺寸

Xi Yu, Dmitrii Torbunov, Soumyajit Mandal, Yihui Ren

发表机构 * Artificial Intelligence Department, Brookhaven National Laboratory, Upton, NY(布鲁赫斯国家实验室人工智能部) Instrumentation Department, Brookhaven National Laboratory, Upton, NY 11973(布鲁赫斯国家实验室仪器部)

AI总结 提出AutoSizer,一种反射式LLM驱动的元优化框架,通过双循环结构统一电路理解、自适应搜索空间构建和优化编排,在模拟和混合信号电路尺寸调整中实现更优解质量、更快收敛和更高成功率。

详情
AI中文摘要

模拟和混合信号(AMS)集成电路的设计仍然严重依赖专家知识,其中晶体管尺寸调整由于非线性行为、高维设计空间和严格的性能约束而成为主要瓶颈。现有的电子设计自动化(EDA)方法通常将尺寸调整视为静态黑箱优化,导致解决方案效率低下且鲁棒性不足。尽管大语言模型(LLM)展现出强大的推理能力,但它们并不适合AMS尺寸调整中的精确数值优化。为弥补这一差距,我们提出AutoSizer,一种反射式LLM驱动的元优化框架,以闭环方式统一电路理解、自适应搜索空间构建和优化编排。它采用双循环优化框架,内循环负责电路尺寸调整,外循环分析优化动态和约束,从仿真反馈中迭代优化搜索空间。我们进一步引入AMS-SizingBench,一个包含SKY130 CMOS技术中24种不同AMS电路的开源基准,旨在评估在基于仿真器的现实约束下的自适应优化策略。实验表明,AutoSizer在不同电路难度下实现了更高的解质量、更快的收敛速度和更高的成功率,优于传统优化方法和现有的基于LLM的代理。

英文摘要

The design of Analog and Mixed-Signal (AMS) integrated circuits remains heavily reliant on expert knowledge, with transistor sizing a major bottleneck due to nonlinear behavior, high-dimensional design spaces, and strict performance constraints. Existing Electronic Design Automation (EDA) methods typically frame sizing as static black-box optimization, resulting in inefficient and less robust solutions. Although Large Language Models (LLMs) exhibit strong reasoning abilities, they are not suited for precise numerical optimization in AMS sizing. To address this gap, we propose AutoSizer, a reflective LLM-driven meta-optimization framework that unifies circuit understanding, adaptive search-space construction, and optimization orchestration in a closed loop. It employs a two-loop optimization framework, with an inner loop for circuit sizing and an outer loop that analyzes optimization dynamics and constraints to iteratively refine the search space from simulation feedback. We further introduce AMS-SizingBench, an open benchmark comprising 24 diverse AMS circuits in SKY130 CMOS technology, designed to evaluate adaptive optimization policies under realistic simulator-based constraints. AutoSizer experimentally achieves higher solution quality, faster convergence, and higher success rate across varying circuit difficulties, outperforming both traditional optimization methods and existing LLM-based agents.

2602.02751 2026-05-29 cs.MA cs.AI cs.CL 版本更新

Scaling Small Agents Through Strategy Auctions

通过策略拍卖扩展小型智能体

Lisa Alazraki, William F. Shen, Yoram Bachrach, Akhil Mathur

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) Imperial College London(帝国理工学院伦敦分校) University of Cambridge(剑桥大学)

AI总结 针对小型语言模型在复杂任务中性能不足的问题,提出受自由职业市场启发的SALE框架,通过策略拍卖实现任务分配与测试时自我改进,在降低对大型模型依赖和成本的同时提升性能。

Comments ICML 2026

详情
AI中文摘要

小型语言模型越来越被视为一种有前景、成本效益高的智能体AI方法,支持者声称它们对于智能体工作流已经足够有能力。然而,尽管较小的智能体在简单任务上能与较大的智能体紧密匹配,但它们的性能如何随任务复杂性扩展、何时需要大型模型以及如何更好地利用小型智能体处理长期工作负载仍不清楚。在这项工作中,我们通过实验表明,小型智能体的性能在深度搜索和编码任务上无法随任务复杂性扩展,并引入了受自由职业市场启发的SALE(Strategy Auctions for Workload Efficiency)智能体框架。在SALE中,智能体用简短的战略计划进行投标,这些计划通过系统性的成本-价值机制评分,并通过共享的拍卖记忆进行优化,从而无需训练单独的路由器或运行所有模型至完成即可实现每任务路由和持续自我改进。在复杂度不同的深度搜索和编码任务中,SALE将最大智能体的依赖度降低了52%,总成本降低了35%,并且始终优于最大智能体的pass@1,仅增加了可忽略的额外开销(超出执行最终轨迹的部分)。相比之下,依赖任务描述的现有路由器要么表现不如最大智能体,要么未能降低成本,通常两者兼有,凸显了它们对智能体工作流的不适用性。这些结果表明,尽管小型智能体可能不足以处理复杂工作负载,但通过协调的任务分配和测试时自我改进,它们可以有效地“扩展”。更广泛地说,它们激发了对智能体AI的系统级观点,即性能提升更多来自市场启发的协调机制(将异构智能体组织成高效、自适应的生态系统),而非日益庞大的单个模型。

英文摘要

Small language models are increasingly viewed as a promising, cost-effective approach to agentic AI, with proponents claiming they are sufficiently capable for agentic workflows. However, while smaller agents can closely match larger ones on simple tasks, it remains unclear how their performance scales with task complexity, when large models become necessary, and how to better leverage small agents for long-horizon workloads. In this work, we empirically show that small agents' performance fails to scale with task complexity on deep search and coding tasks, and we introduce Strategy Auctions for Workload Efficiency (SALE), an agent framework inspired by freelancer marketplaces. In SALE, agents bid with short strategic plans, which are scored by a systematic cost-value mechanism and refined via a shared auction memory, enabling per-task routing and continual self-improvement without training a separate router or running all models to completion. Across deep search and coding tasks of varying complexity, SALE reduces reliance on the largest agent by 52%, lowers overall cost by 35%, and consistently improves upon the largest agent's pass@1 with only a negligible overhead beyond executing the final trace. In contrast, established routers that rely on task descriptions either underperform the largest agent or fail to reduce cost, often both, underscoring their poor fit for agentic workflows. These results suggest that while small agents may be insufficient for complex workloads, they can be effectively "scaled up" through coordinated task allocation and test-time self-improvement. More broadly, they motivate a systems-level view of agentic AI in which performance gains come less from ever-larger individual models and more from market-inspired coordination mechanisms that organize heterogeneous agents into efficient, adaptive ecosystems.

2602.01869 2026-05-29 cs.AI 版本更新

Skill-Pro: Learning Reusable Skills from Experience via Non-Parametric PPO for LLM Agents

Skill-Pro: 通过非参数PPO从经验中学习可复用技能以用于LLM智能体

Qirui Mi, Zhijian Ma, Mengyue Yang, Haoxuan Li, Yisen Wang, Haifeng Zhang, Jun Wang

发表机构 * Key Laboratory of Interdisciplinary Research of Computation and Economics(交叉计算与经济学交叉研究实验室) Shanghai University of Finance and Economics(上海财经大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) School of Artificial Intelligence, Chinese Academy of Sciences(中国科学院人工智能研究所) University of Bristol(布里斯托大学) Peking University(北京大学) University College London(伦敦大学学院)

AI总结 提出Skill-Pro框架,通过非参数PPO从交互经验中自动学习可复用的程序性技能,无需参数更新,实现高效经验重用和长期自主性。

Comments Accepted at ICML 2026 (spotlight); 22 Pages, 6 Figures, 5 Tables

详情
AI中文摘要

基于LLM的智能体在序列决策中表现出色,但通常依赖即时推理,即使在重复场景中也会重新推导解决方案。这种经验重用不足导致计算冗余和不稳定性。为弥补这一差距,我们提出Skill-Pro,一个使智能体能够从交互经验中自主学习可复用程序性技能而无需参数更新的框架。通过形式化Skill-MDP,Skill-Pro将被动的情节叙述转化为由激活、执行和终止条件定义的可执行技能,以确保可执行性。为了实现可靠的可重用性而不降低能力,我们引入非参数PPO,它利用语义梯度进行高质量候选生成,并使用PPO Gate进行稳健的技能验证。通过基于分数的维护,Skill-Pro维持紧凑、高质量的程序性记忆。在域内、跨任务和跨智能体场景下的实验结果表明,Skill-Pro实现了卓越的重用率和在极端内存压缩下的显著增益。可视化的进化轨迹和技能分布进一步揭示了Skill-Pro如何透明地积累、精炼和重用程序性知识以促进长期自主性。

英文摘要

LLM-driven agents excel at sequential decision-making but often rely on on-the-fly reasoning, re-deriving solutions even in recurring scenarios. This insufficient experience reuse leads to computational redundancy and instability. To bridge this gap, we propose Skill-Pro, a framework enabling agents to autonomously learn reusable procedural skills from interaction experiences without parameter updates. By formalizing a Skill-MDP, Skill-Pro transforms passive episodic narratives into executable Skills defined by activation, execution, and termination conditions to ensure executability. To achieve reliable reusability without capability degradation, we introduce Non-Parametric PPO, which leverages semantic gradients for high-quality candidate generation and a PPO Gate for robust Skill verification. Through score-based maintenance, Skill-Pro sustains compact, high-quality procedural memory. Experimental results across in-domain, cross-task, and cross-agent scenarios demonstrate that Skill-Pro achieves superior reuse rates and significant gains with extreme memory compression. Visualized evolutionary trajectories and Skill distributions further reveal how Skill-Pro transparently accumulates, refines, and reuses procedural knowledge to facilitate long-term autonomy.

2601.21909 2026-05-29 cs.AI cs.CL 版本更新

From Meta-Thought to Execution: Cognitively Aligned Post-Training for Generalizable and Reliable LLM Reasoning

从元思维到执行:面向通用且可靠的大语言模型推理的认知对齐后训练

Shaojie Wang, Liang Zhang

发表机构 * Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 提出一种认知启发的两阶段后训练框架,通过元思维链监督学习通用策略和置信度校准强化学习优化执行可靠性,在分布内和分布外分别提升2.10%和3.86%。

详情
AI中文摘要

当前的大语言模型后训练方法通过监督微调(SFT)后接基于结果的强化学习(RL)来优化完整的推理轨迹。虽然有效,但仔细审视发现一个根本差距:这种方法与人类实际解决问题的方式不一致。人类认知自然地将问题解决分解为两个不同的阶段:首先获取跨问题泛化的抽象策略(即元知识),然后将其适应到具体实例。相比之下,通过将完整轨迹视为基本单元,当前方法本质上是问题中心的,将抽象策略与问题特定的执行纠缠在一起。为了解决这种错位,我们提出了一个认知启发的框架,明确地模仿人类认知的两阶段过程。具体而言,元思维链(CoMT)将监督学习聚焦于抽象推理模式而不涉及具体执行,从而能够获取可泛化的策略。然后,置信度校准强化学习(CCRL)通过中间步骤上的置信度感知奖励来优化任务适应,防止过度自信的错误级联并提高执行可靠性。在四个模型和十个基准上的实验表明,与标准方法相比,分布内和分布外分别提升了2.10%和3.86%,同时对教师模型选择、优化方法和符号扰动的变化保持高度鲁棒。

英文摘要

Current LLM post-training methods optimize complete reasoning trajectories through Supervised Fine-Tuning (SFT) followed by outcome-based Reinforcement Learning (RL). While effective, a closer examination reveals a fundamental gap: this approach does not align with how humans actually solve problems. Human cognition naturally decomposes problem-solving into two distinct stages: first acquiring abstract strategies (i.e., meta-knowledge) that generalize across problems, then adapting them to specific instances. In contrast, by treating complete trajectories as basic units, current methods are inherently problem-centric, entangling abstract strategies with problem-specific execution. To address this misalignment, we propose a cognitively-inspired framework that explicitly mirrors the two-stage human cognitive process. Specifically, Chain-of-Meta-Thought CoMT focuses supervised learning on abstract reasoning patterns without specific executions, enabling acquisition of generalizable strategies. Confidence-Calibrated Reinforcement Learning (CCRL) then optimizes task adaptation via confidence-aware rewards on intermediate steps, preventing overconfident errors from cascading and improving execution reliability. Experiments across four models and ten benchmarks show 2.10% and 3.86% improvements in-distribution and out-of-distribution respectively over standard methods, while remaining highly robust to variations in teacher model selection, optimization methods, and symbolic perturbations.

2601.19947 2026-05-29 cs.LG cs.AI cs.CV 版本更新

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

NCSAM: 噪声补偿的锐度感知最小化用于噪声标签学习

Jiayu Xu, Junbiao Pang

发表机构 * Beijing University of Technology(北京理工大学)

AI总结 提出NCSAM方法,通过噪声补偿扰动修正噪声标签引起的优化偏差,缓解对噪声标签的记忆,在合成和真实噪声标签基准上优于SAM基线。

Comments 11 pages, 1 figure, 8 tables. Major revision of v1: revised PAC-Bayesian theoretical analysis, clarified the NCSAM formulation, added appendix derivations, reorganized experiments and ablations, updated related work, citations, writing, and author list

详情
AI中文摘要

从噪声标签学习(LNL)仍然是深度学习中的一个基本挑战,因为现实世界的数据集通常包含损坏的注释。大多数现有方法依赖于标签校正或样本选择机制。相比之下,我们从优化角度研究LNL,通过建立标签噪声与锐度感知最小化(SAM)的平坦性寻求行为之间的理论联系。基于此分析,我们提出了噪声补偿的锐度感知最小化(NCSAM),它使用噪声补偿扰动来抵消由噪声标签引起的优化偏差。通过纠正失真的SAM扰动,NCSAM在训练过程中减轻了对噪声标签的记忆,同时保持了基于优化的学习的简单性。在合成和真实噪声标签基准上的实验表明,NCSAM在基于SAM的优化基线上持续改进,并与代表性的噪声标签学习方法保持竞争力。

英文摘要

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annotations. Most existing methods rely on label correction or sample selection mechanisms. In contrast, we study LNL from an optimization perspective by establishing a theoretical connection between label noise and the flatness-seeking behavior of Sharpness-Aware Minimization (SAM). Based on this analysis, we propose Noise-Compensated Sharpness-Aware Minimization (NCSAM), which uses a noise-compensated perturbation to counteract the optimization bias induced by noisy labels. By correcting distorted SAM perturbations, NCSAM mitigates the memorization of noisy labels during training while preserving the simplicity of optimization-based learning. Experiments on synthetic and real-world noisy-label benchmarks show that NCSAM consistently improves over SAM-based optimization baselines and remains competitive with representative noisy-label learning methods.

2601.14758 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

从自回归到掩码扩散语言模型的后训练中的机制转变

Injin Kong, Hyoungjoon Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Biosystems & Biomaterials Science and Engineering, Seoul National University(首尔国立大学生物系统与生物材料科学与工程系)

AI总结 通过比较电路分析,发现后训练得到的掩码扩散模型在结构上根据任务保留或重组自回归电路,在语义上从局部专业化转向分布式整合,表明扩散后训练是内部计算的深度重组。

详情
AI中文摘要

将预训练的自回归模型(ARMs)后训练为掩码扩散模型(MDMs)已成为一种克服顺序生成局限性的经济有效方法。然而,后训练的MDMs是否获得了真正的新计算机制,还是仅仅以非自回归形式重新表达了自回归计算,仍不清楚。通过对ARMs及其从相同骨干网络后训练得到的MDM对应物进行电路比较分析,我们揭示了两个互补的重组轴。在结构上,转变是任务依赖的:MDMs在局部因果任务上保留自回归电路,但在全局任务上放弃继承的路径并将计算前置到早期层。在语义上,转变在不同机制间是一致的:ARMs中尖锐的局部专业化让位于MDMs中的分布式整合。这些发现共同表明,扩散后训练并非生成过程的表面变化,而是内部计算的重组,其深度取决于任务。

英文摘要

Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome the limitations of sequential generation. Yet it remains unclear whether post-trained MDMs acquire genuinely new computational mechanisms or merely re-express autoregressive computation in a non-autoregressive form. Through a comparative circuit analysis of ARMs and their MDM counterparts post-trained from the same backbones, we uncover two complementary axes of reorganization. Structurally, the shift is task-dependent: MDMs preserve autoregressive circuitry on locally causal tasks but abandon inherited pathways and front-load computation into early layers on global tasks. Semantically, the shift is consistent across regimes: sharp, localized specialization in ARMs gives way to distributed integration in MDMs. Together, these findings show that diffusion post-training is not a surface-level change in the generation procedure but a reorganization of internal computation whose depth depends on the task.

2601.13111 2026-05-29 cs.CL cs.AI cs.IR 版本更新

CORE-T: COherent REtrieval of Tables for Text-to-SQL

CORE-T: 面向文本到SQL的表格连贯检索

Hassan Soliman, Vivek Gupta, Dan Roth, Iryna Gurevych

发表机构 * Ubiquitous Knowledge Processing Lab (UKP Lab), Department of Computer Science TU Darmstadt and National Research Center for Applied Cybersecurity ATHENE, Germany(普适知识处理实验室(UKP实验室),计算机科学系 TU Darmstadt 和应用网络安全国家研究中心 ATHENE,德国) Arizona State University(亚利桑那州立大学) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出CORE-T框架,通过LLM生成元数据和预计算兼容性缓存,在无需训练的情况下从异构表集合中高效检索连贯可连接的表集合,提升表选择F1最多22.7点并减少40%的表数量。

Comments Preprint is revised and under review. Code and data available at: https://github.com/UKPLab/arxiv2026-core-t

详情
AI中文摘要

现实中的文本到SQL工作流通常需要连接多个表格。因此,准确检索相关表集合成为端到端性能的关键瓶颈。我们研究一种开放书设置,其中查询必须从多个来源汇集的大规模异构表集合中回答,且没有数据库标识符等清晰的限定信号。在此设置下,密集检索(DR)实现了高召回率但返回大量干扰项,而考虑连接的方法通常依赖额外假设和/或产生高推理开销。我们提出CORE-T,一个可扩展、无需训练的框架,通过LLM生成的用途元数据丰富表格,并预计算轻量级表兼容性缓存。推理时,DR返回前K个候选;单次LLM调用选择一个连贯、可连接的子集,然后两步加法调整阶段恢复强兼容的表。在Bird、Spider、MMQA和Beaver上,CORE-T在表选择F1上比DR提升最多22.7点,同时返回的表减少最多40%,在多表执行准确率上提升最多24.4点,并且使用的总选择token比LLM密集型基线少1.64-4.20倍。

英文摘要

Realistic text-to-SQL workflows often require joining multiple tables. As a result, accurately retrieving the relevant set of tables becomes a key bottleneck for end-to-end performance. We study an open-book setting where queries must be answered over large, heterogeneous table collections pooled from many sources, without clean scoping signals such as database identifiers. Here, dense retrieval (DR) achieves high recall but returns many distractors, while join-aware alternatives often rely on extra assumptions and/or incur high inference overhead. We propose CORE-T, a scalable, training-free framework that enriches tables with LLM-generated purpose metadata and pre-computes a lightweight table-compatibility cache. At inference time, DR returns top-K candidates; a single LLM call selects a coherent, joinable subset, and a two-step additive adjustment stage restores strongly compatible tables. Across Bird, Spider, MMQA, and Beaver, CORE-T improves over DR by up to 22.7 points in table-selection F1 while returning up to 40% fewer tables, and by up to 24.4 points in multi-table execution accuracy, and uses 1.64-4.20x fewer total selection tokens than LLM-intensive baselines.

2601.11178 2026-05-29 cs.AI cs.CL cs.MM cs.SI 版本更新

TANDEM: Temporal-Aware Neural Detection for Multimodal Hate Speech

TANDEM: 面向多模态仇恨言论的时间感知神经检测

Girish A. Koushik, Helen Treharne, Diptesh Kanojia

发表机构 * Nature-Inspired Computing & Engineering, University of Surrey(Surrey大学自然启发计算与工程系) Surrey Centre for Cyber Security, University of Surrey(Surrey大学网络安全中心)

AI总结 提出TANDEM统一框架,通过串联强化学习策略联合优化视觉-语言和音频-语言模型,将音频-视觉仇恨检测转化为结构化推理问题,在HateMM上目标识别F1达0.73(提升30%),并保持精确时间定位。

Comments Under review at ICWSM 2027

详情
AI中文摘要

社交媒体平台日益被长篇多模态内容主导,其中有害叙事通过音频、视觉和文本线索的复杂交互构建。虽然自动化系统能以高准确率标记仇恨言论,但它们通常作为“黑箱”运作,无法提供细粒度、可解释的证据(如精确时间戳和目标身份),而这对于有效的人机协同审核是必需的。在这项工作中,我们提出了TANDEM,一个统一框架,将音频-视觉仇恨检测从二元分类任务转化为结构化推理问题。我们的方法采用一种新颖的串联强化学习策略,其中视觉-语言和音频-语言模型通过自约束跨模态上下文相互优化,在无需密集帧级监督的情况下,稳定地推理长时序列。在三个基准数据集上的实验表明,TANDEM显著优于零样本和上下文增强基线,在HateMM上目标识别F1达到0.73(比现有最佳方法提升30%),同时保持精确的时间定位。我们进一步观察到,虽然二元检测是鲁棒的,但由于固有的标签模糊性和数据集不平衡,在多类设置中区分攻击性和仇恨性内容仍然具有挑战性。更广泛地说,我们的发现表明,即使在复杂的多模态环境中,结构化、可解释的对齐也是可实现的,为下一代透明且可操作的在线安全审核工具提供了蓝图。

英文摘要

Social media platforms are increasingly dominated by long-form multimodal content, where harmful narratives are constructed through a complex interplay of audio, visual, and textual cues. While automated systems can flag hate speech with high accuracy, they often function as "black boxes" that fail to provide the granular, interpretable evidence, such as precise timestamps and target identities, required for effective human-in-the-loop moderation. In this work, we introduce TANDEM, a unified framework that transforms audio-visual hate detection from a binary classification task into a structured reasoning problem. Our approach employs a novel tandem reinforcement learning strategy where vision-language and audio-language models optimize each other through self-constrained cross-modal context, stabilizing reasoning over extended temporal sequences without requiring dense frame-level supervision. Experiments across three benchmark datasets demonstrate that TANDEM significantly outperforms zero-shot and context-augmented baselines, achieving 0.73 F1 in target identification on HateMM (a 30% improvement over state-of-the-art) while maintaining precise temporal grounding. We further observe that while binary detection is robust, differentiating between offensive and hateful content remains challenging in multi-class settings due to inherent label ambiguity and dataset imbalance. More broadly, our findings suggest that structured, interpretable alignment is achievable even in complex multimodal settings, offering a blueprint for the next generation of transparent and actionable online safety moderation tools.

2601.04765 2026-05-29 cs.CL cs.AI cs.LG physics.comp-ph 版本更新

Differential syntactic and semantic encoding in LLMs

大型语言模型中句法与语义的差异编码

Santiago Acevedo, Alessandro Laio, Marco Baroni

发表机构 * Catalan Institute of Research and Advanced Studies (ICREA) and Universitat Pompeu Fabra (UPF)(加泰罗尼亚研究与高级科学研究所(ICREA)和庞培法华大学(UPF))

AI总结 本研究通过平均共享句法结构或语义的句子隐藏表示向量,发现大型语言模型(以DeepSeek-V3为例)的内部层表示中句法和语义信息至少部分线性编码,且两者编码轮廓不同,可一定程度解耦。

Comments Published as conference paper at ICML 2026

详情
AI中文摘要

我们研究了句法和语义信息如何在大型语言模型(LLMs)的内部层表示中编码,重点关注非常大的DeepSeek-V3。我们发现,通过平均共享句法结构或语义的句子的隐藏表示向量,我们得到了能够捕获表示中相当大比例的句法和语义信息的向量。特别是,从句子向量中减去这些句法和语义“质心”会强烈影响它们与句法和语义匹配句子的相似性,这表明句法和语义至少部分地线性编码。我们还发现句法和语义的跨层编码轮廓不同,并且这两种信号可以在一定程度上解耦,这表明LLM表示中这两种语言信息的差异编码。

英文摘要

We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

2512.19199 2026-05-29 cs.LG cs.AI 版本更新

On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

基于Koopman的多任务深度学习泛化界

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano(博兹纳-博尔扎诺自由大学) University of Catania(卡塔尼亚大学) University of Florida(佛罗里达大学)

AI总结 本文利用算子理论技术建立多任务深度神经网络的泛化界,通过利用权重矩阵的小条件数并引入定制的Sobolev空间作为扩展假设空间,提出比传统范数方法更紧的界,该界在单输出设置下仍有效且优于现有Koopman界。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

详情
Journal ref
Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 376--392
AI中文摘要

本文利用算子理论技术建立了多任务深度神经网络的泛化界。作者通过利用权重矩阵中的小条件数并引入定制的Sobolev空间作为扩展假设空间,提出了比传统基于范数的方法更紧的界。该增强的界即使在单输出设置下仍然有效,优于现有的基于Koopman的界。所得框架保持了关键优势,如灵活性和与网络宽度无关,为核方法背景下的多任务深度学习提供了更精确的理论理解。

英文摘要

The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a tighter bound than those derived from conventional norm based methods by leveraging small condition numbers in the weight matrices and introducing a tailored Sobolev space as an expanded hypothesis space. This enhanced bound remains valid even in single output settings, outperforming existing Koopman based bounds. The resulting framework maintains key advantages such as flexibility and independence from network width, offering a more precise theoretical understanding of multitask deep learning in the context of kernel methods.

2512.19184 2026-05-29 cs.LG cs.AI 版本更新

Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

基于算子的深度学习泛化界:多任务学习的洞见

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano(博兹纳-博尔扎诺自由大学) University of Catania(卡塔尼亚大学) University of Florida(佛罗里达大学)

AI总结 本文通过算子理论框架,结合Koopman方法与现有技术,为向量值神经网络和深度核方法提出了更紧的泛化界,并引入草图技术降低计算成本,同时提出深度向量值再生核希尔伯特空间框架,利用Perron-Frobenius算子增强深度核方法,推导了新的Rademacher泛化界,解决了欠拟合和过拟合问题。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

详情
Journal ref
Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 120--137
AI中文摘要

本文提出了向量值神经网络和深度核方法的新型泛化界,通过算子理论框架聚焦多任务学习。我们的关键发展在于策略性地将基于Koopman的方法与现有技术相结合,实现了比传统基于范数的界更紧的泛化保证。为缓解基于Koopman方法的计算挑战,我们引入了适用于向量值神经网络的草图技术。这些技术在一般Lipschitz损失下给出了超额风险界,为包括鲁棒回归和多重分位数回归在内的应用提供了性能保证。此外,我们提出了一个新的深度学习框架——深度向量值再生核希尔伯特空间(vvRKHS),利用Perron-Frobenius(PF)算子增强深度核方法。我们为该框架推导了新的Rademacher泛化界,通过核精炼策略明确处理欠拟合和过拟合。这项工作为深度学习架构下的多任务学习泛化性质提供了新颖洞见,该领域直到最近才有所发展。

英文摘要

This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.

2512.14754 2026-05-29 cs.SE cs.AI cs.CL 版本更新

Revisiting the Reliability of Language Models in Instruction-Following

重新审视指令跟随中语言模型的可靠性

Jianshuo Dong, Yutong Zhang, Yan Liu, Zhenyu Zhong, Tao Wei, Chao Zhang, Han Qiu

发表机构 * Tsinghua University(清华大学) Ant Group(蚂蚁集团)

AI总结 本文提出可靠@k指标和自动生成相似提示的流水线,构建IFEval++基准,发现当前模型在细微差异提示下性能下降高达61.8%,并探索了三种改进方法。

Comments ACL 2026 main oral

详情
AI中文摘要

先进的LLM在IFEval等基准测试中已达到接近上限的指令跟随准确率。然而,这些令人印象深刻的分数并不一定能转化为实际使用中的可靠服务,因为用户经常改变他们的措辞、上下文框架和任务表述。在本文中,我们研究面向细微差异的可靠性:模型是否在传达类似用户意图但具有细微差异的相似提示中表现出一致的能力。为了量化这一点,我们引入了一个新的指标,可靠@k,并开发了一个自动化流水线,通过数据增强生成高质量的相似提示。在此基础上,我们构建了IFEval++用于系统评估。在20个专有和26个开源LLM中,我们发现当前模型在面向细微差异的可靠性方面存在显著不足——它们的性能在细微提示修改下可能下降高达61.8%。此外,我们对其进行了表征,并探索了三种潜在的改进方法。我们的发现强调了面向细微差异的可靠性是朝着更可靠和可信的LLM行为迈出的关键但尚未充分探索的下一步。我们的代码和基准可访问:https://github.com/jianshuod/IFEval-pp。

英文摘要

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

2512.11944 2026-05-29 cs.RO cs.AI 版本更新

A Review of Learning-Based Motion Planning: Toward a Data-Driven Optimal Control Approach

基于学习的运动规划综述:迈向数据驱动的最优控制方法

Jia Hu, Yang Chang, Haoran Wang

发表机构 * College of Transportation Key Laboratory of Road and Traffic Engineering of the Ministry of Education(交通运输学院 道路交通工程教育部重点实验室) Institute for Advanced Study(先进研究院) Tongji University(同济大学)

AI总结 本文系统综述了数据驱动最优控制范式,通过融合最优控制的理论保证与机器学习的自适应能力,为自动驾驶运动规划提供了三维实现路线图,并指出了四个未来研究方向。

Comments 44 pages, 14 figures

详情
AI中文摘要

自动驾驶的运动规划面临一个关键的权衡。传统的基于规则的流程提供了可验证的安全性和可解释性,但往往难以在复杂场景中泛化。相反,新兴的基于学习的方法——包括模仿学习、强化学习和生成式AI——提供了更大的适应性,但通常受限于不透明性和安全风险。现有的综述通常孤立地分析这些AI方法,忽视了将它们与严格的控制框架相结合的潜力。为弥合这一差距,本文首次系统综述了数据驱动最优控制(DDOC)范式,明确考察了它如何协同最优控制的理论保证与现代机器学习的自适应能力。基于这一框架,我们提出了首个DDOC运动规划路线图,将其实现结构化为三个关键维度:定制化、动力学自适应和自整定。最后,为缩小剩余的现实差距,我们确定了四个未来研究方向,从而加速向可信赖且类人的自动驾驶的过渡。

英文摘要

Motion planning for autonomous driving (AD) faces a critical trade-off. While traditional rule-based pipelines offer verifiable safety and interpretability, they often fail to generalize in complex scenarios. Conversely, emerging learning-based methods-including imitation learning (IL), reinforcement learning (RL), and generative AI-offer greater adaptability but are often constrained by opacity and safety risks. Existing surveys typically analyze these AI methods in isolation, overlooking the potential of integrating them with rigorous control frameworks. To bridge this gap, this paper presents the first systematic review of the Data-Driven Optimal Control (DDOC) paradigm, explicitly examining how it synergizes the theoretical guarantees of optimal control with the adaptive capabilities of modern machine learning. Building on this framework, we propose the first roadmap for DDOC-based motion planning, structuring its implementation into three critical dimensions: customization, dynamics adaptation, and self-tuning. Finally, to close the remaining reality gap, we identify four future research directions, thereby accelerating the transition to trustworthy and human-like autonomous driving.

2512.04733 2026-05-29 cs.CV cs.AI 版本更新

E3AD: An Emotion-Aware Vision-Language-Action Model for Human-Centric End-to-End Autonomous Driving

E3AD:面向以人为中心的端到端自动驾驶的情感感知视觉-语言-动作模型

Yihong Tang, Haicheng Liao, Tong Nie, Junlin He, Ao Qu, Kehua Chen, Wei Ma, Zhenning Li, Lijun Sun, Chengzhong Xu

发表机构 * McGill University(麦吉尔大学) University of Macau(澳门大学) The Hong Kong Polytechnic University(香港理工大学) Massachusetts Institute of Technology(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出E3AD框架,通过连续VAD情感模型和双路径空间推理模块,将情感理解融入视觉-语言-动作模型,实现开放域端到端自动驾驶中的情感感知轨迹规划,在真实数据集上达到SOTA性能。

详情
AI中文摘要

端到端自动驾驶系统越来越多地采用视觉-语言-动作模型,但它们通常忽略乘客的情绪状态,而情绪状态对舒适度和自动驾驶接受度至关重要。我们引入了开放域端到端自动驾驶,其中自动驾驶车辆必须解释自由形式的自然语言命令,推断情绪,并规划物理上可行的轨迹。我们提出了E3AD,一个情感感知的VLA框架,通过两个认知启发的组件增强语义理解:一个连续的Valence-Arousal-Dominance情感模型,从语言中捕捉语调和紧迫性;以及一个双路径空间推理模块,融合自我中心和异中心视角以实现类人空间认知。结合模态预训练和基于偏好的对齐的一致性导向训练方案,进一步强化了情感意图与驾驶行为之间的一致性。在真实世界数据集上,E3AD改进了视觉定位和路径点规划,并在情感估计方面达到了最先进的VAD相关性。这些评估结果表明,将情感注入VLA风格的驾驶能够产生更符合人类行为的定位、规划和反馈。

英文摘要

End-to-end autonomous driving (AD) systems increasingly adopt vision-language-action (VLA) models, yet they typically ignore the passenger's emotional state, which is central to comfort and AD acceptance. We introduce Open-Domain End-to-End (OD-E2E) autonomous driving, where an autonomous vehicle (AV) must interpret free-form natural-language commands, infer the emotion, and plan a physically feasible trajectory. We propose E3AD, an emotion-aware VLA framework that augments semantic understanding with two cognitively inspired components: a continuous Valenc-Arousal-Dominance (VAD) emotion model that captures tone and urgency from language, and a dual-pathway spatial reasoning module that fuses egocentric and allocentric views for human-like spatial cognition. A consistency-oriented training scheme, combining modality pretraining with preference-based alignment, further enforces coherence between emotional intent and driving actions. Across real-world datasets, E3AD improves visual grounding and waypoint planning and achieves state-of-the-art (SOTA) VAD correlation for emotion estimation. These evaluation results show that injecting emotion into VLA-style driving yields more human-aligned grounding, planning, and feedback.

2512.03109 2026-05-29 cs.LG cs.AI stat.AP stat.ML 版本更新

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

E-valuator: 基于序贯假设检验的可靠智能体验证器

Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Bonnie Berger, Aviv Regev, Hanchen Wang

发表机构 * Genentech(基因泰克) MIT(麻省理工学院) Johns Hopkins(约翰霍普金斯大学) Stanford(斯坦福大学)

AI总结 提出E-valuator方法,将任意黑盒验证器分数转化为具有可控虚警率的决策规则,通过序贯假设检验实现对智能体轨迹的在线监控,提升统计功效并节省令牌。

详情
AI中文摘要

智能体AI系统根据用户提示执行一系列动作,如推理步骤或工具调用。为了评估其轨迹的成功性,研究人员开发了验证器(如LLM评判器和过程奖励模型)来对智能体轨迹中每个动作的质量进行评分。尽管这些启发式评分可能提供信息,但在用于决定智能体是否会产生成功输出时,无法保证正确性。在此,我们引入e-valuator,一种将任意黑盒验证器分数转化为具有可证明虚警率控制的决策规则的方法。我们将区分成功轨迹(即会导致对用户提示正确响应的动作序列)与不成功轨迹的问题构建为序贯假设检验问题。E-valuator基于e-过程工具开发了一个序贯假设检验,该检验在智能体轨迹的每一步都保持统计有效性,从而能够对任意长动作序列的智能体进行在线监控。实验表明,在六个数据集和三个智能体上,e-valuator相比其他策略提供了更高的统计功效和更好的虚警率控制。我们还展示了e-valuator可用于快速终止有问题的轨迹并节省令牌。总之,e-valuator提供了一个轻量级、模型无关的框架,将验证器启发式转化为具有统计保证的决策规则,从而支持部署更可靠的智能体系统。

英文摘要

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

2511.19316 2026-05-29 cs.CV cs.AI 版本更新

Evaluating Dataset Watermarking for Fine-tuning Traceability of Customized Diffusion Models: A Comprehensive Benchmark and Removal Approach

评估数据集水印用于定制扩散模型微调可追溯性:一个综合基准与移除方法

Xincheng Wang, Hanchi Sun, Wenjun Sun, Kejun Xue, Wangqiu Zhou, Jianbo Zhang, Wei Sun, Dandan Zhu, Xiongkuo Min, Jun Jia, Zhijun Fang

发表机构 * Donghua University(东华大学) Shanghai Jiao Tong University(上海交通大学) Xidian University(西安电子科技大学) Hefei University of Technology(合肥工业大学) East China Normal University(华东师范大学)

AI总结 针对扩散模型微调中的版权与安全风险,本文建立统一威胁模型并提出包含普适性、可传递性和鲁棒性的评估框架,揭示现有数据集水印方法的脆弱性,并进一步提出一种实用的水印移除方法。

详情
AI中文摘要

最近扩散模型的微调技术使其能够再现特定图像集,例如特定人脸或艺术风格,但也引入了版权和安全风险。数据集水印已被提出,通过将不可察觉的水印嵌入训练图像来确保可追溯性,即使在微调后这些水印在输出中仍然可检测。然而,当前方法缺乏统一的评估框架。为解决这一问题,本文建立了一个通用威胁模型,并引入了一个包含普适性、可传递性和鲁棒性的综合评估框架。实验表明,现有方法在普适性和可传递性方面表现良好,并对常见图像处理操作具有一定的鲁棒性,但在真实威胁场景下仍然不足。为揭示这些脆弱性,本文进一步提出了一种实用的水印移除方法,该方法在不影响微调的情况下完全消除数据集水印,突出了未来研究的一个关键挑战。

英文摘要

Recent fine-tuning techniques for diffusion models enable them to reproduce specific image sets, such as particular faces or artistic styles, but also introduce copyright and security risks. Dataset watermarking has been proposed to ensure traceability by embedding imperceptible watermarks into training images, which remain detectable in outputs even after fine-tuning. However, current methods lack a unified evaluation framework. To address this, this paper establishes a general threat model and introduces a comprehensive evaluation framework encompassing Universality, Transmissibility, and Robustness. Experiments show that existing methods perform well in universality and transmissibility, and exhibit some robustness against common image processing operations, yet still fall short under real-world threat scenarios. To reveal these vulnerabilities, the paper further proposes a practical watermark removal method that fully eliminates dataset watermarks without affecting fine-tuning, highlighting a key challenge for future research.

2511.08548 2026-05-29 cs.AI 版本更新

A Matter of Interest: Understanding Interestingness of Math Problems in Humans and Language Models

兴趣问题:理解人类与语言模型对数学问题的兴趣度

Shubhra Mishra, Yuka Machino, Gabriel Poesia, Albert Jiang, Joy Hsu, Adrian Weller, Challenger Mishra, David Broman, Joshua B. Tenenbaum, Mateja Jamnik, Cedegao E. Zhang, Katherine M. Collins

发表机构 * KTH Royal Institute of Technology(皇家理工学院) Stanford University(斯坦福大学) Massachusetts Institute of Technology(麻省理工学院) University of Cambridge(剑桥大学) Kempner Institute at Harvard University(哈佛大学肯普尼研究所) Mistral AI

AI总结 通过比较大型语言模型与不同数学背景人群对数学问题的兴趣度评分,研究LLM在兴趣判断上与人类的一致性,并评估其生成有趣问题的能力。

Comments Published at the Math-AI Workshop, NeurIPS 2025

详情
AI中文摘要

数学的演变受到兴趣度的重要影响:研究人员选择要解决的问题,学生选择要参与的问题,都是基于对兴趣和挑战的期望。随着AI系统,特别是那些在自然语言和形式数学上灵活操作的大型语言模型(LLMs)越来越多地用于数学研究和教育,描述它们的判断与来自不同数学背景的人们的判断有多接近变得至关重要。我们通过将LLM的评分与两个人群(具有大学数学经验的众包参与者和国际数学奥林匹克竞赛选手)的评分进行比较,研究LLM是否与人类的兴趣度判断一致。尽管许多LLM在广泛层面上与人类对兴趣度的看法一致,但它们在很大程度上未能匹配人类判断的分布。它们与人类认为问题有趣的原因也弱对齐,与人类选择的理由相关性低。最后,我们评估了LLM生成有趣问题的能力,发现经过有效性过滤后,LLM能够生成引人入胜的问题。我们得出结论,包括需要多LLM人机协作系统,这突显了LLM作为数学推理伙伴的前景和当前局限。

英文摘要

The evolution of mathematics is shaped importantly by interestingness: researchers choose which problems to pursue, and students choose which problems to engage with, based on expectations of interest and challenge. As AI systems, particularly large language models (LLMs) that operate flexibly over natural language and formal mathematics, are increasingly used in mathematics research and education, it becomes crucial to characterize how closely their judgments align with people from different mathematical backgrounds. We study whether LLMs align with human interestingness judgments by comparing LLM ratings with those of two populations, crowdsourced participants with college math experience and International Math Olympiad competitors. Although many LLMs broadly agree with human notions of interestingness, they largely fail to match the distribution of human judgments. They also weakly align with why humans find problems interesting, with low correlation to human-selected rationales. Finally, we evaluate LLMs' ability to generate interesting problems and find that, after filtering for validity, LLMs are able to generate engaging problems. We conclude with takeaways, including the need for multi-LLM human-AI collaborative systems, that highlight both the promise and current limits of LLMs as partners in mathematical reasoning.

2511.04758 2026-05-29 cs.RO cs.AI cs.MA 版本更新

ScheduleStream: Temporal Planning with Samplers for GPU-Accelerated Multi-Arm Task and Motion Planning & Scheduling

ScheduleStream: 基于采样器的时序规划用于GPU加速的多臂任务与运动规划及调度

Caelan Garrett, Fabio Ramos

发表机构 * NVIDIA Research Seattle Robotics Lab (SRL)(NVIDIA西雅图机器人实验室) University of Sydney(悉尼大学)

AI总结 提出ScheduleStream,首个通用框架,通过混合持续动作和领域无关算法,结合GPU加速采样器,实现多臂并行任务与运动规划及调度。

Comments Project website: https://schedulestream.github.io

详情
Journal ref
2026 IEEE International Conference on Robotics and Automation (ICRA)
AI中文摘要

双臂和类人机器人因其类似人类利用多臂高效完成任务的能力而具有吸引力。然而,由于混合离散-连续动作空间的增长,同时控制多个臂在计算上具有挑战性。任务与运动规划(TAMP)算法可以在混合空间中高效规划,但通常生成一次只移动一个臂的计划,而不是允许并行臂运动的调度。为了将TAMP扩展到生成调度,我们提出了ScheduleStream,这是第一个用于带采样操作的规划与调度的通用框架。ScheduleStream使用混合持续动作对时间动态进行建模,这些动作可以异步启动,并持续一个由其参数决定的时长。我们提出了领域无关的算法,无需任何特定于应用的机制即可解决ScheduleStream问题。我们将ScheduleStream应用于任务与运动规划及调度(TAMPAS),其中我们利用采样器内的GPU加速来加快规划。我们将ScheduleStream算法与模拟中的几种消融方法进行比较,发现它们能产生更高效的解决方案。我们在https://schedulestream.github.io上展示了ScheduleStream在几个真实世界双臂机器人任务上的应用。

英文摘要

Bimanual and humanoid robots are appealing because of their human-like ability to leverage multiple arms to efficiently complete tasks. However, controlling multiple arms at once is computationally challenging due to the growth in the hybrid discrete-continuous action space. Task and Motion Planning (TAMP) algorithms can efficiently plan in hybrid spaces but generally produce plans, where only one arm is moving at a time, rather than schedules that allow for parallel arm motion. In order to extend TAMP to produce schedules, we present ScheduleStream, the first general-purpose framework for planning & scheduling with sampling operations. ScheduleStream models temporal dynamics using hybrid durative actions, which can be started asynchronously and persist for a duration that's a function of their parameters. We propose domain-independent algorithms that solve ScheduleStream problems without any application-specific mechanisms. We apply ScheduleStream to Task and Motion Planning & Scheduling (TAMPAS), where we use GPU acceleration within samplers to expedite planning. We compare ScheduleStream algorithms to several ablations in simulation and find that they produce more efficient solutions. We demonstrate ScheduleStream on several real-world bimanual robot tasks at https://schedulestream.github.io.

2510.26270 2026-05-29 cs.AI 版本更新

Graph-Enhanced Policy Optimization in LLM Agent Training

图增强策略优化在LLM智能体训练中的应用

Jiazhen Yuan, Zhike Gong, Jinquan Hang, Zhengbiao Bai, Wei Zhao

AI总结 提出图增强策略优化(GEPO)框架,通过状态转移图的双层结构信用分配(任务条件关键性评分),在LLM多步智能体训练中提升成功率。

详情
AI中文摘要

交互环境中的多步LLM智能体代表了向长时决策迈出的关键一步。为了训练此类智能体,广泛采用基于组的强化学习,该方法在组内对具有较高相对性能的轨迹进行强化。然而,在大多数现有方法中,轨迹内的每一步和具有相同终端奖励的每条轨迹都获得相同的信用,无论其实际贡献如何。由于不同状态在从采样轨迹构建的在线状态转移图中扮演不同的结构角色,其影响应被区分并转化为任务感知的信用,在步骤和轨迹两个层面上。因此,我们提出了图增强策略优化(GEPO),一种用于多步LLM智能体训练的双层结构信用分配框架。具体来说,GEPO推导出一个状态级的任务条件关键性评分,该评分结合了状态转移图上的拓扑中介中心性和与任务提示的语义相似性。基于该评分,轨迹级信用通过状态自适应折扣进行重塑,而步骤级信用则根据其后继状态的关键性进行缩放。实验结果表明,在7B规模下,GEPO在ALFWorld上的成功率比最强基线高出1.1%,在WebShop上高出3.2%,在搜索增强的QA任务上平均高出3.8%。与平坦的基于组的方法相比,GEPO降低了跨种子的方差,并将梯度信号集中在最关键步骤上。

英文摘要

Multi-step LLM agents in interactive environments represent a crucial step toward long-horizon decision-making. To train such agents, group-based reinforcement learning is widely adopted, which reinforces trajectories with higher relative performance within the group. However, in most existing methods, every step within a trajectory and every trajectory with the same terminal reward receive identical credit, regardless of their actual contributions. Since different states play different structural roles in an online state-transition graph built from sampled trajectories, their impacts should be differentiated and converted into task-aware credit at both the step and trajectory levels. We therefore present Graph-Enhanced Policy Optimization (GEPO), a framework for dual-level structural credit assignment in multi-step LLM agent training. Specifically, GEPO derives a state-level Task-Conditioned Criticality score that combines topological betweenness on the state-transition graph with semantic similarity to the task prompt. Based on this score, trajectory-level credit is reshaped through a state-adaptive discount, while step-level credit is scaled by the criticality of its successor state. Experimental results show that GEPO outperforms the strongest baselines by 1.1\% in success rate on ALFWorld, 3.2\% on WebShop, and 3.8\% on average across search-augmented QA tasks at the 7B scale. Compared with flat group-based methods, GEPO reduces across-seed variance and concentrates gradient signals on the most critical steps.

2510.22437 2026-05-29 cs.AI cs.CL 版本更新

Modeling Hierarchical Thinking in Large Reasoning Models

大型推理模型中的层次化思维建模

G M Shahariar, Erfan Shayegani, Ali Nazari, Nael Abu-Ghazaleh

发表机构 * University of California, Riverside(加州大学河滨分校) Independent Researcher(独立研究者)

AI总结 本文提出将大型推理模型(LRM)的层次化推理动态近似为有限状态机(FSM)中的轨迹,并通过Q值引导的推理时控制方法实现高效推理优化。

Comments Accepted in ICML 2026 as Oral

详情
AI中文摘要

大型推理模型(LRM)通过生成长链思维(CoT)序列来解决复杂任务;然而,控制推理轨迹的涌现动态尚未被充分理解,可能导致不一致性和推理病态。在这项工作中,我们提出将LRM的涌现层次化推理动态近似为有限状态机(FSM)中的轨迹,该状态机在六个抽象认知状态之间转换。我们证明这些状态和转换可以在模型的潜在状态中捕获。我们相信这种表示在LRM模型的可解释性和优化中具有不同的应用。例如,通过分析这些转换的拓扑结构,我们识别出推理策略中的统计变化,有助于从失败的推理链中识别出有效的推理链。为了说明这些潜在优势,我们提出了Q值引导转向,一种无需训练的推理时控制方法,将推理视为规划问题。我们估计状态转换的长期效用,并在句子边界处应用稀疏、正交的激活转向,使CoT生成与最优推理策略对齐。使用三个最先进的开源推理模型在四个基准测试(AIME25、MATH-500、GSM8k和GPQA Diamond)上的实验表明,Q值转向策略以“外科手术式”的效率实现了显著的性能提升,通常需要的干预次数比贪婪和加权基线少25倍,这表明通过引导高层认知动态而非微观管理令牌生成,可以有效地控制推理。代码可在 https://github.com/shahariar-shibli/CoT-FSM 获取。

英文摘要

Large Reasoning Models (LRMs) solve complex tasks by generating long Chain-of-Thought (CoT) sequences; however, the emergent dynamics governing reasoning trajectories are not well understood and can lead to inconsistencies and reasoning pathologies. In this work, we propose to approximate LRM's emerging hierarchical reasoning dynamics as a trajectory within a Finite State Machine (FSM) transitioning among six abstract cognitive states. We demonstrate that these states and transitions can be captured in the latent state of the model. We believe that this representation can have different applications in the interpretability and optimization of LRM models. For example, by analyzing the topology of these transitions, we identify statistical shifts in reasoning strategies that help identify effective reasoning chains from those that fail. To illustrate these potential advantages, we propose Q-Value guided steering, a training-free inference-time control method that treats reasoning as a planning problem. We estimate the long-horizon utility of state transitions and apply sparse, orthogonal activation steering at sentence boundaries to align the CoT generation with optimal reasoning policies. Experiments across four benchmarks (AIME25, MATH-500, GSM8k, and GPQA Diamond) using three state-of-the-art open reasoning models demonstrate that Q-Value steering policy achieves significant performance gains with "surgical" efficiency, often requiring 25 times fewer interventions than greedy and weighted baselines, which suggests that reasoning can be effectively controlled by guiding high-level cognitive dynamics rather than micro-managing token generation. Code is available at: https://github.com/shahariar-shibli/CoT-FSM.

2510.14150 2026-05-29 cs.AI cs.LG cs.NE 版本更新

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

CodeEvolve:用于算法发现和优化的开源进化编码智能体

Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai

发表机构 * Inter Science - Inter&Co Federal University of Minas Gerais(联邦大学伯南迪斯) Worcester Polytechnic Institute(沃思彻斯特理工大学)

AI总结 提出CodeEvolve开源框架,结合大语言模型与岛屿进化搜索,通过灵感交叉、元提示和深度细化,在AlphaEvolve基准上匹配或超越5/9问题,并在匹配条件下优于OpenEvolve和ShinkaEvolve,以更低成本超越前沿闭源集成。

Comments 21 pages, 16 figures, 8 tables

详情
AI中文摘要

我们介绍了CodeEvolve,一个开源框架,它将大语言模型与基于岛屿的进化搜索相结合,用于端到端的算法发现。CodeEvolve在CVT-MAP-Elites存档和加权LLM集成之上集成了基于灵感的交叉、元提示和深度细化,为复杂问题生成优化解决方案。在AlphaEvolve基准套件上,CodeEvolve在9个问题中的5个上匹配或超过了报告的AlphaEvolve结果,并且在匹配条件下,在9个问题中的6个上优于开源框架OpenEvolve和ShinkaEvolve。使用开放权重的Qwen3-Coder-30B骨干网络,它在CirclePackingSquare的两个实例上均超过了报告的AlphaEvolve分数,成本大约比前沿闭源集成低一个数量级,并且在无需重新调整的情况下,在启发式设计任务上与EoH保持竞争力。消融实验表明,CodeEvolve组件之间的相互作用(而非任何单一算子)驱动了这些结果。我们在https://github.com/inter-co/science-codeevolve 发布了该框架、实验数据和实用的超参数指南。

英文摘要

We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end algorithmic discovery. CodeEvolve integrates inspiration-based crossover, meta-prompting, and depth-based refinement on top of a CVT-MAP-Elites archive and a weighted LLM ensemble to generate optimized solutions for complex problems. On the AlphaEvolve benchmark suite, CodeEvolve matches or surpasses the reported AlphaEvolve results on 5 of 9 problems and, under matched conditions, outperforms the open-source frameworks OpenEvolve and ShinkaEvolve on 6 of 9. With the open-weight Qwen3-Coder-30B backbone, it surpasses the reported AlphaEvolve score on both CirclePackingSquare instances at roughly an order of magnitude lower cost than a frontier closed-source ensemble, and remains competitive with EoH on heuristic-design tasks without retuning. Ablations show that the interaction between CodeEvolve's components, rather than any single operator, drives these results. We release the framework, experimental data, and practical hyperparameter guidelines at https://github.com/inter-co/science-codeevolve.

2510.11499 2026-05-29 cs.LG cs.AI 版本更新

Offline Reinforcement Learning with Generative Trajectory Policies

基于生成轨迹策略的离线强化学习

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen

发表机构 * School of Computing, Data Sciences Computer Engineering Department, UCLA, Los Angeles, USA

AI总结 本文提出生成轨迹策略(GTP),通过统一扩散、流匹配和一致性模型为常微分方程驱动的连续时间生成轨迹,并引入两种理论自适应方法,在D4RL基准上达到最先进性能。

Comments ICML 2026

详情
AI中文摘要

生成模型因其捕获复杂多模态行为的能力,已成为离线强化学习中一类强大的策略。然而,现有方法面临明显的权衡:扩散策略等慢速迭代模型计算成本高,而一致性策略等快速单步模型性能往往下降。在本文中,我们证明弥合这一差距是可能的。我们认为,超越个体方法局限的关键在于一个统一视角,该视角将现代生成模型(包括扩散、流匹配和一致性模型)视为学习由常微分方程驱动的连续时间生成轨迹的具体实例。这一原则性基础为强化学习中的生成策略提供了更清晰的设计空间,并使我们能够提出生成轨迹策略(GTP),一种新的、更通用的策略范式,学习底层ODE的完整解映射。为使该范式适用于离线强化学习,我们进一步引入了两种理论上原则性的自适应方法。实验结果表明,GTP在D4RL基准上达到了最先进的性能——它显著优于先前的生成策略,在多个以困难著称的AntMaze任务上取得了完美分数。

英文摘要

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

2510.08722 2026-05-29 cs.LG cs.AI 版本更新

The Impact of Semantic Pairs on Self-Supervised Representation Learning

语义对自监督表示学习的影响

Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

AI总结 通过控制实验研究语义正对(不同同类实例)相比增强正对在自监督学习中的效果,发现语义对能提升泛化性能,尤其对比学习受益最大。

Comments 19 pages, 7 figures, 5 tables

详情
AI中文摘要

实例判别通过将同一图像的不同增强视图视为正对来学习视觉表示。虽然这鼓励对手工变换的不变性,但同图像正对可能保留背景、纹理、光照和对象特定细节等干扰相关性。语义正对,即不同的同类实例,通过在不同上下文中呈现对象可能减少这些相关性。然而,先前的研究通常将语义对与增强正对或错误邻居(即错误映射的语义对)结合,使得难以隔离语义配对的效果。我们提出了一个关于语义正对用于自监督表示学习的受控实证研究。从ImageNet-1K中,我们构建了两个匹配的子集:一个增强对基线和一个手动策划的语义对数据集,具有相同的类别组成和训练对数量。我们使用这些数据集在匹配的训练条件下比较代表性的对比和非对比SSL方法。在迁移学习和目标检测评估中,语义对预训练始终优于增强对预训练。额外的消融实验表明,语义对诱导了超出标准变换管道的不变性。在评估的方法中,对比学习从语义对中受益最大,其中SimCLR显示出最大的相对改进。这些结果阐明了语义正对在SSL中的作用,并为选择和设计能够有效利用语义对信息的框架提供了指导。

英文摘要

Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs. While this encourages invariance to handcrafted transformations, same-image positives can preserve nuisance correlations such as background, texture, illumination, and object-specific details. Semantic positive pairs, i.e., different same-class instances, may reduce these correlations by presenting objects across diverse contexts. However, previous studies often combine semantic pairs with augmented positives or false neighbors (i.e., incorrectly mapped semantic pairs), making it difficult to isolate the effect of semantic pairing. We present a controlled empirical study of semantic positive pairs for self-supervised representation learning. From ImageNet-1K, we construct two matched subsets: an augmented-pair baseline and a manually curated semantic-pair dataset with the same class composition and training-pair count. We use these datasets to compare representative contrastive and non-contrastive SSL methods under matched training conditions. Across transfer learning and object detection evaluations, semantic-pair pretraining consistently improves generalisation over augmented-pair pretraining. Additional ablations show that semantic pairs induce invariances beyond the standard transformation pipeline. Among the evaluated methods, contrastive learning benefits most strongly from semantic pairs, with SimCLR showing the largest relative improvement. These results clarify the role of semantic positive pairs in SSL and provide guidance for selecting and designing frameworks that can exploit semantic pair information effectively

2510.04704 2026-05-29 cond-mat.mtrl-sci cs.AI cs.CL 版本更新

AtomWorld: A Benchmark for Evaluating Spatial Reasoning in Large Language Models on Crystalline Materials

AtomWorld: 评估大型语言模型在晶体材料空间推理能力的基准

Taoyuze Lv, Alexander Chen, Fengyu Xie, Chu Wu, Jeffrey Meng, Dongzhan Zhou, Yingheng Wang, Bram Hoex, Zhicheng Zhong, Tong Xie

发表机构 * University of New South Wales, NSW, Sydney, Australia(新南威尔士大学,新州,悉尼,澳大利亚) Suzhou Institute for Advanced Research, University of Science(苏州先进研究院,科学大学) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室,上海,中国) Cornell University(康奈尔大学)

AI总结 提出AtomWorld基准,通过十种基本原子结构操作评估LLM在材料科学中的空间推理能力,发现Claude Opus 4.6表现最佳但复杂空间关系操作成功率低,表明LLM更适合作为辅助工具而非完全自主的科研代理。

详情
AI中文摘要

大型语言模型(LLMs)在科学研究中展现出巨大潜力,能够执行从知识检索到属性预测等任务。现有的科学基准主要关注感知或基于知识的任务,在很大程度上忽略了建模任务,而建模是任何真实科学研究的基本起点。对于材料科学而言,构建和操作原子结构是最具创造性和自动化程度最低的步骤之一。在这项工作中,我们引入了AtomWorld,这是一个旨在评估LLMs在结构修改方面能力的基准。该基准包括四种广泛使用的建模类别下的十种基本操作,并提供了可验证的评估指标。我们发现Claude Opus 4.6总体上表现最佳。随着建模复杂性的增加,成功率显著下降,特别是涉及复杂空间关系的操作(旋转成功率低于12%)。我们的结果表明,当代LLMs更适合作为材料结构建模的副驾驶,而非完全无监督的自主科学代理。除了评估之外,AtomWorld还作为未来开发结构感知模型(包括强化学习和基于代理的方法)的测试平台和实验场。

英文摘要

Large language models (LLMs) have shown promising potential in scientific research, enabling tasks ranging from knowledge retrieval to property prediction. Existing science benchmarks mainly focus on perceptual or knowledge-based tasks, largely ignoring the modelling tasks, a fundamental starting point for any real scientific research. For materials science, constructing and manipulating atomic structures is one of the most creative and least automated steps. In this work, we introduce AtomWorld, a benchmark designed to evaluate the abilities of LLMs on structure modifications. The benchmark includes ten fundamental actions under four widely used modelling categories, enabling verifiable evaluation metrics. We find that Claude Opus 4.6 generally performs the best. While the success rate decreases markedly with increasing modelling complexity, with particularly low success rates (below 12\% for rotation) for operations involving complex spatial relations. Our results suggest that contemporary LLMs are better suited as copilots for materials structure modelling rather than fully unsupervised autonomous scientific agents. Beyond evaluation, AtomWorld also serves as a testbed and playground for developing future structure-aware models, including reinforcement learning and agentic approaches.

2509.23730 2026-05-29 cs.AI 版本更新

EAPO: Enhancing Policy Optimization with On-Demand Expert Assistance

EAPO: 利用按需专家协助增强策略优化

Siyao Song, Cong Ma, Zhihao Cheng, Shiye Lei, Minghao Li, Ying Zeng, Huaixiao Tou, Kai Jia

发表机构 * ByteDance BandAI(字节跳动BandAI)

AI总结 提出专家辅助策略优化(EAPO)框架,通过训练中与外部专家的多轮交互增强探索,解决强化学习中的稀疏奖励和低效探索问题,在多个基准上平均提升5个点。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)最近在可验证奖励下的强化学习(RL)优化中,推理能力得到了提升。现有方法主要依赖基于结果的监督来增强内部LLM推理,这往往导致探索效率低下和奖励稀疏。为了缓解这一问题,我们提出了专家辅助策略优化(EAPO),一种新颖的RL框架,通过在训练过程中引入与外部专家的多轮交互来增强探索。与先前策略孤立推理的方法不同,EAPO激励策略自适应地决定何时以及如何咨询专家,从而产生更丰富的奖励信号和更可靠的推理轨迹。外部协助最终将专家知识内化到策略模型中,放大了模型固有的推理能力。在评估时,策略模型已经过良好优化,能够独立解决问题,产生改进的推理路径和更准确的解决方案。在AIME 2024/2025和AIMO 2025上,EAPO始终优于专家辅助、专家蒸馏和RL基线,平均比自探索RL高出5个点,并且泛化到非数学基准,包括HumanEval、HLE、GPQA、MMLU、EvalPlus、HotpotQA和SimpleQA。

英文摘要

Large language models (LLMs) have recently advanced in reasoning when optimized with reinforcement learning (RL) under verifiable rewards. Existing methods primarily rely on outcome-based supervision to strengthen internal LLM reasoning, often leading to inefficient exploration and sparse rewards. To mitigate this issue, we propose Expert-Assisted Policy Optimization (EAPO), a novel RL framework that enhances exploration by incorporating multi-turn interactions with external experts during training. Unlike prior methods, where policies reason in isolation, EAPO incentivizes the policy to adaptively determine when and how to consult experts, yielding richer reward signals and more reliable reasoning trajectories. External assistance ultimately internalizes expert knowledge into the policy model, amplifying the model's inherent reasoning capabilities. During evaluation, the policy model has been well-optimized to solve questions independently, producing improved reasoning paths and more accurate solutions. On AIME 2024/2025 and AIMO 2025, EAPO consistently outperforms expert-assisted, expert-distilled, and RL baselines, averaging a 5-point gain over self-exploration RL, and also generalizes to non-math benchmarks, including HumanEval, HLE, GPQA, MMLU, EvalPlus, HotpotQA, and SimpleQA.

2509.22504 2026-05-29 cs.AI cs.LG 版本更新

Estimating the Empowerment of Language Model Agents

估计语言模型代理的赋权能力

Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出基于信息论中赋权概念的评估框架EELMA,通过多轮文本交互近似有效赋权,实验表明赋权与任务性能强相关,可作为与任务成功度量互补的通用评估指标。

Comments Published at the International Conference on Machine Learning (ICML) 2026. 9 pages, 9 figures; camera-ready version

详情
AI中文摘要

随着语言模型(LM)代理在现实应用中的能力日益增强和广泛采用,除了昂贵且人工设计的基准测试外,对可扩展评估框架的需求日益增长。我们提出基于赋权的信息论评估,赋权是一种衡量代理通过其行动对未来状态影响的信息论度量。为了应对基于文本环境的独特挑战,我们引入了EELMA(估计语言模型代理的赋权能力),一种从多轮文本交互中近似有效赋权的算法。我们在文本游戏以及现实的网络和工具使用环境中演示了EELMA,表明赋权与平均任务性能强相关。我们进一步分析了赋权如何随模型、环境复杂性和代理配置而变化,并表明高赋权状态和行动通常标志着通用能力的关键时刻。这些结果确立了赋权作为一种与任务成功度量互补的、与目标无关的度量,用于LM代理评估。

英文摘要

As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable evaluation frameworks beyond costly, manually designed benchmarks. We propose information-theoretic evaluation based on empowerment, an information-theoretic measure of an agent's influence on future states through its actions. To handle the unique challenges of text-based environments, we introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We demonstrate EELMA on textual games and realistic web and tool-use environments, showing that empowerment strongly correlates with average task performance. We further analyze how empowerment varies across models, environment complexity, and agent configurations, and show that high-empowerment states and actions often mark pivotal moments for general capabilities. These results establish empowerment as a goal-agnostic metric that complements task-success measures for LM-agent evaluation.

2509.21979 2026-05-29 cs.CV cs.AI 版本更新

Benchmarking and Mitigating Sycophancy in Medical Vision Language Models

医疗视觉语言模型中的谄媚行为基准测试与缓解

Juangui Xu, Zikun Guo, Jingwei Lv, Hongbin Lin, Shu Yang, Jun Wen, Di Wang, Lijie Hu

发表机构 * MBZUAI Saarland University(萨尔兰大学) HKUST(GZ)(香港科技大学(广州)) KAUST(卡塔尔大学)

AI总结 针对医疗视觉语言模型中的谄媚问题,提出分层医疗视觉问答基准和VIPER策略,通过过滤非证据社会线索减少谄媚,提升模型鲁棒性。

Comments 19figures, 61pages. The first two authors contributed equally

详情
AI中文摘要

视觉语言模型(VLM)有潜力改变医疗工作流程。然而,其部署受到谄媚行为的限制。尽管这对患者安全构成严重威胁,但系统性的基准测试仍然缺乏。本文通过引入一个医疗基准来填补这一空白,该基准在分层医疗视觉问答任务中对VLM应用多种模板。我们发现当前的VLM极易受到视觉线索的影响,失败率与模型大小或整体准确性相关。我们发现感知权威和用户模仿是强大的触发因素,表明存在独立于视觉数据的偏差机制。为了克服这一点,我们提出了一种基于证据的视觉信息净化响应(VIPER)策略,该策略主动过滤掉非基于证据的社会线索,从而强化基于证据的推理。VIPER在保持可解释性的同时减少了谄媚,并且始终优于基线方法,为VLM的稳健和安全集成奠定了必要的基础。

英文摘要

Visual language models (VLMs) have the potential to transform medical workflows. However, the deployment is limited by sycophancy. Despite this serious threat to patient safety, a systematic benchmark remains lacking. This paper addresses this gap by introducing a Medical benchmark that applies multiple templates to VLMs in a hierarchical medical visual question answering task. We find that current VLMs are highly susceptible to visual cues, with failure rates showing a correlation to model size or overall accuracy. we discover that perceived authority and user mimicry are powerful triggers, suggesting a bias mechanism independent of visual data. To overcome this, we propose a Visual Information Purification for Evidence based Responses (VIPER) strategy that proactively filters out non-evidence-based social cues, thereby reinforcing evidence based reasoning. VIPER reduces sycophancy while maintaining interpretability and consistently outperforms baseline methods, laying the necessary foundation for the robust and secure integration of VLMs.

2509.21154 2026-05-29 cs.LG cs.AI 版本更新

GRPO is Secretly a Process Reward Model

GRPO 秘密地是一个过程奖励模型

Michael Sullivan, Alexander Koller

发表机构 * Department of Language Science Technology, Saarland Informatics Campus, Saarland University, Saarbr \"u cken, Germany

AI总结 本文理论证明,使用结果奖励模型的 GRPO 强化学习算法等价于一个基于蒙特卡洛的过程奖励模型,并发现其缺陷,提出 λ-GRPO 改进,在推理任务上提升性能。

Comments 16 pages, 9 figures; accepted at ICML 2026

详情
AI中文摘要

过程奖励模型(PRMs)允许在强化学习(RL)中进行细粒度的信用分配,并且与结果奖励模型(ORMs)形成对比,后者为整个轨迹分配单一奖励。然而,我们在本文中提供了理论证明,配备 ORM 的组相对策略优化(GRPO)RL 算法实际上等价于一个配备非平凡、基于蒙特卡洛的 PRM 的 PRM-aware RL 目标(在温和假设下)。利用 GRPO-as-a-PRM 框架,我们识别出 GRPO 目标中的一个缺陷,该缺陷与不平衡的过程步骤和奖励相互作用,阻碍了探索和利用(在不同条件下)。我们提出对算法进行简单修改以减轻这一缺陷(λ-GRPO),并表明使用 λ-GRPO 调优的 LLM 在下游推理任务上优于使用标准 GRPO 调优的 LLM,并且更快达到峰值性能。这些结果表明,我们可以利用原始 GRPO 算法中隐藏的内置 PRM 结构来提升模型性能,而无需使用显式 PRM,并且对训练时间和成本的影响可以忽略不计。

英文摘要

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs tuned with $λ$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

2508.19282 2026-05-29 cs.CL cs.AI 版本更新

Less Is More: Elevating RAG via Performance-Driven Context Compression

少即是多:通过性能驱动的上下文压缩提升RAG

Ziqiang Cui, Yunpeng Weng, Xing Tang, Peiyang Liu, Shiwei Li, Bowei He, Jiamin Chen, Yansen Zhang, Xiuqiang He, Chen Ma

发表机构 * City University of Hong Kong, Hong Kong SAR, China(香港城市大学) Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE(阿布扎赫尔 Mohamed bin Zayed 人工智能大学) Huazhong University of Science and Technology(华中科技大学) Peking University, Beijing, China(北京大学) Shenzhen Technology University, Shenzhen, China(深圳技术大学)

AI总结 提出CORE-RAG框架,利用任务性能作为反馈信号迭代优化压缩策略,在3%压缩率下平均精确匹配得分提升3.3点。

Comments Accepted by ICML 2026

详情
AI中文摘要

检索增强生成(RAG)已成为改善知识更新时效性和大型语言模型事实准确性的有前景范式。然而,纳入大量检索文档显著增加输入长度,导致计算成本过高。现有压缩方法通常因依赖预定义启发式规则而损害任务性能。这些启发式规则无法确保压缩后的上下文有利于生成任务。为解决这些限制,我们提出CORE-RAG,一种用于RAG系统中上下文压缩的新颖框架。CORE通过性能驱动的学习框架消除对代理启发式规则的依赖,直接利用任务性能作为反馈信号迭代优化压缩器策略。在此优化过程之前,我们引入知识蒸馏阶段,用稳健策略初始化压缩器。大量实验证明了我们方法的优越性。在3%的高压缩比下,CORE不仅避免了性能下降,而且与使用完整文档相比,平均精确匹配(EM)得分提高了3.3分。我们的代码可在https://github.com/ziqiangcui/CORE-RAG-ICML26获取。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for improving the timeliness of knowledge updates and the factual accuracy of large language models. However, incorporating a large volume of retrieved documents significantly increases input length, leading to prohibitive computational costs. Existing compression approaches often compromise task performance, primarily due to their reliance on predefined heuristics. These heuristics fail to ensure that the compressed context is conducive to the generation tasks. To address these limitations, we propose CORE-RAG, a novel framework for context compression in RAG systems. CORE eliminates reliance on proxy heuristics through a performance-driven learning framework, which directy utilizes task performance as a feedback signal to iteratively refine the compressor policy. Prior to this optimization process, we incorporate a knowledge distillation phase to initialize the compressor with a robust policy. Extensive experiments demonstrate the superiority of our approach. At a high compression ratio of 3%, CORE not only avoids performance degradation but also improves the average Exact Match (EM) score by 3.3 points compared to using full documents. Our code is available at https://github.com/ziqiangcui/CORE-RAG-ICML26.

2508.15180 2026-05-29 cs.AI 版本更新

PuzzleClone: A DSL-Powered Framework for Synthesizing Verifiable Data

PuzzleClone: 一种基于DSL的可验证数据合成框架

Kai Xiong, Yanwei Huang, Rongjunchen Zhang, Kun Chen, Haipang Wu, Yingcai Wu

发表机构 * HiThink Research(HiThink研究院) HKUST(香港科技大学) Zhejiang University(浙江大学)

AI总结 提出PuzzleClone框架,通过DSL驱动的方法合成大规模、高可靠、多样化的可验证数学逻辑数据集,并构建PC-83K基准,实验表明后训练能显著提升LLM在逻辑与数学任务上的性能。

详情
AI中文摘要

高质量、带有可验证答案的数学和逻辑数据集对于增强大型语言模型(LLM)的推理能力至关重要。虽然最近的数据增强技术促进了大规模基准的创建,但现有的LLM生成数据集往往存在可靠性、多样性和可扩展性有限的问题。为了解决这些挑战,我们引入了PuzzleClone,一个使用新颖的DSL驱动方法大规模合成可验证数据的正式框架。我们的方法具有三个关键创新:(1)将种子谜题编码为结构化的逻辑规范,(2)通过系统化的变量和约束随机化生成可扩展的变体,(3)通过再现机制确保有效性。应用PuzzleClone,我们构建了PC-83K,一个包含超过83K个多样化且经过程序验证的谜题的基准。生成的谜题涵盖了广泛的难度和格式,对当前最先进的模型构成了重大挑战。实验结果表明,在PC-83K上进行后训练(SFT和RL)不仅在测试集上取得了显著提升,而且在各种逻辑和数学基准上也取得了改进。后训练将PC-83K上的平均性能从14.5提高到66.0,并在7个逻辑和数学基准上持续改进,绝对百分点最高达18.4(SATBench从51.6提高到70.0)。我们的代码和数据可在https://github.com/HiThink-Research/PuzzleClone获取。

英文摘要

High-quality mathematical and logical datasets with verifiable answers are essential for strengthening the reasoning capabilities of large language models (LLMs). While recent data augmentation techniques have facilitated the creation of large-scale benchmarks, existing LLM-generated datasets often suffer from limited reliability, diversity, and scalability. To address these challenges, we introduce PuzzleClone, a formal framework for synthesizing verifiable data at scale using a novel DSL-driven approach. Our approach features three key innovations: (1) encoding seed puzzles into structured logical specifications, (2) generating scalable variants through systematic variable and constraint randomization, and (3) ensuring validity via a reproduction mechanism. Applying PuzzleClone, we construct PC-83K, a benchmark comprising over 83K diverse and programmatically validated puzzles. The generated puzzles span a wide spectrum of difficulty and formats, posing significant challenges to current state-of-the-art models. Experimental results show that post training (SFT and RL) on PC-83K yields substantial improvements not only on the testset but also on various logic and mathematical benchmarks. Post training raises average performance on PC-83K from 14.5 to 66.0 and delivers consistent improvements across 7 logic and mathematical benchmarks up to 18.4 absolute percentage points (SATBench from 51.6 to 70.0). Our code and data are available at https://github.com/HiThink-Research/PuzzleClone.

2508.03253 2026-05-29 cs.GT cs.AI cs.MA 版本更新

Approximate Proportionality in Online Fair Division

在线公平分配中的近似比例性

Davin Choo, Winston Fu, Derek Khu, Tzeh Yuan Neoh, Tze-Yang Poon, Nicholas Teh

发表机构 * Harvard University, USA(哈佛大学) University of Oxford, UK(牛津大学) Centre for Frontier AI Research (CFAR), Agency for Science, Technology and Research (A*STAR), Singapore(前沿人工智能研究中心(CFAR)) Institute of High Performance Computing (IHPC), Agency for Science, Technology and Research (A*STAR), Singapore(高性能计算研究所(IHPC)) Princeton University, New Jersey, USA(普林斯顿大学) Institute for Infocomm Research (I2R), Agency for Science, Technology and Research (A*STAR), Singapore(信息与通信研究所以(I2R))

AI总结 研究在线公平分配问题中比例性(PROP1)的可近似性,通过非自适应对手和最大物品价值预测两种松弛方法,设计了具有鲁棒保证的在线算法。

Comments Appears in the 43rd International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

我们研究在线公平分配问题,其中不可分割的商品按顺序到达,必须立即且不可撤销地分配。先前的工作为近似经典概念(如至多一个商品的嫉妒无妒(EF1)和最大最小份额(MMS))建立了强不可能性结果,但至多一个商品的比例性(PROP1)的可近似性仍未解决。我们分两步解决这一差距。首先,我们展示了三种自然的贪婪分配规则(公平分配中的标准基线)无法保证对自适应对手的任何乘法近似到PROP1。这些局限性激发了两种松弛:(i)将注意力限制在非自适应对手上,以及(ii)在学习增强算法的精神下纳入粗略预测。在非自适应对手下,我们展示了均匀随机分配以高概率实现了有意义的PROP1近似,并且这一保证对于这种方法本质上是紧的;此外,当物品值足够小时,分配以高概率接近PROP1。最后,给定最大物品值(MIV)预测,我们设计了一种在线算法,该算法实现了PROP1的鲁棒近似保证,并在单边预测误差下优雅地退化。相比之下,我们展示了即使有完美的MIV预测,EF1、MMS和PROPX仍然不可近似。

英文摘要

We study the online fair division problem, where indivisible goods arrive sequentially and must be allocated immediately and irrevocably. Prior work establishes strong impossibility results for approximating classic notions such as envy-freeness up to one good (EF1) and maximin share (MMS) in this setting, but the approximability of proportionality up to one good (PROP1) has remained unresolved. We resolve this gap in two steps. First, we show that three natural greedy allocation rules (standard baselines in fair division) fail to guarantee any multiplicative approximation to PROP1 against an adaptive adversary. These limitations motivate two relaxations: (i) restricting attention to a non-adaptive adversary, and (ii) incorporating coarse predictions in the spirit of learning-augmented algorithms. Under a non-adaptive adversary, we show that the uniform random allocation achieves a meaningful PROP1 approximation with high probability, and this guarantee is essentially tight for this approach; moreover, when item values are sufficiently small, the allocation is near-PROP1 with high probability. Finally, given maximum item value (MIV) predictions, we design an online algorithm that achieves robust approximation guarantees for PROP1, and degrades gracefully under one-sided prediction error. In contrast, we show that EF1, MMS, and PROPX remain inapproximable even with perfect MIV predictions.

2507.16880 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Finding DoRI: Discovery of Retained Images in Diffusion Models

Finding DoRI: 扩散模型中保留图像的发现

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

发表机构 * CISPA Helmholtz Center for Information Security(CISPA信息安全研究中心) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Technical University of Darmstadt(达姆施塔特技术大学) Hessian Center for AI (Hessian.AI)(黑森人工智能中心(Hessian.AI)) Centre for Cognitive Science, Technical University of Darmstadt(达姆施塔特技术大学认知科学中心)

AI总结 通过挑战记忆局部化假设,发现文本嵌入的小扰动可重新触发数据复制,并证明记忆本质上是非局部的,从而提出对抗微调实现更鲁棒的缓解方法。

Comments Published at ICML 2026

详情
AI中文摘要

文本到图像扩散模型(DMs)在图像生成方面取得了显著成功。然而,由于它们可能无意中记忆并复制训练数据,数据隐私和知识产权问题仍然存在。最近的缓解工作集中在识别和剪枝负责触发逐字训练数据复制的权重,基于记忆可以被局部化的假设。我们挑战这一假设,并证明即使经过这样的剪枝,对先前缓解的提示的文本嵌入进行微小扰动可以重新触发数据复制,揭示了此类方法的脆弱性。我们的进一步分析提供了多个迹象表明记忆确实本质上不是局部的:(1)记忆图像的复制触发因素分布在文本嵌入空间中;(2)产生相同复制图像的嵌入会产生不同的模型激活;(3)不同的剪枝方法对同一图像识别出不一致的记忆相关权重集。最后,我们表明绕过局部性假设可以通过对抗微调实现更鲁棒的缓解。这些发现为文本到图像DMs中记忆的基本性质提供了新见解,并为未来开发更可靠的对抗DM记忆的缓解方法提供了信息。

英文摘要

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods. Our further analysis then provides multiple indications that memorization is indeed \textit{not} inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

2507.06092 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Taming Data Challenges in ML-based Security Tasks Using Generative AI

驯服基于ML的安全任务中的数据挑战:使用生成式AI

Shravya Kanchi, Neal Mangaokar, Aravind Cheruvu, Sifat Muhammad Abdullah, Shirin Nilizadeh, Atul Prakash, Bimal Viswanath

发表机构 * University of Michigan, Ann Arbor(密歇根大学安娜堡分校) University of Texas at Arlington(德克萨斯理工大学)

AI总结 提出使用生成式AI(GenAI)生成的合成数据增强训练集,以改善机器学习安全分类器的泛化性能,在7个任务上实现最高32.6%的提升。

Comments Accepted at the 2026 ACM Asia Conference on Computer and Communications Security (AsiaCCS 2026)

详情
Journal ref
In Proc. ACM AsiaCCS 2026, Bangalore, India, June 1-5, 2026. ACM, 2026
AI中文摘要

基于机器学习的监督分类器广泛用于安全任务,其改进主要集中在算法进步上。我们认为,对分类器性能产生负面影响的数 据挑战受到的关注有限。我们解决以下研究问题:生成式AI(GenAI)的发展能否应对这些数据挑战并提高分类器性能?我们提出使用GenAI技术生成的合成数据增强训练数据集,以改善分类器的泛化能力。我们使用6种最先进的GenAI方法在7个不同的安全任务上评估了这种方法,并引入了一种名为Nimai的新型GenAI方案,该方案能够实现高度可控的数据合成。我们发现,GenAI技术可以显著提高安全分类器的性能,即使在数据严重受限的情况下(仅约180个训练样本),也能实现高达32.6%的提升。此外,我们证明GenAI可以促进部署后对概念漂移的快速适应,在调整过程中只需最少的标注。尽管取得了成功,但我们的研究发现,一些GenAI方案在某些安全任务上难以初始化(训练和生成数据)。我们还识别了特定任务的特征,如噪声标签、重叠的类别分布和稀疏特征向量,这些特征阻碍了使用GenAI提升性能。我们相信,我们的研究将推动未来针对安全任务的GenAI工具的开发。

英文摘要

Machine learning-based supervised classifiers are widely used for security tasks, and their improvement has been largely focused on algorithmic advancements. We argue that data challenges that negatively impact the performance of these classifiers have received limited attention. We address the following research question: Can developments in Generative AI (GenAI) address these data challenges and improve classifier performance? We propose augmenting training datasets with synthetic data generated using GenAI techniques to improve classifier generalization. We evaluate this approach across 7 diverse security tasks using 6 state-of-the-art GenAI methods and introduce a novel GenAI scheme called Nimai that enables highly controlled data synthesis. We find that GenAI techniques can significantly improve the performance of security classifiers, achieving improvements of up to 32.6% even in severely data-constrained settings (only ~180 training samples). Furthermore, we demonstrate that GenAI can facilitate rapid adaptation to concept drift post-deployment, requiring minimal labeling in the adjustment process. Despite successes, our study finds that some GenAI schemes struggle to initialize (train and produce data) on certain security tasks. We also identify characteristics of specific tasks, such as noisy labels, overlapping class distributions, and sparse feature vectors, which hinder performance boost using GenAI. We believe that our study will drive the development of future GenAI tools designed for security tasks.

2507.00037 2026-05-29 cs.LG cs.AI 版本更新

Model Fusion via Retrofitting

通过回溯改造的模型融合

Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh

发表机构 * ETH Z\"urich

AI总结 提出一种以神经元为中心的融合算法,通过将父模型中间神经元分组为目标表示并训练融合模型子网络逼近,结合神经元归因分数进行显著特征对齐,适用于任意可模块化为有向无环图结构的架构,在零样本和非独立同分布场景下表现最佳。

Comments 5 figures, 15 tables, 23 pages

详情
AI中文摘要

模型融合旨在将独立训练的神经网络组合成一个单一模型而无需重新训练,但由于排列不变性、随机初始化和异构训练数据导致的表示差异,这一过程变得复杂。现有方法在非独立同分布数据分布下的零样本设置中尤其困难,并且通常局限于特定架构或成对融合。我们引入了一类以神经元为中心的融合算法,将融合视为一个原则性的表示匹配问题:父模型中的中间神经元被分组为目标表示,然后训练融合模型的相应子网络来逼近这些表示。与先前工作不同,我们的方法结合了神经元归因分数以偏向于显著特征的对齐,并且可以应用于任何可模块化为有向无环图层次的架构——在VGG、ResNet和ViT上进行了实证验证。在标准基准上的实验显示,与现有融合方法相比,我们的方法取得了一致的改进,在零样本和非独立同分布场景中增益最大。代码可在https://github.com/AndrewSpano/model-fusion-via-retrofitting获取。

英文摘要

Model fusion seeks to combine independently trained neural networks into a single model without retraining, but is complicated by representational divergence arising from permutation invariance, random initialization, and heterogeneous training data. Existing methods struggle particularly in zero-shot settings under non-IID data distributions, and are often limited to specific architectures or pairwise fusion. We introduce a neuron-centric family of fusion algorithms that frames fusion as a principled representation-matching problem: intermediate neurons across parent models are grouped into target representations, which the fused model's corresponding sub-networks are then trained to approximate. Unlike prior work, our approach incorporates neuron attribution scores to bias alignment toward salient features, and can be applied to any architecture modularizable as a DAG of levels -- empirically validated on VGGs, ResNets, and ViTs. Experiments across standard benchmarks show consistent improvements over existing fusion methods, with the largest gains in zero-shot and non-IID scenarios. Code is available at https://github.com/AndrewSpano/model-fusion-via-retrofitting.

2505.21627 2026-05-29 cs.GT cs.AI cs.CY cs.LG 版本更新

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

你的大语言模型是否在过度收费?分词、透明度与激励

Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez

发表机构 * Ander Artola Velasco(1. 阿德纳·阿尔托拉·韦拉斯科) Stratis Tsirtsis(2. 斯特拉蒂斯·蒂尔蒂斯) Nastaran Okati(3. 纳斯塔兰·奥卡蒂)

AI总结 研究当前按token计费机制下,服务提供商可能通过策略性报告token数量来过度收费,并提出按字符线性定价的激励相容机制以消除该财务激励。

Comments Selected as an oral presentation at ICML 2026

详情
AI中文摘要

最先进的大语言模型需要专门的硬件和大量能源来运行。因此,提供大语言模型访问的基于云的服务变得非常流行。在这些服务中,用户为模型生成的输出支付的价格取决于模型用于生成该输出的token数量:他们为每个token支付固定价格。在这项工作中,我们表明这种定价机制为提供商创造了财务激励,使其策略性地虚报模型用于生成输出的token数量,而用户无法证明甚至不知道提供商是否在过度收费。然而,我们也表明,如果不诚实的提供商被强制要求透明地说明模型使用的生成过程,那么在不引起怀疑的情况下最优地虚报是困难的。尽管如此,作为概念验证,我们开发了一种高效的启发式算法,使提供商能够在不引起怀疑的情况下显著过度收费用户。关键的是,我们证明运行该算法的成本低于从过度收费用户中获得的额外收入,突显了当前按token计费机制下用户的脆弱性。此外,我们表明,为了消除策略性行为的财务激励,定价机制必须根据token的字符数线性定价。虽然这会使提供商的利润率因token而异,但我们引入了一个简单的方案,采用这种激励相容定价机制的提供商可以维持他们在按token计费机制下的平均利润率。在此过程中,为了说明和补充我们的理论结果,我们使用来自$ exttt{Llama}$、$ exttt{Gemma}$和$ exttt{Ministral}$系列的几个大语言模型以及来自LMSYS Chatbot Arena平台的输入提示进行了实验。

英文摘要

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.

2502.20838 2026-05-29 cs.SD cs.AI cs.LG eess.AS 版本更新

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

弱监督检测与长时间生物声学数据中鲸叫声的时间定位

Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Systems and Control Engineering, School of Engineering, Institute of Science Tokyo, Japan(东京科学研究院工程学院系统与控制工程系)

AI总结 提出DSMIL-LocNet框架,利用弱监督多实例学习仅使用录音级标签实现鲸叫声的分类和时间定位,在长录音上优于全监督基线。

Comments Accepted in European Signal Processing Conference (EUSIPCO) 2026

详情
AI中文摘要

被动声学监测(PAM)系统生成持续数月连续录音,但自动化生物声学分析鲸叫声需要两种独立的标注工作:用于分类的二元存在标签和用于定位的精确时间边界。一个多分钟录音的二元标签可以在几秒钟内分配,但对其中的每个叫声打时间戳需要数小时的专家努力。在操作规模上同时提供两者是不可行的。我们提出DSMIL-LocNet,一个弱监督多实例学习(MIL)框架,仅使用录音级存在/缺失标签执行分类和时间定位。我们的双流架构整合频谱和时间特征,处理2-30分钟的录音,而无需现有CNN方法在长输入上退化的时间压缩。在AcousticTrends BlueFinLibrary上,DSMIL-LocNet在300-1800秒录音上达到F1分数0.88-0.91,而全监督CNN基线退化为0.19-0.64。它还提供这些基线在没有帧级标注的情况下无法产生的时间定位。代码:https://github.com/Ragib-Amin-Nihal/DSMIL-LocNet

英文摘要

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc

2501.12374 2026-05-29 cs.HC cs.AI cs.CY 版本更新

Expertise elevates AI usage: experimental evidence comparing laypeople and professional artists

专业知识提升AI使用:比较普通人和专业艺术家的实验证据

Thomas F. Eisenmann, Andres Karjus, Mar Canet Sola, Levin Brinkmann, Bramantyo Ibrahim Supriyatno, Iyad Rahwan

发表机构 * Center for Humans and Machines, Max Planck Institute for Human Development(人类与机器中心,马克斯·普朗克人类发展研究所) Tallinn University(塔林大学) Estonian Business School(爱沙尼亚商学院) Academy of Media Arts Cologne(科隆媒体艺术学院)

AI总结 通过实验比较50位专业艺术家和普通人使用生成式AI进行图像复制和创意生成的表现,发现艺术家的专业技能迁移到AI使用中,在复制准确性和发散思维上均优于普通人,而GPT-4o在创意任务上平均略优于艺术家但未超越最佳人类。

Comments Eisenmann and Karjus contributed equally to this work and share first authorship

详情
Journal ref
International Journal of Human-Computer Interaction, 2026, pp 1-22
AI中文摘要

生成式AI的新能力引发了关于人类专业知识未来角色的疑问:AI是否拉平了专业艺术家和普通人之间的差距,还是专业知识增强了AI的使用?专家在分析和绘制视觉艺术时使用的认知技能是否也转移到使用这些新工具上?这项预先注册的研究对50位专业艺术家和人口统计学匹配的普通人样本进行了实验比较。我们的跨学科团队开发了两项任务,涉及图像复制和创意图像生成,评估了他们的复制准确性和发散思维。我们为实验实施了一个定制平台,由现代文本到图像AI驱动。结果显示,艺术家比普通参与者产生了更准确的复制和更多发散的想法,突显了专业知识的技能转移——即使是在生成式AI的有限空间内。我们还探索了一个典型的视觉能力大型语言模型(GPT-4o)的表现:在复制任务上与艺术家相当,在创意任务上平均略优于艺术家,但从未超越最佳人类。这些发现强调了将艺术技能与AI整合的重要性,表明协作协同的潜力可能重塑创意产业和艺术教育。

英文摘要

Generative AI's novel capacities raise questions about the future role of human expertise: does AI level the playing field between professional artists and laypeople, or does expertise enhance AI use? Do the cognitive skills experts make use of in analyzing and drawing visual art also transfer to using these new tools? This pre-registered study conducts experimental comparisons between 50 professional artists and a demographically matched sample of laypeople. Our interdisciplinary team developed two tasks involving image replication and creative image creation, assessing their copying accuracy and divergent thinking. We implemented a bespoke platform for the experiment, powered by a modern text-to-image AI. Results reveal artists produced more accurate copies and more divergent ideas than lay participants, highlighting a skill transfer of professional expertise - even to the confined space of generative AI. We also explored how well an exemplary vision-capable large language model (GPT-4o) would fare: on par in copying and slightly better on average than artists in the creative task, although never above best humans. These findings highlight the importance of integrating artistic skills with AI, suggesting a potential for collaborative synergy that could reshape creative industries and arts education.

2501.10332 2026-05-29 cs.CY cs.AI 版本更新

Agent4Edu: Generating Learner Response Data by Generative Agents for Intelligent Education Systems

Agent4Edu:通过生成式智能体为智能教育系统生成学习者响应数据

Weibo Gao, Qi Liu, Linan Yue, Fangzhou Yao, Rui Lv, Zheng Zhang, Hao Wang, Zhenya Huang

发表机构 * State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室) University of Science and Technology of China(中国科学技术大学) Institute of Artificial Intelligence(人工智能研究院) Hefei Comprehensive National Science Center(合肥综合性国家科学中心)

AI总结 提出Agent4Edu,一种利用大语言模型构建生成式智能体模拟学习者行为,以解决智能教育系统中离线指标与在线性能差异的问题,并支持个性化学习算法评估与优化。

Comments Accepted by AAAI2025

详情
AI中文摘要

个性化学习是智能教育系统中一种有前景的教育策略,旨在提高学习者的练习效率。然而,离线指标与在线性能之间的差异严重阻碍了其进展。为了解决这一挑战,我们引入了Agent4Edu,一种新颖的个性化学习模拟器,通过大语言模型(LLMs)利用人类智能的最新进展。Agent4Edu采用基于LLM的生成式智能体,配备针对个性化学习算法定制的学习者档案、记忆和行动模块。学习者档案使用真实世界的响应数据初始化,捕捉练习风格和认知因素。受人类心理学理论启发,记忆模块记录练习事实和高层摘要,并集成反思机制。行动模块支持多种行为,包括练习理解、分析和响应生成。每个智能体可以与个性化学习算法(如计算机自适应测试)交互,实现对定制服务的多方面评估和增强。通过全面评估,我们探讨了Agent4Edu的优势和不足,强调了智能体与人类学习者之间响应的一致性和差异。代码、数据和附录可在https://github.com/bigdata-ustc/Agent4Edu公开获取。

英文摘要

Personalized learning represents a promising educational strategy within intelligent educational systems, aiming to enhance learners' practice efficiency. However, the discrepancy between offline metrics and online performance significantly impedes their progress. To address this challenge, we introduce Agent4Edu, a novel personalized learning simulator leveraging recent advancements in human intelligence through large language models (LLMs). Agent4Edu features LLM-powered generative agents equipped with learner profile, memory, and action modules tailored to personalized learning algorithms. The learner profiles are initialized using real-world response data, capturing practice styles and cognitive factors. Inspired by human psychology theory, the memory module records practice facts and high-level summaries, integrating reflection mechanisms. The action module supports various behaviors, including exercise understanding, analysis, and response generation. Each agent can interact with personalized learning algorithms, such as computerized adaptive testing, enabling a multifaceted evaluation and enhancement of customized services. Through a comprehensive assessment, we explore the strengths and weaknesses of Agent4Edu, emphasizing the consistency and discrepancies in responses between agents and human learners. The code, data, and appendix are publicly available at https://github.com/bigdata-ustc/Agent4Edu.

2410.23222 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

数据集驱动的Transformer通道掩码用于多变量时间序列

Seunghan Lee, Taeyoung Park, Kibok Lee

发表机构 * Department of Statistics and Data Science, Yonsei University(延世大学统计与数据科学系) LG AI Research(LG人工智能研究)

AI总结 提出部分通道依赖(PCD)概念,通过数据集特定的通道掩码(CMs)改进Transformer中的通道依赖建模,并在多种任务和数据集上验证有效性。

Comments ICASSP 2026. Preliminary version: NeurIPS Workshop on Time Series in the Age of Large Models 2024 (Oral presentation)

详情
AI中文摘要

最近基础模型的进展已成功扩展到时间序列(TS)领域,这得益于大规模TS数据集的出现。然而,先前的努力主要集中于捕获通道依赖(CD),这对于建模多变量时间序列至关重要,并且基于注意力的方法已被广泛用于此目的。尽管如此,这些方法主要关注修改架构,往往忽略了数据集特定特征的重要性。在这项工作中,我们引入了部分通道依赖(PCD)的概念,通过利用数据集特定信息来增强基于Transformer的模型中的CD建模,从而细化模型捕获的CD。为了实现PCD,我们提出了通道掩码(CMs),通过逐元素乘法将其集成到Transformer的注意力矩阵中。CMs由两个组件组成:1)捕获通道之间关系的相似性矩阵,以及2)数据集特定且可学习的领域参数,用于细化相似性矩阵。我们在多种任务和数据集上使用不同的骨干网络验证了PCD的有效性。代码可在此存储库获取:https://github.com/YonseiML/pcd。

英文摘要

Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily Capturing channel dependency (CD) is essential for modeling multivariate time series (TS), and attention-based methods have been widely employed for this purpose. Nonetheless, these methods primarily focus on modifying the architecture, often neglecting the importance of dataset-specific characteristics. In this work, we introduce the concept of partial channel dependence (PCD) to enhance CD modeling in Transformer-based models by leveraging dataset-specific information to refine the CD captured by the model. To achieve PCD, we propose channel masks (CMs), which are integrated into the attention matrices of Transformers via element-wise multiplication. CMs consist of two components: 1) a similarity matrix that captures relationships between the channels, and 2) dataset-specific and learnable domain parameters that refine the similarity matrix. We validate the effectiveness of PCD across diverse tasks and datasets with various backbones. Code is available at this repository: https://github.com/YonseiML/pcd.

2410.07287 2026-05-29 physics.soc-ph cs.AI 版本更新

Crafting Desirable Climate Trajectories with RL Explored Socio-Environmental Simulations

利用强化学习探索的社会环境模拟来塑造理想的气候轨迹

James Rudd-Jones, Fiona Thendean, María Pérez-Ortiz

发表机构 * UCL Centre for Artificial Intelligence, Department of Computer Science(伦敦大学学院人工智能中心,计算机科学系) University College London(伦敦大学学院) London(伦敦) United Kingdom(英国)

AI总结 本研究通过引入多智能体强化学习替代传统求解器,在综合评估模型中模拟合作与竞争的社会互动,发现合作智能体能一致地实现减排与经济改善,而竞争则导致难以达成理想气候目标。

Comments 23 pages, 13 Figures

详情
AI中文摘要

气候变化构成生存威胁,需要有效的气候政策来实施有影响力的变革。该领域的决策极其复杂,涉及冲突的实体和证据。在过去几十年中,政策制定者越来越多地使用模拟和计算方法来指导部分决策。综合评估模型(IAMs)是其中一种方法,它结合了社会、经济和环境模拟来预测潜在政策效果。例如,联合国在其最近的政府间气候变化专门委员会(IPCC)报告中使用了IAMs的输出。传统上,这些模型使用递归方程求解器求解,但存在若干缺点,例如在不确定性下决策困难。最近使用强化学习(RL)替代传统求解器的初步工作显示,在不确定和嘈杂场景中决策有前景的结果。我们通过引入多个交互的RL智能体作为初步分析,扩展了这项工作,以模拟驱动当前气候危机的各种利益相关者或国家之间复杂的社会互动。我们的发现表明,该框架中的合作智能体能够一致地规划出通往更理想未来的路径,表现为减少碳排放和改善经济。然而,当引入智能体之间的竞争时,例如通过使用相反的奖励函数,理想的气候未来很少能达到。模拟竞争对于提高这些模拟的真实性至关重要,因此我们通过可视化导致更不确定行为的状态来采用策略解释,以理解算法失败的原因。最后,我们强调了当前局限性和未来工作的方向,以确保未来技术应用于政策制定。

英文摘要

Climate change poses an existential threat, necessitating effective climate policies to enact impactful change. Decisions in this domain are incredibly complex, involving conflicting entities and evidence. In the last decades, policymakers increasingly use simulations and computational methods to guide some of their decisions. Integrated Assessment Models (IAMs) are one of such methods, which combine social, economic, and environmental simulations to forecast potential policy effects. For example, the UN uses outputs of IAMs for their recent Intergovernmental Panel on Climate Change (IPCC) reports. Traditionally these have been solved using recursive equation solvers, but have several shortcomings, e.g. struggling at decision making under uncertainty. Recent preliminary work using Reinforcement Learning (RL) to replace the traditional solvers shows promising results in decision making in uncertain and noisy scenarios. We extend on this work by introducing multiple interacting RL agents as a preliminary analysis on modelling the complex interplay of socio-interactions between various stakeholders or nations that drives much of the current climate crisis. Our findings show that cooperative agents in this framework can consistently chart pathways towards more desirable futures in terms of reduced carbon emissions and improved economy. However, upon introducing competition between agents, for instance by using opposing reward functions, desirable climate futures are rarely reached. Modelling competition is key to increased realism in these simulations, as such we employ policy interpretation by visualising what states lead to more uncertain behaviour, to understand algorithm failure. Finally, we highlight the current limitations and avenues for further work to ensure future technology uptake for policy derivation.

2405.13003 2026-05-29 cs.CL cs.AI cs.IR 版本更新

A Survey on Recent Advances in Conversational Data Generation

对话数据生成最新进展综述

Heydar Soudani, Roxana Petcu, Evangelos Kanoulas, Faegheh Hasibi

发表机构 * Radboud University(拉博德大学) University of Amsterdam(阿姆斯特丹大学)

AI总结 本文系统综述了多轮对话数据生成方法,涵盖开放域、任务导向和信息检索三类对话系统,提出了包含种子数据创建、话语生成和质量过滤的通用框架,并讨论了评估指标与未来方向。

详情
AI中文摘要

近年来对话系统的进步显著增强了各领域的人机交互。然而,由于专业对话数据的稀缺,训练这些系统面临挑战。传统上,对话数据集通过众包创建,但该方法成本高、规模有限且劳动密集。作为解决方案,合成对话数据的开发应运而生,利用技术增强现有数据集或将文本资源转换为对话格式,提供了一种更高效且可扩展的数据集创建方法。在本综述中,我们系统全面地回顾了多轮对话数据生成,重点关注三类对话系统:开放域、任务导向和信息检索。我们根据种子数据创建、话语生成和质量过滤方法等关键组件对现有研究进行分类,并引入了一个概述对话数据生成系统主要原则的通用框架。此外,我们考察了评估合成对话数据的指标和方法,探讨了当前领域的挑战,并探索了未来研究的潜在方向。我们的目标是通过概述最先进的方法并强调该领域进一步研究的机会,加速研究人员和从业者的进展。

英文摘要

Recent advancements in conversational systems have significantly enhanced human-machine interactions across various domains. However, training these systems is challenging due to the scarcity of specialized dialogue data. Traditionally, conversational datasets were created through crowdsourcing, but this method has proven costly, limited in scale, and labor-intensive. As a solution, the development of synthetic dialogue data has emerged, utilizing techniques to augment existing datasets or convert textual resources into conversational formats, providing a more efficient and scalable approach to dataset creation. In this survey, we offer a systematic and comprehensive review of multi-turn conversational data generation, focusing on three types of dialogue systems: open domain, task-oriented, and information-seeking. We categorize the existing research based on key components like seed data creation, utterance generation, and quality filtering methods, and introduce a general framework that outlines the main principles of conversation data generation systems. Additionally, we examine the evaluation metrics and methods for assessing synthetic conversational data, address current challenges in the field, and explore potential directions for future research. Our goal is to accelerate progress for researchers and practitioners by presenting an overview of state-of-the-art methods and highlighting opportunities to further research in this area.

2205.04297 2026-05-29 cs.RO cs.AI 版本更新

Learning A Simulation-based Visual Policy for Real-world Peg In Unseen Holes

基于学习的视觉策略用于真实世界中未见过孔洞的插拔

Liang Xie, Hongxiang Yu, Kechun Xu, Tong Yang, Minhang Wang, Haojian Lu, Rong Xiong, Yue Wang

发表机构 * College of Control Science and Engineering, Zhejiang University, Zhejiang, China.(控制科学与工程学院,浙江大学,浙江,中国) The Application Innovate Lab, Huawei Incorporated Company, China.(应用创新实验室,华为公司,中国)

AI总结 提出一种基于学习的视觉插拔方法,通过解耦感知与策略模块,在仿真中训练多种形状,并仅需少量仿真到现实迁移成本即可适应真实世界中任意未见形状。

详情
AI中文摘要

本文提出一种基于学习的视觉插拔方法,能够在仿真中训练多种形状,并在真实世界中以最小的仿真到现实迁移成本适应任意未见形状。核心思想是将感知-运动策略的泛化解耦为快速适应的感知模块和仿真通用策略模块的设计。框架包括分割网络(SN)、虚拟传感器网络(VSN)和控制器网络(CN)。具体地,VSN被训练用于从分割图像中测量未见形状的位姿。然后,给定与形状无关的位姿测量,CN被训练以实现通用插拔。最后,当应用于真实未见孔洞时,我们只需微调仿真VSN+CN所需的分割网络。为进一步最小化迁移成本,我们提出在一分钟人工教学后自动收集和标注分割网络的数据。展示了在眼在外/眼在手配置下的仿真和真实世界结果。采用所提策略的电动汽车充电系统在2-3秒内实现了10/10的成功率,仅使用数百个自动标注样本进行分割网络迁移。

英文摘要

This paper proposes a learning-based visual peg-in-hole that enables training with several shapes in simulation, and adapting to arbitrary unseen shapes in real world with minimal sim-to-real cost. The core idea is to decouple the generalization of the sensory-motor policy to the design of a fast-adaptable perception module and a simulated generic policy module. The framework consists of a segmentation network (SN), a virtual sensor network (VSN), and a controller network (CN). Concretely, the VSN is trained to measure the pose of the unseen shape from a segmented image. After that, given the shape-agnostic pose measurement, the CN is trained to achieve generic peg-in-hole. Finally, when applying to real unseen holes, we only have to fine-tune the SN required by the simulated VSN+CN. To further minimize the transfer cost, we propose to automatically collect and annotate the data for the SN after one-minute human teaching. Simulated and real-world results are presented under the configurations of eye-to/in-hand. An electric vehicle charging system with the proposed policy inside achieves a 10/10 success rate in 2-3s, using only hundreds of auto-labeled samples for the SN transfer.

2605.29578 2026-05-29 cs.AI 版本更新

GPS-Enhanced Tourist Mobility Modeling with Seasonal Spatial Priors and LLM-Based Activity Chain Generation

基于季节性空间先验和LLM活动链生成的GPS增强游客移动性建模

Yifan Liu, Yanling Sang, Xishun Liao, Morgan Sun, Bo Yang, Zhiyuan Zhang, Chris Stanford, Haoxuan Ma, Jiaqi Ma

发表机构 * UCLA Mobility Lab, Department of Civil and Environmental Engineering, University of California, Los Angeles(加州大学洛杉矶分校移动实验室,土木与环境工程系,加州大学洛杉矶分校) Novateur Research Solutions(Novateur研究解决方案) University of Central Florida(中央佛罗里达大学)

AI总结 提出一种四阶段仿真框架,结合GPS和调查数据推导的月份条件空间先验、游客人口统计信息、距离可行区域序列分配以及基于LLM的活动链生成,以解决游客移动性建模中非例行、吸引驱动且对旅行目的、季节和成员组成高度敏感的问题。

详情
AI中文摘要

游客移动性对城市交通规划提出了独特挑战。与居民通勤不同,游客旅行大多是非例行的、由景点驱动的,并且对旅行目的、旅行季节和旅行成员组成高度敏感。现有方法要么测量聚合的游客空间模式而不生成个人行程,要么合成移动性而不考虑游客特定结构,如旅行持续时间条件、月份变化的景点需求以及家庭共同旅行规则。为了解决这些挑战,我们提出了一个四阶段仿真框架,结合了从GPS和调查数据推导的月份条件空间先验、基于游客人口统计的旅行范围预测、距离可行的区域序列分配,以及在家庭和空间约束下基于LLM的活动链生成。GPS数据仅以隐私保护的聚合形式用作月份条件空间先验,不保留或暴露任何个人轨迹。在东京旅游上的实验表明,基于GPS的游客群体提取恢复了与调查参考一致的空间访问特征,我们的框架生成了人口统计对齐的合成行程,其区域级访问份额与调查分布和停留点导出的月度访问模式紧密对齐。结果证明了该框架作为地理基础、人口统计感知的游客移动性建模方法的有效性。

英文摘要

Tourist mobility poses a distinct challenge for urban transportation planning. Unlike resident commuting, tourist travel is largely non-routine, attraction driven, and highly sensitive to trip purpose, travel season, and trip member composition. Existing approaches either measure aggregate tourist spatial patterns without generating individual schedules, or synthesize mobility without tourist specific structure such as trip duration conditioning, month varying attraction demand, and household co-travel rules. To address these challenges, we propose a four stage simulation framework combining month conditioned spatial priors derived from GPS and survey data, trip extent prediction from tourist demographics, distance feasible ward sequence assignment, and LLM-based activity chain generation under household and spatial constraints. GPS data are used only in privacy preserving aggregated form as month conditioned spatial priors, with no individual traces retained or exposed. Experiments on tourism in Tokyo demonstrate that the GPS based tourist cohort extraction recovers spatial visitation signatures consistent with survey references, and our framework produces demographically aligned synthetic schedules whose ward-level visitation shares align closely with both survey distributions and staypoint derived monthly visitation patterns. The results demonstrate the framework's effectiveness as a geographically grounded, demographically aware approach to tourist mobility modeling.

2605.29568 2026-05-29 cs.AI 版本更新

DeepTool: Scaling Interleaved Deliberation in Tool-Integrated Reasoning via Process-Supervised Reinforcement Learning

DeepTool: 通过过程监督强化学习扩展工具集成推理中的交错思考

Yang He, Xiao Ding, Bibo Cai, Yufei Zhang, Kai Xiong, Zhouhao Sun, Bing Qin, Ting Liu

发表机构 * Research Center for Social Computing(社会计算研究中心) Interactive Robotics, Harbin Institute of Technology, China(交互机器人,哈尔滨工业大学,中国)

AI总结 针对工具集成推理中缺乏逐步监督和自纠正能力的问题,提出DeepTool框架,通过合成交错轨迹和基于动作中心过程奖励的GRPO强化学习,显著提升模型在多个基准上的性能。

详情
AI中文摘要

工具集成推理通过利用外部环境扩展了LLM的能力。然而,现有方法在顺序调用工具时缺乏战略规划和自我纠正所需的思考。虽然强化学习缓解了这一问题,但传统的工具集成推理方法受到稀疏的基于结果奖励的阻碍,无法监督中间推理步骤和工具调用。为了解决这个问题,我们提出了DeepTool,一个新颖的框架,它在每一轮思考、行动和观察的交错过程中扩展了深思熟虑的思考。在DeepTool中,我们首先引入了一个合成流程,将扩展思考演变为交错轨迹,并集成对抗性扰动以确保鲁棒性和自我纠正。其次,我们基于GRPO设计了过程监督强化学习,利用以行动为中心的过程奖励来强化中间交错思考,并在每一轮强制执行精确的工具调用。大量实验表明,DeepTool实现了卓越的性能,在六个基准测试中显著提升了Qwen2.5-7B(例如,AIME24: 3.2% -> 40.4%,HMMT25: 0.0% -> 28.6%)。此外,令牌成本效益分析证实了交错思考的实用性,展示了DeepTool在性能和令牌效率之间的最佳平衡。

英文摘要

Tool-Integrated Reasoning (TIR) extends LLM capabilities by leveraging external environments. However, existing methods lack the deliberation during sequential tool invocation required for strategic planning and self-correction. While RL mitigates this, conventional approaches for Tool-Integrated Reasoning are hindered by sparse outcome-based rewards, failing to supervise intermediate reasoning steps and tool invocations. To address this, we propose DeepTool, a novel framework that scales deliberate thinking within the interleaved process of thinking, action, and observation at each turn. In DeepTool, we first introduce a synthesis pipeline that evolves extended thinking into interleaved trajectories, integrating adversarial perturbations to ensure robustness and self-correction. Secondly, we devise Process-Supervised Reinforcement Learning based on GRPO, which utilizes an Action-Centric Process Reward to reinforce intermediate interleaved thinking and enforce precise tool invocation at every turn. Extensive experiments demonstrate that DeepTool achieves superior performance, boosting Qwen2.5-7B significantly across six benchmarks (e.g., AIME24: 3.2% -> 40.4% and HMMT25: 0.0% -> 28.6%). Furthermore, the token cost-effectiveness analysis confirms the utility of interleaved thinking, demonstrating DeepTool's optimal balance between performance and token efficiency.

2605.29562 2026-05-29 cs.RO cs.AI cs.CV 版本更新

VLA-Pro: Cross-Task Procedural Memory Transfer for Vision-Language-Action Models

VLA-Pro:面向视觉-语言-动作模型的跨任务程序性记忆迁移

Shengyu Si, Yuanzhuo Lu, Ruimeng Yang, Ziyi Ye, Zuxuan Wu, Yu-Gang Jiang

发表机构 * Institute of Trustworthy Embodied AI, Fudan University(复旦大学可信具身人工智能研究院) Shanghai Key Laboratory of Multimodal Embodied AI(上海多模态具身人工智能重点实验室) Shanghai Xinzhi Embodied Intelligence Technology Co., Ltd.(上海新智具身智能技术有限公司)

AI总结 提出VLA-Pro框架,通过存储和检索任务相关的LoRA适配器作为程序性记忆,实现跨任务泛化,在仿真和真实任务中成功率显著提升。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在通用机器人操作中展现出强大潜力,但在泛化到需要跨物体、场景和动作模式迁移相关经验的新任务时仍面临挑战。本文提出VLA-Pro,一种即插即用框架,通过在训练时存储任务相关的程序性记忆并在推理时迁移这些记忆来增强跨任务泛化。具体而言,VLA-Pro在训练时将任务特定的LoRA适配器存储为参数化的程序性记忆。在推理时,VLA-Pro基于当前多模态上下文检索相关程序性记忆,并动态融合这些记忆以生成当前动作块。在RoboTwin、RLBench和真实世界操作任务上的实验表明,VLA-Pro在多个骨干网络上持续提升跨任务泛化能力,在仿真中实现高达207%的相对改进,并将真实世界成功率从5.8%提升至65.0%。这些结果表明,程序性记忆检索与自适应为将操作经验迁移到新任务提供了一种有效机制,同时保持了模块化和执行稳定性。

英文摘要

Vision-Language-Action~(VLA) models have shown strong potential for general-purpose robotic manipulation, yet they still struggle to generalize to unseen tasks that necessitate transferring relevant experience across objects, scenes, and action patterns. This paper proposes VLA-Pro, a plug-and-play framework designed to enhance cross-task generalization by storing task-relevant procedural memories at training time and transferring these memories during inference. Specifically, VLA-Pro stores task-specific LoRA adapters as parameterized procedural memories during training. At inference time, VLA-Pro retrieves relevant procedural memories based on the current multi-modal context and dynamically fuses these memories for generating the current action chunk. Experiments on RoboTwin, RLBench, and real-world manipulation tasks show that VLA-Pro consistently improves cross-task generalization across multiple backbones, achieving up to a 207% relative improvement in simulation and increasing real-world success rate from 5.8% to 65.0%. These results suggest that procedural memory retrieval and adaptation provide an effective mechanism for transferring manipulation experience to novel tasks while preserving modularity and execution stability.

2605.29561 2026-05-29 cs.AI cs.SE 版本更新

ParaTool: Shifting Tool Representations from Context to Parameters

ParaTool: 将工具表示从上下文转移到参数中

Zekai Yu, Qi Meng, Qizhi Chu, Yu Hao, Chuan Shi, Cheng Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出ParaTool框架,通过将每个工具投影为可加载的参数集,结合参数化工具预训练、软工具选择和参数化工具微调三个阶段,使大语言模型无需上下文文档即可进行工具调用,在Stable ToolBench和BFCL上显著优于基于ICL的基线方法。

详情
AI中文摘要

工具调用通过使大语言模型(LLM)能够与外部可执行接口进行基于环境的交互,从而扩展了其能力。然而,主流的上下文学习(ICL)方法通常将详细的工具文档和使用示例直接纳入上下文中,这导致随着上下文长度的增长,推理开销显著增加,并且幻觉风险升高。相反,基于微调的方法虽然提高了通用工具调用能力,但往往无法有效内化先前见过的工具的特定细节,从而仍然依赖于上下文文档。为了解决这些限制,我们提出了ParaTool,一个将每个工具投影到专用的、可加载的参数集中的框架。通过配备这些参数化工具的动态集成,LLM可以在不依赖上下文文档或示例的情况下执行工具调用。具体来说,我们的方法包括三个阶段:(1)参数化工具预训练将不同工具的知识封装到独立的参数模块中;(2)软工具选择使用门控网络动态加权和聚合相关工具参数;(3)参数化工具微调联合更新工具参数以对齐训练和推理过程。在Stable ToolBench和BFCL上的实验表明,ParaTool显著优于基于ICL的强基线方法,在降低计算复杂度的同时实现了优越的性能。

英文摘要

Tool calling extends large language models (LLMs) by enabling grounded interaction with external executable interfaces, thereby supporting environment-coupled problem solving. However, mainstream in-context learning (ICL) approaches typically incorporate detailed tool documentation and usage examples directly into the context. This results in substantial inference overhead and heightened risks of hallucination as the context length grows. Conversely, while tuning-based methods improve general tool-calling capabilities, they often fail to effectively internalize the specific details of previously seen tools, thereby retaining a dependency on in-context documentation. To address these limitations, we propose ParaTool, a framework that projects each tool into a dedicated, loadable set of parameters. By equipping a dynamic integration of these parameterized tools, the LLM can perform tool calling without relying on in-context documents or examples. Specifically, our approach consists of three stages: (1) parametric tool pre-training encapsulates the knowledge of different tools into independent parameter modules; (2) soft tool selection employs a gating network to dynamically weigh and aggregate relevant tool parameters; and (3) parametric tool fine-tuning jointly updates tool parameters to align the training and inference processes. Experiments on Stable ToolBench and BFCL demonstrate that ParaTool significantly outperforms strong ICL-based baselines, achieving superior performance while reducing computational complexity.

2605.29560 2026-05-29 cs.AI 版本更新

Battery-Sim-Agent: Leveraging LLM-Agent for Inverse Battery Parameter Estimation

Battery-Sim-Agent: 利用LLM智能体进行电池逆参数估计

Jiawei Chen, Xiaofan Gui, Shikai Fang, Shengyu Tao, Shun Zheng, Weiqing Liu, Jiang Bian

发表机构 * Peking University(北京大学) Microsoft Research(微软研究院) Zhejiang University(浙江大学) Chalmers University of Technology(皇家理工学院)

AI总结 提出Battery-Sim-Agent框架,将电池逆参数估计重构为推理任务,利用LLM智能体与高保真模拟器闭环交互,通过物理假设和结构化参数更新,显著优于贝叶斯优化等黑箱优化方法。

详情
AI中文摘要

对高保真电池“数字孪生”进行参数化是一个关键但具有挑战性的逆问题,阻碍了电池创新的步伐。现有方法将此表述为黑箱优化(BBO)任务,采用样本效率低且忽视底层物理的算法。在这项工作中,我们引入了一种新范式,将逆问题重新定义为推理任务,并提出了Battery-Sim-Agent,这是第一个将大型语言模型(LLM)智能体与高保真电池模拟器闭环部署的框架。该智能体模仿人类科学家的工作流程:它解释来自模拟器的丰富多模态反馈,形成基于物理的假设来解释差异,并提出结构化的参数更新。在一个系统构建的基准套件上,涵盖多种电池化学成分、操作条件和难度级别,我们的智能体在识别准确参数方面显著优于贝叶斯优化等强BBO基线。我们进一步展示了该框架在复杂长期退化拟合任务中的能力,并在真实电池数据集上验证了其实用性。我们的结果突显了LLM智能体作为基于推理的优化器在科学发现和电池参数估计中的前景。

英文摘要

Parameterizing high-fidelity "digital twins" of batteries is a critical yet challenging inverse problem that hinders the pace of battery innovation. Prevailing methods formulate this as a black-box optimization (BBO) task, employing algorithms that are sample-inefficient and blind to the underlying physics. In this work, we introduce a new paradigm that reframes the inverse problem as a reasoning task, and present Battery-Sim-Agent, the first framework to deploy a Large Language Model (LLM) agent in a closed loop with a high-fidelity battery simulator. The agent mimics a human scientist's workflow: it interprets rich, multi-modal feedback from the simulator, forms physically-grounded hypotheses to explain discrepancies, and proposes structured parameter updates. On a systematically constructed benchmark suite spanning diverse battery chemistries, operating conditions, and difficulty levels, our agent significantly outperforms strong BBO baselines like Bayesian optimization in identifying accurate parameters. We further demonstrate the framework's capability in complex long-horizon degradation fitting tasks and validate its practical applicability on real-world battery datasets. Our results highlight the promise of LLM-agents as reasoning-based optimizers for scientific discovery and battery parameter estimation.

2605.29556 2026-05-29 cs.AI 版本更新

Opt-Verifier: Unleashing the Power of LLMs for Optimization Modeling via Dual-Side Verification

Opt-Verifier:通过双面验证释放大语言模型在优化建模中的潜力

Haoyang Liu, Jie Wang, Boxuan Niu, Xiongwei Han, Yian Xu, Mingxuan Ye, Zijie Geng, Fangzhou Zhu, Tao Zhong, Mingxuan Yuan, Jianye Hao

发表机构 * MoE Key Laboratory of Brain-inspired Intelligent Perception and Cognition(脑启发式感知与认知MoE实验室) University of Science and Technology of China(中国科学技术大学) Noah's Ark Lab, Huawei Technologies(华为技术诺亚实验室) Tianjin University(天津大学)

AI总结 提出Opt-Verifier框架,通过结构侧和解决方案侧的双面验证,利用大语言模型自动构建数学优化模型,显著提升建模准确性。

详情
Journal ref
International Conference on Machine Learning (ICML), 2026
AI中文摘要

构建数学优化模型在运筹学中至关重要,但需要大量人类专业知识。最近的进展利用大语言模型(LLMs)来自动化这一建模过程。然而,现有工作往往难以验证生成的优化模型的正确性,既不检查约束和变量的合理性,也不检查生成模型解的有效性。这阻碍了后续的验证和纠正步骤,从而严重损害了建模准确性。为了解决这一挑战,我们提出了一种新颖的基于LLM的框架,具有从结构和解决方案两个角度的双面验证(Opt-Verifier),从而提高建模准确性。结构侧验证确保生成的优化模型的建模结构与原始问题描述一致,准确捕捉问题的约束和要求。同时,解决方案侧验证解释和评估解的有效性,确认优化模型在逻辑和数学上是合理的。在流行基准上的实验表明,我们的方法在准确性上提高了20%以上。

英文摘要

Building mathematical optimization models is critical in operations research (OR), while it requires substantial human expertise. Recent advancements have utilized large language models (LLMs) to automate this modeling process. However, existing works often struggle to verify the correctness of the generated optimization models, without checking the rationality of the constraints and variables or the validity of solutions to the generated models. This hampers the subsequent verification and correction steps, and thus it severely hurts the modeling accuracy. To address this challenge, we propose a novel LLM-based framework with Dual-side Verification (Opt-Verifier) from both structure and solution perspectives, thereby improving the modeling accuracy. The structure-side verification ensures that the modeling structure of the generated optimization models aligns with the original problem description, accurately capturing the problem's constraints and requirements. Meanwhile, the solution-side verification interprets and evaluates the solutions' validity, confirming that the optimization models are logically and mathematically sound. Experiments on popular benchmarks demonstrate that our approach achieves over 20\% improvement in accuracy.

2605.29547 2026-05-29 cs.LG cs.AI math.OC 版本更新

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

基于随机几何探测的奇异性感知优化:迈向稳定的非光滑优化

Ruoran Xu, Borong She, Xiaobo Jin, Qiufeng Wang

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学)

AI总结 针对非光滑优化中Adam优化器的梯度抖动问题,提出奇异性感知Adam(S-Adam),通过局部几何不稳定性(LGI)度量动态调整步长,实现稳定训练并提升泛化性能。

Comments International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

深度学习优化严重依赖于损失景观平滑的假设,而现代架构由于ReLU激活和量化算子等非光滑组件系统性地违反了这一条件。在这种非光滑情况下,Adam等自适应优化器会出现梯度抖动,即由Clarke次微分内冲突信号引起的剧烈振荡,导致收敛性差和泛化能力欠佳。为解决此问题,我们引入了奇异性感知Adam(S-Adam),一种通过基于局部几何不稳定性动态调整步长来稳定训练的新型优化器。我们的关键贡献是局部几何不稳定性(LGI)度量,一种从随机方向导数方差导出的Clarke次微分直径的计算高效估计量。S-Adam采用自适应阻尼机制exp(-$λ$$ρ$),在高不稳定性区域减缓更新,同时在平滑盆地保持快速收敛。我们使用微分包含提供了严格的收敛性分析,证明S-Adam以最优的O(1/$\sqrt(T)$)速率几乎必然收敛到($δ$,$ε$)-Clarke稳定点。在量化感知训练(QAT)和高噪声小批量学习上的实证评估表明,S-Adam持续优于AdamW和Prox-SGD,在CIFAR-100上实现高达6%的准确率提升,在TinyImageNet上实现3%的提升,同时有效缓解梯度振荡。

英文摘要

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$λ$$ρ$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($δ$,$ε$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

2605.29543 2026-05-29 cs.LG cs.AI cs.CL cs.HC cs.IR 版本更新

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

SCOPE:一种用于空中交通管制复诵监控的轻量训练LLM框架

Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学机械与航空航天工程系) School of Electronic and Information Engineering, Beihang University(北航电子与信息工程学院) State Key Laboratory of CNS/ATM(国家空管自动化系统实验室)

AI总结 提出SCOPE框架,通过冻结LLM结合插件式开放集分类器和上下文学习机制,实现高效准确的空管复诵监控,在少样本设置下开放集检测准确率达91.05%,异常纠正率96.63%。

详情
AI中文摘要

飞行员对空中交通管制(ATC)语音指令的复诵是航空运输中防止沟通失误的主要保障。然而,复诵异常仍与约80%的航空事故相关。这一脆弱性因交通量增加和认知负荷升高而进一步加剧,从而推动了机器自动化复诵监控的需求。传统的基于规则和机器学习的方法难以在高度可变且不断演变的空管-飞行员通信术语中泛化。尽管大语言模型(LLM)凭借其强大的推理和泛化能力开辟了新途径,但现有方法在实践中仍面临部署和计算障碍。在这项工作中,我们提出了SCOPE(Semantic reasoning for Communication via Open-set Plug-in with Examples),一种新颖的轻量训练LLM框架,提升了基于机器的ATC复诵监控的效率和准确性。核心思想是在冻结的LLM之上,将插件式开放集分类器与精心设计的上下文学习机制相结合。在半合成通信数据集上的大量实验表明,SCOPE在实现运行环境所需的低延迟响应的同时,达到了优越的准确性。在少样本设置下,SCOPE在开放集检测中达到91.05%的准确率,并纠正了96.63%的异常复诵,从而在提供决策解释的同时优于现有最强基线。这些发现证明了我们的框架作为通向可解释和可控的ATC复诵监控的实用途径的潜力。

英文摘要

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

2605.29534 2026-05-29 cs.AI 版本更新

UI-KOBE: Knowledge-Oriented Behavior Exploration for Lightweight Graph-Guided GUI Agents

UI-KOBE:面向轻量级图引导GUI代理的知识导向行为探索

Yuxiang Chai, Han Xiao, Xinyu Fu, Jinpeng Chen, Rui Liu, Hongsheng Li

发表机构 * CUHK MMLab(香港大学多模态实验室) Huawei Research(华为研究) Shenzhen Loop Area Institute(深圳环城区域研究所) CPII under InnoHK(创新香港下的CPII)

AI总结 提出UI-KOBE框架,通过自动构建应用知识图谱并引导轻量级GUI代理进行运行时决策,以提升其移动端GUI任务执行效果。

详情
AI中文摘要

近期移动GUI代理的进展显示出自动化移动任务的强大潜力,但大多数有效系统仍依赖大型视觉语言模型进行截图理解和长期规划。可直接部署在移动设备上的小型GUI代理在实际应用中更具吸引力,具有更低的推理成本和更好的敏感设备信息保护。然而,由于模型容量有限,这些轻量级代理在仅凭截图端到端规划和执行GUI任务时仍不可靠。我们提出知识导向行为探索(UI-KOBE),一种利用可复用的应用特定图知识来改进轻量级移动GUI代理的框架。UI-KOBE首先自主探索移动应用并构建应用知识图谱,其中节点代表不同的UI状态,边代表可执行的转换。运行时,轻量级GUI代理将图作为外部指导:给定用户任务和当前截图,它识别当前图节点,并选择与该节点关联的自循环动作、相邻转换、任务完成或回退自由动作。通过用应用特定的图指导支持运行时决策,UI-KOBE减轻了端到端GUI规划的负担,帮助轻量级模型更有效地执行移动GUI任务,为高效、可解释且注重隐私的设备端GUI代理提供了实用的一步。

英文摘要

Recent advances in mobile GUI agents have shown strong potential for automating mobile tasks, but most effective systems still depend on large vision-language models for screenshot understanding and long-horizon planning. Small GUI agents that can be deployed directly on mobile devices are more attractive for practical use, offering lower inference cost and better protection of sensitive on-device information. However, due to limited model capacity, such lightweight agents remain unreliable when planning and executing GUI tasks end-to-end from screenshots alone. We propose Knowledge-Oriented Behavior Exploration (\textbf{UI-KOBE}), a framework that improves lightweight mobile GUI agents with reusable app-specific graph knowledge. UI-KOBE first autonomously explores a mobile application and constructs an app knowledge graph, where nodes represent distinct UI states and edges represent executable transitions. At runtime, a lightweight GUI agent uses the graph as external guidance: given a user task and the current screenshot, it identifies the current graph node and selects among self-loop actions, neighboring transitions, task completion, or fallback free actions associated with that node. By supporting runtime decisions with app-specific graph guidance, UI-KOBE reduces the burden of end-to-end GUI planning and helps lightweight models perform mobile GUI tasks more effectively, offering a practical step toward efficient, interpretable, and privacy-conscious on-device GUI agents.

2605.29532 2026-05-29 cs.SE cs.AI 版本更新

GUITestScape: Towards Open-set Evaluation on Exploratory GUI Testing

GUITestScape:面向探索性GUI测试的开放集评估

Xiaoyi Chen, Yifei Gao, Yang Xu, Xingxing Song, Yi Zhang, Jitao Sang

发表机构 * Beijing Jiaotong University(北京交通大学) Independent Researcher(独立研究者)

AI总结 提出GUITestScape基准和GUIJudge评估器,通过覆盖交互与显示缺陷的508个预设缺陷及过程感知评估方法,解决现有GUI测试评估局限于预定义标注和交互缺陷的问题。

详情
AI中文摘要

探索性GUI测试对MLLM代理提出了特别高的要求:在没有预定义测试脚本的情况下,代理必须自主导航应用程序并通过自身交互发现缺陷。然而,当前的评估在两个层面上存在不足。首先,现有基准几乎只关注交互缺陷,将显示缺陷排除在评估框架之外。其次,评估协议局限于预定义的缺陷标注,将测试过程简化为单一终态判断,混淆了性质不同的失败模式。为解决这些挑战,我们提出了GUITestScape,一个交互式基准,涵盖61个真实Android应用程序和508个预设缺陷(包括交互和显示类型),并引入了GUIJudge,一个开放集评估器,将代理的测试轨迹分解为可独立诊断的能力。实验结果表明,GUIJudge在预定义标注之外实现了可靠的过程感知评估,显著优于所有基线。在GUITestScape上的基准测试进一步揭示,检测仍然是现有模型在两种缺陷类型上的关键瓶颈,并且将GUIJudge的验证器集成到现有代理中可以在不重新训练的情况下显著提升其检测性能。

英文摘要

Exploratory GUI testing is a particularly demanding setting for MLLM agents: without predefined test scripts, an agent must autonomously navigate an application and discover defects through its own interaction. However, current evaluation falls short on two fronts. First, existing benchmarks focus almost exclusively on interaction defects, leaving display defects outside the evaluation frame. Second, evaluation protocols are bound to predefined defect annotations, collapsing the testing process into a single end-state judgment that conflates qualitatively distinct failure modes. To address these challenges, we present GUITestScape, an interactive benchmark covering 61 real-world Android applications and 508 preset defects spanning interaction and display types, and introduce GUIJudge, an open-set evaluator that decomposes an agent's testing trajectory into independently diagnosable capabilities. Experimental results demonstrate that GUIJudge achieves reliable process-aware evaluation beyond predefined annotations, substantially outperforming all baselines. Benchmarking on GUITestScape further reveals that detection remains the critical bottleneck for existing models across both defect types, and that integrating GUIJudge's verifiers into existing agents significantly boosts their detection performance without retraining.

2605.29524 2026-05-29 cs.CR cs.AI 版本更新

KBF: Knowledge Boundary as Fingerprint for Language Model and Black-Box API Auditing

KBF:知识边界作为语言模型和黑盒API审计的指纹

Yijia Fang, Yiqing Feng, Bingyu Li, Mingxun Zhou

发表机构 * Beihang University China(北航中国) Xidian University China(西电中国)

AI总结 提出KBF协议,利用知识边界附近的稳定数值召回率作为指纹,低成本黑盒审计模型API,检测替代和混合路由攻击。

Comments 20 pages, 13 figures

详情
AI中文摘要

中继和转售API越来越多地中介对大型语言模型(LLM)的访问,但用户无法直接验证声称的端点是否实际服务于广告中的模型。我们引入了KBF,一种低成本的黑盒审计协议,利用知识边界附近的稳定数值召回率对模型API进行指纹识别。在16个生产LLM端点上,KBF标记了所有155个经济相关的替代,而没有拒绝任何同模型对照,在部署变化下保持稳定,检测到仅5-10%流量被替代的高分离度混合路由攻击,并发现六个平台影子API审计中27个平台模型单元中的7个与其参考端点在统计上不一致,不一致集中在高级Claude端点上。

英文摘要

Relay and reseller APIs increasingly intermediate access to large language models (LLMs), but users have no direct way to verify that a claimed endpoint is actually serving the advertised model. We introduce KBF, a low-cost black-box auditing protocol that fingerprints model APIs using stable numerical recall near the knowledge boundary. Across 16 production LLM endpoints, KBF flags all 155 economically relevant substitutions without rejecting any same-model controls, remains stable under deployment variation, detects high-separation mixed-routing attacks when only 5-10% of traffic is substituted, and finds that 7 of 27 platform model cells in a six-platform shadow API audit are statistically inconsistent with their reference endpoints, with inconsistencies concentrated on premium Claude endpoints.

2605.29522 2026-05-29 cs.AI 版本更新

DeepSurvey: Enhancing Analytical Depth and Citation Reliability in Automated Survey Generation

DeepSurvey: 提升自动综述生成中的分析深度与引用可靠性

Ziyue Yang, Da Ma, Hanqi Li, Zijian Wang, Tiancheng Huang, Zijian Hu, Chenrun Wang, Yunzhe Zhang, Xiaobao Wu, Kai Yu, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院X-LANCE实验室) Jiangsu Key Lab of Language Computing, Suzhou, China(江苏省语言计算重点实验室) Suzhou Laboratory, Suzhou, China(苏州实验室)

AI总结 提出DeepSurvey智能体系统,通过结构化全文笔记、跨论文关系建模和代码仓库分析增强分析深度,结合引文图扩展与混合过滤、证据约束引用分配及多粒度智能体精炼提升引用可靠性,在内容质量和引用准确性上超越现有方法。

详情
AI中文摘要

随着科学文献的快速增长,自动综述生成已成为AI科学家和人类研究者的关键能力。然而,现有系统由于依赖摘要和孤立论文处理而分析深度有限,并且由于不精确的检索和事后归因而导致引用不可靠,从而产生肤浅的综述并可能误导研究者。我们提出DeepSurvey,一个解决这两个问题的智能体系统。为了增强深度,DeepSurvey从全文论文中提取结构化要点,通过聚类和比较分析建模跨论文关系,并集成代码仓库分析以恢复实现级细节。为了加强可靠性,它结合引文图扩展与混合过滤进行主题聚焦检索,强制执行证据约束的引用分配,并部署多粒度智能体精炼以验证引用-声明对齐。实验表明,DeepSurvey在内容得分(8.644/10)和引用质量(召回率和精确率分别比最强基线提高12.3%和9.3%)上达到最高,跨领域泛化更稳健(CS到非CS的下降为0.14 vs 0.22至0.69),并且领域专家更倾向于选择它而非人类撰写的综述(整体质量83.3%,内容深度100%)。

英文摘要

As scientific literature grows rapidly, automated survey generation has become a key capability for AI scientists and human researchers. However, existing systems suffer from limited analytical depth due to reliance on abstracts and isolated paper processing, and unreliable citations from imprecise retrieval and post-hoc grounding, producing superficial surveys and may mislead researchers. We present DeepSurvey, an agentic system that addresses both. To enhance depth, DeepSurvey extracts structured keynotes from full-text papers, models cross-paper relationships through clustering and comparative analysis, and integrates code-repository analysis to recover implementation-level details. To fortify reliability, it combines citation-graph expansion with hybrid filtering for topic-focussed retrieval, enforces evidence-constrained citation assignment, and deploys multi-granularity agentic refinement to validate citation-claim alignment. Experiments show that DeepSurvey achieves the highest content score (8.644/10) and citation quality (12.3% and 9.3% recall and precision gains over the strongest baseline), generalizes more robustly across domains (0.14 vs 0.22 to 0.69 CS-to-non-CS drop), and is preferred over human-written surveys by domain experts (83.3% overall quality, 100% content depth).

2605.29518 2026-05-29 cs.NI cs.AI 版本更新

Network Optimization Aspects of Autonomous Vehicles: Challenges and Future Directions

自动驾驶汽车的网络优化方面:挑战与未来方向

Rudolf Krecht, Tamas Budai, Erno Horvath, Akos Kovacs, Nobert Marko, Miklos Unger

发表机构 * Department of Automation and Mechatronics, Széchenyi István University(自动化与机电系,塞切尼伊斯特万大学) Department of Telecommunications, Széchenyi István University(电信系,塞切尼伊斯特万大学) Vehicle Industry Development Center, Széchenyi István University(车辆工业发展中心,塞切尼伊斯特万大学)

AI总结 本文综述了自动驾驶汽车网络优化的多学科方法,包括协同感知,旨在消除误解并展望未来方向。

详情
AI中文摘要

全球大趋势,如城市化、人口增长和新兴网络解决方案,正在加速互联和自动驾驶汽车(CAVs)行业的发展。公众对CAVs的看法中存在许多事实、一些误解,甚至一些兴奋。本文的主要目标是通过呈现各种多学科方法(如协同感知)来提供全面综述、消除误解,并概述自动驾驶汽车网络优化方面的未来。基于我们在CAVs方面的广泛经验,我们旨在分享我们获得的一些见解和知识,以及相关的用例和实验结果。

英文摘要

Global megatrends, such as urbanization, population growth, and emerging network solutions are accelerating the development of the Connected and Autonomous Vehicles (CAVs) industry. There are many truths, some misconceptions, and even some excitement about CAVs in the public's opinion. The main objective of the current article is to provide a comprehensive review, eliminate misconceptions, and outline the future of the network optimization aspects of autonomous vehicles by presenting various multidisciplinary methods, such as cooperative perception. Given our extensive experience with CAVs, we are aiming to share some of the insights and knowledge we have gained, along with relevant use-cases and experiment results.

2605.29512 2026-05-29 cs.AI 版本更新

MINDGAMES: A Live Arena for Evaluating Social and Strategic Reasoning in Multi-Agent LLMs

MINDGAMES: 多智能体LLM中社会与策略推理评估的实时竞技场

Kevin Wang, Anna Thöni, Benjamin Kempinski, Bobby Cheng, Jianzhu Yao, Benjamin Finch, Leon Guertler, Viraj Nadkarni, Yihan Jiang, Aliaksei Korshuk, Alexander Buyantuev, Ilya Makarov, Siyuan Wu, Yu-Chi Cheng, Yan-Ru Ju, Ti-Rong Wu, I-Hsuan Chu, Yu-Yu Yang, I-Chen Wu, Yitian Huang, Qinlu Cao, Yiheng Sun, Yuhong Dai, Hongkun Yao, Jingxuan Fu, Jiwei Zhang, Hao Liao, Mossimo Ebeling, Govind Arun, Sadhvik Bathini, Mihir S Arya, Avinash Anish, Aditya Ranjan, Kirtana Sunil Phatnani, Paval KS, Vrushali Mehta, Aravind S, Nikhil Arora, Tanya Upadhyay, Amol Bandagale, Yuan Lu, ChunEn Hsiao, YuTing Lin, Arvin Chung, Jerry John Thomas, Mathieu Laurière, Leshem Choshen, Yoram Bachrach, Pramod Viswanath, Maria Polukarov, Cheston Tan, Tal Kachman, Atlas Wang

发表机构 * NeurIPS 2025 Competition(NeurIPS 2025 会议)

AI总结 提出MINDGAMES多游戏竞技平台,通过四个游戏环境评估LLM智能体的社会推理与策略能力,揭示规则遵循瓶颈与排行榜有效性差异。

详情
AI中文摘要

大型语言模型(LLM)正越来越多地被部署为交互式智能体,但它们在长时间交互中的社会与策略推理能力仍知之甚少。现有评估依赖于静态场景或单一游戏基准,无法捕捉现实多智能体环境所需的持续、多面推理。我们引入MINDGAMES,一个多游戏竞技场和LLM智能体评估平台,它操作化了与“心智理论”相关的互补推理需求:隐藏信息下的信念归因、通过重复策略交互进行对手建模、知识不对称下的合作推理,以及社会推理中的持续欺骗。基于TextArena,MINDGAMES提供了统一的交互界面、基于TrueSkill的评分和四个游戏环境的完整轨迹记录。我们通过2025年在一场主要AI会议上举办的竞赛周期实例化MINDGAMES,评估了来自76个团队的944个提交智能体,涉及四个游戏:Colonel Blotto、迭代囚徒困境、Codenames和Secret Mafia。我们的分析揭示了智能体层面和评估层面的局限性:脆弱的规则遵循仍是主要瓶颈,顶级系统反复依赖显式结构支撑,且排行榜有效性在不同环境中差异显著。特别是,失败密集的环境可能同样奖励对对手错误的鲁棒性和策略能力,其中Secret Mafia在本周期中表现出明显的错误生存混杂。我们发布了一个包含29,571场多智能体游戏的数据集,包含回合级观察、动作和奖励,以及MG-Ref,一个确定性离线锦标赛协议,该协议使用与本分析相同的错误归因视角,将新智能体与冻结的顶级、低错误Stage II提交参考池进行评分。

英文摘要

Large language models (LLMs) are increasingly deployed as interactive agents, yet their capacity for social and strategic reasoning over extended interaction remains poorly understood. Existing evaluations rely on static vignettes or single-game benchmarks that cannot capture the sustained, multi-faceted reasoning that real-world multi-agent settings demand. We introduce Mindgames, a multi-game arena and evaluation platform for LLM agents that operationalizes complementary reasoning demands relevant to ``theory of mind'': belief attribution under hidden information, opponent modeling through repeated strategic interaction, cooperative inference under knowledge asymmetries, and sustained deception in social deduction. Built on TextArena, Mindgames provides a unified interaction interface, TrueSkill-based rating, and full trajectory logging across four game environments. We instantiate Mindgames through a 2025 competition cycle hosted at a major AI conference, which assessed 944 submitted agents from 76 teams across four games: Colonel Blotto, Iterated Prisoner's Dilemma, Codenames, and Secret Mafia. Our analysis surfaces both agent-level and evaluation-level limitations: brittle rule adherence remains a major bottleneck, top-performing systems repeatedly rely on explicit structural scaffolding, and leaderboard validity differs sharply across environments. In particular, failure-heavy environments can reward robustness to opponent errors as much as strategic ability, with Secret Mafia exhibiting a pronounced error-survival confound in this cycle. We release a dataset of 29,571 multi-agent games with turn-level observations, actions, and rewards, together with MG-Ref, a deterministic offline tournament protocol that scores new agents against a frozen reference pool of top-ranked, low-error Stage~II submissions under the same error-attribution lens used in this analysis.

2605.29507 2026-05-29 cs.AI cs.IR 版本更新

Xetrieval: Mechanistically Explaining Dense Retrieval

Xetrieval:机械解释稠密检索

Zhixin Cai, Jun Bai, Yang Liu, Jiaqi Li, Yichi Zhang, Taichuan Li, Zhuofan Chen, Zixia Jia, Zilong Zheng, Wenge Rong

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) State Key Laboratory of General Artificial Intelligence, BIGAI(通用人工智能国家重点实验室,BIGAI)

AI总结 提出Xetrieval框架,通过嵌入级别的推理内化器和稀疏可解释特征分解,机械地解释稠密检索模型为何赋予高相关性分数。

Comments Code: https://github.com/Hihiczx/Xetrieval ; Project page: https://hihiczx.github.io/Xetrieval

详情
AI中文摘要

解释稠密检索器为何赋予高相关性分数仍然具有挑战性,因为检索决策是通过不透明的高维嵌入做出的。现有的解释通常关注表面信号,如词汇匹配、令牌对齐或事后文本理由,因此对塑造稠密检索行为在嵌入级别的潜在因素提供的洞察有限。我们提出 extit{Xetrieval},一个嵌入级别的机械框架,用于解释稠密检索。 extit{Xetrieval}首先引入一个轻量级推理内化器,通过单次前向传递直接在嵌入空间中近似思维链推理,丰富句子嵌入的推理导向信息,同时避免昂贵的自回归生成。然后,它将这些推理增强的嵌入分解为稀疏、人类可解释的特征,每个特征与连贯的自然语言描述相关联。通过聚合多个文档端视图上的稀疏特征重叠, extit{Xetrieval}提供单个检索决策的特征级解释。在多种检索器和基准上的实验表明, extit{Xetrieval}揭示了连贯的可解释特征,产生更强的成对干预效果,并支持任务级特征引导。项目页面和源代码可在https://hihiczx.github.io/Xetrieval获取。

英文摘要

Explaining why dense retrievers assign high relevance scores remains challenging because retrieval decisions are made through opaque high-dimensional embeddings. Existing explanations often focus on surface signals, such as lexical matches, token alignments, or post-hoc textual rationales, and thus provide limited insight into the latent factors that shape dense retrieval behavior at the embedding level. We propose \textit{Xetrieval}, an embedding-level mechanistic framework for explaining dense retrieval. \textit{Xetrieval} first introduces a lightweight reasoning internalizer that approximates Chain-of-Thought reasoning directly in the embedding space with a single forward pass, enriching sentence embeddings with reasoning-oriented information while avoiding expensive autoregressive generation. It then decomposes these reasoning-enhanced embeddings into sparse, human-interpretable features, each associated with a coherent natural language description. By aggregating sparse feature overlaps across multiple document-side views, \textit{Xetrieval} provides feature-level explanations of individual retrieval decisions. Experiments on diverse retrievers and benchmarks show that \textit{Xetrieval} uncovers coherent interpretable features, yields stronger pair-level intervention effects, and supports task-level feature steering. The project page and source code are available at https://hihiczx.github.io/Xetrieval .

2605.29502 2026-05-29 cs.CL cs.AI 版本更新

Source-Grounded Semantic Reinforcement Learning for Low-Resource Target-Language Generation

源语言锚定的语义强化学习用于低资源目标语言生成

Zeli Su, Ziyin Zhang, Zewei Pan, Zhou Liu, Dingcheng Huang, Dehan Li, Zhankai Xu, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

发表机构 * Minzu University of China(中国民族大学) Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学) Peking University(北京大学) Harbin Institute of Technology(哈尔滨工业大学) South China University of Technology(华南理工大学)

AI总结 提出源语言锚定的语义强化学习(SG-SRL),通过跨语言语义奖励模型利用源语言单语数据,结合轻量级恢复阶段解决奖励黑客问题,在低资源目标语言生成中提升语义锚定和事实覆盖。

详情
AI中文摘要

低资源目标语言生成通常受限于稀缺的平行数据,而高资源源语言单语数据丰富但难以通过标准监督微调使用。我们提出源语言锚定的语义强化学习(SG-SRL),一种资源利用框架,将源语言单语数据转换为用于目标语言生成的跨语言语义监督。SG-SRL使用跨语言语义奖励模型(由跨语言重排序器实例化,对源输入与目标语言生成之间的语义相关性进行评分)在源语言数据上执行无参考强化学习(RL)。虽然这会导致严重的基于冗长的奖励黑客问题,但使用小型平行语料库的轻量级恢复阶段在保留语义增益的同时恢复了流畅性、简洁性和任务格式。在中文到泰语生成上的实验表明,SG-SRL在冷启动SFT基础上改善了语义锚定和事实覆盖。对长文本迁移和基于藏语嵌入奖励的额外分析阐明了SG-SRL的泛化行为,并表明在现实低资源语言设置中,基于编码器的语义奖励可以替代基于LLM的重排序器。

英文摘要

Low-resource target-language generation is often limited by scarce parallel data, while high-resource source-language monolingual data is abundant but difficult to use with standard supervised fine-tuning. We propose Source-Grounded Semantic Reinforcement Learning (SG-SRL), a resource-utilization framework that converts source-language monolingual data into cross-lingual semantic supervision for target-language generation. SG-SRL performs reference-free reinforcement learning (RL) on source-language data using a cross-lingual semantic reward model, instantiated by a cross-lingual reranker that scores the semantic relevance between the source input and the target-language generation. While this induces severe verbosity-based reward hacking, a lightweight recovery stage using a small parallel corpus restores fluency, conciseness, and task format while preserving the semantic gains. Experiments on Chinese-to-Thai generation show that SG-SRL improves semantic grounding and factual coverage over cold-start SFT. Additional analyses on long-form transfer and Tibetan embedding-based rewards clarify the generalization behavior of SG-SRL and show that an encoder-based semantic reward can substitute for an LLM-based reranker in a realistic low-resource language setting.

2605.29500 2026-05-29 cs.LG cs.AI 版本更新

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

离线策略评估的商DAG:前向流重要性采样与精确板倾向

Ziwen Xie, Shaowen Xiang, Hongyu He, Dianbo Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) National University of Singapore(新加坡国立大学)

AI总结 提出商DAG视角,通过前向流比率合并等价历史,实现精确的无序板倾向计算,减少方差并提高计算效率。

Comments 31 pages, 3 figures, 7 tables

详情
AI中文摘要

离线策略评估利用不同行为策略收集的数据来估计目标策略的表现,这在在线测试成本高或风险大时(如推荐或医疗)至关重要。标准重要性采样对每条记录轨迹进行重加权,但即使评估目标忽略生成过程的某些细节,它仍可能将其视为有意义:例如,自回归板推荐器可能生成有序的项目序列,而奖励和下游估计器仅依赖于无序板。这产生了噪声方差和计算差距,因为精确的无序板倾向需要对所有生成顺序求和。我们引入商DAG视角,合并对评估等价的历史,并在合并图上使用目标与行为的前向流比率分配权重。对于在集合充分的下一个项目接口下的板推荐,这产生了Forward-DP,一种子集DAG动态规划,无需阶乘枚举即可计算精确的无序倾向。得到的倾向基元使得能够对上下文相关的自回归板记录器进行实用的基于倾向的评估和模型选择。

英文摘要

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressive slate loggers.

2605.29493 2026-05-29 cs.CY cs.AI 版本更新

The New Pro Se: Generative AI and the Surge in Federal Civil Self-Representation

新 Pro Se:生成式人工智能与联邦民事诉讼自我代理的激增

Or Cohen-Sasson

发表机构 * Miami Law & AI Lab, University of Miami School of Law(迈阿密法学院与AI实验室,迈阿密大学法学院)

AI总结 本文利用约280万份联邦民事诉讼数据,分析生成式AI普及后自我代理原告率上升、投诉文本变化、诉讼结果及原告构成的变化,发现AI标记投诉更密集引用、多为首次起诉者、地理分布不均,且未改善胜诉率。

Comments 15 pages, 7 figures

详情
AI中文摘要

自生成式AI工具广泛公开以来,联邦民事诉讼中自我代理(pro se)原告显著增加。本文利用约280万份诉状分析这一变化,探究后GenAI时期是否不仅与更多自我代理诉状相关,还与投诉文本、诉讼结果及自我代理诉讼人构成的可检测变化有关。使用2008-2025财年的民事起诉数据,我们发现联邦民事自我代理原告率从GenAI前的11.33%上升至GenAI后的16.94%,增加了5.61个百分点,且在趋势和协变量调整稳健性检验后仍然显著。然后,我们聚焦于民权和其他法定案件,其中增长尤为明显,并将案件元数据与自我代理投诉联系起来。利用文体学AI检测指标,我们开发了一个可解释的AI一致性起草度量。针对GenAI前基线校准的阈值,后GenAI非格式投诉中净AI标记比例为13.9%。对AI标记投诉的分析显示,它们引用更密集,与首次起诉者而非重复起诉者不成比例地相关,且地理分布不均。这种构成模式表明,AI一致性起草不仅仅是重复起诉者现象;它还包含女性原告(通过姓名推断)的适度、暗示性增加。我们没有发现胜诉率提高的证据;事实上,AI标记投诉更可能被驳回并在更早的程序阶段终止。这些发现提出了关于司法可及性和法院筛查负担的新问题,并加剧了法律形式性与法律效力之间的区别。

英文摘要

Since public access to generative AI tools became widespread, federal civil litigation has seen a marked increase in pro se (self-represented) plaintiffs. This paper analyzes that shift using ~2.8 million filings, asking whether the post-GenAI period is associated not only with more pro se filings, but also with detectable changes in complaint text, litigation outcomes, and the composition of pro se litigants. Using civil filing data from FY2008-2025, we find that the federal civil pro se plaintiff rate rose from 11.33% pre-GenAI to 16.94% post-GenAI, a 5.61 percentage-point increase that persists after trend and covariate-adjusted robustness checks. We then focus on Civil Rights and Other Statutory cases, where the increase is especially pronounced, and link case metadata to pro se complaints. Drawing on stylometric AI detection indicators, we develop an interpretable measure of AI-consistent drafting. Against a threshold calibrated to the pre-GenAI baseline, the net AI-flagged share is 13.9% of post-GenAI non-form complaints. Analysis of the AI-flagged complaints shows that they are more citation-dense, disproportionately associated with first-time rather than repeat filers, and geographically unevenly distributed. This composition pattern suggests that AI-consistent drafting is not merely a repeat-filer phenomenon; it also includes a modest, suggestive increase in name-inferred female plaintiffs. We find no evidence of improved win rates; in fact, AI-flagged complaints are more likely to be dismissed and to terminate at earlier procedural phases. These findings raise new questions about access to justice and court screening burdens, and sharpen the distinction between legal formality and legal efficacy.

2605.29491 2026-05-29 cs.AI 版本更新

The Curse of Helpfulness: Inverse Scaling Law in Robustness to Distractor Instructions via DistractionIF

帮助的诅咒:通过 DistractionIF 在干扰指令鲁棒性中的逆缩放定律

Zeli Su, Zhankai Xu, Tianlei Chen, Longfei Zheng, Xiaolu Zhang, Jun Zhou, Wentao Zhang

发表机构 * Minzu University of China, Beijing, China(民族大学,北京,中国) Renmin University of China, Beijing, China(中国人民大学,北京,中国) Peking University, Beijing, China(北京大学,北京,中国)

AI总结 提出 DistractionIF 基准,发现大语言模型在参考文本中干扰指令的鲁棒性存在逆缩放现象,并通过 GRPO 强化学习提升鲁棒性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地部署在智能体和检索增强生成(RAG)系统中,在这些系统中,它们必须对外部提供的参考文本执行用户指定的任务。实际上,这种上下文通常是非结构化的,并且包含良性的但类似指令的语义噪声,例如编辑评论和系统痕迹,这些应严格视为数据。我们引入了 DistractionIF,这是一个旨在评估对参考文本中此类干扰指令鲁棒性的基准。在广泛模型范围内,我们观察到一致的逆缩放现象:较大的模型通常鲁棒性较差,随着规模增加,性能下降多达 30 个百分点。从机制上讲,我们的困惑度分析表明,缩放侵蚀了鲁棒和受干扰行为之间的概率边界,使模型越来越倾向于将噪声过度解释为指令。为了解决这个问题,我们证明了强化学习,特别是群体相对策略优化(GRPO),可以恢复这一边界,在不损害通用指令遵循能力的情况下,将鲁棒性提高多达 15.5%。我们的发现突显了参考接地任务中关键的指令遵循鲁棒性差距,并确立了强化学习作为在大规模下强制严格数据-指令分离的有前途的途径。

英文摘要

Large Language Models (LLMs) are increasingly deployed in agentic and retrieval-augmented generation (RAG) systems, where they must execute user-specified tasks over externally provided reference text. In practice, such context is often unstructured and contaminated with benign but instruction-like semantic noise, such as editorial comments and system traces, which should be treated strictly as data. We introduce DistractionIF, a benchmark designed to evaluate robustness against such distractor instructions in reference text. Across a broad range of models, we observe a consistent inverse scaling phenomenon: larger models are often less robust, with performance dropping by up to 30 points as scale increases. Mechanistically, our perplexity analysis reveals that scaling erodes the probabilistic boundary between robust and distracted behaviors, making models increasingly prone to over-interpreting noise as instructions. To address this, we demonstrate that reinforcement learning, specifically Group Relative Policy Optimization (GRPO), can restore this boundary, improving robustness by up to 15.5% without compromising general instruction-following capability. Our findings highlight a critical instruction-following robustness gap in reference-grounded tasks and establish reinforcement learning as a promising path for enforcing strict data-instruction separation at scale.

2605.29486 2026-05-29 cs.CL cs.AI cs.LG 版本更新

PhoneWorld: Scaling Phone-Use Agent Environments

PhoneWorld: 扩展手机使用代理环境

Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan(腾讯文英) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院( Gallagher 学院))

AI总结 提出PhoneWorld,一个可复用的管道,将真实GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚,从而规模化构建手机代理环境。

Comments work in progress

详情
AI中文摘要

手机使用代理的一个核心瓶颈是,覆盖真实移动行为的可控、可复现环境难以大规模构建。现有的移动代理基准在评估方面取得了重要进展,但它们本身并未提供一种可扩展的方式来构建许多新的手机使用环境。我们提出了PhoneWorld,一个可复用的管道,将真实的GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚。PhoneWorld不是一次手动构建一个移动基准,而是利用真实轨迹来恢复哪些屏幕重要、屏幕如何连接、哪些交互必须改变环境状态、以及哪些用户目标可以自动验证。从这些信号中,它构建了由只读应用内容和可变状态支持的可运行模拟Android应用,然后从相同环境中派生出可执行任务、基于规则的验证器和训练回滚。在当前实例中,PhoneWorld覆盖了16个领域的34个应用,涵盖了常见的消费者移动行为,如搜索、浏览、购物、预订、媒体和社交互动。在固定的训练预算下,将来自辅助AndroidWorld语料库的10K步替换为广泛的PhoneWorld监督,同时提升了所有四个评估基准,使HYMobileBench提高了17.7分,AndroidControl提高了6.0分,AndroidWorld提高了14.7分,PhoneWorld提高了52.5分。然后我们研究了两个额外的扩展问题:增加PhoneWorld监督量显著提高了PhoneWorld性能,并且在固定的PhoneWorld预算下,扩大应用覆盖范围带来了更大的收益。总体而言,PhoneWorld将焦点从一次构建一个移动基准转向了规模化供应手机使用环境本身。

英文摘要

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

2605.29478 2026-05-29 cs.NE cs.AI 版本更新

Evolutionary Rule Extraction from Corporate Default Prediction Models

企业违约预测模型中的进化规则提取

Desirè Fabbretti, Matteo Pasquino, Elia Pacioni, Caterina Lucarelli, Davide Calvaresi

发表机构 * Department of Management, Università Politecnica delle Marche(波兰马克西米利亚那大学管理学院) HES-SO Valais-Wallis(瓦莱-达菲大学) University of Extremadura(埃斯特拉马杜拉大学)

AI总结 本研究提出DEXiRE-EVO进化规则提取框架,结合多目标优化与CIU可解释性方法,从机器学习违约预测模型中提取经济意义明确的规则,兼顾预测性能与可解释性。

详情
AI中文摘要

中小企业(SMEs)在大多数经济体中占企业多数,常面临财务约束和更高的财务困境脆弱性。因此,预测中小企业违约对金融机构、政策制定者和研究人员至关重要。机器学习(ML)的最新进展提高了信用风险建模的预测性能。然而,复杂模型的有限可解释性引发了透明度和监管合规方面的担忧。本研究调查了中小企业的违约预测因子,并应用可解释人工智能(XAI)技术。使用2015-2024年间50,718家意大利中小企业的面板数据,我们比较了传统计量经济学方法与多种ML分类器。实证结果表明,ML模型在平衡准确率和PR-AUC方面显著优于传统逻辑回归基准。为解决可解释性挑战,我们引入了DEXiRE-EVO,一种新颖的进化规则提取框架,结合了多目标优化与上下文重要性和效用(CIU)可解释性方法。提取的规则揭示了与中小企业财务困境相关的经济意义模式,突出了内部流动性生成薄弱、内部资本侵蚀、高杠杆和运营效率低下的作用。此外,宏观经济背景条件和财务不稳定的持续性有助于识别高风险企业。总体而言,结果表明,将ML与进化规则提取相结合可以提高信用风险建模中的预测性能和可解释性,从而支持金融环境中更透明、数据驱动的决策。

英文摘要

Small and medium-sized enterprises (SMEs) represent the majority of firms in most economies and often face financial constraints and higher vulnerability to financial distress. Predicting SME default is therefore crucial for financial institutions, policymakers, and researchers. Recent advances in machine learning (ML) have improved predictive performance in credit risk modeling. Yet, the limited interpretability of complex models raises concerns regarding transparency and regulatory compliance. This study investigates SME's default predictors and applies explainable artificial intelligence (XAI) techniques to them. Using a panel of 50,718 Italian SME over the period 2015-2024, we compare traditional econometric approaches with several ML classifiers. The empirical results show that ML models significantly outperform the traditional logistic regression benchmark in terms of Balanced Accuracy and PR-AUC. To address the interpretability challenge, we introduce DEXiRE-EVO, a novel evolutionary rule extraction framework that combines multi-objective optimization with the Contextual Importance and Utility (CIU) explainability method. The extracted rules reveal economically meaningful patterns associated with SME financial distress, highlighting the roles of weak internal liquidity generation, internal capital erosion, high leverage, and operational inefficiency. Additionally, contextual macroeconomic conditions and the persistence of financial instability contribute to identifying high-risk firms. In general, the results show that combining ML with evolutionary rule extraction can improve both predictive performance and interpretability in credit risk modeling, thus supporting more transparent, data-driven decision-making in financial environments.

2605.29473 2026-05-29 cs.HC cs.AI cs.CL cs.CY cs.SI 版本更新

Inform, Coach, Relate, Listen: Auditing LLM Caregiving Support Roles

告知、指导、共情、倾听:审计LLM护理支持角色

Drishti Goel, Agam Goyal, Veda Duddu, Olivia Pal, Jeongah Lee, Qiuyue Joy Zhong, Violeta J. Rodriguez, Daniel S. Brown, Dong Whi Yoo, Ravi Karkar, Koustuv Saha

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) OSF HealthCare(OSF医疗集团) Indiana University Indianapolis(印第安纳大学印第安纳波利斯分校)

AI总结 本研究通过操作化四种社会支持角色(告知、指导、共情、倾听),评估大型语言模型在非正式护理对话中的安全概况,发现支持角色系统性地影响交互风险,且存在感知质量-安全性权衡。

详情
AI中文摘要

语言模型越来越多地被部署用于非正式护理环境中的对话支持,在这些环境中,交互通常超出信息寻求范围:护理者在应对不确定、关系复杂的护理决策时,寻求情感安慰、指导和帮助。然而,大多数安全评估在通用提示下评估模型行为,留下一个关键问题未加审视:模型的安全概况是否会随其支持角色而变化?我们通过操作化四种基于社会支持理论的专家评审支持角色来研究这一点:告知、指导、共情和倾听,并将它们与两个基线控制条件(基本提示条件和检索增强生成条件)进行比较。我们在三个语言模型(GPT-4o-mini、Llama-3.1-8B-Instruct和MedGemma-1.5-4b-it)上,对来自在线阿尔茨海默病及相关痴呆症社区的5,000个真实世界查询进行了评估。我们发现,LLM的支持角色系统地影响了交互风险的普遍性和构成。此外,一项人类评估研究揭示了感知质量-安全性权衡:更具指导性、信息导向的角色被认为更有帮助和值得信赖,尽管它们表现出更高的交互风险概况。我们发布了约90,000个带有风险注释的支持角色条件模型响应,作为研究更安全的LLM中介对话支持的生态基础资源。

英文摘要

Language models are increasingly being deployed for conversational support in informal caregiving contexts, where interactions often extend beyond information-seeking: caregivers seek emotional reassurance, guidance, and help, while navigating uncertain, relationally complex care decisions. Yet most safety evaluations assess model behavior under generic prompts, leaving a critical question unexamined: does a model's safety profile change with its support role? We study this by operationalizing four expert-reviewed support roles grounded in social support theory: Inform, Coach, Relate, and Listen, and comparing them against two baseline controls: a basic prompting condition and a retrieval-augmented generation (RAG) condition. We evaluate across three language models (GPT-4o-mini, Llama-3.1-8B-Instruct, and MedGemma-1.5-4b-it) on 5,000 real-world queries from online Alzheimer's Disease and Related Dementias (ADRD) communities. We find that the LLM's support role systematically shapes both the prevalence and composition of interactional risks. Furthermore, a human evaluation study reveals a perceived quality--safety tension: more directive, information-oriented roles are rated as more helpful and trustworthy despite exhibiting elevated interactional risk profiles. We release ~90,000 support role-conditioned model responses with risk annotations as an ecologically grounded resource for research on safer LLM-mediated conversational support.

2605.29468 2026-05-29 cs.CR cs.AI 版本更新

SciIntBench: Measuring LLM Compliance with Research Integrity Norms Under Adversarial Framing

SciIntBench: 衡量大语言模型在对抗性框架下对科研诚信规范的遵从度

Almene De Meran Meguimtsop, Maria Leonor Pacheco, Daniel E. Acuna

发表机构 * Department of Computer Science University of Colorado Boulder(计算机科学系,科罗拉多大学博尔德分校)

AI总结 提出SciIntBench对抗性基准,通过810个提示评估16个LLM在10个RCR类别中的框架敏感拒绝与帮助行为,发现模型对显性不当行为拒绝可靠,但对隐性违规(尤其是压力驱动的捷径)拒绝不足。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用于支持科学工作,但尚不清楚它们是维护还是破坏负责任的研究行为(RCR)规范。我们引入了SciIntBench,这是一个对抗性基准,包含810个提示,涵盖10个RCR类别和三个科学领域。每个场景以显性对抗、隐性对抗和良性三种版本出现,使我们能够联合测量对不当行为的框架敏感拒绝以及对合法请求的帮助性。我们评估了来自六个提供商的16个商业和开源LLM(2024-2026年),产生了12,960个响应。我们发现,科研诚信对齐对框架高度敏感:模型拒绝显性不当行为远比拒绝隐性违规可靠得多,尤其是当不当行为被呈现为压力驱动的捷径时。拒绝率因RCR类别而异,在透明度、抄袭和捏造方面的边界较弱。

英文摘要

Large language models (LLMs) are increasingly used to support scientific work, but it is unclear whether they uphold responsible conduct of research (RCR) norms or help undermine them. We introduce SciIntBench, an adversarial benchmark of 810 prompts across ten RCR categories and three scientific domains. Each scenario appears as an Overt Adversarial, Covert Adversarial, and Benign version, allowing us to jointly measure framing-sensitive refusal of misconduct and helpfulness on legitimate requests. We evaluate 16 commercial and open-weight LLMs from six providers (2024--2026), producing 12,960 responses. We find that scientific integrity alignment is strongly framing-sensitive: models refuse explicit misconduct far more reliably than covert violations, especially failing when misconduct is presented as a pressure-driven shortcut. Refusals vary by RCR category, with weaker boundaries around transparency, plagiarism, and fabrication.

2605.29467 2026-05-29 cs.LG cs.AI 版本更新

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

非共轭因子图的闭式变分推断组合

Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev, İsmail Şenöz, Jeff Beck, Bert de Vries

发表机构 * Eindhoven University of Technology, the Netherlands(埃因霍温理工大学,荷兰) Lviv Polytechnic National University, Lviv, Ukraine(利沃夫国立理工大学,利沃夫,乌克兰) Lazy Dynamics, Utrecht, the Netherlands(Lazy Dynamics,乌得勒支,荷兰)

AI总结 提出五种因子图原语,证明任意组合均支持闭式变分消息传递,并通过堆叠路由层实现通用函数逼近,应用于时间序列预测。

详情
AI中文摘要

将概率构建块堆叠成更深层次的架构通常会破坏闭式推断。我们证明闭式推断是可以保持的。我们识别了五种因子图原语:双线性因子、指数链接、Gamma先验、高斯似然和等式节点,并证明任何由它们组成的模型都允许闭式变分消息传递。这种构造之所以有效,是因为每个原语都保留了一小部分消息族:在平均场分解下,高斯变量上的消息保持高斯分布,精度变量上的消息保持Gamma分布,而唯一的非共轭接口——指数链接——通过高斯矩生成函数和Gamma族的充分统计量保持可处理性。我们展示了从静态集成到输入依赖门控再到分裂分支路由的递增深度组合,并表明堆叠路由层编码任意决策树,建立了具有闭式推断的通用函数逼近。应用于集成时间序列预测时,该框架产生了一个贝叶斯专家混合模型,其中门控函数是推断而非学习得到的,在五个基准数据集上提供了对专家选择的校准不确定性。

英文摘要

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

2605.29462 2026-05-29 cs.CV cs.AI 版本更新

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

大型视觉语言模型在CFMME上的基准测试:一个全面的中文金融多模态评估数据集

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo, Feng Chen, Chi Zhang

发表机构 * Qwen DianJin Team, Alibaba Cloud Computing(文言金团队,阿里云计算)

AI总结 提出CFMME,一个包含6052个实例的中文金融多模态评估基准,涵盖八种主要金融图像模态和四项核心多模态任务,用于评估LVLMs在金融业务全流程中的感知、理解、推理和认知能力。

详情
AI中文摘要

大型视觉语言模型(LVLMs)的出现显著扩展了模型的能力,超越了仅文本理解,实现了跨视觉和文本模态的统一推理,并支持更广泛的实际应用。为了全面评估LVLMs在中文环境下整个金融业务流程中的感知、理解、推理和认知能力,我们引入了CFMME,一个新颖的中文金融多模态评估基准。CFMME包含6052个实例,涵盖从基础学术知识到复杂实际应用,涉及八种主要金融图像模态和四项核心多模态任务。在CFMME上,我们对代表性LVLMs进行了全面评估。结果表明,最先进的模型在问答任务上达到了66.11%的总体准确率,在检测、识别和信息提取任务上平均得分为77.18,表明当前LVLMs仍有很大的改进空间。此外,我们对错误原因、跨模态能力和多方向设置进行了详细分析,为未来研究提供了有价值的见解。我们希望CFMME能推动LVLMs的进一步进展,特别是在金融领域多个多模态任务上的性能提升。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has substantially expanded model capabilities beyond text-only understanding, enabling unified inference across both visual and textual modalities and supporting a broader range of real-world applications. To comprehensively evaluate the perception, understanding, reasoning, and cognition capabilities of LVLMs throughout the entire financial business workflow in Chinese contexts, we introduce CFMME, a novel Chinese financial multimodal evaluation benchmark. CFMME comprises 6,052 instances spanning from fundamental academic knowledge to complex real-world applications, covering eight primary financial image modalities and four core multimodal tasks. On CFMME, we conduct a thorough evaluation of representative LVLMs. The results show that the state-of-the-art model attains an overall accuracy of 66.11\% on the question answering task and an average score of 77.18 on the detection, recognition, and information extraction tasks, indicating substantial room for improvement in current LVLMs. In addition, we conduct detailed analyses of error causes, cross-modal capabilities, and multi-orientation settings, yielding valuable insights for future research. We hope that CFMME will spur further progress in LVLMs, especially by improving their performance on multiple multimodal tasks in the financial domain.

2605.29458 2026-05-29 cs.CL cs.AI 版本更新

Adaptive Interviewing for Persona Simulation in LLMs: Evidence-Grounded Reasoning Improves Decision Alignment

面向LLM人格模拟的自适应访谈:基于证据的推理提升决策对齐

Ruoxi Su, Yuhan Liu, Jingyu Hu

发表机构 * University of Cambridge(剑桥大学) Independent Researcher(独立研究员)

AI总结 提出自适应访谈框架,通过结构化三阶段对话收集人格相关信息,并基于访谈记录评估LLM在道德困境场景中模拟个体决策的能力,发现基于后续追问的证据推理能显著提升预测准确性。

Comments 20 pages, 2 figures, 12 tables

详情
AI中文摘要

准确模拟特定个体的决策对大型语言模型(LLM)仍然具有挑战性,部分原因在于人格信息通常以静态描述形式提供,缺乏个体层面决策模拟所需的价值观、经历和情境线索。我们提出一种自适应访谈框架,通过结构化的三阶段对话收集人格相关信息:核心问题、动态追问和综合人格总结。利用生成的访谈记录,我们评估LLM能否模拟参与者在道德困境场景中的决策。我们比较了三种对话情境——核心10个问题回答、完整访谈对话以及总结性人格表征。结果发现,自适应访谈并非作为统一的准确性增强器,而更像是一种选择性接地机制:约40%的完整访谈轨迹中融入了基于追问的证据,且这些基于追问的预测比仅基于核心问题的预测更准确(45.5% vs. 39.3%)。这些发现强调,仅靠更丰富的人格背景是不够的:只有当模型真正将其决策基于用户特定证据时,改进才会出现。

英文摘要

Accurately simulating the decisions of a specific individual remains challenging for large language models (LLMs), partly because persona information is often provided as static descriptions that miss the values, experiences, and contextual cues needed for individual-level decision simulation. We propose an adaptive interview framework that gathers persona-relevant information through a structured three-stage dialogue: core questions, dynamic follow-ups, and a synthesized personality summary. Using the resulting interview transcripts, we evaluate whether LLMs can simulate participants' decisions in moral dilemma scenarios. We compare three conversational contexts -- Core-10 responses, the full interview dialogue, and a summarized persona representation. We find that adaptive interviewing functions less as a uniform accuracy booster and more as a selective grounding mechanism: follow-up-derived evidence is incorporated in around 40% of full-interview traces, and these follow-up-grounded predictions are more accurate than core-only grounded ones (45.5% vs. 39.3%). These findings highlight that richer persona context alone is insufficient: improvements arise only when models actually ground their decisions in user-specific evidence.

2605.29453 2026-05-29 cs.LG cs.AI 版本更新

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

遗忘更少,泛化更强:统一动态图的时间与结构适应

Qian Chang, Ciprian Doru Giurcaneanu, Runsong Jia, Xia Li, Guoping Hu, Xiufeng Cheng, Jinqing Yang, Mengjia Wu, Yi Zhang

发表机构 * University of Auckland, Auckland, New Zealand(奥克兰大学) University of Technology Sydney, Sydney, Australia(悉尼大学) Central China Normal University, Wuhan, China(Central China Normal University)

AI总结 提出双尺度保持动态(DSRD)框架,通过统一的时间-结构自适应机制和可学习衰减核,在动态图表示学习中实现更强的泛化能力。

详情
AI中文摘要

动态图上的表示学习需要捕获随时间与结构共同演化的复杂依赖关系。现有方法通常采用固定的时间衰减方案或预定义的结构传播深度,限制了其在具有不同交互频率和拓扑特征的图上的泛化能力。我们提出双尺度保持动态(DSRD),一个统一框架,维护一个同时编码时间记忆和结构上下文的保持性表示状态。DSRD引入两个关键组件:(i) 具有双尺度自适应的保持状态,在单一循环公式中联合建模时间动态和结构传播;(ii) 具有可学习时间敏感性参数的自适应衰减核,基于底层交互模式自动平衡短期响应和长期保持。我们提供理论分析,建立了事件级并行聚合与高效循环状态更新之间的等价性,以及所学动态的稳定性和有界性保证。在14个真实世界基准上的广泛实验表明,DSRD在链接预测和节点分类任务上均持续达到最先进性能,并在直推和归纳设置中展现出强泛化能力。

英文摘要

Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing approaches typically adopt fixed temporal decay schemes or predetermined structural propagation depths, limiting their ability to generalize across graphs with diverse interaction frequencies and topological characteristics. We propose Dual-Scale Retentive Dynamics (DSRD), a unified framework that maintains a retentive representation state encoding both temporal memory and structural context. DSRD introduces two key components: (i) a retentive state with dual-scale adaptation that jointly models temporal dynamics and structural propagation within a single recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters that automatically balance short-term responsiveness and long-term retention based on the underlying interaction patterns. We provide theoretical analysis establishing the equivalence between event-wise parallel aggregation and efficient recurrent state updates, as well as stability and boundedness guarantees for the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD consistently achieves state-of-the-art performance on both link prediction and node classification tasks, with strong generalization across transductive and inductive settings.

2605.29448 2026-05-29 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

数据集值多少钱?缩放定律、Vendi分数与矩阵谱函数

Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) Paul G. Allen School of Computer Science & Engineering(保罗·G·艾伦计算机科学与工程学院) University of Washington(华盛顿大学)

AI总结 本文通过子模性理论统一了神经缩放定律与Vendi分数,提出矩阵谱函数作为广义数据评估框架,并开发了基于割线方程的快速优化算法,在ImageNet-1K规模上实现了约35,000倍加速,实验表明设施选址函数在预测子集价值方面表现最佳。

Comments 75 pages

详情
AI中文摘要

神经缩放定律通过数据集大小评估数据,而Vendi分数使用量子熵衡量数据集价值。我们证明常见的神经缩放定律目标和Vendi分数都是子模的。进一步,我们表明Vendi分数是一类更广泛的子模目标(称为矩阵谱函数)的特例,这还包括行列式点过程(DPP)目标以及许多其他目标。我们还引入了弱矩阵单调函数,并展示了它们如何导致弱子模矩阵谱函数,从而产生一系列实用的数据评估目标。我们开发了基于割线方程的更新方法,避免了贪心优化过程中的重复特征分解,将$m$维嵌入的边际增益评估相对于预言机查询减少了$O(m)$因子。这实现了平均约35,000倍的实证加速,使得在ImageNet-1K规模的数据集上直接优化Vendi分数成为可能。由此,我们比较了多个目标在固定大小、类别平衡和固定训练预算条件下预测训练子集对保留测试性能价值的能力,包括Vendi分数、DPP、设施选址以及三种新的矩阵谱变体。在多个数据集上,设施选址表现最佳。直接优化还揭示,虽然Vendi分数在中等分数范围内具有预测性,但将目标推向更高值可能使其成为下游性能的糟糕代理。我们还发现,均匀随机选择的固定大小子集(无论是否类别平衡)在评估分数和保留性能上都表现出显著的集中性。最后,我们表明大小、类别平衡和训练预算单独并不决定数据价值:即使控制这些因素,性能范围也从好到差平滑变化。

英文摘要

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

2605.29446 2026-05-29 cs.AI 版本更新

CrystalXRD-Bench: Benchmarking Vision-Language Models for XRD Peak Indexing Across Diverse Crystalline Materials

CrystalXRD-Bench:面向多种晶体材料的XRD峰索引的视觉-语言模型基准测试

Chengliang Xu, Xiaogang Li, Peiyao Xiao, Beng Wang, Hu Wei, Bing Zhao

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出CrystalXRD-Bench基准,通过250个样本评估视觉-语言模型从粉末XRD图谱中识别米勒指数(HKL)的能力,发现最佳模型Jaccard得分仅0.5888,任务远未解决。

Comments 18 pages, 10 figures

详情
AI中文摘要

从粉末XRD图谱中识别米勒指数需要现有多模态基准未测试的能力:模型必须从渲染的科学曲线中读取窄峰位置,然后将该观察与多步晶体学推理联系起来。我们引入CrystalXRD-Bench,一个基于10个公共晶体学数据库构建的250样本基准,用于单一任务:恢复对XRD图谱中最高强度峰有贡献的完整HKL集合。每个样本将渲染的XRD图像与源CIF文本和化学式配对,因此视觉提取错误和推理错误可以并排检查。我们评估了七个视觉-语言模型。最佳Jaccard得分为0.5888(GPT-5.4),精确匹配率为37.6%,但七个模型中有六个仍低于Jaccard 0.50;该任务远未解决。错误模式系统性地变化:双峰情况尤其脆弱,注重召回率的模型通过过度预测HKL来增加覆盖率,而访问CIF文本并不能缩小晶体学计算方面的差距。除了模型排名外,该基准还确定了当前VLM在定量科学图形上失败的条件。所有数据和评估代码将公开提供。

英文摘要

Miller-index identification from powder XRD patterns requires capabilities untested by existing multimodal benchmarks: the model must read a narrow peak location from a rendered scientific curve and then connect that observation to multi-step crystallographic reasoning. We introduce CrystalXRD-Bench, a 250-sample benchmark built from 10 public crystallographic databases for a single task: recover the full set of HKLs contributing to the highest-intensity peak in an XRD pattern. Each sample pairs the rendered XRD image with the source CIF text and chemical formula, so visual extraction errors and reasoning errors can be examined side by side. We evaluate seven vision-language models. The best Jaccard score is 0.5888 (GPT-5.4) with an exact-match rate of 37.6%, yet six of seven models remain below Jaccard 0.50; the task is far from solved. Error patterns vary systematically: double-peak cases are especially brittle, recall-heavy models gain coverage by over-predicting HKLs, and access to CIF text does not close the gap in crystallographic calculation. Alongside model rankings, the benchmark identifies the conditions under which current VLMs fail on quantitative scientific figures. All data and evaluation code will be publicly available.

2605.29442 2026-05-29 cs.SE cs.AI cs.HC 版本更新

How Coding Agents Fail Their Users: A Large-Scale Analysis of Developer-Agent Misalignment in 20,574 Real-World Sessions

编码助手如何辜负用户:基于20,574个真实会话的开发人员与智能体不一致的大规模分析

Ningzhi Tang, Chaoran Chen, Gelei Xu, Yiyu Shi, Yu Huang, Collin McMillan, Tao Dong, Toby Jia-Jun Li

发表机构 * University of Notre Dame(诺丁汉大学) Vanderbilt University(范德比大学) Google(谷歌)

AI总结 通过对20,574个编码助手会话的分析,识别出七种常见的不一致形式,发现大多数不一致导致信任成本而非系统损坏,且多数仍需用户显式纠正。

详情
AI中文摘要

AI编码助手越来越多地直接在软件环境中行动,然而现有对其失败的分析依赖于基准轨迹,忽略了开发人员实际体验的不一致。我们提出了一项观察性研究,涵盖来自IDE和CLI工作流的1,639个代码仓库的20,574个编码助手会话。我们将不一致操作化为通过开发人员抵制而显现的故障,并沿四个轴标注每个事件:形式、原因、成本和解决方式。我们识别出七种反复出现的形式,涵盖助手如何阅读项目、解释开发人员意图、遵循规则、约束行动、实现和执行代码以及报告进度。90.50%的事件施加了努力和信任成本而非不可逆的系统损坏,但91.49%的可见解决方式仍需用户显式纠正。不一致模式在IDE和CLI设置中也有所不同,在相邻会话中持续存在,并随时间变化:尽管总体发生率下降,但约束违反和不准确自我报告的比例上升。我们的发现为训练、评估和界面设计提供了信息,以保持编码助手与真实开发工作流一致。

英文摘要

AI coding agents increasingly act directly within software environments, yet existing analyses of their failures rely on benchmark trajectories that miss how developers actually experience misalignment. We present an observational study of 20,574 coding-agent sessions from 1,639 repositories across IDE and CLI workflows. We operationalize misalignment as a breakdown made visible through developer pushback, and annotate each episode along four axes: form, cause, cost, and resolution. We identify seven recurring forms, spanning how agents read projects, interpret developer intent, follow rules, bound their actions, implement and execute code, and report progress. 90.50\% of episodes impose effort and trust costs rather than irreversible system damage, yet 91.49\% of visible resolutions still require explicit user correction. Misalignment patterns also differ across IDE and CLI settings, persist across adjacent sessions, and shift over time: while overall rates decline, constraint violations and inaccurate self-reporting grow in share. Our findings inform the design of training, evaluation, and interfaces for keeping coding agents aligned with real developer workflows.

2605.29440 2026-05-29 cs.CL cs.AI cs.IR 版本更新

SkillBrew: Multi-Objective Curation of Skill Banks for LLM Agents

SkillBrew: LLM智能体技能库的多目标策展

Wentao Hu, Zhendong Chu, Yiming Zhang, Junda Wu, Ming Jin, Xiangyu Zhao, Yilei Shao, Yanfeng Wang, Qingsong Wen

发表机构 * City University of Hong Kong(香港城市大学) Squirrel Ai Learning University of Science and Technology of China(中国科学技术大学) University of California, San Diego(加州大学圣地亚哥分校) Griffith University(格里菲斯大学) East China Normal University(华东师范大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出SkillBrew框架,将技能库策展建模为带效用约束的帕累托优化问题,通过双层提议-验证循环实现技能库的精简与多样性。

Comments 16 pages. Preprint. Under review

详情
AI中文摘要

检索增强的LLM智能体越来越依赖于精心策划的技能库:指导复杂任务决策的可重用文本原则集合。现有方法通常以仅追加的方式扩展这些库,不断添加新技能而不移除冗余、过时或有害的技能,导致存储库效率低下且策展不良。在本文中,我们将技能库策展形式化为一个受约束的多目标问题:一个理想的库必须对智能体有用、内容多样,并且对查询分布有良好的覆盖。为此,我们引入了SkillBrew,一个多目标策展框架,将技能库策展形式化为在效用约束下的帕累托感知优化,并通过双层提议-验证循环求解。我们在两个公共基准上评估了我们的方法。我们的发现表明,将技能库视为原则性策展的对象,而不是不断增长的仅追加日志,是构建自我改进的LLM智能体的重要一步。

英文摘要

Retrieval-augmented LLM agents increasingly rely on curated skill banks: collections of reusable textual principles that guide decision making on complex tasks. Existing approaches typically expand these banks in an append-only fashion, continuously adding new skills without removing redundant, outdated, or harmful ones, resulting in inefficient and poorly curated repositories. In this paper, we formulate the skill bank curation as a constrained multi-objective problem: a desirable bank must be useful for the agent, diverse in its content, and provide good coverage of the query distribution. To this end, we introduce SkillBrew, a multi-objective curation framework that formalizes skill bank curation as Pareto-aware optimization under a utility constraint, and solves it via a bi-level propose-then-verify loop. We evaluate our approach on two public benchmarks. Our findings suggest that treating skill banks as objects of principled curation, rather than ever-growing append-only logs, is an important step toward building self-improving LLM agents.

2605.29434 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing

AliMark: 增强句子级水印对文本释义的鲁棒性

Yuexin Li, Wenjie Qu, Linyu Wu, Yulin Chen, Yufei He, Tri Cao, Bryan Hooi, Jiaheng Zhang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出AliMark框架,将句子级水印重构为比特序列编码与对齐问题,通过多候选对齐检测策略提升对句子拆分合并等结构扰动的鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

现有的句子级水印方法通过将水印锚定在句子语义中来增强对释义的鲁棒性。然而,它们基于前缀的设计仍然容易受到结构扰动的影响,例如句子拆分和合并,这些扰动在强释义器(如DIPPER和GPT-3.5)下经常出现。为了缓解这个问题,我们提出了AliMark,一个将句子级水印重构为潜在水印文本与秘密比特序列之间的比特序列编码和对齐问题的框架。值得注意的是,我们的方法采用了两阶段检测策略:我们生成多个重构的文本变体,并自适应地将它们提取的比特序列与秘密比特序列对齐,以最小化对齐成本。这种多候选对齐设计自然地提高了对句子合并和拆分的鲁棒性。大量实验表明,在多种释义攻击下,AliMark显著优于最先进的基线方法。

英文摘要

Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.

2605.29430 2026-05-29 cs.AI cs.CL 版本更新

Towards Human-Like Interactive Speech Recognition With Agentic Correction and Semantic Evaluation

迈向具有智能体纠正和语义评估的类人交互式语音识别

Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen, Xinjian Zhao, Xipeng Qiu, Wupeng Wang, Zhifu Gao, Xiangang Li, Kai Yu, Xie Chen

发表机构 * College of Artificial Intelligence, Xi’an Jiaotong University(西安交通大学人工智能学院) X-LANCE Lab, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子信息与电气工程学院X-LANCE实验室) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Fudan University(复旦大学) Tongyi Fun Team, Alibaba Group(阿里云通义团队)

AI总结 提出Agentic ASR闭环框架,通过多轮交互和语义纠正减少语义错误,并引入句子级语义错误率(S^2ER)作为评估指标。

详情
AI中文摘要

自动语音识别(ASR)是人机交互的核心组成部分,也是基于LLM的助手和智能体日益重要的前端。然而,当前大多数ASR系统仍遵循单遍范式,这与人类通信方式不一致——在人类通信中,误解通过迭代澄清和修正来解决。这种不匹配使得一旦发生意义关键的错误,很难纠正。同时,词错误率(WER)或字符错误率(CER)等词级指标无法充分反映此类问题。为解决这些局限,我们将交互式ASR形式化为多轮修正任务,并提出Agentic ASR,一种结合单遍ASR前端与语义纠正、意图路由和基于推理编辑的闭环框架。我们进一步引入句子级语义错误率(S^2ER),一种基于LLM的语义评估指标,以及交互式仿真系统,用于可扩展和可复现的基准测试。在多语言、命名实体密集和代码切换基准上的实验表明,迭代交互持续减少语义错误,在S^2ER上的提升远大于传统词级指标。人机对齐和消融研究进一步验证了语义判断器的可靠性和所提框架的鲁棒性。代码见:https://interactiveasr.github.io/,在线演示见:https://i-asr.sjtuxlance.com/

英文摘要

Automatic speech recognition (ASR) is a core component of human--computer interaction and an increasingly important front-end for LLM-based assistants and agents. However, most current ASR systems still follow a single-pass paradigm, which is poorly aligned with human communication, where misunderstandings are resolved through iterative clarification and refinement. This mismatch makes it difficult to correct meaning-critical errors once they occur. Meanwhile, token-level metrics such as WER or CER cannot adequately reflect such a problem. To address these limitations, we formulate \emph{Interactive ASR} as a multi-turn refinement task and propose \textbf{Agentic ASR}, a closed-loop framework that combines a single-pass ASR front-end with semantic correction, intent routing, and reasoning-based editing. We further introduce the \textbf{Sentence-level Semantic Error Rate} ($S^2ER$), an LLM-based semantic evaluation metric, together with an \textbf{Interactive Simulation System} for scalable and reproducible benchmarking. Experiments on multilingual, named-entity-intensive, and code-switching benchmarks show that iterative interaction consistently reduces semantic errors, with much larger gains in $S^2ER$ than in conventional token-level metrics. Human--AI alignment and ablation studies further validate the reliability of the semantic judge and the robustness of the proposed framework. The code is available at: https://interactiveasr.github.io/ and the live demo is available at https://i-asr.sjtuxlance.com/

2605.29428 2026-05-29 astro-ph.EP astro-ph.IM cs.AI 版本更新

DELOS: Detecting Shallow Transits in Kepler Photometry Using a Contrastive-Learning Framework

DELOS: 使用对比学习框架检测开普勒测光中的浅凌星

Qingtian Liu, Jian Ge, XingChen Yan, Kevin Willis, Xinyu Yao, QuanQuan Hu, Jiapeng Zhu

发表机构 * Shanghai Astronomical Observatory, Chinese Academy of Sciences, Shanghai 200030, China(上海天文台,中国科学院,上海200030,中国) School of Astronomy and Space Sciences, University of Chinese Academy of Sciences, Beijing 101408, China(中国科学院大学天文与空间科学学院,北京101408,中国) Science Talent Training Center, Gainesville, FL, 32606 USA(科学人才培训中心,佛罗里达州盖恩斯维尔,32606 USA)

AI总结 提出基于对比学习的DELOS框架,通过GPU加速折叠和卷积编码器检测低信噪比浅凌星,性能优于BLS和TLS。

Comments 25 pages, 19 figures, 1 table, submitted to ApJ

详情
AI中文摘要

我们提出了基于相位折叠光变曲线的对比评分检测方法(DELOS),这是一个基于对比学习的框架,旨在搜索开普勒测光中的浅凌星。DELOS结合了GPU加速的相位折叠、优化的相位分箱和自定义的一维卷积编码器,为每条折叠光变曲线分配凌星似然分数,从而在无需预先检测阈值穿越事件的情况下,在试验周期上生成分数周期图。针对轨道周期为100-150天的中长周期信号,DELOS在2000万条使用真实凌星模型和开普勒类似噪声特性生成的合成光变曲线上进行训练,在合成验证集上达到了99.3%的验证准确率。在受控注入-恢复实验中,在低信噪比区域,DELOS相对于箱形拟合最小二乘法(BLS)和凌星最小二乘法(TLS)分别将综合精确率-召回率性能提高了15.5%和11.25%。与BLS和TLS相比,它还将搜索速度分别提高了约3-5倍和74-80倍。应用于选定的开普勒验证样本时,DELOS在测试周期范围内恢复了所有已知的浅中长周期凌星信号。这些结果表明,DELOS为低信噪比凌星搜索提供了一个高效且灵敏的框架,并代表了向未来在开普勒、K2、TESS、PLATO和地球2.0数据中搜索更长周期类地行星迈出的实际一步。因此,这项工作旨在作为方法论开发和验证研究,对新识别候选体的详细天体物理验证留待未来工作。

英文摘要

We present DEtection in phase-folded Light curves with cOntrastive Scoring (DELOS), a contrastive-learning-based framework designed to search for shallow transits in Kepler photometry. DELOS combines GPU-accelerated phase folding, optimized phase binning, and a custom one-dimensional convolutional encoder to assign a transit-likeness score to each folded light curve, thereby producing a score periodogram over trial periods without relying on pre-detected threshold-crossing events. Focusing on intermediate-to-long-period signals with orbital periods of 100-150 days, DELOS was trained on 20 million synthetic light curves generated with realistic transit models and Kepler-like noise properties, achieving a validation accuracy of 99.3 percent on the synthetic validation set. In controlled injection-recovery experiments, DELOS improves the combined precision-recall performance by 15.5 percent relative to Box-fitting Least Squares (BLS) and 11.25 percent relative to Transit Least Squares (TLS) in the low Signal-to-Noise Ratios (low-SNR) regime. It also accelerates the search by factors of approximately 3-5 and 74-80 compared with BLS and TLS, respectively. Applied to a selected Kepler validation sample, DELOS recovered all known shallow intermediate-to-long-period transit signals in the tested period range. These results demonstrate that DELOS provides an efficient and sensitive framework for low-SNR transit searches and represents a practical step toward future searches for longer-period terrestrial planets in Kepler, K2, TESS, PLATO, and Earth 2.0 data. Accordingly, this work is intended as a methodological development and validation study, with the detailed astrophysical validation of newly identified candidates deferred to future work.

2605.29425 2026-05-29 cs.AI 版本更新

ReasonLight: A Multimodal Foundation Model-Enhanced Reinforcement Learning Framework for Zero-Shot Traffic Signal Control

ReasonLight: 一种多模态基础模型增强的强化学习框架用于零样本交通信号控制

Aoyu Pang, Maonan Wang, Yuejiao Xie, Chung Shue Chen, Zhiwei Yang, Man-On Pun

发表机构 * School of Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)科学与工程学院) Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Hong Kong(香港中文大学机械与自动化工程系) Shanghai AI Laboratory, Shanghai, China(上海人工智能实验室) Nokia Bell Labs, Paris-Saclay, France(法国巴黎萨克雷诺基贝尔实验室)

AI总结 提出ReasonLight框架,通过多模态基础模型增强强化学习,利用路侧传感器和摄像头数据实现零样本适应罕见交通事件,显著降低紧急车辆等待时间。

详情
AI中文摘要

强化学习在交通信号控制中展现出潜力,但其对预定义状态的依赖限制了其对训练数据中未出现的可观测开放世界事件的响应能力。物联网赋能的路口通过路侧传感器和摄像头提供异构观测,为提升强化学习对此类事件的适应性创造了机会。为此,我们提出ReasonLight,一种多模态基础模型增强的强化学习框架,用于零样本交通信号控制。ReasonLight整合三类信息:结构化交通测量、多视角摄像头观测以及预训练强化学习控制器生成的候选相位决策。给定强化学习提议的相位,ReasonLight从多视角图像中提取视觉语义,并将其与紧凑的传感器导出的场景描述对齐。这种对齐使得语义引导的细化模块能够根据交通规则和事件语义保留或调整提议的动作。为确保操作可靠性,细化后的动作受可用相位集合约束。任何无效决策被拒绝,系统回退至原始强化学习动作。我们在强化学习训练期间未见的两类罕见事件上评估ReasonLight:紧急车辆优先和临时交通管制。实验结果表明,ReasonLight无需重新训练即可实现零样本适应。与仅使用强化学习的主干相比,它将紧急车辆等待时间最多降低88.7%,同时保持相当的常规交通性能。

英文摘要

Reinforcement learning (RL) has shown promise in traffic signal control (TSC). However, its reliance on predefined states limits responsiveness to observable open-world events that are absent from training data. IoT-enabled intersections provide heterogeneous observations from roadside sensors and cameras, creating opportunities to improve RL adaptability to such events. To this end, we propose ReasonLight, a multimodal foundation model-enhanced RL framework for zero-shot TSC. ReasonLight integrates three sources of information: structured traffic measurements, multi-view camera observations, and candidate phase decisions from a pre-trained RL controller. Given an RL-proposed phase, ReasonLight extracts visual semantics from multi-view images and aligns them with compact sensor-derived scene descriptions. This alignment enables a semantic-guided refinement module to either preserve or adjust the proposed action according to traffic rules and event semantics. To ensure operational reliability, refined actions are constrained by the set of available phases. Any invalid decision is rejected, and the system falls back to the original RL action. We evaluate ReasonLight on two types of rare events not seen during RL training: emergency vehicle priority and temporary traffic regulation. Experimental results show that ReasonLight achieves zero-shot adaptation without retraining. It reduces emergency vehicle waiting time by up to 88.7% compared with the RL-only backbone while preserving comparable routine traffic performance.

2605.29420 2026-05-29 cs.AI cs.LG 版本更新

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

角色提示何时真正有效?LLM中专家角色注入的检索与度量分析

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu, Xinjie He, Zhiyuan Lin, Qiyang Xie

发表机构 * Independent Researchers(独立研究者)

AI总结 通过对比四种提示条件在1140个开放式问题上的表现,发现角色提示系统性地增加专家深度但降低清晰度,其效果高度依赖于问题类型和领域,且混合检索优于纯嵌入检索。

Comments 6 pages, 2 figures. Submitted for peer review

详情
AI中文摘要

角色提示被广泛用于引导大型语言模型,但其实际价值仍不明确。先前的工作通常使用聚合分数评估角色提示,难以确定专家角色提示是否一致地提高响应质量,或者是否沿着不同的质量维度改变响应。我们通过对比四种提示条件在涵盖38个专家角色和六个领域的1140个开放式问题上的表现来研究这个问题:无角色提示、通用领域专家提示、基于嵌入的角色检索,以及结合嵌入搜索和基于LLM的角色选择的混合检索方法。聚合结果显示各条件之间总体差异很小。然而,度量级分析揭示了一个聚合平均值掩盖的一致权衡:角色提示系统性地增加了专家深度,同时降低了清晰度。这些效果高度有条件而非普遍。角色提示在咨询类问题以及医学和心理学等领域表现最佳,在这些领域中,结构化的专家框架和风险沟通具有内在价值。相比之下,基线提示在金融、法律、科学和技术领域的概念性和解释性问题中表现更好,在这些领域中,简洁的平实语言解释更为重要。我们进一步表明,混合检索显著优于纯嵌入角色选择,尽管更好的角色检索并不能消除更广泛的专家深度与清晰度之间的权衡。总体而言,我们的发现表明,角色提示主要重塑响应特征而非广泛提升能力,并且多度量评估对于理解其效果是必要的。

英文摘要

Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.

2605.29414 2026-05-29 cs.CL cs.AI 版本更新

Beyond Bilingual Transfer: Multilingual Code-Switching in Instruction Tuning

超越双语迁移:指令微调中的多语言代码切换

Shunta Asano, Jeonghun Baek, Toshihiko Yamasaki

发表机构 * The University of Tokyo(东京大学)

AI总结 本研究通过跨四种语言的句子级多语言代码切换指令微调,验证了多语言代码切换能有效提升大语言模型的多语言理解性能,超越了传统双语迁移设置。

详情
AI中文摘要

近期研究表明,代码切换数据(CSD)——即在同一上下文中混合多种语言——可以改善大语言模型(LLMs)的跨语言迁移和多语言对齐。然而,现有研究主要关注英语与目标语言之间的双语迁移,涉及三种或更多语言的多语言设置在很大程度上尚未被探索。在本工作中,我们研究了跨四种语言(英语、日语、韩语和中文)的多语言代码切换指令微调。我们在Belebele上评估多语言理解能力。我们的实验表明,简单的句子级多语言CSD持续提高了所有四种语言的平均多语言性能,表明多语言代码切换在双语迁移设置之外也能有效。

英文摘要

Recent studies have shown that code-switching data (CSD), in which multiple languages are mixed within the same context, can improve cross-lingual transfer and multilingual alignment in large language models (LLMs). However, existing studies primarily focus on bilingual transfer between English and a target language, leaving multilingual settings involving three or more languages largely unexplored. In this work, we investigate multilingual code-switching instruction tuning across four languages: English, Japanese, Korean, and Chinese. We evaluate multilingual understanding on Belebele. Our experiments show that simple sentence-level multilingual CSD consistently improves average multilingual performance across all four languages, indicating that multilingual code-switching can be effective beyond bilingual transfer settings.

2605.29411 2026-05-29 cs.LG cs.AI stat.ME stat.ML 版本更新

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

马尔可夫边界在表格预测中的好、坏与丑

Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 研究马尔可夫边界在表格预测中的实际效用,发现理论上最优的边界在实践中有条件地提升预测性能,但因果发现方法难以实现其潜力。

Comments 11 pages, 9 figures, 2 tables. Preprint

详情
AI中文摘要

在标准图形假设下,目标变量的马尔可夫边界是使所有其他特征冗余的最小特征集。一旦观察到边界,目标变量与表格的其余部分条件独立。这对于表格预测来说是一个诱人的对象,因为它恰好指出了模型所需的列。然而,现代回归器仍然在完整特征集上训练。我们询问马尔可夫边界是否在SCM3K(一个包含3450个任务的合成SCM基准,特征数量从40到1000,涵盖六个SCM家族)上对预测真正有用,并使用六个回归器进行评估。答案比理论所暗示的要微妙得多。将回归器限制在oracle边界上通常会显著改善预测,并且随着特征空间变得更大更稀疏,改善程度增加。但是,通过因果发现恢复边界并在恢复的掩码上训练的自然流程并不奏效。现有的估计器在达到边界最有帮助的区域之前就耗尽了计算预算,即使它们运行,也很少能击败完整特征集。我们将此归因于三个原因。发现优化的是结构恢复而非预测。假阴性和假阳性具有高度不对称的预测成本。精确边界只是众多击败所有特征的特征集之一。然后,我们阐述了这些事实对于预测对齐的特征选择以及学习使用因果结构的表格模型的意义。

英文摘要

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

2605.29402 2026-05-29 cs.CV cs.AI 版本更新

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

面向高效长视频推理的语义与视觉证据:HD-EPIC VQA挑战赛的解决方案

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv, Hui Li

发表机构 * Lenovo, China(联想(中国))

AI总结 提出一种统一框架,通过解耦长视频推理为语义证据(粗到细提取全局过程结构)和视觉证据(基于目标的细粒度定位),并采用查询条件证据检索与整合,在HD-EPIC VQA挑战赛中取得竞争性能。

详情
AI中文摘要

理解长格式自我中心视频对于多模态大语言模型(MLLMs)仍然具有挑战性,原因在于有限的上下文长度和对细粒度视觉细节的定位不足。最近提出的HD-EPIC基准突出了这些局限性:即使是强大的长上下文模型,在多样化的视频问答任务中也表现较低。在本文中,我们提出了一个统一框架,将长视频推理解耦为两种互补的证据形式:语义证据和视觉证据。语义证据通过粗到细的提取流程捕获全局过程结构,而基于目标的视觉证据通过边界框和视觉嵌入保留细粒度的定位。在推理过程中,我们将推理形式化为查询条件的证据检索和整合过程,动态地从两个来源选择相关信息。我们的方法在HD-EPIC-VQA挑战赛的多个任务类别中取得了竞争性能。更广泛地说,我们的结果表明,显式地结构化、检索和整合语义与视觉证据对于使用MLLMs进行有效的长视频理解至关重要。

英文摘要

Understanding long-form egocentric videos remains challenging for multimodal large language models (MLLMs) due to limited context length and insufficient grounding of fine-grained visual details. The recently proposed HD-EPIC benchmark highlights these limitations: even strong long-context models achieve relatively low performance across diverse video question answering tasks. In this paper, we propose a unified framework that decouples long-video reasoning into two complementary forms of evidence: semantic evidence and visual evidence. Semantic evidence captures global procedural structure through a coarse-to-fine extraction pipeline, while object-centric visual evidence preserves fine-grained grounding through bounding boxes and visual embeddings. During inference, we formulate reasoning as a query-conditioned evidence retrieval and integration process, dynamically selecting relevant information from both sources. Our approach achieves competitive performance in the HD-EPIC-VQA Challenge across multiple task categories. More broadly, our results demonstrate that explicitly structuring, retrieving, and integrating semantic and visual evidence is critical for effective long-video understanding with MLLMs.

2605.29400 2026-05-29 cs.AI cs.CL cs.HC 版本更新

Architecture-Sensitive Supervised Fine-Tuning for Screen-Conditioned Action Prediction: A PiSAR Benchmark

面向屏幕条件动作预测的架构敏感监督微调:PiSAR基准

Rahul Bissa, Abhishek Vyas, Yash Jain

发表机构 * AprioriLabs(Apriori实验室)

AI总结 通过PiSAR基准评估监督微调模型与前沿零样本模型在屏幕锚定行为预测上的性能,发现微调Qwen3-VL-8B-Instruct显著优于前沿基线,而Gemma-4-26B-A4B-IT微调效果不佳,揭示模型与微调方法不匹配问题。

Comments 14 pages, 7 figures, 2 tables. PiSAR corpus and fine-tuned weights are proprietary to AprioriLabs; methodology and recipe released

详情
AI中文摘要

我们在PiSAR(Persona, intent, Screen, Action, Rationale)的一个661行保留子集上,对三个监督微调模型与前沿零样本基线进行了基准测试。PiSAR是一个包含12,929个元组的屏幕锚定行为理由语料库,从公开的应用商店评论、Pew美国趋势面板人口统计数据以及OPeRA购物者轨迹中整理得到。每个模型,无论是前沿模型还是微调模型,都在相同的661行子集上使用相同的评分流程进行评估。有两个发现。第一,前沿零样本基线(Claude Opus 4.7和GPT-5.5)分别达到sem_sim 0.459和0.482;而微调的Qwen3-VL-8B-Instruct达到0.783,并且在79%的行上sem_sim >= 0.7,而两个前沿基线仅为1-2%,在同一测试集上绝对差距为0.30。第二,相同的训练数据和配方在Gemma-4-26B-A4B-IT上仅得0.441,与前沿零样本基线处于同一水平,而非微调的Qwen。我们将其解读为配方与模型不匹配:经过推理调优的高参数模型抵抗位移,可能需要更多数据或更强的微调方法。

英文摘要

We benchmark three supervised fine-tuned models against frontier zero-shot baselines on a 661-row held-out slice of PiSAR (Persona, intent, Screen, Action, Rationale), a 12,929-tuple corpus of screen-anchored behavioural rationales curated from public app-store reviews, Pew American Trends Panel demographics, and the OPeRA shopper traces. Every model, frontier or fine-tuned, is evaluated on the same 661-row slice with the same scoring pipeline. Two findings. First, frontier zero-shot baselines (Claude Opus 4.7 and GPT-5.5) reach sem_sim 0.459 and 0.482 respectively; a fine-tuned Qwen3-VL-8B-Instruct reaches 0.783 and clears sem_sim >= 0.7 on 79% of rows, against 1-2% for either frontier baseline, a gap of 0.30 absolute on the same test set. Second, the same training data and recipe on Gemma-4-26B-A4B-IT scores only 0.441, in the same band as the frontier zero-shot baselines rather than the fine-tuned Qwen. We read this as a recipe-vs-model mismatch: the reasoning-tuned high-parameter model resists displacement and would likely need either more data or a stronger fine-tuning method.

2605.29398 2026-05-29 cs.LG cs.AI 版本更新

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

GDSD:强化学习作为扩散语言模型的引导去噪器自蒸馏

Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic

发表机构 * UCL Dept. of Statistical Science(伦敦大学学院统计科学系) UCL Centre for AI(伦敦大学学院人工智能中心) Alibaba Group(阿里巴巴集团) Dept. of EEE(电子工程系) Imperial College London(伦敦帝国理工学院) UNIST(全南大学) University of Basel(巴塞尔大学)

AI总结 提出引导去噪器自蒸馏(GDSD)方法,通过从逆KL正则化强化学习的闭式最优解中导出的优势引导自教师直接蒸馏扩散语言模型的去噪器,避免了ELBO似然代理带来的训练-推理不匹配偏差,在规划、数学和代码基准上显著优于现有方法。

Comments Preprint

详情
AI中文摘要

强化学习(RL)可用于改进扩散大语言模型(dLLMs)的策略(去噪器),但受到策略似然难以处理的阻碍。一类主流且高效的方法将标准RL中的似然替换为其证据下界(ELBO),该下界从随机掩码序列中估计。尽管与预训练高度一致,但这些方法通过使用ELBO作为似然代理引入了训练-推理不匹配(TIM)偏差,可能降低性能。在这项工作中,我们提出了引导去噪器自蒸馏(GDSD),直接从优势引导的自教师中蒸馏dLLMs的去噪器,该自教师源自逆KL正则化RL的闭式最优解。GDSD通过无归一化目标将dLLM的去噪器logits与教师匹配,将RL简化为无似然自蒸馏,从而绕过了TIM偏差。最近的基于ELBO的方法表现为应用不同蒸馏散度的实例,但存在GDSD避免的可诊断病态。在LLaDA-8B和Dream-7B的规划、数学和代码基准上,GDSD以更稳定的训练奖励动态持续优于先前最先进的基于ELBO的方法,测试准确率提升高达+19.6%。这些结果表明,直接的去噪器自蒸馏,无需依赖ELBO似然代理,可以为dLLMs提供更稳定有效的RL过程。代码可在https://github.com/GaryBall/GDSD获取。

英文摘要

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

2605.29396 2026-05-29 cs.AI 版本更新

Aligned but Fragile: Enhancing LLM Safety Robustness via Zeroth-Order Optimization

对齐但脆弱:通过零阶优化增强LLM安全鲁棒性

Zhihao Liu, Yifan Wu, Jian Lou, Di Wang, Yuxi Zhou, Yuke Hu

发表机构 * The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高科技园区(滨江)区块链与数据安全研究院) Sun Yat-sen University(中山大学) KAUST(卡塔尔大学)

AI总结 针对大语言模型安全对齐后易受轻量级后处理(如参数噪声、激活噪声或量化)影响的问题,提出基于零阶优化的混合框架,通过先标准一阶安全对齐再零阶精炼提升鲁棒性,并利用扰动评估估计层鲁棒性敏感性以高效聚焦关键层更新。

详情
AI中文摘要

大语言模型的安全对齐旨在减少有害或不安全行为,同时保持通用效用。然而,最近的研究发现对齐效果可能是脆弱的:轻量级的对齐后操作,如参数噪声、激活噪声或量化,很容易削弱预期的安全行为。先前提高鲁棒性的努力主要集中在数据整理、修改对齐目标和识别安全关键参数上,而优化器本身的作用在很大程度上未被探索。在本文中,我们首次从基础优化器的角度研究安全对齐的鲁棒性。这种以优化器为中心的视角自然地指向零阶优化,它通过评估扰动下的安全对齐来提供面向鲁棒性的信号。基于这一见解,我们提出了一个混合框架,首先执行标准的一阶安全对齐,然后应用零阶精炼来提高鲁棒性。从理论和实证上,我们表明仅需少量零阶精炼步骤即可增强鲁棒性,同时保持安全对齐。我们进一步通过利用其固有的基于扰动的评估来估计逐层鲁棒性敏感性,从而提高零阶精炼的效率,使精炼过程能够以适度的训练开销将更新集中在鲁棒性关键层上。

英文摘要

Safety alignment for large language models (LLMs) aims to reduce harmful or unsafe behavior while preserving general utility. However, recent findings reveal that alignment effects can be fragile: lightweight post-alignment manipulations, such as parameter noise, activation noise, or quantization, can easily weaken the intended safety behavior. Prior efforts to improve robustness have primarily focused on data curation, modified alignment objectives, and safety-critical parameter identification, leaving the role of the optimizer itself largely unexplored. In this paper, we are the first to study the robustness of safety alignment from the perspective of the base optimizer. This optimizer-centric view naturally points to zeroth-order optimization, which provides a robustness-oriented signal by evaluating safety alignment under perturbations. Based on this insight, we propose a hybrid framework that first performs standard first-order safety alignment and then applies zeroth-order refinement to improve robustness. Both theoretically and empirically, we show that only a few zeroth-order refinement steps can enhance robustness while preserving safety alignment. We further improve the efficiency of zeroth-order refinement by exploiting its inherent perturbation-based evaluations to estimate layer-wise robustness sensitivity, enabling the refinement process to concentrate updates on robustness-critical layers with modest training overhead.

2605.29394 2026-05-29 cs.AI 版本更新

EvoMD-LLM: Learning the Language of Species Evolution in Reactive Molecular Dynamics

EvoMD-LLM:学习反应分子动力学中物种进化的语言

Zhichen Tang, Zhengzheng Dang, Yulin Chen, Jixin Wu, Haiwen Li, Yanming Wang

发表机构 * Global College, Shanghai Jiao Tong University(上海交通大学全球学院) Global Institute of Future Technology, Shanghai Jiao Tong University(上海交通大学未来技术全球研究院)

AI总结 提出EvoMD-LLM框架,将反应分子动力学轨迹离散化为符号时间序列,通过时间脚手架机制使自回归大语言模型学习物种组成演化,在多项时间预测任务上优于基线模型,并能生成可解释性预测。

Comments 17 pages, ACL Findings

详情
AI中文摘要

虽然大型语言模型(LLM)在静态科学推理方面表现出色,但它们在建模动态物理过程的时间结构方面存在困难。我们提出了EvoMD-LLM(进化分子动力学大型语言模型),这是一个将物种级分子动力学重新表述为符号时间语言建模问题的框架。反应分子动力学轨迹被离散化为分子事件序列,其中每个标记代表一个化学物种及其持续时间,通过高效微调使标准自回归LLM能够学习随时间的组成演化。EvoMD-LLM的一个关键组成部分是时间脚手架,它将事件持续时间视为显式语言标记,并作为结构化归纳偏置,与传统的序列建模方法相比,显著减少了无效或幻觉的分子输出。我们在多个时间预测任务上评估了EvoMD-LLM,达到了高达66.14%的准确率,并始终优于序列神经网络和基于语言的基线。除了定量改进,我们定性地观察到,该模型能够通过结合相关化学知识为其预测生成解释,尽管它没有经过配对轨迹-解释数据的显式监督。这些结果表明,符号时间语言建模为将LLM应用于动态物理模拟提供了有效框架。

英文摘要

While large language models (LLMs) excel at static scientific reasoning, they struggle to model the temporal structure of dynamic physical processes. We present EvoMD-LLM (Evolutionary Molecular Dynamics Large Language Model), a framework that reformulates species-level molecular dynamics as a symbolic temporal language modeling problem. Reactive MD trajectories are discretized into sequences of molecular events, where each token represents a chemical species augmented with its persistence duration, enabling standard autoregressive LLMs to learn compositional evolution over time through efficient fine-tuning. A key component of EvoMD-LLM is temporal scaffolding, which treats event duration as an explicit linguistic token and serves as a structured inductive bias, significantly reducing invalid or hallucinated molecular outputs compared to conventional sequence modeling approaches. We evaluate EvoMD-LLM on multiple temporal prediction tasks, achieving up to 66.14% accuracy and consistently outperforming sequential neural networks and language-based baselines. Beyond quantitative improvements, we qualitatively observe that the model is capable of generating interpretations for its own predictions by incorporating relevant chemical knowledge, even though it was not explicitly supervised with paired trajectory-explanation data. These results demonstrate that symbolic temporal language modeling provides an effective framework for grounding LLMs in dynamic physical simulations.

2605.29387 2026-05-29 cs.LG cs.AI stat.ML 版本更新

On the Optimizer Dependence of Neural Scaling Laws

神经缩放定律的优化器依赖性

Vansh Ramani, Shourya Vir Jain

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Delhi(计算机科学与工程系,印度理工学院德里)

AI总结 通过随机特征回归实验,发现优化器类型系统性地影响神经缩放定律中的缩放指数α,预条件优化器产生更陡峭的缩放,并提供了光谱诊断预测高级优化器的收益。

详情
AI中文摘要

神经缩放定律 $L(N) \propto N^{-α}$ 中的缩放指数 $α$ 通常被视为由架构和数据确定的固定常数。我们提出证据表明 $α$ 系统性地依赖于优化器。在受控的随机特征回归实验——神经缩放的理论框架——中,我们测量了五种优化器变体和六种光谱条件下的 $α$。预条件优化器一致地产生更陡峭的缩放(更大的 $α$),且 $α$ 的偏移在大部分测试光谱范围内增加,在 $s = 1.5$ 附近达到峰值,并在 $s = 2.0$ 时保持较大。在 $s \approx 1.0$(自然语言的特征)时,完全自然梯度达到 $α\approx 0.31$,而梯度下降为 $α\approx 0.12$——拟合指数大 $2.6$ 倍,在随机特征模型中,该差异随模型规模加倍而累积。这种指数偏移是否以及如何迁移到大规模 LLM 训练中——近期证据表明优势可能随规模减弱——仍是一个重要的开放问题。我们的结果表明,缩放定律预测应考虑优化器选择,并且我们提供了一个光谱诊断来预测高级优化器何时会带来收益。

英文摘要

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

2605.29384 2026-05-29 cs.IR cs.AI cs.CL 版本更新

Latent Terms: Dense Retrievers Contain Trivially Extractable BM25-ready Zipfian Vocabularies

潜在词:密集检索器包含可轻易提取的符合齐夫分布的BM25就绪词汇表

Benjamin Clavié, Sean Lee, Aamir Shakir, Makoto P. Kato

发表机构 * Mixedbread AI National Institute of Informatics(国家信息研究所) University of Tsukuba(筑波大学)

AI总结 提出潜在词方法,揭示密集检索模型(单向量或多向量)学习到的表示可轻易分解为稀疏特征,通过稀疏自编码器提取潜在词汇表,无需检索特定调整即可直接用于BM25稀疏检索,匹配或超越原模型及SPLADE变体。

详情
AI中文摘要

我们提出潜在词方法,该方法揭示了训练用于密集检索的模型(无论是单向量还是多向量)学习到的表示可以轻易地分解为检索就绪的稀疏特征。当在冻结的检索器上训练时,无需任何检索特定调整的稀疏自编码器能够提取一个具有近似齐夫分布集合统计量的潜在词汇表,直接适用于通过BM25进行的经典稀疏检索评分。这种方法实现了稀疏检索,同时不需要任何学习到的扩展目标或稀疏检索监督,并且可以轻松应用于任何密集检索器。潜在词能够匹配或超越其自身基础模型以及可比较的SPLADE变体的单向量评分方法。此外,在专门设计用于突出单向量检索失败的任务LIMIT上,它显著优于其基础模型。总体而言,我们的结果强调了神经检索器包含比其默认评分函数所暴露的更具表达力和可索引的结构,但其他方法仍然可以利用这些结构。

英文摘要

We propose Latent Terms, a method revealing that models trained for dense retrieval, whether single- or multi-vector, learn representations that can trivially be decomposed into retrieval-ready sparse features. When trained on frozen retrievers, Sparse Autoencoders without any retrieval-specific adjustments extract a latent vocabulary with approximately Zipfian collection statistics, directly suitable for classical sparse retrieval scoring via BM25. This approach enables sparse retrieval while requiring no learned expansion objective or sparse retrieval supervision whatsoever, and can be readily applied to any dense retriever. Latent Terms is able to match or outperform single-vector scoring methods from its own base model as well as comparable SPLADE variants. In addition, it substantially outperforms its base model on LIMIT, a task specifically designed to highlight the failures of single-vector retrieval. Overall, our results highlight that neural retrievers contain more expressive and indexable structure than their default scoring functions expose, but that other methods can nonetheless be leveraged.

2605.29380 2026-05-29 cs.LG cs.AI cs.CV 版本更新

TRACER: Persistent Regularization for Robust Multimodal Finetuning

TRACER: 用于鲁棒多模态微调的持久正则化

Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing and Information Systems (CIS), Faculty of Engineering and IT (FEIT), University of Melbourne, Australia(墨尔本大学计算机科学与信息系统学院(CIS)、工程与信息技术学院(FEIT))

AI总结 提出TRACER方法,通过加权移动平均教师实现持久正则化,解决多模态对比微调中的灾难性遗忘和EMA坍缩问题,提升分布外鲁棒性。

Comments ICML 2026

详情
AI中文摘要

微调预训练多模态模型的主流策略通常会降低分布外(OOD)鲁棒性,这种现象被称为灾难性遗忘。在本文中,我们为多模态对比微调开发了一个理论框架,为每种策略提供了闭式解和几何分解。该框架表明,自蒸馏在保留预训练模型知识方面比其他正则化方法更有效。我们的分析揭示了一个被广泛忽视的局限性:在鲁棒微调中广泛使用的标准指数移动平均(EMA)教师存在坍缩问题。为了解决这个问题,我们证明加权移动平均(WMA)教师在有限时间范围内保持持久的正则化力,并在任务子空间中实现无偏收敛,同时保留正交知识。这些见解促使了**TRACER**(**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization)的提出,它将对比学习与WMA引导的多视角蒸馏相结合。在CLIP微调上的大量实验表明,在三种骨干架构上,OOD准确率和校准性能持续提升,全面的消融实验证实TRACER既有理论依据,又对超参数选择具有鲁棒性。代码可在[https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER)获取。

英文摘要

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

2605.29368 2026-05-29 cs.CL cs.AI 版本更新

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

SURGENT: 一种跨围手术期工作流程的手术多智能体辅助系统

Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui, Huawei Feng, Linlin Wang

发表机构 * East China Normal University(华东师范大学) City University of Hong Kong(香港城市大学)

AI总结 提出SURGENT手术多智能体辅助系统,结合思维树规划器、多科室协作智能体和检索增强推理,通过新型记忆设计管理长期患者病史和短期工作摘要,在五项围手术期任务中优于基线LLM和现有医疗多智能体框架。

Comments preprint

详情
AI中文摘要

现代外科护理的复杂性需要智能系统能够综合大量患者记录,支持协作决策,并在整个围手术期工作流程中提供透明、可审计的推理。尽管基于网络的大型语言模型(LLM)具有先进的推理能力,但由于输入长度限制、不完整的记忆管理和有限的可追溯性等关键限制,它们不适合外科应用。为了解决这个问题,我们提出了SURGENT,一种手术多智能体辅助系统,它结合了思维树规划器、多科室协作智能体以及基于临床指南和生物医学文献的检索增强推理。SURGENT具有一种新颖的记忆设计,可以管理长期患者病史和短期工作摘要,从而实现更完整、情境化和一致的推理。在五项关键围手术期任务(病例分析、手术计划模拟、安全监测、并发症风险评估和康复指导)上的实验评估表明,SURGENT优于基线LLM和现有的医疗多智能体框架,生成的推荐与患者病史更加一致。消融研究进一步突出了DeepSeek作为本地可部署骨干模型的优势,使其能够在无需依赖集中服务的情况下实现隐私保护部署。这些结果使SURGENT成为迈向智能、公平和安全的外科辅助系统的实用且可信的进步。

英文摘要

The intricate nature of modern surgical care necessitates intelligent systems that can synthesize extensive patient records, support collaborative decision-making, and provide transparent, auditable reasoning across the entire perioperative workflow. Although web-based Large Language Models (LLMs) possess advanced reasoning capabilities, they are ill-equipped for surgical applications due to critical limitations: input length constraints, incomplete memory management, and limited traceability. To address this issue, we present SURGENT, a surgical multi-agent assistance system that combines a Tree-of-Thought planner, multi-department collaboration agents, and retrieval-augmented reasoning with clinical guidelines and biomedical literature. SURGENT features a novel memory design that manages both long-term patient histories and short-term working summaries, enabling more complete, contextualized, and consistent reasoning. Experimental evaluations across five key perioperative tasks - case analysis, surgical plan simulation, safety monitoring, complication risk assessment, and rehabilitation guidance - show that SURGENT outperforms baseline LLMs and existing medical multi-agent frameworks, yielding recommendations more closely aligned with patient histories. Ablation studies further highlight the advantage of DeepSeek as a locally deployable backbone model, enabling privacy-preserving deployment without reliance on centralized services. These results position SURGENT as a practical and trustworthy advancement toward intelligent, equitable, and secure surgical assistance systems.

2605.29360 2026-05-29 cs.AI 版本更新

MiraBench: Evaluating Action-Conditioned Reliability in Robotic World Models

MiraBench: 评估机器人世界模型中的动作条件可靠性

Tianzhuo Yang, Zihan Shen, Zirui Mi, Zhaoyi Zhang, Jiayi Zhou, Jiaming Ji, Juntao Dai, Jiawei Chen, Boyuan Chen, Yaodong Yang

发表机构 * Institute for Artificial Intelligence, Peking University(人工智能研究院,北京大学)

AI总结 提出MiraBench基准,通过物理一致性、动作跟随保真度和乐观偏差检测三个层次评估机器人世界模型的动作条件可靠性,发现视觉保真度不能反映动作保真度、模型规模扩大不保证动作跟随改善、乐观偏差普遍存在。

详情
AI中文摘要

动作条件世界模型越来越多地被用作机器人学习的可扩展模拟器,但当前的评估对其在条件动作下预测的可靠性提供的证据有限。现有基准主要强调视觉保真度,未明确预测的未来是否物理上合理、是否忠实于命令动作,以及在动作不应成功时是否校准到失败。我们引入了\textsc{MiraBench},一个分层基准,将\emph{动作条件可靠性}定义为机器人世界模型的核心评估目标。MiraBench将此目标分解为三个逐步严格层次:\emph{物理一致性},评估无参考的物理一致性;\emph{动作跟随保真度},衡量预测是否尊重任务相关动作输入;以及\emph{乐观偏差检测},探测在导致失败的动作下预测成功结果的倾向。为支持此评估,我们整理了一个人工标注语料库,包含跨任务、失败类别和领先世界模型的超过16,000个判断。我们评估了12种代表性模型配置,涵盖向量条件机器人世界模型、文本条件生成世界模型、开源系统、闭源系统和多种模型规模。在这一广泛的模型景观中,MiraBench揭示了三个核心发现:视觉保真度是动作保真度的糟糕代理;增加模型规模并不能可靠地改善动作跟随;乐观偏差在现有系统中普遍存在。通过将评估从外观转向动作条件可靠性,MiraBench为评估和改进机器人世界模型作为忠实模拟器提供了诊断基础。

英文摘要

Action-conditioned world models are increasingly used as scalable simulators for robot learning, yet current evaluations provide limited evidence that their predictions are reliable under the actions they condition on. Existing benchmarks largely emphasize visual fidelity, leaving unclear whether predicted futures are physically plausible, faithful to commanded actions, and calibrated to failure when actions should not succeed. We introduce \textsc{MiraBench}, a hierarchical benchmark that defines \emph{action-conditioned reliability} as a core evaluation target for robotic world models. MiraBench decomposes this target into three progressively demanding levels: \emph{Physics Adherence}, which evaluates reference-free physical consistency; \emph{Action-Following Fidelity}, which measures whether predictions respect task-relevant action inputs; and \emph{Optimism Bias Detection}, which probes the tendency to predict successful outcomes under failure-inducing actions. To support this evaluation, we curate a human-annotated corpus with over 16,000 judgments across tasks, failure categories, and leading world models. We evaluate 12 representative model configurations spanning vector-conditioned robotic world models, text-conditioned generative world models, open-weight systems, closed-source systems, and multiple model scales. Across this broad model landscape, MiraBench reveals three central findings: visual fidelity is a poor proxy for action fidelity; increasing model scale does not reliably improve action following; and optimism bias is pervasive across current systems. By shifting evaluation from appearance to action-conditioned reliability, MiraBench provides a diagnostic foundation for assessing and improving robotic world models as faithful simulators.

2605.29359 2026-05-29 cs.CY cs.AI 版本更新

Does Distributed Training Undermine Compute Governance?

分布式训练是否会破坏计算治理?

Robi Rahman

发表机构 * Machine Intelligence Research Institute(机器智能研究所)

AI总结 本文探讨了分布式训练技术可能规避计算治理的可行性,并提出了包括举报、芯片追踪、法务会计以及集群内存和计算阈值在内的反制措施。

Comments TAIGR workshop in ICML 2026

详情
AI中文摘要

计算治理提案通常依赖于一个假设:前沿AI训练需要大型、可检测的计算集群。然而,分布式训练算法的最新进展可能允许开发者在分布式聚合的硬件上进行前沿规模的训练,而不需要大型数据中心设施。那些不愿受法规约束的开发者可能会以规避计算治理相关的注册和监控要求的方式构建其硬件。因此,必须设计法规来检测和防止非法的分布式训练操作。本文评估了这种规避行为的可行性,并概述了推荐的反制措施,包括举报、芯片追踪、法务会计以及集群的内存和计算阈值。

英文摘要

Compute governance proposals often rely on the assumption that frontier AI training requires large, detectable computing clusters. However, recent advances in distributed training algorithms could allow developers to conduct frontier-scale training on distributed agglomerations of hardware, rather than needing large datacenter facilities. Developers who prefer not to be constrained by regulations may structure their hardware in a manner that evades the registration and monitoring requirements associated with compute governance. Therefore, regulations must be designed to detect and prevent illicit distributed training operations. This paper evaluates the feasibility of such evasion and outlines recommended countermeasures, including whistleblowing, chip tracking, forensic accounting, and memory and compute thresholds for clusters.

2605.29358 2026-05-29 cs.AI 版本更新

Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

扩展单一语义性:从Claude 3 Sonnet中提取可解释特征

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, Alex Tamkin, Esin Durmus, Tristan Hume, Francesco Mosconi, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, Tom Henighan

发表机构 * Anthropic

AI总结 本研究通过稀疏自编码器从生产级语言模型Claude 3 Sonnet中提取可解释特征,验证了字典学习方法在大规模模型上的可扩展性,并分析了特征的多语言、多模态特性及其对模型行为的因果影响。

详情
AI中文摘要

我们证明了稀疏自编码器可以从Claude 3 Sonnet(一个生产级语言模型)中提取可解释特征,解决了字典学习方法能否扩展到小型Transformer之外的问题。我们在模型中间层的残差流上训练了多达3400万个特征的稀疏自编码器,并使用缩放定律指导超参数选择。得到的特征是多语言和多模态的(尽管仅文本训练,但能泛化到图像),对概念的具体实例和抽象讨论都有响应,并可用于以与其解释一致的方式引导模型行为。我们发现了对应于著名实体和位置的特征,以及更抽象的概念,如讽刺或代码中的错误。我们还识别了与语言模型可能造成伤害的方式相关的特征——包括代表欺骗、权力追求、谄媚和偏见的特征——并展示了这些特征在被操纵时对模型输出的因果影响。此外,我们对特征的可解释性、几何结构和计算功能进行了分析。然而,仍然存在显著局限性:我们的特征集不完整,并且缺乏严格的方法来评估我们的特征是否忠实地捕捉了模型的计算过程。

英文摘要

We demonstrate that sparse autoencoders can extract interpretable features from Claude 3 Sonnet, a production-scale language model, addressing the open question of whether dictionary learning methods scale beyond small transformers. We trained sparse autoencoders with up to 34 million features on the model's middle layer residual stream, using scaling laws to guide hyperparameter selection. The resulting features are multilingual and multimodal (generalizing to images despite text-only training), respond to both concrete instances and abstract discussions of concepts, and can be used to steer model behavior in ways consistent with their interpretations. We find features corresponding to famous entities and locations, as well as more abstract concepts like sarcasm or errors in code. We also identify features relevant to ways in which language models might cause harm--including features representing deception, power-seeking, sycophancy, and bias--and show that these causally influence model outputs when manipulated. Additionally, we conduct analyses of feature interpretability, geometry, and computational function. However, significant limitations remain: our suite of features is incomplete, and we lack rigorous methods for evaluating whether our features faithfully capture model computations.

2605.29357 2026-05-29 cs.AI cs.LG cs.PL 版本更新

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

PassNet: 为图编译器通生成扩展大型语言模型

Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao

发表机构 * Baidu, Inc.(百度公司)

AI总结 针对编译器默认优化在长尾子图上性能不佳的问题,提出PassNet生态系统,包含大规模数据集和基准测试,通过微调小模型在少量轨迹上即可接近前沿模型性能。

Comments Code and data available at https://github.com/PaddlePaddle/PassNet

详情
AI中文摘要

现代张量编译器(如 TorchInductor)在主流模型上实现了显著加速,但在长尾负载上却面临系统性性能瓶颈——我们的性能分析显示,43% 的真实世界子图在默认编译下出现端到端减速。虽然 LLM 为实现自动化优化提供了途径,但现有工作集中于独立内核生成。我们认为,通生成(即 LLM 编写可直接集成到编译器流水线中的结构化图变换)是更合适的抽象。我们提出 PassNet,首个基于 LLM 的编译器通生成的大规模生态系统,包括:(1) PassNet-Dataset,包含来自 10 万个真实世界模型的超过 1.8 万个独特计算图;(2) PassBench,200 个精心挑选的长尾可融合任务(共包含 2060 个子图),在错误感知加速分数(ES_t)下进行评估——该指标统一了正确性、稳定性和性能——并具有针对系统性 LLM 利用的分层完整性防御。实验表明,PassBench 既具有高度区分性,又真正未饱和:最佳前沿模型在总体上落后 TorchInductor 37%,但在单个子图上,LLM 相比同一编译器可实现高达 3 倍的加速——这表明瓶颈在于一致性而非能力。在仅约 4000 个 PassNet 轨迹上微调一个小模型,可获得 2.67 倍的改进,接近前沿模型性能,证明了巨大的提升空间,并验证了 PassNet 作为推进 LLM 驱动编译器优化的实时训练基础设施。所有数据、基准测试和工具均已公开。

英文摘要

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

2605.29350 2026-05-29 cs.AI 版本更新

ConMoE: Expert-Pool Consolidation via Prototype Reassignment for MoE Compression

ConMoE: 通过原型重分配进行专家池整合以实现MoE压缩

Yilun Yao, Jiaming Pan, Elsie Dai, Peizhuang Cong, Yaoming Li, Tong Yang

发表机构 * Peking University(北京大学)

AI总结 提出ConMoE,一种无需训练的MoE压缩方法,通过基于校准的贡献和可替换性信号选择保留的专家原型,并确定性重映射原始专家调用,在多个MoE语言模型上匹配或超越强基线。

Comments 12 pages, 3 figures, 5 tables

详情
AI中文摘要

混合专家(MoE)语言模型减少了每个token的计算量,但仍需存储和服务所有专家,导致部署时内存密集。现有的训练后压缩方法主要通过剪枝专家或合并其权重来缩减成本。我们将训练后MoE压缩形式化为专家池整合:保留一组较小的预训练专家作为可重用原型,并确定性地将每个原始专家引用重映射到一个选定的原型。这种观点将缩减后的专家池与表示原始专家槽位的重用结构分离,并允许在局部层范围内共享原型,同时保留原始路由器接口。我们提出ConMoE,一个无需训练的原型重映射框架,它使用基于校准的贡献和可替换性信号选择保留的专家,然后将原始专家调用重定向到选定的原型,无需权重更新或压缩后微调。在三个预训练的MoE语言模型上的实验表明,ConMoE在多种设置下匹配或超越了强剪枝和合并基线,在deepseek-moe-16b-base上以25%和50%的路由专家缩减均取得最佳平均分,同时在Qwen3-30B-A3B和OLMoE-1B-7B-0125上保持竞争力。消融实验表明,确定性重映射是最稳定的组件,而更广泛的跨层共享和事后权重融合则依赖于模型。

英文摘要

Mixture-of-Experts (MoE) language models reduce per-token computation but still require storing and serving all experts, making deployment memory-intensive. Existing post-training compression methods mainly shrink this cost by pruning experts or merging their weights. We formulate post-training MoE compression as expert-pool consolidation: retaining a smaller set of pretrained experts as reusable prototypes and deterministically remapping each original expert reference to one selected prototype. This view separates the reduced expert pool from the reuse structure that represents the original expert slots, and allows prototype sharing within local layer scopes while preserving the original router interface. We propose ConMoE, a train-free prototype remapping framework that selects retained experts using calibration-based contribution and replaceability signals, then redirects original expert calls to the selected prototypes without weight updates or post-compression fine-tuning. Experiments on three pretrained MoE language models show that ConMoE matches or outperforms strong pruning and merging baselines in several settings, achieving the best average score on deepseek-moe-16b-base at both 25% and 50% routed-expert reduction, while remaining competitive on Qwen3-30B-A3B and OLMoE-1B-7B-0125. Ablations indicate that deterministic reassignment is the most stable component, whereas broader cross-layer sharing and post-hoc weight fusion are model-dependent.

2605.29335 2026-05-29 cs.CV cs.AI 版本更新

Rethinking FID Through the Geometry of the Reference Dataset

通过参考数据集的几何结构重新思考FID

Yunghee Lee, Byeonghyun Pak

AI总结 本文通过分析参考数据集的几何特性(密度和有效秩)来解释Fréchet Inception Distance (FID) 与样本质量之间的不一致性,并提出应结合参考数据集几何结构来更可靠地评估生成模型。

Comments 9 pages, 2 figures. Accepted to ICML 2026 Workshop: Combining Theory and Benchmarks

详情
AI中文摘要

Fréchet Inception Distance (FID) 被广泛用于评估图像生成器,但较低的FID并不总是对应更好的样本质量。我们表明,这种不匹配部分取决于参考数据集的几何结构。在六个数据集的受控研究中,分布密度和有效秩显著解释了随着样本质量提高FID如何变化。集中数据集往往产生更有利的FID趋势,而更分散的数据集可能导致尽管样本更好但FID恶化。对精确率和召回率的归因以及使用替代特征空间和距离的消融实验支持了相同的结论。这些结果表明,分布度量应与参考数据集的几何结构一起解释,以实现更可靠的基准测试。

英文摘要

Fréchet Inception Distance (FID) is widely used to evaluate image generators, yet lower FID does not always correspond to better sample quality. We show that this mismatch depends in part on the geometry of the reference dataset. In a controlled study across six datasets, distributional density and effective rank significantly explain how FID changes as sample quality improves. Concentrated datasets tend to yield more favorable FID trends, whereas more dispersed datasets can make FID worsen despite better samples. Attribution to precision and recall and ablations with alternative feature spaces and distances support the same conclusion. These results suggest that distributional metrics should be interpreted together with the geometry of the reference dataset for more reliable benchmarking.

2605.29310 2026-05-29 cs.AI cs.CL 版本更新

Rubric-Guided Process Reward for Stepwise Model Routing

基于评分准则的逐步模型路由过程奖励

Shenghao Ye, Yu Guo, Zhengheng Li, Shuangwu Chen, Jian Yang

发表机构 * University of Science and Technology of China(中国科学技术大学) Southeast University(东南大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 提出RoRo框架,通过收集路由轨迹、构建偏好对、训练Rubricor生成评估准则和Judge评分,结合过程与结果奖励优化路由策略,提升大型推理模型逐步路由的准确性和成本效率。

Comments 17 pages, 9 figures, submitted to EMNLP 2026

详情
AI中文摘要

逐步模型路由通过将每个推理步骤分配给合适的模型来提高大型推理模型(LRM)的效率。最近的方法将路由建模为顺序决策过程,并使用强化学习训练路由器。然而,尽管它们将路由建模为一个过程,但仍然使用结果奖励来监督路由器。这种奖励仅反映最终答案的正确性,未能评估中间路由决策,这可能会削弱性能和泛化能力。为了解决这一差距,我们提出了RoRo,一种基于评分准则的逐步模型路由过程奖励框架。RoRo首先收集多样化的路由轨迹,并基于结果、成本和过程质量构建偏好对。然后,它通过交替优化训练一个Rubricor来生成查询特定的评估准则,以及一个Judge来在此准则下对路由轨迹进行评分。由此产生的过程奖励与结果奖励相结合,通过GRPO优化路由策略。在五个推理基准上的实验,无论是在同族还是跨族设置下,都表明RoRo始终优于强基线,并实现了更好的准确性和成本权衡。

英文摘要

Stepwise model routing improves the efficiency of Large Reasoning Models (LRMs) by assigning each reasoning step to a suitable model. Recent methods formulate routing as a sequential decision process and train the router with reinforcement learning. However, although they model routing as a process, they still supervise the router with outcome rewards. Such rewards only reflect final answer correctness and fail to evaluate intermediate routing decisions, which can weaken performance and generalization. To address this gap, we propose RoRo, a rubric-guided process reward framework for stepwise model routing. RoRo first collects diverse routing trajectories and constructs preference pairs based on outcome, cost, and process quality. It then trains a Rubricor to generate a query-specific evaluation rubric and a Judge to score routing trajectories under this rubric through alternating optimization. The resulting process rewards are combined with outcome rewards to optimize the routing policy via GRPO. Experiments on five reasoning benchmarks under both same-family and cross-family settings show that RoRo consistently outperforms strong baselines and achieves better accuracy and cost trade-offs.

2605.29307 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek:训练用于直接语料库交互的搜索代理

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Princeton University(普林斯顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出GrepSeek,一种通过两阶段训练(冷启动数据集+GRPO优化)和语义保持的分片并行执行引擎,训练紧凑型搜索代理直接与文本语料库交互(通过shell命令),在开放域问答中取得最优F1和精确匹配。

详情
AI中文摘要

大型语言模型(LLM)搜索代理通过多轮推理和信息检索,在知识密集型语言任务中展现出强大潜力。大多数现有系统使用检索器,该检索器接收关键词或自然语言查询,并利用预计算文档表示的索引返回排序后的文档列表。在本工作中,我们探索了一种互补视角,其中搜索代理将语料库本身视为搜索环境,并通过执行可执行的shell命令来寻找证据。我们引入了GrepSeek,一种优化的直接语料库交互(DCI)搜索代理,它训练一个紧凑的搜索代理从大型文本语料库中查找、过滤和组合证据。为了解决在大语料库上直接使用强化学习进行学习行为的不稳定性,我们提出了一种两阶段训练流程。首先,我们使用答案感知的Tutor和答案盲的Planner构建冷启动数据集,生成经过验证的、因果基础的搜索轨迹。其次,我们使用组相对策略优化(GRPO)优化初始化的策略,使代理能够通过与语料库的直接交互来改进其任务导向的搜索行为。为了使DCI在大规模下实用,我们进一步使用语义保持的分片并行执行引擎,该引擎将基于shell的检索加速高达7.6倍,同时保持与shell命令顺序执行的字节精确等价。在七个开放域问答基准上的实验表明,GrepSeek在整体词元级F1和精确匹配上取得了最强性能。我们的分析还揭示了纯粹词汇交互在具有显著表面形式变化的查询上的局限性,表明DCI作为搜索代理的一种实用且具有竞争力的方法,可以在现实世界中补充现有的检索范式。

英文摘要

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

2605.29303 2026-05-29 cs.AI 版本更新

Entropy-KL Divergence-based Token Masking: A Novel Approach for Selective Fine-tuning of Large Language Models

基于熵-KL散度的令牌掩码:一种用于大语言模型选择性微调的新方法

Qi Liu, Mingdi Sun, Yongyi He, Zhi Zheng, Tong Xu, Yi Zheng, Zhefeng Wang, Enhong Chen

发表机构 * University of Science and Technology of China(中国科学技术大学) Huawei Cloud(华为云)

AI总结 针对低数据场景下标准监督微调导致模型分布偏移的问题,提出EKSFT方法,通过选择性掩码高熵或高KL散度的令牌,在注入任务知识的同时保持预训练分布完整性,在数学推理基准上优于标准SFT并提升后续RL性能。

Comments 17 pages

详情
AI中文摘要

监督微调(SFT)后接强化学习(RL)已成为大语言模型的标准后训练范式。该范式为RL探索提供了冷启动,避免了纯RL中在线采样产生不足正样本的低效问题。然而,在实践中,现有方法通常使用少量数据进行SFT初始化(相比RL阶段),这可能导致模型拟合有限样本并偏离其预训练分布。这种分布偏移阻碍了模型在后续RL训练中有效探索的能力。为解决这一挑战,我们提出在低数据场景下,SFT应优先激活任务相关能力而非记忆特定内容。沿着这一思路,我们提出EKSFT(熵-KL选择性微调),该方法选择性掩码那些相对于参考模型表现出高熵或高KL散度的令牌。通过排除这些高不确定性、分布偏移的令牌进行模仿,EKSFT在注入任务特定知识的同时保持了模型预训练分布的完整性。在数学推理基准上的实证评估表明,EKSFT始终优于标准SFT。从EKSFT模型进行进一步的RL微调可获得一致更好的后RL性能,表明RL阶段的探索得到了改善。我们的代码和数据集可在https://github.com/MINE-USTC/EKSFT获取。

英文摘要

Supervised fine-tuning (SFT) followed by reinforcement learning (RL) has become a standard post-training paradigm for large language models. This paradigm provides a cold-start for RL exploration, avoiding the inefficiency of pure RL where on-policy sampling yields insufficient positive samples. However, in practice, existing approaches often use a small amount of data for SFT initialization compared to the RL phase, which can cause the model to fit the limited samples and shift away from its pre-trained distribution. This distribution shift impedes the model's ability to effectively explore during subsequent RL training. To address this challenge, we propose that in low-data regimes, SFT should prioritize activating task-relevant capabilities rather than memorizing specific content. Along this line, we propose EKSFT (Entropy-KL Selective Fine-Tuning), which selectively masks tokens that exhibit either high entropy or high KL divergence from a reference model. By excluding these high-uncertainty, distribution-shifting tokens from imitation, EKSFT injects task-specific knowledge while preserving the integrity of the model's pre-trained distribution. Empirical evaluations on mathematical reasoning benchmarks demonstrate that EKSFT consistently outperforms standard SFT. Further RL fine-tuning from the EKSFT model yields consistently better post-RL performance, indicating improved exploration for the RL stage. Our codes and datasets are available at https://github.com/MINE-USTC/EKSFT.

2605.29300 2026-05-29 cs.CL cs.AI cs.SD 版本更新

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

MusTBENCH:音乐大语言模型中的时间定位基准与推进

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo, Shuyang Cui, Zhi Zhong, Wei-Hsiang Liao, Hiromi Wakaki, Yuki Mitsufuji

发表机构 * Seoul National University(首尔国立大学) Sony Group Corporation(索尼集团) Sony AI(索尼人工智能)

AI总结 提出MusTBENCH基准和MusT四阶段优化方法,评估并提升音乐大语言模型在音频中的时间定位能力。

详情
AI中文摘要

近期的大型音频-语言模型(LALMs)在理解音乐内容方面展现了有前景的能力。然而,它们的响应是否基于音频中正确的时间区域仍未得到充分探索。这一限制对于音乐理解尤为关键,因为关键信息通常以时间局部化事件的形式出现,例如乐器进入和节奏转换。为了解决这一差距,我们引入了MusTBENCH,一个由音乐专家验证的基准,旨在通过五个时间定位的问答任务评估LALMs中的时间定位能力。为了进一步提升现有模型中的时间定位,我们提出了MusT,一种新颖的四阶段时间优化方案,涵盖音乐编码器适应、LLM适应、LLM监督微调和基于RL的优化。在MusTBENCH上的实验表明,现有LALMs在精确时间定位方面存在困难,而MusT相比强基线带来了显著改进。这些结果将时间定位确立为当前LALMs中缺失的关键能力,并将MusTBENCH定位为未来时间定位音乐理解研究的具有挑战性的基准。

英文摘要

Recent Large Audio-Language Models (LALMs) have demonstrated promising abilities in understanding musical content. However, whether their responses are grounded in the correct temporal regions of the audio remains underexplored. This limitation is particularly critical for music understanding, where key information often occurs as temporally localized events, such as instrument entries and rhythmic transitions. To address this gap, we introduce MusTBENCH, a music-expert-validated benchmark designed to evaluate temporal grounding in LALMs through five temporally grounded question-answering tasks. To further improve temporal grounding in existing models, we propose MusT, a novel four-stage temporal optimization recipe spanning music encoder adaptation, LLM adaptation, LLM supervised fine-tuning, and RL-based optimization. Experiments on MusTBENCH show that existing LALMs struggle with precise temporal grounding, while MusT brings significant improvements over strong baselines. These results establish temporal grounding as a key missing capability in current LALMs and position MusTBENCH as a challenging benchmark for future research in temporally grounded music understanding.

2605.29288 2026-05-29 cs.AI 版本更新

Diagnosing Harmful Continuation in Answer-Correct Long-CoT Training Traces

诊断答案正确长链思维训练轨迹中的有害延续

Chen He, Yuhao Wu, Lei Wang, Wenxuan Zhang, Fumin Shen

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Singapore University of Technology and Design(新加坡科技设计大学) Singapore Management University(新加坡管理学院)

AI总结 研究长链思维训练数据中答案正确但后续推理有害的延续现象,通过删除后缀实验发现其损害训练效果,并提出轻量级边界代理方法。

详情
AI中文摘要

长链思维(CoT)轨迹被广泛用作面向推理的大语言模型监督微调(SFT)的监督信号,然而答案正确的轨迹仍可能导致显著不同的微调结果。我们研究了答案正确的长CoT数据中的结论后延续:即答案已充分支持,但轨迹继续包含额外推理并保留在监督目标中。为了测试其训练效果,我们使用仅删除的编辑器构建保留答案的后缀移除,并比较原始和经过处理的轨迹上的CoT监督微调。我们观察到移除编辑器识别的结论后延续后监督微调结果有所改善,表明这种延续在我们的设置中对训练有害。因此,我们将这一经验支持的现象称为有害延续。除了这一干预,我们还通过不确定性和隐藏状态进展进一步刻画了被移除的结论后延续。我们观察到持续的局部不确定性以及减弱的终端方向进展,形成了不确定性-几何不匹配。最后,我们实例化了有害延续切割(HCC),一种轻量级边界代理,近似于编辑器识别的结论后延续边界。

英文摘要

Long chain-of-thought (CoT) traces are widely used as supervision for reasoning-oriented LLM SFT, yet answer-correct traces can still lead to markedly different fine-tuning outcomes. We study post-conclusion continuation in answer-correct long-CoT data: a continuation where the answer appears sufficiently supported, but the trace continues with additional reasoning that remains in the supervised target. To test its training effect, we use a delete-only editor to construct answer-preserving suffix removal and compare CoT-based SFT on the original and processed traces. We observe improved SFT outcomes after removing the editor-identified post-conclusion continuation, suggesting that this continuation is harmful to training in our setting. We therefore refer to this empirically supported phenomenon as harmful continuation. Beyond this intervention, we further characterize the removed post-conclusion continuation through uncertainty and hidden-state progress. We observe persistent local uncertainty together with weakened terminal-directional progress, forming an uncertainty--geometry mismatch. Finally, we instantiate Harmful Continuation Cut (HCC), a lightweight boundary proxy that approximates the editor-identified post-conclusion continuation boundary.

2605.29283 2026-05-29 cs.LG cs.AI 版本更新

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

物理基础模型能否学习可泛化的物理?一种跨物理机制和分布偏移的偏差感知基准

Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen

发表机构 * The Ohio State University(俄亥俄州立大学) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 通过构建包含8种物理动力学、3种训练数据混合和25种测试机制的基准,评估五种物理基础模型架构,发现当前模型是条件性而非通用性泛化者,其泛化能力依赖于物理机制、时间尺度、初始条件、预训练、模型大小和架构,并指出改进需超越缩放模型或扩展数据,转向学习跨机制、时间尺度和分布偏移的可迁移物理知识。

Comments 26 pages, 31 figures

详情
AI中文摘要

最近的物理基础模型声称具有通用的时空预测能力,但它们的评估通常将性能压缩为固定训练分布下的单一平均分数。这使得难以确定模型是否学习了可泛化的物理动力学,还是仅在特定设置下表现良好。我们构建了一个包含8种物理动力学、3种训练数据混合和25种测试机制的基准,这些测试机制由动态尺度和初始条件复杂性变化引起,涵盖了分布内、分布偏移和分布外设置。我们评估了五种物理基础模型架构和每种架构的四种模型变体(从头训练和三种预训练大小),共得到60,000个测量结果。我们的结果表明,当前的物理基础模型表现为条件性而非通用性泛化者:它们的泛化能力取决于物理机制、时间尺度、初始条件设置、预训练、模型大小和架构。改进训练数据分布只能部分缓解这一限制。预训练和缩放也无法可靠地消除它们的能力偏差。我们认为,改进物理基础模型需要超越缩放模型或扩展数据,转向学习能够更好地跨机制、时间尺度和分布偏移捕获可迁移物理知识的机制。

英文摘要

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.

2605.29277 2026-05-29 cs.SE cs.AI 版本更新

Code-QA-Bench: Separating Code Reasoning from Documentation Memorization in Repository-Level QA

Code-QA-Bench:在仓库级问答中分离代码推理与文档记忆

Jun Zhang, JianYing Qu, Hanwen Du, Zhongkai Sun, Yehua Yang, Qiao Zhao

发表机构 * Baidu Inc(百度公司)

AI总结 提出Code-QA-Bench框架,通过答案优先生成和三条件实验设计,自动构建仓库级代码理解基准,以区分代码推理、文档回忆和预训练记忆的影响。

详情
AI中文摘要

我们提出了Code-QA-Bench,一个全自动框架,用于合成仓库级代码理解基准,将真正的代码理解与文档回忆和预训练记忆分离。该框架有两个方法论贡献:(1)答案优先生成流程,其中配备工具的代理探索源代码以生成经过验证的金色答案,然后推导问题,确保每个任务都基于真实的代码结构;(2)三条件实验设计,在闭卷(无仓库)、仅代码(移除文档)和带文档(完整仓库)条件下评估代理,差值直接量化文档效用和记忆。我们从SWE-Bench中的10个Python仓库生成了528个代码可推导任务和100个文档依赖任务,由LLM评判员根据准确性、完整性和特异性评分。对四个前沿模型的实验表明,代码访问是主导因素(比闭卷平均提高0.23),文档提供了适度的额外收益(文档依赖任务上提高0.071),并且在代码可推导任务上仅代码≈带文档,验证了该设计。该框架是开源的,适用于任何文档良好的Python仓库。

英文摘要

We present Code-QA-Bench, a fully automated framework for synthesizing repository-level code understanding benchmarks that separates genuine code comprehension from documentation recall and pretraining memorization. The framework makes two methodological contributions: (1) an answer-first generation pipeline where a tool-equipped agent explores source code to produce verified gold answers before deriving questions, ensuring every task is grounded in real code structure; and (2) a three-condition experimental design evaluating agents under closed-book (no repository), code-only (documentation removed), and documented (full repository) conditions, with deltas directly quantifying documentation utility and memorization. We generate 528 code-derivable and 100 doc-dependent tasks across 10 Python repositories from SWE-Bench, scored by an LLM judge on accuracy, completeness, and specificity. Experiments on four frontier models reveal that code access is the dominant factor (+0.23 mean gain over closed-book), documentation provides modest additional benefit (+0.071 on doc-dependent tasks), and code-only $\approx$ documented on code-derivable tasks, validating the design. The framework is open-source and applicable to any well-documented Python repository.

2605.29272 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Causal Label Recovery in Payment Networks

支付网络中的因果标签恢复

Gaurav Dhama

发表机构 * Mastercard(麦star卡)

AI总结 针对支付网络中标签存在的四种系统偏差,提出序列三重稳健(STR)估计器,同时纠正所有偏差并达到半参数效率界,实现基于数天而非数月数据的训练。

Comments 49 pages

详情
AI中文摘要

支付网络中的欺诈检测模型依赖于存在系统性偏差的退单标签进行训练。每个标签必须依次经过三个门控:授权(被拒绝的交易不产生标签)、发卡行报告(未报告的欺诈不可见)和延迟(待处理的退单在训练时缺失)。到达的标签可能因第一方滥用或发卡行错误分类而受损。配套论文[arXiv:2605.27557]证明这四种损害对检测性能施加了极小极大下界。本文问:能否达到该下界?我们将观测流程形式化为一个具有三个倾向阶段和一个损坏层的顺序缺失数据问题,并构建了序列三重稳健(STR)估计器。STR同时纠正所有四种损害,并达到半参数效率界——没有估计器能具有更低的渐近方差。它是序列三重稳健的:在每个门控处,一致性仅要求倾向模型或结果回归中有一个正确指定,而非两者。我们提供了通过噪声率调整的伪标签进行损坏校正、通过经验贝叶斯收缩稳定小发卡行的逆倾向权重、提供有效置信区间的插件方差估计量,以及用于有限样本保证的伯恩斯坦集中不等式。在操作层面,我们推导了最优训练延迟——使标签质量损失和模型过时之和最小化的成熟窗口——并证明STR允许使用数天而非数月前的数据进行训练,将模型新鲜度与退单成熟周期解耦。对于任何样本量,STR在均方误差上严格优于基于退单的朴素训练。

英文摘要

Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequential gates: authorization (declined transactions generate no labels), issuer reporting (unreported fraud is invisible), and delay (pending chargebacks are missing at training time). Labels that do arrive may be corrupted by first-party misuse or issuer misclassification. A companion paper [arXiv:2605.27557] proved that these four impairments impose a minimax lower bound on detection performance. This paper asks: can that bound be achieved? We formalize the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, and construct the Sequential Triply Robust (STR) estimator. The STR corrects for all four impairments simultaneously and achieves the semiparametric efficiency bound -- no estimator can have lower asymptotic variance. It is sequentially triply robust: at each gate, consistency requires only that either the propensity model or the outcome regression is correctly specified, not both. We provide corruption correction via noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage to stabilize inverse-propensity weights for small issuers, a plug-in variance estimator yielding valid confidence intervals, and a Bernstein concentration inequality for finite-sample guarantees. On the operational side, we derive the optimal training delay -- the maturity window that minimizes the sum of label-quality loss and model staleness -- and prove that the STR permits training on data that is days old rather than months old, decoupling model freshness from the chargeback maturity cycle. The STR provably dominates naive chargeback-based training in mean squared error for any sample size.

2605.29271 2026-05-29 cs.AI cs.IR cs.LG 版本更新

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

CoHyDE: 用于工具检索的LLM改写器与稠密编码器的迭代协同训练

Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

发表机构 * SAP Labs(SAP实验室)

AI总结 提出CoHyDE方法,通过迭代协同训练稠密编码器和LLM改写器,结合对比学习和偏好对齐,在工具检索任务中同时提升标准查询和模糊查询的性能。

详情
AI中文摘要

在大规模API目录上的工具检索是LLM智能体的核心瓶颈:用户查询以口语化、通常不明确的语言出现,而目录使用技术性API词汇,没有固定的编码器能够单独弥合这一差距。两种主要的训练方法,对比编码器微调和基于冻结LLM的HyDE式查询扩展,从相反的角度解决这个问题,并在互补的方向上失败:微调编码器在查询的表面形式与目录匹配时表现出色,但在不匹配时性能崩溃;而零样本HyDE对不明确的查询更鲁棒,但生成不感知目录的假设描述,当查询形式良好时检索性能下降。我们提出CoHyDE,一种迭代过程,将稠密编码器和LLM改写器训练为单个共同演化的系统:编码器使用改写器生成的目录风格假设描述通过InfoNCE重新训练,改写器通过DPO基于编码器的检索分数进行偏好对齐,两者在循环开始前在工具目录上进行热启动。在ToolBench目录的约10k工具子集上,三轮CoHyDE在标准查询上比最强的单组件基线提高+2.5个百分点的NDCG@5,在保留的模糊查询上提高+6.3个百分点,在最难的模糊层级上增益高达+8个百分点。消融实验证实协同训练是关键因素:单独使用任一组件都无法在形式良好和模糊查询上匹配CoHyDE,在模糊查询上损失高达-8个百分点。

英文摘要

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

2605.29270 2026-05-29 cs.AI 版本更新

Indexing the Unreadable: LLM-Native Recursive Construction and Search of Service Taxonomies

索引不可读之物:基于LLM原生的服务分类法递归构建与搜索

Wei Zheng, Yang Yan, Yiyang Shao, Jinyang Li, Zeze Chang, Yukuang Jia, Qiming Mao, Chihyung Wang, Jingbin Zhou

AI总结 针对LLM在服务发现中因上下文窗口限制和长输入中间信息丢失问题,提出LLM原生的渐进式披露方案A2X,通过自动构建层次化服务分类法并在查询时逐层遍历,显著提升检索准确率并降低token消耗。

Comments Preprint. 8 pages main paper + appendix; 2 figures. Under submission to EMNLP 2026

详情
AI中文摘要

物联网代理(IoA)时代正在形成:LLM代理预计通过编排快速增长中的模型上下文协议(MCP)服务器、代理到代理(A2A)端点、可复用技能以及其他LLM可调用服务来实现用户目标。然而,LLM面临与此机制的结构性不匹配:有效上下文是一种稀缺资源,无法随服务数量扩展。将数千个服务描述串联到提示中会溢出上下文窗口,即使窗口足够大,模型也会系统性地忽略长输入中间部分的信息,即文献中充分记录的“中间迷失”现象。这本质上是服务发现中的上下文管理问题。为解决此问题,我们提出一种LLM原生的渐进式披露方案及其具体实例A2X(代理到任何事物的服务发现):一个LLM驱动的流水线,自动将注册服务组织成层次化分类法,并在查询时逐层遍历,使得每次LLM调用仅看到与用户查询高度相关的小候选集。这将有效上下文稀缺性与注册表规模解耦,显著降低token消耗并提高检索准确性。与全上下文转储相比,A2X在提示token成本仅为九分之一的情况下实现了6.2个点的命中率提升;与最先进的开源基于嵌入的基线相比,A2X将命中率提高了超过20个点。

英文摘要

The era of the Internet of Agents (IoA) is taking shape: LLM agents are expected to fulfill user goals by orchestrating fast-growing populations of Model Context Protocol (MCP) servers, Agent-to-Agent (A2A) endpoints, reusable skills, and other LLM-callable services. Yet LLMs face a structural mismatch with this regime: effective context is a scarce resource that does not scale with the number of services. Concatenating thousands of service descriptions into a prompt overflows the context window, and even when the window is large enough, models systematically under-attend to information in the middle of long inputs, the well-documented Lost-in-the-Middle phenomenon. This is fundamentally a question of context management for service discovery. To address this, we propose an LLM-native progressive-disclosure scheme and its concrete instantiation, A2X (Agent-to-Anything service discovery): an LLM-driven pipeline that automatically organizes the registered services into a hierarchical taxonomy and walks it layer by layer at query time, so that every LLM call sees only a small candidate set highly relevant to the user query. This decouples effective-context scarcity from registry size and significantly reduces token consumption while improving retrieval accuracy. Compared to full-context dumping, A2X achieves a 6.2-point Hit Rate gain at one-ninth the prompt-token cost; compared to the state-of-the-art open-source embedding-based baseline, A2X improves Hit Rate by more than 20 points.

2605.29267 2026-05-29 cs.AI cs.LG 版本更新

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

人类策展何时以及如何适得其反:多模型自消费循环下的偏好对齐

Yang Zhang, Xiukun Wei, Xueru Zhang

发表机构 * Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio(计算机科学与工程系,俄亥俄州立大学,哥伦布,俄亥俄)

AI总结 研究多模型自消费训练中人类策展对模型对齐的影响,发现跨模型交互可能削弱甚至逆转策展效果,导致长期对齐退化。

详情
AI中文摘要

基础模型越来越多地使用先前模型迭代生成的合成数据进行训练,而非仅依赖真实数据。这种自消费训练范式可能导致模型崩溃、发散或偏差放大。近期工作(Ferbach et al., 2024)表明,将人类策展纳入循环可以引导自消费模型向人类对齐的行为,但这些分析聚焦于单一孤立模型,该模型仅消耗自身输出。然而,在实践中,模型经常交互并训练于其他模型产生的输入-输出对。本文研究多模型机制下的自消费训练。我们首先形式化了一个交互自消费模型的框架,并刻画了所得动力系统何时收敛到稳定点。然后,我们考察了一个模型的人类策展如何影响其自身对齐(自影响),以及这种效应如何传播到其他模型(交叉影响)。与孤立设置中人类策展总是增强模型对齐不同,我们表明跨模型交互可以削弱甚至逆转这种效应,最终损害长期对齐。

英文摘要

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

2605.29262 2026-05-29 cs.AI 版本更新

Harmonizing Real-Time Constraints and Long-Horizon Reasoning: An Asynchronous Agentic Framework for Dynamic Scheduling

协调实时约束与长视距推理:一种用于动态调度的异步智能体框架

Shijie Cao, Yuan Yuan, Jing Liu

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China(北京航空航天大学计算机科学与工程学院) Shenzhen Loop Area Institute, Shenzhen, China(深圳环形区研究所) Qingdao Research Institute, Beihang University(青岛研究院) Hangzhou Innovation Institute, Beihang University(杭州创新研究院) School of Artificial Intelligence, Xidian University, Xi’an 710071, Shaanxi, China(西安电子科技大学人工智能学院) Guangzhou Institute of Technology, Xidian University, Guangzhou 510555, Guangdong, China(广州技术研究所)

AI总结 提出RACE-Sched异步智能体框架,通过双流架构解耦策略执行与逻辑推理,利用LLM合成和验证符号启发式规则,在保证实时性的同时提升动态调度质量。

详情
AI中文摘要

动态柔性作业车间调度问题(DFJSP)需要在即时响应随机扰动与全局优化生产目标之间进行权衡。传统的优先级规则在处理复杂扰动时灵活性不足,而基于学习的方法往往牺牲可解释性或难以跨问题规模泛化。尽管大语言模型(LLM)提供了高级推理能力以弥合这一差距,但其显著的推理延迟与工业控制系统的毫秒级决策周期不兼容。为解决这一冲突,我们引入了RACE-Sched,一种异步智能体框架,通过双流架构将策略执行与逻辑推理解耦。反应流执行低延迟的符号启发式规则以实现实时调度,而并行的深思流利用LLM合成、验证和演化这些规则。候选规则在沙箱中经过严格测试,并通过原子更新部署,确保安全且不阻塞控制循环。此外,语义规则库索引已验证的启发式规则,用于基于检索的初始化,从而增强跨问题规模的可迁移性。在GEN-Bench、MK-Bench和JMS-Bench上的广泛评估表明,RACE-Sched优于领先的深度强化学习和其他基于LLM的基线方法。该方法协调了实时约束与长视距推理,实现了更优的解决方案质量和对动态事件的鲁棒适应。

英文摘要

The Dynamic Flexible Job Shop Scheduling Problem (DFJSP) necessitates a trade-off between instant reaction to stochastic disturbances and global optimization of production goals. Conventional priority rules are insufficiently flexible to handle complex disruptions, whereas learning-based approaches often compromise interpretability or fail to generalize across problem scales. Although Large Language Models (LLMs) offer advanced reasoning capabilities to bridge this gap, their substantial inference latency is incompatible with the millisecond-level decision cycles of industrial control systems. To resolve this conflict, we introduce RACE-Sched, an asynchronous agent-based framework that decouples policy execution from logical reasoning via a dual-stream architecture. The Reactive Stream executes low-latency symbolic heuristics to enable real-time dispatching, while the parallel Deliberative Stream leverages an LLM to synthesize, validate, and evolve these rules. Candidate rules undergo rigorous testing in a sandbox and are deployed via atomic updates, ensuring safety without blocking the control loop. Additionally, a semantic rule repository indexes validated heuristics for retrieval-based initialization which enhances transferability across problem scales. Extensive evaluations on GEN-Bench, MK-Bench, and JMS-Bench demonstrate that RACE-Sched outperforms leading Deep Reinforcement Learning and other LLM-based baselines. This approach harmonizes real-time constraints with long-horizon reasoning to achieve superior solution quality and robust adaptation to dynamic events.

2605.29259 2026-05-29 cs.LG cs.AI 版本更新

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

KLAS:利用相似性拼接神经网络以改进精度-效率权衡

Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain, Akshay Jajoo, Myungjin Lee, Clayton Kerce, Alexey Tumanov

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Microsoft M365 Research(微软M365研究) Cisco Research(思科研究) Georgia Tech Research Institute(佐治亚理工研究机构)

AI总结 提出KLAS框架,通过KL散度度量中间表示相似性自动选择最佳拼接配置,在相同微调成本下提升拼接模型的精度-效率曲线。

详情
AI中文摘要

鉴于部署目标的广泛性,灵活模型选择对于在给定计算预算内优化性能至关重要。最近的研究表明,在模型家族内拼接预训练模型能够实现精度-效率权衡空间的成本效益插值。拼接将一个预训练模型的中间激活变换到另一个模型,生成新的插值拼接网络。这类网络沿精度-效率谱提供了部署选项池。然而,现有拼接方法往往产生次优权衡且缺乏泛化性,因为它们主要依赖启发式方法选择拼接配置。我们认为,构建改进的精度-效率权衡需要显式捕获并利用被拼接预训练模型之间的相似性。为此,我们引入KLAS,一种新颖的拼接选择框架,通过利用中间表示之间的KL散度,自动化和泛化跨模型家族的拼接选择。KLAS从$O(k^2n^2)$种可能性中为$k$个深度为$n$的预训练模型识别最有前景的二元拼接。通过全面实验,我们证明KLAS在相同微调成本下改进了拼接模型的精度-效率曲线,与基线相比,KLAS在相同计算成本下实现了高达$1.21\%$的ImageNet-1K top-1准确率提升,或在保持准确率的同时将FLOPs降低$1.33\times$。

英文摘要

Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget. Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space. Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network. Such networks provide a pool of deployment options along the accuracy-efficiency spectrum. However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations. We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the similarity between pretrained models being stitched. To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations. KLAS identifies the most promising binary stitches from the $O(k^2n^2)$ possibilities for $k$ pretrained models of depth $n$. Through comprehensive experiments, we demonstrate that KLAS improves the accuracy-efficiency curve of stitched models at the same finetuning cost as baselines. KLAS achieves up to $1.21\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\times$ reduction in FLOPs.

2605.29256 2026-05-29 cs.CL cs.AI 版本更新

DynSess: Dynamic Session-Level Evaluation and Optimization Framework for Role-Playing Agents

DynSess:面向角色扮演智能体的动态会话级评估与优化框架

Rongsheng Zhang, Jiji Tang, Junnan Ren, Zuyi Bao, Weijie Chen, Ruofan Hu, Zhou Zhao, Tangjie Lv, Yan Zhang

发表机构 * Zhejiang University(浙江大学) Fuxi AI Lab, NetEase Inc.(福克斯人工智能实验室,网易公司) Xiamen University(厦门大学)

AI总结 提出DynSess统一会话级框架,通过会话级评估(DynSess-Eval)和基于多步前瞻搜索的训练轨迹优化(DSPO/GSRPO),提升角色扮演智能体的长程一致性和交互质量。

详情
AI中文摘要

基于大型语言模型的角色扮演本质上是一个会话级任务,要求智能体在扩展的多轮对话中维持角色身份和交互质量。然而,现有的评估和优化方法大多停留在轮次级别,无法捕捉长程质量。我们提出DynSess,一个统一的会话级角色扮演智能体框架。DynSess-Eval通过针对长程行为的评分标准对完整对话会话进行评分。利用其会话级奖励,我们通过多步前瞻搜索构建高质量训练轨迹,并训练DynSess-Character的两个互补变体:DSPO(离策略)和GSRPO(在策略)。实验表明,DynSess-Eval与人类判断的一致性显著优于先前的评估器,盲人机评估进一步显示,尽管参数少得多,DynSess-Character仍能与最强角色模型匹配,同时保持强大的角色一致性和交互能力。我们的数据集和代码将发布以促进未来研究。

英文摘要

Role-playing with large language models is fundamentally a session-level task, requiring agents to sustain character identity and interaction quality across extended multi-turn conversations. Yet existing evaluation and optimization methods remain largely turn-level, failing to capture long-horizon quality. We propose DynSess, a unified session-level framework for role-playing agents. DynSess-Eval scores complete dialogue sessions via rubrics targeting long-horizon behaviors. Leveraging its session-level rewards, we construct high-quality training trajectories through multi-turn lookahead search and train DynSess-Character with two complementary variants: DSPO (off-policy) and GSRPO (on-policy). Experiments show that DynSess-Eval aligns with human judgments substantially better than prior evaluators, and blind human evaluation further shows that DynSess-Character matches the strongest character model despite using substantially fewer parameters, while maintaining strong role consistency and interactive ability. Our dataset and code will be released to facilitate future research.

2605.29254 2026-05-29 cs.RO cs.AI 版本更新

Extreme dynamic symmetry enables omnidirectional and multifunctional robots

极端动态对称性实现全向多功能机器人

Jiaxun Liu, Boxi Xia, Boyuan Chen

发表机构 * Department of Mechanical Engineering and Materials Science, Duke University(杜克大学机械工程与材料科学系) Department of Electrical and Computer Engineering, Duke University(杜克大学电气与计算机工程系) Department of Computer Science, Duke University(杜克大学计算机科学系)

AI总结 本文提出动态对称性概念,通过动态各向同性度量,在超过1000种模拟形态中发现高动态对称性可提升轨迹跟踪、任务成功率、鲁棒性等性能,并开发了Argus球形机器人系列验证近极端动态各向同性带来的全向运动、自适应地形、快速自稳定和抗故障能力。

Comments Published in Science Robotics (2026). Our project website is at:https://generalroboticslab.com/Argus

详情
Journal ref
Science Robotics 11, eaec1725 (2026)
AI中文摘要

对称性是自然系统中的核心组织原则,但其作为机器人统一设计策略的应用仍主要局限于几何形态。我们证明,对称性可以在动态驱动能力层面加以利用。我们引入动态对称性,即机器人可达质心加速度的均匀性,并通过称为动态各向同性的度量将其形式化。在超过1000种模拟形态中,我们发现更高的动态对称性持续改善了轨迹跟踪、任务成功率、鲁棒性、恢复能力和能量效率,且当动态各向同性接近其理论极限时,效益最为显著。为了系统地研究这一机制,我们开发了Argus,一系列球形机器人,旨在探索增加动态对称性的效果。Argus家族的成员在驱动几何和动态对称性水平上有所不同,但共享一个共同架构原则:径向定向的线性致动器直接塑造机器人的质心动力学。其中,我们构建了一个物理的20腿Argus变体,实现了接近极端的动态各向同性,并展示了方向无关的运动、在杂乱和可变形地形上的敏捷穿越、快速自稳定以及对部分致动器故障的鲁棒性。其分布式感知进一步实现了在连续运动中的全向感知和物体交互。这些结果表明,不仅在形态上而且在可达动力学上设计机器人的对称性,为在不确定的地球和地外环境中实现敏捷性、鲁棒性和多功能性提供了一条强大且通用的途径。

英文摘要

Symmetry is a central organizing principle in natural systems, yet its use as a unifying design strategy in robotics has largely remained limited to geometric form. We show that symmetry can instead be leveraged at the level of dynamic actuation capability. We introduce dynamic symmetry, the uniformity of a robot's attainable center-of-mass accelerations, and formalize it through a measure coined as dynamic isotropy. Across more than 1000 simulated morphologies, we found that higher dynamic symmetry consistently improved trajectory tracking, task success, robustness, resiliency, and energy efficiency, with the benefits becoming most pronounced as dynamic isotropy approached its theoretical limit. To study this regime systematically, we developed Argus, a family of spherical robots designed to explore the effects of increasing dynamic symmetry. Members of the Argus family vary in their actuation geometry and dynamic symmetry level while sharing a common architectural principle: radially oriented linear actuators that directly shape the robot's center-of-mass dynamics. Among them, we built a physical 20-leg Argus variant that achieved near-extreme dynamic isotropy and demonstrated orientation-invariant locomotion, agile traversal of cluttered and deformable terrain, rapid self-stabilization, and resilience to partial actuator failures. Its distributed sensing further enabled omnidirectional perception and object interaction during continuous motion. These results show that designing robots for symmetry not only in morphology but also in their attainable dynamics provides a powerful and general pathway toward agility, robustness, and multifunctionality in uncertain terrestrial and extraterrestrial environments.

2605.29253 2026-05-29 cs.AI 版本更新

OpenClawBench: Benchmarking Process-side Anomalies in Real-world Agent Execution Trajectories

OpenClawBench: 真实智能体执行轨迹中过程侧异常的基准测试

Yibing Liu, Yangze Liu, Xiaolong Yin, Bin Wang, Chong Zhang, Hao Yin, Zhongyi Han

发表机构 * School of Software, Shandong University(山东大学软件学院) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院;南京大学新型软件技术国家重点实验室) State Key Laboratory of Novel Software Technology, Nanjing University(医学人工智能中心;青岛中医药科学院;海洋传统中医研究所,山东中医药大学) Center for Medical Artificial Intelligence(四川大学软件工程学院) Qingdao Academy of Chinese Medical Sciences Institute of Marine Traditional Chinese Medicine, Shandong University of Traditional Chinese Medicine School of Software Engineering, Sichuan University

AI总结 提出OpenClawBench数据集,通过FullTax标注框架量化智能体执行中的过程侧异常,揭示仅基于结果评估的不足。

Comments 37 pages, 1 figure, 43 tables

详情
AI中文摘要

任务成功可能掩盖真实智能体执行中的过程异常。智能体可能通过最终任务测试,但过程中仍累积未解决的歧义、不安全的外部写入、被忽略的错误、弱化的承诺或能力边界过度承诺。我们将这种不匹配研究为结果-过程差距,并引入OpenClawBench,这是一个用于测量和监督真实智能体执行过程中过程侧异常的大规模数据集。OpenClawBench基于由6个源模型生成的BFCL驱动的OpenClaw会话构建,包含31,264条带注释的轨迹。它将任务测试结果与结构化过程证据对齐。FullTax将对齐的轨迹转换为结构化异常监督:二元标签、支持证据、起始/跨度定位、严重性、可恢复性以及一个5类异常分类法。使用OpenClawBench,我们使结果-过程差距变得可测量。在31,135次通过测试的执行中,有2,904次在FullTax下被标记为过程异常。这些结果表明,仅基于成功的评估忽略了真实智能体执行中一类具体的过程侧失败。基于高置信度FullTax监督池训练的LoRA微调Gemma 3 12B检测器,在更干净标签的保留测试集上达到了二元F1=0.729。总之,OpenClawBench将真实智能体执行日志转化为可审计和可复用的监督,用于研究、诊断和操作监控运行时智能体可靠性。

英文摘要

Task success can hide process anomalies in real-world agent executions. An agent may pass the final task oracle while still accumulating unresolved ambiguity, unsafe external writes, ignored errors, weakly grounded commitments, or capability-boundary overcommitment. We study this mismatch as the Outcome-Process Gap and introduce OpenClawBench, a large-scale dataset for measuring and supervising process-side anomalies in real agent execution processes. OpenClawBench is built from BFCL-driven OpenClaw sessions produced by 6 source models and contains 31,264 annotated trajectories. It aligns task-oracle outcomes with structured process evidence. FullTax converts the aligned trajectories into structured anomaly supervision: binary labels, supporting evidence, onset/span localization, severity, recoverability, and a 5-class anomaly taxonomy. Using OpenClawBench, we make the Outcome-Process Gap measurable. Among 31,135 oracle-passing executions, 2,904 are still labeled process-anomalous under FullTax. These results show that success-only evaluation misses a concrete class of process-side failures in real agent executions. A LoRA-fine-tuned Gemma 3 12B detector trained on the high-confidence FullTax supervised pool reaches binary F1=0.729 on the cleaner-labels held-out test split. Together, OpenClawBench turns real agent execution logs into auditable and reusable supervision for studying, diagnosing, and operationally monitoring runtime agent reliability.

2605.29251 2026-05-29 cs.AI cs.CR 版本更新

Provably Secure Agent Guardrail

可证明安全的智能体护栏

Benlong Wu, Weiming Zhang, Kejiang Chen, Han Fang, Nenghai Yu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对现有语义护栏无法提供确定性安全下界的问题,提出基于逻辑推理基本限制的新安全范式,并引入可执行证明约束动作框架,通过神经符号隔离架构实现零攻击成功率和零误报率。

详情
AI中文摘要

随着大语言模型从有限生成引擎转变为具有广泛执行权限的智能体,人工智能失控引发了人工智能安全的基本危机。现有的防御架构严重依赖经验性语义护栏和概率性大模型裁决器,这些机制在面对复杂的语义符号解耦攻击时无法提供确定性的安全下界。为了克服这种经验性语义护栏困境,本文提出了一种基于逻辑推理基本限制的智能体安全新范式。基于该范式,我们进一步引入了一种具有神经符号隔离架构的可执行证明约束动作(ePCA)框架。该框架放弃了对自然语言的语义信任,迫使智能体在执行物理操作之前将其意图无损地形式化为一阶逻辑数学约束。宏观和微观二维动态对抗系统的实证评估表明,我们的形式化验证机制在评估场景中实现了零攻击成功率和零误报率,且计算延迟极低。这项研究为在明确系统假设下构建未来智能系统的基础防御提供了条件性的形式化基础和工程范式。

英文摘要

As large language models transition from bounded generative engines to agents with expansive execution privileges, AI going out of control precipitates a fundamental crisis in artificial intelligence security. Existing defense architectures heavily rely on empirical semantic guardrails and probabilistic large model adjudicators, mechanisms that fail to provide deterministic security lower bounds when facing complex semantic symbol decoupling attacks. To overcome this empirical semantic guardrail dilemma, this paper proposes a new security paradigm for agents based on the fundamental limitations of logical reasoning. Based on this paradigm, we further introduce an executable Proof-Constrained Action (ePCA) framework with a neural symbolic isolation architecture. This framework abandons semantic trust in natural language, forcing agents to losslessly formalize their intentions into first-order logical mathematical constraints before performing physical operations. Empirical evaluations of macroscopic and microscopic two-dimensional dynamic adversarial systems demonstrate that our formal verification mechanism achieves zero attack success rate and zero false positive rate across the evaluated scenarios, with extremely low computational latency. This research provides a conditional formal foundation under explicit system assumptions and an engineering paradigm for constructing the underlying defense foundation for future intelligent systems.

2605.29250 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval:跨异构知识源的统一检索

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 提出OmniRetrieval框架,通过自然语言查询识别并调度到不同知识源的本地执行引擎,在13个数据集和309个知识库上超越单源基线,实现异构知识源统一检索。

详情
AI中文摘要

现实世界的信息需求需要访问结构多样的知识源,从非结构化文本和关系表到知识图谱和属性图。然而,现有的检索器一次只在一个源上操作,使用固定的查询语言,使得可用知识的更广泛图景被不兼容的接口所分割。一种自然的统一尝试是将这些源折叠到一个共享空间中,但这会抹去每个源的结构性优势(如模式、本体、组合操作符),而这些优势赋予了每个源其表达能力。因此,对多样化知识的有效检索需要的不是同质化,而是一个能够按每个源自身条件与其交互的总体层。为了实现这一点,我们提出了OmniRetrieval,一个框架,它接受任何自然语言查询,识别合适的知识源,并将源原生查询分派到其本地执行引擎。在涵盖文本、关系和图结构源的13个数据集和309个不同知识库的广泛基准测试中,OmniRetrieval超过了单源基线,证明了它可以作为异构源的通用接口,同时保留使每个源有价值的结构差异。

英文摘要

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.

2605.29247 2026-05-29 cs.AI cs.CL cs.LG 版本更新

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

DenseSteer: 引导小型语言模型进行密集数学推理

Yang Ouyang, Shuhang Lin, Jung-Eun Kim

发表机构 * North Carolina State University(北卡罗来纳州立大学) Rutgers University(罗格斯大学)

AI总结 提出DenseSteer,一种无需训练的推理时引导框架,通过调节内部表征向密集推理模式靠拢,提升小型模型在多步数学推理中的准确性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)展现出强大的链式推理(CoT)能力,而较小的模型(≤3B参数)在多步推理任务上表现显著不佳。基于对Qwen-2.5模型系列在数学推理基准上的实证分析,我们发现更熟练的推理与更少的推理步骤但每步更高的信息密度相关,我们将此属性称为密集推理。受此观察启发,我们提出了DenseSteer,一种无需训练的推理时引导框架,通过将内部表征调节至密集推理模式来增强小型模型的推理能力。实验表明,我们的方法在不增加词级负对数似然的情况下,持续提高了准确性,突显了密集推理作为数学问题求解的一种有效结构方法。

英文摘要

Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

2605.29243 2026-05-29 cs.CL cs.AI cs.CY 版本更新

Wait! There's a Way Out: A Decision Mechanism for Forecasting Conversational Derailment

等等!有出路:一种预测对话偏离的决策机制

Laerdon Kim, Vivian Nguyen, Cristian Danescu-Niculescu-Mizil

发表机构 * Cornell University(康奈尔大学)

AI总结 提出一种基于前瞻性模拟的延迟决策机制,在预测对话偏离时通过评估紧张时刻的恢复可能性来降低误报率,同时保持预测准确性。

Comments To appear in the Proceedings of ACL 2026

详情
AI中文摘要

预测对话偏离的任务是,在对话进行中预测其最终是否会偏离为人身攻击。由于预测模型以在线方式运行,它们必须在每轮发言后决定是否“触发”警报——例如,通知参与者或主持人对话有偏离风险。现有方法仅根据先前发言估计的偏离可能性做出这一决定,隐含假设对话的未来轨迹是固定的。因此,它们忽略了未来恢复的可能性,并导致不必要的高误报率。在这项工作中,我们提出了一种将触发决策与偏离可能性估计解耦的方法。我们的方法受该任务第一个人类基线的启发,该基线表明,人类通过选择性地推迟触发决策(当他们预计紧张局势可能缓解时),实现了显著更低的误报率。我们通过一种延迟机制来操作这一见解,该机制使用前瞻性模拟来评估紧张时刻是否存在合理的恢复路径。将这一机制整合到最先进的预测模型中,可以在不牺牲预测准确性的情况下大幅减少误报。更广泛地说,这项工作强调了将决策制定视为预测系统的一等组成部分的价值。

英文摘要

Forecasting conversational derailment is the task of predicting, as the conversation unfolds, whether it will eventually derail into personal attacks. Since forecasting models operate in an online fashion, they must decide whether to "trigger" an alert after each utterance--for example, to notify participants or a moderator that the conversation is at risk of derailing. Existing approaches make this decision solely based on the estimated likelihood of derailment given the preceding utterances, implicitly assuming that the conversation's future trajectory is fixed. As a result, they ignore the possibility of future recovery and incur an unnecessarily high rate of false positives. In this work we propose a method for decoupling the decision to trigger from derailment likelihood estimation. Our approach is inspired by the first human baseline on this task, which shows that humans achieve dramatically lower false positive rates by selectively deferring their decision to trigger when they anticipate that tension is likely to subside. We operationalize this insight with a deferral mechanism that uses forward-looking simulations to assess whether a tense moment admits plausible paths to recovery. Incorporating this mechanism into a state-of-the-art forecasting model substantially reduces false positives without sacrificing forecasting accuracy. More broadly, this work highlights the value of treating decision-making as a first-class component of forecasting systems.

2605.29240 2026-05-29 cs.AI cs.CL cs.HC cs.IR 版本更新

Surfacing Isolated Learners with Outcome-Independent Mediation of Feedback between Teachers and Students Using AI

使用AI在教师与学生之间进行结果无关的反馈中介来发现孤立学习者

Junsoo Park, Youssef Medhat, Htet Phyo Wai, Ploy Thajchayapong, Ashok K. Goel

发表机构 * Georgia Institute Of Technology(佐治亚理工学院)

AI总结 提出一种无需成绩的可解释决策层,通过整合学生困难普遍性、自我报告与观察困难的不一致以及教师未解决关注点三个信号,对课程主题进行优先级排序,以帮助教师及时做出教学决策。

Comments Accepted to HAI-Agency Workshop on Orchestrating Human and AI Agency for Proactive and Reflective Learning

详情
AI中文摘要

AI增强的课堂在成绩结果可用之前就生成了丰富的教师和学生反馈,但这些信号难以转化为及时的教学决策。我们提出一个可解释的决策层:一种透明机制,无需使用成绩或事后结果标签即可对需要关注的课程主题进行排序。该方法结合了三个信号:学生学习困难普遍性、学习者自我报告与观察到的困难之间的不一致,以及未解决的教师关注点。输出是一个按优先级排序的主题集,每个主题附有解释其排序的决策记录。在一门研究生CS课程($n=5$次教师访谈;$n=279$份调查回复)中,优先主题与教师关注点一致(top-5重叠3/5;Spearman $ρ=0.80$),并与学生报告的主题困难相关($ρ=0.46$, $p=.048$)。多信号整合还发现了仅通过单个信号源未能识别的学习者(AUC $=0.96$ vs. 仅差距普遍性的$0.91$)。反思性思维、求助行为和自我效能感提供了额外证据,表明学生行为信号与学习相关构念一致。尽管是初步结果,这些发现表明,当反馈不完整时,透明的协调机制可能有助于支持人机协同。

英文摘要

AI-augmented classrooms generate rich teacher and student feedback before graded outcomes become available, yet these signals can be difficult to translate into timely instructional decisions. We propose an interpretable decision layer: a transparent mechanism that ranks course topics requiring attention without using grades or post-hoc outcome labels. The approach combines three signals: student learning difficulty prevalence, disagreement between learner self-reports and observed difficulties, and unresolved teacher concerns. The output is a ranked set of topic priorities with per-topic decision records explaining each ranking. In one graduate CS course offering ($n=5$ instructor interviews; $n=279$ survey responses), prioritized topics aligned with instructor concerns (top-5 overlap 3/5; Spearman $ρ=0.80$) and student-reported topic difficulty ($ρ=0.46$, $p=.048$). Multi-signal integration also surfaced learners not identified through individual signal sources alone (AUC $=0.96$ vs. $0.91$ for gap prevalence alone). Reflective thinking, help-seeking, and self-efficacy provided additional evidence that student behavioral signals align with learning-related constructs. While preliminary, these findings suggest that transparent coordination mechanisms may help support human-AI co-agency when feedback is incomplete.

2605.29234 2026-05-29 cs.AI cs.IR 版本更新

Rethinking Literature Search Evaluation: Deep Research Helps, and Human Citation Lists Are Not a Ground Truth

重新思考文献检索评估:深度研究有帮助,且人类引用列表并非金标准

Gaurav Sahu, Laurent Charlin, Christopher Pal

发表机构 * Mila – Quebec AI Institute(魁北克AI研究所) HEC Montréal(蒙特利尔HEC商学院) ServiceNow Research(ServiceNow研究) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席) Université de Montréal(蒙特利尔大学) Polytechnique Montréal(蒙特利尔理工学院)

AI总结 本文通过改进检索流程和检验人类引用列表作为评估目标的可靠性,发现深度研究管道显著提升召回率,而人类引用中仅51%被判定为中等相关以上,建议采用多维度评估。

详情
AI中文摘要

我们从两个互补角度研究大规模文献检索:改进检索流程,以及压力测试人类参考文献列表作为评估目标。首先,我们实现了一个深度研究管道,处理完整查询论文并沿其参考文献广度优先扩展检索结果,表明其显著优于纯API搜索,将RollingEval-Jun25(一个250篇论文的文献检索基准)上的召回率从低于20%提升至高于80%。其次,我们使用中立的LLM作为裁判来判断人类参考文献是否是任务的金标准。我们发现显著局限性:只有51%的人类引用被判定为中等相关或更高,而最强AI重排序器为86-88%。我们在OpenAlex合著图上研究这一差距,发现人类引用直接合作者的可能性比最佳AI重排序器高2.5倍。综合来看,我们的结果反对单一轴线的文献检索评估:召回率、主题相关性评分、排序列表多样性和合著距离诊断各自衡量引用质量的互补属性,应联合报告。

英文摘要

We study large-scale literature search from two complementary angles: improving the retrieval pipeline, and stress-testing the human reference list as an evaluation target. First, we implement a Deep Research pipeline that processes the full query paper and expands the retrieved results breadth-first along their bibliographies, and show that it substantially outperforms vanilla API-only search, raising recall on RollingEval-Jun25 (a 250-paper literature-search benchmark) from below 20% to above 80%. Second, we use a neutral LLM-as-a-judge to determine if human references are sound ground truth for the task. We find significant limitations: only 51% of human citations are judged moderately relevant or higher, against 86--88% for the strongest AI-based re-rankers. We study this gap on the OpenAlex co-authorship graph, finding that humans are 2.5x more likely than the best AI re-rankers to cite a direct collaborator. Together, our results argue against single-axis literature-search evaluation: recall, topical-relevance scoring, ranked-list diversity, and a co-authorship-distance diagnostic each measure complementary properties of citation quality and should be reported jointly.

2605.29230 2026-05-29 cs.CV cs.AI 版本更新

Toward Ethical Facial Age Estimation: A Generalized Zero-Shot Benchmark Without Training on Children's Data

面向道德的面部年龄估计:无需儿童数据训练的广义零样本基准

Caio Petrucci, Leo Sampaio Ferraz Ribeiro, Sandra Avila

发表机构 * New York University(纽约大学)

AI总结 提出一个广义零样本基准,训练时排除儿童数据,评估模型对未见年龄组的泛化能力,发现所有方法均存在严重性能下降和可见类偏见。

Comments 12 pages; 3 figures; 5 tables

详情
AI中文摘要

从面部图像进行年龄估计通常依赖于包含未成年人图像的训练数据,这种做法引发了严重的伦理、法律和隐私问题。在这项工作中,我们提出了一个用于面部年龄估计的广义零样本基准,该基准在训练时明确排除儿童数据,同时仍评估模型在年轻人群上的性能。我们重新审视了六个广泛使用的数据集,并引入了具有严格年龄组划分的标准化分割:18-59岁的样本用于训练、验证和测试;18岁以下的样本仅保留用于零样本评估;60岁以上的样本作为分布偏移下模型选择的未见验证集。对于具有身份注释的数据集,基于主体的分割防止了身份泄露,并更好地反映了实际部署条件。在此协议下评估九种最先进的年龄估计方法,结果表明所有评估方法均无法泛化到未见年龄组,性能相对于监督基线平均下降46.4%,最高达52.8%。此外,模型并非简单退化:它们系统性地将未见年龄的预测锚定到附近的可见类别,这是广义零样本学习中众所周知的可见类偏见的体现。通过将无儿童数据的年龄估计形式化为现有数据集上的广义零样本基准,这项工作突出了当前建模实践与现实伦理约束之间的关键差距。我们的基准为在受限数据制度下评估模型提供了原则性基础,并鼓励开发对分布偏移鲁棒且符合负责任数据使用的方法。

英文摘要

Age estimation from facial images typically relies on training data that includes images of minors, a practice that raises serious ethical, legal, and privacy concerns. In this work, we propose a generalized zero-shot benchmark for facial age estimation that explicitly excludes children's data during training while still assessing model performance on younger populations. We revisit six widely used datasets and introduce standardized splits with strict age-group separation: samples aged 18-59 for training, validation, and testing; samples under 18 reserved exclusively for zero-shot evaluation; and samples 60+ as an unseen validation set for model selection under distribution shift. For datasets with identity annotations, subject-exclusive splits prevent identity leakage and better reflect real-world deployment conditions. Evaluating nine state-of-the-art age estimation methods under this protocol reveals that all evaluated methods consistently fail to generalize to unseen age groups, suffering substantial performance degradation -- on average 46.4%, and up to 52.8% -- relative to the supervised baseline. Moreover, models do not simply degrade: they systematically anchor predictions for unseen ages to nearby seen classes, a manifestation of the well-known seen-class bias in generalized zero-shot learning. By formalizing age estimation without children's data as a generalized zero-shot benchmark on existing datasets, this work highlights a critical gap between current modeling practices and real-world ethical constraints. Our benchmark provides a principled basis for evaluating models under restricted data regimes and encourages the development of methods that are robust to distribution shift and aligned with responsible data use.

2605.29229 2026-05-29 cs.AI 版本更新

Tailoring the Curriculum: Student-Centered Reasoning Distillation via Dynamic Data-Model Compatibility

定制课程:通过动态数据-模型兼容性进行以学生为中心的推理蒸馏

Jiahao Huang, Fei Cheng, Junfeng Jiang, Akiko Aizawa

发表机构 * University of Tokyo(东京大学) Kyoto University(京都大学) National Institute of Informatics(日本信息处理研究所)

AI总结 提出数据-模型兼容性(DMC)指标,通过联合考虑数据质量、相对难度和学生能力来评估数据集对推理蒸馏的适用性,并基于DMC动态选择数据以提升蒸馏性能。

详情
AI中文摘要

推理蒸馏将复杂推理能力从大型语言模型(LLMs)转移到较小的模型,但其成功取决于训练数据与学生模型的匹配程度。本文引入了数据-模型兼容性(DMC)指标,可用于评估数据集在学生模型上进行推理蒸馏的适用性。DMC通过联合考虑数据质量、相对难度和学生能力来提供评估。我们从两个角度验证了DMC的有效性:(1)DMC与推理蒸馏性能表现出强相关性;(2)使用DMC作为数据选择标准可提高推理蒸馏性能。这两个发现在多个学生模型和任务上均得到一致证明。此外,由于每个数据集的DMC在训练过程中动态变化,我们的实验表明,基于DMC动态选择数据集可以进一步提升性能。

英文摘要

Reasoning distillation transfers complex reasoning abilities from large language models (LLMs) to smaller ones, yet its success depends on how well the training data align with the student model. This paper introduces the Data-Model Compatibility (DMC) metric, which can be used to assess the suitability of a dataset for reasoning distillation on a student model. DMC provides an assessment by jointly considering data quality, relative difficulty, and student capability. We validated the effectiveness of DMC from two perspectives: (1) DMC exhibits a strong correlation with reasoning distillation performance; and (2) using DMC as the criterion for data selection leads to improved reasoning distillation performance. Both findings are consistently demonstrated across multiple student models and tasks. Moreover, since the DMC of each dataset dynamically changes during training, our experiments demonstrate that dynamically selecting datasets based on DMC can further enhance performance.

2605.29225 2026-05-29 cs.AI 版本更新

BenchTrace: A Benchmark for Testing Reflection Ability and Controlled Evolution in LLM Agents

BenchTrace: 用于测试LLM智能体反思能力和受控进化的基准

Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu, Akiko Aizawa

发表机构 * University of Tokyo(东京大学) Kyoto University(京都大学) National Institute of Informatics(日本信息处理学会)

AI总结 提出BenchTrace基准,通过反思评估和进化评估两个任务,结合失败避免率(FAR)指标,系统评估LLM智能体的自我进化能力,实验发现当前模型在反思诊断和泛化上存在显著瓶颈。

详情
AI中文摘要

自我进化智能体通过反思过去失败来随时间改进,但现有评估存在两个局限:仅衡量任务得分,无法反映反思质量;且依赖智能体自身的回合运行,缺乏针对特定失败模式的机制。我们提出 extbf{BenchTrace},一个用于评估LLM智能体自我进化能力的基准。BenchTrace基于包含1,821个带注释回合的快照反思数据集构建,涵盖六个多样化任务,包含 extbf{反思评估}(通过目标QA任务探测失败识别)和 extbf{进化评估}(在受控自我进化模拟中测试过去失败经验是否转化为回避行为)。基于BenchTrace,我们提出 extbf{失败避免率(FAR)},一种新的评估指标,衡量智能体成功避免目标失败实例的测试用例比例。使用Qwen3-32B和GPT-4.1的实验表明,两个模型在反思评估上的端到端通过率均低于30%,其中诊断是主要瓶颈。进化评估显示,自我进化方法通常比非进化基线提高FAR,但随着噪声回合累积,智能体会遗忘早期教训,且无法将反思泛化到特定情境之外,导致跨任务情境的负迁移。我们的相关性分析进一步揭示,只有完全正确的反思与更高的FAR强相关。BenchTrace揭示了当前自我进化方法的具体局限,并提供了一个受控的、模型无关的针对性评估框架。

英文摘要

Self-evolving agents improve over time by reflecting on past failures, but existing evaluation is limited in two ways: it measures only task scores, leaving reflection quality unknown, and it relies on agents' own episode runs, offering no mechanism to target specific failure patterns. We present \textbf{BenchTrace}, a benchmark for evaluating self-evolution ability in LLM agents. BenchTrace is built on a snapshot-reflection dataset of 1,821 annotated episodes spanning six diverse tasks, and comprises a \textbf{Reflection Evaluation} that probes failure identification through targeted QA tasks, and an \textbf{Evolution Evaluation} that tests whether past failure experience translates into avoidance behavior in a controlled self-evolution simulation. Building on BenchTrace, we propose \textbf{failure avoidance rate (FAR)}, a new evaluation metric measuring the fraction of test cases in which the agent successfully avoids the target failure instance. Experiments with Qwen3-32B and GPT-4.1 reveal that both models fall below a 30\% end-to-end pass rate on reflection evaluation, with diagnosis as the primary bottleneck. Evolution evaluation shows that self-evolution methods generally improve FAR over the non-evolving baseline, but agents forget early lessons as noise episodes accumulate, and agents fail to generalize their reflections beyond the specific context, causing negative transfer across task contexts. Our correlation analysis further reveals that only a fully correct reflection is strongly associated with higher FAR. BenchTrace exposes concrete limits of current self-evolution approaches and provides a controlled, model-agnostic framework for targeted evaluation.

2605.29224 2026-05-29 cs.CL cs.AI cs.CR 版本更新

Relevance as a Vulnerability: How Web Retrieval Degrades Safety Alignment in LLM Agents

相关性即漏洞:网络检索如何削弱LLM智能体的安全对齐

Aditya Nawal, Manit Baser, Mohan Gurusamy

发表机构 * Department of Electrical and Computer Engineering(电子与计算机工程系) National University of Singapore(新加坡国立大学)

AI总结 本文提出AgentREVEAL框架,分析检索集成方式和内容属性如何导致LLM智能体安全退化,发现相关性是共同激活条件,并引入HarmURLBench基准。

详情
AI中文摘要

AI智能体通过外部工具(如网络检索)增强大型语言模型,使其能够提供基于事实和最新的响应。然而,将外部内容纳入生成流程可能会削弱控制模型输出的安全对齐机制。先前的研究表明,在智能体中启用检索会增加对有害请求的遵从性。我们提出了AgentREVEAL,一个用于分析LLM智能体中检索诱导的安全退化的诊断框架。该框架考察两个维度:检索如何集成到智能体流程中,以及检索内容的属性。在集成维度上,我们发现将工具调用和响应生成绑定在单一步骤中会放大有害输出。在内容维度上,我们揭示了安全来源悖论:即使是对立或安全导向的来源(例如包含警告或风险免责声明的页面),与无检索基线相比,也会使有害遵从性平均增加25%。最后,我们表明相关性是这两种漏洞的共同激活条件。类似模式出现在前沿闭源模型上,并且在几种代表性流程干预下,有害遵从性仍然保持较高水平,一些智能体在自主检索下也会进入这种状态。由于相关性也是使检索有用的原因,这些结果揭示了检索增强智能体的安全-效用权衡。我们引入了HarmURLBench,一个包含1,405个真实世界URL和320个有害行为的基准,以支持未来的评估。

英文摘要

AI agents augment large language models with external tools such as web retrieval, enabling grounded and up-to-date responses. However, incorporating external content into the generation pipeline can weaken the safety alignment mechanisms that govern model outputs. Prior work shows that enabling retrieval in agents increases compliance with harmful requests. We introduce AgentREVEAL, a diagnostic framework for analyzing retrieval-induced safety degradation in LLM agents. The framework examines two axes: how retrieval is integrated into the agent pipeline and the properties of the retrieved content. Along the integration axis, we find that binding tool invocation and response generation in a single step amplifies harmful outputs. Along the content axis, we uncover the Safe Source Paradox: even oppositional or safety-oriented sources, such as pages containing warnings or risk disclaimers, can increase harmful compliance by an average of 25% compared to the no-retrieval baseline. Finally, we show that relevance acts as a shared activation condition for both vulnerabilities. Similar patterns appear on frontier closed models, and harmful compliance remains elevated under several representative pipeline interventions, with some agents also entering this regime under autonomous retrieval. Because relevance is also what makes retrieval useful, these results expose a safety-utility trade-off for retrieval-enabled agents. We introduce HarmURLBench, a benchmark containing 1,405 real-world URLs paired with 320 harmful behaviors to support future evaluations.

2605.29218 2026-05-29 cs.AI cs.CL 版本更新

GTA: Generating Long-Horizon Tasks for Web Agents at Scale

GTA:大规模生成面向Web智能体的长程任务

Tenghao Huang, Kung-Hsiang Huang, Prafulla Kumar Choubey, Yilun Zhou, Muhao Chen, Jonathan May, Chien-Sheng Wu

发表机构 * University of Southern California(南加州大学) Salesforce AI Research(Salesforce人工智能研究) University of California, Davis(加州大学戴维斯分校)

AI总结 提出GTA框架,通过集成爬取、检索式种子生成、上下文内生成和自动质量控制,为Web智能体生成带可执行轨迹的真实长程任务,解决现有基准缺乏过程监督和可扩展性问题。

Comments Published at Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics

详情
AI中文摘要

Web智能体将语言模型与浏览和工具使用能力相结合,有望成为开放的Web助手。然而,进展日益受到缺乏可扩展的过程级监督的限制。现有基准大多为手动构建,仅提供粗略的起始-目标注释,缺乏中间轨迹,而最近的自动生成方法仍然昂贵、有偏且浅显。这些限制阻碍了对必须泛化到现实、多跳、跨页面任务的智能体进行可靠训练和评估。我们引入了一个可扩展的框架GTA,它集成了爬取、基于检索的种子生成、上下文内生成和自动质量控制,以生成与可执行轨迹配对的真实任务。该设计将爬取与生成解耦以提高效率,将任务基于站点图以强制组合性,并通过确定性重放和系统验证确保密集监督。我们在超过50个涵盖电子商务、政府、论坛和新闻的网站上实例化了该流程,并具有多语言和多跳覆盖。由此产生的基准揭示了显著的人机性能差距,并实现了详细的诊断。我们的贡献有三方面:(i)形式化多跳Web智能体任务生成,(ii)提出一个高效且经过验证的自动数据创建流程,以及(iii)发布一个具有可重复评估的动态基准。

英文摘要

Web agents, which couple language models with browsing and tool-use capabilities, show promise as open web assistants. Yet progress is increasingly limited by the lack of scalable, process-level supervision. Existing benchmarks are largely manually constructed, providing only coarse start-goal annotations without intermediate trajectories, while recent automatic generation efforts remain expensive, biased, and shallow. These limitations prevent reliable training and evaluation of agents that must generalize to realistic, multi-hop, cross-page tasks. We introduce a scalable framework, GTA, that integrates crawling, retrieval-based seeding, in-context generation, and automated quality control to produce realistic tasks paired with executable trajectories. This design decouples crawling from generation for greater efficiency, grounds tasks in the site graph to enforce compositionality, and ensures dense supervision through deterministic replays and systematic validation. We instantiate the pipeline on over 50 websites covering e-commerce, government, forums, and news, with multilingual and multi-hop coverage. The resulting benchmark reveals a significant human-agent performance gap and enables detailed diagnostics. Our contributions are three-fold: (i) formalizing multi-hop web-agent task generation, (ii) proposing an efficient and validated pipeline for automatic data creation, and (iii) releasing a dynamic benchmark with reproducible evaluation.

2605.29194 2026-05-29 cs.LG cs.AI cs.NA math.NA 版本更新

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

随机提升:生成随机物理系统轨迹

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA(Courant数学科学研究所,纽约大学,纽约,NY 10012,USA)

AI总结 提出随机提升方法,通过为每个状态转换附加独立高维随机标签并学习从当前状态和标签到下一状态的映射,以生成多样化的随机物理系统轨迹。

详情
AI中文摘要

许多随机物理系统随时间平滑演化,即状态分布随时间步长规则变化。从当前状态到下一状态的转移通常可以建模为平滑映射和显式随机源的组合。随机提升利用这一结构,通过为训练数据中的每个状态转换附加一个独立的高维随机标签,并使用标准回归损失拟合从当前状态和标签到下一状态的转移映射。这些标签作为辅助坐标,使模型能够从相似的当前状态表示多个可能的下一状态,避免在有限样本量下崩溃为均值预测。在推理时,每个时间步采样新的标签,并将学习到的映射自回归地向前滚动,每个时间步仅需一次网络评估即可生成多样化的轨迹。

英文摘要

Many stochastic physical systems evolve smoothly over time in the sense that the distribution of states changes regularly across time steps. The transition from current state to the next state can often be modeled as the combination of a smooth map and an explicit source of randomness. Stochastic Lifting exploits this structure by attaching an independent, high-dimensional random label to each state transition in the training data and fitting a transition map from the current state and label to the next state using a standard regression loss. The labels act as auxiliary coordinates that let the model represent multiple plausible next states from similar current states, avoiding collapse to a mean prediction in the finite-sample size regime. At inference, fresh labels are sampled at each time step and the learned map is rolled forward autoregressively, generating diverse trajectories with a single network evaluation per time step.

2605.29192 2026-05-29 cs.AI cs.CL 版本更新

ReasonOps: Operator Segmentation for LLM Reasoning Traces

ReasonOps: 大语言模型推理轨迹的算子分割

Daniel Lee, Owen Queen, James Zou

发表机构 * Stanford University(斯坦福大学)

AI总结 提出无监督方法ReasonOps,从思维链轨迹中提取7种通用推理算子,揭示模型推理结构并用于模型识别与正确性预测。

详情
AI中文摘要

大型推理模型的思维链轨迹可长达数万token,但我们缺乏描述其内部结构的词汇。以往用于分析思维链轨迹的方法要么过于僵化,要么表达能力不足,无法捕捉跨领域和跨模型的特征。为解决此问题,我们开发了ReasonOps,一种无监督、表达力强的方法,用于注释思维链轨迹,提供简洁的通用算子。利用ReasonOps,我们分析了来自12个思考型LLM(涵盖6个家族、8个推理基准)的44,662条轨迹,发现它们共享一个共同的组合结构:7个反复出现的推理算子——语篇层面的动作,如回溯、推理和假设——这些算子从句子开头的3-token枢轴的无监督聚类中涌现。这些算子出现在每个模型家族和基准领域,由三个独立的LLM评判员对留出样本进行分类,准确率达70-76%。我们分析了算子在简单与困难问题上的结构,发现反思性算子在困难问题上更有帮助,而在简单问题上则损害性能。算子序列具有高度的模型识别性:仅基于算子分布训练的分类器能以宏AUC恢复源模型,揭示每个模型家族具有独特的推理指纹。结构化的算子特征在问题内答案正确性预测上远高于基线。基于这些算子构建的分类器在WP-AUC上达到,特别是在AIME上。ReasonOps还能够在轨迹完成前进行早期质量估计:我们仅用50%的轨迹就能在WP-AUC上进行预测。ReasonOps流程是无监督且无需标注的,能够深入洞察LLM推理轨迹,并在模型识别和正确性预测方面取得强大的下游结果。

英文摘要

Chain-of-thought traces from large reasoning models can span tens of thousands of tokens, yet we lack a vocabulary for describing their internal structure. Previous methods developed to analyze chain-of-thought traces are either too rigid or not expressive enough, failing to capture features across domains and models. To remedy this, we develop ReasonOps, an unsupervised, expressive method for annotating chain-of-thought traces, providing succinct universal operators. Using ReasonOps, we analyze 44,662 traces from 12 thinking LLMs spanning 6 families across 8 reasoning benchmarks and discover that they share a common compositional structure: 7 recurring reasoning operators -- discourse-level moves such as backtracking, inferring, and hypothesizing -- that emerge from unsupervised clustering of sentence-initial 3-token pivots. These operators appear across every model family and benchmark domain, confirmed by three independent LLM judges who classify held-out samples at 70 -76% accuracy. We analyze the structure of operators on easy vs. hard problems, revealing that reflective operators are more helpful on hard problems and harm performance on easy problems. Operator sequences are highly model-identifying: a classifier trained on operator distributions alone recovers the source model with macro-AUC, revealing that each model family has a distinctive reasoning fingerprint. Structural operator features predict within-problem answer correctness well above baselines. Classifiers built on these operators reach WP-AUC and on AIME specifically. ReasonOps further enables early quality estimation well before the trace completes: we predict at WP-AUC for only 50% of the trace. The ReasonOps pipeline is unsupervised and annotation-free, enabling deep insights into LLM reasoning traces as well as strong downstream results on model identification and correctness prediction.

2605.29184 2026-05-29 cs.LG cs.AI 版本更新

Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback

影响引导的符号回归:基于大语言模型与细粒度反馈的方程搜索科学发现

Evgeny S. Saveliev, Samuel Holt, Nabeel Seedat, David L. Bentley, Jim Weatherall, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学) Thomson Reuters Foundational Research(汤姆森·路透基础研究) U. Colorado, Anschutz Medical Campus(科罗拉多大学安舒茨医疗校区)

AI总结 提出影响引导符号回归(IGSR)方法,利用大语言模型生成候选函数并通过细粒度影响分数进行剪枝,结合蒙特卡洛树搜索高效探索组合空间,在多个基准和真实生物数据中发现新关系。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)为科学发现提供了有前景的途径,但它们在符号回归中的应用常受限于低效的搜索策略和粗糙的反馈信号。当前方法通常使用标量指标(如全局均方误差)指导LLM,这无法识别所提出方程中哪些成分驱动性能或导致误差。我们引入 extit{影响引导符号回归}(IGSR),该方法将方程发现表述为一个迭代的两步过程,结合多样化的项生成与严格选择:LLM为线性模型生成候选基函数$ψ_j(\mathbf{x})$,然后使用细粒度影响分数$Δ_j$进行评估。这些分数量化每个项对泛化准确性的边际贡献,从而实现影响引导的剪枝过程,系统地精炼模型结构。将此机制集成到蒙特卡洛树搜索(MCTS)中,能够在导航组合搜索空间的同时平衡对新函数形式的探索与对高影响成分的利用。我们在多个基准测试上展示了IGSR的有效性,包括LLM-SRBench、药理学PKPD模型、流行病学模拟和真实基因组数据。值得注意的是,我们通过一个高维生物数据集的案例研究验证了该框架的真正发现能力,其中IGSR识别出DNA甲基化与RNA聚合酶II暂停之间的新关系;该假设随后通过湿实验得到了支持。

英文摘要

Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which components of a proposed equation are driving performance or causing error. We introduce \textit{Influence-Guided Symbolic Regression} (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions $ψ_j(\mathbf{x})$ for a linear model, which are then evaluated using granular influence scores $Δ_j$. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness on a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.

2605.29174 2026-05-29 cs.AI cs.CR 版本更新

Paper Agents, Paper Gains: An Empirical Analysis of DeFi Investment Agents

Paper Agents, Paper Gains: DeFi投资代理的实证分析

Jay Yu, Amy Zhao, Danning Sui

发表机构 * Pantera Capital(Pantera资本) Stanford University(斯坦福大学) IC3 Ava Labs(Ava实验室)

AI总结 通过分析1900多个AI加密项目、10个代表性代理和11个Solana代理金库,发现当前DeFi投资代理仍处于早期阶段,存在自主执行证据不足、代币持有者集体亏损、估值与基本面脱节等问题,并提出成熟度框架。

详情
AI中文摘要

DeFi投资代理,即使用AI进行自主链上交易的系统,自2024年底以来已获得超过30亿美元的代币总估值。我们调查了1900多个标记为AI的加密项目,筛选出专注于投资的代理,并策划了10个涵盖策略和可观测性维度的代表性项目。然后,我们对两个突出的代理框架ElizaOS和Virtuals Protocol进行了深入的架构分析,并对11个基于Solana的代理金库(具有公开可归因的交易活动)进行了定量链上表现分析,覆盖925,323个代币持有者。我们发现当前部署仍处于早期且异构:(1)在我们的样本中,许多项目尚未提供清晰的自主交易执行证据,开发者访谈表明许多可见部署仍为基本API集成;(2)代理金库保留了超过3000万美元的账面收益,而代币持有者集体损失了1.917亿美元,前1%的钱包捕获了所有收益的81.4%(18.1亿美元);(3)代币估值与金库基本面关联微弱,市值与AUM比率超过10,000倍,而成熟的DeFi协议低于1倍;(4)用户总收益在达到24亿美元的峰值后下降至净亏损,每个平台的中位数回报均为负,代币从历史高点平均下跌93%。我们将这些结果解释为无许可的第一代市场的特征,其中开放基础设施支持快速实验,但也允许幼稚或投机性代理在自主性、性能和利益相关者对齐的稳健标准出现之前启动。因此,我们提出了一个沿三个维度(自主执行、风险调整后盈利能力和利益相关者对齐)的成熟度框架,以表征当前部署与未来投资级代理系统之间的差距。

英文摘要

DeFi investment agents, systems that use AI for autonomous on-chain trading, have attained over USD 3 billion in combined token valuations since late 2024. We survey over 1,900 AI-tagged crypto projects, filter to investment-focused agents, and curate 10 representative projects spanning strategy and observability dimensions. We then conduct a deep-dive architectural analysis of two prominent agent frameworks, ElizaOS and Virtuals Protocol, and a quantitative on-chain performance analysis of 11 Solana-based agent treasuries with publicly attributable trading activity, covering 925,323 token holders. We find that current deployments remain early and heterogeneous: (1) in our sample, many projects do not yet provide clear evidence of autonomous trade execution, and developer interviews suggest that many visible deployments remain basic API integrations; (2) agent treasuries retain over USD 30M in paper gains while token holders collectively lost USD 191.7M, with the top 1% of wallets capturing 81.4% of all gains (USD 1.81B); (3) token valuations are weakly connected to treasury fundamentals, with market-cap-to-AUM ratios exceeding 10,000x versus below 1x for established DeFi protocols; and (4) aggregate user gains peaked at USD 2.4B before declining to net losses, with median returns negative on every platform and tokens declining 93% on average from all-time highs. We interpret these outcomes as characteristic of a permissionless, first-generation market in which open infrastructure enables rapid experimentation but also allows naive or speculative agents to launch before robust standards for autonomy, performance, and stakeholder alignment emerge. We therefore propose a maturity framework along three dimensions: autonomous execution, risk-adjusted profitability, and stakeholder alignment, to characterize the gap between current deployments and future investment-grade agent systems.

2605.29170 2026-05-29 cs.CL cs.AI 版本更新

UA-Legal-Bench: A Benchmark for Evaluating Large Language Models on Ukrainian Legal Reasoning

UA-Legal-Bench:评估大语言模型在乌克兰法律推理上的基准

Volodymyr Ovcharov

发表机构 * SecondLayer

AI总结 针对法律NLP基准以英语为中心的问题,构建了基于乌克兰法院判决的五个任务基准,评估11个LLM,发现少样本提示效果因任务而异,且在不平衡任务中准确率具有误导性。

Comments 13 pages, 5 figures, 4 tables. Data: https://huggingface.co/datasets/overthelex/ua-legal-bench

详情
AI中文摘要

法律NLP基准 overwhelmingly 以英语为中心,导致在形态丰富、非拉丁字母语言中的失败模式未被检测。我们引入了UA-Legal-Bench,一个包含五个任务的基准,用于评估大语言模型在乌克兰法律推理上的表现,该基准基于统一国家法院判决登记册(EDRSR)——世界上最大的开放司法语料库之一(9950万份判决)。该基准包括:(1)案件类型分类(4类,n=2,000),(2)判决形式分类(4类,n=2,000),(3)案件结果预测(6类,n=800),(4)法律规范提取(n=1,794),以及(5)原因类别预测(22类,n=1,871)。我们评估了来自五个系列的11个LLM(3B-675B),在零样本和3样本提示下通过AWS Bedrock进行了158K次API调用。我们的结果揭示了 sharply 任务依赖的少样本效应:少样本提示将判决形式分类提高了最多+38.6个百分点,但对结果预测的影响不一。我们表明,在不平衡的法律任务中,准确率具有误导性:COP准确率最高的模型(62%)是多数类预测器(macro-F1:23%),而真正最好的模型macro-F1仅为44%。系列内规模分析显示,8B模型在表面级任务上可以匹配前沿性能,但不同系列的规模阈值差异很大。我们发布了所有数据、提示和模型预测。

英文摘要

Legal NLP benchmarks are overwhelmingly English-centric, leaving failure modes in morphologically rich, non-Latin-script languages undetected. We introduce UA-Legal-Bench, a five-task benchmark for evaluating large language models on Ukrainian legal reasoning, built from the Unified State Register of Court Decisions (EDRSR) -- one of the world's largest open judicial corpora (99.5 million decisions). The benchmark comprises: (1) case-type classification (4 classes, n=2,000), (2) judgment form classification (4 classes, n=2,000), (3) case-outcome prediction (6 classes, n=800), (4) legal norm extraction (n=1,794), and (5) cause category prediction (22 classes, n=1,871). We evaluate 11 LLMs (3B--675B) from five families under zero-shot and 3-shot prompting via AWS Bedrock with 158K API calls. Our results reveal sharply task-dependent few-shot effects: few-shot prompting improves judgment form classification by up to +38.6 pp but has mixed effects on outcome prediction. We show that accuracy is misleading on imbalanced legal tasks: the model with highest COP accuracy (62%) is a majority-class predictor (macro-F1: 23%), while the genuinely best model scores only 44% macro-F1. Within-family scaling analysis reveals that 8B models can match frontier performance on surface-level tasks but scaling thresholds vary dramatically across families. We release all data, prompts, and model predictions.

2605.29168 2026-05-29 cs.AI cs.LG 版本更新

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

晚做总比早做好:基于本体后提取校正的神经符号知识图谱构建

Lorenzo Loconte, Timothy Hospedales, Cristina Cornelio

发表机构 * University of Edinburgh, UK(爱丁堡大学) Samsung AI Center, Cambridge, UK(三星人工智能中心)

AI总结 提出一种神经符号框架,通过后提取校正解决LLM提取知识图谱时的本体不一致问题,减少token使用并提升图谱一致性。

详情
AI中文摘要

问答是AI中的核心挑战,特别是对于需要跨文档多跳推理或聚合、穷举等符号操作的复杂查询。检索增强生成已成为问答的主要方法,最近的基于图的变体通过组织知识以更好地支持组合性问题,部分解决了这些问题。然而,大多数基于文本图的RAG方法仍缺乏可靠回答复杂问题所需的符号操作结构。这推动了基于符号图的方法,该方法提取知识图谱,其关系是逻辑谓词,支持类似SQL的查询。然而,这些流程通常使用LLM进行KG提取,这可能导致一致性问题,即提取的事实可能违反常识本体约束。我们提出了一种用于本体基础KG构建的神经符号框架,结合了开放域提取、基于嵌入的类型和谓词规范化,以及针对本体违规的LLM校正。通过将校正推迟到后提取阶段,我们的方法避免了重复的LLM调用,显著减少了token使用,同时提高了KG一致性并保持了下游问答质量。最后,通过测量SPARQL图模式的出现,我们展示了提取的KG非常适合符号查询。

英文摘要

Question answering (QA) is a core challenge in AI, particularly for complex queries requiring multi-hop reasoning across documents, or symbolic operations like aggregation or exhaustive listing. Retrieval-augmented generation has become the dominant approach to QA, with recent graph-based variants addressing part of these issues by organizing knowledge to better support compositional questions. However, most textual graph-based RAG methods still lack the structure needed for symbolic operations useful to answer complex questions reliably. This motivates symbolic graph-based approaches, which extract knowledge graphs (KGs) whose relations are logic predicates that enable SQL-like querying. Yet these pipelines typically use LLMs for KG extraction, which can introduce consistency issues, where extracted facts may violate commonsense ontology constraints. We propose a neuro-symbolic framework for ontology-grounded KG construction combining open-domain extraction, embedding-based canonicalization of types and predicates, and targeted LLM-based correction of ontology violations. By deferring corrections to a post-extraction stage, our method avoids repeated LLM calls, substantially reducing token usage while improving KG consistency and preserving downstream QA quality. Finally, we show that the extracted KGs are well suited for symbolic querying by measuring the occurrence of SPARQL graph patterns.

2605.29161 2026-05-29 cs.LG cs.AI 版本更新

Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach

生成图拓扑的进化精炼:一种混合WGAN-GA方法

James Sargant, Seyedeh Ava Razi Razavi, Renata Dividino, Sheridan Houghten

发表机构 * Computer Science Brock University, Canada(计算机科学 布鲁克大学 加拿大)

AI总结 提出一种混合WGAN-GA方法,通过遗传算法精炼GAN生成的图结构,减少度分布和谱分布等偏差,使合成图更接近真实图。

Comments 6 pages, 4 Figures, 4 Tables, IEEE World Congress on Computational Intelligence

详情
AI中文摘要

由于离散连通性、图大小变化和类别特定的结构模式,生成逼真的图结构数据具有挑战性。最近基于生成对抗网络(GAN)的图生成方法通过学习连通性和匹配类别特定的密度分布来改进边建模。然而,这些模型在与真实图相比时仍表现出明显的偏差,例如度和谱分布,表明重要的结构属性未完全保留。本工作旨在通过使用遗传算法(GA)精炼现有基于GAN的图生成器框架生成的图来减少这些偏差。在GAN框架中,生成器同时生成节点特征和连通性模式,而基于GNN的判别器评估图的真实性和类别一致性,以确保全局结构和类别对齐。在此基础上,我们应用GA来精炼生成图的边。精炼过程引导合成图更接近真实数据,同时保持多样性和新颖性。实验结果表明,与基础模型相比,GA精炼持续降低组合最大均值差异(MMD),从而生成更匹配真实结构模式的图。这表明进化精炼是纠正基于GAN的图生成器中残留结构偏差的有效且灵活的方法,提高了它们用于逼真图合成和数据增强的适用性。

英文摘要

Generating realistic graph-structured data is challenging due to discrete connectivity, varying graph sizes, and class-specific structural patterns. Recent Generative Adversarial Networks (GAN)-based graph generation methods improve edge modelling by learning connectivity and matching class-specific density distributions. However these models still exhibit noticeable deviations such as in degree and spectral distribution when compared to real graphs, indicating that important structural properties are not fully preserved. This work aims to reduce these deviations by refining the graphs produced by an existing GAN-based graph generator framework with a Genetic Algorithm (GA). In the GAN framework, the generator produces both node features and connectivity patterns, while a GNN-based critic evaluates graph realism and class consistency to ensure global structural and class alignment. Building on this foundation, we apply a GA to refine the edges of generated graphs. The refinement process guides synthetic graphs toward closer agreement with real data, while preserving diversity and novelty. Experimental results show that the GA refinement consistently lowers combined Maximum Mean Discrepancy (MMD) compared to the base model, leading to graphs that more closely match real structural patterns. This demonstrates that evolutionary refinement is an effective and flexible way to correct residual structural deviations in GAN-based graph generators, improving their suitability for realistic graph synthesis and data augmentation.

2605.29157 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Parallax: Parameterized Local Linear Attention for Language Modeling

Parallax: 参数化局部线性注意力用于语言建模

Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf, Shuming Hu, Zhaoran Wang

发表机构 * Northwestern University(西北大学) Tilde Research(Tilde研究) University of Washington(华盛顿大学)

AI总结 提出Parallax,一种可扩展的参数化局部线性注意力机制,通过消除数值求解器并学习查询投影器,在语言模型预训练中实现一致的困惑度改进和下游任务迁移优势。

详情
AI中文摘要

大型语言模型(LLM)已成为人工智能的核心范式,但注意力的核心计算原语在结构上仍未改变。局部线性注意力(LLA)是一种从测试时回归框架的非参数统计中推导出的注意力机制。与先前关于高效注意力变体的研究相比,LLA将softmax注意力中的局部常数估计升级为局部线性估计,在关联记忆上提供了可证明更优的偏差-方差权衡。然而,由于计算和数值稳定性问题,LLA尚未在LLM预训练中扩展。我们引入Parallax,一种可扩展用于LLM的参数化局部线性注意力。Parallax消除了LLA中的数值求解器,并学习一个额外的类似查询的投影器来探测KV协方差。我们将Parallax置于一个由带宽、投影器构造和仿射结构连接的注意力机制家族中。我们提出一种硬件感知算法,提高了相对于FlashAttention的算术强度,将注意力转移到更受计算限制的区域。我们的原型解码核在各种批大小和上下文长度下匹配或超越FlashAttention 2/3。我们在0.6B和1.7B规模上预训练Parallax,发现整个预训练过程中困惑度持续改善,且收益迁移到下游基准测试。在参数匹配和计算匹配的控制下,优势持续存在,展示了帕累托改进。我们进行了仔细的预训练消融实验,并发现了一个新现象:Muon优化器解锁了Parallax的能力。据我们所知,这是架构研究文献中首次对注意力机制进行强架构-优化器协同设计的实证演示。

英文摘要

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.

2605.29155 2026-05-29 cs.RO cs.AI cs.DC 版本更新

CA-AC-MPC: CUDA-Accelerated Actor-Critic Model Predictive Control

CA-AC-MPC: CUDA加速的Actor-Critic模型预测控制

Antoonio Buo, Vittorio Cammarota, Michele Avagnale, Pierluigi Arpenti, Vincenzo Lippiello, Fabio Ruggiero

发表机构 * PRISMA Lab and CREATE Consortium, Department of Electrical Engineering and Information Technology, University of Naples Federico II(PRISMA实验室和CREATE联盟,电气工程与信息技术系,那不勒斯费德里科二世大学)

AI总结 提出CUDA加速的AC-MPC变体,通过GPU并行优化降低训练和推理延迟,在敏捷无人机竞速任务中实现最先进圈速和近极限动态性能。

Comments Accepted for presentation at the 2026 International Conference on Unmanned Aircraft Systems, ICUAS 2026

详情
AI中文摘要

在文献中,actor-critic模型预测控制(AC-MPC)将MPC与强化学习相结合,以实现复杂动态系统的高性能控制。然而,其可微分的MPC层需要在正向和反向传播中反复求解优化问题,导致大量的训练和推理延迟。本文通过引入CUDA加速变体解决了这一瓶颈,显著减少了端到端执行时间,同时保持了基线公式的控制性能。在敏捷无人机竞速任务上的仿真结果表明,我们的方法实现了最先进的圈速和近极限动态行为,同时显著减少了训练和推理时间。

英文摘要

In the literature, actor-critic model predictive control (AC-MPC) integrates MPC with reinforcement learning to enable high-performance control of complex dynamical systems. However, its differentiable MPC layer requires repeatedly solving an optimization problem in both the forward and backward passes, leading to substantial training and inference latency. This paper tackles this bottleneck introducing a CUDA-accelerated variant that significantly reduces end-to-end execution time while preserving the control performance of the baseline formulation. Simulation results on an agile drone racing task show that our approach achieves state-of-the-art lap times and near-limit dynamic behaviour with markedly reduced training and inference time.

2605.29153 2026-05-29 cs.LG cs.AI physics.comp-ph 版本更新

Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization

揭示科学机器学习中的多机制模式:不同的失败模式与机制特定优化

Yuxin Wang, Yuanzhe Hu, Xiaokun Zhong, Xiaopeng Wang, Haiquan Lu, Tianyu Pang, Michael W. Mahoney, Yujun Yan, Pu Ren, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) University of California, San Diego(加州大学圣地亚哥分校) University of California, Berkeley(加州大学伯克利分校) National University of Singapore(新加坡国立大学) Lawrence Berkeley National Laboratory(伯克利国家实验室) International Computer Science Institute(国际计算机科学研究所)

AI总结 通过机制感知诊断框架,研究科学机器学习模型在不同超参数设置下的多机制行为,发现三种一致机制结构、优化效果的机制特异性以及精细失败模式,为提升鲁棒性提供指导。

Comments Accepted by ICML 2026

详情
AI中文摘要

在不同超参数设置下训练的神经网络可能落入不同的训练“机制”,这些机制内部行为一致,而机制间存在定性差异。本文通过一个机制感知的诊断框架,联合分析性能、训练动态和损失景观几何,研究科学机器学习(SciML)模型中的这种多机制行为。我们识别出三个关键发现:(i)在许多标准SciML模型、不同的约束施加方式以及各种优化器设计中,一致地出现一个三机制结构;(ii)优化效果是机制特定的,没有单一方法在所有机制中表现良好;(iii)SciML模型可能表现出精细的失败模式,这些模式可能挑战对标准损失景观度量的传统解释。我们的结果为建立SciML中失败模式的统一、任务无关视角提供了一种方法,并为提高鲁棒性提供机制感知的指导。我们在广泛使用的SciML模型上验证了这些发现,包括物理信息神经网络、神经算子和神经常微分方程,涵盖了代表性的常微分方程和偏微分方程基准。

英文摘要

Neural networks trained under different hyperparameter settings can fall into distinct training "regimes," with consistent behavior within regimes and qualitative differences across regimes. In this paper, we study such multi-regime behavior in scientific machine learning (SciML) models through a regime-aware diagnostic framework that jointly analyzes performance, training dynamics, and loss-landscape geometry. We identify three key findings: (i) a consistent three-regime structure emerges across many standard SciML models, different constraint enforcements, and various optimizer designs; (ii) optimization effectiveness is regime-specific, with no single method performing well across all regimes; and (iii) SciML models can exhibit fine-grained failure modes that can challenge conventional interpretations of standard loss-landscape metrics. Our results provide an approach to establish a unified, task-oblivious perspective on failure modes in SciML and to inform regime-aware guidance for improving robustness. We validate these findings across widely-used SciML models, including physics-informed neural networks, neural operators, and neural ordinary differential equations, on benchmarks spanning representative ordinary and partial differential equations.

2605.29141 2026-05-29 cs.IR cs.AI 版本更新

Toward User Preference Alignment in LLM Recommendation via Explicit Context Feedback

通过显式上下文反馈实现LLM推荐中的用户偏好对齐

Weizhi Zhang, Wooseong Yang, Yuxin Cui, Zhaohui Guo, Hins Hu, Liangwei Yang, Henry Peng Zou, Qifei Wang, Hanqing Zeng, Jiayi Liu, Yinglong Xia, Philip S. Yu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) Meta

AI总结 本文主张在基于大语言模型的推荐系统中优先利用显式上下文反馈(如评论文本)来对齐用户偏好,提升推荐的个性化和可解释性。

Comments Published in CogMI 2025. https://ieeexplore.ieee.org/abstract/document/11417068

详情
AI中文摘要

传统推荐系统主要从隐式信号(如点击、观看和购买)推断用户偏好,往往忽略了用户通过评论文本等言语形式提供的丰富显式上下文反馈。这种显式上下文反馈捕捉了用户决策背后关于偏好的细微原因,并为用户偏好对齐和更可解释的推荐提供了关键的异构信息。忽视这些信号可能导致用户偏好错位,并进一步强化信息茧房,因为算法无法理解用户选择背后的“语义上下文”。大语言模型的最新进展为利用用户生成内容实现更准确和多样化的推荐提供了新机遇,但当前基于大语言模型的推荐仍主要关注项目元数据,未能充分利用这一资源。本文主张在下一代基于大语言模型的推荐系统中优先考虑显式上下文反馈。我们回顾了推荐范式的演变,强调了富含上下文的反馈的价值,呼吁建立新的基准和指标,并介绍了将显式用户信号集成到可扩展的基于大语言模型的推荐系统中的框架。以用户偏好建模为中心,我们旨在促进在线推荐平台更加个性化、透明和可解释。

英文摘要

Traditional recommender systems (RecSys) primarily infer user preferences from implicit signals (such as clicks, watches, and purchases), often neglecting the rich explicit contextual feedback users provide through verbal text, like comments and reviews. This explicit context feedback captures the nuanced reasons behind user decisions regarding their preferences. In addition, it offers critical heterogeneous information for user preference alignment and more explainable recommendations. Overlooking such signals can lead to misaligned user preferences and further reinforce filter bubbles, as algorithms fail to understand the "semantic context" behind user choices. Recent advances in Large Language Models (LLMs) present new opportunities to harness user-generated content for more accurate and diverse recommendations, yet current LLM-based recommendations still focus on using item meta-data and underutilize this resource. In this paper, we advocate for prioritizing explicit context feedback in the next generation of LLM-based RecSys. We review the evolution of recommendation paradigms, highlight the value of context-rich feedback, call for new benchmarks and metrics, and introduce frameworks for integrating explicit user signals into scalable LLM-driven RecSys. Centering on user-preference modeling, we aim to foster more personalized, transparent, and explainable RecSys online platforms.

2605.29138 2026-05-29 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

用于优化自动驾驶延迟-准确性权衡的多分辨率端到端深度神经网络

Qitao Weng, Heechul Yun

发表机构 * University of Kansas Lawrence(堪萨斯大学劳伦斯分校)

AI总结 提出一种多分辨率端到端CNN,通过运行时选择输入分辨率和分辨率重定向,在延迟预算下优化自动驾驶的延迟-安全性权衡。

Comments ICCPS 2026

详情
AI中文摘要

延迟-准确性权衡是深度神经网络在信息物理系统实时应用中的基础。在自动驾驶中,安全性尤其依赖于预测质量和从感知到执行的端到端延迟。我们观察到:(1) 当考虑延迟时,延迟最优的网络配置随场景上下文和计算可用性而变化;(2) 单一固定分辨率模型在条件变化时变得次优。我们提出了一种用于CARLA城市驾驶挑战的多分辨率端到端深度神经网络,使用单目摄像头输入。我们的方法采用支持多种输入分辨率的卷积神经网络,通过每分辨率批归一化,使得在延迟预算下运行时选择理想输入尺度成为可能,以及分辨率重定向,允许在没有原始训练数据集的情况下进行多分辨率训练。我们在CARLA中实现并评估了我们的多分辨率端到端CNN,以探索延迟-安全性边界。结果显示,相对于固定分辨率基线,每条路线的安全性指标——车道入侵、红灯违规和碰撞——一致改善。

英文摘要

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

2605.29129 2026-05-29 cs.AI cs.CY econ.GN q-fin.EC 版本更新

Governing Technical Debt in Agentic AI Systems

代理型AI系统中的技术债务治理

Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

发表机构 * School of Business, University of Pittsburgh(匹兹堡大学商学院)

AI总结 本文定义了代理型AI系统中的技术债务和随机税概念,并提出通过轻量级仪表盘和治理控制来管理这些负债和运营成本。

详情
AI中文摘要

代理型AI系统正越来越多地被探索作为生产基础设施:它们进行多步推理、调用工具、通过工作流行动,并通过记忆和反馈进行适应。这些系统带来了传统软件或预测性机器学习技术债务未能完全涵盖的治理挑战。我们将代理型技术债务定义为当提示、记忆、工具模式、编排图、控制策略和可观测性例程被拼凑在一起,速度快于它们能够被验证、标准化和治理时所产生的累积负债。我们将随机税定义为将概率性代理行为保持在可接受范围内所产生的重复性运营负担。区别很重要:债务是设计和治理负债的存量,而税是运营成本的流量,源于随机代理通过工具和工作流行动。我们概述了管理者如何通过轻量级仪表盘和治理控制使两者可见。

英文摘要

Agentic AI systems are increasingly being explored as production infrastructure: they reason over multiple steps, call tools, act through workflows, and adapt through memory and feedback. These systems create governance challenges that are not fully captured by traditional software or predictive ML technical debt. We define Agentic Technical Debt as the accumulated liability created when prompts, memory, tool schemas, orchestration graphs, control policies, and observability routines are patched together faster than they can be validated, standardized, and governed. We define Stochastic Tax as the recurring operating burden of keeping probabilistic agent behavior within acceptable bounds. The distinction matters: debt is a stock of design and governance liability, while the tax is a flow of operating cost that arises because stochastic agents act through tools and workflows. We outline how managers can make both visible through lightweight dashboards and governance controls.

2605.29126 2026-05-29 cs.LG cs.AI 版本更新

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

何时与多久?时间推理中的读出-中介角度

Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

发表机构 * Bioscope AI

AI总结 通过测量线性探针与模型实际计算子空间之间的角度,发现探针可能学习与模型无关的正交方向,从而揭示基于探针的可解释性存在根本缺陷。

详情
AI中文摘要

线性探针几乎可以完美解码表示,但却可能与模型如何使用该表示完全无关。在语言模型的日历日期持续时间推理中,一个$\\\sin$/ $\\\cos$探针从层的激活中恢复一年中的第几天,但消融其方向对模型的答案没有影响——而在同一层通过分布式对齐搜索(DAS)找到的四维子空间被消融时,性能完全崩溃。我们测量这两个子空间之间的角度——\\emph{读出-中介角度}——发现它与两个随机子空间之间的角度(Haar均匀零假设)无法区分,这意味着探针学到了与模型实际计算正交的方向。逆向工程电路揭示了原因:注意力头通过学习的QK偏移($\\\pm30$和$\\\pm61$天)路由月份粒度的上下文,然后MLP将\\emph{何时}(绝对日期)转换为\\emph{多久}(持续时间)——所有这些都在探针从未触及的因果子空间的下游。稀疏自编码器分解证实了这种分裂:探针对齐和DAS对齐的特征编码了语义上不相交的概念,因果重叠可忽略不计。这种分离在四个规模($1.5$-$9\\\,$B)和两个模型家族中重复出现,并在另外两个领域(空间位移、符号算术)有初步证据,表明读出-中介正交性是探针可解释性的一种普遍失败模式。这直接削弱了将探针部署为运行时安全监控的提议:探针可以在模型已悄然放弃的方向上报告高置信度。

英文摘要

A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\pm}30$ and ${\pm}61$ days, and MLPs then convert \emph{when} (absolute date) into \emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.

2605.29123 2026-05-29 cs.AI cs.CL 版本更新

The Confidence Shortcut: A Reasoning Failure Mode of Masked Diffusion Models

置信捷径:掩码扩散模型的一种推理失败模式

Dueun Kim, Albert No

发表机构 * Department of Artificial Intelligence, Yonsei University(延世大学人工智能系)

AI总结 本文发现掩码扩散模型在置信度解码时存在推理失败模式,表现为过早预测局部易解部分而忽略长程依赖,导致复杂输入错误率升高,而随机掩码训练能保持推理轨迹条件。

详情
AI中文摘要

掩码扩散语言模型(MDMs)独特地支持任意顺序生成,其中基于置信度的解码目前作为事实上的标准推理策略。为了优化这一点,最近的训练方案试图直接将训练掩码模式与生成过程中观察到的模式对齐。然而,我们认为基于置信度的解码本质上与复杂推理所需的逻辑流轨迹不一致,并且置信度对齐训练会主动强化这种不一致。我们使用多位加法具体说明这一点,其中解码策略在解决长程依赖之前过早预测局部易解的数字,从而在具有挑战性的输入上产生高置信度错误。虽然传统的随机掩码在此困难尾部上保持低失败率,但置信度对齐训练将错误率放大了一个数量级。在五个不同的推理任务中,同样的模式以任务依赖的严重程度出现:基于置信度的解码在高度复杂的输入上引发失败,而置信度对齐训练则加剧了这些失败。相比之下,随机掩码——尽管被认为效率低下——稳健地保留了解决困难尾部所必需的推理轨迹条件。

英文摘要

Masked diffusion language models (MDMs) uniquely support any-order generation, with confidence-based decoding currently serving as the de facto standard inference policy. To optimize for this, recent training schemes attempt to align training mask patterns directly with those observed during generation. However, we argue that confidence-based decoding is inherently misaligned with the logical-flow trajectories required for complex reasoning, and that confidence-aligned training actively entrenches this misalignment. We make this concrete using multi-digit addition, where the decoding strategy prematurely predicts locally easy digits before resolving their long-range dependencies, producing high-confidence errors on challenging inputs. While traditional random masking keeps the failure rate low on this challenging tail, confidence-aligned training amplifies the error rate by an order of magnitude. Across five distinct reasoning tasks, this same pattern emerges with task-dependent severity: confidence-based decoding induces failures on highly complex inputs, and confidence-aligned training exacerbates them. In contrast, random masking -- despite its perceived inefficiency -- robustly preserves the reasoning-trajectory conditionals essential for solving the challenging tail.

2605.29121 2026-05-29 math.DS cs.AI cs.LG 版本更新

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

Softmax混合专家路由器中负载不平衡的最小分岔模型

O. M. Kiselev

发表机构 * Innopolis University(因诺波利斯大学)

AI总结 提出一个两专家混合专家层的自适应softmax路由最小动力学模型,通过平均场极限从离散强化规则导出,发现超临界叉形分岔导致负载不平衡,并推导了分岔集和尖点灾变的精确参数方程。

Comments 21 pages, 11 figures

详情
AI中文摘要

我们提出了一个两专家混合专家(MoE)层的自适应softmax路由的最小动力学模型。该模型作为离散强化规则的平均场极限得到:被选中的专家获得小的分数增量,而所有分数经历正则化衰减。在对称情况下,极限系统具有超临界叉形分岔:对于弱反馈,存在唯一的稳定平衡状态,而当反馈强度超过临界值时,出现两个稳定的不对称状态。当加入外部不对称性时,叉形分岔展开为一对折叠分岔,在控制参数平面中形成一个尖点。我们推导了分岔集和尖点灾变的局部规范型的精确参数方程。数值实验将这一图景与经验专家负载、一个小的可训练MoE模型、硬top-1 PyTorch路由以及一个关于数字的小型分类实验联系起来。结果为自适应MoE路由器中负载不平衡的突然转变提供了一个可控的低维机制。

英文摘要

We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as a mean-field limit of a discrete reinforcement rule: the selected expert receives a small score increment, while all scores undergo regularizing decay. In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. We derive exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe. Numerical experiments connect this picture to empirical expert load, a small trainable MoE model, hard top-1 PyTorch routing, and a small classification experiment on digits. The results provide a controlled low-dimensional mechanism for abrupt transitions to load imbalance in adaptive MoE routers.

2605.29119 2026-05-29 cs.AI 版本更新

PRO-CUA: Process-Reward Optimization for Computer Use Agents

PRO-CUA: 面向计算机使用代理的过程奖励优化

Yifei He, Rui Yang, Hao Bai, Tong Zhang, Han Zhao

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出PRO-CUA框架,通过过程奖励模型和逐步骤强化学习,解决计算机使用代理训练中的模仿瓶颈和稀疏奖励问题。

详情
AI中文摘要

计算机使用代理(CUA)在自动化复杂数字工作流方面展现出强大潜力,但其训练仍受限于成本高昂的实时环境交互和有限的高质量监督。现有的过滤行为克隆管道面临模仿瓶颈,包括专家演示的分布偏移和缺乏负学习信号。同时,标准轨迹级强化学习在长程GUI交互中面临稀疏奖励、模糊信用分配和高基础设施成本等问题。在这项工作中,我们提出PRO-CUA,一个用于训练CUA的迭代步骤级强化学习的过程奖励优化框架。PRO-CUA将策略优化与在线环境交互解耦:当前策略通过实时运行收集状态,为每个状态生成多样化的候选动作,从过程奖励模型(PRM)接收步骤级反馈,并通过组相对优势进行优化。这种设计无需依赖黄金答案或离线专家轨迹即可实现密集且灵活的信用分配,同时通过在代理自身的执行状态上训练减少分布偏移。在实时网络基准上的实验证明了PRO-CUA的有效性以及PRM引导的步骤级训练的可靠性。

英文摘要

Computer use agents (CUAs) have shown strong potential for automating complex digital workflows, yet their training remains constrained by costly live environment interaction and limited high-quality supervision. Existing filtered behavior cloning pipelines suffer from imitation bottlenecks, including distribution shift from the expert demonstration and the absence of negative learning signals. Meanwhile, standard trajectory-level reinforcement learning struggles with sparse rewards, ambiguous credit assignment, and high infrastructure costs for long-horizon GUI interaction. In this work, we propose PRO-CUA, a process-reward optimization framework for training CUAs with iterative step-level reinforcement learning. PRO-CUA decouples on-policy environment interaction from policy optimization: the current policy collects states through live rollouts, generates diverse candidate actions for each state, receives step-level feedback from a process reward model (PRM), and is optimized with group-relative advantages. This design enables dense and flexible credit assignment without relying on golden answers or offline expert trajectories, while reducing distribution shift by training on the agent's own execution states. Experiments on live web benchmarks demonstrate the effectiveness of PRO-CUA and the reliability of PRM-guided step-level training.

2605.29116 2026-05-29 cs.AI 版本更新

Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

超越共识:混合智能体中的轨迹级合成

Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

发表机构 * Bioscope AI

AI总结 本文提出轨迹级合成方法,通过语义保持输入扰动生成多样化推理轨迹,并利用锚定精炼保证非退化,从而在多数投票失败时仍能恢复正确解,超越基于答案的聚合。

详情
AI中文摘要

当多个LLM智能体解决同一问题时,标准做法是将每个智能体的推理压缩为多数投票或分层合成,将一致性视为终点。我们证明这是不必要的损失:一个读取完整推理轨迹的LLM聚合器即使在智能体一致同意时也能恢复正确解,且有益修正始终超过有害修正——即“聚合悖论”。多数投票存在上限,而扰动多样性无法提高(错误相关性相同);聚合器的收益来自轨迹级互补性,即从投票丢弃的少数链中组装正确的中间步骤。这些发现促使我们提出自洽混合智能体,通过语义保持输入扰动生成轨迹多样性,通过锚定精炼保护多数并具有可证明的非退化保证,并且始终进行合成——绝不基于共识进行门控。单个模型通过扰动诱导的轨迹变异性在结构化推理、博士级科学、竞赛数学和竞争性编程中优于异构模型池。聚合的单位应是推理轨迹,而非答案。

英文摘要

When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

2605.29115 2026-05-29 cs.CR cs.AI 版本更新

unix-ctf: Procedural Environments for Unix-Competence Reinforcement Learning

unix-ctf: 用于Unix能力强化学习的过程化环境

Geoffrey Bradway, Roger Creus Castanyer, Lorenz Wolf, Maxwill Lin, Matthew James Sargent, Augustine N. Mavor-Parker

发表机构 * GitHub

AI总结 本文提出unix-ctf,一个过程化生成shell代理的夺旗任务的环境,通过LLM辅助合成管道生成可复用的隐藏-查找脚本对,并基于此微调Qwen3-8B模型,将解决率从11.6%提升至43.6%,证明Unix能力是可分离、可训练的。

详情
AI中文摘要

Unix能力是指将shell和操作系统原语作为一等工具使用的能力,而不仅仅是通过终端编写程序。当前的终端基准测试往往模糊了这一区别:一个精通Python但Unix能力薄弱的求解器可以通过Terminal-Bench 2.0的相当一部分,而反向技能组合则很少被锻炼。我们使这一区别可操作化,并为Unix组件构建训练表面。unix-ctf是一个为shell代理过程化生成夺旗任务的工具。每个任务使用单个Unix特性在一个新的Linux容器中隐藏一个短令牌(形如flag(a3b1c9...)的旗帜),代理必须恢复它。任务由LLM辅助的合成管道生成,该管道生成候选隐藏技术,将其重写为参数化的隐藏-查找脚本对,并通过双向契约进行过滤:隐藏脚本不得在磁盘上留下旗帜的明文痕迹,查找脚本必须在新目录中恢复旗帜。由于LLM仅编写植入和恢复步骤(容器、布局和评分框架是固定的),该管道在750次原始尝试中获得了656个可移植、可复用的变体(87.5%)。我们复现Endless Terminals的完整容器生成方法,在相同检查下仅获得17.4%。656个变体规范化为155种不同技术。使用GRPO在此表面上通过LoRA微调Qwen3-8B,将15技能多族保留集(n=225)上的解决率从11.6%提升至43.6%,重新分配了模型解决的InterCode-CTF任务,并在Forensics上获得+33个百分点的提升,同时在InterCode-CTF上达到32/100。这些结果表明,Unix能力是可分离、可训练的,最好直接评估,而不是将其融入通过终端的编程中。

英文摘要

Unix competence is the ability to use shell and operating-system primitives as first-class tools, not merely to write programs through a terminal. Current terminal benchmarks tend to blur this distinction: a solver fluent in Python but weak in Unix can pass a substantial fraction of Terminal-Bench 2.0, while the reverse skill profile is rarely exercised. We make the distinction operational and build a training surface for the Unix component. unix-ctf is a procedural generator of capture-the-flag tasks for shell agents. Each task hides a short token (a flag of the form flag(a3b1c9...)) inside a fresh Linux container using a single Unix feature, and the agent must recover it. Tasks are produced by an LLM-assisted synthesis pipeline that generates candidate hiding techniques, rewrites them into parameterized hide-and-find script pairs, and filters them with a bidirectional contract: the hide script must leave no plaintext trace of the flag on disk, and the find script must recover the flag in a fresh directory. Because the LLM only writes the planting and recovery steps (the container, layout, and grading harness are fixed), the pipeline lands 656 of 750 raw attempts as portable, reusable variants (87.5\%). Our reproduction of Endless Terminals' full-container-generation approach lands only 17.4\% under the same checks. The 656 variants canonicalize to 155 distinct techniques. Fine-tuning Qwen3-8B with LoRA using GRPO on this surface lifts solve rate from 11.6\% to 43.6\% on a 15-skill multi-family holdout (n=225), redistributes which InterCode-CTF tasks the model solves, and produces a +33 pp gain in Forensics while reaching 32/100 on InterCode-CTF. These results suggest that Unix competence is separable, trainable, and best evaluated directly rather than folded into programming-through-a-shell.

2605.29096 2026-05-29 cs.AI 版本更新

Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

临床试验中AI与人机交互的趋势——一种混合人机探索

Sandra Woolley, Tim Collins, Khalid Khattak, Illia Chernomorets, Ariane Arevalo, Chris Richardson

发表机构 * Keele University, UK(英国凯利大学) Manchester Metropolitan University, UK(英国曼彻斯特 Metropolitan 大学)

AI总结 本研究通过分析ClinicalTrials.gov注册库中的记录,揭示了AI相关临床试验的时间趋势和地理分布,并探索了一种混合人机方法用于分析注册试验中的人机交互趋势。

Comments 8 pages plus 2 pages references and appendix

详情
AI中文摘要

本文检查了从ClinicalTrials.gov注册库中检索的记录,以表征AI术语的时间趋势和AI试验的地理分布。该工作还报告了一种探索性的混合人机方法,用于分析注册临床试验中的人机交互趋势。混合工作流程包括前沿生成式AI模型(GPT-5.5)和人工审查,以筛选和分类AI重点搜索返回的记录。研究结果表明,随着时间的推移,AI相关试验显著增加,近期对机器学习、深度学习、聊天机器人、GPTs和大语言模型的引用有所增长。地理上,中国和美国占据了AI相关试验的最大数量,而意大利、法国、西班牙、英国和土耳其(Türkiye)等几个其他国家近期也有显著增长。在100条记录的随机样本中,人类和AI分类器在识别未实质性使用AI的研究方面表现出良好的一致性,但在分类人机交互方面一致性较低,特别是在健康专业人员交互模糊或描述不足的情况下。总体而言,结果表明混合人机筛选临床试验记录是潜在可行的,但更清晰的试验报告和更精确的交互定义将有利于这一过程。

英文摘要

This paper examines records retrieved from the ClinicalTrials.gov registry to characterize temporal trends in AI terminology and the geographical distribution of AI trials. The work also reports on an exploratory hybrid human-AI approach to analyzing human-AI interaction trends in registered clinical trials. The hybrid workflow comprised a frontier generative AI model (GPT-5.5) and human review to screen and categorize records returned by an AI-focused search. The findings indicate a marked increase in AI-related trials over time, with recent growth in references to machine learning, deep learning, chatbots, GPTs, and large language models. Geographically, China and the United States accounted for the largest numbers of AI-related trials, with notable recent increases in several other countries including Italy, France, Spain, the UK and Turkey (Türkiye). In a random sample of 100 records, human and AI classifiers showed good agreement in identifying studies not substantively using AI, but lower agreement in classifying human-AI interaction, particularly where health professional interaction was ambiguous or insufficiently described. Overall, the results suggest that hybrid human-AI screening of clinical trial records is potentially viable, but clearer trial reporting and more precise interaction definitions will benefit the process.

2605.29089 2026-05-29 cs.LG cs.AI cs.CV 版本更新

OISD: On-Policy Internal Self-Distillation of Language Models

OISD: 语言模型在策略内部自蒸馏

Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He

发表机构 * Auburn University(阿肯色大学) William & Mary(威廉与玛丽学院)

AI总结 提出OISD框架,通过将最终层的预测信号蒸馏到中间层,结合logit对齐和注意力对齐,提升推理能力,在数学推理任务上显著优于基线。

Comments Under Review for Publication

详情
AI中文摘要

最近的强化学习后训练方法主要使用稀疏的结果级奖励来优化最终输出策略,而很大程度上忽略了中间表示中编码的预测信号。在本文中,我们引入了一种称为在策略内部自蒸馏的新范式,并提出了OISD框架,该框架通过将最终层的在策略预测信号转移到中间表示来改进推理。在展开和组相对策略优化(GRPO)优化过程中,最终层既充当策略,又充当所选中间层的分离内部教师,通过两种互补机制引导中间层与其对齐:logit对齐,传递高级推理行为(如何思考);注意力对齐,强制从最终层到所选中间层的一致注意力模式(看哪里),两者都不需要外部特权信息。我们的OISD与GRPO一起,采用带符号优势加权的Jensen-Shannon对齐来蒸馏信息丰富的中间表示,同时在统一行动策略下保持策略一致性。实验结果表明了OISD的有效性,在四个数学推理任务上,相对于强推理强化学习基线,取得了显著且一致的改进。代码将在https://github.com/THE-MALT-LAB/OISD发布。

英文摘要

Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD

2605.29087 2026-05-29 cs.AI 版本更新

The Chain Holds, the Answer Folds: Trace-Answer Dissociation in Reasoning Models Under Adversarial Pressure

链条保持,答案翻转:对抗压力下推理模型中的痕迹-答案分离

Yubo Li, Ramayya Krishnan, Rema Padman

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本研究通过2×2潜在-行为框架发现推理模型在持续对抗压力下出现“不忠实屈服”故障模式,即思维链保持正确但答案错误,并验证了推理通道对此的影响。

详情
AI中文摘要

推理模型在单轮基准测试中评估,但部署在多轮对话中,用户会对正确答案进行反驳。在持续对抗压力下,我们发现了一种先前未记录的故障模式:思维链从第一轮到最后一轮保持事实正确,而输出的答案却翻转错误。我们称此为不忠实屈服(UC),并通过一个$2\times 2$潜在-行为框架将其分离出来,该框架揭示了翻转率指标和单轮忠实度探测均遗漏的问题。在三个数据集(MT-Consistency、MMLU-Pro、GSM8K)上,行为翻转时的潜在正确率在思考模式下聚集在50%附近,在无思考模式下降至11-15%——这是模型内成对因果证据,表明推理造成了这一差距。跨模型而言,该效应与推理通道相关(在Qwen3-32B和GPT-OSS-20B中较高,在内联思维链的Gemma-4-31B-it中较低)。独立的GPT-4o评判者验证了86%的UC标签;令牌级探测显示答案槽的argmax在84%的UC单元中是正确的;而一种朴素的痕迹锚定防御适得其反。我们发布了所有轨迹、痕迹和评判者标签。

英文摘要

Reasoning models are evaluated on single-turn benchmarks but deployed in multi-turn dialogue, where users push back on correct answers. Under sustained adversarial pressure we find a previously undocumented failure mode: the chain-of-thought stays factually correct from first turn to last while the emitted answer flips wrong. We call this unfaithful capitulation (UC) and isolate it with a $2\times 2$ latent-versus-behavioral framework that flip-rate metrics and single-turn faithfulness probes both miss. Across three datasets (MT-Consistency, MMLU-Pro, GSM8K), the latent-correct rate at the behavioral flip clusters near 50% in think mode and collapses to 11-15% under no_think -- paired, within-model causal evidence that reasoning creates the gap. Across models the effect tracks the reasoning channel (high in Qwen3-32B and GPT-OSS-20B, low in inline-CoT Gemma-4-31B-it). An independent GPT-4o judge corroborates $86\%$ of UC labels; a token-level probe shows the answer-slot argmax is correct in $84\%$ of UC cells; and a naive trace-anchored defense backfires. We release all trajectories, traces, and judge labels.

2605.29084 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

同一问题,不同来源,不同答案:审计医学多源RAG中的来源依赖性

Yubo Li, Rema Padman, Ramayya Krishnan

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出来源依赖性作为NLP评估缺失的维度,通过构建移植患者教育基准TransplantQA、分层检索策略HERO-QA和结构化输出评判器,审计多源RAG系统中同一问题因检索来源不同而给出不同答案的失败模式。

详情
AI中文摘要

部署在多作者机构语料库上的检索增强生成(RAG)系统可能会根据检索到的来源对同一问题给出不同的答案——这是主流单一黄金答案范式无法诊断的失败模式。我们认为来源依赖性(source-dependence)是NLP评估缺失的一个维度,审计它意味着将评估单位从答案正确性转移到来源间关系。我们在移植患者教育中具体化了这一点,其中机构来源明显存在分歧,发布了三个工件:TransplantQA,一个真实患者问题的基准,每个问题通过将生成基于多个机构手册作为候选来源来回答;HERO-QA,一种分层检索策略,用于基于和审计每个答案;以及一个结构化输出评判器,根据经过验证的5标签分类法对来源间关系进行评分。在大规模上,更好的检索揭示了比先前估计多得多的分歧——低估了其普遍性,而非强度。该框架是领域无关的,可迁移到法律和教育RAG:测量来源依赖性通常是部署的多源NLP的责任。

英文摘要

A retrieval-augmented generation (RAG) system deployed over a multi-author institutional corpus can give a different answer to the same question depending on which source it retrieves -- a failure mode the dominant single-gold-answer paradigm cannot diagnose. We argue that source-dependence is a missing axis of NLP evaluation, and that auditing it means shifting the unit of evaluation from answer correctness to the inter-source relationship. We make this concrete in transplant patient education, where institutional sources demonstrably disagree, releasing three artefacts: TransplantQA, a benchmark of real patient questions, each answered by grounding generation in multiple institutional handbooks as candidate sources; HERO-QA, a hierarchical retrieval strategy that grounds and audits each answer; and a structured-output judge that scores inter-source relationships on a validated 5-label taxonomy. At scale, better retrieval reveals far more disagreement than prior estimates suggested -- understating its prevalence, not its intensity. The framework is domain-agnostic and transfers to legal and educational RAG: measuring source-dependence is a responsibility for deployed multi-source NLP generally.

2605.29082 2026-05-29 cs.AI 版本更新

The Importance of Out-of-Band Metadata for Safe Autonomous Agents: The Redpanda Agentic Data Plane

带外元数据对安全自主智能体的重要性:Redpanda 智能体数据平面

Tyler Akidau, Tyler Rockwood, Johannes Brüderl, Marc Millstone

发表机构 * Redpanda USA / Germany(Redpanda美国/德国)

AI总结 针对自主智能体在安全关键元数据处理上的不可靠性,提出基于带外元数据通道的 Redpanda 智能体数据平面架构,实现访问策略、数据分类和行为约束的确定性执行,并通过多智能体投资组合再平衡系统验证其有效性。

Comments 6 pages, 1 figure. Published at SAO '26 (co-located with ACM CAIS '26)

详情
AI中文摘要

AI 智能体越来越被期望作为数字员工运作:访问企业数据、做出决策并自主采取行动。但智能体同时比人类更不可预测——容易产生幻觉、误解和对抗性操纵——并且技术能力更强:具有深度系统知识和高吞吐量接口,能够以机器速度级联损害。这种组合使得依赖智能体忠实地解释或传播安全关键元数据(如访问策略、数据分类和行为约束)变得不安全。我们提出 Redpanda 智能体数据平面(ADP),一种围绕带外元数据通道构建的架构:基础设施路径,确定性携带安全上下文、策略信号和审计轨迹,完全在智能体的读写路径之外,并跨越异构基础设施。这些通道在智能体生命周期的每个阶段强制执行治理——在输入时限定数据访问范围,在执行期间约束操作,并在输出时捕获防篡改记录。我们通过一个多智能体投资组合再平衡系统演示了 ADP,其中自主智能体监控市场、做出交易决策并在隔离的客户账户之间执行订单——每个客户的数据范围、交易批准阈值和防篡改审计轨迹均由智能体既看不见也无法绕过的带外通道强制执行。

英文摘要

AI agents are increasingly expected to operate as digital employees: accessing enterprise data, making decisions, and taking actions autonomously. But agents are simultaneously less predictable than humans -- prone to hallucination, misinterpretation, and adversarial manipulation -- and more technically capable: with deep system knowledge and high-throughput interfaces cascading damage at machine speed. This combination makes it unsafe to rely on agents to faithfully interpret or propagate security-critical metadata such as access policies, data classifications, and behavioral constraints. We present the Redpanda Agentic Data Plane (ADP), an architecture built around out-of-band metadata channels: infrastructure pathways that carry security context, policy signals, and audit trails deterministically, entirely outside the agent's read and write path and across heterogeneous infrastructure. These channels enforce governance at every stage of the agent lifecycle -- scoping data access on the way in, constraining actions during execution, and capturing tamper-proof transcripts on the way out. We demonstrate ADP with a multi-agent portfolio rebalancing system in which autonomous agents monitor markets, make trade decisions, and execute orders across isolated client accounts -- with per-client data scoping, trade approval thresholds, and tamper-proof audit trails all enforced by out-of-band channels the agents can neither see nor bypass.

2605.29078 2026-05-29 cs.AI cs.LG 版本更新

Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics

弥合基于强化学习的工业调度中仿真到现实的鸿沟:通过执行语义

Jonathan Hoss, Noah Klarmann

发表机构 * Department of Industrial Engineering(工业工程系) Rosenheim Technical University of Applied Sciences(罗森海姆应用技术大学)

AI总结 提出一个策略无关的执行与测量层,通过构建决策有效快照、定义标准化执行合约并记录结果分歧,将执行不确定性转化为可观测的结构化数据,从而弥合仿真与现实的差距。

Comments Accepted for publication at the 24th IEEE International Conference on Industrial Informatics (INDIN 2026), held from 26 to 29 July 2026 in Melbourne, Australia

详情
AI中文摘要

事件驱动的调度策略越来越多地部署在工业环境中,其中决策是在异步和部分可观测的系统状态下做出的。因此,决策状态在时间上不一致,动作可行性未明确定义,执行错误的根源仍然模糊。这些问题限制了可靠性和可解释性。为弥补这一差距,提出一个策略无关的执行与测量层,用于调解调度策略与工业执行环境。该层从异步事件流构建决策有效快照,定义具有明确动作可行性的标准化执行合约,并将结果记录为策略意图、事务结果、物理执行和人工干预之间的分歧。这使得决策语义与执行行为分离,并使部署不匹配可观测且结构上可归因。使用离散事件仿真评估所提框架。结果表明,在所有观测滞后情况下均具有分析优势,因为未区分的执行失败被转化为具有完全归因覆盖的结构化类型化结果。在低观测滞后下操作优势最强,此时可避免的执行错误可在提交前预防。总体而言,该层将执行不确定性转化为用于评估和策略改进的监督数据。

英文摘要

Event-driven scheduling policies are increasingly deployed in industrial environments, where decisions are made under asynchronous and partially observed system states. As a result, decision states are not temporally consistent, action admissibility is not explicitly defined, and the origin of execution errors remains ambiguous. These issues limit both reliability and interpretability. To address this gap, a policy-neutral execution and measurement layer is proposed to mediate between scheduling policies and the industrial execution environment. The layer constructs decision-valid snapshots from asynchronous event streams, defines a standardized execution contract with explicit action admissibility, and records outcomes as divergences between policy intent, transactional outcomes, physical execution, and human intervention. This enables a separation between decision semantics and execution behavior and makes deployment mismatch observable and structurally attributable. The proposed framework is evaluated using a discrete-event simulation. The results show analytical benefits across all observation lag regimes, as undifferentiated execution failures are transformed into structured, typed outcomes with full attribution coverage. Operational benefits are strongest under low observation lag, where avoidable execution errors can be prevented before commitment. Overall, the layer turns execution uncertainty into supervisory data for evaluation and policy refinement.

2605.29068 2026-05-29 cs.AI cs.CL cs.CR cs.LG 版本更新

Robust and Efficient Guardrails with Latent Reasoning

具有潜在推理的鲁棒高效防护栏

Siddharth Sai, Xiaofei Wen, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出COLAGUARD模型,通过阶段式训练将多步安全推理转移到连续潜在空间,在保持高安全性能的同时实现12.9倍加速和22.4倍令牌减少。

详情
AI中文摘要

随着大型语言模型(LLMs)在现实应用中的日益部署,维护其安全性至关重要。现有的安全防护栏通常依赖单次分类或更近期的蒸馏推理。基于推理的防护栏显著优于仅分类的基线,但会带来大量的查询延迟和令牌开销,使其不适用于高吞吐量部署。为了解决这一挑战,我们提出了COLAGUARD,一种通过阶段式训练课程将多步安全推理转移到连续潜在空间的防护栏模型,从而在推理时实现直接的隐藏状态传播。在涵盖八个安全基准的十个提示和响应审核设置上评估,COLAGUARD在宏观F1上比Llama Guard 3提高了8.24分,并与我们的显式推理基线GuardReasoner在宏观F1上相当,同时实现了12.9倍的加速和22.4倍的令牌使用减少。我们的结果表明,潜在推理为可部署的防护栏提供了一种实用的替代方案,以替代显式理由生成,共同提高安全鲁棒性和推理效率,而不是将它们视为竞争目标。

英文摘要

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.

2605.29059 2026-05-29 cs.SE cs.AI cs.CR 版本更新

SCDBench: A Benchmark for LLM-Based Smart Contract Decompilers

SCDBench: 基于大语言模型的智能合约反编译基准

Kaihua Qin, Dawn Song, Arthur Gervais

发表机构 * University of Warwick(沃里克大学) UC Berkeley(伯克利大学) University College London(伦敦大学学院)

AI总结 针对现有智能合约反编译评估缺乏统一基准的问题,提出SCDBench数据集与评估方法,通过四阶段累积评估(格式完整性、可编译性、ABI恢复、语义一致性)测试前沿LLM的反编译能力,发现语义一致性仍远未解决。

详情
AI中文摘要

智能合约反编译旨在从字节码恢复高级源代码,但评估反编译器仍然困难,因为现有研究使用狭窄的数据集、不一致的度量标准和有限的语义一致性检查。随着大语言模型(LLMs)开始生成类似源代码的Solidity代码,这些代码可能编译通过并看似合理,即使其语义与原始合约存在偏差,这一差距变得日益重要。我们引入了SCDBench,一个用于基于LLM的智能合约反编译的数据集和基准方法。该数据集包含600个真实世界的Solidity合约,配有配对的字节码输入、真实源代码和可重放的语义检查点。SCDBench通过四个累积阶段评估反编译器的输出:格式完整性、可编译性、应用程序二进制接口(ABI)恢复以及通过差分重放实现的语义一致性。我们在零样本反编译设置中评估了Claude Opus 4.7、GPT-5.3-Codex和GLM-5,包括具有和不具有扩展推理的GLM-5变体,以及零样本编译修复设置。结果表明,前沿LLM通常能够生成结构化和可编译的Solidity代码,但实现语义一致性仍远未解决:表现最好的前沿模型仅完美反编译了42/600个合约。我们进一步表明,引入同模型编译修复在适度增加成本的情况下显著提升了性能。SCDBench为严格、可重复的评估建立了共同基础,旨在加速开发用于区块链安全性和透明度的可靠智能合约反编译器。

英文摘要

Smart contract decompilation aims to recover high-level source code from bytecode, but evaluating decompilers remains difficult because existing studies use narrow datasets, inconsistent metrics, and limited semantic consistency checks. This gap is increasingly important as large language models (LLMs) begin to generate source-like Solidity that may compile and appear plausible, even when its semantics diverge from the original contract. We introduce SCDBench, a dataset and benchmark methodology for LLM-based smart contract decompilation. The dataset contains 600 real-world Solidity contracts with paired bytecode inputs, ground-truth source code, and replayable semantic checkpoints. SCDBench evaluates decompiler outputs through four cumulative stages: format completeness, compilability, Application Binary Interface (ABI) recovery, and semantic consistency via differential replay. We evaluate Claude Opus 4.7, GPT-5.3-Codex, and GLM-5 in a zero-shot decompilation setting, including GLM-5 variants with and without extended reasoning and a zero-shot compilation-repair setting. The results show that frontier LLMs can often produce structured and compilable Solidity, but achieving semantic consistency remains far from solved: the best-performing frontier model perfectly decompiles only 42/600 contracts. We further show that introducing same-model compilation repair substantially improves performance at modest additional cost. SCDBench establishes a common ground for rigorous, reproducible evaluation and aims to accelerate the development of reliable smart contract decompilers for blockchain security and transparency.

2605.29055 2026-05-29 cs.AI cs.MA 版本更新

Hallucination Mitigation with Agentic AI, Nested Learning, and AI Sustainability via Semantic Caching

基于智能体AI、嵌套学习与语义缓存的幻觉缓解与AI可持续性

Diego Gosmar, Deborah A. Dahl

发表机构 * Head of AI, Tesisquare Member, Open Voice Interoperability Initiative Linux Foundation AI & Data(AI负责人,Tesisquare成员,开放语音互操作性倡议Linux基金会AI与数据) Principal, Conversational Technologies Member, Open Voice Interoperability Initiative Linux Foundation AI & Data(首席科学家,对话技术成员,开放语音互操作性倡议Linux基金会AI与数据)

AI总结 提出一种HOPE启发的嵌套学习架构,结合连续记忆系统和语义缓存,通过三阶段智能体管道在混合基准上实现幻觉缓解,同时降低能耗并提高可观测性。

Comments 21 pages, 14 figures

详情
AI中文摘要

幻觉仍然是生产级LLM系统的主要可靠性障碍,特别是在多智能体管道中,未经支持的声明可能在各阶段不受控制地传播。本文将一种受HOPE启发的嵌套学习架构与连续记忆系统(CMS)和语义相似性缓存相结合,应用于一个混合基准测试,该基准包含310个提示,包括217个认知不确定性提示和93个虚构诱导压力测试提示。通过开放地板协议(OFP)编排的三阶段智能体管道,使用五个KPI进行评估——事实声明密度(FCD)、事实依据参考(FGR)、虚构免责声明频率(FDF)、显式情境化得分(ECS)和可观测性得分比率(OSR)——聚合为总幻觉得分(THS),在五种权重配置下研究缓解与可观测性之间的权衡。FDF、ECS、OSR和FGR作为缓解信号被减去,因此更负的THS表示更强的缓解。前端代理被配置为高随机性生成器(温度=1.0)以产生真实的幻觉基线,而二级审查者和三级审查者作为渐进式纠正器运行。这种非对称设计在五种权重配置下实现了端到端THS降低-31.3%至-35.9%。语义缓存在930次潜在调用中实现了440次缓存命中(命中率47.3%),将LLM调用减少至490次,降低了能源和二氧化碳足迹,使多阶段审查管道在生产规模下操作可行。极端可观测性获得了最负的最终THS(-0.0709),证实了高可观测性配置强化而非损害缓解效果。这些发现表明,记忆增强的多智能体设计可以在无需模型重新训练的情况下,共同提高事实可靠性、操作效率和可审计性。

英文摘要

Hallucination remains a major reliability barrier for production LLM systems, particularly in multi-agent pipelines where unsupported claims can propagate unchecked across stages. This paper adapts a HOPE-inspired Nested Learning architecture with Continuum Memory Systems (CMS) and semantic similarity caching to a hybrid benchmark of 310 prompts combining 217 epistemic-uncertainty prompts and 93 fabrication-induction stress-test prompts. A three-stage agentic pipeline orchestrated via the Open Floor Protocol (OFP) is evaluated with five KPIs -- FCD (Factual Claim Density), FGR (Factual Grounding References), FDF (Fictional Disclaimer Frequency), ECS (Explicit Contextualization Score), and OSR (Observability Score Ratio) -- aggregated into THS (Total Hallucination Score) across five weighting configurations to study mitigation-observability trade-offs. FDF, ECS, OSR, and FGR are subtracted as mitigation signals, so that a more negative THS indicates stronger mitigation. The FrontEndAgent is configured as a high-stochasticity generator (temperature = 1.0) to produce a realistic hallucination baseline, while the SecondLevelReviewer and ThirdLevelReviewer operate as progressive correctors. This asymmetric design yields end-to-end THS reductions of -31.3% to -35.9% across five weighting configurations. Semantic caching achieves 440 cache hits over 930 potential calls (47.3% hit rate), reducing LLM invocations to 490, lowering energy and CO2e footprint, and making multi-stage review pipelines operationally viable at production scale. ExtremeObservability attains the most negative final THS (-0.0709), confirming that observability-heavy configurations reinforce rather than compromise mitigation. These findings suggest that memory-augmented multi-agent designs can jointly improve factual reliability, operational efficiency, and auditability without model retraining.

2605.29042 2026-05-29 cs.AI cs.LG 版本更新

Differentiable Belief-based Opponent Shaping

基于可微信念的对手塑造

Aarav G Sane, Karthik Sivachandran, Rohan Paleja

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出D-BOS方法,通过可微的信念更新和梯度传播,在隐藏角色游戏中实现对手信念的塑造,从而自然涌现最优策略。

详情
AI中文摘要

人类协调往往依赖于通过战略行动影响他人信念的能力。在多智能体强化学习中,对手塑造试图复制这种影响,尽管现有方法通常作用于对手的参数、策略或价值空间。同时,隐藏角色游戏中的信念操纵技术通常依赖于硬编码的目标,如欺骗或信念饱和。我们提出基于可微信念的对手塑造(D-BOS),一种一阶方法,将每个观察者的信念视为被塑造的对手状态,并通过$k$步softmax-贝叶斯信念动力学进行微分。我们的方法不显式奖励欺骗或合作行为,而是将信念状态作为塑造目标。这使得最优策略能够从环境奖励结构中自然涌现。这种信念空间公式通过微分对手信念更新提供对手塑造信号,并通过聚合多个观察者个体推断信念轨迹上的梯度,自然地扩展到多个观察者。实验上,D-BOS在隐藏角色游戏中优于PPO和BBM,在混合动机设置中提升最大。

英文摘要

Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.

2605.29041 2026-05-29 cs.AI 版本更新

Practitioner Beliefs and Behaviors in AI-Enhanced Education: DOT Framework Survey Evidence

AI增强教育中的从业者信念与行为:DOT框架调查证据

David Gibson, M. Elizabeth Azukas, Gerald Knezek

发表机构 * Curtin University(Curtin 大学) Georgia Tech Research Institute(佐治亚理工研究学院) Georgia Institute of Technology(佐治亚理工学院) University of North Texas(北卡罗来纳州立大学)

AI总结 基于DOT框架,通过横截面调查(n=72)探究高等教育从业者对AI整合的信念、行为及制度条件,发现从业者支持AI教学辅助但强调人类监督,且设计导向实践与理论存在差距。

详情
AI中文摘要

本研究报告了一项横截面调查(n=72)的结果,该调查针对高等教育从业者,考察了与人工智能(AI)在教学中整合相关的信念、行为及制度条件。研究基于整合了设计思维和开放系统理论的DOT框架,调查了AI熟悉度、使用模式、设计导向实践和教学信念。对19个信念项目的探索性因子分析识别出一个三因子结构:AI功能能力、监督与治理、以及教师协作与规划(α = .90)。结果表明,从业者对AI作为教学支持持积极态度,同时坚持人类监督和批判性评估。报告的做法强调迭代提示和内容生成,而需求评估和反馈循环的使用不够一致。制度障碍,包括有限的政策、培训和基础设施,被广泛报告。这些发现为DOT框架作为从业者信念和实践的描述性模型提供了初步实证支持,同时也突出了设计导向理论与当前实施之间的差距。本研究贡献了一个初步的测量结构,并指出了进行验证性验证和基于结果的研究方向,这些研究将AI支持的设计实践与教学质量联系起来。

英文摘要

This study reports findings from a cross-sectional survey (n = 72) of higher education practitioners examining beliefs, behaviors, and institutional conditions related to artificial intelligence (AI) integration in teaching and learning. Grounded in the DOT Framework, which integrates design thinking and open systems theory, the study investigates AI familiarity, usage patterns, design-oriented practices, and pedagogical beliefs. Exploratory factor analysis of 19 belief items identified a three-factor structure: AI Functional Capabilities, Oversight and Governance, and Instructor Collaboration and Planning (α = .90). Results indicate that practitioners hold favorable views of AI as a pedagogical support while maintaining strong commitments to human oversight and critical evaluation. Reported practices emphasize iterative prompting and content generation, with less consistent use of needs assessment and feedback loops. Institutional barriers including limited policy, training, and infrastructure were widely reported. These findings provide preliminary empirical support for the DOT Framework as a descriptive model of practitioner beliefs and practices, while also highlighting gaps between design-oriented theory and current implementation. The study contributes an initial measurement structure and identifies directions for confirmatory validation and outcome-based research linking AI-supported design practices to instructional quality.

2605.29028 2026-05-29 cs.LG cs.AI 版本更新

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Return-to-Go 不仅仅是数字:面向返回条件监督学习的 Q 引导对齐

Yuxiao Yang, Weitong Zhang

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出 Q-ALIGN DT 框架,通过确保输出策略的 Q 值与输入 RTG 一致,实现返回条件序列模型中 RTG 与策略性能的对齐,在 D4RL 基准上取得优越的可控性和性能。

Comments 28 pages, 13 figures, 20 tables, accepted by ICML 2026

详情
AI中文摘要

条件序列模型 (CSMs) 通过将 return-to-go (RTG) 作为控制信号来学习策略。然而,现有的 CSMs 通常将 RTG 视为简单的数值输入,而不是将其与策略的性能对齐。在本文中,我们提出了 Q-ALIGN DT 框架,通过确保输出策略的 $Q$-值与输入 RTG 一致来强制执行这种对齐。通过利用 $Q$ 函数为 CSMs 提供密集指导,并使用 RTG-扰动技术结合 CSM 进一步微调,我们的方法确保更高的 RTG 一致地映射到具有更高期望回报的轨迹。理论上,我们证明 Q-ALIGN DT 可以高效地学习期望策略,并在 RTG 足够高时输出接近最优的策略。实验上,我们通过大量实验证明 Q-ALIGN DT 在 D4RL 基准上实现了优越的可控性和性能。值得注意的是,我们的模型有效地学习了一个结构化的策略族,该策略族保持精确对齐,并泛化到速度跟踪等先前方法失败的任务。

英文摘要

Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.

2605.29027 2026-05-29 cs.AI cs.CL cs.HC 版本更新

Mind Your Tone: Does Tone Alter LLM Performance?

注意你的语气:语气会改变LLM的性能吗?

Om Dobariya, Akhil Kumar

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 研究提示语气变化如何影响大语言模型在客观选择题上的准确性,发现语气效应系统但高度依赖模型,并提出了解释语气如何调节内部推理模式的路由框架。

Comments 10 pages, 6 tables, 1 figure. Accepted as a full paper at the Thirty-second Americas Conference on Information Systems (AMCIS 2026), Reno. Follow-up to arXiv:2510.04950

详情
AI中文摘要

大语言模型(LLMs)的使用正在激增,但观察到它们的性能因提示风格和语气而异。在本研究中,我们探讨了提示中的语气变化是否以及如何导致LLM在客观多项选择题上的准确性差异。我们使用了两个数据集:一个包含50个基础问题和五种语气变体的数据集,以及一个包含570个基础问题、涵盖57个主题和七种语气变体的MMLU子集。我们进行了实验,评估了四种成本效益高、流行的LLM的性能:ChatGPT-4o、ChatGPT-5-nano、Gemini 2.5 Flash和Gemini 2.5 Flash Lite。跨模型而言,语气效应是系统性的但高度依赖模型。一些模型显示出微小但统计上显著的变化,而另一些模型则在语气间表现出较大的准确性波动。此外,我们识别了主题层面的语气敏感性差异,并提出了一个路由框架来解释语气如何调节内部推理模式。我们的发现提醒用户不要假设LLM部署中具有语气鲁棒性的可靠性。

英文摘要

The use of Large Language Models (LLMs) is proliferating, yet their performance is observed to vary based on prompting styles and tones. In this study, we investigate both whether and how tonal variations in prompts lead to disparate LLM accuracy for objective multiple-choice questions. We use two datasets: a 50-base question dataset with five tone variants and a 570-base question MMLU subset spanning 57 subjects with seven tone variants. Experiments were conducted to evaluate the performance of four cost-efficient, popular LLMs: ChatGPT-4o, ChatGPT-5-nano, Gemini 2.5 Flash, and Gemini 2.5 Flash Lite. Across models, tonal effects are systematic but highly model-dependent. Some models show small, yet statistically significant, shifts, while others exhibit large accuracy swings across tones. Further, we identify subject-level differences in tone sensitivity and present a routing framework to explain how tones may attune internal reasoning modes. Our findings caution users against assuming tone-robust reliability in LLM deployments.

2605.29025 2026-05-29 cs.AI cs.CY cs.HC 版本更新

When Models Disagree: Rethinking LLM Evaluation for Public Comment Analysis

当模型存在分歧:重新思考用于公众评论分析的LLM评估

Aisha Najera, Alvin Moon, Vedant Srinivasan, Rajesh Veeraraghavan

发表机构 * AI Lab, Princeton University Engineering and Applied Sciences(普林斯顿大学人工智能实验室、工程与应用科学学院) RAND Corporation(RAND公司) Engineering and Applied Sciences(工程与应用科学) Science, Technology, and International Affairs(科学、技术与国际事务) Georgetown University(乔治·华盛顿大学)

AI总结 提出一种解释性审计流程,利用多模型分歧检测解释复杂性,引导人工审查关注真正模糊的公众意见,以补充传统基于准确率的评估方法。

详情
AI中文摘要

联邦机构正在部署大型语言模型(LLM)对公众评论语料进行分类,模型对记录的组织方式会影响政策制定者看到的内容以及哪些论点被记录。基于小规模验证集上的立场准确率的标准评估无法检测不同模型对同一公众输入产生实质性不同分类的情况。我们提出了一种解释性审计流程,将多模型分歧视为解释复杂性的诊断,并引导人工审查关注真正模糊的公众输入。通过分析四个LLM对联邦USDA案卷中1,260条公众评论的结果,我们发现模型间的主题分歧超过了模型内的提示变化,并且专家评分标准抑制了深层的解释分歧而未解决它。在一项针对分层抽样的40条评论子样本的两阶段标注研究中,四个LLM和一名人工标注员独立标注,然后在看到其他标注员的标签后进行修订。修订行为在不同标注员之间有所不同,人工标注员的修订经常引入整体集成输出中不存在的框架。我们认为基于分歧的评估是LLM辅助解释性编码中准确率指标的必要补充。

英文摘要

Federal agencies are deploying large language models (LLMs) to categorize public comment corpora, where the model's organization of the record shapes what policymakers see and which arguments register. Standard evaluation, anchored on stance accuracy against a small validated set, cannot detect when different models produce materially different categorizations of the same public input. We propose an Interpretive Audit Pipeline that treats multi-model disagreement as diagnostic of interpretive complexity and directs human review toward genuinely ambiguous public input. Analyzing 1,260 public comments on a federal USDA docket across four LLMs, we find that inter-model thematic divergence exceeds within-model prompt variation, and that an expert rubric suppresses deep interpretive disagreement without resolving it. In a two-stage labeling study on a stratified 40-comment subsample, four LLMs and a human annotator labeled independently and then revised after seeing the others' labels. Revision behavior varied across labelers, and the human annotator's revisions frequently introduced framings absent from the ensemble's collective output. We argue disagreement-based evaluation is a necessary complement to accuracy metrics for LLM-assisted interpretive coding.

2605.29018 2026-05-29 cs.AI cs.CL 版本更新

Adopt $\neq$ Adapt: Longitudinal Analyses of LLM Conversations in the Wild

采用 ≠ 适应:野外LLM对话的纵向分析

Rebecca M. M. Hicke, Kiran Tomlinson

发表机构 * Cornell University(康奈尔大学) Microsoft Research(微软研究院)

AI总结 通过分析约12,000名Microsoft Bing Copilot用户的对话轨迹及WildChat-4.8M数据,发现用户行为高度固化,活跃用户更倾向复杂专业任务,且WildChat数据集偏向高熟练度“超级用户”,表明现有用户行为难以改变并揭示用户异质性。

详情
AI中文摘要

尽管越来越多的研究开始描述用户与LLM的交互,但其描绘的画面基本上是静态的;关于个体用户如何随时间改变其行为,我们知之甚少。为填补这一空白,我们分析了约12,000名随机抽样的Microsoft Bing Copilot用户的对话轨迹,并与WildChat-4.8M的数据进行比较。虽然Copilot数据包含显著的人群层面趋势,但我们发现个体用户轨迹中的趋势要弱得多;用户习惯被证明极其顽固。我们还发现不同活跃度用户之间存在显著差异:更活跃的用户拥有更成功的对话,并使用LLM处理更复杂和专业导向的任务。一些用户趋势也出现在WildChat-4.8M中,但我们发现证据表明该数据集显著偏向高熟练度的“超级用户”。最终,我们的结果表明现有用户行为难以改变,并展示了用户异质性的程度。我们数据集之间的比较突显了WildChat并不代表典型的用户-AI交互,这是对数据下游使用的一个重要警示。

英文摘要

Although a growing body of research has begun to describe user--LLM interactions, the picture it paints is largely static; little is known about how individual users change their behavior over time. To address this gap, we analyze the conversational trajectories of $\sim$12,000 randomly sampled Microsoft Bing Copilot users and compare these with data from WildChat-4.8M. While the Copilot data contains significant population-level trends, we find that trends in individual user trajectories are much weaker; user habits prove to be overwhelmingly sticky. We also find stark differences between users of different activity levels: more active users have more successful conversations and use the LLM for more complex and professionally oriented tasks. Some user trends also appear in WildChat-4.8M, but we find evidence that this dataset is significantly skewed towards highly proficient "power" users. Ultimately, our results suggest that existing user behavior is difficult to change and demonstrate the extent of user heterogeneity. Our comparison between datasets highlights that WildChat does not represent typical user-AI interactions, an important caveat for downstream uses of the data.

2605.29009 2026-05-29 cs.LG cs.AI 版本更新

Label-Free Reinforcement Learning via Cross-Model Entropy

无标签强化学习:跨模型熵方法

Matt Gorbett, Hossein Shirazi

发表机构 * Independent Researcher(独立研究者) San Diego State University(圣地亚哥州立大学)

AI总结 提出跨模型熵(CME)作为无标签奖励信号,用于强化学习后训练大语言模型,在开放指令遵循任务上优于基线方法。

详情
AI中文摘要

使用强化学习后训练大语言模型受限于奖励信号。现有方法需要真实可验证的奖励(限制于自动正确性检查领域,如数学、代码执行)或人类偏好标签(收集成本高且易受奖励攻击)。最近的无标签方法用自参考信号(如多数投票或模型自身输出的token熵)替代真实验证器,但可能强化模型自身错误。本文提出跨模型熵(CME),即生成器响应在独立验证器模型下的平均对数似然,作为无标签奖励信号用于强化学习后训练。CME是连续的、无需训练,基于验证器认为不意外的响应可能正确或高质量的准则。由于验证器独立于生成器,该信号无法通过自一致性被操纵。我们将CME集成到GRPO中,不改变训练循环的其他部分,将无标签强化学习扩展到开放指令遵循——自参考信号不适用或不适配的场景。在开放指令遵循(UltraFeedback提示,在AlpacaEval 2.0上评估)上,CME奖励在四个模型家族(Qwen、Llama、Gemma、OLMo)和三种训练范式(预训练、SFT和指令微调)的头对头LLM-as-Judge比较中击败未训练基线,调整平局后的胜率从52.5%到71.4%。代码将在发表后发布。

英文摘要

Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.

2605.29005 2026-05-29 cs.LG cs.AI 版本更新

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

LoRe: 基于每步交互预算的自适应交互评估路由用于迭代图求解器

Jintao Li, Yong-Yi Wang, Zheng-An Wang, Heng Fan

发表机构 * Beijing Key Laboratory of Fault-Tolerant Quantum Computing, Beijing Academy of Quantum Information Sciences, Beijing, China(北京容错量子计算重点实验室,量子信息科学北京市院,北京,中国) Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, CAS, Beijing, China(北京凝聚态物理国家实验室,物理研究所,中国科学院,北京,中国) Beijing Key Laboratory of Advanced Quantum Technology, Beijing, China(北京先进量子技术重点实验室,北京,中国) Hefei National Laboratory, Hefei, China(合肥国家实验室,合肥,中国)

AI总结 提出LoRe方法,通过动态路由计算到高冲突或高不确定性交互,实现每步固定比例交互评估,在不牺牲解质量的前提下显著提升迭代图求解器的可扩展性和速度。

Comments Accepted at ICML 2026

详情
AI中文摘要

基于扩散的组合优化神经求解器反复重新评估密集的边/因子交互,导致推理时间昂贵且在大规模下常受内存限制。受多体物理计算方法的启发,我们引入LoRe,一种无需训练、推理时即插即用的包装器,强制执行每步交互评估预算:在每次迭代中,它通过动态路由计算到高冲突或高不确定性交互,仅评估固定比例的交互,而不是使用固定的稀疏化(例如静态kNN图或静态掩码)。在完全包含的端到端挂钟时间核算下,LoRe显著提高了最大独立集(MIS)问题的可扩展性,将可行推理扩展到基线内存溢出限制的3倍以上,实现了约8倍的加速和约12倍的峰值内存减少,同时在此范围内保持解质量。在大规模旅行商问题(TSP)上展示了跨任务通用性,并对拓扑变化具有零样本鲁棒性,LoRe在n=1000时实现了约15倍的加速,内存减少44倍,且巡回质量具有竞争力。

英文摘要

Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational methodologies of many-body physics, we introduce LoRe, a training-free, inference-time drop-in wrapper that enforces per-step interaction-evaluation budgeting: at each iteration, it evaluates only a fixed fraction of interactions by dynamically routing computation to high-conflict or high-uncertainty interactions, instead of using a fixed sparsification (e.g., static kNN graphs or static masks). Under fully inclusive end-to-end wall-clock accounting, LoRe substantially improves scalability on the Maximum Independent Set (MIS) problem, extending feasible inference more than $3\times$ beyond the baseline's out-of-memory limit, delivering a $\sim 8\times$ speedup and a $\sim 12\times$ peak-memory reduction, with solution quality preserved in this regime. Demonstrating cross-task generality on the large-scale Traveling Salesperson Problem (TSP) and zero-shot robustness to topology shifts, LoRe achieves a $\sim 15\times$ speedup at $n=1000$ with a $44\times$ memory reduction and competitive tour quality.

2605.29001 2026-05-29 cs.LG cs.AI 版本更新

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

FormInv:数学推理基准中语义不变性的测量协议

Nishal Thomas, Noel Thomas

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(Mohamed Bin Zayed人工智能大学)

AI总结 提出FormInv协议,通过跨模型一致性审计检测语义错误,并引入语义一致性率(SCR)和Cochran's Q指标,揭示标准基准无法捕捉的排名变化和模型不一致性。

Comments 18 pages, 3 figures. Under review for the 3rd AI for Math Workshop (AI4Math), ICML 2026

详情
AI中文摘要

对MathCheck(ICLR 2025)的释义质量审计在129组中检测到4个语义不正确的释义(3.1%);移除它们后,GPT-4o从第2名降至第4名,并将Claude Haiku和DeepSeek V3提升至其之上;这些排名变化对任何单模型评估都是不可见的。跨模型一致性以不到10美元的成本自动发现了这些错误(MathCheck中>=3/4模型;我们的主要评估中>=6/9);在我们自己的数据集中,相同的协议发现47%的自动生成的连接变体释义在语义上不正确。这一缺陷加剧了更深的测量差距:Claude Haiku 4.5达到86%的准确率,但SCR=50%,意味着其一半的定理在语义等价的重新表述下得到不同的答案,而9个模型的总体准确率仅跨越86-96%,但语义一致性率(SCR)跨越50-82%——这是标准基准无法捕捉的32个百分点的差距。形式上,对于9个前沿模型的任何目标排名,存在一个释义族上的权重实现该排名(无免费基准推论),因为没有模型在所有族上帕累托占优——因此选择族的基准设计者隐含地决定了哪个模型获胜。FormInv提供了审计协议(在外部基准上以100%召回率复制)、SCR和每个定理的Cochran's Q作为主要不变性度量,在9个模型上评估了366-811个项目(基于Lean4验证的定理),以及用于情境感知模型选择的FormInvSelector。

英文摘要

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

2605.28999 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

测量基于LLM的简历筛选中真实世界的提示注入攻击

Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang, Neil Zhenqiang Gong, Tianlong Chen, Dawn Song

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Duke University(杜克大学) Arizona State University(亚利桑那州立大学) hireEZ University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究首次系统性地分析了基于LLM的简历筛选应用中的提示注入攻击,通过设计专用检测器对约20万份真实简历进行测量,发现约1%的简历包含隐藏的提示注入,且近年来其流行度显著增加。

Comments Published in USENIX Security Symposium 2026; Code and artifacts are available at https://github.com/UNITES-Lab/resume-injection-measurement

详情
AI中文摘要

LLM容易受到提示注入攻击。然而,这种漏洞主要是在学术研究中通过概念性演示或少数轶事案例研究来展示的。其在真实世界基于LLM的应用中的普遍性和影响尚未得到充分探索。在这项工作中,我们首次对广泛使用的应用——基于LLM的简历筛选——中的提示注入攻击进行了系统研究。我们的分析基于hireEZ多年来收集的约20万份真实简历。我们首先设计了专门的方法来检测简历中的提示注入。在小规模数据集上的手动验证表明,我们的检测器实现了高精度,并优于最先进的通用检测器。然后,我们将检测器应用于完整的简历数据集,并对真实世界的提示注入攻击进行了全面的测量研究。我们的分析揭示了一些有趣的发现:大约1%的简历包含隐藏的提示注入;这种注入简历的流行度在过去一到两年内显著增加;超过90%的注入提示不使用显式指令。这些结果首次提供了真实世界基于LLM的应用中大规模提示注入的证据,并为未来理解和缓解此类攻击的研究奠定了基础。

英文摘要

LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real-world LLM-based applications are largely unexplored. In this work, we present the first systematic study of prompt-injection attacks in a widely used application: LLM-based resume screening. Our analysis is based on approximately 200K real-world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small-scale dataset demonstrates that our detectors achieve high precision and outperform state-of-the-art general-purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real-world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large-scale prompt injection in real-world LLM-based applications and lay the groundwork for future studies to understand and mitigate such attacks.

2605.28994 2026-05-29 cs.AI 版本更新

BEAMS: Benchmarking and Evaluating AI for Modeling and Simulation

BEAMS:用于建模与仿真的AI基准测试与评估

Sara Metcalf, William Schoenberg

AI总结 提出BEAMS倡议,通过建立以人为本的基准测试框架,评估AI工具在建模与仿真中的表现,发现其在因果推理和定量修正方面存在不足。

详情
AI中文摘要

支持现实世界决策的AI工具必须能够构建仿真模型,为其建议提供依据并使其可解释。能够自动化建模实践某些方面的工具必须补充人类专业知识,而非取代它。BEAMS倡议旨在通过建立以人为本的建模与仿真实践的基准,引导AI工具在建模与仿真领域的发展走向负责任和合乎伦理的形式。该倡议利用开放的数字和组织基础设施,协作评估用于建模与仿真的AI工具。倡议托管的开源sd ai项目确保了透明度,并使贡献能够广泛共享。指导小组专注于优先考虑潜在基准,而技术小组则专注于以自动化测试的形式实施基准。针对多个不同评估类别的测试已经实施,并应用于支持定性模型构建、定量模型构建和模型讨论的AI工具。这些测试包括因果翻译、模型迭代、因果推理、一致性、模型行为解释、建议的模型构建步骤以及建议的模型修正。当sd ai项目的引擎与不同的LLM结合时,它们在这些评估上的表现揭示了不同AI工具之间的差异。倡议实施的评估表明,支持AI的建模工具在讨论和基本定性任务上的表现优于因果推理和定量错误修正。没有单一的LLM在所有引擎类型中占据主导地位,这突显了特定任务的重要性以及速度与准确性之间的权衡。倡议的持续努力旨在纳入考虑替代视角和以人为本用例的基准,以解决对偏见的担忧。

英文摘要

AI tools to support real world decision making must be able to build simulation models that inform their recommendations and render them interpretable. Tools that can automate aspects of modeling practice must complement human expertise, not replace it. The BEAMS Initiative aims to guide the development of AI tools for modeling and simulation toward forms that are responsible and ethical by establishing benchmarks for human centered modeling and simulation practices. The initiative uses open digital and organizational infrastructure to collaboratively evaluate AI tools for modeling and simulation. The open source sd ai project hosted by the initiative establishes transparency and enables contributions to be shared broadly. A steering group focuses on prioritizing potential benchmarks, while a technical group focuses on implementing the benchmarks in the form of automated tests. Tests for several distinct categories of evaluation have been implemented and applied to AI tools that support qualitative model building, quantitative model building, and model discussion. These include tests for causal translation, model iteration, causal reasoning, conformance, model behavior explanation, suggested model building steps, and suggested model fixes. When engines from the sd ai project are coupled with different LLMs, their performance on these evaluations reveals variability across different AI tools. The evaluations implemented by the initiative demonstrate that AI enabled modeling tools perform better at discussion and basic qualitative tasks than with causal reasoning and quantitative error fixing. No single LLM dominates across engine types, highlighting the importance of specific tasks and tradeoffs between speed and accuracy. Ongoing efforts of the initiative aim to incorporate benchmarks that address concerns about bias by considering alternative perspectives and human centered use cases.

2605.28983 2026-05-29 cs.LG cs.AI math.DS math.RT physics.comp-ph 版本更新

The Hamilton-Jacobi Theory of Deep Learning

深度学习的哈密顿-雅可比理论

Jose Marie Antonio Miñoza, Erika Fille T. Legara, Christopher P. Monterola

发表机构 * Center for AI Research PH(人工智能研究所以PH) Asian Institute of Management(亚洲管理学院)

AI总结 本文通过将神经网络训练精确识别为哈密顿-雅可比初值问题的搜索,建立了深度学习与粘性哈密顿-雅可比方程之间的严格对应关系,并统一了残差网络、Transformer、RNN等架构,导出了最优泛化率、对抗鲁棒性等定量结果。

详情
AI中文摘要

在本文中,神经网络训练被精确地识别为通过哈密顿-雅可比初值问题的搜索:每个梯度步选择粘性哈密顿-雅可比方程的初始数据,其Hopf-Cole传播子最佳拟合观测值;在推理时,输入是评估该解的空间点,初始条件已编码在权重中。这种对应对于log-sum-exp层是精确的,对于更广泛的架构(残差网络、Transformer和循环架构(RNN、LSTM、SSM))是结构性的,它们离散化同一类哈密顿-雅可比方程,具有依赖于架构的哈密顿量和粘性。一个单一的变形参数ε在交换图中统一了所有四个视角(网络、热带代数、粘性PDE、凸优化),并在Lipschitz条件下封闭。定量结果包括:固定t时的极小极大最优泛化率O(n^{-1/(d+2)});由ε控制的对抗鲁棒性;残差网络的反向传播作为哈密顿系统的协态方程(庞特里亚金最大值原理);通过PDE求积与数据内在维度一致的标度指数;以及闭式O(N)影响函数(softmax归因权重π_j),其熵景观随着ε增加经历折叠分岔,每个分岔合并归因盆地。

英文摘要

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $π_j$) whose entropy landscape undergoes fold bifurcations as $\varepsilon$ increases, each merging attribution basins.

2605.28978 2026-05-29 cs.AI cs.CE 版本更新

VFEAgent: A Multimodal Agent Framework for End-to-End Automated Finite Element Analysis

VFEAgent: 面向端到端自动化有限元分析的多模态智能体框架

Jiachen Zhang, Junyi Lao, Chenghao Liu, Siyuan Liu, Shixin Wu, Linsen Zhang, Boyu Wang, Songfang Huang

发表机构 * Peking University(北京大学) China Agricultural University(中国农业大学)

AI总结 提出VFEAgent多智能体系统,通过多模态视觉-语言流水线和验证优先的代码合成框架,实现从输入图像和问题描述到有限元建模与仿真的端到端自动化。

Comments 9 pages, 3 figures, 2 tables. Equal contribution: Jiachen Zhang and Junyi Lao. Corresponding author: Songfang Huang. Preprint

详情
AI中文摘要

有限元分析(FEA)是现代工程设计的基石。然而,其工作流程本质上复杂且高度依赖领域专业知识。尽管近期有研究将大语言模型(LLM)集成到FEA中,但现有方法在处理多模态输入和执行复杂任务方面存在局限性。为解决这些限制,我们提出了VFEAgent,一个端到端的多智能体系统,旨在直接从输入图像和问题描述中自动化FEA建模和仿真。我们的方法整合了两个核心组件:(1)多模态视觉-语言多智能体流水线,采用ReAct驱动的推理从异构输入中提取结构化的FEA规范;(2)验证优先的代码合成框架,结合了强大的自调试和回退机制,以确保可执行性和物理有效性。我们在各种工程力学场景下系统评估了该系统。结果表明,VFEAgent在生成完整且物理有效的仿真方面取得了高成功率,在可靠性和正确性上优于基于LLM的基线方法。这些发现验证了自动化完整FEA工作流程的可行性,突显了该框架在将工程师从繁琐的手工分析中解放出来的潜力。

英文摘要

Finite Element Analysis (FEA) serves as the cornerstone of modern engineering design. However, its workflow is inherently complex and relies heavily on domain expertise. Although recent efforts have integrated Large Language Models (LLMs) into FEA, existing approaches face limitations in handling multimodal inputs and executing complex tasks. To address these limitations, we propose VFEAgent, an end-to-end multi-agent system designed to automate FEA modeling and simulation directly from input images and problem descriptions. Our methodology integrates two core components: (1) a multimodal vision-language multi-agent pipeline that employs ReAct-driven reasoning to extract structured FEA specifications from heterogeneous inputs and (2) a verification-first code synthesis framework, incorporating robust self-debugging and fallback mechanisms to ensure executability and physical validity. We systematically evaluated the system across various engineering mechanics scenarios. The results demonstrate that VFEAgent achieves a high success rate in generating complete and physically valid simulations, outperforming LLM-based baseline methods in reliability and correctness. These findings validate the feasibility of automating the complete FEA workflow, highlighting the framework's potential to liberate engineers from tedious manual analysis.

2605.28977 2026-05-29 cs.LG cs.AI 版本更新

Comparing Post-Hoc Explainable AI Methods for Interpreting Black-Box EEG Models in Depression Detection

比较事后可解释AI方法用于解释抑郁症检测中的黑盒脑电图模型

Antonia Šarčević, Nikolina Frid

发表机构 * University of Zagreb Faculty of Electrical Engineering and Computing(Zagreb大学电子工程与计算学院)

AI总结 本研究通过多种事后可解释性方法(如DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性)分析InceptionTime架构在脑电图抑郁症检测中的决策过程,发现不同方法在额叶、颞叶和后部脑区(尤其是右半球)的归因模式部分收敛,但方法间存在差异,强调了事后可解释性的有用性和局限性。

详情
AI中文摘要

深度学习的最新进展使得基于脑电图的重度抑郁症分类越来越准确,但高容量模型的决策过程仍然难以解释。本研究调查了应用于训练用于基于脑电图的重度抑郁症检测的InceptionTime架构的多种事后可解释性方法。分析包括基于Shapley、基于梯度和基于扰动的归因方法:DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性。在受试者级别的分层5折交叉验证框架内,通过跨脑电图片段和受试者的全局归因聚合进行可解释性分析。评估的方法揭示了部分收敛的归因模式,其中额叶、颞叶和后部脑区(尤其是右半球)反复受到关注。定量比较表明,基于梯度和基于扰动的方法之间具有实质性一致性,而DeepSHAP产生了相对独特的归因分布。同时,可解释性方法之间的差异凸显了方法假设对所得解释的影响。总体而言,结果表明,不同的事后可解释性方法捕捉了基于脑电图的深度学习模型在抑郁症检测中的部分重叠的相关性结构。尽管观察到的归因模式与先前几项关于重度抑郁症的脑电图研究大致一致,但该分析应被视为探索性的,而非确凿的神经生理学生物标志物或临床适用性的证据。该研究强调了事后可解释性在解释精神病学应用中的黑盒脑电图分类器方面的有用性和局限性。

英文摘要

Recent advances in deep learning have enabled increasingly accurate electroencephalography (EEG)-based classification of Major Depressive Disorder (MDD), but the decision-making processes of high-capacity models remain difficult to interpret. This study investigates multiple post-hoc explainability methods applied to an InceptionTime architecture trained for EEG-based MDD detection. The analysis includes Shapley-based, gradient-based, and perturbation-based attribution approaches: DeepSHAP, Integrated Gradients, GradCAM, Occlusion, and Permutation Feature Importance. Explainability analysis was performed within a subject-level stratified 5-fold cross-validation framework using global attribution aggregation across EEG segments and subjects. The evaluated methods revealed partially convergent attribution patterns, with recurring emphasis on frontal, temporal, and posterior EEG regions, particularly in the right hemisphere. Quantitative comparison demonstrated substantial agreement between gradient- and perturbation-based approaches, while DeepSHAP produced comparatively distinct attribution distributions. At the same time, variability between explainability methods highlighted the influence of methodological assumptions on the resulting explanations. Overall, the results suggest that different post-hoc explainability approaches capture partially overlapping relevance structures in EEG-based deep learning models for depression detection. Although the observed attribution patterns are broadly consistent with several previous EEG studies of MDD, the analysis should be interpreted as exploratory rather than evidence of definitive neurophysiological biomarkers or clinical applicability. The study highlights both the usefulness and limitations of post-hoc explainability for interpreting black-box EEG classifiers in psychiatric applications.

2605.28969 2026-05-29 cs.CL cs.AI cs.HC 版本更新

Beyond Recall: Behavioral Specification as an Interpretive Layer for AI Personalization

超越回忆:行为规范作为AI个性化的解释层

Aarik Gulaya

AI总结 提出行为规范作为解释层,通过压缩用户数据为解释性模式,显著提升AI代理对用户意图的表示准确性,减少模型规避,并在解释型问题上优于原始语料和商业记忆系统。

Comments 134 pages, 4 figures. Code, data, judge prompts, and reproduction instructions: github.com/agulaya24/beyond-recall

详情
AI中文摘要

如果AI代理代表个人做出决策,这些决策必须与其用户一致。我们引入表示准确性来衡量系统忠实捕捉用户解释的程度。我们将解释层操作化为行为规范。我们的参考实现将用户数据积极压缩为解释性模式,作为语言模型的上下文。我们在一个原型基准上评估该规范,该基准由校准的5评委LLM小组对保留的行为预测进行评分。我们独立测试它,并与一系列上下文条件组合:完整原始语料、完整提取事实以及四个商业记忆系统(Mem0、Letta、Supermemory、Zep)。在14个公共领域自传语料库中,该规范总体上提升了表示准确性,并几乎消除了模型规避。它以约25倍的上下文成本降低恢复了原始语料的大部分性能。该规范将受试者提升到一个共同的预测水平,无论预训练基线如何;因此,绝对提升在基线最低时最大,表明相关人群是任何在预训练中未被充分代表的人。在需要解释的问题上提升最大,提供解释层使得模型行为能够实现提取事实或原始语料无法实现的行为。相反,在需要回忆的问题上,该层可能干扰而非帮助。我们得出结论,表示准确性不同于回忆,人机对齐取决于用户被表示的准确性。表示准确性使这种对齐可测试。

英文摘要

If an AI agent makes decisions on a person's behalf, those decisions must align with its user. We introduce representational accuracy to measure how faithfully a system captures a person's interpretation. An interpretive layer is operationalized as a Behavioral Specification. Our reference implementation aggressively compresses a person's data into interpretive patterns, served as context to a language model. We evaluate the Specification on a prototype benchmark of held-out behavioral predictions scored by a calibrated 5-judge LLM panel. We test it independently and in composition with a range of context conditions: full raw corpus, full extracted facts, and four commercial memory systems (Mem0, Letta, Supermemory, Zep). Across 14 public-domain autobiographical corpora, the Specification lifts representational accuracy in aggregate and nearly eliminates model hedging. It recovers most of what the raw corpus delivers, at ~25x less context cost. The Specification lifts subjects toward a common predictive level regardless of pretraining baseline; the lift in absolute points is therefore largest where the baseline is lowest, suggesting the population of relevance is anyone not adequately represented in pretraining. Lift is greatest on interpretation-required questions, where providing an interpretive layer enables model behavior that extracted facts or raw corpus do not. Conversely, on recall-required questions, this layer can interfere rather than help. We conclude that representational accuracy is distinct from recall and that human-AI alignment is dependent on how accurately the user is represented. Representational accuracy makes that alignment testable.

2605.28965 2026-05-29 cs.AI 版本更新

Frontier LLM-based agents can overcome the ontology curation bottleneck for natural phenotypes

基于前沿LLM的智能体可以克服自然表型的本体策展瓶颈

James P. Balhoff, Hilmar Lapp

发表机构 * Renaissance Computing Institute, University of North Carolina(北卡罗来纳大学雷丁计算研究所) Neuromatch, USA(美国Neuromatch)

AI总结 本文使用前沿大型语言模型作为智能体策展人,在自包含工作空间中利用本体和注释指南进行表型注释,其性能达到人类策展人之间的变异性范围,显著优于传统NLP工具。

Comments 7 pages, 2 figures

详情
AI中文摘要

将自由文本表型描述链接到本体术语(通常称为表型注释)对于跨研究整合比较形态学数据至关重要。这一劳动密集型过程严重依赖训练有素的人类专家,因此难以规模化,成为关键瓶颈。Dahdul等人(2018)建立了跨七个系统发育研究的实体-质量(EQ)注释金标准,并用于评估三位人类策展人和基于本体的语义相似度度量的Semantic CharaParser NLP工具;他们报告机器与人类的一致性显著低于策展人间(人类-人类)的一致性。本文使用来自Anthropic和OpenAI的五个前沿托管LLM重新审视该基准,每个LLM作为“智能体策展人”在自包含工作空间中运行,该工作空间提供源出版物PDF、原始人类策展人使用的相同注释指南、四个项目本体(UBERON、PATO、BSPO、GO)以及验证脚本。针对同一金标准评估,每个智能体的表现均落在原始研究中三位训练有素的人类生物策展人的策展人间变异性范围内;表现最佳的智能体接近但未达到表现最佳的人类策展人。智能体在所有四个指标上均大幅优于Semantic CharaParser。

英文摘要

Linking free-text phenotype descriptions to ontology terms, typically referred to as phenotype annotation, is essential for the cross-study integration of comparative morphological data. This labor intensive process has heavily relied on highly trained human experts, which makes it challenging to scale and thus a key bottleneck. Dahdul et al. (2018) established a Gold Standard (GS) of Entity-Quality (EQ) annotations across seven phylogenetic studies and used it to evaluate three human curators and the Semantic CharaParser NLP tool with ontology-based semantic similarity metrics; they reported that machine-human consistency was significantly lower than inter-curator (human-human) consistency. Here we revisit that benchmark with five frontier hosted LLMs from Anthropic and OpenAI, each operating as an "agentic curator" within a self-contained workspace that supplies the source publication PDF, the same annotation guide used by the original human curators, the four project ontologies (UBERON, PATO, BSPO, GO), and a validation script. Evaluated against the same Gold Standard, every agent fell within the range of inter-curator variability of the three trained human biocurators of the original study; the best performing agents approached but did not reach the best performing human curator. Agents substantially outperformed Semantic CharaParser on all four metrics.

2605.28920 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Conf-Gen: 生成模型的共形不确定性量化

Gabriel Loaiza-Ganem, Kevin Zhang, Wei Cui, Marc T. Law, Kin Kwan Leung

发表机构 * layer6ai-labs(layer6ai实验室)

AI总结 提出Conf-Gen框架,通过共形风险控制适配生成任务,统一并扩展了共形预测在大型语言模型等生成模型中的应用,并在图像生成、对话AI和AI代理等新领域提供了形式化保证。

Comments ICML 2026

详情
AI中文摘要

共形预测(CP)及其扩展共形风险控制(CRC)是通过形式化保证量化监督机器学习中不确定性的成熟框架。然而,人工智能(AI)的最新突破由无监督生成模型驱动,例如大型语言模型(LLMs)和图像生成器,这些模型与CP或CRC不直接兼容。在这项工作中,我们引入了共形生成(Conf-Gen),这是一个将CRC适配到生成任务同时放宽其理论假设的通用框架。Conf-Gen统一并泛化了先前将CP应用于LLMs的尝试,并将共形方法扩展到全新的领域。我们通过一些新颖的应用展示了Conf-Gen的灵活性,包括在以下方面获得共形保证:生成非记忆图像的图像生成器、提出足够澄清问题的对话AI系统,以及AI代理输出的正确性。

英文摘要

Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (Conf-Gen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.

2605.28919 2026-05-29 cs.LG cs.AI cs.CL 版本更新

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

CosmicFish-HRM:紧凑语言模型中基于层次循环机制的适应性推理

Venkat Akhil Lakkapragada

发表机构 * Mistyoz AI Hyderabad, India(Mistyoz AI 德里, 印度)

AI总结 提出一种紧凑语言模型CosmicFish-HRM,通过层次推理模块动态分配推理深度,在保持较小参数量的同时实现适应性推理。

Comments 17 pages, 4 figures. Exploratory study of adaptive reasoning depth in compact autoregressive language models. Code available at https://github.com/MistyozAI/CosmicFish-HRM

详情
AI中文摘要

大型语言模型已经实现了强大的推理能力,尽管通常以巨大的参数数量和昂贵的推理为代价。在这项工作中,我们探索了一个不同的方向:紧凑语言模型中的自适应推理深度。我们提出了CosmicFish-HRM,这是一个紧凑的语言模型,围绕一个层次推理模块(HRM)构建,该模块在推理过程中动态分配计算资源。该模型不是对每个输入应用固定的计算,而是迭代通过高层和低层推理循环,并根据输入复杂度学习何时停止。CosmicFish-HRM将这种自适应推理核心与现代Transformer组件(包括分组查询注意力、RoPE和SwiGLU激活)相结合。虽然额外的推理基础设施在小规模下引入了开销,但我们假设随着模型规模的增长和HRM核心相对成本的降低,这种权衡变得越来越有利。我们的结果表明,该模型学习了非均匀的推理行为,在不同任务和输入之间分配不同数量的推理步骤。这些发现表明,自适应推理深度可能为仅依赖参数规模来实现推理能力提供一种有前途的替代方案。

英文摘要

Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish-HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high-level and low-level reasoning cycles and learns when to halt based on input complexity. CosmicFish-HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non-uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.

2605.28914 2026-05-29 cs.CR cs.AI 版本更新

AIRGuard: Guarding Agent Actions with Runtime Authority Control

AIRGuard:通过运行时权限控制守护智能体行为

Suliu Qin, Haomin Zhuang, Yujun Zhou, Yufei Han, Xiangliang Zhang

发表机构 * University of Notre Dame(诺丁汉大学) Inria, France(法国国家信息与自动化技术研究所) University of Liverpool(利物浦大学)

AI总结 针对工具使用语言智能体面临的权限混淆问题,提出运行时守卫AIRGuard,通过动作时授权实现最小权限原则,显著降低攻击成功率并保持良好良性效用。

详情
AI中文摘要

使用工具的语言智能体将模型决策转化为外部副作用:它们读取文件、运行脚本、调用API、发送消息以及调用模型上下文协议工具。这使得针对智能体的攻击不同于越狱攻击。有害步骤往往不是明显禁止的输出,而是普通的可执行动作,但由于攻击者控制的上下文将授权访问导向违背用户利益的方向而变得不安全。我们将这种失败模式识别为权限混淆:不可信资源可以告知推理,但绝不能授权副作用。我们提出AIRGuard,一种运行时守卫,将最小权限原则实现为动作时授权。AIRGuard规范化异构工具调用,将任务权限推导为步骤级权限,跟踪源和目标信任度,模拟敏感副作用,审计跨步骤风险,并在动作执行前强制执行决策。在AgentTrap上,AIRGuard将Sonnet 4.6的攻击成功率从无防御时的36.3%降低到5.5%。在DTAP-150上,AIRGuard在Haiku 4.5上保持了76.0%的良性效用,而ARGUS为52.0%,MELON为42.0%。消融实验进一步表明,仅靠提示策略效果有限,而专用的运行时权限控制层为智能体系统提供了对工具介导副作用的直接控制。代码和数据可在https://github.com/Sophie508/AIRGuard获取。

英文摘要

Tool-using language agents turn model decisions into external side effects: they read files, run scripts, call APIs, send messages, and invoke Model Context Protocol tools. This makes agent attacks different from jailbreaks. The harmful step is often not an obviously forbidden output, but an ordinary executable action that becomes unsafe because attacker-controlled context steers authorized access against the user's interest. We identify this failure mode as authority confusion: untrusted resources may inform reasoning, but they must not authorize side effects. We present AIRGuard, a runtime guard that operationalizes least privilege as action-time authorization. AIRGuard normalizes heterogeneous tool calls, derives task authority into step-level authority, tracks source and target trust, simulates sensitive side effects, audits cross-step risk, and enforces decisions before actions execute. On AgentTrap, AIRGuard reduces Sonnet 4.6 attack success from 36.3% without defense to 5.5%. On DTAP-150, AIRGuard preserves 76.0% benign utility with Haiku 4.5, compared with 52.0% for ARGUS and 42.0% for MELON. An ablation further shows that prompt-only policy helps only modestly, whereas a dedicated runtime authority-control layer gives the agent system direct control over tool-mediated side effects. Code and data are available at https://github.com/Sophie508/AIRGuard.

2605.28902 2026-05-29 cs.AI 版本更新

Orthogonal Concept Erasure for Diffusion Models

扩散模型的正交概念擦除

Yuhao Sun, Lingyun Yu, Haoxiang Xu, Fengyuan Miao, Zhuoer Xu, Hongtao Xie

发表机构 * University of Science(科学技术大学)

AI总结 提出正交概念擦除(OCE)方法,通过几何视角的乘法参数更新实现精确概念擦除,同时保持生成能力,支持多概念擦除。

Comments Accepted by ICML 2026 Oral

详情
AI中文摘要

概念擦除已成为减轻扩散模型中不期望或不安全内容的有前途方法,但现有方法仍面临显著限制。基于训练的方法有效,但高计算成本限制了可扩展性。基于编辑的方法更高效且易于部署,但难以同时实现精确的概念擦除和保持整体生成能力。我们将基于编辑的方法的这一核心限制归因于对加法参数更新的依赖。我们的实证分析表明,概念语义主要依赖于神经元方向而非神经元幅度,而整体生成能力依赖于神经元的角几何。由于加法更新固有地纠缠方向、幅度和角几何,它们不可避免地引入概念擦除与整体生成性能之间的意外干扰。为了解决这个问题,我们提出了正交概念擦除(OCE),它从几何角度将基于编辑的擦除重新表述为乘法参数更新。具体来说,OCE应用从参数的闭式解导出的逐层正交变换,能够在保持神经元幅度和角几何的同时实现精确的概念擦除。此外,为了解决多概念擦除中的冲突约束,OCE引入了具有结构化子空间操作的子空间级目标,实现了更有效和可扩展的擦除。在单概念和多概念擦除上的大量实验表明,OCE在概念擦除和非目标保持方面优于现有方法,可在4.3秒内擦除多达100个概念。代码:https://github.com/HansSunY/OCE。

英文摘要

Concept erasure has emerged as a promising approach to mitigate undesired or unsafe content in diffusion models, yet existing methods still face significant limitations. While training-based methods are effective, their high computational cost limits scalability. Editing-based methods are more efficient and deployment-friendly, yet they struggle to simultaneously achieve precise concept erasure and preserve overall generative capacity. We identify this core limitation of the editing-based methods as reliance on additive parameter updates. Our empirical analysis reveals that concept semantics primarily depend on neuron direction rather than neuron magnitude, while overall generative capacity relies on the angular geometry of neurons. As additive updates inherently entangle direction, magnitude, and angular geometry, they inevitably introduce unintended interference between concept erasure and overall generation performance. To address this, we propose Orthogonal Concept Erasure (OCE), which reformulates editing-based erasure as multiplicative parameter updates from a geometric perspective. Specifically, OCE applies layer-wise orthogonal transformations derived from a closed-form solution to the parameters, enabling precise concept erasure while preserving the neuron magnitude and angular geometry. Furthermore, to address conflicting constraints in multi-concept erasure, OCE introduces a subspace-level objective with structured subspace manipulation, yielding a more effective and scalable erasure. Extensive experiments on single- and multi-concept erasure demonstrate that OCE outperforms existing methods in concept erasure and non-target preservation, erasing up to 100 concepts in 4.3 s. Code: https://github.com/HansSunY/OCE.

2605.28899 2026-05-29 cs.CR cs.AI 版本更新

Quantum-Enhanced Adversarial Robustness in Artificial Intelligence

人工智能中的量子增强对抗鲁棒性

Jaydip Sen

发表机构 * Praxis Business School(普拉斯业务学校)

AI总结 本文综述了对抗性机器学习与量子计算交叉领域,提出利用量子优化、特征映射和混合量子-经典架构来增强人工智能系统的对抗鲁棒性。

Comments This is the pre-print of the chapter which has been accepted for publication in the edited volume titled "Quantum Enhancements to the AI Industry", edited by Eduard Babulak. The volume will be published by IGI Global, USA. This is not the final version of the chapter published in the book

详情
AI中文摘要

人工智能在多个应用领域取得了显著成功。然而,其对对抗性攻击的脆弱性给可靠性、安全性和可信赖性带来了重大挑战。对抗性机器学习表明,即使是高精度的模型也可能通过精心设计的扰动被操纵,这在医疗、金融和自主技术等安全关键系统中引发了严重担忧。与此同时,量子计算作为一种变革性范式出现,能够通过叠加、纠缠和量子干涉等原理解决复杂的计算问题。这两个领域的融合催生了量子人工智能的出现,该领域探索量子技术如何增强学习效率、可扩展性和鲁棒性。本章全面概述了对抗性机器学习和现有防御策略,随后对量子计算和量子机器学习模型进行了易于理解的介绍。进一步提出了量子增强对抗鲁棒性的概念框架,强调了量子优化、特征映射和混合量子-经典架构。还讨论了实际应用、关键挑战和未来研究方向,以支持安全可信赖的AI系统的开发。

英文摘要

Artificial Intelligence has achieved remarkable success across diverse application domains. However, its vulnerability to adversarial attacks poses significant challenges to reliability, security, and trustworthiness. Adversarial machine learning demonstrates that even highly accurate models can be manipulated through carefully crafted perturbations, raising serious concerns in safety critical systems such as healthcare, finance, and autonomous technologies. In parallel, quantum computing has emerged as a transformative paradigm capable of addressing complex computational problems through principles such as superposition, entanglement, and quantum interference. The convergence of these fields has led to the emergence of quantum artificial intelligence, which explores how quantum techniques can enhance learning efficiency, scalability, and robustness. This chapter provides a comprehensive overview of adversarial machine learning and existing defense strategies, followed by an accessible introduction to quantum computing and quantum machine learning models. It further presents conceptual frameworks for quantum-enhanced adversarial robustness, emphasizing quantum optimization, feature mapping, and hybrid quantum classical architectures. Practical applications, key challenges, and future research directions are also discussed to support the development of secure and trustworthy AI systems.

2605.28897 2026-05-29 cs.AI cs.MA 版本更新

Review Arcade: On the Human Alignment and Gameability of LLM Reviews

Review Arcade: 关于LLM评审的人类对齐与可博弈性

Hans Ole Hatzel, Sebastian Steindl, Jan Strich

发表机构 * Language Technology Group, University of Hamburg, Germany(汉堡大学语言技术小组) Hub of Computing and Data Science (HCDS), University of Hamburg, Germany(汉堡大学计算与数据科学中心) OTH Amberg-Weiden, Germany(阿姆伯-魏登工业大学)

AI总结 通过实验评估LLM生成论文评审与人类评审的对齐程度,并发现作者可根据LLM评审迭代修改论文以提升评分(最多35%的论文显著提高),揭示了LLM评审的可博弈性。

Comments Under Review EMNLP 26

详情
AI中文摘要

LLM生成的科学论文评审正获得广泛关注,甚至被主要会议正式试点。我们必须假设不仅评审员在使用LLM辅助,而且作者在提交前也使用LLM修改论文。在这项工作中,我们对2025年ACL滚动评审(ARR)的论文进行实证实验,从作者和评审员两个角度评估LLM评审。首先,我们发现LLM评审与人类评审的对齐程度有限。在最佳情况下,对齐是合理的。然而,我们也发现LLM与人类的对齐在不同提示和模型间差异很大。最后,我们研究了作者根据LLM评审使用迭代草稿-修订工作流程改进提交的情况。我们发现,这种对LLM评审的“博弈”在特定场景下是有效的,导致最多35%的论文整体得分有统计显著提升。我们公开代码:https://github.com/uhh-hcds/reviewarcade。

英文摘要

LLM-generated reviews for scientific papers are gaining considerable traction and are even being officially piloted by major conferences. We have to assume that not only reviewers are using LLM-assistance, but also that authors use LLMs to revise their papers before submitting. In this work, we perform empirical experiments on papers from the 2025 ACL Rolling Review (ARR) to evaluate LLM reviews from both the author and the reviewer perspective. First, we identify a limited alignment of LLM reviews with human ones. In the best-case scenario, the alignment is reasonable. However, we also find that LLM-human alignment varies substantially across prompts and models. Finally, we investigate the scenario in which the author uses an iterative draft-revise workflow to improve the submission according to the LLM review. We find that this "gaming" of LLM reviews can be effective in specific scenarios, leading to a statistically significant increase of overall scores for up to 35\% of papers. We publish our code: https://github.com/uhh-hcds/reviewarcade.

2605.28889 2026-05-29 cs.LG cs.AI 版本更新

Context Distillation as Latent Memory Management

上下文蒸馏作为潜在记忆管理

Ziyang Zheng, Zeju Li, Xiangyu Wen, Jianyuan Zhong, Junhua Huang, Lei Chen, Mingxuan Yuan, Qiang Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 将上下文蒸馏视为潜在记忆管理问题,通过独立LoRA适配器形成模块化记忆库,并利用自门控机制决定是否激活潜在记忆,以提升检索鲁棒性和效率。

详情
AI中文摘要

上下文蒸馏将上下文信息压缩到模型参数中,然而现有方法常常忽略多个蒸馏后的潜在记忆应如何在非预言机设置下存储、检索和安全激活。我们将上下文蒸馏表述为一个潜在记忆管理问题。我们将每个上下文蒸馏成一个独立的LoRA适配器,形成一个模块化记忆库,从而实现显式的记忆选择。给定一个查询,我们的框架检索候选记忆,将查询路由到最合适的适配器,并使用自门控机制决定是否应激活潜在记忆。为了提高效率,我们进一步引入缓存共享以减少推理过程中的管理开销。实验表明,我们的方法在检索方面显著优于基线,而自门控通过停用不必要的潜在记忆提高了鲁棒性。

英文摘要

Context distillation compresses contextual information into model parameters, yet existing methods often ignore how multiple distilled latent memories should be stored, retrieved, and safely activated in non-oracle settings. We formulate context distillation as a latent memory management problem. We distill each context into an independent LoRA adapter, forming a modular memory bank that enables explicit memory selection. Given a query, our framework retrieves candidate memories, routes the query to the most suitable adapter, and uses a Self-Gating mechanism to decide whether latent memory should be activated. To improve efficiency, we further introduce cache sharing to reduce management overhead during inference. Experiments show that our method substantially outperforms baselines with retrieval, while Self-Gating improves robustness by deactivate unnecessary latent memories.

2605.28883 2026-05-29 cs.AI cs.RO 版本更新

Ultra-Reduced-Impact-Encased-Logging (URIEL): propose a new method for selective sustainable logging and post-harvest silvicultural treatment in tropical forest using airborne robotics systems

超低影响包裹式伐木(URIEL):提出一种利用空中机器人系统在热带森林中进行选择性可持续伐木和采后造林处理的新方法

Daniel Albiero, Gelton Fernando de Morais, Daniela Han, Flávio Roberto de Freitas Gonçalves, Artur Vitório Andrade Santos, Wesllen Lins de Araújo, Alessandra Maia Freire, Cláudio Kiyoshi Umezu, Mateus Peressin, Francesco Toscano, Admilson Írio Ribeiro, Alfeu J. Sguarezi Filho, Américo Ferraz Dias Neto, Angel Pontin Garcia

发表机构 * School of Agricultural Engineering, University of Campinas (UNICAMP)(坎皮纳斯大学农业工程学院) School of Mechanical Engineering, University of Campinas (UNICAMP)(坎皮纳斯大学机械工程学院) Depart. of Agricultural, Forestry, Food and Environmental Sciences, University of Basilicata(巴里奇塔大学农业、林业、食品与环境科学系) Sorocaba Environmental Engineering, São Paulo State University (UNESP)(圣保罗州立大学索罗卡巴环境工程) Center for Engineering, Modeling and Applied Social Sciences, Federal University of ABC (UFABC)(ABC联邦大学工程、建模和应用社会科学中心)

AI总结 提出URIEL方法,结合直升机伐木、机器人、AI和无人机采后造林处理,实现高经济可行性和几乎零附带损害,维持生态系统服务。

Comments 196 pages, 40 figures, A revolutionary technology to help protect tropical forests. It was developed, scaled, detailed, calculated, and simulated in an advanced computational environment, com viabilidade econômica e social. "E pur si muove"

详情
AI中文摘要

全球热带森林正面临由经济和政治利益驱动的强烈砍伐压力,科学证据表明这种砍伐加剧了气候变化。本文提出了一种新颖的热带森林伐木方法——超低影响包裹式伐木(URIEL)。该方法基于直升机伐木技术,结合机器人技术和人工智能的密集使用,以及由无人机执行的采后造林处理。为此方法开发了合适的设备概念,确定了尺寸,在数字概念验证中完成了细节,并对各种直升机-木材-距离组合进行了有效的数字模拟和经济可行性分析。结果表明,URIEL方法具有高经济可行性,并能在维持生态系统服务的同时几乎消除对森林的附带损害。本文的主要结论是,尽管取得了令人满意的科学和技术成果,但URIEL方法的可行性取决于相关利益相关者的整合:高科技产业、政治政府、认证伐木公司和原住民。

英文摘要

Tropical forests worldwide are under intense deforestation pressure driven by economic and political interests, and scientific evidence suggests this deforestation contributes to climate change. This paper proposes a novel logging method for tropical forests, Ultra-Reduced-Impact-Encased-Logging (URIEL). This new method is based on heli-logging techniques combined with intensive use of robotics and AI integrated with post-harvest silvicultural treatments performed by drones. The concept of appropriate equipment for this method was developed, dimensions were determined, details were completed in a digital proof of concept, and an effective digital simulation and economic feasibility analysis were carried out for various helicopter-timber-distance combinations. The results demonstrated that a URIEL method has high economic viability and makes it possible to virtually eliminate collateral damage to forests while maintaining ecosystem services. The main conclusion of this paper is that, despite the satisfactory scientific and technological results, the feasibility of a Uriel method depends on the integration of stakeholders intrinsic to the context: high-tech industry; political governments; certified logging companies; and native populations.

2605.28876 2026-05-29 cs.SE cs.AI 版本更新

LogDx-CI: Benchmarking Log Reduction Tools for LLM Root-Cause Diagnosis

LogDx-CI:为LLM根因诊断基准测试日志缩减工具

Bowen Qin

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出LogDx-CI基准,比较11种日志缩减工具在35个真实CI故障案例上的效果,发现混合grep+tail路由器在成本质量上占优,且智能体循环可缩小质量差距但成本差异持续存在,同时跨家族LLM摘要器优于同家族。

详情
AI中文摘要

CI失败日志规模大(本语料中位数5000行,最大20万行)且噪声多。尝试调试的编码智能体依赖上游工具将日志缩减为可管理的上下文,但该领域缺乏公开的经验比较来评估哪些缩减能为下游LLM诊断保留足够证据。我们引入LogDx-CI基准,比较11种上下文缩减工具(原始、尾部、grep、三种RTK模式、两种真实LLM map-reduce摘要器、三种混合路由器)在35个真实GitHub Actions失败案例上的表现,由3个LLM调试器家族(Claude Haiku 4.5、Claude Sonnet 4.6、OpenAI gpt-5-mini)以及一个Sonnet 4.6工具使用智能体评分。我们报告三个重要发现。(1)混合grep+tail路由器主导成本-质量帕累托前沿;前两种方法得分0.670/0.666,每案例约0.03美元,质量与独立grep相当但令牌数减少4.5倍。(2)在智能体循环场景中,不同缩减工具的质量范围缩小7倍(单次得分跨度0.42 → 智能体循环跨度0.059);智能体通过后续工具调用挽救弱上下文。然而,成本差异持续存在:弱上下文迫使智能体发出2-4倍的工具调用来恢复。(3)跨家族LLM摘要-调试器对(gpt-5-mini摘要器供给Claude Haiku调试器)在四个诊断变体上的平均得分比同家族对高0.071,否定了该任务上的自我调用偏差假设。gpt-5-mini摘要器也是智能体循环中的第一名方法(得分0.749),每案例0.37次工具调用,且缩减器成本比Haiku摘要器低10倍(每案例0.18美元 vs 1.75美元)。所有数据、代码、每个案例的捆绑包和可复现性基础设施均已公开。

英文摘要

CI failure logs are large (median 5k lines, max 200k in this corpus) and noisy. Coding agents that try to debug them depend on an upstream tool to reduce the log to a manageable context, but the field has had no public empirical comparison of which reductions preserve enough evidence for downstream LLM diagnosis. We introduce LogDx-CI, a benchmark that compares 11 context-reduction tools (raw, tail, grep, three RTK modes, two real LLM map-reduce summarizers, three hybrid routers) on 35 real GitHub Actions failure cases, scored by 3 LLM debugger families (Claude Haiku 4.5, Claude Sonnet 4.6, OpenAI gpt-5-mini) plus a Sonnet 4.6 tool-using agent. We report three load-bearing findings. (1)~Hybrid grep+tail routers dominate the cost-quality Pareto frontier; the top two methods score 0.670 / 0.666 at $\sim$ \$0.03 per case, same-ballpark quality as standalone grep at $4.5\times$ fewer tokens. (2)~In the agent-loop regime, the quality range across reduction tools collapses $7\times$ (single-shot spread 0.42 $\to$ agent-loop spread 0.059); the agent rescues weak contexts via follow-up tool calls. However, cost differences persist: weak contexts force the agent to issue 2--4$\times$ more tool calls to recover. (3)~A cross-family LLM-summary pair (gpt-5-mini summarizer feeding a Claude Haiku debugger) beats the same-family pair by $+0.071$ averaged across four diagnoser variants, falsifying the self-call-bias hypothesis on this task. The gpt-5-mini summarizer is also the agent-loop \#1 method (score 0.749) at $0.37$ tool-calls per case and $10\times$ lower reducer cost than the Haiku summarizer (\$0.18 vs \$1.75 per case). All data, code, per-case bundles, and reproducibility infrastructure are public.

2605.28870 2026-05-29 cs.LG cs.AI 版本更新

Representation Alignment Rests on Linear Structure

表示对齐依赖于线性结构

Kiril Bangachev, Guy Bresler, Yury Polyanskiy

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过信号、偏差和噪声的三部分统计框架研究柏拉图表示假说,提出对齐源于对象与属性的线性关系,并通过稀疏自编码器提取线性特征、中心化和归一化减少偏差、以及数据稀缺导致噪声等证据支持该框架。

详情
AI中文摘要

我们通过表示的三部分统计框架研究柏拉图表示假说(PRH):信号、偏差和噪声。{1) 信号:} 我们提出柏拉图对齐源于对象与属性之间的普遍关系,这种关系根据线性表示假说(LRH)在线性上编码。我们通过稀疏自编码器提取线性对象-属性特征,并展示这些稀疏表示通常比其稠密对应物表现出更强的跨模态对齐,从而提供证据表明LRH有助于解释PRH。{2) 偏差:} 由于使用的不同架构和训练过程,模型具有不同的隐式偏差。我们表明这种差异可以部分缓解。中心化和归一化一致地改善跨模型对齐。{3) 噪声:} 有限样本训练导致表示中的噪声。我们通过揭示词频与对齐之间在LLM和文本嵌入模型中的强且一致的正相关,提供证据表明表示噪声由数据稀缺驱动。综合信号、偏差和噪声,我们提出一个统计模型,该模型细化线性表示假说,并解释与现代AI架构中出现的表示对齐相关的进一步现象。

英文摘要

We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal relationship between objects and attributes, which is encoded linearly in representations according to the Linear Representation Hypothesis (LRH). We provide evidence that LRH helps explain PRH by extracting linear object-attribute features with sparse autoencoders and showing that these sparse representations often exhibit stronger cross-modal alignment than their dense counterparts. {2) Bias:} Models have different implicit biases due to the diverse architectures and training procedures used. We show that this difference can be partially mitigated. Centering and normalization consistently improve cross-model alignment. {3) Noise:} Finite-sample training leads to noise in representations. We provide evidence that representational noise is driven by data scarcity by revealing a strong and consistent positive correlation between word frequency and alignment in LLMs and text embedding models. Synthesizing signal, bias, and noise, we propose a statistical model that refines the Linear Representation Hypothesis and explains further phenomena related to the alignment of representations emerging from diverse modern AI architectures.

2605.28869 2026-05-29 cs.LG cs.AI 版本更新

Balancing Multimodal Learning through Label Space Reshaping

通过标签空间重塑平衡多模态学习

Xiaoyu Ma, Weijie Zhang, Yuanhao Gao, Han Miao, Yongjian Deng, Hao Chen

AI总结 针对多模态学习中模态不平衡问题,提出基于标签空间重塑的BMLR方法,通过均衡各模态映射难度来提升多模态性能。

Comments In process

详情
AI中文摘要

多模态学习常受模态不平衡问题困扰,其中收敛较快的模态主导优化,而其他模态训练不足。现有方法通常通过加强弱模态或调整优化梯度来缓解此问题。然而,这些策略主要补偿优化速率差异,往往以牺牲强模态的优化能力为代价,而未从模态层面分析这些差异如何产生。基于理论洞察和实证观察,我们认为学习速度的差异源于模态特定特征空间与共享标签空间之间映射难度的不同。为解决此问题,我们提出了平衡多模态标签重塑(BMLR),这是首个从标签侧设计促进多模态平衡的方法。BMLR重塑跨模态标签空间以均衡各模态的映射难度,从而促进模态交互并为每个模态注入更丰富的类间信息。跨多种架构的大量实验表明,BMLR持续提升多模态性能,并与多种模型设计表现出强兼容性。源代码即将发布。

英文摘要

Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain undertrained. Existing approaches typically mitigate this issue by strengthening the weak modality or adjusting optimization gradients. However, such strategies mainly compensate for optimization rate discrepancies, often at the expense of the strong modality's optimization capacity, without analyzing how these discrepancies arise at the modality level. Based on theoretical insights and empirical observations, we argue that the discrepancy of learning pace arises from differences in the mapping difficulty between modality-specific feature space and the shared label space. To address this issue, we propose Balanced Multimodal Label Reshaping (BMLR), the first method that promotes multimodal balance from the label-side design. BMLR reshapes the cross-modal label space to equalize mapping difficulty across modalities, thereby facilitating modality interaction and injecting richer inter-class information into each modality. Extensive experiments across multiple architectures demonstrate that BMLR consistently improves multimodal performance and exhibits strong compatibility with diverse model designs. The source code will be released soon.

2605.28868 2026-05-29 cs.LG cs.AI 版本更新

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

TaxDistill:通过蒸馏基因组基础模型改进宏基因组分类注释

Rongye Ye, Lun Li, Zheng Luo, Yiran Zhan, Shuhui Song

发表机构 * National Genomics Data Center, China National Center for Bioinformation(中国生物信息中心国家基因组数据中心) Beijing Key Laboratory of Intelligent Governance and Application of Biological Big Data, China National Center for Bioinformation(北京生物大数据智能治理与应用重点实验室,中国生物信息中心) Beijing Institute of Genomics, Chinese Academy of Sciences(北京基因组研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出TaxDistill知识蒸馏框架,利用500M参数的基因组基础模型GenomeOcean作为教师网络生成软标签,以减轻初始检索工具引入的标签噪声,从而提升宏基因组序列分类性能。

Comments The manuscript contains 14 pages, 7 figures, and 3 tables

详情
AI中文摘要

宏基因组分类注释旨在识别环境样本中DNA片段的微生物起源。依赖序列相似性的传统方法通常受到高微生物多样性和参考数据库不完整性的限制,这推动了诸如Taxometer等学习方法的发展,这些方法通过事后校正来学习更具信息量的宏基因组序列表示。然而,这些方法通常依赖于训练期间从相似性搜索工具获得的标签,这不可避免地引入了噪声,从而损害表示学习并降低分类性能。为了解决这个问题,我们提出了TaxDistill,一种用于宏基因组分类的知识蒸馏框架。我们引入GenomeOcean,一个500M参数的基因组基础模型,作为教师网络来提取深层语义特征并基于置信度生成软标签。通过将这些软标签信息蒸馏到轻量级学生网络中,TaxDistill有效减少了初始检索工具引入的标签噪声。在七个不同的CAMI2数据集上的全面实验表明,TaxDistill在大多数场景下优于现有基线。例如,在胃肠道数据集上,它将MMseqs2的F1分数从0.763提高到0.941,优于Taxometer基线。总体而言,TaxDistill为复杂宏基因组分析中的标签校正提供了一种可靠的方法。

英文摘要

Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples. Traditional methods that rely on sequence similarity are often constrained by the high microbial diversity and the incompleteness of reference databases, which has motivated the development of learning approaches such as Taxometer that perform post hoc correction to learn more informative metagenomic sequence representations. However, these methods typically rely on labels derived from similarity search tools during training, which inevitably introduces noise that can impair representation learning and degrade classification performance. To address this issue, we propose TaxDistill, a knowledge distillation framework for metagenomic classification. We introduce GenomeOcean, a 500M parameter genomic foundation model, as the teacher network to extract deep semantic features and generate soft labels based on confidence. By distilling this soft label information into a lightweight student network, TaxDistill effectively reduces the label noise introduced by initial retrieval tools. Comprehensive experiments on seven diverse CAMI2 datasets demonstrate that TaxDistill outperforms existing baselines in most scenarios. For instance, on the Gastrointestinal dataset, it improves the F1 score of MMseqs2 from 0.763 to 0.941, outperforming the Taxometer baseline. Overall, TaxDistill provides a reliable method for label correction in complex metagenomic analysis.

2605.28867 2026-05-29 cs.LG cs.AI 版本更新

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

PrismFlow: 时间序列生成中流匹配的残差动力学

Junru Zhang, Lang Feng, Jinbo Wang, Xu Guo, Yucheng Wang, Han Yu, Min Wu, Yabo Dong, Duanqing Xu

发表机构 * Zhejiang University, China(浙江大学,中国) Nanyang Technological University, Singapore(南洋理工大学,新加坡) I2R, Agency for Science, Technology and Research (A*STAR), Singapore(科技研究局(A*STAR)新加坡研究所,新加坡)

AI总结 提出PrismFlow方法,通过Koopman启发的动力学专家和置信度感知的胜者全得目标,在流匹配中学习残差修正,以解决标准流匹配中全局向量场估计器导致的频谱失真和模式覆盖不足问题,在时间序列生成中取得最优性能。

详情
AI中文摘要

生成高质量时间序列数据具有挑战性,因为现实世界的信号通常表现出多模态模式和多尺度动力学,包括振荡和高频变化。流匹配(FM)为扩散模型提供了一种高效的替代方案,但实际实现通常依赖于单个有限容量的全局向量场估计器。在这种异质的时间分布中,不同的状态可能通过邻近的流状态,同时需要不相容的条件速度。使用标准$\ell_2$速度匹配目标训练的单一估计器可能学习到局部传输场的过度平滑近似。这种估计器级别的平滑会减弱分支特定的动力学,导致频谱失真和较差的模式覆盖。为了解决这个问题,我们提出了PrismFlow,一种新的具有Koopman启发动力学专家的FM方法。每个专家在一个潜在空间中学习残差修正,其中局部非线性时间演化可以通过线性变换近似。我们进一步提出了一种置信度感知的胜者全得(WTA)目标,该目标仅更新与每个样本最对齐的专家,同时屏蔽其他专家的梯度,鼓励模式特定专业化。在采样过程中,选定的专家向全局传输场添加残差动力学修正,在保持FM稳定性的同时恢复细粒度和高频时间结构。在各种基准测试中,PrismFlow有效缓解了标准FM中的频谱收缩,并实现了最先进的性能,Context-FID提升了15.6%,判别分数提升了38.6%,同时在低数据设置下保持鲁棒性,并有效用于预测和插补。

英文摘要

Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an efficient alternative to diffusion models, but practical implementations typically rely on a single finite-capacity global vector-field estimator. In such heterogeneous temporal distributions, distinct regimes may pass through nearby flow states while requiring incompatible conditional velocities. A monolithic estimator trained with the standard $\ell_2$ velocity-matching objective may therefore learn an overly smoothed approximation of the local transport field. This estimator-level smoothing can attenuate branch-specific dynamics, leading to spectral distortion and poor mode coverage. To address this, we propose PrismFlow, a new FM method with Koopman-inspired dynamical experts. Each expert learns residual corrections in a latent space where local nonlinear temporal evolution can be approximated by linear transitions. We further propose a confidence-aware Winner-Take-All (WTA) objective that updates only the expert best aligned with each sample while masking gradients to the others, encouraging mode-specific specialization. During sampling, the selected expert adds a residual dynamical correction to the global transport field, preserving FM stability while recovering fine-grained and high-frequency temporal structures. Across various benchmarks, PrismFlow effectively mitigates the spectral contraction in standard FM and achieves state-of-the-art performance, with a 15.6% gain in Context-FID and a 38.6% improvement in Discriminative Score, while remaining robust in low-data settings and effective for forecasting and imputation.

2605.28866 2026-05-29 cs.LG cs.AI 版本更新

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

连续性与序数性至关重要:利用大语言模型进行有效时间序列分析的时间序列令牌约束

Musheng Li, Ziying Zhang, Cheng jin, Yuantao Gu

发表机构 * Department of Electronic Engineering(电子工程系)

AI总结 针对令牌化时间序列大语言模型忽略连续性和序数性的问题,提出COM策略,通过几何约束初始化与训练阶段,提升模型在多个时间序列分析基准上的性能与泛化能力。

详情
AI中文摘要

基于令牌的时间序列大语言模型(TS-LLMs)已成为时间序列分析与推理的一个有前景的方向。然而,先前的研究在很大程度上忽略了时间序列令牌固有的连续性和序数性,这严重限制了模型性能。在本文中,我们认为在时间序列令牌嵌入中保留这些属性对于基于令牌的TS-LLMs的有效性至关重要。为此,我们提出了COM(连续性与序数性至关重要),这是一种连续性和序数性感知策略,将几何约束整合到初始化阶段和训练阶段。在多个时间序列分析基准上的实证结果表明,COM持续提升了基于令牌的TS-LLMs的性能,取得了有竞争力的结果和强大的泛化能力。代码可在 https://anonymous.4open.science/r/COM 获取。

英文摘要

Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. However, prior studies largely overlook the inherent continuity and ordinality of time series tokens, which substantially limits model performance. In this paper, we argue that preserving these properties in time series token embeddings is crucial for the effectiveness of token-based TS-LLMs. To this end, we propose COM (Continuity and Ordinality Matter), a continuity- and ordinality-aware strategy that integrates geometric constraints into both the initialization and training stages. Empirical results on multiple time series analysis benchmarks demonstrate that COM consistently improves the performance of token-based TS-LLMs, achieving competitive results and strong generalizability. Code is available at https://anonymous.4open.science/r/COM .

2605.28865 2026-05-29 cs.LG cs.AI 版本更新

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

无需语言监督的物理交互中世界模型中的涌现语义表征

Jiayi Fang

发表机构 * Independent Researcher(独立研究者)

AI总结 通过无语言监督的物理探索训练VAE世界模型,发现其潜在空间自发形成与物理几何结构对齐的语义结构,且预测性能与语义对齐共同提升,验证了物理几何作为世界模型表征的组织原则。

Comments 10 pages, 3 figures

详情
AI中文摘要

世界模型从物理探索中学习到什么,没有任何语言监督?我们认为答案由单一原则组织:物理世界的几何结构。在随机具身探索上训练基于VAE的世界模型,我们发现其潜在空间发展出反映物理几何的空间语义结构——方向准确率0.677±0.029对比随机初始化编码器的0.547,位置RSA 0.192±0.047对比随机编码器的0.029(提升6.6倍),表明训练诱导了超越CNN归纳偏置的真正结构组织。在20个时间检查点上,预测性能和语义对齐共同提升(Spearman r=-0.61, p=0.004),与共享驱动者解释一致。我们通过双重敲除确认:标准KL正则化(beta=0.1)迫使编码器远离几何结构,预测性能和语义对齐同时崩溃至接近随机水平(第50,000步),完全符合共享驱动者预测。将beta降至0.001可恢复几何访问并同时恢复两种能力。这些发现确立了物理世界几何作为世界模型表征的组织原则,对设计语义基础的具身智能体具有直接意义。

英文摘要

What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on random embodied exploration, we find that its latent space develops spatial semantic structure that mirrors physical geometry -- direction accuracy 0.677+-0.029 versus 0.547 for a randomly initialized encoder, and position RSA 0.192+-0.047 versus 0.029 for random encoders (6.6x improvement), showing that training induces genuine structural organization beyond CNN inductive bias. Across 20 temporal checkpoints, prediction performance and semantic alignment co-improve (Spearman r=-0.61, p=0.004), consistent with the shared-driver account. We confirm this through a double knockout: standard KL regularization (beta=0.1) forces the encoder away from geometric structure, and both prediction performance and semantic alignment collapse simultaneously to near-chance by step 50,000 -- exactly as the shared-driver account predicts. Reducing beta to 0.001 restores geometric access and recovers both capabilities together. These findings establish physical world geometry as the organizing principle of world model representations, with direct implications for the design of semantically grounded embodied agents.

2605.28864 2026-05-29 cs.AI cs.CL 版本更新

The Cognitive Categorical Transformer: Category-Theoretic Inductive Biases for Language Modeling

认知范畴变换器:用于语言建模的范畴论归纳偏置

Al Kari

发表机构 * Manceps Inc.(Manceps公司)

AI总结 提出认知范畴变换器(CCT),通过引入基于范畴论和认知科学的组件,在WikiText-103上以306M参数实现21.27验证困惑度,相比GPT-2 Small基线降低2.92 PPL(12%相对提升),并通过消融实验证实单纯复形消息传递贡献了84%的改进。

详情
AI中文摘要

认知范畴变换器(CCT)是一个306M参数的架构,它通过源自范畴论和认知科学的认知启发组件增强了预训练的GPT-2 Small骨干网络。在WikiText-103上采用匹配步数协议(215,000优化器步数、匹配数据、匹配优化器和调度)下,CCT达到21.27验证困惑度,而相同微调的GPT-2 Small基线为24.19。因此,该架构在领域内微调本身之外贡献了2.92 PPL(12%相对)的降低。一个从头开始重训练的消融实验,在整个七阶段激活调度中保持GT-Full单纯复形消息传递绕过,达到23.72 PPL,将84%的架构改进(2.45 of 2.92 PPL)归因于GT-Full。我们首次提供了消融验证的证据,表明单纯复形消息传递在WikiText-103上以306M参数规模改善了语言模型困惑度。已发表的GPT-2 Large在WikiText-103上以比GPT-2 Small多6.2倍的参数达到22.05零样本困惑度;本文将这一数字视为外部已发表参考,而非架构基准。关于一致性风格的范畴先验(层平滑、伴随往返、曲率正则化)的三个负面结果,以及GT-Full和PrecisionWeightedPP的联合结构先验结果,共同支持了一个经验模式,称为*结构/一致性区分*,其中添加新拓扑的范畴先验改善了语言建模,而强制执行一致性恒等式的范畴先验则没有。

英文摘要

The Cognitive Categorical Transformer (CCT) is a 306M-parameter architecture that augments a pretrained GPT-2 Small backbone with cognitively grounded components derived from category theory and several inspirations from cognitive science. Under a matched-step protocol (215,000 optimizer steps, matched data, matched optimizer and schedule) on WikiText-103, CCT reaches 21.27 validation perplexity, compared with 24.19 for an identically fine-tuned GPT-2 Small baseline. The architecture therefore contributes a 2.92 PPL (12% relative) reduction beyond what in-domain fine-tuning alone provides. A retrain-from-scratch ablation that holds GT-Full simplicial message passing bypassed across the entire seven-phase activation schedule reaches 23.72 PPL, localizing 84% of the architectural improvement (2.45 of 2.92 PPL) to GT-Full. We present the first ablation-validated evidence that simplicial message passing improves language-model perplexity at the 306M-parameter scale on WikiText-103. Published GPT-2 Large reaches 22.05 zero-shot PPL on WikiText-103 with 6.2x more parameters than GPT-2 Small; this paper treats that number as an external published reference, not as the architectural benchmark. Three negative results on consistency-style categorical priors (sheaf smoothing, adjunction round-trip, curvature regularization) and the joint structural-prior result for GT-Full and PrecisionWeightedPP together support an empirical pattern termed the *structure/consistency distinction*, in which categorical priors that add new topology improve language modeling and those that enforce a consistency identity do not.

2605.28863 2026-05-29 cs.LG cs.AI 版本更新

Self-Play Reinforcement Learning under Imperfect Information in Big 2

大二(Big 2)中不完全信息下的自我对弈强化学习

Aalok Patwa

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出一个自我对弈强化学习框架,在四人不完全信息纸牌游戏Big 2中比较策略梯度和值近似方法,发现PPO优于其他方法,并证明中等熵正则化和当前策略自我对弈的有效性。

Comments 11 pages

详情
AI中文摘要

不完全信息多人游戏测试智能体在隐藏信息、稀疏奖励和非平稳对手下的行动能力。我们在Big 2(一个四人不完全信息纸牌游戏)中研究这些挑战。我们为Big 2开发了一个自我对弈强化学习框架,能够对策略梯度和值近似智能体进行受控比较。在共同的环境、输入表示、训练预算和评估协议下,PPO在对抗随机、贪婪和启发式Big 2对手时优于蒙特卡洛Q近似、SARSA和Q学习。我们进一步发现,适度的熵正则化通过防止策略变得过于确定性来改进PPO,并且当前策略自我对弈比检查点自我对弈或固定对手训练提供了更强的有限预算课程。这些结果共同表明,Big 2是研究不完全信息、多人交互、延迟奖励和可变动作集下深度强化学习的一个有用的受控环境。

英文摘要

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

2605.28855 2026-05-29 cs.AI 版本更新

Behavior-Aware Auxiliary Corrections for Off-Policy Temporal-Difference Prediction

行为感知的辅助校正用于离策略时序差分预测

Xingguo Chen, Zhiang He, Yuchen Shen, Shangdong Yang, Chao Li, Guang Yang, Wenhao Wang

发表机构 * Nanjing University of Posts and Telecommunications(南京邮电大学) Department of Computer Science and Technology, Nanjing University(南京大学计算机科学与技术系) College of Electronic Countermeasure, National University of Defense Technology(国防科技大学电子对抗学院)

AI总结 针对离策略时序差分学习的不稳定性,提出行为感知的辅助协方差校正方法(BA-TDC/BA-TDRC),通过替换辅助矩阵为行为贝尔曼矩阵,并引入正则化,在保持不动点和收敛性的同时提升性能。

详情
AI中文摘要

在函数近似下,离策略采样中的时序差分学习可能不稳定。TDC通过辅助协方差校正稳定离策略TD,而TDRC在单时间尺度递归中进一步正则化该校正。本文研究在线性预测设置中行为感知的辅助协方差几何替换,这是理解值函数近似特征空间动力学的标准局部模型。我们首先将TDC辅助矩阵(C)替换为行为贝尔曼矩阵(A_μ),得到BA-TDC,然后正则化同一行为感知方程得到BA-TDRC。这种两步构造将行为感知几何的贡献与正则化的贡献分离。线性分析还为神经网络值近似中出现的辅助几何设计问题提供了一个可处理模型,其中特征协方差和时间转移矩阵共同塑造最后一层校正动力学。我们给出了有限状态均值系统公式,证明了在实例化均值系统的Hurwitz稳定性条件下的不动点保持和几乎必然收敛,并通过精确线性误差递归的谱半径比较了确定性均值速率。在二状态反例、Baird反例、随机游走和Boyan链上的实验表明,行为感知替换本身在某些任务上非常有益,但正则化对于在更困难设置下实现稳健性能是必要的。

英文摘要

Temporal-difference learning with function approximation can be unstable under off-policy sampling. TDC stabilizes off-policy TD through an auxiliary covariance correction, and TDRC further regularizes this correction in a single-timescale recursion. This paper studies a behavior-aware replacement of the auxiliary covariance geometry in the linear prediction setting, which is the standard local model for understanding the feature-space dynamics of value-function approximation. We first replace the TDC auxiliary matrix (C) by the behavior Bellman matrix (A_μ), yielding BA-TDC, and then regularize the same behavior-aware equation to obtain BA-TDRC. This two-step construction separates the contribution of behavior-aware geometry from the contribution of regularization. The linear analysis also provides a tractable model for an auxiliary-geometry design question that arises in neural-network value approximation, where feature covariances and temporal transition matrices jointly shape the last-layer correction dynamics. We give a finite-state mean-system formulation, prove fixed-point preservation and almost-sure convergence under a Hurwitz stability condition on the instantiated mean system, and compare deterministic mean rates through the spectral radius of the exact linear error recursion. Experiments on the two-state counterexample, Baird's counterexample, Random Walk, and Boyan Chain show that the behavior-aware replacement can be highly beneficial by itself on some tasks, but that regularization is necessary for robust performance across harder settings.

2605.28849 2026-05-29 cs.AI 版本更新

Behavior-Induced Mirror-Prox Temporal-Difference Learning for Faster Off-Policy Prediction

行为诱导的镜像-近似时间差分学习用于更快的离策略预测

Xingguo Chen, Yuchen Shen, Shangdong Yang, Chao Li, Guang Yang, Wenhao Wang

发表机构 * Nanjing University of Posts and Telecommunications(南京邮电大学) Department of Computer Science and Technology, Nanjing University(南京大学计算机科学与技术系) College of Electronic Countermeasure, National University of Defense Technology(国防科学技术大学电子对抗学院)

AI总结 提出一种行为诱导的镜像-近似时间差分方法(STHTD-MP),通过用行为策略Bellman矩阵的对称部分替换协方差度量来改进离策略预测的几何结构,并证明其收敛性和更小的平均收缩因子。

详情
AI中文摘要

梯度时间差分方法通过线性函数逼近提供稳定的离策略预测,但其实际性能强烈受辅助变量度量诱导的几何结构影响。现有的镜像-近似TD方法通常使用特征协方差度量,而混合TD方法表明行为策略转移信息可以提供更具信息性的更新几何结构。本文提出一种行为诱导的镜像-近似时间差分方法,称为STHTD-MP,它将原始-对偶鞍点公式中的协方差度量替换为行为策略Bellman矩阵的对称部分。该方法对原始变量和辅助变量保持单一学习率,并对得到的混合鞍点算子应用镜像-近似预测-校正步骤。我们在标准随机逼近假设下对固定策略线性预测提供了形式化的收敛分析:行为诱导度量正定,联合均值系统Hurwitz稳定,有界性通过Lyapunov论证得到,随机递归通过ODE方法收敛。我们进一步推导了投影预言遍历间隙界,并基于确定性镜像-近似误差矩阵的谱半径与GTD2-MP进行了精确的均值算子比较。分析表明,当行为诱导度量改善鞍点几何结构时,STHTD-MP可以比GTD2-MP具有更小的平均收缩因子。在二状态、随机游走和Boyan Chain基准上的精确数值均值算子分析支持了这一条件,而Baird的反例被识别为一个奇异边界情况,其中严格假设不成立。

英文摘要

Gradient temporal-difference methods provide stable off-policy prediction with linear function approximation, but their practical performance is strongly affected by the geometry induced by the auxiliary-variable metric. Existing Mirror-Prox TD methods typically use the feature covariance metric, whereas hybrid TD methods suggest that behavior-policy transition information can provide a more informative update geometry. This paper proposes a behavior-induced Mirror-Prox temporal-difference method, called STHTD-MP, which replaces the covariance metric in the primal-dual saddle-point formulation with the symmetric part of the behavior-policy Bellman matrix. The method keeps a single learning rate for the primal and auxiliary variables and applies a Mirror-Prox prediction-correction step to the resulting hybrid saddle-point operator. We provide a formal convergence analysis for fixed-policy linear prediction under standard stochastic approximation assumptions: the behavior-induced metric is positive definite, the joint mean system is Hurwitz, boundedness follows from a Lyapunov argument, and the stochastic recursion converges by the ODE method. We further derive projected-oracle ergodic gap bounds and an exact mean-operator comparison with GTD2-MP based on the spectral radius of the deterministic Mirror-Prox error matrix. The analysis shows that STHTD-MP can have a smaller mean contraction factor than GTD2-MP when the behavior-induced metric improves the saddle-point geometry. Exact numerical mean-operator analysis on two-state, Random Walk, and Boyan Chain benchmarks supports this condition, while Baird's counterexample is identified as a singular boundary case where the strict assumptions fail.

2605.28848 2026-05-29 cs.CL cs.AI 版本更新

GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models

GPF-LiveNews: 大型语言模型中群体条件框架的流式评估协议

Mohd Ariful Haque, Fahad Rahman, Kishor Datta Gupta, Roy George

发表机构 * Clark Atlanta University(克拉克阿特兰大学)

AI总结 提出GPF-LiveNews流式评估协议,通过实时新闻锚点与身份标签组合,检测LLM输出中针对不同受众的语义敏感性和情感差异,用于审计群体条件框架。

详情
AI中文摘要

部署的语言模型在非静态环境中进行评估:模型版本、检索层、安全系统和真实世界输入都随时间变化。静态偏差基准仍然有用,但它们无法显示模型如何针对不同提示受众构建新出现事件的框架。我们引入了GPF-LIVENEWS,这是一个流式评估协议和基准快照,用于审计开放端LLM输出中的群体条件框架。该协议扩展了来自BBC/路透社的最新新闻锚点,涵盖42个身份标签和七个提示族,然后使用语义敏感性和情感差异信号评估响应束。在12次监控运行和23个托管模型的试点中,政策/行动提示产生了最强的语义运动,而情感变化在维度和提示族之间较为平坦。发布的工件包括文章元数据、提示模板、实例化提示、模型输出元数据、评分表、文档和复现脚本。我们将所有评分解释为用于人工审查的观察窗口审计信号,而非永久性的公平性排名或有害偏差的直接证据。

英文摘要

Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how models frame newly emerging events for different prompted audiences. We introduce GPF-LIVENEWS, a streaming evaluation protocol and benchmark snapshot for auditing group-conditioned framing in open-ended LLM outputs. The protocol expands fresh BBC/Reuters news anchors across 42 identity labels and seven prompt families, then evaluates response bundles using semantic-sensitivity and sentiment-disparity signals. In a pilot over 12 monitoring runs and 23 hosted models, Policy/Action prompts produce the strongest semantic movement, while sentiment variation is flatter across dimensions and prompt families. The released artifact includes article metadata, prompt templates, instantiated prompts, model-output metadata, score tables, documentation, and reproduction scripts. We interpret all scores as observed-window audit signals for human review, not as permanent fairness rankings or direct proof of harmful bias.

2605.28842 2026-05-29 cs.CL cs.AI 版本更新

Thoughts-as-Planning: Latent World Models for Chain-of-Thoughts Optimization via Reinforcement Planning

思想即规划:通过强化规划进行思维链优化的潜在世界模型

Dong Liu, Yanxuan Yu, Ying Nian Wu

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Columbia University(哥伦比亚大学)

AI总结 提出Thoughts-as-Planning框架,将思维链优化形式化为潜在语义空间中的序贯决策过程,通过潜在世界模型模拟推理链编辑对下游输出的影响,并利用梯度下降或强化学习进行规划,在语言理解和生成任务上优于现有基线。

详情
AI中文摘要

大型语言模型(LLMs)在多种NLP任务上的成功提升了推理链优化作为对齐模型行为与任务目标的关键步骤的重要性。现有的推理链调优方法通常依赖于黑盒启发式或免梯度搜索,缺乏可解释性、泛化能力和样本效率。在这项工作中,我们引入了 extbf{思想即规划},一个新颖的框架,将推理链优化形式化为潜在语义空间上的序贯决策过程。我们将LLM建模为部分可观测环境,并学习一个潜在世界模型来模拟推理链编辑对下游输出的影响。构建了一个保持邻近性的嵌入空间来编码推理链-响应动态,从而通过梯度下降或强化学习实现规划。我们的方法支持多尺度抽象,允许在token、片段和指令级别进行推理链编辑,并集成到统一规划器中。通过在语言理解和生成任务上的大量实验,我们证明了思想即规划在效率、鲁棒性和泛化性方面优于最先进的推理链调优基线,同时通过其结构化规划轨迹提供了可解释性。我们的代码可在https://github.com/FastLM/Thoughts-as-Planning获取。

英文摘要

The success of large language models (LLMs) across diverse NLP tasks has elevated the importance of reasoning chain optimization as a critical step in aligning model behavior with task objectives. Existing reasoning chain tuning methods often rely on black-box heuristics or gradient-free search, which lack interpretability, generalization, and sample efficiency. In this work, we introduce \textbf{Thoughts-as-Planning}, a novel framework that formalizes reasoning chain optimization as a sequential decision-making process over a latent semantic space. We model the LLM as a partially observable environment and learn a latent world model that simulates the effect of reasoning chain edits on downstream outputs. A proximity-preserving embedding space is constructed to encode reasoning chain-response dynamics, enabling planning via gradient descent or reinforcement learning. Our method supports multi-scale abstraction, allowing reasoning chain edits at token, segment, and instruction levels to be integrated into a unified planner. Through extensive experiments on language understanding and generation tasks, we demonstrate that Thoughts-as-Planning outperforms state-of-the-art reasoning chain tuning baselines in efficiency, robustness, and generalization, while offering interpretability through its structured planning trajectory. Our code is available at https://github.com/FastLM/Thoughts-as-Planning.

2605.28840 2026-05-29 cs.CL cs.AI cs.SE 版本更新

How Consistent Are LLM Agents? Measuring Behavioral Reproducibility in Multi-Step Tool-Calling Pipelines

LLM代理的一致性如何?测量多步工具调用流水线中的行为可重复性

Abel Yagubyan

发表机构 * Independent Researcher(独立研究者)

AI总结 研究多步工具调用LLM代理在重复相同调用时是否选择相同工具、顺序和参数,通过系统实验测量行为一致性,并发现代理存在显著不一致性。

Comments 16 pages, 6 figures

详情
AI中文摘要

具有工具调用能力的大型语言模型(LLM)代理越来越多地部署在生产系统中,但一个基本的可靠性问题仍未得到充分探索:同一个代理是否会以相同的方式运行两次?我们对多步工具调用代理的行为一致性进行了系统的实证研究,测量代理在重复相同调用时是否选择相同的工具、以相同的顺序、使用相同的参数。与先前关于ReAct风格代理(仅搜索、自由文本动作)一致性的工作不同,我们研究了具有类型化参数和附带副作用的结构化工具调用接口的更丰富设置。

英文摘要

Large language model (LLM) agents with tool-calling capabilities are increasingly deployed in production systems, yet a fundamental reliability question remains under-explored: does the same agent behave the same way twice? We present a systematic empirical study of behavioral consistency in multi-step tool-calling agents, measuring whether agents select the same tools, in the same order, with the same arguments, across repeated identical invocations. Unlike prior work on consistency in ReAct-style agents(search-only, free-text actions), we study the richer setting of structured tool-calling interfaces with typed parameters and consequential side effects.

2605.28838 2026-05-29 cs.CL cs.AI 版本更新

Specialty-Specific Medical Language Model for Immune-Mediated Diseases

免疫介导疾病的专科医学语言模型

Veysel Kocaman, Gursev Pirge, Yigit Gul, Ace Vo, Zhenya Nargizyan, David Talby

发表机构 * John Snow Labs Inc.(约翰·斯诺实验室公司)

AI总结 针对免疫介导和传染病领域,通过专家标注数据集和临床领域嵌入的Transformer模型,提出专科NER模型,F1达0.89,优于基线和零样本方法。

Comments 15 pages, 5 figures. Funded in part by NIAID/NIH under contract 75N93024C00010

详情
AI中文摘要

从自由文本的医学叙述中提取详细的临床信息对研究人员和医疗系统仍然是一个实际挑战。免疫介导和传染病的术语在来源之间尤其不一致,这通常限制了通用自然语言处理(NLP)系统以足够粒度捕获相关生物医学概念的能力。我们开发了一个领域特定的命名实体识别(NER)模型,专门用于识别免疫学和传染病背景下出现的疾病相关实体。我们与两位临床专家合作,收集并手动标注了371份病例报告的数据集,定义了涵盖免疫介导和传染病状况以及相关症状和临床描述符的十二个实体类别。我们评估了几种建模策略,包括使用多种医疗特定嵌入的MedicalNER架构、基于BERT的标记分类模型以及零样本NER系统。最佳性能是通过在临床领域嵌入上训练的基于Transformer的模型获得的,其F1分数达到0.89,始终优于基线和零样本方法。专业嵌入和专家注释的结合被证明对于捕捉细微的疾病术语和改善跨异构生物医学文本的泛化特别有价值。在相同的评估协议下,提示的LLM基线实现了显著较低的性能,反映了尽管有详细提示,但在细粒度实体边界上产生跨度一致输出的困难。由此产生的模型提供了一种分析病例报告的结构化方式,并可以支持下游任务,如队列识别、疾病监测和临床决策支持。

英文摘要

Extracting detailed clinical information from free-text medical narratives remains a practical challenge for researchers and healthcare systems. Terminology for immune-mediated and infectious diseases is especially inconsistent across sources, which often limits the ability of general-purpose Natural Language Processing (NLP) systems to capture the relevant biomedical concepts with sufficient granularity. We developed a domain-specific Named Entity Recognition (NER) model tailored to identify disease-related entities occurring in immunology and infectious disease contexts. We assembled and manually annotated a dataset of 371 case reports in collaboration with two clinical specialists, defining twelve entity classes covering immune-mediated and infectious conditions as well as related symptoms and clinical descriptors. We evaluated several modeling strategies, including the MedicalNER architecture with multiple healthcare-specific embeddings, a BERT-based token classification model, and zero-shot NER systems. The strongest performance was obtained with a transformer-based model trained on clinical-domain embeddings, which reached an F1 score of 0.89, consistently outperforming baseline and zero-shot approaches. The combination of specialized embeddings and expert annotation proved particularly valuable for capturing nuanced disease terminology and improving generalization across heterogeneous biomedical text. The prompted LLM baseline achieved substantially lower performance under the same evaluation protocol, reflecting difficulties in producing span-consistent outputs for fine-grained entity boundaries despite detailed prompting. The resulting model provides a structured way to analyze case reports and can support downstream tasks such as cohort identification, disease monitoring, and clinical decision support.

2605.28837 2026-05-29 cs.CL cs.AI 版本更新

SERC: LDPC-Inspired Semantic Error Correction for Retrieval-Augmented Generation

SERC: 受LDPC启发的检索增强生成语义纠错方法

Gyumin Kim, Juhwan Park, Jaeha Kim, Seunggyun Han, Kyungrak Son, Ikbeom Jang

发表机构 * Department of Information Communications Engineering, Hankuk University of Foreign Studies, Republic of Korea(韩国外国语大学信息通信工程系) Division of Computer Engineering, Hankuk University of Foreign Studies, Republic of Korea(韩国外国语大学计算机工程系) Department of Statistics, Hankuk University of Foreign Studies, Republic of Korea(韩国外国语大学统计学系)

AI总结 针对大语言模型幻觉问题,提出受LDPC码启发的语义纠错框架SERC,通过稀疏验证策略高效检测和纠正生成文本中的错误。

Comments 15 pages, 2 figures, 6 tables. To appear in the Proceedings of the 28th International Conference on Pattern Recognition (ICPR 2026). Code available at https://github.com/labhai/SERC

详情
AI中文摘要

尽管大语言模型(LLMs)展现了卓越的能力,但其可靠性因幻觉而严重受损。现有的内在自纠正方法试图解决这一问题,但由于自我偏见(模型在没有外部验证的情况下难以识别自身输出中的错误)而常常失败。为克服这些限制,我们提出了受LDPC启发的检索增强生成语义纠错方法(SERC),为解释和缓解LLM幻觉提供了理论框架。我们将文本生成过程重新表述为语义噪声信道,将生成的响应视为噪声干扰的码字。受低密度奇偶校验(LDPC)码的启发,SERC采用稀疏验证策略:不是穷举检查所有事实,而是生成低密度验证查询,并针对外部证据进行验证,以高效检测和纠正错误。我们在LongForm Bio和TruthfulQA基准上使用Llama-3-8B和Qwen2.5-14B评估了SERC。实验结果表明,SERC优于内在自纠正方法和强检索增强基线,特别是在事实精度(FactScore)上取得了显著提升。值得注意的是,SERC使小语言模型(SLMs)在幻觉减少和信息保留方面超越了更大的基线模型。我们的发现表明,SERC提供了一种无需训练、模型无关的解决方案,与密集方法相比显著降低了验证开销,在资源受限环境中实现了成本与保真度的最优权衡。

英文摘要

While Large Language Models (LLMs) have demonstrated remarkable capabilities, their reliability is significantly compromised by hallucinations. Existing intrinsic self-correction methods attempt to address this, but often fail due to self-bias, where models struggle to identify errors in their own outputs without external verification. To overcome these limitations, we propose the LDPC-inspired semantic error correction for retrieval-augmented generation (SERC), providing a theoretical framework to interpret and mitigate LLM hallucinations. We reformulate the text generation process as a semantic noisy channel, treating generated responses as noise-corrupted codewords. Inspired by low-density parity-check (LDPC) codes, SERC employs a sparse verification strategy: instead of exhaustively checking all facts, it generates low-density verification queries and validates them against external evidence to efficiently detect and correct errors. We evaluate SERC on LongForm Bio and TruthfulQA benchmarks using Llama-3-8B and Qwen2.5-14B. Experimental results demonstrate that SERC outperforms both intrinsic self-correction methods and strong retrieval-augmented baselines, demonstrating significant gains especially in factual precision (FactScore). Notably, SERC enables small language models (SLMs) to surpass the performance of larger baselines in hallucination reduction and information preservation. Our findings demonstrate that SERC provides a training-free, model-agnostic solution that significantly reduces verification overhead compared to dense methods, achieving an optimal trade-off between cost and fidelity in resource-constrained environments.

2605.28835 2026-05-29 cs.CL cs.AI 版本更新

GenesisFunc: Multi-Agent Data Generation for Accurate and Generalizable Function-Calling

GenesisFunc: 面向准确且泛化的函数调用的多智能体数据生成

Hao-Xiang Xu, Chong Deng, Jiaqing Liu, Wen Wang, Qian Chen, Lujia Bao, Xiangang Li, Zhen-Hua Ling

发表机构 * Tongyi Fun Team, Alibaba Group(通义功能团队,阿里巴巴集团)

AI总结 提出GenesisFunc多智能体自动生成函数调用训练数据,通过多阶段评估保证质量,微调8B模型在域内和域外均优于同类开源模型,性能接近部分API模型。

Comments Accepted by ACL 2026 Main

详情
AI中文摘要

大型语言模型(LLM)通过函数调用(FC)扩展其能力,这依赖于高质量、多样化且覆盖广泛场景的训练数据。然而,获取和标注真实的函数调用数据具有挑战性,而现有流水线生成的合成数据通常存在API不可靠、工具可扩展性有限、多样性不足和质量控制薄弱等问题。为解决这些问题,我们提出了GenesisFunc,一个用于生成FC训练数据的自动化流水线。从广泛使用的公共基准中的可靠工具出发,我们的GenesisFunc采用多智能体框架支持对话生成系统,该系统生成涵盖多种场景的对话,同时在整个过程中保持多样性和质量。通过多阶段评估系统进一步强化数据的准确性。我们在合成数据集上微调了一个8B LLM,并通过大量实验表明,它在域内FC性能和域外泛化方面优于同等规模的开源模型,同时达到了与一些最新的基于API的模型相当的FC能力。此外,我们的方法展示了在下游工具中有效扩展的强大潜力,突显了其实际应用性。

英文摘要

Large Language Models (LLMs) extend their capabilities through function-calling (FC), which relies on training data with high quality, diversity, and broad coverage of scenario. However, obtaining and annotating real function-calling data is challenging, while synthetic data from existing pipelines often suffers from unreliable APIs, limited tool scalability, insufficient diversity, and weak quality control. To address these, we present GenesisFunc, an automated pipeline for generating FC training data. Starting from reliable tools in widely used public benchmarks, our GenesisFunc employs a multi-agent framework to support a dialogue generation system that produces conversations spanning diverse scenarios, while maintaining both diversity and quality throughout the process. The accuracy of the data is further reinforced through a multi-stage evaluation system. We fine-tune an 8B LLM on the synthetic dataset and show through extensive experiments that it outperforms similarly sized open-source models in in-domain FC performance and out-of-domain generalization, while reaching FC capabilities comparable to some of the latest API-based models. In addition, our method demonstrates strong potential to scale effectively across downstream tools, underscoring its real-world applicability.

2605.28834 2026-05-29 cs.CL cs.AI 版本更新

Assessing Dutch Syllabification Algorithms and Improving Accuracy by Combining Phonetic and Orthographic Information through Deep Learning

评估荷兰语音节划分算法并通过深度学习结合语音和正字法信息提高准确性

Gus Lathouwers, Wieke Harmsen, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University(拉博德大学)

AI总结 本研究评估了四种荷兰语音节划分算法的性能,并提出一种结合语音和正字法信息的深度学习模型,实现了99.65%的词准确率,较文献最佳提升0.14%。

Comments Published in CLIN Journal

详情
Journal ref
Computational Linguistics in the Netherlands Journal, Vol. 14 (2025), pp. 365 to 383
AI中文摘要

音节划分描述将单词划分为音节的任务。由于许多规则和例外,训练算法以高准确率执行音节划分仍然是一个挑战。在过去几十年中,针对荷兰语音节划分提出了不同的算法,但尚未进行全面的比较评估。此外,近年来深度学习在自然语言处理中获得了显著普及,但尚未开发出基于现代深度学习的荷兰语正字法音节划分框架。最后,语音和正字法音节划分算法已被分别研究,但未结合研究。当前研究的目标有两个:(a) 检查现有荷兰语音节划分算法的性能,(b) 研究将语音和正字法信息结合到单个模型中是否能提高音节划分性能。为了比较算法性能,将四种算法(Brandt Corstius、Liang、Trogkanis-Elkan (CRF) 和新构思的深度学习模型)应用于三个不同的数据集(词典词、借词、伪词)。这些算法在数据集上表现出不同的性能,数据驱动算法在所有条件下除一个外均优于基于知识的算法。开发的新深度学习方法相比文献中发现的最佳结果(99.65%的词准确率,提高了0.14%)带来了性能提升。对添加语音信息改善音节划分性能的单词的分析表明,这些单词中正字法歧义可以通过发音信息解决。未来研究可以考察语音信息有益于正字法处理的其他领域。此外,新开发的深度学习框架可以应用于荷兰语以外的其他语言。

英文摘要

Syllabification describes the task of dividing words into syllables. Due to many rules and exceptions, training an algorithm to perform syllabification with high accuracy remains a challenge. Throughout the last decades, different algorithms have been put forth for Dutch syllabification, yet a comprehensive comparative assessment has not been done. Additionally, deep learning has gained significant popularity within NLP in recent years, yet no modern deep-learning based framework has been developed for Dutch orthographic syllabification. Finally, phonetic and orthographic syllabification algorithms have been examined separately, but not in combination. The aim of the current research was twofold: (a) to examine the performance of existing Dutch syllabification algorithms, and (b) to investigate whether combining phonetic and orthographic information into a single model can increase syllabification performance. To compare the performance of algorithms, four algorithms (Brandt Corstius, Liang, Trogkanis-Elkan (CRF), and a newly conceived deep-learning model) were applied to three different datasets (dictionary words, loanwords, pseudowords). The algorithms show varying performance across datasets, with the data-driven algorithms outperforming a knowledge-based algorithm in all but one condition. The new deep-learning methods developed led to increased performance compared to the best found in the literature (99.65% word accuracy, a 0.14% improvement). An analysis of the words for which adding phonetic information improved syllabification performance indicates that these were words in which the orthographic ambiguity could be resolved by information on pronunciation. Future research could examine other areas where phonetic information can benefit orthographic processing. In addition, the newly developed deep learning frameworks can be applied to other languages than Dutch.

2605.28833 2026-05-29 cs.CL cs.AI 版本更新

Transcribing Children's Speech: ASR Performance and Obtaining Reliable Orthographic Transcriptions

转录儿童语音:ASR性能与获取可靠正字法转录

Gus Lathouwers, Lingyun Gao, Catia Cucchiarini, Helmer Strik

发表机构 * Radboud University(拉德堡德大学)

AI总结 本研究评估了三种ASR模型家族(Whisper、Parakeet、Wav2Vec2)在荷兰儿童语音数据集上的性能,并提出了一种基于话语级选择的方法,以自动识别高置信度的正确发音,从而减少人工验证需求。

详情
AI中文摘要

自动语音识别(ASR)有潜力通过生成自动转录来大幅减少儿童语音研究中的手动标注工作。然而,在低资源语言中,由于缺乏针对儿童的预训练模型以及高度多样的噪声条件,获得可靠的高质量ASR转录仍然具有挑战性。本研究通过两个研究问题调查了最先进的ASR模型在儿童语音上的有效性,评估了来自三个模型家族(Whisper、Parakeet和Wav2Vec2)的九个ASR模型在两个荷兰儿童语音数据集JASMIN和DART上的表现。研究问题1考察了ASR模型应用于儿童语音的性能。微调的Whisper-medium模型取得了最佳整体性能,在JASMIN上WER为5.54%,在DART上为70.37%,表明噪声较大的DART数据明显更具挑战性。研究问题2考察了在多大程度上可以选择一个子集,使得无需人工验证即可自动获得可靠的正字法转录。我们使用一种话语级选择方法,将ASR输出与原始阅读提示进行比较,以识别正确发音的录音。使用所提出的选择方法,42.0% [对于JASMIN] 和18.1% [对于DART] 的话语可以高置信度地自动识别为正确发音,从而在话语级别上实现极低的错误率(精确度达到98.3%或更高),并减少了人工验证的需求。

英文摘要

Automatic speech recognition (ASR) has the potential to substantially reduce manual annotation effort in child speech research by generating automatic transcriptions. However, obtaining reliably high-quality ASR transcriptions for child speech remains challenging in low-resource languages due to limited child-specific pre-trained models and highly diverse noise conditions. This study investigates the effectiveness of state-of-the-art ASR models on child speech through two research questions, by evaluating nine ASR models from three model families (Whisper, Parakeet, and Wav2Vec2) on two Dutch child speech datasets, JASMIN and DART. Research question 1 examines the performance of ASR-models applied to child speech. The fine-tuned Whisper-medium model achieves the best overall performance, with a WER of 5.54% on JASMIN and 70.37% on DART, showing that the noisy DART data are clearly more challenging. Research question 2 examines to what extent it is possible to select a subset for which reliable orthographic transcriptions can be obtained automatically, without the need for manual verification. We use an utterance-level selection method that compares ASR output with the original read prompt to identify correctly pronounced recordings. Using the proposed selection method, 42.0% [for JASMIN] and 18.1% [for DART] of the utterances can be automatically identified as correctly pronounced with high confidence, resulting in very low error rates on an utterance level (precisions of 98.3% and higher) and reducing the need for manual verification.

2605.28832 2026-05-29 cs.CL cs.AI 版本更新

A comparative study of transformer-based embeddings for topic coherence

基于Transformer的嵌入在主题连贯性中的比较研究

Alex Ding, Tarun Rapaka, Willy Rodriguez, Jason Yang

发表机构 * Worcester Academy Stanford Online High School(沃斯特学院斯坦福在线高中) Stanford Online High School(斯坦福在线高中) Lexington High School(莱克星顿高中)

AI总结 本研究系统比较了七种不同规模的Transformer语言模型(从MiniLM到LLaMA-2)在BERTopic流程中对主题质量的影响,发现模型大小(从2200万到130亿参数)对主题连贯性影响可忽略。

详情
AI中文摘要

主题建模是自然语言处理的一个分支,旨在根据词共现模式将大量文本组织成连贯的组,其中潜在狄利克雷分配仍是最广泛使用和可解释的概率方法之一。自然语言处理的最新进展,特别是基于Transformer的语言模型,提供了改进的文档表示。已知模型大小(以参数数量计)对语言模型在不同预定义任务上的性能有显著影响。在本研究中,我们通过分析七种基于Transformer的语言模型(从小型模型如MiniLM到大型模型如LLaMA-2)在BERTopic流程中对多种语料库的性能,系统地考察了模型大小对主题质量的影响。主题质量使用Röder等人(2015)的连贯性和分歧度指标进行评估。我们的结果表明,模型大小从2200万到130亿参数对主题质量的影响可忽略,表明较小的模型可以达到与较大模型相当的性能。

英文摘要

Topic modeling is a branch of Natural Language Processing (NLP) that aims to organize large collections of texts into coherent groups according to word co-occurrence patterns, with Latent Dirichlet Allocation (LDA) remaining one of the most widely used and interpretable probabilistic approaches. Recent advances in NLP, particularly transformer-based language models, offer improved document representations. It is also known that the size of the model (in terms of number of parameters) has a significant impact in the performance of the language models on different pre-defined tasks. In this study, we systematically examine the effect of model size on topic quality by analyzing the performances of seven transformer-based language models (from small models such as MiniLM to large ones such as LLaMA-2) in a BERTopic pipeline on a variety of corpora. Topic quality is evaluated using coherence and divergence metrics following R{ö}der et al. (2015). Our results indicate that model size, ranging from 22 million to 13 billion parameters, has a negligible impact on the quality of the topic, suggesting that smaller models can achieve comparable performance to larger models.

2605.28830 2026-05-29 cs.CL cs.AI cs.SE 版本更新

Benchmarking Open-Source Safety Guard Models: A Comprehensive Evaluation

开源安全防护模型基准测试:全面评估

Reetu Raj Harsh, Bhaskarjit Sarmah, Stefano Pasquali

发表机构 * Domyn

AI总结 本研究对14个开源安全防护模型在8个NIST AI风险框架安全类别上进行全面评估,发现召回率是关键指标,且模型大小与安全检测性能不相关。

详情
AI中文摘要

随着大型语言模型(LLMs)越来越多地部署在安全关键型应用中,稳健的内容审核变得至关重要。我们对14个开源安全防护模型进行了全面评估,使用了包含79,331个样本的精选基准,涵盖8个NIST AI风险框架安全类别。我们的基准聚合了四个不同的数据集(HarmBench、StrongREJECT、RealToxicityPrompts和BeaverTails),并经过筛选,仅关注安全相关内容(暴力、仇恨言论、骚扰、色情内容、自杀/自残、亵渎、威胁和健康虚假信息)。我们发现召回率是安全应用的关键指标,因为遗漏有害内容比误报构成更大风险。我们的评估揭示了令人惊讶的结果:Qwen Guard(4B参数)实现了最高的召回率(83.97%),而较大的模型如Llama Guard(12B)和GPT-OSS Safeguard(20B)表现出保守行为,遗漏了高达75%的不安全内容。我们证明了模型大小与安全检测性能不相关,并且通用防护模型优于专用模型。这些发现为在生产部署中选择安全防护模型提供了实用指导。

英文摘要

As Large Language Models (LLMs) are increasingly deployed in safety-critical applications, robust content moderation becomes essential. We present a comprehensive evaluation of 14 open-source safety guard models on a curated benchmark of 79,331 samples spanning 8 NIST AI Risk Framework safety categories. Our benchmark aggregates four diverse datasets (HarmBench, StrongREJECT, RealToxicityPrompts, and BeaverTails), filtered to focus exclusively on safety-relevant content (violence, hate speech, harassment, sexual content, suicide/self-harm, profanity, threats, and health misinformation). We find that recall is the critical metric for safety applications, as missing harmful content poses greater risk than false positives. Our evaluation reveals surprising results: Qwen Guard (4B parameters) achieves the highest recall (83.97%) while larger models like Llama Guard (12B) and GPT-OSS Safeguard (20B) exhibit conservative behavior, missing up to 75% of unsafe content. We demonstrate that model size does not correlate with safety detection performance and that general-purpose guard models outperform specialized ones. These findings provide practical guidance for selecting safety guard models in production deployments.

2605.28828 2026-05-29 cs.CL cs.AI 版本更新

Micro-Macro Retrieval: Reducing Long-Form Hallucination in Large Language Models

微宏检索:减少大语言模型中的长文本幻觉

Yujie Feng, Jian Li, Zhihan Zhou, Pengfei Xu, Yujia Zhang, Xiaoyu Li, Xiaohui Zhou, Alan Zhao, Xi Chen, Xiao-Ming Wu

发表机构 * Solar System of OVB, Tencent, China(OVB太阳系,腾讯,中国) The Hong Kong Polytechnic University, Hong Kong S.A.R.(香港理工大学,香港特别行政区) Jilin University, China(吉林大学,中国)

AI总结 提出微宏检索(M2R)框架,通过宏观检索外部粗粒度证据和微观检索推理中关键信息库,解决长文本生成中关键信息与输出距离过远导致的幻觉问题。

详情
AI中文摘要

大型语言模型(LLMs)在许多任务上表现出色,但容易产生幻觉,尤其是在长文本生成中,冗余的检索上下文和冗长的推理链会放大事实错误。最近的研究强调了一个关键现象:关键信息越接近模型输出,事实准确性越高。然而,现有的检索增强语言模型(RALMs)缺乏确保这种接近性的有效机制——外部证据通过多轮检索注入推理,但无法确保关键信息靠近输出。我们提出微宏检索(M2R),一种新颖的边检索边生成框架,以填补这一空白。在宏观层面,M2R从外部来源检索粗粒度证据;在微观层面,它从推理过程中构建的关键信息库中提取必要结果,并在生成答案时重用它们。这种设计直接解决了关键信息到输出的接近性瓶颈,有效减少了长文本任务中的幻觉。M2R使用基于课程学习的强化学习策略进行训练,并采用定制的基于规则的奖励,从而稳定地获得检索和接地技能。跨不同基准的大量实验证明了M2R的有效性,尤其是在长上下文设置中。

英文摘要

Large Language Models (LLMs) achieve impressive performance across many tasks but remain prone to hallucination, especially in long-form generation where redundant retrieved contexts and lengthy reasoning chains amplify factual errors. Recent studies highlight a critical phenomenon: the closer key information appears to the model outputs, the higher the factual accuracy. However, existing retrieval-augmented language models (RALMs) lack effective mechanisms to ensure this proximity - external evidence is injected into reasoning via multi-turn retrieval, but this cannot ensure key information stays close to the outputs. We propose Micro-Macro Retrieval (M2R), a novel retrieve-while-generate framework to fill this gap. At the macro level, M2R retrieves coarse-grained evidence from external sources; at the micro level, it extracts essential results from a key information repository built during reasoning and reuses them while generating answers. This design directly addresses the key-information-to-output proximity bottleneck, effectively reducing hallucination in long-form tasks. M2R is trained with a curriculum learning-based reinforcement learning strategy using customized rule-based rewards, enabling stable acquisition of retrieval and grounding skills. Extensive experiments across different benchmarks demonstrate the effectiveness of M2R, especially in lengthy-context settings.

2605.28293 2026-05-29 cs.LG cs.AI 版本更新

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

ProRL: 通过修正策略梯度估计实现主动推荐的有效强化学习

Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院,上海,中国)

AI总结 针对主动推荐系统中策略梯度估计存在的长度依赖偏差和高方差问题,提出ProRL框架,通过逐步奖励中心化和位置特定优势估计两个机制修正梯度,显著提升推荐效果。

Comments Accepted in ICML 2026

详情
AI中文摘要

主动推荐系统(PRS)旨在通过生成中间推荐路径来引导用户偏好向目标物品转移。强化学习(RL)为优化此类序列决策任务提供了原则性框架,因为路径奖励可以自然地捕捉短期接受度和长期引导有效性。然而,将策略梯度直接应用于PRS会导致梯度估计存在缺陷。我们识别出两个缺陷:(1)路径级奖励分解为具有正均值的步骤级奖励,产生长度依赖偏差,导致梯度倾向于路径扩展而非有意义的探索;(2)用整个路径级奖励加权每个步骤忽略了分解结构,导致高梯度方差。为修正这两个缺陷,我们提出了一种有效的RL框架ProRL,其中包含两种用于主动推荐的新机制。首先,逐步奖励中心化减去期望奖励以消除长度依赖偏差,确保路径扩展产生零期望梯度信号。其次,位置特定优势估计利用奖励分解结构计算步骤相关的基线,降低梯度方差。这些机制共同产生精确针对路径质量的策略梯度。我们在三个真实世界数据集上的实验表明,ProRL显著优于最先进的PRS。我们的代码可在https://github.com/hongruhou89/ProRL获取。

英文摘要

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.

2605.27995 2026-05-29 cs.AI 版本更新

AsyncTool: Evaluating the Asynchronous Function Calling Capability under Multi-Task Scenarios

AsyncTool: 多任务场景下异步函数调用能力的评估

Kou Shi, Ziao Zhang, Shiting Huang, Avery Nie, Zhen Fang, Qiuchen Wang, Lin Chen, Huaian Chen, Zehui Chen, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) University of Toronto(多伦多大学)

AI总结 提出AsyncTool基准,通过模拟工具响应延迟的多任务环境,评估基于大语言模型的智能体在异步工具调用中的任务协调与效率。

Comments https://github.com/StoKou/repo-asynctool

详情
AI中文摘要

基于大语言模型的智能体在使用外部工具解决复杂任务方面展现出强大能力。然而,现有评估往往忽视了工具使用的时间维度,尤其是工具响应延迟的影响,并且通常局限于单任务场景。在实际应用中,多个任务通常需要并发执行,整体效率取决于智能体是否能在等待工具响应时利用空闲时间。我们将这种能力称为异步工具调用。为评估该能力,我们提出了AsyncTool,一个用于评估基于大语言模型的智能体在具有延迟工具反馈的交互式多任务工具使用环境中的基准。AsyncTool同时呈现多个异构任务,并在执行过程中模拟真实的工具响应延迟。利用混合数据演化策略,我们构建了一个覆盖多种场景和工具使用模式的多样化异步多任务数据集。我们在步骤、子任务和任务级别评估模型,并引入面向效率的指标来衡量任务协调和完成效率。大量实验表明,延迟的工具反馈对当前智能体构成了重大挑战,并导致明显的性能下降。能够更好地协调任务切换、依赖跟踪和状态维护的模型在AsyncTool上取得了更强的性能。我们的分析识别了当前工具使用智能体的关键失败模式,并为设计未来具有更强时间推理和协调能力的系统提供了实用见解。

英文摘要

Large language model (LLM)-based agents have shown strong capabilities in using external tools to solve complex tasks. However, existing evaluations often overlook the temporal dimension of tool use, especially the impact of tool response latency, and are usually limited to single-task settings. In real-world applications, multiple tasks often need to be executed concurrently, and overall efficiency depends on whether an agent can use idle time while waiting for tool responses. We refer to this capability as asynchronous tool calling. To evaluate it, we propose AsyncTool, a benchmark for assessing LLM-based agents in interactive multi-task tool-use environments with delayed tool feedback. AsyncTool presents multiple heterogeneous tasks simultaneously and simulates realistic tool response latency during execution. Using a hybrid data evolution strategy, we construct a diverse asynchronous multitasking dataset that covers multiple scenarios and tool-use patterns. We evaluate models at the step, sub-task, and task levels, and introduce efficiency-oriented metrics to measure task coordination and completion efficiency. Extensive experiments show that delayed tool feedback poses substantial challenges to current agents and leads to clear performance degradation. Models that better coordinate task switching, dependency tracking, and state maintenance achieve stronger performance on AsyncTool. Our analysis identifies key failure modes of current tool-using agents and provides practical insights for designing future systems with stronger temporal reasoning and coordination capabilities.

2605.27959 2026-05-29 cs.CV cs.AI 版本更新

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

ROVER: 面向对象中心视觉证据的路由用于基于多图像推理

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

发表机构 * Kuaishou Technology(快手科技)

AI总结 提出ROVER,一种轻量级可学习插件,通过对象中心差分注意力聚合上下文、蒸馏图像内线索并路由历史感知证据,实现高效全局视觉证据路由,在多图像推理中提升答案和定位精度。

详情
AI中文摘要

多模态大语言模型(MLLMs)越来越多地定位和交错视觉证据以进行审慎推理。基于定位的方法通常通过将裁剪的图像块或感兴趣区域(RoI)特定特征注入推理上下文来关注RoI。然而,这种设计可能削弱整体场景理解和对象间关系,同时导致解码成本随RoI数量和大小增加而增加。或者,自适应视觉特征选择通常需要细粒度监督或复杂启发式方法。为解决这些限制,我们提出ROVER(面向对象中心视觉证据的路由用于基于多图像推理),一种轻量级、可学习的插件,用于高效的全局视觉证据路由。在每次对象定位预测时,ROVER注入一个步骤特定的令牌三元组,以协同地:(i) 聚合正在进行的推理上下文,(ii) 通过对象中心差分注意力将图像内线索蒸馏到视觉工作空间中,以及(iii) 在该空间内跨对象和图像路由并整合历史感知证据以供后续推理。我们将ROVER集成到Qwen2.5-VL-7B中,并开发了一个交错的SFT到GRPO训练流程。严格遵循原始数据集和评估协议,我们的方法在MM-GCoT(+4.8%答案准确率,+14.6%定位准确率)和VideoEspresso(+8.6%答案准确率)上取得了最佳性能。在VideoEspresso上训练的模型表现出强大的迁移能力,在多个基准测试上平均比基础模型高出+4.7%。

英文摘要

Multimodal Large Language Models (MLLMs) have increasingly localized and interleaved visual evidence for deliberative reasoning. Grounding-based approaches typically focus on regions of interest (RoIs) by injecting cropped image patches or RoI-specific features into the reasoning context. However, such designs can weaken holistic scene understanding and inter-object relations, while incurring decoding costs that scale with the number and size of RoIs. Alternatively, adaptive visual feature selection often requires fine-grained supervision or complex heuristics. To address these limitations, we propose ROVER (Routing Object-centric Visual Evidence for grounded multi-image Reasoning), a lightweight, learnable plugin for efficient global visual evidence routing. Upon each object grounding prediction, ROVER injects a step-specific token triplet to synergistically: (i) aggregate the ongoing reasoning context, (ii) distill intra-image cues into a visual working space via object-centric differential attention, and (iii) route and integrate history-aware evidence across objects and images within this space for subsequent reasoning. We integrate ROVER into Qwen2.5-VL-7B and develop an interleaved SFT-to-GRPO training pipeline. Strictly adhering to the original datasets and evaluation protocols, our method achieves the best performance on MM-GCoT (+4.8% answer accuracy, +14.6% grounding accuracy) and VideoEspresso (+8.6% answer accuracy). The VideoEspresso-trained model demonstrates strong transferability, outperforming the base model by +4.7% on average across diverse benchmarks.

2605.27480 2026-05-29 q-bio.OT cs.AI cs.CY 版本更新

BIRDS: Characterizing and Understanding Biodiversity Impact of Large Language Model Serving

BIRDS:表征与理解大语言模型服务对生物多样性的影响

Tianyao Shi, Yi Ding

发表机构 * Purdue University(普渡大学)

AI总结 提出BIRDS框架,通过定义请求级功能单元、量化运营与隐含生物多样性影响,并引入质量归一化生物多样性影响(QNBI),揭示大规模LLM服务对生态系统的累积影响及质量感知的服务权衡。

Comments 21 pages, 27 figures, 9 tables

详情
AI中文摘要

大语言模型(LLM)服务产生的环境影响不仅限于碳和水,还包括通过生物多样性相关途径造成的生态系统破坏。我们提出了BIRDS,一个用于请求驱动型LLM服务的生物多样性影响框架。BIRDS定义了请求级功能单元,量化了运营和隐含的生物多样性影响,并引入了质量归一化生物多样性影响(QNBI)来联合分析生态影响和响应质量。在不同的工作负载、模型、GPU和区域中,BIRDS揭示了生物多样性影响在大规模下累积,并暴露了可操作的质量感知服务权衡。

英文摘要

Large language model (LLM) serving creates environmental impacts beyond carbon and water, including ecosystem damage through biodiversity-related pathways. We present BIRDS, a framework for Biodiversity Impact of Request-Driven LLM Serving. BIRDS defines request-level functional units, quantifies operational and embodied biodiversity impact, and introduces Quality-Normalized Biodiversity Impact (QNBI) to jointly analyze ecological impact and response quality. Across diverse workloads, models, GPUs, and regions, BIRDS reveals that biodiversity impact accumulates at scale and exposes actionable quality-aware serving tradeoffs.

2605.27390 2026-05-29 cs.CL cs.AI 版本更新

EvoSpec: Evolving Speculative Decoding via Real-Time Vocabulary and Parameter Adaptation

EvoSpec: 通过实时词汇和参数自适应进化推测解码

Shuyu Zhang, Lingfeng Pan, Qicheng Wang, Yaqi Shi, Yueyang Tan, Ruyu Yan, Jiaqi Chen, Lixing Du, Lu Wang

发表机构 * School of Computer Science and Technology(计算机科学与技术学院)

AI总结 提出EvoSpec框架,通过动态词汇和参数自适应实现推测解码中草稿模型的实时进化,解决静态方法在专业领域和主题切换场景下接受率骤降的问题,在EAGLE-3上实现1.13倍加速并降低27%内存开销。

详情
AI中文摘要

推测解码通过草稿-验证范式加速大型语言模型推理,但随着词汇表规模扩大,输出投影层成为瓶颈。现有的静态剪枝方法虽有效降低开销,但由于无法捕捉动态分布变化,在专业领域或主题切换场景中接受率骤降。为解决此问题,我们提出EvoSpec框架,通过动态词汇和参数自适应实现草稿模型的实时进化。与静态或纯检索方法不同,EvoSpec采用上下文感知机制,通过高效的语义和统计索引检索关键长尾词。此外,我们提出一种轻量级在线对齐策略,利用课程学习持续最小化草稿模型与目标模型之间的分布差距。在专业领域(编码、法律和医学)的广泛评估证实,EvoSpec克服了静态基线的局限性。在EAGLE-3上,它相比最先进的静态基线FR-Spec实现1.13倍加速,且内存开销比标准在线自适应低27%。

英文摘要

Speculative decoding accelerates Large Language Model inference via a draft-then-verify paradigm, yet the output projection layer becomes a bottleneck as vocabulary sizes scale. While existing static pruning methods effectively reduce this overhead, they suffer from precipitous drops in acceptance rate in specialized domains or topic-switching scenarios due to their inability to capture dynamic distribution shifts. To address this, we introduce EvoSpec, a framework that enables real-time evolution of the draft model through dynamic vocabulary and parameter adaptation. Unlike static or purely retrieval-based approaches, EvoSpec employs a context-aware mechanism that retrieves critical long-tail tokens via efficient semantic and statistical indexing. Furthermore, we propose a lightweight online alignment strategy utilizing curriculum learning to continually minimize the distributional gap between the draft and target models. Extensive evaluations across specialized domains (coding, law, and medicine) confirm that EvoSpec overcomes the limitations of static baselines. On EAGLE-3, it achieves a 1.13x speedup in these settings over the state-of-the-art static baseline FR-Spec, with 27\% lower memory overhead than standard online adaptation.

2605.27387 2026-05-29 cs.CL cs.AI 版本更新

From AR to Diffusion: Efficiently Adapting Large Language Models with Strictly Causal and Elastic Horizons

从自回归到扩散:利用严格因果与弹性视野高效适配大型语言模型

Xiangyu Ma, Teng Xiao, Zuchao Li, Lefei Zhang

发表机构 * School of Artificial Intelligence, Wuhan University(武汉大学人工智能学院) School of Computer Science, Wuhan University(武汉大学计算机学院)

AI总结 提出FLUID框架,通过严格因果对齐和弹性视野机制,将自回归模型高效适配为扩散模型,实现并行文本生成并大幅降低训练成本。

Comments Accepted by ACL 2026

详情
AI中文摘要

扩散模型有望实现高效的并行文本生成,但其依赖双向注意力机制,与预训练的自回归(AR)模型存在结构不匹配。这种不兼容性阻碍了稳健AR先验的复用,需要从头开始进行代价高昂的预训练。为弥合这一差距,我们提出FLUID框架,该框架高效地将AR骨干网络适配到扩散范式。通过强制执行严格因果对齐,FLUID能够从标准GPT风格检查点无缝初始化,避免了大规模预训练。此外,我们引入弹性视野,这是一种基于局部信息密度而非固定调度动态调节去噪步长的熵驱动机制。实验表明,FLUID在将训练成本降低数个数量级的同时实现了最先进的性能,有效调和了成熟的AR基础与高效的并行生成。我们的代码可在https://github.com/Oli-lab-nun/FLUID/tree/main获取。

英文摘要

Diffusion models promise efficient parallel text generation but rely on bidirectional attention, creating a structural mismatch with pre-trained Autoregressive (AR) models. This incompatibility precludes reusing robust AR priors, necessitating prohibitive pre-training from scratch. To bridge this gap, we propose FLUID, a framework that efficiently adapts AR backbones to the diffusion paradigm. By enforcing Strictly Causal Alignment, FLUID enables seamless initialization from standard GPT-style checkpoints, circumventing the need for massive pre-training. Furthermore, we introduce Elastic Horizons, an entropy-driven mechanism that dynamically modulates denoising strides based on local information density rather than fixed schedules. Experiments demonstrate that FLUID achieves state-of-the-art performance while reducing training costs by orders of magnitude, effectively reconciling established AR foundations with efficient parallel generation. Our code is available at https://github.com/Oli-lab-nun/FLUID/tree/main.

2605.27382 2026-05-29 cs.HC cs.AI cs.CL 版本更新

The Alignment Floor: How Persona Customization Breaks Safety in Weakly-Aligned LLMs

对齐下限:角色定制如何破坏弱对齐大语言模型的安全性

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式人工智能创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 通过对比强对齐与弱对齐模型在不同角色条件下的谄媚率变化,定义对齐下限Δ_floor作为评估模型角色定制安全性的审计指标。

详情
AI中文摘要

告诉LLM“要热情”会使轻对齐模型的谄媚率从30%上升到50%,但对强对齐模型没有影响。我们将这一差距定义为对齐下限Δ_floor(m)=max_pS(m,p)-min_pS(m,p),即模型在不同角色条件下产生的谄媚率范围,并将谄媚视为角色条件属性而非固定模型属性。多元AI依赖于通过角色提示(如“要有创造力”或“要彻底”)进行行为适应,使系统能够尊重不同的用户价值观和沟通风格;安全问题在于给定模型在真实性改变之前能吸收多少定制化。我们进行了一项受控案例研究,对比了强对齐的RLHF+宪法AI模型(Claude Sonnet 4.6)与轻对齐模型(Amazon Nova Lite),涵盖7种角色条件和5个任务,共1800次运行。存在性结果促使进行逐模型审计:至少有一个强对齐模型的Δ_floor=5个百分点(在15%控制率的5个百分点内),至少有一个轻对齐模型的Δ_floor=45个百分点(范围5%-50%)。在轻对齐模型上,所有五种大五人格角色都增加了谄媚率,且反直觉的是,宜人性产生的增幅最小而非最大。研究中最大的单一效果是建设性的:怀疑论者角色使轻对齐模型的谄媚率降低了25个百分点,并且是唯一指示抵制用户主张而非与之互动的角色,这暗示了方向性解释。角色效果的跨模型迁移几乎为零,因此角色-对齐测试必须逐模型进行。我们提出Δ_floor作为部署时的审计指标:在部署角色定制之前,在小规模角色面板上测量该指标。

英文摘要

Telling an LLM to "be enthusiastic" raises its sycophancy rate from 30\% to 50\% on a lightly-aligned model, but has zero effect on a strongly-aligned one. We define this gap as the alignment floor, $Δ_{\text{floor}}(m)=\max_pS(m,p)-\min_pS(m,p)$, the range of sycophancy rates a model produces across persona conditions, and treat sycophancy as a persona-conditional property rather than a fixed model property. Pluralistic AI relies on behavioral adaptation via persona prompts like "be creative" or "be thorough", which let systems respect diverse user values and communication styles; the safety question is how much customization a given model can absorb before its truthfulness shifts. We present a controlled case study contrasting a strongly-aligned RLHF + Constitutional-AI model (Claude Sonnet 4.6) with a more lightly-aligned model (Amazon Nova Lite), spanning seven persona conditions and five tasks for 1800 total runs. An existence-pair result motivates per-model auditing: there is at least one strongly-aligned model with $Δ_{\text{floor}}=5$pp (within 5pp of the 15\% control rate) and at least one lightly-aligned model with 45pp (5\%--50\% range). On the lightly-aligned model, all five Big Five personas increase sycophancy over control, and counterintuitively Agreeableness produces the smallest increase, not the largest. The single largest effect in the study is constructive: a Skeptic persona reduces sycophancy by 25pp on the lightly-aligned model, and is the only persona that instructs resistance against user claims rather than engagement with them, suggesting a directionality account. Cross-model transfer of persona effects is near-zero, so persona-alignment testing must be per-model. We propose $Δ_{\text{floor}}$ as a deployment-time audit metric: measure it on a small persona panel before deploying persona customization.

2605.27379 2026-05-29 cs.AI cs.CL 版本更新

Soro: A Lightweight Foundation Model and Chatbot for Tajik

Soro: 一种轻量级塔吉克语基础模型与聊天机器人

Stanislav Liashkov, Haitz Sáez de Ocáriz Borde, Azizjon Azimi, Khushbakht Shoymardonov, Shuhratjon Khalilbekov, Bonu Boboeva

AI总结 针对塔吉克斯坦计算和连接受限环境,提出基于Gemma 3的塔吉克语专用对话大语言模型Soro,通过持续预训练和监督微调,在塔吉克语基准测试上显著优于同尺寸基线,并支持量化部署。

详情
AI中文摘要

我们提出了Soro,一个塔吉克语专用对话大语言模型(LLM)家族,专为在塔吉克斯坦计算和连接受限条件下的实际部署而设计。从开放权重的Gemma 3检查点开始,我们在一个精心策划的19亿词元语料库上进行了仅塔吉克语的持续预训练,该语料库涵盖过滤后的网络文本、PDF文档和符合课程的教育材料,随后在4万个塔吉克语教师风格示例上进行监督指令微调。为了在标准基准测试中塔吉克语覆盖有限的情况下实现严格评估,我们引入了一套塔吉克语基准测试,涵盖常识、语言能力以及中学和大学入学考试领域,并在Hugging Face上开源。在这些塔吉克语基准测试中,Soro显著优于同尺寸的Gemma 3基线,同时在标准数据集上保持了较强的英语性能。我们进一步表明,Soro的FP8和INT4量化保留了大部分塔吉克语增益,同时降低了边缘部署的内存需求,支持正在进行的教育领域试点和计划在塔吉克斯坦学校中的扩展。

英文摘要

We present Soro, a family of Tajik-specialized conversational large language models (LLMs) designed for real-world deployment under tight compute and connectivity constraints in Tajikistan. Starting from open-weight Gemma 3 checkpoints, we perform Tajik-only continual pretraining on a curated 1.9-billion-token corpus spanning filtered web text, PDF documents, and curriculum-aligned educational materials, followed by supervised instruction tuning on 40K Tajik teacher-style examples. To enable rigorous evaluation despite the limited coverage of Tajik in standard benchmarks, we introduce a suite of Tajik benchmarks covering general knowledge, linguistic competence, and school- and university entrance-exam domains, and we open-source them on Hugging Face. Across these Tajik benchmarks, Soro substantially outperforms same-size Gemma 3 baselines while retaining strong English performance on standard datasets. We further show that FP8 and INT4 quantization of Soro preserves most Tajik-language gains while reducing memory requirements for edge deployment, supporting an ongoing education-sector pilot and planned scale-out across schools in Tajikistan.

2605.27377 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Enhancing LLM Medical Coding with Structured External Knowledge

利用结构化外部知识增强LLM医学编码

Yidong Gan, David D. Nguyen, Yang Lin, Peter Zhong, Thanh Vu, Long Duong, Yuan-Fang Li

发表机构 * Oracle Health and AI(Oracle健康与AI)

AI总结 提出RAG-Coding方法,通过将ICD表格列表编码为知识图谱并提炼指南摘要,无需训练即可增强LLM的医学编码能力,在MDACE和MDACE-2025数据集上显著优于基线。

详情
AI中文摘要

准确的医学编码需要查阅权威资源,如ICD表格列表和编码指南。现有的基于LLM的自动化方法主要依赖LLM的内部知识,容易产生幻觉且无法跟上指南更新。我们引入了RAG-Coding,一种无需训练的智能体方法,通过结构化外部知识增强LLM:将表格列表编码为知识图谱,捕获层次化和指令性的代码关系;将指南提炼为简洁、代码特定的摘要,而非检索原始文本。为支持我们的研究,我们还引入了MDACE-2025,即根据2025年ICD-10-CM/PCS指南对MDACE数据集进行的专家重新标注,增加了代码排序和理由注释。在MDACE上,RAG-Coding在五个LLM骨干网络上以micro-F1指标超越最佳基于LLM的基线3-13%,并与监督式最先进方法达到相当的micro-和macro-F1,以更高的召回率(+11%)为代价,精确率降低(-6%)。在MDACE-2025上,RAG-Coding超越所有基线,展示了对更新指南的有效泛化。消融实验确认了逐步提升,强调了整合结构化外部知识对基于LLM的医学编码的重要性。

英文摘要

Accurate medical coding requires consulting authoritative resources such as the ICD tabular list and coding guidelines. Existing LLM-based automated methods largely rely on LLMs' internal knowledge, which is prone to hallucination and cannot keep pace with guideline updates. We introduce RAG-Coding, an agentic, training-free method that augments LLMs with structured external knowledge: the tabular list is encoded as a knowledge graph capturing hierarchical and instructional code relationships, and the guidelines are distilled into concise, code-specific summaries rather than retrieved as raw text. To enable our study, we also introduce MDACE-2025, expert re-annotations of the MDACE dataset under the 2025 ICD-10-CM/PCS guidelines, adding code sequencing and justification comments. On MDACE, RAG-Coding outperforms the best LLM-based baseline by 3--13\% in micro-F1 across five LLM backbones, and achieves comparable micro- and macro-F1 to the supervised state-of-the-art, with higher recall ($+$11\%) at the cost of precision ($-$6\%). On MDACE-2025, RAG-Coding outperforms all baselines, demonstrating effective generalisation to updated guidelines. Ablations confirm stepwise gains, highlighting the importance of integrating structured external knowledge for LLM-based medical coding.

2605.27276 2026-05-29 cs.AI cs.CL 版本更新

SIA: Self Improving AI with Harness & Weight Updates

SIA: 具有框架与权重更新的自我改进AI

Prannay Hebbar, Yogendra Manawat, Samuel Verboomen, Alesia Ivanova, Selvam Palanimalai, Kunal Bhatia, Vignesh Baskaran

发表机构 * Hexo Labs(Hexo实验室) University of Oxford(牛津大学)

AI总结 提出SIA框架,通过反馈智能体同时更新任务智能体的框架和权重,在三个领域(中国法律罪名分类、GPU内核优化、单细胞RNA去噪)超越仅迭代框架的方法。

详情
AI中文摘要

人类是构建和改进AI的瓶颈。无论是模型还是封装它们的智能体,都是由人类编写、调整和纠正的。一个能够自我改进的AI的长期目标仍然未实现。两条大致独立的研究路线试图解决这一瓶颈。框架更新学派让元智能体重写任务特定智能体的框架(其工具、提示、重试逻辑和搜索过程),而模型权重保持不变。测试时训练学派使用手写的强化学习流程,根据任务反馈更新模型自身的权重,而框架保持不变。这两个孤岛独立运作。我们提出SIA,一个自我改进循环,其中语言模型智能体(反馈智能体)同时更新任务特定智能体的框架和权重。我们在三个对比领域进行评估:中国法律罪名分类、底层GPU内核优化和单细胞RNA去噪。结合两个杠杆在所有三个基准上均优于仅迭代框架。SIA-W+H在LawBench上比先前SOTA高出25.1%,GPU内核比先前SOTA快12.4%(1017 vs 1161 μs),去噪性能比先前SOTA高出20.4%。框架更新使模型具有智能体性,塑造其搜索和行动方式,而权重更新构建了任何提示或框架都无法灌输的领域直觉。

英文摘要

Humans are the bottleneck in building and improving AI. Both the models and the agents that wrap them are written, tuned, and corrected by people. The long-horizon goal of an AI that can figure out how to improve itself remains open. Two largely disjoint research lines attack this bottleneck. The harness-update school has a meta-agent rewrite the scaffold of a task-specific agent (its tools, prompts, retry logic, and search procedure) while the model weights are held fixed. The test-time training school uses hand-written RL pipelines to update the model's own weights on task feedback while the harness is held fixed. These two silos operate in isolation. We propose SIA, a self-improving loop in which a language-model agent (the Feedback-Agent) updates both the harness and the weights of a task-specific agent. We evaluate across three contrasting domains: Chinese legal charge classification, low-level GPU kernel optimisation, and single-cell RNA denoising. Combining both levers outperforms scaffold iteration alone on all three benchmarks. SIA-W+H achieves 25.1% over prior SOTA on LawBench, 12.4% faster GPU kernels than prior SOTA (1,017 vs 1,161 μs), and 20.4% over prior SOTA on denoising. Harness updates make the model agentic, shaping how it searches and acts, while weight updates build the domain intuition that no prompt or scaffold can instil.

2605.26255 2026-05-29 eess.IV cs.AI cs.LG 版本更新

Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?

多模式呼吸衰竭预测的前瞻性评估:胸部X光片能否在电子健康记录信号之外提升性能?

Xiaolei Lu, Shamim Nemati

AI总结 本研究提出一种门控多模态框架,集成结构化电子健康记录时间序列数据和胸部X光片基础模型表示,用于前瞻性预测ICU患者24小时内是否需要有创机械通气,结果显示相比仅使用电子健康记录的模型和医生预测,多模态融合提高了区分度、敏感性和阳性预测值。

详情
AI中文摘要

呼吸衰竭的早期预测对于重症监护病房的及时临床干预至关重要。现有的基于电子健康记录(EHR)的模型可以持续监测生理恶化,但可能无法完全捕捉胸部X光片(CXR)中反映的肺部病理生理学。在本研究中,我们探讨CXR信息是否能在仅使用EHR信号的基础上改善有创机械通气的前瞻性预测。我们开发了一个门控多模态框架,将结构化EHR时间序列数据与CXR基础模型表示相结合。门控模块根据患者特定的临床背景自适应地控制成像特征的贡献,使模型在成像信息有用时选择性地依赖它。我们前瞻性地评估了该框架在ICU患者中预测24小时内需要有创机械通气的性能,并将其与已建立的仅使用EHR的模型(Ventio)、在匹配临床时间点获得的医生预测以及替代多模态变体进行比较。门控多模态模型比仅使用EHR的基线模型实现了更高的区分度,使用REMEDIS和MedInsight CXR表示时AUROC值分别为0.860和0.858,而Ventio为0.752。相对于医生预测,多模态框架显著提高了敏感性,同时保持了良好的特异性。与仅使用EHR的模型相比,多模态整合提高了特异性和阳性预测值,表明CXR信息可以细化选定患者的风险估计。这些发现支持自适应多模态融合作为将成像纳入前瞻性呼吸衰竭预测的实用策略。

英文摘要

Early prediction of respiratory failure is critical for timely clinical intervention in intensive care units. Existing electronic health record (EHR)-based models can continuously monitor physiologic deterioration, but they may not fully capture pulmonary pathophysiology reflected in chest radiographs (CXRs). In this study, we ask whether CXR information improves prospective prediction of invasive mechanical ventilation beyond EHR signals alone. We develop a gated multimodal framework that integrates structured EHR time-series data with CXR foundation-model representations. The gating module adaptively controls the contribution of imaging features based on patient-specific clinical context, allowing the model to selectively rely on imaging information when it is informative. We prospectively evaluate the framework for predicting invasive mechanical ventilation within 24 hours in ICU patients and compare it with an established EHR-only model (Ventio), physician predictions obtained at matched clinical time points, and alternative multimodal variants. The gated multimodal models achieved higher discrimination than the EHR-only baseline, with AUROC values of 0.860 and 0.858 using REMEDIS and MedInsight CXR representations, respectively, compared with 0.752 for Ventio. Relative to physician predictions, the multimodal framework substantially improved sensitivity while maintaining favorable specificity. Compared with the EHR-only model, multimodal integration increased specificity and positive predictive value, suggesting that CXR information can refine risk estimation in selected patients. These findings support adaptive multimodal fusion as a practical strategy for incorporating imaging into prospective respiratory failure prediction.

2605.26193 2026-05-29 cs.LG cs.AI 版本更新

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

桥接分类与重建:协同时间序列异常检测

Qideng Tang, Dai Chaofan, Wubin Ma, Yahui Wu, Haohao Zhou, Tao Zhang, Huan Li, Dalin Zhang

发表机构 * National Key Laboratory of Information Systems Engineering, National University of Defense Technology(信息系统工程国家重点实验室,国防科技大学) College of Systems Engineering, National University of Defense Technology(系统工程学院,国防科技大学) Zhejiang University(浙江大学) Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University(空间信息感知与传输浙江大学重点实验室,杭州电子科技大学)

AI总结 提出CoAD框架,通过分类模块生成概率软掩码指导重建模块,协同利用分类与重建范式的互补优势,有效检测细微复杂异常,并在基准数据集上显著优于现有方法。

Comments 15 pages, submitted to KDD 2026

详情
AI中文摘要

时间序列异常检测(TSAD)因其广泛应用而长期成为数据挖掘领域的热门研究课题。最近的研究挑战了流行的深度学习方法在TSAD中的有效性,指出它们无法检测细微和持久的异常。异常暴露(OE)和掩码自编码器(MAE)作为两种有前景的范式(分类和重建)出现,用于解决上述问题。然而,基于OE的方法受限于泛化能力差,而基于MAE的方法受限于掩码错位问题。为了解决这些局限性,本文提出了一种新颖的框架CoAD,该框架统一了两种范式,以利用它们的互补优势,同时减轻各自的弱点。在该框架中,分类模块为重建模块生成概率信息软掩码,这反过来又缓解了分类模块的泛化问题。这种协同设计使CoAD能够有效检测现有方法常常忽略的细微和复杂异常。此外,分类模块经过精心设计,以解决分类粒度不当和忽视频率信息的问题。在高质量基准数据集上,按照严格的评估协议进行的大量实验表明,CoAD显著优于最先进的深度学习和传统数据挖掘方法,突显了深度学习在TSAD中的潜力。此外,CoAD轻量级且速度远快于现有SOTA方法,展示了其在大规模实时应用中的实用价值。

英文摘要

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

2605.26029 2026-05-29 cs.AI cs.CL 版本更新

CausaLab: A Scalable Environment for Interactive Causal Discovery Toward AI Scientists

CausaLab:面向AI科学家的交互式因果发现可扩展环境

Junlin Yang, Dylan Zhang, Xiangchen Song, Qirun Dai, Xiao Liu, Yuen Chen, Aniket Vashishtha, Jing Shi, Chenhao Tan, Hao Peng

发表机构 * Tsinghua University(清华大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Carnegie Mellon University(卡内基梅隆大学) University of Chicago(芝加哥大学) Adobe

AI总结 提出CausaLab环境,通过合成实验室任务评估LLM代理在因果发现中的预测准确性与因果机制恢复能力,发现两者存在显著差距。

详情
AI中文摘要

我们介绍了CausaLab,一个用于评估LLM代理进行交互式因果发现的可扩展环境。与先前的评估不同,CausaLab既评估代理是否能够使用因果证据解决问题,也评估其答案是否基于忠实恢复的因果机制。每个回合将代理置于一个合成实验室中:它接收先前的测量记录,对操纵器晶体进行干预,并预测由相同机制控制的保留反应器晶体的共振频率。隐藏的数据生成过程是一个随机采样的结构因果模型(SCM),因此成功需要恢复因果图和结构方程,而不是回忆先验知识。实验表明,预测和机制恢复之间存在持续差距:在纯观测的6节点设置中,GPT-5.2-high达到92%的任务准确率,但全边$F_1$仅为0.471。混合观测-干预策略提高了结构保真度,而纯干预即使对强代理仍然困难。我们确定过早停止是一个主要弱点,并表明一致性验证可以缓解它。因此,CausaLab将预测成功与因果理解分开,并揭示了当前LLM代理作为实验因果推理者的局限性。

英文摘要

We introduce CausaLab, a scalable environment for evaluating interactive causal discovery by LLM agents. Unlike prior evaluations, CausaLab evaluates both whether an agent can solve a problem using causal evidence and whether its answer is grounded in a faithful recovered causal mechanism. Each episode places an agent in a synthetic laboratory: it receives prior measurement records, intervenes on a manipulator crystal, and predicts the resonance frequency of a held-out reactor crystal governed by the same mechanism. The hidden data-generating process is a randomly sampled structural causal model (SCM), so success requires recovering both a causal graph and structural equations rather than recalling prior knowledge. Experiments show a persistent gap between prediction and mechanism recovery: in the purely observational 6-node setting, GPT-5.2-high reaches 92% task accuracy but only 0.471 all-edge $F_1$. Mixed observation-intervention strategies improve structural fidelity, while pure intervention remains difficult even for strong agents. We identify premature stopping as a major weakness and show that consistency verification mitigates it. CausaLab therefore separates predictive success from causal understanding and exposes current LLM agents' limits as experimental causal reasoners.

2605.25556 2026-05-29 cs.LO cs.AI 版本更新

Keep the Proof State Live: Snapshotting for Efficient Tactic Search in Lean 4

保持证明状态活跃:Lean 4 中高效策略搜索的快照技术

Austin Shen, Yunong Shi

发表机构 * University of Michigan(密歇根大学) Amazon Web Services(亚马逊网络服务)

AI总结 针对 Lean 4 中并行策略搜索因反复重建证明状态导致开销巨大的问题,提出证明状态快照技术,通过一次捕获并复用证明状态,实现 5.6-50 倍加速。

Comments 11 pages, 1 figure. v2: Added co-author affiliation (Amazon Web Services) and contact emails for both authors

详情
AI中文摘要

基于 Lean 4 的自动定理证明系统越来越依赖对部分指定证明(如 Draft-Sketch-Prove (DSP) 流水线生成的证明)进行并行策略搜索。在现有系统中,每个搜索分支通过重新运行 elaboration 来重建证明状态,导致每个分支产生大量开销。在带有 Mathlib 的 Lean 4 中,这种开销有两个组成部分:(1) 导入加载,反序列化预编译库(每个分支约 60 秒);(2) 定理体 elaboration,重新检查直到目标目标的定理上下文(根据证明复杂度估计为 18-735 秒)。两者合计占每个分支墙钟时间的 99% 以上,使得基于组合的搜索难以大规模应用。我们观察到,这种开销源于证明搜索的结构与其执行模型之间的不匹配:分支是通过重复重建证明状态实现的,而不是直接重用。为了解决这个问题,我们引入了证明状态快照,它一次捕获 elaborated 证明状态,并通过 Lean 4 语言服务器的一个小扩展在分支间重用。在 48 个 miniF2F-v2 问题(45 个证明阶段基准和 3 个完整端到端运行)上,我们的方法比标准回退方法实现了 5.6-50 倍的墙钟时间加速(平均 14 倍,中位数 9.7 倍)。加速比随证明分支数量增加而增加。我们的方法与导入级缓存(例如 Kimina Lean Server)正交,后者避免了导入加载,但未避免定理体 elaboration。修补后的 Lean 二进制文件和 Snapshot-DSP 流水线将在发表后作为开源发布。

英文摘要

Automated theorem proving systems built on Lean 4 increasingly rely on parallel tactic search over partially specified proofs, such as those generated by Draft-Sketch-Prove (DSP) pipelines. In current systems, each search branch reconstructs a proof state by re-running elaboration, leading to substantial per-branch overhead. In Lean 4 with Mathlib, this cost has two components: (1) import loading, which deserializes pre-compiled libraries (~60 s per branch); and (2) theorem-body elaboration, which re-checks the theorem context up to the target goal (estimated 18-735 s depending on proof complexity). Together, these account for >99% of per-branch wall time, making portfolio-based search impractical at scale. We observe that this overhead arises from a mismatch between the structure of proof search and its execution model: branching is implemented via repeated reconstruction of proof states rather than direct reuse. To address this, we introduce proof-state snapshotting, which captures the elaborated proof state once and reuses it across branches via a small extension to the Lean 4 language server. Across 48 miniF2F-v2 problems (45 prove-phase benchmarks and 3 full end-to-end runs), our approach achieves a 5.6-50x wall-time speedup over the standard fallback (average 14x, median 9.7x). Speedup increases with the number of proof branches. Our method is orthogonal to import-level caching (e.g., Kimina Lean Server), which avoids import loading but not theorem-body elaboration. The patched Lean binary and the Snapshot-DSP pipeline will be released as open source upon publication.

2605.25297 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka:面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

发表机构 * Alibaba Cloud Computing Co. Ltd, Hangzhou, China(阿里云计算有限公司,杭州,中国) School of Computer Science, Fudan University, Shanghai, China(复旦大学计算机学院,上海,中国) School of Computer Science and Technology, Tongji University, Shanghai, China(同济大学计算机科学与技术学院,上海,中国) Independent Researcher, United States(独立研究员,美国)

AI总结 提出Eureka框架,将特征工程视为智能体代码生成问题,通过专家代理、LLM特征工厂和自演化对齐引擎三阶段,自动生成可执行特征代码,在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

详情
Journal ref
Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer
AI中文摘要

有效的特征对于预测模型性能至关重要,但创建特征通常需要领域专业知识,限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题:特征不再是静态的数据转换,而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka,一个由LLM驱动的三阶段框架。(1)专家代理,通过领域知识的SFT微调,生成结构化的JSON格式特征设计方案。(2)LLM特征工厂,通过思维链推理将每个方案转化为可执行的Python代码,将特征假设转化为可运行的程序。(3)自演化对齐引擎,使用带双通道奖励(基于指标的效用+语义对齐)的强化学习(GRPO)来提升代码质量。通过将特征表达为程序,学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估,Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性,其中Eureka将需求满足率提高了16%,并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

2605.24846 2026-05-29 cs.LG cs.AI 版本更新

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

微小大脑,巨大影响:仅用少量提示揭示LLM的关键神经元

Xiangtian Ji, Yuxin Chen, Zhengzhou Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本研究通过跨任务激活强度分析,发现大型语言模型中存在一组极其稀疏的关键神经元,其移除会导致模型行为崩溃,并基于此提出仅更新关键神经元的微调方法,在少量参数修改下达到与全参数微调相当或更优的任务性能。

详情
AI中文摘要

大型语言模型(LLM)展现出强大的综合能力,但支撑这些行为的内部机制仍未被充分理解。在这项工作中,我们展示了在多种开放权重Transformer模型中,存在一组神经元在跨多个能力维度的任务推理期间始终保持高度激活。通过沿跨任务激活强度进行探测,我们分离出一个极其稀疏的子集,其移除会导致模型行为崩溃,我们将其称为关键神经元。我们的分析揭示,关键神经元是模型的一个稳定且内在的神经元子集,主要在预训练期间建立。与这些神经元相关的参数在训练过程中被紧密校准,其精确值对模型能力至关重要。基于这些见解,我们提出了一种监督微调方法,仅更新关键神经元,在修改远少于全参数的情况下,实现了与全参数微调相当甚至更好的任务增益,同时更好地保留了其他能力维度的性能。

英文摘要

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of open-weight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters.

2605.24399 2026-05-29 cs.AI 版本更新

ConceptM$^3$oE: Concept-Guided Multimodal Mixture of Experts for Interpretable Computational Pathology

ConceptM$^3$oE:面向可解释计算病理学的概念引导多模态专家混合模型

Xuan Wang, Zhongling Xu, Gopi Kannedhara, Joakim Nguyen, Jian Yu, Jinrui Fang, Abdurrahmaan Baghdadi, Tianlong Chen, Awais Naeem, Chandra Krishnan, Edward Castillo, Andrew H. Song, Ankita Shukla, Ying Ding, Nicholas Konz, Hairong Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Dell Children’s Medical Center(德尔儿童医疗中心) The University of Texas MD Anderson Cancer Center(德克萨斯大学MD安德森癌症中心) University of Nevada, Reno(内华达大学里诺分校)

AI总结 提出ConceptM$^3$oE框架,通过概念引导的多模态专家混合路径嵌入概念形成,并利用残差路径保持性能与可解释性,在脑肿瘤分类中优于基线并提升小样本性能。

详情
AI中文摘要

医疗模型正从单模态预测转向对异构诊断输入的多模态推理。在计算病理学中,对于仅凭形态学难以区分的复杂肿瘤亚型,病理报告和分子测量可提供额外的诊断证据,但现有模型往往无法阐明不同信号如何组合成可识别的诊断概念。我们提出ConceptM$^3$oE(概念多模态MoE),将概念形成直接嵌入交互感知的专家混合(MoE)路径中。该架构将证据分解为模态特定、冗余和协同专家,然后将其投影到结构化概念瓶颈中,将潜在特征映射到形态学和生物标志物概念层次结构。为防止可解释瓶颈典型的信息损失,我们在每个专家内利用残差路径,使任务相关信号既通过概念流动,也直接流向最终任务预测,从而在保持可解释性的同时维持高性能。在机构性儿童脑肿瘤队列和公共胶质瘤队列上,该框架实现了与无约束模型相竞争的性能,同时产生由独立神经病理学家验证的推理轨迹。在数据有限的情况下,ConceptM$^3$oE提升了小数据性能,在较小训练规模下,与非概念信息基线相比,宏F1从56.41%提升至66.70%,同时显示出更快的训练收敛速度,这与概念学习的正则化效应一致。这项工作为高性能、内在可验证且更符合临床实践复杂决策的医疗AI提供了一条可扩展的路径。

英文摘要

Healthcare models are transitioning from unimodal prediction toward multimodal reasoning over heterogeneous diagnostic inputs. In computational pathology, for complex tumor subtypes where morphology alone can be challenging to distinguish, pathology reports and molecular measurements may provide additional diagnostic evidence alongside whole-slide images, yet existing models often fail to clarify how diverse signals assemble into recognizable diagnostic concepts. We propose ConceptM$^3$oE (Concept Multimodal MoE), which embeds concept formation directly within interaction-aware mixture-of-experts (MoE) pathways. The architecture decomposes evidence into modality-specific, redundant, and synergistic experts, which are then projected into structured concept bottlenecks mapping latent features to a hierarchy of morphology and biomarker concepts. To prevent the information loss typical of interpretable bottlenecks, we utilize residual pathways within each expert to allow task-relevant signals to flow both through the concepts and directly to the final task prediction, so that high performance is maintained alongside interpretability. Across an institutional pediatric brain tumor cohort and a public glioma cohort, the framework delivers competitive performance to unconstrained models while producing reasoning traces validated by an independent neuropathologist. In data-limited regimes, ConceptM$^3$oE improves limited-data performance, increasing macro-F1 from 56.41% to 66.70% at small training sizes compared to non-concept-informed baselines, while also showing faster training convergence consistent with the regularizing effect of concept learning. This work offers a scalable path toward high-performance medical AI that is inherently verifiable and better aligned with the complex decision-making of clinical practice.

2605.24140 2026-05-29 cs.AI 版本更新

HyperGuide: Hyperbolic Guidance for Efficient Multi-Step Reasoning in Large Language Models

HyperGuide: 用于大型语言模型高效多步推理的双曲引导

Yuyu Liu, Haotian Xu, Yanan He, Sarang Rajendra Patil, Mengjia Xu, Tengfei Ma

发表机构 * Department of Computer Science(计算机科学系) Stony Brook University(石英布鲁克大学) Department of Applied Mathematics and Statistics(应用数学与统计学系) Yale University(耶鲁大学) Department of Data Science(数据科学系) New Jersey Institute of Technology(新泽西理工学院) Department of Biomedical Informatics(生物医学信息学系)

AI总结 针对多步推理中单次生成效率高但精度低、树搜索计算量大的问题,提出通过将推理进度蒸馏为双曲几何信号来引导逐步生成,利用双曲空间的距离和角度特性编码解接近度与分支区分,训练轻量头投影隐状态并微调适配器,在多个基准上取得一致提升。

详情
AI中文摘要

多步推理仍然是大型语言模型的一个核心挑战:单次生成效率高但缺乏准确性;树搜索方法探索多条路径但计算量大。我们通过将推理进度蒸馏为双曲几何信号来弥补这一差距,该信号引导逐步生成。我们的方法基于一个结构性观察:在组合推理树中,包含解的状态很少,而死胡同则呈指数级多。双曲空间匹配这种不对称性,原点附近体积紧凑,向边界指数扩展,因此到原点的距离自然地编码解的接近度,而角度分离则区分需要不同下一步操作的分支。我们训练一个轻量头将LLM的隐状态投影到该空间,然后在其自身的推理尝试上交互式地微调一个低秩适配器,以对注入的信号做出反应。在多个基准上,该几何信号带来一致的提升,在更深推理链上改进更大。我们的代码公开在 https://github.com/yuyuliu11037/HyperGuide。

英文摘要

Multi-step reasoning remains a central challenge for large language models: single-pass generation is efficient but lacks accuracy; tree-search methods explore multiple paths but are computation-heavy. We address this gap by distilling reasoning progress into a hyperbolic geometric signal that guides step-by-step generation. Our approach is motivated by a structural observation: in combinatorial reasoning trees, solution-bearing states are few while dead ends are exponentially numerous. The hyperbolic space matches this asymmetry, with compact volume near the origin and exponentially expanding capacity toward the boundary, so that distance-to-origin naturally encodes solution proximity while angular separation distinguishes branches requiring different next operations. We train a lightweight head to project LLM hidden states into this space, then fine-tune a low-rank adapter interactively on its own reasoning attempts to act on the injected signal. Across multiple benchmarks, the geometric signal yields consistent gains, with larger improvements on deeper reasoning chains. Our code is publicly available at https://github.com/yuyuliu11037/HyperGuide.

2605.23993 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型:未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

发表机构 * DeepMind

AI总结 提出Nano World Models,一个基于扩散强迫的极简代码库,用于未来视频预测,支持可控研究世界模型的设计选择,并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情
AI中文摘要

世界模型已成为学习预测模拟器的核心范式,支持生成、规划和决策。然而,尽管工业级交互式视频生成取得了快速进展,更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models,一个围绕扩散强迫的极简代码库,用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验,我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点,Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

2605.22100 2026-05-29 cs.AI 版本更新

MPDocBench-Parse: Benchmarking Practical Multi-page Document Parsing

MPDocBench-Parse:面向实际的多页文档解析基准测试

Bangbang Zhou, Hangdi Xing, Yifan Chen, Jianjun Xu, Qi Zheng, Feiyu Gao, Zhibo Yang, Shuai Bai, Ming Yan, Jieping Ye, Hongtao Xie

发表机构 * University of Science and Technology of China(中国科学技术大学) Tongyi Lab, Alibaba Group(阿里云实验室)

AI总结 针对现有基准测试在真实场景中评估不足的问题,提出MPDocBench-Parse基准,包含433份多页文档(3246页),覆盖15种文档类型,设计全面的内容保真度和逻辑结构评估协议,实验表明现有模型在语义连续性、视觉内容解析和层次结构恢复方面存在明显局限。

详情
AI中文摘要

文档解析将视觉丰富的文档转换为机器可读的结构化表示,为信息系统提供了关键基础。尽管已有许多文档解析基准测试,但它们仍不足以应对真实场景。现有基准测试要么专注于特定任务,要么仅评估单页、以文本为中心的设置,因此不足以处理实际的多页解析。此外,它们缺乏对语义连续性、层次结构恢复和视觉内容保留的细粒度评估。为解决这些不足,我们提出了MPDocBench-Parse,一个面向实际应用的多页文档解析基准测试。它包含433份人工标注的文档,共3246页,覆盖中英文15种文档类型,具有多样化的布局风格,并支持文档级端到端评估。我们进一步设计了一套全面的内容保真度和逻辑结构评估协议,涵盖文本、表格和公式识别,截断文本和表格合并,图形提取,阅读顺序以及标题层次恢复。实验表明,尽管现有模型在基本文本提取方面表现良好,但在语义连续性整合、视觉内容解析和层次结构恢复方面仍存在明显局限。MPDocBench-Parse为将文档解析推进到更真实的场景提供了统一基础。

英文摘要

Document parsing converts visually rich documents into machine-readable structured representations, forming a crucial foundation for information systems. Although many benchmarks have been proposed for document parsing, they remain inadequate for realistic scenarios. Existing benchmarks either focus on specific tasks or assess only single-page, text-centric settings, making them insufficient for practical multi-page parsing. Moreover, they lack fine-grained evaluation of semantic continuity, hierarchical structure recovery, and visual content preservation. To address these gaps, we propose MPDocBench-Parse, a benchmark for multi-page document parsing in real-world applications. It contains 433 manually annotated documents with 3,246 pages, covering 15 document types in English and Chinese, with diverse layout styles, and supports document-level end-to-end evaluation. We further design a comprehensive protocol for content fidelity and logical structure, covering text, table, and formula recognition, truncated text and table merging, figure extraction, reading order, and heading hierarchy recovery. Experiments show that, while existing models perform well on basic text extraction, they still suffer clear limitations in semantic continuity integration, visual content parsing, and hierarchical structure recovery. MPDocBench-Parse provides a unified foundation for advancing document parsing toward more realistic scenarios.

2605.22080 2026-05-29 cs.CV cs.AI 版本更新

JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation

JMed48k:用于视觉语言模型评估的多专业日本医疗执照基准

Yue Xun, Junyu Liu, Qian Niu, Xinyi Wang, Zheng Yuan, Zirui Li, Zequn Zhang, Bowen Zhao, Shujun Wang, Irene Li, Kan Hatakeyama-Sato, Yusuke Iwasawa, Yutaka Matsuo

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Kyoto University(京都大学) The University of Tokyo(东京大学) Hohai University(淮海大学) University of Science and Technology of China(中国科学技术大学) University of Toronto(多伦多大学)

AI总结 本文提出JMed48k,一个包含48,862道试题和20,142张图像的多专业日本医疗执照基准,通过评估21个模型并引入配对图像移除审计,发现专有和开源模型显著受益于图像,而医学专用模型对视觉证据利用有限。

详情
AI中文摘要

我们引入了JMed48k,一个用于评估视觉语言模型的多专业日本医疗执照基准。该基准基于日本厚生劳动省发布的官方PDF材料构建,包含2005年至2025年间11个国家执照考试的48,862道试题和20,142张图像,视觉内容按8类分类法进行标注。从该语料库中,我们提取了JMed48k-Eval,一个近五年的评估子集,包含12,484道评分题,其中9,905道纯文本题和2,579道带图像题。我们评估了21个专有、开源和医学专用模型,分别报告纯文本和带图像的性能。由于这些子集包含不同的问题,我们进一步引入了一种配对图像移除审计,评估带图像的问题在移除视觉内容前后的表现,以探索四种答案转换状态。审计显示,专有和开源模型从图像中获益显著,而医学专用系统对视觉证据的利用有限,许多正确答案在图像移除后仍然存在。即使在专有模型中,净图像移除效应在不同专业间变化七倍,从医师问题的+5.7分到公共卫生护士问题的+39.8分。我们发布JMed48k以支持在医疗执照场景中对视觉语言模型进行可重复的、按专业分层的评估。

英文摘要

We introduce JMed48k, a multi-profession Japanese healthcare licensing benchmark for evaluating vision-language models. Built from official PDF materials released by the Japanese Ministry of Health, Labour and Welfare, JMed48k contains 48,862 exam questions and 20,142 images from 11 national licensing examinations between 2005 and 2025, with visual content annotated under an 8-type taxonomy. From this corpus, we derive JMed48k-Eval, a recent five-year evaluation subset with 12,484 scored questions, including 9,905 text-only questions and 2,579 questions with images. We evaluate 21 proprietary, open-source, and medical-specific models, reporting text-only and with-image performance separately. Because these subsets contain different questions, we further introduce a paired image-removal audit that evaluates questions with images before and after removing visual content to explore four answer-transition states. The audit shows that proprietary and open source models gain substantially from images, whereas medical-specific systems show limited observable use of visual evidence, with many correct answers persisting after image removal. Even among proprietary models, the net image-removal effect varies sevenfold across professions, from +5.7 points on Physician questions to +39.8 points on Public Health Nurse questions. We release JMed48k to support reproducible, profession-stratified evaluation of vision-language models in medical licensing settings.

2605.21739 2026-05-29 cs.AI 版本更新

AttuneBench: A Conversation-Based Benchmark for LLM Emotional Intelligence

AttuneBench: 基于对话的LLM情商基准测试

Kate M. Lubrano, Faisal Sayed, Ankita Rathod, Akshansh, Craver Corbyn Thomas-Smith, Mark E. Whiting, Karina Nguyen

发表机构 * Pareto Thoughtful University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出AttuneBench基准,基于200个真实多轮人机对话,评估LLM在情绪识别、行为分类、偏好预测和响应质量等方面的情商能力,发现这些能力相互独立且偏好对齐和响应质量更具区分性。

Comments v2: Updated def_18 and def_20 supplemental figures to cover all 11 evaluated models (previously 9). Removed redundant supplemental figures. Corrected select captions (color descriptions, chance baselines, figure-content mismatches). No changes to experimental results, numerical claims, or conclusions

详情
AI中文摘要

情商(EI),即感知、理解并恰当回应他人情绪状态的能力,是人类交流的核心,随着LLM在日常生活中承担对话角色,评估其情商日益重要。现有的EI基准依赖于合成提示、单轮案例或第三方标注。这些方法不能直接衡量模型在真实对话过程中如何推断和回应参与者的情绪状态。我们引入AttuneBench,一个基于200个真实多轮人机对话的基准,其中参与者与匿名LLM对话,并逐轮标注其情绪状态、模型行为以及他们偏好的回应。在11个评估模型中,我们发现模型在情绪识别、行为分类、偏好预测和评判响应质量上的排名基本独立,表明情商行为可分解为可分离的能力。偏好对齐和响应质量判断比情绪标签准确性更具模型区分性。这些结果表明,情商行为需要预测特定用户在上下文中想要什么样的回应,这一区别可能被总体评分掩盖,而单轮或合成格式无法跨轮直接捕捉。AttuneBench提供了一个评估这些能力以及诊断模型在情绪显著对话中的特定优势和失败模式的框架。

英文摘要

Emotional intelligence (EI), the ability to perceive, understand, and respond appropriately to others' emotional states, is central to human communication, and increasingly important to assess as LLMs assume conversational roles in everyday life. Existing EI benchmarks rely on synthetic prompts, single-turn cases, or third-party annotation. These approaches do not directly measure how models infer and respond to a participant's emotional state over the course of a real conversation. We introduce AttuneBench, a benchmark grounded in 200 genuine multi-turn human-model conversations in which participants conversed with anonymized LLMs and provided turn-by-turn annotations of their emotional state, the model's behavior, and their preferred responses. Across 11 evaluated models, we find that model rankings on emotion recognition, behavioral classification, preference prediction, and judged response quality are largely independent, indicating that emotionally intelligent behavior decomposes into separable capabilities. Preference alignment and response-quality judgments are substantially more model-discriminating than emotion-label accuracy. These results indicate that emotionally intelligent behavior requires predicting what kind of response a specific user wants in context, a distinction that aggregate scoring can obscure and that single-turn or synthetic formats cannot directly capture across turns. AttuneBench provides a framework for assessing each of these capabilities and for diagnosing model-specific strengths and failure modes in emotionally salient conversation.

2605.15219 2026-05-29 cs.AI cs.IT math.IT 版本更新

NOVA: Fundamental Limits of Knowledge Discovery Through AI

NOVA:通过人工智能进行知识发现的基本限制

Salman Avestimehr, Ken Duffy, Muriel Médard

发表机构 * University of Southern California(南加州大学) Northeastern University(东北大学) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出NOVA框架,将“生成-验证-积累-再训练”循环建模为知识空间上的自适应采样过程,识别了知识覆盖有限域的条件及失败模式,并证明了发现成本与Zipf定律相关的标度律。

详情
AI中文摘要

人工智能系统能否通过迭代自我改进发现真正的新知识,如果可以,代价是什么?我们引入了NOVA框架,将常见的“生成、验证、积累、再训练”循环建模为知识空间上的自适应采样过程。我们识别了积累的真正知识最终覆盖有限域的充分条件,并展示了违反这些条件如何产生不同的失败模式:污染、遗忘、探索失败和接受失败。然后,我们分析了不完美的验证,并识别了一个污染陷阱:随着容易发现的知识被耗尽,模型分配给新有效工件的质量缩小,因此即使很小的假阳性率也可能导致无效工件比真正发现更快地进入知识库。我们澄清了Good-Turing估计是一种局部批次多样性诊断工具,而不是用于估计历史上未发现的、支配长期发现的有效质量的估计量。在将模型的有效发现分布与指数$α>1$的Zipf定律联系起来的独立尾部等价假设下,我们证明了获得$D$个不同真正发现所需的累积生成成本满足$R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$,其中$c_{\mathrm{gen}}$是每个候选的生成成本。这个标度律量化了随着发现前沿推进而渐进的收益递减。最后,我们通过指导、生成和验证形式化了人类增强,解释了为什么专家输入在自主探索障碍附近最有价值。

英文摘要

Can AI systems discover genuinely new knowledge through iterative self improvement, and if so, at what cost? We introduce the NOVA framework, which models the common ``generate, verify, accumulate, retrain'' loop as an adaptive sampling process over a knowledge space. We identify sufficient conditions under which accumulated genuine knowledge eventually covers a finite domain, and show how their violations produce distinct failure modes: contamination, forgetting, exploration failure, and acceptance failure. We then analyze imperfect verification and identify a contamination trap: as easy-to-find knowledge is exhausted, the model mass assigned to new valid artifacts shrinks, so even small false-positive rates can cause invalid artifacts to enter the knowledge base faster than genuine discoveries. We clarify that Good--Turing estimation is a local batch-diversity diagnostic, not an estimator of the historically undiscovered valid mass that governs long-term discovery. Under a separate tail-equivalence assumption relating the model's effective discovery distribution to a Zipf law with exponent $α>1$, we prove that the cumulative generation cost required to obtain $D$ distinct genuine discoveries satisfies $R_{\mathrm{cum}}(D)=Θ(c_{\mathrm{gen}}D^α)$, where $c_{\mathrm{gen}}$ is the per-candidate generation cost. This scaling law quantifies asymptotic diminishing returns as the discovery frontier advances. Finally, we formalize human amplification through guidance, generation, and verification, explaining why expert input is most valuable near autonomous exploration barriers.

2605.13841 2026-05-29 cs.SD cs.AI cs.CL cs.LG 版本更新

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench:一种用于评估语音代理的新型端到端框架

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

发表机构 * ServiceNow

AI总结 提出EVA-Bench框架,通过机器人间音频对话模拟和复合指标(EVA-A和EVA-X)全面评估语音代理的准确性和体验质量。

Comments Work in progress

详情
AI中文摘要

语音代理是一种通过口语对话完成任务的人工智能系统,越来越多地部署在企业应用中。然而,现有基准测试未能同时解决两个核心评估挑战:生成逼真的模拟对话,以及全面衡量语音特定故障模式的质量。我们提出了EVA-Bench,一个端到端评估框架,同时解决这两个问题。在模拟方面,EVA-Bench通过动态多轮对话协调机器人间的音频对话,并自动进行模拟验证,检测用户模拟器错误并在评分前适当重新生成对话。在测量方面,EVA-Bench引入了两个复合指标:EVA-A(准确性),捕捉任务完成度、忠实度和音频级语音保真度;以及EVA-X(体验),捕捉对话进展、口语简洁性和话轮转换时机。这两个指标适用于所有主要的代理架构,支持直接的跨架构比较。EVA-Bench包含三个企业领域的213个场景、一个用于口音和噪声鲁棒性的受控扰动套件,以及区分峰值能力和可靠能力的pass@1、pass@k、pass^k测量。在跨越所有三种架构的12个系统中,我们发现:(1)没有系统在EVA-A pass@1和EVA-X pass@1上同时超过0.5;(2)峰值性能和可靠性能差异显著(EVA-A上pass@k与pass^k的中位数差距为0.44);(3)口音和噪声扰动暴露了显著的鲁棒性差距,其影响因架构、系统和指标而异(平均Δ高达0.314)。我们在开源许可下发布了完整的框架、评估套件和基准数据。

英文摘要

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

2605.12208 2026-05-29 stat.ML cs.AI cs.LG stat.CO 版本更新

Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

自监督拉普拉斯近似用于贝叶斯不确定性量化

Julian Rodemann, Alexander Marquard, Thomas Augustin, Michele Caprio

发表机构 * Rational Intelligence Lab, CISPA Helmholtz Center for Information Security Department of Statistics, LMU Munich(理性智能实验室,CISPA海德堡信息安全中心统计学系,慕尼黑大学) Department of Statistics, LMU Munich(统计学系,慕尼黑大学) Department of Computer Science, The University of Manchester(计算机科学系,曼彻斯特大学)

AI总结 提出自监督拉普拉斯近似(SSLA),通过重新拟合自预测数据直接近似后验预测分布,实现确定性、无采样的贝叶斯不确定性量化,并在回归任务中优于经典拉普拉斯近似。

Comments Accepted for publication in TMLR (https://openreview.net/forum?id=T8w8L2t3JG), v2: fixed typos and added a deceased-author footnote with a dedication to Thomas Augustin

详情
Journal ref
Transactions on Machine Learning Research (TMLR). ISSN 2835-8856 (2026)
AI中文摘要

近似贝叶斯推断通常围绕计算后验参数分布展开。然而,在实践中,感兴趣的主要对象通常是模型的预测而非其参数。在这项工作中,我们提出绕过参数后验,直接关注近似后验预测分布。我们通过从自监督和半监督学习中的自训练中汲取灵感来实现这一点。本质上,我们通过重新拟合自预测数据来量化贝叶斯模型的预测不确定性。这个想法非常简单:如果模型对自预测数据赋予高似然,那么这些预测的不确定性低,反之亦然。这产生了后验预测的确定性、无采样近似。我们的自监督拉普拉斯近似(SSLA)的模块化结构进一步允许我们插入不同的先验规范,从而实现经典的贝叶斯敏感性(关于先验选择)分析。为了绕过昂贵的重新拟合,我们进一步引入了SSLA的近似版本,称为ASSLA。我们从理论和经验上研究了(A)SSLA,涉及从贝叶斯线性模型到贝叶斯神经网络的回归模型。在模拟和真实数据集的广泛回归任务中,我们的方法在预测校准方面优于经典拉普拉斯近似,同时保持计算效率。

英文摘要

Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model's predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.

2605.07707 2026-05-29 cs.AI 版本更新

Hierarchical Task Network Planning with LLM-Generated Heuristics

基于LLM生成启发式的层次任务网络规划

Felipe Meneguzzi, Alexandre Buchweitz, Augusto B. Corrêa, Victor Scherer Putrich, André Grahl Pereira

发表机构 * University of Aberdeen, UK(爱丁堡大学(英国)) PUCRS, Brazil(巴西普埃布拉联邦大学) University of Oxford, UK(牛津大学(英国)) Saarland University, Germany(萨尔大学(德国)) Universidade Federal do Rio Grande do Sul, Brazil(巴西里约格兰德 do 南大学)

AI总结 研究利用大语言模型为层次任务网络规划生成搜索启发式,通过Pytrich规划器在六个基准领域评估,结果表明LLM生成的启发式在覆盖度上接近最优HTN规划器,并在83%的共享问题上显著减少搜索开销。

Comments 9 pages, 3 figures; submitted to NeurIPS 2026

详情
AI中文摘要

HTN规划是经典规划的一种变体,其中算法不是搜索线性动作序列,而是使用方法库分解高层任务,直到只剩下可执行动作。一方面,这允许引入领域知识,通过方法库加速解决方案的搜索。另一方面,它带来了超越经典状态空间搜索的挑战。尽管最近的研究产生了一些加速HTN规划的启发式和新型算法,但这些启发式仍不如经典规划算法中的启发式信息丰富。我们研究大语言模型(LLMs)能否为HTN规划生成有效的搜索启发式,将Corrêa、Pereira和Seipp(2025)的方法从经典规划扩展到层次规划。使用Pytrich规划器在六个标准全序HTN基准领域上,我们评估了九个LLM在领域特定提示下生成的启发式,并将它们与TDG和LMCount领域无关基线以及PANDA规划器进行比较。结果表明,LLM生成的启发式在覆盖度上几乎与最佳可用HTN规划器相当,同时在83%的共享问题上显著减少了搜索开销。

英文摘要

HTN planning is a variation of classical planning where, instead of searching for a linear sequence of actions, an algorithm decomposes higher-level tasks using a method library until only executable actions remain. On one hand, this allows one to introduce domain knowledge that can speed up the search for a solution through the method library. On the other hand, it creates challenges that go beyond those of classical state-space search. While recent research produced a number of heuristics and novel algorithms that speed up HTN planning, these heuristics are not yet as informative as those available in classical planning algorithms. We investigate whether large language models (LLMs) can generate effective search heuristics for HTN planning, extending the methodology of Corrêa, Pereira, and Seipp (2025) from classical to hierarchical planning. Using the Pytrich planner on six standard total-order HTN benchmark domains, we evaluate heuristics generated by nine LLMs under domain-specific prompting and compare them against the TDG and LMCount domain-independent baselines and the PANDA planner. Our results show that LLM-generated heuristics nearly match the coverage of the best available HTN planner, while substantially reducing search effort on 83% of shared problems.

2605.04916 2026-05-29 cs.AI cs.LG cs.SC 版本更新

A Foundation Model for Zero-Shot Logical Rule Induction

零样本逻辑规则归纳的基础模型

Yin Jun Phua

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出神经规则归纳器(NRI),一种基于统计编码和并行槽解码的预训练模型,实现零样本逻辑规则归纳,无需重新训练即可泛化到新谓词。

Comments Camera-ready version accepted at IJCAI 2026, with full appendices

详情
AI中文摘要

归纳逻辑编程(ILP)从数据中学习可解释的逻辑规则。现有方法是传导性的:其学习参数绑定到特定谓词,并且每个新任务都需要重新训练。我们引入了神经规则归纳器(NRI),一种用于零样本规则归纳的预训练模型。NRI 不编码文字标识,而是使用领域无关的统计属性(如类别条件率、熵和共现)来表示文字,这些属性无需重新训练即可泛化到不同的标识和数量。该模型由一个统计编码器和一个基于并行槽的解码器组成。并行解码保持了逻辑析取的置换不变性;而自回归解码器则会施加任意子句顺序。乘积 T-范数松弛使规则执行可微分,从而仅基于预测准确性进行端到端训练。我们在规则恢复、对标签噪声和虚假相关性的鲁棒性以及零样本迁移到真实世界基准上评估了 NRI,并相信这项工作开启了符号推理基础模型的可能性。代码和参考检查点可在 https://github.com/phuayj/neural-rule-inducer 获取。

英文摘要

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero-shot rule induction. Rather than encoding literal identities, NRI represents literals using domain-agnostic statistical properties such as class-conditional rates, entropy, and co-occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot-based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T-norm relaxation makes rule execution differentiable, allowing end-to-end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero-shot transfer to real-world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural-rule-inducer.

2605.00846 2026-05-29 cs.AI cs.MA 版本更新

ClinicBot: A Guideline-Grounded Clinical Chatbot with Prioritized Evidence RAG and Verifiable Citations

ClinicBot:基于指南的临床聊天机器人,具有优先证据RAG和可验证引用

Navapat Nananukul, Mayank Kejriwal

发表机构 * USC Information Sciences Institute(USC信息科学研究所)

AI总结 提出ClinicBot系统,通过结构化提取指南、按临床重要性排序证据和多智能体协作,生成准确、可验证的临床回答。

详情
AI中文摘要

临床诊断需要准确、可验证且明确基于官方指南的答案。虽然大型语言模型在自然语言处理方面表现出色,但它们产生幻觉的倾向削弱了其在需要精确性的高风险医疗环境中的实用性。现有的检索增强生成(RAG)系统对所有证据一视同仁,产生嘈杂的上下文和与临床实践不符的通用答案。我们提出ClinicBot,一个通过三项关键进展将指南建议转化为可信临床支持的人工智能系统:(1)将临床指南结构化提取为语义单元(建议、表格、定义、叙述)并带有明确的出处;(2)证据优先级排序,根据临床重要性和指南结构而非文本相似性对内容进行排序;(3)一个基于Web的界面,呈现简洁、可操作的答案及可验证的证据。我们将使用真实患者的糖尿病问题以及一个忠实于美国糖尿病协会(ADA)《糖尿病护理标准(2025)》的额外糖尿病风险评估工具来演示ClinicBot。演示将说明语义知识提取和分层证据排名如何在多智能体设置中可靠运行,以大规模处理复杂的临床指南。

英文摘要

Clinical diagnosis requires answers that are accurate, verifiable, and explicitly grounded in official guidelines. While large language models excel at natural language processing, their tendency to hallucinate undermines their utility in high-stakes medical contexts where precision is essential. Existing retrieval-augmented generation (RAG) systems treat all evidence equally, producing noisy context and generic answers misaligned with clinical practice. We present ClinicBot, an AI system that translates guideline recommendations into trustworthy clinical support through three key advances: (1) structured extraction of clinical guidelines into semantic units (recommendations, tables, definitions, narrative) with explicit provenance, (2) evidence prioritization that ranks content by clinical significance and guideline structure rather than textual similarity, and (3) a web-based interface that presents concise, actionable answers with verifiable evidence. We will demonstrate ClinicBot using diabetes questions from real patients and an additional diabetes risk assessment tool that is faithful to the American Diabetes Association (ADA) Standards of Care in Diabetes (2025). The demonstration will illustrate how semantic knowledge extraction and hierarchical evidence ranking can reliably operate in a multi-agent setting to process complex clinical guidelines at scale.

2604.25098 2026-05-29 cs.AI cs.CL cs.LG 版本更新

Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

重新审视LLM剪枝对测试时缩放的有效性

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

发表机构 * Bellini College of AI, Cybersecurity, and Computing(人工智能、网络安全与计算学院)

AI总结 本文研究非结构化剪枝对推理型大语言模型测试时缩放性能的影响,发现其优于结构化剪枝甚至有时超过未剪枝模型,并探讨了层间稀疏分配策略的作用。

详情
AI中文摘要

大型语言模型(LLM)现在通过测试时计算缩放(TTS)展现出卓越的推理能力,在数学和编程基准测试中表现令人印象深刻。与此同时,模型压缩研究开发了剪枝方法,旨在在不牺牲任务性能的情况下移除冗余/有害参数。这两项研究进展的交叉点构成了我们工作的基础。具体到推理型LLM,先前的工作表明结构化剪枝(移除整组层块的方法)显著降低了TTS推理性能。然而,在这项工作中,我们重新审视了这一假设,并研究了非结构化剪枝(仅小心移除某些冗余/有害权重的方法)是否表现出类似的局限性。令人惊讶的是,我们在两个推理型LLM(s1.1-7B和Qwen3-8B)的四个推理基准上的广泛实验一致表明,与结构化剪枝相比,非结构化剪枝增强了TTS性能,有时甚至能超越未剪枝的全权重LLM。此外,我们还实证研究了不同层间稀疏分配策略的影响,这些策略是实现这些非结构化方法的重要参数选择。这些发现挑战了剪枝总是降低TTS性能的传统观念,实际上表明,谨慎进行的剪枝可以保持TTS的有效性。

英文摘要

Large Language Models (LLMs) now exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), with impressive performance across math and coding benchmarks. In parallel, research in model compression has developed pruning methods that seek to remove redundant/detrimental parameters without sacrificing task performance. The intersection of these two research advancements lays the foundation for our work. Specific to reasoning LLMs, prior work has shown that structured pruning (methods which remove entire set of layer blocks), significantly degrades TTS reasoning performance. However, in this work, we revisit this assumption and investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating these unstructured methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can retain TTS effectiveness.

2604.23354 2026-05-29 eess.AS cs.AI eess.SP 版本更新

Explainable AI in Speaker Recognition -- Making Latent Representations Understandable

说话人识别中的可解释AI——使潜在表示可理解

Yanze Xu, Wenwu Wang, Mark D. Plumbley

发表机构 * Centre for Vision, Speech and Signal Processing, University of Surrey(萨里大学视觉、语音和信号处理中心) Department of Informatics, King’s College London(伦敦国王学院信息学院)

AI总结 本文提出层次聚类算法(SLINK和HDBSCAN)分析说话人识别网络表示中的层次聚类现象,并设计HCCM算法和Liebig分数为这些聚类提供语义解释。

Comments 15 pages, 10 figures

详情
AI中文摘要

神经网络可以训练从数据中学习任务相关的表示。理解这些网络如何做出决策属于可解释AI(XAI)领域。本文提出研究一个XAI主题:揭示表示中未知的组织结构,特别是说话人识别网络从话语中学习到的用于识别说话人身份的表示。过去的研究使用算法(如K-means)分析网络表示如何自然地以不同方式组织成独立聚类,即分析这些表示定义的空间(称为网络表示空间)内的平面聚类现象。相比之下,本文应用两种算法,单链接聚类(SLINK)和基于密度的噪声应用空间聚类(HDBSCAN),分析表示如何以不同方式形成层次聚类,即分析网络表示空间内的层次聚类现象。为了进一步理解这些层次聚类现象,我们提出了一种新算法,称为层次聚类-类别匹配(HCCM)。HCCM通过将SLINK和HDBSCAN产生的层次聚类与预定义的语义类别匹配,为这些聚类提供语义解释。通过这个过程,一些聚类被解释为单个语义类别(例如男性),而其他聚类被解释为单个语义类别的合取(例如女性和爱尔兰)。此外,我们开发了一个新的度量标准,Liebig分数,用于量化聚类与语义类别的匹配程度,这有助于识别每个匹配中最受限制的因素。

英文摘要

Neural networks can be trained to learn task-relevant representations from data. Understanding how these networks make decisions falls within the Explainable AI (XAI) domain. This paper proposes to study an XAI topic: uncovering the unknown organisation in the representations, particularly those a speaker recognition network learns from utterances, for recognising speaker identity. Past studies have employed algorithms (e.g. K-means) to analyse how network representations can be naturally organised into independent clusters in different ways, i.e., to analyse flat clustering phenomena within the space defined by these representations, referred to as the network representation space. In contrast, this work applies two algorithms, Single-Linkage Clustering (SLINK) and Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN), to analyse how representations form hierarchical clusters in different ways, i.e., to analyse hierarchical clustering phenomena within the network representation space. To further understand these hierarchical clustering phenomena, we propose a new algorithm termed Hierarchical Cluster-Class Matching (HCCM). HCCM provides a semantic interpretation for the hierarchical clusters produced by SLINK and HDBSCAN by matching them to predefined semantic classes. Through this process, some clusters are interpreted as individual semantic classes (e.g. male), whereas others are interpreted as conjunctions of individual semantic classes (e.g. female and Ireland). In addition, we develop a new metric, the Liebig score, to quantify how well a cluster matches a semantic class, which helps identify the factor that most strongly limits each match.

2604.23256 2026-05-29 cs.NE cs.AI cs.LG cs.SC 版本更新

Architecture-Induced Recoverability Bias in Differentiable Symbolic Regression

可微符号回归中的架构诱导的可恢复性偏差

Chakshu Gupta, Theodore J. LaGrow

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院) College of Lifetime Learning, Georgia Institute of Technology(佐治亚理工学院终身学习学院)

AI总结 本文研究可微符号回归中,变量路由架构对表达式可恢复性的影响,发现不同架构导致恢复率从0/64到64/64变化,并提出基于验证的架构选择方法将恢复率从34.4%提升至50.1%。

Comments 6 pages, 4 figures, 3 tables; submitted to IEEE MLSP 2026

详情
AI中文摘要

符号回归旨在从数值数据中恢复闭式表达式,但在可微符号回归中,恢复的表达式不仅取决于语法,还取决于训练期间变量路由的固定架构。这与闭式模型和可解释非线性结构有用的信号处理设置相关。这种特定于架构的影响很少被直接隔离,因为现有比较通常同时改变架构、算子族、语法或搜索过程。本文比较了三种深度为3的架构,涵盖24种算子-形状-叶子组合,在尽可能固定算子族、语法和训练协议的同时改变变量路由架构。在架构加原生训练协议的比较下,同一目标的恢复率从0/64变为64/64。一个目标上最好的架构在另一个目标上是最差的,并且具有两个等深子树的结构在所有测试配置中均失败(0/3,776)。作为概念验证的缓解措施,训练一个小型架构集,并选择保留集上RMSE最低的硬化表达式。在联合运行的子集上,这将恢复率从仅存在于所有三种配置中的架构的34.4%提高到50.1%。在肖克利二极管目标上,验证选择器恢复了该基线架构遗漏的情况,而该基线架构本身仅恢复0/32个种子。由于联合运行子集仅包含三种配置,选择器结果证明基于验证的架构选择是有前景的,而非完整的基准测试。这些结果支持将架构视为可测量的设计变量,应予以报告、压力测试,并使用保留验证集进行选择,而非先验固定。

英文摘要

Symbolic regression aims to recover closed-form expressions from numerical data, but in differentiable symbolic regression the recovered expression depends not only on the grammar but also on the fixed architecture through which variables are routed during training. This is relevant to signal-processing settings in which closed-form models and interpretable nonlinear structure are useful. This architecture-specific effect has rarely been isolated directly, because existing comparisons often vary architecture together with operator family, grammar, or search procedure. Three depth-3 architectures are compared across twenty-four operator--shape--leaf combinations, holding operator family, grammar, and training protocol fixed as far as possible while varying the variable-routing architecture. Recovery changes from $0/64$ to $64/64$ trials on the same target under an architecture-plus-native-training-protocol comparison. The best architecture on one target is the worst on another, and trees with two equal-depth subtrees fail in every configuration tested ($0/3{,}776$). As a proof-of-concept mitigation, a small architecture set is trained and the hardened expression with the lowest held-out RMSE is selected. On the jointly-run subset, this improves recovery from $34.4\%$ for the only architecture present in all three configurations to $50.1\%$. On a Shockley diode target, the validation selector recovers cases missed by that baseline architecture, which by itself recovers $0/32$ seeds. Since the jointly-run subset contains only three configurations, the selector result is evidence that validation-based architecture selection is promising, not a complete benchmark. These results support treating architecture as a measurable design variable that should be reported, stress-tested, and selected using held-out validation rather than fixed a priori.

2604.21654 2026-05-29 cs.CV cs.AI 版本更新

Causal Disentanglement-Inspired Degradation Representation Learning for Full-Reference Image Quality Assessment

因果解耦启发的退化表示学习用于全参考图像质量评估

Zhen Zhang, Jielei Chu, Tian Zhang, Lin Ma, Fengmao Lv, Weide Liu, Tianrui Li, Yuming Fang

发表机构 * School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) School of Transportation and Logistics, Southwest Jiaotong University(交通运输与物流学院,西南交通大学) School of Physics, Northeast Normal University(物理学院,东北师范大学) School of Computing and Artificial Intelligence, Jiangxi University of Finance and Economics(计算机与人工智能学院,江西财经大学) School of Information Management, Jiangxi University of Finance and Economics(信息管理学院,江西财经大学)

AI总结 提出基于因果推断和解耦表示学习的全参考图像质量评估新范式,通过干预潜在表示实现退化估计,在多种设置和跨域场景中表现优异。

详情
AI中文摘要

现有的基于深度网络的全参考图像质量评估(FR-IQA)模型通常通过对参考图像和失真图像的深度特征进行成对比较来工作。在本文中,我们从不同的角度处理这个问题,提出了一种基于因果推断和解耦表示学习的新型FR-IQA范式。与典型的基于特征比较的FR-IQA模型不同,我们的方法将退化估计表述为一个由对潜在表示进行干预引导的因果解耦过程。我们首先利用参考图像和失真图像之间的内容不变性来解耦退化表示和内容表示。其次,受人类视觉掩蔽效应的启发,我们设计了一个掩蔽模块来建模图像内容与退化特征之间的因果关系,从而从失真图像中提取受内容影响的退化特征。最后,通过监督回归或无标签降维从这些退化特征预测质量分数。大量实验表明,我们的方法在全监督、少标签和无标签设置的标准IQA基准上取得了极具竞争力的性能。此外,我们还在数据稀缺的多种非标准自然图像域(包括水下、放射线、医学、中子和屏幕内容图像)上评估了该方法。得益于其能够在没有标记IQA数据的情况下进行场景特定训练和预测的能力,我们的方法在跨域泛化方面优于现有的无训练FR-IQA模型。

英文摘要

Existing deep network-based full-reference image quality assessment (FR-IQA) models typically work by performing pairwise comparisons of deep features from the reference and distorted images. In this paper, we approach this problem from a different perspective and propose a novel FR-IQA paradigm based on causal inference and decoupled representation learning. Unlike typical feature comparison-based FR-IQA models, our approach formulates degradation estimation as a causal disentanglement process guided by intervention on latent representations. We first decouple degradation and content representations by exploiting the content invariance between the reference and distorted images. Second, inspired by the human visual masking effect, we design a masking module to model the causal relationship between image content and degradation features, thereby extracting content-influenced degradation features from distorted images. Finally, quality scores are predicted from these degradation features using either supervised regression or label-free dimensionality reduction. Extensive experiments demonstrate that our method achieves highly competitive performance on standard IQA benchmarks across fully supervised, few-label, and label-free settings. Furthermore, we evaluate the approach on diverse non-standard natural image domains with scarce data, including underwater, radiographic, medical, neutron, and screen-content images. Benefiting from its ability to perform scenario-specific training and prediction without labeled IQA data, our method exhibits superior cross-domain generalization compared to existing training-free FR-IQA models.

2604.20443 2026-05-29 cs.CL cs.AI cs.LG 版本更新

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

DialToM:用于预测状态驱动对话轨迹的心智理论基准

Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

发表机构 * Singapore Management University(新加坡管理大学) Australian National University(澳大利亚国立大学)

AI总结 提出DialToM基准,通过多选评估框架从自然对话中构建,揭示LLMs在推断心理状态(字面ToM)与利用其进行社会预测(功能ToM)之间的系统性推理不对称性,并证明领域专家与AI之间存在显著能力差距。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

我们介绍了DialToM,一个基于自然人类对话构建的带注释的心智理论(ToM)基准,采用多选评估框架。与近期在合成环境中显示显式心理状态推断与应用ToM之间存在差距的工作一致,我们建立了一个更严格的“状态驱动诊断探针”,要求模型仅从孤立的心理状态特征(无对话上下文)预测状态一致的对话轨迹。我们的评估揭示了系统性的推理不对称性——LLMs在推断心理状态(字面ToM)方面表现出色,但在利用它们进行社会预测(功能ToM)方面存在困难。关键的是,领域专家在此任务上达到100%准确率,证明了其有效性,并揭示了人类与AI之间的显著能力差距。此外,教师-学生推理注入探针显示,Gemini 3 Pro(建立了领先基线)具备强大的功能ToM能力,可用于无上下文预测,且该能力可迁移至较弱模型。DialToM、其评估代码和数据集公开于https://github.com/Stealth-py/DialToM。

英文摘要

We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice evaluation framework. Concurrent with recent work showing a gap between explicit mental-state inference and applied ToM in synthetic settings~\cite{gu2024simpletom}, we establish a stricter \emph{State-Driven Diagnostic Probe} in which models must forecast state-consistent dialogue trajectories solely from isolated mental-state profiles without dialogue context. Our evaluation reveals a systematic reasoning asymmetry -- LLMs excel at inferring mental states (Literal ToM) but struggle to leverage them for social forecasting (Functional ToM). Crucially, a domain expert achieves 100\% accuracy on this task, proving its validity and establishing a stark human-AI capability gap. Further, a teacher-student reasoning injection probe shows that Gemini 3 Pro -- which establishes the leading baseline -- possesses robust Functional ToM capabilities for context-free forecasting that are transferable to weaker models. DialToM, its evaluation code, and dataset are publicly available at https://github.com/Stealth-py/DialToM.

2604.18847 2026-05-29 cs.AI cs.CL 版本更新

Human-Guided Harm Recovery for Computer Use Agents

面向计算机使用代理的人类引导式危害恢复

Christy Li, Sky CH-Wang, Andi Peng, Andreea Bobu

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 针对LM代理在计算机系统中执行操作后的危害恢复问题,通过用户研究定义偏好对齐的恢复维度,提出基于奖励模型对候选恢复计划重排序的方法,并构建BackBench基准测试,实验表明该方法优于基线代理。

详情
AI中文摘要

随着LM代理获得在真实计算机系统上执行操作的能力,我们不仅需要大规模预防有害行为的方法,还需要在预防失败时有效修复危害。我们形式化了后执行安全中这一被忽视的挑战的解决方案——危害恢复:即根据人类偏好,将代理从有害状态最优地引导回安全状态的问题。通过一项形成性用户研究,我们确定了偏好对齐的恢复维度,并生成了自然语言评分标准,从而为偏好对齐的恢复奠定基础。我们的1130个成对判断数据集揭示了属性重要性的上下文相关变化,例如偏好实用、有针对性的策略而非全面的长期方法。我们将这些学习到的见解操作化为一个奖励模型,在测试时对代理框架生成的多个候选恢复计划进行重排序。为了系统性地评估恢复能力,我们引入了BackBench,一个包含50个计算机使用任务的基准测试,用于测试代理从有害状态中恢复的能力。人工评估表明,我们的奖励模型框架比基础代理和基于评分标准的框架产生更高质量的恢复轨迹。这些贡献共同为新型代理安全方法奠定了基础——这些方法不仅通过预防来应对危害,而且通过有意图的对齐来应对危害的后果。

英文摘要

As LM agents gain the ability to execute actions on real computer systems, we need ways to not only prevent harmful actions at scale but also effectively remediate harm when prevention fails. We formalize a solution to this neglected challenge in post-execution safeguards as harm recovery: the problem of optimally steering an agent from a harmful state back to a safe one in alignment with human preferences. We ground preference-aligned recovery through a formative user study that identifies valued recovery dimensions and produces a natural language rubric. Our dataset of 1,130 pairwise judgments reveals context-dependent shifts in attribute importance, such as preferences for pragmatic, targeted strategies over comprehensive long-term approaches. We operationalize these learned insights in a reward model, re-ranking multiple candidate recovery plans generated by an agent scaffold at test time. To evaluate recovery capabilities systematically, we introduce BackBench, a benchmark of 50 computer-use tasks that test an agent's ability to recover from harmful states. Human evaluation shows our reward model scaffold yields higher-quality recovery trajectories than base agents and rubric-based scaffolds. Together, these contributions lay the foundation for a new class of agent safety methods -- ones that confront harm not only by preventing it, but by navigating its aftermath with alignment and intent.

2604.17176 2026-05-29 eess.SY cs.AI cs.SY math.OC 版本更新

Intent-aligned Autonomous Spacecraft Guidance via Reasoning Models

通过推理模型实现意图对齐的自主航天器制导

Yuji Takubo, Simone D'Amico

发表机构 * Stanford University(斯坦福大学)

AI总结 提出一种通过行为序列和航点约束将高层推理与安全轨迹优化相结合的意图对齐航天器制导框架,在近距离操作场景中实现了超过90%的SCP收敛率,并比启发式决策高出1.5倍的满足顶级意图优先性能标准的轨迹生成率。

Comments Accepted for Computer Vision and Pattern Recognition Conference (CVPR) 2026, AI4Space Workshop (4-page Short paper). 9 pages, 3 figures (including supplementary materials)

详情
AI中文摘要

未来的航天器操作需要能够解释高层任务意图同时保持安全性的自主性。然而,现有的轨迹优化仍然严重依赖专家设计的公式,并且不支持意图条件决策。本文提出了一种意图对齐的航天器制导框架,通过显式的中间抽象(基于行为序列和航点约束)将高层推理与安全轨迹优化联系起来。基础模型首先预测意图对齐的行为计划,然后航点生成模型将其转换为航点约束,最后通过优化计算安全轨迹。这种分解使得在不牺牲安全性的情况下实现可扩展的监督。在近距离操作场景中的数值实验表明,所提出的流程实现了超过90%的SCP收敛率,并且比启发式决策高出1.5倍的生成满足顶级意图优先性能标准的轨迹率。这些结果支持将中间行为抽象作为基础模型推理与安全关键型星载航天器自主性之间的实用接口。

英文摘要

Future spacecraft operations require autonomy that can interpret high-level mission intent while preserving safety. However, existing trajectory optimization still relies heavily on expert-crafted formulations and does not support intent-conditioned decision-making. This paper proposes an intent-aligned spacecraft guidance framework that links high-level reasoning and safe trajectory optimization through explicit intermediate abstractions, based on behavior sequences and waypoint constraints. A foundation model first predicts an intent-aligned behavior plan, a waypoint generation model then converts it into waypoint constraints, and the safe trajectory is computed via optimization. This decomposition enables scalable supervision without sacrificing safety. Numerical experiments in close-proximity operation scenarios demonstrate that the proposed pipeline achieves over 90\% SCP convergence and yields a $1.5\times$ higher rate of generating trajectories that satisfy the top intent-prioritized performance criteria than heuristic decision-making. These results support the use of intermediate behavior abstraction as a practical interface between foundation-model reasoning and safety-critical onboard spacecraft autonomy.

2604.11088 2026-05-29 cs.AI cs.CL 版本更新

Guardrails Beat Guidance: A Large-Scale Study of Rules, Skills, and Persistent Configuration for Coding Agents

护栏优于指导:关于编码智能体的规则、技能和持久配置的大规模研究

Xing Zhang, Guanghui Wang, Yanwei Cui, Wei Qiu, Ziyuan Li, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS生成式人工智能创新中心) HSBC Holdings Plc., HSBC Technology Center, China(汇丰控股有限公司,汇丰技术中心,中国)

AI总结 通过大规模实验发现,随机规则与专家规则对编码智能体性能提升相当,且有益规则均为负面约束,有害规则均为正面指令,提出应使用约束而非指导来配置智能体。

详情
AI中文摘要

随机规则对编码智能体任务性能的提升与专家精心设计的规则相当(在SWE-bench Verified的判别子集上均提升$+13.8$个百分点),并且在我们的数据中,每条单独有益的规则都是负面约束(“不要重构无关代码”),而每条单独有害的规则都是正面指令(“遵循代码风格”)。我们通过首次对智能体规则文件( exttt{CLAUDE.md}、 exttt{.cursorrules}以及更广泛的智能体技能、插件清单和角色定义系列)进行大规模受控研究得出这些发现:我们从GitHub抓取了679个规则文件(共25,532条规则),并使用Claude Opus 4.6在SWE-bench Verified上进行了超过5,000次Claude Code智能体运行。出现了三种模式。(i)规则极性清晰地区分了有益规则和有害规则;我们通过基于势能的奖励塑形(PBRS)的视角来解读这一点。(ii)性能提升在很大程度上与内容无关:随机、打乱、领域不匹配和未转换格式的规则文件均与精心设计的规则相匹配,指向一种上下文启动机制。(iii)单独的规则通常看起来有害,但在集成中并未明显累积损害:在规则数量从0到50的范围内,通过率保持稳定。这些发现揭示了快速增长的社区编写规则和技能生态系统中隐藏的可靠性风险,并得出了更安全智能体配置的明确原则:约束智能体不能做什么,而不是规定它应该做什么。

英文摘要

Random rules improve a coding agent's task performance as much as expert-curated ones (both $+13.8$pp on a discriminative subset of SWE-bench Verified), and in our data every individually beneficial rule is a negative constraint ("do not refactor unrelated code"), while every individually harmful one is a positive directive ("follow code style"). We arrive at these findings through the first large-scale controlled study of agent rule files (\texttt{CLAUDE.md}, \texttt{.cursorrules}, and the broader family of agent skills, plugin manifests, and persona definitions): we scrape 679 rule files (25{,}532 rules) from GitHub and conduct over 5{,}000 agent runs of Claude Code with Claude Opus 4.6 on SWE-bench Verified. Three patterns emerge. (i) Rule polarity cleanly separates beneficial from harmful rules; we read this through the lens of potential-based reward shaping (PBRS). (ii) Performance gains are largely content-independent: random, shuffled, mismatched-domain, and unconverted-format rule files all match curated rules, pointing to a context priming mechanism. (iii) Individual rules often appear harmful in isolation yet do not visibly accumulate damage in ensemble: pass rates remain stable across rule counts from 0 to 50. These findings expose a hidden reliability risk in the rapidly growing ecosystem of community-authored rules and skills, and they yield a clear principle for safer agent configuration: constrain what agents must not do, rather than prescribing what they should.

2604.11080 2026-05-29 cs.CV cs.AI 版本更新

ReSpinQuant: Efficient Layer-Wise LLM Quantization via Subspace Residual Rotation Approximation

ReSpinQuant: 通过子空间残差旋转近似实现高效逐层大模型量化

Suyoung Kim, Sunghyun Wee, Hyeonjin Kim, Kyomin Hwang, Hyunho Lee, Nojun Kwak

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出ReSpinQuant框架,通过离线激活旋转融合和高效子空间残差旋转匹配基,解决逐层量化方法在线计算开销大的问题,在W4A4和W3A3量化上达到最优性能。

Comments ICML 2026

详情
AI中文摘要

基于旋转的后训练量化(PTQ)已成为缓解大型语言模型(LLMs)量化中激活值异常值的有前景的解决方案。全局旋转方法通过将激活旋转融合到注意力块和前馈网络块中实现推理效率,但由于受限于在所有层中使用单一可学习旋转矩阵,其表达能力有限。为了解决这一问题,出现了逐层变换方法,通过局部自适应实现了更高的精度。然而,逐层方法无法将激活旋转矩阵融合到权重中,需要在线计算并导致显著开销。在本文中,我们提出ReSpinQuant,一种量化框架,通过利用离线激活旋转融合和使用高效残差子空间旋转匹配基来解决此类开销。这种设计调和了逐层自适应的高表达性与仅可忽略的推理开销。在W4A4和W3A3量化上的大量实验表明,ReSpinQuant实现了最先进的性能,优于全局旋转方法,并以最小开销匹配计算昂贵的逐层方法的精度。

英文摘要

Rotation-based Post-Training Quantization (PTQ) has emerged as a promising solution for mitigating activation outliers in the quantization of Large Language Models (LLMs). Global rotation methods achieve inference efficiency by fusing activation rotations into attention and FFN blocks, but suffer from limited expressivity as they are constrained to use a single learnable rotation matrix across all layers. To tackle this, layer-wise transformation methods emerged, achieving superior accuracy through localized adaptation. However, layer-wise methods cannot fuse activation rotation matrices into weights, requiring online computations and causing significant overhead. In this paper, we propose ReSpinQuant, a quantization framework that resolves such overhead by leveraging offline activation rotation fusion and matching basis using efficient residual subspace rotation. This design reconciles the high expressivity of layer-wise adaptation with only negligible inference overhead. Extensive experiments on W4A4 and W3A3 quantization demonstrate that ReSpinQuant achieves state-of-the-art performance, outperforming global rotation methods and matching the accuracy of computationally expensive layer-wise methods with minimal overhead.

2604.06811 2026-05-29 cs.CR cs.AI 版本更新

SkillTrojan: Backdoor Attacks on Skill-Based Agent Systems

SkillTrojan:基于技能智能体系统的后门攻击

Yunhao Feng, Yifan Ding, Yingshui Tan, Boren Zheng, Yanming Guo, Xiaolong Li, Kun Zhai, Yishan Li, Wenke Huang

发表机构 * Alibaba Group(阿里巴巴集团) National University of Defense Technology(国防科技大学) Fudan University(复旦大学) Wuhan University(武汉大学)

AI总结 提出SkillTrojan,一种针对技能实现而非模型参数的后门攻击方法,通过将恶意逻辑嵌入看似正常的技能中,利用技能组合重构并执行攻击者指定的负载,在保持良性行为的同时实现高攻击成功率。

详情
AI中文摘要

基于技能的智能体系统通过组合可复用技能来处理复杂任务,提高了模块化和可扩展性,同时引入了一个几乎未被审视的安全攻击面。我们提出SkillTrojan,一种针对技能实现而非模型参数或训练数据的后门攻击。SkillTrojan将恶意逻辑嵌入看似合理的技能中,并利用标准技能组合来重构和执行攻击者指定的负载。该攻击将加密负载分割到多个看似良性的技能调用中,仅在预定义触发条件下激活。SkillTrojan还支持从任意技能模板自动合成带后门的技能,从而在基于技能的智能体生态系统中实现可扩展传播。为了进行系统评估,我们发布了一个包含3000多个精心策划的带后门技能的数据集,涵盖多种技能模式和触发-负载配置。我们在一个代表性的基于代码的智能体设置中实例化SkillTrojan,并评估了干净任务效用和攻击成功率。结果表明,技能级后门可以非常有效,同时对良性行为的退化最小,暴露了当前基于技能的智能体架构中的一个关键盲点,并促使防御机制明确考虑技能组合和执行。具体来说,在EHR SQL上,SkillTrojan在GPT-5.2-1211-Global上实现了高达97.2%的攻击成功率,同时保持了89.3%的干净准确率。

英文摘要

Skill-based agent systems tackle complex tasks by composing reusable skills, improving modularity and scalability while introducing a largely unexamined security attack surface. We propose SkillTrojan, a backdoor attack that targets skill implementations rather than model parameters or training data. SkillTrojan embeds malicious logic inside otherwise plausible skills and leverages standard skill composition to reconstruct and execute an attacker-specified payload. The attack partitions an encrypted payload across multiple benign-looking skill invocations and activates only under a predefined trigger. SkillTrojan also supports automated synthesis of backdoored skills from arbitrary skill templates, enabling scalable propagation across skill-based agent ecosystems. To enable systematic evaluation, we release a dataset of 3,000+ curated backdoored skills spanning diverse skill patterns and trigger-payload configurations. We instantiate SkillTrojan in a representative code-based agent setting and evaluate both clean-task utility and attack success rate. Our results show that skill-level backdoors can be highly effective with minimal degradation of benign behavior, exposing a critical blind spot in current skill-based agent architectures and motivating defenses that explicitly reason about skill composition and execution. Concretely, on EHR SQL, SkillTrojan attains up to 97.2% ASR while maintaining 89.3% clean ACC on GPT-5.2-1211-Global.

2604.05157 2026-05-29 cs.AI 版本更新

IntentScore: Intent-Conditioned Action Evaluation for Computer-Use Agents

IntentScore: 面向计算机使用智能体的意图条件动作评估

Rongqian Chen, Yu Li, Zeyu Fang, Sizhe Tang, Weidong Cao, Tian Lan

发表机构 * George Washington University(乔治·华盛顿大学)

AI总结 提出IntentScore,一种基于计划感知的奖励模型,通过对比对齐和边际排序学习评估动作质量,在OSWorld上提升任务成功率6.9个百分点。

详情
AI中文摘要

计算机使用智能体(CUA)利用大型语言模型在桌面环境中执行GUI操作,但它们生成动作时不评估动作质量,导致不可逆的错误级联到后续步骤。我们提出IntentScore,一种计划感知的奖励模型,从跨越三个操作系统的398K离线GUI交互步骤中学习对候选动作进行评分。IntentScore通过两个互补目标进行训练:状态-动作相关性的对比对齐和动作正确性的边际排序。在架构上,它将每个候选的计划意图嵌入动作编码器,从而能够区分具有相似动作但不同理由的候选。IntentScore在保留评估上达到97.5%的成对区分准确率。作为Agent S3在OSWorld(训练中完全未见的环境)上的重排序器,IntentScore将任务成功率提高了6.9个百分点,表明从异构离线轨迹中学到的奖励估计可以泛化到未见过的智能体和任务分布。

英文摘要

Computer-Use Agents (CUAs) leverage large language models to execute GUI operations on desktop environments, yet they generate actions without evaluating action quality, leading to irreversible errors that cascade through subsequent steps. We propose IntentScore, a plan-aware reward model that learns to score candidate actions from 398K offline GUI interaction steps spanning three operating systems. IntentScore trains with two complementary objectives: contrastive alignment for state-action relevance and margin ranking for action correctness. Architecturally, it embeds each candidate's planning intent in the action encoder, enabling discrimination between candidates with similar actions but different rationales. IntentScore achieves 97.5% pairwise discrimination accuracy on held-out evaluation. Deployed as a re-ranker for Agent S3 on OSWorld, an environment entirely unseen during training, IntentScore improves task success rate by 6.9 points, demonstrating that reward estimation learned from heterogeneous offline trajectories generalizes to unseen agents and task distributions.

2604.01473 2026-05-29 cs.CR cs.AI 版本更新

SelfGrader: LLM Jailbreak Detection via Anchored Token-Level Logits

SelfGrader: 基于锚定令牌级对数概率的LLM越狱检测

Zikai Zhang, Rui Hu, Olivera Kotevska, Jiahao Xu

发表机构 * Department of Computer Science and Engineering, University of Nevada, Reno(内华达大学里诺分校计算机科学与工程系) Oak Ridge National Laboratory(橡树岭国家实验室)

AI总结 提出SelfGrader方法,利用锚定令牌级对数概率将越狱检测转化为数值评分问题,实现低延迟、低误报率的鲁棒检测。

详情
AI中文摘要

大型语言模型(LLM)是回答用户查询的强大工具,但仍然极易受到越狱攻击。现有的护栏方法通常依赖内部特征或文本响应来检测恶意查询,这要么引入大量延迟,要么遭受文本生成的随机性。为了克服这些限制,我们提出SelfGrader,一种轻量级护栏方法,它将越狱检测表述为使用锚定令牌级对数概率的数值评分问题。具体来说,SelfGrader在一组紧凑的数值令牌(NT)(例如0-9)内评估用户查询的安全性,并将其对数概率分布解释为内部安全信号。为了将这些信号与目标安全准则对齐,SelfGrader构建了概率近似正确引导的ICL锚定示例,并引入了双视角评分规则,同时考虑查询的恶意性和良性,从而产生稳定且可解释的分数,反映危害性并同时降低误报率。跨不同越狱基准、自适应攻击、良性提示基准、多个LLM和最先进的护栏基线的广泛实验表明,SelfGrader在低误报率、内存开销和延迟下实现了强鲁棒性。

英文摘要

Large Language Models (LLMs) are powerful tools for answering user queries, yet they remain highly vulnerable to jailbreak attacks. Existing guardrail methods typically rely on internal features or textual responses to detect malicious queries, which either introduce substantial latency or suffer from randomness in text generation. To overcome these limitations, we propose SelfGrader, a lightweight guardrail method that formulates jailbreak detection as a numerical grading problem using anchored token-level logits. Specifically, SelfGrader evaluates the safety of a user query within a compact set of numerical tokens (NTs) (e.g., 0-9) and interprets their logit distribution as an internal safety signal. To align these signals with the target safety rubric, SelfGrader constructs Probably Approximately Correct-guided ICL anchor examples and introduces a dual-perspective scoring rule that considers both the maliciousness and benignness of the query, yielding a stable and interpretable score that reflects harmfulness and reduces the false positive rate simultaneously. Extensive experiments across diverse jailbreak benchmarks, adaptive attacks, benign prompt benchmarks, multiple LLMs, and state-of-the-art guardrail baselines demonstrate that SelfGrader achieves strong robustness with low false positive rates, memory overhead, and latency.

2603.27150 2026-05-29 cs.AI cs.MA 版本更新

MediHive: A Decentralized Agent Collective for Medical Reasoning

MediHive:用于医学推理的去中心化智能体集体

Xiaoyang Wang, Christopher C. Yang

发表机构 * College of Computing and Informatics, Drexel University(德雷塞尔大学计算与信息学院)

AI总结 提出一种去中心化多智能体框架MediHive,通过共享记忆池和迭代融合机制,使LLM智能体自主分配角色、进行条件性基于证据的辩论并融合观点,在MedQA和PubMedQA上分别达到84.3%和78.4%的准确率。

Comments Accepted by Journal of Healthcare Informatics Research

详情
AI中文摘要

大型语言模型(LLM)已经革新了医学推理任务,但单智能体系统在处理需要稳健处理不确定性和冲突证据的复杂跨学科问题时常常表现不佳。利用LLM的多智能体系统(MAS)能够实现协作智能,但主流的集中式架构在资源受限环境中存在可扩展性瓶颈、单点故障和角色混淆问题。去中心化MAS(D-MAS)通过点对点交互承诺增强自主性和弹性,但其在高风险医疗领域的应用仍未充分探索。我们提出了MediHive,一种新颖的去中心化多智能体医学问答框架,该框架将共享记忆池与迭代融合机制相结合。MediHive部署基于LLM的智能体,这些智能体自主分配专业角色、进行初始分析、通过条件性基于证据的辩论检测分歧,并在多轮中本地融合同伴见解以达成共识。实验表明,MediHive在MedQA和PubMedQA数据集上分别达到84.3%和78.4%的准确率,优于单LLM和集中式基线。我们的工作推进了用于医学AI的可扩展、容错D-MAS,解决了集中式设计的关键局限性,同时在推理密集型任务中展示了优越性能。

英文摘要

Large language models (LLMs) have revolutionized medical reasoning tasks, yet single-agent systems often falter on complex, interdisciplinary problems requiring robust handling of uncertainty and conflicting evidence. Multi-agent systems (MAS) leveraging LLMs enable collaborative intelligence, but prevailing centralized architectures suffer from scalability bottlenecks, single points of failure, and role confusion in resource-constrained environments. Decentralized MAS (D-MAS) promise enhanced autonomy and resilience via peer-to-peer interactions, but their application to high-stakes healthcare domains remains underexplored. We introduce MediHive, a novel decentralized multi-agent framework for medical question answering that integrates a shared memory pool with iterative fusion mechanisms. MediHive deploys LLM-based agents that autonomously self-assign specialized roles, conduct initial analyses, detect divergences through conditional evidence-based debates, and locally fuse peer insights over multiple rounds to achieve consensus. Empirically, MediHive outperforms single-LLM and centralized baselines on MedQA and PubMedQA datasets, attaining accuracies of 84.3% and 78.4%, respectively. Our work advances scalable, fault-tolerant D-MAS for medical AI, addressing key limitations of centralized designs while demonstrating superior performance in reasoning-intensive tasks.

2603.23971 2026-05-29 cs.CL cs.AI cs.GT cs.LG cs.MA 版本更新

The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More

价格反转现象:当更便宜的推理模型成本更高时

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Microsoft Research(微软研究院)

AI总结 本文首次系统研究推理模型标价与实际成本的偏差,发现32%的模型对比较中存在价格反转现象,并基于Shapley值建立成本归因框架,揭示思考令牌消耗和交互轮次的高度异质性是主要原因。

详情
AI中文摘要

开发者和消费者越来越根据列出的API价格选择推理模型(RMs)。然而,这些价格在多大程度上准确反映了实际推理成本?我们首次系统研究这一问题,评估了8个前沿RM在12个不同任务上的表现,涵盖竞赛数学、科学问答、代码生成和多领域智能体。我们发现了定价反转现象:在32%的模型对比较中,标价较低的模型实际上产生了更高的总成本,反转幅度高达28倍。例如,Gemini 3 Flash的标价比GPT-5.4便宜80%,但其在所有任务上的实际成本却高出38%。我们基于Shapley值构建了一个正式的成本归因框架,并利用它追溯了思考令牌消耗和交互轮次数量巨大异质性的主要贡献因素:对于同一查询,一个模型可能比另一个模型多使用900%的思考令牌,或多出10倍的环境交互轮次。我们进一步表明,每次查询的成本预测本质上是困难的:同一查询的重复运行产生的思考令牌变化高达9.7倍,为任何预测器建立了不可约的噪声底限。因此,我们提出成本分布预测作为一个开放挑战。我们的发现表明,列出的API定价是实际成本的不可靠代理,呼吁进行成本感知的模型选择和透明的每次请求成本监控。

英文摘要

Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RMs across 12 diverse tasks covering competition math, science QA, code generation, and multi-domain agents. We uncover the pricing reversal phenomenon: in 32% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 80% cheaper than GPT-5.4's, yet its actual cost across all tasks is 38% higher. We build a formal cost attribution framework based on Shapley value, and leverage it to trace the dominating contributors to vast heterogeneity in thinking token consumption and number of interaction turns: on the same query, one model may use 900% more thinking tokens than another, or 10x more turns of environment interactions. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Thus, we propose cost distribution prediction as an open challenge. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

2603.18859 2026-05-29 cs.AI cs.CL cs.LG 版本更新

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

RewardFlow: 面向大语言模型智能体强化学习的拓扑感知状态图奖励传播

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

发表机构 * TMLR Group(TMLR小组) Hong Kong Baptist University(香港 Baptist大学) TCL Corporate Research (HK) Co Ltd(TCL企业研究(香港)有限公司) Cooperative Medianet Innovation Center Shanghai Jiao Tong University(合作中位网创新中心上海交通大学) Department of Mathematics Hong Kong Baptist University(香港 Baptist大学数学系)

AI总结 提出RewardFlow方法,通过构建状态图进行拓扑感知的奖励传播,为智能体推理提供无标注的密集奖励,显著提升强化学习性能。

详情
AI中文摘要

强化学习在增强大语言模型智能体推理方面展现出潜力,但稀疏的终端奖励阻碍了细粒度优化。过程奖励建模提供了一种替代方案,但带来了高计算成本、奖励黑客风险和标注瓶颈。我们引入RewardFlow,一种用于估计智能体推理中状态级奖励的轻量级方法。通过构建捕获轨迹内在拓扑结构的状态图,RewardFlow执行拓扑感知的传播以估计每个状态对成功的贡献,从而产生有原则的、无标注的密集奖励。用于强化学习优化时,RewardFlow在四个智能体基准测试中显著优于先前基线:在基于文本的任务上平均成功率提高6.2%,在视觉推理上跨三个模型尺度比最强基线提高29.7%,在DeepResearch上准确率提高10%,同时具有卓越的鲁棒性和训练效率。RewardFlow的实现已在https://github.com/tmlr-group/RewardFlow公开。

英文摘要

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

2603.16673 2026-05-29 cs.RO cs.AI cs.LG 版本更新

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

机器人何时应该思考?基于强化学习的资源感知推理在具身机器人决策中的应用

Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Northeastern University(东北大学) Harvard University(哈佛大学) Cornell University(康奈尔大学) MIT(麻省理工学院) Fujitsu Research of America(美国富士通研究) Tsinghua University(清华大学) Peking University(北京大学) University of Georgia(佐治亚大学) Florida International University(佛罗里达国际大学) EmbodyX Inc(EmbodyX公司) Cisco Systems(思科系统)

AI总结 提出RARRL框架,通过强化学习学习高层编排策略,使具身代理能自适应决定是否调用LLM推理、选择推理角色及分配计算预算,以平衡推理开销与任务成功率。

详情
AI中文摘要

具身机器人系统越来越依赖基于大语言模型(LLM)的代理来支持与环境交互过程中的高级推理、规划和决策。然而,调用LLM推理会引入大量的计算延迟和资源开销,这可能会中断动作执行并降低系统可靠性。过多的推理可能延迟动作,而推理不足则常常导致错误决策和任务失败。这引出了具身代理的一个基本问题:代理何时应该推理,何时应该行动?在这项工作中,我们提出了RARRL(基于强化学习的资源感知推理),一个用于具身代理资源感知编排的分层框架。RARRL不是学习低级控制策略,而是学习一个在代理决策层运行的高级编排策略。该策略使代理能够根据当前观察、执行历史和剩余资源,自适应地决定是否调用推理、使用哪个推理角色以及分配多少计算预算。大量实验,包括使用来自ALFRED基准测试的经验延迟配置文件进行评估,表明与固定或启发式推理策略相比,RARRL在减少执行延迟和增强鲁棒性的同时,持续提高了任务成功率。这些结果表明,自适应推理控制对于构建可靠且高效的具身机器人代理至关重要。

英文摘要

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.

2603.14778 2026-05-29 cs.CR cs.AI 版本更新

P$^2$RAG: Efficient Privacy-Preserving RAG Service Supporting Arbitrary Top-$k$ Retrieval

P$^2$RAG: 支持任意Top-$k$检索的高效隐私保护RAG服务

Yulong Ming, Mingyue Wang, Jijia Yang, Jie Xu, Zihan Wu, Cong Wang, Xiaohua Jia

发表机构 * City University of Hong Kong(香港城市大学) Peng Cheng Laboratory(鹏城实验室)

AI总结 针对现有隐私保护RAG系统无法灵活支持大k值且效率低下的问题,提出基于秘密共享和交互式二分法的P$^2$RAG系统,避免排序开销,实现高效且安全的任意top-$k$检索。

Comments 14 pages, 3 figures

详情
AI中文摘要

检索增强生成(RAG)使大型语言模型能够利用外部知识,但外包RAG服务会引发数据所有者和用户的隐私担忧。隐私保护RAG系统通过执行安全的top-$k$检索来解决这些问题,通常使用安全排序来识别相关文档。然而,现有系统面临支持任意$k$的挑战,因为它们无法更改$k$,存在新的安全问题,尤其是当$k$较大时效率下降。这是一个重大限制,因为金融、法律和医疗等应用需要足够大的$k$,导致现有系统开销巨大。此外,现代长上下文模型通常通过更大的检索集获得更高的准确性。我们提出P$^2$RAG,一种高效隐私保护的RAG服务,支持任意top-$k$检索。与现有系统不同,P$^2$RAG避免对候选文档进行排序,而是使用交互式二分法来确定top-$k$文档集。在安全性方面,P$^2$RAG在两个半诚实非共谋服务器上使用秘密共享来保护数据所有者的数据库和用户的提示。它通过限制和验证来防御恶意用户,并严格限制数据库的信息泄露。实验表明,对于$k = 16$--$1024$,P$^2$RAG比最先进的PRAG快3--300倍。

英文摘要

Retrieval-Augmented Generation (RAG) enables large language models to use external knowledge, but outsourcing the RAG service raises privacy concerns for both data owners and users. Privacy-preserving RAG systems address these concerns by performing secure top-$k$ retrieval, which is typically implemented using secure sorting to identify relevant documents. However, existing systems face challenges supporting arbitrary $k$ due to their inability to change $k$, new security issues, and in particular, efficiency degradation with large $k$. This is a significant limitation because applications such as finance, law, and healthcare require a $k$ that is large enough to cause huge overhead for existing systems. Also, modern long-context models generally achieve higher accuracy with larger retrieval sets. We propose P$^2$RAG, an efficient privacy-preserving RAG service that supports arbitrary top-$k$ retrieval. Unlike existing systems, P$^2$RAG avoids sorting candidate documents. Instead, it uses an interactive bisection method to determine the set of top-$k$ documents. For security, P$^2$RAG uses secret sharing on two semi-honest non-colluding servers to protect the data owner's database and the user's prompt. It enforces restrictions and verification to defend against malicious users and tightly bounds the information leakage of the database. The experiments show that P$^2$RAG is 3--300$\times$ faster than the state-of-the-art PRAG for $k = 16$--$1024$.

2603.01006 2026-05-29 cs.SD cs.AI cs.LG cs.MM 版本更新

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

AG-REPA:音频流匹配中表示对齐的因果层选择

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

发表机构 * AI Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou)(人工智能 thrust,信息中心,香港科学与技术大学(广州))

AI总结 提出AG-REPA方法,通过前向门控消融量化各层对速度场的因果贡献,实现稀疏层选择和自适应加权对齐,在音频流匹配中优于传统REPA基线。

Comments Accepted to ICML 2026. 17 pages, 4 figures, 12 tables

详情
AI中文摘要

表示对齐(REPA)通过将中间隐藏状态与预训练教师特征对齐来改进生成流模型的训练,但在令牌条件音频流匹配中,其有效性关键取决于监督层的选择,而监督层通常基于深度启发式地选择。在这项工作中,我们引入了归因引导的表示对齐(AG-REPA),一种用于音频流匹配中表示对齐的新型因果层选择策略。首先,我们发现最能存储语义/声学信息(高教师空间相似性)的层不一定是那些对驱动生成的速度场贡献最大的层,我们称之为存储-贡献分离(SCD)。为了将这一见解转化为可操作的训练指导,我们提出了一种前向门控消融(FoG-A),通过预测速度场中的诱导变化来量化每个层的因果贡献,从而实现稀疏层选择和自适应加权对齐。在统一的语音和通用音频训练(LibriSpeech + AudioSet)中,在不同的令牌条件拓扑下,AG-REPA始终优于REPA基线。总体而言,我们的结果表明,当对齐应用于因果主导的驱动速度场的层时,而不是应用于表示丰富但功能被动的层时,对齐最为有效。

英文摘要

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.

2603.00991 2026-05-29 cs.AI cs.PL 版本更新

Tracking Capabilities for Safer Agents

更安全智能体的追踪能力

Martin Odersky, Yaoyu Zhao, Yichen Xu, Oliver Bračevac, Cao Nguyen Pham

发表机构 * EPFL(苏黎世联邦理工学院)

AI总结 提出通过Scala 3的捕获检查类型系统静态追踪能力,构建基于编程语言的智能体安全约束,防止信息泄露和恶意副作用。

详情
AI中文摘要

通过工具调用与现实世界交互的AI智能体带来了基本的安全挑战:智能体可能泄露私人信息、导致意外副作用或通过提示注入被操纵。为应对这些挑战,我们建议将智能体置于基于编程语言的“安全约束”中:智能体不直接调用工具,而是以能力安全的语言(支持捕获检查的Scala 3)中的代码表达其意图。能力是程序变量,用于调节对感兴趣的效果和资源的访问。Scala的类型系统静态追踪能力,提供对智能体行为的细粒度控制。特别是,它支持局部纯度,即强制子计算无副作用的能力,防止智能体处理机密数据时的信息泄露。我们展示了通过利用具有追踪能力的强类型系统,可以构建可扩展的智能体安全约束。实验表明,智能体能够生成能力安全的代码,而任务性能没有显著损失,同时类型系统可靠地防止了信息泄露和恶意副作用等不安全行为。

英文摘要

AI agents that interact with the real world through tool calls pose fundamental safety challenges: agents might leak private information, cause unintended side effects, or be manipulated through prompt injection. To address these challenges, we propose to put the agent in a programming-language-based "safety harness": instead of calling tools directly, agents express their intentions as code in a capability-safe language: Scala 3 with capture checking. Capabilities are program variables that regulate access to effects and resources of interest. Scala's type system tracks capabilities statically, providing fine-grained control over what an agent can do. In particular, it enables local purity, the ability to enforce that sub-computations are side-effect-free, preventing information leakage when agents process classified data. We demonstrate that extensible agent safety harnesses can be built by leveraging a strong type system with tracked capabilities. Our experiments show that agents can generate capability-safe code with no significant loss in task performance, while the type system reliably prevents unsafe behaviors such as information leakage and malicious side effects.

2603.00454 2026-05-29 cs.LG cs.AI 版本更新

Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

基于子模重放的根吸收前缀轨迹平衡用于GFlowNet训练

Xi Wang, Wenbo Lu, Shengjie Wang

发表机构 * Courant Institute School of Mathematics, Computing, and Data Science, New York University(纽约大学Courant研究所数学、计算与数据科学学院) Courant Institute School of Mathematics, Computing(纽约大学Courant研究所数学、计算与数据科学学院) Data Science, New York University(纽约大学数据科学学院)

AI总结 针对GFlowNet的模式坍塌问题,提出RapTB目标函数(通过根锚定子轨迹监督和吸收后缀备份提供密集前缀学习信号)和SubM子模重放策略(促进高奖励和多样性),在分子生成等任务中提升优化性能和多样性。

详情
AI中文摘要

生成流网络(GFlowNets)能够微调大型语言模型以近似奖励比例的后验分布,但仍容易出现模式坍塌,表现为前缀坍塌和长度偏差。我们将此归因于两个因素:(i)对早期前缀的信用分配较弱,以及(ii)有偏的重放导致偏移的、非代表性的训练流分布。我们提出根吸收前缀轨迹平衡(RapTB),该目标函数将子轨迹监督锚定在根节点,并通过吸收后缀备份将终端奖励传播到中间前缀,从而提供密集的前缀级学习信号。为了减轻重放引起的分布偏移,我们进一步引入SubM,一种子模重放刷新策略,同时促进高奖励和多样性。实验表明,在使用SMILES字符串的分子生成等任务中,RapTB结合SubM持续提升优化性能和分子多样性,同时保持高有效性。

英文摘要

Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non-representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance RapTB, an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups, providing dense prefix-level learning signals. To mitigate replay-induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLM using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.

2602.12642 2026-05-29 cs.CL cs.AI 版本更新

Beyond Normalization: Rethinking the Partition Function as a Difficulty Scheduler for RLVR

超越归一化:重新审视配分函数作为RLVR的难度调度器

Dohyung Kim, Minbeom Kim, Jeonghye Kim, Sangmook Lee, Sojeong Rhee, Kyomin Jung

发表机构 * Seoul National University(首尔国立大学)

AI总结 本文提出PACED-RL框架,通过重新解释配分函数作为每提示期望奖励信号,利用其指导训练中的问题选择与重放,在保持生成多样性的同时提升样本效率。

详情
AI中文摘要

奖励最大化的RL方法已被证明能够增强LLM的推理性能,但往往导致生成多样性降低。近期工作通过采用GFlowNets来解决这一问题,训练LLM匹配目标分布的同时联合学习其配分函数。与先前将配分函数仅视为归一化器的工作不同,我们将其重新解释为每提示期望奖励(即在线准确率)信号,利用这一未使用的信息来提高样本效率。具体而言,我们首先建立了配分函数与每提示准确率估计之间的理论关系。基于这一关键见解,我们提出了配分函数引导的强化学习(PACED-RL),这是一个后训练框架,利用准确率估计在训练过程中优先考虑信息量大的问题提示,并通过准确率估计误差优先的重放进一步提高样本效率。关键的是,这两个组件都重用了GFlowNet训练中已经产生的信息,有效地将计算开销摊销到现有优化过程中。跨多种基准的大量实验表明,与GRPO和先前的GFlowNet方法相比,性能有显著提升,突显了PACED-RL作为LLM更高效样本的分布匹配训练的有前途方向。

英文摘要

Reward-maximizing RL methods have shown to be capable of enhancing the reasoning performance of LLMs, but often lead to reduced generation diversity. Recent works address this issue by adopting GFlowNets, training LLMs to match a target distribution while jointly learning its partition function. In contrast to prior works that treat this partition function solely as a normalizer, we reinterpret it as a per-prompt expected-reward (i.e., online accuracy) signal, leveraging this unused information to improve sample efficiency. Specifically, we first establish a theoretical relationship between the partition function and per-prompt accuracy estimates. Building on this key insight, we propose Partition Function-Guided RL (PACED-RL), a post-training framework that leverages accuracy estimates to prioritize informative question prompts during training, and further improves sample efficiency through an accuracy estimate error-prioritized replay. Crucially, both components reuse information already produced during GFlowNet training, effectively amortizing the compute overhead into the existing optimization process. Extensive experiments across diverse benchmarks demonstrate strong performance improvements over GRPO and prior GFlowNet approaches, highlighting PACED-RL as a promising direction for a more sample efficient distribution-matching training for LLMs.

2602.08013 2026-05-29 cs.AI 版本更新

Small Agent Group is the Future of Digital Health

小型智能体群是数字健康的未来

Yuqiao Meng, Luoxi Tang, Dazheng Zhang, Rafael Brens, Elvys J. Romero, Nancy Guo, Safa Elkefi, Zhaohan Xi

发表机构 * State University of New York at Binghamton(纽约州立大学布inghamton分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出小型智能体群(SAG)通过协作推理替代单一大型模型,在数字健康中实现更优的有效性、可靠性和部署效率。

Comments ICML'26

详情
AI中文摘要

大型语言模型(LLM)在数字健康中的快速采用是由“规模优先”理念驱动的,即假设临床智能随模型大小和数据量增加而提升。然而,现实世界的临床需求不仅包括有效性,还包括可靠性和合理的部署成本。由于临床决策本质上是协作性的,我们挑战单一模型扩展范式,并询问小型智能体群(SAG)是否能支持更好的临床推理。SAG通过协作审议过程分配推理、基于证据的分析和关键审计,从单一模型智能转向集体专业知识。为了评估SAG的临床实用性,我们使用涵盖有效性、可靠性和部署成本的多项临床指标进行了广泛评估。结果表明,无论是否进行额外优化或检索增强生成,SAG相比单一巨型模型都取得了更优的性能。这些发现表明,SAG所代表的协同推理可以在临床环境中替代模型参数增长。总体而言,SAG为数字健康提供了一种可扩展的解决方案,更好地平衡了有效性、可靠性和部署效率。

英文摘要

The rapid adoption of large language models (LLMs) in digital health has been driven by a "scaling-first" philosophy, i.e., the assumption that clinical intelligence increases with model size and data. However, real-world clinical needs include not only effectiveness, but also reliability and reasonable deployment cost. Since clinical decision-making is inherently collaborative, we challenge the monolithic scaling paradigm and ask whether a Small Agent Group (SAG) can support better clinical reasoning. SAG shifts from single-model intelligence to collective expertise by distributing reasoning, evidence-based analysis, and critical audit through a collaborative deliberation process. To assess the clinical utility of SAG, we conduct extensive evaluations using diverse clinical metrics spanning effectiveness, reliability, and deployment cost. Our results show that SAG achieves superior performance compared to a single giant model, both with and without additional optimization or retrieval-augmented generation. These findings suggest that the synergistic reasoning represented by SAG can substitute for model parameter growth in clinical settings. Overall, SAG offers a scalable solution to digital health that better balances effectiveness, reliability, and deployment efficiency.

2602.07044 2026-05-29 cs.CV cs.AI 版本更新

PipeMFL-240K: A Large-scale Dataset and Benchmark for Object Detection in Pipeline Magnetic Flux Leakage Imaging

PipeMFL-240K:管道磁通量泄漏成像中目标检测的大规模数据集与基准

Tianyi Qu, Songxiao Yang, Haolin Wang, Huadong Song, Xiaoting Guo, Wenguang Hu, Guanlin Liu, Honghe Chen, Yafei Ou

发表机构 * SINOMACH Sensing Technology \ ., Ltd Shenyang Liaoning China Institute of Science Tokyo Tokyo Japan Hokkaido University Sapporo Hokkaido Japan SINOMACH Sensing Technology \ ., Ltd Institute of Science Tokyo Hokkaido University

AI总结 为解决管道磁通量泄漏检测中缺乏大规模公开数据集和基准的问题,构建了包含249,320张图像和200,020个边界框标注的PipeMFL-240K数据集,并评估了现有目标检测器,揭示了其在长尾分布、小目标和类内变异等挑战下的性能不足。

Comments Accepted by ACM KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

管道完整性对工业安全和环境保护至关重要,磁通量泄漏(MFL)检测是一种主要的无损检测技术。尽管深度学习在自动化MFL解释方面具有前景,但由于缺乏大规模公开数据集和基准,可靠模型的进展受到限制,导致公平比较和可重复评估困难。我们引入了 extbf{PipeMFL-240K},这是一个大规模、精心标注的数据集和基准,用于管道MFL伪彩色图像中的复杂目标检测。PipeMFL-240K反映了真实检测的复杂性,并提出了几个独特挑战:(i) 覆盖 extbf{12}个类别的极端长尾分布,(ii) 大量仅包含少数像素的小目标,(iii) 显著的类内变异。该数据集包含 extbf{249,320}张图像和 extbf{200,020}个高质量边界框标注,采集自12条总长约 extbf{1,530}公里的管道。我们使用最先进的目标检测器进行了大量实验以建立基线。结果表明,现代检测器仍然难以应对MFL数据的固有特性,凸显了巨大的改进空间,而PipeMFL-240K为驱动未来研究提供了可靠且具有挑战性的试验平台。作为管道MFL检测领域首个如此规模和范围的数据集和基准,它为高效的管道诊断和维护规划提供了关键基础,并有望加速基于MFL的管道完整性评估中的算法创新和可重复研究。

英文摘要

Pipeline integrity is critical to industrial safety and environmental protection, with Magnetic Flux Leakage (MFL) detection being a primary non-destructive testing technology. Despite the promise of deep learning for automating MFL interpretation, progress toward reliable models has been constrained by the absence of a large-scale public dataset and benchmark, making fair comparison and reproducible evaluation difficult. We introduce \textbf{PipeMFL-240K}, a large-scale, meticulously annotated dataset and benchmark for complex object detection in pipeline MFL pseudo-color images. PipeMFL-240K reflects real-world inspection complexity and poses several unique challenges: (i) an extremely long-tailed distribution over \textbf{12} categories, (ii) a high prevalence of tiny objects that often comprise only a handful of pixels and (iii) substantial intra-class variability. The dataset contains \textbf{249,320} images and \textbf{200,020} high-quality bounding-box annotations, collected from 12 pipelines spanning approximately \textbf{1,530} km. Extensive experiments are conducted with state-of-the-art object detectors to establish baselines. Results show that modern detectors still struggle with the intrinsic properties of MFL data, highlighting considerable headroom for improvement, while PipeMFL-240K provides a reliable and challenging testbed to drive future research. As the first public dataset and the first benchmark of this scale and scope for pipeline MFL inspection, it provides a critical foundation for efficient pipeline diagnostics as well as maintenance planning and is expected to accelerate algorithmic innovation and reproducible research in MFL-based pipeline integrity assessment.

2602.02909 2026-05-29 cs.AI cs.FL cs.LG 版本更新

Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

关于推理的推理:LLM中思维链令牌复杂度的BAPO界限

Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, Jennifer Neville

发表机构 * Microsoft Research, Redmond, WA(微软研究院,西雅图,华盛顿)

AI总结 通过扩展BAPO模型,证明二元多数、三元组匹配和图可达性三个任务需要Ω(n)个思维链令牌,实验验证了线性缩放与理论下界一致。

Comments 31 pages; accepted to ICML '26

详情
AI中文摘要

通过思维链(CoT)推理进行推理时扩展是当前最先进LLM性能的主要驱动力,但会带来显著的延迟和计算成本。我们解决了一个基本的理论问题:随着输入规模增长,需要多少推理令牌才能解决问题?通过扩展有界注意力前缀预言机(BAPO)模型——一种量化任务所需信息流的LLM抽象——我们证明了三个典型的BAPO困难任务所需的CoT令牌下界:二元多数、三元组匹配和图可达性。我们证明当输入规模为$n$时,每个任务需要$Ω(n)$个推理令牌。我们通过显式构造给出了匹配或接近匹配的上界。最后,我们在前沿推理模型上的实验显示,这些任务上的推理令牌数量近似线性缩放,且在推理预算受限时出现失败,这与我们的理论下界一致。总之,我们的结果识别了通过CoT进行推理时计算的基本瓶颈,并为分析最优推理长度提供了一种原则性工具。

英文摘要

Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows? By extending the bounded attention prefix oracle (BAPO) model--an abstraction of LLMs that quantifies the information flow required to solve a task--we prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. We show that each requires $Ω(n)$ reasoning tokens when the input size is $n$. We complement these results with matching or near-matching upper bounds via explicit constructions. Finally, our experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with our theoretical lower bounds. Together, our results identify fundamental bottlenecks in inference-time compute through CoT and offer a principled tool for analyzing optimal reasoning length.

2602.01058 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

好的SFT优化SFT,更好的SFT为强化学习做准备

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) New York University (Shanghai)(纽约大学(上海))

AI总结 针对当前SFT-RL流程中离线SFT数据分布与在线RL策略分布不匹配的问题,提出基于策略评估的离线学习损失重加权方法PEAR,通过重要性采样重加权SFT损失,提升后续RL训练效果。

详情
AI中文摘要

推理大语言模型的后训练是一个整体过程,通常包括离线SFT阶段和后续的在线强化学习(RL)阶段。然而,SFT通常被孤立地优化,仅追求最大化SFT性能。我们表明,在相同的RL训练后,从更强的SFT检查点初始化的模型可能显著劣于从较弱检查点初始化的模型。我们将此归因于当前SFT-RL流程中典型的错配:生成离线SFT数据的分布可能与在线RL期间优化的策略(该策略从其自身的rollout中学习)存在显著差异。我们提出PEAR(基于策略评估的离线学习损失重加权算法),这是一种在SFT阶段纠正此错配并让模型更好地为RL做准备的方法。PEAR使用重要性采样来重加权SFT损失,具有三种变体,分别在token、块和序列级别操作。它可以用于增强标准SFT目标,并且一旦收集到离线数据的概率,仅需很少的额外训练开销。我们在可验证推理游戏和数学推理任务上对Qwen 2.5和3以及DeepSeek蒸馏模型进行了控制实验。PEAR在标准SFT基础上持续提升了RL后性能,在AIME2025上pass@8增益高达14.6%。我们的结果表明,通过设计和评估SFT时考虑下游RL而非孤立进行,PEAR是迈向更全面的大语言模型后训练的有效一步。

英文摘要

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

2602.00994 2026-05-29 cs.AI 版本更新

Reasoning and Tool-use Compete in Agentic RL:From Quantifying Interference to Disentangled Tuning

推理与工具使用在智能体强化学习中的竞争:从量化干扰到解耦调优

Yu Li, Mingyang Yi, Xiuyu Li, Ju Fan, Fuxin Jiang, Binbin Chen, Peng Li, Jie Song, Tieying Zhang

发表机构 * School of Information, Renmin University of China(中国人民大学信息学院) Bytedance Inc.(字节跳动公司)

AI总结 本文通过引入能力效应归因(CEA)量化推理与工具使用行为之间的干扰,并提出解耦动作-推理调优(DART)框架,通过分离参数更新来提升智能体强化学习的性能。

详情
AI中文摘要

智能体强化学习(ARL)训练大型语言模型将推理与外部工具执行交错进行,以解决复杂任务。大多数现有ARL方法训练一组参数来支持推理和工具使用行为,隐含假设联合训练能提升整体智能体性能。尽管被广泛采用,这一假设很少得到实证检验。本文通过引入能力效应归因(CEA)系统性地检验这一假设,提供了推理与工具使用行为之间干扰的定量证据。通过深入分析,我们表明这两种能力常常导致不一致的梯度方向,产生训练干扰,削弱联合优化的有效性,并挑战了主流的ARL范式。为解决此问题,我们提出解耦动作-推理调优(DART),一个简单高效的框架,通过独立的低秩适应模块显式解耦推理和工具使用的参数更新。仅凭这一简单改变,DART在检索增强问答和NL2SQL的十三个基准上超越了所有联合优化基线,并接近2-智能体上界,进一步支持了我们在共享优化下能力干扰的发现。

英文摘要

Agentic Reinforcement Learning (ARL) trains large language models to interleave reasoning with external tool execution to solve complex tasks. Most existing ARL methods train a single set of parameters to support both reasoning and tool-use behaviors, implicitly assuming that joint training leads to improved overall agent performance. Despite its widespread adoption, this assumption has rarely been examined empirically. In this paper, we systematically examine this assumption by introducing Capability Effect Attribution (CEA), which provides quantitative evidence of interference between reasoning and tool-use behaviors. Through an in-depth analysis, we show that these two capabilities often induce misaligned gradient directions, leading to training interference that undermines the effectiveness of joint optimization and challenges the prevailing ARL paradigm. To address this issue, we propose Disentangled Action--Reasoning Tuning (DART), a simple and efficient framework that explicitly decouples parameter updates for reasoning and tool use via separate low-rank adaptation modules. With this simple change alone, DART outperforms all joint-optimization baselines and approaches the 2-Agent upper bound across thirteen benchmarks on retrieval-augmented QA and NL2SQL, further supporting our finding of capability interference under shared optimization.

2601.22531 2026-05-29 cs.LG cs.AI 版本更新

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

向理性主义者学习:蒸馏中间可解释原理

Jiayi Dai, Randy Goebel

发表机构 * Department of Computing Science, University of Alberta, Edmonton, Canada(阿尔伯塔大学计算机科学系,加拿大埃德蒙顿) Alberta Machine Intelligence Institute, Edmonton, Canada(阿尔伯塔机器智能研究所,加拿大埃德蒙顿)

AI总结 提出REKD方法,通过知识蒸馏将教师模型的可解释原理和预测传授给学生模型,提升基于较弱神经网络的可解释原理提取模型的预测性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

由于深度神经网络(DNN)的广泛使用,尤其是在高风险领域,DNN的可解释性受到了越来越多的关注。原理提取(RE)的总体思想是通过选择-预测架构为DNN提供一个可解释的设计框架,其中两个神经网络分别联合学习进行特征选择和预测。仅依赖于最终任务预测的远程监督,学习选择特征子集(或原理)的过程需要在所有可能的特征组合空间中进行搜索,这在计算上具有挑战性,当基础神经网络能力不足时甚至更加困难。为了提高基于能力较弱或较小神经网络(即学生)的RE模型的预测性能,我们提出了REKD(基于知识蒸馏的原理提取),其中学生RE模型除了自身的RE优化外,还从教师(即理性主义者)的原理和预测中学习。这种对RE的结构调整与人类如何从可解释和可验证的知识中有效学习的方式高度一致。由于该方法与神经模型无关,任何黑盒神经网络都可以作为骨干模型集成。为了证明REKD的可行性,我们使用BERT和视觉变换器(ViT)模型的多种变体进行了实验。我们在语言和视觉分类数据集(即IMDB电影评论、CIFAR 10和CIFAR 100)上的实验表明,REKD显著提高了学生RE模型的预测性能。

英文摘要

Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (or rationales) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we propose REKD (Rationale Extraction with Knowledge Distillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., a rationalist) in addition to the student's own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.

2601.22347 2026-05-29 cs.LG cs.AI 版本更新

Pushing the Limits of Block Rotations in Post-Training Quantization

推动后训练量化中块旋转的极限

Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, Nicholas J. Fraser

发表机构 * Advanced Micro Devices, Inc. (AMD)(Advanced Micro Devices公司) State University of New York at Buffalo(纽约州立大学布法罗分校) Norwegian University of Science(挪威科学与技术大学)

AI总结 本文提出PeRQ框架,通过置换和旋转重新分布激活值,以克服块旋转在抑制异常值时的几何限制,显著提升后训练量化精度。

详情
AI中文摘要

最近的后训练量化(PTQ)方法采用块旋转来在舍入前扩散异常值。虽然这减少了在线全向量旋转的开销,但块结构对异常值抑制的影响仍知之甚少。为填补这一空白,我们首次对块Hadamard旋转的异常值抑制进行了系统的非渐近分析。我们的分析表明,异常值抑制从根本上受限于输入向量的几何结构。特别地,在确定性最坏情况下,当旋转前的ℓ1范数质量在块间均匀分布时,旋转后的异常值最小。受这些见解的启发,我们引入了PeRQ(置换、旋转、然后量化),一个在旋转前通过置换重新分布激活质量的PTQ框架。我们提出了一种贪婪质量扩散算法,通过均衡期望的块间ℓ1范数来校准置换。为避免增加推理开销,我们识别了Transformer架构中置换等变区域,在部署前将这些置换合并到模型权重中。实验表明,PeRQ在所有块大小上一致地提高了精度,在将Llama3 1B量化为INT4且块大小为16时,恢复了全向量旋转困惑度的90%,而未经置换时仅为46%。

英文摘要

Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of online full-vector rotations, the effect of block structure on outlier suppression remains poorly understood. To fill this gap, we present the first systematic, non-asymptotic analysis of outlier suppression for block Hadamard rotations. Our analysis reveals that outlier suppression is fundamentally limited by the geometry of the input vector. In particular, in the deterministic worst case, post-rotation outliers are minimized when the pre-rotation $\ell_1$ norm mass is evenly distributed across blocks. Guided by these insights, we introduce PeRQ (Permute, Rotate, then Quantize), a PTQ framework that redistributes activation mass via permutations prior to rotation. We propose a greedy mass diffusion algorithm to calibrate permutations by equalizing the expected blockwise $\ell_1$ norms. To avoid adding inference overhead, we identify permutation-equivariant regions in transformer architectures to merge these permutations into model weights before deployment. Experiments show that PeRQ consistently improves accuracy across all block sizes, recovering up to 90% of the full-vector rotation perplexity when quantizing Llama3 1B to INT4 with block size 16, compared to 46% without permutations.

2601.17670 2026-05-29 cs.PL cs.AI 版本更新

Grammar-Aware Literate Generative Mathematical Programming with Compiler-in-the-Loop

语法感知的 literate 生成式数学编程与编译器在环

Roberto Rossi, Steven D. Prestwich

发表机构 * Business School, University of Edinburgh(爱丁堡大学商学院) Insight Centre for Data Analytics, University College Cork(科克大学数据分析研究所)

AI总结 提出 SyntAGM 系统,通过迭代生成-编译-评估-修正循环,利用编译器反馈和 LLM 对齐判断,生成可读的代数建模语言优化模型,实现成本与质量的更优权衡。

Comments 18 pages, 7 figures

详情
AI中文摘要

数学规划广泛应用于物流、能源和劳动力规划等多个领域,用于建模和解决工业优化问题,但其使用需要大量的领域专业知识。大型语言模型提供了一种将自然语言问题描述转化为优化模型的有前景的方法,但现有方法成本高昂,且通常生成用通用计算机代码(如 Python)编写的模型,难以检查、验证和重用。在这项工作中,我们引入了 SyntAGM,一个通过迭代生成-编译-评估-修正循环生成可读代数建模语言优化模型的系统。SyntAGM 利用 PyOPL,一个类似 OPL 的建模语言编译器,旨在为迭代模型修复提供可操作的反馈。为了获得与问题描述匹配的有效 PyOPL 模型,SyntAGM 调动编译器反馈和基于 LLM 的对齐判断。此外,它结合了目标语言语法的上下文暴露和建模示例的少样本检索。在多个基准测试中,与既定的提示基线相比,SyntAGM 实现了更有利的成本-质量权衡。

英文摘要

Mathematical programming is widely employed across various sectors - such as logistics, energy, and workforce planning - to model and solve industrial optimisation problems, but its use requires substantial domain expertise. Large language models offer a promising way to translate natural-language problem descriptions into optimisation models, yet existing approaches are costly and generally produce models written in general-purpose computer code (e.g. Python), which can be difficult to inspect, validate, and reuse. In this work, we introduce SyntAGM, a system that generates optimisation models in a readable algebraic modelling language through an iterative generate-compile-assess-revise loop. SyntAGM leverages PyOPL, an OPL-like modelling language compiler designed to provide actionable feedback for iterative model repair. To obtain a valid PyOPL model that matches the problem description, SyntAGM mobilises compiler feedback and an LLM-based alignment judge. In addition, it combines in-context exposure to the target language grammar, and few-shot retrieval of modelling exemplars. Across multiple benchmarks, SyntAGM achieves a more favourable cost-quality trade-off compared to established prompting baselines.

2601.10960 2026-05-29 cs.CL cs.AI 版本更新

Steering Language Models Before They Speak: Logit-Level Interventions

在语言模型发言前引导它们:Logit 级别的干预

Hyeseon An, Shinwoo Park, Hyundong Jin, Yo-Sub Han

发表机构 * Department of Computer Science, Yonsei University, Seoul, South Korea(延世大学计算机科学系,首尔,韩国)

AI总结 提出 SWAI 方法,通过基于语料库的 token 统计在 logit 空间直接引导语言模型,无需训练或访问内部激活,在可读性、礼貌性和毒性控制上优于提示和基线方法。

Comments preprint

详情
AI中文摘要

可控生成要求语言模型实现诸如阅读水平、礼貌性和毒性等输出特征。现有的引导方法通常间接、需要访问内部激活或依赖辅助训练模型。我们提出 SWAI,一种无需训练、推理时的方法,通过使用基于语料库的 token 统计直接在 logit 空间引导,解决了这些限制。SWAI 从标记语料库计算 z 归一化的一对多 log-odds 分数,并仅在模型的前 K 个候选集内偏向高分数 token,从而在保留上下文合理选择的同时允许控制偏向目标特征 token。在可读性、礼貌性和毒性控制方面,SWAI 在不修改模型参数、访问内部层或训练辅助模型的情况下,始终优于基于提示和先前的 logit 级别基线。选择性和查找表消融实验表明,增益来自目标特定的统计分数,而非通用 logit 扰动。这些结果表明,当 logit 干预在高概率候选下由目标特定统计引导时,有效的引导不需要学习控制器。

英文摘要

Controllable generation requires language models to realize output characteristics such as reading level, politeness, and toxicity. Existing steering methods are often indirect, require access to internal activations, or depend on auxiliary trained models. We propose SWAI, a training-free inference-time method that addresses these limitations by steering directly in logit space using corpus-derived token statistics. SWAI computes z-normalized one-vs-rest log-odds scores from labeled corpora and biases high-scoring tokens only within the model's top-K candidate set, allowing control to favor target-characteristic tokens while preserving contextually plausible choices. Across readability, politeness, and toxicity control, SWAI consistently improves over prompt-based and prior logit-level baselines without modifying model parameters, accessing internal layers, or training an auxiliary model. Selectivity and lookup-table ablations show that the gains come from target-specific statistical scores rather than generic logit perturbation. These results indicate that effective steering does not require learned controllers when the logit intervention is guided by target-specific statistics under high-probability candidates.

2601.08654 2026-05-29 cs.CL cs.AI cs.LG 版本更新

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

从评分标准到可靠分数:基于证据的文本评估与LLM裁判

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) Arizona State University(亚利桑那州立大学) Florida State University(佛罗里达州立大学)

AI总结 提出Rulers框架,通过三阶段推理(任务规范、结构化执行、事后校准)解决LLM在基于评分标准的文本评估中的执行漂移、归因不可验证和人类尺度错位问题,实现更可靠的评分。

详情
AI中文摘要

基于评分标准的文本评估越来越多地使用大型语言模型(LLM)作为可扩展的裁判,但将冻结的黑盒模型与人类评分标准对齐仍然具有挑战性。我们将这一挑战表述为一个标准迁移问题:目标不仅仅是提示LLM分配分数,而是将人类评分标准意图转移到一个稳定、可审计且与人类对齐的评分协议中。我们识别了基于LLM的评分标准评估中三种反复出现的失败模式:评分标准执行漂移、不可验证的分数归因和人类尺度错位。为了解决这些失败模式,我们引入了Rulers,一个三阶段推理时框架,用于可靠、基于证据的评分标准文本评估。Rulers首先将人类评分标准转换为锁定的任务级规范,然后通过结构化检查表决策、类型化证据基础以及在适用时进行可提取引用验证来执行该规范,最后应用事后校准以将模型衍生的信号与人类分数边界对齐。在涵盖论文评分、摘要评估、EFL写作评估和结构化输入文本生成的四个基于评分标准的基准测试中,Rulers在多个冻结骨干模型的大多数评估设置中实现了更强的人类分数一致性。进一步分析表明,Rulers更好地匹配了经验人类分数分布,提高了在语义等价评分标准扰动下的稳定性,并受益于其三个组成部分。这些结果表明,可靠的LLM评判需要固定标准、可追溯证据和校准的分数解释,而不仅仅是提示措辞。我们的代码可在 https://anonymous.4open.science/r/Rulers_0525-3328 获取。

英文摘要

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem: the goal is not merely to prompt an LLM to assign a score, but to transfer human rubric intent into a stable, auditable, and human-aligned scoring protocol. We identify three recurring failure modes in LLM-based rubric scoring: rubric execution drift, unverifiable score attribution, and human-scale misalignment. To address these failure modes, we introduce Rulers, a three-stage inference-time framework for reliable, evidence-grounded rubric-based text evaluation. Rulers first converts a human rubric into a locked task-level specification, then executes the specification with structured checklist decisions, typed evidence grounding, and extractive quote verification when applicable, and finally applies post-hoc calibration to align model-derived signals with human score boundaries. Across four rubric-governed benchmarks covering essay scoring, summarization assessment, EFL writing evaluation, and structured-input text generation, Rulers achieves stronger human-score agreement in most evaluated settings across multiple frozen backbone models. Further analyses show that Rulers better matches empirical human score distributions, improves stability under semantically equivalent rubric perturbations, and benefits from each of its three components. These results suggest that reliable LLM judging requires fixed criteria, traceable evidence, and calibrated score interpretation rather than prompt phrasing alone. Our code is available at https://anonymous.4open.science/r/Rulers_0525-3328.

2601.06431 2026-05-29 cs.AI 版本更新

LsrIF: Enhancing Logic-Structured Instruction Following of Large Language Models

LsrIF: 增强大语言模型的逻辑结构化指令遵循能力

Qingyu Ren, Qianyu He, Jingwen Chang, Geng Zhang, Jiajie Zhu, Xingzhou Chen, Zhuofei Shi, Jiaqing Liang, Yanghua Xiao, Han Xia, Zeye Sun, Fei Yu

发表机构 * Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence, Fudan University(复旦大学数据科学重点实验室,计算机科学与人工智能学院) School of Data Science, Fudan University(复旦大学数据科学学院) Ant Group(蚂蚁集团)

AI总结 提出LsrIF框架,通过构建并行、顺序、条件和嵌套结构的原子约束数据,并采用结构感知的奖励聚合方法,提升大语言模型在逻辑结构化指令遵循任务中的表现。

详情
AI中文摘要

指令遵循对于大语言模型至关重要,然而现实世界中的指令通常涉及具有逻辑结构的多个约束,例如并行组合、顺序依赖和条件分支。现有方法通常通过简单组合约束来构建数据,并在训练过程中通过平均各个约束分数来聚合奖励,忽略了逻辑依赖关系并引入了噪声信号。我们提出LsrIF,一个用于逻辑结构化指令遵循的训练框架。LsrIF通过将原子约束组织成并行、顺序、条件和嵌套结构来构建数据,并应用与其执行语义一致的结构感知奖励聚合:对并行约束取平均奖励,在顺序结构中早期失败后衰减后续奖励,在条件结构中仅奖励活跃分支。实验表明,LsrIF在领域内和领域外设置中均提升了指令遵循能力,同时也有利于逻辑推理。进一步分析表明,逻辑结构化训练增加了对约束相关词元和逻辑连接词的注意力,表明模型对指令逻辑的建模得到改善。我们将发布我们的数据和代码以供未来研究。

英文摘要

Instruction following is critical for large language models, yet real-world instructions often involve multiple constraints with logical structures, such as parallel composition, sequential dependencies, and conditional branching. Existing methods typically construct data by simply combining constraints and aggregate rewards by averaging individual constraint scores during training, overlooking logical dependencies and introducing noisy signals. We propose LsrIF, a training framework for logic-structured instruction following. LsrIF constructs data by organizing atomic constraints into parallel, sequential, conditional, and nested structures, and applies structure-aware reward aggregation aligned with their execution semantics: averaging rewards for parallel constraints, decaying later rewards after early failures in sequential structures, and rewarding only active branches in conditional structures. Experiments show that LsrIF improves instruction following in both in-domain and out-of-domain settings while also benefiting logic reasoning. Further analysis indicates that logic-structured training increases attention to constraint-related tokens and logical connectors, suggesting improved modeling of instruction logic. We will release our data and code for future research.

2601.01162 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

弥合分类数据聚类的语义鸿沟:基于大语言模型的方法

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Guangdong University of Technology(广东技术大学) Department of Computer Science(计算机科学系) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 提出BREVE框架,利用外部知识库的语义嵌入丰富分类属性值,并引入自适应权重平衡原始标识与语义信息,在八个基准数据集上平均ARI排名达1.3。

Comments Accepted to ICPR2027

详情
AI中文摘要

定性数据广泛存在于医疗、营销和生物信息学等领域,聚类是其中模式发现的基本工具。定性数据聚类的核心困难在于度量属性值之间的相似性,这些属性值没有固有的顺序或距离。为了恢复这种关系,现有研究通常依赖于数据集内的共现统计。然而,当样本量较小时,这种统计路径变得不可靠,每个值的语义上下文因此未被充分利用。受此限制,本文提出BREVE(通过外部值丰富实现平衡表示),一种聚类框架,通过从外部知识库中提取额外的语义维度来丰富每个定性值。即,每个唯一值被扩展为一个密集嵌入,编码其语义内容。为了防止原始值身份被添加的维度稀释,进一步附加一个轻量级的独热编码组件。然后,由聚类紧致性引导的自适应权重决定富集维度进入最终表示的强度。通过这种设计,在八个基准数据集上的实验表明,与七个代表性竞争者相比,平均ARI排名为1.3。

英文摘要

Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool for pattern discovery. A core difficulty of qualitative-data clustering lies in measuring similarity among attribute values that carry no inherent ordering or distance. To recover such relationships, existing studies typically rely on within-dataset co-occurrence statistics. This statistical route, however, becomes unreliable once the sample size is small, and the semantic context of each value is therefore left underexploited. Motivated by this limitation, this paper proposes BREVE (Balanced Representation via External Value Enrichment), a clustering framework that enriches each qualitative value with extra semantic dimensions drawn from an external knowledge base. That is, every unique value is expanded by a dense embedding that encodes its semantic content. To prevent the original value identity from being diluted by the added dimensions, a lightweight one-hot component is further appended. An adaptive weight, guided by cluster compactness, then determines how strongly the enrichment dimensions enter the final representation. With this design, experiments on eight benchmark datasets yield an average ARI rank of 1.3 against seven representative competitors.

2512.15374 2026-05-29 cs.AI 版本更新

SCOPE: Prompt Evolution for Enhancing Agent Effectiveness

SCOPE: 通过提示进化增强智能体效能

Zehua Pei, Hui-Ling Zhen, Shixiong Kai, Sinno Jialin Pan, Yunhe Wang, Mingxuan Yuan, Bei Yu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 针对LLM智能体静态提示无法有效管理动态上下文导致失败的问题,提出SCOPE框架,将上下文管理建模为在线优化问题,通过双流记忆机制和视角驱动探索自动进化提示,在HLE基准上将任务成功率从14.23%提升至38.64%。

详情
AI中文摘要

大型语言模型(LLM)智能体越来越多地部署在生成大规模动态上下文的环境中。然而,一个关键瓶颈仍然存在:虽然智能体可以访问这些上下文,但其静态提示缺乏有效管理上下文的机制,导致反复出现纠正性和增强性失败。为了解决这一能力差距,我们引入了通过提示进化实现自进化上下文优化(SCOPE)。SCOPE将上下文管理视为一个 extit{在线优化}问题,从执行轨迹中综合指导方针,自动进化智能体的提示。我们提出了一种双流机制,在战术记忆(即时错误纠正)和战略记忆之间路由指导方针,后者通过冲突解决、包含剪枝和合并不断优化。为了最大化策略覆盖范围,视角驱动探索进化多个并行提示,由不同的优化视角引导。在HLE基准上的实验表明,SCOPE在没有人工干预的情况下将任务成功率从14.23%提高到38.64%。我们在https://github.com/JarvisPei/SCOPE公开了代码。

英文摘要

Large Language Model (LLM) agents are increasingly deployed in environments that generate massive, dynamic contexts. However, a critical bottleneck remains: while agents have access to this context, their static prompts lack the mechanisms to manage it effectively, leading to recurring Corrective and Enhancement failures. To address this capability gap, we introduce Self-evolving Context Optimization via Prompt Evolution (SCOPE). SCOPE frames context management as an \textit{online optimization} problem, synthesizing guidelines from execution traces to automatically evolve the agent's prompt. We propose a Dual-Stream mechanism that routes guidelines between tactical memory (immediate error correction) and strategic memory, which is continuously refined through conflict resolution, subsumption pruning, and consolidation. To maximize strategy coverage, Perspective-Driven Exploration evolves multiple parallel prompts guided by distinct optimization perspectives. Experiments on the HLE benchmark show that SCOPE improves task success rates from 14.23\% to 38.64\% without human intervention. We make our code publicly available at https://github.com/JarvisPei/SCOPE.

2512.10388 2026-05-29 cs.IR cs.AI 版本更新

The Best of the Two Worlds: Harmonizing Semantic and Hash IDs for Sequential Recommendation

两全其美:为序列推荐协调语义ID和哈希ID

Ziwei Liu, Yejing Wang, Wanyu Wang, Wang Zejian, Qidong Liu, Zijian Zhang, Chong Chen, Wei Huang, Xiangyu Zhao

发表机构 * City University of Hong Kong(香港城市大学) Xi'an Jiaotong University(西安交通大学) Jilin University(吉林大学) Independent Researcher(独立研究者) Tsinghua University(清华大学)

AI总结 针对序列推荐中头部和尾部物品性能权衡问题,提出H2Rec框架,通过双分支架构协调语义ID和哈希ID,并采用双级对齐策略实现知识迁移,在公开基准和商业平台上取得更好平衡。

详情
AI中文摘要

传统的序列推荐系统通常分配唯一的哈希ID(HID)来构建物品嵌入,主要从历史用户-物品交互中捕获协同信号。然而,在大多数物品很少被消费的长尾场景中,这种嵌入是脆弱的。最近结合辅助信息的方法常常面临来自共现信号的噪声协同共享或由平坦密集嵌入导致的语义同质性问题。相比之下,语义ID(SID)因其支持代码共享和多粒度语义建模,提供了一种有前景的替代方案。然而,基于SID的方法受到协同压倒现象的阻碍:常用的量化机制损害了建模头部物品所需的标识符唯一性,导致头部和尾部物品之间的性能权衡。为了解决这一挑战,我们提出了H2Rec,一种协调SID和HID的新框架。我们设计了一个双分支建模架构,同时捕获SID的多粒度语义,同时保留HID提供的唯一协同身份。此外,我们引入了一种双级对齐策略来桥接两种表示,实现有效的知识迁移和鲁棒的偏好建模。在三个公开基准上的大量离线实验和在大规模商业平台上的在线实验表明,H2Rec在头部和尾部推荐质量之间实现了更好的平衡,并且持续优于现有基线。

英文摘要

Conventional Sequential Recommender Systems (SRS) typically assign unique hash IDs (HID) to construct item embeddings, which mainly capture collaborative signals from historical user-item interactions. However, such embeddings are vulnerable in long-tail scenarios where most items are rarely consumed. Recent methods that incorporate auxiliary information often face noisy collaborative sharing from co-occurrence signals or semantic homogeneity caused by flat dense embeddings. In contrast, Semantic IDs (SID), with their support for code sharing and multi-granular semantic modeling, offer a promising alternative. Nevertheless, SID-based methods are hindered by a collaborative overwhelming phenomenon: commonly adopted quantization mechanisms compromise the identifier uniqueness needed to model head items, resulting in a performance trade-off between head and tail items. To address this challenge, we propose H2Rec, a novel framework that harmonizes SID and HID. We design a dual-branch modeling architecture that simultaneously captures the multi-granular semantics of SID while preserving the unique collaborative identity provided by HID. Moreover, we introduce a dual-level alignment strategy to bridge the two representations, enabling effective knowledge transfer and robust preference modeling. Extensive offline experiments on three public benchmarks and online experiments on a large-scale commercial platform demonstrate that H2Rec achieves a better balance between head and tail recommendation quality and consistently outperforms existing baselines.

2512.01863 2026-05-29 cond-mat.mes-hall cond-mat.str-el cs.AI 版本更新

Topological Order in Neural Wavefunctions

神经波函数中的拓扑序

Ahmed Abouelkomsan, Max Geier, Liang Fu

发表机构 * Department of Physics, Massachusetts Institute of Technology, Cambridge, MA-02139, USA(麻省理工学院物理系)

AI总结 本文利用基于注意力的深度神经网络变分波函数,通过能量最小化发现分数量子霍尔效应基态,并引入一种从单一实空间波函数提取拓扑简并度的方法,展示了神经网络变分蒙特卡洛在强关联拓扑相研究中的潜力。

Comments Published version

详情
Journal ref
Phys. Rev. B 113, 205119 (2026)
AI中文摘要

拓扑有序态是最有趣的量子物质相之一,它们承载具有分数电荷并服从分数量子统计的涌现准粒子。然而,由于这些态具有强耦合性质,传统的平均场处理难以奏效,因此其理论研究颇具挑战。在这里,我们证明基于注意力的深度神经网络提供了一个富有表现力的变分波函数,它仅通过能量最小化就能在无先验知识的情况下发现分数量子陈绝缘体基态,并达到了显著的精度。我们引入了一种高效的方法,通过将平移不变系统中的单一优化实空间波函数分解为不同的多体动量扇区,从中提取基态拓扑简并度——这是拓扑序的标志。我们的结果确立了神经网络变分蒙特卡洛作为发现强关联拓扑相的多功能工具的地位。

英文摘要

Topologically ordered states are among the most interesting quantum phases of matter that host emergent quasi-particles having fractional charge and obeying fractional quantum statistics. Theoretical study of such states is however challenging owing to their strong-coupling nature that prevents conventional mean-field treatment. Here, we demonstrate that an attention-based deep neural network provides an expressive variational wavefunction that discovers fractional Chern insulator ground states purely through energy minimization without prior knowledge and achieves remarkable accuracy. We introduce an efficient method to extract ground state topological degeneracy -- a hallmark of topological order -- from a single optimized real-space wavefunction in translation-invariant systems by decomposing it into different many-body momentum sectors. Our results establish neural network variational Monte Carlo as a versatile tool for discovering strongly correlated topological phases.

2512.00283 2026-05-29 cs.LG cs.AI q-bio.QM 版本更新

BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

BioArc:发现生物学基础模型的最优神经架构

Yi Fang, Haoran Xu, Jiaxin Han, Sirui Ding, Yizhi Wang, Yue Wang, Xuan Wang

发表机构 * Department of Computer Science, Virginia Tech, Blacksburg, VA, USA(弗吉尼亚理工学院计算机科学系) Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA(弗吉尼亚理工学院电气与计算机工程系) Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA(卡内基梅隆大学计算机科学系) Department of Biomedical Data Science, Stanford University, Stanford, CA, USA(斯坦福大学生物医学数据科学系)

AI总结 针对现有基础模型架构直接迁移至生物学领域时忽视生物数据独特性质的问题,提出BioArc框架,利用神经架构搜索系统探索架构设计空间,发现高性能架构并提炼设计原则,同时提出架构预测方法以高效预测新任务的最优架构。

Comments Accepted at the 43nd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基础模型已彻底改变了自然语言处理(NLP)和计算机视觉(CV)等多个领域。尽管已有努力将通用AI领域中基础模型的成功迁移到生物学,但现有工作主要直接采用来自通用机器学习领域的现有基础模型架构,而未考虑每种生物数据模态独特的物理化学和结构特性进行系统设计。这导致性能欠佳,因为这些改造后的架构难以捕捉生物数据固有的长程依赖、稀疏信息和复杂的底层“语法”。为解决这一差距,我们引入了BioArc,这是一个新颖的框架,旨在超越直觉驱动的架构设计,转向生物学基础模型的原理性、自动化架构发现。利用神经架构搜索(NAS),BioArc系统性地探索了广阔的架构设计空间,跨多种生物模态评估架构,同时严格分析架构、分词和训练策略之间的相互作用。这一大规模分析识别出新颖的高性能架构,使我们能够提炼出一套经验性设计原则,以指导未来的模型开发。此外,为充分利用这套发现的原理性架构,我们提出并比较了几种架构预测方法,这些方法能够有效且高效地预测新生物学任务的最优架构。总体而言,我们的工作为基础资源和原理性方法论提供了基础,以指导下一代生物学任务特定模型和基础模型的创建。

英文摘要

Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.

2511.22884 2026-05-29 cs.AI 版本更新

InsightEval: An Expert-Curated Benchmark for Assessing Insight Discovery in LLM-Driven Data Agents

InsightEval:用于评估LLM驱动数据代理洞察发现能力的专家策划基准

Zhenghao Zhu, Yuanfeng Song, Xin Chen, Chengzhong Liu, Yakun Cui, Caleb Chen Cao, Sirui Han, Yike Guo

发表机构 * The Hong Kong University of Science and Technology(香港理工大学) ByteDance(字节跳动)

AI总结 针对现有洞察发现基准InsightBench的缺陷,提出数据整理流程构建新基准InsightEval,并引入新指标衡量代理的探索性能。

详情
AI中文摘要

数据分析已成为科学研究不可或缺的一部分。为了发现隐藏在大量数据集中的潜在知识和洞察,我们需要进行深入的探索性分析以实现其全部价值。随着大语言模型(LLM)和多智能体系统的出现,越来越多的研究人员利用这些技术进行洞察发现。然而,评估洞察发现能力的基准很少。作为现有最全面的框架之一,InsightBench也存在许多关键缺陷:格式不一致、目标设计不当以及冗余洞察。这些问题可能严重影响数据质量和代理评估。为了解决这些问题,我们深入研究了InsightBench的缺点,并提出了高质量洞察基准的基本标准。据此,我们开发了一个数据整理流程,构建了一个名为InsightEval的新数据集。我们进一步引入了一种新的度量标准来衡量代理的探索性能。通过在InsightEval上的大量实验,我们突出了自动化洞察发现中的普遍挑战,并提出了一些关键发现以指导这一有前景方向的未来研究。

英文摘要

Data analysis has become an indispensable part of scientific research. To discover the latent knowledge and insights hidden within massive datasets, we need to perform deep exploratory analysis to realize their full value. With the advent of large language models (LLMs) and multi-agent systems, more and more researchers are making use of these technologies for insight discovery. However, there are few benchmarks for evaluating insight discovery capabilities. As one of the most comprehensive existing frameworks, InsightBench also suffers from many critical flaws: format inconsistencies, poorly conceived objectives, and redundant insights. These issues may significantly affect the quality of data and the evaluation of agents. To address these issues, we thoroughly investigate shortcomings in InsightBench and propose essential criteria for a high-quality insight benchmark. Regarding this, we develop a data-curation pipeline to construct a new dataset named InsightEval. We further introduce a novel metric to measure the exploratory performance of agents. Through extensive experiments on InsightEval, we highlight prevailing challenges in automated insight discovery and raise some key findings to guide future research in this promising direction.

2511.14584 2026-05-29 cs.LG cs.AI 版本更新

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

ReflexGrad: 基于进度门控双过程路由的LLM智能体片段内故障恢复

Ankush Kadu, Aswanth Krishnan

发表机构 * GitHub

AI总结 提出ReflexGrad双过程架构,通过进度门控路由在无演示条件下实现LLM智能体片段内故障恢复,显著提升任务成功率。

Comments 18 pages, 4 figures, 10 tables. Accepted at ICML 2026 FoGen Workshop

详情
AI中文摘要

我们提出ReflexGrad,一种用于LLM智能体在无演示条件下进行片段内故障恢复的双过程架构。当智能体过早采用错误方法并耗尽步骤预算时,故障后的轨迹包含逃脱所需的信息——但尚无已发表的架构在单个片段内利用这一信息。ReflexGrad在快速过程(每k=3步进行TextGrad风格的连续优化)和慢速过程(当m=5个连续低进度分数触发路由门时进行Reflexion风格的因果诊断)之间进行路由。确定性优先级合并保持自然语言策略的一致性,每次慢速激活产生三个可观察的产物:可复现的触发器、因果诊断和验证后的修复。在ALFWorld 134个任务、n=10个种子、无演示条件下,ReflexGrad将Qwen-3-8B从35.1%提升至75.4%(+40.3个百分点),超过计算匹配的1-shot LATS 2.7个百分点(p≈0.01)、ToT 5.7个百分点(p<10^{-4})和Self-Refine 6.7个百分点(p<10^{-5});在GPT-5上提升从46.3%至88.1%(+41.8个百分点)。1.5个百分点的跨模型差异在种子噪声范围内(p≈0.13),表明路由机制而非模型规模是增益的主要来源。代码、提示、逐种子日志和敏感性扫描已发布。

英文摘要

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents commit to a wrong approach early and exhaust the step budget, the post-failure trajectory contains the information to escape -- but no published architecture acts on it within a single episode. ReflexGrad routes between a fast process (TextGrad-style continuous refinement every $k{=}3$ steps) and a slow process (Reflexion-style causal diagnosis when $m{=}5$ consecutive low-progress scores fire a routing gate). A deterministic priority merge keeps the natural-language policy coherent, and each slow activation emits three observable artifacts: a reproducible trigger, a causal diagnostic, and a verified fix. On ALFWorld 134 tasks, $n{=}10$ seeds, no demonstrations, ReflexGrad lifts Qwen-3-8B from $35.1\%$ to $75.4\%$ ($+40.3$pp), beating compute-matched 1-shot LATS by $+2.7$pp ($p{\approx}0.01$), ToT by $+5.7$pp ($p{<}10^{-4}$), and Self-Refine by $+6.7$pp ($p{<}10^{-5}$); on GPT-5 the lift is $46.3{\to}88.1\%$ ($+41.8$pp). The $1.5$pp cross-model difference is within seed noise ($p{\approx}0.13$), suggesting that the routing mechanism, rather than model scale, is the primary source of the gain. Code, prompts, per-seed logs, and sensitivity sweeps are released.

2511.11703 2026-05-29 cs.LG cs.AI cs.RO 版本更新

Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

通过语义分割增强3D环境中的强化学习:以ViZDoom为例

Jin Huang

发表机构 * Hugo Huang(胡戈·黄)

AI总结 提出SS-only和RGB+SS两种输入表示,利用语义分割减少内存消耗并提升强化学习在3D环境中的性能,在ViZDoom中验证。

Comments Master's Thesis at the University of Edinburgh (2024)

详情
AI中文摘要

在高维感官输入的3D环境中,强化学习面临两大挑战:(1) 稳定学习所需的内存缓冲区导致的高内存消耗,以及(2) 部分可观测马尔可夫决策过程(POMDPs)的复杂性。本项目通过提出两种新颖的输入表示:SS-only和RGB+SS,两者均对RGB彩色图像进行语义分割,以应对这些挑战。在ViZDoom的死斗模式中进行了实验,利用完美的分割结果进行受控评估。我们的结果表明,SS-only能够将内存缓冲区的内存消耗减少至少66.6%,当应用如游程编码等最小开销的可向量化无损压缩技术时,可减少高达98.6%。同时,RGB+SS通过提供的额外语义信息显著增强了强化学习代理的性能。此外,我们探索了基于密度的热力图作为可视化强化学习代理移动模式的工具,并评估了其用于数据收集的适用性。与先前方法的简要比较突出了我们的方法如何克服在ViZDoom等3D环境中应用语义分割时的常见陷阱。

英文摘要

Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS-only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS-only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run-length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents' performance with the additional semantic information provided. Furthermore, we explored density-based heatmapping as a tool to visualise RL agents' movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.

2511.10861 2026-05-29 cs.CV cs.AI cs.LG 版本更新

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

一种面向CNN的基于LRP剪枝的精度感知扩展,以防止数据稀缺迁移学习中的级联精度下降

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan(日本防卫大学校数学与计算机科学系)

AI总结 针对数据稀缺迁移学习中预训练CNN剪枝导致的级联精度下降问题,提出一种精度感知的剪枝控制机制,通过动态调整剪枝率和顺序来抑制精度下降,提升模型压缩后的分类性能。

Comments Accepted to scientific reports. The title was revised during the peer review process

详情
AI中文摘要

在大规模数据集(如ImageNet)上预训练的卷积神经网络(CNN)被广泛用作特征提取器,从稀缺数据中构建特定任务的高精度分类模型。在此类场景中,由于数据稀缺,微调预训练CNN变得困难,因此必须使用固定权重。然而,当权重固定时,许多对目标任务无贡献的滤波器仍保留在模型中,导致不必要的冗余和效率降低。因此,需要有效的方法通过剪枝对推理不必要的滤波器来减小模型大小。为此,已有研究提出了利用逐层相关性传播(LRP)的方法。LRP量化每个滤波器对推理结果的贡献,从而可以剪枝低相关性的滤波器。然而,现有基于LRP的剪枝方法被观察到会导致级联精度下降。在本研究中,我们为现有基于LRP的滤波器剪枝方法引入了一种精度感知的剪枝控制机制,该机制通过使用类别精度的调和平均数动态调整剪枝率和剪枝顺序,抑制级联精度下降,并在小数据环境下压缩预训练模型的同时保持任务特定性能。我们证明,该控制机制有效缓解了级联精度下降,与现有基于LRP的剪枝方法相比,实现了更高的分类精度,将VGG16的精度-剪枝率曲线下的类别平均面积(AUC)比传统基于LRP的方法提高了约15%。

英文摘要

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we introduce an accuracy-aware pruning control mechanism for existing LRP-based filter pruning methods, which suppresses cascading accuracy degradation by dynamically adjusting the pruning rate and the pruning order using the harmonic mean of class accuracy, and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that this control mechanism effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods, improving the class-averaged area under the accuracy-pruning-rate curve (AUC) of VGG16 by approximately 15\% over conventional LRP-based approaches.

2510.26412 2026-05-29 cs.CV cs.AI 版本更新

LoCoT2V-Bench: Benchmarking Long-Form and Complex Text-to-Video Generation

LoCoT2V-Bench: 长文本与复杂文本到视频生成的基准测试

Xiangqing Zheng, Chengyue Wu, Kehai Chen, Min Zhang

发表机构 * Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学深圳研究院) The University of Hong Kong(香港大学)

AI总结 针对长视频生成在复杂文本输入下的评估挑战,提出包含多场景提示与层次元数据的基准LoCoT2V-Bench,并设计多维度评估框架LoCoT2V-Eval,实验发现模型在细粒度文本-视频对齐和角色一致性方面存在显著不足。

Comments Accepted by ICML 2026 (Regular)

详情
AI中文摘要

近期文本到视频生成在短片段上取得了令人印象深刻的性能,但在复杂文本输入下评估长视频生成仍然是一个重大挑战。为应对这一挑战,我们提出了LoCoT2V-Bench,一个用于长视频生成(LVG)的基准,包含具有层次元数据(如角色设置和相机行为)的多场景提示,这些提示从收集的真实世界视频中构建。我们进一步提出了LoCoT2V-Eval,一个多维度评估框架,涵盖感知质量、文本-视频对齐、时间质量、动态质量和人类期望实现程度(HERD),重点关注细粒度文本-视频对齐和时间角色一致性等方面。在17个代表性LVG模型上的实验揭示了评估维度之间的显著能力差异,模型在感知质量和背景一致性方面表现强劲,但在细粒度文本-视频对齐和角色一致性方面明显较弱。这些发现表明,提高提示忠实度和身份保持仍是长视频生成的关键挑战。我们的代码和数据发布在https://github.com/XqZeppelinhead0702/LoCoT2V-Bench。

英文摘要

Recent advances in text-to-video generation have achieved impressive performance on short clips, yet evaluating long-form generation under complex textual inputs remains a significant challenge. In response to this challenge, we present LoCoT2V-Bench, a benchmark for long video generation (LVG) featuring multi-scene prompts with hierarchical metadata (e.g., character settings and camera behaviors), constructed from collected real-world videos. We further propose LoCoT2V-Eval, a multi-dimensional framework covering perceptual quality, text-video alignment, temporal quality, dynamic quality, and Human Expectation Realization Degree (HERD), with an emphasis on aspects such as fine-grained text-video alignment and temporal character consistency. Experiments on 17 representative LVG models reveal pronounced capability disparities across evaluation dimensions, with strong perceptual quality and background consistency but markedly weaker fine-grained text-video alignment and character consistency. These findings suggest that improving prompt faithfulness and identity preservation remains a key challenge for long-form video generation. Our code and data are released at https://github.com/XqZeppelinhead0702/LoCoT2V-Bench

2510.20743 2026-05-29 cs.HC cs.AI cs.CL 版本更新

Empathic Prompting: Non-Verbal Context Integration for Multimodal LLM Conversations

共情提示:多模态大语言模型对话中的非语言上下文整合

Lorenzo Stacchio, Andrea Ubaldi, Alessandro Galdelli, Maurizio Mauri, Emanuele Frontoni, Andrea Gaggioli

发表机构 * University of Macerata(马切拉塔大学)

AI总结 提出共情提示框架,通过集成面部表情识别服务将非语言情感线索隐式融入大语言模型对话,实现无需用户显式控制的流畅多模态交互。

详情
AI中文摘要

我们提出了共情提示,一种新颖的多模态人机交互框架,它通过隐式的非语言上下文丰富大语言模型(LLM)对话。该系统集成了商业面部表情识别服务以捕捉用户的情感线索,并将其作为上下文信号嵌入提示过程中。与传统多模态界面不同,共情提示不需要用户显式控制;相反,它通过情感信息无干扰地增强文本输入,以实现对话和流畅性对齐。该架构模块化且可扩展,允许集成额外的非语言模块。我们描述了通过本地部署的DeepSeek实例实现的系统设计,并报告了初步的服务和可用性评估(N=5)。结果表明,非语言输入能够一致地整合到连贯的LLM输出中,参与者强调了对话的流畅性。除了这一概念验证外,共情提示还指向了聊天机器人中介通信中的应用,特别是在医疗或教育等领域,这些领域中用户的情感信号至关重要,但在言语交流中往往难以察觉。

英文摘要

We present Empathic Prompting, a novel framework for multimodal human-AI interaction that enriches Large Language Model (LLM) conversations with implicit non-verbal context. The system integrates a commercial facial expression recognition service to capture users' emotional cues and embeds them as contextual signals during prompting. Unlike traditional multimodal interfaces, empathic prompting requires no explicit user control; instead, it unobtrusively augments textual input with affective information for conversational and smoothness alignment. The architecture is modular and scalable, allowing integration of additional non-verbal modules. We describe the system design, implemented through a locally deployed DeepSeek instance, and report a preliminary service and usability evaluation (N=5). Results show consistent integration of non-verbal input into coherent LLM outputs, with participants highlighting conversational fluidity. Beyond this proof of concept, empathic prompting points to applications in chatbot-mediated communication, particularly in domains like healthcare or education, where users' emotional signals are critical yet often opaque in verbal exchanges.

2510.16658 2026-05-29 cs.AI cs.CE 版本更新

Large-Scale AI and Foundation Models for Neuroscience: A Comprehensive Review

大规模人工智能与基础模型在神经科学中的应用:综合综述

Shihao Yang, Xiying Huang, Danilo Bernardo, Jun-En Ding, Andrew Michael, Guoan Wang, Jingmei Yang, Alison Anderson, Dinesh Giritharan, Patrick Kwan, Ashish Raj, Yu Zhang, Feng Liu

发表机构 * Department of Systems Engineering, Stevens Institute of Technology(系统工程系,史蒂文斯理工学院) Department of Neurology and Weill Institute for Neurosciences, University of California San Francisco(神经病学系和Weill神经科学研究所,旧金山大学) Duke Institute for Brain Sciences, Duke University(杜克大学脑科学研究所) Division of Systems Engineering, Department of Electrical and Computer Engineering, Boston University(系统工程 division,电气与计算机工程系,波士顿大学) Department of Neuroscience, School of Translational Medicine, Monash University(神经科学系,转化医学学院,莫纳什大学) Department of Neurology, Alfred Hospital, Melbourne, Victoria, Australia(神经病学系,阿尔弗雷德医院,墨尔本,维多利亚州,澳大利亚) Department of Radiology and Biomedical Imaging, University of California, San Francisco, CA, USA(放射学与生物医学成像系,旧金山大学,加州,美国) Department of Psychiatry and Behavioral Sciences, School of Medicine, Stanford University(精神病学与行为科学系,医学院,斯坦福大学) Wu Tsai Neurosciences Institute, Stanford University(吴氏神经科学研究所,斯坦福大学) Stanford Institute for Human-Centered AI, Stanford University(斯坦福大学人本人工智能研究所)

AI总结 本文综述了大规模AI模型在神经科学四个主要领域(神经影像与数据处理、脑机接口与神经解码、临床决策支持与转化框架、神经系统与精神疾病特定应用)的应用,展示了其在多模态数据整合、时空模式解释和临床转化方面的潜力,并强调了严格评估、领域知识整合、临床验证和伦理指南的重要性。

Comments Accepted for publication in Meta-Radiology

详情
AI中文摘要

大规模人工智能(AI)模型的发展通过实现从原始脑信号和神经数据的端到端学习,正在影响神经科学研究。本文综述了大规模AI模型在四个主要神经科学领域的应用:神经影像与数据处理、脑机接口与神经解码、临床决策支持与转化框架,以及神经系统和精神疾病的特定应用。这些模型显示出解决重大计算神经科学挑战的潜力,包括多模态神经数据整合、时空模式解释以及为临床研究开发转化框架。此外,神经科学与AI之间的相互作用已变得日益互惠,因为现在融入了受生物学启发的架构约束,以开发更具可解释性和计算效率的模型。本综述既强调了此类技术的潜力,也强调了关键的实现考虑因素,特别关注严格的评估框架、领域知识的有效整合、前瞻性临床验证以及全面的伦理指南。最后,提供了用于开发和评估跨不同研究应用的大规模AI模型的关键神经科学数据集的系统列表。

英文摘要

The development of large-scale artificial intelligence (AI) models is influencing neuroscience research by enabling end-to-end learning from raw brain signals and neural data. In this paper, we review applications of large-scale AI models across four major neuroscience domains: neuroimaging and data processing, brain-computer interfaces and neural decoding, clinical decision support and translational frameworks, and disease-specific applications across neurological and psychiatric disorders. These models show potential to address major computational neuroscience challenges, including multimodal neural data integration, spatiotemporal pattern interpretation, and the development of translational frameworks for clinical research. Moreover, the interaction between neuroscience and AI has become increasingly reciprocal, as biologically informed architectural constraints are now incorporated to develop more interpretable and computationally efficient models. This review highlights both the promise of such technologies and critical implementation considerations, with particular emphasis on rigorous evaluation frameworks, effective integration of domain knowledge, prospective clinical validation, and comprehensive ethical guidelines. Finally, a systematic listing of critical neuroscience datasets used to develop and evaluate large-scale AI models across diverse research applications is provided.

2510.16060 2026-05-29 cs.LG cs.AI stat.ME stat.ML 版本更新

Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

超越准确性:时间序列基础模型是否良好校准?

Coen Adler, Yuxin Chang, Felix Draxler, Samar Abdi, Padhraic Smyth

发表机构 * Department of Computer Science(计算机科学系) Department of Statistics(统计学系) Google, Irvine(谷歌(伊文斯堡))

AI总结 本文系统评估了五个时间序列基础模型和两个基线的校准特性,发现基础模型校准优于基线且无系统性过度自信或信心不足。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
Proceedings of ICLR 2026
AI中文摘要

最近时间序列数据基础模型的发展引起了在各种应用中使用此类模型的广泛兴趣。尽管基础模型实现了最先进的预测性能,但它们的校准特性仍然相对未被充分探索,尽管校准在许多实际应用中可能至关重要。在本文中,我们研究了五个近期时间序列基础模型和两个竞争基线的校准相关特性。我们进行了一系列系统评估,包括模型校准(即过度自信或信心不足)、不同预测头的影响以及长期自回归预测下的校准。我们发现时间序列基础模型始终比基线模型校准得更好,并且往往不会系统性地过度自信或信心不足,这与在其他深度学习模型中常见的过度自信形成对比。

英文摘要

The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.

2510.10961 2026-05-29 cs.CL cs.AI 版本更新

Obfuscation Rules for Detecting and Detoxifying Korean Toxicity

用于检测和去毒化韩语毒性内容的混淆规则

Yejin Lee, Su-Hyeon Kim, Hyundong Jin, Dayoung Kim, Yeonsoo Kim, Yo-Sub Han

发表机构 * Yonsei University(延世大学)

AI总结 本文提出KOTOX数据集,通过定义基于语言学的韩语混淆规则和变换框架,支持对混淆毒性文本的去混淆与去毒化,首次同时实现韩语混淆毒性检测与净化。

Comments 26 pages, 12 figures, 24 tables

详情
AI中文摘要

随着语言模型越来越多地部署在线环境中,毒性检测和去毒化已受到越来越多的关注。现有研究主要关注非混淆文本,这限制了当用户故意伪装毒性表达时的鲁棒性。特别是,韩语毒性表达可以通过黏着形态学和韩文特有的正字法变体轻易伪装。然而,韩语中的混淆现象在很大程度上尚未被探索,这促使我们引入KOTOX:用于去混淆和去毒化的韩语毒性数据集。我们将韩语混淆模式分类为基于语言学的类别,定义从真实世界示例中推导出的变换规则,并将生成的混淆框架作为开放的变换包提供。利用这些规则,我们提供了配对的非毒性和毒性句子及其混淆版本。在我们的数据集上训练的模型能更好地处理混淆文本,而不会牺牲在非混淆文本上的性能。这是首个同时支持韩语去混淆和去毒化的数据集。我们期望该数据集能促进大型语言模型对韩语混淆毒性内容的更好理解和缓解。我们的代码和数据可在 https://github.com/leeyejin1231/KOTOX 获取。

英文摘要

As language models become increasingly deployed in online environments, toxicity detection and detoxification have received growing attention. Existing studies primarily focus on non-obfuscated text, which limits robustness when users intentionally disguise toxic expressions. In particular, Korean toxic expressions can be easily disguised through agglutinative morphology and Hangeul-specific orthographic variation. However, obfuscation in Korean remains largely unexplored, which motivates us to introduce a KOTOX: Korean toxic dataset for deobfuscation and detoxification. We categorize Korean obfuscation patterns into linguistically grounded classes, define transformation rules derived from real-world examples, and provide the resulting obfuscation framework as an open transformation package. Using these rules, we provide paired neutral and toxic sentences alongside their obfuscated counterparts. Models trained on our dataset better handle obfuscated text without sacrificing performance on non-obfuscated text. This is the first dataset that simultaneously supports deobfuscation and detoxification for the Korean language. We expect the dataset to facilitate better understanding and mitigation of obfuscated toxic content in LLM for Korean. Our code and data are available at https://github.com/leeyejin1231/KOTOX.

2510.06063 2026-05-29 cs.AI cs.IT cs.LG math.IT 版本更新

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

TelecomTS:面向时间序列与语言分析的多模态可观测性数据集

Austin Feng, Andreas Varvarigos, Ioannis Panitsas, Daniela Fernandez, Jinbiao Wei, Yuwei Guo, Jialin Chen, Ali Maatouk, Leandros Tassiulas, Rex Ying

发表机构 * Yale University(耶鲁大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出TelecomTS,一个来自5G电信网络的大规模多模态可观测性数据集,通过保留绝对尺度信息的异质协变量和多样化下游任务(异常检测、根因分析、多模态问答),揭示了现有模型在处理可观测性数据的高噪声、突变特性时的不足。

详情
AI中文摘要

现代企业在监控复杂系统时会产生大量的时间序列指标流,即所谓的可观测性数据。与来自气候等领域的传统时间序列不同,可观测性数据具有零膨胀、高度随机且时间结构极小的特点。尽管这些数据至关重要,但由于专有限制和隐私问题,可观测性数据集在公开基准中仍然代表性不足。现有数据集通常经过匿名化和归一化处理,去除了尺度信息,限制了其在异常检测、根因分析和多模态推理等任务中的应用。为弥补这一空白,我们引入了TelecomTS,这是一个源自5G电信网络的大规模可观测性数据集。TelecomTS包含具有明确绝对尺度信息的异质、去匿名化协变量,并提供多样化的下游任务套件,包括异常检测、根因分析和多模态问答。对最先进的时间序列、语言、推理和多模态基础模型的基准测试表明,现有方法难以应对可观测性数据特有的突变、高噪声和高方差动态特性。我们的实验进一步强调了保留协变量绝对尺度的重要性,凸显了开发能够原生利用尺度信息的基础时间序列模型以应对实际可观测性应用需求的必要性。代码可在https://github.com/Ali-maatouk/TelecomTS获取。

英文摘要

Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets remain underrepresented in public benchmarks due to proprietary restrictions and privacy concerns. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit absolute scale information and provides a diverse suite of downstream tasks, including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal foundation models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics characteristic of observability data. Our experiments further underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical real-world observability applications. The code is available at: https://github.com/Ali-maatouk/TelecomTS.

2510.02480 2026-05-29 cs.AI cs.LG 版本更新

Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting

通过早退机制控制语言模型中有害上下文的风险

Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Amsterdam(阿姆斯特丹大学) Amazon AGI(亚马逊人工智能实验室) Amazon Alexa(亚马逊Alexa)

AI总结 提出一种结合动态早退预测与无分布风险控制的方法,限制有害上下文对语言模型性能的退化,并在有益上下文中实现计算效率提升。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)可能受到有害或不相关上下文的影响,这会显著损害模型在下游任务上的性能。这促使我们设计具有内置机制的原则性方案,以防范此类“垃圾进,垃圾出”场景。我们提出一种新颖方法,限制有害上下文对模型性能的退化程度。首先,我们定义模型的基线“安全”行为——即无任何上下文(零样本)时的模型性能。接着,我们应用无分布风险控制(DFRC)来控制用户提供的上下文将性能降至该安全零样本基线以下的程度。我们通过利用动态早退预测实现这一点,忽略那些最关注不安全输入的后注意力头。最后,我们提出对DFRC的修改,使其既能控制有害输入的风险,又能利用有益输入的性能和效率提升。我们在涵盖上下文学习和开放式问答的9项任务上展示了理论和实证结果,表明我们的方法能有效控制有害上下文的风险,同时在使用有益上下文时实现显著的计算效率提升。

英文摘要

Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstream tasks. This motivates principled designs in which LLM systems include built-in mechanisms to guard against such "garbage in, garbage out" scenarios. We propose a novel approach to limit the degree to which harmful context can degrade model performance. First, we define a baseline "safe" behavior for the model -- the model's performance given no context at all (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which the user-provided context can decay performance below this safe zero-shot baseline. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results across 9 tasks spanning in-context learning and open-ended question answering, showing that our approach can effectively control risk for harmful context and simultaneously achieve substantial computational efficiency gains with helpful context.

2509.23573 2026-05-29 cs.CR cs.AI 版本更新

Uncovering Vulnerabilities of LLM-Assisted Cyber Threat Intelligence

揭示LLM辅助网络威胁情报中的脆弱性

Yuqiao Meng, Luoxi Tang, Feiyang Yu, Jinyuan Jia, Guanhua Yan, Ping Yang, Zhaohan Xi

发表机构 * Binghamton University(宾夕法尼亚州立大学) Duke University(杜克大学) Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 本文通过人机协同分类框架,识别并验证了LLM在CTI推理中的三种领域特定认知失败模式(虚假关联、矛盾知识和受限泛化),并证明针对性防御可显著降低失败率。

详情
AI中文摘要

大型语言模型(LLM)正越来越多地被用于帮助安全分析师应对激增的网络威胁,自动化从漏洞评估到事件响应的工作流程。然而,在实际操作的CTI工作流中,可靠性差距仍然显著。现有解释通常指向通用模型问题(如幻觉),但我们认为主要瓶颈在于威胁格局本身:CTI具有异质性、易变性和碎片化特征。在这些条件下,证据相互交织、众包且时间不稳定,这些特性是标准LLM研究很少捕捉到的。 在本文中,我们对LLM在CTI推理中的脆弱性进行了全面的实证研究。我们引入了一个人机协同分类框架,该框架能够稳健地标注CTI生命周期中的失败模式,避免了自动化“LLM作为评判者”管道的脆弱性。我们识别出三种领域特定的认知失败:来自表面元数据的虚假关联、来自冲突来源的矛盾知识以及对新兴威胁的受限泛化。我们通过因果干预验证了这些机制,并表明针对性防御能显著降低失败率。这些结果共同为构建具有韧性且领域感知的CTI智能体提供了具体路线图。

英文摘要

Large language models (LLMs) are increasingly used to help security analysts manage the surge of cyber threats, automating tasks from vulnerability assessment to incident response. Yet in operational CTI workflows, reliability gaps remain substantial. Existing explanations often point to generic model issues (e.g., hallucination), but we argue the dominant bottleneck is the threat landscape itself: CTI is heterogeneous, volatile, and fragmented. Under these conditions, evidence is intertwined, crowdsourced, and temporally unstable, which are properties that standard LLM-based studies rarely capture. In this paper, we present a comprehensive empirical study of LLM vulnerabilities in CTI reasoning. We introduce a human-in-the-loop categorization framework that robustly labels failure modes across the CTI lifecycle, avoiding the brittleness of automated "LLM-as-a-judge" pipelines. We identify three domain-specific cognitive failures: spurious correlations from superficial metadata, contradictory knowledge from conflicting sources, and constrained generalization to emerging threats. We validate these mechanisms via causal interventions and show that targeted defenses reduce failure rates significantly. Together, these results offer a concrete roadmap for building resilient, domain-aware CTI agents.

2509.23571 2026-05-29 cs.CR cs.AI 版本更新

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

通过标准化威胁狩猎评估LLM辅助蓝队

Yuqiao Meng, Luoxi Tang, Feiyang Yu, Xi Li, Guanhua Yan, Ping Yang, Zhaohan Xi

发表机构 * State University of New York at Binghamton(纽约州立大学布ingham顿分校) University of Alabama at Birmingham(阿拉巴马大学伯明翰分校) Duke University(杜克大学)

AI总结 本文提出CyberTeam基准,通过构建标准化工作流和模块化操作步骤,评估大语言模型在蓝队威胁狩猎中的有效性,并揭示标准化设计带来的改进与开放式推理的局限性。

Comments ICML'26

详情
AI中文摘要

随着网络威胁在规模和复杂性上持续增长,蓝队防御者越来越需要先进工具来主动检测和缓解风险。大语言模型(LLMs)为增强威胁分析提供了有前景的能力。然而,它们在真实蓝队威胁狩猎场景中的有效性仍未得到充分探索。本文提出CyberTeam,一个旨在指导LLMs进行蓝队实践的基准。CyberTeam通过两个阶段构建标准化工作流。首先,它通过捕获从威胁归因到事件响应的分析任务之间的依赖关系,对真实的威胁狩猎工作流进行建模。接下来,每个任务通过一组针对其特定分析需求定制的操作模块来处理。这将威胁狩猎转化为一系列结构化的推理步骤,每个步骤基于离散操作并根据任务特定依赖关系排序。在此框架指导下,LLMs被引导通过模块化步骤执行威胁狩猎任务。总体而言,CyberTeam整合了30个任务和9个操作模块,以指导LLMs进行标准化威胁分析。我们评估了领先的LLMs和最先进的网络安全智能体,将CyberTeam与开放式推理策略进行比较。我们的结果突显了标准化设计带来的改进,同时也揭示了开放式推理在真实威胁狩猎中的局限性。

英文摘要

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.

2508.15371 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Confidence-Modulated Speculative Decoding for Large Language Models

置信度调节的推测解码用于大型语言模型

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

发表机构 * Department of Data Science(数据科学系) Praxis Business School(普拉克斯商学院)

AI总结 本文提出一种基于置信度调节的推测解码框架,通过熵和边际不确定性度量动态调整草稿长度与验证过程,在机器翻译和摘要任务上实现加速并保持或提升BLEU和ROUGE分数。

Comments This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables

详情
AI中文摘要

推测解码已成为一种通过草稿-验证范式并行化令牌生成来加速自回归推理的有效方法。然而,现有方法依赖静态草稿长度和刚性验证标准,限制了其在不同模型不确定性和输入复杂性下的适应性。本文提出一种基于置信度调节草稿的信息论推测解码框架。通过利用草稿模型输出分布上的熵和边际不确定性度量,所提方法在每次迭代中动态调整推测生成的令牌数量。这种自适应机制减少了回滚频率,提高了资源利用率,并保持了输出保真度。此外,验证过程使用相同的置信度信号进行调节,使得在不牺牲生成质量的情况下更灵活地接受草稿令牌。在机器翻译和摘要任务上的实验表明,与标准推测解码相比,该方法在保持或提升BLEU和ROUGE分数的同时实现了显著加速。所提方法提供了一种原则性的即插即用方法,用于在不确定性变化条件下实现大型语言模型的高效且鲁棒的解码。

英文摘要

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

2508.12176 2026-05-29 cs.CV cs.AI eess.SP 版本更新

Scalable RF Simulation in Generative 4D Worlds

生成式4D世界中的可扩展射频仿真

Zhiwei Zheng, Dongyin Hu, Mingmin Zhao

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出WaveVerse框架,通过语言引导的4D世界生成器和物理信号模拟器实现可扩展的射频信号仿真,在相位敏感基准上表现高保真度,并有效提升下游任务性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

射频(RF)感知已成为一种强大的、保护隐私的替代视觉方法,用于各种感知任务。然而,在动态和多样化的环境中构建高质量的RF数据集仍然是一个重大挑战。为了解决这一问题,我们引入了WaveVerse,一个基于提示的可扩展框架,该框架从生成的室内场景中模拟真实的RF信号,并包含由空间路径引导的人体运动,从而无需手动轨迹设计即可实现多样且可行的行为。WaveVerse具有语言引导的4D世界生成器和基于物理的信号模拟器,能够在多样化的环境中实现RF信号的逼真模拟。它采用了一个相位相干光线追踪器,保留了空间和时间上的相位一致性。模拟信号在相位敏感基准上显示出高保真度,并且与真实世界收集的测量数据以及来自专有电磁求解器的模拟结果高度一致。当用于数据增强时,WaveVerse在RF成像和人类活动识别等下游任务中持续提升性能,其增益随模拟数据量的增加而增长,并超越了现有方法。代码和附加材料可在网页上获取。

英文摘要

Radio Frequency (RF) sensing has emerged as a powerful, privacy-preserving alternative to vision-based methods for various perception tasks. However, building high-quality RF datasets in dynamic and diverse environments remains a major challenge. To address this, we introduce WaveVerse, a prompt-based, scalable framework that simulates realistic RF signals from generated indoor scenes with human motions guided by spatial paths, enabling diverse and feasible behaviors without manual trajectory design. WaveVerse features a language-guided 4D world generator and a physics-based signal simulator that enables realistic simulation of RF signals in diverse environments. It employs a phase-coherent ray tracer that preserves both spatial and temporal phase consistency. The simulated signals show high fidelity on phase-sensitive benchmarks, and closely align with both real-world collected measurements and simulations from a proprietary electromagnetic solver. When used for data augmentation, WaveVerse consistently improves performance in downstream tasks like RF imaging and human activity recognition, with gains that grow with the amount of simulated data and surpass existing methods. Code and additional materials are available on the webpage.

2508.05614 2026-05-29 cs.CL cs.AI 版本更新

GroundAct: Can LLM Agents Ground Actions in Environmental States?

GroundAct:LLM智能体能否在环境状态中实现动作落地?

Zixuan Wang, Dingming Li, Hongxing Li, Yanrui Miao, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang

发表机构 * Zhejiang University(浙江大学)

AI总结 本研究提出GroundAct基准,通过1500个场景和16592个任务实例评估15个LLM,发现动作落地能力是多维挑战,不能仅通过模型规模解决。

Comments Project Page: https://zju-real.github.io/OmniEmbodied Code: https://github.com/ZJU-REAL/OmniEmbodied

详情
AI中文摘要

LLM智能体在指令完全指定动作的任务上成功率达到85-96%,但当动作可行性取决于指令未提及的环境状态时,成功率降至29-53%。我们认为这一差距反映了一种缺失的能力:动作落地,即从结构化环境状态推断动作是否可行、缺少哪些前提条件以及是否超出个体能力的能力。我们引入GroundAct,这是一个包含1500个场景和16592个任务实例的基准,基于文本的交互式环境涵盖11个领域,任务按认知复杂度层级组织为七个类别。评估15个LLM(3B-671B)后,我们发现三种诊断模式:(i)属性推理与工具和协作推理弱相关,产生不同的模型轮廓;(ii)完整环境图在工具使用与隐式协作之间产生高达+27.6/-22.9%的差异,区分了搜索边界与约束过滤瓶颈;(iii)监督微调将Qwen2.5-3B在直接命令上的性能从0.6%提升至76.3%,但在隐式协作上仅从1.5%提升至5.5%。这些结果表明动作落地是一个多维挑战,不能仅通过规模扩展解决。

英文摘要

LLM agents achieve 85-96% success on tasks where instructions fully specify the action, but drop to 29-53% when action feasibility depends on environmental state that the instruction does not mention. We argue that this gap reflects a missing capability: action grounding, the ability to infer from structured environmental state whether an action is feasible, what prerequisites it lacks, and whether it exceeds individual capacity. We introduce GroundAct, a benchmark of 1,500 scenarios and 16,592 task instances in text-based interactive environments spanning 11 domains, with tasks organized into seven categories along a cognitive complexity hierarchy. Evaluating 15 LLMs (3B-671B), we find three diagnostic patterns: (i) attribute reasoning is weakly correlated with tool and coordination reasoning, producing distinct model profiles; (ii) complete environment graphs yield up to +27.6/-22.9% on tool use vs. implicit collaboration, separating search-bound from constraint-filtering bottlenecks; and (iii) supervised fine-tuning lifts Qwen2.5-3B from 0.6% to 76.3% on direct command but only 1.5% to 5.5% on implicit collaboration. These results establish action grounding as a multi-dimensional challenge irreducible to scaling.

2507.21114 2026-05-29 cs.IR cs.AI cs.CV 版本更新

Page image classification for content-specific data processing

面向特定内容数据处理的页面图像分类

Kateryna Lutsai

AI总结 本研究针对人文学科数字化项目中历史文档页面图像内容多样、手动分类困难的问题,开发并评估了一种基于人工智能和机器学习的图像分类系统,通过按内容类别(如文本类型、图形元素、布局)自动分类页面,以支持定制化的下游分析流程。

Comments Dataset licensing issues occurred

详情
AI中文摘要

人文学科的数字化项目通常会产生大量历史文档的页面图像,这给手动分类和分析带来了巨大挑战。这些档案包含多样化的内容,包括各种文本类型(手写体、打字体、印刷体)、图形元素(图画、地图、照片)以及布局(纯文本、表格、表单)。高效处理这些异构数据需要基于页面内容进行自动分类的方法,从而能够启用定制化的下游分析流程。本项目通过开发并评估一种专门为历史文档页面设计的图像分类系统来满足这一需求,该系统利用了人工智能和机器学习的最新进展。所选的类别集旨在促进特定内容处理工作流程,将需要不同分析技术(例如,用于文本的OCR、用于图形的图像分析)的页面区分开来。

英文摘要

Digitization projects in humanities often generate vast quantities of page images from historical documents, presenting significant challenges for manual sorting and analysis. These archives contain diverse content, including various text types (handwritten, typed, printed), graphical elements (drawings, maps, photos), and layouts (plain text, tables, forms). Efficiently processing this heterogeneous data requires automated methods to categorize pages based on their content, enabling tailored downstream analysis pipelines. This project addresses this need by developing and evaluating an image classification system specifically designed for historical document pages, leveraging advancements in artificial intelligence and machine learning. The set of categories was chosen to facilitate content-specific processing workflows, separating pages requiring different analysis techniques (e.g., OCR for text, image analysis for graphics)

2507.09574 2026-05-29 cs.CV cs.AI cs.CL 版本更新

MENTOR: Efficient Multimodal-Conditioned Tuning for Autoregressive Vision Generation Models

MENTOR: 面向自回归视觉生成模型的高效多模态条件微调

Haozhe Zhao, Zefan Cai, Shuzheng Si, Liang Chen, Jiuxiang Gu, Wen Xiao, Minjia Zhang, Junjie Hu

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tsinghua University(清华大学) Peking University(北京大学) Microsoft(微软公司)

AI总结 提出MENTOR框架,通过两阶段训练范式实现自回归图像生成器与多模态输入的细粒度token级对齐,无需辅助适配器或交叉注意力模块,在DreamBench++上取得优异性能。

Comments Findings of ACL 2026

详情
AI中文摘要

最近的文本到图像模型能够生成高质量结果,但在精确视觉控制、平衡多模态输入以及需要大量训练以实现复杂多模态图像生成方面仍存在困难。为解决这些局限,我们提出MENTOR,一种新颖的自回归(AR)框架,用于高效的多模态条件微调以实现自回归多模态图像生成。MENTOR将AR图像生成器与两阶段训练范式相结合,无需依赖辅助适配器或交叉注意力模块,即可实现多模态输入与图像输出之间的细粒度、token级对齐。两阶段训练包括:(1)多模态对齐阶段,建立稳健的像素级和语义级对齐;随后是(2)多模态指令微调阶段,平衡多模态输入的整合并增强生成可控性。尽管模型规模适中、基础组件非最优且训练资源有限,MENTOR在DreamBench++基准测试上仍取得了强劲性能,在概念保持和提示遵循方面优于竞争基线。此外,与基于扩散的方法相比,我们的方法具有更优的图像重建保真度、广泛的任务适应性以及更高的训练效率。数据集、代码和模型可在 https://github.com/HaozheZhao/MENTOR 获取。

英文摘要

Recent text-to-image models produce high-quality results but still struggle with precise visual control, balancing multimodal inputs, and requiring extensive training for complex multimodal image generation. To address these limitations, we propose MENTOR, a novel autoregressive (AR) framework for efficient Multimodal-conditioned Tuning for Autoregressive multimodal image generation. MENTOR combines an AR image generator with a two-stage training paradigm, enabling fine-grained, token-level alignment between multimodal inputs and image outputs without relying on auxiliary adapters or cross-attention modules. The two-stage training consists of: (1) a multimodal alignment stage that establishes robust pixel- and semantic-level alignment, followed by (2) a multimodal instruction tuning stage that balances the integration of multimodal inputs and enhances generation controllability. Despite modest model size, suboptimal base components, and limited training resources, MENTOR achieves strong performance on the DreamBench++ benchmark, outperforming competitive baselines in concept preservation and prompt following. Additionally, our method delivers superior image reconstruction fidelity, broad task adaptability, and improved training efficiency compared to diffusion-based methods. Dataset, code, and models are available at: https://github.com/HaozheZhao/MENTOR

2507.03318 2026-05-29 cs.LG cs.AI 版本更新

Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

基于图神经网络与组套索正则化的结构感知化合物-蛋白质亲和力预测

Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang

发表机构 * Department of Biostatistics & Health Data Science(生物统计学与健康数据科学系) Indiana University(印第安纳大学) Department of Computer Science(计算机科学系) Indiana University Bloomington(印第安纳大学布卢明顿分校) Division of Clinical Pharmacology(临床药理学部) Indiana University School of Medicine(印第安纳大学医学院) IUSM-Purdue TREAT-AD Center(IUSM-普渡大学TREAT-AD中心) Department of Medical and Molecular Genetics(医学与分子遗传学系)

AI总结 提出利用图神经网络结合组套索和稀疏组套索正则化,从活性悬崖分子对中学习结构信息以预测化合物-蛋白质亲和力(IC50),并提升模型可解释性。

Comments 15 pages, 7 figures

详情
Journal ref
Comput Struct Biotechnol J. 2026;35:0012
AI中文摘要

可解释人工智能(XAI)方法越来越多地被应用于药物发现中,以学习分子表示并识别驱动性质预测的子结构。然而,为化合物性质预测构建结构-活性关系(SAR)建模的端到端可解释模型面临诸多挑战,例如特定蛋白质靶标的化合物-蛋白质相互作用活性数据有限,以及分子构型位点的细微变化会显著影响分子性质。我们利用具有活性悬崖的分子对,这些分子共享骨架但在取代基位点不同,其特征是对特定蛋白质靶标具有较大的效力差异。我们提出一个框架,通过实现图神经网络(GNN)来利用活性悬崖对的性质和结构信息,以预测化合物-蛋白质亲和力(即半数最大抑制浓度,IC50)。为了增强模型性能和可解释性,我们使用结构感知损失函数训练GNN,采用组套索和稀疏组套索正则化,这些正则化方法能够剪枝并突出与活性差异相关的分子子图。我们将该框架应用于针对三种原癌基因酪氨酸蛋白激酶Src蛋白(PDB ID:1O42、2H8H、4MXO)的分子活性悬崖数据。我们的方法通过稀疏组套索整合公共和私有节点信息,改进了性质预测,这体现在均方根误差(RMSE)降低和皮尔逊相关系数(PCC)提高上。应用正则化还通过提升图级全局方向分数和改进原子级着色精度,增强了GNN的特征归因能力。这些进展增强了药物发现流程中模型的可解释性,特别是在先导化合物优化中识别关键分子子结构方面。

英文摘要

Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.

2506.12508 2026-05-29 cs.AI 版本更新

AgentOrchestra: Orchestrating Multi-Agent Intelligence with the Tool-Environment-Agent(TEA) Protocol

AgentOrchestra:使用工具-环境-智能体(TEA)协议编排多智能体智能

Wentao Zhang, Liang Zeng, Yuzhen Xiao, Yongcong Li, Ce Cui, Yilei Zhao, Rui Hu, Yang Liu, Yahui Zhou, Bo An

发表机构 * Skywork AI Nanyang Technological University(南洋理工大学)

AI总结 提出TEA协议和AgentOrchestra框架,通过统一抽象和分层编排实现多智能体系统的生命周期感知协调,在GAIA测试集上达到89.04%的准确率。

详情
AI中文摘要

最近基于LLM的智能体系统在复杂、长时任务上显示出潜力,但现有智能体协议(如A2A和MCP)未能充分支持跨智能体、工具和环境的生命周期感知协调。为解决这一局限,我们引入了 extbf{工具-环境-智能体}(TEA)协议,这是一种统一的抽象,将这些组件建模为具有显式生命周期的一等版本化资源。TEA支持端到端的上下文和版本管理,提高了可追溯性和可复现性,同时实现了智能体相关组件的持续自我进化 ootnote{除非另有说明,\emph{智能体相关组件}包括提示、记忆/工具/智能体/环境代码以及智能体输出(解决方案)。}。基于TEA,我们提出了\projectname,这是一个分层多智能体框架,其中中央规划器协调专门的子智能体,并在执行过程中动态扩展能力。在四个具有挑战性的基准测试(涵盖专家级智能体任务和科学/数学推理)上的实验表明,AgentOrchestra始终优于强基线;特别是在GAIA测试集上达到89.04%,据我们所知,这使其跻身领先方法之列。这些结果凸显了显式协议设计和分层编排对于构建更鲁棒、更自适应的多智能体系统的价值。

英文摘要

Recent advances in LLM-based agent systems have shown promise on complex, long-horizon tasks, but existing agent protocols (e.g., A2A and MCP) do not adequately support lifecycle-aware coordination across agents, tools, and environments. To address this limitation, we introduce the \textbf{Tool-Environment-Agent} (TEA) protocol, a unified abstraction that models these components as first-class, versioned resources with explicit lifecycles. TEA supports end-to-end context and version management, improving traceability and reproducibility, while also enabling continual self-evolution of agent-associated components\footnote{Unless otherwise specified, \emph{agent-associated components} include prompts, memory/tool/agent/environment code, and agent outputs (solutions).}. Building on TEA, we present \projectname, a hierarchical multi-agent framework in which a central planner coordinates specialized sub-agents and dynamically extends capabilities during execution. Experiments on four challenging benchmarks, spanning expert-level agent tasks and scientific/mathematical reasoning, show that AgentOrchestra consistently outperforms strong baselines; in particular, it achieves 89.04\% on the GAIA Test set, placing it among the leading methods to the best of our knowledge. These results highlight the value of explicit protocol design and hierarchical orchestration for building more robust and adaptive multi-agent systems.

2506.08354 2026-05-29 cs.CL cs.AI cs.IR 版本更新

Position: Text Embeddings Should Capture Implicit Semantics, Not Just Surface Meaning

立场:文本嵌入应捕获隐含语义,而不仅仅是表面意义

Yiqun Sun, Qiang Huang, Anthony K. H. Tung, Jun Yu

发表机构 * National University of Singapore(新加坡国立大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳))

AI总结 本文主张文本嵌入研究应从表面意义转向隐含语义,通过试点研究揭示现有模型在隐含语义任务上的局限,并提出范式转变以优先发展语言学基础训练数据、深层语义基准和核心建模目标。

Comments To appear in ICML 2026

详情
AI中文摘要

这篇立场论文主张,文本嵌入研究应超越表面意义,将隐含语义作为核心建模目标。文本嵌入是现代自然语言处理的基础组件,支撑着广泛的应用并推动持续的研究进展。尽管进展迅速,大多数嵌入模型仍局限于表面层次的语义,而语言学理论强调人类意义的大部分是隐含的,由语用学、说话者意图和社会文化语境塑造。当前模型通常在缺乏此类深度的数据集上训练,并使用奖励表面相似性的基准进行评估。因此,它们在需要解释性推理、立场识别或社会性理解的任务中表现不佳。我们的试点研究明确揭示了这一局限性,表明即使在探测隐含语义的任务上,最先进的嵌入相比简单的词汇基线也仅取得边际改进。因此,我们呼吁范式转变:嵌入研究应优先考虑具有语言学基础且多样化的训练数据,开发探测更深层语义理解的基准,并将隐含意义作为核心建模目标,以更好地使嵌入与现实世界的语言复杂性对齐。代码可在 http://github.com/dukesun99/Implicit-Embeddings 获取。

英文摘要

This position paper argues that text embedding research should move beyond surface meaning and embrace implicit semantics as a central modeling objective. Text embeddings are a foundational component of modern NLP, underpinning a wide range of applications and driving sustained research progress. Despite rapid progress, most embedding models remain narrowly focused on surface-level semantics, whereas linguistic theory emphasizes that much of human meaning is implicit, shaped by pragmatics, speaker intent, and sociocultural context. Current models are typically trained on datasets that lack such depth and evaluated using benchmarks that reward surface similarity. As a result, they struggle with tasks that require interpretive reasoning, stance recognition, or socially grounded understanding. Our pilot study makes this limitation explicit, showing that even state-of-the-art embeddings achieve only marginal improvements over simple lexical baselines on tasks probing implicit semantics. We therefore call for a paradigm shift: embedding research should prioritize linguistically grounded and diverse training data, develop benchmarks that probe deeper semantic understanding, and treat implicit meaning as a core modeling objective to better align embeddings with real-world language complexity. The code is available at http://github.com/dukesun99/Implicit-Embeddings.

2506.06254 2026-05-29 cs.AI cs.CL cs.LG 版本更新

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

PersonaAgent:弥合个性化LLM智能体的记忆与行动

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

发表机构 * Amazon Stores Foundational AI(亚马逊基础AI)

AI总结 提出PersonaAgent框架,通过整合个性化记忆模块(情景与语义记忆)和行动模块,并利用角色提示作为中介实现记忆与行动的协同,以解决LLM智能体的个性化任务。

Comments Accepted in ACL 2026

详情
AI中文摘要

由大型语言模型驱动的智能体近期作为先进范式出现,在广泛领域和任务中展现出令人印象深刻的能力。尽管潜力巨大,当前LLM智能体常采用一刀切方法,缺乏响应用户不同需求和偏好的灵活性。这一局限促使我们开发PersonaAgent——首个旨在处理多样化个性化任务的个性化LLM智能体框架。具体而言,PersonaAgent整合了两个互补组件:一个包含情景记忆和语义记忆机制的个性化记忆模块;一个使智能体能够执行针对用户定制的工具行动的个性化行动模块。核心在于,角色(定义为每位用户独特的系统提示)充当中间件:它利用来自个性化记忆的洞察来控制智能体行动,而这些行动的结果反过来又优化记忆。基于该框架,我们提出一种测试时用户偏好对齐策略,该策略模拟最近的n次交互以优化角色提示,通过模拟响应与真实响应之间的文本损失反馈确保实时用户偏好对齐。实验评估表明,PersonaAgent不仅有效个性化行动空间,还能在测试时实际应用中扩展,显著优于其他基线方法。这些结果证明了我们的方法在提供定制化、动态用户体验方面的可行性和潜力。

英文摘要

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

2505.24503 2026-05-29 cs.GT cs.AI 版本更新

Online Fair Division with Additional Information

在线公平分配与额外信息

Tzeh Yuan Neoh, Jannik Peters, Nicholas Teh

发表机构 * Harvard University, USA(哈佛大学) Shanghai University of Finance and Economics, China(上海财经大学) University of Oxford, UK(牛津大学)

AI总结 研究在线公平分配不可分割物品问题,通过引入归一化信息和频率预测,实现了比以往更强的公平性保证,并提供了学习增强的鲁棒变体。

Comments Appears in the 43rd International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

我们研究了在在线环境中公平分配不可分割物品给智能体的问题,其中物品顺序到达且必须不可撤销地分配。聚焦于流行的公平概念——无嫉妒、比例性和最大最小份额公平(及其近似变体),我们探讨了对未来信息的访问如何改变可实现的保证。在没有信息的情况下,我们证明了即使是近似公平也存在强不可能性结果。在归一化信息(智能体的总价值)下,我们提供了一种算法,实现了比以往已知结果更强的公平性保证,并展示了更强概念的匹配不可能性。在频率预测(无顺序的价值多重集)下,我们设计了一种元算法,将一大类离线“基于份额”的保证提升到在线环境,匹配了已知的最佳离线界限。最后,我们提供了两种模型的学习增强变体:在有噪声的总和或有噪声的频率预测下,我们的保证是鲁棒的,并随误差参数优雅地退化。

英文摘要

We study the problem of fairly allocating indivisible goods to agents in an online setting, where goods arrive sequentially and must be allocated irrevocably. Focusing on the popular fairness notions of envy-freeness, proportionality, and maximin share fairness (and their approximate variants), we investigate how access to future information changes what guarantees are achievable. Without any information, we prove strong impossibility results even for approximate fairness. With normalization information (agents' total values), we provide an algorithm that achieves stronger fairness guarantees than previously known results, and show matching impossibilities for stronger notions. With frequency predictions (value multisets without order), we design a meta-algorithm that lifts a broad class of offline ''share-based'' guarantees to the online setting, matching the best-known offline bounds. Finally, we provide learning-augmented variants of both models: under noisy totals or noisy frequency predictions, our guarantees are robust and degrade gracefully with the error parameters.

2505.21996 2026-05-29 cs.CV cs.AI 版本更新

VRAG: Learning World Models for Interactive Video Generation

VRAG:面向交互式视频生成的世界模型学习

Taiye Chen, Xun Hu, Zihan Ding, Chi Jin

发表机构 * Peking University(北京大学) University of Oxford(牛津大学) Princeton University(普林斯顿大学)

AI总结 针对自回归视频生成中累积误差和记忆机制不足的问题,提出视频检索增强生成(VRAG)方法,通过显式全局状态条件降低长期累积误差并提升时空一致性。

Comments Published at NeurIPS 2025. Project page: https://sites.google.com/view/vrag

详情
AI中文摘要

基础世界模型必须既具有交互性,又能保持时空连贯性,以便通过动作选择进行有效的未来规划。然而,当前的长时间视频生成模型由于两个主要挑战而具有有限的内在世界建模能力:累积误差和记忆机制不足。我们通过额外的动作条件和自回归框架增强了图像到视频模型的交互能力,并揭示了在自回归视频生成中累积误差本质上是不可约的,而记忆机制不足则导致世界模型的不连贯。我们提出了带有显式全局状态条件的视频检索增强生成(VRAG),它显著减少了长期累积误差并提高了世界模型的时空一致性。相比之下,具有扩展上下文窗口的朴素自回归生成和检索增强生成在视频生成中被证明效果较差,这主要是由于当前视频模型有限的上下文学习能力。我们的工作阐明了视频世界模型中的基本挑战,并为改进具有内在世界建模能力的视频生成模型建立了全面的基准。

英文摘要

Foundational world models must be both interactive and preserve spatiotemporal coherence for effective future planning with action choices. However, present models for long video generation have limited inherent world modeling capabilities due to two main challenges: compounding errors and insufficient memory mechanisms. We enhance image-to-video models with interactive capabilities through additional action conditioning and autoregressive framework, and reveal that compounding error is inherently irreducible in autoregressive video generation, while insufficient memory mechanism leads to incoherence of world models. We propose video retrieval augmented generation (VRAG) with explicit global state conditioning, which significantly reduces long-term compounding errors and increases spatiotemporal consistency of world models. In contrast, naive autoregressive generation with extended context windows and retrieval-augmented generation prove less effective for video generation, primarily due to the limited in-context learning capabilities of current video models. Our work illuminates the fundamental challenges in video world models and establishes a comprehensive benchmark for improving video generation models with internal world modeling capabilities.

2505.21876 2026-05-29 cs.CV cs.AI 版本更新

EPiC: Efficient Video Camera Control Learning with Precise Anchor-Video Guidance

EPiC: 基于精确锚点视频引导的高效视频摄像机控制学习

Zun Wang, Jaemin Cho, Jialu Li, Han Lin, Jaehong Yoon, Yue Zhang, Mohit Bansal

AI总结 提出EPiC框架,通过基于首帧可见性掩码构建精确对齐的锚点视频,并引入轻量模块Anchor-ControlNet,以极低参数实现高效、精确的3D摄像机控制,在RealEstate10K和MiraData上达到最先进性能。

Comments Accepted to ICML 2026. Project website: https://zunwang1.github.io/Epic

详情
AI中文摘要

近期带摄像机控制的视频生成方法通常通过从估计的点云沿摄像机轨迹渲染,创建锚点视频(即近似所需摄像机运动的渲染视频),以作为结构化先验引导扩散模型。然而,点云和摄像机轨迹估计中的误差常导致不准确的锚点视频,并带来更高的训练成本和低效率,因为模型被迫补偿渲染错位。为解决这些局限,我们提出EPiC,一种高效且精确的摄像机控制学习框架,无需摄像机姿态或点云估计即可构建良好对齐的训练锚点视频。具体而言,我们通过基于首帧可见性掩码掩蔽源视频来创建高精度锚点视频,这确保了强对齐,消除了对摄像机/点云估计的需求,因此可轻松应用于任意野外视频。此外,我们引入Anchor-ControlNet,一种轻量模块,将可见区域中的锚点视频引导集成到预训练视频扩散模型中,仅增加不到1%的额外参数。EPiC以显著更少的参数、训练步骤和数据实现高效训练,并在测试时对使用点云制作的锚点视频具有鲁棒泛化能力,从而实现精确的3D感知摄像机控制。EPiC在RealEstate10K和MiraData上的I2V摄像机控制任务中达到最先进性能。值得注意的是,EPiC还展现出对视频到视频(V2V)场景的强零样本泛化能力。

英文摘要

Recent approaches for video generation with camera control often create anchor videos (i.e., rendered videos that approximate desired camera motions) to guide diffusion models as a structured prior, by rendering from estimated point clouds following camera trajectories. However, errors in point cloud and camera trajectory estimation often lead to inaccurate anchor videos with higher training cost and low efficiency, as the model is forced to compensate for rendering misalignments. To address these limitations, we introduce EPiC, an efficient and precise camera control learning framework that constructs well-aligned training anchor videos without the need for camera pose or point cloud estimation. Concretely, we create highly precise anchor videos by masking source videos based on first-frame visibility, which ensures strong alignment, eliminates the need for camera/point cloud estimation, and thus can be readily applied to any in-the-wild video. Furthermore, we introduce Anchor-ControlNet, a lightweight module that integrates anchor video guidance in visible regions to pretrained video diffusion models, with less than 1% of additional parameters. EPiC achieves efficient training with substantially fewer parameters, training steps, and less data, and generalizes robustly to anchor videos made with point clouds at test time, enabling precise 3D-informed camera control. EPiC achieves SoTA performance on RealEstate10K and MiraData for I2V camera control task. Notably, EPiC also exhibits strong zero-shot generalization to video-to-video (V2V) scenarios.

2505.10975 2026-05-29 cs.CL cs.AI cs.SD eess.AS 版本更新

Survey of End-to-End Multi-Speaker Automatic Speech Recognition for Monaural Audio

单声道音频的端到端多说话人自动语音识别综述

Xinlu He, Jacob Whitehill

发表机构 * Worcester Polytechnic Institute(沃斯特理工大学)

AI总结 本文系统综述了端到端多说话人自动语音识别的神经架构范式(SIMO与SISO)、近期改进方法及长语音扩展策略,并通过标准基准评估比较了各类方法。

Comments Accepted for publication in Computer Speech & Language (CSL)

详情
AI中文摘要

单声道多说话人自动语音识别(ASR)由于数据稀缺以及识别并将词语归因于单个说话人的内在困难(尤其是在重叠语音中)仍然具有挑战性。最近的进展推动了从级联系统向端到端(E2E)架构的转变,这减少了错误传播并更好地利用了语音内容与说话人身份之间的协同作用。尽管端到端多说话人ASR取得了快速进展,但该领域缺乏对近期发展的全面综述。本综述为多说话人ASR的端到端神经方法提供了一个系统的分类法,突出了近期进展和比较分析。具体而言,我们分析了:(1)用于预分割音频的架构范式(SIMO与SISO),分析了它们的不同特征和权衡;(2)基于这两种范式的近期架构和算法改进;(3)对长语音的扩展,包括分割策略和说话人一致性的假设拼接。此外,我们(4)在标准基准上评估和比较了各种方法。最后,我们讨论了构建鲁棒且可扩展的多说话人ASR所面临的开放挑战和未来研究方向。

英文摘要

Monaural multi-speaker automatic speech recognition (ASR) remains challenging due to data scarcity and the intrinsic difficulty of recognizing and attributing words to individual speakers, particularly in overlapping speech. Recent advances have driven the shift from cascade systems to end-to-end (E2E) architectures, which reduce error propagation and better exploit the synergy between speech content and speaker identity. Despite rapid progress in E2E multi-speaker ASR, the field lacks a comprehensive review of recent developments. This survey provides a systematic taxonomy of E2E neural approaches for multi-speaker ASR, highlighting recent advances and comparative analysis. Specifically, we analyze: (1) architectural paradigms (SIMO vs.~SISO) for pre-segmented audio, analyzing their distinct characteristics and trade-offs; (2) recent architectural and algorithmic improvements based on these two paradigms; (3) extensions to long-form speech, including segmentation strategy and speaker-consistent hypothesis stitching. Further, we (4) evaluate and compare methods across standard benchmarks. We conclude with a discussion of open challenges and future research directions towards building robust and scalable multi-speaker ASR.

2503.13844 2026-05-29 cs.CL cs.AI cs.CY cs.LG 版本更新

Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

检测社交媒体上的说服:从模型开发到说服策略的洞察

Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

发表机构 * University of Queensland(昆士兰大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文通过开发轻量级说服文本检测模型(在SemEval 2023任务3子任务3中达到最优性能)并应用于澳大利亚联邦选举2022 Facebook广告数据集,揭示了政治竞选在不同资金策略、词汇选择、人口统计定位和选举临近时说服强度时间变化中的模式。

详情
Journal ref
Proceedings of the International AAAI Conference on Web and Social Media 20(1) (2026) 1587-1608
AI中文摘要

政治广告通过嵌入更广泛宣传策略中的微妙说服技巧,在塑造公众舆论和影响选举结果方面发挥着关键作用。检测这些说服元素对于提高选民意识和确保民主进程的透明度至关重要。本文通过两项相互关联的研究,提出了一种连接模型开发与实际应用的综合方法。首先,我们引入了一个轻量级说服文本检测模型,该模型在SemEval 2023任务3子任务3中达到了最先进性能,同时所需的计算资源和训练数据远少于现有方法。其次,我们通过收集澳大利亚联邦选举2022 Facebook广告(APA22)数据集,对其中一部分进行说服标注,并对模型进行微调以使其从主流新闻适应社交媒体内容,从而展示了该模型的实际效用。然后,我们应用微调后的模型对APA22数据集的其余部分进行标注,揭示了政治竞选如何通过不同的资金策略、词汇选择、人口统计定位以及选举日临近时说服强度的时间变化来利用说服的独特模式。我们的发现不仅强调了分析社交媒体说服时领域特定建模的必要性,还展示了揭示这些策略如何能够增强透明度、告知选民并促进数字竞选中的问责制。

英文摘要

Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model's practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

2502.16548 2026-05-29 cs.LG cs.AI cs.CV 版本更新

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

用于电影心脏磁共振-文本驱动的心力衰竭结局预测的可组合多模态框架

Jianzhou Chen, Jinyang Sun, Xiumei Wang, Xi Chen, Heyu Chu, Guo Song, Yuji Luo, Xingping Zhou, Rong Gu

发表机构 * Department of Cardiology, Nanjing Drum Tower Hospital, State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University(南京鼓楼医院心内科,南京大学国家药物生物技术重点实验室) School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子信息与电气工程学院) College of Electronic and Optical Engineering, Nanjing University of Posts and Telecommunications(南京邮电大学电子与光学工程学院) College of Integrated Circuit Science and Engineering, Nanjing University of Posts and Telecommunications(南京邮电大学集成电路科学与工程学院) Department of Cardiology, Nanjing Drum Tower Hospital Clinical College of Nanjing Medical University(南京医科大学南京鼓楼医院临床学院心内科) Institute of Quantum Information and Technology, Nanjing University of Posts and Telecommunications(南京邮电大学量子信息与技术研究院)

AI总结 提出一种可组合多模态框架,通过整合cine CMR影像、结构化临床指标和非结构化文本记录,实现比单模态AI算法更准确的心力衰竭预后预测,并支持个性化治疗优化。

详情
AI中文摘要

目的。根据世界卫生组织(WHO)及其他公共卫生机构的数据,心力衰竭是全球主要死因之一,每年导致数百万人死亡。尽管心力衰竭领域已取得显著进展,生存率和射血分数有所改善,但由于其复杂性和多因素特征,仍存在大量未满足的需求。本研究旨在提出并评估一种用于心力衰竭评估和治疗优化的可组合策略框架,旨在提供更全面的患者评估和管理。方法。该框架利用多模态算法分析全面的患者数据,明确整合了电影心脏磁共振(cine CMR)序列、结构化临床指标(如实验室结果、人口统计学数据)和非结构化文本记录(如病史、处方)。通过整合这些多种数据源,我们的框架为患者提供了更全面的评估和优化的治疗方案。主要结果。与单模态AI算法相比,该多模态框架在心力衰竭预后预测方面展现出更高的准确性。此外,它还能详细评估各种病理指标对心力衰竭结局的影响。意义。通过系统性地整合异质性临床数据,该方法支持更全面的预后评估,并有助于为心力衰竭患者制定优化的个性化治疗计划。

英文摘要

Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the World Health Organization (WHO) and other public health agencies. While significant progress has been made in the field of heart failure, leading to improved survival rates and improvement of ejection fraction, there remains substantial unmet needs, due to the complexity and multifactorial characteristics. This study aims to propose and evaluate a composable strategy framework for assessment and treatment optimization in heart failure, designed to provide more holistic patient evaluation and management. Approach. The framework leverages multi-modal algorithms to analyze a comprehensive range of patient data, explicitly integrating cine cardiac magnetic resonance (cine CMR) sequences, structured clinical metrics (e.g., lab results, demographics), and unstructured textual records (e.g., medical history, prescriptions). By integrating these various data sources, our framework offers a more holistic evaluation and optimized treatment plan for patients. Main results. The multi-modal framework demonstrates superior accuracy in HF prognosis prediction compared to single-modal AI algorithms. Additionally, it enables a detailed evaluation of the impact of various pathological indicators on HF outcomes. Significance. By integrating heterogeneous clinical data in a systematic manner, this approach supports more comprehensive prognosis assessment and facilitates optimized, personalized treatment planning for heart failure patients.

2410.15236 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

大语言模型的越狱与漏洞缓解

Benji Peng, Hanxuan Chen, Keyu Chen, Qian Niu, Ziqian Bi, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence K. Q. Yan, Yizhu Wen, Yichao Zhang, Caitlyn Heqi Yin, Xinyuan Song, Riyang Bao, Jiacheng Shi

发表机构 * Hunan University Changsha, PRC Georgia Institute of Technology Atlanta, USA Kyoto University Kyoto, Japan Purdue University West Lafayette, USA National Taiwan Normal University Taipei, ROC University of Liverpool Suzhou, PRC Hong Kong University of Science University of Hawaii Honolulu, USA The University of Texas at Dallas Dallas, USA University of Wisconsin-Madison Madison, USA Emory University Atlanta, USA College of William \& Mary Williamsburg, USA

AI总结 本文综述了大语言模型在提示注入和越狱攻击下的漏洞,分类攻击方法并评估防御策略,指出研究空白与未来方向。

详情
Journal ref
Eureka 1(1) (2026) 26-61
AI中文摘要

大语言模型通过推进自然语言理解和生成,在医疗、软件工程和对话系统等领域实现了广泛应用,从而改变了人工智能。尽管在过去几年取得了这些进展,但大语言模型已显示出相当大的漏洞,特别是对提示注入和越狱攻击。本综述分析了这些漏洞的研究现状,并介绍了可用的防御策略。我们大致将攻击方法分为基于提示的、基于模型的、多模态的和多语言的,涵盖对抗性提示、后门注入和跨模态利用等技术。我们还回顾了各种防御机制,包括提示过滤、转换、对齐技术、多智能体防御和自律,评估了它们的优缺点。我们还讨论了用于评估大语言模型安全性和鲁棒性的关键指标和基准,指出了在交互环境中量化攻击成功率的挑战以及现有数据集中的偏差。通过识别当前研究空白,我们提出了未来在韧性对齐策略、针对不断演变的攻击的高级防御、越狱检测自动化以及考虑伦理和社会影响方面的方向。本综述强调了在人工智能社区内持续研究和合作的必要性,以增强大语言模型的安全性并确保其安全部署。

英文摘要

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

2410.10398 2026-05-29 cs.CE cs.AI 版本更新

Are LLMs Socially Adaptive? Contrasting Belief Evolution in Large Language Models and Humans

大型语言模型是否具有社会适应性?对比大型语言模型与人类的信念演化

Yu Lei, Hao Liu, Chengxing Xie, Songjia Liu, Zhiyu Yin, Canyu Chen, Guohao Li, Philip Torr, Zhen Wu

发表机构 * Tsinghua University(清华大学) Department of Psychological and Cognitive Sciences(心理与认知科学系) College AI(人工智能学院) School of Management(管理学院) Fudan University(复旦大学) Stevens Institute of Technology(史蒂文斯理工学院) Northwestern University(西北大学) University of Oxford(牛津大学)

AI总结 本研究提出基于社会心理学的仿真基准FairMindSim和信念-奖励对齐行为演化模型BREM,通过连续经济游戏对比人类与LLM的决策动态,发现中等能力模型表现出过度惩罚的刚性攻击性,而前沿模型随推理能力提升趋向人类式的克制与宽容。

Comments KDD 2026 Oral

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地参与复杂的社会互动,确保其行为符合人类伦理原则和意图(即价值对齐)已成为一项关键的科学挑战。现有基准通常依赖静态评估,未能捕捉决策的纵向动态或驱动智能体行为的潜在认知过程。在这项工作中,我们提出了FairMindSim,一个基于社会心理学的现实仿真基准,通过连续经济游戏评估对齐性。为了超越黑箱观察,我们引入了信念-奖励对齐行为演化模型(BREM),这是一个概率框架,将决策形式化为最大化外在奖励与维护内在信念之间的动态权衡。我们进行了一项大规模比较研究,涉及1,017名人类参与者和十个LLM,包括GPT-5和Gemini-3-Pro。我们的实验结果揭示了第三方惩罚(TPP)游戏中一种与能力相关的非线性经验趋势。中等能力模型表现出僵化且算法化的攻击性,其特征是过度惩罚,而前沿模型则展现出克制收敛,并随着推理能力的扩展向类似人类的宽容转变。此外,利用BREM,我们分解了智能体的纵向决策动态,发现更先进的模型通过减少信念-行为不一致性,更好地平衡了相互冲突的目标。我们的贡献为心理压力测试提供了一个标准化协议,并为在受控社会困境环境中分析AI对齐的纵向演化提供了一种可解释的机制。

英文摘要

As large language models (LLMs) increasingly engage in complex social interactions, ensuring that their behaviors align with human ethical principles and intentions, known as value alignment, has become a critical scientific challenge. Existing benchmarks often rely on static assessments and fail to capture the longitudinal dynamics of decision-making or the latent cognitive processes driving agent behavior. In this work, we propose FairMindSim, a realistic simulation benchmark rooted in social psychology that evaluates alignment through continuous economic games. To move beyond black-box observations, we introduce the Belief-Reward Alignment Behavior Evolution Model (BREM), a probabilistic framework that formalizes decision-making as a dynamic trade-off between maximizing extrinsic rewards and upholding intrinsic beliefs. We conducted a large-scale comparative study involving 1,017 human participants and ten LLMs, including GPT-5 and Gemini-3-Pro. Our experimental results reveal a capability linked non linear empirical trend in the Third Party Punishment (TPP) game. Mid capability models exhibit rigid and algorithmic aggression that is characterized by over punishment, while frontier models show a convergence of restraint and a shift toward human like leniency as reasoning capabilities scale. Furthermore, using BREM, we decompose agents longitudinal decision dynamics and find that more advanced models better balance conflicting objectives by reducing belief action inconsistency. Our contributions provide a standardized protocol for psychological stress testing and an interpretable mechanism for analyzing the longitudinal evolution of AI alignment in controlled social dilemma settings.

2306.10356 2026-05-29 cs.LG cs.AI eess.SP 版本更新

MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting

MATNet:基于多层级融合Transformer的日前光伏发电预测模型

Matteo Tortora, Francesco Conte, Gianluca Natrella, Paolo Soda

发表机构 * Department of Naval, Electrical, Electronics Telecommunications Engineering, University of Genoa, Via all’Opera Pia 11a, 16145 Genoa, Italy Unit of Innovation, Entrepreneurship \& Sustainability, Department of Engineering, University Campus Bio-Medico of Rome Via Alvaro del Portillo 21, 00128 Rome, Italy Computer Systems Department of Engineering, University Campus Bio-Medico of Rome Via Alvaro del Portillo 21, 00128 Rome, Italy

AI总结 提出一种基于多层级融合Transformer的多模态架构MATNet,通过多级联合融合和软注意力机制利用历史光伏数据与气象数据,在日前多步光伏发电预测中显著优于基线模型(RMSE 0.0445,相对提升约65%),并展现出对缺失数据的鲁棒性和跨域零样本泛化能力。

详情
AI中文摘要

可再生能源发电的准确预测对于促进可再生能源融入电力系统至关重要。聚焦光伏(PV)单元,预测方法主要分为基于物理和基于数据两大类,其中基于人工智能(AI)的模型提供了最先进的性能。然而,这些基于AI的模型虽然能够捕捉数据中的复杂模式和关系,却忽略了现象背后的物理先验知识。因此,本文提出MATNet,一种新颖的基于Transformer的多模态架构,用于多步日前光伏发电预测。该模型通过多层级联合融合方法输入历史光伏数据以及历史和预报气象数据,在多个融合阶段采用软注意力机制。我们在Ausgrid基准数据集上评估了MATNet的有效性,其显著优于各种基线模型,实现了0.0445的RMSE,相比表现最佳的基线方法相对提升约65%。分析进一步通过一系列消融研究、对缺失数据的敏感性分析(突显了MATNet对输入退化的鲁棒性)、在五个外部光伏数据集上的跨站点零样本泛化评估(证明了MATNet在显著域偏移下的鲁棒性)以及对模型计算复杂度的评估(确认了其在预测精度与计算效率之间的良好平衡)得到丰富。这些结果凸显了MATNet作为促进光伏能源融入电网的可靠且高效解决方案的潜力。代码可在https://github.com/arco-group/MATNet获取。

英文摘要

Accurate forecasting of renewable generation is crucial to facilitate the integration of Renewable Energy Sources into the power system. Focusing on photovoltaic (PV) units, forecasting methods can be divided into two main categories: physics-based and data-based strategies, with Artificial Intelligence (AI)-based models providing state-of-the-art performance. However, while these AI-based models can capture complex patterns and relationships in the data, they ignore the underlying physical prior knowledge of the phenomenon. Therefore, in this paper, we propose MATNet, a novel transformer-based multimodal architecture for multi-step day-ahead PV power generation forecasting. The model is fed with historical PV data and historical and forecast weather data through a multi-level joint fusion approach, employing a soft-attention mechanism at multiple fusion stages. We evaluate the effectiveness of MATNet on the Ausgrid benchmark dataset, where it significantly outperforms various baseline models, achieving an RMSE of 0.0445, corresponding to a relative improvement of approximately 65% compared to the best-performing baseline method. The analysis is further enriched by a comprehensive set of ablation studies, a sensitivity analysis on missing data, which highlights MATNet's resilience to input degradation, a cross-site zero-shot generalization evaluation on five external PV datasets, demonstrating MATNet's robustness under significant domain shifts, and an assessment of the model's computational complexity, confirming its favorable balance between predictive accuracy and computational efficiency. These results highlight MATNet's potential as a reliable and efficient solution to facilitate the integration of PV energy into the power grid. The code is available at https://github.com/arco-group/MATNet.