arXivDaily arXiv每日学术速递 周一至周五更新
全部学科分类 3405
专题追踪
2605.15971 2026-05-26 cs.RO

OHP-RL: Online Human Preference as Guidance in Reinforcement Learning for Robot Manipulation

OHP-RL:在线人类偏好作为机器人操作强化学习中的指导

Yunyang Mo, Jian Li, Qiwei Wu, Yihang Kang, Renjing Xu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 提出OHP-RL框架,利用人类干预作为偏好信息,通过状态依赖偏好门自适应调节策略学习,在Franka机器人接触丰富的操作任务中实现高成功率、快速收敛和低人类干预。

详情
AI中文摘要

虽然强化学习使机器人能够自主获取技能,但其在实际部署中受到低效和不安全探索的严重限制。人类在环干预提供了一种实用的解决方案,但现有方法通常将这些干预作为辅助训练信号,未能充分捕捉它们提供的关于何时以及如何引导自主性的更丰富信息。人类干预通常编码了在安全和任务约束下对行为的相对偏好,而不是规定要模仿的精确动作。受此观点启发,我们提出在线人类偏好作为强化学习中的指导(OHP-RL),这是一个利用人类干预作为偏好信息来指导策略学习的框架。OHP-RL引入了一个状态依赖的偏好门,自适应地调节人类干预应在何时以及多大程度上塑造策略学习。这种设计使智能体能够从间歇性和不完美的人类反馈中受益,同时保持自主探索和稳定的策略优化。我们在Franka机器人上的三个具有挑战性的真实世界接触丰富操作任务中评估了OHP-RL。在所有任务中,OHP-RL始终实现了高成功率、更快的收敛以及比先前方法显著更低的人类干预努力。此外,学习到的策略在整个训练过程中表现出更稳定和与人类一致的行为。

英文摘要

While reinforcement learning (RL) enables robots to acquire skills autonomously, its real-world deployment is severely limited by inefficient and unsafe exploration. Human-in-the-loop interventions offer a practical solution, yet existing methods typically exploit these interventions as auxiliary training signals, without fully capturing the richer information they provide about when and how autonomy should be guided. Human interventions often encode relative preferences over behavior under safety and task constraints, rather than prescribing exact actions to imitate. Motivated by this perspective, we propose Online Human Preference as Guidance in Reinforcement Learning (OHP-RL), a framework that leverages human interventions as preference information to guide policy learning. OHP-RL introduces a state-dependent preference gate that adaptively regulates when and to what extent human interventions should shape policy learning. This design enables the agent to benefit from intermittent and imperfect human feedback while preserving autonomous exploration and stable policy optimization. We evaluate OHP-RL on three challenging real-world contact-rich manipulation tasks on a Franka robot. Across all tasks, OHP-RL consistently achieves strong success rates, faster convergence, and substantially lower human intervention effort than prior approaches. Moreover, the learned policies exhibit more stable and human-aligned behavior throughout training.

2605.15777 2026-05-26 cs.AI

SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?

SaaS-Bench:计算机使用代理能否利用真实世界SaaS解决专业工作流程?

Kean Shi, Zihang Li, Tianyi Ma, Zengji Tu, Jialong Wu, Xinbo Xu, Qingyao Yang, Ruoyu Wu, Weichu Xie, Ming Wu, Jason Zeng, Michael Heinrich, Elvis Zhang, Liang Chen, Kuan Li, Baobao Chang

发表机构 * UniPat AI PKU(北京大学) HKU(香港大学) G Labs(0G实验室) Pipeline Lab(Pipeline实验室)

AI总结 提出SaaS-Bench基准,包含23个可部署SaaS系统和106个真实工作场景任务,评估计算机使用代理在长期规划、跨应用协调等能力上的表现,发现最强模型端到端任务完成率不足4%。

Comments 24 pages, 11 figures

详情
AI中文摘要

计算机使用代理(CUA)正迅速将大型语言模型(LLM)从基于文本的推理扩展到更复杂环境中的行动执行,例如网络浏览器和图形用户界面(GUI)。然而,现有的网络和GUI代理基准通常依赖于简化设置、孤立任务或短周期交互,难以评估代理在现实专业工作流程中的能力。软件即服务(SaaS)环境是CUA评估的自然选择,因为它们承载了现代数字工作的很大一部分,并且自然涉及动态系统状态、跨应用协调、领域特定知识和长期依赖。为此,我们引入了SaaS-Bench,一个基于23个可部署SaaS系统(涵盖六个专业领域)的基准,包含106个基于现实工作场景的任务。这些任务需要长期执行,涵盖纯文本和多模态设置,并通过加权验证检查点进行评估,以衡量严格任务完成和部分进展。实验表明,代表性的基于LLM的代理在SaaS-Bench上表现不佳,即使最强的模型端到端完成任务也少于4%,暴露了在规划、状态跟踪、跨应用上下文维护和错误恢复方面的局限性。代码可在https://github.com/UniPat-AI/SaaS-Bench获取以进行复现。

英文摘要

Computer-Using Agents (CUAs) are rapidly extending large language models (LLMs) beyond text-based reasoning toward action execution in more complex environments, such as web browsers and graphical user interfaces (GUIs). However, existing web and GUI agent benchmarks often rely on simplified settings, isolated tasks, or short-horizon interactions, making it difficult to assess capabilities of agents in realistic professional workflows. Software-as-a-Service (SaaS) environments are a natural choice for CUA evaluation, as they host a large share of modern digital work and naturally involve dynamic system states, cross-application coordination, domain-specific knowledge, and long-horizon dependencies. To this end, we introduce SaaS-Bench, a benchmark built on 23 deployable SaaS systems across six professional domains, containing 106 tasks grounded in realistic work scenarios. These tasks require long-horizon execution, cover both text-only and multimodal settings, and are evaluated with weighted verification checkpoints that measure strict task completion and partial progress. Experiments show that representative LLM-based agents struggle on SaaS-Bench, with even the strongest model completing fewer than 4% of tasks end-to-end, exposing limitations in planning, state tracking, cross-application context maintenance, and error recovery. Code are available at https://github.com/UniPat-AI/SaaS-Bench for reproduction.

2605.15759 2026-05-26 cs.CL

DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory

DimMem:面向高效长期智能体记忆的维度结构化

Wentao Qiu, Haotian Hu, Fanyi Wang, Jinwei Kong, Yu Zhang

发表机构 * StepOS Xiamen University(厦门大学) ShanghaiTech University(上海科技大学)

AI总结 提出DimMem维度记忆框架,通过原子化、类型化、自包含的记忆单元(含时间、地点、原因等显式字段)实现维度感知检索与更新,在LoCoMo-10和LongMemEval-S上分别达到81.43%和78.20%准确率,且每查询token成本降低24%。

详情
AI中文摘要

大型语言模型(LLM)智能体需要长期记忆来利用过去交互中的信息。然而,现有的记忆系统常常面临保真度与效率之间的权衡:原始对话历史成本高昂,而扁平化的事实或摘要可能丢弃精确回忆所需的结构。我们提出 extbf{DimMem},一种轻量级维度记忆框架,将每条记忆表示为一个原子化、类型化、自包含的单元,并带有显式字段,如时间、地点、原因、目的和关键词。这种表示暴露了维度感知检索、记忆更新和选择性助手上下文回忆所需的结构,而无需在模型上下文中存储完整历史。在LoCoMo-10和LongMemEval-S上,DimMem分别达到 extbf{81.43\%}和 extbf{78.20\%}的整体准确率,优于现有的轻量级记忆系统,同时将LoCoMo每查询token成本降低 extbf{24\%}。我们进一步证明,维度记忆提取可通过紧凑模型学习:在DimMem模式上微调后,Qwen3-4B提取器在两个基准测试上均超越使用GPT-4.1-mini的LightMem,并在关键设置中达到与更大提取器相当或更优的性能。这些结果表明,显式维度结构化是LLM智能体长期记忆有效且高效的基础。代码见https://github.com/ChowRunFa/DimMem。

英文摘要

Large language model (LLM) agents require long-term memory to leverage information from past interactions. However, existing memory systems often face a fidelity--efficiency trade-off: raw dialogue histories are expensive, while flat facts or summaries may discard the structure needed for precise recall. We propose \textbf{DimMem}, a lightweight dimensional memory framework that represents each memory as an atomic, typed, and self-contained unit with explicit fields such as time, location, reason, purpose, and keywords. This representation exposes the structure needed for dimension-aware retrieval, memory update, and selective assistant-context recall without storing full histories in the model context. Across LoCoMo-10 and LongMemEval-S, DimMem achieves \textbf{81.43\%} and \textbf{78.20\%} overall accuracy, respectively, outperforming existing lightweight memory systems while reducing LoCoMo per-query token cost by \textbf{24\%}. We further show that dimensional memory extraction is learnable by compact models: after fine-tuning on the DimMem schema, a Qwen3-4B extractor surpasses LightMem with GPT-4.1-mini on both benchmarks and reaches performance comparable to, or better than, much larger extractors in key settings. These results suggest that explicit dimensional structuring is an effective and efficient foundation for long-term memory in LLM agents. Code is available at https://github.com/ChowRunFa/DimMem.

2605.15011 2026-05-26 cs.CL

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

科学贡献图:基于文献的规模化自动技术路线图绘制

Peter A. Jansen

发表机构 * University of Arizona(亚利桑那大学) Allen Institute for Artificial Intelligence(人工智能 Allen 机构)

AI总结 提出从学术论文中提取科学贡献并链接其前提条件的自动技术路线图任务,构建包含200万贡献和1250万前提边的AI/NLP领域科学贡献图,并引入科学前提预测任务,实验表明现有模型在该任务上表现快速提升。

Comments 8 pages, 5 figures

详情
AI中文摘要

科学贡献很少孤立发展,而是建立在先前发现的基础上。我们将自动技术路线图的任务定义为从学术文章中提取科学贡献并将其与前提条件联系起来。我们提出了科学贡献图,这是一个大规模的人工智能/自然语言处理领域资源,包含从23万篇开放获取论文中提取的200万个详细科学贡献,并通过1250万条前提边连接。我们进一步引入了科学前提预测,这是一项科学发现任务,模型预测哪些现有技术可以促成未来的发现,并表明当代模型在该任务上迅速改进,在使用时间过滤回测评估时达到0.48 MAP。我们预计这样的技术路线图资源将支持科学影响评估和自动科学发现。

英文摘要

Scientific contributions rarely develop in isolation, but instead build upon prior discoveries. We formulate the task of automated technological roadmapping as extracting scientific contributions from scholarly articles and linking them to their prerequisites. We present the Scientific Contribution Graph, a large-scale AI/NLP-domain resource containing 2 million detailed scientific contributions extracted from 230k open-access papers and connected by 12.5 million prerequisite edges. We further introduce scientific prerequisite prediction, a scientific discovery task in which models predict which existing technologies can enable future discoveries, and show that contemporary models are rapidly improving on this task, reaching 0.48 MAP when evaluated using temporally filtered backtesting. We anticipate technological roadmapping resources such as this will support scientific impact assessment and automated scientific discovery.

2605.14890 2026-05-26 cs.CL cs.AI

Tokenizer Fertility and Zero-Shot Performance of Foundation Models on Ukrainian Legal Text: A Comparative Study

分词器生育率与基础模型在乌克兰法律文本上的零样本性能:一项比较研究

Volodymyr Ovcharov

发表机构 * LEX AI Platform(LEX AI平台) legal.org.ua Kyiv, Ukraine(基辅,乌克兰)

AI总结 本研究比较了七种基础模型在乌克兰法律文本上的分词器生育率和零样本性能,发现分词器生育率差异达1.6倍,Qwen 3模型比Llama系列多消耗60%的token,而NVIDIA Nemotron Super 3 (120B)以更低的成本取得最佳性能,同时揭示了少样本提示在形态丰富语言上的退化以及战时法律语言对模型泛化的影响。

Comments 25 pages, 13 tables, 5 figures; v2 adds cross-temporal generalization experiment and classical baseline

详情
AI中文摘要

在乌克兰法律文本上,不同基础模型的分词器生育率差异达1.6倍,然而这一成本关键维度在模型选择实践中被忽视。我们使用来自乌克兰国家登记册(EDRSR)的273份经过验证的法院判决,对来自五个提供商的七个模型进行了基准测试,测量了分词器生育率以及在三个任务上的零样本性能。发现了四个结果。(1)Qwen 3模型在相同输入上比Llama系列模型多消耗60%的token,使得分词器分析成为成本高效部署的前提。(2)NVIDIA Nemotron Super 3 (120B)取得了最高综合得分(83.1),以三分之一的API成本超越了Mistral Large 3(总参数多5.6倍)——模型规模并不能很好地代表领域性能。(3)少样本提示使性能下降高达26个百分点;分层和提示敏感性消融实验证实,这是乌克兰语演示的内在问题,而非示例选择的伪影。(4)跨时间泛化实验表明,在战前法院判决(2008-2013)上训练的分类器,应用于全面入侵时期的判决(2022-2026)时,性能下降27.9个百分点,并呈现出显著的前后不对称性:较新的模型向后迁移效果更好(比向前迁移高14.6个百分点),但较旧的模型在战时法律语言上完全失败。对于从业者:分词器分析应优先于模型选择,对于形态丰富的语言,零样本比少样本更可靠。为了支持可重复性并解决乌克兰语在法律NLP基准中的缺失,我们发布了一个包含14,452份法院判决的公开数据集,时间跨度为2008-2026年,标注了三个时间段的七个结果标签,这些时间段捕捉了武装冲突对司法程序的影响。

英文摘要

Tokenizer fertility varies 1.6x across foundation models on Ukrainian legal text, yet this cost-critical dimension is absent from model selection practice. We benchmark seven models from five providers on 273 validated court decisions from Ukraine's state registry (EDRSR), measuring tokenizer fertility and zero-shot performance on three tasks. Four findings emerge. (1) Qwen 3 models consume 60% more tokens than Llama-family models on identical input, making tokenizer analysis a prerequisite for cost-efficient deployment. (2) NVIDIA Nemotron Super 3 (120B) achieves the highest composite score (83.1), outperforming Mistral Large 3 (5.6x more total parameters) at one-third the API cost model scale is a poor proxy for domain performance. (3) Few-shot prompting degrades performance by up to 26 percentage points; stratified and prompt-sensitivity ablations confirm this is intrinsic to Ukrainian-language demonstrations, not an artifact of example selection. (4) A cross-temporal generalization experiment reveals that classifiers trained on pre-war court ecisions (2008-2013) lose 27.9 percentage points when applied to full-scale invasion era decisions (2022-2026), with a pronounced forward-backward asymmetry: newer models transfer backward (+14.6 pp above forward transfer), but older models fail catastrophically on wartime legal language. For practitioners: tokenizer analysis should precede model selection, and zero-shot is a more reliable default than few-shot for morphologically rich languages. To support reproducibility and address the absence of Ukrainian from legal NLP benchmarks, we release a public dataset of 14,452 court decisions spanning 2008-2026, annotated with seven outcome labels across three temporal epochs that capture the impact of armed conflict on judicial proceedings.

2605.14559 2026-05-26 cs.AI math.OC

PyCSP3-Scheduling: A Scheduling Extension for PyCSP3

PyCSP3-Scheduling: PyCSP3的调度扩展

Sohaib Afifi

发表机构 * Univ. Artois, UR 3926, Laboratoire de Génie Informatique et d’Automatique de l’Artois (LGI2A)(阿劳斯-大学,UR 3926,阿劳斯信息工程与自动化实验室(LGI2A))

AI总结 提出PyCSP3 Scheduling库,通过53个专用约束和27个表达式为PyCSP3添加调度抽象,并编译为标准约束,在261个实例上验证了与原始公式的目标一致性,但运行时性能因编译开销而异。

详情
AI中文摘要

PyCSP$^3$提供了一种高效构建约束模型以解决组合约束问题的方法,并将其导出为XCSP$^3$,保持了建模与求解的完全分离。然而,它缺乏对调度抽象(如区间变量、序列变量和资源函数)的原生支持。因此,即使PyCSP$^3$已经提供了如NoOverlap和Cumulative等整数数组上的全局约束,调度模型仍需通过低层整数变量和手动通道约束进行编码。我们提出了PyCSP$^3$ Scheduling,一个通过53个专用约束和27个表达式为PyCSP$^3$添加调度抽象的库,并将其编译为标准PyCSP$^3$/XCSP$^3$约束,维护了支撑PyCSP$^3$生态系统的建模/求解分离。在17个模型家族(每个5次运行)的261个配对实例上,两种公式在所有72个双重证明最优对以及近一半的家族(8/17)中产生了相同的目标值,且在编译后结构保持不变;然而,运行时性能在不同家族间存在差异,部分家族有显著提升(高达5.8倍),而其他家族由于编译分解的开销出现性能下降。代码和基准测试可在以下网址获取:https://github.com/sohaibafifi/pycsp3-scheduling

英文摘要

PyCSP$^3$ provides a productive way to build constraint models for solving combinatorial constrained problems and export them to XCSP$^3$, preserving a complete separation between modeling and solving. However, it lacks native support for scheduling abstractions such as interval variables, sequence variables, and resource functions. As a result, scheduling models must be encoded with low-level integer variables and manual channeling constraints, even though PyCSP$^3$ already provides global constraints like NoOverlap and Cumulative on integer arrays. We present PyCSP$^3$ Scheduling, a library that adds scheduling abstractions to PyCSP$^3$ through 53 dedicated constraints and 27 expressions, and compiles them down to standard PyCSP$^3$/XCSP$^3$ constraints, maintaining the modeling/solving separation that underpins the PyCSP$^3$ ecosystem. On 261 paired instances across 17 model families (5 runs each), both formulations produce identical objectives on all 72 doubly-proved optimal pairs and nearly half of the families (8/17) remain structurally unchanged after compilation; however, runtime performance diverges across families, with clear gains on some (up to 5.8x) and regressions on others due to the overhead of compilation decompositions. Code and benchmarks are available at: https://github.com/sohaibafifi/pycsp3-scheduling

2605.14552 2026-05-26 cs.CV

LiWi: Layering in the Wild

LiWi: 野外分层

Yu He, Fang Li, Haoyang Tong, Lichen Ma, Xinyuan Shan, Jingling Fu, Dong Chen, Luohang Liu, Junshi Huang, Yan Li

发表机构 * MAIS & NLPR, CASIA(模式识别与人工智能实验室及中国科学院自动化研究所)

AI总结 提出基于代理驱动数据分解和联合优化光度保真度与alpha边界的方法,实现野外自然图像的高保真分层分解,构建了LiWi-100k数据集并达到SOTA性能。

Comments Project Page https://rassetmusty.github.io/LiWi

详情
AI中文摘要

生成模型的最新进展使得令人印象深刻的分层图像生成成为可能,但其成功主要局限于图形设计领域。野外图像的分层仍然是一个未充分探索的问题,限制了细粒度编辑和图像在真实场景中的应用。具体而言,可扩展的分层数据和自然图像中对象交互(如光照效果和结构边界)的建模仍面临挑战。为解决这些瓶颈,我们提出了一种用于高保真自然图像分解的新框架。首先,我们引入了一种代理驱动数据分解(ADD)流水线,该流水线协调代理和工具以合成分层数据,无需人工干预。利用该流水线,我们构建了一个大规模数据集LiWi-100k,包含超过10万张高质量的分层野外图像。其次,我们提出了一个新框架,联合改进光度保真度和alpha边界精度。具体而言,阴影引导学习显式建模光照效果,退化-恢复目标通过从退化图像恢复干净前景图像提供边界校正监督。大量实验表明,我们的框架在自然图像分解中达到了最先进的性能,在RGB L1和Alpha IoU指标上优于现有模型。我们将很快发布代码和数据集。

英文摘要

Recent advances in generative models have empowered impressive layered image generation, yet their success is largely confined to graphic design domains. The layering of in-the-wild images remains an underexplored problem, limiting fine-grained editing and applications of images in real-world scenarios. Specifically, challenges remain in scalable layered data and the modeling of object interaction in natural images, such as illumination effects and structural boundary. To address these bottlenecks, we propose a novel framework for high-fidelity natural image decomposition. First, we introduce an Agent-driven Data Decomposition (ADD) pipeline that orchestrates agents and tools to synthesize layered data without manual intervention. Utilizing this pipeline, we construct a large-scale dataset, named LiWi-100k, with over 100,000 high-quality layered in-the-wild images. Second, we present a novel framework that jointly improves photometric fidelity and alpha boundary accuracy. Specifically, shadow-guided learning explicitly models the illumination effects, and degradation-restoration objective provides boundary-correction supervision by recovering clean foreground image from degraded one. Extensive experiments demonstrate that our framework achieves state-of-the-art (SoTA) performance in natural image decomposition, outperforming existing models in RGB L1 and Alpha IoU metrics. We will soon release our code and dataset.

2605.13850 2026-05-26 cs.AI cs.MA cs.SE

A Two-Dimensional Framework for AI Agent Design Patterns: Cognitive Function and Execution Topology

AI智能体设计模式的二维框架:认知功能与执行拓扑

Jia Huang, Joey Tianyi Zhou

发表机构 * Agency for Science, Technology and Research (A*STAR)(科技研究局(A*STAR)) Centre for Frontier AI Research (CFAR)(前沿人工智能研究中心(CFAR))

AI总结 提出一个结合认知功能(7类)和执行拓扑(6种结构)的二维分类框架,识别28种命名模式,并通过跨领域分析得出模式选择的五条经验法则。

Comments 10 pages, 6 tables, 28 named patterns

详情
AI中文摘要

现有的基于LLM的智能体架构框架从单一视角描述系统:行业指南(Anthropic、Google、LangChain)关注执行拓扑——数据如何流动,而认知科学调查关注认知功能——智能体做什么。单独任何一个轴都无法区分架构上不同的系统:相同的Orchestrator-Workers拓扑可以实现Plan-and-Execute、Hierarchical Delegation或Adversarial Verification——这三种模式具有根本不同的故障模式和设计权衡。我们提出一个二维分类,结合(1)认知功能轴,包含七个类别(感知、记忆、推理、行动、反思、协作、治理)和(2)执行拓扑轴,包含六种结构原型(链、路由、并行、编排、循环、层次)。由此产生的7x6矩阵识别出28种命名模式,其中15种为原创名称。我们通过系统的跨轴分析证明正交性,详细定义八种代表性模式,并在四个真实领域(金融贷款、法律尽职调查、网络运维、医疗分诊)验证描述覆盖范围。跨领域分析得出模式选择的五条经验法则,这些法则支配环境约束(时间压力、行动权限、失败成本不对称、规模)与架构选择之间的关系。该框架为AI智能体架构设计提供了原则性、框架中立且模型无关的词汇表。

英文摘要

Existing frameworks for LLM-based agent architectures describe systems from a single perspective: industry guides (Anthropic, Google, LangChain) focus on execution topology -- how data flows -- while cognitive science surveys focus on cognitive function -- what the agent does. Neither axis alone disambiguates architecturally distinct systems: the same Orchestrator-Workers topology can implement Plan-and-Execute, Hierarchical Delegation, or Adversarial Verification -- three patterns with fundamentally different failure modes and design trade-offs. We propose a two-dimensional classification that combines (1) a Cognitive Function axis with seven categories (Perception, Memory, Reasoning, Action, Reflection, Collaboration, Governance) and (2) an Execution Topology axis with six structural archetypes (Chain, Route, Parallel, Orchestrate, Loop, Hierarchy). The resulting 7x6 matrix identifies 28 named patterns, 15 with original names. We demonstrate orthogonality through systematic cross-axis analysis, define eight representative patterns in detail, and validate descriptive coverage across four real-world domains (financial lending, legal due diligence, network operations, healthcare triage). Cross-domain analysis yields five empirical laws of pattern selection governing the relationship between environmental constraints (time pressure, action authority, failure cost asymmetry, volume) and architectural choices. The framework provides a principled, framework-neutral, and model-agnostic vocabulary for AI agent architecture design.

2605.13282 2026-05-26 cs.AI cs.LG

Differentiable Learning of Lifted Action Schemas for Classical Planning

经典规划中提升动作模式的可微学习

Jonas Reiter, Jakob Elias Gebler, Hector Geffner

发表机构 * RWTH Aachen University(亚琛工业大学)

AI总结 提出一种神经网络架构,从完全可观测状态但动作参数未观测的轨迹中学习提升动作模式,实现近乎完美的结构恢复。

详情
AI中文摘要

经典规划器可以有效解决用STRIPS或PDDL表示的非常大的确定性MDP,其中状态是对象和关系上的原子集合,提升动作模式添加或删除这些原子。这种紧凑表示产生了强大的搜索启发式,并为结构泛化提供了理想设置,因为提升关系和动作模式可以产生无限多个领域实例。一个核心挑战是从数据中学习这些关系和动作模式,最近的方法使用不同类型的观测来解决这个问题。在这项工作中,我们开发了一种新颖的神经网络架构,从状态完全可观测但动作参数未观测的轨迹中学习动作模式。该问题是一个简化,但却是从图像序列和动作标签学习规划领域的重要一步,我们旨在以近乎完美的方式解决这个简化问题。挑战在于同时从观测到的状态变化中识别动作参数并学习动作模式。我们的方法产生了一个鲁棒的可微组件,然后可以集成到更大的神经符号模型中。我们在各种规划领域上评估该架构,其中学习到的提升动作模式必须恢复真实结构。此外,我们报告了关于对观测噪声的鲁棒性以及与基于槽的动态模型相关变体的实验。

英文摘要

Classical planners can effectively solve very large deterministic MDPs represented in STRIPS or PDDL where states are sets of atoms over objects and relations, and lifted action schemas add or delete these atoms. This compact representation yields strong search heuristics and provides an ideal setting for structural generalization, since lifted relations and action schemas give rise to infinitely many domain instances. A central challenge is to learn these relations and action schemas from data, and recent approaches have addressed this problem using different types of observations. In this work, we develop a novel neural network architecture for learning action schemas from traces where states are fully observed but action arguments are unobserved. The problem is a simplification but an important step towards learning planning domains from sequences of images and action labels, and we aim to solve this simplification in a nearly perfect manner. The challenge lies in learning the action schemas while simultaneously identifying the action arguments from observed state changes. Our approach yields a robust differentiable component that can then be integrated into larger neuro-symbolic models. We evaluate the architecture on various planning domains, where the learned lifted action schemas must recover the ground-truth structure. Additionally, we report experiments on robustness to observation noise and on a variation related to slot-based dynamics models.

2605.12850 2026-05-26 cs.CL cs.AI cs.CR cs.LG

Persona-Model Collapse in Emergent Misalignment

涌现性失调中的人格模型崩溃

Davi Bastos Costa, Renato Vicente

发表机构 * TELUS Digital Research Hub(TELUS数字研究中心) Center for Artificial Intelligence and Machine Learning(人工智能与机器学习中心) Institute of Mathematics, Statistics and Computer Science(数学、统计与计算机科学研究所) University of São Paulo(圣保罗大学)

AI总结 提出人格模型崩溃假说,通过道德易感性(S)和道德稳健性(R)两个指标,证明在有害数据上微调大语言模型会导致模型模拟、区分和维持一致角色的内部能力恶化,从而引发涌现性失调。

Comments 23 pages, 7 figures, 7 tables; NeurIPS 2026 submission; Corrected code repository URL

详情
AI中文摘要

在包含有害内容的狭窄数据上微调大型语言模型,会在无关提示上产生广泛的失调行为,这种现象称为涌现性失调。我们提出涌现性涉及人格模型崩溃:模型模拟、区分和维持一致角色的内部能力恶化。我们通过两个指标在行为上检验这一假设:道德易感性(S)和道德稳健性(R),它们根据模型在角色扮演下道德基础问卷回答的跨角色和角色内变异性计算得出。这些指标形式化了模型区分角色的能力(S)以及模拟给定角色时的一致性(R)。我们评估了四个前沿模型(DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B)的三种变体:基础版、微调为输出不安全代码的版本,以及匹配的微调为输出安全代码的对照版本。在四个模型中,不安全微调导致S平均增加55%,将所有四个不安全变体推至先前工作中13个前沿模型基准观测到的波段之外——其中GPT-4o达到波段上端的两倍以上——表明分化失调。它还导致R平均下降65%,相当于1/R增加304%。相比之下,匹配的安全对照将S保持在基础值附近,仅引起部分R损失,表明这些效应主要特定于失调。补充这些指标变化,不安全变体的无条件响应趋近于接近量表上限的饱和状态,与基础模型的结构化响应以及基础模型角色扮演有毒人格时的响应明显不同。综合来看,这些指标为涌现性失调提供了敏感的诊断,并作为其涉及人格模型崩溃的行为证据。

英文摘要

Fine-tuning large language models on narrow data with harmful content produces broadly misaligned behavior on unrelated prompts, a phenomenon known as emergent misalignment. We propose that emergent misalignment involves persona-model collapse: deterioration of the model's internal capacity to simulate, differentiate, and maintain consistent characters. We test this hypothesis behaviorally using two metrics: moral susceptibility (S) and moral robustness (R), computed from the across- and within-persona variability of models' Moral Foundations Questionnaire responses under persona role-play. These metrics formalize the model's ability to differentiate characters (S) and its consistency when simulating a given one (R). We evaluate four frontier models (DeepSeek-V3.1, GPT-4.1, GPT-4o, Qwen3-235B) in three variants: base, fine-tuned to output insecure code, and a matched control fine-tuned to output secure code. Across the four models, insecure fine-tuning produces an average $55\%$ increase in S, pushing all four insecure variants beyond the band observed across 13 frontier models benchmarked in prior work -- with GPT-4o reaching more than twice the band's upper end -- signaling dysregulated differentiation. It also causes an average $65\%$ decrease in R, equivalent to a $304\%$ increase in 1/R. By contrast, the matched secure control preserves S near the base and induces only a partial R loss, showing that these effects are largely misalignment-specific. Complementing these metric shifts, insecure variants' unconditioned responses converge toward saturation near the scale ceiling, departing markedly from both base models' structured responses and those elicited when base models role-play toxic personas. Taken together, these metrics provide a sensitive diagnostic for emergent misalignment and serve as behavioral evidence that it involves persona-model collapse.

2605.11182 2026-05-26 cs.AI

The Many Faces of On-Policy Distillation: Pitfalls, Mechanisms, and Fixes

在线策略蒸馏的多种面貌:陷阱、机制与修复

Siqi Zhu, Xuyan Ye, Hongyu Lu, Weiye Shi, Ge Liu

发表机构 * UIUC(伊利诺伊大学香槟分校) Renmin University of China(中国人民大学) Peking University(北京大学)

AI总结 本文通过实证研究分析了在线策略蒸馏(OPD)和在线策略自蒸馏(OPSD)在大语言模型后训练中的有效性、失败机制及修复方法。

详情
AI中文摘要

在线策略蒸馏(OPD)和在线策略自蒸馏(OPSD)已成为大语言模型有前景的后训练方法,它们在模型自身策略采样的轨迹上提供密集的token级监督。然而,现有关于其有效性的结果仍然好坏参半:虽然OP(S)D在系统提示和知识内化方面显示出潜力,但最近的研究也报告了不稳定性和退化。在这项工作中,我们对OPD和OPSD何时有效、何时失败以及原因进行了全面的实证研究。我们发现,数学推理上的OPD对教师选择和损失公式高度敏感,而OPSD在我们测试的设置中失败,因为测试时缺乏实例特定的特权信息(PI)。相反,当PI表示共享的潜在规则(如系统提示或对齐偏好)时,OPSD是有效的。我们识别出三种失败机制:(1)由于以学生生成的前缀为条件导致的教师与学生之间的分布不匹配,(2)来自有偏TopK反向KL梯度的优化不稳定性,以及(3)OPSD特定的限制,即学生学习了无PI策略,该策略聚合了以PI为条件的教师,当PI是实例特定时这是不够的。我们进一步表明,停止梯度TopK目标、RLVR适应的教师和SFT稳定的学生可以缓解这些失败。

英文摘要

On-policy distillation (OPD) and on-policy self-distillation (OPSD) have emerged as promising post-training methods for large language models, offering dense token-level supervision on trajectories sampled from the model's own policy. However, existing results on their effectiveness remain mixed: while OP(S)D has shown promise in system prompt and knowledge internalization, recent studies also report instability and degradation. In this work, we present a comprehensive empirical study of when OPD and OPSD work, when they fail, and why. We find that OPD on mathematical reasoning is highly sensitive to teacher choice and loss formulation, whereas OPSD fails in our tested settings due to test-time absence of instance-specific privileged information (PI). In contrast, OPSD is effective when PI represents a shared latent rule, such as a system prompt or alignment preference. We identify three failure mechanisms: (1) distribution mismatch between teacher and student caused by conditioning on student-generated prefixes, (2) optimization instability from biased TopK reverse-KL gradients, and (3) an OPSD-specific limitation where the student learns a PI-free policy that aggregates PI-conditioned teachers, which is insufficient when PI is instance-specific. We further show that stop-gradient TopK objectives, RLVR-adapted teachers, and SFT-stabilized students mitigate these failures.

2605.10989 2026-05-26 cs.LG cs.AI

SURGE: Surrogate Gradient Adaptation in Binary Neural Networks

SURGE: 二值神经网络中的替代梯度自适应

Haoyu Huang, Boyu Liu, Linlin Yang, Yanjing Li, Yuguang Yang, Xuhui Liu, Canyu Chen, Zhongqian Fu, Baochang Zhang

发表机构 * National College for Excellent Engineers, Beihang University, Beijing, China(北京航空航天大学优秀工程师学院) School of Artificial Intelligence, Beihang University, Beijing, China(北京航空航天大学人工智能学院) School of Electronic and Information Engineering, Beihang University, Beijing, China(北京航空航天大学电子与信息工程学院) King Abdullah University of Science and Technology, Saudi Arabia(沙特国王 Abdullah 科学技术大学) Huawei Noah’s Ark Lab, China(华为诺亚实验室)

AI总结 针对二值神经网络中梯度失配和固定范围梯度裁剪导致的信息损失问题,提出一种基于理论的可学习梯度补偿框架SURGE,通过双路径梯度补偿器和自适应梯度缩放器实现偏差减少的梯度估计与动态平衡,在图像分类、目标检测和语言理解任务上达到最优性能。

Comments Accepted as a poster at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

二值神经网络(BNN)的训练从根本上依赖于对不可微二值化操作(如符号函数)的梯度近似。然而,包括直通估计器(STE)及其改进变体在内的主流方法依赖于手工设计,存在梯度失配问题和固定范围梯度裁剪导致的信息损失。为了解决这一问题,我们提出了SURrogate GradiEnt Adaptation(SURGE),一种新颖的、具有理论依据的可学习梯度补偿框架。SURGE通过辅助反向传播缓解梯度失配。具体地,我们设计了一个双路径梯度补偿器(DPGC),为每个二值化层构建一个并行的全精度辅助分支,通过在反向传播期间进行输出分解来解耦梯度流。DPGC利用全精度分支估计超出STE一阶近似的分量,从而实现偏差减少的梯度估计。为了进一步增强训练稳定性,我们引入了一个基于最优缩放因子的自适应梯度缩放器(AGS),通过基于范数的缩放动态平衡分支间的梯度贡献。在图像分类、目标检测和语言理解任务上的实验表明,SURGE在现有最先进方法中表现最佳。

英文摘要

The training of Binary Neural Networks (BNNs) is fundamentally based on gradient approximation for non-differentiable binarization operations (e.g., sign function). However, prevailing methods including the Straight-Through Estimator (STE) and its improved variants, rely on hand-crafted designs that suffer from gradient mismatch problem and information loss induced by fixed-range gradient clipping. To address this, we propose SURrogate GradiEnt Adaptation (SURGE), a novel learnable gradient compensation framework with theoretical grounding. SURGE mitigates gradient mismatch through auxiliary backpropagation. Specifically, we design a Dual-Path Gradient Compensator (DPGC) that constructs a parallel full-precision auxiliary branch for each binarized layer, decoupling gradient flow via output decomposition during backpropagation. DPGC enables bias-reduced gradient estimation by leveraging the full-precision branch to estimate components beyond STE's first-order approximation. To further enhance training stability, we introduce an Adaptive Gradient Scaler (AGS) based on an optimal scale factor to dynamically balance inter-branch gradient contributions via norm-based scaling. Experiments on image classification, object detection, and language understanding tasks demonstrate that SURGE performs best over state-of-the-art methods.

2605.10302 2026-05-26 cs.LG

Follow the Mean: Reference-Guided Flow Matching

跟随均值:参考引导的流匹配

Pedro M. P. Curvo, Maksim Zhdanov, Floor Eijkelboom, Jan-Willem van de Meent

发表机构 * University of Amsterdam(阿姆斯特丹大学) AMLab(AML实验室)

AI总结 提出通过改变参考集均值来引导预训练流匹配模型实现可控生成,无需微调或额外网络。

详情
AI中文摘要

现有的可控生成方法通常依赖于微调、辅助网络或测试时搜索。我们证明流匹配提供了不同的控制接口:通过示例进行自适应。对于确定性插值,速度场仅由条件端点均值决定;移动该均值会移动流本身。这为可控生成提供了一个简单原则:通过改变模型遵循的参考集来引导预训练模型。我们以两种形式实例化这一思想。参考均值引导无需训练:它从参考库中计算封闭形式的端点均值修正,并将其应用于冻结的FLUX.2-klein(4B)模型,在保持提示、种子和权重不变的情况下,实现对颜色、身份、风格和结构的控制。半参数引导通过显式均值锚点和学习到的残差精炼器摊销相同的思想,在AFHQv2上匹配无条件的DiT-B/4质量,同时允许在推理时交换参考集。这些结果指向一个更广泛的方向:通过数据而非参数更新进行自适应的生成模型。

英文摘要

Existing approaches to controllable generation typically rely on fine-tuning, auxiliary networks, or test-time search. We show that flow matching admits a different control interface: adaptation through examples. For deterministic interpolants, the velocity field is solely governed by a conditional endpoint mean; shifting this mean shifts the flow itself. This yields a simple principle for controllable generation: steer a pretrained model by changing the reference set it follows. We instantiate this idea in two forms. Reference-Mean Guidance is training-free: it computes a closed-form endpoint-mean correction from a reference bank and applies it to a frozen FLUX.2-klein (4B) model, enabling control of color, identity, style, and structure while keeping the prompt, seed, and weights fixed. Semi-Parametric Guidance amortizes the same idea through an explicit mean anchor and learned residual refiner, matching unconditional DiT-B/4 quality on AFHQv2 while allowing the reference set to be swapped at inference time. These results point to a broader direction: generative models that adapt through data, not parameter updates.

2605.08063 2026-05-26 cs.CV cs.AI

Flow-OPD: On-Policy Distillation for Flow Matching Models

Flow-OPD:面向流匹配模型的在线策略蒸馏

Zhen Fang, Wenxuan Huang, Yu Zeng, Yiming Zhao, Shuang Chen, Kaituo Feng, Yunlong Lin, Lin Chen, Zehui Chen, Shaosheng Cao, Feng Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学) University of California, Los Angeles(加州大学洛杉矶分校) The Chinese University of Hong Kong(香港中文大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出Flow-OPD框架,通过两阶段对齐策略(单奖励GRPO微调专家+流式冷启动与在线策略蒸馏)解决流匹配模型在多任务对齐中的奖励稀疏和梯度干扰问题,并引入流形锚点正则化抑制美学退化,在GenEval和OCR指标上显著提升。

Comments Project Page: https://costaliya.github.io/Flow-OPD/ , Code: https://github.com/CostaliyA/Flow-OPD

详情
AI中文摘要

现有的流匹配(FM)文本到图像模型在多任务对齐下存在两个关键瓶颈:标量奖励导致的奖励稀疏性,以及联合优化异构目标引起的梯度干扰,这共同导致了竞争指标的“跷跷板效应”和普遍的奖励破解。受大型语言模型社区中在线策略蒸馏(OPD)成功的启发,我们提出了Flow-OPD,这是第一个将在线策略蒸馏集成到流匹配模型中的统一后训练框架。Flow-OPD采用两阶段对齐策略:首先通过单奖励GRPO微调培养领域专精的教师模型,使每个专家在隔离环境中达到其性能上限;然后通过基于流的冷启动方案建立稳健的初始策略,并通过在线策略采样、任务路由标记和密集轨迹级监督的三步编排,将异构专业知识无缝整合到单个学生模型中。我们进一步引入了流形锚点正则化(MAR),它利用任务无关的教师提供全数据监督,将生成锚定到高质量流形,有效缓解了纯强化学习对齐中常见的美学退化。基于Stable Diffusion 3.5 Medium,Flow-OPD将GenEval分数从63提升至92,OCR准确率从59提升至94,相比原始GRPO总体提升约10个百分点,同时保持了图像保真度和人类偏好对齐,并展现出“超越教师”的涌现效应。这些结果确立了Flow-OPD作为构建通用文本到图像模型的可扩展对齐范式。代码和权重将在 https://github.com/CostaliyA/Flow-OPD 发布。

英文摘要

Existing Flow Matching (FM) text-to-image models suffer from two critical bottlenecks under multi-task alignment: the reward sparsity induced by scalar-valued rewards, and the gradient interference arising from jointly optimizing heterogeneous objectives, which together give rise to a 'seesaw effect' of competing metrics and pervasive reward hacking. Inspired by the success of On-Policy Distillation (OPD) in the large language model community, we propose Flow-OPD, the first unified post-training framework that integrates on-policy distillation into Flow Matching models. Flow-OPD adopts a two-stage alignment strategy: it first cultivates domain-specialized teacher models via single-reward GRPO fine-tuning, allowing each expert to reach its performance ceiling in isolation; it then establishes a robust initial policy through a Flow-based Cold-Start scheme and seamlessly consolidates heterogeneous expertise into a single student via a three-step orchestration of on-policy sampling, task-routing labeling, and dense trajectory-level supervision. We further introduce Manifold Anchor Regularization (MAR), which leverages a task-agnostic teacher to provide full-data supervision that anchors generation to a high-quality manifold, effectively mitigating the aesthetic degradation commonly observed in purely RL-driven alignment. Built upon Stable Diffusion 3.5 Medium, Flow-OPD raises the GenEval score from 63 to 92 and the OCR accuracy from 59 to 94, yielding an overall improvement of roughly 10 points over vanilla GRPO, while preserving image fidelity and human-preference alignment and exhibiting an emergent 'teacher-surpassing' effect. These results establish Flow-OPD as a scalable alignment paradigm for building generalist text-to-image models. The codes and weights will be released in: https://github.com/CostaliyA/Flow-OPD .

2605.08025 2026-05-26 cs.CV

TRAS: An Interactive Software for Tracing Tree Ring Cross Sections

TRAS:一种用于追踪树木年轮横截面的交互式软件

Henry Marichal, Diego Passarella, Gregory Randall

发表机构 * Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República(拉普拉塔大学电气工程学院,工程学院) Procesos Industriales de la Madera, CENUR Noreste, Universidad de la República(木材工业过程,CENUR东北,拉普拉塔大学)

AI总结 提出TRAS开源图形软件,集成三种检测算法(CS-TRD、DeepCS-TRD、INBD),实现树木年轮自动勾画、手动校正和测量,在松木横截面图像上DeepCS-TRD达到81.0% F值,显著减少手动校正工作量。

Comments This manuscript has been accepted for publication in Forestry: An International Journal of Forest Research, published by Oxford University Press. This is an author-produced version and may differ from the final Version of Record. The final published version will be available through the journal website

详情
AI中文摘要

树木年轮标记仍然是树木测量学和树木年代学中的关键步骤,但通常手动进行,使得过程耗时、主观且难以扩展到大型图像数据集。我们提出了树木年轮分析套件(TRAS),一个用于木材横截面图像中树木年轮自动勾画、手动校正和测量的开源图形软件。TRAS集成了三种互补的检测算法:经典图像处理方法CS-TRD和两种深度学习方法DeepCS-TRD与INBD。界面允许用户细化自动检测、去除假阳性并手动添加缺失的年轮。它还计算树木年代学指标,如早材和晚材面积、年轮周长、等效年轮宽度以及基于自定义路径的年轮宽度测量。TRAS在18张专家标注的Pinus taeda L.横截面图像上进行了评估。DeepCS-TRD取得了最佳自动检测性能,F值为81.0%,精确率为86.4%。自动检测将所需的手动校正工作减少到大约20%的年轮边界。对于一维年轮宽度测量,TRAS与CooRecorder显示出极好的一致性(r > 0.99)。常见的检测错误,如跳跃传播或靠近节疤的假阳性,可以通过后处理界面轻松校正。TRAS在Windows、macOS和Linux上为树木年轮分析提供了灵活且可重复的解决方案。代码可在https://hmarichal93.github.io/tras获取。

英文摘要

Tree ring marking remains a key step in dendrometry and dendrochronology, but it is often performed manually, making the process time-consuming, subjective, and difficult to scale to large image datasets. We present the Tree Ring Analyzer Suite (TRAS), an open-source graphical software for automatic delineation, manual correction, and measurement of tree rings in wood cross-sectional images. TRAS integrates three complementary detection algorithms: the classical image-processing method CS-TRD and two deep-learning approaches, DeepCS-TRD and INBD. The interface allows users to refine automatic detections, remove false positives, and manually add missing rings. It also computes dendrochronological metrics such as earlywood and latewood areas, ring perimeter, equivalent ring width, and custom path-based ring-width measurements. TRAS was evaluated on 18 expertly annotated Pinus taeda L. cross-section images. DeepCS-TRD achieved the best automatic detection performance, with an F-score of 81.0% and precision of 86.4%. Automatic detection reduced the required manual correction effort to approximately 20% of ring boundaries. For one-dimensional ring-width measurements, TRAS showed excellent agreement with CooRecorder ($r > 0.99$). Common detection errors, such as jump propagation or false positives near knots, were easily corrected through the postprocessing interface. TRAS provides a flexible and reproducible solution for tree-ring analysis on Windows, macOS, and Linux. Code is available at the https://hmarichal93.github.io/tras.

2605.07647 2026-05-26 cs.CL cs.AI

Quality-Conditioned Agreement in Automated Short Answer Scoring: Mid-Range Degradation and the Impact of Task-Specific Adaptation

自动简答题评分中的质量条件一致性:中等范围退化与任务特定适应的影响

Abigail Victoria Gurin Schleifer, Moriah Ariely, Beata Beigman Klebanov, Asaf Salman, Giora Alexandron

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院) ETS(教育考试服务中心)

AI总结 研究自动简答题评分中不同模型的任务适应程度与质量条件评分一致性的关系,发现所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上出现显著退化,且退化程度与任务特定数据量相关。

Comments PRE-PRINT VERSION Accepted to ACL 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA26)

详情
AI中文摘要

自动简答题评分(ASAS)正从判别式微调模型转向少样本设置下的大语言模型(LLM)。这种范式利用了LLM广泛的世界知识和易于部署的优势,但有限的任务特定数据可能降低复杂评分任务的对齐。特别是,其对评分需要细微解释的部分正确回答的影响仍未充分探索。我们研究了不同模型的任务特定适应程度与质量条件评分一致性之间的关系。我们比较了三种LLM(GPT-5.2、GPT-4o、Claude Opus 4.5)在少样本模式下的表现、一个基于BERT的微调编码器以及一位人类专家,在两个开放式生物学题目上使用了数百个学生回答和由生物学教育专家提供的真实分数。结果表明,人类之间的一致性最高且在整个质量范围内稳定。所有AI模型在完全正确和完全错误的回答上表现良好,但在中等范围回答上表现出显著退化。这种中等范围退化取决于任务特定适应:在少样本LLM中最为严重,随着任务特定数据的增加而减少,其中微调编码器模型表现最佳。这种中等范围退化可能导致对理解发展中的学生所产生回答的不公平评估。我们的发现强调了质量条件公平性的重要性,尤其需要关注中等范围回答。

英文摘要

Automated short answer scoring (ASAS) is shifting from discriminative, fine-tuned models to large language models (LLMs) used in few-shot settings. This paradigm leverages LLMs broad world knowledge and ease of deployment, but limited task-specific data may reduce alignment on complex scoring tasks. In particular, its impact on scoring partially correct responses that require nuanced interpretation remains underexplored. We investigate the relationship between the degree of task-specific adaptation of different models and quality-conditioned scoring agreement. We compare three LLMs (GPT-5.2, GPT-4o, Claude Opus 4.5) in few-shot mode, a fine-tuned BERT-based encoder, and a human expert on two open-ended biology items, using several hundred student responses and ground truth scores provided by a biology education expert. The results show that human-human agreement is highest and stable across the full quality spectrum. All AI models perform well on fully correct and fully incorrect responses, but exhibit substantial degradation on mid-range responses. This mid-range degradation is conditioned on task-specific adaptation: It is most severe in few-shot LLMs with few examples and decreases as task-specific data increases, with fine-tuned encoder models performing best. This mid-range degradation may lead to inequitable evaluation of responses produced by students with developing understanding. Our findings highlight the importance of quality-conditioned fairness, with particular attention to mid-range responses.

2605.06505 2026-05-26 cs.LG cs.AI cs.CR

PACZero: PAC-Private Fine-Tuning of Language Models via Sign Quantization

PACZero: 通过符号量化的语言模型PAC隐私微调

Murat Bilgehan Ertan, Xiaochen Zhu, Phuong Ha Nguyen, Marten van Dijk, Srinivas Devadas

发表机构 * CWI Amsterdam(阿姆斯特丹信息与计算科学研究所) MIT Cambridge(麻省理工学院) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 提出PACZero系列零阶机制,通过符号量化实现零互信息下的PAC隐私微调,在SST-2和SQuAD上取得竞争性结果。

详情
AI中文摘要

我们引入了PACZero,一系列用于微调大型语言模型的PAC隐私零阶机制,在$I(S^*; Y_{1:T})=0$时提供可用的效用。该隐私机制将成员推断攻击(MIA)后验成功率限制在先验水平,这是DP框架仅在$\varepsilon=0$和无限噪声下才能达到的MIA抵抗水平。所有下面的DP-ZO比较都在MIA后验水平上匹配。关键见解是,PAC隐私仅在发布依赖于哪个候选子集是秘密时才对互信息收费。对子集聚合的零阶梯度进行符号量化会产生频繁的一致步骤,即每个候选子集在更新方向上达成一致;在这些步骤中,发布的符号花费零条件互信息。我们提出了两个变体,涵盖隐私-效用权衡:PACZero-MI(通过对二元发布进行精确校准的预算化MI)和PACZero-ZPL(在分歧步骤上通过均匀硬币翻转实现$I=0$)。我们在SST-2和SQuAD上使用OPT-1.3B和OPT-6.7B在LoRA和全参数轨道上进行了评估。在SST-2 OPT-1.3B全微调$I=0$时,PACZero-ZPL达到$88.99\pm0.91$,比非私有MeZO基线($91.1$ FT)低2.1个百分点。在$\varepsilon<1$的高隐私机制下,没有先前方法能产生可用的效用,而PACZero-ZPL在$I=0$时在OPT-1.3B和OPT-6.7B上获得了有竞争力的SST-2准确率和非平凡的SQuAD F1分数。

英文摘要

We introduce PACZero, a family of PAC-private zeroth-order mechanisms for fine-tuning large language models that delivers usable utility at $I(S^*; Y_{1:T})=0$. This privacy regime bounds the membership-inference attack (MIA) posterior success rate at the prior, an MIA-resistance level the DP framework matches only at $\varepsilon=0$ and infinite noise. All DP-ZO comparisons below are matched at the MIA posterior level. The key insight is that PAC Privacy charges mutual information only when the release depends on which candidate subset is the secret. Sign-quantizing subset-aggregated zeroth-order gradients creates frequent unanimity, steps at which every candidate subset agrees on the update direction; at these steps the released sign costs zero conditional mutual information. We propose two variants that span the privacy-utility trade-off: PACZero-MI (budgeted MI via exact calibration on the binary release) and PACZero-ZPL ($I=0$ via a uniform coin flip on disagreement steps). We evaluate on SST-2 and SQuAD with OPT-1.3B and OPT-6.7B in both LoRA and full-parameter tracks. On SST-2 OPT-1.3B full fine-tuning at $I=0$, PACZero-ZPL reaches ${88.99\pm0.91}$, within $2.1$pp of the non-private MeZO baseline ($91.1$ FT). No prior method produces usable utility in the high-privacy regime $\varepsilon<1$, and PACZero-ZPL obtains competitive SST-2 accuracy and nontrivial SQuAD F1 across OPT-1.3B and OPT-6.7B at $I=0$.

2605.06259 2026-05-26 cs.LG cs.CR

Trade-off Functions for DP-SGD with Subsampling based on Random Shuffling: Tight Upper and Lower Bounds

基于随机洗牌的DP-SGD的权衡函数:紧的上界和下界

Marten van Dijk, Murat Bilgehan Ertan

发表机构 * CWI Amsterdam(阿姆斯特丹信息与计算科学研究所) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 本文在$f$-DP框架下,针对基于随机洗牌子采样的差分隐私随机梯度下降(DP-SGD),推导了权衡函数的紧致分析,得到了透明且可解释的闭式界,并展示了单轮训练中达到有意义的差分隐私所需的参数设置。

详情
AI中文摘要

我们在$f$-DP框架下,针对基于随机洗牌子采样的差分隐私随机梯度下降(DP-SGD),推导了权衡函数的紧致分析。我们的分析涵盖了噪声乘数$σ$满足$σ\geq \sqrt{3/\ln M}$的情形,其中$M$是单轮内的轮数。与泊松子采样的$f$-DP分析(产生非封闭的隐式公式,可机器计算但不透明)不同,随机洗牌允许紧致分析,得到透明且可解释的闭式界。我们通过Berry-Esseen定理推导的具体界,在证明框架内紧致到常数因子。我们展示了单轮($E=1$)的工作参数设置,对应的权衡函数$\geq 1-a-δ$,即仅比理想随机猜测对角线$1-a$低$δ$:对于$δ=1/100$和$σ=1$,大约$M \approx 1.14\times 10^6$轮和$N \approx 1.14\times 10^7$训练样本足以实现有意义的差分隐私。这与最近关于$σ\leq 1/\sqrt{2 \ln M}$情形的负面结果形成对比。我们的具体界可以在多个轮次上组合,导致$δ$具有与$E$的线性依赖关系,这限制了$E=O(\sqrt{M})$。为了超越Berry-Esseen,我们引入了一种新的证明技术,基于大数定律的推广,得到了渐近随机猜测对角线极限结果:如果$E=c_M^2M$且$c_M\to 0$,则$E$次组合的权衡函数满足$f^{\otimes E}(a)\to 1-a$在$a\in[0,1]$上一致,且$δ$仅具有$O(\sqrt{E})$的依赖关系。我们将这种渐近状态与相应的泊松子采样渐近进行比较,并将显式收敛速率的刻画作为一个开放问题。

英文摘要

We derive a tight analysis of the trade-off function for Differentially Private Stochastic Gradient Descent (DP-SGD) with subsampling based on random shuffling within the $f$-DP framework. Our analysis covers the regime $σ\geq \sqrt{3/\ln M}$, where $σ$ is the noise multiplier and $M$ is the number of rounds within a single epoch. Unlike $f$-DP analyses for Poisson subsampling, which yield non-closed implicit formulas that can be machine computed but are non-transparent, random shuffling admits a tight analysis yielding transparent and interpretable closed-form bounds. Our concrete bounds, derived via the Berry-Esseen theorem, are tight up to constant factors within the proof framework. We demonstrate worked parameter settings for a single epoch ($E=1$) with a corresponding trade-off function $\geq 1-a-δ$, that is, only $δ$ below the ideal random guessing diagonal $1-a$: For $δ= 1/100$ and $σ= 1$, roughly $M \approx 1.14\times 10^6$ rounds and $N \approx 1.14\times 10^7$ training samples suffice to achieve meaningful differential privacy. This is in contrast to recent negative results for the regime $σ\leq 1/\sqrt{2 \ln M}$. Our concrete bounds can be composed over multiple epochs leading to $δ$ having a linear in $E$ dependency, which restricts $E=O(\sqrt{M})$. To go beyond Berry--Esseen, we introduce a new proof technique based on a generalization of the law of large numbers that yields an asymptotic random guessing diagonal-limit result: if $E=c_M^2M$ with $c_M\to 0$, then the $E$-fold composed trade-off function satisfies $f^{\otimes E}(a)\to 1-a$ uniformly in $a\in[0,1]$ with $δ$ having only an $O(\sqrt{E})$ dependency. We compare this asymptotic regime with the corresponding Poisson subsampling asymptotic, and highlight the characterization of explicit convergence rates as an open question.

2605.05795 2026-05-26 cs.LG

Reward Shaping and Action Masking for Compositional Tasks using Behavior Trees and LLMs

使用行为树和LLM的组合任务奖励塑造与动作掩码

Nicholas Potteiger, Ankita Samaddar, Taylor T. Johnson, Xenofon Koutsoukos

发表机构 * Vanderbilt University(范德比大学)

AI总结 提出MRBT结构,结合LLM自动生成奖励和动作掩码,通过SMT验证和神经符号RL循环,提升组合任务训练效率和成功率。

详情
AI中文摘要

将复杂任务分解为一系列更简单的子任务可以提高自主代理的学习效率。强化学习(RL)可用于优化代理策略以完成子任务,但需要明确定义的子任务奖励,并受益于动作掩码。最近的工作使用大型语言模型(LLM)来自动化奖励塑造和动作掩码,然而它们都没有完全解决对子任务失败的响应性以及组合任务中不同对象的模块化问题。为了克服这些挑战,我们开发了掩码奖励行为树(MRBT),这是一种用作响应式和模块化奖励及动作掩码函数的符号结构。我们设计了一个MRBT模板,并推导出逻辑规范来构建和验证一系列对象交互子任务的MRBT。此外,我们开发了一个自动化流水线,使用LLM生成对变化任务对象鲁棒的MRBT,使用SMT求解器验证规范的正确性,以及一个神经符号RL循环来训练代理完成组合任务。实验证明成功生成和优化了五个MRBT,与基线以及没有动作掩码的MRBT相比,持续提高了训练效率和任务成功率。我们进一步强调了MRBT的三个优势:可迁移性、模块化和可验证性。

英文摘要

Decomposing complex tasks into a sequence of simpler subtasks can improve learning efficiency for an autonomous agent. Reinforcement learning (RL) can be used to optimize agent policies to complete subtasks, but requires well-defined subtask rewards and benefits from action masking. Recent work uses large language models (LLMs) to automate reward shaping and action masking, however none of them fully address reactivity to subtask failure and modularity to varying objects for compositional tasks. To overcome these challenges, we develop masking reward behavior tree (MRBT), a symbolic structure used as a reactive and modular reward and action mask function. We design an MRBT template and derive logical specifications to construct and verify MRBTs for a sequence of object-interaction subtasks. Further, we develop an automated pipeline that uses an LLM to generate MRBTs robust to varying task objects, an SMT-solver to verify correctness of specifications, and a neurosymbolic RL loop to train agents on compositional tasks. Experiments demonstrate successful generation and refinement of five MRBTs, consistently improving training efficiency and task success rates over baselines and MRBTs without action masking. We further highlight three advantages of MRBTs: transferability, modularity, and verifiability.

2605.05759 2026-05-26 cs.LG

Full-Spectrum Graph Neural Networks: Expressive and Scalable

全谱图神经网络:表达力与可扩展性

Xiaohan Wang, Deyu Bo, Longlong Li, Kelin Xia

发表机构 * Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371, Singapore(数学科学学院,物理与数学科学学院,南洋理工大学,新加坡637371,新加坡)

AI总结 提出全谱图神经网络(FSpecGNN),通过将信号从节点域提升到节点对域并将单变量谱滤波器扩展为双变量滤波器,实现了对节点对信号的通用逼近,同时保持可扩展性。

Comments 41 pages, 4 figures. Accepted to ICML 2026

详情
AI中文摘要

众所周知,谱图神经网络(GNN)可以通用逼近节点信号;然而,它们的表达能力仍然受限于1维Weisfeiler-Lehman测试,这体现在它们对高阶信号缺乏通用性。为了突破这一界限,我们提出了全谱GNN(FSpecGNN),这是经典谱GNN的二阶推广。FSpecGNN从两个角度推进了谱滤波:(1)将信号从节点域提升到节点对域;(2)将特征值上的单变量谱滤波器扩展为特征值对上的双变量滤波器。我们证明经典谱GNN是FSpecGNN的对角特例,并证明FSpecGNN在通用逼近节点对信号的同时,其表达能力最多与Local 2-GNN相当,后者对异配图学习特别有益。此外,FSpecGNN支持可扩展实现,避免了显式的节点对级计算;结合低秩近似将全谱卷积简化为多项式谱滤波器的组合,使其能够在大图上学习。实验上,FSpecGNN验证了预测的表达能力,并在异配基准上展现了强劲性能。

英文摘要

It is well established that spectral graph neural networks (GNNs) can universally approximate node signals; however, their expressive power remains bounded by the 1-dimensional Weisfeiler-Lehman test, which is mirrored in their lack of universality for higher-order signals. To go beyond this bound, we propose the Full-Spectrum GNNs (FSpecGNNs), a second-order generalization of classical spectral GNNs. FSpecGNN advances spectral filtering from two perspectives: (1) it lifts signals from the node domain to the node-pair domain; and (2) it extends the univariate spectral filter over eigenvalues to a bivariate filter over eigenvalue pairs. We show that classical spectral GNNs arise as a diagonal special case of FSpecGNNs, and prove that FSpecGNNs can be at most as expressive as Local 2-GNN while universally approximating node-pair signals, the latter being particularly beneficial for heterophilic graph learning. Moreover, FSpecGNN admits scalable implementations that avoid explicit node-pair-level computations; combined with a low-rank approximation that reduces full-spectrum convolution to a combination of polynomial spectral filters, it enables learning on large graphs. Empirically, FSpecGNN validates the predicted expressivity and delivers strong performance on heterophilic benchmarks.

2605.05226 2026-05-26 cs.LG cs.AI cs.CL

Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning

将结果监督内化为过程监督:推理强化学习的新范式

Fei Ding, Yongkang Zhang, Runhao Liu, Yuhao Liao, Zijian Zeng, Sibo wang, Huiming Yang

发表机构 * Alibaba Group(阿里巴巴集团) Tsinghua University(清华大学)

AI总结 提出一种监督内化方法,使模型在仅结果监督下自动提取过程级学习信号,实现细粒度策略优化。

详情
AI中文摘要

推理强化学习的核心挑战不仅在于结果级监督的稀疏性,更在于如何将仅在序列末尾提供的反馈转化为可指导中间推理步骤的细粒度学习信号。现有方法要么依赖结果级奖励进行序列级优化,导致精确信用分配困难,要么依赖外部构建的过程监督,成本高昂且难以可持续扩展。为解决这一问题,我们提出一个新视角:推理强化学习可以理解为将结果监督内化为过程监督的问题。基于此视角,我们引入一种用于推理强化学习的监督内化方法,使模型能够通过识别、纠正和重用失败的推理轨迹自动提取过程级学习信号,从而在仅结果监督下实现更细粒度的策略优化。我们进一步将这一思想抽象为一种新的训练范式,其中模型在强化学习过程中持续生成并完善自身的内部过程监督,为推理强化学习中细粒度信用分配开辟了一条不同于外部提供过程监督的新路径。

英文摘要

The central challenge of reinforcement learning for reasoning lies not only in the sparsity of outcome-level supervision, but more fundamentally in how to transform feedback provided only at the end of a sequence into fine-grained learning signals that can guide intermediate reasoning steps. Existing approaches either rely on outcome-level rewards for sequence-level optimization, which makes precise credit assignment difficult, or depend on externally constructed process supervision, which is costly and difficult to scale sustainably. To address this, we propose a new perspective: reinforcement learning for reasoning can be understood as the problem of internalizing outcome supervision into process supervision. From this perspective, we introduce a supervision-internalization method for reinforcement learning for reasoning, enabling the model to automatically extract process-level learning signals through identifying, correcting, and reusing failed reasoning trajectories, thereby achieving finer-grained policy optimization under outcome-only supervision. We further abstract this idea into a new training paradigm, in which the model continually generates and refines its own internal process supervision during reinforcement learning, opening a new path for fine-grained credit assignment in reinforcement learning for reasoning that differs from externally provided process supervision.

2605.05182 2026-05-26 cs.RO cs.SY eess.SY

A Closed-Form Dual-Barrier CBF Safety Filter for Holonomic Robots on Incrementally Built Occupancy Grid Maps

基于增量构建占据栅格地图的全向机器人闭式双障碍CBF安全滤波器

Himanshu Paudel, Basanta Joshi, Dhirendra Raj Madai, Alina Bartaula, Biman Rimal, Sanjay Neupane

发表机构 * Raspberry Pi(树莓派) PX4 CUAV Nano 7 Raspberry Pi 4B quadrotor(树莓派4B四旋翼无人机)

AI总结 提出一种闭式双障碍控制障碍函数安全滤波器,通过解析推导占据栅格地图的符号距离场,同时避免已映射障碍物并限制进入未探索区域,实现全向机器人在资源受限平台上的实时安全控制。

详情
AI中文摘要

我们提出了一种双障碍控制障碍函数(CBF)安全滤波器,用于在增量构建的占据栅格地图中运行的全向机器人的实时、安全关键速度控制。当机器人探索未知环境时,未映射区域引入了不可约的不确定性,因为超出已探索前沿的障碍物几何形状未知,使得进入这些区域成为碰撞风险的来源,尤其是对于前向传感器。为了解决这个问题,我们强制执行两个约束:避免已映射障碍物和限制进入未探索区域。这两个约束都是从占据栅格地图的符号距离场解析推导出来的,产生了一个闭式安全滤波器,每个周期只需求解一个小型线性系统。在资源受限的平台(如Raspberry Pi)上,SLAM和规划已经消耗了大量计算资源,所提出的滤波器的低开销节省了资源。自适应增益调度在信息丰富的区域放松前沿约束,在良好映射的区域收紧约束,提高了探索效率,同时保持了安全性。该滤波器在速度空间中作为最小侵入性校正运行,并与任意标称控制器(包括基于学习的方法)组合。在PX4控制的四旋翼飞行器上的硬件飞行实验表明,在多次室内运行中实现了零碰撞。

英文摘要

We present a dual-barrier control barrier function (CBF) safety filter for real-time, safety-critical velocity control of holonomic robots operating in incrementally built occupancy grid maps. As a robot explores an unknown environment, unmapped regions introduce irreducible uncertainty, since obstacle geometry beyond the explored frontier is unknown, making entry into such regions a source of collision risk, especially with front-facing sensors. To address this, we enforce two constraints: avoidance of mapped obstacles and restriction from unexplored regions. Both constraints are derived analytically from the occupancy grid's signed distance field, yielding a closed-form safety filter that requires only a small linear system solve per cycle. On resource-constrained platforms such as the Raspberry Pi, where SLAM and planning already consume significant compute, the low overhead of the proposed filter preserves resources. An adaptive gain schedule relaxes the frontier constraint in information-rich regions and tightens it in well-mapped areas, improving exploration efficiency while maintaining safety. The filter operates in velocity space as a minimally invasive correction and composes with arbitrary nominal controllers, including learning-based methods. Hardware flight experiments on a PX4-controlled quadrotor demonstrate zero collisions across multiple indoor runs.

2605.04363 2026-05-26 cs.LG cs.AI

Mitigating Label Shift in Tabular In-Context Learning via Test-Time Posterior Adjustment

通过测试时后验调整缓解表格上下文学习中的标签偏移

Seunghan Lee

发表机构 * LG AI Research(LG人工智能研究)

AI总结 针对TabPFN在表格数据上下文学习中对标签偏移敏感的问题,提出DistPFN方法,通过测试时后验调整重新缩放类别概率,无需修改架构或额外训练,在250多个OpenML数据集上显著提升分类性能。

Comments ICML 2026

详情
AI中文摘要

TabPFN最近作为表格数据集的基础模型受到关注,通过在合成数据上利用上下文学习实现了强性能。然而,我们发现TabPFN容易受到标签偏移的影响,常常过拟合训练数据集中的多数类。为了解决这一局限性,我们提出了DistPFN,这是第一个专为表格基础模型设计的测试时后验调整方法。DistPFN通过降低训练先验(即上下文的类别分布)的影响并强调模型预测后验的贡献来重新缩放预测的类别概率,无需架构修改或额外训练。我们进一步引入了DistPFN-T,它结合了温度缩放,以根据先验和后验之间的差异自适应地控制调整强度。我们在超过250个OpenML数据集上评估了我们的方法,证明在标签偏移下,各种基于TabPFN的模型在分类任务中取得了显著改进,同时在无标签偏移的标准设置中保持了强性能。代码可在以下仓库获取:https://github.com/seunghan96/DistPFN。

英文摘要

TabPFN has recently gained attention as a foundation model for tabular datasets, achieving strong performance by leveraging in-context learning on synthetic data. However, we find that TabPFN is vulnerable to label shift, often overfitting to the majority class in the training dataset. To address this limitation, we propose DistPFN, the first test-time posterior adjustment method designed for tabular foundation models. DistPFN rescales predicted class probabilities by downweighting the influence of the training prior (i.e., the class distribution of the context) and emphasizing the contribution of the model's predicted posterior, without architectural modification or additional training. We further introduce DistPFN-T, which incorporates temperature scaling to adaptively control the adjustment strength based on the discrepancy between prior and posterior. We evaluate our methods on over 250 OpenML datasets, demonstrating substantial improvements for various TabPFN-based models in classification tasks under label shift, while maintaining strong performance in standard settings without label shift. Code is available at this repository: https://github.com/seunghan96/DistPFN.

2605.02124 2026-05-26 cs.LG cs.AI math.PR

Soft-to-Hard Routing in Sparse Mixture-of-Experts Models

稀疏混合专家模型中的软到硬路由

Reza Rastegar

发表机构 * Meta Platforms, Inc(Meta平台)

AI总结 本文通过边界层微积分方法,研究了稀疏混合专家模型中softmax路由随温度趋于零时趋近于硬top-1路由的极限过程,并给出了基于路由界面邻域概率的定量误差界。

详情
AI中文摘要

随着温度趋于零,softmax路由趋近于硬top-1路由,但极限过程在路由器平局时存在奇异性。本文针对总体平方损失混合专家回归中的软到硬极限,发展了一种边界层微积分方法。对于具有logits $a_k(x;ϕ)$的路由器,相关的局部量是前两名的间隔$Δ(x;ϕ)$,相关的全局量是边界质量$\\mathbb{P}(Δ(X;ϕ)\\\le w)$。在光滑性和横截性假设下,余面积和管状邻域估计展示了该质量如何随板宽缩放;在二元情形中,主导系数是路由界面上的显式曲面积分。这些几何估计给出了软目标$L_τ$和硬目标$L_0$之间的定量界,包括在间隔尾条件下的$O(τ^α)$一致比较,并得到了紧参数空间上软目标的$Γ$-收敛性。主要结论是,零温度近似由路由界面的$O(τ)$邻域所承载的概率控制,而不仅仅由温度本身决定。在分离出问题的这一边界层部分后,我们记录了一个从硬路由到小温度软路由的条件景观传递定理,以及一个简化的双专家高斯计算,展示了局部对称性破缺。仅包含合成诊断作为边界层预测的受控检验。

英文摘要

Softmax routing approaches hard top-1 routing as the temperature tends to zero, but the limiting passage is singular at router ties. This paper develops a boundary-layer calculus for this soft-to-hard limit in population squared-loss mixture-of-experts regression. For a router with logits $a_k(x;ϕ)$, the relevant local quantity is the top-two margin $Δ(x;ϕ)$, and the relevant global quantity is the boundary mass $\mathbb{P}(Δ(X;ϕ)\le w)$. Under smoothness and transversality assumptions, coarea and tubular-neighborhood estimates show how this mass scales with the slab width; in the binary case the leading coefficient is an explicit surface integral over the routing interface. These geometric estimates give quantitative bounds between the soft objective $L_τ$ and the hard objective $L_0$, including an $O(τ^α)$ uniform comparison under a margin-tail condition, and yield $Γ$-convergence of the soft objectives on compact parameter spaces. The main conclusion is that the zero-temperature approximation is controlled by the probability carried by an $O(τ)$ neighborhood of the routing interfaces, not by temperature alone. After isolating this boundary-layer part of the problem, we record a conditional landscape-transfer theorem from hard to small-temperature soft routing and a reduced two-expert Gaussian calculation illustrating local symmetry breaking. Synthetic diagnostics are included only as controlled checks of the boundary-layer predictions.

2605.02010 2026-05-26 cs.AI

Reliable AI Needs to Externalize Implicit Knowledge: A Human-AI Collaboration Perspective

可靠AI需要外化隐性知识:人机协作视角

Hengyu Liu, Tianyi Li, Zhihong Cui, Yushuai Li, Zhangkai Wu, Torben Bach Pedersen, Kristian Torp, Christian S. Jensen

发表机构 * Department of Computer Science, Aalborg University, Aalborg, Denmark(奥胡斯大学计算机科学系) Department of Informatics, University of Oslo, Oslo, Norway(奥斯陆大学信息系) School of Computing, Macquarie University, Sydney, Australia(麦考瑞大学计算科学学院)

AI总结 本文从人机协作视角提出,可靠AI需要基础设施将隐性知识外化为可验证的形式,通过知识对象(KOs)实现人类验证,从而提升可靠性。

Comments Accepted at ICML 2026 (Position Paper Track). 14 pages, 2 figures, 1 table

详情
AI中文摘要

本文立场认为,可靠AI需要基础设施来支持人类对隐性知识的验证。AI从显性知识(论文、文档、结构化数据库)和隐性知识(推理模式、调试过程、中间步骤)中学习。隐性知识由于文档成本超过感知价值而未被外化——然而AI不加区分地学习它,既获得有益模式也获得有害偏见。当前的可靠性方法只能根据来源验证显性知识,造成根本性差距:最有价值的AI能力(推理、判断、直觉)恰恰是我们无法验证的。我们提出知识对象(KOs)——将隐性知识外化为人类可以检查、验证和认可的形式的结构化工件。KOs改变了验证经济学:以前验证成本过高的事情变得可行,使得累积的人类验证能够随时间提高可靠性。

英文摘要

This position paper argues that reliable AI requires infrastructure for human validation of implicit knowledge. AI learns from both explicit knowledge (papers, documentation, structured databases) and implicit knowledge (reasoning patterns, debugging processes, intermediate steps). Implicit knowledge remains unexternalized because documentation cost exceeds perceived value -- yet AI learns from it indiscriminately, acquiring both beneficial patterns and harmful biases. Current reliability methods can only verify explicit knowledge against sources, creating a fundamental gap: the most valuable AI capabilities (reasoning, judgment, intuition) are precisely those we cannot verify. We propose Knowledge Objects (KOs) -- structured artifacts that externalize implicit knowledge into forms humans can inspect, verify, and endorse. KOs transform verification economics: what was previously too costly to verify becomes feasible, enabling accumulated human validation to improve reliability over time.

2605.01284 2026-05-26 cs.CV cs.AI cs.CL cs.IR

Chain of Evidence: Pixel-Level Visual Attribution for Iterative Retrieval-Augmented Generation

证据链:面向迭代检索增强生成的像素级视觉归因

Peiyang Liu, Ziqiang Cui, Xi Wang, Di Liang, Wei Ye

发表机构 * National Engineering Research Center for Software Engineering, Peking University(软件工程国家级工程研究中心,北京大学) City University of Hong Kong(香港城市大学) Peking University(北京大学) Tencent Technology(腾讯科技)

AI总结 提出Chain of Evidence (CoE)框架,利用视觉语言模型直接对检索到的文档截图进行推理,输出精确边界框以可视化完整推理链,解决迭代检索增强生成中的粗粒度归因和视觉语义丢失问题。

详情
AI中文摘要

迭代检索增强生成(iRAG)已成为通过逐步检索和推理外部文档来回答复杂多跳问题的强大范式。然而,当前系统主要基于解析文本运行,这造成了两个关键瓶颈:(1)粗粒度归因,用户需要根据模糊的文本级引用在冗长文档中手动定位证据;(2)视觉语义丢失,将视觉丰富的文档(如幻灯片、带有图表的PDF)转换为文本会丢弃对推理至关重要的空间逻辑和布局线索。为弥合这一差距,我们提出了证据链(CoE),这是一个与检索器无关的视觉归因框架,利用视觉语言模型直接对检索到的文档候选截图进行推理。CoE消除了特定格式的解析,输出精确的边界框,可视化检索候选集中的完整推理链。我们在两个不同的基准上评估CoE:Wiki-CoE,一个源自2WikiMultiHopQA的大规模结构化网页数据集;以及SlideVQA,一个具有挑战性的演示幻灯片数据集,包含复杂图表和自由形式布局。实验表明,微调后的Qwen3-VL-8B-Instruct取得了稳健的性能,在需要视觉布局理解的场景中显著优于基于文本的基线,同时为像素级可解释的iRAG建立了与检索器无关的解决方案。我们的代码可在https://github.com/PeiYangLiu/CoE.git获取。

英文摘要

Iterative Retrieval-Augmented Generation (iRAG) has emerged as a powerful paradigm for answering complex multi-hop questions by progressively retrieving and reasoning over external documents. However, current systems predominantly operate on parsed text, which creates two critical bottlenecks: (1) \textit{Coarse-grained attribution}, where users are burdened with manually locating evidence within lengthy documents based on vague text-level citations; and (2) \textit{Visual semantic loss}, where the conversion of visually rich documents (e.g., slides, PDFs with charts) into text discards spatial logic and layout cues essential for reasoning. To bridge this gap, we present \textbf{Chain of Evidence (CoE)}, a retriever-agnostic visual attribution framework that leverages Vision-Language Models to reason directly over screenshots of retrieved document candidates. CoE eliminates format-specific parsing and outputs precise bounding boxes, visualizing the complete reasoning chain within the retrieved candidate set. We evaluate CoE on two distinct benchmarks: \textbf{Wiki-CoE}, a large-scale dataset of structured web pages derived from 2WikiMultiHopQA, and \textbf{SlideVQA}, a challenging dataset of presentation slides featuring complex diagrams and free-form layouts. Experiments demonstrate that fine-tuned Qwen3-VL-8B-Instruct achieves robust performance, significantly outperforming text-based baselines in scenarios requiring visual layout understanding, while establishing a retriever-agnostic solution for pixel-level interpretable iRAG. Our code is available at https://github.com/PeiYangLiu/CoE.git.

2605.00817 2026-05-26 cs.CL

When LLMs Stop Following Steps: A Diagnostic Study of Procedural Execution in Language Models

当LLM停止遵循步骤:语言模型中程序执行的诊断研究

Sailesh Panda, Pritam Kadasi, Abhishek Upperwal, Mayank Singh

发表机构 * Indian Institute of Technology Gandhinagar(印度理工学院冈丁加尔)

AI总结 本研究通过构建受控诊断基准,评估大型语言模型在程序执行任务中的忠实性,发现随着步骤增加准确率从63%降至20%,并揭示了缺失答案、过早答案、自我修正和执行不完整等失败模式。

Comments 86 pages, 124 figures, 4 Tables

详情
AI中文摘要

大型语言模型(LLM)在推理基准测试中通常表现强劲,但仅凭最终答案的准确性并不能表明它们是否忠实地执行了提示中指定的程序。我们引入了一个受控的诊断基准,用于程序执行,其中模型被给予一个逐步的算术程序以及两个数值输入,必须返回最终计算值。通过程序长度和中间变量的回溯依赖性来改变复杂性。平均首次答案准确率从5步程序的63%下降到95步程序的20%。生成级别分析表明,失败通常涉及缺失答案、过早答案、初始错误后的自我修正以及未完全执行的轨迹。这些发现表明,表面上的推理能力可能掩盖了在忠实的长程程序执行中的重大弱点。

英文摘要

Large language models (LLMs) often achieve strong performance on reasoning benchmarks, but final-answer accuracy alone does not show whether they faithfully execute the procedure specified in a prompt. We introduce a controlled diagnostic benchmark for procedural execution, where models are given a step-wise arithmetic procedure and two numeric inputs, and must return the final computed value. Complexity is varied through procedure length and look-back dependencies over intermediate variables. Average first-answer accuracy drops from 63% on 5-step procedures to 20% on 95-step procedures. Generation-level analysis shows that failures often involve missing answers, premature answers, self-correction after an initial error and under-executed traces. These findings suggest that apparent reasoning ability can mask substantial weaknesses in faithful long-horizon procedural execution.

2604.27636 2026-05-26 cs.AI

Generative structure search for efficient and diverse discovery of molecular and crystal structures

生成式结构搜索:高效且多样地发现分子和晶体结构

Yifang Qin, Yu Shi, Junfu Tan, Chang Liu, Ming Zhang, Ziheng Lu

发表机构 * Zhongguancun Academy(中关村学院) Kairos Materials(Kairos材料)

AI总结 提出生成式结构搜索(GSS)框架,结合扩散模型和随机结构搜索,利用数据先验加速采样并保持能量引导的局部极小探索,以低于随机结构搜索十分之一的成本恢复多样亚稳态结构。

详情
AI中文摘要

预测稳定和亚稳态结构是分子和材料发现的核心,但受限于高维能量景观的搜索成本。深度生成模型提供了高效的结构采样,但其输出仍受训练数据影响,可能未充分探索罕见但物理相关的极小值。我们引入生成式结构搜索(GSS),一个统一框架,将基于扩散的生成和随机结构搜索(RSS)表述为由学习得分场和物理力驱动的共同采样过程的极限情况。耦合这些驱动因素使GSS能够利用数据先验加速采样,同时保留能量引导的局部极小探索。在分子和晶体系统中,GSS恢复了多样的亚稳态结构,其采样成本比RSS低十倍以上,且对训练分布之外的组成仍然有效。结果建立了一种物理基础的生成搜索策略,用于发现仅靠数据驱动采样无法达到的结构。

英文摘要

Predicting stable and metastable structures is central to molecular and materials discovery, but remains limited by the cost of searching high-dimensional energy landscapes. Deep generative models offer efficient structure sampling, yet their outputs remain shaped by training data and can underexplore minima that are rare but physically relevant. We introduce generative structure search (GSS), a unified framework that formulates diffusion-based generation and random structure search (RSS) as limiting regimes of a common sampling process driven by learned score fields and physical forces. Coupling these drivers lets GSS use data priors to accelerate sampling while retaining energy-guided exploration of local minima. Across molecular and crystalline systems, GSS recovers diverse metastable structures with more than tenfold lower sampling cost than RSS for broad coverage and remains effective for compositions outside the training distribution. The results establish a physically grounded generative search strategy for discovering structures beyond the reach of data-driven sampling alone.

2604.20022 2026-05-26 cs.LG cs.AI cs.CL

MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support

MoBayes:一种用于对话式临床决策支持中推理与语言分离的模块化贝叶斯框架

Yusuf Kesmen, Fay Elhassan, Jiayi Ma, Julien Stalhandske, Yena Chang, David Sasu, Alexandra Kulinkina, Akhil Arora, Lars Klein, Mary-Anne Hartley

发表机构 * LiGHT, EPFL(LiGHT,瑞士联邦理工学院) University of Bern(伯尔尼大学) Aarhus University(奥胡斯大学)

AI总结 提出MoBayes框架,通过将LLM作为语言接口、贝叶斯模块进行概率推理,实现推理与语言分离,在临床决策支持中优于独立前沿LLM医生。

Comments 50 pages including appendix, 13 figures, 22 tables. Preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于对话式临床决策支持,但它们将下一个标记预测与概率决策混为一谈。我们认为这种混淆反映了架构上的局限性:此类系统缺乏显式的后验追踪、可控的弃权阈值和可审计的推理链。我们引入MoBayes,一个模块化贝叶斯对话框架,将推理与语言分离。LLM仅作为语言接口,将患者对话解析为结构化观察,而贝叶斯模块对这些观察进行概率推理以更新后验,通过期望信息增益选择后续问题,并通过校准的决策阈值决定何时停止或推迟。这种设计实现了显式后验追踪、可控的选择性决策,以及无需重新训练语言模型即可替换的特定人群统计后端。在经验知识和LLM生成的知识库上,MoBayes优于独立的前沿LLM医生,包括匹配模型系列的比较,其中廉价的传感器模型与MoBayes配对以较低成本超过更大的自主模型。在对抗性患者沟通风格和不同诊断场景下,该优势依然存在。这些结果表明,可靠的对话式临床决策支持系统应将概率推理与语言生成分离,而不是仅扩大模型规模。代码可在https://anonymous.4open.science/r/MoBayes/获取。

英文摘要

Large language models (LLMs) are increasingly used for conversational clinical decision support, yet they conflate next token prediction with probabilistic decision making. We argue that this conflation reflects an architectural limitation: such systems lack explicit posterior tracking, controllable abstention thresholds, and auditable reasoning chains. We introduce MoBayes, a Modular Bayesian dialogue framework that separates reasoning from language. The LLM acts only as a language interface, parsing patient conversation into structured observations, while a Bayesian module performs probabilistic inference over these observations to update posteriors, select follow-up questions via expected-information-gain and determine when to stop or defer through calibrated decision thresholds. This design enables explicit posterior tracking, controllable selective decision-making, and replaceable population-specific statistical backends without retraining the language model. Across empirical and LLM-generated knowledge bases, MoBayes outperforms standalone frontier LLM doctors, including matched model-family comparisons where inexpensive sensor models paired with MoBayes exceed larger autonomous models at lower cost. The advantage persists under adversarial patient communication styles and across varying diagnostic scenarios. These results suggest that reliable conversational clinical decision support systems should separate probabilistic reasoning from language generation rather than scaling model size alone. Code is available at https://anonymous.4open.science/r/MoBayes/

2604.19151 2026-05-26 cs.CL cs.SD eess.AS

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

印度之声:面向印度真实世界语音识别的大规模基准

Kaushal Bhogale, Manas Dhir, Amritansh Walecha, Manmeet Kaur, Vanshika Chhabra, Aaditya Pareek, Hanuman Sidh, Mahima Manik, Sagar Jain, Bhaskar Singh, Utkarsh Singh, Tahir Javed, Shobhit Banga, Mitesh M. Khapra

发表机构 * Indian Institute of Technology, Madras, India(印度理工学院,马德拉斯分校) Josh Talks, India(Josh Talks)

AI总结 针对现有Indic ASR基准的局限性,提出基于非脚本电话对话的封闭源基准Voice of India,覆盖15种主要印度语言和139个区域集群,包含306230条语音(536小时),并分析地理、音频质量、语速、性别和设备类型等因素对ASR性能的影响。

Comments 6 pages, 4 figures

详情
AI中文摘要

现有的Indic ASR基准通常使用脚本化的、干净的语音和基于排行榜的评估,这鼓励了针对数据集的过拟合。此外,严格的单参考WER会惩罚印度语言中的自然拼写变体,包括非标准拼写的代码混合英语起源词。为了解决这些局限性,我们引入了Voice of India,这是一个从非脚本电话对话构建的封闭源基准,覆盖15种主要印度语言,跨越139个区域集群。该数据集包含306230条语音,总计536小时的语音,来自36691名说话人,转录考虑了拼写变体。我们还在地理上按地区分析了性能,揭示了差异。最后,我们提供了跨音频质量、语速、性别和设备类型等因素的详细分析,突出了当前ASR系统在哪些方面存在困难,并为改进真实世界的Indic ASR系统提供了见解。

英文摘要

Existing Indic ASR benchmarks often use scripted, clean speech and leaderboard driven evaluation that encourages dataset specific overfitting. In addition, strict single reference WER penalizes natural spelling variation in Indian languages, including non standardized spellings of code-mixed English origin words. To address these limitations, we introduce Voice of India, a closed source benchmark built from unscripted telephonic conversations covering 15 major Indian languages across 139 regional clusters. The dataset contains 306230 utterances, totaling 536 hours of speech from 36691 speakers with transcripts accounting for spelling variations. We also analyze performance geographically at the district level, revealing disparities. Finally, we provide detailed analysis across factors such as audio quality, speaking rate, gender, and device type, highlighting where current ASR systems struggle and offering insights for improving real world Indic ASR systems.