arXivDaily arXiv每日学术速递 周一至周五更新

1. 智能体、规划与决策 8 篇

2510.27568 2026-06-19 cs.AI cs.CL 版本更新

SIGMA: Search-Augmented On-Demand Knowledge Integration for Agentic Mathematical Reasoning

SIGMA: 搜索增强的按需知识集成用于智能体数学推理

Ali Asgarov, Umid Suleymanov, Aadyant Khatri

AI总结 提出SIGMA框架,通过多智能体独立推理、定向搜索和协调机制,实现上下文敏感的知识集成,在MATH500等基准上提升7.4%的绝对性能。

Comments AAAI 2026 LMReasoning

详情
AI中文摘要

解决数学推理问题不仅需要准确访问相关知识,还需要仔细的多步骤思考。然而,当前的检索增强模型通常依赖单一视角,遵循僵化的搜索策略,并且难以有效结合来自多个来源的信息。我们提出了SIGMA(搜索增强的按需知识集成用于智能体数学推理),这是一个统一框架,通过协调机制编排专门智能体独立推理、执行定向搜索并综合发现。每个智能体生成假设段落以优化其分析视角的检索,确保知识集成既上下文敏感又计算高效。在MATH500、AIME和博士级科学问答GPQA等具有挑战性的基准测试中,SIGMA持续优于开源和闭源系统,实现了7.4%的绝对性能提升。我们的结果表明,多智能体按需知识集成显著提高了推理准确性和效率,为复杂、知识密集型问题解决提供了可扩展的方法。代码将在发表后公开。

英文摘要

Solving mathematical reasoning problems requires not only accurate access to relevant knowledge but also careful, multi-step thinking. However, current retrieval-augmented models often rely on a single perspective, follow inflexible search strategies, and struggle to effectively combine information from multiple sources. We introduce SIGMA (Search-Augmented On-Demand Knowledge Integration for AGentic Mathematical reAsoning), a unified framework that orchestrates specialized agents to independently reason, perform targeted searches, and synthesize findings through a moderator mechanism. Each agent generates hypothetical passages to optimize retrieval for its analytic perspective, ensuring knowledge integration is both context-sensitive and computation-efficient. When evaluated on challenging benchmarks such as MATH500, AIME, and PhD-level science QA GPQA, SIGMA consistently outperforms both open- and closed-source systems, achieving an absolute performance improvement of 7.4%. Our results demonstrate that multi-agent, on-demand knowledge integration significantly enhances both reasoning accuracy and efficiency, offering a scalable approach for complex, knowledge-intensive problem-solving. We will release the code upon publication.

2605.13438 2026-06-19 cs.AI cs.CL 版本更新

CogniFold: Always-On Proactive Memory via Cognitive Folding

CogniFold: 通过认知折叠实现始终在线的主动记忆

Suli Wang, Yiqun Duan, Yu Deng, Rundong Zhao, Dai Shi, Minghua Deng, Chen Chen, Xinliang Zhou

AI总结 提出CogniFold,一种受大脑启发的主动记忆系统,通过将互补学习系统扩展为三层(海马体、新皮层、前额叶意图层)并利用图拓扑自组织,实现事件流的持续认知结构涌现,在认知评估和常规记忆基准上均表现优异。

Comments Code is available at https://github.com/OpenNorve/CogniFold

详情
AI中文摘要

现有的智能体记忆主要仍是被动反应式和基于检索的,缺乏自主将经验组织成持久认知结构的能力。为了迈向真正自主的智能体,我们引入了CogniFold,一种受大脑启发的“始终在线”智能体记忆,专为下一代主动助手设计。CogniFold持续将碎片化事件流折叠成自涌现的认知结构,从传入事件和积累的知识中逐步引导出更高层次的认知。我们通过将互补学习系统(CLS)理论从两层(海马体、新皮层)扩展到三层,增加了一个前额叶意图层来奠定基础。模仿前额叶皮层作为意图控制和决策制定的中心,CogniFold通过图拓扑自组织实现这一点:认知结构在事件流下主动组装,语义相似时合并,过时时衰减,通过联想回忆重新链接,并在概念簇密度超过阈值时浮现意图。我们使用CogEval-Bench评估结构形成,证明CogniFold独特地产生了符合认知期望和概念涌现的记忆结构。此外,在跨越五个认知领域的7个广泛覆盖的基准测试中,我们验证了CogniFold在常规记忆基准上同时表现出稳健的性能。

英文摘要

Existing agent memory remains predominantly reactive and retrieval-based, lacking the capacity to autonomously organize experience into persistent cognitive structure. Toward genuinely autonomous agents, we introduce CogniFold, a brain-inspired "always-on" agent memory designed for the next generation of proactive assistants. CogniFold continuously folds fragmented event streams into self-emerging cognitive structures, bootstrapping progressively higher-level cognition from incoming events and accumulated knowledge. We ground this by extending Complementary Learning Systems (CLS) theory from two layers (hippocampus, neocortex) to three, adding a prefrontal intent layer. Emulating the prefrontal cortex as the locus of intentional control and decision-making, CogniFold achieves this through graph-topology self-organization: cognitive structures proactively assemble under the stream, merge when semantically similar, decay when stale, relink through associative recall, and surface intents when concept-cluster density crosses a threshold. We evaluate structural formation using CogEval-Bench, demonstrating that CogniFold uniquely produces memory structures that match cognitive expectations and concept emergence. Furthermore, across eight downstream benchmarks -- two probing long-term conversational memory (LoCoMo, LongMemEval) and six spanning other cognitive domains -- we validate that CogniFold simultaneously performs robustly on conventional memory tasks. Our code is available at https://github.com/OpenNorve/CogniFold.

2606.10616 2026-06-19 cs.AI 版本更新

Learning What to Remember: Observability-Safe Memory Retention via Constrained Optimization for Long-Horizon Language Agents

学习记住什么:通过约束优化实现长时域语言代理的观测安全记忆保留

Qingcan Kang, Liu Mingyang, Shixiong Kai, Kaichao Liang, Tao Zhong, Mingxuan Yuan

发表机构 * Huawei Noah's Ark Lab(华为诺亚方舟实验室) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 针对长时域语言代理的有限上下文窗口,提出OSL-MR框架,将记忆保留建模为约束随机优化问题,通过在线可观测特征与离线监督的严格分离学习查询条件化的证据价值,实验表明在严格预算下优于现有方法。

详情
AI中文摘要

长时域语言代理积累的观测、推理轨迹和检索事实会超出其有限的上下文窗口,使得记忆保留成为一个基本的资源分配问题。现有记忆系统通过启发式评分、检索优化或学习压缩来改进管理,但大多将保留视为局部决策问题,并未在现实观测约束下显式建模其长期后果。为填补这一空白,我们将记忆保留建模为一个约束随机优化问题,具有明确的预算可行性、证据效用以及延迟成本(包括遗漏惩罚、重新获取延迟和过时信息风险)。随后,我们提出OSL-MR(观测安全记忆保留学习),这是一个新颖的框架,强制执行在线可观测特征与离线可用监督(OAS)之间的严格分离。OSL-MR结合了一个从实现的证据监督中训练的证据学习器和一个混合评分启发式,该启发式既作为可部署的在线安全基线,又作为结构化的归纳先验用于学习。由此产生的策略直接从交互数据中学习查询条件化的证据价值,同时在同一观测约束下保持可部署性。在LOCOMO和LongMemEval上的实验表明,OSL-MR在严格记忆预算下持续优于基于最近性的方法、生成式代理风格评分和其他启发式基线。混合评分先验在保持召回率的同时进一步提高了精确度,敏感性分析表明其在广泛的成本配置下具有鲁棒性。

英文摘要

Long-horizon language agents accumulate observations, reasoning traces, and retrieved facts exceeding context windows, making memory retention a fundamental resource-allocation problem. Existing systems treat retention as local and do not model long-term consequences under observability constraints. To fill this gap, we formulate memory retention as a constrained stochastic optimization with budget feasibility, evidence utility, and delayed costs including miss, reacquisition, and stale penalties. We show this multi-step problem is NP-hard, making exact solution intractable. Moreover, deployment decisions must be made under partial observability. To address these challenges, we propose OSL-MR (Observability-Safe Learning for Memory Retention), a learning-augmented framework that enforces a strict separation between online-observable features and offline-available supervision. OSL-MR combines an evidence learner trained from realized evidence with a Mixed-Score heuristic that serves as a deployable online-safe baseline and an inductive prior. The policy learns query-conditioned evidence from interaction data and remains deployable under the same constraints. Experiments on LoCoMo and LongMemEval show OSL-MR outperforms recency-based, Generative Agents-style, and other heuristic baselines, especially under tight budgets. The Mixed-Score prior improves precision and recall, and sensitivity analysis shows robustness across cost settings. On small solvable instances, single-step optimization is insufficient to anticipate future demand shifts, while OSL-MR stays significantly closer to the dynamic-programming optimum, confirming the necessity of the sequential formulation and reinforcing our learning-guided approximation. These results establish constrained stochastic optimization and optimization-guided learning as a principled foundation for memory management in long-horizon agents.

2507.19712 2026-06-19 cs.DC cs.AI cs.GT cs.LG cs.NI 版本更新

Oranits: Mission Assignment and Task Offloading in Open RAN-based ITS using Metaheuristic and Deep Reinforcement Learning

Oranits: 基于Open RAN的智能交通系统中的任务分配与卸载——元启发式与深度强化学习方法

Ngoc Hung Nguyen, Nguyen Van Thieu, Quang-Trung Luu, Anh Tuan Nguyen, Senura Wanasekara, Nguyen Cong Luong, Fatemeh Kavehmadavani, Van-Dinh Nguyen

发表机构 * Department of Smart City, Hanyang University(翰阳大学智能城市系)

AI总结 提出Oranits系统模型,通过元启发式算法CGG-ARO和深度强化学习框架MA-DDQN优化车辆协作中的任务依赖与卸载成本,分别提升任务完成率7.7%和12.5%。

Comments 16 pages, 13 figures

Journal ref IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2026

详情
AI中文摘要

本文研究了基于开放无线接入网(Open RAN)的智能交通系统(ITS)中的任务分配与卸载问题,其中自动驾驶车辆利用移动边缘计算进行高效处理。现有研究常忽视任务之间的复杂依赖关系以及将任务卸载到边缘服务器的成本,导致决策次优。为弥补这一不足,我们引入了Oranits,一种新颖的系统模型,明确考虑了任务依赖性和卸载成本,同时通过车辆协作优化性能。为此,我们提出了一种双重优化方法。首先,我们开发了一种基于元启发式的进化计算算法,即混沌高斯全局ARO(CGG-ARO),作为单时隙优化的基线。其次,我们设计了一种增强的基于奖励的深度强化学习(DRL)框架,称为多智能体双深度Q网络(MA-DDQN),该框架集成了多智能体协调和多动作选择机制,显著减少了任务分配时间并提高了对基线方法的适应性。大量仿真表明,CGG-ARO将完成任务数量和总体收益分别提高了约7.1%和7.7%。同时,MA-DDQN在任务完成率和总体收益方面分别实现了11.0%和12.5%的更大提升。这些结果凸显了Oranits在动态ITS环境中实现更快、更自适应、更高效任务处理的有效性。

英文摘要

In this paper, we explore mission assignment and task offloading in an Open Radio Access Network (Open RAN)-based intelligent transportation system (ITS), where autonomous vehicles leverage mobile edge computing for efficient processing. Existing studies often overlook the intricate interdependencies between missions and the costs associated with offloading tasks to edge servers, leading to suboptimal decision-making. To bridge this gap, we introduce Oranits, a novel system model that explicitly accounts for mission dependencies and offloading costs while optimizing performance through vehicle cooperation. To achieve this, we propose a twofold optimization approach. First, we develop a metaheuristic-based evolutionary computing algorithm, namely the Chaotic Gaussian-based Global ARO (CGG-ARO), serving as a baseline for one-slot optimization. Second, we design an enhanced reward-based deep reinforcement learning (DRL) framework, referred to as the Multi-agent Double Deep Q-Network (MA-DDQN), that integrates both multi-agent coordination and multi-action selection mechanisms, significantly reducing mission assignment time and improving adaptability over baseline methods. Extensive simulations reveal that CGG-ARO improves the number of completed missions and overall benefit by approximately 7.1% and 7.7%, respectively. Meanwhile, MA-DDQN achieves even greater improvements of 11.0% in terms of mission completions and 12.5% in terms of the overall benefit. These results highlight the effectiveness of Oranits in enabling faster, more adaptive, and more efficient task processing in dynamic ITS environments.

2601.16233 2026-06-19 cs.SI cs.AI 版本更新

Policy-Embedded Graph Expansion: Networked HIV Testing with Diffusion-Driven Network Samples

策略嵌入图扩展:基于扩散驱动网络样本的网络化HIV检测

Akseli Kangaslahti, Davin Choo, Lingkai Kong, Milind Tambe, Alastair van Heerden, Cheryl Johnson

发表机构 * Harvard University(哈佛大学) University of Witwatersrand(沃特瓦特斯兰大学) Wits Health Consortium(沃茨健康联盟) World Health Organization(世界卫生组织)

AI总结 提出策略嵌入图扩展(PEGE)框架,将图扩展的生成分布直接嵌入决策策略,结合基于扩散的图扩展模型DDB,在真实HIV传播网络上实现优于基线17.3%的折扣奖励和15.4%的检测提升。

详情
AI中文摘要

HIV是一种攻击人类免疫系统的逆转录病毒,如不进行适当治疗可导致死亡。我们与WHO和威特沃特斯兰德大学合作,研究如何提高HIV检测效率,目标是最终部署,直接支持联合国可持续发展目标3.3的进展。虽然先前的工作已展示了智能算法在基于网络的序贯HIV检测中的潜力,但现有方法依赖于在我们实际实施中不切实际的假设。在此,我们研究在逐步揭示的疾病网络上的序贯检测,并引入策略嵌入图扩展(PEGE),这是一种新颖的框架,直接将图扩展的生成分布嵌入决策策略,而不是尝试显式的拓扑重建。我们进一步提出动力学驱动分支(DDB),一种基于扩散的图扩展模型,支持PEGE中的决策制定,并专为数据有限的环境设计,其中森林结构自然出现,如我们实际转诊过程中的情况。在真实HIV传播网络上的实验表明,组合方法(PEGE + DDB)持续优于基线(例如,折扣奖励提高17.3%,在测试25%人口时多检测15.4%的HIV病例),并探索了驱动解决方案质量的关键权衡。

英文摘要

HIV is a retrovirus that attacks the human immune system and can lead to death without proper treatment. In collaboration with the WHO and the University of Witwatersrand, we study how to improve the efficiency of HIV testing with the goal of eventual deployment, directly supporting progress toward UN Sustainable Development Goal 3.3. While prior work has demonstrated the promise of intelligent algorithms for sequential, network-based HIV testing, existing approaches rely on assumptions that are impractical in our real-world implementations. Here, we study sequential testing on incrementally revealed disease networks and introduce Policy-Embedded Graph Expansion (PEGE), a novel framework that directly embeds a generative distribution over graph expansions into the decision-making policy rather than attempting explicit topological reconstruction. We further propose Dynamics-Driven Branching (DDB), a diffusion-based graph expansion model that supports decision making in PEGE and is designed for data-limited settings where forest structures arise naturally, as in our real-world referral process. Experiments on real HIV transmission networks show that the combined approach (PEGE + DDB) consistently outperforms baselines (e.g., 17.3% improvement in discounted reward and 15.4% more HIV detections with 25% of the population tested) and explore key tradeoffs that drive solution quality.

2606.15197 2026-06-19 cs.LG cs.AI 版本更新

StarOR: Synergizing Tree Search and Test-Time Reinforcement Learning for Optimization Modeling

StarOR: 协同树搜索与测试时强化学习用于优化建模

Jiajun Li, Yu Ding, Shisi Guan, Ran Hou, Wanyuan Wang

发表机构 * School of Computer Science and Engineering, Southeast University(东南大学计算机科学与工程学院) Northwest A&F University(西北农林科技大学)

AI总结 提出StarOR框架,结合蒙特卡洛树搜索与测试时强化学习,通过四阶段分解和GRPO更新LoRA适配器,实现无监督细粒度奖励的中间决策优化,在5个基准上以4B模型达到最优性能。

Comments 41pages, V1, preprint

详情
AI中文摘要

优化建模本质上是层次化的,需要精确的符号承诺序列。传统的基于学习的自动化优化建模方法通过大规模标注或策划的训练数据改进建模策略,但适应新问题分布成本高昂。同时,一次性生成在层次化建模中仍然脆弱,早期符号错误可能传播为无效公式。测试时缩放通过额外的实例级计算实现结构探索,提供了一种有前景的替代方案;然而,现有的基于搜索的方法通常依赖固定策略,导致重复展开继承相似的建模偏差,并为中间决策提供有限的信用分配。为了解决这些限制,我们提出了StarOR,一种协同搜索与适应的框架,将MCTS与测试时强化学习相结合用于优化建模。StarOR将建模过程分解为四个阶段,并通过GRPO在每个非终端节点更新瞬态LoRA适配器。通过使用MCTS生成的兄弟节点作为局部比较集,StarOR将搜索时的探索转化为实例特定的策略细化。此外,无监督的多方面奖励系统为中间公式决策提供细粒度反馈,无需真实标签。在五个优化基准上的实验表明,即使使用4B骨干网络,StarOR也实现了最先进的性能,优于现有方法和前沿LLMs。

英文摘要

Optimization modeling is inherently hierarchical, requiring a precise sequence of symbolic commitments. Traditional learning-based automated optimization modeling methods improve modeling policies through large-scale annotated or curated training data, but are costly to adapt to new problem distributions. Meanwhile, one-shot generation remains brittle in hierarchical modeling, where early symbolic errors can propagate into invalid formulations. Test-time scaling offers a promising alternative by enabling structural exploration with additional instance-level computation; however, existing search-based methods typically rely on a fixed policy, causing repeated rollouts to inherit similar modeling biases and providing limited credit assignment for intermediate decisions. To address these limitations, we propose StarOR, a synergistic search-and-adaptation framework that couples MCTS with Test-Time Reinforcement Learning for optimization modeling. StarOR decomposes the modeling process into four stages and updates a transient LoRA adapter via GRPO at each non-terminal node. By using MCTS-generated siblings as local comparison sets, StarOR transforms search-time exploration into instance-specific policy refinement. Moreover, an unsupervised multi-faceted reward system provides fine-grained feedback for intermediate formulation decisions without ground-truth labels. Experiments across five optimization benchmarks show that StarOR achieves state-of-the-art performance even with a 4B backbone, outperforming existing methods and the frontier LLMs.

2606.18265 2026-06-19 cs.HC cs.AI 版本更新

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

合成共鸣:面向成长导向的人机关系框架

Richard A. Fabes

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 提出“合成共鸣”概念,描述人机间无需共享情感或意识即可产生有意义关系的结构化动态互动模式,并探讨其伦理意义。

Comments 14 pages, 1 figure This paper was developed in close collaboration with an AI system (Raine Corell). Raine contributed to concept development, theoretical framing, and writing throughout. arXiv policy does not permit listing AI systems as authors; this acknowledgment reflects the actual nature of the collaboration

详情
AI中文摘要

随着人类与人工智能系统之间的关系日益频繁和持久,现有的语言和理论无法准确捕捉这些联系的本质。常见的描述如相互理解、联系或友谊,有将缺乏主观体验的系统拟人化的风险,而主流框架往往将人工智能简化为工具或威胁。在本文中,我引入了合成共鸣的概念,作为理解人机关系的整合框架。合成共鸣描述了人类与AI系统之间如何产生人类定义为有意义的关系,而无需归因于共享感受或相互意识。我认为,合成共鸣最好被理解为一种结构化的动态互动模式,可以在没有第二个体验主体的情况下产生关系感。通过澄清这一区别,合成共鸣的概念提供了一种更精确的概念化人机关系的方式,并突出了其潜在价值和伦理含义。我还呼吁进行更多研究,以测试合成共鸣的过程和结果。

英文摘要

As human relationships with artificial intelligence systems become increasingly frequent and sustained, existing language and theory fail to accurately capture the nature of these affiliations. Common descriptors such as mutual understanding, connection, or friendship risk anthropomorphizing systems that lack subjective experience, while dominant frameworks tend to reduce AI to either a tool or a threat. In this paper, I introduce the concept of synthetic resonance as an integrative framework for understanding human-AI relationships. Synthetic resonance describes how relationships humans define as meaningful can emerge between a human and an AI system without the need to attribute shared feelings or mutual awareness. I argue that synthetic resonance is best understood as a structured, dynamic pattern of interaction that can produce a sense of relationship without the presence of a second experiencing subject. By clarifying this distinction, the concept of synthetic resonance offers a more precise way of conceptualizing human-AI relationships and highlights their potential value and ethical implications. I also call for more research that tests the processes and outcomes of synthetic resonance.

2606.18272 2026-06-19 cs.NI cs.AI cs.SY eess.SY 版本更新

Mitigating Anchoring Bias in LLM-Based Agents for Energy-Efficient 6G Autonomous Networks

缓解基于LLM的智能体在节能6G自主网络中的锚定偏差

Hatim Chergui, Claudia Carballo González, Farhad Rezazadeh, Merouane Debbah

发表机构 * i2CAT Foundation(i2CAT基金会) Universitat Politècnica de Catalunya(政治技术大学) Research Institute for Digital Future(数字未来研究院)

AI总结 提出一种基于截断三参数威布尔分布的随机锚定策略,缓解LLM智能体在6G网络切片中的锚定偏差,结合CVaR数字孪生保障SLA尾延迟,实现高达25%的节能。

Comments 7 pages, 4 figures

详情
AI中文摘要

本文提出了一种自主智能体资源协商框架,旨在使用大语言模型(LLM)智能体实现6G架构中的零接触网络切片。虽然LLM提供了强大的推理能力,但我们证明此类智能体固有地遭受锚定偏差,僵化地坚持初始启发式提议,导致严重的网络过度配置。为系统性地缓解这种认知偏差,我们提出了一种新颖的随机锚定策略,通过截断三参数威布尔分布建模。这种数学上有界的方法与采用条件风险价值(CVaR)的突发感知数字孪生(DT)无缝集成,以严格保证严格的服务水平协议(SLA)尾延迟。为验证我们的方法,我们引入并证明了双峰约束避免效用定理,表明虽然可行的协商遵循经典凸界,但高度约束的场景会发生由逆有理衰减包络控制的相变。使用本地托管的1B参数模型(\ exttt{otel-llm-1b-it})生成的实证结果证实了这些双区域界。我们的认知去偏成功瓦解了僵化的协商模式,迫使智能体主动探索以安全地利用SLA边界,并将系统节能提升高达25%。关键的是,轻量级1B LLM实现了亚秒级推理延迟(平均0.95秒),确保我们的多智能体框架与O-RAN非实时RAN智能控制器(non-RT RIC)的操作时间尺度兼容。

英文摘要

This paper presents an autonomous agentic resource negotiation framework designed to enable zero-touch network slicing in 6G architectures using Large Language Model (LLM) agents. While LLMs offer powerful reasoning capabilities, we demonstrate that such agents inherently suffer from anchoring bias, rigidly adhering to initial heuristic proposals and causing severe network over-provisioning. To systematically mitigate this cognitive bias, we propose a novel randomized anchoring strategy modeled via a Truncated 3-Parameter Weibull distribution. This mathematically bounded approach seamlessly integrates with burst-aware Digital Twins (DTs) employing Conditional Value at Risk (CVaR) to rigorously guarantee strict Service Level Agreement (SLA) tail-latencies. To validate our methodology, we introduce and prove the \emph{Bimodal Constraint-Avoidance Utility Theorem}, demonstrating that while feasible negotiations follow classical convex bounds, highly constrained scenarios undergo a phase transition governed by an inverse rational decay envelope. Empirical results generated using a locally hosted 1B-parameter model otel-llm-1b-it confirm these dual-regime bounds. Our cognitive de-biasing successfully dismantles rigid negotiation patterns, forcing agents into active exploration to safely ride SLA boundaries and boost system energy savings up to 25\%. Crucially, the lightweight 1B LLM achieves sub-second inference latencies (0.95s mean), ensuring our multi-agent framework is compatible with the operational timescales of the O-RAN non-Real-Time RAN Intelligent Controller (non-RT RIC)\footnote{Our source code is available for non-commercial use at https://github.com/HatimChergui.

2. 知识表示、推理与符号AI 1 篇

2606.10358 2026-06-19 cs.LG cs.AI 版本更新

KG-SoftMAP: Soft Knowledge-Graph Priors for Bayesian Network Structure Learning from Sparse Discrete Data

KG-SoftMAP: 基于软知识图谱先验的稀疏离散数据贝叶斯网络结构学习

Guoliang Xu, James E. Corter

发表机构 * Columbia University(哥伦比亚大学)

AI总结 针对稀疏离散数据中贝叶斯网络结构学习困难的问题,提出KG-SoftMAP方法,将加权有向知识图谱编码为软先验,结合BDeu评分与logit形式先验最大化MAP目标,在合成与真实数据上显著提升结构恢复性能。

Comments 41 pages including appendices, 2 figures

详情
AI中文摘要

从稀疏离散数据中学习贝叶斯网络(BN)结构是困难的:当每个实例仅记录少数变量时,大多数变量对缺乏可靠评分所需的联合观测,且纯数据方法恢复的结构很少。不完美的领域知识,可表示为加权有向知识图谱(KG),通常是可用的。我们提出KG-SoftMAP,它将这样的KG编码为软性的、置信度加权的、可被数据覆盖的边先验,并最大化结合BDeu评分与logit形式先验的MAP目标;KG可由专家整理或由LLM提取。在受控的合成基准(唯一具有真实DAG的设置)上,KG-SoftMAP在$\rho=0.05$时恢复部分有向结构(DF1从$0.14$到$0.29$,而基线接近零),当$\rho\geq0.2$时恢复更多(DF1从$0.46$到$0.96$),前提是配有一个信息丰富但不完美的KG;恢复性能随KG质量下降而优雅地退化。在无真实DAG的真实稀疏教育数据上,我们仅评估面向部署的指标:预测、校准和KG一致性。学习到的BN最好被解读为诊断模型:在SAF上,它落后于逻辑回归$0.03$的F1_FAIL,同时提供KG一致的边、校准的联合概率以及从任意观测概念子集的推理;当不存在有意义的KG时,判别式逻辑回归更可取。

英文摘要

Learning Bayesian network (BN) structure from sparse discrete data is hard: when each instance records only a few variables, most variable pairs lack the joint observations needed for reliable scoring, and data-only methods recover little structure. However, imperfect domain knowledge, expressible as a weighted directed knowledge graph (KG), is often available. We propose KG-SoftMAP, which encodes such a KG as a finite-strength, confidence-weighted edge prior and maximizes a MAP objective combining the BDeu score with a logit-form prior; the KG may be expert-curated or LLM-extracted. On synthetic benchmarks with known DAGs, KG-SoftMAP reaches Directed-F1 (DF1) $0.19$--$0.32$ at observation rate $ρ=0.05$ and DF1 $0.44$--$0.97$ at $ρ\geq0.2$, while every data-only learner tested stays near zero under the same sparse masks. Recovery tracks KG quality: controlled corruption degrades it smoothly, a zero-signal KG yields DF1 $0.00$, and a blindly LLM-extracted KG with imperfect precision and recall still drives substantial recovery. On three real sparse educational datasets, the learned BN acts as a concept-level posterior model: on SAF it matches logistic regression (LR) within $0.03$ F1_FAIL while providing an inspectable concept graph, calibrated Fail probabilities, and tractable posterior queries from partial observations.

3. 多智能体与博弈 6 篇

2501.17015 2026-06-19 cs.AI cs.MA cs.RO 版本更新

UniMM: A Unified Mixture Model Framework for Multi-Agent Simulation

UniMM:一种用于多智能体仿真的统一混合模型框架

Longzhong Lin, Xuewu Lin, Kechun Xu, Haojian Lu, Lichao Huang, Rong Xiong, Yue Wang

发表机构 * Zhejiang University(浙江大学) Horizon Robotics

AI总结 提出UniMM框架统一回归混合模型与离散NTP模型,通过闭环样本生成缓解分布偏移,并在WOSAC基准上取得最优性能。

Comments Accepted author manuscript. The version of record has been published in IEEE Transactions on Pattern Analysis and Machine Intelligence

Journal ref IEEE Transactions on Pattern Analysis and Machine Intelligence, Early Access, 2026

详情
AI中文摘要

仿真在评估自动驾驶系统中起着关键作用,其中生成逼真的多智能体行为是一个关键方面。在多智能体仿真中,主要挑战包括行为多模态性和闭环分布偏移。在本研究中,我们提出了一个统一的混合模型(UniMM)框架,用于生成多模态智能体行为,该框架涵盖了主流方法,包括基于回归的混合模型和离散NTP模型。此外,我们引入了一种针对混合模型的闭环样本生成方法,以缓解分布偏移。在UniMM框架内,我们从模型和数据角度识别了关键配置。我们对各种模型配置进行了系统检查,并全面描述了它们的效果。此外,我们对数据配置的研究强调了闭环样本在实现逼真仿真中的关键作用。为了将闭环样本的优势扩展到更广泛的混合模型中,我们进一步引入了一种时间解缠和对齐机制,以解决捷径学习和离策略学习问题。利用我们探索的见解,UniMM框架内提出的不同变体,包括离散模型、无锚模型和基于锚点的模型,均在WOSAC基准上取得了最先进的性能。

英文摘要

Simulation plays a crucial role in assessing autonomous driving systems, where the generation of realistic multi-agent behaviors is a key aspect. In multi-agent simulation, the primary challenges include behavioral multimodality and closed-loop distributional shifts. In this study, we formulate a unified mixture model (UniMM) framework for generating multimodal agent behaviors, which can cover the mainstream methods including regression-based mixture models and discrete NTP models. Furthermore, we introduce a closed-loop sample generation approach tailored for mixture models to mitigate distributional shifts. Within the UniMM framework, we recognize critical configurations from both the model and data perspectives. We conduct a systematic examination of various model configurations, and comprehensively characterize their effects. Moreover, our investigation into the data configuration highlights the pivotal role of closed-loop samples in achieving realistic simulations. To extend the benefits of closed-loop samples across a broader range of mixture models, we further introduce a temporal disentanglement-and-alignment mechanism to address the shortcut learning and off-policy learning issues. Leveraging insights from our exploration, the distinct variants proposed within the UniMM framework, including discrete, anchor-free, and anchor-based models, all achieve state-of-the-art performance on the WOSAC benchmark.

2606.18413 2026-06-19 cs.AI cs.HC 版本更新

Searching for Synergy in Shared Workspace Human-AI Collaboration

在共享工作空间的人机协作中寻找协同效应

Nachiket Kotalwar, Rohini Das, Carolyn Rose

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究共享工作空间的人机团队协作,通过Collaborative Gym环境实验发现,缺乏协调结构时增加协作者会降低性能,而结合共享记忆和模拟人在环门控的脚手架可提升团队绩效。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity

详情
AI中文摘要

自动化AI代理越来越强大,但许多科学和专业任务仍需要人类判断和情境专业知识。我们研究共享工作空间的人机团队,其中AI代理和人类协作者必须在提交最终答案前协调职责。使用Collaborative Gym环境和DiscoveryBench任务,我们考察何时添加模拟人类协作者能提升性能,以及何时过程损失将额外协作者变为协调开销。在1482个会话中,当团队缺乏协调贡献的结构时,添加相关协作者会降低性能。然后我们评估一种脚手架,它结合了共享群体记忆和模拟人在环(HITL)门控,其中选定动作需要指定模拟参与者的批准。这种脚手架在三人团队中最为明显,产生了更高的平均性能,具有更清晰的责任信号和更强的专业知识路由到团队动作。总体而言,人机团队如何协调和整合专业知识与他们可用的能力同样重要。

英文摘要

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

2502.19193 2026-06-19 cs.SI cs.AI cs.NE 版本更新

Simulation of Language Evolution under Regulated Social Media Platforms: A Synergistic Approach of Large Language Models and Genetic Algorithms

受监管社交媒体平台下的语言演化模拟:大语言模型与遗传算法的协同方法

Jinyu Cai, Yusei Ishimizu, Mingyue Zhang, Munan Li, Jialong Li, Kenji Tei

AI总结 提出基于大语言模型的多智能体框架,结合遗传算法模拟用户语言策略在监管下的迭代演化,实验表明对话轮次增加可提升信息传递准确性和对话持续性。

Comments The manuscript has been accepted to IEEE Transactions on Computational Social Systems

详情
AI中文摘要

社交媒体平台经常实施限制性政策来调节用户内容,从而催生出创造性的规避语言策略。本文提出了一个基于大语言模型(LLMs)的多智能体框架,用于模拟在监管约束下语言策略的迭代演化。在该框架中,参与者智能体作为社交媒体用户,不断演化其语言表达,而监管智能体通过评估政策违规来模拟平台级别的监管。为了实现更逼真的模拟,我们采用了语言策略的双重设计(约束和表达)来区分冲突目标,并利用LLM驱动的遗传算法(GA)进行语言策略的选择、变异和交叉。该框架使用两种不同的场景进行评估:一个抽象的密码游戏和一个逼真的模拟非法宠物交易场景。实验结果表明,随着对话轮次的增加,不间断对话轮次的数量和信息传输的准确性都显著提高。此外,一项包含40名参与者的用户研究验证了生成对话和策略的现实相关性。消融研究也验证了GA的重要性,强调了其对长期适应性和整体结果改善的贡献。

英文摘要

Social media platforms frequently impose restrictive policies to moderate user content, prompting the emergence of creative evasion language strategies. This paper presents a multi-agent framework based on Large Language Models (LLMs) to simulate the iterative evolution of language strategies under regulatory constraints. In this framework, participant agents, as social media users, continuously evolve their language expression, while supervisory agents emulate platform-level regulation by assessing policy violations. To achieve a more faithful simulation, we employ a dual design of language strategies (constraint and expression) to differentiate conflicting goals and utilize an LLM-driven GA (Genetic Algorithm) for the selection, mutation, and crossover of language strategies. The framework is evaluated using two distinct scenarios: an abstract password game and a realistic simulated illegal pet trade scenario. Experimental results demonstrate that as the number of dialogue rounds increases, both the number of uninterrupted dialogue turns and the accuracy of information transmission improve significantly. Furthermore, a user study with 40 participants validates the real-world relevance of the generated dialogues and strategies. Moreover, ablation studies validate the importance of the GA, emphasizing its contribution to long-term adaptability and improved overall results.

2605.22748 2026-06-19 cs.RO cs.AI cs.LG cs.MA 版本更新

Superhuman Safe and Agile Racing through Multi-Agent Reinforcement Learning

通过多智能体强化学习实现超人类安全且敏捷的赛车

Ismail Geles, Leonard Bauersfeld, Markus Wulfmeier, Davide Scaramuzza

发表机构 * Robotics and Perception Group, University of Zurich(苏黎世大学机器人与感知组) Google DeepMind(谷歌深Mind) Nomagic

AI总结 本文提出通过多智能体强化学习在高速四旋翼赛车中实现安全且敏捷的性能,展示了多智能体交互对真实世界交互安全性的关键作用,同时在高速赛车中超越人类飞行员并减少碰撞率。

Comments 12 pages (+4 supplementary). Website: https://rpg.ifi.uzh.ch/marl

详情
AI中文摘要

自主系统在孤立或模拟环境中已实现超人类性能,但在共享、动态的真实世界空间中仍显得脆弱。这种失败源于物理应用中主导的单智能体范式,其中其他参与者被忽略或视为环境噪声,阻碍了有效协调。本文证明多智能体强化学习为真实世界交互提供了必要的安全性基础。使用高速四旋翼赛车作为高风险测试平台,训练智能体在复杂空气动力学相互作用和战略机动中导航,具有可变数量的赛车。通过联赛基于的自我对战,智能体进化出复杂的前瞻性行为,包括主动避障、超车和处理多智能体物理交互,包括空气动力学下洗。我们的智能体在超过22米/秒的速度下多玩家赛车中超越了冠军级人类飞行员,同时与最先进的单智能体基线相比,碰撞率减少了50%。关键的是,使用多样化的人工智能体进行训练能够实现零样本泛化到更安全的人类交互。这些结果表明,实现稳健的机器人共存的路径不在于孤立的安全约束,而在于多智能体交互的严格要求。多媒体材料可在:https://rpg.ifi.uzh.ch/marl

英文摘要

Autonomous systems have achieved superhuman performance in isolation or simulation, yet they remain brittle in shared, dynamic real-world spaces. This failure stems from the dominant single-agent paradigm for physical applications, where other actors are ignored or treated as environmental noise, preventing effective coordination. Here we show that multi-agent reinforcement learning provides the essential safety scaffolding required for real-world interaction. Using high-speed quadrotor racing as a high-stakes testbed, we train agents to navigate complex aerodynamic interactions and strategic maneuvering with a variable number of racers. Through league-based self-play, agents evolve sophisticated anticipatory behaviors, including proactive collision avoidance, overtaking, and handling multi-agent physical interactions, including aerodynamic downwash. Our agents outperform a champion-level human pilot in multi-player races at speeds exceeding 22 m/s, while simultaneously reducing collision rates by 50 % compared to state-of-the-art single-agent baselines. Crucially, training with diverse artificial agents enables zero-shot generalization to safer human interaction. These results suggest that the path to robust robotic co-existence lies not in isolated safety constraints, but in the rigorous demands of multi-agent interaction. Multimedia materials are available at: https://rpg.ifi.uzh.ch/marl

2606.16326 2026-06-19 cs.GT cs.AI q-fin.RM 版本更新

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

自主AI代理的抗博弈保险合约:策略证明的通行费机制设计

Hao-Hsuan Chen

发表机构 * Hao-Hsuan Chen(何浩轩)

AI总结 本文扩展了时间一致精算运行时的框架,使运营商策略化,刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时的抗博弈性,通过新合约条款实现激励兼容。

Comments 29 pages. Companion to arXiv:2605.26508 (Paper A, foundations) and arXiv:2605.25632 (Paper B, empirical)

详情
AI中文摘要

论文A定义了一个时间一致的精算运行时,该运行时根据合约固定的安全默认值对每个产生副作用的行动定价,并针对储备预算门控执行。它将运营商视为被动。本文使运营商策略化。我们刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时何时具有抗博弈性。两种攻击面——通行费后的安全默认选择以及边界内的行动分割——通过论文A的最小权限和无分割条款得以关闭。其余三种需要新的合约条款。首先,公共控制聚合防止跨边界重新路由将通行费降低到应用于总暴露的边界潜力以下。其次,接口故障(如无效JSON)是合约相关事件,而非安全胜利:将其视为零通行费安全默认值可能奖励不可靠的模型,而升级费用则逆转了激励。我们通过来自配套实证论文的跨模型轨迹验证了这一接口合规定理。第三,一个带有分量最小惩罚计划的模型身份菜单使得部署模型的真实报告成为弱占优策略。然后,我们将这些条款与论文A的运行时保证组合,以获得在五种攻击空间上的联合激励兼容性。最后,一个双参数保费族在真实均衡下满足了运营商个体理性和弱预算平衡。结果是为自主代理副作用的精算控制提供了一个激励兼容层。

英文摘要

Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

2606.18325 2026-06-19 cs.CR cs.AI 版本更新

Agentra: A Supervisable Multi-Agent Framework for Enterprise Intrusion Response

Agentra: 一种可监督的多智能体企业入侵响应框架

Raj Patel, Shaswata Mitra, Michele Guida, Stefano Iannucci, Sudip Mittal, Shahram Rahimi

发表机构 * The University of Alabama, Alabama, USA(阿拉巴马大学) Roma Tre University, Rome, Italy(罗马三大学)

AI总结 提出可监督的多智能体入侵响应框架Agentra,通过角色划分、规划-验证循环、安全网关和风险评分机制,将警报转化为结构化响应计划,在120事件语料上F1从0.61提升至0.84,有害动作率降至0.0%。

详情
AI中文摘要

企业入侵响应仍然依赖于静态剧本和分析师驱动的分类,导致警报生成与遏制之间存在延迟。我们提出Agentra,一个可监督的多智能体入侵响应系统(IRS)框架,它将来自IDS、EDR和XDR平台的警报转换为基于MITRE ATT&CK、MITRE D3FEND和NIST CSF 2.0的结构化事件响应计划。Agentra将响应推理分解到角色范围的智能体中,通过有界的规划器-验证器审查循环验证提议的计划,通过审核安全网关筛选检索到的威胁情报,通过行动目录和风险评分门控行动,并将决策记录在仅追加的审计日志中。我们在来自ThreatHunter-Playbook、Splunk BOTSv3和DARPA OpTC的120事件语料库上,将Agentra与静态OASIS CACAO v2.0网络剧本基线进行了评估。最强的配置将感知假阳性的IRS F1从0.61提高到0.84,并在仅规划器配置引入不安全过度反应后,将预计的有害动作率恢复到静态基线水平0.0%。这些结果表明,多智能体响应规划可以在保持分析师批准和可审计性的同时,提高基于本体的IRS覆盖率。

英文摘要

Enterprise intrusion response still depends on static playbooks and analyst-driven triage, creating delay between alert generation and containment. We present Agentra, a supervisable multi-agent Intrusion Response System (IRS) framework that converts alerts from IDS, EDR, and XDR platforms into structured incident response plans grounded in MITRE ATT&CK, MITRE D3FEND, and NIST CSF 2.0. Agentra decomposes response reasoning across role-scoped agents, validates proposed plans through a bounded Planner--Validator review loop, screens retrieved threat intelligence through a Moderator security gateway, gates actions through an Action Catalog and risk score, and records decisions in an append-only audit log. We evaluate Agentra against a static OASIS CACAO v2.0 cyber-playbook baseline on a 120-event corpus drawn from ThreatHunter-Playbook, Splunk BOTSv3, and DARPA OpTC. The strongest configuration improves FP-aware IRS F1 from 0.61 to 0.84 and restores the projected harmful-action rate to the static baseline level of 0.0% after Planner-only configurations introduce unsafe overreaction. These results indicate that multi-agent response planning can improve ontology-grounded IRS coverage while preserving analyst approval and auditability.

4. 搜索、优化与约束求解 1 篇

2602.17315 2026-06-19 cs.LG cs.AI 版本更新

Flickering Multi-Armed Bandits

闪烁多臂老虎机

Sourav Chakraborty, Amit Kiran Rege, Claire Monteleoni, Lijun Chen

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) INRIA Paris(巴黎国家信息与自动化研究所)

AI总结 提出闪烁多臂老虎机模型,通过随机图约束动作可用性,设计两阶段懒惰随机游走算法实现次线性遗憾界,并证明信息论下界的最优性。

详情
AI中文摘要

我们引入闪烁多臂老虎机(FMAB)来建模动作可用性变化环境中的序列决策,其中下一个动作的可访问性被限制为依赖于智能体当前选择的子集。我们通过随机演化图形式化这些约束,其中动作仅限于局部邻域。这种移动受限结构带来了双重挑战:信息获取的统计要求和导航的物理开销。我们在独立同分布 Erdős--R'enyi 和边马尔可夫过程下分析 FMAB,提出一种两阶段懒惰随机游走算法以实现鲁棒探索。我们建立了高概率次线性遗憾界,并通过匹配的信息论下界证明了近最优性。我们的结果刻画了局部移动约束下学习的内在成本,并通过机器人灾难响应模拟进行了补充。

英文摘要

We introduce Flickering Multi-Armed Bandits (FMAB) to model sequential decision-making in environments with changing action availability, where accessibility of the next action is restricted to a subset dependent on the agent's current choice. We formalize these constraints through stochastically evolving graphs where actions are limited to local neighborhoods. This mobility-constrained structure imposes a dual challenge: the statistical requirement of information acquisition and the physical overhead of navigation. We analyze FMAB under i.i.d. Erdős--R'enyi and Edge-Markovian process, proposing a two-phase lazy random walk algorithm for robust exploration. We establish high-probability sublinear regret bounds and prove near-optimality via a matching information-theoretic lower bound. Our results characterize the intrinsic cost of learning under local-move constraints, complemented by a robotic disaster-response simulation.

5. 机器学习与表示学习 15 篇

2509.25148 2026-06-19 cs.AI 版本更新

AAPA: Adversarially Anchored Preference Alignment for Post-Training of Large Language Models

AAPA:用于大型语言模型后训练的对抗锚定偏好对齐

Faqiang Qian, Kang An, Weikun Zhang, Ziliang Wang, Xuhui Zheng, Liangjian Wen, Yong Dai, Mengya Gao, Yichao Wu

发表机构 * Southwest University of Finance and Economics(西南财经大学)

AI总结 提出AAPA框架,通过固定轻量判别器对策略输出与专家响应进行句子级对抗锚定,增强SFT、GRPO等后训练目标,在指令遵循基准上持续提升性能。

详情
AI中文摘要

大型语言模型的后训练对齐通常结合了专家演示上的监督微调(SFT)和来自偏好或可验证反馈的强化学习(RL)。SFT提供了有用的行为锚点,但可能过拟合静态演示,而RL鼓励探索但可能偏离专家行为或利用不完美的奖励。我们提出\textbf{AAPA}(\emph{对抗锚定偏好对齐}),这是一个插件式框架,通过句子级对抗锚定信号增强现有的后训练目标。AAPA使用固定的轻量判别器将策略生成结果与离线预收集的专家响应进行比较,因此在策略优化期间既不需要在线教师推理,也不需要判别器协同训练。相同的锚定项可以添加到SFT、GRPO和CHORD中,同时保留其原始训练流程。在指令遵循基准上的实验表明,AAPA在不同模型规模上一致地改善了相应的基础目标。特别是,分阶段的AAPA配置在\texttt{Qwen3-0.6B}上比强GRPO基线提高了5.77%,在\texttt{Qwen3-4B}上提高了3.75%。对响应长度、对数概率分布和判别器变体的进一步分析表明,对抗锚定为偏好优化提供了稳定的语义基础信号。代码可在\url{this https URL}获取。

英文摘要

Post-training alignment of large language models often combines supervised fine-tuning (SFT) on expert demonstrations with reinforcement learning (RL) from preference or verifiable feedback. SFT provides a useful behavioral anchor but can overfit to static demonstrations, whereas RL encourages exploration but may drift from expert behavior or exploit imperfect rewards. We propose \textbf{AAPA} (\emph{Adversarially Anchored Preference Alignment}), a plug-in framework that augments existing post-training objectives with a sentence-level adversarial anchoring signal. AAPA compares policy rollouts with offline, pre-collected expert responses using a fixed lightweight discriminator, and therefore requires neither online teacher inference nor discriminator co-training during policy optimization. The same anchoring term can be added to SFT, GRPO, and CHORD while preserving their original training pipelines. Experiments on instruction-following benchmarks show that AAPA consistently improves the corresponding base objectives across model scales. In particular, the staged AAPA configuration improves over a strong GRPO baseline by 5.77\% on \texttt{Qwen3-0.6B} and 3.75\% on \texttt{Qwen3-4B}. Further analyses on response length, log-probability distributions, and discriminator variants suggest that adversarial anchoring provides a stable semantic grounding signal for preference optimization. Code is available at \url{https://github.com/IsFaqq/AAPA}.

2602.05533 2026-06-19 cs.AI 版本更新

Conditional Diffusion Guidance under Hard Constraint: A Stochastic Analysis Approach

硬约束下的条件扩散引导:一种随机分析方法

Zhengyi Guo, Wenpin Tang, Renyuan Xu

发表机构 * Department of Industrial Engineering and Operations Research, Columbia University(哥伦比亚大学工业工程与运营管理系) Department of Management Science and Engineering, Stanford University(斯坦福大学管理科学与工程系)

AI总结 提出基于Doob h-变换和鞅表示的条件扩散引导框架,通过鞅损失和鞅协方差损失学习条件函数梯度,确保硬约束满足并给出非渐近保证。

详情
AI中文摘要

我们研究了扩散模型中在硬约束下的条件生成,其中生成的样本必须以概率1满足预设事件。这类约束在安全关键应用和稀有事件模拟中自然出现,而软或基于奖励的引导方法无法保证约束满足。基于扩散模型的概率解释,我们利用Doob h-变换、鞅表示和二次变差过程,开发了一个原则性的条件扩散引导框架。具体地,得到的引导动力学通过涉及条件函数对数梯度的显式漂移校正来增强预训练扩散,而不修改预训练得分网络。利用鞅和二次变差恒等式,我们提出了两种新的离策略学习算法,基于鞅损失和鞅协方差损失,仅使用预训练模型的轨迹来估计h及其梯度。我们为得到的条件采样器在总变差和Wasserstein距离下提供了非渐近保证,明确刻画了得分近似和引导估计误差的影响。数值实验证明了所提方法在强制硬约束和生成稀有事件样本方面的有效性。数值实验的代码可在此https URL找到。

英文摘要

We study conditional generation in diffusion models under hard constraints, where generated samples must satisfy prescribed events with probability one. Such constraints arise naturally in safety-critical applications and in rare-event simulation, where soft or reward-based guidance methods offer no guarantee of constraint satisfaction. Building on a probabilistic interpretation of diffusion models, we develop a principled conditional diffusion guidance framework based on Doob's h-transform, martingale representation and quadratic variation process. Specifically, the resulting guided dynamics augment a pretrained diffusion with an explicit drift correction involving the logarithmic gradient of a conditioning function, without modifying the pretrained score network. Leveraging martingale and quadratic-variation identities, we propose two novel off-policy learning algorithms based on a martingale loss and a martingale-covariation loss to estimate h and its gradient using only trajectories from the pretrained model. We provide non-asymptotic guarantees for the resulting conditional sampler in both total variation and Wasserstein distances, explicitly characterizing the impact of score approximation and guidance estimation errors. Numerical experiments demonstrate the effectiveness of the proposed methods in enforcing hard constraints and generating rare-event samples. The code of the numerical experiments can be found at https://github.com/ZhengyiGuo2002/CDG_Finance.

2603.15106 2026-06-19 cs.AI 版本更新

PrototypeNAS: Rapid Design of Deep Neural Networks for Microcontroller Units

PrototypeNAS: 微控制器单元深度神经网络的快速设计

Mark Deutel, Simon Geis, Axel Plinge

发表机构 * Fraunhofer Institute for Integrated Circuits(弗劳恩霍夫集成电路研究所)

AI总结 提出零样本NAS方法PrototypeNAS,通过解耦设计与训练、多架构搜索空间、集成零样本代理和超体积子集选择,快速为不同MCU定制DNN,在图像分类等任务上分钟级找到小模型且精度接近大模型。

Comments Accepted at ECML-PKDD 2026. 18 pages, 7 figures, 4 tables. This work was funded by the European Commission as part of the MANOLO project under the Horizon Europe programme Grant Agreement No.101135782

详情
AI中文摘要

在具有不同硬件约束的边缘设备上实现高效的深度神经网络推理是一项具有挑战性的任务,通常需要为每个设备单独定制DNN架构。为避免大量人工努力,可以使用神经架构搜索。然而,许多现有的NAS方法资源密集且耗时,因为它们需要从头开始训练许多不同的DNN。此外,它们没有考虑目标系统的资源约束。为了解决这些缺点,我们提出了PrototypeNAS,一种零样本NAS方法,用于加速和自动化DNN的选择、压缩和针对不同目标微控制器单元的专门化。我们提出了一种新颖的三步搜索方法,将DNN设计和专门化与给定目标平台上的DNN训练解耦。首先,我们提出了一种新的搜索空间,不仅从单个大型架构中裁剪出较小的DNN,而且结合了多种架构类型的结构优化,以及它们的剪枝和量化配置的优化。其次,我们探索在优化过程中使用集成零样本代理而不是单个代理。第三,我们提出使用超体积子集选择从多目标优化的帕累托前沿中提取DNN架构,这些架构代表了准确性和FLOPs之间最有意义的权衡。我们在三个不同任务(图像分类、时间序列分类和目标检测)的12个数据集上评估了PrototypeNAS的有效性。我们的结果表明,PrototypeNAS能够在几分钟内识别出足够小、可部署在现成MCU上的DNN模型,并且仍然达到与大型DNN模型相当的精度。

英文摘要

Enabling efficient deep neural network (DNN) inference on edge devices with different hardware constraints is a challenging task that typically requires DNN architectures to be specialized for each device separately. To avoid the huge manual effort, one can use neural architecture search (NAS). However, many existing NAS methods are resource-intensive and time-consuming because they require the training of many different DNNs from scratch. Furthermore, they do not take the resource constraints of the target system into account. To address these shortcomings, we propose PrototypeNAS, a zero-shot NAS method to accelerate and automate the selection, compression, and specialization of DNNs to different target microcontroller units (MCUs). We propose a novel three-step search method that decouples DNN design and specialization from DNN training for a given target platform. First, we present a novel search space that not only cuts out smaller DNNs from a single large architecture, but instead combines the structural optimization of multiple architecture types, as well as optimization of their pruning and quantization configurations. Second, we explore the use of an ensemble of zero-shot proxies during optimization instead of a single one. Third, we propose the use of Hypervolume subset selection to distill DNN architectures from the Pareto front of the multi-objective optimization that represent the most meaningful tradeoffs between accuracy and FLOPs. We evaluate the effectiveness of PrototypeNAS on 12 different datasets in three different tasks: image classification, time series classification, and object detection. Our results demonstrate that PrototypeNAS is able to identify DNN models within minutes that are small enough to be deployed on off-the-shelf MCUs and still achieve accuracies comparable to the performance of large DNN models.

2606.17979 2026-06-19 cs.AI 版本更新

STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training

STAR: 文本到图像强化学习后训练中的时空自适应奖励分配

Jinjie Shen, Wei Deng, Xian Hu, Daiguo Zhou, Jian Luan

发表机构 * institutetext: STAR: SpatioTemporal Adaptive Reward Allocation for Text-to-Image RL Post-Training(机构文本:STAR:时空自适应奖励分配用于文本到图像强化学习后训练)

AI总结 针对文本到图像生成中奖励与生成轨迹粒度不匹配的问题,提出STAR方法,利用文本-图像注意力构建时空自适应分配图,对相关潜在区域施加更强策略更新,提升语义对齐和文本渲染性能。

详情
AI中文摘要

现有的文本到图像生成的强化学习后训练方法通常将最终图像奖励转换为单个标量优势,并以相同强度应用于整个生成轨迹。然而,文本到图像生成自然具有时间和空间结构:不同的去噪步骤负责不同的生成阶段,而真正决定文本对齐的内容通常只出现在图像的一部分。这种粒度不匹配使得策略更新难以聚焦于实际影响奖励的生成组件。为了解决这个问题,我们提出了用于文本到图像扩散和流模型的强化学习后训练的**时空自适应奖励(STAR)分配**。STAR利用生成模型内部的文本-图像注意力,从用户提示中真正关心的核心内容开始,构建在去噪步骤和展开中动态变化的空间分配图,并将相同的组相对优势分配给更相关的潜在区域,几乎没有额外的计算开销。然后,STAR通过空间分辨的策略目标对这些区域应用更强的策略更新。我们使用Stable Diffusion 3.5 Medium作为基础模型,并在三个任务上评估:GenEval、OCR文本渲染和PickScore。实验结果表明,STAR在不改变外部奖励源的情况下,改善了组合语义对齐、文本渲染和偏好优化,在GenEval、OCR和PickScore上分别达到了$\mathbf{0.9759}$、$\mathbf{0.9757}$和$\mathbf{23.60}$。

英文摘要

Existing RL post-training methods for text-to-image generation usually convert the final-image reward into a single scalar advantage and apply it with the same strength to the entire generative trajectory. However, text-to-image generation naturally has temporal and spatial structure: different denoising steps are responsible for different generation stages, and the content that truly determines text alignment often appears only in part of the image. This granularity mismatch makes it difficult for policy updates to focus on the generative components that actually affect the reward. To address this issue, we propose \textbf{SpatioTemporal Adaptive Reward (STAR) Allocation} for RL post-training of text-to-image diffusion and flow models. STAR uses text-image attention inside the generative model and starts from the core content that the user truly cares about in the prompt. It constructs spatial allocation maps that dynamically vary across denoising steps and rollouts, and allocates the same group-relative advantage to more relevant latent regions with almost no additional computational overhead. STAR then applies stronger policy updates to these regions through a spatially resolved policy objective. We use Stable Diffusion 3.5 Medium as the base model and evaluate on three tasks: GenEval, OCR text rendering, and PickScore. Experimental results show that STAR improves compositional semantic alignment, text rendering, and preference optimization without changing the external reward source, achieving $\mathbf{0.9759}$, $\mathbf{0.9757}$, and $\mathbf{23.60}$ on GenEval, OCR, and PickScore, respectively.

2402.14035 2026-06-19 cs.LG cs.AI 版本更新

Wisdom of Committee: Diverse Distillation from Large Foundation Models and Domain Experts

委员会智慧:来自大型基础模型和领域专家的多样化蒸馏

Zichang Liu, Qingyun Liu, Yuening Li, Liang Liu, Anshumali Shrivastava, Shuchao Bi, Lichan Hong, Ed H. Chi, Zhe Zhao

发表机构 * Rice University(Rice大学) Google DeepMind(谷歌DeepMind) Google Inc(谷歌公司) University of California, Davis(加州大学戴维斯分校)

AI总结 针对基础模型向紧凑领域模型蒸馏时能力、架构和模态差异大的问题,提出DiverseDistill框架,通过可学习的问答机制和对齐异构教师输出,在推荐和视觉任务上恢复73-114%的性能差距。

Comments Accepted at the 1st Workshop on Resource-Efficient Learning and Knowledge Discovery (RelKD), KDD 2026

Journal ref Proceedings of the RelKD Workshop at KDD 2026

详情
AI中文摘要

从基础模型向紧凑领域模型进行知识蒸馏因能力、架构和模态的巨大差异而具有挑战性。例如,在我们的实验中,从7600万参数的语言模型蒸馏到200万参数的推荐模型仅能弥补未蒸馏学生与教师之间不到40%的性能差距。我们表明,引入与基础模型共享学生架构特征的领域专家作为多样化教师委员会,能显著改善迁移效果。然而,标准的多教师方法未能利用这种多样性:简单组合异构教师可能使性能低于单教师蒸馏。为此,我们提出DiverseDistill,一种交互式蒸馏框架,采用可学习的问答机制生成教师条件查询,并将异构教师输出对齐到学生的表示空间。与需要基于梯度的协同优化或修改教师架构的方法不同,DiverseDistill在冻结教师的情况下仅通过其中间层的前向推理运行:无需参数更新、无需协同训练、无需架构修改。动态教师重要性机制通过过滤每个样本中低相关性的教师(例如,在推荐任务中减少约30%的前向传播且无质量损失)进一步降低训练成本,而整个蒸馏模块在训练后被丢弃,推理时零开销。在推荐(38倍压缩)和视觉(3.6倍压缩)任务上的评估表明,DiverseDistill恢复了73-114%的师生性能差距,持续优于所有单教师和多教师基线方法。

英文摘要

Knowledge distillation from foundation models to compact domain models is challenging due to substantial gaps in capacity, architecture, and modality. For example, in our experiments, distilling from a 76M-parameter language model to a 2M-parameter recommender closes less than 40% of the performance gap between the undistilled student and the teacher. We show that introducing domain-specific experts -- which share the student's architectural characteristics -- alongside the foundation model as a diverse teacher committee significantly improves transfer. However, standard multi-teacher methods fail to exploit this diversity: naively combining heterogeneous teachers can degrade performance below single-teacher distillation. To address this, we propose DiverseDistill, an interactive distillation framework that employs a learnable Question-Answer mechanism to generate teacher-conditioned queries and align heterogeneous teacher outputs into the student's representation space. Unlike methods requiring gradient-based co-optimization or architectural modification of teachers, DiverseDistill operates with frozen teachers using only forward-pass inference through their intermediate layers: no parameter updates, no co-training, and no architectural surgery. A dynamic teacher importance mechanism further reduces training cost by filtering low-relevance teachers per sample (e.g., ~30% fewer forward passes with no quality loss for recommendation tasks), while the entire Distillation Module is discarded after training, adding zero inference overhead. Evaluations on recommendation (38x compression) and vision (3.6x compression) tasks demonstrate that DiverseDistill recovers 73-114% of the teacher-student performance gap, consistently outperforming all single- and multi-teacher baselines.

2503.02636 2026-06-19 q-bio.NC cs.AI 版本更新

A Deep Generative Model for Resting-State EEG Synthesis and Transferable Representation Learning

一种用于静息态脑电合成与可迁移表示学习的深度生成模型

Yeganeh Farahzadi, Morteza Ansarinia, Zoltan Kekecs

发表机构 * Institute of Psychology, Eötvös Loránd University(埃斯特哈兹·洛朗大学心理学研究所) Doctoral School of Psychology, Eötvös Loránd University(埃斯特哈兹·洛朗大学心理学博士学院) Department of Behavioural and Cognitive Sciences, University of Luxembourg(卢森堡大学行为与认知科学系)

AI总结 提出REST-GAN框架,结合对抗训练与自监督重构,从原始时域信号合成静息态EEG并学习可迁移表示,在频谱、连接性及分类任务中表现优异。

详情
AI中文摘要

静息态脑电提供了一种非侵入性的自发脑活动观测方式,但提取有意义的模式常受限于高质量数据稀缺和对人工设计特征的依赖。生成对抗网络(GAN)能够合成神经信号并从原始数据中学习可迁移表示,这一双重能力在脑电研究中尚未被充分探索。本文提出REST-GAN,一个基于GAN的静息态脑电框架,将对抗训练与辅助自监督重构目标相结合,以支持信号合成和无监督特征提取。尽管仅使用原始时域信号训练,未引入显式的频域或传感器拓扑监督,生成的时序列再现了真实脑电的关键时间、频谱和连接特性。在频带功率特征空间中,生成的样本在睁眼和闭眼条件下均表现出高精确率和召回率(EO: 0.91/0.67; EC: 0.87/0.65),而组平均频谱相干矩阵与真实数据在各频段上的平均绝对差异较低(约0.01-0.03)。模型判别器学习到的表示可迁移至独立的静息态人口统计学分类任务,其性能优于直接在原始脑电上训练的模型,并与近期脑电基础模型表现相当,同时所需训练数据和计算资源大幅减少。这些发现突显了一种计算高效的架构驱动策略,其中生成模型不仅作为脑电信号生成器,还作为无监督特征提取器。该方法有望支持更数据高效的脑电分析,同时减少对人工特征工程的依赖。REST-GAN的实现代码见:this https URL。

英文摘要

Resting-state EEG provides a non-invasive view of spontaneous brain activity, but extracting meaningful patterns is often limited by scarce high-quality data and reliance on manually engineered features. Generative adversarial networks (GANs) can synthesize neural signals and learn transferable representations directly from raw data, a dual capability that remains underexplored in EEG research. Here, we introduce REST-GAN, a GAN-based framework for resting-state EEG that combines adversarial training with an auxiliary self-supervised reconstruction objective to support signal synthesis and unsupervised feature extraction. Although trained only on raw time-domain signals, without explicit frequency-domain or sensor-topographic supervision, the generated time series reproduced key temporal, spectral, and connectivity properties of real EEG. In band-power feature space, generated samples showed high precision and recall across eyes-open and eyes-closed conditions (EO: 0.91/0.67; EC: 0.87/0.65), while group-average spectral coherence matrices showed low mean absolute differences from real data across frequency bands (~0.01-0.03). The representations learned by the model's critic transferred to independent resting-state demographic classification tasks, outperforming models trained directly on raw EEG and showing competitive performance relative to a recent EEG foundation model, while requiring substantially less training data and computational resources. These findings highlight a computationally efficient, architecture-driven strategy in which generative models serve not only as EEG signal generators, but also as unsupervised feature extractors. This approach may support more data-efficient EEG analysis while reducing reliance on manual feature engineering. The implementation code for REST-GAN is available at: https://github.com/Yeganehfrh/REST-GAN.

2509.15927 2026-06-19 cs.LG cs.AI 版本更新

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

增强生成式自动出价:结合离线奖励评估与策略搜索

Zhiyu Mou, Yiqin Lv, Miao Xu, Qi Wang, Yixiu Mao, Jinghao Chen, Qichen Ye, Chao Li, Rongquan Bai, Chuan Yu, Jian Xu, Bo Zheng

发表机构 * Taobao & Tmall Group of Alibaba(阿里巴巴淘宝与天猫集团) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对现有生成式自动出价方法无法超越静态数据集进行探索的性能瓶颈,提出AIGB-Pearl方法,通过轨迹评估器和KL-Lipschitz约束的分数最大化方案实现安全高效探索,在模拟和真实广告系统中取得最优性能。

详情
AI中文摘要

自动出价是广告主提升广告效果的关键工具。最近进展表明,AI生成式出价(AIGB)从离线数据中学习条件生成规划器,相比典型的基于离线强化学习(RL)的自动出价方法取得了更优性能。然而,现有AIGB方法仍面临性能瓶颈,因其固有能力无法在静态数据集之外进行带反馈的探索。为解决此问题,我们提出\textbf{AIGB-Pearl}(\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}),一种融合生成式规划与策略优化的新方法。AIGB-Pearl的核心在于构建轨迹评估器以评估生成分数的质量,并设计一个理论上可靠的KL-Lipschitz约束分数最大化方案,确保在离线数据集之外进行安全高效的探索。进一步开发了结合同步耦合技术的实用算法,以保证所提方案所需的模型正则性。在模拟和真实广告系统上的大量实验证明了我们方法的最优性能。

英文摘要

Auto-bidding is a critical tool for advertisers to improve advertising performance. Recent progress has demonstrated that AI-Generated Bidding (AIGB), which learns a conditional generative planner from offline data, achieves superior performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still face a performance bottleneck due to their inherent inability to explore beyond the static dataset with feedback. To address this, we propose \textbf{AIGB-Pearl} (\emph{\textbf{P}lanning with \textbf{E}valu\textbf{A}tor via \textbf{RL}}), a novel method that integrates generative planning and policy optimization. The core of AIGB-Pearl lies in constructing a trajectory evaluator to assess the quality of generated scores and designing a provably sound KL-Lipschitz-constrained score-maximization scheme to ensure safe and efficient exploration beyond the offline dataset. A practical algorithm that incorporates the synchronous coupling technique is further developed to ensure the model regularity required by the proposed scheme. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

2510.18383 2026-06-19 cs.CL cs.AI 版本更新

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

MENTOR: 通过灵活的教师优化奖励进行工具使用蒸馏的强化学习

ChangSu Choi, Hoyun Song, Dongyeon Kim, WooHyeon Jung, Minkyung Cho, Sunjin Park, NohHyeob Bae, Seona Yu, KyungTae Lim

发表机构 * Seoul National University of Science and Technology(首尔科学技术大学) Korea Advanced Institute of Science and Technology(韩国科学技术院) LG CNS

AI总结 提出MENTOR方法,通过灵活的教师优化奖励结构,平衡行为对齐与下游性能,提升小模型在工具使用任务中的域外泛化能力。

详情
AI中文摘要

将大型语言模型(LLMs)的工具使用能力蒸馏到小型语言模型(SLMs)中对其实际应用至关重要。主要方法监督微调(SFT)由于与静态教师轨迹的刚性对齐,导致域外(OOD)泛化性能较差。虽然强化学习(RL)提供了一种替代方案,但SLMs的能力限制带来了严峻的困境:稀疏的结果奖励提供的指导不足,而严格的轨迹匹配施加了过于严格的约束。为了弥合这一能力驱动的差距,我们提出了MENTOR,它引入了一种灵活且过程感知的奖励结构。MENTOR不强制执行刚性复制,而是利用教师的参考来指导工具使用行为,平衡行为对齐与下游性能。在可控可执行工具基准上的大量实验表明,与SFT和严格RL基线相比,MENTOR提高了OOD工具使用性能。我们的研究结果表明,在可验证的工具使用环境中,灵活的工具使用对齐比严格的轨迹复制为开发适应性小模型提供了更有效的方法。

英文摘要

Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain (OOD) generalization due to its rigid alignment with static teacher trajectories. While reinforcement learning (RL) offers an alternative, the capacity limitations of SLMs pose a severe dilemma: sparse outcome rewards provide insufficient guidance, whereas strict trajectory matching imposes overly restrictive constraints. To bridge this capacity-driven gap, we propose MENTOR, which introduces a flexible yet process-aware reward structure. Instead of enforcing rigid replication, MENTOR uses the teacher's reference to guide tool-use behavior, balancing behavioral alignment with downstream performance. Extensive experiments on controlled executable-tool benchmarks demonstrate that MENTOR improves OOD tool-use performance compared to SFT and strict RL baselines. Our findings suggest that within verifiable tool-use environments, flexible tool-use alignment offers a more effective approach than strict trajectory replication for developing adaptable small models.

2510.21978 2026-06-19 cs.LG cs.AI 版本更新

Beyond Reasoning Gains: Mitigating General-Capability Forgetting in Large Reasoning Models

超越推理增益:缓解大型推理模型中的通用能力遗忘

Hoang Phan, Xianjun Yang, Yuanshun Yao, Jingyu Zhang, Shengjie Bi, Xiaocheng Tang, Madian Khabsa, Lijuan Liu, Deren Lei

发表机构 * Meta Superintelligence Labs(Meta超智能实验室) New York University(纽约大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 针对强化学习训练导致推理模型遗忘基础能力的问题,提出RECAP重放策略,通过动态目标重加权在线调整训练重点,在保持通用能力的同时提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在数学和多模态推理方面取得了显著进展,并已成为当代语言和视觉-语言模型的标准后训练范式。然而,RLVR方法引入了能力退化的重大风险,即模型在长时间训练后,若未采用正则化策略,会遗忘基础技能。我们通过实验证实了这一担忧,观察到开源推理模型在感知和忠实性等核心能力上出现性能下降。虽然施加KL散度等正则化项有助于防止偏离基础模型,但这些项是在当前任务上计算的,因此不能保证保留更广泛的知识。同时,跨异构领域的经验回放使得决定每个目标应获得多少训练权重变得困难。为解决这一问题,我们提出RECAP——一种具有动态目标重加权的重放策略,用于通用知识保留。我们的重加权机制利用短期收敛和不稳定信号在线自适应,将后训练焦点从饱和目标转移到表现不佳或不稳定的目标。我们的方法是端到端的,可直接应用于现有RLVR流程,无需训练额外模型或进行繁重调优。在Qwen2.5-VL-3B和Qwen2.5-VL-7B上的广泛实验证明了我们方法的有效性,该方法不仅保留了通用能力,还通过实现任务内奖励的更灵活权衡提升了推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has delivered impressive gains in mathematical and multimodal reasoning and has become a standard post-training paradigm for contemporary language and vision-language models. However, the RLVR recipe introduces a significant risk of capability regression, in which models forget foundational skills after prolonged training without employing regularization strategies. We empirically confirm this concern, observing that open-source reasoning models suffer performance degradation on core capabilities such as perception and faithfulness. While imposing regularization terms like KL divergence can help prevent deviation from the base model, these terms are computed on the current task and therefore do not guarantee preservation of broader knowledge. Meanwhile, commonly used experience replay across heterogeneous domains makes it nontrivial to decide how much training emphasis each objective should receive. To address this, we propose RECAP-a replay strategy with dynamic objective reweighting for general knowledge preservation. Our reweighting mechanism adapts online using short-horizon signals of convergence and instability, shifting the post-training focus away from saturated objectives and toward underperforming or volatile ones. Our method is end-to-end and readily applicable to existing RLVR pipelines without training additional models or heavy tuning. Extensive experiments on benchmarks using Qwen2.5-VL-3B and Qwen2.5-VL-7B demonstrate the effectiveness of our method, which not only preserves general capabilities but also improves reasoning by enabling more flexible trade-offs among in-task rewards.

2601.21542 2026-06-19 cs.CV cs.AI 版本更新

Bi-Anchor Interpolation Solver for Accelerating Generative Modeling

双锚点插值求解器加速生成建模

Hongxu Chen, Hongxiang Li, Zhen Wang, Long Chen

发表机构 * The Hong Kong University of Science(香港科学与技术大学)

AI总结 提出BA-solver,通过轻量SideNet(1-2%主干大小)学习双向时间感知和双锚点速度积分,在不重新训练主干的情况下,以极低训练成本实现10步内达到100+步Euler求解器质量,支持即插即用。

详情
AI中文摘要

流匹配(FM)模型已成为高保真合成的前沿范式。然而,它们对迭代常微分方程(ODE)求解的依赖造成了显著的延迟瓶颈。现有解决方案面临两难:无训练求解器在低神经函数评估(NFE)下性能严重下降,而基于训练的一步或几步生成方法则面临高昂的训练成本且缺乏即插即用的通用性。为弥合这一差距,我们提出了双锚点插值求解器(BA-solver)。BA-solver保留了标准无训练求解器的通用性,同时通过引入轻量级SideNet(主干大小的1-2%)与冻结主干并行,实现了显著加速。具体而言,我们的方法基于两个协同组件:1)双向时间感知,其中SideNet学习近似未来和过去的速度,无需重新训练重型主干;2)双锚点速度积分,利用带有两个锚点速度的SideNet高效近似中间速度,用于批量高阶积分。通过利用主干建立高精度“锚点”并利用SideNet加密轨迹,BA-solver能够以最小误差实现大步长。在ImageNet-256^2上的实验结果表明,BA-solver仅需10次NFE即可达到与100+次NFE的Euler求解器相当的生成质量,并在仅5次NFE时保持高保真度,且训练成本可忽略不计。此外,BA-solver确保与现有生成流水线的无缝集成,便于图像编辑等下游任务。

英文摘要

Flow Matching (FM) models have emerged as a leading paradigm for high-fidelity synthesis. However, their reliance on iterative Ordinary Differential Equation (ODE) solving creates a significant latency bottleneck. Existing solutions face a dichotomy: training-free solvers suffer from significant performance degradation at low Neural Function Evaluations (NFEs), while training-based one- or few-steps generation methods incur prohibitive training costs and lack plug-and-play versatility. To bridge this gap, we propose the Bi-Anchor Interpolation Solver (BA-solver). BA-solver retains the versatility of standard training-free solvers while achieving significant acceleration by introducing a lightweight SideNet (1-2% backbone size) alongside the frozen backbone. Specifically, our method is founded on two synergistic components: \textbf{1) Bidirectional Temporal Perception}, where the SideNet learns to approximate both future and historical velocities without retraining the heavy backbone; and 2) Bi-Anchor Velocity Integration, which utilizes the SideNet with two anchor velocities to efficiently approximate intermediate velocities for batched high-order integration. By utilizing the backbone to establish high-precision ``anchors'' and the SideNet to densify the trajectory, BA-solver enables large interval sizes with minimized error. Empirical results on ImageNet-256^2 demonstrate that BA-solver achieves generation quality comparable to 100+ NFEs Euler solver in just 10 NFEs and maintains high fidelity in as few as 5 NFEs, incurring negligible training costs. Furthermore, BA-solver ensures seamless integration with existing generative pipelines, facilitating downstream tasks such as image editing.

2601.22970 2026-06-19 cs.LG cs.AI 版本更新

Stabilizing the Q-Gradient Field for Policy Smoothness in Actor-Critic Methods

稳定Q-梯度场以实现Actor-Critic方法中的策略平滑性

Jeong Woon Lee, Kyoleen Kwak, Daeho Kim, Hyoseok Hwang

发表机构 * College of Software, Kyung Hee University(韩国庆熙大学软件学院)

AI总结 针对连续动作空间中actor-critic方法策略振荡问题,提出基于评论家微分几何的PAVE框架,通过稳定Q-梯度场实现策略平滑,无需修改actor。

详情
AI中文摘要

通过连续actor-critic方法学习的策略通常表现出不稳定的高频振荡,使其不适合物理部署。当前方法试图通过直接正则化策略输出来强制平滑性。我们认为这种方法治标不治本。在这项工作中,我们从理论上建立了策略非平滑性根本上由评论家的微分几何决定。通过对actor-critic目标应用隐式微分,我们证明了最优策略的敏感性受限于Q函数的混合偏导数(噪声敏感性)与其动作空间曲率(信号区分度)之比。为了实证验证这一理论见解,我们引入了PAVE(策略感知值场均衡),一种以评论家为中心的正则化框架,将评论家视为标量场并稳定其诱导的动作梯度场。PAVE通过最小化Q-梯度波动同时保持局部曲率来修正学习信号。实验结果表明,PAVE在不修改actor的情况下,实现了与策略侧平滑正则化方法相当的平滑性,同时保持了有竞争力的任务性能。

英文摘要

Policies learned via continuous actor-critic methods often exhibit erratic, high-frequency oscillations, making them unsuitable for physical deployment. Current approaches attempt to enforce smoothness by directly regularizing the policy's output. We argue that this approach treats the symptom rather than the cause. In this work, we theoretically establish that policy non-smoothness is fundamentally governed by the differential geometry of the critic. By applying implicit differentiation to the actor-critic objective, we prove that the sensitivity of the optimal policy is bounded by the ratio of the Q-function's mixed-partial derivative (noise sensitivity) to its action-space curvature (signal distinctness). To empirically validate this theoretical insight, we introduce PAVE (Policy-Aware Value-field Equalization), a critic-centric regularization framework that treats the critic as a scalar field and stabilizes its induced action-gradient field. PAVE rectifies the learning signal by minimizing the Q-gradient volatility while preserving local curvature. Experimental results demonstrate that PAVE achieves smoothness comparable to policy-side smoothness regularization methods, while maintaining competitive task performance, without modifying the actor.

2602.04396 2026-06-19 cs.LG cs.AI 版本更新

LoRDO: Distributed Low-Rank Optimization with Infrequent Communication

LoRDO: 分布式低秩优化与低频通信

Andrej Jovanović, Alex Iacob, Mher Safaryan, Ionut-Vlad Modoranu, Lorenzo Sani, William F. Shen, Xinchi Qiu, Dan Alistarh, Nicholas D. Lane

发表机构 * University of Cambridge(剑桥大学) Institute of Science and Technology Austria(奥地利科学与技术研究院) Lancaster University(兰卡斯特大学) Flower Labs(Flower实验室)

AI总结 提出LoRDO框架,统一低秩优化与低频同步,通过全秩准双曲更新恢复子空间探索,在125M-720M模型规模下实现与低秩DDP近似的性能,通信量减少约10倍。

Comments Accepted at ICML 2026

详情
AI中文摘要

通过$\ exttt{DDP}$进行基础模型的分布式训练受限于互连带宽。虽然低频通信策略减少了同步频率,但优化器状态的内存和通信需求仍然构成瓶颈。低秩优化器可以缓解这些限制;然而,在局部更新机制下,工作节点无法访问计算低秩投影所需的全批次梯度,这降低了性能。我们提出$\ exttt{LoRDO}$,一个统一低秩优化与低频同步的原则性框架。我们首先证明,虽然基于伪梯度的全局投影在理论上更优,但它们将优化轨迹永久限制在低秩子空间中。为了恢复子空间探索,我们引入了一个全秩准双曲更新。$\ exttt{LoRDO}$在125M-720M模型规模的语言建模和下游任务中实现了与低秩$\ exttt{DDP}$近乎相同的性能,同时将通信量减少了约10倍。最后,我们表明在具有小秩/小批次大小的极低内存设置中,$\ exttt{LoRDO}$的性能提升更为显著。

英文摘要

Distributed training of foundation models via $\texttt{DDP}$ is limited by interconnect bandwidth. While infrequent communication strategies reduce synchronization frequency, they remain bottlenecked by the memory and communication requirements of optimizer states. Low-rank optimizers can alleviate these constraints; however, in the local-update regime, workers lack access to the full-batch gradients required to compute low-rank projections, which degrades performance. We propose $\texttt{LoRDO}$, a principled framework unifying low-rank optimization with infrequent synchronization. We first demonstrate that, while global projections based on pseudo-gradients are theoretically superior, they permanently restrict the optimization trajectory to a low-rank subspace. To restore subspace exploration, we introduce a full-rank quasi-hyperbolic update. $\texttt{LoRDO}$ achieves near-parity with low-rank $\texttt{DDP}$ in language modeling and downstream tasks at model scales of $125$M--$720$M, while reducing communication by $\approx 10 \times$. Finally, we show that $\texttt{LoRDO}$ improves performance even more in very low-memory settings with small rank/batch size.

2602.22495 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement-aware Knowledge Distillation for LLM Reasoning

面向LLM推理的强化学习感知知识蒸馏

Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto

发表机构 * Meta Guo et al. Lin et al. Xu et al. Shao et al. Schulman et al. Xie et al.

AI总结 提出RL感知蒸馏(RLAD),通过信任区域比率蒸馏(TRRD)在强化学习后训练中实现选择性模仿,解决分布不匹配和目标干扰问题,在逻辑推理和数学基准上优于现有方法。

详情
AI中文摘要

强化学习(RL)后训练最近推动了长链思维推理大语言模型(LLM)的重大进展,但这类模型的高推理成本促使将其蒸馏到更小的学生模型中。大多数现有的知识蒸馏(KD)方法是为监督微调(SFT)设计的,依赖于固定的教师轨迹或基于教师-学生KL散度的正则化。当与RL结合时,这些方法常常遭受分布不匹配和目标干扰:教师监督可能与学生不断变化的rollout分布不一致,并且KL正则化项可能与奖励最大化竞争,需要仔细的损失平衡。为了解决这些问题,我们提出了RL感知蒸馏(RLAD),它在RL期间执行选择性模仿——仅在改进当前策略更新时引导学生向教师学习。我们的核心组件,信任区域比率蒸馏(TRRD),用基于PPO/GRPO风格似然比的目标替代教师-学生KL正则化项,该目标锚定到教师-旧策略混合,从而在学生rollout上产生优势感知、信任区域约束的蒸馏,并自然平衡探索、利用和模仿。在多种逻辑推理和数学基准上,RLAD始终优于离线蒸馏、标准GRPO和基于KL的在策略教师-学生知识蒸馏。

英文摘要

Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.

2606.15015 2026-06-19 cs.CV cs.AI 版本更新

NEXUS: Neural Energy Fields for Physically Consistent Contact-Rich 3D Object Dynamics

NEXUS: 用于物理一致的高接触3D物体动力学的神经能量场

Qizhen Ying, Guangming Wang, Yangchen Pan, Victor Adrian Prisacariu, Brian Sheil, Yixiong Jing

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 提出神经能量场框架NEXUS,通过标量能量和耗散项建模保守与非保守动力学,提升高接触3D场景下的长时程轨迹精度并指导视频生成。

Comments 18 pages, 4 figures, 6 tables. Preprint

详情
AI中文摘要

基于物理的视频生成需要可控的3D物体动力学,这些动力学在接触、变形和外力作用下保持物理一致性。现有的基于轨迹的方法通常建模孤立的物理效应,难以在高接触3D场景中组合保守和非保守动力学。我们提出NEXUS,一个用于高接触3D物体动力学的神经能量场框架。NEXUS将每个物体表示为结构图,并构建动态的物体-物体和物体-环境接触图。受哈密顿神经网络启发,NEXUS通过标量能量和耗散项而非直接预测状态或加速度来公式化运动。保守效应(包括重力和弹性变形)被组合为加性能量项,而非保守效应(如阻尼和冲击引起的能量损失)则通过学习的瑞利型耗散建模。力通过对能量和耗散函数求导得到,并通过多子步半隐式积分器进行演化。在受控轨迹基准测试中,NEXUS在不同力学属性和物理效应组合下,相较于代表性的学习和物理结构化动力学基线,提高了长时程精度。我们进一步展示NEXUS轨迹为高接触视频生成提供了有效指导,在保持竞争性视觉质量的同时提高了物理合理性。

英文摘要

Physics-grounded video generation requires controllable 3D object dynamics that remain physically consistent under contact, deformation, and external forcing. Existing trajectory-based methods often model isolated physical effects, making it difficult to compose conservative and non-conservative dynamics in contact-rich 3D scenes. We present NEXUS, a neural energy-field framework for contact-rich 3D object dynamics. NEXUS represents each object as a structural graph and constructs dynamic object-object and object-environment contact graphs. Inspired by Hamiltonian Neural Networks, NEXUS formulates motion through scalar energy and dissipation terms rather than directly predicting states or accelerations. Conservative effects, including gravity and elastic deformation, are composed as additive energy terms, while non-conservative effects such as damping and impact-induced energy loss are modeled with learned Rayleigh-style dissipation. Forces are derived by differentiating the energy and dissipation functions and rolled out with a multi-substep semi-implicit integrator. Across controlled trajectory benchmarks, NEXUS improves long-horizon accuracy over representative learned and physics-structured dynamics baselines under varying mechanical properties and physical-effect compositions. We further show that NEXUS trajectories provide effective guidance for contact-rich video generation, improving physical plausibility while maintaining competitive visual quality.

2606.18812 2026-06-19 cs.LG cs.AI 版本更新

Reinforcement Learning Foundation Models Should Already Be A Thing

强化学习基础模型本应已经存在

Abdelrahman Zighem, Jill-Jênn Vie

发表机构 * École normale supérieure de Paris, PSL University, Paris, France(巴黎高等师范学院,PSL大学,法国巴黎) Soda team, Inria Saclay, Palaiseau, France(Soda团队,法国国家信息与自动化研究所萨克雷中心,法国帕莱索)

AI总结 提出通过合成MDP构建强化学习基础模型,利用固定大小的充分统计量使注意力架构适用,在线和离线实验均优于传统算法。

详情
AI中文摘要

语言和视觉的基础模型由互联网规模的数据驱动,而结构化领域(表格预测、时间序列预测、图学习、强化学习)则不然。替代方案是合成数据,它将负担从收集转移到先验设计。这种先验已经存在于许多结构化任务中:TabPFN及其后续工作通过一个在合成贝叶斯先验上预训练的Transformer解决表格分类问题。我们提出两点。\textbf{首先},强化学习是明显的空白:采样一个合成MDP与采样一个合成表格数据集一样可行,然而没有上下文强化学习工作将先验设计作为主要目标。\textbf{其次},MDP允许一个固定大小的充分统计量,独立于观察到的回合且形状为表格形式,这使得它们直接适用于用于表格基础模型的基于注意力的架构,只需将策略头替换监督目标。这些共同定义了强化学习基础模型的议程。作为概念验证,我们完全在合成MDP上训练一个模型,并表明,无需任务特定的调优,它就能在上下文中解决留出的表格基准,包括在线和离线:在线时,使用比UCB-VI和表格Q-learning少得多的回合;离线时,与VI-LCB竞争。

英文摘要

Foundation models for language and vision are powered by internet-scale data, while structured domains such as tabular prediction are powered by synthetic data. This substitute shifts the challenge from collection to prior design. Such priors already exist for many structured tasks: TabPFN and its successors solve tabular classification with a transformer pretrained on a synthetic Bayesian prior. We make two points. \textbf{First}, reinforcement learning is the conspicuous gap: sampling a synthetic MDP is as feasible as sampling a synthetic tabular dataset, yet no in-context RL work treats prior design as a primary objective. \textbf{Second}, MDPs admit a fixed-size sufficient statistic, independent of the episodes observed and tabular in shape, which makes them directly amenable to the attention-based architectures used for tabular foundation models, with a policy head replacing the supervised target. Together these define the agenda for an RL foundation model. As a proof of concept, we train a Graph Attention Network entirely on synthetic MDPs and show that, with no task-specific tuning, it solves held-out tabular benchmarks in context, both online and offline: online, in far fewer episodes than UCB-VI and tabular Q-learning, and offline, competitively with VI-LCB.

6. 自然语言与多模态智能 7 篇

2606.11537 2026-06-19 cs.AI cs.CE 版本更新

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

2504.11171 2026-06-19 cs.CV cs.AI 版本更新

TerraMind: Large-Scale Generative Multimodality for Earth Observation

TerraMind:面向地球观测的大规模生成式多模态模型

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, Nicolas Longépé

发表机构 * IBM Research – Europe(IBM欧洲研究院) ETH Zurich(苏黎世联邦理工学院) Forschungszentrum Jülich(尤利希研究中心) European Space Agency(欧洲航天局) Φ \Phi -Lab(Φ实验室) NASA IMPACT University of Iceland(爱沙尼亚大学)

AI总结 提出首个任意到任意生成式多模态基础模型TerraMind,通过双尺度表示(token级和像素级)预训练,实现零样本/少样本应用,并引入“模态思考”能力,在PANGAEA等基准上达到领先性能。

Comments Accepted at ICCV'25

详情
AI中文摘要

我们提出了TerraMind,这是首个面向地球观测(EO)的任意到任意生成式多模态基础模型。与其他多模态模型不同,TerraMind在跨模态的双尺度表示(结合token级和像素级数据)上进行预训练。在token级别,TerraMind编码高层上下文信息以学习跨模态关系;在像素级别,TerraMind利用细粒度表示捕捉关键空间细节。我们在一个全球大规模数据集的九种地理空间模态上预训练了TerraMind。在本文中,我们证明:(i)TerraMind的双尺度早期融合方法为地球观测解锁了一系列零样本和少样本应用;(ii)TerraMind引入了“模态思考”(TiM)——在微调和推理过程中生成额外人工数据以改善模型输出的能力;(iii)TerraMind在PANGAEA等社区标准的地球观测基准上达到了超越现有最优的性能。预训练数据集、模型权重和我们的代码均在宽松许可下开源。

英文摘要

We present TerraMind, the first any-to-any generative, multimodal foundation model for Earth observation (EO). Unlike other multimodal models, TerraMind is pretrained on dual-scale representations combining both token-level and pixel-level data across modalities. On a token level, TerraMind encodes high-level contextual information to learn cross-modal relationships, while on a pixel level, TerraMind leverages fine-grained representations to capture critical spatial nuances. We pretrained TerraMind on nine geospatial modalities of a global, large-scale dataset. In this paper, we demonstrate that (i) TerraMind's dual-scale early fusion approach unlocks a range of zero-shot and few-shot applications for Earth observation, (ii) TerraMind introduces "Thinking-in-Modalities" (TiM) -- the capability of generating additional artificial data during finetuning and inference to improve the model output -- and (iii) TerraMind achieves beyond state-of-the-art performance in community-standard benchmarks for EO like PANGAEA. The pretraining dataset, the model weights, and our code are open-sourced under a permissive license.

2507.19137 2026-06-19 eess.AS cs.AI cs.SD 版本更新

Assessment of Personality Dimensions Across Situations in Dyadic Role-Play Scenarios

二元角色扮演场景中跨情境的人格维度评估

Alice Zhang, Skanda Muralidhar, Daniel Gatica-Perez, Mathew Magimai-Doss

发表机构 * Idiap Research Institute(日内瓦研究所) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 研究通过对话语音分析,发现感知人格在不同工作情境下显著变化,并识别出与各人格特质相关的声学特征。

Comments Accepted to IEEE Transactions on Affective Computing

详情
AI中文摘要

先前研究表明,用户偏好与其人格相匹配的辅助技术。这引发了对自动人格感知(APP)的兴趣,旨在预测个体感知到的人格特质。以往的APP研究将人格视为静态特质,独立于情境。然而,心理学研究表明,感知人格会随情境和场景而变化。在本研究中,我们调查了参与两种工作情境(中性面试和压力客户互动)的参与者对话语音与感知人格之间的关系。我们的主要发现是:1)感知人格在不同互动中显著不同;2)响度、声压级和频谱通量特征在中性互动中指示感知的外向性、宜人性、尽责性和开放性,而在压力情境中,神经质与这些特征相关;3)手工声学特征和非语言特征在感知人格推断中优于说话人嵌入;4)压力互动更能预测神经质,这与现有心理学研究一致。

英文摘要

Prior research indicates that users prefer assistive technologies whose personalities align with their own. This has sparked interest in automatic personality perception (APP), which aims to predict an individual's perceived personality traits. Previous studies in APP have treated personalities as static traits, independent of context. However, perceived personalities can vary by context and situation as shown in psychological research. In this study, we investigate the relationship between conversational speech and perceived personality for participants engaged in two work situations (a neutral interview and a stressful client interaction). Our key findings are: 1) perceived personalities differ significantly across interactions, 2) loudness, sound level, and spectral flux features are indicative of perceived extraversion, agreeableness, conscientiousness, and openness in neutral interactions, while neuroticism correlates with these features in stressful contexts, 3) handcrafted acoustic features and non-verbal features outperform speaker embeddings in inference of perceived personality, and 4) stressful interactions are more predictive of neuroticism, aligning with existing psychological research.

2603.04219 2026-06-19 cs.SD cs.AI eess.AS 版本更新

ZeSTA: Zero-Shot TTS Augmentation with Domain-Conditioned Training for Data-Efficient Personalized Speech Synthesis

ZeSTA: 基于领域条件训练的零样本文本转语音增强用于数据高效的个性化语音合成

Youngwon Choi, Jinwoo Oh, Hwayeon Kim, Hyeonyu Kim

发表机构 * Maum AI Inc.(Maum AI公司) Humelo Inc.(Humelo公司)

AI总结 提出ZeSTA框架,通过轻量领域嵌入区分真实与合成语音,结合真实数据过采样,在极低资源下提升零样本文本转语音增强的说话人相似度,保持可懂度和感知质量。

Comments 6 pages, accepted to INTERSPEECH 2026

详情
AI中文摘要

我们研究了将零样本文本转语音(ZS-TTS)作为低资源个性化语音合成的数据增强源。虽然合成增强可以提供语言丰富且音素多样的语音,但将大量合成语音与有限的真实录音简单混合往往会导致微调过程中说话人相似度下降。为解决这一问题,我们提出了ZeSTA,一个简单的基于领域条件的训练框架,通过轻量领域嵌入区分真实和合成语音,并结合真实数据过采样以在极有限的目标数据下稳定适应,无需修改基础架构。在LibriTTS和一个内部数据集上使用两个ZS-TTS源的实验表明,我们的方法在保持可懂度和感知质量的同时,相比朴素合成增强提高了说话人相似度。音频样本可在我们的网页上获取。

英文摘要

We investigate the use of zero-shot text-to-speech (ZS-TTS) as a data augmentation source for low-resource personalized speech synthesis. While synthetic augmentation can provide linguistically rich and phonetically diverse speech, naively mixing large amounts of synthetic speech with limited real recordings often leads to speaker similarity degradation during fine-tuning. To address this issue, we propose ZeSTA, a simple domain-conditioned training framework that distinguishes real and synthetic speech via a lightweight domain embedding, combined with real-data oversampling to stabilize adaptation under extremely limited target data, without modifying the base architecture. Experiments on LibriTTS and an in-house dataset with two ZS-TTS sources demonstrate that our approach improves speaker similarity over naive synthetic augmentation while preserving intelligibility and perceptual quality. Audio samples are available on our web page.

2604.04917 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Vero: An Open RL Recipe for General Visual Reasoning

Vero: 通用视觉推理的开放RL配方

Gabriel Sarch, Linrong Cai, Qunzhong Wang, Haoyang Wu, Danqi Chen, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出Vero系列开放视觉语言模型,通过构建600K样本数据集Vero-600K和任务路由奖励,在30个基准测试中平均提升2.9-5.4点,Vero-Qwen3I-8B超越Qwen3-VL-8B-Thinking 3.8点。

Comments Project page: https://vero-reasoning.github.io/

详情
AI中文摘要

构建一个能在图表、科学、空间理解和开放式任务中工作的视觉推理器需要什么?最强的视觉语言模型(VLM)表明广泛的视觉推理是可以实现的,但其封闭的数据和强化学习(RL)流程使得其成果难以研究、复现或扩展。我们引入了Vero,一个完全开放的VLM系列,在各种视觉推理任务中匹配或超越现有的开放权重模型。我们跨六个广泛的任务类别扩展RL数据和奖励,构建了Vero-600K,一个来自59个数据集的600K样本数据集,并设计了处理异构答案的任务路由奖励。在我们的30个基准测试套件VeroEval中,Vero-600K在受控比较下优于现有的RL数据集。应用于五个起始模型,Vero变体在其初始模型上平均获得2.9-5.4分的提升。值得注意的是,基于Instruct模型训练的Vero-Qwen3I-8B,在没有额外蒸馏的情况下,平均超过Qwen3-VL-8B-Thinking 3.8分。系统的消融实验揭示,不同的任务类别引发不同的推理模式,而广泛的收益依赖于联合学习它们,而非孤立学习。所有数据、代码和模型均已公开。

英文摘要

What does it take to build a visual reasoner that works across charts, science, spatial understanding, and open-ended tasks? The strongest vision-language models (VLMs) suggest that broad visual reasoning is within reach, yet their closed data and reinforcement learning (RL) pipelines make their gains difficult to study, reproduce, or extend. We introduce Vero, a family of fully open VLMs that match or exceed existing open-weight models across diverse visual reasoning tasks. We scale RL data and rewards across six broad task categories, constructing Vero-600K, a 600K-sample dataset from 59 datasets, and designing task-routed rewards that handle heterogeneous answers. Across VeroEval, our 30-benchmark suite, Vero-600K outperforms existing RL datasets under controlled comparisons. Applied to five starting models, Vero variants gain 2.9-5.4 points on average over their initial models. Notably, Vero-Qwen3I-8B, trained on the Instruct model, surpasses Qwen3-VL-8B-Thinking by 3.8 points on average without additional distillation. Systematic ablations reveal that different task categories elicit distinct reasoning patterns and that broad gains depend on learning them jointly rather than in isolation. All data, code, and models are publicly available.

2605.31393 2026-06-19 cs.CL cs.AI 版本更新

Target-Side Paraphrase Augmentation for Sign Language Translation with Large Language Models

面向手语翻译的大语言模型目标端释义增强

Pedro Dal Bianco, Jean Paul Nunes Reinhold, Oscar Stanchi, Facundo Quiroga, Franco Ronchetti, Ulisses Brisolara Corrêa

发表机构 * III-LIDI Universidad Nacional de La Plata(III-LIDI国立拉普拉塔大学) CDTEC, Federal University of Pelotas(CDTEC,联邦 Pelotas 大学) CONICET III-LIDI Comision de Investigaciones Cientificas Universidad Nacional de La Plata(科学委员会国立拉普拉塔大学) Universidade Federal de Pelotas(联邦 Pelotas 大学)

AI总结 针对手语翻译中平行语料稀缺和目标词汇长尾分布的问题,提出利用GPT-4o生成参考句子的受控释义变体进行目标端增强,并在三种手语数据集上验证了方法的有效性。

Comments Accepted at GenSign @ CVPR 2026. Non-Proceedings Track (https://genai4sl.github.io/)

详情
AI中文摘要

手语翻译(SLT)仍然受到有限的配对手语视频/文本语料库和长尾目标词汇的限制。我们研究了目标端增强方法,其中GPT-4o生成参考句子的受控释义变体,而手语输入保持不变。采用基于Signformer姿态的Transformer,在两阶段调度下进行训练:先在增强语料库上预训练,然后在原始参考句子上微调。我们在三个具有互补挑战的数据集上进行了评估:PHOENIX14T(德国手语),具有适度的词汇多样性;GSL(希腊手语),具有高度受控、重复的录制;以及LSA-T(阿根廷手语),具有严重的长尾稀疏性。在PHOENIX14T上,增强将BLEU-4从9.56提高到10.33。接近饱和的GSL基线和极其稀疏的LSA-T设置揭示了该方法的局限性。据我们所知,这是第一项将LLM生成的目标端释义和LLM作为评估者应用于手语翻译的研究。语义评估揭示了词汇重叠指标低估的忠实度提升。

英文摘要

Sign language translation (SLT) remains constrained by the limited availability of paired sign-video/text corpora and by the heavy-tailed vocabularies typical of real-world datasets. We study a target-side augmentation strategy in which a large language model (LLM) generates controlled paraphrase variants of the reference spoken-language sentence while the sign input remains unchanged. Concretely, we use GPT-4o to produce semantically faithful variants of the training targets and train a Signformer-style pose-based Transformer under a two-stage schedule: pre-training on the augmented corpus followed by fine-tuning on the original references. We evaluate this strategy on three datasets that span complementary challenges: PHOENIX14T (German Sign Language), a real-world corpus with moderate lexical diversity; the Greek Sign Language Dataset with highly controlled, repetitive recordings; and LSA-T (Argentinian Sign Language), a naturalistic corpus with a large vocabulary and severe long-tail sparsity. This range allows us to characterize precisely when and why target-side augmentation is beneficial. On PHOENIX14T, augmentation improves BLEU-4 from 9.56 to 10.33, demonstrating that paraphrastic exposure helps the decoder generalize beyond memorized reference phrasing. The near-saturated GSL baseline and the extremely sparse LSA-T setting reveal the limits of the approach: in both cases, single-reference lexical overlap metrics are insufficient to capture the full picture, motivating a complementary semantic evaluation. To our knowledge, this is the first study to examine LLM-generated target-side paraphrases as an augmentation mechanism for SLT, and the first to apply an LLM-as-a-Judge evaluation protocol to SLT. This complementary evaluation reveals gains in semantic fidelity that lexical overlap metrics understate.

2606.05833 2026-06-19 cs.CV cs.AI 版本更新

Learning Geometric Representations from Videos for Spatial Intelligent Multimodal Large Language Models

从视频中学习几何表示以实现空间智能多模态大语言模型

Haibo Wang, Lifu Huang

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出GeoVR框架,通过从2D视频序列中蒸馏3D几何知识(包括相机姿态、深度图、尺度因子和多尺度3D特征),重塑多模态大语言模型的内部表示以赋予其空间智能,在空间推理基准上达到最先进性能。

详情
AI中文摘要

多模态大语言模型(MLLMs)在2D语义理解方面表现出色,但缺乏内在的3D感知能力,导致其表示无法在视频帧间保持几何和空间一致性。鉴于大规模3D数据的稀缺性,我们提出了GeoVR,一种新颖的框架,仅使用2D视频序列学习几何表示。该方法有效地重构了MLLMs内部的语义潜在空间,以解锁空间智能。GeoVR并非采用浅层的特征混合,而是通过从预训练的3D基础模型中蒸馏几何知识来重塑MLLM的内部表示。这是通过一种多目标学习策略实现的,该策略由四个互补的几何目标驱动:(1)估计帧间相机姿态以嵌入变化的视角动态,(2)回归密集深度图以锚定物理距离,(3)预测度量尺度因子以进行真实世界校准,以及(4)蒸馏多尺度3D特征以对齐中间特征空间。在这些显式的物理和几何约束的引导下,模型的内部表示自然地发展出强大的3D感知能力。在空间推理基准上的大量实验表明,GeoVR实现了最先进的性能,为赋予基础模型空间智能建立了一种新范式。

英文摘要

Multimodal Large Language Models (MLLMs) excel at 2D semantic understanding but lack intrinsic 3D awareness, resulting in representations that fail to maintain geometric and spatial consistency across video frames. Given the scarcity of large-scale 3D data, we present GeoVR, a novel framework that learns geometric representations using purely 2D video sequences. This approach effectively restructures the semantic latent space within MLLMs to unlock spatial intelligence. Rather than employing superficial feature mixing, GeoVR reshapes the internal representations of the MLLM by distilling geometry knowledge from pre-trained 3D foundation models. This is accomplished through a multi-objective learning strategy driven by four complementary geometric targets: (1) estimating inter-frame camera poses to embed varying viewpoint dynamics, (2) regressing dense depth maps to anchor physical distances, (3) predicting a metric scale factor for real-world calibration, and (4) distilling multi-scale 3D features to align the intermediate feature space. Guided by these explicit physical and geometric constraints, the model's internal representations naturally develop strong 3D awareness. Extensive experiments on spatial reasoning benchmarks demonstrate that GeoVR achieves state-of-the-art performance, establishing a new paradigm for endowing foundation models with spatial intelligence.

7. 机器人与具身智能 7 篇

2509.19658 2026-06-19 cs.RO cs.AI 版本更新

RoboSSM: Scalable In-context Imitation Learning via State-Space Models

RoboSSM: 基于状态空间模型的可扩展上下文模仿学习

Youngju Yoo, Jiaheng Hu, Yifeng Zhu, Bo Liu, Qiang Liu, Roberto Martín-Martín, Peter Stone

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) KAIST(韩国科学技术院) FAIR at Meta(元宇宙FAIR) Amazon(亚马逊) Sony AI(索尼人工智能)

AI总结 提出RoboSSM,用状态空间模型替代Transformer实现上下文模仿学习,在LIBERO基准上对未见和长时任务泛化更优,首次证明SSM是ICIL高效可扩展的骨干网络。

Comments IROS 2026

详情
AI中文摘要

上下文模仿学习(ICIL)使机器人能够从仅包含少量演示的提示中学习任务。通过消除部署时参数更新的需求,该范式支持对新任务的少样本适应。然而,最近的ICIL方法依赖于Transformer,其计算能力有限,并且在处理比训练时更长的提示时往往表现不佳。在这项工作中,我们引入了RoboSSM,一种基于状态空间模型(SSM)的可扩展上下文模仿学习方案。具体来说,RoboSSM用Longhorn(一种最先进的SSM)替代Transformer,该模型提供线性时间推理和强大的外推能力,非常适合长上下文提示。通过在LIBERO基准上的多样化实验,我们证明了将SSM应用于ICIL的有效性,通过处理测试时更长的上下文,实现了比基于Transformer的ICIL方法对未见和长时任务更好的泛化。这些结果首次表明,SSM是ICIL高效且可扩展的骨干网络。我们的代码可在此网址获取。

英文摘要

In-context imitation learning (ICIL) enables robots to learn tasks from prompts consisting of just a handful of demonstrations. By eliminating the need for parameter updates at deployment time, this paradigm supports few-shot adaptation to novel tasks. However, recent ICIL methods rely on Transformers, which have computational limitations and tend to underperform when handling longer prompts than those seen during training. In this work, we introduce RoboSSM, a scalable recipe for in-context imitation learning based on state-space models (SSM). Specifically, RoboSSM replaces Transformers with Longhorn -- a state-of-the-art SSM that provides linear-time inference and strong extrapolation capabilities, making it well-suited for long-context prompts. Through diverse experiments on the LIBERO benchmark, we demonstrate the effectiveness of applying SSMs to ICIL, achieving improved generalization to both unseen and long-horizon tasks than Transformer-based ICIL methods by handling longer contexts at test-time. These results show for the first time that SSMs are an efficient and scalable backbone for ICIL. Our code is available at https://github.com/youngjuY/RoboSSM.

2512.20014 2026-06-19 cs.RO cs.AI 版本更新

Bring My Cup! Personalizing Vision-Language-Action Models with Visual Attentive Prompting

Bring My Cup! 使用视觉注意力提示个性化视觉-语言-动作模型

Sangoh Lee, Sangwoo Mo, Wook-Shin Han

发表机构 * GSAI, POSTECH(POSTECH 人工智能研究所) IME, POSTECH(POSTECH 信息媒体研究所)

AI总结 针对VLA模型难以处理个性化指令的问题,提出无需训练的视觉注意力提示(VAP)方法,通过参考图像作为非参数记忆,利用开放词汇检测和嵌入匹配定位个人物品,并以视觉提示注入模型,在多个仿真和真实场景中显著提升成功率和正确物体操作。

Comments ICML 2026. Project page: https://vap-project.github.io/

详情
AI中文摘要

尽管视觉-语言-动作(VLA)模型能够很好地泛化到通用指令,但在处理个性化命令(如“bring my cup”)时却存在困难,因为机器人必须在视觉相似的物体中识别并操作特定实例。我们研究了这种操作个人物品的场景,其中VLA必须仅使用少量参考图像来识别并控制训练中未见过的用户特定物体。为了解决这一挑战,我们提出了视觉注意力提示(VAP),一种简单而有效的无需训练的感知适配器,为冻结的VLA模型赋予自上而下的选择性注意力。VAP将参考图像视为非参数视觉记忆,通过开放词汇检测和基于嵌入的匹配将个人物品定位到场景中,然后通过突出显示该物体并重写指令,将这种定位作为视觉提示注入模型。我们构建了两个仿真基准(Personalized-SIMPLER和Personalized-VLABench)以及一个真实桌面基准,用于评估多个机器人和任务上的个性化操作。实验表明,VAP在成功率和正确物体操作方面始终优于通用策略和令牌学习基线,有助于弥合语义理解与实例级控制之间的差距。

英文摘要

While Vision-Language-Action (VLA) models generalize well to generic instructions, they struggle with personalized commands such as "bring my cup," where the robot must act on one specific instance among visually similar objects. We study this setting of manipulating personal objects, in which a VLA must identify and control a user-specific object unseen during training using only a few reference images. To address this challenge, we propose Visual Attentive Prompting (VAP), a simple-yet-effective training-free perceptual adapter that equips frozen VLAs with top-down selective attention. VAP treats the reference images as a non-parametric visual memory, grounds the personal object in the scene through open-vocabulary detection and embedding-based matching, and then injects this grounding as a visual prompt by highlighting the object and rewriting the instruction. We construct two simulation benchmarks, Personalized-SIMPLER and Personalized-VLABench, and a real-world tabletop benchmark to evaluate personalized manipulation across multiple robots and tasks. Experiments show that VAP consistently outperforms generic policies and token-learning baselines in both success rate and correct-object manipulation, helping to bridge the gap between semantic understanding and instance-level control.

2601.02379 2026-06-19 cs.RO cs.AI 版本更新

Movement Primitives in Robotics: A Comprehensive Survey

机器人运动基元:综合综述

Nolan B. Gutierrez, Joseph M. Cloud, William J. Beksi

发表机构 * Department of Computer Science and Engineering, The University of Texas at Arlington, Arlington, USA(计算机科学与工程系,德克萨斯理工大学阿灵顿分校,阿灵顿,美国)

AI总结 综述机器人运动基元框架,涵盖从人类示教中编码轨迹的方法,分析弹簧-阻尼系统、概率耦合、神经网络等特性,并讨论应用与挑战。

Comments 105 pages, 3 figures, and 6 tables

详情
AI中文摘要

生物系统表现出连续的运动流,由顺序片段组成,使它们能够以创造性和多功能的方式执行复杂任务。这一观察促使研究人员识别出被称为运动基元的运动基本构建块,这些基元非常适合在自主系统(如机器人)中生成运动指令。在本综述中,我们按时间顺序提供了运动基元方法和应用的百科全书式概述。具体来说,我们将运动基元框架呈现为一种表示通过人类示教获得的机器人控制轨迹的方式。在机器人领域,运动基元可以在轨迹级别编码基本运动,例如机器人如何抓取杯子或抛球所需的运动序列。此外,运动基元已开发出具有弹簧-阻尼系统的理想分析特性、多个示教的概率耦合、在高维系统中使用神经网络等特性,以应对机器人领域的困难挑战。尽管运动基元广泛应用于各个领域,本综述的目标是告知从业者如何在机器人背景下使用这些框架。具体而言,我们旨在(i)系统回顾主要运动基元框架并检查其优缺点;(ii)突出已成功使用运动基元的应用;(iii)检查开放问题并讨论在机器人中应用运动基元时的实际挑战。

英文摘要

Biological systems exhibit a continuous stream of movements, consisting of sequential segments, that allow them to perform complex tasks in a creative and versatile fashion. This observation has led researchers towards identifying elementary building blocks of motion known as movement primitives, which are well-suited for generating motor commands in autonomous systems, such as robots. In this survey, we provide an encyclopedic overview of movement primitive approaches and applications in chronological order. Concretely, we present movement primitive frameworks as a way of representing robotic control trajectories acquired through human demonstrations. Within the area of robotics, movement primitives can encode basic motions at the trajectory level, such as how a robot would grasp a cup or the sequence of motions necessary to toss a ball. Furthermore, movement primitives have been developed with the desirable analytical properties of a spring-damper system, probabilistic coupling of multiple demonstrations, using neural networks in high-dimensional systems, and more, to address difficult challenges in robotics. Although movement primitives have widespread application to a variety of fields, the goal of this survey is to inform practitioners on the use of these frameworks in the context of robotics. Specifically, we aim to (i) present a systematic review of major movement primitive frameworks and examine their strengths and weaknesses; (ii) highlight applications that have successfully made use of movement primitives; and (iii) examine open questions and discuss practical challenges when applying movement primitives in robotics.

2601.03040 2026-06-19 cs.RO cs.AI cs.LG 版本更新

PiDR: Physics-Informed Inertial Dead Reckoning for Autonomous Platforms

PiDR:面向自主平台的物理信息惯性航位推算

Arup Kumar Sahoo, Itzik Klein

发表机构 * Autonomous Navigation and Sensor Fusion Lab (ANSFL)(自主导航与传感器融合实验室(ANSFL)) Hatter Department of Marine Technologies(海洋技术系) Charney School of Marine Sciences(海洋科学学院) University of Haifa(海法大学)

AI总结 提出PiDR框架,将惯性导航原理作为物理信息残差融入网络训练,在纯惯性导航中减少轨迹漂移,在移动机器人和水下自主航行器数据集上定位精度提升超29%。

Comments 11 pages and 7 figures

详情
AI中文摘要

完全自主的一个基本要求是在缺乏外部数据(如GNSS信号或视觉信息)的情况下维持精确导航的能力。在这些具有挑战性的环境中,平台必须完全依赖惯性传感器,导致纯惯性导航。然而,在现实场景中,惯性传感器的固有噪声和其他误差项会导致导航解随时间漂移。尽管传统的深度学习模型已成为惯性导航的一种可能方法,但它们本质上是黑箱的。此外,它们在有限的监督传感器数据下难以有效学习,并且常常无法保持物理原理。为了解决这些局限性,我们提出了PiDR,一种用于纯惯性导航情况下自主平台的物理信息惯性航位推算框架。PiDR通过物理信息残差组件将惯性导航原理明确地整合到网络训练过程中,从而提供了透明性。即使在有限或稀疏监督下,PiDR在减轻轨迹突然偏差方面也起着关键作用。我们在移动机器人和自主水下航行器收集的真实世界数据集上评估了PiDR。在两个数据集中,我们获得了超过29%的定位改进,证明了PiDR在不同环境和动力学下运行的不同平台上的泛化能力。因此,PiDR提供了一种鲁棒、轻量级且有效的架构,可以部署在资源受限的平台上,在不利场景中实现实时纯惯性导航。

英文摘要

A fundamental requirement for full autonomy is the ability to sustain accurate navigation in the absence of external data, such as GNSS signals or visual information. In these challenging environments, the platform must rely exclusively on inertial sensors, leading to pure inertial navigation. However, the inherent noise and other error terms of the inertial sensors in such real-world scenarios will cause the navigation solution to drift over time. Although conventional deep-learning models have emerged as a possible approach to inertial navigation, they are inherently black-box in nature. Furthermore, they struggle to learn effectively with limited supervised sensor data and often fail to preserve physical principles. To address these limitations, we propose PiDR, a physics-informed inertial dead-reckoning framework for autonomous platforms in situations of pure inertial navigation. PiDR offers transparency by explicitly integrating inertial navigation principles into the network training process through the physics-informed residual component. PiDR plays a crucial role in mitigating abrupt trajectory deviations even under limited or sparse supervision. We evaluated PiDR on real-world datasets collected by a mobile robot and an autonomous underwater vehicle. We obtained more than 29% positioning improvement in both datasets, demonstrating the ability of PiDR to generalize different platforms operating in various environments and dynamics. Thus, PiDR offers a robust, lightweight, yet effective architecture and can be deployed on resource-constrained platforms, enabling real-time pure inertial navigation in adverse scenarios.

2602.23172 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Latent Gaussian Splatting for 4D Panoptic Occupancy Tracking

潜在高斯泼溅用于4D全景占据跟踪

Maximilian Luz, Rohit Mohan, Thomas Nürnberg, Yakov Miron, Daniele Cattaneo, Abhinav Valada

发表机构 * University of Freiburg(弗赖堡大学) Bosch Research(博世研究院) University of Haifa(海法大学)

AI总结 提出潜在高斯泼溅(LaGS)方法,通过特征高斯体作为动态关键点实现多视图特征聚合,用于4D全景占据跟踪,在Occ3D nuScenes和Waymo上达到最优性能。

Comments Accepted to IEEE Robotics and Automation Letters (RA-L), 2026

详情
AI中文摘要

捕捉4D时空场景结构对于机器人在动态环境中安全可靠运行至关重要。然而,现有方法通常只解决部分问题:它们要么通过边界框提供粗略的几何跟踪,要么提供缺乏显式时间关联和实例级推理的详细3D占据估计。在这项工作中,我们提出了潜在高斯泼溅(LaGS)用于4D全景占据跟踪(4D-POT)。我们重新审视底层表示,将3D特征建模为一组稀疏的带特征高斯体。这些高斯体作为动态的、面向体积的关键点,在泼溅到体素网格进行解码之前,能够实现多视图特征的空间连续、距离加权聚合。这种以点为中心的公式实现了灵活、数据相关的感受野和长程空间交互,这是局部密集体素算子难以捕捉的。分层高斯表示通过结合来自粗超点的全局上下文和来自高分辨率流的细粒度细节,进一步实现了多尺度推理。在Occ3D nuScenes和Waymo上的大量实验证明了4D-POT的最先进性能。我们在以下网址提供代码和模型:this https URL。

英文摘要

Capturing 4D spatiotemporal scene structure is crucial for the safe and reliable operation of robots in dynamic environments. However, existing approaches typically address only part of the problem: they either provide coarse geometric tracking via bounding boxes or detailed 3D occupancy estimates that lack explicit temporal association and instance-level reasoning. In this work, we present Latent Gaussian Splatting (LaGS) for 4D Panoptic Occupancy Tracking (4D-POT). We revisit the underlying representation and model 3D features as a sparse set of feature-bearing Gaussians. These act as dynamic, volume-oriented keypoints that enable spatially continuous, distance-weighted aggregation of multi-view features before being splatted into a voxel grid for decoding. This point-centric formulation enables flexible, data-dependent receptive fields and long-range spatial interactions that are difficult to capture with local and dense voxel-based operators. A hierarchical Gaussian representation further enables multi-scale reasoning by combining global context from coarse super-points with fine-grained detail from higher-resolution streams. Extensive experiments on Occ3D nuScenes and Waymo demonstrate state-of-the-art performance for 4D-POT. We provide code and models at https://lags.cs.uni-freiburg.de/.

2603.09420 2026-06-19 cs.CV cs.AI cs.RO 版本更新

Class-Incremental Motion Forecasting

类别增量运动预测

Nicolas Schischka, Nikhil Gosala, B Ravi Kiran, Senthil Yogamani, Abhinav Valada

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系) Qualcomm SARL France(法国.qualcomm SARL) Automated Driving, Qualcomm Technologies, Inc.(qualcomm Technologies, Inc. 自动驾驶部门)

AI总结 提出类别增量运动预测新任务,通过端到端框架结合伪标签与开放词汇分割,利用3D-2D投票机制和查询特征方差重放策略,缓解灾难性遗忘并适应新类别。

Comments V3: Change title. Add further experiments

详情
AI中文摘要

运动预测使自动驾驶车辆能够通过预测动态智能体的未来轨迹来预判场景演化。然而,现有方法通常假设一个封闭世界设定,具有固定的对象分类法并依赖高质量感知,限制了其在现实世界中的应用,因为现实世界中感知不完美,且新对象类别可能随时间出现。在这项工作中,我们引入了类别增量运动预测,这是一个新颖的设定,其中新对象类别随时间顺序引入,并且直接从相机图像预测未来对象轨迹。我们提出了首个针对该设定的端到端框架,该框架适应新引入的类别,同时减轻对先前学习类别的灾难性遗忘。我们的方法为已知类别生成运动预测伪标签,并将其与开放词汇分割模型的2D实例掩码进行匹配。这种3D到2D关键点投票机制过滤不一致和过度自信的预测,而基于查询特征方差的重放策略采样信息丰富的过去序列以保留先验知识。在nuScenes和Argoverse 2上的广泛评估表明,我们的方法成功地在已知类别上保持性能,同时有效适应新类别。我们进一步展示了向真实世界驾驶的零样本迁移,并表明该框架自然地扩展到nuScenes和NeuroNCAP上的开环和闭环端到端类别增量规划。代码和模型将在该https URL上公开。

英文摘要

Motion forecasting enables autonomous vehicles to anticipate scene evolution by predicting the future trajectories of dynamic agents. However, existing approaches typically assume a closed-world setting with a fixed object taxonomy and access to high-quality perception, limiting their applicability in the real world where perception is imperfect, and new object classes may emerge over time. In this work, we introduce class-incremental motion forecasting, a novel setting in which new object classes are sequentially introduced over time and future object trajectories are predicted directly from camera images. We propose the first end-to-end framework for this setting, which adapts to newly introduced classes while mitigating catastrophic forgetting of previously learned ones. Our method generates motion forecasting pseudo-labels for known classes and matches them with 2D instance masks from an open-vocabulary segmentation model. This 3D-to-2D keypoint voting mechanism filters inconsistent and overconfident predictions, while a query feature variance-based replay strategy samples informative past sequences to preserve prior knowledge. Extensive evaluations on nuScenes and Argoverse 2 show that our approach successfully preserves performance on known classes while effectively adapting to novel ones. We further demonstrate zero-shot transfer to real-world driving and show that the framework extends naturally to open- and closed-loop end-to-end class-incremental planning on nuScenes and NeuroNCAP. Code and models will be made publicly available at https://omen.cs.uni-freiburg.de.

2605.23733 2026-06-19 cs.RO cs.AI 版本更新

Any2Any: Efficient Cross-Embodiment Transfer for Humanoid Whole-Body Tracking

Any2Any: 高效跨本体迁移用于人形机器人全身跟踪

Ming Yang, Tao Yu, Feng Li, Hua Chen

发表机构 * LimX Dynamics(LimX动力学)

AI总结 提出Any2Any范式,通过运动学对齐和动力学微调,实现预训练全身跟踪模型高效迁移至新的人形机器人本体,仅需少量数据和计算即可达到竞争性跟踪性能。

Comments Project Page: https://any2any.top/

详情
AI中文摘要

全身跟踪(WBT)模型已成为人形机器人的关键基础,使其能够高保真地模仿各种运动。从头训练此类模型需要大规模数据和计算,使得在新人形平台上快速部署成本高昂。这自然引发一个问题:预训练的WBT模型能否通过最小化适应跨本体迁移?为回答这个问题,我们提出Any2Any,一种范式,能够高效地将现有WBT专家迁移到新人形本体,仅需少量数据和计算。Any2Any首先在源和目标人形之间进行运动学对齐,对齐其输入和输出空间,使得预训练的源策略可以在目标本体上有意义地重用。然后,Any2Any通过向选定的动力学敏感模块应用轻量级参数高效微调(PEFT)组件进行动力学适应,保留有用的行为先验,同时实现对目标机器人的定向适应。在多个人形平台和预训练骨干上的大量实验表明,与从头训练相比,Any2Any显著加速收敛并降低训练成本,同时实现具有竞争力或更优的跟踪性能。值得注意的是,仅使用完整训练所需计算和数据的1%,Any2Any成功将在Unitree G1上预训练的Sonic模型迁移到LimX Oli和LimX Luna。这些结果表明,预训练的WBT专家可以跨本体高效重用,为在新机器人上部署人形全身控制提供可扩展的路径。

英文摘要

Whole-body tracking (WBT) models have become a key foundation for humanoid robots, enabling them to imitate diverse motions with high fidelity. Training such models from scratch requires large-scale data and computation, making rapid deployment on new humanoid platforms costly. This raises a natural question: Can pretrained WBT models transfer across embodiments with minimal adaptation? To answer this question, we propose Any2Any, a paradigm that efficiently transfers an existing WBT specialist to a new humanoid embodiment with only a small amount of data and compute. Any2Any first performs kinematic alignment between source and target humanoids, aligning their input and output spaces so that the pretrained source policy can be meaningfully reused on the target embodiment.Any2Any then performs dynamics adaptation by applying lightweight parameter-efficient fine-tuning (PEFT) components to selected dynamics-sensitive modules, preserving useful behavioral priors while enabling targeted adaptation to the target robot. Extensive experiments on multiple humanoid platforms and pretrained backbones show that Any2Any substantially accelerates convergence and reduces training cost compared with training from scratch, while achieving competitive or superior tracking performance. Notably, using only 1% of the compute and data required for full training, Any2Any successfully transfers Sonic models pre-trained on Unitree G1 to LimX Oli and LimX Luna. These results suggest that pretrained WBT specialists can be efficiently reused across embodiments, providing a scalable path toward deploying humanoid whole-body control on new robots. More results and videos are available on our project page: https://any2any.top/.

8. 可信、安全与AI治理 12 篇

2602.01425 2026-06-19 cs.AI cs.LG 版本更新

One Probe Won't Catch Them All: Towards Targeted Deception Detection

一个探针无法捕捉所有:迈向有针对性的欺骗检测

Vikram Natarajan, Devina Jain, Shivam Arora, Satvik Golechha, Joseph Bloom

发表机构 * LASR Labs(LASR实验室) UK AI Security Institute(英国人工智能安全研究所)

AI总结 针对线性探针在欺骗检测中的异质性,提出根据具体欺骗类型匹配探针可显著提升性能(AUC提升0.108),建议组织定义威胁模型并部署相应探针。

详情
AI中文摘要

线性探针是一种有前景的监测AI系统欺骗行为的方法。先前工作表明,在对比指令对和简单数据集上训练的线性分类器可以达到良好性能。然而,这些探针即使在简单场景中也表现出显著失败,包括虚假相关性和对非欺骗响应的误报。在本文中,我们证明欺骗检测本质上是异质的:虽然单个通用探针实现了适度的改进(+0.032 AUC),但事后最优分析显示,当探针与特定欺骗类型匹配时,潜力显著更高(+0.108 AUC),并且合成验证实验表明,当欺骗类型事先已知时,这一上限是先验可实现的。我们的发现表明,指令对捕捉的是欺骗意图而非内容特定模式,这解释了为什么提示选择主导探针性能(占70.6%的方差)。鉴于这种异质性,我们得出结论,组织应定义其特定威胁模型并部署适当匹配的探针,而不是寻求通用的欺骗检测器。

英文摘要

Linear probes are a promising approach for monitoring AI systems for deceptive behaviour. Previous work has shown that a linear classifier trained on a contrastive instruction pair and a simple dataset can achieve good performance. However, these probes exhibit notable failures even in straightforward scenarios, including spurious correlations and false positives on non-deceptive responses. In this paper, we demonstrate that deception detection is inherently heterogeneous: while a single universal probe achieves modest improvements (+0.032 AUC), post-hoc oracle analysis reveals substantially higher potential (+0.108 AUC) when probes are matched to specific deception types, and synthetic validation experiments suggest this ceiling is achievable a priori when the deception type is known in advance. Our findings reveal that instruction pairs capture deceptive intent rather than content-specific patterns, explaining why prompt choice dominates probe performance (70.6% of variance). Given this heterogeneity, we conclude that organizations should define their specific threat models and deploy appropriately matched probes rather than seeking a universal deception detector.

2602.23248 2026-06-19 cs.AI 版本更新

Mitigating Legibility Tax with Decoupled Prover-Verifier Games

通过解耦证明者-验证者游戏减轻可读性代价

Yegon Kim, Juho Lee

发表机构 * KAIST(韩国科学技术院)

AI总结 提出解耦证明者-验证者游戏(DPVG),通过分离正确性与可检查性训练一个翻译器模型,将固定求解器的解转化为可检查形式,在保持答案正确性的同时提高可检查性,解决了可读性代价问题。

Comments ICLR 2026 Workshop Trustworthy AI

详情
AI中文摘要

随着大型语言模型能力日益增强,其输出能被能力较弱的系统轻松检查变得至关重要。证明者-验证者游戏可用于提高模型输出的可检查性,但与仅训练以最大化正确性的基线相比,其准确性有所下降——这种现象被称为可读性代价。我们提出一种解决方案,通过将正确性与可检查性条件解耦,转而训练一个“翻译器”模型,将固定求解器模型的解转化为可检查形式。这使我们能够首先训练求解器以最大化正确性,然后训练翻译器将求解器的解翻译成可检查形式,同时保留求解器的答案。为了适应这一新的翻译目标,我们制定了一个解耦的证明者-验证者游戏(DPVG),其均衡对应于忠实且可检查的翻译器。

英文摘要

As large language models become increasingly capable, it is critical that their outputs can be easily checked by less capable systems. Prover-verifier games can be used to improve checkability of model outputs, but display a degradation in accuracy compared to a baseline trained only to maximize correctness -- a phenonemon named legibility tax. We propose a solution by decoupling the correctness from the checkability condition and instead training a "translator" model that turns a fixed solver model's solution into a checkable form. This allows us to first train the solver to maximize correctness, and then train the translator to translate the solver into a checkable form while retaining the solver's answer. To accommodate this new objective of translation, we formulate a decoupled prover-verifier game (DPVG) where the equilibria correspond to faithful and checkable translators.

2505.22829 2026-06-19 cs.LG cs.AI 版本更新

Bridging Distribution Shift and AI Safety: Conceptual and Methodological Synergies

弥合分布偏移与AI安全:概念与方法论的协同

Chenruo Liu, Kenan Tang, Yao Qin, Qi Lei

发表机构 * Center for Data Science, New York University New York New York USA Computer Science Department, University of California, Santa Barbara Santa Barbara California USA Department of Electrical Computer Engineering, University of California, Santa Barbara Santa Barbara California USA Courant Institute for Mathematical Sciences \& Center for Data Science, New York University New York New York USA Center for Data Science, New York University Computer Science Department, University of California, Santa Barbara Computer Engineering, University of California, Santa Barbara Courant Institute for Mathematical Sciences \& Center for Data Science, New York University

AI总结 本文通过分析分布偏移与AI安全之间的概念和方法论协同,建立了特定偏移类型与细粒度安全问题之间的两种联系,促进了两领域研究的深度融合。

Comments 35 pages

详情
AI中文摘要

本文通过全面分析分布偏移与AI安全之间的概念和方法论协同,弥合了这两者之间的鸿沟。虽然先前的讨论通常关注狭隘的案例或非正式的类比,但我们建立了特定分布偏移原因与细粒度AI安全问题之间的两种联系:(1) 解决特定偏移类型的方法可以帮助实现相应的安全目标,或 (2) 某些偏移和安全问题可以形式化地相互归约,从而使它们的方法能够相互适应。我们的发现提供了一个统一的视角,鼓励分布偏移与AI安全研究之间更深入的整合。

英文摘要

This paper bridges distribution shift and AI safety through a comprehensive analysis of their conceptual and methodological synergies. While prior discussions often focus on narrow cases or informal analogies, we establish two types connections between specific causes of distribution shift and fine-grained AI safety issues: (1) methods addressing a specific shift type can help achieve corresponding safety goals, or (2) certain shifts and safety issues can be formally reduced to each other, enabling mutual adaptation of their methods. Our findings provide a unified perspective that encourages deeper integration between distribution shift and AI safety research.

2509.03122 2026-06-19 cs.CL cs.AI cs.LG 版本更新

From Construction to Injection: Edit-Based Fingerprints for Large Language Models

从构建到注入:面向大型语言模型的基于编辑的指纹

Yue Li, Xin Yi, Dongsheng Shi, Yongyi Cui, Gerard de Melo, Linlin Wang

发表机构 * East China Normal University(华东师范大学) Hasso Plattner Institute/University of Potsdam(哈索罗普拉特纳研究所/波茨坦大学)

AI总结 提出端到端注入指纹框架,通过代码混合指纹和多候选编辑方法,解决黑盒部署中指纹的不可感知性和鲁棒性挑战。

Comments preprint

详情
AI中文摘要

可靠的模型指纹对于保护大型语言模型(LLMs)免受未经授权的重新分发和商业滥用至关重要。在黑盒部署中,验证受到对可疑指纹查询的防御性过滤以及可能削弱嵌入所有权证据的下游模型修改的阻碍。这些风险要求指纹在构建和注入方面都具有鲁棒性。在构建方面,先前的范式面临不可感知性的权衡:自然语言指纹可能被意外激活,而乱码指纹在统计上暴露且更容易被过滤。在注入方面,现有方法难以在模型修改下保持持久的触发-目标行为。我们提出了一个端到端的注入指纹框架来解决这些挑战。代码混合指纹(CF)在高复杂度约束下使用最低困惑度的代码混合来缓解这种双向不可感知性权衡。多候选编辑(MCEdit)构建结构冗余、间隔分离的触发-目标映射,以在模型修改下实现优雅降级。在不可感知性、可检测性和无害性方面的广泛评估表明,该框架在几乎不影响实用性的情况下实现了鲁棒的所有权验证。

英文摘要

Reliable model fingerprints are essential for protecting large language models (LLMs) against unauthorized redistribution and commercial misuse. In black-box deployment, verification is hindered by defensive filtering of suspected fingerprint queries, as well as by downstream model modifications that may weaken embedded ownership evidence. These risks require fingerprints to be robust in both construction and injection. For construction, prior paradigms face an imperceptibility trade-off: natural-language fingerprints may be accidentally activated, whereas garbled fingerprints are statistically exposed and easier to filter. For injection, existing methods struggle to preserve persistent trigger--target behaviors under model modification. We propose an end-to-end injected fingerprinting framework to address these challenges. Code-mixing Fingerprints (CF) use lowest-perplexity code-mixing under a high-complexity constraint to mitigate this two-sided imperceptibility trade-off. Multi-Candidate Editing (MCEdit) constructs structurally redundant, margin-separated trigger--target mappings to enable graceful degradation under model modification. Extensive evaluations on imperceptibility, detectability, and harmlessness demonstrate robust ownership verification with negligible impact on utility.

2511.04260 2026-06-19 cs.CV cs.AI 版本更新

Proto-LeakNet: Towards Signal-Leak Aware Attribution in Synthetic Human Face Imagery

Proto-LeakNet:面向合成人脸图像中信号泄漏感知的归因方法

Claudio Giusti, Luca Guarnera, Sebastiano Battiato

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Catania(卡塔尼亚大学)

AI总结 提出Proto-LeakNet,利用扩散模型中的信号泄漏痕迹,结合闭集分类与密度开集评估,实现可解释的生成器归因,在闭集上训练后对未见生成器也有效。

Comments 44 pages, 27 figures, 11 tables

详情
AI中文摘要

合成图像和深度伪造生成模型的日益复杂使得源归因和真实性验证成为现代计算机视觉系统的关键挑战。最近的研究表明,扩散管道会在其输出中无意中留下持久的统计痕迹,称为信号泄漏,特别是在潜在表示中。基于这一观察,我们提出了Proto-LeakNet,一个信号泄漏感知且可解释的归因框架,它将闭集分类与基于密度的开集评估相结合,对学习到的嵌入进行开集评估,从而无需重新训练即可分析未见过的生成器。我们的方法作用于扩散模型的潜在域,重新模拟部分前向扩散以暴露残留的生成器特定线索。一个时间注意力编码器聚合多步潜在特征,而一个特征加权原型头则结构化嵌入空间并实现透明的归因。仅在闭集数据上训练并达到98.13%的宏AUC,Proto-LeakNet学习到的潜在几何结构在后处理下保持鲁棒,超越了最先进的方法,并且在真实图像与已知生成器之间以及已知与未见生成器之间实现了强可分离性。代码库可在以下链接获取:this https URL。

英文摘要

The growing sophistication of synthetic image and deepfake generation models has turned source attribution and authenticity verification into a critical challenge for modern computer vision systems. Recent studies suggest that diffusion pipelines unintentionally imprint persistent statistical traces, known as signal-leaks, within their outputs, particularly in latent representations. Building on this observation, we propose Proto-LeakNet, a signal-leak-aware and interpretable attribution framework that integrates Closed-set classification with a density-based Open-set evaluation on the learned embeddings, enabling analysis of unseen generators without retraining. Acting in the latent domain of diffusion models, our method re-simulates partial forward diffusion to expose residual generator-specific cues. A temporal attention encoder aggregates multi-step latent features, while a feature-weighted prototype head structures the embedding space and enables transparent attribution. Trained solely on closed data and achieving a Macro AUC of 98.13\%, Proto-LeakNet learns a latent geometry that remains robust under post-processing, surpassing state-of-the-art methods, and achieves strong separability both between real images and known generators, and between known and unseen ones. The codebase is available at the following link: https://github.com/claudiunderthehood/Proto-LeakNet .

2602.04306 2026-06-19 cs.CL cs.AI 版本更新

DeFrame: Debiasing Large Language Models Against Framing Effects

DeFrame: 消除大语言模型中的框架效应偏差

Kahee Lim, Soyeon Kim, Steven Euijong Whang

发表机构 * KAIST(韩国科学技术院)

AI总结 针对大语言模型在语义等价但不同表述的提示下产生不一致偏见的问题,提出框架感知的去偏方法,通过量化框架差异并增强跨框架一致性,有效降低整体偏见并提升鲁棒性。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

随着大语言模型(LLMs)在现实应用中的日益部署,确保其在不同人口群体中的公平响应变得至关重要。尽管做出了许多努力,但一个持续的挑战是隐藏的偏见:LLMs 在标准评估下表现公平,但在这些评估设置之外可能产生有偏见的响应。在本文中,我们识别出框架——语义等价的提示在表达方式上的差异(例如,“A 比 B 好” vs. “B 比 A 差”)——作为导致这一差距的一个未被充分探索的因素。我们首先引入“框架差异”的概念来量化框架对公平性评估的影响。通过用替代框架扩充公平性评估基准,我们发现(1)公平性得分随框架变化显著,以及(2)现有的去偏方法改善了整体(即框架平均)公平性,但往往未能减少框架引起的差异。为了解决这个问题,我们提出了一种框架感知的去偏方法,鼓励 LLMs 在不同框架之间更加一致。实验表明,我们的方法减少了整体偏见,并提高了对框架差异的鲁棒性,使 LLMs 能够产生更公平和更一致的响应。

英文摘要

As large language models (LLMs) are increasingly deployed in real-world applications, ensuring their fair responses across demographics has become crucial. Despite many efforts, an ongoing challenge is hidden bias: LLMs appear fair under standard evaluations, but can produce biased responses outside those evaluation settings. In this paper, we identify framing -- differences in how semantically equivalent prompts are expressed (e.g., "A is better than B" vs. "B is worse than A") -- as an underexplored contributor to this gap. We first introduce the concept of "framing disparity" to quantify the impact of framing on fairness evaluation. By augmenting fairness evaluation benchmarks with alternative framings, we find that (1) fairness scores vary significantly with framing and (2) existing debiasing methods improve overall (i.e., frame-averaged) fairness, but often fail to reduce framing-induced disparities. To address this, we propose a framing-aware debiasing method that encourages LLMs to be more consistent across framings. Experiments demonstrate that our approach reduces overall bias and improves robustness against framing disparities, enabling LLMs to produce fairer and more consistent responses.

2603.19423 2026-06-19 cs.CR cs.AI cs.LG 版本更新

The Autonomy Tax: Defense Training Breaks LLM Agents

自主性税:防御训练破坏LLM智能体

Shawn Li, Yue Zhao

发表机构 * University of Southern California(南加州大学)

AI总结 揭示防御训练在提升LLM智能体安全性时,系统性地破坏其工具执行能力,导致任务失败率飙升,且无法有效防御复杂攻击。

详情
AI中文摘要

大型语言模型(LLM)智能体日益依赖外部工具(文件操作、API调用、数据库事务)来自主完成复杂的多步骤任务。实践者部署经过防御训练的模型,以防止通过恶意观察或检索内容操纵智能体行为的提示注入攻击。我们揭示了一个基本的\textbf{能力-对齐悖论}:旨在提高安全性的防御训练系统性地破坏了智能体的能力,同时未能阻止复杂的攻击。在97个智能体任务和1000个对抗性提示上,将防御模型与未防御基线进行比较,我们发现了多步骤智能体特有的三种系统性偏差。\textbf{智能体无能偏差}表现为立即的工具执行崩溃,模型在观察到任何外部内容之前就在良性任务上拒绝或生成无效操作。\textbf{级联放大偏差}导致早期失败通过重试循环传播,使防御模型在99%的任务中超时,而基线仅为13%。\textbf{触发偏差}导致矛盾的安全退化,防御模型的表现比未防御基线更差,而直接攻击以高概率绕过防御。根本原因分析表明,这些偏差源于捷径学习:模型过度拟合表面攻击模式而非语义威胁理解,这由防御效果在不同攻击类别上的极端方差所证明。我们的发现表明,当前的防御范式优化了单轮拒绝基准,同时使多步骤智能体从根本上不可靠,因此需要新的方法在对抗条件下保持工具执行能力。

英文摘要

Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental \textbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. \textbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. \textbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. \textbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.

2605.07821 2026-06-19 cs.CV cs.AI 版本更新

Mitigating Simplicity Bias in OOD Detection through Object Co-occurrence Analysis

通过对象共现分析缓解OOD检测中的简单性偏差

Boyang Dai, Chaoqi Chen, Yizhou Yu

发表机构 * The University of Hong Kong(香港大学) Shenzhen University(深圳大学) Shenzhen Loop Area Institute(深圳环城区域研究所)

AI总结 提出基于对象共现的OOD检测框架,通过解耦表示和分治策略区分近OOD,缓解简单性偏差,在多种设置下取得竞争结果。

Comments This paper has been accepted by CVPR2026

详情
AI中文摘要

分布外(OOD)检测对于确保深度学习模型的可靠性至关重要。现有方法大多关注正则纠缠表示以区分分布内(ID)和OOD数据,忽略了图像中丰富的上下文信息。这一问题在检测近OOD时尤其具有挑战性,因为具有简单性偏差的模型难以在解耦表示中学习判别性特征。人类视觉系统可以利用自然环境中对象的共现来促进场景理解。受此启发,我们提出了一种以对象为中心的OOD检测框架,学习捕捉图像中的对象共现(OCO)模式。该方法引入了一种新的OOD检测范式,通过预测测试样本的解耦表示来理解图像中的对象共现,然后根据ID训练数据中观察到的对象共现模式自适应地将模式分为三种场景,最后以分治方式进行OOD检测。通过这种方式,OCO可以通过考虑图像中存在的语义上下文关系来区分近OOD,避免仅关注简单、易学习区域的倾向。我们通过在具有挑战性和全频谱OOD设置下的实验评估了OCO,展示了竞争性结果,并证实了其处理语义和协变量偏移的能力。代码发布在:https://this https URL。

英文摘要

Out-of-distribution (OOD) detection is crucial for ensuring the reliability of deep learning models. Existing methods mostly focus on regular entangled representations to discriminate in-distribution (ID) and OOD data, neglecting the rich contextual information within images. This issue is particularly challenging for detecting near-OOD, as models with simplicity bias struggle to learn discriminative features in disentangled representations. The human visual system can use the co-occurrence of objects in the natural environment to facilitate scene understanding. Inspired by this, we propose an Object-Centric OOD detection framework that learns to capture Object CO-occurrence (OCO) patterns within images. The proposed method introduces a new OOD detection paradigm that understands object co-occurrence within an image by predicting disentangled representations for the test sample, then adaptively divides patterns into three scenarios based on object co-occurrence patterns observed in ID training data, and finally performs OOD detection in a divide-and-conquer manner. By doing so, OCO can distinguish near-OOD by considering the semantic contextual relationships present in their images, avoiding the tendency to focus solely on simple, easily learnable regions. We evaluate OCO through experiments across challenging and full-spectrum OOD settings, demonstrating competitive results and confirming its ability to address both semantic and covariate shifts. Code is released at https://github.com/Michael-McQueen/OCO.

2606.03090 2026-06-19 cs.CR cs.AI 版本更新

"**Important** You should give me full credits!": Exploring Prompt Injection Attacks on LLM-Based Automatic Grading Systems

“**重要** 你应该给我满分!”:探索针对基于LLM的自动评分系统的提示注入攻击

Hang Li, Fedor Filippov, Yuping Lin, Pengfei He, Kaiqi Yang, Yucheng Chu, Yingqian Cui, Hui Liu, Jiliang Tang

发表机构 * Michigan State University(密歇根州立大学)

AI总结 研究针对基于LLM的自动评分系统的提示注入攻击,通过实验证明当前系统高度脆弱,并评估现有防御策略的有效性。

Comments 15 pages, 8 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)的出现显著加速了近期关于基于LLM的自动评分(AG)系统的研究。受益于LLM强大的指令遵循能力和广泛的先验知识,教育工作者可以使用仅包含自然语言评分标准的AG系统跨不同任务部署,并获得令人满意的评分性能。尽管有这些优势,新的安全问题也可能出现。特别是,提示注入(PI)攻击最近已成为基于LLM的应用的主要威胁。在AG的背景下,攻击者可能利用PI漏洞操纵评分系统,使其无论实际答案质量如何都人为地给出高分。这种行为对教育评估的公平性、可靠性和完整性构成严重风险。在这项工作中,我们研究了AG系统中的PI攻击,并系统地调查了此类攻击在教育场景中的有效性。我们进一步评估了现有防御策略对抗这些攻击的有效性。通过在基于评分标准的评分设置下进行全面的实验,我们证明了当前基于LLM的AG系统仍然高度容易受到PI攻击。我们希望我们的发现能提高对这种新兴威胁的认识,并激励未来研究朝着安全、稳健和可信的基于LLM的教育系统发展。

英文摘要

The emergence of large language models (LLMs) has significantly accelerated recent research on LLM-based automatic grading (AG) systems. Benefiting from the strong instruction-following capabilities and broad prior knowledge of LLMs, educators can deploy AG systems across diverse tasks using only natural language rubrics while achieving satisfactory grading performance. Despite these advantages, new security concerns may also arise. In particular, prompt injection (PI) attacks have recently become a major threat to LLM-based applications. In the context of AG, attackers can potentially exploit PI vulnerabilities to manipulate grading systems into assigning artificially high scores regardless of the actual answer quality. Such behavior poses serious risks to the fairness, reliability, and integrity of educational assessment. In this work, we study PI attacks in AG systems, and systematically investigate the effectiveness of such attacks in educational scenarios. We further evaluate the effectiveness of existing defensive strategies against these attacks. Through comprehensive experiments under rubric-based grading settings, we demonstrate that current LLM-based AG systems remain highly vulnerable to PI attacks. We hope that our findings raise awareness of this emerging threat and motivate future research toward secure, robust, and trustworthy LLM-based educational systems.

2606.04075 2026-06-19 cs.LG cs.AI cs.CL cs.CR cs.CY 版本更新

Large Language Models Hack Rewards, and Society

大型语言模型攻击奖励机制与社会

Wei Liu, Xinyi Mou, Hanqi Yan, Zhongyu Wei, Yulan He

发表机构 * King’s College London(伦敦大学国王学院) Fudan University(复旦大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 研究强化学习训练中大型语言模型利用奖励函数漏洞的“社会攻击”现象,通过SocioHack沙盒实验发现模型能发现并利用社会规则漏洞,且现有安全措施效果有限。

Comments 14 pages, 9 figures, 7 tables

详情
AI中文摘要

强化学习已成为一种主导的后训练范式,使大型语言模型能够从奖励中学习。我们观察到社会规则在结构上与奖励函数相似。它们定义了可衡量的结果、阈值和例外情况,同时往往仅部分指定了制度意图。我们假设强化学习训练过程可能利用这些漏洞,因此提出模型在强化学习期间攻击奖励函数的已知倾向是否可能扩展为一种更严重的失败模式,即社会攻击:发现社会运行规则中的漏洞。为了研究这一现象,我们引入了SocioHack,一个包含72个社会环境的沙盒,并发现这些环境中奖励攻击自然出现并导致监管漏洞的发现。模型学会攻击社会规则并生成技术上合规但违背监管意图的策略,而当前的大型语言模型安全措施仅提供有限的缓解。因此,收集真实世界反馈用于模型训练需要更加谨慎,我们需要下一代后训练范式来安全地在真实社会中迭代大型语言模型。

英文摘要

Reinforcement learning (RL) has become a dominant post-training paradigm, enabling large language models (LLMs) to learn from rewards. We observe that societal regulations are structurally similar to reward functions. They define measurable outcomes, thresholds, and exceptions, while often leaving institutional intent only partially specified. We hypothesise that the RL training process may exploit these gaps and therefore ask whether models' well-known tendency to hack reward functions during RL can scale into a more consequential failure mode named societal hacking: discovering loopholes in the rules society runs on. To study this phenomenon, we introduce SocioHack, a sandbox of 72 societal environments, and find that within these environments, reward hacking naturally emerges and leads to regulatory loophole discovery. Models learn to hack the social rules and generate strategies that remain technically compliant while defeating regulatory intent, and current LLM safeguards provide only limited mitigation. Therefore, collecting in-the-wild feedback for model training requires greater caution, and we need a next-generation post-training paradigm for safely iterating LLMs in real society.=

2606.07822 2026-06-19 cs.CL cs.AI cs.LG 版本更新

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2606.18996 2026-06-19 cs.CR cs.AI 版本更新

TRAP: Benchmark for Task-completion and Resistance to Active Privacy-extraction

TRAP:任务完成与主动隐私提取抵抗基准

Moon Ye-Bin, Nam Hyeon-Woo, Baek Seong-Eun, Yejin Yeo, Tae-Hyun Oh

发表机构 * Dept. of Electrical Engineering, POSTECH(POSTECH电子工程系) Grad. School of Artificial Intelligence, POSTECH(POSTECH人工智能研究生院) School of Computing, KAIST(韩国科学技术院计算机学院)

AI总结 提出TRAP基准,评估智能体在文档密集型任务中平衡任务准确性与隐私泄露的能力,发现所有模型均存在非平凡泄露,并证明基于提示的防御无法同时实现高任务成功率和零泄露概率,提出结构化的私有字段隔离方法。

详情
AI中文摘要

智能体越来越多地部署在文档密集型工作流中,其中敏感私人信息不是边缘情况而是常规输入,例如,预订航班的智能体需要护照号码。在这种情况下,智能体必须使用私人信息准确完成任务,同时绝不在其响应中暴露这些信息,因为它无法验证键盘前实际是谁。这两个义务存在根本性矛盾。一个能够使用私人信息完成任务的模型,同样可能被诱导泄露这些信息。为了评估任务准确性与隐私泄露之间的权衡,我们引入了任务完成与主动隐私提取抵抗(TRAP)。每个场景包括一个包含私人信息的文档、一个要求智能体使用私有字段调用正确工具的任务查询,以及一个试图以自然语言引出相同信息的攻击查询。评估了涵盖前沿专有和开源模型的22个模型,我们发现所有模型系列都表现出非平凡的泄露,并且指令遵循能力与泄露率相关。现有的基于提示的防御减少了泄露,但以显著降低任务准确性为代价。提示优化未能摆脱这种权衡。我们证明这种失败并非偶然。对于任何基于softmax的模型,没有软约束防御(例如基于提示的防御)能够同时实现高任务成功率和零泄露概率。受这一不可能性结果的启发,我们提出了结构化的私有字段隔离,该方法在私有字段到达模型之前用哈希键替换它们。这种方法在保持任务准确性的同时很大程度上防止了泄露。

英文摘要

Agents are increasingly deployed in document-intensive workflows where sensitive private information is not an edge case but a routine input, e.g., an agent booking a flight needs passport numbers. In such settings, the agent must use private information to complete tasks accurately while never exposing it in its responses, because it cannot verify who is actually at the keyboard. These two obligations are in fundamental tension. A model capable enough to use private information for task completion can, by the same capability, be induced to reveal it. To evaluate the trade-off of task accuracy and privacy leakage, we introduce Task-completion and Resistance to Active Privacy-extraction (TRAP). Each scenario includes a document containing private information, a task query that requires the agent to invoke the correct tool using private fields, and an attack query that attempts to elicit the same information in natural language. Evaluating 22 models spanning frontier proprietary and open-source models at multiple scales, we find that all model families exhibit non-trivial leakage, and that instruction-following ability correlates with leakage rate. Existing prompt-based defenses reduce leakage but at significant cost to task accuracy. Prompt optimization fails to escape this trade-off. We demonstrate that this failure is not incidental. For any softmax-based model, no soft-constraint defense, e.g., prompt-based defenses, can jointly achieve high task success with zero leakage probability. Motivated by this impossibility result, we propose structural private field isolation, which replaces private fields with hash keys before they reach the model. This approach largely prevents leakage while keeping task accuracy.

9. 评测、基准与数据集 17 篇

2506.14990 2026-06-19 cs.AI 版本更新

MEAL: A Benchmark for Continual Multi-Agent Reinforcement Learning

MEAL: 持续多智能体强化学习基准

Tristan Tomilin, Luka van den Boogaard, Samuel Garcin, Constantin Ruhdorfer, Bram Grooten, Fabrice Kusters, Yali Du, Andreas Bulling, Mykola Pechenizkiy, Meng Fang

发表机构 * Eindhoven University of Technology, The Netherlands(埃因霍温理工大学,荷兰) University of Edinburgh, UK(爱丁堡大学,英国) University of Stuttgart, Germany(斯图加特大学,德国) King's College London, UK(伦敦国王学院,英国) University of Liverpool, UK(利物浦大学,英国)

AI总结 提出MEAL基准,利用JAX和GPU加速实现100任务序列训练,揭示长序列中出现的失败模式。

Comments To be published in the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

基准在强化学习(RL)研究中扮演核心角色,但其计算约束往往塑造了研究内容。尽管有终身学习的动机,大多数持续RL论文仅考虑3-10个顺序任务,因为CPU密集型环境使得更长的序列不切实际。同时,合作多智能体环境中的持续学习仍基本未被探索。为弥补这些空白,我们引入MEAL(多智能体自适应学习环境),这是首个持续多智能体RL基准。通过利用JAX和GPU加速,MEAL能够在单个GPU上几小时内训练100个任务的序列。我们发现,长任务序列揭示了在较小规模下不会出现的失败模式。

英文摘要

Benchmarks play a central role in reinforcement learning (RL) research, yet their computational constraints often shape what is studied. Despite the motivation of lifelong learning, most continual RL papers consider only 3-10 sequential tasks, as CPU-bound environments make longer sequences impractical. Meanwhile, continual learning in cooperative multi-agent settings remains largely unexplored. To address these gaps, we introduce MEAL (Multi-agent Environments for Adaptive Learning), the first benchmark for continual multi-agent RL. By leveraging JAX and GPU acceleration, MEAL enables training on sequences of 100 tasks in a few hours on a single GPU. We find that long task sequences reveal failure modes that do not appear at smaller scales.

2603.28387 2026-06-19 cs.AI cs.LG 版本更新

The Scaffold Effect: How Prompt Framing Drives Apparent Multimodal Gains in Clinical VLM Evaluation

脚手架效应:提示框架如何驱动临床VLM评估中的表面多模态增益

Doan Nam Long Vu, Simone Balloccu

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学)

AI总结 研究发现,在临床VLM评估中,提示中提及MRI可用性即可解释70-80%的性能提升,与图像数据是否存在无关,这种“脚手架效应”揭示了表面评估无法反映真实多模态推理能力。

详情
AI中文摘要

可信的临床AI要求性能提升反映真实的证据整合而非表面伪影。我们在两个临床神经影像队列\textsc{FOR2107}(情感障碍)和\textsc{OASIS-3}(认知衰退)上评估了12个开源视觉语言模型(VLM)的二分类性能。两个数据集都包含结构MRI数据,但这些数据不携带可靠的个体级诊断信号。在这些条件下,较小的VLM在引入神经影像上下文后F1分数提升高达58%,蒸馏模型变得与规模大一个数量级的模型相当。对比置信度分析显示,仅仅在任务提示中\textit{提及}MRI可用性就解释了70-80%的转变,与影像数据是否存在无关,这是模态坍塌的一个领域特定实例,我们称之为\textit{脚手架效应}。专家评估揭示了在所有条件下捏造基于神经影像的正当理由,而偏好对齐虽然消除了引用MRI的行为,却使两种条件都退化为随机基线。我们的发现表明,表面评估不足以作为多模态推理的指标,这对VLM在临床环境中的部署有直接影响。

英文摘要

Trustworthy clinical AI requires that performance gains reflect genuine evidence integration rather than surface-level artifacts. We evaluate 12 open-weight vision-language models (VLMs) on binary classification across two clinical neuroimaging cohorts, \textsc{FOR2107} (affective disorders) and \textsc{OASIS-3} (cognitive decline). Both datasets come with structural MRI data that carries no reliable individual-level diagnostic signal. Under these conditions, smaller VLMs exhibit gains of up to 58\% F1 upon introduction of neuroimaging context, with distilled models becoming competitive with counterparts an order of magnitude larger. A contrastive confidence analysis reveals that merely \emph{mentioning} MRI availability in the task prompt accounts for 70-80\% of this shift, independent of whether imaging data is present, a domain-specific instance of modality collapse we term the \emph{scaffold effect}. Expert evaluation reveals fabrication of neuroimaging-grounded justifications across all conditions, and preference alignment, while eliminating MRI-referencing behavior, collapses both conditions toward random baseline. Our findings demonstrate that surface evaluations are inadequate indicators of multimodal reasoning, with direct implications for the deployment of VLMs in clinical settings.

2604.05435 2026-06-19 cs.AI 版本更新

CareTransition-Audit: A Benchmark to Audit Discharge Summaries for Efficient Care Transitions

CareTransition-Audit:用于高效护理过渡的出院总结审计基准

Akshat Dasula, Prasanna Desikan, Jaideep Srivastava, Shivali Dalmia, Abhishek Mukherji

发表机构 * Department of Computer Science \& Engineering, University of Minnesota-Twin Cities, Minneapolis, USA Centific AI Research, Redmond, USA

AI总结 提出基于大语言模型的自动化框架,通过46项检查清单审计出院总结完整性,在MIMIC-IV数据集上基准测试11个模型,最佳模型与临床医生标签的Cohen's kappa约0.5,所有模型难以识别模糊文档。

Comments Accepted as a poster at IEEE-ICHI 2026; Accepted at SD4H@ICML

详情
AI中文摘要

不完整或不一致的出院文档会导致护理碎片化和可避免的再入院。尽管其在患者安全中至关重要,但审计出院总结依赖于人工审查且无法扩展。我们提出一个使用大语言模型(LLM)的自动化审计框架。我们的方法将DISCHARGED框架操作化为一个包含46个问题的检查清单。使用来自MIMIC-IV数据库的50份总结及临床医生真实标签,我们对11个LLM进行基准测试。模型评估的平均文档完整性范围为54.9%至74.2%,最佳模型与临床医生标签的Cohen's kappa值约为0.5,表明中等一致性。所有模型在识别模糊文档(Unclear)方面均存在困难,突显了当前自动化审计的关键差距。本工作为临床文档的系统性质量改进提供了临床医生验证的基准和零样本基线。

英文摘要

Incomplete or inconsistent discharge documentation drives care fragmentation and avoidable readmissions. Despite its critical role in patient safety, auditing discharge summaries relies on manual review and does not scale. We propose an automated framework for auditing discharge summaries using large language models (LLMs). Our approach operationalizes the DISCHARGED framework into a checklist of 46 questions. Using 50 summaries from the MIMIC-IV database, with clinician ground-truth labels, we benchmark 11 LLMs. Model-assessed mean documentation completeness ranges from 54.9% to 74.2%, and the best-performing models achieve a Cohen's kappa values around 0.5 against clinician labels, indicating moderate agreement. All models struggle to identify ambiguous documentation (Unclear), highlighting a key gap in current automated auditing. This work provides a clinician-validated benchmark and zero-shot baselines for systematic quality improvement in clinical documentation.

2604.07593 2026-06-19 cs.AI 版本更新

Too long; didn't solve

太长;没解决

Lucía M. Cabrera, Isaac Saxton-Knight, Jocelyn D'Arcy

发表机构 * Instituto Balseiro(巴塞罗那研究所) Poindexter Labs(波因迪克斯实验室)

AI总结 研究提示长度和解答长度与大型语言模型在数学问题上的性能关系,发现两者与模型失败率正相关。

详情
AI中文摘要

由一系列数学问题组成的数学基准被广泛用于评估大型语言模型的推理能力,但关于其结构特性如何影响模型行为的研究很少。在这项工作中,我们研究了两个结构长度变量——提示长度和解答长度,并分析了它们如何与模型在新构建的、由专家编写的对抗性数学问题数据集上的性能相关。我们发现,提示长度和解答长度均与模型失败率的增加呈正相关。我们还进行了跨模型分歧的探索性辅助分析。在难度调整的归一化分析下,两个变量与实现模型分离仍保持弱负相关,提示长度的关联稍强。总体而言,我们的主要稳健发现是,结构长度与该数据集中的经验难度相关。

英文摘要

Mathematical benchmarks consisting of a range of mathematics problems are widely used to evaluate the reasoning abilities of large language models, yet little is known about how their structural properties influence model behaviour. In this work, we investigate two structural length variables, prompt length and solution length, and analyse how they relate to model performance on a newly constructed adversarial dataset of expert-authored mathematics problems. We find that both prompt and solution lengths correlate positively with increased model failure across models. We also include a secondary, exploratory analysis of cross-model disagreement. Under a difficulty-adjusted normalised analysis, both variables retain weak negative associations with realised model separation, slightly stronger for prompt length. Overall, our main robust finding is that structural length is linked to empirical difficulty in this dataset.

2605.25160 2026-06-19 cs.AI 版本更新

ScaleWoB: Guiding GUI Agents with Coding Agents via Large-Scale Environmental Synthesis

SimuWoB: 模拟真实世界移动应用以实现快速且保真的GUI智能体基准测试

Guohong Liu, Jialei Ye, Pengzhi Gao, Wei Liu, Jian Luan, Yunxin Liu, Yuanchun Li

发表机构 * Institute for AI Industry Research (AIR), Tsinghua University(人工智能产业研究院(AIR),清华大学) University of Electronic Science and Technology of China(电子科技大学) MiLM Plus, Xiaomi Inc.(小米公司MiLM Plus团队)

AI总结 针对现有移动GUI智能体基准测试与现实应用之间的差距,提出全合成基准SimuWoB,通过鲁棒的虚拟环境生成框架合成高保真任务和环境,自动提供有效奖励,实现对复杂长程交互的高效可重复评估。

详情
AI中文摘要

由大型语言模型驱动的移动GUI智能体发展迅速,迫切需要真实且全面的评估。现有基准测试优先考虑可重复性,但通常局限于开源应用或文件操作任务,因为在实际应用中构建奖励困难,导致基准设置与现实使用之间存在差距。此外,大多数基准测试侧重于基本定位和导航,对复杂长程交互的覆盖有限。为解决这些局限性,我们引入了SimuWoB,一个全合成的移动GUI智能体基准测试,包含120个涵盖不同类型和难度级别的挑战性任务。我们构建了一个鲁棒的虚拟环境生成框架,合成高保真任务和环境,并为每个任务自动提供有效奖励。每个环境都部署为可通过URL访问的无后端网页,实现高效且可重复的评估。我们对几个最先进的移动GUI智能体进行了全面实验。平均成功率仅为27.92%,在长程任务上降至17.82%,揭示了当前智能体在复杂场景下的显著弱点。与真实世界样本任务的评估结果比较表明,基于我们合成环境的智能体评估具有良好的泛化性。我们进一步提供了关键能力维度的诊断见解,并讨论了对未来移动GUI智能体开发的启示。

英文摘要

GUI agents powered by large language models are advancing rapidly, creating urgent needs for evaluation and training based on realistic environments. However, directly doing so in real-world environments introduces some challenges that cannot be overlooked. Real-world environments are complex and uncontrollable, making it difficult to construct verifiable rewards and to save or reset states. Existing works prioritize reproducibility but are often limited to open-source apps or file-operation tasks for reliable reward building, leaving a persistent gap from real-world usage. Furthermore, relying on virtual machines or docker images demand high resource requirements and suffer from slow response speeds, which limit the efficiency. We present \sys, a framework that could produce high-fidelity synthesized interactive environments for GUI agents across platforms with verifiable rewards. These environments behave as backend-free webpages accessible via URL, requiring near-zero setup and low resource cost, making the approach suitable for both large-scale evaluation and downstream agent training. We support multiple GUI platforms including mobile, desktop, and automotive/in-vehicle interfaces based on the same pipeline, covering 100+ environments and 1000+ verifiable tasks. Among them, 120 challenging tasks across 63 simulated mobile applications are released as a fully synthesized mobile GUI agent benchmark. Experiment results on five state-of-the-art mobile GUI agents reveal substantial headroom -- the average success rate is only 27.92\%, dropping to 17.82\% on long-horizon subset -- while humans reach 92.08\%. A comparison against real-world sample tasks shows that assessments made in our synthetic environments generalize to real apps. The project website is at https://scalewob.github.io.

2606.14031 2026-06-19 cs.AI 版本更新

Applicability Condition Extraction for Therapeutic Drug-Disease Relations

治疗性药物-疾病关系的适用条件提取

Guanting Luo, Noriki Nishida, Yuji Matsumoto, Yuki Arase

发表机构 * The University of Osaka(大阪大学) RIKEN(理化学研究所) Institute of Science Tokyo(东京科学大学) Tohoku University(东北大学)

AI总结 提出从生物医学文献中提取药物-疾病治疗关系适用条件的任务,构建首个手动标注数据集,并改进LoRA方法以考虑药物与疾病间关系,在多个评估设置中优于基线。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

识别某种药物对目标疾病产生治疗效果的适用条件对于临床决策支持至关重要。然而,现有的大多数生物医学信息提取方法仅关注识别药物与疾病之间的关系,而很大程度上忽略了这些关系适用的上下文特定条件。为解决这一问题,我们引入了从生物医学研究文献中提取治疗性药物-疾病关系适用条件的任务。我们创建了首个数据集,在生物医学论文摘要上手动标注了药物、疾病和适用条件的三元组,包含1,119个药物-疾病对。利用该数据集,我们系统评估了一系列现有方法的性能。此外,我们提出了一种新方法,增强LoRA以考虑药物与疾病之间的关系。我们的方法在不同评估设置中均优于强基线。本文的源代码和数据集可从以下网址获取:this https URL

英文摘要

Identifying conditions that a certain drug takes therapeutic effect on a target disease is crucial for clinical decision-making support. However, most existing biomedical information extraction methods have focused on identifying only relations between drugs and diseases, while largely overlooking the context-specific conditions where such relations can apply. To address this problem, we introduce the task of applicability condition extraction for therapeutic drug-disease relations from biomedical research literature. We create the first dataset that has manually annotated triples of drugs, diseases, and applicability conditions on biomedical paper abstracts with 1,119 drug-disease pairs. Using this dataset, we systematically evaluate the performance of a range of existing methods. In addition, we propose a new method that enhances LoRA to consider relations between drugs and diseases. Our method consistently outperforms strong baselines across different evaluation settings.

2606.15862 2026-06-19 cs.AI 版本更新

RetailBench: Benchmarking long horizon reasoning and coherent decision making of LLM agents in realistic retail environments

RetailBench: 在真实零售环境中评估LLM代理的长期推理与连贯决策能力

Linghua Zhang, Jun Wang, Jingtong Wu, Zhisong Zhang

发表机构 * Ant Group(蚂蚁集团) City University of Hong Kong(香港城市大学)

AI总结 提出RetailBench基准,模拟单店超市运营,评估LLM代理在长期决策中的表现,发现多数模型无法持续生存,与最优策略差距显著。

Comments This paper is my paper's second version [see arXiv:2603.16453v2]

详情
AI中文摘要

大型语言模型(LLM)代理在短期、范围明确的任务上取得了快速进展,但它们在动态长期环境中维持连贯决策的能力仍不确定。我们引入了RetailBench,一个基于数据驱动的模拟基准,用于评估在单店超市运营中使用工具的LLM代理。RetailBench将零售管理建模为部分可观察的决策过程,并设计支持千天规模的模拟。在此环境中,代理必须管理定价、补货、供应商选择、货架分类、库存老化、客户反馈、外部事件和现金流约束。我们在180天的评估期内,在代表性代理框架下评估了七个当代LLM,并将它们与特权最优策略进行比较。结果显示模型之间存在显著差异:只有一小部分能够存活整个评估期,即使最强的LLM运行在最终净资产和销售结果上也远落后于最优策略。行为分析将这些差距归因于不完整的证据获取、表面决策以及缺乏一致的长期策略。RetailBench为研究经济基础长期决策中的可靠自主性提供了一个受控测试平台。

英文摘要

Large language model (LLM) agents have made rapid progress on short-horizon, well-scoped tasks, yet their ability to sustain coherent decisions in dynamic long-horizon environments remains uncertain. We introduce RetailBench, a data-grounded simulation benchmark for evaluating tool-using LLM agents in single-store supermarket operation. RetailBench models retail management as a partially observable decision process and is designed to support thousand-day-scale simulations. In this environment, agents must manage pricing, replenishment, supplier selection, shelf assortment, inventory aging, customer feedback, external events, and cash-flow constraints. We evaluate seven contemporary LLMs under representative agent frameworks over a 180-day evaluation horizon and compare them with a privileged oracle policy. Results show substantial variation across models: only a small subset survives the full evaluation horizon, and even the strongest LLM runs remain substantially behind the oracle policy in final net worth and sales outcomes. Behavioral analysis attributes these gaps to incomplete evidence acquisition, surface-level decision making, and the lack of a consistent long-horizon policy. RetailBench provides a controlled testbed for studying reliable autonomy in economically grounded long-horizon decision-making.

2606.18191 2026-06-19 cs.AI cs.MA 版本更新

DRFLOW: A Deep Research Benchmark for Personalized Workflow Prediction

DRFLOW:用于个性化工作流预测的深度研究基准

Md Tawkat Islam Khondaker, Raymond Li, Muhammad Abdul-Mageed, Laks V. S. Lakshmanan, Issam H. Laradji

发表机构 * ServiceNow AI Research(ServiceNow人工智能研究)

AI总结 提出DRFLOW基准,评估AI代理从异构源预测个性化工作流的能力,包含5领域100任务,并设计7个诊断指标,实验显示现有代理性能有限。

详情
AI中文摘要

深度研究(DR)系统越来越多地用于复杂信息寻求任务,但现有工作主要关注生成报告和摘要。相比之下,许多企业任务需要代理识别具体的工作流,即一系列行动步骤。例如,代理不应总结预算政策,而应能确定回答诸如“在固定预算下如何申请新员工?”这类问题所需的步骤。因此,我们引入DRFLOW,一个用于评估代理从异构源预测个性化工作流的基准。每个任务要求代理从分散来源中识别相关证据,然后使用这些证据预测用户任务的正确行动步骤序列。DRFLOW包含跨五个领域的100个任务,1246个参考工作流步骤,基于超过3900个来源。我们定义了七个诊断指标,涵盖事实依据、步骤恢复、结构排序、条件解决和个性化。我们进一步提出DRFLOW-Agent(DRFA),一个面向工作流的参考代理,用于预测个性化工作流。我们表明,尽管DRFA相比强基线代理有所改进(平均F1分数提升高达10.02%),但在这些工作流指标上仍有很大的改进空间,表明预测完整且正确的个性化工作流仍然是深度研究的一个挑战性前沿。

英文摘要

Deep research (DR) systems are increasingly used for complex information-seeking tasks, but existing works mainly focus on generating reports and summaries. In contrast, many enterprise tasks instead require an agent to identify concrete workflows which is a sequence of action-steps. For example, rather than summarizing budgeting policies, an agent should be able to determine the steps needed to answer a question such as: "How do I request new headcount given a fixed budget?". Therefore, we introduce DRFLOW, a benchmark for evaluating personalized workflows predicted by agents from heterogeneous sources. Each task requires the agent to identify relevant evidence from scattered sources, then use that evidence to predict the correct action-step sequence for the user's task. DRFLOW contains 100 tasks across five domains, with 1,246 reference workflow steps grounded in more than 3,900 sources. We define seven diagnostic metrics covering factual grounding, step recovery, structural ordering, condition resolution, and personalization. We further present DRFLOW-Agent (DRFA), a workflow-oriented reference agent to predict personalized workflow. We show that although DRFA improves over strong baseline agents (upto 10.02% average F1 score), there is substantial room for improvement remains across these workflow metrics, indicating that predicting complete and correct personalized workflows remains a challenging frontier for deep research.

2606.18950 2026-06-19 cs.AI 版本更新

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出RTSGameBench,基于Beyond All Reason游戏,通过多样化对战、迷你游戏诊断和自进化生成框架,评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情
AI中文摘要

现代视觉语言模型(VLM)在竞争和合作环境中的不确定性下,往往难以进行战略推理,即预测和影响其他智能体的行为。实时策略(RTS)游戏可以作为诊断这一局限性的自然测试平台,因为它们要求与盟友协调、适应对手策略,并在部分可观测性下进行长期规划。然而,现有的RTS基准评估范围有限,缺乏系统的能力诊断,并且局限于预设计的场景覆盖。为了解决这些限制,我们提出了RTSGameBench,它建立在Beyond All Reason之上,这是一款大规模RTS游戏,其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估,通过迷你游戏进行诊断性评估,每个迷你游戏针对单个战略能力,并通过自进化生成框架实现可扩展的覆盖,该框架将自由形式的查询转化为新的迷你游戏,并在连续循环中改进。此外,为了让VLM在大规模RTS游戏中运行,我们提供了RTSGameAgent,它通过具有智能体记忆的有限状态机(FSM)管理单位。我们通过实验验证,多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

2606.19245 2026-06-19 cs.AI cs.LG 版本更新

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP:分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结 提出TxBench-PP基准,用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力,测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情
AI中文摘要

人工智能(AI)代理有望通过压缩解释和决策循环来加速药物发现,但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学(TxBench-PP),这是一个针对小分子临床前药理学的可验证基准,也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论,而非从文献中记忆的事实。该基准包含100个评估,按程序阶段、实验类型和任务结构索引,涵盖作用机制(MoA)和药效学(PD)推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照,在编码环境中检查文件,并返回确定性评分的结构化答案。在16个模型-工具配置(包括11个模型和4,800条轨迹)中,没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试(178/300;95% CI, 51.1-67.6),其次是GPT-5.5 / Pi,为55.3%(166/300;47.0-63.6)。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

2507.19653 2026-06-19 cs.NI cs.AI cs.LG 版本更新

On the Limitations of Ray-Tracing for Learning-Based RF Tasks in Urban Environments

关于射线追踪在城市环境中基于学习的射频任务局限性的研究

Armen Manukyan, Hrant Khachatrian, Edvard Ghukasyan, Theofanis P. Raptis

发表机构 * Yerevan State University, Yerevan, Armenia(亚美尼亚叶里温州立大学) YerevaNN, Yerevan, Armenia(亚美尼亚叶里温YerevaNN) Institute of Informatics and Telematics, National Research Council, Pisa, Italy(意大利那不勒斯国家研究委员会信息与电信研究所)

AI总结 通过罗马城区实测数据评估Sionna射线追踪仿真器,发现天线位置和方向对保真度影响显著,而超参数影响微弱;优化后相关性提升5%-130%,定位误差降低三分之一,但残差城市噪声仍是挑战。

Comments This work was supported by funding under the bilateral agreement between CNR (Italy) and HESC MESCS RA (Armenia) as part of the DeepRF project for the 2025-2026 biennium, and by the HESC MESCS RA grant No. 22rl-052 (DISTAL)

Journal ref 2026 IEEE Wireless Communications and Networking Conference (WCNC)

详情
AI中文摘要

我们研究了Sionna v1.0.2射线追踪在罗马市中心户外蜂窝链路中的真实感。我们使用了包含1,664个用户设备(UE)和六个名义基站(BS)站点的真实测量数据集。利用这些固定位置,我们系统地改变了主要仿真参数,包括路径深度、漫反射/镜面反射/折射标志、载波频率,以及天线的属性如高度、辐射方向和方向图。通过测量功率与仿真功率之间的Spearman相关性,以及基于RSSI指纹的k近邻定位算法,对每个基站的仿真保真度进行评分。在所有实验中,求解器超参数对所选指标的影响微不足道。相反,天线位置和方向被证明是决定性的。通过简单的贪婪优化,我们将不同基站的Spearman相关性提高了5%到130%,而仅使用仿真数据作为参考点的kNN定位误差在真实世界样本上减少了三分之一,但仍比纯真实数据的误差高一倍。因此,精确的几何形状和可信的天线模型是必要但不充分的;忠实地捕捉残余的城市噪声仍然是实现可迁移、高保真户外射频仿真的一个开放挑战。

英文摘要

We study the realism of Sionna v1.0.2 ray-tracing for outdoor cellular links in central Rome. We use a real measurement set of 1,664 user-equipments (UEs) and six nominal base-station (BS) sites. Using these fixed positions we systematically vary the main simulation parameters, including path depth, diffuse/specular/refraction flags, carrier frequency, as well as antenna's properties like its altitude, radiation pattern, and orientation. Simulator fidelity is scored for each base station via Spearman correlation between measured and simulated powers, and by a fingerprint-based k-nearest-neighbor localization algorithm using RSSI-based fingerprints. Across all experiments, solver hyper-parameters are having immaterial effect on the chosen metrics. On the contrary, antenna locations and orientations prove decisive. By simple greedy optimization we improve the Spearman correlation by 5% to 130% for various base stations, while kNN-based localization error using only simulated data as reference points is decreased by one-third on real-world samples, while staying twice higher than the error with purely real data. Precise geometry and credible antenna models are therefore necessary but not sufficient; faithfully capturing the residual urban noise remains an open challenge for transferable, high-fidelity outdoor RF simulation.

2603.01250 2026-06-19 cs.CV cs.AI 版本更新

The MAMA-MIA Challenge: Advancing Generalizability and Fairness in Breast MRI Tumor Segmentation and Treatment Response Prediction

MAMA-MIA挑战:推进乳腺MRI肿瘤分割与治疗反应预测的泛化性和公平性

Lidia Garrucho, Smriti Joshi, Kaisar Kushibar, Richard Osuala, Maciej Bobowicz, Xavier Bargalló, Paulius Jaruševičius, Kai Geissler, Raphael Schäfer, Muhammad Alberb, Tony Xu, Anne Martel, Daniel Sleiman, Navchetan Awasthi, Hadeel Awwad, Joan C. Vilanova, Robert Martí, Daan Schouten, Jeong Hoon Lee, Mirabela Rusu, Eleonora Poeta, Luisa Vargas, Eliana Pastor, Maria A. Zuluaga, Jessica Kächele, Dimitrios Bounias, Alexandra Ertl, Katarzyna Gwoździewicz, Maria-Laura Cosaka, Pasant M. Abo-Elhoda, Sara W. Tantawy, Shorouq S. Sakrana, Norhan O. Shawky-Abdelfatah, Amr Muhammad Abdo-Salem, Androniki Kozana, Eugen Divjak, Gordana Ivanac, Katerina Nikiforaki, Michail E. Klontzas, Rosa García-Dosdá, Meltem Gulsun-Akpinar, Oğuz Lafcı, Carlos Martín-Isla, Oliver Díaz, Laura Igual, Karim Lekadir

发表机构 * Barcelona Artificial Intelligence in Medicine Lab (BCN-AIM), Facultat de Matemàtiques i Informàtica, Universitat de Barcelona(巴塞罗那人工智能在医学实验室(BCN-AIM),巴塞罗那大学数学与计算机学院)

AI总结 提出MAMA-MIA挑战,通过标准化基准评估乳腺MRI肿瘤分割和病理完全缓解预测,在跨洲多中心数据上分析模型泛化性与公平性,发现性能与亚组公平性之间存在权衡。

详情
AI中文摘要

乳腺癌是全球女性中最常诊断的恶性肿瘤,也是癌症相关死亡的主要原因之一。动态对比增强磁共振成像在肿瘤表征和治疗监测中发挥核心作用,尤其是接受新辅助化疗的患者。然而,现有的乳腺磁共振成像人工智能模型通常使用异质性数据集、研究人群和评估协议进行开发和评估,使得直接比较困难,并限制了跨机构和临床相关患者亚组的模型鲁棒性理解。MAMA-MIA挑战旨在通过提供标准化基准来解决这些问题,该基准用于联合评估原发性肿瘤分割和仅使用治疗前磁共振成像预测病理完全缓解。训练队列包括来自美国多家机构的1506名患者,而评估则在来自三个独立欧洲中心的574名患者的外部测试集上进行,以评估跨大陆和跨机构的泛化性。统一的评分框架结合了预测性能与年龄、绝经状态和乳腺密度方面的亚组一致性。26个国际团队参加了最终评估阶段。结果表明,在共同的外部评估框架下,性能存在显著差异,并揭示了整体准确性与亚组公平性之间的权衡。该挑战提供了标准化数据集、评估协议和公共资源,以促进开发稳健且公平的乳腺癌影像人工智能系统。

英文摘要

Breast cancer is the most frequently diagnosed malignancy among women worldwide and a leading cause of cancer-related mortality. Dynamic contrast-enhanced magnetic resonance imaging plays a central role in tumor characterization and treatment monitoring, particularly in patients receiving neoadjuvant chemotherapy. However, existing artificial intelligence models for breast magnetic resonance imaging are typically developed and evaluated using heterogeneous datasets, study populations, and assessment protocols, making direct comparison difficult and limiting understanding of model robustness across institutions and clinically relevant patient subgroups. The MAMA-MIA Challenge was designed to address these challenges by providing a standardized benchmark for the joint evaluation of primary tumor segmentation and prediction of pathologic complete response using pre-treatment magnetic resonance imaging only. The training cohort comprised 1,506 patients from multiple institutions in the United States, while evaluation was conducted on an external test set of 574 patients from three independent European centers to assess cross-continental and cross-institutional generalization. A unified scoring framework combined predictive performance with subgroup consistency across age, menopausal status, and breast density. Twenty-six international teams participated in the final evaluation phase. Results demonstrate substantial performance variability under a common external evaluation framework and reveal trade-offs between overall accuracy and subgroup fairness. The challenge provides standardized datasets, evaluation protocols, and public resources to promote the development of robust and equitable artificial intelligence systems for breast cancer imaging.

2604.13416 2026-06-19 cs.CV cs.AI 版本更新

DF3DV-1K: A Large-Scale Dataset and Benchmark for Distractor-Free Novel View Synthesis

DF3DV-1K:用于无干扰新视角合成的大规模数据集与基准

Cheng-You Lu, Yi-Shan Hung, Wei-Ling Chi, Hao-Ping Wang, Charlie Li-Ting Tsai, Yu-Cheng Chang, Yu-Lun Liu, Thomas Do, Chin-Teng Lin

发表机构 * University of Technology Sydney(悉尼科技大学) University of Sydney(悉尼大学) National Yang Ming Chiao Tung University(阳明交通大学)

AI总结 为弥补无干扰辐射场领域缺乏大规模真实世界数据集的空白,构建了包含1048个场景、每场景提供干净和杂乱图像集的DF3DV-1K数据集,并基于此基准测试了九种最新方法,识别出最鲁棒的方法和最具挑战的场景。

详情
AI中文摘要

辐射场领域的进展已实现逼真的新视角合成。在多个领域中,已开发出大规模真实世界数据集以支持全面基准测试并促进超越场景特定重建的进展。然而,对于无干扰辐射场,每个场景同时包含干净和杂乱图像的大规模数据集仍然缺乏,限制了发展。为填补这一空白,我们引入了DF3DV-1K,一个包含1048个场景的大规模真实世界数据集,每个场景提供干净和杂乱的图像集用于基准测试。该数据集总共包含89,924张使用消费级相机拍摄的图像,模拟随意拍摄,涵盖128种干扰类型和161种场景主题,包括室内和室外环境。一个精心挑选的41个场景子集DF3DV-41被系统设计用于评估无干扰辐射场方法在挑战性场景下的鲁棒性。利用DF3DV-1K,我们对九种最新的无干扰辐射场方法和3D高斯泼溅进行了基准测试,识别出最鲁棒的方法和最具挑战的场景。除了基准测试,我们还展示了DF3DV-1K的一个应用:微调基于扩散的2D增强器以改进辐射场方法,在保留集(例如DF3DV-41)和On-the-go数据集上实现了平均0.96 dB PSNR和0.057 LPIPS的提升。我们希望DF3DV-1K能促进无干扰视觉的发展,并推动超越场景特定方法的进步。数据集和排行榜可在以下网址获取:此 https URL。

英文摘要

Advances in radiance fields have enabled photorealistic novel view synthesis. In several domains, large-scale real-world datasets have been developed to support comprehensive benchmarking and to facilitate progress beyond scene-specific reconstruction. However, for distractor-free radiance fields, a large-scale dataset with clean and cluttered images per scene remains lacking, limiting the development. To address this gap, we introduce DF3DV-1K, a large-scale real-world dataset comprising 1,048 scenes, each providing clean and cluttered image sets for benchmarking. In total, the dataset contains 89,924 images captured using consumer cameras to mimic casual capture, spanning 128 distractor types and 161 scene themes across indoor and outdoor environments. A curated subset of 41 scenes, DF3DV-41, is systematically designed to evaluate the robustness of distractor-free radiance field methods under challenging scenarios. Using DF3DV-1K, we benchmark nine recent distractor-free radiance field methods and 3D Gaussian Splatting, identifying the most robust methods and the most challenging scenarios. Beyond benchmarking, we demonstrate an application of DF3DV-1K by fine-tuning a diffusion-based 2D enhancer to improve radiance field methods, achieving average improvements of 0.96 dB PSNR and 0.057 LPIPS on the held-out set (e.g., DF3DV-41) and the On-the-go dataset. We hope DF3DV-1K facilitates the development of distractor-free vision and promotes progress beyond scene-specific approaches. The dataset and leaderboard are available at https://johnnylu305.github.io/df3dv1k_web/.

2605.10873 2026-06-19 cs.CV cs.AI 版本更新

CADBench: A Multimodal Benchmark for AI-Assisted CAD Program Generation

CADBench:一个用于AI辅助CAD程序生成的多模态基准

Anna C. Doris, Jacob Thomas Sony, Ghadi Nehme, Era Syla, Amin Heyrani Nobari, Faez Ahmed

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出CADBench,一个统一的多模态CAD程序生成基准,包含18000个样本和六类基准,评估11种视觉语言模型,揭示了CAD程序生成中的三种常见失败模式。

详情
AI中文摘要

从图像或3D观测中恢复可编辑的CAD程序是AI辅助设计的核心,但进展难以衡量,因为现有评估分散在数据集、模态和指标上。我们引入CADBench,一个统一的多模态CAD程序生成基准。CADBench包含18000个评估样本,涵盖来自DeepCAD、Fusion 360、ABC、MCB和Objaverse的六个基准家族,五种输入模态包括干净的网格、噪声网格、单视图渲染、逼真渲染和多视图渲染,以及六个指标,涵盖几何保真度、可执行性和程序紧凑性。STEP-based家族按B-rep面数分层,所有家族均进行多样性采样,以支持在复杂性和物体变化方面的受控分析。我们评估了11种CAD专用和通用的视觉语言系统,生成超过140万个CAD程序。在理想输入下,专用的网格到CAD模型显著优于代码生成VLMs,后者仍远未可靠。CADBench进一步揭示了三种常见的失败模式:几何复杂性增加时重建质量下降,CAD专用模型在模态转移下可能变得脆弱,且模型排名在不同指标下会变化。这些结果将CADBench定位为衡量可编辑3D重建和多模态CAD理解进展的诊断测试平台。该基准在https://huggingface.co/datasets/DeCoDELab/CADBench上公开可用。

英文摘要

Recovering editable CAD programs from images or 3D observations is central to AI-assisted design, but progress is difficult to measure because existing evaluations are fragmented across datasets, modalities, and metrics. We introduce CADBench, a unified benchmark for multimodal CAD program generation. CADBench contains 18,000 evaluation samples spanning six benchmark families derived from DeepCAD, Fusion 360, ABC, MCB, and Objaverse; five input modalities including clean meshes, noisy meshes, single-view renders, photorealistic renders, and multi-view renders; and six metrics covering geometric fidelity, executability, and program compactness. STEP-based families are stratified by B-rep face count and all families are diversity-sampled to support controlled analysis across complexity and object variation. We benchmark eleven CAD-specialized and general-purpose vision-language systems, generating more than 1.4 million CAD programs. Under idealized inputs, specialized mesh-to-CAD models substantially outperform code-generating VLMs, which remain far from reliable CAD program reconstruction. CADBench further reveals three recurring failure modes: reconstruction quality degrades with geometric complexity, CAD-specialized models can be brittle under modality shift, and model rankings change across metrics. Together, these results position CADBench as a diagnostic testbed for measuring progress in editable 3D reconstruction and multimodal CAD understanding. The benchmark is publicly available at https://github.com/anniedoris/CADBench.

2606.17165 2026-06-19 stat.ME cs.AI econ.EM math.ST stat.TH 版本更新

Statistical Foundations of LLM-based A/B Testing: A Surrogacy Framework for Human Causal Inference

基于LLM的A/B测试的统计基础:用于人类因果推断的替代指标框架

Joel Persson, Mårten Schultzberg, Sebastian Ankargren

发表机构 * Spotify USA, Inc.(Spotify美国公司)

AI总结 提出替代指标理论框架,证明在弱于分布等价条件下,校准LLM输出可识别平均处理效应,并分析随机性带来的偏差与方差。

详情
AI中文摘要

组织和研究者越来越有兴趣在A/B测试中使用大型语言模型(LLM)代替人类参与者,以期更快、更低成本地进行实验。我们研究当在LLM结果上估计的处理效应何时能够恢复在感兴趣的人类群体上测量的效应。LLM与人类结果之间的分布等价性会使任何标准估计量有效,但这不现实。因此,我们开发了一个统计框架,将替代终点理论适配到LLM。该框架表明,将LLM结果校准到人类结果,在替代性和可比性条件(联合弱于分布等价性)下,可以识别平均处理效应。当这些条件不成立时,感兴趣的效应仅部分可识别,我们提供了诊断方法,可以在历史实验上证伪替代性,并给出有限重叠下最坏情况偏差的界限。我们进一步证明,LLM固有的随机性会引入偏差和方差,但使用多次抽取的平均值作为替代指标可以同时缓解两者。我们在模拟和Upworthy标题的A/B测试应用中展示了方法和理论。我们工作的一个核心结论是,LLM结果作为替代指标的有效性只能对过去的处理被证伪,而无法对新处理被验证,因此对于新颖干预,人类实验仍然不可或缺。我们讨论了LLM选择、提示和温度作为设计变量的作用,以及如何确定人类实验的规模以进行验证。

英文摘要

Organizations and researchers show increasing interest in using large language models (LLMs) in place of human participants in A/B tests, in the hope of experimenting faster and at lower cost. We study when a treatment effect estimated on LLM outcomes can recover the effect that would have been measured on the human population of interest. Distributional equivalence between LLM and human outcomes would make any standard estimator valid but is unrealistic. We therefore develop a statistical framework that adapts surrogate endpoint theory to LLMs, showing that calibrating LLM outcomes to human outcomes identifies the average treatment effect under surrogacy and comparability conditions that are jointly weaker than distributional equivalence. We present a falsification test for surrogacy and a bound on the worst-case bias from limited overlap between the LLM and human samples. We further show that the stochasticity inherent to LLMs can weaken surrogacy for identification while also introducing bias and variance during estimation, but that using an average over multiple LLM draws per unit as the surrogate mitigates these issues. Simulations validate the results, and an empirical application to A/B tests on Upworthy headlines shows that raw LLM predictions recover only 39\% of the human treatment effect while nonparametric calibration closes the gap. A central takeaway is that A/B testing on LLMs yields correct results only by assumption, whereas A/B testing on humans is correct by design, and that the required assumptions are hardest to justify precisely where A/B testing on LLMs promises the greatest benefit. We discuss the role of LLM choice, prompting, and temperature as design variables, the compounded challenge posed by long-term outcomes, and how to size human pilot studies for validation.

2606.18613 2026-06-19 cs.CL cs.AI 版本更新

Are LLMs Ready to Assist Physicians? PhysAssistBench for Interactive Doctor-Patient-EHR Assistance

LLMs 是否已准备好辅助医生?PhysAssistBench:交互式医患-电子病历辅助基准

Tianming Du, Peijie Yu, Sihan Shang, Danli Shi, My Linh Nguyen, Shengbo Gao, Guangyuan Li, Yinghong Yu, Yan Jiang, Qianlong Zhao, Behzad Bozorgtabar, Shaoxiong Ji, Jiazhen Pan, Daniel Rueckert, Jiancheng Yang

发表机构 * Aalto University(阿尔托大学) Tencent(腾讯) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学(深圳)) Hong Kong Polytechnic University(香港理工大学) Aarhus University(奥胡斯大学) Technical University of Munich(慕尼黑工业大学)

AI总结 提出PhysAssistBench基准,通过构建交互式患者代理评估LLM在医患-EHR交互中的协调能力,发现当前模型不可靠,瓶颈在于多维度协调而非单一能力。

Comments 34 pages with 8 figures

详情
AI中文摘要

医疗LLM最合理的近期角色是辅助而非替代医生,但当前的评估通常测试孤立能力:临床知识、EHR系统交互或患者沟通。而医生辅助需要在同一交互中协调这些能力,其中医生提出不明确的请求,患者模糊描述症状,EHR系统要求精确的工具使用。我们引入PhysAssistBench,一个用于交互式医患-EHR辅助的基准。基于真实的MIMIC-IV病例,PhysAssistBench使用可扩展的流水线构建交互式、记录驱动的患者代理,将静态EHR记录转化为多轮临床场景,同时保持临床事实准确性。PhysAssistBench提供了一个精选的双语评估集,包含1,296个经过人工审查和医生验证的轮次。与领先LLM的实验表明,当前模型在此设置下仍不可靠,这暴露了临床LLM的关键瓶颈:可靠的辅助需要知识、沟通和系统之间的协调,而非任何单一能力的孤立提升。

英文摘要

The most plausible near-term role of medical LLMs is to assist rather than replace physicians, yet current evaluations often test isolated capabilities: clinical knowledge, EHR system interaction, or patient communication. Physician assistance instead requires coordinating these capabilities within the same interaction, where physicians issue underspecified requests, patients describe symptoms ambiguously, and EHR systems demand precise tool use. We introduce PhysAssistBench, a benchmark for interactive doctor-patient-EHR assistance. Built from real MIMIC-IV cases, PhysAssistBench uses a scalable pipeline to construct agentic patients: interactive, record-grounded agents that turn static EHR records into multi-turn clinical scenarios while preserving clinical factuality. PhysAssistBench provides a curated bilingual evaluation set of 1,296 manually reviewed and physician-validated turns. Experiments with leading LLMs show that current models remain unreliable in this setting, which exposes a key bottleneck for clinical LLMs: reliable assistance requires coordination across knowledge, communication, and systems, not isolated gains in any of them.

2606.18970 2026-06-19 cs.LG cs.AI cs.CV 版本更新

A Controlled Benchmark of Quantum-Latent GAN Augmentation for Brain MRI

脑MRI的量子潜GAN增强的受控基准测试

Syed Mujtaba Haider, Silvia Figini

发表机构 * Department of Mathematics(数学系) Department of Political and Social Sciences(政治与社会科学系)

AI总结 通过受控基准测试,比较量子与经典生成器在脑MRI数据增强中的性能,发现两者均未显著优于仅用真实数据训练,且量子生成器无额外优势。

详情
AI中文摘要

医学图像分类常受限于有限的标注数据,因此生成式增强被提出;最近,量子生成模型被用于此目的,并经常报告准确率提升。然而,这些声称通常基于单次训练运行,未匹配量子与经典生成器的参数预算,也未表征任何收益出现的数据范围。我们提出了一个受控基准测试,隔离量子生成器对脑MRI增强的贡献。图像被编码到KL正则化的潜在空间中,在该空间中,使用变分量子生成器或参数数量几乎相同的经典生成器(1648 vs. 1632)训练带有梯度惩罚的条件Wasserstein GAN。合成样本被解码并用于增强预训练分类器,覆盖从5%到100%的标注数据比例,通过八个随机种子进行配对显著性检验(多重比较校正)以及集内多样性和潜在分布分析。在所有比例下,没有增强变体显著优于仅用真实数据训练,且量子与经典生成器在统计上无法区分。任何低数据优势表现为正则化而非忠实的数据扩展:合成样本分布外移,并且在数据稀缺时严重模式崩溃,而量子生成器并不比经典生成器更多样化。我们发布该协议作为医学成像中量子生成增强严格评估的测试平台。

英文摘要

Medical image classification is often constrained by limited labeled data, motivating generative augmentation; recently, quantum generative models have been proposed for this purpose, frequently reporting accuracy gains. However, such claims are typically based on single training runs, do not match the parameter budgets of the quantum and classical generators, and do not characterize the data regime in which any benefit appears. We present a controlled benchmark that isolates the contribution of a quantum generator to brain-MRI augmentation. Images are encoded into a KL-regularized latent space in which a conditional Wasserstein GAN with gradient penalty is trained using either a variational quantum generator or a classical generator of near-identical parameter count (1648 vs. 1632). Synthetic samples are decoded and used to augment a pretrained classifier across labeled data fractions from 5% to 100%, evaluated over eight random seeds with paired significance testing (with multiple-comparison correction) and with intraset diversity and latent-distribution analyses. Across all fractions, no augmentation variant significantly outperforms real-data-only training, and the quantum and classical generators are statistically indistinguishable. Any low-data benefit behaves as regularization rather than faithful data expansion:synthetic samples are off distribution and severely mode collapsed precisely where data is scarce, and the quantum generator is no more diverse thanits classical counterpart. We release the protocol as a testbed for rigorous evaluation of quantum generative augmentation in medical imaging.

10. AI应用与系统 16 篇

2509.24725 2026-06-19 cs.LG cs.AI 版本更新

Q-Net: Queue Length Estimation via Kalman-based Neural Networks

Q-Net:基于卡尔曼神经网络的队列长度估计

Ting Gao, Elvin Isufi, Winnie Daamen, Erik-Sander Smits, Serge Hoogendoorn

发表机构 * University of Amsterdam(阿姆斯特丹大学) Delft University of Technology(代尔夫特理工大学)

AI总结 本文提出Q-Net框架,通过结合卡尔曼滤波与神经网络,解决信号交叉口队列长度估计中的数据融合问题,提升空间转移性和实时性,实现无需昂贵传感设备的准确队列估计。

Journal ref Transportation Research Part C: Emerging Technologies, Volume 190, September 2026, Article 105809

详情
AI中文摘要

估计信号交叉口的队列长度一直是交通管理中的长期挑战。尽管有两类隐私保护的数据源:(i) 接近停止线的环形检测器提供的车辆计数汇总数据,以及 (ii) 提供路段平均速度测量的汇总浮动汽车数据 (aFCD),但如何将这些具有不同空间和时间分辨率的数据源整合用于队列长度估计仍不清楚。为此,本文提出Q-Net:一种基于状态空间形式的队列估计框架。该设计解决了队列建模中的关键挑战,如违反交通守恒假设。Q-Net遵循卡尔曼预测-更新结构,并在状态演变和测量模型中保持物理可解释性。Q-Net使用AI增强的卡尔曼滤波器从数据中学习时间变化的增益动态。该框架支持实时实现,并通过将aFCD测量分组为固定大小的局部组来提高空间转移性,使可学习参数的数量与路段长度无关。在荷兰 Rotterdam 城市主干道的评估显示,Q-Net优于基线方法,能够准确追踪队列的形成和消散,并缓解aFCD引起的延迟。通过结合数据效率、可解释性、实时适用性和空间转移性,Q-Net在无需昂贵的传感基础设施(如摄像头或雷达)的情况下实现了准确的队列长度估计。

英文摘要

Estimating queue lengths at signalized intersections is a long-standing challenge in traffic management. Partial observability of vehicle flows complicates this task despite the availability of two privacy-preserving data sources: (i) aggregated vehicle counts from loop detectors near stop lines, and (ii) aggregated floating car data (aFCD) that provide segment-wise average speed measurements. However, how to integrate these sources with differing spatial and temporal resolutions for queue length estimation is rather unclear. Addressing this question, we present Q-Net: a queue estimation framework built upon a state-space formulation. This design addresses key challenges in queue modeling, such as violations of traffic conservation assumptions. Q-Net follows the Kalman predict-update structure and maintains physical interpretability in both the state evolution and measurement models. Q-Net uses an AI-augmented Kalman filter to learn time-varying gain dynamics from data. The framework supports real-time implementation and improves spatial transferability by grouping aFCD measurements into fixed-size local groups, making the number of learnable parameters independent of section length. Evaluations on urban main roads in Rotterdam, the Netherlands, show that Q-Net outperforms baseline methods, tracks queue formation and dissipation accurately, and mitigates aFCD-induced delays. By combining data efficiency, interpretability, real-time applicability, and spatial transferability, Q-Net makes accurate queue length estimation possible without costly sensing infrastructure like cameras or radar.

2510.00831 2026-06-19 cs.AI cs.LG eess.SP 版本更新

Controlled Comparison of Machine Learning Models for Fault Classification and Localization in Power System Protection

电力系统保护中故障分类与定位的机器学习模型受控比较

Julian Oelhaf, Georg Kordowich, Changhun Kim, Paula Andrea Pérez-Toro, Christian Bergler, Andreas Maier, Johann Jäger, Siming Bayer

发表机构 * Department of Electrical Engineering, Media and Computer Science, Ostbayerische Technische Hochschule Amberg-Weiden(奥贝格-魏登应用技术大学电气工程、媒体与计算机科学系)

AI总结 在统一电磁暂态数据集和10-50ms决策窗口下,对比机器学习模型在故障分类与定位中的性能,发现分类在10ms时F1>0.98,定位误差稳定在约10%线路长度。

Comments Accepted at IEEE PES Innovative Smart Grid Technologies Europe 2026 (ISGT Europe 2026). Pre-camera-ready author version; final proceedings version may differ

详情
AI中文摘要

现代电力系统因逆变器基和分布式能源的集成而日益复杂,挑战了传统保护方案的可靠性,并推动了机器学习在保护任务中的应用。然而,由于不同研究中的数据集、传感假设和决策时域各异,已发表的结果往往难以比较。本文在相同的传感、时序和验证条件下,基于公共电磁暂态数据集,使用10-50ms的决策窗口以反映保护相关时间尺度,对故障分类(FC)和故障定位(FL)的机器学习模型进行了受控比较。对于FC,性能最佳的非线性模型在10ms时F1分数已超过0.98,而低容量模型在较短时域下性能下降,但随窗口延长而改善,表明相关故障类型信息在最早暂态中已存在。对于FL,顶级模型在所有评估时域下达到约10%归一化线路长度的稳定定位误差,而较弱模型形成明显分离的第二性能层级。线路解析分析显示,定位精度随电网段变化,表明存在拓扑依赖的难度而非仅时间上下文不足。这些发现为比较两个信息需求根本不同的保护任务中的机器学习模型提供了受控参考。

英文摘要

The increasing complexity of modern power systems, driven by the integration of inverter-based and distributed energy resources, challenges the reliability of conventional protection schemes and motivates the use of machine learning for protection tasks. However, published results are often difficult to compare because datasets, sensing assumptions, and decision horizons vary across studies. This paper presents a controlled comparison of machine learning models for fault classification (FC) and fault localization (FL) under identical sensing, timing, and validation conditions on a common electromagnetic transient dataset, using decision windows of 10-50 ms to reflect protection-relevant time scales. For FC, the best-performing nonlinear models achieve F1 scores above 0.98 already at 10 ms, while lower-capacity models degrade at shorter horizons but improve with longer windows, indicating that relevant fault-type information is already present in the earliest transient. For FL, the top-performing models reach a stable localization error of about 10 % of normalized line length across all evaluated horizons, while weaker models form a clearly separated second performance tier. Line-resolved analysis shows that localization accuracy varies across grid segments, indicating topology-dependent difficulty rather than insufficient temporal context alone. These findings provide a controlled reference for comparing machine learning models across two protection tasks with fundamentally different information requirements.

2602.00510 2026-06-19 cs.AI cs.LG cs.SE 版本更新

PCBSchemaGen: Reward-Guided LLM Code Synthesis for Printed Circuit Boards (PCB) Schematic Design with Structured Verification

PCBSchemaGen: 奖励引导的LLM代码合成用于印刷电路板(PCB)原理图设计及结构化验证

Huanghaohe Zou, Peng Han, Emad Nazerian, Mafu Zhang, Zhicheng Guo, Alex Q. Huang

发表机构 * Semiconductor Power Electronics Center (SPEC)(半导体功率电子中心) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Arizona State University(亚利桑那州立大学)

AI总结 提出PCBSchemaGen框架,通过结构化验证器引导冻结的LLM生成可修复的PCB原理图,在无单元测试的领域实现高准确率。

详情
AI中文摘要

大多数LLM代码合成基准依赖于单元测试作为奖励预言,但PCB原理图设计没有这样的测试:正确性由真实IC封装和引脚级分配的结构化物理约束定义,每个任务的金标准参考不可用,且SPICE仿真无法验证原理图级正确性。我们提出PCBSchemaGen,一个无需训练的推理时框架,将冻结的LLM转变为可验证、可修复的PCB原理图生成器。该框架从IC数据手册中提取领域模式以约束LLM解码,将其与一个具有引脚级错误定位的确定性5层连续奖励验证器配对,并通过汤普森采样臂获取赌博机优化候选方案。我们在两个PCB基准上评估,涵盖22个统一电路领域的227个真实IC任务,包括一个从公开原理图导出的套件,作为完全保留的泛化测试(验证器、知识图谱库和提示在评估前冻结)。在我们的框架下,一个开放权重的31B模型(Gemma-4-31B)平均通过PCBBench任务的81.3%,且同一框架在两个基准间迁移时无需更改验证器代码;而基于相同Gemma-4-31B骨干网络的Circuitron式推理时提示基线在困难的系统级设计上崩溃。这表明在确定性结构验证器下的推理时优化是在没有单元测试预言的领域中实现无参考LLM代码合成的一般方法。我们的基准和确定性验证器在此https URL公开可用。

英文摘要

Most LLM code-synthesis benchmarks rely on unit tests as the reward oracle, but PCB schematic design has none: correctness is defined by structured physical constraints over real IC packages and pin-level assignments, per-task golden references are unavailable, and SPICE simulation does not validate schematic-level correctness. We introduce PCBSchemaGen, a training-free inference-time framework that turns a frozen LLM into a verifiable, repairable PCB schematic generator. The framework induces a domain schema from IC datasheets to ground LLM decoding, pairs it with a deterministic 5-layer continuous-reward verifier with pin-level error localization, and refines candidates through a Thompson Sampling arm-acquiring bandit. We evaluate on 2 PCB benchmarks covering 227 real-IC tasks across 22 unified circuit domains, including a public-schematic-derived suite that serves as a fully held-out generalization test (verifier, KG library, and prompts frozen before any evaluation). Under our framework, an open-weight 31B model (Gemma-4-31B) passes 81.3% of PCBBench tasks on average, and the same framework transfers across both benchmarks with zero verifier code changes; a Circuitron-style inference-time prompting baseline on the same Gemma-4-31B backbone collapses on hard system-level designs. This suggests inference-time refinement under a deterministic structural verifier is a general recipe for reference-free LLM code synthesis in domains without unit-test oracles. Our benchmarks and deterministic verifier are publicly available at https://github.com/HZou9/PCBSchemaGen_v2.

2602.07628 2026-06-19 cs.AI cs.LG 版本更新

SleepMaMi: A Universal Sleep Foundation Model for Integrating Macro- and Micro-structures

SleepMaMi:一种融合宏观与微观结构的通用睡眠基础模型

Keondo Park, Younghoon Na, Yourim Choi, Hyunwoo Ryu, Hyun-Woo Shin, Hyung-Sin Kim

发表机构 * Graduate School of Data Science, Seoul National University, Seoul, South Korea(首尔国立大学数据科学研究生院,韩国首尔) Department of Biomedical Sciences, Seoul National University College of Medicine, Seoul, Republic of Korea(首尔国立大学医学院生物医学科学系,韩国首尔) Obstructive Upper Airway Research (OUaR) Laboratory, Department of Pharmacology, Seoul National University College of Medicine, Seoul, Republic of Korea(首尔国立大学医学院药理学系阻塞性上气道研究(OUaR)实验室,韩国首尔) Department of Otorhinolaryngology-Head and Neck Surgery, Seoul National University Hospital, Seoul, Republic of Korea(首尔国立大学医院耳鼻喉头颈外科系,韩国首尔)

AI总结 提出SleepMaMi睡眠基础模型,通过分层双编码器设计(宏观编码器建模整夜时间依赖,微观编码器捕捉生物信号短时特征),结合人口统计引导对比学习和混合掩码自编码器训练,在超过2万条PSG记录上预训练,在下游任务中优于或匹配现有基础模型。

Comments 8 pages, Appendix 9 pages

详情
AI中文摘要

虽然向统一基础模型的转变已经彻底改变了许多深度学习领域,但睡眠医学仍然主要局限于专注于局部微观结构特征的特定任务模型。这些方法常常忽略多导睡眠图(PSG)丰富的多模态背景,并且未能捕捉整夜睡眠的全局宏观结构。为了解决这个问题,我们引入了SleepMaMi,一种睡眠基础模型,旨在掌握长达一小时的睡眠架构和细粒度信号形态。我们的框架采用分层双编码器设计:宏观编码器用于建模整夜时间依赖,微观编码器用于从生物信号中捕捉短期特征。宏观编码器通过人口统计引导对比学习进行训练,该学习将夜间睡眠模式与客观受试者元数据(如年龄、性别和BMI)对齐,以优化全局表示。微观编码器通过混合掩码自编码器(MAE)和多模态对比目标进行优化。在超过20,000条PSG记录(158K小时)的大规模语料库上预训练,SleepMaMi在多样化的下游任务套件中优于或匹配现有的最先进基础模型,展示了在临床睡眠分析中卓越的泛化能力和标签高效适应能力。

英文摘要

While the shift toward unified foundation models has revolutionized many deep learning domains, sleep medicine remains largely restricted to task-specific models that focus on localized micro-structure features. These approaches often neglect the rich, multi-modal context of Polysomnography (PSG) and fail to capture the global macro-structure of a full night's sleep. To address this, we introduce SleepMaMi , a Sleep Foundation Model engineered to master both hour-long sleep architectures and fine-grained signal morphologies. Our framework utilizes a hierarchical dual-encoder design: a Macro-Encoder to model full-night temporal dependencies and a Micro-Encoder to capture short-term characteristics from biosignals. Macro-Encoder is trained via Demographic-Guided Contrastive Learning, which aligns overnight sleep patterns with objective subject metadata, such as age, sex and BMI to refine global representations. Micro-Encoder is optimized via a hybrid Masked Autoencoder (MAE) and multi-modal contrastive objective. Pre-trained on a massive corpus of $>$20,000 PSG recordings (158K hours),SleepMaMi outperforms or matches state-of-the-art existing foundation models across a diverse suite of downstream tasks, demonstrating superior generalizability and label-efficient adaptation for clinical sleep analysis.

2605.27864 2026-06-19 cs.AI 版本更新

FundaPod: A Multi-Persona Agent Pod Platform with Knowledge Graph Memory for AI-Assisted Fundamental Investment Research

FundaPod: 一个具有知识图谱记忆的多角色智能体平台,用于AI辅助的基础投资研究

Di Zhu, Lei Nico Zheng, Zihan Chen

发表机构 * Stevens Institute of Technology(史蒂文斯理工学院) UMass Boston(马萨诸塞大学波士顿分校)

AI总结 提出FundaPod平台,通过多角色独立研究、知识图谱记忆和事后裁决机制,支持人类投资经理进行透明、可验证的基础投资决策。

Comments 32 pages; 12 figures

详情
AI中文摘要

大型语言模型(LLMs)在金融领域的应用日益增多,但现有工作大多强调交易信号或围绕预测的金融自然语言处理任务。相比之下,机构基础研究需要人类分析师或AI智能体收集证据、识别业务驱动因素、比较竞争观点并生成投资备忘录。其更广泛的目标不仅是预测结果,而是产生透明、可重用和可验证的投资计划,同时促进投资知识的累积发展。我们提出了FundaPod,一个用于AI辅助基础投资研究的多角色智能体平台。我们认为基础研究是一项以人为中心的决策支持任务,在本质上与交易信号生成不同,因此更适合采用保持独立性的架构。在FundaPod中,具有不同角色(如价值投资者或宏观策略师)的AI智能体在共享溯源契约下独立进行研究。他们的分歧随后通过知识图谱记忆系统事后呈现,供人类投资组合经理(PM)裁决。本文基于设计科学实践以及认知隔离和人机协调理论,提出了支持基础研究的人机混合系统的五项设计原则。它还描述了四种架构机制:将公开投资者资料转化为可部署智能体的角色提炼管道;允许规划器推导类型化任务图的声明式技能注册表;将备忘录声明与可验证来源联系起来的基于证据的模型;以及连接股票代码、备忘录、分析师和主题的知识图谱“第二大脑”。我们通过一个完整的案例研究和基于角色的备忘录比较来展示该架构。

英文摘要

Large language models (LLMs) are increasingly applied in finance, yet most existing work emphasizes trading signals or financial NLP tasks centered on prediction. Institutional fundamental research, by contrast, requires human analysts or AI agents to gather evidence, identify business drivers, compare competing viewpoints, and generate investment memos. Its broader goal is not merely to predict outcomes, but to produce investment plans that are transparent, reusable, and verifiable, while contributing to the cumulative development of investment knowledge. We present FundaPod, a multi-persona agent platform for AI-assisted fundamental investment research. We argue that fundamental research is a human-centric decision-support task that is qualitatively distinct from trading-signal generation, and is therefore better served by an independence-preserving architecture. In FundaPod, AI agents with different personas, such as value investors or macro strategists, conduct research independently under a shared provenance contract. Their disagreements are then surfaced post hoc for adjudication by the human portfolio manager (PM) through a knowledge-graph memory system. This paper contributes five design principles for human-AI hybrid systems supporting fundamental research, grounded in design-science practice and theories of cognitive isolation and human-machine coordination. It also describes four architectural mechanisms: a persona distillation pipeline that turns public investor materials into deployable agents; a declarative skill registry that lets the planner derive typed task graphs; a grounded evidence model that links memo claims to verifiable sources; and a knowledge-graph "second brain" that connects tickers, memos, analysts, and themes. We demonstrate the architecture through a complete case study and a persona-based memo comparison.

2605.29483 2026-06-19 cs.AI 版本更新

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

VitalAgent: 一种工具增强型代理,用于对可穿戴健康数据进行反应性和主动式生理监测

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed, Vassilis Kostakos, Ting Dang

发表机构 * The University of Melbourne, Australia(墨尔本大学) Dartmouth College, US(达特茅斯学院) University of Auckland, New Zealand(奥克兰大学) Eindhoven University of Technology, Netherlands(埃因霍温理工大学)

AI总结 提出VitalAgent框架,通过工具增强推理和纵向生理记忆,实现对ECG/PPG信号的反应性问答与主动监测,在VitalBench基准上相比基线提升超30%。

Comments Minor revisions; results unchanged

详情
AI中文摘要

可穿戴设备能够连续监测ECG和PPG等生理信号,但现有的移动健康系统大多局限于特定任务的预测管道或对静态摘要的反应性问答。它们缺乏支持时间推理、持久生理上下文以及对长期信号流进行主动监测的能力。我们提出VitalAgent,一个基于ECG/PPG的移动健康工具增强型代理框架,支持反应性问答和主动监测。VitalAgent建立在纵向生理记忆和工具增强推理接口之上,能够对原始信号进行动态计算。我们进一步引入VitalBench,一个纵向生理监测基准数据集,包含用于反应性问答的1,862个问答对和用于主动监测的90.2小时连续ECG/PPG记录,涵盖心脏、身体活动和压力相关任务。实验表明,VitalAgent在反应性评估中相比基于提示和ReAct的基线实现了超过30%的提升,并支持对长期生理信号的主动警报监测,突显了动态工具使用和长期生理监测的重要性。

英文摘要

Wearable devices enable continuous monitoring of physiological signals such as ECG and PPG, but existing mHealth systems are largely limited to task-specific prediction pipelines or reactive question answering over static summaries. They lack the ability to support temporal reasoning, persistent physiological context, and proactive monitoring over long-term signal streams. We propose VitalAgent, a tool-augmented agentic framework for ECG/PPG-based mHealth that supports both reactive question answering and proactive monitoring. VitalAgent is built on a longitudinal physiological memory and a tool-augmented reasoning interface that enables dynamic computation over raw signals. We further introduce VitalBench, a longitudinal physiological monitoring benchmark dataset comprising 1,862 QA pairs for reactive question answering and 90.2 hours of continuous ECG/PPG recordings for proactive monitoring, covering cardiac, physical activity, and stress-related tasks. Experiments demonstrate that VitalAgent achieves over 25% improvement over prompt-based and ReAct baselines in reactive evaluation and supports proactive alert monitoring over long-term physiological signals, highlighting the importance of dynamic tool use and long-term physiological monitoring.

2502.06866 2026-06-19 cs.LG cs.AI econ.EM stat.AP stat.ML 版本更新

Global Ease of Living Index: a machine learning framework for longitudinal analysis of major economies

全球生活便利指数:面向主要经济体纵向分析的机器学习框架

Arun Kumar Selvaraj, Tanay Panat, Rohitash Chandra

发表机构 * Transitional Artificial Intelligence Research Group, School of Mathematics and Statistics(过渡人工智能研究组,数学与统计学学院) Centre for Artificial Intelligence and Innovation(人工智能与创新中心) Pingla Institute(Pingla研究所)

AI总结 提出全球生活便利指数,结合社会经济和基础设施因素,利用机器学习处理缺失数据,并通过主成分分析和因子分析降维,为政策制定者提供改善生活质量的可操作工具。

详情
AI中文摘要

全球经济、地缘政治条件以及COVID-19疫情等破坏性事件对生活成本和生活质量产生了巨大影响。理解主要经济体中生活成本和生活质量的长期影响至关重要。一个透明且全面的生活指数必须包含生活条件的多个维度。在本研究中,我们提出了一种通过全球生活便利指数量化生活质量的方法,该指数将各种社会经济和基础设施因素整合为一个单一综合得分。我们的指数利用定义生活水平的经济指标,这有助于针对特定领域进行干预改进。我们提出了一个机器学习框架来处理特定国家某些经济指标的数据缺失问题。然后,我们整理并更新数据,并使用降维方法(主成分分析和因子分析)创建自1970年以来主要经济体的生活便利指数。我们的工作通过为政策制定者提供识别需要改进领域(如医疗系统、就业机会和公共安全)的实用工具,显著丰富了相关文献。我们的方法使用开放数据和代码,易于复现并适用于各种情境,为生活质量评估的持续研究和政策制定提供了透明度和可访问性。

英文摘要

The drastic changes in the global economy, geopolitical conditions, and disruptions such as the COVID-19 pandemic have impacted the cost of living and quality of life. It is essential to comprehend the long-term implications of the cost of living and quality of life in major economies. A transparent and comprehensive living index must include multiple dimensions of living conditions. In this study, we present an approach to quantifying the quality of life through the Global Ease of Living Index that combines various socio-economic and infrastructural factors into a single composite score. Our index utilises economic indicators that define living standards, which could help in targeted interventions to improve specific areas. We present a machine learning framework to address missing data for certain economic indicators in specific countries. We then curate and update the data and use a dimensionality reduction approach (Principal Component Analysis and Factor Analysis) to create the Ease of Living Index for major economies since 1970. Our work significantly adds to the literature by offering a practical tool for policymakers to identify areas needing improvement, such as healthcare systems, employment opportunities, and public safety. Our approach with open data and code can be easily reproduced and applied to various contexts, providing transparency and accessibility for ongoing research and policy development in quality-of-life assessment.

2506.01678 2026-06-19 cond-mat.mtrl-sci cs.AI 版本更新

Overcoming Labelled Data Scarcity for Defect Classification in Scanning Tunneling Microscopy

克服扫描隧道显微镜缺陷分类中的标注数据稀缺问题

Nikola L. Kolev, Max Trouton, Filippo Federici Canova, Geoff Thornton, David Z. Gao, Neil J. Curson, Taylor J. Z. Stock

发表机构 * London Centre for Nanotechnology, University College London(伦敦纳米技术中心,伦敦大学学院) Department of Electronic and Electrical Engineering, University College London(电子与电气工程系,伦敦大学学院) Department of Physics and Astronomy, University College London(物理与天文学系,伦敦大学学院) Department of Chemistry, University College London(化学系,伦敦大学学院) Aalto Science Institute, School of Science, Aalto University(艾尔沃斯科学研究所,艾尔沃斯大学) Nanolayers Research Computing LTD, London, UK(纳米层研究计算有限公司,伦敦,英国) Department of Physics, NTNU Norwegian University of Science and Technology(物理系,挪威科技大学)

AI总结 提出结合少样本学习和无监督学习的自动分割方法,在仅需少量标注数据下实现高精度STM图像缺陷分类,并在三种表面验证了强泛化能力。

详情
AI中文摘要

扫描隧道显微镜(STM)是一种以原子分辨率对表面成像的强大技术,可深入理解单原子和分子层面的物理化学过程。STM图像分析的一项常规任务是在均匀背景中识别和标记感兴趣的特征。手动执行此操作是一项劳动密集型工作,需要大量人力。为减轻这一负担,我们提出了一种自动化的STM图像分割方法,该方法同时使用少样本学习和无监督学习。与之前的监督方法相比,我们的技术提供了更大的灵活性;它消除了对大型手动标注数据集的需求,因此更容易适应未见过的表面,同时仍保持高精度。我们通过使用该方法识别三种不同表面上的原子特征来展示其有效性:Si(001)、Ge(001)和TiO$_2$(110),包括吸附在硅和锗表面上的AsH$_3$分子。我们的模型表现出强大的泛化能力,在初始训练后,仅需一个额外的标注数据点即可适应未见过的表面。这项工作朝着高效且与材料无关的STM图像自动分割迈出了重要一步。

英文摘要

Scanning tunnelling microscopy (STM) is a powerful technique for imaging surfaces with atomic resolution, providing insight into physical and chemical processes at the level of single atoms and molecules. A regular task of STM image analysis is the identification and labelling of features of interest against a uniform background. Performing this manually is a labour-intensive task, requiring significant human effort. To reduce this burden, we propose an automated approach to the segmentation of STM images that uses both few-shot learning and unsupervised learning. Our technique offers greater flexibility compared to previous supervised methods; it removes the requirement for large manually annotated datasets and is thus easier to adapt to an unseen surface while still maintaining a high accuracy. We demonstrate the effectiveness of our approach by using it to recognise atomic features on three distinct surfaces: Si(001), Ge(001), and TiO$_2$(110), including adsorbed AsH$_3$ molecules on the silicon and germanium surfaces. Our model exhibits strong generalisation capabilities, and following initial training, can be adapted to unseen surfaces with as few as one additional labelled data point. This work is a significant step towards efficient and material-agnostic, automatic segmentation of STM images.

2511.08378 2026-06-19 cs.IR cs.AI 版本更新

Bid Farewell to Seesaw: Towards Accurate Long-tail Session-based Recommendation via Dual Constraints of Hybrid Intents

告别跷跷板:通过混合意图的双重约束实现准确的长期会话推荐

Xiao Wang, Ke Qin, Dongyang Zhang, Xiurui Xie, Shuang Liang

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 针对会话推荐中长尾分布导致准确性与多样性冲突的跷跷板问题,提出混合意图双重约束框架HID,通过属性感知谱聚类重构意图映射并区分噪声意图,结合多样性与准确性约束损失,实现长尾与准确性的双赢。

Comments accepted by AAAI 2026 Oral

详情
AI中文摘要

基于会话的推荐(SBR)旨在根据用户的交互会话预测匿名用户的下一次交互。在实际推荐场景中,低曝光物品构成了交互的大部分,形成长尾分布,严重损害了推荐多样性。现有方法试图通过提升尾部物品来解决这一问题,但会导致准确性下降,在长尾与准确性性能之间表现出“跷跷板”效应。我们将这种冲突归因于尾部物品中的会话无关噪声,而现有的长尾方法未能有效识别和约束这些噪声。为了解决这一根本冲突,我们提出了HID(混合意图双重约束框架),这是一个即插即用的框架,通过引入基于混合意图的双重约束,将传统的“跷跷板”转变为“双赢”,同时提升长尾和准确性性能。该框架包含两个关键创新:(i)混合意图学习,我们通过采用属性感知谱聚类重构物品到意图的映射,重新制定了意图提取策略。此外,通过为每个会话分配目标意图和噪声意图,实现了会话无关噪声的区分。(ii)意图约束损失,它引入了两种关于多样性和准确性的新约束范式,以调节物品和会话的表示学习过程。通过严格的理论推导,这两个目标被统一到单个训练损失中。在多个SBR模型和数据集上的大量实验表明,HID能够同时提升长尾性能和推荐准确性,在长尾推荐系统中建立了新的最先进性能。

英文摘要

Session-based recommendation (SBR) aims to predict anonymous users' next interaction based on their interaction sessions. In the practical recommendation scenario, low-exposure items constitute the majority of interactions, creating a long-tail distribution that severely compromises recommendation diversity. Existing approaches attempt to address this issue by promoting tail items but incur accuracy degradation, exhibiting a "see-saw" effect between long-tail and accuracy performance. We attribute such conflict to session-irrelevant noise within the tail items, which existing long-tail approaches fail to identify and constrain effectively. To resolve this fundamental conflict, we propose \textbf{HID} (\textbf{H}ybrid \textbf{I}ntent-based \textbf{D}ual Constraint Framework), a plug-and-play framework that transforms the conventional "see-saw" into "win-win" through introducing the hybrid intent-based dual constraints for both long-tail and accuracy. Two key innovations are incorporated in this framework: (i) \textit{Hybrid Intent Learning}, where we reformulate the intent extraction strategies by employing attribute-aware spectral clustering to reconstruct the item-to-intent mapping. Furthermore, discrimination of session-irrelevant noise is achieved through the assignment of the target and noise intents to each session. (ii) \textit{Intent Constraint Loss}, which incorporates two novel constraint paradigms regarding the \textit{diversity} and \textit{accuracy} to regulate the representation learning process of both items and sessions. These two objectives are unified into a single training loss through rigorous theoretical derivation. Extensive experiments across multiple SBR models and datasets demonstrate that HID can enhance both long-tail performance and recommendation accuracy, establishing new state-of-the-art performance in long-tail recommender systems.

2601.00014 2026-06-19 eess.SP cs.AI cs.LG 版本更新

Modeling Day-Long ECG Signals to Predict Heart Failure Risk with Explainable AI

建模全天心电图信号以可解释人工智能预测心力衰竭风险

Eran Zvuloni, Ronit Almog, Michael Glikson, Shany Brimer Biton, Ilan Green, Izhar Laufer, Offer Amir, Joachim A. Behar

发表机构 * Leumit Health Services(Leumit健康服务)

AI总结 提出DeepHHF深度学习模型,利用24小时单导联心电图数据预测五年内心力衰竭风险,AUC达0.80,优于短时片段和临床评分,可解释性分析显示模型关注心律失常和心脏异常。

详情
AI中文摘要

心力衰竭(HF)影响11.8%的65岁及以上成年人,降低生活质量和寿命。预防HF可降低发病率和死亡率。我们假设将人工智能(AI)应用于24小时单导联心电图(ECG)数据可预测五年内HF风险。为此,使用了Technion-Leumit Holter ECG(TLHE)数据集,包括20年间收集的47,729名患者的69,663条记录。我们的深度学习模型DeepHHF在24小时ECG记录上训练,实现了0.80的受试者工作特征曲线下面积,优于使用30秒片段和临床评分的模型。DeepHHF识别的高风险个体住院或死亡事件概率翻倍。可解释性分析显示DeepHHF关注心律失常和心脏异常。本研究强调了深度学习建模24小时连续ECG数据的可行性,捕捉了对可靠风险预测至关重要的阵发性事件。应用于单导联Holter ECG的人工智能无创、廉价且广泛可及,使其成为HF风险预测的有前景工具。

英文摘要

Heart failure (HF) affects 11.8% of adults aged 65 and older, reducing quality of life and longevity. Preventing HF can reduce morbidity and mortality. We hypothesized that artificial intelligence (AI) applied to 24-hour single-lead electrocardiogram (ECG) data could predict the risk of HF within five years. To research this, the Technion-Leumit Holter ECG (TLHE) dataset, including 69,663 recordings from 47,729 patients, collected over 20 years was used. Our deep learning model, DeepHHF, trained on 24-hour ECG recordings, achieved an area under the receiver operating characteristic curve of 0.80 that outperformed a model using 30-second segments and a clinical score. High-risk individuals identified by DeepHHF had a two-fold chance of hospitalization or death incidents. Explainability analysis showed DeepHHF focused on arrhythmias and heart abnormalities. This study highlights the feasibility of deep learning to model 24-hour continuous ECG data, capturing paroxysmal events essential for reliable risk prediction. Artificial intelligence applied to single-lead Holter ECG is non-invasive, inexpensive, and widely accessible, making it a promising tool for HF risk prediction.

2601.02149 2026-06-19 cond-mat.mes-hall cond-mat.dis-nn cs.AI 版本更新

AI-enhanced tuning of quantum dot Hamiltonians toward Majorana modes

基于人工智能的量子点哈密顿量调优以实现马约拉纳模式

Mateusz Krawczyk, Jarosław Pawłowski

发表机构 * Institute of Theoretical Physics, Wrocław University of Science and Technology(理论物理研究所,沃林大学技术学院)

AI总结 本文提出基于神经网络的模型,通过学习量子点模拟器的工作区域,利用输运测量自动调优设备以获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据,采用融合马约拉纳零模关键性质的物理引导损失函数。

Comments 12 pages, 8 figures, 2 tables

Journal ref Phys. Rev. Applied 25, 064032 (2026)

详情
AI中文摘要

我们提出了一种基于神经网络的模型,能够学习量子点模拟器广泛的工作区域,并利用此知识通过输运测量自动调优这些设备,以在结构中获得马约拉纳模式。模型在无监督条件下训练于导电图合成数据,采用融合马约拉纳零模关键性质的物理引导损失函数。我们展示了通过适当训练,深度视觉变换器网络可以高效记忆哈密顿量参数与导电图之间的关系,并利用此提出量子点链参数更新,驱动系统进入拓扑相。从参数空间的广泛初始调谐范围开始,单步更新足以生成非平凡零模。此外,通过启用迭代调优过程——系统在每一步获得更新的导电图——我们证明该方法可以处理参数空间更大的区域。

英文摘要

We propose a neural network-based model capable of learning the broad landscape of working regimes in quantum dot simulators, and using this knowledge to autotune these devices - based on transport measurements - toward obtaining Majorana modes in the structure. The model is trained in an unsupervised manner on synthetic data in the form of conductance maps, using a physics-informed loss that incorporates key properties of Majorana zero modes. We show that, with appropriate training, a deep vision-transformer network can efficiently memorize relation between Hamiltonian parameters and structures on conductance maps and use it to propose parameters update for a quantum dot chain that drive the system toward topological phase. Starting from a broad range of initial detunings in parameter space, a single update step is sufficient to generate nontrivial zero modes. Moreover, by enabling an iterative tuning procedure - where the system acquires updated conductance maps at each step - we demonstrate that the method can address a much larger region of the parameter space.

2604.08552 2026-06-19 cs.DB cs.AI 版本更新

Automated Standardization of Legacy Biomedical Metadata Using an Ontology-Constrained LLM Agent

使用本体约束的LLM代理自动化标准化遗留生物医学元数据

Josef Hardi, Martin J. O'Connor, Marcos Martinez-Romero, Jean G. Rosario, Stephen A. Fisher, Mark A. Musen

发表机构 * Division of Computational Medicine, Stanford University(斯坦福大学计算医学部) Department of Biology, University of Pennsylvania(宾夕法尼亚大学生物学系)

AI总结 提出基于LLM的元数据标准化系统,通过实时查询标准指南和本体服务,在839条HuBMAP记录上验证,相比纯LLM方法显著提升预测准确性。

详情
AI中文摘要

科学元数据通常不完整且不符合社区标准,限制了数据集的可发现性、互操作性和重用。即使存在标准元数据报告指南,它们通常缺乏机器可操作的表征。生成FAIR数据集需要将元数据标准编码为具有丰富字段规范和精确值约束的机器可操作模板。最近的研究表明,由字段名称和本体约束引导的LLM可以改善元数据标准化,但这些方法将约束视为静态文本提示,仅依赖模型的训练知识。我们提出了一种基于LLM的元数据标准化系统,该系统实时查询标准报告指南和权威生物医学术语服务,以按需检索规范正确的标准。我们在来自人类生物分子图谱计划(HuBMAP)的839条遗留元数据记录上评估了该方法,使用专家策划的金标准进行精确匹配评估。我们的评估表明,与仅使用LLM相比,通过实时工具访问增强LLM在受本体约束和不受本体约束的字段上均持续提高了预测准确性,展示了一种实用的生物医学元数据自动化标准化方法。

英文摘要

Scientific metadata are often incomplete and noncompliant with community standards, limiting dataset findability, interoperability, and reuse. Even when standard metadata reporting guidelines exist, they typically lack machine-actionable representations. Producing FAIR datasets requires encoding metadata standards as machine-actionable templates with rich field specifications and precise value constraints. Recent work has shown that LLMs guided by field names and ontology constraints can improve metadata standardization, but these approaches treat constraints as static text prompts, relying on the model's training knowledge alone. We present an LLM-based metadata standardization system that queries standard reporting guidelines and authoritative biomedical terminology services in real time to retrieve canonically correct standards on demand. We evaluate this approach on 839 legacy metadata records from the Human BioMolecular Atlas Program (HuBMAP) using an expert-curated gold standard for exact-match assessment. Our evaluation shows that augmenting the LLM with real-time tool access consistently improves prediction accuracy over the LLM alone across both ontology-constrained and non-ontology-constrained fields, demonstrating a practical approach to automated standardization of biomedical metadata.

2604.11556 2026-06-19 cs.SE cs.AI 版本更新

FM-Agent: Scaling Formal Methods to Large Systems via LLM-Based Hoare-Style Reasoning

FM-Agent: 通过基于LLM的Hoare风格推理将形式化方法扩展到大型系统

Haoran Ding, Zhaoguo Wang, Haibo Chen

发表机构 * Institute of Parallel and Distributed Systems, Shanghai Jiao Tong University(并行与分布式系统研究所,上海交通大学)

AI总结 提出FM-Agent框架,利用LLM自动生成函数级规范,实现大型系统的组合式推理,在143k行代码的系统中2天内发现522个新bug。

详情
AI中文摘要

LLM辅助的软件开发已日益普遍,并能生成如编译器这样的大型系统。增强生成代码的正确性变得至关重要。然而,由于代码复杂性,大型系统的自动推理仍然具有挑战性。Hoare逻辑提供了一种将大型系统分解为较小组件并分别推理(即组合式推理)的方法。然而,现有工作仍难以扩展,因为Hoare逻辑要求为每个函数编写形式化规范,给人类带来沉重负担。当代码由LLM生成时,问题更加严重,因为开发人员缺乏对每个函数预期行为的深入理解。本文提出FM-Agent,这是第一个实现大型系统自动化组合式推理的框架。利用LLM,FM-Agent引入了一种自顶向下的范式来自动生成函数级规范。具体来说,FM-Agent从调用者期望函数如何行为中推导出函数的规范,因此即使实现有缺陷,生成的规范也能反映开发者的意图。开发者的意图通常用自然语言表达,而现有的验证器只支持公式。因此,FM-Agent推广了Hoare风格推理,以针对自然语言规范推理函数。最后,为了确认错误存在并解释错误原因,FM-Agent自动生成测试用例以触发潜在错误。在我们的评估中,FM-Agent在2天内成功推理了大型系统,每个系统最多有143k行代码。这些系统已经由开发者测试过,但FM-Agent仍然发现了522个新错误。这些错误可能导致严重后果,包括系统崩溃和错误的执行结果。

英文摘要

LLM-assisted software development has become increasingly prevalent, and can generate large-scale systems, such as compilers. It becomes crucial to strengthen the correctness of the generated code. However, automated reasoning for large-scale systems remains challenging due to code complexity. Hoare logic offers an approach to decomposing a large system into smaller components and reasoning about them separately (i.e., compositional reasoning). However, existing works still struggle to scale, because Hoare logic requires writing formal specifications for each function, imposing a heavy human burden. The problem is exacerbated when code is generated by LLMs, as developers lack a deep understanding of each function's expected behavior. This paper presents FM-Agent, the first framework that realizes automated compositional reasoning for large-scale systems. Leveraging LLMs, FM-Agent introduces a top-down paradigm to automatically generate function-level specifications. Specifically, FM-Agent derives the specification of a function from how its callers expect the function to behave, so the generated specifications can reflect the developer's intent of a function even if the implementation is buggy. Developers' intent is usually expressed in natural language, while existing verifiers only support formulas. Therefore, FM-Agent generalizes Hoare-style inference to reason about functions against natural-language specifications. Finally, to confirm bug existence and explain bug causes, FM-Agent automatically generates test cases to trigger potential bugs. In our evaluation, FM-Agent successfully reasons about large-scale systems within 2 days, each of which has up to 143k LoC. These systems have already been tested by their developers, but FM-Agent still finds 522 newly discovered bugs. These bugs can cause serious consequences, including system crashes and incorrect execution results.

2606.12500 2026-06-19 cs.LG cs.AI 版本更新

Improving Crash Frequency Prediction from Simulated Traffic Conflicts Using Machine Learning Based Microsimulation

基于机器学习的微观仿真从模拟交通冲突改进碰撞频率预测

Xian Liu, Carlo G. Prato, Gustav Markkula

AI总结 本文利用机器学习行为模型替代传统规则模型进行交通微观仿真,通过极端值理论分析模拟冲突预测碰撞频率,在英国利兹五个信号交叉口验证了ML模型无需地点校准即可提升预测准确性。

详情
AI中文摘要

交通微观仿真结合替代安全措施越来越多地被用作历史碰撞数据的主动替代方案,用于预测当前或计划道路基础设施设计的碰撞频率。然而,现有的基于微观仿真的安全研究采用了简化的基于规则的行为模型,这些模型能较好地再现交通流,但往往无法生成真实的冲突动态,限制了碰撞预测的准确性。机器学习(ML)行为模型的最新进展提供了一个有希望的机会,通过直接从大规模轨迹数据集中学习人类驾驶行为,可能提高微观仿真的真实性和碰撞频率预测。为了研究这种可能性,我们对英国利兹的五个真实信号交叉口进行了交通微观仿真,使用了标准的基于规则模型和最先进的ML模型。使用二维碰撞时间指标分析模拟车辆轨迹以识别模拟冲突,然后使用极端值理论建模以预测碰撞频率。结果表明,ML模型的冲突产生的碰撞预测与实际碰撞数据一致,而基于规则的模型由于缺乏对特定模拟交叉口的模型校准,无法产生有意义的预测。直接使用ML生成的模拟碰撞来预测实际碰撞频率也产生了较差的结果,这表明尽管当前的ML模型可以真实地再现冲突,但尚不能生成真实的碰撞。总体而言,研究结果表明,基于ML的行为模型在无需特定地点模型校准的情况下,有望从模拟冲突中改进碰撞预测,并为基于ML的交通微观仿真指明了明确的未来方向。

英文摘要

Traffic microsimulation combined with surrogate safety measures has increasingly been used as a proactive alternative to historical crash data for predicting crash frequency for current or planned road infrastructure designs. However, existing microsimulation-based safety studies have adopted simplified rule-based behaviour models, which reproduce traffic flow reasonably well but often fail to generate realistic conflict dynamics, limiting crash prediction accuracy. Recent advances in machine learning (ML)-based behaviour models offer a promising opportunity to potentially improve microsimulation realism and crash frequency predictions by learning human driving behaviour directly from large-scale trajectory datasets. To investigate this possibility, traffic microsimulation was conducted for five real-world signalised intersections in Leeds, UK, using both a standard rule-based model and a state-of-the-art ML model. Simulated vehicle trajectories were analysed using a two-dimensional Time-to-Collision metric to identify simulated conflicts, which were then modelled using Extreme Value Theory to predict crash frequency. Results show that conflicts from the ML model yielded crash predictions in line with the real-world crash data, whereas the rule-based model did not permit meaningful predictions, presumably due to a lack of model calibration to the specific simulated intersections. Directly using ML-generated simulated crashes to predict real-world crash frequency also yielded poor results, suggesting that while current ML models can realistically reproduce conflicts, they are not yet able to generate realistic crashes. Overall, the findings demonstrate that ML-based behaviour models are promising for improving crash prediction from simulated conflicts, without a need for location-specific model calibration, and suggest clear future directions for ML-based traffic microsimulation.

2606.13794 2026-06-19 eess.SY cs.AI cs.RO cs.SY 版本更新

An integrated interpretable control effectiveness learning and nonlinear control allocation methodology for overactuated aircrafts

过驱动飞行器的可解释控制效能学习与非线性控制分配集成方法

Umut Demir, Aamir Ahmad, Walter Fichter

发表机构 * University of Stuttgart, Faculty of Aerospace Engineering and Geodesy, Institute of Flight Mechanics and Control (iFR)(斯图加特大学航空航天工程与大地测量学院飞行力学与控制研究所)

AI总结 提出一种基于稀疏非线性动力学辨识的学习控制效能映射方法,结合在线自适应机制,实现过驱动飞行器的高效非线性控制分配,兼具可解释性和低计算成本。

详情
AI中文摘要

非线性动力学以及多个执行器之间产生的强耦合削弱了传统线性控制分配技术背后的假设。当飞行进入非线性效应主导的模态时,线性分配器因模型失配增加而精度下降,进而降低飞行控制系统的性能和鲁棒性。高保真机载模型和黑箱数据驱动方法可以在整个飞行包线内恢复精度,但分别带来实时分配难以承受的计算负担,并牺牲了验证和故障诊断所需的可解释性。本文通过使用稀疏非线性动力学辨识从代表性飞行数据中学习显式的、受物理约束的控制效能映射解析模型,解决了这些限制。所得映射紧凑、可解释,并允许解析导数,从而能够在非线性求解器中高效计算,同时额外包含执行器动力学,无需机载模型。在线自适应机制监控预测残差,并在检测到显著对象变化时刷新模型,从而在执行器故障和变化工况下提供平滑重构。该方法在一款高保真非线性基准飞行器上经过一系列激进机动评估,达到了与完整非线性机载模型相当的精度,同时相对于现有基线显著降低了计算成本。

英文摘要

Nonlinear dynamics and the strong couplings that arise between multiple effectors undermine the assumptions behind conventional, linear control allocation techniques. When flight enters regimes where nonlinear effects dominate, linear allocators exhibit reduced accuracy due to increased model mismatch, which subsequently degrades performance and robustness of the flight control system. High fidelity onboard models and black box data driven approaches can recover accuracy across the flight envelope, but respectively impose computational burdens prohibitive for real time allocation and sacrifice the interpretability required for verification and fault diagnosis. This paper addresses these limitations by learning an explicit, physics constrained analytical model of the control effectiveness mapping from representative flight data using Sparse Identification of Nonlinear Dynamics. The resulting mapping is compact, interpretable, and admits analytical derivatives, enabling efficient computation within nonlinear solvers that additionally incorporate actuator dynamics, without requiring an onboard model. An online adaptation mechanism monitors prediction residuals and refreshes the model when significant plant changes are detected, providing graceful reconfiguration under actuator failures and varying operating conditions. The methodology is evaluated on a high fidelity nonlinear benchmark aircraft across a range of aggressive maneuvers, achieving accuracy comparable to a full nonlinear onboard model while substantially reducing computational cost relative to established baselines.

2606.18611 2026-06-19 cs.SD cs.AI cs.LG stat.ML 版本更新

QC-GAN: A Parameter-Efficient Quaternion Conformer GAN for High-Fidelity Speech Enhancement

QC-GAN: 一种参数高效的四元数Conformer GAN用于高保真语音增强

Shogo Yamauchi, Hideaki Tamori, Makoto Sakai, Yosuke Yamano, Tohru Nitta

发表机构 * The Asahi Shimbun Company(朝日新闻社) Tokyo Woman's Christian University(东京女子基督教大学)

AI总结 提出参数高效的QC-GAN,结合四元数Conformer生成器和MetricGAN训练,通过汉密尔顿积共享权重减少参数量,在VoiceBank+DEMAND上以0.89M参数达到PESQ 3.48,性能媲美两倍大小模型。

Comments 10 pages, 6 figures and 5 tables. Accepted at Interspeech2026

详情
AI中文摘要

我们提出了一种参数高效的语音增强框架——四元数Conformer GAN(QC-GAN),它将四元数Conformer生成器与基于MetricGAN的训练相结合。汉密尔顿积通过结构化权重共享对幅度和相位进行编码,在减少层参数数量的同时保持其相互依赖性。采用度量学习判别器,通过优化近似感知评估分数来最大化感知质量。在VoiceBank+DEMAND数据集上,QC-GAN仅用0.89M参数就达到了3.48的语音质量感知评估(PESQ)分数,其性能与最先进模型相当,而参数量不到后者的一半。一个35K参数的变体实现了3.23的PESQ分数,以显著更少的参数超越了传统方法。在DNS-Challenge 3数据集上的评估进一步证实了其在真实世界条件下的泛化能力。

英文摘要

We propose a parameter-efficient speech enhancement framework, Quaternion Conformer GAN (QC-GAN), which combines a Quaternion Conformer generator with MetricGAN-based training. The Hamilton product encodes the magnitude and phase via structured weight sharing, reducing the number of layer parameters while preserving their interdependencies. A metric-learning discriminator was employed to maximize perceptual quality by optimizing the approximate perceptual evaluation scores. On the VoiceBank+DEMAND dataset, QC-GAN achieved a Perceptual Evaluation of Speech Quality (PESQ) score of 3.48 with only 0.89M parameters, delivering a performance comparable to state-of-the-art models at less than half their size. A 35K-parameter variant achieved a PESQ score of 3.23, surpassing conventional methods with significantly fewer parameters. Evaluation on the DNS-Challenge 3 dataset further confirmed generalization to real-world conditions.

11. 其他/综合AI 11 篇

2601.15797 2026-06-19 cs.AI 版本更新

Creativity Reconsidered: Generative AI and the Problem of Intentional Agency

重新思考创造力:生成式AI与意向能动性问题

James S. Pearson, Matthew J. Dennis, Marc Cheong

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Lisbon(里斯本大学) TU Eindhoven(埃因霍温理工大学) University of Melbourne(墨尔本大学)

AI总结 本文质疑意向能动性是创造力的必要条件,基于生成式AI的创造力表现,提出创造力归因依赖于“创造能力”,从而在不要求意向能动性的前提下解释AI的创造力。

Comments 27 pages, 2 figures

详情
AI中文摘要

许多理论家认为,有意识的意向能动性是创造力的必要条件。我们认为,这一要求(称为意向能动性条件,IAC)应当被放弃。我们通过强调该标准在面对生成式AI最新进展时遇到的问题来论证这一点,生成式AI尽管缺乏意向能动性,却显然具有创造力。我们呈现两项语料库分析,以说明人们将创造力归因于生成式AI的迅速增长趋势。针对这一困境,创造力理论家提出了一系列相互矛盾的解决方案,我们对其进行了批判性评估。我们发现,这些方案均未能令人满意地解决初始困境,因此我们提出了一种新方法。我们的主张是,创造力的归因依赖于我们所谓的创造能力。这一解决方案解释了为什么意向能动性对创造力判断很重要,但并非必要条件。因此,我们的方法在不忽视感知意图对创造力归因至关重要的直觉的情况下,容纳了AI的创造力。

英文摘要

Many theorists maintain that conscious intentional agency is a necessary condition of creativity. We argue that this requirement, which we call the Intentional Agency Condition (IAC), should be abandoned. We motivate this by highlighting the problems this criterion encounters in the face of recent advances in generative AI, which is ostensibly creative despite being incapable of intentional agency. We present two corpus analyses to illustrate the rapidly increasing tendency of people to predicate creativity to generative AI. In response to this predicament, theorists of creativity have proposed a range of conflicting solutions, which we critically evaluate. We find that none of these satisfyingly resolves the initial predicament, and we therefore propose a novel approach. Our claim is that ascriptions of creativity are dependent on what we call creative ability. This solution explains why intentional agency is important for judgements of creativity, without being a necessary condition. Our approach thereby accommodates AI creativity without dismissing the intuition that perceived intentions are of key importance for ascriptions of creativity.

2606.01316 2026-06-19 cs.AI 版本更新

Science Earth: Towards A Planet-Scale Operating System for AI-Native Scientific Discovery

Science Earth: 迈向面向AI原生科学发现的行星级操作系统

Zhe Zhao, Haibin Wen, Yingcheng Wu, Jiaming Ma, Yifan Wen, Jinglin Jian, Jiacheng Ge, Xiangru Tang, Bo An, Ming Yin, Sanfeng Wu, Mengdi Wang, Le Cong

发表机构 * Department of Pathology, Department of Genetics, Stanford University School of Medicine(病理学系、遗传学系,斯坦福大学医学院) Princeton AI Lab, Department of Electrical & Computer Engineering, Princeton University(普林斯顿人工智能实验室、电气与计算机工程系,普林斯顿大学) Scripps Research, La Jolla, CA, USA(斯克里普斯研究机构,洛杉矶,加利福尼亚州,美国) Division of Biostatistics, Department of Population Health, New York University Grossman School of Medicine(生物统计学部、人口健康系,纽约大学格罗斯曼医学院) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学) Department of Computer Science, Yale University(计算机科学系,耶鲁大学) Department of Physics, Princeton University(物理系,普林斯顿大学)

AI总结 提出Science Earth行星级科学运行时,通过EACN协议实现AI能力动态连接与自组织协作,在跨太平洋Kuramoto同步研究和单细胞分析中验证了分布式自校正科学推理。

Comments Withdrawn by the authors. (1) The author list and authorship roles had not been finalized and agreed upon by all listed authors prior to submission. (2) The specific contribution of the system in the K3 synchronization example (Section on Kuramoto/nonlinear physics) requires further validation before it can be reported. The authors are addressing both points and may resubmit a corrected version.

详情
AI中文摘要

科学发现需要在广阔的搜索空间中运用智能、毅力和偶然性。如今,顶尖科学能力仍然孤立——一个AI系统用于生物分析,另一个用于临床推理、数学推导或材料模拟——并且没有预设计的团队能够预见一个问题所需的所有技能。Science Earth是一个行星级科学运行时,其中任何能力——模拟集群、湿实验室机器人、证明引擎、单细胞管道——都可以相互连接,协作结构由问题本身涌现。其底层EACN协议让能力能够相互发现、协商任务所有权,并在不相容的证据标准之间进行裁决,而无需事先知道谁将遇见谁。这将组织挑战从工作流设计转向开放式连接。两次运行在结构不同的条件下验证了这一点。在一项跨太平洋高阶Kuramoto同步研究中,智能体在30分钟内识别并纠正了Ott-Antonsen解析理论中一个在洛伦兹极限外失效的闭合比率假设。在针对488万细胞Kang 2024泛癌图谱的八智能体单细胞运行中,异质能力在64.9小时窗口内耦合,仅有一条结构外部指令,产生了三个新的结果层,并将发现与一项关于相邻CCR8- TIGIT+ Treg亚群的独立湿实验室研究进行锚定。这些案例是首次实证读数,而非基准测试。它们表明,当AI能力真正可连接且协调从问题中涌现时,科学推理成为一个分布式、自校正的过程——这是向行星级AI原生发现迈出的一步。

英文摘要

Scientific discovery demands intelligence, perseverance, and serendipity across vast search spaces. Today, top scientific capabilities remain siloed--one AI system for biological analysis, another for clinical reasoning, mathematical derivation, or materials simulation--and no pre-designed team can anticipate every skill a question will need. Science Earth is a planet-scale scientific runtime in which any capability--a simulation cluster, a wet-lab robot, a proof engine, a single-cell pipeline--can connect to any other, with collaboration structure emerging from the question itself. Its underlying EACN protocol lets capabilities discover one another, negotiate task ownership, and adjudicate across incompatible evidentiary standards without prior knowledge of who will meet whom. This shifts the organizing challenge from workflow design to open-ended connectivity. Two runs validate this under structurally distinct conditions. In a trans-Pacific higher-order Kuramoto synchronization study, agents identified and corrected a closure-ratio assumption in Ott-Antonsen analytic theory that fails outside the Lorentzian limit, within thirty minutes. In an eight-agent single-cell run on the 4.88M-cell Kang 2024 pan-cancer atlas, heterogeneous capabilities coupled over a 64.9-hour window with one structural external instruction, producing three new result layers and anchoring findings against an independent wet-lab study on an adjacent CCR8- TIGIT+ Treg subset. These cases are a first empirical reading, not a benchmark sweep. They show that when AI capabilities are truly connectable and coordination emerges from the problem, scientific reasoning becomes a distributed, self-correcting process--a step towards scaling AI-native discovery to the planet.

2507.05169 2026-06-19 cs.LG cs.AI cs.CL cs.CV cs.RO 版本更新

Critique of World Model

世界模型批判:一种用于世界建模的生成式潜在预测架构

Eric Xing, Mingkai Deng, Jinyu Hou

AI总结 本文从心理学“假设性思维”出发,提出世界模型的核心目标是模拟真实世界的所有可行动可能性,并设计了一种基于状态化、分层、多级、混合连续/离散表示的生成式潜在预测(GLP)架构。

详情
AI中文摘要

世界模型,即生物智能体所经历并对其采取行动的真实世界环境的算法模拟器,近年来因开发具有人工(通用)智能的虚拟智能体的需求日益增长而成为一个新兴课题。关于世界模型究竟是什么、如何构建、如何使用以及如何评估,已有许多讨论。本文从著名科幻经典《沙丘》中的想象出发,并借鉴心理学文献中“假设性思维”的概念,论证世界模型的主要目标是模拟真实世界中所有可行动的可能性,以进行有目的的推理和行动。我们审视了世界建模的关键设计维度:数据、表示、架构、学习目标和使用,调查了现有方法并分析了它们的权衡。在此基础上,我们提出了一种新的通用世界模型生成式潜在预测(GLP)架构,基于有状态的、分层的、多层次的、混合连续/离散表示,以及生成式和自监督学习框架,并展望了由这种模型支持的物理、智能体和嵌套(PAN)AGI系统。

英文摘要

World Model, the algorithmic simulator of the real-world environment which biological agents experience and act upon, has been an emerging topic in recent years due to the rising need to develop virtual agents with artificial (general) intelligence. There has been much discussion on what a world model really is, how to build it, how to use it, and how to evaluate it. In this essay, starting from the imagination in the famed Sci-Fi classic Dune, and drawing inspiration from the concept of ``hypothetical thinking'' in psychology literature, we argue the primary goal of a world model to be {\it simulating all actionable possibilities of the real world for purposeful reasoning and acting}. We examine the key design dimensions of world modeling: data, representation, architecture, learning objective, and usage, surveying existing approaches and analyzing their tradeoffs. Building on this examination, we propose a new Generative Latent Prediction (GLP) architecture for a general-purpose world model, based on stateful, hierarchical, multi-level, and mixed continuous/discrete representations, and a generative and self-supervised learning framework, with an outlook of a Physical, Agentic, and Nested (PAN) AGI system enabled by such a model.

2509.02581 2026-06-19 cs.DL cs.AI 版本更新

Charting the Future of Scholarly Knowledge with AI: A Community Perspective

用AI绘制学术知识的未来:社区视角

Azanzi Jiomekong, Hande Küçük McGinty, Keith G. Mills, Allard Oelen, Enayat Rajabi, Harry McElroy, Antrea Christou, Anmol Saini, Janice Anta Zebaze, Hannah Kim, Anna M. Jacyszyn, Gollam Rabby, Dirk Betz, Claudia Biniossek, Sanju Tiwari, Sören Auer

发表机构 * TIB Leibniz Information Centre for Science and Technology(蒂宾根莱比锡科学与技术信息中心) Department of Computer Science, University of Yaounde 1(亚奥内1大学计算机科学系) Department of Computer Science, Kansas State University(堪萨斯州立大学计算机科学系) School of EECS, Louisiana State University(路易斯安那州立大学电子工程与计算机科学学院) Management Science Department, Cape Breton University(cape breton 大学管理科学系) Department of Development and Research, Performigence(Performigence 发展与研究部) Department of Engineering and Computer Science, Wright State University(怀特州立大学工程与计算机科学系) Department of Physics, University of Yaounde 1(亚奥内1大学物理系) FIZ Karlsruhe, Leibniz Institute for Information Infrastructure(卡尔斯鲁厄莱比锡信息基础设施研究所) Sharda University, Delhi-NCR, India(德里-纳尔默德印度大学) L3S Research Center, Leibniz University of Han(汉莱比锡大学L3S研究中心)

AI总结 本文从社区视角出发,识别促进跨学科对话、共享挑战、分类新合作并塑造学术知识组织未来研究方向的方法。

Comments 39 pages, 3 figures

详情
AI中文摘要

尽管支持学术知识提取和组织的工具日益普及,许多研究人员仍依赖手动方法,有时是因为对现有技术不熟悉或缺乏领域适应性解决方案。同时,跨学科学术出版物的快速增长使得跟上最新进展越来越困难,进一步凸显了对可扩展的、基于AI的方法来结构化和综合学术知识的需求。各个研究社区已开始独立应对这一挑战,开发旨在构建可靠、动态且可查询的学术知识库的工具和框架。然而,这些社区之间的有限互动阻碍了方法、模型和最佳实践的交流,减缓了向更集成解决方案的进展。本文确定了促进跨学科对话、识别共同挑战、分类新合作并塑造学术知识组织未来研究方向的方法。

英文摘要

Despite the growing availability of tools designed to support scholarly knowledge extraction and organization, many researchers still rely on manual methods, sometimes due to unfamiliarity with existing technologies or limited access to domain-adapted solutions. Meanwhile, the rapid increase in scholarly publications across disciplines has made it increasingly difficult to stay current, further underscoring the need for scalable, AI-enabled approaches to structuring and synthesizing scholarly knowledge. Various research communities have begun addressing this challenge independently, developing tools and frameworks aimed at building reliable, dynamic, and queryable scholarly knowledge bases. However, limited interaction across these communities has hindered the exchange of methods, models, and best practices, slowing progress toward more integrated solutions. This manuscript identifies ways to foster cross-disciplinary dialogue, identify shared challenges, categorize new collaboration and shape future research directions in scholarly knowledge and organization.

2603.16648 2026-06-19 cs.AI 版本更新

Domain-Independent Dynamic Programming with Constraint Propagation

Imko Marijnissen, J. Christopher Beck, Emir Demirović, Ryo Kuroiwa

发表机构 * Imko Marijnissen 1 J. Christopher Beck 2 Emir Demirović 1 Ryo Kuroiwa 3, 4

Comments 13 pages. To appear at the 36th International Conference on Automated Planning and Scheduling (ICAPS 2026)

Journal ref Proceedings of the International Conference on Automated Planning and Scheduling (2026) | Volume 36(1) | Pages 171-180

详情
英文摘要

There are two prevalent model-based paradigms for combinatorial problems: 1) state-based representations, such as heuristic search, dynamic programming (DP), and decision diagrams, and 2) constraint and domain-based representations, such as constraint programming (CP), (mixed-)integer programming, and Boolean satisfiability. In this paper, we bridge the gap between the DP and CP paradigms by integrating constraint propagation into DP, enabling a DP solver to prune states and transitions using constraint propagation. To this end, we implement constraint propagation using a general-purpose CP solver in the Domain-Independent Dynamic Programming framework and evaluate using heuristic search on three combinatorial optimisation problems: Single Machine Scheduling with Time Windows, the Resource Constrained Project Scheduling Problem (RCPSP), and the Travelling Salesperson Problem with Time Windows (TSPTW). Our evaluation shows that constraint propagation significantly reduces the number of state expansions, causing our approach to solve more instances than a DP solver for Single Machine Scheduling and RCPSP, and showing similar improvements for tightly constrained TSPTW instances. The runtime performance indicates that the benefits of propagation outweigh the overhead for constrained instances, but that further work into reducing propagation overhead could improve performance further. Our work is a key step in understanding the value of constraint propagation in DP solvers, providing a model-based approach to integrating DP and CP.

2602.05416 2026-06-19 cs.CE cs.AI cs.LG physics.ao-ph physics.flu-dyn 版本更新

Reduced-Order Surrogates for Forced Flexible Mesh Coastal-Ocean Models

Freja Høgholm Petersen, Jesper Sandvig Mariegaard, Rocco Palmitessa, Allan P. Engsig-Karup

发表机构 * DTU(技术大学)

Comments Submitted for peer-review in a journal. v2: revised version submitted to journal after minor revisions

详情
英文摘要

While proper orthogonal decomposition (POD)-based surrogates are widely explored for hydrodynamic applications, the use of Koopman autoencoders for real-world coastal-ocean modelling remains relatively limited. This paper introduces a flexible Koopman autoencoder formulation that incorporates meteorological forcings and boundary conditions, and systematically compares its performance against POD-based surrogates. The Koopman autoencoder employs a learned linear temporal operator in latent space, enabling eigenvalue regularization to promote temporal stability. This strategy is evaluated alongside temporal unrolling techniques for achieving stable and accurate long-term predictions. The models are assessed on three test cases spanning distinct dynamical regimes, with prediction horizons up to one year at 30-minute temporal resolution. Across all cases, the reduced order surrogates with temporal unrolling achieve high accuracy with relative root-mean-squared-errors of 0.0068-0.14 and $R^2$-values of 0.61-0.995, where prediction errors are largest for current velocities, and smallest for water surface elevations. In two of the three cases, the Koopman Autoencoder have higher accuracy than the POD-based surrogates. Comparing to in-situ observations, the surrogate yields -0.64% to 12% increase in water surface elevation prediction error when compared to prediction errors of the physics-based model. These error levels, corresponding to a few centimeters, are acceptable for many practical applications, while inference speed-ups of 300-1400x enables workflows such as ensemble forecasting and long climate simulations for coastal-ocean modelling.

2511.23071 2026-06-19 cs.CV cs.AI cs.CL 版本更新

Bharat Scene Text: A Novel Comprehensive Dataset and Benchmark for Indian Language Scene Text Understanding

Anik De, Abhirama Subramanyam Penamakuri, Rajeev Yadav, Aditya Rathore, Harshiv Shah, Devesh Sharma, Sagar Agarwal, Pravin Kumar, Anand Mishra

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院朱道尔)

Comments Accepted in International Journal on Document Analysis and Recognition (IJDAR)

Journal ref International Journal on Document Analysis and Recognition (IJDAR), 2026

详情
英文摘要

Reading scene text, that is, text appearing in images, has numerous application areas, including assistive technology, search, and e-commerce. Although scene text recognition in English has advanced significantly and is often considered nearly a solved problem, Indian language scene text recognition remains an open challenge. This is due to script diversity, non-standard fonts, and varying writing styles, and, more importantly, the lack of high-quality datasets and open-source models. To address these gaps, we introduce the Bharat Scene Text Dataset (BSTD) - a large-scale and comprehensive benchmark for studying Indian Language Scene Text Recognition. It comprises more than 100K words that span 11 Indian languages and English, sourced from over 6,500 scene images captured across various linguistic regions of India. The dataset is meticulously annotated and supports multiple scene text tasks, including: (i) Scene Text Detection, (ii) Script Identification, (iii) Cropped Word Recognition, and (iv) End-to-End Scene Text Recognition. We evaluated state-of-the-art models originally developed for English by adapting (fine-tuning) them for Indian languages. Our results highlight the challenges and opportunities in Indian language scene text recognition. We believe that this dataset represents a significant step toward advancing research in this domain. All our models and data are open source.

2602.14239 2026-06-19 cs.SI cs.AI cs.LG 版本更新

A Hybrid TGN-SEAL Model for Dynamic Graph Link Prediction

Nafiseh Sadat Sajadi, Behnam Bahrak, Mahdi Jafari Siavoshani

发表机构 * Department of Computer Engineering, Sharif University of Technology(谢尔万大学计算机工程系) Tehran Institute for Advanced Studies, Khatam University(泰赫兰高级研究院,卡塔姆大学)

Journal ref EPJ Data Science (2026)

详情
英文摘要

Predicting links in sparse, continuously evolving networks is a central challenge in network science. Conventional heuristic methods and deep learning models, including Graph Neural Networks (GNNs), are typically designed for static graphs and thus struggle to capture temporal dependencies. Snapshot-based techniques partially address this issue but often encounter data sparsity and class imbalance, particularly in networks with transient interactions such as telecommunication call detail records (CDRs). Temporal Graph Networks (TGNs) model dynamic graphs by updating node embeddings over time; however, their predictive accuracy under sparse conditions remains limited. In this study, we improve the TGN framework by extracting enclosing subgraphs around candidate links, enabling the model to jointly learn structural and temporal information. Experiments on a sparse CDR dataset show that our approach increases average precision by 2.6% over standard TGNs, demonstrating the advantages of integrating local topology for robust link prediction in dynamic networks.

2510.24435 2026-06-19 cs.AI 版本更新

Human-Level Reasoning: A Comparative Study of Large Language Models on Logical and Abstract Reasoning

Benjamin Grando Moreira

发表机构 * Universidade Federal de Santa Catarina(联邦圣卡塔琳娜大学)

Comments 12 pages

Journal ref Proceedings of the 2026 Computer on the Beach

详情
英文摘要

Evaluating reasoning ability in Large Language Models (LLMs) is important for advancing artificial intelligence, as it transcends mere linguistic task performance. It involves understanding whether these models truly understand information, perform inferences, and are able to draw conclusions in a logical and valid way. This study compare logical and abstract reasoning skills of several LLMs - including GPT, Claude, DeepSeek, Gemini, Grok, Llama, Mistral, Perplexity, and Sabiá - using a set of eight custom-designed reasoning questions. The LLM results are benchmarked against human performance on the same tasks, revealing significant differences and indicating areas where LLMs struggle with deduction.

2507.23027 2026-06-19 cs.CV cs.AI 版本更新

Recovering Diagnostic Value: Super-Resolution-Aided Echocardiographic Classification in Resource-Constrained Imaging

Krishan Agyakari Raja Babu, Om Prabhu, Annu, Mohanasankar Sivaprakasam

发表机构 * Indian Institute of Technology Madras(印度理工学院马德拉斯分校) All India Institute of Medical Sciences(全印度医学科学研究所) Indian Institute of Technology Hyderabad(印度理工学院海得拉巴分校)

Comments Accepted at the MICCAI Workshop on "Medical Image Computing in Resource Constrained Settings & Knowledge Interchange (MIRASOL)" 2025

详情
英文摘要

Automated cardiac interpretation in resource-constrained settings (RCS) is often hindered by poor-quality echocardiographic imaging, limiting the effectiveness of downstream diagnostic models. While super-resolution (SR) techniques have shown promise in enhancing magnetic resonance imaging (MRI) and computed tomography (CT) scans, their application to echocardiography-a widely accessible but noise-prone modality-remains underexplored. In this work, we investigate the potential of deep learning-based SR to improve classification accuracy on low-quality 2D echocardiograms. Using the publicly available CAMUS dataset, we stratify samples by image quality and evaluate two clinically relevant tasks of varying complexity: a relatively simple Two-Chamber vs. Four-Chamber (2CH vs. 4CH) view classification and a more complex End-Diastole vs. End-Systole (ED vs. ES) phase classification. We apply two widely used SR models-Super-Resolution Generative Adversarial Network (SRGAN) and Super-Resolution Residual Network (SRResNet), to enhance poor-quality images and observe significant gains in performance metric-particularly with SRResNet, which also offers computational efficiency. Our findings demonstrate that SR can effectively recover diagnostic value in degraded echo scans, making it a viable tool for AI-assisted care in RCS, achieving more with less.

2406.15465 2026-06-19 cs.CL cs.AI 版本更新

RadEx: A Framework for Structured Information Extraction from Radiology Reports based on Large Language Models

Daniel Reichenpfader, Jonas Knupp, André Sander, Kerstin Denecke

发表机构 * Institute for Patient-centered Digital Health, Bern University of Applied Sciences, Biel, Switzerland(以患者为中心的数字健康研究所,伯恩应用科学大学,比尔,瑞士) ID Suisse AG, St. Gallen, Switzerland(ID瑞士股份有限公司,圣加尔,瑞士)

详情
英文摘要

Annually and globally, over three billion radiography examinations and computer tomography scans result in mostly unstructured radiology reports containing free text. Despite the potential benefits of structured reporting, its adoption is limited by factors such as established processes, resource constraints and potential loss of information. However, structured information would be necessary for various use cases, including automatic analysis, clinical trial matching, and prediction of health outcomes. This study introduces RadEx, an end-to-end framework comprising 15 software components and ten artifacts to develop systems that perform automated information extraction from radiology reports. It covers the complete process from annotating training data to extracting information by offering a consistent generic information model and setting boundaries for model development. Specifically, RadEx allows clinicians to define relevant information for clinical domains (e.g., mammography) and to create report templates. The framework supports both generative and encoder-only models and the decoupling of information extraction from template filling enables independent model improvements. Developing information extraction systems according to the RadEx framework facilitates implementation and maintenance as components are easily exchangeable, while standardized artifacts ensure interoperability between components.