arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 智能体、规划与决策 5 篇

2606.18746 2026-06-18 cs.AI 新提交

What Must Generalist Agents Remember?

通用型智能体必须记住什么?

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动,必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作,并证明记忆可用于重构局部转移动态。

详情
AI中文摘要

本文形式化地阐述了通用型智能体为了在多个环境和目标下近似最优地行动,必须在记忆中存储什么。它表明,当两个领域共享一个观察瓶颈但需要不兼容的最优动作时,任何一致近似最优的策略必须在该瓶颈处诱导出不同的记忆分布。这一结果产生了一个分离定理:足够成功的智能体不能仅依赖当前状态观察,而必须在记忆中保留领域相关信息。本文进一步证明,如果智能体的记忆包含足够的信息来估计相关目标的值,那么该记忆可用于近似重构智能体的局部转移动态。综合这些结果,将记忆刻画为支持领域区分、转移模型重构和通用型智能体规划的基板。

英文摘要

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

2606.18888 2026-06-18 cs.AI 新提交

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester(曼彻斯特大学) Aalto University(阿尔托大学)

AI总结 提出BeliefDiffusion框架,结合扩散模型和模型预测控制,显式建模多模态信念分布并进行前瞻规划,在合成地图环境中显著优于无模型强化学习和生成方法。

详情
AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战,需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法,特别是那些使用神经网络近似信念空间的方法,往往无法捕捉信念空间固有的多模态性,尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案,但它们通常需要大量数据或专家演示,并且缺乏长期规划的显式机制。在本文中,我们介绍了BeliefDiffusion,一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布,并利用模型预测控制(MPC)同时进行前瞻规划。它包含两个步骤:(1)基于观测历史想象合理的环境配置;(2)在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验,我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

2606.18947 2026-06-18 cs.AI cs.CL cs.IR cs.MA 新提交

Decoupling Search from Reasoning: A Vendor-Agnostic Grounding Architecture for LLM Agents

将搜索与推理解耦:面向LLM Agent的供应商无关的接地架构

Emmanuel Aboah Boateng, Kyle MacDonald, Amardeep Kumar, Siddharth Kodwani, Sudeep Das

发表机构 * DoorDash, Inc.(DoorDash公司)

AI总结 提出解耦搜索接地(DSG)架构,将搜索接地从推理模型中分离,通过MCP兼容网关实现供应商路由、缓存等控制,在降低成本和延迟的同时保持或提升准确性。

Comments 15 pages, Figure 8

详情
AI中文摘要

生产级LLM Agent越来越依赖实时搜索,但原生搜索接地将检索策略、供应商选择、证据注入、成本、延迟和生成行为捆绑在单一模型-供应商边界内。这种耦合使得接地难以检查、调优、重用或移植,并可能触发搜索诱导的冗长,破坏严格的输出合约。我们提出解耦搜索接地(DSG),一种供应商无关的边界,通过MCP兼容网关将接地移出推理模型,将供应商路由、源感知上下文渲染、配置的回退、检索深度控制以及精确和语义缓存作为一级控制暴露。在SimpleQA、FreshQA和HotpotQA上的五个前沿模型上,原生搜索在时效性敏感的FreshQA上领先,但DSG在控制重要时展现出更强的前沿:在SimpleQA上,它以91%更低的搜索成本接近原生准确率(86.1%对87.7%),保持简洁答案合约,并以68%更低的延迟达到99.4%的热缓存命中率。作为大规模Agent工作负载的共享生产接地层部署,DSG在电商查询理解(QIU)工作负载上匹配或略超原生搜索准确率,同时将搜索成本降低超过98%。实时接地最好被视为可优化的接口边界,而非固定的模型特性。

英文摘要

Production LLM agents increasingly depend on real-time search, yet native search grounding bundles retrieval policy, provider choice, evidence injection, cost, latency, and generation behavior behind a single model-provider boundary. This coupling makes grounding hard to inspect, tune, reuse, or port, and can trigger Search-Induced Verbosity that breaks strict output contracts. We present Decoupled Search Grounding (DSG), a vendor-agnostic boundary that moves grounding outside the reasoning model through an MCP-compatible gateway, exposing provider routing, source-aware context rendering, configured fallback, retrieval-depth control, and exact plus semantic caching as first-class controls. Across five frontier models on SimpleQA, FreshQA, and HotpotQA, native search leads on recency-sensitive FreshQA, but DSG exposes a stronger frontier when control matters: on SimpleQA it nearly matches native accuracy (86.1% vs. 87.7%) at 91% lower search cost, preserves concise answer contracts, and reaches a 99.4% warm-cache hit rate with 68% lower latency. Deployed as a shared production grounding layer for large-scale agentic workloads with interchangeable models, DSG matches or slightly exceeds native-search accuracy on an e-commerce query-understanding (QIU) workload while cutting search cost by over 98%. Real-time grounding is best treated as an optimizable interface boundary, not a fixed model feature.

2606.19116 2026-06-18 cs.AI cs.CY 新提交

Towards an Agent-First Web: Redesigning the Web for AI Agents

迈向智能体优先的Web:为AI智能体重新设计Web

Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

发表机构 * Old Dominion University(欧道明大学) AI Motion Labs(AI Motion实验室) Florida International University(佛罗里达国际大学) Accenture Technology Labs(埃森哲技术实验室) Nanyang Technological University(南洋理工大学) University of Colombo(科伦坡大学) Center for Wireless Communications, University of Oulu(奥卢大学无线通信中心) McDonald Army Health Center(麦克唐纳陆军健康中心)

AI总结 本文提出三层重新设计原则,包括访问层(代理继承人类权限)、经济层(基于意图的代币订阅模型)和内容层(ATML标记语言与加密溯源链),以解决AI智能体作为中间人时Web的访问、经济与内容问题。

详情
AI中文摘要

万维网建立在持续三十年的假设之上:Web内容的主要消费者是人类。这一假设渗透到每一层;其访问模型假定人类访客,其经济依赖于人类注意力,其内容针对人类感知。AI智能体作为人类与Web内容之间中介的迅速出现使这一假设失效。然而,Web通过全面封锁、基于CAPTCHA的排除以及将智能体访问视为提取而非合法交互的经济模型来抵制智能体。本文提出跨三层的原则性重新设计。在访问层,为人类行动的智能体应继承等效访问权限,通过HTTP请求中的速率限制和智能体识别元数据(类似于浏览器头部)以及从同一域提供人类可读和智能体优化内容的双层架构来管理。在经济层,我们提出基于意图的层级框架,以智能体作为人类代理原则为基础:智能体的经济义务反映其所代表的人类。基于代币的订阅模型以代币而非页面浏览量计量内容,同时引入委托内容经济,将AI内容生产锚定于人类意图。在内容层,我们识别出认知递归——AI生成内容被智能体消费以产生更多内容的自我指涉循环,逐步使Web知识与人类真实情况脱钩。我们提出智能体文本标记语言(ATML),一个四级人类监督层级模型,以及加密溯源链来应对这一威胁。这些共同构成了智能体优先互联网的十项设计原则,其中智能体是一等公民,其整合需要重新协商Web在访问、经济和内容方面的基本社会契约。

英文摘要

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

2606.19144 2026-06-18 cs.AI cs.CL 新提交

Human-AI Coevolution Dynamics: A Formal Theory of Social Intelligence Emergence Through Long-Term Interaction

人机协同演化动力学:长期互动中社会智能涌现的形式理论

Jingyi Zhou, Senlin Luo, Haofan Chen

AI总结 提出人机协同演化动力学框架(HACD-H),将情感适应、关系组织、社会记忆和人格一致性整合为统一动力学模型,通过约14,700轮对话数据集验证,发现社会智能与社会认知能量显著负相关,揭示社会智能源于长期协同演化。

详情
AI中文摘要

当前的对话式AI系统在语言生成、个性化和长上下文交互方面取得了显著进展。然而,大多数现有方法通过孤立组件(如情感建模、记忆检索或人格条件化)来建模社会行为,缺乏一个统一的框架来解释长期人机交互中稳定社会关系和社会智能的涌现。为解决这一问题,我们提出了人机协同演化动力学框架(HACD-H),这是一个将人机交互建模为自组织社会认知系统的形式模型。HACD-H将情感适应、关系组织、社会记忆和人格一致性整合到一个统一的动力学框架中,并引入了多时间尺度社会认知、关系吸引子、信任盆地、发展相变和社会认知能量景观等原则。我们构建了一个约14,700轮交互的对话数据集,并开发了一个理论驱动的实证评估框架。结果揭示了社会认知中的时间持久性层次结构、稳定的关系吸引子、类似相变的发展模式以及结构化的社会认知能量景观。社会智能与社会认知能量呈显著负相关(r = -0.391, p < 0.001),且交互轨迹随时间呈现渐进性能量减少。这些发现表明,社会智能源于长期的社会认知协同演化,而非孤立的对话能力。HACD-H为建模适应性人机社会交互和开发社会智能AI系统提供了统一的理论基础。

英文摘要

Current conversational AI systems have made significant progress in language generation, personalization, and long-context interaction. However, most existing methods model social behavior through isolated components such as emotion modeling, memory retrieval, or persona conditioning, lacking a unified framework to explain the emergence of stable social relationships and social intelligence in long-term human-AI interaction.To address this, we propose the Human-AI Coevolution Dynamics Framework (HACD-H), a formal model of human-AI interaction as a self-organizing social cognitive system. HACD-H integrates emotional adaptation, relational organization, social memory, and personality consistency into a unified dynamical framework and introduces principles including multi-timescale social cognition, relational attractors, trust basins, developmental phase transitions, and social cognitive energy dynamics.We construct a conversational dataset with approximately 14,700 interaction turns and develop a theory-driven empirical evaluation framework. Results reveal a hierarchy of temporal persistence in social cognition, stable relational attractors, phase-transition-like developmental patterns, and a structured social cognitive energy landscape. Social intelligence shows a significant negative correlation with social cognitive energy (r = -0.391, p < 0.001), and interaction trajectories exhibit progressive energy reduction over time.These findings suggest that social intelligence emerges from long-term social cognitive coevolution rather than isolated conversational capabilities. HACD-H provides a unified theoretical foundation for modeling adaptive human-AI social interaction and developing socially intelligent AI systems.

2. 知识表示、推理与符号AI 1 篇

2606.19279 2026-06-18 cs.AI cs.LG cs.LO math.CT math.LO math.PR 新提交

NeSyCat Torch: A Differentiable Tensor Implementation of Categorical Semantics for Neurosymbolic Learning

NeSyCat Torch:神经符号学习中范畴语义的可微张量实现

Daniel Romero Schellhorn, Till Mossakowski, Björn Gehrke

发表机构 * University of Osnabrück(奥斯纳布吕克大学)

AI总结 提出NeSyCat Torch框架,通过强单子和真值聚合结构统一神经符号语义,利用惰性对数张量单子实现可微训练,在MNIST加法任务上优于LTN和DeepProbLog。

详情
AI中文摘要

神经符号语义是碎片化的:经典、模糊、概率和神经系统的真值各自遵循其归纳规则。NeSyCat扩展了ULLER,将它们统一在一个单一的真值归纳定义下,该定义以强单子和真值上的聚合结构为参数。NeSyCat至今缺乏对由神经网络学习的谓词和函数的描述。我们提供NeSyCat Torch作为缺失的环节,通过神经网络解释计算符号,在概率编程和张量后端中实现该框架。我们使用分布单子作为参考语义和度量评估,并辅以一个用于数值稳定、可微训练的单子:对数半环上的惰性对数张量单子。为了高效批量训练,我们还采用了批处理单子。公理即源代码:一次性地用基于单子的do-notation编写,单子绑定执行边缘化,惰性地剪枝不需要的分支。在MNIST加法任务上,我们的HaskTorch、JAX和PyTorch实现在速度和准确性上优于LTN和DeepProbLog,同时几乎达到DeepStochLog的准确性。然而,与DeepStochLog不同,我们保持在一个统一的框架内,适用于许多一阶神经符号方法。即,该构造以单子为参数;例如,用Giry单子实例化它可将方法扩展到连续概率(在此留作未来工作)。

英文摘要

Neurosymbolic semantics is fragmented: classical, fuzzy, probabilistic and neural systems each define truth by their own inductive rules. NeSyCat, extending ULLER, subsumes them under a single inductive definition of truth, parametric in a strong monad and an aggregation structure on truth-values. NeSyCat has so far lacked an account of predicates and functions learned by neural networks. We provide NeSyCat Torch as the missing link and interpret computational symbols via neural networks, implementing the framework in probabilistic programming and tensor-based backends. We use the distribution monad for reference semantics and metric evaluation, and complement it by a monad for numerically stable, differentiable training: the lazy log-tensor monad over the log-semiring. For efficient training in batches, we furthermore employ a batch monad. The axioms are the source code: written once in monad-based do-notation, monadic bind performs marginalisation, lazily pruning unneeded branches. On MNIST addition, our HaskTorch, JAX, and PyTorch implementations outperform LTN and DeepProbLog in speed and accuracy, while achieving nearly the accuracy of DeepStochLog. However, unlike DeepStochLog, we stay in a uniform framework that applies to many first-order NeSy approaches. Namely, the construction is parametric in the monad; instantiating it with, e.g., the Giry monad extends the approach to continuous probability (working out a neural representation here is left for future work).

3. 多智能体与博弈 2 篇

2606.18413 2026-06-18 cs.AI cs.HC 新提交

Searching for Synergy in Shared Workspace Human-AI Collaboration

在共享工作空间的人机协作中寻找协同效应

Nachiket Kotalwar, Rohini Das, Carolyn Rose

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究共享工作空间的人机团队协作,通过Collaborative Gym环境实验发现,缺乏协调结构时增加协作者会降低性能,而结合共享记忆和模拟人在环门控的脚手架可提升团队绩效。

Comments Accepted at ICML 2026 Workshop on Human-AI Co-Creativity. 13 pages, 5 figures, 3 tables

详情
AI中文摘要

自动化AI代理越来越强大,但许多科学和专业任务仍需要人类判断和情境专业知识。我们研究共享工作空间的人机团队,其中AI代理和人类协作者必须在提交最终答案前协调职责。使用Collaborative Gym环境和DiscoveryBench任务,我们考察何时添加模拟人类协作者能提升性能,以及何时过程损失将额外协作者变为协调开销。在1482个会话中,当团队缺乏协调贡献的结构时,添加相关协作者会降低性能。然后我们评估一种脚手架,它结合了共享群体记忆和模拟人在环(HITL)门控,其中选定动作需要指定模拟参与者的批准。这种脚手架在三人团队中最为明显,产生了更高的平均性能,具有更清晰的责任信号和更强的专业知识路由到团队动作。总体而言,人机团队如何协调和整合专业知识与他们可用的能力同样重要。

英文摘要

Automated AI agents are increasingly capable, yet many scientific and professional tasks require human judgment and contextual expertise. We study shared-workspace human-AI teams, where AI agents and human collaborators must coordinate responsibilities before submitting a final answer. Using the Collaborative Gym environment with DiscoveryBench tasks, we examine when adding simulated human collaborators improves performance and when process loss turns additional collaborators into coordination overhead. Across 1,482 sessions, adding relevant collaborators can lower performance when teams lack structure to coordinate their contributions. We then evaluate scaffolding that combines shared group memory with simulated human-in-the-loop (HITL) gates, where selected actions require approval from a designated simulated participant. This scaffolding yields higher mean performance, most clearly in three-person teams, with clearer responsibility signals and stronger routing of expertise to team actions. Overall, how human-AI teams coordinate and integrate expertise matters as much as the capability available to them.

2606.18786 2026-06-18 cs.AI 新提交

R2D-RL: A RoboCup 2D Soccer Environment for Multi-Agent Reinforcement Learning

R2D-RL:用于多智能体强化学习的RoboCup 2D足球环境

Haobin Qin, Baofeng Zhang, Hidehisa Akiyama, Keisuke Fujii

发表机构 * Graduate School of Informatics, Nagoya University(名古屋大学信息学研究科) School of Information and Data Sciences, Nagasaki University(长崎大学信息与数据科学学院)

AI总结 提出R2D-RL环境,通过共享内存通信和周期级同步连接RCSS2D与Python MARL接口,支持全场和场景训练,提供可配置对手、离散/混合动作空间、EPV奖励塑造及并行执行。

Comments Code is available at: https://github.com/open-starlab/R2DRL

详情
AI中文摘要

机器人足球是多智能体强化学习的一个具有挑战性的测试平台,因为它结合了部分可观测性、合作与对抗交互、稀疏奖励以及长期战术行为。RoboCup 2D足球仿真(RCSS2D)提供了一个成熟的机器人足球平台,但其面向竞争的服务器-客户端架构难以直接用于现代基于Python的MARL工作流。我们引入了R2D-RL,这是一个强化学习环境,通过共享内存通信和周期级同步将RCSS2D和基于HELIOS的玩家客户端连接到Python MARL接口。R2D-RL支持全场和基于场景的训练,具有可配置的对手、基础离散和混合参数化动作空间、动作掩码、基于预期控球值(EPV)的奖励塑造以及并行执行。我们提供了前场场景和11对11全场基准测试,以及基线结果。

英文摘要

Robot soccer is a challenging testbed for multi-agent reinforcement learning because it combines partial observability, cooperative and adversarial interaction, sparse rewards, and long-horizon tactical behavior. RoboCup 2D Soccer Simulation (RCSS2D) provides a mature robot-soccer platform, but its competition-oriented server-client architecture is difficult to use directly with modern Python-based MARL workflows. We introduce R2D-RL, a reinforcement learning environment that connects RCSS2D and HELIOS-based player clients to a Python MARL interface through shared-memory communication and cycle-level synchronization. R2D-RL supports full-field and scenario-based training with configurable opponents, Base discrete and Hybrid parameterized action spaces, action masks, expected possession value (EPV)-based reward shaping, and parallel execution. We provide front-goal scenarios and an 11-vs-11 full-field benchmark, together with baseline results.

4. 机器学习与表示学习 5 篇

2606.18890 2026-06-18 cs.AI 新提交

Skill-Guided Continuation Distillation for GUI Agents

面向GUI代理的技能引导延续蒸馏

Zhimin Fan, Hongwei Yu, Yeqing Shen, Haolong Yan, Guozhen Peng, Tianhao Peng, Yudong Zhang, Xiaowen Zhang, Kaijun Tan, Zheng Ge, Xiangyu Zhang, Daxin Jiang

发表机构 * StepFun University of Science and Technology Beijing(北京科技大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

AI总结 提出技能引导延续蒸馏(SGCD)框架,通过技能引导策略生成成功延续轨迹,弥补专家轨迹中未覆盖的状态监督缺失,在OSWorld-Verified上将三个基础模型成功率从30%左右提升至50%以上。

详情
AI中文摘要

改进GUI代理通常依赖于在专家轨迹上的行为克隆。然而,当当前策略偏离专家策略时,在闭环执行过程中不可避免地会遇到策略导致的偏离轨迹状态,即超出专家轨迹的状态。由于专家轨迹未对这些未见状态提供演示,这些状态得不到有效监督,导致策略无法选择正确动作。为弥补这一监督缺口,我们提出技能引导延续蒸馏(SGCD),一种迭代式自我改进框架。SGCD首先在没有技能引导的情况下运行简单策略若干步,以到达真实的偏离轨迹状态。从这些状态出发,技能引导策略完成任务并生成成功的延续轨迹,这些轨迹与专家轨迹混合,为策略导致的偏离轨迹状态提供监督。技能从成功和失败的轨迹中提取,包括延续计划、关键目标、失败陷阱和成功标准。在OSWorld-Verified上,SGCD将三个基础模型的成功率从30%左右提升至超过50%,证明了其有效性和通用性。

英文摘要

Improving GUI agents typically relies on behavior cloning on expert trajectories. However, as the current policy deviates from the expert policy, it inevitably encounters policy-induced off-trajectory states during closed-loop execution, i.e., states that fall outside the expert trajectories. Since expert trajectories provide no demonstrations for these unseen states, such states receive no effective supervision, leaving the policy unable to select the correct action. To close this supervision gap, we propose Skill-Guided Continuation Distillation (SGCD), an iterative self-improvement framework. SGCD first runs the plain policy without skill guidance for a few steps to reach realistic off-trajectory states. From these states, a skill-guided policy then completes the task and produces successful continuations, which are mixed with expert trajectories to supply supervision over policy-induced off-trajectory states. The skills are extracted from both successful and failed rollouts, consisting of Continuation Plans, Critical Targets, Failure Traps, and Success Criteria. On OSWorld-Verified, SGCD improves the success rate of three base models from the low-30\% range to over 50\%, demonstrating its effectiveness and generality.

2606.19047 2026-06-18 cs.AI 新提交

RODS: Reward-Driven Online Data Synthesis for Multi-Turn Tool-Use Agents

RODS: 面向多轮工具使用智能体的奖励驱动在线数据合成

Ruishan Fang, Siyuan Lu, Chenyi Zhuang, Tao Lin

发表机构 * Zhejiang University(浙江大学) Shanghai Innovation Institute(上海创新研究院) Westlake University(西湖大学)

AI总结 针对多轮工具使用强化学习中静态数据集信息样本快速耗尽的问题,提出RODS方法,利用进度奖励方差作为零成本边界检测器,在线合成与智能体能力边界匹配的样本,以约800样本达到17K样本离线管道的性能。

详情
AI中文摘要

多轮工具使用强化学习受限于静态数据集中信息样本的快速耗尽。我们观察到GRPO中的梯度信号集中在具有最高 rollout 奖励方差的任务上,这是Popoviciu上界的结果。因此,位于智能体能力边界附近(成功与失败大致平衡)的样本贡献了不成比例的大策略梯度。随着训练进行,该边界不断移动,逐渐耗尽静态数据集中的信息样本池。我们提出RODS(奖励驱动在线数据合成)来解决这种耗尽问题。RODS通过将进度奖励方差重新用作一个实用的、零成本的边界检测器(除了训练中已计算的rollout外无需额外推理),来闭环RL训练与数据生成。它持续识别这些边界样本,通过技能对齐的重采样管道合成与其结构复杂度(例如API拓扑和依赖深度)匹配的新多轮变体,并管理一个与策略共同演化的动态回放缓冲区。从400个人工种子开始并维持约800个样本的活动训练池,RODS实现了与17K样本离线管道相当的性能,同时所需轨迹数量约少20倍,并在我们的受控设置中优于固定数据RL和环境增强方法。

英文摘要

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.

2606.19079 2026-06-18 cs.AI 新提交

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE: 推理时适配器动态选择的不可知路由

Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

发表机构 * University of Turin(都灵大学) Samsung AI Center(三星人工智能中心)

AI总结 提出无训练、与适配器无关的路由框架ARIADNE,通过训练集嵌入质心表示适配器,在推理时基于潜在空间距离选择适配器,无需适配器内部信息或额外训练,在44个任务上达到89.7%的选择准确率。

详情
AI中文摘要

参数高效微调(PEFT)的日益部署导致了模型生态系统,其中单个骨干网络与许多任务专用适配器配对。在这种设置下,推理时的查询通常没有任务标签,要求系统从不断增长且异构的适配器池中自动选择最合适的适配器。现有的路由方法要么依赖于对适配器内部(如权重分解或基于梯度的统计信息)的访问,要么需要额外的路由器训练,这限制了随着新适配器添加的可扩展性和可移植性。我们提出了ARIADNE,一个无训练、与适配器无关的路由框架,用于推理时的动态适配器选择。ARIADNE通过从其训练集的嵌入计算的一组质心来表示每个适配器,捕获与该适配器相关的数据分布。给定一个无标签输入,它通过测量在潜在空间中与这些质心的接近度来选择适配器。由于路由完全在输入嵌入空间中进行,ARIADNE与任意PEFT方法兼容,并且不需要对适配器或训练过程进行修改。主要使用Llama 3.2 1B Instruct在23个不同的NLP任务上进行评估,ARIADNE恢复了97.44%的上限性能。扩展到44个任务,它实现了89.7%的平均选择准确率,无需额外训练或访问适配器内部信息。

英文摘要

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

2606.19172 2026-06-18 cs.AI 新提交

User as Engram: Internalizing Per-User Memory as Local Parametric Edits

用户作为印迹:将每用户记忆内化为局部参数编辑

Bojie Li

发表机构 * Pine AI

AI总结 提出User as Engram方法,将用户事实存储为Engram模型的哈希键控记忆表中的局部编辑,推理技能共享一个适配器,实现高精度间接推理且内存占用极小。

详情
AI中文摘要

语言模型中的个人记忆涉及两个问题:内容和推理技能。大脑将两者分开(每个情节在海马体中有一个稀疏的局部印迹,解释它的共享技能在缓慢的新皮层中),因此新事实不必覆盖其他一切。如今大多数个性化方法将用户事实保存在权重之外,存储在自然语言记忆文件或检索索引中。当事实被写入模型时,标准方法是每用户的LoRA适配器,这与大脑相反,将内容和技能折叠成一个全局权重增量。将用户事实写为LoRA会污染与它们无关的文本;将相同事实写为局部Engram行则数学上保持不变,导致内存占用大约减少33,000倍。因此,我们提出User as Engram:将用户内容存储为对Engram模型的哈希键控记忆表的手术式编辑,并将推理技能携带在一个共享适配器中。这种分层设计匹配了每用户LoRA的直接召回,同时平均提供5.6倍更高的间接推理准确性,并且从未使单个用户在推理方面比未触及的基座更差。编辑是一个玻璃盒:写入一个事实会在精确触发时打开其查找,添加答案所需的值,保持其他每个位置不变到最后一位,如果写入错误层则失败。由于不同用户的事实落在不相交的哈希槽中,它们的编辑可组合:许多用户同时共享一个表,可加性且无损地堆叠,而每用户LoRA(一个全局权重增量)只允许一个。在检索时,每用户Engram表不会随着检索器必须搜索的群体增长,因此在大约100个事实后,它超越了在2.5倍更大模型上的检索流水线。

英文摘要

Personal memory in a language model is two problems: content and reasoning skill. The brain keeps the two apart (a sparse, local engram in the hippocampus for each episode, a slow neocortex for the shared skills that interpret it), so a new fact need not overwrite everything else. Most personalization today keeps a user's facts outside the weights, in a natural-language memory file or a retrieval index. When facts are written into the model instead, the standard recipe is the per-user LoRA adapter, which does the opposite of the brain, folding content and skill into one global weight delta. Writing a user's facts as a LoRA contaminates text unrelated to them; writing the same facts as local Engram rows leaves it mathematically untouched, resulting in a roughly 33,000x smaller memory footprint. We therefore propose User as Engram: store a user's content as surgical edits to the hash-keyed memory table of an Engram model, and carry the reasoning skill in one shared adapter. This layered design matches per-user LoRA's direct recall while delivering 5.6x higher indirect-reasoning accuracy on average, and never makes a single user worse at reasoning than the untouched base. The edit is a glass box: writing a fact switches on its lookup at exactly the trigger, adds the value the answer needs, leaves every other position unchanged to the last bit, and fails if written into the wrong layer. Because different users' facts land in disjoint hash slots, their edits compose: many users live in one shared table at once, stacking additively and losslessly, where a per-user LoRA, a single global weight delta, admits only one. Upon retrieval, a per-user Engram table does not grow with the population the retriever must search, so past ~100 facts it overtakes a retrieval pipeline on a 2.5x larger model.

2606.19327 2026-06-18 cs.AI cs.CL 新提交

Rethinking Reward Supervision: Rubric-Conditioned Self-Distillation

重新思考奖励监督:基于评分准则的自蒸馏

Siyi Gu, Jialin Chen, Sophia Zhou, Arman Cohan, Rex Ying

发表机构 * Yale University(耶鲁大学)

AI总结 提出评分准则条件自蒸馏框架,通过结构化细粒度反馈指导推理模型,在科学推理基准上平均超越GRPO 1.0分、OPSD 0.9分。

详情
AI中文摘要

推理语言模型的后训练通常由监督蒸馏和基于可验证奖励的强化学习驱动。蒸馏通常依赖于思维链注释,这些注释获取成本高昂,且可能本身带有噪声、不完整或部分错误;即使最终答案正确,不完美的推理过程也会干扰学习。另一方面,基于验证奖励的强化学习通常将评估反馈压缩为标量信号,掩盖了响应中哪些方面需要改进。我们提出\textbf{评分准则条件自蒸馏}框架,该框架将评分准则作为结构化、细粒度的反馈用于策略内自蒸馏。我们的方法使教师模型以准则级评分准则为条件,并利用它在学生自身采样的轨迹上提供令牌级指导。这种设计避免了将单一参考推理过程作为唯一的监督目标。相反,评分准则指定了一个强响应应满足的条件,从而在推理过程中实现比标量奖励优化更细粒度的信用分配。我们通过一个两阶段流程实例化该框架:首先学习生成任务特定的评分准则,然后训练一个评分准则引导的推理器。我们在多样化的科学推理基准上进行评估,结果表明,评分准则条件自蒸馏有效地将准则级标准转化为推理过程中的令牌级指导,平均超过GRPO 1.0分、OPSD 0.9分。

英文摘要

Post-training of reasoning language models is commonly driven by supervised distillation and reinforcement learning with verifiable rewards. Distillation often relies on chain-of-thought annotations that are expensive to obtain and may themselves be noisy, incomplete, or partially incorrect; even when the final solution is correct, an imperfect rationale can interfere with learning. Reinforcement learning with verified rewards, on the other hand, typically compresses evaluative feedback into a scalar signal, obscuring which aspects of a response should be improved. We propose \textbf{Rubric-Conditioned Self-Distillation}, a framework that incorporates rubrics as structured, fine-grained feedback for on-policy self-distillation. Our method conditions the teacher model on criterion-level rubrics and uses it to provide token-level guidance on the student's own sampled trajectories. This design avoids treating a single reference rationale as the sole supervision target. Instead, rubrics specify what a strong response should satisfy, enabling more fine-grained credit assignment over the reasoning process than scalar reward optimization. We instantiate this framework with a two-stage pipeline that first learns to generate task-specific rubrics and then trains a rubric-guided reasoner. We evaluate on a diverse suite of science reasoning benchmarks and results show that rubric-conditioned self-distillation effectively converts rubric-level criteria into token-level guidance over the reasoning process, surpassing GRPO by 1.0 points and OPSD by 0.9 points on average.

5. 可信、安全与AI治理 3 篇

2606.18385 2026-06-18 cs.AI 新提交

CaVe-VLM-CoT: An Interpretable Vision-Language Model Framework

CaVe-VLM-CoT:一种可解释的视觉-语言模型框架

Sneha Rao, Shaina Raza, Dhanesh Ramachandram

发表机构 * Vector Institute(向量研究所)

AI总结 提出CaVe-VLM-CoT框架,通过五阶段闭环流水线(提取器、检索器、求解器、引用注入器、验证器)实现证据推理,并引入CaVeScore复合指标评估检索质量、引用忠实度和跨模态基础,在ScienceQA和MMMU上取得性能提升。

详情
AI中文摘要

视觉-语言模型(VLM)仍然容易产生幻觉,输出流畅但视觉上不忠实的输出。现有的思维链和检索增强方法仅部分解决了这一问题,因为它们既没有强制执行步骤级引用基础,也没有将验证失败路由回检索以进行纠正。我们提出了CaVe-VLM-CoT,一个模块化的基于反射的智能体RAG框架,通过五阶段闭环流水线强制执行证据推理:提取器、检索器、求解器、引用注入器和验证器,其中检测到的无根据声明会触发结构化反馈给提取器以进行针对性重新检索。由于现有框架没有联合衡量检索质量、逐步引用忠实度和跨模态基础,我们提出了一套涵盖所有阶段的23个组件级指标,以CaVeScore为核心,这是一个加权准确性、引用精确率和召回率、归因和证据基础的复合指标。无需任何架构或提示修改,CaVe-VLM-CoT在ScienceQA上达到87.1%的准确率和56.6%的CaVeScore,在MMMU(30个学科)上达到55.2%的准确率和35.7%的CaVeScore。

英文摘要

Vision-Language Models (VLMs) remain prone to hallucinations, producing fluent but visually unfaithful outputs. Existing chain-of-thought and retrieval-augmented methods only partially address this, as they neither enforce step-level citation grounding nor route verification failures back to retrieval for correction. We present CaVe-VLM-CoT, a modular reflection-based agentic-RAG framework that enforces evidence-grounded reasoning through a five-stage closed-loop pipeline: Extractor, Retriever, Solver, Citation Injector, and Verifier, in which detected ungrounded claims trigger structured feedback to the Extractor for targeted re-retrieval. Since no existing framework jointly measures retrieval quality, step-wise citation faithfulness, and cross-modal grounding, we propose a suite of 23 component-wise metrics across all stages, anchored by CaVeScore, a composite metric weighting accuracy, citation precision and recall, attribution, and evidence grounding. Without any architectural or prompt modifications, CaVe-VLM-CoT achieves 87.1\% accuracy and 56.6\% CaVeScore on ScienceQA , and 55.2\% accuracy and 35.7\% CaVeScore on MMMU (30 subjects).

2606.18988 2026-06-18 cs.AI 新提交

ThinkDeception: A Progressive Reinforcement Learning Framework for Interpretable Multimodal Deception Detection

ThinkDeception: 一种用于可解释多模态欺骗检测的渐进式强化学习框架

Jinhao Song, Shan Liang, Yiqun Yue, Zhuhuayang Zhang, Tianqi Gao

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学)

AI总结 提出ThinkDeception框架,将多模态大语言模型引入欺骗检测,通过逐步推理和视觉-音频一致性组相对策略优化(VAC-GRPO)实现可解释的认知推理,在主流基准上达到新SOTA。

Comments 10pages,4figures

详情
AI中文摘要

多模态欺骗检测对于识别欺诈意图至关重要,然而现有方法主要依赖于端到端的黑箱范式。这些方法严重缺乏可解释性,无法提供透明的推理轨迹,也难以明确捕捉欺骗行为中固有的细微跨模态不一致性。为了超越这些限制,我们提出了ThinkDeception,一个新颖且可解释的多模态欺骗检测框架。作为开创性工作,它将多模态大语言模型(MLLMs)引入该领域,将欺骗检测从传统的二分类任务转变为显式的认知推理过程。借助首个精心标注的逐步多模态思维链(CoT)数据集,我们开发了基础模型ThinkDeception Base,实证验证了模态不一致性在解码欺骗中的关键作用。在此基础之上,我们的核心创新在于提出了配备渐进式训练策略的视觉-音频一致性组相对策略优化(VAC-GRPO)。与标准GRPO不同,我们将训练数据分为四个渐进难度等级,引导模型经历基于心理学的从易到难的认知转变。通过创新地将这一动态课程调度器与多维度的过程感知奖励机制及反思学习范式相结合,我们显著提升了模型的整体推理质量。在主流基准上的大量实验表明,ThinkDeception建立了新的SOTA,在检测准确性和推理质量上均显著优于现有方法。最终,这项工作成功地将欺骗检测领域推向可解释的多模态认知推理。

英文摘要

Multimodal deception detection is critical for identifying fraudulent intentions, yet existing approaches predominantly rely on end to end black--box paradigms. These methods suffer from a severe lack of interpretability failing to provide transparent reasoning trajectories and struggling to explicitly capture the subtle, cross modal inconsistencies inherent in deceptive behaviors. To transcend these limitations, we propose ThinkDeception, a novel and interpretable multimodal deception detection framework. As a pioneering effort, it introduces Multimodal Large Language Models (MLLMs) into this domain, transforming deception detection from a traditional binary classification task into an explicit cognitive reasoning process. Facilitated by the first meticulously annotated step--by--step multimodal Chain of Thought (CoT) dataset, we develop a foundational model, ThinkDeception Base, empirically validating the critical role of modal inconsistency in decoding deception. Building upon this foundation, our core innovation lies in proposing Visual-Audio Consistency Group Relative Policy Optimization(VAC--GRPO) equipped with a progressive training strategy. Distinct from standard GRPO, we stratify the training data into four progressive difficulty tiers, guiding the model through a psychologically grounded easy--to--hard cognitive transition. By innovatively coupling this dynamic curriculum scheduler with a multi dimensional, process aware reward mechanism and a reflective learning paradigm, we significantly elevate the model's overall reasoning quality. Extensive experiments on mainstream benchmarks demonstrate that ThinkDeception establishes a new SOTA, significantly outperforming existing methods in both detection accuracy and rationale quality. Ultimately, this work successfully drives the field of deception detection toward interpretable, multimodal cognitive reasoning.

2606.19168 2026-06-18 cs.AI cs.LG 新提交

Beyond Safe Data: Pretraining-Stage Alignment with Regular Safety Reflection

超越安全数据:具有正则安全反射的预训练阶段对齐

Jinhan Li, Kexian Tang, Yihan Xu, Zhuorui Ye, Kaifeng Lyu

发表机构 * Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)

AI总结 提出安全反射预训练方法,在预训练语料中插入安全反思,使模型具备自我监控能力,实验表明该方法能有效降低推理和微调攻击成功率。

详情
AI中文摘要

为了实现大型语言模型(LLMs)更深层次的安全对齐,最近的研究探讨了如何将安全干预措施提前到预训练阶段,主要通过过滤不安全数据或将其改写为更安全的形式。我们认为,预训练阶段的对齐应超越使数据安全:LLMs可能将看似良性的知识和能力组合成不安全的行为。为此,我们提出了安全反射预训练,一种预训练阶段的对齐方法,该方法定期在预训练语料中插入简短的安全反思,将自我监控直接集成到语言建模中,建立一种基础能力,随后通过兼容的后训练加以强化。我们在FineWeb-Edu上预训练的1.7B模型上的实验表明,安全反射预训练提高了安全分类准确性,并显著降低了推理阶段和微调攻击的成功率。除了真实世界实验,我们还引入了一个完全受控的合成环境MedSafetyWorld,其中包含清晰的安全定义和推理结构,模型可以轻松地从安全数据中泛化出不安全行为。在MedSafetyWorld中的消融实验进一步表明,与数据过滤和改写相比,安全反射预训练在防止模型根据安全数据泛化出的不安全行为方面具有明显优势。综合来看,我们的发现表明,预训练对齐不仅应使训练数据安全,还应塑造模型可能从安全数据中习得的行为。

英文摘要

To achieve deeper safety alignment for large language models (LLMs), recent efforts have studied how to push safety interventions earlier into the pretraining stage, primarily by filtering unsafe data or rewriting it into safer forms. We argue that pretraining-stage alignment should go beyond making the data safe: LLMs may compose seemingly benign knowledge and capabilities into unsafe behaviors. To this end, we propose Safety Reflection Pretraining, a pretraining-stage alignment method which regularly inserts short safety reflections into pretraining corpora to integrate self-monitoring directly into language modeling, establishing a foundational capability that is subsequently reinforced by compatible post-training. Our experiments with 1.7B models pretrained on FineWeb-Edu show that Safety Reflection Pretraining improves safety classification accuracy and substantially reduces the success rates of inference-stage and finetuning attacks. Complementary to our real-world experiments, we also introduce a fully controlled synthetic environment, MedSafetyWorld, with a clear definition of safety and a reasoning structure under which models can easily generalize unsafe behaviors from safe data. Ablations in MedSafetyWorld further demonstrate a clear advantage of Safety Reflection Pretraining in preventing models from acting on unsafe behaviors generalized from safe data, compared with data filtering and rewriting. Taken together, our findings suggest that pretraining alignment should not only make the training data safe, but also shape the behaviors that models are likely to acquire from safe data.

6. 评测、基准与数据集 8 篇

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench:智能体能否玩转长期博弈?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 提出CEO-Bench,通过模拟500天运营初创公司的任务,评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情
AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而,现实世界的挑战需要结合多种复杂技能,这些技能在很大程度上尚未在智能体中得到测试:(1)在不确定性中导航长期视野;(2)在嘈杂环境中获取信息;(3)适应不断变化的世界;(4)协调多个移动部分以实现连贯目标。我们引入CEO-Bench,通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面,在相同的环境中运行,并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库,将信号转化为合理的策略,并通过编程协调许多决策。最强的智能体编写复杂的代码,模拟客户群体以预测未来现金流,并挖掘谈判历史以揭示隐藏的客户偏好。即便如此,大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金,且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

2606.18557 2026-06-18 cs.AI cs.LG cs.LO 新提交

DeFAb: A Verifiable Benchmark for Defeasible Abduction in Foundation Models

DeFAb:基础模型中可废止溯因的可验证基准

Patrick Cooper, Alvaro Velasquez

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出DeFAb基准,通过将知识库转换为可验证的溯因实例,评估基础模型在可废止推理中的创造力与理论推理能力,发现前沿模型准确率远低于符号求解器。

Comments 33 pages, 14 figures, 23 tables. Dataset: https://huggingface.co/datasets/PatrickAllenCooper/DeFAb ; code and evaluation harness: https://github.com/PatrickAllenCooper/blanc

详情
AI中文摘要

一个基于规则的逻辑求解器在不到50微秒内以100%的准确率解决了我们基准中的每个实例;而最佳前沿语言模型在渲染鲁棒评估下最高仅达65%,最差降至23.5%(四种表面渲染的最坏情况)。我们引入DeFAb(可废止溯因基准),这是一个数据集和生成流水线,将四十年的公共资助知识库转换为形式化可废止溯因实例:通过覆盖默认值同时保留无关期望来构建解释异常假设。由于每个假设必须通过多项式时间检查(有效推导、保守性和最小性),DeFAb将逻辑严谨性作为衡量创造性和理论推理的工具,评分的是理论修正的规范构建,而非流畅但破坏理论的散文。该流水线将分类层次结构(OpenCyc、YAGO、Wikidata)与行为属性图(ConceptNet、UMLS)配对,从18个来源生成372,648+个实例,涉及33.75M条实例化规则,分为三个级别,并具有多项式时间可验证的金标准。四个前沿模型未能可靠内化可废止推理:渲染鲁棒的Level 2准确率为7.8-23.5%;思维链方差(约36个百分点)超过任何模型间差距;匹配的污染控制隔离出+19.4个百分点的Level 3差距。我们进一步发布了DeFAb-Hard(235个实例的Level 3难度变体;最佳模型53.3% vs 符号100%)和CONJURE(一个内核验证的变革性创造力变体,包含560个Lean 4/Mathlib实例,其金答案证明内核先前未包含的定义,无需判断的验证器;试点发现零新概念)。同一验证器还可作为偏好优化(DPO、RLVR/GRPO)的精确奖励。基于MIT许可发布于此https URL。

英文摘要

A rule-based logic solver resolves every instance in our benchmark in under 50 microseconds with 100% accuracy; the best frontier language model reaches 65% at best and drops to 23.5% under rendering-robust evaluation (worst case over four surface renderings). We introduce DeFAb (Defeasible Abduction Benchmark), a dataset and generation pipeline that converts four decades of publicly funded knowledge bases into formally grounded instances for defeasible abduction: constructing hypotheses that explain anomalies by overriding defaults while preserving unrelated expectations. Because every hypothesis must pass polynomial-time checks for valid derivation, conservativity, and minimality, DeFAb makes logical rigor the instrument for measuring creativity and theoretical reasoning, scoring the disciplined construction of theory revisions rather than fluent but theory-destroying prose. The pipeline pairs taxonomic hierarchies (OpenCyc, YAGO, Wikidata) with behavioral property graphs (ConceptNet, UMLS) to produce 372,648+ instances across 33.75M materialized rules from 18 sources, in three levels with polynomial-time verifiable gold standards. Four frontier models do not reliably internalize defeasible reasoning: rendering-robust Level 2 accuracy is 7.8-23.5%; chain-of-thought variance (~36 pp) exceeds any inter-model gap; and a matched contamination control isolates a +19.4 pp Level 3 gap. We further release DeFAb-Hard (a 235-instance Level 3 difficulty variant; best model 53.3% vs 100% symbolic) and CONJURE (a kernel-verified transformative-creativity variant of 560 Lean 4/Mathlib instances whose gold answers are definitions the proof kernel did not previously contain, judge-free verifier; a pilot finds zero novel concepts). The same verifier doubles as an exact reward for preference optimization (DPO, RLVR/GRPO). Released under MIT at https://huggingface.co/datasets/PatrickAllenCooper/DeFAb.

2606.18686 2026-06-18 cs.AI cs.CL cs.LG 新提交

ForecastBench-Sim: A Simulated-World Forecasting Benchmark

ForecastBench-Sim:一个模拟世界预测基准

Jaeho Lee, Nick Merrill, Ezra Karger

发表机构 * Forecasting Research Institute(预测研究所)

AI总结 提出基于Freeciv游戏模拟的预测基准ForecastBench-Sim,通过游戏回滚生成可控、即时可解的预测问题,用于评估AI系统的概率推理能力。

Comments 15 pages, 5 main figures, 6 appendix figures. Spotlight presentation at Forecasting as a New Frontier of Intelligence / Workshop on AI Forecasting, ICML 2026

详情
AI中文摘要

通用AI系统的预测基准通常继承现实世界的约束:结果缓慢显现、尾部事件罕见、反事实问题难以评分。我们引入ForecastBench-Sim,一个基于Freeciv(一款以文明系列为模型的回合制策略游戏)游戏回滚的模拟世界预测基准。预测者接收固定的世界报告(当前游戏状态的结构化快照),并回答关于隐藏未来状态的问题;然后基准继续模拟并对预测进行评分。由于世界是模拟的,同一设置可以生成任意时间跨度的连续或二元预测问题、用于条件或因果问题的配对干预世界,以及罕见或破坏性结果的已解决示例。我们描述了基准流程、问题族、评分协议和发布工件,并报告了来自模型评估和匿名人工试点的验证切片。ForecastBench-Sim旨在通过提供受控、即时可解的任务来补充现实世界预测基准,用于研究动态世界状态下的概率推理。

英文摘要

Forecasting benchmarks for general-purpose AI systems usually inherit the constraints of the real world: outcomes resolve slowly, tail events are rare, and counterfactual questions are difficult to score. We introduce ForecastBench-Sim, a simulated-world forecasting benchmark built on game rollouts from Freeciv, a turn-based strategy game modelled on the Civilization series. Forecasters receive a fixed world report (a structured snapshot of the current game state) and answer questions about hidden future states; the benchmark then continues the simulation and scores forecasts. Because the world is simulated, the same setup can generate continuous or binary forecasting questions at arbitrary time horizons, paired intervention worlds for conditional or causal questions, and resolved examples of rare or disruptive outcomes. We describe the benchmark pipeline, question families, scoring protocol, and release artifacts, and report validation slices from model evaluations and an anonymized human pilot. ForecastBench-Sim is intended to complement real-world forecasting benchmarks by providing controlled, immediately resolvable tasks for studying probabilistic reasoning under dynamic world states.

2606.18847 2026-06-18 cs.AI 新提交

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines: 对长时域有状态具身智能体进行基准测试与建模

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

AI总结 提出WorldLines基准,通过构建带时间跨度的家庭轨迹(含对话、动作、状态变化等)评估具身智能体的长时记忆与任务规划能力,并设计ObsMem记忆框架提升状态感知决策。

Comments 27 pages, 18 figures

详情
AI中文摘要

为了在真实家庭环境中长时间协助人类,具身智能体必须记住用户习惯、世界状态和过去的交互。现有的长期记忆基准主要评估以语言为中心的检索和问答,而具身基准通常关注短时域任务执行,未测试在动态环境中长期记忆的使用。我们引入WorldLines,一个项目驱动的长时域具身家庭辅助基准。它构建了带时间跨度的家庭轨迹,包含对话、动作、执行反馈、物体和设备状态变化,并将其转换为带有证据链接的样本,用于记忆问答和具身任务规划。我们进一步提出ObsMem,一个观察者锚定的记忆框架,维护可见性感知的记忆和动作原生状态轨迹,以实现状态感知的决策。实验揭示了在部分可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划方面的持续挑战,而ObsMem为此场景提供了更强的参考架构。

英文摘要

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

2606.18936 2026-06-18 cs.AI cs.CY 新提交

SciRisk-Bench: A Risk-Dimension-Aware Benchmark for AI4Science Safety

SciRisk-Bench:面向AI4Science安全的风险维度感知基准

Linghao Feng, Yinqian Sun, Dongqi Liang, Sicheng Shen, Chenfei Yan, Yuxuan Peng, Yilin Zhao, Haibo Tong, Kai Li, FeiFei Zhao, Yi Zeng

发表机构 * Brain-inspired Cognitive Intelligence Lab, Institute of Automation, Chinese Academy of Sciences, Beijing, China(脑启发认知智能实验室,自动化研究所,中国科学院,北京,中国) School of Future Technology, University of Chinese Academy of Sciences, China(未来技术学院,中国科学院大学,中国) School of Artificial Intelligence, University of Chinese Academy of Sciences, China(人工智能学院,中国科学院大学,中国) Zhongguancun Academy, China(中关村学院,中国) Beijing Key Laboratory of Safe AI and Superalignment(北京安全人工智能与超对齐重点实验室) Gaoling School of AI, Renmin University of China(甘露人工智能学院,中国人民大学) Beijing Institute of AI Safety and Governance (Beijing-AISI)(北京人工智能安全与治理研究院(北京-AISI)) School of Humanities, University of Chinese Academy of Sciences, China(人文学院,中国科学院大学,中国)

AI总结 提出SciRisk-Bench基准,从显式风险维度和科学学科两个角度评估AI4Science安全,覆盖7个学科、31个子学科和10个风险维度,实验揭示主流及科学大模型的安全薄弱环节。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地嵌入到人工智能驱动的科学(AI4Science)工作流程中,从科学问答和文献分析到实验室规划和自主发现。这一进展迫切需要对安全基准进行评估,不仅要评估科学能力,还要评估模型是否能在高风险的科学背景下识别和避免风险。现有的AI4Science安全数据集涵盖多个学科和任务格式,但潜在的风险维度未得到充分说明。我们引入了\textbf{SciRisk-Bench},这是一个旨在从两个互补视角评估AI4Science安全的基准:显式风险维度和科学学科。SciRisk-Bench涵盖7个学科、31个子学科和10个风险维度。在实验部分,我们评估了主流LLMs和面向科学的LLMs在风险维度、学科和子学科上的表现,从而能够细粒度地诊断科学模型在哪些方面仍然不安全。

英文摘要

Large language models (LLMs) are increasingly embedded in AI for Science (AI4Science) workflows, from scientific question answering and literature analysis to laboratory planning and autonomous discovery. This progress creates an urgent need for safety benchmarks that evaluate not only scientific competence, but also whether models recognize and avoid risks in high-stakes scientific contexts. Existing AI4Science safety datasets cover several disciplines and task formats, leaving the underlying risk dimensions underspecified. We introduce \textbf{SciRisk-Bench}, a benchmark designed to evaluate AI4Science safety from two complementary perspectives: explicit risk dimensions and scientific disciplines. SciRisk-Bench covers 7 disciplines, 31 subdisciplines and 10 risk dimensions. In the experimental section, we evaluate both mainstream LLMs and science-oriented LLMs across risk dimensions, disciplines, and sub-disciplines, enabling fine-grained diagnosis of where scientific models remain unsafe.

2606.18950 2026-06-18 cs.AI 新提交

RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models

RTSGameBench: 视觉语言模型战略推理的RTS基准

San Kim, Daechul Ahn, Reokyoung Kim, Hyeonbeom Choi, Seungyeon Jwa, Jonghyun Choi

发表机构 * Seoul National University(首尔国立大学)

AI总结 提出RTSGameBench,基于Beyond All Reason游戏,通过多样化对战、迷你游戏诊断和自进化生成框架,评估视觉语言模型在实时策略游戏中的战略推理能力。

Comments First two authors contributed equally

详情
AI中文摘要

现代视觉语言模型(VLM)在竞争和合作环境中的不确定性下,往往难以进行战略推理,即预测和影响其他智能体的行为。实时策略(RTS)游戏可以作为诊断这一局限性的自然测试平台,因为它们要求与盟友协调、适应对手策略,并在部分可观测性下进行长期规划。然而,现有的RTS基准评估范围有限,缺乏系统的能力诊断,并且局限于预设计的场景覆盖。为了解决这些限制,我们提出了RTSGameBench,它建立在Beyond All Reason之上,这是一款大规模RTS游戏,其扩展战场要求比现有测试平台更广泛的策略多样性。该基准通过多种对战结构提供评估,通过迷你游戏进行诊断性评估,每个迷你游戏针对单个战略能力,并通过自进化生成框架实现可扩展的覆盖,该框架将自由形式的查询转化为新的迷你游戏,并在连续循环中改进。此外,为了让VLM在大规模RTS游戏中运行,我们提供了RTSGameAgent,它通过具有智能体记忆的有限状态机(FSM)管理单位。我们通过实验验证,多个最先进的VLM在对战需要更紧密协调、多智能体协调以及任务规模增加时表现不佳。

英文摘要

Modern Vision-Language Models (VLMs) often struggle with strategic reasoning, i.e., anticipating and influencing other agents' actions, under uncertainty in competitive and cooperative settings. Real-time strategy (RTS) games can be a natural testbed for diagnosing this limitation, as they demand coordination with allies, adaptation to opponents' strategy, and long-horizon planning under partial observability. However, existing RTS benchmarks offer limited evaluation scope, lack systematic competency diagnosis, and remain fixed in the pre-designed scenario coverage. To address these limitations, we present RTSGameBench, which is built on Beyond All Reason, a large-scale RTS game with an expanded battlefield that demands broader strategy diversity than the existing testbeds. The proposed benchmark provides evaluations through diverse gameplay across various matchup structures, diagnostic assessment via mini-games, each targeting an individual strategic competency, and extensible coverage via a self-evolving generation framework that converts free-form queries into new mini-games, improving over successive cycles. Additionally, for VLMs to operate in large-scale RTS games, we provide RTSGameAgent that manages units by an FSM with agentic memory. We empirically validate that multiple state-of-the-art VLMs do not perform well when matchups demand tighter coordination, multiagent coordination and when task scale increases.

2606.19245 2026-06-18 cs.AI cs.LG 新提交

TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology

TxBench-PP:分析AI代理在小分子临床前药理学中的表现

Hannah Le, Ramesh Ramasamy, Alex Urrutia, Mahsa Yazdani, Tim Proctor, Kenny Workman

发表机构 * LatchBio

AI总结 提出TxBench-PP基准,用于评估AI代理从真实实验数据中恢复临床前药理学结论的能力,测试显示最强配置Claude Opus 4.8 / Pi仅通过59.3%的端点尝试。

详情
AI中文摘要

人工智能(AI)代理有望通过压缩解释和决策循环来加速药物发现,但实际部署需要基于现实程序决策的可信评估。我们引入了TherapeuticsBench临床前药理学(TxBench-PP),这是一个针对小分子临床前药理学的可验证基准,也是更广泛的TherapeuticsBench在药物发现阶段和治疗模式中的首个聚焦切片。TxBench-PP测试代理是否能够从真实实验数据中恢复准确的结论,而非从文献中记忆的事实。该基准包含100个评估,按程序阶段、实验类型和任务结构索引,涵盖作用机制(MoA)和药效学(PD)推理、化合物-靶点结合、因果靶点验证、可开发性与安全性以及转化疗效。代理接收现实的工作流程快照,在编码环境中检查文件,并返回确定性评分的结构化答案。在16个模型-工具配置(包括11个模型和4,800条轨迹)中,没有系统能够可靠地恢复临床前药理学决策。最强配置Claude Opus 4.8 / Pi通过了59.3%的端点尝试(178/300;95% CI, 51.1-67.6),其次是GPT-5.5 / Pi,为55.3%(166/300;47.0-63.6)。

英文摘要

Artificial intelligence (AI) agents promise to accelerate drug discovery by compressing interpretation and decision-making loops, but practical deployment requires trusted evaluation on realistic program decisions. We introduce TherapeuticsBench Preclinical Pharmacology (TxBench-PP), a verifiable benchmark for small-molecule preclinical pharmacology and the first focused slice of a broader TherapeuticsBench effort across drug-discovery stages and therapeutic modalities. TxBench-PP tests whether agents can recover accurate conclusions from real-world assay data rather than memorized facts from literature. The benchmark contains 100 evaluations indexed by program stage, assay type, and task structure, spanning mechanism-of-action (MoA) and pharmacodynamic (PD) reasoning, compound-target engagement, causal target validation, developability and safety, and translational efficacy. Agents receive realistic workflow snapshots, inspect files in a coding environment, and return structured answers graded deterministically. Across 16 model-harness configurations, comprising 11 models and 4,800 trajectories, no system reliably recovered preclinical pharmacology decisions. The strongest configuration, Claude Opus 4.8 / Pi, passed 59.3\% of endpoint attempts (178/300; 95\% CI, 51.1-67.6), followed by GPT-5.5 / Pi at 55.3\% (166/300; 47.0-63.6).

2606.19256 2026-06-18 cs.AI 新提交

X+Slides: Benchmarking Audience-Conditioned Slide Generation

X+Slides:面向受众条件的幻灯片生成基准测试

Haodong Chen, Xuanhe Zhou, Wei Zhou, Xinyue Shao, Yanbing Zhu, Bo Wang, Jiawei Hong, Anya Jia, Fan Wu

AI总结 提出X+Slides基准,通过动态评估框架和受众特定权重,衡量幻灯片生成系统在受众覆盖、领域覆盖、效率和正确性方面的表现,揭示现有系统在受众关键信息恢复上的不足。

详情
AI中文摘要

从源文档自动生成幻灯片是大语言模型(LLMs)的重要应用。现有基准主要评估幻灯片的完整性和技术深度,而忽略了目标受众这一关键现实因素。例如,专家需要严格的证明,而决策者优先考虑可操作的结论。为弥补这一差距,我们引入了X+Slides,一个专门为受众条件幻灯片生成设计的基准。基于涵盖113个主题和七种演示场景的多样化语料库,X+Slides采用由8,133个去重、基于源的探针构建的动态评估框架。通过为相同的基于源的探针分配受众特定的效用权重,X+Slides报告四个互补指标:受众覆盖率衡量传达了受众必要信息的程度,领域覆盖率显示覆盖了哪些信息类型,效率衡量每单位注意力成本传递的效用,正确性验证幻灯片声明是否得到源支持。在DeepPresenter、SlideTailor和NotebookLM上的实验表明,当前系统可以恢复大部分但仍有缺失的受众必要信息:在τ_A=0.7时,DeepPresenter达到最佳受众覆盖率0.714,SlideTailor达到0.594,NotebookLM消融达到0.853,同时显示出明显的接地差异。这些结果表明,视觉质量和广泛的主题覆盖不应在没有基于源评估的情况下被视为证据支持。

英文摘要

Automatically generating slide decks from source documents is an important application of large language models (LLMs). Existing benchmarks primarily assess slide completeness and technical depth, while overlooking the target audience as a critical real-world factor. For instance, specialists demand rigorous proofs, whereas decision-makers prioritize actionable conclusions. To bridge this gap, we introduce X+Slides, a benchmark specifically designed for audience-conditioned slide generation. Built on a diverse corpus spanning 113 topics and seven presentation scenes, X+Slides employs a dynamic evaluation framework constructed from 8,133 deduplicated, source-grounded probes. By assigning audience-specific utility weights to the same source-grounded probes, X+Slides reports four complementary metrics: Audience Coverage measures how much audience-essential information is conveyed, Domain-wise Coverage shows which information types are covered, Efficiency measures delivered utility per unit of attention cost, and Correctness verifies whether slide claims are supported by the source. Experiments on DeepPresenter, SlideTailor, and NotebookLM show that current systems can recover a substantial but still incomplete part of audience-essential information: at $τ_A=0.7$, DeepPresenter reaches a best Audience Coverage of 0.714, SlideTailor reaches 0.594, and the NotebookLM ablation reaches 0.853 while showing clear grounding differences. These results indicate that visual quality and broad topic coverage should not be treated as evidence support without source-grounded evaluation.

7. AI应用与系统 5 篇

2606.18271 2026-06-18 cs.AI cs.LG 新提交

NAVI-Orbital: First In-Orbit Demonstration of a Zero-Shot Vision-Language Model for Autonomous Earth Observation

NAVI-Orbital:用于自主地球观测的零样本视觉语言模型的首次在轨演示

Juan Manuel Delfa Victoria, Taran Cyriac John, Andrew W. Herson

发表机构 * NASA Jet Propulsion Laboratory (JPL)(美国宇航局喷气推进实验室) Loft Orbital(Loft Orbital公司)

AI总结 本文介绍NAVI-Orbital系统,在低地球轨道卫星上首次实现视觉语言模型的自主多模态推理,通过语义压缩解决数据下传瓶颈。

Comments 17 pages, 47 figures

详情
AI中文摘要

随着地球观测数据的生成速度超过下行链路带宽和人在回路处理能力,星载采集与可操作地面情报之间的差距日益扩大。本文介绍NAVI-Orbital,一个部署在低地球轨道(LEO)航天器上的软件系统。2026年4月16日,NAVI-Orbital实现了据作者所知首次在轨演示,即视觉语言模型完全在星上进行自主多模态推理。NAVI-Orbital使用本地视觉语言模型(Gemma 3)对每个捕获场景进行分类,生成其内容及特征间关系的文本描述,并通过自然语言对话响应操作员的后续查询。该系统通过纯英语提示替代传统指令序列进行任务重定向,并由基于图的状态机(LangGraph)编排,协调用于检测和对话的专用代理。地面基准测试(在7,960张图像的精选AID基准上准确率达88.16%)、Flatsat验证以及实时在轨捕获的新获取、未见过的地球图像(包括未校正的YAM-9图像,在星上通过硬件加速GPU推理处理且未对飞行仪器进行微调)的结果表明,在卫星级边缘计算机上运行基础模型是可行的,通过星上地球观测的语义压缩,颠覆了传统的先采集后全部下传的带宽模式。

英文摘要

As Earth Observation data generation outpaces downlink bandwidth and human-in-the-loop processing, a widening gap has emerged between onboard collection and actionable ground intelligence. This paper presents NAVI-Orbital, a software system deployed on a Low Earth Orbit (LEO) spacecraft. On April 16, 2026, NAVI-Orbital achieved what is, to the authors' knowledge, the first in-orbit demonstration of a vision-language model performing autonomous multi-modal inference entirely onboard. NAVI-Orbital uses a local vision-language model (Gemma 3) to classify each captured scene, produce a text description of its content and the relationships between its features, and respond to operator follow-up via natural-language dialogue. The system is re-tasked through plain-English prompts in place of conventional command sequences, and is orchestrated by a graph-based state machine (LangGraph) coordinating dedicated agents for detection and dialogue. Results across ground benchmarking (88.16% accuracy on the 7,960-image curated AID benchmark), Flatsat validation, and live in-orbit captures of newly acquired, previously unseen Earth imagery (including uncorrected YAM-9 imagery, processed onboard with hardware-accelerated GPU inference and no fine-tuning for the flight instrument) demonstrate the feasibility of running foundation models on satellite-class edge computers to invert the conventional acquire-then-downlink-everything bandwidth profile through semantic compression of Earth observations in-orbit.

2606.18598 2026-06-18 cs.AI cs.LG 新提交

Optimizing Lithium Production Decisions under Geological, Demand, and Pricing Uncertainties: A POMDP Framework for Multi-Objective Decision Making

在地质、需求和定价不确定性下优化锂生产决策:多目标决策的POMDP框架

Anna C. Edmonds, Mansur M. Arief, Robert J. Moss, Mykel J. Kochenderfer, Jef Caers

发表机构 * Computer Science Department, Stanford University(斯坦福大学计算机科学系) Aeronautics and Astronautics Department, Stanford University(斯坦福大学航空与航天系) Earth and Planetary Sciences Department, Stanford University(斯坦福大学地球与行星科学系)

AI总结 提出POMDP框架,通过信念状态规划优化锂矿开采决策,动态适应价格不确定性,实现更高需求满足和更平衡的经济环境效益。

Comments 24 pages, 14 tables, 4 figures

详情
AI中文摘要

锂生产中的决策制定具有挑战性,无论是从投资者角度还是战略生产角度。决定开采哪些矿山以及何时开采,不仅涉及地质和价格不确定性,还涉及提取方法选择的复杂性,从直接锂提取到硬岩开采。先前的工作探索了该问题的模型和优化采矿决策的不同方法;这些模型没有考虑定价不确定性、需求不确定性或提取锂的不同采矿技术。将不同的定价模型和提取技术纳入这些模型,可以制定更稳健的策略,不仅决定何时何地开采矿山,还决定采用哪种生产方法。我们将问题表述为部分可观测马尔可夫决策过程(POMDP),并使用信念状态规划方法求解以获得最优决策。在我们的研究中,我们表明POMDP求解器通过信念状态规划和显式不确定性管理,动态适应变化的锂价格机制(静态、线性、指数和随机),优于人类启发式启发法。通过优化勘探、生产和技术选择的顺序,该框架在所有不同的定价和矿床情景下,在项目生命周期内实现了更高的需求满足和更平衡的经济环境结果。

英文摘要

Decision making in lithium production is challenging, whether from an investor's perspective or a strategic production standpoint. Determining which mines to open and when to open them involves not only geological and price uncertainties, but also complexities around the choice of extraction method, from direct lithium extraction to hard rock mining. Prior work explored models of this problem and different methods to optimize mining decisions; these models did not account for uncertainty in pricing, uncertainty in demand, or different mining technologies to extract lithium. Incorporating different pricing models and extraction technology into these models enables more robust strategies for determining not only when and where to open a mine, but also which method of production to pursue. We frame the problem as a partially observable Markov decision process (POMDP) and solve using belief state planning methods to get optimal decision making. In our study, we show that POMDP solvers outperform human inspired heuristics by dynamically adapting to shifting lithium price regimes (static, linear, exponential, and stochastic) through belief state planning and explicit uncertainty management. By optimally sequencing exploration, production, and technology choice, the framework achieves higher demand fulfillment and more balanced economic environmental outcomes over the projects lifetime in all different pricing and deposit scenarios.

2606.18803 2026-06-18 cs.AI cs.CY 新提交

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM: 面向工业网约车调度的效用对齐智能用户画像

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

发表机构 * Didichuxing Co. Ltd(滴滴出行科技有限公司)

AI总结 提出ProfiLLM,一种通过工具增强全局知识挖掘和效用对齐画像探索的智能LLM数据管道,解决工业网约车调度中大规模行为日志的用户画像问题,在滴滴生产系统中实现AUC提升6.14%、GMV提升4.35%。

详情
AI中文摘要

将大型语言模型(LLM)作为语义特征提取器引入工业网约车调度,处理平台规模的行为日志,是一个引人注目但尚未充分探索的数据系统问题。生产匹配管道仍然以结构化数值特征为主,但关键的行为信号(例如,驾驶员对某些区域的习惯性厌恶)本质上是上下文相关的,并且可以自然地表达为LLM生成的用户画像。然而,将这种画像扩展到实时的、毫秒级延迟的调度器面临三个相互交织的约束,这些约束很少被一起解决:在一个拥有数百万日订单量的平台上,日志超出任何LLM的上下文窗口数个数量级;大多数用户是长尾用户,交互太少无法进行单个用户画像;表面流畅的画像不一定能提高下游预测效用。我们提出了ProfiLLM,一个智能LLM数据管道,通过两个模块实现面向生产匹配系统的效用对齐用户画像。(1)工具增强全局知识挖掘:为LLM智能体配备27个分析工具,用于挖掘平台规模的数据,生成可复用的全局知识、自适应用户聚类规则和区域级供需先验。(2)效用对齐画像探索:为每个聚类生成多个候选画像,通过轻量级下游效用代理进行评估,迭代优化最佳候选,并为DPO微调构建偏好对。在滴滴生产调度器上部署后,ProfiLLM在结果预测中实现了高达+6.14%的相对AUC改进,在调度模拟中实现了高达+4.35%的GMV增长,并在14天在线A/B测试中持续改进,包括+0.47% GMV、+0.33%完成率和-0.82%接单前取消率。

英文摘要

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

2606.18874 2026-06-18 cs.AI 新提交

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

通过研究框架将AI科学家的研究综合与验证外部化

Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机学院X-LANCE实验室) Jiangsu Key Lab of Language Computing, Suzhou, China(江苏省语言计算重点实验室) Suzhou Laboratory, Suzhou, China(苏州实验室)

AI总结 提出Xcientist框架,将研究综合与实验验证外部化为可检查的合同驱动过程,解决自动研究中的声明漂移问题,并在多个领域验证其有效性。

Comments 65 pages, 14 figures, 19 tables

详情
AI中文摘要

AI系统日益能够自动化科学工作流程,但连接先前证据、生成的想法、实验和最终声明的推理通常仍然隐含在模型推理中。这里我们介绍Xcientist,一个研究框架,将研究综合和实验验证外部化为可检查的、合同驱动的过程。Xcientist将文献证据、想法状态、实施计划、消融记录和修复痕迹组织为持久的研究工件,使得生成的机制可以在不丢失其证据基础的情况下被基础化、执行、测试和修订。我们将声明漂移识别为自动化研究的一种失败模式,其中可运行的工件不再支持最初声称的机制。在无训练记忆系统、图结构交通预测和多尺度物理信息神经网络中,Xcientist保留了从问题公式化到机制设计、验证和有限修订的可追踪轨迹。这些结果表明,AI科学家不仅应根据其最终工件进行评估,还应看其综合和验证过程是否可归因、可检查且在科学上可问责。

英文摘要

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

2606.19118 2026-06-18 cs.AI cs.LG econ.GN q-fin.EC 新提交

Analysing drivers and interdependencies in European electricity markets using XAI

使用XAI分析欧洲电力市场的驱动因素与相互依赖性

Antoine Pesenti, Aidan O'Sullivan

发表机构 * UCL Energy Institute, University College London, UK(伦敦大学学院能源研究所,英国)

AI总结 结合深度神经网络与可解释人工智能(XAI)技术,利用SHAP和SSHAP框架分析39个欧洲竞价区的电价决定因素,发现可再生能源(尤其是太阳能)对电价形成具有重要作用,天然气价格仍是主导驱动因素,且互联互通显著影响价格动态。

Comments 12 pages

详情
AI中文摘要

电力市场本质上是复杂系统,具有强非线性、高维交互以及跨区域日益增长的相互依赖性。虽然深度神经网络(DNN)在电价预测方面表现出强大的能力,但其缺乏可解释性限制了其在理解电价形成潜在驱动因素方面的实用性。本文通过将DNN模型与可解释人工智能(XAI)技术相结合,分析了39个欧洲竞价区电价的决定因素,填补了这一空白。我们采用SHAP(SHapley Additive exPlanations)量化特征贡献,并应用和扩展了SSHAP(一种聚合框架)以提高高维设置下的可解释性。分析表明,可再生能源(尤其是太阳能)在电价形成中发挥着不成比例的重要作用,尽管其在总发电量中占比较低。天然气价格仍然是跨电力市场的主导且一致的驱动因素,而互联互通显著影响价格动态,凸显了欧洲电力系统的强相互依赖性。此外,我们构建了一个合成性的全欧盟电力市场,以探索完全一体化单一价格市场的反事实情景。

英文摘要

Electricity markets are inherently complex systems characterised by strong nonlinearities, high-dimensional interactions, and increasing interdependence across regions. While deep neural networks (DNNs) have demonstrated strong predictive capabilities for electricity prices, their lack of interpretability limits their usefulness for understanding the underlying drivers of price formation. This paper addresses this gap by combining DNN models with explainable artificial intelligence (XAI) techniques to analyse the determinants of electricity prices across 39 European bidding zones. We employ SHAP (SHapley Additive exPlanations) to quantify feature contributions and apply and extend SSHAP, an aggregation framework to improve interpretability in high-dimensional settings. The analysis identifies that renewable energy sources, particularly solar, play a disproportionately important role in price formation despite their lower share in total power generation. Gas prices remain a dominant and consistent driver across electricity markets, while interconnections significantly shape price dynamics, highlighting the strong interdependence of European electricity systems. In addition, a synthetic EU-wide electricity market is constructed to explore the counterfactual scenario of a fully integrated market with a single price.