arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 26 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交 90%

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench:智能体能否玩转长期博弈?

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

专题命中 规划决策 :模拟500天运营初创公司任务

AI总结 提出CEO-Bench,通过模拟500天运营初创公司的任务,评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情
AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而,现实世界的挑战需要结合多种复杂技能,这些技能在很大程度上尚未在智能体中得到测试:(1)在不确定性中导航长期视野;(2)在嘈杂环境中获取信息;(3)适应不断变化的世界;(4)协调多个移动部分以实现连贯目标。我们引入CEO-Bench,通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面,在相同的环境中运行,并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库,将信号转化为合理的策略,并通过编程协调许多决策。最强的智能体编写复杂的代码,模拟客户群体以预测未来现金流,并挖掘谈判历史以揭示隐藏的客户偏好。即便如此,大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金,且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

2606.18633 2026-06-18 cs.MA 新提交 85%

PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

PersonalPlan: 面向个性化编程学习的多智能体系统规划

Zhiyuan Wen, Jiannong Cao, Peng Gao, Haochen Shi, Wengpan Kuan, Bo Yuan, Xiuxiu Qi

专题命中 规划决策 :多智能体规划器用于个性化编程学习

AI总结 提出PersonalPlan,一种两阶段多智能体规划器,通过分层SFT和奖励自适应GRPO生成可执行、个性化且具有教学支架的计划,在MAP-PPL数据集上优于现有方法。

详情
AI中文摘要

有效的编程教育需要针对不同学习者背景进行个性化教学。然而,虽然基于LLM的多智能体系统(MAS)擅长复杂规划,但现有规划器通常缺乏轮廓基础(profile-grounding)和教学支架(pedagogical scaffolding),从而削弱了个性化编程学习。为填补这一空白,我们首先引入\textbf{MAP-PPL}(\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning),这是一个基于轮廓的多智能体规划数据集,包含来自1,730个Stack Overflow问题组和2,738个学习者轮廓的3,043个查询-轮廓-计划实例。每个计划指定了智能体、子任务、可执行步骤和先决依赖关系。然后,我们提出\textbf{PersonalPlan},一个两阶段MAS规划器,首先使用独立的LoRA适配器进行分层SFT,用于轮廓感知的任务分解和步骤依赖规划,然后应用奖励自适应GRPO,鼓励模型生成可执行、个性化且具有教学支架的计划。在MAP-PPL上进行的广泛实验,将PersonalPlan与前沿LLM、通用MAS框架和智能体规划器进行比较,证明了其优越性。仅使用8B和32B变体,PersonalPlan在计划可执行性、个性化和教学质量方面达到了最先进水平,有效协调了MAS进行智能体-学生交互。

英文摘要

Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce \textbf{MAP-PPL} (\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning), a profile-conditioned multi-agent planning dataset with 3{,}043 query--profile--plan instances from 1{,}730 Stack Overflow question groups and 2{,}738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose \textbf{PersonalPlan}, a two-stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile-aware task decomposition and step dependency planning, then applies a Reward-Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP-PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state-of-the-art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent-student interactions.

2605.30880 2026-06-18 cs.CL cs.AI 版本更新 85%

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld:可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Independent Researcher(独立研究员) HKUST(香港科技大学) Beijing Institute of Technology(北京理工大学) Southern University of Science and Technology(南方科技大学) Wayne State University(韦恩州立大学) University of Edinburgh(爱丁堡大学)

专题命中 规划决策 :可执行世界模型,用于智能体规划与预测

AI总结 提出 PatchWorld 框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型,实现无需梯度优化的符号信念状态程序,在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情
AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程(POMDP),假设模拟器的潜在状态和转移动态对智能体隐藏。然而,很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld,一个免梯度框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察,而是归纳出符号信念状态程序,其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中,PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数,在实时一步前瞻中达到 76.4% 的宏观成功率,同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现,人类指定的残差记忆偏差提高了表面观察保真度,但削弱了决策效用。这暴露了可执行世界模型中的权衡,因为提高观察保真度可能以牺牲动作判别动态为代价,反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

2603.00656 2026-06-18 cs.AI 版本更新 85%

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

InfoPO:面向用户智能体的信息驱动策略优化

Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu

发表机构 * Peking University(北京大学) The Hong Kong University of Science(香港科学大学)

专题命中 规划决策 :信息驱动策略优化,面向用户智能体

AI总结 针对多轮交互中信用分配和优势信号不足的问题,提出信息增益奖励与自适应方差门控融合的InfoPO方法,在意图澄清、协作编码等任务上优于现有基线。

详情
AI中文摘要

现实世界中用户对LLM智能体的请求往往不明确。智能体必须通过交互获取缺失信息并做出正确的下游决策。然而,当前基于多轮GRPO的方法通常依赖于轨迹级奖励计算,这导致信用分配问题以及rollout组内优势信号不足。一种可行的方法是在细粒度上识别有价值的交互轮次,以驱动更有针对性的学习。为此,我们引入了InfoPO(信息驱动策略优化),它将多轮交互视为一个主动不确定性降低的过程,并计算信息增益奖励,该奖励对反馈可测量地改变智能体后续动作分布(与掩码反馈反事实相比)的轮次进行奖励。然后,通过自适应方差门控融合将该信号与任务结果结合,以在保持任务导向目标方向的同时识别信息重要性。在包括意图澄清、协作编码和工具增强决策在内的多种任务中,InfoPO始终优于提示和多轮RL基线。它还在用户模拟器偏移下表现出鲁棒性,并有效泛化到环境交互任务。总体而言,InfoPO为优化复杂的智能体-用户协作提供了一种原则性且可扩展的机制。代码可在以下网址获取:https://this URL。

英文摘要

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

2603.00026 2026-06-18 cs.CL cs.AI cs.IR 版本更新 85%

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

ActMem:弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) Alibaba Group, Hangzhou, China(阿里巴巴集团,杭州,中国) National Institute of Healthcare Data Science, Nanjing University, China(南京大学健康数据科学国家研究院)

专题命中 规划决策 :记忆检索与推理结合,主动因果推理

AI总结 提出ActMem框架,通过将非结构化对话历史转化为结构化因果语义图,结合反事实推理和常识补全,实现主动因果推理,显著提升LLM代理在复杂记忆依赖任务中的表现。

详情
AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”,并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距,我们提出了一种新颖的可操作记忆框架ActMem,它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全,它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外,我们引入了一个全面的数据集ActMemEval,用于评估代理在逻辑驱动场景中的推理能力,超越了现有记忆基准测试中事实检索的焦点。实验表明,ActMem在处理复杂的、依赖记忆的任务时显著优于基线,为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

2510.05107 2026-06-18 cs.AI 版本更新 85%

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

大型语言模型代理中行为智能的结构化认知循环(扩展修订:从行为架构到认知问责)

Myung Ho Kim

发表机构 * JEI University(JEI大学)

专题命中 规划决策 :结构化认知循环实现LLM代理可问责行为

AI总结 提出结构化认知循环(SCL)架构,通过分离认知、记忆、控制和行动模块,实现LLM代理的可问责行为,在360个任务中成功率86.3%,优于基线方法。

Comments This revised version extends the original SCL framework from a behavioral architecture for reliable LLM agents into a broader architecture of epistemic accountability, integrating context-aware Human-in-the-Loop control, Pool-Gated Retrieval, and the Horizon-Warrant-Commitment structure

详情
AI中文摘要

AI代理的核心挑战不仅是性能,还有问责性。通过不透明提示序列行动的代理可能产生正确输出,但几乎无法验证为何允许某个行动、错误发生在何处或如何分配责任。本文提出结构化认知循环(SCL)作为大型语言模型代理中可问责行为的架构。SCL将认知、记忆、控制和行动分离为不同模块。语言模型提出建议。外部记忆保存已验证的状态。轻量级控制器检查前提条件、防止冗余行动,并在使用工具前授权执行。我们评估了SCL与ReAct及常见LangChain代理变体在旅行规划、条件邮件起草和约束引导图像生成中的表现。在360个回合中,SCL的任务成功率达到86.3%,而基于提示的基线为70.5%至76.8%。它还提高了目标保真度,减少了冗余工具调用,增加了中间状态的重用,并降低了无依据的断言。此扩展修订将SCL置于更广泛的认知问责架构中。后续扩展整合了上下文感知的人机循环控制、池门控检索和视野担保承诺框架。这些组件共同定义了一个代理架构,其中模型提出建议,结构做出决策,证据在使用前得到担保,人类判断嵌入在轨迹中而非事后强加。结果为AI代理奠定了基础,使其决策不仅有效,而且得到授权、可检查且可问责。

英文摘要

The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted, where an error occurred, or how responsibility should be assigned. This paper presents the Structured Cognitive Loop as an architecture for accountable behavior in large language model agents. SCL separates cognition, memory, control, and action into distinct modules. The language model proposes. External memory preserves verified state. A lightweight controller checks preconditions, prevents redundant actions, and authorizes execution before tools are used. We evaluate SCL against ReAct and common LangChain agent variants across travel planning, conditional email drafting, and constraint guided image generation. Across 360 episodes, SCL achieves 86.3 percent task success compared with 70.5 to 76.8 percent for prompt based baselines. It also improves goal fidelity, reduces redundant tool calls, increases reuse of intermediate state, and lowers unsupported assertions. This extended revision situates SCL within a broader architecture of epistemic accountability. Subsequent extensions integrate context aware Human in the Loop control, Pool Gated Retrieval, and the Horizon Warrant Commitment framework. Together these components define an agent architecture in which the model proposes, structure decides, evidence is warranted before use, and human judgment is embedded in the trace rather than imposed after the fact. The result is a foundation for AI agents whose decisions are not only effective but also authorized, inspectable, and accountable.

2606.18847 2026-06-18 cs.AI 新提交 80%

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines: 对长时域有状态具身智能体进行基准测试与建模

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

专题命中 规划决策 :具身智能体长时记忆与任务规划。

AI总结 提出WorldLines基准,通过构建带时间跨度的家庭轨迹(含对话、动作、状态变化等)评估具身智能体的长时记忆与任务规划能力,并设计ObsMem记忆框架提升状态感知决策。

Comments 27 pages, 18 figures

详情
AI中文摘要

为了在真实家庭环境中长时间协助人类,具身智能体必须记住用户习惯、世界状态和过去的交互。现有的长期记忆基准主要评估以语言为中心的检索和问答,而具身基准通常关注短时域任务执行,未测试在动态环境中长期记忆的使用。我们引入WorldLines,一个项目驱动的长时域具身家庭辅助基准。它构建了带时间跨度的家庭轨迹,包含对话、动作、执行反馈、物体和设备状态变化,并将其转换为带有证据链接的样本,用于记忆问答和具身任务规划。我们进一步提出ObsMem,一个观察者锚定的记忆框架,维护可见性感知的记忆和动作原生状态轨迹,以实现状态感知的决策。实验揭示了在部分可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划方面的持续挑战,而ObsMem为此场景提供了更强的参考架构。

英文摘要

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

2606.18746 2026-06-18 cs.AI 新提交 80%

What Must Generalist Agents Remember?

通用型智能体必须记住什么?

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

专题命中 规划决策 :通用智能体记忆需求的形式化分析。

AI总结 本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动,必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作,并证明记忆可用于重构局部转移动态。

详情
AI中文摘要

本文形式化地阐述了通用型智能体为了在多个环境和目标下近似最优地行动,必须在记忆中存储什么。它表明,当两个领域共享一个观察瓶颈但需要不兼容的最优动作时,任何一致近似最优的策略必须在该瓶颈处诱导出不同的记忆分布。这一结果产生了一个分离定理:足够成功的智能体不能仅依赖当前状态观察,而必须在记忆中保留领域相关信息。本文进一步证明,如果智能体的记忆包含足够的信息来估计相关目标的值,那么该记忆可用于近似重构智能体的局部转移动态。综合这些结果,将记忆刻画为支持领域区分、转移模型重构和通用型智能体规划的基板。

英文摘要

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

2606.18105 2026-06-18 cs.NI cs.LG 新提交 80%

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan:一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University(浙江大学) Fuzhou University(福州市大学) Yangzhou University(扬州大学) The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) College of Computer Science and Technology(计算机科学与技术学院)

专题命中 规划决策 :自适应框架动态选择求解器进行规划

AI总结 提出OmniPlan自适应框架,利用大语言模型解析用户意图,通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型,实现网络规划优化的及时性与近乎最优性,在分布式机器学习推理卸载任务中延迟降低97.8%,资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情
AI中文摘要

网络规划优化是跨多个领域(包括交通系统、通信网络和电网)的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划(MIP)求解器、启发式算法和深度强化学习(DRL)模型来计算规划决策。然而,它们缺乏对多样化和动态用户意图的有效适应性,从而导致执行时间与最优性之间的权衡。在本文中,我们提出OmniPlan,一种自适应框架,在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性,OmniPlan采用基于大语言模型(LLM)的解释器,将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后,它采用混合专家架构,集成MIP求解器、启发式算法和DRL模型作为专门专家,OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后,它包含一个基于DRL的专家配置模块,该模块微调优化目标权重,使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载(即分布式机器学习(ML))评估OmniPlan,其中我们利用OmniPlan将广泛的ML推理任务(例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林)卸载到硬件设备网络。我们在真实测试平台上的实验表明,OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载,延迟降低高达97.8%,网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

2606.17453 2026-06-18 cs.AI 新提交 80%

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

专题命中 规划决策 :评估地图智能体的隐含需求满足能力

AI总结 提出MapSatisfyBench基准,通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力,实验表明现有智能体在显式任务完成上表现良好,但在满足隐含需求方面仍有局限。

详情
AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中,用户通常非正式地表达需求,导致查询不明确,包含许多未言明的需求,即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法,但它增加了日常交互中的用户负担,而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而,评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次,用户满意度不能可靠地由单个参考答案表示,需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战,我们提出一个恢复-识别-过滤框架,从行为链证据中重建完整的用户需求,识别隐含决策因素,并仅保留那些有查询前证据支持的因素。基于此方法,我们从大规模真实世界匿名用户数据构建MapSatisfyBench,并从五个维度标注真实值,实现对满意度感知地图智能体的全链条评估。实验表明,当前智能体在显式任务完成上普遍表现良好,但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

2606.14202 2026-06-18 cs.NE cs.AI 新提交 80%

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China(诺丁汉大学宁波分校计算机科学学院) School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院)

专题命中 规划决策 :自动启发式设计框架,结合进化与元认知

AI总结 提出MeEvo框架,通过循环耦合自然进化(探索启发式代码)和元认知进化(反思历史生成改进启发式),解决现有方法知识继承弱、探索不足的问题,在五个优化问题上表现更优。

详情
AI中文摘要

大型语言模型(LLMs)通过推理和代码合成实现启发式生成,推动了自动启发式设计(AHD)的发展。现有的基于LLM的AHD架构主要遵循两种范式:自然进化,它使用交叉和变异来探索启发式程序;以及元认知进化,它通过反思来改进推理。然而,自然进化丢弃了推理轨迹,削弱了知识继承和利用,而元认知进化缺乏种群级别的重组,限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距,我们提出了MeEvo,一种双层AHD框架,它循环耦合自然进化和元认知进化。自然进化探索启发式代码,同时将推理轨迹、适应度值和错误记录到共享历史中;然后元认知进化反思该历史以生成改进的启发式,这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验(使用两个LLM骨干)表明,MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能,尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

2605.22142 2026-06-18 cs.LG cs.AI 版本更新 80%

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

知识图谱下的短期到长期记忆转移:在部分可观测性下的短期到长期记忆转移

Taewoon Kim, Vincent François-Lavet, Michael Cochez

专题命中 规划决策 :强化学习中记忆转移,属于智能体决策。

AI总结 本文研究了在部分可观测性下知识图谱中的短期到长期记忆转移问题,提出了一种基于神经符号价值决策的方法,通过在长期插入前决定保留或丢弃观察到的三元组,从而提升记忆效率,并在RoomKG基准测试中优于符号和神经基线方法。

详情
AI中文摘要

在部分可观测性下的强化学习需要决定保留哪些信息,但大多数基于记忆的方法并未显式建模符号观察的短期到长期转移。我们研究了这一转移过程,将其建模为一个神经符号价值决策问题:对于每个观察到的三元组,智能体需决定在长期插入前是否保留或丢弃。为处理可变大小的短期缓冲区,我们采用了一种每项Q学习设计,使用共享参数和实际的时间差分更新,跨连续步骤匹配项目。在长期记忆容量为128的RoomKG基准测试中,学习到的转移决策优于符号和神经基线,包括带有时间注释的符号基线和基于历史的LSTM/Transformer基线。在转移策略消融分析中,一个轻量级的本地短期-only变体表现最佳,且在步骤层面行为显示,策略保留导航和查询相关的事实,同时丢弃低价值的候选事实,支持在内存限制下显式且可解释的记忆决策。

英文摘要

Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

2604.03208 2026-06-18 cs.LG 版本更新 80%

Hierarchical Planning with Latent World Models

基于潜在世界模型的分层规划

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, Nicolas Ballas

发表机构 * FAIR at Meta(Meta旗下的FAIR) New York University(纽约大学) Mila - Québec AI Institute(魁北克AI研究院) Brown University(布朗大学)

专题命中 规划决策 :分层世界模型用于长时域规划,属智能体规划

AI总结 提出HWM架构,通过多时间尺度潜在世界模型和潜在匹配实现分层模型预测控制,解决长时域任务中单层规划失败和计算爆炸问题。

详情
AI中文摘要

世界模型是通过规划实现零样本具身控制的一条有前景的路径。然而,现有的世界模型规划器在长时域、多阶段任务中面临困难:预测误差累积,且朴素搜索的复杂度随规划时域呈指数增长。分层方法通过将任务分解为更短、可处理的子问题来缓解这两个问题;然而,先前的分层方法要么将控制摊销为任务特定的策略(分层强化学习),要么假设低维状态和已知动力学(经典分层MPC)。我们提出了基于潜在世界模型的分层规划(HWM),这是一种直接在仅通过下一潜在预测训练的视觉世界模型上进行分层模型预测控制(MPC)的架构和规划范式。HWM在共享潜在空间内学习多个时间尺度的世界模型,因此长时域模型的预测通过潜在匹配作为短时域模型的子目标,无需任务特定的奖励、技能学习或分层策略。为了保持长时域搜索的可处理性,HWM学习了一个动作编码器,将原始动作块压缩为潜在宏动作。在真实世界的Franka操作中,HWM从单个目标图像中完成拾取和放置的成功率为70%,而单层规划的成功率为0%。在模拟的推操作和迷宫导航任务中,HWM在长时域任务上持续提升性能,同时所需规划计算量最多减少3倍。

英文摘要

World models are a promising path to zero-shot embodied control through planning. However, existing world model planners struggle on long-horizon, multi-stage tasks: prediction errors compound and naive search is exponential in the planning horizon. Hierarchy mitigates both by decomposing tasks into shorter, tractable subproblems; yet prior hierarchical approaches either amortize control into task-specific policies (hierarchical RL) or assume low-dimensional states and known dynamics (classical hierarchical MPC). We present Hierarchical Planning with Latent World Models (HWM), an architecture and planning paradigm for hierarchical model predictive control (MPC) directly on visual world models trained solely via next-latent prediction. HWM learns world models at multiple temporal scales within a shared latent space, so predictions from the long-horizon model serve as subgoals for the short-horizon model via latent matching, without task-specific rewards, skill learning, or hierarchical policies. To keep long-horizon search tractable, HWM learns an action encoder that compresses primitive action chunks into latent macro-actions. On real-world Franka manipulation, HWM solves pick-and-place from a single goal image at 70% success vs. 0% for single-level planning. Across simulated push manipulation and maze navigation, HWM consistently improves performance on long-horizon tasks while requiring up to 3x less planning compute.

2411.10399 2026-06-18 cs.GT cs.CR cs.DC 版本更新 80%

Game Theoretic Liquidity Provisioning in Concentrated Liquidity Market Makers

集中流动性做市商中的博弈论流动性提供

Weizhao Tang, Rachid El-Azouzi, Cheng Han Lee, Ethan Chan, Giulia Fanti

专题命中 规划决策 :博弈论模型分析流动性提供策略

AI总结 针对集中流动性做市商中流动性提供者的策略互动,建立博弈论模型,证明其可简化为具有唯一纳什均衡的线性复杂度博弈,均衡遵循水填充策略,并基于真实数据发现LP策略偏离均衡,调整后可提升日收益率。

详情
AI中文摘要

自动做市商(AMM)是一类去中心化交易所,能够实现数字资产的自动交易。它们接受流动性提供者(LP)存入的数字代币;交易者可以使用这些代币执行交易,从而为投资的LP产生费用。AMM的显著特征是交易价格由算法决定,这与传统的限价订单簿不同。集中流动性做市商(CLMM)是AMM的一个重要类别,它为流动性提供者提供了灵活性,不仅可以决定提供多少流动性,还可以决定在哪些价格范围内使用流动性。由于费用奖励在LP之间共享,这种灵活性可能使战略规划复杂化。我们建立并分析了一个博弈论模型来研究CLMM中LP的激励。我们的主要结果表明,虽然原始公式存在多个纳什均衡且复杂度与合约中价格点数量的二次方成正比,但它可以简化为一个具有唯一纳什均衡的博弈,其复杂度仅为线性。我们进一步证明,这个简化博弈的纳什均衡遵循一种水填充策略,其中低预算LP用尽其全部预算,而富裕LP则不会。最后,通过将我们的博弈模型拟合到真实的CLMM,我们观察到在具有风险资产的流动性池中,LP采用的投资策略远非纳什均衡。在价格不确定性下,他们通常投资于比我们分析建议的更少且更宽的价格范围,并且流动性更新频率较低。我们表明,在多个池中,通过将策略更新为更接近我们博弈的纳什均衡,LP可以将其每日回报中位数提高116美元,这相当于每日投资回报中位数增加0.009%。

英文摘要

Automated marker makers (AMMs) are a class of decentralized exchanges that enable the automated trading of digital assets. They accept deposits of digital tokens from liquidity providers (LPs); tokens can be used by traders to execute trades, which generate fees for the investing LPs. The distinguishing feature of AMMs is that trade prices are determined algorithmically, unlike classical limit order books. Concentrated liquidity market makers (CLMMs) are a major class of AMMs that offer liquidity providers flexibility to decide not only \emph{how much} liquidity to provide, but \emph{in what ranges of prices} they want the liquidity to be used. This flexibility can complicate strategic planning, since fee rewards are shared among LPs. We formulate and analyze a game theoretic model to study the incentives of LPs in CLMMs. Our main results show that while our original formulation admits multiple Nash equilibria and has complexity quadratic in the number of price ticks in the contract, it can be reduced to a game with a unique Nash equilibrium whose complexity is only linear. We further show that the Nash equilibrium of this simplified game follows a waterfilling strategy, in which low-budget LPs use up their full budget, but rich LPs do not. Finally, by fitting our game model to real-world CLMMs, we observe that in liquidity pools with risky assets, LPs adopt investment strategies far from the Nash equilibrium. Under price uncertainty, they generally invest in fewer and wider price ranges than our analysis suggests, with lower-frequency liquidity updates. We show that across several pools, by updating their strategy to more closely match the Nash equilibrium of our game, LPs can improve their median daily returns by \$116, which corresponds to an increase of 0.009\% in median daily return on investment.

2606.18888 2026-06-18 cs.AI 新提交 75%

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester(曼彻斯特大学) Aalto University(阿尔托大学)

专题命中 规划决策 :生成模型预测规划用于导航

AI总结 提出BeliefDiffusion框架,结合扩散模型和模型预测控制,显式建模多模态信念分布并进行前瞻规划,在合成地图环境中显著优于无模型强化学习和生成方法。

详情
AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战,需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法,特别是那些使用神经网络近似信念空间的方法,往往无法捕捉信念空间固有的多模态性,尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案,但它们通常需要大量数据或专家演示,并且缺乏长期规划的显式机制。在本文中,我们介绍了BeliefDiffusion,一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布,并利用模型预测控制(MPC)同时进行前瞻规划。它包含两个步骤:(1)基于观测历史想象合理的环境配置;(2)在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验,我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

2606.19214 2026-06-18 econ.GN q-fin.EC 新提交 70%

Testing Centralized and Polycentric Computational Planning

测试集中式和多中心计算规划

Ricardo Alonzo Fernández Salguero

专题命中 规划决策 :比较计算规划者与基于代理的市场,涉及规划决策

AI总结 本文提出一个可复现的合成基准,在模拟经济中比较计算规划者、基于代理的市场和混合元市场,发现规划者福利损失更低,但结果受设计选择影响,主要贡献是方法论而非意识形态。

详情
AI中文摘要

本文提出了一个可复现的合成基准,在共同的模拟经济中比较计算规划者、基于代理的市场和混合元市场。该基准包含投入产出生产网络、异质企业、产能约束、内生价格、福利指标、结构性冲击、对抗性压力测试和信息报告实验。在训练、保留和对抗性场景中,规划者始终比分散化替代方案实现更低的福利损失。主要贡献是方法论而非意识形态的。虽然该基准展示了一个可证伪的框架用于比较经济协调机制,但它并未确立规划的实证优越性。若干设计选择机械地偏向规划者,包括信息不对称、不完整的市场表示和简化的制度假设。因此,结果应被解释为对合成实验架构的验证,以及作为未来研究的原型。本文最后概述了一个基于实证校准、结构性保留、敏感性分析、不确定性量化、机制设计测试和独立复制的验证议程。

英文摘要

This paper presents a reproducible synthetic benchmark comparing a computational planner, an agent-based market, and a hybrid meta-market within a common simulated economy. The benchmark incorporates input-output production networks, heterogeneous firms, capacity constraints, endogenous prices, welfare metrics, structural shocks, adversarial stress testing, and information-reporting experiments. Across training, holdout, and adversarial scenarios, the planner consistently achieves lower welfare losses than the decentralized alternatives. The main contribution is methodological rather than ideological. While the benchmark demonstrates a falsifiable framework for comparing economic coordination mechanisms, it does not establish the empirical superiority of planning. Several design choices mechanically favor the planner, including informational asymmetries, incomplete market representation, and simplified institutional assumptions. The results should therefore be interpreted as validation of a synthetic experimental architecture and as a prototype for future research. The paper concludes by outlining a validation agenda based on empirical calibration, structural holdouts, sensitivity analysis, uncertainty quantification, mechanism-design tests, and independent replication.

2606.18963 2026-06-18 cs.LG 新提交 70%

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

无环境奖励的固定通道感知事件流在线奖惩学习

Zirong Li

发表机构 * Zirong Li(李 Cirong)

专题命中 规划决策 :提出无环境奖励的在线奖惩学习框架。

AI总结 提出OHIRL框架,在无标量奖励下通过固定通道感知流进行在线奖惩学习,利用内部轨迹评估器推断感知维度的效价,在XOR任务和CartPole等控制任务中达到高准确率。

Comments 9 pages, 5 figures, 6 tables; 13-page technical supplement

详情
AI中文摘要

我们研究当环境不提供标量奖励或评估标签时的在线奖惩学习。在每一步,智能体仅接收一个固定通道的感知数据包,诸如疼痛、能量、接触、损伤或认知错误等量被视为感知维度,其效价必须从转移后果中推断。OHIRL分离了四个角色:M_psi学习下一数据包预测,D_omega建模残差动力学,C_eta是一个固定的内部转移后轨迹评估器,B_xi学习使用由此产生的价值证据进行后续策略更新和动作评分。C_eta采用恢复正性、持久/增长负性的残差调节取向;系数来源审计显示,等单元、原始等值和随机单调变体保留了超过92%的已发布顶级动作排名,而符号反转保留了0%。无奖励协议暴露观察转移,同时隐藏环境奖励、延迟外部评估器、成功标签和动作好坏标签。条件误差分解将B_xi的证据估计误差与残差策略优化误差分离。在2x2-XOR数据包任务中,药物和辣椒在视觉XOR上下文中获得相反的价值,并且相同的疼痛或辣度增加可能根据后果结构为正或负;B_xi达到0.952的平衡奖励符号准确率。在完整的在线交错审计中,M_psi达到留出R2=0.907,B_xi达到0.940的符号准确率,策略达到0.979的最优动作准确率,而即时数据包分数、预测误差奖励、打乱目标、零奖励和误差减少控制均崩溃。隐藏奖励的CartPole和Taxi控制、公共上下文无泄漏审计以及模块角色消融进一步测试了信息边界和组件必要性。

英文摘要

We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact, damage, or cognitive error are treated as perceptual dimensions whose valence must be inferred from transition consequences. OHIRL separates four roles: M_psi learns next-packet prediction, D_omega models residual dynamics, C_eta is a fixed internal post-transition trajectory evaluator, and B_xi learns to use the resulting value evidence for later policy updates and action scoring. C_eta uses a recovery-positive and persistence/growth-negative residual-regulation orientation; a coefficient-origin audit shows that equal-unit, raw-equal, and random monotone variants preserve more than 92% of the released top-action rankings, while sign inversion preserves 0%. The reward-free protocol exposes observation transitions while withholding environment rewards, delayed external evaluators, success labels, and action-goodness labels. A conditional error decomposition separates B_xi evidence-estimation error from residual policy-optimization error. In a 2x2-XOR packet task, medicine and chili acquire opposite value under visual XOR contexts, and the same pain or spice increase can be positive or negative depending on consequence structure; B_xi reaches 0.952 balanced reward-sign accuracy. In a full online-interleaved audit, M_psi reaches holdout R2=0.907, B_xi reaches 0.940 sign accuracy, and the policy reaches 0.979 optimal-action accuracy, while immediate packet scores, prediction-error rewards, shuffled targets, zero reward, and error-reduction controls collapse. Hidden-reward CartPole and Taxi controls, public-context no-leakage audits, and module-role ablations further test information boundaries and component necessity.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交 70%

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

专题命中 规划决策 :利用LLM智能体进行树搜索发现训练策略

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2510.03635 2026-06-18 eess.SY cs.SY 版本更新 70%

Cyber Resilience of Three-phase Unbalanced Distribution System Restoration under Sparse Adversarial Attack on Load Forecasting

三相不平衡配电系统恢复在负荷预测稀疏对抗攻击下的网络弹性

Chen Chao, Zixiao Ma, Ziang Zhang

专题命中 规划决策 :攻击下的恢复规划,涉及决策

AI总结 本文量化对抗性攻击对负荷预测的影响,提出梯度稀疏攻击方法,并建立恢复感知验证框架,揭示系统级故障,为设计网络安全感知的恢复规划提供见解。

Comments 10 pages, 7 figures

详情
AI中文摘要

系统恢复对于电力系统弹性至关重要,然而,其对基于人工智能的负荷预测的日益依赖引入了显著的网络安全风险。不准确的预测可能导致不可行的规划、电压和频率违规以及断电段落的恢复失败,但恢复过程对此类攻击的弹性在很大程度上仍未探索。本文通过量化对抗性操纵的预测如何影响恢复可行性和电网安全性来填补这一空白。我们开发了一种基于梯度的稀疏对抗攻击,该攻击策略性地扰动最具影响力的时空输入,在保持隐蔽性的同时暴露预测模型的脆弱性。我们进一步创建了一个恢复感知验证框架,将这些受损的预测嵌入到顺序恢复模型中,并使用不平衡三相最优潮流公式评估操作可行性。仿真结果表明,所提出的方法比基线攻击更高效、更隐蔽。它揭示了系统级故障,例如电压和功率爬坡违规,这些故障阻止了关键负荷的恢复。这些发现为设计网络安全感知的恢复规划框架提供了可行的见解。

英文摘要

System restoration is critical for power system resilience, nonetheless, its growing reliance on artificial intelligence (AI)-based load forecasting introduces significant cybersecurity risks. Inaccurate forecasts can lead to infeasible planning, voltage and frequency violations, and unsuccessful recovery of de-energized segments, yet the resilience of restoration processes to such attacks remains largely unexplored. This paper addresses this gap by quantifying how adversarially manipulated forecasts impact restoration feasibility and grid security. We develop a gradient-based sparse adversarial attack that strategically perturbs the most influential spatiotemporal inputs, exposing vulnerabilities in forecasting models while maintaining stealth. We further create a restoration-aware validation framework that embeds these compromised forecasts into a sequential restoration model and evaluates operational feasibility using an unbalanced three-phase optimal power flow formulation. Simulation results show that the proposed approach is more efficient and stealthier than baseline attacks. It reveals system-level failures, such as voltage and power ramping violations that prevent the restoration of critical loads. These findings provide actionable insights for designing cybersecurity-aware restoration planning frameworks.

2402.08128 2026-06-18 cs.AI cs.GT 版本更新 70%

Recursive Joint Simulation in Games

博弈中的递归联合模拟

Vojtech Kovarik, Caspar Oesterheld, Vincent Conitzer

发表机构 * Foundations of Cooperative AI Lab (FOCAL), Computer Science Department(合作人工智能基础实验室(FOCAL),计算机科学系) Carnegie Mellon University(卡内基梅隆大学) AI Center(人工智能中心) Czech Technical University(捷克技术大学) Center for Theoretical Study(理论研究中心) Charles University(查理大学)

专题命中 规划决策 :研究AI智能体递归联合模拟实现合作

AI总结 研究AI智能体通过递归联合模拟实现合作,证明该过程等价于原博弈的无限重复版本,从而可直接应用民间定理等现有结论。

详情
AI中文摘要

AI智能体之间的博弈动力学可能以多种方式不同于传统的人类-人类互动。其中一个差异是,可能能够精确模拟一个AI智能体,例如因为其源代码已知。这样的智能体将从根本上不确定自己是在现实世界还是在模拟中。我们的目标是探索利用这种可能性在战略环境中实现更合作的结果。在本文中,我们研究了AI智能体之间的交互,其中智能体运行递归联合模拟。也就是说,智能体首先共同观察它们所面临情境的模拟。这个模拟递归地包含额外的模拟(带有小的失败概率以避免无限递归),并且在选择行动之前观察所有这些嵌套模拟的结果。我们表明,由此产生的交互在策略上等价于原始博弈的无限重复版本,允许直接转移现有结果,如各种民间定理。作为该等价性稳健性的证据,我们表明即使放宽一些假设,它仍然成立,并且“从内部”也成立——即对于发现自己处于博弈中并具有自定位不确定性的智能体而言。

英文摘要

Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' -- meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

2606.19134 2026-06-18 cs.LG cs.AI 新提交 65%

Pareto Q-Learning with Reward Machines

带奖励机的帕累托Q学习

Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières

发表机构 * Linköping University, Sweden(瑞典_linköping大学) Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France(法国里尔大学、CNRS、中央里尔学院、UMR 9189 CRIStAL、法国里尔) Univ. Toulouse, INRAE-MIAT, Toulouse, France(法国图卢兹大学、INRAE-MIAT、图卢兹)

专题命中 规划决策 :多目标强化学习算法,用于智能体决策

AI总结 提出PQLRM算法,结合帕累托Q学习和奖励机,在多目标强化学习中高效逼近帕累托前沿,并处理非马尔可夫奖励。

Comments Accepted at the ICAPS 2026 Workshop on Bridging the Gap Between AI Planning and (Reinforcement) Learning (PRL)

详情
AI中文摘要

我们提出了带奖励机的帕累托Q学习(PQLRM),这是一种用于任务的多目标强化学习算法,其奖励结构由一组奖励机(RMs)指定。PQLRM结合了帕累托Q学习(PQL)(该方法维护向量值Q估计的集合以逼近帕累托前沿)和带奖励机的Q学习(QRM)的增强(该方法利用奖励信号的因子化自动机结构)。这产生了一种多策略算法,在非马尔可夫、RM编码的奖励下保持样本效率。实验表明,PQLRM比应用于叉积MDP的朴素PQL基线收敛更快,并且可以合成QRM无法获得的帕累托最优策略。

英文摘要

We present Pareto Q-Learning with Reward Machines (PQLRM), a multi-objective reinforcement learning algorithm for tasks whose reward structure is specified by a set of reward machines (RMs). PQLRM combines Pareto Q-Learning (PQL), which maintains sets of vector-valued Q-estimates to approximate the Pareto front, with enhancements from Q-Learning with Reward Machines (QRM), which exploits the factored automaton structure of the reward signal. This yields a multi-policy algorithm that remains sample-efficient under non-Markovian, RM-encoded rewards. Experimental trials show that PQLRM converges faster than a naive PQL baseline applied to the cross-product MDP and can synthesize Pareto-optimal policies that QRM cannot.

2606.18537 2026-06-18 cs.LG 新提交 65%

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

入乡随俗:从异构智能体学习通用行为

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

发表机构 * University of Washington(华盛顿大学) NVIDIA(英伟达)

专题命中 规划决策 :提取通用奖励训练通用智能体

AI总结 提出GRID方法,从追求不同目标的异构示范者中提取通用奖励,训练通用智能体以学习环境通用能力,避免模式平均偏差,提升下游任务微调效率。

详情
AI中文摘要

人类通常通过观察他人来获取新技能,因为观察到的行为隐含地揭示了如何在环境中行动。然而,从异构群体中获得的观察会引入冲突的行为信号,使得难以确定哪些行为值得模仿。我们通过通用奖励推断与解耦(GRID)来解决这一挑战,这是一种从追求不同目标的异构示范者群体中提取普遍有用行为的社会学习方法。GRID将每个智能体的奖励函数分解为通用奖励(捕捉所有智能体共享的行为)和特定奖励(捕捉个体偏好和目标)。仅基于通用奖励进行训练提供了一种通用预训练的新范式。它产生了一个通用智能体,该智能体内化了通用的环境能力,如安全性和基本任务熟练度,而不会出现困扰标准从示范学习技术的模式平均偏差。这个通用智能体作为微调到下游任务(包括训练中未见过的偏好)的优越先验。在合成基函数分解、多智能体Craftax和连续自动驾驶模拟器(Highway-Env)上的实验证实,GRID以语义上有意义的方式成功解耦了奖励结构,优于标准的从示范学习基线,并实现了更高效和稳定的特化。

英文摘要

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

2603.09344 2026-06-18 cs.AI stat.ML 版本更新 65%

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China(浙江大学计算机科学与技术学院) School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China(西北工业大学人工智能、光学与电子学院(iOPEN)) School of Software Technology, Zhejiang University, Hangzhou, China(浙江大学软件技术学院) School of Software Engineering, Xi'an Jiaotong University, Xi'an, China(西安交通大学软件工程学院) School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China(中山大学系统科学与工程学院)

专题命中 规划决策 :离线强化学习用于智能体决策

AI总结 提出鲁棒正则化策略迭代(RRPI),通过将离线强化学习建模为鲁棒策略优化,使用KL正则化替代难解的双层目标,并基于鲁棒正则化贝尔曼算子实现高效策略迭代,理论保证收敛性,实验在D4RL基准上表现优异。

详情
AI中文摘要

离线强化学习(RL)无需在线探索即可实现数据高效且安全的策略学习,但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对,其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性,我们将离线RL建模为鲁棒策略优化,将转移核视为不确定性集内的决策变量,并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代(RRPI),用可处理的KL正则化替代难解的最大-最小双层目标,并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证,证明所提出的算子是$\gamma$-压缩算子,且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明,RRPI实现了强大的平均性能,在大多数环境中优于包括基于百分位数方法在内的最新基线,并在其余环境中保持竞争力。此外,RRPI通过将较低的$Q$值与高认知不确定性对齐,展现出鲁棒性能,从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 新提交 60%

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

带移动障碍物的移动目标旅行商问题的两阶段双层搜索

Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset

发表机构 * Texas A&M University(德克萨斯A&M大学) Carnegie Mellon University(卡内基梅隆大学)

专题命中 规划决策 :移动目标TSP的两阶段双层搜索算法

AI总结 针对带移动障碍物的移动目标旅行商问题,提出混合整数锥规划公式和两阶段双层搜索算法,显著优于基线方法。

详情
AI中文摘要

移动目标旅行商问题(MT-TSP)寻求从静态仓库出发、访问一组移动目标(每个目标在其分配的时间窗口内)并返回仓库的代理的最小成本轨迹。在本文中,我们研究了带移动障碍物的移动目标旅行商问题(MT-TSP-MO),这是MT-TSP的推广,其中代理轨迹必须避开移动障碍物。我们提出了一个混合整数锥规划(MICP)公式,可以使用现成的求解器求解,以及一个快速且可扩展的两阶段双层搜索(TPBS)算法,该算法为问题计算高质量可行解。我们在多达40个目标和40个障碍物的广泛问题实例上评估了我们的方法,与现有基线算法相比。结果表明,所提出的两种方法在成功率、解决方案成本和计算时间方面均显著优于基线。

英文摘要

The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

2412.15472 2026-06-18 cs.GT econ.TH 60%

On the Fairness of Additive Welfarist Rules

关于加法福利主义规则的公平性

Karen Frilya Celine, Warut Suksompong, Sheung Man Yuen

专题命中 规划决策 :公平分配规则研究,与多智能体系统相关

AI总结 本文研究了加法福利主义规则在公平分配中的公平性,证明了MNW规则是唯一能保证EF1的规则,同时探讨了不同实例类型下的规则特性。

Comments Appears in the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2025

Journal ref ACM Transactions on Economics and Computation, 14(2):5 (2026)

详情
AI中文摘要

分配不可分割的商品是公平分割中的常见任务。我们研究了加法福利主义规则,这类规则选择使某些效用函数总和最大的分配。先前研究显示,最大纳什福利(MNW)规则是唯一能保证 envy-freeness up to one good(EF1)的加法福利主义规则。我们加强这一结论,证明MNW规则在相同商品实例、二值实例以及三个或更多代理人归一化实例中仍唯一保证EF1。另一方面,如果代理人的效用是整数,我们证明其他规则也能提供EF1保证,并为各种实例类型提供了这些规则的特征化。

英文摘要

Allocating indivisible goods is a ubiquitous task in fair division. We study additive welfarist rules, an important class of rules which choose an allocation that maximizes the sum of some function of the agents' utilities. Prior work has shown that the maximum Nash welfare (MNW) rule is the unique additive welfarist rule that guarantees envy-freeness up to one good (EF1). We strengthen this result by showing that MNW remains the only additive welfarist rule that ensures EF1 for identical-good instances, two-value instances, as well as normalized instances with three or more agents. On the other hand, if the agents' utilities are integers, we demonstrate that several other rules offer the EF1 guarantee, and provide characterizations of these rules for various classes of instances.

2606.19175 2026-06-18 econ.TH 新提交 55%

To Gamble, Perchance to Grow

赌博,或许为了增长

Mark Whitmeyer

专题命中 规划决策 :研究增长最优投资组合问题,涉及决策优化

AI总结 研究增长最优(凯利)投资组合问题中的收益变换,刻画了产生更保守投资组合的变换条件,并推导了理性疏忽代理人的风险厌恶比较。

详情
AI中文摘要

我研究了增长最优(凯利)投资组合问题中的收益变换。在一安全一风险资产问题中,收益变换 f 普遍产生更保守的投资组合当且仅当 f 是凹且严格递增的,并且 r/f 是凸的。作为推论,我刻画了理性疏忽代理人的比较风险厌恶:一个更风险厌恶的代理人是在 Pratt (1964) 意义上足够更风险厌恶的代理人。

英文摘要

I study transformations of returns in the growth-optimal (Kelly) portfolio problem. In the one-safe-one-risky-asset problem, a return transform f universally produces a more conservative portfolio if and only if f is concave and strictly increasing and r/f is convex. As a corollary, I characterize comparative risk aversion for a rationally-inattentive agent: a more risk-averse agent is one who is sufficiently more risk averse in the Pratt (1964) sense.