AI Agent - arXivDaily 专题

2606.18543 2026-06-18 cs.AI cs.CL cs.SE 新提交 90%

CEO-Bench: Can Agents Play the Long Game?

CEO-Bench：智能体能否玩转长期博弈？

Haozhe Chen, Karthik Narasimhan, Zhuang Liu

发表机构 * Princeton University（普林斯顿大学）

专题命中规划决策：模拟500天运营初创公司任务

AI总结提出CEO-Bench，通过模拟500天运营初创公司的任务，评估语言模型智能体在长期、不确定、动态环境下的综合决策能力。

详情

AI中文摘要

语言模型智能体在软件工程、客户服务等孤立、短期的任务上正变得熟练。然而，现实世界的挑战需要结合多种复杂技能，这些技能在很大程度上尚未在智能体中得到测试：（1）在不确定性中导航长期视野；（2）在嘈杂环境中获取信息；（3）适应不断变化的世界；（4）协调多个移动部分以实现连贯目标。我们引入CEO-Bench，通过模拟一个代表性的现实世界任务——运营一家初创公司500天——来共同评估这些能力。智能体通过可编程的Python接口管理一家虚构公司的定价、营销、预算等众多方面，在相同的环境中运行，并面临与人类CEO相同的挑战。成功需要分析嘈杂、相互关联的业务数据库，将信号转化为合理的策略，并通过编程协调许多决策。最强的智能体编写复杂的代码，模拟客户群体以预测未来现金流，并挖掘谈判历史以揭示隐藏的客户偏好。即便如此，大多数最先进的模型在此环境中挣扎。只有Claude Opus 4.8和GPT-5.5的最终余额超过100万美元的起始资金，且两者均未能持续盈利。CEO-Bench迈出了衡量驱动持续、自适应进步所需智能的第一步。

英文摘要

Language model agents are becoming proficient executors at isolated, short-horizon tasks such as software engineering and customer service. Yet real-world challenges require a combination of sophisticated skills that remain largely untested in agents: (1) navigating long horizons amid uncertainty; (2) acquiring information in noisy environments; (3) adapting to a changing world; (4) orchestrating multiple moving parts toward a coherent goal. We introduce CEO-Bench, which evaluates these capabilities together by simulating a representative real-world task: operating a startup for 500 days. An agent manages pricing, marketing, budgeting, and many other aspects of a fictional company through a programmable Python interface, operating in the same environment and facing the same challenges as a human CEO. Success demands analyzing noisy, interconnected business databases, translating signals into sound strategy, and coordinating many decisions with programming. The strongest agents write sophisticated code that simulates customer cohorts to forecast future cash and mines negotiation history to uncover hidden customer preferences. Even so, most state-of-the-art models struggle in this environment. Only Claude Opus 4.8 and GPT-5.5 finish above the $1M starting balance, and neither consistently turns a profit. CEO-Bench takes a first step toward measuring the intelligence required to drive sustained, adaptive progress over time.

URL PDF HTML ☆

赞 0 踩 0

2606.18633 2026-06-18 cs.MA 新提交 85%

PersonalPlan: Planning Multi-Agent Systems for Personalized Programming Learning

PersonalPlan: 面向个性化编程学习的多智能体系统规划

Zhiyuan Wen, Jiannong Cao, Peng Gao, Haochen Shi, Wengpan Kuan, Bo Yuan, Xiuxiu Qi

专题命中规划决策：多智能体规划器用于个性化编程学习

AI总结提出PersonalPlan，一种两阶段多智能体规划器，通过分层SFT和奖励自适应GRPO生成可执行、个性化且具有教学支架的计划，在MAP-PPL数据集上优于现有方法。

详情

AI中文摘要

有效的编程教育需要针对不同学习者背景进行个性化教学。然而，虽然基于LLM的多智能体系统（MAS）擅长复杂规划，但现有规划器通常缺乏轮廓基础（profile-grounding）和教学支架（pedagogical scaffolding），从而削弱了个性化编程学习。为填补这一空白，我们首先引入\textbf{MAP-PPL}（\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning），这是一个基于轮廓的多智能体规划数据集，包含来自1,730个Stack Overflow问题组和2,738个学习者轮廓的3,043个查询-轮廓-计划实例。每个计划指定了智能体、子任务、可执行步骤和先决依赖关系。然后，我们提出\textbf{PersonalPlan}，一个两阶段MAS规划器，首先使用独立的LoRA适配器进行分层SFT，用于轮廓感知的任务分解和步骤依赖规划，然后应用奖励自适应GRPO，鼓励模型生成可执行、个性化且具有教学支架的计划。在MAP-PPL上进行的广泛实验，将PersonalPlan与前沿LLM、通用MAS框架和智能体规划器进行比较，证明了其优越性。仅使用8B和32B变体，PersonalPlan在计划可执行性、个性化和教学质量方面达到了最先进水平，有效协调了MAS进行智能体-学生交互。

英文摘要

Effective programming education requires personalized instruction adapted to diverse learner backgrounds. However, while LLM-based multi-agent systems (MAS) excel at complex planning, existing planners often lack profile-grounding and pedagogical scaffolding, thereby undermining personalized programming learning. To fill in the gap, we first introduce \textbf{MAP-PPL} (\textbf{M}ulti-\textbf{A}gent \textbf{P}lans for \textbf{P}ersonalized \textbf{P}rogramming \textbf{L}earning), a profile-conditioned multi-agent planning dataset with 3{,}043 query--profile--plan instances from 1{,}730 Stack Overflow question groups and 2{,}738 learner profiles. Each plan specifies agents, subtasks, executable steps, and prerequisite dependencies. Then, we propose \textbf{PersonalPlan}, a two-stage MAS planner that first performs hierarchical SFT with separate LoRA adapters for profile-aware task decomposition and step dependency planning, then applies a Reward-Adaptive GRPO to encourage the model to generate executable, personalized, and pedagogically scaffolded plans. Extensive experiments on MAP-PPL comparing PersonalPlan against frontier LLMs, generic MAS frameworks, and agentic planners demonstrate its superiority. With only 8B and 32B variants, PersonalPlan achieves state-of-the-art plan executability, personalization, and pedagogical quality, effectively orchestrating MAS for agent-student interactions.

URL PDF HTML ☆

赞 0 踩 0

2606.18847 2026-06-18 cs.AI 新提交 80%

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines: 对长时域有状态具身智能体进行基准测试与建模

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

发表机构 * HKUST(GZ)（香港科技大学（广州））； HKUST（香港科技大学）； Knowin

专题命中规划决策：具身智能体长时记忆与任务规划。

AI总结提出WorldLines基准，通过构建带时间跨度的家庭轨迹（含对话、动作、状态变化等）评估具身智能体的长时记忆与任务规划能力，并设计ObsMem记忆框架提升状态感知决策。

Comments 27 pages, 18 figures

详情

AI中文摘要

为了在真实家庭环境中长时间协助人类，具身智能体必须记住用户习惯、世界状态和过去的交互。现有的长期记忆基准主要评估以语言为中心的检索和问答，而具身基准通常关注短时域任务执行，未测试在动态环境中长期记忆的使用。我们引入WorldLines，一个项目驱动的长时域具身家庭辅助基准。它构建了带时间跨度的家庭轨迹，包含对话、动作、执行反馈、物体和设备状态变化，并将其转换为带有证据链接的样本，用于记忆问答和具身任务规划。我们进一步提出ObsMem，一个观察者锚定的记忆框架，维护可见性感知的记忆和动作原生状态轨迹，以实现状态感知的决策。实验揭示了在部分可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划方面的持续挑战，而ObsMem为此场景提供了更强的参考架构。

英文摘要

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

URL PDF HTML ☆

赞 0 踩 0

2606.18746 2026-06-18 cs.AI 新提交 80%

What Must Generalist Agents Remember?

通用型智能体必须记住什么？

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）

专题命中规划决策：通用智能体记忆需求的形式化分析。

AI总结本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动，必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作，并证明记忆可用于重构局部转移动态。

2606.18105 2026-06-18 cs.NI cs.LG 新提交 80%

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan：一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University（浙江大学）； Fuzhou University（福州市大学）； Yangzhou University（扬州大学）； The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； College of Computer Science and Technology（计算机科学与技术学院）

专题命中规划决策：自适应框架动态选择求解器进行规划

AI总结提出OmniPlan自适应框架，利用大语言模型解析用户意图，通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型，实现网络规划优化的及时性与近乎最优性，在分布式机器学习推理卸载任务中延迟降低97.8%，资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情

AI中文摘要

网络规划优化是跨多个领域（包括交通系统、通信网络和电网）的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划（MIP）求解器、启发式算法和深度强化学习（DRL）模型来计算规划决策。然而，它们缺乏对多样化和动态用户意图的有效适应性，从而导致执行时间与最优性之间的权衡。在本文中，我们提出OmniPlan，一种自适应框架，在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性，OmniPlan采用基于大语言模型（LLM）的解释器，将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后，它采用混合专家架构，集成MIP求解器、启发式算法和DRL模型作为专门专家，OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后，它包含一个基于DRL的专家配置模块，该模块微调优化目标权重，使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载（即分布式机器学习（ML））评估OmniPlan，其中我们利用OmniPlan将广泛的ML推理任务（例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林）卸载到硬件设备网络。我们在真实测试平台上的实验表明，OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载，延迟降低高达97.8%，网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

URL PDF HTML ☆

赞 0 踩 0

2606.17453 2026-06-18 cs.AI 新提交 80%

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

专题命中规划决策：评估地图智能体的隐含需求满足能力

AI总结提出MapSatisfyBench基准，通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力，实验表明现有智能体在显式任务完成上表现良好，但在满足隐含需求方面仍有局限。

详情

AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中，用户通常非正式地表达需求，导致查询不明确，包含许多未言明的需求，即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法，但它增加了日常交互中的用户负担，而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而，评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次，用户满意度不能可靠地由单个参考答案表示，需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战，我们提出一个恢复-识别-过滤框架，从行为链证据中重建完整的用户需求，识别隐含决策因素，并仅保留那些有查询前证据支持的因素。基于此方法，我们从大规模真实世界匿名用户数据构建MapSatisfyBench，并从五个维度标注真实值，实现对满意度感知地图智能体的全链条评估。实验表明，当前智能体在显式任务完成上普遍表现良好，但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

URL PDF HTML ☆

赞 0 踩 0

2606.14202 2026-06-18 cs.NE cs.AI 新提交 80%

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China（诺丁汉大学宁波分校计算机科学学院）； School of Computer Science, University of Nottingham（诺丁汉大学计算机科学学院）

专题命中规划决策：自动启发式设计框架，结合进化与元认知

AI总结提出MeEvo框架，通过循环耦合自然进化（探索启发式代码）和元认知进化（反思历史生成改进启发式），解决现有方法知识继承弱、探索不足的问题，在五个优化问题上表现更优。

详情

AI中文摘要

大型语言模型（LLMs）通过推理和代码合成实现启发式生成，推动了自动启发式设计（AHD）的发展。现有的基于LLM的AHD架构主要遵循两种范式：自然进化，它使用交叉和变异来探索启发式程序；以及元认知进化，它通过反思来改进推理。然而，自然进化丢弃了推理轨迹，削弱了知识继承和利用，而元认知进化缺乏种群级别的重组，限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距，我们提出了MeEvo，一种双层AHD框架，它循环耦合自然进化和元认知进化。自然进化探索启发式代码，同时将推理轨迹、适应度值和错误记录到共享历史中；然后元认知进化反思该历史以生成改进的启发式，这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验（使用两个LLM骨干）表明，MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能，尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.18888 2026-06-18 cs.AI 新提交 75%

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester（曼彻斯特大学）； Aalto University（阿尔托大学）

专题命中规划决策：生成模型预测规划用于导航

AI总结提出BeliefDiffusion框架，结合扩散模型和模型预测控制，显式建模多模态信念分布并进行前瞻规划，在合成地图环境中显著优于无模型强化学习和生成方法。

详情

AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战，需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法，特别是那些使用神经网络近似信念空间的方法，往往无法捕捉信念空间固有的多模态性，尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案，但它们通常需要大量数据或专家演示，并且缺乏长期规划的显式机制。在本文中，我们介绍了BeliefDiffusion，一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布，并利用模型预测控制（MPC）同时进行前瞻规划。它包含两个步骤：（1）基于观测历史想象合理的环境配置；（2）在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验，我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

URL PDF HTML ☆

赞 0 踩 0

2606.19214 2026-06-18 econ.GN q-fin.EC 新提交 70%

Testing Centralized and Polycentric Computational Planning

测试集中式和多中心计算规划

Ricardo Alonzo Fernández Salguero

专题命中规划决策：比较计算规划者与基于代理的市场，涉及规划决策

AI总结本文提出一个可复现的合成基准，在模拟经济中比较计算规划者、基于代理的市场和混合元市场，发现规划者福利损失更低，但结果受设计选择影响，主要贡献是方法论而非意识形态。

详情

AI中文摘要

本文提出了一个可复现的合成基准，在共同的模拟经济中比较计算规划者、基于代理的市场和混合元市场。该基准包含投入产出生产网络、异质企业、产能约束、内生价格、福利指标、结构性冲击、对抗性压力测试和信息报告实验。在训练、保留和对抗性场景中，规划者始终比分散化替代方案实现更低的福利损失。主要贡献是方法论而非意识形态的。虽然该基准展示了一个可证伪的框架用于比较经济协调机制，但它并未确立规划的实证优越性。若干设计选择机械地偏向规划者，包括信息不对称、不完整的市场表示和简化的制度假设。因此，结果应被解释为对合成实验架构的验证，以及作为未来研究的原型。本文最后概述了一个基于实证校准、结构性保留、敏感性分析、不确定性量化、机制设计测试和独立复制的验证议程。

英文摘要

This paper presents a reproducible synthetic benchmark comparing a computational planner, an agent-based market, and a hybrid meta-market within a common simulated economy. The benchmark incorporates input-output production networks, heterogeneous firms, capacity constraints, endogenous prices, welfare metrics, structural shocks, adversarial stress testing, and information-reporting experiments. Across training, holdout, and adversarial scenarios, the planner consistently achieves lower welfare losses than the decentralized alternatives. The main contribution is methodological rather than ideological. While the benchmark demonstrates a falsifiable framework for comparing economic coordination mechanisms, it does not establish the empirical superiority of planning. Several design choices mechanically favor the planner, including informational asymmetries, incomplete market representation, and simplified institutional assumptions. The results should therefore be interpreted as validation of a synthetic experimental architecture and as a prototype for future research. The paper concludes by outlining a validation agenda based on empirical calibration, structural holdouts, sensitivity analysis, uncertainty quantification, mechanism-design tests, and independent replication.

URL PDF HTML ☆

赞 0 踩 0

2606.18963 2026-06-18 cs.LG 新提交 70%

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

无环境奖励的固定通道感知事件流在线奖惩学习

Zirong Li

发表机构 * Zirong Li（李 Cirong）

专题命中规划决策：提出无环境奖励的在线奖惩学习框架。

AI总结提出OHIRL框架，在无标量奖励下通过固定通道感知流进行在线奖惩学习，利用内部轨迹评估器推断感知维度的效价，在XOR任务和CartPole等控制任务中达到高准确率。

Comments 9 pages, 5 figures, 6 tables; 13-page technical supplement

详情

AI中文摘要

我们研究当环境不提供标量奖励或评估标签时的在线奖惩学习。在每一步，智能体仅接收一个固定通道的感知数据包，诸如疼痛、能量、接触、损伤或认知错误等量被视为感知维度，其效价必须从转移后果中推断。OHIRL分离了四个角色：M_psi学习下一数据包预测，D_omega建模残差动力学，C_eta是一个固定的内部转移后轨迹评估器，B_xi学习使用由此产生的价值证据进行后续策略更新和动作评分。C_eta采用恢复正性、持久/增长负性的残差调节取向；系数来源审计显示，等单元、原始等值和随机单调变体保留了超过92%的已发布顶级动作排名，而符号反转保留了0%。无奖励协议暴露观察转移，同时隐藏环境奖励、延迟外部评估器、成功标签和动作好坏标签。条件误差分解将B_xi的证据估计误差与残差策略优化误差分离。在2x2-XOR数据包任务中，药物和辣椒在视觉XOR上下文中获得相反的价值，并且相同的疼痛或辣度增加可能根据后果结构为正或负；B_xi达到0.952的平衡奖励符号准确率。在完整的在线交错审计中，M_psi达到留出R2=0.907，B_xi达到0.940的符号准确率，策略达到0.979的最优动作准确率，而即时数据包分数、预测误差奖励、打乱目标、零奖励和误差减少控制均崩溃。隐藏奖励的CartPole和Taxi控制、公共上下文无泄漏审计以及模块角色消融进一步测试了信息边界和组件必要性。

英文摘要

We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact, damage, or cognitive error are treated as perceptual dimensions whose valence must be inferred from transition consequences. OHIRL separates four roles: M_psi learns next-packet prediction, D_omega models residual dynamics, C_eta is a fixed internal post-transition trajectory evaluator, and B_xi learns to use the resulting value evidence for later policy updates and action scoring. C_eta uses a recovery-positive and persistence/growth-negative residual-regulation orientation; a coefficient-origin audit shows that equal-unit, raw-equal, and random monotone variants preserve more than 92% of the released top-action rankings, while sign inversion preserves 0%. The reward-free protocol exposes observation transitions while withholding environment rewards, delayed external evaluators, success labels, and action-goodness labels. A conditional error decomposition separates B_xi evidence-estimation error from residual policy-optimization error. In a 2x2-XOR packet task, medicine and chili acquire opposite value under visual XOR contexts, and the same pain or spice increase can be positive or negative depending on consequence structure; B_xi reaches 0.952 balanced reward-sign accuracy. In a full online-interleaved audit, M_psi reaches holdout R2=0.907, B_xi reaches 0.940 sign accuracy, and the policy reaches 0.979 optimal-action accuracy, while immediate packet scores, prediction-error rewards, shuffled targets, zero reward, and error-reduction controls collapse. Hidden-reward CartPole and Taxi controls, public-context no-leakage audits, and module-role ablations further test information boundaries and component necessity.

URL PDF HTML ☆

赞 0 踩 0

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交 70%

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon（亚马逊）

专题命中规划决策：利用LLM智能体进行树搜索发现训练策略

AI总结提出LLMZero系统，利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略，揭示容量参数单调累积、正则化参数振荡的规律，在4个GRPO任务上相对基线提升9%-140%。

详情

AI中文摘要

RL后训练策略依赖于数据集，并揭示了一个反复出现的经验模式：容量参数在阶段间单调累积，而正则化参数主要根据训练动态的变化而振荡。这种区别很重要，因为固定调度将所有参数提交到固定轨迹，因此无法表达正则化必须跟踪的非平稳探索-利用权衡；该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点，该系统通过树搜索让LLM智能体搜索训练轨迹，诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中，LLMZero发现的策略相对基础模型提升9%到140%，相对网格搜索提升6%到15%，始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移，解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

URL PDF HTML ☆

赞 0 踩 0

2606.19134 2026-06-18 cs.LG cs.AI 新提交 65%

Pareto Q-Learning with Reward Machines

带奖励机的帕累托Q学习

Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières

发表机构 * Linköping University, Sweden（瑞典_linköping大学）； Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France（法国里尔大学、CNRS、中央里尔学院、UMR 9189 CRIStAL、法国里尔）； Univ. Toulouse, INRAE-MIAT, Toulouse, France（法国图卢兹大学、INRAE-MIAT、图卢兹）

专题命中规划决策：多目标强化学习算法，用于智能体决策

AI总结提出PQLRM算法，结合帕累托Q学习和奖励机，在多目标强化学习中高效逼近帕累托前沿，并处理非马尔可夫奖励。

Comments Accepted at the ICAPS 2026 Workshop on Bridging the Gap Between AI Planning and (Reinforcement) Learning (PRL)

2606.18537 2026-06-18 cs.LG 新提交 65%

Do as the Romans Do: Learning Universal Behaviors from Heterogeneous Agents

入乡随俗：从异构智能体学习通用行为

Caleb Chang, Davin Win Kyi, Natasha Jaques, Karen Leung

发表机构 * University of Washington（华盛顿大学）； NVIDIA（英伟达）

专题命中规划决策：提取通用奖励训练通用智能体

AI总结提出GRID方法，从追求不同目标的异构示范者中提取通用奖励，训练通用智能体以学习环境通用能力，避免模式平均偏差，提升下游任务微调效率。

详情

AI中文摘要

人类通常通过观察他人来获取新技能，因为观察到的行为隐含地揭示了如何在环境中行动。然而，从异构群体中获得的观察会引入冲突的行为信号，使得难以确定哪些行为值得模仿。我们通过通用奖励推断与解耦（GRID）来解决这一挑战，这是一种从追求不同目标的异构示范者群体中提取普遍有用行为的社会学习方法。GRID将每个智能体的奖励函数分解为通用奖励（捕捉所有智能体共享的行为）和特定奖励（捕捉个体偏好和目标）。仅基于通用奖励进行训练提供了一种通用预训练的新范式。它产生了一个通用智能体，该智能体内化了通用的环境能力，如安全性和基本任务熟练度，而不会出现困扰标准从示范学习技术的模式平均偏差。这个通用智能体作为微调到下游任务（包括训练中未见过的偏好）的优越先验。在合成基函数分解、多智能体Craftax和连续自动驾驶模拟器（Highway-Env）上的实验证实，GRID以语义上有意义的方式成功解耦了奖励结构，优于标准的从示范学习基线，并实现了更高效和稳定的特化。

英文摘要

Humans often acquire new skills by observing others, since observed behaviors implicitly reveal how to act in an environment. However, observations drawn from a heterogeneous population introduce conflicting behavioral signals, making it difficult to determine which behaviors are worth imitating. We address this challenge with General Reward Inference and Disentanglement (GRID), a social learning method that extracts universally useful behaviors from a heterogeneous population of demonstrators pursuing different goals. GRID decomposes per-agent reward functions into a general reward, capturing behaviors shared across all agents, and specific rewards, capturing individual preferences and objectives. Training exclusively on the general reward provides a new paradigm of generalist pretraining. It yields a generalist agent that internalizes universal environmental competencies, such as safety and basic task proficiency, without the mode-averaging bias that afflicts standard learning from demonstration techniques. This generalist serves as a superior prior for fine-tuning to downstream tasks, including preferences unseen during training. Experiments across a synthetic basis function decomposition, multi-agent Craftax, and a continuous autonomous driving simulator (Highway-Env) confirm that GRID successfully disentangles reward structure in a semantically meaningful way, outperforms standard learning from demonstration baselines, and enables more efficient and stable specialization.

URL PDF HTML ☆

赞 0 踩 0

2606.18730 2026-06-18 cs.RO cs.AI math.CO math.OC 新提交 60%

Two-Phase Bilevel Search for the Moving-Target Traveling Salesman Problem with Moving Obstacles

带移动障碍物的移动目标旅行商问题的两阶段双层搜索

Allen George Philip, Anoop Bhat, Sivakumar Rathinam, Howie Choset

发表机构 * Texas A&M University（德克萨斯A&M大学）； Carnegie Mellon University（卡内基梅隆大学）

专题命中规划决策：移动目标TSP的两阶段双层搜索算法

AI总结针对带移动障碍物的移动目标旅行商问题，提出混合整数锥规划公式和两阶段双层搜索算法，显著优于基线方法。

详情

AI中文摘要

移动目标旅行商问题（MT-TSP）寻求从静态仓库出发、访问一组移动目标（每个目标在其分配的时间窗口内）并返回仓库的代理的最小成本轨迹。在本文中，我们研究了带移动障碍物的移动目标旅行商问题（MT-TSP-MO），这是MT-TSP的推广，其中代理轨迹必须避开移动障碍物。我们提出了一个混合整数锥规划（MICP）公式，可以使用现成的求解器求解，以及一个快速且可扩展的两阶段双层搜索（TPBS）算法，该算法为问题计算高质量可行解。我们在多达40个目标和40个障碍物的广泛问题实例上评估了我们的方法，与现有基线算法相比。结果表明，所提出的两种方法在成功率、解决方案成本和计算时间方面均显著优于基线。

英文摘要

The Moving-Target Traveling Salesman Problem (MT-TSP) seeks a minimum cost trajectory for an agent that departs from a static depot, visits a set of moving targets, each within one of their assigned time windows, and returns to the depot. In this article, we study the Moving-Target Traveling Salesman Problem with Moving Obstacles (MT-TSP-MO), a generalization of the MT-TSP where the agent trajectory must avoid moving obstacles. We present a Mixed-Integer Conic Programming (MICP) formulation that can be solved using off-the-shelf solvers, as well as a fast and scalable Two-Phase Bilevel Search (TPBS) algorithm that computes high-quality feasible solutions for the problem. We evaluate our approaches against an existing baseline algorithm on a broad range of problem instances with up to 40 targets and 40 obstacles. The results demonstrate that both the proposed methods significantly outperform the baseline with respect to success rates, solution costs, and computation time.

URL PDF HTML ☆

赞 0 踩 0

2412.15472 2026-06-18 cs.GT econ.TH 60%

On the Fairness of Additive Welfarist Rules

关于加法福利主义规则的公平性

Karen Frilya Celine, Warut Suksompong, Sheung Man Yuen

专题命中规划决策：公平分配规则研究，与多智能体系统相关

AI总结本文研究了加法福利主义规则在公平分配中的公平性，证明了MNW规则是唯一能保证EF1的规则，同时探讨了不同实例类型下的规则特性。

Comments Appears in the 24th International Conference on Autonomous Agents and Multiagent Systems (AAMAS), 2025

Journal ref ACM Transactions on Economics and Computation, 14(2):5 (2026)

2606.19175 2026-06-18 econ.TH 新提交 55%

To Gamble, Perchance to Grow

赌博，或许为了增长

Mark Whitmeyer

专题命中规划决策：研究增长最优投资组合问题，涉及决策优化

AI总结研究增长最优（凯利）投资组合问题中的收益变换，刻画了产生更保守投资组合的变换条件，并推导了理性疏忽代理人的风险厌恶比较。