AI Agent - arXivDaily 专题

2605.30880 2026-06-18 cs.CL cs.AI 版本更新 85%

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld：可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Independent Researcher（独立研究员）； HKUST（香港科技大学）； Beijing Institute of Technology（北京理工大学）； Southern University of Science and Technology（南方科技大学）； Wayne State University（韦恩州立大学）； University of Edinburgh（爱丁堡大学）

专题命中规划决策：可执行世界模型，用于智能体规划与预测

AI总结提出 PatchWorld 框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型，实现无需梯度优化的符号信念状态程序，在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情

AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程（POMDP），假设模拟器的潜在状态和转移动态对智能体隐藏。然而，很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld，一个免梯度框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察，而是归纳出符号信念状态程序，其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中，PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数，在实时一步前瞻中达到 76.4% 的宏观成功率，同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现，人类指定的残差记忆偏差提高了表面观察保真度，但削弱了决策效用。这暴露了可执行世界模型中的权衡，因为提高观察保真度可能以牺牲动作判别动态为代价，反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

URL PDF HTML ☆

赞 0 踩 0

2603.00656 2026-06-18 cs.AI 版本更新 85%

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

InfoPO：面向用户智能体的信息驱动策略优化

Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu

发表机构 * Peking University（北京大学）； The Hong Kong University of Science（香港科学大学）

专题命中规划决策：信息驱动策略优化，面向用户智能体

AI总结针对多轮交互中信用分配和优势信号不足的问题，提出信息增益奖励与自适应方差门控融合的InfoPO方法，在意图澄清、协作编码等任务上优于现有基线。

详情

AI中文摘要

现实世界中用户对LLM智能体的请求往往不明确。智能体必须通过交互获取缺失信息并做出正确的下游决策。然而，当前基于多轮GRPO的方法通常依赖于轨迹级奖励计算，这导致信用分配问题以及rollout组内优势信号不足。一种可行的方法是在细粒度上识别有价值的交互轮次，以驱动更有针对性的学习。为此，我们引入了InfoPO（信息驱动策略优化），它将多轮交互视为一个主动不确定性降低的过程，并计算信息增益奖励，该奖励对反馈可测量地改变智能体后续动作分布（与掩码反馈反事实相比）的轮次进行奖励。然后，通过自适应方差门控融合将该信号与任务结果结合，以在保持任务导向目标方向的同时识别信息重要性。在包括意图澄清、协作编码和工具增强决策在内的多种任务中，InfoPO始终优于提示和多轮RL基线。它还在用户模拟器偏移下表现出鲁棒性，并有效泛化到环境交互任务。总体而言，InfoPO为优化复杂的智能体-用户协作提供了一种原则性且可扩展的机制。代码可在以下网址获取：https://this URL。

英文摘要

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

URL PDF HTML ☆

赞 0 踩 0

2603.00026 2026-06-18 cs.CL cs.AI cs.IR 版本更新 85%

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

ActMem：弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； Alibaba Group, Hangzhou, China（阿里巴巴集团，杭州，中国）； National Institute of Healthcare Data Science, Nanjing University, China（南京大学健康数据科学国家研究院）

专题命中规划决策：记忆检索与推理结合，主动因果推理

AI总结提出ActMem框架，通过将非结构化对话历史转化为结构化因果语义图，结合反事实推理和常识补全，实现主动因果推理，显著提升LLM代理在复杂记忆依赖任务中的表现。

详情

AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”，并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距，我们提出了一种新颖的可操作记忆框架ActMem，它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全，它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外，我们引入了一个全面的数据集ActMemEval，用于评估代理在逻辑驱动场景中的推理能力，超越了现有记忆基准测试中事实检索的焦点。实验表明，ActMem在处理复杂的、依赖记忆的任务时显著优于基线，为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

URL PDF HTML ☆

赞 0 踩 0

2510.05107 2026-06-18 cs.AI 版本更新 85%

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

大型语言模型代理中行为智能的结构化认知循环（扩展修订：从行为架构到认知问责）

Myung Ho Kim

发表机构 * JEI University（JEI大学）

专题命中规划决策：结构化认知循环实现LLM代理可问责行为

AI总结提出结构化认知循环（SCL）架构，通过分离认知、记忆、控制和行动模块，实现LLM代理的可问责行为，在360个任务中成功率86.3%，优于基线方法。

Comments This revised version extends the original SCL framework from a behavioral architecture for reliable LLM agents into a broader architecture of epistemic accountability, integrating context-aware Human-in-the-Loop control, Pool-Gated Retrieval, and the Horizon-Warrant-Commitment structure

详情

AI中文摘要

AI代理的核心挑战不仅是性能，还有问责性。通过不透明提示序列行动的代理可能产生正确输出，但几乎无法验证为何允许某个行动、错误发生在何处或如何分配责任。本文提出结构化认知循环（SCL）作为大型语言模型代理中可问责行为的架构。SCL将认知、记忆、控制和行动分离为不同模块。语言模型提出建议。外部记忆保存已验证的状态。轻量级控制器检查前提条件、防止冗余行动，并在使用工具前授权执行。我们评估了SCL与ReAct及常见LangChain代理变体在旅行规划、条件邮件起草和约束引导图像生成中的表现。在360个回合中，SCL的任务成功率达到86.3%，而基于提示的基线为70.5%至76.8%。它还提高了目标保真度，减少了冗余工具调用，增加了中间状态的重用，并降低了无依据的断言。此扩展修订将SCL置于更广泛的认知问责架构中。后续扩展整合了上下文感知的人机循环控制、池门控检索和视野担保承诺框架。这些组件共同定义了一个代理架构，其中模型提出建议，结构做出决策，证据在使用前得到担保，人类判断嵌入在轨迹中而非事后强加。结果为AI代理奠定了基础，使其决策不仅有效，而且得到授权、可检查且可问责。

英文摘要

The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted, where an error occurred, or how responsibility should be assigned. This paper presents the Structured Cognitive Loop as an architecture for accountable behavior in large language model agents. SCL separates cognition, memory, control, and action into distinct modules. The language model proposes. External memory preserves verified state. A lightweight controller checks preconditions, prevents redundant actions, and authorizes execution before tools are used. We evaluate SCL against ReAct and common LangChain agent variants across travel planning, conditional email drafting, and constraint guided image generation. Across 360 episodes, SCL achieves 86.3 percent task success compared with 70.5 to 76.8 percent for prompt based baselines. It also improves goal fidelity, reduces redundant tool calls, increases reuse of intermediate state, and lowers unsupported assertions. This extended revision situates SCL within a broader architecture of epistemic accountability. Subsequent extensions integrate context aware Human in the Loop control, Pool Gated Retrieval, and the Horizon Warrant Commitment framework. Together these components define an agent architecture in which the model proposes, structure decides, evidence is warranted before use, and human judgment is embedded in the trace rather than imposed after the fact. The result is a foundation for AI agents whose decisions are not only effective but also authorized, inspectable, and accountable.

URL PDF HTML ☆

赞 0 踩 0

2605.22142 2026-06-18 cs.LG cs.AI 版本更新 80%

Short-Term-to-Long-Term Memory Transfer for Knowledge Graphs under Partial Observability

知识图谱下的短期到长期记忆转移：在部分可观测性下的短期到长期记忆转移

Taewoon Kim, Vincent François-Lavet, Michael Cochez

专题命中规划决策：强化学习中记忆转移，属于智能体决策。

AI总结本文研究了在部分可观测性下知识图谱中的短期到长期记忆转移问题，提出了一种基于神经符号价值决策的方法，通过在长期插入前决定保留或丢弃观察到的三元组，从而提升记忆效率，并在RoomKG基准测试中优于符号和神经基线方法。

详情

AI中文摘要

在部分可观测性下的强化学习需要决定保留哪些信息，但大多数基于记忆的方法并未显式建模符号观察的短期到长期转移。我们研究了这一转移过程，将其建模为一个神经符号价值决策问题：对于每个观察到的三元组，智能体需决定在长期插入前是否保留或丢弃。为处理可变大小的短期缓冲区，我们采用了一种每项Q学习设计，使用共享参数和实际的时间差分更新，跨连续步骤匹配项目。在长期记忆容量为128的RoomKG基准测试中，学习到的转移决策优于符号和神经基线，包括带有时间注释的符号基线和基于历史的LSTM/Transformer基线。在转移策略消融分析中，一个轻量级的本地短期-only变体表现最佳，且在步骤层面行为显示，策略保留导航和查询相关的事实，同时丢弃低价值的候选事实，支持在内存限制下显式且可解释的记忆决策。

英文摘要

Reinforcement learning under partial observability requires deciding what information to retain, yet most memory-based approaches do not explicitly model short-term-to-long-term transfer of symbolic observations. We study this transfer process in a temporal knowledge-graph memory setting and cast it as a neuro-symbolic value-based decision problem: for each observed triple, the agent chooses whether to keep or drop it before long-term insertion. To handle variable-sized short-term buffers, we use a per-item Q-learning design with shared parameters and a practical temporal-difference update over matched items across consecutive steps. On the RoomKG benchmark at long-term memory capacity 128, learned transfer decisions outperform symbolic and neural baselines, including symbolic baselines with temporal annotations and history-based LSTM/Transformer baselines. Across transfer-policy ablations, a lightweight local short-term-only variant performs best, and step-level behavior shows that the policy keeps navigation- and query-relevant facts while discarding lower-value candidate facts, supporting explicit and interpretable memory decisions under memory constraints.

URL PDF HTML ☆

赞 0 踩 0

2604.03208 2026-06-18 cs.LG 版本更新 80%

Hierarchical Planning with Latent World Models

基于潜在世界模型的分层规划

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, Nicolas Ballas

发表机构 * FAIR at Meta（Meta旗下的FAIR）； New York University（纽约大学）； Mila - Québec AI Institute（魁北克AI研究院）； Brown University（布朗大学）

专题命中规划决策：分层世界模型用于长时域规划，属智能体规划

AI总结提出HWM架构，通过多时间尺度潜在世界模型和潜在匹配实现分层模型预测控制，解决长时域任务中单层规划失败和计算爆炸问题。

详情

AI中文摘要

世界模型是通过规划实现零样本具身控制的一条有前景的路径。然而，现有的世界模型规划器在长时域、多阶段任务中面临困难：预测误差累积，且朴素搜索的复杂度随规划时域呈指数增长。分层方法通过将任务分解为更短、可处理的子问题来缓解这两个问题；然而，先前的分层方法要么将控制摊销为任务特定的策略（分层强化学习），要么假设低维状态和已知动力学（经典分层MPC）。我们提出了基于潜在世界模型的分层规划（HWM），这是一种直接在仅通过下一潜在预测训练的视觉世界模型上进行分层模型预测控制（MPC）的架构和规划范式。HWM在共享潜在空间内学习多个时间尺度的世界模型，因此长时域模型的预测通过潜在匹配作为短时域模型的子目标，无需任务特定的奖励、技能学习或分层策略。为了保持长时域搜索的可处理性，HWM学习了一个动作编码器，将原始动作块压缩为潜在宏动作。在真实世界的Franka操作中，HWM从单个目标图像中完成拾取和放置的成功率为70%，而单层规划的成功率为0%。在模拟的推操作和迷宫导航任务中，HWM在长时域任务上持续提升性能，同时所需规划计算量最多减少3倍。

英文摘要

World models are a promising path to zero-shot embodied control through planning. However, existing world model planners struggle on long-horizon, multi-stage tasks: prediction errors compound and naive search is exponential in the planning horizon. Hierarchy mitigates both by decomposing tasks into shorter, tractable subproblems; yet prior hierarchical approaches either amortize control into task-specific policies (hierarchical RL) or assume low-dimensional states and known dynamics (classical hierarchical MPC). We present Hierarchical Planning with Latent World Models (HWM), an architecture and planning paradigm for hierarchical model predictive control (MPC) directly on visual world models trained solely via next-latent prediction. HWM learns world models at multiple temporal scales within a shared latent space, so predictions from the long-horizon model serve as subgoals for the short-horizon model via latent matching, without task-specific rewards, skill learning, or hierarchical policies. To keep long-horizon search tractable, HWM learns an action encoder that compresses primitive action chunks into latent macro-actions. On real-world Franka manipulation, HWM solves pick-and-place from a single goal image at 70% success vs. 0% for single-level planning. Across simulated push manipulation and maze navigation, HWM consistently improves performance on long-horizon tasks while requiring up to 3x less planning compute.

URL PDF HTML ☆

赞 0 踩 0

2411.10399 2026-06-18 cs.GT cs.CR cs.DC 版本更新 80%

Game Theoretic Liquidity Provisioning in Concentrated Liquidity Market Makers

集中流动性做市商中的博弈论流动性提供

Weizhao Tang, Rachid El-Azouzi, Cheng Han Lee, Ethan Chan, Giulia Fanti

专题命中规划决策：博弈论模型分析流动性提供策略

AI总结针对集中流动性做市商中流动性提供者的策略互动，建立博弈论模型，证明其可简化为具有唯一纳什均衡的线性复杂度博弈，均衡遵循水填充策略，并基于真实数据发现LP策略偏离均衡，调整后可提升日收益率。

详情

AI中文摘要

自动做市商（AMM）是一类去中心化交易所，能够实现数字资产的自动交易。它们接受流动性提供者（LP）存入的数字代币；交易者可以使用这些代币执行交易，从而为投资的LP产生费用。AMM的显著特征是交易价格由算法决定，这与传统的限价订单簿不同。集中流动性做市商（CLMM）是AMM的一个重要类别，它为流动性提供者提供了灵活性，不仅可以决定提供多少流动性，还可以决定在哪些价格范围内使用流动性。由于费用奖励在LP之间共享，这种灵活性可能使战略规划复杂化。我们建立并分析了一个博弈论模型来研究CLMM中LP的激励。我们的主要结果表明，虽然原始公式存在多个纳什均衡且复杂度与合约中价格点数量的二次方成正比，但它可以简化为一个具有唯一纳什均衡的博弈，其复杂度仅为线性。我们进一步证明，这个简化博弈的纳什均衡遵循一种水填充策略，其中低预算LP用尽其全部预算，而富裕LP则不会。最后，通过将我们的博弈模型拟合到真实的CLMM，我们观察到在具有风险资产的流动性池中，LP采用的投资策略远非纳什均衡。在价格不确定性下，他们通常投资于比我们分析建议的更少且更宽的价格范围，并且流动性更新频率较低。我们表明，在多个池中，通过将策略更新为更接近我们博弈的纳什均衡，LP可以将其每日回报中位数提高116美元，这相当于每日投资回报中位数增加0.009%。

英文摘要

Automated marker makers (AMMs) are a class of decentralized exchanges that enable the automated trading of digital assets. They accept deposits of digital tokens from liquidity providers (LPs); tokens can be used by traders to execute trades, which generate fees for the investing LPs. The distinguishing feature of AMMs is that trade prices are determined algorithmically, unlike classical limit order books. Concentrated liquidity market makers (CLMMs) are a major class of AMMs that offer liquidity providers flexibility to decide not only \emph{how much} liquidity to provide, but \emph{in what ranges of prices} they want the liquidity to be used. This flexibility can complicate strategic planning, since fee rewards are shared among LPs. We formulate and analyze a game theoretic model to study the incentives of LPs in CLMMs. Our main results show that while our original formulation admits multiple Nash equilibria and has complexity quadratic in the number of price ticks in the contract, it can be reduced to a game with a unique Nash equilibrium whose complexity is only linear. We further show that the Nash equilibrium of this simplified game follows a waterfilling strategy, in which low-budget LPs use up their full budget, but rich LPs do not. Finally, by fitting our game model to real-world CLMMs, we observe that in liquidity pools with risky assets, LPs adopt investment strategies far from the Nash equilibrium. Under price uncertainty, they generally invest in fewer and wider price ranges than our analysis suggests, with lower-frequency liquidity updates. We show that across several pools, by updating their strategy to more closely match the Nash equilibrium of our game, LPs can improve their median daily returns by \$116, which corresponds to an increase of 0.009\% in median daily return on investment.

URL PDF HTML ☆

赞 0 踩 0

2510.03635 2026-06-18 eess.SY cs.SY 版本更新 70%

Cyber Resilience of Three-phase Unbalanced Distribution System Restoration under Sparse Adversarial Attack on Load Forecasting

三相不平衡配电系统恢复在负荷预测稀疏对抗攻击下的网络弹性

Chen Chao, Zixiao Ma, Ziang Zhang

专题命中规划决策：攻击下的恢复规划，涉及决策

AI总结本文量化对抗性攻击对负荷预测的影响，提出梯度稀疏攻击方法，并建立恢复感知验证框架，揭示系统级故障，为设计网络安全感知的恢复规划提供见解。

Comments 10 pages, 7 figures

详情

AI中文摘要

系统恢复对于电力系统弹性至关重要，然而，其对基于人工智能的负荷预测的日益依赖引入了显著的网络安全风险。不准确的预测可能导致不可行的规划、电压和频率违规以及断电段落的恢复失败，但恢复过程对此类攻击的弹性在很大程度上仍未探索。本文通过量化对抗性操纵的预测如何影响恢复可行性和电网安全性来填补这一空白。我们开发了一种基于梯度的稀疏对抗攻击，该攻击策略性地扰动最具影响力的时空输入，在保持隐蔽性的同时暴露预测模型的脆弱性。我们进一步创建了一个恢复感知验证框架，将这些受损的预测嵌入到顺序恢复模型中，并使用不平衡三相最优潮流公式评估操作可行性。仿真结果表明，所提出的方法比基线攻击更高效、更隐蔽。它揭示了系统级故障，例如电压和功率爬坡违规，这些故障阻止了关键负荷的恢复。这些发现为设计网络安全感知的恢复规划框架提供了可行的见解。

英文摘要

System restoration is critical for power system resilience, nonetheless, its growing reliance on artificial intelligence (AI)-based load forecasting introduces significant cybersecurity risks. Inaccurate forecasts can lead to infeasible planning, voltage and frequency violations, and unsuccessful recovery of de-energized segments, yet the resilience of restoration processes to such attacks remains largely unexplored. This paper addresses this gap by quantifying how adversarially manipulated forecasts impact restoration feasibility and grid security. We develop a gradient-based sparse adversarial attack that strategically perturbs the most influential spatiotemporal inputs, exposing vulnerabilities in forecasting models while maintaining stealth. We further create a restoration-aware validation framework that embeds these compromised forecasts into a sequential restoration model and evaluates operational feasibility using an unbalanced three-phase optimal power flow formulation. Simulation results show that the proposed approach is more efficient and stealthier than baseline attacks. It reveals system-level failures, such as voltage and power ramping violations that prevent the restoration of critical loads. These findings provide actionable insights for designing cybersecurity-aware restoration planning frameworks.

URL PDF HTML ☆

赞 0 踩 0

2402.08128 2026-06-18 cs.AI cs.GT 版本更新 70%

Recursive Joint Simulation in Games

博弈中的递归联合模拟

Vojtech Kovarik, Caspar Oesterheld, Vincent Conitzer

发表机构 * Foundations of Cooperative AI Lab (FOCAL), Computer Science Department（合作人工智能基础实验室（FOCAL），计算机科学系）； Carnegie Mellon University（卡内基梅隆大学）； AI Center（人工智能中心）； Czech Technical University（捷克技术大学）； Center for Theoretical Study（理论研究中心）； Charles University（查理大学）

专题命中规划决策：研究AI智能体递归联合模拟实现合作

AI总结研究AI智能体通过递归联合模拟实现合作，证明该过程等价于原博弈的无限重复版本，从而可直接应用民间定理等现有结论。

详情

AI中文摘要

AI智能体之间的博弈动力学可能以多种方式不同于传统的人类-人类互动。其中一个差异是，可能能够精确模拟一个AI智能体，例如因为其源代码已知。这样的智能体将从根本上不确定自己是在现实世界还是在模拟中。我们的目标是探索利用这种可能性在战略环境中实现更合作的结果。在本文中，我们研究了AI智能体之间的交互，其中智能体运行递归联合模拟。也就是说，智能体首先共同观察它们所面临情境的模拟。这个模拟递归地包含额外的模拟（带有小的失败概率以避免无限递归），并且在选择行动之前观察所有这些嵌套模拟的结果。我们表明，由此产生的交互在策略上等价于原始博弈的无限重复版本，允许直接转移现有结果，如各种民间定理。作为该等价性稳健性的证据，我们表明即使放宽一些假设，它仍然成立，并且“从内部”也成立——即对于发现自己处于博弈中并具有自定位不确定性的智能体而言。

英文摘要

Game-theoretic dynamics between AI agents could differ from traditional human-human interactions in various ways. One such difference is that it may be possible to accurately simulate an AI agent, for example because its source code is known. Such an agent would then be fundamentally uncertain whether it is in the real world or in a simulation. Our aim is to explore ways of leveraging this possibility to achieve more cooperative outcomes in strategic settings. In this paper, we study an interaction between AI agents where the agents run a recursive joint simulation. That is, the agents first jointly observe a simulation of the situation they face. This simulation in turn recursively includes additional simulations (with a small chance of failure, to avoid infinite recursion), and the results of all these nested simulations are observed before an action is chosen. We show that the resulting interaction is strategically equivalent to an infinitely repeated version of the original game, allowing a direct transfer of existing results such as the various folk theorems. As evidence that the equivalence is robust, we show that it holds even when we relax some of the assumptions and that it also holds ``from the inside'' -- meaning, for an agent that finds itself inside the game and has self-locating uncertainty.

URL PDF HTML ☆

赞 0 踩 0

2603.09344 2026-06-18 cs.AI stat.ML 版本更新 65%

Robust Regularized Policy Iteration under Transition Uncertainty

鲁棒正则化策略迭代在转移不确定性下

Hongqiang Lin, Zhenghui Fu, Weihao Tang, Pengfei Wang, Yiding Sun, Qixian Huang, Dongxu Zhang

发表机构 * College of Computer Science and Technology, Zhejiang University, Hangzhou, China（浙江大学计算机科学与技术学院）； School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University, Xi'an, China（西北工业大学人工智能、光学与电子学院（iOPEN））； School of Software Technology, Zhejiang University, Hangzhou, China（浙江大学软件技术学院）； School of Software Engineering, Xi'an Jiaotong University, Xi'an, China（西安交通大学软件工程学院）； School of Systems Science and Engineering, Sun Yat-sen University, Guangzhou, China（中山大学系统科学与工程学院）

专题命中规划决策：离线强化学习用于智能体决策

AI总结提出鲁棒正则化策略迭代（RRPI），通过将离线强化学习建模为鲁棒策略优化，使用KL正则化替代难解的双层目标，并基于鲁棒正则化贝尔曼算子实现高效策略迭代，理论保证收敛性，实验在D4RL基准上表现优异。

详情

AI中文摘要

离线强化学习（RL）无需在线探索即可实现数据高效且安全的策略学习，但其性能常因分布偏移而下降。学习到的策略可能访问分布外的状态-动作对，其中价值估计和学习到的动态不可靠。为了在统一框架中处理策略引发的外推和转移不确定性，我们将离线RL建模为鲁棒策略优化，将转移核视为不确定性集内的决策变量，并针对最坏情况动态优化策略。我们提出鲁棒正则化策略迭代（RRPI），用可处理的KL正则化替代难解的最大-最小双层目标，并基于鲁棒正则化贝尔曼算子推导出高效的策略迭代过程。我们提供了理论保证，证明所提出的算子是$\gamma$-压缩算子，且迭代更新替代目标能单调改进原始鲁棒目标并收敛。在D4RL基准上的实验表明，RRPI实现了强大的平均性能，在大多数环境中优于包括基于百分位数方法在内的最新基线，并在其余环境中保持竞争力。此外，RRPI通过将较低的$Q$值与高认知不确定性对齐，展现出鲁棒性能，从而防止策略执行不可靠的分布外动作。

英文摘要

Offline reinforcement learning (RL) enables data-efficient and safe policy learning without online exploration, but its performance often degrades under distribution shift. The learned policy may visit out-of-distribution state-action pairs where value estimates and learned dynamics are unreliable. To address policy-induced extrapolation and transition uncertainty in a unified framework, we formulate offline RL as robust policy optimization, treating the transition kernel as a decision variable within an uncertainty set and optimizing the policy against the worst-case dynamics. We propose Robust Regularized Policy Iteration (RRPI), which replaces the intractable max-min bilevel objective with a tractable KL-regularized surrogate and derives an efficient policy iteration procedure based on a robust regularized Bellman operator. We provide theoretical guarantees by showing that the proposed operator is a $γ$-contraction and that iteratively updating the surrogate yields monotonic improvement of the original robust objective with convergence. Experiments on D4RL benchmarks demonstrate that RRPI achieves strong average performance, outperforming recent baselines including percentile-based methods on the majority of environments while remaining competitive on the rest. Moreover, RRPI exhibits robust performance by aligning lower $Q$-values with high epistemic uncertainty, which prevents the policy from executing unreliable out-of-distribution actions.

URL PDF HTML ☆

赞 0 踩 0