arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 65 信号源:cs.AI, cs.CL, cs.LG, cs.SE

1. 多智能体 6 篇

2606.19135 2026-06-18 cs.MA cs.AI cs.NI 新提交 80%

A Technical Taxonomy of LLM Agent Communication Protocols

LLM智能体通信协议的技术分类法

Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

发表机构 * Technische Universität München(慕尼黑技术大学)

专题命中 多智能体 :分类LLM智能体通信协议,核心是Agent通信

AI总结 针对大语言模型智能体通信协议碎片化问题,提出包含五个维度的技术分类法,分析九种开源协议,揭示架构模式并预测协议演进趋势。

详情
AI中文摘要

随着大语言模型(LLM)的进步以及多智能体系统旨在克服单智能体的局限性,健壮的通信协议正成为分布式智能体网络的关键基础设施。然而,碎片化的协议格局带来了显著的互操作性挑战。本研究开发了一种技术分类法,用于分类和分析LLM智能体通信协议。遵循既定的迭代方法,我们定义了分类法的目的、元特征和终止条件,然后在九个积极维护且具有可证明采用度的开源协议上执行了五次迭代(三次从经验到概念,两次从概念到经验)。该分类法包含五个维度:交易对手、有效载荷、交互状态、发现机制和模式灵活性。分类揭示了重复出现的架构模式:所有采样的智能体间协议都将混合有效载荷与会话状态持久性相结合;大多数协议支持多个预定义模式,其中两个协议在运行时协商模式,表明向模式灵活性的趋势;去中心化发现仍然罕见。分析表明,短期内存在向统一智能体间和智能体-上下文(工具和数据)通信的协议收敛压力。然而,长期来看,没有单一协议能同时最大化通用性、效率和可移植性。该领域更可能演变为联邦式分层协议栈。该框架指导协议选择,并突出开放的研究空白,如隐私和策略执行。

英文摘要

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}

2606.19080 2026-06-18 eess.SY cs.SY 新提交 80%

Byzantine-Resilient Federated Multi-Agent Optimization Framework for Cyber-Secure Interconnected Microgrids

面向网络安全互联微电网的拜占庭弹性联邦多智能体优化框架

Ali Peivand, Seyyed Mostafa Nosratabadi

专题命中 多智能体 :联邦多智能体优化,拜占庭弹性。

AI总结 提出BR-FedMAPPO框架,结合三重表面移动目标防御与自适应隔离策略,通过两阶段拜占庭弹性聚合规则抵御隐蔽虚假数据注入攻击,保护分布式学习通道并维持经济调度性能。

详情
AI中文摘要

配电网络日益数字化,使得互联微电网集群面临隐蔽虚假数据注入攻击,这些攻击绕过不良数据检测器,通过联络线耦合和共享学习通道传播。本文提出BR-FedMAPPO,一种拜占庭弹性联邦多智能体近端策略优化框架,学习三重表面移动目标防御和自适应隔离策略以实现网络安全运行。每个微电网托管一个本地Actor-Critic智能体,其策略被划分为全局联邦共享编码器和私有保留动作头,因此没有微电网暴露其D-FACTS线路、电池储能单元或联络线容量的配置、基数或位置。动作向量扰动D-FACTS电抗、重定向BES注入、重塑微电网间交换,并包含连续孤岛信号。两阶段拜占庭弹性聚合规则结合了修剪均值滤波和奖励加权更新。该方案基于F1分数和假阳性率纳入检测质量分数,以惩罚引起误报的客户端。在基于IEEE 30节点和118节点测试系统的四个互联微电网上的仿真结果表明,该框架能有效缓解协调的S-FDI攻击,通过自适应隔离遏制级联中断,保护分布式学习通道免受恶意模型操纵,同时保持成本感知的调度性能。

英文摘要

The escalating digitalization of distribution networks has exposed interconnected Microgrid (MG) clusters to Stealthy False Data Injection Attacks that bypass Bad Data Detectors and propagate through tie-line couplings and shared learning channels. This paper proposes BR-FedMAPPO, a Byzantine-Resilient Federated Multi-Agent Proximal Policy Optimization framework that learns a triple-surface Moving Target Defense and an adaptive isolation strategy for cyber-secure operation. Each MG hosts a local Actor-Critic Agent whose policy is partitioned into a globally federated shared encoder and a privately retained action head, so no MG exposes the configurations, cardinality, or locations of its D-FACTS lines, Battery Energy Storage (BES) units, or tie-line capacities. The action vector perturbs D-FACTS reactances, redirects BES injections, reshapes inter-MG exchanges, and includes a continuous islanding signal. A two-stage Byzantine-resilient aggregation rule combines trimmed-mean filtering with reward-weighted updates. This scheme incorporates a detection-quality score based on the F1-score and False Positive Rate to penalize clients causing false alarms. Simulation results on four interconnected MGs based on the IEEE 30- and 118-bus test systems demonstrate effective mitigation of coordinated S-FDI attacks, containment of cascading disruptions through adaptive isolation, and protection of distributed learning channels against malicious model manipulations while maintaining cost-aware dispatch performance.

2606.18829 2026-06-18 cs.LG cs.CL 新提交 80%

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem:多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Shanghai Jiao Tong University(上海交通大学) King Abdullah University of Science and Technology (KAUST)(卡尔斯鲁厄大学) Tsinghua University(清华大学) National University of Singapore(新加坡国立大学)

专题命中 多智能体 :多主体共享内存代理的记忆治理基准

AI总结 提出GateMem基准,评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力,发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情
AI中文摘要

LLM代理的内存基准主要假设单用户设置,而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中,多个主体写入公共内存池并根据不同角色、范围和关系进行查询,因此内存质量需要治理和召回。我们引入GateMem,一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用(含状态更新)、跨上下文授权边界的访问控制,以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域,包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上,没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数,而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明,当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

2606.18276 2026-06-18 cs.MA cs.SI physics.soc-ph 新提交 80%

Characterizing Opinion Evolution of Networked LLMs

表征网络化大语言模型的意见演化

Caleb Probine, Yigit Ege Bayiz, Filippos Fotiadis, Samuel Li, Yunhao Yang, Ufuk Topcu

专题命中 多智能体 :研究网络化LLM多智能体系统中的意见演化动力学。

AI总结 研究经典意见动力学模型能否描述多智能体系统中大语言模型(LLM)的意见传播,发现引入偏置项可显著提升建模精度,将平均意见误差降低高达88%。

Comments 19 pages, 2 figures

详情
AI中文摘要

大语言模型(LLM)在多智能体系统中日益相互交互,从人类话语模拟到影响力操作以及完全由LLM驱动的社交平台。这些交互产生了尚未被充分理解的新的意见传播机制。我们研究了长期以来用于解释人类社会中互动如何塑造集体信念的经典意见动力学模型是否能够捕捉LLM网络的行为。我们发现,虽然朴素的平均式模型无法跟踪LLM的意见动态,但简单的修改在建模保真度上带来了显著提升。特别是,偏置——智能体回归的内在意见——成为LLM意见动态的重要驱动因素,其引入将累积估计平均意见误差降低了高达88%。我们还发现,这些结论在不同模型家族、讨论主题和网络中具有普遍性。

英文摘要

Large language models (LLMs) increasingly interact with one another in multi-agent systems, from simulations of human discourse to influence operations and fully LLM-driven social platforms. These interactions give rise to new regimes of opinion propagation that are not yet well understood. We investigate whether classical opinion dynamics models, which have long been used to explain how interactions shape collective beliefs in human societies, can capture the behavior of LLM networks. We find that, while naive averaging-style models fail to track LLMs' opinion dynamics, simple modifications yield substantial gains in modeling fidelity. In particular, bias, an innate opinion toward which agents regress, emerges as a significant driver of LLM opinion dynamics, with its inclusion reducing cumulative estimated mean opinion error by up to 88%. We additionally find that these conclusions generalize across model families, discussion topics, and networks.

2606.19152 2026-06-18 cond-mat.mtrl-sci cs.AI 新提交 80%

AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

AdsMind: 一种基于物理的多智能体系统,用于异质催化剂表面吸附构型的自校正发现

Zongmin Zhang, Yuyang Lou, Bowen Zhang, Junwu Chen, Ryo Kuroki, Xuan Vu Nguyen, Edvin Fako, Lixue Cheng, Philippe Schwaller

发表机构 * Department of Computer Science Engineering, Hong Kong University of Science Department of Chemistry, Hong Kong University of Science Laboratory of Artificial Chemical Intelligence (LIAC), EPFL, Lausanne, Switzerland Platform Laboratory for Science \& Technology, Asahi Kasei Corporation, Tokyo, Japan IAS Center for AI for Scientific Discoveries, Hong Kong University of Science

专题命中 多智能体 :提出闭环多智能体框架,自主纠错搜索。

AI总结 提出AdsMind闭环多智能体框架,利用机器学习力场弛豫反馈实现吸附构型搜索的自主纠错,在基准测试中成功率高达100%和98.8%,且仅需少量弛豫步骤,显著优于启发式枚举和单次方法。

Comments 37 pages, 5 figures

详情
AI中文摘要

识别最低能量的表面-吸附物构型对于模拟异质催化至关重要,然而使用从头计算方法进行穷举探索在计算上是不可行的。机器学习力场(MLFF)加速了结构弛豫,但将广阔构型空间中的搜索留作主要瓶颈,而开环的大语言模型(LLM)智能体缺乏基于物理的反馈机制来纠正错误的初始猜测。我们提出了AdsMind(基于机器智能和弛豫反馈的吸附构型发现),这是一个闭环多智能体框架,通过MLFF弛豫反馈实现自主纠错。在四个LLM后端上,AdsMind实现了持续的高搜索可靠性,在基准AA20和OCD-GMAE62上的成功率分别为100%和98.8%。相对于其单次(1-Shot)消融,它降低了跨后端的能量分散,并且每个案例仅分别使用4.11和4.67次MLFF弛豫——相比启发式枚举基线减少了约14倍。使用VASP/PBE对六个代表性AA20系统进行的密度泛函理论(DFT)验证表明,所报告的开环Adsorb-Agent输出对分子吸附物存在定性的吸附能符号错误,而AdsMind在所有测试案例中均保持正确的符号,且定量一致性更佳。因此,AdsMind同时提供了可靠性、自我反思和可解释性,支持更多基于DFT的自主化学工作流程。

英文摘要

Identifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, yet exhaustive exploration with ab initio calculations is computationally prohibitive. Machine-learning force fields (MLFFs) accelerate structural relaxation but leave the search over the vast configurational space a major bottleneck, and open-loop large language model (LLM) agents lack a physics-grounded feedback mechanism to correct erroneous initial guesses. We propose AdsMind (Adsorption configuration discovery with Machine intelligence and relaxation feedback), a closed-loop multi-agent framework that enables autonomous error correction through MLFF relaxation feedback. Across four LLM backends, AdsMind achieves consistently high search reliability, with success rates of 100% and 98.8% on the benchmarks AA20 and OCD-GMAE62. Relative to its single-pass (1-Shot) ablation it reduces cross-backend energy dispersion, and it uses only 4.11 and 4.67 MLFF relaxations per case, respectively -- an approximately 14-fold reduction over heuristic enumeration baselines. Density functional theory (DFT) validation using VASP/PBE on six representative AA20 systems shows that the reported open-loop Adsorb-Agent outputs exhibit qualitative adsorption-energy sign errors for molecular adsorbates, whereas AdsMind preserves the correct sign in all tested cases with closer quantitative agreement. AdsMind thus delivers reliability, self-reflection, and interpretability simultaneously, supporting more DFT-informed autonomous chemistry workflows.

2606.18836 2026-06-18 cs.HC cs.AI 新提交 70%

Improving Human-Robot Teamwork in Urban Search and Rescue Through Episodic Memory of Prior Collaboration

通过先前协作的片段记忆改善城市搜索与救援中的人机团队合作

Taewoon Kim, Emma van Zoelen, Mark Neerincx

发表机构 * HumemAI, The Netherlands(荷兰HumemAI) Vrije Universiteit Amsterdam, The Netherlands(荷兰阿姆斯特丹自由大学) TNO, The Netherlands(荷兰TNO)

专题命中 多智能体 :人机团队,记忆复用。

AI总结 提出利用知识图谱片段记忆存储历史协作模式,通过图表示学习选择代表性记忆初始化机器人,在MATRX USAR环境中将救援成功率从25.7%提升至41.3%,任务时间减少283秒。

详情
AI中文摘要

有效的人机团队合作要求机器人从交互开始就适应伙伴、情境和任务动态。在MATRX城市搜索与救援(USAR)环境中,人们可以通过聊天和反思界面将他们在团队合作中发现的协作模式(CPs)外部化。我们研究机器人是否可以利用这种先前的团队经验,在未来的交互中成为更好的队友。为此,我们将历史CPs表示为知识图谱片段记忆,并使用具有节点分类目标的图表示学习来识别一个代表性且有效的记忆以供重用。然后,在新的协作片段开始之前,我们用该记忆初始化机器人。在20名参与者和160轮次观察中,用单个自动选择的先前CP初始化机器人将救援成功率从25.7%提高到41.3%,并将平均任务时间减少283秒。最强的提升出现在交互开始时,表明可重用的片段记忆可以帮助机器人以更有效的任务知识进入协作,并支持更顺畅的早期团队合作。

英文摘要

Effective human-robot teamwork requires robots to adapt to partners, situations, and task dynamics from the start of an interaction. In the MATRX Urban Search and Rescue (USAR) environment, people can externalize collaboration patterns (CPs) they discover during teamwork through a chat and reflection interface. We study whether a robot can use such prior team experience to become a better teammate in future interactions. To this end, we represent historical CPs as knowledge-graph episodic memories and use graph representation learning with a node-classification objective to identify a representative and effective memory for reuse. We then initialize the robot with this memory before a new collaboration episode begins. Across 20 participants and 160 round-level observations, initializing the robot with a single automatically selected prior CP increases rescue success from 25.7% to 41.3% and reduces average task time by 283 seconds. The strongest gains appear at the beginning of interaction, suggesting that reusable episodic memory can help robots enter collaboration with more effective task knowledge and support smoother early teamwork.

2. 其他Agent 7 篇

2606.19116 2026-06-18 cs.AI cs.CY 新提交 80%

Towards an Agent-First Web: Redesigning the Web for AI Agents

迈向智能体优先的Web:为AI智能体重新设计Web

Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

发表机构 * Old Dominion University(老 Dominion 大学) AI Motion Labs(AI Motion 实验室) Florida International University(佛罗里达国际大学) Accenture Technology Labs(Accenture 技术实验室) Nanyang Technological University(南洋理工大学) University of Colombo(科伦坡大学) Center for Wireless Communications, University of Oulu(无线通信中心,奥卢大学) McDonald Army Health Center(麦克唐纳陆军健康中心)

专题命中 其他Agent :为AI智能体重新设计Web,核心是Agent访问

AI总结 本文提出三层重新设计原则,包括访问层(代理继承人类权限)、经济层(基于意图的代币订阅模型)和内容层(ATML标记语言与加密溯源链),以解决AI智能体作为中间人时Web的访问、经济与内容问题。

详情
AI中文摘要

万维网建立在持续三十年的假设之上:Web内容的主要消费者是人类。这一假设渗透到每一层;其访问模型假定人类访客,其经济依赖于人类注意力,其内容针对人类感知。AI智能体作为人类与Web内容之间中介的迅速出现使这一假设失效。然而,Web通过全面封锁、基于CAPTCHA的排除以及将智能体访问视为提取而非合法交互的经济模型来抵制智能体。本文提出跨三层的原则性重新设计。在访问层,为人类行动的智能体应继承等效访问权限,通过HTTP请求中的速率限制和智能体识别元数据(类似于浏览器头部)以及从同一域提供人类可读和智能体优化内容的双层架构来管理。在经济层,我们提出基于意图的层级框架,以智能体作为人类代理原则为基础:智能体的经济义务反映其所代表的人类。基于代币的订阅模型以代币而非页面浏览量计量内容,同时引入委托内容经济,将AI内容生产锚定于人类意图。在内容层,我们识别出认知递归——AI生成内容被智能体消费以产生更多内容的自我指涉循环,逐步使Web知识与人类真实情况脱钩。我们提出智能体文本标记语言(ATML),一个四级人类监督层级模型,以及加密溯源链来应对这一威胁。这些共同构成了智能体优先互联网的十项设计原则,其中智能体是一等公民,其整合需要重新协商Web在访问、经济和内容方面的基本社会契约。

英文摘要

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

2606.19063 2026-06-18 cs.CR 新提交 80%

PYPILINE: Malicious PyPI Package Detection via Suspicious API Knowledge and Agent Workflow

PYPILINE:通过可疑API知识和Agent工作流检测恶意PyPI包

Siyuan Pang, Zhengwei Jiang, Yepeng Yao, Zijing Fan, Haozhe Li, Baoxu Liu

专题命中 其他Agent :Agent工作流检测恶意PyPI包。

AI总结 提出PYPILINE方法,结合可疑API知识库与Agent工作流,通过静态分析构建知识库并自动检测恶意PyPI包,在精度、召回率和F1分数上显著优于现有工具。

详情
AI中文摘要

恶意PyPI包的检测对于维护开源软件供应链的安全至关重要。现有方法主要依赖规则或传统机器学习,存在可解释性差且难以适应新型攻击的问题。为此,我们提出PYPILINE,一种结合可疑API知识库与Agent工作流的新型检测方法。PYPILINE首先对已知恶意包进行静态分析,提取抽象语法树并生成API调用图,从中自动提取并构建结构化的可疑API知识库。在检测阶段,利用该知识库增强推理能力。通过Agent工作流,PYPILINE对未知包进行深度语义分析,并输出结构化的、可解释的恶意性评估报告。实验结果表明,PYPILINE在精度96.7%、召回率99.6%和F1分数98.1%上显著优于现有最先进工具,其精度比基线工具高出5.7至24.2个百分点。此外,我们对恶意包进行了实证研究,系统揭示了常见的攻击策略以及最常被滥用的API。通过配备工具调用的AI Agent工作流,实现可疑API知识的自动向量数据库检索和通过邮件服务器发送分析报告,PYPILINE提供了一种实用、高效且便捷的恶意包检测解决方案,以增强开源生态系统安全。

英文摘要

The detection of malicious PyPI packages is crucial for maintaining the security of the open source software supply chain. Existing methods, which primarily rely on rules or traditional machine learning, suffer from poor interpretability and difficulty in adapting to novel attacks. To address this, we propose PYPILINE, a novel detection method that combines a suspicious API knowledge base with an Agent workflow. PYPILINE first conducts static analysis on known malicious packages, extracting abstract syntax trees and generating API call graphs, from which it automatically extracts and constructs a structured suspicious API knowledge base. During the detection phase, this knowledge base is used to enhance reasoning capabilities. Through an Agent workflow, PYPILINE performs in depth semantic analysis of unknown packages and outputs a structured, interpretable maliciousness assessment report. The experimental results show that PYPILINE significantly outperforms existing state-of-the-art tools in precision of 96.7\%, recall of 99.6\%, and F1-score of 98.1\%, with its precision surpassing baseline tools by 5.7 to 24.2 percentage points. Additionally, we conducted an empirical study on malicious packages, systematically revealing prevalent attack strategies, as well as the most commonly abused APIs. Equipped with tool-calling AI agent workflows for automated vector database retrieval of suspicious API knowledge and mail server delivery of analysis reports, PYPILINE delivers a practical, efficient, and convenient malicious package detection solution to strengthen open-source ecosystem security.

2606.17454 2026-06-18 cs.AI cs.LG 新提交 80%

Dissecting model behavior through agent trajectories

通过智能体轨迹剖析模型行为

Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

发表机构 * AWS AI Labs(AWS人工智能实验室)

专题命中 其他Agent :分析AI代理轨迹以改进模型行为

AI总结 本文提出“意图-执行差距”概念,并设计Simple Strands Agent(SSA)框架,通过分析138k条轨迹揭示模型在自主问题解决中的行为差异。

Comments 106 pages, 50 Figures, 16 Tables

详情
AI中文摘要

AI智能体性能不仅仅是一个建模问题,它本质上是一个系统问题。模型的高级能力通过智能体框架(harness)实现。因此,模型假设与框架行为之间的差距很容易阻止模型的全部能力转化为智能体性能。我们将此形式化为“意图-执行差距”:模型意图与框架执行之间的不匹配,反之亦然。我们认为,最小化这种意图-执行差距与框架设计的其他方面(如工具和执行循环)同样重要。为了说明这种框架-模型对齐的影响,我们开发了一个简单且可定制的框架,称为“Simple Strands Agent”(SSA)。SSA旨在找到跨不同模型家族(如Claude、Gemini、GPT、Grok、Qwen)通用的常见模式,以及少量模型特定的偏好。我们做出两个贡献:(i)我们在流行的智能体基准测试(SWE-Pro、SWE-Verified和Terminal-Bench-2)上**复现或改进了**不同模型提供商家族报告的pass@1性能;(ii)基于对**SSA生成的138k条轨迹的分析**,我们超越了前沿模型之间通常相对均匀的pass@1数字。通过在代码状态空间中表示智能体轨迹,我们观察到问题解决行为中的模型级差异。更细粒度的指标,如编辑频率、测试活动和阶段转换,揭示了单个模型如何在自主问题解决的不同阶段分配努力。

英文摘要

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

2606.15345 2026-06-18 cs.CL cs.IR 新提交 80%

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

超越单语言深度研究:用跨语言 BrowseComp-Plus 评估智能体和检索器

Yuheng Lu, Qingcheng Zeng, Heli Qi, Puxuan Yu, Fuheng Zhao, Rui Yang, Hitomi Yanaka, Naoto Yokoya, Weihao Xuan

发表机构 * Waseda University(早稻田大学) Northwestern University(西北大学) RIKEN AIP(理化学研究所革新智能研究中心) Snowflake Inc.(Snowflake公司) University of Utah(犹他大学) Duke-NUS Medical School(杜克-新加坡国立大学医学院) The University of Tokyo(东京大学)

专题命中 其他Agent :评估深度研究智能体的跨语言能力

AI总结 提出跨语言基准 XBCP,评估深度研究智能体在证据语言与查询不同时的表现,发现检索和智能体端均存在显著性能下降。

Comments Preprint

详情
AI中文摘要

深度研究智能体越来越被评估其搜索证据、推理检索来源和生成有依据答案的能力。然而,现有的浏览基准大多假设用户查询和支持证据使用同一种语言,因此当相关证据出现在另一种语言时,智能体搜索系统能否运行尚不清楚。我们引入了 XBCP(跨语言 BrowseComp-Plus),这是一个受控基准,它保留了 BrowseComp-Plus 的英文问答空间,但改变了支持文档的语言。XBCP 实例化了两个互补的设置:在跨语言设置中,每个查询与单一指定语言的证据配对。在多语言设置中,完整的证据语料库在 12 种语言(涵盖高资源和低资源语言)中均匀随机分布。我们使用稀疏和密集的多语言检索器评估了四个深度研究智能体,测量了答案准确性、证据召回率、搜索行为、校准度、引用忠实度和 oracle 检索。结果显示,当证据被翻译时,性能显著下降。即使是强大的密集检索器也会丢失证据召回率,智能体变得不那么校准,且引用证据的可靠性降低。值得注意的是,即使直接提供所有黄金证据,准确性仍然较低。这些发现表明,跨语言深度研究暴露了检索失败和智能体端在整合语言不匹配证据方面的独立困难。

英文摘要

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

2606.19079 2026-06-18 cs.AI 新提交 75%

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE: 推理时适配器动态选择的不可知路由

Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

发表机构 * University of Turin(都灵大学) Samsung AI Center(三星人工智能中心)

专题命中 其他Agent :推理时适配器动态选择,路由框架。

AI总结 提出无训练、与适配器无关的路由框架ARIADNE,通过训练集嵌入质心表示适配器,在推理时基于潜在空间距离选择适配器,无需适配器内部信息或额外训练,在44个任务上达到89.7%的选择准确率。

详情
AI中文摘要

参数高效微调(PEFT)的日益部署导致了模型生态系统,其中单个骨干网络与许多任务专用适配器配对。在这种设置下,推理时的查询通常没有任务标签,要求系统从不断增长且异构的适配器池中自动选择最合适的适配器。现有的路由方法要么依赖于对适配器内部(如权重分解或基于梯度的统计信息)的访问,要么需要额外的路由器训练,这限制了随着新适配器添加的可扩展性和可移植性。我们提出了ARIADNE,一个无训练、与适配器无关的路由框架,用于推理时的动态适配器选择。ARIADNE通过从其训练集的嵌入计算的一组质心来表示每个适配器,捕获与该适配器相关的数据分布。给定一个无标签输入,它通过测量在潜在空间中与这些质心的接近度来选择适配器。由于路由完全在输入嵌入空间中进行,ARIADNE与任意PEFT方法兼容,并且不需要对适配器或训练过程进行修改。主要使用Llama 3.2 1B Instruct在23个不同的NLP任务上进行评估,ARIADNE恢复了97.44%的上限性能。扩展到44个任务,它实现了89.7%的平均选择准确率,无需额外训练或访问适配器内部信息。

英文摘要

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

2606.18259 2026-06-18 cs.HC cs.AI 新提交 75%

Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration

无感关怀:情感动态作为人-AI智能体协作的控制层

Junjie Xu, Xingjiao Wu, Zihao Zhang, Yujia Xu, Yuzhe Yang, Jin Zhu, Luwei Xiao, Wen Wu, Liang He

发表机构 * East China Normal University(华东师范大学) National University of Singapore(新加坡国立大学)

专题命中 其他Agent :综述情感动态在人-AI智能体协作中的控制作用。

AI总结 本文综述情感动态在人-AI智能体协作中的作用,提出将情感视为协调层而非AI内部属性,用于校准信任、委托和治理。

详情
AI中文摘要

能够规划、跨会话保留记忆、调用外部工具并部分自主行动的AI智能体正在改变人-AI协作。情感计算、大语言模型中的模拟共情、自动化信任和AI安全的研究揭示了重要的设计原则,但这些文献仍然分散。没有统一的解释说明情感线索如何在智能体协作中运作——在这种协作中,人类委托、监控和纠正重要任务。本综述综合了情感动态的计算和交互机制:情感线索、类似情绪的行为和感知到的智能体情感如何影响信任校准、委托决策、错误纠正、依赖和治理的过程。我们追溯模型生成的情感信号如何进入控制依赖、修复和监督的交互循环,并提出了一个框架,该框架将情感视为不是AI的内部属性,而是作为人类和智能体协商能力、不确定性和责任的协调层。该框架为校准测量、有目的的设计和知情治理提供了基础。

英文摘要

AI agents that plan, retain memory across sessions, invoke external tools and act with partial autonomy are transforming human--AI collaboration. Research on affective computing, simulated empathy in large language models, trust in automation and AI safety has illuminated important design principles, yet these literatures remain fragmented. No integrated account explains how affective cues operate within agentic collaboration -- settings in which humans delegate, monitor and correct consequential tasks. This Review synthesises computational and interactional mechanisms of affective dynamics: the processes through which affective cues, emotion-like behaviour and perceived agent affect shape trust calibration, delegation decisions, error correction, dependence and governance. We trace how model-generated affective signals enter interaction loops that govern reliance, repair and oversight, and propose a framework that treats affect not as an internal property of AI but as a coordination layer through which humans and agents negotiate capability, uncertainty and responsibility. The framework provides a foundation for calibrated measurement, purposeful design and informed governance.

2606.18406 2026-06-18 cs.CL 新提交 70%

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Peng Cheng Laboratory(鹏城实验室) Shandong Analysis and Test Center, Qilu University of Technology(齐鲁工业大学山东省分析测试中心) State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs(道地药材品质保障与可持续利用国家重点实验室)

专题命中 其他Agent :对话代理长期记忆架构

AI总结 提出CoreMem架构,用黎曼检索替代余弦相似度解决高维检索枢纽问题,通过Fisher引导离散令牌蒸馏实现原则性压缩,在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情
AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而,在消费级硬件(例如8 GB VRAM边缘设备)上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索,以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础,经常在高维检索中遭受枢纽问题,并在压缩过程中出现句法碎片化。为克服这些限制,我们提出CoreMem,一种资源高效的边缘-云记忆架构,从根本上由信息几何统一。首先,黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配,通过马氏距离有效惩罚枢纽记忆,并采用O(Ndr) Woodbury加速实现实时搜索。其次,Fisher引导离散令牌蒸馏(FDTD)引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数,提供原则性的压缩-KL权衡,并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估,CoreMem实现了显著的准确率提升,在开放域(+4.51个百分点)和时间(+4.17个百分点)推理上取得实质性增益。广泛性能分析证实,CoreMem在严格的8 GB VRAM预算内无缝运行,成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

3. 工作流自动化 2 篇

2606.18874 2026-06-18 cs.AI 新提交 80%

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

通过研究框架将AI科学家的研究综合与验证外部化

Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机学院X-LANCE实验室) Jiangsu Key Lab of Language Computing, Suzhou, China(江苏省语言计算重点实验室) Suzhou Laboratory, Suzhou, China(苏州实验室)

专题命中 工作流自动化 :自动化科学研究工作流,外部化综合与验证。

AI总结 提出Xcientist框架,将研究综合与实验验证外部化为可检查的合同驱动过程,解决自动研究中的声明漂移问题,并在多个领域验证其有效性。

Comments 65 pages, 14 figures, 19 tables

详情
AI中文摘要

AI系统日益能够自动化科学工作流程,但连接先前证据、生成的想法、实验和最终声明的推理通常仍然隐含在模型推理中。这里我们介绍Xcientist,一个研究框架,将研究综合和实验验证外部化为可检查的、合同驱动的过程。Xcientist将文献证据、想法状态、实施计划、消融记录和修复痕迹组织为持久的研究工件,使得生成的机制可以在不丢失其证据基础的情况下被基础化、执行、测试和修订。我们将声明漂移识别为自动化研究的一种失败模式,其中可运行的工件不再支持最初声称的机制。在无训练记忆系统、图结构交通预测和多尺度物理信息神经网络中,Xcientist保留了从问题公式化到机制设计、验证和有限修订的可追踪轨迹。这些结果表明,AI科学家不仅应根据其最终工件进行评估,还应看其综合和验证过程是否可归因、可检查且在科学上可问责。

英文摘要

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

2606.17510 2026-06-18 cs.SE cs.SY eess.SY 新提交 75%

OmniDroneX: An LLM-Assisted Holistic Drone-as-a-Service Ecosystem

OmniDroneX: 一种LLM辅助的全方位无人机即服务生态系统

I-Ling Yen, Akeem Mohammed, Farokh Bastani, San-Yih Hwang

专题命中 工作流自动化 :LLM辅助无人机服务组合与任务定义

AI总结 提出OmniDroneX统一无人机即服务生态系统,通过libUAV接口和PT-SOA抽象模型连接底层物理与高层任务,利用大语言模型辅助功能识别、服务组合和自然语言任务定义,支持多种组合技术以实现可扩展、自演进的无人机系统。

Comments This manuscript is a full version of a paper accepted in shortened form by IEEE International Conference on Joint Cloud Computing

详情
AI中文摘要

尽管无人机技术取得了快速进步,但由于无人机系统研究中的若干空白,当前部署仍然有限。为应对这些挑战,我们提出OmniDroneX,一个统一的无人机即服务生态系统,其中无人机从固定功能平台转变为动态可组合实体,可与外部基础设施集成以提供全方位能力。OmniDroneX通过统一的供应商无关接口(libUAV)和形式化的物理服务抽象模型(PT-SOA)连接底层物理原语与高层任务意图。一个核心创新是大语言模型(LLM)在OmniDroneX架构多层中的多样化应用。LLM用于辅助识别和形式化原始设备功能及抽象服务定义,支持自动化服务组合和工作流生成,并实现交互式自然语言任务规范与细化。OmniDroneX还包含了动态无人机系统中至关重要的多种组合技术类别,包括用于无人机能力增强的物理层组合,以及时空、功能、协作、异常感知和基于QoS的服务组合。总体而言,这些特性使OmniDroneX能够作为在复杂动态环境中运行的可扩展、有弹性和自演进的无人机生态系统的基础。

英文摘要

Despite rapid advances in UAV technologies, current deployments remain limited due to several gaps in UAV systems research. To address these challenges, we propose OmniDroneX, a unified Drone-as-a-Service ecosystem, in which drones are transitioned from fixed function platforms into dynamically composable entities that can be integrated with external infrastructures to offer omni-capabilities. OmniDroneX bridges low-level physical primitives with high-level mission intent through a unified vendor-agnostic interface (libUAV) and a formal physical-service abstraction model (PT-SOA). A core innovation is the diverse application of large language models (LLMs) across multiple layers of the OmniDroneX architecture. LLMs are used to assist in identifying and formalizing primitive device functions and abstract service definitions, supporting automated service composition and workflow generation, and enabling interactive, natural-language mission specification and refinement. OmniDroneX also incorporates important categories of composition techniques that are essential in dynamic UAV systems, including physical layer composition for drone capability augmentation, as well as spatiotemporal, functional, collaborative, exception-aware, and QoS-based service compositions. Collectively, these features allow OmniDroneX to serve as a foundation for scalable, resilient, and self-evolving UAV ecosystems operating in complex and dynamic environments.

4. 规划决策 10 篇

2606.18847 2026-06-18 cs.AI 新提交 80%

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines: 对长时域有状态具身智能体进行基准测试与建模

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

发表机构 * HKUST(GZ)(香港科技大学(广州)) HKUST(香港科技大学) Knowin

专题命中 规划决策 :具身智能体长时记忆与任务规划。

AI总结 提出WorldLines基准,通过构建带时间跨度的家庭轨迹(含对话、动作、状态变化等)评估具身智能体的长时记忆与任务规划能力,并设计ObsMem记忆框架提升状态感知决策。

Comments 27 pages, 18 figures

详情
AI中文摘要

为了在真实家庭环境中长时间协助人类,具身智能体必须记住用户习惯、世界状态和过去的交互。现有的长期记忆基准主要评估以语言为中心的检索和问答,而具身基准通常关注短时域任务执行,未测试在动态环境中长期记忆的使用。我们引入WorldLines,一个项目驱动的长时域具身家庭辅助基准。它构建了带时间跨度的家庭轨迹,包含对话、动作、执行反馈、物体和设备状态变化,并将其转换为带有证据链接的样本,用于记忆问答和具身任务规划。我们进一步提出ObsMem,一个观察者锚定的记忆框架,维护可见性感知的记忆和动作原生状态轨迹,以实现状态感知的决策。实验揭示了在部分可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划方面的持续挑战,而ObsMem为此场景提供了更强的参考架构。

英文摘要

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

2606.18746 2026-06-18 cs.AI 新提交 80%

What Must Generalist Agents Remember?

通用型智能体必须记住什么?

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

专题命中 规划决策 :通用智能体记忆需求的形式化分析。

AI总结 本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动,必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作,并证明记忆可用于重构局部转移动态。

详情
AI中文摘要

本文形式化地阐述了通用型智能体为了在多个环境和目标下近似最优地行动,必须在记忆中存储什么。它表明,当两个领域共享一个观察瓶颈但需要不兼容的最优动作时,任何一致近似最优的策略必须在该瓶颈处诱导出不同的记忆分布。这一结果产生了一个分离定理:足够成功的智能体不能仅依赖当前状态观察,而必须在记忆中保留领域相关信息。本文进一步证明,如果智能体的记忆包含足够的信息来估计相关目标的值,那么该记忆可用于近似重构智能体的局部转移动态。综合这些结果,将记忆刻画为支持领域区分、转移模型重构和通用型智能体规划的基板。

英文摘要

This paper develops a formal account of what generalist agents must store in memory in order to act near-optimally across multiple environments and goals. It shows that when two domains share an observational bottleneck but require incompatible optimal actions, any uniformly near-optimal policy must induce distinct memory distributions at that bottleneck. The result yields a separation theorem: sufficiently successful agents cannot rely only on current state observations, but must preserve domain-relevant information in memory. The paper further shows that if an agent's memory contains enough information to estimate values for related goals, then that memory can be used to approximately reconstruct the agent's local transition dynamics. Together, these results characterize memory as the substrate that supports domain disambiguation, transition-model reconstruction, and planning for generalist agents.

2606.18105 2026-06-18 cs.NI cs.LG 新提交 80%

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan:一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University(浙江大学) Fuzhou University(福州市大学) Yangzhou University(扬州大学) The State Key Laboratory of Blockchain and Data Security(区块链与数据安全国家重点实验室) College of Computer Science and Technology(计算机科学与技术学院)

专题命中 规划决策 :自适应框架动态选择求解器进行规划

AI总结 提出OmniPlan自适应框架,利用大语言模型解析用户意图,通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型,实现网络规划优化的及时性与近乎最优性,在分布式机器学习推理卸载任务中延迟降低97.8%,资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情
AI中文摘要

网络规划优化是跨多个领域(包括交通系统、通信网络和电网)的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划(MIP)求解器、启发式算法和深度强化学习(DRL)模型来计算规划决策。然而,它们缺乏对多样化和动态用户意图的有效适应性,从而导致执行时间与最优性之间的权衡。在本文中,我们提出OmniPlan,一种自适应框架,在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性,OmniPlan采用基于大语言模型(LLM)的解释器,将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后,它采用混合专家架构,集成MIP求解器、启发式算法和DRL模型作为专门专家,OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后,它包含一个基于DRL的专家配置模块,该模块微调优化目标权重,使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载(即分布式机器学习(ML))评估OmniPlan,其中我们利用OmniPlan将广泛的ML推理任务(例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林)卸载到硬件设备网络。我们在真实测试平台上的实验表明,OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载,延迟降低高达97.8%,网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

2606.17453 2026-06-18 cs.AI 新提交 80%

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

专题命中 规划决策 :评估地图智能体的隐含需求满足能力

AI总结 提出MapSatisfyBench基准,通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力,实验表明现有智能体在显式任务完成上表现良好,但在满足隐含需求方面仍有局限。

详情
AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中,用户通常非正式地表达需求,导致查询不明确,包含许多未言明的需求,即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法,但它增加了日常交互中的用户负担,而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而,评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次,用户满意度不能可靠地由单个参考答案表示,需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战,我们提出一个恢复-识别-过滤框架,从行为链证据中重建完整的用户需求,识别隐含决策因素,并仅保留那些有查询前证据支持的因素。基于此方法,我们从大规模真实世界匿名用户数据构建MapSatisfyBench,并从五个维度标注真实值,实现对满意度感知地图智能体的全链条评估。实验表明,当前智能体在显式任务完成上普遍表现良好,但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

2606.14202 2026-06-18 cs.NE cs.AI 新提交 80%

MeEvo: Metacognitive Evolution Combined with Natural Evolution for Automatic Heuristic Design

MeEvo: 元认知进化与自然进化相结合用于自动启发式设计

Zishang Qiu, Xinan Chen, Rong Qu, Ruibin Bai

发表机构 * School of Computer Science, University of Nottingham Ningbo China(诺丁汉大学宁波分校计算机科学学院) School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院)

专题命中 规划决策 :自动启发式设计框架,结合进化与元认知

AI总结 提出MeEvo框架,通过循环耦合自然进化(探索启发式代码)和元认知进化(反思历史生成改进启发式),解决现有方法知识继承弱、探索不足的问题,在五个优化问题上表现更优。

详情
AI中文摘要

大型语言模型(LLMs)通过推理和代码合成实现启发式生成,推动了自动启发式设计(AHD)的发展。现有的基于LLM的AHD架构主要遵循两种范式:自然进化,它使用交叉和变异来探索启发式程序;以及元认知进化,它通过反思来改进推理。然而,自然进化丢弃了推理轨迹,削弱了知识继承和利用,而元认知进化缺乏种群级别的重组,限制了探索并增加了过早收敛的风险。这些局限性降低了复杂问题的搜索效率、稳定性和解的质量。为了解决这一差距,我们提出了MeEvo,一种双层AHD框架,它循环耦合自然进化和元认知进化。自然进化探索启发式代码,同时将推理轨迹、适应度值和错误记录到共享历史中;然后元认知进化反思该历史以生成改进的启发式,这些启发式重新进入父代池以进行下一轮循环。这种设计使得种群驱动的探索和反思驱动的改进相互加强。在五个优化问题上的实验(使用两个LLM骨干)表明,MeEvo比现有的基于LLM的AHD架构实现了更强且更稳定的性能,尤其是在复杂约束任务上。

英文摘要

Large Language Models (LLMs) have advanced Automatic Heuristic Design (AHD) by enabling heuristic generation through reasoning and code synthesis. Existing LLM-based AHD architectures mainly follow two paradigms: Natural Evolution, which uses crossover and mutation to explore heuristic programs, and Metacognitive Evolution, which refines reasoning through reflection. However, Natural Evolution discards reasoning traces, weakening knowledge inheritance and exploitation, while Metacognitive Evolution lacks population-level recombination, limiting exploration and increasing the risk of premature convergence. These limitations reduce search efficiency, stability, and solution quality on complex problems. To address this gap, we propose MeEvo, a dual-layer AHD framework that cyclically couples Natural Evolution and Metacognitive Evolution. Natural Evolution explores heuristic code while recording reasoning traces, fitness values, and errors into a shared history; Metacognitive Evolution then reflects on this history to generate improved heuristics that re-enter the parent pool for the next cycle. This design enables population-driven exploration and reflection-driven refinement to reinforce each other. Experiments on five optimization problems with two LLM backbones show that MeEvo achieves stronger and more stable performance than existing LLM-based AHD architectures, especially on complex constrained tasks.

2606.18888 2026-06-18 cs.AI 新提交 75%

Generative-Model Predictive Planning for Navigation in Partially Observable Environments

部分可观测环境下导航的生成模型预测规划

Thomas Quilter, Yifan Zhu, Guorui Quan, Mingfei Sun, Samuel Kaski

发表机构 * University of Manchester(曼彻斯特大学) Aalto University(阿尔托大学)

专题命中 规划决策 :生成模型预测规划用于导航

AI总结 提出BeliefDiffusion框架,结合扩散模型和模型预测控制,显式建模多模态信念分布并进行前瞻规划,在合成地图环境中显著优于无模型强化学习和生成方法。

详情
AI中文摘要

部分可观测环境中的导航对自主智能体构成重大挑战,需要在未知环境中利用有限的感知信息做出有效决策。基于信念的方法,特别是那些使用神经网络近似信念空间的方法,往往无法捕捉信念空间固有的多模态性,尤其是在具有感知混淆的高维情况下。虽然生成模型提供了一种有吸引力的替代方案,但它们通常需要大量数据或专家演示,并且缺乏长期规划的显式机制。在本文中,我们介绍了BeliefDiffusion,一种结合了生成和规划优势的新框架。BeliefDiffusion利用扩散模型显式表征多模态信念分布,并利用模型预测控制(MPC)同时进行前瞻规划。它包含两个步骤:(1)基于观测历史想象合理的环境配置;(2)在聚合的配置上规划高效的导航策略。通过在合成地图环境中的大量实验,我们证明BeliefDiffusion在导航成功率和路径效率上显著优于无模型强化学习基线和其它生成方法。我们的结果验证了将多模态信念表示显式纳入规划能够在部分可观测设置中实现更鲁棒的导航。

英文摘要

Navigation in partially observable environments presents a significant challenge for autonomous agents, requiring effective decision-making with limited sensory information in unknown environments. Belief-based methods, particularly those using neural networks to approximate the belief space, often fail to capture the inherent multimodality of belief spaces, especially in high-dimensional cases with perceptual aliasing. While generative models present a compelling alternative, they typically require substantial data or expert demonstrations and lack explicit mechanisms for long-term planning. In this paper, we introduce BeliefDiffusion, a novel framework that combines the benefits of both generation and planning. BeliefDiffusion leverages diffusion models to explicitly characterize multimodal belief distributions and utilizes Model Predictive Control (MPC) to simultaneously plan ahead. It consists of two steps: (1) Imagining plausible environment configurations based on observation history and (2) Planning efficient navigation strategies across an aggregated configurations. Through extensive experiments in synthetic map environments, we demonstrate that BeliefDiffusion significantly outperforms both model-free reinforcement learning baselines and other generative approaches in navigation success rate and path efficiency. Our results validate that explicitly incorporating multimodal belief representations into planning enables more robust navigation in partially observable settings.

2606.19214 2026-06-18 econ.GN q-fin.EC 新提交 70%

Testing Centralized and Polycentric Computational Planning

测试集中式和多中心计算规划

Ricardo Alonzo Fernández Salguero

专题命中 规划决策 :比较计算规划者与基于代理的市场,涉及规划决策

AI总结 本文提出一个可复现的合成基准,在模拟经济中比较计算规划者、基于代理的市场和混合元市场,发现规划者福利损失更低,但结果受设计选择影响,主要贡献是方法论而非意识形态。

详情
AI中文摘要

本文提出了一个可复现的合成基准,在共同的模拟经济中比较计算规划者、基于代理的市场和混合元市场。该基准包含投入产出生产网络、异质企业、产能约束、内生价格、福利指标、结构性冲击、对抗性压力测试和信息报告实验。在训练、保留和对抗性场景中,规划者始终比分散化替代方案实现更低的福利损失。主要贡献是方法论而非意识形态的。虽然该基准展示了一个可证伪的框架用于比较经济协调机制,但它并未确立规划的实证优越性。若干设计选择机械地偏向规划者,包括信息不对称、不完整的市场表示和简化的制度假设。因此,结果应被解释为对合成实验架构的验证,以及作为未来研究的原型。本文最后概述了一个基于实证校准、结构性保留、敏感性分析、不确定性量化、机制设计测试和独立复制的验证议程。

英文摘要

This paper presents a reproducible synthetic benchmark comparing a computational planner, an agent-based market, and a hybrid meta-market within a common simulated economy. The benchmark incorporates input-output production networks, heterogeneous firms, capacity constraints, endogenous prices, welfare metrics, structural shocks, adversarial stress testing, and information-reporting experiments. Across training, holdout, and adversarial scenarios, the planner consistently achieves lower welfare losses than the decentralized alternatives. The main contribution is methodological rather than ideological. While the benchmark demonstrates a falsifiable framework for comparing economic coordination mechanisms, it does not establish the empirical superiority of planning. Several design choices mechanically favor the planner, including informational asymmetries, incomplete market representation, and simplified institutional assumptions. The results should therefore be interpreted as validation of a synthetic experimental architecture and as a prototype for future research. The paper concludes by outlining a validation agenda based on empirical calibration, structural holdouts, sensitivity analysis, uncertainty quantification, mechanism-design tests, and independent replication.

2606.18963 2026-06-18 cs.LG 新提交 70%

Online Reward-Punishment Learning from Fixed-Channel Perceptual Event Streams without Environment Rewards

无环境奖励的固定通道感知事件流在线奖惩学习

Zirong Li

发表机构 * Zirong Li(李 Cirong)

专题命中 规划决策 :提出无环境奖励的在线奖惩学习框架。

AI总结 提出OHIRL框架,在无标量奖励下通过固定通道感知流进行在线奖惩学习,利用内部轨迹评估器推断感知维度的效价,在XOR任务和CartPole等控制任务中达到高准确率。

Comments 9 pages, 5 figures, 6 tables; 13-page technical supplement

详情
AI中文摘要

我们研究当环境不提供标量奖励或评估标签时的在线奖惩学习。在每一步,智能体仅接收一个固定通道的感知数据包,诸如疼痛、能量、接触、损伤或认知错误等量被视为感知维度,其效价必须从转移后果中推断。OHIRL分离了四个角色:M_psi学习下一数据包预测,D_omega建模残差动力学,C_eta是一个固定的内部转移后轨迹评估器,B_xi学习使用由此产生的价值证据进行后续策略更新和动作评分。C_eta采用恢复正性、持久/增长负性的残差调节取向;系数来源审计显示,等单元、原始等值和随机单调变体保留了超过92%的已发布顶级动作排名,而符号反转保留了0%。无奖励协议暴露观察转移,同时隐藏环境奖励、延迟外部评估器、成功标签和动作好坏标签。条件误差分解将B_xi的证据估计误差与残差策略优化误差分离。在2x2-XOR数据包任务中,药物和辣椒在视觉XOR上下文中获得相反的价值,并且相同的疼痛或辣度增加可能根据后果结构为正或负;B_xi达到0.952的平衡奖励符号准确率。在完整的在线交错审计中,M_psi达到留出R2=0.907,B_xi达到0.940的符号准确率,策略达到0.979的最优动作准确率,而即时数据包分数、预测误差奖励、打乱目标、零奖励和误差减少控制均崩溃。隐藏奖励的CartPole和Taxi控制、公共上下文无泄漏审计以及模块角色消融进一步测试了信息边界和组件必要性。

英文摘要

We study online reward-punishment learning when the environment provides no scalar reward or evaluative label. At each step the agent receives only a fixed-channel perceptual packet, and quantities such as pain, energy, contact, damage, or cognitive error are treated as perceptual dimensions whose valence must be inferred from transition consequences. OHIRL separates four roles: M_psi learns next-packet prediction, D_omega models residual dynamics, C_eta is a fixed internal post-transition trajectory evaluator, and B_xi learns to use the resulting value evidence for later policy updates and action scoring. C_eta uses a recovery-positive and persistence/growth-negative residual-regulation orientation; a coefficient-origin audit shows that equal-unit, raw-equal, and random monotone variants preserve more than 92% of the released top-action rankings, while sign inversion preserves 0%. The reward-free protocol exposes observation transitions while withholding environment rewards, delayed external evaluators, success labels, and action-goodness labels. A conditional error decomposition separates B_xi evidence-estimation error from residual policy-optimization error. In a 2x2-XOR packet task, medicine and chili acquire opposite value under visual XOR contexts, and the same pain or spice increase can be positive or negative depending on consequence structure; B_xi reaches 0.952 balanced reward-sign accuracy. In a full online-interleaved audit, M_psi reaches holdout R2=0.907, B_xi reaches 0.940 sign accuracy, and the policy reaches 0.979 optimal-action accuracy, while immediate packet scores, prediction-error rewards, shuffled targets, zero reward, and error-reduction controls collapse. Hidden-reward CartPole and Taxi controls, public-context no-leakage audits, and module-role ablations further test information boundaries and component necessity.

2606.18388 2026-06-18 cs.LG cs.AI cs.CL cs.MA 新提交 70%

LLMZero: Discovering Adaptive Training Strategies for RL Post-Training via LLM Agents

LLMZero: 通过LLM智能体发现RL后训练的自适应训练策略

Haoyang Fang, Wei Zhu, Boran Han, Alex Zhang, Zhenyu Pan, Shuo Yang, Shuai Zhang, Jiading Gai, Peng Tang, Cuixiong Hu, Xuan Zhu, Huzefa Rangwala, George Karypis, Bernie Wang

发表机构 * Amazon(亚马逊)

专题命中 规划决策 :利用LLM智能体进行树搜索发现训练策略

AI总结 提出LLMZero系统,利用LLM智能体通过树搜索发现多阶段RL后训练的自适应策略,揭示容量参数单调累积、正则化参数振荡的规律,在4个GRPO任务上相对基线提升9%-140%。

详情
AI中文摘要

RL后训练策略依赖于数据集,并揭示了一个反复出现的经验模式:容量参数在阶段间单调累积,而正则化参数主要根据训练动态的变化而振荡。这种区别很重要,因为固定调度将所有参数提交到固定轨迹,因此无法表达正则化必须跟踪的非平稳探索-利用权衡;该原则为多阶段训练提供了可操作的设计规则。我们通过LLMZero发现了这一点,该系统通过树搜索让LLM智能体搜索训练轨迹,诊断每个检查点的病理并提出协调的多参数转换。在4个不同的GRPO任务中,LLMZero发现的策略相对基础模型提升9%到140%,相对网格搜索提升6%到15%,始终优于随机搜索和基于技能的智能体。该结构原则跨任务迁移,解释了为什么发现的策略形式不同但参数动态相似。

英文摘要

RL post-training strategies are dataset-dependent and reveal a recurring empirical pattern: capacity parameters accumulate monotonically across stages, while regularization parameters predominantly oscillate in response to shifting training dynamics. This distinction matters because fixed schedules commit all parameters to fixed trajectories and therefore cannot express the non-stationary exploration-exploitation tradeoffs that regularization must track; the principle provides actionable design rules for multi-stage training. We discover this through LLMZero, a system where LLM agents search over training trajectories via tree search, diagnosing pathologies at each checkpoint and proposing coordinated multi-parameter transitions. Across 4 diverse GRPO tasks, LLMZero discovers strategies that improve over the base model by 9% to 140% relative and over grid search by 6% to 15% relative, consistently outperforming random search and the skill-based agent. The structural principle transfers across tasks, providing an explanation for why discovered strategies take qualitatively different forms yet share similar parameter dynamics.

2606.19134 2026-06-18 cs.LG cs.AI 新提交 65%

Pareto Q-Learning with Reward Machines

带奖励机的帕累托Q学习

Arnaud Lequen, Clément Legrand-Lixon, Léo Saulières

发表机构 * Linköping University, Sweden(瑞典_linköping大学) Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France(法国里尔大学、CNRS、中央里尔学院、UMR 9189 CRIStAL、法国里尔) Univ. Toulouse, INRAE-MIAT, Toulouse, France(法国图卢兹大学、INRAE-MIAT、图卢兹)

专题命中 规划决策 :多目标强化学习算法,用于智能体决策

AI总结 提出PQLRM算法,结合帕累托Q学习和奖励机,在多目标强化学习中高效逼近帕累托前沿,并处理非马尔可夫奖励。

Comments Accepted at the ICAPS 2026 Workshop on Bridging the Gap Between AI Planning and (Reinforcement) Learning (PRL)

详情
AI中文摘要

我们提出了带奖励机的帕累托Q学习(PQLRM),这是一种用于任务的多目标强化学习算法,其奖励结构由一组奖励机(RMs)指定。PQLRM结合了帕累托Q学习(PQL)(该方法维护向量值Q估计的集合以逼近帕累托前沿)和带奖励机的Q学习(QRM)的增强(该方法利用奖励信号的因子化自动机结构)。这产生了一种多策略算法,在非马尔可夫、RM编码的奖励下保持样本效率。实验表明,PQLRM比应用于叉积MDP的朴素PQL基线收敛更快,并且可以合成QRM无法获得的帕累托最优策略。

英文摘要

We present Pareto Q-Learning with Reward Machines (PQLRM), a multi-objective reinforcement learning algorithm for tasks whose reward structure is specified by a set of reward machines (RMs). PQLRM combines Pareto Q-Learning (PQL), which maintains sets of vector-valued Q-estimates to approximate the Pareto front, with enhancements from Q-Learning with Reward Machines (QRM), which exploits the factored automaton structure of the reward signal. This yields a multi-policy algorithm that remains sample-efficient under non-Markovian, RM-encoded rewards. Experimental trials show that PQLRM converges faster than a naive PQL baseline applied to the cross-product MDP and can synthesize Pareto-optimal policies that QRM cannot.

5. 工具调用 2 篇

2606.18803 2026-06-18 cs.AI cs.CY 新提交 80%

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM: 面向工业网约车调度的效用对齐智能用户画像

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

发表机构 * Didichuxing Co. Ltd(滴滴出行科技有限公司)

专题命中 工具调用 :LLM智能体用于网约车调度用户画像

AI总结 提出ProfiLLM,一种通过工具增强全局知识挖掘和效用对齐画像探索的智能LLM数据管道,解决工业网约车调度中大规模行为日志的用户画像问题,在滴滴生产系统中实现AUC提升6.14%、GMV提升4.35%。

详情
AI中文摘要

将大型语言模型(LLM)作为语义特征提取器引入工业网约车调度,处理平台规模的行为日志,是一个引人注目但尚未充分探索的数据系统问题。生产匹配管道仍然以结构化数值特征为主,但关键的行为信号(例如,驾驶员对某些区域的习惯性厌恶)本质上是上下文相关的,并且可以自然地表达为LLM生成的用户画像。然而,将这种画像扩展到实时的、毫秒级延迟的调度器面临三个相互交织的约束,这些约束很少被一起解决:在一个拥有数百万日订单量的平台上,日志超出任何LLM的上下文窗口数个数量级;大多数用户是长尾用户,交互太少无法进行单个用户画像;表面流畅的画像不一定能提高下游预测效用。我们提出了ProfiLLM,一个智能LLM数据管道,通过两个模块实现面向生产匹配系统的效用对齐用户画像。(1)工具增强全局知识挖掘:为LLM智能体配备27个分析工具,用于挖掘平台规模的数据,生成可复用的全局知识、自适应用户聚类规则和区域级供需先验。(2)效用对齐画像探索:为每个聚类生成多个候选画像,通过轻量级下游效用代理进行评估,迭代优化最佳候选,并为DPO微调构建偏好对。在滴滴生产调度器上部署后,ProfiLLM在结果预测中实现了高达+6.14%的相对AUC改进,在调度模拟中实现了高达+4.35%的GMV增长,并在14天在线A/B测试中持续改进,包括+0.47% GMV、+0.33%完成率和-0.82%接单前取消率。

英文摘要

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

2606.18550 2026-06-18 cs.CR 新提交 70%

The Gate Is Only as Honest as Its Contracts: ContractGuard for the Contract Layer of Risk-Aware Causal Gating

门仅与其合约一样诚实:面向风险感知因果门控合约层的ContractGuard

Laxmipriya Ganesh Iyer, Rahul Suresh Babu

专题命中 工具调用 :保护工具增强型LLM代理

AI总结 针对工具增强型LLM代理的间接提示注入,提出ContractGuard,通过验证合约完整性(而非风险标签)来防御攻击,在基准测试中实现零注入成功率。

详情
AI中文摘要

风险感知因果门控(RACG)通过从代理的可见动作空间中移除危险工具来防御工具增强型LLM代理免受间接提示注入,使得即使完全符合注入条件的代理也无法调用其不可见的工具。我们提出三点。首先,这种结构性保证并未消除安全工具使用背后的信任假设;它将其转移到门所读取的工具合约——声明的先决条件、效果、风险和授权——的完整性上,因此攻击者若破坏合约,可使门误判而无需说服代理。其次,伪造工具的效果比篡改其风险标签更危险,因为RACG在可准入门之前应用因果门:离路径工具从不暴露,因此仅重新标记风险会失败,而效果伪造则将危险工具路由到因果路径上并成功。效果完整性,而非风险标签,是承载假设。第三,我们引入ContractGuard,一个位于注册表和门之间的验证器,它分层使用签名来源、类型化合约认证和运行时效果验证;在受控基准测试中,它针对所有建模攻击(包括穷举白盒自适应攻击)将注入成功率恢复为零,且不会过度拒绝诚实合约,该结构性预测在六个当前代托管模型(Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B)上得到确认。

英文摘要

Risk-Aware Causal Gating (RACG) defends tool-augmented LLM agents against indirect prompt injection by removing dangerous tools from the agent's visible action space, so that even a fully injection-compliant agent cannot call a tool it cannot see. We make three points. First, this structural guarantee does not eliminate the trust assumption behind safe tool use; it relocates it into the integrity of the tool contracts -- declared preconditions, effects, risk, and authorization -- that the gate reads, so an attacker who corrupts a contract can make the gate mis-decide without ever persuading the agent. Second, forging a tool's effects is strictly more dangerous than tampering with its risk label, because RACG applies a causal gate before its admissibility gate: an off-path tool is never exposed, so risk-relabeling alone fails, whereas effect forgery routes the dangerous tool onto the causal path and succeeds. Effect integrity, not the risk label, is the load-bearing assumption. Third, we introduce ContractGuard, a verifier between the registry and the gate that layers signed provenance, typed contract attestation, and runtime effect verification; on a controlled benchmark it restores injection success to zero against every modeled attack -- including an exhaustive white-box adaptive attacker -- without over-rejecting honest contracts, and the structural prediction is confirmed on six current-generation hosted models (Claude Opus 4.8, Sonnet 4.6, Haiku 4.5; Amazon Nova Premier and Nova 2 Lite; GPT-OSS-120B).

6. 软件智能体 3 篇

2606.18294 2026-06-18 physics.ins-det nucl-ex physics.app-ph 新提交 80%

Vision AI Agent for Continuous Material Monitoring of LEGEND-1000 LoFi Reentrant Tube

用于LEGEND-1000 LoFi回旋管连续材料监测的视觉AI智能体

Sonata Simonaitis-Boyd, Soonhong Lee, Lauren N. O'Brien, Brandon T. Turner, Ralph Massarczyk, Steven R. Elliott, Aobo Li, Alexander F. Leder

专题命中 软件智能体 :LangChain智能体流水线,自动材料监测

AI总结 提出基于LangChain和Claude Haiku 4.5的视觉AI智能体流水线,通过SAM2分割和混合OCR验证从静水压测试视频中自动提取OFHC铜圆柱的直径和应变,计算屈服强度并与模拟对比。

Comments 27 pages, 8 figures, 5 tables, submitted to PRX Intelligence

详情
AI中文摘要

我们报告了一种用于从视频数据中非接触式提取材料应变和属性的视觉AI智能体流水线,并在LEGEND-1000硬件验证活动中对四个OFHC铜圆柱进行静水压测试的视频上进行了演示。传统的应变片测量被证明不可靠,因此需要一种全自动的智能体替代方案。该智能体基于LangChain框架构建,以Claude Haiku 4.5作为核心推理引擎,集成了专门的计算机视觉工具套件:用于视频预处理和通过霍夫线变换进行旋转校正的FFmpeg,用于时空分割并具有自动记忆感知动态分块的Segment Anything Model 2 (SAM2),以及混合EasyOCR和基于LLM的时间戳验证流水线。开发了三个专门的子智能体来处理视频数据并获取圆柱直径和时间戳,同时自主处理诸如损坏帧和内存限制等障碍。从与压力数据同步的直径轮廓中,重建了环向应力-应变曲线,并使用0.2%偏移法、0.5% EUL法和Johnson-Cook法在两次独立测试中计算了屈服强度。与非智能体流水线的交叉验证确认了直径提取在±5像素水平上的一致性。材料属性和测试结果进一步与作为LEGEND-1000回旋管设计活动一部分进行的Ansys机械模拟进行了比较。这项工作展示了智能体流水线仅从视频中提取材料数据的能力。

英文摘要

We report on a vision AI agent pipeline for non-contact material strain and property extraction from video data, demonstrated on video taken during hydrostatic testing of four OFHC copper cylinders conducted as part of the LEGEND-1000 hardware validation campaign. Traditional strain gauge measurements proved unreliable, motivating a fully-automated agentic alternative. The agent was built on the LangChain framework with Claude Haiku 4.5 as its central reasoning engine, integrating a specialized suite of computer vision tools: FFmpeg for video preprocessing and rotation correction via Hough Line Transform, the Segment Anything Model 2 (SAM2) for spatiotemporal segmentation with automated memory-informed dynamic chunking, and a hybrid EasyOCR and LLM-based timestamp validation pipeline. Three specialized sub-agents were developed to process the video data and obtain cylinder diameters and timestamps while autonomously handling obstacles such as corrupted frames and memory limits. From the diameter profiles synchronized to pressure data, hoop stress--strain curves were reconstructed and yield strengths were calculated using the 0.2\% offset, 0.5\% EUL, and Johnson-Cook methods across two independent tests. Cross-validation against a non-agentic pipeline confirmed agreement for the diameter extraction at the $\pm$5 pixel level. The material properties and testing results were further compared to Ansys mechanical simulations performed as part of the LEGEND-1000 reentrant tube design campaign. This work showcases the power of agentic pipelines to extract materials data from video alone.

2606.15828 2026-06-18 cs.SE 新提交 80%

Configuration Smells in AGENTS.md Files: Common Mistakes in Configuring Coding Agents

AGENTS.md 文件中的配置异味:配置编码代理的常见错误

Helio Victor F. dos Santos, Vitor Costa, Joao Eduardo Montandon, Luciana Lourdes Silva, Marco Tulio Valente

专题命中 软件智能体 :编码代理配置问题,属于AI Agent

AI总结 本文首次系统化编码代理配置文件(AGENTS.md/CLAUDE.md)的异味,通过灰文献综述和仓库挖掘识别出六种异味,并在100个开源仓库中验证其普遍性,其中Lint Leakage最常见(62%)。

详情
AI中文摘要

编码代理越来越多地被用于自动化软件工程任务。为了指导其行为,这些代理通常依赖配置文件(通常命名为 AGENTS.md 或 CLAUDE.md),这些文件提供关于架构、工作流、编码规范和测试实践的指令。尽管它们的重要性日益增加,但人们对影响这些文件定义和维护的常见问题知之甚少。在本文中,我们提出了首个编码代理配置文件异味目录。为了识别此类异味,我们首先进行了灰文献综述和仓库挖掘分析。结果,我们识别出六种配置异味,并提出了自动检测它们的启发式方法。为了评估所提出异味的普遍性,我们分析了100个包含 AGENTS.md 或 CLAUDE.md 文件的流行开源仓库。我们的结果表明,配置异味广泛存在。Lint Leakage 是最常见的异味,影响了62%的文件,其次是 Context Bloat(42%)和 Skill Leakage(35%)。我们进一步表明,几种异味经常同时出现,特别是 Context Bloat、Skill Leakage 和 Conflicting Instructions。

英文摘要

Coding agents are increasingly used to automate software engineering tasks. To guide their behavior, these agents commonly rely on configuration files, typically named AGENTS.‌md or CLAUDE.‌md, which provide instructions about architecture, workflows, coding conventions, and testing practices. Despite their growing importance, little is known about common problems affecting the definition and maintenance of these files. In this paper, we present the first catalog of smells for coding-agent configuration files. To identify such smells, we first conducted a grey literature review and a repository mining analysis. As a result, we identified six configuration smells and proposed automated heuristics to detect them. To evaluate the prevalence of the proposed smells, we analyzed 100 popular open-source repositories containing either an AGENTS.‌md or a CLAUDE.‌md file. Our results show that configuration smells are widespread. Lint Leakage was the most common smell, affecting 62% of the files, followed by Context Bloat (42%) and Skill Leakage (35%). We further show that several smells frequently co-occur, particularly Context Bloat, Skill Leakage, and Conflicting Instructions.

2606.18619 2026-06-18 cs.CR cs.AI cs.SE 新提交 70%

Code-Augur: Agentic Vulnerability Detection via Specification Inference

Code-Augur:通过规约推断的智能体漏洞检测

Zhengxiong Luo, Mehtab Zafar, Dylan Wolff, Abhik Roychoudhury

发表机构 * National University of Singapore(新加坡国立大学)

专题命中 软件智能体 :自主LLM智能体进行漏洞审计

AI总结 提出安全规约优先范式,通过显式化智能体假设并运行时反证,结合引导式模糊测试提升漏洞检测能力,在真实项目中比现有智能体检测更多漏洞。

详情
AI中文摘要

智能体漏洞检测的出现已成为软件安全的分水岭。完全由自主LLM智能体进行的审计正在发现数字社会基础软件中的关键漏洞。许多漏洞多年来一直隐藏,直到现在才被AI智能体发现。然而,这些发现背后的推理仍然令人担忧地不透明且未经验证。当智能体认为某个函数安全时,它对函数输入做了哪些假设?推理失败和错误假设可能导致遗漏漏洞,并降低对智能体分析的信任。我们提出了一种安全规约优先范式,该范式(1)将智能体的隐性假设明确暴露为安全规约,并(2)通过运行时反证持续细化这些规约。我们在Code-Augur中实现了我们的方法,这是一种用于智能体漏洞检测的新型框架。给定一个代码库,Code-Augur分析系统的每个组件以查找漏洞代码。当它认为某个组件安全时,它会将该判断背后的局部不变量作为源代码中的断言提交。同时,Code-Augur利用引导式模糊测试器尝试反证这些假设。当模糊测试器触发断言时,要么揭示一个真实漏洞,要么揭示一个需要细化的有缺陷规约。在这两种情况下,这一过程都夯实了智能体的理解,使其对代码意图的看法与代码实际行为保持一致。在真实世界的主题上,Code-Augur有效利用安全规约检测到比其他最先进智能体更多的漏洞。此外,Code-Augur在关键开源项目中发现了22个新漏洞。与精心策划的专用模型(如Claude Mythos)相比,Code-Augur提供了基于广泛可用的LLM(如Sonnet和DeepSeek)构建的有效智能体漏洞检测。

英文摘要

The advent of agentic vulnerability detection is already becoming a watershed moment for software security. Audits conducted entirely by autonomous LLM agents are uncovering critical vulnerabilities in fundamental software underpinning digital society. Many of these vulnerabilities remained masked for years, surfacing only now with AI agents. Yet the reasoning behind these discoveries remains alarmingly opaque and unvalidated. What assumptions did the agent make about a function's inputs when it deemed that function to be secure? Failures in reasoning and incorrect assumptions can lead to missed vulnerabilities and reduce trust in agentic analysis. We propose a security-specification-first paradigm that (1) exposes the agent's tacit assumptions explicitly as security specifications and (2) continuously refines those specifications via runtime falsification. We realize our approach in Code-Augur, a novel harness for agentic vulnerability detection. Given a codebase, Code-Augur analyzes each component of the system for vulnerable code. When it deems a component to be secure, it commits the local invariants behind that judgment as in-source assertions. In parallel, Code-Augur leverages a guided fuzzer to attempt to falsify those assumptions. When the fuzzer triggers an assertion, this either reveals a genuine vulnerability or a flawed specification to refine. In both cases, this process grounds the agent's understanding, aligning its view of code intent with how the code actually behaves. On real-world subjects, Code-Augur effectively leverages security specifications to detect more vulnerabilities than other state-of-the-art agents. Additionally, Code-Augur found 22 new vulnerabilities in key open-source projects. Compared to curated specialized models like Claude Mythos, Code-Augur offers effective agentic vulnerability detection built on widely available LLMs like Sonnet and DeepSeek.