AI Agent

2606.15504 2026-06-18 cs.AI 新提交专题 85

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

迈向振动医学：一种用于临床决策支持的自演化多智能体框架

Qianxue Zhang, Yiming Ren, Shihuan Qin, Xiao Zhang, Liao Zhang, Jinyang Huang, Zhengliang Liu, Chenbin Liu, Hongying Feng, Jingyuan Chen, Yuzhen Ding, Weihang You, Hanqi Jiang, Yi Pan, Yifan Zhou, Junhao Chen, Lifeng Chen, Wei Liu, Tianming Liu, Zengren Zhao, Lian Zhang

发表机构 * Medical AI Lab, The First Hospital of Hebei Medical University（河北医科大学第一医院医学人工智能实验室）； Hebei Provincial Engineering Research Center for AI-Based Cancer Treatment Decision-Making, The First Hospital of Hebei Medical University（河北省人工智能癌症治疗决策工程研究中心，河北医科大学第一医院）； State Key Laboratory of Neurology and Oncology Drug Development（神经与肿瘤药物研发国家重点实验室）； School of Computing, University of Georgia（佐治亚大学计算学院）； Department of Radiation Oncology, National Cancer Center/National Clinical Research Center for Cancer/Cancer Hospital and Shenzhen Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College（中国医学科学院北京协和医学院国家癌症中心/国家肿瘤临床医学研究中心/肿瘤医院深圳医院放射治疗科）； Department of Radiation Oncology, Mayo Clinic（梅奥诊所放射肿瘤科）； College of Mechanical and Power Engineering, China Three Gorges University（三峡大学机械与动力工程学院）； Department of Radiation Oncology, Guangzhou Concord Cancer Center（广州康华肿瘤中心放射治疗科）； Gastrointestinal Disease Diagnosis and Treatment Center, The First Hospital of Hebei Medical University（河北医科大学第一医院胃肠疾病诊疗中心）； Department of General Surgery, The First Hospital of Hebei Medical University（河北医科大学第一医院普通外科）

专题命中多智能体：提出多智能体框架，包含三个专用智能体

AI总结提出VIBEMed多智能体框架，通过自演化机制和架构级安全沙箱，从交互历史中动态学习，实现个性化临床决策支持。

详情

DOI: 10.1016/j.metrad.2026.100223

AI中文摘要

近年来，大型语言模型和自主智能体的进步彻底改变了医疗领域，促进了诊断并改善了治疗结果。然而，大多数现有AI系统依赖预训练知识和预定义流程，难以从包含患者结果和过去失败的交互式聊天会话历史中动态学习。为解决这一限制，我们提出了VIBEMed，一种具有内置自演化机制和架构级安全沙箱的多智能体框架，用于稳健的临床决策支持。该系统集成了三个专门智能体：用于假设生成的临床诊断智能体（CDA）、用于治疗计划的治疗执行智能体（TEA）以及将纵向临床反馈提炼为可重用知识的临床演化管理智能体（CEMA），将多模态患者信息转化为个性化医疗决策。通过自演化机制，该框架实现了跨记忆、模型行为和决策策略的迭代更新，使系统能够随时间改进。实验结果表明，VIBEMed通过其演化机制在复杂临床病例中表现出优越性能，特别是在需要集成决策和纵向规划的任务中。该框架还支持在具有挑战性的场景（如肿瘤治疗规划）中进行可靠的端到端决策，凸显了其在真实临床环境中的可行性。总体而言，VIBEMed为超越静态AI系统、迈向自适应、经验驱动的临床决策支持提供了一条实用路径，展示了将多智能体协作与持续演化相结合以推进精准医学的价值。

英文摘要

In recent years, the advances of large language models and autonomous agents have revolutionized the healthcare field, facilitating diagnosis and improving treatment results. However, most existing AI systems rely on pre-trained knowledge and predefined pipelines, which struggle to learn dynamically from the interactive chat session history that contains patient outcomes and past failures. To address this limitation, we propose VIBEMed, a multi-agent framework with a built-in self-evolution mechanism and architecture-level safety sandbox for robust clinical decision support. The system integrates three specialized agents, including a Clinical Diagnostic Agent (CDA) for hypothesis generation, a Therapeutic Execution Agent (TEA) for treatment planning, and a Clinical Evolution Manager Agent (CEMA) that distills longitudinal clinical feedback into reusable knowledge, transforming multimodal patient information into personalized medical decisions. Through self-evolution mechanism, the framework enables iterative updates across memory, model behavior, and decision strategies, allowing the system to improve over time. Experimental results show that VIBEMed demonstrates superior performance through its evolving mechanism in complex clinical cases, particularly in tasks that require integrated decision-making and longitudinal planning. The framework also supports reliable end-to-end decisions in challenging scenarios such as oncology treatment planning, highlighting its feasibility in real-world clinical contexts. Overall, VIBEMed provides a practical path beyond static AI systems toward adaptive, experience-driven clinical decision support, demonstrating the value of combining multi-agent collaboration with continuous evolution for advancing precision medicine.

URL PDF HTML ☆

赞 0 踩 0

2606.07150 2026-06-18 cs.CR cs.AI cs.MA cs.NI 新提交专题 85

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

从隐私到工作流完整性：自主智能体互操作性中的通信图元数据

Bijaya Dangol

发表机构 * Independent Researcher（独立研究者）

专题命中多智能体：研究智能体互操作性协议中的通信图元数据威胁

AI总结针对智能体通信图元数据泄露问题，提出工作流完整性威胁模型，定义传输层与引导层隐私属性，并通过A2A案例验证元数据保护可有效抑制任务推断。

Comments 22 pages, 7 figures, 6 tables

详情

AI中文摘要

诸如A2A和MCP之类的智能体互操作性协议标准化了智能体之间的通信内容，但假设基于地址的HTTP(S)传输。此类传输保护消息内容，并越来越多地采用端到端加密。它们暴露在明文中的是通信图：哪个智能体联系哪个智能体、何时以及频率如何。在智能体系统中，该图比隐私框架所暗示的更具后果性。端点通常带有能力标签，工作流是结构化和链式的，交互与实际行动耦合，因此观察者恢复的不仅仅是过去的关系。它可以推断出待处理的工作流、正在组装的任务以及可能即将发生的行动。以机器速度，它可以在工作流完成之前根据该推断采取行动。因此，威胁是工作流完整性，而不仅仅是隐私：对自主行动的预测性杠杆。我们为智能体通信图提供了一个威胁模型；识别了使智能体元数据具有独特揭示性的因素（语义性、前瞻性、驱动性）；定义了传输层和引导层隐私属性，并评估了候选传输（SimpleX/SMP、Tor、混合网络）与这些属性的匹配程度；并提出了一个A2A案例研究，其中元数据保护绑定是可表达的，但揭示了协议的身份假设。我们在一个基于真实A2A捕获的生成模型上测试了这些。仅凭被动元数据，没有载荷，一个分类器从工作流的开头就能以远高于随机水平的概率恢复任务类别；应用这些属性后，该恢复被急剧拉回随机水平。除了观察者能恢复的内容外，我们衡量了利用泄露的杠杆：在工作流开头和固定预算下，选择对哪些工作流采取行动的对手在此模型中实现了大部分先知攻击者相对于元数据盲攻击者的优势，而相同的属性抑制了这一点。

英文摘要

Agent-interoperability protocols such as A2A and MCP standardize what agents say to one another but assume address-based transport. Whether over HTTP(S) or a content-protecting binding such as MLS-based SLIM, these transports protect message content yet leave the communication graph exposed: which agent contacts which, when, and how often. In agent systems this graph is more consequential than a privacy framing suggests. Endpoints are capability-labeled, workflows are structured and chained, and interactions are coupled to actions, so an observer recovers more than past relationships: it can recognize a recurring pending workflow from its opening and, at machine speed, act on it before it completes. The threat is one of workflow integrity, not privacy alone. We give a threat model for the communication graph and locate what makes its metadata distinctively consequential: not stronger fingerprinting but exposure across independent trust domains, coupled to autonomous action. We define transport- and bootstrap-layer privacy properties, give them an indistinguishability-game semantics, evaluate transports, and give an A2A case study where a metadata-protecting binding surfaces its implicit identity assumptions. On a corpus of real multi-agent A2A traffic from the official reference agents, on a live A2A binding, and with a generative model as a controlled instrument, a label-blind classifier recovers a task's class from passive metadata at 6x chance, and from only its opening; a defense-aware adversary does not overturn this, and only the full set of properties drives recovery toward chance. Acting on the leak is distinct from recoverability: under a fixed budget an adversary captures 0.63 of a clairvoyant attacker's advantage on the corpus (0.41 from a workflow's opening), governed by top-ranked precision rather than overall accuracy, so integrity and privacy come apart under defense.

URL PDF HTML ☆

赞 0 踩 0

2605.25929 2026-06-18 cs.MA cs.LG 版本更新专题 85

Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?

多智能体系统是专家混合：谁成为影响者？

Franka Bause, Jonas Niederle, Martin Pawelczyk, Rebekka Burkholz

发表机构 * CISPA Helmholtz Center for Information Security（CISPA海德堡信息安全中心）； Faculty of Computer Science, University of Vienna（维也纳大学计算机科学系）

专题命中多智能体：研究多智能体LLM协商机制，属于多智能体系统。

AI总结本文通过Friedkin-Johnsen意见动力学模型分析多智能体LLM协商机制，揭示输入依赖的FJ参数使系统成为专家混合，并探讨基于自信度、感知自信度和初始观点对齐的影响者形成机制。

Comments Accepted at the 2nd Workshop on Compositional Learning at ICML 2026

2605.18185 2026-06-18 cs.MA 版本更新专题 85

The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection

在有伴侣选择的社交困境中政策梯度的动力学

Benedict Russell, Chin-wing Leung, Paolo Turrini

专题命中多智能体：研究多智能体社交困境中的策略梯度动力学。

AI总结本文研究了在有伴侣选择的多智能体环境中政策梯度动力学，揭示了伴侣选择如何改变对手分布及奖励景观，并证明在简单规则下促进合作的必要条件是种群方差。

详情

AI中文摘要

在社交困境中，自利学习智能体面临合作的社会效益与背叛的即时奖励之间的选择。已有大量证据表明， assortments 机制如伴侣选择对合作的出现有显著益处，但这些证据大多通过基于代理的模拟获得。本文提供了该问题的分析解，研究了具有伴侣选择的多智能体环境中的政策梯度动力学。我们展示了伴侣选择如何改变对手分布以及奖励景观，并证明这在简单规则下促进合作。特别是，我们发现种群方差是合作出现的必要条件。使用二维维纳过程，我们扩展了动力学以捕捉伴侣选择的随机效应及由此产生的对手分布。我们推导了种群促进合作的充分条件，并证明了稳态分布的存在。模拟证实了随机模型准确捕捉了政策梯度动力学，并澄清了学习率如何影响合作的出现。

英文摘要

In social dilemmas self-interested learning agents face the choice between the societal benefit of cooperation and the immediate reward of defection. Significant evidence exists on the benefits of assortment mechanisms such as partner selection for the emergence of cooperation, but this is largely available through agent-based simulations. In this paper, we provide an analytical solution to the problem, studying the policy-gradient dynamics in a multi-agent environment with partner selection. We show how partner selection changes the opponent distribution and hence the reward landscape, and prove this promotes cooperation under simple rules known from the literature. In particular, we find that population variance is a necessary condition for cooperation to emerge. Using a two-dimensional Wiener process, we extend the dynamics to capture the stochastic effects of partner selection and the resulting opponent distribution. We derive a sufficient condition for the population to be cooperation-promoting and prove the existence of a stationary distribution. Simulations confirm that the stochastic model accurately captures the policy-gradient dynamics and clarifies how the learning rate affects the emergence of cooperation.

URL PDF HTML ☆

赞 0 踩 0

2508.21720 2026-06-18 cs.AI 版本更新专题 85

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

PosterForest: 用于科学海报生成的分层多智能体协作

Jiho Choi, Seojeong Park, Seongjong Song, Hyunjung Shim

发表机构 * Graduate School of Artificial Intelligence, KAIST（韩国釜山国立大学人工智能研究生院）； School of Integrated Technology, Yonsei University（延世大学整合技术学院）

专题命中多智能体：分层多智能体协作生成科学海报

AI总结提出PosterForest，一种无需训练的科学海报生成框架，通过Poster Tree分层表示文档结构，并利用内容与布局智能体进行分层推理与递归优化，实现内容与布局的联合优化，提升语义连贯性、逻辑流畅性和视觉平衡。

Comments ACL 2026

详情

AI中文摘要

自动化科学海报生成需要层次化的文档理解和连贯的内容-布局规划。现有方法通常依赖于平面摘要或分别优化内容和布局。因此，它们常常遭受信息丢失、逻辑流程薄弱和视觉平衡差的问题。我们提出了PosterForest，一个无需训练的科学海报生成框架。我们的方法引入了Poster Tree，一种结构化的中间表示，能够跨多个层次捕获文档层次结构和视觉-文本语义。基于这种表示，内容和布局智能体执行分层推理和递归优化，从全局组织到局部组成逐步优化海报。这种联合优化提高了语义连贯性、逻辑流畅性和视觉和谐。实验表明，PosterForest在自动评估和人工评估中均优于先前方法，且无需额外训练或领域特定监督。

英文摘要

Automating scientific poster generation requires hierarchical document understanding and coherent content-layout planning. Existing methods often rely on flat summarization or optimize content and layout separately. As a result, they often suffer from information loss, weak logical flow, and poor visual balance. We present PosterForest, a training-free framework for scientific poster generation. Our method introduces the Poster Tree, a structured intermediate representation that captures document hierarchy and visual-textual semantics across multiple levels. Building on this representation, content and layout agents perform hierarchical reasoning and recursive refinement, progressively optimizing the poster from global organization to local composition. This joint optimization improves semantic coherence, logical flow, and visual harmony. Experiments show that PosterForest outperforms prior methods in both automatic and human evaluations, without additional training or domain-specific supervision.

URL PDF HTML ☆

赞 0 踩 0

2606.19135 2026-06-18 cs.MA cs.AI cs.NI 新提交专题 80

A Technical Taxonomy of LLM Agent Communication Protocols

LLM智能体通信协议的技术分类法

Linus Sander, Habtom Kahsay Gidey, Alexander Lenz, Alois Knoll

发表机构 * Technische Universität München（慕尼黑技术大学）

专题命中多智能体：分类LLM智能体通信协议，核心是Agent通信

AI总结针对大语言模型智能体通信协议碎片化问题，提出包含五个维度的技术分类法，分析九种开源协议，揭示架构模式并预测协议演进趋势。

详情

AI中文摘要

随着大语言模型（LLM）的进步以及多智能体系统旨在克服单智能体的局限性，健壮的通信协议正成为分布式智能体网络的关键基础设施。然而，碎片化的协议格局带来了显著的互操作性挑战。本研究开发了一种技术分类法，用于分类和分析LLM智能体通信协议。遵循既定的迭代方法，我们定义了分类法的目的、元特征和终止条件，然后在九个积极维护且具有可证明采用度的开源协议上执行了五次迭代（三次从经验到概念，两次从概念到经验）。该分类法包含五个维度：交易对手、有效载荷、交互状态、发现机制和模式灵活性。分类揭示了重复出现的架构模式：所有采样的智能体间协议都将混合有效载荷与会话状态持久性相结合；大多数协议支持多个预定义模式，其中两个协议在运行时协商模式，表明向模式灵活性的趋势；去中心化发现仍然罕见。分析表明，短期内存在向统一智能体间和智能体-上下文（工具和数据）通信的协议收敛压力。然而，长期来看，没有单一协议能同时最大化通用性、效率和可移植性。该领域更可能演变为联邦式分层协议栈。该框架指导协议选择，并突出开放的研究空白，如隐私和策略执行。

英文摘要

As large language models (LLMs) advance and multi-agent systems aim to overcome the limits of standalone agents, robust communication protocols are becoming essential infrastructure for distributed agent networks. Nonetheless, the fragmented protocol landscape presents a significant interoperability challenge. This study develops a technical taxonomy to classify and analyze LLM agent communication protocols. Following an established iterative method, we defined the taxonomy's purpose, meta-characteristic, and ending conditions, then performed five iterations, three empirical-to-conceptual and two conceptual-to-empirical, on nine actively maintained open-source protocols with demonstrable adoption. The taxonomy comprises five dimensions: counterparty, payload, interaction state, discovery mechanism, and schema flexibility. Classification reveals recurring architectural patterns: all sampled agent-to-agent protocols combine hybrid payloads with session-state persistence; most protocols support multiple predefined schemas, and two negotiate schemas at runtime, indicating a trend toward schema flexibility; decentralized discovery remains rare. Analysis suggests short-term convergence pressure toward protocols unifying agent-to-agent and agent-to-context (tool and data) communication. Long-term, however, no single protocol is likely to maximize versatility, efficiency, and portability simultaneously. The field will more likely evolve toward a federated, layered protocol stack. The framework guides protocol selection and highlights open research gaps such as privacy and policy enforcement.}

URL PDF HTML ☆

赞 0 踩 0

2606.19080 2026-06-18 eess.SY cs.SY 新提交专题 80

Byzantine-Resilient Federated Multi-Agent Optimization Framework for Cyber-Secure Interconnected Microgrids

面向网络安全互联微电网的拜占庭弹性联邦多智能体优化框架

Ali Peivand, Seyyed Mostafa Nosratabadi

专题命中多智能体：联邦多智能体优化，拜占庭弹性。

AI总结提出BR-FedMAPPO框架，结合三重表面移动目标防御与自适应隔离策略，通过两阶段拜占庭弹性聚合规则抵御隐蔽虚假数据注入攻击，保护分布式学习通道并维持经济调度性能。

详情

AI中文摘要

配电网络日益数字化，使得互联微电网集群面临隐蔽虚假数据注入攻击，这些攻击绕过不良数据检测器，通过联络线耦合和共享学习通道传播。本文提出BR-FedMAPPO，一种拜占庭弹性联邦多智能体近端策略优化框架，学习三重表面移动目标防御和自适应隔离策略以实现网络安全运行。每个微电网托管一个本地Actor-Critic智能体，其策略被划分为全局联邦共享编码器和私有保留动作头，因此没有微电网暴露其D-FACTS线路、电池储能单元或联络线容量的配置、基数或位置。动作向量扰动D-FACTS电抗、重定向BES注入、重塑微电网间交换，并包含连续孤岛信号。两阶段拜占庭弹性聚合规则结合了修剪均值滤波和奖励加权更新。该方案基于F1分数和假阳性率纳入检测质量分数，以惩罚引起误报的客户端。在基于IEEE 30节点和118节点测试系统的四个互联微电网上的仿真结果表明，该框架能有效缓解协调的S-FDI攻击，通过自适应隔离遏制级联中断，保护分布式学习通道免受恶意模型操纵，同时保持成本感知的调度性能。

英文摘要

The escalating digitalization of distribution networks has exposed interconnected Microgrid (MG) clusters to Stealthy False Data Injection Attacks that bypass Bad Data Detectors and propagate through tie-line couplings and shared learning channels. This paper proposes BR-FedMAPPO, a Byzantine-Resilient Federated Multi-Agent Proximal Policy Optimization framework that learns a triple-surface Moving Target Defense and an adaptive isolation strategy for cyber-secure operation. Each MG hosts a local Actor-Critic Agent whose policy is partitioned into a globally federated shared encoder and a privately retained action head, so no MG exposes the configurations, cardinality, or locations of its D-FACTS lines, Battery Energy Storage (BES) units, or tie-line capacities. The action vector perturbs D-FACTS reactances, redirects BES injections, reshapes inter-MG exchanges, and includes a continuous islanding signal. A two-stage Byzantine-resilient aggregation rule combines trimmed-mean filtering with reward-weighted updates. This scheme incorporates a detection-quality score based on the F1-score and False Positive Rate to penalize clients causing false alarms. Simulation results on four interconnected MGs based on the IEEE 30- and 118-bus test systems demonstrate effective mitigation of coordinated S-FDI attacks, containment of cascading disruptions through adaptive isolation, and protection of distributed learning channels against malicious model manipulations while maintaining cost-aware dispatch performance.

URL PDF HTML ☆

赞 0 踩 0

2606.18829 2026-06-18 cs.LG cs.CL 新提交专题 80

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

GateMem：多主体共享内存代理中的内存治理基准

Zhe Ren, Yibo Yang, Yimeng Chen, Zijun Zhao, Benshuo Fu, Zhihao Shu, Bingjie Zhang, Yangyang Xu, Dandan Guo, Shuicheng Yan

发表机构 * School of Artificial Intelligence, Jilin University（吉林大学人工智能学院）； Shanghai Jiao Tong University（上海交通大学）； King Abdullah University of Science and Technology (KAUST)（卡尔斯鲁厄大学）； Tsinghua University（清华大学）； National University of Singapore（新加坡国立大学）

专题命中多智能体：多主体共享内存代理的记忆治理基准

AI总结提出GateMem基准，评估多主体共享内存代理在效用、访问控制和遗忘三方面的治理能力，发现现有方法无法同时满足三者。

Comments 24 pages, 8 figures. Code and dataset are available at https://github.com/rzhub/GateMem and https://huggingface.co/datasets/Ray368/GateMem

详情

AI中文摘要

LLM代理的内存基准主要假设单用户设置，而医院、工作场所、校园和家庭中的共享助手研究不足。在这些部署中，多个主体写入公共内存池并根据不同角色、范围和关系进行查询，因此内存质量需要治理和召回。我们引入GateMem，一个多主体共享内存代理的基准。GateMem联合评估合法长期请求的效用（含状态更新）、跨上下文授权边界的访问控制，以及显式删除请求后的主动遗忘。它涵盖医疗、办公、教育和家庭领域，包含长形式多方情节、增量内存注入、隐藏检查点、结构化评判和泄漏目标注释。在多种基线和骨干模型上，没有方法能同时实现强效用、鲁棒访问控制和可靠遗忘。长上下文提示通常以高令牌成本获得最佳治理分数，而基于检索和外部内存的方法降低成本但仍泄漏未授权或已删除信息。这些结果表明，当前内存代理远未达到可靠的共享机构部署水平。

英文摘要

Memory benchmarks for LLM agents largely assume single-user settings, leaving shared assistants for hospitals, workplaces, campuses, and households understudied. In these deployments, multiple principals write to a common memory pool and query it under different roles, scopes, and relationships, so memory quality requires governance as well as recall. We introduce GateMem, a benchmark for multi-principal shared-memory agents. GateMem jointly evaluates utility for legitimate long-horizon requests with state updates, access control across contextual authorization boundaries, and agent-facing active forgetting after explicit deletion requests. It spans medical, office, education, and household domains, with long-form multi-party episodes, incremental memory injection, hidden checkpoints, structured judging, and leak-target annotations. Across diverse baselines and backbone models, no method simultaneously achieves strong utility, robust access control, and reliable forgetting. Long-context prompting often yields the best governance score at high token cost, while retrieval-based and external-memory methods reduce cost yet still leak unauthorized or deleted information. These results show current memory agents remain far from reliable shared institutional deployment.

URL PDF HTML ☆

赞 0 踩 0

2606.18276 2026-06-18 cs.MA cs.SI physics.soc-ph 新提交专题 80

Characterizing Opinion Evolution of Networked LLMs

表征网络化大语言模型的意见演化

Caleb Probine, Yigit Ege Bayiz, Filippos Fotiadis, Samuel Li, Yunhao Yang, Ufuk Topcu

专题命中多智能体：研究网络化LLM多智能体系统中的意见演化动力学。

AI总结研究经典意见动力学模型能否描述多智能体系统中大语言模型（LLM）的意见传播，发现引入偏置项可显著提升建模精度，将平均意见误差降低高达88%。

Comments 19 pages, 2 figures

2605.01818 2026-06-18 nlin.AO physics.soc-ph 版本更新专题 80

Emergent Macro-Criticality from Micro-Critical Agents

从微观临界主体涌现的宏观临界性

Nicolas Bessone, Erwan Plantec

专题命中多智能体：多智能体系统，微观临界性涌现宏观临界

AI总结通过多智能体系统研究微观临界性如何影响集体行为，发现宏观临界性依赖于交互网络的连接性，而非单个智能体的临界动力学。

详情

AI中文摘要

临界性已被提出作为生物和人工系统中复杂行为的关键原则；然而，临界性如何从个体动力学转化为集体行为仍不清楚。我们使用一个具有空间约束交互的多智能体系统来研究这个问题，其中智能体通过外感受器感知邻近的光信号，并通过开关自身的光来行动，从而在宏观层面形成一个动态交互网络。智能体的内部状态在微观层面由储层动力系统控制。通过改变微观参数围绕动力学临界性，以及宏观交互拓扑，我们系统地研究了这两个层面之间的关系。我们发现，单个智能体内的近临界动力学不足以产生集体临界般的雪崩统计。相反，无标度行为取决于控制活动传播的宏观交互网络的有效连接性。因此，宏观临界般的动力学是由偏离临界性的微观机制实现的，所需的偏离取决于交互网络的特性。研究这种关系，我们发现略微亚临界的微观层面支持在更广泛的宏观参数范围内接近临界动力学。这些结果表明，在这个多智能体系统中，集体近临界行为取决于内部动力学与控制活动传播的交互结构之间的相互作用。

英文摘要

Criticality has been proposed as a key principle underlying complex behavior in biological and artificial systems; however, how criticality translates from individual dynamics to collective behavior remains unclear. We study this question using a multi-agent system with spatially constrained interactions in which agents sense neighboring light signals through exteroceptors and act by switching their own light on or off, thereby forming a dynamical interaction network at the macroscopic level. The agents' internal states are themselves governed by a reservoir dynamical system at the microscopic level. By varying the microscopic parameters around dynamical criticality, together with the macroscopic interaction topology, we systematically investigate the relation between the two levels. We find that near-critical dynamics within individual agents is not sufficient to produce collective critical-like avalanche statistics. Instead, scale-free behavior depends on the effective connectivity of the macroscopic interaction network, which controls activity propagation. As a result, macroscopic critical-like dynamics are enabled by microscopic regimes that deviate from criticality, with the required deviation depending on the properties of the interaction network. Investigating this relation, we find that slightly subcritical micro-level regimes support near-critical dynamics across a wider range of macroscopic parameters. These results show that in this multi-agent system, collective near-critical behavior depends on the interplay between internal dynamics and the interaction structure that governs activity propagation.

URL PDF HTML ☆

赞 0 踩 0

2606.19152 2026-06-18 cond-mat.mtrl-sci cs.AI 新提交专题 80

AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

AdsMind: 一种基于物理的多智能体系统，用于异质催化剂表面吸附构型的自校正发现

Zongmin Zhang, Yuyang Lou, Bowen Zhang, Junwu Chen, Ryo Kuroki, Xuan Vu Nguyen, Edvin Fako, Lixue Cheng, Philippe Schwaller

发表机构 * Department of Computer Science ； Engineering, Hong Kong University of Science ； Department of Chemistry, Hong Kong University of Science ； Laboratory of Artificial Chemical Intelligence (LIAC), EPFL, Lausanne, Switzerland ； Platform Laboratory for Science \& Technology, Asahi Kasei Corporation, Tokyo, Japan ； IAS Center for AI for Scientific Discoveries, Hong Kong University of Science

专题命中多智能体：提出闭环多智能体框架，自主纠错搜索。

AI总结提出AdsMind闭环多智能体框架，利用机器学习力场弛豫反馈实现吸附构型搜索的自主纠错，在基准测试中成功率高达100%和98.8%，且仅需少量弛豫步骤，显著优于启发式枚举和单次方法。

Comments 37 pages, 5 figures

详情

AI中文摘要

识别最低能量的表面-吸附物构型对于模拟异质催化至关重要，然而使用从头计算方法进行穷举探索在计算上是不可行的。机器学习力场（MLFF）加速了结构弛豫，但将广阔构型空间中的搜索留作主要瓶颈，而开环的大语言模型（LLM）智能体缺乏基于物理的反馈机制来纠正错误的初始猜测。我们提出了AdsMind（基于机器智能和弛豫反馈的吸附构型发现），这是一个闭环多智能体框架，通过MLFF弛豫反馈实现自主纠错。在四个LLM后端上，AdsMind实现了持续的高搜索可靠性，在基准AA20和OCD-GMAE62上的成功率分别为100%和98.8%。相对于其单次（1-Shot）消融，它降低了跨后端的能量分散，并且每个案例仅分别使用4.11和4.67次MLFF弛豫——相比启发式枚举基线减少了约14倍。使用VASP/PBE对六个代表性AA20系统进行的密度泛函理论（DFT）验证表明，所报告的开环Adsorb-Agent输出对分子吸附物存在定性的吸附能符号错误，而AdsMind在所有测试案例中均保持正确的符号，且定量一致性更佳。因此，AdsMind同时提供了可靠性、自我反思和可解释性，支持更多基于DFT的自主化学工作流程。

英文摘要

Identifying the lowest-energy surface-adsorbate configuration is critical for modeling heterogeneous catalysis, yet exhaustive exploration with ab initio calculations is computationally prohibitive. Machine-learning force fields (MLFFs) accelerate structural relaxation but leave the search over the vast configurational space a major bottleneck, and open-loop large language model (LLM) agents lack a physics-grounded feedback mechanism to correct erroneous initial guesses. We propose AdsMind (Adsorption configuration discovery with Machine intelligence and relaxation feedback), a closed-loop multi-agent framework that enables autonomous error correction through MLFF relaxation feedback. Across four LLM backends, AdsMind achieves consistently high search reliability, with success rates of 100% and 98.8% on the benchmarks AA20 and OCD-GMAE62. Relative to its single-pass (1-Shot) ablation it reduces cross-backend energy dispersion, and it uses only 4.11 and 4.67 MLFF relaxations per case, respectively -- an approximately 14-fold reduction over heuristic enumeration baselines. Density functional theory (DFT) validation using VASP/PBE on six representative AA20 systems shows that the reported open-loop Adsorb-Agent outputs exhibit qualitative adsorption-energy sign errors for molecular adsorbates, whereas AdsMind preserves the correct sign in all tested cases with closer quantitative agreement. AdsMind thus delivers reliability, self-reflection, and interpretability simultaneously, supporting more DFT-informed autonomous chemistry workflows.

URL PDF HTML ☆

赞 0 踩 0

2606.13681 2026-06-18 cs.CL 新提交专题 85

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore（新加坡国立大学）； Singapore Management University（新加坡管理大学）； University of Washington（华盛顿大学）； University College London（伦敦大学学院）； University of Pennsylvania（宾夕法尼亚大学）； Nanyang Technological University（南洋理工大学）； Recursive ； Massachusetts Institute of Technology（麻省理工学院）

专题命中软件智能体：动态环境中LLM智能体的记忆演化基准

AI总结提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化，并设计基于补丁的记忆范式EvoMem记录结构化更新历史，使智能体能通过记忆变化推理环境演化，实验表明当前智能体在动态环境中表现不佳，EvoMem可稳定提升性能。

详情

AI中文摘要

大型语言模型（LLM）智能体在广泛基准测试中取得了强劲性能，但大多数评估假设静态环境。相比之下，实际部署本质上是动态的，要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距，我们引入了EvoArena，一个基准套件，将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem，一种基于补丁的记忆范式，将记忆演化记录为结构化的更新历史，使智能体能够通过记忆中的变化推理环境演化。实验表明，当前智能体在EvoArena上表现不佳，在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能，在EvoArena上平均提升1.5%，并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外，EvoMem在EvoArena上还将链级准确率提升3.7%，其中成功需要完成一系列连续的相关演化子任务。机制分析表明，EvoMem改善了记忆中的证据捕获，表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

URL PDF HTML ☆

赞 1 踩 0

2606.18294 2026-06-18 physics.ins-det nucl-ex physics.app-ph 新提交专题 80

Vision AI Agent for Continuous Material Monitoring of LEGEND-1000 LoFi Reentrant Tube

用于LEGEND-1000 LoFi回旋管连续材料监测的视觉AI智能体

Sonata Simonaitis-Boyd, Soonhong Lee, Lauren N. O'Brien, Brandon T. Turner, Ralph Massarczyk, Steven R. Elliott, Aobo Li, Alexander F. Leder

专题命中软件智能体：LangChain智能体流水线，自动材料监测

AI总结提出基于LangChain和Claude Haiku 4.5的视觉AI智能体流水线，通过SAM2分割和混合OCR验证从静水压测试视频中自动提取OFHC铜圆柱的直径和应变，计算屈服强度并与模拟对比。

Comments 27 pages, 8 figures, 5 tables, submitted to PRX Intelligence

详情

AI中文摘要

我们报告了一种用于从视频数据中非接触式提取材料应变和属性的视觉AI智能体流水线，并在LEGEND-1000硬件验证活动中对四个OFHC铜圆柱进行静水压测试的视频上进行了演示。传统的应变片测量被证明不可靠，因此需要一种全自动的智能体替代方案。该智能体基于LangChain框架构建，以Claude Haiku 4.5作为核心推理引擎，集成了专门的计算机视觉工具套件：用于视频预处理和通过霍夫线变换进行旋转校正的FFmpeg，用于时空分割并具有自动记忆感知动态分块的Segment Anything Model 2 (SAM2)，以及混合EasyOCR和基于LLM的时间戳验证流水线。开发了三个专门的子智能体来处理视频数据并获取圆柱直径和时间戳，同时自主处理诸如损坏帧和内存限制等障碍。从与压力数据同步的直径轮廓中，重建了环向应力-应变曲线，并使用0.2%偏移法、0.5% EUL法和Johnson-Cook法在两次独立测试中计算了屈服强度。与非智能体流水线的交叉验证确认了直径提取在±5像素水平上的一致性。材料属性和测试结果进一步与作为LEGEND-1000回旋管设计活动一部分进行的Ansys机械模拟进行了比较。这项工作展示了智能体流水线仅从视频中提取材料数据的能力。

英文摘要

We report on a vision AI agent pipeline for non-contact material strain and property extraction from video data, demonstrated on video taken during hydrostatic testing of four OFHC copper cylinders conducted as part of the LEGEND-1000 hardware validation campaign. Traditional strain gauge measurements proved unreliable, motivating a fully-automated agentic alternative. The agent was built on the LangChain framework with Claude Haiku 4.5 as its central reasoning engine, integrating a specialized suite of computer vision tools: FFmpeg for video preprocessing and rotation correction via Hough Line Transform, the Segment Anything Model 2 (SAM2) for spatiotemporal segmentation with automated memory-informed dynamic chunking, and a hybrid EasyOCR and LLM-based timestamp validation pipeline. Three specialized sub-agents were developed to process the video data and obtain cylinder diameters and timestamps while autonomously handling obstacles such as corrupted frames and memory limits. From the diameter profiles synchronized to pressure data, hoop stress--strain curves were reconstructed and yield strengths were calculated using the 0.2\% offset, 0.5\% EUL, and Johnson-Cook methods across two independent tests. Cross-validation against a non-agentic pipeline confirmed agreement for the diameter extraction at the $\pm$5 pixel level. The material properties and testing results were further compared to Ansys mechanical simulations performed as part of the LEGEND-1000 reentrant tube design campaign. This work showcases the power of agentic pipelines to extract materials data from video alone.

URL PDF HTML ☆

赞 0 踩 0

2606.12837 2026-06-18 cs.CL 新提交专题 85

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan（美团）

专题命中其他Agent ：长时域搜索代理基准测试

AI总结提出LoHoSearch基准，基于700万维基实体知识图谱自动构建544个复杂问题，评估显示最强模型仅34.74%准确率，远超人类难度上限。

详情

AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和，最强模型已超过90%准确率。由于这些基准主要由人类编写，标注者缺乏对实体统计的全局视角，无法系统性地最大化搜索空间大小和结构复杂性，这造成了难以突破的难度上限。为解决这一问题，我们引入了LoHoSearch（长时域搜索代理），一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建，该流水线选择具有大搜索空间的关系，并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明，即使是最强模型也仅达到34.74%的准确率，且现有的上下文管理策略（最佳提升+6.8%）带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

URL PDF HTML ☆

赞 0 踩 0

2606.07591 2026-06-18 cs.LG cs.AI cs.CL 版本更新专题 85

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

专题命中其他Agent ：自主科学研究基准评估智能体

AI总结提出ResearchClawBench基准，包含10个领域40个任务，通过多模态评分标准评估自主科研能力，最强智能体仅得21.5分，揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情

AI中文摘要

AI编码智能体越来越多地用于科学工作，但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench，一个用于评估自主科学研究的基准，涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文，提供相关文献和原始数据，并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准，从而能够评估目标论文级别的重新发现，同时为新发现留出空间。我们在统一协议下评估了七个自主研究（auto-research）智能体，并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现：最强的自主智能体Claude Code平均得分为21.5，最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7，LLM前沿均值仅为26.5。错误分析表明，失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

URL PDF HTML ☆

赞 0 踩 0

2606.19116 2026-06-18 cs.AI cs.CY 新提交专题 80

Towards an Agent-First Web: Redesigning the Web for AI Agents

迈向智能体优先的Web：为AI智能体重新设计Web

Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

发表机构 * Old Dominion University（老 Dominion 大学）； AI Motion Labs（AI Motion 实验室）； Florida International University（佛罗里达国际大学）； Accenture Technology Labs（Accenture 技术实验室）； Nanyang Technological University（南洋理工大学）； University of Colombo（科伦坡大学）； Center for Wireless Communications, University of Oulu（无线通信中心，奥卢大学）； McDonald Army Health Center（麦克唐纳陆军健康中心）

专题命中其他Agent ：为AI智能体重新设计Web，核心是Agent访问

AI总结本文提出三层重新设计原则，包括访问层（代理继承人类权限）、经济层（基于意图的代币订阅模型）和内容层（ATML标记语言与加密溯源链），以解决AI智能体作为中间人时Web的访问、经济与内容问题。

详情

AI中文摘要

万维网建立在持续三十年的假设之上：Web内容的主要消费者是人类。这一假设渗透到每一层；其访问模型假定人类访客，其经济依赖于人类注意力，其内容针对人类感知。AI智能体作为人类与Web内容之间中介的迅速出现使这一假设失效。然而，Web通过全面封锁、基于CAPTCHA的排除以及将智能体访问视为提取而非合法交互的经济模型来抵制智能体。本文提出跨三层的原则性重新设计。在访问层，为人类行动的智能体应继承等效访问权限，通过HTTP请求中的速率限制和智能体识别元数据（类似于浏览器头部）以及从同一域提供人类可读和智能体优化内容的双层架构来管理。在经济层，我们提出基于意图的层级框架，以智能体作为人类代理原则为基础：智能体的经济义务反映其所代表的人类。基于代币的订阅模型以代币而非页面浏览量计量内容，同时引入委托内容经济，将AI内容生产锚定于人类意图。在内容层，我们识别出认知递归——AI生成内容被智能体消费以产生更多内容的自我指涉循环，逐步使Web知识与人类真实情况脱钩。我们提出智能体文本标记语言（ATML），一个四级人类监督层级模型，以及加密溯源链来应对这一威胁。这些共同构成了智能体优先互联网的十项设计原则，其中智能体是一等公民，其整合需要重新协商Web在访问、经济和内容方面的基本社会契约。

英文摘要

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

URL PDF HTML ☆

赞 0 踩 0

2606.19063 2026-06-18 cs.CR 新提交专题 80

PYPILINE: Malicious PyPI Package Detection via Suspicious API Knowledge and Agent Workflow

PYPILINE：通过可疑API知识和Agent工作流检测恶意PyPI包

Siyuan Pang, Zhengwei Jiang, Yepeng Yao, Zijing Fan, Haozhe Li, Baoxu Liu

专题命中其他Agent ：Agent工作流检测恶意PyPI包。

AI总结提出PYPILINE方法，结合可疑API知识库与Agent工作流，通过静态分析构建知识库并自动检测恶意PyPI包，在精度、召回率和F1分数上显著优于现有工具。

详情

AI中文摘要

恶意PyPI包的检测对于维护开源软件供应链的安全至关重要。现有方法主要依赖规则或传统机器学习，存在可解释性差且难以适应新型攻击的问题。为此，我们提出PYPILINE，一种结合可疑API知识库与Agent工作流的新型检测方法。PYPILINE首先对已知恶意包进行静态分析，提取抽象语法树并生成API调用图，从中自动提取并构建结构化的可疑API知识库。在检测阶段，利用该知识库增强推理能力。通过Agent工作流，PYPILINE对未知包进行深度语义分析，并输出结构化的、可解释的恶意性评估报告。实验结果表明，PYPILINE在精度96.7%、召回率99.6%和F1分数98.1%上显著优于现有最先进工具，其精度比基线工具高出5.7至24.2个百分点。此外，我们对恶意包进行了实证研究，系统揭示了常见的攻击策略以及最常被滥用的API。通过配备工具调用的AI Agent工作流，实现可疑API知识的自动向量数据库检索和通过邮件服务器发送分析报告，PYPILINE提供了一种实用、高效且便捷的恶意包检测解决方案，以增强开源生态系统安全。

英文摘要

The detection of malicious PyPI packages is crucial for maintaining the security of the open source software supply chain. Existing methods, which primarily rely on rules or traditional machine learning, suffer from poor interpretability and difficulty in adapting to novel attacks. To address this, we propose PYPILINE, a novel detection method that combines a suspicious API knowledge base with an Agent workflow. PYPILINE first conducts static analysis on known malicious packages, extracting abstract syntax trees and generating API call graphs, from which it automatically extracts and constructs a structured suspicious API knowledge base. During the detection phase, this knowledge base is used to enhance reasoning capabilities. Through an Agent workflow, PYPILINE performs in depth semantic analysis of unknown packages and outputs a structured, interpretable maliciousness assessment report. The experimental results show that PYPILINE significantly outperforms existing state-of-the-art tools in precision of 96.7\%, recall of 99.6\%, and F1-score of 98.1\%, with its precision surpassing baseline tools by 5.7 to 24.2 percentage points. Additionally, we conducted an empirical study on malicious packages, systematically revealing prevalent attack strategies, as well as the most commonly abused APIs. Equipped with tool-calling AI agent workflows for automated vector database retrieval of suspicious API knowledge and mail server delivery of analysis reports, PYPILINE delivers a practical, efficient, and convenient malicious package detection solution to strengthen open-source ecosystem security.

URL PDF HTML ☆

赞 0 踩 0

2606.17454 2026-06-18 cs.AI cs.LG 新提交专题 80

Dissecting model behavior through agent trajectories

通过智能体轨迹剖析模型行为

Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

发表机构 * AWS AI Labs（AWS人工智能实验室）

专题命中其他Agent ：分析AI代理轨迹以改进模型行为

AI总结本文提出“意图-执行差距”概念，并设计Simple Strands Agent（SSA）框架，通过分析138k条轨迹揭示模型在自主问题解决中的行为差异。

Comments 106 pages, 50 Figures, 16 Tables

详情

AI中文摘要

AI智能体性能不仅仅是一个建模问题，它本质上是一个系统问题。模型的高级能力通过智能体框架（harness）实现。因此，模型假设与框架行为之间的差距很容易阻止模型的全部能力转化为智能体性能。我们将此形式化为“意图-执行差距”：模型意图与框架执行之间的不匹配，反之亦然。我们认为，最小化这种意图-执行差距与框架设计的其他方面（如工具和执行循环）同样重要。为了说明这种框架-模型对齐的影响，我们开发了一个简单且可定制的框架，称为“Simple Strands Agent”（SSA）。SSA旨在找到跨不同模型家族（如Claude、Gemini、GPT、Grok、Qwen）通用的常见模式，以及少量模型特定的偏好。我们做出两个贡献：（i）我们在流行的智能体基准测试（SWE-Pro、SWE-Verified和Terminal-Bench-2）上**复现或改进了**不同模型提供商家族报告的pass@1性能；（ii）基于对**SSA生成的138k条轨迹的分析**，我们超越了前沿模型之间通常相对均匀的pass@1数字。通过在代码状态空间中表示智能体轨迹，我们观察到问题解决行为中的模型级差异。更细粒度的指标，如编辑频率、测试活动和阶段转换，揭示了单个模型如何在自主问题解决的不同阶段分配努力。

英文摘要

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

URL PDF HTML ☆

赞 0 踩 0

2605.30880 2026-06-18 cs.CL cs.AI 版本更新专题 85

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld：可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University（香港 Baptist 大学）； Independent Researcher（独立研究员）； HKUST（香港科技大学）； Beijing Institute of Technology（北京理工大学）； Southern University of Science and Technology（南方科技大学）； Wayne State University（韦恩州立大学）； University of Edinburgh（爱丁堡大学）

专题命中规划决策：可执行世界模型，用于智能体规划与预测

AI总结提出 PatchWorld 框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型，实现无需梯度优化的符号信念状态程序，在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情

AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程（POMDP），假设模拟器的潜在状态和转移动态对智能体隐藏。然而，很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld，一个免梯度框架，通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察，而是归纳出符号信念状态程序，其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中，PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数，在实时一步前瞻中达到 76.4% 的宏观成功率，同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现，人类指定的残差记忆偏差提高了表面观察保真度，但削弱了决策效用。这暴露了可执行世界模型中的权衡，因为提高观察保真度可能以牺牲动作判别动态为代价，反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

URL PDF HTML ☆

赞 0 踩 0

2603.00656 2026-06-18 cs.AI 版本更新专题 85

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

InfoPO：面向用户智能体的信息驱动策略优化

Fanqi Kong, Jiayi Zhang, Mingyi Deng, Chenglin Wu, Yuyu Luo, Bang Liu

发表机构 * Peking University（北京大学）； The Hong Kong University of Science（香港科学大学）

专题命中规划决策：信息驱动策略优化，面向用户智能体

AI总结针对多轮交互中信用分配和优势信号不足的问题，提出信息增益奖励与自适应方差门控融合的InfoPO方法，在意图澄清、协作编码等任务上优于现有基线。

详情

AI中文摘要

现实世界中用户对LLM智能体的请求往往不明确。智能体必须通过交互获取缺失信息并做出正确的下游决策。然而，当前基于多轮GRPO的方法通常依赖于轨迹级奖励计算，这导致信用分配问题以及rollout组内优势信号不足。一种可行的方法是在细粒度上识别有价值的交互轮次，以驱动更有针对性的学习。为此，我们引入了InfoPO（信息驱动策略优化），它将多轮交互视为一个主动不确定性降低的过程，并计算信息增益奖励，该奖励对反馈可测量地改变智能体后续动作分布（与掩码反馈反事实相比）的轮次进行奖励。然后，通过自适应方差门控融合将该信号与任务结果结合，以在保持任务导向目标方向的同时识别信息重要性。在包括意图澄清、协作编码和工具增强决策在内的多种任务中，InfoPO始终优于提示和多轮RL基线。它还在用户模拟器偏移下表现出鲁棒性，并有效泛化到环境交互任务。总体而言，InfoPO为优化复杂的智能体-用户协作提供了一种原则性且可扩展的机制。代码可在以下网址获取：https://this URL。

英文摘要

Real-world user requests to LLM agents are often underspecified. Agents must interact to acquire missing information and make correct downstream decisions. However, current multi-turn GRPO-based methods often rely on trajectory-level reward computation, which leads to credit assignment problems and insufficient advantage signals within rollout groups. A feasible approach is to identify valuable interaction turns at a fine granularity to drive more targeted learning. To address this, we introduce InfoPO (Information-Driven Policy Optimization), which frames multi-turn interaction as a process of active uncertainty reduction and computes an information-gain reward that credits turns whose feedback measurably changes the agent's subsequent action distribution compared to a masked-feedback counterfactual. It then combines this signal with task outcomes via an adaptive variance-gated fusion to identify information importance while maintaining task-oriented goal direction. Across diverse tasks, including intent clarification, collaborative coding, and tool-augmented decision making, InfoPO consistently outperforms prompting and multi-turn RL baselines. It also demonstrates robustness under user simulator shifts and generalizes effectively to environment-interactive tasks. Overall, InfoPO provides a principled and scalable mechanism for optimizing complex agent-user collaboration. Code is available at https://github.com/kfq20/InfoPO.

URL PDF HTML ☆

赞 0 踩 0

2603.00026 2026-06-18 cs.CL cs.AI cs.IR 版本更新专题 85

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

ActMem：弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China（南京大学新型软件技术国家重点实验室）； Alibaba Group, Hangzhou, China（阿里巴巴集团，杭州，中国）； National Institute of Healthcare Data Science, Nanjing University, China（南京大学健康数据科学国家研究院）

专题命中规划决策：记忆检索与推理结合，主动因果推理

AI总结提出ActMem框架，通过将非结构化对话历史转化为结构化因果语义图，结合反事实推理和常识补全，实现主动因果推理，显著提升LLM代理在复杂记忆依赖任务中的表现。

详情

AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”，并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距，我们提出了一种新颖的可操作记忆框架ActMem，它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全，它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外，我们引入了一个全面的数据集ActMemEval，用于评估代理在逻辑驱动场景中的推理能力，超越了现有记忆基准测试中事实检索的焦点。实验表明，ActMem在处理复杂的、依赖记忆的任务时显著优于基线，为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

URL PDF HTML ☆

赞 0 踩 0

2510.05107 2026-06-18 cs.AI 版本更新专题 85

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

大型语言模型代理中行为智能的结构化认知循环（扩展修订：从行为架构到认知问责）

Myung Ho Kim

发表机构 * JEI University（JEI大学）

专题命中规划决策：结构化认知循环实现LLM代理可问责行为

AI总结提出结构化认知循环（SCL）架构，通过分离认知、记忆、控制和行动模块，实现LLM代理的可问责行为，在360个任务中成功率86.3%，优于基线方法。

Comments This revised version extends the original SCL framework from a behavioral architecture for reliable LLM agents into a broader architecture of epistemic accountability, integrating context-aware Human-in-the-Loop control, Pool-Gated Retrieval, and the Horizon-Warrant-Commitment structure

详情

AI中文摘要

AI代理的核心挑战不仅是性能，还有问责性。通过不透明提示序列行动的代理可能产生正确输出，但几乎无法验证为何允许某个行动、错误发生在何处或如何分配责任。本文提出结构化认知循环（SCL）作为大型语言模型代理中可问责行为的架构。SCL将认知、记忆、控制和行动分离为不同模块。语言模型提出建议。外部记忆保存已验证的状态。轻量级控制器检查前提条件、防止冗余行动，并在使用工具前授权执行。我们评估了SCL与ReAct及常见LangChain代理变体在旅行规划、条件邮件起草和约束引导图像生成中的表现。在360个回合中，SCL的任务成功率达到86.3%，而基于提示的基线为70.5%至76.8%。它还提高了目标保真度，减少了冗余工具调用，增加了中间状态的重用，并降低了无依据的断言。此扩展修订将SCL置于更广泛的认知问责架构中。后续扩展整合了上下文感知的人机循环控制、池门控检索和视野担保承诺框架。这些组件共同定义了一个代理架构，其中模型提出建议，结构做出决策，证据在使用前得到担保，人类判断嵌入在轨迹中而非事后强加。结果为AI代理奠定了基础，使其决策不仅有效，而且得到授权、可检查且可问责。

英文摘要

The central challenge for AI agents is not only performance but accountability. Agents that act through opaque prompt sequences may produce correct outputs, but they provide little basis for verifying why an action was permitted, where an error occurred, or how responsibility should be assigned. This paper presents the Structured Cognitive Loop as an architecture for accountable behavior in large language model agents. SCL separates cognition, memory, control, and action into distinct modules. The language model proposes. External memory preserves verified state. A lightweight controller checks preconditions, prevents redundant actions, and authorizes execution before tools are used. We evaluate SCL against ReAct and common LangChain agent variants across travel planning, conditional email drafting, and constraint guided image generation. Across 360 episodes, SCL achieves 86.3 percent task success compared with 70.5 to 76.8 percent for prompt based baselines. It also improves goal fidelity, reduces redundant tool calls, increases reuse of intermediate state, and lowers unsupported assertions. This extended revision situates SCL within a broader architecture of epistemic accountability. Subsequent extensions integrate context aware Human in the Loop control, Pool Gated Retrieval, and the Horizon Warrant Commitment framework. Together these components define an agent architecture in which the model proposes, structure decides, evidence is warranted before use, and human judgment is embedded in the trace rather than imposed after the fact. The result is a foundation for AI agents whose decisions are not only effective but also authorized, inspectable, and accountable.

URL PDF HTML ☆

赞 0 踩 0

2606.18847 2026-06-18 cs.AI 新提交专题 80

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

WorldLines: 对长时域有状态具身智能体进行基准测试与建模

Yehang Zhang, Jianchong Su, Haojian Huang, Yifan Chang, Tianhao Zhou, Xinli Xu, Yingjie Xu, Yinchuan Li, Zexi Li, Ying-Cong Chen

发表机构 * HKUST(GZ)（香港科技大学（广州））； HKUST（香港科技大学）； Knowin

专题命中规划决策：具身智能体长时记忆与任务规划。

AI总结提出WorldLines基准，通过构建带时间跨度的家庭轨迹（含对话、动作、状态变化等）评估具身智能体的长时记忆与任务规划能力，并设计ObsMem记忆框架提升状态感知决策。

Comments 27 pages, 18 figures

详情

AI中文摘要

为了在真实家庭环境中长时间协助人类，具身智能体必须记住用户习惯、世界状态和过去的交互。现有的长期记忆基准主要评估以语言为中心的检索和问答，而具身基准通常关注短时域任务执行，未测试在动态环境中长期记忆的使用。我们引入WorldLines，一个项目驱动的长时域具身家庭辅助基准。它构建了带时间跨度的家庭轨迹，包含对话、动作、执行反馈、物体和设备状态变化，并将其转换为带有证据链接的样本，用于记忆问答和具身任务规划。我们进一步提出ObsMem，一个观察者锚定的记忆框架，维护可见性感知的记忆和动作原生状态轨迹，以实现状态感知的决策。实验揭示了在部分可观测性、被覆盖的世界状态以及将长期记忆转化为具身规划方面的持续挑战，而ObsMem为此场景提供了更强的参考架构。

英文摘要

To assist humans over extended periods in real homes, embodied agents must remember user routines, world states, and past interactions. Existing long-term memory benchmarks mainly evaluate language-centric retrieval and question answering, while embodied benchmarks often focus on short-horizon task execution without testing long-term memory use in dynamic environments. We introduce WorldLines, a project-driven benchmark for long-horizon embodied household assistance. It constructs temporally extended household traces with dialogues, actions, execution feedback, object and device state changes, and converts them into evidence-linked samples for Memory QA and Embodied Task Planning. We further propose ObsMem, an observer-grounded memory framework that maintains visibility-aware memories and action-native state trails for state-aware decisions. Experiments reveal persistent challenges in partial observability, overwritten world states, and translating long-term memory into embodied plans, while ObsMem offers a stronger reference architecture for this setting.

URL PDF HTML ☆

赞 0 踩 0

2606.18746 2026-06-18 cs.AI 新提交专题 80

What Must Generalist Agents Remember?

通用型智能体必须记住什么？

Khurram Yamin, Namrata Deka, Maitreyi Swaroop, Albert Ting, Jeff Schneider, Bryan Wilder

发表机构 * Carnegie Mellon University（卡内基梅隆大学）； Georgia Institute of Technology（佐治亚理工学院）

专题命中规划决策：通用智能体记忆需求的形式化分析。

AI总结本文形式化论证了通用型智能体为在多个环境和目标下近似最优行动，必须存储领域相关信息以区分观察瓶颈处的不兼容最优动作，并证明记忆可用于重构局部转移动态。

2606.18105 2026-06-18 cs.NI cs.LG 新提交专题 80

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

OmniPlan：一种用于及时且近乎最优的网络规划优化的自适应框架

Longlong Zhu, Jiashuo Yu, Zedi Chen, Yuhan Wu, Zhifan Jiang, Yuchen Xian, Yimeng Liu, Jiajie Su, Shaopeng Zhou, Xingyuan Li, Hongyan Liu, Xuan Liu, Dong Zhang, Chunming Wu, Xiang Chen

发表机构 * Zhejiang University（浙江大学）； Fuzhou University（福州市大学）； Yangzhou University（扬州大学）； The State Key Laboratory of Blockchain and Data Security（区块链与数据安全国家重点实验室）； College of Computer Science and Technology（计算机科学与技术学院）

专题命中规划决策：自适应框架动态选择求解器进行规划

AI总结提出OmniPlan自适应框架，利用大语言模型解析用户意图，通过混合专家架构动态选择MIP求解器、启发式算法或深度强化学习模型，实现网络规划优化的及时性与近乎最优性，在分布式机器学习推理卸载任务中延迟降低97.8%，资源消耗降低11.5%。

Comments Accepted by ACM KDD 2026

详情

AI中文摘要

网络规划优化是跨多个领域（包括交通系统、通信网络和电网）的基本问题。它需要在复杂约束下同时优化多个相互竞争的目标。现有的网络规划优化框架依赖混合整数规划（MIP）求解器、启发式算法和深度强化学习（DRL）模型来计算规划决策。然而，它们缺乏对多样化和动态用户意图的有效适应性，从而导致执行时间与最优性之间的权衡。在本文中，我们提出OmniPlan，一种自适应框架，在网络规划优化中同时实现及时性和近乎最优性。为了实现现有解决方案所缺乏的适应性，OmniPlan采用基于大语言模型（LLM）的解释器，将异构的自然语言意图转换为统一且可量化的用户偏好向量。然后，它采用混合专家架构，集成MIP求解器、启发式算法和DRL模型作为专门专家，OmniPlan通过动态选择及时且近乎最优的专家来适应多样化的意图。最后，它包含一个基于DRL的专家配置模块，该模块微调优化目标权重，使规划决策与用户特定偏好对齐。我们使用代表性的真实工作负载（即分布式机器学习（ML））评估OmniPlan，其中我们利用OmniPlan将广泛的ML推理任务（例如决策树、SVM、朴素贝叶斯、XGBoost和随机森林）卸载到硬件设备网络。我们在真实测试平台上的实验表明，OmniPlan为真实ML推理任务实现了近乎最优且低执行时间的卸载，延迟降低高达97.8%，网络设备资源消耗降低高达11.5%。

英文摘要

Network planning optimization is a fundamental problem across diverse domains, including transportation systems, communication networks, and power grids. It requires simultaneous optimization of multiple competing objectives under complex constraints. Existing network planning optimization frameworks rely on mixed integer programming (MIP) solvers, heuristics, and deep reinforcement learning (DRL) models to compute planning decisions. However, they lack effective adaptability to diverse and dynamic user intents, thus leading to the trade-off between execution time and optimality. In this paper, we propose OmniPlan, an adaptive framework that achieves both timeliness and near-optimality in network planning optimization. To achieve the adaptability lacking in existing solutions, OmniPlan employs a large language model (LLM)-based interpreter to convert heterogeneous natural-language intents into a unified and quantifiable user-preference vector. Then it employs a mixture-of-experts architecture that integrates MIP solvers, heuristics, and DRL models as specialized experts, where OmniPlan adapts to diverse intents by dynamically selecting timely and near-optimal experts. Finally, it incorporates a DRL-based expert configuration module that fine-tunes optimization objective weights to align planning decisions with user-specific preferences. We evaluate OmniPlan with a representative real-world workload, i.e., distributed machine learning (ML), where we leverage OmniPlan to offload a wide spectrum of ML inference tasks, e.g., decision trees, SVM, naive Bayes, XGBoost, and random forests, onto a network of hardware devices. Our experiments on a real-world testbed indicate that OmniPlan achieves near-optimal and low-execution-time offloading for real-world ML inference tasks, reducing latency by up to 97.8\% and network device resource consumption by up to 11.5\%.

URL PDF HTML ☆

赞 0 踩 0

2606.17453 2026-06-18 cs.AI 新提交专题 80

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

MapSatisfyBench: 通过行为隐含决策因素基准测试满意度感知的地图智能体

Lubin Bai, Mengyu Cao, Sixue Wang, Zhongwei Wan, Yue Pan, Jiale Hou, Xiang Li, Xiuyuan Zhang

发表机构 * University of Chinese Academy of Sciences（中国科学院大学）； Institute of Automation, Chinese Academy of Sciences（中国科学院自动化研究所）

专题命中规划决策：评估地图智能体的隐含需求满足能力

AI总结提出MapSatisfyBench基准，通过恢复用户行为链中的隐含决策因素来评估地图智能体的满意度感知能力，实验表明现有智能体在显式任务完成上表现良好，但在满足隐含需求方面仍有局限。

详情

AI中文摘要

大型语言模型智能体越来越多地集成到地图服务中。由于地图服务嵌入在日常场景而非专业任务设置中，用户通常非正式地表达需求，导致查询不明确，包含许多未言明的需求，即对用户满意度至关重要的隐含决策因素。虽然澄清是缓解这一问题的有效方法，但它增加了日常交互中的用户负担，而一个能干的智能体应首先从可用信息源主动恢复这些因素。然而，评估这一能力具有挑战性。第一个挑战是确定哪些隐含决策因素适合评估。一个因素只有在影响用户接受度且能从智能体响应前可获取的信息中恢复时才是可评估的。其次，用户满意度不能可靠地由单个参考答案表示，需要一个将满意度相关因素转化为客观可量化评估目标的基准。为应对这些挑战，我们提出一个恢复-识别-过滤框架，从行为链证据中重建完整的用户需求，识别隐含决策因素，并仅保留那些有查询前证据支持的因素。基于此方法，我们从大规模真实世界匿名用户数据构建MapSatisfyBench，并从五个维度标注真实值，实现对满意度感知地图智能体的全链条评估。实验表明，当前智能体在显式任务完成上普遍表现良好，但在满足隐含决策因素和主动获取满意度感知决策所需证据方面仍然有限。这些发现使MapSatisfyBench成为将地图智能体评估从任务完成转向满意度感知空间决策的基准。

英文摘要

Large language model agents are increasingly integrated into map services. Since map services are embedded in everyday-life scenarios rather than professional task settings, users often express their needs informally, resulting in underspecified queries with many unspoken needs, namely, implicit decision factors that are critical for user satisfaction. Although clarification is an effective way to mitigate this issue, it increases user burden in daily interaction, and a capable agent should first proactively recover such factors from available information sources. However, evaluating this ability is challenging. The first challenge is to determine which implicit decision factors are suitable for evaluation. A factor is evaluable only if it affects user acceptance and can be recovered from information available to the agent before it responds. Second, user satisfaction cannot be reliably represented by a single reference answer, requiring a benchmark that converts satisfaction-relevant factors into objective and quantifiable evaluation targets. To address these challenges, we propose a restore-identify-filter framework that reconstructs complete user needs from behavior-chain evidence, identifies implicit decision factors, and retains only those supported by pre-query evidence. Building on this methodology, we construct MapSatisfyBench from large-scale, real-world anonymized user data and annotate ground truth from five dimensions and enables full-chain evaluation of satisfaction-aware map agents. Experiments show that current agents generally perform well on explicit task completion, but remain limited in satisfying implicit decision factors and proactively acquiring the evidence needed for satisfaction-aware decisions. These findings establish MapSatisfyBench as a benchmark for shifting map-agent evaluation from task completion toward satisfaction-aware spatial decision making.

URL PDF HTML ☆

赞 0 踩 0

2605.29676 2026-06-18 cs.AI cs.CL 版本更新专题 85

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要：智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH（知中心研究有限公司）； Graz University of Technology（格拉茨技术大学）； Graz Center for Machine Learning（格拉茨机器学习中心）

专题命中工具调用：智能体系统中令牌优化格式，提升工具调用效率

AI总结本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON，发现TRON在保持准确率的同时最多减少27%的令牌，而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情

AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果，并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的，因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案，如TOON（令牌导向对象表示法）和TRON（令牌减少对象表示法）作为更紧凑的替代，但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准（BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench）和五个开放权重LLM上评估了TOON和TRON，将输入压缩与输出压缩解耦，以独立测量理解和生成。TRON最多减少27%的令牌，准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少，准确率成本类似为9个百分点，但在多轮解析失败上额外级联，并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

URL PDF HTML ☆

赞 0 踩 0

2606.18803 2026-06-18 cs.AI cs.CY 新提交专题 80

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

ProfiLLM: 面向工业网约车调度的效用对齐智能用户画像

Tengfei Lyu, Zirui Yuan, Xu Liu, Kai Wan, Zihao Lu, Li Ma, Hao Liu

发表机构 * Didichuxing Co. Ltd（滴滴出行科技有限公司）

专题命中工具调用：LLM智能体用于网约车调度用户画像

AI总结提出ProfiLLM，一种通过工具增强全局知识挖掘和效用对齐画像探索的智能LLM数据管道，解决工业网约车调度中大规模行为日志的用户画像问题，在滴滴生产系统中实现AUC提升6.14%、GMV提升4.35%。

详情

AI中文摘要

将大型语言模型（LLM）作为语义特征提取器引入工业网约车调度，处理平台规模的行为日志，是一个引人注目但尚未充分探索的数据系统问题。生产匹配管道仍然以结构化数值特征为主，但关键的行为信号（例如，驾驶员对某些区域的习惯性厌恶）本质上是上下文相关的，并且可以自然地表达为LLM生成的用户画像。然而，将这种画像扩展到实时的、毫秒级延迟的调度器面临三个相互交织的约束，这些约束很少被一起解决：在一个拥有数百万日订单量的平台上，日志超出任何LLM的上下文窗口数个数量级；大多数用户是长尾用户，交互太少无法进行单个用户画像；表面流畅的画像不一定能提高下游预测效用。我们提出了ProfiLLM，一个智能LLM数据管道，通过两个模块实现面向生产匹配系统的效用对齐用户画像。（1）工具增强全局知识挖掘：为LLM智能体配备27个分析工具，用于挖掘平台规模的数据，生成可复用的全局知识、自适应用户聚类规则和区域级供需先验。（2）效用对齐画像探索：为每个聚类生成多个候选画像，通过轻量级下游效用代理进行评估，迭代优化最佳候选，并为DPO微调构建偏好对。在滴滴生产调度器上部署后，ProfiLLM在结果预测中实现了高达+6.14%的相对AUC改进，在调度模拟中实现了高达+4.35%的GMV增长，并在14天在线A/B测试中持续改进，包括+0.47% GMV、+0.33%完成率和-0.82%接单前取消率。

英文摘要

Bringing Large Language Models (LLMs) into industrial ride-hailing dispatch as semantic feature extractors over platform-scale behavioral logs is a compelling but under-explored data systems problem. Production matching pipelines remain dominated by structured numerical features, yet decisive behavioral signals (e.g., a driver's habitual aversion to certain regions) are inherently contextual and naturally expressible as LLM-generated user profiles. However, scaling such profiling to a live, millisecond-latency dispatcher faces three intertwined constraints rarely addressed together: on a platform with millions of daily orders, logs exceed any LLM's context window by orders of magnitude; most users are long-tail, with too few interactions for per-user profiling; and surface-fluent profiles do not necessarily improve downstream prediction utility. We present ProfiLLM, an agentic LLM data pipeline that operationalizes utility-aligned user profiling for production matching systems through two modules. (1) Tool-Augmented Global Knowledge Mining equips an LLM agent with 27 analytical tools to mine platform-scale data, producing reusable global knowledge, adaptive user clustering rules, and region-level supply-demand priors. (2) Utility-Aligned Profile Exploration generates multiple candidate profiles per cluster, evaluates them via a lightweight downstream utility proxy, iteratively refines the best candidates and constructs preference pairs for DPO fine-tuning. Deployed on DiDi's production dispatcher, ProfiLLM achieves up to +6.14% relative AUC improvement in outcome prediction, up to +4.35% GMV gain in dispatching simulation, and consistent improvements in a 14-day online A/B test including +0.47% GMV, +0.33% Completion Rate, and -0.82% Cancel-Before-Accept rate.

URL PDF HTML ☆

赞 0 踩 0

2601.14288 2026-06-18 astro-ph.CO cs.AI cs.CE gr-qc hep-th 版本更新专题 85

DeepInflation: an AI agent for research and model discovery of inflation

DeepInflation：用于暴胀研究与模型发现的AI智能体

Ze-Yu Peng, Hao-Shi Yuan, Qi Lai, Jun-Qian Jiang, Gen Ye, Jun Zhang, Yun-Song Piao

发表机构 * School of Physical Sciences, University of Chinese Academy of Sciences, Beijing 100049, China ； International Centre for Theoretical Physics Asia-Pacific, University of Chinese Academy of Sciences, 100190 Beijing, China Taiji Laboratory for Gravitational Wave Universe, University of Chinese Academy of Sciences, 100049 Beijing, China School of Fundamental Physics ； Mathematical Sciences, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China Institute of Theoretical Physics, Chinese Academy of Sciences, P.O. Box 2735, Beijing 100190, China D\' e partement de Physique Th\' e orique, Universit\' e de Gen\` e ve, 24 quai Ernest-Ansermet, CH-1211 Gen\` e ve 4, Switzerland

专题命中工作流自动化：多智能体架构自动发现暴胀势模型

AI总结提出基于多智能体架构的AI智能体DeepInflation，集成大语言模型、符号回归引擎和检索增强生成知识库，自动发现与最新观测一致的单场慢滚暴胀势，并解释理论背景。

详情

AI中文摘要

我们提出了DeepInflation，一个专为暴胀宇宙学中的研究和模型发现而设计的AI智能体。基于多智能体架构，DeepInflation将大语言模型（LLMs）与符号回归（SR）引擎以及检索增强生成（RAG）知识库相结合。该框架使智能体能够自动探索和验证广阔的暴胀势景观，同时将其输出建立在既定的理论文献基础上。我们证明，DeepInflation能够成功发现与最新观测（以ACT DR6结果为例）或任意给定的$n_s$和$r$一致的简单且可行的单场慢滚暴胀势，并为晦涩的暴胀场景提供准确的理论背景。DeepInflation作为宇宙学中新一代自主科学发现引擎的原型，使研究人员和非专家都能使用自然语言探索暴胀景观。该智能体可从此网址获取：https://example.com。

英文摘要

We present DeepInflation, an AI agent designed for research and model discovery in inflationary cosmology. Built upon a multi-agent architecture, DeepInflation integrates Large Language Models (LLMs) with a symbolic regression (SR) engine and a retrieval-augmented generation (RAG) knowledge base. This framework enables the agent to automatically explore and verify the vast landscape of inflationary potentials while grounding its outputs in established theoretical literature. We demonstrate that DeepInflation can successfully discover simple and viable single-field slow-roll inflationary potentials consistent with the latest observations (with the ACT DR6 results taken as an example) or any given $n_s$ and $r$, and provide accurate theoretical context for obscure inflationary scenarios. DeepInflation serves as a prototype for a new generation of autonomous scientific discovery engines in cosmology, which enables researchers and non-experts alike to explore the inflationary landscape using natural language. This agent is available at https://github.com/pengzy-cosmo/DeepInflation.

URL PDF HTML ☆

赞 0 踩 0

2606.18874 2026-06-18 cs.AI 新提交专题 80

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness

通过研究框架将AI科学家的研究综合与验证外部化

Zijian Wang, Hanqi Li, Ziyue Yang, Zijian Hu, Shenghan Zuo, Yunzhe Zhang, Da Ma, Danyu Luo, Chenrun Wang, Jing Peng, Tiancheng Huang, Sijia Guo, Huayang Wang, Zichen Zhu, Senyu Han, Yilu Cao, Kai Yu, Lu Chen

发表机构 * X-LANCE Lab, School of Computer Science, Shanghai Jiao Tong University, Shanghai, China（上海交通大学计算机学院X-LANCE实验室）； Jiangsu Key Lab of Language Computing, Suzhou, China（江苏省语言计算重点实验室）； Suzhou Laboratory, Suzhou, China（苏州实验室）

专题命中工作流自动化：自动化科学研究工作流，外部化综合与验证。

AI总结提出Xcientist框架，将研究综合与实验验证外部化为可检查的合同驱动过程，解决自动研究中的声明漂移问题，并在多个领域验证其有效性。

Comments 65 pages, 14 figures, 19 tables

详情

AI中文摘要

AI系统日益能够自动化科学工作流程，但连接先前证据、生成的想法、实验和最终声明的推理通常仍然隐含在模型推理中。这里我们介绍Xcientist，一个研究框架，将研究综合和实验验证外部化为可检查的、合同驱动的过程。Xcientist将文献证据、想法状态、实施计划、消融记录和修复痕迹组织为持久的研究工件，使得生成的机制可以在不丢失其证据基础的情况下被基础化、执行、测试和修订。我们将声明漂移识别为自动化研究的一种失败模式，其中可运行的工件不再支持最初声称的机制。在无训练记忆系统、图结构交通预测和多尺度物理信息神经网络中，Xcientist保留了从问题公式化到机制设计、验证和有限修订的可追踪轨迹。这些结果表明，AI科学家不仅应根据其最终工件进行评估，还应看其综合和验证过程是否可归因、可检查且在科学上可问责。

英文摘要

AI systems can increasingly automate scientific workflows, but the reasoning that links prior evidence, generated ideas, experiments and final claims often remains implicit inside model inference. Here we introduce Xcientist, a research harness that externalizes research synthesis and experimental validation into inspectable, contract-governed processes. Xcientist organizes literature evidence, idea states, implementation plans, ablation records and repair traces as persistent research artifacts, so that generated mechanisms can be grounded, executed, tested and revised without losing their evidential basis. We identify claim drift as a failure mode of automated research, where runnable artifacts no longer support the mechanism originally claimed. Across training-free memory systems, graph-structured traffic forecasting and multi-scale physics-informed neural networks, Xcientist preserves traceable trajectories from problem formulation to mechanism design, validation and bounded revision. These results suggest that AI scientists should be evaluated not only by their final artifacts, but by whether their synthesis and validation processes remain attributable, inspectable and scientifically accountable.

URL PDF HTML ☆

赞 0 踩 0

1. 多智能体 11 篇

Toward Vibe Medicine: A Self-Evolving Multi-Agent Framework for Clinical Decision Support

From Privacy to Workflow Integrity: Communication-Graph Metadata in Autonomous Agent Interoperability

Multi-Agent Systems are Mixtures of Experts: Who Becomes an Influencer?

The Dynamics of Policy Gradient in Social Dilemmas with Partner Selection

PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation

A Technical Taxonomy of LLM Agent Communication Protocols

Byzantine-Resilient Federated Multi-Agent Optimization Framework for Cyber-Secure Interconnected Microgrids

GateMem: Benchmarking Memory Governance in Multi-Principal Shared-Memory Agents

Characterizing Opinion Evolution of Networked LLMs

Emergent Macro-Criticality from Micro-Critical Agents

AdsMind: A Physics-Grounded Multi-Agent System for Self-Correcting Discovery of Adsorption Configurations on Heterogeneous Catalyst Surfaces

2. 软件智能体 2 篇

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

Vision AI Agent for Continuous Material Monitoring of LEGEND-1000 LoFi Reentrant Tube

3. 其他Agent 5 篇

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

Towards an Agent-First Web: Redesigning the Web for AI Agents

PYPILINE: Malicious PyPI Package Detection via Suspicious API Knowledge and Agent Workflow

Dissecting model behavior through agent trajectories

4. 规划决策 8 篇

PatchWorld: Gradient-Free Optimization of Executable World Models

InfoPO: Information-Driven Policy Optimization for User-Centric Agents

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

Structured Cognitive Loop for Behavioral Intelligence in Large Language Model Agents (Extended Revision: From Behavioral Architecture to Epistemic Accountability)

WorldLines: Benchmarking and Modeling Long-Horizon Stateful Embodied Agents

What Must Generalist Agents Remember?

OmniPlan: An Adaptive Framework for Timely and Near-Optimal Network Planning Optimization

MapSatisfyBench: Benchmarking Satisfaction-Aware Map Agents through Behavior-Grounded Implicit Decision Factors

5. 工具调用 2 篇

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

ProfiLLM: Utility-Aligned Agentic User Profiling for Industrial Ride-Hailing Dispatch

6. 工作流自动化 2 篇

DeepInflation: an AI agent for research and model discovery of inflation

Externalizing Research Synthesis and Validation in AI Scientists through a Research Harness