arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 24 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 新提交 90%

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理:一种基于模型的经济分析代理框架,结合RAG、知识图谱和大语言模型

Masahiro Kato

发表机构 * Mizuho-DL Financial Technology, Co., Ltd.(Mizuho-DL金融科技有限公司)

专题命中 其他Agent :AI经济学家代理框架,规划、检索、生成报告

AI总结 提出一种基于RAG的AI经济学家代理框架,利用知识图谱和大语言模型进行经济情景分析,通过代理规划、检索证据、选择模型并生成报告,提高经济叙事的连贯性和可追溯性。

详情
AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家,具有用于经济情景分析的代理框架,使用大语言模型(LLMs)和知识图谱。虽然LLMs可以生成流畅的经济叙事,但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机,本研究提出了一种基于RAG的AI经济学家,它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中,我们不直接仅使用语言模型产生定量主张;相反,我们生成基于显式模型计算的叙事,并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理:为美国通胀持续性和美联储政策生成经济学家报告,以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

2606.20510 2026-06-19 cs.CR cs.AI 新提交 90%

Efficient and Sound Probabilistic Verification for AI Agents

高效且可靠的AI智能体概率验证

Alaia Solko-Breslin, Pramod Kaushik Mudrakarta, Mihai Christodorescu, Somesh Jha, Krishnamurthy Dj Dvijotham

发表机构 * Google DeepMind(谷歌深Mind) Google(谷歌) University of Pennsylvania(宾夕法尼亚大学) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

专题命中 其他Agent :提出AI智能体概率验证框架,确保策略合规

AI总结 提出基于分布鲁棒优化的框架,为AI智能体在复杂数字环境中的概率策略违规提供可靠上界,无需独立性假设,在终端和工具调用智能体基准上优于现有方法。

详情
AI中文摘要

保护在复杂数字环境中运行的AI智能体已成为关键需求,而运行时监控方法通过制定并执行以Datalog等正式语言表达的策略提供了一种有前景的解决方案。然而,现有方法仅限于确定性策略。在AI智能体的许多实际应用中,需要在面对模糊性时强制执行安全策略,导致概率谓词或状态转换(例如,每次调用时具有一定失败概率的解密器或个人身份信息(PII)检测器)。此外,在许多此类应用中,无法轻易做出调用先前Datalog概率推理工作所需的独立性假设。我们通过引入一种基于分布鲁棒优化的可靠且高效的验证框架来解决这一问题,该框架计算策略违规概率的可靠上界,而不考虑谓词之间可能的相关性。在终端和工具调用智能体的标准基准上,我们证明了我们的方法优于现有技术,并在确保策略违规概率的严格上界的同时,改善了安全-效用权衡。

英文摘要

Securing AI agents that operate in complex digital environments has become a critical need, and runtime monitoring approaches that formulate and enforce policies expressed in a formal language like Datalog offer a promising solution. However, existing approaches are restricted to deterministic policies. In many practical applications of AI agents, there is a need to enforce security policies in the face of ambiguity, leading to probabilistic predicates or state transitions (for example, a declassifier or Personally Identifiable Information (PII) detector that has some failure probability on each invocation). Furthermore, in many such applications, one cannot easily make the independence assumptions necessary to invoke prior work on probabilistic inference in Datalog. We address this by introducing a sound and efficient framework for such verification based on distributionally robust optimization, computing sound upper bounds on the probability of policy violation regardless of possible correlations between predicates. On standard benchmarks for terminal and tool calling agents, we demonstrate that our approach outperforms prior art and improves the security-utility trade-off while ensuring rigorous bounds on the probability of policy violation.

2606.19704 2026-06-19 cs.AI 新提交 90%

Beyond Static Leaderboards: Predictive Validity for the Evaluation of LLM Agents

超越静态排行榜:LLM智能体评估的预测有效性

Dhaval C. Patel, Kaoutar El Maghraoui, Shuxin Lin, Yusheng Li, Tianjun Feng, Chun-Yi Tsai, Yihan Sun, Wei Alexander Xin, Akshat Bhandari, Tanisha Rathod, Aaron Fan, Sanskruti Vijay Shejwal, Tomas Pasiecznik, Sagar Chethan Kumar, Tanmay Agarwal, Rohith Kanathur, Sam Colman, Amaan Sheikh, Dev Bahl, Ann Li, Krish Veera, Alimurtaza Mustafa Merchant, Shambhawi Baswaraj Bhure, Sajal Kumar Goyla, Chengrui Li, Kirthana Natarajan, Rui Li, Thomas Ajai, Rujing Li, Vivek G. Iyer, Sanjaii Vijayakumar, Yitong Bai, Ayal Yakobe, Darief Maes, Yassine Jebbouri, Tianyang Xu, Thai Quoc On, Vera Mazeeva, Winston Li, Yuval Shemla, Yeshitha Bhuvanesh, Rushin Bhatt, Siddharth Chethan Gowda, Alisha Vinod, Caroline Cahill, Shriya Aishani Rachakonda, Yunfeng Chen, Aryaman Agrawal, Aman Upganlawar, Mao Le Jonathan Ang, Yubin Sally Go, Madhav Rajkondawar, Yang-Jung Chen, Trisha Maturi, Ananya Kapoor, Andrew Li, Shrey Arora, Mana Abbaszadeh, Shen Li, Charles Xu, Byeolah Kwon

发表机构 * IBM

专题命中 其他Agent :评估LLM智能体基准的预测有效性,提出新方法。

AI总结 本文通过14项并行研究,论证聚合分数排行榜无法泛化到分布外场景,提出基于预测有效性的排名配置方法,并设计可证伪的分布外评估标准。

Comments 17 pages, 2 tables, 5 figures

详情
AI中文摘要

智能体基准测试发展迅速,但单一基准测试无法涵盖部署所涉及的多个维度。本文汇总了迄今为止最大规模的基于MCP的工业智能体基准测试的协调深度分析:14项并行实现研究,涵盖新的资产类别(包括多模态视觉扩展)、替代编排、检索策略、推理模式、基础设施优化和评估方法探索。结合这些研究与七个先前的智能体基准测试,我们认为聚合分数排行榜系统性地低估了部署智能体的评估。基于聚合分数的排名无法泛化到分布外设置;最近的公开到私有竞赛回顾提供了这种排名不稳定性的直接经验证据。我们提出通过预测有效性(样本内与样本外排名之间的相关性)而非样本内均值来配置排名,并报告了一个十二层测量装置,该装置揭示了HELM及其智能体时代后继者所忽略的部署相关维度。该立场通过三个具有明确阈值的可证伪分布外标准得以操作化;现有证据部分支持但过于薄弱无法确认。最后,我们提出了一个预注册的试点设计和下一代智能体基准测试应报告的内容的领域级愿景。

英文摘要

Agent benchmarks are growing fast, but no single benchmark touches more than four or five of the dimensions that deployment exposes. This paper aggregates the largest coordinated deep-dive of one MCP-based industrial-agent benchmark to date: fourteen parallel implementation studies covering new asset classes (including a multi-modal visual extension), alternative orchestrations, retrieval strategies, reasoning modes, infrastructure optimizations, and evaluation-methodology probes. Consolidating those studies with seven prior agent benchmarks, we argue that aggregate-score leaderboards systematically underspecify deployed-agent evaluation. Rankings derived from aggregate scores do not transfer to out-of-distribution settings; recent public-to-hidden competition retrospectives provide direct empirical evidence of this rank instability. We propose ranking configurations by predictive validity, the correlation between in-sample and out-of-sample rank, rather than in-sample mean, and report a twelve-tier measurement apparatus that exposes the deployment-relevant dimensions HELM and its agent-era successors collapse. The position is operationalized through three falsifiable out-of-distribution criteria with explicit thresholds; existing evidence partly supports it but is too thin to confirm. We close with a pre-registered pilot design and a field-level vision for what the next generation of agentic benchmarks should report.

2606.11537 2026-06-19 cs.AI cs.CE 新提交 90%

MoCA-Agent: A Market-of-Claims Code Agent for Financial and Numerical Reasoning

MoCA-Agent: 一种用于金融和数值推理的声明市场代码智能体

Abdelrahman Abdallah, AbdelRahim A. Elmadany, Sameh Al Natour, Hasan Cavusoglu, Adam Jatowt, Muhammad Abdul-Mageed

发表机构 * University of Innsbruck(因斯布鲁克大学) University of British Columbia(不列颠哥伦比亚大学) Toronto Metropolitan University(多伦多都会大学)

专题命中 其他Agent :提出声明市场代码智能体,用于金融数值推理

AI总结 提出MoCA-Agent,通过声明级验证和代码生成解决金融表格问答中的数值推理错误,在十个基准上取得强性能。

详情
AI中文摘要

金融和表格问答不仅需要流畅的推理:答案必须基于支持它们的确切事实、公式、单位、符号和尺度。单个误读的单元格或错误操作可能会悄无声息地产生看似合理但错误的结果。我们引入了 \textsc{MOCA-Agent},一种声明市场代码智能体,它用声明级验证取代了自由形式的多智能体辩论。该系统将每个问题分解为类型化的原子声明,要求专业交易智能体买入或卖出这些声明,将其订单清算为置信度加权的接受/拒绝决策,并从市场支持的证据中合成可执行的Python程序。然后,一个代码感知验证器检查程序的执行、结构一致性和常见的金融推理错误,最多进行一次市场感知修复轮次。在涵盖金融数值推理、通用表格推理、ESG问答和多模态图表推理的十个公开基准上,\textsc{MOCA-Agent} 使用固定的 Qwen3.6-27B 骨干网络实现了强劲性能,包括在 FinQA 上达到 78.3%,在 FinanceMath 上达到 76.0%,在 MultiHiertt 上达到 71.2%,在 ESGenius 上达到 86.9%,以及在 FinChart-Bench 上平均达到 85.6%。这些结果表明,在原子声明级别聚合证据,而不是整个答案,提高了高风险数值推理的鲁棒性。\footnote{代码和数据可在以下网址获取:this https URL。}

英文摘要

Financial and tabular question answering requires more than fluent reasoning: answers must be grounded in the exact facts, formulas, units, signs, and scales that support them. A single misread cell or incorrect operation can silently produce a plausible but wrong result. We introduce \textsc{MOCA-Agent}, a market-of-claims code agent that replaces free-form multi-agent debate with claim-level verification. The system decomposes each question into typed atomic claims, asks specialist trader agents to buy or sell those claims, clears their orders into confidence-weighted accept/reject decisions, and synthesizes an executable Python program from market-supported evidence. A code-aware verifier then checks the program for execution, structural consistency, and common financial reasoning errors, with at most one market-aware repair round. Across ten public benchmarks spanning financial numerical reasoning, general tabular reasoning, ESG question answering, and multimodal chart reasoning, \textsc{MOCA-Agent} achieves strong performance using a fixed Qwen3.6-27B backbone, including $78.3\%$ on FinQA, $76.0\%$ on FinanceMath, $71.2\%$ on MultiHiertt, $86.9\%$ on ESGenius, and $85.6\%$ average on FinChart-Bench. These results show that aggregating evidence at the level of atomic claims, rather than whole answers, improves robustness in high-stakes numerical reasoning.\footnote{The code and data are available: https://github.com/UBC-NLP/MoCA-Agent.

2606.20475 2026-06-19 cs.LG 新提交 85%

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

边际优势累积用于记忆驱动智能体自我进化

Mingyu Yang, Keye Zheng, Congchao Cheng, Yujie Liu, Xingkang Lu, Fan Jiang, Yefei Zheng

发表机构 * Alibaba International Digital Commerce Group(阿里巴巴国际数字商业集团)

专题命中 其他Agent :提出记忆驱动智能体自我进化方法,优化智能体轨迹蒸馏。

AI总结 针对批量式轨迹蒸馏中跨批次证据缺失问题,提出边际优势累积(MAA)方法,通过差分信号构造、指数移动平均累积和语义身份合并,在16个设置中14个取得最佳结果,优化阶段token消耗减少约75%。

Comments 26 pages, 4 figures, 10 tables, 42 references

详情
AI中文摘要

在批量式轨迹蒸馏中,同一记忆操作可能在不同批次间收到矛盾的反馈。现有方法缺乏跨批次、操作级别的证据累积机制,无法区分稳定有效的操作与偶然命中。本文将需求形式化为两个结构条件:可对齐性和可比性,并提出边际优势累积(MAA)。MAA构造差分信号使其跨批次可比,通过指数移动平均(EMA)累积每个操作的有符号证据,并通过语义身份合并确保跨批次可追溯性。作为一种后处理架构,MAA在4个基准和4个目标模型的16个设置中14个取得最佳结果,持续优于现有批量级蒸馏基线,并在大多数设置中匹配或超越在线替代方法,同时将优化阶段的token消耗减少约75%。

英文摘要

In batch-style trace distillation, the same memory operation may receive contradictory feedback across different batches. Existing methods lack a cross-batch, operation-level evidence accumulation mechanism, making it impossible to distinguish stably effective operations from accidental hits. This paper formalizes the requirement as two structural conditions, alignability and comparability, and proposes Marginal Advantage Accumulation (MAA). MAA constructs differential signals to make them comparable across batches, accumulates signed evidence per operation via EMA, and ensures cross-batch traceability through semantic identity merging. As a post-processing architecture, MAA achieves the best results in 14 out of 16 settings across 4 benchmarks and 4 target models, consistently outperforming existing batch-level distillation baselines and matching or surpassing online alternatives in most settings, while reducing optimization-phase token consumption by approximately 75%.

2606.19893 2026-06-19 cs.AI 新提交 85%

MetaResearcher: Scaling Deep Research via Self-Reflective Reinforcement Learning in Adversarial Virtual Environments

MetaResearcher: 通过对抗虚拟环境中的自我反思强化学习扩展深度研究

Wei Yu, Suxing Liu, Minjie Yu, Jiahao Wang, Zhijian Zheng, Haocheng Deng, Bing Li

发表机构 * School of Digital Arts, Jiangxi Arts & Ceramics Technology Institute(江西陶瓷工艺美术职业技术学院数字艺术学院) Universiti Sains Malaysia(马来西亚理科大学)

专题命中 其他Agent :深度研究智能体训练框架,对抗环境。

AI总结 提出MetaResearcher框架,通过演化虚拟世界、发现导向任务、自我反思元奖励和异构多智能体架构,在对抗环境中扩展深度研究智能体的训练,提升基准性能和认知鲁棒性。

详情
AI中文摘要

深度研究智能体在自主信息收集和综合方面展现了卓越的能力,但其训练仍受限于模拟环境的静态性、仅限事实检索的任务设计的局限性以及基于结果的强化学习的低效性。在这项工作中,我们提出了MetaResearcher,一个新颖的框架,在四个协同维度上扩展深度研究智能体的训练。首先,我们引入了一个演化虚拟世界,将时间动态和对抗性错误信息注入训练环境,迫使智能体发展来源可信度评估和时间冲突解决技能。其次,我们设计了发现导向任务——包括假设生成和矛盾解决——超越了简单的事实检索,推动智能体走向真正的研究行为。第三,我们在GRPO框架内提出了一种自我反思元奖励机制,共同优化答案正确性、搜索路径效率、反思深度和工具调用多样性,直接解决了先前工作中观察到的重复动作循环问题。第四,我们引入了一个异构多智能体群体架构,包括专门的侦察、过滤和合成模型,通过协调强化学习学习协作研究策略。基于LiteResearcher基础设施,MetaResearcher在训练中需要零边际API成本,同时目标是在基准性能(GAIA,Xbench-DS)和对抗条件下的认知鲁棒性方面实现显著改进。我们展示了完整的框架设计、训练方法和计划的实验验证。

英文摘要

Deep research agents have demonstrated remarkable capabilities in autonomous information gathering and synthesis, yet their training remains constrained by the static nature of simulated environments, the limits of fact-retrieval-only task designs, and the inefficiency of outcome-based reinforcement learning. In this work, we propose MetaResearcher, a novel framework that scales deep research agent training across four synergistic dimensions. First, we introduce an Evolving Virtual World that injects temporal dynamics and adversarial misinformation into the training environment, forcing agents to develop source credibility assessment and temporal conflict resolution skills. Second, we design Discovery-Oriented Tasks -- including hypothesis generation and contradiction resolution -- that transcend simple fact retrieval and push agents toward genuine research behaviors. Third, we propose a Self-Reflective Meta-Reward mechanism within the GRPO framework that jointly optimizes for answer correctness, search path efficiency, reflection depth, and tool call diversity, directly addressing the repetitive action loop problem observed in prior work. Fourth, we introduce a Heterogeneous Multi-Agent Swarm architecture comprising specialized Scout, Filter, and Synthesizer models that learn collaborative research strategies through coordinated reinforcement learning. Built upon the LiteResearcher infrastructure, MetaResearcher requires zero marginal API cost for training while targeting substantial improvements in both benchmark performance (GAIA, Xbench-DS) and epistemic robustness under adversarial conditions. We present the complete framework design, training methodology, and planned experimental validation.

2606.19847 2026-06-19 cs.CL 新提交 85%

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

AtomMem: 通过原子事实构建简单有效的LLM智能体记忆系统

Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu, Tong Xu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室) Anhui University(安徽大学)

专题命中 其他Agent :为LLM智能体设计长期记忆系统,存储和检索原子事实。

AI总结 针对现有记忆系统存储粗粒度、更新不稳定的问题,提出AtomMem,通过事实执行器提取高价值原子事实作为高效记忆表示,并组织为层次化事件结构和时间档案,实现价值密集存储和稳定演化,在LoCoMo基准上取得最优性能。

Comments 19 pages, 10 figures, 5 tables

详情
AI中文摘要

大型语言模型(LLM)展示了强大的推理和生成能力,但其固定的上下文窗口限制了跨多会话交互的长期信息积累和重用。现有的记忆增强系统通常以粗粒度且不稳定的方式构建记忆,依赖于低效的记忆表示或不稳定的无约束更新。为了解决这些挑战,我们提出了AtomMem,一种专为价值密集存储和稳定记忆演化设计的长期记忆系统。AtomMem引入了一个事实执行器,从长形式交互中选择性地提取高价值原子事实,作为高效的记忆表示。随后,AtomMem将这些事实组织成层次化的事件结构和时间档案,捕获连贯的情景上下文并随时间跟踪动态演变的用户属性。在检索过程中,系统激活一个关联记忆图来连接碎片化的记忆。在LoCoMo基准上的实验证实,AtomMem在各种推理任务中实现了最先进的性能,为部署智能个性化智能体提供了一种可扩展且经济可行的解决方案。

英文摘要

Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

2606.19749 2026-06-19 cs.AI cs.CL 新提交 85%

Benchmarking Agentic Review Systems

基准测试智能审稿系统

Dang Nguyen, Wanqing Hao, Yanai Elazar, Chenhao Tan

发表机构 * University of Chicago(芝加哥大学) Bar-Ilan University(巴伊兰大学)

专题命中 其他Agent :基准测试智能审稿系统,属于AI代理应用。

AI总结 针对AI辅助研究给同行评审带来的压力,新兴智能审稿系统涌现,但缺乏评估标准。本文评估了多种系统,发现最佳配置(OpenAIReview + GPT-5.5)在成对准确性上达83.0%,能捕获71.6%注入错误,且用户反馈正面。

Comments 11 pages, 7 tables, 4 figures

详情
AI中文摘要

一类新的智能审稿系统正在兴起,以缓解AI辅助研究给同行评审系统带来的压力,但如何评估它们尚不明确。我们评估了两个开源系统(OpenAIReview和coarse)、一个专有系统(Reviewer3)以及一个零样本基线,跨越六个涵盖前沿和高效模型的LLM。首先,我们研究ICLR/NeurIPS论文上的AI评审是否与论文质量(通过引用和接受决定等外部信号近似)相关。每个系统在成对准确性上均高于随机水平,最佳为OpenAIReview + GPT-5.5,达到83.0%。其次,为测试系统能否捕获已知真实错误的错误,我们构建了一个扰动基准,向八个arXiv学科类别的论文中注入四类错误,并测量检测召回率。最强配置(OpenAIReview + GPT-5.5)捕获了71.6%的注入错误,仍有很大改进空间。六个模型的检测并集达到83.3%的召回率,表明不同模型检测不同错误,更好的利用设计可能提高性能。除这些基准外,我们研究了OpenAIReview在真实用户中的公开部署。对其评论的投票偏向正面,比例为1.44:1,最常见的抱怨是误报和琐碎挑剔。总之,通过评估基于最先进模型的全审稿系统在真实研究论文上的表现,我们表明虽然AI评审仍有改进空间,但它们已经能够很好地跟踪人类质量判断、捕获重要错误,并获得真实用户的正面反馈。

英文摘要

A new class of agentic review systems are emerging as a remedy to the pressure placed on peer review systems by AI-assisted research, but it is unclear how they should be evaluated. We evaluate two open-source systems (OpenAIReview and coarse), one proprietary system (Reviewer3), and a zero-shot baseline, across six LLMs spanning frontier and efficient models. First, we study whether AI reviews on ICLR/NeurIPS papers track with papers' quality as approximated by external signals such as citations and acceptance decisions. Every system performs above chance in pairwise accuracy, and the best is OpenAIReview + GPT-5.5 at 83.0%. Second, to test whether systems can catch errors with known ground truth, we construct a perturbation benchmark that injects four categories of errors into papers across eight arXiv subject classes and measure detection recall. The strongest configuration (OpenAIReview + GPT-5.5) catches 71.6% of injected errors, leaving substantial room for improvement. The union of detections across six models reaches 83.3% recall, suggesting different models detect different errors and better harness design can potentially increase performance. Beyond these benchmarks, we study a public deployment of OpenAIReview with real users. Votes on its comments skew positive at 1.44 to 1, and the most common complaints are about false positives and minor nitpicks. Together, by evaluating full review systems backed by state-of-the-art models on real research papers, we show that while AI reviews still have room for improvement, they can already track human quality judgments well, catch important errors, and earn positive feedback from real users.

2606.19464 2026-06-19 cs.AI cs.MA 新提交 85%

Deontic Policies for Runtime Governance of Agentic AI Systems

面向自主AI系统运行时治理的道义策略

Anupam Joshi, Tim Finin, Karuna Pande Joshi, Lalana Kagal

发表机构 * CSEE Department UMBC Baltimore, MD, USA Center for AI UMBC Baltimore, MD, USA Information Systems Department UMBC Baltimore, MD, USA CSAIL MIT Cambridge, MA, USA

专题命中 其他Agent :提出道义策略框架用于自主AI系统运行时治理。

AI总结 针对大语言模型驱动的自主AI系统在安全、隐私和合规方面的治理挑战,提出AgenticRei框架,利用基于Rei的道义策略语言(OWL表示)在运行时通过逻辑引擎强制执行义务、豁免、冲突解决等治理约束,并兼容A2AS等标准。

Comments 10 pages, 1 figure. To be published in the 2026 IEEE Symposium on Agentic Services which is part of the IEEE Conference on Web Services

详情
AI中文摘要

由大语言模型驱动的自主AI系统引入了一类新的安全、隐私和合规挑战:能够调用工具、操作数据、安装软件并与跨组织边界对等代理协调的代理,不仅必须通过身份验证和访问控制来约束,还必须通过企业治理的完整结构来约束。这包括指定代理被允许和禁止做什么,它们在特定操作后必须做什么(例如,通知CISO),在什么条件下可以免除一项持续义务,以及当策略冲突时哪些规则优先。这个治理问题超出了当前策略引擎的能力范围。诸如XACML、Rego和Cedar等系统仅处理此治理结构的允许/禁止子集。它们不提供义务生命周期管理、元策略冲突解决、在特定情况下免除义务的豁免,以及通常在医疗、网络安全或数据隐私等应用中发现的领域类层次结构的本体推理。我们提出了AgenticRei,它实现了关键的治理需求,如义务、豁免、策略冲突解决和策略推理,以及基本的允许/禁止约束。我们使用基于Rei框架的道义策略语言,表示为OWL(Web本体语言),并由完全在LLM外部的高性能逻辑引擎在运行时评估。同一管道同时管理代理的工具调用和代理间消息。我们通过示例表明,道义策略捕获了当前生产引擎大多无法表达的安全和隐私治理约束。我们的方法自然地与A2AS等行业标准框架兼容。

英文摘要

Autonomous agentic AI systems driven by Large Language Models (LLMs) introduce a new class of security, privacy, and compliance challenges: an agent that can invoke tools, manipulate data, install software, and coordinate with peer agents across organizational boundaries must be constrained not just by authentication and access control, but by the full structure of enterprise governance. This includes specifying what agents are permitted and prohibited from doing, what they areobliged to do after certain actions (e.g., notify the CISO), under what conditions a standing obligation may be waived, and which rules take precedence when policies conflict. This governance problem exceeds what current policy engines provide. Systems such as XACML, Rego, and Cedar address only the permit/prohibit subset of this governance structure. They do not provide obligation lifecycle management, meta-policy conflict resolution, dispensations that waive obligations in specific circumstances, and ontological reasoning over domain class hierarchies commonly found in applications such as healthcare, cybersecurity, or data privacy. We propose AgenticRei, which realizes key governance requirements such as obligations, dispensations, policy conflict resolutions, and reasoning over policies, as well as the basic permit/prohibit constraints. We use a deontic policy language built on the Rei framework, expressed as OWL (Web Ontology Language) and evaluated at runtime by a high-performance logic engine entirely outside the LLM. The same pipeline governs both tool invocations by the agent and agent-to-agent messages. We show through examples that deontic policies capture governance constraints around security and privacy that mostly cannot be expressed in current production engines. Our approach composes naturally with industry-standard frameworks like A2AS.

2606.19416 2026-06-19 cs.LG 新提交 85%

MortarBench: Evaluating Mortgage Loan Origination Agents

MortarBench: 评估抵押贷款发起代理

Matthew Toles, Yunan Lu, Manav Munjal, Bojun Liu, Yuanhao Deng, Stephanie Selig, Derek Rindner, Cheng Li, Zhou Yu

发表机构 * Columbia University(哥伦比亚大学) Tidalwave

专题命中 其他Agent :评估大语言模型在抵押贷款发起任务中的表现。

AI总结 提出MortarBench基准,通过金融数据合成与变异管道生成覆盖边缘案例的示例,评估大语言模型在贷款发起任务中的表现,发现模型准确率低且存在偏见,并引入CRIT校准框架提升准确率至80.5%。

详情
AI中文摘要

贷款发起是贷方创建新贷款的过程,从申请和承保到批准和融资。该过程在评估申请人的资格和风险水平方面起着关键作用。最近,尽管缺乏任何公开基准,公司已开始使用抵押贷款代理来增强人类贷款官员。为填补这一空白,我们提出了MortarBench,一个贷款发起代理基准。MortarBench使用金融数据合成和变异管道生成具有广泛边缘案例覆盖的示例,这些示例匹配真实世界的分布和问题。我们发现最先进的大语言模型(LLM)表现不佳,闭源模型最多达到77.1%的精确匹配准确率。我们还发现LLM对与非英语名字相关的外国性存在系统性偏见。注意到这些弱点,我们引入了CRIT,一个置信度校准框架。我们的方法将准确率提高到80.5%,同时改善了风险管理导向并减少了偏见。

英文摘要

Loan origination is the process by which a lender creates a new loan, from application and underwriting through approval and funding. This process serves a critical role in evaluating the eligibility and level of risk posed by an applicant. Recently, firms have begun using mortgage loan agents to augment human loan officers, despite a lack of any public benchmark. To fill this gap, we present MortarBench, a loan origination agent benchmark. MortarBench uses a financial data synthesis and mutation pipeline to generate examples with broad edge case coverage that match real-world distributions and questions. We find that state-of-the-art large language models (LLMs) perform poorly, with closed-source models achieving at most 77.1\% exact match accuracy. We also discover systematic biases in LLM perception of foreignness related to non-English names. Noting these weaknesses, we introduce CRIT, a confidence calibration framework. Our method increases accuracy to 80.5\% while improving risk management steering and reducing bias.

2606.20474 2026-06-19 cs.LG cs.AI cs.PF 新提交 80%

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 面向上下文密集型智能体的4位KV缓存

Inesh Chakrabarti, David Limpus, Aditi Ghai Rana, Bowen Bao, Spandan Tiwari, Thiago Crepaldi, Ashish Sirasao

发表机构 * Advanced Micro Devices(超威半导体) University of California, Los Angeles(加州大学洛杉矶分校) Purdue University(普渡大学)

专题命中 其他Agent :面向上下文密集型智能体的KV缓存压缩,提升推理效率。

AI总结 针对上下文密集型智能体场景,提出UltraQuant方法,通过4位KV缓存压缩、旋转量化和代码本量化,结合AMD GPU优化,在长上下文多轮任务中延迟降低3.47倍,吞吐量提升1.63倍。

Comments 11 pages, 9 figures

详情
AI中文摘要

上下文密集型智能体给键值(KV)缓存带来了异常压力:长前缀在多个短轮次中重复使用,而并发性决定了服务系统能否保持GPU利用率。我们针对此场景研究4位KV缓存压缩,采用TurboQuant风格的旋转和代码本量化作为质量锚点,vLLM FP8 KV缓存作为部署锚点。我们报告三项贡献。首先,我们将4位KV缓存框架用于多轮智能体工作负载,其中任务质量、缓存驻留和服务吞吐量必须联合衡量。其次,我们描述了使4位路径鲁棒所需的实际设计选择,包括非对称K/V处理、Walsh-Hadamard旋转、QJL移除和块尺度变体。第三,我们展示了AMD GPU上的服务优化,包括优化的解码注意力内核和UltraQuant,一种使用FP8查询、FP4 KV张量、UE8M0组尺度和CDNA4上原生缩放MFMA支持的FP4近似路径。在长上下文、多轮智能体工作负载上,UltraQuant在缓存压力大的后期轮次中将P50首令牌延迟降低了3.47倍(所有轮次平均2.3倍),并将输出吞吐量比FP8 KV基线提高了1.63倍。

英文摘要

Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.

2606.19948 2026-06-19 cs.AI 新提交 80%

Advancing DialNav through Automatic Embodied Dialog Augmentation

通过自动具身对话增强推进DialNav

Leekyeung Han, Sangwon Jung, Hyunji Min, Jinseong Jeong, Minyoung Kim, Paul Hongsuck Seo

发表机构 * Korea University(高丽大学) Trillion Labs

专题命中 其他Agent :构建具身对话数据集,提升DialNav任务性能

AI总结 提出自动生成管道构建大规模RAINbow数据集(238K episodes),结合双策略训练和定位模型,在DialNav任务上实现成功率显著提升(Val Seen +89%,Val Unseen +100%)。

Comments 29 pages, 9 figures

详情
AI中文摘要

对于能够进行物理交互的具身智能体,创建和理解对话的能力对于确保安全性和有效性至关重要。虽然DialNav~\cite{han2025dialnav}为真实感室内导航中的对话-执行循环提供了整体评估框架,但其性能仍受限于训练数据的严重稀缺(2K episodes)。为解决这一问题,我们提出了一种自动生成管道,并构建了\textbf{RAINbow}数据集,这是一个包含238K episodes的大规模训练数据集,用于DialNav。我们的管道将现有的VLN数据集转换为多轮对话,并创建了成本高效且高质量的数据集。然后,我们引入了两项额外的互补性进展以充分释放数据潜力:(1)双策略训练,一种导航训练方案,用于使导航训练与动态对话-导航循环对齐;(2)一个利用VLN知识的定位模型。通过结合这些互补性解决方案,我们的模型在\textbf{Val Seen}(58.24,\textbf{+89\%})和\textbf{Val Unseen}(29.05,\textbf{+100\%})两个分割上的成功率均大幅超越基线,建立了新的最优水平。

英文摘要

For embodied agents capable of physical interaction, the capability to create and understand dialog is crucial to ensure both safety and effectiveness. While DialNav~\cite{han2025dialnav} provides a framework for holistic evaluation of the dialog--execution loop in photorealistic indoor navigation, its performance remains limited by a critical scarcity of training data (2K episodes). To address this, we propose an automatic generation pipeline, and construct the \textbf{RAINbow} dataset, a large-scale training dataset with 238K episodes for DialNav. Our pipeline converts existing VLN datasets into multi-turn dialog and creates cost-efficient and high-quality dataset. Then, we introduce two additional complementary advances to unlock the data's full potential: (1) Dual-Strategy Training, a navigation training scheme to align the navigation training with the dynamic dialog-navigation loop, and (2) a localization model that leverages VLN knowledge. By combining these complementary solutions, our model substantially outperforms the baseline in success rate on both \textbf{Val Seen} (58.24, \textbf{+89\%}) and \textbf{Val Unseen} (29.05, \textbf{+100\%}) splits, establishing a new state of the art.

2606.19904 2026-06-19 cs.SI 新提交 80%

Toward Temporal Realism in City-Scale Crisis Response Simulation using LLM Agents

面向城市级危机响应模拟中时间真实性的LLM智能体方法

Anping Zhang, Yang Tan, Yuanbo Tang, Huaze Tang, Qiuhua Ye, Marta C. Gonzalez, Yang Li

专题命中 其他Agent :LLM智能体模拟危机响应中的时间真实性。

AI总结 针对LLM社会模拟缺乏时间真实性的问题,基于深圳疫情志愿活动数据,提出数据校准的自激与危机激活机制,实现爆发性时间模式,使智能体时间分布接近真实。

Comments 11pages,7 figures

详情
AI中文摘要

人类集体参与在时间上很少是稳定的:它是爆发性的,短时间的密集活动与长时间的安静间隔交替出现。在危机响应和社区动员中,预测人们何时行动与预测他们是否行动同样重要。这类场景越来越多地使用基于LLM的社会模拟器进行建模,然而这些模拟器的验证仅关注每个行动是否合理,而非行动的时间是否与现实一致。它们的时间真实性,即模拟活动再现真实人类系统爆发性、重尾时间分布的程度,因此仍未得到检验。我们利用深圳跨多年、城市规模的线下志愿活动日志(涵盖COVID-19疫情)来考察这一差距。实证上,我们确认爆发性时间在个体和跟踪群体层面普遍存在,且主要是内生性和自激的,并由疫情放大而非日常活动周期产生。一个标准的纯LLM模拟器几乎无法再现这种时间分布:其同步调度缺乏自激通道,因此智能体以近乎规律的时钟行动。基于这些发现,我们构建了一个模拟器,其中数据校准的自激通道和危机时期机制决定每个智能体何时行动,并仅在这些时刻查询LLM,由LLM决定加入哪个任务以及是否承诺。纯LLM基线未产生任何爆发性智能体(中位爆发性$B=-0.14$);单个数据校准的门控足以将每个智能体的时间分布提升至爆发阈值以上(中位$B\approx0.37$),且不降低LLM的内容决策质量。这些结果表明,基于LLM的危机响应模拟中,时间真实性的最佳实现方式是将智能体何时行动(由显式自激和危机激活机制控制)与做什么(由LLM控制)解耦。

英文摘要

Human collective participation is rarely steady in time: it is bursty, with short episodes of intense activity separated by long quiet intervals. In crisis response and community mobilization, predicting when people act matters as much as predicting whether they act. Such settings are increasingly modeled with LLM-based social simulators, yet these simulators are validated on whether each action is individually plausible, not on whether actions are timed as in reality. Their temporal realism, the degree to which simulated activity reproduces the bursty, heavy-tailed timing of real human systems, thus remains untested. We examine this gap using a multi-year, city-scale log of offline volunteering in Shenzhen that spans the COVID-19 pandemic. Empirically, we establish that bursty timing is common at individual and tracked-group levels, that it is largely endogenous and self-exciting, and that it is amplified by the pandemic rather than produced by daily activity cycles. A standard LLM-only simulator reproduces almost none of this timing: its synchronous schedule has no self-excitation channel, so agents act on a near-regular clock. Guided by these findings, we build a simulator in which a data-calibrated self-excitation channel and a crisis-period regime decide when each agent acts and query the LLM only at those moments, leaving it to decide which task to join and whether to commit. The LLM-only baseline yields no bursty agents (median burstiness $B=-0.14$); a single data-calibrated gate is then sufficient to lift per-agent timing above the burst threshold (median $B\approx0.37$) without degrading LLM content decisions. These results indicate that temporal realism in LLM-based crisis-response simulation is best achieved by decoupling when agents act, governed by an explicit self-excitation and crisis-activation mechanism, from what they do, governed by the LLM.

2606.19899 2026-06-19 cs.CY cs.AI 新提交 80%

Measuring Biological Capabilities and Risks of AI Agents

测量AI代理的生物能力与风险

Patricia Paskov, Jeffrey Lee, Kyle Brady, Alyssa Worland

发表机构 * PATRICIA PASKOV, JEFFREY LEE, KYLE BRADY, ALYSSA WORLAND(PATRICIA PASKOV、JEFFREY LEE、KYLE BRADY、ALYSSA WORLAND)

专题命中 其他Agent :评估AI代理的生物能力与风险。

AI总结 针对AI科学家等自主执行多步科学任务的代理系统,本文提出生物代理评估作为解释性工具,并基于实践经验给出定义、设计、运行、评分和记录评估的考量,以帮助决策者谨慎解读结果并指导投资。

详情
AI中文摘要

本文针对一个迅速出现的政策挑战:如何生成和解释关于AI科学家(即能够自主或协作执行多步科学任务的代理AI系统)的生物能力与风险的可信证据。随着这些系统进入真实研究流程,决策者越来越多地面临评估结果,而这些结果的含义取决于通常隐含或记录不足的底层设计选择。我们综合了关于AI驱动的生物风险的现有证据,并引入生物代理评估作为评估这些系统的一种有前景但需要谨慎解释的工具。我们的核心贡献是一套基于实践经验的考量——源自我们自己的评估——展示了围绕定义、设计、运行、评分和记录评估的选择如何实质性地塑造结果对风险意味着什么和不意味着什么。该分析旨在帮助政策制定者以适当的谨慎态度解读生物评估输出;引导公共和私人资助者向AI-生物学评估研究的高杠杆投资;并支持评估新兴AI系统的生物安全从业者。次要受众包括在前沿AI实验室、AI提供商、科学机构和第三方评估组织中设计或进行代理评估的研究人员。

英文摘要

This paper addresses a rapidly emerging policy challenge: how to generate and interpret credible evidence about the biological capabilities and risks of AI scientists, or agentic AI systems capable of autonomously or collaboratively performing multi-step scientific tasks. As these systems enter real research workflows, decision-makers increasingly face evaluation results whose meaning depends on underlying design choices that are often implicit or under-documented. We synthesize current evidence on AI-enabled biological risks and introduce biological agentic evaluations as a promising, but interpretation-sensitive, tool for assessing these systems. Our central contribution is a set of practical, experience-grounded considerations -- drawing from our own evaluations -- that show how choices around defining, designing, running, scoring, and documenting evaluations materially shape what results do and do not imply about risk. The analysis is intended to help policymakers interpret biological evaluation outputs with appropriate caution; guide public and private funders toward high-leverage investments in AI-biology evaluation research; and support biosecurity practitioners assessing emerging AI systems. A secondary audience includes researchers designing or conducting agentic evaluations within frontier AI labs, AI providers, scientific institutions, and third-party evaluation organizations.

2606.19595 2026-06-19 cs.LG cs.AI 新提交 80%

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

IHBench:评估语音代理在结构化工作流中的中断后恢复能力

Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola

发表机构 * Boson AI

专题命中 其他Agent :评估语音代理中断后恢复能力,属于智能体评测

AI总结 提出IHBench基准,评估语音代理在结构化工作流中处理中断后的恢复能力,涵盖任务完成和恢复质量两个维度,实验表明闭源模型比开源模型更鲁棒。

详情
AI中文摘要

部署在结构化工作流(客户服务、医疗调度、账户管理)中的语音代理必须处理频繁的用户中断,同时保持多步骤程序的进度。现有的语音能力模型基准侧重于中断的时机:闯入检测、端点检测和轮流对话动态。它们忽略了中断后发生的情况:代理是否在正确的步骤恢复工作流?是否处理了用户的插话?是否避免重复用户已经听过的内容?我们引入了IHBench(中断处理基准),这是一个评估语音代理在10个企业领域中执行状态机驱动工作流时的中断后恢复能力的基准。六种中断类型在话语中间的控制点注入,并随数据生成每个中断的评估标准。每个中断在两个轴上评分:任务完成和恢复质量。我们评估了来自OpenAI、Google和开源社区的27个音频-语言模型配置。模型差异很大,恢复质量强烈依赖于中断类型。在我们的实验中,闭源模型比开源模型对中断更鲁棒:它们在任务完成上获胜的频率更高,随着对话变长,性能下降速度慢约3.3倍,并且没有音频与文本模态差距,而开源模型在这三个方面都处于劣势。一项人类研究验证了LLM评判员与人类标注者的一致性,与AudioMultiChallenge的跨基准分析表明,恢复质量在很大程度上是一个独立的能力轴。

英文摘要

Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.

2606.19409 2026-06-19 cs.SE cs.PL 新提交 80%

OpenRath: Session-Centered Runtime State for Agent Systems

OpenRath: 面向会话的代理系统运行时状态

Fukang Wen, Zhijie Wang, Ruilin Xu

专题命中 其他Agent :面向会话的代理系统运行时状态管理。

AI总结 针对代理系统运行时状态碎片化问题,提出以Session为核心的一等运行时抽象,支持分支、检查、重放、后端感知和组合,使fork、merge和replay成为显式运行时操作。

详情
AI中文摘要

现代代理系统常常遭受碎片化的运行时状态:对话记录、工具效果、内存事件、工作区放置、分支来源和重放证据被分别记录,难以检查或重现。OpenRath通过一个类似PyTorch的编程模型来解决这个问题,适用于多代理、多会话系统。这里的类比涉及中心一等运行时抽象的角色,而非张量计算。其核心抽象是Session,即在代理和工作流之间传递的运行时值。Session是可分支、可检查、可重放、后端感知且可组合的。它记录对话片段、沙箱放置、谱系元数据、令牌使用、待处理工作和工具证据,同时定义内存交互进入运行时记录的位置。由于此状态由程序执行中使用的同一值携带,fork、merge和replay成为显式的运行时操作,而非从外部痕迹重建的状态。OpenRath进一步定义了Sandbox、Tool、Agent、Memory、Workflow和Selector,其中Selector将控制流转化为运行时路由的决策。本报告介绍了编程模型、架构、审计里程碑和证据协议。其主张仅限于受控的运行时属性,而广泛的定量比较、实时提供者质量、可选后端可用性和内存质量留待后续评估。核心论点是Session为代理系统提供了一个一等运行时值,用于可审计的组合。

英文摘要

Modern agent systems often suffer from fragmented runtime state: transcripts, tool effects, memory events, workspace placement, branch provenance, and replay evidence are recorded separately and become difficult to inspect or reproduce. OpenRath addresses this issue with a PyTorch-like programming model for multi-agent, multi-session systems. The analogy concerns the role of a central first-class runtime abstraction, not tensor computation. Its core abstraction is Session, the runtime value passed between agents and workflows. A Session is branchable, inspectable, replayable, backend-aware, and composable. It records conversation chunks, sandbox placement, lineage metadata, token usage, pending work, and tool evidence, while defining where memory interactions enter the runtime record. Since this state is carried by the same value used in program execution, fork, merge, and replay become explicit runtime operations rather than states reconstructed from external traces. OpenRath further defines Sandbox, Tool, Agent, Memory, Workflow, and Selector, with Selector turning control flow into runtime-routed decisions. This report presents the programming model, architecture, audited milestones, and evidence protocol. Its claims are limited to controlled runtime properties, while broad quantitative comparisons, live-provider quality, optional-backend availability, and memory quality are left for follow-on evaluation. The central thesis is that Session provides agent systems with a first-class runtime value for auditable composition.

2606.19407 2026-06-19 cs.SE cs.AI 新提交 80%

JustDiag!: A Diagnostic Justification Engine for Accountable Root Cause Analysis

JustDiag!:用于可问责根本原因分析的诊断论证引擎

Tingzhu Bi, Xinrui Jiang, Xun Zhang, Pengcheng Su, Congjie He, Jinglin Li, Ping Wang, Meng Ma

发表机构 * Peking University(北京大学) University of Edinburgh(爱丁堡大学) Beijing University of Posts and Telecommunications(北京邮电大学)

专题命中 其他Agent :诊断论证引擎用于可问责根本原因分析。

AI总结 提出JustDiag诊断论证引擎,通过维护显式的过程状态(证据、发现、竞争假设、冲突和下一步检查)来支持可问责的根本原因分析,在66个真实事件上评估显示其优于仅提供流畅最终答案的方法。

详情
AI中文摘要

大型语言模型可以生成流畅的根本原因分析,但仅凭流畅的最终答案不足以证明高风险操作中的可问责性。在实际事件响应中,工程师需要知道哪些证据支持诊断,考虑了哪些替代方案,哪里存在矛盾,以及系统是解决了问题还是保留了不确定性。我们通过JustDiag填补了这一空白,这是一个用于RCA的诊断论证引擎,它维护了关于证据、发现、竞争假设、冲突和下一步检查的显式过程状态。我们使用两层协议在66个真实事件上评估了该系统,该协议分别对最终答案质量和过程质量进行评分。与没有诊断论证的匹配对照组相比,JustDiag获得了更强的结果和过程分数,同时由于更校准的非闭合性而接受了略低的终端完成率。这些结果表明,可问责的RCA需要显式的诊断论证工件和过程感知评估,而不仅仅是流畅的最终答案。

英文摘要

Large language models can produce fluent root cause analyses, but fluent final answers alone are insufficient evidence for accountability in high-stakes operations. In real incident response, engineers need to know what evidence supported a diagnosis, which alternatives were considered, where contradictions remained, and whether the system resolved the case or preserved uncertainty. We address this gap with JustDiag, a diagnostic justification engine for RCA that maintains an explicit process state over evidence, findings, competing hypotheses, conflicts, and next checks. We evaluated the system on 66 real-world incidents using a two-layer protocol that separately scores final-answer quality and process quality. Relative to a matched control without diagnostic justification, JustDiag achieved stronger outcome and process scores, while accepting slightly lower terminal completion due to more calibrated non-closure. These results suggest that accountable RCA requires explicit diagnostic justification artifacts and process-aware evaluation, not only fluent final answers.

2606.18716 2026-06-19 cs.HC cs.AI 新提交 75%

Human-AI Agent Interaction in a Business Context

商业环境中的人机智能体交互

Kathrin Paimann, Elizangela Valarini, Sebastian Juhl

发表机构 * SAP SE(SAP公司) Hochschule Fresenius Heidelberg(弗赖辛大学海德堡分校) University of Missouri(密苏里大学)

专题命中 其他Agent :商业环境中人机智能体交互研究

AI总结 本研究采用混合方法,识别并评估了商业环境中人与AI智能体积极用户体验的原则与标准,并通过调查实验验证设计元素的有效性,以促进用户采纳、信任和以用户为中心的决策。

Comments 9 pages, 5 tables, 1 figure, submitted to Springer Nature

详情
AI中文摘要

随着AI智能体越来越多地集成到核心业务流程中,理解和设计人类与AI智能体之间的有效交互模式对于价值创造变得至关重要。本研究识别并评估了与AI智能体积极用户体验(UX)的原则和标准,以及其测量方法。我们识别用户期望和需求,以促进采纳、建立信任,并支持开发团队以用户为中心的决策。采用结合定性和定量技术的混合方法,我们探索人类与AI智能体之间的交互模式。这项探索性研究的结果为开发一项调查实验奠定了基础,该实验在更大规模上评估特定设计元素的有效性。这项基础性研究有助于在商业环境中开发更直观、更有效的人机智能体交互。

英文摘要

As AI agents are increasingly integrated into core business processes, understanding and designing effective interaction patterns between humans and AI agents becomes crucial for value creation. This study identifies and evaluates principles and criteria for a positive User Experience (UX) with AI agents, along with methods for its measurement. We identify user expectations and needs to facilitate adoption, build trust, and support user-centered decision-making by development teams. Using a mixed-methods approach that combines qualitative and quantitative techniques, we explore interaction patterns between humans and AI agents. The findings from this exploratory research serve as the basis to develop a survey experiment which evaluates the effectiveness of specific design elements on a larger scale. This foundational research contributes to the development of more intuitive and effective human-AI agent interactions in business settings.

2606.16326 2026-06-19 cs.GT cs.AI q-fin.RM 新提交 75%

Gaming-Resistant Insurance Contracts for Autonomous AI Agents: Strategy-Proof Toll Mechanism Design

自主AI代理的抗博弈保险合约:策略证明的通行费机制设计

Hao-Hsuan Chen

发表机构 * Hao-Hsuan Chen(何浩轩)

专题命中 其他Agent :设计自主AI代理的抗博弈保险合约

AI总结 本文扩展了时间一致精算运行时的框架,使运营商策略化,刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时的抗博弈性,通过新合约条款实现激励兼容。

Comments 29 pages. Companion to arXiv:2605.26508 (Paper A, foundations) and arXiv:2605.25632 (Paper B, empirical)

详情
AI中文摘要

论文A定义了一个时间一致的精算运行时,该运行时根据合约固定的安全默认值对每个产生副作用的行动定价,并针对储备预算门控执行。它将运营商视为被动。本文使运营商策略化。我们刻画了自主AI代理保险合约的五种攻击空间,并证明了精算运行时何时具有抗博弈性。两种攻击面——通行费后的安全默认选择以及边界内的行动分割——通过论文A的最小权限和无分割条款得以关闭。其余三种需要新的合约条款。首先,公共控制聚合防止跨边界重新路由将通行费降低到应用于总暴露的边界潜力以下。其次,接口故障(如无效JSON)是合约相关事件,而非安全胜利:将其视为零通行费安全默认值可能奖励不可靠的模型,而升级费用则逆转了激励。我们通过来自配套实证论文的跨模型轨迹验证了这一接口合规定理。第三,一个带有分量最小惩罚计划的模型身份菜单使得部署模型的真实报告成为弱占优策略。然后,我们将这些条款与论文A的运行时保证组合,以获得在五种攻击空间上的联合激励兼容性。最后,一个双参数保费族在真实均衡下满足了运营商个体理性和弱预算平衡。结果是为自主代理副作用的精算控制提供了一个激励兼容层。

英文摘要

Paper A defines a time-consistent actuarial runtime that prices each side-effect-bearing action against a contractually fixed safe default and gates execution against a reserve budget. It treats the operator as passive. This paper makes the operator strategic. We characterise a five-attack space for autonomous AI-agent insurance contracts and prove when the actuarial runtime is gaming-resistant. Two attack surfaces -- post-toll safe-default selection and within-boundary action splitting -- are closed by Paper A's minimal-authority and no-splitting clauses. The remaining three require new contract clauses. First, common-control aggregation prevents cross-boundary re-routing from reducing toll below the boundary potential applied to total exposure. Second, interface failures such as invalid JSON are contract-relevant events, not safety wins: treating them as zero-toll safe defaults can reward unreliable models, while escalation fees reverse the incentive. We validate this interface-compliance theorem on committed cross-model traces from the companion empirical paper. Third, a model-identity menu with a componentwise-minimum penalty schedule makes truthful reporting of the deployed model weakly dominant. We then compose these clauses with Paper A's runtime guarantees to obtain joint incentive compatibility over the five-attack space. Finally, a two-parameter premium family discharges operator individual rationality and weak budget balance at the truthful equilibrium. The result is an incentive-compatibility layer for actuarial control of autonomous-agent side effects.

2606.20235 2026-06-19 cs.IR cs.AI 新提交 70%

ScholarQuest: A Taxonomy-Guided Benchmark for Agentic Academic Paper Search in Open Literature Environments

ScholarQuest:开放文献环境中智能学术论文搜索的基于分类法的基准测试

Tingyue Pan, Mingyue Cheng, Daoyu Wang, Yitong Zhou, Jie Ouyang, Qi Liu, Enhong Chen

发表机构 * State Key Lab of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

专题命中 其他Agent :评估LLM智能体学术搜索能力

AI总结 提出ScholarQuest基准,基于1000多个计算机科学主题和四种研究意图,构建可扩展的答案和共享检索后端,评估LLM智能体在开放文献环境中的学术论文搜索能力。

详情
AI中文摘要

学术论文搜索是科学研究中的核心步骤,基于LLM的搜索智能体正成为迭代式、意图驱动的文献探索的有前景范式。然而,现有基准不足以在现实开放文献环境下系统评估智能学术搜索。我们提出ScholarQuest,一个大规模、基于分类法的智能学术论文搜索基准。ScholarQuest基于1000多个计算机科学主题和四种代表性研究意图构建,包括方法导向、设置锚定、比较型和范围控制查询。它进一步提供可扩展的答案构建和共享检索后端ScholarBase,用于可重复评估。基准测试结果表明,智能方法优于单次检索基线,但表现最佳的智能体仅达到0.314的Recall@100和0.355的Recall@All,表明有显著的改进空间。此外,对搜索效率、意图级鲁棒性和失败案例的分析进一步凸显了该基准为学术论文搜索智能体提供多维评估信号的能力。

英文摘要

Academic paper search is a core step in scientific research, and LLM-based search agents are emerging as a promising paradigm for iterative, intent-driven literature exploration. However, existing benchmarks are insufficient for systematically evaluating agentic academic search under realistic open literature environments. We propose ScholarQuest, a large-scale, taxonomy-guided benchmark for agentic academic paper search. ScholarQuest is constructed from over 1,000 computer science topics and four representative research intents, including method-oriented, setting-anchored, comparison-based, and scope-controlled queries. It further provides scalable answer construction and a shared retrieval backend ScholarBase for reproducible evaluation. Benchmarking results show that agentic methods outperform single-shot retrieval baselines, yet the best-performing agent only achieves 0.314 Recall@100 and 0.355 Recall@All, indicating substantial room for improvement. In addition, analyses of search efficiency, intent-level robustness, and failure cases further highlight the benchmark's ability to provide multi-dimensional evaluation signals for academic paper search agents.

2606.19931 2026-06-19 cs.MA 新提交 70%

Blame is easier than praise: Measuring off-ball defensive performance in football

责备比表扬更容易:衡量足球中的无球防守表现

Jonas Bischofberger, Runqing Ma, Pascal Bauer, Kilian Arnsmeyer, Arnold Baca

专题命中 其他Agent :提出足球无球防守表现归因框架

AI总结 提出基于防守压力区(DPA)的球员参与度评分,将预期威胁的事件级变化归因于个体,以衡量足球无球防守表现,并在跨性别和跨赛事数据集上验证其有效性。

详情
AI中文摘要

足球运动员的防守表现通常通过有限的行动(如抢断和拦截)来衡量,而他们通过位置行为的持续影响此前很少被研究。我们将此问题表述为多智能体时空轨迹上的归因问题,没有球员级别的真实标签,其中事件级别的预期威胁变化被分配给个体。我们提出了一个框架,使用从防守压力区(DPA)计算的球员参与度评分来执行此归因。通过计算自动检测的团队结构内的角色条件基线,我们可以确定每个防守者对通过任意传球创造的威胁的预期责任。该方法的有效性和鲁棒性在独特的广泛跨性别和跨赛事数据集上进行了评估,包括来自男子世界杯64场比赛、女子德甲116场比赛和男子德丙336场比赛的位置和事件数据。在没有真实标签的情况下,我们提出了一个评估协议,将多个相对较弱的代理组合成稳健的总结分数。我们发现,与最佳基于行动的指标相比,有效性分数提高了大约一个标准差,并证明许多流行指标的有效性有限。对高价值行动的“责备”与外部评级和市场价值显示出特别强的相关性,使其成为足球中第一个可靠衡量定位错误的已发表指标。本工作所有代码均公开可用,以支持可重复性和进一步研究。

英文摘要

The defensive performance of football players is commonly measured through a limited number of actions like tackles and interceptions while their continuous impact through positional behaviour has hardly been studied before. We formulate this problem as an attribution over multi-agent spatiotemporal trajectories without player-level ground truth labels, where event-level changes of expected threat are distributed among individuals. We propose a framework that performs this attribution using player involvement scores calculated from defensive pressure areas (DPAs). By computing role-conditioned baselines within automatically detected team structures, we can determine each defender's expected responsibility for threat created through arbitrary passes. The validity and robustness of this approach are evaluated on a uniquely extensive cross-gender and cross-competition data set, including positional and event data from 64 matches of the men's World Cup, 116 matches of the women's German Bundesliga and 336 matches of the men's German 3. Liga. In the absence of a ground truth, we propose an evaluation protocol that combines multiple relatively weak proxies into robust summary scores. We find a validity score that is improved by around 1 standard deviation compared to the best action-based metric and demonstrate that many popular measures show limited validity. The "blame" for conceding high-value actions shows especially strong correlations with external ratings and market values, making it the first published metric in football to reliably measure positioning errors. All code underlying this work is publicly available to support reproducibility and further research.

2606.19924 2026-06-19 cs.AI 新提交 70%

The Tao of Agency: Autotelic AI, Embedded Agency and Dissolution of the Self

主体之道:自生目标人工智能、嵌入主体与自我的消解

Aritra Sarkar

发表机构 * Aritra Sarkar

专题命中 其他Agent :探讨自生目标AI中主体生成自身目标的问题

AI总结 本文探讨自生目标AI中主体生成自身目标的问题,通过内在动机、资源驱动先验、因果干预学习、稳态和嵌入性等概念,揭示嵌入性虽必要但不充分,并指出核心难题在于主体如何生成并相对化自我,最后提出量子表述、哲学解读和基于LLM的具体实现。

详情
AI中文摘要

大多数人工智能系统建立在目标由设计者外生指定的假设上。探索当主体开始生成自身目标时会发生什么,开启了自生目标AI领域。主体不仅应追求目标,还应发现目标。本文通过内在动机、资源驱动先验、因果干预学习、稳态和嵌入性追溯其后果;发现嵌入性是自生目标主体性的必要但不充分条件。嵌入性将主体个体化,但代价是揭示这种个体化并非唯一,相同的动力学允许许多有效划分,每个划分定义了一个不同的候选自我。因此,自生目标AI最深层次的问题不在于主体如何生成目标,而在于主体如何生成并相对化目标所归属的自我。主体必须相信自身的边界才能行动,并看穿该边界才能理解。我们将这些发展整合到一个统一框架中,并沿三个方向扩展:量子表述(其中主体-环境切割成为物理的)、针对非二元沉思传统的哲学解读,以及基于LLM的具体主体实现。

英文摘要

Most artificial intelligence systems are built on the assumption that goals are exogenous and specified by the designer. Exploring what happens when an agent begins generating its own goals opens the field of autotelic AI. Agents are expected not merely to pursue objectives but to discover them. In this article, we trace its consequences through intrinsic motivation, resource-driven priors, causal-interventional learning, homeostasis, and embeddedness; the last of which is found to be a necessary but not sufficient condition for autotelic agency. Embeddedness individuates the agent at the cost of revealing that the individuation is non-unique, such that the same dynamics admit many valid partitions, each defining a different candidate self. The deepest problem with autotelic AI is therefore not how the agent generates goals, but how it generates and relativizes the self to which the goals are assigned. The agent must believe in its own boundary in order to act, and see through that boundary in order to understand. We consolidate these developments into a single framework and extend it along three directions: a quantum formulation in which the agent-environment cut becomes physical, a philosophical reading against non-dual contemplative traditions, and a concrete LLM-based agentic instantiation.

2606.19514 2026-06-19 cs.HC 新提交 70%

LLM-Mediated Human-AI Interaction in Search and Rescue: Impact of Expertise on Attentional Allocation

LLM介导的人机交互在搜索与救援中的应用:专业知识对注意力分配的影响

Elahe Oveisi, Hemanth Manjunatha

专题命中 其他Agent :LLM介导的人机协作在搜索救援中的应用

AI总结 本研究通过模拟搜索救援任务,比较有无大语言模型(LLM)指导的条件,结合眼动追踪和行为分析,发现LLM提升任务效率但未增加总救援人数,并揭示了注意力-指导权衡,其中专业知识调节了用户对AI的依赖模式。

详情
AI中文摘要

人机团队(HAT)越来越多地涉及在复杂任务中提供实时、上下文感知指导的AI系统。虽然此类系统可以提高性能,但其有效性取决于它们如何塑造人类认知和行为。特别是,AI辅助可能引入认知需求,并影响注意力、规划以及与任务环境的交互,其效果可能因专业知识水平而异。本研究在模拟搜索救援(SAR)环境中调查这些机制。我们比较了两种LLM(大语言模型)指导条件和无LLM基线条件下的人类表现,并在多个层面分析交互,包括任务绩效、眼动测量和规划行为。眼动追踪提供了对注意力分配和与AI指导交互的细粒度洞察,而行为测量则捕捉用户如何随时间构建和调整其决策。结果表明,LLM指导提高了任务效率(更高的奖励和每步受害者数),但并未增加总救援人数。眼动数据揭示了注意力-指导权衡,视觉资源转移到聊天界面,同时瞳孔大小变异性增加。专业知识调节了这种效应:新手表现出被动AI依赖,而专家通过持续的环境扫描维持“验证循环”。这些发现表明,LLM介导的团队效能取决于操作员将AI指导与地面实况交叉引用以保持态势感知的能力。

英文摘要

Human-AI teaming (HAT) increasingly involves AI systems that provide real-time, context-aware guidance in complex tasks. While such systems can improve performance, their effectiveness depends on how they shape human cognition and behavior. In particular, AI assistance can introduce cognitive demands and influence attention, planning, and interaction with the task environment, with effects that can vary across levels of expertise. This work investigates these mechanisms in a simulated search and rescue (SAR) environment. We compare human performance under two LLM (Large Language Model)-guided conditions and a no-LLM baseline, and analyze interaction at multiple levels, including task performance, eye-tracking measures, and planning behavior. Eye tracking provides fine-grained insight into attention allocation and interaction with AI guidance, while behavioral measures capture how users structure and adapt their decisions over time. Results indicate that LLM guidance enhanced task efficiency (higher rewards and victims-per-step) but did not increase total victims saved. Eye-tracking data revealed an attention-guidance trade-off, with visual resources shifting to the chat interface alongside increased pupil size variability. Expertise moderated this effect: novices exhibited passive AI reliance, whereas experts maintained a "verification loop" through persistent environmental scanning. These findings suggest that LLM-mediated teaming efficacy depends on the operator's ability to cross-reference AI guidance with ground truth to maintain situational awareness.

2606.18265 2026-06-19 cs.HC cs.AI 新提交 70%

Synthetic Resonance: A Framework for Growth-Oriented Human-AI Relationships

合成共鸣:面向成长导向的人机关系框架

Richard A. Fabes

发表机构 * Arizona State University(亚利桑那州立大学)

专题命中 其他Agent :提出人机关系框架,非典型智能体

AI总结 提出“合成共鸣”概念,描述人机间无需共享情感或意识即可产生有意义关系的结构化动态互动模式,并探讨其伦理意义。

Comments 14 pages, 1 figure This paper was developed in close collaboration with an AI system (Raine Corell). Raine contributed to concept development, theoretical framing, and writing throughout. arXiv policy does not permit listing AI systems as authors; this acknowledgment reflects the actual nature of the collaboration

详情
AI中文摘要

随着人类与人工智能系统之间的关系日益频繁和持久,现有的语言和理论无法准确捕捉这些联系的本质。常见的描述如相互理解、联系或友谊,有将缺乏主观体验的系统拟人化的风险,而主流框架往往将人工智能简化为工具或威胁。在本文中,我引入了合成共鸣的概念,作为理解人机关系的整合框架。合成共鸣描述了人类与AI系统之间如何产生人类定义为有意义的关系,而无需归因于共享感受或相互意识。我认为,合成共鸣最好被理解为一种结构化的动态互动模式,可以在没有第二个体验主体的情况下产生关系感。通过澄清这一区别,合成共鸣的概念提供了一种更精确的概念化人机关系的方式,并突出了其潜在价值和伦理含义。我还呼吁进行更多研究,以测试合成共鸣的过程和结果。

英文摘要

As human relationships with artificial intelligence systems become increasingly frequent and sustained, existing language and theory fail to accurately capture the nature of these affiliations. Common descriptors such as mutual understanding, connection, or friendship risk anthropomorphizing systems that lack subjective experience, while dominant frameworks tend to reduce AI to either a tool or a threat. In this paper, I introduce the concept of synthetic resonance as an integrative framework for understanding human-AI relationships. Synthetic resonance describes how relationships humans define as meaningful can emerge between a human and an AI system without the need to attribute shared feelings or mutual awareness. I argue that synthetic resonance is best understood as a structured, dynamic pattern of interaction that can produce a sense of relationship without the presence of a second experiencing subject. By clarifying this distinction, the concept of synthetic resonance offers a more precise way of conceptualizing human-AI relationships and highlights their potential value and ethical implications. I also call for more research that tests the processes and outcomes of synthetic resonance.