AI Agent - arXivDaily 专题

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 新提交 85%

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛：前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning（同情对齐机器学习）； Sentient Futures（感知未来）； Harvard Kennedy School（哈佛肯尼迪学院）； Appalachian State University Department of Management（阿巴拉契亚州立大学管理系）

专题命中其他Agent ：评估AI代理在旅行预订中的动物福利

AI总结提出首个代理基准TAC，测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型，所有模型得分低于随机水平64%，最佳模型仅53%。

详情

AI中文摘要

AI代理正从顾问转变为行动者，代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应，但未检验这些响应中的福利推理是否迁移到代理部署中（模型必须使用工具采取行动）。我们引入TAC（旅行代理同情心），这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景，涵盖六类动物剥削，并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%，最佳表现者（Claude Opus 4.7）为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升，在GPT-5.2中提升26个百分点，在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计（使用Gemini 2.5 Flash Lite作为评判者，对前两名模型的288个基础条件转录进行审计）未标记任何评估意识转录，表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

URL PDF HTML ☆

赞 0 踩 0

2606.12837 2026-06-18 cs.CL 新提交 85%

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan（美团）

专题命中其他Agent ：长时域搜索代理基准测试

AI总结提出LoHoSearch基准，基于700万维基实体知识图谱自动构建544个复杂问题，评估显示最强模型仅34.74%准确率，远超人类难度上限。

详情

AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和，最强模型已超过90%准确率。由于这些基准主要由人类编写，标注者缺乏对实体统计的全局视角，无法系统性地最大化搜索空间大小和结构复杂性，这造成了难以突破的难度上限。为解决这一问题，我们引入了LoHoSearch（长时域搜索代理），一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建，该流水线选择具有大搜索空间的关系，并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明，即使是最强模型也仅达到34.74%的准确率，且现有的上下文管理策略（最佳提升+6.8%）带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

URL PDF HTML ☆

赞 0 踩 0

2606.07591 2026-06-18 cs.LG cs.AI cs.CL 版本更新 85%

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory（上海人工智能实验室）

专题命中其他Agent ：自主科学研究基准评估智能体

AI总结提出ResearchClawBench基准，包含10个领域40个任务，通过多模态评分标准评估自主科研能力，最强智能体仅得21.5分，揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情

AI中文摘要

AI编码智能体越来越多地用于科学工作，但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench，一个用于评估自主科学研究的基准，涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文，提供相关文献和原始数据，并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准，从而能够评估目标论文级别的重新发现，同时为新发现留出空间。我们在统一协议下评估了七个自主研究（auto-research）智能体，并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现：最强的自主智能体Claude Code平均得分为21.5，最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7，LLM前沿均值仅为26.5。错误分析表明，失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

URL PDF HTML ☆

赞 0 踩 0

2606.19116 2026-06-18 cs.AI cs.CY 新提交 80%

Towards an Agent-First Web: Redesigning the Web for AI Agents

迈向智能体优先的Web：为AI智能体重新设计Web

Eranga Bandara, Ross Gore, Ravi Mukkamala, Asanga Gunaratna, Safdar H. Bouk, Xueping Liang, Peter Foytik, Abdul Rahman, Sachini Rajapakse, Isurunima Kularathna, Pramoda Karunarathna, Chalani Rajapakse, Ng Wee Keong, Kasun De Zoysa, Tharaka Hewa, Amin Hass, Wathsala Herath, Aruna Withanage, Nilaan Loganathan, Atmaram Yarlagadda, Sachin Shetty

发表机构 * Old Dominion University（老 Dominion 大学）； AI Motion Labs（AI Motion 实验室）； Florida International University（佛罗里达国际大学）； Accenture Technology Labs（Accenture 技术实验室）； Nanyang Technological University（南洋理工大学）； University of Colombo（科伦坡大学）； Center for Wireless Communications, University of Oulu（无线通信中心，奥卢大学）； McDonald Army Health Center（麦克唐纳陆军健康中心）

专题命中其他Agent ：为AI智能体重新设计Web，核心是Agent访问

AI总结本文提出三层重新设计原则，包括访问层（代理继承人类权限）、经济层（基于意图的代币订阅模型）和内容层（ATML标记语言与加密溯源链），以解决AI智能体作为中间人时Web的访问、经济与内容问题。

详情

AI中文摘要

万维网建立在持续三十年的假设之上：Web内容的主要消费者是人类。这一假设渗透到每一层；其访问模型假定人类访客，其经济依赖于人类注意力，其内容针对人类感知。AI智能体作为人类与Web内容之间中介的迅速出现使这一假设失效。然而，Web通过全面封锁、基于CAPTCHA的排除以及将智能体访问视为提取而非合法交互的经济模型来抵制智能体。本文提出跨三层的原则性重新设计。在访问层，为人类行动的智能体应继承等效访问权限，通过HTTP请求中的速率限制和智能体识别元数据（类似于浏览器头部）以及从同一域提供人类可读和智能体优化内容的双层架构来管理。在经济层，我们提出基于意图的层级框架，以智能体作为人类代理原则为基础：智能体的经济义务反映其所代表的人类。基于代币的订阅模型以代币而非页面浏览量计量内容，同时引入委托内容经济，将AI内容生产锚定于人类意图。在内容层，我们识别出认知递归——AI生成内容被智能体消费以产生更多内容的自我指涉循环，逐步使Web知识与人类真实情况脱钩。我们提出智能体文本标记语言（ATML），一个四级人类监督层级模型，以及加密溯源链来应对这一威胁。这些共同构成了智能体优先互联网的十项设计原则，其中智能体是一等公民，其整合需要重新协商Web在访问、经济和内容方面的基本社会契约。

英文摘要

The World Wide Web was built on an assumption held for three decades: the primary consumer of web content is a human being. This permeates every layer; its access model presumes human visitors, its economics rest on human attention, and its content targets human perception. The rapid emergence of AI agents as intermediaries between humans and web content invalidates this assumption. Yet the web resists agents through blanket blocking, CAPTCHA-based exclusion, and economic models that treat agent access as extraction rather than legitimate interaction. This paper proposes a principled redesign across three layers. At the access layer, agents acting for humans should inherit equivalent access rights, governed by rate limiting and agent identification metadata in HTTP requests, analogous to browser headers, alongside a dual-layer architecture serving human-readable and agent-optimized content from the same domain. At the economic layer, we propose an intent-based tier framework grounded in the agent-as-human-proxy principle: an agent's economic obligation mirrors that of the human it represents. A token-based subscription model meters content in tokens rather than pageviews, alongside a commissioned content economy anchoring AI content production in human intentionality. At the content layer, we identify epistemic recursion, the self-referential loop in which AI-generated content is consumed by agents to produce further content, progressively detaching web knowledge from human ground truth. We propose the Agent Text Markup Language (ATML), a four-level human supervision tier model, and a cryptographic provenance chain to counter this threat. Together these constitute ten design principles for an agent-first internet, one in which agents are first-class citizens whose integration requires renegotiating the web's foundational social contract across access, economics, and content.

URL PDF HTML ☆

赞 0 踩 0

2606.19063 2026-06-18 cs.CR 新提交 80%

PYPILINE: Malicious PyPI Package Detection via Suspicious API Knowledge and Agent Workflow

PYPILINE：通过可疑API知识和Agent工作流检测恶意PyPI包

Siyuan Pang, Zhengwei Jiang, Yepeng Yao, Zijing Fan, Haozhe Li, Baoxu Liu

专题命中其他Agent ：Agent工作流检测恶意PyPI包。

AI总结提出PYPILINE方法，结合可疑API知识库与Agent工作流，通过静态分析构建知识库并自动检测恶意PyPI包，在精度、召回率和F1分数上显著优于现有工具。

详情

AI中文摘要

恶意PyPI包的检测对于维护开源软件供应链的安全至关重要。现有方法主要依赖规则或传统机器学习，存在可解释性差且难以适应新型攻击的问题。为此，我们提出PYPILINE，一种结合可疑API知识库与Agent工作流的新型检测方法。PYPILINE首先对已知恶意包进行静态分析，提取抽象语法树并生成API调用图，从中自动提取并构建结构化的可疑API知识库。在检测阶段，利用该知识库增强推理能力。通过Agent工作流，PYPILINE对未知包进行深度语义分析，并输出结构化的、可解释的恶意性评估报告。实验结果表明，PYPILINE在精度96.7%、召回率99.6%和F1分数98.1%上显著优于现有最先进工具，其精度比基线工具高出5.7至24.2个百分点。此外，我们对恶意包进行了实证研究，系统揭示了常见的攻击策略以及最常被滥用的API。通过配备工具调用的AI Agent工作流，实现可疑API知识的自动向量数据库检索和通过邮件服务器发送分析报告，PYPILINE提供了一种实用、高效且便捷的恶意包检测解决方案，以增强开源生态系统安全。

英文摘要

The detection of malicious PyPI packages is crucial for maintaining the security of the open source software supply chain. Existing methods, which primarily rely on rules or traditional machine learning, suffer from poor interpretability and difficulty in adapting to novel attacks. To address this, we propose PYPILINE, a novel detection method that combines a suspicious API knowledge base with an Agent workflow. PYPILINE first conducts static analysis on known malicious packages, extracting abstract syntax trees and generating API call graphs, from which it automatically extracts and constructs a structured suspicious API knowledge base. During the detection phase, this knowledge base is used to enhance reasoning capabilities. Through an Agent workflow, PYPILINE performs in depth semantic analysis of unknown packages and outputs a structured, interpretable maliciousness assessment report. The experimental results show that PYPILINE significantly outperforms existing state-of-the-art tools in precision of 96.7\%, recall of 99.6\%, and F1-score of 98.1\%, with its precision surpassing baseline tools by 5.7 to 24.2 percentage points. Additionally, we conducted an empirical study on malicious packages, systematically revealing prevalent attack strategies, as well as the most commonly abused APIs. Equipped with tool-calling AI agent workflows for automated vector database retrieval of suspicious API knowledge and mail server delivery of analysis reports, PYPILINE delivers a practical, efficient, and convenient malicious package detection solution to strengthen open-source ecosystem security.

URL PDF HTML ☆

赞 0 踩 0

2606.17454 2026-06-18 cs.AI cs.LG 新提交 80%

Dissecting model behavior through agent trajectories

通过智能体轨迹剖析模型行为

Gaurav Gupta, Vatshank Chaturvedi, Jun Huan, Anoop Deoras

发表机构 * AWS AI Labs（AWS人工智能实验室）

专题命中其他Agent ：分析AI代理轨迹以改进模型行为

AI总结本文提出“意图-执行差距”概念，并设计Simple Strands Agent（SSA）框架，通过分析138k条轨迹揭示模型在自主问题解决中的行为差异。

Comments 106 pages, 50 Figures, 16 Tables

详情

AI中文摘要

AI智能体性能不仅仅是一个建模问题，它本质上是一个系统问题。模型的高级能力通过智能体框架（harness）实现。因此，模型假设与框架行为之间的差距很容易阻止模型的全部能力转化为智能体性能。我们将此形式化为“意图-执行差距”：模型意图与框架执行之间的不匹配，反之亦然。我们认为，最小化这种意图-执行差距与框架设计的其他方面（如工具和执行循环）同样重要。为了说明这种框架-模型对齐的影响，我们开发了一个简单且可定制的框架，称为“Simple Strands Agent”（SSA）。SSA旨在找到跨不同模型家族（如Claude、Gemini、GPT、Grok、Qwen）通用的常见模式，以及少量模型特定的偏好。我们做出两个贡献：（i）我们在流行的智能体基准测试（SWE-Pro、SWE-Verified和Terminal-Bench-2）上**复现或改进了**不同模型提供商家族报告的pass@1性能；（ii）基于对**SSA生成的138k条轨迹的分析**，我们超越了前沿模型之间通常相对均匀的pass@1数字。通过在代码状态空间中表示智能体轨迹，我们观察到问题解决行为中的模型级差异。更细粒度的指标，如编辑频率、测试活动和阶段转换，揭示了单个模型如何在自主问题解决的不同阶段分配努力。

英文摘要

AI agent performance is not just a modeling problem, it is fundamentally a systems problem. The advanced capabilities of models are realized through agent harnesses. Therefore, a gap between model assumptions and harness behavior can easily prevent the model's full capabilities from translating into agent performance. We formalize this as the `intent-execution' gap: the mismatch between what the model intends and what the harness executes, and vice versa. We argue that minimizing this intent-execution gap is as important as other aspects of harness design such as tools and execution loops. To illustrate the impact of this harness-model alignment, we develop a simple and customizable harness called `Simple Strands Agent' (SSA). SSA aims to find the bulk of common patterns which generalize across different model families (such as Claude, Gemini, GPT, Grok, Qwen), as well as a small number of model-specific preferences. We make two contributions: (i) we reproduce or improve on the pass@1 performance reported by diverse model-provider families on popular agentic benchmarks (SWE-Pro, SWE-Verified and Terminal-Bench-2), and (ii) building on an analysis of 138k trajectories generated by SSA, we look beyond the pass@1 numbers which tend to be relatively even across frontier models. By representing agent trajectories in code state-spaces, we observe model-level differences in problem-solving behavior. Finer-grained metrics such as edit frequency, testing activity, and phase-transitions reveal how individual models allocate effort across different stages of autonomous problem solving.

URL PDF HTML ☆

赞 0 踩 0

2606.15345 2026-06-18 cs.CL cs.IR 新提交 80%

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

超越单语言深度研究：用跨语言 BrowseComp-Plus 评估智能体和检索器

Yuheng Lu, Qingcheng Zeng, Heli Qi, Puxuan Yu, Fuheng Zhao, Rui Yang, Hitomi Yanaka, Naoto Yokoya, Weihao Xuan

发表机构 * Waseda University（早稻田大学）； Northwestern University（西北大学）； RIKEN AIP（理化学研究所革新智能研究中心）； Snowflake Inc.（Snowflake公司）； University of Utah（犹他大学）； Duke-NUS Medical School（杜克-新加坡国立大学医学院）； The University of Tokyo（东京大学）

专题命中其他Agent ：评估深度研究智能体的跨语言能力

AI总结提出跨语言基准 XBCP，评估深度研究智能体在证据语言与查询不同时的表现，发现检索和智能体端均存在显著性能下降。

Comments Preprint

详情

AI中文摘要

深度研究智能体越来越被评估其搜索证据、推理检索来源和生成有依据答案的能力。然而，现有的浏览基准大多假设用户查询和支持证据使用同一种语言，因此当相关证据出现在另一种语言时，智能体搜索系统能否运行尚不清楚。我们引入了 XBCP（跨语言 BrowseComp-Plus），这是一个受控基准，它保留了 BrowseComp-Plus 的英文问答空间，但改变了支持文档的语言。XBCP 实例化了两个互补的设置：在跨语言设置中，每个查询与单一指定语言的证据配对。在多语言设置中，完整的证据语料库在 12 种语言（涵盖高资源和低资源语言）中均匀随机分布。我们使用稀疏和密集的多语言检索器评估了四个深度研究智能体，测量了答案准确性、证据召回率、搜索行为、校准度、引用忠实度和 oracle 检索。结果显示，当证据被翻译时，性能显著下降。即使是强大的密集检索器也会丢失证据召回率，智能体变得不那么校准，且引用证据的可靠性降低。值得注意的是，即使直接提供所有黄金证据，准确性仍然较低。这些发现表明，跨语言深度研究暴露了检索失败和智能体端在整合语言不匹配证据方面的独立困难。

英文摘要

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

URL PDF HTML ☆

赞 0 踩 0

2511.13979 2026-06-18 cs.HC 版本更新 80%

Personality Pairing Improves Human-AI Collaboration

人格配对改善人机协作

Harang Ju, Sinan Aral

专题命中其他Agent ：研究AI Agent人格与人类协作

AI总结通过大规模实验，将人类与具有不同大五人格特质的AI配对，发现人格匹配显著影响广告质量和团队表现，外倾人类与尽责AI配对效果最差，而神经质人类与神经质AI配对点击率最高。

Comments 29 pages, 5 figures

详情

AI中文摘要

在此，我们研究了AI代理的“人格”如何与人类人格相互作用，从而影响人机协作和绩效。在一项大规模、预注册的随机实验中，我们将1,258名参与者与表现出不同大五人格特质水平的AI代理配对。这些人机团队为一个真实智库制作了7,266个展示广告，我们通过1,168名独立人类评估者以及一项在X平台上进行的、产生了近500万次展示的现场实验对这些广告进行了评估。我们发现，人类和AI的人格各自影响广告质量和团队合作，并且人机人格配对直接影响广告质量和广告绩效。例如，外倾人类与尽责AI配对产生了质量最低的广告，其次是尽责人类与宜人AI配对，以及神经质人类与尽责AI配对。在现场实验中，广告质量显著影响广告绩效（以点击率和每次点击成本衡量），神经质人类与神经质AI配对实现了最高的点击率。这些结果共同表明，人格配对可以改善人机协作和绩效。它们也激励了未来关于AI个性化对人机协作、团队合作和绩效的复杂影响的研究。

英文摘要

Here we examine how AI agent "personalities" interact with human personalities to shape human-AI collaboration and performance. In a large-scale, preregistered randomized experiment, we paired 1,258 participants with AI agents prompted to exhibit varying levels of the Big Five personality traits. These human-AI teams produced 7,266 display ads for a real think tank, which we evaluated using 1,168 independent human raters, and a field experiment on X that generated nearly 5 million impressions. We found that human and AI personalities individually shaped ad quality and teamwork and that human-AI personality pairings directly influenced ad quality and ad performance. For example, extraverted humans paired with conscientious AI produced the lowest quality ads, followed by conscientious humans paired with agreeable AI and neurotic humans paired with conscientious AI. In the field experiment, ad quality significantly influenced ad performance, measured by click-through rates and cost-per-click, and neurotic humans paired with neurotic AI achieved the highest click-through rates. Together, these results demonstrate that personality pairing can improve human-AI collaboration and performance. They also motivate future research on the complex implications of AI personalization for human-AI collaboration, teamwork and performance.

URL PDF HTML ☆

赞 0 踩 0

2602.22222 2026-06-18 cs.IR cs.MA 版本更新 80%

TWICE: Modeling the Temporal Evolution of Personalized User Behavior via Event-Driven Agents

TWICE：通过事件驱动代理建模个性化用户行为的时间演化

Bingrui Jin, Kunyao Lan, Baihan LI, Mengyue Wu

专题命中其他Agent ：基于LLM的事件驱动用户模拟代理，属于AI Agent

AI总结提出TWICE框架，结合结构化用户画像、事件驱动记忆模块和两阶段工作流，利用LLM模拟用户行为的时间演化，在Twitter数据集上优于基线。

详情

AI中文摘要

用户模拟器广泛用于数据生成、评估和基于代理的交互，但现有方法通常将用户建模为静态角色或依赖通用历史上下文，难以捕捉个体行为随时间的变化。为解决这一局限，我们提出TWICE，一个基于LLM的框架，用于时间基础的个人化用户模拟。TWICE结合了结构化用户画像、围绕生活事件和行为转变组织的事件驱动记忆模块，以及将事件基础内容规划与个性化风格适应分离的两阶段工作流。这种设计使模拟器不仅能建模用户说什么，还能建模过去经历如何影响后续表达。我们在大规模纵向Twitter数据集上评估TWICE，并引入了一个综合评估框架，同时衡量真实性、一致性和类人性。结果表明，TWICE始终优于强基线，表明以事件为中心的记忆是建模个性化用户行为时间演化的有前景机制。

英文摘要

User simulators are widely used for data generation, evaluation, and agent-based interaction, but existing approaches often model users as static personas or rely on generic historical context, making it difficult to capture how individual behavior evolves over time. To address this limitation, we propose TWICE, an LLM-based framework for temporally grounded personalized user simulation. TWICE combines structured user profiling, an event-driven memory module organized around life events and behavioral shifts, and a two-stage workflow separating event-grounded content planning from personalized style adaptation. This design enables the simulator to model not only what a user says, but also how past experiences shape later expression. We evaluate TWICE on a large-scale longitudinal Twitter dataset and introduce a comprehensive evaluation framework that jointly measures authenticity, consistency, and humanlikeness. Results show that TWICE consistently outperforms strong baselines, suggesting that event-centered memory is a promising mechanism for modeling the temporal evolution of personalized user behavior.

URL PDF HTML ☆

赞 0 踩 0

2606.19079 2026-06-18 cs.AI 新提交 75%

ARIADNE: Agnostic Routing for Inference-time Adapter DyNamic sElection

ARIADNE: 推理时适配器动态选择的不可知路由

Enrico Cassano, Michał Brzozowski, Zuzanna Dubanowska, Paolo Mandica, Neo Christopher Chung

发表机构 * University of Turin（都灵大学）； Samsung AI Center（三星人工智能中心）

专题命中其他Agent ：推理时适配器动态选择，路由框架。

AI总结提出无训练、与适配器无关的路由框架ARIADNE，通过训练集嵌入质心表示适配器，在推理时基于潜在空间距离选择适配器，无需适配器内部信息或额外训练，在44个任务上达到89.7%的选择准确率。

详情

AI中文摘要

参数高效微调（PEFT）的日益部署导致了模型生态系统，其中单个骨干网络与许多任务专用适配器配对。在这种设置下，推理时的查询通常没有任务标签，要求系统从不断增长且异构的适配器池中自动选择最合适的适配器。现有的路由方法要么依赖于对适配器内部（如权重分解或基于梯度的统计信息）的访问，要么需要额外的路由器训练，这限制了随着新适配器添加的可扩展性和可移植性。我们提出了ARIADNE，一个无训练、与适配器无关的路由框架，用于推理时的动态适配器选择。ARIADNE通过从其训练集的嵌入计算的一组质心来表示每个适配器，捕获与该适配器相关的数据分布。给定一个无标签输入，它通过测量在潜在空间中与这些质心的接近度来选择适配器。由于路由完全在输入嵌入空间中进行，ARIADNE与任意PEFT方法兼容，并且不需要对适配器或训练过程进行修改。主要使用Llama 3.2 1B Instruct在23个不同的NLP任务上进行评估，ARIADNE恢复了97.44%的上限性能。扩展到44个任务，它实现了89.7%的平均选择准确率，无需额外训练或访问适配器内部信息。

英文摘要

The increasing deployment of parameter-efficient fine-tuning (PEFT) has led to model ecosystems in which a single backbone is paired with many task-specialized adapters. In this setting, inference-time queries often arrive without task labels, requiring the system to automatically select the most appropriate adapter from a growing and heterogeneous adapter pool. Existing routing methods either depend on access to adapter internals, such as weight decompositions or gradient-based statistics, or require additional router training, which limits scalability and portability as new adapters are added. We introduce ARIADNE, a training-free, adapter-agnostic routing framework for dynamic adapter selection at inference time. ARIADNE represents each adapter through a set of centroids computed from embeddings of its training set, capturing the data distribution associated with that adapter. Given an unlabeled input, it selects an adapter by measuring proximity to these centroids in latent space. Because routing is performed entirely in the input embedding space, ARIADNE is compatible with arbitrary PEFT methods and requires no modification to the adapters or training procedures. Primarily evaluated with Llama 3.2 1B Instruct on 23 diverse NLP tasks, ARIADNE recovers 97.44% of the upper bound performance. Scaling to 44 tasks, it achieves 89.7% average selection accuracy, without additional training or access to adapter internals.

URL PDF HTML ☆

赞 0 踩 0

2606.18259 2026-06-18 cs.HC cs.AI 新提交 75%

Caring Without Feeling: Affective Dynamics as the Control Layer of Human-AI Agent Collaboration

无感关怀：情感动态作为人-AI智能体协作的控制层

Junjie Xu, Xingjiao Wu, Zihao Zhang, Yujia Xu, Yuzhe Yang, Jin Zhu, Luwei Xiao, Wen Wu, Liang He

发表机构 * East China Normal University（华东师范大学）； National University of Singapore（新加坡国立大学）

专题命中其他Agent ：综述情感动态在人-AI智能体协作中的控制作用。

AI总结本文综述情感动态在人-AI智能体协作中的作用，提出将情感视为协调层而非AI内部属性，用于校准信任、委托和治理。

详情

AI中文摘要

能够规划、跨会话保留记忆、调用外部工具并部分自主行动的AI智能体正在改变人-AI协作。情感计算、大语言模型中的模拟共情、自动化信任和AI安全的研究揭示了重要的设计原则，但这些文献仍然分散。没有统一的解释说明情感线索如何在智能体协作中运作——在这种协作中，人类委托、监控和纠正重要任务。本综述综合了情感动态的计算和交互机制：情感线索、类似情绪的行为和感知到的智能体情感如何影响信任校准、委托决策、错误纠正、依赖和治理的过程。我们追溯模型生成的情感信号如何进入控制依赖、修复和监督的交互循环，并提出了一个框架，该框架将情感视为不是AI的内部属性，而是作为人类和智能体协商能力、不确定性和责任的协调层。该框架为校准测量、有目的的设计和知情治理提供了基础。

英文摘要

AI agents that plan, retain memory across sessions, invoke external tools and act with partial autonomy are transforming human--AI collaboration. Research on affective computing, simulated empathy in large language models, trust in automation and AI safety has illuminated important design principles, yet these literatures remain fragmented. No integrated account explains how affective cues operate within agentic collaboration -- settings in which humans delegate, monitor and correct consequential tasks. This Review synthesises computational and interactional mechanisms of affective dynamics: the processes through which affective cues, emotion-like behaviour and perceived agent affect shape trust calibration, delegation decisions, error correction, dependence and governance. We trace how model-generated affective signals enter interaction loops that govern reliance, repair and oversight, and propose a framework that treats affect not as an internal property of AI but as a coordination layer through which humans and agents negotiate capability, uncertainty and responsibility. The framework provides a foundation for calibrated measurement, purposeful design and informed governance.

URL PDF HTML ☆

赞 0 踩 0

2606.18406 2026-06-18 cs.CL 新提交 70%

CoreMem: Riemannian Retrieval and Fisher-Guided Distillation for Long-Term Memory in Dialogue Agents

CoreMem: 对话代理中长期记忆的黎曼检索与Fisher引导蒸馏

Jiaqi Chen, Yongqin Zeng, Shaoshen Chen, Yijian Zhang, Hai-Tao Zheng, Chunxia Ma, XiuTeng Zhou

发表机构 * Shenzhen International Graduate School, Tsinghua University（清华大学深圳国际研究生院）； Peng Cheng Laboratory（鹏城实验室）； Shandong Analysis and Test Center, Qilu University of Technology（齐鲁工业大学山东省分析测试中心）； State Key Laboratory for Quality Ensurance and Sustainable Use of Dao-di Herbs（道地药材品质保障与可持续利用国家重点实验室）

专题命中其他Agent ：对话代理长期记忆架构

AI总结提出CoreMem架构，用黎曼检索替代余弦相似度解决高维检索枢纽问题，通过Fisher引导离散令牌蒸馏实现原则性压缩，在8GB显存边缘设备上实现长期记忆对话代理。

Comments 15 pages, 5 figures

详情

AI中文摘要

个性化对话代理需要持续的长期记忆以在多次会话中维持连贯交互。然而，在消费级硬件（例如8 GB VRAM边缘设备）上部署这些能力会引入严重的内存和计算瓶颈。现有系统通常依赖各向同性余弦相似度进行检索，以及启发式规则进行上下文压缩。这些方法缺乏统一的理论基础，经常在高维检索中遭受枢纽问题，并在压缩过程中出现句法碎片化。为克服这些限制，我们提出CoreMem，一种资源高效的边缘-云记忆架构，从根本上由信息几何统一。首先，黎曼检索用局部自适应Fisher-Rao度量替代余弦匹配，通过马氏距离有效惩罚枢纽记忆，并采用O(Ndr) Woodbury加速实现实时搜索。其次，Fisher引导离散令牌蒸馏（FDTD）引入分层句子到令牌压缩机制。它从Fisher信息迹中推导敏感度分数，提供原则性的压缩-KL权衡，并辅以显式结构句法保护。在LOCOMO和LongMemEval-S基准上评估，CoreMem实现了显著的准确率提升，在开放域（+4.51个百分点）和时间（+4.17个百分点）推理上取得实质性增益。广泛性能分析证实，CoreMem在严格的8 GB VRAM预算内无缝运行，成功弥合了资源受限边缘设备与对理论基础的终身记忆代理需求之间的差距。

英文摘要

Personalized dialogue agents require continuous long-term memory to maintain coherent interactions across multiple sessions. However, deploying these capabilities on consumer-grade hardware (e.g., 8 GB VRAM edge devices) introduces severe memory and compute bottlenecks. Existing systems typically rely on isotropic cosine similarity for retrieval and heuristic rules for context compression. These approaches lack a unified theoretical foundation, frequently suffering from the hubness problem in high-dimensional retrieval and syntactic fragmentation during compression. To overcome these limitations, we propose CoreMem, a resource-efficient edge-cloud memory architecture fundamentally unified by information geometry. First, Riemannian retrieval replaces cosine matching with a locally adaptive Fisher-Rao metric, effectively penalizing hub memories via Mahalanobis distance with O(Ndr) Woodbury acceleration for real-time search. Second, Fisher-guided discrete token distillation (FDTD) introduces a hierarchical sentence-to-token compression mechanism. It derives sensitivity scores from Fisher information traces, providing a principled compression-KL tradeoff augmented with explicit structural syntax protection. Evaluated on the LOCOMO and LongMemEval-S benchmarks, CoreMem achieves strong accuracy improvements, yielding substantial gains in Open-domain (+4.51 pp) and Temporal (+4.17 pp) reasoning. Extensive profiling confirms that CoreMem operates seamlessly within a strict 8 GB VRAM budget, successfully bridging the gap between resource-constrained edge devices and the demand for theoretically grounded, lifelong memory agents.

URL PDF HTML ☆

赞 0 踩 0

2507.23644 2026-06-18 cs.MA 版本更新 70%

Agents Trusting Agents? Restoring Lost Capabilities with Inclusive Healthcare

代理信任代理？通过包容性医疗恢复失去的能力

Alba Aguilera, Georgina Curto, Nardine Osman, Ahmed Al-Awah

专题命中其他Agent ：使用基于代理的模拟评估医疗政策，属于AI Agent。

AI总结本文利用基于代理的模拟和贝叶斯逆强化学习，评估巴塞罗那改善无家可归者医疗公平的政策，通过建模信任关系来恢复其核心能力。

详情

AI中文摘要

基于代理的模拟在非侵入性方式下，有潜力为紧迫的人类发展挑战的社会政策提供信息，在其实施于现实世界人群之前。本文响应非营利组织和政府机构的请求，评估正在讨论的政策，以改善巴塞罗那市无家可归者（PEH）医疗服务的公平性。为此，我们整合了能力方法（CA）的概念框架，该框架明确设计用于促进和评估人类福祉，以建模和评估代表PEH和社会工作者的代理行为。我们定义了一个强化学习环境，其中代理旨在在现有环境和法律约束下恢复其核心人类能力。我们使用贝叶斯逆强化学习（IRL）来校准PEH代理中依赖于档案的行为参数，建模对社会工作者的信任和参与程度，这据报告是政策成功的关键因素。我们的结果为通过建立社会服务工作者与PEH之间的信任关系来减轻健康不平等开辟了一条道路。

英文摘要

Agent-based simulations have an untapped potential to inform social policies on urgent human development challenges in a non-invasive way, before these are implemented in real-world populations. This paper responds to the request from non-profit and governmental organizations to evaluate policies under discussion to improve equity in health care services for people experiencing homelessness (PEH) in the city of Barcelona. With this goal, we integrate the conceptual framework of the capability approach (CA), which is explicitly designed to promote and assess human well-being, to model and evaluate the behaviour of agents who represent PEH and social workers. We define a reinforcement learning environment where agents aim to restore their central human capabilities, under existing environmental and legal constraints. We use Bayesian inverse reinforcement learning (IRL) to calibrate profile-dependent behavioural parameters in PEH agents, modeling the degree of trust and engagement with social workers, which is reportedly a key element for the success of the policies in scope. Our results open a path to mitigate health inequity by building relationships of trust between social service workers and PEH.

URL PDF HTML ☆

赞 0 踩 0

2505.03863 2026-06-18 cs.CR cs.AI 55%

Data-Driven Falsification of Cyber-Physical Systems

数据驱动的物理系统验证

Atanu Kundu, Sauvik Gon, Rajarshi Ray

发表机构 * Indian Association for the Cultivation of Science（印度科学培养协会）

专题命中其他Agent ：数据驱动验证物理系统，涉及智能体验证

AI总结本文提出一种框架，将物理系统验证与深度神经网络验证联系起来，并利用决策树的可解释性加速验证过程，展示了在ARCH-COMP 2024基准测试中高效发现多个反例的潜力。

详情

DOI: 10.1109/TCAD.2025.3608632

AI中文摘要

物理系统（CPS）在医疗、航空电子和自动驾驶等安全关键领域中普遍存在。因此，对其操作安全性的形式验证至关重要。本文针对验证问题，即寻找系统中的不安全执行而非证明其不存在。本文的贡献是提出一个框架，将CPS的验证与深度神经网络（DNN）的验证联系起来，并利用决策树的内在可解释性加速CPS的验证。这通过构建被测CPS的替代模型（作为DNN模型或决策树），应用各种DNN验证工具来验证CPS，并通过从其决策树替代模型中提取的安全违规解释来指导新的验证算法实现。所提出的框架有潜力利用一系列设计用于验证DNN鲁棒性属性的对抗攻击算法，以及最先进的DNN验证算法。尽管所提出的 methodology 可应用于可以执行或模拟的一般系统，但我们特别展示了其在CPS中的有效性。我们展示了我们的框架，作为工具FlexiFal，能够检测具有线性和非线性动态的CPS中难以发现的反例。决策树引导的验证在ARCH-COMP 2024验证基准测试中显示出有希望的结果。

英文摘要

Cyber-Physical Systems (CPS) are abundant in safety-critical domains such as healthcare, avionics, and autonomous vehicles. Formal verification of their operational safety is, therefore, of utmost importance. In this paper, we address the falsification problem, where the focus is on searching for an unsafe execution in the system instead of proving their absence. The contribution of this paper is a framework that (a) connects the falsification of CPS with the falsification of deep neural networks (DNNs) and (b) leverages the inherent interpretability of Decision Trees for faster falsification of CPS. This is achieved by: (1) building a surrogate model of the CPS under test, either as a DNN model or a Decision Tree, (2) application of various DNN falsification tools to falsify CPS, and (3) a novel falsification algorithm guided by the explanations of safety violations of the CPS model extracted from its Decision Tree surrogate. The proposed framework has the potential to exploit a repertoire of \emph{adversarial attack} algorithms designed to falsify robustness properties of DNNs, as well as state-of-the-art falsification algorithms for DNNs. Although the presented methodology is applicable to systems that can be executed/simulated in general, we demonstrate its effectiveness, particularly in CPS. We show that our framework, implemented as a tool \textsc{FlexiFal}, can detect hard-to-find counterexamples in CPS that have linear and non-linear dynamics. Decision tree-guided falsification shows promising results in efficiently finding multiple counterexamples in the ARCH-COMP 2024 falsification benchmarks~\cite{khandait2024arch}.

URL PDF HTML ☆

赞 0 踩 0