arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

AI Agent

智能体、工具调用、规划、工作流、多智能体和自主任务执行。

今日/当前日期收录 4 信号源:cs.AI, cs.CL, cs.LG, cs.SE
2606.07591 2026-06-18 cs.LG cs.AI cs.CL 版本更新 85%

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

专题命中 其他Agent :自主科学研究基准评估智能体

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

2511.13979 2026-06-18 cs.HC 版本更新 80%

Personality Pairing Improves Human-AI Collaboration

人格配对改善人机协作

Harang Ju, Sinan Aral

专题命中 其他Agent :研究AI Agent人格与人类协作

AI总结 通过大规模实验,将人类与具有不同大五人格特质的AI配对,发现人格匹配显著影响广告质量和团队表现,外倾人类与尽责AI配对效果最差,而神经质人类与神经质AI配对点击率最高。

Comments 29 pages, 5 figures

详情
AI中文摘要

在此,我们研究了AI代理的“人格”如何与人类人格相互作用,从而影响人机协作和绩效。在一项大规模、预注册的随机实验中,我们将1,258名参与者与表现出不同大五人格特质水平的AI代理配对。这些人机团队为一个真实智库制作了7,266个展示广告,我们通过1,168名独立人类评估者以及一项在X平台上进行的、产生了近500万次展示的现场实验对这些广告进行了评估。我们发现,人类和AI的人格各自影响广告质量和团队合作,并且人机人格配对直接影响广告质量和广告绩效。例如,外倾人类与尽责AI配对产生了质量最低的广告,其次是尽责人类与宜人AI配对,以及神经质人类与尽责AI配对。在现场实验中,广告质量显著影响广告绩效(以点击率和每次点击成本衡量),神经质人类与神经质AI配对实现了最高的点击率。这些结果共同表明,人格配对可以改善人机协作和绩效。它们也激励了未来关于AI个性化对人机协作、团队合作和绩效的复杂影响的研究。

英文摘要

Here we examine how AI agent "personalities" interact with human personalities to shape human-AI collaboration and performance. In a large-scale, preregistered randomized experiment, we paired 1,258 participants with AI agents prompted to exhibit varying levels of the Big Five personality traits. These human-AI teams produced 7,266 display ads for a real think tank, which we evaluated using 1,168 independent human raters, and a field experiment on X that generated nearly 5 million impressions. We found that human and AI personalities individually shaped ad quality and teamwork and that human-AI personality pairings directly influenced ad quality and ad performance. For example, extraverted humans paired with conscientious AI produced the lowest quality ads, followed by conscientious humans paired with agreeable AI and neurotic humans paired with conscientious AI. In the field experiment, ad quality significantly influenced ad performance, measured by click-through rates and cost-per-click, and neurotic humans paired with neurotic AI achieved the highest click-through rates. Together, these results demonstrate that personality pairing can improve human-AI collaboration and performance. They also motivate future research on the complex implications of AI personalization for human-AI collaboration, teamwork and performance.

2602.22222 2026-06-18 cs.IR cs.MA 版本更新 80%

TWICE: Modeling the Temporal Evolution of Personalized User Behavior via Event-Driven Agents

TWICE:通过事件驱动代理建模个性化用户行为的时间演化

Bingrui Jin, Kunyao Lan, Baihan LI, Mengyue Wu

专题命中 其他Agent :基于LLM的事件驱动用户模拟代理,属于AI Agent

AI总结 提出TWICE框架,结合结构化用户画像、事件驱动记忆模块和两阶段工作流,利用LLM模拟用户行为的时间演化,在Twitter数据集上优于基线。

详情
AI中文摘要

用户模拟器广泛用于数据生成、评估和基于代理的交互,但现有方法通常将用户建模为静态角色或依赖通用历史上下文,难以捕捉个体行为随时间的变化。为解决这一局限,我们提出TWICE,一个基于LLM的框架,用于时间基础的个人化用户模拟。TWICE结合了结构化用户画像、围绕生活事件和行为转变组织的事件驱动记忆模块,以及将事件基础内容规划与个性化风格适应分离的两阶段工作流。这种设计使模拟器不仅能建模用户说什么,还能建模过去经历如何影响后续表达。我们在大规模纵向Twitter数据集上评估TWICE,并引入了一个综合评估框架,同时衡量真实性、一致性和类人性。结果表明,TWICE始终优于强基线,表明以事件为中心的记忆是建模个性化用户行为时间演化的有前景机制。

英文摘要

User simulators are widely used for data generation, evaluation, and agent-based interaction, but existing approaches often model users as static personas or rely on generic historical context, making it difficult to capture how individual behavior evolves over time. To address this limitation, we propose TWICE, an LLM-based framework for temporally grounded personalized user simulation. TWICE combines structured user profiling, an event-driven memory module organized around life events and behavioral shifts, and a two-stage workflow separating event-grounded content planning from personalized style adaptation. This design enables the simulator to model not only what a user says, but also how past experiences shape later expression. We evaluate TWICE on a large-scale longitudinal Twitter dataset and introduce a comprehensive evaluation framework that jointly measures authenticity, consistency, and humanlikeness. Results show that TWICE consistently outperforms strong baselines, suggesting that event-centered memory is a promising mechanism for modeling the temporal evolution of personalized user behavior.

2507.23644 2026-06-18 cs.MA 版本更新 70%

Agents Trusting Agents? Restoring Lost Capabilities with Inclusive Healthcare

代理信任代理?通过包容性医疗恢复失去的能力

Alba Aguilera, Georgina Curto, Nardine Osman, Ahmed Al-Awah

专题命中 其他Agent :使用基于代理的模拟评估医疗政策,属于AI Agent。

AI总结 本文利用基于代理的模拟和贝叶斯逆强化学习,评估巴塞罗那改善无家可归者医疗公平的政策,通过建模信任关系来恢复其核心能力。

详情
AI中文摘要

基于代理的模拟在非侵入性方式下,有潜力为紧迫的人类发展挑战的社会政策提供信息,在其实施于现实世界人群之前。本文响应非营利组织和政府机构的请求,评估正在讨论的政策,以改善巴塞罗那市无家可归者(PEH)医疗服务的公平性。为此,我们整合了能力方法(CA)的概念框架,该框架明确设计用于促进和评估人类福祉,以建模和评估代表PEH和社会工作者的代理行为。我们定义了一个强化学习环境,其中代理旨在在现有环境和法律约束下恢复其核心人类能力。我们使用贝叶斯逆强化学习(IRL)来校准PEH代理中依赖于档案的行为参数,建模对社会工作者的信任和参与程度,这据报告是政策成功的关键因素。我们的结果为通过建立社会服务工作者与PEH之间的信任关系来减轻健康不平等开辟了一条道路。

英文摘要

Agent-based simulations have an untapped potential to inform social policies on urgent human development challenges in a non-invasive way, before these are implemented in real-world populations. This paper responds to the request from non-profit and governmental organizations to evaluate policies under discussion to improve equity in health care services for people experiencing homelessness (PEH) in the city of Barcelona. With this goal, we integrate the conceptual framework of the capability approach (CA), which is explicitly designed to promote and assess human well-being, to model and evaluate the behaviour of agents who represent PEH and social workers. We define a reinforcement learning environment where agents aim to restore their central human capabilities, under existing environmental and legal constraints. We use Bayesian inverse reinforcement learning (IRL) to calibrate profile-dependent behavioural parameters in PEH agents, modeling the degree of trust and engagement with social workers, which is reportedly a key element for the success of the policies in scope. Our results open a path to mitigate health inequity by building relationships of trust between social service workers and PEH.