arXivDaily arXiv每日学术速递 周一至周五更新

AI 大模型

RAG / 检索增强生成

检索增强生成、向量检索、知识库问答和面向大模型的搜索系统。

今日/当前日期收录 1 信号源:cs.IR, cs.CL, cs.AI, cs.DB
2606.13681 2026-06-18 cs.CL 新提交 70%

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore(新加坡国立大学) Singapore Management University(新加坡管理大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Pennsylvania(宾夕法尼亚大学) Nanyang Technological University(南洋理工大学) Recursive Massachusetts Institute of Technology(麻省理工学院)

专题命中 其他RAG :基于补丁的记忆范式用于环境演化推理

AI总结 提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化,并设计基于补丁的记忆范式EvoMem记录结构化更新历史,使智能体能通过记忆变化推理环境演化,实验表明当前智能体在动态环境中表现不佳,EvoMem可稳定提升性能。

详情
AI中文摘要

大型语言模型(LLM)智能体在广泛基准测试中取得了强劲性能,但大多数评估假设静态环境。相比之下,实际部署本质上是动态的,要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距,我们引入了EvoArena,一个基准套件,将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem,一种基于补丁的记忆范式,将记忆演化记录为结构化的更新历史,使智能体能够通过记忆中的变化推理环境演化。实验表明,当前智能体在EvoArena上表现不佳,在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能,在EvoArena上平均提升1.5%,并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外,EvoMem在EvoArena上还将链级准确率提升3.7%,其中成功需要完成一系列连续的相关演化子任务。机制分析表明,EvoMem改善了记忆中的证据捕获,表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.