AI中文摘要
背景:大型语言模型通常作为模型、基准或简短对话片段进行评估。当智能体持久嵌入真实学术研究环境,具有持久记忆、本地文件、外部工具、计划例程、委派角色和明确安全协议时,会发生什么知之甚少。方法:从2026年1月31日至5月25日进行了一项结构化自我观察的实施案例研究。分析单元是持久的人-智能体环境:研究者、智能体运行时、记忆层、工具、仓库、计划任务、专门智能体角色和治理规则。结果使用PARE-M(持久智能体研究环境测量)组织,这是一个涵盖架构、利用、工件生产、资源使用、可重复性和治理的测量框架。结果:可恢复的主智能体遥测包含96个活跃日中的75,671条去重记录,其中8,059条用户角色消息和23,710条助手角色消息。工作空间包括502个记忆相关文件、17个配置的智能体目录和57个技能文件。活跃系统时间为579.7小时(30分钟上限间隙估计)。记忆衍生记录识别出482个输出代理事件和889个失败、验证、纠正或协议代理事件。一个严格的2026年5月轨迹子集捕获了627个模型完成事件和73.95百万记录token,其中82.9%为缓存读取。结论:工作流以缓存为主导,表明持久智能体环境可能将经济单位从每token成本转向每完成工件成本。未来评估应使用工件级分母、可重复解析规则、纠正分类法和治理事件的独立编码。
英文摘要
Background: Large language models are typically evaluated as models, benchmarks, or short conversational episodes. Less is known about what happens when an agent is embedded persistently in a real academic research environment with durable memory, local files, external tools, scheduled routines, delegated roles, and explicit safety protocols. Methods: A structured self-observed implementation case study was conducted from January 31 to May 25, 2026. The unit of analysis was the persistent human-agent environment: researcher, agent runtime, memory layer, tools, repositories, scheduled jobs, specialized agent roles, and governance rules. Outcomes were organized using PARE-M (Persistent Agentic Research Environment Measurement), a measurement framework covering architecture, utilization, artifact production, resource use, reproducibility, and governance. Results: Recoverable main-agent telemetry contained 75,671 de-duplicated records across 96 active days, with 8,059 user-role and 23,710 assistant-role messages. The workspace included 502 memory-related files, 17 configured agent directories, and 57 skill files. Active system time was 579.7 hours (30-minute capped-gap estimate). Memory-derived records identified 482 output-proxy events and 889 failure, verification, correction, or protocol-proxy events. A strict May 2026 trajectory subset captured 627 model-completed events and 73.95 million recorded tokens, of which 82.9% were cache reads. Conclusions: The workflow was cache-dominant, suggesting that persistent agentic environments may shift the economic unit from cost per token to cost per completed artifact. Future evaluations should use artifact-level denominators, reproducible parsing rules, correction taxonomies, and independent coding of governance events.