AI中文摘要
长期记忆是LLM Agent缺失的一层:跨会话时它们会遗忘,而常见的解决方法——将整个历史重放到提示中——成本高、速度慢,且随着干扰物积累,准确性下降。大多数记忆系统在成本或延迟上胜出,但在准确性上仍不如完整上下文基线,且基准测试结果在不一致、不可复现的测试平台上报告,导致同一系统在不同来源上得分差异巨大。我们提出Engram,一种基于双时间数据模型的开源双过程记忆引擎。快速写入路径附加无损事件,无需LLM参与关键路径;异步路径提取原子(主体、谓词、客体)事实,构建双时间知识图谱,并解决矛盾,无需每个事实调用LLM——使事实失效而非删除,因此每个事实都有来源和继承链。混合读取路径融合密集、词汇、图谱和时效/显著性信号,应用时间点(“截至”)过滤器,并组装紧凑、带有来源标记的上下文。在完整的500个问题的LongMemEval_S上,由官方分类特定评判器评分,Engram的精简配置——从约9.6k token的检索片段回答,而非完整历史——得分为83.6%,而完整上下文为73.2%(+10.4个百分点,McNemar p < 10^-6),token数约为1/8(9.6k vs. 79k),且0/500错误。这种增益需要混合读取路径:仅事实会丢失召回率,而事实加检索片段则恢复细节。我们还贡献了一个中立的、仓库内的评估平台,内置官方评判器,并在每个表格中包含完整上下文基线,发布原始每问题日志,并记录了无声扭曲记忆基准的测量完整性陷阱(截断、自制评判器、完整历史泄露)。每个数字都附带复现命令。
英文摘要
Long-term memory is the missing layer for LLM agents: across sessions they forget, and the common workaround -- replaying the whole history into the prompt -- is expensive, slow, and, as distractors accumulate, less accurate. Most memory systems win on cost or latency but still lose to the full-context baseline on accuracy, and benchmark numbers are reported on inconsistent, non-reproducible harnesses, so one system appears at wildly different scores across sources. We present Engram, an open-source, dual-process memory engine on a bi-temporal data model. A fast write path appends lossless episodes with no LLM on the critical path; an asynchronous path extracts atomic (subject, predicate, object) facts, builds a bi-temporal knowledge graph, and resolves contradictions without an LLM call per fact -- invalidating, never deleting, so every fact keeps provenance and a supersession chain. A hybrid read path fuses dense, lexical, graph, and recency/salience signals, applies a point-in-time ("as-of") filter, and assembles a compact, provenance-tagged context. On the full 500-question LongMemEval_S, graded by the official category-specific judge, Engram's lean configuration -- answering from a ~9.6k-token retrieved slice, never the full history -- scores 83.6% vs. 73.2% for full-context (+10.4 points, McNemar p < 10^-6) at ~8x fewer tokens (9.6k vs. 79k), with 0/500 errored. The gain needs a hybrid read path: facts alone lose recall, while facts plus retrieved chunks recover detail. We also contribute a neutral, in-repo evaluation harness with the official judge baked in and the full-context baseline in every table, publish the raw per-question logs, and document the measurement-integrity pitfalls (truncation, home-grown judges, full-history leaks) that silently distort memory benchmarks. Every number ships with a command to reproduce it.