RAG / 检索增强生成 - arXivDaily 专题

2606.19396 2026-06-19 q-bio.QM 新提交 90%

BioHarness: Substrate-Aware Evidence Assembly for Biomedical Question Answering across Literature, Knowledge Bases, and Biological Atlases

BioHarness：面向生物医学问答的底物感知证据组装——跨文献、知识库和生物图谱

Meng Xiao, Chuan Qin, Jinmiao Chen, Yihang Cheng, Yuanchun Zhou, Hengshu Zhu

专题命中知识库问答：生物医学问答中跨文献、知识库和生物图谱的证据组装

AI总结提出BioHarness，通过级联控制机制在文献检索、知识库和生物图谱间选择性组装证据，提升生物医学问答准确率，在19,302个问答项上得分从65.9提升至71.0。

Comments 14 Pages, 11 Figures, Keywords: biomedical question answering; retrieval-augmented generation; large language models; evidence assembly; biomedical knowledge bases; biological atlases

详情

AI中文摘要

动机：生物医学问答通常需要超越主题检索文献的证据，包括基因别名解析、数据库标识符标准化以及来自图谱的生物测量值。然而，现有的检索增强生成（RAG）系统通常遵循固定工作流程，缺乏明确机制来决定何时检索文本足够、何时需要经过整理的生物医学知识、或何时应调用对结构化测量值的可执行证据组装。这激发了一种底物感知的大语言模型（LLM）框架，能够跨文献、知识库和生物图谱选择性地组装足够的证据。结果：我们引入BioHarness，一种用于分阶段生物医学证据组装的LLM框架，涵盖文献检索、经过整理的生物医学知识资源以及来自图谱的结构化测量值。BioHarness首先尝试根据重排序的文献证据回答问题，并通过基于接地级联控制，仅在当前证据不确定、接地不足或底物不匹配时升级到REPL风格的证据组装。在涵盖七种答案格式的19,302个生物医学问答项上，BioHarness将最强非预言基线的综合得分从65.9提升至71.0。消融实验、案例研究和骨干扩展分析表明，这些提升源于通过重排序、实体接地和结构化测量访问修复证据-底物不匹配，而非不加区分地调用更多推理步骤、检索更多文献或依赖特定答案模型规模。

英文摘要

Motivation: Biomedical question answering often requires evidence beyond topically retrieved literature, including gene alias resolution, database identifier normalization, and atlas-derived biological measurements. However, existing retrieval-augmented generation (RAG) systems typically follow a fixed workflow and lack an explicit mechanism for deciding when retrieved text is sufficient, when curated biomedical knowledge is required, or when executable evidence assembly over structured measurements should be invoked. This motivates a substrate-aware large language model (LLM) harness that selectively assembles sufficient evidence across literature, knowledge bases, and biological atlases. Results: We introduce BioHarness, an LLM harness for staged biomedical evidence assembly across literature retrieval, curated biomedical knowledge resources, and atlas-derived structured measurements. BioHarness first attempts to answer from reranked literature evidence and escalates through grounded cascade control to REPL-style evidence assembly only when the current evidence is uncertain, weakly grounded, or substrate-mismatched. Across 19,302 biomedical QA items spanning seven answer formats, BioHarness improves the pooled score from 65.9 to 71.0 over the strongest non-oracle baseline. Ablations, case studies, and backbone-scaling analyses show that these gains arise from repairing evidence-substrate mismatches through reranking, entity grounding, and structured measurement access, rather than from indiscriminately invoking more reasoning steps, retrieving additional literature, or relying on a particular answer-model scale.

URL PDF HTML ☆

赞 0 踩 0

2606.20359 2026-06-19 cs.LG 新提交 90%

Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act

训练、检索，还是两者兼用？针对安大略省住宅租赁法的正确法定引用的四组头对头比较

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

专题命中知识库问答：SFT+RAG混合模型用于法律条文引用

AI总结研究自诉租户、房东和帮助台工作人员如何获得正确的法定引用，通过四组实验比较微调、检索及混合方法，发现SFT+RAG混合模型在精确匹配上得分最高且无幻觉引用。

详情

AI中文摘要

自诉租户、房东和帮助台工作人员需要被指向实际管辖问题的法律条款，并附有正确的法定引用。我们在2006年安大略省住宅租赁法（RTA）及其核心法规上研究此任务，从操作者的角度实证提问：微调是否足够，还是需要混合检索？我们在Qwen2.5-7B-Instruct上运行四组头对头比较（基础零样本、仅LoRA SFT、仅RAG、以及SFT+RAG混合），在一个小型、待人工验证的真实评估集上，以引用的精确匹配（节+小节）评分。基础模型无法引用RTA，仅SFT会错误回忆章节；检索至关重要，并通过构造将幻觉降至零；而SFT+RAG混合模型得分最高，精确匹配为0.481，且无幻觉引用。其优势在于SFT使得条款选择对高召回候选集（损害零样本RAG）更加鲁棒。值得注意的是，这种廉价的bge-small混合模型匹配或超越了基于更大、专门检索模型（更大的嵌入器和交叉编码器重排序器）的管道，更大/改进的训练集也无帮助：在此任务中，强法定引用性能不需要专门的检索模型或更多数据。该工件将幻觉归零并超过了基准提升线，但未达到期望的0.70精确匹配目标。所有结果均基于小型、待人工验证的真实评估集，并作为初步结果报告。

英文摘要

Self-represented tenants, landlords, and help-desk staff need to be pointed at the provision of law that actually governs a question, with a correct statutory citation. We study this task on the Ontario Residential Tenancies Act, 2006 (RTA) and its core regulation, asking the operator's question empirically: is fine-tuning enough, or is hybrid retrieval needed? We run a four-arm head-to-head on Qwen2.5-7B-Instruct (base zero-shot, LoRA SFT-only, RAG-only, and an SFT+RAG hybrid), scored on citation exact-match (section+subsection) over a small, human-verification-pending real eval set. The base model cannot cite the RTA and SFT-only mis-recalls sections; retrieval is essential and drives hallucination to zero by construction; and the SFT+RAG hybrid scores highest at 0.481 exact-match with zero hallucinated citations. Its edge comes from SFT making provision selection more robust to the higher-recall candidate sets that hurt zero-shot RAG. Notably, this cheap bge-small hybrid matches or beats a pipeline built on bigger, specialized retrieval models (a larger embedder and a cross-encoder reranker), and a larger/improved training set does not help either: strong statutory-citation performance here does not require specialized retrieval models or more data. The artifact zeroes hallucination and clears the lift-over-base bar but does not reach the aspirational 0.70 exact-match target. All results are on a small, human-verification-pending real eval set and are reported as preliminary.

URL PDF HTML ☆

赞 0 踩 0

2606.19602 2026-06-19 cs.AI 新提交 90%

Configurable Clinical Information Extraction with Agentic RAG: What Works, What Breaks, and Why

可配置的临床信息提取与智能体RAG：什么有效、什么失效及原因

Osman Alperen Çinar-Koraş, Marie Bauer, Sameh Khattab, Merlin Engelke, Moon Kim, Stephan Settelmeier, Shigeyasu Sugawara, Fabian Freisleben, Felix Nensa, Jens Kleesiek

发表机构 * Institute for Artificial Intelligence in Medicine (IKIM), University Medicine Essen（埃森大学医学院人工智能医学研究所）； Faculty of Computer Science, University of Duisburg-Essen（杜伊斯堡-埃森大学计算机科学学院）； Department of Physics, TU Dortmund University（多特蒙德工业大学物理系）； Lamarr Institute for Machine Learning and Artificial Intelligence, TU Dortmund University（多特蒙德工业大学拉马尔机器学习和人工智能研究所）； Advanced Clinical Research Center, Fukushima Medical University（福岛医科大学先进临床研究中心）； Department of Cardiology and Vascular Medicine, University Hospital Essen（埃森大学医院心血管内科）

专题命中知识库问答：提出ACIE系统，基于智能体RAG进行临床信息提取

AI总结针对临床文档元数据缺失问题，提出基于智能体RAG的ACIE系统，在埃森大学医学中心部署，通过完整患者上下文推理和源引用验证，在7326次临床判断中实现96.5%的提取接受率。

详情

AI中文摘要

患者上下文涵盖数百份异构文档和数千个结构化数据点，然而AI系统进行检索和分诊所需的文档级元数据缺失或不完整。标准检索增强生成在此类数据上失效，无法处理时间推理、跨文档依赖和缺失元数据。我们在埃森大学医学中心部署了ACIE（智能体临床信息提取）：一个本地智能体RAG管道，能够推理完整的患者上下文，并将每个答案基于源段落以供临床医生验证。我们量化了元数据差距，追溯了由此形成的架构决策，并在一项独立的回顾性淋巴瘤注册研究中评估了提取效果，其中核医学医生根据引用的来源验证每个提取值。在7326次判断中，临床医生接受了96.5%的提取结果，按类型划分的接受率从80%到99%不等。

英文摘要

Patient contexts span hundreds of heterogeneous documents and thousands of structured data points, yet the document-level metadata that AI systems need for retrieval and triage is absent or incomplete. Standard retrieval-augmented generation fails on this data, mishandling temporal reasoning, cross-document dependencies, and missing metadata. We deploy ACIE (Agentic Clinical Information Extraction) at University Medicine Essen: an on-premise agentic RAG pipeline that reasons over complete patient contexts and grounds every answer in source passages for clinician verification. We quantify the metadata gap, trace the architectural decisions it shaped, and evaluate extraction alongside an independent retrospective lymphoma registry study, in which nuclear-medicine physicians verify every extracted value against its cited sources. Across 7,326 judgments, clinicians accepted 96.5\% of extractions, with per-type acceptance ranging from 80\% to 99\%.

URL PDF HTML ☆

赞 0 踩 0

2606.20041 2026-06-19 econ.GN cs.AI cs.LG q-fin.EC q-fin.GN 新提交 80%

AI Economist Agent: An Agentic Framework for Model-Grounded Economic Analysis with RAG, Knowledge Graphs, and Large Language Models

AI经济学家代理：一种基于模型的经济分析代理框架，结合RAG、知识图谱和大语言模型

Masahiro Kato

发表机构 * Mizuho-DL Financial Technology, Co., Ltd.（Mizuho-DL金融科技有限公司）

专题命中知识库问答：基于RAG的经济分析，检索证据并生成报告

AI总结提出一种基于RAG的AI经济学家代理框架，利用知识图谱和大语言模型进行经济情景分析，通过代理规划、检索证据、选择模型并生成报告，提高经济叙事的连贯性和可追溯性。

详情

AI中文摘要

我们提出了一种基于模型的RAG型AI经济学家，具有用于经济情景分析的代理框架，使用大语言模型（LLMs）和知识图谱。虽然LLMs可以生成流畅的经济叙事，但经济学家通常需要做出基于经济理论和现实数据的经济主张。基于这一动机，本研究提出了一种基于RAG的AI经济学家，它利用包含经济数据和理论的知识图谱以及基于LLM的代理来规划分析、检索相关证据、选择合适的模型并生成报告。在我们的框架中，我们不直接仅使用语言模型产生定量主张；相反，我们生成基于显式模型计算的叙事，并通过AI代理与检索到的证据相关联。我们将我们的框架称为AI经济学家代理。我们在两个应用中评估了AI经济学家代理：为美国通胀持续性和美联储政策生成经济学家报告，以及为美国商业房地产再融资压力生成银行压力测试叙事。结果说明了如何通过基于生成报告来提高其经济连贯性和可追溯性。

英文摘要

We propose a model-grounded RAG-based AI economist with an agentic framework for economic scenario analysis using large language models (LLMs) and knowledge graphs. While LLMs can generate fluent economic narratives, economists are often required to make economic claims grounded by economic theory and real-world data. Based on this motivation, this study proposes an RAG-based AI economist, which utilizes knowledge graphs including economic data and theory and LLM-based agents to plan the analysis, retrieve relevant evidence, select appropriate models, and generate reports. In our framework, we do not produce quantitative claims directly with the language model alone; instead, we generate narratives grounded in explicit model-based computations and linked to the retrieved evidence via AI agents. We refer to our framework as an AI economist agent. We evaluate the AI economist agent in two applications: economist report generation for U.S. inflation persistence and Federal Reserve policy, and bank stress-test narrative generation for U.S. commercial real estate refinancing stress. The results illustrate how grounding the generated reports improves their economic coherence and traceability.

URL PDF HTML ☆

赞 0 踩 0

2606.20369 2026-06-19 cs.CL 新提交 80%

CATCH-ME if you RAG: a dataset of Contextually Annotated multi-Turn Counterspeech against Hate and Misinformation Exchanges

CATCH-ME if you RAG：针对仇恨与虚假信息交流的上下文注释多轮对抗言论数据集

Helena Bonaldi, Genoveffa Martone, Marco Guerini

发表机构 * Fondazione Bruno Kessler（布鲁诺·凯斯勒基金会）； Università Cattolica del Sacro Cuore（圣心天主教大学）

专题命中知识库问答：数据集用于RAG系统训练对抗言论模型

AI总结提出首个大规模、专家策划的多语言对话数据集，覆盖仇恨与虚假信息重叠问题，包含事实核查锚定和跨度标注，支持RAG系统训练更可信的对抗言论模型。

详情

AI中文摘要

在线仇恨言论和虚假信息经常重叠，但NLP研究主要将它们孤立处理。虽然LLMs代表了协助人类针对这两种威胁生成对抗言论的可扩展解决方案，但零样本模型经常生成重复和模糊的回应，凸显了需要高质量示例来指导模型生成。然而，现有的针对仇恨和虚假信息重叠的对抗言论数据集很少，且仅限于单轮英语对话，而现实中的交互跨越多个轮次和语言。为弥补这一差距，我们引入了第一个大规模、专家策划的多语言对话数据集，处理仇恨与虚假信息的交叉点。为确保事实基础，对话还锚定在已验证的外部知识（即事实核查文章和非政府组织报告）中，并包含文档级和块级跨度标注，使其可直接应用于RAG系统。该新资源涵盖五种语言，针对七个边缘化群体的仇恨，能够训练和评估更具说服力、基于事实的对抗言论模型。

英文摘要

Online hate speech and misinformation frequently overlap, yet NLP research has mainly treated them in isolation. While LLMs represent a scalable solution for assisting humans in the generation of counterspeech for both threats, zero-shot models frequently generate repetitive and vague responses, underscoring the need for high-quality examples to steer model generation. However, existing counterspeech datasets against the overlap of hate and misinformation are scarce and limited to single-turn English dialogues, while real-life interactions span across multiple turns and languages. To bridge this gap, we introduce the first large-scale, expert-curated, multilingual dataset of dialogues tackling the intersection of hate and misinformation. To ensure factual grounding, the dialogues are also anchored in verified external knowledge (i.e., fact-checking articles and NGO reports) and include document- and chunk-level span annotations, making it directly applicable for RAG systems. Covering five languages and targeting hate directed at seven marginalized groups, this novel resource enables the training and evaluation of more persuasive, factually grounded counterspeech models.

URL PDF HTML ☆

赞 0 踩 0

2606.19598 2026-06-19 cs.RO 新提交 80%

Fail-RAG : A Retrieval Augmented Generation Informed Framework for Robot Failure Identification

Fail-RAG：一种基于检索增强生成的机器人故障识别框架

Ameya Salvi, Jie Hu

发表机构 * Hitachi America, Ltd.（日立美国有限公司）

专题命中知识库问答：提出Fail-RAG框架，利用RAG检测机器人故障

AI总结提出Fail-RAG框架，利用检索增强生成和视觉语言模型，通过嵌入故障图像和上下文信息并查询数据库，实现机器人操作故障的高效检测，在仓库自动化任务中平均检测准确率提升25个百分点。

详情

AI中文摘要

工业自动化正经历由技术突破和社会变革驱动的机器人演进：向通用机器人、具身和物理人工智能发展，以及劳动力短缺的加剧。智能自主机器人不仅需要按计划运动，还需对意外事件做出反应。本研究聚焦于仓库中物料搬运机器人的意外事件，将其定义为故障，并开发检测机器人操作故障的方法。由于环境和任务的动态性，故障形式可能变化，基于规则的检测方法可能失效。我们提出'Fail-RAG'，一种基于检索增强生成（RAG）的故障检测框架，其中故障图像和上下文信息被嵌入，并通过计算相似度查询故障数据库。进一步使用视觉语言模型（VLM）按照指令模板分析故障并提供细节。通过使用固定机械臂和移动操作器在仓库自动化常见任务中进行仿真和物理实验，评估了Fail-RAG的性能。与使用现成VLM相比，Fail-RAG在五种机器人操作类型上的平均故障检测准确率提高了25个百分点，表明其在真实世界故障检测中的有效性。

英文摘要

Industry automation is witnessing an evolution in robotics driven by both technological breakthroughs and societal changes: progress towards generalist robots, embodied and physical artificial intelligence (AI), and increasing labor shortage in manufacturing.An intelligent autonomous robot needs to not only act according to planned motions but also react to any unexpected events. In this study, we focus on such unexpected events in warehouses where robots are used for material handling. Specifically, we refer to any unexpected events as failures and develop methods to detect robot operations related failures. Rule-based detection methods may break since the form of failures could change due to the dynamic nature of both environments and tasks. We propose 'Fail-RAG', a Retrieval Augmented Generation (RAG)-based failure detection framework where failure images and context information are embedded and queried against a failure database by calculating their similarities. Vision-Language Models (VLMs) are further used to analyze failures and provide details by following our instruction template. We evaluated the performance of Fail-RAG by conducting both simulation and physical experiments using fixed robot arms and a mobile manipulator for multiple tasks that are common in warehouse automation. Fail-RAG achieved 25 percentage point higher failure detection accuracy on average across five types of robot operations compared to using off-the-shelf VLMs, indicating its effectiveness for real-world failure detection.

URL PDF HTML ☆

赞 0 踩 0

2606.19847 2026-06-19 cs.CL 新提交 70%

AtomMem: Building Simple and Effective Memory System for LLM Agents via Atomic Facts

AtomMem: 通过原子事实构建简单有效的LLM智能体记忆系统

Yanyu Yao, Shangze Li, Zhi Zheng, Hui Zheng, Qi Liu, Tong Xu, Enhong Chen

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China（中国科学技术大学认知智能国家重点实验室）； Anhui University（安徽大学）

专题命中知识库问答：涉及事实提取和层次化事件结构，用于记忆检索。

AI总结针对现有记忆系统存储粗粒度、更新不稳定的问题，提出AtomMem，通过事实执行器提取高价值原子事实作为高效记忆表示，并组织为层次化事件结构和时间档案，实现价值密集存储和稳定演化，在LoCoMo基准上取得最优性能。

Comments 19 pages, 10 figures, 5 tables

详情

AI中文摘要

大型语言模型（LLM）展示了强大的推理和生成能力，但其固定的上下文窗口限制了跨多会话交互的长期信息积累和重用。现有的记忆增强系统通常以粗粒度且不稳定的方式构建记忆，依赖于低效的记忆表示或不稳定的无约束更新。为了解决这些挑战，我们提出了AtomMem，一种专为价值密集存储和稳定记忆演化设计的长期记忆系统。AtomMem引入了一个事实执行器，从长形式交互中选择性地提取高价值原子事实，作为高效的记忆表示。随后，AtomMem将这些事实组织成层次化的事件结构和时间档案，捕获连贯的情景上下文并随时间跟踪动态演变的用户属性。在检索过程中，系统激活一个关联记忆图来连接碎片化的记忆。在LoCoMo基准上的实验证实，AtomMem在各种推理任务中实现了最先进的性能，为部署智能个性化智能体提供了一种可扩展且经济可行的解决方案。

英文摘要

Large language models (LLMs) demonstrate strong reasoning and generation abilities, but their fixed context windows limit long-term information accumulation and reuse across multi-session interactions. Existing memory-augmented systems often construct memory in a coarse and unstable manner, relying on inefficient memory representations or unstable unconstrained updates. To address these challenges, we propose AtomMem, a long-term memory system designed for value-dense storage and stable memory evolution. AtomMem introduces a Fact Executor, which selectively extracts high value atomic facts from long form interactions to serve as highly efficient memory representations. Subsequently, AtomMem organizes these facts into hierarchical event structures and temporal profiles, capturing coherent episodic contexts and tracking dynamically evolving user attributes over time. During retrieval, the system activates an associative memory graph to connect fragmented memories. Experiments on the LoCoMo benchmark confirm that AtomMem achieves state-of-the-art performance across various reasoning tasks, offering a scalable and economically viable solution for deploying intelligent personalized agents.

URL PDF HTML ☆

赞 0 踩 0

2606.19700 2026-06-19 cs.CL 新提交 70%

TerraMARS: A Domain-Adapted Small-Language-Model Pipeline for Mars Terraforming Literature

TerraMARS: 用于火星地球化改造文献的领域自适应小语言模型管道

Jyotsna Singh, Ash Black, Jeff Larsen, Scott R. Saleska

发表机构 * University of Arizona（亚利桑那大学）； College of Information Science, University of Arizona（亚利桑那大学信息科学学院）； Biosphere 2, University of Arizona（亚利桑那大学生物圈2）； Department of Ecology and Evolutionary Biology, University of Arizona（亚利桑那大学生态与进化生物学系）； Department of Environmental Sciences, University of Arizona（亚利桑那大学环境科学系）

专题命中知识库问答：结合检索和分块框架进行信息提取。

AI总结提出TerraMARS管道，结合领域自适应小语言模型，从火星科学文献中提取结构化信息，支持地球化改造研究。

Comments 16 pages, 1 figure, 4 tables

详情

AI中文摘要

研究人员有兴趣了解火星，以便最终使其适合人类居住。为此，需要通过科学文献全面了解行星的大气、水文、表面化学、辐射环境和空间特征。这些文献包含有价值的信息和有意义的定量约束，可用于其他模型和研究，如宜居性评估和未来的地球化改造研究。我们提出了TerraMARS，一个端到端的信息提取管道，它结合了领域自适应的小语言模型来回答火星地球化改造相关问题，并将非结构化的火星科学文本转换为机器可读的结构化输出（JSON格式）。收集了一个开放获取论文语料库，并使用多阶段检索和分块框架进行处理。使用量化低秩自适应（QLoRA）对火星特定问答和信息提取数据集进行微调，使Google Gemma 3 1B适应领域。生成的管道产生两种类型的输出，并为将科学文献中的知识整合到下游应用（如数字孪生和火星宜居性建模）提供了基础。该管道的输出看起来很有前景，但需要进一步改进以提高提取准确性和事实一致性。

英文摘要

Researchers are interested in learning about Mars so that it may eventually become habitable for humans. To achieve this, there is a need for comprehensive knowledge of the planet's atmosphere, hydrology, surface chemistry, radiation environment, and spatial features through the scientific literature. These contain valuable information and meaningful quantitative constraints that can be used in other models and studies, such as habitability assessment and future terraforming studies. We present TerraMARS, an end-to-end information extraction pipeline that combines a domain-adapted Small Language Model to answer Mars terraforming-related questions and convert unstructured Mars science text into machine-readable structured outputs in JavaScript Object Notation (JSON) format. A corpus of open-access papers is collected and processed using a multistage retrieval and chunking framework. Google Gemma 3 1B was adapted to the domain using Quantized Low-Rank Adaptation (QLoRA) fine-tuning on Mars-specific question-answering and information extraction datasets. The resulting pipeline generates both types of output and provides a foundation for integrating knowledge from scientific literature into downstream applications like digital twins and habitability modeling for Mars. The output from this pipeline looks promising, but further improvements are needed to increase extraction accuracy and factual consistency.

URL PDF HTML ☆

赞 0 踩 0