arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.IR信息检索34
2606.12400 2026-06-11 cs.CL cs.IR 新提交

Doc-to-Atom: Learning to Compile and Compose Memory Atoms

Doc-to-Atom:学习编译和组合记忆原子

Xingjian Diao, Wenbo Li, Yashas Malur Saidutta, Avinash Amballa, Lazar Valkov, Srinivas Chappidi

发表机构 * AI Center-Mountain View, Samsung Electronics(三星电子AI中心-山景城) Dartmouth College(达特茅斯学院)

AI总结 提出Doc2Atom框架,将文档分解为语义类型化的知识原子并编译为微LoRA适配器,通过轻量查询路由器选择相关原子组装成查询特定适配器,以解决文档压缩中的干扰和扩展性问题,在六个QA基准上优于Doc-to-LoRA。

详情
Comments
20 pages
AI中文摘要

长输入序列是大语言模型文档理解和多步推理的核心,但注意力的二次成本使得推理既内存密集又缓慢。上下文蒸馏通过将上下文信息压缩到模型参数中来缓解这一问题,最近的工作如Doc-to-LoRA将上下文蒸馏摊销为一次前向传播,为每个文档生成一个LoRA适配器。然而,为所有查询生成单个整体适配器会导致无关查询干扰、有限的组合回忆以及长文档推理的可扩展性差。为了解决这些挑战,我们提出了Doc-to-Atom(Doc2Atom),一种组合参数化记忆框架,将每个文档分解为语义类型化的知识原子。每个原子被编译成一个独立的微LoRA适配器和一个来源检索键。在推理时,一个轻量查询路由器选择并仅组装相关原子到一个查询特定适配器中,然后将其注入冻结的基础模型。整个系统通过多目标蒸馏框架进行端到端训练。在六个不同的QA基准上的实验表明,Doc2Atom优于Doc-to-LoRA基线,同时降低了文档内部化的内存成本。

英文摘要

Long input sequences are central to document understanding and multi-step reasoning in Large Language Models, yet the quadratic cost of attention makes inference both memory-intensive and slow. Context distillation mitigates this by compressing contextual information into model parameters, and recent work such as Doc-to-LoRA amortizes context distillation into a single forward pass that generates one LoRA adapter per document. However, producing a single monolithic adapter for all queries leads to irrelevant-query interference, limited compositional recall, and poor scalability to long-document reasoning. To address these challenges, we propose Doc-to-Atom (Doc2Atom), a compositional parametric memory framework that decomposes each document into semantically typed knowledge atoms. Each atom is compiled into an independent micro-LoRA adapter and a provenance retrieval key. At inference time, a lightweight query router selects and assembles only the relevant atoms into a query-specific adapter, which is then injected into a frozen base model. The entire system is trained end-to-end through a multi-objective distillation framework. Experiments on six diverse QA benchmarks demonstrate that Doc2Atom outperforms Doc-to-LoRA baselines while reducing the memory cost of document internalization.

2606.12295 2026-06-11 cs.CV cs.CL cs.IR 新提交

Findings of the MAGMaR 2026 Shared Task

MAGMaR 2026 共享任务结果

Alexander Martin, Dengjia Zhang, Joel Brogan, Francis Ferraro, Jeremy Gwinnup, Reno Kriz, Teng Long, Kenton Murray, Andrew Yates, Xiang Xiang

发表机构 * Johns Hopkins University(约翰霍普金斯大学) OpenAI University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) Air Force Research Laboratory(空军研究实验室) Human Language Technology Center of Excellence, Johns Hopkins University(约翰霍普金斯大学人类语言技术卓越中心) University of Amsterdam(阿姆斯特丹大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 本文介绍MAGMaR 2026共享任务的结果,包括视频检索和基于检索视频的生成任务,所有提交系统均超越去年基线。

详情
Comments
Findings of the 2nd workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR); Resources at this url: this https URL
AI中文摘要

本概述论文介绍了第二届多模态检索增强生成(MAGMaR)研讨会的共享任务结果。在该共享任务中,参与者提交的系统专注于(i)视频检索或(ii)基于检索到的视频进行文章的接地生成。团队可以提交到任一任务。对于检索任务,我们有2个参与团队提交了总共17个系统——所有这些系统都击败了基于去年共享任务获胜者得出的基线。在生成方面,我们有4个团队提交了16个系统。所有团队至少有一个生成的报告被人类标注者评为最佳。

英文摘要

This overview paper presents the results of the shared task for the second workshop on Multimodal Augmented Generation via Multimodal Retrieval (MAGMaR). In this shared task participants submitted systems focused on either (i) video retrieval or (ii) grounded generation of articles given retrieved videos. Teams could submit to either task. For the retrieval task, we had 2 participating teams that submitted a total of 17 systems -- all of which beat a baseline derived from the winner of last year's shared task. On the generation side, we had 4 teams submit 16 systems. All teams had at least one generated report that was labeled the best by a human annotator.

2606.12246 2026-06-11 cs.DC cs.IR 新提交

Efficient and Robust Online Learning to Rank in Decentralized Systems

去中心化系统中高效且鲁棒的在线学习排序

Marcel Gregoriadis, Martijn de Vos, Sayan Biswas, Anne-Marie Kermarrec, Johan Pouwelse

AI总结 提出RankGuard框架,通过用户间直接交换模型更新并利用私有点击历史防御投毒攻击,首次给出去中心化在线学习排序的收敛性证明,效率最高提升62倍。

详情
AI中文摘要

在在线学习排序(OLTR)中,排序模型直接从实时用户交互中训练,但现有系统依赖可信中央服务器来收集和处理这些交互。这使得操作者可以自由引入与用户利益冲突的偏见。去中心化学习提供了一种有吸引力的替代方案,允许用户通过直接相互交换模型更新来协作训练共享排序模型,无需任何中央权威。然而,在这种设置中,恶意节点可以发送投毒模型更新,降低诚实节点的排序质量。我们引入了RankGuard,一个去中心化OLTR框架,其中用户协作训练排序模型并直接与其他节点交换模型更新。RankGuard通过仔细评估传入模型与用户自己的私有点击历史(经位置偏差校正)来防御投毒攻击。仅当传入模型比当前本地模型更好地解释用户过去交互时,才进行聚合,这使得恶意节点极难构造出能通过此测试而又不真正帮助用户的更新。我们推导了RankGuard的理论收敛保证。据我们所知,这是去中心化OLTR算法的首次形式化收敛分析。我们使用四个标准基准和三个点击模型,针对四种投毒攻击(包括一种强大的自适应攻击)评估了RankGuard。在大多数设置中,RankGuard优于所有基线,同时效率比最接近的竞争者高出62倍。

英文摘要

In Online Learning to Rank (OLTR), ranking models are trained directly from live user interactions, but existing systems rely on a trusted central server to collect and process these interactions. This leaves operators free to introduce biases that conflict with user interests. Decentralized learning offers an attractive alternative, allowing users to collaboratively train a shared ranking model by exchanging model updates directly with one another, without any central authority. In such settings, however, malicious nodes can send poisoned model updates that degrade the ranking quality of honest nodes. We introduce RankGuard, a decentralized OLTR framework in which users collaboratively train ranking models and exchange model updates directly with other nodes. RankGuard defends against poisoning attacks by carefully evaluating incoming models against the user's own private click history, corrected for position bias. An incoming model is only aggregated if it better explains the user's past interactions than the current local model, making it fundamentally hard for malicious nodes to craft updates that pass this test without also genuinely helping the user. We derive a theoretical convergence guarantee of RankGuard. To the best of our knowledge, this is the first formal convergence analysis of a decentralized OLTR algorithm. We evaluate RankGuard against four poisoning attacks, including a powerful adaptive attack, using four standard benchmarks and three click models. RankGuard outperforms all baselines in most settings while being up to 62x more efficient than its closest competitors.

2606.12245 2026-06-11 cs.IR cs.AI 新提交

DiffCold: A Diffusion-based Generative Model for Cold-Start Item Recommendation

DiffCold: 基于扩散的生成模型用于冷启动物品推荐

Kangning Zhang, Yingjie Qin, Weinan Zhang, Yong Yu, Jianghao Lin

AI总结 针对冷启动物品推荐中的跷跷板困境,提出基于条件扩散的生成模型DiffCold,通过从内容重建温物品嵌入并保持流形结构,结合检索增强聚合器和模拟表示对齐模块,统一冷热物品表示。

详情
Comments
Accepted by ECML-PKDD 2026
AI中文摘要

冷启动物品推荐由于缺乏交互历史,在现实系统中仍然是一个持续的挑战。虽然先前的模型尝试利用物品内容特征来弥合这一差距,但它们普遍遭受\textbf{跷跷板困境}:提升冷物品的性能不可避免地会降低温物品的性能,反之亦然。我们发现这一困境源于根本的\textbf{分布差异}:温物品嵌入占据由丰富交互信号塑造的复杂“行为流形”,而冷物品嵌入则被限制在仅从辅助内容导出的“语义流形”上。现有方法通常强制在这些不一致空间之间进行刚性映射,导致模型为了适应冷物品而牺牲温表示的精度。为了解决这个问题,我们提出\textbf{DiffCold},一种基于扩散的生成模型,统一了温表示和冷表示。与GAN或VAE不同,DiffCold利用条件扩散从内容重建温物品嵌入,保留底层流形结构而不退化。我们进一步针对这一范式设计了两个特定模块:一个\textbf{检索增强聚合器},利用语义相似的温物品初始化生成,以绕过低效的噪声;以及一个\textbf{基于模拟的表示对齐}模块,通过对比学习强制生成嵌入与真实嵌入之间的分布一致性。在三个基准上的实验证实,DiffCold解决了跷跷板困境,在所有指标上持续优于最先进的方法。

英文摘要

Cold-start item recommendation remains a persistent challenge in real-world systems due to the absence of interaction histories. While prior models attempt to bridge this gap using item content features, they universally suffer from the \textbf{seesaw dilemma}: enhancing performance for cold items inevitably degrades performance for warm items, and vice versa. We identify that this dilemma stems from a fundamental \textbf{distributional disparity}: warm item embeddings occupy a complex ``behavioral manifold" shaped by rich interaction signals, whereas cold item embeddings are constrained to a ``semantic manifold" derived solely from auxiliary content. Existing methods often force a rigid mapping between these inconsistent spaces, causing the model to sacrifice the precision of warm representations to accommodate cold ones. To address this, we propose \textbf{DiffCold}, a diffusion-based generative model that unifies warm and cold representations. Unlike GANs or VAEs, DiffCold leverages conditional diffusion to reconstruct warm item embeddings from content, preserving the underlying manifold structure without degradation. We further tailor this paradigm with two specific designs: a \textbf{Retrieval-enhanced Aggregator} that initializes generation using semantically similar warm items to bypass inefficient noise, and a \textbf{Simulation-based Representation Alignment} module that enforces distribution consistency between generated and real embeddings via contrastive learning. Experiments on three benchmarks confirm that DiffCold resolves the seesaw dilemma, consistently outperforming state-of-the-art methods across all metrics.

2606.12215 2026-06-11 cs.CV cs.IR cs.LG 新提交

MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching

MLT-Dedup:通过多级表示和时空匹配的高效大规模在线视频去重

David Yuchen Wang, Haoying Li, Hailun Xu, Wei Chee Yew, Zirui Zhu, Sanjay Saha, Hao Hei, Kanchan Sarkar, Kun Xu

发表机构 * TikTok Singapore(TikTok新加坡) School of Computing, National University of Singapore(新加坡国立大学计算机学院) TikTok San Jose(TikTok圣何塞)

AI总结 提出MLT-Dedup框架,采用多级视频编码器提取细粒度帧级和稀疏片段级嵌入,结合差分特征增强相似性模块进行时空匹配,在90%精度下降低在线重复率91%,索引容量提升5倍。

详情
Comments
Accepted by KDD-2026 ADS track
AI中文摘要

在线平台上用户生成视频内容的爆炸性增长伴随着大量近似重复视频的出现——这些视频相同或高度相似,但存在部分编辑差异。这些重复视频降低了用户体验,增加了存储和带宽成本,使得大规模视频去重成为一项关键任务。现有的视频去重框架在有限的索引预算下检索足够高质量候选视频方面面临根本性挑战,同时在效率和精度之间存在权衡。为了解决这些问题,我们提出了MLT-Dedup,一种基于多级表示和时空匹配的高效大规模在线视频去重框架。我们的方法采用多级视频编码器(ML-VE)提取细粒度的帧级嵌入和稀疏的片段级嵌入:稀疏嵌入支持高效的候选检索,而细粒度嵌入则用于精确的成对匹配。在匹配过程中,我们引入了DiF-SiM,一种差分特征增强相似性模块,能够定位重复的时间片段并提供可靠的相似性证据,以支持基于策略的去重决策。在真实大规模平台上的大量实验表明,MLT-Dedup在90%精度下将在线重复率降低了91%。此外,我们的稀疏检索设计使索引容量提升了5倍,从而在实际部署中实现了更广泛的候选覆盖。

英文摘要

The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.

2606.12198 2026-06-11 cs.IR 新提交

LLM-Based User Personas for Recommendations at Scale

基于LLM的用户画像用于大规模推荐

Haoting Wang, Haokai Lu, Zheyun Feng, Jenny Huang, Yifat Amir, Gregory Hinkson, Ben Most, Zelong Zhao, Yixin Kelly Cui, Rein Zhang, Fabio Soldo, Yu Xia, Nihar Bhupalam, Minmin Chen, Konstantina Christakopoulou, Lichan Hong, Ed H. Chi

AI总结 提出实时生成LLM用户兴趣画像的框架,通过知识蒸馏、异步推理和语义聚类优化,平衡利用-探索权衡,提升大规模视频推荐效果。

详情
AI中文摘要

大型语言模型(LLM)凭借其世界知识和推理能力,为增强推荐系统提供了前所未有的潜力。然而,现有方法通常依赖结构化ID或离线处理,限制了语义丰富性、实时适应性和面向用户的解释性。在本文中,我们介绍了一种新颖的框架,能够为大规模商业视频推荐平台实时生成基于LLM的用户兴趣画像。我们的方法生成自然语言的用户兴趣画像,通过结合现有兴趣的总结和新颖主题,在服务过程中直接解决利用-探索权衡。为了克服十亿用户规模下在线LLM推理的计算挑战,我们设计了一种成本高效的架构,利用知识蒸馏、异步推理和通过语义聚类视频表示进行的输入优化。广泛的离线评估、用户研究和在线A/B测试表明,该方法显著提升了观众价值。这项工作弥合了高层语义理解与工业规模推荐之间的差距,为更动态、可解释和令人满意的个性化体验铺平了道路。

英文摘要

Large Language Models (LLMs) offer unprecedented potential for enhancing recommendation systems through their world knowledge and reasoning capabilities. However, existing approaches often rely on structured IDs or offline processing, limiting semantic richness, real-time adaptability, and user-facing interpretability. In this paper, we introduce a novel framework that enables real-time generation of LLM-based user interest personas for a large-scale commercial video recommendation platform. Our method generates natural-language user interest personas that address the exploitation-exploration trade-off by combining the summarization of existing interests with novel topics, directly during serving. To overcome the computational challenges of online LLM inference at a billion-user scale, we design a cost-efficient architecture leveraging knowledge distillation, asynchronous inference, and input optimization via semantically clustered video representations. Extensive offline evaluations, user studies, and live A/B tests demonstrate significant improvements in viewer value. This work bridges the gap between high-level semantic understanding and industrial-scale recommendation, paving the way for more dynamic, explainable, and satisfying personalized experiences.

2606.11945 2026-06-11 cs.CL cs.IR 新提交

uva-irlab-conv at SemEval-2026 Task 8: Multi-Turn RAG with Learned Sparse Retrieval and Listwise Reranking

uva-irlab-conv 在 SemEval-2026 任务 8:基于学习型稀疏检索和列表式重排序的多轮 RAG

Simon Lupart, Kidist Amde Mekonnen, Zahra Abbasiantaeb, Mohammad Aliannejadi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 提出结合学习型稀疏检索与基于 LLM 的重排序和生成的多轮检索增强生成流水线,用于跨四个领域的对话系统,有效处理不可回答查询。

详情
Comments
SemEval-2026, The 20th International Workshop on Semantic Evaluation, collocated with ACL 2026, 9 pages, 5 figures, 6 tables
AI中文摘要

本报告描述了我们在 SemEval-2026 任务 8(多轮检索与问答)中的参与情况。该任务评估跨四个领域(金融、云文档、政府、维基百科)的对话系统,并包括不可回答的查询,即可用集合中没有足够证据来生成完整回答。我们提出了一种多轮检索增强生成流水线,将学习型稀疏检索与基于 LLM 的重排序和生成相结合。使用稀疏检索作为主要检索方法,我们利用了其跨领域的强泛化能力。此外,我们利用 LLM 的长上下文能力进行对话查询重写、逐点和列表式重排序以及生成最终回答,每一步都基于完整的对话历史。这种多步骤设计使得在整个检索和生成过程中有效整合对话上下文,提高了跨领域的鲁棒性。

英文摘要

This report describes our participation in SemEval-2026 Task 8 on multi-turn retrieval and question answering. The task evaluates conversational systems across four domains (finance, cloud documentation, government, Wikipedia), and includes unanswerable queries where the available collection does not contain sufficient evidence to produce a complete response. We propose a multi-turn retrieval-augmented generation pipeline that combines learned sparse retrieval with LLM-based reranking and generation. Using sparse retrieval as the primary retrieval method, we leverage its strong generalization across domains. In addition, we make use of the long-context capabilities of LLMs for conversational query rewriting, pointwise and listwise reranking, and generating the final response, each conditioned on the full conversational history. This multi-step design enables effective integration of conversational context throughout retrieval and generation, improving robustness across domains.

2606.11907 2026-06-11 cs.IR 新提交

Tail-Aware Adaptive-k: Query-Adaptive Context Selection for Retrieval-Augmented Generation

尾部感知自适应-k:面向检索增强生成的查询自适应上下文选择

Ziyu Song, Jiaming Fang, Kuangyu Li, Tuo Xia, Chuanpeng Wang

AI总结 针对固定Top-K检索在查询依赖和重尾相似度分布下的失效问题,提出TAA-k框架,通过局部化极值理论验证策略实现高效、稳定的查询自适应截断,在三个数据集上达到接近最优的检索质量且效率大幅提升。

详情
Comments
First two authors contributed equally. Accepted at ECML PKDD 2026
AI中文摘要

自适应上下文选择对于检索增强生成(RAG)系统至关重要,因为固定的Top-K检索在查询依赖和重尾相似度分布下会失效。尽管极值理论(EVT)为自适应截断提供了原则性框架,但现有方法在整个排序列表上全局应用EVT,导致计算成本高昂且统计不稳定。我们提出尾部感知自适应-k(TAA-k),一种无需训练的框架,通过局部化验证策略实现EVT的操作化。关键洞察是,排序相似度曲线呈现出典型的陡-平-陡模式,反映了从相关主导到噪声主导的转变。TAA-k利用这种几何结构,通过拐点检测识别紧凑候选区域,然后在该窗口内应用基于EVT的拟合优度检验来验证尾部行为的起始点。这种由粗到精的设计将计算复杂度从O(N^2M)降低到O(sqrt{N log N} * M),同时保持统计严谨性。在温和的单调似然比假设下,TAA-k产生一个稳定的、查询自适应的截断点,对应于最早的噪声主导位置。在WebQuestions、2WikiMultiHopQA和MuSiQue上的实验表明,TAA-k实现了接近最优的检索质量(F1分数在最优值的2-3%以内),相比全局EVT方法效率提升数个数量级,并且在不同的嵌入模型和压缩维度下保持鲁棒性。

英文摘要

Adaptive context selection is critical for retrieval-augmented generation (RAG) systems, as fixed Top-K retrieval fails under query-dependent and heavy-tailed similarity distributions. While Extreme Value Theory (EVT) offers a principled framework for adaptive truncation, existing approaches apply EVT globally across the entire ranked list, incurring prohibitive computational costs and statistical instability. We propose Tail-Aware Adaptive-k(TAA-k), a training-free framework that operationalizes EVT through a localized validation strategy. The key insight is that ranked similarity curves exhibit a characteristic steep--flat--steep pattern reflecting a transition from relevance-dominated to noise-dominated regimes. TAA-k exploits this geometric structure via knee detection to identify a compact candidate region, then applies EVT-based goodness-of-fit testing within this window to validate the onset of tail behavior. This coarse-to-fine design reduces computational complexity from O(N^2M) to O(sqrt{N\log N}*M) while maintaining statistical rigor. Under mild monotone likelihood ratio assumptions, TAA-k yields a stable, query-adaptive cutoff corresponding to the earliest noise-dominated position. Experiments on WebQuestions, 2WikiMultiHopQA, and MuSiQue demonstrate that TAA-k achieves near-oracle retrieval quality (F1 within 2-3% of oracle) with orders-of-magnitude efficiency gains over global EVT methods, while maintaining robustness across embedding models and compression dimensions.

2606.11864 2026-06-11 cs.IR 新提交

CORE-Bench: A Comprehensive Benchmark for Code Retrieval in the Era of Agentic Coding

CORE-Bench:智能体编程时代代码检索的综合基准

Fuwei Zhang, Yanzhao Zhang, Mingxin Li, Dingkun Long, Lexiang Hu, Pengjun Xie, Zhao Zhang, Fuzhen Zhuang

AI总结 针对智能体编程中需求驱动的仓库级代码检索问题,构建了包含18万查询和10.6万上下文相关性标签的三级基准CORE-Bench,实验表明现有嵌入模型性能显著下降,微调可提升效果。

详情
AI中文摘要

代码检索正成为编码智能体的核心,但智能体编程需要的不仅仅是自然语言查询与孤立代码片段的匹配。给定用户请求,编码智能体需要导航具体的仓库状态,定位相关文件和函数,收集支持上下文,并过滤仓库内相似的干扰项。现有的代码检索基准主要评估文档字符串到函数或片段级别的匹配,因此忽略了这种需求驱动的仓库搜索问题。为弥补这一空白,我们引入了CORE-Bench,一个面向智能体编程时代代码检索的综合基准。CORE-Bench在三个层面评估代码检索能力:代码理解、问题到编辑的定位以及更广泛的上下文检索。基于精心整理的代码搜索任务和SWE-bench系列实例构建,CORE-Bench包含超过18万条查询和10.6万个更广泛上下文的相关性标签。使用代表性嵌入模型的实验表明,从传统代码搜索到智能体编程设置中的代码检索,性能急剧下降。对现有嵌入模型进行简单的监督微调可显著提升该设置下的性能,表明存在进一步改进的广阔空间。

英文摘要

Code retrieval is becoming central to coding agents, but agentic coding requires more than matching a natural-language query to an isolated snippet. Given a user request, a coding agent needs to navigate a concrete repository state, locate relevant files and functions, gather supporting context, and filter similar in-repository distractors. Existing code retrieval benchmarks mainly evaluate docstring-to-function or snippet-level matching, thereby missing this requirement-driven repository search problem. To address this gap, we introduce CORE-Bench, a comprehensive benchmark for code retrieval in the era of agentic coding. CORE-Bench evaluates code retrieval ability at three levels: code understanding, issue-to-edit localization, and broader context retrieval. Built from curated code-search tasks and SWE-bench-series instances, CORE-Bench contains over 180K queries and 106K broader-context relevance labels. Experiments with representative embedding models show a sharp drop from traditional code search to code retrieval in agentic coding settings. Simple supervised fine-tuning of existing embedding models significantly improves performance in this setting, suggesting substantial room for further progress.

2606.11780 2026-06-11 cs.IR cs.AI cs.IT 新提交

What Limits Does Quantization Place on Dense Top-$k$ Retrieval? A Theoretical Study

量化对密集Top-$k$检索的限制是什么?一项理论研究

Koki Okajima, Tsukasa Yoshida

AI总结 理论证明在有限精度下,完美Top-$k$检索所需维度随语料库大小对数增长,量化精度存在阈值,影响实际系统设计。

详情
Comments
9 pages, 2 figures
AI中文摘要

我们建立了将包含$N$个文档的语料库嵌入为$d$维向量的条件,使得每个$k$子集$S \subseteq [N]$都能通过某个查询向量的top-$k$检索实现。最近的研究表明,在$\mathbb{R}^d$中,$d = O(k)$足以存在这样的嵌入,与$N$无关。我们理论上证明,这种与语料库无关的界限是无限精度所特有的。当每个坐标使用$B$比特时,完美top-$k$检索需要$Bd = \Omega(k \ln N)$;因此,在任何固定精度下,维度必须至少随$N$对数增长。针对$\ell_2$归一化的$B$比特均匀标量量化模型,我们还确定了精度阈值$B^{*} = O(\ln \ln N)$,低于该阈值任何维度都不够,同时还有两个进一步限制可行$(B, d)$对的区域。我们的结果表明,在实际的向量数据库和密集检索系统中,由于量化是标准操作,嵌入维度和可能的精度必须随语料库大小增长。

英文摘要

We establish conditions for embedding a corpus of $N$ documents as $d$-dimensional vectors such that every $k$-subset $S \subseteq [N]$ is realizable as a result of top-$k$ retrieval by some query vector. Recent work shows that $d = O(k)$ suffices for such embeddings to exist in $\mathbb{R}^d$, independently of $N$. We theoretically prove that this corpus-independent bound is specific to infinite precision. With $B$ bits per coordinate, perfect top-$k$ retrieval requires $Bd = \Omega(k \ln N)$; thus, at any fixed precision, the dimension must grow at least logarithmically with $N$. Specializing to a $\ell_2$-normalized $B$-bit uniform scalar quantization model, we also identify a threshold on the precision $B^{*} = O(\ln \ln N)$ below which no dimension suffices, together with two further regimes that bound the feasible $(B, d)$ pairs. Our result implies that in practical vector databases and dense retrieval systems where quantization is standard, the embedding dimension and possibly the precision must grow with the corpus size.

2606.11749 2026-06-11 cs.IR 新提交

FAST-MEL: A Fast, Accurate, and Storage Efficient Solution for Multimodal Entity Linking

FAST-MEL: 一种快速、准确且存储高效的多模态实体链接解决方案

Derrien Thomas, Laurent Amsaleg, Pascale Sébillot

AI总结 提出FAST-MEL,通过紧凑的固定大小向量化表示文本和视觉信息,在保持高准确率的同时实现三个数量级的加速和一个数量级的存储节省。

详情
AI中文摘要

多模态实体链接(MEL)是将非结构化数据中实体的文本和视觉提及与知识库(KB)中对应实体匹配的任务。为了在大规模实际场景中有效,MEL系统必须满足三个目标:高链接准确率、计算效率和存储效率,即紧凑且高效的KB索引。在本文中,我们强调现有系统无法同时满足这三个要求。为了实现这一三重目标,我们提出了FAST-MEL,一种轻量级的基于编码器的MEL解决方案,它依赖于每个实体或提及的文本和视觉信息的新型紧凑固定大小向量化表示。它在匹配最佳系统准确率的同时,执行速度快三个数量级,并且比最快系统消耗少一个数量级的存储。

英文摘要

Multimodal entity linking (MEL) is the task that consists of matching textual and visual mentions of entities in unstructured data to their corresponding entities in a knowledge base (KB). To be effective in large-scale practical settings, MEL systems must meet three objectives: high linking accuracy, computational efficiency, and storage efficiency, i.e., a compact yet efficient index of the KB. In this paper, we highlight that state-of-the-art systems fail to simultaneously satisfy these 3 requirements. To meet this three-fold objective, we propose FAST-MEL, a lightweight encoder-based MEL solution that relies on a novel and compact fixed-size vectorized representation of both the textual and visual information of each entity or mention. It matches the accuracy of the best systems but performs three orders of magnitude faster. It also consumes one order of magnitude less storage than the fastest systems.

2606.11700 2026-06-11 cs.IR 新提交

CompRank: Efficient LLM Reranking via Token-Level Compression and Decoding-Free Scoring

CompRank: 通过令牌级压缩和无解码评分实现高效的LLM重排序

Xuan Lu, Haohang Huang, Yingqi Fan, Junlong Tong, Yuxuan Zhang, Ping Nie, Rui Meng, Xiaoyu Shen

AI总结 提出CompRank框架,通过令牌级压缩和无解码注意力评分减少冗余计算,在BEIR数据集上仅保留10.2%文档令牌即达到接近全令牌的排序性能,并实现4.9-9.5倍端到端加速。

详情
AI中文摘要

大语言模型(LLM)重排序器已成为现代检索和检索增强生成流水线的重要组成部分,但其高计算成本限制了其在长候选列表中的应用。在本文中,我们提出\textbf{CompRank},一种令牌高效的重排序框架,通过将重排序器设计与排序信号的稀疏性对齐来减少冗余计算。CompRank将文档表示与候选顺序和查询上下文解耦,实现可重用的文档侧状态;应用分段令牌压缩以减少查询-文档交互成本;并引入CopyNet风格的目标函数,将基于注意力的文档评分直接与训练监督对齐。在七个BEIR数据集上的实验表明,CompRank在仅保留10.2%文档令牌的情况下实现了强大的重排序性能,平均NDCG@10达到39.2,而全令牌注意力为39.7。在TREC-COVID上的进一步扩展实验表明,CompRank在30文档列表上训练后,对多达500个文档的候选列表进行评估时保持稳定,同时相比基于生成的列表式重排序实现了4.9倍至9.5倍的端到端加速,相比全令牌CompRank变体实现了约1.3倍加速。这些结果表明,令牌级压缩和无解码注意力评分为可扩展的基于LLM的重排序提供了有效途径。

英文摘要

Large language model (LLM) rerankers have become an important component of modern retrieval and retrieval-augmented generation pipelines, but their high computational cost limits their applicability to long candidate lists. In this paper, we propose \textbf{CompRank}, a token-efficient reranking framework that reduces redundant computation by aligning reranker design with the sparsity of ranking signals. CompRank decouples document representations from candidate order and query context, enabling reusable document-side states; applies segment-wise token compression to reduce query--document interaction cost; and introduces a CopyNet-style objective that directly aligns attention-based document scoring with training supervision. Experiments on seven BEIR datasets show that CompRank achieves strong reranking performance while retaining only 10.2\% of document tokens, reaching an average NDCG@10 of 39.2 compared with 39.7 under full-token attention. Further scaling experiments on TREC-COVID show that CompRank remains stable when evaluated on candidate lists of up to 500 documents after training on 30-document lists, while achieving $4.9\times$--$9.5\times$ end-to-end speedup over generation-based listwise reranking and approximately $1.3\times$ speedup over the full-token CompRank variant. These results suggest that token-level compression and decoding-free attention scoring provide an effective path toward scalable LLM-based reranking.

2606.11654 2026-06-11 cs.IR cs.CL cs.HC cs.SI 新提交

The Long Tail, Not the Front Page: Cold-Start Prediction of Crowd Highlight Salience

长尾而非首页:众包高亮显著性的冷启动预测

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 本文研究在无读者标记时,如何从文本预测文档的众包高亮显著性,提出基于句子嵌入和位置/上下文特征的对数排序模型,在平均精度上比位置基线提升0.044,并证明该优势源于真实读者标记的学习。

详情
Comments
10 pages, 3 figures, 4 tables
AI中文摘要

社交高亮工具最有用的信号——一群读者标记的段落——仅存在于人们已经阅读过的文档中。能否在标记积累之前,从文本预测文档的聚合众包显著性?先前关于此数据的研究发现,零样本语言模型恢复高亮位置的效果不如简单的基线(位置),因此我们询问,在高亮语料上训练的模型能否击败该基线。使用预注册的模型阶梯和按文档的聚类自助法,我们发现一个微小但稳健的优势:基于句子嵌入和位置/上下文特征的对数排序器比位置基线平均精度高出+0.044(95%置信区间[+0.029, +0.058];在97%的重采样中超过预注册的边界delta=0.03,且在流水线重复运行中稳定)。两种无监督抽取式基线(质心、LexRank风格中心性)均输给位置基线,而训练模型比它们高出+0.108,因此该优势并非由通用无监督代理恢复——它反映了从真实读者标记中学习。在产品术语中,precision@3从0.25上升到0.39(相对提升55%),模型在69%的文档上击败位置基线。消融实验将优势归因于原始嵌入(+0.014)和训练增强(+0.010),每个都有正的置信区间。该优势并非时间泛化失败,我们也没有发现内容漂移或近似重复泄露可以解释它的证据。标准化回归显示,优势主要由文档流行度(流行度越低,优势越大)和标签可靠性决定。它仅在流行度最高的内容上几乎消失;在那里,是位置基线变强,而非模型变弱。由于我们的评估条件设定在最终积累了读者的文档上,这些结果是回顾性的冷启动模拟。

英文摘要

A social highlighter's most useful signal -- which passages a crowd of readers marks -- exists only for documents people have already read. Can the aggregate crowd salience of a document be predicted from its text before its marks accumulate? Prior work on this data found that zero-shot language models recover highlight locations worse than a trivial lead (position) baseline, so we ask whether a model trained on the highlight corpus can beat that baseline. Using a pre-registered ladder of models and a by-document cluster bootstrap, we find a small but robust edge: a logistic ranker over sentence embeddings and positional/contextual features beats the lead baseline by +0.044 average precision (95% CI [+0.029, +0.058]; clears a pre-registered margin delta=0.03 in 97% of resamples, and stable across pipeline re-runs). Two unsupervised extractive baselines (centroid, LexRank-style centrality) lose to lead, and the trained model beats them by +0.108, so the edge is not recovered by generic unsupervised proxies -- it reflects learning from real reader marks. In product terms, precision@3 rises from 0.25 to 0.39 (+55% relative) and the model beats lead on 69% of documents. An ablation attributes the edge to the raw embedding (+0.014) and training augmentation (+0.010), each with a positive CI. The edge is not a temporal-generalization failure, and we find no evidence that content drift or near-duplicate leakage explains it. A standardized regression shows the advantage is governed mainly by document popularity (lower popularity, larger edge) and by label reliability. It nearly vanishes only on the most popular content; there it is the lead baseline that strengthens, not the model that weakens. Because our evaluation conditions on documents that eventually accumulated readers, these results are a retrospective cold-start simulation.

2606.11616 2026-06-11 cs.LG cs.IR 新提交

DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

DeMix: 通过影响向量调试包含混合错误类型的训练数据

Jiale Deng, Yanyan Shen, Xiaogang Shi, Chai Junjun

发表机构 * Shanghai Jiao Tong University(上海交通大学) ByteDance Inc.(字节跳动) Tiktok

AI总结 提出DeMix框架,利用影响向量捕捉不同错误类型对模型行为的独特模式,将数据调试转化为多标签分类问题,并引入基于干预的学习策略,在11个任务上显著提升调试F1分数和修复后模型性能。

详情
AI中文摘要

高质量的训练数据对于机器学习模型的成功至关重要。然而,真实世界的数据集通常包含由数据准备流程中的系统性缺陷引起的混合错误类型,包括标签错误、特征错误和虚假相关性。有效的训练数据调试既需要检测错误样本,也需要识别其具体的错误类型以便进行针对性修复,但现有的数据清洗和归因方法未能充分满足这一双重需求。在本文中,我们提出DeMix,一种同时诊断错误样本及其错误类型的新框架。我们的关键见解是,不同的错误类型会在模型行为上产生不同的模式。DeMix通过影响向量捕获这些特定于错误的模式,这些影响向量描述了每个训练样本如何影响所有验证样本上的模型预测。我们将训练数据调试形式化为一个多标签分类问题,其中开发了一个分类器直接从影响向量预测错误类型。我们进一步引入了一种基于干预的学习策略,引导分类器捕获每种错误类型特有的不变理由,确保学到的分类器有效泛化。在表格数据预测、推荐系统和LLM对齐等11个任务上的实证评估表明,DeMix显著优于最先进的方法,在数据调试F1分数上提高了22.61%,在数据修复后任务模型性能上提高了9.32%。代码可在以下网址获取:this https URL。

英文摘要

High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: this https URL.

2606.11613 2026-06-11 cs.IR cs.CL cs.HC cs.SI 新提交

Factions Within, Uncertain Across: Within-Document Reader Sub-Groups in Social Highlighting

内部派系,跨文档不确定:社交高亮中的文档内读者子群体

Kazuki Nakayashiki, Keisuke Watanabe

AI总结 通过保留边界的曲线球零模型,发现文档内读者形成强子群体,其一致性远超共享显著性预测,且大部分源于细粒度读者特定共识;跨文档稳定性未解决。

详情
Comments
11 pages, 3 figures, 3 tables
AI中文摘要

当许多人高亮同一文档时,人群是单一共识,还是内部结构化为标记不同内容的读者子群体?这种结构是读者的稳定属性还是文档的属性?基于先前工作表明个体文档内高亮信号是低语而个体性存在于选择中,我们在一个共读平台上使用保留边界的曲线球零模型提出群体层面问题。实验1:在文档内,读者形成强子群体——配对一致性远超共享显著性、标记密度和句子流行度所预测的(最近邻一致性z=+6.3,在88%的文档中显著)。在八块区域保留零模型下,与文档相同粗略区域的共享参与解释了约40%的额外一致性;大部分以更细粒度的读者特定一致性存在(z=+3.6,77%显著)。因此,文档内人群在描述意义上是派系化的。实验2:这种分组是稳定的读者特质吗?这里我们诚实地面对统计功效。配对一致性的跨文档分半可重复性在合并后接近零(两个独立抽取样本中分别为+0.078和0.000),功效校准表明该检验仅对共读许多文档的配对有信息。在唯一有信息的高重叠子集(k>=4)中,点估计为正但小样本,在独立抽取样本间不精确,从未显著,并在区域保留零模型下衰减。因此,我们未解决跨文档稳定性:数据与从情境分组到弱至中等稳定读者特质的一切一致。人群在文档内是派系化的;这些派系是否随读者跨文档迁移,诚实地讲,超出了我们的能力范围。

英文摘要

When many people highlight the same document, is the crowd a single consensus, or is it internally structured into reader sub-groups that mark different things -- and is that structure a stable property of a reader or of the document? Building on prior work showing an individual's within-document highlighting signal is a whisper while individuality lives in selection, we ask the group-level question on a co-readership platform using a margin-preserving curveball null. Experiment 1: within a document, readers form strong sub-groups -- pairs agree far beyond what shared salience, mark density, and sentence popularity predict (nearest-neighbour agreement z=+6.3, significant in 88% of documents). Under an eight-block region-preserving null, shared engagement with the same coarse regions of the document accounts for about 40% of this excess; the majority survives as finer reader-specific agreement (z=+3.6, 77% significant). So the within-document crowd is, in a descriptive sense, factional. Experiment 2: is that grouping a stable reader trait? Here we are honest about power. The cross-document split-half reproducibility of a pair's agreement is near zero pooled (+0.078 and 0.000 in two separately drawn samples), and a power calibration shows the test is informative only for pairs that co-read many documents. In the only informative high-overlap subset (k>=4), point estimates are positive but small-sample, imprecise across the separately drawn samples, never significant, and attenuate under the region-preserving null. We therefore leave cross-document stability unresolved: the data is consistent with anything from situational grouping to a weak-to-moderate stable reader trait. The crowd is factional within a document; whether its factions follow the reader across documents is, honestly, beyond our reach.

2606.11361 2026-06-11 cs.IR cs.CL 新提交

A PubMed-Scale Dataset of Structured Biomedical Abstracts

一个PubMed规模的生物医学结构化摘要数据集

Chia-Hsuan Chang, Haerin Song, Brian Ondov, Hua Xu

AI总结 针对PubMed中大量非结构化摘要阻碍下游文本处理的问题,构建了包含2320万条记录的结构化摘要语料库,其中590万条来自官方XML,1720万条通过大语言模型自动标注,统一为五段格式。

详情
Comments
Data and code for this work are available at this https URL and this https URL, respectively
AI中文摘要

结构化摘要对于生物医学文献处理至关重要,它有助于信息检索、文本挖掘和知识综合。然而,PubMed中索引的绝大部分摘要仍然是非结构化的,这给下游文本处理工作流程和应用带来了重大瓶颈。为解决这一限制,我们引入了Structured PubMed,这是一个从完整PubMed数据库编译而来的全面语料库,包含超过2320万条研究文章记录,每条记录都带有节标签。该语料库分为两个不同的子集:一个包含590万条作者结构化摘要的集合,这些摘要从官方XML文件中解析而来;另一个包含1720万条原本非结构化摘要的自动标注集合,这些摘要通过逐字提取的大语言模型流水线进行结构化。每条记录都统一在统一的五节模式下,并映射到其原始PubMed标识符、出版类型和出版日期。该数据集可用于训练句子分类模型、基准测试文本分割架构,并在前所未有的PubMed范围内进行大规模、特定节的信息提取。

英文摘要

Structured abstracts are important for biomedical literature processing, by facilitating information retrieval, text mining, and knowledge synthesis. However, a vast portion of abstracts indexed in PubMed remain unstructured, presenting a significant bottleneck for downstream text-processing workflows and applications. To resolve this limitation, we introduce Structured PubMed, a comprehensive corpus of section-labeled biomedical abstracts compiled from the complete PubMed database, encompassing over 23.2 million research-article records. The corpus is divided into two distinct subsets: a collection of 5.9 million author-structured abstracts parsed from official XML files, and an automatically labeled collection of 17.2 million originally unstructured abstracts structured via a verbatim-extraction Large Language Model pipeline. Every record is harmonized under a unified five-section schema and mapped to its original PubMed identifier, publication type, and publication date. This dataset can be utilized to train sentence-classification models, benchmark text-segmentation architectures, and perform large-scale, section-specific information extraction at an unprecedented PubMed-wide scale.

2606.11350 2026-06-11 cs.CL cs.IR 新提交

When More Documents Hurt RAG: Mitigating Vector Search Dilution with Domain-Scoped, Model-Agnostic Retrieval

当更多文档损害RAG:利用领域限定、模型无关的检索缓解向量搜索稀释

Nabaraj Subedi, Ahmed Abdelaty, Shivanand Venkanna Sheshappanavar

发表机构 * Dept. of Electrical Engineering & Computer Science, University of Wyoming(怀俄明大学电气工程与计算机科学系) Dept. of Civil, Architectural Engineering & Construction Management, University of Wyoming(怀俄明大学土木、建筑工程与施工管理系)

AI总结 针对检索增强生成在异构文档集合中因向量搜索稀释导致性能下降的问题,提出基于组织元数据的领域限定方法MASDR-RAG,显著提升P@10至0.86,并揭示多智能体编排的精度-忠实度悖论。

详情
Comments
24 pages, 8 figures, 30 tables. Preprint under review
AI中文摘要

当检索增强生成扩展到大规模、异构的文档集合时,其性能会下降,因为密集相似性失去了区分能力,top-k检索越来越多地返回语义相似但上下文不正确的块。我们将这种失败模式称为向量搜索稀释。即使使用混合密集+稀疏检索,我们在部署的怀俄明州交通部语料库中直接观察到了这一点:当文档从54篇扩展到1128篇(88907个块)时,准确率从75%下降到40%以下。为了解决这种稀释问题,我们提出了MASDR-RAG(用于RAG的多智能体领域限定检索),并在200个专家验证的查询上进行了评估,涉及五个LLM骨干、六个语料库和两个索引栈。我们的结果表明,使用组织元数据进行领域限定是关键修复,显著将P@10从0.77提高到0.86(p < 0.05)。此外,我们对多智能体编排的研究揭示,高度配置依赖会导致我们所谓的精度-忠实度悖论。基于这些不同的结果,我们的实用建议很简单:先限定领域,然后执行一次合成调用,将完整的多智能体编排保留给真正多领域的语料库,并配合原生工具调用骨干。代码和数据将在接收后公开。

英文摘要

Retrieval-augmented generation degrades when scaled to large, heterogeneous document collections, where dense similarity loses discriminative power, and top-k retrieval increasingly returns semantically similar but contextually incorrect chunks. We refer to this failure mode as vector search dilution. Even when using hybrid dense+sparse retrieval, we observed this firsthand in a deployed Wyoming Department of Transportation corpus, where scaling from 54 to 1,128 documents (88,907 chunks) reduced accuracy from 75% to below 40%. To address this dilution, we propose MASDR-RAG ( Multi-Agent Scoped Domain Retrieval for RAG) and evaluate it on 200 expert-validated queries across five LLM backbones, six corpora, and two index stacks. Our results indicate that domain scoping using organizational metadata is the key fix, significantly improving P@10 from 0.77 to 0.86 ($p < 0.05$). Furthermore, our investigation of multi-agent orchestration revealed that a high degree of configuration dependence results --creating what we call the precision-faithfulness paradox. Based on these varied outcomes, our practical recommendation is simple: scope first, then perform a single synthesis call, reserving full multi-agent orchestration for genuinely multi-domain corpora paired with native-tool-call backbones. Code and Data will be made public upon acceptance.

2605.02411 2026-06-11 cs.AI cs.IR cs.LG cs.MA 版本更新

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

FitText: 通过模因检索演化智能体工具生态

Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang

AI总结 针对用户任务描述与工具文档间的语义鸿沟,提出FitText框架,将检索嵌入推理循环,通过自然语言伪工具描述迭代优化和模因进化选择,显著提升工具检索性能。

详情
AI中文摘要

用户描述任务的方式与工具文档之间存在语义鸿沟。随着API生态扩展到数万个端点,仅凭初始查询的静态检索无法弥合这一鸿沟:智能体对其所需工具的理解在执行过程中不断演变,但其工具集却保持不变。我们指出,这种检索接口(而非规划)是端到端智能体性能的约束瓶颈,并引入FitText——一个无需训练的框架,通过将检索直接嵌入智能体的推理循环中,使其动态化。FitText将检索视为测试时假设的演化:智能体生成自然语言的伪工具描述(关于所需工具的可修正信念),利用检索反馈迭代优化,并通过随机生成探索多样化的替代方案。模因检索在候选描述上施加进化选择压力,并由避免冗余搜索的工具记忆引导。在ToolRet(三个领域)上,FitText的重构策略在所有基模型上相比静态查询检索将NDCG@5提升了2.7至10.6个点;在StableToolBench(16,464个API)上使用GPT-5.4-mini时,模因检索达到了84.3%的合并通过率,相比静态查询检索绝对提升了26.7个点。

英文摘要

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.

2606.11204 2026-06-11 cs.CL cs.IR 新提交

Benchmarking Large Language Models for Safety Data Extraction

大型语言模型在安全数据提取中的基准测试

Jonas Grill, Thomas Bayer, Sören Berlinger

发表机构 * SAP SE(SAP公司) Institute for Digital Transformation, Ravensburg-Weingarten University(拉文斯堡-魏恩加滕大学数字化转型研究所)

AI总结 针对安全数据表(SDS)的异构格式,本研究基准测试了四种大型语言模型(LLM)在文本与多模态处理下的提取性能,发现文本结合思维链提示的Gemini 1.5 Pro准确率最高(84%),但均未达到90%的可靠部署阈值。

详情
Comments
18 pages, 8 figures, submitted to Applied Intelligence
AI中文摘要

从安全数据表(SDS)中准确提取结构化信息在工业安全中仍具挑战性,原因在于文档格式异构以及传统基于规则的方法的局限性。本研究对最先进的大型语言模型(LLM)在自动化SDS数据提取方面进行了基准测试,比较了基于文本和多模态处理流水线。我们系统评估了四种模型:Gemini 1.5 Pro、GPT-4o、Claude 3.7 Sonnet和Llama 3.1-70B,采用三种提示策略:零样本、少样本和思维链。评估框架在超过50,000个提取数据字段上评估了准确性、延迟和成本。结果显示,基于文本的提取在所有指标上始终优于多模态处理。结合思维链提示的Gemini 1.5 Pro达到了最高准确率(84%),优于GPT-4o(81%)和Claude 3.7 Sonnet(79%)。然而,没有模型超过可靠实际部署通常所需的90%准确率阈值。这些发现表明,通用LLM在无监督工业使用中尚不够稳健,尽管性能表明通过任务特定微调具有强大潜力。未来研究应关注领域自适应训练、模型校准以及集成人在回路验证,以确保安全关键可靠性。

英文摘要

Accurate extraction of structured information from Safety Data Sheets (SDS) remains challenging in industrial safety due to heterogeneous document formats and the limitations of traditional rule-based methods. This study benchmarks state-of-the-art Large Language Models (LLMs) for automated SDS data extraction, comparing text-based and multimodal processing pipelines. We systematically evaluate four models: Gemini 1.5 Pro, GPT-4o, Claude 3.7 Sonnet, and Llama 3.1-70B, across three prompting strategies: zero-shot, few-shot, and chain-of-thought. The evaluation framework assessed accuracy, latency, and cost across more than 50,000 extracted data fields. Results show that text-based extraction consistently outperforms multimodal processing across all metrics. Gemini 1.5 Pro combined with a Chain-of-Thought prompt achieved the highest accuracy (84%), outperforming GPT-4o (81%) and Claude 3.7 Sonnet (79%). However, no model surpassed the 90% accuracy threshold commonly required for reliable real-world deployment. These findings indicate that general-purpose LLMs are not yet robust enough for unsupervised industrial use, though performance suggests strong potential with task-specific fine-tuning. Future research should focus on domain-adapted training, model calibration, and the integration of Human-in-the-Loop verification to ensure safety-critical reliability.

2606.11199 2026-06-11 cs.CL cs.AI cs.IR cs.LG 新提交

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

NightFeats @ MMU-RAGent NeurIPS 2025: 面向文本到文本轨道的上下文优化多智能体RAG系统

Quentin Fever, Naziha Aslam

AI总结 提出一种结构化多智能体RAG系统NightFeats,通过检索、策展和组合三阶段分解知识合成,引入时序语义重排序、矛盾协调和引用保留架构,在MMU-RAGent竞赛中超越商业基线。

详情
Comments
5 pages, 1 figure, 1 table. NeurIPS 2025 Competition Track (MMU-RAGent). System developed October 2025
AI中文摘要

我们提出NightFeats,一个结构化的多智能体检索增强生成(RAG)系统,提交至NeurIPS 2025的MMU-RAGent竞赛,并在文本到文本轨道中获得最佳动态评估奖。本文并非以基准最大化目标,而是提出一个原则性流水线,将知识合成为三个协调阶段:检索、策展和组合,每个阶段由显式的中间表示和交接契约控制。受智能体上下文工程(ACE)启发,该系统引入时序语义重排序、有界矛盾协调和保留引用的组合作为核心架构原语。竞赛结果表明,NightFeats在LLM-as-a-Judge和人类Likert评估中超越了包括Claude-SonnetV2和Nova-Pro在内的商业基线,证实了架构透明性和可验证证据基础比单纯优化自动相似度指标的系统更符合人类偏好。

英文摘要

We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.

2606.10120 2026-06-11 cs.IR cs.AI cs.HC 版本更新

MetaPlate: Counterfactual-Guided RAG-LLM Tool for Personalized Food Recommendation and Hyperglycemia Prevention

MetaPlate: 反事实引导的RAG-LLM工具用于个性化食物推荐和高血糖预防

Asiful Arefeen, Carol Johnston, Hassan Ghasemzadeh

AI总结 提出MetaPlate框架,结合反事实解释、机器学习预测和RAG-LLM,生成个性化膳食建议以预防餐后高血糖,经注册营养师评估证明其可行性和有效性。

详情
AI中文摘要

餐后高血糖是代谢紊乱的关键风险因素;然而,现有的饮食指导通常是静态的、不切实际的且个性化不足,提供的建议难以遵循或效果不佳。尽管最近的进展利用连续血糖监测(CGM)和机器学习来预测血糖反应,但这些方法主要是预测性的,缺乏可操作的指导。此外,推荐系统常常与用户目标不一致,且需要大量输入。我们提出了MetaPlate,一个反事实解释(CF)引导的、上下文感知的决策支持框架,用于生成个性化膳食建议,以减轻健康成年人的餐后血糖波动。MetaPlate整合了多模态数据,包括来自25名个体的CGM读数、可穿戴设备衍生的生理信号以及用户提供的膳食输入,以建模餐前上下文。一个机器学习模型预测血糖反应,而CF优化模块通过调整膳食组成(修改宏量营养素数量)来维持血糖水平在目标范围内(≤140 mg/dL)。基于LLM的检索增强生成(RAG)层通过使用USDA食品数据库的约束搜索生成人类可读的建议,增强了可解释性。我们通过结构化的专家在环评估,与注册营养师(RDs)一起评估MetaPlate,比较提示优化前后的性能。结果显示,在膳食真实性、份量适宜性和推荐可能性方面有所改进,专家反馈表明从临床不可行的输出转向了可操作、上下文适宜的建议。我们的发现强调了领域知识和结构化约束在LLM驱动系统中的重要性,并突出了MetaPlate作为实时个性化膳食决策支持工具的潜力。

英文摘要

Postprandial hyperglycemia is a key risk factor for metabolic disorders; however, existing dietary guidance is often static, impractical, and insufficiently personalized, providing recommendations that are difficult to follow or not impactful. While recent advances leverage continuous glucose monitoring (CGM) and machine learning to predict glycemic responses, these approaches are largely predictive and lack actionable guidance. Moreover, recommendation systems are often misaligned with user goals and require extensive input. We present MetaPlate, a counterfactual explanation (CF) guided, context-aware decision-support framework that generates personalized meal recommendations to mitigate postprandial glucose excursions in healthy adults. MetaPlate integrates multimodal data, including CGM readings, wearable-derived physiological signals, and user-provided meal inputs from $25$ individuals to model pre-meal context. A machine learning model predicts glucose response, while a CF optimization module adjusts meal composition modifying macronutrient amounts to maintain glucose levels within a target range ($\leq 140$ mg/dL). An LLM-based retrieval-augmented generation (RAG) layer enhances interpretability by producing human-readable recommendations using constrained search of the USDA food database. We evaluate MetaPlate via a structured expert-in-the-loop assessment with registered dietitians (RDs), comparing performance before and after prompt refinement. Results show improvements in meal realism, portion suitability, and recommendation likelihood, with expert feedback indicating a shift from clinically implausible outputs to actionable, contextually appropriate recommendations. Our findings emphasize the importance of domain knowledge and structured constraints in LLM-driven systems and highlight the potential of MetaPlate as a real-time personalized dietary decision-support tool.

2606.07075 2026-06-11 cs.IR 版本更新

Beyond Matching: Category-Guided Latent Intent Reasoning for Generative Retrieval in E-Commerce

超越匹配:类别引导的潜在意图推理在电商生成式检索中的应用

Fuwei Zhang, Xiaoyu Liu, Jiajie Jin, Jiale Mao, Wei Chen, Dongbo Xi, Yifan Yang, Peng Yan, Zichao Hao, Zhao Zhang, Fuzhen Zhuang

AI总结 提出CaLIR框架,通过类别引导的潜在意图推理,在生成式检索中弥合查询与商品标识的表示差距,平衡检索效果与推理效率。

详情
AI中文摘要

生成式检索通过将用户查询直接映射到产品语义标识符(SID),为电商搜索提供了新范式。然而,电商查询通常简短、嘈杂、属性密集,并与多个类别一致的产品相关联,导致自然语言购物意图与人工构建的商品SID之间存在显著的表示差距。显式的思维链(CoT)推理有助于弥合这一差距,但其额外的生成成本难以与电商在线系统的低延迟要求相协调。为应对这一挑战,我们提出了CaLIR(类别引导的潜在意图推理),一种用于电商生成式检索的类别引导潜在意图推理框架。CaLIR不生成显式的文本理由,而是在SID解码之前学习连续的潜在意图状态,并利用产品类别层次结构作为从粗到细的意图推理的自然支架。具体来说,我们引入层次化语义推理以将潜在状态与类别级购物意图对齐,以及查询级推理增强以建模多正例查询下的多样意图路径。CaLIR进一步将查询特定的动态前缀字典树(由预索引的类别级字典树组装而成)与推理感知的约束解码相结合。在多语言电商搜索数据集上的实验表明,CaLIR在检索效果和推理效率之间取得了比现有方法更好的平衡,同时在诱导层次结构和不同生成骨干上展现出可迁移性和鲁棒性。

英文摘要

Generative retrieval offers a new paradigm for e-commerce search by mapping user queries directly to product Semantic Identifiers (SIDs). However, e-commerce queries are often short, noisy, attribute-heavy, and associated with multiple category-consistent products, creating a substantial representation gap between natural-language shopping intent and artificially constructed item SIDs. Explicit Chain-of-Thought (CoT) reasoning can help bridge this gap, but its extra generation cost is difficult to reconcile with the low-latency requirements of online e-commerce systems. To address this challenge, we propose CaLIR (Category-guided Latent Intent Reasoning), a category-guided latent intent reasoning framework for e-commerce generative retrieval. Rather than generating explicit textual rationales, CaLIR learns continuous latent intent states before SID decoding and uses product category hierarchies as a natural scaffold for coarse-to-fine intent reasoning. Specifically, we introduce hierarchical semantic reasoning to align latent states with category-level shopping intent, and query-wise reasoning enhancement to model diverse intent paths under multi-positive queries. CaLIR further combines a query-specific dynamic prefix trie, assembled from pre-indexed category-level tries, with reasoning-aware constrained decoding. Experiments on multilingual e-commerce search datasets show that CaLIR achieves a better balance between retrieval effectiveness and inference efficiency than existing methods, while also demonstrating transferability and robustness across induced hierarchies and different generative backbones.

2606.05907 2026-06-11 cs.IR cs.LG 版本更新

Knowledge Manifold: A Riemannian Geometric Framework for Semantic Mapping and Geodesic Analysis of Scientific Literature

知识流形:用于科学文献语义映射和测地线分析的黎曼几何框架

Tomonaga Okabe, Kazuhiko Komatsu

AI总结 提出知识流形框架,通过字符n-gram TF-IDF、SPH插值、高斯过程回归和黎曼测地线路径,实现文献的语义映射、虚拟知识生成和概念桥梁发现。

详情
AI中文摘要

我们提出了知识流形:一个黎曼几何空间,其中文档语料库根据从字符n-gram TF-IDF表示中导出的语义位置关系进行排列。该框架包含五个紧密耦合的阶段。首先,每篇文档被转换为字符级n-gram TF-IDF向量(4-7克,最多250,000个特征,L2归一化),并通过带有排斥、方差和中心正则化项的约束应力最小化嵌入到二维知识地图中。其次,通过使用三次样条核的平滑粒子流体动力学(SPH)插值估计任意查询点的知识,得到可进行语言表征的插值TF-IDF特征向量。第三,从SPH插值图计算0、45和90度方向的知识梯度,并通过内积和余弦相似度量化成对方向相似性。第四,一个高斯过程回归(GPR)模型,使用在10维SVD投影上拟合的Constant × RBF + White核,提供查询点的贝叶斯后验均值、不确定性估计和每篇文档的贡献率。第五,通过最小化由SPH诱导度量张量导出的离散黎曼路径能量,使用L-BFGS-B算法和七个确定性初始路径候选,获得知识空间中的测地线。我们将该公式应用于20篇纤维增强复合材料与航空航天结构力学论文的语料库,表明语义地图恢复了有意义的研究聚类,测地线路径揭示了遥远主题之间的自然概念桥梁,并且SPH/GPR插值能够生成虚拟知识:描述未研究但几何预测的研究方向的假设论文摘要。

英文摘要

We present the knowledge manifold: a Riemannian geometric space in which a corpus of documents is arranged according to semantic positional relationships derived from character n-gram TF-IDF representations. The framework proceeds in five tightly coupled stages. First, each document is converted to a character-level n-gram TF-IDF vector (4-7 grams, up to 250,000 features, L2-normalized) and embedded in a two-dimensional knowledge map via constrained stress minimization with repulsion, variance, and centering regularizers. Second, knowledge at an arbitrary query point is estimated through Smoothed Particle Hydrodynamics (SPH) interpolation using a cubic-spline kernel, yielding an interpolated TF-IDF feature vector that can be linguistically characterized. Third, directional knowledge gradients at 0, 45, and 90 degrees are computed from the SPH interpolation map, and pairwise directional similarity is quantified via inner product and cosine similarity. Fourth, a Gaussian Process Regression (GPR) model, with a Constant x RBF + White kernel fitted on a 10-dimensional SVD projection, provides a Bayesian posterior mean, uncertainty estimate, and per-document contribution rate at the query point. Fifth, geodesics in the knowledge space are obtained by minimizing a discrete Riemannian path energy derived from the SPH-induced metric tensor, using L-BFGS-B with seven deterministic initial-path candidates. We apply the formulation to a corpus of 20 papers in fiber-reinforced composite materials and aerospace structural mechanics, showing that the semantic map recovers meaningful research clusters, geodesic paths reveal natural conceptual bridges between distant topics, and SPH/GPR interpolation enables the generation of virtual knowledge: hypothetical paper abstracts describing unstudied but geometrically predicted research directions.

2605.31506 2026-06-11 cs.IR cs.CL 版本更新

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

评估多源RAG中的事实密度:医学AI准确性研究

Michael R. DeMarco

AI总结 针对标准RAG管道因专家盲视效应而忽视高密度事实证据的问题,提出事实密度(FD*)作为检索优化信号,通过概率事实性分析预处理和Z-score归一化消除长度偏差,在HealthFC基准上实现100%系统综述覆盖率。

详情
Comments
16 pages, 8 tables. Includes Experiment 3 results (n=11, Wilcoxon p=0.0619). Preliminary findings; powered Experiment 3 and Graph RAG extension identified as future work. Updated from v1
AI中文摘要

检索增强生成(RAG)是当前将AI锚定于现实世界事实的行业标准。传统检索方法依赖关键词匹配和主题接近度,根据内容与用户查询的相似程度进行排序。但它们并未衡量内容实际包含多少经过验证的事实。这种结构性差距被称为专家盲视效应,导致标准RAG管道持续将高密度事实证据埋没,而偏向于同一主题的词汇主导文本。为解决这一差距,本文引入事实密度(FD*),一种新颖的检索优化信号,衡量经过验证的原子声明相对于总标记数的比例。使用NexusAgentics Ghost Audit预处理管道,通过概率事实性分析对原始文本进行事实特异性评分,在语料库摄入前过滤内容。初始公式引入了严重的文档长度混杂因素(Pearson R = -0.8636,p = 2.27e-07)。在长度区间内实施Z-score归一化解决了这一偏差,验证了FD*作为长度无关的密度信号(p = 0.0749)。在HealthFC基准(由医学专家标记为支持、反驳或无证据的750个健康声明)上评估,FD*优化的检索是唯一在top-5结果中实现100%系统综述饱和度的条件,使标准余弦相似度排名前十之外的Cochrane证据浮现。真实验证确认了跨越七个HealthFC支持声明的25个映射。由于语料库-基准对齐的限制,n=50个查询的完整统计验证仍是未来工作,但这些发现确立了事实密度重排序作为一种低成本、高影响力的干预措施,用于提高健康RAG架构的事实精度。

英文摘要

Retrieval-Augmented Generation (RAG) is the current industry standard for grounding AI in real-world facts. Traditional retrieval methods rely on keyword matching and topic proximity, ranking content based on how closely it sounds like the user's query. What they do not measure is how many verified facts the content actually contains. This structural gap, termed the Expert Blindness Effect, causes standard RAG pipelines to consistently bury high-density factual evidence in favor of lexically dominant text on the same topic. To address this gap, this paper introduces Factual Density (FD*), a novel retrieval optimization signal that measures the proportion of verified atomic claims relative to total token count. Using the NexusAgentics Ghost Audit preprocessing pipeline, raw text is scored for factual specificity using probabilistic factuality analysis to filter content before corpus ingestion. An initial formulation introduced a severe document-length confound (Pearson R = -0.8636, p = 2.27e-07). Implementing Z-score normalization within length bins resolved this bias, validating FD* as a length-independent density signal (p = 0.0749). Evaluated against the HealthFC benchmark (750 health claims labeled Supported, Refuted, or No Evidence by medical experts), FD*-optimized retrieval was the only condition to achieve 100% systematic review saturation in top-5 results, surfacing Cochrane evidence that standard cosine similarity ranked outside the top ten. Ground truth verification confirmed 25 mappings across seven HealthFC-supported claims. While full statistical validation across n=50 queries remains future work due to constraints on corpus-benchmark alignment, these findings establish factual density reranking as a low-cost, high-impact intervention for improving factual precision in health RAG architectures.

2603.19225 2026-06-11 cs.CE cs.AI cs.CL cs.IR q-fin.CP 版本更新

FinTradeBench: A Financial Reasoning Benchmark for LLMs

FinTradeBench: 面向LLM的金融推理基准

Yogesh Agrawal, Aniruddha Dutta, Md Mahadi Hasan, Santu Karmaker, Aritra Dutta

AI总结 提出FinTradeBench基准,通过结合公司基本面与交易信号,评估大语言模型在金融推理中的表现,发现检索增强对数值和时间序列推理帮助有限。

详情
Comments
9 pages main text, 31 pages total (including references and appendix). 5 figures, 16 tables. Preprint under review. Code and data will be made available upon publication
AI中文摘要

现实世界的金融决策是一个具有挑战性的问题,需要对异构信号进行推理,包括从监管文件中提取的公司基本面和从价格动态计算出的交易信号。最近,随着大语言模型(LLM)的进步,金融分析师开始将它们用于金融决策任务。然而,现有的用于测试这些模型的金融问答基准主要关注公司资产负债表数据,很少评估关于公司股票如何在市场中交易或它们与基本面相互作用的推理。为了利用这两种方法的优势,我们引入了FinTradeBench,这是一个评估金融推理的基准,它整合了公司基本面和交易信号。FinTradeBench包含1400个问题,这些问题基于纳斯达克-100公司十年历史窗口的数据。该基准分为三个推理类别:基本面聚焦、交易信号聚焦以及需要跨信号推理的混合问题。为了确保大规模可靠性,我们采用了一个校准然后扩展的框架,该框架结合了专家种子问题、多模型响应生成、模型内自过滤、数值审计以及人类-LLM判断对齐。我们在零样本提示和检索增强设置下评估了14个LLM,并观察到了明显的性能差距。检索显著改善了对文本基本面的推理,但对交易信号推理的益处有限。这些发现突显了当前LLM在数值和时间序列推理方面的根本性挑战,并激励了未来在金融智能方面的研究。

英文摘要

Real-world financial decision-making is a challenging problem that requires reasoning over heterogeneous signals, including company fundamentals derived from regulatory filings and trading signals computed from price dynamics. Recently, with advances in Large Language Models (LLMs), financial analysts have begun to use them for financial decision-making tasks. However, existing financial question-answering benchmarks for testing these models primarily focus on company balance sheet data and rarely evaluate reasoning about how company stocks trade in the market or their interactions with fundamentals. To leverage the strengths of both approaches, we introduce FinTradeBench, a benchmark for evaluating financial reasoning that integrates company fundamentals and trading signals. FinTradeBench contains 1,400 questions grounded in NASDAQ-100 companies over a ten-year historical window. The benchmark is organized into three reasoning categories: fundamentals-focused, trading-signal-focused, and hybrid questions requiring cross-signal reasoning. To ensure reliability at scale, we adopt a calibration-then-scaling framework that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human-LLM judge alignment. We evaluate 14 LLMs under zero-shot prompting and retrieval-augmented settings and witness a clear performance gap. Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning. These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.

2506.22141 2026-06-11 cs.CL cs.IR

DAPFAM: A Domain-Aware Family-level Dataset to benchmark cross domain patent retrieval

Iliass Ayaou, Denis Cavallucci, Hicham Chibane

详情
英文摘要

Patent prior-art retrieval becomes especially challenging when relevant disclosures cross technological boundaries. Existing benchmarks lack explicit domain partitions, making it difficult to assess how retrieval systems cope with such shifts. We introduce DAPFAM, a family-level benchmark with explicit IN-domain and OUT-domain partitions defined by a new IPC3 overlap scheme. The dataset contains 1,247 query families and 45,336 target families aggregated at the family level to reduce international redundancy, with citation based relevance judgments. We conduct 249 controlled experiments spanning lexical (BM25) and dense (transformer) backends, document and passage level retrieval, multiple query and document representations, aggregation strategies, and hybrid fusion via Reciprocal Rank Fusion (RRF). Results reveal a pronounced domain gap: OUT-domain performance remains roughly five times lower than IN-domain across all configurations. Passage-level retrieval consistently outperforms document-level, and dense methods provide modest gains over BM25, but none close the OUT-domain gap. Document-level RRF yields strong effectiveness efficiency trade-offs with minimal overhead. By exposing the persistent challenge of cross-domain retrieval, DAPFAM provides a reproducible, compute-aware testbed for developing more robust patent IR systems. The dataset is publicly available on huggingface at https://huggingface.co/datasets/datalyes/DAPFAM_patent.

2512.24787 2026-06-11 cs.IR cs.AI 版本更新

HiGR: Industrial-Scale Hierarchical Generative Slate Recommendation Framework in Tencent

HiGR:腾讯工业级层次化生成式推荐框架

Yunsheng Pang, Zijian Liu, Yudong Li, Shaojie Zhu, Zijian Luo, Chenyun Yu, Sikai Wu, Shichen Shen, Cong Xu, Bin Wang, Kai Jiang, Chengxiang Zhuo, Zang Li

AI总结 提出HiGR框架,通过结构化语义ID和层次化解码器解决生成式推荐在工业规模下的规划效率与列表质量对齐问题,离线质量提升超10%,推理加速5倍。

详情
AI中文摘要

Slate推荐(在单个展示中向用户呈现排序项目列表)在主流在线平台中无处不在。虽然最近的生成式推荐方法在利用语义ID建模项目序列方面显示出强大潜力,但直接将其应用于工业规模的slate推荐面临根本性脱节:纠缠的SID空间混淆了高级列表规划,长序列上的细粒度自回归解码限制了语义规划效率,而令牌级目标与整体slate质量不一致。在本文中,我们提出HiGR,一个工业规模的层次化生成式slate推荐框架,通过协同设计的流水线弥合这一脱节。首先,HiGR通过前缀对比残差量化VAE(PCRQ-VAE)学习结构化SID。通过强制高级前缀捕获共享语义,PCRQ-VAE创建了一个可控的离散空间,作为高效规划的前提。利用这一结构化空间,我们的层次化Slate解码器(HSD)将自回归建模从纠缠的令牌级解码转变为粗粒度偏好嵌入。该设计显著降低了推理延迟,同时允许显式的全局slate结构规划。最后,这一稳定的规划空间使得基于ORPO的列表级对齐机制能够优化三重目标隐式反馈——排序保真度、真实用户兴趣和多样性。广泛的离线实验表明,HiGR在离线推荐质量上优于最先进的基线超过10%,同时实现了5倍的推理加速。腾讯平台上的在线A/B测试进一步将观看时间提高了1.22%,视频播放量提高了1.73%。HiGR已在多个腾讯平台表面部署,服务数亿用户,证明了其工业规模的适用性。

英文摘要

Slate recommendation, which presents users with a ranked item list in a single display, is ubiquitous across mainstream online platforms. While recent generative recommendation methods have shown strong potential in modeling item sequences with semantic IDs, directly applying them to industrial-scale slate recommendation faces a fundamental disconnect: entangled SID spaces confound high-level list planning, fine-grained autoregressive decoding over long sequences limits semantic planning efficiency, and token-level objectives misalign with holistic slate quality. In this paper, we propose HiGR, an industrial-scale hierarchical generative framework for slate recommendation that bridges this disconnect through a co-designed pipeline. First, HiGR learns structured SIDs via a Prefix-Contrastive Residual Quantized VAE (PCRQ-VAE). By enforcing high-level prefixes to capture shared semantics, PCRQ-VAE creates a controllable discrete space that acts as a prerequisite for efficient planning. Leveraging this structured space, our Hierarchical Slate Decoder (HSD) shifts autoregressive modeling from entangled token-level decoding to coarse-grained preference embeddings. This design significantly reduces inference latency while allowing explicit global slate structure planning. Finally, this stable planning space enables an ORPO-based listwise alignment mechanism to optimize triple-objective implicit feedback-ranking fidelity, genuine user interest, and diversity. Extensive offline experiments show that HiGR outperforms state-of-the-art baselines by over 10% in offline recommendation quality while achieving a $5\times$ inference speedup. Online A/B tests on Tencent platforms further improve watch time by 1.22% and video plays by 1.73%. HiGR has been deployed on multiple Tencent platform surfaces, serving hundreds of millions of users and proving its industrial-scale applicability.

2602.11719 2026-06-11 cs.IR 版本更新

Uncertainty-aware Generative Recommendation

不确定性感知的生成式推荐

Chenxiao Fan, Chongming Gao, Yaxin Gong, Haoyan Liu, Fuli Feng, Xiangnan He

AI总结 提出不确定性感知生成式推荐框架UGR,通过不确定性加权奖励、难度感知优化和显式置信度对齐,解决生成式推荐中忽视模型置信度导致的训练不稳定和决策风险问题。

详情
Comments
Accepted by KDD 2026
AI中文摘要

生成式推荐已成为一种变革性范式,将推荐重新构建为端到端的自回归序列生成任务。尽管前景广阔,现有的偏好优化方法通常依赖于二元结果正确性,遭受我们称之为不确定性盲点的系统性限制。这一问题表现为忽视模型内在的生成置信度、样本学习难度的变化以及缺乏显式的置信度表达,直接导致训练动态不稳定和不可量化的决策风险。在本文中,我们提出不确定性感知生成式推荐(UGR),一个利用不确定性作为自适应优化关键信号的统一框架。UGR协同三种机制:(1)不确定性加权奖励以惩罚置信错误;(2)难度感知优化动态以防止过早收敛;(3)显式置信度对齐以赋予模型置信度表达能力。大量实验表明,UGR不仅产生优越的推荐性能,而且从根本上稳定训练,防止标准方法中常见的性能退化。此外,学习到的置信度能够实现可靠的下游风险感知应用。我们的项目仓库位于:this https URL。

英文摘要

Generative Recommendation has emerged as a transformative paradigm, reformulating recommendation as an end-to-end autoregressive sequence generation task. Despite its promise, existing preference optimization methods typically rely on binary outcome correctness, suffering from a systemic limitation we term uncertainty blindness. This issue manifests in the neglect of the model's intrinsic generation confidence, the variation in sample learning difficulty, and the lack of explicit confidence expression, directly leading to unstable training dynamics and unquantifiable decision risks. In this paper, we propose Uncertainty-aware Generative Recommendation (UGR), a unified framework that leverages uncertainty as a critical signal for adaptive optimization. UGR synergizes three mechanisms: (1) an uncertainty-weighted reward to penalize confident errors; (2) difficulty-aware optimization dynamics to prevent premature convergence; and (3) explicit confidence alignment to empower the model with confidence expression capabilities. Extensive experiments demonstrate that UGR not only yields superior recommendation performance but also fundamentally stabilizes training, preventing the performance degradation often observed in standard methods. Furthermore, the learned confidence enables reliable downstream risk-aware applications. Our project repository is available at: this https URL.

2602.07840 2026-06-11 cs.IR cs.AI 版本更新

SAGE: Scalable AI Governance & Evaluation

SAGE: 可扩展的人工智能治理与评估

Benjamin Le, Xueying Lu, Nick Stern, Wenqiong Liu, Igor Lapchuk, Xiang Li, Baofen Zheng, Kevin Rosenberg, Jiewen Huang, Zhe Zhang, Abraham Cabangbang, Satej Milind Wagle, Jianqiang Shen, Raghavan Muthuregunathan, Abhinav Gupta, Mathew Teoh, Andrew Kirk, Thomas Kwan, Jingwei Wu, Wenjing Zhang

AI总结 本文提出SAGE框架,通过双向校准循环将高质量的人类产品判断转化为可扩展的评估信号,解决了大规模搜索系统中相关性评估的治理差距问题,并实现了92倍成本降低的模型迭代和政策监督。

详情
AI中文摘要

在大规模搜索系统中评估相关性本质上受到人类监督与生产系统高吞吐要求之间的治理差距的限制。传统方法依赖于参与代理或稀疏手动审查,但这些方法往往无法捕捉高影响的相关性失败的全部范围。我们提出了SAGE(可扩展的人工智能治理与评估)框架,该框架将高质量的人类产品判断作为可扩展的评估信号。SAGE的核心是一个双向校准循环,其中自然语言政策、精心编写的先例和一个LLM替代法官共同进化。SAGE系统性地解决语义模糊和不一致,将主观的相关性判断转化为可执行的多维标准,具有接近人类水平的一致性。为了弥合前沿模型推理与工业级推理之间的差距,我们应用教师-学生蒸馏技术,将高保真判断转移到紧凑的学生替代体,成本降低92倍。SAGE部署在LinkedIn搜索生态系统中,通过模拟驱动开发指导模型迭代,蒸馏出符合政策的模型用于在线服务,并实现快速的离线评估。在生产环境中,它推动了政策监督,测量了升级的模型变体并检测到无法被参与指标检测到的回归。集体上,这些措施推动了LinkedIn每日活跃用户的0.25%提升。

英文摘要

Evaluating relevance in large-scale search systems is fundamentally constrained by the governance gap between nuanced, resource-constrained human oversight and the high-throughput requirements of production systems. While traditional approaches rely on engagement proxies or sparse manual review, these methods often fail to capture the full scope of high-impact relevance failures. We present \textbf{SAGE} (Scalable AI Governance \& Evaluation), a framework that operationalizes high-quality human product judgment as a scalable evaluation signal. At the core of SAGE is a bidirectional calibration loop where natural-language \emph{Policy}, curated \emph{Precedent}, and an \emph{LLM Surrogate Judge} co-evolve. SAGE systematically resolves semantic ambiguities and misalignments, transforming subjective relevance judgment into an executable, multi-dimensional rubric with near human-level agreement. To bridge the gap between frontier model reasoning and industrial-scale inference, we apply teacher-student distillation to transfer high-fidelity judgments into compact student surrogates at \textbf{92$\times$} lower cost. Deployed within LinkedIn Search ecosystems, SAGE guided model iteration through simulation-driven development, distilling policy-aligned models for online serving and enabling rapid offline evaluation. In production, it powered policy oversight that measured ramped model variants and detected regressions invisible to engagement metrics. Collectively, these drove a \textbf{0.25\%} lift in LinkedIn daily active users.

2601.22025 2026-06-11 cs.CL cs.AI cs.IR cs.SE 版本更新

When Generic Prompt Improvements Hurt: Evaluation-Driven Iteration for LLM Applications

当通用提示改进有害:LLM应用的评估驱动迭代

Daniel Commey

AI总结 提出最小可行评估套件(MVES),通过结构化评估框架和本地复现实验,发现通用提示添加并非单调改进,强调评估驱动的提示迭代。

详情
Comments
Technical report. 42 pages, 3 figures. Code, test suites, and result logs: this https URL
AI中文摘要

评估大型语言模型(LLM)应用与传统软件测试不同,因为输出是概率性的、语义可变的,并且对提示和模型变化敏感。本技术报告提出了最小可行评估套件(MVES),一种面向审计的应用级LLM评估结构。MVES将应用类别与失败模式、指标、所需工件和验证证据联系起来,涵盖通用LLM应用、检索增强系统和智能体工作流。我们将该框架与可复现的本地评估工具配对,包括结构化提取、RAG引用/内容合规性和指令遵循检查。使用Ollama与Llama 3 8B Instruct和Qwen 2.5 7B Instruct,我们在扩展的每套30例消融实验中评估了五种提示条件。结果表明,在测试的本地条件下,通用提示添加不会产生单调改进:更强的输出合同提示提高了两种模型的严格提取,而RAG引用/内容合规性在某些通用规则条件下下降。观察到的最显著下降发生在Qwen 2.5上,当通用规则附加到用户提示时,RAG从26/30下降到9/30。这些发现支持评估驱动的提示迭代:提示更改应被视为潜在的回归风险,并在部署前针对特定任务套件进行测试。随附的存储库包含测试套件、提示变体、评估工具、原始结果日志和复现所报告本地消融所需的脚本。

英文摘要

Evaluating Large Language Model (LLM) applications differs from conventional software testing because outputs are probabilistic, semantically variable, and sensitive to prompt and model changes. This technical report proposes the Minimum Viable Evaluation Suite (MVES), an audit-oriented structure for application-level LLM evaluation. MVES links application categories to failure modes, metrics, required artifacts, and validation evidence across general LLM applications, retrieval-augmented systems, and agentic workflows. We pair the framework with a reproducible local evaluation harness covering structured extraction, RAG citation/content-compliance, and instruction-following checks. Using Ollama with Llama 3 8B Instruct and Qwen 2.5 7B Instruct, we evaluate five prompt conditions over expanded 30-case-per-suite ablations. The results show that, in the tested local conditions, generic prompt additions do not produce monotonic improvements: stronger output-contract prompts improve strict extraction for both models, while RAG citation/content-compliance declines under some generic-rule conditions. The largest observed decline occurs for Qwen 2.5 on RAG when generic rules are appended to the user prompt, from 26/30 to 9/30. These findings support evaluation-driven prompt iteration: prompt changes should be treated as potential regression risks and tested against task-specific suites before deployment. The accompanying repository contains the test suites, prompt variants, evaluation harness, raw result logs, and scripts needed to reproduce the reported local ablations.