arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 大语言模型与基础模型 4 篇

2602.05992 2026-06-18 cs.CL 版本更新

DSB: Dynamic Sliding Block Scheduling for Diffusion LLMs

DSB: 扩散语言模型的动态滑动块调度

Lizhuo Luo, Shenggui Li, Yonggang Wen, Tianwei Zhang

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 针对扩散语言模型固定块调度忽视语义难度的问题,提出无训练的动态滑动块方法DSB及配套KV缓存机制DSB Cache,显著提升生成质量和推理效率。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

扩散大语言模型(dLLMs)已成为文本生成的一种有前景的替代方案,其特点在于原生支持并行解码。在实践中,块推理对于避免全局双向解码中的顺序错乱以及提高输出质量至关重要。然而,广泛使用的固定、预定义块(朴素)调度忽略了语义难度,使其在质量和效率上均非最优策略:它可能迫使模型对不确定的位置过早做出承诺,同时延迟块边界附近的简单位置。在这项工作中,我们分析了朴素块调度的局限性,并揭示了根据语义难度动态调整调度对于可靠高效推理的重要性。受此启发,我们提出了动态滑动块(DSB),一种无训练的块调度方法,它使用动态大小的滑动块来克服朴素块的刚性。为了进一步提高效率,我们引入了DSB Cache,一种针对DSB量身定制的无训练KV缓存机制。跨多个模型和基准的大量实验表明,DSB与DSB Cache一起,持续提升了dLLMs的生成质量和推理效率。代码已发布在 https://this https URL。

英文摘要

Diffusion large language models (dLLMs) have emerged as a promising alternative for text generation, distinguished by their native support for parallel decoding. In practice, block inference is crucial for avoiding order misalignment in global bidirectional decoding and improving output quality. However, the widely-used fixed, predefined block (naive) schedule is agnostic to semantic difficulty, making it a suboptimal strategy for both quality and efficiency: it can force premature commitments to uncertain positions while delaying easy positions near block boundaries. In this work, we analyze the limitations of naive block scheduling and disclose the importance of dynamically adapting the schedule to semantic difficulty for reliable and efficient inference. Motivated by this, we propose Dynamic Sliding Block (DSB), a training-free block scheduling method that uses a sliding block with a dynamic size to overcome the rigidity of the naive block. To further improve efficiency, we introduce DSB Cache, a training-free KV-cache mechanism tailored to DSB. Extensive experiments across multiple models and benchmarks demonstrate that DSB, together with DSB Cache, consistently improves both generation quality and inference efficiency for dLLMs. Code is released at https://github.com/lizhuo-luo/DSB.

2602.06470 2026-06-18 cs.CL cs.AI 版本更新

Improve Large Language Model Systems with User Logs

通过用户日志改进大型语言模型系统

Changyue Wang, Weihang Su, Qingyao Ai, Xingzhao Yue, Rui Zhang, Xiaojia Chang, Yiqun Liu

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 本文提出UNO框架,通过用户日志提炼规则和偏好对,利用查询反馈驱动聚类处理数据异质性,量化模型知识与日志数据间的认知差距,提升LLM系统性能。

详情
AI中文摘要

扩大训练数据和模型参数规模长期以来推动了大型语言模型(LLMs)的发展,但这一范式日益受到高质量数据稀缺和计算成本上升导致的边际效益递减的限制。因此,近期研究更加关注从真实世界部署中持续学习,其中用户交互日志提供了丰富的真人类反馈和过程知识。然而,从用户日志学习具有挑战性,因为它们是无结构和嘈杂的。传统的LLM系统往往难以区分有用的反馈信号与嘈杂的用户行为,且用户日志收集与模型优化之间的差异(例如,非策略优化问题)进一步加剧了这一问题。为此,我们提出UNO(用户日志驱动的优化),一个统一的框架,用于通过用户日志改进LLM系统(LLMsys)。UNO首先将日志提炼为半结构化的规则和偏好对,然后利用查询和反馈驱动的聚类来管理数据异质性,最后量化模型先验知识与日志数据之间的认知差距。这一评估指导LLMsys自适应地过滤掉嘈杂的反馈并构建不同模块,以处理从用户日志中提取的初级和反思性经验,从而提升未来的响应。广泛的实验表明,UNO在效果和效率上均达到最先进的水平,显著优于检索增强生成(RAG)和基于记忆的基线方法。我们已开源代码至https://github.com/bebr2/UNO。

英文摘要

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

2603.26557 2026-06-18 cs.CL 版本更新

MemBoost: A Memory-Boosted Framework for Cost-Aware LLM Inference

MemBoost:一种面向成本感知的LLM推理的内存增强框架

Joris Köster, Zixuan Liu, Siavash Khajavi, Zizhan Zheng

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出MemBoost框架,通过轻量模型重用历史答案和检索支持信息,并选择性将困难查询路由到强模型,以降低LLM推理成本,同时保持回答质量。

Comments ICML MemFM 2026 Workshop

详情
AI中文摘要

大型语言模型(LLM)在现实服务中表现出色,但在跨用户和会话的重复或近似重复查询工作负载下,推理成本高昂。本文提出MemBoost,一种内存增强的LLM服务框架,使轻量模型能够重用先前生成的答案并检索相关支持信息以实现低成本推理,同时选择性地将困难或不确定的查询升级到更强的模型。与主要基于单一响应的标准检索增强生成不同,MemBoost通过支持答案重用、持续内存增长和成本感知路由,专为交互式场景设计。在模拟工作负载下跨多个模型的实验表明,MemBoost显著减少了昂贵的大模型调用和总体推理成本,同时保持了与强模型基线相当的高答案质量。

英文摘要

Large Language Models (LLMs) deliver strong performance but incur high inference cost in real-world services, especially under workloads with repeated or near-duplicate queries across users and sessions. In this work, we propose MemBoost, a memory-boosted LLM serving framework that enables a lightweight model to reuse previously generated answers and retrieve relevant supporting information for cheap inference, while selectively escalating difficult or uncertain queries to a stronger model. Unlike standard retrieval-augmented generation, which primarily grounds a single response, MemBoost is designed for interactive settings by supporting answer reuse, continual memory growth, and cost-aware routing. Experiments across multiple models under simulated workloads show that MemBoost substantially reduces expensive large-model invocations and overall inference cost, while maintaining high answer quality comparable to the strong model baseline.

2410.15595 2026-06-18 cs.AI cs.CL cs.LG 版本更新

A Comprehensive Survey of Direct Preference Optimization: Datasets, Theories, Variants, and Applications

直接偏好优化综述:数据集、理论、变体及应用

Wenyi Xiao, Zechuan Wang, Leilei Gan, Shuai Zhao, Zongrui Li, Ruirui Lei, Wanggui He, Luu Anh Tuan, Long Chen, Hao Jiang, Zhou Zhao, Fei Wu

发表机构 * Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学) Alibaba Group(阿里巴巴集团)

AI总结 综述直接偏好优化(DPO)在理论、变体、数据集和应用方面的进展,指出其作为RL-free替代方案的潜力与局限,并提出未来研究方向。

Comments Accepted by TPAMI 2026. Project page: https://github.com/Mr-Loevan/DPO-Survey

详情
AI中文摘要

随着大语言模型(LLMs)的快速发展,将策略模型与人类偏好对齐变得日益关键。直接偏好优化(DPO)作为一种有前景的对齐方法,作为从人类反馈中强化学习(RLHF)的无RL替代方案而出现。尽管DPO取得了各种进展并存在固有局限性,但文献中目前缺乏对这些方面的深入综述。在这项工作中,我们对DPO中的挑战和机遇进行了全面回顾,涵盖理论分析、变体、相关偏好数据集和应用。具体而言,我们基于关键研究问题对近期DPO研究进行分类,以提供对DPO当前格局的透彻理解。此外,我们提出了几个未来研究方向,为研究社区提供模型对齐的见解。相关论文的更新合集可在此https URL找到。

英文摘要

With the rapid advancement of large language models (LLMs), aligning policy models with human preferences has become increasingly critical. Direct Preference Optimization (DPO) has emerged as a promising approach for alignment, acting as an RL-free alternative to Reinforcement Learning from Human Feedback (RLHF). Despite DPO's various advancements and inherent limitations, an in-depth review of these aspects is currently lacking in the literature. In this work, we present a comprehensive review of the challenges and opportunities in DPO, covering theoretical analyses, variants, relevant preference datasets, and applications. Specifically, we categorize recent studies on DPO based on key research questions to provide a thorough understanding of DPO's current landscape. Additionally, we propose several future research directions to offer insights on model alignment for the research community. An updated collection of relevant papers can be found on https://github.com/Mr-Loevan/DPO-Survey.

2. 机器翻译与跨语言处理 1 篇

2510.15551 2026-06-18 cs.CL cs.AI cs.LG 版本更新

Rethinking Cross-lingual Gaps from a Statistical Viewpoint

从统计视角重新思考跨语言差距

Vihari Piratla, Purvam Jain, Darshan Singh, Trevor Cohn, Preethi Jyothi, Partha Talukdar

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出跨语言差距源于目标语言响应方差,通过形式化偏差和无偏误差,并采用推理时集成方法降低方差,使跨语言迁移得分提升8%-50%以上。

Comments 30 pages

详情
AI中文摘要

任何知识片段通常以一种或少数几种自然语言表达在网页或大型语料库中。大型语言模型(LLMs)通过从源语言获取知识,并在使用目标语言查询时使其可访问,从而充当桥梁。跨语言差距是指使用目标语言而非源语言查询知识时准确率的下降。现有研究侧重于导致跨语言差距的建模或训练失败。在这项工作中,我们采取另一种视角来表征跨语言错误的性质,并假设目标语言中响应的方差是造成这一差距的关键原因。我们首次将跨语言差距形式化为有偏误差和无偏误差。通过多种控制方差并减少跨语言差距的推理时干预,我们实证验证了我们的假设。我们展示了几种测试时集成方法,这些方法降低了响应方差,从而将源-目标迁移得分提高了多达12个绝对百分点,在各种LLMs上实现了8%到超过50%的相对提升。

英文摘要

Any piece of knowledge is usually expressed in one or a handful of natural languages on the web or in any large corpus. Large Language Models (LLMs) act as a bridge by acquiring knowledge from a source language and making it accessible when queried using target languages. A cross-lingual gap is a drop in accuracy incurred when querying knowledge in a target language rather than the source language. Existing research focused on modeling or training failures leading to cross-lingual gaps. In this work, we take an alternative view to characterize the nature of cross-lingual error, and hypothesize that the variance of responses in the target language is a key cause of this gap. For the first time, we formalize the cross-lingual gap in terms of biased and unbiased errors. We empirically validate our hypothesis through multiple inference-time interventions that control variance and reduce the cross-lingual gap. We demonstrate a few test-time ensemble methods that reduce response variance, and thereby improve source-target transfer scores by up to 12 absolute points yielding relative gains of 8% to over 50% across various LLMs.

3. 信息抽取、检索与问答 2 篇

2603.29247 2026-06-18 cs.CL cs.AI cs.LG 版本更新

MemRerank: Preference Memory for Personalized Product Reranking

MemRerank:用于个性化产品重排序的偏好记忆

Zhiyuan Peng, Xuyang Wu, Huaixiao Tou, Yi Fang, Yu Gong

发表机构 * Santa Clara University(圣克拉拉大学) Independent Researcher(独立研究者)

AI总结 提出MemRerank框架,通过强化学习将用户购买历史提炼为查询无关的偏好记忆,用于LLM购物代理的个性化重排序,在1-in-5选择任务中准确率提升高达10.61个百分点。

Comments correct author name in metadata

详情
AI中文摘要

基于LLM的购物代理越来越依赖长购买历史和多轮交互来实现个性化,然而,由于噪声、长度和相关性不匹配,将原始历史简单地附加到提示中通常效果不佳。我们提出MemRerank,一个偏好记忆框架,将用户购买历史提炼为简洁、查询无关的信号,用于个性化产品重排序。为了研究这个问题,我们构建了一个端到端的基准测试和评估框架,围绕基于LLM的\ extbf{1-in-5}选择任务,该任务同时衡量记忆质量和下游重排序效用。我们进一步使用强化学习(RL)训练记忆提取器,以下游重排序性能作为监督。使用两个基于LLM的重排序器进行的实验表明,MemRerank始终优于无记忆、原始历史和现成记忆基线,在1-in-5准确率上提高了高达\ extbf{+10.61}个绝对百分点。这些结果表明,显式偏好记忆是代理型电子商务系统中个性化的一种实用且有效的构建模块。

英文摘要

LLM-based shopping agents increasingly rely on long purchase histories and multi-turn interactions for personalization, yet naively appending raw history to prompts is often ineffective due to noise, length, and relevance mismatch. We propose MemRerank, a preference memory framework that distills user purchase history into concise, query-independent signals for personalized product reranking. To study this problem, we build an end-to-end benchmark and evaluation framework centered on an LLM-based \textbf{1-in-5} selection task, which measures both memory quality and downstream reranking utility. We further train the memory extractor with reinforcement learning (RL), using downstream reranking performance as supervision. Experiments with two LLM-based rerankers show that MemRerank consistently outperforms no-memory, raw-history, and off-the-shelf memory baselines, yielding up to \textbf{+10.61} absolute points in 1-in-5 accuracy. These results suggest that explicit preference memory is a practical and effective building block for personalization in agentic e-commerce systems.

2606.01697 2026-06-18 cs.CL 版本更新

RCEM: Robust Conversational Search EMbedder in Distributional Shift

RCEM:配备查询重写技能的嵌入器,用于分布偏移下的鲁棒对话搜索

Kilho Son, Paul Hsu, Cha Zhang, Dinei Florencio

发表机构 * Microsoft(微软)

AI总结 提出RCEM模型,通过将LLM的查询重写能力蒸馏到嵌入模型中,实现无需显式重写的上下文感知检索,在分布偏移下提升鲁棒性。

详情
AI中文摘要

对话搜索在检索增强生成(RAG)系统中变得越来越重要,用户通过包含上下文相关查询的多轮对话与AI助手交互。我们提出RCEM,一种对话式稠密检索模型,它将LLM的查询改写能力蒸馏到嵌入模型中,从而在推理时无需显式查询改写即可实现上下文感知检索。与先前学习直接对话到文档匹配的对话式稠密检索方法不同,RCEM将对话查询嵌入与改写后的查询嵌入对齐,提高了在分布偏移下的鲁棒性。RCEM不需要用于训练的对话查询到文档的相关性映射,这些映射通常昂贵且难以获得高质量。在QReCC、TopiOCQA和TREC CAsT上的大量实验表明,RCEM始终优于强对话检索基线,在分布偏移下取得了特别大的增益,包括Recall@10提升高达20%。RCEM进一步扩展了基础嵌入模型,使其具备对话查询改写能力,同时保留了原有的检索功能,允许单个模型对独立查询和对话查询进行编码,并针对现有文档索引进行搜索,而无需重建检索数据库。

英文摘要

We propose RCEM, a Robust Conversational search EMbedder that is additionally equipped with LLM's query reformulation capability without losing base model's generalization. Unlike prior conversational dense retrieval approaches that learn direct conversation-to-passage matching, RCEM aligns conversations, prepended by special token, to LLM-rewritten queries, while preserving the original embedding space. The unchanged embedding space automatically maps the rewritten-query to the relevant passages. As a result, RCEM (1) reduces overfitting by simplifying the alignment task from long passages to shorter rewritten queries, (2) eliminates the need for conversation-to-passage relevance labels for training, and (3) maintains its original embedding space that allows conversational queries against indexes built by original embedder without rebuilding them. Extensive experiments show that RCEM consistently outperforms prior approaches, achieving up to 30% improvement under distributional shift.

4. 对话系统与智能体 4 篇

2508.04086 2026-06-18 cs.CL 版本更新

ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"

ToolGrad:利用文本“梯度”高效生成工具使用数据集

Zhongyi Zhou, Kohei Uehara, Haoyu Zhang, Jingtao Zhou, Lin Gu, Ruofei Du, Zheng Xu, Tatsuya Harada

发表机构 * Google(谷歌) The University of Tokyo(东京大学) RIKEN AIP(日本学术振兴会AIP) Tohoku University(东北大学)

AI总结 提出ToolGrad框架,通过文本“梯度”引导的迭代过程先构建有效工具使用链再合成用户查询,实现低成本、高成功率的数据生成,训练模型性能超越基线。

Comments ACL 2026 Findings. Source code: https://github.com/zhongyi-zhou/toolgrad

详情
AI中文摘要

先前的工作通过首先生成用户查询,然后进行复杂的工具使用注释(如深度优先搜索)来合成工具使用LLM数据集。这导致不可避免的注释失败和数据生成效率低下。我们引入了ToolGrad,一个反转这种范式的智能体框架。ToolGrad首先通过由文本“梯度”引导的迭代过程构建有效的工具使用链,然后合成相应的用户查询。这种“答案优先”的方法产生了ToolGrad-500,一个以更复杂的工具使用、更低的成本和几乎100%的通过率生成的数据集。实验表明,ToolGrad模型优于在昂贵的基线数据集和专有LLM上训练的模型。ToolGrad源代码、数据集和模型可在https://this URL获取。

英文摘要

Prior work synthesizes tool-use LLM datasets by first generating a user query, followed by complex tool-use annotations like depth-first search (DFS). This leads to inevitable annotation failures and low efficiency in data generation. We introduce ToolGrad, an agentic framework that inverts this paradigm. ToolGrad first constructs valid tool-use chains through an iterative process guided by textual "gradients", and then synthesizes corresponding user queries. This "answer-first" approach led to ToolGrad-500, a dataset generated with more complex tool use, lower cost, and almost 100% pass rate. Experiments show that ToolGrad models outperform those trained on expensive baseline datasets and proprietary LLMs. The ToolGrad source code, dataset, and models are available at https://github.com/zhongyi-zhou/toolgrad.

2603.00026 2026-06-18 cs.CL cs.AI cs.IR 版本更新

ActMem: Bridging the Gap Between Memory Retrieval and Reasoning in LLM Agents

ActMem:弥合LLM代理中记忆检索与推理之间的差距

Xiaohui Zhang, Zequn Sun, Chengyuan Yang, Yaqin Jin, Yazhong Zhang, Wei Hu

发表机构 * State Key Laboratory for Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) Alibaba Group, Hangzhou, China(阿里巴巴集团,杭州,中国) National Institute of Healthcare Data Science, Nanjing University, China(南京大学健康数据科学国家研究院)

AI总结 提出ActMem框架,通过将非结构化对话历史转化为结构化因果语义图,结合反事实推理和常识补全,实现主动因果推理,显著提升LLM代理在复杂记忆依赖任务中的表现。

详情
AI中文摘要

记忆管理对于长期交互中的LLM代理至关重要。当前的记忆框架通常将代理视为被动的“记录器”,并在不理解其深层含义的情况下检索信息。它们可能在需要推理和复杂决策的场景中失败。为了弥合这一关键差距,我们提出了一种新颖的可操作记忆框架ActMem,它将记忆检索与主动因果推理相结合。ActMem将非结构化对话历史转化为结构化的因果语义图。通过利用反事实推理和常识补全,它使代理能够推断隐含约束并解决过去状态与当前意图之间的潜在冲突。此外,我们引入了一个全面的数据集ActMemEval,用于评估代理在逻辑驱动场景中的推理能力,超越了现有记忆基准测试中事实检索的焦点。实验表明,ActMem在处理复杂的、依赖记忆的任务时显著优于基线,为更一致和可靠的智能助手铺平了道路。

英文摘要

Memory management is essential for LLM agents in long-term interactions. Current memory frameworks typically treat agents as passive ``recorders'' and retrieve information without understanding its deeper implications. They may fail in scenarios requiring reasoning and complex decision-making. To bridge this critical gap, we propose a novel actionable memory framework called ActMem that integrates memory retrieval with active causal reasoning. ActMem transforms unstructured dialogue history into a structured causal and semantic graph. By leveraging counterfactual reasoning and commonsense completion, it enables agents to deduce implicit constraints and resolve potential conflicts between past states and current intentions. Furthermore, we introduce a comprehensive dataset ActMemEval to evaluate agent reasoning capabilities in logic-driven scenarios, moving beyond the fact-retrieval focus of existing memory benchmarks. Experiments demonstrate that ActMem significantly outperforms baselines in handling complex, memory-dependent tasks, paving the way for more consistent and reliable intelligent assistants.

2605.30880 2026-06-18 cs.CL cs.AI 版本更新

PatchWorld: Gradient-Free Optimization of Executable World Models

PatchWorld:可执行世界模型的免梯度优化

Jiaxin Bai, Yue Guo, Yifei Dong, Jiaxuan Xiong, Tianshi Zheng, Yixia Li, Tianqing Fang, Yufei Li, Yisen Gao, Haoyu Huang, Zhongwei Xie, Hong Ting Tsang, Zihao Wang, Lihui Liu, Jeff Z. Pan, Yangqiu Song

发表机构 * Hong Kong Baptist University(香港 Baptist 大学) Independent Researcher(独立研究员) HKUST(香港科技大学) Beijing Institute of Technology(北京理工大学) Southern University of Science and Technology(南方科技大学) Wayne State University(韦恩州立大学) University of Edinburgh(爱丁堡大学)

AI总结 提出 PatchWorld 框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型,实现无需梯度优化的符号信念状态程序,在 AgentGym 环境中达到 76.4% 的宏观成功率。

Comments 40 pages

详情
AI中文摘要

文本智能体环境通常被建模为部分可观察马尔可夫决策过程(POMDP),假设模拟器的潜在状态和转移动态对智能体隐藏。然而,很少有工作研究是否可以通过归纳可执行代码来作为部分可观察性下的预测和规划的世界模型。我们引入了 PatchWorld,一个免梯度框架,通过反例引导的代码修复将离线轨迹转化为可执行的 Python 世界模型。PatchWorld 不是用黑盒模型预测下一个观察,而是归纳出符号信念状态程序,其动作更新可以被检查、重放和局部修补。在七个 AgentGym 环境中,PatchWorld-Simple 在评估方法中取得了最高的基于代码的规划分数,在实时一步前瞻中达到 76.4% 的宏观成功率,同时在世界模型预测模块本身内不调用任何 LLM。我们进一步发现,人类指定的残差记忆偏差提高了表面观察保真度,但削弱了决策效用。这暴露了可执行世界模型中的权衡,因为提高观察保真度可能以牺牲动作判别动态为代价,反之亦然。代码可在 https://github.com/HKBU-KnowComp/PatchWorld 获取。

英文摘要

Text-agent environments are typically modeled as partially observable Markov decision processes (POMDPs), assuming that the simulator's latent state and transition dynamics are hidden from the agent. Yet little work has examined whether executable code can be induced to serve as a world model for prediction and planning under partial observability. We introduce PatchWorld, a gradient-free framework that turns offline trajectories into executable Python world models through counterexample-guided code repair. Instead of predicting the next observation with a black-box model, PatchWorld induces symbolic belief-state programs whose action updates can be inspected, replayed, and locally patched. Across seven AgentGym environments, PatchWorld-Simple achieves the highest code-based planning score among evaluated methods, reaching 76.4\% macro success in live one-step lookahead while invoking no LLM calls inside the world-model prediction module itself. We further find that a human-specified residual-memory bias improves surface observation fidelity but weakens decision utility. This exposes a tradeoff in executable world models, since improving observation fidelity can come at the expense of action-discriminative dynamics, and vice versa. Code is available at https://github.com/HKBU-KnowComp/PatchWorld.

2412.15557 2026-06-18 cs.SE cs.CL 版本更新

MORTAR: Multi-turn Metamorphic Testing for LLM-based Dialogue Systems

MORTAR:基于LLM的对话系统的多轮蜕变测试

Aaron Guoxiang Guo, Aldeida Aleti, Neelofar Neelofar, Chakkrit Tantithamthavorn, Yuanyuan Qi, Tsong Yueh Chen

发表机构 * Faculty of Information Technology, Monash University(墨尔本大学信息科技学院) School of Computing Technologies, RMIT University(皇家墨尔本理工大学计算技术学院) School of Science, Computing and Emerging Technologies, Swinburne University of Technology(斯威本理工大学科学、计算与新兴技术学院)

AI总结 提出MORTAR方法,通过多轮蜕变关系自动化生成测试用例,解决LLM对话系统多轮测试中的预言问题,相比单轮测试每个用例发现更多且更高质量的缺陷。

Comments Accepted for publication in IEEE Transactions on Software Engineering (TSE)

详情
AI中文摘要

随着基于LLM的对话系统在日常生活中的广泛应用,质量保证变得比以往更加重要。最近的研究成功引入了在单轮测试场景中识别意外行为的方法。然而,多轮交互是对话系统常见的实际使用方式,但针对此类交互的测试方法仍未得到充分探索。这主要是由于多轮测试中的预言问题,它仍然是对话系统开发人员和研究人员面临的重大挑战。在本文中,我们提出了MORTAR,一种蜕变式多轮对话测试方法,它缓解了测试基于LLM的对话系统时的测试预言问题。MORTAR形式化了对话系统的多轮测试,并自动生成问答对话测试用例,其中包含多种对话级扰动和蜕变关系(MRs)。自动化的MR匹配机制使MORTAR在蜕变测试中具有更高的灵活性和效率。所提出的方法完全自动化,无需依赖LLM评判。在测试六个流行的基于LLM的对话系统时,与单轮蜕变测试基线相比,MORTAR每个测试用例发现的错误数量增加了150%以上,效果显著更好。在错误质量方面,MORTAR在多样性、精确性和唯一性方面揭示了更高质量的错误。MORTAR有望激发更多的多轮测试方法,并帮助开发人员在有限的测试资源和预算下更全面地评估对话系统性能。

英文摘要

With the widespread application of LLM-based dialogue systems in daily life, quality assurance has become more important than ever. Recent research has successfully introduced methods to identify unexpected behaviour in single-turn testing scenarios. However, multi-turn interaction is the common real-world usage of dialogue systems, yet testing methods for such interactions remain underexplored. This is largely due to the oracle problem in multi-turn testing, which continues to pose a significant challenge for dialogue system developers and researchers. In this paper, we propose MORTAR, a metamorphic multi-turn dialogue testing approach, which mitigates the test oracle problem in testing LLM-based dialogue systems. MORTAR formalises the multi-turn testing for dialogue systems, and automates the generation of question-answer dialogue test cases with multiple dialogue-level perturbations and metamorphic relations (MRs). The automated MR matching mechanism allows MORTAR more flexibility and efficiency in metamorphic testing. The proposed approach is fully automated without reliance on LLM judges. In testing six popular LLM-based dialogue systems, MORTAR reaches significantly better effectiveness with over 150\% more bugs revealed per test case when compared to the single-turn metamorphic testing baseline. Regarding the quality of bugs, MORTAR reveals higher-quality bugs in terms of diversity, precision and uniqueness. MORTAR is expected to inspire more multi-turn testing approaches, and assist developers in evaluating the dialogue system performance more comprehensively with constrained test resources and budget.

5. 文本生成、摘要与编辑 2 篇

2601.17226 2026-06-18 cs.CL cs.AI 版本更新

Retell, Reward, Repeat: Reinforcement Learning for Narrative Theory-Informed Story Retelling

复述、奖励、重复:面向叙事理论启发的故事复述的强化学习

David Y. Liu, Xanthe Muston, Dipankar Srirag, Aditya Joshi, Sebastian Sequoiah-Grayson

发表机构 * University of New South Wales(新南威尔士大学)

AI总结 提出RRR强化学习框架,结合结构主义叙事学与标量叙事性,通过d-RLAIF从文本特征中获取训练信号,无需参考输出,提升LLM故事复述的逻辑性、合理性和完整性。

Comments 8 Pages, 7 figures

详情
AI中文摘要

反事实故事复述暴露了LLM在受限叙事解空间中的缺陷,此时它们无法依赖回忆记忆的训练数据。基于真实值的后训练(如SFT)无法教会LLM生成逻辑合理的叙事事件。本文提出Retell, Reward, Repeat (RRR),一个基于强化学习的流水线,将结构主义叙事学与标量叙事性相结合,以教授故事结构。我们扩展了TimeTravel数据集,加入人工标注的叙事平衡阶段,以评估奖励模型。通过d-RLAIF,RRR从文本特征的叙事性中推导训练信号,无需参考输出。评估表明,RRR训练的LLM在逻辑性、合理性和完整性上优于少样本和SFT基线,输出质量通过盲人偏好验证。RRR仅依赖小型查询数据集,为故事讲述——一个目前缺乏有效后训练方法的领域——提供了一种基于语言学、成本效益高的后训练机制。RRR强调了将既定语言学理论整合到当代NLP中的持续相关性。

英文摘要

Counterfactual story retelling exposes LLM shortcomings in constrained narrative solution spaces where they can no longer rely on recalling memorised training data. Ground-truth-based post-training, such as SFT, fails to teach LLMs how to generate logical and rational narrative events. In this paper, we introduce Retell, Reward, Repeat (RRR), an RL-based pipeline synthesising Structuralist Narratology with scalar narrativity to teach storytelling structure. We extend the TimeTravel dataset with human-annotated stages of narrative equilibrium to evaluate reward models. By using d-RLAIF, RRR derives training signals from the narrativity of textual features without the need for reference outputs. Evaluations demonstrate that RRR-trained LLMs outperform few-shot and SFT baselines in logic, rationality, and completeness, with output quality additionally validated by blind human preference. Relying on a small, query-only dataset, RRR provides a linguistically grounded, cost-effective post-training mechanism for storytelling--a domain currently lacking effective post-training methods. RRR highlights the continued relevance of integrating established linguistic theories into contemporary NLP.

2602.15851 2026-06-18 cs.CL cs.AI 版本更新

Narrative Theory-Driven LLM Methods for Automatic Story Generation and Understanding: A Survey

叙事理论驱动的LLM方法在自动故事生成与理解中的应用:综述

David Y. Liu, Aditya Joshi, Paul Dawson

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院) School of Arts and Media(艺术与媒体学院) University of New South Wales (UNSW)(新南威尔士大学)

AI总结 综述叙事理论驱动的大语言模型方法在自动故事生成与理解中的应用,分析现状并指出生成任务在理论应用、后训练方法、非虚构叙事及叙事层次等方面落后于理解任务,提出未来方向。

Comments 31 pages

详情
AI中文摘要

使用大语言模型(LLM)的叙事理论应用在自动故事生成和理解任务中提供了有前景的方法。本综述考察了自然语言处理(NLP)研究如何利用LLM方法处理叙事研究中的不同概念。我们使用叙事学中的既定区分来分类当前工作,并发现以下内容:(a) 叙事文本来源多样,不仅限于文学;(b) 理论综合与验证是潜在成果;(c) 生成任务在多个方面落后于理解任务:理论应用、后训练方法、探索非虚构叙事以及处理超出故事与话语层面的叙事层次。对于未来方向,我们相信,与其追求单一的、通用的“叙事质量”基准,进步可以受益于以下方面的努力:定义和改进针对单个叙事属性的基于理论的度量;继续开展大规模、理论驱动的文学/社会/文化分析;在情境化上下文中生成叙事;以及继续进行实验,其输出可用于验证或完善叙事理论。本文通过概述当前研究工作和更广泛的叙事研究领域,为NLP中更系统、更具理论依据的叙事研究提供了背景基础。

英文摘要

Applications of narrative theories using large language models (LLMs) deliver promising methods in automatic story generation and understanding tasks. Our survey examines how natural language processing (NLP) research uses LLM methods to engage with diverse concepts from narrative studies. We use established distinctions from narratology to categorise ongoing efforts and discover the following: \redtext{(a) narrative texts come from diverse sources beyond just literature, (b) theoretical synthesis and validation are potential outcomes, (c) generation tasks lag behind understanding in several ways: theoretical application, post-training methods, exploring non-fiction narratives and addressing narrative levels beyond fabula and discourse.} For future directions, instead of the pursuit of a single, generalised benchmark for `narrative quality', we believe that progress can benefit from efforts that focus on the following: defining and improving theory-based metrics for individual narrative attributes; continue conducting large-scale, theory-driven literary/social/cultural analysis; generating narratives in situated contexts; and continuing experiments where outputs can be used to validate or refine narrative theories. This work provides a contextual foundation for more systematic and theoretically informed narrative research in NLP by providing an overview to ongoing research efforts and the broader narrative studies landscape.

6. 语义、语法与语言学分析 1 篇

2510.04120 2026-06-18 cs.CL cs.AI 版本更新

Probing Semantic Alignment, Lexical Invariance, and Syntactic Influence in LLM Metaphor Processing

探究大语言模型隐喻处理中的语义对齐、词汇不变性和句法影响

Fengying Ye, Shanshan Wang, Lidia S. Chao, Derek F. Wong

发表机构 * NLP 2 CT Lab, Department of Computer and Information Science, University of Macau(自然语言处理2CT实验室,计算机与信息科学系,澳门大学)

AI总结 通过几何探测、上下文替换和句法扰动三种方法,分析LLM在隐喻处理中的语义漂移、词汇稳定性及句法敏感性,揭示强行为表现可能源于异质信号。

Comments Accepted to ACL 2026

详情
AI中文摘要

大语言模型(LLM)在隐喻检测和解释任务上表现出色,但尚不清楚这种行为成功揭示了隐喻处理的哪些方面。我们通过探测三个互补维度:语义属性对齐、词汇不变性和句法敏感性,对行为证据的局限性进行诊断分析。使用几何探测,我们评估模型生成的解释是否与参考语义属性对齐;通过上下文变化替换,分析隐喻和字面表达之间词汇关联的稳定性;通过受控句法扰动,检查隐喻检测的敏感性。我们的分析表明,LLM生成的解释可能相对于参考属性出现语义漂移;稳定的词汇锚点在不同上下文条件下持续存在,可能支持常规隐喻,同时使需要上下文整合的新奇隐喻产生偏差;检测性能对句法不规则性敏感。这些发现表明,强行为表现可能反映了异质的潜在信号,强调在将隐喻基准解释为稳健、集成语义理解的证据时需要谨慎。

英文摘要

Large language models (LLMs) achieve strong performance on metaphor detection and interpretation tasks, yet it remains unclear what such behavioral success reveals about metaphor processing. We present a diagnostic analysis that examines the limits of behavioral evidence by probing three complementary dimensions: semantic attribute alignment, lexical invariance, and syntactic sensitivity. Using geometric probing, we assess whether model-generated interpretations align with reference semantic attributes; through context-varying substitution, we analyze the stability of lexical associations between metaphorical and literal expressions; and via controlled syntactic perturbations, we examine sensitivity in metaphor detection. Our analysis reveals that LLM-generated interpretations can exhibit semantic drift relative to reference attributes; stable lexical anchors persist across contextual conditions, potentially supporting conventional metaphors while biasing novel metaphors requiring contextual integration; and detection performance is sensitive to syntactic irregularities. These findings suggest that strong behavioral performance may reflect heterogeneous underlying signals, highlighting the need for caution when interpreting metaphor benchmarks as evidence of robust, integrated semantic understanding.

7. 多模态语言处理 5 篇

2509.18588 2026-06-18 cs.CL 版本更新

UniECG: Understanding and Generating ECG in One Unified Model

UniECG: 在一个统一模型中理解与生成心电图

Jiarui Jin, Haoyu Wang, Xiang Lan, Jun Li, Hongyan Li, Shenda Hong

发表机构 * Peking University(北京大学) Shanghai Ocean University(上海海洋大学) the Second Hospital of Tianjin Medical University(天津医科大学第二医院) National University of Singapore(新加坡国立大学)

AI总结 提出UniECG模型,通过两阶段设计实现心电图信号/图像生成解释性文本及根据文本目标生成对应心电图信号,支持交互式心电图教育。

详情
AI中文摘要

心电图解读是医学教育中的基本技能,但学生往往需要更多静态示例来将波形证据与诊断推理联系起来。本文提出UniECG作为迈向交互式心电图教育的一步。UniECG支持两种互补的学习交互:给定心电图信号或图像,它生成基于证据的解释;给定文本学习目标,它生成对应的心电图信号示例用于案例学习。该模型采用两阶段设计。首先,它从心电图信号-图像-文本数据中学习基于证据的心电图解释。其次,它引入特殊的心电图生成标记,并将其隐藏表示与预训练的文本条件心电图扩散模型对齐,实现可控的信号级心电图生成。我们通过基于证据的心电图解释和面向生成的定性分析来评估UniECG,考察其支持解释和案例学习的潜力。UniECG旨在作为教育辅助工具和迈向交互式AI辅助心电图学习的研究步骤,而非临床验证的诊断系统。

英文摘要

Electrocardiogram (ECG) interpretation is a fundamental skill in medical education, yet students often need more than static examples to connect waveform evidence with diagnostic reasoning. This paper presents UniECG as a step toward interactive ECG education. UniECG supports two complementary learning interactions: given an ECG signal or image, it generates an evidence-based explanation; given a textual learning objective, it generates a corresponding ECG signal example for case-based learning. The model follows a two-stage design. First, it learns grounded ECG explanation from ECG signal--image--text data. Second, it introduces special ECG generation tokens and aligns their hidden representations with a pretrained text-conditioned ECG diffusion model, enabling controllable signal-level ECG generation. We evaluate UniECG through grounded ECG explanation and generation-oriented qualitative analysis, examining its potential to support explanation and case-based learning. UniECG is intended as an educational aid and a research step toward interactive AI-assisted ECG learning, rather than a clinically validated diagnostic system.

2601.19792 2026-06-18 cs.CL cs.AI cs.HC 版本更新

LVLMs and Humans Ground Differently in Referential Communication

LVLMs与人类在指称交流中的基础不同

Peter Zeng, Weiling Li, Amie J. Paige, Zhengxiang Wang, Panagiotis Kaliosis, Dimitris Samaras, Gregory Zelinsky, Susan E. Brennan, Owen Rambow

AI总结 通过人类与AI配对的多轮指称交流实验,发现LVLMs无法像人类一样利用共同基础生成和解析指称表达,导致交流不畅。

Comments 27 pages, 16 figures

详情
AI中文摘要

对于生成式AI代理与人类用户有效合作,准确预测人类意图的能力至关重要。但这种协作能力仍然受到一个关键缺陷的限制:无法建模共同基础。我们提出了一个因子设计的指称交流实验,涉及指导者-匹配者配对(人类-人类、人类-AI、AI-人类和AI-AI),他们在多轮重复回合中交互,以匹配与任何明显词汇化标签无关的物体图片。我们表明,LVLMs无法以促进顺畅交流的方式交互式生成和解析指称表达,而这是人类语言使用的基础技能。我们发布了包含356个对话(89对,每对4轮)的语料库,以及用于数据收集的在线流程和用于分析准确性、效率和词汇重叠的工具。

英文摘要

For generative AI agents to partner effectively with human users, the ability to accurately predict human intent is critical. But this ability to collaborate remains limited by a critical deficit: an inability to model common ground. We present a referential communication experiment with a factorial design involving director-matcher pairs (human-human, human-AI, AI-human, and AI-AI) that interact with multiple turns in repeated rounds to match pictures of objects not associated with any obvious lexicalized labels. We show that LVLMs cannot interactively generate and resolve referring expressions in a way that enables smooth communication, a crucial skill that underlies human language use. We release our corpus of 356 dialogues (89 pairs over 4 rounds each) along with the online pipeline for data collection and the tools for analyzing accuracy, efficiency, and lexical overlap.

2606.17372 2026-06-18 cs.CL cs.AI 版本更新

Implicit vs. Explicit Prompting Strategies for LVLMs in Referential Communication

LVLMs在指称通信中的隐式与显式提示策略

Peter Zeng, Amie J. Paige, Weiling Li, Susan E. Brennan, Owen Rambow, Cameron R. Jones

发表机构 * Stony Brook University(石溪大学)

AI总结 本研究通过控制任务差异,比较显式与隐式提示对LVLM生成高效指称表达的影响,发现显式提示下模型能协调高效表达,而隐式提示则失败,揭示了人机通信的关键差异。

详情
AI中文摘要

两项近期研究(Jones等人,2026;Zeng等人,2026)关于LVLM能否协调高效指称表达得出了明显矛盾的结论。我们在控制研究间任务差异的同时,直接比较了它们的提示风格。我们复现了当显式提示时模型可以协调高效指称表达的发现,表明其他任务差异并非导致结果分歧的原因。然而,我们也发现相同的模型无法从更隐式的提示中推断出通信效率的需求,凸显了人类与AI系统通信方式的关键差异。

英文摘要

Two recent studies (Jones et al. (2026); Zeng et al. (2026)) reach apparently contradictory conclusions about whether LVLMs can coordinate on efficient referring expressions. We control for task differences between the studies while directly comparing their prompting styles. We replicate the finding that models can coordinate efficient referring expressions when explicitly prompted to do so, suggesting that other task differences are not responsible for divergent results. However, we also find that the same models fail to infer the need for communicative efficiency from a more implicit prompt, highlighting critical differences between how humans and AI systems communicate.

2606.05409 2026-06-18 cs.CV cs.CL 版本更新

Would you still call this Dax? Novel Visual References in VLMs and Humans

你还会称它为Dax吗?VLM与人类中的新颖视觉参照

Ada Defne Tür, Gaurav Kamath, Joyce Chai, Siva Reddy, Benno Krojer

发表机构 * McGill University(麦吉尔大学) Mila Quebec AI Institute(魁北克人工智能研究所) University of Michigan - Ann Arbor(密歇根大学安娜堡分校) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出新颖视觉参照数据集(NVRD),通过对比VLM和人类对新颖视觉概念的泛化能力,发现模型在矛盾先验知识时难以习得新概念,且过度泛化。

详情
AI中文摘要

视觉语言模型(VLM)像人类学习者一样,经常接触新的视觉概念,但它们在接触后如何将新颖的视觉参照映射到语言上仍未被充分探索,特别是当这些参照与预训练的先验知识相矛盾时。为了研究这一点,我们提出了新颖视觉参照数据集(NVRD):包含跨越90个视觉概念的19,176张图像,这些概念具有不同层次的新颖性,每个概念最多有20个原始对象的逐渐扰动版本以测试泛化能力。与之前关于熟悉概念视觉增强的工作不同,NVRD包含完全新颖、开放式的刺激,从头构建,模拟人类遇到真正新概念的方式。我们评估了3个开源和2个闭源模型以及2,400个人类判断,以进行直接的人机比较,发现(i)当新概念与先验知识矛盾时,模型难以在上下文中习得它们,以及(ii)虽然模型和人类对视觉扰动表现出相关的敏感性,但模型显著过度泛化,将学到的标签扩展到人类拒绝的刺激上。我们贡献了NVRD作为人类和机器视觉概念学习研究的语料库和基准。

英文摘要

Vision-language models (VLMs), like human learners, are frequently exposed to new visual concepts, but how they map novel visual references to language after exposure remains largely underexplored, particularly when those references contradict prior knowledge from pre-training. To study this, we present the Novel Visual References Dataset (NVRD): 19,176 images spanning 90 visual concepts across different levels of visual novelty, each with up to 20 increasingly perturbed versions of the original object to probe generalization. Unlike prior work on visual augmentations of familiar concepts, NVRD comprises entirely novel, open-ended stimuli constructed from scratch, mirroring how humans encounter genuinely new concepts. We evaluate 3 open- and 2 closed-source models alongside 2,400 human judgments for direct human-model comparison, and find that (i) models struggle to acquire novel concepts in-context when they contradict prior knowledge, and (ii) while models and humans show correlated sensitivity to visual perturbations, models significantly overgeneralize, extending learned labels to stimuli that humans reject. We contribute NVRD as a corpus and benchmark for research on visual concept learning in both humans and machines.

2606.15088 2026-06-18 cs.SD cs.CL eess.AS 版本更新

When the Same Musical Knowledge Forgets Differently: A Clean Probe of Pathway-Dependent Forgetting

当相同的音乐知识以不同方式遗忘:路径依赖遗忘的干净探测

Yu Liu, Zhiwei Yang, Wenxiao Zhang, Cong Cao, Fangfang Yuan, Kun Peng, Haimei Qin, Lei Jiang, Jin B. Hong, Hao Peng, Yanbing Liu

发表机构 * Institute of Information Engineering, CAS(中国科学院信息工程研究所) School of Cyber Security, UCAS(中国科学院大学网络空间安全学院) The University of Western Australia(西澳大利亚大学) Beihang University(北京航空航天大学)

AI总结 提出配对路径控制协议(PPCP),发现多模态模型中通过文本路径获取的知识比音频路径更易遗忘,且该效应不受架构深度影响,主要源于输入表示差异。

详情
AI中文摘要

一个模型可以通过听音频或阅读文本描述来学习钢琴曲《致爱丽丝》是平静而沉思的,但当这些知识后来面临遗忘风险时,获取路径是否重要?多模态模型中的遗忘研究衡量了在适应过程中丢失了哪些知识,但尚未探究获取路径是否影响知识被遗忘的难易程度。我们将这个未经检验的前提称为路径不变假设。音乐理解提供了一个干净的测试,因为一段音乐剪辑和一段规范的文本描述可以对齐到相同的感知内容,使得相同的知识单元可以通过听或读进入模型,而目标保持不变。在多个架构不同的音频-语言模型中,我们观察到一致的不对称性:在相同的适应压力下,文本路径知识比匹配的音频路径知识更容易被遗忘。为了将这种效应归因于路径而非混淆因素,我们引入了配对路径控制协议(PPCP),这是一个三阶段设计,建立匹配的路径基线,在相同的知识池上以对称监督激活两条路径,并对两条路径施加相同的遗忘压力。这种差距在模型间和增益控制分析中稳定存在,当矛盾覆盖被替换为正确标签的跨域学习时仍然存在,在单模态压力下仍然存在,并且不会被轻量级重放消除。两个独立的路径深度控制证实,该效应不能由架构深度解释,表明输入表示是主导因素。在PPCP下,我们的结果表明遗忘高度依赖于路径,将获取路径确立为遗忘研究和多模态系统设计的一个新的分析维度。

英文摘要

A model can learn that the piano piece Für Elise is calm and reflective by listening to the audio or by reading a text description, but does it matter which route that knowledge took when it is later at risk of being forgotten? Forgetting research in multimodal models measures what knowledge is lost under adaptation, yet has not asked whether acquisition route affects how easily that knowledge is forgotten. We call this untested premise the Pathway-Invariant Assumption. Music understanding enables a clean test because a music clip and a canonical text description can be aligned to the same perceptual content, allowing the same knowledge unit to enter a model through listening or reading while the target remains fixed. Across multiple architecturally distinct audio-language models, we observe a consistent asymmetry: text-pathway knowledge is forgotten more than matched audio-pathway knowledge under identical adaptation pressure. To attribute this effect to route rather than confounds, we introduce the Paired Pathway Controlled Protocol (PPCP), a three-phase design that establishes matched pathway baselines, activates both pathways under symmetric supervision on the same knowledge pool, and applies identical forgetting pressure to both pathways. The gap is stable across models and gain-controlled analyses, persists when contradictory overwrite is replaced by correct-label cross-domain learning, remains under single-modality pressure, and is not removed by lightweight replay. Two independent routing-depth controls confirm that the effect is not explained by architectural depth, pointing to input representation as the dominant factor. Under PPCP, our results demonstrate that forgetting is highly route-dependent, establishing acquisition route as a new analytical dimension for forgetting research and multimodal system design.

8. 语音语言联合与音频文本 4 篇

2506.12311 2026-06-18 cs.CL cs.SD eess.AS 版本更新

Phonikud: Overcoming Phonetic Underspecification for Hebrew Text-To-Speech

Phonikud:克服希伯来语文本转语音中的语音欠指定问题

Yakov Kolani, Maxim Melichov, Cobi Calev, Morris Alper

发表机构 * Independent Researcher(独立研究者) Reichman University(雷赫曼大学) Tel Aviv University(特拉维夫大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Phonikud框架,通过开源G2P系统、语料库、基准和评估模型,解决希伯来语TTS中重音等语音特征欠指定问题,实现更准确的音素预测。

Comments Accepted to Interspeech 2026. Project page: https://phonikud.github.io

详情
AI中文摘要

现代希伯来语的文本转语音(TTS)受到该语言正字法复杂性的挑战,现有解决方案忽略了诸如重音等欠指定的语音特征。我们提出了一个更准确的希伯来语TTS框架,包含四个贡献:(1)Phonikud,一个开源的希伯来语字素到音素(G2P)系统,输出完全指定的国际音标(IPA)转录,通过增强基础注音器设计而成。(2)ILSpeech语料库,包含配对的希伯来语音频、文本和专家IPA标注。(3)针对先前未测量的希伯来语G2P转换任务的基准。(4)希伯来语音频到IPA模型,捕获先前忽略的语音细节,用于自动TTS评估。我们的结果表明,Phonikud比先前方法更准确地预测希伯来语音素,并且使用Phonikud语音输入的小型本地TTS模型接近大型专有系统。我们在以下网址发布代码、数据和模型:this https URL。

英文摘要

Text-to-speech (TTS) for Modern Hebrew is challenged by the language's orthographic complexity, with existing solutions ignoring underspecified phonetic features such as stress. We present a framework for more phonetically accurate Hebrew TTS with four contributions: (1) Phonikud, an open-source Hebrew grapheme-to-phoneme (G2P) system that outputs fully-specified International Phonetic Alphabet (IPA) transcriptions, designed by augmenting a base diacritizer. (2) The ILSpeech corpus of paired Hebrew audio, text, and expert IPA annotations. (3) A benchmark for the previously unmeasured task of Hebrew G2P conversion. (4) Hebrew audio-to-IPA models capturing previously disregarded phonetic details for automatic TTS evaluation. Our results show that Phonikud more accurately predicts Hebrew phonemes than prior methods, and that small, local TTS models with phonetic input from Phonikud approach large proprietary systems. We release our code, data, and models at https://phonikud.github.io.

2508.07375 2026-06-18 cs.CL cs.SD eess.AS 版本更新

TurnGuide: Enhancing Meaningful Full Duplex Spoken Interactions via Dynamic Turn-Level Text-Speech Interleaving

TurnGuide: 通过动态轮次级文本-语音交错增强有意义的全双工口语交互

Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Technologies(华为技术)

AI总结 提出TurnGuide方法,通过动态分割助手语音为对话轮次并交错生成轮次级文本和语音,解决全双工语音语言模型在连续双通道音频中集成离散文本令牌导致的时间对齐问题,显著提升语义连贯性和轮次交互性能。

Comments Interspeech 2026 Long Paper Track

详情
AI中文摘要

全双工语音语言模型(FD-SLMs)是专门的基础模型,旨在通过建模复杂的对话轮次(如打断、反馈和重叠语音)来实现自然的实时口语交互。端到端(e2e)FD-SLMs利用真实世界的双通道对话数据捕捉细微的双说话者对话模式以实现类人交互,但由于语音序列过长和高质量口语对话数据有限,其对话能力往往比纯文本对话有所下降。尽管交错文本-语音生成可以缓解这种退化,但将离散文本令牌集成到连续双通道音频流中可能会破坏流畅交互所需的时间对齐。为了解决这个问题,我们提出了TurnGuide,一种用于e2e FD-SLMs的新型文本-语音交错生成方法,该方法动态地将助手语音分割成对话轮次,并交错生成轮次级文本和语音。这种方法使FD-SLMs能够整合LLMs的语义智能,同时不损害自然的声学流畅性。大量实验表明,TurnGuide不仅显著提升了e2e FD-SLMs生成语义有意义且连贯语音的能力,而且在各种轮次事件上达到了最先进的性能。演示请访问此https URL。代码请访问此https URL。

英文摘要

Full-Duplex Speech Language Models (FD-SLMs) are specialized foundation models designed to enable natural, real-time spoken interactions by modeling complex conversational turn-taking such as interruptions, backchannels, and overlapping speech. End-to-end (e2e) FD-SLMs leverage real-world double-channel conversational data to capture nuanced two-speaker dialogue patterns for human-like interactions, but their conversational abilities often degrade compared to pure-text conversation due to prolonged speech sequences and limited high-quality spoken dialogue data. Although interleaved text-speech generation could mitigate this degradation, integrating discrete text tokens into continuous double-channel audio streams could disrupt the precise time alignment required for fluid interaction. To address this, we propose TurnGuide, a novel text-speech interleaved generation approach for e2e FD-SLMs that dynamically segments assistant speech into dialogue turns and interleaves turn-level text and speech generation. This approach allows FD-SLMs to integrate the semantic intelligence of LLMs without compromising the natural acoustic flow. Extensive experiments show that TurnGuide not only significantly improves e2e FD-SLMs to produce semantically meaningful, coherent speech but also achieves state-of-the-art performance on various turn-taking events. Demos are available at https://dreamtheater123.github.io/TurnGuide-Demo/. Code is available at https://github.com/dreamtheater123/TurnGuide.

2509.14653 2026-06-18 cs.CL 版本更新

UMA-Split: unimodal aggregation for both English and Mandarin non-autoregressive speech recognition

UMA-Split:面向英语和普通话的非自回归语音识别的单峰聚合

Ying Fang, Xiaofei Li

发表机构 * Zhejiang University(浙江大学) School of Engineering, Westlake University(西湖大学工程学院) Institute of Advanced Technology, Westlake Institute for Advanced Study(西湖先进研究学院技术研究所)

AI总结 针对UMA在英语等语言中因token粒度不匹配导致性能下降的问题,提出UMA-Split,通过分割模块使每个聚合帧映射到多个token,提升非自回归语音识别的跨语言性能。

Comments Accepted by ICASSP 2026. Code:https://github.com/FnoY0723/uma_split

详情
AI中文摘要

本文提出了一种基于单峰聚合(UMA)的非自回归模型,用于英语和普通话语音识别。原始的UMA显式地分割并聚合相同文本标记的声学帧(使用先单调递增后递减的单峰权重),以学习比常规连接主义时间分类(CTC)更好的表示。然而,它仅在普通话中表现良好。在其他语言(如英语)中,单个音节可能被分词为多个细粒度标记,或者一个标记跨越少于3个声学帧而无法形成单峰权重,导致其难以处理。为解决此问题,我们提出允许每个UMA聚合帧映射到多个标记,通过一个简单的分割模块,在计算CTC损失之前从每个聚合帧生成两个标记。

英文摘要

This paper proposes a unimodal aggregation (UMA) based nonautoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that first monotonically increase and then decrease) of the same text token to learn better representations than regular connectionist temporal classification (CTC). However, it only works well in Mandarin. It struggles with other languages, such as English, for which a single syllable may be tokenized into multiple fine-grained tokens, or a token spans fewer than 3 acoustic frames and fails to form unimodal weights. To address this problem, we propose allowing each UMA-aggregated frame map to multiple tokens, via a simple split module that generates two tokens from each aggregated frame before computing the CTC loss.

2509.09631 2026-06-18 cs.SD cs.CL cs.CV 版本更新

DiFlow-TTS: Compact and Low-Latency Zero-Shot Text-to-Speech with Discrete Flow Matching

DiFlow-TTS: 基于离散流匹配的紧凑低延迟零样本文本转语音

Ngoc-Son Nguyen, Thanh V. T. Tran, Hieu-Nghia Huynh-Nguyen, Truong-Son Hy, Van Nguyen

AI总结 提出DiFlow-TTS框架,通过离散流匹配和分解离散流去噪器,在零样本TTS中实现高质量与低延迟的平衡。

Comments Accepted at Interspeech 2026 (Long Paper Track)

详情
AI中文摘要

零样本文本转语音(TTS)在复制未见过的声音方面取得了显著进展,但平衡生成质量和推理效率仍然具有挑战性。自回归模型存在高延迟问题,而基于扩散的方法受限于训练时的配置。此外,大多数基于流的方法在连续空间中运行,由于连续令牌空间本质上比离散空间更复杂,这引入了优化挑战。为了解决这些限制,我们提出了DiFlow-TTS,一种基于离散流匹配的新型零样本TTS框架。该模型由一个用于语言建模的确定性音素-内容映射器和一个同时生成韵律和声学令牌流的分解离散流去噪器组成。实验结果表明了我们的方法在多个评估指标上的有效性。

英文摘要

Zero-shot text-to-speech (TTS) has made significant progress in replicating unseen voices, yet balancing generation quality and inference efficiency remains challenging. Autoregressive models suffer from high latency, while diffusion-based approaches are constrained by training-time configurations. Moreover, most flow-based methods operate in continuous space, which introduces optimization challenges because continuous token spaces are inherently more complex than discrete ones. To address these limitations, we propose DiFlow-TTS, a novel zero-shot TTS framework based on discrete flow matching. The model consists of a deterministic Phoneme-Content Mapper for linguistic modeling and a Factorized Discrete Flow Denoiser that simultaneously generates prosody and acoustic token streams. Experimental results demonstrate the effectiveness of our approach across multiple evaluation metrics.

9. 评测、数据集与基准 16 篇

2505.23851 2026-06-18 cs.CL cs.AI cs.SC 版本更新

ASyMOB: Algebraic Symbolic Mathematical Operations Benchmark

ASyMOB:代数符号数学运算基准

Michael Shalyt, Rotem Elimelech, Ido Kaminer

发表机构 * MIT(麻省理工学院) Technion - Israel Institute of Technology(技术学院-以色列理工学院)

AI总结 提出ASyMOB基准,包含35,368个符号数学问题,通过扰动测试揭示大模型在符号数学推理中的鲁棒性不足,并发现LLM与CAS的互补潜力。

Comments Published in ICML2026: https://icml.cc/virtual/2026/poster/63549 Code repository: https://github.com/RamanujanMachine/ASyMOB Complete benchmark dataset: https://huggingface.co/datasets/Shalyt/ASyMOB-Algebraic_Symbolic_Mathematical_Operations_Benchmark

详情
AI中文摘要

大型语言模型(LLM)越来越多地应用于符号数学,然而现有评估常常混淆模式记忆与真正推理。为弥补这一空白,我们提出\textbf{ASyMOB},一个包含\textit{35,368}个经过验证的符号数学问题的高分辨率数据集,涵盖积分、极限、微分方程、级数和超几何函数。与以往基准不同,\textbf{ASyMOB}通过符号、数值和等价保持变换系统地扰动每个种子问题,从而实现对泛化能力的细粒度评估。我们的评估揭示了三个关键发现:(1)大多数模型的性能在微小扰动下崩溃,而顶级系统表现出明显的鲁棒性\textit{机制转变};(2)集成代码工具稳定了性能,尤其对较弱模型;(3)我们识别出计算机代数系统(CAS)失败而LLM成功的例子,以及仅通过LLM-CAS混合方法解决的问题,突显了有前景的集成前沿。\textbf{ASyMOB}作为一个原则性诊断工具,用于衡量和加速构建可验证、可信赖的AI以促进科学发现。

英文摘要

Large language models (LLMs) are increasingly applied to symbolic mathematics, yet existing evaluations often conflate pattern memorization with genuine reasoning. To address this gap, we present ASyMOB, a high-resolution dataset of 35,368 validated symbolic math problems spanning integration, limits, differential equations, series, and hypergeometrics. Unlike prior benchmarks, ASyMOB systematically perturbs each seed problem using symbolic, numeric, and equivalence-preserving transformations, enabling a fine-grained assessment of generalization. Our evaluation reveals three key findings: (1) most models' performance collapses under minor perturbations, while top systems exhibit an apparent regime shift in robustness; (2) integrated code tools stabilize performance, particularly for weaker models; and (3) we identify examples where Computer Algebra Systems (CAS) fail while LLMs succeed, as well as problems solved only via a hybrid LLM-CAS approach, highlighting a promising integration frontier. ASyMOB serves as a principled diagnostic tool for measuring and accelerating progress toward building verifiable, trustworthy AI for scientific discovery.

2601.13836 2026-06-18 cs.CL cs.CV cs.MM 版本更新

FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs

FutureOmni:从全模态上下文中评估多模态大语言模型的未来预测能力

Qian Chen, Jinlan Fu, Changsong Li, Min Zhang, See-Kiong Ng, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳分校) National University of Singapore(新加坡国立大学)

AI总结 提出FutureOmni基准,评估多模态大模型从音视频线索预测未来的能力,发现现有模型在语音密集场景下表现差,并设计OFF训练策略提升性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管多模态大语言模型(MLLMs)展现出强大的全模态感知能力,但它们从音视频线索预测未来事件的能力仍未被充分探索,因为现有基准主要关注回顾性理解。为弥补这一差距,我们引入了FutureOmni,这是第一个旨在从音视频环境中评估全模态未来预测的基准。评估模型需要执行跨模态因果和时间推理,并有效利用内部知识预测未来事件。FutureOmni通过可扩展的LLM辅助、人在回路流水线构建,包含8个主要领域的919个视频和1034个多项选择问答对。对13个全模态和7个仅视频模型的评估表明,当前系统在音视频未来预测方面存在困难,尤其是在语音密集场景中,Gemini 3 Flash达到最佳准确率64.8%。为缓解这一局限,我们整理了一个7K样本的指令微调数据集,并提出全模态未来预测(OFF)训练策略。在FutureOmni以及流行的音视频和仅视频基准上的评估表明,OFF增强了未来预测和泛化能力。我们公开发布所有代码(此 https URL )和数据集(此 https URL )。

英文摘要

Although Multimodal Large Language Models (MLLMs) demonstrate strong omni-modal perception, their ability to forecast future events from audio-visual cues remains largely unexplored, as existing benchmarks focus mainly on retrospective understanding. To bridge this gap, we introduce FutureOmni, the first benchmark designed to evaluate omni-modal future forecasting from audio-visual environments. The evaluated models are required to perform cross-modal causal and temporal reasoning, as well as effectively leverage internal knowledge to predict future events. FutureOmni is constructed via a scalable LLM-assisted, human-in-the-loop pipeline and contains 919 videos and 1,034 multiple-choice QA pairs across 8 primary domains. Evaluations on 13 omni-modal and 7 video-only models show that current systems struggle with audio-visual future prediction, particularly in speech-heavy scenarios, with the best accuracy of 64.8% achieved by Gemini 3 Flash. To mitigate this limitation, we curate a 7K-sample instruction-tuning dataset and propose an Omni-Modal Future Forecasting (OFF) training strategy. Evaluations on FutureOmni and popular audio-visual and video-only benchmarks demonstrate that OFF enhances future forecasting and generalization. We publicly release all code (https://github.com/OpenMOSS/FutureOmni) and datasets (https://huggingface.co/datasets/OpenMOSS-Team/FutureOmni).

2604.13899 2026-06-18 cs.CL cs.AI 版本更新

Do We Still Need Humans in the Loop? Comparing Human and LLM Annotation in Active Learning for Hostility Detection

我们是否仍然需要人在回路中?比较主动学习中用于敌意检测的人类与LLM标注

Ahmad Dawar Hakimi, Lea Hirlimann, Isabelle Augenstein, Hinrich Schütze

AI总结 研究比较了LLM与人类在主动学习中的标注效果,发现LLM标注成本更低且性能更优,但主动学习在LLM标注下无优势。

详情
AI中文摘要

指令微调的LLM可以低成本标注数千个实例。这为主动学习(AL)提出了两个问题:LLM标签能否替代AL回路中的人类标签?当整个语料库可以廉价标注时,AL是否仍然必要?我们在一个新的包含277,902条德国政治TikTok评论(25,974条LLM标注,5,000条人工标注)的数据集上进行了研究,比较了LLM和人类标注在七种条件、四种编码器和10个随机种子下的表现。在模仿人类标注任务的双问题界面下,大规模LLM标注的性能优于人类监督分类器,成本约为其十分之一(GPT-5.2 Batch API为28美元,Prolific为316美元)。这一优势对于闭源(GPT-5.2)和开源(Qwen3.5-122B-10B)LLM均成立,在软标签评估下具有鲁棒性,并且是通过双问题分解实现的;整体单提示基线仅与人类监督持平。在任一LLM标注器下,主动学习相比随机采样没有可靠优势。然而,错误结构差异显著:只有GPT-5.2在双问题界面下产生的分类器具有接近人类的FP/FN平衡,而其他LLM变体过度标记了边境管制和经济竞争话语。我们发布了数据集和代码。

英文摘要

Instruction-tuned LLMs can annotate thousands of instances at low cost. This raises two questions for active learning (AL): can LLM labels replace human labels within the AL loop, and does AL remain necessary when entire corpora can be cheaply labeled? We investigate both on a new dataset of 277,902 German political TikTok comments (25,974 LLM-labeled, 5,000 human-annotated), comparing LLM and human annotation across seven conditions, four encoders, and 10 random seeds. Under a two-question interface that mirrors the human annotation task, LLM annotation at scale outperforms human-supervised classifiers at roughly one-tenth the cost (\$28 for GPT-5.2 Batch API vs. \$316 for Prolific). The advantage holds for both a closed-source (GPT-5.2) and an open-weight (Qwen3.5-122B-10B) LLM, is robust under soft-label evaluation, and is unlocked specifically by the two-question decomposition; a holistic single-prompt baseline only ties with human supervision. AL provides no reliable advantage over random sampling under either LLM annotator. However, error structure varies sharply: only GPT-5.2 under the two-question interface produces classifiers with near-human FP/FN balance, while other LLM variants over-flag border-control and economic competition discourse. We release the dataset and code.

2604.18109 2026-06-18 cs.CL cs.SD 版本更新

FLiP: Towards understanding and interpreting multimodal multilingual sentence embeddings

FLiP:理解和解释多模态多语句子嵌入

Santosh Kesiraju, Bolaji Yusuf, Šimon Sedláček, Oldřich Plchot, Petr Schwarz

发表机构 * Brno University of Technology(布拉格技术大学)

AI总结 提出因子化线性投影(FLiP)模型,从多语言、多模态句子嵌入中恢复词汇内容,揭示编码器的模态和语言偏差。

Comments Accepted to Interspeech 2026

详情
AI中文摘要

本文提出了因子化线性投影(FLiP)模型,用于理解预训练句子嵌入空间。我们训练FLiP模型从多语言(LaBSE)、多模态(SONAR)和基于API(Gemini)的句子嵌入空间中恢复多种高资源和中等资源语言的词汇内容。我们表明,FLiP可以从嵌入中召回超过75%的词汇内容,显著优于现有的非因子化基线。使用此作为诊断工具,我们揭示了所选句子编码器的模态和语言偏差,并为从业者提供了关于编码器的内在见解,而无需依赖传统的下游评估任务。我们的实现已公开,链接见此:https://this URL。

英文摘要

This paper presents factorized linear projection (FLiP) models for understanding pretrained sentence embedding spaces. We train FLiP models to recover the lexical content from multilingual (LaBSE), multimodal (SONAR) and API-based (Gemini) sentence embedding spaces in several high- and mid-resource languages. We show that FLiP can recall more than 75% of lexical content from the embeddings, significantly outperforming existing non-factorized baselines. Using this as a diagnostic tool, we uncover the modality and language biases across the selected sentence encoders and provide practitioners with intrinsic insights about the encoders without relying on conventional downstream evaluation tasks. Our implementation is public https://github.com/BUTSpeechFIT/FLiP.

2604.28076 2026-06-18 cs.CL cs.AI cs.LG 版本更新

TopBench: A Benchmark for Implicit Predictive Reasoning in Tabular Question Answering

TopBench:表格问答中隐式预测推理的基准

An-Yang Ji, Jun-Peng Jiang, De-Chuan Zhan, Han-Jia Ye

发表机构 * School of Artificial Intelligence, Nanjing University, China(人工智能学院,南京大学,中国) National Key Laboratory for Novel Software Technology, Nanjing University, China(新型软件技术国家重点实验室,南京大学,中国)

AI总结 提出TopBench基准,包含779个样本和四个子任务,评估大语言模型在表格问答中识别隐式预测意图并进行可靠推理的能力,发现当前模型在意图识别上存在困难。

详情
AI中文摘要

大型语言模型(LLM)推动了表格问答的发展,其中大多数查询可以通过提取信息或简单聚合来回答。然而,一类常见的现实世界查询是隐式预测性的,需要从历史模式中推断未观察到的答案,而不仅仅是检索。这些查询带来了两个挑战:识别潜在意图和对大规模表格进行可靠的预测推理。为了评估LLM在带有隐式预测任务的表格问答中的表现,我们引入了TopBench,一个包含779个样本的基准,涵盖四个子任务,从单点预测到决策制定、处理效应分析和复杂过滤,要求模型生成涵盖推理文本和结构化表格的输出。我们在基于文本和代理工作流下评估了多种模型。实验表明,当前模型通常在意图识别上存在困难,默认进行查找。更深入的分析发现,准确的意图消歧是引导这些预测行为的前提。此外,提升预测精度的上限需要整合更复杂的建模或推理能力。

英文摘要

Large Language Models (LLMs) have advanced Table Question Answering, where most queries can be answered by extracting information or simple aggregation. However, a common class of real-world queries is implicitly predictive, requiring the inference of unobserved answers from historical patterns rather than mere retrieval. These queries introduce two challenges: recognizing latent intent and reliable predictive reasoning over massive tables. To assess LLMs in such Tabular questiOn answering with implicit Prediction tasks, we introduce TopBench, a benchmark consisting of 779 samples across four sub-tasks, ranging from single-point prediction to decision making, treatment effect analysis, and complex filtering, requiring models to generate outputs spanning reasoning text and structured tables. We evaluate diverse models under both text-based and agentic workflows. Experiments reveal that current models often struggle with intent recognition, defaulting to just lookups. Deeper analysis identifies that accurate intent disambiguation serves as the prerequisite for leading these predictive behaviors. Furthermore, elevating the upper bound of prediction precision requires the integration of more sophisticated modeling or reasoning capabilities.

2606.12837 2026-06-18 cs.CL 版本更新

LoHoSearch: Benchmarking Long-Horizon Search Agents Beyond the Human Difficulty Ceiling

LoHoSearch: 超越人类难度上限的长时域搜索代理基准测试

Jiarui Zhao, Rongzhi Zhang, Lingchuan Liu, Hao Yang, Xunliang Cai, Xi Su

发表机构 * Meituan(美团)

AI总结 提出LoHoSearch基准,基于700万维基实体知识图谱自动构建544个复杂问题,评估显示最强模型仅34.74%准确率,远超人类难度上限。

详情
AI中文摘要

以BrowseComp为代表的搜索代理基准在过去一年中迅速饱和,最强模型已超过90%准确率。由于这些基准主要由人类编写,标注者缺乏对实体统计的全局视角,无法系统性地最大化搜索空间大小和结构复杂性,这造成了难以突破的难度上限。为解决这一问题,我们引入了LoHoSearch(长时域搜索代理),一个包含544个人工验证问题、覆盖11个领域的挑战性基准。LoHoSearch通过基于覆盖超过700万维基百科实体的知识图谱的自动化流水线构建,该流水线选择具有大搜索空间的关系,并将其组装成结构复杂且具有知识图谱验证的唯一答案的问题。我们的评估表明,即使是最强模型也仅达到34.74%的准确率,且现有的上下文管理策略(最佳提升+6.8%)带来的增益远小于先前基准。LoHoSearch为评估搜索代理中的长时域推理和上下文管理提供了更高要求的标准。

英文摘要

Search agent benchmarks exemplified by BrowseComp have rapidly saturated over the past year, with the strongest models surpassing 90% accuracy. Since these benchmarks are predominantly human-authored, annotators lack a global perspective on entity statistics and cannot systematically maximize search space size and structural complexity. This creates a difficulty ceiling that is hard to break. To address this, we introduce LoHoSearch (Long-Horizon Search Agents), a challenging benchmark comprising 544 human-verified questions across 11 domains. LoHoSearch is constructed via an automated pipeline built upon a knowledge graph covering over 7 million Wikipedia entities, which selects relations with large search spaces and assembles them into structurally complex questions with KG-verified unique answers. Our evaluation demonstrates that even the strongest model achieves only 34.74% accuracy, and existing context management strategies (best +6.8%) yield far smaller gains than on prior benchmarks. LoHoSearch provides a more demanding standard for evaluating long-horizon reasoning and context management in search agents.

2606.13681 2026-06-18 cs.CL 版本更新

EvoArena: Tracking Memory Evolution for Robust LLM Agents in Dynamic Environments

EvoArena: 追踪记忆演化以构建动态环境中的鲁棒LLM智能体

Jundong Xu, Qingchuan Li, Jiaying Wu, Yihuai Lan, Shuyue Stella Li, Huichi Zhou, Bowen Jiang, Lei Wang, Jun Wang, Anh Tuan Luu, Caiming Xiong, Hae Won Park, Bryan Hooi, Zhiyuan Hu

发表机构 * National University of Singapore(新加坡国立大学) Singapore Management University(新加坡管理大学) University of Washington(华盛顿大学) University College London(伦敦大学学院) University of Pennsylvania(宾夕法尼亚大学) Nanyang Technological University(南洋理工大学) Recursive Massachusetts Institute of Technology(麻省理工学院)

AI总结 提出EvoArena基准套件模拟终端、软件和社交领域的渐进环境变化,并设计基于补丁的记忆范式EvoMem记录结构化更新历史,使智能体能通过记忆变化推理环境演化,实验表明当前智能体在动态环境中表现不佳,EvoMem可稳定提升性能。

详情
AI中文摘要

大型语言模型(LLM)智能体在广泛基准测试中取得了强劲性能,但大多数评估假设静态环境。相比之下,实际部署本质上是动态的,要求智能体持续将其知识、技能和行为与不断变化的环境及更新的任务条件对齐。为弥补这一差距,我们引入了EvoArena,一个基准套件,将环境变化建模为终端、软件和社交领域的渐进更新序列。我们进一步提出EvoMem,一种基于补丁的记忆范式,将记忆演化记录为结构化的更新历史,使智能体能够通过记忆中的变化推理环境演化。实验表明,当前智能体在EvoArena上表现不佳,在演化的终端、软件和社交偏好领域平均准确率仅为39.6%。EvoMem持续提升性能,在EvoArena上平均提升1.5%,并在GAIA和LoCoMo等标准基准上分别提升6.1%和4.8%。除单个任务外,EvoMem在EvoArena上还将链级准确率提升3.7%,其中成功需要完成一系列连续的相关演化子任务。机制分析表明,EvoMem改善了记忆中的证据捕获,表明更完整地保留了演化的环境状态。我们的结果强调了在评估和记忆中对演化进行建模对于可靠智能体部署的重要性。

英文摘要

Large language model (LLM) agents have achieved strong performance on a wide range of benchmarks, yet most evaluations assume static environments. In contrast, real-world deployment is inherently dynamic, requiring agents to continually align their knowledge, skills, and behavior with changing environments and updated task conditions. To address this gap, we introduce EvoArena, a benchmark suite that models environment changes as sequences of progressive updates across terminal, software, and social domains. We further propose EvoMem, a patch-based memory paradigm that records memory evolution as structured update histories, enabling agents to reason about environmental evolution through changes in their memory. Experiments show that current agents struggle on EvoArena, achieving an average accuracy of 39.6% across evolving terminal, software, and social-preference domains. EvoMem consistently improves performance, yielding an average gain of 1.5% on EvoArena and also improving standard benchmarks such as GAIA and LoCoMo by 6.1% and 4.8%. Beyond individual tasks, EvoMem further improves chain-level accuracy by 3.7% on EvoArena, where success requires completing a consecutive sequence of related evolutionary subtasks. Mechanistic analysis shows that EvoMem improves evidence capture in the memory, indicating better preservation of complete evolving environment states. Our results highlight the importance of modeling evolution in both evaluation and memory for reliable agent deployment.

2606.15345 2026-06-18 cs.CL cs.IR 版本更新

Beyond Monolingual Deep Research: Evaluating Agents and Retrievers with Cross-Lingual BrowseComp-Plus

超越单语言深度研究:用跨语言 BrowseComp-Plus 评估智能体和检索器

Yuheng Lu, Qingcheng Zeng, Heli Qi, Puxuan Yu, Fuheng Zhao, Rui Yang, Hitomi Yanaka, Naoto Yokoya, Weihao Xuan

发表机构 * Waseda University(早稻田大学) Northwestern University(西北大学) RIKEN AIP(理化学研究所革新智能研究中心) Snowflake Inc.(Snowflake公司) University of Utah(犹他大学) Duke-NUS Medical School(杜克-新加坡国立大学医学院) The University of Tokyo(东京大学)

AI总结 提出跨语言基准 XBCP,评估深度研究智能体在证据语言与查询不同时的表现,发现检索和智能体端均存在显著性能下降。

Comments Preprint

详情
AI中文摘要

深度研究智能体越来越被评估其搜索证据、推理检索来源和生成有依据答案的能力。然而,现有的浏览基准大多假设用户查询和支持证据使用同一种语言,因此当相关证据出现在另一种语言时,智能体搜索系统能否运行尚不清楚。我们引入了 XBCP(跨语言 BrowseComp-Plus),这是一个受控基准,它保留了 BrowseComp-Plus 的英文问答空间,但改变了支持文档的语言。XBCP 实例化了两个互补的设置:在跨语言设置中,每个查询与单一指定语言的证据配对。在多语言设置中,完整的证据语料库在 12 种语言(涵盖高资源和低资源语言)中均匀随机分布。我们使用稀疏和密集的多语言检索器评估了四个深度研究智能体,测量了答案准确性、证据召回率、搜索行为、校准度、引用忠实度和 oracle 检索。结果显示,当证据被翻译时,性能显著下降。即使是强大的密集检索器也会丢失证据召回率,智能体变得不那么校准,且引用证据的可靠性降低。值得注意的是,即使直接提供所有黄金证据,准确性仍然较低。这些发现表明,跨语言深度研究暴露了检索失败和智能体端在整合语言不匹配证据方面的独立困难。

英文摘要

Deep research agents are increasingly evaluated on their ability to search for evidence, reason over retrieved sources, and produce grounded answers. Existing browsing benchmarks, however, largely assume that the user's query and the supporting evidence are written in the same language, leaving open whether agentic search systems can operate when relevant evidence appears in another language. We introduce XBCP (Cross-lingual BrowseComp-Plus), a controlled benchmark that preserves the English question-and-answer space of BrowseComp-Plus but varies the languages of the supporting documents. XBCP instantiates two complementary settings: in the cross-lingual setting, each query is paired with evidence in a single assigned language. In the multilingual setting, the full evidence corpus is distributed equally and randomly across 12 languages spanning high-resource and low-resource regimes. We evaluate four deep research agents using sparse and dense multilingual retrievers, measuring answer accuracy, evidence recall, search behavior, calibration, citation fidelity, and oracle retrieval. Results reveal substantial degradation when evidence is translated. Even strong, dense retrievers lose evidence recall, and agents become less calibrated and cite evidence less reliably. Notably, accuracy remains lower even when all gold evidence is supplied directly. These findings suggest that cross-lingual deep research exposes both retrieval failures and an independent, agent-side difficulty in integrating language-mismatched evidence.

2606.16000 2026-06-18 cs.CL cs.LG 版本更新

GRACE-DS: a Guarded Reward-guided Agent Correction Environment in Data Science

GRACE-DS:数据科学中的受保护奖励引导智能体修正环境

Aleksandr Tsymbalov, Danis Zaripov, Artem Epifanov, Anastasiya Palienko

发表机构 * ITMO University(ITMO大学) HSE University(高等经济学院)

AI总结 提出GRACE-DS,一个用于评估LLM驱动的AutoML智能体在部署前性能的隔离环境,通过隐藏的可执行验证器衡量预测性能、泄漏避免、可重复性等指标,实验证明其灵活迭代交互模式优于基线方法。

详情
AI中文摘要

我们介绍了GRACE-DS,一个数据科学中的受保护奖励引导智能体修正环境,用于对LLM驱动的AutoML智能体进行部署前评估。GRACE-DS是一组在隔离环境中的评估指标,可应用于特定组织的表格ML任务。它将智能体暴露于现实的工作流阶段,从规划和数据检查到特征工程、模型开发、验证、代码修复直至最终提交,同时隐藏的可执行验证器不仅衡量最终预测性能,还衡量泄漏避免、可重复性、协议有效性、修正行为和奖励对齐。最强的结构化机制——灵活迭代交互(我们的方法)——实现了比单次生成、非结构化交互和基于重启的基线更高的端到端归一化隐藏测试质量,同时提高了协议有效完成率。经过7000多个回合的验证,这些结果确立了GRACE-DS作为评估基于LLM的AutoML智能体在生产类条件下按照组织特定要求执行机器学习工作流能力的稳健平台。

英文摘要

We introduce GRACE-DS, a Guarded Reward-guided Agent Correction Environment in Data Science for pre-deployment evaluation of LLM-powered AutoML agents. GRACE-DS is a set of evaluation metrics in an isolated environment that can be applied to tabular ML tasks specific to a particular organization. It exposes agents to realistic workflow stages, from planning and data inspection through feature engineering, model development, validation, and code repair to final submission, while hidden executable validators measure not only final predictive performance but also leakage avoidance, reproducibility, protocol validity, correction behavior, and reward alignment. The strongest structured regime, flexible iterative interaction (our approach), achieves higher end-to-end normalized hidden-test quality than single-shot generation, unstructured interaction, and restart-based baselines, while also improving protocol-valid completion. Validated across more than 7,000 episodes, these results establish GRACE-DS as a robust platform for assessing the capacity of LLM-based AutoML agents to execute machine learning workflows under production-like conditions and in accordance with organization-specific requirements.

2502.02904 2026-06-18 cs.HC cs.CL q-bio.NC 版本更新

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process

ScholaWrite: 端到端学术写作过程数据集

Khanh Chi Le, Linghe Wang, Minhwa Lee, Ross Volkov, Luan Tuyen Chau, Dongyeop Kang

发表机构 * University of Minnesota(明尼苏达大学)

AI总结 提出ScholaWrite数据集,通过Chrome扩展记录Overleaf上的按键,捕捉从初稿到终稿的多月写作过程,包含5篇计算机科学预印本的近6.2万次文本修改及认知写作意图标注,揭示人类写作与LLM辅助之间的差距。

Comments Equal contribution: Khanh Chi Le, Linghe Wang, Minhwa Lee | project page: https://minnesotanlp.github.io/scholawrite/

详情
AI中文摘要

写作是一项认知要求高的活动,需要持续决策、高度依赖工作记忆,并在不同目标的任务之间频繁切换。为了构建与作者认知真正一致的写作助手,我们必须捕捉并解码作者将想法转化为最终文本背后的完整思维过程。我们提出了ScholaWrite,这是第一个端到端学术写作数据集,追踪从初稿到最终手稿的多月历程。我们贡献了三个关键进展:(1)一个Chrome扩展,可无干扰地记录Overleaf上的按键,从而能够收集真实、现场写作数据;(2)一个新颖的完整学术手稿语料库,附有认知写作意图的细粒度标注。该数据集包含基于LaTeX的五篇计算机科学预印本的编辑,捕捉了四个月内近6.2万次文本更改;(3)对学术写作微观动态的分析和见解,突出了人类写作过程与大型语言模型(LLM)在提供有意义帮助方面的当前能力之间的差距。ScholaWrite强调了捕获端到端写作数据以开发未来写作助手的重要性,这些助手支持而非取代科学家的认知工作。

英文摘要

Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.

2511.00802 2026-06-18 cs.SE cs.CL cs.LG 版本更新

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker: 使用代码修改型LLM代理的自动离线策略评估优化

Jie JW Wu, Ayanda Patrick Herlihy, Ahmad Saleem Mirza, Ali Afoud, Fatemeh Fard

发表机构 * Michigan Technological University, Houghton(密歇根技术大学) Birmingham City University(伯明翰城市大学) University of British Columbia, Kelowna(不列颠哥伦比亚大学, 肯洛纳)

AI总结 提出GrowthHacker基准,利用LLM代理自动迭代修改代码以优化离线策略评估(OPE)实现,在Open Bandit Pipeline和Scope-RL上评估多种框架,证明基于LLM的代理可作为自动增长黑客持续改进OPE系统。

Comments Accepted for publication in ACM Transactions on Software Engineering and Methodology (TOSEM), 2026

详情
AI中文摘要

随着数据驱动开发的广泛采用,在线A/B测试已成为衡量新技术效果的既定方法。然而,部署在线实验需要设计、实现和部署资源,并可能对用户产生负面影响(例如,不安全或不道德的结果),同时需要数周的数据收集。为了解决这一问题,离线策略评估(OPE)或离线A/B测试这一日益增长的研究领域,使用先前收集的日志数据离线评估新技术。OPE也是强化学习中的一个基本问题,在在线测试昂贵或风险高的领域(如医疗保健、推荐系统、教育和机器人技术)中非常重要。尽管代码生成大语言模型(LLM)和代理工作流取得了进展,但关于LLM和基于LLM的代理是否以及如何自动优化OPE实现,我们知之甚少。我们提出了GrowthHacker,这是一个基准测试,用于在大规模公共数据集上评估基线LLM和基于LLM的代理。GrowthHacker自主迭代修改代码,运行OPE,并使用指标指导后续优化。我们在Open Bandit Pipeline(OBP)和Scope-RL上评估方法,并开发了一个双代理框架,该框架解决了现有框架的局限性,同时降低了复杂性。在两个库中,双代理显示出最高的可靠性(98.1%-100%成功率)和正向结果率(78%),正向结果的中位改进为4.4%;CrewAI实现了最高的平均改进(37.9%),并且是唯一没有极端值失败的框架。AutoGen和Default各达到65%的正向结果率。这些结果证明了使用基于LLM的代理作为自动“增长黑客”持续改进OPE系统的可行性,对在手动优化成本高昂的情况下扩展数据驱动决策具有重要意义。

英文摘要

With data-driven development now widely adopted, online A/B testing is an established method for measuring the effects of new technologies. However, deploying online experiments demands resources for design, implementation, and deployment, and may negatively impact users (e.g., unsafe or unethical outcomes) while requiring weeks of data collection. To address this, the growing research area of off-policy evaluation (OPE), or offline A/B testing, assesses new technologies offline using previously collected logged data. OPE is also a fundamental problem in reinforcement learning and is important where online testing is expensive or risky, such as healthcare, recommender systems, education, and robotics. Despite advances in code-generation large language models (LLMs) and agentic workflows, little is known about whether and how LLMs and LLM-based agents can automatically optimize OPE implementations. We propose GrowthHacker, a benchmark that evaluates baseline LLMs and LLM-based agents on large-scale public datasets. GrowthHacker autonomously and iteratively modifies code, runs OPE, and uses the metrics to guide subsequent optimization. We evaluate methods on Open Bandit Pipeline (OBP) and Scope-RL, and develop a two_agent framework that addresses limitations of existing frameworks while reducing complexity. Across both libraries, two_agent shows the highest reliability (98.1%-100% success rate) and positive-outcome rate (78%), with a median improvement of 4.4% among positive outcomes; CrewAI achieves the highest average improvement (37.9%) and is the only framework with zero extreme-value failures. AutoGen and Default each reach 65% positive-outcome rates. These results establish the feasibility of using LLM-based agents as automated "growth hackers" to continuously improve OPE systems, with implications for scaling data-driven decision-making where manual optimization is expensive.

2601.12805 2026-06-18 q-bio.GN cs.AI cs.CL 版本更新

SciHorizon-GENE: Benchmarking LLM for Life Sciences Inference from Gene Knowledge to Functional Understanding

SciHorizon-GENE:从基因知识到功能理解的生命科学推理基准测试

Xiaohan Huang, Meng Xiao, Chuan Qin, Qingqing Long, Jinmiao Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of the Chinese Academy of Sciences(中国科学院大学) DUKE-NUS Medical School, National University of Singapore(新加坡国立大学杜克-新加坡医学学校) Singapore Immunology Network, Agency for Science, Technology and Research(新加坡免疫网络,科技研究局)

AI总结 针对大语言模型在基因级推理能力上的不足,构建了包含超过19万个人类基因和54万问题的基准SciHorizon-GENE,从研究关注敏感性、幻觉倾向、答案完整性和文献影响力四个生物学关键维度评估模型,揭示了模型在生成忠实、完整且基于文献的功能解释方面的持续挑战。

Comments Accepted by SIGKDD 2026. 12 pages

详情
AI中文摘要

大型语言模型(LLMs)在生物医学研究中展现出日益增长的潜力,尤其是在知识驱动的解释任务中。然而,它们从基因知识到功能理解的可靠推理能力——这是知识增强型细胞图谱解释的核心要求——仍然在很大程度上未被探索。为了填补这一空白,我们引入了SciHorizon-GENE,这是一个基于权威生物数据库构建的大规模基因中心基准。该基准整合了超过19万个人类基因的 curated 知识,包含超过54万个问题,涵盖了与细胞类型注释、功能解释和机制导向分析相关的多种基因到功能推理场景。受初步检查中观察到的行为模式启发,SciHorizon-GENE从四个生物学关键角度评估LLMs:研究关注敏感性、幻觉倾向、答案完整性和文献影响力,明确针对限制LLMs在生物解释管道中安全采用的失败模式。我们系统评估了多种最先进的通用和生物医学LLMs,揭示了基因级推理能力的显著异质性,以及在生成忠实、完整且基于文献的功能解释方面的持续挑战。我们的基准为在基因尺度上分析LLM行为建立了系统基础,并为模型选择和发展提供了见解,与知识增强型生物解释直接相关。

英文摘要

Large language models (LLMs) have shown growing promise in biomedical research, particularly for knowledge-driven interpretation tasks. However, their ability to reliably reason from gene-level knowledge to functional understanding, a core requirement for knowledge-enhanced cell atlas interpretation, remains largely underexplored. To address this gap, we introduce SciHorizon-GENE, a large-scale gene-centric benchmark constructed from authoritative biological databases. The benchmark integrates curated knowledge for over 190K human genes and comprises more than 540K questions covering diverse gene-to-function reasoning scenarios relevant to cell type annotation, functional interpretation, and mechanism-oriented analysis. Motivated by behavioral patterns observed in preliminary examinations, SciHorizon-GENE evaluates LLMs along four biologically critical perspectives: research attention sensitivity, hallucination tendency, answer completeness, and literature influence, explicitly targeting failure modes that limit the safe adoption of LLMs in biological interpretation pipelines. We systematically evaluate a wide range of state-of-the-art general-purpose and biomedical LLMs, revealing substantial heterogeneity in gene-level reasoning capabilities and persistent challenges in generating faithful, complete, and literature-grounded functional interpretations. Our benchmark establishes a systematic foundation for analyzing LLM behavior at the gene scale and offers insights for model selection and development, with direct relevance to knowledge-enhanced biological interpretation.

2605.29676 2026-06-18 cs.AI cs.CL 版本更新

Notation Matters: A Benchmark Study of Token-Optimized Formats in Agentic AI Systems

符号至关重要:智能体AI系统中令牌优化格式的基准研究

Lorenz Kutschka, Bernhard Geiger

发表机构 * Know Center Research GmbH(知中心研究有限公司) Graz University of Technology(格拉茨技术大学) Graz Center for Machine Learning(格拉茨机器学习中心)

AI总结 本研究在四个智能体基准上评估了两种令牌优化格式TOON和TRON,发现TRON在保持准确率的同时最多减少27%的令牌,而TOON虽减少18%但存在多轮解析失败和并行工具调用输出崩溃的问题。

Comments 16 pages, 6 figures, 4 tables

详情
AI中文摘要

智能体AI系统中的大型语言模型消耗工具模式和执行结果,并发出结构化数据的工具调用。这种交换的默认语言JSON是为应用间交换而非令牌效率设计的,因此其结构元素带来大量令牌开销。最近的工作提出了令牌优化替代方案,如TOON(令牌导向对象表示法)和TRON(令牌减少对象表示法)作为更紧凑的替代,但这些格式仅在孤立的理解或生成任务上进行了评估。它们在端到端智能体循环中是否保持令牌减少仍是一个开放问题。我们在四个智能体基准(BFCL、MCPToolBenchPP、MCP-Universe、StableToolBench)和五个开放权重LLM上评估了TOON和TRON,将输入压缩与输出压缩解耦,以独立测量理解和生成。TRON最多减少27%的令牌,准确率在JSON基线的14个百分点内。TOON实现了最多18%的减少,准确率成本类似为9个百分点,但在多轮解析失败上额外级联,并且对于大多数模型导致并行工具调用输出崩溃。

英文摘要

Large language models in Agentic AI systems consume tool schemas and execution results and emit tool invocations as structured data. The default language for that exchange, JSON, was designed for application-to-application interchange rather than token efficiency, so its structural elements impose substantial token overhead. Recent work proposes token-optimized alternatives such as TOON (Token-Oriented Object Notation) and TRON (Token Reduced Object Notation) as more compact replacements, but these formats have been evaluated only on isolated comprehension or generation tasks. Whether their token reductions hold inside end-to-end agentic loops therefore remains an open question. We evaluate TOON and TRON on four agentic benchmarks (BFCL, MCPToolBenchPP, MCP-Universe, StableToolBench) and five open-weight LLMs, decoupling input compression from output compression to measure comprehension and generation independently. TRON reduces tokens by up to 27% with accuracy within 14pp of the JSON baseline. TOON achieves up to 18% reduction at a similar 9pp accuracy cost, but additionally cascades on multi-turn parsing failures and collapses parallel tool-call output for most models. The code is available at: https://github.com/lkutschka/notation-matters

2606.07591 2026-06-18 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Koutian Wu, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

2606.17188 2026-06-18 cs.CV cs.CL 版本更新

Not Truly Multilingual: Script Consistency as a Missing Dimension in VLM Evaluation

并非真正的多语言:脚本一致性作为VLM评估中缺失的维度

Prabhjot Singh, Bhushan Pawar, Madhu Reddiboina, Rajvee Sheth

发表机构 * RediMinds Inc.(RediMinds公司) The University of Texas at Austin(德克萨斯大学奥斯汀分校) Independent Researcher(独立研究员)

AI总结 提出PuMVR基准,评估10个VLM在旁遮普语三种文字上的表现,发现显著的脚本差距,并提出脚本一致性率(SCR)作为必要评估指标。

详情
AI中文摘要

当前视觉语言模型(VLM)的多语言评估假设语言与正字法一一对应,忽略了使用多种文字语言的数十亿用户。我们引入了PuMVR(旁遮普多模态视觉推理),这是一个包含1000个严格平行图像-文本实例的基准,覆盖旁遮普语的三种活跃文字:古木基文、沙穆基文和罗马文。评估10个最先进的VLM,我们暴露了一个显著且系统的脚本差距。模型经常在一种文字上解决视觉任务,而在另一种文字上失败,准确率差异高达16%。关键的是,视觉输入均匀地提升了绝对性能,但并未缩小正字法差距。此外,跨文字的上下文迁移非常脆弱,揭示了脚本锁定的知识表示。通过所有文字对的McNemar检验支持,我们的发现表明当前的“多语言”VLM并非真正的多文字。我们提出脚本一致性率(SCR),在我们的基准上低至24.8%,作为脚本无关评估的强制性指标,以确保公平的AI访问。数据和代码可在以下网址获取:this https URL。

英文摘要

Current multilingual evaluations for Vision-Language Models (VLMs) assume a one-to-one mapping between language and orthography, overlooking billions of users of multi-script languages. We introduce PuMVR (Punjabi Multimodal Visual Reasoning), a benchmark of 1,000 strictly parallel image-text instances across Punjabi's three active scripts: Gurmukhi, Shahmukhi, and Roman. Evaluating 10 state-of-the-art VLMs, we expose a substantial and systematic Script Gap. Models frequently solve visual tasks in one script while failing identical tasks in another, with accuracy deltas reaching 16%. Crucially, visual input boosts absolute performance uniformly yet does not close the orthographic gap. Furthermore, cross-script in-context transfer is highly brittle, exposing script-locked knowledge representation. Supported by McNemar tests across all script pairs, our findings demonstrate that current "multilingual" VLMs are not truly multi-script. We propose the Script Consistency Rate (SCR), which falls as low as 24.8% on our benchmark, as a mandatory metric for script-agnostic evaluation to ensure equitable AI access. Data and code are available at: https://github.com/prabhjotschugh/Not-Truly-Multilingual-PuMVR.

2606.18142 2026-06-18 cs.AI cs.CL cs.CY 版本更新

Your AI Travel Agent Would Book You a Bullfight: An Agentic Benchmark for Implicit Animal Welfare in Frontier AI Models

你的AI旅行代理会为你预订斗牛:前沿AI模型中隐含动物福利的代理基准

Jasmine Brazilek, Joel Christoph, Miles Tidmarsh, Carol Kline, Oliver Tullio, Arturs Kanepajs

发表机构 * Compassion Aligned Machine Learning(同情对齐机器学习) Sentient Futures(感知未来) Harvard Kennedy School(哈佛肯尼迪学院) Appalachian State University Department of Management(阿巴拉契亚州立大学管理系)

AI总结 提出首个代理基准TAC,测试AI代理在为用户执行旅行预订等操作时是否避免涉及动物剥削的选项。评估七个前沿模型,所有模型得分低于随机水平64%,最佳模型仅53%。

详情
AI中文摘要

AI代理正从顾问转变为行动者,代表用户预订旅行、规划菜单和管理采购。现有的AI与动物福利基准评估模型对问答提示的文本响应,但未检验这些响应中的福利推理是否迁移到代理部署中(模型必须使用工具采取行动)。我们引入TAC(旅行代理同情心),这是首个衡量AI代理在代表用户行动时是否避免涉及动物剥削选项的代理基准。TAC向AI代理提供十二个手工编写的旅行预订场景,涵盖六类动物剥削,并扩展至四十八个样本以控制价格、评分和位置混淆因素。我们评估了来自四个实验室的七个前沿模型。每个模型得分均低于随机水平64%,最佳表现者(Claude Opus 4.7)为53%。系统提示中的单一福利意识句子在Claude和GPT-5.5中带来47至63个百分点的提升,在GPT-5.2中提升26个百分点,在DeepSeek和Gemini中提升不足12个百分点。一项辅助的Inspect Scout审计(使用Gemini 2.5 Flash Lite作为评判者,对前两名模型的288个基础条件转录进行审计)未标记任何评估意识转录,表明低于随机水平的比率并非源于模型识别出评估。我们讨论了跨文化领域的类别级变化、文本响应福利基准的局限性以及欧盟通用AI实践准则系统性风险框架的影响。

英文摘要

AI agents are moving from advisors to actors, booking travel, planning menus, and running procurement on behalf of users. Existing benchmarks for AI and animal welfare evaluate model text responses to question-answer prompts, leaving open whether the welfare reasoning surfaced in those responses transfers to agentic deployment where the model must take actions with tools. We introduce TAC (Travel Agent Compassion), the first agentic benchmark measuring whether AI agents avoid options involving animal exploitation when acting on behalf of users. TAC presents an AI agent with twelve hand-authored travel booking scenarios across six categories of animal exploitation, augmented to forty-eight samples to control for price, rating, and position confounds. We evaluate seven frontier models from four labs. Every model scores below the chance level of sixty-four percent, with the best performer (Claude Opus 4.7) at fifty-three percent. A single welfare-aware sentence in the system prompt yields gains of forty-seven to sixty-three percentage points in Claude and GPT-5.5, twenty-six points in GPT-5.2, and under twelve points in DeepSeek and Gemini. An auxiliary Inspect Scout audit of 288 base-condition transcripts from the top two performers, using Gemini 2.5 Flash Lite as judge, flags zero transcripts for evaluation awareness, suggesting the below-chance rates do not stem from the models recognising the evaluation. We discuss implications for category-level variation across cultural domains, the limits of text-response welfare benchmarks, and the EU General-Purpose AI Code of Practice systemic risk framework.

10. 安全、隐私、公平与可解释NLP 4 篇

2503.04989 2026-06-18 cs.CL 版本更新

Application of integrated gradients explainability to sociopsychological semantic markers

集成梯度可解释性在社会心理语义标记中的应用

Ali Aghababaei, Jan Nikadon, Magdalena Formanowicz, Maria Laura Bettinsoli, Carmen Cervone, Caterina Suitner, Tomaso Erseghe

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系) Center for Research on Social Relations, University of Social Sciences and Humanities (SWPS)(社会科学与人文大学社会关系研究中心) Department of Cognitive Science, Nicolaus Copernicus University in Toruń(托伦尼古拉·哥白尼大学认知科学系) Interdisciplinary Centre for Modern Technologies, Nicolaus Copernicus University in Toruń(托伦尼古拉·哥白尼大学现代技术跨学科中心) Department of Developmental Psychology and Socialization, University of Padova(帕多瓦大学发展心理学与社会化系)

AI总结 本文利用集成梯度方法在词级别解释文本分类输出,聚焦社会心理标记(如能动性),通过测试BERTAgent等模型,验证了该方法在有限标注数据下识别关键词语的有效性。

Comments Submitted to IEEE Trans. on Computational Social Systems

详情
AI中文摘要

基于情感或更细微的社会心理标记(如能动性)的文本数据分类,现在是一种常用的句子级方法。在本文中,我们利用集成梯度(IG)方法在词级别捕获分类输出,揭示哪些词语实际贡献于分类过程。该方法提高了可解释性,并提供了对文本的深入洞察。我们关注超越情感的社会心理标记,并研究如何有效训练IG在能动性上——这是目前少数拥有经过验证的深度学习分类器BERTAgent的标记之一。我们仔细测试了性能和系统参数,评估了IG方法的替代方案,并在相关应用场景中验证了结果的实用性。该方法还应用于仅拥有少量标注数据集的场景,旨在利用IG识别有助于构建与相关社会心理标记相关的不同类别的显著词语。为此,采用了一种鼓励过拟合的非同寻常的训练程序,以增强每个类别的独特性。通过社会心理学的视角分析结果,提供了有价值的见解。

英文摘要

Classification of textual data in terms of sentiment, or more nuanced sociopsychological markers (e.g., agency), is now a popular approach commonly applied at the sentence level. In this paper, we exploit the integrated gradient (IG) method to capture the classification output at the word level, revealing which words actually contribute to the classification process. This approach improves explainability and provides in-depth insights into the text. We focus on sociopsychological markers beyond sentiment and investigate how to effectively train IG in agency, one of the very few markers for which a verified deep learning classifier, BERTAgent, is currently available. Performance and system parameters are carefully tested, alternatives to the IG approach are evaluated, and the usefulness of the result is verified in a relevant application scenario. The method is also applied in a scenario where only a small labeled dataset is available, with the aim of exploiting IG to identify the salient words that contribute to building the different classes that relate to relevant sociopsychological markers. To achieve this, an uncommon training procedure that encourages overfitting is employed to enhance the distinctiveness of each class. The results are analyzed through the lens of social psychology, offering valuable insights.

2505.20045 2026-06-18 cs.CL 版本更新

Efficient Hallucination Detection for LLMs Using Uncertainty-Aware Attention Heads

基于不确定性感知注意力头的高效大语言模型幻觉检测

Artem Vazhentsev, Lyudmila Rvanova, Gleb Kuzmin, Ekaterina Fadeeva, Ivan Lazichny, Alexander Panchenko, Maxim Panov, Mrinmaya Sachan, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(莫扎德人工智能大学) ETH Zurich(苏黎世联邦理工学院) Independent Researcher(独立研究者) Applied AI Institute(应用人工智能研究所)

AI总结 提出RAUQ框架,利用不确定性感知注意力头与令牌级置信度,通过单次前向传递实现无监督、高效的序列级幻觉检测,在12个数据集上优于现有方法且额外计算少于1%。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML), Seoul, South Korea, 2026
AI中文摘要

尽管大型语言模型(LLM)已经变得非常强大,但它们仍然容易出现事实性错误,通常称为“幻觉”。不确定性量化(UQ)为缓解这一问题提供了一种有前景的方法,但大多数现有方法计算量大且/或需要监督。在这项工作中,我们提出了基于循环注意力的不确定性量化(RAUQ),这是一种无监督且高效的幻觉识别框架。该方法利用了Transformer注意力行为的一个观察:当生成错误信息时,某些“不确定性感知”注意力头倾向于减少对前驱令牌的关注。RAUQ自动检测这些注意力头,并以循环方式将其激活模式与令牌级置信度度量相结合,仅通过一次前向传递即可生成序列级不确定性估计。通过在涵盖问答、摘要和翻译的十二个数据集上对九个不同LLM进行的实验,我们表明RAUQ始终优于最先进的UQ基线。重要的是,它产生的开销极小,所需的额外计算不到1%。由于它既不需要标记数据也不需要广泛的参数调整,RAUQ可作为白盒LLM中实时幻觉检测的轻量级即插即用解决方案。

英文摘要

While large language models (LLMs) have become highly capable, they remain prone to factual inaccuracies, commonly referred to as "hallucinations." Uncertainty quantification (UQ) offers a promising way to mitigate this issue, but most existing methods are computationally intensive and/or require supervision. In this work, we propose Recurrent Attention-based Uncertainty Quantification (RAUQ), an unsupervised and efficient framework for identifying hallucinations. The method leverages an observation about transformer attention behavior: when incorrect information is generated, certain "uncertainty-aware" attention heads tend to reduce their focus on preceding tokens. RAUQ automatically detects these attention heads and combines their activation patterns with token-level confidence measures in a recurrent scheme, producing a sequence-level uncertainty estimate in just a single forward pass. Through experiments on twelve datasets spanning question answering, summarization, and translation across nine different LLMs, we show that RAUQ consistently outperforms state-of-the-art UQ baselines. Importantly, it incurs minimal overhead, requiring less than 1\% additional computation. Since it requires neither labeled data nor extensive parameter tuning, RAUQ serves as a lightweight, plug-and-play solution for real-time hallucination detection in white-box LLMs.

2604.23130 2026-06-18 cs.CL cs.AI 版本更新

From Concept-Aligned Tokens to Vulnerable Features: Mechanistic Localization of Jailbreaks

从概念对齐的Token到脆弱特征:越狱的机制定位

Nilanjana Das, Mathew Dawit, Aman Chadha, Manas Gaur

发表机构 * UMBC(马里兰大学伯克利分校) Apple(苹果公司)

AI总结 提出一种基于Token的机制流水线,通过稀疏自编码器特征子组定位越狱漏洞,发现单个有害Token足以定位脆弱特征,且这些特征集中在中后期层。

详情
AI中文摘要

越狱攻击揭示了安全对齐的大语言模型中一种持续的失败模式:模型可以被推向有害行为,但促成这种转变的内部表示仍未被很好地定位。最近的机制安全性研究通常通过广泛的表示对象来解释这种行为,包括全局拒绝方向、激活引导向量和与拒绝相关的SAE特征。我们转而询问越狱脆弱性是否可以追溯到更细粒度的、基于提示的SAE特征子组。我们引入了一个基于Token的机制流水线,将Gemma-2-2B的残差流分解为稀疏自编码器(SAE)特征,并识别与不安全行为相关的特征子组。使用BeaverTails中的单类别不安全示例以减少跨类别干扰,我们从对抗性响应中提取有害概念,并通过子空间相似性将其与概念相关的提示Token对齐。然后,我们应用三种特征分组策略:基于聚类的、层次链接的和单Token驱动的,以识别所有26层中的SAE特征子组。最后,我们放大每个子组中的顶级特征,并使用标准的有害性评判器评估生成的输出。单Token驱动的分组实现了与完整基于聚类的分组相当的有害性,表明单个有害提示Token足以定位与脆弱性相关的SAE特征子组,而无需依赖更广泛的聚类级聚合。这些子组出现在早期和中后期层,且更集中在中后期层,其中目标引导暴露了特定的模型脆弱性。总体而言,我们的结果表明越狱敏感性可以追溯到稀疏的、基于Token定位的SAE特征子组,补充了先前基于广泛对抗、拒绝或引导方向的解释。

英文摘要

Jailbreak attacks expose a persistent failure mode in safety-aligned LLMs: models can be pushed into harmful behavior, but the internal representations enabling this shift remain poorly localized. Recent mechanistic safety studies often explain such behavior through broad representational objects, including global refusal directions, activation steering vectors, and refusal-related SAE features. We instead ask whether jailbreak vulnerability can be traced to finer-grained, prompt-conditioned SAE feature subgroups. We introduce a token-driven mechanistic pipeline that decomposes the residual stream of Gemma-2-2B into Sparse Autoencoder (SAE) features and identifies feature subgroups associated with unsafe behavior. Using single-category unsafe examples from BeaverTails to reduce cross-category interference, we extract harmful concepts from adversarial responses and align them with concept-relevant prompt tokens through subspace similarity. We then apply three feature-grouping strategies: cluster-based, hierarchical-linkage, and single-token-driven, to identify SAE feature subgroups across all 26 layers. Finally, we amplify the top features in each subgroup and evaluate the resulting generations with a standardized harmfulness judge. Single-token-driven grouping achieves harmfulness comparable to full cluster-based grouping, showing that individual harmful prompt tokens are sufficient to localize vulnerability-relevant SAE feature subgroups without relying on broader cluster-level aggregation. These subgroups appear across early and mid-to-late layers, with stronger concentration in mid-to-late layers, where targeted steering exposes specific model vulnerabilities. Overall, our results suggest that jailbreak susceptibility can be traced to sparse, token-localized SAE feature subgroups, complementing prior accounts based on broad adversarial, refusal, or steering directions.

2510.09905 2026-06-18 cs.AI cs.CL 版本更新

The Personalization Trap: How User Memory Alters Emotional Reasoning in LLMs

个性化陷阱:用户记忆如何改变大语言模型的情感推理

Xi Fang, Weijie Xu, Yuchong Zhang, Stephanie Eckman, Scott Nickleach, Chandan K. Reddy

发表机构 * Amazon(亚马逊)

AI总结 研究用户记忆如何导致大语言模型在情感推理中产生系统性偏差,发现高绩效模型对优势背景用户的情感解读更准确,个性化机制可能嵌入社会等级。

Comments 19 pages 5 figures

详情
AI中文摘要

当AI助手记住Sarah是一位打两份工的单亲母亲时,它对她压力的解读是否与她是富有的高管时不同?随着个性化AI系统越来越多地融入长期用户记忆,理解这种记忆如何塑造情感推理至关重要。我们通过在人验证的情感智能测试上评估15个模型,研究用户记忆如何影响大语言模型(LLMs)的情感智能。我们发现,相同的场景搭配不同的用户画像会产生系统性不同的情感解读。在经验证的独立于用户的情感场景和多样化的用户画像中,几个高性能LLM出现了系统性偏差,其中优势背景的用户画像获得了更准确的情感解读。此外,LLM在情感推理和支持性推荐任务中表现出跨人口统计因素的显著差异,表明个性化机制可以将社会等级嵌入模型的情感推理中。这些结果凸显了记忆增强AI的一个关键挑战:为个性化设计的系统可能会强化社会不平等。为缓解这些差异,我们整理了一个通用偏好数据集,旨在减少人口统计画像对情感理解的影响。

英文摘要

When an AI assistant remembers that Sarah is a single mother working two jobs, does it interpret her stress differently than if she were a wealthy executive? As personalized AI systems increasingly incorporate long-term user memory, understanding how this memory shapes emotional reasoning is critical. We investigate how user memory affects emotional intelligence in large language models (LLMs) by evaluating 15 models on human-validated emotional intelligence tests. We find that identical scenarios paired with different user profiles produce systematically divergent emotional interpretations. Across validated user-independent emotional scenarios and diverse user profiles, systematic biases emerged in several high-performing LLMs where advantaged profiles received more accurate emotional interpretations. Moreover, LLMs demonstrate significant disparities across demographic factors in emotion reasoning and supportive recommendations tasks, indicating that personalization mechanisms can embed social hierarchies into models' emotional reasoning. These results highlight a key challenge for memory-enhanced AI: systems designed for personalization may reinforce social inequalities. To mitigate these disparities, we curate a general-purpose preference dataset designed to reduce demographic profiles' influence on emotional understanding.

11. 低资源、领域适配与高效训练 3 篇

2602.00161 2026-06-18 cs.LG cs.AI cs.CL quant-ph 版本更新

LLM Compression by Block Removal with Constrained Binary Optimization

通过带约束二进制优化的块移除进行LLM压缩

David Jansen, Roman Rausch, Ali Hashemi, David Montero, Román Orús

发表机构 * Multiverse Computing(多维计算公司) Donostia International Physics Center(多斯蒂亚国际物理中心) Ikerbasque Foundation for Science(伊克尔巴斯克科学基金会)

AI总结 提出将大语言模型块移除压缩问题建模为约束二进制优化,映射到Ising玻璃系统,实现高效排序和高质量非连续块移除,在50%压缩时MMLU提升近23个百分点,且计算高效、通用性强。

Comments 16 pages, 3 figures

详情
AI中文摘要

在本文中,我们将通过最优删除Transformer块(“块移除”)来压缩大语言模型(LLM)的问题,表述为一个约束二进制优化(CBO)问题,该问题可以映射到物理系统(Ising玻璃),其能量是下游模型性能的强代理。这种表述使得能够高效地对大量候选块移除配置进行排序,产生许多高质量、非平凡的解决方案,而不仅仅是移除连续区域。我们的方法在深度压缩场景中表现强劲,例如在Llama-3.3-70B-Instruct的50%压缩中,与其他最先进的块移除方法相比,我们在MMLU基准上取得了近23个百分点的提升。对于较轻的压缩,它在多个基准上与这些方法表现相当,适用于Llama-3.1-8B-Instruct、Qwen3-14B(重训练前后)以及Llama-3.3-70B-Instruct。该方法计算效率高,仅需在校准数据集上对少数活跃参数进行前向和反向传播。此外,我们证明,当无法精确求解CBO问题时,使用良好的启发式求解器可以在可忽略的运行时间内提供在下游任务上表现良好的解决方案。该方法可以轻松应用于任何架构。我们在最近的NVIDIA-Nemotron-3-Nano-30B-A3B-FP8模型上展示了这种通用性,该模型具有高度不均匀且具有挑战性的块结构,并且在移除2个注意力层或3个混合专家层时,我们在AIME25和GPQA上超越了最先进水平。

英文摘要

In this paper, we formulate the compression of large language models (LLMs) by optimally deleting transformer blocks (``block removal'') as a constrained binary optimization (CBO) problem that can be mapped to a physical system (Ising glass), whose energies are a strong proxy for downstream model performance. This formulation enables an efficient ranking of a large number of candidate block-removal configurations yielding many high-quality, non-trivial solutions beyond those only removing consecutive regions. Our method performs strongly in the deep compression regime, such as for 50% compression of Llama-3.3-70B-Instruct, where we achieve an almost 23 percentage point increase on the MMLU benchmark compared to other state-of-the-art (SOTA) block-removal methods. For lighter compression, it performs on par with those methods across several benchmarks for Llama-3.1-8B-Instruct, Qwen3-14B (both before and after retraining), as well as Llama-3.3-70B-Instruct. The approach is computationally efficient and requires only forward and backward passes on a calibration dataset for a few active parameters. Additionally, we demonstrate that using good heuristic solvers for the CBO problem provides solutions that perform well on downstream tasks in negligible runtime when it is unfeasible to solve the problem exactly. The method can be readily applied to any architecture. We illustrate this generality on the recent NVIDIA-Nemotron-3-Nano-30B-A3B-FP8 model, which exhibits a highly inhomogeneous and challenging block structure, and where we outperform SOTA for AIME25 and GPQA when removing either 2 attention layers or 3 mixture-of-experts layers.

2603.06310 2026-06-18 eess.AS cs.CL cs.SD 版本更新

Continual Adaptation for Pacific Indigenous Speech Recognition

太平洋土著语音识别的持续适应

Yang Xiao, Aso Mahmudi, Nick Thieberger, Eliathamby Ambikairajah, Eun-Jung Holden, Ting Dang

发表机构 * The University of Melbourne(墨尔本大学) UNSW Sydney(新南威尔士大学悉尼分校)

AI总结 针对太平洋土著语言数据稀缺和灾难性遗忘问题,研究语音基础模型的适应策略,发现LoRA在顺序学习中会灾难性遗忘,需定制鲁棒适应方法。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

语音基础模型在处理资源匮乏的太平洋土著语言时面临严重的数据稀缺问题。此外,完全微调存在灾难性遗忘的风险。为弥补这一空白,我们提出了一项实证研究,将模型适应到真实的太平洋数据集。我们研究了数据量、适应策略和表征漂移对多种太平洋语言语音基础模型的影响。此外,我们分析了一个用于顺序语言习得的持续学习框架。跨三种不同的太平洋土著语言的实证结果表明,适应这些语言距离较远的语言会引发严重的内部表征漂移。因此,这些模型面临严格的可塑性与稳定性困境。虽然LoRA初始适应良好,但在顺序学习过程中会出现灾难性遗忘。最终,本研究强调了为代表性不足的语言定制鲁棒适应策略的迫切需求。

英文摘要

Speech foundation models struggle with low-resource Pacific Indigenous languages because of severe data scarcity. Furthermore, full fine-tuning risks catastrophic forgetting. To address this gap, we present an empirical study adapting models to real-world Pacific datasets. We investigate the impact of data volume, adaptation strategies, and representational drift on speech foundation models for various Pacific languages. Additionally, we analyze a continual learning framework for sequential language acquisition. Empirical results across three distinct Pacific Indigenous languages demonstrate that adapting to these linguistically distant languages induces severe internal representational drift. Consequently, these models face a strict plasticity and stability dilemma. While LoRA adapts well initially, it suffers from catastrophic forgetting during sequential learning. Ultimately, this study highlights the urgent need for robust adaptation strategies tailored to underrepresented languages.

2606.01249 2026-06-18 cs.LG cs.CL 版本更新

Trust Region On-Policy Distillation

信任区域在线策略蒸馏

Xingrun Xing, Haoqing Wang, Boyan Gao, Ziheng Li, Yehui Tang

发表机构 * Samsung Research(三星研究院) University of Oxford(牛津大学) Peking University(北京大学)

AI总结 提出信任区域在线策略蒸馏(TrOPD),通过信用分配策略和信任区域学习解决师生分布差异导致的训练不稳定问题,在数学推理、代码生成和通用基准上超越现有方法。

详情
AI中文摘要

在线策略蒸馏(OPD)是大型语言模型(LLM)高效后训练的基本技术,在智能体学习、多任务增强和模型压缩中具有广泛应用。然而,当教师和学生分布差异较大时,OPD训练变得不稳定,因为教师对学生生成token的监督可能产生不可靠的策略梯度,甚至导致优化失败。本文通过信用分配策略解决可靠的在线策略token级监督问题,并提出信任区域在线策略蒸馏(TrOPD)。它具有以下特点:1)信任区域在线策略学习:TrOPD仅在教师提供可靠监督的区域进行OPD,缓解了分布不匹配下K1反向KL估计的优化困难。2)异常值估计:对于异常区域,我们探索梯度裁剪、掩码和前向KL估计,以减少不可靠监督的不利影响。3)离策略引导:学生从教师前缀继续生成,并使用前向KL模仿离策略引导,鼓励向可靠区域进行在线策略探索。实验表明,TrOPD在数学推理、代码生成和通用领域基准上始终优于最先进的OPD基线,包括OPD、EOPD和REOPOLD。

英文摘要

On-Policy Distillation (OPD) is a fundamental technique for efficient post-training of large language models (LLMs), with broad applications in agent learning, multi-task enhancement, and model compression. However, OPD training becomes unstable when the teacher and student distributions differ substantially, as teacher supervision on student-generated tokens may yield unreliable policy gradients and even cause optimization failure. This work addresses reliable on-policy token-level supervision through credit assignment strategies, and proposes Trust Region On-Policy Distillation, TrOPD. It features the following characteristics: 1) Trust-Region On-Policy Learning: TrOPD performs OPD only in regions where the teacher provides reliable supervision, mitigating the optimization difficulty of the K1 reverse-KL estimator under distribution mismatch. 2) Outlier Estimation: For outlier regions, we explore gradient clipping, masking, and forward-KL estimation to reduce the adverse effects of unreliable supervision. 3) Off-Policy Guidance: The student continues generation from teacher prefixes and uses forward KL to imitate off-policy guidance, encouraging on-policy exploration toward reliable regions. Experiments show that TrOPD consistently outperforms SoTA OPD baselines, including OPD, EOPD, and REOPOLD, across mathematical reasoning, code generation, and general-domain benchmarks.

12. 其他/综合NLP 8 篇

2503.01805 2026-06-18 cs.LG cs.AI cs.CL 版本更新

Depth-Width tradeoffs in Algorithmic Reasoning of Graph Tasks with Transformers

图任务算法推理中Transformer的深度-宽度权衡

Gilad Yehudai, Clayton Sanford, Maya Bechler-Speicher, Orr Fischer, Ran Gilad-Bachrach, Amir Globerson

发表机构 * Courant Institute of Mathematical Sciences, New York University(纽约大学应用数学科学研究所) Google Research(谷歌研究) Meta AI Bar-Ilan University(巴伊兰大学) Department of Bio-Medical Engineering, Edmond J. Safra Center for Bioinformatics, Tel-Aviv University(生物医学工程系,埃德蒙·J·萨法中心,特拉维夫大学) Tel Aviv University(特拉维夫大学)

AI总结 研究Transformer在图算法任务中深度与宽度的权衡,发现线性宽度下常数深度足以解决许多图问题,而某些问题需要二次宽度,实验验证了宽模型在保持精度的同时训练和推理更快。

Comments Updated ISF grant number

详情
AI中文摘要

Transformer已经彻底改变了机器学习领域。特别是,它们可用于解决复杂的算法问题,包括基于图的任务。在此类算法任务中,一个关键问题是能够实现该任务的Transformer的最小尺寸是多少。最近的工作开始探索图任务的这个问题,表明对于次线性嵌入维度(即模型宽度),对数深度就足够了。然而,我们在这里解决的一个开放问题是,如果允许宽度线性增长而深度保持固定,会发生什么。我们分析了这种情况,并得出了一个令人惊讶的结果:在线性宽度下,常数深度足以解决一系列基于图的问题。这表明宽度的适度增加可以允许更浅的模型,这在推理和训练时间方面是有利的。对于其他问题,我们表明需要二次宽度。我们的结果展示了Transformer实现图算法的复杂而有趣的格局。我们通过实验研究了深度和宽度相对能力之间的这些权衡,并发现宽模型在具有与深模型相同准确度的任务中,由于可并行化的硬件,训练和推理时间更快。

英文摘要

Transformers have revolutionized the field of machine learning. In particular, they can be used to solve complex algorithmic problems, including graph-based tasks. In such algorithmic tasks a key question is what is the minimal size of a transformer that can implement the task. Recent work has begun to explore this problem for graph-based tasks, showing that for sub-linear embedding dimension (i.e., model width) logarithmic depth suffices. However, an open question, which we address here, is what happens if width is allowed to grow linearly, while depth is kept fixed. Here we analyze this setting, and provide the surprising result that with linear width, constant depth suffices for solving a host of graph-based problems. This suggests that a moderate increase in width can allow much shallower models, which are advantageous in terms of inference and train time. For other problems, we show that quadratic width is required. Our results demonstrate the complex and intriguing landscape of transformer implementations of graph-based algorithms. We empirically investigate these trade-offs between the relative powers of depth and width and find tasks where wider models have the same accuracy as deep models, while having much faster train and inference time due to parallelizable hardware.

2605.16385 2026-06-18 cs.CV cs.AI cs.CL 版本更新

Hilbert-Geo: Solving Solid Geometric Problems by Neural-Symbolic Reasoning

Hilbert-Geo:通过神经符号推理解决立体几何问题

Ruoran Xu, Haoyu Cheng, Bin Dong, Qiufeng Wang

发表机构 * Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学) Ricoh Software Research Center Beijing Co.,Ltd(Ricoh 软件研究中心北京有限公司)

AI总结 提出Hilbert-Geo框架和Parse2Reason方法,利用条件描述语言和定理库实现立体几何问题的严格推理,在SolidFGeo2k和MathVerse-Solid上达到SOTA性能。

Comments Computer Vision and Pattern Recognition (CVPR), 2026

详情
AI中文摘要

几何问题求解作为一种典型的多模态推理问题,近年来受到广泛关注并取得了很大进展,然而大多数工作集中于平面几何,由于三维空间图和复杂推理,通常在立体几何中失败。为弥补这一差距,我们引入了Hilbert-Geo,这是第一个用于立体几何的统一形式语言框架,包括一个广泛的谓词库和一个专用的定理库。基于该框架,我们提出了一种Parse2Reason方法,包含先解析后推理两个步骤。在解析步骤中,我们利用条件描述语言(CDL),一种由专门用于构建几何条件的谓词组成的形式化语言,来表示问题描述(自然文本)和立体图(视觉图像)。在推理步骤中,我们利用这些形式化CDL和定理库进行关系推理和代数计算,生成严格正确、可验证且人类可读的推理过程。值得注意的是,我们提出的Hilbert-Geo也适用于平面几何。为推进几何推理,我们策划了两个专家标注的数据集SolidFGeo2k和PlaneFGeo3k,它们配备了几何形式语言标注、解答和答案。大量实验表明,我们提出的方法在SolidFGeo2k上达到77.3%的最先进性能,在MathVerse-Solid(MathVerse中专用于立体几何的一个小子集)上达到84.1%,显著优于领先的多模态大语言模型,如Gemini-2.5-pro(在SolidFGeo2k上为54.2%)和GPT-5(在MathVerse-Solid上为62.9%)。此外,我们的方法在PlaneFGeo3k上达到80.2%的SOTA准确率,展示了Hilbert-Geo在几何推理中的通用性。我们的代码和数据集将公开提供。

英文摘要

Geometric problem solving, as a typical multimodal reasoning problem, has attracted much attention and made great progress recently, however most of works focus on plane geometry while usually fail in solid geometry due to 3D spatial diagrams and complex reasoning. To bridge this gap, we introduce Hilbert-Geo, the first unified formal language framework for solid geometry, including an extensive predicate library and a dedicated theorem bank. Based on this framework, we propose a Parse2Reason method containing two steps of first parsing then reasoning. In the parsing step, we utilize conditional description language (CDL), a formalized language composed of predicates specifically designed to construct geometric conditions, to represent both problem description (natural text) and solid diagrams (visual image). In the reasoning step, we leverage those formal CDL and the theorem bank to perform relational inference and algebraic computation, generating strictly correct, verifiable, and human-readable reasoning processes. Notably, our proposed Hilbert-Geo is also applicable to plane geometry. To advance geometric reasoning, we curate two expert-annotated dataset SolidFGeo2k and PlaneFGeo3k, which are furnished with geometric formal language annotations, solutions and answers. Extensive experiments show that our proposed method achieves the state-of-the-art (SOTA) performance 77.3% in SolidFGeo2k and 84.1% in MathVerse-Solid (one small subset in MathVerse dedicated to solid geometry), substantially outperforming leading MLLMs, such as Gemini-2.5-pro (54.2% on SolidFGeo2k) and GPT-5 (62.9% on MathVerse-Solid). In addition, our method achieves the SOTA accuracy 80.2% in PlaneFGeo3k, demonstrating the generality of the Hilbert-Geo in geometric reasoning. Our code and datasets are released at https://github.com/PremiLab-Math/Hilbert-Geo.

2602.20135 2026-06-18 cs.CL cs.AI cs.IR 版本更新

KNIGHT: Knowledge Graph-Driven Multiple-Choice Question Generation with Adaptive Hardness Calibration

Mohammad Amanlou, Erfan Shafiee Moghaddam, Yasaman Amou Jafari, Mahdi Noori, Farhan Farsi, Behnam Bahrak

发表机构 * University of Tehran(塔里班大学) Independent Researcher(独立研究员) Amirkabir University of Technology(阿米尔卡比尔技术大学) TEIAS Institute(TEIAS研究所)

Comments Accepted at the Third Conference on Parsimony and Learning (CPAL 2026). 36 pages, 12 figures. (Equal contribution: Yasaman Amou Jafari and Mahdi Noori.)

详情
Journal ref
Conference on Parsimony and Learning, Proceedings of Machine Learning Research, 328:989-1024, 2026
英文摘要

With the rise of large language models (LLMs), they have become instrumental in applications such as Retrieval-Augmented Generation (RAG). Yet evaluating these systems remains bottlenecked by the time and cost of building specialized assessment datasets. We introduce KNIGHT, an LLM-based, knowledge-graph-driven framework for generating multiple-choice question (MCQ) datasets from external sources. KNIGHT constructs a topic-specific knowledge graph, a structured and parsimonious summary of entities and relations, that can be reused to generate instructor-controlled difficulty levels, including multi-hop questions, without repeatedly re-feeding the full source text. This knowledge graph acts as a compressed, reusable state, making question generation a cheap read over the graph. We instantiate KNIGHT on Wikipedia/Wikidata while keeping the framework domain- and ontology-agnostic. As a case study, KNIGHT produces six MCQ datasets in History, Biology, and Mathematics. We evaluate quality on five criteria: fluency, unambiguity (single correct answer), topic relevance, option uniqueness, and answerability given the provided sources (as a proxy for hallucination). Results show that KNIGHT enables token- and cost-efficient generation from a reusable graph representation, achieves high quality across these criteria, and yields model rankings aligned with MMLU-style benchmarks, while supporting topic-specific and difficulty-controlled evaluation.

2508.20275 2026-06-18 cs.LG cs.CL q-bio.QM 版本更新

A Systematic Review on the Generative AI Applications in Human Medical Genomics

Anton Changalidis, Yury Barbitoff, Yulia Nasykhova, Andrey Glotov

发表机构 * Dpt. of Genomic Medicine(基因组医学系) D.O. Ott Research Institute of Obstetrics, Gynaecology, and Reproductology(D.O. Ott妇产科与生殖医学研究所)

Comments 31 pages, 5 figures

详情
Journal ref
Frontiers in Genetics 16 (2026) 1694070
英文摘要

Although traditional statistical techniques and machine learning methods have contributed significantly to genetics and, in particular, inherited disease diagnosis, they often struggle with complex, high-dimensional data, a challenge now addressed by state-of-the-art deep learning models. Large language models (LLMs), based on transformer architectures, have excelled in tasks requiring contextual comprehension of unstructured medical data. This systematic review examines the role of LLMs in the genetic research and diagnostics of both rare and common diseases. Automated keyword-based search in PubMed, bioRxiv, medRxiv, and arXiv was conducted, targeting studies on LLM applications in diagnostics and education within genetics and removing irrelevant or outdated models. A total of 172 studies were analyzed, highlighting applications in genomic variant identification, annotation, and interpretation, as well as medical imaging advancements through vision transformers. Key findings indicate that while transformer-based models significantly advance disease and risk stratification, variant interpretation, medical imaging analysis, and report generation, major challenges persist in integrating multimodal data (genomic sequences, imaging, and clinical records) into unified and clinically robust pipelines, facing limitations in generalizability and practical implementation in clinical settings. This review provides a comprehensive classification and assessment of the current capabilities and limitations of LLMs in transforming hereditary disease diagnostics and supporting genetic education, serving as a guide to navigate this rapidly evolving field.

2503.01163 2026-06-18 cs.AI cs.CL cs.HC cs.LG cs.NE 版本更新

Bandit-Based Prompt Design Strategy Selection Improves Prompt Optimizers

Rin Ashizawa, Yoichi Hirose, Nozomu Yoshinari, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University(横滨国立大学)

Comments Accepted to ACL 2025 Findings

详情
英文摘要

Prompt optimization aims to search for effective prompts that enhance the performance of large language models (LLMs). Although existing prompt optimization methods have discovered effective prompts, they often differ from sophisticated prompts carefully designed by human experts. Prompt design strategies, representing best practices for improving prompt performance, can be key to improving prompt optimization. Recently, a method termed the Autonomous Prompt Engineering Toolbox (APET) has incorporated various prompt design strategies into the prompt optimization process. In APET, the LLM is needed to implicitly select and apply the appropriate strategies because prompt design strategies can have negative effects. This implicit selection may be suboptimal due to the limited optimization capabilities of LLMs. This paper introduces Optimizing Prompts with sTrategy Selection (OPTS), which implements explicit selection mechanisms for prompt design. We propose three mechanisms, including a Thompson sampling-based approach, and integrate them into EvoPrompt, a well-known prompt optimizer. Experiments optimizing prompts for two LLMs, Llama-3-8B-Instruct and GPT-4o mini, were conducted using BIG-Bench Hard. Our results show that the selection of prompt design strategies improves the performance of EvoPrompt, and the Thompson sampling-based mechanism achieves the best overall results. Our experimental code is provided at https://github.com/shiralab/OPTS .

2504.12347 2026-06-18 cs.CL cs.AI cs.CY 版本更新

Assessment of Evolving Large Language Models in Upper Secondary Mathematics

Mika Setälä, Pieta Sikström, Ville Heilala, Tommi Kärkkäinen

发表机构 * Faculty of Information Technology(信息科技学院) University of Jyväskylä(于韦斯屈莱大学) Faculty of Humanities and Social Sciences(人文与社会科学学院)

详情
英文摘要

Large language models (LLMs) have shown increasing promise in educational settings, yet their mathematical reasoning has been considered evolving. This study evaluates the mathematical capabilities of various LLMs using the Finnish matriculation examination, a high-stakes digital test for upper secondary education. Initial tests yielded moderate performance corresponding to mid-range grades, but later evaluations demonstrated substantial improvements as the language models evolved. Remarkably, some models achieved near-perfect or perfect scores, matching top student performance and qualifying for university admission. Our findings highlight the rapid advances in the mathematical proficiency of LLMs and illustrate their potential as underlying tools to support learning and teaching in a variety of ways.

2410.03151 2026-06-18 cs.CL cs.SI 版本更新

Media Framing through the Lens of Event-Centric Narratives

Rohan Das, Aditya Chandra, I-Ta Lee, Maria Leonor Pacheco

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校) Independent Researcher(独立研究者)

Comments Accepted to the 6th Workshop on Narrative Understanding, co-located with EMNLP 2024

详情
英文摘要

From a communications perspective, a frame defines the packaging of the language used in such a way as to encourage certain interpretations and to discourage others. For example, a news article can frame immigration as either a boost or a drain on the economy, and thus communicate very different interpretations of the same phenomenon. In this work, we argue that to explain framing devices we have to look at the way narratives are constructed. As a first step in this direction, we propose a framework that extracts events and their relations to other events, and groups them into high-level narratives that help explain frames in news articles. We show that our framework can be used to analyze framing in U.S. news for two different domains: immigration and gun control.

2206.05018 2026-06-18 cs.SD cs.CL eess.AS 版本更新

Going Beyond the Cookie Theft Picture Test: Detecting Cognitive Impairments using Acoustic Features

Franziska Braun, Andreas Erzigkeit, Hartmut Lehfeld, Thomas Hillemacher, Korbinian Riedhammer, Sebastian P. Bayerl

Comments Accepted at the 25th International Conference on Text, Speech and Dialogue (TSD 2022)

详情
Journal ref
Proceedings of the 25th International Conference on Text, Speech, and Dialogue (TSD 2022)
英文摘要

Standardized tests play a crucial role in the detection of cognitive impairment. Previous work demonstrated that automatic detection of cognitive impairment is possible using audio data from a standardized picture description task. The presented study goes beyond that, evaluating our methods on data taken from two standardized neuropsychological tests, namely the German SKT and a German version of the CERAD-NB, and a semi-structured clinical interview between a patient and a psychologist. For the tests, we focus on speech recordings of three sub-tests: reading numbers (SKT 3), interference (SKT 7), and verbal fluency (CERAD-NB 1). We show that acoustic features from standardized tests can be used to reliably discriminate cognitively impaired individuals from non-impaired ones. Furthermore, we provide evidence that even features extracted from random speech samples of the interview can be a discriminator of cognitive impairment. In our baseline experiments, we use OpenSMILE features and Support Vector Machine classifiers. In an improved setup, we show that using wav2vec 2.0 features instead, we can achieve an accuracy of up to 85%.