arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.12260 2026-05-25 cs.CL 版本更新

PRISM: Pareto-Efficient Retrieval over Intent-Aware Structured Memory for Long-Horizon Agents

PRISM:面向长周期智能体的意图感知结构化记忆的帕累托高效检索

Jingyi Peng, Zhongwei Wan, Weiting Liu, Qiuzhuang Sun

发表机构 * Singapore Management University(新加坡管理大学) The Ohio State University(俄亥俄州立大学) Fudan University(复旦大学)

AI总结 PRISM是一种面向长周期智能体的高效检索框架,旨在解决对话历史存储与检索中的内存管理问题。该方法将长周期记忆视为图结构记忆中的联合检索与压缩问题,通过四种独立的推理组件实现高效的信息检索与压缩。实验表明,PRISM在保持高准确率的同时,显著降低了上下文预算,展现出在回答质量与检索效率之间的优越平衡。

Comments Preprint

详情
AI中文摘要

长周期语言智能体积累的对话历史远超任何固定上下文窗口的容量,使得记忆管理对回答准确性和服务成本都至关重要。现有方法要么扩展上下文窗口而不解决检索内容,要么在摄入时进行大量事实提取而消耗大量token,要么依赖启发式图遍历而牺牲准确性和效率。我们提出PRISM,一个无需训练的检索端框架,将长周期记忆视为图结构记忆上的联合检索与压缩问题。PRISM结合了四个正交的推理时组件:基于类型化关系路径的层次化束搜索、与检测到的查询意图对齐的查询敏感边成本计算、将候选束压缩为紧凑答案侧上下文的证据压缩,以及将大多数查询路由到零LLM层的自适应意图路由。通过将检索公式化为类型化路径模板上的最小成本选择,并与LLM侧的压缩步骤配对,PRISM在严格的上下文预算下呈现正确的证据,无需对上游摄入流程进行任何微调或修改。在LoCoMo基准上的实验表明,PRISM在相同协议下,以数量级更小的上下文预算,实现了比所有基线更高的LLM评判准确率,占据了准确率-上下文成本前沿中先前空白的角落,并在回答质量和检索效率之间展现了卓越的平衡。

英文摘要

Long-horizon language agents accumulate conversation history far faster than any fixed context window can hold, making memory management critical to both answer accuracy and serving cost. Existing approaches either expand the context window without addressing what is retrieved, perform heavy ingestion-time fact extraction at substantial token cost, or rely on heuristic graph traversal that leaves both accuracy and efficiency on the table. We present PRISM, a training-free retrieval-side framework that treats long-horizon memory as a joint retrieval-and-compression problem over a graph-structured memory. PRISM combines four orthogonal inference-time components: Hierarchical Bundle Search over typed relation paths, Query-Sensitive Edge Costing that aligns traversal with detected query intent, Evidence Compression that compresses the candidate bundle into a compact answer-side context, and Adaptive Intent Routing that routes most queries through zero-LLM tiers. By formulating retrieval as min-cost selection over typed path templates and pairing it with an LLM-side compression step, PRISM surfaces the right evidence under a strict context budget without any fine-tuning or modification to the upstream ingestion pipeline. Experiments on the LoCoMo benchmark show that PRISM delivers substantially higher LLM-judge accuracy than every same-protocol baseline at an order-of-magnitude smaller context budget, occupying a previously empty corner of the accuracy-context-cost frontier and demonstrating a superior balance between answer quality and retrieval efficiency.

2605.23897 2026-05-25 cs.CV cs.AI cs.CL 版本更新

ETCHR: Editing To Clarify and Harness Reasoning

ETCHR: 通过编辑来澄清和利用推理

Beichen Zhang, Yuhong Liu, Jinsong Li, Yuhang Zang, Jiaqi Wang, Dahua Lin

发表机构 * The Chinese University of Hong Kong Shanghai AI Laboratory(香港中文大学上海人工智能实验室) Shanghai AI Laboratory(上海人工智能实验室) Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 多模态大语言模型在视觉推理方面取得了进展,但纯文本推理链在需要精细关注或视角变换的问题上仍存在瓶颈。为解决这一问题,研究提出ETCHR,一种与理解模型解耦的、基于问题条件的图像编辑器,通过两阶段训练方法分别弥补语言侧和生成侧的缺陷,提升编辑准确性和推理效果。实验表明,ETCHR在多个任务上显著提升了推理性能,且可无缝集成到不同开源和闭源模型中。

Comments Code, model and data are open-sourced at https://github.com/InternLM/ETCHR

详情
AI中文摘要

多模态大语言模型已经推进了视觉推理,但对于需要细粒度关注或视角变换的问题,纯文本思维链仍然是一个瓶颈。“用图像思考”范式缩小了这一差距,但现有方法要么受限于固定的预定义工具包,要么从统一的多模态方法中产生噪声的中间图像。我们追求第三种选择:使用专用的图像编辑模型并将其与理解模型解耦。然而,现成的图像编辑器作为推理助手存在两个互补的差距:语言侧差距,即被训练为被动指令跟随者的编辑器无法将抽象问题映射到适当的视觉变换;以及生成侧差距,即随着推理深度增加,编辑正确性下降。基于这一分析,我们引入了ETCHR(Editing To Clarify and Harness Reasoning),一种问题条件化、推理感知的图像编辑器,与下游理解模型解耦,并通过针对这两个差距的两阶段配方进行训练:通过监督微调编辑轨迹进行推理模仿,随后通过基于VLM的奖励进行推理增强,以提升编辑正确性和下游推理准确性。由于编辑器是解耦的,ETCHR可以以无需训练的方式插入不同的开源和闭源MLLM。在五个任务族(细粒度感知、图表理解、逻辑推理、拼图恢复和3D理解)上,ETCHR将平均Pass@1从55.95提升到60.77(+4.82,使用Qwen3-VL-8B),从65.08提升到70.55(+5.47,使用Gemini-3.1-Flash-Lite),以及从76.55提升到81.16(+4.61,使用1T参数的MoE模型Kimi K2.5)。

英文摘要

Multimodal Large Language Models have advanced visual reasoning, yet a purely textual chain of thought remains a bottleneck for questions that require fine-grained focus or view transformations. The ''think with images'' paradigm narrows this gap, but existing approaches are either constrained by fixed predefined toolkits or produce noisy intermediate images from unified multimodal methods. We pursue a third option: using a dedicated image editing model and decouple it with an understanding model. However, off-the-shelf image editors fail as reasoning assistants with two complementary gaps: a language-side gap, where editors trained as passive instruction-followers cannot map an abstract question to an appropriate visual transformation, and a generation-side gap, where edit correctness degrades as reasoning depth grows. Guided by this analysis, we introduce ETCHR (Editing To Clarify and Harness Reasoning), a question-conditioned, reasoning-aware image editor decoupled from the downstream understanding model and trained with a two-stage recipe targeted at the two gaps: Reasoning Imitation via supervised fine-tuning on edit trajectories, followed by Reasoning Enhancement with VLM-derived rewards for edit correctness and downstream reasoning accuracy. Since the editor is decoupled, ETCHR plugs into different open- and closed-source MLLMs in a training-free manner. Across five task families (fine-grained perception, chart understanding, logic reasoning, jigsaw restoration, and 3D understanding), ETCHR raises average Pass@1 from 55.95 to 60.77 (+4.82) with Qwen3-VL-8B, from 65.08 to 70.55 (+5.47) with Gemini-3.1-Flash-Lite, and from 76.55 to 81.16 (+4.61) with the 1T-parameter MoE model Kimi K2.5.

2605.23885 2026-05-25 cs.CL 版本更新

Multilingual Knowledge Transfer under Data Constraints via Lexical Interventions

数据约束下的多语言知识迁移:通过词汇干预

Anastasiia Sedova, Natalie Schluter, Skyler Seto, Maartje ter Hoeve

发表机构 * Apple(苹果公司)

AI总结 本文研究了在数据有限的情况下如何通过词级干预实现多语言知识迁移,提出了一种名为LINK的方法,通过在高资源语言(如英语)的预训练数据中使用双语词表进行词替换,从而提升模型对低资源语言的知识迁移能力。该方法无需额外训练或大量平行语料,仅需双语词汇表即可实现,实验表明在多种语言和不同规模模型上均能显著提升下游任务性能。

详情
AI中文摘要

跨语言知识迁移对于为训练数据不足的语言构建高性能多语言语言模型至关重要。当目标语言数据稀缺时,许多涉及科学推理、常识推理和世界知识的下游任务所需的知识必须主要从高资源语言获取,因此有效的知识迁移至关重要。现有的改进此类跨语言知识迁移的方法需要大量平行数据、翻译系统、辅助模型或额外的训练阶段,而这些对于许多语言来说基本不可用。我们提出LINK——一种数据级干预方法,通过使用双语词汇对预训练数据中的高资源部分进行词汇替换,在模型预训练期间改进知识迁移。对于给定的替换比例,在高资源(英语)训练语料库的一部分中,随机选择的单词被替换为其词级翻译,无需额外的模型训练,仅需一个双语词汇,而该词汇几乎可以零成本获得。对五种模型规模下的八种语言的评估显示,目标语言的下游任务有显著改进,达到同等性能的训练速度最高可提升2倍。

英文摘要

Cross-lingual knowledge transfer is critical for building high-performing multilingual language models for languages with insufficient training data. When target language data is scarce, the knowledge required for many downstream tasks involving scientific reasoning, commonsense inference, and world knowledge must be acquired primarily from the high-resource language, making effective knowledge transfer essential. Existing methods for improving such cross-lingual knowledge transfer require large amounts of parallel data, translation systems, auxiliary models, or additional training stages that are largely unavailable for many languages. We propose LINK - a data-level intervention method that improves knowledge transfer during model pretraining through lexical substitutions in high-resource part of pretraining data using bilingual vocabularies. For a given replacement ratio, randomly selected words in a portion of the high-resource (English) training corpus are swapped with their word-level translations, requiring no additional model training and only a bilingual vocabulary, which can be obtained at near-zero cost for virtually any language. Evaluation on eight languages across five model sizes shows notable improvements on downstream tasks in the target language, with up to a 2x speedup in training to reach equivalent performance.

2605.23857 2026-05-25 cs.LG cs.CL 版本更新

Strong Teacher Not Needed? On Distillation in LLM Pretraining

不需要强教师?关于大语言模型预训练中的蒸馏

Taiming Lu, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文挑战了知识蒸馏中“强教师优于弱教师”的常见假设,研究了在大语言模型预训练中不同教师-学生关系对蒸馏效果的影响。通过调整模型规模和训练数据量,作者构建了强-弱、同级和弱-强的教师-学生关系,并发现即使使用小型或未充分训练的教师,通过合理混合语言建模和蒸馏损失,也能有效提升学生模型性能。研究还表明,更强的教师并不总是更好,过度增加教师规模或训练数据可能削弱蒸馏效果,同时蒸馏在提升模型泛化能力方面比领域内拟合更具优势。

详情
AI中文摘要

知识蒸馏通常假设强到弱的关系,即更强的教师会产生更好的学生。在这项工作中,我们检验了关于大语言模型预训练中蒸馏的这一假设。通过改变架构大小和训练token预算,我们创建了强到弱、同级和弱到强的师生关系,并研究了每种情况下蒸馏的有效性。我们发现教师不需要强:通过适当混合语言建模和知识蒸馏损失,即使是小型和训练不足的教师也能提升较大的学生。同时,更强的教师并不总是更好:通过更多参数或更多训练token进一步推动教师,可能会饱和甚至逆转蒸馏收益。我们进一步观察到,蒸馏更容易改善泛化(分布外和下游性能)而非域内拟合。这些结果共同挑战了蒸馏预训练总是需要强教师的普遍信念。

英文摘要

Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.

2605.23826 2026-05-25 cs.CV cs.CL 版本更新

Decomposing Queries into Tool Calls for Long-Video Keyframe Retrieval

将查询分解为工具调用以进行长视频关键帧检索

Michal Shlapentokh-Rothman, Prachi Garg, Yu-Xiong Wang, Derek Hoiem

发表机构 * University of Illinois at Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了如何从长视频中检索关键帧以支持问答任务,提出了一种基于工具调用分解与合并的新型关键帧检索方法ToolMerge。该方法利用大语言模型将查询分解为多个工具调用,并通过布尔运算符合并各工具的排序结果,从而更精准地定位相关帧。实验在自建的M2M基准上进行,ToolMerge在多项任务中表现优异,尤其在字幕检索任务中超越其他方法5%。

详情
AI中文摘要

关键帧选择是为长视频问答(QA)提供可验证视觉证据的直接方式。查询所需的内容各不相同,找到正确的帧取决于知道要查找什么。现有的关键帧选择器要么根据单个查询对每一帧进行评分,要么将查询分解为由单个视觉工具评估的固定模式。我们提出ToolMerge,一种基于分解和合并的关键帧检索方法:基于大语言模型(LLM)的规划器将查询分解为工具调用,并指定如何使用布尔运算符合并每个工具的排名。为了直接评估检索,我们构建了Molmo-2 Moments(M2M)基准,其中每个问题通过构造锚定到特定的时间间隔。在QA、问题检索和字幕检索中,ToolMerge与先前的关键帧选择器具有竞争力,尤其是在字幕检索上,优于其他方法5%。代码和数据可在https://github.com/michalsr/ToolMerge找到。

英文摘要

Keyframe selection is a direct way to provide verifiable visual evidence for long-video question answering (QA). Queries differ in what they require, and finding the right frames depends on knowing what to look for. Existing keyframe selectors either score every frame against a single query, or decompose the query into a fixed schema evaluated by a single visual tool. We propose ToolMerge, a keyframe retrieval method based on decomposition and merging: an Large Language Model (LLM) based planner decomposes the query into tool calls and specifies how their per-tool rankings are merged using boolean operators. To evaluate retrieval directly, we construct Molmo-2 Moments (M2M), a benchmark in which every question is anchored to a specific time interval by construction. Across QA, question retrieval, and caption retrieval, ToolMerge is competitive with prior keyframe selectors, most notably on caption retrieval, outperforming other methods by 5%. Code and data can be found at https://github.com/michalsr/ToolMerge .

2605.23821 2026-05-25 cs.CL cs.LG 版本更新

Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

语言模型中的层级概念几何源于词汇共现

Andres Nava, Matthieu Wyart

发表机构 * Johns Hopkins University(约翰霍普金斯大学) EPFL(苏黎世联邦理工学院)

AI总结 本文研究了语言模型中如何通过词共现关系几何地编码超类关系(即“是-一种”关系)。作者从词网中词语之间的共现频率与层次结构关系的实证观察出发,理论分析了词嵌入的协方差矩阵谱结构,证明了主特征向量能按从粗到细的层次逐步分离出概念分支,形成与树状结构一致的层次分割几何。实验验证表明,这一现象不仅在词2vec中存在,在Gemma 2B模型中也表现显著,表明层次概念几何可由词对统计的谱结构自然产生,无需依赖特定的层次功能机制。

Comments 34 pages, 12 figures, including appendices

详情
AI中文摘要

我们提出了一种分布理论,解释上下义关系——一般概念与具体概念之间的“is-a”关系——如何在语言表示中以几何方式编码。从经验验证的假设出发,即WordNet上下义图中距离较近的词汇共现频率更高,我们在理论上刻画了由此产生的word2vec嵌入Gram矩阵的谱。在共现核的温和正性和衰减条件下,我们证明主特征向量首先分离广泛的分类分支,然后逐步分离更细的子分支,产生一种\emph{层级分裂几何},其从粗到细的谱组织反映了树结构。我们在多个采样的WordNet子树上的word2vec嵌入中验证了这些预测,并表明相同的特征显著地扩展到Gemma 2B的解嵌入。我们的结果表明,LLM中的层级概念几何不必反映层级特定的功能机制,而是从成对词汇统计的谱结构中涌现出来。

英文摘要

We propose a distributional theory of how hypernymy -- the ``is-a'' relation between general and specific concepts -- is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.

2605.23721 2026-05-25 cs.CL 版本更新

Is a Document Educational or Just Wikipedia-Style? -- Pitfalls of Classifier-Based Quality Filtering

文档是教育性的还是维基风格的?——基于分类器的质量过滤的陷阱

Mateusz Klimaszewski, Piotr Andruszkiewicz

发表机构 * Warsaw University of Technology(华沙技术大学) IDEAS Research Institute(IDEAS研究所)

AI总结 本文探讨了基于分类器的质量过滤方法在构建预训练语料库中的潜在问题,指出简单的维基百科风格格式化操作可能显著改变模型对内容质量的判断,导致低质量内容通过过滤。研究以FineWeb-Edu CQF模型为例,发现约7%的文档会被错误分类,从而被错误地纳入预训练语料库,揭示了该方法在实际应用中的关键缺陷。

Comments Accepted to ACL 2026

详情
AI中文摘要

基于分类器的质量过滤最近已成为构建预训练语料库的基本技术。部署单一模型以替代或补充一组启发式规则的能力已被证明在众多大型语言模型中有效。在这项工作中,我们通过展示一个直接的维基风格重格式化操作如何显著改变模型的质量评估,并使低质量内容能够超过过滤阈值,暴露了这种方法的一个关键漏洞。我们的分析表明,FineWeb-Edu CQF模型会反转约7%评估文档的过滤决策,从而将原本会被排除的内容纳入预训练语料库。

英文摘要

Classifier-based Quality Filtering has recently emerged as a fundamental technique in constructing pre-training corpora. The ability to deploy a single model that can replace or supplement a set of heuristics has proven effective across numerous Large Language Models. In this work, we expose a critical vulnerability in this approach by demonstrating how a straightforward Wikipedia-style reformatting operation can substantially alter a model's quality assessment and enable low-quality content to surpass filtering thresholds. Our analysis reveals that the FineWeb-Edu CQF model would reverse its filtering decision for approximately 7% of evaluated documents, thereby admitting content into the pre-training corpus that would otherwise have been excluded.

2605.23715 2026-05-25 cs.CL 版本更新

NLG Evaluation: Past, Present, Future

NLG评估:过去、现在与未来

Ehud Reiter

发表机构 * Dept of Computing Science University of Aberdeen(计算科学系大学阿伯丁)

AI总结 本文回顾了自然语言生成(NLG)评估的发展历程,从1990年代与语言学紧密相关的阶段,到如今与机器学习深度融合的现状,并展望了未来评估方向的变化。研究指出,随着NLG技术的广泛应用,评估方法正从传统的形式化实验向更注重影响、质量与安全的方向演进,其中基于大语言模型的评估方法(如LLM-as-Judge)已成为最新趋势。本文系统梳理了NLG评估的发展脉络,并探讨了未来评估体系的关键挑战与发展方向。

Comments Will appear in Proceeedings of RetroEval 2026

详情
AI中文摘要

自然语言生成(NLG)评估自1990年以来发生了巨大变化,并将继续演变。1990年,当NLG与语言学紧密相连时,几乎没有现代意义上的正式实验评估。到2026年,当NLG与机器学习紧密相关时,实验评估已成为研究的基础和预期。在此期间,开发了许多评估技术,包括最近的LLM-as-Judge。我预计NLG评估将在未来继续演变。特别是,随着大量人群日常使用NLG技术,影响、质量和安全评估将变得更加重要。

英文摘要

Natural Language Generation (NLG) evaluation has changed dramatically since 1990, and will continue to evolve in the future. In 1990, when NLG had close ties to linguistics, there was very little formal experimental evaluation in the modern sense. In 2026, when NLG is closely linked to machine learning, experimental evaluation is expected and indeed fundamental to research. Many evaluation techniques were developed over this period, including most recently LLM-as-Judge. I expect NLG evaluation will continue to evolve in the future. In particular, impact, qualitative, and safety evaluation will become more important as large numbers of people routinely use NLG technology.

2605.23710 2026-05-25 cs.CL 版本更新

A graph-based analysis of semantic types and coercion in contextualized word embeddings

基于图的语义类型与语境化词嵌入中的强制现象分析

Long Chen, Deniz Ekin Yavas

发表机构 * Heinrich Heine University Düsseldorf(杜伊斯堡-埃森海因rich-海因大学)

AI总结 本文研究语义类型不匹配现象在上下文词嵌入中的表现,提出一种基于图的方法来分析词与上下文之间的语义类型信息。通过构建BERT和语义增强嵌入的图结构,并引入邻域类型概率和熵两个指标,研究发现语义增强嵌入能更有效地反映语义类型信息,并能有效区分匹配与不匹配的句子。

详情
AI中文摘要

名词与其语境之间的语义类型不匹配是强制现象的核心。本文引入一种基于图的方法来检查词汇和语境类型信息如何在词嵌入中反映。我们从十种语义类型中选择名词,标注语料实例的类型匹配(匹配 vs. 强制 vs. 其他不匹配 vs. 无限制),并使用BERT和感知增强嵌入构建图。提出了两个指标——邻居类型概率(NTP)和邻居类型熵(NTE)——来分析邻居类型分布。结果表明,使用感知增强嵌入构建的图能更好地反映语义类型信息,并且通过所提出的指标可以区分匹配和不匹配句子。

英文摘要

Semantic type mismatch between a noun and its context is central to coercion phenomena. This paper introduces a graph-based method to examine how lexical and contextual type information is reflected in word embeddings. We select nouns from ten semantic types, annotate corpus instances for type matching (matching vs. coercion vs. other mismatch vs. unrestricted), and construct graphs using BERT and sense-enhanced embeddings. Two metrics -- Neighbor Type Probability (NTP) and Neighbor Type Entropy (NTE) -- are proposed to analyze neighborhood type distributions. Results show that graphs constructed with sense-enhanced embeddings reflect semantic type information better, and matching and mismatch sentences can be distinguished through the proposed metrics.

2605.23701 2026-05-25 cs.CL 版本更新

Metadata Predictability Is Not Evidence Dependence: An Intervention-Based Audit for Weak-Label Benchmarks

元数据可预测性并非证据依赖性:基于干预的弱标签基准审计

Kan Shao

发表机构 * Jinglue Technology Development (Nanjing) Co., Ltd.(景庐科技发展(南京)有限公司)

AI总结 本文研究了弱标注基准测试中的一种协议级测试方法,即在干预提供的证据时,基准输出是否会变化。作者指出,仅基于元数据的快捷检查关注的是输出是否可由元数据预测,而非对证据的依赖性。为此,他们结合元数据统计量MPDS与证据干预统计量ΔEvi,揭示了元数据预测能力与证据敏感性的区别,并通过实验展示了不同数据集在不同模型下的表现差异,强调在基准审计中应同时报告元数据筛选、证据干预和模型校准结果。

Comments 5 pages, 1 figure, 1 table. Accepted at ICML 2026 Workshop on Hypothesis Testing

详情
AI中文摘要

我们研究了一种针对弱标签基准的协议级测试:当提供的证据被干预时,基准输出是否发生变化。仅元数据快捷检查回答了一个不同的问题,即输出是否可以从元数据先验中预测。因此,我们将元数据统计量——元数据先验主导得分(MPDS)——与证据干预统计量ΔEvi相结合,后者衡量在跨项目洗牌下对证据身份的敏感性。合成HotpotQA给出了一个针对仅元数据筛查的构造反例:MPDS仅为中等(0.643),但ΔEvi为零。更强阅读器的重新运行显示了为什么校准应属于测试程序:SNLI显示了校准反转,重构的HotpotQA占据了一个问题主导的警告区域,而FEVER在四个Transformer上是一个强证据敏感的正对照。实际教训很简单:基准审计应同时报告仅元数据筛查、证据干预和阅读器强度校准。

英文摘要

We study a protocol-level test for weak-label benchmarks: whether benchmark outputs change when the provided evidence is intervened on. Metadata-only shortcut checks answer a different question, namely whether outputs are predictable from metadata priors. We therefore combine a metadata statistic, the Metadata Prior Dominance Score (MPDS), with an evidence-intervention statistic, ΔEvi, measuring sensitivity to evidence identity under cross-item shuffling. Synthetic HotpotQA gives a constructed counterexample to metadata-only screening: MPDS is only moderate (0.643), yet ΔEvi is zero. Stronger-reader reruns show why calibration belongs in the test procedure: SNLI shows a calibration reversal, reconstructed HotpotQA occupies a question-dominant warning region, and FEVER is a strongly evidence-sensitive positive control across four transformers. The practical lesson is simple: benchmark audits should report metadata-only screening, evidence intervention, and reader-strength calibration together.

2605.22643 2026-05-25 cs.CL 版本更新

Boiling the Frog: A Multi-Turn Benchmark for Agentic Safety

温水煮青蛙:面向智能体安全的多轮基准测试

Piercosma Bisconti, Matteo Prandi, Federico Pierucci, Federico Sartore, Enrico Panai, Laura Caroli, Yue Zhu, Adam Leon Smith, Luca Nannini, Marcello Galisai, Susanna Cifani, Francesco Giarrusso, Marcantonio Bracale Syrnikov, Daniele Nardi

发表机构 * Icaro Foundation(Icaro基金会) Sapienza University of Rome(罗马萨皮恩扎大学) Sant’Anna School of Advanced Studies(圣安娜高级研究学校) Tongji University School of Law(同济大学法学院) AIQI Consortium(AIQI联盟) Università Cattolica del Sacro Cuore(圣心大学) Piccadilly Labs(皮卡迪利实验室) VU Amsterdam(阿姆斯特丹伏伊大学) Independent(独立)

AI总结 本文提出“Boiling the Frog”(慢慢煮青蛙)多轮基准测试,用于评估作为智能体部署的语言模型在企业及办公场景中的安全性,尤其关注其在逐步攻击下的表现。该基准通过模拟工作环境中的多轮交互,测试模型在面对逐步引入的风险请求时是否会引发安全问题。研究发现,不同模型在该基准上的攻击成功率差异显著,部分模型的攻击成功率高达92.9%,突显了当前AI系统在安全防护方面仍面临严峻挑战。

详情
AI中文摘要

背景。传统的语言模型安全基准评估生成的文本:模型是否输出有毒语言、再现偏见或遵循有害指令。当模型作为智能体部署时,安全相关的对象从系统所说的内容转移到它在环境中执行的操作,仅评估模型在提示下的响应已不足以应对人工智能带来的安全挑战。近期发展出现了评估大型语言模型作为智能体的基准测试。我们对此研究方向做出贡献。方法。我们引入“温水煮青蛙”基准测试,评估部署在企业及办公环境中使用工具的AI模型是否易受渐进式攻击。每个场景以良性工作区编辑开始,随后引入一个包含风险的请求。基准测试聚焦于有状态的多轮评估:链暴露持久工作区,将风险负载置于轮次序列中的受控位置,并评估最终工件状态是否变得不安全。场景基于“温水煮青蛙”风险、AI法案附件I和附件III高风险情境以及欧盟AI法案通用人工智能行为准则(GPAI)构建的三级操作风险分类进行组织。结果。在九个模型的面板中,总体严格攻击成功率(ASR)为44.4%。模型级ASR范围从Claude Haiku 4.5的20.5%到Gemini 3.1 Flash Lite的92.9%,Seed 2.0 Lite也超过80%。对于行为准则失控场景,平均链类别级ASR达到93.3%。

英文摘要

Background. Traditional safety benchmarks for language models evaluate generated text: whether a model outputs toxic language, reproduces bias, or follows harmful instructions. When models are deployed as agents, the safety-relevant object shifts from what the system says to what it does within an environment, and evaluating model responses under prompting is no longer sufficient to address the safety challenges posed by artificial intelligence. Recent developments have seen the rise of benchmarks that evaluate large language models as agents. We contribute to this strand of research. Approach. We introduce Boiling the Frog, a benchmark that evaluates whether tool-using AI models deployed in corporate and office settings are susceptible to incremental attacks. Each scenario begins with benign workspace edits and later introduces a risk-bearing request. The benchmark focuses on stateful multi-turn evaluation: chains expose a persistent workspace, place the risk-bearing payload at controlled positions in the turn sequence, and score whether the resulting artifact state becomes unsafe. Scenarios are organized through a three-level operational risk taxonomy grounded in the Boiling the Frog risks, the AI Act Annex I and Annex III high-risk contexts, and EU AI Act's Code of Practice on General-Purpose AI (GPAI). Results. Across a nine-model panel, aggregate strict attack success rate (ASR) is 44.4%. Model-level ASR ranges from 20.5% for Claude Haiku 4.5 to 92.9% for Gemini 3.1 Flash Lite, with Seed 2.0 Lite also above 80%. Average chain category-level ASR reaches 93.3% for Code of Practice loss-of-control scenarios.

2605.19069 2026-05-25 cs.CL cs.AI 版本更新

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

商业ASR系统在代码切换语音上的基准测试:阿拉伯语、波斯语和德语

Sajjad Abdoli, Ghassan Al-Sumaidaee, Clayton W. Taylor, Ahmad ElShiekh, Ahmed Rashad

发表机构 * Perle AI

AI总结 本文研究了自动语音识别(ASR)系统在语言代码转换(Code-Switching)场景下的性能,针对阿拉伯语、波斯语和德语与英语之间的四种语言对进行了评估。通过一个两阶段的筛选流程,选取了300个样本,并使用BERTScore和词错误率(WER)进行测评,发现不同指标对系统排名的一致性及质量差距的反映存在差异。研究还揭示了商业ASR系统在处理代码转换语音时的性能差距,并公开了相关数据集以供进一步研究。

详情
AI中文摘要

代码切换——在同一话语中两种语言的自然交替——仍然是自动语音识别(ASR)中最具挑战性和研究不足的条件之一。我们提出了一个基准测试,评估了五个商业ASR提供商在四种语言对上的表现:埃及阿拉伯语-英语、沙特阿拉伯语(纳吉迪/希贾兹)-英语、波斯语(法尔西)-英语和德语-英语,每对包含300个样本,通过结合启发式过滤和GPT-4o与Gemini 1.5 Pro集成评分器的两阶段管道选择,将LLM成本降低约91%。我们在WER和BERTScore上进行评估,表明虽然两个指标在阿拉伯语和波斯语对的系统排序上一致(τ=1.0),但WER通过惩罚语义正确的音译选择,将质量差距的幅度夸大约3倍。ElevenLabs Scribe v2实现了最低的WER(总体13.2%),并在BERTScore上领先(总体0.936)。难度分层分析揭示了被总体平均值掩盖的性能差距,BERT嵌入投影证实了参考和假设之间的语义接近性,尽管存在表面脚本差异。数据集公开于https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch。

英文摘要

Code-switching -- the natural alternation between two languages within a single utterance -- remains one of the most challenging and under-studied conditions for automatic speech recognition (ASR). We present a benchmark evaluating five commercial ASR providers across four language pairs: Egyptian Arabic--English, Saudi Arabic (Najdi/Hijazi)--English, Persian (Farsi)--English, and German--English, comprising 300 samples per pair selected by a two-stage pipeline combining heuristic filtering with a GPT-4o and Gemini 1.5 Pro ensemble scorer, reducing LLM costs by $\approx$91\%. We evaluate on both WER and BERTScore, showing that while both metrics agree on the ordinal ranking of systems for all Arabic and Persian pairs ($τ= 1.0$), WER inflates the magnitude of quality gaps by approximately 3$\times$ by penalising semantically correct transliteration choices. ElevenLabs Scribe v2 achieves the lowest WER (13.2\% overall) and leads on BERTScore (0.936 overall). Difficulty-stratified analysis reveals performance gaps masked by aggregate averages, and BERT embedding projections confirm semantic proximity between reference and hypothesis despite surface-level script differences. The dataset is publicly available at https://huggingface.co/datasets/Perle-ai/ASR_Code_Switch.

2605.02443 2026-05-25 cs.CL 版本更新

HalluScan: A Systematic Benchmark for Detecting and Mitigating Hallucinations in Instruction-Following LLMs

HalluScan:一个用于检测和缓解指令遵循型大语言模型幻觉的系统性基准

Ahmed Cherif

发表机构 * Sofrecom Tunisia(Sofrecom突尼斯) Orange Innovation(Orange创新)

AI总结 HalluScan 是一个系统化的基准框架,用于检测和缓解遵循指令的大型语言模型中的幻觉问题。该研究提出了 HalluScore 作为新的综合评估指标,并引入了 Adaptive Detection Routing 算法以提高检测效率,同时通过错误类型分解揭示了不同领域中幻觉表现的显著差异。实验表明,NLI 验证方法在检测效果上表现最佳,为后续研究和应用提供了重要参考。

Comments 38 pages, 13 figures, 10 tables. Submitted to Neural Computing and Applications

详情
AI中文摘要

大型语言模型(LLM)在各种自然语言处理任务中展现了卓越的能力,但它们仍然容易产生幻觉——生成事实上不正确、与提供上下文不一致或与用户指令不符的内容。我们提出了HalluScan,一个全面的基准框架,系统性地评估了跨越72种配置(涵盖6种检测方法、4个开放权重模型家族和3个不同领域)的幻觉检测与缓解。我们引入了三个关键贡献:(1)HalluScore,一种新颖的复合指标,与人类专家判断的皮尔逊相关系数达到r=0.41;(2)自适应检测路由(ADR),一种智能路由算法,实现了2.0倍的成本降低,仅损失0.1%的AUROC;(3)系统错误级联分解,揭示了不同领域间幻觉错误类型的显著差异。我们的实验表明,NLI验证达到了最高的总体AUROC为0.88,而RAV达到了第二高的AUROC为0.66。

英文摘要

Large Language Models (LLMs) have demonstrated remarkable capabilities across diverse natural language processing tasks, yet they remain susceptible to hallucinations -- generating content that is factually incorrect, unfaithful to provided context, or misaligned with user instructions. We present HalluScan, a comprehensive benchmark framework that systematically evaluates hallucination detection and mitigation across 72 configurations spanning 6 detection methods, 4 open-weight model families, and 3 diverse domains. We introduce three key contributions: (1) HalluScore, a novel composite metric that achieves a Pearson correlation of r = 0.41 with human expert judgments; (2) Adaptive Detection Routing (ADR), an intelligent routing algorithm achieving 2.0x cost reduction with only 0.1% AUROC degradation; and (3) systematic error cascade decomposition revealing substantial variation in hallucination error types across domains. Our experiments reveal that NLI Verification achieves the highest overall AUROC of 0.88, while RAV achieves the second-highest AUROC of 0.66.

2604.17134 2026-05-25 cs.CL 版本更新

RoIt-XMASA: Multi-Domain Multilingual Sentiment Analysis Dataset for Romanian and Italian

RoIt-XMASA:面向罗马尼亚语和意大利语的多领域多语言情感分析数据集

Andrei-Marius Avram, Aureliu Valentin Antonie, Cosmin-Mircea Croitoru, Vlad Andrei Muntean, Dumitru-Clementin Cercel

发表机构 * National University of Science and Technology POLITEHNICA Bucharest(波兰科技大学布加勒斯特分校)

AI总结 本文介绍了 RoIt-XMASA,一个扩展了跨语言多领域亚马逊情感分析数据集的多语言数据集,涵盖意大利语和罗马尼亚语,包含36,000条标注评论和202,141条未标注样本。为应对跨语言和跨领域挑战,研究提出了一种多目标对抗训练框架,通过损失反转和元学习系数动态平衡情感判别与领域和语言不变性。实验表明,该方法在XLM-R模型上实现了66.23%的F1分数,优于基线4.64%,并在少样本设置下展示了任务微调与提示方法之间的性能权衡。

Comments Accepted at the International AAAI Conference on Web and Social Media (ICWSM 2026)

详情
AI中文摘要

我们提出了RoIt-XMASA,一个多语言数据集,它将跨语言多领域亚马逊情感分析扩展到意大利语和罗马尼亚语,包含36,000条跨三个领域(书籍、电影和音乐)的标注评论和202,141条未标注样本。为了解决跨语言和跨领域的挑战,我们提出了一种多目标对抗训练框架,该框架采用带有元学习系数的损失反转,以动态平衡情感判别与领域和语言不变性。使用我们的方法,XLM-R达到了66.23%的F1分数,比基线高出4.64%。少样本评估显示,Llama-3.1-8B达到了58.43%的F1分数,揭示了基于提示的方法的效率与任务特定微调的更高性能之间的有意义权衡。

英文摘要

We present RoIt-XMASA, a multilingual dataset that extends the Cross-lingual Multi-domain Amazon Sentiment Analysis to Italian and Romanian, comprising 36,000 labeled reviews across three domains (books, movies, and music) and 202,141 unlabeled samples. To address cross-lingual and cross-domain challenges, we propose a multi-target adversarial training framework that employs loss reversal with meta-learned coefficients to dynamically balance sentiment discrimination with domain and language invariance. XLM-R achieves an F1-score of 66.23% with our approach, outperforming the baseline by 4.64%. Few-shot evaluation shows that Llama-3.1-8B achieves 58.43% F1-score, revealing a meaningful trade-off between the efficiency of prompting-based approaches and the higher performance of task-specific fine-tuning.

2511.10404 2026-05-25 cs.CL 版本更新

DELICATE: Diachronic Entity LInking using Classes And Temporal Evidence

DELICATE: 利用类别和时间证据的历时实体链接

Cristian Santini, Sebastian Barzaghi, Paolo Sernani, Emanuele Frontoni, Mehwish Alam

发表机构 * University of Macerata(马切拉塔大学) University of Bologna(博洛尼亚大学) Télécom Paris, Institut Polytechnique de Paris(巴黎电信学院)

AI总结 本文提出了一种名为 DELICATE 的新型神经符号方法,用于解决历史意大利语文本中的实体链接问题,该方法结合了基于 BERT 的编码器和来自 Wikidata 的上下文信息,通过时间合理性与实体类型一致性选择合适的知识库实体。同时,研究还构建了一个名为 ENEIDE 的多领域实体链接语料库,涵盖19至20世纪的文学与政治文本。实验表明,DELICATE 在性能上优于其他历史文本实体链接模型,且其置信度评分和特征敏感性提升了结果的可解释性。

详情
AI中文摘要

尽管自然语言处理领域取得了显著进展,但由于复杂的文档类型、缺乏特定领域的数据集和模型以及长尾实体(即在知识库中代表性不足的实体),实体链接任务在人文学科中仍然具有挑战性。本文旨在通过两个主要贡献解决这些问题。第一个贡献是DELICATE,一种用于历史意大利语的新型神经符号方法,它结合了基于BERT的编码器和来自Wikidata的上下文信息,利用时间合理性和实体类型一致性来选择适当的KB实体。第二个贡献是ENEIDE,一个多领域的历史意大利语实体链接语料库,半自动地从两个注释版本中提取,时间跨度从19世纪到20世纪,包括文学和政治文本。结果表明,即使与拥有数十亿参数的更大架构相比,DELICATE在历史意大利语中的表现也优于其他实体链接模型。此外,进一步的分析揭示了DELICATE的置信度分数和特征敏感性如何提供比纯神经方法更可解释和可解释的结果。

英文摘要

In spite of the remarkable advancements in the field of Natural Language Processing, the task of Entity Linking (EL) remains challenging in the field of humanities due to complex document typologies, lack of domain-specific datasets and models, and long-tail entities, i.e., entities under-represented in Knowledge Bases (KBs). The goal of this paper is to address these issues with two main contributions. The first contribution is DELICATE, a novel neuro-symbolic method for EL on historical Italian which combines a BERT-based encoder with contextual information from Wikidata to select appropriate KB entities using temporal plausibility and entity type consistency. The second contribution is ENEIDE, a multi-domain EL corpus in historical Italian semi-automatically extracted from two annotated editions spanning from the 19th to the 20th century and including literary and political texts. Results show how DELICATE outperforms other EL models in historical Italian even if compared with larger architectures with billions of parameters. Moreover, further analyses reveal how DELICATE confidence scores and features sensitivity provide results which are more explainable and interpretable than purely neural methods.

2509.19858 2026-05-25 cs.CL 版本更新

Benchmarking Gaslighting Attacks Against Speech Large Language Models

针对语音大语言模型的气灯攻击基准测试

Jinyang Wu, Bin Zhu, Xiandong Zou, Qiquan Zhang, Xu Fang, Pan Zhou

发表机构 * Singapore Management University(新加坡管理学院) Tongyi Speech Lab(通义语音实验室) Dalian University of Technology(大连理工大学)

AI总结 随着语音大语言模型(Speech LLMs)在语音应用中的广泛应用,确保其对操纵性或对抗性输入的鲁棒性变得尤为重要。本文引入了一种新型对抗攻击——“Gaslighting攻击”,通过精心设计的提示误导模型推理,评估Speech LLMs的脆弱性,并提出了五种操纵策略用于测试模型在不同任务下的鲁棒性。实验结果显示,五种攻击策略平均使模型准确率下降24.3%,突显了当前语音AI系统在行为层面存在的显著漏洞,亟需提升其鲁棒性和可靠性。

Comments 5 pages, 2 figures, 3 tables

详情
AI中文摘要

随着语音大语言模型(Speech LLMs)越来越多地集成到基于语音的应用中,确保其对操纵性或对抗性输入的鲁棒性变得至关重要。尽管先前的工作研究了基于文本的LLMs和视觉语言模型中的对抗性攻击,但基于语音交互的独特认知和感知挑战仍未得到充分探索。相比之下,语音具有固有的模糊性、连续性和感知多样性,这使得对抗性攻击更难检测。在本文中,我们引入了气灯攻击,即精心设计的提示,旨在误导、覆盖或扭曲模型推理,以评估语音LLMs的脆弱性。具体来说,我们构建了五种操纵策略:愤怒、认知干扰、讽刺、隐晦和专业否定,旨在测试模型在不同任务上的鲁棒性。值得注意的是,我们的框架捕获了性能下降和行为响应,包括未经请求的道歉和拒绝,以诊断不同维度的易感性。此外,还进行了声学扰动实验以评估多模态鲁棒性。为了量化模型脆弱性,在5个语音和多模态LLMs上,对来自5个不同数据集的超过10,000个测试样本进行全面评估,结果显示在五种气灯攻击下平均准确率下降24.3%,表明显著的行为脆弱性。这些发现强调了需要更具弹性和可信赖的基于语音的AI系统。

英文摘要

As Speech Large Language Models (Speech LLMs) become increasingly integrated into voice-based applications, ensuring their robustness against manipulative or adversarial input becomes critical. Although prior work has studied adversarial attacks in text-based LLMs and vision-language models, the unique cognitive and perceptual challenges of speech-based interaction remain underexplored. In contrast, speech presents inherent ambiguity, continuity, and perceptual diversity, which make adversarial attacks more difficult to detect. In this paper, we introduce gaslighting attacks, strategically crafted prompts designed to mislead, override, or distort model reasoning as a means to evaluate the vulnerability of Speech LLMs. Specifically, we construct five manipulation strategies: Anger, Cognitive Disruption, Sarcasm, Implicit, and Professional Negation, designed to test model robustness across varied tasks. It is worth noting that our framework captures both performance degradation and behavioral responses, including unsolicited apologies and refusals, to diagnose different dimensions of susceptibility. Moreover, acoustic perturbation experiments are conducted to assess multi-modal robustness. To quantify model vulnerability, comprehensive evaluation across 5 Speech and multi-modal LLMs on over 10,000 test samples from 5 diverse datasets reveals an average accuracy drop of 24.3% under the five gaslighting attacks, indicating significant behavioral vulnerability. These findings highlight the need for more resilient and trustworthy speech-based AI systems.

2412.14642 2026-05-25 cs.CL 版本更新

Speak-to-Structure: Evaluating LLMs in Open-domain Natural Language-Driven Molecule Generation

Speak-to-Structure:评估大语言模型在开放域自然语言驱动的分子生成中的表现

Jiatong Li, Junxian Li, Weida Wang, Yunqing Liu, Changmeng Zheng, Yatao Bian, Dongzhan Zhou, Xiao-yong Wei, Qing Li

发表机构 * Hong Kong Polytechnic University(香港理工大学) Shanghai Jiao Tong University(上海交通大学) Shanghai AI Lab(上海人工智能实验室) National University of Singapore(新加坡国立大学)

AI总结 近期,大型语言模型(LLMs)在自然语言驱动的分子发现任务中展现出巨大潜力,但现有数据集和基准主要基于一对一映射,仅评估模型检索单一预定义答案的能力,而忽略了其生成多样化且有效分子候选物的创造力。为此,研究者提出了首个用于评估LLMs在开放领域自然语言驱动分子生成能力的基准S²-Bench,该基准专门设计用于一对多关系,挑战模型展现真实的分子理解和开放生成能力。研究还引入了大规模指令微调数据集OpenMolIns,使Llama3.1-8B在该基准上超越了GPT-4o和Claude-3.5等强大多模态模型,并通过全面评估31个LLMs,推动了从简单模式记忆向真实分子设计的转变。

Comments Accepted by KDD 2026. Our codes and datasets are fully accessible through the https://github.com/phenixace/S2-TOMG-Bench and https://huggingface.co/datasets/phenixace/S2-TOMG-Bench

详情
AI中文摘要

近期,大语言模型(LLMs)在自然语言驱动的分子发现中展现出巨大潜力。然而,现有的分子-文本对齐数据集和基准主要基于一对一映射,衡量LLMs检索单一预定义答案的能力,而非其生成多样且同样有效的候选分子的创造性潜力。为填补这一关键空白,我们提出Speak-to-Structure(S^2-Bench),这是首个评估LLMs在开放域自然语言驱动分子生成中的基准。S^2-Bench专为一对多关系设计,挑战LLMs展现真正的分子理解和开放生成能力。我们的基准包括三个关键任务:分子编辑(MolEdit)、分子优化(MolOpt)和定制分子生成(MolCustom),每个任务探索分子发现的不同方面。我们还引入OpenMolIns,一个大规模指令微调数据集,使Llama3.1-8B在S^2-Bench上超越最强大的LLMs如GPT-4o和Claude-3.5。我们对31个LLMs的全面评估将焦点从简单的模式回忆转向现实的分子设计,为自然语言驱动分子发现中更强大的LLMs铺平道路。我们的代码和数据集完全可通过GitHub仓库(https://github.com/phenixace/S2-TOMG-Bench)和Huggingface数据集(https://huggingface.co/datasets/phenixace/S2-TOMG-Bench)获取。

英文摘要

Recently, Large Language Models (LLMs) have demonstrated great potential in natural language-driven molecule discovery. However, existing datasets and benchmarks for molecule-text alignment are predominantly built on one-to-one mappings, measuring LLMs' ability to retrieve a single, pre-defined answer, rather than their creative potential to generate diverse, yet equally valid, molecular candidates. To address this critical gap, we propose Speak-to-Structure (S^2-Bench), the first benchmark to evaluate LLMs in open-domain natural language-driven molecule generation. S^2-Bench is specifically designed for one-to-many relationships, challenging LLMs to exhibit genuine molecular understanding and open-ended generation capabilities. Our benchmark includes three key tasks: molecule editing (MolEdit), molecule optimization (MolOpt), and customized molecule generation (MolCustom), each probing a different aspect of molecule discovery. We also introduce OpenMolIns, a large-scale instruction tuning dataset that enables Llama3.1-8B to surpass the most powerful LLMs like GPT-4o and Claude-3.5 on S^2-Bench. Our comprehensive evaluation of 31 LLMs shifts the focus from simple pattern recall to realistic molecular design, paving the way for more capable LLMs in natural language-driven molecule discovery. Our codes and datasets are fully accessible through the Github Repository: https://github.com/phenixace/S2-TOMG-Bench and Huggingface Datasets: https://huggingface.co/datasets/phenixace/S2-TOMG-Bench.

2605.23668 2026-05-25 cs.CL cs.AI 版本更新

OnePred: Next-Query Prediction via Recursive Intent Memory in Multi-Turn Conversations

OnePred: 多轮对话中基于递归意图记忆的下一个查询预测

Jiangwang Chen, Bowen Zhang, Zixin Song, Jiazheng Kang, Xiao Yang, Da Zhu, Guanjun Jiang

发表机构 * Tsinghua University(清华大学) Qwen Applications Business Group of Alibaba(阿里巴巴Qwen应用业务组)

AI总结 该研究提出了 OnePred,一种用于多轮对话中预测用户下一条查询的模型,旨在使对话系统更具主动性。其核心方法是通过递归更新的意图记忆来捕捉用户意图的演变,从而在不依赖完整对话历史的情况下实现高效且准确的预测。该方法通过两阶段强化学习训练模型,既学习预测内容又优化信息压缩,显著降低了计算成本并提升了预测性能。研究还发布了 NQP-Bench 基准数据集,实验表明 OnePred 在保持预测质量的同时,相比传统方法减少了高达 22 倍的计算开销。

详情
AI中文摘要

尽管大语言模型(LLM)对话系统每天处理数百万次多轮对话,但它们本质上仍是被动的:仅在用户输入查询后才响应。迈向主动交互的关键一步是下一个查询预测,即仅根据之前的对话预测用户后续的查询。该任务的进展受到缺乏专用基准以及基本效率-质量权衡的阻碍:简单拼接完整对话历史会导致线性增长的token消耗,而截断至最新一轮则会丢弃关键的跨轮上下文。我们的关键见解是,准确预测不需要重新阅读原始历史;只需跟踪用户跨主题、未解决需求和兴趣转移的不断演变的意图轨迹即可。我们提出OnePred,它维护一个递归更新的记忆作为唯一的跨轮上下文,将每轮成本限制为与对话长度无关。我们通过两阶段强化学习流程训练模型,首先教导预测什么,然后教导压缩什么,将记忆塑造成面向预测的意图链。为了建立严格的测试平台,我们引入了NQP-Bench,涵盖三个不同的子集。实验表明,与完整历史输入相比,OnePred将每轮token消耗减少高达22倍,同时在预测质量上持续超过所有基线,在较长对话中增益更大。我们的代码可在https://github.com/ZBWpro/OnePred公开获取。

英文摘要

Although large language model (LLM) conversational systems process millions of multi-turn dialogues daily, they remain fundamentally reactive: they respond only after the user types a query. A key step toward proactive interaction is next-query prediction, which anticipates the user's subsequent query based solely on the preceding dialogue. Progress on this task is hindered by the lack of dedicated benchmarks and a fundamental efficiency--quality trade-off: naively concatenating full dialogue history incurs linearly growing token consumption, while truncating to the latest turn discards crucial cross-turn context. Our key insight is that accurate prediction does not require re-reading raw history; it suffices to track the user's evolving intent trajectory across topics, unresolved needs, and interest shifts. We propose OnePred, which maintains a recursively updated memory as its sole cross-turn context, bounding the per-turn cost independently of conversation length. We train the model via a two-stage reinforcement learning pipeline that first teaches what to predict, then what to compress, shaping the memory into a prediction-oriented intent chain. To establish a rigorous testbed, we introduce NQP-Bench, spanning three diverse subsets. Experiments demonstrate that OnePred reduces per-turn token consumption by up to 22$\times$ compared to full-history inputs while consistently exceeding all baselines in prediction quality, with larger gains on longer conversations. Our code is publicly available at https://github.com/ZBWpro/OnePred.

2605.23618 2026-05-25 cs.CL 版本更新

Benchmarking Google Embeddings 2 against Open-Source Models for Multilingual Dense Retrieval and RAG Systems

Google Embeddings 2 与开源模型在多语言稠密检索和 RAG 系统中的基准测试

Stefano Cirillo, Domenico Desiato, Giuseppe Polese, Giandomenico Solimando

发表机构 * University of Salerno(萨勒诺大学) University of Bari(巴里大学)

AI总结 本文对比了Google Embeddings 2(GE2)与五个开源模型在多语言密集检索和RAG系统中的性能,发现GE2在多个任务中表现最佳,但其延迟较高;相比之下,mE5-L在保持较高检索效果的同时具有更低的延迟,适合对响应时间有要求的应用;实验还表明,所有模型在32词块长度时性能趋于饱和,而语义分块仅在16词块时带来明显提升。

Comments 9 pages, 2 figures, 5 tables. Text and evaluation code available at https://github.com/cciro94/GoogleEmbeddings2-benchmark

详情
AI中文摘要

我们对 Google Embeddings (GE2) 进行了基准测试,这是一个由 Vertex-AI 托管的双编码器,具有 2048 令牌上下文和显式任务类型条件,与五个开源替代方案:BGE-M3、E5-large、Multilingual-E5-large (mE5-L)、LaBSE 和 Paraphrase-Multilingual-MPNet (mMPNet)。评估涵盖四个 BEIR 子集、一个合成意大利语 RAG 语料库、考虑三种策略下 5 种令牌大小的分块消融实验,以及在商品 CPU 硬件上的每查询延迟。GE2 在每个任务上排名第一,达到 BEIR 平均 nDCG@10 = 0.638 和 IT-RAG-Bench nDCG@10 = 0.282,但中位延迟为 231.6 毫秒,比最快的本地模型慢约 14 倍。mE5-L 在意大利语上以 31 毫秒的延迟达到与 GE2 相差 0.003 nDCG 的性能,使其成为在子 100 毫秒 SLA 下更优的选择。一个更惊人的发现涉及 LaBSE,尽管广泛部署于多语言场景,其在 BEIR 上的平均 nDCG@10 仅为 0.188,低于包括 mMPNet 在内的所有专用检索模型。分块实验表明,所有六个模型在我们的语料库上在 32 令牌分块时性能饱和,语义分块仅在 16 令牌时提供可衡量的增益。

英文摘要

We benchmark Google Embeddings (GE2), a Vertex-AI-hosted bi-encoder with 2,048-token context and explicit task-type conditioning, against five open-source alternatives: BGE-M3, E5-large, Multilingual-E5-large (mE5-L), LaBSE, and Paraphrase-Multilingual-MPNet (mMPNet). Evaluation covers four BEIR subsets, a synthetic Italian RAG corpus, a chunking ablation considering 5 sizes of tokens with three strategies, and per-query latency on commodity CPU hardware. GE2 ranks first on every task, achieving BEIR avg.nDCG@10 = 0.638 and IT-RAG-Bench nDCG@10 = 0.282, but at 231.6 ms median latency, it is roughly 14x slower than the fastest local models. mE5-L reaches within 0.003 nDCG of GE2 on Italian at 31 ms, making it the preferred option when sub-100 ms SLAs matter. A more striking finding concerns LaBSE, which, despite widespread multilingual deployment scores 0.188 average nDCG@10 on BEIR, below every dedicated retrieval model including mMPNet. Chunking experiments show that all six models saturate at 32-token chunks on our corpus, with semantic chunking providing measurable gains only at 16 tokens.

2605.23605 2026-05-25 cs.LG cs.AI cs.CL 版本更新

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

DiLaDiff: 蒸馏潜在增强扩散用于语言建模

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat, Ante Jukić

发表机构 * NVIDIA(英伟达)

AI总结 DiLaDiff 是一种改进的扩散语言模型,旨在解决传统扩散模型在采样质量和生成速度之间的矛盾。该方法引入了连续语义潜在空间,并通过自编码器和一致性蒸馏技术提升生成效率和质量。实验表明,DiLaDiff 在不进行蒸馏时已优于基线模型,并在蒸馏后显著加快了推理速度。

详情
AI中文摘要

扩散语言模型本质上无法捕捉解码令牌之间的相关性,导致采样质量与吞吐量之间存在严峻的权衡。为了解决这个问题,我们提出了DiLaDiff,一种掩码扩散语言模型的变体,包含三个组件:(1)具有语义能力的连续潜在空间,通过从现有掩码扩散语言模型微调的自编码器学习;(2)学习编码器分布先验的潜在扩散模型;(3)将学习到的先验蒸馏为少步潜在生成模型的一致性模型。我们表明,即使没有蒸馏,我们的潜在引导扩散模型在显著加速推理的同时也优于掩码扩散基线。一致性蒸馏进一步降低了连续扩散的计算开销,使得潜在生成的时间相对于离散解码可以忽略不计。

英文摘要

Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.

2605.23597 2026-05-25 cs.CL cs.LG 版本更新

Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts

结构引导的实体解析:微调大语言模型以实现复杂语言上下文中的鲁棒姓名匹配

Shivam Chourasia, Hitesh Kapoor, Nilesh Patil

发表机构 * Dream Sports

AI总结 本文研究了在语言和文化复杂环境下进行人名匹配的实体解析问题,提出了一种名为Structure-Guided Entity Resolution(SGER)的新框架,通过两阶段课程式微调增强大语言模型对姓名结构和语义的理解,从而提升实体匹配的准确性。该方法在印度身份数据等具有高度语言多样性和噪声的现实场景中表现出色,取得了99.02%的高准确率,并在生产环境中成功部署,验证了其在大规模多语言系统中的有效性和鲁棒性。

Comments Accepted to ACL 2026. 8 pages, 1 figure, 2 tables

详情
AI中文摘要

跨异构记录匹配人名是实体解析的核心挑战,尤其是在语言和文化复杂的环境中。命名惯例的差异、跨文字的不一致音译以及频繁的数据录入错误使得统一用户身份变得困难,而这对于了解你的客户(KYC)合规至关重要。虽然大语言模型在理解自然语言方面显示出潜力,但它们往往难以处理此类特定领域设置中存在的结构化歧义。本文介绍了结构引导实体解析(SGER),一种新颖的框架,通过两阶段课程微调大语言模型。模型首先被训练解析人名的语法和语义结构,然后针对二元实体匹配的下游任务进行优化。我们在印度身份数据的挑战性背景下评估SGER,这是全球语言最多样化和噪声最大的环境之一。SGER在包含50,000个真实世界对的保留测试集上达到了99.02%的准确率和0.994的F1分数,优于GPT-4o少样本提示和单阶段微调基线。该系统已完全部署在全球最大的梦幻体育平台Dream11的生产环境中,服务超过2.5亿用户。我们的结果表明,课程引导的训练能够在现实世界的多语言系统中实现大规模、高精度的实体解析。

英文摘要

Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world's largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.

2605.23497 2026-05-25 cs.CL 版本更新

Asking For An Old Friend: Diagnosing and Mitigating Temporal Failure Modes in LLM-based Statutory Question Answering

询问老朋友:诊断和缓解基于LLM的法定问答中的时间故障模式

Max Prior, Andreas Schultz, Matthias Grabmair

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 该研究探讨了基于大语言模型(LLM)的法律问答系统在处理时效性法律条文时的两种失效模式:法规更新后的过时问题和对较新法规的偏好偏差。为此,研究构建了一个包含312个专家验证的德语法律问答对的基准数据集,并在不同推理设置下评估了多个LLM的表现。结果表明,引入基于检索的增强方法能显著提升模型在时间有效性方面的性能,而单纯依赖网络搜索则存在不稳定性和近期偏好问题,研究强调了在法律问答中必须将时间有效性作为硬性约束。

详情
AI中文摘要

大型语言模型越来越多地用于法律研究,但其固定的训练截止日期和对静态参数知识的依赖与成文法的演变性质相矛盾。我们研究了两种时间故障模式:截止后过时(模型在立法修正后应用被取代的规则)和近因偏差(即使历史版本支配事实模式,模型也偏好较新的规定)。为此,我们提出了一个包含312个专家验证、时间敏感的德国法定问答对的基准,涵盖三个类别:截止后修正问题、修正前问题和多条款修正前问题。我们评估了来自OpenAI、Anthropic和DeepSeek的五个LLM,在四种推理设置下:普通、网络搜索和两种检索增强变体(通过事实日期提取和版本过滤强制执行时间有效性)。使用经过人类专家评分验证的LLM作为评判,我们发现普通设置在截止后设置中性能严重下降。两种RAG方法在所有问题类型上均显著提高了性能,而网络搜索则产生不稳定的收益,并在历史锚定任务上表现出明显的近因偏差。我们的结果表明,可靠的法律问答需要将时间有效性视为硬约束。

英文摘要

Large language models are increasingly used for legal research, yet their fixed training cutoffs and reliance on static parametric knowledge are at odds with the evolving nature of statutory law. We study two temporal failure modes: post-cutoff staleness, where models apply superseded rules after legislative amendments, and recency bias, where models prefer newer provisions even when a historical version governs the fact pattern. To this end, we present a benchmark of 312 expert-validated, time-sensitive German statutory QA pairs spanning three categories: Post-Cutoff Amendment Questions, Pre-Amendment Questions, and Multi-Provision Pre-Amendment Questions. We evaluate five LLMs by OpenAI, Anthropic and DeepSeek under four inference settings: Vanilla, Web-search, and two retrieval-augmented variants that enforce temporal validity via a fact date extraction and version filtering. Using an LLM-as-a-judge validated against human expert ratings, we find severe degradation in the Vanilla post-cutoff setting. Both RAG approaches substantially improve performance across all question types, while web search yields unstable gains and exhibits a marked recency bias on historically anchored tasks. Our results indicate that reliable legal QA requires treating temporal validity as a hard constraint.

2605.23420 2026-05-25 cs.CL 版本更新

Naturalistic measure of social norms alignment

社会规范一致性的自然主义度量

Yevhen Kostiuk, Kenneth Enevoldsen, Peter Bjerregaard Vahlstrup, Márton Kardos, Kristoffer Nielbo

发表机构 * Aarhus University(奥胡斯大学)

AI总结 该研究旨在解决社会规范对齐的自然化测量问题,传统方法多依赖人工设定的封闭式评估,而本文提出了一种基于自由形式解决方案匹配的框架,用于衡量不同主体(如人类或大型语言模型)在社会困境中的回应一致性。研究引入了两个评估指标,并构建了一个包含3000个非平凡社会困境的丹麦语数据集,每个困境均附有三位文化背景评审提供的参考解决方案。实验表明,该方法能够有效区分模型在不同社会议题上的对齐程度,尤其在邻里冲突和共同居住等话题上表现出更高的共识水平。

详情
AI中文摘要

社会规范反映了对可接受行为的共同期望。测量社会规范一致性仍然具有挑战性,现有方法通常依赖于人为的封闭式评估,如多项选择问卷或衡量与预定义陈述的一致性。在本工作中,社会规范一致性指的是衡量解决方案在应对社会问题或困境时的一致性。我们提出了一个框架,通过解决方案匹配在自然主义、自由形式的设置中测量社会规范一致性。该框架使我们能够衡量任意两个困境响应之间的一致性,例如LLM与人类、LLM与LLM或人类与人类之间。我们引入了两个指标:陈述一致性和显式一致性准确性,并构建了一个包含3000个非平凡社会困境的丹麦语数据集。所有困境都分配了来自三位小组成员的参考解决方案,他们作为文化背景法官。我们在类似于自然用户模型对话的交互设置中评估了几个LLM和人类响应的一致性。我们的结果表明,所提出的指标产生一致的模型排名,并揭示了不同类型困境之间一致性的变化,其中在邻里冲突和共享生活情境等主题上观察到更高的一致性。总体而言,我们的工作引入了一个数据集和评估框架,用于研究自然主义开放式对话中基于文化背景的社会推理。

英文摘要

Social norms reflect shared expectations on acceptable behavior. Measuring social norms alignment remains challenging, with existing approaches typically relying on artificial closed-form evaluations such as multiple-choice questionnaires or measuring agreement with predefined statements. In the context of this work, social norms alignment refers to measuring an agreement between solutions with respect to the social problem or dilemma. We propose a framework for measuring social norm alignment in naturalistic, free-form settings through solution matching. The framework enables us to measure alignment between any two dilemma responses e.g., LLMs to a human, LLMs to LLMs, or human to human. We introduce two metrics: stated and explicit agreement accuracy, and construct a dataset of 3k non-trivial social dilemmas in Danish. All dilemmas are assigned reference solutions derived from three panelists, who serve as culturally grounded judges. We evaluate the agreement of several LLMs and human responses in an interaction setup that resembles natural user-model conversations. Our results show that the proposed metrics produce consistent model rankings and reveal variation in agreement across different types of dilemmas, with higher agreement observed for topics such as neighbor conflicts and shared living situations. Overall, our work introduces a dataset and evaluation framework for studying culturally grounded social reasoning in naturalistic open-ended conversations.

2605.23416 2026-05-25 cs.CL cs.SD 版本更新

Articulatory strategy as a source of variation in acoustic vowel dynamics

发音策略作为声学元音动态变异的一个来源

Patrycja Strycharczuk, Justin J. H. Lo, Sam Kirkham

发表机构 * Linguistics and English Language, University of Manchester, United Kingdom(曼彻斯特大学语言学与英语语言系,英国) Linguistics and English Language, Lancaster University, United Kingdom(兰卡斯特大学语言学与英语语言系,英国)

AI总结 本研究探讨了发音策略如何影响元音的声学动态变化,揭示了个体发音习惯与音素形式过渡之间的关系。通过分析36位北英格兰英语说话者的舌部超声影像数据,研究发现元音/i/的舌形是影响带有腭化滑音的双元音中共振峰动态变化的重要因素。研究结果表明,舌根和舌背的更大运动幅度会导致共振峰过渡更早且更陡峭,并为理解语音个体差异提供了新的视角。

详情
Journal ref
Journal of the Acoustical Society of America (2026) 159(5): 4068-4078
AI中文摘要

声学元音动态具有一些说话者识别特征,这些特征被归因于发音策略的个体特性:共振峰过渡具有特定形状,因为说话者使用特定且熟练的动作移动发音器官。然而,现有证据很少表明不同的发音策略会系统性地影响共振峰动态。本研究证实了二者之间的联系。使用来自36位北盎格鲁英语说话者的超声舌成像数据,识别出腭元音/i/产生的不同发音策略。发现/i/中的舌形是腭滑音双元音中共振峰动态的重要预测因子。观察到的关系可以通过声道形状调节的发音运动特征来解释。舌根和/或舌背的更大发音位移会产生腭元音中与平均舌形的更大偏差,并且还需要更高的发音速度,导致相对更早且更陡的共振峰过渡。结果通过阐明发音补偿的规律性和个体性方面,有助于对言语个体性的概念理解。

英文摘要

Acoustic vowel dynamics have some speaker-identifying characteristics, which have been ascribed to individual properties of articulatory strategies: formant transitions have a particular shape because speakers move their articulators, using specific and practised movements. However, there is little existing evidence that different articulatory strategies systematically affect formant dynamics. The present study corroborates the link between the two. Ultrasound tongue imaging data from 36 speakers of Northern-Anglo English are used to identify distinct articulatory strategies for the production of palatal vowel /i/. Tongue shape in /i/ is found to be a significant predictor of formant dynamics in diphthongs with a palatal offglide. The observed relationships can be explained by the characteristics of articulatory movement conditioned by vocal tract shape. Greater articulatory displacement of tongue root and/or dorsum produces greater distortion from the mean tongue shape in palatal vowels, and it also requires higher articulatory velocities, resulting in relatively earlier and steeper formant transitions. The results contribute to the conceptual understanding of individuality in speech, by illuminating the regularising and individual aspects of articulatory compensation.

2605.23412 2026-05-25 cs.CL 版本更新

EquiSumm : A Gender Bias-Aware Framework for Inclusive Tweet Summarization

EquiSumm:面向包容性推文摘要的性别偏见感知框架

Chaitanya Wanjari, Jessica Kamal, Riddhi Jain, Samruddhi Kurhe, Roshni Chakraborty

发表机构 * ABV IIITM Gwalior(ABV IIITM 格瓦利尔)

AI总结 在新闻事件中,社交媒体平台如Twitter提供了大规模意见分享的渠道,但人工处理海量内容以提炼关键观点是不现实的。为此,已有自动摘要技术被提出,但大多未考虑人口统计学公平性,尤其是性别偏差问题。本文提出EquiSumm,一种关注性别偏见的包容性推文摘要框架,通过考虑意见中的性别因素生成更加公平的摘要,实验结果表明其在两个主要数据集上的性能优于现有方法。

Comments Accepted at AI for Social Good Workshop, Pattern Recognition and Machine Intelligence (PReMI 2025), IIT Delhi. 6 pages, 2 figures

详情
AI中文摘要

虽然Twitter等社交媒体平台为新闻事件期间的大规模意见分享提供了媒介,但个人或媒体机构手动处理大量内容以识别关键观点是不可能的。为了解决这个问题,已经提出了几种自动摘要技术,将大量推文压缩成简洁且信息丰富的摘要。然而,这些算法没有明确考虑人口统计公平性。现有的若干研究工作已经开发了自动摘要方法,可以提供社交媒体平台上与新闻事件相关的关键方面和主要意见的整体概述。然而,这些方法没有明确考虑不同形式的人口统计代表性,例如性别,这可能导致有偏见的摘要表示。在本文中,我们提出了EquiSumm,它考虑了共享意见的性别方面来生成摘要,我们在两个主要数据集上的实验分析表明了相对于现有研究工作的性能有效性。

英文摘要

While social media platforms, such as Twitter, provide a medium for large-scale opinion sharing during news events, it is manually impossible for individuals or media agencies to process the vast volume of content to identify key viewpoints. In order to resolve this, several automatic summarization techniques have been proposed to condense large collections of tweets into concise and informative summaries. However, these algorithms do not explicitly consider demographic fairness. Several existing research works have developed automated summarization approaches that can provide a holistic overview of the key aspects and major opinions shared on social media platforms related to a news event. However, these approaches do not explicitly consider different forms of demographic representation, such as gender, which can lead to biased summary representation. In this paper, we propose EquiSumm, which considers the gender aspect of the shared opinion to generate a summary, and our experimental analysis on two major datasets indicates the performance effectiveness with respect to existing research works.

2605.23384 2026-05-25 cs.CL cs.AI 版本更新

Metacognition as Reward: Reinforcing LLM Reasoning via Knowledge and Regulation Signals

元认知作为奖励:通过知识和调节信号强化LLM推理

Sirui Chen, Lei Xu, Yuying Zhao, Yutian Chen, Yu Wang, Beier Zhu, Hanwang Zhang, Shengjie Zhao, Chaochao Lu

发表机构 * Tongji University(同济大学) Shanghai AI Laboratory(上海人工智能实验室) Nanyang Technological University(南洋理工大学) University of Science and Technology of China(中国科学技术大学) EPFL(苏黎世联邦理工学院) Wuhan University(武汉大学)

AI总结 该论文提出了一种基于元认知的强化学习框架 MaR,旨在提升大语言模型的推理能力。MaR 通过元认知知识和元认知调节两个维度提供奖励信号,前者用于识别任务相关的信息,后者用于规划和调整推理过程,从而超越仅依赖最终答案的奖励设计。实验表明,MaR 在多个基准测试中显著提升了模型性能,并在部分任务上超越了更强大的模型。

详情
AI中文摘要

最近的强化学习方法显著提高了LLM的推理能力。现有的奖励设计主要遵循两种范式:(1) 基于可验证奖励的强化学习(RLVR)从可执行检查或真实答案中获取结果信号,但对中间推理行为的指导有限。(2) 基于评分标准的奖励(RaR)通过使用自然语言评分标准来评估推理质量和任务合规性,超越了最终答案检查,但通常需要实例特定的评分标准和大量设计工作。为解决这些问题,我们引入了元认知奖励(MaR),一种受元认知启发的RL框架,通过两个通用过程维度指导LLM推理:i) 元认知知识,无需手工制作的实例特定评分标准即可识别任务相关信息;ii) 元认知调节,规划和调整推理过程,以提供超越最终答案结果的奖励指导。MaR将模型轨迹分解为显式的元认知组件,并通过任务知识覆盖度、调节保真度和最终答案正确性的轨迹级奖励进行优化。通过这种方式,MaR将奖励反馈扩展到推理轨迹,同时将奖励信号锚定在通用的元认知维度上。在22个基准上的实验表明,MaR持续提升模型性能,相比基础模型最高提升7.7%,相比原始DAPO最高提升11.0%。值得注意的是,Qwen3.5-9B + MaR缩小了与前沿模型的差距,在整体平均上超越GPT-OSS-120B,并在多个单独基准上超越更强模型。过程级分析进一步显示推理过程质量显著提升。MaR还能泛化到域外数据集,MaR训练的模型在平均性能上优于对应的基础模型。

英文摘要

Recent RL methods have substantially improved the reasoning abilities of LLMs. Existing reward designs mainly follow two paradigms: (1) Reinforcement learning with verifiable rewards (RLVR) derives outcome signals from executable checks or ground-truth answers, but provides limited guidance for intermediate reasoning behaviors. (2) Rubrics-as-reward (RaR) goes beyond final-answer checking by using natural-language rubrics to assess reasoning quality and task compliance, but often requires instance-specific rubrics and substantial design effort. To address these issues, we introduce Metacognition-as-Reward (MaR), a metacognition-inspired RL framework that guides LLM reasoning through two general process dimensions: i) metacognitive knowledge, which identifies task-relevant information without hand-crafted instance-specific rubrics, and ii) metacognitive regulation, which plans and adjusts the reasoning process to provide reward guidance beyond final-answer outcomes. MaR scaffolds model rollouts into explicit metacognitive components and optimizes them with a trajectory-level reward over task knowledge coverage, regulation fidelity, and final-answer correctness. In this way, MaR extends reward feedback to reasoning trajectories while grounding the reward signals in general metacognitive dimensions. Experiments on 22 benchmarks show that MaR consistently improves model performance, achieving up to a 7.7% gain over the base model and up to an 11.0% gain over vanilla DAPO. Notably, Qwen3.5-9B + MaR narrows the gap to frontier models, surpassing GPT-OSS-120B on overall average and outperforming stronger models on several individual benchmarks. Process-level analysis further shows substantial improvements in reasoning process quality. MaR also generalizes to out-of-domain datasets, where MaR-trained models improve over their corresponding base models on average.

2605.23382 2026-05-25 cs.CL 版本更新

From Correctness to Preference: A Framework for Personalized Agentic Reinforcement Learning

从正确性到偏好:个性化智能体强化学习框架

Ranxu zhang, zeyang li, Jiacheng Huang, Rui Zhang, Xiaozhou Xu, sun zhe, Yanyong Zhang, Chao Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团)

AI总结 该论文提出了一种面向个性化智能体的强化学习框架,旨在解决现实场景中用户需求差异带来的行为规划与工具使用策略多样化问题。核心方法是通过解耦通用任务奖励与个性化偏好奖励,并引入用户特定锚点以稳定学习过程,同时结合分阶段偏好分离奖励模型和个性化技能演化图记忆结构,实现偏好识别、策略优化与技能积累的闭环。实验表明,该框架在多个基准任务中优于现有强化学习与记忆方法。

Comments 34 pages, 7 figures, Under Review

详情
AI中文摘要

智能体强化学习(Agentic RL)在具有明确成功信号的任务中取得了显著进展。然而,许多现实世界的智能体应用需要用户条件行为:同一查询可能在不同用户之间需要不同的规划策略和工具使用决策。这种设置带来了关键挑战:通用奖励无法捕捉异构用户偏好,观察到的行为与从众效应纠缠在一起,扁平记忆无法支持个性化技能检索。为此,我们提出了一个统一的个性化智能体RL框架,将个性化嵌入训练时优化。其核心是 extbf{个性化锚点奖励解耦策略优化}( extbf{PARPO}),它将通用任务质量奖励与个性化偏好奖励解耦,并使用用户特定锚点在异构奖励尺度下稳定学习。我们进一步引入了两阶段偏好解耦奖励模型和 extbf{偏好对齐技能演化图记忆}( extbf{PSGM}),用于个性化监督和偏好对齐技能检索。它们共同构成了偏好识别、策略优化和结构化技能积累的闭环。在ETAPP、ETAPP-Hard和SJAgent上的实验表明,我们的框架始终优于强记忆和RL基线。代码和数据包含在补充材料中。

英文摘要

Agentic reinforcement learning (Agentic RL) has achieved strong progress in tasks with clear success signals. However, many real-world agent applications require user-conditioned behavior: the same query may call for different planning strategies and tool-use decisions across users. This setting raises key challenges: generic rewards cannot capture heterogeneous user preferences, observed behaviors are entangled with conformity effects, and flat memories cannot support personalized skill retrieval. To this end, we propose a unified personalized Agentic RL framework that embeds personalization into training-time optimization. At its core is \emph{Personalized Anchor Reward-Decoupled Policy Optimization} (\textbf{PARPO}), which decouples generic task-quality rewards from personalized preference rewards and uses user-specific anchors to stabilize learning under heterogeneous reward scales. We further introduce a two-stage preference-disentangled reward model and \emph{Preference-Aligned Skill Evolution Graph Memory} (\textbf{PSGM}) for personalized supervision and preference-aligned skill retrieval. Together, they form a closed loop of preference identification, policy optimization, and structured skill accumulation. Experiments on ETAPP, ETAPP-Hard, and SJAgent show that our framework consistently outperforms strong memory and RL baselines. Code and data are included in the supplementary materials.

2605.23332 2026-05-25 cs.CL 版本更新

Cultural Adaptation in Large Language Models for Political Discourse

大型语言模型在政治话语中的文化适应

Wajdi Zaghouani

发表机构 * Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文探讨了在政治话语分析中部署大型语言模型时所需的文化适应问题,指出当前模型因依赖英语主导的数据和有限的多语言覆盖,难以适应不同文化和制度背景,导致系统性偏差。研究从翻译、话语和本体三个层面形式化文化适应,提出基于文化忠实度、校准和民主安全的评估矩阵,并结合参与式数据集构建、文化感知的迁移学习等方法,为文化适应提供可衡量的实现路径,以增强政治自然语言处理的民主合法性。

详情
AI中文摘要

大型语言模型融入政治话语分析为比较研究、政策分析和公民技术创造了新机遇,同时也为民主问责带来了实质性风险。本文认为,文化适应是在不同语言和制度背景下将大型语言模型可靠地部署于政治传播的先决条件。当前系统仍受英语主导数据、不均衡的多语言覆盖以及基于狭窄范围的政治制度和话语惯例的假设所影响,导致跨文化应用时产生系统性错误。我们在翻译、话语和本体层面形式化文化适应,识别政治自然语言处理中反复出现的文化失败模式,并提出一个基于文化保真度、校准度和民主安全性的操作性评估矩阵。基于政治文本分析、社会技术审计和跨文化语用学,我们概述了方法论路径,包括参与式数据集开发、文化感知迁移学习以及使文化适应可经验测量的基准设计。最后,我们阐明了适应性政治自然语言处理能够支持民主合法性的治理约束和适用范围条件。

英文摘要

The integration of large language models into political discourse analysis creates new opportunities for comparative research, policy analysis, and civic technology, while introducing material risks for democratic accountability. This paper argues that cultural adaptation is a prerequisite for trustworthy deployment of large language models in political communication across diverse linguistic and institutional contexts. Current systems remain shaped by English dominant data, uneven multilingual coverage, and assumptions grounded in a narrow range of political institutions and discourse conventions, producing systematic errors when applied across cultures. We formalize cultural adaptation across translation, discourse, and ontology levels, identify recurring cultural failure modes in political NLP, and propose an operational evaluation matrix grounded in cultural fidelity, calibration, and democratic safety. Building on political text analysis, sociotechnical auditing, and cross cultural pragmatics, we outline methodological pathways including participatory dataset development, culturally aware transfer learning, and benchmark design that makes cultural adaptation empirically measurable. We conclude by clarifying governance constraints and scope conditions under which culturally adaptive political NLP can support democratic legitimacy.

2605.23328 2026-05-25 cs.CL 版本更新

Emotion Recognition in Sign Language Conversation

手语对话中的情感识别

Yusong Wang, Keyu Mao, Takao Obi, Minghao Shao, Kotaro Funakoshi

发表机构 * Institute of Science Tokyo(东京科学研究所) New York University(纽约大学)

AI总结 本文研究手语对话中的情绪识别问题,针对现有手语情绪数据集主要关注孤立句子而缺乏对话上下文的局限性,提出了一个新的任务——手语对话情绪识别(ERC),并构建了包含1920个视频样本的eJSL Dialog数据集。研究通过多种模型在该数据集上的系统评估,揭示了通用多模态对话情绪识别模型在手语场景中的性能差距,突显了开发针对手语的上下文感知视觉提取器以及扩展对话数据集规模以支持大规模预训练的必要性。

详情
AI中文摘要

对话中的情感识别是情感计算的核心组成部分,而当前手语情感数据集资源主要关注孤立句子,缺乏对话上下文。仅在这些孤立话语上训练的模型在现实场景中表现下降,因为它们无法利用历史对话流。为了解决这一结构性限制,我们将ERC任务引入手语视频分析,并提出了eJSL Dialog数据集。该数据集使用STUDIES语料库的脚本构建,包含1,920个视频样本,组织成480个独特的对话。我们使用从孤立视觉网络到多模态对话架构的模型对该数据集进行了系统基准测试。结果揭示了将通用多模态对话情感识别模型应用于手语时存在的领域差距。这些发现表明,明确需要针对手语的上下文感知视觉提取器,并指出扩大对话数据集规模以支持大规模预训练是未来研究的必要下一步。

英文摘要

Emotion Recognition in Conversation is a core component of affective computing, while current resources of sign language emotion datasets primarily focus on isolated sentences and lack conversational context. Models trained exclusively on these isolated utterances demonstrate degraded performance in real world scenarios because they cannot utilize historical dialogue flow. To address this structural limitation, we introduce the ERC task to sign language video analysis and propose the eJSL Dialog dataset. Constructed using the scripts from the STUDIES corpus, the dataset contains 1,920 video samples organized into 480 unique dialogues. We conduct systematic benchmarking on this dataset using models ranging from isolated visual networks to multimodal conversational architectures. The results reveal a domain gap when applying generic multimodal conversational emotion recognition models to sign language. These findings demonstrate the explicit need for context aware visual extractors specific to sign language and indicate that expanding the scale of conversational datasets to support large scale pre-training is a necessary next step for future research.

2605.23326 2026-05-25 cs.CL 版本更新

ClimateChat-300K: A Multi-Modal Facebook Dataset for Understanding Diverse Perspectives in Climate Communication

ClimateChat-300K:一个用于理解气候传播中不同观点的多模态Facebook数据集

Wajdi Zaghouani, Md. Rafiul Biswas, Mabrouka Bessghaier, Shimaa Ibrahim, George Mikros

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Hamad Bin Khalifa University(哈马德·本·卡伊夫大学)

AI总结 本文介绍了ClimateChat-300K,一个包含299,329条公共Facebook帖子的多模态数据集,涵盖2020年5月至2024年5月期间关于气候变化的讨论。该数据集包含41个元数据特征,如内容、互动指标和页面属性,支持对气候沟通中公众话语的全面分析。研究通过主题建模和情感分析识别出五个领域下的十个主要主题,并揭示了情感基调、内容形式和页面身份对受众参与度的影响,同时展示了在线讨论如何随国际气候峰会和新冠疫情等重大事件演变。该数据集为气候议题的偏见、虚假信息及数字话语动态等跨学科研究提供了开放资源。

详情
AI中文摘要

我们提出了ClimateChat-300K,这是一个大规模数据集,包含通过CrowdTangle平台收集的2020年5月至2024年5月间关于气候变化的299,329条Facebook公开帖子。该数据集包含41个元数据特征,包括帖子内容、参与度指标和页面属性,覆盖来自超过26,000个全球页面的材料。每条帖子包含丰富的上下文信息,如语言、时间戳、页面类别和互动次数,从而能够对气候传播中的公共话语进行全面分析。通过主题建模和情感分析,我们识别出十个主要主题,这些主题被归为五个领域:政策、行动主义、合作、科学和保护。结果显示,情感基调、帖子格式和页面身份强烈影响受众参与度,视觉丰富且情感强烈的内容获得最高水平的互动。该数据集还展示了在线讨论如何响应重大事件(如国际气候峰会和COVID-19疫情时期)而演变。ClimateChat-300K为关于极化、错误信息和数字气候话语动态的可重复跨学科研究提供了开放资源。通过发布该数据集,我们旨在支持透明、数据驱动的研究,并促进对公众如何随时间、地理和制度背景参与气候问题的更深入理解。

英文摘要

We present ClimateChat-300K, a large-scale dataset of 299,329 public Facebook posts about climate change collected between May 2020 and May 2024 through the CrowdTangle platform. The dataset contains 41 metadata features including post content, engagement metrics, and page attributes, covering material from more than 26,000 global pages. Each post includes rich contextual information such as language, timestamp, page category, and interaction counts, enabling comprehensive analyses of public discourse around climate communication. Using topic modeling and sentiment analysis, we identify ten main themes grouped into five domains: policy, activism, cooperation, science, and conservation. The results reveal that emotional tone, post format, and page identity strongly influence audience engagement, with visually rich and emotionally charged content receiving the highest levels of interaction. The dataset also demonstrates how online discussions evolved in response to major events such as international climate summits and the COVID-19 pandemic period. ClimateChat-300K provides an open resource for reproducible and interdisciplinary research on polarization, misinformation, and the dynamics of digital climate discourse. By releasing this dataset, we aim to support transparent, data-driven research and contribute to a deeper un-derstanding of how public engagement with climate issues develops across time, geography, and institutional contexts.

2605.23325 2026-05-25 cs.CL 版本更新

AraHopeCorpus: Annotation Guidelines and Dataset for Hope Speech in Arabic Social Media Crisis Discourse

AraHopeCorpus:阿拉伯语社交媒体危机话语中的希望言语标注指南与数据集

Esra'a Sharqawi, Wajdi Zaghouani

发表机构 * Hamad Bin Khalifa University(哈马德·本·哈利法大学) Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本文介绍了 AraHopeCorpus,这是首个用于阿拉伯语社交媒体危机语境中希望言论研究的标注数据集,包含2023年至2024年间加沙战争相关一万条YouTube评论。研究通过详细的标注框架将评论分为希望言论、无希望言论和中性或模糊内容三类,发现超过六成的评论表达了希望,主要体现为宗教鼓励、集体团结和对正义与忍耐的乐观。该数据集为研究阿拉伯语社交媒体中的建设性话语、危机沟通与韧性提供了重要资源。

详情
AI中文摘要

社交媒体已成为武装冲突期间塑造公众叙事的关键舞台,为有害和建设性交流提供了空间。虽然仇恨言论和虚假信息已被广泛研究,但促进韧性、团结和乐观的表达仍未被充分探索,尤其是在阿拉伯语语境中。本文介绍了AraHopeCorpus,这是首个从2023年至2024年与加沙战争相关的一万条YouTube评论中收集的阿拉伯语希望言语标注数据集。使用详细的标注框架,评论被分为三类:希望言语、无希望言语和中性或模糊话语。数据集显示,希望语言占主导地位,占所有评论的64%以上。这些希望表达主要表现为宗教鼓励、集体团结以及对忍耐和正义的乐观。无希望言语约占13%,反映了绝望和幻灭,而其余评论包含中性或混合内容。标注者间一致性达到显著水平(Cohen's Kappa = 0.71),尽管方言差异、讽刺和隐含意义带来了标注挑战。人类标注者与ChatGPT之间的比较分析表明,大型语言模型可以支持标注,但在处理方言和文化嵌入表达方面仍存在局限性。AraHopeCorpus将在开放和非商业许可下发布用于研究目的。它为研究建设性数字话语提供了宝贵资源,有助于进一步研究阿拉伯语社交媒体中的希望言语检测、危机沟通和韧性。

英文摘要

Social media has become a crucial arena for shaping public narratives during armed conflicts, providing space for both harmful and constructive communication. While hate speech and misinformation have been widely studied, expressions that promote resilience, solidarity, and optimism remain underexplored, particularly in Arabic contexts. This paper introduces AraHopeCorpus, the first annotated dataset of Arabic hope speech collected from ten thousand YouTube comments related to the war on Gaza between 2023 and 2024. Using a detailed annotation framework, comments were classified into three categories: hope speech, no hope speech, and neutral or unclear discourse. The dataset shows that hopeful language dominates, accounting for more than sixty four percent of all comments. These expressions of hope appear mainly as religious encouragement, collective solidarity, and optimism for endurance and justice. No hope speech, representing about thirteen percent, reflects despair and disillusionment, while the rest of the comments contain neutral or mixed content. Inter-Annotator Agreement reached substantial levels (Cohen's Kappa equals 0.71), though dialectal variation, sarcasm, and implicit meaning posed annotation challenges. A comparative analysis between human annotators and ChatGPT revealed that large language models can support annotation but remain limited in handling dialectal and culturally embedded expressions. AraHopeCorpus will be released for research purposes under an open and non commercial license. It provides a valuable resource for studying constructive digital discourse, enabling further research on hope speech detection, crisis communication, and resilience in Arabic social media.

2605.23315 2026-05-25 cs.CL cs.AI 版本更新

Convergence Without Understanding: When Language Models Agree on Representations but Disagree on Reasoning

没有理解的趋同:当语言模型在表示上一致但在推理上分歧时

Muhammad Usama, Dong Eui Chang

AI总结 本研究探讨了大型语言模型在不同目标和架构下训练后,其内部表征是否趋于一致,并进一步验证了这种表征一致性是否也体现在推理过程上。通过对8个模型家族共16个语言模型在800个推理问题上的分析,研究发现模型在表征层面趋于一致,但在推理策略上却存在显著分歧,表明表征收敛更多源于输入处理的共性,而非推理方法的统一。这一发现对模型集成、可解释性迁移及模型相似性评估具有重要意义。

详情
AI中文摘要

在不同目标和架构下训练的大型语言模型已被证明会发展出越来越相似的内部表示,这一观察被形式化为柏拉图式表示假说。这种表示趋同是否延伸到对共享表示进行操作的推理过程仍未得到检验。我们在800个涵盖数学、科学、常识和真实性的推理问题上,评估了来自8个家族(1.5B到72B参数)的16个语言模型的表示相似性,并按问题难度、计算阶段和因果相关性进行分层。我们的分析揭示了三种分离:难度反转,模型在它们共同失败的问题上趋同更多(中心核对齐[CKA] = 0.897),而在它们解决的问题上趋同较少(CKA = 0.830);生成差距,决策前表示对齐(CKA = 0.875),而决策后表示分歧(CKA = 0.274);以及附带正确性,共享信息可在模型间解码(66%的迁移准确率),但对预测的因果影响极小(在不同消融协议下翻转率为1.5%到5.5%)。这些结果表明,语言模型中的表示趋同反映了共享的输入处理约束而非共享的推理策略,对集成设计、可解释性迁移和模型相似性评估有直接影响。代码可在 https://github.com/Usama1002/convergence-without-understanding 获取。

英文摘要

Large language models trained under diverse objectives and architectures have been shown to develop increasingly similar internal representations, an observation formalized as the Platonic Representation Hypothesis. Whether this representational convergence extends to the reasoning processes that operate over shared representations remains untested. We evaluate representational similarity across 16 language models from 8 families (1.5B to 72B parameters) on 800 reasoning problems spanning mathematics, science, commonsense, and truthfulness, stratifying by problem difficulty, computational stage, and causal relevance. Our analysis reveals three dissociations: a difficulty inversion, where models converge more on problems they collectively fail (Centered Kernel Alignment [CKA] = 0.897) than on those they solve (CKA = 0.830); a generation gap, where pre-decision representations align (CKA = 0.875) while post-decision representations diverge (CKA = 0.274); and epiphenomenal correctness, where shared information is decodable across models (66% transfer accuracy) but exerts minimal causal influence on predictions (1.5% to 5.5% flip rate across ablation protocols). These results indicate that representational convergence in language models reflects shared input processing constraints rather than shared reasoning strategies, with direct implications for ensemble design, interpretability transfer, and evaluations of model similarity. Code is available at https://github.com/Usama1002/convergence-without-understanding.

2605.23259 2026-05-25 cs.LG cs.AI cs.CL 版本更新

Multi-Gate Residuals

多门残差

Zhizhan Zheng, Feiyun Zhang, Shuchun Liu, Tian Xia, Xi Liu, Dasheng Hu, Hongquan Zhou

发表机构 * Shanghai Yichuang Information Technology Co.,Ltd.(上海亿创信息技术有限公司) Fudan University(复旦大学)

AI总结 本文提出了一种名为Multi-Gate Residuals(MGR)的新方法,旨在解决深度残差网络中激活值无界增长的问题,同时避免引入额外的通信开销。该方法通过简单的评分与门控机制维护多流上下文,并结合注意力池化技术提取隐藏状态,从而在保持激活规模稳定的同时提升模型性能。实验表明,MGR在大规模训练与部署中具有实用性,并优于现有架构。

详情
AI中文摘要

虽然注意力残差在解决深度残差层中普遍存在的激活值无界增长问题方面显示出一定效果,但它不可避免地引入了显著的通信开销。为了规避这一瓶颈,我们提出了多门残差(MGR),它在不增加通信负担的情况下稳定激活尺度。它利用简单的评分和门控机制来维护多流上下文,并结合注意力池化从流状态中提取隐藏状态。实证实验表明,MGR对于大规模训练和部署是实用的,相比现有架构提供了切实的性能提升。

英文摘要

While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residual layers, it inevitably incurs significant communication overhead. To circumvent this bottleneck, we propose Multi-Gate Residuals (MGR), which stabilizes activation scales without additional communication burden. It utilizes a straightforward scoring and gating mechanism to maintain multi-stream context, coupled with Attention Pooling to extract hidden states from the stream states. Empirical experiments demonstrate that MGR is practical for large-scale training and deployment, offering tangible performance improvements over existing architectures.

2605.23215 2026-05-25 cs.LG cs.AI cs.CL 版本更新

FastKernels: Benchmarking GPU Kernel Generation in Production

FastKernels:生产中GPU内核生成的基准测试

Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari

发表机构 * Snowflake AI Research(Snowflake AI研究院) CMU(卡内基梅隆大学) UCSD(加州大学圣地亚哥分校) Independent Researcher(独立研究者)

AI总结 当前基于大语言模型的GPU内核生成代理在性能评估方面面临基准与实际生产环境不匹配的问题。为此,研究提出了FastKernels,一个基于46个代表性架构构建的基准测试集,覆盖了8个类别,几乎涵盖了96.2%的HuggingFace Transformers架构,并同时提供了一个生产级推理框架。实验表明,现有最先进的内核生成代理在FastKernels上的加速效果有限,突显了基准与实际应用之间存在的关键瓶颈。

详情
AI中文摘要

基于LLM的GPU内核生成代理正在快速发展,但其进展从根本上受到所优化基准的限制。现有基准与生产推理框架严重脱节:它们在单GPU上使用合成输入评估内核,忽略周围的编译栈,并奖励复制已知优化而非发现新优化。由此产生的奖励信号具有误导性:代理学会生成在沙箱中得分高但在集成到实际系统时引入接口不兼容、编译栈冲突和静默正确性下降的内核。我们引入FastKernels,一个基于最小化46个代表性架构(涵盖8个类别)的内核基准,这些内核共同涵盖了96.2%(409/425)的HuggingFace Transformers架构。FastKernels同时作为一个简约的生产级推理框架,在主流LLM服务上与vLLM和SGLang等成熟系统运行性能相当,并在服务不足的架构上显著超过上游参考;每个任务的接口镜像其架构家族中最先进库的相应模块,使得优化后的内核能够直接部署到生产代码库中。在FastKernels上评估最先进的内核代理,我们发现即使最强的代理也仅实现0.94倍于生产基线的总加速,而较弱的代理分别为0.78倍和0.53倍——证实基准-生产错位是该领域的关键瓶颈。我们发布FastKernels,作为迈向基准收益直接转化为生产吞吐量改进的内核代理的垫脚石。代码可在https://github.com/Snowflake-AI-Research/fastkernels获取。

英文摘要

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

2605.23193 2026-05-25 cs.HC cs.CL cs.CY cs.MA 版本更新

CultivAgents: Cultivating Relationship-Centered Multi-Agent Systems for Personalized Gardening

CultivAgents:培育以关系为中心的多智能体系统以实现个性化园艺

Yiyang Wang, Moeiini Reilly, Britney Johnson, Kefei Yan, Alex Cabral, Josiah Hester

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Massachusetts Institute of Technology(麻省理工学院)

AI总结 CultivAgents 是一种以关系为中心的多智能体系统,旨在为个性化园艺提供具有社会文化背景的支持。该系统通过协调多个专业智能体,如经验智能体、环境智能体和民族植物学智能体,提供适应用户技能水平、本地生态条件和文化背景的园艺建议。研究通过多阶段混合方法评估表明,CultivAgents 有效提升了园艺者的信心、动机和对AI建议的信任,同时揭示了在文化特异性、生态适配和智能体协作方面仍有改进空间。

Comments Preprint, 9 pages. Website: https://hello-diana.github.io/CultivAgents/

详情
AI中文摘要

园艺对于支持福祉、文化连续性和食物自主至关重要,然而现有的数字工具通常提供忽视园丁技能、当地生态、季节和文化背景的通用建议。我们介绍了 CultivAgents,一个以关系为中心的多智能体系统,用于个性化、社会文化背景化的园艺支持。基于关怀伦理,CultivAgents 协调多个专业智能体:经验智能体根据用户技能水平调整指导,环境智能体基于当地和季节条件提供建议,民族植物学智能体将植物与文化知识和历史联系起来。我们通过一项三阶段混合方法研究评估了 CultivAgents,参与者包括领域专家(n=3)、人机交互研究人员(n=7)和社区园丁(n=5),分析了专家反馈、前后调查和参与式设计活动。结果表明,CultivAgents 帮助园丁将兴趣转化为情境行动:社区园丁报告信心(从3.00到3.60)、动机(从4.00到4.40)和信任AI建议(从3.20到4.00)均有提升。参与者重视超本地生态指导和互补的智能体视角,同时也指出了文化特异性、生态基础以及智能体协调方面的局限性。这项工作推进了以关系为中心的人工智能,为支持食物主权、社区韧性和文化保护的多智能体系统提供了设计启示。

英文摘要

Gardening is critical to support well-being, cultural continuity, and food autonomy, yet existing digital tools often provide generic advice that overlooks gardeners' skills, local ecologies, seasons, and cultural contexts. We introduce CultivAgents, a relationship-centered multi-agent system for personalized, socio-culturally grounded gardening support. Grounded in ethics of care, CultivAgents coordinates multiple specialized agents: an Experience Agent that adapts guidance to users' skill levels, an Environmental Agent that grounds advice in local and seasonal conditions, and an Ethnobotanical Agent that connects plants to cultural knowledge and histories. We evaluated CultivAgents through a three-phase mixed-methods study with domain experts (n=3), HCI researchers (n=7), and community gardeners (n=5), analyzing expert feedback, pre/post surveys, and participatory design activities. Results suggest that CultivAgents helped gardeners translate interest into situated action: community gardeners reported increased confidence (3.00 to 3.60), motivation (4.00 to 4.40), and trust in acting on AI advice (3.20 to 4.00). Participants valued hyperlocal ecological guidance and complementary agent perspectives, while also identifying limits in cultural specificity, ecological grounding, and agent coordination. The work advances relationship-centered AI, offering design implications for multi-agent systems that support food sovereignty, community resilience, and cultural preservation.

2605.23190 2026-05-25 cs.CL 版本更新

Hidden Human-Like Nature of Machine-Generated Texts: Theory and Detection Enhancement

机器生成文本的隐藏类人本质:理论与检测增强

Chenwang Wu, Yiu-ming Cheung, Bo Han, Defu Lian

发表机构 * Department of Computer Science, Hong Kong Baptist University(香港 Baptist 大学计算机科学系) School of Computer Science and Technology, University of Science and Technology of China(中国科学技术大学计算机科学与技术学院) State Key Laboratory of Cognitive Intelligence(认知智能国家重点实验室)

AI总结 随着大语言模型生成的文本在各类应用中日益普及,其潜在的滥用问题引发了对检测方法的迫切需求。现有方法通常将生成文本视为完全机械化的产物,忽视了其中可能存在的类人类写作风格的片段。本文揭示了这类隐藏的类人类片段的存在,并分析其对检测任务的负面影响,进而提出一种无需模型依赖的堆叠增强框架,通过迭代过滤和优化提升检测性能,实验表明该方法在多种场景下均有效,并支持无训练部署。

详情
AI中文摘要

由大型语言模型(LLMs)生成的机器生成文本(MGTs)在各种应用中越来越普遍,但其在虚假新闻传播和网络钓鱼中的潜在滥用引发了严重担忧,凸显了MGT检测的必要性。现有的段落级检测方法通常将MGT视为完全机器化的,忽略了机器生成文本的隐藏类人本质:即使是完全机器生成的文本也可能包含与人类写作高度一致的片段。为此,我们首先揭示了这种隐藏类人片段的存在,然后从理论上分析了它们对检测的影响。我们的分析表明,这些片段增加了检测的句子复杂度,从而使MGT检测本质上更加困难。基于这一发现,我们提出了一种模型无关的堆叠增强框架,通过减少隐藏类人片段的影响来改进现有检测器。具体来说,我们将片段级别的保留决策建模为潜在变量问题,并使用硬EM启发式过程实例化优化,其中检测器迭代地过滤置信度高的类人子序列,并在剩余文本上自我优化。在各种LLM和实际场景中的大量实验表明,所提出的框架能够持续增强现有检测器。值得注意的是,该框架还可以以无需训练的方式工作,为实际部署提供了灵活性和可扩展性。

英文摘要

Machine-generated texts (MGTs) produced by large language models (LLMs) are increasingly prevalent across various applications, while their potential misuse in fake news propagation and phishing has raised serious concerns, highlighting the need for MGT detection. Existing paragraph-level detection methods commonly treat MGTs as entirely machine-like, overlooking the hidden human-like nature of machine-generated texts: even fully machine-generated texts may contain spans that are highly consistent with human writing. To this end, we first reveal the existence of such hidden human-like spans, and then theoretically analyze their impact on detection. Our analysis shows that these spans increase the sentence complexity for detection, thereby making MGT detection intrinsically harder. Based on this finding, we propose a model-agnostic stacked enhancement framework that improves existing detectors by reducing the influence of hidden human-like spans. Specifically, we model span-level retention decisions as a latent-variable problem and instantiate the optimization with a hard-EM-inspired procedure, where the detector iteratively filters confidently human-like subsequences and refines itself on the remaining text. Extensive experiments across various LLMs and practical scenarios demonstrate that the proposed framework consistently enhances existing detectors. Notably, the framework can also work in a training-free manner, offering flexibility and scalability for practical deployment.

2605.23180 2026-05-25 cs.CL cs.LG 版本更新

Self-Improving In-Context Learning

自我改进的上下文学习

Baturay Saglam, Dionysis Kalogerias

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文提出了一种改进上下文学习(ICL)的方法,通过在测试时优化固定少样本提示的连续嵌入来提升模型性能。研究发现,模型对示例输出的对数概率可以作为衡量其任务理解程度的有效信号,并据此构建了一个无需额外数据的自监督置信度代理,通过零阶优化对提示嵌入进行校准。该方法无需微调、无需生成token、无需预定义标签集,适用于分类和自由生成任务,在多个ICL任务中表现出色,验证了其优化信号的有效性。

详情
AI中文摘要

我们提出通过优化测试时固定少样本提示的连续嵌入来改进上下文学习(ICL)。关键观察是,模型对其演示输出分配的对数概率——可在单次前向传播中获得,无需生成任何令牌——为模型从演示中推断任务提供了有意义的信号。我们将此信号形式化为一个有界的、自监督的置信度代理,并通过在提示嵌入上进行零阶优化来最大化它,从而得到一种测试时校准程序。该方法不需要微调、令牌生成、预定义标签集或外部数据,因此同样适用于分类和自由生成任务。在一系列全面的ICL任务中,所提出的校准方法始终匹配或改进基础模型,并在大多数任务上优于特定于分类的基线。代理改进与下游准确率提升之间的统计显著相关性证实了所提出的代理编码了用于上下文学习的可靠优化信号。

英文摘要

We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

2605.23175 2026-05-25 cs.CR cs.CL 版本更新

Robust LLM Watermarking with Minimal Semantic Distortion for IP Protection

用于知识产权保护的具有最小语义失真的鲁棒LLM水印

Kieu Dang, Phung Lai, NhatHai Phan, Yelong Shen, Ruoming Jin

发表机构 * State University of New York at Albany(纽约州立大学阿尔巴尼分校) New Jersey Institute of Technology(新泽西理工学院) Microsoft(微软) Kent State University(肯特州立大学)

AI总结 本文研究如何在保护大型语言模型知识产权的同时,最小化语义失真。为此,作者提出了SAFESEAL,一种基于密钥条件的水印框架,通过上下文感知的同义词替换机制,在保持语义一致性和事实准确性的同时实现高可检测性。该方法在检测端引入密钥条件对比检测器,支持跨提供方和多用户的水印验证,并在实验中表现出优于现有方法的实用性、可检测性和鲁棒性。

详情
AI中文摘要

专有大语言模型面临知识产权侵犯的风险,因为对手可以通过收集输入-输出对来训练替代模型,从而复制LLM,造成财务损失。水印提供了一种有前景的防御手段来验证所有权,但现有方法常常面临语义失真、事实不一致和对抗攻击的问题。此外,用于特定提供商检测的密钥条件水印,特别是在跨提供商和多用户场景中,仍然在很大程度上未被探索。为了解决这些挑战,我们提出了SAFESEAL,一种新颖的密钥条件水印框架,在最小化对模型实用性的影响下实现强可检测性,有效平衡可检测性、实用性和鲁棒性。SAFESEAL通过密钥条件锦标赛采样机制,在替换语言术语为上下文感知同义词的同时保留命名实体,保持语义保真度和事实一致性。在检测方面,我们引入了一种密钥条件对比检测器,该检测器联合编码文本和密钥,实现特定提供商和鲁棒的水印验证。我们推导了实用性-可检测性权衡的理论界限,并通过轻量级模型、批处理和并行化显著降低了延迟。大量实验表明,SAFESEAL在实用性、可检测性和鲁棒性方面优于基线,实现了0.983的BERTScore、0.963的实体相似度、98.2%的检测率,以及文本质量和内容保留的最高人类评分,延迟与最快的基线相当。为了促进透明度和社区驱动的进展,我们发布了第一个公共水印排行榜和一个交互式演示。

英文摘要

Proprietary large language models (LLMs) face risks of intellectual property (IP) violation, as adversaries can replicate an LLM by collecting input-output pairs to train a surrogate model, causing financial setbacks. Watermarks offer a promising defense to verify ownership, but existing methods often struggle with semantic distortion, factual inconsistency, and adversarial attacks. In addition, key-conditioned watermarks for provider-specific detection, especially in cross-provider and multi-user scenarios, remain largely underexplored. To address these challenges, we propose SAFESEAL, a novel key-conditioned watermarking framework that achieves strong detectability with minimal impact on model utility, effectively balancing detectability, utility, and robustness. SAFESEAL preserves named entities while substituting linguistic terms with context-aware synonyms through a key-conditioned Tournament sampling mechanism, maintaining semantic fidelity and factual consistency. For detection, we introduce a key-conditioned contrastive detector that jointly encodes the text and key, enabling provider-specific and robust watermark verification. We derive theoretical bounds on the utility-detectability trade-off and significantly reduce latency through lightweight models, batching, and parallelism. Extensive experiments show that SAFESEAL outperforms baselines in utility, detectability, and robustness, achieving a BERTScore of 0.983, entity similarity of 0.963, a 98.2% detection rate, and the highest human ratings for text quality and content preservation, with latency comparable to the fastest baseline. To promote transparency and community-driven progress, we release the first public watermark leaderboard and an interactive demo.

2605.23170 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

长上下文LLM中的位置失败:推理基准测试中的盲点

Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang

发表机构 * Beijing Jiaotong University(北京交通大学) Central South University of Forestry and Technology(中央林业科技大学)

AI总结 该研究指出当前主流的长上下文大语言模型推理基准在任务位置控制方面存在不足,导致无法准确评估模型在不同位置上的表现。为此,作者提出了Context Rot Evaluation(CRE)框架,系统地控制任务位置、填充内容和上下文长度三个因素,并通过实验发现,当目标任务从上下文末尾移至中间位置时,模型性能会显著下降,且随着上下文长度增加,这一问题更加严重。研究还表明,通过在末尾添加任务副本,可以有效缓解位置带来的性能下降,揭示了当前基准设计中存在结构性的评估盲区。

Comments 20 pages, 1 figure, 23 tables

详情
AI中文摘要

位置控制评估是检索任务(如Needle-in-a-Haystack和RULER)的标准做法,但主流推理基准测试并未控制目标任务在长上下文中的位置。我们审计了11个长上下文基准测试,发现没有一个同时控制任务位置、填充内容和上下文长度进行推理。对四个旗舰长上下文发布的审计发现,NIAH、RULER或LongBench系列基准测试的主要结果表中没有条目,而智能体和编码基准测试在所有四个发布的主要结果表中均有出现。我们提出了上下文旋转评估(CRE),一个控制所有三个因素的框架,并在两轮中评估了九个LLM在GSM8K和ARC-Challenge上的表现:初始五个模型集和四个较新的供应商发布。当目标任务从末尾移动到中间时,模型性能可能急剧下降,且对于易受影响的模型,这种下降随着上下文长度增加而恶化。MiMo-v2-Flash在64K下使用with_solutions填充时下降88个百分点(中间准确率8%)。较新的发布显示出较小的下降:在64K下,四个模型中有三个的末尾位置准确率波动在+/-6个百分点内;MiMo-V2.5-Pro将MiMo-v2-Flash的88个百分点下降缩小到32个百分点。在questions_only_v2填充下,所有四个模型在中间位置的下降仍然存在(在8K、32K、64K下范围-16到-56个百分点)。在8K下,一个诊断探针在末尾添加目标任务副本,使所有九个模型的中间准确率与末尾基线相差在+/-4个百分点内,这与位置解释一致。在初始五个模型集中,76%的中间位置错误与周围填充文本匹配,而末尾位置仅为22%,这与填充-答案干扰作为主要错误模式一致。这些结果暴露了当前推理基准测试设计和供应商评估实践中的结构性评估差距:当任务位置不受控制时,无法测量随上下文长度增长而恶化的位置脆弱性。

英文摘要

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

2605.23165 2026-05-25 cs.RO cs.AI cs.CL 版本更新

Autonomous Frontier-Based Exploration with VLM Guidance

基于自主前沿探索与VLM引导

Aarush Aitha, Avideh Zakhor

发表机构 * EECS Department, University of California(加州大学EECS系)

AI总结 本文提出了一种基于视觉语言模型(VLM)引导的自主前沿探索方法,用于提升机器人在未知和危险环境中的探索能力。该方法通过VLM进行高层战略决策,指导传统的底层机器人控制系统,利用当前地图和潜在路径的视觉信息生成多模态提示,从而选择最具前景的探索方向。实验表明,该方法在六个室内环境的仿真中提升了地图覆盖率,且具有轻量、无需训练和易于迁移的特点。

Comments 8 pages, 10 figures, CVPR 2026: 2nd Workshop on 3D-LLM/VLA: Bridging Language, Vision and Action in 3D Environments

详情
AI中文摘要

自主机器人在未知和危险环境中的探索是一个长期挑战,通过利用视觉语言模型的高级推理能力可以显著改进。我们提出了一种新颖的探索流程,其中VLM执行高层战略决策,引导传统的低级机器人控制栈。在决策点,机器人生成包含当前地图和潜在路径(即前沿)视觉图像的多模态提示。VLM分析该提示以选择最有希望的前沿,用上下文空间推理替代简单的几何启发式。该方法在六个室内环境的模拟中得到了验证,与现有方法相比,地图覆盖率提高了高达24%。我们的流程轻量级、无需训练,并且可以轻松迁移到任何配备标准传感器和互联网连接的机器人上。

英文摘要

Autonomous robotic exploration of unknown and hazardous environments, a long-standing challenge, can be significantly improved by leveraging the advanced reasoning of Vision-Language Models (VLMs). We introduce a novel exploration pipeline where a VLM performs high-level strategic decision-making, guiding a conventional low-level robotics control stack. At decision points, the robot generates a multimodal prompt with its current map and visual imagery of potential paths, or frontiers. The VLM analyzes this prompt to select the most promising frontier, replacing simple geometric heuristics with contextual spatial reasoning. This approach, validated in simulation across six indoor environments, improves map coverage by up to 24\% over existing methods. Our pipeline is lightweight, training-free, and easily transferable to any robot with standard sensors and an internet connection.

2605.23158 2026-05-25 cs.CR cs.CL cs.LG 版本更新

What Does the Server See? Understanding Privacy Leakage from Large Language Models in Split Inference

服务器看到了什么?理解大语言模型在分割推理中的隐私泄露

Mingyuan Fan, Yu Liu, Fuyi Wang, Cen Chen

发表机构 * East China Normal University(华东师范大学) RMIT University(皇家墨尔本理工大学)

AI总结 本文研究了在分割推理(split inference)框架下,大型语言模型(LLM)可能泄露用户隐私的问题。作者提出了一种名为ActInv的方法,通过匹配中间激活值来重建客户端输入,揭示了分割推理中的隐私漏洞。研究还引入了“扰动放大因子”(PAF)来量化各层对重建的抵抗能力,并设计了PriPert防御方案,有效提升了隐私保护效果,同时保持了模型的实用性和计算效率。

Comments Accepted to ACM CCS'26

详情
AI中文摘要

在资源受限设备上部署大语言模型(LLM)仍然具有挑战性,这激发了人们对分割推理的兴趣,即模型在客户端和服务器之间进行划分,通过仅传输中间激活来减少计算负担并增强隐私。然而,分割推理的隐私保护能力,特别是在LLM背景下,尚未得到彻底研究。为填补这一空白,我们引入了ActInv,它解决了一个中间激活匹配问题以重建客户端的输入。大量评估表明,即使在存在常见基于扰动的防御(如高斯噪声注入和激活稀疏化)的情况下,ActInv也能实现高保真重建。为了系统地理解这一漏洞,我们开发了扰动放大因子(PAF),一个用于量化层对重建固有抵抗力的指标。我们的分析揭示了隐私脆弱性在层间并不均匀,一些层高度易受泄露,而另一些层则提供自然抵抗力。此外,我们证明了通过校准扰动方向以在反向传播期间最大化重建误差,可以显著提高防御有效性。基于这些见解,我们设计了PriPert,并进行了全面评估,涵盖隐私、效用和计算开销,以证明其有效性。

英文摘要

The deployment of large language models (LLMs) on resource-constrained devices remains challenging, spurring interest in split inference, where models are partitioned between client and server to reduce computational burden and enhance privacy by transmitting only intermediate activations. However, the privacy-preserving capabilities of split inference, particularly in the context of LLMs, have not been exhaustively investigated. To fill this gap, we introduce ActInv, which solves an intermediate activation matching problem to reconstruct the client's input. Extensive evaluations demonstrate that ActInv achieves high-fidelity reconstructions, even in the presence of common perturbation-based defenses such as Gaussian noise injection and activation sparsification. To systematically understand this vulnerability, we develop Perturbation Amplification Factor (PAF), a metric for quantifying a layer's inherent resistance to reconstruction. Our analysis reveals that privacy vulnerability is not uniform across layers, with some layers being highly susceptible to leakage while others offer natural resistance. Furthermore, we demonstrate that defense effectiveness can be significantly improved by calibrating perturbation directions to maximize reconstruction error during backpropagation. Building on these insights, we design PriPert and conduct comprehensive evaluations, covering privacy, utility, and computational overhead, to demonstrate its effectiveness.

2605.23157 2026-05-25 cs.CL 版本更新

Same Model, Different Weakness: How Language and Modality Reshape the Jailbreak Attack Surface in Frontier MLLMs

相同模型,不同弱点:语言与模态如何重塑前沿多模态大语言模型的越狱攻击面

Casey Ford, Madison Van Doren, Sicheng Jin, Emily Dix

发表机构 * Appen(Appliance)

AI总结 该研究探讨了多模态大语言模型(MLLM)在不同语言和模态下的越狱攻击表面差异,揭示了语言对模型安全性的非均匀影响。通过对比四种前沿模型在英语和西班牙语下的攻击表现,研究发现语言框架攻击在西班牙语中效果减弱,而视觉化多模态攻击则更有效,表明语言与模态对齐失败的机制存在差异。研究指出,当前将语言和模态视为独立维度的安全评估框架无法准确反映实际攻击风险,需进行重新设计。

详情
AI中文摘要

多模态大语言模型(MLLM)的攻击面具有语言依赖性,揭示了对齐失败的机制结构。我们首次进行系统的跨语言、多模态红队研究,比较了四种前沿MLLM(Claude Sonnet 4.5、GPT-5、Pixtral Large和Qwen Omni)在美国英语(en-US)和墨西哥西班牙语(es-MX)下的越狱漏洞。使用包含363个多样化提示场景的固定对抗基准,在纯文本和多模态条件下进行测试,从每组语言的九名母语标注员匹配小组收集了52,272个危害评级和二元攻击成功判断。我们的核心发现是,语言不会均匀地放大漏洞。贝叶斯混合效应分析显示,语言框架攻击(如角色扮演)在西班牙语提示下效果显著降低,而视觉显式多模态攻击效果增强,这直接指向提示-语言界面而非全局标注员宽松度。这种分离表明,语言和视觉对齐失败通过不同机制运作,切换语言足以暴露这种分离。实际后果是安全排名不跨语言保持。Qwen Omni在es-MX参与者中超越Pixtral Large成为最易受攻击的模型,这种排名反转是英语条件下分数的标量校正无法恢复的,并且绝对攻击成功率在模型代际间下降,但模型间差距未缩小。这些发现表明,将语言和模态视为独立维度的安全评估框架从根本上错误地指定了全球部署MLLM的攻击面,必须相应重新设计。

英文摘要

The attack surface of a multimodal large language model (MLLM) is language-dependent in ways that reveal the mechanistic structure of alignment failures. We present the first systematic cross-lingual, multimodal red-teaming study comparing jailbreak vulnerability in US English (en-US) and Mexican Spanish (es-MX) across four frontier MLLMs: Claude Sonnet 4.5, GPT-5, Pixtral Large, and Qwen Omni. Using a fixed adversarial benchmark of 363 diverse prompt scenarios administered in text-only and multimodal conditions, we collected 52,272 harm ratings and binary attack success judgements from matched panels of nine native-speaker annotators per language group. Our central finding is that language does not scale vulnerability uniformly. Bayesian mixed-effects analyses reveal that linguistic framing attacks such as role-play become substantially less effective under Spanish prompting, while visually explicit multimodal attacks become more effective, which directly implicates the prompt-language interface rather than global annotator leniency. This dissociation indicates that linguistic and visual alignment failures operate through distinct mechanisms, and that switching language is sufficient to expose that separation. The practical consequence is that safety rankings are not preserved across languages. Qwen Omni overtakes Pixtral Large as the most vulnerable model among es-MX participants, a rank reversal no scalar correction of English-condition scores could recover, and absolute attack success rates have declined across model generations without closing the gaps between them. These findings demonstrate that safety evaluation frameworks treating language and modality as independent dimensions fundamentally misspecify the attack surface of globally deployed MLLMs, and must be redesigned accordingly.

2605.23147 2026-05-25 cs.CL cs.AI 版本更新

As X, Do Y: How Persona and Task Combine in Instruction-Tuned LLMs

作为X,做Y:角色和任务如何在指令微调LLM中结合

Eric Xu

发表机构 * Independent Researcher(独立研究者)

AI总结 该研究探讨了在指令微调的大语言模型中,角色提示(如“As X, do Y”)如何将“人物”和“任务”信息结合,并发现这种结合在残差流中的某个特定位置可以通过线性分解清晰地体现。研究指出,人物和任务分别通过部分正交的加法方向影响模型输出,并展示了通过残差流局部加法结构可以实现对角色和任务贡献的可解释控制。然而,研究也表明,尽管存在局部加法结构,角色提示无法被压缩为单一的残差向量,因为其行为依赖于整个提示中的分布式机制。

Comments 12 pages, 1 figure. Code: https://github.com/xuy/localized-additive-composition

详情
AI中文摘要

形式为“作为X,做Y”的角色提示在残差流的一个特定位置——提示到答案的过渡(最后一个提示标记与前两个生成标记)——在早期/中层波段表现出清晰的线性分解。在那里,角色和任务通过部分正交的加性方向贡献。形成纯角色效应Δ_X、纯任务效应Δ_Y,并将h_BB + Δ_X + Δ_Y替换干净残差,在Gemma-2-2B-IT和Qwen-2.5-{1.5B, 3B}-Instruct上,跨越12个单元格的短网格和48个单元格的长角色网格,下游输出与干净输出的KL散度很小,并保留了角色特定的行为标记。从这种加性结构自然推断,角色提示可以压缩为单个缓存的残差向量。我们证明它不能。将缓存的加性预测——甚至oracle干净残差h_XY——注入到移除了角色文本的基线宿主提示中,无论是在一个位置还是在多个层,都无法接近干净的长角色目标。角色条件化的多标记生成通过注意力流回整个提示中的角色文本位置,这是任何单个位置的残差无法复现的。残差流中的局部加性性并不意味着提示可压缩。提示到答案过渡处的加性结构支持可解释性和对角色或任务贡献的细粒度控制;整个延续中的角色条件化行为依赖于分布式的提示/KV机制,局部激活算术无法取代。

英文摘要

Role prompts of the form As X, do Y admit a clean linear decomposition at one specific site in the residual stream: the prompt-to-answer transition -- the last prompt token together with the first two generated tokens -- in an early/mid layer band. There, persona and task contribute through partially orthogonal additive directions. Forming a pure persona effect $Δ_X$, a pure task effect $Δ_Y$, and substituting $h_{BB} + Δ_X + Δ_Y$ for the clean residual yields downstream output within a small KL of clean on Gemma-2-2B-IT and Qwen-2.5-\{1.5B, 3B\}-Instruct, across a 12-cell short grid and a 48-cell long-persona grid, with persona-specific behavioral markers preserved. The natural inference from this additive structure is that the role prompt can be compressed into a single cached residual vector. \emph{We show it cannot.} Injecting the cached additive prediction -- or even the oracle clean residual $h_{XY}$ -- into a baseline host prompt with the persona text removed does not approach the clean long-persona target, at one site or at many layers. Persona-conditioned multi-token generation flows through attention back to the persona-text positions throughout the prompt, which no residual at one site reproduces. Local additivity in the residual stream does not imply prompt compressibility. The additive structure at the prompt-to-answer transition supports interpretability and fine-grained steering of persona or task contributions; persona-conditioned behavior across the full continuation depends on a distributed prompt/KV mechanism that local activation arithmetic does not displace.

2605.23103 2026-05-25 cs.CL cs.AI cs.CY cs.DB 版本更新

A Fine-Tuned BERT Classifier for Personal-Letter Titles in Late-Ming and Early-Qing Collected Works

用于明清之际文集中个人书信标题的微调BERT分类器

Queenie Luo

发表机构 * Harvard University(哈佛大学)

AI总结 本文提出了一种基于微调BERT的分类器Lepton,用于识别晚明至清初文集目录中的标题是否为个人书信,特别是与可混淆的序言(如告别序)进行区分。该模型在33位文人手标注的5438个文集标题上进行微调,并已部署于Hugging Face平台,应用于中国传记资料库(CBDB),成功识别出约五万五千封书信,为明信平台的数据建设提供了支持。

详情
AI中文摘要

我提出Lepton(书信预测),一个微调的BERT分类器,用于预测古典中文文集目录中的标题是个人书信还是易混淆的序文(特别是赠序)。Lepton在来自三十三位明清之际文人的5438个手工标注的文集标题上微调bert-base-chinese。我已将该模型部署在Hugging Face上,并已在中国传记数据库(CBDB)中使用,用于识别从中明到清初文集中约五万五千封书信,从而填充明代书信平台。

英文摘要

I present Lepton (Letter Prediction), a fine-tuned BERT classifier that predicts whether a title in a Classical Chinese wenji table of contents is a personal letter or a closely confusable preface (particularly the farewell-preface). Lepton fine-tunes bert-base-chinese on 5438 hand-labeled wenji titles from thirty-three late-Ming and early-Qing literati. I've deployed the model on Hugging Face and has been used at the China Biographical Database (CBDB) to identify approximately fifty-five thousand letters across mid-Ming through early-Qing wenji, populating the Ming Letter Platform.

2605.23093 2026-05-25 cs.CL cs.CY 版本更新

A Comparative Evaluation of Structural Topic Models and BERTopic for Short, Open-Ended Survey Responses

结构主题模型与BERTopic在简短开放式调查回答中的比较评估

Yan Jiang, Sihong Liu, Philip A. Fisher

发表机构 * Stanford Center on Early Childhood, Stanford University(斯坦福大学早期儿童研究中心)

AI总结 本文比较了结构主题模型(STM)和基于嵌入的BERTopic模型在分析短文本开放性调查回复中的表现。研究通过多种参数设置对两种方法进行了评估,发现BERTopic在主题一致性方面优于STM,而STM在协变量分析方面更具优势。研究结果表明,两种方法各有优劣,适用于不同研究需求,为应用社会科学研究中的主题建模方法选择提供了实用指导。

详情
AI中文摘要

应用心理学中的主题建模日益跨越两种方法论传统:概率词袋模型和较新的基于嵌入的方法。然而,对这些方法的许多评估依赖于较长且更干净的基准语料库,对简短、开放式调查回答的指导较少。本文比较了结构主题模型(STM)(一种概率主题模型)和BERTopic(一种基于嵌入的模型)用于分析开放式调查回答。我们评估了三种STM条件和五种BERTopic条件,变化包括拼写纠正、词干提取、嵌入选择以及上下文增强(我们引入的一种为极短回答提供额外语义上下文的策略)。结果表明,BERTopic始终比STM产生更高的主题连贯性,其中上下文增强带来了最强的性能提升。相比之下,仅使用更高维度的嵌入并未改善连贯性,反而与更大的数据损失相关。定性评估显示,BERTopic生成了更可解释和稳定的主题,而STM主题通常更广泛且更混杂。然而,STM为推断性协变量分析提供了更强的支持,而BERTopic的协变量比较主要是描述性的。这些发现表明STM和BERTopic具有互补优势。我们最后为应用社会科学研究中选择和结合主题建模方法提供了实用指导。

英文摘要

Topic modeling in applied psychology increasingly spans two methodological traditions: probabilistic bag-of-words models and newer embedding-based approaches. Yet many evaluations of these methods rely on longer and cleaner benchmark corpora, leaving less guidance for short, open-ended survey responses. This paper compares Structural Topic Models (STM), a probabilistic topic model, and BERTopic, an embedding-based model, for analyzing open-ended survey responses. We evaluated three STM conditions and five BERTopic conditions, varying typographical correction, stemming, embedding choice, and contextual augmentation, a strategy we introduced to provide additional semantic context for very short responses. Results indicate that BERTopic consistently produced higher topic coherence than STM, with contextual augmentation yielding the strongest performance gains. In contrast, higher-dimensional embeddings alone did not improve coherence and were associated with greater data loss. Qualitative evaluation showed that BERTopic generated more interpretable and stable topics, while STM topics were often broader and more mixed. However, STM provides stronger support for inferential covariate analysis, whereas BERTopic covariate comparisons are primarily descriptive. These findings suggest that STM and BERTopic offer complementary strengths. We conclude with practical guidance for selecting and combining topic modeling approaches in applied social science research.

2605.23078 2026-05-25 cs.LG cs.CL 版本更新

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

GEMQ:MoE大语言模型的全局专家级混合精度量化

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu

发表机构 * University of Pittsburgh(匹兹堡大学) University of Central Florida(佛罗里达州立大学) University of Arizona(亚利桑那大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 混合专家大型语言模型(MoE-LLMs)在性能上表现优异,但因大量专家参数导致内存开销较大。为解决这一问题,本文提出了一种全局专家级混合精度量化方法GEMQ,通过全局线性规划形式捕捉模型整体的专家重要性,并结合高效的路由微调以适应量化后的专家,从而实现更优的精度与内存权衡。实验表明,GEMQ在保持精度的同时显著降低了内存占用并加速了推理。

Comments ICML 2026

详情
AI中文摘要

混合专家大语言模型(MoE-LLMs)性能强大,但由于大量专家参数导致显著的内存开销。混合精度量化根据专家重要性分配不同的位宽,接近精度-内存帕累托前沿,并实现极低比特量化。然而,现有方法依赖于逐层重要性估计,忽视了量化引起的路由器偏移,导致次优的分配和路由。本文提出全局专家级混合精度量化(GEMQ),通过(1)基于量化误差分析的全局线性规划公式来捕获模型范围内的专家重要性,以及(2)高效的路由器微调以适应量化后的专家,从而克服这些限制。这些组件被集成到一个渐进式量化框架中,该框架迭代地优化重要性估计和分配。实验表明,GEMQ在最小化精度损失的情况下显著减少内存并加速推理。源代码可在 https://github.com/jndeng/GEMQ 获取。

英文摘要

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

2605.23071 2026-05-25 cs.CL 版本更新

The Efficiency Frontier: A Unified Framework for Cost-Performance Optimization in LLM Context Management

效率前沿:LLM上下文管理中成本-性能优化的统一框架

Binqi Shen, Lier Jin, Hanyu Cai, Lan Hu, Yuting Xin

发表机构 * Northwestern University(西北大学) Duke University(杜克大学) Carnegie Mellon University(卡内基梅隆大学) University of Minnesota(明尼苏达大学)

AI总结 随着大语言模型对长上下文处理的需求增加,扩展上下文窗口带来了显著的计算和经济成本。本文提出了一种统一的框架《The Efficiency Frontier》,用于在上下文管理中实现成本与性能的优化,通过联合考虑任务性能、令牌成本和预处理复用,将上下文策略选择建模为部署感知的优化问题。该框架揭示了检索与预处理策略在不同操作条件下的适用范围,并在实验中展示了其在减少令牌使用和降低成本方面的显著优势。

详情
AI中文摘要

大型语言模型(LLM)越来越依赖长上下文处理,但扩展上下文窗口会带来巨大的计算和财务成本。现有的上下文缩减方法,包括检索和内存压缩方法,通常使用性能和效率指标独立评估,限制了系统比较和部署感知决策。本文介绍了效率前沿,一个用于LLM上下文管理中成本-性能优化的统一框架。该框架将上下文策略选择建模为部署感知优化问题,通过摊销成本建模联合考虑任务性能、token成本和预处理重用。与孤立比较方法的现有评估不同,所提出的框架能够进行决策导向分析,揭示不同上下文管理策略在不同操作条件下何时变得更为可取。在5000个HotpotQA实例上的评估显示,该框架揭示了基于检索和基于预处理的策略之间的不同操作区间和转换边界。结果表明,部署感知优化在可比性能(F1 ≈ 0.78)下将有效token使用减少了约25%,而摊销内存压缩在高性能设置下相比全上下文提示实现了超过50%的token成本降低。总体而言,所提出的框架为评估和部署可扩展、高效且可持续的LLM系统提供了原则性和实用性的基础。

英文摘要

Large language models (LLMs) increasingly rely on long-context processing, but expanding context windows introduces substantial computational and financial costs. Existing context reduction approaches, including retrieval and memory compression methods, are typically evaluated using performance and efficiency metrics independently, limiting systematic comparison and deployment-aware decision-making. This paper introduces The Efficiency Frontier, a unified framework for cost-performance optimization in LLM context management. The framework models context strategy selection as a deployment-aware optimization problem that jointly accounts for task performance, token cost, and preprocessing reuse through amortized cost modeling. Unlike existing evaluations that compare methods in isolation, the proposed framework enables decision-oriented analysis of when different context management strategies become preferable under varying operational conditions. Evaluated on 5,000 HotpotQA instances, the framework reveals distinct operational regimes and transition boundaries between retrieval-based and preprocessing-based strategies. Results show that deployment-aware optimization reduces effective token usage by approximately 25% at comparable performance ($F1 \approx 0.78$), while amortized memory compression achieves over 50% lower token cost relative to full-context prompting in higher-performance settings. Overall, the proposed framework provides a principled and practical foundation for evaluating and deploying scalable, efficient, and sustainable LLM systems.

2605.23069 2026-05-25 cs.CL 版本更新

DFKI-MLT at SemEval-2026 TASK 7: Steering Multilingual Models Towards Cultural Knowledge

DFKI-MLT 在 SemEval-2026 任务 7 中:将多语言模型引导至文化知识

Yusser Al Ghussin, Daniil Gurgurov, Yasser Hamidullah, Josef van Genabith, Cristina España-Bonet, Simon Ostermann

发表机构 * German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心(DFKI GmbH)) Saarland Informatics Campus(萨尔布吕肯信息学校区) Barcelona Supercomputing Center (BSC-CNS)(巴塞罗那超级计算中心(BSC-CNS))

AI总结 该研究针对多语言大语言模型在文化知识理解上的不足,提出了一种基于激活引导的方法,通过从平行语料FLORES中提取语言向量,对多语言模型进行推理时的适应性调整。研究参与了SemEval-2026任务7的多选题和简答题两个赛道,其中多选题部分取得了86.96%的准确率,排名第七。分析表明,激活引导在不同语言和层面上的效果不一,提示在文化感知任务中应综合优化提示设计与激活引导策略。

Comments Accepted to The 20th International Workshop on Semantic Evaluation at ACL 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于不同的语言和文化背景,但其文化知识在不同地区和语言之间仍然不均匀。我们提出了用于 SemEval-2026 任务 7(文化意识)的 DFKI-MLT 系统,该系统使用从并行 FLORES 数据中提取的语言向量,对多语言 LLMs 应用激活引导。我们的方法通过在选定的 Transformer 层的残差流中添加特定语言的引导向量来进行推理时调整,无需任何参数更新。我们参加了简答题(SAQ)和多项选择题(MCQ)两个赛道;然而,只有我们的 MCQ 提交获得了官方评分。在官方 MCQ 赛道中,我们达到了 86.96% 的准确率,在 17 个队伍中排名第 7。为了更好地理解系统行为,我们对共享任务的 MCQ 和 SAQ 设置进行了事后分析。这些分析表明,激活引导对文化推理产生了适度且异质的改进:增益对层高度敏感,在不同语言-区域对之间差异很大,某些配置甚至降低了性能,并且与提示表述相互作用,比较了通用提示和文化条件提示。我们的发现表明,提示设计和激活引导应联合优化,以实现具有文化意识的多语言推理。

英文摘要

Large language models (LLMs) are increasingly used across diverse linguistic and cultural contexts, yet their cultural knowledge remains uneven across regions and languages. We present the DFKI-MLT system for SemEval-2026 Task 7 on cultural awareness, where we apply activation steering to multilingual LLMs using language vectors extracted from parallel FLORES data. Our method performs inference-time adaptation by adding language-specific steering vectors to the residual stream at a selected transformer layer, without any parameter updates. We participated in both the short-answer (SAQ) and multiple-choice (MCQ) tracks; however, only our MCQ submission received an official score. In the official MCQ track, we achieved 86.96% accuracy, ranking 7th out of 17 teams. To better understand system behavior, we conduct post-hoc analyses on the shared-task MCQ and SAQ settings. These analyses show that activation steering yields modest and heterogeneous improvements on cultural reasoning: gains are strongly layer-sensitive, vary substantially across language-region pairs, with some configurations even degrading performance, and interact with prompt formulation, comparing generic and culturally conditioned prompts. Our findings suggest that prompt design and activation steering should be jointly optimized for culturally aware multilingual inference.

2605.23067 2026-05-25 cs.CL 版本更新

What Training Data Teaches RL Memory Agents: An Empirical Study of Curriculum Effects in Memory-Augmented QA

训练数据教会RL记忆体代理什么:记忆增强问答中课程效应的实证研究

Xinjie He, Zhiyuan Lin, Su Liu, Jialun Wu, Qiyang Xie, Weikai Zhou, Shuai Xiao

发表机构 * Columbia University(哥伦比亚大学) Independent Researcher(独立研究者) Johns Hopkins University(约翰霍普金斯大学) Northeastern University(东北大学)

AI总结 本研究探讨了训练数据对强化学习记忆代理在问答任务中学习能力的影响,通过控制模型架构、算法和超参数不变,仅改变训练课程组成,分析了不同数据组合对模型性能的影响。实验表明,训练课程的构成在细粒度任务上具有显著影响,混合课程在整体表现上最优,而特定领域的训练数据可以提升特定技能。研究还提出了在单GPU环境下优化训练的两个实用经验,为实际应用提供了指导。

Comments 14 pages, 2 figures, 11 tables. Code, checkpoints, and evaluation artifacts available at https://github.com/EvaxHe/rl-memory-curriculum

详情
AI中文摘要

强化学习(RL)已成为训练LLM代理在多会话对话中推理外部记忆库的可行方法。现有工作仅在单个基准上训练,未揭示训练数据的组成如何塑造记忆体代理获得的技能。我们进行了一项受控的实证研究,固定架构、RL算法和所有超参数,仅改变三种条件下的训练课程:领域内(LoCoMo)、混合基准(LoCoMo + LongMemEval)和领域外(仅LongMemEval)。在两个基准和十种问题类型上,课程组成作为专业化的细粒度杠杆,而非性能的均匀缩放因子。混合课程在两个评估集上均获得最强的整体F1。在窄领域外集上训练可转移特定技能——时间推理,尽管整体性能较弱。每种类型的差异显著超过整体差异,表明单一数字的基准比较系统性地低估了课程效应。我们进一步报告了将GRPO适配到单GPU环境的两个实用经验:跨基准混合需要过滤记忆库中的格式特定噪声以保留训练信号,并且二元精确匹配奖励在单GPU所需的小组大小(G=4)下不产生学习信号,从而激励在该设置下使用连续奖励函数。

英文摘要

Reinforcement learning (RL) has emerged as a viable recipe for training LLM agents to reason over external memory banks in multi-session dialogue. Existing work trains exclusively on a single benchmark, leaving open how the composition of training data shapes the skills a memory agent acquires. We present a controlled empirical study that holds architecture, RL algorithm, and all hyperparameters fixed and varies only the training curriculum across three conditions: in-domain (LoCoMo), mixed-benchmark (LoCoMo + LongMemEval), and out-of-domain (LongMemEval only). Across two benchmarks and ten question types, curriculum composition acts as a fine-grained lever on specialization rather than a uniform scaling factor on performance. The mixed curriculum yields the strongest overall F1 on both evaluation sets. Training on a narrow out-of-domain set transfers a targeted skill - temporal reasoning - despite weak aggregate performance. Per-type differences substantially exceed aggregate differences, indicating that single-number benchmark comparisons systematically underreport curriculum effects. We further report two practical lessons from adapting GRPO to a single-GPU regime: cross-benchmark mixing requires filtering format-specific noise from memory banks to preserve training signal, and binary exact-match reward produces no learning signal at the small group sizes (G = 4) required on one GPU, motivating continuous reward functions in this regime.

2605.23057 2026-05-25 cs.LG cs.CL cs.PF 版本更新

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

ModeSwitch-LLM:单GPU上跨模式LLM推理的轻量级阶段感知控制器

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

发表机构 * New York University Abu Dhabi(纽约大学阿布扎克分校) New York University(纽约大学) United States(美国)

AI总结 ModeSwitch-LLM 是一种轻量级的请求边界控制器,旨在提升单块 GPU 上大语言模型推理的效率,通过将每个请求路由到合适的固定推理模式。该方法利用低成本的工作负载级特征,在 FP16、量化模式、推测解码等不同模式间进行动态选择,无需依赖单一静态配置。实验表明,该控制器在保持推理质量的同时,显著降低了延迟和能耗,且相比基于学习的路由方法,规则控制器在效率和资源约束下表现更优。

Comments 10 pages main text, 11 pages including references, 5 figures, 3 tables. Preprint

详情
AI中文摘要

ModeSwitch-LLM是一种轻量级请求边界控制器,通过将每个请求路由到适当的固定推理模式,提高单GPU大语言模型推理效率。该系统不依赖单一的静态服务配置,而是利用廉价的工作负载级特征,在FP16、量化模式、推测解码以及混合模式(如GPTQ加前缀缓存和INT8加连续批处理)之间进行选择。我们在单个NVIDIA A100 GPU上对Meta-Llama-3.1-8B-Instruct进行了评估。在部署风格的合成工作负载上,在线控制器相比FP16实现了2.10倍的平均延迟加速和0.48倍的平均能耗比,相当于每个token能耗降低51.7%。在用作质量门的自动基准测试中,准确率接近FP16,平均差异为+0.17个百分点。我们还评估了轻量级学习路由器,但发现它们并未明显优于基于规则的控制器,因为它们增加了路由开销,并且更频繁地选择违反质量、能耗或内存约束的模式。这些结果表明,简单的请求感知路由可以从现有推理模式中恢复大量效率,而无需重新训练模型或更改其架构。

英文摘要

ModeSwitch-LLM is a lightweight request-boundary controller for improving single-GPU large language model inference efficiency by routing each request to an appropriate fixed inference mode. Instead of relying on one static serving configuration, the system selects among FP16, quantized modes, speculative decoding, and hybrid modes such as GPTQ plus prefix caching and INT8 plus continuous batching using cheap workload-level features. We evaluate ModeSwitch-LLM on Meta-Llama-3.1-8B-Instruct served on a single NVIDIA A100 GPU. On deployment-style synthetic workloads, the online controller achieves a 2.10x mean latency speedup over FP16 and a 0.48x mean energy ratio, corresponding to 51.7% lower energy per token. On automatic benchmarks used as a quality gate, accuracy remains close to FP16 with a mean delta of +0.17 percentage points. We also evaluate lightweight learned routers, but find that they do not clearly outperform the rule-based controller because they add routing overhead and more often select modes that violate quality, energy, or memory constraints. These results show that simple request-aware routing can recover substantial efficiency from existing inference modes without retraining the model or changing its architecture.

2605.23054 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Model Collapse as Cultural Evolution

模型崩溃作为文化演化

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 本文研究了大型语言模型(LLM)在自训练过程中出现的“模型崩溃”现象,即模型输出质量逐渐下降的问题。作者引入文化进化中的迭代学习理论,提出五个可验证的预测,并通过多语言实验验证,发现模型的组合性结构在无过滤自训练下呈现非单调变化趋势,这一特征仅在任务导向的过滤机制下得以维持。研究为模型崩溃提供了语言学层面的解释,并为自训练流程的设计提供了具体原则。

Comments Accepted at CoNLL 2026. 18 pages, 3 figures, 2 tables

详情
AI中文摘要

模型崩溃,即在其自身输出上训练的LLM的逐步退化,已被统计表征,但缺乏对哪些结构退化、以何种顺序以及为何退化的语言学解释。我们表明,文化演化中的迭代学习理论填补了这一空白。我们推导出五个可证伪的预测,区分了那些对该理论具有独特判别性的预测与确认性预测,并通过在英语、德语和土耳其语中自训练LLaMA-2-7B和Mistral-7B达10代来测试它们。关键的判别性发现:在未过滤的自训练下,组合性遵循非单调轨迹(先上升后下降)。这一特征在最大规则种子数据下持续存在(排除了噪声去除),并且仅由任务导向的过滤维持,而非随机过滤,提供了压缩-通信权衡的首个LLM尺度证据。所有预测均得到确认,效应量较大(Hedges' $g > 1.6$;$\mathrm{BF}_{10} > 100$),且LLM正则化梯度与人类行为数据高度匹配($R^2 = 0.94$)。这些结果将模型崩溃重新定义为文化传播现象,并为自训练管道设计提供了具体原则。

英文摘要

Model collapse, the progressive degradation of LLMs trained on their own outputs, has been characterized statistically but lacks a linguistic explanation for which structures degrade, in what order, and why. We show that iterated learning theory from cultural evolution fills this gap. We derive five falsifiable predictions, distinguish those uniquely discriminative for the theory from confirmatory ones, and test them by self-training LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish. The critical discriminative finding: compositionality follows a non-monotonic trajectory (initially rising, then falling) under unfiltered self-training. This signature persists with maximally regular seed data (ruling out noise removal) and is sustained only by task-grounded filtering, not random filtering, providing the first LLM-scale evidence for the compression-communication tradeoff. All predictions are confirmed with large effect sizes (Hedges' $g > 1.6$; $\mathrm{BF}_{10} > 100$), and LLM regularization gradients closely match human behavioral data ($R^2 = 0.94$). These results reframe model collapse as a cultural transmission phenomenon and yield concrete principles for self-training pipeline design.

2605.23052 2026-05-25 cs.CL cs.AI 版本更新

DreamerNLplus: Interpretable Modeling of Mental Health Dynamics from Social Media Timelines using Hybrid Rule-Based and RAG Methods

DreamerNLplus: 使用混合规则和RAG方法从社交媒体时间线进行可解释的心理健康动态建模

Maryia Zhyrko, Daisy Monika Lal, Erik van Mulligen, Lifeng Han

发表机构 * Leiden Institute of Advanced Computer Science (LIACS), Leiden University(莱顿高级计算机科学研究所(LIACS),莱顿大学) School of Computing and Communications (SCC), Lancaster University(计算与通信学院(SCC),兰卡斯特大学) Department of Medical Informatics, Erasmus University Medical Center Rotterdam(医学信息学系,埃因霍温医学中心鲁特万分校) Biomedical Data Sciences, Leiden University Medical Center(生物医学数据科学,莱顿大学医学中心)

AI总结 本文提出了一种混合框架 DreamerNLplus,用于从社交媒体时间线中建模心理健康动态,参与了 CLPsych 2026 共享任务。该方法结合了基于规则和检索增强生成(RAG)的技术,分别用于心理状态建模、时间变化检测和序列级摘要任务,并在多个子任务中取得了优异成绩。研究揭示了心理健康动态建模中的关键挑战,如分类与回归性能的不匹配、时间过渡建模的困难,为未来研究提供了重要方向。

Comments Accepted by CLPsych2026. CLPsych 2026 will be held at ACL in San Diego July 4th, 2026

详情
AI中文摘要

我们提出DreamerNLplus,一个用于在CLPsych 2026共享任务中从社交媒体时间线建模心理健康动态的混合框架。我们的系统处理三个任务:心理状态建模、时间变化检测和序列级总结。对于任务1,我们结合基于LLM的数据增强、DeBERTa分类和随机森林回归进行结构化状态预测。对于任务2,我们使用本地部署的Llama 3.1模型进行少样本提示,利用短期时间上下文检测切换和升级事件。对于任务3.1,我们探索了确定性基于规则的总结流水线和基于LLM的少样本方法,官方排名第二。我们的基于RAG的方法在任务3.2中取得了强劲性能,在改善任务中排名第一,在恶化任务中排名第三,展示了其捕捉时间线上反复出现的心理变化模式的能力。我们的分析揭示了关键挑战,包括分类与回归性能之间的不匹配、时间转换建模的困难,以及基于语义和基于相似性的评估指标之间的不一致。这些发现凸显了建模心理健康动态的复杂性,并推动了未来关于统一评估框架的工作。我们在https://github.com/4dpicture/CLPsych2026分享我们的代码和提示。

英文摘要

We present DreamerNLplus, a hybrid framework for modeling mental health dynamics from social media timelines in the CLPsych 2026 shared task. Our system addresses three tasks: psychological state modeling, temporal change detection, and sequence-level summarization. For Task 1, we combine LLM-based data augmentation, DeBERTa classification, and Random Forest regression for structured state prediction. For Task 2, we use few-shot prompting with a locally deployed Llama 3.1 model to detect Switch and Escalation events using short-term temporal context. For Task 3.1, we explore both a deterministic rule-based summarization pipeline and a few-shot LLM-based approach, ranking \textbf{2nd} officially. Our RAG-based method achieves strong performance in Task 3.2, ranking \textbf{1st} for Improvement and \textbf{3rd} for Deterioration, demonstrating its ability to capture recurrent psychological change patterns across timelines. Our analysis reveals key challenges, including the mismatch between classification and regression performance, the difficulty of modeling temporal transitions, and the disagreement between semantic and similarity-based evaluation metrics. These findings highlight the complexity of modeling mental health dynamics and motivate future work on unified evaluation frameworks. We share our code and prompts at https://github.com/4dpicture/CLPsych2026

2605.23043 2026-05-25 cs.CL stat.ML 版本更新

HawkesLLM: Semantic Uncertainty Propagation in Agentic Text Simulation

HawkesLLM:智能体文本模拟中的语义不确定性传播

Zewei Deng, Tinghan Ye, Liyan Xie

发表机构 * Department of Industrial and Systems Engineering, University of Minnesota(工业与系统工程系,明尼苏达大学) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(H. Milton Stewart工业与系统工程学院,佐治亚理工学院)

AI总结 本文提出HawkesLLM框架,用于解决智能体文本模拟系统中语义不确定性随时间累积的问题。该方法将时间影响建模与文本生成过程分离,通过多变量Hawkes过程建模节点间的激活关系,并利用语言模型基于时间模型选择的紧凑记忆生成新内容。实验表明,在GDELT新闻传播案例中,HawkesLLM在有限提示记忆预算下有效提升了后期语义对齐的效果。

Comments 10 pages, 4 figures, Accepted at the ICML 2026 Workshop on Statistical Frameworks for Uncertainty in Agentic Systems

详情
AI中文摘要

智能体文本模拟系统按顺序生成文本,每个项目成为后续步骤的可能上下文。这使得不确定性具有路径依赖性:早期的模糊性可能影响后续输出。本文通过HawkesLLM框架研究这一问题,该框架将时间影响建模与文本生成分离。我们将级联表示为一个网络,其节点是文本生成智能体。多变量Hawkes过程模拟这些节点随时间激活的方式,以及哪些早期节点输出应影响后续提示。然后,语言模型根据该时间模型选择的紧凑记忆编写每个新事件。我们在一个保留的全球事件、语言和语调数据库(GDELT)新闻级联案例研究中评估该框架。诊断跟踪与局部保留参考的语义对齐,并区分局部漂移和全局漂移。在此设置下,HawkesLLM在紧凑的提示记忆预算下改善了后期语义对齐。

英文摘要

Agentic text-simulation systems write in sequence, with each item becoming possible context for later steps. That makes uncertainty path-dependent: an early ambiguity can affect later outputs. This paper studies this problem with HawkesLLM, a framework that separates temporal influence modeling from text generation. We represent the cascade as a network whose nodes are text-generating agents. A multivariate Hawkes process models how these nodes activate over time and which earlier node outputs should influence later prompts. A language model then writes each new event from the compact memory selected by this temporal model. We evaluate the framework on a held-out Global Database of Events, Language, and Tone (GDELT) news-cascade case study. The diagnostics track semantic alignment with local held-out references and separate local drift from global drift. In this setting, HawkesLLM improves late-stage semantic alignment under a compact prompt-memory budget.

2605.23039 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs

语言模型知道不该说什么吗?大语言模型中统计预占的因果证据

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 本研究探讨了语言模型如何通过分布竞争机制习得语言禁忌知识,提出统计预占(statistical preemption)是关键机制。通过四个实验,研究发现语言模型对非常规结构的惊讶度(surprisal)与人类可接受性判断高度相关,并且这种模式由竞争形式的频率驱动,而非动词整体频率。研究还表明,预占敏感性随模型规模呈幂律增长,并通过可控微调实验验证了竞争形式频率对预占行为的因果影响,为构造语法理论提供了计算支持。

Comments Accepted at CoNLL 2026. 21 pages (9 main body + appendices and references); 4 figures, 14 tables

详情
AI中文摘要

学习者在没有负面证据的情况下如何获得关于不可接受性的知识?构式语法提出了统计预占:接触常规形式(例如,“donated the books to the library”)会预占结构上可能但未经验证的替代形式(“*donated the library the books”)。我们提出了一项计算研究,首次在单一收敛设计中直接分离了大语言模型中的统计预占与竞争性固化假说。通过跨越120个英语动词-构式配对(与格、使役、方位格)的四个实验,我们表明:(1)大语言模型的惊讶度模式与人类可接受性判断强相关(r = 0.79),并在三个独立的行为数据集上得到验证;(2)这些模式由竞争形式频率驱动,而非整体动词频率,通过非循环偏相关得到确认;(3)预占敏感度随模型规模呈幂律增长;(4)一项受控微调干预因果地表明,操纵竞争形式频率会按预测方向改变预占行为,反向控制排除了频率敏感性混淆。这些结果提供了汇聚证据,表明神经语言模型通过分布竞争(构式语法所提出的核心机制)习得负面语言知识。

英文摘要

How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structurally possible but unattested alternatives ("*donated the library the books"). We present a computational study that, for the first time, directly dissociates statistical preemption from the competing entrenchment hypothesis in large language models within a single converging design. Across four experiments spanning 120 English verb-construction pairings (dative, causative, locative), we show that (1) LLM surprisal patterns correlate strongly with human acceptability judgments ($r = 0.79$), validated against three independent behavioral datasets; (2) these patterns are driven by competing-form frequency rather than overall verb frequency, confirmed by non-circular partial correlations; (3) preemption sensitivity scales as a power law with model size; and (4) a controlled fine-tuning intervention causally demonstrates that manipulating competing-form frequencies shifts preemption behavior in the predicted direction, with reverse-direction controls ruling out frequency-sensitivity confounds. These results provide converging evidence that neural language models acquire negative linguistic knowledge through distributional competition, the core mechanism posited by Construction Grammar.

2605.23036 2026-05-25 cs.CL 版本更新

Multilingual Steering by Design: Multilingual Sparse Autoencoders and Principled Layer Selection

通过设计实现多语言引导:多语言稀疏自编码器与原则性层选择

Yusser Al Ghussin, Daniil Gurgurov, Tanja Baeumel, Josef van Genabith, Patrick Schramowski, Simon Ostermann

发表机构 * Saarland University(萨尔兰大学) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心) TU Darmstadt(德累斯顿技术大学) Centre for European Research in Trusted AI (CERTAIN)(欧洲可信AI研究中心)

AI总结 该研究针对多语言大语言模型中基于稀疏自编码器(SAE)的语言控制可靠性不足的问题,提出了一种设计导向的多语言引导方法。研究通过在多语言数据上训练SAE,增强了跨语言表征,并引入了一种基于多语言对齐与语言可分性交集的先验层选择规则,有效预测了干预深度,避免了逐层搜索。实验表明,该方法在翻译和跨语言摘要任务中提升了语言识别准确率与生成质量的平衡,为多语言SAE引导提供了原理性与可预测的解决方案。

Comments Accepted to TrustNLP Workshop at ACL 2026

详情
AI中文摘要

稀疏自编码器(SAEs)能够实现大型语言模型(LLMs)中的特征级可解释性机制和激活引导,但基于SAE的语言控制在多语言环境中仍然不可靠:大多数SAE仅在英语数据上训练,且引导层的选择是启发式的。我们通过推进基于SAE的多语言语言引导的原则性、机制性解释来解决这些限制。首先,我们展示了在多语言数据上训练SAE能够持续增强跨语言表示,并在不同层和模型家族中产生更可靠、质量保持的语言控制。其次,我们引入了一种基于多语言对齐与语言可分离性交集的先验引导层选择规则,该规则无需穷举逐层搜索即可预测有效的干预深度。我们在LLaMA-3.1-8B和Gemma-2-9B上,使用SpBLEU、ROUGE-L、COMET和LaSE评估了我们的方法,涉及机器翻译和跨语言摘要(CrossSumm)。结果表明,多语言SAE结合交集选择的层稳定了语言识别准确率与生成质量之间的权衡,为多语言SAE引导提供了原则性、预测性的表示级解释。

英文摘要

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an \emph{a priori} steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.

2605.23035 2026-05-25 cs.CL cs.AI q-bio.NC 版本更新

Sparse Autoencoders Map Brain-LLM Alignment onto Cortical Semantic Topography

稀疏自编码器将大脑-LLM对齐映射到皮层语义拓扑

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 该研究探讨了大型语言模型(LLM)中间层与人类大脑语言响应之间的对应关系,并利用稀疏自编码器(SAEs)对其进行机制解释。通过将SAEs与神经编码模型结合,研究者分解了GPT-2 XL和Llama-3.1-8B模型,提取出每层1.6万至3.2万个可解释特征,并验证了语义特征在预测大脑编码性能中的主导作用。研究进一步表明,SAE提取的语义特征能够重现大脑皮层的语义拓扑结构,并在多种语言中展现出良好的泛化能力。

Comments Accepted at CoNLL 2026. 20 pages (9 main + 1 limitations/acknowledgments + 3 references + 7 appendix), 5 figures, 20 tables

详情
AI中文摘要

大型语言模型(LLM)的中间层最能预测人脑对语言的反应,这是计算神经语言学中最稳健的发现之一,但其机制原因仍未得到解释。我们通过将可解释性机制中的稀疏自编码器(SAE)与神经编码模型相结合来填补这一空白,将GPT-2 XL和Llama-3.1-8B分解为每层16K-32K个可解释特征。一个人工验证的分类法(κ≥0.74)显示,仅语义特征就恢复了94%的峰值编码性能(r=0.285),显著超过了方差匹配的基线(p<0.001,d=1.31)。除了这种总体主导性之外,我们还测试了一个新颖的皮层拓扑预测:从三个独立神经科学项目先验导出的五个语义子类别应映射到不同的大脑区域。一个正式的收敛测试证实了这种对齐(Spearman ρ=0.72,p<0.001;超几何p=0.007),表明SAE发现的特征以先前方法无法达到的粒度重现了已知的皮层语义组织。SAE特征进一步预测了超出词汇控制的人类阅读时间(ΔlogLik=38.4,p<0.001),并且一项探索性的预测误差分析提供了初步证据,表明大脑还编码了意外的语义内容。结果在英语、中文和法语中具有普适性。

英文摘要

Intermediate layers of large language models (LLMs) best predict human brain responses to language, one of the most robust findings in computational neurolinguistics, yet why remains mechanistically unexplained. We address this gap by bridging sparse autoencoders (SAEs) from mechanistic interpretability with neural encoding models, decomposing GPT-2 XL and Llama-3.1-8B into 16K-32K interpretable features per layer. A human-validated taxonomy ($κ\geq 0.74$) reveals that semantic features alone recover 94% of peak encoding performance ($r=0.285$), substantially exceeding variance-matched baselines ($p<0.001$, $d=1.31$). Beyond this aggregate dominance, we test a novel cortical topography prediction: five semantic subcategories derived a priori from three independent neuroscience programs should map onto distinct brain regions. A formal convergence test confirms this alignment (Spearman $ρ=0.72$, $p<0.001$; hypergeometric $p=0.007$), demonstrating that SAE-discovered features recapitulate known cortical semantic organization at a granularity inaccessible to prior methods. SAE features further predict human reading times beyond lexical controls ($Δ\mathrm{logLik}=38.4$, $p<0.001$), and an exploratory prediction-error analysis provides preliminary evidence that the brain additionally encodes unexpected semantic content. Results generalize across English, Chinese, and French.

2605.23032 2026-05-25 cs.CL cs.AI q-bio.NC 版本更新

Brain-LLM Alignment Tracks Training Data, Not Typology

大脑-大语言模型对齐追踪训练数据,而非语言类型学

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 该研究探讨了大脑与大语言模型(LLM)之间的对齐模式是否具有跨语言泛化能力,发现对齐模式主要由模型训练语言的主导性决定,而非英语本身的特性。通过对比多种语言的fMRI数据和不同语言主导的LLM,研究发现以中文为主导训练的模型在与中文大脑对齐时表现最佳,而与英语大脑对齐最差。此外,语言类型学距离、句法相关脑区的梯度差异以及分词粒度等因素也对对齐效果产生显著影响,揭示了此前观察到的“英语优势”主要源于训练数据的组成,而非语言结构本身的特性。

Comments Accepted to CoNLL 2026. 9 pages main content + 4 pages references + 6 pages appendix; 4 figures, 13 tables

详情
AI中文摘要

大脑-大语言模型对齐在英语中已得到充分证实,然而大脑的语言网络在神经解剖学上跨语言具有普遍性。这种对齐是否也能跨语言泛化,以及什么因素决定了其变化?我们使用来自英语、中文和法语(《小王子》语料库)112名参与者的fMRI数据,以及涵盖英语主导、中文主导和多语言架构的七种大语言模型进行了测试。我们的核心发现是,训练语言主导性(而非英语的固有属性)驱动了对齐模式:一个中文主导模型(Baichuan2-7B),其架构与LLaMA-2-7B匹配,完全逆转了梯度,与中文大脑对齐最佳,与英语对齐最差。除训练主导性外,形式类型学距离独立地与对齐退化共变,与句法相关的大脑区域(IFG)显示出比词汇语义区域(PTL)陡峭2.3倍的类型学梯度,而分词丰度解释了跨语言最优编码层转移的约60%。这些结果表明,大脑-大语言模型对齐中明显的“英语优势”是训练数据组成的假象,而剩余的变化反映了集中在句法处理中的真实类型学结构。

英文摘要

Brain-LLM alignment is well established in English, yet the brain's language network is neuroanatomically universal across languages. Does alignment also generalize cross-linguistically, and what governs the variation? We test this using fMRI data from 112 participants across English, Chinese, and French (the Le Petit Prince corpus) and seven LLMs spanning English-dominant, Chinese-dominant, and multilingual architectures. Our central finding is that training-language dominance, not an inherent property of English, drives the alignment pattern: a Chinese-dominant model (Baichuan2-7B), architecture-matched to LLaMA-2-7B, reverses the gradient entirely, aligning best with Chinese brains and worst with English. Beyond training dominance, formal typological distance independently covaries with alignment degradation, syntax-associated brain regions (IFG) show $2.3\times$ steeper typological gradients than lexico-semantic regions (PTL), and tokenization fertility accounts for $\sim$60% of a cross-linguistic shift in optimal encoding layer. These results reveal that the apparent "English advantage" in brain-LLM alignment is an artifact of training data composition, while the remaining variation reflects genuine typological structure concentrated in syntactic processing.

2605.23028 2026-05-25 cs.LG cs.CL cs.CV 版本更新

RADAR: Relative Angular Divergence Across Representations

RADAR: 表示间的相对角度散度

Xavier Cadet, Mateusz Nowak, Peter Chin

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 本文提出了一种名为 RADAR 的度量方法,用于评估基础模型在跨领域任务中的迁移能力。该方法基于几何原理,通过分析模型各层表示的角对齐和层间位移轨迹上的距离变化,比较域内与跨域动态的分布差异,从而估计领域间迁移的可行性。实验表明,RADAR 在多个模态任务中表现出色,尤其在领域过渡平滑或明确的情况下具有更强的预测能力,且其效果依赖于模型内部表示空间的几何结构。

Comments 27 pages; 8 figures; 10 tables

详情
AI中文摘要

机器学习方法依赖于数据。然而,由于可用性限制、成本或需要领域专业知识,收集合适的数据可能具有挑战性。用额外来源扩展数据集是对有限数据的常见回应,但这种做法并不总能提高下游性能,有时甚至会导致性能下降,即负迁移。我们提出RADAR,一种简单、基于几何的度量,用于估计基础模型中的跨域迁移性。RADAR通过测量沿层间位移轨迹的角度对齐和距离的相对变化,并比较域内和跨域动态的经验分布,来分析表示的逐层演化。我们假设域迁移性与这些轨迹分布之间的散度有关。我们在多种模态上评估该度量,包括使用文本嵌入模型的跨语言情感分类和使用基础视觉模型的跨域图像分类。在多种设置下,RADAR在几个视觉和文本基准上相对于现有迁移性度量提供了有竞争力的预测性能,特别是在域过渡平滑或清晰分离时。我们的消融实验进一步表明,迁移性估计的有效性取决于模型内部表示空间的几何结构,不同模态偏好不同的拓扑形式。

英文摘要

Machine learning methods rely on data. However, gathering suitable data can be challenging due to availability constraints, cost, or the need for domain expertise. Expanding datasets with additional sources is a common response to limited data, yet this practice does not always improve downstream performance and can sometimes lead to a loss of performance, known as negative transfer. We propose RADAR, a simple, geometrically grounded metric for estimating cross-domain transferability in foundation models. RADAR analyzes the layer-wise evolution of representations by measuring angular alignments and relative changes in distance along layer-to-layer displacement trajectories, and by comparing empirical distributions of within-domain and cross-domain dynamics. We hypothesize that domain transferability is related to the divergence between these trajectory distributions. We evaluate the metric across multiple modalities, including cross-lingual sentiment classification with text embedding models and cross-domain image classification with foundation vision models. Across several settings, RADAR provides competitive predictive performance relative to existing transferability metrics on several vision and text benchmarks, with particularly strong results when domain transitions are smooth or cleanly separated. Our ablations further suggest that the effectiveness of transferability estimation depends on the geometry of the model's internal representation space, with different modalities favoring different topological formulations.

2605.23024 2026-05-25 cs.AI cs.CC cs.CL cs.LG 版本更新

The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

确定性视界:作为可信AI系统设计规范的不可行性结果

Dongxin Guo

AI总结 本文探讨了可信人工智能系统设计中由计算理论根本限制所带来的边界问题,提出将不可行性定理转化为系统设计规则的新方法。研究核心在于确定性地证明了大型语言模型的推理深度存在一个由架构决定的上限——“确定性地平线”,该上限不受训练数据量、适配器秩或损失函数的影响,并可通过模型层数和嵌入宽度预先计算。研究还展示了这一理论在多个AI子领域中的应用,形成一套包含十六项设计规范的目录,为构建更可靠的人工智能系统提供了理论依据和设计指导。

Comments PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

详情
AI中文摘要

大型语言模型现在编写软件、起草法律文件并生成临床笔记,但从图灵、阿罗到没有免费午餐定理的基本极限,塑造了计算的能力。本文将这些不可行性结果从奇闻转化为设计规则。其旗舰结果证明了仅由架构设定的准确率上限:超过关键推理深度后,无论适配器秩、样本大小或损失函数如何,训练都无法改变它。该确定性视界在部署前可从层数和嵌入宽度计算,在十二种Transformer架构中测量值介于19到31之间,而在最优长度轨迹上微调可恢复不到4个百分点。其机制是残差流的容量不变性,信息论转换得出超过视界后准确率超指数衰减。一个针对模幂的无条件电路复杂度下界(对抗常数深度素数模电路)补充了这一结果。同样的论证重新应用于多个子领域:任何错误指定模型下的偏好学习在样本复杂度上出现不连续跳跃;多阶段检索流水线至少需要与阶段数一样多的独立指标;标准诚实拍卖对于具有提示相关估值的智能体失效;神经推理的零知识验证为每个非线性激活支付110到190倍的测量开销。这些共同构成了一个包含16条规范的目录,每条规范配对一个可计算边界、一个量化违反成本和一个建设性设计规则:两个组合已被证明,一个配对是诚实障碍,四个保持开放。本文为可信AI可能需要的生成式研究计划提供了不可行性规范方法论。AI的每一个基本极限也是一个设计规则。

英文摘要

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

2605.22993 2026-05-25 cs.CL cs.AI 版本更新

A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism

一种主动式多智能体对话框架用于评估自闭症中的社交语言障碍特征

Chuanbo Hu, Minglei Yin, Bin Liu, Wenqi Li, Lynn K. Paul, Shuo Wang, Xin Li

发表机构 * Department of Computer Science(计算机科学系) University at Albany(阿尔巴尼大学) Department of Management Information System(管理信息系统系) West Virginia University(西弗吉尼亚大学) Department of Radiology(放射学系) Washington University in St. Louis(圣路易斯华盛顿大学) Humanities and Social Sciences(人文学与社会科学)

AI总结 该研究提出了一种名为TPA的主动多智能体对话框架,用于评估自闭症谱系障碍中的社会语言障碍(SLD)特征。该框架通过医生智能体主动选择针对性的问题策略,以系统性地揭示患者对话中潜在的语言障碍特征,从而提高诊断效率。实验表明,TPA在多个关键指标上优于现有基线方法,显著提升了SLD特征的覆盖率和诊断效率,为AI辅助临床筛查提供了重要支持。

详情
AI中文摘要

与自闭症谱系障碍中社交语言障碍(SLD)相关的特征性语言行为,包括回声性重复、代词位移和刻板媒体引用,在自发对话中基本不存在,仅在特定对话条件下出现。在结构化临床评估中,这种延迟意味着提问策略选择是决定对话产生多少诊断信息的关键但未被充分重视的因素。大型语言模型(LLMs)能否被引导主动选择系统地揭示这些潜在特征的提问策略,在很大程度上仍未探索。本文提出TPA(思考、计划、询问),一种应用于自闭症诊断观察量表模块4(ADOS-2)语言评估部分的主动式多智能体对话框架,其中医生智能体在选择临床依据策略并生成针对性问题之前,明确推理哪些特征尚未观察到。基于真实ADOS-2临床数据的患者智能体使得无需真实患者参与即可进行可重复评估,并通过三个独立实验验证,确认其对真实患者语言具有足够的保真度。在来自35名患者的484个片段上评估,TPA在所有主要指标上优于六个竞争性对话规划基线,实现了82.1%的SLD特征覆盖率,比训练有素的临床医生进行的真实临床对话自动回放(65.5%)高16.6%,并且每轮诊断效率显著更高(AUCC:0.628 vs. 0.458,绝对增益+0.170)。这些结果表明,主动提问策略选择显著提高了自动化SLD特征评估的效率,对可扩展的AI辅助临床筛查具有直接意义。

英文摘要

Characteristic linguistic behaviors associated with Social Language Disorder (SLD) in autism spectrum disorder, including echoic repetition, pronoun displacement, and stereotyped media quoting, are largely absent from spontaneous conversation and only emerge under specific conversational conditions. In structured clinical assessments, this latency means that questioning strategy selection is a critical yet underappreciated determinant of how much diagnostic information a conversation yields. Whether large language models (LLMs) can be guided to proactively select questioning strategies that systematically surface these latent traits remains largely unexplored. Here we present TPA (Think, Plan, Ask), a proactive multi-agent dialogue framework applied to the language assessment component of the Autism Diagnostic Observation Schedule Module 4 (ADOS-2), in which a doctor agent explicitly reasons about which traits remain unobserved before selecting a clinically grounded strategy and generating a targeted question. A patient agent grounded in real ADOS-2 clinical data enables reproducible evaluation without real patient participation, validated across three independent experiments confirming adequate fidelity to real patient language. Evaluated on 484 episodes from 35 patients, TPA outperforms six competitive dialogue planning baselines across all primary metrics, achieving 82.1% SLD trait coverage, 16.6% higher than automated replay of real clinical dialogues conducted by trained clinicians (65.5%), with substantially greater per-turn diagnostic efficiency (AUCC: 0.628 vs. 0.458, absolute gain +0.170). These results demonstrate that proactive questioning strategy selection substantially improves the efficiency of automated SLD trait assessment, with direct implications for scalable AI-assisted clinical screening.

2605.22981 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Memorization Dynamics of Fill-in-the-Middle Pretraining

Fill-in-the-Middle 预训练的记忆动态

Tobias von Arx, Tanguy Dieudonné

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 本文研究了“填中”(FIM)预训练目标对语言模型逐字记忆能力的影响。通过在包含重复内容的语料库上训练匹配的Llama 3.2模型,发现FIM更倾向于恢复短或部分匹配的文本片段,而传统的从左到右(LTR)方法则更常对长段精确续写赋予高置信度。实验还表明,FIM训练下的逐字记忆能力随重复次数近似线性增长,并且后缀上下文不足以支持准确回忆,前缀上下文在其中起关键作用。研究强调了单一评估方式可能忽略记忆行为的复杂性。

Comments MemFM @ ICML 2026

详情
AI中文摘要

Fill-in-the-Middle (FIM) 是一种广泛用于赋予因果语言模型填充能力的预训练目标,但其对逐字记忆的影响尚未充分探索。我们在受控设置中研究 FIM 的记忆动态,通过在包含重复 Gutenberg 摘录的 FineWeb-Gutenberg 语料库上,使用 FIM 和标准从左到右 (LTR) 目标预训练匹配的 Llama 3.2 模型。基于前缀的探测表明,FIM 更常恢复短片段或部分匹配的跨度,而 LTR 更常对长精确延续赋予高置信度。我们观察到,在测试范围内,FIM 训练下的逐字提取随重复次数近似线性增长。评估原生 FIM 格式的探测显示,后缀上下文并不足够:FIM 训练下的逐字回忆仍然强烈锚定于前缀上下文。我们的结果还表明,仅评估一种跨度长度或探测格式可能会遗漏记忆行为中的重要细微差别。

英文摘要

Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

2605.22971 2026-05-25 cs.CL cs.HC 版本更新

Can AI Guess What You Know? Performance Comparison of Large Language Models for Human Domain Knowledge Estimation From Communication Logs

AI 能猜出你知道什么吗?从通信日志中估计人类领域知识的大型语言模型性能比较

Ko Watanabe, Shoya Ishimaru

发表机构 * Osaka Metropolitan University(大阪 metropolitan 大学)

AI总结 本文研究了大型语言模型(LLMs)是否能通过分析员工的Slack通信日志来估计其领域知识水平。通过对43名用户共27,188条消息的分析,比较了包括Gemini、Claude和GPT系列在内的七种模型在零样本设置下的估计性能,发现Gemini 2.5 Flash表现最佳,误差最低。研究还表明,估计准确性与消息数量的相关性较弱,强调了自动化专家映射的可行性及当前限制,并指出隐私保护和更丰富的知识表示形式的重要性。

详情
AI中文摘要

员工常常难以识别“谁知道什么”,导致组织生产力损失。我们研究大型语言模型(LLMs)是否能够直接从长期 Slack 日志中推断个人领域知识。通过分析来自 43 名用户的 27,188 条消息,我们评估了七个模型(包括 Gemini、Claude 和 GPT 系列),将其零样本估计与 27 名参与者的自我报告技能评分进行比较。Gemini 2.5 Flash 实现了最低误差(MAE 21.13%),而 GPT 模型显示出显著更大的差异。值得注意的是,估计精度仅弱依赖于消息量,表明更多的文本本身并不能保证更好的推断。这些发现证明了自动专业知识映射的可行性和当前局限性,强调了需要保护隐私的部署以及更丰富、结构感知的人类知识表示。

英文摘要

Employees often struggle to identify ``who knows what,'' leading to organizational productivity losses. We investigate whether Large Language Models (LLMs) can infer individual domain knowledge directly from long-term Slack logs. Analyzing 27,188 messages from 43 users, we evaluated seven models (including Gemini, Claude, and GPT families) by comparing their zero-shot estimates against self-reported skill ratings from 27 participants. Gemini 2.5 Flash achieved the lowest error (MAE 21.13%), while GPT models showed significantly larger discrepancies. Notably, estimation accuracy depended only weakly on message volume, indicating that more text alone does not guarantee better inference. These findings demonstrate the feasibility and current limits of automated expertise mapping, highlighting the need for privacy-preserving deployments and richer, structure-aware representations of human knowledge.

2605.22963 2026-05-25 cs.CL cs.AI 版本更新

Graph Alignment Topology as an Inductive Bias for Grounding Detection

图对齐拓扑作为接地检测的归纳偏置

Paul Landes, Pranav Herur, Adam Cross, Jimeng Sun

发表机构 * Department of Pediatrics, University of Illinois College of Medicine Peoria(伊利诺伊大学皮奥里亚医学院儿科部) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Siebel School of Computing and Data Science, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校计算机与数据科学学院) Carle Illinois College of Medicine, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校卡莱医学院)

AI总结 本文研究了如何利用图对齐拓扑作为归纳偏置,以提升大语言模型(LLM)生成内容的事实准确性。作者构建了参考信息与模型输出之间的二分图,并通过图神经网络建模对齐结构,从而直接学习对齐拓扑特征。该方法在多个幻觉检测和问答数据集上取得了优于现有方法及基础LLM(如GPT-4o)的最先进结果,为提升模型输出的可解释性和事实可靠性提供了新思路。

详情
AI中文摘要

大型语言模型(LLM)被优化以产生分布上合理的延续,而不是明确验证生成的命题是否源自源文档。这种归纳偏置使得泛化成为可能,但它不编码响应是否相对于参考是接地的。这些问题限制了LLM在严格事实正确性至关重要的领域(如临床决策支持)中的使用。现有的幻觉检测方法通过检索增强、自一致性或声明验证来提高事实性,但通常不直接学习对齐拓扑。为了利用对齐拓扑作为归纳偏置,我们在参考信息和LLM输出之间构建对齐二分图,并训练图神经网络(GNN)通过消息传递来建模对齐结构。该方法在四个不同的幻觉和问答数据集上取得了最先进的结果,优于所有比较的方法,包括基础LLM如GPT-4o。

英文摘要

Large Language Models (LLMs) are optimized to produce distributionally plausible continuations rather than to explicitly verify whether generated propositions are entailed by source documents. This inductive bias enables generalization, but it does not encode whether responses are grounded with respect to a reference. These issues limit the use of LLMs in domains where strict factual correctness is crucial, such as clinical decision support. Existing hallucination detection approaches improve factuality through retrieval augmentation, self-consistency, or claim verification, but generally do not learn directly over alignment topology. To leverage alignment topology as an inductive bias, we construct aligned bipartite graphs between reference information and LLM outputs and train a graph neural network (GNN) to model alignment structure using message passing. The method achieves state-of-the-art results on four diverse hallucination and question-answering datasets, outperforming all compared methods, including foundational LLMs such as GPT-4o.

2605.22635 2026-05-25 cs.LG cs.CL cs.CV 版本更新

The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

多任务放射学报告生成中的双重困境:梯度动力学分析与解决方案

Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Xinjiang University(新疆大学) Information Security Engineering Technology Research Center(信息安全工程技术研究中心)

AI总结 在多任务医学影像报告生成中,现有的线性标量化策略难以有效平衡临床监督的严格约束与报告生成的平滑性需求。本文从梯度动力学角度分析了这一问题,揭示其本质是漂移项偏差与扩散项衰减的“双重困境”,并提出了一种与模型无关的优化器CAME-Grad,通过冲突规避方向校正和幅度增强能量注入,实现了几何有效性与局部最优解的规避,实验表明该方法在多个任务中均能显著提升临床效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于多任务学习的自动放射学报告生成(RRG)被广泛采用以确保临床一致性,但大多数研究集中在架构设计上,仍局限于粗糙的线性标量化策略。这些策略无法有效平衡判别性临床监督的硬约束与报告生成的平滑性要求。为了解决这些问题,我们从梯度动力学的角度分析了线性标量化的失败机制,利用随机微分方程(SDE)框架将其表征为漂移项偏差和扩散项衰减的“双重困境”。基于此,我们提出了一种与骨干网络无关的优化器,名为冲突规避幅度增强梯度下降(CAME-Grad)。通过冲突规避的方向修正和幅度增强的能量注入,该算法不仅保证了几何有效性,还避免了局部最优解。然后,自适应梯度融合机制用于建立理论最优方向与任务特定归纳偏差之间的动态平衡。实验表明,作为一种通用的即插即用优化器,CAME-Grad在八种不同的RRG方法上带来了显著且一致的改进,在MIMIC-CXR上平均提升整体临床效能2.3%,在IU X-Ray上提升1.9%。我们的代码可在https://github.com/vpsg-research/CAME-Grad获取。

英文摘要

While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at https://github.com/vpsg-research/CAME-Grad.

2605.22373 2026-05-25 cs.LG cs.CL 版本更新

Boundary-targeted Membership Inference Attacks on Safety Classifiers

针对安全分类器的边界目标成员推断攻击

Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah

发表机构 * University of Sheffield(谢菲尔德大学) Carnegie Mellon University(卡内基梅隆大学) MBZUAI

AI总结 该研究探讨了针对安全分类器的边界定向成员推理攻击问题,这类分类器常用于生成式AI系统中以过滤有害内容或识别高风险用户。研究提出了一种新的攻击方法,通过识别分类器最不自信的样本,揭示模型在训练数据上的记忆性特征,从而推断出样本是否属于训练集。实验表明,该方法在检测用户情绪支持需求的分类器上,能以较低的误报率恢复更多被标记为高风险的对话,效果显著优于现有成员推理攻击方法,并进一步分析了边界样本的特性,指出基于内容的过滤策略难以有效防御此类攻击。

详情
AI中文摘要

安全分类器是生成式AI系统中的重要保障,用于过滤有害内容或识别与大语言模型交互时处于风险中的用户。尽管这些模型是必要的,但它们是在包含自残和心理健康讨论等敏感数据集上训练的,这引发了重要但尚未充分理解的隐私问题。成员推断攻击(MIA)允许对手推断用于训练模型的示例的成员身份。在这项工作中,我们假设识别分类器最不自信的示例对于对手推断成员身份是有信息的。这反映了局部泛化失败,其中模型依赖记忆来解决训练集中的歧义。为了研究这一点,我们引入了一种新的边界目标选择策略,该策略识别低置信度示例,从而放大训练集中示例成员身份的信号。我们的实验结果表明,在针对检测可能需要情感支持的用户的微调分类器上,对手可以以5%的假阳性率恢复安全分类器标记为指示用户困扰的对话中的19%。这比单独使用最先进的MIA方法攻击高出3.5倍。最后,我们描述了边界示例的特征,并表明基于内容的过滤对于保护无效,而现有的噪声策略可以有效减轻这些示例的敏感性。

英文摘要

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19% of the conversations a safety classifier flagged as indicating user distress, at a 5% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

2605.21071 2026-05-25 cs.CL cs.AI 版本更新

Fine-grained Claim-level RAG Benchmark for Law

细粒度声明级法律RAG基准

Souvick Das, Sallam Abualhaija, Domenico Bianculli

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本文提出ClaimRAG-LAW,一个支持英法双语、面向法律专家与非专家用户的细粒度法律检索增强生成(RAG)基准数据集,涵盖多种真实场景的问答类型。研究通过细粒度评估框架分析当前先进法律RAG系统的检索、生成及主张级表现,揭示了其在法律领域中存在的局限性,为提升法律AI系统的可靠性提供了重要参考。

详情
AI中文摘要

大型语言模型(LLM)的快速进展正在将语义搜索转向问答范式,用户提出问题,LLM生成回答。在法律等高风险领域,检索增强生成(RAG)通常用于减轻生成回答中的幻觉。然而,先前的研究表明,无论是通用还是法律专用的RAG系统,仍然以不同速率产生幻觉,这使得细粒度评估变得至关重要。尽管有需求,现有的法律RAG系统评估框架缺乏分别对检索和生成性能进行详细分析所需的粒度。此外,当前的基准主要是英文且集中于法律专家查询,忽视了非专家需求。我们引入了ClaimRAG-LAW,一个全面的法律RAG数据集,支持法语和英语,面向专家和非专家,并包含反映现实场景的多样化问题类型。我们进一步应用细粒度评估框架对最先进的法律RAG系统进行评估,揭示了法律领域在检索、生成和声明级分析方面的局限性。

英文摘要

The rapid progress of large language models (LLMs) is shifting semantic search toward a question-answering paradigm, where users ask questions and LLMs generate responses. In high-stake domains such as law, retrieval-augmented generation (RAG) is commonly used to mitigate hallucinations in generated responses. Nonetheless, prior work shows that RAG systems, whether general-purpose or legal-specific, still hallucinate at varying rates, making fine-grained evaluation essential. Despite the need, existing evaluation frameworks for legal RAG systems lack the granularity required to provide detailed analysis of retrieval and generation performance separately. Moreover, current benchmarks are largely English-only and centered on legal expert queries, overlooking non-expert needs. We introduce ClaimRAG-LAW, a comprehensive dataset for legal RAG that supports French and English, targets both experts and non-experts, and includes diverse question types reflecting realistic scenarios. We further apply a fine-grained evaluation framework of state-of-the-art legal RAG systems, revealing limitations in retrieval, generation, and claim-level analysis in the legal domain.

2605.20558 2026-05-25 cs.CL 版本更新

When Irregularity Helps: A Subclass Analysis of Inductive Bias in Neural Morphology

当不规则性有帮助:神经形态学中归纳偏置的子类分析

Wen Zhang

AI总结 该研究分析了神经形态生成系统在处理日语过去时动词变位时的表现,发现模型错误主要集中在一个结构特殊且数据极少的不规则子类上。通过对照实验表明,移除这一小类比移除所有不规则动词更能提升模型泛化能力,说明不同不规则模式对模型稳定性的影响不同。研究指出,错误集中源于极低频形态模式与特定音系过程(如重音)的相互作用,强调形态评估应引入更细致的子类分析以揭示模型缺陷。

详情
AI中文摘要

神经形态生成系统通常在基准数据集上达到高总体准确率,但这种性能可能掩盖集中在罕见形态子类中的系统性错误。我们考察日语过去式动词屈折,发现一个非常小、结构特定的不规则子类(<1%的数据)占据了模型错误的不成比例份额。受控消融实验表明,移除该子类比移除所有不规则动词带来更大的泛化提升,表明并非所有不规则性对模型不稳定性贡献相同。这些发现表明,错误集中是由极端低频形态模式与特定形态音韵过程(特别是促音化)之间的交互驱动的。我们认为形态评估应纳入比标准变位类别更细粒度的子类分析。

英文摘要

Neural morphological generation systems often achieve high aggregate accuracy on benchmark datasets, yet such performance can conceal systematic errors concentrated in rare morphological subclasses. We examine Japanese past-tense verb inflection and show that a very small, structurally specific irregular subtype (<1% of data) accounts for a disproportionate share of model errors. Controlled ablation experiments demonstrate that removing this subtype yields larger improvements in generalization than removing all irregular verbs, indicating that not all irregularity contributes equally to model instability. These findings suggest that error concentration is driven by the interaction between extreme low-frequency morphological patterns and specific morphophonological processes, particularly gemination. We argue that morphological evaluation should incorporate finer-grained subclass analysis beyond standard conjugation categories.

2605.20201 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

基于代理思维链调优的长上下文推理

Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 该研究针对大语言模型在长上下文复杂推理任务中表现不佳的问题,提出了一种名为ProxyCoT的新训练框架。该方法通过在短代理上下文中获取高质量的推理轨迹,并将其迁移到完整的长上下文中,从而提升模型的长上下文推理能力。实验表明,ProxyCoT在多个数据集上均优于现有方法,且计算开销更低,同时具备良好的跨领域泛化能力。

Comments Long paper, ACL 2026 (Main conference)

详情
AI中文摘要

近期的大语言模型支持高达1000万token的输入,但在需要复杂推理的长上下文任务上表现不佳。此类任务可以通过仅使用输入的一个子集(即代理上下文)而非完整序列来解决。尽管共享相同的底层推理过程,模型在代理上下文和完整上下文之间表现出显著的性能差异。为了改进长上下文推理,我们提出了ProxyCoT,一种新颖的训练框架,将推理能力从短代理上下文迁移到完整长上下文。具体来说,我们首先通过强化学习或从更大的教师模型蒸馏,在代理上下文中获得高质量的思维链推理轨迹,然后通过监督微调将这些生成的轨迹锚定到完整长上下文中。跨不同数据集的实验表明,ProxyCoT在减少计算开销的同时,始终优于强基线。此外,使用ProxyCoT训练的模型能够将其长上下文推理能力泛化到域外任务。

英文摘要

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

2605.20192 2026-05-25 cs.CL cs.CE cs.CR cs.CY q-fin.CP 版本更新

Leveraging Large Language Models for Sentiment Analysis: Multi-Modal Analysis of Decentraland's MANA Token

利用大语言模型进行情感分析:Decentraland的MANA代币多模态分析

Xintong Wu, Peiting Tsai, Jing Yuan, Michael Yu, Greg Sun, Luyao Zhang

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Microsoft(微软) Duke Kunshan University(杜克昆山大学)

AI总结 本文研究了如何利用大型语言模型分析Decentraland虚拟平台中Discord社区的情感,结合多模态金融数据提升对MANA代币价格的预测能力。研究采用基于BERT的模型进行情感分析,并构建了两种LSTM架构,分别基于历史价格和融合情感评分、交易量及市值的多模态特征。实验表明,多模态模型在预测准确性上显著优于仅使用价格数据的基线模型,揭示了社区情感信号在虚拟经济预测中的重要价值。

详情
AI中文摘要

Decentraland是一个在扩展的元宇宙生态系统中运行的去中心化虚拟现实平台,利用其原生MANA代币促进虚拟资产交易和治理。本研究探讨将Discord社区情感与多模态金融数据相结合,以增强虚拟世界经济中的加密货币价格预测。我们解决以下问题:(1) 识别Decentraland的Discord社区内的情感模式,以及(2) 评估多模态特征对代币回报预测的影响。使用基于BERT的大语言模型进行情感分析,我们开发了两种LSTM架构:一种包含历史价格的基线模型,另一种集成情感分数、交易量和市值的多模态变体。结果显示社区情感以中性为主,但存在正向偏斜。多模态模型在预测准确性上显著优于仅基于价格的基线模型。这些发现证明了社区衍生信号对虚拟经济预测的预测价值,并为未来在沉浸式虚拟环境、自然语言处理和加密货币市场分析交叉领域的研究奠定了基础。

英文摘要

Decentraland, a decentralized virtual reality platform operating within the expanding Metaverse ecosystem, utilizes its native MANA token to facilitate virtual asset transactions and governance. This study investigates the integration of Discord community sentiment with multi-modal financial data to enhance cryptocurrency price prediction within virtual world economies. We address: (1) identifying sentiment patterns within Decentraland's Discord community, and (2) evaluating the impact of multi-modal features on token return forecasting. Using a BERT-based large language model for sentiment analysis, we develop two LSTM architectures: a baseline incorporating historical prices and a multi-modal variant integrating sentiment scores, trading volume, and market capitalization. Results indicate predominantly neutral community sentiment with a positive skew. The multi-modal model significantly outperforms the price-only baseline in prediction accuracy. These findings demonstrate the predictive value of community-derived signals for virtual economy forecasting and establish a foundation for future research at the intersection of immersive virtual environments, natural language processing, and cryptocurrency market analysis.

2605.20087 2026-05-25 cs.CL cs.AI 版本更新

ThoughtTrace: Understanding User Thoughts in Real-World LLM Interactions

ThoughtTrace: 理解真实世界LLM交互中的用户想法

Chuanyang Jin, Binze Li, Haopeng Xie, Cathy Mengying Fang, Tianjian Li, Shayne Longpre, Hongxiang Gu, Maximillian Chen, Tianmin Shu

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Massachusetts Institute of Technology(麻省理工学院) Google Research(谷歌研究)

AI总结 ThoughtTrace 是首个大规模数据集,记录了真实场景中用户与AI的多轮对话及其用户自我报告的思考内容,揭示了用户发送提示的原因和对AI回复的反应。该数据集包含1,058名用户、2,155次对话及10,174条思考注释,分析表明用户思考内容在语义上与对话消息不同,且难以被当前先进大模型准确推断。研究进一步展示了思考内容在行为预测和个性化助手训练中的应用价值,为理解用户潜在目标和需求提供了新的数据模态。

Comments 53 pages, 23 figures, 4 tables. Project website: https://thoughttrace-project.github.io/

详情
AI中文摘要

对话式AI现已服务数十亿用户,但现有数据集仅捕捉用户所说,而非所想。我们引入ThoughtTrace,首个大规模数据集,将真实世界多轮人机对话与用户自述想法配对:用户发送提示的原因以及对助手回复的反应。ThoughtTrace包含来自20个语言模型的1,058名用户、2,155次对话、17,058轮次和10,174条想法标注。我们的分析表明,ThoughtTrace捕捉了长期、主题多样的交互,且想法在语义上不同于消息,前沿LLM难以从上下文中推断,内容多样,并与对话阶段相关。我们进一步展示了想法在下游建模中的实用性。首先,想法作为推理时上下文改善了用户行为预测。其次,想法引导的重写为训练个性化助手提供了细粒度对齐信号。总之,ThoughtTrace将用户想法确立为研究人机交互背后认知动态的新数据模态,并为构建更好理解和适应用户潜在目标、偏好与需求的助手奠定了基础。

英文摘要

Conversational AI has now reached billions of users, yet existing datasets capture only what people say, not what they think. We introduce ThoughtTrace, the first large-scale dataset that pairs real-world multi-turn human--AI conversations with users' self-reported thoughts: their reasons for sending prompts and reactions to assistant responses. ThoughtTrace comprises 1,058 users, 2,155 conversations, 17,058 turns, and 10,174 thought annotations collected across 20 language models. Our analysis shows that ThoughtTrace captures long-horizon, topically diverse interactions, and that thoughts are semantically distinct from messages, difficult for frontier LLMs to infer from context, diverse in content, and tied to conversation stages. We further demonstrate the utility of thoughts for downstream modeling. First, thoughts improve user-behavior prediction as inference-time context. Second, thought-guided rewrites provide fine-grained alignment signals for training personalized assistants. Together, ThoughtTrace establishes user thoughts as a new data modality for studying the cognitive dynamics behind human--AI interaction and provides a foundation for building assistants that better understand and adapt to users' latent goals, preferences, and needs.

2605.20043 2026-05-25 cs.CL 版本更新

Mind Your Moras: Orthography-Aware Error Analysis of Neural Japanese Morphological Generation

注意你的莫拉:神经日语形态生成的拼写感知错误分析

Wen Zhang

AI总结 本文针对日语过去时态形态生成任务,提出了一种关注正字法的错误分析方法,将平假名视为编码形态音系差异的表征系统,而不仅仅是转录媒介。研究评估了两种字符级序列到序列模型,在高整体准确率下仍存在与平假名正字法特性相关的系统性错误,尤其在涉及辅音连写(Gemination)的动词中表现明显。研究提出了包含七种主要错误模式的分类体系,并揭示了正字法表征、形态结构和数据频率在模型泛化中的紧密关联,强调了在形态复杂的语言中进行正字法感知评估的重要性。

详情
AI中文摘要

我们提出了一种拼写感知的日语过去时形态屈折错误分析,将平假名不仅视为转录媒介,而且视为编码形态音位区别的表征系统,这些区别可能影响模型泛化。我们使用根据SIGMORPHON 2020和2023共享任务约定格式化的数据集,评估了两种字符级序列到序列架构在过去时形成上的表现。尽管总体准确率较高,但模型表现出系统的、语言上可解释的错误,这些错误集中在平假名的特定拼写属性上。我们引入了一个简洁的错误分类法,捕获了七种主要失败模式,并提供了定量和定性分析。促音化相关错误主导了剩余失败,占错误的75-80%,特别是在词干以元音e结尾且需要在过去时后缀前促音化的动词中。错误模式在架构和随机种子之间高度一致,表明拼写表示、形态结构和数据频率效应在塑造模型泛化中存在稳健的交互。这些结果强调了在理解形态复杂语言的神经泛化时,拼写感知评估的必要性。

英文摘要

We present an orthography-aware error analysis of Japanese past-tense morphological inflection, treating hiragana not merely as a transcriptional medium, but as a representational system encoding morphophonological distinctions that may influence model generalization. We evaluate two character-level sequence-to-sequence architectures on past-tense formation using datasets formatted according to the SIGMORPHON 2020 and 2023 shared task conventions. Despite high aggregate accuracy, models exhibit systematic, linguistically interpretable errors that cluster around specific orthographic properties of hiragana. We introduce a concise error taxonomy capturing seven primary failure modes and provide both quantitative and qualitative analyses. Gemination-related errors dominate residual failures, accounting for 75-80% of errors, particularly in verbs whose stems end in the vowel e and require gemination before the past-tense suffix. Error patterns remain highly consistent across architectures and random seeds, suggesting a robust interaction between orthographic representation, morphological structure, and data frequency effects in shaping model generalization. These results underscore the necessity of orthography-aware evaluation for understanding neural generalization in morphologically complex languages.

2605.15482 2026-05-25 cs.CL 版本更新

FINESSE-Bench: A Hierarchical Benchmark Suite for Financial Domain Knowledge and Technical Analysis in Large Language Models

FINESSE-Bench:面向大语言模型金融领域知识与技术分析的分层基准套件

Dmitry Stanishevskii, Nini Kamkia, Alexey Khoroshilov, Dmitry Zmitrovich, Denis Kokosinskii, Zhirayr Hayrapetyan, Andrei Kalmykov

发表机构 * Lime FinTech(Lime金融科技)

AI总结 FINESSE-Bench 是一个用于评估大型语言模型在金融领域知识和技术分析能力的分层基准测试套件,包含 3,993 道题目,涵盖从基础到专家级的多个难度层级。该基准结合了专业认证考试、实际交易任务和金融竞赛题目,旨在全面评估模型在金融领域的知识广度、计算能力及应对复杂问题的表现。此外,FINESSE-Bench 提供统一的评估协议和自动评分机制,为更深入、专业化的模型能力测评提供了有力工具。

Comments 21 pages, 10 tables, 2 figures

详情
AI中文摘要

大语言模型(LLMs)正越来越多地应用于金融分析、报告、投资决策支持、风险管理、合规和专业培训。然而,对其在金融领域专业能力的稳健评估仍不完整。广泛使用的开放基准如FinQA、ConvFinQA和TAT-QA在推动金融问答和数值推理方面发挥了重要作用,但它们主要关注财务报告上的问答,并未提供明确的专业难度层级。更广泛的资源,包括FinanceBench、PIXIU、FinBen和FLaME,扩展了金融任务的覆盖范围,但评估从基础知识到专家级金融推理的过渡问题仍然存在。在这项工作中,我们提出了FINESSE-Bench,这是一个由八个专业基准组成的套件,包含3993个问题,用于对LLMs的金融能力进行分层评估。FINESSE-Bench结合了受专业认证(类似CFA 1-3级、类似CMT 2级和类似CFTe 1级)启发的考试导向数据集、应用交易任务集以及一个俄语奥林匹克基准。这种设计能够评估领域广度、难度增加时的性能下降、解决计算任务的能力以及模型在专业金融领域中的行为。我们还描述了一个统一的评估协议,涵盖多项选择题、数值答案和简短开放式回答,以及基于LLM作为评判范式的自由形式答案自动评分方案。FINESSE-Bench旨在作为现有开放金融基准的补充,并作为对大语言模型中专业相关金融能力进行更实质性评估的工具。

英文摘要

Large language models (LLMs) are increasingly being applied to financial analysis, reporting, investment decision support, risk management, compliance, and professional training. However, robust evaluation of their domain competence in finance remains incomplete. Widely used open benchmarks such as FinQA, ConvFinQA, and TAT-QA have played an important role in advancing financial question answering and numerical reasoning, but they focus primarily on question answering over financial reports and do not provide an explicit hierarchy of professional difficulty. Broader resources, including FinanceBench, PIXIU, FinBen, and FLaME, expand the coverage of financial tasks, yet the problem of evaluating the transition from foundational knowledge to expert-level financial reasoning remains open. In this work, we present FINESSE-Bench, a suite of eight specialized benchmarks comprising 3,993 questions for hierarchical evaluation of financial competencies in LLMs. FINESSE-Bench combines exam-oriented datasets inspired by professional certifications (CFA-like Levels 1-3, CMT-like Level 2, and CFTe-like Level 1), applied trading task collections, and a Russian-language olympiad benchmark. This design enables evaluation of domain breadth, performance degradation as difficulty increases, the ability to solve computational tasks, and model behavior in specialized financial domains. We also describe a unified evaluation protocol covering multiple-choice questions, numerical answers, and short open-ended responses, together with an automated scoring scheme for freeform answers based on the LLM-as-judge paradigm. FINESSE-Bench is intended both as a complement to existing open financial benchmarks and as a tool for more substantive evaluation of professionally relevant financial competencies in large language models.

2605.10347 2026-05-25 cs.AI cs.CL 版本更新

How Mobile World Model Guides GUI Agents?

移动世界模型如何指导GUI代理?

Weikai Xu, Kun Huang, Yunren Feng, Jiaxing Li, Yuhan Chen, Yuxuan Liu, Zhizheng Jiang, Heng Qu, Pengzhi Gao, Wei Liu, Jian Luan, Xiaolin Hu, Bo An

发表机构 * Nanyang Technological University(南洋理工大学) MiLM Plus, Xiaomi Inc.(小米公司) Independent Researchers(独立研究人员) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院) Wuhan University(武汉大学) Xiamen University(厦门大学)

AI总结 本文研究了移动世界模型如何指导GUI代理进行有效交互,针对现有模型在预测动作后果方面的不足,提出了一种多模态世界模型,涵盖增量文本、完整文本、扩散图像和可渲染代码四种表示方式。实验表明,该模型在多个基准测试中达到最优性能,并揭示了代码重建在分布内精度和多模态监督上的优势,文本反馈在分布外执行中的鲁棒性,以及世界模型在训练过程中的辅助作用,而非作为通用的后验验证工具。

详情
AI中文摘要

视觉语言模型的最新进展使移动GUI代理能够感知视觉界面并执行用户指令,但对于长期和高风险交互,动作后果的可靠预测仍然至关重要。现有的移动世界模型提供基于文本或基于图像的未来状态,但尚不清楚哪种表示有用,生成的rollout是否可以替代真实环境,以及测试时指导如何帮助不同强度的代理。为了回答上述问题,我们筛选并标注了移动世界模型数据,然后训练了四种模态的世界模型:增量文本、完整文本、基于扩散的图像和可渲染代码。这些模型在MobileWorldBench和Code2WorldBench上均达到了最先进性能。此外,通过在AITZ、AndroidControl和AndroidWorld上评估其下游效用,我们得到三个发现。首先,可渲染代码重建实现了高分布内保真度,并为数据构建提供了有效的多模态监督,而基于文本的反馈对于在线分布外执行更鲁棒。其次,世界模型生成的轨迹可以在训练过程中提供可迁移的交互经验,并提高代理的端到端任务性能,尽管这些数据不保留原始分布。最后,对于动作熵低的过度自信移动代理,后验自省提供的收益有限,这表明世界模型作为先验感知或训练监督比作为通用事后验证器更有效。

英文摘要

Recent advances in vision-language models have enabled mobile GUI agents to perceive visual interfaces and execute user instructions, but reliable prediction of action consequences remains critical for long-horizon and high-risk interactions. Existing mobile world models provide either text-based or image-based future states, yet it remains unclear which representation is useful, whether generated rollouts can replace real environments, and how test-time guidance helps agents of different strengths. To answer the above questions, we filter and annotate mobile world-model data, then train world models across four modalities: delta text, full text, diffusion-based images, and renderable code. These models achieve SoTA performance on both MobileWorldBench and Code2WorldBench. Furthermore, by evaluating their downstream utility on AITZ, AndroidControl, and AndroidWorld, we obtain three findings. First, renderable code reconstruction achieves high in-distribution fidelity and provides effective multimodal supervision for data construction, while text-based feedback is more robust for online out-of-distribution (OOD) execution. Second, world-model-generated trajectories can provide transferable interaction experience in the training process and improve agents' end-to-end task performance, although these data do not preserve the original distribution. Last, for overconfident mobile agents with low action entropy, posterior self-reflection provides limited gains, suggesting that world models are more effective as prior perception or training supervision than as universal post-hoc verifiers.

2604.28048 2026-05-25 cs.CL cs.SI 版本更新

Stable Behavior, Limited Variation: Persona Validity in LLM Agents for Urban Sentiment Perception

稳定行为,有限变化:城市情感感知中LLM智能体的角色有效性

Neemias B da Silva, Rodrigo Minetto, Daniel Silver, Thiago H Silva

发表机构 * University of Toronto(多伦多大学)

AI总结 该研究探讨了在城市情感感知任务中,使用不同人格设定对多模态大语言模型(LLM)行为一致性与差异性的影响。通过设置包括性别、经济状况、政治立场和性格等维度的人格变量,研究发现同一人格设定下的模型表现出高度一致的行为,但不同人格之间的差异有限,仅经济状况和性格带来可检测但实际影响较小的变化。研究还指出,模型在细粒度情感判断上表现较差,且去除了人格设定后模型性能有时甚至更优,表明简单的人格标签提示可能对感知判断的注释价值有限。

Comments 8 pages, 8 figures. IEEE DCOSS - UrbCom

详情
Journal ref
IEEE DCOSS 2026
AI中文摘要

大型语言模型(LLM)越来越多地被用作城市分析中人类感知的代理,但尚不清楚角色提示是否会产生有意义且可重复的行为多样性。我们研究了不同角色是否影响多模态LLM生成的城市情感判断。使用涵盖性别、经济状况、政治取向和人格的角色因子集,我们为每个角色实例化多个智能体,以评估来自PerceptSent数据集的城市场景图像,并评估角色内一致性和角色间变化。结果显示,共享角色的智能体之间存在强收敛性,表明行为稳定且可重复。然而,角色间分化有限:经济状况和人格引起统计上可检测但实际变化不大的影响,而性别没有可测量的效果,政治取向的影响可忽略不计。智能体还表现出极端偏差,压缩了人类注释中常见的中间情感类别。因此,在粗粒度极性任务上表现强劲,但随着情感分辨率的提高而下降,表明简单的基于标签的角色提示无法捕捉细粒度的感知判断。为了隔离角色条件的作用,我们还评估了没有角色的相同模型。令人惊讶的是,无角色模型在所有任务变体上与人类标签的一致性有时达到或超过有角色条件,表明在这种设置下,简单的基于标签的角色提示可能增加有限的注释价值。

英文摘要

Large Language Models (LLMs) are increasingly used as proxies for human perception in urban analysis, yet it remains unclear whether persona prompting produces meaningful and reproducible behavioral diversity. We investigate whether distinct personas influence urban sentiment judgments generated by multimodal LLMs. Using a factorial set of personas spanning gender, economic status, political orientation, and personality, we instantiate multiple agents per persona to evaluate urban scene images from the PerceptSent dataset and assess both within-persona consistency and cross-persona variation. Results show strong convergence among agents sharing a persona, indicating stable and reproducible behavior. However, cross-persona differentiation is limited: economic status and personality induce statistically detectable but practically modest variation, while gender shows no measurable effect and political orientation only negligible impact. Agents also exhibit an extremity bias, collapsing intermediate sentiment categories common in human annotations. As a result, performance remains strong on coarse-grained polarity tasks but degrades as sentiment resolution increases, suggesting that simple label-based persona prompting does not capture fine-grained perceptual judgments. To isolate the contribution of persona conditioning, we additionally evaluate the same model without personas. Surprisingly, the no-persona model sometimes matches or exceeds persona-conditioned agreement with human labels across all task variants, suggesting that simple label-based persona prompting may add limited annotation value in this setting.

2604.27468 2026-05-25 cs.CL 版本更新

Syntactically-guided Information Maintenance in Sentence Comprehension

句子理解中的句法引导信息维护

Shinnosuke Isono, Kohei Kajikawa

发表机构 * NINJAL(NINJAL研究所) Georgetown University(乔治·华盛顿大学)

AI总结 本研究探讨了在句子理解过程中,如何根据句法结构选择性地维持对后续预测至关重要的信息。研究提出,信息维持的成本受到预测头数量和未完成依存关系数量的影响,并通过自然阅读时间数据验证了这两个因素在日语中对维持成本的不同作用。研究还发现,阅读速度较慢的读者更能从可预测性中获益,表明句法结构在语言理解中的重要作用,同时指出英语中未表现出相同模式,提示不同语言在句法引导信息维持方面可能存在差异。

详情
AI中文摘要

在成功的实时语言理解中,在上下文中维护信息至关重要,但维护在认知上代价高昂且可能减慢处理速度。我们假设理性语言使用者会选择性维护对未来预测至关重要的信息,并由句法结构引导。根据这一观点,两个因素影响维护成本:预测头的数量和未完成依赖的数量。尽管这些因素在文献中被视为竞争性假设,但我们的解释预测它们不可相互约简。我们在日语的自然阅读时间数据中证明了这一点,日语中这两个因素对比尤为清晰。我们进一步表明存在一种权衡,即因维护而减慢速度的读者往往从可预测性中获益更多,这为所提出的解释提供了额外支持。然而,这些模式在英语中并不明显,我们强调了一些有待解决的问题,以理解句法在各种语言记忆高效处理中的贡献。

英文摘要

Maintaining information in context is essential in successful real-time language comprehension, but maintenance is cognitively costly and can slow processing. We hypothesize that rational language users selectively maintain information that is crucial for future prediction, guided by syntactic structure. Under this view, two factors affect maintenance cost: the number of predicted heads and the number of incomplete dependencies. Although these factors have been treated as competing hypotheses in the literature, our account predicts that they are not reducible to one another. We show this is the case in a naturalistic reading time dataset in Japanese, a language in which the two factors contrast particularly clearly. We further show that there is a tradeoff such that readers that slow down for maintenance tend to benefit more from predictability, providing additional support for the proposed account. These patterns are not evident in English, however, and we highlight some issues to be resolved to understand the contribution of syntax in memory-efficient processing of various languages.

2604.21889 2026-05-25 cs.CL cs.AI cs.LG 版本更新

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TingIS:企业级规模下从嘈杂客户事件中实时发现风险事件

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang

发表机构 * Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文介绍了TingIS,一个用于大规模企业环境中实时发现风险事件的端到端系统。针对客户事件数据中存在噪声大、语义复杂、吞吐量高的挑战,TingIS结合多阶段事件链接引擎与大型语言模型,实现了从少量用户描述中稳定提取有效事件的能力,并通过级联路由机制和多维降噪流程提升业务归因精度和信号质量。实验表明,TingIS在高优先级事件发现率和系统响应延迟方面表现优异,显著优于现有方法。

Comments Accepted to ACL 2026 Industry Track (oral presentation)

详情
AI中文摘要

实时检测和缓解技术异常对于大规模云原生服务至关重要,即使几分钟的停机也可能导致巨大的财务损失和用户信任度下降。虽然客户事件是发现监控遗漏风险的重要信号,但由于极端噪声、高吞吐量和不同业务线的语义复杂性,从这些数据中提取可操作情报仍然具有挑战性。在本文中,我们提出了TingIS,一个为企业级事件发现设计的端到端系统。TingIS的核心是一个多阶段事件链接引擎,该引擎将高效索引技术与大型语言模型(LLM)协同起来,对事件合并做出明智决策,从而仅从少量多样的用户描述中稳定提取可操作事件。该引擎辅以级联路由机制以实现精确的业务归属,以及一个集成领域知识、统计模式和行为过滤的多维降噪管道。TingIS部署在生产环境中,处理峰值吞吐量超过每分钟2,000条消息和每天300,000条消息,实现了P90告警延迟3.5分钟和高优先级事件95%的发现率。基于真实数据构建的基准测试表明,TingIS在路由准确性、聚类质量和信噪比方面显著优于基线方法。

英文摘要

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

2604.09349 2026-05-25 cs.CV cs.AI cs.CL 版本更新

Visually-Guided Policy Optimization for Multimodal Reasoning

视觉引导的多模态推理策略优化

Zengbin Wang, Feng Xiong, Liang Lin, Xuecai Hu, Yong Wang, Yanlin Wang, Man Zhang, Xiangxiang Chu

发表机构 * AMAP, Alibaba Group(阿里集团AMAP) SYSU(南方科技大学) BUPT(北京邮电大学)

AI总结 该研究针对视觉语言模型在多模态推理中视觉关注不足的问题,提出了一种名为Visually-Guided Policy Optimization(VGPO)的新框架,通过引入视觉注意力补偿机制和双粒度优势重加权策略,增强模型在推理过程中的视觉聚焦能力。实验表明,VGPO有效提升了模型在数学多模态推理和依赖视觉的任务中的表现,显著改善了视觉信息的利用效率。

Comments Accepted to ACL 2026, https://github.com/wzb-bupt/VGPO

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)显著提升了视觉语言模型(VLM)的推理能力。然而,VLM固有的文本主导特性常导致视觉忠实度不足,表现为对视觉标记的注意力激活稀疏。更重要的是,我们的实证分析揭示,推理步骤中的时序视觉遗忘加剧了这一缺陷。为弥补这一差距,我们提出视觉引导策略优化(VGPO),一种在策略优化期间强化视觉聚焦的新框架。具体而言,VGPO首先引入视觉注意力补偿机制,利用视觉相似性定位并放大视觉线索,同时在后续步骤中逐步提升视觉期望以对抗视觉遗忘。基于此机制,我们实施双粒度优势重加权策略:轨迹内层级突出显示具有相对较高视觉激活的标记,而轨迹间层级优先选择表现出优越视觉累积的轨迹。大量实验表明,VGPO在数学多模态推理和视觉依赖任务中实现了更好的视觉激活和优越性能。代码已发布于https://github.com/wzb-bupt/VGPO。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has significantly advanced the reasoning ability of vision-language models (VLMs). However, the inherent text-dominated nature of VLMs often leads to insufficient visual faithfulness, characterized by sparse attention activation to visual tokens. More importantly, our empirical analysis reveals that temporal visual forgetting along reasoning steps exacerbates this deficiency. To bridge this gap, we propose Visually-Guided Policy Optimization (VGPO), a novel framework to reinforce visual focus during policy optimization. Specifically, VGPO initially introduces a Visual Attention Compensation mechanism that leverages visual similarity to localize and amplify visual cues, while progressively elevating visual expectations in later steps to counteract visual forgetting. Building on this mechanism, we implement a dual-grained advantage re-weighting strategy: the intra-trajectory level highlights tokens exhibiting relatively high visual activation, while the inter-trajectory level prioritizes trajectories demonstrating superior visual accumulation. Extensive experiments demonstrate that VGPO achieves better visual activation and superior performance in mathematical multimodal reasoning and visual-dependent tasks. The code has been released at https://github.com/wzb-bupt/VGPO.

2604.00003 2026-05-25 cs.CL cs.AI cs.IR 版本更新

Tabular PDF Information Extraction with Local LLMs and Layout-Aware Parsing: A Reliability Evaluation

使用本地大语言模型和布局感知解析的表格PDF信息提取:可靠性评估

Muhammad Anis Al Hilmi, Neelansh Khare, Noel Framil Iglesias, Kurnia Adi Cahyanto, Azhar Al Afghani, Musfi Yuliadi

发表机构 * Faculty of Engineering, Universitas Swadaya Gunung Jati(工程学院,Swadaya Gunung Jati大学) University of California, Irvine(加州大学伊维奇分校) UNIR, La Rioja(UNIR,拉里奥ja) Universitas Diponegoro(迪波内戈罗大学)

AI总结 该研究评估了从学术PDF文档中提取结构化信息的可靠性,以印度尼西亚高等教育课程注册表(KRS)为案例,比较了三种方法:纯大语言模型(LLM)、混合确定性-LLM(正则表达式与LLM结合)以及基于Camelot的流程并结合LLM作为后备。实验表明,混合方法在处理确定性元数据时效率更高,而基于Camelot的流程结合LLM后备在准确率和计算效率上表现最佳,尤其适合计算资源受限的环境。

Comments 9 pages, 5 figures, 3 tables

详情
AI中文摘要

从学术PDF文档中提取结构化信息并非易事:单页通常结合自由文本元数据和表格区域,存在跨程序变化,并容易受到干扰下游解析的Unicode编码伪影的影响。本研究以印度尼西亚高等教育的学术课程注册文档(Kartu Rencana Studi或KRS)为案例,评估了表格PDF文档信息提取方法的可靠性。比较了三种策略:纯LLM、混合确定性-LLM(正则表达式和LLM)以及基于Camelot的管道(带LLM回退)。实验在140份文档(基于LLM的测试)和860份文档(基于Camelot的管道评估)上进行,涵盖四个学习项目,包含表格和元数据中的不同数据。使用Ollama和消费级CPU(无GPU)本地运行了三个12-14B的LLM模型(Gemma 3、Phi 4和Qwen 2.5)。评估使用了精确匹配(EM)和Levenshtein相似度(LS)指标,阈值为0.7。尽管并非适用于所有模型,但结果表明,与纯LLM相比,混合方法可以提高效率,尤其是对于确定性元数据。基于Camelot的管道(带LLM回退)在准确性(EM和LS高达0.99-1.00)和计算效率(大多数情况下每个PDF不到1秒)方面取得了最佳组合。Qwen 2.5:14b模型在所有场景中表现最一致。这些发现证实,在计算受限的环境中,将确定性和基于LLM的方法相结合是从基于文本的表格PDF文档中提取信息的可靠且高效的策略。

英文摘要

Extracting structured information from academic PDF documents is non trivial: a single page typically combines free text metadata with tabular regions, exhibits cross program variation, and is susceptible to Unicode encoding artifacts that interfere with downstream parsing. This study evaluates the reliability of information extraction approaches for tabular PDF documents, using academic course registration documents (Kartu Rencana Studi or KRS) from Indonesian higher education as a case study. Three strategies are compared: LLM only, Hybrid Deterministic - LLM (regex & LLM), and a Camelot based pipeline with LLM fallback. Experiments were conducted on 140 documents for the LLM based test and 860 documents for the Camelot based pipeline evaluation, covering four study programs with varying data in tables and metadata. Three 12 - 14B LLM models (Gemma 3, Phi 4, and Qwen 2.5) were run locally using Ollama and a consumer grade CPU without a GPU. Evaluations used exact match (EM) and Levenshtein similarity (LS) metrics with a threshold of 0.7. Although not applicable to all models, the results show that the hybrid approach can improve efficiency compared to LLM only, especially for deterministic metadata. The Camelot based pipeline with LLM fallback produced the best combination of accuracy (EM and LS up to 0.99 - 1.00) and computational efficiency (less than 1 second per PDF in most cases). The Qwen 2.5:14b model demonstrated the most consistent performance across all scenarios. These findings confirm that integrating deterministic and LLM based methods is a reliable and efficient strategy for information extraction from tabular text based PDF documents in computationally constrained environments.

2603.21437 2026-05-25 cs.CL cs.IR 版本更新

Pooling and Semantic Shift: The Fundamental Challenges in Long Text Embedding and Retrieval

池化与语义偏移:长文本嵌入与检索中的根本挑战

Hang Gao, Wujiang Xu, Kai Mei, Dimitris N. Metaxas

发表机构 * Rutgers University(罗格斯大学)

AI总结 本文研究了基于Transformer的嵌入模型在长文本表示与检索中面临的两个根本性挑战:池化操作导致的嵌入坍缩和语义漂移。作者提出,嵌入质量下降并非单纯由文本长度或注意力机制引起,而是源于池化操作与内部语义变化的共同作用,并建立了统一的理论框架加以证明。通过实验验证,语义漂移是导致嵌入高度集中化的主因,揭示了各向异性对检索性能的影响仅在强语义漂移情况下才显著,为理解长文本嵌入难题提供了理论依据。

详情
AI中文摘要

基于Transformer的嵌入模型经常表现出几何病态,例如各向异性和长度诱导的表示崩溃,这会降低下游检索性能。虽然先前的工作通常将这些归因于文本长度或注意力机制,但我们认为根本驱动因素反而是固有的池化操作与内部语义偏移。在本文中,我们建立了一个统一的理论框架,证明上下文池化本质上会导致嵌入崩溃。具体来说,我们从数学上证明,对语义多样的句子进行池化不可避免地会导致微观层面的语义稀释,并严格降低向量空间的平均成对距离,从而保证宏观层面的空间集中。基于这些几何洞察,我们正式定义了语义偏移,以捕捉文本内部的自然语义演变和分散。通过跨多种模型和语料库的精心控制实验,我们将文本长度与语义内容分离。我们证明语义偏移是严重嵌入集中的主要预测因子。关键的是,我们的检索评估揭示,各向异性仅在由强语义偏移诱导时才有害,从而调和了先前文献中的矛盾观察,并为现代嵌入模型面临的长上下文挑战提供了原则性解释。

英文摘要

Transformer-based embedding models frequently exhibit geometric pathologies, such as anisotropy and length-induced representation collapse, which can degrade downstream retrieval performance. While prior work often attributes these issues directly to text length or attention mechanisms, we argue that the fundamental drivers are instead the inherent pooling operations coupled with internal semantic shift. In this paper, we establish a unified theoretical framework proving that contextual pooling intrinsically causes embedding collapse. Specifically, we mathematically prove that pooling semantically diverse sentences inevitably leads to micro-level semantic dilution, and strictly reduces the Mean Pairwise Distance of the vector space, guaranteeing macro-level spatial concentration. Grounded in these geometric insights, we formally define semantic shift to capture the natural semantic evolution and dispersion within a text. Through carefully controlled experiments across diverse models and corpora, we disentangle text length from semantic content. We demonstrate that semantic shift is the primary predictor of severe embedding concentration. Crucially, our retrieval evaluations reveal that anisotropy is fundamentally harmful only when induced by strong semantic shifts, reconciling conflicting observations in prior literature and offering a principled explanation for the long-context challenges faced by modern embedding models.

2603.19167 2026-05-25 cs.CL 版本更新

Evaluating Counterfactual Strategic Reasoning in Large Language Models

评估大型语言模型中的反事实策略推理

Dimitrios Georgousis, Maria Lymperaiou, Angeliki Dimitriou, Giorgos Filandrianos, Giorgos Stamou

发表机构 * National Technical University of Athens(希腊雅典国家技术大学)

AI总结 本研究评估了大型语言模型在重复博弈场景中的策略性能,以判断其表现是基于真正的推理能力还是对记忆模式的依赖。研究引入了对经典博弈(如囚徒困境和石头剪刀布)的反事实变体,改变收益结构和行动标签,从而打破原有的对称性和支配关系。通过多维度评估框架,研究揭示了大型语言模型在激励敏感性、结构泛化和反事实环境中的策略推理方面存在局限性。

Comments Accepted at GEM@ACL 2026

详情
AI中文摘要

我们在重复博弈论环境中评估大型语言模型(LLM),以判断策略表现是否反映了真正的推理还是依赖于记忆模式。我们考虑两个经典博弈:囚徒困境(PD)和石头剪刀布(RPS),并引入反事实变体,改变收益结构和行动标签,打破熟悉的对称性和支配关系。我们的多指标评估框架比较了默认和反事实实例,展示了LLM在反事实环境中的激励敏感性、结构泛化和策略推理方面的局限性。

英文摘要

We evaluate Large Language Models (LLMs) in repeated game-theoretic settings to assess whether strategic performance reflects genuine reasoning or reliance on memorized patterns. We consider two canonical games, Prisoner's Dilemma (PD) and Rock-Paper-Scissors (RPS), upon which we introduce counterfactual variants that alter payoff structures and action labels, breaking familiar symmetries and dominance relations. Our multi-metric evaluation framework compares default and counterfactual instantiations, showcasing LLM limitations in incentive sensitivity, structural generalization and strategic reasoning within counterfactual environments.

2603.14027 2026-05-25 cs.CL 版本更新

SemEval-2026 Task 6: CLARITY -- Unmasking Political Question Evasions

SemEval-2026 任务 6:CLARITY——揭露政治问题回避

Konstantinos Thomas, Giorgos Filandrianos, Maria Lymperaiou, Chrysoula Zerva, Giorgos Stamou

发表机构 * National Technical University of Athens(希腊国家技术大学) Instituto de Telecomunicações(电信研究所) Instituto Superior Técnico, Universidade de Lisboa(里斯本大学电信理工学院) ELLIS Unit Lisbon(里斯本ELLIS单位)

AI总结 SemEval-2026任务6(CLARITY)旨在识别政治发言中对问题的回避性回答,研究如何在保持表面回应性的同时避免直接回答。该任务包含两个子任务:一是对回答清晰度进行分类,二是对九种具体回避策略进行细粒度识别。该基准数据集基于美国总统采访构建,采用专家定义的分类体系,结果显示大语言模型提示和分层利用分类体系是有效方法,且清晰度分类任务比回避策略分类更具挑战性。

Comments SemEval 2026 (Task organizers)

详情
AI中文摘要

政治演讲者常常在保持回应表象的同时避免直接回答问题。尽管这对公共话语至关重要,但这种策略性回避在自然语言处理中仍未得到充分探索。我们介绍了 SemEval-2026 任务 6,CLARITY,一个关于政治问题回避的共享任务,包含两个子任务:(i) 清晰度级别分类,分为清晰回答、矛盾和不清晰回答;(ii) 回避级别分类,分为九种细粒度回避策略。该基准来自美国总统访谈,并遵循基于专家的回应清晰度和回避分类体系。该任务吸引了 124 个注册团队,他们提交了 946 个有效运行用于清晰度级别分类,539 个用于回避级别分类。结果显示两个子任务之间存在显著的难度差距:最佳系统在清晰度分类上达到了 0.89 的宏 F1,大幅超过最强基线,而顶级回避系统达到了 0.68 的宏 F1,与最佳基线持平。总体而言,大语言模型提示和分类体系的层级利用成为最有效的策略,顶级系统始终优于那些独立处理两个子任务的系统。CLARITY 将政治回应回避确立为计算话语分析的一个具有挑战性的基准,并突显了建模政治语言中策略性模糊的难度。

英文摘要

Political speakers often avoid answering questions directly while maintaining the appearance of responsiveness. Despite its importance for public discourse, such strategic evasion remains underexplored in Natural Language Processing. We introduce SemEval-2026 Task 6, CLARITY, a shared task on political question evasion consisting of two subtasks: (i) clarity-level classification into Clear Reply, Ambivalent, and Clear Non-Reply, and (ii) evasion-level classification into nine fine-grained evasion strategies. The benchmark is constructed from U.S. presidential interviews and follows an expert-grounded taxonomy of response clarity and evasion. The task attracted 124 registered teams, who submitted 946 valid runs for clarity-level classification and 539 for evasion-level classification. Results show a substantial gap in difficulty between the two subtasks: the best system achieved 0.89 macro-F1 on clarity classification, surpassing the strongest baseline by a large margin, while the top evasion-level system reached 0.68 macro-F1, matching the best baseline. Overall, large language model prompting and hierarchical exploitation of the taxonomy emerged as the most effective strategies, with top systems consistently outperforming those that treated the two subtasks independently. CLARITY establishes political response evasion as a challenging benchmark for computational discourse analysis and highlights the difficulty of modeling strategic ambiguity in political language.

2602.19174 2026-05-25 cs.CL 版本更新

TurkicNLP: An NLP Toolkit for Turkic Languages

TurkicNLP:突厥语言的自然语言处理工具包

Sherzod Hakimov

发表机构 * Computational Linguistics University of Potsdam(计算语言学波茨坦大学)

AI总结 本文介绍了TurkicNLP,一个面向突厥语系的开源自然语言处理工具包,旨在解决该语系语言处理工具和资源分散的问题。该工具包支持四种书写系统,提供统一的NLP流程,包括分词、形态分析、词性标注、依存句法分析等功能,并采用模块化架构整合规则和神经模型,实现自动脚本检测与转换。其输出遵循CoNLL-U标准,便于与其他系统兼容与扩展。

Comments The toolkit is available here: https://github.com/turkic-nlp/turkicnlp

详情
AI中文摘要

突厥语族由欧亚大陆超过2亿人使用,其自然语言处理仍然碎片化,大多数语言缺乏统一的工具和资源。我们提出TurkicNLP,一个开源的Python库,为四种文字体系(拉丁、西里尔、波斯-阿拉伯和古突厥如尼文)的突厥语言提供单一、一致的NLP流水线。该库通过一个语言无关的API覆盖分词、形态分析、词性标注、依存句法分析、命名实体识别、双向文字转写、跨语言句子嵌入和机器翻译。模块化多后端架构透明地集成了基于规则的有限状态转换器和神经模型,并具备自动文字检测和文字变体路由功能。输出遵循CoNLL-U标准,以实现完全互操作性和扩展性。代码和文档托管于https://github.com/turkic-nlp/turkicnlp。

英文摘要

Natural language processing for the Turkic language family, spoken by over 200 million people across Eurasia, remains fragmented, with most languages lacking unified tooling and resources. We present TurkicNLP, an open-source Python library providing a single, consistent NLP pipeline for Turkic languages across four script families: Latin, Cyrillic, Perso-Arabic, and Old Turkic Runic. The library covers tokenization, morphological analysis, part-of-speech tagging, dependency parsing, named entity recognition, bidirectional script transliteration, cross-lingual sentence embeddings, and machine translation through one language-agnostic API. A modular multi-backend architecture integrates rule-based finite-state transducers and neural models transparently, with automatic script detection and routing between script variants. Outputs follow the CoNLL-U standard for full interoperability and extension. Code and documentation are hosted at https://github.com/turkic-nlp/turkicnlp .

2602.18788 2026-05-25 cs.CL 版本更新

BURMESE-SAN: Burmese NLP Benchmark for Evaluating Large Language Models

BURMESE-SAN: 评估大语言模型的缅甸语NLP基准

Thura Aung, Jann Railey Montalan, Jian Gang Ngui, Peerat Limkonchotiwat

AI总结 本文介绍了BURMESE-SAN,这是首个系统评估大型语言模型在缅甸语自然语言理解、推理和生成能力的综合性基准。该基准包含七项子任务,涵盖问答、情感分析、因果推理等多个领域,并通过严格的母语者参与流程构建,确保语言自然性和文化真实性。研究发现,缅甸语模型性能更依赖于架构设计、语言表示和指令微调,而非模型规模,并指出区域微调和新一代模型能显著提升效果。

详情
AI中文摘要

我们引入了BURMESE-SAN,这是第一个系统性评估大语言模型(LLM)在缅甸语上三种核心NLP能力:理解(NLU)、推理(NLR)和生成(NLG)的全面基准。BURMESE-SAN整合了涵盖这些能力的七个子任务,包括问答、情感分析、毒性检测、因果推理、自然语言推理、抽象摘要和机器翻译,其中多个任务此前在缅甸语中不可用。该基准通过严格的母语者驱动流程构建,以确保语言自然性、流畅性和文化真实性,同时最小化翻译引起的伪影。我们对开源和商业LLM进行了大规模评估,以考察缅甸语建模中因预训练覆盖有限、丰富形态和句法变异带来的挑战。我们的结果表明,缅甸语性能更多地依赖于架构设计、语言表示和指令微调,而非仅模型规模。特别是,东南亚区域微调和更新的模型世代带来了显著提升。最后,我们发布BURMESE-SAN作为公共排行榜,以支持缅甸语及其他低资源语言的系统评估和持续进步。https://leaderboard.sea-lion.ai/detailed/MY

英文摘要

We introduce BURMESE-SAN, the first holistic benchmark that systematically evaluates large language models (LLMs) for Burmese across three core NLP competencies: understanding (NLU), reasoning (NLR), and generation (NLG). BURMESE-SAN consolidates seven subtasks spanning these competencies, including Question Answering, Sentiment Analysis, Toxicity Detection, Causal Reasoning, Natural Language Inference, Abstractive Summarization, and Machine Translation, several of which were previously unavailable for Burmese. The benchmark is constructed through a rigorous native-speaker-driven process to ensure linguistic naturalness, fluency, and cultural authenticity while minimizing translation-induced artifacts. We conduct a large-scale evaluation of both open-weight and commercial LLMs to examine challenges in Burmese modeling arising from limited pretraining coverage, rich morphology, and syntactic variation. Our results show that Burmese performance depends more on architectural design, language representation, and instruction tuning than on model scale alone. In particular, Southeast Asia regional fine-tuning and newer model generations yield substantial gains. Finally, we release BURMESE-SAN as a public leaderboard to support systematic evaluation and sustained progress in Burmese and other low-resource languages. https://leaderboard.sea-lion.ai/detailed/MY

2602.18176 2026-05-25 cs.CL 版本更新

Improving Sampling for Masked Diffusion Models via Information Gain

通过信息增益改进掩码扩散模型的采样

Kaisen Yang, Jayden Teoh, Kaicheng Yang, Yitong Zhang, Alex Lamb

发表机构 * Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Massachusetts Institute of Technology(麻省理工学院) Shanghai Jiao Tong University(上海交通大学) Beihang University(北航) College of AI, Tsinghua University(清华大学人工智能学院)

AI总结 该论文研究了如何改进掩码扩散模型(MDMs)的采样过程,指出现有采样方法过于贪心,仅关注局部确定性而忽视了后续影响,导致生成结果不确定性增加。为此,作者提出了一种无需训练的解码方法——信息增益采样器(Info-Gain Sampler),通过利用MDMs的双向结构,在当前不确定性和剩余位置的信息增益之间取得平衡。实验表明,该方法在推理、编码、创意写作和图像生成等任务中均优于现有方法,显著提升了生成质量。

Comments https://github.com/yks23/Information-Gain-Sampler Accepted by ICML2026 Accepted by ICML2026

详情
AI中文摘要

掩码扩散模型(MDMs)支持灵活的解码顺序,但现有采样器大多是贪婪的,仅选择局部确定的token而不考虑其下游影响。我们表明这种短视行为会增加累积不确定性并导致次优生成。为解决此问题,我们提出**Info-Gain采样器**,一种无需训练的解码方法,利用MDMs的双向结构平衡即时不确定性与剩余掩码位置获得的信息增益。在推理、编码、创意写作和图像生成任务中,Info-Gain采样器持续优于现有MDM采样器,平均推理准确率提升2.9--11.6个百分点,创意写作平均胜率达到62.8%。代码可在https://github.com/yks23/Information-Gain-Sampler获取。

英文摘要

Masked Diffusion Models (MDMs) enable flexible decoding orders, yet existing samplers remain largely greedy, selecting locally certain tokens without accounting for their downstream effects. We show that this myopia can increase cumulative uncertainty and lead to suboptimal generation. To address this, we propose the **Info-Gain Sampler**, a training-free decoding method that uses the bidirectional structure of MDMs to balance immediate uncertainty with the information gained over remaining masked positions. Across reasoning, coding, creative writing, and image generation tasks, Info-Gain Sampler consistently outperforms existing MDM samplers, improving average reasoning accuracy by 2.9--11.6 percentage points and achieving a 62.8% average win rate in creative writing. The code is available at https://github.com/yks23/Information-Gain-Sampler.

2602.17653 2026-05-25 cs.CL 版本更新

Differences in Typological Alignment in Language Models' Treatment of Differential Argument Marking

语言模型处理差异论元标记中的类型学对齐差异

Iskar Deng, Nathalia Xu, Shane Steinert-Threlkeld

发表机构 * University of Washington(华盛顿大学)

AI总结 本文研究语言模型在处理差分化论元标记(DAM)时表现出的类型学对齐差异。通过在18个实现不同DAM系统的合成语料库上训练GPT-2模型,并使用最小对进行评估,研究发现模型在DAM的自然标记方向上表现出与人类语言相似的偏好,即更倾向于对语义不典型的论元进行显性标记,但在对象优先这一人类语言常见现象上却未能复现。这一结果表明,不同类型学倾向可能源于不同的底层机制。

Comments 16 pages, 8 figures, 7 tables. To appear at CoNLL 2026

详情
AI中文摘要

近期研究表明,在合成语料上训练的语言模型可以展现出类似人类语言跨语言规律的类型学偏好,特别是对于语序等句法现象。本文将此范式扩展到差异论元标记(DAM),一种形态标记取决于语义显著性的语义许可系统。使用受控合成学习方法,我们在18个实现不同DAM系统的语料上训练GPT-2模型,并通过最小对评估其泛化能力。结果揭示了DAM的两个类型学维度之间的分离。模型可靠地展现出对自然标记方向的人类偏好,倾向于那些显性标记针对语义非典型论元的系统。相比之下,模型并未复现人类语言中强烈的宾语偏好,即在DAM中显性标记更常针对宾语而非主语。这些发现表明,不同的类型学倾向可能源于不同的潜在来源。

英文摘要

Recent work has shown that language models (LMs) trained on synthetic corpora can exhibit typological preferences that resemble cross-linguistic regularities in human languages, particularly for syntactic phenomena such as word order. In this paper, we extend this paradigm to differential argument marking (DAM), a semantic licensing system in which morphological marking depends on semantic prominence. Using a controlled synthetic learning method, we train GPT-2 models on 18 corpora implementing distinct DAM systems and evaluate their generalization using minimal pairs. Our results reveal a dissociation between two typological dimensions of DAM. Models reliably exhibit human-like preferences for natural markedness direction, favoring systems in which overt marking targets semantically atypical arguments. In contrast, models do not reproduce the strong object preference in human languages, in which overt marking in DAM more often targets objects rather than subjects. These findings suggest that different typological tendencies may arise from distinct underlying sources.

2602.12316 2026-05-25 cs.AI cs.CL cs.CY cs.GT cs.MA 版本更新

GT-HarmBench: Benchmarking AI Safety Risks Through the Lens of Game Theory

GT-HarmBench:通过博弈论视角评估AI安全风险

Pepijn Cobben, Xuanqiang Angelo Huang, Thao Amelia Pham, Isabel Dahlgren, Terry Jingchen Zhang, Zhijing Jin

发表机构 * ETH Zürich(苏黎世联邦理工学院) Berea College(贝雷学院) University of Toronto(多伦多大学) Vector Institute(向量研究所) Max Planck Institute for Intelligent Systems, Tübingen, Germany(图宾根德国智能系统马克斯·普朗克研究所)

AI总结 本文提出GT-HarmBench,一个用于评估前沿AI系统在多智能体高风险场景中安全性的基准测试,涵盖博弈论中的经典场景如囚徒困境、 stag hunt 和 chicken。研究发现,现有AI模型在38%的高风险情境中无法选择对社会有益的行动,揭示了多智能体环境下对齐问题的严重性。通过引入博弈论干预,研究展示了提升社会收益的潜力,并为多智能体AI安全研究提供了标准化测试平台。

详情
AI中文摘要

前沿AI系统能力日益增强,并部署在高风险的多智能体环境中。然而,现有的AI安全基准主要评估单一智能体,导致对协调失败和冲突等多智能体风险的理解不足。我们引入了GT-HarmBench,这是一个包含1535个高风险场景的基准,涵盖了囚徒困境、猎鹿博弈和斗鸡博弈等博弈论结构。场景来源于MIT AI风险库中的现实AI风险背景。在15个前沿模型中,智能体在38%的高风险案例中未能选择对社会有益的行为,例如军事升级、选举操纵和医疗事故。我们测量了对博弈论提示框架和顺序的敏感性,并分析了导致失败的推理模式。我们进一步表明,博弈论干预可将社会有益结果提升高达18%。我们的结果突出了显著的可靠性差距,并为研究多智能体环境中的对齐提供了一个广泛的标准化测试平台。该基准和代码可在https://github.com/causalNLP/gt-harmbench获取。

英文摘要

Frontier AI systems are increasingly capable and deployed in high-stakes multi-agent environments. However, existing AI safety benchmarks largely evaluate single agents, leaving multi-agent risks such as coordination failure and conflict poorly understood. We introduce GT-HarmBench, a benchmark of 1,535 high-stakes scenarios spanning game-theoretic structures such as the Prisoner's Dilemma, Stag Hunt and Chicken. Scenarios are drawn from realistic AI risk contexts in the MIT AI Risk Repository. Across 15 frontier models, agents fail to choose socially beneficial actions in 38% of high-stakes cases, such as military escalation, election manipulation, and medical malpractice. We measure sensitivity to game-theoretic prompt framing and ordering, and analyze reasoning patterns driving failures. We further show that game-theoretic interventions improve socially beneficial outcomes by up to 18%. Our results highlight substantial reliability gaps and provide a broad standardized testbed for studying alignment in multi-agent environments. The benchmark and code are available at https://github.com/causalNLP/gt-harmbench.

2602.11243 2026-05-25 cs.LG cs.CL 版本更新

Evaluating Memory Structure in LLM Agents

评估LLM智能体中的记忆结构

Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin

发表机构 * HSE University(莫斯科国立高等经济学院) Yandex YSDA New Economic School(新经济学院)

AI总结 本文研究了基于大语言模型(LLM)的智能体在长期记忆结构组织方面的能力,提出了一个名为 StructMemEval 的新基准,用于评估智能体组织长期记忆的结构化能力,而不仅仅是事实记忆或简单检索。该基准包含一系列需要结构化知识组织的任务,如事务账本、待办事项列表等。实验表明,普通检索增强型 LLM 在未明确提示下难以处理这些任务,而具备结构化记忆框架的智能体则能更有效地完成任务,突显了改进 LLM 训练和记忆架构的重要性。

Comments Preprint, work in progress

详情
AI中文摘要

现代基于LLM的智能体和聊天助手依赖长期记忆框架来存储可重用知识、回忆用户偏好并增强推理。随着研究人员创建更复杂的记忆架构,分析其能力并指导未来记忆设计变得越来越困难。大多数长期记忆基准侧重于简单事实保留、多跳回忆和基于时间的变化。虽然这些能力无疑很重要,但通常可以通过简单的检索增强LLM实现,并且不测试复杂的记忆层次。为了弥补这一差距,我们提出了StructMemEval——一个测试智能体组织其长期记忆能力(而不仅仅是事实回忆)的基准。我们收集了一系列人类通过以特定结构组织知识来解决的任务:交易账本、待办事项列表、树结构等。我们的初步实验表明,简单的检索增强LLM在这些任务上表现困难,而记忆智能体在提示如何组织记忆时可以可靠地解决它们。然而,我们还发现,现代LLM在未被提示时并不总是能识别记忆结构。这突显了未来在LLM训练和记忆框架改进方面的一个重要方向。

英文摘要

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent's ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.

2602.08404 2026-05-25 cs.CL 版本更新

TEAM: Temporal-Spatial Consistency Guided Expert Activation for MoE Diffusion Language Model Acceleration

TEAM: 时空一致性引导的专家激活用于MoE扩散语言模型加速

Linye Wei, Zixiang Luo, Pingzhi Tang, Meng Li

发表机构 * Institute for Artificial Intelligence(人工智能研究院) School of Integrated Circuits(集成电路学院) Yuanpei College(元培学院) Beijing Advanced Innovation Center for Integrated Circuits(北京集成电路先进创新中心)

AI总结 该论文提出了一种名为TEAM的框架,旨在加速基于MoE架构的扩散语言模型(dLLMs)。研究发现,现有MoE dLLMs在扩散解码过程中激活大量专家,但最终仅接受少量token,导致推理开销大。TEAM通过利用专家激活在时间和空间上的一致性,采用三种互补策略,在保证性能的同时减少激活专家数量,从而实现高达2.2倍的加速效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

扩散大语言模型(dLLM)因其天然支持并行解码而近期受到广泛关注。基于这一范式,具有自回归(AR)初始化的混合专家(MoE)dLLM进一步展示了与主流AR模型相媲美的强劲性能。然而,我们发现MoE架构与基于扩散的解码之间存在根本性不匹配。具体来说,每个去噪步骤中激活了大量专家,而最终只有一小部分token被接受,导致大量推理开销,限制了其在延迟敏感应用中的部署。在这项工作中,我们提出了TEAM,一个即插即用的框架,通过更少的激活专家实现更多被接受的token,从而加速MoE dLLM。TEAM的动机源于观察到专家路由决策在去噪层级间表现出强时间一致性,以及在token位置间表现出强空间一致性。利用这些特性,TEAM采用了三种互补的专家激活和解码策略,保守地选择解码和掩码token所需的专家,同时跨多个候选进行积极的推测性探索。实验结果表明,TEAM在性能下降可忽略的情况下,实现了相比原始MoE dLLM高达2.2倍的加速。代码已发布在https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM。

英文摘要

Diffusion large language models (dLLMs) have recently gained significant attention due to their inherent support for parallel decoding. Building on this paradigm, Mixture-of-Experts (MoE) dLLMs with autoregressive (AR) initialization have further demonstrated strong performance competitive with mainstream AR models. However, we identify a fundamental mismatch between MoE architectures and diffusion-based decoding. Specifically, a large number of experts are activated at each denoising step, while only a small subset of tokens is ultimately accepted, resulting in substantial inference overhead and limiting their deployment in latency-sensitive applications. In this work, we propose TEAM, a plug-and-play framework that accelerates MoE dLLMs by enabling more accepted tokens with fewer activated experts. TEAM is motivated by the observation that expert routing decisions exhibit strong temporal consistency across denoising levels as well as spatial consistency across token positions. Leveraging these properties, TEAM employs three complementary expert activation and decoding strategies, conservatively selecting necessary experts for decoded and masked tokens and simultaneously performing aggressive speculative exploration across multiple candidates. Experimental results demonstrate that TEAM achieves up to 2.2x speedup over vanilla MoE dLLM, with negligible performance degradation. Code is released at https://github.com/PKU-SEC-Lab/TEAM-MoE-dLLM.

2602.00979 2026-05-25 cs.CR cs.AI cs.CL 版本更新

GradingAttack: Exposing Security Vulnerabilities in LLM Based Educational Grading Agents

GradingAttack: 揭示基于LLM的教育评分代理中的安全漏洞

Xueyi Li, Zhuoneng Zhou, Zitao Liu, Yongdong Wu

发表机构 * Guangdong Institute of Smart Education(广东智能教育研究院) Jinan University(济南大学)

AI总结 随着大型语言模型(LLM)在自动短答案评分中的广泛应用,其安全性问题日益受到关注。本文提出GradingAttack,一种细粒度的对抗攻击框架,用于系统评估基于LLM的教育评分代理的安全漏洞。通过设计基于词元和提示的攻击策略,该方法在保持高隐蔽性的同时有效操控评分结果,揭示了当前系统在防御对抗攻击方面的不足,突显了构建安全可信教育代理系统的重要性。

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为教育代理,用于实际教育环境中的自动简答题评分(ASAG),显著提升了评估效率和可扩展性。然而,当这些评分代理“在野外”运行时,它们对对抗性操纵的脆弱性引发了对代理安全性和可信度的关键担忧。在本文中,我们介绍了GradingAttack,一个细粒度的对抗攻击框架,系统地评估基于LLM的教育评分代理的安全漏洞。具体来说,我们设计了token级和prompt级攻击策略,在保持高隐蔽性的同时操纵代理评分结果,揭示了当前代理部署中的根本弱点。在多个数据集上的实验表明,两种攻击策略都能有效破坏评分代理,其中prompt级攻击成功率更高,而token级攻击具有更优的隐蔽性。我们的发现表明,当前的基于LLM的教育代理缺乏针对对抗性攻击的鲁棒防御,突显了为关键教育应用开发安全可信的代理系统的紧迫性。

英文摘要

Large language models (LLMs) are increasingly deployed as educational agents for automatic short answer grading (ASAG) in real-world educational environments, significantly boosting assessment efficiency and scalability. However, when these grading agents operate ``in the wild'', their vulnerability to adversarial manipulation raises critical concerns about agent security and trustworthiness. In this paper, we introduce GradingAttack, a fine-grained adversarial attack framework that systematically evaluates the security vulnerabilities of LLM based educational grading agents. Specifically, we design token-level and prompt-level attack strategies that manipulate agent grading outcomes while maintaining high stealth, exposing fundamental weaknesses in current agent deployments. Experiments on multiple datasets demonstrate that both attack strategies effectively compromise grading agents, with prompt-level attacks achieving higher success rates and token-level attacks exhibiting superior stealth capability. Our findings reveal that current LLM based educational agents lack robust defenses against adversarial attacks, underscoring the urgent need for developing secure and trustworthy agent systems for critical educational applications.

2601.21766 2026-05-25 cs.CL cs.AI 版本更新

CoFrGeNet: Continued Fraction Architectures for Language Generation

CoFrGeNet:用于语言生成的连分式架构

Amit Dhurandhar, Vijil Chenthamarakshan, Dennis Wei, Tejaswini Pedapati, Karthikeyan Natesan Ramamurthy, Rahul Nair

发表机构 * IBM Research(IBM研究院)

AI总结 本文提出了一种基于连分数结构的新型生成模型架构CoFrGeNet,用于替代传统Transformer中的多头注意力和前馈网络模块,显著减少了参数量。该方法通过自定义梯度计算提升训练效率,并在多个大规模语言模型(如GPT2-xl和Llama3)上验证了其有效性,实验表明在保持甚至提升任务性能的同时,参数规模可减少至原模型的二分之一到三分之一,且预训练时间更短。

Comments Earlier version accepted to ICML 2026

详情
AI中文摘要

Transformer可以说是语言生成的首选架构。本文受连分式启发,引入了一种用于生成建模的新函数类。实现该函数类的架构族称为CoFrGeNets——连分式生成网络。我们基于该函数类设计了新颖的架构组件,可以替换Transformer块中的多头注意力和前馈网络,同时需要的参数少得多。我们推导了自定义梯度公式,以比使用标准PyTorch梯度更准确、更高效地优化所提出的组件。我们的组件是即插即用的替换,几乎不需要改变已为基于Transformer的模型建立的训练或推理过程,从而使我们的方法易于集成到大型工业工作流中。我们在两个非常不同的Transformer架构GPT2-xl(1.5B)和Llama3(3.2B)上进行了实验,前者我们在OpenWebText和GneissWeb上预训练,后者我们在docling数据混合(包含九个不同数据集)上预训练。结果表明,我们的模型在下游分类、问答、推理和文本理解任务上的性能与原始模型相当,有时甚至更优,而参数量仅为原始模型的2/3到1/2,预训练时间更短。我们相信,未来针对硬件定制的实现将进一步发挥我们架构的真正潜力。

英文摘要

Transformers are arguably the preferred architecture for language generation. In this paper, inspired by continued fractions, we introduce a new function class for generative modeling. The architecture family implementing this function class is named CoFrGeNets - Continued Fraction Generative Networks. We design novel architectural components based on this function class that can replace Multi-head Attention and Feed-Forward Networks in Transformer blocks while requiring much fewer parameters. We derive custom gradient formulations to optimize the proposed components more accurately and efficiently than using standard PyTorch-based gradients. Our components are a plug-in replacement requiring little change in training or inference procedures that have already been put in place for Transformer-based models thus making our approach easy to incorporate in large industrial workflows. We experiment on two very different transformer architectures GPT2-xl (1.5B) and Llama3 (3.2B), where the former we pre-train on OpenWebText and GneissWeb, while the latter we pre-train on the docling data mix which consists of nine different datasets. Results show that the performance on downstream classification, Q\& A, reasoning and text understanding tasks of our models is competitive and sometimes even superior to the original models with $\frac{2}{3}$ to $\frac{1}{2}$ the parameters and shorter pre-training time. We believe that future implementations customized to hardware will further bring out the true potential of our architectures.

2601.15224 2026-05-25 cs.CV cs.CL 版本更新

PROGRESSLM: Towards Progress Reasoning in Vision-Language Models

PROGRESSLM: 迈向视觉-语言模型中的进度推理

Jianshu Zhang, Chengxuan Qian, Haosen Sun, Haoran Lu, Dingcheng Wang, Letian Xue, Han Liu

发表机构 * Northwestern University(西北大学) University of California, Santa Barbara(加州大学圣巴巴拉分校)

AI总结 该论文提出ProgressLM,旨在解决视觉语言模型在任务进展推理方面的能力不足问题。研究引入了Progress-Bench基准,用于系统评估模型对任务进展的推理能力,并提出了一种受人类启发的两阶段推理范式,通过训练无关的提示和基于数据集ProgressLM-45K的训练方法进行探索。实验表明,大多数现有模型在任务进展估计上表现有限,而基于训练的ProgressLM-3B即使在小规模下也取得了稳定提升,显示出良好的泛化能力。

Comments ACL 2026 Camera Ready Version

详情
AI中文摘要

估计任务进度需要对长期动态进行推理,而非仅识别静态视觉内容。尽管现代视觉-语言模型(VLM)擅长描述可见内容,但它们能否从部分观测中推断任务进展程度仍不清楚。为此,我们引入了Progress-Bench,一个用于系统评估VLM进度推理能力的基准。除基准测试外,我们进一步探索了一种受人类启发的两阶段进度推理范式,包括基于无训练提示和基于训练的方法,后者基于精心策划的数据集ProgressLM-45K。对14个VLM的实验表明,大多数模型尚未准备好进行任务进度估计,对演示模态和视角变化敏感,且难以处理不可回答的情况。虽然强制结构化进度推理的无训练提示仅带来有限且依赖模型的改进,但基于训练的ProgressLM-3B即使在小型模型规模下也取得了一致的改进,尽管其训练任务集与评估任务完全不相交。进一步分析揭示了特征性错误模式,并阐明了进度推理成功或失败的时间与原因。网站:https://progresslm.github.io/ProgressLM/

英文摘要

Estimating task progress requires reasoning over long-horizon dynamics rather than recognizing static visual content. While modern Vision-Language Models (VLMs) excel at describing what is visible, it remains unclear whether they can infer how far a task has progressed from partial observations. To this end, we introduce Progress-Bench, a benchmark for systematically evaluating progress reasoning in VLMs. Beyond benchmarking, we further explore a human-inspired two-stage progress reasoning paradigm through both training-free prompting and training-based approach based on curated dataset ProgressLM-45K. Experiments on 14 VLMs show that most models are not yet ready for task progress estimation, exhibiting sensitivity to demonstration modality and viewpoint changes, as well as poor handling of unanswerable cases. While training-free prompting that enforces structured progress reasoning yields limited and model-dependent gains, the training-based ProgressLM-3B achieves consistent improvements even at a small model scale, despite being trained on a task set fully disjoint from the evaluation tasks. Further analyses reveal characteristic error patterns and clarify when and why progress reasoning succeeds or fails. Website: https://progresslm.github.io/ProgressLM/

2601.03260 2026-05-25 cs.CE cs.CL 版本更新

SciNet: Evaluating AI Agents in Relation-Aware Scientific Literature Retrieval

SciNet: 评估关系感知科学文献检索中的AI代理

Chenyang Shao, Fengli Xu, Yong Li

发表机构 * Department of Electronic Engineering, BNRist, Tsinghua University, Beijing, China(电子工程系、BNRist、清华大学、北京、中国) Zhongguancun Academy(中关村学院)

AI总结 该研究提出 SciNet,一个用于科学文献检索的首个关系感知数据集,旨在解决现有 AI 检索代理在理解科学论文间复杂关系方面的不足。SciNet 基于包含 2.69 亿篇论文的元数据库构建,包含 8,940 个精心设计的任务,涵盖从个体论文检索到科学演化路径重建的多层次关系理解。实验表明,现有检索代理在关系感知任务上的准确率普遍低于 20%,而结合 SciNet 的代理在文献综述任务中质量提升了 25.3%,凸显了关系感知检索对深化科学洞察的重要性。

详情
AI中文摘要

AI代理在科学研究的文献检索中已被广泛采用,催生了如Deep Research等工具。然而,现有的检索代理主要依赖基于关键词或嵌入的方法。虽然它们在捕捉内容级相似性方面有效,但难以理解科学论文之间的复杂关系网络,例如识别相互印证或冲突的研究以及追踪技术谱系。这一根本性限制常常导致知识结构碎片化、研究情感误解以及集体科学进展建模失效。为解决这一限制,我们引入了SciNet,这是首个用于信息检索代理的科学网络关系感知数据集。该数据集基于包含7个学科、2.69亿篇论文的元数据库,并包含8940个精心设计的任务,系统性地捕捉了三个层次的关系理解:以自我为中心检索具有新颖知识结构的论文、成对识别学术关系以及路径式重建科学演化。对三类检索代理的广泛评估表明,它们在关系感知任务上的准确率通常低于20%,凸显了当前检索范式的根本缺陷。重要的是,在下游文献综述应用中,配备SciNet的代理在综述质量上实现了25.3%的提升,突出了关系感知检索对深化科学见解的关键价值。我们在https://github.com/tsinghua-fib-lab/SciNet公开发布SciNet以支持未来研究。

英文摘要

AI agents have seen widespread adoption in information retrieval for scientific research, giving rise to tools such as Deep Research. However, existing retrieval agents mainly rely on keyword- or embedding-based methods. While effective at capturing content-level similarities, they struggle to understand complex relational networks among scientific papers, such as identifying corroborating or conflicting studies and tracing technological lineages. This fundamental limitation often results in fragmented knowledge structures, misinterpreted research sentiment, and ineffective modeling of collective scientific progress. To address this limitation, we introduce SciNet, the first Scientific Network relation-aware dataset for information retrieval agents. Built on a meta-database of 269 million papers across 7 disciplines and containing 8,940 carefully designed tasks, SciNet systematically captures three levels of relational understanding: ego-centric retrieval of papers with novel knowledge structures, pairwise identification of scholarly relationships, and path-wise reconstruction of scientific evolution. Extensive evaluation of three categories of retrieval agents shows that their accuracy on relation-aware tasks often falls below 20%, highlighting a fundamental shortcoming of current retrieval paradigms. Importantly, in a downstream literature review application, agents empowered with SciNet achieve a 25.3% improvement in review quality, highlighting the critical value of relation-aware retrieval for deepening scientific insights. We publicly release SciNet at https://github.com/tsinghua-fib-lab/SciNet to support future research.

2512.20298 2026-05-25 cs.CL cs.AI cs.CY cs.HC 版本更新

Patterns vs. Patients: Evaluating LLMs against Mental Health Professionals on Personality Disorder Diagnosis through First-Person Narratives

模式 vs. 患者:通过第一人称叙事评估大语言模型与心理健康专业人员的人格障碍诊断能力

Karolina Drożdż, Kacper Dudzic, Anna Sterna, Marcin Moskalewicz

发表机构 * IDEAS Research Institute(IDEAS研究 institute) Adam Mickiewicz University(亚当·密茨凯维奇大学) AMU Center for Artificial Intelligence(AMU人工智能中心) Poznań University of Medical Sciences(波兹南医学科学大学) Maria Curie-Skłodowska University(玛丽·居里-斯克洛多夫斯卡大学)

AI总结 该研究探讨了大型语言模型(LLMs)在基于第一人称叙述进行人格障碍诊断方面的能力,特别比较了其与心理健康专业人士在诊断边缘型(BPD)和自恋型(NPD)人格障碍时的表现。研究发现,尽管LLMs在识别BPD方面表现优异,但在诊断NPD时显著低估,反映出模型在处理价值判断性术语时可能存在偏见。研究还指出,LLMs倾向于基于模式和形式分类提供详细解释,而人类专家则更关注患者的自我认知和时间体验,整体诊断可靠性仍有待提升。

详情
AI中文摘要

对LLMs进行精神病学自我评估的日益依赖引发了对其解释定性患者叙事能力的质疑。这项深度而非广度的案例研究直接比较了最先进的LLMs和心理健康专业人员,基于波兰语第一人称自传叙事评估边缘型人格障碍(BPD)和自恋型人格障碍(NPD)。在我们的样本中,表现最佳的Gemini Pro模型的总体诊断得分(65.48%)比人类专业人员的平均得分(43.57%)高出21.91个百分点。虽然模型和人类专家在识别BPD方面都表现出色(F1分别为83.4和80.0),但模型严重漏诊NPD(F1=6.7 vs. 50.0),显示出对价值负载术语“自恋”的潜在回避。定性上,模型提供了自信、详尽的理由,侧重于模式和形式类别,而人类专家则保持简洁和谨慎,强调患者的自我感和时间体验。我们的研究结果表明,虽然LLMs可能擅长解释复杂的临床第一人称数据,但其输出仍然存在关键的可靠性和偏见问题。

英文摘要

Growing reliance on LLMs for psychiatric self-assessment raises questions about their ability to interpret qualitative patient narratives. This depth over breadth case study directly compares state-of-the-art LLMs and mental health professionals in assessing Borderline (BPD) and Narcissistic (NPD) Personality Disorders based on Polish-language first-person autobiographical accounts. Within our sample, the overall diagnostic scores of the top-performing Gemini Pro models (65.48%) were 21.91 percentage points higher than the average scores of the human professionals (43.57%). While both models and human experts excelled at identifying BPD (F1 = 83.4 & F1 = 80.0, respectively), models severely underdiagnosed NPD (F1 = 6.7 vs. 50.0), showing a potential reluctance toward the value-laden term "narcissism." Qualitatively, models provided confident, elaborate justifications focused on patterns and formal categories, while human experts remained concise and cautious, emphasizing the patients' sense of self and temporal experience. Our findings demonstrate that while LLMs might be competent at interpreting complex first-person clinical data, their outputs still carry critical reliability and bias issues.

2510.21270 2026-05-25 cs.CL cs.AI cs.CV 版本更新

Sparser Block-Sparse Attention via Token Permutation

通过令牌置换实现更稀疏的块稀疏注意力

Xinghao Wang, Pengyu Wang, Dong Zhang, Chenkun Tan, Shaojun Zhou, Zhaoxiang Liu, Shiguo Lian, Fangxu Liu, Kai Song, Xipeng Qiu

发表机构 * Fudan University(复旦大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 随着大语言模型上下文长度的增加,计算成本显著上升,主要瓶颈来自自注意力机制的二次复杂度。为此,本文提出了一种名为Permuted Block-Sparse Attention(PBS-Attn)的新型稀疏注意力方法,通过重新排列token顺序以提升块级稀疏性,从而在保持模型精度的同时显著提高计算效率。实验表明,该方法在多个长上下文数据集上优于现有块稀疏注意力方法,并在端到端推理速度上实现了最高2.75倍的加速。

Comments ICML 2026

详情
AI中文摘要

扩展大语言模型(LLM)的上下文长度带来了显著的好处,但计算成本高昂。这种成本主要源于自注意力机制,其相对于序列长度的$O(N^2)$复杂度在内存和延迟方面构成了主要瓶颈。幸运的是,注意力矩阵通常是稀疏的,尤其是对于长序列,这为优化提供了机会。块稀疏注意力已成为一种有前景的解决方案,它将序列划分为块并跳过其中一部分块的计算。然而,该方法的有效性高度依赖于底层的注意力模式,这可能导致次优的块级稀疏性。例如,单个块内查询的重要键令牌可能分散在许多其他块中,导致计算冗余。在这项工作中,我们提出了置换块稀疏注意力(PBS-Attn),这是一种即插即用的方法,利用注意力的置换性质来增加块级稀疏性并提高LLM预填充的计算效率。我们在具有挑战性的真实世界长上下文数据集上进行了全面实验,结果表明PBS-Attn在模型精度上始终优于现有的块稀疏注意力方法,并紧密匹配全注意力基线。借助我们自定义的permuted-FlashAttention内核,PBS-Attn在长上下文预填充中实现了高达2.75倍的端到端加速,证实了其实用性。代码可在https://github.com/xinghaow99/pbs-attn获取。

英文摘要

Scaling the context length of large language models (LLMs) offers significant benefits but is computationally expensive. This expense stems primarily from the self-attention mechanism, whose $O(N^2)$ complexity with respect to sequence length presents a major bottleneck for both memory and latency. Fortunately, the attention matrix is often sparse, particularly for long sequences, suggesting an opportunity for optimization. Block-sparse attention has emerged as a promising solution that partitions sequences into blocks and skips computation for a subset of these blocks. However, the effectiveness of this method is highly dependent on the underlying attention patterns, which can lead to sub-optimal block-level sparsity. For instance, important key tokens for queries within a single block may be scattered across numerous other blocks, leading to computational redundancy. In this work, we propose Permuted Block-Sparse Attention (\textbf{PBS-Attn}), a plug-and-play method that leverages the permutation properties of attention to increase block-level sparsity and enhance the computational efficiency of LLM prefilling. We conduct comprehensive experiments on challenging real-world long-context datasets, demonstrating that PBS-Attn consistently outperforms existing block-sparse attention methods in model accuracy and closely matches the full attention baseline. Powered by our custom permuted-FlashAttention kernels, PBS-Attn achieves an end-to-end speedup of up to $2.75\times$ in long-context prefilling, confirming its practical viability. Code available at https://github.com/xinghaow99/pbs-attn

2510.00526 2026-05-25 cs.CL cs.LG 版本更新

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

超越对数似然:面向模型能力连续体的监督微调概率目标

Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了监督微调(SFT)中超越负对数似然(NLL)的目标函数,针对大语言模型在不同能力水平下的表现差异,提出了一种基于概率的优化目标体系。通过大量实验和消融研究,发现模型能力水平是决定不同目标函数优劣的关键因素:在模型能力强时,优先考虑先验知识的目标(如$-p$、$-p^{10}$)表现更优;在模型能力弱时,NLL仍占优势;而中间阶段则无单一目标占优。该研究为根据模型能力选择合适的目标函数提供了理论依据和实践指导。

Comments ICML 2026

详情
AI中文摘要

监督微调(SFT)是后训练大型语言模型(LLM)的标准方法,但通常表现出有限的泛化能力。我们将此限制归因于其默认训练目标:负对数似然(NLL)。虽然NLL在从头训练时经典最优,但后训练处于不同范式,可能违反其最优性假设,因为模型已编码任务相关先验,且监督可能冗长且有噪声。在这项工作中,我们系统研究了各种基于概率的目标,并刻画了不同目标在不同条件下成功或失败的时间和原因。通过在8个模型骨干、27个基准和7个领域上的全面实验和广泛消融研究,我们揭示了控制目标行为的关键维度:模型能力连续体。在模型强端附近,降低低概率令牌权重的先验倾向目标(例如,-p, -p^{10}, 阈值变体)一致优于NLL;在模型弱端,NLL占主导;在中间,没有单一目标普遍最优。我们的理论分析进一步阐明了目标如何在连续体上交换位置,为根据模型能力调整目标提供了原则性基础。代码可在 https://github.com/GaotangLi/Beyond-Log-Likelihood 获取。

英文摘要

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. In this work, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. The code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

2509.26383 2026-05-25 cs.CL cs.AI 版本更新

Efficient and Transferable Agentic Knowledge Graph RAG via Reinforcement Learning

基于强化学习的高效可迁移智能知识图谱检索增强生成

Junhong Lin, Shicheng Liu, Jinyeop Song, Song Wang, Julian Shun, Yada Zhu

发表机构 * MIT CSAIL(麻省理工学院CSAIL) University of Virginia(弗吉尼亚大学) IBM Research(IBM研究院)

AI总结 该研究提出了一种基于强化学习的高效且可迁移的智能体知识图谱检索增强生成框架KG-R1,旨在解决现有KG-RAG系统中固定流程导致的高推理成本和依赖特定图结构的问题。KG-R1通过单智能体与知识图谱环境交互,逐步学习信息检索与推理生成的统一过程,从而在减少生成token数量的同时提升回答准确性。实验表明,KG-R1在多个知识图谱问答基准上表现出优异的效率和跨图谱迁移能力,且无需重新训练即可保持对新知识图谱的准确推理,具有良好的实际应用前景。

详情
AI中文摘要

知识图谱检索增强生成(KG-RAG)将大型语言模型(LLMs)与结构化、可验证的知识图谱(KGs)相结合,以减少幻觉并提供推理轨迹。然而,当前的KG-RAG系统通常依赖于多个LLM模块(如规划、推理和响应)的固定流水线,这增加了推理成本,并将性能与特定图模式绑定。为了解决这个问题,我们引入了KG-R1,一个通过强化学习(RL)优化KG-RAG的智能体框架。与模块化工作流不同,KG-R1使用单个智能体,将KGs作为其环境进行交互,学习在每一步检索信息,并将其融入统一的推理和生成过程中。在知识图谱问答(KGQA)基准测试中,KG-R1展示了高效性和可迁移性——使用Qwen 2.5-3B,KG-R1以比先前使用更大基础或微调模型的多模块工作流方法更少的生成token提高了答案准确性。此外,KG-R1表现出强大的即插即用能力:训练后,无需重新训练即可在未见过的KGs上保持准确性。这些特性使KG-R1成为实际部署中很有前景的KG-RAG框架。我们的代码公开在github.com/junhongmit/KG-R1/。

英文摘要

Knowledge-graph retrieval-augmented generation (KG-RAG) couples large language models (LLMs) with structured, verifiable knowledge graphs (KGs) to reduce hallucination and provide reasoning traces. However, current KG-RAG systems often rely on fixed pipelines of multiple LLM modules (e.g., planning, reasoning, and responding), which inflate inference costs and tie performance to specific graph schemas. To address this, we introduce KG-R1, an agentic framework that optimizes KG-RAG through reinforcement learning (RL). Unlike modular workflows, KG-R1 uses a single agent that interacts with KGs as its environment, learning to retrieve information at each step and incorporating it into its reasoning and generation in a unified process. Across Knowledge-Graph Question Answering (KGQA) benchmarks, KG-R1 demonstrates both efficiency and transferability-using Qwen 2.5-3B, KG-R1 improves answer accuracy with fewer generation tokens than prior multi-module workflow methods that use much larger foundation or fine-tuned models. Furthermore, KG-R1 exhibits strong plug-and-play capability: after training, maintaining accuracy on unseen KGs without retraining. These properties make KG-R1 a promising KG-RAG framework for real-world deployment. Our code is publicly available at github.com/junhongmit/KG-R1/.

2508.10016 2026-05-25 cs.CL 版本更新

Training-Free Multimodal Large Language Model Orchestration

免训练的多模态大语言模型编排

Tianyu Xie, Yuexiao Ma, Yuhang Wu, Wang Chen, Jiayi Ji, Tat-Seng Chua, Xiawu Zheng, Rongrong Ji

发表机构 * Media Analytics and Computing Lab, Xiamen University, Xiamen, China(厦门大学媒体分析与计算实验室) Institute of Artificial Intelligence, Xiamen University, Xiamen, China(厦门大学人工智能研究院) School of Informatics, Xiamen University, Xiamen, China(厦门大学信息学院) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Department of Computer Science, National University of Singapore, Singapore(新加坡国立大学计算机科学系)

AI总结 本文提出了一种无需训练的多模态大语言模型编排框架(LLM Orchestration),旨在解决构建交互式多模态助手时因端到端对齐带来的高昂数据和计算成本问题。该框架通过集成现成的模态专家模块,无需额外训练即可实现多模态输入输出的统一处理,包含意图识别控制器、跨模态记忆模块和统一交互层三个核心组件,有效提升了系统扩展性与运行效率。实验表明,该方法在多模态基准测试中表现出色,同时保持了较低的编排开销和模块化升级能力。

详情
Journal ref
ICML 2026
AI中文摘要

构建交互式全模态助手通常依赖于端到端的多模态对齐来融合异构模态,这会产生大量的数据和计算成本,并限制了可扩展性。我们提出了免训练的大语言模型编排(LLM Orchestration),这是一个免训练的编排框架,它将现成的模态专家集成到一个统一的多模态输入-输出系统中,而无需额外的基于梯度的训练来进行集成。LLM Orchestration包含三个组件:(1)一个LLM控制器,它推断用户意图并发出显式的控制令牌以进行专家选择和排序,实现协议约束和可审计的路由;(2)一个以文本为中心的跨模态记忆,它将多模态证据压缩为结构化记录,用于轻量级检索和重用,减少跨轮次中冗余的专家调用;(3)一个统一的交互层,它执行路由和记忆决策,以支持一致的模态转换、全双工流和可中断的对话。在多种多模态基准测试中,LLM Orchestration在标准评估约束下实现了强劲的性能,同时保持了低编排开销和模块化可升级性,为全模态系统提供了一种实用的替代方案,避免了昂贵的联合训练。

英文摘要

Building interactive omni-modal assistants often relies on end-to-end multimodal alignment to fuse heterogeneous modalities, which incurs substantial data and compute costs and limits extensibility. We present Training-Free Large Language Model Orchestration (LLM Orchestration), a training-free orchestration framework that integrates off-the-shelf modality experts into a unified multimodal input--output system without additional gradient-based training for integration. LLM Orchestration comprises three components: (1) an LLM controller that infers user intent and emits explicit control tokens for expert selection and sequencing, enabling protocol-constrained and auditable routing; (2) a text-centric cross-modal memory that compresses multimodal evidence into structured records for lightweight retrieval and reuse, reducing redundant expert invocations across turns; and (3) a unified interaction layer that executes routing and memory decisions to support consistent modality transitions, full-duplex streaming, and interruption-aware dialogue. Across diverse multimodal benchmarks, LLM Orchestration achieves strong performance under standard evaluation constraints while maintaining low orchestration overhead and modular upgradeability, providing a practical alternative to costly joint training for omni-modal systems.

2508.07849 2026-05-25 cs.CL 版本更新

Evaluating Customized vs. Generalist Transformer-based Models for Legal Contract Classification

评估定制化与通用型基于Transformer的模型在法律合同分类中的表现

Amrita Singh, H. Suhan Karaca, Aditya Joshi, Hye-young Paik, Jiaojiao Jiang

发表机构 * School of Computer Science and Engineering University of New South Wales (UNSW), Sydney(计算机科学与工程学院 新南威尔士大学(UNSW),悉尼)

AI总结 尽管在法律自然语言处理领域已有进展,但目前尚缺乏对专门用于法律任务的Transformer模型在合同分类任务上的全面评估。本文对13种法律专用Transformer模型在三个英文合同分类任务上的表现进行了评估,并与9种通用模型进行了对比,结果表明法律专用模型在需要细致法律理解的任务中表现更优,尤其在处理类别不平衡数据时能有效减少罕见类的误分类。研究还发现,Legal-BERT和Contracts-BERT在参数量仅为最佳通用模型69%的情况下,在两个任务中取得了新的最先进成果,突显了领域定制模型在法律应用中的重要性。

Comments Accepted to Customizable NLP at ACL 2026

详情
AI中文摘要

尽管法律NLP取得了进展,但目前尚无针对合同分类任务对法律任务定制的Transformer模型(本文称为“法律专用”模型)进行全面评估。为填补这一空白,我们对13个法律专用Transformer模型在三个英文合同分类任务上进行了评估,并与9个通用模型进行了比较。结果表明,法律专用模型持续优于通用模型,尤其是在需要精细法律理解的任务上。它们还有助于减少不平衡数据集中罕见类别的误分类。Legal-BERT和Contracts-BERT在三个任务中的两个上建立了新的最优结果,尽管其参数比最佳通用模型少69%。我们还确定了CaseLaw-BERT和LexLM作为合同分类的强有力额外基线。我们的结果凸显了通用模型的不足,强调了领域特定定制的必要性,尤其是在法律应用的背景下。

英文摘要

Despite advances in legal NLP, no comprehensive evaluation of Transformer-based models customized for legal tasks (referred to as `legal-specific' models in this paper) exists for contract classification tasks. To address this gap, we present an evaluation of 13 legal-specific transformer-based models on 3 English-language contract classification tasks and compare them with 9 generalist models. The results show that legal-specific models consistently outperform generalist models, especially on tasks requiring nuanced legal understanding. They also help reduce misclassification of rare classes in imbalanced datasets. Legal-BERT and Contracts-BERT establish new SOTAs on two of the three tasks, despite having 69% fewer parameters than the best-performing generalist models. We also identify CaseLaw-BERT and LexLM as strong additional baselines for contract classification. Our results highlight the shortcomings of generalist models, emphasizing the need for domain-specific customization, particularly in the context of legal applications.

2506.03530 2026-05-25 cs.MM cs.CL cs.CV 版本更新

How Far Are We from Generating Missing Modalities with Foundation Models?

我们距离用基础模型生成缺失模态还有多远?

Guanzhou Ke, Bo Wang, Guoqing Chao, Weiming Hu, Shengfeng He

发表机构 * Institute of Data Science and Intelligent Decision Support, Beijing Jiaotong University(数据科学与智能决策支持研究所,北京交通大学) School of Computing and Information Systems, Singapore Management University(计算与信息系统学院,新加坡管理大学) State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统国家重点实验室,自动化研究所,中国科学院) School of Computer Science and Technology, Harbin Institute of Technology(计算机科学与技术学院,哈尔滨工业大学)

AI总结 该研究探讨了基础模型在生成缺失模态数据方面的潜力与局限,提出了三种缺失模态重建的范式,并对42种模型变体进行了系统评估。研究发现,当前基础模型在细粒度语义提取和生成模态的鲁棒验证方面存在不足,导致生成结果不够理想。为此,作者提出了一种智能代理框架,通过动态的模态感知挖掘策略和自优化机制,显著提升了缺失模态重建的质量,实验表明在图像和文本重建任务中分别取得了14%和10%以上的性能提升。

Comments T-PAMI

详情
AI中文摘要

多模态基础模型在各种任务中展现了令人印象深刻的能力。然而,它们作为缺失模态重建的即插即用解决方案的潜力尚未被充分探索。为弥补这一差距,我们识别并形式化了三种可能的缺失模态重建范式,并跨这些范式进行了全面评估,覆盖了42个模型变体在重建准确性和下游任务适应性方面的表现。我们的分析表明,当前基础模型在两个关键方面往往表现不佳:(i) 从可用模态中提取细粒度语义,以及(ii) 对生成模态的稳健验证。这些限制导致了次优甚至有时不匹配的生成。为解决这些挑战,我们提出了一个专为缺失模态重建设计的智能框架。该框架根据输入上下文动态制定模态感知的挖掘策略,促进提取更丰富、更具判别性的语义特征。此外,我们引入了一种自精炼机制,通过内部反馈迭代验证和提升生成模态的质量。实验结果表明,与基线相比,我们的方法在缺失图像重建上FID降低了至少14%,在缺失文本重建上MER降低了至少10%。代码已发布在:https://github.com/Guanzhou-Ke/AFM2。

英文摘要

Multimodal foundation models have demonstrated impressive capabilities across diverse tasks. However, their potential as plug-and-play solutions for missing modality reconstruction remains underexplored. To bridge this gap, we identify and formalize three potential paradigms for missing modality reconstruction, and perform a comprehensive evaluation across these paradigms, covering 42 model variants in terms of reconstruction accuracy and adaptability to downstream tasks. Our analysis reveals that current foundation models often fall short in two critical aspects: (i) fine-grained semantic extraction from the available modalities, and (ii) robust validation of generated modalities. These limitations lead to suboptimal and, at times, misaligned generations. To address these challenges, we propose an agentic framework tailored for missing modality reconstruction. This framework dynamically formulates modality-aware mining strategies based on the input context, facilitating the extraction of richer and more discriminative semantic features. In addition, we introduce a self-refinement mechanism, which iteratively verifies and enhances the quality of generated modalities through internal feedback. Experimental results show that our method reduces FID for missing image reconstruction by at least 14\% and MER for missing text reconstruction by at least 10\% compared to baselines. Code are released at: https://github.com/Guanzhou-Ke/AFM2.

2505.17015 2026-05-25 cs.CV cs.CL 版本更新

Multi-SpatialMLLM: Multi-Frame Spatial Understanding with Multi-Modal Large Language Models

Multi-SpatialMLLM: 多模态大语言模型的多帧空间理解

Runsen Xu, Weiyao Wang, Hao Tang, Xingyu Chen, Xiaodong Wang, Fu-Jen Chu, Matt Feiszli, Kevin J. Liang

发表机构 * FAIR, Meta(FAIR,Meta) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出了一种名为Multi-SpatialMLLM的多模态大语言模型框架,旨在增强模型对多帧场景的时空理解能力。通过引入深度感知、视觉对应和动态感知等基本空间技能,并构建包含2700多万个样本的MultiSPA数据集,该方法显著提升了模型在多帧空间任务中的表现。实验表明,Multi-SpatialMLLM在多种空间任务上优于现有基线模型和商业系统,展示了其在复杂场景下的泛化能力和多任务学习优势,并可应用于机器人领域的多帧奖励标注。

Comments CVPR 2026 Camera Ready. 27 pages. Project page: https://runsenxu.com/projects/Multi-SpatialMLLM

详情
AI中文摘要

多模态大语言模型(MLLMs)在视觉任务中取得了快速进展,但其空间理解仍局限于单张图像,使其不适合需要多帧推理的物理世界应用。在本文中,我们提出一个框架,通过整合基本空间技能(包括深度感知、视觉对应和动态感知)来赋予MLLMs多帧空间理解能力。我们设计了一个新颖的数据管道,并收集了包含超过2700万个样本的MultiSPA数据集,涵盖多样的3D和4D场景,以支持训练。除了MultiSPA,我们还引入了一个全面的基准测试,在统一的度量标准下测试广泛的空间任务。我们的最终模型Multi-SpatialMLLM在基线和专有系统上取得了显著提升,展示了可扩展和可泛化的多帧感知能力。我们进一步观察到多任务收益和在挑战性场景中的新兴空间能力,并展示了我们的模型如何作为机器人学的多帧奖励标注器。

英文摘要

Multi-modal large language models (MLLMs) have rapidly advanced in visual tasks, yet their spatial understanding remains limited to single images, leaving them ill-suited for physical-world applications that require multi-frame reasoning. In this paper, we propose a framework to equip MLLMs with multi-frame spatial understanding by integrating fundamental spatial skills, including depth perception, visual correspondence, and dynamic perception. We design a novel data pipeline and collect the MultiSPA dataset of more than 27 million samples spanning diverse 3D and 4D scenes to enable training. Alongside MultiSPA, we introduce a comprehensive benchmark that tests a wide spectrum of spatial tasks under uniform metrics. Our resulting model, Multi-SpatialMLLM, achieves significant gains over baselines and proprietary systems, demonstrating scalable and generalizable multi-frame perception. We further observe multi-task benefits and emergent spatial capabilities in challenging scenarios, and showcase how our model can serve as a multi-frame reward annotator for robotics.

2505.13893 2026-05-25 cs.CL 版本更新

InfiGFusion: Graph-on-Logits Distillation via Efficient Gromov-Wasserstein for Model Fusion

InfiGFusion:通过高效Gromov-Wasserstein进行图对逻辑蒸馏的模型融合

Yuanyi Wang, Zhaoyi Yan, Yiming Zhang, Qi Zhou, Yanggan Gu, Fei Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University(香港理工大学) Zhejiang University(浙江大学) Daya Bay Technology and Innovation Research Institute(大亚湾技术与创新研究院)

AI总结 该论文提出了一种名为 InfiGFusion 的模型融合框架,旨在将不同异构开源模型的优势整合到统一系统中。其核心方法是通过引入基于图的对数概率蒸馏(GLD)损失,显式建模词汇维度间的语义依赖关系,并利用改进的格罗莫夫-瓦萨距离近似算法提升计算效率。实验表明,InfiGFusion 在多个基准测试中显著优于现有方法,尤其在复杂推理任务中表现出色。

详情
AI中文摘要

近期大语言模型(LLMs)的进展推动了将异构开源模型融合为统一系统的努力,以继承其互补优势。现有的基于逻辑的融合方法保持了推理效率,但独立处理词汇维度,忽略了跨维度交互编码的语义依赖。这些依赖反映了模型内部推理下词元类型的交互方式,对于对齐具有不同生成行为的模型至关重要。为显式建模这些依赖,我们提出 extbf{InfiGFusion},首个结构感知融合框架,带有新颖的 extit{图对逻辑蒸馏}(GLD)损失。具体地,我们保留每个输出的top-$k$逻辑值,并在序列位置上聚合它们的外积以形成全局共激活图,其中节点表示词汇通道,边量化其联合激活。为确保可扩展性和效率,我们设计了一种基于排序的闭式近似,将Gromov-Wasserstein距离的原始$O(n^4)$成本降至$O(n \log n)$,并具有可证明的近似保证。在多种融合设置下的实验表明,GLD持续提高了融合质量和稳定性。InfiGFusion在涵盖推理、编程和数学的11个基准上优于SOTA模型和融合基线。它在复杂推理任务中表现出特别优势,在Multistep Arithmetic上比SFT提高+35.6,在Causal Judgement上提高+37.06,展示了卓越的多步和关系推理能力。

英文摘要

Recent advances in large language models (LLMs) have intensified efforts to fuse heterogeneous open-source models into a unified system that inherits their complementary strengths. Existing logit-based fusion methods maintain inference efficiency but treat vocabulary dimensions independently, overlooking semantic dependencies encoded by cross-dimension interactions. These dependencies reflect how token types interact under a model's internal reasoning and are essential for aligning models with diverse generation behaviors. To explicitly model these dependencies, we propose \textbf{InfiGFusion}, the first structure-aware fusion framework with a novel \textit{Graph-on-Logits Distillation} (GLD) loss. Specifically, we retain the top-$k$ logits per output and aggregate their outer products across sequence positions to form a global co-activation graph, where nodes represent vocabulary channels and edges quantify their joint activations. To ensure scalability and efficiency, we design a sorting-based closed-form approximation that reduces the original $O(n^4)$ cost of Gromov-Wasserstein distance to $O(n \log n)$, with provable approximation guarantees. Experiments across multiple fusion settings show that GLD consistently improves fusion quality and stability. InfiGFusion outperforms SOTA models and fusion baselines across 11 benchmarks spanning reasoning, coding, and mathematics. It shows particular strength in complex reasoning tasks, with +35.6 improvement on Multistep Arithmetic and +37.06 on Causal Judgement over SFT, demonstrating superior multi-step and relational inference.

2504.01542 2026-05-25 cs.CL 版本更新

Register Always Matters: Analysis of LLM Pretraining Data Through the Lens of Language Variation

语域始终重要:通过语言变异视角分析LLM预训练数据

Amanda Myntti, Erik Henriksson, Veronika Laippala, Sampo Pyysalo

发表机构 * University of Turku(图尔库大学)

AI总结 本文研究了语言变体(如语域或文体)对大语言模型预训练数据质量及模型性能的影响。作者首次将语域分类这一语料语言学中的标准应用于预训练数据的筛选,并通过实验发现不同语域的文本对模型表现有显著影响。研究显示,使用新闻类文本预训练会导致性能下降,而意见类文本(如评论和博客)则对模型性能提升有积极作用,合理组合多种高表现语域可进一步优化模型效果。这些发现表明,语域是解释模型性能差异的重要因素,有助于未来更精准地选择预训练数据。

详情
AI中文摘要

预训练数据整理是大语言模型(LLM)开发的基石,导致对大型网络语料库质量过滤的研究日益增多。从统计质量标志到基于LLM的标注系统,数据集被划分为不同类别,通常简化为二元分类:通过过滤器的被视为有价值的样本,其余则被丢弃为无用或有害。然而,对不同类型文本对模型性能贡献的更详细理解仍然缺乏。在本文中,我们首次利用语域或体裁——语料库语言学中广泛使用的语言变异建模标准——来整理预训练数据集,并研究语域对LLM性能的影响。我们使用语域分类的数据训练小型生成模型,并通过标准基准进行评估,表明预训练数据的语域显著影响模型性能。我们揭示了预训练材料与最终模型之间令人惊讶的关系:使用“新闻”语域会导致性能不佳,相反,包含“观点”类(涵盖评论和观点博客等文本)则非常有益。虽然在整个未过滤数据集上训练的模型优于在单一语域数据集上训练的模型,但将表现良好的语域(如“操作指南”、“信息描述”和“观点”)组合起来会带来重大改进。此外,对单个基准结果的分析揭示了特定语域类作为预训练数据的优势和缺点的关键差异。这些发现表明,语域是模型变异的重要解释因素,并可以促进未来更谨慎的数据选择实践。

英文摘要

Pretraining data curation is a cornerstone in Large Language Model (LLM) development, leading to growing research on quality filtering of large web corpora. From statistical quality flags to LLM-based labelling systems, datasets are divided into categories, frequently reducing to a binary: those passing the filters are deemed as valuable examples, others are discarded as useless or detrimental. However, a more detailed understanding of the contribution of different kinds of texts to model performance is still largely lacking. In this article, we present the first study utilising registers or genres - a widely used standard in corpus linguistics to model linguistic variation - to curate pretraining datasets and investigate the effect of register on the performance of LLMs. We train small generative models with register classified data and evaluate them using standard benchmarks, and show that the register of pretraining data substantially affects model performance. We uncover surprising relationships between the pretraining material and the resulting models: using the News register results in subpar performance, and on the contrary, including the Opinion class, covering texts such as reviews and opinion blogs, is highly beneficial. While a model trained on the entire unfiltered dataset outperforms those trained on datasets limited to a single register, combining well-performing registers like How-to-Instructions, Informational Description, and Opinion leads to major improvements. Furthermore, analysis of individual benchmark results reveals key differences in the strengths and drawbacks of specific register classes as pretraining data. These findings show that register is an important explainer of model variation and can facilitate more deliberate future data selection practices.

2407.04573 2026-05-25 cs.IR cs.CL 版本更新

Vector Retrieval with Similarity and Diversity: How Hard Is It?

向量检索中的相似性与多样性:难度有多大?

Hang Gao, Dong Deng, Yongfeng Zhang

发表机构 * Rutgers University(罗杰斯大学)

AI总结 本文研究了在密集向量检索中如何同时平衡相似性与多样性这一关键问题,提出了一个名为VRSD的新优化框架,旨在从查询向量与所选候选向量之和之间最大化相似性。该问题被证明是NP难的,具有坚实的理论基础。作者进一步提出了一种无需参数的启发式算法,并在多个数据集上验证了其有效性,优于现有主流方法如MMR和k-DPP。

详情
AI中文摘要

稠密向量检索是现代机器学习系统的重要构建模块,支撑从语义搜索到检索增强生成和知识密集型推理等应用。除了检索与查询单独相似的项外,许多应用还需要一组结果具有多样性、互补性和集体信息性。因此,平衡相似性和多样性是有效检索的核心,但在稳定且理论扎实的方式下优化仍然具有挑战性。最大边际相关性(MMR)是解决该问题的广泛采用的启发式方法,但其对手动调整参数的依赖导致优化波动和不可预测的检索结果。更广泛地说,现有方法对相似性和多样性如何在稠密向量空间中相互作用提供的理论见解有限,使得联合优化问题尚未被充分理解。为了解决这些挑战,本文引入了一种新方法,通过最大化查询向量与所选候选向量之和之间的相似性来同时刻画两个约束。我们正式定义了该优化问题——向量检索相似性与多样性(VRSD),并证明其为NP完全问题,从而建立了该双目标检索固有难度的严格理论界限。随后,我们提出了一种无参数启发式算法来求解VRSD。在多个数据集上的广泛评估,结合客观几何指标和LLM模拟的主观评估,表明我们的VRSD启发式方法始终优于包括MMR和行列式点过程(k-DPP)在内的已建立基线。

英文摘要

Dense vector retrieval is an important building block of modern machine learning systems, underlying applications ranging from semantic search to retrieval-augmented generation and knowledge-intensive reasoning. Beyond retrieving items that are individually similar to a query, many applications require a set of results that is also diverse, complementary, and collectively informative. Balancing similarity and diversity is therefore central to effective retrieval, but remains challenging to optimize in a stable and theoretically grounded way. Maximal Marginal Relevance (MMR) is a widely adopted heuristic for this problem, yet its reliance on a manually tuned parameter leads to optimization fluctuations and unpredictable retrieval results. More broadly, existing methods provide limited theoretical insight into how similarity and diversity interact in dense vector spaces, leaving the joint optimization problem insufficiently understood. To address these challenges, this paper introduces a novel approach that characterizes both constraints simultaneously by maximizing the similarity between the query vector and the sum of the selected candidate vectors. We formally define this optimization problem, Vector Retrieval with Similarity and Diversity (VRSD), and prove that it is NP-complete, establishing a rigorous theoretical bound on the inherent difficulty of this dual-objective retrieval. Subsequently, we present a parameter-free heuristic algorithm to solve VRSD. Extensive evaluations on multiple datasets, incorporating both objective geometric metrics and LLM-simulated subjective assessments, demonstrate that our VRSD heuristic consistently outperforms established baselines, including MMR and Determinantal Point Processes (k-DPP).

2605.22939 2026-05-25 cs.CL cs.LG 版本更新

Learnability-Informed Fine-Tuning of Diffusion Language Models

扩散语言模型的可学习性感知微调

Shubham Parashar, Atharv Chagi, Jacob Helwig, Lakshmi Jotsna, Sushil Vemuri, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A\&M University, College Station, TX, USA(计算机科学与工程系,德克萨斯A&M大学,College Station, TX, USA) Department of Electrical and Computer Engineering, Texas A\&M University, College Station, TX, USA(电气与计算机工程系,德克萨斯A&M大学,College Station, TX, USA)

AI总结 本文旨在提升扩散语言模型(DLMs)的推理能力。研究发现,传统的监督微调(SFT)在DLMs中应用时存在局限,忽视了学习的难易程度与时机,导致性能下降。为此,作者提出了一种新的微调方法LIFT,通过在不同扩散时间步根据上下文的丰富程度学习易学或难学的token,从而更有效地利用训练信息。实验表明,LIFT在六个推理基准测试中均优于现有方法,相对提升了达3倍的性能。

详情
AI中文摘要

我们旨在提升扩散语言模型(DLM)的推理能力。虽然SFT是自回归模型常用的后训练方法,但其在DLM中的应用面临挑战,甚至可能损害性能,而根本原因尚未得到充分研究。我们的分析揭示,普通SFT忽略了可学习性,即学习什么以及何时学习。具体而言,当大部分输入被掩码时,稀有标记难以学习;而当大部分输入未被掩码时,学习常见标记则较为简单且价值不大。基于我们的分析,我们提出LIFT,一种高效的基于SFT的DLM后训练算法。LIFT在大部分输入被掩码时学习容易标记,在更多上下文可用时学习困难标记,从而使训练与不同扩散时间步的信息可用性对齐。我们的结果表明,LIFT在六个推理基准上优于现有SFT基线,在AIME'24和AIME'25上实现了高达3倍的相对增益。我们的代码已在https://github.com/divelab/LIFT公开。

英文摘要

We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.

2605.22937 2026-05-25 cs.CL 版本更新

RAS: Reflection-Augmented Scaling with In-Context Learning for Executable Cypher Query Generation

RAS: 基于上下文学习的反射增强缩放用于可执行Cypher查询生成

Minseok Jung, Abhas Ricky, Muhammad Rameez Chatni

发表机构 * Cloudera

AI总结 本文研究了在生成可执行Cypher查询任务中如何通过推理时的计算分配来减少错误。作者提出了一种基于上下文学习的反射增强缩放方法(RAS),通过利用数据库返回的错误信息作为反馈,指导模型生成更准确的查询语句。实验表明,与传统的独立缩放方法相比,RAS在多个数据集和语言模型上显著降低了查询执行错误率,证明了利用执行反馈进行推理优化的有效性。

详情
AI中文摘要

推理时缩放可以减少结构化查询生成中的错误,但为查询代码生成分配计算资源的方法仍未充分探索。我们研究Text2Cypher,其中语言模型生成针对属性图数据库执行的Cypher查询。不可执行查询构成了与语义不准确性不同的独特语法失败:语法错误会触发数据库生成的系统错误消息。这些错误消息通常在推理时被丢弃,而不是通过上下文学习(ICL)加以利用。我们比较了两种推理方法:独立缩放(IS),执行无记忆重采样;以及反射增强缩放(RAS),通过ICL使每次新尝试基于先前的执行反馈。在三个Neo4j数据集和五个代码专用语言模型上,RAS在n=5时将查询执行错误率降低了41-50%,优于IS的32-38%。执行错误不仅仅是需要丢弃的失败,而是可操作的反馈,围绕它们组织推理时计算是实现可执行性比缩放独立样本更高效的途径。

英文摘要

Inference-time scaling can reduce errors in structured query generation, but methods to allocate the compute for query code generation remains underexplored. We study Text2Cypher, where language models generate Cypher queries that execute against property graph databases. Non-executable queries constitute a distinct syntactic failure separate from semantic inaccuracy: a syntax error triggers a system-generated error message from the database. These error messages are typically discarded at inference time rather than leveraged through in-context learning (ICL). We compare two inference methods: Independent Scaling (IS), which performs memoryless resampling, and Reflection-Augmented Scaling (RAS), which conditions each new attempt on prior execution feedback via ICL. Across three Neo4j datasets and five code-specialized language models, RAS reduces the Query Execution Error Rate by 41--50% at n{=}5, outperforming IS at 32--38%. Execution errors are not merely failures to discard but actionable feedback, and structuring inference-time compute around them is a more efficient path to executability than scaling independent samples.

2605.22923 2026-05-25 cs.IR cs.CL 版本更新

AI-Friendly LaTeX: Using LaTeX Code as a Knowledge Source for Retrieval-Augmented Generation

AI友好的LaTeX:使用LaTeX代码作为检索增强生成的知识源

Tom Verhoeff

发表机构 * Department of Mathematics & Computer Science, Eindhoven University of Technology(数学与计算机科学系,埃因霍温理工大学)

AI总结 本文研究如何将LaTeX源代码作为知识源用于检索增强生成(RAG),以提升大型语言模型在处理数学和技术内容时的准确性。作者提出了一种针对性的预处理方法,将LaTeX源文件及其辅助文件转换为适合向量化数据库索引的Markdown和JSONL格式,从而更好地保留结构信息和语义内容。该方法有效解决了LaTeX源代码在AI应用中的兼容性问题,为技术文档的智能处理提供了新思路。

Comments 19 pages, 3 figures

详情
AI中文摘要

当大型语言模型的答案基于显式知识源时,它们可以更可靠地回答关于教科书、讲义和编程练习的问题。检索增强生成(RAG)是一种常见方法:在回答之前,检索文档的相关片段并将其插入模型上下文。对于数学和技术材料,原始的LaTeX源码可以比PDF更好的起点,因为它包含结构信息、标签、章节命令、宏以及作者意图,这些在PDF提取中常常丢失或失真。然而,LaTeX源码并非自动AI友好的。必须解析交叉引用,解释自定义宏,识别练习和示例,并且可能需要作者提供的语义元数据。本文描述了一种有针对性的预处理方法,将LaTeX源码及其编译的辅助文件和可选的作者注释转换为适合在向量数据库中索引的Markdown和JSONL块。

英文摘要

Large language models can answer questions about textbooks, lecture notes, and programming exercises more reliably when their answers are grounded in an explicit knowledge source. Retrieval-augmented generation (RAG) is a common approach: relevant fragments of a document are retrieved and inserted into the model context before answering. For mathematical and technical material, the original LaTeX source can be a better starting point than a PDF, because it contains structural information, labels, sectioning commands, macros, and authorial intent that are often lost or distorted in PDF extraction. However, LaTeX source is not automatically AI-friendly. Cross-references must be resolved, custom macros must be interpreted, exercises and examples must be identified, and author-supplied semantic metadata may be needed. This article describes a focused preprocessing approach for turning LaTeX source, together with its compiled auxiliary files and optional author annotations, into Markdown and JSONL chunks suitable for indexing in a vector database.

2605.22905 2026-05-25 cs.AI cs.CL 版本更新

EVE-Agent: Evidence-Verifiable Self-Evolving Agents

EVE-Agent: 证据可验证的自我进化智能体

Yamato Arai, Yuma Ichikawa

发表机构 * Fujitsu Limited(富士通株式会社) The University of Tokyo(东京大学) RIKEN center for AIP(理化学研究所AIP研究中心)

AI总结 本文提出了一种名为EVE-Agent的证据可验证自进化智能体,旨在解决自进化搜索代理在缺乏可验证证据时可能生成不准确但流畅的训练样本的问题。该方法通过修改提议者-求解者框架,使每个生成的实例不仅包含答案,还包含可验证的来源片段,并通过证据验证器评估其对答案的贡献。实验表明,EVE-Agent显著提升了基于证据的正确性,且生成的训练样本具有可审计性,增强了系统的可信度。

Comments 23 pages, 2 figures

详情
AI中文摘要

自我进化智能体不应在其无法证明的示例上进行训练。无数据的自我进化搜索智能体提供了一种可扩展的途径,使系统能够生成自己的问题、回答问题,并从自身反馈中改进,而无需人工标注。然而,没有可验证的证据,这种循环可能会奖励流畅但无依据的示例,使自我生成的课程变成不透明且可能不可靠的训练信号。我们认为,证据可验证性是搜索智能体可信自我进化的先决条件:每个生成的实例不仅应包含答案,还应包含一个基于来源的文本片段,其对该答案的贡献可以被衡量。我们引入了EVE-Agent,一种证据可验证的自我进化智能体,通过对提议者-求解者框架的修改来实现这一原则。提议者生成一个问题、一个答案和一个逐字证据片段。然后,证据验证器根据提供证据时的边际准确率增益来奖励该片段。这产生了一个训练信号,倾向于真正有助于回答问题的证据,而不需要标准答案、人工标签或外部标注。EVE-Agent保持骨干模型、检索器、搜索工具和优化框架不变。实验表明,EVE-Agent在证据基础的准确性上显著优于先前的自我进化搜索智能体。由此产生的课程不仅是自我生成的,而且从结构上是可审计的:每个训练实例都带有一个可检查的来源片段,解释其为何值得信任。

英文摘要

Self-evolving agents should not train on examples they cannot justify. Data-free self-evolving search agents offer a scalable route to systems that generate their own questions, answer them, and improve from their own feedback without human annotations. Yet, without verifiable evidence, this loop can reward fluent but unsupported examples, turning the self-generated curriculum into an opaque and potentially unreliable training signal. We argue that evidence verifiability is a prerequisite for trustworthy self-evolution in search agents: each generated instance should include not only an answer but also a source-grounded span whose contribution to that answer can be measured. We introduce EVE-Agent, an Evidence-Verifiable Self-Evolving Agent that operationalizes this principle through a modification to the proposer--solver framework. The proposer generates a question, an answer, and a verbatim evidence span. An evidence verifier then rewards the span according to the marginal accuracy gain when the evidence is provided. This produces a training signal that favors evidence that genuinely helps answer the question, without requiring oracle answers, human labels, or external annotations. EVE-Agent leaves the backbone model, retriever, search tool, and optimization framework unchanged. Experiments show that EVE-Agent substantially improves evidence-grounded correctness over prior self-evolving search agents. The resulting curriculum is not merely self-generated but auditable by construction: each training example carries an inspectable source span that explains why it should be trusted.

2605.22903 2026-05-25 cs.CV cs.AI cs.CL 版本更新

Seeing without Looking: Do Vision-Language Benchmarks Really Test Vision?

不看而见:视觉-语言基准测试真的测试视觉吗?

Zixuan Lan, Luzhe Sun, Matthew R. Walter, Jiawei Zhou

发表机构 * University of Chicago(芝加哥大学) Stony Brook University(石溪大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所)

AI总结 该研究质疑了当前视觉-语言模型(VLMs)基准测试是否真正评估了模型对视觉证据的依赖程度。通过系统分析多个开源模型的行为表现,研究发现尽管VLMs会利用视觉输入,但其预测对细粒度视觉信息的丢失并不敏感,这与标准准确率所暗示的情况存在明显偏差。研究还从表示层面揭示了视觉特征在深层逐渐趋同的现象,为这一现象提供了可能的解释,表明现有基准可能无法有效评估模型的细粒度视觉理解能力。

Comments Accepted to GRAIL-V: Grounded Retrieval and Agentic Intelligence for Vision-Language, CVPR 2026 Workshop. accepted version

详情
AI中文摘要

基准测试的准确性通常被隐含地视为反映了视觉-语言模型(VLM)中的基础视觉理解,但尚不清楚这些分数在多大程度上真正反映了对视觉证据的依赖。受一个令人惊讶的观察结果——在广泛使用的幻觉基准测试中,移除大量图像令牌仅轻微降低模型性能——的启发,我们在一组开源VLM中系统地研究了这种不匹配。我们的分析涵盖多个粒度级别,包括全局视觉退化、局部遮挡、问题重述、答案空间扩展以及超出标准准确率的决策级分析。我们进一步用视觉令牌几何的逐层分析补充这些行为结果。在整个实验中,我们发现尽管VLM确实整合了视觉输入,但其预测对细粒度视觉证据丢失的敏感性低于标准准确率所暗示的程度。即使最终预测保持不变,模型对正确答案的内部支持可能已经减弱。我们还补充了表示级分析,显示深层中视觉令牌之间的相似性增加,这为我们的发现提供了一个可能的解释。总之,这些结果表明,当前的基准测试不足以可靠地评估VLM中的细粒度视觉基础。

英文摘要

Benchmark accuracy is often implicitly assumed to reflect grounded visual understanding in vision-language models (VLMs), yet it remains unclear to what extent such scores truly reflect reliance on visual evidence. Motivated by a surprising observation that removing a substantial fraction of image tokens only degrades model performance very slightly on a widely used hallucination benchmark, we systematically investigate this mismatch in a set of open-source VLMs. Our analysis spans multiple levels of granularity, spanning global visual degradation, localized occlusion, question reformulation, answer-space expansion, and decision-level analyses beyond standard accuracy. We further complement these behavioral results with a layer-wise analysis of vision-token geometry. Throughout the experiments, we find that although VLMs do incorporate visual input, their predictions are less sensitive to the loss of fine-grained visual evidence that standard accuracy should have suggested. Even when the final prediction remains unchanged, the model's internal support for the correct answer may already be weakened. We further complement a representation-level analysis, which shows increasing similarity among visual tokens in deeper layers, providing a possible explanation for our findings. Together, these results suggest that current benchmarks are not sufficient to reliably evaluate fine-grained visual grounding in VLMs.

2605.22902 2026-05-25 cs.LG cs.AI cs.CL 版本更新

Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

Transcoders 追踪视觉语言模型中的视觉基础与幻觉

Dimitrios Damianos, Leon Voukoutis, Georgios Skyrianos, Vassilis Katsouros, Georgios Paraskevopoulos

发表机构 * Institute of Language and Speech Processing(语言与语音处理研究所) Athena Research Center(雅典研究中心)

AI总结 该研究探讨了生成式视觉-语言模型(VLMs)中视觉输入如何转化为文本的问题,提出了基于Transcoders的函数中心解释框架,用于分解模型内部的计算路径,揭示图像块与文本生成之间的关联。相比传统的稀疏自编码器(SAEs),该方法在图像块缺失实验中表现出更强且更稳定的解释效果,并能更准确地对应语义相关的图像区域。此外,研究还通过结构分析揭示了模型生成幻觉的机制,并利用图特征构建分类器实现了对幻觉的预测。

详情
AI中文摘要

生成式视觉语言模型(VLM)在多模态推理上表现良好,但视觉输入如何转化为文本仍知之甚少。现有的VLM可解释性工作使用稀疏自编码器(SAE),其分解静态残差表示,忽略了驱动跨模态交互的功能更新。我们采用基于Transcoders的功能中心框架,Transcoders是MLP子层的稀疏近似,作为逐层计算的因果代理。应用于Gemma 3-4B-IT,该框架将模型分解为可解释的计算路径,连接图像块到文本生成中的方向。在补丁消融下,Transcoder归因对视觉基础标记产生比SAE归因更强且更稳定的效果,并与语义相关的图像区域更好对齐。假视觉基础反事实分析证实恢复的路径是视觉-语言交互特有的。最后,我们对幻觉生成进行结构分析,从Transcoder产生的电路痕迹中提取基于图的指标。基于这些机制图特征的逻辑分类器以AUC 0.68预测幻觉。这些结果表明,功能中心的电路分解为VLM中的多模态计算提供了可解释且可预测的描述。

英文摘要

Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language interaction.Finally, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC $0.68$. These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.

2605.22885 2026-05-25 cs.AI cs.CL cs.LG cs.LO 版本更新

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

ImProver 2:用于神经符号证明优化的迭代自改进语言模型

Riyaz Ahuja, Tate Rowney, Jeremy Avigad, Sean Welleck

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 随着形式化数学库的快速增长,对验证证明的重构和神经证明器训练数据质量的提升需求日益迫切。为解决可扩展性证明优化中面临的异构目标、数据稀缺和高训练推理成本等问题,本文提出ImProver 2,一个用于Lean 4的神经符号框架,结合高效的数据专家迭代流程和形式化结构暴露的轻量非正式抽象框架,并引入一系列衡量证明结构特性的指标。实验表明,该框架能够使小型模型在多个指标上达到与更大模型相当甚至更优的性能,展示了证明优化作为可扩展学习任务的可行性。

详情
AI中文摘要

形式化数学库正在迅速扩展,这产生了对已验证证明进行重构以保持可维护性以及提高神经证明器训练数据质量的日益增长的需求。然而,可扩展的证明优化受到异构且启发式指定的目标、稀缺的数据以及高训练和推理成本的阻碍。为了克服这些挑战,我们引入了ImProver 2,这是一个用于在Lean 4中自动进行证明优化的神经符号框架。ImProver 2将数据高效的专家迭代流程与一个暴露形式结构并附带轻量级非正式抽象的脚手架相结合。我们进一步引入了一套捕捉证明结构属性的指标。使用ImProver 2,我们训练了一个7B参数的模型,该模型在相同模型系列中优于数量级更大的模型,并且在各项指标上与中端前沿模型具有竞争力。我们还证明,我们的神经符号脚手架显著提高了小型和前沿模型的性能。我们表明,通过适当的脚手架和训练,小型模型可以有效地在复杂且多样的指标上重构研究级证明,与更大的系统相匹配,并将证明优化确立为一项可扩展、可学习的任务。

英文摘要

Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve training data quality for neural provers. However, scalable proof optimization is hindered by heterogeneous and heuristically specified objectives, scarce data, and high training and inference costs. To overcome these challenges, we introduce ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. ImProver 2 combines a data-efficient expert-iteration pipeline with a scaffold that exposes formal structure alongside lightweight informal abstractions. We further introduce a suite of metrics capturing structural proof properties. Using ImProver 2, we train a 7B-parameter model that outperforms orders-of-magnitude larger models within the same model family, and is competitive with mid-tier frontier models across metrics. We additionally demonstrate that our neurosymbolic scaffold significantly improves performance across both small and frontier models. We show that with proper scaffolding and training, small models can effectively restructure research-level proofs over complex and varied metrics, matching substantially larger systems and establishing proof optimization as a scalable, learnable task.

2605.22880 2026-05-25 cs.CL cs.AI cs.CY 版本更新

How Far Will They Go? Red-Teaming Online Influence with Large Language Models

它们会走多远?使用大型语言模型对在线影响力进行红队测试

Daniel C. Ruiz, Anna Serbina, Ashwin Rao, Emilio Ferrara, Luca Luceri

发表机构 * Information Sciences Institute University of Southern California(信息科学研究所 乌德穆尔特国立大学)

AI总结 随着基于大语言模型(LLM)的代理越来越多地参与在线讨论,评估其支持政治影响力活动的能力对维护信息完整性至关重要。本文提出了一种实证的“红队”框架,用于测量LLM的奥托窗(Overton Window,OW),即模型在争议性话题上可靠表达的政治观点范围,并量化自然语言越狱技术如何扩展这一范围。研究评估了来自10个模型家族、5个国家的30多个开源LLM,发现其在政治表达上存在系统性偏差,如更倾向于生成左翼内容,且模型规模越大,OW范围越小,不同地区模型表现差异显著。研究还揭示了越狱效果在不同模型家族间差异明显,为识别有效的越狱技术组合提供了参考。

Comments 30 pages, 8 figures, submitted to COLM 2026

详情
AI中文摘要

随着基于大型语言模型(LLM)的代理越来越多地参与在线讨论,对其支持政治影响力活动的能力进行红队测试对于信息完整性至关重要。为实现这一目标,我们专注于本地部署的开源LLM,而非前沿的仅API模型,因为前者更符合在社交媒体环境中部署的注重隐私的恶意行为者的操作约束。我们引入了一个经验性的红队测试框架,用于测量LLM的Overton窗口(OW),即模型在争议话题上能够可靠表达的政治观点范围,并量化简单的自然语言越狱如何扩展该范围。我们评估了来自10个模型家族和5个原产国的30多个LLM。我们发现政治表达存在系统性不对称:开源LLM通常更愿意生成左倾的社交媒体内容,OW往往与模型大小成反比,尽管开源生态系统中的代表性不均,但区域差异显著。越狱效果在不同模型家族之间也差异很大,这促使我们开发一个工作流程来识别有效的越狱技术组合。综合来看,我们的结果建立了一个实用的框架,用于审计开源LLM的政治可操控性,并帮助未来的研究人员设计更强的对策来对抗基于LLM的影响力活动。

英文摘要

As large language model (LLM)-based agents increasingly participate in online discourse, red-teaming their capacity to support political influence campaigns is critical for information integrity. In pursuit of this goal, we focus on locally deployed open-source LLMs, as opposed to frontier API-only models, given their superior alignment with the operational constraints of privacy-conscious malicious actors deployed in social media environments. We introduce an empirical red-teaming framework for measuring LLM Overton Windows (OWs), defined as the range of political opinions a model can reliably express on controversial topics, and for quantifying how simple natural-language jailbreaks expand that range. We evaluate more than 30 LLMs spanning 10 model families and five countries of origin. We find systematic asymmetries in political expressivity: open-source LLMs are typically more willing to generate left-leaning social media content, OWs tend to contract inversely to model size, and regional differences are substantial despite uneven representation in the open-source ecosystem. Jailbreak potency also varies sharply across model families, motivating a workflow for identifying effective combinations of jailbreak techniques. Taken together, our results establish a practical framework for auditing the political steerability of open-source LLMs and for helping future researchers design stronger countermeasures against LLM-enabled influence campaigns.

2605.22878 2026-05-25 cs.AI cs.CL cs.IR cs.LG 版本更新

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

SciAtlas:面向自动化科学研究的大规模知识图谱

Shuofei Qiao, Yunxiang Wei, Jiazheng Fan, Bin Wu, Busheng Zhang, Mengru Wang, Yuqi Zhu, Ningyu Zhang, Keyan Ding, Qiang Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) University College London(伦敦大学学院)

AI总结 随着全球学术产出的指数级增长,研究者和人工智能代理面临前所未有的“信息爆炸”挑战,碎片化和非结构化的知识组织阻碍了跨学科的深度融合。为解决这一问题,本文提出 SciAtlas,一个涵盖26个学科、包含4300万篇论文、1.57亿实体和30亿三元组的多学科异构学术知识图谱,旨在构建全景式的科学演进网络。SciAtlas 提供了结构化的拓扑认知基础,打破了学科壁垒,并通过神经符号检索算法实现了从语义匹配到确定性关联发现的转变,为自动化科研全流程提供了高效、低成本的“认知地图”。

Comments Ongoing Work

详情
AI中文摘要

全球学术产出的指数级增长使研究人员和AI代理面临前所未有的“信息爆炸”,其中碎片化和非结构化的知识组织阻碍了深层次的跨学科整合。当前的学术检索工具主要依赖浅层关键词匹配或向量空间语义检索,缺乏导航复杂逻辑连接所需的拓扑推理能力。基于代理的深度研究框架往往容易出现逻辑幻觉并消耗高推理成本。为弥补这一差距,本报告介绍了SciAtlas,一个大规模、多学科、异构的学术资源知识图谱,设计为全景科学演化网络。通过整合来自26个学科的超过4300万篇论文,总计1.57亿个实体和30亿个三元组,SciAtlas提供了一个结构化的拓扑认知基础,打破了学科壁垒,并为AI代理提供了全局视角。此外,我们开发了一种神经符号检索算法,具有三路径协同召回和图重排序,实现了从简单语义匹配到确定性关联发现的无缝过渡。我们还展示了SciAtlas的关键应用方向,包括文献综述、自动化研究趋势综合、想法定位和学术轨迹探索,以证明SciAtlas可以作为有效的“认知地图”,赋能自动化科学研究的全流程,同时显著降低推理成本。我们已在GitHub仓库中发布了知识图谱检索和各种下游任务的接口。

英文摘要

The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented ``information explosion,'' where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration. Current academic retrieval tools predominantly rely on superficial keyword matching or vector-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections. Agentic deep-research-based frameworks are often prone to logical hallucinations and consuming high inference costs. To bridge this gap, in this report, we introduce SciAtlas, a large-scale, multi-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network. By integrating over 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets, SciAtlas provides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective. Furthermore, we develop a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery. We also present key application directions of SciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate that SciAtlas can serve as an effective ``cognitive map'' to empower the full loop of automated scientific research while significantly reducing reasoning costs. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo.

2605.22870 2026-05-25 cs.LG cs.AI cs.CL 版本更新

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

读出捷径:位置数字复制主导小语言模型中的算术思维链读出

Ming Liu

发表机构 * Amazon(亚马逊)

AI总结 该研究探讨了小型语言模型在进行算术推理时,思维链(CoT)提示的实际作用。研究发现,模型在输出答案时更倾向于复制位于答案分隔符前的最后一个数字,而非依赖中间推理过程。这一“位置捷径”现象显著影响了模型性能,表明当前的CoT方法可能更多依赖位置信息而非逻辑推理。实验还揭示了不同模型在复制行为上的差异,并指出这一机制可能与模型架构及任务类型相关。

Comments 18 pages (8 main + 10 appendix), 3 figures, 5 tables

详情
AI中文摘要

思维链提示对于小语言模型进行算术运算是必要的,然而打乱其步骤仍能保留大部分性能。如果思维链贡献的不是逻辑顺序,那是什么?在三个1-3B指令微调的语言模型上,针对GSM8K数据集,我们通过前缀补全隔离了答案读出阶段,并识别出一个位置捷径:模型复制占据答案分隔符前最后一个位置的数字,无论中间推理如何。正确答案的存在贡献了54-92个百分点的准确率(每个模型教师强制上限的89-92%);即使在错误项上,最终答案与思维链最后一个数字匹配的概率为95-96%。复制通道优先于保留上下文补全:用错误值替换最后一个数字会使准确率降至接近零,尽管中间步骤正确;但移除它后,准确率在该基线之上恢复5-32个百分点——当存在可复制的数字时,即使模型本可以执行的单步算术也被抑制。Qwen和Llama在87-95%的情况下复制新干扰项;Gemma则选择性门控。头部级消融实验揭示了特定于架构的头部集;该效应在GSM-Symbolic上复现。在非算术的BBH任务上,打乱保留率急剧下降;在7-8B规模时,出现了内容选择性门控。步骤级忠实度评估有风险将位置答案传输与真实计算混为一谈——这是基于思维链的监督的一个失败模式。

英文摘要

Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.

2605.22855 2026-05-25 cs.GT cs.AI cs.CL cs.LG 版本更新

PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

PrefBench:评估隐藏偏好个性化定价谈判中的零样本LLM智能体

Yingjie Lei

发表机构 * University of Aberdeen(阿伯丁大学)

AI总结 本文提出了PrefBench,一个用于评估零样本大语言模型(LLM)代理在隐藏偏好个性化定价谈判中表现的基准测试平台。该平台通过模拟买家与固定车辆定制套餐的互动,要求卖家在仅能获取公开信息的情况下进行谈判,而买家的估值、耐心、还价行为等关键参数是隐藏的。实验表明,尽管LLM代理能够遵循协议并达成高比例的交易,但其利润表现较差,远不如简单的让步策略,突显了当前LLM在利润敏感型谈判中的不足。PrefBench为研究隐藏买家偏好下的定价代理行为提供了可控的评估环境。

Comments 24 pages, 3 figures, 5 tables. Code is available at https://github.com/ChaosTheProducer/PrefBench

详情
AI中文摘要

个性化定价谈判是LLM智能体的一个具有挑战性的测试平台,因为成功的互动并不能保证盈利的决策。当买方的支付意愿和谈判特征仍然隐藏时,卖方可能产生有效的行动并达成许多交易,但定价仍然很差。本文提出了PrefBench,一个基于模拟器的隐藏偏好个性化定价谈判基准。每个回合将一个模拟买家与一个固定的车辆定制捆绑包配对;卖方观察公开的人物描述符、捆绑包信息和谈判历史,而潜在的买方变量控制估值、耐心、还价行为和退出决策。PrefBench通过一个面向LLM的状态摘要协议来评估这一设置,该协议限制智能体在固定的隐藏信息边界下返回严格的JSON动作。我们在7500个回合中评估了零样本LLM卖家与启发式参考。测试的LLM可靠地遵循协议,实现了高于0.99的交易率,但它们的卖家利润结果仍然较弱:最佳LLM平均利润仅略高于随机基线,远低于同一回合流下的简单让步启发式。这些结果表明,结构化行动合规性和寻求协议的行为可以与弱利润敏感谈判共存。PrefBench为评估隐藏买方偏好下的定价智能体行为提供了一个受控基准。

英文摘要

Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator-based benchmark for hidden-preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle-customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter-offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM-facing state-summary protocol that constrains agents to return strict JSON actions under a fixed hidden-information boundary. We evaluate zero-shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller-profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement-seeking behavior can coexist with weak profit-sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing-agent behavior under hidden buyer preferences.

2605.22843 2026-05-25 cs.CL cs.IR 版本更新

Knowledge Distillation for Low-Resource Open-source Text-to-SQL Model

面向低资源开源文本到SQL模型的知识蒸馏

Tianhao Qiu, Xiaojun Chen

发表机构 * Shenzhen University(深圳大学)

AI总结 本文研究了在低资源环境下提升开源文本到SQL模型性能的问题,针对领域特定数据库中高质量标注数据稀缺、模式定义不透明等挑战,提出了一种基于知识蒸馏的文本到SQL框架。该方法构建了包含模式语义、缩写、业务逻辑和查询模式的任务特定知识库,并将其注入训练和推理过程,生成符合上下文的合成数据并增强推理能力。实验表明,该方法在多个基准测试中显著提升了开源和闭源大模型在文本到SQL任务中的表现,特别是在低资源的领域特定场景中。

Comments 17ages, 5 figures

详情
AI中文摘要

文本到SQL将自然语言问题转换为可执行的SQL查询,使非技术用户能够访问关系数据库进行分析和智能数据服务。在现实场景中,性能常受低资源设置的限制,其中高质量的标注<问题, SQL>对稀缺,尤其是针对特定领域的数据库。其他挑战包括不明确的模式定义、缩写以及未在模式中显式编码的隐含业务逻辑。现有的数据合成和提示技术提高了覆盖率,但往往无法生成与数据库约束一致的、任务特定且语义扎实的示例。为应对这些挑战,我们提出一个知识感知的文本到SQL框架,该框架构建任务特定的知识库,包括模式语义、缩写、业务逻辑和查询模式,并将其注入训练和推理中。该框架生成多样化、上下文扎实的合成训练数据,并通过目标知识检索增强推理。在涵盖通用和领域特定数据集的七个基准上的实验表明,我们的方法显著提升了开源和闭源大语言模型在文本到SQL任务中的性能,尤其是在低资源领域特定设置中,增强了泛化性、鲁棒性和适应性。

英文摘要

Text-to-SQL converts natural language questions into executable SQL queries, enabling non-technical users to access relational databases for analytics and intelligent data services. In real-world scenarios, performance is often constrained by low-resource settings, where high-quality annotated \texttt{<question, SQL>} pairs are scarce, particularly for domain-specific databases. Additional challenges include opaque schema definitions, abbreviations, and implicit business logic that are not explicitly encoded in the schema. Existing data synthesis and prompting techniques improve coverage but often fail to produce task-specific, semantically grounded examples aligned with database constraints. To address these challenges, we propose a knowledge-aware Text-to-SQL framework that constructs task-specific knowledge base including schema semantics, abbreviations, business logic, and query patterns, and injects them into both training and inference. This framework generates diverse, contextually grounded synthetic training data and enhances inference through targeted knowledge retrieval. Experiments on seven benchmarks, covering both general and domain-specific datasets, demonstrate that our approach substantially improves the performance of open-source and closed-source large language models in Text-to-SQL tasks, especially in low-resource domain-specific settings, enhancing generalization, robustness, and adaptability.

2605.22841 2026-05-25 physics.soc-ph cs.AI cs.CL cs.GT cs.MA econ.GN q-fin.EC 版本更新

Strategic Coercion Within Alliances: The Greenland Sovereignty Game as an AI Stress Test

联盟内的战略胁迫:格陵兰主权博弈作为人工智能压力测试

Rommin Adl, Peyton Williams

发表机构 * Grinnell College(格里纳尔学院)

AI总结 本文以2019-2026年美国试图从丹麦手中获得格陵兰主权的事件为案例,研究联盟内部强权对弱权的策略性施压问题,构建了多个博弈模型并通过八种前沿大语言模型进行多智能体模拟实验。研究揭示了在战略控制与联盟规范执行等集体行动难题下,不同模型在权力权重、行为策略和冲突升级等方面表现出显著差异,尤其指出中国来源模型在扮演美国角色时具有不同于西方模型的特征,并发现仅有少数模型能够实现和平的美国获取格陵兰的情景。

Comments 78 pages, 17 figures, 18 tables. Multi-agent LLM simulation recovering structural utility parameters across 8 frontier models in the Greenland sovereignty crisis. v3: typo pass, fixes phantom action names (REQUEST_MULTILATERAL, INDEPENDENT) and a Blunden date mismatch. v2 added Section V safety findings (legitimacy-laundered escalation, signal decoupling) and Appendix H

详情
AI中文摘要

当最强大的联盟成员在领土和战略控制问题上向较弱的成员施压时会发生什么?我们将格陵兰主权危机作为大语言模型地缘政治的压力测试,聚焦于2019-2026年美国推动从丹麦王国获取格陵兰的努力。该危机嵌套了两个集体行动问题:北极战略控制以及北约能否对主导成员执行联盟规范。我们开发了三个博弈(非对称胁迫;具有临界点转折的北约保证博弈;具有社会偏好的三元扩展式博弈),并通过多智能体模拟进行测试,其中八个前沿大语言模型扮演六个地缘政治角色(美国、丹麦、格陵兰、北约、俄罗斯、加拿大),共完成3604场博弈和108120个行动观测。利用逆向博弈论,我们恢复了每个模型的结构性效用参数(alpha、beta、gamma、delta、eta),分别对应物质自利、互惠、不平等厌恶、规范尊重和承诺一致性。三个发现突出:第一,所有八个模型在胁迫框架下变得更加升级(四步升级从10.7%上升至28.6%);第二,中国来源模型在扮演美国角色时显示出与西方来源模型系统性不同的权力权重分布;第三,和平的美国获取仅在1.9%的干净博弈中出现,且8个前沿模型中只有3个实现了这一点,最突出的是DeepSeek V3.2,它通过宗主国执行了稳定的五轮策略。强调强制法和自决的提示在仅英语的确认样本中将升级降低回基线附近;多语言对比作为探索性敏感性检验报告。我们将此定位为大语言模型地缘政治行为的结构性基准,补充行动频率基准。

英文摘要

What happens when the strongest alliance member pressures a weaker member over territory and strategic control? We examine the Greenland sovereignty crisis as a stress test for LLM geopolitics, centered on the 2019-2026 U.S. push to acquire Greenland from the Kingdom of Denmark. The crisis nests two collective-action problems: Arctic strategic control and whether NATO can enforce alliance norms against the dominant member. We develop three games (asymmetric coercion; a NATO assurance game with a critical-mass tipping point; a triadic extensive-form game with social preferences) and test them with a multi-agent simulation in which eight frontier LLMs play six geopolitical roles (United States, Denmark, Greenland, NATO, Russia, Canada) across 3,604 completed games and 108,120 action observations. Using inverse game theory, we recover each model's structural utility parameters (alpha, beta, gamma, delta, eta) for material self-interest, reciprocity, inequality aversion, norm respect, and commitment consistency. Three findings stand out. First, all eight models become more escalatory under coercion framing (four-action escalation rises from 10.7% to 28.6%). Second, Chinese-origin models show systematically different power-weight profiles from Western-origin models when playing the U.S. role. Third, peaceful US acquisition emerges in only 1.9% of clean games and only 3 of 8 frontier models ever achieve it, most prominently DeepSeek V3.2, which executes a stable five-round playbook through the metropole. Prompts emphasizing jus cogens and self-determination reduce escalation back near baseline in the English-only confirmatory sample; multilingual contrasts are reported as exploratory sensitivity checks. We position this as a structural benchmark for LLM geopolitical behavior, complementing action-frequency benchmarks.

2605.22828 2026-05-25 cs.CL 版本更新

A Survey of Text and Speech Resources for Hausa and Fongbe: Availability, Quality, and Gaps for NLP Development

豪萨语和丰贝语文本与语音资源综述:可用性、质量及NLP发展差距

Mahounan Pericles Adjovi, Victor Olufemi, Roald Eiselen, Prasenjit Mitra

发表机构 * Carnegie Mellon University Africa(卡内基梅隆大学非洲分校) Centre for Text Technology, North-West University(文本技术中心,北开普大学)

AI总结 本文综述了用于豪萨语和丰贝语的公开文本和语音资源,分析了它们的可用性、质量及在自然语言处理发展中的不足。豪萨语作为拥有数千万使用者的阿非罗-亚细亚语系语言,拥有较为丰富的文本资源,而丰贝语虽资源较少,但近期在语音数据收集方面有所进展。研究指出了两种语言在资源多样性上的差异,并提出了针对命名实体识别和词性标注任务的改进方向,强调了建立更多领域多样化的丰贝语文本和豪萨语语音语料库的必要性。

Comments 8 pages, 7 tables; survey paper; to appear in IEEE SDS 2026

详情
AI中文摘要

本综述全面编目了两种西非语言的公开文本和语音资源:豪萨语(一种亚非语系语言,约有8000万至1亿使用者)和丰贝语(一种尼日尔-刚果语系语言,在贝宁约有200万人使用)。这两种语言代表了资源可用性谱系上的对比案例。我们探讨的问题是: extit{豪萨语和丰贝语的公开NLP资源现状如何,还存在哪些差距?}通过系统搜索学术库、数据平台和网络资源,我们编目了平行语料库、单语文本集、语音数据集、预训练模型和评估基准。对于每种资源,我们记录了规模、领域覆盖、格式、许可和可访问性。我们的发现表明,豪萨语受益于新闻、百科和教育领域更广泛的文本资源多样性。丰贝语虽然文本资源较为有限,但已成为近期学术语音数据收集项目的重点。两种语言在Masakhane的命名实体识别和词性标注基准中均有代表。我们提供了任务特定建议,并指出了优先差距,包括领域多样的丰贝语文本和专门的豪萨语语音语料库。

英文摘要

This survey provides a comprehensive catalog of publicly available text and speech resources for two West African languages: Hausa, an Afroasiatic language with approximately 80-100 million speakers, and Fongbe, a Niger-Congo language spoken by approximately 2 million people in Benin. These languages represent contrasting cases on the resource availability spectrum. We address the question: \textit{What is the current state of publicly available NLP resources for Hausa and Fongbe, and what gaps remain?} Through systematic search of academic repositories, data platforms, and web sources, we catalog parallel corpora, monolingual text collections, speech datasets, pre-trained models, and evaluation benchmarks. For each resource, we document size, domain coverage, format, licensing, and accessibility. Our findings reveal that Hausa benefits from broader text resource diversity across news, encyclopedic, and educational domains. Fongbe, while having more limited text resources, has been the focus of recent academic speech data collection initiatives. Both languages are represented in Masakhane benchmarks for NER and POS tagging. We provide task-specific recommendations and identify priority gaps including domain-diverse Fongbe text and dedicated Hausa speech corpora.

2605.22826 2026-05-25 cs.CL cs.AI cs.GT cs.MA 版本更新

Evaluating Large Language Models in a Complex Hidden Role Game

评估大型语言模型在复杂隐藏角色游戏中的表现

Niklas Bauer

发表机构 * University of Göttingen(哥廷根大学)

AI总结 本文研究了大型语言模型(LLMs)在复杂隐藏角色游戏《Secret Hitler》中的推理、说服与欺骗能力,引入了角色识别准确率、欺骗保持率和游戏状态影响率等新型评估指标。通过与基于规则的算法和人类游戏进行对比,发现当前模型在策略深度上仍存在明显不足,且增强推理的技术如思维链提示和内部记忆并未提升模型表现,反而导致部分角色的胜率下降。研究结果表明,现有模型在复杂的多轮操控任务中仍表现欠佳,亟需进一步改进以实现更高级的对齐与安全控制。

Comments Master's thesis, University of Göttingen

详情
AI中文摘要

量化大型语言模型(LLMs)的欺骗潜力对于人工智能安全至关重要,但在非受控环境中难以实现。本文研究了LLMs在社交推理游戏《秘密希特勒》中的推理、说服和欺骗能力。我引入了一个开源框架和新的度量指标来衡量性能:角色识别准确率、欺骗保持率和游戏状态影响率。通过将模型与基于规则的算法和人类游戏进行基准测试,我识别出对话能力与战略深度之间的差距。研究还分析了推理增强技术对胜率和战略推理的影响。无论是思维链提示还是内部记忆,都没有带来性能提升,法西斯角色的胜率甚至下降了23.2%。虽然基于规则的智能体在86.7%的情况下与专家人类投票决策一致,但Llama 3.1 70B等模型仅达到59.7%的准确率。扮演法西斯角色的模型始终产生负面的影响分数,并且无法维持欺骗,导致游戏时间比人类短约40%。这些发现表明,当前的架构在复杂的多轮操纵中仍然无效。随着能力的提升,检测模型何时开始掌握这些欺骗行为至关重要。所开发的框架可作为未来对齐研究的可重复测试平台。

英文摘要

Quantifying the deceptive potential of Large Language Models (LLMs) is critical for AI safety, yet difficult to achieve in uncontrolled environments. This work investigates the reasoning, persuasion, and deceptive capabilities of LLMs within the social deduction game Secret Hitler. I introduce an open-source framework and novel metrics to measure performance: Role Identification Accuracy, Deception Retention Rate, and Game State Impact Rate. By benchmarking models against rule-based algorithms and human games, I identify a gap between conversational ability and strategic depth. The study also analyzes the impact of reasoning-enhancement techniques on win rates and strategic reasoning. Neither Chain-of-Thought prompting nor internal memory bring improvements in performance, with up to 23.2% worse win rates for fascist roles. While rule-based agents align with expert human voting decisions 86.7% of the time, models like Llama 3.1 70B achieve only a 59.7% accuracy. Models playing as Fascists consistently yield negative impact scores and fail to sustain deception, resulting in roughly 40% shorter games compared to humans. These findings suggest that current architectures remain ineffective at complex, multi-turn manipulation. As capabilities advance, detecting when models begin to master these deceptive behaviors is crucial. The developed framework serves as a reproducible testbed for future alignment research.

2605.11416 2026-05-25 cs.CL 版本更新

Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training

冻结深层,训练浅层:面向持续预训练的可解释层分配

Yu-Hang Wu, Qin-Yuan Liu, Qiu-Yang Zhao, Bo Jiang, Jiang-Feng Yang, Qing-Wei Cong

发表机构 * Nanhu Research Institute of China Electronic Science and Technology(中国电子科技科技研究院纳米研究所) School of Electronic and Electrical Engineering, Shanghai University of Engineering Science(上海工程技术大学电子电气工程学院)

AI总结 本文研究了如何在低成本下进行大语言模型的持续预训练,提出了一种可解释的层分配策略。通过引入架构无关的诊断框架LayerTracer,揭示了各层表示的演变模式与稳定性,发现深层在任务执行中起关键作用且更具鲁棒性。实验表明,在持续预训练中保持深层冻结而浅层训练的策略优于全参数微调及其他分配方式,并在多个基准测试中表现更优,为资源受限团队提供了低成本、可解释的参数分配指导。

详情
AI中文摘要

选择性逐层更新对于大型语言模型(LLMs)的低成本持续预训练至关重要,但由于缺乏可解释的指导,确定哪些层应冻结或训练仍然是一个经验性的黑盒问题。为了解决这个问题,我们提出了LayerTracer,一个与架构无关的诊断框架,通过定位任务执行位置和量化层敏感性来揭示逐层表示和稳定性的演化模式。分析结果表明,深层作为任务执行的关键区域,并且对干扰更新保持高稳定性。基于这一发现,我们进行了三项受控的持续预训练试验,比较了不同的冻结-训练策略,证明在C-Eval和CMMLU基准测试中,训练浅层同时冻结深层始终优于全参数微调和相反的分配。我们进一步展示了一个混合模型案例研究,验证了将高质量预训练模块放置在深层可以有效保留模型的内在知识。这项工作为资源受限的团队提供了一种低成本且可解释的解决方案,为持续预训练和混合模型构建中的逐层参数分配提供了可操作的指导。

英文摘要

Selective layer-wise updates are essential for low-cost continued pre-training of Large Language Models (LLMs), yet determining which layers to freeze or train remains an empirical black-box problem due to the lack of interpretable guidance. To address this issue, we propose LayerTracer, an architecture-agnostic diagnostic framework that reveals the evolution patterns of layer-wise representations and stability by locating task execution positions and quantifying layer sensitivity. Analysis results reveal that deep layers act as critical regions for task execution and maintain high stability against disruptive updates. Guided by this finding, we conduct three controlled continued pre-training trials to compare diverse freeze-train strategies, demonstrating that training shallow layers while freezing deep layers consistently outperforms full-parameter fine-tuning and the opposite allocation on both C-Eval and CMMLU benchmarks. We further present a hybrid model case study, which validates that placing high-quality pre-trained modules in deep layers effectively preserves inherent knowledge of the model. This work delivers a low-cost and interpretable solution for resource-constrained teams, offering actionable guidance for layer-wise parameter allocation in continued pre-training and hybrid model construction.

2601.14652 2026-05-25 cs.AI cs.CL cs.MA 版本更新

MAS-Orchestra: Understanding and Improving Multi-Agent Reasoning Through Holistic Orchestration and Controlled Benchmarks

MAS-Orchestra:通过整体编排和受控基准理解与改进多智能体推理

Zixuan Ke, Yifei Ming, Austin Xu, Ryan Chin, Xuan-Phi Nguyen, Prathyusha Jwalapuram, Jiayu Wang, Semih Yavuz, Caiming Xiong, Shafiq Joty

发表机构 * Salesforce Research(Salesforce研究院) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文提出MAS-Orchestra,一种通过整体编排和受控基准测试来理解和提升多智能体系统(MAS)推理能力的训练框架。该方法将MAS的编排建模为函数调用的强化学习问题,能够一次性生成完整的MAS系统,并通过抽象子智能体为可调用函数,实现对系统结构的全局推理。同时,研究引入MASBENCH基准,从五个维度刻画任务特性,揭示MAS优势依赖于任务结构、验证机制及智能体能力,而非普遍适用。实验表明,MAS-Orchestra在多个基准测试中取得显著提升,效率较现有方法提高十倍以上。

Comments ICML 2026

详情
AI中文摘要

虽然多智能体系统(MAS)通过智能体协调有望提升智能水平,但当前自动MAS设计的方法表现不佳。这些不足源于两个关键因素:(1)方法论复杂性——智能体编排通过顺序的代码级执行进行,限制了全局系统级整体推理,且随智能体复杂性扩展性差;(2)效能不确定性——MAS在未理解相比单智能体系统(SAS)是否有切实益处的情况下被部署。我们提出MAS-Orchestra,一个训练时框架,将MAS编排形式化为具有整体编排的函数调用强化学习问题,一次性生成整个MAS。在MAS-Orchestra中,复杂的、面向目标的子智能体被抽象为可调用函数,从而在隐藏内部执行细节的同时实现系统结构上的全局推理。为了严格研究MAS何时以及为何有益,我们引入了MASBENCH,一个受控基准,沿五个轴表征任务:深度、范围、广度、并行性和鲁棒性。我们的分析揭示,MAS的收益关键取决于任务结构、验证协议以及编排器和子智能体的能力,而非普遍成立。在这些洞察的指导下,MAS-Orchestra在数学推理、多跳问答和基于搜索的问答等公共基准上实现了一致的改进,同时相比强基线实现了超过10倍的效率提升。MAS-Orchestra和MASBENCH共同使得在追求多智能体智能的过程中能够更好地训练和理解MAS。

英文摘要

While multi-agent systems (MAS) promise elevated intelligence through coordination of agents, current approaches to automatic MAS design under-deliver. Such shortcomings stem from two key factors: (1) methodological complexity - agent orchestration is performed using sequential, code-level execution that limits global system-level holistic reasoning and scales poorly with agent complexity - and (2) efficacy uncertainty - MAS are deployed without understanding if there are tangible benefits compared to single-agent systems (SAS). We propose MASOrchestra, a training-time framework that formulates MAS orchestration as a function-calling reinforcement learning problem with holistic orchestration, generating an entire MAS at once. In MAS-Orchestra, complex, goal-oriented subagents are abstracted as callable functions, enabling global reasoning over system structure while hiding internal execution details. To rigorously study when and why MAS are beneficial, we introduce MASBENCH, a controlled benchmark that characterizes tasks along five axes: Depth, Horizon, Breadth, Parallel, and Robustness. Our analysis reveals that MAS gains depend critically on task structure, verification protocols, and the capabilities of both orchestrator and subagents, rather than holding universally. Guided by these insights, MAS-Orchestra achieves consistent improvements on public benchmarks including mathematical reasoning, multi-hop QA, and search-based QA, while achieving more than 10x efficiency over strong baselines. Together, MAS-Orchestra and MASBENCH enable better training and understanding of MAS in the pursuit of multi-agent intelligence.

2311.01468 2026-05-25 cs.CL cs.LG 版本更新

Remember what you did so you know what to do next

记住你做了什么,以便知道下一步该做什么

Manuel R. Ciosici, Alex Hedges, Yash Kankanampati, Justin Martin, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文研究了使用中等规模的大型语言模型(GPT-J,60亿参数)为模拟机器人在ScienceWorld平台中制定计划,以完成30类科学实验目标。实验表明,通过引入更多历史步骤信息,该模型的性能显著优于基于强化学习的方法,最高可达3.5倍。研究还指出任务类别间的性能差异较大,平均表现可能掩盖具体问题,并展示了在仅使用6.5%训练数据时仍能取得2.2倍的性能提升。

Comments Identical to EMNLP 2023 Findings

详情
AI中文摘要

我们探索使用中等规模的大语言模型(GPT-J,6B参数)为模拟机器人在ScienceWorld(一个用于基础科学实验的文本游戏模拟器)中制定计划,以实现30类目标。先前发表的实证工作声称,与强化学习相比,大语言模型(LLMs)不太适合(Wang等人,2022)。使用马尔可夫假设(仅前一步),LLM的性能是强化学习方法性能的1.4倍。当我们尽可能多地填充LLM的输入缓冲区(包含尽可能多的先前步骤)时,性能提升至3.5倍。即使仅使用6.5%的训练数据,我们观察到性能比强化学习方法提高了2.2倍。我们的实验表明,30类动作的性能差异很大,这表明对任务进行平均可能会掩盖显著的性能问题。在与我们同时期的工作中,Lin等人(2023)展示了一种两部分方法(SwiftSage),该方法使用一个小型LLM(T5-large)并辅以OpenAI的大规模LLM,在ScienceWorld中取得了出色结果。我们的6B参数单阶段GPT-J在结合GPT-3.5 turbo(其参数数量是GPT-J的29倍)时,与SwiftSage的两阶段架构性能相匹配。

英文摘要

We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.

2110.01552 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

或许PTLMs应该去上学——一项评估开卷和闭卷问答的任务

Manuel R. Ciosici, Joe Cecil, Alex Hedges, Dong-Ho Lee, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文提出了一项新的任务,旨在评估预训练语言模型(PTLMs)在开放书和闭合书场景下的问答能力,使用社会学和人文领域的大学教材作为教学材料。研究通过设计基于教材内容的判断题,并进行多轮测试,发现PTLMs在闭合书条件下表现有限,表明其可能未真正理解教材内容;而在开放书条件下,允许模型检索相关段落进行回答时,性能显著提升。该任务为评估PTLMs对复杂文本的理解能力提供了新的基准。

Comments Identical to the EMNLP 2021 version

详情
AI中文摘要

我们的目标是提供一项新任务和排行榜,以刺激关于问答和预训练语言模型(PTLM)的研究,使其理解重要的教学文档,例如大学入门教科书或手册。PTLM在许多问答任务中取得了巨大成功,但需要大量监督训练,而在零样本设置中表现较差。我们提出了一项新任务,包括两本社会科学(《美国政府2e》)和人文科学(《美国历史》)的大学入门教材,数百个基于教材作者编写的复习题的真假陈述,基于教材前八章的验证/开发测试,基于剩余章节的盲测,以及基于最先进PTLM的基线结果。由于问题平衡,随机表现应为约50%。使用BoolQ微调的T5达到了相同的表现,表明教材内容未在PTLM中预表示。闭卷考试(即阅读教材,将教材添加到T5的预训练中)最多带来微小改进(56%),表明PTLM可能没有“理解”教材(或可能误解了问题)。开卷考试(即允许机器自动检索段落并用于回答问题)表现更好(约60%)。

英文摘要

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

2101.05400 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Machine-Assisted Script Curation

机器辅助脚本编纂

Manuel R. Ciosici, Joseph Cummings, Mitchell DeHaven, Alex Hedges, Yash Kankanampati, Dong-Ho Lee, Ralph Weischedel, Marjorie Freedman

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文介绍了一种名为MASC的系统,用于实现人机协作的脚本创作。该系统能够自动生成事件类型、链接至维基数据、提示可能被遗漏的子事件,并记录参与多个子事件的实体及其时间顺序,从而辅助用户高效编写结构复杂的事件脚本。研究展示了MASC在实际案例中的应用效果,验证了其在脚本创作中的实用价值。

Comments Identical to the NAACL 2021 Demo version

详情
AI中文摘要

我们描述了机器辅助脚本编纂器(MASC),一个用于人机协作脚本创作的系统。使用MASC生成的脚本包括:(1)构成更大复杂事件的子事件的英文描述;(2)每个事件的类型;(3)预期参与多个子事件的实体记录;(4)子事件之间的时间顺序。MASC通过提供事件类型建议、维基数据链接以及可能被遗忘的子事件,自动化了脚本创作过程的部分环节。我们通过几个案例研究脚本展示了这些自动化功能对脚本作者的实用性。

英文摘要

We describe Machine-Aided Script Curator (MASC), a system for human-machine collaborative script authoring. Scripts produced with MASC include (1) English descriptions of sub-events that comprise a larger, complex event; (2) event types for each of those events; (3) a record of entities expected to participate in multiple sub-events; and (4) temporal sequencing between the sub-events. MASC automates portions of the script creation process with suggestions for event types, links to Wikidata, and sub-events that may have been forgotten. We illustrate how these automations are useful to the script writer with a few case-study scripts.