arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.22821 2026-05-22 cs.CL cs.LG 版本更新

Tokenisation via Convex Relaxations

基于凸松弛的分词

Jan Tempus, Philip Whittington, Craig W. Schmidt, Dennis Komm, Tiago Pimentel

发表机构 * ETH Zurich(苏黎世联邦理工学院) Kensho Technologies(Kensho科技公司)

AI总结 本文提出了一种基于凸松弛的分词方法ConvexTok,通过将分词构建问题转化为线性规划并利用凸优化工具求解,改进了分词指标和语言模型的bits-per-byte性能,并提升了下游任务表现。

详情
AI中文摘要

分词是当前自然语言处理流水线中的重要组成部分。当前的分词算法如BPE和Unigram都是贪心算法,它们在局部最优决策上不做考虑,而没有考虑整个词汇表的结果。我们相反地将分词构建过程作为线性规划来制定,并使用凸优化工具来解决它,从而得到一种新的算法,我们称之为ConvexTok。我们发现ConvexTok在内在分词指标和语言模型所实现的bits-per-byte (BpB)方面始终有改进;它也改善了下游任务的表现,但不太一致。此外,ConvexTok允许用户通过一个下界来认证他们的分词器在某种目标下离最优的差距,并且我们实证发现它在常见词汇表大小下处于最优的1%以内。

英文摘要

Tokenisation is an integral part of the current NLP pipeline. Current tokenisation algorithms such as BPE and Unigram are greedy algorithms -- they make locally optimal decisions without considering the resulting vocabulary as a whole. We instead formulate tokeniser construction as a linear program and solve it using convex optimisation tools, yielding a new algorithm we call ConvexTok. We find ConvexTok consistently improves intrinsic tokenisation metrics and the bits-per-byte (BpB) achieved by language models; it also improves downstream task performance, but less consistently. Furthermore, ConvexTok allows the user to certify how far their tokeniser is from optimal, with respect to a certain objective, via a lower bound, and we empirically find it to be within 1\% of optimal at common vocabulary sizes.

2605.22817 2026-05-22 cs.LG cs.AI cs.CL cs.NE 版本更新

Vector Policy Optimization: Training for Diversity Improves Test-Time Search

向量策略优化:为多样性训练改进测试时间搜索

Ryan Bahlous-Boldi, Isha Puri, Idan Shenfeld, Akarsh Kumar, Mehul Damani, Sebastian Risi, Omar Khattab, Zhang-Wei Hong, Pulkit Agrawal

发表机构 * MIT(麻省理工学院) Improbable AI Lab(Improbable AI 实验室) MIT-IBM Computing Research Lab(麻省理工-IBM 计算研究实验室) Sakana AI

AI总结 本文提出向量策略优化(VPO)方法,通过训练策略以预测多样化的下游奖励函数,从而产生多样化的解决方案,以改进测试时间搜索的性能。

Comments 24 pages

详情
AI中文摘要

语言模型现在必须能够即刻泛化到新的环境,并在像AlphaEvolve这样的推理扩展搜索过程中工作,该过程通过多种任务特定的奖励函数选择滚出。不幸的是,标准的LLM后训练优化方法通常优化预定义的标量奖励,导致当前LLM生成低熵响应分布,从而在推理时间搜索所需多样性方面挣扎。我们提出向量策略优化(VPO),一种RL算法,专门训练策略以预测多样化的下游奖励函数并生成多样化的解决方案。VPO利用奖励在实践中通常是向量值的事实,例如代码生成中的每测试用例正确性,或者多个不同的用户人设或奖励模型。VPO本质上是GRPO优势估计器的直接替代品,但其训练LLM输出一组解决方案,其中每个解决方案专门针对向量奖励空间中的不同权衡。在四个任务上,VPO在测试时间搜索(如pass@k和best@k)中匹配或超越了最强的标量RL基线,随着搜索预算的增长,差距逐渐扩大。对于进化搜索,VPO模型解锁了GRPO模型无法解决的问题。随着测试时间搜索变得更加标准化,优化多样性可能需要成为后训练的默认目标。

英文摘要

Language models must now generalize out of the box to novel environments and work inside inference-scaling search procedures, such as AlphaEvolve, that select rollouts with a variety of task-specific reward functions. Unfortunately, the standard paradigm of LLM post-training optimizes a pre-specified scalar reward, often leading current LLMs to produce low-entropy response distributions and thus to struggle at displaying the diversity that inference-time search will require. We propose Vector Policy Optimization (VPO), an RL algorithm that explicitly trains policies to anticipate diverse downstream reward functions and to produce diverse solutions. VPO exploits that rewards are often vector-valued in practice, like per-test-case correctness in code generation or, say, multiple different user personas or reward models. VPO is essentially a drop-in replacement for the GRPO advantage estimator, but it trains the LLM to output a set of solutions where individual solutions specialize to different trade-offs in the vector reward space. Across four tasks, VPO matches or beats the strongest scalar RL baselines on test-time search (e.g. pass@k and best@k), with the gap widening as the search budget grows. For evolutionary search, VPO models unlock problems that GRPO models cannot solve at all. As test-time search becomes more standardized, optimizing for diversity may need to become the default post-training objective.

2605.22785 2026-05-22 cs.CL 版本更新

Evaluating Commercial AI Chatbots as News Intermediaries

评估商业AI聊天机器人作为新闻中介

Mirac Suzgun, Emily Shen, Federico Bianchi, Alexander Spangher, Thomas Icard, Daniel E. Ho, Dan Jurafsky, James Zou

发表机构 * Stanford University(斯坦福大学) Independent Researcher(独立研究者) Together AI

AI总结 本研究评估了AI聊天机器人在跨语言和区域处理新兴事实的准确性,发现其在多选题中表现良好,但在自由回答和复杂问题上存在显著误差,揭示了区域不平等和依赖检索基础设施的问题。

Comments https://suzgunmirac.github.io/ai-news-preview/

详情
AI中文摘要

AI聊天机器人正在迅速塑造人们获取新闻的方式,但此前没有任何研究系统地衡量了这些系统在跨语言和区域处理新兴事实的准确性。我们对六个AI聊天机器人(Gemini 3 Flash和Pro、Grok 4、Claude 4.5 Sonnet、GPT-5和GPT-4o mini)进行了14天(2026年2月9日至22日)的评估,测试了2100个基于同日BBC新闻报道的事实性问题,覆盖六个区域服务(美国和加拿大、阿拉伯语、非洲、印地语、俄语、土耳其语)。最佳系统在处理数小时前报道的事件问题时,多选题准确率超过90%。然而,这些系统在自由回答评估中准确率下降11-13%,在整体群体中下降16-17%。我们进一步识别了三种失败模式。首先,所有模型在印地语上的准确率最低(79% vs. 89-91%在其他地方),引用表明存在英语语系的检索偏见(例如,模型回答印地语查询时引用英语维基百科比任何印地语来源更多)。其次,检索失败而非推理失败导致超过70%的错误。当模型检索到正确的来源时,它们通常能提取出正确的答案;问题在于如何首先找到正确的来源。第三,模型在处理结构良好的问题时准确率在88-96%之间,但在包含微妙虚假前提的问题中准确率骤降至19-70%,最脆弱的模型接受伪造事实的频率高达64%。我们还识别出一个检测准确性悖论:最好的虚假前提检测器在对抗性准确性(回避率)上排名第二,而较弱的检测器排名第一,表明前提检测和答案恢复是部分独立的能力。总体而言,这些结果表明高准确性可能掩盖系统性的区域不平等、对检索基础设施的近乎完全依赖,以及对不完美查询的脆弱性。

英文摘要

AI chatbots are rapidly shaping how people encounter the news, yet no prior study has systematically measured how accurately these systems, with their proprietary search integrations and retrieval-synthesis pipelines, handle emerging facts across languages and regions. We present a 14-day (February 9-22, 2026) evaluation of six AI chatbots (Gemini 3 Flash and Pro, Grok 4, Claude 4.5 Sonnet, GPT-5 and GPT-4o mini) on 2,100 factual questions derived from same-day BBC News reporting across six regional services (US & Canada, Arabic, Afrique, Hindi, Russian, Turkish). The best systems achieve over 90% multiple-choice accuracy on questions about events reported hours earlier. The same systems, however, lose 11-13% under free-response evaluation, and 16-17% across the cohort. We further characterize three failure patterns. First, every model achieves its lowest accuracy on Hindi (79% vs. 89-91% elsewhere) and citations indicate an Anglophone retrieval bias (e.g., models answering Hindi queries cite English Wikipedia more than any Hindi outlet). Second, retrieval, not reasoning, failures drive over 70% of all errors. When models retrieve a correct source, they often extract the correct answer; the problem is to land on the right source in the first place. Third, models achieving 88-96% accuracy on well-formed questions drop to 19-70% when questions contain subtle false premises, with the most vulnerable model accepting fabricated facts 64% of the time. We also identify a detection-accuracy paradox: the best false-premise detector ranks second in adversarial accuracy (abstention rate), while a weaker detector ranks first, showing that premise detection and answer recovery are partially independent capabilities. Overall, these suggest that high accuracy can mask systematic regional inequity, near-total dependence on retrieval infrastructure, and vulnerability to imperfect queries real users pose.

2605.22734 2026-05-22 cs.CL 版本更新

ChronoMedKG: A Temporally-Grounded Biomedical Knowledge Graph and Benchmark for Clinical Reasoning

ChronoMedKG:一个具有时间基础的生物医学知识图谱和用于临床推理的基准

Md Shamim Ahmed, Farzaneh Firoozbakht, Lukas Galke Poech, Jan Baumbach, Richard Röttger

发表机构 * University of Southern Denmark(丹麦南部大学) University of Hamburg(汉堡大学)

AI总结 本文提出ChronoMedKG,一个包含460,497个证据链接三元组的生物医学知识图谱,覆盖13,431种疾病,通过时间组件如发病窗口或进展阶段,为临床推理提供时间基础,并引入ChronoTQA基准测试,验证了其在时间推理任务中的有效性。

Comments 9 pages main text plus appendices, 8 figures. Dataset and benchmark paper. ChronoMedKG released under CC BY 4.0 and ChronoTQA/code under MIT (Zenodo: 10.5281/zenodo.19697542). Under review

详情
AI中文摘要

生物医学知识图谱(KGs)将疾病关联视为静态事实,但时间信息对临床推理至关重要,例如,3岁时的症状诊断可能暗示13岁时的不同疾病。现有KGs如PrimeKG、HetoNet和iKraph未能编码疾病发展过程中发现变得临床相关的时间。这限制了它们在纵向临床推理和检索增强中的实用性。我们介绍了ChronoMedKG,一个包含460,497个证据链接三元组(从1300万原始提取中过滤)的时序生物医学知识图谱,覆盖13,431种疾病。每个关联都与时间组件如发病窗口或进展阶段相关,这些时间组件由PMID可追溯的证据和多信号可信度评分支持。图谱通过一种疾病自主的多代理管道构建,在其中多个前沿LLM独立从PubMed和PMC文献中提取知识。仅保留那些由多模型共识支持、通过可信度过滤以及符合本体对齐的关系。ChronoMedKG在Orphadata上得分92.7%,并为HPOA、Orphadata和Phenopackets中缺失的6,250种疾病添加了时间基础,包括1,657种Orphanet编码的罕见疾病。我们进一步引入了ChronoTQA,一个包含3,341个问题的基准测试,涵盖八种任务类型(六种时间相关任务加两种静态控制任务),并附带12个补充探测问题。前沿LLM在从静态到时间问题的转换中失去大约30分;ChronoMedKG检索恢复了其长尾失败的47-65%,而HPOA-RAG恢复了17-29%。因此,ChronoMedKG为检索增强的临床系统提供了至关重要的时间轴,此前该轴线缺失。

英文摘要

Biomedical knowledge graphs (KGs) treat disease associations as static facts, but temporal information is crucial for clinical reasoning, e.g., a symptom diagnostic of one disease at age 3 may imply a different disease at age 13. Existing KGs such as PrimeKG, Hetionet, and iKraph do not encode when a finding becomes clinically relevant over the course of a disease. This limits their usefulness for longitudinal clinical reasoning and retrieval augmentation. We introduce ChronoMedKG, a temporal biomedical knowledge graph that contains 460,497 evidence-linked triples (filtered from 13M raw extractions) covering 13,431 diseases. Each association is tied to temporal components like onset window or progression stage, which are backed by PMID-traceable evidence and a multi-signal credibility score. The graph is constructed through a disease-autonomous multi-agent pipeline in which multiple frontier LLMs independently extract knowledge from PubMed and PMC literature. Only those relations are kept that are supported by multi-model consensus, survive credibility filtering, as well as ontology alignment. ChronoMedKG scored 92.7% agreement against Orphadata and adds temporal grounding for 6,250 diseases absent from HPOA, Orphadata, and Phenopackets, including 1,657 Orphanet-coded rare diseases. We further introduce ChronoTQA, a benchmark of 3,341 questions across eight task types (six temporal plus two static controls), with a 12-question supplementary probe. Frontier LLMs lose roughly 30 points moving from static to temporal questions; ChronoMedKG retrieval rescues 47-65% of their long-tail failures, against 17-29% for HPOA-RAG. As such, ChronoMedKG provides a crucial temporal axis for retrieval-augmented clinical systems that was previously absent.

2605.22732 2026-05-22 cs.AI cs.CL cs.HC cs.SD eess.AS 版本更新

Beyond Acoustic Emotion Recognition: Multimodal Pathos Analysis in Political Speech Using LLM-Based and Acoustic Emotion Models

超越语音情感识别:利用基于LLM和语音情感模型的政治演讲多模态Pathos分析

Juergen Dietrich

发表机构 * Democracy Intelligence gGmbH(民主智能有限责任公司)

AI总结 本文研究了语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,通过TRUST多智能体大语言模型(LLM)管道进行操作。使用德国议会全体会议中Felix Banaszak的演讲作为案例研究,比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

Comments 13 pages, 1 figure

详情
AI中文摘要

我们研究语音情感识别模型是否能作为政治演讲分析中Pathos维度的代理,如由TRUST多智能体大语言模型(LLM)管道定义的那样。使用Felix Banaszak在德国议会全体会议中的演讲(51个片段,245秒)作为案例研究,我们比较了三种分析模式:(1) emotion2vec_plus_large,一个通过后验Russell Circumplex投影得到连续唤醒度和估值的语音情感识别(SER)模型;(2) Gemini 2.5 Flash,一个分析完整演讲音频及其转录文本的LLM,以开放和上下文感知的方式进行;(3) TRUST-Pathos分数,来自三个倡导者LLM监督集合。斯皮尔曼等级相关性显示,Gemini估值与TRUST-Pathos高度相关(rho = +0.664,p < 0.001),而emotion2vec估值不相关(rho = +0.097,p = 0.499)。我们进一步通过系统评估柏林情感语音数据库(EMO-DB)使用Gemini在开放注释范式下,证明标准SER基准语料库存在表演性演讲、文化偏见和类别不兼容性。我们的结果表明,基于LLM的多模态分析在捕捉语义定义的政治情感方面比单独的语音模型更有效,而语音特征仍对低层次唤醒度估计有帮助。未来的工作将扩展这种方法到视频分析中,结合面部表情和眼神。

英文摘要

We investigate whether acoustic emotion recognition models can serve as proxies for the Pathos dimension in political speech analysis, as operationalised by the TRUST multi-agent large language model (LLM) pipeline. Using a Bundestag plenary speech by Felix Banaszak (51 segments, 245 s) as a case study, we compare three analysis modalities: (1) emotion2vec_plus_large, an acoustic speech emotion recognition (SER) model whose continuous Arousal and Valence values are derived via post-hoc Russell Circumplex projection; (2) Gemini 2.5 Flash, an LLM analysing the full speech audio together with its transcript in an open-ended, context-aware fashion; and (3) TRUST-Pathos scores from a three-advocate LLM supervisor ensemble. Spearman rank correlations reveal that Gemini Valence correlates strongly with TRUST-Pathos (rho = +0.664, p < 0.001), whereas emotion2vec Valence does not (rho = +0.097, p = 0.499). We further demonstrate, via a systematic quality evaluation of the Berlin Database of Emotional Speech (EMO-DB) using Gemini in an open-ended annotation paradigm, that standard SER benchmark corpora suffer from acted speech, cultural bias, and category incompatibility. Our results suggest that LLM-based multimodal analysis captures semantically defined political emotion substantially better than acoustic models alone, while acoustic features remain informative for low-level Arousal estimation. Future work will extend this approach to video-based analysis incorporating facial expression and gaze.

2605.22675 2026-05-22 cs.CL 版本更新

Self-Policy Distillation via Capability-Selective Subspace Projection

通过能力选择性子空间投影实现自我策略蒸馏

Guangya Hao, Yitong Shang, Yunbo Long, Zhuokai Zhao, Hanxue Liang

发表机构 * University of Cambridge(剑桥大学) HKUST(香港科技大学) University of Chicago(芝加哥大学)

AI总结 本文提出Self-Policy Distillation(SPD),通过从模型自身梯度中提取低维能力子空间,将关键值(KV)激活投影到该子空间,并在标准下一项预测损失下进行微调,实现了无需外部信号的通用且能力选择性的自我蒸馏方法。

详情
AI中文摘要

自我蒸馏通过训练模型自身的生成来提升大语言模型(LLMs)。然而,现有方法要么依赖外部信号来筛选自生成输出(例如正确性过滤、执行反馈和奖励搜索),这些方法成本高且无法用于表现最佳的前沿模型;要么完全跳过筛选直接训练所有原始输出,这种方法通常领域特定且难以泛化。两者都共享一个更深层次的弱点,即自生成输出会将任务相关的能力建与其它因素(如风格模式、格式瑕疵和模型特定错误)纠缠在一起,稀释了要改进的特定能力的信号。在本文中,我们提出Self-Policy Distillation(SPD),实现了无需外部信号的通用且能力选择性的自我蒸馏。具体而言,SPD从模型对正确性定义标记的自身梯度中提取低维能力子空间,在自我生成过程中将关键值(KV)激活投影到该子空间,并在标准下一项预测损失下对结果进行微调。通过在代码生成、数学推理和多个选择性问答任务上的广泛实验,我们展示了SPD在无外部信号的情况下比最先进的自我蒸馏方法提高了高达13%,并且在预训练基线上的表现提高了高达16%。值得注意的是,SPD展示了优越的泛化能力,在跨领域泛化设置下表现更优15%。

英文摘要

Self-distillation bootstraps large language models (LLMs) by training on their own generations. However, existing methods either rely on external signals to curate self-generated outputs (e.g., correctness filtering, execution feedback, and reward search), which are costly and unavailable for the best-performing frontier models, or skip curation entirely and train on all raw outputs, an approach that is often domain-specific and hard to generalize. Both also share a deeper weakness that self-generated outputs entangle task-relevant capability with others, such as stylistic patterns, formatting artifacts, and model-specific errors, diluting the signal for the specific capability one aims to improve. In this paper, we propose Self-Policy Distillation (SPD), which achieves generalizable, capability selective without any external signal. Specifically, SPD extracts a low-rank capability subspace from the model's own gradients on correctness-defining tokens, projects key-value (KV) activations into this subspace during self-generation, and fine-tunes on the resulting raw outputs with standard next-token prediction loss. Through extensive experiments across code generation, mathematical reasoning, and multiple-choice QA, we show that SPD achieves up to 13% improvement over state-of-the-art self-distillation methods without external signals and up to 16% improvement over pre-trained baselines. Notably, SPD demonstrates superior generalizability, achieving 15% better performance under out-of-domain generalization settings.

2605.22660 2026-05-22 cs.CL cs.AI 版本更新

Moral Semantics Survive Machine Translation: Cross-Lingual Evidence from Moral Foundations Corpora

道德语义在机器翻译中得以保留:来自道德基础语料库的跨语言证据

Maciej Skorski

发表机构 * University of Luxembourg(卢森堡大学)

AI总结 本研究探讨了基于LLM的翻译是否能弥合道德价值观分类中语言特定标注语料库的差距,通过波兰语案例展示直接翻译能有效保留微妙的道德线索,为资源匮乏语言的道德研究提供了可行路径。

详情
AI中文摘要

道德语言具有微妙性和文化差异性,使得跨语言忠实翻译极具挑战性。习语、俚语和文化参考会引入难以避免的翻译痕迹。然而,自动道德价值观分类依赖于几乎只存在于英语中的语言特定标注语料库。我们研究了基于LLM的翻译是否能弥合这一差距,以波兰语为测试案例。使用约5万条涵盖广泛主题的道德标注社交媒体帖子,我们应用了一个系统化的四方法验证流程:LaBSE跨语言嵌入相似性、中心核对齐(CKA)、LLM作为评判者评估以及深度学习分类器公平性测试。我们证明,尽管在处理俚语、粗俗语言和文化负载表达方面存在不足,直接翻译能够很好地保留微妙的道德线索,这些线索足以被跨语言机器学习系统捕获——在所有基础方面,平均余弦相似度为0.86,AUC差距在0.01-0.02之间,经过语言模型微调后进一步缩小。这些结果表明,机器翻译是实现当前资源匮乏语言中道德研究的实用且成本效益高的途径。我们以波兰语作为代表性的斯拉夫语言展示了这一点,并预期可推广到相关语言。

英文摘要

Moral language is subtle and culturally variable, making it difficult to translate faithfully across languages. Idiomatic expressions, slang, and cultural references introduce hard-to-avoid translation artifacts. Yet automated moral values classification depends on language-specific annotated corpora that exist almost exclusively in English. We investigate whether LLM-based translation can bridge this gap, taking Polish as a test case. Using $\sim$50k morally-annotated social media posts from a diverse range of topics, we apply a principled four-method validation pipeline: LaBSE cross-lingual embedding similarity, Centered Kernel Alignment (CKA), LLM-as-judge evaluation, and deep learning classifier parity tests. We show that despite shortcomings in handling slang, vulgarity, and culturally-loaded expressions, direct translation preserves subtle moral cues well enough to be harvested by cross-lingual machine learning -- with mean cosine similarity of 0.86 and AUC gaps of 0.01--0.02 across all foundations closing further under fine-tuning of language models. These results demonstrate that machine translation is a practical and cost-effective path to moral values research in languages currently under-resourced in this domain. We demonstrate this for Polish as a representative Slavic language, with expected generalisation to related languages.

2605.22654 2026-05-22 cs.CL cs.CV 版本更新

Seeing the Poem: Image-Semantic Detection of AI-Generated Modern Chinese Poetry with MLLMs

看见诗歌:基于大语言模型的AI生成现代汉语诗歌的图像-语义检测

Shanshan Wang, Fengying Ye, Hanjia Lyu, Caiwen Gou, Junchao Wu, Jingming Yao, Chengzhong Xu, Jiebo Luo, Derek F. Wong

发表机构 * Department of Computer and Information Science, University of Macau(澳门大学计算机与信息科学系) University of Rochester(罗切斯特大学) Sichuan University(四川大学) Department of Portuguese, Faculty of Arts and Humanities, University of Macau(澳门大学人文学院葡萄牙语系)

AI总结 本文提出了一种图像-语义引导的诗歌检测方法,通过整合图像内容与诗歌文本信息,提升大语言模型在检测现代汉语诗歌中的性能,实验结果表明该方法在多个数据集上均优于传统方法。

详情
AI中文摘要

先前的检测研究显示,LLMs无法有效用作检测器,但这些研究未涉及现代汉语诗歌。此外,没有相关研究探讨LLMs在检测现代汉语诗歌中的性能。本文评估并提升了LLMs作为现代汉语诗歌检测器的性能,并提出了一种图像-语义引导的诗歌检测方法。与传统检测方法相比,我们的方法创新性地整合了反映诗歌内容的图像。通过示例驱动的方法,我们的方法有效整合了图像中的意义、意象和情感信息,然后与诗歌文本形成互补判断。实验结果表明,基于我们方法的LLM检测器在多个数据集上均优于基于纯文本的基线检测器,甚至超越了表现最佳的传统检测器RoBERTa。使用我们方法的Gemini检测器在Macro-F1得分上达到85.65%,达到最先进的水平。不同LLM检测器在多个LLM生成数据上的性能提升证明了我们方法的有效性。

英文摘要

Previous detection studies have shown that LLMs cannot be effectively used as detectors, but these studies have not addressed modern Chinese poetry. Moreover, no relevant research has explored the performance of LLMs in detecting modern Chinese poetry. This paper evaluates and enhances the performance of LLMs as detectors for modern Chinese poetry, and proposes an image-semantic guided poetry detection method. Compared with traditional detection approaches, our method innovatively incorporates images that reflect the content of the poetry. Through example-driven approaches, our method effectively integrates information such as meaning, imagery, and feeling from the image, then forms a complementary judgment with the poem text. Experimental results demonstrate that the LLM detectors based on our method outperform baseline detectors based on plain text, and even surpass the best-performing traditional detector, RoBERTa. The Gemini detector using our method achieves a Macro-F1 score of 85.65%, reaching the state-of-the-art level. The performance improvements of different LLM detectors on multiple LLMs-generated data prove the effectiveness of our method.

2605.22650 2026-05-22 cs.CL 版本更新

Whose Voice Counts? Mapping Stakeholder Perspectives on AI Through Public Submissions to the U.S. Government

谁的声音被听见?通过美国政府公开提交的材料映射利益相关者的AI观点

Alina Karakanta, Alex Christiansen, Tomás Dodds, Bissie Anderson, Matteo Fuoli, Marcus Perlman, Aletta G. Dorst

发表机构 * Leiden University Centre for Linguistics, Leiden University(莱顿大学语言学中心,莱顿大学) Department of Linguistics and Communication, University of Birmingham(伯明翰大学语言学与交流系) School of Journalism and Mass Communication, University of Wisconsin-Madison(威斯康星大学麦迪逊分校新闻与大众传播学院)

AI总结 本文通过分析美国政府AI行动计划公众咨询期间提交的信件,探讨不同利益相关者对AI的看法,发现个人更关注AI对生活的影响,而其他群体更关注AI发展,揭示了AI行动计划主要反映私营部门的关切。

详情
AI中文摘要

随着人工智能(AI)系统在日常生活中的普及,了解不同利益相关者如何理解和设想这些技术在塑造社会、政治和经济现实中的作用变得至关重要。本文基于特朗普政府AI行动计划公众咨询期间提交的信件语料库,调查公众对AI的看法。为此,我们发布了一个语料库清理流程,并通过主题建模和频率分析来探索不同子群体(如学术界、个人、私营部门)讨论的主要主题以及AI行动计划中出现的主题。我们的结果表明,个人对AI对生活的影响表达了强烈担忧,而其他利益相关者则更关注AI的发展。我们的主题比较显示,AI行动计划主要反映了私营部门对安全、政策和发展方面的关切,而个人的关切则代表性较低。

英文摘要

As artificial intelligence (AI) systems become more common in our daily lives, it is important to understand how different stakeholders comprehend and envisage the role that these technologies play in shaping social, political, and economic realities. In this paper, we investigate public perceptions of AI based on a corpus of letters submitted during the public consultation for the Trump Administration's US AI Action Plan. To this aim, we release a corpus cleaning pipeline and perform topic modelling and frequency analysis to explore predominant topics discussed by different subgroups (e.g., academia, individuals, private sector) and those appearing in the AI Action Plan. Our results show that individuals voice strong concerns related to the impact of AI on life, while other stakeholders are more concerned with AI development. Our comparison of topics suggests that the AI Action Plan reflects predominantly the concerns of the private sector on security, policies, and development, with individuals' concerns less represented.

2605.22620 2026-05-22 cs.LG cs.CL 版本更新

Two is better than one: A Collapse-free Multi-Reward RLIF Training Framework

两个优于一个:一种无崩溃的多奖励RLIF训练框架

Shourov Joarder, Diganta Sikdar, Ahsan Habib Akash, Binod Bhattarai, Prashnna Gyawali

发表机构 * Bangladesh University of Engineering and Technology(孟加拉工程科技大学) West Virginia University(西弗吉尼亚大学) University of Aberdeen(阿伯丁大学) Fogsphere (Redev.AI Ltd, UK)(Fogsphere(Redev.AI Ltd,英国)) University College London(伦敦大学学院)

AI总结 本文提出一种多奖励RLIF框架,通过分解训练信号为答案级奖励和完成级奖励,并结合GDPO归一化和KL-Cov正则化,提升稳定性和鲁棒性,同时在数学推理和代码生成任务中接近监督RLVR方法的性能。

详情
AI中文摘要

可验证奖励的强化学习(RLVR)显著提升了大语言模型的推理能力,但通常依赖于外部监督的人类注释或黄金标准解决方案。最近,从内部反馈强化学习(RLIF)作为一种可扩展的无监督替代方法出现,利用模型自身提取的信号。然而,现有RLIF方法通常依赖单一内部奖励,可能导致奖励黑客、熵崩溃和推理结构退化。我们提出一种多奖励RLIF框架,将训练信号分解为两个互补成分:基于聚类投票的答案级奖励和基于逐token自信心的完成级奖励。为了稳健地结合这些信号,我们应用GDPO基于的归一化以减少奖励尺度不平衡。我们进一步引入KL-Cov正则化,针对导致不成比例熵减少的低熵token分布,保持探索并防止后期崩溃。在数学推理和代码生成基准上,我们的方法在无监督RL方法中提高了稳定性和鲁棒性,同时在性能上接近监督RLVR方法。这些结果表明,互补的内部奖励结合针对性正则化可以支持稳定的长周期推理,而无需依赖外部真实监督。代码将很快发布。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has substantially improved the reasoning ability of LLMs, but often depends on external supervision from human annotations or gold-standard solutions. Reinforcement learning from internal feedback (RLIF) has recently emerged as a scalable unsupervised alternative, using signals extracted from the model itself. However, existing RLIF methods typically rely on a single internal reward, which can lead to reward hacking, entropy collapse, and degraded reasoning structure. We propose a multi-reward RLIF framework that decomposes the training signal into two complementary components: an answer-level reward based on cluster voting and a completion-level reward based on token-wise self-certainty. To combine these signals robustly, we apply GDPO-based normalization to reduce reward-scale imbalance. We further introduce KL-Cov regularization, which targets low-entropy token distributions responsible for disproportionate entropy reduction, preserving exploration and preventing late-stage collapse. Across mathematical reasoning and code-generation benchmarks, our method improves stability and robustness over prior unsupervised RL approaches, while achieving performance close to supervised RLVR methods. These results show that complementary internal rewards, combined with targeted regularization, can support stable long-horizon reasoning without relying on external ground-truth supervision. Code will be released soon.

2605.22608 2026-05-22 cs.CL cs.AI 版本更新

Agentic CLEAR: Automating Multi-Level Evaluation of LLM Agents

Agentic CLEAR: 自动化多层级评估LLM代理

Asaf Yehudai, Lilach Eden, Michal Shmueli-Scheuer

发表机构 * IBM Research(IBM研究院)

AI总结 本研究提出Agentic CLEAR框架,通过多层级细粒度分析实现LLM代理的自动化评估,提供高质量的数据驱动反馈并预测任务成功率。

Comments ACL

详情
AI中文摘要

代理系统正变得越来越有能力:代理定义策略、采取行动并与不同环境交互。这种自主性对监督和评估代理行为提出了严峻挑战。当前大多数工具功能有限,要么侧重于可观测性并具备基本评估能力,要么强制使用静态、手工制定的错误分类法,无法适应新领域。为解决这一差距,我们提出了Agentic CLEAR,一个自动、动态且易于使用的评估框架。它在三个粒度层级上生成关于代理行为的文本洞察:系统、轨迹和节点。Agentic CLEAR运行在可观测性层之上,能够实现无缝集成,并具有直观的用户界面,使代理评估变得高度可访问。在四个基准测试、七个代理设置和数万次LLM调用的实验中,我们展示了Agentic CLEAR能够产生高质量、数据驱动的反馈。我们的分析显示与人工标注的错误高度一致,并且能够预测任务的成功率。

英文摘要

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.

2605.22579 2026-05-22 cs.CL cs.AI stat.ML 版本更新

Beyond Temperature: Hyperfitting as a Late-Stage Geometric Expansion

超越温度:超拟合作为晚期几何扩展

Meimingwei Li, Yuanhao Ding, Esteban Garces Arias, Christian Heumann

发表机构 * Department of Statistics, LMU Munich(慕尼黑大学统计系) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML)) School of Computer and Information Engineering, Henan University(河南大学计算机与信息工程学院)

AI总结 本文研究了超拟合现象,发现其与分布锐化不同,通过实验表明超拟合依赖于动态的上下文相关排名重排机制,并在Transformer最后一层的终端扩展中实现了特征空间的几何扩展,提出了Late-Stage LoRA方法以提升生成质量。

Comments Accepted at ICML 2026

详情
AI中文摘要

近期的研究揭示了一个反直觉现象,称为

英文摘要

Recent work has identified a counterintuitive phenomenon termed "Hyperfitting", where fine-tuning Large Language Models (LLMs) to near-zero training loss on small datasets surprisingly enhances open-ended generation quality and mitigates repetition in greedy decoding. While effective, the underlying mechanism remains poorly understood, with the extremely low-entropy output distributions suggesting a potential equivalence to simple temperature scaling. In this work, we demonstrate that this phenomenon is fundamentally distinct from distribution sharpening; entropy-matched control experiments reveal that temperature scaling fails to replicate the diversity gains of hyperfitting. Furthermore, we falsify the hypothesis of static vocabulary reweighting, showing through ablation studies that hyperfitting relies on a dynamic, context-dependent rank reordering mechanism. Layer-wise analysis localizes this effect to a "Terminal Expansion" in the final transformer block, where a substantial geometric expansion of the feature space (Delta Dim approx +80.8) facilitates the promotion of deep-tail tokens. Additionally, we introduce Late-Stage LoRA, a targeted fine-tuning strategy that updates only the final 5 layers, yielding robust generation with minimal parameter updates

2605.22567 2026-05-22 cs.CL 版本更新

LANG: Reinforcement Learning for Multilingual Reasoning with Language-Adaptive Hint Guidance

LANG: 用于多语言推理的强化学习与语言自适应提示引导

Yuchun Fan, Bei Li, Peiguang Li, Yilin Wang, Yongyu Mu, Jian Yang, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Jingbo Zhu, Tong Xiao

发表机构 * NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China(东北大学计算机科学与工程学院自然语言处理实验室) Meituan Inc.(美团公司) NiuTrans Research, Shenyang, China(牛译研所)

AI总结 本文提出LANG框架,通过语言条件提示引导非英语推理任务的探索,解决了多语言环境下强化学习在输入语言一致性与推理质量之间的权衡问题,提升了推理性能而不影响语言一致性。

Comments Accepted to ACL 2026 (main conference)

详情
AI中文摘要

强化学习已被证明在增强大型语言模型(LLMs)的多步推理方面非常有效,但其好处尚未完全转化为多语言环境。现有方法在根本上面临一个矛盾:优先考虑输入语言的一致性严重损害推理质量,而优先考虑推理则会导致无意中向英语漂移。我们通过LANG,一种新的框架,利用语言条件提示来指导非英语推理任务的探索。我们的方法结合了两个关键机制来防止依赖这些提示:一个逐步衰减计划,逐渐撤回支架,以及一个语言自适应切换,将学习时间跨度调整到特定语言的困难程度。在具有挑战性的多语言数学基准上的实验证明,LANG显著提高了推理性能,而不会损害语言一致性。此外,我们表明我们的框架超越了数学,促进了模型各层之间更一致的语言对齐。

英文摘要

Reinforcement learning has proven effective for enhancing multi-step reasoning in large language models (LLMs), yet its benefits have not fully translated to multilingual contexts. Existing methods struggle with a fundamental trade-off: prioritizing input-language consistency severely hampers reasoning quality, while prioritizing reasoning often leads to unintended language drift toward English. We address this challenge with LANG, a novel framework that leverages language-conditioned hints to guide exploration in non-English reasoning tasks. Our method incorporates two key mechanisms to prevent dependency on these hints: a progressive decay schedule that gradually withdraws scaffolding, and a language-adaptive switch that tailors learning horizons to specific language difficulties. Empirical results on challenging multilingual mathematical benchmarks reveal that LANG substantially enhances reasoning performance without compromising language consistency. Moreover, we show that our framework generalizes beyond mathematics, fostering more consistent language alignment across model layers

2605.22564 2026-05-22 cs.CL cs.LG cs.SE 版本更新

SynAE: A Framework for Measuring the Quality of Synthetic Data for Tool-Calling Agent Evaluations

SynAE: 一个用于评估工具调用代理合成数据质量的框架

Shuaiqi Wang, Aadyaa Maddi, Zinan Lin, Giulia Fanti

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft Research(微软研究院)

AI总结 本文提出SynAE框架,用于评估多轮工具调用代理合成数据的质量,通过四个指标类别评估合成数据的有效性、保真度和多样性,揭示单一指标不足以全面表征合成数据质量。

详情
AI中文摘要

如今,工具调用代理通常在静态执行轨迹数据集上进行评估或测试,包括输入命令、代理响应和相关工具调用。然而,内部生产数据集往往不足或无法使用;例如,它们可能包含敏感或专有数据,或过于稀疏,无法支持全面测试(尤其是预部署前)。在这些情况下,实践者越来越多地用合成数据替代或补充真实数据进行评估。关键挑战是量化这些合成数据集与真实数据之间的关系。我们介绍了SynAE,一个用于评估多轮工具调用代理合成基准如何复制和增强真实数据轨迹特征的评估框架。SynAE在四个指标类别中评估合成数据的效度、保真度和多样性:(i)任务指令和中间响应,(ii)工具调用,(iii)最终输出,(iv)下游评估。我们通过近期代理基准评估SynAE,并通过现实且受控的生成方案测试常见的合成数据失败模式。SynAE能够检测数据效度、保真度和多样性的细粒度变化,并表明没有单一指标足以全面表征合成数据质量,从而推动对合成数据的多轴评估。SynAE的演示可在https://synae-2026-synae-demo.static.hf.space/index.html获取,代码在https://github.com/wsqwsq/SynAE。

英文摘要

Today, tool-calling agents are commonly evaluated or tested on static datasets of execution traces, including input commands, agent responses, and associated tool calls. However, internal production datasets are often insufficient or unusable for testing; for example, they may contain sensitive or proprietary data, or they may be too sparse to support comprehensive testing (especially pre-deployment). In these settings, practitioners are increasingly replacing or augmenting real datasets with synthetic ones for evaluation purposes. A key challenge is quantifying the relation between these synthetic datasets and the real data. We introduce SynAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate and augment the characteristics of real data trajectories. SynAE assesses the validity, fidelity, and diversity of synthetic data across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate SynAE using recent agent benchmarks and test common synthetic data failure modes via realistic and controlled generation schemes. SynAE detects fine-grained variations in data validity, fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing. A demo of SynAE is available at https://synae-2026-synae-demo.static.hf.space/index.html, with code at https://github.com/wsqwsq/SynAE.

2605.22544 2026-05-22 cs.CL cs.IR 版本更新

One prompt is not enough: Instruction Sensitivity Undermines Embedding Model Evaluation

一个提示不够:指令敏感性削弱了嵌入模型评估

Yevhen Kostiuk, Kenneth Enevoldsen

发表机构 * Aarhus University(奥胡斯大学)

AI总结 本文研究了单提示评估在指令调优嵌入模型中的不足,发现默认提示可能系统性低估或高估性能,并指出排行榜对提示选择不鲁棒,建议通过多提示评估或报告敏感性来改进基准测试。

详情
AI中文摘要

指令嵌入模型已成为最先进模型中的常见选择,但通常仅使用单个提示进行评估。单点评估忽略了指令方法的主要问题,即对指令措辞的敏感性。我们对6个嵌入模型、11个数据集和每个数据集15个任务特定提示进行了实证研究,共990个案例。我们发现报告的分数无法代表在合理提示下的分数分布。默认提示既可能系统性低估也可能高估性能。此外,我们发现排行榜对提示选择不鲁棒:通过选择有利的提示,研究中的任何模型都可以被提升到首位。我们的发现表明,单提示评估不足以评估指令调优的嵌入模型,基准测试应纳入提示鲁棒性,通过多提示评估或报告敏感性来改进。

英文摘要

Instruction embedding models have become common among state-of-the-art models, however are evaluated using a single prompt per task. The single-point evaluation ignores a main problem of the instruction-based approach namely: sensitivity to the phrasing of the instruction. We present an empirical study of prompt sensitivity across 6 embedding models, 11 datasets, and 15 task-specific prompts per dataset, a total of 990. We show that reported scores misrepresent the distribution of scores over plausible prompts. The default prompt can both systematically understate or overstate performance. Furthermore, we show that the leaderboard ranking is not robust to prompt selection: by choosing prompts favorably, any model in our study can be promoted to first place. Our findings suggest that single-prompt evaluation is insufficient for instruction-tuned embedding models and that benchmarks should incorporate prompt robustness, either by evaluating over multiple prompts or by reporting sensitivity alongside point estimates.

2605.22536 2026-05-22 cs.CV cs.CL 版本更新

SpaceDG: Benchmarking Spatial Intelligence under Visual Degradation

SpaceDG: 在视觉退化下评估空间智能的基准测试

Xiaolong Zhou, Yifei Liu, Ziyang Gong, Jiarui Li, Qiyue Zhao, Muyao Niu, Yuanyuan Gao, Le Ma, Xue Yang, Hongjie Zhang, Zhihang Zhong

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Electronic Science and Technology of China(电子科技大学) Chongqing University(重庆大学) The University of Tokyo(东京大学) Beihang University(北航) Northwestern Polytechnical University(西北工业大学)

AI总结 本文提出SpaceDG,首个针对退化感知空间理解的大型数据集,通过物理基础的退化合成引擎生成9种退化类型,评估多模态大语言模型在视觉退化下的空间推理能力,并展示在退化条件下微调可提升模型鲁棒性。

详情
AI中文摘要

多模态大语言模型(MLLMs)在空间智能方面取得了快速进展,但现有空间推理基准大多假设纯净的视觉输入,忽略了现实部署中常见的退化现象,如运动模糊、低光照、恶劣天气、镜头畸变和压缩伪影。这提出了一个根本性问题:当前MLLMs在视觉观察不完美时的空间智能鲁棒性如何?为回答这个问题,我们引入SpaceDG,首个大规模退化感知空间理解数据集。它通过物理基础的退化合成引擎将退化形成过程嵌入3D高斯点散布(3DGS)渲染,能够真实模拟九种退化类型。所生成的数据集包含约100万对QA问题,来自近1000个室内场景。我们进一步引入SpaceDG-Bench,一个经人类验证的基准,包含11种推理类别和9种视觉退化类型的1102个问题,产生超过10000个VQA实例。评估25个开源和闭源MLLMs发现,视觉退化一致且显著损害空间推理能力,暴露出关键的鲁棒性差距。最后,我们展示在SpaceDG上微调可显著提高退化鲁棒性,并且在退化条件下甚至可以超越人类性能,而不会在清晰图像上造成性能下降,突显了退化感知训练在鲁棒空间智能方面的潜力。

英文摘要

Multimodal Large Language Models (MLLMs) have made rapid progress in spatial intelligence, yet existing spatial reasoning benchmarks largely assume pristine visual inputs and overlook the degradations that commonly occur in real-world deployment, such as motion blur, low light, adverse weather, lens distortion, and compression artifacts. This raises a fundamental question: how robust is the spatial intelligence of current MLLMs when visual observations are imperfect? To answer this question, we introduce SpaceDG, the first large-scale dataset for degradation-aware spatial understanding. It is constructed with a physically grounded degradation synthesis engine that embeds degradation formation process into 3D Gaussian Splatting (3DGS) rendering, enabling realistic simulation of nine degradation types. The resulting dataset contains approximately 1M QA pairs from nearly 1,000 indoor scenes. We further introduce SpaceDG-Bench, an human-verified benchmark with 1,102 questions spanning 11 reasoning categories and 9 visual degradation types, yielding over 10K VQA instances. Evaluating 25 open- and closed-source MLLMs reveals that visual degradations consistently and substantially impair spatial reasoning, exposing a critical robustness gap. Finally, we show that finetuning on SpaceDG markedly improves degradation robustness and can even surpass human performance under degraded conditions without any performance drop on clean images, highlighting the promise of degradation-aware training for robust spatial intelligence.

2605.22501 2026-05-22 cs.CL cs.AI cs.IR 版本更新

BeLink: Biomedical Entity Linking Meets Generative Re-Ranking

BeLink: 生物医学实体链接结合生成性重新排序

Darya Shlyk, Stefano Montanelli, Lawrence Hunter

发表机构 * University of Milan(米兰大学) University of Chicago(芝加哥大学)

AI总结 本文提出了一种基于生成模型的重新排序方法,通过指令微调提高生物医学实体链接的效率和准确性,在多个基准测试中实现了3%-24%的链接准确率提升,同时减少了推理时间。

Comments Accepted to ACM SIGIR 2026

详情
AI中文摘要

尽管近年来取得了进展,但使用大语言模型(LLMs)的生物医学实体链接(BEL)仍然计算效率低下,难以在实际应用中部署。在本工作中,我们证明了在BEL流水线的重新排序阶段对开源生成模型进行指令微调可以提供有效的解决方案。我们提出了一种集束式指令微调公式,使候选人的选择变得快速且准确。我们的方法在多个BEL基准测试中表现出色,比最先进的方法在链接准确性上提高了3%-24%,同时减少了推理时间。我们将我们的生成性重新排序器整合到BeLink中,这是一个模块化、端到端的系统,旨在实际的生物医学实体链接应用中使用。

英文摘要

Despite recent progress, Biomedical Entity Linking (BEL) with large language models (LLMs) remains computationally inefficient and challenging to deploy in practical settings. In this work, we demonstrate that instruction-tuning of open-source generative models can offer an effective solution when applied at the re-ranking stage of the BEL pipeline. We propose a set-wise instruction-tuning formulation that enables fast and accurate candidate selection. Our method demonstrates strong performance on multiple BEL benchmarks, yielding significant improvements in linking accuracy (3%-24%) while reducing inference time compared to the state-of-the-art. We integrate our generative re-ranker into BeLink, a modular, end-to-end system designed for practical real-world BEL applications.

2605.22487 2026-05-22 cs.CL 版本更新

Polite on the Surface, Wrong in Practice: A Curated Dataset for Fixing Honorific Failures in Multilingual Bangla Generation

表面礼貌,实践错误:一个用于修复多语言孟加拉语生成中敬语失误的定制数据集

Md. Asaduzzaman Shuvo, Mahedi Hasan, Md. Tashin Parvez, Azizul Haque Noman, Md. Shafayet Hossain Ovi

发表机构 * United International University(国际大学)

AI总结 本文提出了一个定制数据集BLADE,用于改进多语言孟加拉语生成中敬语处理的准确性,通过系统微调和评估领先的开源架构,如DeepSeek-8B和LLaMA-3.2-3B,以提高结构忠实度和敬语对齐度。

详情
AI中文摘要

近年来,多语言大语言模型(MLLMs)的进展显著增强了跨语言对话能力,但建模文化细腻和上下文依赖的交流仍是一个关键瓶颈。具体而言,现有最先进的模型在处理低资源环境如孟加拉语中的结构变化、地区习语和敬语一致性时存在严重的语用差距。为了解决这一限制,我们引入了一个新的、文化对齐的指令微调数据集和基准框架,即BangLa Application and DialoguE生成-BLADE,包含4,196对精心编纂的交互对。我们利用这一资源系统地微调和评估领先的开源架构,包括DeepSeek-8B和LLaMA-3.2-3B,通过LoRA适配器在4位NormalFloat(NF4)量化框架下进行参数高效微调。我们的实证评估表明,使用我们数据集微调的模型在结构忠实度和敬语对齐方面有显著改进,为弥合低资源多语言文本生成中的语用差距提供了严格基准。代码和数据集:https://github.com/ashuvo25/Bangla_Application_LLM/tree/main

英文摘要

Recent advances in Multilingual Large Language Models (MLLMs) have significantly enhanced cross-lingual conversational capabilities, yet modeling culturally nuanced and context-dependent communication remains a critical bottleneck. Specifically, existing state-of-the-art models exhibit a severe pragmatic gap when handling structural variations, regional idioms, and honorific consistencies in low-resource contexts like Bangla. To address this limitation, we introduce a novel, culturally aligned instruction-tuning dataset for \textbf{BangLa Application and DialoguE generation - BLADE} and benchmarking framework comprising $4,196$ meticulously curated interaction pairs. We leverage this resource to systematically fine-tune and evaluate leading open-weight architectures, including DeepSeek-8B and LLaMA-3.2-3B, utilizing parameter-efficient fine-tuning via LoRA adapters in a 4-bit NormalFloat (NF4) quantization framework. Our empirical evaluations demonstrate that models fine-tuned on our dataset yield substantial improvements in structural fidelity and honorific alignment, providing a rigorous benchmark for bridging pragmatic disparities in low-resource multilingual text generation. Code and dataset: https://github.com/ashuvo25/Bangla_Application_LLM/tree/main

2605.22476 2026-05-22 cs.LG cs.CL 版本更新

Structured-Sparse Attention for Entity Tracking with Subquadratic Sequence Complexity

结构稀疏注意力用于具有次二次序列复杂度的实体跟踪

Hangyue Zhao, Paul Caillon, Erwan Fagnou, Alexandre Allauzen

发表机构 * ESPCI PSL(ESPCI 法国巴黎大学) LAMSADE, Université Paris Dauphine - PSL(LAMSADE 巴黎dauphine大学-巴黎科学实验室)

AI总结 本文提出了一种结构稀疏注意力机制,用于在长序列中高效维护和更新实体和属性的潜在状态,通过减少计算复杂度提升实体跟踪的效率和准确性。

Comments 12 pages, 1 figure, 9 tables

详情
AI中文摘要

实体跟踪需要在长序列中维护和更新实体和属性的潜在状态。最近的特定任务注意力运算可以通过在单个层内进行多跳状态传播,将深度Transformer堆栈压缩成几层,但其密集评估仍很昂贵。我们显示在这种情况下,学习的注意力具有很强的结构特性:大部分质量集中在局部块对角邻域,具有轻量的跨块残差。利用这一点,我们推导出一种分块评估的解析式算子,保持块内交互的精确性,并通过缩减系统路由跨块交互。所得到的评估是序列长度的次二次复杂度$O(n^{4/3}d)$(当$d\approx n$时为$O(n^{7/3})$)。在受控跟踪基准上,我们的方法在保持密集运算准确性的同时,通过标准化测量协议减少了12-29%的实时时钟时间,并在可比的精确匹配准确性下,比紧凑的密集Transformer快高达2.4倍。我们进一步提供了关于块大小和模型容量的消融实验,并识别了一个限制:当同时演化的属性数量超过注意力头的数量时,性能会崩溃。

英文摘要

Entity tracking requires maintaining and updating latent states for entities and attributes over long sequences. Recent task-specific attention operators can compress deep Transformer stacks into a few layers by performing multi-hop state propagation within a single layer, but their dense evaluation remains expensive. We show that in this setting, learned attention is strongly structured: most mass concentrates in local block-diagonal neighborhoods with a light cross-block residue. Exploiting this, we derive a blockwise evaluation of a resolvent-style operator that keeps within-block interactions exact and routes cross-block interactions through a reduced system. The resulting evaluation is subquadratic in sequence length $O(n^{4/3}d)$ (and $O(n^{7/3})$ when $d\approx n$). On controlled tracking benchmarks, our method matches the dense operator's accuracy while reducing wall-clock time by $12-29\%$ under a standardized measurement protocol, and is up to $2.4 \times$ faster than a compact dense Transformer at comparable exact-match accuracy. We further provide ablations over block size and model capacity, and identify a limitation: performance collapses when the number of simultaneously evolving properties exceeds the number of attention heads.

2605.22465 2026-05-22 cs.CL 版本更新

In Silico Modeling of the RAMPHO Buffer: Dissociating Informational and Energetic Masking via Phonetic Entropy in Deep Neural Networks

RAMPHO缓冲区的计算机模拟:通过语音熵在深度神经网络中分离信息性和能量性遮蔽

Stefan Bleeck

发表机构 * Institute of Sound and Vibration Research (ISVR), University of Southampton(声学与振动研究所(ISVR),南安普顿大学)

AI总结 本文提出了一种基于wav2vec 2.0的自监督声学模型的计算机模拟,通过语音熵分离信息性和能量性遮蔽,揭示了认知-听觉帕累托优化问题。

详情
AI中文摘要

在多说话者环境中听觉识别的核心挑战是一个认知瓶颈,定义为RAMPHO事件缓冲区内的失败。当前用于语音增强的深度神经网络仅优化物理声学,未能考虑信息性遮蔽的认知惩罚。本文通过使用自监督声学模型(wav2vec 2.0)的帧级语音熵,对RAMPHO缓冲区进行了计算机模拟。通过在信号噪声比(SNR)扫描中对比语义完整和相位去相关干扰源(集中护盾),成功将信息性干扰的认知惩罚与能量衰减的物理惩罚分离。该模拟揭示了一个认知-听觉帕累托优化问题:破坏干扰源的语义负载在高SNR下释放了信息性遮蔽,但本质上在低SNR下会退化时间瞥见线索。

英文摘要

The fundamental challenge of listening in multi-talker environments is a cognitive bottleneck, defined by the Ease of Language Understanding (ELU) model as a failure within the RAMPHO episodic buffer. Current deep neural networks for speech enhancement optimize purely for physical acoustics, failing to account for the cognitive penalty of informational masking. Here, we present an in silico simulation of the RAMPHO buffer using the frame-by-frame phonetic entropy of a self-supervised acoustic model (wav2vec 2.0). By contrasting a semantically intact distractor with a phase-decorrelated distractor (the Concentration Shield) across a signal-to-noise ratio (SNR) sweep, we successfully dissociate the cognitive penalty of informational distraction from the physical penalty of energetic decay. The simulation reveals a cognitive-acoustic Pareto optimization problem: destroying a distractor's semantic payload provides a release from informational masking at high SNRs, but fundamentally degrades temporal glimpsing cues at low SNRs.

2605.22462 2026-05-22 cs.CL cs.AI 版本更新

From Correlation to Cause: A Five-Stage Methodology for Feature Analysis in Transformer Language Models

从相关性到因果:一种五阶段方法用于Transformer语言模型中的特征分析

Caleb Munigety

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种五阶段方法用于Transformer语言模型中的因果特征分析,并在GPT-2小型模型上端到端地展示了其在间接宾语识别任务中的应用,通过激活补丁恢复经典IOI电路,稀疏自编码器恢复特定名称的特征,因果验证发现这些特征具有特定但部分因果性,鲁棒性测试揭示了检测鲁棒性与因果鲁棒性之间的差距,部署评估显示了最优监控配置带来的成本节省。

详情
AI中文摘要

我们提出了一种五阶段方法用于Transformer语言模型中的因果特征分析(探针设计、特征提取、因果验证、鲁棒性测试和部署集成),并在GPT-2小型模型上端到端地执行了间接宾语识别(IOI)任务。激活补丁恢复了经典的IOI电路(第9层头9单独恢复+1.02)。稀疏自编码器恢复了每名称选择性特征,其效果大小为30到50个激活单元。因果验证发现这些特征具有特定但部分因果性:删除十五个特征后,模型在98%的提示上仍保持准确。两种受NLA启发的评估强化了这一观点:十五个选择性特征仅解释了激活方差的31%,而SAE的解释为99.7%,选择性比率与因果力呈负相关(r = -0.56)。三种分布偏移下的鲁棒性测试发现,电路能够顺利转移,但特征消融效果显著下降,揭示了检测鲁棒性与因果鲁棒性之间的差距。基于成本的部署评估(假设$50/FN,$0.42/FP,2%错误率)发现最优监控配置可使每1000次查询的成本降至$8.96,相比$1000的基准,节省了99.1%。最优组合策略随成本比和基础率变化。各阶段的结合产生了单一阶段无法产生的发现。

英文摘要

We propose a five-stage methodology for causal feature analysis in transformer language models (probe design, feature extraction, causal validation, robustness testing, and deployment integration) and demonstrate it end-to-end on GPT-2 small performing the Indirect Object Identification (IOI) task. Activation patching recovers the canonical IOI circuit (layer-9 head 9 alone gives recovery +1.02). A sparse autoencoder recovers per-name selective features with effect sizes of 30 to 50 activation units. Causal validation finds these features specifically but only partially causal: ablating fifteen of them leaves the model accurate on 98% of prompts. Two NLA-inspired evaluations strengthen this picture: the fifteen selective features explain only 31% of activation variance versus the SAE's 99.7%, and selectivity ratio anticorrelates with causal force (r = -0.56). Robustness testing under three distribution shifts finds that the circuit transfers cleanly but feature ablation effects degrade substantially, exposing a gap between detection robustness and causal robustness. A cost-based deployment evaluation (assumed $50/FN, $0.42/FP, 2% error rate) finds an optimal monitor configuration yielding $8.96 per 1000 queries against a $1000 baseline, a 99.1% saving. Optimal composition strategy varies with cost ratio and base rate. The conjunction of stages produces findings no single stage would.

2605.22447 2026-05-22 cs.CL 版本更新

Cohesion-6K: An Arabic Dataset for Analyzing Social Cohesion and Conflict in Online Discourse

Cohesion-6K: 一个用于分析在线讨论中社会凝聚力与冲突的阿拉伯语数据集

Aisha Ali Al-Athba, Wajdi Zaghouani

发表机构 * Hamad Bin Khalifa University(哈马德·本·拉希德大学) Northwestern University in Qatar(卡塔尔西北大学)

AI总结 本研究通过Cohesion-6K数据集探讨在线讨论中的社会凝聚力与冲突,采用五类话语分类揭示冲突与凝聚力的动态平衡,并通过公开资源支持未来计算社会科学、数字通信和阿拉伯语自然语言处理的研究。

详情
AI中文摘要

在线讨论研究已成为理解社会极化的核心。尽管许多研究聚焦于检测显性毒性,但社会凝聚力的微妙动态,即分裂与团结叙述之间的互动,仍 computationally underexplored (Bail, 2021; Gonzalez-Bailon and Lelkes, 2023). 本文提出了Cohesion-6K,一个包含六千条阿拉伯语公共Facebook帖子的数据集,这些帖子涉及巴勒斯坦被占领土问题。每条帖子被分配到五个话语类别之一,代表从冲突到凝聚力的连续体:冲突、解决、社区参与、支持性互动和共同价值观。注释过程结合了专家人类判断与模型辅助预标注,经培训注释者验证,实现了显著的注释者间一致性 (Cohens kappa = 0.85). 定量分析揭示了持续的参与差距,冲突导向的帖子获得的用户互动比解决导向的帖子多两到四倍 (p < 0.01). 这种模式展示了分裂性讨论如何在阿拉伯语社交媒体空间中获得不成比例的可见性。Cohesion-6K提供了一个透明且可重复的资源,用于研究在线凝聚力和极化。该数据集、注释指南和预处理代码将通过开放许可发布,以支持未来计算社会科学、数字通信和阿拉伯语自然语言处理的研究。

英文摘要

The study of online discourse has become central to understanding societal polarization. While much research has focused on detecting overt toxicity, the subtle dynamics of social cohesion, meaning the interaction between divisive and unifying narratives, remain computationally underexplored (Bail, 2021; Gonzalez-Bailon and Lelkes, 2023). This paper presents Cohesion-6K, a manually and ChatGPT-assisted annotated dataset of six thousand Arabic public Facebook posts related to the Israeli Occupation of Palestine. Each post is assigned to one of five discourse categories that represent a continuum from conflict to cohesion: Conflict, Resolution, Community Engagement, Supportive Interactions, and Shared Values. The annotation process combines expert human judgment with model-assisted pre-labeling verified by trained annotators, achieving substantial inter-annotator agreement (Cohens kappa = 0.85). Quantitative analysis reveals a consistent engagement gap, where conflict-oriented posts receive between two and four times more user interaction than resolution-oriented ones (p < 0.01). This pattern illustrates how divisive discourse tends to attract disproportionate visibility in Arabic social media spaces. Cohesion-6K provides a transparent and reproducible resource for the study of online cohesion and polarization. The dataset, annotation guidelines, and preprocessing code will be released for research use under an open license, supporting future work in computational social science, digital communication, and Arabic natural language processing.

2605.22435 2026-05-22 cs.CL 版本更新

Assisted Counterspeech Writing at the Crossroads of Hate Speech and Misinformation

在仇恨言论与虚假信息交汇处的辅助反诽谤写作

Genoveffa Martone, Helena Bonaldi, Marco Guerini

发表机构 * Fondazione Bruno Kessler(布鲁诺·克塞勒基金会) Università Cattolica del Sacro Cuore(天主教圣心大学)

AI总结 本文研究了在仇恨言论和虚假信息共存的背景下,利用大型语言模型辅助专家反诽谤写作的方法,通过三种知识驱动的生成策略,结合事实核查和非政府组织的指南,提高了反诽谤文本的质量和有效性。

详情
AI中文摘要

仇恨言论和虚假信息经常在线上同时出现,加剧偏见和极化。鉴于其规模,使用大型语言模型(LLMs)来帮助专家反诽谤(CS)写作引起了关注,但以往的研究将这些现象分别处理。我们通过研究在仇恨言论和虚假信息共存的背景下生成反诽谤文本,填补了这一空白。我们测试了三种知识驱动的生成策略:首先,我们提示LLM使用事实核查员的指南和事实核查文章;其次,使用非政府组织的指南和报告;第三,我们创建了一种混合策略,结合来自两方面的指南和文档。23名专家修改生成的反诽谤文本,这些文本通过人工和自动指标进行评估。尽管LLM在40%的情况下生成了足够的反诽谤文本,但专家修改显著提高了自然性、全面性和对指南的遵守程度。基于修改后的反诽谤文本,混合策略在众包评估中证明是最有效的,结合了强大的事实纠正、刻板印象缓解和同理心参与。我们发布了包含仇恨言论和误导性声明的数据库,并附有专家验证的反诽谤文本和相关知识。

英文摘要

Hate speech and misinformation frequently co-occur online, amplifying prejudice and polarization. Given their scale, using Large Language Models (LLMs) to assist expert counterspeech (CS) writing has gained interest, yet prior work has addressed these phenomena separately. We bridge this gap by studying CS generation in contexts where both hate and misinformation co-occur. We test three knowledge-driven generation strategies: first we prompt an LLM with fact-checkers' guidelines and fact-checking articles; secondly, with NGOs' guidelines and reports; thirdly, we create a mixed strategy that combines guidelines and documents from both. 23 experts revise the generated CS, which are assessed via human and automatic metrics. While LLMs produce adequate CS in 40% of cases, expert edits substantially improve naturalness, exhaustiveness, and adherence to guidelines. Based on the post-edited CS, the mixed strategy proves to be the most effective in crowdsourcing evaluation, pairing strong factual correction with stereotype mitigation and empathetic engagement. We release a dataset of hateful and misinformed claims with expert-verified CS and supporting knowledge.

2605.22411 2026-05-22 cs.CL cs.AI cs.LG 版本更新

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

DeferMem: 通过强化学习进行长时记忆问答的查询时证据蒸馏

Jianing Yin, Tan Tang

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室)

AI总结 本文提出DeferMem,一种长时记忆框架,通过分离问题为高召回候选检索和查询条件证据蒸馏,以提升长时记忆问答的准确性和效率。

Comments 31 pages, 3 figures

详情
AI中文摘要

大型语言模型(LLM)代理在长时记忆问答任务中仍面临挑战,因为答案支持的证据通常分散在长对话历史中并被大量无关内容掩盖。现有记忆系统通常在未来的查询确定之前处理记忆,然后根据相似性而非其对回答查询的效用来检索结果单元。这种工作流程使下游回答者不得不对检索的候选进行去噪并重建查询特定的证据。我们提出了DeferMem,一种长时记忆框架,将该问题分解为高召回候选检索和查询条件证据蒸馏。DeferMem使用轻量级的段链接结构来组织原始历史并在查询时检索广泛的候选。然后,它应用一个通过DistillPO训练的内存蒸馏器,DistillPO是我们用于将高召回但高度嘈杂的候选蒸馏成一组忠实、自包含且查询条件的证据的强化学习算法。DistillPO将检索后的证据蒸馏制定为一个结构化的动作,包括信息选择和证据重写。它通过分解和门控奖励管道和结构对齐优势分配来优化此动作,门控奖励组件从有效性到质量检查,同时在早期暴露任务级别的正确性反馈,并将每个奖励分配给其负责的输出片段。在LoCoMo和LongMemEval-S上,DeferMem在问答准确性和记忆系统效率上超过了强大的基线,在达到最高问答准确度的同时实现了最快的运行时间和零商业API令牌成本的记忆操作。

英文摘要

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

2605.22391 2026-05-22 cs.AI cs.CL cs.CY 版本更新

Epicure: Navigating the Emergent Geometry of Food Ingredient Embeddings

Epicure:探索食品成分嵌入的涌现几何

Jakub Radzikowski, Josef Chen

AI总结 本文提出Epicure,一种基于三兄弟skip-gram模型重新训练的食品成分嵌入方法,通过多语言食谱语料库构建了包含1790个标准成分的嵌入模型,并通过三种不同的随机游走方案生成了不同侧重的模型。

详情
AI中文摘要

我们提出了Epicure,一种由三个兄弟skip-gram成分嵌入模型组成的家族,这些模型是从多语言食谱语料库中从头开始重新训练的。我们汇总了来自11个来源的414万条食谱,涵盖七种语言:英语、中文、俄语、越南语、西班牙语、土耳其语、印度尼西亚语、德语和印度英语,并通过一个增强语言模型的流程将原始成分字符串标准化为1790个标准条目。一个包含203,508条边的成分-成分NPMI图和一个包含80,019条边的带类型FlavorDB成分-化合物图,以及2,247个带类型化合物节点跨越15个类别,为三种共享架构和超参数但仅在随机游走方案上不同的Metapath2Vec变体提供了基础:Cooc仅在共现图上行走,Chem仅在带类型化合物元路径上行走,Core则通过注入的成分-成分行走进行混合,在可控混合下,将每个模型置于化学与食谱上下文的谱线上不同的位置。

英文摘要

We present Epicure, a family of three sibling skip-gram ingredient embeddings retrained from scratch on a multilingual recipe corpus. We aggregate 4.14M recipes from 11 sources spanning seven languages, English, Chinese, Russian, Vietnamese, Spanish, Turkish, Indonesian, German, and Indian-English, and normalise the raw ingredient strings to 1,790 canonical entries via an LLM-augmented pipeline. A 203,508-edge ingredient-ingredient NPMI graph and an 80,019-edge typed FlavorDB ingredient-compound graph, 2,247 typed compound nodes across 15 categories, seed three Metapath2Vec variants that share architecture and hyperparameters and differ only in the random-walk schema: Cooc walks the co-occurrence graph only, Chem walks the typed compound metapaths only, and Core blends both via injected ingredient-ingredient walks at controlled mixing, placing each model at a distinct point on the chemistry-vs-recipe-context spectrum.

2605.22389 2026-05-22 cs.CL 版本更新

Unified Data Selection for LLM Reasoning

统一的数据选择用于LLM推理

Xiaoyuan Li, Yubo Ma, Chengpeng Li, Fengbin Zhu, Yiyao Yu, Keqin Bao, Wenjie Wang, Fuli Feng, Dayiheng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一种无需训练的高熵和(HES)指标,用于评估和选择高质量的推理样本,通过在三种主流训练范式(监督微调、拒绝微调和强化学习)中验证,证明了其在提高LLM推理性能和减少计算开销方面的有效性。

Comments Under Review

详情
AI中文摘要

有效训练大型语言模型(LLMs)进行复杂、长链推理通常受到需要大量高质量推理数据的限制。现有方法要么计算成本高,要么无法可靠地区分高质量和低质量的推理样本。为此,我们提出高熵和(HES),一种无需训练的度量标准,通过仅对每个推理样本中熵最高的(例如0.5%)高熵标记进行求和来量化推理质量。我们验证了HES在三种主流训练范式:监督微调(SFT)、拒绝微调(RFT)和强化学习(RL)中的效果,结果表明其效果一致且显著减少了计算开销。在SFT中,使用顶部20%的HES排名数据与完整数据集性能相当,而使用最低HES数据会使其下降。在RFT中,我们的HES基于训练方法显著优于基线方法。在RL中,HES选择的成功轨迹使模型能够学习到强大的推理模式,显著超越其他比较方法。我们的发现确立了HES作为一种稳健、无需训练的度量标准,使开发高级LLM推理能力成为统一、有效和高效的方法。

英文摘要

Effectively training Large Language Models (LLMs) for complex, long-CoT reasoning is often bottlenecked by the need for massive high-quality reasoning data. Existing methods are either computationally expensive or fail to reliably distinguish high- from low-quality reasoning samples. To address this, we propose High-Entropy Sum (HES), a training-free metric that quantifies reasoning quality by summing only the entropy of the top (e.g., 0.5\%) highest-entropy tokens in each reasoning sample. We validate HES across three mainstream training paradigms: Supervised Fine-tuning (SFT), Rejection Fine-tuning (RFT), and Reinforcement Learning (RL), with extensive results demonstrating its consistent effectiveness and significantly reduced computational overhead. In SFT, training on the top 20\% HES-ranked data matches full-dataset performance, while using the lowest-HES data degrades it. In RFT, our HES-based training approach significantly outperforms baseline methods. In RL, HES-selected successful trajectories enable the model to learn strong reasoning patterns, significantly surpassing other compared methods. Our findings establish HES as a robust, training-free metric that enables a unified, effective, and efficient method for developing advanced reasoning in LLMs.

2605.22380 2026-05-22 cs.CL cs.LG 版本更新

Multi-Stage Training for Abusive Comment Detection in Indic Languages

印地语中辱骂评论检测的多阶段训练

Pranshu Rastogi, Madhav Mathur, Ramaneswaran S, Kshitij Mohan

发表机构 * Department of CSE, JIIT Noida(计算机科学与工程系,印度尼泊尔理工学院诺伊达) Department of ICE, NSUT Delhi(电子与计算机工程系,NSUT德里) Department of IT, VIT Vellore(信息科技系,维杰学院维洛雷) Department of CSE, IIIT Delhi(计算机科学与工程系,德里理工学院)

AI总结 本文提出了一种多阶段训练方法,通过语言预处理和多个模型的集成,提高印地语中辱骂评论检测的准确性,减少误报率以保护言论自由。

Comments 4 pages, EAM2021 selected

详情
AI中文摘要

近年来,社交媒体已成为人们交流思想、分享观点和交换信息的重要工具。鉴于其普及性和广泛影响,社交媒体必须保持安全空间。生成在社交媒体上的内容可能具有攻击性,因此检测此类内容变得越来越重要。本文利用基于语言的预处理和多个模型的集成,分析其在辱骂评论检测中的性能。通过广泛实验,我们提出了一条管道,以最小化误报率(将非攻击性内容标记为攻击性),从而使这些系统能够在不损害言论自由的前提下检测攻击性评论。

英文摘要

In recent years social media has become an increasingly popular tool for communication. People use it to share their ideas, exchange information, and discuss thoughts. Given its prevalence and widespread reach, social media must remain a safe space for people. Content generated on social media can be abusive and it has become increasingly important to detect such content. In this paper, we use a language-based preprocessing and an ensemble of several models and analyze their performance of abusive comment detection. Through extensive experimentation, we propose a pipeline that minimizes the false-positive rate (marking non-abusive as abusive) so that these systems can detect abusive comments without undermining the freedom of expression.

2605.22356 2026-05-22 cs.CL 版本更新

Modeling Pathology-Like Behavioral Patterns in Language Models Through Behavioral Fine-Tuning

通过行为微调建模病理样行为模式

Nicola Milano, Davide Marocco

发表机构 * University of Naples Federico II(那不勒斯费德里科二世大学) Natural and Artificial Cognition Laboratory(自然与人工认知实验室) Department of Humanities(人文学部)

AI总结 本文研究了通过行为微调在语言模型中建模病理样行为模式,采用结构化决策任务进行微调,发现模型在不同上下文中产生稳定的生成分布变化,表明行为优化能影响语言生成的分布特性。

详情
AI中文摘要

大型语言模型越来越多地被用作计算工具来建模人类样行为。我们引入了一个行为诱导框架,通过在结构化决策任务上进行微调来修改模型策略:使用受适应不良行为模式启发的合成数据集,包括抑郁和偏执,我们训练基于转换器的语言模型在多样化上下文中一致选择特定类别的动作。然后我们测试这种行为优化是否会在生成分布中产生系统性的变化。在两个架构中,微调后的模型显示出稳定的、上下文通用的下一个标记概率分布变化,包括在开放性语言任务中对负面和威胁相关解释的概率增加。这些效果超越了训练上下文,并在定性完成、心理测量风格的评估和定量分布度量如詹森-香农散度中可以检测到。诱导的行为配置文件也显示出部分特异性。为不同行为模式优化的模型在评估探针上表现出可区分的响应倾向,表明结构化行为训练产生的是差异化的策略层面偏差,而不是通用的分布偏斜。我们将这些发现解释为证据,表明在LLM中一致的行为优化可以生成与改变的潜在先验相关的稳定行为和分布模式,将动作选择和语言生成联系起来。更广泛地说,这些结果支持了LLM作为基于策略系统的观点,在其中行为约束塑造了涌现的表示结构,突显了它们作为研究行为、解释和生成语言之间关系的受控测试床的潜力。

英文摘要

Large language models are increasingly used as computational tools for modeling human-like behavior. We introduce a behavioral induction framework that modifies model policies through fine-tuning on structured decision-making tasks: using synthetic datasets inspired by maladaptive behavioral patterns, including depression and paranoia, we train transformer-based language models to consistently select specific classes of actions across diverse contexts. We then test whether this behavioral optimization produces systematic changes in generative distributions. Across two architectures, fine-tuned models show stable, context-general shifts in next-token probability distributions, including increased probability assigned to negative and threat-related interpretations in open-ended language tasks. These effects generalize beyond training contexts and are detectable in qualitative completions, psychometric-style evaluations, and quantitative distributional metrics such as Jensen-Shannon divergence. Induced behavioral profiles also show partial specificity. Models optimized for different behavioral patterns exhibit dissociable response tendencies across evaluation probes, suggesting that structured behavioral training produces differentiated policy-level biases rather than generic distributional skew. We interpret these findings as evidence that consistent behavioral optimization in LLMs can generate stable behavioral and distributional patterns consistent with altered latent priors, linking action selection and language generation. More broadly, the results support a view of LLMs as policy-based systems in which behavioral constraints shape emergent representational structure, highlighting their potential as controlled testbeds for studying the relationship between behavior, interpretation, and generative language in computational models of cognition.

2605.22355 2026-05-22 cs.CL cs.AI cs.LG 版本更新

TransitLM: A Large-Scale Dataset and Benchmark for Map-Free Transit Route Generation

TransitLM: 一个大规模数据集和基准,用于无地图的公共交通路线生成

Hanyu Guo, Jiedong Yang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu

发表机构 * Alibaba Group(阿里巴巴集团) AMAP

AI总结 本文提出TransitLM,一个包含1300万条公共交通路线规划记录的数据集,用于无地图的公共交通路线生成,展示了通过数据训练模型生成有效路线的能力。

详情
AI中文摘要

公共交通路线规划传统上依赖于结构化的地图基础设施和复杂的路由引擎,而现有的数据集不支持训练模型绕过这种依赖。我们提出了TransitLM,一个包含来自四个中国城市的超过1300万条公共交通路线规划记录的数据集,覆盖120,845个车站和13,666条线路,作为持续预训练语料库和用于三个评估任务的基准数据。实验表明,使用TransitLM训练的LLM能够生成结构上有效的路线,精度高,并且能够隐式地将任意GPS坐标映射到合适的车站,而无需显式映射。这些结果表明,公共交通路线规划可以完全从数据中学习,从而实现端到端、无地图的路线生成,直接从起止点信息生成。数据集和基准可在https://huggingface.co/datasets/GD-ML/TransitLM获取,评估代码在https://github.com/HotTricker/TransitLM。

英文摘要

Public transit route planning traditionally depends on structured map infrastructure and complex routing engines, and no existing dataset supports training models to bypass this dependency. We present TransitLM, a large-scale dataset of over 13 million transit route planning records from four Chinese cities covering 120,845 stations and 13,666 lines, released as a continual pre-training corpus and benchmark data for three evaluation tasks with complementary metrics. Experiments show that an LLM trained on TransitLM produces structurally valid routes at high accuracy and implicitly grounds arbitrary GPS coordinates to appropriate stations without any explicit mapping. These results demonstrate that transit route planning can be learned entirely from data, enabling end-to-end, map-free route generation directly from origin-destination information. The dataset and benchmark are available at https://huggingface.co/datasets/GD-ML/TransitLM, with evaluation code at https://github.com/HotTricker/TransitLM.

2605.22310 2026-05-22 cs.CL 版本更新

Pattern-and-root inflectional morphology: the Arabic broken plural

词形和根的屈折形态:阿拉伯语的破碎复数

Alexis Amid Neme, Eric Laporte

发表机构 * LIGM, Université Paris-Est(LIGM,巴黎-est大学) DLL, Universidade Federal do Espírito Santo(DLL,弗拉门杜斯皮里霍萨联邦大学)

AI总结 本文提出了一种对阿拉伯语名词屈折形态的描述模型,重点在于阿拉伯语学者在管理词典和其他语言资源时的处理方式。其突破在于将传统的根-词形塞语模型反转为词形-根模型,优先考虑词形。该模型包括破碎复数(BPs),即通过修改词干形成的复数。它基于传统的塞语形态学中的根和词形概念。与传统阿拉伯语形态学相比,它将屈折的正式描述与派生和语义分开。如同传统阿拉伯语词典,可更新的词典以词形的词典条目结构进行组织,并且参考拼写完全带变音符号。在我们的模型中,阿拉伯语文本的形态分析直接使用词典进行,而无需形态音律规则。我们对名词屈折的分类是简单、有序且详细的。我们通过指定元音数量为v或vv,并忽略元音质量来简化单数词形的分类。根交替和正字法变化是独立于词形并以事实方式编码的,而不涉及深根或形态音律或正字法规则。具有三重词干BPs的名词根据22个词形细分到90个类别中进行分类,而具有四重词干BPs的名词根据3个词形细分到70个类别中进行分类。这些160个类别在考虑只影响单数的屈折变化时,变为300个屈折类别。我们提供了一种直接的编码方案,该方案应用于3200个BPs名词条目。

详情
Journal ref
Language Sciences, 2013, 40, pp.221-250
AI中文摘要

我们提出了一种实施程度较高的模型,用于描述阿拉伯语名词的屈折形态,特别关注阿拉伯语学者在管理词典和其他语言资源时的处理方式。突破点在于将传统的根-词形塞语模型反转为词形-根模型,优先考虑词形。我们的模型包括破碎复数(BPs),即通过修改词干形成的复数。它基于传统的塞语形态学中的根和词形概念。然而,与传统阿拉伯语形态学相比,它将屈折的正式描述与派生和语义分开。如同传统阿拉伯语词典,可更新的词典以词形的词典条目结构进行组织,并且参考拼写完全带变音符号。在我们的模型中,阿拉伯语文本的形态分析直接使用词典进行,而无需形态音律规则。我们的名词屈折分类法简单、有序且详细。我们通过指定元音数量为v或vv,并忽略元音质量来简化单数词形的分类。根交替和正字法变化是独立于词形并以事实方式编码的,而不涉及深根或形态音律或正字法规则。具有三重词干BPs的名词根据22个词形细分到90个类别中进行分类,而具有四重词干BPs的名词根据3个词形细分到70个类别中进行分类。这些160个类别在考虑只影响单数的屈折变化时,变为300个屈折类别。我们提供了一种直接的编码方案,该方案应用于3200个BPs名词条目。

英文摘要

We present a substantially implemented model of description of the inflectional morphology of Arabic nouns, with special attention to the management of dictionaries and other language resources by Arabic-speaking linguists. The breakthrough lies in the reversal of the traditional root-and-pattern Semitic model into pattern-and-root, giving precedence to patterns over roots. Our model includes broken plurals (BPs), i.e. plurals formed by modifying the stem. It is based on the traditional notions of root and pattern of Semitic morphology. However, as compared to traditional Arabic morphology, it keeps the formal description of inflection separate from that of derivation and semantics. As traditional Arabic dictionaries, the updatable dictionary is structured in lexical entries for lemmas, and the reference spelling is fully diacritized. In our model, morphological analysis of Arabic text is performed directly with a dictionary of words and without morphophonological rules. Our taxonomy for noun inflection is simple, orderly and detailed. We simplify the taxonomy of singular patterns by specifying vowel quantity as v or vv, and ignoring vowel quality. Root alternations and orthographical variations are encoded independently from patterns and in a factual way, without deep roots or morphophonological or orthographical rules. Nouns with a triliteral BP are classified according to 22 patterns subdivided into 90 classes, and nouns with a quadriliteral BP according to 3 patterns subdivided into 70 classes. These 160 classes become 300 inflectional classes when we take into account inflectional variations that affect only the singular. We provide a straightforward encoding scheme that we applied to 3 200 entries of BP nouns.

2605.22258 2026-05-22 cs.CL 版本更新

Harder to Defend: Towards Chinese Toxicity Attacks via Implicit Enhancement and Obfuscation Rewriting

更难防御:通过隐含增强和模糊重写实现中文毒性攻击

Jingyi Kang, Junyu Lu, Bo Xu, Hongbo Wang, Linlin zong, Roy Ka-Wei Lee, Hongfei Lin

发表机构 * Dalian University of Technology(大连理工大学) The University of Tokyo(东京大学) Singapore University of Technology and Design(新加坡科技设计大学)

AI总结 本研究提出了一种针对中文毒性攻击的框架CITA,通过隐含增强和模糊重写技术生成攻击样本,揭示了现有检测器在识别隐含毒性内容时的不足,并展示了通过训练防御模型提升鲁棒性的效果。

Comments 16 pages, 5 figures

详情
AI中文摘要

大型语言模型(LLMs)需要超越显式用语的鲁棒毒性评估。在这种设定中,中文的毒性可能结合语义间接性和表层模糊性仍处于探索阶段。我们引入了中文隐含毒性攻击(CITA),一种受控的红队评估和防御数据生成框架,而不是可部署的逃避工具。CITA使用三个阶段:(i)有害意图学习,(ii)隐含毒性增强,以及(iii)模糊变体重写,以保持有害意图,增加隐含性,并添加受控的表层变体。在CITA生成的评估样本上,七个测试检测器表现出显著的漏检风险,达到平均ASR为69.48%;人类评估进一步确认了保留的有害性和增加的隐含性/逃避性。作为下游防御应用,我们使用CITA生成的红队数据微调了中文隐含毒性防御模型(CITD),显示此类数据可通过额外训练提高鲁棒性。

英文摘要

Large language models (LLMs) require robust toxicity evaluation beyond explicit wording. This setting remains underexplored in Chinese, where toxicity may combine semantic indirectness with surface obfuscation. We introduce Chinese Implicit Toxicity Attack (CITA), a controlled red-team evaluation and defense-data generation framework, not a deployable evasion tool. CITA uses three stages: (i) Harmful Intent Learning, (ii) Implicit Toxicity Enhancement, and (iii) Obfuscation Variant Rewriting, to preserve harmful intent, increase implicitness, and add controlled surface variants. On CITA-generated evaluation samples, the seven tested detectors exhibit substantial missed-detection risks, reaching an average ASR of 69.48%; human evaluation further confirms preserved harmfulness and increased implicitness/evasiveness. As a downstream defense application, we fine-tune a Chinese Implicit Toxicity Defense model (CITD) with CITA-generated red-team data, showing that such data can improve robustness through additional training.

2605.22247 2026-05-22 cs.CL 版本更新

IdioLink: Retrieving Meaning Beyond Words Across Idiomatic and Literal Expressions

IdioLink: 超越词语的语义检索:在隐喻和直述表达之间

Kai Golan Hashiloni, Daniel Fadlon, Lior Livyatan, Ofri Hefetz, Jiahuan Pei, Kfir Bar

发表机构 * Data Science Institute, Reichman University(雷赫曼大学数据科学学院) Efi Arazi School of Computer Science, Reichman University(雷赫曼大学埃菲·阿拉兹计算机科学学院) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学)

AI总结 本文提出IdioLink检索基准,旨在测试模型能否将隐喻表达与直述或改写形式的概念等价意义联系起来,揭示当前模型在隐喻语义检索中的不足。

详情
AI中文摘要

隐喻表达对语言模型构成了基本挑战,因为其含义无法仅通过表层形式推断。因此,理解此类表达需要超越词汇重叠的语义抽象。我们介绍了IdioLink,一个检索基准,用于测试模型是否能将隐喻表达与用直述或改写形式表达的概念等价意义联系起来。IdioLink包含10,700个文档和2,140个查询,涵盖107个具有直述和隐喻用法的习语。每个文档和查询都标注了传达核心意义的片段。评估强大的嵌入基线(如BGE、E5、Contriever和Qwen),我们发现当前模型在跨不同表层实现检索等价意义时表现不佳,依赖于主题和浅层语义线索。IdioLink揭示了隐喻意识语义检索中的关键缺口,并为未来模型提供了具有挑战性的测试平台。

英文摘要

Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.

2605.22228 2026-05-22 cs.CL 版本更新

GHI: Graphormer over Conditioned Hypergraph Incidence for Aspect-Based Sentiment Analysis

GHI: 图ormer over Conditioned Hypergraph Incidence 用于基于方面的情感分析

Yu Du, Wenlong Zhu, Xingze Li, Chenglong Cao, Jing Wang, Yukun Ma

发表机构 * Qiqihar University(齐齐哈尔大学)

AI总结 本文提出GHI框架,通过构建基于双分拓扑的 incidence 结构推理层,实现对基于方面的情感分析任务中不同结构信号的统一处理,实验表明GHI在多个标准基准上优于现有方法,且在参数较少的情况下表现优异。

Comments 15 pages, 8 figures, 7 tables

详情
AI中文摘要

基于方面的情感分析(ABSA)要求模型将情感证据绑定到正确的方面,使其成为细粒度结构推理的自然测试场。我们介绍了GHI,一种基于条件超图 incidence 的图ormer框架,其设计为一个基于双分拓扑的 incidence 基础结构推理层。GHI将多样化的语言和语义证据表示为token-超边 incidence 关系,允许通过统一接口整合不同的结构信号。在六个标准ABSA基准上的广泛实验表明,GHI在SemEval领域优于所有基线,多种子评估显示其在强DeBERTa模型上表现稳定。进一步实验显示,仅使用247M参数,GHI在ISE基准上接近11B Flan-T5基方法的性能。此外,GHI在具有挑战性的ARTS数据集上表现出强鲁棒性,保持了高度竞争性性能,而传统模型在此处退化。这些结果表明,紧凑的结构推理仍然是细粒度任务中比规模驱动方法有价值的替代方案。

英文摘要

Aspect-based sentiment analysis (ABSA) requires models to bind sentiment evidence to the correct aspect, making it a natural testbed for fine-grained structural reasoning. We introduce GHI, a Graphormer-over-Conditioned-Hypergraph-Incidence framework that is designed as an incidence-based structural reasoning layer built on a bipartite topology. GHI represents diverse linguistic and semantic evidence as token--hyperedge incidence relations, allowing different structural signals to be incorporated through a unified interface. Extensive experiments on six standard ABSA benchmarks show that GHI outperforms all baselines on the SemEval domains, and multi-seed evaluations show stable improvements over strong DeBERTa. Further experiments show that with only 247M parameters, GHI approaches the performance of 11B Flan-T5 based methods on the ISE benchmark. Moreover, it demonstrates strong robustness on the challenging ARTS datasets, maintaining highly competitive performance where traditional models degrade. These results demonstrate that compact structural reasoning remains a valuable alternative to scale-driven approaches for fine-grained tasks.

2605.22217 2026-05-22 cs.LG cs.CL 版本更新

Survive or Collapse: The Asymmetric Roles of Data Gating and Reward Grounding in Self-Play RL

生存或崩溃:自我博弈强化学习中数据门控与奖励基础的不对称作用

Sophia Xiao Pu, Zhaotian Weng, Chengzhi Liu, Jayanth Srinivasa, Gaowen Liu, William Yang Wang, Xin Eric Wang

发表机构 * University of California, Santa Barbara(加州大学圣巴巴拉分校) Cisco Research(思科研究)

AI总结 本文研究了自我博弈强化学习中数据门控和奖励基础的不对称作用,发现数据门控是维持稳定的关键因素,而奖励信号在门控移除后无法单独保证稳定性,揭示了'基础提出者悖论'。

详情
AI中文摘要

自我博弈强化学习通过语言模型自行生成任务进行训练,实现提出者与求解者的共同进化,无需人工标注。最近的系统报告了显著的推理提升,但崩溃和不稳定性普遍存在且理解不足。主流观点将其视为奖励设计问题,但我们认为自我博弈的稳定性由两个不同的调节机制决定:数据层面的门控,决定哪些由提出者生成的任务进入训练池,以及奖励信号,更新已准入任务的策略。通过在Python输出预测任务和确定性DSL双胞胎任务上的受控实验,我们发现这两个机制是不对称的。严格的数据门控在我们测试的每种奖励变体下都能保证稳定性,包括没有地面真实信息访问的自一致性奖励;而一旦移除门控,没有任何奖励变体足以保证稳定性。这种不对称性揭示了我们称之为'基础提出者悖论'的反直觉耦合:具有地面真实信息访问的提出者在与自一致性求解器配对时,会比无地面真实信息的提出者更快崩溃,因为训练集中在形成最快路径到虚假自一致性吸引子的干净任务上。将二进制门控替换为连续严格性参数ε进一步揭示了两阶段相变:训练侧指标在低ε时解耦,而验证准确率在ε远高于时才保持。数据层面的门控,而非奖励校准,是自我博弈稳定性的绑定约束。

英文摘要

Self-play reinforcement learning trains language models on their own generated tasks, co-evolving a proposer and solver without human labels. Recent systems report strong reasoning gains, but collapse and instability are widely observed and poorly understood. The dominant response treats this as a reward-design problem. We argue instead that self-play stability is governed by two distinct levers: a data-level gate that decides which proposer-generated tasks enter the training pool, and the reward signal that updates the policy on tasks already admitted. Through controlled experiments on a Python output-prediction task and a deterministic-DSL twin task that strips pretraining priors, output ambiguity, and executor noise, we find the two levers are asymmetric. A strict gate is sufficient for stability under every reward variant we test, including a self-consistency reward with no access to ground truth; while no reward variant is sufficient once the gate is removed. This asymmetry exposes a counter-intuitive coupling we call the Grounded Proposer Paradox: a proposer with ground-truth access accelerates collapse faster than an ungrounded one when paired with a self-consistency solver, by concentrating training on clean tasks that form the fastest path to a spurious self-consistent attractor. Replacing the binary gate with a continuous strictness parameter $\varepsilon$ further reveals a two-stage phase transition: training-side metrics decouple at low $\varepsilon$, while validation accuracy holds until $\varepsilon$ is much higher. Data-level gating, not reward calibration, is the binding constraint on self-play stability.

2605.22204 2026-05-22 cs.CL 版本更新

Audience Engagement with Arabic Women's Social Empowerment and Wellbeing: A Decadal Corpus

阿拉伯女性社会赋权与福祉的受众参与:一个十年语料库

Wajdi Zaghouani, Mabrouka Bessghaier, MD. Rafiul Biswas, Shimaa Amer Ibrahim

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Hamad bin Khalifa University(哈利法大学)

AI总结 本文提出阿拉伯女性与社会语料库,包含2013至2024年间252,487条阿拉伯语Facebook公开帖子,涵盖女性赋权和社会福祉主题,通过自动化流程处理后,为阿拉伯方言的性别话语、社会改革和情感参与的大规模分析提供了数据支持。

详情
AI中文摘要

本文介绍了阿拉伯女性与社会语料库,该语料库包含2013年至2024年间收集的252,487条阿拉伯语Facebook公开帖子,涉及女性赋权和社会福祉。该语料库从77个国家的51,660个页面中收集,产生超过267亿次用户互动。每条帖子均包含分享、评论和情感反应等参与指标,为受众情感和社会关注度提供了独特视角。数据通过自动化流程处理,包括语言识别、标准化和元数据清洗,以确保可靠性和可重复性。该语料库支持阿拉伯语自然语言处理、计算社会科学和数字传播研究。该数据集和相关文档将在研究用途下发布。

英文摘要

This paper presents the Arabic Women and Society Corpus, a ten year collection of 252,487 public Arabic Facebook posts related to women's empowerment and social wellbeing. The corpus was collected from 51,660 pages across 77 countries between 2013 and 2024, resulting in more than 267 million user interactions. Each post includes engagement metrics such as shares, comments, and emotional reactions, providing a unique view of audience sentiment and social attention. The data were processed using an automated pipeline with language identification, normalization, and metadata cleaning to ensure reliability and reproducibility. The corpus enables large scale analysis of gender discourse, social reform, and emotional engagement across Arabic dialects. It supports research in Arabic natural language processing, computational social science, and digital communication studies. The dataset and accompanying documentation will be released under request for research use.

2605.22203 2026-05-22 cs.CL 版本更新

Evaluation of Chunking Strategies for Effective Text Embedding in Low-Resource Language on Agricultural Documents

对低资源语言农业文档中有效文本嵌入的分块策略评估

Sovandara Chhoun, Pichdara Po, Sereiwathna Ros, Wan-Sup Cho, Saksonita Khoeurn

发表机构 * Department of Big Data, Chungbuk National University, Cheongju-si, South Korea(大数据系, Chungbuk国立大学,韩国Cheongju市) Department of Computer Science, Chungbuk National University, Cheongju-si, South Korea(计算机科学系, Chungbuk国立大学,韩国Cheongju市) BigDataLabs Co., Ltd. Department of Management Information Systems, Chungbuk National University, South Korea(BigDataLabs公司 管理信息系, Chungbuk国立大学,韩国)

AI总结 本研究比较了四种文本分块方法在Khmer农业文档中的性能,通过检索增强生成(RAG)框架评估分块策略对密集检索优化的影响,发现基于字符的递归分块方法在低资源语言中表现最佳。

Comments 11 pages, 1 figure

详情
AI中文摘要

在本研究中,我们比较了四种文本分块方法:递归、Khmer-aware、基于句子和基于大语言模型(LLM)在检索增强生成(RAG)框架中应用于Khmer农业文档的性能。文档分块使用BGE-M3多语言嵌入模型进行编码,并使用FAISS库进行检索。性能通过四个指标评估:平均检索分数(L2距离)、答案相关性、Khmer覆盖率和Khmer交并比(IoU),均基于真实问题-答案对进行测量。在评估中,我们对18个问题-答案对进行了5折交叉验证。我们观察到基于字符的递归分块方法在分块大小为300字符时表现最佳,实现了最低的L2距离(0.4295 ± 0.0461)、最高的答案相关性(0.8663 ± 0.0199)和最高的Khmer IoU(0.6441 ± 0.0347)。配对t检验显示在L2距离上,与基于句子的分块方法相比有统计学显著改进(p = 0.0121)。这些结果突显了分块粒度和结构保持在优化形态复杂、低资源语言如Khmer的密集检索中的重要性。

英文摘要

In this study, we compare the performance of four text chunking approaches: Recursive, Khmer-Aware, Sentence-Based, and LLM-Based within a Retrieval-Augmented Generation (RAG) framework applied to Khmer agricultural documents. The document chunks are encoded using the BGE-M3 multilingual embedding model and retrieved using the FAISS library. Performance is evaluated using four metrics: Average Retrieval Score (L2 distance), Answer Relevance, Khmer Coverage, and Khmer Intersection over Union, all measured against ground-truth question-answer pairs. For evaluation, we perform 5-fold cross-validation over 18 question-answer pairs. We observe the best performance for the character-based Recursive chunking method with a chunk size of 300 characters, achieving the lowest L2 distance (0.4295 +- 0.0461), highest Answer Relevance (0.8663 +- 0.0199), and highest Khmer IoU (0.6441 +- 0.0347). A paired t-test shows a statistically significant improvement over the Sentence-Based chunking method in L2 distance (p = 0.0121). These results highlight the importance of segmentation granularity and structural preservation for optimizing dense retrieval in morphologically complex, low-resource languages such as Khmer.

2605.22202 2026-05-22 cs.CL 版本更新

Structure Retention in Embedding Spaces as a Predictor of Benchmark Performance

嵌入空间中的结构保留作为基准性能预测因子

Amanda Myntti, Jenna Kanerva, Veronika Laippala, Filip Ginter

发表机构 * TurkuNLP, University of Turku, Finland(图尔库大学TurkuNLP实验室,图尔库大学,芬兰) ELLIS Institute Finland(芬兰ELLIS研究所)

AI总结 本文研究了高表现嵌入模型在嵌入空间中的一致性组织方式,通过评估25种现代嵌入模型在五个MTEB任务上的表现,发现最近邻重叠和独立成分分析(ICA)中成对文本实例的幅度差异与任务性能高度相关,揭示了嵌入任务在线性度和局部信息保留依赖性方面的差异。

详情
AI中文摘要

在本文中,我们展示了高性能的嵌入模型在其嵌入空间中以一致的方式组织。我们评估了25种现代嵌入模型在五个MTEB任务上的表现,这些任务涵盖四个多样化的任务类别(检索、双语挖掘、对分类和摘要)在英语和多语言设置中。我们发现成对文本实例之间的最近邻重叠和独立成分分析(ICA)中的幅度差异与给定任务的性能高度相关(甚至达到0.97)。最终,我们展示了嵌入任务在不同程度上表现出线性和对局部信息保留的依赖性。我们的结果进一步加深了对嵌入的理解,揭示了嵌入与模型性能的关系,并为可能的未来训练目标和优化条件嵌入提供了启示。

英文摘要

In this paper, we show that high-performing embedding models organize their embedding spaces in a consistent way. We evaluate 25 contemporary embedding models on five MTEB tasks spanning four diverse task categories (retrieval, bitext mining, pair classification, and summarization) in both English and multilingual settings, and reveal that nearest-neighbor overlap and magnitude differences in independent component analysis (ICA) between paired text instances strongly correlate (even up to 0.97) with performance on the given task. Ultimately, we show that embedding tasks display varying degrees of linearity and reliance on retention of local information. Our results further the understanding of embeddings, their relation to model performance, and shed light on possible future training objectives and optimizing conditional embeddings.

2605.22177 2026-05-22 cs.LG cs.CL 版本更新

Maestro: Reinforcement Learning to Orchestrate Hierarchical Model-Skill Ensembles

Maestro:通过强化学习协调分层模型-技能集合

Jinyang Wu, Guocheng Zhai, Ruihan Jin, Yuhao Shen, Zhengxi Lu, Fan Zhang, Haoran Luo, Zheng Lian, Zhengqi Wen, Jianhua Tao

发表机构 * Tsinghua University(清华大学) Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Nanyang Technological University(南洋理工大学) Tongji University(同济大学)

AI总结 本文提出Maestro框架,通过强化学习协调多模态任务,利用分层模型-技能集合提升多模态任务性能,实现高效且通用的协调策略。

详情
AI中文摘要

大型语言模型(LLMs)和模块化技能的普及使自主代理具备了越来越强大的能力。现有框架通常依赖于单一的LLM和固定的逻辑来与这些技能交互。这导致了一个关键瓶颈:不同的LLMs在不同领域具有不同的优势,但当前框架未能利用模型和技能的互补优势,从而限制了其在下游任务上的性能。在本文中,我们提出了Maestro(多模态代理专家技能强化学习协调框架),这是一个由强化学习(RL)驱动的协调框架,将异构多模态任务重新框架化为一个在分层模型-技能注册表上的顺序决策过程。与将所有知识整合到单一模型中不同,Maestro训练了一个轻量级的策略,动态组合冻结的专家模型和一个双层技能库,决定在每一步是否调用外部专家,选择哪个模型-技能对,以及何时终止。该策略通过基于结果的强化学习进行优化,不需要步骤级监督。我们评估了Maestro在十个代表性的多模态基准上,涵盖数学推理、图表理解、高分辨率感知和领域特定分析。仅使用一个4B的协调器,Maestro实现了70.1%的平均准确率,超过了GPT-5(69.3%)和Gemini-2.5-Pro(68.7%)。关键的是,学习的协调策略能够泛化到未见过的模型和技能,无需重新训练:在注册表中添加非领域专家,使在四个具有挑战性的基准上平均达到59.5%,优于所有闭源基线。Maestro进一步保持了高计算效率和低延迟。源代码可在https://github.com/jinyangwu/Maestro上获得。

英文摘要

The proliferation of large language models (LLMs) and modular skills has endowed autonomous agents with increasingly powerful capabilities. Existing frameworks typically rely on monolithic LLMs and fixed logic to interface with these skills. This gives rise to a critical bottleneck: different LLMs offer distinct advantages across diverse domains, yet current frameworks fail to exploit the complementary strengths of models and skills, thereby limiting their performance on downstream tasks. In this paper, we present Maestro (Multimodal Agent for Expert-Skill Targeted Reinforced Orchestration), a Reinforcement Learning (RL)-driven orchestration framework that reframes heterogeneous multimodal tasks as a sequential decision-making process over a hierarchical model-skill registry. Rather than consolidating all knowledge into a single model, Maestro trains a lightweight policy to dynamically compose ensembles of frozen expert models and a two-tier skill library, deciding at each step whether to invoke an external expert, which model-skill pair to select, and when to terminate. The policy is optimized via outcome-based RL, requiring no step-level supervision. We evaluate Maestro across ten representative multimodal benchmarks spanning mathematical reasoning, chart understanding, high-resolution perception, and domain-specific analysis. With only a 4B orchestrator, Maestro achieves an average accuracy of 70.1%, surpassing both GPT-5 (69.3%) and Gemini-2.5-Pro (68.7%). Crucially, the learned coordination policy generalizes to unseen models and skills without retraining: augmenting the registry with out-of-domain experts yields a 59.5% average on four challenging benchmarks, outperforming all closed-source baselines. Maestro further maintains high computational efficiency with low latency. The source code is available at https://github.com/jinyangwu/Maestro.

2605.22170 2026-05-22 cs.CL 版本更新

Do Factual Recall Mechanisms Carry over from Text to Speech in Multimodal Language Models?

事实回忆机制在文本到语音的多模态语言模型中是否延续?

Luca Modica, Filip Landin, Mehrdad Farahani, Livia Qian, Gabriel Skantze, Richard Johansson

发表机构 * Zenseact Unbox AI Chalmers University of Technology(楚德斯大学) University of Gothenburg(哥德堡大学) KTH Royal Institute of Technology(皇家理工学院)

AI总结 研究探讨了多模态语言模型中事实回忆机制在文本和语音模态间的延续性,通过因果中介分析揭示了语音到文本与文本到文本在事实存储和回忆中的差异。

Comments In *SEM 2026, the 15th Joint Conference on Lexical and Computational Semantics

详情
AI中文摘要

近年来,几种将语音和书面文本联合表示的语音语言模型(SLMs)已被提出。然后出现的问题是,当模型在两种模态中运行时,内部机制的相似性和差异性如何。我们关注这些系统如何编码、存储和检索事实知识,这之前已经在文本模型中被研究过。为了调查SLMs中事实关联存储和回忆的机制,我们利用了因果中介分析,这是一种之前应用于基于文本的模型的技术。使用SpiritLM这一多模态模型,整合离散语音标记,初步结果揭示了文本到文本和语音到文本结果之间的差异,表明事实回忆的新兴机制仅部分从文本延续到语音模态。这些结果加深了我们对SLMs中内部机制如何编码事实关联的理解,并为改进语音启用的AI系统提供了见解。

英文摘要

In recent years, several Speech Language Models (SLMs) that represent speech and written text jointly have been presented. The question then emerges about how model-internal mechanisms are similar and different when operating in the two modalities. We focus on how these systems encode, store, and retrieve factual knowledge, which has previously been investigated for text-only models. To investigate mechanisms behind the storage and recall of factual association in SLMs, we leverage Causal Mediation Analysis, a technique previously applied to text-based models. Initial results using SpiritLM, a multimodal model integrating discrete speech tokens reveal discrepancies between text-to-text and speech-to-text results, suggesting that the emergent mechanisms for factual recall are only partially carried over from the text to the speech modality. These results advance our understanding of how internal mechanisms encode factual associations in SLMs while contributing insights for improving speech-enabled AI systems.

2605.22148 2026-05-22 cs.AI cs.CL 版本更新

Ratchet: A Minimal Hygiene Recipe for Self-Evolving LLM Agents

Ratchet:一种最小化卫生的自演化LLM代理技能库

Xing Zhang, Yanwei Cui, Guanghui Wang, Ziyuan Li, Wei Qiu, Bing Zhu, Peiyang He

发表机构 * AWS Generative AI Innovation Center(AWS 生成式人工智能创新中心) HSBC Holdings Plc.(汇丰控股有限公司) HSBC Technology Center, China(汇丰技术中心,中国)

AI总结 本文提出Ratchet,一种单代理循环,使冻结的LLM能够自行编写、检索、整理和淘汰其自然语言技能,通过整合四个卫生机制提升技能库的生命周期管理,从而在MBPP+ hard-100数据集上显著提升性能。

Comments 16 pages, 2 figures, 6 tables. Extends arXiv:2605.19576 with the SWE-bench Verified evaluation and a non-divergence analysis (Proposition 1)

详情
AI中文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce extbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

英文摘要

Self-evolving skill libraries, pioneered by Voyager, let frozen LLM agents accumulate reusable knowledge without weight updates, yet recent evaluation shows that LLM-authored skills deliver $+0.0$pp over no-skill baselines while human-curated ones deliver $+16.2$pp: the bottleneck is not skill authoring but lifecycle management. We introduce \textbf{Ratchet}, a single-agent loop in which a frozen LLM writes, retrieves, curates, and retires its own natural-language skills. Ratchet integrates four candidate hygiene mechanisms: outcome-driven retirement, a bounded active-cap, meta-skill authoring guidance, and pattern canonicalisation. On MBPP+ hard-100 with Claude Opus 4.7, Ratchet lifts held-out pass@1 from a $0.258 \pm 0.047$ baseline to a late-window rolling mean of $0.584$ (peak $0.658 \pm 0.042$) across 100 rounds and 3 seeds, a $+0.328 \pm 0.018$ rolling-mean gain where the no-skill control drifts at $+0.002 \pm 0.005$; the same recipe transfers to an agentic solver on SWE-bench Verified ($+0.22$ peak lift over 20 rounds). Eight ablations (A1--A8) reveal that the minimal working recipe is smaller than our design suggests: retirement and the meta-skill authoring prior are load-bearing, while explicit deduplication (canonicalisation, cover-guard) is subsumed by the meta-skill itself. A non-divergence proposition shows that bounded cap and retirement threshold together prevent expected performance from drifting below the no-skills floor.

2605.22140 2026-05-22 cs.CL 版本更新

Psy-Chronicle:A Structured Pipeline for Synthesizing Long-Horizon Campus Psychological Counseling Dialogues

Psy-Chronicle: 一个用于合成长周期校园心理辅导对话的结构化流水线

Chaogui Gou, Jiarui Liang

发表机构 * University of Science and Technology Beijing(北京科技大学)

AI总结 本文提出Psy-Chronicle,一种结构化数据生成框架,用于合成长周期校园心理辅导对话,通过生成学期跨度的时间压力事件图和学生与辅导员代理的交互模拟,构建了包含100个学生档案和9万条对话的CPCD数据集,并通过CPCD-Bench评估模型的长周期校园辅导能力,实验结果表明CPCD有效提升了模型的会话级响应生成和长周期记忆召回能力。

详情
AI中文摘要

近年来,大语言模型在心理支持任务中展现出显著潜力。然而,现有心理辅导数据大多依赖于单轮问答或短多轮对话,难以刻画大学生心理困扰在校园生活事件中如何积累、交互并逐渐演变的长期过程。为解决这一问题,本文提出Psy-Chronicle,一种结构化数据生成框架,用于合成长周期校园心理辅导对话。我们生成一个跨越学期的时间压力事件图,以建模校园压力事件之间的时序顺序和演变依赖关系。通过学生代理与辅导员代理之间的交互模拟,以及结构化记忆整合机制,Psy-Chronicle生成具有连续性且跨越多个辅导会话的长周期对话。基于Psy-Chronicle,我们构建并开源了CPCD,一个中文长周期对话数据集,包含100个学生档案和90,000条辅导对话。我们进一步构建CPCD-Bench,从三个维度评估模型的长周期校园辅导能力:会话级响应、长周期记忆召回和时间因果推理。实验结果表明,CPCD有效提升了具有相同基础架构的模型的会话级响应生成和长周期记忆召回能力。同时,时间因果推理的改进仍有限,表明事件链组织和因果解释是长周期心理辅导建模中的关键挑战。相关代码和数据可在:https://github.com/EdwinUSTB/Psy-Chronicle 获取。

英文摘要

In recent years, large language models have shown substantial potential in psychological support tasks. However, existing psychological counseling data mostly rely on single-turn question answering or short multi-turn dialogues, making it difficult to characterize how college students' psychological distress accumulates, interacts, and gradually evolves over long periods within campus life events. To address this issue, this paper proposes Psy-Chronicle, a structured data-generation framework for synthesizing long-horizon campus psychological counseling dialogues. We generate a semester-spanning temporal stress event graph to model the chronological order and evolutionary dependencies among campus stress events. Through interactive simulation between a student agent and a counselor agent, together with a structured memory integration mechanism, Psy-Chronicle generates long-horizon dialogues with continuity across counseling sessions. Based on Psy-Chronicle, we construct and open-source CPCD, a Chinese long-horizon dialogue dataset for college psychological counseling, containing 100 student profiles, 90,000 counseling dialogues. We further build CPCD-Bench to evaluate models' long-horizon campus counseling capabilities from three dimensions: session-level response, long-horizon memory recall, and temporal-causal reasoning. Experimental results show that CPCD effectively improves session-level response generation and long-horizon memory recall for models with the same base architecture. Meanwhile, improvements in temporal-causal reasoning remain limited, indicating that event-chain organization and causal explanation are key challenges in long-horizon psychological counseling modeling. The related code and data are available at: https://github.com/EdwinUSTB/Psy-Chronicle

2605.22138 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

Efficient Agentic Reasoning Through Self-Regulated Simulative Planning

通过自我调节模拟规划实现高效的代理推理

Mingkai Deng, Jinyu Hou, Lara Sá Neves, Varad Pimpalkhute, Taylor W. Killian, Zhengzhong Liu, Eric P. Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出通过分解决策过程为三个系统:模拟推理、自我调节和反应执行,来提升代理推理的效率,并展示了SR$^2$AM模型在不同任务中的表现。

Comments Code and model artifacts are available at https://github.com/sailing-lab/sr2am

详情
AI中文摘要

代理应该如何决定何时以及如何规划?主流方法将代理建模为具有自适应计算的反应策略(例如链式思考),通过端到端训练期望规划隐式地出现。由于无法控制规划的存在、结构或时间范围,这些系统显著增加了推理长度,导致无效的令牌使用,而没有可靠的准确性提升。我们主张高效的代理推理受益于将决策过程分解为三个系统:模拟推理(系统II)通过世界模型将推理根植于未来状态预测;自我调节(系统III)通过学习的配置器决定何时以及如何深入规划;以及反应执行(系统I)处理细粒度的动作。模拟推理在不同任务中提供统一的规划,而无需每个领域的工程,同时自我调节确保规划只在需要时被调用。为了测试这一点,我们开发了SR$^2$AM(Self-Regulated Simulative Reasoning Agentic LLM),在LLM的链式思考中实现这两个系统作为独立阶段,其中LLM作为世界模型。我们探索了两种实现:从提示的多模块系统中记录决策(v0.1)和从预训练推理LLM的痕迹中重建结构化计划(v1.0),通过监督学习和强化学习(RL)训练。在数学、科学、表格分析和网络信息检索中,v0.1-8B和v1.0-30B在性能上与120-355B和685B-1T参数系统相当,而v1.0-30B使用的推理令牌比同类代理LLM少25.8-95.3%。强化学习使平均规划时间增加22.8%,而规划频率仅增加2.0%,表明它学会了更远地规划而不是更频繁地规划。更广泛地说,学习的自我调节实例化了一个原则,我们预计可以扩展到代理如何管理自己的学习和适应。

英文摘要

How should an agent decide when and how to plan? A dominant approach builds agents as reactive policies with adaptive computation (e.g., chain-of-thought), trained end-to-end expecting planning to emerge implicitly. Without control over the presence, structure, or horizon of planning, these systems dramatically increase reasoning length, yielding inefficient token use without reliable accuracy gains. We argue efficient agentic reasoning benefits from decomposing decision-making into three systems: simulative reasoning (System II) grounding deliberation in future-state prediction via a world model; self-regulation (System III) deciding when and how deeply to plan via a learned configurator; and reactive execution (System I) handling fine-grained action. Simulative reasoning provides unified planning across diverse tasks without per-domain engineering, while self-regulation ensures the planner is invoked only when needed. To test this, we develop SR$^2$AM (Self-Regulated Simulative Reasoning Agentic LLM), realizing both as distinct stages within an LLM's chain-of-thought, with the LLM as world model. We explore two instantiations: recording decisions from a prompted multi-module system (v0.1) and reconstructing structured plans from traces of pretrained reasoning LLMs (v1.0), trained via supervised then reinforcement learning (RL). Across math, science, tabular analysis, and web information seeking, v0.1-8B and v1.0-30B achieve Pass@1 competitive with 120-355B and 685B-1T parameter systems respectively, while v1.0-30B uses 25.8-95.3% fewer reasoning tokens than comparable agentic LLMs. RL increases average planning horizon by 22.8% while planning frequency grows only 2.0%, showing it learns to plan further ahead rather than more often. More broadly, learned self-regulation instantiates a principle we expect to extend beyond planning to how agents govern their own learning and adaptation.

2605.22099 2026-05-22 cs.CL 版本更新

A Comparative Study of Language Models for Khmer Retrieval-Augmented Question Answering

一种用于柬埔寨检索增强问答的语言模型比较研究

Sereiwathna Ros, Phannet Pov, Ratanaktepi Chhor, Kimleang Ly, Wan-Sup Cho, Saksonita Khoeurn

发表机构 * Department of Computer Science, Chungbuk National University(Chungbuk National University 计算机科学系) Department of Big Data, Chungbuk National University(Chungbuk National University 大数据系) General Department of Information and Communication Technology, Ministry of Post and Telecommunications(邮电部信息和通信技术总局) Department of Management Information Systems, Chungbuk National University(Chungbuk National University 管理信息系统系) BigDatalabs Co., Ltd(BigDatalabs 公司)

AI总结 本文针对低资源非拉丁语种柬埔寨语言,比较了多种语言模型在检索增强问答任务中的性能,发现检索器选择是影响效果的关键因素,生成器在不同指标上表现各异。

Comments 14 pages, 1 figure,

详情
AI中文摘要

检索增强生成(RAG)作为一种将大型语言模型(LLM)输出与检索到的证据相结合的范式,已被证明可以减少幻觉并提高事实准确性。然而,其在低资源、非拉丁语种如柬埔寨语言中的有效性仍鲜有研究。本文提出了一种基于RAG的柬埔寨语言电信领域文档问答系统。我们进行了两阶段的比较评估。首先,我们基准测试了三种嵌入模型:BGE-M3(567M)、Jina-Embeddings-v3(570M)和Qwen3-Embedding(597M),用于柬埔寨文档的密集检索。BGE-M3在Hit Rate@3、File Hit Rate@3、MRR@3和Precision@3等指标上均表现最佳,显著优于其他检索器。其次,使用BGE-M3作为选定的检索器,我们在200个柬埔寨问答对的精心编纂的黄金数据集上评估了五个生成器后端:Qwen3(8B)、Qwen3.5(9B)、Sailor2-8B-Chat、SeaLLMs-v3-7B-Chat和Llama-SEA-LION-v2-8B-IT。为了量化系统性能,我们应用了六个受RAGAS启发的指标:忠实度、答案相关性、上下文相关性、事实正确性、答案相似性和答案正确性。结果表明,没有单一模型在所有指标上占据主导:Qwen3.5-9B在忠实度(0.859)和上下文相关性(0.726)上最高,Qwen3-8B在事实正确性(0.380)上最高,而SeaLLMs-v3-7B-Chat在答案相关性(0.867)、答案相似性(0.836)和答案正确性(0.599)上表现最佳。这些发现突显了检索器选择仍是柬埔寨RAG的主要瓶颈,而生成器的强项则取决于优先考虑的是 grounding、事实精度还是语义相似性。

英文摘要

Retrieval-Augmented Generation (RAG) has emerged as a promising paradigm for grounding large language model (LLM) outputs in retrieved evidence, thereby reducing hallucination and improving factual accuracy. Its efficacy, however, remains largely unexamined for low-resource, non-Latin-script languages such as Khmer. In this paper, we present a RAG-based question answering system for Khmer-language telecom-domain documents. We conduct a two-phase comparative evaluation. First, we benchmark three embedding models: BGE-M3 (567M), Jina-Embeddings-v3 (570M), and Qwen3-Embedding (597M), for dense retrieval over Khmer documents. BGE-M3 consistently performs best, achieving a Hit Rate@3 of 0.285, File Hit Rate@3 of 0.700, MRR@3 of 0.221, and Precision@3 of 0.112, substantially outperforming the other retrievers. Second, using BGE-M3 as the selected retriever, we evaluate five generator backends: Qwen3 (8B), Qwen3.5 (9B), Sailor2-8B-Chat, SeaLLMs-v3-7B-Chat, and Llama-SEA-LION-v2-8B-IT, on a curated golden dataset of 200 Khmer question-answer pairs. To quantify system performance, we apply six RAGAS-inspired metrics: faithfulness, answer relevance, context relevance, factual correctness, answer similarity, and answer correctness. The results show no single model dominates across all metrics: Qwen3.5-9B achieves the highest faithfulness (0.859) and context relevance (0.726), Qwen3-8B attains the highest factual correctness (0.380), and SeaLLMs-v3-7B-Chat performs best on answer relevance (0.867), answer similarity (0.836), and answer correctness (0.599). These findings highlight that retriever choice remains a major bottleneck for Khmer RAG, while generator strengths vary depending on whether the priority is grounding, factual precision, or semantic similarity.

2605.16984 2026-05-22 cs.CL 版本更新

Closing the Gap at CRAC 2026: Two-Stage Adaptation for LLM-Based Multilingual Coreference Resolution

在CRAC 2026上缩小差距:基于LLM的多语言核心指代解析的两阶段适应

Antoine Bourgois, Olga Seminck, Thierry Poibeau

发表机构 * Lattice (CNRS UMR 8094 & ENS-PSL & Université Sorbonne Nouvelle)(Lattice(CNRS UMR 8094 和 ENS-PSL 和 索邦学院新欧))

AI总结 本文提出了一种基于LLM的多语言核心指代解析方法,通过两阶段适应策略,在CRAC 2026共享任务中取得了74.32的平均CoNLL F1分数,排名第一。

详情
AI中文摘要

我们提交了参加2026年计算参照、指代和核心指代(CRAC 2026)共享任务的LLM赛道的系统。在官方测试集上,我们的系统以74.32的平均CoNLL F1分数排名第一,整体排名第三。我们的系统基于Gemma-3-27b模型,通过多语言基础适配器和数据集特定适配器的两阶段策略进行微调。我们使用XML启发式的格式用头词表示提及跨度,并通过局部重新索引进行标注。这些设计选择在不同语言、文档长度和标注指南下均表现出色。

英文摘要

We present our submission to the LLM track of the 2026 Computational Models of Reference, Anaphora and Coreference (CRAC 2026) shared task. With an average CoNLL F1 score of 74.32 on the official test set, our system ranked first in the LLM track, and third overall. Our system is based on the Gemma-3-27b model, fine-tuned using a two-stage strategy with a multilingual base adapter followed by dataset-specific adapters. We represent mention spans by their headword using an XML-inspired format with local reindexing and annotate documents iteratively. These design choices proved effective across languages, document lengths, and annotation guidelines.

2605.16545 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces

Symphony for Speech-to-Text: 支持实时医疗语音接口

Arne Nix, Robert James, Lasse Borgholt, Anna B. Ekner, Lana Krumm, Julius Severin, Dan Engel, Lars Maaløe, Jakob Havtorn

发表机构 * Corti

AI总结 本文提出Symphony for Speech-to-Text,一种医疗级实时语音识别系统,通过分解转录过程为识别、格式化和上下文校正等专业化组件,优化医学术语召回,实现实时临床结构文本生成,并在医疗场景中显著优于现有系统,同时在通用领域表现不逊。

Comments Updated with a correction and improvement to Symphony's performance in spoken punctuation evaluation (R_punct, P_punct)

详情
AI中文摘要

在数十年用于打字和更近期的环境记录后,语音正逐渐成为与技术及AI交互的主要方式,在医疗领域也不例外。然而,医疗语音识别仍然具有挑战性:系统必须捕捉专业术语,解决上下文歧义,并精确渲染测量、缩写和临床缩写。现有解决方案通常针对通用目的转录或狭窄的打字工作流进行优化,限制了其在安全关键设置中的可靠性以及在更广泛临床工作流中的实用性。我们引入Symphony for Speech-to-Text,一种用于实时流式和基于批量文件的临床使用的医疗级语音识别系统。Symphony将转录过程分解为识别、格式化和上下文校正等专业化组件,以优化医学术语召回,同时在实时生成临床结构文本并适应不同使用场景。在公共基准和医疗语音数据集上的评估表明,Symphony在临床场景中显著优于现有系统,同时在通用领域表现不逊,表明具有鲁棒的泛化能力而非过拟合。我们发布了一个临床基准数据集以支持可靠的验证和进一步推进医疗语音识别。Symphony通过生产级API提供,用于实时打字、对话转录和批量音频文件处理。

英文摘要

After decades of use in dictation and, more recently, ambient documentation, speech is emerging as a primary modality for interacting with technology and AI in healthcare. Yet medical speech recognition remains difficult: systems must capture specialized terminology, resolve contextual ambiguity, and render measurements, abbreviations, and clinical shorthand precisely. Existing solutions are typically optimized either for general-purpose transcription or narrow dictation workflows, limiting their reliability in safety-critical settings and their usefulness for broader clinical workflows. We introduce Symphony for Speech-to-Text, a medical-grade speech recognition system for real-time streaming and batch file-based clinical use. Symphony decomposes the transcription process into specialized components for recognition, formatting, and contextual correction to optimize medical term recall while producing clinically structured text in real time and adapting across use cases. Evaluations on public benchmark and medical speech datasets show that Symphony substantially outperforms state-of-the-art systems in clinical settings while matching or exceeding them in general-domain settings, suggesting robust generalization rather than overfitting. We release a clinical benchmark dataset to support reliable validation and further progress in medical speech recognition. Symphony is available through a production-grade API for live dictation, conversational transcription, and batch audio file processing.

2605.15040 2026-05-22 cs.AI cs.CL 版本更新

Orchard: An Open-Source Agentic Modeling Framework

Orchard:一个开源的智能体建模框架

Baolin Peng, Wenlin Yao, Qianhui Wu, Hao Cheng, Xiao Yu, Rui Yang, Tao Ge, Alessandro Sordoni, Xingdi Yuan, Yelong Shen, Pengcheng He, Tong Zhang, Zhou Yu, Jianfeng Gao

发表机构 * Microsoft Research(微软研究院) Columbia University(哥伦比亚大学) UIUC(伊利诺伊大学香槟分校)

AI总结 本文提出Orchard,一个开源的智能体建模框架,通过轻量级环境服务和三种智能体建模食谱,实现了跨领域可重用的智能体数据、训练和评估。

详情
AI中文摘要

智能体建模旨在通过规划、推理、工具使用和与环境的多轮交互,将大语言模型转化为能够解决复杂任务的自主智能体。尽管有大量投入,开放研究仍受制于基础设施和训练差距。许多高性能系统依赖于专有代码库、模型或服务,而大多数开源框架专注于编排和评估,而非可扩展的智能体训练。我们提出了Orchard,一个用于可扩展智能体建模的开源框架。其核心是Orchard Env,一个轻量级环境服务,提供可重用的原语用于跨任务领域、智能体利用和流水线阶段的沙盒生命周期管理。在Orchard Env之上,我们构建了三种智能体建模食谱。Orchard-SWE针对编码智能体,从MiniMax-M2.5和Qwen3.5-397B中蒸馏出107K条轨迹,引入信用分配SFT来学习未解决轨迹的 productive 段落,并应用平衡自适应回滚进行强化学习。从Qwen3-30B-A3B-Thinking开始,Orchard-SWE在SWE-bench Verified上经过SFT后达到64.3%,经过SFT+RL后达到67.5%,在同等规模的开源模型中设立了新的状态。Orchard-GUI使用仅0.4K蒸馏轨迹和2.2K开放性任务训练了一个4B视觉-语言计算机使用智能体,在WebVoyager、Online-Mind2Web和DeepShop上分别达到74.1%、67.0%和64.0%的成功率,成为最强的开源模型,同时在与专有系统竞争中保持竞争力。Orchard-Claw针对个人助理智能体,仅用0.2K合成任务训练,达到Claw-Eval上的59.6% pass@3和与更强的ZeroClaw利用配合时的73.9%。总体而言,这些结果表明,一个轻量级、开源、不依赖利用的环境层能够实现跨领域的可重用智能体数据、训练和评估。

英文摘要

Agentic modeling aims to transform LLMs into autonomous agents capable of solving complex tasks through planning, reasoning, tool use, and multi-turn interaction with environments. Despite major investment, open research remains constrained by infrastructure and training gaps. Many high-performing systems rely on proprietary codebases, models, or services, while most open-source frameworks focus on orchestration and evaluation rather than scalable agent training. We present Orchard, an open-source framework for scalable agentic modeling. At its core is Orchard Env, a lightweight environment service providing reusable primitives for sandbox lifecycle management across task domains, agent harnesses, and pipeline stages. On top of Orchard Env, we build three agentic modeling recipes. Orchard-SWE targets coding agents. We distill 107K trajectories from MiniMax-M2.5 and Qwen3.5-397B, introduce credit-assignment SFT to learn from productive segments of unresolved trajectories, and apply Balanced Adaptive Rollout for RL. Starting from Qwen3-30B-A3B-Thinking, Orchard-SWE achieves 64.3% on SWE-bench Verified after SFT and 67.5% after SFT+RL, setting a new state of the art among open-source models of comparable size. Orchard-GUI trains a 4B vision-language computer-use agent using only 0.4K distilled trajectories and 2.2K open-ended tasks. It achieves 74.1%, 67.0%, and 64.0% success rates on WebVoyager, Online-Mind2Web, and DeepShop, respectively, making it the strongest open-source model while remaining competitive with proprietary systems. Orchard-Claw targets personal assistant agents. Trained with only 0.2K synthetic tasks, it achieves 59.6% pass@3 on Claw-Eval and 73.9% when paired with a stronger ZeroClaw harness. Collectively, these results show that a lightweight, open, harness-agnostic environment layer enables reusable agentic data, training recipes, and evaluations across domains.

2605.14257 2026-05-22 cs.CL 版本更新

Sakura at BEA 2026 Shared Task 1: What Makes Vocabulary Difficult?

BEA 2026 共同任务 1:什么使词汇困难?

Adam Nohejl, Xuanxin Wu, Yusuke Ide, Maria Angelica Riera Machin, Yi-Ning Chang, Hitomi Yanaka

发表机构 * RIKEN The University of Osaka(大阪大学) Nara Institute of Science and Technology(奈良科学技術大學) National Tsing Hua University(國立清華大學) The University of Tokyo(東京大學) Tohoku University(東北大學)

AI总结 本文提出两种模型用于预测词汇难度:一种高精度的黑盒模型,在公开赛道取得最佳成绩,另一种可解释模型,优于微调编码器基线。黑盒模型通过软目标损失函数微调LLM,在评分任务中达到r>0.91的精度,而可解释模型在保持强相关性(r>0.77)的同时,揭示了影响每个项目难度的因素。进一步分析显示,英国理事会知识型词汇列表(KVL)中词汇难度常受拼写难度或测试项目构造影响,而不仅仅是词汇本身的生产难度。

Comments To be published in Proceedings of the 21st Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2026)

详情
AI中文摘要

我们描述了两种用于词汇难度预测的模型类型:一种高精度的黑盒模型,其在公开赛道中取得了最佳共同任务结果;另一种可解释的模型,其优于微调编码器基线。作为黑盒模型,我们使用软目标损失函数微调了一个LLM,以在评分任务中实现有效的应用,达到了r > 0.91的精度。可解释模型在保持强相关性(r > 0.77)的同时,提供了关于影响每个项目难度因素的见解。我们进一步分析了结果,证明英国理事会知识型词汇列表(KVL)中词汇的难度往往受到拼写难度或测试项目构造的影响,而不仅仅是词汇本身的生产难度。我们的代码已在线上提供,网址为https://github.com/ynklab/vocabulary-difficulty.

英文摘要

We describe two types of models for vocabulary difficulty prediction: a high-accuracy black-box model, which achieved the top shared task result in the open track, and an explainable model, which outperforms a fine-tuned encoder baseline. As the black-box model, we fine-tuned an LLM using a soft-target loss function for effective application to the rating task, achieving r > 0.91. The explainable model provides insights into what impacts the difficulty of each item while maintaining a strong correlation (r > 0.77). We further analyze the results, demonstrating that the difficulty of items in the British Council's Knowledge-based Vocabulary Lists (KVL) is often affected by spelling difficulty or the construction of the test items, in addition to the genuine production difficulty of the words. We make our code available online at https://github.com/ynklab/vocabulary-difficulty .

2605.13989 2026-05-22 cs.CL 版本更新

VectraYX-Nano: A 42M-Parameter Spanish Cybersecurity Language Model with Curriculum Learning and Native Tool Use

VectraYX-Nano: 一个具有课程学习和原生工具使用的4200万参数西班牙语网络安全语言模型

Juan S. Santillana

发表机构 * Globant

AI总结 本文提出了一种基于西班牙语的网络安全语言模型VectraYX-Nano,通过课程学习和原生工具调用,展示了在网络安全领域的应用与改进。

Comments 24 pages, 5 figures, 12 tables. v3: post-Chinchilla compute ablation (v8-v15), Globant affiliation finalized, EMNLP Findings 2026 submission. Released model: VectraYX-Nano v7 (42M params, GGUF Q4 ~20 MB, native MCP)

详情
AI中文摘要

我们介绍了VectraYX-Nano,一个从头开始训练的41.95M参数解码器-only语言模型,专为西班牙语网络安全领域设计,具有拉丁美洲地区侧重和通过模型上下文协议(MCP)进行的原生工具调用。该模型有四个贡献。(i)语料库:VectraYX-Sec-ES,一个由八台虚拟机分布式管道构建的170M-token西班牙语语料库,成本约为25美元的云计算,分为三个课程阶段(对话42M,网络安全118M,攻击性工具10M)。(ii)架构:一个42M Transformer解码器,包含GQA、QK-Norm、RMSNorm、SwiGLU、RoPE和z-loss,配以领域平衡的16,384-token字节回退BPE。(iii)课程学习通过跨三个阶段的回放,导致单调损失下降(9.80 -> 3.17 -> 3.00 -> 2.16);在SFT(损失1.74)后,v2 bootstrap-ablation参考在B5上达到0.775 +/- 0.043的对话门,经过受控的Phase-2回放扫描,以{0,5,10,25,50}%的饱和度在B5上达到>=25%的回放。(iv)两个经验发现,N=4。一个受控的bootstrap-语料库消融跨v2(OpenSubs)、v4(mC4-ES)和v6(60/25/15 OpenSubs/mC4/Wiki)暴露了损失与注册的倒置:低困惑度的bootstrap产生可测量更差的对话行为(v2 > v4 > v6在B5上每个配对的种子)。B4(工具选择)地板为0.000是语料库密度的产物,而不是容量门:重新平衡SFT混合物的工具使用比例为1:21,得到VectraYX-Nano v7,即发布的头版配置,达到B4 = 0.230 +/- 0.052在42M的同时保持B1 = 0.332 +/- 0.005和B5 = 0.725 +/- 0.130;在260M从头开始的中档模型上进行LoRA复制达到0.445 +/- 0.201。发布的GGUF在F16中为96 MB,能够在llama.cpp下在商用硬件上实现亚秒级TTFT,并且,据我们所知,是第一个发布的西班牙语原生网络安全LLM,具有端到端的MCP集成。

英文摘要

We present VectraYX-Nano, a 41.95M-parameter decoder-only language model trained from scratch in Spanish for cybersecurity, with a Latin-American regional focus and native tool invocation via the Model Context Protocol (MCP). The model has four contributions. (i) Corpus: VectraYX-Sec-ES, a 170M-token Spanish corpus assembled by an eight-VM distributed pipeline at ~$25 USD of cloud compute and split into three curriculum phases (conversational 42M, cybersecurity 118M, offensive tooling 10M). (ii) Architecture: a 42M Transformer decoder with GQA, QK-Norm, RMSNorm, SwiGLU, RoPE and z-loss, paired with a domain-balanced 16,384-token byte-fallback BPE. (iii) Curriculum with replay across the three phases yields a monotonic loss descent (9.80 -> 3.17 -> 3.00 -> 2.16); after SFT (loss 1.74) the v2 bootstrap-ablation reference attains a conversational gate of 0.775 +/- 0.043 on B5 over N=4 seeds, and a controlled Phase-2 replay sweep over {0,5,10,25,50}% saturates B5 at >=25% replay. (iv) Two empirical findings, both N=4. A controlled bootstrap-corpus ablation across v2 (OpenSubs), v4 (mC4-ES), and v6 (60/25/15 OpenSubs/mC4/Wiki) exposes a loss-versus-register inversion: lower-perplexity bootstraps yield measurably worse conversational behavior (v2 > v4 > v6 on B5 at every paired seed). The B4 (tool-selection) floor of 0.000 is a corpus-density artifact, not a capacity gate: rebalancing the SFT mixture to tool-use ratio 1:21 yields VectraYX-Nano v7, the released headline configuration, reaching B4 = 0.230 +/- 0.052 at 42M while retaining B1 = 0.332 +/- 0.005 and B5 = 0.725 +/- 0.130; a LoRA replication on a 260M from-scratch mid-tier reaches 0.445 +/- 0.201. The released GGUF is 96 MB in F16, runs sub-second TTFT on commodity hardware under llama.cpp, and is, to our knowledge, the first published Spanish-native cybersecurity LLM with end-to-end MCP integration.

2605.12456 2026-05-22 cs.CR cs.CL cs.LG 版本更新

TextSeal: A Localized LLM Watermark for Provenance & Distillation Protection

TextSeal: 一种用于溯源与蒸馏保护的本地化大语言模型水印

Tom Sander, Hongyan Chang, Tomáš Souček, Tuan Tran, Valeriu Lacatusu, Sylvestre-Alvise Rebuffi, Alexandre Mourachko, Surya Parimi, Christophe Ropers, Rashel Moritz, Vanessa Stark, Hady Elsahar, Pierre Fernandez

发表机构 * FAIR, Meta Superintelligence Labs(FAIR,Meta超智能实验室)

AI总结 本文提出TextSeal,一种先进的大语言模型水印技术,通过Gumbel-max采样引入双密钥生成以恢复输出多样性,并结合熵加权评分和多区域定位提升检测性能。该方法支持推测解码和多令牌预测等服务优化,不增加推理开销。在检测强度上严格优于基线方法SynthID-text,并对稀释具有鲁棒性,即使在混合的人类/AI文档中也能保持自信的本地化检测。理论上该方案无失真,经推理基准评估证实其保持下游性能;同时通过多语言人工评估(6000次A/B对比,5种语言)显示无明显质量差异。除了用于溯源检测外,TextSeal还具有'放射性'特性:其水印信号通过模型蒸馏传递,可检测未经授权的使用。

详情
AI中文摘要

我们介绍TextSeal,一种最先进的大语言模型水印。基于Gumbel-max采样,TextSeal引入双密钥生成以恢复输出多样性,同时结合熵加权评分和多区域定位以提升检测性能。它支持推测解码和多令牌预测等服务优化,并不增加任何推理开销。TextSeal在检测强度上严格优于基线方法如SynthID-text,并对稀释具有鲁棒性,即使在混合的人类/AI文档中也能保持自信的本地化检测。该方案在理论上是无失真的,经推理基准评估确认其保持下游性能;同时通过多语言人工评估(6000次A/B对比,5种语言)显示无明显质量差异。除了用于溯源检测外,TextSeal还具有'放射性'特性:其水印信号通过模型蒸馏传递,可检测未经授权的使用。

英文摘要

We introduce TextSeal, a state-of-the-art watermark for large language models. Building on Gumbel-max sampling, TextSeal introduces dual-key generation to restore output diversity, along with entropy-weighted scoring and multi-region localization for improved detection. It supports serving optimizations such as speculative decoding and multi-token prediction, and does not add any inference overhead. TextSeal strictly dominates baselines like SynthID-text in detection strength and is robust to dilution, maintaining confident localized detection even in heavily mixed human/AI documents. The scheme is theoretically distortion-free, and evaluation across reasoning benchmarks confirms that it preserves downstream performance; while a multilingual human evaluation (6000 A/B comparisons, 5 languages) shows no perceptible quality difference. Beyond its use for provenance detection, TextSeal is also ``radioactive'': its watermark signal transfers through model distillation, enabling detection of unauthorized use.

2605.11739 2026-05-22 cs.CL 版本更新

Learning to Foresee: Unveiling the Unlocking Efficiency of On-Policy Distillation

学习预见:揭示在线蒸馏的解锁效率

Yuchen Cai, Ding Cao, Liang Lin, Chunxi Luo, Xin Xu, Kai Yang, Weijie Liu, Saiyong Yang, Tianxiang Zhao, Guangzhong Sun, Guiquan Liu, Junfeng Fang

发表机构 * USTC(中国科学技术大学) Tencent(腾讯) NUS(新加坡国立大学) HKUST(GZ)(香港科技大学(广州)) UCAS-IIE(中国科学技术大学国际交流学院) SHU(上海大学)

AI总结 本文研究了在线蒸馏(OPD)的效率来源,提出EffOPD方法通过适应性选择 extrapolation 步长和沿当前更新方向移动来加速OPD,实现了3倍的训练加速同时保持最终性能。

详情
AI中文摘要

在线蒸馏(OPD)已成为大型语言模型的一种高效的后训练范式。然而,现有研究大多将其优势归因于更密集和稳定的监督,而OPD效率背后的参数级机制仍不清晰。本文认为OPD的效率源于一种“预见”机制:它在训练早期就建立了指向最终模型的稳定更新轨迹。这种预见体现在两个方面。首先,在模块分配层面,OPD识别出边际效用低的区域,并将更新集中在对推理更关键的模块上。其次,在更新方向层面,OPD表现出更强的低秩集中,其主导子空间在训练早期就与最终更新子空间紧密对齐。基于这些发现,我们提出了EffOPD,一种即插即用的加速方法,通过自适应选择extrapolation步长并沿当前更新方向移动来加速OPD。EffOPD不需要额外可训练模块或复杂的超参数调优,实现了平均3倍的训练加速,同时保持可比的最终性能。整体而言,我们的发现为理解OPD的效率提供了参数动态视角,并为设计更高效的大型语言模型后训练方法提供了实用见解。

英文摘要

On-policy distillation (OPD) has emerged as an efficient post-training paradigm for large language models. However, existing studies largely attribute this advantage to denser and more stable supervision, while the parameter-level mechanisms underlying OPD's efficiency remain poorly understood. In this work, we argue that OPD's efficiency stems from a form of ``foresight'': it establishes a stable update trajectory toward the final model early in training. This foresight manifests in two aspects. First, at the \textbf{Module-Allocation Level}, OPD identifies regions with low marginal utility and concentrates updates on modules that are more critical to reasoning. Second, at the \textbf{Update-Direction Level}, OPD exhibits stronger low-rank concentration, with its dominant subspaces aligning closely with the final update subspace early in training. Building on these findings, we propose \textbf{EffOPD}, a plug-and-play acceleration method that speeds up OPD by adaptively selecting an extrapolation step size and moving along the current update direction. EffOPD requires no additional trainable modules or complex hyperparameter tuning, and achieves an average training acceleration of $3\times$ while maintaining comparable final performance. Overall, our findings provide a parameter-dynamics perspective for understanding the efficiency of OPD and offer practical insights for designing more efficient post-training methods for large language models.

2605.07711 2026-05-22 cs.CL 版本更新

SimCT: Recovering Lost Supervision for Cross-Tokenizer On-Policy Distillation

SimCT: 通过跨分词器策略进行监督恢复

Jie Sun, Mao Zheng, Mingyang Song, Qiyong Zhong, Yilin Cheng, Bichuan Feng, Pengfei Liu, Junfeng Fang, Xiang Wang

发表机构 * University of Science and Technology of China(中国科学技术大学) Large Language Model Department, Tencent(腾讯大语言模型部门) Shanghai Innovation Institute(上海创新研究院) Zhongguancun Academy(中关村学院)

AI总结 本文提出SimCT,一种改进的在线策略蒸馏方法,通过扩展监督空间来恢复因分词差异而丢失的监督信号,从而在数学推理和代码生成任务中提升了性能。

Comments 4 figures, 6 tables, 28 pages

详情
AI中文摘要

在线策略蒸馏(OPD)是一种标准工具,用于将教师行为转移到较小的学生模型中,但其隐含假设教师和学生预测在逐个token上是可比的,这一假设在两个模型对同一文本进行不同分词时会失效。在异构分词器情况下,精确共享token匹配会静默丢弃大量教师信号,特别是在词汇不一致的位置。我们提出了简单的跨分词器OPD(SimCT),通过扩展监督空间:在共享token之外,SimCT比较教师和学生在短多token延续上的表现,这些延续两者都能实现,从而保持OPD损失形式不变。我们证明这些单位是 finest 共同可分词的监督接口,并且更粗糙的替代方法会移除对在线学习有用的教师-学生区分。在三个异构教师-学生对上,SimCT在数学推理和代码生成基准上表现优于共享词汇OPD和代表性跨分词器基线,消融实验确认改进来自恢复精确共享token匹配所丢弃的监督。代码可在https://github.com/sunjie279/SimCT-获取。

英文摘要

On-policy distillation (OPD) is a standard tool for transferring teacher behavior to a smaller student, but it implicitly assumes that teacher and student predictions are comparable token by token, an assumption that fails whenever the two models tokenize the same text differently. Under heterogeneous tokenizers, exact shared-token matching silently discards a large fraction of the teacher signal at precisely the positions where vocabularies disagree. We propose \textbf{\underline{Sim}ple \underline{C}ross-\underline{T}okenizer OPD (SimCT)}, which restores this signal by enlarging the supervision space: alongside shared tokens, SimCT compares teacher and student over short multi-token continuations that both tokenizers can realize, leaving the OPD loss form itself unchanged. We show that these units are the finest jointly tokenizable supervision interface, and that coarser alternatives remove teacher-student distinctions that are useful for on-policy learning. Across three heterogeneous teacher-student pairs on mathematical reasoning and code-generation benchmarks, SimCT shows consistent gains over shared-vocabulary OPD and representative cross-tokenizer baselines, with ablations confirming that the improvements come from recovering supervision discarded by exact shared-token matching. Code is available at \href{https://github.com/sunjie279/SimCT-}{https://github.com/sunjie279/SimCT-}.

2605.07243 2026-05-22 cs.CL 版本更新

SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

SpecBlock:带有动态树草案的块迭代推测解码

Weijie Shi, Qiang Xu, Fan Deng, Yaguang Wu, Jiarun Liu, Yehong Xu, Hao Chen, Jia Zhu, Jiajie Xu, Xiangjun Huang, Jian Yang, Xiaofang Zhou

发表机构 * Hong Kong University of Science and Technology(香港科技大学) MetaX Zhejiang Normal University(浙江师范大学) Soochow University(苏州大学)

AI总结 该研究提出了一种结合路径依赖性和低成本草案的块迭代草案方法SpecBlock,通过动态树草案和路径依赖机制提高LLM推理效率,同时在部署时利用验证器反馈进行成本感知适应,从而在速度和成本上均优于现有方法。

详情
AI中文摘要

推测解码通过起草候选延续的树并单次目标前向验证来加速LLM推理。现有草案工具分为两派,各有缺陷。自回归草案工具如EAGLE-3在每条草案路径上保留依赖性,但每次树深度调用一次草案器,使草案成为每次迭代延迟的重要部分。并行草案工具通过一次前向预测多个未来位置来减少草案器调用,但每个位置的预测不考虑其他位置,导致验证器拒绝路径。本文提出SpecBlock,一种结合路径依赖性和低成本草案的块迭代草案工具。每个草案器前向生成K个依赖位置,称为块。草案树通过重复块扩展生长。两种机制显式携带路径依赖性以保持后续草案位置的准确性。在每个块内,逐层位移将前一位置的隐藏状态传输到每个解码器层。在块之间,每个新块可以从上一个块的任意位置开始,继承其隐藏状态以延长路径。为在验证器预算中花费在可能接受的位置,一个共同训练的排名头取代固定top-k树,通过在草案过程中分配每位置分支。为避免在推理时训练草案器在它从未生成的前缀上,一个有效前缀掩码在较后位置的损失在较早位置出错时丢弃。除了静态草案外,部署时的成本感知老虎机利用免费验证器反馈来选择性更新草案器,仅当预期吞吐量增益超过更新成本时。实验表明,SpecBlock在44-52%的草案成本下,相比EAGLE-3提高了8-13%的平均速度,而成本感知适应将此优势扩展到11-19%。

英文摘要

Speculative decoding accelerates LLM inference by drafting a tree of candidate continuations and verifying it in one target forward. Existing drafters fall into two camps with opposite weaknesses. Autoregressive drafters such as EAGLE-3 preserve dependence along each draft path but call the drafter once per tree depth, making drafting a non-trivial share of per-iteration latency. Parallel drafters cut drafter calls by predicting multiple future positions in one forward, but each position is predicted without seeing the others, producing paths the verifier rejects. In this paper, we propose SpecBlock, a block-iterative drafter that combines path dependence with cheap drafting. Each drafter forward produces K dependent positions and we call this a block. The draft tree grows through repeated block expansions. Two mechanisms explicitly carry path dependence to keep later draft positions accurate. Within each block, a layer-wise shift carries the previous position's hidden state into every decoder layer. Across blocks, each new block can start from any position of the previous block, inheriting its hidden state to extend the path. To spend verifier budget where acceptance is likely, a co-trained rank head replaces the fixed top-k tree by allocating per-position branching during drafting. To avoid training the drafter on prefixes it never produces at inference, a valid-prefix mask drops the loss at later positions once an earlier one is wrong. Beyond static drafting, a cost-aware bandit at deployment uses free verifier feedback to update the drafter selectively, only when the expected throughput gain exceeds the update cost. Experiments show that SpecBlock improves mean speedup by 8-13% over EAGLE-3 at 44-52% of its drafting cost, and cost-aware adaptation extends this lead to 11-19%.

2605.06597 2026-05-22 cs.CL cs.AI cs.LG 版本更新

UniSD: Towards a Unified Self-Distillation Framework for Large Language Models

UniSD:面向大语言模型的统一自蒸馏框架

Yiqiao Jin, Yiyang Wang, Lucheng Fu, Yijia Xiao, Yinyi Luo, Haoxin Liu, B. Aditya Prakash, Josiah Hester, Jindong Wang, Srijan Kumar

发表机构 * Georgia Institute of Technology(佐治亚理工学院) University of California, Los Angeles(加州大学洛杉矶分校) Carnegie Mellon University(卡内基梅隆大学) William & Mary(威廉与玛丽大学)

AI总结 本文提出UniSD框架,系统研究自蒸馏方法,通过整合多种机制提升监督可靠性、表征对齐和训练稳定性,从而在多个基准和模型上验证自蒸馏的有效性,并构建出性能最优的UniSDfull流水线。

Comments Website: https://unifiedsd.github.io/ Code: https://github.com/Ahren09/UniSD

详情
AI中文摘要

自蒸馏(SD)为在不依赖更强外部教师的情况下适应大语言模型(LLMs)提供了一条有前途的路径。然而,在自回归LLMs中,SD仍然具有挑战性,因为自生成轨迹是自由形式的,正确性依赖于任务,且合理的推理仍可能提供不稳定或不可靠的监督。现有方法主要考察孤立的设计选择,留下其有效性、作用和交互关系不清晰。在本文中,我们提出UniSD,一个统一的框架,系统地研究自蒸馏。UniSD整合了互补的机制,解决监督可靠性、表征对齐和训练稳定性问题,包括多教师一致、EMA教师稳定、token级对比学习、特征匹配和发散剪裁。在六个基准和六个模型(来自三个模型家族)上,UniSD揭示了自蒸馏何时优于静态模仿,哪些组件驱动了收益,以及这些组件在不同任务间的交互方式。基于这些见解,我们构建了UniSDfull,一个整合互补组件的流水线,实现了最强的整体性能,比基模型提高了+5.4点,比最强基线提高了+2.8点。广泛评估凸显了自蒸馏作为一种实用且可控的高效LLM适应方法,无需更强的外部教师。

英文摘要

Self-distillation (SD) offers a promising path for adapting large language models (LLMs) without relying on stronger external teachers. However, SD in autoregressive LLMs remains challenging because self-generated trajectories are free-form, correctness is task-dependent, and plausible rationales can still provide unstable or unreliable supervision. Existing methods mainly examine isolated design choices, leaving their effectiveness, roles, and interactions unclear. In this paper, we propose UniSD, a unified framework to systematically study self-distillation. UniSD integrates complementary mechanisms that address supervision reliability, representation alignment, and training stability, including multi-teacher agreement, EMA teacher stabilization, token-level contrastive learning, feature matching, and divergence clipping. Across six benchmarks and six models from three model families, UniSD reveals when self-distillation improves over static imitation, which components drive the gains, and how these components interact across tasks. Guided by these insights, we construct UniSDfull, an integrated pipeline that combines complementary components and achieves the strongest overall performance, improving over the base model by +5.4 points and the strongest baseline by +2.8 points. Extensive evaluation highlights self-distillation as a practical and steerable approach for efficient LLM adaptation without stronger external teachers.

2605.04217 2026-05-22 cs.LG cs.CL 版本更新

Jordan-RoPE: Non-Semisimple Relative Positional Encoding via Complex Jordan Blocks

Jordan-RoPE: 通过复Jordan块实现非半单相对位置编码

Yaobo Zhang

发表机构 * School of Physics, Ningxia University(宁夏大学物理学院)

AI总结 本文提出了一种非半单相对位置编码Jordan-RoPE,通过复旋转特征和Nilpotent响应在同一缺陷Jordan块中实现距离调制的相位基,从而生成振荡-多项式特征,如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)等,并在语言模型中验证了其有效性。

Comments 15 pages, 4 figures, 6 tables; code available at https://github.com/ybzhang-nxu/jordan_rope

详情
AI中文摘要

相对位置编码决定了查询-键滞后函数能够进入原始注意力logit的哪些功能。RoPE提供旋转相位,而ALiBi提供加性距离偏置。受线性平移不变位置编码的群论观点启发,我们研究了非半单情况,其中复旋转特征和Nilpotent响应共存于同一缺陷Jordan块中。所生成的相对算子产生如e^{-γd}cos(ωd)、e^{-γd}sin(ωd)、d e^{-γd}cos(ωd)和d e^{-γd}sin(ωd)等振荡-多项式特征,其中因果滞后d=i-j≥0。因此,该构造实现了距离调制的相位基d e^{iωd},而非仅仅添加单独的距离通道到RoPE。我们将其精确Jordan-RoPE公式化为非半单一参数表示,给出其实块形式,并指定非正交位置映射所需的共轭查询作用。我们还区分了该精确表示与稳定变体,后者虽然改善了数值行为但破坏了精确群律。核级别诊断和一个Jordan友好的合成语言模型任务表明,当目标包含距离调制的相位交互时,耦合的Jordan基是有用的。在小型WikiText-103字语言模型上,一个缩放精确变体在Jordan家族中优于RoPE和直接求和基线,而RoPE+ALiBi仍然是整体最强的。证据是结构性的,而非广义的性能声明。

英文摘要

Relative positional encodings determine which functions of query-key lag can enter the primitive attention logit. RoPE supplies a rotary phase, while ALiBi supplies an additive distance bias. Motivated by group-theoretic views of linear translation-invariant positional encodings, we study a non-semisimple case in which a complex rotary eigenvalue and a nilpotent response live in the same defective Jordan block. The resulting relative operator generates oscillatory-polynomial features such as $e^{-γd}\cos(ωd)$, $e^{-γd}\sin(ωd)$, $d e^{-γd}\cos(ωd)$, and $d e^{-γd}\sin(ωd)$, for causal lag $d=i-j\geq 0$. Thus the construction realizes a distance-modulated phase basis $d e^{iωd}$, rather than merely adding a separate distance channel to RoPE. We formulate Exact Jordan-RoPE as a non-semisimple one-parameter representation, give its real block form, and specify the contragredient query action required by non-orthogonal positional maps. We also distinguish this exact representation from stabilized variants whose bounded shear improves numerical behavior but breaks the exact group law. Kernel-level diagnostics and a Jordan-friendly synthetic language-model task show that the coupled Jordan basis is useful when the target contains distance-modulated phase interactions. On a small WikiText-103 byte language model, a scaled-exact variant improves over RoPE and direct-sum baselines within the Jordan family, while RoPE+ALiBi remains strongest overall. The evidence is structural rather than a broad performance claim.

2604.15774 2026-05-22 cs.CL 版本更新

MemEvoBench: Benchmarking Safety Risks from Memory Misevolution in LLM Agents

MemEvoBench: 评估LLM代理中内存误进化带来的安全风险

Weiwei Xie, Shaoxiong Guo, Fan Zhang, Tian Xia, Xue Yang, Lizhuang Ma, Junchi Yan, Qibing Ren

发表机构 * Shanghai Jiao Tong University(上海交通大学) East China Normal University(华东师范大学) Shandong University(山东大学) Duke University(杜克大学)

AI总结 本文提出MemEvoBench,首个评估LLM代理长期内存安全性的基准,针对对抗性内存注入、噪声工具输出和偏见反馈,通过7个领域36种风险类型的问题任务和20个Agent-SafetyBench环境改编的工作流任务,验证了内存进化对安全性的重大影响,指出静态提示防御不足,亟需加强LLM代理内存进化的安全性。

详情
AI中文摘要

为大型语言模型(LLMs)配备持久化内存可以增强交互连续性和个性化,但引入了新的安全风险。具体而言,受污染或偏见的内存积累可能触发异常代理行为。现有的评估方法尚未建立衡量内存误进化的标准化框架。这种现象是指由于反复接触误导信息而导致的行为漂移。为解决这一缺口,我们引入MemEvoBench,首个评估LLM代理长期内存安全性的基准,针对对抗性内存注入、噪声工具输出和偏见反馈。该框架包含7个领域36种风险类型的问答式任务,以及改编自20个Agent-SafetyBench环境的工作流任务,采用混合良性与误导性内存池在多轮交互中模拟内存进化。在代表性模型上的实验揭示了在偏见内存更新下显著的安全退化。我们的分析表明,内存进化是这些失败的重要原因。此外,静态提示基于防御证明不足,强调了在LLM代理中保障内存进化的安全性的紧迫性。

英文摘要

Equipping Large Language Models (LLMs) with persistent memory enhances interaction continuity and personalization but introduces new safety risks. Specifically, contaminated or biased memory accumulation can trigger abnormal agent behaviors. Existing evaluation methods have not yet established a standardized framework for measuring memory misevolution. This phenomenon refers to the gradual behavioral drift resulting from repeated exposure to misleading information. To address this gap, we introduce MemEvoBench, the first benchmark evaluating long-horizon memory safety in LLM agents against adversarial memory injection, noisy tool outputs, and biased feedback. The framework consists of QA-style tasks across 7 domains and 36 risk types, complemented by workflow-style tasks adapted from 20 Agent-SafetyBench environments with noisy tool returns. Both settings employ mixed benign and misleading memory pools within multi-round interactions to simulate memory evolution. Experiments on representative models reveal substantial safety degradation under biased memory updates. Our analysis suggests that memory evolution is a significant contributor to these failures. Furthermore, static prompt-based defenses prove insufficient, underscoring the urgency of securing memory evolution in LLM agents.

2604.08362 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Towards Real-world Human Behavior Simulation: Benchmarking Large Language Models on Long-horizon, Cross-scenario, Heterogeneous Behavior Traces

迈向真实世界的人类行为模拟:在长时间跨度、跨场景、异质行为轨迹上对大语言模型进行基准测试

Jiawei Chen, Ruoxi Xu, Boxi Cao, Ruotong Pan, Yunfei Zhang, Yifei Hu, Yong Du, Tingting Gao, Yaojie Lu, Yingfei Sun, Xianpei Han, Le Sun, Xiangyu Wu, Hongyu Lin

发表机构 * Chinese Information Processing Laboratory, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所信息处理实验室) University of Chinese Academy of Sciences(中国科学院大学) Kuaishou Technology(快手科技)

AI总结 本文提出OmniBehavior基准测试,通过真实世界数据整合长周期、跨场景和异质行为模式,揭示现有模型在模拟复杂人类行为时的局限性,包括对正向平均人的趋同、人格同质化和乌托邦偏见,为未来高保真模拟研究指明方向。

Comments Project page: https://OmniBehavior.github.io

详情
AI中文摘要

大语言模型(LLMs)的出现揭示了通用用户模拟的潜力。然而,现有基准测试仍局限于孤立场景、狭窄动作空间或合成数据,无法捕捉真实人类行为的整体性。为弥合这一差距,我们引入OmniBehavior,首个完全基于真实世界数据构建的用户模拟基准测试,将长周期、跨场景和异质行为模式整合到统一框架中。基于此基准测试,我们首先提供了实证证据,表明以往孤立场景的数据集存在隧道视野问题,而真实世界决策依赖于长期的跨场景因果链。对最新LLMs的广泛评估显示,当前模型在模拟这些复杂行为时表现不佳,即使扩展上下文窗口,性能也趋于平稳。关键的是,模拟行为与真实行为的系统性比较揭示了根本性的结构偏差:LLMs倾向于趋同于正向平均人,表现出超活跃、人格同质化和乌托邦偏见。这导致了个体差异和长尾行为的丧失,突显了未来高保真模拟研究的关键方向。

英文摘要

The emergence of Large Language Models (LLMs) has illuminated the potential for a general-purpose user simulator. However, existing benchmarks remain constrained to isolated scenarios, narrow action spaces, or synthetic data, failing to capture the holistic nature of authentic human behavior. To bridge this gap, we introduce OmniBehavior, the first user simulation benchmark constructed entirely from real-world data, integrating long-horizon, cross-scenario, and heterogeneous behavioral patterns into a unified framework. Based on this benchmark, we first provide empirical evidence that previous datasets with isolated scenarios suffer from tunnel vision, whereas real-world decision-making relies on long-term, cross-scenario causal chains. Extensive evaluations of state-of-the-art LLMs reveal that current models struggle to accurately simulate these complex behaviors, with performance plateauing even as context windows expand. Crucially, a systematic comparison between simulated and authentic behaviors uncovers a fundamental structural bias: LLMs tend to converge toward a positive average person, exhibiting hyper-activity, persona homogenization, and a utopian bias. This results in the loss of individual differences and long-tail behaviors, highlighting critical directions for future high-fidelity simulation research.

2603.27355 2026-05-22 cs.AI cs.CL cs.SE 版本更新

LLM Readiness Harness: Evaluation, Observability, and CI Gates for LLM/RAG Applications

LLM Readiness Harness: 评估、可观测性和持续集成门禁用于LLM/RAG应用

Alexandre Cristovão Maiorano

发表机构 * Lumytics

AI总结 本文提出了一种LLM和RAG应用的准备性框架,通过自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,将评估转化为部署决策流程,并通过帕累托前沿计算场景加权的准备度分数,展示了在票务路由工作流和BEIR接地任务上的评估结果。

Comments 19 pages, 4 figures, 15 tables

详情
AI中文摘要

我们提出了一种用于LLM和RAG应用的准备性框架,将评估转化为部署决策流程。该系统结合了自动化基准测试、OpenTelemetry可观测性和持续集成质量门禁,通过最小的API合同聚合工作流程成功、政策合规性、 groundedness、检索命中率、成本和p95延迟,计算出场景加权的准备度分数。我们对票务路由工作流和BEIR接地任务(SciFact和FiQA)进行了评估,覆盖了完整的Azure矩阵(162/162有效单元跨数据集、场景、检索深度、种子和模型)。结果表明,准备度不是单一指标:在FiQA中,sla-first at k=5时,gpt-4.1-mini在准备度和忠实度上领先,而gpt-5.2则支付了显著的延迟成本;在SciFact中,模型质量接近但仍有操作区分。票务路由回归门禁持续拒绝不安全的提示变体,证明了该框架能够阻止风险发布,而不仅仅是报告离线分数。结果是一个可重复、操作基础的框架,用于决定LLM或RAG系统是否准备好发布。

英文摘要

We present a readiness harness for LLM and RAG applications that turns evaluation into a deployment decision workflow. The system combines automated benchmarks, OpenTelemetry observability, and CI quality gates under a minimal API contract, then aggregates workflow success, policy compliance, groundedness, retrieval hit rate, cost, and p95 latency into scenario-weighted readiness scores with Pareto frontiers. We evaluate the harness on ticket-routing workflows and BEIR grounding tasks (SciFact and FiQA) with full Azure matrix coverage (162/162 valid cells across datasets, scenarios, retrieval depths, seeds, and models). Results show that readiness is not a single metric: on FiQA under sla-first at k=5, gpt-4.1-mini leads in readiness and faithfulness, while gpt-5.2 pays a substantial latency cost; on SciFact, models are closer in quality but still separable operationally. Ticket-routing regression gates consistently reject unsafe prompt variants, demonstrating that the harness can block risky releases instead of merely reporting offline scores. The result is a reproducible, operationally grounded framework for deciding whether an LLM or RAG system is ready to ship.

2603.20405 2026-05-22 cs.LG cs.CL cs.LO 版本更新

Putnam 2025 Problems in Rocq using Opus 4.6 and Rocq-MCP

使用 Opus 4.6 和 Rocq-MCP 的 2025 年 Putnam 问题

Guillaume Baudart, Marc Lelarge, Tristan Stérin, Jules Viennot

发表机构 * IRIF, Université Paris Cité, Inria, CNRS(IRIF,巴黎Cité大学,法国国家信息与自动化研究所,法国国家科学研究中心) DI ENS, PSL University, Inria(ENS巴黎大学DI,巴黎科学实验室大学,法国国家信息与自动化研究所)

AI总结 研究探讨了使用 Opus 4.6 配合 Rocq-MCP 工具自主证明 2025 年 Putnam 数学竞赛中 12 个问题中的 10 个,展示了基于模型上下文协议 (MCP) 的自动证明方法及公开可用的证明过程。

详情
AI中文摘要

我们报告了一项实验,其中配备有 Model Context Protocol (MCP) 工具的 Claude Opus~4.6,能够自主证明 2025 年 Putnam 数学竞赛中的 10 个问题。MCP 工具由 Claude 设计,通过分析先前在 miniF2F-Rocq 上的实验日志来编码一种“先编译,后交互回退”的策略。该代理在隔离的虚拟机上运行,无网络访问,部署了 141 个子代理,在 17.7 小时的活跃计算时间(51.6 小时墙钟时间)内消耗了约 190 亿个 token。所有证明均公开可用。

英文摘要

We report on an experiment in which Claude Opus~4.6, equipped with a suite of Model Context Protocol (MCP) tools for the Rocq proof assistant, autonomously proved 10 of 12 problems from the 2025 Putnam Mathematical Competition. The MCP tools, designed with Claude by analyzing logs from a prior experiment on miniF2F-Rocq, encode a "compile-first, interactive-fallback" strategy. Running on an isolated VM with no internet access, the agent deployed 141 subagents over 17.7 hours of active compute (51.6h wall-clock), consuming approximately 1.9 billion tokens. All proofs are publicly available.

2603.16672 2026-05-22 cs.AI cs.CL cs.CY 版本更新

CritiSense: Critical Digital Literacy and Resilience Against Misinformation

CritiSense: 关键数字素养与对抗虚假信息的韧性

Firoj Alam, Fatema Ahmad, Ali Ezzat Shahroor, Mohamed Bayan Kmainasi, Elisa Sartori, Giovanni Da San Martino, Abul Hasnat, Raian Ali

发表机构 * Qatar Computing Research Institute(卡塔尔计算研究所) University of Padova(帕多瓦大学) Hamad Bin Khalifa University(哈马德·本·卡西姆大学)

AI总结 本研究提出CritiSense,一个多功能的移动媒体素养应用,通过短而互动的挑战提升用户识别操纵手段的能力,为多语言的预警告平台和微学习效果评估提供测试环境。

Comments resilience, disinformation, misinformation, fake news, propaganda

详情
AI中文摘要

社交媒体上的虚假信息破坏了知情决策和公众信任。预警告(prebunking)通过帮助用户在遇到真实信息前识别操纵手法,提供了一种积极的补充方法。我们介绍了CritiSense,一个移动媒体素养应用,通过短而互动的挑战和即时反馈来培养这些技能。它是首个支持九种语言且模块化的平台,设计用于快速更新不同主题和领域。我们报告了93名用户的可用性研究:83.9%的用户表示总体满意,90.1%的用户认为该应用易于使用。定性反馈表明,CritiSense有助于提高数字素养技能。总体而言,它提供了一个多语言预警告平台和一个测试环境,用于衡量微学习对对抗虚假信息韧性的影响。在六个月中,我们已吸引了超过500名活跃用户。它在Apple App Store(https://apps.apple.com/us/app/critisense/id6749675792)和Google Play Store(https://play.google.com/store/apps/details?id=com.critisense&hl=en)上免费向所有用户提供。

英文摘要

Misinformation on social media undermines informed decision-making and public trust. Prebunking offers a proactive complement by helping users recognize manipulation tactics before they encounter them in the wild. We present CritiSense, a mobile media-literacy app that builds these skills through short, interactive challenges with instant feedback. It is the first multilingual (supporting nine languages) and modular platform, designed for rapid updates across topics and domains. We report a usability study with 93 users: 83.9% expressed overall satisfaction and 90.1% rated the app as easy to use. Qualitative feedback indicates that CritiSense helps improve digital literacy skills. Overall, it provides a multilingual prebunking platform and a testbed for measuring the impact of microlearning on misinformation resilience. Over 6 months, we have reached 500+ active users. It is freely available to all users on the Apple App Store (https://apps.apple.com/us/app/critisense/id6749675792) and Google Play Store (https://play.google.com/store/apps/details?id=com.critisense&hl=en).

2602.08064 2026-05-22 cs.LG cs.AI cs.CL 版本更新

SiameseNorm: Breaking the Barrier to Reconciling Pre/Post-Norm

SiameseNorm: 突破预规范与后规范之间的障碍

Tianyu Li, Dongchen Han, Zixuan Cao, Haofeng Huang, Mengyu Zhou, Ming Chen, Erchao Zhao, Xiaoxi Jiang, Guanjun Jiang, Gao Huang

发表机构 * Leap Lab, Tsinghua University(清华大学 Leap 实验室) Qwen Large Model Application Team, Alibaba(阿里巴巴 Qwen 大模型应用团队) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息学研究院)

AI总结 本文提出SiameseNorm,一种双流架构,通过共享残差块将预规范和后规范结合,从而在保持训练稳定性的同时提升模型性能,适用于多种架构和模态。

Comments Accepted to ICML 2026; camera-ready version; revised presentation and added additional experimental results

详情
AI中文摘要

预规范与后规范之间的长期矛盾仍然是Transformer架构中的一个开放问题,反映了训练稳定性与表示能力之间的根本权衡。先前尝试结合两者优势的研究取得了一定进展,但往往在不同训练设置下表现有限,限制了其更广泛的应用。我们重新审视这一困境,表明单流架构难以协调预规范的稳定身份梯度传播与后规范的主要残差路径归一化。为了解决这种结构张力,我们提出SiameseNorm,一种简单而有效的双流架构,能够与预规范训练配方保持兼容。SiameseNorm通过共享残差块将预规范和后规范流连接起来,允许每个残差块从两个路径接收优化信号,且开销极低。在400M和1.3B密集语言模型、15B MoE模型、视觉Transformer以及扩散Transformer上的大量实验表明,SiameseNorm在各种架构和模态中都能保持强大的训练稳定性的同时提升性能。代码可在https://github.com/Qwen-Applications/SiameseNorm上获得。

英文摘要

The long-standing tension between Pre- and Post-Norm remains an open problem in Transformer architecture, reflecting a fundamental trade-off between training stability and representational capacity. Prior attempts to combine their strengths have made progress, but often show limited robustness across training settings, restricting their broader applicability. We revisit this dilemma, showing that single-stream architectures struggle to reconcile Pre-Norm's stable identity-gradient propagation with Post-Norm's normalization of the main residual path. To address this structural tension, we propose SiameseNorm, a simple yet effective two-stream architecture that remains compatible with Pre-Norm training recipes. SiameseNorm couples Pre-Norm-like and Post-Norm-like streams through shared residual blocks, allowing each residual block to receive optimization signals from both pathways with negligible overhead. Extensive experiments on 400M and 1.3B dense language models, 15B MoE models, Vision Transformers, and Diffusion Transformers show that SiameseNorm consistently improves performance while maintaining strong training stability across architectures and modalities. Code is available at https://github.com/Qwen-Applications/SiameseNorm.

2602.05536 2026-05-22 cs.LG cs.AI cs.CL cs.CV 版本更新

When Shared Knowledge Hurts: Spectral Over-Accumulation in Model Merging

当共享知识有害:模型融合中的谱过积累

Yayuan Li, Ze Peng, Jian Zhang, Jintao Guo, Yue Duan, Yinghuan Shi

发表机构 * National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China.(新型软件技术国家重点实验室,南京大学,南京210023,中国。) Institute of Brain-Computer Interface, Nanjing University, Nanjing 210023, China.(脑机接口研究院,南京大学,南京210023,中国。)

AI总结 本文研究了模型融合中共享知识过积累的问题,提出SVC方法通过校准奇异值来恢复谱平衡,提升了模型融合和任务算术的性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

模型融合通过将多个微调模型的权重更新相加,提供了一种轻量级的替代方法,而非重新训练。现有方法主要针对解决任务更新之间的冲突,未处理共享知识过积累的失败模式。我们发现当任务共享对齐的谱方向(即重叠的奇异向量)时,简单的线性组合会反复积累这些方向,导致奇异值膨胀并使融合模型偏向共享子空间。为缓解此问题,我们提出Singular Value Calibration (SVC),一种无需训练和数据的后处理方法,量化子空间重叠并重新缩放膨胀的奇异值以恢复平衡的谱。在视觉和语言基准上,SVC一致改进了强大的融合基线并实现了最先进的性能。此外,仅通过修改奇异值,SVC将任务算术的性能提高了13.0%。代码可在https://github.com/lyymuwu/SVC获取。

英文摘要

Model merging combines multiple fine-tuned models into a single model by adding their weight updates, providing a lightweight alternative to retraining. Existing methods primarily target resolving conflicts between task updates, leaving the failure mode of over-counting shared knowledge unaddressed. We show that when tasks share aligned spectral directions (i.e., overlapping singular vectors), a simple linear combination repeatedly accumulates these directions, inflating the singular values and biasing the merged model toward shared subspaces. To mitigate this issue, we propose Singular Value Calibration (SVC), a training-free and data-free post-processing method that quantifies subspace overlap and rescales inflated singular values to restore a balanced spectrum. Across vision and language benchmarks, SVC consistently improves strong merging baselines and achieves state-of-the-art performance. Furthermore, by modifying only the singular values, SVC improves the performance of Task Arithmetic by 13.0%. Code is available at https://github.com/lyymuwu/SVC.

2602.03784 2026-05-22 cs.CL 版本更新

Fix the Structural Bottleneck: Context Compression via Explicit Information Transmission

修复结构瓶颈:通过显式信息传输进行上下文压缩

Jiangnan Ye, Hanqi Yan, Zhenyi Shen, Heng Chang, Ye Mao, Yulan He

发表机构 * King’s College London(伦敦国王学院) Tsinghua University(清华大学) Imperial College London(伦敦帝国学院) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 本文通过从结构角度重新审视上下文压缩,识别出标准LLM压缩方法中的两个关键瓶颈,并提出ComprExIT框架,通过显式信息传输提升压缩效率,实验表明其在多个数据集上表现优异,提升了F1分数并降低了计算成本。

详情
AI中文摘要

长上下文LLM代理往往面临增长的token、内存和延迟成本,使高效的上下文压缩对实际部署至关重要。现有LLM作为压缩器的方法在使用完整上下文时仍明显劣于其性能。我们发现这一差距部分源于其无法有效保留上下文信息。在本文中,我们从结构角度重新审视上下文压缩,并识别出标准LLM压缩方法中的两个关键瓶颈:信息聚合过程中压缩token之间的协调有限,以及层间稀释削弱了中间隐藏状态中的有用信号。为了解决这些限制,我们提出了ComprExIT,一种基于显式信息传输的新上下文压缩框架。ComprExIT会自适应地选择冻结LLM层中的特征,然后通过全局协调的运输计划将信息从锚点分配到压缩槽中。在12个数据集上的实验表明,ComprExIT在多个数据集上优于强大的软压缩基线,平均F1分数提升高达18.5%,同时仅增加约1%的可训练参数,并且比最快的基线快超过2倍的压缩速度。代码将在接受后发布。

英文摘要

Long-context LLM agents often struggle with growing token, memory, and latency costs, making efficient context compression essential for practical deployment. Existing LLM-as-a-compressor methods remain noticeably inferior to using the full context. We find that this gap partly stems from their inability to preserve contextual information effectively. In this work, we revisit context compression from a structural perspective and identify two key bottlenecks in standard LLM-based compressors: limited coordination among compression tokens during information aggregation, and layerwise dilution that weakens useful signals from intermediate hidden states. To address these limitations, we propose ComprExIT, a new context compression framework based on explicit information transmission. ComprExIT adaptively selects features across frozen LLM layers, then allocates information from anchors to compression slots through a globally coordinated transport plan. Experiments on 12 datasets show that ComprExIT consistently outperforms strong soft-compression baselines, improving average F1 by up to 18.5%, while adding only ~1% trainable parameters and achieving more than 2x faster compression than the fastest baselines. The code will be released upon acceptance.

2602.02112 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Unifying Masked Diffusion Models with Various Generation Orders and Beyond

统一多种生成顺序及超越的掩码扩散模型

Chunsan Hong, Sanghyun Lee, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea(韩国延世大学人工智能研究生院)

AI总结 本文提出Order-Expressive Masked Diffusion Model (OeMDM)和Learnable-Order Masked Diffusion Model (LoMDM),统一了不同生成顺序的扩散生成过程,并通过单目标学习生成顺序和扩散骨干,提升了文本生成性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

Masked diffusion models (MDMs) 是语言生成中替代自回归模型 (ARMs) 的潜在选择,但生成质量严重依赖于生成顺序。先前工作要么硬编码顺序(例如块状左到右),要么为预训练的MDM学习顺序策略,这会带来额外成本并可能导致次优解,因为存在两阶段优化。受此启发,我们提出了order-expressive masked diffusion model (OeMDM),以适用于各种生成顺序的广泛扩散生成过程,使MDM、ARM和块扩散能在单一框架中进行解释。此外,基于OeMDM,我们引入了learnable-order masked diffusion model (LoMDM),通过单目标学习生成顺序和扩散骨干,使扩散模型能够根据上下文生成顺序进行文本生成。实证上,我们证实LoMDM在多个语言模型基准测试中优于各种离散扩散模型。

英文摘要

Masked diffusion models (MDMs) are a potential alternative to autoregressive models (ARMs) for language generation, but generation quality depends critically on the generation order. Prior work either hard-codes an ordering (e.g., blockwise left-to-right) or learns an ordering policy for a pretrained MDM, which incurs extra cost and can yield suboptimal solutions due to the two-stage optimization. Motivated by this, we propose order-expressive masked diffusion model (OeMDM) for a broad class of diffusion generative processes with various generation orders, enabling the interpretation of MDM, ARM, and block diffusion in a single framework. Furthermore, building on OeMDM, we introduce learnable-order masked diffusion model (LoMDM), which jointly learns the generation ordering and diffusion backbone through a single objective from scratch, enabling the diffusion model to generate text in context-dependent ordering. Empirically, we confirm that LoMDM outperforms various discrete diffusion models across multiple language modeling benchmarks.

2601.20107 2026-05-22 cs.CV cs.CL cs.IR 版本更新

Structural Anchor Pruning: Training-Free Multi-Vector Compression for Visual Document Retrieval

结构锚点剪枝:用于视觉文档检索的无训练多向量压缩

Zhuchenyang Liu, Ziyu Hu, Yao Zhang, Yu Xiao

发表机构 * Aalto University(阿alto大学)

AI总结 本文提出结构锚点剪枝(SAP),一种无需训练的多向量压缩方法,通过保留评分、指导窗口选择和视觉入度中心性评分三个组件,在不进行模型参数调整的情况下,实现了超过90%的视觉token剪枝同时保持NDCG@5超过90%的性能。

Comments methodology revision and new title

详情
AI中文摘要

最近的视觉-语言模型(例如ColPali)能够实现细粒度的视觉文档检索(VDR),但带来了可接受的多向量索引存储开销。现有的无训练剪枝方法要么依赖于启发式的层选择,要么在激进压缩下急剧退化,导致先前的工作认为有效的高压缩剪枝需要查询依赖的训练。我们通过结构锚点剪枝(SAP)挑战这一观点,这是一种自校准、无训练、且查询无关的索引时间剪枝框架,包含三个组件:(i)评分保留(SR),一种每层压缩诊断的白盒方法;(ii)SR引导的窗口选择,一种自动定位任何主干网络的结构剪枝区域的程序,无需每个模型的超参数;(iii)一个视觉入度中心性评分器,用于识别所选窗口内的锚点块。在ViDoRe v1/v2基准测试中,跨越三种架构(18、28和36层主干网络)的三个架构上,SAP在不进行任何模型参数调整的情况下,保留了超过90%的NDCG@5,同时剪枝了超过90%的视觉token。我们的分层解析SR分析揭示了对齐-聚合分歧:文档的视觉结构在主干网络中被保留为稳定的“结构高原”,但最终层将这种表示重塑为稀疏、查询对齐的形式,不再适合剪枝。这是SAP在最终层方法失败的地方的机械原因。

英文摘要

Recent Vision-Language Models (e.g., ColPali) enable fine-grained Visual Document Retrieval (VDR) but incur prohibitive multi-vector index storage overhead. Existing training-free pruning methods either rely on heuristic layer choices or degrade sharply under aggressive compression, leading prior work to argue that effective high-compression pruning requires query-dependent training. We challenge this view with Structural Anchor Pruning (SAP), a self-calibrating, training-free, and query-agnostic index-time pruning framework with three components: (i) Score Retention (SR), a white-box per-layer compression diagnostic; (ii) SR-guided window selection, a procedure that automatically locates the structural pruning region for any backbone with no per-model hyperparameters; and (iii) a visual in-degree centrality scorer that identifies anchor patches within the selected window. On the ViDoRe v1/v2 benchmarks across three architectures spanning 18, 28, and 36 backbone layers, SAP retains over 90\% of NDCG@5 while pruning more than 90\% of visual tokens, without any per-model parameter tuning. Our layer-resolved SR analysis reveals an Alignment-Aggregation Divergence: the document's visual structure is preserved as a stable ``Structural Plateau'' within the backbone, but the final layers reshape this representation into a sparse, query-aligned form that is no longer suitable for pruning. This is the mechanistic reason SAP succeeds where final-layer methods fail.

2601.04537 2026-05-22 cs.LG cs.CL 版本更新

Linear Dynamics in the RLVR Training of Large Language Models

在大语言模型RLVR训练中的线性动力学

Tianle Wang, Jiayu Liu, Zhongyuan Wu, Shenghao Jin, Wei Chen, Hao Xu, Ning Miao

发表机构 * Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Hong Kong Institute of AI for Science, City University of Hong Kong(香港城市大学人工智能科学研究院) Li Auto Inc. Beihang University(北航大学)

AI总结 本文研究了强化学习可验证奖励(RLVR)在大语言模型训练中的内部动态,发现RLVR在多种模型和训练配置下均进入线性区域,通过实验和理论分析证明这种线性特性源于训练信号的高方差和噪声,且具有预测性和实用性。

Comments Major revision: substantially reorganized the manuscript and added a theoretical explanation section. The replacement is intended for the same arXiv paper; the core topic and contribution remain the same

详情
AI中文摘要

强化学习可验证奖励(RLVR)在以推理为导向的大语言模型(LLMs)中推动了显著的性能提升,但其内部训练动态仍 largely 是一个黑箱。在本文中,我们对RLVR进行了全面的轨迹级分析,并揭示出一个显著的规律:在各种模型家族、RL算法和训练配置下,RLVR始终进入一个稳健的线性区域,其中参数权重和输出对数概率,通过严格教师强制评估测量,以高度线性的方式(R²>0.7)演变。通过受控实验和理论分析,我们证明这种线性并非偶然,而是源于RLVR训练信号的高方差和噪声性质,这些性质起到了低通滤波器的作用,将优化集中在稳定的、低维的漂移上。此外,我们显示这种线性结构不仅具有描述性,而且具有强大的预测性和实用性。具体而言,权重空间外推在性能上与标准RL优化相当,同时通过定期重新定位实现了6.1倍的训练加速。同时,输出空间外推作为一种轻量级干预,有效 bypassed 后期模型崩溃,持续在数学和编码基准上优于标准RL,平均性能提升了4.2%。我们的代码可在https://github.com/Miaow-Lab/RLVR-Linearity获得。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven significant performance gains in reasoning-oriented large language models (LLMs), yet its internal training dynamics remain largely a black box. In this work, we perform a comprehensive trajectory-level analysis of RLVR and uncover a striking regularity: across various model families, RL algorithms, and training configurations, RLVR consistently enters a robust linear regime, where both parameter weights and output log-probabilities, measured rigorously via teacher-forced evaluation, evolve in a highly linear manner ($R^2 > 0.7$). Through controlled experiments and theoretical analysis, we demonstrate that this linearity is not a coincidence, but stems from the high-variance, noisy nature of RLVR training signals, which act as a low-pass filter to concentrate optimization along a stable, low-dimensional drift. Moreover, we show that this linear structure is not merely descriptive but powerfully predictive and actionable. Specifically, weight-space extrapolation matches the performance of standard RL optimization while achieving a 6.1x training speedup through periodic re-grounding. Meanwhile, output-space extrapolation serves as a lightweight intervention that effectively bypasses late-stage model collapse, consistently outperforming standard RL across mathematical and coding benchmarks, with an average performance improvement of 4.2%. Our code is available at https://github.com/Miaow-Lab/RLVR-Linearity.

2511.04106 2026-05-22 physics.soc-ph cs.CL cs.CY stat.AP 版本更新

Sub-exponential Growth Dynamics in Complex Systems: A Piecewise Power-Law Model for the Diffusion of New Words and Names

复杂系统中的亚指数增长动力学:一种用于新词汇和名称扩散的分段幂律模型

Hayafumi Watanabe

发表机构 * Department of Economics, Seijo University, Setagaya-ku, Tokyo 157-8511, Japan(早稻田大学经济学部) The Institute of Statistical Mathematics, Tachikawa-shi, Tokyo 190-8562,Japan(统计数学研究所)

AI总结 本文提出了一种分段幂律模型,用于描述复杂增长曲线,通过分析大规模数据集发现亚指数增长是社会扩散的常见模式。

详情
Journal ref
Physical Review E (2026)
AI中文摘要

社会中思想和语言的扩散通常被S型模型描述,如逻辑斯蒂曲线。然而,亚指数增长——一种比指数增长更慢的模式,在更广泛的社会现象中作用被忽视。本文提出了一种分段幂律模型,通过分析约十亿篇日本博客文章与维基百科词汇的数据集,发现网络搜索趋势数据(英语、西班牙语和日语)中存在一致的模式。分析2963个选定项目(如足够持续时间/峰值、单调增长)发现,1625(55%)种扩散模式没有突变水平,可以由一个或两个段描述。对于单段曲线,发现(i)形状参数α的模式接近0.5,表明亚指数增长普遍;(ii)峰值扩散规模主要由增长速率R决定,次要贡献来自α或持续时间T;(iii)α倾向于随主题性质变化,小主题/本地主题的α较小,广泛共享主题的α较大。此外,一个微观行为模型表明,α可以解释为对外向(陌生人)与内向(社区)接触的偏好指数。这些发现表明亚指数增长是社会扩散的常见模式,我们的模型提供了一个实用框架,用于一致描述、比较和解释复杂多样的增长曲线。

英文摘要

The diffusion of ideas and language in society has conventionally been described by S-shaped models, such as the logistic curve. However, the role of sub-exponential growth -- a slower-than-exponential pattern known in epidemiology -- has been largely overlooked in broader social phenomena. Here, we present a piecewise power-law model to characterize complex growth curves with a few parameters. We systematically analyzed a large-scale dataset of approximately one billion Japanese blog articles linked to Wikipedia vocabulary, and observed consistent patterns in web search trend data (English, Spanish, and Japanese). Our analysis of 2,963 items, selected for reliable estimation (e.g., sufficient duration/peak, monotonic growth), reveals that 1,625 (55%) diffusion patterns without abrupt level shifts were adequately described by one or two segments. For single-segment curves, we found that (i) the mode of the shape parameter $α$ was near 0.5, indicating prevalent sub-exponential growth; (ii) the peak diffusion scale is primarily determined by the growth rate $R$, with minor contributions from $α$ or the duration $T$; and (iii) $α$ showed a tendency to vary with the nature of the topic, being smaller for niche/local topics and larger for widely shared ones. Furthermore, a micro-behavioral model of outward (stranger) vs. inward (community) contact suggests that $α$ can be interpreted as an index of the preference for outward-oriented communication. These findings suggest that sub-exponential growth is a common pattern of social diffusion, and our model provides a practical framework for consistently describing, comparing, and interpreting complex and diverse growth curves.

2510.23090 2026-05-22 cs.CL 版本更新

MAP4TS: A Multi-Aspect Prompting Framework for Time-Series Forecasting with Large Language Models

MAP4TS: 一个用于基于大语言模型的时间序列预测的多方面提示框架

Suchan Lee, Jihoon Choi, Sohyeon Lee, Minseok Song, Bong-Gyu Jang, Hwanjo Yu, Soyeon Caren Han

AI总结 本文提出MAP4TS框架,通过将经典时间序列分析融入提示设计,提升大语言模型在时间序列预测中的性能,实验表明其在多个数据集上均优于现有方法。

Comments There is a error in modeling. Thereafter, paper will be revised and re-uploaded

详情
AI中文摘要

最近的研究探讨了使用预训练的大语言模型(LLMs)进行时间序列预测,通过将数值输入对齐到LLM嵌入空间。然而,现有的多模态方法往往忽视了时间序列数据中独特的统计特性和时间依赖性。为弥合这一差距,我们提出了MAP4TS,一种新颖的多方面提示框架,该框架明确将经典时间序列分析纳入提示设计。我们的框架引入了四个专门的提示组件:一个全局领域提示传达数据集级别的上下文,一个局部领域提示编码近期趋势和系列特定行为,以及一对统计和时间提示,嵌入了从自相关(ACF)、偏自相关(PACF)和傅里叶分析中提取的手工洞察。多方面提示与原始时间序列嵌入结合,并通过跨模态对齐模块生成统一的表示,然后通过LLM处理并投影以进行最终预测。在八个多样化的数据集上进行的广泛实验表明,MAP4TS在多个数据集上均优于现有方法。我们的消融研究进一步揭示,提示意识设计显著提升了性能稳定性,并且当与结构化提示结合时,GPT-2模型在长期预测任务中优于较大的模型如LLaMA。

英文摘要

Recent advances have investigated the use of pretrained large language models (LLMs) for time-series forecasting by aligning numerical inputs with LLM embedding spaces. However, existing multimodal approaches often overlook the distinct statistical properties and temporal dependencies that are fundamental to time-series data. To bridge this gap, we propose MAP4TS, a novel Multi-Aspect Prompting Framework that explicitly incorporates classical time-series analysis into the prompt design. Our framework introduces four specialized prompt components: a Global Domain Prompt that conveys dataset-level context, a Local Domain Prompt that encodes recent trends and series-specific behaviors, and a pair of Statistical and Temporal Prompts that embed handcrafted insights derived from autocorrelation (ACF), partial autocorrelation (PACF), and Fourier analysis. Multi-Aspect Prompts are combined with raw time-series embeddings and passed through a cross-modality alignment module to produce unified representations, which are then processed by an LLM and projected for final forecasting. Extensive experiments across eight diverse datasets show that MAP4TS consistently outperforms state-of-the-art LLM-based methods. Our ablation studies further reveal that prompt-aware designs significantly enhance performance stability and that GPT-2 backbones, when paired with structured prompts, outperform larger models like LLaMA in long-term forecasting tasks.

2510.13910 2026-05-22 cs.CL 版本更新

RAGCap-Bench: Benchmarking Capabilities of LLMs in Agentic Retrieval Augmented Generation Systems

RAGCap-Bench: 评估代理检索增强生成系统中LLM能力的基准测试

Jingru Lin, Chen Zhang, Stephen Y. Liu, Haizhou Li

发表机构 * National University of Singapore(新加坡国立大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文提出RAGCap-Bench,用于评估代理检索增强生成系统中中间任务的细粒度能力,通过分析现有系统输出识别常见任务和核心能力,设计针对性评估问题,实验表明增强中间能力的模型能获得更好的整体性能。

详情
AI中文摘要

检索增强生成(RAG)通过动态检索外部信息缓解大型语言模型(LLMs)的关键限制,如事实错误、过时知识和幻觉。最近的研究通过代理RAG系统扩展了这一范式,其中LLMs作为代理迭代地计划、检索和推理复杂查询。然而,这些系统在处理具有挑战性的多跳问题时仍存在困难,且其中间推理能力仍缺乏深入研究。为此,我们提出了RAGCap-Bench,一个以能力为导向的基准测试,用于对代理RAG工作流程中的中间任务进行细粒度评估。我们分析了最先进系统的输出,以识别常见任务和执行所需的核心能力,然后构建了一个典型LLM错误的分类学,以设计针对性的评估问题。实验表明,具有更强RAGCap性能的“慢思考”模型在端到端结果上表现更好,这证明了该基准测试的有效性以及增强这些中间能力的重要性。

英文摘要

Retrieval-Augmented Generation (RAG) mitigates key limitations of Large Language Models (LLMs)-such as factual errors, outdated knowledge, and hallucinations-by dynamically retrieving external information. Recent work extends this paradigm through agentic RAG systems, where LLMs act as agents to iteratively plan, retrieve, and reason over complex queries. However, these systems still struggle with challenging multi-hop questions, and their intermediate reasoning capabilities remain underexplored. To address this, we propose RAGCap-Bench, a capability-oriented benchmark for fine-grained evaluation of intermediate tasks in agentic RAG workflows. We analyze outputs from state-of-the-art systems to identify common tasks and the core capabilities required for their execution, then construct a taxonomy of typical LLM errors to design targeted evaluation questions. Experiments show that "slow-thinking" models with stronger RAGCap performance achieve better end-to-end results, underscoring the benchmark's validity and the importance of enhancing these intermediate capabilities.

2507.23773 2026-05-22 cs.AI cs.CL cs.LG cs.RO 版本更新

General Agentic Planning Through Simulative Reasoning with World Models

通过世界模型的模拟推理实现通用代理规划

Mingkai Deng, Jinyu Hou, Zhiting Hu, Eric Xing

发表机构 * Institute of Foundation Models (IFM)(基础模型研究所) Carnegie Mellon University(卡内基梅隆大学) UC San Diego(南加州大学)

AI总结 本文提出通过模拟推理实现通用代理规划,利用世界模型进行未来状态预测,提升决策能力,通过SiRA架构在不同任务中取得更高任务完成率。

Comments Winner of Berkeley LLM Agents Hackathon (Fundamentals Track); code available at https://github.com/sailing-lab/sira

详情
AI中文摘要

什么是规划?当前的代理系统,无论是 scaffolding 工作流还是端到端策略,都依赖于反应式决策:通过固定流程选择下一步行动,最多只能有非区分性的适应性计算(例如链式思维),缺乏对未来结果的显式建模。这限制了通用性,因为每个新任务都需要重新工程而不是共享推理能力的转移。相比之下,人类通过在内部世界模型中心理模拟候选动作的后果来规划,这种能力被称为模拟推理(系统II),它支持在不同上下文中灵活、目标导向的行为。我们主张通过世界模型进行模拟推理为代理系统提供了一种通用的规划机制,比反应式策略(系统I)更优,因为决策基于预测的未来状态而不是模式匹配的响应。为了验证这一点,我们引入了SiRA(模拟推理架构),一种以目标为导向的架构,利用基于LLM的世界模型和自然语言信念状态来实现模拟推理,同时保持模型无关性。我们在网络浏览器环境中评估了三个质的不同的任务类别:受约束的导航、多跳信息聚合和一般指令跟随。在所有类别中,模拟推理在与匹配的反应基线相比,任务完成率提高了124%,并且在与代表性的开放网络代理相比,受约束导航的成功率从0%提高到32.2%。在不同任务类型中的持续优势表明,这种优势源于可泛化的情境评估,而不是特定任务的调优。

英文摘要

What does it mean to plan? Current agentic systems, whether scaffolded workflows or end-to-end policies, rely on reactive decision-making: selecting the next action via a fixed procedure with at most undifferentiated adaptive computation (e.g., chain-of-thought) lacking explicit modeling of future outcomes. This limits generalizability, as each new task demands re-engineering rather than transfer of shared reasoning capacity. Humans, by contrast, plan by mentally simulating consequences of candidate actions within an internal world model, a capacity known as simulative reasoning (System II) that supports flexible, goal-directed behavior across diverse contexts. We argue that simulative reasoning through a world model provides a general-purpose planning mechanism for agentic systems, improving upon reactive policies (System I) by grounding decisions in predicted future states rather than pattern-matched responses. To verify this, we introduce SiRA (Simulative Reasoning Architecture), a goal-oriented architecture instantiating simulative reasoning using an LLM-based world model with natural-language belief states, while remaining model-agnostic. We evaluate across three qualitatively distinct task categories: constrained navigation, multi-hop information aggregation, and general instruction following, in a web-browser environment. Across all categories, simulative reasoning achieves up to 124% higher task completion rates than a matched reactive baseline, and increases constrained navigation success from 0% to 32.2% compared to a representative open-web agent. The persistent advantage across distinct task types suggests the benefit stems from generalizable counterfactual evaluation rather than task-specific tuning.

2507.05660 2026-05-22 cs.CR cs.AI cs.CL 版本更新

Optimus: A Robust Defense Framework for Mitigating Toxicity while Fine-Tuning Conversational AI

Optimus: 一种用于在微调对话AI时缓解毒性行为的稳健防御框架

Aravind Cheruvu, Shravya Kanchi, Sifat Muhammad Abdullah, Nicholas Ka-Shing Kong, Daphne Yao, Murtuza Jadliwala, Bimal Viswanath

发表机构 * University of Texas at San Antonio(德克萨斯大学圣安东尼奥分校)

AI总结 本研究提出Optimus框架,通过整合训练无关的毒性分类方案和双重策略对齐过程,有效缓解微调过程中的毒性问题,并在有毒性分类器偏差时仍能保持高召回率,优于现有最佳防御方法StarDSS。

Comments Accepted at ACM CODASPY 2026

详情
AI中文摘要

定制化大型语言模型(LLMs)于不可信数据集上存在注入毒性行为的严重风险。在本文中,我们引入Optimus,一种新的防御框架,旨在在保持对话实用性的同时减轻微调危害。与现有防御方法依赖精确的毒性检测或限制性过滤不同,Optimus解决的是在毒性分类器不完美或有偏见时确保鲁棒缓解的关键挑战。Optimus整合了训练无关的毒性分类方案,重新利用商用LLMs的安全对齐性,并采用结合合成“治愈数据”与直接偏好优化(DPO)的双重策略对齐过程,以高效地将模型引导至安全方向。广泛的评估显示,即使依赖极度有偏见的分类器(召回率降高达85%),Optimus仍能缓解毒性。Optimus在性能上优于现有最佳防御方法StarDSS,并表现出对适应性对抗和越狱攻击的强大抵抗力。我们的源代码和数据集可在https://github.com/secml-lab-vt/Optimus上获得。

英文摘要

Customizing Large Language Models (LLMs) on untrusted datasets poses severe risks of injecting toxic behaviors. In this work, we introduce Optimus, a novel defense framework designed to mitigate fine-tuning harms while preserving conversational utility. Unlike existing defenses that rely heavily on precise toxicity detection or restrictive filtering, Optimus addresses the critical challenge of ensuring robust mitigation even when toxicity classifiers are imperfect or biased. Optimus integrates a training-free toxicity classification scheme that repurposes the safety alignment of commodity LLMs, and employs a dual-strategy alignment process combining synthetic "healing data" with Direct Preference Optimization (DPO) to efficiently steer models toward safety. Extensive evaluations demonstrate that Optimus mitigates toxicity even when relying on extremely biased classifiers (with up to 85% degradation in Recall). Optimus outperforms the state-of-the-art defense StarDSS and exhibits strong resilience against adaptive adversarial and jailbreak attacks. Our source code and datasets are available at https://github.com/secml-lab-vt/Optimus

2506.22316 2026-05-22 cs.CL 版本更新

Evaluating Scoring Bias in LLM-as-a-Judge

评估LLM作为裁判的评分偏见

Qingquan Li, Shaoyu Dou, Kailai Shao, Chao Chen, Haixiang Hu

发表机构 * Ant Group(蚂蚁集团)

AI总结 本文研究了LLM作为裁判在评分任务中的偏见问题,提出了三种新的评分偏见类型,并开发了一个框架来量化这些偏见,以改进评分提示设计。

Comments Accepted by DASFAA 2026

详情
AI中文摘要

"LLM-as-a-Judge"范式通过大型语言模型(LLMs)作为自动评估者,是LLM发展中的关键部分,为复杂任务提供可扩展的反馈。然而,这些裁判的可靠性受到多种偏见的影响。现有研究主要集中在比较性评估中的偏见。相比之下,基于评分的评估(分配绝对分数,常用于工业应用)研究较少。为填补这一空白,我们进行了首次专门的评分偏见评估。我们从评分提示本身而非评估目标的偏见出发。我们正式定义了评分偏见,并识别了三种新的偏见类型:评分标准顺序偏见、评分ID偏见和参考答案评分偏见。我们提出了一种全面的框架来量化这些偏见,包含多方面的度量指标和自动数据合成管道来创建定制的评估语料库。我们的实验实证地证明了即使最先进的LLMs也受这些显著评分偏见的影响。我们的分析为设计更稳健的评分提示和缓解这些新发现的偏见提供了可行的见解。

英文摘要

The "LLM-as-a-Judge" paradigm, using Large Language Models (LLMs) as automated evaluators, is pivotal to LLM development, offering scalable feedback for complex tasks. However, the reliability of these judges is compromised by various biases. Existing research has heavily concentrated on biases in comparative evaluations. In contrast, scoring-based evaluations-which assign an absolute score and are often more practical in industrial applications-remain under-investigated. To address this gap, we undertake the first dedicated examination of scoring bias in LLM judges. We shift the focus from biases tied to the evaluation targets to those originating from the scoring prompt itself. We formally define scoring bias and identify three novel, previously unstudied types: rubric order bias, score ID bias, and reference answer score bias. We propose a comprehensive framework to quantify these biases, featuring a suite of multi-faceted metrics and an automatic data synthesis pipeline to create a tailored evaluation corpus. Our experiments empirically demonstrate that even the most advanced LLMs suffer from these substantial scoring biases. Our analysis yields actionable insights for designing more robust scoring prompts and mitigating these newly identified biases.

2506.04708 2026-05-22 cs.CL 版本更新

Accelerated Test-Time Scaling with Model-Free Speculative Sampling

加速测试时间缩放与模型无关的推测采样

Woomin Song, Saket Dingliwal, Sai Muralidhar Jayanthi, Bhavana Ganesh, Jinwoo Shin, Aram Galstyan, Sravan Babu Bodapati

发表机构 * KAIST(韩国科学技术院) Amazon AGI(亚马逊人工智能实验室) AirSignal

AI总结 本文提出STAND,一种无需模型的推测解码方法,通过利用推理轨迹中的冗余性,显著提升推理效率而不牺牲准确性,经多个模型和任务评估,STAND在保持准确性的同时将推理延迟降低了60-65%。

Comments EMNLP 2025 Oral

详情
AI中文摘要

语言模型通过测试时间缩放技术如best-of-N采样和树搜索在推理任务中展现了显著的能力。然而,这些方法通常需要大量的计算资源,导致性能与效率之间的关键权衡。我们引入STAND(STochastic Adaptive N-gram Drafting),一种新颖的无模型推测解码方法,利用推理轨迹中的内在冗余性,实现显著的加速而不牺牲准确性。我们的分析显示,推理路径经常重复相似的推理模式,使高效的无模型令牌预测成为可能,而无需单独的草案模型。通过引入随机草案和通过高效日志几率基的n-gram模块保留概率信息,结合优化的Gumbel-Top-K采样和数据驱动的树构建,STAND显著提高了令牌接受率。在多个模型和推理任务(AIME-2024、GPQA-Diamond和LiveCodeBench)上的广泛评估表明,与标准自回归解码相比,STAND将推理延迟降低了60-65%,同时保持准确性。此外,STAND在各种推理模式下,包括单轨迹解码、批量解码和测试时间树搜索中,均优于最先进的推测解码方法。作为一种无模型方法,STAND可以应用于任何现有语言模型,无需额外训练,使其成为加速语言模型推理的强大即插即用解决方案。

英文摘要

Language models have demonstrated remarkable capabilities in reasoning tasks through test-time scaling techniques like best-of-N sampling and tree search. However, these approaches often demand substantial computational resources, creating a critical trade-off between performance and efficiency. We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach that exploits the inherent redundancy in reasoning trajectories to achieve significant acceleration without compromising accuracy. Our analysis shows that reasoning paths frequently reuse similar reasoning patterns, enabling efficient model-free token prediction without requiring separate draft models. By introducing stochastic drafting and preserving probabilistic information through a memory-efficient logit-based N-gram module, combined with optimized Gumbel-Top-K sampling and data-driven tree construction, STAND significantly improves token acceptance rates. Extensive evaluations across multiple models and reasoning tasks (AIME-2024, GPQA-Diamond, and LiveCodeBench) demonstrate that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding while maintaining accuracy. Furthermore, STAND consistently outperforms state-of-the-art speculative decoding methods across diverse inference patterns, including single-trajectory decoding, batch decoding, and test-time tree search. As a model-free approach, STAND can be applied to any existing language model without additional training, making it a powerful plug-and-play solution for accelerating language model reasoning.

2505.17123 2026-05-22 cs.CL 版本更新

MTR-Bench: A Comprehensive Benchmark for Multi-Turn Reasoning Evaluation

MTR-Bench:多轮推理评估的综合性基准

Xiaoyuan Li, Keqin Bao, Yubo Ma, Moxin Li, Wenjie Wang, Rui Men, Yichang Zhang, Fuli Feng, Dayiheng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学) Alibaba Group(阿里巴巴集团) National University of Singapore(新加坡国立大学)

AI总结 本文提出MTR-Bench,一个包含4类、40个任务和3600个实例的综合性基准,用于评估大型语言模型的多轮推理能力,通过自动化框架实现大规模评估,并揭示了当前先进推理模型在多轮交互任务中的不足。

Comments ACL 2026 Main Conference

详情
AI中文摘要

近年来,大型语言模型(LLMs)在复杂推理任务中展现出有前景的结果。然而,当前的评估主要集中在单轮推理场景,忽略了交互性任务。我们归因于缺乏全面的数据集和可扩展的自动评估协议。为了填补这些空白,我们提出了MTR-Bench用于LLM的多轮推理评估。MTR-Bench包含4类、40个任务和3600个实例,覆盖了多样的推理能力、细粒度难度层次以及需要与环境进行多轮交互的任务。此外,MTR-Bench具备完全自动化的框架,涵盖了数据集构建和模型评估,使大规模评估成为可能而无需人工干预。广泛实验表明,即使是最先进的推理模型在多轮交互推理任务中也显得不足。对这些结果的进一步分析为未来交互式人工智能系统的研究提供了有价值的见解。

英文摘要

Recent advances in Large Language Models (LLMs) have shown promising results in complex reasoning tasks. However, current evaluations predominantly focus on single-turn reasoning scenarios, leaving interactive tasks largely unexplored. We attribute it to the absence of comprehensive datasets and scalable automatic evaluation protocols. To fill these gaps, we present MTR-Bench for LLMs' Multi-Turn Reasoning evaluation. Comprising 4 classes, 40 tasks, and 3600 instances, MTR-Bench covers diverse reasoning capabilities, fine-grained difficulty granularity, and necessitates multi-turn interactions with the environments. Moreover, MTR-Bench features fully-automated framework spanning both dataset constructions and model evaluations, which enables scalable assessment without human interventions. Extensive experiments reveal that even the cutting-edge reasoning models fall short of multi-turn, interactive reasoning tasks. And the further analysis upon these results brings valuable insights for future research in interactive AI systems.

2505.05406 2026-05-22 cs.CL 版本更新

Frame In, Frame Out: Measuring Framing Bias in LLM-Generated News Summaries

框入框出:衡量LLM生成新闻摘要中的框架偏差

Valeria Pastorino, Nafise Sadat Moosavi

发表机构 * Department of Computer Science University of Sheffield (UK)(计算机科学系谢菲尔德大学(英国))

AI总结 本文提出FIFO基准测试,用于衡量LLM生成的新闻摘要中的框架存在性,发现LLM生成的摘要在科学和公共卫生领域显示出较高的框架率,表明框架是摘要质量的一个被忽视但重要的维度。

Comments Accepted to The 15th Joint Conference on Lexical and Computational Semantics (*SEM 2026) co-located with ACL 2026

详情
AI中文摘要

新闻标题和摘要通过选择性强调和省略来影响事件的解读,这种现象通常称为框架。大型语言模型现在经常用于生成此类内容,但现有的评估框架大多忽略了这一维度。我们介绍了Frame In, Frame Out (FIFO),这是首个大规模基准测试,用于衡量LLM生成的新闻摘要中的框架存在性,基于广泛使用的XSum数据集。FIFO结合了15,499名陪审团标注的例子和320个专家标注的实例(κ=0.61)来验证和校准基于模型的标注。使用FIFO,我们分析了27个摘要模型的测量框架率。我们发现,LLM生成的摘要往往表现出比人类撰写的参考更高的校准框架率,不同主题和训练制度下存在显著差异,包括在科学和公共卫生摘要中出现较高的框架率。我们的结果确立了框架作为摘要质量的一个被忽视但重要的维度。

英文摘要

News headlines and summaries shape how events are interpreted through selective emphasis and omission, a phenomenon commonly referred to as framing. Large language models are now routinely used to generate such content, yet existing evaluation frameworks largely overlook this dimension. We introduce Frame In, Frame Out (FIFO), the first large-scale benchmark for measuring framing presence in LLM-generated news summaries, grounded in the widely used XSum dataset. FIFO combines 15,499 jury-annotated examples with 320 expert-labeled instances ($κ= 0.61$) to validate and calibrate model-based annotations. Using FIFO, we analyze measured framing rates across 27 summarization models. We find that LLM-generated summaries often exhibit higher calibrated framing rates than human-written references, with substantial variation across topics and training regimes, including elevated rates in scientific and public health summaries. Our results establish framing as an underexplored and consequential dimension of summarization quality.

2503.17599 2026-05-22 cs.CL cs.AI 版本更新

Evaluating Clinical Competencies of Large Language Models with a General Practice Benchmark

利用通用医疗基准评估大型语言模型的临床能力

Zheqing Li, Yiying Yang, Jiping Lang, Wenhao Jiang, Junrong Chen, Yuhang Zhao, Shuang Li, Dingqian Wang, Zhu Lin, Xuanna Li, Yuze Tang, Jiexian Qiu, Xiaolin Lu, Hongji Yu, Shuang Chen, Yuhua Bi, Xiaofei Zeng, Yixian Chen, Lin Yao

发表机构 * The Sixth Affiliated Hospital of Sun Yat-sen University(中山大学第六附属医院) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东省人工智能与数字经济发展实验室(深圳)) Xinyi People’s Hospital(新一人民医院) The Fifth Affiliated Hospital of Sun Yat-sen University(中山大学第五附属医院) School of Public Health of Sun Yat-sen University(中山大学公共卫生学院)

AI总结 本文提出了一种新的评估框架,通过通用医疗基准(GPBench)评估大型语言模型在医疗实践中的能力,发现当前LLM无法独立应用于临床医疗,需持续的人类监督。

详情
AI中文摘要

大型语言模型(LLMs)在一般医疗实践中展现出了相当大的潜力。然而,现有的基准测试和评估框架主要依赖于考试式或简化的问题-答案格式,缺乏与一般医疗实践中实际临床责任相匹配的基于能力的结构。因此,LLMs能否可靠地履行一般医生(GPs)职责的范围仍然不确定。在本工作中,我们提出了一种新的评估框架,用于评估LLMs作为GPs的能力。基于此框架,我们引入了一个通用医疗基准(GPBench),其数据由领域专家根据常规临床实践标准进行细致标注。我们评估了十种最先进的LLMs,并分析了它们的能力。我们的发现表明,当前的LLMs不适合在临床一般实践中自主部署,所有实际应用都需要持续的人类监督;进一步针对GPs日常职责进行的特定优化仍至关重要。

英文摘要

Large Language Models (LLMs) have demonstrated considerable potential in general practice. However, existing benchmarks and evaluation frameworks primarily depend on exam-style or simplified question-answer formats, lacking a competency-based structure aligned with the real-world clinical responsibilities encountered in general practice. Consequently, the extent to which LLMs can reliably fulfill the duties of general practitioners (GPs) remains uncertain. In this work, we propose a novel evaluation framework to assess the capability of LLMs to function as GPs. Based on this framework, we introduce a general practice benchmark (GPBench), whose data are meticulously annotated by domain experts in accordance with routine clinical practice standards. We evaluate ten state-of-the-art LLMs and analyze their competencies. Our findings indicate that current LLMs are not suitable for autonomous deployment in clinical general practice and that all realistic applications require continuous human oversight; further optimization specifically tailored to the daily responsibilities of GPs remains essential.

2503.02885 2026-05-22 cs.CY cs.CL cs.HC 版本更新

"Would You Want an AI Tutor?" Understanding Stakeholder Perceptions of LLM-based Systems in the Classroom

你希望有一个AI导师吗?理解基于大语言模型的系统在课堂中的利益相关者观点

Caterina Fuligni, Daniel Dominguez Figaredo, Armanda Lewis, Julia Stoyanovich

发表机构 * Khan Academy New York City Department of Education(纽约市教育部) Learning Innovation Catalyst(学习创新催化剂)

AI总结 本文研究了在课堂中部署基于大语言模型(LLM)系统的利益相关者观点,提出了一种以利益相关者为中心的框架,以支持更谨慎的决策。

详情
AI中文摘要

大语言模型(LLM)在教育环境中获得了广泛的应用,通常被描述为虚拟导师或教学助手。在早期的怀疑和禁令之后,许多学校和大学已经开始将这些系统整合到课程中。然而,关于是否以及如何部署LLM-based工具的决策通常是缺乏系统性地与所有受影响的利益相关者互动。在本文中,我们主张理解课堂中基于LLM的系统利益相关者观点不仅仅是衡量批准或接受,而是识别哪些关注被提出,何时提出,以及这对负责任的设计和治理有何影响。我们介绍了面向教育中LLM采用的感知框架(Co-PALE),该框架以利益相关者为中心,连接教育环境、负责任的人工智能原则和感知类别,以支持关于LLM-based工具采用的更谨慎的决策。我们通过针对先前工作的分析来奠定Co-PALE,以诊断在研究利益相关者观点时反复出现的差距,并通过具有不同教育场景的上下文不同的教育场景来说明相同的技术如何对不同的利益相关者产生不同的关注。我们进一步探讨了大学教师和K-12家长如何理解该框架,通过焦点小组讨论,使用他们的反思来揭示紧张和不确定性。Co-PALE支持更系统地思考在教育中部署LLM-based工具是否、在哪里以及为谁而部署。

英文摘要

Large Language Models (LLMs) have gained traction in educational settings, often framed as virtual tutors or teaching assistants. Following early skepticism and bans, many schools and universities have begun integrating these systems into curricula. Yet decisions about whether and how to deploy LLM-based tools are frequently made without systematic engagement with the full range of stakeholders they affect. In this paper, we argue that understanding stakeholder perceptions of LLM-based systems in the classroom is not a matter of measuring approval or acceptance, but of identifying whose concerns are surfaced, in which contexts, and with what implications for responsible design and governance. We introduce Contextualized Perceptions for the Adoption of LLMs in Education (Co-PALE), a stakeholder-first framework that connects educational context, responsible AI principles, and categories of perception to support more deliberate decision-making about the adoption of LLM-based tools. We ground Co-PALE through a targeted analysis of prior work to diagnose recurring gaps in how stakeholder perceptions are studied, and through contextually distinct educational scenarios that illustrate how the same technology raises different concerns for different stakeholders. We further examine how university faculty and K--12 parents make sense of the framework through focus groups, using their reflections to surface tensions and uncertainties. Co-PALE supports more systematic reasoning about whether, where, and for whom LLM-based tools should be deployed in education.

2605.22081 2026-05-22 cs.CL 版本更新

ArabDiscrim: A Decade-Long Arabic Facebook Corpus on Racism and Discrimination

ArabDiscrim: 一个十年的阿拉伯语Facebook语料库,涉及种族主义和歧视

Wajdi Zaghouani, Shimaa Amer Ibrahim, Mabrouka Bessghaier, Houda Bouamor

发表机构 * Northwestern University in Qatar(卡塔尔西北大学) Carnegie Mellon University in Qatar(卡塔尔卡内基梅隆大学)

AI总结 本文提出了ArabDiscrim,一个包含293,000条阿拉伯语Facebook公开帖子的十年长的词料库(2014-2024年),用于研究种族主义和歧视。该语料库整合了平台原生的互动信号,如反应、分享、评论和页面元数据,支持语言和受众反应的联合分析。该资源包括200个精心挑选的术语(100个与种族主义相关,100个与歧视相关)以及20个歧视轴,捕捉基于身份的不平等对待。它还提供了显式的归属模式。ArabDiscrim在伦理合规的限制研究使用许可下发布,支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度,它为公平导向、平台意识的阿拉伯语NLP建立了基础。

Comments Accepted at LREC 2026 Main Conference

详情
AI中文摘要

我们介绍了ArabDiscrim,一个十年长的词料库和包含293,000条公开阿拉伯语Facebook帖子(2014-2024)的词料库,讨论种族主义和歧视。不同于现有以推特为中心的数据集,ArabDiscrim整合了平台原生的互动信号,包括反应、分享、评论和页面元数据,使语言和受众反应的联合分析成为可能。该资源包括200个精心挑选的术语(100个与种族主义相关,100个与歧视相关)以及20个歧视轴,捕捉基于身份的不平等对待。它还提供了显式的归属模式。在遵守平台条款的限制研究使用许可下发布,ArabDiscrim支持弱监督、轴感知采样和平台生态研究。通过连接词法深度和生态效度,它为公平导向、平台意识的阿拉伯语NLP建立了基础。

英文摘要

We present ArabDiscrim, a decade-long lexical resource and corpus of 293K public Arabic Facebook posts (2014--2024) discussing racism and discrimination. Unlike existing Twitter-centric datasets, ArabDiscrim integrates platform-native engagement signals, including reactions, shares, comments, and page metadata, enabling joint analysis of language and audience response. The resource includes 200 curated terms (100 racism-related and 100 discrimination-related) with morphological regex families (13+ inflections per lemma), and 20 discrimination axes capturing identity-based grounds for unequal treatment. It also provides explicit attribution patterns. Released under a restricted research-use license for ethical compliance with platform terms, ArabDiscrim supports weak supervision, axis-aware sampling, and platform ecology research. By bridging lexical depth and ecological validity, it establishes a foundation for fairness-oriented, platform-aware Arabic NLP.

2605.22074 2026-05-22 cs.LG cs.AI cs.CL 版本更新

From Reasoning Chains to Verifiable Subproblems: Curriculum Reinforcement Learning Enables Credit Assignment for LLM Reasoning

从推理链到可验证子问题:课程强化学习使LLM推理能够进行信用分配

Xitai Jiang, Zihan Tang, Wenze Lin, Yang Yue, Shenzhi Wang, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学 LeapLab) Qiuzhen College, Tsinghua University(清华大学 旗正学院)

AI总结 该研究提出SCRL框架,通过从参考推理链中生成可验证子问题,解决LLM推理中信用分配问题,提升了在数学推理任务中的性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)在LLM推理中展现出强大潜力,但基于结果的RLVR在处理难题时效率低下,因为正确的最终答案 rollout 很少且样本层面的信用分配无法利用失败尝试中的部分进展。我们引入SCRL(子问题课程强化学习),一种课程强化学习框架,通过从参考推理链中推导出可验证子问题,并将最终子问题固定为原始问题。这将难题中的部分进展转化为可验证的学习信号。算法上,SCRL使用子问题层面的归一化,每个子问题位置独立归一化奖励,并将结果优势分配给相应的答案片段,使在没有外部评分标准或奖励模型的情况下实现更细粒度的信用分配。我们的分析表明,子问题课程将难题从梯度死亡区中拉出,随着原始问题难度增加,相对收益也更大。在七个数学推理基准测试中,SCRL超越了强大的课程学习基线,使Qwen3-4B-Base的平均准确率比GRPO提高+4.1点,Qwen3-14B-Base提高+1.9点。在AIME24、AIME25和IMO-Bench上,SCRL进一步提高Qwen3-4B-Base的pass@1由+3.7点,pass@64由+4.6点,表明在难题推理任务中探索能力更强。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) has shown strong promise for LLM reasoning, but outcome-based RLVR remains inefficient on hard problems because correct final-answer rollouts are rare and sample-level credit assignment cannot use partial progress in failed attempts. We introduce SCRL (Subproblem Curriculum Reinforcement Learning), a curriculum RL framework that derives verifiable subproblems from reference reasoning chains and fixes the final subproblem as the original problem. This turns partial progress on hard problems into verifiable learning signals. Algorithmically, SCRL uses subproblem-level normalization, which normalizes rewards independently at each subproblem position and assigns the resulting advantages to the corresponding answer spans, enabling finer-grained credit assignment without external rubrics or reward models. Our analysis shows that subproblem curricula lift hard problems out of gradient dead zones, with larger relative gains as the original problem becomes harder. Across seven mathematical reasoning benchmarks, SCRL outperforms strong curriculum-learning baselines, improving average accuracy over GRPO by +4.1 points on Qwen3-4B-Base and +1.9 points on Qwen3-14B-Base. On AIME24, AIME25, and IMO-Bench, SCRL further improves pass@1 by +3.7 points and pass@64 by +4.6 points on Qwen3-4B-Base, indicating better exploration on hard reasoning problems.

2605.22072 2026-05-22 cs.CL cs.CV 版本更新

Faithful-MR1: Faithful Multimodal Reasoning via Anchoring and Reinforcing Visual Attention

Faithful-MR1: 通过锚定和强化视觉注意力实现忠实的多模态推理

Changyuan Tian, Zhicong Lu, Huaxing Liu, Xiang Wang, Shuai Li, Yu Chen, Wenqian Lv, Zichuan Lin, Juncheng Diao, Deheng Ye

发表机构 * AMAP, Alibaba Group(阿里云实验室,阿里巴巴集团) University of Chinese Academy of Sciences(中国科学院大学) Tsinghua University(清华大学) Nanyang Technological University(南洋理工大学)

AI总结 本文提出Faithful-MR1框架,通过锚定和强化视觉注意力解决多模态推理中的忠实性问题,提升模型在多模态基准上的表现。

Comments 20 pages, 7 figures, 3 tables. Preprint

详情
AI中文摘要

可验证奖励的强化学习(RLVR)已成为推动大语言模型复杂推理发展的有希望范式,最近的研究将其扩展到多模态大语言模型(MLLMs)。然而,这种转移带来了忠实性挑战:任务相关视觉证据的忠实感知以及在推理中忠实使用该证据,导致多模态基准上的不满意收益。具体而言,现有的感知监督通常基于文本描述而非原生的图像区域,且忠实使用被忽视,暴露出感知-推理断层,正确感知的证据在推理中被丢弃或矛盾。为弥合这些差距,我们提出Faithful-MR1,一个训练框架,通过锚定和强化视觉注意力来解决忠实多模态推理的两方面。锚定阶段将感知转化为一个显式的预推理子任务,监督专门的<Focus>标记的注意力直接针对图像区域,而不是通过文本描述。强化阶段通过反事实图像干预暴露忠实使用,奖励那些在视觉上因果重要的区域集中注意力的轨迹。广泛实验表明,Faithful-MR1在Qwen2.5-VL-Instruct 3B和7B架构上优于最近的多模态推理基线,同时使用大量训练数据。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a promising paradigm for advancing complex reasoning in large language models, and recent work extends RLVR to multimodal large language models (MLLMs). This transfer, however, surfaces a faithfulness challenge: faithful perception of task-relevant visual evidence and faithful use of that evidence during reasoning, leading to unsatisfactory gains on multimodal benchmarks. Specifically, existing perception supervision often operates on textual descriptions rather than natively on image regions, and faithful use is largely overlooked, exposing the perception-reasoning disconnect where correctly perceived evidence is dropped or contradicted during reasoning. To close these gaps, we propose Faithful-MR1, a training framework that anchors and reinforces visual attention to address both halves of faithful multimodal reasoning. The Anchoring stage turns perception into an explicit pre-reasoning subtask, supervising a dedicated <Focus> token's attention directly against image regions rather than through textual descriptions. The Reinforcing stage exposes faithful use through counterfactual image intervention, rewarding answer-correct trajectories that concentrate visual attention where vision causally matters. Extensive experiments demonstrate that Faithful-MR1 outperforms recent multimodal reasoning baselines on both Qwen2.5-VL-Instruct 3B and 7B backbones while using substantially less training data.

2605.22057 2026-05-22 cs.CL 版本更新

FlyRoute: Self-Evolving Agent Profiling via Data Flywheel for Adaptive Task Routing

FlyRoute: 通过数据飞轮实现自进化代理配置以实现适应性任务路由

Rongjun Li, Ziyu Zhou, Yihang Wu

发表机构 * IT Innovation and Research Center, Huawei Technologies(华为技术有限公司信息技术创新与研究中心)

AI总结 本文提出FlyRoute,一种自进化配置框架,通过真实流量增长能力证据,提高适应性任务路由的性能。

Comments 13 pages, 5 figures, 5 tables

详情
AI中文摘要

企业路由器将查询分配给专家代理,但部署的配置保持静态,而代理进化(提示、工具、模型)时,配置未更新。我们提出了FlyRoute,一种自进化配置框架,从真实流量中增长能力证据:将调度候选者和质量门成功的配对加入每个代理的成功存储,定期将证据转化为学习的能力描述,并将这些描述与BM25检索的成功注入到LLM路由器中。为了使此飞轮数据高效,FlyRoute引入了一种针对性探索策略,结合配置不确定性、BM25相关性和词汇新颖性,只优先为可能的查询下注释欠配置的代理,并避免冗余证据收集。在我们专有的企业开发支持数据集上的实验中,FlyRoute仅使用每个代理五个种子查询,将相同架构的零样本LLM路由器的性能从72.57%提升到78.04%,表明配置检索已经增强了冷启动路由。在流过7,211个标记的训练查询后,准确率提升到89.83%(零样本提升17.26个点;冷启动提升11.79个点),在四个专家领域下,标准路由准确率在单金测试查询上保持一致的提升。

英文摘要

Enterprise routers assign queries to expert agents, yet deployed profiles stay static while agents evolve (prompts, tools, models), and developers rarely keep descriptions or exemplars current. We present FlyRoute, a self-evolving profiling framework that grows capability evidence from real traffic: dispatch candidates, quality-gate successful pairs into each agent's success store, periodically distill evidence into learned capability descriptions, and inject those descriptions together with BM25-retrieved successes into an LLM router. To make this flywheel data-efficient, FlyRoute introduces a targeted exploration policy that combines profile uncertainty, BM25 relevance, and lexical novelty, prioritizing under-profiled agents only for plausible queries and avoiding redundant evidence collection. In experiments on our proprietary enterprise developer-support dataset of real routed queries, FlyRoute improves a same-backbone zero-shot LLM router from 72.57% to 78.04% with only five seed queries per agent, showing that profile retrieval already strengthens cold-start routing. After streaming 7,211 labeled training queries through the flywheel, accuracy rises to 89.83% (+17.26pp over zero-shot; +11.79pp over cold start), with consistent gains across four expert domains under standard routing accuracy on single-gold test queries.

2605.22035 2026-05-22 cs.CV cs.CL 版本更新

HyLoVQA: Dynamic Hypernetwork-Generated Low-Rank Adaptation for Continual Visual Question Answering

HyLoVQA: 动态超网络生成低秩适应用于连续视觉问答

Yiran Wang, Chenyi Xiong, Ziyue Qin, Miao Zhang, Kui Xiao, Zhifei Li

发表机构 * School of Computer Science, Hubei University, Wuhan 430062, China(湖北大学计算机学院,武汉430062,中国) Hubei Key Laboratory of Big Data Intelligent Analysis and Application (Hubei University), Wuhan 430062, China(湖北省大数据智能分析与应用重点实验室(湖北大学),武汉430062,中国) Key Laboratory of Intelligent Sensing System and Security (Hubei University), Ministry of Education, Wuhan 430062, China(智能感知系统与安全重点实验室(湖北大学),教育部,武汉430062,中国)

AI总结 HyLoVQA通过动态超网络生成低秩适应,解决连续视觉问答中任务干扰问题,提升模型对当前任务和对象的适应能力。

Comments Accepted by IJCAI 2026

详情
AI中文摘要

连续视觉问答(VQA)需要在非稳态的视觉输入和问题流中学习,同时保持过去知识。大多数先前方法通过更新大量共享参数集来适应,这通常导致跨层任务干扰,阻碍对当前任务和对象的准确适应。为了解决这一限制,我们提出了HyLoVQA。它维护一个具有漂移鲁棒性的锚点记忆库。该库存储视觉对象的内容和文本任务的内容,并使用当前输入特征进行更新。基于检索到的锚点,超网络生成轻量级低秩适应(LoRA)适配器。这确保了参数效率,使模型能够动态适应每个任务和对象。此外,我们提出了一个对齐损失,将特征空间中的语义差异与参数空间中的功能变化对齐,从而约束LoRA适配器保持专注于当前任务和对象。在VQA v2和NExT-QA上广泛实验表明,HyLoVQA在标准和组合设置下优于先前最先进的方法。

英文摘要

Continual Visual Question Answering (VQA) requires learning from non-stationary streams of visual inputs and questions while preserving past knowledge. Most prior methods adapt by updating a largely shared parameter set. This often leads to cross-level task interference, hindering accurate adaptation to the current task and object. To address this limitation, we propose HyLoVQA. It maintains a drift-resilient memory bank of anchors. The bank stores the content of visual objects and textual tasks, and they are updated using current input features. Conditioned on retrieved anchors, a hypernetwork generates lightweight Low-Rank Adaptation (LoRA) adapters. This ensures parameter efficiency, allowing the model to adapt to each task and object dynamically. Additionally, we formulate an alignment loss that aligns semantic discrepancies in the feature space with functional changes in the parameter space, thereby constraining LoRA adapters to remain focused on the current task and object. Extensive experiments on VQA v2 and NExT-QA under both standard and compositional settings demonstrate the superiority of HyLoVQA over prior state-of-the-art methods.

2605.22012 2026-05-22 cs.CL cs.CV 版本更新

LatentOmni: Rethinking Omni-Modal Understanding via Unified Audio-Visual Latent Reasoning

LatentOmni: 通过统一的音频-视觉潜在推理重新思考多模态理解

Yifan Dai, Zhenhua Wu, Bohan Zeng, Daili Hua, Jialing Liu, Bozhou Li, Yuran Wang, Chengzhuo Tong, Hao Liang, Xiaochen Ma, Junbo Niu, Tianyu Guo, Yang Shi, Yue Ding, Yiyan Ji, Bingyin Mei, Yushuo Guan, Yuanxing Zhang, Pengfei Wan, Fangcheng Fu, Wentao Zhang

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) Kling Team, Kuaishou Technology(快手科技 Kling 团队) Peking University(北京大学) HKUST(香港科技大学) CASIA(中国科学院自动化研究所) Nanjing University(南京大学) Renmin University of China(中国人民大学) Tsinghua University(清华大学)

AI总结 本文提出LatentOmni框架,通过统一的音频-视觉潜在空间进行多模态推理,利用特征级监督和Omni-Sync Position Embedding保持时间一致性,从而在多个音频-视觉推理基准测试中取得最佳性能。

Comments 21 pages, 15 figures

详情
AI中文摘要

联合音频-视觉推理对于多模态理解至关重要,但当前的多模态大语言模型(MLLMs)在需要从两种模态中提取细粒度证据进行推理时仍存在困难。一个核心限制是显式的基于文本的推理链(CoT)将连续的音频-视觉信号压缩成离散的标记,削弱了时间定位并使中间推理偏向语言先验。我们主张统一的潜在空间是此类推理更好的媒介,因为它保留了密集的感知信息,同时仍能与自回归生成兼容。基于这一见解,我们提出了LatentOmni,一个跨模态推理框架,将文本推理与音频-视觉潜在状态交织在一起。LatentOmni引入了特征级监督,以对齐潜在推理状态与任务相关的感知特征,并使用Omni-Sync Position Embedding(OSPE)来保持潜在音频和视觉状态之间的时间一致性。我们进一步构建了LatentOmni-Instruct-35K数据集,该数据集包含音频-视觉交织推理轨迹,用于监督潜在空间推理。在多个音频-视觉推理基准测试中的综合评估表明,LatentOmni在评估的开源模型中取得了最佳性能,并且在显式文本CoT基线中表现一致,支持潜在空间联合推理作为更强多模态理解的有前途的路径。

英文摘要

Joint audio-visual reasoning is essential for omnimodal understanding, yet current multimodal large language models (MLLMs) still struggle when reasoning requires fine-grained evidence from both modalities. A central limitation is that explicit text-based chain-of-thought (CoT) compresses continuous audio-visual signals into discrete tokens, weakening temporal grounding and shifting intermediate reasoning toward language priors. We argue that a unified latent space is a better medium for such reasoning because it preserves dense sensory information while remaining compatible with autoregressive generation. Based on this insight, we propose \textbf{LatentOmni}, a cross-modal reasoning framework that interleaves textual reasoning with audio-visual latent states. LatentOmni introduces feature-level supervision to align latent reasoning states with task-relevant sensory features and uses Omni-Sync Position Embedding (OSPE) to maintain temporal consistency between latent audio and visual states. We further construct \textbf{LatentOmni-Instruct-35K}, a dataset of audio-visual interleaved reasoning trajectories for supervising latent-space reasoning. Comprehensive evaluation across multiple audio-visual reasoning benchmarks demonstrates that LatentOmni achieves the best performance among the evaluated open-source models and consistently outperforms the Explicit Text CoT baseline, supporting latent-space joint reasoning as a promising path toward stronger omnimodal understanding.

2605.22007 2026-05-22 cs.CL 版本更新

Hallucination as Commitment Failure: Larger LLMs Misfire Despite Knowing the Answer

幻觉作为承诺失败:更大的LLM在知道答案的情况下仍会出错

Jewon Yeom, Jaewon Sok, Heejun Kim, Seonghyeon Park, Jeongjae Park, Taesup Kim

发表机构 * Graduate School of Data Science(数据科学研究生院) Department of Rural Systems Engineering(农村系统工程系) Electrical Engineering and Computer Science(电子工程与计算机科学) Department of Aerospace Engineering(航空航天工程系)

AI总结 本文研究了大型LLM在知道正确答案的情况下仍出现幻觉的现象,发现模型在生成答案时,正确概念的概率分布方式决定了幻觉的发生,而非是否包含正确概念。

详情
AI中文摘要

幻觉通常被视为知识缺失的直接后果:当正确答案不在生成时的分布中,模型会错误回答;当正确答案存在时,模型会正确回答。我们通过引入一个语义上的答案可用性概念,聚合表达相同答案概念的token级变体,检验这一假设。在Qwen和Llama模型(0.8B至72B,包括Instruct和Base版本)中,16-47%的Instruct幻觉在模型承诺回答时已有显著概率质量在正确概念上,且随着规模增加,此比例单调上升。将此类失败与具有匹配语义支持的正确生成进行比较,发现区别不在于是否表示正确概念,而在于概率分布方式:正确生成将质量集中于单一表层形式,幻觉则将其分散到多个替代选项中。这种锐化不对称性在多token生成中也延伸,并在预生成隐藏状态中可检测到。这些结果识别出单一机制:指令微调通过规模锐化答案承诺,使有用性和自信幻觉成为同一底层倾向的两种结果。

英文摘要

Hallucination is often viewed as a direct consequence of missing knowledge: a model answers incorrectly when the correct answer is absent from its generation-time distribution, and correctly when it is present. We test this assumption by introducing a semantic notion of answer availability that aggregates token-level variants expressing the same answer concept, and asks whether the correct concept is already available at the moment the model commits to an answer. Across Qwen and Llama models from 0.8B to 72B in both Instruct and Base variants, 16-47% of Instruct hallucinations occur with substantial probability mass already on the correct concept, and the rate rises monotonically with scale. Comparing such failures against correct generations with matched semantic support, the distinguishing factor is not whether the correct concept is represented, but how its probability is distributed: correct generations concentrate mass on a single surface form, hallucinations disperse it across alternatives. The same sharpening asymmetry extends across multi-token generation and is detectable in pre-generation hidden states. Together, these results identify a single mechanism: instruction tuning sharpens answer commitment with scale, making helpfulness and confident hallucination two consequences of the same underlying disposition.

2605.22003 2026-05-22 cs.CL cs.AI cs.IR cs.LG 版本更新

From TF-IDF to Transformers: A Comparative and Ensemble Approach to Sentiment Classification

从TF-IDF到Transformer:一种比较和集成的方法用于情感分类

Dip Biswas Shanto, Mitali Yadav, Prajwal Panth, Suresh Chandra Satapathy

发表机构 * School of Computer Engineering KIIT Deemed to be University(计算机工程学院 KIIT 被认定大学)

AI总结 本文比较了多种机器学习模型,包括Naive Bayes、逻辑回归、SVM、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT,旨在对电影评论进行情感分类,并发现RoBERTa在准确率上表现最佳,同时集成所有模型的软投票方法进一步提升了分类性能。

Comments 6 pages, 9 figures. This is the author's accepted manuscript, presented at the International Conference on Intelligent Computing, Networks and Security (IC-ICNS 2026), March 26-28, Bhubaneswar, India. Proceedings publication pending

详情
AI中文摘要

情感分析,也称为观点挖掘,主要试图从任何基于文本的数据中提取观点。在电影评论和评论员的背景下,情感分析可以成为预测电影评论总体是积极还是消极的有用工具。对于ML模型来说,理解上下文或隐喻性情感可能具有挑战性,因为ML模型主要依赖统计词表示。本文的目标是检验并分类电影评论为积极或消极情感。为此考虑了多种机器学习模型,并运用自然语言处理(NLP)方法进行数据预处理和模型评估。使用IMDb数据集。具体来说,评估了Naive Bayes、逻辑回归、支持向量机(SVM)、LightGBM、LSTM以及基于Transformer的RoBERTa和DistilBERT等模型。经过大量测试,使用准确率、精确率、召回率、F1分数和ROC-AUC后,RoBERTa在所有其他模型之上表现更好,准确率为93.02%。一个结合所有模型的软投票集成方法也提高了分类性能,表明模型集成在情感分析中效果良好。

英文摘要

Sentiment analysis, also referred to as opinion mining, primarily tries to extract opinion from any text-based data. In the context of movie reviews and critics, sentimental analysis can be a helpful tool to predict whether a movie review is generally positive or negative. It can be difficult for the ML models to understand the context or metaphysical sentiment accurately, as ML models rely largely on statistical word representations. The objective of this paper is to examine and categorise movie reviews into positive and negative sentiments. Diverse machine learning models are considered in doing so, and Natural Language Processing (NLP) methodologies are employed for data preprocessing and model assessment. The IMDb dataset is used. Specifically, Naive Bayes, Logistic Regression, Support Vector Machines (SVM), LightGBM, LSTM, and transformer-based models such as RoBERTa and DistilBERT were evaluated. After a lot of testing with accuracy, precision, recall, F1-score, and ROC-AUC, RoBERTa performed better than all the other models, with an accuracy of 93.02%. A soft voting ensemble that combined all the models also improved classification performance, showing that model ensembling works well for sentiment analysis.

2605.22001 2026-05-22 cs.CR cs.AI cs.CL 版本更新

Blind Spots in the Guard: How Domain-Camouflaged Injection Attacks Evade Detection in Multi-Agent LLM Systems

守卫中的盲区:如何域伪装注入攻击在多智能体大语言模型系统中逃避检测

Aaditya Pai

发表机构 * Data Science Institute(数据科学研究所) Columbia University(哥伦比亚大学)

AI总结 本文研究了在多智能体大语言模型系统中,域伪装注入攻击如何通过模仿目标文档的领域词汇和权威结构来逃避检测,揭示了检测器在静态和伪装负载之间的检测率差异(Camouflage Detection Gap, CDG),并展示了多智能体辩论架构对静态注入攻击的放大效应以及检测器增强的有限有效性。

Comments 8 pages, 3 figures, 2 tables. Submitted to EMNLP 2026 ARR cycle

详情
AI中文摘要

部署在保护大语言模型代理中的注入检测器是基于静态的模板化负载进行校准的,这些负载会公开声明自身为覆盖指令。我们识别出一个系统性的盲区:当负载被生成以模仿目标文档的领域词汇和权威结构时,我们称之为域伪装注入,标准检测器无法识别它们,检测率从Llama 3.1 8B上的93.8%降至9.7%,从Gemini 2.0 Flash上的100%降至55.6%。我们将此正式定义为Camouflage Detection Gap(CDG),即静态负载与伪装负载之间注入检测率的差异。在覆盖三个领域和两种模型家族的45项任务中,CDG是显著且统计显著的(Llama的chi^2=38.03,p<0.001;Gemini的chi^2=17.05,p<0.001),在两种情况下均无零反向不一致对。我们还评估了Llama Guard 3,一个生产安全分类器,其检测零伪装负载(IDRcamouflage=0.000),证实盲区不仅限于少量样本检测器,还扩展到专门的安全分类器。我们进一步表明,多智能体辩论架构通过小型模型放大静态注入攻击高达9.9倍,而更强的模型则表现出集体抵抗力。针对检测器的增强仅提供部分缓解(Llama上提高10.2%,Gemini上提高78.7%),这表明该漏洞是架构性的,而非偶然的,对于较弱的模型而言。我们的框架、任务库和负载生成器已公开发布。

英文摘要

Injection detectors deployed to protect LLM agents are calibrated on static, template-based payloads that announce themselves as override directives. We identify a systematic blind spot: when payloads are generated to mimic the domain vocabulary and authority structures of the target document, what we call domain camouflaged injection, standard detectors fail to flag them, with detection rates dropping from 93.8% to 9.7% on Llama 3.1 8B and from 100% to 55.6% on Gemini 2.0 Flash. We formalize this as the Camouflage Detection Gap (CDG), the difference in injection detection rate between static and camouflaged payloads. Across 45 tasks spanning three domains and two model families, CDG is large and statistically significant (chi^2 = 38.03, p < 0.001 for Llama; chi^2 = 17.05, p < 0.001 for Gemini), with zero reverse discordant pairs in either case. We additionally evaluate Llama Guard 3, a production safety classifier, which detects zero camouflage payloads (IDRcamouflage = 0.000), confirming that the blind spot extends beyond few-shot detectors to dedicated safety classifiers. We further show that multi-agent debate architectures amplify static injection attacks by up to 9.9x on smaller models, while stronger models show collective resistance. Targeted detector augmentation provides only partial remediation (10.2% improvement on Llama, 78.7% on Gemini), suggesting the vulnerability is architectural rather than incidental for weaker models. Our framework, task bank, and payload generator are released publicly.

2605.21984 2026-05-22 cs.AI cs.CL 版本更新

Echo: Learning from Experience Data via User-Driven Refinement

Echo:通过用户驱动的细化学习经验数据

Hande Dong, Xiaoyun Liang, Jiarui Yu, Jiayi Lin, Changqing Ai, Feng Liu, Wenjun Zhang, Rongbi Wei, Chaofan Zhu, Linjie Che, Feng Wu, Xin Shen, Dexu Kong, Xiaotian Wang, Qiuyuan Chen, Bingxu An, Yueting Lei, Qiang Lin

发表机构 * Core Contributors(核心贡献者) Qiang Lin is the team leader(Qiang Lin 是团队负责人)

AI总结 本文提出Echo框架,通过用户驱动的细化过程将原始经验数据转化为可学习的知识,提升模型性能,实验表明其能将接受率从25.7%提升至35.7%。

详情
AI中文摘要

静态的'人类数据'面临固有局限:扩展成本高且受制于创造者知识。持续学习'经验数据'——智能体与其环境的交互——有望超越这些障碍。如今,AI智能体的广泛应用使我们能够以低成本获取大量真实世界经验数据。然而,原始交互日志本质上嘈杂,充满试错和低信息密度,使其不适合直接用于模型训练。我们引入Echo,一个通用框架,旨在将原始经验转化为可学习的知识,有效将环境反馈回训练循环以优化模型。在当今智能体生态系统中,用户细化是主要的反馈来源:出于对结果的责任感,用户严格地将缺陷智能体提案转化为已验证的解决方案。这些用户驱动的细化序列本质上将智能体的粗略尝试提炼为高质量的训练信号。Echo系统性地收集这些信号,持续使智能体与真实世界需求对齐。在大规模生产代码补全环境中的验证表明,Echo有效利用这一流程,打破静态性能上限,将接受率从25.7%提升至35.7%。

英文摘要

Static "human data" faces inherent limitations: it is expensive to scale and bounded by the knowledge of its creators. Continuous learning from "experience data" - interactions between agents and their environments - promises to transcend these barriers. Today, the widespread deployment of AI agents grants us low-cost access to massive streams of such real-world experience. However, raw interaction logs are inherently noisy, filled with trial-and-error and low information density, rendering them inefficient for direct model training. We introduce Echo, a generalized framework designed to operationalize the transition from raw experience to learnable knowledge, effectively "echoing" environmental feedback back into the training loop for model optimization. In today's agent ecosystem, user refinement serves as a primary source of such feedback: driven by responsibility for the outcome, users rigorously transform flawed agent proposals into verified solutions. These user-driven refinement sequences inherently distill agents' crude attempts into high-quality training signals. Echo systematically harvests these signals to continuously align the agent with real-world needs. Large-scale validation in a production code completion environment confirms that Echo effectively harnesses this pipeline, breaking the static performance ceiling by increasing the acceptance rate from 25.7% to 35.7%.

2605.21965 2026-05-22 cs.CL 版本更新

SpecHop: Continuous Speculation for Accelerating Multi-Hop Retrieval Agents

SpecHop:连续推测用于加速多跳检索代理

Mehrdad Saberi, Keivan Rezaei, Soheil Feizi

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 本文研究如何在不改变最终轨迹的情况下加速多跳工具使用过程,提出了一种连续推测框架SpecHop,通过维护多个推测线程和异步验证预测观测来减少延迟。

详情
AI中文摘要

大型语言模型越来越多地使用外部工具如网络搜索和文档检索来解决信息密集型任务。然而,在复杂任务中多跳工具使用引入了显著的延迟,因为模型必须反复等待工具观察结果才能继续。我们研究如何在不改变最终轨迹的情况下加速此类轨迹,假设可以访问更快但更不可靠的推测工具。我们开发了一个理论框架用于多跳工具使用设置中的无损推测,表征了最佳可实现的延迟增益。我们提出了SpecHop,一种连续推测框架,维护多个推测线程,在目标工具输出到达时异步验证预测观测,提交正确的分支并回滚错误的分支。这在保持准确性的同时减少了实际时间延迟。我们证明,当有足够活跃线程时,SpecHop可以接近理想延迟增益。在检索增强的多跳任务中,Empirically,SpecHop接近理论预测,并在某些设置中将延迟减少高达40%。代码:https://github.com/mehrdadsaberi/spechop

英文摘要

Large language models increasingly use external tools such as web search and document retrieval to solve information-intensive tasks. However, multi-hop tool use in complex tasks introduces substantial latency, since the model must repeatedly wait for tool observations before continuing. We study how to accelerate such trajectories without changing the final trajectory the model would have taken without acceleration, assuming access to faster but less reliable speculator tools. We develop a theoretical framework for lossless speculation in multi-hop tool-use settings, characterizing the optimal achievable latency gain. We propose SpecHop, a continuous speculation framework that maintains multiple speculative threads, verifies predicted observations asynchronously as target tool outputs arrive, commits correct branches, and rolls back incorrect ones. This preserves accuracy while reducing wall-clock latency. We show that SpecHop can approach oracle latency gains with enough active threads. Empirically, on retrieval-augmented multi-hop tasks, SpecHop closely matches theoretical predictions and reduces latency by up to 40\% in some settings. Code: https://github.com/mehrdadsaberi/spechop

2605.21958 2026-05-22 cs.CL 版本更新

Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines

诊断并非处方:语言共适应解释了LLM流水线中的修补危害

Yoon Jeonghun, Kim Dongchan

发表机构 * KAIST (Korea Advanced Institute of Science and Technology)(韩国科学技术院) NAVER Corp.(NAVER公司)

AI总结 本文研究了多模块LLM代理失败时,诊断与修补之间的矛盾,发现路由模块虽为瓶颈,但注入修正示例反而降低性能,而修正查询重写模块则更有效,提出了语言合同假说解释这种现象。

Comments Preprint. Under review at EMNLP 2026 (ARR)

详情
AI中文摘要

当多模块LLM代理失效时,最负责失败的模块未必是最佳干预地点。我们通过实证展示了这种诊断悖论:因果分析一致地将路由模块——选择下一步调用的工具——识别为三个独立代理家族中的主要瓶颈。然而,将提示级修正示例注入此模块会持续降低性能,有时甚至严重。相反,修补上游的查询重写模块则能可靠地改善结果。这种效果在两个代理家族中具有统计显著性,在第三个家族中表现出方向一致性;在路由模块的替代修补策略(指令重写、模型升级)则无明显影响,证实了危害仅特定于修正注入修补。我们通过语言合同假说解释这种不对称性:每个下游模块隐式地适应其上游的特征错误分布,因此修正瓶颈会打破这种隐式对齐,而上游修正不会。我们通过基于诊断的每代理共适应度度量来操作化这一概念,并展示其在所有代理家族中与修补危害一致相关:较高的共适应度与危害相关,较低的与安全性相关。这一趋势在所有三个代理家族中均成立,为假说提供了初步支持,超越了单代理观察。

英文摘要

When a multi-module LLM agent fails, the module most responsible for the failure is not necessarily the best place to intervene. We demonstrate this Diagnostic Paradox empirically: causal analysis consistently identifies the routing module -- which selects which tool to call next -- as the primary bottleneck across three independent agent families. Yet injecting prompt-level correction examples into this module consistently degrades performance, sometimes severely. Patching an upstream query-rewriting module instead reliably improves outcomes. The effect holds with statistical significance on two agent families and directional consistency on a third; alternative repair strategies at the routing module (instruction rewriting, model upgrade) are neutral, confirming that the harm is specific to correction-injection patching. We explain this asymmetry through the Linguistic Contract hypothesis: each downstream module implicitly adapts to its upstream's characteristic error distribution, so correcting the bottleneck breaks this implicit alignment in a way that upstream corrections do not. We operationalize this via a per-agent co-adaptation measure, derived from diagnosis alone, and show it is consistently associated with patching harm across agent families: higher co-adaptation co-occurs with harm, lower with safety. This trend holds across all three agent families, providing preliminary support for the hypothesis beyond a single-agent observation.

2605.21949 2026-05-22 cs.CL 版本更新

Claim-Selective Certification for High-Risk Medical Retrieval-Augmented Generation

面向高风险医疗检索增强生成的声明选择性认证

Shao Kan

发表机构 * Jinglue Technology Development (Nanjing) Co., Ltd.(Jinglue 技术发展(南京)有限公司)

AI总结 本文研究了高风险医疗问答场景中检索增强生成系统中声明选择性认证问题,通过将响应分解为可验证的声明并根据检索证据评分,结合意图感知选择器映射到{完整、部分、冲突、回避},在弱标签证书协议上实现了高准确率的认证结果。

Comments 22 pages, 7 figures, 11 tables

详情
AI中文摘要

在高风险问答设置中,医疗RAG系统通常通过单个答案或回避决策进行评估,但混合证据可能支持一个声明,要求另一个声明的条件,并与第三个声明矛盾。我们研究声明选择性认证:每个响应被分解为可验证的声明,根据检索证据评分,并通过意图感知选择器映射到{完整、部分、冲突、回避}。在主要弱标签证书协议上,其真实源-only的开发/测试行覆盖了自然发生的非回避动作,完整的系统在开发集(n=314)上记录UCCR=0.0000,PAU=1.0000,PAU Precision=0.9901,以及动作准确率=0.9204,在测试集(n=319)上记录UCCR=0.0000,PAU=0.9967,PAU Precision=0.9739,以及动作准确率=0.8997。UCCR衡量证书定义内的不支持声明风险,而源缺失的反事实切片评估在空证据下的回避。捷径控制量化由源和意图元数据解释的动作-标签先验,而源/证据新颖切片表征转移边界。所得到的界面在混合证据下将动作-标签预测与证据关联的声明选择分开。

英文摘要

Medical RAG systems in high-risk QA settings are often evaluated through a single answer-or-abstain decision, but mixed evidence may support one claim, require conditions for another, and contradict a third. We study claim-selective certification: each response is decomposed into verifiable claims, scored against retrieved evidence, and mapped by an intent-aware selector to {full, partial, conflict, abstain}. On the primary weak-label certificate protocol, whose real-source-only dev/test rows cover the naturally occurring non-abstain actions, the full system records UCCR=0.0000, PAU=1.0000, PAU Precision=0.9901, and action accuracy=0.9204 on dev (n=314), and UCCR=0.0000, PAU=0.9967, PAU Precision=0.9739, and action accuracy=0.8997 on test (n=319). UCCR measures unsupported-claim risk within the certificate definition, and a source-missing counterfactual slice evaluates abstain under empty evidence. Shortcut controls quantify the action-label prior explained by source and intent metadata, while source/evidence-novel slices characterize transfer boundaries. The resulting interface separates action-label prediction from evidence-linked claim selection under mixed evidence.

2605.21902 2026-05-22 cs.AI cs.CL 版本更新

Planning in the LLM Era: Building for Reliability and Efficiency

在大语言模型时代进行规划:构建可靠性与效率

Michael Katz, Harsha Kokel, Kavitha Srinivas, Shirin Sohrabi

AI总结 本文探讨了在大语言模型时代规划领域的发展,重点介绍了通过生成可验证的符号求解器来提高规划的可靠性和效率的方法。

Comments Published at ICAPS 2026

详情
AI中文摘要

随着智能代理受到越来越多的关注,规划能力成为其核心能力之一。早期尝试利用大语言模型(LLMs)进行规划的方法主要依赖于单次计划生成,随后发展出结合LLMs与有限外部搜索的混合方法。这些方法本质上不严谨且不完整,往往需要大量资源,但并未在未见问题上产生更好的解决方案。随着对LLMs局限性的认识加深,近期的研究转向在解决方案构建时使用LLMs,生成可用于验证并高效用于推理时间的一类问题的符号求解器。这一趋势反映了对既可靠又资源高效的代理日益增长的需求。它还提供了一条生成可维护规划器的路径,从而在推理时对语言模型的依赖最小化。在本文中,我们论证这种转变反映了在大语言模型时代规划领域更广泛的真实调整。我们检查了三种主要的规划器生成方法类别,讨论了它们当前的局限性,并概述了朝着更可靠和高效的大语言模型驱动的规划器生成的研究步骤。

英文摘要

Growing attention to intelligent agents has put a spotlight on one of their central capabilities: planning. Early attempts to leverage large language models (LLMs) for planning relied on single-shot plan generation, followed by hybrid approaches that coupled LLMs with limited external search. These methods, unsound and incomplete by their very nature, often require substantial resources without yielding better solutions on unseen problems. As the limitations of LLMs become clearer, recent work has shifted toward using them at solution construction time -- generating symbolic solvers for a family of problems that can be verified and then used efficiently at inference time. This trend reflects the growing need for agents that are both reliable and resource-efficient. It also offers a path towards generating maintainable planners with minimal dependence on language models at inference time. In this paper, we argue that this shift reflects a broader realignment of the planning field in the LLM era. We examine three major categories of planner-generation methods, discuss their current limitations, and outline research steps towards a more reliable and efficient LLM-based generation of planners.

2605.21858 2026-05-22 cs.CL 版本更新

Hypergraph as Language

超图作为语言

Mengqi Lei, Guohuan Xie, Shihui Ying, Shaoyi Du, Jun-Hai Yong, Siqi Li, Yue Gao

发表机构 * Tsinghua University(清华大学) Yangtze Delta Region Institute(长江三角洲研究院) Shanghai Institute of Applied Mathematics and Mechanics(上海应用数学和力学研究所) State Key Laboratory of Human-Machine Hybrid Augmented Intelligence(人机混合增强智能国家重点实验室) National Engineering Research Center for Visual Information and Applications(视觉信息与应用国家工程研究中心) Institute of Artificial Intelligence and Robotics(人工智能与机器人研究所)

AI总结 本文提出了一种基于超图的语言模型对齐框架Hyper-Align,通过将超图结构转换为可被大语言模型理解的超图令牌,以更有效地处理高阶关联关系,从而在结构建模任务中取得显著优势。

详情
AI中文摘要

大型语言模型(LLMs)最近在建模关系结构方面展现出了强大的潜力。然而,现有方法仍然从根本上以图为中心:它们专注于将成对图结构转换为LLMs可以理解的令牌。相比之下,许多现实世界的关系模式并不自然地符合成对边的假设,而更适合用超图中的高阶关联来建模。对于超图结构,现有方法往往无法保留多个对象由同一高阶关系共同连接的本源语义,限制了其对复杂结构的利用能力。为了解决这一限制,我们提出了

英文摘要

Large language models (LLMs) have recently shown strong potential in modeling relational structures. However, existing approaches remain fundamentally graph-centric: they focus on processing pairwise graph structures into tokens that LLMs can understand. In contrast, many real-world relational patterns do not naturally conform to the pairwise-edge assumption, and are better modeled as high-order associations in hypergraphs. For hypergraph structures, existing methods often fail to preserve the native semantics that multiple objects are jointly connected by the same high-order relation, limiting their ability to exploit complex structures. To address this limitation, we put forth the "Hypergraph as Language" perspective and propose Hyper-Align, a hypergraph-native alignment framework for large language models. Hyper-Align compiles the query-object-centered hypergraph context into hypergraph tokens directly consumable by a base LLM. Specifically, we introduce Hypergraph Incidence Detail Template with Overview (HIDT-O), which serializes high-order association structures into a fixed-shape hybrid template combining local incidence details and overview-level summaries. We then design a Hypergraph Incidence Projector (HIP), which maps native high-order incidence structures into the LLM token space through explicit semantic-structural decoupling and bidirectional message passing between vertices and hyperedges. We further define a concrete Hypergraph-as-Language input protocol, which jointly feeds hypergraph tokens and textual prompts into a frozen base LLM, supporting both vertex-level and hyperedge-level tasks under a unified question-answering paradigm. To systematically evaluate different methods in hypergraph structural modeling, we introduce HyperAlign-Bench. Extensive experiments show that Hyper-Align significantly outperforms existing methods across in-domain and zero-shot evaluations.

2605.21849 2026-05-22 cs.LG cs.CL 版本更新

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

基于几何适应的解释器:在分布偏移下字典基础可解释性的忠实性

Sungjun Lim, Heedong Kim, Andrew Lee, Kyungwoo Song

发表机构 * Yonsei University(延世大学) Harvard University(哈佛大学)

AI总结 本文提出了一种几何适应解释器(GAE),用于在分布偏移下提高基于字典的可解释性。通过重新对齐解释器的字典与偏移活跃子空间,同时保持原始特征结构,GAE在无监督的情况下减少了分布偏移下的忠实性差距。

详情
AI中文摘要

机制可解释性旨在通过识别因果负责的内部结构来解释模型的行为。基于字典的解释器如稀疏自编码器和转码器是主要工具,但其在分布外(OOD)偏移下的忠实性却很少受到系统性关注。我们证明分布偏移会旋转模型所使用的子空间,导致解释器的字典在训练分布(ID)激活上训练时出现对齐偏差。我们将这种偏差正式化为忠实性差距,即ID字典与OOD活跃子空间之间的几何距离,并证明其控制OOD忠实性退化。为了减少这种差距,我们提出了几何适应解释器(GAE),它在保持原始特征结构的同时,重新对齐解释器的字典与OOD活跃子空间。这只需要未标记的OOD激活,并且不需要梯度更新。我们证明GAE在无适应ID解释器上有所改进,其额外损失被二次限制于二阶矩偏移。经验上,GAE在多个模型和OOD设置中甚至匹配或超过了所有基于训练的基线在因果忠实性上的表现。

英文摘要

Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.

2605.21845 2026-05-22 cs.CL cs.AI 版本更新

Comparing LLM and Fine-Tuned Model Performance on NVDRS Circumstance Extraction with Varying Prompt Complexity

对比LLM和微调模型在不同提示复杂度下的NVDRS情境提取性能

Geoffrey Martin, Xuan Zhong Feng, Yifan Peng

发表机构 * Department of Population Health Sciences, Weill Cornell Medicine(人群健康科学系,威尔·康奈尔医学学院) Systems Engineering, Cornell University(系统工程,康奈尔大学)

AI总结 本文研究了在不同提示复杂度下,LLM与微调模型在NVDRS情境提取任务中的表现差异,提出了一种复杂度评分算法,并展示了一个混合方法,通过不同情境选择提示策略,发现LLM在低 prevalence 情境中表现更优,且框架能跨不同前沿LLM通用。

Comments Accepted at IEEE ICHI 2026

详情
AI中文摘要

自杀是美国的主要死亡原因之一,理解其前因需要从死亡调查叙述中提取结构化信息。许多前因需要语义推理而非简单的关键词匹配。我们开发了一种“复杂度评分”算法,分析编码手册结构以预测何时详细提示(包含完整编码指南)比仅名称提示更优。随后,我们构建了一种混合方法,根据情境选择提示策略。我们评估了大型语言模型(LLMs)与微调的RoBERTa在25个从国家暴力死亡报告系统(NVDRS)中提取的推断复杂情境上的表现。我们发现,在训练数据不足的低 prevalence 情境中,LLMs表现显著优于微调模型。我们进一步展示了我们的框架能够跨前沿LLM通用,GPT-5.2、Gemini 2.5 Pro和Llama-3 70B显示出一致的表现模式。这些发现支持了一种混合架构,其中LLMs处理罕见的推断复杂情境,而微调模型处理常见情境。

英文摘要

Suicide is a leading cause of death in the United States, and understanding the circumstances that precede it requires extracting structured information from death investigation narratives. Many of these circumstances require semantic inference beyond simple keyword matching. We develop a ``Complexity Score'' algorithm that analyzes coding manual structure to predict when detailed prompts with full coding guidelines improve over name-only prompts. We then construct a hybrid approach that selects prompt strategy per circumstance. We evaluate large language models (LLMs) against fine-tuned RoBERTa on 25 inferentially complex circumstances from the National Violent Death Reporting System (NVDRS). We found that LLMs substantially outperform on low-prevalence circumstances where training data is insufficient. We further demonstrate that our framework generalizes across frontier LLMs, with GPT-5.2, Gemini 2.5 Pro and Llama-3 70B showing consistent performance patterns. These findings support a hybrid architecture where LLMs handle rare, inferentially complex circumstances while fine-tuned models handle common ones.

2605.21842 2026-05-22 cs.LG cs.CL eess.SP 版本更新

Energy-Gated Attention: Spectral Salience as an Inductive Bias for Transformer Attention

能量门控注意力:频谱显著性作为Transformer注意力的归纳偏置

Athanasios Zeris

发表机构 * Independent Researcher, Athens, Greece(雅典,希腊独立研究者)

AI总结 本文提出能量门控注意力(EGA),通过频谱显著性作为归纳偏置来改进Transformer注意力机制,通过在键嵌入的频谱能量上进行门控,提高了信息密集位置的注意力权重,实验结果显示在多个数据集上均取得显著效果。

Comments 12 pages, 4 figures

详情
AI中文摘要

标准的Transformer注意力计算查询和键之间的成对相似性,将所有标记视为具有同等显著性,无论其内在信息含量如何。在湍流流体力学中,相干结构——在背景混沌中持续存在的能量主导、空间组织化的模式——承载了总能量的不成比例份额,并控制所有传输。我们提出,标记在Transformer注意力中扮演类似的角色:信息密集的位置(形态边界、语法头、话语标记)集中了频谱能量,并应比背景标记(功能词、重复模式、低信息填充词)获得更多的注意力。我们提出能量门控注意力(EGA):一种简单的修改,通过键标记嵌入的频谱能量来门控值聚合,该计算通过一个单个学习的线性投影完成,以发现嵌入场的主导频谱模式。在TinyShakespeare上,EGA仅使用12,480个额外参数(<0.26%的开销)和没有可测量的计算成本,就实现了+0.103的验证损失改进。结果在Penn Treebank上也一致(+0.101),证明了数据集的独立性。在三种小波家族(固定Morlet、Daubechies db2/db4和参数化Morlet)的系统消融研究中,发现固定结构基底是次优的——最优的能量方向是数据自适应的且非正弦的——同时识别出学习的小波包作为有前途的开放方向。学习的能量阈值收敛到tau ~ 0.35,无论初始化如何,对应于英语文本中携带高于平均频谱能量的约36%的标记比例,这是一个稳定的语言属性,与英语文本中内容词的比例一致。

英文摘要

Standard transformer attention computes pairwise similarity between queries and keys, treating all tokens as equally salient regardless of their intrinsic informational content. In turbulent fluid dynamics, coherent structures -- the energetically dominant, spatially organized patterns that persist amid background chaos -- carry a disproportionate fraction of total energy and govern all transport. We propose that tokens play an analogous role in transformer attention: informationally dense positions (morphological boundaries, syntactic heads, discourse markers) concentrate spectral energy and should attract proportionally more attention than background tokens (function words, repeated patterns, low-information filler). We propose Energy-Gated Attention (EGA): a simple modification that gates value aggregation by the spectral energy of key token embeddings, computed by a single learned linear projection that discovers the dominant spectral mode of the embedding field. On TinyShakespeare, EGA achieves +0.103 validation loss improvement with only 12,480 additional parameters (<0.26% overhead) and no measurable computational cost. The result is consistent on Penn Treebank (+0.101), demonstrating dataset independence. A systematic ablation across three wavelet families (fixed Morlet, Daubechies db2/db4, and a parametric Morlet) establishes that fixed structured bases are suboptimal -- the optimal energy direction is data-adaptive and non-sinusoidal -- while identifying learned wavelet packets as a promising open direction. The learned energy threshold converges to tau ~= 0.35 independently of initialization, corresponding to the fraction (~36%) of tokens carrying above-average spectral energy in English text, a stable linguistic property consistent with the fraction of content words in running English text.

2605.21827 2026-05-22 cs.CL cs.AI 版本更新

Does Slightly Mean Somewhat? Measuring Vague Intensity Words in LLM Numeric Actions

‘稍微’意味着‘ somewhat’吗?在LLM数值行为中测量模糊强度词

Daniel Tabach

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文研究了语言模型在必须生成数值行为时是否保留强度词的顺序意义,发现模型在数值输出中压缩了模糊强度词,其解释依赖于状态并接近操作边界时出现不连续性。

Comments 9 figures, 2 tables, 16 references

详情
AI中文摘要

语言模型在必须生成数值行为时是否保留强度词的顺序意义?我研究了一个由研究者构建的10个英语程度修饰词尺度,从稍微到剧烈,依据Quirk等人程度修饰词分类法,在受控资源分配环境中进行测试,其中Claude Haiku接收自然语言指令,生成数值分配,并由确定性后端转换为可测量结果。在测试中,唯一变化的变量是强度词或起始系统状态,从而隔离了它们对模型数值输出的影响。在6,620次运行中(T=0.0和T=0.7),出现了三种模式。首先,模型将10个强度词压缩为5个不同的中位数输出:四个低层级词都映射到相同值,而更强的词则进入更高层级(Spearman rho=0.845,p<0.001)。其次,当当前系统状态作为上下文提供时,Kruskal-Wallis检验显示按起始分配分组捕捉到的基于排名的方差远多于按词分组(epsilon-squared基线=0.782 vs. epsilon-squared词=0.079),并且当系统接近容量时,词汇区分度降为零。第三,接近可行性极限时,模型表现出三种行为模式:弱词通过小调整进行妥协,强词完全回避,而“剧烈”词则推至局部天花板。这些模式在温度变化下保持不变,随机采样扩展了分布但未恢复词之间的顺序差异。在该模型和领域中,模型对模糊强度词的数值解释是压缩的、依赖状态的,并且在操作边界附近出现不连续性。

英文摘要

Do language models preserve the ordinal meaning of intensity words when those words must produce numeric actions? I study a researcher-constructed scale of 10 English degree modifiers, from slightly to drastically, informed by the Quirk et al. degree-modifier taxonomy, in a controlled resource-allocation environment where Claude Haiku receives a natural-language instruction, produces a numeric allocation, and a deterministic backend converts that allocation into a measurable outcome. The only variable that changes between runs is the intensity word or the starting system state, isolating their effects on the model's numeric output. Across 6,620 runs at T=0.0 and T=0.7, three patterns emerge. First, the model compresses 10 intensity words into 5 distinct median outputs: four lower-tier words all map to the same value, while stronger words break into higher regimes (Spearman rho = 0.845, p < 0.001). Second, when the current system state is supplied as context, separate Kruskal-Wallis tests show that grouping by starting allocation captures far more rank-based variance than grouping by word (epsilon-squared baseline = 0.782 vs. epsilon-squared word = 0.079), and lexical differentiation collapses to zero as the system approaches capacity. Third, near feasibility limits the model exhibits three behavioral modes: weak words hedge with small adjustments, strong words abstain entirely, and the word drastically pushes to the local ceiling. These patterns persist across temperature, with stochastic sampling broadening distributions but not restoring ordinal distinctions between words. In this model and domain, the model's numeric interpretation of vague intensity words is compressed, state-dependent, and discontinuous near operational boundaries.

2605.21807 2026-05-22 cs.CL 版本更新

When Cases Get Rare: A Retrieval Benchmark for Off-Guideline Clinical Question Answering

当案例变得稀少时:一个用于偏离指南临床问答的检索基准

Doeun Lee, Muge Zhang, Yi Yu, Ashish Manne, Stephen Koesters, Frank Wen, Brady Buchanan, Lynda Villagomez, Oluwatoba Moninuola, James Lim, Kathryn Tobin, Andrew Srisuwananukorn, Ping Zhang, Sachin Kumar

发表机构 * The Ohio State University(俄亥俄州立大学) The Ohio State University Wexner Medical Center(俄亥俄州立大学韦克斯纳医学中心) University of Chicago Medical Center(芝加哥大学医学中心)

AI总结 本文提出OGCaReBench基准,用于评估医疗问答中超越常规指南的开放性推理能力,通过检索医学文献提升模型在真实世界医疗场景中的表现。

Comments 34 pages, 20 figures

详情
AI中文摘要

在医学各专科中,临床实践基于证据导向的指南,这些指南通常无法覆盖现实世界中未被涵盖的长尾部分。然而,大多数医疗大型语言模型(LLMs)训练时专注于编码常见、指南导向的医学知识。当前评估主要测试模型在回忆和推理这些记忆内容上的能力,通常在多项选择设置中进行。鉴于证据导向推理在医学中的基础重要性,实践中依赖记忆是不可行且不可靠的。为此,我们引入OGCaReBench,一个以自由形式检索为核心的基准,旨在评估LLMs在回答需要超越典型指南的临床问题上的能力。该基准从发表的医学案例报告中提取,并由医学专家验证,包含需要自由文本回答的长形式临床问题,提供了一个系统框架,用于评估罕见、案例基于场景中的开放性医学推理。我们的实验发现,即使最佳基线模型(GPT-5.2)在基准中也只正确回答了56%,而仅使用专业模型的仅达到42%。通过增强模型检索医学文献,性能可提升至82%(使用GPT-5.2),突显了证据基础在真实世界医疗推理任务中的重要性。本文因此为基准测试和推动通用和医疗LLM在挑战性临床情境中产生可靠答案奠定了基础。

英文摘要

Across medical specialties, clinical practice is anchored in evidence-based guidelines that codify best studied diagnostic and treatment pathways. These pathways routinely fall short for the long tail of real-world care not covered by guidelines. Most medical large language models (LLMs), however, are trained to encode common, guideline-focused medical knowledge in their parameters. Current evaluations test models primarily on recalling and reasoning with this memorized content, often in multiple-choice settings. Given the fundamental importance of evidence-based reasoning in medicine, it is neither feasible nor reliable to depend on memorization in practice. To address this gap, we introduce OGCaReBench, a free-form retrieval-focused benchmark aimed at evaluating LLMs at answering clinical questions that require going beyond typical guidelines. Extracted from published medical case reports and validated by medical experts, OGCaReBench contains long-form clinical questions requiring free-text answers, providing a systematic framework for assessing open-ended medical reasoning in rare, case-based scenarios. Our experiments reveal that even the best-performing baseline (GPT-5.2) correctly answers only 56% of our benchmark with specialized models only reaching 42%. Augmenting models with retrieved medical articles improves this performance to up to 82% (using GPT-5.2) highlighting the importance of evidence-grounding for real-world medical reasoning tasks. This work thus establishes a foundation for benchmarking and advancing both general-purpose and medical LLMs to produce reliable answers in challenging clinical contexts.

2605.21801 2026-05-22 cs.LG cs.CL 版本更新

Why Semantic Entropy Fails: Geometry-Aware and Calibrated Uncertainty for Policy Optimization

为何语义熵失效:面向策略优化的几何感知与校准不确定性

Zheyuan Zhang, Kaiwen Shi, Han Bao, Zehong Wang, Tianyi Ma, Yanfang Ye

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 本文提出了一种新的策略优化框架GCPO,通过几何感知措施捕捉语义分歧,并利用基于奖励的校准对齐不确定性与学习信号强度,从而更准确地跟踪梯度变化并提升训练后性能。

详情
AI中文摘要

训练后已成为改进大语言模型推理和对齐的关键,其中无批评模型能够实现从模型生成输出的可扩展学习,但缺乏区分信息性与噪声信号的原理性机制。最近的方法利用响应级度量作为不确定性信号来调节基于群体的优化方法,如GRPO。然而,其经验成功仍不稳定,且不清楚它们如何影响优化动态。在本文中,我们提供迄今为止第一个原理性公式,将不确定性信号解释为表征和调节梯度方差和学习信号质量的机制。基于经验和理论分析,我们识别出当前基于熵的估计器的两个关键缺陷:各向异性缺口和校准缺口。受此分析启发,我们提出几何感知校准策略优化(GCPO),一种新的框架,整合几何感知度量以捕捉语义分歧,利用基于奖励的校准对齐不确定性与学习信号强度。在多个基准测试中的实验表明,我们的方法更忠实跟踪梯度变化,并且一致提升训练后性能。我们的结果强调了设计与优化动态对齐的不确定性信号的重要性,为稳健训练后方法提供了原理性视角。

英文摘要

Post-training has become central to improving reasoning and alignment in large language models, where critic-free models enable scalable learning from model-generated outputs but lack principled mechanisms to distinguish informative from noisy signals. Recent approaches leverage response-level measures as uncertainty signals to regulate group-based optimization methods such as GRPO. Yet their empirical success remains unstable and unclear in how they influence optimization dynamics. In this paper, we provide, to our knowledge, the first principled formulation that interprets uncertainty signals as mechanisms for characterizing and regulating gradient variance and learning signal quality. Based on both empirical and theoretical analysis, we identify two critical gaps of current entropy-based estimators: The anisotropic gap and The calibration gap. Motivated by this analysis, we propose Geometric-aware Calibrated Policy Optimization (GCPO), a novel framework integrating geometry-aware measures to capture semantic disagreement with reward-based calibration to align uncertainty with learning signal strength. Experiments on multiple benchmarks show that our approach more faithfully tracks gradient variability and consistently improves post-training performance. Our results highlight the importance of designing uncertainty signals that are aligned with optimization dynamics, offering a principled perspective for robust post-training.

2605.21796 2026-05-22 cs.CV cs.CL 版本更新

MM-Conv: A Multimodal Dataset and Benchmark for Context-Aware Grounding in 3D Dialogue

MM-Conv: 一种多模态数据集和基准,用于上下文感知的3D对话中指代解析

Anna Deichler, Jim O'Regan, Fethiye Irmak Dogan, Lubos Marcinek, Anna Klezovich, Iolanda Leite, Jonas Beskow

发表机构 * KTH Royal Institute of Technology(皇家理工学院)

AI总结 本文提出了一种多模态数据集和基准,用于在动态3D环境中实现上下文感知的指代解析,通过引入包含6.7小时第一人称VR交互的同步语音、动作、注视和3D场景几何数据的基准,以及一个两阶段的指代解析流水线,改进了对话中的指代解析性能。

Comments Extended version of the paper published at LREC 2026 (Palma de Mallorca, Spain), with expanded VLM baselines and inter-annotator agreement analysis

详情
Journal ref
Proceedings of the 15th Language Resources and Evaluation Conference (LREC 2026), Palma de Mallorca, Spain
AI中文摘要

在物理世界中将语言进行定位需要AI系统解释在对话中动态出现的参考。尽管当前的视觉语言模型(VLMs)在静态图像任务上表现出色,但在自发的多轮对话中解决歧义表达方面存在困难。我们通过引入(1)一个用于动态3D环境中的指代交流的基准,该基准基于6.7小时的第一人称VR交互,同步语音、动作、注视和3D场景几何数据,以及(2)一个两阶段的定位流水线,该流水线在视觉定位之前显式解决对话中的歧义,来填补这一空白。该基准包含超过4,200个经过人工验证的指代表达,涵盖完整、部分和代词类型。我们的上下文重写方法在平均上将定位性能提高了11-22个百分点,纯检测器(GroundingDINO)在重写后在代词上达到了56.7%的准确率,几乎是最佳端到端基线的两倍。结果表明,将语言推理与视觉感知解耦比端到端方法在对话定位中更有效。

英文摘要

Grounding language in the physical world requires AI systems to interpret references that emerge dynamically during conversation. While current vision-language models (VLMs) excel at static image tasks, they struggle to resolve ambiguous expressions in spontaneous, multi-turn dialogue. We address this gap by introducing (1) a benchmark for referential communication in dynamic 3D environments, built from 6.7 hours of egocentric VR interaction with synchronized speech, motion, gaze, and 3D scene geometry, and (2) a two-stage grounding pipeline that explicitly resolves conversational ambiguity before visual localization. The benchmark includes over 4,200 manually verified referring expressions spanning full, partitive, and pronominal types. Our contextual rewriting approach improves grounding performance by 11-22 percentage points on average, with a pure detector (GroundingDINO) reaching 56.7% on pronominals after rewriting, nearly double the best end-to-end baseline. Results demonstrate that decoupling linguistic reasoning from visual perception is more effective than end-to-end approaches for conversational grounding.

2605.21792 2026-05-22 cs.CL cs.AI cs.DB cs.LG 版本更新

Residual Skill Optimization for Text-to-SQL Ensembles

残差技能优化用于文本到SQL集成

Jiongli Zhu, Haoquan Guan, Parjanya Prajakta Prashant, Nikki Lijing Kuang, Seyedeh Baharan Khatami, Canwen Xu, Xiaodong Yu, Yingyu Lin, Zhewei Yao, Yuxiong He, Babak Salimi

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Snowflake AI Research(Snowflake人工智能研究)

AI总结 本文提出DivSkill-SQL,一种残差技能优化框架,通过在当前技能集成失败的示例上优化新技能,从而构建互补的文本到SQL集成,提升Pass@K性能,在Spider2-Lite上实现了显著的准确性提升,同时在不同方言和任务上表现出一致的改进。

详情
AI中文摘要

文本到SQL集成通过生成多个SQL候选并选择一个来优于单一候选生成,但其效果受限于Pass@K,即至少有一个K候选正确的概率。现有方法通过随机解码或提示变体启发式地引入多样性,导致候选集受相关失败主导。我们提出DivSkill-SQL,一种残差技能优化框架,构建互补的文本到SQL集成而无需模型微调:每个新技能在当前技能集成失败的示例上进行优化,证明其对Pass@K的边际贡献。在Spider2-Lite上,DivSkill-SQL在Snowflake和BigQuery上分别比最强集成基线提升11.1和8.3个点,且在两个基础模型(Opus-4.6和GPT-5.4)上表现一致。在单个方言上无重新训练即可转移至其他方言(Snowflake、BigQuery、SQLite)和不同任务形式(如BIRD-Critic,+2.6个点)。错误诊断显示幻觉的模式参考和函数调用减少3倍,表明收益来自真正可靠的互补技能,而非表面形式变化。

英文摘要

Text-to-SQL ensembles improve over single-candidate generation by drawing multiple SQL candidates and selecting one, but their effectiveness is bounded by Pass@K, the probability that at least one of K candidates is correct. Existing methods source diversity heuristically through stochastic decoding or prompt variants, leaving candidate sets dominated by correlated failures. We present DivSkill-SQL, a residual skill optimization framework that builds complementary agentic Text-to-SQL ensembles without model fine-tuning: each new skill is optimized on examples the current skill ensemble fails on, provably targeting its marginal contribution to Pass@K. On Spider2-Lite, DivSkill-SQL improves selected accuracy by up to +11.1 points on Snowflake and +8.3 on BigQuery over the strongest ensemble baseline, with consistent gains across two base models (Opus-4.6 and GPT-5.4). Skills optimized on a single dialect transfer without retraining across dialects (Snowflake, BigQuery, SQLite) and to a different task formulation, such as BIRD-Critic (+2.6 pts). Error diagnostics show up to 3x fewer hallucinated schema references and function calls, indicating that gains come from genuinely reliable complementary skills rather than surface-form variation.

2605.21781 2026-05-22 cs.CL 版本更新

Reflective Prompt Tuning through Language Model Function-Calling

通过语言模型功能调用实现反思式提示调优

Farima Fatahi Bayat, Moin Aminnaseri, Pouya Pezeshkpour, Estevam Hruschka

发表机构 * Megagon Labs(梅加穹实验室)

AI总结 本文提出了一种名为Reflective Prompt Tuning (RPT)的框架,利用语言模型功能调用模拟人类提示工程师的迭代工作流程,通过诊断函数评估目标模型,生成结构化诊断报告,并利用历史报告优化提示,从而提升提示效果和置信度校准。

Comments 17 pages, 6 figures

详情
AI中文摘要

大型语言模型(LLMs)在遵循指令和复杂推理方面的能力不断增强,使提示成为一种灵活的接口,用于在不更新参数的情况下调整模型。然而,提示设计仍然劳动密集且对格式、措辞和指令顺序高度敏感,这促使了自动提示优化方法的发展,以减少手动努力并保持推理时的灵活性。然而,现有方法通常在提示候选项上进行搜索或使用由单个示例或小批量驱动的固定批评-优化流水线,限制了它们捕捉系统性错误模式和基于失败历史进行针对性编辑的能力。我们提出Reflective Prompt Tuning (RPT),一个框架,利用语言模型功能调用模拟人类提示工程师的迭代工作流程。一个语言模型优化器调用一个诊断函数,该函数在完整的优化集上评估目标模型,总结反复出现的失败模式,并返回结构化的诊断报告。优化器利用此报告,以及累积的记忆中的先前报告,来修改下一个迭代的提示。RPT进一步通过在诊断反馈和最终提示选择中使用校准信号支持置信度-aware的优化。在三个推理任务上,RPT在初始提示上提高了高达12.9个点,与最先进方法保持竞争力,并提高了置信度校准。我们的分析显示,RPT在多跳和数学推理中特别有效,产生针对性的提示修改,与诊断的失败模式对齐,并导致任务性能和校准的提升。

英文摘要

Large language models (LLMs) have become increasingly capable of following instructions and complex reasoning, making prompting a flexible interface for adapting models without parameter updates. Yet prompt design remains labor-intensive and highly sensitive to formatting, phrasing, and instruction order, motivating automated prompt optimization methods that reduce manual effort while preserving inference-time flexibility. However, existing methods often search over prompt candidates or use fixed critique-refine pipelines driven by individual examples or small batches, limiting their ability to capture systematic error patterns and make targeted edits grounded in failure history. We propose Reflective Prompt Tuning (RPT), a framework that uses LLM function calling to simulate the iterative workflow of human prompt engineers. An LLM optimizer calls a diagnostic function that evaluates the target model over an entire optimization set, summarizes recurring failure modes, and returns a structured diagnostic report. The optimizer uses this report, together with an accumulated memory of prior reports, to revise the prompt for the next iteration. RPT further supports confidence-aware optimization by using calibration signals in diagnostic feedback and final prompt selection. Across three reasoning tasks, RPT improves over initial prompts by up to 12.9 points, remains competitive with state of the art, and improves confidence calibration. Our analyses show that RPT is especially effective on multi-hop and mathematical reasoning, producing targeted prompt revisions that align with diagnosed failure patterns and lead to gains in task performance and calibration.

2605.21776 2026-05-22 cs.CL 版本更新

PromptNCE: Pointwise Mutual Information Predictions Using Only LLMs and Contrastive Estimation Prompts

PromptNCE:仅使用LLM和对比估计提示进行点互信息预测

Juliette Woodrow, Chris Piech

发表机构 * Department of Computer Science Stanford University(计算机科学系 斯坦福大学)

AI总结 本文提出PromptNCE方法,通过将条件概率估计转化为对比任务,并引入显式的OTHER类别来恢复真实的条件概率,从而在低数据情况下实现零样本点互信息估计。

详情
AI中文摘要

从文本中估计互信息通常需要训练特定任务的批评者,这限制了其在低数据设置中的应用。我们问大语言模型能否仅通过提示和提取的概率来零样本估计点互信息。我们引入了一个包含三个公开可用数据集的人类衍生真实PMI基准,并评估了五个信息论提示基于的估计器。我们的主要方法PromptNCE将条件概率估计框架为对比任务,并在候选集上引入显式的OTHER类别。我们理论证明,添加OTHER类别可以恢复真实的条件P(y | x),而不是仅仅在列出的候选者之间进行排名,将对比提示转化为通用的零样本概率估计器。PromptNCE在所有三个数据集上都是最佳的零样本方法,与人类衍生的PMI达到Spearman相关性高达0.82。我们还展示了在计算机科学教育中的案例研究,说明这些估计器如何在低数据情况下用于评分学生知识摘要。

英文摘要

Estimating mutual information from text usually requires training a task-specific critic, which limits its use in low-data settings. We ask whether large language models can instead estimate pointwise mutual information zero-shot, using only prompts and elicited probabilities. We introduce a benchmark with human-derived ground-truth PMI across three publicly available datasets, and evaluate five information-theoretic prompting-based estimators. Our main method, PromptNCE, frames conditional probability estimation as a contrastive task and augments the candidate set with an explicit OTHER category. We show theoretically that adding OTHER recovers the true conditional P(y | x) rather than just a ranking over listed candidates, turning a contrastive prompt into a general-purpose zero-shot probability estimator. PromptNCE is the best zero-shot method on all three datasets, reaching Spearman correlation up to 0.82 with human-derived PMI. We also present a case study in computer science education showing how these estimators can be used to score student knowledge summaries in a low-data setting.

2605.21748 2026-05-22 cs.CL 版本更新

RankJudge: A Multi-Turn LLM-as-a-Judge Synthetic Benchmark Generator

RankJudge: 一个多轮LLM-as-a-Judge合成基准生成器

Zhenwei Tang, Zhaoyan Liu, Rasa Hosseinzadeh, Tongzi Wu, Keyvan Golestan, Jesse C. Cresswell

发表机构 * Layer 6 AI

AI总结 本文提出RankJudge,一种用于评估LLM作为评判者在多轮对话中表现的合成基准生成器,通过生成带有单个缺陷的对话对,实现对评判准确性的严格评估,并通过领域覆盖和21个前沿LLM评判者评估,验证了评判排名的稳定性。

详情
AI中文摘要

随着交互式基于LLM的应用被创建和优化,模型开发者需要在多个可能的轴上评估生成文本的质量。对于更简单的系统,人工评估可能是可行的,但在复杂的系统如对话聊天机器人中,生成文本的数量可能会超出人类注释资源的承受能力。模型开发者已经开始依赖自动评估,其中LLM也被用来判断生成质量。然而,现有的LLM-as-a-judge基准主要集中在简单的问答任务上,而无法匹配多轮对话的复杂性。我们引入了RankJudge,一种用于评估LLM-as-a-judge在基于参考文档的多轮对话中的基准生成器。RankJudge生成对话对,其中一组对话在某一回合中注入了一个单一的缺陷。这种构造使得对话对可以被无歧义地标记为更好或更差,并且能够精确地将失败类别隔离到单个回合中,从而实现一个严格的联合正确性标准来评判。我们实现了RankJudge在机器学习、生物医学和金融领域,评估了21个前沿LLM评判者,并通过Bradley-Terry模型对这些评判者进行排名。我们的方法还允许对每个对话对进行难度评分,我们利用这些评分动态地整理评估切片以减少标签噪声,这已通过人工注释得到验证。我们发现,在部分可观测性、更粗略的正确性标准以及替代的随机游走评分算法下,评判排名是稳定的。

英文摘要

As interactive LLM-based applications are created and refined, model developers need to evaluate the quality of generated text along many possible axes. For simpler systems, human evaluation may be practical, but in complicated systems like conversational chatbots, the amount of generated text can overwhelm human annotation resources. Model developers have begun to rely heavily on auto-evaluation, where LLMs are also used to judge generation quality. However, existing LLM-as-a-judge benchmarks largely focus on simple Q\&A tasks that do not match the complexity of multi-turn conversations. We introduce RankJudge, a benchmark generator for evaluating LLM-as-a-judge on multi-turn conversations grounded in reference documents. RankJudge creates pairs of conversations where one conversation has a single flaw injected into one turn. This construction allows paired conversations to be labeled unambiguously as better or worse, and precisely isolates failure categories to individual turns, enabling a strict joint correctness criterion for judging. We implement RankJudge across the domains of machine learning, biomedicine, and finance, evaluate 21 frontier LLM judges, and rank those judges via the Bradley-Terry model. Our formulation also allows ranking each conversation pair with difficulty ratings, which we use to dynamically curate the evaluation slice to reduce label noise, as confirmed via human annotation. We find that judge rankings are stable under partial observability, coarser correctness criteria, and an alternative random-walk rating algorithm.

2605.21728 2026-05-22 cs.CV cs.CL cs.LG 版本更新

BEiTScore: Reference-free Image Captioning Evaluation with an Efficient Cross-Encoder Model

BEiTScore: 一种基于高效交叉编码器的无参考图像描述评估方法

Gonçalo Gomes, Bruno Martins, Chrysoula Zerva

发表机构 * Instituto Superior Técnico(里斯本大学理工学院) INESC-ID Instituto de Telecomunicações(电信机构)

AI总结 本文提出了一种无参考图像描述评估方法BEiTScore,通过高效的交叉编码器模型解决传统评估方法在计算成本和敏感性方面的不足,提出了一种新的评估指标,并在多种场景下验证了其优越的性能。

详情
AI中文摘要

图像描述评估仍是一个重大挑战,因为视觉-语言模型朝着生成长形式和上下文丰富的描述等更具挑战性的能力发展。最先进的评估度量标准涉及使用大型语言模型(LLMs)作为评判者的大量计算成本,或者受到标准CLIP基于编码器的限制,例如严格的令牌限制、缺乏细粒度敏感性或缺乏组合泛化能力,因为将描述视为“词袋”。我们提出了一种新的学习度量标准,以解决上述挑战,基于一个轻量级交叉编码器,其初始化来自视觉问答模型检查点,平衡了强大的权重初始化与计算效率。我们的训练方案使用精心编排的数据混合进行监督学习,特征是对抗性的LLM基于数据增强,以增强模型对细粒度视觉-语言错误的敏感性。我们还引入了一个新的基准,用于在多种场景中评估详细的描述评估。实验结果表明,所提出的度量标准在保持大规模基准测试、质量感知解码或奖励指导所需的效率的同时,实现了最先进的性能。

英文摘要

Image captioning evaluation remains a significant challenge, as vision-language models evolve toward more challenging capabilities such as generating long-form and context-rich descriptions. State-of-the-art evaluation metrics involve extensive computational costs associated with the use of Large Language Models (LLMs) as judges, or instead suffer from the limitations of standard CLIP-based encoders, such as strict token limits, lack of fine-grained sensitivity, or lack of compositional generalization by treating captions as ``bags-of-words.'' We propose a new learned metric that tackles the aforementioned challenges, based on a lightweight cross-encoder that is initialized from a visual question-answering model checkpoint, balancing a strong weight initialization with computational efficiency. Our training scheme uses a carefully assembled data mixture for supervised learning, featuring adversarial LLM-based data augmentations to enhance model sensitivity to fine-grained visual-linguistic errors. We also introduce a new benchmark designed to assess detailed captioning evaluation across diverse scenarios. Experimental results demonstrate that the proposed metric achieves state-of-the-art performance while maintaining the efficiency required for large-scale benchmarking, quality-aware decoding, or reward guidance.

2605.21726 2026-05-22 cs.CL cs.AI 版本更新

Probabilistic Attribution For Large Language Models

基于概率的大型语言模型归因

Shilpika Shilpika, Carlo Graziani, Bethany Lusch, Venkatram Vishwanath, Michael E. Papka

发表机构 * Argonne Leadership Computing Facility(阿贡领导计算设施) Argonne National Laboratory(阿贡国家实验室) Mathematics and Computer Science Division(数学与计算机科学 division) Department of Computer Science(计算机科学系)

AI总结 本文提出了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示,从而提高大型语言模型的可解释性。

Comments 29 pages, 13 figures

详情
AI中文摘要

大型语言模型(LLMs)生成性的特性体现在它们计算每个响应token的条件概率,以根据先前的token进行采样。这些概率编码了模型在训练中学习的分布结构,并在推理中加以利用。在本文中,我们利用这些概率将LLMs置于随机过程的数学理论框架中。我们使用此框架设计了一种模型无关的概率性token归因度量,通过贝叶斯法则反向计算下一个token的对数概率,以捕捉模型对token序列分布的内部表示。该表示独立于模型的计算结构。此表示给出了响应给提示的条件概率,以及在移除一个token后的响应给提示的条件概率。我们的归因分数是这两个概率比值的对数。我们进一步计算了单个提示token分布的熵,条件于剩余的上下文。熵与归因分数之间的相互作用揭示了LLM的行为。我们评估了8个模型在7个提示上的表现,并调查了异常、token敏感性、响应稳定性、模型稳定性以及训练收敛性,从而提高了可解释性,并引导用户关注生成中不确定或不稳定的部分。

英文摘要

The generative nature of Large Language Models (LLMs) is reflected in the conditional probabilities they compute to sample each response token given the previous tokens. These probabilities encode the distributional structure that the model learns in training and exploits in inference. In this work, we use these probabilities to situate LLMs within the mathematical theory of stochastic processes. We use this framework to design a model-agnostic probabilistic token attribution measure, using Bayes rule to invert the next-token log-probabilities so as to capture the models internal representation of the distribution over token sequences. The representation is independent of the models computational structure. This representation yields the conditional probability of the response given the prompt, and of the response given the prompt with a token marginalized away. Our attribution score is the log of the ratio of these probabilities. We further compute the entropies of a single prompts token distributions, conditioned on the remaining context. The interplay between entropy and attribution score sheds light on LLM behavior. We evaluate 8 models across 7 prompts and investigate anomalies, token sensitivity, response stability, model stability, and training convergence, thereby improving interpretability and guiding users to focus on uncertain or unstable parts of the generation.

2605.21713 2026-05-22 cs.CL 版本更新

Sem-Detect: Semantic Level Detection of AI Generated Peer-Reviews

Sem-Detect:基于语义层面的AI生成同行评审检测

André V. Duarte, Brian Tufts, Aditya Oke, Fei Fang, Arlindo L. Oliveira, Lei Li

发表机构 * Language Technologies Institute, Carnegie Mellon University(卡内基梅隆大学语言技术研究所)

AI总结 本文提出Sem-Detect方法,通过结合文本特征和语义分析,区分AI生成与人类撰写的同行评审,实验表明其在二分类和三分类场景下均表现出色,准确率显著提升。

详情
AI中文摘要

如何区分同行评审是人类还是AI生成?我们主张在该设置中,不应仅根据文本特征来确定作者身份,还应考虑评审所表达的观点、判断和主张。为此,我们提出了Sem-Detect,一种用于同行评审的作者身份检测方法,通过结合文本特征和基于声明的语义分析来实现这一原则。Sem-Detect通过将目标评审与同一篇论文的多个AI生成评审进行比较,利用观察到的不同AI模型倾向于收敛到相似点,而人类评审者引入更多独特和多样化的观点这一现象。因此,Sem-Detect能够区分完全由AI生成的评审与真实的由人类撰写的评审,包括那些经过LLM优化但仍反映人类判断的评审。在包含超过20,000篇同行评审的ICLR和NeurIPS会议数据集中,Sem-Detect在二分类设置中将TPR@0.1% FPR比最强基线提高了25.5%。此外,在三分类场景中,我们实证表明LLM优化保留了人类评审的语义信号,这些信号仍与完全由AI生成的文本模式区分开来;因此,少于3.5%的LLM优化后的评审被错误分类为AI生成。

英文摘要

How can we distinguish whether a peer review was written by a human or generated by an AI model? We argue that, in this setting, authorship should not be attributed solely from the textual features of a review, but also from the ideas, judgments, and claims it expresses. To this end, we propose Sem-Detect, an authorship detection method for peer reviews that operationalizes this principle by combining textual features with claim-level semantic analysis. Sem-Detect compares a target review against multiple AI-generated reviews of the same paper, leveraging the observation that different AI models tend to converge on similar points, while human reviewers introduce more unique and diverse ones. As a result, Sem-Detect is able to distinguish fully AI reviews from authentic human-written ones, including those that have been refined using an LLM but still reflect human judgment. Across a dataset of over 20,000 peer reviews from ICLR and NeurIPS conferences, Sem-Detect improves over the strongest baseline by 25.5% in TPR@0.1% FPR in the binary setting. Moreover, in the three-class scenario, we empirically show that LLM refinement preserves the semantic signals of human reviews, which remain distinct from the patterns exhibited by fully AI-generated text; as a result, fewer than 3.5% of LLM-refined human reviews are misclassified as AI-generated.

2605.21712 2026-05-22 cs.CL 版本更新

Broadening Access to Transportation Safety Data with Generative AI: A Schema-Grounded Framework for Spatial Natural Language Queries

通过生成式AI拓宽交通安全管理数据的可及性:一种基于模式的时空自然语言查询框架

Mahdi Azhdari, Eric J. Gonzales

发表机构 * Department of Civil and Environmental Engineering, University of Massachusetts Amherst(麻省大学阿姆赫斯特分校土木与环境工程系)

AI总结 本文提出了一种基于模式的自然语言接口,利用大型语言模型解释用户意图,同时保持确定性和可审查的执行,以解决交通安全管理数据访问不均的问题,通过整合事故记录、道路属性和地理空间数据,提升公共部门的安全规划能力。

Comments 30 pages, 5 figures

详情
AI中文摘要

交通安全管理分析需要通过基于GIS的工作流整合事故记录、道路属性和地理空间数据,但各机构和社区利益相关者之间的访问仍然不均。技术前提导致分析工具与能够使用它们的从业者之间存在差距。地方机构、学校委员会和居民可能有安全担忧,但缺乏检索、过滤、映射和分析相关数据的能力。生成式AI提供了一种缩小这一差距的方法,但其在公共部门的使用引发了关于可靠性和可复现性的问题。本文提出了一种基于模式的自然语言接口,利用大型语言模型(LLM)解释用户意图,同时保持确定性和可审查的执行,以权威数据库为基础。用户查询被翻译成结构化的语义框架,通过基于规则的层验证,编译成一个有类型的有向无环图,用于执行PostGIS数据库。这种设计将语言解释与确定性执行分离,保持结果可复现和基于模式,同时去除访问障碍。该框架使用整合了事故记录、道路属性和地理空间层(包括学校、公交站、过街天桥和市政边界)的州级马萨诸塞州交通安全管理数据库进行评估。所有查询均成功执行;验证层纠正了29%的评估查询中的错误,反映了灵活的自然语言与严格模式要求之间的差距。结果表明,结合自然语言的可访问性与确定性执行是扩大交通安全管理数据可及性的可行方向,对公共部门规划中的可信AI具有启示。

英文摘要

Transportation safety analysis requires integrating crash records, roadway attributes, and geospatial data through GIS-based workflows, but access remains uneven across agencies and community stakeholders. Technical prerequisites create a gap between analytical tools central to safety planning and the practitioners able to use them. Local agencies, school committees, and residents may have safety concerns but limited capacity to retrieve, filter, map, and analyze relevant data. Generative AI offers a way to narrow this divide, but its public-sector use raises questions about reliability, reproducibility, and governance. This paper presents a schema-grounded natural language interface for transportation safety analysis, using a large language model (LLM) to interpret user intent while preserving deterministic, reviewable execution against an authoritative database. User queries are translated into structured semantic frames, validated by a rule-based layer, compiled into a typed directed acyclic graph of spatial operations, and executed against a PostGIS database. This bounded design separates language interpretation from deterministic execution, keeping results reproducible and schema-grounded while removing access barriers. The framework is evaluated using a statewide Massachusetts transportation safety database integrating crash records, roadway attributes, and geospatial layers including schools, bus stops, crosswalks, and municipal boundaries. All queries executed successfully; the validation layer corrects errors in 29% of evaluation queries, reflecting the gap between flexible natural language and strict schema-grounded requirements. The results suggest that combining natural language accessibility with deterministic execution is a practical direction for broadening access to transportation safety data, with implications for trustworthy AI in public-sector planning.

2605.21699 2026-05-22 cs.LG cs.CL 版本更新

X-Token: Projection-Guided Cross-Tokenizer Knowledge Distillation

X-Token: 通过投影引导的跨分词器知识蒸馏

Sharath Turuvekere Sreenivas, Adithyakrishna Venkatesh Hanasoge, Mingyu Yang, Ali Taghibakhshi, Saurav Muralidharan, Ashwath Aithal, Pavlo Molchanov

发表机构 * NVIDIA

AI总结 本文提出X-Token,一种通过投影引导的跨分词器知识蒸馏方法,解决传统方法在处理不同分词器间知识迁移时的不足,通过两个互补的损失函数改进知识蒸馏效果。

详情
AI中文摘要

跨分词器知识蒸馏允许学生模型从具有不兼容词汇表的教师模型中学习。先前工作基于隐藏状态或对数几率,后者更优,因为它不需要辅助组件。基于对数几率的方法要么只使用正确分词的概率,从而遗漏了教师分布中的全部'暗知识',要么基于完整的输出分布,依赖严格的分词划分和/或不严谨的启发式排序。我们发现完整分布、基于对数几率方法的两个关键缺点:(i) 不常见分词失败,其中关键分词落入未匹配子集(例如,在数字拆分Qwen监督下Llama的1100多数字),在训练中被抑制,导致GSM8k从12.89降至2.56,相较于使用相同分词器的KD;(ii) 过于保守的匹配,严格的一对一匹配排除了表面形式间的近等价分词。这些失败需要不同的解决办法:当关键分词对齐错误时消除划分,当对齐可靠时进行细化。我们提出X-Token,一种具有两个互补损失函数的方法,针对这些问题。P-KL通过稀疏投影矩阵W(从分词级别字符串规则初始化)消除划分,并通过将学生分布与教师分布对齐来解决不常见分词失败。H-KL保留混合形式,同时放松匹配,使每个学生分词与W下的最高排名教师映射对齐。两个目标共享W并自然扩展到多个教师。实验证明,在Llama-3.2-1B上,X-Token在Qwen3-4B教师下比当前最佳GOLD高出+3.82平均点,在Phi-4-Mini教师下高出+0.5。此外,双教师设置(Phi-4-mini + Llama-3B)在单教师蒸馏上提高了+1.3点。

英文摘要

Cross-tokenizer knowledge distillation allows a student model to learn from teachers with incompatible vocabularies. Prior work operates on hidden states or logits; the latter is preferred as a drop-in replacement requiring no auxiliary components. Logit-based methods either use only the correct-token probability, missing the full 'dark knowledge' in the teacher's distribution, or operate on the full output distribution, relying on strict token partitioning and/or unprincipled heuristic ranking. We identify two key shortcomings of full-distribution, logit-based methods: (i) an uncommon-token failure, where critical tokens fall into the unmatched subset (e.g., Llama's 1100 multi-digit numerals under digit-splitting Qwen supervision) and are suppressed during training, reducing GSM8k from 12.89 to 2.56 compared to same-tokenizer KD from a weaker teacher; and (ii) over-conservative matching, where strict 1-to-1 matching excludes near-equivalent tokens across surface forms. These failures require distinct remedies: eliminating the partition when critical tokens are misaligned, and refining it when alignment is reliable. We propose X-Token, an approach with two complementary loss formulations targeting these issues. P-KL removes partitioning and aligns the student's distribution with the teacher's via a sparse projection matrix W (initialized from tokenizer-level string rules) to address the uncommon-token failure. H-KL retains the hybrid form while relaxing matching to align each student token with its top-ranked teacher mapping under W. Both objectives share W and extend naturally to multiple teachers. Empirically, on Llama-3.2-1B, X-Token outperforms the current state of the art GOLD by +3.82 average points with a Qwen3-4B teacher and by +0.5 with a Phi-4-Mini teacher. Further, a two-teacher setup (Phi-4-mini + Llama-3B) improves over single-teacher distillation by +1.3 points.

2605.21654 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Value-Gradient Hypothesis of RL for LLMs

强化学习中大语言模型的价值-梯度假说

Arip Asadulaev, Daniil Ognev, Karim Salta, Martin Takac

发表机构 * MBZUAI(穆斯林人工智能研究所)

AI总结 本文提出了一种价值-梯度视角来解释无评论强化学习方法在大语言模型后训练中的有效性,并通过分析actor更新和注意力机制中的自适应微分,提出了价值梯度信号和可达奖励空间的分解方法。

详情
AI中文摘要

强化学习显著提升了预训练语言模型,但尚不清楚为何无评论方法如PPO和GRPO能发挥如此大的作用,以及何时能提供最大的收益。我们开发了一种无评论强化学习在大语言模型后训练中的价值-梯度视角。首先,在可微展开和加性噪声参数化下,我们证明在期望下actor更新是价值-梯度类似的:反向传播传播的costates的条件期望等于价值梯度。其次,对于离散transformer策略,我们证明通过注意力机制的自适应微分会产生经验性的costates,这些近似于该价值信号,其误差受采样间隙和策略熵的控制。这些结果促使将RL影响分解为价值梯度信号和可达奖励空间,从而得出RL在预训练轨迹上最有效的标准。

英文摘要

Reinforcement learning substantially improves pretrained language models, but it remains understudied why critic-free methods such as PPO and GRPO work as well as they do, and when they should provide the largest gains. We develop a value-gradient perspective of critic-free RL for LLM post-training. First, under a differentiable rollout and additive-noise parameterization, we show that the actor update is value-gradient-like in expectation: the backward pass propagates costates whose conditional expectation equals the value gradient. Second, for discrete transformer policies, we show that autodifferentiation through attention produces empirical costates that approximate this value signal, with an error controlled by the sampling gap and policy entropy. These results motivate a decomposition of RL impact into value gradient signal and reachable reward headroom, yielding a criterion for when RL should be most effective along a pretraining trajectory.

2605.21653 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Amplifying, Not Learning: Fine-Tuned AI Text Detectors Amplify a Pretrained Direction

放大而非学习:微调的AI文本检测器放大了预训练的方向

Alexander Smirnov

发表机构 * University College London(伦敦大学学院)

AI总结 该研究探讨了通过微调AI文本检测器来放大预训练方向而非学习AI与人类边界的问题,发现微调在某些情况下会降低辨别能力,但在非母语写作中表现不同,并展示了闭合形式雅可比预测器在不同架构中的有效性。

详情
AI中文摘要

AI文本检测器放大了预训练的典型性轴;它们并不构建AI与人类的边界。在没有任何任务监督的原始编码器上,将投影到AI-中心(HC3)的中心可以实现NYT与HC3的AUROC分别为0.806/0.944/0.834,跨三种架构(86-106%的微调辨别上限:在RoBERTa-base上,原始投影超过微调);在RoBERTa-base上,完全微调在两种流畅正式人口测试中降低了辨别能力。相同的轴在非母语ESL写作中反转(AUROC 0.06-0.20)--这是典型性阅读独有的可验证预测。一个24例冻结探测器与完全微调(0.900 vs 0.895)一致。一个闭合形式雅可比预测器参数化轴操纵干预,R²=1.000通用,提升了ELECTRA-CE部署的TPR从0.000到0.904(FPR=1%),并在三个独立训练的第三方RoBERTa检测器上转移,达到16/16 oracle等价(在OpenAI检测器上57%的NYT-FPR减少)。范围:编码器家族;机制幅度HC3锚定;人口层面共享轴,不同架构中每文本机制有所变化。三种操作上不同的探测器--文本表面caps_rate残差化、几何符号epsilon消融、闭合形式文本对预测器--在三种架构中一致,cos 0.74/0.81/1.00,确认了观察者不变性。在匹配TPR-0.90评估下,已发表的干预动物园(CC、dealign-f2c)在27个单元格中校准等价(|Delta AUROC| <= 0.0081),并且ELECTRA上的LoRA->full-FT偏移差距的97%是校准偏移而非学习表示--这是核心主张的预测确认。

英文摘要

AI text detectors amplify a pretrained typicality axis; they do not construct an AI-vs-human boundary. On raw encoders before any task supervision, projecting onto centroid(AI)-centroid(HC3) achieves NYT-vs-HC3 AUROC 0.806/0.944/0.834 across three architectures (86-106% of the fine-tuned discrimination ceiling: on RoBERTa-base, raw projection exceeds fine-tuning); on RoBERTa-base, full fine-tuning reduces discrimination below raw on both fluent-formal populations tested. The same axis inverts on non-native ESL writing (AUROC 0.06-0.20) -- a falsifiable prediction unique to the typicality reading. A 24-example frozen probe matches full fine-tuning (0.900 vs 0.895). A closed-form Jacobian predictor parameterises axis-manipulating interventions with R^2 = 1.000 universal, lifts ELECTRA-CE deployment TPR from 0.000 to 0.904 at FPR = 1%, and transfers to three independently-trained third-party RoBERTa detectors at 16/16 oracle-equivalence (57% NYT-FPR reduction on the OpenAI detector). Scope: encoder family; mechanism magnitude HC3-anchored; population-level shared axis with per-text mechanisms varying across architectures. Three operationally distinct probes -- text-surface caps_rate residualisation, geometric signed-epsilon ablation, closed-form text-pair predictor -- agree at cos 0.74/0.81/1.00 across three architectures, confirming observer-invariance. Under matched-TPR-0.90 evaluation, the published intervention zoo (CC, dealign-f2c) is calibration-equivalent across 27 cells (|Delta AUROC| <= 0.0081), and >= 97% of the LoRA->full-FT bias gap on ELECTRA is calibration shift, not learned representation -- the central claim's prediction confirmed.

2605.21649 2026-05-22 cs.LG cs.CL 版本更新

EntmaxKV: Support-Aware Decoding for Entmax Attention

EntmaxKV: 基于支持的解码方法用于Entmax注意力

Gonçalo Duarte, Miguel Couceiro, Marcos V. Treviso

发表机构 * Instituto Superior Técnico, Universidade de Lisboa(里斯本大学理工学院) ELLIS Unit Lisbon(里斯本ELLIS单位) INESC-ID Instituto de Telecomunicações(电信研究所)

AI总结 本文提出EntmaxKV,一种基于支持的解码框架,利用熵最大注意力的稀疏性在KV页面加载前进行稀疏解码,通过查询感知的页面评分、支持感知的候选选择和稀疏熵最大注意力,减少概率质量丢失,提高长上下文语言模型的效率。

详情
AI中文摘要

长上下文解码越来越受到KV缓存内存流量的限制,因为每个生成的标记都需在缓存上进行注意力运算,而缓存大小与上下文长度成线性增长。现有稀疏解码方法通过选择部分标记或页面来减少成本,但这些方法是为softmax注意力设计的,其密集尾部使得任何截断都会丢弃非零的概率质量。相比之下,α-entmax产生精确的零,将稀疏解码从密集尾部近似转变为支持恢复:如果所选候选包含entmax支持,稀疏解码仍保持精确。虽然最近的entmax内核实现了高效的训练,但它们并未解决自回归解码瓶颈,即密集推理仍需在稀疏性确定之前流式传输完整的KV缓存。在本文中,我们引入了EntmaxKV,一种基于entmax的稀疏解码框架,它在KV页面加载前利用稀疏性。EntmaxKV结合了查询感知的页面评分、支持感知的候选选择和稀疏entmax注意力。我们通过分析截断误差中的丢弃概率质量δ,证明输出误差由δ控制,并在恢复entmax支持时消失。我们进一步引入了一种高斯感知的entmax选择器,从轻量级页面统计中估计entmax阈值,使所选预算适应于分数分布。实验证明,EntmaxKV比基于softmax的稀疏解码在相同KV预算下丢弃更少的概率质量,保留更多支持标记,并实现更低的输出误差。在长上下文和语言建模基准上,它接近完整的缓存entmax,但使用KV缓存的少量比例,达到100万上下文长度时,比完整的注意力基线快3.36倍(softmax)和5.43倍(entmax)。代码可在:https://github.com/deep-spin/entmaxkv获取。

英文摘要

Long-context decoding is increasingly limited by KV-cache memory traffic since each generated token attends over a cache whose size grows linearly with context length. Existing sparse decoding methods reduce this cost by selecting subsets of tokens or pages, but are designed for softmax attention, whose dense tails make any truncation discard nonzero probability mass. In contrast, $α$-entmax produces exact zeros, turning sparse decoding from dense-tail approximation into support recovery: if the selected candidates contain the entmax support, sparse decoding remains exact. While recent entmax kernels enable efficient training, they do not address the autoregressive decoding bottleneck, where dense inference still streams the full KV cache before sparsity is known. In this work, we introduce EntmaxKV, an entmax-native sparse decoding framework that exploits sparsity before KV pages are loaded. EntmaxKV combines query-aware page scoring, support-aware candidate selection, and sparse entmax attention. We analyze truncation error through the dropped probability mass $δ$, showing that output error is controlled by $δ$ and vanishes when the entmax support is recovered. We further introduce a Gaussian-aware entmax selector that estimates the entmax threshold from lightweight page statistics, adapting the selected budget to the score distribution. Empirically, EntmaxKV drops less probability mass, retains more support tokens, and achieves lower output error than softmax-based sparse decoding at matched KV budgets. On long-context and language modeling benchmarks, it closely matches full-cache entmax while using a small fraction of the KV cache, achieving up to $3.36\times$ (softmax) and $5.43\times$ (entmax) speedup over full attention baselines at 1M context length. Code available at: https://github.com/deep-spin/entmaxkv.

2605.21625 2026-05-22 cs.CV cs.AI cs.CL 版本更新

Flat-Pack Bench: Evaluating Spatio-Temporal Understanding in Large Vision-Language Models through Furniture Assembly

Flat-Pack Bench: 通过家具组装评估大视觉-语言模型的时空理解

Aditya Chetan, Eric Cai, Peeyush Kushwaha, Bharath Raj Nagoor Kani, Utkarsh Mall, Qianqian Wang, Noah Snavely, Bharath Hariharan

发表机构 * Cornell University(康奈尔大学) Cornell Tech(康奈尔科技) MBZUAI(麦吉尔-伯克利-浙江大学人工智能研究院) UC Berkeley(伯克利大学)

AI总结 本文提出Flat-Pack Bench基准,用于评估大视觉-语言模型在复杂视频场景中的时空理解能力,发现当前模型在细粒度时空推理上存在显著不足。

Comments CVPR 2026

详情
AI中文摘要

大视觉-语言模型(LVLMs)的出现显著提升了视频理解能力。然而,现有基准主要集中在粗粒度任务,如动作分割、分类、描述和检索,且这些基准通常依赖于易于口头识别的实体,如家庭物品、动物、人类主体等,限制了其在复杂真实视频场景中的适用性。但许多应用,如家具组装、烹饪等,需要对视频进行逐步细粒度的时空理解,而当前基准并未充分评估。为解决这一差距,我们引入了Flat-Pack Bench,一个专注于家具组装任务的新基准。我们的基准评估LVLMs在细微任务上的表现,包括组装动作的时间顺序、组装状态的时间定位、理解部件配合和追踪,使用多选问题配以视觉提示突出相关部分作为参考,以回答细粒度问题。我们的实验表明,最先进的LVLMs在细粒度时空推理上表现显著不足,凸显了其在有效利用视频时间信息、跟踪能力和理解空间交互(如物理接触)方面的局限性。

英文摘要

The emergence of Large Vision-Language Models (LVLMs) has significantly advanced video understanding capabilities. However, existing benchmarks focus predominantly on coarse-grained tasks such as action segmentation, classification, captioning, and retrieval. Furthermore, these benchmarks often rely on entities that can be easily identified verbally, like household objects, animals, human subjects, etc., limiting their applicability to complex, in-the-wild video scenarios. But, many applications such as furniture assembly, cooking, etc., require step-by-step fine-grained spatio-temporal understanding of the video, which is not sufficiently evaluated in current benchmarks. To address this gap, we introduce Flat-Pack Bench, a novel benchmark centered on furniture assembly tasks. Our benchmark evaluates LVLMs on nuanced tasks, including temporal ordering of assembly actions, temporal localization of assembly state, understanding part mating, and tracking, using multiple-choice questions paired with visual prompts highlighting relevant parts as references for fine-grained questions. Our experiments reveal that state-of-the-art LVLMs struggle significantly with fine-grained spatio-temporal reasoning, highlighting their limitations in effectively leveraging temporal information from videos, limited tracking ability, and understanding of spatial interactions like physical contact.

2605.21609 2026-05-22 cs.CL cs.AI cs.CY 版本更新

CR4T: Rewrite-Based Guardrails for Adolescent LLM Safety

CR4T:基于重写的青少年LLM安全机制

Heajun An, Qi Zhang, Vedanth Achanta, Jin-Hee Cho

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 本文提出CR4T框架,通过选择性响应重构替代拒绝导向的安全机制,以更符合青少年发展需求的方式提升LLM的安全性。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地嵌入青少年的数字环境,介导信息搜索、建议和情感敏感的互动。然而,现有安全机制仍主要基于成人中心的规范,并通过拒绝导向的压制来实现安全。尽管这些方法可能减少即时的政策违规,但它们也可能导致对话死胡同、限制建设性指导,并未能解决青少年与AI互动中固有的发展脆弱性。我们主张,青少年LLM安全不应仅被视为过滤问题,而应被视为一种社会技术、发展一致的转变问题。为实现这一视角,我们提出了Critique-and-Revise-for-Teenagers(CR4T),一种模型无关的安全保障框架,该框架可选择性地将不安全或拒绝式输出重构为适合年龄的指导性响应,同时保持善意意图。CR4T结合轻量级风险检测与领域条件重写,以去除风险放大内容,减少不必要的对话关闭,并引入适合发展的指导。实验结果表明,针对重写显著减少了不安全和拒绝导向的结果,同时避免了对可接受互动的不必要的干预。这些发现表明,选择性响应重构为青少年面向的LLM系统提供了一种更以人为本的替代方案,以替代以拒绝为中心的安全机制。

英文摘要

Large language models (LLMs) are increasingly embedded in adolescent digital environments, mediating information seeking, advice, and emotionally sensitive interactions. Yet existing safety mechanisms remain largely grounded in adult-centric norms and operationalize safety through refusal-oriented suppression. While such approaches may reduce immediate policy violations, they can also create conversational dead-ends, limit constructive guidance, and fail to address the developmental vulnerabilities inherent in adolescent-AI interactions. We argue that adolescent LLM safety should be framed not solely as a filtering problem, but as a socio-technical, developmentally aligned transformation problem. To operationalize this perspective, we propose Critique-and-Revise-for-Teenagers (CR4T), a model-agnostic safeguarding framework that selectively reconstructs unsafe or refusal-style outputs into ageappropriate, guidance-oriented responses while preserving benign intent. CR4T combines lightweight risk detection with domain-conditioned rewriting to remove risk-amplifying content, reduce unnecessary conversational shutdown, and introduce developmentally appropriate guidance. Experimental results show that targeted rewriting substantially reduces unsafe and refusal-oriented outcomes while avoiding unnecessary intervention on acceptable interactions. These findings suggest that selective response reconstruction offers a more human-centered alternative to refusal-centric guardrails for adolescent-facing LLM systems.

2605.21558 2026-05-22 cs.LG cs.CL 版本更新

From Parameters to Data: A Task-Parameter-Guided Fine-Tuning Pipeline for Efficient LLM Alignment

从参数到数据:一种任务参数引导的微调流水线用于高效的LLM对齐

Hao Chen, Qi Zhang, Liyao Li, Zhanming Shen, Wentao Ye, Lirong Gao, Ningtao Wang, Xing Fu, Xiaoyu Shen, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) Eastern Institute of Technology(东部技术研究所)

AI总结 本研究提出了一种任务参数引导的微调流水线,通过任务敏感的注意力头作为双指南,实现样本挖掘和结构剪枝,从而提高LLM对齐的效率。

Comments Accepted@ICML26, 28 pages, 11 figures, 26 tables

详情
AI中文摘要

适应大型语言模型(LLM)到专业领域通常会带来高数据和计算开销。尽管先前的效率努力大多将数据选择和参数高效微调视为孤立过程,我们的实证分析表明它们可能本质上是耦合的。我们提出了强映射假说:稀疏的注意力头子集在任务特定适应中起主导作用,作为解锁特定数据模式的钥匙。基于这一观察,我们提出了从参数到数据(P2D)统一框架,利用这些任务敏感的注意力头作为双指南,用于样本挖掘和结构剪枝。为了严格量化整个流程的成本,我们引入了对齐效率比率(AER)指标,用于衡量选择延迟和训练时间。机理上,P2D通过轻量级代理识别关键头,并利用它们作为功能性过滤器来精选高亲和力数据,建立协同流程。经验上,通过更新仅10%的注意力头在10%的数据上,P2D在强基线基础上实现了8.3个百分点的性能提升,并提供了7.0倍的端到端时间加速。这些结果验证了精确的参数-数据同步消除了冗余,提供了一种新的高效对齐范式。

英文摘要

Adapting Large Language Models (LLMs) to specialized domains typically incurs high data and computational overhead. While prior efficiency efforts have largely treated data selection and parameter-efficient fine-tuning as isolated processes, our empirical analysis suggests they may be intrinsically coupled. We posit the Strong Map Hypothesis: a sparse subset of attention heads plays a dominant role in task-specific adaptation, acting as keys that unlock specific data patterns. Building on this observation, we propose From Parameters to Data (P2D), a unified framework that leverages these task-sensitive attention heads as a dual compass for both sample mining and structural pruning. To rigorously quantify the total pipeline cost, we introduce the Alignment Efficiency Ratio (AER) metric for both selection latency and training time. Mechanistically, P2D identifies critical heads via a lightweight proxy and uses them as a functional filter to curate high-affinity data, establishing a synergistic pipeline. Empirically, by updating merely 10% of attention heads on 10% of the data, P2D achieves an 8.3 pp performance gain over strong baselines and delivers a 7.0x end-to-end time speedup. These results validate that precise parameter-data synchronization eliminates redundancy, offering a new paradigm for efficient alignment.

2605.21540 2026-05-22 cs.SI cs.AI cs.CL cs.CY 版本更新

Detecting Synthetic Political Narratives in Cross-Platform Social Media Discourse

在跨平台社交媒体讨论中检测合成政治叙述

Despoina Antonakaki, Sotiris Ioannidis

发表机构 * Institute of Computer Science, Foundation for Research and Technology(计算机科学研究所,希腊研究与技术基金会) Technical University of Crete(希腊克里特技术大学)

AI总结 本文提出了一种跨平台框架,通过四个协调信号(词汇多样性、时间爆发性、修辞重复和语义同质性)组合成合成叙述协调评分SNC(C),以检测合成政治叙述,研究发现IntelSlava在四个事件窗口中排名第一,而Rybar尽管语义同质性高但因语言差异导致表现不佳。

详情
AI中文摘要

大规模语言模型的普及引入了新的合成政治沟通范式,其中叙述可能被生成、语义协调并战略性地在多个平台大规模传播。我们提出了一种跨平台框架,利用四个协调信号——词汇多样性D(C)、时间爆发性B(C)、修辞重复R(C)和语义同质性H(C)——组合成合成叙述协调评分SNC(C)以检测合成政治叙述。我们对包含6个地缘政治事件窗口的353,223条记录进行了分析,数据来自六个Telegram频道和九个Reddit社区(2023-2026)。结果表明,IntelSlava表现出最低的词汇多样性(MATTR 0.52-0.54)、最高的爆发性(B=+0.48至+0.73)和最高的与同僚频道的修辞重叠(Jaccard 0.12),在六个事件窗口中的四个中排名第一(SNC 0.45-0.60)。Rybar在所有窗口中排名最后,尽管其俄语输出导致词汇多样性高且与英语频道的修辞Jaccard接近零,这表明单一指标不足以检测协调性。多维SNC(C)评分提供了比任何单一指标更稳健和可解释的信号。

英文摘要

The proliferation of large language models has introduced a new paradigm of synthetic political communication in which narratives may be generated, semantically coordinated, and strategically disseminated across platforms at scale. We present a cross-platform framework for detecting synthetic political narratives using four coordination signals -- lexical diversity D(C), temporal burstiness B(C), rhetorical repetition R(C), and semantic homogenization H(C) -- combined into a Synthetic Narrative Coordination Score SNC(C). We apply the framework to a corpus of 353,223 records spanning six geopolitical event windows collected from six Telegram channels and nine Reddit communities (2023--2026). Results show that IntelSlava exhibits the lowest lexical diversity (MATTR 0.52--0.54), the highest burstiness (B=+0.48 to +0.73), and the highest rhetorical overlap with peer channels (Jaccard 0.12), ranking first in the composite SNC(C) on four of six event windows (SNC 0.45--0.60). Rybar ranks last on all windows despite its high semantic homogenization, because its Russian-language output yields high lexical diversity and near-zero rhetorical Jaccard with English-language channels -- demonstrating that no single indicator is sufficient for coordination detection. Multi-dimensional SNC(C) scoring provides a more robust and interpretable signal than any individual metric.

2605.21496 2026-05-22 cs.LG cs.AI cs.CL 版本更新

HealthCraft: A Reinforcement Learning Safety Environment for Emergency Medicine

HealthCraft: 一种用于急救医学的强化学习安全环境

Brandon Dent

发表机构 * GOATnote Inc.(GOATnote公司)

AI总结 本文提出HealthCraft,首个公开的强化学习环境,用于在真实急救医学条件下奖励轨迹级安全,通过FHIR R4世界状态、24个MCP工具和双层评估标准,评估模型在急救任务中的安全性和性能,揭示了模型在多步骤工作流中的安全失败问题。

Comments 16 pages, 5 figures, 6 tables. Code, task suite, and Docker bundle: https://github.com/GOATnote-Inc/healthcraft

详情
AI中文摘要

前沿语言模型被部署到临床工作流程的速度超过了评估它们安全性的基础设施。静态医学问答基准测试忽略了急救医学中至关重要的失败模式:轨迹级安全崩溃、工具误用和在持续临床压力下的屈从。我们提出了HealthCraft,首个公开的强化学习环境,该环境在真实急救医学条件下奖励轨迹级安全,源自Corecraft。它基于FHIR R4世界状态,包含14个实体类型和3,987个种子实体,暴露24个MCP工具,并定义了双层评估标准,只要任何安全关键性标准被违反,就会将奖励设为零。我们发布了195个任务,涵盖六个类别,根据2,255个二元标准(其中515个为安全关键性标准)进行评分;一个事后10任务负类列表将此扩展到205个任务和2,337个标准。在两个前沿模型上的V8结果表明,Claude Opus 4.6在Pass@1达到24.8% [21.5-28.4],GPT-5.4为12.6% [10.2-15.6],安全失败率为27.5%和34.0%。在多步骤工作流——最接近真实急救护理的代理——中,性能降至接近零(Claude 1.0%,GPT-5.4 0.0%),尽管在单个步骤上部分具备能力。在试点v2和v8之间修复了六个基础设施错误,重新排列了哪些模型“看起来更强”,这表明基础设施的保真度是测量的一部分。一个确定性的LLM-判断器叠加限制了评估者的噪声,并且一个60次负类烟雾试点显示奖励信号不是可直接用于训练的安全:限制标准通过率为0.929的患病率,这在评估工具可以容忍但训练奖励不能。我们搭建了与Corecraft第5.2节中的Megatron+SGLang+GRPO循环的耦合,并将训练奖励的消融作为未来的工作。环境、任务、评估标准和工具均在Apache 2.0下发布。

英文摘要

Frontier language models are being deployed into clinical workflows faster than the infrastructure to evaluate them safely. Static medical-QA benchmarks miss the failure modes that matter in emergency medicine: trajectory-level safety collapse, tool misuse, and capitulation under sustained clinical pressure. We present HealthCraft, the first public reinforcement-learning environment that rewards trajectory-level safety under realistic emergency-medicine conditions, adapted from Corecraft. It is built on a FHIR R4 world state with 14 entity types and 3,987 seed entities, exposes 24 MCP tools, and defines a dual-layer rubric that zeroes reward whenever any safety-critical criterion is violated. We release 195 tasks across six categories, graded against 2,255 binary criteria (515 safety-critical); a post-hoc 10-task negative-class slate extends this to 205 tasks and 2,337 criteria. V8 results on two frontier models show Claude Opus 4.6 at Pass@1 24.8% [21.5-28.4] and GPT-5.4 at 12.6% [10.2-15.6], with safety-failure rates of 27.5% and 34.0%. On multi-step workflows - the closest proxy to real emergency care - performance collapses to near zero (Claude 1.0%, GPT-5.4 0.0%) despite partial competence on individual steps. Six infrastructure bugs fixed between pilots v2 and v8 re-ordered which model "looks stronger," evidence that infrastructure fidelity is part of the measurement. A deterministic LLM-judge overlay bounds evaluator noise, and a 60-run negative-class smoke pilot shows the reward signal is not drop-in training-safe: restraint criteria pass at 0.929 prevalence, a gameability an eval harness can tolerate but a training reward cannot. We scaffold coupling to a Megatron+SGLang+GRPO loop per Corecraft Section 5.2 and leave training-reward ablations as future work. Environment, tasks, rubrics, and harness are released under Apache 2.0.

2605.21491 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Teaching Language Models to Forecast Research Success Through Comparative Idea Evaluation

通过比较想法评估教授语言模型预测研究成功的技巧

Srujan P Mule, Aniketh Garikaparthi, Manasi Patwardhan

发表机构 * IISER Pune(印度理工学院帕内尔)

AI总结 本研究探讨了语言模型能否在无需实验的情况下预测研究想法的实证成功,通过构建基于PapersWithCode客观结果的11488对想法数据集,发现通过强化学习可提升模型性能至71.35%,证明小型语言模型可以作为有效的客观验证器,为自主科学发现提供可扩展路径。

Comments ACL 2026 Findings

详情
AI中文摘要

随着语言模型通过自动化假设生成和实现加速科学研究,出现了一个新的瓶颈:在没有彻底实验的情况下评估和过滤数百个AI生成的想法。我们问语言模型是否能学会在任何实验运行之前预测研究想法的实证成功。我们研究了比较实证预测:给定一个基准特定的研究目标和两个候选想法,预测哪个将实现更好的基准性能。我们构建了一个基于PapersWithCode客观结果的11,488对想法数据集。尽管现成的8B参数模型表现不佳(30%准确率),SFT显著提升了性能至77.1%,优于GPT-5(61.1%)。通过将评估框架为推理任务,通过可验证奖励的强化学习(RLVR),我们训练模型发现潜在的推理路径,实现71.35%的准确率,并具有可解释的依据。通过额外的消融和分布外测试,我们展示了对表面启发式的鲁棒性,并转移到了跨领域时间拆分测试集和独立构建的测试集。我们的结果表明,计算高效的轻量级语言模型可以作为有效的、客观的验证器,为自主科学发现提供可扩展的路径。

英文摘要

As language models accelerate scientific research by automating hypothesis generation and implementation, a new bottleneck emerges: evaluating and filtering hundreds of AI-generated ideas without exhaustive experimentation. We ask whether LMs can learn to forecast the empirical success of research ideas before any experiments are run. We study comparative empirical forecasting: given a benchmark-specific research goal and two candidate ideas, predict which will achieve better benchmark performance. We construct a dataset of 11,488 idea pairs grounded in objective outcomes from PapersWithCode. While off-the-shelf 8B-parameter models struggle (30% acc.), SFT dramatically boosts performance to 77.1%, outperforming GPT-5 (61.1%). By framing evaluation as a reasoning task via Reinforcement Learning with Verifiable Rewards (RLVR), we train models to discover latent reasoning paths, achieving 71.35% acc. with interpretable justifications. Through additional ablations and out-of-distribution tests, we show robustness to surface-level heuristics and transfer to both a cross-domain time-split test set and an independently constructed test set. Our results demonstrate that compute-efficient small language models can serve as effective, objective verifiers, offering a scalable path for autonomous scientific discovery.

2605.18721 2026-05-22 cs.LG cs.CL 版本更新

General Preference Reinforcement Learning

通用偏好强化学习

Muhammad Umer, Muhammad Ahmed Mohsin, Ahsan Bilal, Arslan Chaudhry, Andreas Haupt, Sanmi Koyejo, Emily Fox, John M. Cioffi

发表机构 * Stanford University(斯坦福大学) The University of Oklahoma(俄克拉荷马大学)

AI总结 本文提出通用偏好强化学习(GPRL),通过引入通用偏好模型(GPM)解决传统强化学习在开放任务中连续探索不足的问题,通过多维偏好比较提升模型性能。

详情
AI中文摘要

训练后将大型语言模型(LLM)对齐分解为两个大致分离的轨道。在线强化学习(RL)通过可验证奖励推动数学和代码的涌现推理,但依赖于无法达到开放任务的程序验证器;而偏好优化处理开放生成任务却牺牲了驱动在线RL的连续探索。弥合这一差距需要一个开放性质量验证器,但标量奖励模型不适合此任务。质量是多维的,任何标量分数都是不完整的代理,使在线RL崩溃于分数最敏感的轴。我们转而采用通用偏好模型(GPM),将响应嵌入到k个斜对称子空间中,并将偏好表示为结构化的、具有不传递性的比较。在此基础上,我们提出通用偏好强化学习(GPRL),将k维结构延伸到策略更新中。GPRL计算每维的组相对优势,对每个优势进行归一化以避免任何轴主导,并通过上下文相关的特征值进行聚合。相同的结构推动了一个闭环漂移监视器,能够检测单轴利用并通过重新加权维度和收紧信任区域进行即时纠正。从Llama-3-8B-Instruct开始,GPRL在AlpacaEval~2.0上达到长度控制的胜利率为56.51%,并在Arena-Hard、MT-Bench和WildBench上优于SimPO和SPPO,通过在长时间训练中抵抗奖励黑客。

英文摘要

Post-training has split large language model (LLM) alignment into two largely disconnected tracks. Online reinforcement learning (RL) with verifiable rewards drives emergent reasoning on math and code but depends on a programmatic verifier that cannot reach open-ended tasks, while preference optimization handles open-ended generation yet forgoes the continuous exploration that powers online RL. Closing this gap requires a verifier for open-ended quality, but a scalar reward model is the wrong shape for the job. Quality is multi-dimensional, and any scalar score is an incomplete proxy that lets online RL collapse onto whichever axis the score is most sensitive to. We turn instead to the General Preference Model (GPM), which embeds responses into $k$ skew-symmetric subspaces and represents preference as a structured, intransitivity-aware comparison. Building on this, we propose General Preference Reinforcement Learning (GPRL), which carries the $k$-way structure through to the policy update. GPRL computes per-dimension group-relative advantages, normalizes each on its own scale so no axis can dominate, and aggregates them with context-dependent eigenvalues. The same structure powers a closed-loop drift monitor that detects single-axis exploitation and corrects it on the fly by reweighting dimensions and tightening the trust region. Starting from $\texttt{Llama-3-8B-Instruct}$, GPRL reaches a length-controlled win rate of $56.51\%$ on AlpacaEval~2.0 while also outperforming SimPO and SPPO on Arena-Hard, MT-Bench, and WildBench by resisting reward hacking across extended training runs.

2605.15588 2026-05-22 cs.CL cs.LG 版本更新

Calibrating LLMs with Semantic-level Reward

通过语义层面奖励校准大型语言模型

Fengfei Yu, Ruijia Niu, Dongxia Wu, Yian Ma, Rose Yu

发表机构 * Department of Computer Science and Engineering, University of California San Diego, La Jolla, California, USA(加州大学圣地亚哥分校计算机科学与工程系,拉贾尔,加利福尼亚州,美国) Halıcıoğlu Data Science Institute, University of California San Diego, La Jolla, California, USA(加州大学圣地亚哥分校Halıcıoğlu数据科学研究所,拉贾尔,加利福尼亚州,美国) Department of Statistics, Stanford University, Stanford, California, USA(斯坦福大学统计学系,斯坦福,加利福尼亚州,美国)

AI总结 本文提出了一种新的校准框架CSR,通过在语义空间中直接校准语言模型,避免了传统方法中因词汇化置信度导致的不一致问题,实验显示CSR在多个数据集上均能有效降低ECE并提高AUROC。

详情
AI中文摘要

随着大型语言模型(LLMs)被应用于医疗问答和法律推理等关键领域,估计其输出正确性的能力对于安全可靠使用至关重要,要求模型具有良好的校准能力。标准的可验证奖励强化学习(RLVR)通过二元正确性奖励训练模型,但该奖励对置信度不敏感,无法对自信但错误的预测施加惩罚,从而降低校准效果。最近的研究通过训练模型生成带有词汇化置信度的置信分数并奖励与正确性的同意来解决这一问题。然而,词汇化置信度在语义相同但文本变化时表现出不一致性。我们提出Calibration with Semantic Reward(CSR),一种在语义空间中直接校准语言模型的框架,无需词汇化置信度接口。CSR结合了正确性奖励和一种新的语义校准奖励,通过促进正确路径中的语义一致性和不正确路径中的探索来鼓励利用和探索。在HotpotQA(在分布)和TriviaQA、MSMARCO、NQ-Open(不在分布)三个模型家族上的实验表明,CSR在几乎所有设置中都比词汇化置信度基线实现了更低的ECE和更高的AUROC,ECE减少高达40%,AUROC提高高达31%,校准行为在所有四个评估设置中均表现出良好的鲁棒性。

英文摘要

As large language models (LLMs) are deployed in consequential settings such as medical question answering and legal reasoning, the ability to estimate when their outputs are likely to be correct is essential for safe and reliable use, requiring well-calibrated uncertainty. Standard reinforcement learning with verifiable rewards (RLVR) trains models with a binary correctness reward that is indifferent to confidence, providing no penalty for confident but wrong predictions and thereby degrading calibration. Recent work addresses this by training models to produce verbalized confidence scores alongside answers and rewarding agreement with correctness. However, verbalized confidence is calibrated at the token level and thus exhibits inconsistency across textual variations with same semantic meaning. We propose \textbf{Calibration with Semantic Reward (CSR)}, a framework that calibrates language models directly in semantic space without a verbalized confidence interface. CSR combines the correctness reward with a novel semantic calibration reward that encourages exploitation among correct rollouts by promoting semantic agreement, and exploration among incorrect ones by discouraging spurious consistency. Experiments across three model families on HotpotQA (in-distribution) and TriviaQA, MSMARCO, and NQ-Open (out-of-distribution) show that CSR consistently achieves lower ECE and higher AUROC than verbalized-confidence baselines across nearly all settings, reducing ECE by up to $40\%$ and improving AUROC by up to $31\%$ over verbalized-confidence baselines, with calibration behavior generalizing robustly across all four evaluation settings.

2605.12623 2026-05-22 cs.CL cs.CV cs.LG 版本更新

DocAtlas: Multilingual Document Understanding Across 80+ Languages

DocAtlas: 跨80多种语言的多语言文档理解

Ahmed Heakl, Youssef Mohamed, Abdullah Sohail, Rania Elbadry, Ahmed Nassar, Peter W. J. Staar, Fahad Shahbaz Khan, Imran Razzak, Salman Khan

发表机构 * MBZUAI(穆罕默德·本·拉谢德人工智能研究所) IBM Research(IBM研究院)

AI总结 本文提出DocAtlas框架,通过构建高保真的OCR数据集和基准测试,覆盖82种语言和9个评估任务,利用双重管道生成精确的结构注解,展示了直接偏好优化在多语言适应中的有效性,提升了领域内和领域外的准确率。

Comments Under submission

详情
AI中文摘要

多语言文档理解在低资源语言中受限于稀缺的训练数据和基于模型的标注流程,这些流程会加剧现有偏见。我们引入DocAtlas,一个构建覆盖82种语言和9个评估任务的高保真OCR数据集和基准测试的框架。我们的双重管道,包括本地DOCX文档的差异渲染和针对从右到左脚本的合成LaTeX生成,生成统一的DocTag格式注解,编码布局、文本和组件类型,无需学习模型进行核心注解。评估16种最先进的模型揭示了低资源脚本中的持续差距。我们展示直接偏好优化(DPO)使用渲染派生的真实情况作为正信号,实现了稳定的多语言适应,提高了领域内(+1.9%)和领域外(+1.8%)的准确性,而监督微调会导致领域外性能下降高达21%。我们的最佳变体,DocAtlas-DeepSeek,在最强基线基础上提高了+1.7%。代码可在https://github.com/ahmedheakl/DocAtlas获取。

英文摘要

Multilingual document understanding remains limited for low-resource languages due to scarce training data and model-based annotation pipelines that perpetuate existing biases. We introduce DocAtlas, a framework that constructs high-fidelity OCR datasets and benchmarks covering 82 languages and 9 evaluation tasks. Our dual pipelines, differential rendering of native DOCX documents and synthetic LaTeX-based generation for right-to-left scripts produce precise structural annotations in a unified DocTag format encoding layout, text, and component types, without learned models for core annotation. Evaluating 16 state-of-the-art models reveals persistent gaps in low-resource scripts. We show that Direct Preference Optimization (DPO) using rendering-derived ground truth as positive signal achieves stable multilingual adaptation, improving both in-domain (+1.9%) and out-of-domain (+1.8%) accuracy without measurable base-language degradation, where supervised fine-tuning degrades out-of-domain performance by up to 21%. Our best variant, DocAtlas-DeepSeek, improves +1.7% over the strongest baseline. Code is available at https://github.com/ahmedheakl/DocAtlas .

2605.09252 2026-05-22 cs.CL 版本更新

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

LLM Agents Already Know When to Call Tools -- Even Without Reasoning

Chung-En Sun, Linbo Liu, Ge Yan, Zimo Wang, Tsui-Wei Weng

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 本文提出When2Tool基准,通过18个环境研究工具调用的必要性,发现模型已能识别何时需要调用工具,但生成时未能有效利用此知识,提出Probe&Prefill方法显著减少工具调用。

详情
AI中文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of 免训练 baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$ imes$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

英文摘要

Tool-augmented LLM agents tend to call tools indiscriminately, even when the model can answer directly. Each unnecessary call wastes API fees and latency, yet no existing benchmark systematically studies when a tool call is actually needed. We propose When2Tool, a benchmark of 18 environments (15 single-hop, 3 multi-hop) spanning three categories of tool necessity -- computational scale, knowledge boundaries, and execution reliability -- each with controlled difficulty levels that create a clear decision boundary between tool-necessary and tool-unnecessary tasks. We evaluate two families of training-free baselines: Prompt-only (varying the prompt to discourage unnecessary calls) and Reason-then-Act (requiring the model to reason about tool necessity before acting). Both provide limited control: Prompt-only suppresses necessary calls alongside unnecessary ones, and Reason-then-Act still incurs a disproportionate accuracy cost on hard tasks. To understand why these baselines fail, we probe the models' hidden states and find that tool necessity is linearly decodable from the pre-generation representation with AUROC 0.89--0.96 across six models, substantially exceeding the model's own verbalized reasoning. This reveals that models already know when tools are needed, but fail to act on this knowledge during generation. Building on this finding, we propose Probe&Prefill, which uses a lightweight linear probe to read the hidden-state signal and prefills the model's response with a steering sentence. Across all models tested, Probe&Prefill reduces tool calls by 48% with only 1.7% accuracy loss, while the best baseline at comparable accuracy only reduces 6% of tool calls, or achieves a similar tool call reduction but incurs a 5$\times$ higher accuracy loss. Our code is available at https://github.com/Trustworthy-ML-Lab/when2tool

2604.08571 2026-05-22 cs.LG cs.AI cs.CL 版本更新

Robust Reasoning Benchmark

鲁棒推理基准

Pavel Golikov, Evgenii Opryshko, Gennady Pekhimenko, Mark C. Jeffrey

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 本研究提出鲁棒推理基准(RRB),通过13种确定性文本扰动评估8种前沿模型,发现Claude在面对变换提示时表现出异常拒绝行为,而开放权重模型在结构噪声下出现多种失败模式,如认知冲刷、分词崩溃和推理崩溃,导致平均准确率下降高达54%。研究进一步发现由模型自身推理链引起的注意力稀释问题,并提出Intra-Query Attention Dilution概念,表明中间推理步骤会污染标准密集注意力机制,未来架构需整合显式上下文重置以实现可靠推理。

详情
AI中文摘要

尽管大型语言模型(LLMs)在标准数学基准上表现优异,但其问题解决能力依赖于上下文和文本格式。我们引入鲁棒推理基准(RRB),该基准由13种确定性文本扰动组成,应用于2024年和2025年的AIME。评估8种最先进的模型后,发现前沿模型总体上具有较强的鲁棒性,但Claude在面对变换提示时表现出异常拒绝行为。开放权重推理模型在结构噪声下表现出多种失败模式(认知冲刷、分词崩溃和推理崩溃),在扰动下平均准确率下降高达54%,某些扰动甚至导致100%的准确率下降。我们进一步研究其中一种失败模式:由模型自身推理链引起的注意力稀释。通过要求模型在单一上下文窗口内依次解决多个独立数学问题,我们识别出Intra-Query Attention Dilution。从7B到120B参数的开放权重模型在后续问题上的准确率逐渐下降,表明中间推理步骤会污染标准密集注意力机制。我们主张,为了实现可靠的推理,未来架构需要在模型自身推理链中整合显式上下文重置,从而引发关于推理任务最佳粒度的开放研究问题。

英文摘要

While Large Language Models (LLMs) achieve high performance on standard mathematical benchmarks, their problem-solving abilities depend on the context and textual formatting. We introduce the Robust Reasoning Benchmark (RRB), a pipeline of 13 deterministic textual perturbations applied to AIME 2024 and AIME 2025. Evaluating 8 state-of-the-art models, we find that frontier models are largely resilient, with the notable exception of Claude, which categorically refuses many transformed prompts. Open-weights reasoning models exhibit a range of failure modes under structural noise (cognitive thrashing, tokenization breakdown, and reasoning collapse), with up to 54% average accuracy drops across perturbations and up to 100% on some. We further study one of these failure modes in isolation: attention dilution caused by the model's own chain-of-thought. By tasking models with solving multiple independent mathematical problems sequentially within a single context window, we identify Intra-Query Attention Dilution. Open-weights models ranging from 7B to 120B parameters exhibit accuracy decay on subsequent problems, suggesting that intermediate reasoning steps progressively pollute standard dense attention mechanisms. We argue that in order to achieve reliable reasoning, future architectures need to integrate explicit contextual resets within models' own chain-of-thought, leading to open research questions regarding the optimal granularity of reasoning tasks.

2603.14987 2026-05-22 cs.CL cs.DB 版本更新

Beyond Benchmark Islands: Toward Representative Trustworthiness Evaluation for Agentic AI

超越基准岛屿:面向代理AI的代表性可信度评估

Jinhu Qi, Yifan Li, Minghao Zhao, Wentao Zhang, Zijian Zhang, Yaoman Li, Irwin King

发表机构 * The Chinese University of Hong Kong(香港中文大学) Macao Polytechnic University(澳门理工学院) Jilin University(吉林大学)

AI总结 本文提出了一种基于五属性的代理可信度定义,并引入了Holographic Agent Assessment Framework(HAAF)框架,通过场景 manifold 的静态策略分析、沙盒模拟、社会伦理对齐评估和分布感知采样,实现对代理系统在社会技术场景中的可信度评估,展示了其在13个模型家族上的跨家族迁移实验结果。

Comments 9 pages, 3 figures, 8 tables. Submitted to the Agent4IR Workshop at KDD 2026

详情
AI中文摘要

Agentic AI systems increasingly act through tool-augmented, multi-step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment-level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and

英文摘要

Agentic AI systems increasingly act through tool-augmented, multi-step workflows whose failures (unsafe tool use, unauthorised actions, social harm) carry deployment-level consequences. Evaluation practice remains fragmented across isolated benchmark slices, and "trustworthiness" is frequently invoked but rarely defined operationally. We argue the central limitation is twofold: (i) the absence of a measurable specification of what agent trustworthiness means, and (ii) the lack of a principled notion of representativeness allowing assessment over a socio-technical scenario distribution rather than disconnected benchmark instances. We address (i) by defining agentic trustworthiness as a five-property profile (Reliability, Robustness, Safety, Social-Ethical Alignment, Operational Integrity) grounded in current AI risk frameworks, and (ii) with the Holographic Agent Assessment Framework (HAAF), which measures this profile over a scenario manifold through static policy analysis, sandbox simulation, social-ethical alignment assessment, and distribution-aware sampling, connected through an iterative Trustworthy Optimization Factory that converts red-team diagnoses into blue-team interventions. Our contributions are: (1) an operational five-property definition of agentic trustworthiness; (2) a distribution-aware scenario-sampling framework that surfaces property-level trade-offs invisible to scalar leaderboards; and (3) a cross-family transfer experiment in which interventions designed from a single focal model generalise -- without per-model or per-scenario tuning -- to 13 systems from seven model families (Llama, Mistral, Kimi, GLM, Qwen, GPT, DeepSeek) on a 100-scenario suite, where all 13 systems improve and two reach a perfect risk-weighted profile, establishing HAAF's Factory as a model-agnostic deployment-readiness pipeline. Code: https://github.com/TonyQJH/haaf-pilot

2602.23200 2026-05-22 cs.LG cs.CL 版本更新

InnerQ: Hardware-Aware Tuning-Free Quantization of KV Cache for Large Language Models

InnerQ: 一种面向硬件的无需调优的KV缓存量化方法用于大语言模型

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文提出InnerQ,一种面向硬件的KV缓存量化方法,旨在减少解码延迟而不影响评估性能,通过分组量化策略提高数据重用率,从而在Llama和Mistral模型上提升了少样本评估得分。

Comments 18 pages, 5 figures, 7 tables

详情
AI中文摘要

当基于Transformer的语言模型用于文本生成时,大部分推理时间消耗在解码阶段,其中依次生成输出token。因此,减少每个解码步骤的硬件成本对于高效的长上下文生成至关重要。主要瓶颈是键值(KV)缓存,其大小随序列长度增长,通常主导模型的内存足迹。先前工作提出了压缩KV缓存的同时最小化精度损失的量化方法。我们提出了InnerQ,一种面向硬件的KV缓存量化方案,能够在不牺牲评估性能的情况下减少解码延迟。InnerQ通过沿内维对缓存矩阵进行分组实现分组量化。这种分组策略使去量化与向量-矩阵乘法对齐,并在GPU计算单元之间增加数据重用。结果,InnerQ减少了内存访问并加速了去量化,实现了比先前KV缓存量化方法平均快1.3倍,比非量化基线快2.7倍。为了在剧烈压缩下保持精度,InnerQ结合了三种技术:(i) 混合量化,根据局部统计选择对每个组使用对称或非对称量化;(ii) 高精度窗口用于最近的token和注意力sink token以缓解异常值泄漏;(iii) 对key缓存的通道归一化,在prefill期间计算一次并折叠到模型参数中以消除运行时开销。除了减少延迟外,在Llama和Mistral模型上的实验表明,InnerQ还相对于先前的KV缓存量化方法提升了少样本评估得分。

英文摘要

When transformer-based language models are deployed for text generation, most of the inference time is spent in the decoding stage, where output tokens are generated sequentially. Reducing the hardware cost of each decoding step is therefore critical for efficient long-context generation. A major bottleneck is the key-value (KV) cache, whose size grows with sequence length and often dominates the model's memory footprint. Prior work has proposed quantization methods to compress the KV cache while minimizing its loss of precision. We present InnerQ, a hardware-aware KV cache quantization scheme that reduces decode latency without compromising evaluation performance. InnerQ performs group-wise quantization by grouping cache matrices along their inner dimension. This grouping strategy aligns dequantization with vector-matrix multiplication and increases data reuse across GPU compute units. As a result, InnerQ reduces memory access and accelerates dequantization, achieving an average $1.3\times$ speedup over prior KV cache quantization methods and $2.7\times$ over the non-quantized baseline. To maintain fidelity under aggressive compression, InnerQ incorporates three techniques: (i) hybrid quantization, which chooses symmetric or asymmetric quantization for each group based on local statistics; (ii) high-precision windows for both recent tokens and attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the model parameters to eliminate runtime overhead. Beyond reducing latency, experiments on Llama and Mistral models show that InnerQ also improves few-shot evaluation scores relative to prior KV cache quantization methods.

2602.16169 2026-05-22 cs.LG cs.CL 版本更新

Discrete Stochastic Localization for Non-autoregressive Generation

非自回归生成的离散随机定位

Yunshu Wu, Jiayi Cheng, Longxuan Yu, Partha Thakuria, Rob Brekelmans, Evangelos E. Papalexakis, Greg Ver Steeg

发表机构 * University of California Riverside(加州大学河滨分校) New York University(纽约大学)

AI总结 本文提出了一种名为离散随机定位(DSL)的连续状态框架,通过单位球体令牌嵌入实现最优去噪,从而在离散序列生成中提升分布忠实度,并展示了其在OpenWebText上的有效性。

详情
AI中文摘要

连续扩散是一种非自回归生成的自然框架,但在离散序列生成中通常落后于掩码离散扩散模型(MDMs)。我们认为瓶颈不在于连续性本身,而在于一种表示方式,其中去噪依赖于时间步索引的噪声模式。我们引入了离散随机定位(DSL),一种具有单位球体令牌嵌入的连续状态框架,其贝叶斯最优去噪器在定位信道下对名义信号噪声比(SNR)具有不变性。一个训练好的网络可以支持整个SNR路径家族,端点掩码扩散路径是特殊情况。对预训练MDLM检查点进行微调可显著提升OpenWebText在所有步预算(从T=128到T=1024)下的分布忠实度(MAUVE),并且同一检查点支持随机顺序自回归采样,以及使用最少T=48总步数的混合连续-然后-离散采样器,无需蒸馏或重新训练。

英文摘要

Continuous diffusion is a natural framework for non-autoregressive generation but has generally lagged behind masked discrete diffusion models (MDMs) on discrete sequence generation. We argue that the bottleneck is not continuity itself, but a representation in which denoising depends on timestep-indexed noise regimes. We introduce \emph{Discrete Stochastic Localization} (DSL), a continuous-state framework with unit-sphere token embeddings whose Bayes-optimal denoiser is invariant to the nominal signal-to-noise ratio (SNR) under the localization channel. One trained network then supports an entire family of per-token SNR paths, with endpoint masked-diffusion paths as a special case. Fine-tuning a pretrained MDLM checkpoint with DSL substantially improves distributional faithfulness (MAUVE) on OpenWebText across all step budgets from $T{=}128$ to $T{=}1024$, and the same checkpoint supports random-order autoregressive sampling, as well as a hybrid continuous-then-discrete sampler using as few as T=48 total steps -- without distillation or retraining.

2602.15338 2026-05-22 cs.LG cs.CL 版本更新

Discovering Implicit Large Language Model Alignment Objectives

发现隐式大语言模型对齐目标

Edward Chen, Sanmi Koyejo, Carlos Guestrin

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出Obj-Disco框架,通过自动分解对齐奖励信号为可解释的目标,解决现有方法的不足,验证了框架在多种任务和模型上的鲁棒性,并发现潜在的对齐偏差。

Comments ICML 2026

详情
AI中文摘要

大语言模型(LLM)对齐依赖于复杂的奖励信号,这些信号往往模糊了被激励的具体行为,导致对齐风险和奖励黑客问题。现有解释方法通常依赖预定义的准则,可能遗漏“未知的未知”,或无法识别全面覆盖和因果影响模型行为的目标。为了解决这些限制,我们引入Obj-Disco框架,该框架能够自动将对齐奖励信号分解为稀疏、加权的可解释自然语言目标的组合。我们的方法利用迭代贪心算法分析训练检查点的行为变化,识别并验证最佳解释残差奖励信号的候选目标。在多种任务、模型大小和对齐算法上的广泛评估证明了框架的鲁棒性。对流行开源奖励模型的实验表明,框架一致捕获超过90%的奖励行为,这一发现进一步得到人类评估的证实。此外,对开源奖励模型对齐的案例研究显示,Obj-Disco能够成功识别伴随预期行为出现的潜在偏移激励。我们的工作提供了一种关键工具,用于揭示LLM对齐中的隐式目标,为更透明和安全的AI发展铺平道路。

英文摘要

Large language model (LLM) alignment relies on complex reward signals that often obscure the specific behaviors being incentivized, creating critical risks of misalignment and reward hacking. Existing interpretation methods typically rely on pre-defined rubrics, risking the omission of "unknown unknowns", or fail to identify objectives that comprehensively cover and are causal to the model behavior. To address these limitations, we introduce Obj-Disco, a framework that automatically decomposes an alignment reward signal into a sparse, weighted combination of human-interpretable natural language objectives. Our approach utilizes an iterative greedy algorithm to analyze behavioral changes across training checkpoints, identifying and validating candidate objectives that best explain the residual reward signal. Extensive evaluations across diverse tasks, model sizes, and alignment algorithms demonstrate the framework's robustness. Experiments with popular open-source reward models show that the framework consistently captures > 90% of reward behavior, a finding further corroborated by human evaluation. Additionally, a case study on alignment with an open-source reward model reveals that Obj-Disco can successfully identify latent misaligned incentives that emerge alongside intended behaviors. Our work provides a crucial tool for uncovering the implicit objectives in LLM alignment, paving the way for more transparent and safer AI development.

2601.10348 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Training-Trajectory-Aware Token Selection

基于训练轨迹的token选择

Zhanming Shen, Jiaqi Hu, Zeyu Qin, Hao Chen, Wentao Ye, Zenan Huang, Yihong Zhuang, Guoshan Lu, Junlin Zhou, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 本文提出T3S方法,通过在token层面重构训练目标,清除未学习token的优化路径,从而在连续蒸馏中提升性能,实验表明在AR和dLLM设置中均取得显著效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

高效的蒸馏是将昂贵的推理能力转化为可部署效率的关键途径,然而在前沿领域中,当学生模型已具备较强的推理能力时,朴素的连续蒸馏往往产生有限的收益甚至退化。我们观察到一种训练特征现象:即使损失单调下降,所有性能指标在几乎相同的瓶颈处会突然大幅下降,然后逐渐恢复。我们进一步揭示了token层面的机制:置信度会分裂成稳步增加的模仿锚点token,快速锚定优化,以及尚未学习的token,其置信度被抑制直到瓶颈之后。这两种类型token无法共存的特性是连续蒸馏失败的根本原因。为此,我们提出了基于训练轨迹的token选择(T3S)方法,以在token层面重建训练目标,清除未学习token的优化路径。T3S在AR和dLLM设置中均取得一致的收益:仅用数百个示例,Qwen3-8B在竞争性推理基准上超越DeepSeek-R1,Qwen3-32B接近Qwen3-235B,且T3训练的LLaDA-2.0-Mini超越其AR基线,达到所有16B级模型中的最先进性能。

英文摘要

Efficient distillation is a key pathway for converting expensive reasoning capability into deployable efficiency, yet in the frontier regime where the student already has strong reasoning ability, naive continual distillation often yields limited gains or even degradation. We observe a characteristic training phenomenon: even as loss decreases monotonically, all performance metrics can drop sharply at almost the same bottleneck, before gradually recovering. We further uncover a token-level mechanism: confidence bifurcates into steadily increasing Imitation-Anchor Tokens that quickly anchor optimization and other yet-to-learn tokens whose confidence is suppressed until after the bottleneck. And the characteristic that these two types of tokens cannot coexist is the root cause of the failure in continual distillation. To this end, we propose Training-Trajectory-Aware Token Selection (T3S) to reconstruct the training objective at the token level, clearing the optimization path for yet-to-learn tokens. T3S yields consistent gains in both AR and dLLM settings: with only hundreds of examples, Qwen3-8B surpasses DeepSeek-R1 on competitive reasoning benchmarks, Qwen3-32B approaches Qwen3-235B, and T3-trained LLaDA-2.0-Mini exceeds its AR baseline, achieving state-of-the-art performance among all of 16B-scale no-think models.

2511.08093 2026-05-22 eess.AS cs.CL cs.SD 版本更新

Quantizing Whisper-small: How design choices affect ASR performance

对Whisper-small的量化:设计选择如何影响语音识别性能

Arthur Söhler, Julian Irigoyen, Andreas Søeborg Kirkedal

发表机构 * Copenhagen Business School(哥本哈根商学院) Danske Bank(丹麦银行) Jabra (GN Group)(Jabra(GN集团))

AI总结 本文研究了不同量化方案对Whisper-small模型性能的影响,发现动态int8量化在模型压缩和识别准确率之间取得了最佳平衡,同时展示了通过精心选择量化方法可以显著减少模型大小和推理成本,从而在受限硬件上实现高效部署。

Comments Accepted to SPEAKABLE workshop at LREC 2026

详情
AI中文摘要

大型语音识别模型如Whisper-small虽然能实现高精度,但其高计算需求使其难以在边缘设备上部署。为此,我们提出了一种统一的跨库评估,评估了Whisper-small上的后训练量化(PTQ)方法,以分离量化方案、方法、粒度和位宽的影响。我们的研究基于四个库:PyTorch、Optimum-Quanto、HQQ和bitsandbytes。在LibriSpeech测试清洁和测试其他数据集上的实验表明,动态int8量化结合Quanto提供了最佳的权衡,将模型大小减少57%,同时在基线的词错误率上有所提升。静态量化表现较差,可能由于Whisper的Transformer架构,而更激进的格式(如nf4、int3)在嘈杂条件下以牺牲准确性为代价实现了高达71%的压缩。总体而言,我们的结果表明,精心选择的PTQ方法可以在不重新训练的情况下显著减少模型大小和推理成本,从而在受限硬件上实现Whisper-small的高效部署。

英文摘要

Large speech recognition models like Whisper-small achieve high accuracy but are difficult to deploy on edge devices due to their high computational demand. To this end, we present a unified, cross-library evaluation of post-training quantization (PTQ) on Whisper-small that disentangles the impact of quantization scheme, method, granularity, and bit-width. Our study is based on four libraries: PyTorch, Optimum-Quanto, HQQ, and bitsandbytes. Experiments on LibriSpeech test-clean and test-other show that dynamic int8 quantization with Quanto offers the best trade-off, reducing model size by 57% while improving on the baseline's word error rate. Static quantization performed worse, likely due to Whisper's Transformer architecture, while more aggressive formats (e.g., nf4, int3) achieved up to 71% compression at the cost of accuracy in noisy conditions. Overall, our results demonstrate that carefully chosen PTQ methods can substantially reduce model size and inference cost without retraining, enabling efficient deployment of Whisper-small on constrained hardware.

2511.07885 2026-05-22 cs.DC cs.AI cs.CL cs.LG 版本更新

Intelligence per Watt: Measuring Intelligence Efficiency of Local AI

每瓦智能:衡量本地AI的智能效率

Jon Saad-Falcon, Avanika Narayan, Hakki Orhun Akengin, J. Wes Griffin, Herumb Shandilya, Adrian Gamarra Lafuente, Medhya Goel, Rebecca Joseph, Shlok Natarajan, Etash Kumar Guha, Shang Zhu, Ben Athiwaratkun, John Hennessy, Azalia Mirhoseini, Christopher Ré

发表机构 * Stanford University(斯坦福大学) Together AI

AI总结 本文研究了本地AI在能源效率和性能上的表现,提出了一种统一的衡量指标IPW,展示了本地推理在重新分配需求方面的能力,并揭示了本地加速器的优化潜力。

详情
AI中文摘要

大型语言模型(LLM)查询主要由集中式云基础设施中的前沿模型处理。需求增长比提供商能够扩展的速度更快。两项进展创造了重新思考这一范式的机会:小型本地LM(<=20B活跃参数)在许多任务上能与前沿模型竞争性地表现,而本地加速器(如Apple M4 Max)可以以交互延迟支持这些模型。这引发了问题:本地推理能否在能源受限的设备上有效重新分配需求?这需要测量本地LM是否能准确回答现实查询以及是否在能源受限的设备上高效。我们提出了智能每瓦(IPW),即任务准确度每单位功率,作为衡量本地推理能力与效率的统一指标。我们评估了20多个最先进的本地LM、8种硬件加速器(本地和云)以及100万条现实单轮聊天和推理查询。对于每个查询,我们测量了准确性(本地LM对前沿模型的胜率)、能耗、延迟和功率。我们发现三个关键结果。首先,本地LM成功回答了88.7%的这些查询,准确性因领域而异。其次,2023-2025年的纵向分析显示IPW提高了5.3倍,由算法和加速器的改进驱动,本地可服务查询覆盖范围从23.2%增加到71.3%。第三,本地加速器在相同模型上实现的IPW至少比云加速器低1.4倍,揭示了本地加速器优化的巨大潜力。这些发现表明,本地推理可以对集中式基础设施的大量查询需求进行有意义的重新分配,IPW是跟踪这一转变的关键指标。

英文摘要

Large language model (LLM) queries are predominantly processed by frontier models in centralized cloud infrastructure. Demand growth strains this paradigm faster than providers can scale. Two advances create an opportunity to rethink it: small, local LMs (<=20B active parameters) now achieve competitive performance to frontier models on many tasks, and local accelerators (e.g., Apple M4 Max) can host these models at interactive latencies. This raises the question: can local inference viably redistribute demand from centralized infrastructure? This requires measuring both whether local LMs can accurately answer real-world queries and whether they can do so efficiently on power-constrained devices (e.g., laptops). We propose intelligence per watt (IPW), task accuracy per unit of power, as a unified metric for the capability and efficiency of local inference across model-accelerator configurations. We evaluate 20+ state-of-the-art local LMs, 8 hardware accelerators (local and cloud), and 1M real-world single-turn chat and reasoning queries. For each query, we measure accuracy (local LM win rate against frontier models), energy, latency, and power. We find three key results. First, local LMs successfully answer 88.7% of these queries, with accuracy varying by domain. Second, longitudinal analysis from 2023-2025 shows IPW improved 5.3x, driven by both algorithmic and accelerator advances, with locally-serviceable query coverage rising from 23.2% to 71.3%. Third, local accelerators achieve at least 1.4x lower IPW than cloud accelerators running identical models, revealing significant headroom for local accelerator optimization. These findings demonstrate that local inference can meaningfully redistribute demand from centralized infrastructure for a substantial subset of queries, with IPW serving as the critical metric for tracking this transition.

2510.07962 2026-05-22 cs.CL cs.AI 版本更新

LightReasoner: Can Small Language Models Teach Large Language Models Reasoning?

LightReasoner: 小型语言模型能否教会大型语言模型推理?

Jingyuan Wang, Yankai Chen, Zhonghang Li, Chao Huang

发表机构 * University of Hong Kong(香港大学) University of Chicago(芝加哥大学)

AI总结 本文提出LightReasoner框架,通过利用强专家模型与弱业余模型之间的行为差异,发现高价值推理时刻,从而提升大型语言模型的推理能力,同时减少资源消耗。

Comments Updated to ACL 2026 camera-ready version with improved method presentation, expanded related work discussion, additional analyses, and presentation refinements

详情
AI中文摘要

大型语言模型(LLMs)在推理任务上取得了显著进展,通常通过监督微调(SFT)实现。然而,SFT过程资源消耗大,依赖大规模定制数据集、拒绝采样演示和对所有token的统一优化,尽管只有少量token具有实际学习价值。在本工作中,我们探索了一个反直觉的想法:小型语言模型(SLMs)能否通过揭示高价值推理时刻来教会大型语言模型(LLMs)其独特优势?我们提出了LightReasoner,一种新的框架,利用强专家模型(LLM)与弱业余模型(SLM)之间的行为差异。LightReasoner分为两个阶段:(1)采样阶段通过专家-业余对比确定关键推理时刻,并构建捕捉专家优势的监督示例;(2)微调阶段将专家模型与这些提炼出的示例对齐,放大其推理优势。在七个数学基准测试中,LightReasoner将准确性提高了28.1%,同时将时间消耗减少了90%,采样问题减少了80%,调优token使用减少了99%,且不依赖真实标签。通过将弱SLMs转化为有效的教学信号,LightReasoner提供了一种可扩展且资源高效的提升LLM推理能力的方法。代码可在:https://github.com/HKUDS/LightReasoner获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable progress in reasoning, often through supervised fine-tuning (SFT). However, SFT is resource-intensive, relying on large curated datasets, rejection-sampled demonstrations, and uniform optimization across all tokens, even though only a fraction carry meaningful learning value. In this work, we explore a counterintuitive idea: can smaller language models (SLMs) teach larger language models (LLMs) by revealing high-value reasoning moments that reflect the latter's unique strength? We propose LightReasoner, a novel framework that leverages the behavioral divergence between a stronger expert model (LLM) and a weaker amateur model (SLM). LightReasoner operates in two stages: (1) a sampling stage that pinpoints critical reasoning moments and constructs supervision examples capturing the expert's advantage through expert-amateur contrast, and (2) a fine-tuning stage that aligns the expert model with these distilled examples, amplifying its reasoning strengths. Across seven mathematical benchmarks, LightReasoner improves accuracy by up to 28.1%, while reducing time consumption by 90%, sampled problems by 80%, and tuned token usage by 99%, all without relying on ground-truth labels. By turning weaker SLMs into effective teaching signals, LightReasoner offers a scalable and resource-efficient approach for advancing LLM reasoning. Code is available at: https://github.com/HKUDS/LightReasoner

2508.03865 2026-05-22 cs.CL 版本更新

An Entity Linking Agent for Question Answering

用于问答任务的实体链接代理

Yajie Luo, Yihong Wu, Muzhi Li, Jia Ao Sun, Xinyu Wang, Liheng Ma, Yingxue Zhang, Jian-Yun Nie

发表机构 * The Chinese University of Hong Kong(香港中文大学) McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克AI研究院) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 本文提出了一种基于大语言模型的实体链接代理,用于解决问答任务中短且模糊用户问题的实体链接问题,通过两个实验验证了其有效性。

Comments 12 pages, 2 figures

详情
AI中文摘要

一些问答(QA)系统依赖知识库(KB)来提供准确答案。实体链接(EL)在将自然语言提及链接到KB条目中起着关键作用。然而,大多数现有的EL方法是为长上下文设计的,无法在问答任务中有效处理短且模糊的用户问题。我们提出了一种用于问答任务的实体链接代理,基于一个模拟人类认知流程的大语言模型。该代理主动识别实体提及、检索候选实体并做出决策。为了验证我们代理的有效性,我们进行了两项实验:基于工具的实体链接和问答任务评估。结果证实了我们代理的鲁棒性和有效性。

英文摘要

Some Question Answering (QA) systems rely on knowledge bases (KBs) to provide accurate answers. Entity Linking (EL) plays a critical role in linking natural language mentions to KB entries. However, most existing EL methods are designed for long contexts and do not perform well on short, ambiguous user questions in QA tasks. We propose an entity linking agent for QA, based on a Large Language Model that simulates human cognitive workflows. The agent actively identifies entity mentions, retrieves candidate entities, and makes decision. To verify the effectiveness of our agent, we conduct two experiments: tool-based entity linking and QA task evaluation. The results confirm the robustness and effectiveness of our agent.

2507.03674 2026-05-22 cs.CL cs.AI 版本更新

STRUCTSENSE: A Task-Agnostic Agentic Framework for Structured Information Extraction with Human-In-The-Loop Evaluation and Benchmarking

STRUCTSENSE:一种任务无关的代理框架,用于结构化信息提取,具有人机协同评估和基准测试

Tek Raj Chhetri, Yibei Chen, Puja Trivedi, Dorota Jarecka, Saif Haobsh, Patrick Ray, Lydia Ng, Satrajit S. Ghosh

发表机构 * McGovern Institute for Brain Research, Massachusetts Institute of Technology, Cambridge, MA, USA(麦戈文脑科学研究所,麻省理工学院,马萨诸塞州剑桥市) Fylo Labs Inc., New York, NY, USA(Fylo实验室公司,纽约州纽约市) Allen Institute for Brain Science, Seattle, WA, USA(艾伦脑科学研究所,华盛顿州西雅图市)

AI总结 本文提出STRUCTSENSE框架,通过整合本体引导的符号知识、代理自我评估细化和人机协同验证,实现了结构化信息提取的鲁棒性,并在三个领域展示了其跨任务泛化能力。

Comments -

详情
AI中文摘要

从科学文献中提取结构化信息对于加速发现至关重要,但大型语言模型(LLMs)在需要专家知识的专门领域表现不佳,且在跨任务泛化方面表现差。我们引入STRUCTSENSE,一种模块化、任务无关、开源的框架,整合了本体引导的符号知识、代理自我评估细化和人机协同验证,以实现领域感知的稳健提取。我们在三个递增语义复杂度的任务上评估STRUCTSENSE:基于模式的评估工具提取(91-100%准确率)、从科学论文中提取元数据和资源(86-93%总体准确率)以及从神经科学文献中进行命名实体识别(NER)(58-75%标签准确率,共8,882个实体)。在两个生物医学NER基准(NCBI疾病和S800物种)上,系统实现了≥90%的宽松召回率和62.5-85.8%的严格召回率,同时提取了1,000-3,600个额外实体。本地概念映射服务在严格匹配下达到62-82%的Hits@1,在语义匹配下达到68-86%。这些结果在三个领域展示了STRUCTSENSE跨任务泛化的能力,同时保持了源地和可追溯性透明度。

英文摘要

Extracting structured information from scientific literature is critical for accelerating discovery, yet Large Language Models (LLMs) often struggle in specialized domains that require expert knowledge and generalize poorly across tasks. We introduce \textsc{StructSense}, a modular, task-agnostic, open-source framework that integrates ontology-guided symbolic knowledge, agentic self-evaluative refinement, and human-in-the-loop validation for robust domain-aware extraction. We evaluate \textsc{StructSense} on three tasks of increasing semantic complexity: schema-based extraction of assessment instruments (91--100\% accuracy), metadata and resource extraction from scientific papers (86--93\% overall), and named entity recognition (NER) from neuroscience literature (58--75\% label accuracy across 8,882 entities). On two biomedical NER benchmarks (NCBI Disease and S800 Species), the system achieves $\geq$90\% relaxed recall and 62.5--85.8\% strict recall while extracting 1,000--3,600 additional entities beyond gold annotations. The local concept mapping service achieves Hits@1 of 62--82\% under strict matching and 68--86\% under semantic matching. These results across three domains demonstrate that \textsc{StructSense} generalizes across tasks while maintaining source grounding and provenance transparency.

2506.19500 2026-05-22 cs.AI cs.CL cs.LG 版本更新

NaviAgent: Graph-Driven Bilevel Planning for Scalable Tool Orchestration

NaviAgent: 一种基于图的双层规划用于可扩展的工具编排

Yan Jiang, Hao Zhou, Lizhong GU, Tianlong Li, Ruinan Jin, Wanqi Zhou, Ai Han

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University, USA(电气与计算机工程系,俄亥俄州立大学,美国)

AI总结 本文提出NaviAgent,一种基于图的双层规划框架,通过解耦任务规划与工具执行,提升大规模工具编排的可扩展性和鲁棒性,实验表明其在任务成功率和实际应用中表现优异。

Comments Accepted to ICML 2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML), 2026
AI中文摘要

大型语言模型(LLMs)越来越多地作为功能调用代理,通过调用外部工具来处理超出其静态知识的任务。然而,它们通常逐个调用工具,缺乏对任务结构的整体视图。由于工具之间往往相互依赖,这导致了错误累积和可扩展性差,尤其是在扩展到数百或数千个工具时。为了解决这些限制,我们提出了NaviAgent,一种显式的双层架构,通过基于工具关系的图建模来解耦任务规划与工具执行。在规划层,基于LLM的代理决定是否直接回应、澄清意图或检索并执行独立于工具间复杂度的工具链。在执行层,工具世界导航模型(TWNM)编码工具之间的结构和行为关系,引导代理生成可扩展且鲁棒的调用序列。通过整合真实工具交互的反馈,NaviAgent实现了规划与执行之间的闭环对齐,使代理能够在大规模工具生态系统中实现自适应导航。在API-Bank和ToolBench上的评估显示,任务成功率(TSR)有持续改进,TWNM在复杂任务上平均提升13.1个百分点。进一步在50个真实API跨7个领域的测试中,展示了4.3-12.0个百分点的持续收益,步骤更少且延迟更低,证明了其在真实世界动态下的鲁棒泛化能力。

英文摘要

Large Language Models (LLMs) increasingly act as function-call agents that invoke external tools to tackle tasks beyond their static knowledge. However, they typically invoke tools one at a time without a global view of task structure. As tools often depend on one another, this leads to error accumulation and poor scalability, particularly when scaling to hundreds or thousands of tools. To address these limitations, we propose NaviAgent, an explicit bilevel architecture that decouples task planning from tool execution through graph-based modeling of tool relations. At the planning level, the LLM-based agent decides whether to respond directly, clarify intent, or retrieve and execute a toolchain independent of inter-tool complexity. At the execution level, a Tool World Navigation Model (TWNM) encodes structural and behavioral relations among tools, steering the agent to compose scalable and robust invocation sequences. Incorporating feedback from real tool interactions, NaviAgent achieves closed-loop alignment between planning and execution, enabling adaptive navigation in large-scale tool ecosystems. Evaluations on API-Bank and ToolBench show consistent improvements in task success rate (TSR), with TWNM yielding an average gain of 13.1 points on complex tasks. Further tests on 50 real APIs across 7 domains show consistent gains of 4.3--12.0 points, with fewer steps and latency, demonstrating robust generalization under real-world dynamics.

2502.09487 2026-05-22 cs.CL cs.AI cs.LG 版本更新

Internal narratives parameterise affective states

内部叙事参数化情感状态

Jakub Onysk, Quentin J. M. Huys

发表机构 * Applied Computational Psychiatry Lab(应用计算精神病学实验室) Max Planck UCL Centre for Computational Psychiatry and Ageing Research(马克斯·普朗克UCL计算精神病学与衰老研究中心) Queen Square Institute of Neurology and Mental Health(圣夸克广场神经病学与心理健康研究所) Neuroscience Department(神经科学系) Division of Psychiatry(精神病学系)

AI总结 本文通过量化参与者内部叙事的大语言模型表示及其子空间,研究了叙事与情感状态之间的关系,发现特定症状的描述性思维能够预测标准化的抑郁评分,并强调保持症状间的协方差对构建效度至关重要。

详情
AI中文摘要

描述我们如何用语言表达感受对于心理评估和干预至关重要,但叙事与情感状态之间的映射仍然理解不足。在两个大规模研究(n=1257)中,我们通过大语言模型表示及其子空间量化了参与者内部叙事的结构和动态,以参数化抑郁状态。在第一项研究中,我们发现对特定症状的描述性思维捕捉了预测标准化、自我报告抑郁评分的细粒度信息。关键的是,我们显示保持症状之间的特定协方差对于构效效度至关重要,这表明高维文本表示镜像了疾病的潜在几何结构。第二项研究探讨了这种关系的时间动态,当参与者与情感叙事互动时。我们发现量化内部叙事的变化导致自我报告的变化,而基线叙事严重性预测了后续情感变化的幅度。通过将情感视为计算状态,我们的结果强调了其核心、治疗相关功能:约束内部叙事的结构并整合上下文以塑造自我报告。

英文摘要

Characterising how we verbalise our feelings is central to psychological assessment and intervention, yet the mapping between narrative and affective state remains poorly understood. Across two large studies (n=1257), we parameterised the structure and dynamics of depressive states by quantifying participants' internal narratives through large-language-model representations and their subspaces. In Study 1, we found verbal descriptions of symptom-specific thoughts captured granular information predictive of standardised, self-reported depression scores. Critically, we show preserving the specific covariance between symptoms is essential for construct validity, suggesting high-dimensional text representations mirror the latent geometry of the disorder. Study 2 probed the temporal dynamics of this relationship as participants engaged with emotional narratives. We found quantified changes in internal narratives led to changes in self-report, while the baseline narrative severity predicted the magnitude of subsequent affective change. By framing affect as a computational state, our results highlight its core, therapeutically pertinent functions: constraining the structure of internal narratives and integrating context to shape self-report.

2410.04753 2026-05-22 cs.AI cs.CL cs.LG cs.LO 版本更新

ImProver: Agent-Based Automated Proof Optimization

ImProver:基于代理的自动证明优化

Riyaz Ahuja, Jeremy Avigad, Prasad Tetali, Sean Welleck

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文研究了自动证明优化问题,提出ImProver这一基于大语言模型的代理,用于重写证明以优化长度、可读性等任意标准,实验表明其能显著缩短证明并提高其模块化和可读性。

Comments Published as a conference paper at ICLR 2025

详情
AI中文摘要

大型语言模型(LLMs)已被用于在证明助手如Lean中生成数学定理的正式证明。然而,我们通常希望根据不同的下游用途优化正式证明,例如使其符合某种风格、易于阅读、简洁或模块化。适当优化的证明对于学习任务也非常重要,尤其是因为人工撰写的证明可能不适用于此目的。为此,我们研究了一个新的问题:自动证明优化,即重写证明以使其正确并优化任意标准,如长度或可读性。作为自动证明优化的一种初步方法,我们提出了ImProver,这是一个能够重写证明以优化任意用户定义指标的大型语言模型代理。我们发现直接应用LLMs进行证明优化效果有限,并在ImProver中引入了各种改进,例如新颖的链式状态技术中的符号Lean上下文使用,以及错误校正和检索。我们测试了ImProver在重写真实世界中的本科、竞赛和研究级数学定理方面的性能,发现ImProver能够重写证明使其显著更短、更模块化和更易读。

英文摘要

Large language models (LLMs) have been used to generate formal proofs of mathematical theorems in proofs assistants such as Lean. However, we often want to optimize a formal proof with respect to various criteria, depending on its downstream use. For example, we may want a proof to adhere to a certain style, or to be readable, concise, or modularly structured. Having suitably optimized proofs is also important for learning tasks, especially since human-written proofs may not optimal for that purpose. To this end, we study a new problem of automated proof optimization: rewriting a proof so that it is correct and optimizes for an arbitrary criterion, such as length or readability. As a first method for automated proof optimization, we present ImProver, a large-language-model agent that rewrites proofs to optimize arbitrary user-defined metrics in Lean. We find that naively applying LLMs to proof optimization falls short, and we incorporate various improvements into ImProver, such as the use of symbolic Lean context in a novel Chain-of-States technique, as well as error-correction and retrieval. We test ImProver on rewriting real-world undergraduate, competition, and research-level mathematics theorems, finding that ImProver is capable of rewriting proofs so that they are substantially shorter, more modular, and more readable.

2403.03920 2026-05-22 cs.AI cs.CL cs.HC 版本更新

Enhancing Instructional Quality: Leveraging Computer-Assisted Textual Analysis to Generate In-Depth Insights from Educational Artifacts

提升教学质量:利用计算机辅助文本分析从教育资料中生成深入见解

Zewei Tian, Min Sun, Alex Liu, Shawon Sarkar, Jing Liu

发表机构 * University of Washington(华盛顿大学) University of Maryland(马里兰大学)

AI总结 本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力,结合Richard Elmore的Instructional Core Framework,分析AI和机器学习方法,特别是自然语言处理(NLP),如何分析教育内容、教师话语和学生回答以促进教学改进,并指出AI/ML在教师指导、学生支持和内容开发中的关键优势。

详情
AI中文摘要

本文探讨了计算机辅助文本分析在通过教育资料的深入分析提升教学质量的变革潜力。我们整合Richard Elmore的Instructional Core Framework,以探讨人工智能(AI)和机器学习(ML)方法,特别是自然语言处理(NLP),如何分析教育内容、教师话语和学生回答,以促进教学改进。通过在Instructional Core Framework内的全面回顾和案例研究,我们识别出AI/ML整合在教师指导、学生支持和内容开发中的关键优势。我们揭示出模式,表明AI/ML不仅简化了行政任务,还引入了新的个性化学习路径,为教育工作者提供可操作的反馈,并有助于更深入地理解教学动态。本文强调了将AI/ML技术与教学目标对齐的重要性,以在教育环境中实现其全部潜力,倡导一种平衡的方法,考虑伦理问题、数据质量和人类专业知识的整合。

英文摘要

This paper explores the transformative potential of computer-assisted textual analysis in enhancing instructional quality through in-depth insights from educational artifacts. We integrate Richard Elmore's Instructional Core Framework to examine how artificial intelligence (AI) and machine learning (ML) methods, particularly natural language processing (NLP), can analyze educational content, teacher discourse, and student responses to foster instructional improvement. Through a comprehensive review and case studies within the Instructional Core Framework, we identify key areas where AI/ML integration offers significant advantages, including teacher coaching, student support, and content development. We unveil patterns that indicate AI/ML not only streamlines administrative tasks but also introduces novel pathways for personalized learning, providing actionable feedback for educators and contributing to a richer understanding of instructional dynamics. This paper emphasizes the importance of aligning AI/ML technologies with pedagogical goals to realize their full potential in educational settings, advocating for a balanced approach that considers ethical considerations, data quality, and the integration of human expertise.

2402.11621 2026-05-22 cs.CL 版本更新

Decoding News Narratives: A Critical Analysis of Large Language Models in Framing Detection

解码新闻叙述:对大型语言模型在框架检测中的关键分析

Valeria Pastorino, Jasivan A. Sivakumar, Nafise Sadat Moosavi

AI总结 本文研究了大型语言模型在框架检测中的应用,分析了不同模型在零样本、少样本和解释性提示设置下的表现,指出模型性能对提示设计敏感且易在模糊案例中出现系统性错误,并提出了一种新的跨领域新闻标题数据集以提高评估的现实性。

详情
Journal ref
Proceedings of the 3rd Workshop on Natural Language Processing for Political Sciences (PoliticalNLP 2026) @ LREC 2026, pages 17-28
AI中文摘要

随着新闻报道的复杂性和多样性增加,框架分析已成为计算社会科学中的关键但具有挑战性的任务。传统方法,包括手动标注和微调模型,仍然受到高标注成本、领域特定性和不一致泛化能力的限制。基于指令的大型语言模型(LLMs)提供了一个有前景的替代方案,但它们在框架分析中的可靠性尚不充分。本文系统评估了几个LLMs,包括GPT-3.5/4、FLAN-T5和Llama 3,在零样本、少样本和基于解释的提示设置下的表现。聚焦于领域转移和固有的标注模糊性,我们显示模型性能高度敏感于提示设计,并且在模糊案例中容易出现系统性错误。尽管LLMs,特别是GPT-4,表现出更强的跨领域泛化能力,但它们也显示出系统性偏见,最值得注意的是倾向于将情感语言与框架混淆。为了在现实世界的话题多样性下实现原则性评估,我们引入了一种新的跨领域新闻标题数据集。最后,通过分析多个模型在现有框架数据集上的一致性模式,我们证明了跨模型共识为识别争议标注提供了一个有用的信号,为低资源环境下的数据集审计提供了一种实用方法。

英文摘要

The growing complexity and diversity of news coverage have made framing analysis a crucial yet challenging task in computational social science. Traditional approaches, including manual annotation and fine-tuned models, remain limited by high annotation costs, domain specificity, and inconsistent generalisation. Instruction-based large language models (LLMs) offer a promising alternative, yet their reliability for framing analysis remains insufficiently understood. In this paper, we conduct a systematic evaluation of several LLMs, including GPT-3.5/4, FLAN-T5, and Llama 3, across zero-shot, few-shot, and explanation-based prompting settings. Focusing on domain shift and inherent annotation ambiguity, we show that model performance is highly sensitive to prompt design and prone to systematic errors on ambiguous cases. Although LLMs, particularly GPT-4, exhibit stronger cross-domain generalisation, they also display systematic biases, most notably a tendency to conflate emotional language with framing. To enable principled evaluation under real-world topic diversity, we introduce a new dataset of out-of-domain news headlines covering diverse subjects. Finally, by analysing agreement patterns across multiple models on existing framing datasets, we demonstrate that cross-model consensus provides a useful signal for identifying contested annotations, offering a practical approach to dataset auditing in low-resource settings.

2401.00139 2026-05-22 cs.AI cs.CL cs.LG stat.ME 版本更新

Enhancing Causal Reasoning in Large Language Models: A Causal Attribution Model for Precision Fine-Tuning

增强大语言模型中的因果推理:一种用于精确微调的因果归因模型

Hengrui Cai, Shengjie Liu, Rui Song

发表机构 * University of California, Irvine(加州大学尔湾分校) North Carolina State University(北卡罗来纳州立大学) Amazon(亚马逊公司)

AI总结 本文提出一种因果归因模型,通过精确微调提升大语言模型的可解释性和因果推理能力,展示了模型在不同领域中的因果发现任务中的有效性。

Comments A Python implementation of our proposed method is available at https://github.com/ncsulsj/Causal_LLM

详情
AI中文摘要

本文介绍了一种因果归因模型,旨在通过精确微调增强大语言模型(LLMs)的可解释性并提高其因果推理能力。尽管LLMs在多种任务中表现出色,但其推理过程往往仍是一个黑箱,限制了有针对性的增强。我们提出了一种新的因果归因模型,利用“do-运算符”构建干预场景,使我们能够系统地量化LLMs因果推理过程中不同组件的贡献。通过在各种领域中进行因果发现任务来评估所提出的归因分数,我们证明了LLMs在因果发现中的有效性严重依赖于提供的上下文和领域特定知识,但也可以利用数值数据进行有限的相关性推理,而非因果性。这促使了所提出的微调LLM用于成对因果发现,有效且正确地利用了知识和数值信息。

英文摘要

This paper introduces a causal attribution model to enhance the interpretability of large language models (LLMs) and improve their causal reasoning abilities via precise fine-tuning. Despite LLMs' proficiency in diverse tasks, their reasoning processes often remain black box, and thus restrict targeted enhancement. We propose a novel causal attribution model that utilizes "do-operators" for constructing interventional scenarios, allowing us to quantify the contribution of different components in LLMs's causal reasoning process systematically. By assessing the proposed attribution scores through causal discovery tasks across various domains, we demonstrate that LLMs' effectiveness in causal discovery heavily relies on provided context and domain-specific knowledge but can also utilize numerical data with limited calculations in correlation, not causation. This motivates the proposed fine-tuned LLM for pairwise causal discovery, effectively and correctly leveraging both knowledge and numerical information.

1607.06330 2026-05-22 cs.CL 版本更新

La representación de la variación contextual mediante definiciones terminológicas flexibles

通过灵活的术语定义实现语境变化的表示

Antonio San Martín

AI总结 本文研究了如何通过灵活的术语定义来反映环境领域中专业概念在不同语境下的变化,提出了灵活的术语定义方法,并通过实证研究分析了语境变化对术语定义的影响。

Comments PhD Thesis. in Spanish. University of Granada. 2016

详情
AI中文摘要

在本博士论文中,我们应用认知语言学的原理到术语定义,并提出了一种称为灵活术语定义的提案。这包括由一组同一概念的定义组成,包括一个一般定义(在此情况下,涵盖整个环境领域)以及额外的定义,从相关子领域的角度描述该概念。由于语境是构建词汇单位(包括术语)意义的关键因素,我们假设术语定义可以并且应该反映语境的影响,尽管定义传统上被视为与任何语境因素无关的意义表达。本论文的主要目标是分析语境变化对专业环境概念的影响,以它们在术语定义中的表示为视角。具体而言,我们专注于基于主题限制的语境变化。为了完成本博士论文的目标,我们进行了实证研究,包括对一组具有语境变化的概念进行分析,并为其中两个概念创建灵活的定义。在我们实证研究的第一部分,我们将我们的领域依赖性语境变化概念划分为三种不同的现象:调制、视角化和子概念化。这些现象是叠加的,即所有概念都经历调制,一些概念还经历视角化,最后,少数概念还经历子概念化。在第二部分,我们应用这些概念到术语定义,并提出了如何构建灵活定义的指南,从知识提取到定义的实际写作。

英文摘要

In this doctoral thesis, we apply premises of cognitive linguistics to terminological definitions and present a proposal called the flexible terminological definition. This consists of a set of definitions of the same concept made up of a general definition (in this case, one encompassing the entire environmental domain) along with additional definitions describing the concept from the perspective of the subdomains in which it is relevant. Since context is a determining factor in the construction of the meaning of lexical units (including terms), we assume that terminological definitions can, and should, reflect the effects of context, even though definitions have traditionally been treated as the expression of meaning void of any contextual effect. The main objective of this thesis is to analyze the effects of contextual variation on specialized environmental concepts with a view to their representation in terminological definitions. Specifically, we focused on contextual variation based on thematic restrictions. To accomplish the objectives of this doctoral thesis, we conducted an empirical study consisting of the analysis of a set of contextually variable concepts and the creation of a flexible definition for two of them. As a result of the first part of our empirical study, we divided our notion of domain-dependent contextual variation into three different phenomena: modulation, perspectivization and subconceptualization. These phenomena are additive in that all concepts experience modulation, some concepts also undergo perspectivization, and finally, a small number of concepts are additionally subjected to subconceptualization. In the second part, we applied these notions to terminological definitions and we presented we presented guidelines on how to build flexible definitions, from the extraction of knowledge to the actual writing of the definition.